{"title": "On the Optimization Landscape of Tensor Decompositions", "book": "Advances in Neural Information Processing Systems", "page_first": 3653, "page_last": 3663, "abstract": "Non-convex optimization with local search heuristics has been widely used in machine learning, achieving many state-of-art results. It becomes increasingly important to understand why they can work for these NP-hard problems on typical data. The landscape of many objective functions in learning has been conjectured to have the geometric property that ``all local optima are (approximately) global optima'', and thus they can be solved efficiently by local search algorithms. However, establishing such property can be very difficult.   In this paper, we analyze the optimization landscape of the random over-complete  tensor decomposition problem, which has many applications in unsupervised leaning, especially in learning latent variable models. In practice, it can be efficiently solved by gradient ascent on a non-convex objective. We show that for any small constant $\\epsilon > 0$, among the set of points with function values $(1+\\epsilon)$-factor larger than the expectation of the function, all the local maxima are approximate global maxima. Previously, the best-known result only characterizes the geometry in small neighborhoods around the true components. Our result implies that even with an initialization that is barely better than the random guess, the gradient ascent algorithm is guaranteed to solve this problem.   Our main technique uses Kac-Rice formula and random matrix theory. To our best knowledge, this is the first time when Kac-Rice formula is successfully applied to counting the number of local minima of a highly-structured random polynomial with dependent coefficients.", "full_text": "On the Optimization Landscape of Tensor\n\nDecompositions\n\nRong Ge\n\nDuke University\n\nrongge@cs.duke.edu\n\nTengyu Ma\n\nFacebook AI Research\n\ntengyuma@cs.stanford.edu\n\nAbstract\n\nNon-convex optimization with local search heuristics has been widely used in\nmachine learning, achieving many state-of-art results. It becomes increasingly\nimportant to understand why they can work for these NP-hard problems on typical\ndata. The landscape of many objective functions in learning has been conjectured\nto have the geometric property that \u201call local optima are (approximately) global op-\ntima\u201d, and thus they can be solved ef\ufb01ciently by local search algorithms. However,\nestablishing such property can be very dif\ufb01cult.\nIn this paper, we analyze the optimization landscape of the random over-complete\ntensor decomposition problem, which has many applications in unsupervised lean-\ning, especially in learning latent variable models. In practice, it can be ef\ufb01ciently\nsolved by gradient ascent on a non-convex objective. We show that for any small\nconstant \u03b5 > 0, among the set of points with function values (1 + \u03b5)-factor larger\nthan the expectation of the function, all the local maxima are approximate global\nmaxima. Previously, the best-known result only characterizes the geometry in\nsmall neighborhoods around the true components. Our result implies that even\nwith an initialization that is barely better than the random guess, the gradient ascent\nalgorithm is guaranteed to solve this problem.\nOur main technique uses Kac-Rice formula and random matrix theory. To our best\nknowledge, this is the \ufb01rst time when Kac-Rice formula is successfully applied to\ncounting the number of local optima of a highly-structured random polynomial\nwith dependent coef\ufb01cients.\n\n1\n\nIntroduction\n\nNon-convex optimization is the dominating algorithmic technique behind many state-of-art results in\nmachine learning, computer vision, natural language processing and reinforcement learning. Local\nsearch algorithms through stochastic gradient methods are simple, scalable and easy to implement.\nSurprisingly, they also return high-quality solutions for practical problems like training deep neural\nnetworks, which are NP-hard in the worst case. It has been conjectured [DPG+14, CHM+15] that\non typical data, the landscape of the training objectives has the nice geometric property that all\nlocal minima are (approximate) global minima. Such property assures the local search algorithms to\nconverge to global minima [GHJY15, LSJR16, NP06, SQW15]. However, establishing it for concrete\nproblems can be challenging.\nDespite recent progress on understanding the optimization landscape of various machine learning\nproblems (see [GHJY15, BBV16, BNS16, Kaw16, GLM16, HM16, HMR16] and references therein),\na comprehensive answer remains elusive. Moreover, all previous techniques fundamentally rely on\nthe spectral structure of the problems. For example, in [GLM16] allows us to pin down the set of the\ncritical points (points with vanishing gradients) as approximate eigenvectors of some matrix. Among\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthese eigenvectors we can further identify all the local minima. The heavy dependency on linear\nalgebraic structure limits the generalization to problems with non-linearity (like neural networks).\nTowards developing techniques beyond linear algebra, in this work, we investigate the optimization\nlandscape of tensor decomposition problems. This is a clean non-convex optimization problem whose\noptimization landscape cannot be analyzed by the previous approach. It also connects to the training\nof neural networks with many shared properties [NPOV15] . For example, in comparison with the\nmatrix case where all the global optima reside on a (connected) Grassmannian manifold, for both\ntensors and neural networks all the global optima are isolated from each other.\nBesides the technical motivations above, tensor decomposition itself is also the key algorithmic tool for\nlearning many latent variable models, mixture of Gaussians, hidden Markov models, dictionary learn-\ning [Cha96, MR06, HKZ12, AHK12, AFH+12, HK13], just to name a few. In practice, local search\nheuristics such as alternating least squares [CLA09], gradient descent and power method [KM11] are\npopular and successful.\nTensor decomposition also connects to the learning of neural networks [GLM17, JSA15, CS16].\nFor example, The work [GLM17] shows that the objective of learning one-hidden-layer network is\nimplicitly decomposing a sequence of tensors with shared components, and uses the intuition from\ntensor decomposition to design better objective functions that provably recovers the parameters under\nGaussian inputs.\nConcretely, we consider decomposing a random 4-th order tensor T of the rank n of the following\nform,\n\nT =\n\nai \u2297 ai \u2297 ai \u2297 ai .\n\nWe are mainly interested in the over-complete regime where n (cid:29) d. This setting is particularly\nchallenging, but it is crucial for unsupervised learning applications where the hidden representations\nhave higher dimension than the data [AGMM15, DLCC07]. Previous algorithmic results either\nrequire access to high order tensors [BCMV14, GVX13], or use complicated techniques such as\nFOOBI [DLCC07] or sum-of-squares relaxation [BKS15, GM15, HSSS16, MSS16].\nIn the worst case, most tensor problems are NP-hard [H\u00e5s90, HL13]. Therefore we work in the\naverage case where vectors ai \u2208 Rd are assumed to be drawn i.i.d from Gaussian distribution N (0, I).\nWe call ai\u2019s the components of the tensor. We are given the entries of tensor T and our goal is to\nrecover the components a1, . . . , an.\nWe will analyze the following popular non-convex objective,\n\nn(cid:88)\n\ni=1\n\n(cid:88)\n\nmax f (x) =\n\nTi,j,k,lxixjxkxl =\n\n(cid:104)ai, x(cid:105)4\n\n(1.1)\n\ni,j,k,l\u2208[d]4\n\ns.t. (cid:107)x(cid:107) = 1.\n\nn(cid:88)\n\ni=1\n\nd\n\nd\n\na1, . . . ,\u00b1 1\u221a\n\nIt is known that for n (cid:28) d2, the global maxima of f is close to one of \u00b1 1\u221a\nan. Previ-\nously, Ge et al. [GHJY15] show that for the orthogonal case where n \u2264 d and all the ai\u2019s are orthogo-\nnal, objective function f (\u00b7) have only 2n local maxima that are approximately \u00b1 1\u221a\nan.\nHowever, the technique heavily uses the orthogonality of the components and is not generalizable to\nover-complete case.\nEmpirically, projected gradient ascent and power methods \ufb01nd one of the components ai\u2019s even if n is\nsigni\ufb01cantly larger than d. The local geometry for the over-complete case around the true components\nis known: in a small neighborhood of each of \u00b1 1\u221a\nai\u2019s, there is a unique local maximum [AGJ15].\nAlgebraic geometry techniques [CS13, ASS15] can show that f (\u00b7) has an exponential number of\nother critical points, while these techniques seem dif\ufb01cult to extend to the characterization of local\nmaxima. It remains a major open question whether there are any other spurious local maxima that\ngradient ascent can potentially converge to.\n\na1, . . . ,\u00b1 1\u221a\n\nd\n\nd\n\nd\n\nMain results. We show that there are no spurious local maxima in a large superlevel set that\ncontains all the points with function values slightly larger than that of the random initialization.\n\n\fTheorem 1.1. Let \u03b5, \u03b6 \u2208 (0, 1/3) be two arbitrary constants and d be suf\ufb01ciently large. Suppose\nd1+\u03b5 < n < d2\u2212\u03b5. Then, with high probability over the randomness of ai\u2019s, we have that in the\nsuperlevel set\n\nthere are exactly 2n local maxima with function values (1 \u00b1 o(1))d2, each of which is (cid:101)O((cid:112)n/d3)-\n\nL =(cid:8)x \u2208 Sd\u22121 : f (x) \u2265 3(1 + \u03b6)n(cid:9) ,\n\n(1.2)\n\nclose to one of \u00b1 1\u221a\n\na1, . . . ,\u00b1 1\u221a\n\nd\n\nan.\n\nd\n\nd\n\nPreviously, the best known result [AGJ15] only characterizes the geometry in small neighborhoods\naround the true components, that is, there exists one local maximum in each of the small constant\nneighborhoods around each of the true components ai\u2019s. (It turns out in such neighborhoods, the\nobjective function is actually convex.) We signi\ufb01cantly enlarge this region to the superlevel set L, on\nwhich the function f is not convex and has an exponential number of saddle points, but still doesn\u2019t\nhave any spurious local maximum.\nNote that a random initialization z on the unit sphere has expected function value E[f (z)] = 3n.\nTherefore the superlevel set L contains all points that have function values barely larger than that\nof the random guess. Hence, Theorem 1.1 implies that with a slightly better initialization than the\nrandom guess, gradient ascent and power method1 are guaranteed to \ufb01nd one of the components in\npolynomial time. (It is known that after \ufb01nding one component, it can be peeled off from the tensor\nand the same algorithm can be repeated to \ufb01nd all other components.)\nCorollary 1.2. In the setting of Theorem 1.1, with high probability over the choice of ai\u2019s, we have\nthat given any starting point x0 that satis\ufb01es f (x0) \u2265 3(1 + \u03b6)n, stochastic projected gradient\ndescent2 will \ufb01nd one of the \u00b1 1\u221a\n\nai\u2019s up to (cid:101)O((cid:112)n/d3) Euclidean error in polynomial time.\n\nstill holds with \u03b6 = O((cid:112)d/n) that is smaller than a constant. Note that the expected value of a\n\nWe also strengthen Theorem 1.1 and Corollary 1.2 (see Theorem 3.1) slightly \u2013 the same conclusion\n\nrandom initialization is 3n and we only require an initialization that is slightly better than random\nguess in function value. We remark that a uniformly random point x on the unit sphere are not in\nthe set L with high probability. It\u2019s an intriguing open question to characterize the landscape in the\ncomplement of the set L.\nWe also conjecture that from random initialization, it suf\ufb01ces to use constant number of projected\n\ngradient descent (with optimal step size) to achieve the function value 3(1 + \u03b6)n with \u03b6 = O((cid:112)d/n).\n\ndn for a universal constant c).\n\nThis conjecture \u2014 an interesting question for future work \u2014 is based on the hypothesis that the \ufb01rst\nconstant number of steps of gradient descent can make similar improvements as the \ufb01rst step does\n(which is equal to c\nAs a comparison, previous works such as [AGJ15] require an initialization with function value\n\u0398(d2) (cid:29) n. Anandkumar et al. [AGJ16] analyze the dynamics of tensor power method with a\ndelicate initialization that is independent with the randomness of the tensor. Thus it is not suitable for\nthe situation where the initialization comes from the result of another algorithm, and it does not have\na direct implication on the landscape of f (\u00b7).\nWe note that the local maximum of f (\u00b7) corresponds to the robust eigenvector of the tensor. Using\nthis language, our theorem says that a robust eigenvector of an over-complete tensor with random\ncomponents is either one of those true components or has a small correlation with the tensor in\nthe sense that (cid:104)T, x\u22974(cid:105) is small. This improves signi\ufb01cantly upon the understanding of robust\neigenvectors [ASS15] under an interesting random model.\nThe condition n > d1+\u03b5 should be arti\ufb01cial. The under-complete case (n < d) can be proved\nby re-using the proof of [GHJY15] with the observation that local optima are preserved by linear\ntransformation. The intermediate regime when d < n < d1+\u03b5 should be analyzable by Kac-Rice\nformula using similar techniques, but our current proof cannot capture it directly. Since the proof in\nthis paper is already involved, we leave this case to future work. The condition n < d2\u2212\u03b5 matches\nthe best over-completeness level that existing polynomial algorithm can handle [DLCC07, MSS16].\n\n\u221a\n\n1Power method is exactly equivalent to gradient ascent with a properly chosen \ufb01nite learning rate\n2We note that by stochastic gradient descent we meant the algorithm that is analyzed in [GHJY15]. To get\na global maximum in polynomial time (polynomial in log(1/\u03b5) to get \u03b5 precision), one also needs to slightly\nmodify stochastic gradient descent in the following way: run SGD until 1/d accuracy and then switch to gradient\ndescent. Since the problem is locally strongly convex, the local convergence is linear.\n\n\fOur techniques The proof of Theorem 1.1 uses Kac-Rice formula (see, e.g., [AT09]), which is\nbased on a counting argument. To build up the intuition, we tentatively view the unit sphere as a\ncollection of discrete points, then for each point x one can compute the probability (with respect to\nthe randomness of the function) that x is a local maximum. Adding up all these probabilities will\ngive us the expected number of local maxima. In continuous space, such counting argument has\nto be more delicate since the local geometry needs to be taken into account. This is formalized by\nKac-Rice formula (see Lemma 2.2).\nHowever, Kac-Rice formula only gives a closed form expression that involves the integration of\nthe expectation of some complicated random variable. It\u2019s often very challenging to simplify the\nexpression to obtain interpretable results. Before our work, Auf\ufb01nger et al. [AA \u02c7C13, AA+13] have\nsuccessfully applied Kac-Rice formula to characterize the landscape of polynomials with random\nGaussian coef\ufb01cients. The exact expectation of the number of local minima can be computed there,\nbecause the Hessian of a random polynomial is a Gaussian orthogonal ensemble, whose eigenvalue\ndistribution is well-understood with closed form expression.\nOur technical contribution here is successfully applying Kac-Rice formula to structured random\nnon-convex functions where the formula cannot be exactly evaluated. The Hessian and gradients of\nf (\u00b7) have much more complicated distributions compared to the Gaussian orthogonal ensemble. As a\nresult, the Kac-Rice formula is dif\ufb01cult to be evaluated exactly. We instead cut the space Rd into\nregions and use different techniques to estimate the number of local maxima. See a proof overview in\nSection 3. We believe our techniques can be extended to 3rd order tensors and can shed light on the\nanalysis of other non-convex problems with structured randomness.\nOrganization In Section 2 we introduce preliminaries regarding manifold optimization and Kac-Rice\nformula. We give a detailed explanation of our proof strategy in Section 3. The technical details are\ndeferred to the supplementary material. We also note that the supplementary material contains an\nextended version of the preliminary and proof overview section below.\n\n2 Notations and Preliminaries\nWe use Idd to denote the identity matrix of dimension d \u00d7 d. Let (cid:107) \u00b7 (cid:107) denote the spectral norm of a\nmatrix or the Euclidean norm of a vector. Let (cid:107)\u00b7(cid:107)F denote the Frobenius norm of a matrix or a tensor.\nGradient, Hessian, and local maxima on manifold We have a constrained optimization problem\nover the unit sphere Sd\u22121, which is a smooth manifold. Thus we de\ufb01ne the local maxima with respect\nto the manifold. It\u2019s known that projected gradient descent for Sd\u22121 behaves pretty much the same\non the manifold as in the usual unconstrained setting [BAC16]. In supplementary material we give a\nbrief introduction to manifold optimization, and the de\ufb01nition of gradient and Hessian. We refer the\nreaders to the book [AMS07] for more backgrounds.\nHere we use grad f and Hess f to denote the gradient and the Hessian of f on the manifold Sd\u22121.\n(cid:80)n\nWe compute them in the following claim.\ni=1(cid:104)ai, x(cid:105)4. Let Px = Idd \u2212 xx(cid:62). Then the gradient\nClaim 2.1. Let f : Sd\u22121 \u2192 R be f (x) := 1\nn(cid:88)\nand Hessian of f on the sphere can be written as,\n(cid:104)ai, x(cid:105)3ai , Hess f (x) = 3\n\n(cid:33)\nLet Mf be the set of all local maxima, i.e. Mf =(cid:8)x \u2208 Sd\u22121 : grad f (x) = 0, Hess f (x) (cid:22) 0(cid:9).\n\nA local maximum of a function f on the manifold Sd\u22121 satis\ufb01es grad f (x) = 0, and Hess f (x) (cid:22) 0.\n\n(cid:104)ai, x(cid:105)2Pxaia(cid:62)\n\ni Px \u2212\n\ngrad f (x) = Px\n\n(cid:32) n(cid:88)\n\ni=1\n\n(cid:104)ai, x(cid:105)4\n\nPx ,\n\nn(cid:88)\n\ni=1\n\n4\n\ni=1\n\nKac-Rice formula Kac-Rice formula is a general tool for computing the expected number of\nspecial points on a manifold. Suppose there are two random functions P (\u00b7) : Rd \u2192 Rd and\nQ(\u00b7) : Rd \u2192 Rk, and an open set B in Rk. The formula counts the expected number of point x \u2208 Rd\nthat satis\ufb01es both P (x) = 0 and Q(x) \u2208 B.\nSuppose we take P = \u2207f and Q = \u22072f, and let B be the set of negative semide\ufb01nite matrices, then\nthe set of points that satis\ufb01es P (x) = 0 and Q \u2208 B is the set of all local maxima Mf . Moreover,\nfor any set Z \u2282 Sd\u22121, we can also augment Q by Q = [\u22072f, x] and choose B = {A : A (cid:22) 0} \u2297 Z.\n\n\fWith this choice of P, Q, Kac-Rice formula can count the number of local maxima inside the region\nZ. For simplicity, we will only introduce Kac-Rice formula for this setting. We refer the readers\nto [AT09, Chapter 11&12] for more backgrounds.\nLemma 2.2 (Informally stated). Let f be a random function de\ufb01ned on the unit sphere Sd\u22121 and let\nZ \u2282 Sd\u22121. Under certain regularity conditions3 on f and Z, we have\n\nE [|Mf \u2229 Z|] =\n\nE [| det(Hess f )| \u00b7 1(Hess f (cid:22) 0)1(x \u2208 Z) | grad f (x) = 0] pgrad f (x)(0)dx .\n\n(2.1)\n\nx\n\nwhere dx is the usual surface measure on Sd\u22121 and pgrad f (x)(0) is the density of grad f (x) at 0.\n\n(cid:90)\n\nn(cid:88)\n\nn(cid:88)\n\nFormula for the number of local maxima In this subsection, we give a concrete formula for\nthe number of local maxima of our objective function (1.1) inside the superlevel set L (de\ufb01ned\nin equation (1.2)). Taking Z = L in Lemma 2.2, it boils down to estimating the quantity on\nthe right hand side of (2.1). We remark that for the particular function f as de\ufb01ned in (1.1) and\nZ = L, the integrand in (2.1) doesn\u2019t depend on the choice of x. This is because for any x \u2208 Sd\u22121,\n(Hess f, grad f, 1(x \u2208 L)) has the same joint distribution, as characterized below:\nLemma 2.3. Let f be the random function de\ufb01ned in (1.1). Let \u03b11, . . . , \u03b1n \u2208 N (0, 1), and\nb1, . . . , bn \u223c N (0, Idd\u22121) be independent Gaussian random variables. Let\n\nM = (cid:107)\u03b1(cid:107)4\n\n4 \u00b7 Idd\u22121 \u2212 3\n\ni bib(cid:62)\n\u03b12\n\ni\n\nand g =\n\n\u03b13\n\ni bi\n\n(2.2)\n\nThen, we have that for any x \u2208 Sd\u22121, (Hess f, grad f, f ) has the same joint distribution as\n(\u2212M, g,(cid:107)\u03b1(cid:107)4\n4).\n\ni=1\n\ni=1\n\nUsing Lemma 2.2 (with Z = L) and Lemma 2.3, we derive the following formula for the expectation\nof our random variable E [|Mf \u2229 L|]. Later we will later use Lemma 2.2 slightly differently with\nanother choice of Z.\n(cid:105)\nLemma 2.4. Using the notation of Lemma 2.3, let pg(\u00b7) denote the density of g. Then,\n4 \u2265 3(1 + \u03b6)n) | g = 0\n\nE [|Mf \u2229 L|] = Vol(Sd\u22121) \u00b7 E(cid:104)|det(M )| 1(M (cid:23) 0)1((cid:107)\u03b1(cid:107)4\n\npg(0) . (2.3)\n\n3 Proof Overview\n\nIn this section, we give a high-level overview of the proof of the main Theorem. We will prove a\nslightly stronger version of Theorem 1.1.\nLet \u03b3 be a universal constant that is to be determined later. De\ufb01ne the set L1 \u2282 Sd\u22121 as,\n\nIndeed we see that L (de\ufb01ned in (1.2)) is a subset of L1 when n (cid:29) d. We prove that in L1 there are\nexactly 2n local maxima.\nTheorem 3.1 (main). There exists universal constants \u03b3, \u03b2 such that the following holds: suppose\nd2/ logO(1) \u2265 n \u2265 \u03b2d log2 d and L1 be de\ufb01ned as in (3.1), then with high probability over the\nchoice of a1, . . . , an, we have that the number of local maxima in L1 is exactly 2n:\n\nMoreover, each of the local maximum in L1 is (cid:101)O((cid:112)n/d3)-close to one of \u00b1 1\u221a\n\n|Mf \u2229 L1| = 2n .\n\n(3.2)\n\na1, . . . ,\u00b1 1\u221a\n\nd\n\nan.\n\nd\n\nIn order to count the number of local maxima in L1, we use the Kac-Rice formula (Lemma 2.4).\nRecall that what Kac-Rice formula gives an expression that involves the complicated expectation\n\n3We omit the long list of regularity conditions here for simplicity. See more details at [AT09, Theorem\n\n12.1.1]\n\n(cid:40)\n\nn(cid:88)\n\ni=1\n\n(cid:41)\n\n\u221a\n\nL1 :=\n\nx \u2208 Sd\u22121 :\n\n(cid:104)ai, x(cid:105)4 \u2265 3n + \u03b3\n\nnd\n\n.\n\n(3.1)\n\n\fE(cid:104)|det(M )| 1(M (cid:23) 0)1((cid:107)\u03b1(cid:107)4\n\n(cid:105)\n4 \u2265 3(1 + \u03b6)n) | g = 0\n\n. Here the dif\ufb01culty is to deal with the deter-\nminant of a random matrix M (de\ufb01ned in Lemma 2.3), whose eigenvalue distribution does not admit\nan analytical form. Moreover, due to the existence of the conditioning and the indicator functions,\nit\u2019s almost impossible to compute the RHS of the Kac-Rice formula (equation (2.3)) exactly.\nLocal vs. global analysis: The key idea to proceed is to divide the superlevel set L1 into two subsets\n\nL1 = (L1 \u2229 L2) \u222a Lc\n2,\n\nd\n\na1, . . . , 1\u221a\n\nwhere L2 := {x \u2208 Sd\u22121 : \u2200i,(cid:107)Pxai(cid:107)2 \u2265 (1 \u2212 \u03b4)d, and |(cid:104)ai, x(cid:105)|2 \u2264 \u03b4d} .\n\n(3.3)\n2 \u2282 L1\nHere \u03b4 is a suf\ufb01ciently small universal constant that is to be chosen later. We also note that Lc\nand hence L1 = (L1 \u2229 L2) \u222a Lc\n2.\nIntuitively, the set L1 \u2229 L2 contains those points that do not have large correlation with any of\nthe ai\u2019s; the compliment Lc\n2 is the union of the neighborhoods around each of the desired vector\nan. We will refer to the \ufb01rst subset L1 \u2229 L2 as the global region, and refer to the Lc\n1\u221a\n2\nd\nas the local region.\nWe will compute the number of local maxima in sets L1 \u2229 L2 and Lc\n2 separately using different\ntechniques. We will show that with high probability L1 \u2229 L2 contains no local maxima using Kac-\nRice formula (see Theorem 3.2). Then, we show that Lc\n2 contains exactly 2n local maxima (see\nTheorem 3.3) using a different and more direct approach.\nGlobal analysis. The key bene\ufb01t of have such division to local and global regions is that for the\nglobal region, we can avoid evaluating the value of the RHS of the Kac-Rice formula. Instead, we only\nneed to have an estimate: Note that the number of local optima in L1 \u2229 L2, namely |Mf \u2229 L1 \u2229 L2|,\nis an integer nonnegative random variable. Thus, if we can show its expectation E [|Mf \u2229 L1 \u2229 L2|]\nis much smaller than 1, then Markov\u2019s inequality implies that with high probability, the number of\nlocal maxima will be exactly zero. Concretely, we will use Lemma 2.2 with Z = L1 \u2229 L2, and then\nestimate the resulting integral using various techniques in random matrix theory. It remains quite\nchallenging even if we are only shooting for an estimate. Concretely, we get the following Theorem\nTheorem 3.2. Let sets L1, L2 be de\ufb01ned as in equation (3.3) and n \u2265 \u03b2d log2 d. There exists\nuniversal small constant \u03b4 \u2208 (0, 1) and universal constants \u03b3, \u03b2, and a high probability event G0,\nsuch that the expected number of local maxima in L1 \u2229 L2 conditioned on G0 is exponentially small:\n\nE(cid:2)|Mf \u2229 L1 \u2229 L2|(cid:12)(cid:12) G0\n\n(cid:3) \u2264 2\u2212d/2 .\n\nSee Section 3.1 for an overview of the analysis. The purpose and de\ufb01nition of G0 are more technical\nand can be found in Section 3 of the supplementary material around equation (3.3) (3,4) and (3.5).\nWe also prove that G0 is indeed a high probability event in supplementary material. 4\nLocal analysis. In the local region Lc\n2, that is, the neighborhoods of a1, . . . , an, we will show\nthere are exactly 2n local maxima. As argued above, it\u2019s almost impossible to get exact numbers\nout of the Kac-Rice formula since it\u2019s often hard to compute the complicated integral. Moreover,\nKac-Rice formula only gives the expected number but not high probability bounds. However, here the\nobservation is that the local maxima (and critical points) in the local region are well-structured. Thus,\ninstead, we show that in these local regions, the gradient and Hessian of a point x are dominated by\nthe terms corresponding to components {ai}\u2019s that are highly correlated with x. The number of such\nterms cannot be very large (by restricted isometry property, see Section B.5 of the supplementary\nmaterial). As a result, we can characterize the possible local maxima explicitly, and eventually show\nthere is exactly one local maximum in each of the local neighborhoods around {\u00b1 1\u221a\nai}\u2019s. Similar\n(but weaker) analysis was done before in [AGJ15]. We formalize the guarantee for local regions in\nthe following theorem, which is proved in Section 5 of the supplementary material. In Section 3.2 of\nthe supplementary material, we also discuss the key ideas of the proof of this Theorem.\nTheorem 3.3. Suppose 1/\u03b42 \u00b7 d log d \u2264 n \u2264 d2/ logO(1) d. Then, with high probability over the\nchoice a1, . . . , an, we have,\n\nd\n\nMoreover, each of the point in L \u2229 Lc\n\n|Mf \u2229 L1 \u2229 Lc\n\n2 is (cid:101)O((cid:112)n/d3)-close to one of \u00b1 1\u221a\n\n2| = 2n .\n\n(3.4)\n\na1, . . . ,\u00b1 1\u221a\n\nd\n\nan.\n\nd\n\n4We note again that the supplementary material contains more details in each section even for sections in the\n\nmain text.\n\n\fThe main Theorem 3.1 is a direct consequence of Theorem 3.2 and Theorem 3.3. The formal proof\ncan be found in Section 3 of the supplementary material.\nIn the next subsections we sketch the basic ideas behind the proof of Theorem 3.2 and Theorem 3.3.\nTheorem 3.2 is the crux of the technical part of the paper.\n\n3.1 Estimating the Kac-Rice formula for the global region\n\nThe general plan to prove Theorem 3.2 is to use random matrix theory to estimate the RHS of the\nKac-Rice formula. We begin by applying Kac-Rice formula to our situation. We note that we dropped\nthe effect of G0 in all of the following discussions since G0 only affects some technicality that\nappears in the details of the proof in the supplementary material.\nApplying Kac-Rice formula.\nThe \ufb01rst step to apply Kac-Rice formula is to characterize the\njoint distribution of the gradient and the Hessian. We use the notation of Lemma 2.3 for ex-\npressing the joint distribution of (Hess f, grad f, 1(x \u2208 L1 \u2229 L2)). For any \ufb01x x \u2208 Sd\u22121,\n4 \u00b7 Idd\u22121 \u2212\nlet \u03b1i = (cid:104)ai, x(cid:105) and bi = Pxai (where Px = Id \u2212 xx(cid:62)) and M = (cid:107)\u03b1(cid:107)4\nIn order to apply Kac-Rice for-\nmula, we\u2019d like to compute the joint distribution of the gradient and the Hessian. We have that\n(Hess f, grad f, 1(x \u2208 L1 \u2229 L2)) has the same distribution as (M, g, 1(E1 \u2229 E2 \u2229 E(cid:48)\n2)),where E1\ncorresponds to the event that x \u2208 L1,\n\nand g = (cid:80)n\n\ni bi as de\ufb01ned in (2.2).\n\n3(cid:80)n\n\ni bib(cid:62)\n\ni=1 \u03b12\n\ni=1 \u03b13\n\ni\n\n(cid:111)\n\n\u221a\n\n4 \u2265 3n + \u03b3\n\nnd\n\n,\n\n(cid:110)(cid:107)\u03b1(cid:107)4\nE2 =(cid:8)(cid:107)\u03b1(cid:107)2\u221e \u2264 \u03b4d(cid:9) , and E(cid:48)\n\nE1 =\n\nand events E2 and E(cid:48)\nand E(cid:48)\n\n2 depends the randomness of \u03b1i\u2019s and bi\u2019s respectively.\n\n2 correspond to the events that x \u2208 L2. We separate them out to re\ufb02ect that E2\n\n2 =(cid:8)\u2200i \u2208 [n],(cid:107)bi(cid:107)2 \u2265 (1 \u2212 \u03b4)d(cid:9) .\n\nUsing Kac-Rice formula (Lemma 2.2 with Z = L1 \u2229 L2), we conclude that\n\nE [|Mf \u2229 L1 \u2229 L2|] = Vol(Sd\u22121) \u00b7 E [|det(M )| 1(M (cid:23) 0)1(E1 \u2229 E2 \u2229 E(cid:48)\n\n2) | g = 0] pg(0) .\n\n(3.5)\n\nNext, towards proving Theorem 3.2 we will estimate the RHS of (3.5) using various techniques.\nConditioning on \u03b1. We observe that the distributions of the gradient g and Hessian M on the RHS\nof equation 3.5 are fairly complicated. In particular, we need to deal with the interactions of \u03b1i\u2019s\n(the components along x) and bi\u2019s (the components in the orthogonal subspace of x). Therefore, we\nuse the law of total expectation to \ufb01rst condition on \u03b1 and take expectation over the randomness of\nbi\u2019s, and then take expectation over \u03b1i\u2019s. Let pg|\u03b1 denotes the density of g | \u03b1, using the law of total\nexpectation, we have,\n\nE [|det(M )| 1(M (cid:23) 0)1(E1 \u2229 E2 \u2229 E(cid:48)\n\n= E(cid:2)E [|det(M )| 1(M (cid:23) 0)1(E(cid:48)\n\n2) | g = 0] pg(0)\n\n2) | g = 0, \u03b1] 1(E1)1(E2)pg|\u03b1(0)(cid:3) .\n\n(3.6)\n\nNote that the inner expectation of RHS of (3.6) is with respect to the randomness of bi\u2019s and the outer\none is with respect to \u03b1i\u2019s.\nFor notional convenience we de\ufb01ne h(\u00b7) : Rn \u2192 R as\nh(\u03b1) := Vol(Sd\u22121) E [det(M )1(M (cid:23) 0)1(E(cid:48)\n\n2) | g = 0, \u03b1] 1(E1)1(E2)pg|\u03b1(0) .\n\nThen, using the Kac-Rice formula (equation (2.3))5 and equation (3.5), we obtain the following\nexplicit formula for the number of local maxima in L1 \u2229 L2.\n\n(3.7)\nWe note that pg|\u03b1(0) has an explicit expression since g | \u03b1 is Gaussian. For the ease of exposition,\nwe separate out the hard-to-estimate part from h(\u03b1), which we call W (\u03b1):\n\nE [|Mf \u2229 L1 \u2229 L2|] = E [h(\u03b1)] .\n\nW (\u03b1) := E [det(M )1(M (cid:23) 0)1(E(cid:48)\n\n2) | g = 0, \u03b1] 1(E1)1(E2) .\n\n(3.8)\n\n5In Section C of the supplementary material, we rigorously verify the regularity condition of Kac-Rice\n\nformula.\n\n\f4 \u2212 3(cid:80) \u03b12\n\nTherefore by de\ufb01nition, we have that h(\u03b1) = Vol(Sd\u22121)W (\u03b1)pg|\u03b1(0). Now, since we have condi-\ntioned on \u03b1, the distributions of the Hessian, namely M | \u03b1, is a generalized Wishart matrix which is\nslightly easier than before. However there are still several challenges that we need to address in order\nto estimate W (\u03b1).\nHow to control det(M )1(M (cid:23) 0)? Recall that M = (cid:107)\u03b1(cid:107)4\ni , which is a generalized\nWishart matrix whose eigenvalue distribution has no (known) analytical expression. The determinant\nitself by de\ufb01nition is a high-degree polynomial over the entries, and in our case, a complicated\npolynomial over the random variables \u03b1i\u2019s and vectors bi\u2019s. We also need to properly exploit the\npresence of the indicator function 1(M (cid:23) 0), since otherwise, the desired statement will not be true \u2013\nthe function f has an exponential number of critical points.\nFortunately, in most of the cases, we can use the following simple claim that bounds the determinant\nfrom above by the trace. The inequality is close to being tight when all the eigenvalues of M are\nsimilar to each other. More importantly, it uses naturally the indicator function 1(M (cid:23) 0)! Later we\nwill see how to strengthen it when it\u2019s far from tight.\nClaim 3.4. We have that\n\ni bib(cid:62)\n\n(cid:18)|tr(M )|\n\n(cid:19)d\u22121\n\ndet(M )1(M (cid:23) 0) \u2264\n\nd \u2212 1\n\n1(M (cid:23) 0)\n\nThe claim is a direct consequence of AM-GM inequality on the eigenvalue of M. (Note that M is of\ndimension (d \u2212 1) \u00d7 (d \u2212 1). we give a formal proof in Section 3.1 of the supplementary material).\nIt follows that\n\n(cid:20)|tr(M )|d\u22121\n(d \u2212 1)d\u22121 | g = 0, \u03b1\n\n(cid:21)\n\nW (\u03b1) \u2264 E\n\n1(E1) .\n\n(3.9)\n\n(cid:1) .\n\nTherefore using equation (3.9) and equation above, we have that\n\nHere we dropped the indicators for events E2 and E(cid:48)\n2 since they are not important for the discussion\nbelow. It turns out that |tr(M )| is a random variable that concentrates very well, and thus we have\n\nE(cid:2)|tr(M )|d\u22121(cid:3) \u2248 | E [tr(M )]|d\u22121. It can be shown that (see Proposition 4.3 in the supplementary\n\nmaterial for the detailed calculation),\n\nE [tr(M ) | g = 0, \u03b1] = (d \u2212 1)(cid:0)(cid:107)\u03b1(cid:107)4\nW (\u03b1) \u2264(cid:0)(cid:107)\u03b1(cid:107)4\nE [h(\u03b1)] \u2264 Vol(Sd\u22121) E(cid:104)(cid:0)(cid:107)\u03b1(cid:107)4\n\n4 \u2212 3(cid:107)\u03b1(cid:107)2 + 3(cid:107)\u03b1(cid:107)8\n\n8/(cid:107)\u03b1(cid:107)6\n\n6\n\n4 \u2212 3(cid:107)\u03b1(cid:107)2 + 3(cid:107)\u03b1(cid:107)8\n\n8/(cid:107)\u03b1(cid:107)6\n\n6\n\n1(E0)1(E1) .\n\n(cid:1)d\u22121\n(cid:1)d\u22121 \u00b7 (2\u03c0)\u2212d/2((cid:107)\u03b1(cid:107)6\n\n.\n\n6\n\n8/(cid:107)\u03b1(cid:107)6\n\nNote that since g | \u03b1 has Gaussian distribution, we have, pg|\u03b1(0) = (2\u03c0)\u2212d/2((cid:107)\u03b1(cid:107)6\nusing two equations above, we can bound E [h(\u03b1)] by\n4 \u2212 3(cid:107)\u03b1(cid:107)2 + 3(cid:107)\u03b1(cid:107)8\n\n6)\u2212d/2 . Thus\n(cid:105)\n6)\u2212d/21(E0)1(E1)\n(3.10)\nTherefore, it suf\ufb01ces to control the RHS of (3.10), which is much easier than the original Kac-Rice\nformula. However, it turns out that RHS of (3.10) is roughly cd for some constant c > 1! Roughly\nspeaking, this is because the high powers of a random variables is very sensitive to its tail.\nTwo sub-cases according to max|\u03b1i|. We aim to \ufb01nd a tighter bond of E[h(\u03b1)] by re-using the idea\nin equation (3.10). Intuitively we can consider two separate situations events: the event F0 when all\nof the \u03b1i\u2019s are close to constant and the complementary event F c\n0 . Formally, let \u03c4 = Kn/d where K\n\nis a universal constant that will be determined later. Let F0 be the event that .F0 =(cid:8)(cid:107)\u03b1(cid:107)4\u221e \u2264 \u03c4(cid:9).\n\nThen we control E [h(\u03b1)1(F0)] and E [h(\u03b1)1(F c\n0 )] separately. For the former, we basically need to\nreuse the equation (3.10) with an indicator function inserted inside the expectation. For the latter,\nwe make use of the large coordinate, which contributes to the \u22123\u03b12\nterm in M and makes the\nprobability of 1(M (cid:23) 0) extremely small. As a result det(M )1(M (cid:23) 0) is almost always 0. We\nformalized the two cases as below:\nProposition 3.5. Let K \u2265 2 \u00b7 103 be a universal constant. Let \u03c4 = Kn/d and let \u03b3, \u03b2 be suf\ufb01ciently\nlarge constants (depending on K). Then for any n \u2265 \u03b2d log2 d, we have that\n\ni bib(cid:62)\n\ni\n\nE [h(\u03b1)1(F0)] \u2264 (0.3)d/2 .\n\n\fProposition 3.6. In the setting of Proposition 3.5, we have\n\nE [h(\u03b1)1(F c\n\n0 )] \u2264 n \u00b7 (0.3)d/2 .\n\nWe see that Theorem 3.2 can be obtained as a direct consequence of Proposition 3.5, Proposition 3.6\nand equation (3.7).\nDue to space limit, we refer the readers to the supplementary material for an extended version of\nproof overview and the full proofs.\n\n4 Conclusion\n\nWe analyze the optimization landscape of the random over-complete tensor decomposition problem\nusing the Kac-Rice formula and random matrix theory. We show that in the superlevel set L that\ncontains all the points with function values barely larger than the random guess, there are exactly 2n\nlocal maxima that correspond to the true components. This implies that with an initialization slight\nbetter than the random guess, local search algorithms converge to the desired solutions. We believe\nour techniques can be extended to 3rd order tensors, or other non-convex problems with structured\nrandomness.\nThe immediate open question is whether there is any other spurious local maximum outside this\nsuperlevel set. Answering it seems to involve solving dif\ufb01cult questions in random matrix theory.\nAnother potential approach to unravel the mystery behind the success of the non-convex methods is\nto analyze the early stage of local search algorithms and show that they will enter the superlevel set L\nquickly from a good initialization.\n\nReferences\n\n[AA+13] Antonio Auf\ufb01nger, Gerard Ben Arous, et al. Complexity of random smooth functions on the\n\nhigh-dimensional sphere. The Annals of Probability, 41(6):4214\u20134247, 2013.\n\n[AA \u02c7C13] Antonio Auf\ufb01nger, G\u00e9rard Ben Arous, and Ji\u02c7r\u00ed \u02c7Cern`y. Random matrices and complexity of spin\n\nglasses. Communications on Pure and Applied Mathematics, 66(2):165\u2013201, 2013.\n\n[AFH+12] Anima Anandkumar, Dean P. Foster, Daniel Hsu, Sham M. Kakade, and Yi-Kai Liu. A spectral\nalgorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems\n25, 2012.\n\n[AGJ15] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Learning overcomplete latent variable\nmodels through tensor methods. In Proceedings of the Conference on Learning Theory (COLT),\nParis, France, 2015.\n\n[AGJ16] Anima Anandkumar, Rong Ge, and Majid Janzamin. Analyzing tensor power method dynamics in\n\novercomplete regime. JMLR, 2016.\n\n[AGMM15] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, ef\ufb01cient and neural algorithms\n\nfor sparse coding. In Proceedings of The 28th Conference on Learning Theory, 2015.\n\n[AHK12] Anima Anandkumar, Daniel Hsu, and Sham M. Kakade. A method of moments for mixture models\n\nand hidden Markov models. In COLT, 2012.\n\n[AMS07] P.A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton\n\nUniversity Press, 2007.\n\n[ASS15] H. Abo, A. Seigal, and B. Sturmfels. Eigencon\ufb01gurations of Tensors. ArXiv e-prints, May 2015.\n\n[AT09] Robert J Adler and Jonathan E Taylor. Random \ufb01elds and geometry. Springer Science & Business\n\nMedia, 2009.\n\n[BAC16] N. Boumal, P.-A. Absil, and C. Cartis. Global rates of convergence for nonconvex optimization on\n\nmanifolds. ArXiv e-prints, May 2016.\n\n[BBV16] Afonso S Bandeira, Nicolas Boumal, and Vladislav Voroninski. On the low-rank approach\nfor semide\ufb01nite programs arising in synchronization and community detection. arXiv preprint\narXiv:1602.04426, 2016.\n\n\f[BCMV14] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaraghavan. Smoothed\nanalysis of tensor decompositions. In Proceedings of the 46th Annual ACM Symposium on Theory\nof Computing, pages 594\u2013603. ACM, 2014.\n\n[BKS15] Boaz Barak, Jonathan A. Kelner, and David Steurer. Dictionary learning and tensor decomposition\nvia the sum-of-squares method. In Proceedings of the Forty-Seventh Annual ACM on Symposium\non Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 143\u2013151, 2015.\n\n[BNS16] Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Global optimality of local search\n\nfor low rank matrix recovery. arXiv preprint arXiv:1605.07221, 2016.\n\n[Cha96] Joseph T. Chang. Full reconstruction of Markov models on evolutionary trees: Identi\ufb01ability and\n\nconsistency. Mathematical Biosciences, 137:51\u201373, 1996.\n\n[CHM+15] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00e9rard Ben Arous, and Yann LeCun. The\n\nloss surfaces of multilayer networks. In AISTATS, 2015.\n\n[CLA09] P. Comon, X. Luciani, and A. De Almeida. Tensor decompositions, alternating least squares and\n\nother tales. Journal of Chemometrics, 23(7-8):393\u2013405, 2009.\n\n[CS13] Dustin Cartwright and Bernd Sturmfels. The number of eigenvalues of a tensor. Linear algebra\n\nand its applications, 438(2):942\u2013952, 2013.\n\n[CS16] Nadav Cohen and Amnon Shashua. Convolutional recti\ufb01er networks as generalized tensor decom-\n\npositions. CoRR, abs/1603.00162, 2016.\n\n[DLCC07] L. De Lathauwer, J. Castaing, and J.-F. Cardoso. Fourth-order cumulant-based blind identi\ufb01cation\nof underdetermined mixtures. Signal Processing, IEEE Transactions on, 55(6):2965\u20132973, 2007.\n\n[DPG+14] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua\nBengio. Identifying and attacking the saddle point problem in high-dimensional non-convex\noptimization. In Advances in neural information processing systems, pages 2933\u20132941, 2014.\n\n[GHJY15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online stochastic\ngradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory,\npages 797\u2013842, 2015.\n\n[GLM16] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. arXiv\n\npreprint arXiv:1605.07272, 2016.\n\n[GLM17] R. Ge, J. D. Lee, and T. Ma. Learning One-hidden-layer Neural Networks with Landscape Design.\n\nArXiv e-prints, November 2017.\n\n[GM15] Rong Ge and Tengyu Ma. Decomposing overcomplete 3rd order tensors using sum-of-squares\n\nalgorithms. arXiv preprint arXiv:1504.05287, 2015.\n\n[GVX13] N. Goyal, S. Vempala, and Y. Xiao. Fourier pca. arXiv preprint arXiv:1306.5825, 2013.\n\n[H\u00e5s90] Johan H\u00e5stad. Tensor rank is np-complete. Journal of Algorithms, 11(4):644\u2013654, 1990.\n\n[HK13] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical Gaussians: moment methods\n\nand spectral decompositions. In Fourth Innovations in Theoretical Computer Science, 2013.\n\n[HKZ12] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden Markov\n\nmodels. Journal of Computer and System Sciences, 78(5):1460\u20131480, 2012.\n\n[HL13] Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM\n\n(JACM), 60(6):45, 2013.\n\n[HM16] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. CoRR, abs/1611.04231, 2016.\n\n[HMR16] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems.\n\nCoRR, abs/1609.05191, 2016.\n\n[HSSS16] Samuel B. Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms\nfrom sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proceedings of\nthe 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA,\nUSA, June 18-21, 2016, pages 178\u2013191, 2016.\n\n\f[JSA15] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the Perils of Non-Convexity: Guaranteed\n\nTraining of Neural Networks using Tensor Methods. ArXiv e-prints, June 2015.\n\n[Kaw16] K. Kawaguchi. Deep Learning without Poor Local Minima. ArXiv e-prints, May 2016.\n\n[KM11] Tamara G Kolda and Jackson R Mayo. Shifted power method for computing tensor eigenpairs.\n\nSIAM Journal on Matrix Analysis and Applications, 32(4):1095\u20131124, 2011.\n\n[LSJR16] Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only\nconverges to minimizers. In Proceedings of the 29th Conference on Learning Theory, COLT 2016,\nNew York, USA, June 23-26, 2016, pages 1246\u20131257, 2016.\n\n[MR06] Elchanan Mossel and S\u00e9bastian Roch. Learning nonsingular phylogenies and hidden Markov\n\nmodels. Annals of Applied Probability, 16(2):583\u2013614, 2006.\n\n[MSS16] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with\n\nsum-of-squares. In FOCS 2016, to appear, 2016.\n\n[NP06] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global perfor-\n\nmance. Mathematical Programming, 108(1):177\u2013205, 2006.\n\n[NPOV15] A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. Tensorizing Neural Networks. ArXiv\n\ne-prints, September 2015.\n\n[SQW15] Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary? arXiv preprint\n\narXiv:1510.06096, 2015.\n\n\f", "award": [], "sourceid": 2039, "authors": [{"given_name": "Rong", "family_name": "Ge", "institution": "Duke University"}, {"given_name": "Tengyu", "family_name": "Ma", "institution": "Facebook AI Research"}]}