{"title": "Matrix Completion has No Spurious Local Minimum", "book": "Advances in Neural Information Processing Systems", "page_first": 2973, "page_last": 2981, "abstract": "Matrix completion is a basic machine learning problem that has wide applications, especially in collaborative filtering and recommender systems. Simple non-convex optimization algorithms are popular and effective in practice. Despite recent progress in proving various non-convex algorithms converge from a good initial point, it remains unclear why random or arbitrary initialization suffices in practice. We prove that the commonly used non-convex objective function for matrix completion has no spurious local minima \\--- all local minima must also be global. Therefore, many popular optimization algorithms such as (stochastic) gradient descent can provably solve matrix completion with \\textit{arbitrary} initialization in polynomial time.", "full_text": "Matrix Completion has No Spurious Local Minimum\n\nRong Ge\n\nDuke University\n\n308 Research Drive, NC 27708\nrongge@cs.duke.edu.\n\nJason D. Lee\n\nUniversity of Southern California\n3670 Trousdale Pkwy, CA 90089\n\njasonlee@marshall.usc.edu.\n\nTengyu Ma\n\nPrinceton University\n\n35 Olden Street, NJ 08540\n\ntengyu@cs.princeton.edu.\n\nAbstract\n\nMatrix completion is a basic machine learning problem that has wide applica-\ntions, especially in collaborative \ufb01ltering and recommender systems. Simple\nnon-convex optimization algorithms are popular and effective in practice. Despite\nrecent progress in proving various non-convex algorithms converge from a good\ninitial point, it remains unclear why random or arbitrary initialization suf\ufb01ces in\npractice. We prove that the commonly used non-convex objective function for\npositive semide\ufb01nite matrix completion has no spurious local minima \u2013 all local\nminima must also be global. Therefore, many popular optimization algorithms such\nas (stochastic) gradient descent can provably solve positive semide\ufb01nite matrix\ncompletion with arbitrary initialization in polynomial time. The result can be\ngeneralized to the setting when the observed entries contain noise. We believe that\nour main proof strategy can be useful for understanding geometric properties of\nother statistical problems involving partial or noisy observations.\n\n1\n\nIntroduction\n\nMatrix completion is the problem of recovering a low rank matrix from partially observed entries. It\nhas been widely used in collaborative \ufb01ltering and recommender systems [Kor09, RS05], dimension\nreduction [CLMW11] and multi-class learning [AFSU07]. There has been extensive work on\ndesigning ef\ufb01cient algorithms for matrix completion with guarantees. One earlier line of results\n(see [Rec11, CT10, CR09] and the references therein) rely on convex relaxations. These algorithms\nachieve strong statistical guarantees, but are quite computationally expensive in practice.\nMore recently, there has been growing interest in analyzing non-convex algorithms for matrix\ncompletion [KMO10, JNS13, Har14, HW14, SL15, ZWL15, CW15]. Let M 2 Rd\u21e5d be the target\nmatrix with rank r \u2327 d that we aim to recover, and let \u2326= {(i, j) : Mi,j is observed} be the\nset of observed entries. These methods are instantiations of optimization algorithms applied to the\nobjective1,\n\nf (X) =\n\n1\n\n2 X(i,j)2\u2326\u21e5Mi,j (XX>)i,j\u21e42\n\n,\n\n(1.1)\n\nThese algorithms are much faster than the convex relaxation algorithms, which is crucial for their\nempirical success in large-scale collaborative \ufb01ltering applications [Kor09].\n\n1In this paper, we focus on the symmetric case when the true M has a symmetric decomposition M = ZZ T .\nSome of previous papers work on the asymmetric case when M = ZW T , which is harder than the symmetric\ncase.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fAll of the theoretical analysis for the nonconvex procedures require careful initialization schemes:\nthe initial point should already be close to optimum. In fact, Sun and Luo [SL15] showed that after\nthis initialization the problem is effectively strongly-convex, hence many different optimization\nprocedures can be analyzed by standard techniques from convex optimization.\nHowever, in practice people typically use a random initialization, which still leads to robust and\nfast convergence. Why can these practical algorithms \ufb01nd the optimal solution in spite of the non-\nconvexity? In this work we investigate this question and show that the matrix completion objective\nhas no spurious local minima. More precisely, we show that any local minimum X of objective\nfunction f (\u00b7) is also a global minimum with f (X) = 0, and recovers the correct low rank matrix M.\nOur characterization of the structure in the objective function implies that (stochastic) gradient\ndescent from arbitrary starting point converge to a global minimum. This is because gradient\ndescent converges to a local minimum [GHJY15, LSJR16], and every local minimum is also a global\nminimum.\n\n1.1 Main results\nAssume the target matrix M is symmetric and each entry of M is observed with probability p\nindependently 2. We assume M = ZZ> for some matrix Z 2 Rd\u21e5r.\nThere are two known issues with matrix completion. First, the choice of Z is not unique since\nM = (ZR)(ZR)> for any orthonormal matrix Z. Our goal is to \ufb01nd one of these equivalent\nsolutions.\nAnother issue is that matrix completion is impossible when M is \u201caligned\u201d with standard basis. For\nexample, when M is the identity matrix in its \ufb01rst r \u21e5 r block, we will very likely be observing only\n0 entries. To address this issue, we make the following standard assumption:\nAssumption 1. For any row Zi of Z, we have kZik 6 \u00b5/pd \u00b7k ZkF . Moreover, Z has a bounded\n\ncondition number max(Z)/min(Z) = \uf8ff.\n\nThroughout this paper we think of \u00b5 and \uf8ff as small constants, and the sample complexity depends\npolynomially on these two parameters. Also note that this assumption is independent of the choice of\nZ: all Z such that ZZT = M have the same row norms and Frobenius norm.\nThis assumption is similar to the \u201cincoherence\u201d assumption [CR09]. Our assumption is the same as\nthe one used in analyzing non-convex algorithms [KMO10, SL15].\nWe enforce X to also satisfy this assumption by a regularizer\n\nf (X) =\n\n+ R(X),\n\n(1.2)\n\n1\n\n2 X(i,j)2\u2326\u21e5Mi,j (XX>)i,j\u21e42\n\nwhere R(X) is a function that penalizes X when one of its rows is too large. See Section 4 and\nSection A for the precise de\ufb01nition. Our main result shows that in this setting, the regularized\nobjective function has no spurious local minimum:\nTheorem 1.1. [Informal] All local minimum of the regularized objective (1.1) satisfy XX T =\nZZT = M when p > poly(\uf8ff, r, \u00b5, log d)/d.\nCombined with the results in [GHJY15, LSJR16] (see more discussions in Section 1.2), we have,\nTheorem 1.2 (Informal). With high probability, stochastic gradient descent on the regularized\nobjective (1.1) will converge to a solution X such that XX T = ZZT = M in polynomial time from\nany starting point. Gradient descent will converge to such a point with probability 1 from a random\nstarting point.\n\nOur results are also robust to noise. Even if each entry is corrupted with Gaussian noise of standard\nF /d (comparable to the magnitude of the entry itself!), we can still guarantee that\ndeviation \u00b52kZk2\nall the local minima satisfy kXX T ZZTkF 6 \" when p is large enough. See the discussion in\nAppendix B for results on noisy matrix completion.\n\n2The entries (i, j) and (j, i) are the same. With probability p we observe both entries and otherwise we\n\nobserve neither.\n\n2\n\n\fOur main technique is to show that every point that satis\ufb01es the \ufb01rst and second order necessary\nconditions for optimality must be a desired solution. To achieve this we use new ideas to analyze the\neffect of the regularizer and show how it is useful in modifying the \ufb01rst and second order conditions\nto exclude any spurious local minimum.\n\n1.2 Related Work\n\nMatrix Completion. The earlier theoretical works on matrix completion analyzed the nuclear\nnorm heuristic [Rec11, CT10, CR09]. This line of work has the cleanest and strongest theoretical\nguarantees; [CT10, Rec11] showed that if |\u2326| & dr\u00b52 log2 d the nuclear norm convex relaxation\nrecovers the exact underlying low rank matrix. The solution can be computed via the solving a\nconvex program in polynomial time. However the primary disadvantage of nuclear norm methods\nis their computational and memory requirements. The fastest known algorithms have running time\nO(d3) and require O(d2) memory, which are both prohibitive for moderate to large values of d.\nThese concerns led to the development of the low-rank factorization paradigm of [BM03]; Burer and\n\nMonteiro proposed factorizing the optimization variablecM = XX T , and optimizing over X 2 Rd\u21e5r\ninstead ofcM 2 Rd\u21e5d . This approach only requires O(dr) memory, and a single gradient iteration\ntakes time O(r|\u2326|), so has much lower memory requirement and computational complexity than the\nnuclear norm relaxation. On the other hand, the factorization causes the optimization problem to be\nnon-convex in X, which leads to theoretical dif\ufb01culties in analyzing algorithms. Under incoherence\nand suf\ufb01cient sample size assumptions, [KMO10] showed that well-initialized gradient descent\nrecovers M. Similary, [HW14, Har14, JNS13] showed that well-initialized alternating least squares\nor block coordinate descent converges to M, and [CW15] showed that well-initialized gradient\ndescent converges to M. [SL15, ZWL15] provided a more uni\ufb01ed analysis by showing that with\ncareful initialization many algorithms, including gradient descent and alternating least squres, succeed.\n[SL15] accomplished this by showing an analog of strong convexity in the neighborhood of the\nsolution M.\n\nNon-convex Optimization. Recently, a line of work analyzes non-convex optimization by separat-\ning the problem into two aspects: the geometric aspect which shows the function has no spurious\nlocal minimum and the algorithmic aspect which designs ef\ufb01cient algorithms can converge to local\nminimum that satisfy \ufb01rst and (relaxed versions) of second order necessary conditions.\nOur result is the \ufb01rst that explains the geometry of the matrix completion objective. Similar geometric\nresults are only known for a few problems: phase retrieval/synchronization, orthogonal tensor\ndecomposition, dictionary learning [GHJY15, SQW15, BBV16]. The matrix completion objective\nrequires different tools due to the sampling of the observed entries, as well as carefully managing the\nregularizer to restrict the geometry. Parallel to our work Bhojanapalli et al.[BNS16] showed similar\nresults for matrix sensing, which is closely related to matrix completion. Loh and Wainwright [LW15]\nshowed that for many statistical settings that involve missing/noisy data and non-convex regularizers,\nany stationary point of the non-convex objective is close to global optima; furthermore, there is a\nunique stationary point that is the global minimum under stronger assumptions [LW14].\nOn the algorithmic side, it is known that second order algorithms like cubic regularization [NP06]\nand trust-region [SQW15] algorithms converge to local minima that approximately satisfy \ufb01rst and\nsecond order conditions. Gradient descent is also known to converge to local minima [LSJR16] from\na random starting point. Stochastic gradient descent can converge to a local minimum in polynomial\ntime from any starting point [Pem90, GHJY15]. All of these results can be applied to our setting,\nimplying various heuristics used in practice are guaranteed to solve matrix completion.\n\n2 Preliminaries\n\nNotations: For \u2326 \u21e2 [d] \u21e5 [d], let P\u2326 be the linear operator that maps a matrix A to P\u2326(A),\nwhere P\u2326(A) has the same values as A on \u2326, and 0 outside of \u2326. We will use the following\nmatrix norms: k\u00b7k F the frobenius norm, k\u00b7k spectral norm, |A|1 elementwise in\ufb01nity norm, and\n|A|p!q = maxkxkp=1 kAkq. We use the shorthand kAk\u2326 = kP\u2326AkF . The trace inner product of\ntwo matrices is hA, Bi = tr(A>B), and min(X), max(X) are the smallest and largest singular\nvalues of X. We also use Xi to denote the i-th row of a matrix X.\n\n3\n\n\f@2\n\n2.1 Necessary conditions for Optimality\nGiven an objective function f (x) : Rn ! R, we use rf (x) to denote the gradient of the function,\nand r2f (x) to denote the Hessian of the function (r2f (x) is an n \u21e5 n matrix where [r2f (x)]i,j =\nf (x)). It is well known that local minima of the function f (x) must satisfy some necessary\n@xi@xj\nconditions:\nDe\ufb01nition 2.1. A point x satis\ufb01es the \ufb01rst order necessary condition for optimality (later abbreviated\nas \ufb01rst order optimality condition) if rf (x) = 0. A point x satis\ufb01es the second order necessary\ncondition for optimality (later abbreviated as second order optimality condition)if r2f (x) \u232b 0.\nThese conditions are necessary for a local minimum because otherwise it is easy to \ufb01nd a direction\nwhere the function value decreases. We will also consider a relaxed second order necessary condition,\nwhere we only require the smallest eigenvalue of the Hessian r2f (x) to be not very negative:\nDe\ufb01nition 2.2. For \u2327 > 0, a point x satis\ufb01es the \u2327-relaxed second order optimality condition, if\nr2f (x) \u232b \u2327 \u00b7 I.\nThis relaxation to the second order condition makes the conditions more robust, and allows for\nef\ufb01cient algorithms.\nTheorem 2.3. [NP06, SQW15, GHJY15] If every point x that satis\ufb01es \ufb01rst order and \u2327-relaxed\nsecond order necessary condition is a global minimum, then many optimization algorithms (cubic\nregularization, trust-region, stochastic gradient descent) can \ufb01nd the global minimum up to \" error in\nfunction value in time poly(1/\", 1/\u2327, d).\n\n3 Proof Strategy: \u201csimple\u201d proofs are more generalizable\n\nIn this section, we demonstrate the key ideas behind our analysis using the rank r = 1 case. In\nparticular, we \ufb01rst give a \u201csimple\u201d proof for the fully observed case. Then we show this simple\nproof can be easily generalized to the random observation case. We believe that this proof strategy is\napplicable to other statistical problems involving partial/noisy observations. The proof sketches in\nthis section are only meant to be illustrative and may not be fully rigorous in various places. We refer\nthe readers to Section 4 and Section A for the complete proofs.\nIn the rank r = 1 case, we assume M = zz>, where kzk = 1, and kzk1 6 \u00b5pd. Let \" \u2327 1 be the\ntarget accuracy that we aim to achieve in this section and let p = poly(\u00b5, log d)/(d\").\nFor simplicity, we focus on the following domain B of incoherent vectors where the regularizer R(x)\nvanishes,\n\n(3.1)\n\nB =\u21e2x : kxk1 <\n\n2\u00b5\n\npd .\n\nInside this domain B, we can restrict our attention to the objective function without the regularizer,\nde\ufb01ned as,\n\n\u02dcg(x) =\n\n1\n2 \u00b7k P\u2326(M xx>)k2\nF .\n\n(3.2)\n\nThe global minima of \u02dcg(\u00b7) are z and z with function value 0. Our goal of this section is to\n(informally) prove that all the local minima of \u02dcg(\u00b7) are O(p\")-close to \u00b1z. In later section we will\nformally prove that the only local minima are \u00b1z.\nLemma 3.1 (Partial observation case, informally stated). Under the setting of this section, in the\ndomain B, all local mimina of the function \u02dcg(\u00b7) are O(p\")-close to \u00b1z.\nIt turns out to be insightful to consider the full observation case when \u2326= [ d]\u21e5[d]. The corresponding\nobjective is\n\ng(x) =\n\n1\n2 \u00b7k M xx>k2\nF .\n\n(3.3)\n\nObserve that \u02dcg(x) is a sampled version of the g(x), and therefore we expect that they share the same\ngeometric properties. In particular, if g(x) does not have spurious local minima then neither does\n\u02dcg(x).\n\n4\n\n\fLemma 3.2 (Full observation case, informally stated). Under the setting of this section, in the domain\nB, the function g(\u00b7) has only two local minima {\u00b1z} .\nBefore introducing the \u201csimple\u201d proof, let us \ufb01rst look at a delicate proof that does not generalize\nwell.\n\nDif\ufb01cult to Generalize Proof of Lemma 3.2. We compute the gradient and Hessian of g(x),\nrg(x) = M x kxk2x,\nr2g(x) = 2xx> M + kxk2 \u00b7 I .Therefore, a critical point x satis\ufb01es rg(x) = M x kxk2x = 0,\nand thus it must be an eigenvector of M and kxk2 is the corresponding eigenvalue. Next, we\nprove that the hessian is only positive de\ufb01nite at the top eigenvector . Let x be an eigenvector with\neigenvalue = kxk2, and is strictly less than the top eigenvalue \u21e4. Let z be the top eigenvector.\nWe have that hz,r2g(x)zi = hz, M zi + kxk2 = \u21e4 + < 0, which shows that x is not a local\nminimum. Thus only z can be a local minimizer, and it is easily veri\ufb01ed that r2g(z) is indeed\npositive de\ufb01nite.\nThe dif\ufb01culty of generalizing the proof above to the partial observation case is that it uses the\nproperties of eigenvectors heavily. Suppose we want to imitate the proof above for the partial\nobservation case, the \ufb01rst dif\ufb01culty is how to solve the equation \u02dcg(x) = P\u2326(M xx>)x = 0.\nMoreover, even if we could have a reasonable approximation for the critical points (the solution of\nr\u02dcg(x) = 0), it would be dif\ufb01cult to examine the Hessian of these critical points without having the\northogonality of the eigenvectors.\n\n\u201cSimple\u201d and Generalizable proof. The lessons from the subsection above suggest us \ufb01nd an\nalternative proof for the full observation case which is generalizable. The alternative proof will be\nsimple in the sense that it doesn\u2019t use the notion of eigenvectors and eigenvalues. Concretely, the key\nobservation behind most of the analysis in this paper is the following,\n\nProofs that consist of inequalities that are linear in 1\u2326 are often easily generalizable to partial\nobservation case.\n\nHere statements that are linear in 1\u2326 mean the statements of the formPij 1(i,j)2\u2326Tij 6 a. We\nwill call these kinds of proofs \u201csimple\u201d proofs in this section. Roughly speaking, the observation\nfollows from the law of large numbers \u2014 Suppose Tij, (i, j) 2 [d]\u21e5 [d] is a sequence of bounded real\nnumbers, then the sampled sumP(i,j)2\u2326 Tij =Pi,j 1(i,j)2\u2326Tij is an accurate estimate of the sum\npPi,j Tij, when the sampling probability p is relatively large. Then, the mathematical implications\nof pP Tij 6 a are expected to be similar to the implications ofP(i,j)2\u2326 Tij 6 a, up to some small\n\nerror introduced by the approximation. To make this concrete, we give below informal proofs for\nLemma 3.2 and Lemma 3.1 that only consists of statements that are linear in 1\u2326. Readers will see that\ndue to the linearity, the proof for the partial observation case (shown on the right column) is a direct\ngeneralization of the proof for the full observation case (shown on the left column) via concentration\ninequalities (which will be discussed more at the end of the section).\n\nA \u201csimple\u201d proof for Lemma 3.2.\nClaim 1f. Suppose x 2B satis\ufb01es rg(x) = 0,\nthen hx, zi2 = kxk4.\nProof. We have,\n\nGeneralization to Lemma 3.1.\nClaim 1p. Suppose x 2B satis\ufb01es r\u02dcg(x) = 0,\nthen hx, zi2 = kxk4 \".\nProof. Imitating the proof on the left, we have\n\nrg(x) = (zz> xx>)x = 0\n\n) hx,rg(x)i = hx, (zz> xx>)xi = 0\n) hx, zi2 = kxk4\nIntuitively, this proof says that the norm of a criti-\ncal point x is controlled by its correlation with z.\nHere at the lasa sampling version of the f the lasa\nsampling ver the f the lasa sampling vesio\n\n(3.4)\n\n5\n\nr\u02dcg(x) = P\u2326(zz> xx>)x = 0\n\n) hx,r\u02dcg(x)i = hx, P\u2326(zz> xx>)xi = 0\n) hx, zi2 > kxk4 \"\nThe last step uses the fact that equation (3.4)\nand (3.5) are approximately equal up to scaling\nfactor p for any x 2B , since (3.5) is a sampled\nversion of (3.4).\n\n(3.5)\n\n\fIf x 2B\n\nhas positive Hessian\n\nClaim 2f.\nr2g(x) \u232b 0, then kxk2 > 1/3.\nProof. By the assumption on x, we have that\nhz,r2g(x)zi > 0. Calculating the quadratic\nform of the Hessian (see Proposition 4.1 for de-\ntails),\n\nIf x 2B\n\nhas positive Hessian\n\nClaim 2p.\nr2\u02dcg(x) \u232b 0, then kxk2 > 1/3 \".\nProof. Imitating the proof on the left, cal-\nculating the quadratic form over\nthe Hes-\nsian at z (see Proposition 4.1) , we have\naaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n\nhz,r2g(x)zi\n= kzx> + xz>k2\n 2z>(zz> xx>)z > 0aaaaaa\n) kxk2 + 2hz, xi2 > 1\n) kxk2 > 1/3\n\n(since hz, xi2 6 kxk2)\n\n(3.6)\n\nhz,r2\u02dcg(x)zi\n= kP\u2326(zx> + xz>)k2\n 2z>P\u2326(zz> xx>)z > 0\n)\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\n) kxk2 > 1/3 \"\n\n(3.7)\n(same step as the left)\n\naaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n\nHere we use the fact\nphz,r2g(x)zi for any x 2B .\n\nthat hz,r2\u02dcg(x)zi \u21e1\n\nWith these two claims, we are ready to prove Lemma 3.2 and 3.1 by using another step that is linear\nin 1\u2326.\n\nProof of Lemma 3.2. By Claim 1f and 2f, we\nhave x satis\ufb01es hx, zi2 > kxk4 > 1/9. More-\nover, we have that rg(x) = 0 implies\n\nProof of Lemma 3.1. By Claim 1p and 2p, we\nhave x satis\ufb01es hx, zi2 > kxk4 > 1/9 O(\").\nMoreover, we have that r\u02dcg(x) = 0 implies\n\n(3.8)\n\nhz,rg(x)i = hz, (zz> xx>)xi = 0\n\n) hx, zi(1 kxk2) = 0\n) kxk2 = 1\n\n(by hx, zi2 > 1/9)\nThen by Claim 1f again we obtain hx, zi2 = 1,\nand therefore x = \u00b1z. aaaaaaaaaaaaaaaaaaa\naaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\naaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n\nhz,r\u02dcg(x)i = hz, P\u2326(zz> xx>)xi = 0\n(3.9)\n(same step as the left)\n)\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\n(same step as the left)\n) kxk2 = 1 \u00b1 O(\")\nSince\n(3.9) is the sampled version of equa-\ntion (3.8), we expect they lead to the same con-\nclusion up to some approximation. Then by\nClaim 1p again we obtain hx, zi2 = 1\u00b1O(\"), and\ntherefore x is O(p\")-close to either of \u00b1z.\nSubtleties regarding uniform convergence.\nIn the proof sketches above, our key idea is to use\nconcentration inequalities to link the full observation objective g(x) with the partial observation\ncounterpart. However, we require a uniform convergence result. For example, we need a statement\nlike \u201cw.h.p over the choice of \u2326, equation (3.4) and (3.5) are similar to each other up to scaling\u201d. This\ntype of statement is often only true for x inside the incoherent ball B. The \ufb01x to this is the regularizer.\nFor non-incoherent x, we will use a different argument that uses the property of the regularizer. This\nis besides the main proof strategy of this section and will be discussed in subsequent sections.\n\n4 Warm-up: Rank-1 Case\n\nIn this section, using the general proof strategy described in previous section, we provide a formal\nproof for the rank-1 case. In subsection 4.1, we formally work out the proof sketches of Section 3\ninside the incoherent ball. The rest of the proofs is deferred to supplementary material.\nIn the rank-1 case, the objective function simpli\ufb01es to,\n\nf (x) =\n\n1\n2kP\u2326(M xx>)k2\n\nF + R(x) .\n\n(4.1)\n\nHere we use the the regularization R(x)\n\nR(x) =\n\ndXi=1\n\nh(xi), and h(t) = (|t| \u21b5)4 It>\u21b5 .\n\n6\n\n\fThe parameters and \u21b5 will be chosen later as in Theorem 4.2. We will choose \u21b5> 10\u00b5/pd so\nthat R(x) = 0 for incoherent x, and thus it only penalizes coherent x. Moreover, we note R(x) has\nLipschitz second order derivative. 3\nWe \ufb01rst state the optimality conditions, whose proof is deferred to Appendix A.\nProposition 4.1. The \ufb01rst order optimality condition of objective (4.1) is,\n\nand the second order optimality condition requires:\n\n2P\u2326(M xx>)x = rR(x) ,\n\n8v 2 Rd, kP\u2326(vx> + xv>)k2\n\nF + v>r2R(x)v > 2v>P\u2326(M xx>)v .\n\nMoreover, The \u2327-relaxed second order optimality condition requires\n\n(4.2)\n\n(4.3)\n\nd\n\n(4.4)\n\n8v 2 Rd, kP\u2326(vx> + xv>)k2\n\nF + v>r2R(x)v > 2v>P\u2326(M xx>)v \u2327kvk2 .\nwhere c is a large enough absolute constant, set \u21b5 = 10\u00b5p1/d\n\nWe give the precise version of Theorem 1.1 for the rank-1 case.\nTheorem 4.2. For p > c\u00b56 log1.5 d\nand > \u00b52p/\u21b52.Then, with high probability over the randomness of \u2326, the only points in Rd that\nsatisfy both \ufb01rst and second order optimality conditions (or \u2327-relaxed optimality conditions with\n\u2327< 0.1p) are z and z.\nIn the rest of this section, we will \ufb01rst prove that when x is constrained to be incoherent (and hence\nthe regularizer is 0 and concentration is straightforward) and satis\ufb01es the optimality conditions, then\nx has to be z or z. Then we go on to explain how the regularizer helps us to change the geometry\nof those points that are far away from z so that we can rule out them from being local minimum. For\nsimplicity, we will focus on the part that shows a local minimum x must be close enough to z.\nLemma 4.3. In the setting of Theorem 4.2, suppose x satis\ufb01es the \ufb01rst-order and second-order\noptimality condition (4.2) and (4.3). Then when p is de\ufb01ned as in Theorem 4.2,\n\nwhere \" = \u00b53(pd)1/2.\n\nxx> zz>2\n\nF 6 O(\") .\n\nThis turns out to be the main challenge. Once we proved x is close, we can apply the result of Sun\nand Luo [SL15] (see Lemma C.1), and obtain Theorem 4.2.\n\n4.1 Handling incoherent x\nTo demonstrate the key idea, in this section we restrict our attention to the subset of Rd which contains\nincoherent x with `2 norm bounded by 1, that is, we consider,\n\nB =\u21e2x : kxk1 6 2\u00b5\npd\n\n,kxk 6 1 .\n\n(4.5)\n\nNote that the desired solution z is in B, and the regularization R(x) vanishes inside B.\nThe following lemmas assume x satis\ufb01es the \ufb01rst and second order optimality conditions, and deduce\na sequence of properties that x must satisfy.\nLemma 4.4. Under the setting of Theorem 4.2 , with high probability over the choice of \u2326, for any\nx 2B that satis\ufb01es second-order optimality condition (4.3) we have,\n\nThe same is true if x 2B only satis\ufb01es \u2327-relaxed second order optimality condition for \u2327 6 0.1p.\nProof. We plug in v = z in the second-order optimality condition (4.3), and obtain that\n\nF > 2z>P\u2326(M xx>)z .\n3This is the main reason for us to choose 4-th power instead of 2-nd power.\n\nP\u2326(zx> + xz>)2\n\n(4.6)\n\n7\n\nkxk2 > 1/4.\n\n\fIntuitively, when restricted to \u2326, the squared Frobenius on the LHS and the quadratic form on the\nRHS should both be approximately a p fraction of the unrestricted case. In fact, both LHS and RHS\ncan be written as the sum of terms of the form hP\u2326(uvT ), P\u2326(stT )i, because\n\nP\u2326(zx> + xz>)2\n\nF = 2hP\u2326(zxT ), P\u2326(zxT )i + 2hP\u2326(zxT ), P\u2326(xzT )i\n2z>P\u2326(M xx>)z = 2hP\u2326(zzT ), P\u2326(zzT )i 2hP\u2326(xxT ), P\u2326(zzT )i.\n\nTherefore we can use concentration inequalities (Theorem D.1), and simplify the equation\n\nLHS of (4.6) = pzx> + xz>2\n\nF \u00b1 O(ppdkxk2\n= 2pkxk2kzk2 + 2phx, zi2 \u00b1 O(p\") ,\npd ). Similarly, by Theorem D.1 again, we have\n\n1kzk2\n\n1kxk2kzk2)\n\n(Since x, z 2B )\n\nwhere \" = O(\u00b52q log d\n\nRHS of (4.6) = 2hP\u2326(zz>), P\u2326(zz>)i hP\u2326(xx>), P\u2326(zz>)i\n\n(Since M = zz>)\n(by Theorem D.1 and x, z 2B )\n(Note that even we use the \u2327-relaxed second order optimality condition, the RHS only becomes\n1.99pkzk4 2phx, zi2 \u00b1 O(p\") which does not effect the later proofs.)\nTherefore plugging in estimates above back into equation (4.6), we have that\n\n= 2pkzk4 2phx, zi2 \u00b1 O(p\")\n\n2pkxk2kzk2 + 2phx, zi2 \u00b1 O(p\") > 2kzk4 2hx, zi2 \u00b1 O(p\") ,\n\nwhich implies that 6pkxk2kzk2 > 2pkxk2kzk2 + 4phx, zi2 > 2pkzk4 O(p\"). Using kzk2 = 1,\nand \" being suf\ufb01ciently small, we complete the proof.\n\nNext we use \ufb01rst order optimality condition to pin down another property of x \u2013 it has to be close\nto z after scaling. Note that this doesn\u2019t mean directly that x has to be close to z since x = 0 also\nsatis\ufb01es \ufb01rst order optimality condition (and therefore the conclusion (4.7) below).\nLemma 4.5. With high probability over the randomness of \u2326, for any x 2B that satis\ufb01es \ufb01rst-order\noptimality condition (4.2), we have that x also satis\ufb01es\n\nwhere \" = \u02dcO(\u00b53(pd)1/2).\n\nhz, xiz kxk2x 6 O(\") .\n\n(4.7)\n\nFinally we combine the two optimality conditions and show equation (4.7) implies xxT must be\nclose to zzT .\n\nLemma 4.6. Suppose vector x satis\ufb01es that kxk2 > 1/4, and thathz, xiz kxk2x 6 . Then\n\nfor 2 (0, 0.1),\n\n5 Conclusions\n\nxx> zz>2\n\nF 6 O() .\n\nAlthough the matrix completion objective is non-convex, we showed the objective function has very\nnice properties that ensures the local minima are also global. This property gives guarantees for many\nbasic optimization algorithms. An important open problem is the robustness of this property under\ndifferent model assumptions: Can we extend the result to handle asymmetric matrix completion? Is\nit possible to add weights to different entries (similar to the settings studied in [LLR16])? Can we\nreplace the objective function with a different distance measure rather than Frobenius norm (which is\nrelated to works on 1-bit matrix sensing [DPvdBW14])? We hope this framework of analyzing the\ngeometry of objective function can be applied to other problems.\n\n8\n\n\fReferences\n\n[AFSU07] Yonatan Amit, Michael Fink, Nathan Srebro, and Shimon Ullman. Uncovering shared structures in multiclass classi\ufb01cation. In\n\nProceedings of the 24th international conference on Machine learning, pages 17\u201324. ACM, 2007.\n\n[BBV16] Afonso S Bandeira, Nicolas Boumal, and Vladislav Voroninski. On the low-rank approach for semide\ufb01nite programs arising\n\nin synchronization and community detection. arXiv preprint arXiv:1602.04426, 2016.\n\n[BM03] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semide\ufb01nite programs via low-rank\n\nfactorization. Mathematical Programming, 95(2):329\u2013357, 2003.\n\n[BNS16] S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global Optimality of Local Search for Low Rank Matrix Recovery. ArXiv\n\ne-prints, May 2016.\n\n[CLMW11] Emmanuel J Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis?\n\n(JACM), 58(3):11, 2011.\n\nJournal of the ACM\n\n[CR09] Emmanuel J Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational\n\nmathematics, 9(6):717\u2013772, 2009.\n\n[CT10] Emmanuel J Cand`es and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. Information Theory,\n\nIEEE Transactions on, 56(5):2053\u20132080, 2010.\n\n[CW15] Yudong Chen and Martin J Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algo-\n\nrithmic guarantees. arXiv preprint arXiv:1509.03025, 2015.\n\n[DPvdBW14] Mark A Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference,\n\n3(3):189\u2013223, 2014.\n\n[GHJY15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online stochastic gradient for tensor decom-\n\nposition. arXiv:1503.02101, 2015.\n\n[Har14] Moritz Hardt. Understanding alternating minimization for matrix completion. In FOCS 2014. IEEE, 2014.\n\n[HKZ12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electron.\n\nCommun. Probab, 17(52):1\u20136, 2012.\n\n[HW14] Moritz Hardt and Mary Wootters. Fast matrix completion without the condition number. In COLT 2014, pages 638\u2013678, 2014.\n\n[Imb10] R. Imbuzeiro Oliveira. Sums of random Hermitian matrices and an inequality by Rudelson. ArXiv e-prints, April 2010.\n\n[JNS13] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Pro-\n\nceedings of the forty-\ufb01fth annual ACM symposium on Theory of computing, pages 665\u2013674. ACM, 2013.\n\n[KMO10] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. Information Theory,\n\nIEEE Transactions on, 56(6):2980\u20132998, 2010.\n\n[Kor09] Yehuda Koren. The bellkor solution to the net\ufb02ix grand prize. Net\ufb02ix prize documentation, 81, 2009.\n\n[LLR16] Yuanzhi Li, Yingyu Liang, and Andrej Risteski. Recovery guarantee of weighted low-rank approximation via alternating\n\nminimization. arXiv preprint arXiv:1602.02262, 2016.\n\n[LSJR16] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers. University\n\nof California, Berkeley, 1050:16, 2016.\n\n[LW14] Po-Ling Loh and Martin J Wainwright. Support recovery without incoherence: A case for nonconvex regularization. arXiv\n\npreprint arXiv:1412.5632, 2014.\n\n[LW15] Po-Ling Loh and Martin J. Wainwright. Regularized m-estimators with nonconvexity: statistical and algorithmic theory for\n\nlocal optima. Journal of Machine Learning Research, 16:559\u2013616, 2015.\n\n[NP06] Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton method and its global performance. Mathematical Pro-\n\ngramming, 108(1):177\u2013205, 2006.\n\n[Pem90] Robin Pemantle. Nonconvergence to unstable points in urn models and stochastic approximations. The Annals of Probability,\n\npages 698\u2013712, 1990.\n\n[Rec11] Benjamin Recht. A simpler approach to matrix completion. The Journal of Machine Learning Research, 12:3413\u20133430, 2011.\n\n[RS05] Jasson DM Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings\n\nof the 22nd international conference on Machine learning, pages 713\u2013719. ACM, 2005.\n\n[SL15] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via nonconvex factorization.\n\nScience (FOCS), 2015 IEEE 56th Annual Symposium on, pages 270\u2013289. IEEE, 2015.\n\nIn Foundations of Computer\n\n[SQW15] Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096, 2015.\n\n[ZWL15] Tuo Zhao, Zhaoran Wang, and Han Liu. A nonconvex optimization framework for low rank matrix estimation. In Advances in\n\nNeural Information Processing Systems, pages 559\u2013567, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1485, "authors": [{"given_name": "Rong", "family_name": "Ge", "institution": "Princeton University"}, {"given_name": "Jason", "family_name": "Lee", "institution": "UC Berkeley"}, {"given_name": "Tengyu", "family_name": "Ma", "institution": "Princeton University"}]}