{"title": "Global Optimality of Local Search for Low Rank Matrix Recovery", "book": "Advances in Neural Information Processing Systems", "page_first": 3873, "page_last": 3881, "abstract": "We show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent {\\em from random initialization}.", "full_text": "Global Optimality of Local Search\nfor Low Rank Matrix Recovery\n\nSrinadh Bhojanapalli\nsrinadh@ttic.edu\n\nBehnam Neyshabur\n\nbneyshabur@ttic.edu\n\nNathan Srebro\nnati@ttic.edu\n\nToyota Technological Institute at Chicago\n\nAbstract\n\nWe show that there are no spurious local minima in the non-convex factorized\nparametrization of low-rank matrix recovery from incoherent linear measurements.\nWith noisy measurements we show all local minima are very close to a global\noptimum. Together with a curvature bound at saddle points, this yields a polynomial\ntime global convergence guarantee for stochastic gradient descent from random\ninitialization.\n\n1\n\nIntroduction\n\nLow rank matrix recovery problem is heavily studied and has numerous applications in collaborative\n\ufb01ltering, quantum state tomography, clustering, community detection, metric learning and multi-task\nlearning [21, 12, 9, 27].\nWe consider the \u201cmatrix sensing\u201d problem of recovering a low-rank (or approximately low rank)\np.s.d. matrix1 X\u21e4 2 Rn\u21e5n, given a linear measurement operator A : Rn\u21e5n ! Rm and noisy\nmeasurements y = A(X\u21e4) + w, where w is an i.i.d. noise vector. An estimator for X\u21e4 is given by\nthe rank-constrained, non-convex problem\n(1)\n\nminimize\nX:rank(X)\uf8ffr kA(X) yk2.\n\nThis matrix sensing problem has received considerable attention recently [30, 29, 26]. This and other\nrank-constrained problems are common in machine learning and related \ufb01elds, and have been used\nfor applications discussed above. A typical theoretical approach to low-rank problems, including (1)\nis to relax the low-rank constraint to a convex constraint, such as the trace-norm of X. Indeed, for\nmatrix sensing, Recht et al. [20] showed that if the measurements are noiseless and the measurement\noperator A satis\ufb01es a restricted isometry property, then a low-rank X\u21e4 can be recovered as the unique\nsolution to a convex relaxation of (1). Subsequent work established similar guarantees also for the\nnoisy and approximate case [14, 6].\nHowever, convex relaxations to the rank are not the common approach employed in practice. In this\nand other low-rank problems, the method of choice is typically unconstrained local optimization (via\ne.g. gradient descent, SGD or alternating minimization) on the factorized parametrization\n\nminimize\nU2Rn\u21e5r\n\nf (U ) = kA(U U>) yk2,\n\n(2)\n\n1We study the case where X\u21e4 is PSD. We believe the techniques developed here can be used to extend\n\nresults to the general case.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fwhere the rank constraint is enforced by limiting the dimensionality of U. Problem (2) is a non-\nconvex optimization problem that could have many bad local minima (as we show in Section 5), as\nwell as saddle points. Nevertheless, local optimization seems to work very well in practice. Working\non (2) is much cheaper computationally and allows scaling to large-sized problems\u2014the number\nof optimization variables is only O(nr) rather than O(n2), and the updates are usually very cheap,\nespecially compared to typical methods for solving the SDP resulting from the convex relaxation.\nThere is therefore a signi\ufb01cant disconnect between the theoretically studied and analyzed methods\n(based on convex relaxations) and the methods actually used in practice.\nRecent attempts at bridging this gap showed that, some form of global \u201cinitialization\u201d, typically\nrelying on singular value decomposition, yields a solution that is already close enough to X\u21e4; that\nlocal optimization from this initializer gets to the global optima (or to a good enough solution). Jain\net al. [15], Keshavan [17] proved convergence for alternating minimization algorithm provided the\nstarting point is close to the optimum, while Zheng and Lafferty [30], Zhao et al. [29], Tu et al.\n[26], Chen and Wainwright [8], Bhojanapalli et al. [2] considered gradient descent methods on the\nfactor space and proved local convergence. But all these studies rely on global initialization followed\nby local convergence, and do not tackle the question of the existence of spurious local minima or deal\nwith optimization starting from random initialization. There is therefore still a disconnect between\nthis theory and the empirical practice of starting from random initialization and relying only on the\nlocal search to \ufb01nd the global optimum.\nIn this paper we show that, under a suitable incoherence condition on the measurement operator A\n(de\ufb01ned in Section 2), with noiseless measurements and with rank(X\u21e4) \uf8ff r, the problem (2) has no\nspurious local minima (i.e. all local minima are global and satisfy X\u21e4 = U U>). Furthermore, under\nthe same conditions, all saddle points have a direction with signi\ufb01cant negative curvature, and so\nusing a recent result of Ge et al. [10] we can establish that stochastic gradient descent from random\ninitialization converges to X\u21e4 in polynomial number of iterations. We extend the results also to\nthe noisy and approximately-low-rank settings, where we can guarantee that every local minima is\nclose to a global minimum. The incoherence condition we require is weaker than conditions used\nto establish recovery through local search, and so our results also ensures recovery in polynomial\ntime under milder conditions than what was previously known. In particular, with i.i.d. Gaussian\nmeasurements, we ensure no spurious local minima and recovery through local search with the\noptimal number O(nr) of measurements.\n\nRelated Work Our work is heavily inspired by Bandeira et al. [1], who recently showed similar\nbehavior for the problem of community detection\u2014this corresponds to a speci\ufb01c rank-1 problem with\na linear objective, elliptope constraints and a binary solution. Here we take their ideas, extend them\nand apply them to matrix sensing with general rank-r matrices. In the past several months, similar\ntype of results were also obtained for other non-convex problems (where the source of non-convexity\nis not a rank constraint), speci\ufb01cally complete dictionary learning [24] and phase recovery [25]. A\nrelated recent result of a somewhat different nature pertains to rank unconstrained linear optimization\non the elliptope, showing that local minima of the rank-constrained problem approximate well the\nglobal optimum of the rank unconstrained convex problem, even though they might not be the global\nminima (in fact, the approximation guarantee for the actual global optimum is better) [18].\nAnother non-convex low-rank problem long known to not possess spurious local minima is the\nPCA problem, which can also be phrased as matrix approximation with full observations, namely\nminrank(X)\uf8ffr kA XkF (e.g. [23]). Indeed, local search methods such as the power-method\nare routinely used for this problem. Recently local optimization methods for the PCA problem\nworking more directly on the optimized formulation have also been studied, including SGD [22]\nand Grassmannian optimization [28]. These results are somewhat orthogonal to ours, as they study\na setting in which it is well known there are never any spurious local minima, and the challenge is\nobtaining satisfying convergence rates.\nThe seminal work of Burer and Monteiro [3] proposed low-rank factorized optimization for SDPs,\nand showed that for extremely high rank r > pm (number of constraints), an Augmented Lagrangian\nmethod converges asymptotically to the optimum. It was also shown that (under mild conditions)\nany rank de\ufb01cient local minima is a global minima [4, 16], providing a post-hoc veri\ufb01able suf\ufb01cient\ncondition for global optimality. However, this does not establish any a-priori condition, based on\nproblem structure, implying the lack of spurious local minima.\n\n2\n\n\fWhile preparing this manuscript, we also became aware of parallel work [11] studying the same\nquestion for the related but different problem of matrix completion. For this problem they obtain\na similar guarantee, though with suboptimal dependence on the incoherence parameters and so\nsuboptimal sample complexity, and requiring adding a speci\ufb01c non-standard regularizer to the\nobjective\u2014this is not needed for our matrix sensing results.\nWe believe our work, together with the parallel work of [11], are the \ufb01rst to establish the lack of\nspurious local minima and the global convergence of local search from random initialization for a\nnon-trivial rank-constrained problem (beyond PCA with full observations) with rank r > 1.\n\nNotation. For matrices X, Y 2 Rn\u21e5n, their inner product is hX, Y i = traceX>Y. We use\nkXkF , kXk2 and kXk\u21e4 for the Frobenius, spectral and nuclear norms of a matrix respectively.\nGiven a matrix X, we use i (X) to denote singular values of X in decreasing order. Xr =\narg minrank(Y )\uf8ffr kX Y kF denotes the rank-r approximation of X, as obtained via its truncated\nsingular value decomposition. We use plain capitals R and Q to denote orthonormal matrices.\n\n2 Formulation and Assumptions\nWe write the linear measurement operator A : Rn\u21e5n ! Rm as A(X)i = hAi, Xi where Ai 2\nRn\u21e5n, yielding yi = hAi, X\u21e4i + wi, i = 1,\u00b7\u00b7\u00b7 , m. We assume wi \u21e0N (0, 2\nw) is i.i.d Gaussian\nnoise. We are generally interested in the high dimensional regime where the number of measurements\nm is usually much smaller than the dimension n2.\nEven if we know that rank(X\u21e4) \uf8ff r, having many measurements might not be suf\ufb01cient for recovery\nif they are not \u201cspread out\u201d enough. E.g., if all measurements only involve the \ufb01rst n/2 rows and\ncolumns, we would never have any information on the bottom-right block. A suf\ufb01cient condition for\nidenti\ufb01ability of a low-rank X\u21e4 from linear measurements by Recht et al. [20] is based on restricted\nisometry property de\ufb01ned below.\nDe\ufb01nition 2.1 (Restricted Isometry Property). Measurement operator A : Rn\u21e5n ! Rm (with rows\nAi, i = 1,\u00b7\u00b7\u00b7 , m) satis\ufb01es (r, r) RIP if for any n \u21e5 n matrix X with rank \uf8ff r,\n\n(1 r)kXk2\n\nF \uf8ff\n\n1\nm\n\nmXi=1\n\nhAi, Xi2 \uf8ff (1 + r)kXk2\nF .\n\n(3)\n\nIn particular, X\u21e4 of rank r is identi\ufb01able if 2r < 1 [see 20, Theorem 3.2]. One situation in which\nRIP is obtained is for random measurement operators. For example, matrices with i.i.d. N (0, 1)\nentries satisfy (r, r)-RIP when m = O( nr\n2 ) [see 6, Theorem 2.3]. This implies identi\ufb01ability based\non i.i.d. Gaussian measurement with m = O(nr) measurements (coincidentally, the number of\ndegrees of freedom in X\u21e4, optimal up to a constant factor).\n\n3 Main Results\n\nWe are now ready to present our main result about local minima for the matrix sensing problem (2).\nWe \ufb01rst present the results for noisy sensing of exact low rank matrices, and then generalize the\nresults also to approximately low rank matrices.\nNow we will present our result characterizing local minima of f (U ), for low-rank X\u21e4. Recall that\nmeasurements are y = A(X\u21e4) + w, where entries of w are i.i.d. Gaussian - wi \u21e0N (0, 2\nw).\nTheorem 3.1. Consider the optimization problem (2) where y = A(X\u21e4) + w, w is i.i.d. N (0, 2\nw),\n10, and rank(X\u21e4) \uf8ff r. Then, with probability 1 10\nA satis\ufb01es (4r, 4r)-RIP with 4r < 1\nn2 (over\nthe noise), for any local minimum U of f (U ):\n\nkU U> X\u21e4kF \uf8ff 20r log(n)\n\nm\n\nw.\n\nIn particular, in the noiseless case (w = 0) we have U U> = X\u21e4 and so f (U ) = 0 and every local\nminima is global. In the noiseless case, we can also relax the RIP requirement to 4r < 1/5 (see\nTheorem 4.1 in Section 4). In the noisy case we cannot expect to ensure we always get to an exact\nglobal minima, since the noise might cause tiny \ufb02uctuations very close to the global minima possibly\n\n3\n\n\fcreating multiple very close local minima. But we show that all local minima are indeed very close\nto some factorization U\u21e4U\u21e4> = X\u21e4 of the true signal, and hence to a global optimum, and this\n\u201cradius\u201d of local minima decreases as we have more observations.\nThe proof of the Theorem for the noiseless case is presented in Section 4. The proof for the general\nsetting follows along the same lines and can be found in the Appendix.\nSo far we have discussed how all local minima are global, or at least very close to a global minimum.\nUsing a recent result by Ge et al. [10] on the convergence of SGD for non-convex functions, we\ncan further obtain a polynomial bound on the number of SGD iterations required to reach the global\nminima. The main condition that needs to be established in order to ensure this, is that all saddle\npoints of (2) satisfy the \u201cstrict saddle point condition\u201d, i.e. have a direction with signi\ufb01cant negative\ncurvature:\nTheorem 3.2 (Strict saddle). Consider the optimization problem (2) in the noiseless case, where\ny = A(X\u21e4), A satis\ufb01es (4r, 4r)-RIP with 4r < 1\n10, and rank(X\u21e4) \uf8ff r. Let U be a \ufb01rst order\ncritical point of f (U ) with U U> 6= X\u21e4. Then the smallest eigenvalue of the Hessian satis\ufb01es\n\nmin\uf8ff 1\n\nmr2(f (U )) \uf8ff 2\n\n5\n\nr(X\u21e4).\n\nNow consider the stochastic gradient descent updates,\n\nU + = Projb U \u2318 mXi=1\n\n(\u2326Ai, U U>\u21b5 yi)AiU + !! ,\n\n(4)\n\nwhere is uniformly distributed on the unit sphere and Projb is a projection onto kUkF \uf8ff b. Using\nTheorem 3.2 and the result of Ge et al. [10] we can establish:\nTheorem 3.3 (Convergence from random initialization). Consider the optimization problem (2)\nunder the same noiseless conditions as in Theorem 3.2. Using b kU\u21e4kF , for some global optimum\nU\u21e4 of f (U ), for any \u270f, c > 0, after T = poly\u21e3\n\u270f , log(1/c)\u2318 iterations of (4)\nwith an appropriate stepsize \u2318, starting from a random point uniformly distributed on kUkF = b,\nwith probability at least 1 c, we reach an iterate UT satisfying\n\nr(X\u21e4) , 1(X\u21e4), b, 1\n\n1\n\nkUT U\u21e4kF \uf8ff \u270f.\n\nThe above result guarantees convergence of noisy gradient descent to a global optimum. Alternatively,\nsecond order methods such as cubic regularization (Nesterov and Polyak [19]) and trust region (Cartis\net al. [7]) that have guarantees based on the strict saddle point property can also be used here.\nRIP Requirement: Our results require (4r, 1/10)-RIP for the noisy case and (4r, 1/5)-RIP for the\nnoiseless case. Requiring (2r, 2r)-RIP with 2r < 1 is suf\ufb01cient to ensure uniqueness of the global\noptimum of (1), and thus recovery in the noiseless setting [20], but all known ef\ufb01cient recovery\nmethods require stricter conditions. The best guarantees we are aware of require (5r, 1/10)-RIP [20]\nor (4r, 0.414)-RIP [6] using a convex relaxation. Alternatively, (6r, 1/10)-RIP is required for global\ninitialization followed by non-convex optimization [26]. In terms of requirements on (2r, 2r)-RIP\nfor non-convex methods, the best we are aware of is requiring 2r < \u2326(1/r) [15, 29, 30]\u2013this is a\nmuch stronger condition than ours, and it yields a suboptimal required number of spherical Gaussian\nmeasurements of \u2326(nr3). So, compared to prior work our requirement is very mild\u2014it ensures\nef\ufb01cient recovery, and requires the optimal number of spherical Gaussian measurements (up to a\nconstant factor) of O(nr).\n\nExtension to Approximate Low Rank We can also obtain similar results that deteriorate gracefully\nif X\u21e4 is not exactly low rank, but is close to being low-rank (see proof in the Appendix):\nTheorem 3.4. Consider the optimization problem (2) where y = A(X\u21e4) and A satis\ufb01es (4r, 4r)-\nRIP with 4r < 1\n\n100, Then, for any local minima U of f (U ):\n\nkU U> X\u21e4kF \uf8ff 4(kX\u21e4 X\u21e4rkF + 2rkX\u21e4 X\u21e4rk\u21e4),\n\nwhere X\u21e4r is the best rank r approximation of X\u21e4.\n\n4\n\n\fThis theorem guarantees that any local optimum of f (U ) is close to X\u21e4 upto an error depending on\nkX\u21e4 X\u21e4rk. For the low-rank noiseless case we have X\u21e4 = X\u21e4r and the right hand side vanishes.\nWhen X\u21e4 is not exactly low rank, the best recovery error we can hope for is kX\u21e4 X\u21e4rkF , since\nU U> is at most rank k. On the right hand side of Theorem 3.4, we have also a nuclear norm term,\nwhich might be higher, but it also gets scaled down by 2r, and so by the number of measurements.\n\n20\n18\n16\n14\n12\n10\n8 \n6 \n4 \n\nk\nn\na\nR\n\n20\n18\n16\n14\n12\n10\n8 \n6 \n4 \n\nk\nn\na\nR\n\nRandom\nSVD\n\n20\n18\n16\n14\n12\n10\n8\n6\n4\n2\n\nk\nn\na\nR\n\n10\n\n20\nm/n\n\n30\n\n40\n\n10\n\n20\nm/n\n\n30\n\n40\n\n5\n\n10\n\n15\n\n25\n\n30\n\n35\n\n40\n\n20\nm/n\n\nFigure 1: The plots in this \ufb01gure compare the success probability of gradient descent between\n(left) random and (center) SVD initialization (suggested in [15]), for problem (2), with increasing\nnumber of samples m and various values of rank r. Right most plot is the \ufb01rst m for a given r,\nwhere the probability of success reaches the value 0.5. A run is considered success if kU U> \nX\u21e4kF /kX\u21e4kF \uf8ff 1e 2. White cells denote success and black cells denote failure of recovery. We\nset n to be 100. Measurements yi are inner product of entrywise i.i.d Gaussian matrix and a rank-r\np.s.d matrix with random subspace. We notice no signi\ufb01cant difference between the two initialization\nmethods, suggesting absence of local minima as shown. Both methods have phase transition around\nm = 2 \u00b7 n \u00b7 r.\n4 Proof for the Noiseless Case\n\nIn this section we present the proof characterizing the local minima of problem (2). For ease of\nexposition we \ufb01rst present the results for the noiseless case (w = 0). Proof for the general case can\nbe found in the Appendix.\nTheorem 4.1. Consider the optimization problem (2) where y = A(X\u21e4), A satis\ufb01es (4r, 4r)-RIP\nwith 4r < 1\n\n5, and rank(X\u21e4) \uf8ff r. Then, for any local minimum U of f (U ):\n\nU U> = X\u21e4.\n\nFor the proof of this theorem we \ufb01rst discuss the implications of the \ufb01rst and second order optimality\nconditions and then show how to combine them to yield the result.\nInvariance of f (U ) over r \u21e5 r orthonormal matrices introduces additional challenges in comparing a\ngiven stationary point to a global optimum. We have to \ufb01nd the best orthonormal matrix R to align a\ngiven stationary point U to a global optimum U\u21e4, where U\u21e4U\u21e4> = X\u21e4, to combine results from\nthe \ufb01rst and second order conditions, without degrading the isometry constants.\nConsider a local optimum U that satis\ufb01es \ufb01rst and second order optimality conditions of problem (2).\nIn particular U satis\ufb01es rf (U ) = 0 and z>r2f (U )z 0 for any z 2 Rn\u00b7r. Now we will see how\nthese two conditions constrain the error U U> U\u21e4U\u21e4>.\nFirst we present the following consequence of the RIP assumption [see 5, Lemma 2.1].\nLemma 4.1. Given two n \u21e5 n rank-r matrices X and Y , and a (4r, )-RIP measurement operator\nA, the following holds:\n\n(5)\n\n1\nm\n\nmXi=1\n\n\n\n4.1 First order optimality\nFirst we will consider the \ufb01rst order condition, rf (U ) = 0. For any stationary point U this implies\n(6)\n\nhAi, XihAi, Y i hX, Y i \uf8ff kXkFkY kF .\nXi DAi, U U> U\u21e4U\u21e4>E AiU = 0.\n\n5\n\n\fNow using the isometry property of Ai gives us the following result.\nLemma 4.2. [First order condition] For any \ufb01rst order stationary point U of f (U ), and A satisfying\nthe (4r, )-RIP (3), the following holds:\n\nwhere Q is an orthonormal matrix that spans the column space of U.\n\nk(U U> U\u21e4U\u21e4>)QQ>kF \uf8ff U U> U\u21e4U\u21e4>F\n\n,\n\nThis lemma states that any stationary point of f (U ) is close to a global optimum U\u21e4 in the subspace\nspanned by columns of U. Notice that the error along the orthogonal subspace Q?, kX\u21e4Q?Q>\n?kF\ncan still be large making the distance between X and X\u21e4 arbitrarily far.\n\nProof of Lemma 4.2. Let U = QR, for some orthonormal Q. Consider any matrix of the form\nZQR\u2020> 2. The \ufb01rst order optimality condition then implies,\n\nmXi=1DAi, U U> U\u21e4U\u21e4>E\u2326Ai, U R\u2020Q>Z>\u21b5 = 0\n\nThe above equation together with Restricted Isometry Property (equation (5)) gives us the following\ninequality:\n\nDU U> U\u21e4U\u21e4>, QQ>Z>E \uf8ff U U> U\u21e4U\u21e4>FQQ>Z>F .\n\nNote that for any matrix A, \u2326A, QQ>Z\u21b5 = \u2326QQ>A, Z\u21b5. Furthermore, for any matrix A,\nsup{Z:kZkF \uf8ff1} hA, Zi = kAkF . Hence the above inequality implies the lemma statement.\n4.2 Second order optimality\nWe now consider the second order condition to show that the error along Q?Q>\nis indeed bounded\n?\nwell. Let r2f (U ) be the hessian of the objective function. Note that this is an n \u00b7 r \u21e5 n \u00b7 r matrix.\nFortunately for our result we need to only evaluate the Hessian along vec(U U\u21e4R) for some\northonormal matrix R. Here vec(.) denotes writing a matrix in vector form.\nLemma 4.3. [Hessian computation] Let U be a \ufb01rst order critical point of f (U ). Then for any r\u21e5 r\northonormal matrix R and j = eje>j ( = U U\u21e4R),\n 2DAi, U U> U\u21e4U\u21e4>E2\nrXj=1\nvec (j)>\u21e5r2f (U )\u21e4 vec (j) =\nHence from second order optimality of U we get,\nCorollary 4.1. [Second order optimality] Let U be a local minimum of f (U ) . For any r \u21e5 r\northonormal matrix R,\n\n4\u2326Ai, U >j\u21b52\n\nrXj=1\n\n(\n\nmXi=1\n\n),\n\nmXi=1\n\n4\u2326Ai, U >j\u21b52\n\n1\n2\n\n\n\nmXi=1DAi, U U> U\u21e4U\u21e4>E2\n\n,\n\nFurther for A satisfying (2r, ) -RIP (equation (3)) we have,\n\nrXj=1\nrXj=1\n\n(7)\n\n(8)\n\nkU eje>j (U U\u21e4R)>k2\n\nF \n\n1 \n2(1 + )kU U> U\u21e4U\u21e4>k2\nF .\n\nThe proof of this result follows simply by applying Lemma 4.3. The above Lemma gives a bound\non the distance in the factor (U ) space kU (U U\u21e4R)>k2\nF . To be able to compare the second\norder condition to the \ufb01rst order condition we need a relation between kU (U U\u21e4R)>k2\nF and\nkX X\u21e4k2\n\nF . Towards this we show the following result.\n\n2R\u2020 is the pseudo inverse of R\n\n6\n\n\fkU eje>j (U U\u21e4R)>k2\n\nLemma 4.4. Let U and U\u21e4 be two n \u21e5 r matrices, and Q is an orthonormal matrix that spans the\ncolumn space of U. Then there exists an r \u21e5 r orthonormal matrix R such that for any \ufb01rst order\nstationary point U of f (U ), the following holds:\nrXj=1\nF ) with kU U>U\u21e4U\u21e4>k2\nThis Lemma bounds the distance in the factor space (k(U U\u21e4R)U>k2\nF\nand k(U U>U\u21e4U\u21e4>)QQ>k2\nF . Combining this with the result from second order optimality (Corol-\nlary 4.1) shows kU U>U\u21e4U\u21e4>k2\nF is bounded by a constant factor of k(U U>U\u21e4U\u21e4>)QQ>k2\nF .\nThis implies kX\u21e4Q?Q?kF is bounded, opposite to what the \ufb01rst order condition implied\n(Lemma 4.2). The proof of the above lemma is in Section B. Hence from the above optimality\nconditions we get the proof of Theorem 4.1.\n\n34\n8 k(U U> U\u21e4U\u21e4>)QQ>k2\nF .\n\n1\n8kU U> U\u21e4U\u21e4>k2\n\nF \uf8ff\n\nF +\n\nProof of Theorem 4.1. Assuming U U> 6= U\u21e4U\u21e4>, from Lemmas 4.2, 4.4 and Corollary 4.1 we\nget,\n\n\u2713 1 \n\n2(1 + ) \n\n1\n\n8\u25c6kU U> U\u21e4U\u21e4>k2\n\nF \uf8ff\n\n34\n8\n\n2(U U> U\u21e4U\u21e4>)\n\n2\n\nF\n\n.\n\n5 the above inequality holds only if U U> = U\u21e4U\u21e4>.\n\nIf \uf8ff 1\n5 Necessity of RIP\n\nWe showed that there are no spurious local minima only under a restricted isometry assumption.\nA natural question is whether this is necessary, or whether perhaps the problem (2) never has any\n\nspurious local minima, perhaps similarly to the non-convex PCA problem minUA U U>.\n\nA good indication that this is not the case is that (2) is NP-hard, even in the noiseless case when\ny = A(X\u21e4) for rank(X\u21e4) \uf8ff k [20] (if we don\u2019t require RIP, we can have each Ai be non-zero on\na single entry in which case (2) becomes a matrix completion problem, for which hardness has been\nshown even under fairly favorable conditions [13])3. That is, we are unlikely to have a poly-time\nalgorithm that succeeds for any linear measurement operator. Although this doesn\u2019t formally preclude\nthe possibility that there are no spurious local minima, but it just takes a very long time to \ufb01nd a local\nminima, this scenario seems somewhat unlikely.\nTo resolve the question, we present an explicit example of a measurement operator A and y = A(X\u21e4)\n(i.e. f (X\u21e4) = 0), with rank(X\u21e4) = r, for which (1), and so also (2), have a non-global local minima.\nExample 1: Let f (X) = (X11 + X22 1)2 + (X11 1)2 + X 2\n21 and consider (1) with r = 1\n0 0 we have f (X\u21e4) = 0 and rank(X\u21e4) = 1. But X =\uf8ff0 0\n0 1\n(i.e. a rank-1 constraint). For X\u21e4 =\uf8ff1 0\nWe can be extended the construction to any rank r by simply addingPr+2\ni=3 (Xii 1)2 to the objective,\nand padding both the global and local minimum with a diagonal beneath the leading 2 \u21e5 2 block.\nIn Example 1, we had a rank-r problem, with a rank-r exact solution, and a rank-r local minima.\nAnother question we can ask is what happens if we allow a larger rank than the rank of the optimal\nsolution. That is, if we have f (X\u21e4) = 0 with low rank(X\u21e4), even rank(X\u21e4) = 1, but consider (1)\nor (2) with a high r. Could we still have non-global local minima? The answer is yes...\n\nis a rank 1 local minimum with f (X) = 1.\n\nExample 2: Let f (X) = (X11 + X22 + X33 1)2 + (X11 1)2 + (X22 X33)2 +Pi,j:i6=j X 2\nand consider the problem (1) with a rank r = 2 constraint. We can verify that X\u21e4 = 24\n0 0 035\n0 0 1/235 is a local minimum with\nis a rank=1 global minimum with f (X\u21e4) = 0, but X = 24\n\n0 0\n0\n0 1/2 0\n\n12 + X 2\n\n1 0 0\n0 0 0\n\n3Note that matrix completion is tractable under incoherence assumptions, similar to RIP\n\nij\n\n7\n\n\ff (X) = 1. Also for an arbitrary large rank constraint r > 1 (taking r to be odd for simplicity),\n\n\u21e5(X11 + X2i,2i + X(2i+1),(2i+1) 1)2\nextend the objective to f (X) = (X11 1)2 +P(r1)/2\n+(X2i,2i X(2i+1),(2i+1))2\u21e4. We still have a rank-1 global minimum X\u21e4 with a single non-zero\nentry X\u21e411 = 1, while X = (I X\u21e4)/2 is a local minimum with f (X) = 1.\n6 Conclusion\n\ni=1\n\nWe established that under conditions similar to those required for convex relaxation recovery guaran-\ntees, the non-convex formulation of matrix sensing (2) does not exhibit any spurious local minima (or,\nin the noisy and approximate settings, at least not outside some small radius around a global minima),\nand we can obtain theoretical guarantees on the success of optimizing it using SGD from random\ninitialization. This matches the methods frequently used in practice, and can explain their success.\nThis guarantee is very different in nature from other recent work on non-convex optimization for\nlow-rank problems, which relied heavily on initialization to get close to the global optimum, and on\nlocal search just for the \ufb01nal local convergence to the global optimum. We believe this is the \ufb01rst\nresult, together with the parallel work of Ge et al. [11], on the global convergence of local search for\ncommon rank-constrained problems that are worst-case hard.\nOur result suggests that SVD initialization is not necessary for global convergence, and random\ninitialization would succeed under similar conditions (in fact, our conditions are even weaker than in\nprevious work that used SVD initialization). To investigate empirically whether SVD initialization\nis indeed helpful for ensuring global convergence, in Figure 1 we compare recovery probability\nof random rank-k matrices for random and SVD initialization\u2014there is no signi\ufb01cant difference\nbetween the two.\nBeyond the implications for matrix sensing, we are hoping these type of results could be a \ufb01rst step\nand serve as a model for understanding local search in deep networks. Matrix factorization, such as\nin (2), is a depth-two neural network with linear transfer\u2014an extremely simple network, but already\nnon-convex and arguably the most complicated network we have a good theoretical understanding of.\nDeep networks are also hard to optimize in the worst case, but local search seems to do very well\nin practice. Our ultimate goal is to use the study of matrix recovery as a guide in understating the\nconditions that enable ef\ufb01cient training of deep networks.\n\nAcknowledgements\nAuthors would like to thank Afonso Bandeira for discussions, Jason Lee and Tengyu Ma for sharing\nand discussing their work. This research was supported in part by an NSF RI/AF grant 1302662.\n\nReferences\n[1] A. S. Bandeira, N. Boumal, and V. Voroninski. On the low-rank approach for semide\ufb01nite programs arising\n\nin synchronization and community detection. arXiv preprint arXiv:1602.04426, 2016.\n\n[2] S. Bhojanapalli, A. Kyrillidis, and S. Sanghavi. Dropping convexity for faster semi-de\ufb01nite optimization.\n\narXiv preprint arXiv:1509.03917, 2015.\n\n[3] S. Burer and R. D. Monteiro. A nonlinear programming algorithm for solving semide\ufb01nite programs via\n\nlow-rank factorization. Mathematical Programming, 95(2):329\u2013357, 2003.\n\n[4] S. Burer and R. D. Monteiro. Local minima and convergence in low-rank semide\ufb01nite programming.\n\nMathematical Programming, 103(3):427\u2013444, 2005.\n\n[5] E. J. Cand\u00e8s. The restricted isometry property and its implications for compressed sensing. Comptes\n\nRendus Mathematique, 346(9):589\u2013592, 2008.\n\n[6] E. J. Candes and Y. Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal number of\n\nnoisy random measurements. Information Theory, IEEE Transactions on, 57(4):2342\u20132359, 2011.\n\n[7] C. Cartis, N. I. Gould, and P. L. Toint. Complexity bounds for second-order optimality in unconstrained\n\noptimization. Journal of Complexity, 28(1):93\u2013108, 2012.\n\n[8] Y. Chen and M. J. Wainwright. Fast low-rank estimation by projected gradient descent: General statistical\n\nand algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015.\n\n8\n\n\f[9] S. Flammia, D. Gross, Y.-K. Liu, and J. Eisert. Quantum tomography via compressed sensing: Error\n\nbounds, sample complexity and ef\ufb01cient estimators. New Journal of Physics, 14(9):095022, 2012.\n\n[10] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points\u2014online stochastic gradient for tensor\n\ndecomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797\u2013842, 2015.\n\n[11] R. Ge, J. Lee, and T. Ma. Matrix completion has no spurious local minimum.\n\narXiv:1605.07272, 2016.\n\narXiv preprint\n\n[12] D. Gross, Y.-K. Liu, S. T. Flammia, S. Becker, and J. Eisert. Quantum state tomography via compressed\n\nsensing. Physical review letters, 105(15):150401, 2010.\n\n[13] M. Hardt, R. Meka, P. Raghavendra, and B. Weitz. Computational limits for matrix completion. In\n\nProceedings of The 27th Conference on Learning Theory, pages 703\u2013725, 2014.\n\n[14] P. Jain, R. Meka, and I. S. Dhillon. Guaranteed rank minimization via singular value projection. In\n\nAdvances in Neural Information Processing Systems, pages 937\u2013945, 2010.\n\n[15] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. In\n\nProceedings of the 45th annual ACM Symposium on theory of computing, pages 665\u2013674. ACM, 2013.\n\n[16] M. Journ\u00e9e, F. Bach, P.-A. Absil, and R. Sepulchre. Low-rank optimization on the cone of positive\n\nsemide\ufb01nite matrices. SIAM Journal on Optimization, 20(5):2327\u20132351, 2010.\n\n[17] R. H. Keshavan. Ef\ufb01cient algorithms for collaborative \ufb01ltering. PhD thesis, STANFORD, 2012.\n\n[18] A. Montanari. A Grothendieck-type inequality for local maxima. arXiv preprint arXiv:1603.04064, 2016.\n\n[19] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance.\n\nMathematical Programming, 108(1):177\u2013205, 2006.\n\n[20] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via\n\nnuclear norm minimization. SIAM review, 52(3):471\u2013501, 2010.\n\n[21] J. D. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In\n\nProceedings of the 22nd international conference on Machine learning, pages 713\u2013719. ACM, 2005.\n\n[22] C. D. Sa, C. Re, and K. Olukotun. Global convergence of stochastic gradient descent for some non-convex\nmatrix problems. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15),\npages 2332\u20132341, 2015.\n\n[23] N. Srebro and T. Jaakkola. Weighted low-rank approximations. In Proceedings of the 20th International\n\nConference on Machine Learning (ICML-03), pages 720\u2013727, 2003.\n\n[24] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery using nonconvex optimization. In Proceedings\n\nof The 32nd International Conference on Machine Learning, pages 2351\u20132360, 2015.\n\n[25] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. preprint arXiv:1602.06664, 2016.\n\n[26] S. Tu, R. Boczar, M. Soltanolkotabi, and B. Recht. Low-rank solutions of linear matrix equations via\n\nProcrustes \ufb02ow. arXiv preprint arXiv:1507.03566, 2015.\n\n[27] H.-F. Yu, P. Jain, P. Kar, and I. Dhillon. Large-scale multi-label learning with missing labels. In Proceedings\n\nof The 31st International Conference on Machine Learning, pages 593\u2013601, 2014.\n\n[28] D. Zhang and L. Balzano. Global convergence of a grassmannian gradient descent algorithm for subspace\n\nestimation. arXiv preprint arXiv:1506.07405, 2015.\n\n[29] T. Zhao, Z. Wang, and H. Liu. A nonconvex optimization framework for low rank matrix estimation. In\n\nAdvances in Neural Information Processing Systems, pages 559\u2013567, 2015.\n\n[30] Q. Zheng and J. Lafferty. A convergent gradient descent algorithm for rank minimization and semide\ufb01nite\nprogramming from random linear measurements. In Advances in Neural Information Processing Systems,\npages 109\u2013117, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1924, "authors": [{"given_name": "Srinadh", "family_name": "Bhojanapalli", "institution": "TTI Chicago"}, {"given_name": "Behnam", "family_name": "Neyshabur", "institution": "TTI-Chicago"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI-Chicago"}]}