{"title": "How Much Restricted Isometry is Needed In Nonconvex Matrix Recovery?", "book": "Advances in Neural Information Processing Systems", "page_first": 5586, "page_last": 5597, "abstract": "When the linear measurements of an instance of low-rank matrix recovery\nsatisfy a restricted isometry property (RIP) --- i.e. they\nare approximately norm-preserving --- the problem is known\nto contain no spurious local minima, so exact recovery is guaranteed.\nIn this paper, we show that moderate RIP is not enough to eliminate\nspurious local minima, so existing results can only hold for near-perfect\nRIP. In fact, counterexamples are ubiquitous: every $x$ is the spurious\nlocal minimum of a rank-1 instance of matrix recovery that satisfies\nRIP. One specific counterexample has RIP constant $\\delta=1/2$, but\ncauses randomly initialized stochastic gradient descent (SGD) to fail\n12\\% of the time. SGD is frequently able to avoid and escape spurious\nlocal minima, but this empirical result shows that it can occasionally\nbe defeated by their existence. Hence, while exact recovery guarantees\nwill likely require a proof of no spurious local minima, arguments\nbased solely on norm preservation will only be applicable to a narrow\nset of nearly-isotropic instances.", "full_text": "How Much Restricted Isometry is Needed In\n\nNonconvex Matrix Recovery?\n\nRichard Y. Zhang\n\nUniversity of California, Berkeley\n\nC\u00e9dric Josz\n\nUniversity of California, Berkeley\n\nryz@alum.mit.edu\n\ncedric.josz@gmail.com\n\nSomayeh Sojoudi\n\nUniversity of California, Berkeley\n\nsojoudi@berkeley.edu\n\nJavad Lavaei\n\nUniversity of California, Berkeley\n\nlavaei@berkeley.edu\n\nAbstract\n\nWhen the linear measurements of an instance of low-rank matrix recovery sat-\nisfy a restricted isometry property (RIP)\u2014i.e.\nthey are approximately norm-\npreserving\u2014the problem is known to contain no spurious local minima, so exact\nrecovery is guaranteed. In this paper, we show that moderate RIP is not enough to\neliminate spurious local minima, so existing results can only hold for near-perfect\nRIP. In fact, counterexamples are ubiquitous: we prove that every x is the spu-\nrious local minimum of a rank-1 instance of matrix recovery that satis\ufb01es RIP.\nOne speci\ufb01c counterexample has RIP constant \u03b4 = 1/2, but causes randomly\ninitialized stochastic gradient descent (SGD) to fail 12% of the time. SGD is fre-\nquently able to avoid and escape spurious local minima, but this empirical result\nshows that it can occasionally be defeated by their existence. Hence, while exact\nrecovery guarantees will likely require a proof of no spurious local minima, argu-\nments based solely on norm preservation will only be applicable to a narrow set\nof nearly-isotropic instances.\n\n1\n\nIntroduction\n\nRecently, several important nonconvex problems in machine learning have been shown to contain no\nspurious local minima [19, 4, 21, 8, 20, 34, 30]. These problems are easily solved using local search\nalgorithms despite their nonconvexity, because every local minimum is also a global minimum, and\nevery saddle-point has suf\ufb01ciently negative curvature to allow escape. Formally, the usual \ufb01rst- and\nsecond-order necessary conditions for local optimality (i.e. zero gradient and a positive semide\ufb01nite\nHessian) are also suf\ufb01cient for global optimality; satisfying them to \u0001-accuracy will yield a point\nwithin an \u0001-neighborhood of a globally optimal solution.\nMany of the best-understood nonconvex problems with no spurious local minima are variants of the\nlow-rank matrix recovery problem. The simplest version (known as matrix sensing) seeks to recover\nan n\u00d7n positive semide\ufb01nite matrix Z of low rank r (cid:28) n, given measurement matrices A1, . . . , Am\nand noiseless data bi = (cid:104)Ai, Z(cid:105). The usual, nonconvex approach is to solve the following\n(cid:104)Am, X(cid:105)]T\n\n(cid:107)A(xxT ) \u2212 b(cid:107)2 where A(X) = [(cid:104)A1, X(cid:105)\n\nminimize\nx\u2208Rn\u00d7r\n\n\u00b7\u00b7\u00b7\n\n(1)\n\nto second-order optimality, using a local search algorithm like (stochastic) gradient descent [19, 24]\nand trust region Newton\u2019s method [16, 7], starting from a random initial point.\nExact recovery of the ground truth Z is guaranteed under the assumption that A satis\ufb01es the re-\nstricted isometry property [14, 13, 31, 11] with a suf\ufb01ciently small constant. The original result is\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(1 \u2212 \u03b4r)(cid:107)X(cid:107)2\n\ndue to Bhojanapalli et al. [4], though we adapt the statement below from a later result by Ge et al. [20,\nTheorem 8]. (Zhu et al. [43] give an equivalent statement for nonsymmetric matrices.)\nDe\ufb01nition 1 (Restricted Isometry Property). The linear map A : Rn\u00d7n \u2192 Rm is said to satisfy\n(r, \u03b4r)-RIP with constant 0 \u2264 \u03b4r < 1 if there exists a \ufb01xed scaling \u03b3 > 0 such that for all rank-r\nmatrices X:\nF \u2264 \u03b3 \u00b7 (cid:107)A(X)(cid:107)2 \u2264 (1 + \u03b4r)(cid:107)X(cid:107)2\nF .\nWe say that A satis\ufb01es r-RIP if A satis\ufb01es (r, \u03b4r)-RIP with some \u03b4r < 1.\nTheorem 2 (No spurious local minima). Let A satisfy (2r, \u03b42r)-RIP with \u03b42r < 1/5. Then, (1)\nhas no spurious local minima: every local minimum x satis\ufb01es xxT = Z, and every saddle point\nhas an escape (the Hessian has a negative eigenvalue). Hence, any algorithm that converges to a\nsecond-order critical point is guaranteed to recover Z exactly.\nStandard proofs of Theorem 2 use a norm-preserving argument: if A satis\ufb01es (2r, \u03b42r)-RIP with a\nsmall constant \u03b42r, then we can view the least-squares residual A(xxT ) \u2212 b as a dimension-reduced\nembedding of the displacement vector xxT \u2212 Z, as in\n\n(2)\n\nF up to scaling.\n\n(cid:107)A(xxT ) \u2212 b(cid:107)2 = (cid:107)A(xxT \u2212 Z)(cid:107)2 \u2248 (cid:107)xxT \u2212 Z(cid:107)2\n\n(3)\nThe high-dimensional problem of minimizing (cid:107)xxT \u2212 Z(cid:107)2\nF over x contains no spurious local min-\nima, so its dimension-reduced embedding (1) should satisfy a similar statement. Indeed, this same\nargument can be repeated for noisy measurements and nonsymmetric matrices to result in similar\nguarantees [4, 20].\nThe norm-preserving argument also extends to \u201charder\u201d choices of A that do not satisfy RIP over\nits entire domain. In the matrix completion problem, the RIP-like condition (cid:107)A(X)(cid:107)2 \u2248 (cid:107)X(cid:107)2\nF\nholds only when X is both low-rank and suf\ufb01ciently dense [12]. Nevertheless, Ge et al. [21] proved\na similar result to Theorem 2 for this problem, by adding a regularizing term to the objective. For a\ndetailed introduction to the norm-preserving argument and its extension with regularizers, we refer\nthe interested reader to [21, 20].\n\n1.1 How much restricted isometry?\n\nThe RIP threshold \u03b42r < 1/5 in Theorem 2 is highly conservative\u2014it is only applicable to nearly-\nisotropic measurements like Gaussian measurements. Let us put this point into perspective by mea-\nsuring distortion using the condition number1 \u03ba2r \u2208 [1,\u221e). Deterministic linear maps from real-life\napplications usually have condition numbers \u03ba2r between 102 and 104, and these translate to RIP\nconstants \u03b42r = (\u03ba2r \u2212 1)/(\u03ba2r + 1) between 0.99 and 0.9999. By contrast, the RIP threshold\n\u03b42r < 1/5 requires an equivalent condition number of \u03ba2r = (1 + \u03b42r)/(1 \u2212 \u03b42r) < 3/2, which\nwould be considered near-perfect in linear algebra.\nIn practice, nonconvex matrix completion works for a much wider class of problems than those\nsuggested by Theorem 2 [6, 5, 32, 1]. Indeed, assuming only that A satis\ufb01es 2r-RIP, solving (1)\nto global optimality is enough to guarantee exact recovery [31, Theorem 3.2]. In turn, stochastic\nalgorithms like stochastic gradient descent (SGD) are often able to attain global optimality. This\ndisconnect between theory and practice motivates the following question.\nCan Theorem 2 be substantially improved\u2014is it possible to guarantee the inexistence of spu-\nrious local minima with (2r, \u03b42r)-RIP and any value of \u03b42r < 1?\nAt a basic level, the question gauges the generality and usefulness of RIP as a base assumption for\nnonconvex recovery. Every family of measure operators A\u2014even correlated and \u201cbad\u201d measure-\nment ensembles\u2014will eventually come to satisfy 2r-RIP as the number of measurements m grows\nlarge. Indeed, given m \u2265 n(n + 1)/2 linearly independent measurements, the operator A becomes\ninvertible, and hence trivially 2r-RIP. In this limit, recovering the ground truth Z from noiseless\nmeasurements is as easy as solving a system of linear equations. Yet, it remains unclear whether\nnonconvex recovery is guaranteed to succeed.\nAt a higher level, the question also gauges the wisdom of exact recovery guarantees through \u201cno\nspurious local minima\u201d. It may be suf\ufb01cient but not necessary; exact recovery may actually hinge\n\n1Given a linear map, the condition number measures the ratio in size between the largest and smallest\nimages, given a unit-sized input. Within our speci\ufb01c context, the 2r-restricted condition number is the smallest\n\u03ba2r = L/(cid:96) such that (cid:96)(cid:107)X(cid:107)2\n\nF \u2264 (cid:107)A(X)(cid:107)2 \u2264 L(cid:107)X(cid:107)2\n\nF holds for all rank-2r matrices X.\n\n2\n\n\fFigure 1: Solving Example 3 using stochastic gradient descent randomly initialized with the standard\nGaussian. (Left) Histogram over 100,000 trials of \ufb01nal error (cid:107)xxT \u2212 Z(cid:107)F after 103 steps with\nlearning rate \u03b1 = 10\u22123 and momentum \u03b2 = 0.9. (Right) Two typical stochastic gradient descent\n\u221a\ntrajectories, showing convergence to the spurious local minimum at (0, 1/\n2), and to the ground\ntruth at (1, 0).\n\non SGD\u2019s ability to avoid and escape spurious local minima when they do exist. Indeed, there is\ngrowing empirical evidence that SGD outmaneuvers the \u201coptimization landscape\u201d of nonconvex\nfunctions [6, 5, 27, 32, 1], and achieves some global properties [22, 40, 39]. It remains unclear\nwhether the success of SGD for matrix recovery should be attributed to the inexistence of spurious\nlocal minima, or to some global property of SGD.\n\n1.2 Our results\n\nIn this paper, we give a strong negative answer to the question above. Consider the counterexample\nbelow, which satis\ufb01es (2r, \u03b42r)-RIP with \u03b42r = 1/2, but nevertheless contains a spurious local\nminimum that causes SGD to fail in 12% of trials.\nExample 3. Consider the following (2, 1/2)-RIP instance of (1) with matrices\n\n(cid:20)1 0\n(cid:21)\n\n0 0\n\n(cid:20)\u221a\n\n2\n0\n\n\u221a\n0\n1/\n\n(cid:21)\n\n(cid:20) 0 (cid:112)3/2\n(cid:21)\n(cid:112)3/2\n(cid:20)0\n\n(cid:21)\n\n.\n\n0\n\n(cid:20)0\n(cid:21)\n0 (cid:112)3/2\n(cid:21)\n(cid:20)0\n\n0\n8\n\n,\n\n0\n\nZ =\n\n, A1 =\n\n, A2 =\n\n0\nF \u2264 (cid:107)A(X)(cid:107)2 \u2264 3(cid:107)X(cid:107)2\nNote that the associated operator A is invertible and satis\ufb01es (cid:107)X(cid:107)2\nX. Nevertheless, the point x = (0, 1/\n\n2) satis\ufb01es second-order optimality,\n\n2\n\u221a\n\n, A3 =\n\nF for all\n\nf (x) \u2261 (cid:107)A(xxT \u2212 Z)(cid:107)2 =\n\n3\n2\n\n,\n\n\u2207f (x) =\n\n,\n\n0\n\n\u22072f (x) =\n\nand randomly initialized SGD can indeed become stranded around this point, as shown in Figure 1.\nRepeating these trials 100,000 times yields 87,947 successful trials, for a failure rate of 12.1\u00b1 0.3%\nto three standard deviations.\n\nAccordingly, RIP-based exact recovery guarantees like Theorem 2 cannot be improved beyond\n\u03b42r < 1/2. Otherwise, spurious local minima can exist, and SGD may become trapped. Using\na local search algorithm with a random initialization, \u201cno spurious local minima\u201d is not only suf\ufb01-\ncient for exact recovery, but also necessary.\nIn fact, there exists an in\ufb01nite number of counterexamples like Example 3. In Section 3, we prove\nthat, in the rank-1 case, almost every choice of x, Z generates an instance of (1) with a strict spurious\nlocal minimum.\nTheorem 4 (Informal). Let x, z \u2208 Rn be nonzero and not colinear. Then, there exists an instance\nof (1) satisfying (n, \u03b4n)-RIP with \u03b4n < 1 that has Z = zzT as the ground truth and x as a strict\nspurious local minimum, i.e. with zero gradient and a positive de\ufb01nite Hessian. Moreover, \u03b4n is\n\n3\n\n00.511.5kxxT\u2212ZkF00.20.40.60.81p-0.200.20.40.60.81x1-0.200.20.40.60.81x2\fbounded in terms of the length ratio \u03c1 = (cid:107)x(cid:107)/(cid:107)z(cid:107) and the incidence angle \u03c6 satisfying xT z =\n(cid:107)x(cid:107)(cid:107)z(cid:107) cos \u03c6 as\n\n\u03b4n \u2264 \u03c4 +(cid:112)1 \u2212 \u03b6 2\n\n\u03c4 + 1\n\n(cid:113)\n\nwhere \u03b6 =\n\nsin2 \u03c6\n\n(\u03c12 \u2212 1)2 + 2\u03c12 sin2 \u03c6\n\n,\n\n\u03c4 =\n\n2(cid:112)\u03c12 + \u03c1\u22122\n\n\u03b6 2\n\nIt is therefore impossible to establish \u201cno spurious local minima\u201d guarantees unless the RIP con-\nstant \u03b4 is small. This is a strong negative result on the generality and usefulness of RIP as a base\nassumption, and also on the wider norm-preserving argument described earlier in the introduction.\nIn Section 4, we provide strong empirical evidence for the following sharp version of Theorem 2.\nConjecture 5. Let A satisfy (2r, \u03b42r)-RIP with \u03b42r < 1/2. Then, (1) has no spurious local minima.\nMoreover, the \ufb01gure of 1/2 is sharp due to the existence of Example 3.\n\nHow is the practical performance of SGD affected by spurious local minima? In Section 5, we apply\nrandomly initialized SGD to instances of (1) engineered to contain spurious local minima. In one\ncase, SGD recovers the ground truth with a 100% success rate, as if the spurious local minima did not\nexist. But in another case, SGD fails in 59 of 1,000 trials, for a positive failure rate of 5.90 \u00b1 2.24%\nto three standard deviations. Examining the failure cases, we observe that SGD indeed becomes\ntrapped around a spurious local minimum, similar to Figure 1 in Example 3.\n\n1.3 Related work\n\nThere have been considerable recent interest in understanding the empirical \u201chardness\u201d of non-\nconvex optimization, in view of its well-established theoretical dif\ufb01culties. Nonconvex functions\ncontain saddle points and spurious local minima, and local search algorithms may become trapped\nin them. Recent work have generally found the matrix sensing problem to be \u201ceasy\u201d, particularly\nunder an RIP-like incoherence assumption. Our results in this paper counters this intuition, show-\ning\u2014perhaps surprisingly\u2014that the problem is generically \u201chard\u201d even under RIP.\nComparison to convex recovery. Classical theory for the low-rank matrix recovery problem is\nbased on convex relaxation: replacing xxT in (1) by a convex term X (cid:23) 0, and augmenting the\nobjective with a trace penalty \u03bb \u00b7 tr(X) to induce a low-rank solution [12, 31, 15, 11]. The con-\nvex approach enjoys RIP-based exact recovery guarantees [11], but these are also fundamentally\nrestricted to small RIP constants [10, 38]\u2014in direct analogy with our results for nonconvex recov-\nery. In practice, convex recovery is usually much more expensive than nonconvex recovery, because\nit requires optimizing over an n \u00d7 n matrix variable instead of an n \u00d7 r vector-like variable. On the\nother hand, it is statistically consistent [3], and guaranteed to succeed with m \u2265 1\n2 n(n + 1) noise-\nless, linearly independent measurements. By comparison, our results show that nonconvex recovery\ncan still fail in this regime.\nConvergence to spurious local minima. Recent results on \u201cno spurious local minima\u201d are often\nestablished using a norm-preserving argument: the problem at hand is the low-dimension embedding\nof a canonical problem known to contain no spurious local minima [19, 34, 35, 4, 21, 20, 30, 43].\nWhile the approach is widely applicable in its scope, our results in this paper \ufb01nds it to be restrictive\nin the problem data. More speci\ufb01cally, the measurement matrices A1, . . . , Am must come from a\nnearly-isotropic ensemble like the Gaussian and the sparse binary.\nSpecial initialization schemes. An alternative way to guarantee exact recovery is to place the initial\npoint suf\ufb01ciently close to the global optimum [25, 26, 23, 42, 41, 36]. This approach is more general\nbecause it does not require a global \u201cno spurious local minima\u201d guarantee. On the other hand, good\ninitializations are highly problem-speci\ufb01c and dif\ufb01cult to generalize. Our results show that spurious\nlocal minima can exist arbitrarily close to the solution. Hence, exact recovery guarantees must give\nproof of local attraction, beyond simply starting close to the ground truth.\nAbility of SGD to escape spurious local minima. Practitioners have long known that stochastic\ngradient descent (SGD) enjoys properties inherently suitable for the sort of nonconvex optimization\nproblems that appear in machine learning [27, 6], and that it is well-suited for generalizing unseen\ndata [22, 40, 39]. Its speci\ufb01c behavior is yet not well understood, but it is commonly conjectured\nthat SGD outperforms classically \u201cbetter\u201d algorithms like BFGS because it is able to avoid and\nescape spurious local minima. Our empirical \ufb01ndings in Section 5 partially con\ufb01rms this suspicion,\n\n4\n\n\fshowing that randomly initialized SGD is sometimes able to avoid and escape spurious local minima\nas if they did not exist. In other cases, however, SGD can indeed become stuck at a local minimum,\nthereby resulting in a positive failure rate.\n\nNotation\n\nWe use x to refer to any candidate point, and Z = zzT to refer to a rank-r factorization of the\nground truth Z. For clarity, we use lower-case x, z even when these are n \u00d7 r matrices.\nThe sets Rn\u00d7n \u2283 Sn are the space of n\u00d7n real matrices and real symmetric matrices, and (cid:104)X, Y (cid:105) \u2261\ntr(X T Y ) and (cid:107)X(cid:107)2\nF \u2261 (cid:104)X, X(cid:105) are the Frobenius inner product and norm. We write X (cid:23) 0 (resp.\nX (cid:31) 0) if X is positive semide\ufb01nite (resp. positive de\ufb01nite). Given a matrix M, its spectral norm\nis (cid:107)M(cid:107), and its eigenvalues are \u03bb1(M ), . . . , \u03bbn(M ). If M = M T , then \u03bb1(M ) \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbn(M )\nand \u03bbmax(M ) \u2261 \u03bb1(M ), \u03bbmin(M ) \u2261 \u03bbn(M ). If M is invertible, then its condition number is\ncond(M ) = (cid:107)M(cid:107)(cid:107)M\u22121(cid:107); if not, then cond(M ) = \u221e.\n: Rn\u00d7n \u2192 Rn2 preserves inner products (cid:104)X, Y (cid:105) =\nThe vectorization operator vec\nvec (X)T vec (Y ) and Euclidean norms (cid:107)X(cid:107)F = (cid:107)vec (X)(cid:107). In each case, the matricization op-\nerator mat(\u00b7) is the inverse of vec (\u00b7).\n\n2 Key idea: Spurious local minima via convex optimization\nGiven arbitrary x \u2208 Rn\u00d7r and rank-r positive semide\ufb01nite matrix Z \u2208 Sn, consider the problem\nof \ufb01nding an instance of (1) with Z as the ground truth and x as a spurious local minimum. While\nnot entirely obvious, this problem is actually convex, because the \ufb01rst- and second-order optimality\nconditions associated with (1) are linear matrix inequality (LMI) constraints [9] with respect to the\nkernel operator H \u2261 ATA. The problem of \ufb01nding an instance of (1) that also satis\ufb01es RIP is indeed\nnonconvex. However, we can use the condition number of H as a surrogate for the RIP constant \u03b4\nof A: if the former is \ufb01nite, then the latter is guaranteed to be less than 1. The resulting optimization\nis convex, and can be numerically solved using an interior-point method, like those implemented in\nSeDuMi [33], SDPT3 [37], and MOSEK [2], to high accuracy.\nWe begin by \ufb01xing some de\ufb01nitions. Given a choice of A : Sn \u2192 Rm and the ground truth\nZ = zzT , we de\ufb01ne the nonconvex objective\n\nf : Rn\u00d7r \u2192 R\n\nf (x) = (cid:107)A(xxT \u2212 zzT )(cid:107)2\n\nsuch that\n\n(4)\nwhose value is always nonnegative by construction. If the point x attains f (x) = 0, then we call it\na global minimum; otherwise, we call it a spurious point. Under RIP, x is a global minimum if and\nonly if xxT = zzT [31, Theorem 3.2]. The point x is said to be a local minimum if f (x) \u2264 f (x(cid:48))\nholds for all x(cid:48) within a local neighborhood of x. If x is a local minimum, then it must satisfy the\n\ufb01rst and second-order necessary optimality conditions (with some \ufb01xed \u00b5 \u2265 0):\n\n\u2200u \u2208 Rn\u00d7r,\n\u2200u \u2208 Rn\u00d7r.\n\n(cid:104)\u2207f (x), u(cid:105) = 2(cid:104)A(xxT \u2212 zzT ),A(xuT + uxT )(cid:105) = 0\n\n(cid:104)\u22072f (x)u, u(cid:105) = 2(cid:104)A(xxT \u2212 zzT ), uuT(cid:105) + (cid:107)A(xuT + uxT )(cid:107)2 \u2265 \u00b5(cid:107)u(cid:107)2\n\n(5)\n(6)\nConversely, if x satis\ufb01es the second-order suf\ufb01cient optimality conditions, that is (5)-(6) with \u00b5 > 0,\nthen it is a local minimum. Local search algorithms are only guaranteed to converge to a \ufb01rst-order\ncritical point x satisfying (5), or a second-order critical point x satisfying (5)-(6) with \u00b5 \u2265 0. The\nlatter class of algorithms include stochastic gradient descent [19], randomized and noisy gradient\ndescent [19, 28, 24, 18], and various trust-region methods [17, 29, 16, 7].\nGiven arbitrary choices of x, z \u2208 Rn\u00d7r, we formulate the problem of picking an A satisfying (5) and\n(6) as an LMI feasibility. First, we de\ufb01ne A = [vec (A1), . . . , vec (Am)]T satisfying A\u00b7 vec (X) =\nA(X) for all X as the matrix representation of the operator A. Then, we rewrite (5) and (6) as\n2 \u00b7 L (AT A) = 0 and 2 \u00b7 M (AT A) (cid:23) \u00b5I, where the linear operators L and M are de\ufb01ned\n\nF\n\nsuch that\n\nL : Sn2 \u2192 Rn\u00d7r\nM : Sn2 \u2192 Snr\u00d7nr\n\n(7)\n(8)\nwith respect to the error vector e = vec (xxT \u2212 zzT ) and the n2 \u00d7 nr matrix X that implements the\nsymmetric product operator X \u00b7 vec (u) = vec (xuT + uxT ). To compute a choice of A satisfying\n\nL (H) \u2261 2 \u00b7 XT He,\nM (H) \u2261 2 \u00b7 [Ir \u2297 mat(He)T ] + XT HX,\n\nsuch that\n\n5\n\n\fL (AT A) = 0 and M (AT A) (cid:23) 0, we solve the following LMI feasibility problem\n\nmaximize\n\nH\n\n0\n\nsubject to\n\nL (H) = 0, M (H) (cid:23) \u00b5I, H (cid:23) 0,\n\n(9)\n\nand factor a feasible H back into AT A, e.g. using Cholesky factorization or an eigendecomposition.\nOnce a matrix representation A is found, we recover the matrices A1, . . . , Am implementing the\noperator A by matricizing each row of A.\nNow, the problem of picking A with the smallest condition number may be formulated as the fol-\nlowing LMI optimization\n\n(10)\nmaximize\nwith solution H(cid:63), \u03b7(cid:63). Then, 1/\u03b7(cid:63) is the best condition number achievable, and any A recovered\n\n\u03b7I (cid:22) H (cid:22) I, L (H) = 0, M (H) (cid:23) \u00b5I, H (cid:23) 0,\n\nsubject to\n\nH,\u03b7\n\n\u03b7\n\n1 \u2212 1 \u2212 \u03b7(cid:63)\n\n1 + \u03b7(cid:63)\n\n(cid:107)X(cid:107)2 \u2264 2\n\n1 + \u03b7(cid:63)(cid:107)A(X)(cid:107)2\n\nF \u2264\n\n1 \u2212 \u03b7(cid:63)\n1 + \u03b7(cid:63)\n\n1 +\n\n(cid:107)X(cid:107)2\n\nfor all X, that is, with any rank. As such, A is (n, \u03b4n)-RIP with \u03b4n = (1 \u2212 \u03b7(cid:63))/(1 + \u03b7(cid:63)), and hence\nalso (p, \u03b4p)-RIP with \u03b4p \u2264 \u03b4n for all p \u2208 {1, . . . , n}; see e.g. [31, 11]. If the optimal value \u03b7(cid:63) is\nstrictly positive, then the recovered A yields an RIP instance of (1) with zzT as the ground truth and\nx as a spurious local minimum, as desired.\nIt is worth emphasizing that a small condition number\u2014a large \u03b7(cid:63) in (10)\u2014will always yield a small\nRIP constant \u03b4n, which then bounds all other RIP constants via \u03b4n \u2265 \u03b4p for all p \u2208 {1, . . . , n}.\nHowever, the converse direction is far less useful, as the value of \u03b4n = 1 does not preclude \u03b4p with\np < n from being small.\n\nfrom H(cid:63) will satisfy(cid:18)\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)\n\n3 Closed-form solutions\n\nIt turns out that the LMI problem (10) in the rank-1 case is suf\ufb01ciently simple that it can be solved\nin closed-form. (All proofs are given in the Appendix.) Let x, z \u2208 Rn be arbitrary nonzero vectors,\nand de\ufb01ne\n\n(cid:19)\n\n(cid:18) xT z\n\n(cid:107)x(cid:107)(cid:107)z(cid:107)\n\n\u03c6 \u2261 arccos\n\n,\n\n(11)\n\nas their associated length ratio and incidence angle. We begin by examining the prevalence of\nspurious critical points.\nTheorem 6 (First-order optimality). The best-conditioned H(cid:63) (cid:23) 0 such that L (H(cid:63)) = 0 satis\ufb01es\n\n(cid:113)\n\nwhere\n\n\u03b6 \u2261\n\nsin \u03c6\n\n(\u03c12 \u2212 1)2 + 2\u03c12 sin2 \u03c6\n\n.\n\n(12)\n\n\u03c1 \u2261 (cid:107)x(cid:107)\n(cid:107)z(cid:107) ,\n\n1 +(cid:112)1 \u2212 \u03b6 2\n1 \u2212(cid:112)1 \u2212 \u03b6 2\n\ncond(H(cid:63)) =\n\n\u03b4 =(cid:112)1 \u2212 \u03b6 2 < 1 given in (12).\n\nHence, if \u03c6 (cid:54)= 0, then x is a \ufb01rst-order critical point for an instance of (1) satisfying (2, \u03b4)-RIP with\n\nThe point x = 0 is always a local maximum for f, and hence a spurious \ufb01rst-order critical point.\nWith a perfect RIP constant \u03b4 = 0, Theorem 6 says that x = 0 is also the only spurious \ufb01rst-order\ncritical point. Otherwise, spurious \ufb01rst-order critical points may exist elsewhere, even when the\nRIP constant \u03b4 is arbitrarily close to zero. This result highlights the importance of converging to\nsecond-order optimality, in order to avoid getting stuck at a spurious \ufb01rst-order critical point.\nNext, we examine the prevalence of spurious local minima.\nTheorem 7 (Second-order optimality). There exists H satisfying L (H) = 0, M (H) (cid:23) \u00b5I, and\n\u03b7I (cid:22) H (cid:22) I where\n\u03b7 \u2265 1\n1 + \u03c4\n\n\u03c4 \u2261 2(cid:112)\u03c12 + \u03c1\u22122\n\n1 +(cid:112)1 \u2212 \u03b6 2\n1 \u2212(cid:112)1 \u2212 \u03b6 2\n\nand \u03b6 is de\ufb01ned in (12). Hence, if \u03c6 (cid:54)= 0 and \u03c1 > 0 is \ufb01nite, then x is a strict local minimum for an\n\ninstance of (1) satisfying (2, \u03b4)-RIP with \u03b4 = (\u03c4 +(cid:112)1 \u2212 \u03b6 2)/(1 + \u03c4 ) < 1.\n\n(cid:107)z(cid:107)2\n1 + \u03c4\n\n(cid:32)\n\n(cid:33)\n\n\u00b5 =\n\n\u03b6 2\n\n\u00b7\n\n,\n\n,\n\n6\n\n\fIf \u03c6 (cid:54)= 0 and \u03c1 > 0, then x is guaranteed to be a strict local minimum for a problem instance sat-\nisfying 2-RIP. Hence, we must conclude that spurious local minima are ubiquitous. The associated\nRIP constant \u03b4 < 1 is not too much worse than than the \ufb01gure quoted in Theorem 6. On the other\nhand, spurious local minima must cease to exist once \u03b4 < 1/5 according to Theorem 2.\n\n4 Experiment 1: Minimum \u03b4 with spurious local minima\n\nWhat is smallest RIP constant \u03b42r that still admits an instance of (1) with spurious local minima?\nLet us de\ufb01ne the threshold value as the following\n\n\u03b4(cid:63) = min\n\nx,Z,A{\u03b4 : \u2207f (x) = 0, \u22072f (x) (cid:23) 0, A satis\ufb01es (2r, \u03b4)-RIP}.\n\n(13)\nHere, we write f (x) = (cid:107)A(xxT \u2212 Z)(cid:107)2, and optimize over the spurious local minimum x \u2208 Rn\u00d7r,\nthe rank-r ground truth Z (cid:23) 0, and the linear operator A : Rn\u00d7n \u2192 Rm. Note that \u03b4(cid:63) gives a \u201cno\nspurious local minima\u201d guarantee, due to the inexistence of counterexamples.\nProposition 8. Let A satisfy (2r, \u03b42r)-RIP. If \u03b42r < \u03b4(cid:63), then (1) has no spurious local minimum.\n\nProof. Suppose that (1) contained a spurious local minimum x for ground truth Z. Then, substitut-\ning this choice of x, Z,A into (13) would contradict the de\ufb01nition of \u03b4(cid:63) as the minimum.\n\nOur convex formulation in Section 2 bounds \u03b4(cid:63) from above. Speci\ufb01cally, our LMI problem (10)\nwith optimal value \u03b7(cid:63) is equivalent to the following variant of (13)\n\n\u03b4ub(x, Z) = minA {\u03b4 : \u2207f (x) = 0, \u22072f (x) (cid:23) 0, A satis\ufb01es (n, \u03b4)-RIP},\n\n(14)\nwith optimal value \u03b4ub(x, Z) = (1\u2212 \u03b7(cid:63))/(1 + \u03b7(cid:63)). Now, (14) gives an upper-bound on (13) because\n(n, \u03b4)-RIP is a suf\ufb01cient condition for (2r, \u03b4)-RIP. Hence, we have \u03b4ub(x, Z) \u2265 \u03b4(cid:63) for every valid\nchoice of x and Z.\nThe same convex formulation can be modi\ufb01ed to bound \u03b4(cid:63) from below2. Speci\ufb01cally, a necessary\ncondition for A to satisfy (2r, \u03b42r)-RIP is the following\n\n(1 \u2212 \u03b42r)(cid:107)U Y U T(cid:107)2\n\n(15)\nwhere U is a \ufb01xed n \u00d7 2r matrix. This is a convex linear matrix inequality; substituting (15) into\n(13) in lieu of of (2r, \u03b4)-RIP yields a convex optimization problem\n\nF \u2264 (cid:107)A(U Y U T )(cid:107)2 \u2264 (1 + \u03b42r)(cid:107)U Y U T(cid:107)2\n\n\u2200Y \u2208 R2r\u00d72r\n\nF\n\n\u03b4lb(x, Z,U) = minA {\u03b4 : \u2207f (x) = 0, \u22072f (x) (cid:23) 0,\n\n(15)},\n\n(16)\n\nthat generates lower-bounds \u03b4(cid:63) \u2265 \u03b4lb(x, Z, U ).\nOur best upper-bound is likely \u03b4(cid:63) \u2264 1/2. The existence of Example 3 gives the upper-bound of\n\u03b4(cid:63) \u2264 1/2. To improve upon this bound, we randomly sample x, z \u2208 Rn\u00d7r i.i.d. from the standard\nGaussian, and evaluate \u03b4ub(x, zzT ) using MOSEK [2]. We perform the experiment for 3 hours\non each tuple (n, r) \u2208 {1, 2, . . . , 10} \u00d7 {1, 2} but obtain \u03b4ub(x, zzT ) \u2265 1/2 for every x and z\nconsidered.\nThe threshold is likely \u03b4(cid:63) = 1/2. Now, we randomly sample x, z \u2208 Rn\u00d7r i.i.d. from the standard\nGaussian. For each \ufb01xed {x, z}, we set U = [x, z] and evaluate \u03b4lb(x, Z, U ) using MOSEK [2].\nWe perform the same experiment as the above, but \ufb01nd that \u03b4lb(x, zzT , U ) \u2265 1/2 for every x and z\nconsidered. Combined with the existence of the upper-bound \u03b4(cid:63) = 1/2, these experiments strongly\nsuggest that \u03b4(cid:63) = 1/2.\n\n5 Experiment 2: SGD escapes spurious local minima\n\nHow is the performance of SGD affected by the presence of spurious local minima? Given that\nspurious local minima cease to exist with \u03b4 < 1/5, we might conjecture that the performance of\nSGD is a decreasing function of \u03b4. Indeed, this conjecture is generally supported by evidence from\n\n2We thank an anonymous reviewer for this key insight.\n\n7\n\n\fFigure 2: \u201cBad\u201d instance (n = 12, r = 2) with RIP constant \u03b4 = 0.973 and spurious local min at\nxloc satisfying (cid:107)xxT(cid:107)F /(cid:107)zzT(cid:107)F \u2248 4. Here, \u03b3 controls initial SGD x = \u03b3w + (1\u2212 \u03b3)xloc where w\nis random Gaussian. (Left) Error distribution after 10,000 SGD steps (rate 10\u22124, momentum 0.9)\nover 1,000 trials. Line: median. Inner bands: 5%-95% quantile. Outer bands: min/max. (Right\ntop) Random initialization with \u03b3 = 1; (Right bottom) Initialization at local min with \u03b3 = 0.\n\nFigure 3: \u201cGood\u201d instance (n = 12, r = 1) with RIP constant \u03b4 = 1/2 and spurious local min at\nxloc satisfying (cid:107)xxT(cid:107)F /(cid:107)zzT(cid:107)F = 1/2 and xT z = 0. Here, \u03b3 controls initial SGD x = \u03b3w +\n(1 \u2212 \u03b3)xloc where w is random Gaussian. (Left) Error distribution after 10,000 SGD steps (rate\n10\u22123, momentum 0.9) over 1,000 trials. Line: median.\nInner bands: 5%-95% quantile. Outer\nbands: min/max. (Right top) Random initialization \u03b3 = 1 with success; (Right bottom) Random\ninitialization \u03b3 = 1 with failure.\n\nthe nearly-isotropic measurement ensembles [6, 5, 32, 1], all of which show improving performance\nwith increasing number of measurements m.\nThis section empirically measures SGD (with momentum, \ufb01xed learning rates, and batchsizes of\none) on two instances of (1) with different values of \u03b4, both engineered to contain spurious local\nminima by numerically solving (10). We consider a \u201cbad\u201d instance, with \u03b4 = 0.975 and rank r = 2,\nand a \u201cgood\u201d instance, with \u03b4 = 1/2 and rank r = 1. The condition number of the \u201cbad\u201d instance\nis 25 times higher than the \u201cgood\u201d instance, so classical theory suggests the former to be a factor of\n5-25 times harder to solve than the former. Moreover, the \u201cgood\u201d instance is locally strongly convex\nat its isolated global minima while the \u201cbad\u201d instance is only locally weakly convex, so \ufb01rst-order\nmethods like SGD should locally converge at a linear rate for the former, and sublinearly for the\nlatter.\nSGD consistently succeeds on \u201cbad\u201d instance with \u03b4 = 0.975 and r = 2. We generate the \u201cbad\u201d\ninstance by \ufb01xing n = 12, r = 2, selecting x, z \u2208 Rn\u00d7r i.i.d. from the standard Gaussian, rescale z\nso that (cid:107)zzT(cid:107)F = 1 and rescale x so that (cid:107)xxT(cid:107)F /(cid:107)zzT(cid:107)F \u2248 4, and solving (10); the results are\nshown in Figure 2. The results at \u03b3 \u2248 0 validate xloc as a true local minimum: if initialized here,\nthen SGD remains stuck here with > 100% error. The results at \u03b3 \u2248 1 shows randomly initialized\n\n8\n\n00.20.40.60.8110-410-21000200040006000800010000-1010200040006000800010000-101200.20.40.60.8110-410-21000200040006000800010000-0.500.50200040006000800010000-0.500.5\fSGD either escaping our engineered spurious local minimum, or avoiding it altogether. All 1,000\ntrials at \u03b3 = 1 recover the ground truth to < 1% accuracy, with 95% quantile at \u2248 0.6%.\nSGD consistently fails on \u201cgood\u201d instance with \u03b4 = 1/2 and r = 1. We generate the \u201cgood\u201d\ninstance with n = 12 and r = 1 using the procedure in the previous Section; the results are shown\nin Figure 3. As expected, the results at \u03b3 \u2248 0 validate xloc as a true local minimum. However, even\nwith \u03b3 = 1 yielding a random initialization, 59 of the 1,000 trials still result in an error of > 50%,\nthereby yielding a failure rate of 5.90 \u00b1 2.24% up to three standard deviations. Examine the failed\ntrials closer, we do indeed \ufb01nd SGD hovering around our engineered spurious local minimum.\nRepeating the experiment over other instances of (1) obtained by solving (10) with randomly se-\nlected x, z, we generally obtain graphs that look like Figure 2. In other words, SGD usually escapes\nspurious local minima even when they are engineered to exist. These observations continue to hold\ntrue with even massive condition numbers on the order of 104, with corresponding RIP constant\n\u03b4 = 1 \u2212 10\u22124. On the other hand, we do occasionally sample well-conditioned instances that\nbehave closer to the \u201cgood\u201d instance describe above, causing SGD to consistently fail.\n\n6 Conclusions\n\nThe nonconvex formulation of low-rank matrix recovery is highly effective, despite the apparent\nrisk of getting stuck at a spurious local minimum. Recent results have shown that if the linear\nmeasurements of the low-rank matrix satisfy a restricted isometry property (RIP), then the problem\ncontains no spurious local minima, so exact recovery is guaranteed. Most of these existing results\nare based on a norm-preserving argument: relating (cid:107)A(xxT \u2212 Z)(cid:107) \u2248 (cid:107)xxT \u2212 Z(cid:107)F and arguing that\na lack of spurious local minima in the latter implies a similar statement in the former.\nOur key message in this paper is that moderate RIP is not enough to eliminate spurious local min-\nima. To prove this, we formulate a convex optimization problem in Section 2 that generates coun-\nterexamples that satisfy RIP but contain spurious local minima. Solving this convex formulation\nin closed-form in Section 3 shows that counterexamples are ubiquitous: almost any rank-1 Z (cid:23) 0\nand any x \u2208 Rn can respectively be the ground truth and spurious local minimum to an instance of\nmatrix recovery satisfying RIP. We gave one speci\ufb01c counterexample with RIP constant \u03b4 = 1/2 in\nthe introduction that causes randomly initialized stochastic gradient descent (SGD) to fail 12% of\nthe time.\nMoreover, stochastic gradient descent (SGD) is often but not always able to avoid and escape spuri-\nous local minima. In Section 5, randomly initialized SGD solved one example with a 100% success\nrate over 1,000 trials, despite the presence of spurious local minima. However, it failed with a con-\nsistent rate of \u2248 6% on another other example with an RIP constant of just 1/2. Hence, as long\nas spurious local minima exist, we cannot expect to guarantee exact recovery with SGD (without a\nmuch deeper understanding of the algorithm).\nOverall, exact recovery guarantees will generally require a proof of no spurious local minima. How-\never, arguments based solely on norm preservation are conservative, because most measurements\nare not isotropic enough to eliminate spurious local minima.\n\nAcknowledgements\n\nWe thank our three NIPS reviewers for helpful comments and suggestions. In particular, we thank\nreviewer #2 for a key insight that allowed us to lower-bound \u03b4(cid:63) in Section 4. This work was sup-\nported by the ONR Awards N00014-17-1-2933 and ONR N00014-18-1-2526, NSF Award 1808859,\nDARPA Award D16AP00002, and AFOSR Award FA9550- 17-1-0163.\n\nReferences\n\n[1] Alekh Agarwal, Olivier Chapelle, Miroslav Dud\u00edk, and John Langford. A reliable effective\nterascale linear learning system. The Journal of Machine Learning Research, 15(1):1111\u2013\n1133, 2014.\n\n9\n\n\f[2] Erling D Andersen and Knud D Andersen. The MOSEK interior point optimizer for linear\nprogramming: an implementation of the homogeneous algorithm. In High performance opti-\nmization, pages 197\u2013232. Springer, 2000.\n\n[3] Francis R Bach. Consistency of trace norm minimization. Journal of Machine Learning Re-\n\nsearch, 9(Jun):1019\u20131048, 2008.\n\n[4] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search\nfor low rank matrix recovery. In Advances in Neural Information Processing Systems, pages\n3873\u20133881, 2016.\n\n[5] L\u00e9on Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings\n\nof COMPSTAT\u20192010, pages 177\u2013186. Springer, 2010.\n\n[6] L\u00e9on Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural\n\ninformation processing systems, pages 161\u2013168, 2008.\n\n[7] Nicolas Boumal, P-A Absil, and Coralia Cartis. Global rates of convergence for nonconvex\n\noptimization on manifolds. IMA Journal of Numerical Analysis, 2018.\n\n[8] Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex Burer-Monteiro ap-\nproach works on smooth semide\ufb01nite programs. In Advances in Neural Information Processing\nSystems, pages 2757\u20132765, 2016.\n\n[9] Stephen Boyd, Laurent El Ghaoui, Eric Feron, and Venkataramanan Balakrishnan. Linear\n\nmatrix inequalities in system and control theory, volume 15. SIAM, 1994.\n\n[10] T Tony Cai and Anru Zhang. Sharp RIP bound for sparse signal and low-rank matrix recovery.\n\nApplied and Computational Harmonic Analysis, 35(1):74\u201393, 2013.\n\n[11] Emmanuel J Candes and Yaniv Plan. Tight oracle inequalities for low-rank matrix recovery\nfrom a minimal number of noisy random measurements. IEEE Transactions on Information\nTheory, 57(4):2342\u20132359, 2011.\n\n[12] Emmanuel J Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization.\n\nFoundations of Computational mathematics, 9(6):717, 2009.\n\n[13] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from in-\ncomplete and inaccurate measurements. Communications on pure and applied mathematics,\n59(8):1207\u20131223, 2006.\n\n[14] Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions\n\non information theory, 51(12):4203\u20134215, 2005.\n\n[15] Emmanuel J Cand\u00e8s and Terence Tao. The power of convex relaxation: Near-optimal matrix\n\ncompletion. IEEE Transactions on Information Theory, 56(5):2053\u20132080, 2010.\n\n[16] Coralia Cartis, Nicholas IM Gould, and Ph L Toint. Complexity bounds for second-order\n\noptimality in unconstrained optimization. Journal of Complexity, 28(1):93\u2013108, 2012.\n\n[17] Andrew R Conn, Nicholas IM Gould, and Ph L Toint. Trust region methods, volume 1. SIAM,\n\n2000.\n\n[18] Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos.\nGradient descent can take exponential time to escape saddle points. In Advances in Neural\nInformation Processing Systems, pages 1067\u20131077, 2017.\n\n[19] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2013online stochas-\nIn Conference on Learning Theory, pages 797\u2013842,\n\ntic gradient for tensor decomposition.\n2015.\n\n[20] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems:\nA uni\ufb01ed geometric analysis. In International Conference on Machine Learning, pages 1233\u2013\n1242, 2017.\n\n10\n\n\f[21] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum.\n\nIn Advances in Neural Information Processing Systems, pages 2973\u20132981, 2016.\n\n[22] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of\nstochastic gradient descent. In International Conference on Machine Learning, pages 1225\u2013\n1234, 2016.\n\n[23] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using\nalternating minimization. In Proceedings of the forty-\ufb01fth annual ACM symposium on Theory\nof computing, pages 665\u2013674. ACM, 2013.\n\n[24] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape\nsaddle points ef\ufb01ciently. In International Conference on Machine Learning, pages 1724\u20131732,\n2017.\n\n[25] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a\n\nfew entries. IEEE Transactions on Information Theory, 56(6):2980\u20132998, 2010.\n\n[26] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from\n\nnoisy entries. Journal of Machine Learning Research, 11(Jul):2057\u20132078, 2010.\n\n[27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[28] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only\n\nconverges to minimizers. In Conference on Learning Theory, pages 1246\u20131257, 2016.\n\n[29] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global\n\nperformance. Mathematical Programming, 108(1):177\u2013205, 2006.\n\n[30] Dohyung Park, Anastasios Kyrillidis, Constantine Carmanis, and Sujay Sanghavi. Non-square\nmatrix sensing without spurious local minima via the Burer-Monteiro approach. In Arti\ufb01cial\nIntelligence and Statistics, pages 65\u201374, 2017.\n\n[31] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of\nlinear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471\u2013501, 2010.\n\n[32] Benjamin Recht and Christopher R\u00e9. Parallel stochastic gradient algorithms for large-scale\n\nmatrix completion. Mathematical Programming Computation, 5(2):201\u2013226, 2013.\n\n[33] Jos F Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones.\n\nOptimization methods and software, 11(1-4):625\u2013653, 1999.\n\n[34] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery using nonconvex optimiza-\n\ntion. In International Conference on Machine Learning, pages 2351\u20132360, 2015.\n\n[35] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. In Information\n\nTheory (ISIT), 2016 IEEE International Symposium on, pages 2379\u20132383. IEEE, 2016.\n\n[36] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization.\n\nIEEE Transactions on Information Theory, 62(11):6535\u20136579, 2016.\n\n[37] Kim-Chuan Toh, Michael J Todd, and Reha H T\u00fct\u00fcnc\u00fc. Sdpt3\u2013a matlab software package for\nsemide\ufb01nite programming, version 1.3. Optimization methods and software, 11(1-4):545\u2013581,\n1999.\n\n[38] HuiMin Wang and Song Li. The bounds of restricted isometry constants for low rank matrices\n\nrecovery. Science China Mathematics, 56(6):1117\u20131127, 2013.\n\n[39] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The\nIn Advances in Neural\n\nmarginal value of adaptive gradient methods in machine learning.\nInformation Processing Systems, pages 4151\u20134161, 2017.\n\n11\n\n\f[40] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under-\nstanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530,\n2016.\n\n[41] Tuo Zhao, Zhaoran Wang, and Han Liu. A nonconvex optimization framework for low rank\nmatrix estimation. In Advances in Neural Information Processing Systems, pages 559\u2013567,\n2015.\n\n[42] Qinqing Zheng and John Lafferty. A convergent gradient descent algorithm for rank minimiza-\ntion and semide\ufb01nite programming from random linear measurements. In Advances in Neural\nInformation Processing Systems, pages 109\u2013117, 2015.\n\n[43] Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B Wakin. Global optimality in low-rank\n\nmatrix optimization. IEEE Transactions on Signal Processing, 66(13):3614\u20133628, 2018.\n\n12\n\n\f", "award": [], "sourceid": 2671, "authors": [{"given_name": "Richard", "family_name": "Zhang", "institution": "University of California, Berkeley"}, {"given_name": "Cedric", "family_name": "Josz", "institution": "UC Berkeley"}, {"given_name": "Somayeh", "family_name": "Sojoudi", "institution": "University of California, Berkeley"}, {"given_name": "Javad", "family_name": "Lavaei", "institution": "University of California, Berkeley"}]}