{"title": "The non-convex Burer-Monteiro approach works on smooth semidefinite programs", "book": "Advances in Neural Information Processing Systems", "page_first": 2757, "page_last": 2765, "abstract": "Semidefinite programs (SDP's) can be solved in polynomial time by interior point methods, but scalability can be an issue. To address this shortcoming, over a decade ago, Burer and Monteiro proposed to solve SDP's with few equality constraints via rank-restricted, non-convex surrogates. Remarkably, for some applications, local optimization methods seem to converge to global optima of these non-convex surrogates reliably. Although some theory supports this empirical success, a complete explanation of it remains an open question. In this paper, we consider a class of SDP's which includes applications such as max-cut, community detection in the stochastic block model, robust PCA, phase retrieval and synchronization of rotations. We show that the low-rank Burer-Monteiro formulation of SDP's in that class almost never has any spurious local optima.", "full_text": "The non-convex Burer\u2013Monteiro approach works\n\non smooth semide\ufb01nite programs\n\nNicolas Boumal\ufffd\n\nDepartment of Mathematics\n\nPrinceton University\n\nVladislav Voroninski\ufffd\n\nDepartment of Mathematics\n\nMassachusetts Institute of Technology\n\nnboumal@math.princeton.edu\n\nvvlad@math.mit.edu\n\nAfonso S. Bandeira\n\nDepartment of Mathematics and Center for Data Science\n\nCourant Institute of Mathematical Sciences, New York University\n\nbandeira@cims.nyu.edu\n\nAbstract\n\nSemide\ufb01nite programs (SDP\u2019s) can be solved in polynomial time by interior point\nmethods, but scalability can be an issue. To address this shortcoming, over a\ndecade ago, Burer and Monteiro proposed to solve SDP\u2019s with few equality con-\nstraints via rank-restricted, non-convex surrogates. Remarkably, for some applica-\ntions, local optimization methods seem to converge to global optima of these non-\nconvex surrogates reliably. Although some theory supports this empirical success,\na complete explanation of it remains an open question. In this paper, we consider a\nclass of SDP\u2019s which includes applications such as max-cut, community detection\nin the stochastic block model, robust PCA, phase retrieval and synchronization of\nrotations. We show that the low-rank Burer\u2013Monteiro formulation of SDP\u2019s in\nthat class almost never has any spurious local optima.\n\n1\n\nIntroduction\n\nWe consider semide\ufb01nite programs (SDP\u2019s) of the form\n\nf\u2217 = min\n\nX\u2208Sn\u00d7n \ufffdC, X\ufffd\n\nsubject to A(X) = b, X \ufffd 0,\n\n(SDP)\n\nwhere \ufffdC, X\ufffd = Tr(C\ufffdX), C \u2208 Sn\u00d7n is the symmetric cost matrix, A : Sn\u00d7n \u2192 Rm is a lin-\near operator capturing m equality constraints with right hand side b \u2208 Rm and the variable X is\nsymmetric, positive semide\ufb01nite. Interior point methods solve (SDP) in polynomial time [Nesterov,\n2004]. In practice however, for n beyond a few thousands, such algorithms run out of memory (and\ntime), prompting research for alternative solvers.\nIf (SDP) has a compact search space, then it admits a global optimum of rank at most r, where\nr(r+1)\n\u2264 m [Pataki, 1998, Barvinok, 1995]. Thus, if one restricts the search space of (SDP) to\nmatrices of rank at most p with p(p+1)\n2 \u2265 m, then the globally optimal value remains unchanged.\nThis restriction is easily enforced by factorizing X = Y Y \ufffd where Y has size n \u00d7 p, yielding an\nequivalent quadratically constrained quadratic program:\n(P)\n\nq\u2217 = min\n\n2\n\nsubject to A(Y Y \ufffd) = b.\n\nY \u2208Rn\u00d7p \ufffdCY , Y \ufffd\n\n\ufffdThe \ufb01rst two authors contributed equally.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn general, (P) is non-convex, making it a priori unclear how to solve it globally. Still, the bene\ufb01ts\nare that it is lower dimensional than (SDP) and has no conic constraint. This has motivated Burer\nand Monteiro [2003, 2005] to try and solve (P) using local optimization methods, with surprisingly\ngood results. They developed theory in support of this observation (details below). About their\nresults, Burer and Monteiro [2005, \u00a73] write (mutatis mutandis):\n\n\u201cHow large must we take p so that the local minima of (P) are guaranteed to map\nto global minima of (SDP)? Our theorem asserts that we need only1 p(p+1)\n2 > m\n(with the important caveat that positive-dimensional faces of (SDP) which are\n\u2018\ufb02at\u2019 with respect to the objective function can harbor non-global local minima).\u201d\n\n2\n\nThe caveat\u2014the existence or non-existence of non-global local optima, or their potentially adverse\neffect for local optimization algorithms\u2014was not further discussed.\nIn this paper, assuming p(p+1)\n2 > m, we show that if the search space of (SDP) is compact and if\nthe search space of (P) is a smooth manifold, then, for almost all cost matrices C, if Y satis\ufb01es \ufb01rst-\nand second-order necessary optimality conditions for (P), then Y is a global optimum of (P) and,\nsince p(p+1)\n\u2265 m, X = Y Y \ufffd is a global optimum of (SDP); in other words, \ufb01rst- and second-\norder necessary optimality conditions for (P) are also suf\ufb01cient for global optimality\u2014an unusual\ntheoretical guarantee in non-convex optimization.\nNotice that this is a statement about the optimization problem itself, not about speci\ufb01c algorithms.\nInterestingly, known algorithms for optimization on manifolds converge to second-order critical\npoints,2 regardless of initialization [Boumal et al., 2016].\nFor the speci\ufb01ed class of SDP\u2019s, our result improves on those of [Burer and Monteiro, 2005] in\ntwo important ways. Firstly, for almost all C, we formally exclude the existence of spurious local\noptima.3 Secondly, we only require the computation of second-order critical points of (P) rather\nthan local optima (which is hard in general [Vavasis, 1991]). Below, we make a statement about\ncomputational complexity, and we illustrate the practical ef\ufb01ciency of the proposed methods through\nnumerical experiments.\nSDP\u2019s which satisfy the compactness and smoothness assumptions occur in a number of applica-\ntions including Max-Cut, robust PCA, Z2-synchronization, community detection, cut-norm approx-\nimation, phase synchronization, phase retrieval, synchronization of rotations and the trust-region\nsubproblem\u2014see Section 4 for references.\n\nA simple example: the Max-Cut problem\n\nGiven an undirected graph, Max-Cut is the NP-hard problem of clustering the n nodes of this graph\nin two classes, +1 and \u22121, such that as many edges as possible join nodes of different signs. If C is\nthe adjacency matrix of the graph, Max-Cut is expressed as\n\nmax\nx\u2208Rn\n\n1\n4\n\nn\ufffdi,j=1\n\nCij(1 \u2212 xixj)\n\ns.t.\n\nx2\n1 = \u00b7\u00b7\u00b7 = x2\n\nn = 1.\n\n(Max-Cut)\n\nIntroducing the positive semide\ufb01nite matrix X = xx\ufffd, both the cost and the constraints may be ex-\npressed linearly in terms of X. Ignoring that X has rank 1 yields the well-known convex relaxation\nin the form of a semide\ufb01nite program (up to an af\ufb01ne transformation of the cost):\n\nmin\n\nX\u2208Sn\u00d7n \ufffdC, X\ufffd\n\ns.t.\n\ndiag(X) = 1, X \ufffd 0.\n\n(Max-Cut SDP)\n\nIf a solution X of this SDP has rank 1, then X = xx\ufffdfor some x which is then an optimal cut. In the\ngeneral case of higher rank X, Goemans and Williamson [1995] exhibited the celebrated rounding\nscheme to produce approximately optimal cuts (within a ratio of .878) from X.\n\n1The condition on p and m is slightly, but inconsequentially, different in [Burer and Monteiro, 2005].\n2Second-order critical points satisfy \ufb01rst- and second-order necessary optimality conditions.\n3Before Prop. 2.3 in [Burer and Monteiro, 2005], the authors write: \u201cThe change of variables X = Y Y \ufffd\ndoes not introduce any extraneous local minima.\u201d This is sometimes misunderstood to mean (P) does not have\nspurious local optima, when it actually means that the local optima of (P) are in exact correspondence with the\nlocal optima of \u201c(SDP) with the extra constraint rank(X) \u2264 p,\u201d which is also non-convex and thus also liable\nto having local optima. Unfortunately, this misinterpretation has led to some confusion in the literature.\n\n2\n\n\fThe corresponding Burer\u2013Monteiro non-convex problem with rank bounded by p is:\n\nmin\n\nY \u2208Rn\u00d7p \ufffdCY , Y \ufffd\n\ns.t.\n\ndiag(Y Y \ufffd) = 1.\n\n(Max-Cut BM)\n\nThe constraint diag(Y Y \ufffd) = 1 requires each row of Y to have unit norm; that is: Y is a point on the\nCartesian product of n unit spheres in Rp, which is a smooth manifold. Furthermore, all X feasible\nfor the SDP have identical trace equal to n, so that the search space of the SDP is compact. Thus,\nour results stated below apply:\n\nFor p =\ufffd\u221a2n\ufffd, for almost all C, even though (Max-Cut BM) is non-convex, any\n\nlocal optimum Y is a global optimum (and so is X = Y Y \ufffd), and all saddle points\nhave an escape (the Hessian has a negative eigenvalue).\n\nWe note that, for p > n/2, the same holds for all C [Boumal, 2015].\n\nNotation\n\nSn\u00d7n is the set of real, symmetric matrices of size n. A symmetric matrix X is positive semide\ufb01nite\n(X \ufffd 0) if and only if u\ufffdXu \u2265 0 for all u \u2208 Rn. For matrices A, B, the standard Euclidean inner\nproduct is \ufffdA, B\ufffd = Tr(A\ufffdB). The associated (Frobenius) norm is \ufffdA\ufffd = \ufffd\ufffdA, A\ufffd. Id is the\nidentity operator and In is the identity matrix of size n.\n\n2 Main results\n\nOur main result establishes conditions under which \ufb01rst- and second-order necessary optimality\nconditions for (P) are suf\ufb01cient for global optimality. Under those conditions, it is a fortiori true that\nglobal optima of (P) map to global optima of (SDP), so that local optimization methods on (P) can\nbe used to solve the higher-dimensional, cone-constrained (SDP).\nWe now specify the necessary optimality conditions of (P). Under the assumptions of our main\nresult below (Theorem 2), the search space\n\n(1)\nis a smooth and compact manifold. As such, it can be linearized at each point Y \u2208 M by a tangent\nspace, simply by differentiating the constraints [Absil et al., 2008, eq. (3.19)]:\n\nM = Mp = {Y \u2208 Rn\u00d7p : A(Y Y \ufffd) = b}\n\nTY M = { \u02d9Y \u2208 Rn\u00d7p : A( \u02d9Y Y \ufffd + Y \u02d9Y \ufffd) = 0}.\n\n(2)\nEndowing the tangent spaces of M with the (restricted) Euclidean metric \ufffdA, B\ufffd = Tr(A\ufffdB) turns\nM into a Riemannian submanifold of Rn\u00d7p. In general, second-order optimality conditions can\nbe intricate to handle [Ruszczy\u00b4nski, 2006]. Fortunately, here, the smoothness of both the search\nspace (1) and the cost function\n\nf (Y ) = \ufffdCY , Y \ufffd\n\n(3)\nmake for straightforward conditions. In spirit, they coincide with the well-known conditions for un-\nconstrained optimization. As further detailed in Appendix A, the Riemannian gradient gradf (Y ) is\nthe orthogonal projection of the classical gradient of f to the tangent space TY M. The Riemannian\nHessian of f at Y is a similarly restricted version of the classical Hessian of f to the tangent space.\nDe\ufb01nition 1. A (\ufb01rst-order) critical point for (P) is a point Y \u2208 M such that\n\n(1st order nec. opt. cond.)\nwhere gradf (Y ) \u2208 TY M is the Riemannian gradient at Y of f restricted to M. A second-order\ncritical point for (P) is a critical point Y such that\n\ngradf (Y ) = 0,\n\n(2nd order nec. opt. cond.)\nwhere Hessf (Y ) : TY M \u2192 TY M is the Riemannian Hessian at Y of f restricted to M (a sym-\nmetric linear operator).\n\nHessf (Y ) \ufffd 0,\n\n3\n\n\fProposition 1. All local (and global) optima of (P) are second-order critical points.\n\nProof. See [Yang et al., 2014, Rem. 4.2 and Cor. 4.2].\n\nWe can now state our main result. In the theorem statement below, \u201cfor almost all C\u201d means po-\ntentially troublesome cost matrices form at most a (Lebesgue) zero-measure subset of Sn\u00d7n, in the\nsame way that almost all square matrices are invertible. In particular, given any matrix C \u2208 Sn\u00d7n,\nperturbing C to C + \u03c3W where W is a Wigner random matrix results in an acceptable cost matrix\nwith probability 1, for arbitrarily small \u03c3 > 0.\nTheorem 2. Given constraints A : Sn\u00d7n \u2192 Rm, b \u2208 Rm and p satisfying p(p+1)\n\n2 > m, if\n\n(i) the search space of (SDP) is compact; and\n\n(ii) the search space of (P) is a smooth manifold,\n\nthen for almost all cost matrices C \u2208 Sn\u00d7n, any second-order critical point of (P) is globally\noptimal. Under these conditions, if Y is globally optimal for (P), then the matrix X = Y Y \ufffd is\nglobally optimal for (SDP).\n\nThe assumptions are discussed in the next section. The proof\u2014see Appendix A\u2014follows directly\nfrom the combination of two intermediate results:\n\n1. If Y is rank de\ufb01cient and second-order critical for (P), then it is globally optimal and\n\nX = Y Y \ufffd is optimal for (SDP); and\n\n2. If p(p+1)\n\n2 > m, then, for almost all C, every \ufb01rst-order critical Y is rank-de\ufb01cient.\n\nThe \ufb01rst step holds in a more general context, as previously established by Burer and Monteiro\n[2003, 2005]. The second step is new and crucial, as it allows to formally exclude the existence of\nspurious local optima, generically in C, thus resolving the caveat mentioned in the introduction.\nThe smooth structure of (P) naturally suggests using Riemannian optimization to solve it [Absil et al.,\n2008], which is something that was already proposed by Journ\u00b4ee et al. [2010] in the same context.\nImportantly, known algorithms converge to second-order critical points regardless of initialization.\nWe state here a recent computational result to that effect.\nProposition 3. Under the numbered assumptions of Theorem 2, the Riemannian trust-region method\n(RTR) [Absil et al., 2007] initialized with any Y0 \u2208 M returns in O(1/\u03b52\nH ) iterations a\npoint Y \u2208 M such that\nf (Y ) \u2264 f (Y0),\n\nHessf (Y ) \ufffd \u2212\u03b5H Id .\n\ng\u03b5H + 1/\u03b53\n\n\ufffdgradf (Y )\ufffd \u2264 \u03b5g,\n\nand\n\nProof. Apply the main results of [Boumal et al., 2016] using that f has locally Lipschitz continuous\ngradient and Hessian in Rn\u00d7p and M is a compact submanifold of Rn\u00d7p.\nEssentially, each iteration of RTR requires evaluation of one cost and one gradient, a bounded num-\nber of Hessian-vector applications, and one projection from Rn\u00d7p to M. In many important cases,\nthis projection amounts to Gram\u2013Schmidt orthogonalization of small blocks of Y \u2014see Section 4.\nProposition 3 bounds worst-case iteration counts for arbitrary initialization.\nIn practice, a good\ninitialization point may be available, making the local convergence rate of RTR more informative.\nFor RTR, one may expect superlinear or even quadratic local convergence rates near isolated local\nminimizers [Absil et al., 2007]. While minimizers are not isolated in our case [Journ\u00b4ee et al., 2010],\nexperiments show a characteristically superlinear local convergence rate in practice [Boumal, 2015].\nThis means high accuracy solutions can be achieved, as demonstrated in Appendix B.\nThus, under the conditions of Theorem 2, generically in C, RTR converges to global optima. In\npractice, the algorithm returns after a \ufb01nite number of steps, and only approximate second-order\ncriticality is guaranteed. Hence, it is interesting to bound the optimality gap in terms of the approx-\nimation quality. Unfortunately, we do not establish such a result for small p. Instead, we give an\na posteriori computable optimality gap bound which holds for all p and for all C. In the following\nstatement, the dependence of M on p is explicit, as Mp. The proof is in Appendix A.\n\n4\n\n\fTheorem 4. Let R < \u221e be the maximal trace of any X feasible for (SDP). For any p such that Mp\nand Mp+1 are smooth manifolds (even if p(p+1)\n2 \u2264 m) and for any Y \u2208 Mp, form \u02dcY = [Y |0n\u00d71]\nin Mp+1. The optimality gap at Y is bounded as\n\n(4)\nIf all feasible X have the same trace R and there exists a positive de\ufb01nite feasible X, then the bound\nsimpli\ufb01es to\n\n0 \u2264 2(f (Y ) \u2212 f\u2217) \u2264\n\n\u221aR\ufffdgradf (Y )\ufffd \u2212 R\u03bbmin(Hessf ( \u02dcY )).\n\n0 \u2264 2(f (Y ) \u2212 f\u2217) \u2264 \u2212R\u03bbmin(Hessf ( \u02dcY ))\n\n(5)\n\nso that \ufffdgradf (Y )\ufffd needs not be controlled explicitly. If p > n, the bounds hold with \u02dcY = Y .\nIn particular, for p = n + 1, the bound can be controlled a priori: approximate second-order critical\npoints are approximately optimal, for any C.4\nCorollary 5. Under the assumptions of Theorem 4, if p = n + 1 and Y \u2208 M satis\ufb01es both\n\ufffdgradf (Y )\ufffd \u2264 \u03b5g and Hessf (Y ) \ufffd \u2212\u03b5H Id, then Y is approximately optimal in the sense that\n\nUnder the same condition as in Theorem 4, the bound can be simpli\ufb01ed to R\u03b5H.\n\n0 \u2264 2(f (Y ) \u2212 f\u2217) \u2264\n\n\u221aR\u03b5g + R\u03b5H .\n\nThis works well with Proposition 3. For any p, equation (4) also implies the following:\n\n\u03bbmin(Hessf ( \u02dcY )) \u2264 \u2212\n\n2(f (Y ) \u2212 f\u2217) \u2212 \u221aR\ufffdgradf (Y )\ufffd\n\n.\n\nR\n\nThat is, for any p and any C, an approximate critical point Y in Mp which is far from optimal maps\nto a comfortably-escapable approximate saddle point \u02dcY in Mp+1.\nThis suggests an algorithm as follows. For a starting value of p such that Mp is a manifold, use\nRTR to compute an approximate second-order critical point Y . Then, form \u02dcY in Mp+1 and test\nthe left-most eigenvalue of Hessf ( \u02dcY ).5 If it is close enough to zero, this provides a good bound\non the optimality gap. If not, use an (approximate) eigenvector associated to \u03bbmin(Hessf ( \u02dcY )) to\nescape the approximate saddle point and apply RTR from that new point in Mp+1; iterate. In the\nworst-case scenario, p grows to n + 1, at which point all approximate second-order critical points\nare approximate optima. Theorem 2 suggests p =\ufffd\u221a2m\ufffd should suf\ufb01ce for C bounded away from\n\na zero-measure set. Such an algorithm already features with less theory in [Journ\u00b4ee et al., 2010]\nand [Boumal, 2015]; in the latter, it is called the Riemannian staircase, for it lifts (P) \ufb02oor by \ufb02oor.\n\nRelated work\n\nLow-rank approaches to solve SDP\u2019s have featured in a number of recent research papers. We\nhighlight just two which illustrate different classes of SDP\u2019s of interest.\nShah et al. [2016] tackle SDP\u2019s with linear cost and linear constraints (both equalities and inequal-\nities) via low-rank factorizations, assuming the matrices appearing in the cost and constraints are\npositive semide\ufb01nite. They propose a non-trivial initial guess to partially overcome non-convexity\nwith great empirical results, but do not provide optimality guarantees.\nBhojanapalli et al. [2016a] on the other hand consider the minimization of a convex cost function\nover positive semide\ufb01nite matrices, without constraints. Such problems could be obtained from\ngeneric SDP\u2019s by penalizing the constraints in a Lagrangian way. Here too, non-convexity is partially\novercome via non-trivial initialization, with global optimality guarantees under some conditions.\nAlso of interest are recent results about the harmlessness of non-convexity in low-rank matrix com-\npletion [Ge et al., 2016, Bhojanapalli et al., 2016b]. Similarly to the present work, the authors there\nshow there is no need for special initialization despite non-convexity.\n\n4With p = n + 1, problem (P) is no longer lower dimensional than (SDP), but retains the advantage of not\n\ninvolving a positive semide\ufb01niteness constraint.\n\n5It may be more practical to test \u03bbmin(S) (14) rather than \u03bbmin(Hessf ). Lemma 7 relates the two.\n\nSee [Journ\u00b4ee et al., 2010, \u00a73.3] to construct escape tangent vectors from S.\n\n5\n\n\f3 Discussion of the assumptions\n\nOur main result, Theorem 2, comes with geometric assumptions on the search spaces of both (SDP)\nand (P) which we now discuss. Examples of SDP\u2019s which \ufb01t the assumptions of Theorem 2 are\nfeatured in the next section.\nThe assumption that the search space of (SDP),\n\nC = {X \u2208 Sn\u00d7n : A(X) = b, X \ufffd 0},\n\n(6)\n\nis compact works in pair with the assumption p(p+1)\n2 > m as follows. For (P) to reveal the global\noptima of (SDP), it is necessary that (SDP) admits a solution of rank at most p. One way to ensure\nthis is via the Pataki\u2013Barvinok theorems [Pataki, 1998, Barvinok, 1995], which state that all extreme\npoints of C have rank r bounded as r(r+1)\n2 \u2264 m. Extreme points are faces of dimension zero (such\nas vertices for a cube). When optimizing a linear cost function \ufffdC, X\ufffd over a compact convex set C,\nat least one extreme point is a global optimum [Rockafellar, 1970, Cor. 32.3.2]\u2014this is not true in\ngeneral if C is not compact. Thus, under the assumptions of Theorem 2, there is a point Y \u2208 M such\nthat X = Y Y \ufffd is an optimal extreme point of (SDP); then, of course, Y itself is optimal for (P).\nIn general, the Pataki\u2013Barvinok bound is tight, in that there exist extreme points of rank up to that\nupper-bound (rounded down)\u2014see for example [Laurent and Poljak, 1996] for the Max-Cut SDP\nand [Boumal, 2015] for the Orthogonal-Cut SDP. Let C (the cost matrix) be the negative of such an\nextreme point. Then, the unique optimum of (SDP) is that extreme point, showing that p(p+1)\n2 \u2265 m\nis necessary for (SDP) and (P) to be equivalent for all C. We further require a strict inequality\nbecause our proof relies on properties of rank de\ufb01cient Y \u2019s in M.\nThe assumption that M (eq. (1)) is a smooth manifold works in pair with the ambition that the re-\nsult should hold for (almost) all cost matrices C. The starting point is that, for a given non-convex\nsmooth optimization problem\u2014even a quadratically constrained quadratic program\u2014computing lo-\ncal optima is hard in general [Vavasis, 1991]. Thus, we wish to restrict our attention to ef\ufb01ciently\ncomputable points, such as points which satisfy \ufb01rst- and second-order KKT conditions for (P)\u2014\nsee [Burer and Monteiro, 2003, \u00a72.2] and [Ruszczy\u00b4nski, 2006, \u00a73]. This only makes sense if global\noptima satisfy the latter, that is, if KKT conditions are necessary for optimality. A global optimum\nY necessarily satis\ufb01es KKT conditions if constraint quali\ufb01cations (CQ\u2019s) hold at Y [Ruszczy\u00b4nski,\n2006]. The standard CQ\u2019s for equality constrained programs are Robinson\u2019s conditions or metric\nregularity (they are here equivalent). They read as follows, assuming A(Y Y \ufffd)i =\ufffdAi, Y Y \ufffd\ufffd for\nsome matrices A1, . . . , Am \u2208 Sn\u00d7n:\n\nCQ\u2019s hold at Y if A1Y, . . . , AmY are linearly independent in Rn\u00d7p.\n\n(7)\nConsidering almost all C, global optima could, a priori, be almost anywhere in M. To simplify,\nwe require CQ\u2019s to hold at all Y \u2019s in M rather than only at the (unknown) global optima. This\nturns out to be a suf\ufb01cient condition for M to be a smooth manifold of codimension m [Absil et al.,\n2008, Prop. 3.3.3]. Indeed, tangent vectors \u02d9Y \u2208 TY M (2) are exactly those vectors that satisfy\n\ufffdAiY , \u02d9Y \ufffd = 0: under CQ\u2019s, the AiY \u2019s form a basis of the normal space to the manifold at Y .\nOnce it is decided that M must be a manifold, we can step away from the speci\ufb01c representation\nof it via the matrices A1, . . . , Am and reason about optimality conditions on the manifold directly.\nAdding redundant constraints (for example, duplicating A1) would break the CQ\u2019s, but not the\nmanifold structure. Hence, stating Theorem 2 in terms of manifolds better captures the role of M\nthan stating it in terms of CQ\u2019s. See also [Andreani et al., 2010, Thm. 3.3] for a proof that requiring\nM to be a manifold around Y is a type of CQ.\nFinally, we note that Theorem 2 only applies for almost all C, rather than all C. To justify this\nrestriction, if indeed it is justi\ufb01ed, one should exhibit a matrix C that leads to suboptimal second-\norder critical points while other assumptions are satis\ufb01ed. We do not have such an example. We do\nobserve that (Max-Cut SDP) on cycles of certain even lengths has a unique solution of rank 1, while\nthe corresponding (Max-Cut BM) with p = 2 has suboptimal local optima (strictly, if we quotient\nout symmetries). This at least suggests it is not enough, for generic C, to set p just larger than\nthe rank of the solutions of the SDP. (For those same examples, at p = 3, we consistently observe\nconvergence to global optima.)\n\n6\n\n\f4 Examples of smooth SDP\u2019s\n\nThe canonical examples of SDP\u2019s which satisfy the assumptions in Theorem 2 are those where the\ndiagonal blocks of X or their traces are \ufb01xed. We note that the algorithms and the theory continue\nto hold for complex matrices, where the set of Hermitian matrices of size n is treated as a real\nvector space of dimension n2 (instead of n(n+1)\nin the real case) with inner product \ufffdH1, H2\ufffd =\n\ufffd{Tr(H\u22171 H2)}, so that occurrences of p(p+1)\n2\nCertain concrete examples of SDP\u2019s include:\n\nare replaced by p2.\n\n2\n\nX \ufffdC, X\ufffd s.t. Tr(X) = 1, X \ufffd 0;\nmin\nX \ufffdC, X\ufffd s.t. diag(X) = 1, X \ufffd 0;\nmin\nX \ufffdC, X\ufffd s.t. Xii = Id, X \ufffd 0.\nmin\n\n(\ufb01xed trace)\n\n(\ufb01xed diagonal)\n\n(\ufb01xed diagonal blocks)\n\nTheir rank-constrained counterparts read as follows (matrix norms are Frobenius norms):\n\nmin\n\nmin\n\nmin\n\nY : n\u00d7p\ufffdCY , Y \ufffd s.t. \ufffdY \ufffd = 1;\nY : n\u00d7p\ufffdCY , Y \ufffd s.t. Y \ufffd = [y1\nY : qd\u00d7p\ufffdCY , Y \ufffd s.t. Y \ufffd = [Y1\n\n(sphere)\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\nyn] and \ufffdyi\ufffd = 1 for all i;\nYq] and Y \ufffdi Yi = Id for all i.\n\n(product of spheres)\n\n(product of Stiefel)\n\nThe \ufb01rst example has only one constraint: the SDP always admits an optimal rank 1 solution, cor-\nresponding to an eigenvector associated to the left-most eigenvalue of C. This generalizes to the\ntrust-region subproblem as well.\nFor the second example, in the real case, p = 1 forces yi = \u00b11, allowing to capture combinatorial\nproblems such as Max-Cut [Goemans and Williamson, 1995], Z2-synchronization [Javanmard et al.,\n2015] and community detection in the stochastic block model [Abbe et al., 2016, Bandeira et al.,\n2016b]. The same SDP is central in a formulation of robust PCA [McCoy and Tropp, 2011] and\nis used to approximate the cut-norm of a matrix [Alon and Naor, 2006]. Theorem 2 states that for\n\ncapture problems where phases must be recovered; in particular, phase synchronization [Bandeira\net al., 2016a, Singer, 2011] and phase retrieval via Phase-Cut [Waldspurger et al., 2015]. For almost\n\nalmost all C, p = \ufffd\u221a2n\ufffd is suf\ufb01cient. In the complex case, p = 1 forces |yi| = 1, allowing to\nall C, it is then suf\ufb01cient to set p = \ufffd\u221an + 1\ufffd.\nIn the third example, Y of size n \u00d7 p is divided in q slices of size d \u00d7 p, with p \u2265 d. Each\nslice has orthonormal rows. For p = d, the slices are orthogonal (or unitary) matrices, allowing\nto capture Orthogonal-Cut [Bandeira et al., 2016c] and the related problems of synchronization of\nrotations [Wang and Singer, 2013] and permutations. Synchronization of rotations is an important\nstep in simultaneous localization and mapping, for example. Here, it is suf\ufb01cient for almost all C to\n\nlet p =\ufffd\ufffdd(d + 1)q\ufffd.\n\nSDP\u2019s with constraints that are combinations of the above examples can also have the smoothness\nproperty; the right-hand sides 1 and Id can be replaced by any positive de\ufb01nite right-hand sides by a\nchange of variables. Another simple rule to check is if the constraint matrices A1, . . . , Am \u2208 Sn\u00d7n\nsuch that A(X)i = \ufffdAi, X\ufffd satisfy AiAj = 0 for all i \ufffd= j (note that this is stronger than requiring\n\ufffdAi, Aj\ufffd = 0), see [Journ\u00b4ee et al., 2010].\n5 Conclusions\n\nThe Burer\u2013Monteiro approach consists in replacing optimization of a linear function \ufffdC, X\ufffd over\nthe convex set {X \ufffd 0 : A(X) = b} with optimization of the quadratic function \ufffdCY , Y \ufffd over the\nnon-convex set {Y \u2208 Rn\u00d7p : A(Y Y \ufffd) = b}. It was previously known that, if the convex set is\ncompact and p satis\ufb01es p(p+1)\n2 \u2265 m where m is the number of constraints, then these two problems\nhave the same global optimum. It was also known from [Burer and Monteiro, 2005] that spurious\nlocal optima Y , if they exist, must map to special faces of the compact convex set, but without\nstatement as to the prevalence of such faces or the risk they pose for local optimization methods. In\n\n7\n\n\fthis paper we showed that, if the set of X\u2019s is compact and the set of Y \u2019s is a smooth manifold, and\nif p(p+1)\n2 > m, then for almost all C, the non-convexity of the problem in Y is benign, in that all\nY \u2019s which satisfy second-order necessary optimality conditions are in fact globally optimal.\nWe further reference the Riemannian trust-region method [Absil et al., 2007] to solve the problem in\nY , as it was recently guaranteed to converge from any starting point to a point which satis\ufb01es second-\norder optimality conditions, with global convergence rates [Boumal et al., 2016]. In addition, for p =\nn + 1, we guarantee that approximate satisfaction of second-order conditions implies approximate\nglobal optimality. We note that the 1/\u03b53 convergence rate in our results may be pessimistic. Indeed,\nthe numerical experiments clearly show that high accuracy solutions can be computed fast using\noptimization on manifolds, at least for certain applications.\nAddressing a broader class of SDP\u2019s, such as those with inequality constraints or equality constraints\nthat may violate our smoothness assumptions, could perhaps be handled by penalizing those con-\nstraints in the objective in an augmented Lagrangian fashion. We also note that, algorithmically,\nthe Riemannian trust-region method we use applies just as well to nonlinear costs in the SDP. We\nbelieve that extending the theory presented here to broader classes of problems is a good direction\nfor future work.\n\nAcknowledgment\n\nVV was partially supported by the Of\ufb01ce of Naval Research. ASB was supported by NSF Grant\nDMS-1317308. Part of this work was done while ASB was with the Department of Mathematics at\nthe Massachusetts Institute of Technology. We thank Wotao Yin and Michel Goemans for helpful\ndiscussions.\n\nReferences\nE. Abbe, A.S. Bandeira, and G. Hall. Exact recovery in the stochastic block model. Information Theory, IEEE\n\nTransactions on, 62(1):471\u2013487, 2016.\n\nP.-A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on Riemannian manifolds. Foundations of\n\nComputational Mathematics, 7(3):303\u2013330, 2007. doi:10.1007/s10208-005-0179-9.\n\nP.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University\n\nPress, Princeton, NJ, 2008. ISBN 978-0-691-13298-3.\n\nN. Alon and A. Naor. Approximating the cut-norm via Grothendieck\u2019s inequality. SIAM Journal on Computing,\n\n35(4):787\u2013803, 2006. doi:10.1137/S0097539704441629.\n\nR. Andreani, C. E. Echag\u00a8ue, and M. L. Schuverdt. Constant-rank condition and second-order constraint qual-\ni\ufb01cation. Journal of Optimization Theory and Applications, 146(2):255\u2013266, 2010. doi:10.1007/s10957-\n010-9671-8.\n\nA.S. Bandeira, N. Boumal, and A. Singer. Tightness of the maximum likelihood semide\ufb01nite relaxation for\nangular synchronization. Mathematical Programming, pages 1\u201323, 2016a. doi:10.1007/s10107-016-1059-6.\n\nA.S. Bandeira, N. Boumal, and V. Voroninski. On the low-rank approach for semide\ufb01nite programs arising\nin synchronization and community detection. In Proceedings of The 29th Conference on Learning Theory,\nCOLT 2016, New York, NY, June 23\u201326, 2016b.\n\nA.S. Bandeira, C. Kennedy, and A. Singer. Approximating the little Grothendieck problem over the orthogonal\n\nand unitary groups. Mathematical Programming, pages 1\u201343, 2016c. doi:10.1007/s10107-016-0993-7.\n\nA.I. Barvinok. Problems of distance geometry and convex properties of quadratic maps. Discrete & Computa-\n\ntional Geometry, 13(1):189\u2013202, 1995. doi:10.1007/BF02574037.\n\nS. Bhojanapalli, A. Kyrillidis, and S. Sanghavi. Dropping convexity for faster semi-de\ufb01nite optimization.\n\nConference on Learning Theory (COLT), 2016a.\n\nS. Bhojanapalli, B. Neyshabur, and N. Srebro. Global optimality of local search for low rank matrix recovery.\n\narXiv preprint arXiv:1605.07221, 2016b.\n\nN. Boumal. A Riemannian low-rank method for optimization over semide\ufb01nite matrices with block-diagonal\n\nconstraints. arXiv preprint arXiv:1506.00575, 2015.\n\n8\n\n\fN. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for optimization on manifolds.\n\nJournal of Machine Learning Research, 15:1455\u20131459, 2014. URL http://www.manopt.org.\n\nN. Boumal, P.-A. Absil, and C. Cartis. Global rates of convergence for nonconvex optimization on manifolds.\n\narXiv preprint arXiv:1605.08101, 2016.\n\nS. Burer and R.D.C. Monteiro. A nonlinear programming algorithm for solving semide\ufb01nite programs via low-\n\nrank factorization. Mathematical Programming, 95(2):329\u2013357, 2003. doi:10.1007/s10107-002-0352-8.\n\nS. Burer and R.D.C. Monteiro. Local minima and convergence in low-rank semide\ufb01nite programming. Math-\n\nematical Programming, 103(3):427\u2013444, 2005.\n\nCVX. CVX: Matlab software for disciplined convex programming. http://cvxr.com/cvx, August 2012.\n\nR. Ge, J.D. Lee, and T. Ma. Matrix completion has no spurious local minimum.\n\narXiv:1605.07272, 2016.\n\narXiv preprint\n\nM.X. Goemans and D.P. Williamson.\n\nImproved approximation algorithms for maximum cut and satis\ufb01a-\nbility problems using semide\ufb01nite programming. Journal of the ACM (JACM), 42(6):1115\u20131145, 1995.\ndoi:10.1145/227683.227684.\n\nC. Helmberg, F. Rendl, R.J. Vanderbei, and H. Wolkowicz. An interior-point method for semide\ufb01nite program-\n\nming. SIAM Journal on Optimization, 6(2):342\u2013361, 1996. doi:10.1137/0806020.\n\nA. Javanmard, A. Montanari, and F. Ricci-Tersenghi. Phase transitions in semide\ufb01nite relaxations. arXiv\n\npreprint arXiv:1511.08769, 2015.\n\nM. Journ\u00b4ee, F. Bach, P.-A. Absil, and R. Sepulchre. Low-rank optimization on the cone of positive semide\ufb01nite\n\nmatrices. SIAM Journal on Optimization, 20(5):2327\u20132351, 2010. doi:10.1137/080731359.\n\nM. Laurent and S. Poljak. On the facial structure of the set of correlation matrices. SIAM Journal on Matrix\n\nAnalysis and Applications, 17(3):530\u2013547, 1996. doi:10.1137/0617031.\n\nM. McCoy and J.A. Tropp. Two proposals for robust PCA using semide\ufb01nite programming. Electronic Journal\n\nof Statistics, 5:1123\u20131160, 2011. doi:10.1214/11-EJS636.\n\nY. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87 of Applied optimization.\n\nSpringer, 2004. ISBN 978-1-4020-7553-7.\n\nG. Pataki. On the rank of extreme matrices in semide\ufb01nite programs and the multiplicity of optimal eigenvalues.\n\nMathematics of operations research, 23(2):339\u2013358, 1998. doi:10.1287/moor.23.2.339.\n\nR.T. Rockafellar. Convex analysis. Princeton University Press, Princeton, NJ, 1970.\n\nA.P. Ruszczy\u00b4nski. Nonlinear optimization. Princeton University Press, Princeton, NJ, 2006.\n\nS. Shah, A. Kumar, D. Jacobs, C. Studer, and T. Goldstein. Biconvex relaxation for semide\ufb01nite programming\n\nin computer vision. arXiv preprint arXiv:1605.09527, 2016.\n\nA. Singer. Angular synchronization by eigenvectors and semide\ufb01nite programming. Applied and Computa-\n\ntional Harmonic Analysis, 30(1):20\u201336, 2011. doi:10.1016/j.acha.2010.02.001.\n\nK.C. Toh, M.J. Todd, and R.H. T\u00a8ut\u00a8unc\u00a8u. SDPT3\u2013a MATLAB software package for semide\ufb01nite programming.\n\nOptimization Methods and Software, 11(1\u20134):545\u2013581, 1999. doi:10.1080/10556789908805762.\n\nS.A. Vavasis. Nonlinear optimization: complexity issues. Oxford University Press, Inc., 1991.\n\nI. Waldspurger, A. d\u2019Aspremont, and S. Mallat. Phase recovery, MaxCut and complex semide\ufb01nite program-\n\nming. Mathematical Programming, 149(1\u20132):47\u201381, 2015. doi:10.1007/s10107-013-0738-9.\n\nL. Wang and A. Singer. Exact and stable recovery of rotations for robust synchronization. Information and\n\nInference, 2(2):145\u2013193, 2013. doi:10.1093/imaiai/iat005.\n\nZ. Wen and W. Yin. A feasible method for optimization with orthogonality constraints. Mathematical Pro-\n\ngramming, 142(1\u20132):397\u2013434, 2013. doi:10.1007/s10107-012-0584-1.\n\nW.H. Yang, L.-H. Zhang, and R. Song. Optimality conditions for the nonlinear programming problems on\n\nRiemannian manifolds. Paci\ufb01c Journal of Optimization, 10(2):415\u2013434, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1405, "authors": [{"given_name": "Nicolas", "family_name": "Boumal", "institution": "Princeton University"}, {"given_name": "Vlad", "family_name": "Voroninski", "institution": "MIT"}, {"given_name": "Afonso", "family_name": "Bandeira", "institution": null}]}