{"title": "An Inexact Augmented Lagrangian Framework for Nonconvex Optimization with Nonlinear Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 13965, "page_last": 13977, "abstract": "We propose a practical inexact augmented Lagrangian method (iALM) for nonconvex problems with nonlinear constraints. We characterize the total computational complexity of our method subject to a verifiable geometric condition, which is closely related to the Polyak-Lojasiewicz and Mangasarian-Fromowitz conditions. In particular, when a first-order solver is used for the inner iterates, we prove that iALM finds a first-order stationary point with $\\tilde{\\mathcal{O}}(1/\\epsilon^3)$ calls to the first-order oracle. {If, in addition, the problem is smooth and} a second-order solver is used for the inner iterates, iALM finds a second-order stationary point with $\\tilde{\\mathcal{O}}(1/\\epsilon^5)$ calls to the second-order oracle.\nThese complexity results match the known theoretical results in the literature. We also provide strong numerical evidence on large-scale machine learning problems, including the Burer-Monteiro factorization of semidefinite programs, and a novel nonconvex relaxation of the standard basis pursuit template. For these examples, we also show how to verify our geometric condition.", "full_text": "An Inexact Augmented Lagrangian Framework for\nNonconvex Optimization with Nonlinear Constraints\n\nMehmet Fatih Sahin\n\nmehmet.sahin@epfl.ch\n\nArmin Eftekhari\n\narmin.eftekhari@epfl.ch\n\nAhmet Alacaoglu\n\nahmet.alacaoglu@epfl.ch\n\nFabian Latorre\n\nfabian.latorre@epfl.ch\n\nVolkan Cevher\n\nvolkan.cevher@epfl.ch\n\nLIONS, Ecole Polytechnique F\u00e9d\u00e9rale de Lausanne, Switzerland\n\nAbstract\n\nWe propose a practical inexact augmented Lagrangian method (iALM) for noncon-\nvex problems with nonlinear constraints. We characterize the total computational\ncomplexity of our method subject to a veri\ufb01able geometric condition, which is\nclosely related to the Polyak-Lojasiewicz and Mangasarian-Fromowitz conditions.\nIn particular, when a \ufb01rst-order solver is used for the inner iterates, we prove that\niALM \ufb01nds a \ufb01rst-order stationary point with \u02dcO(1/\u00013) calls to the \ufb01rst-order oracle.\nIf, in addition, the problem is smooth and a second-order solver is used for the\ninner iterates, iALM \ufb01nds a second-order stationary point with \u02dcO(1/\u00015) calls to\nthe second-order oracle. These complexity results match the known theoretical\nresults in the literature.\nWe also provide strong numerical evidence on large-scale machine learning prob-\nlems, including the Burer-Monteiro factorization of semide\ufb01nite programs, and\na novel nonconvex relaxation of the standard basis pursuit template. For these\nexamples, we also show how to verify our geometric condition.\n\nIntroduction\n\n1\nWe study the nonconvex optimization problem\n\nmin\nx\u2208Rd\n\nf (x) + g(x)\n\ns.t. A(x) = 0,\n\n(1)\nwhere f : Rd \u2192 R is a continuously-differentiable nonconvex function and A : Rd \u2192 Rm is a\nnonlinear operator. We assume that g : Rd \u2192 R \u222a {\u221e} is a proximal-friendly convex function [47].\nA host of problems in computer science [33, 37, 70], machine learning [40, 59], and signal pro-\ncessing [57, 58] naturally fall under the template (1), including max-cut, clustering, generalized\neigenvalue decomposition, as well as the quadratic assignment problem (QAP) [70].\nTo solve (1), we propose an intuitive and easy-to-implement augmented Lagrangian algorithm, and\nprovide its total iteration complexity under an interpretable geometric condition. Before we elaborate\non the results, let us \ufb01rst motivate (1) with an application to semide\ufb01nite programming (SDP):\n\nVignette: Burer-Monteiro splitting. A powerful convex relaxation for max-cut, clustering, and\nmany others is provided by the trace-constrained SDP\n\n(cid:104)C, X(cid:105)\n\ns.t. B(X) = b, tr(X) \u2264 \u03b1, X (cid:23) 0,\n\n(2)\n\nmin\nX\u2208Sd\u00d7d\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhere C \u2208 Rd\u00d7d, X is a positive semide\ufb01nite d \u00d7 d matrix, and B : Sd\u00d7d \u2192 Rm is a linear operator.\nIf the unique-games conjecture is true, the SDP (2) obtains the best possible approximation for the\nunderlying discrete problem [53].\nSince d is often large, many \ufb01rst- and second-order methods for solving such SDP\u2019s are immedi-\nately ruled out, not only due to their high computational complexity, but also due to their storage\nrequirements, which are O(d2).\nA contemporary challenge in optimization is therefore to solve SDPs using little space and in a\nscalable fashion. The recent homotopy conditional gradient method, which is based on linear\nminimization oracles (LMOs), can solve (2) in a small space via sketching [69]. However, such\nLMO-based methods are extremely slow in obtaining accurate solutions.\nA different approach for solving (2), dating back to [14, 15], is the so-called Burer-Monteiro (BM)\nfactorization X = U U(cid:62), where U \u2208 Rd\u00d7r and r is selected according to the guidelines in [49, 1],\nwhich is tight [63]. The BM factorization leads to the following nonconvex problem in the template (1):\n\n(cid:104)C, U U(cid:62)(cid:105)\n\ns.t. B(U U(cid:62)) = b, (cid:107)U(cid:107)2\n\nF \u2264 \u03b1,\n\nmin\nU\u2208Rd\u00d7r\n\n(3)\n\nThe BM factorization does not introduce any extraneous local minima [15]. Moreover, [13] establishes\nthe connection between the local minimizers of the factorized problem (3) and the global minimizers\nfor (2). To solve (3), the inexact Augmented Lagrangian method (iALM) is widely used [14, 15, 35],\ndue to its cheap per iteration cost and its empirical success.\nEvery (outer) iteration of iALM calls a solver to solve an intermediate augmented Lagrangian\nsubproblem to near stationarity. The choices include \ufb01rst-order methods, such as the proximal\ngradient descent [47], or second-order methods, such as the trust region method and BFGS [44].1\nUnlike its convex counterpart [41, 36, 65], the convergence rate and the complexity of iALM for (3)\nare not well-understood, see Section 5 for a review of the related literature. Indeed, addressing this\nimportant theoretical gap is one of the contributions of our work. In addition,\n(cid:46) We derive the convergence rate of iALM to \ufb01rst-order optimality for solving (1) or second-order\noptimality for solving (1) with g = 0, and \ufb01nd the total iteration complexity of iALM using different\nsolvers for the augmented Lagrangian subproblems. Our complexity bounds match the best theoretical\nresults in optimization, see Section 5.\n(cid:46) Our iALM framework is future-proof in the sense that different subsolvers can be substituted.\n(cid:46) We propose a geometric condition that simpli\ufb01es the algorithmic analysis for iALM, and clarify its\nconnection to well-known Polyak-Lojasiewicz [32] and Mangasarian-Fromovitz [3] conditions. We\nalso verify this condition for key problems in Appendices D and E.\n2 Preliminaries\nNotation. We use the notation (cid:104)\u00b7,\u00b7(cid:105) and (cid:107)\u00b7(cid:107) for the standard inner product and the norm on Rd. For\nmatrices, (cid:107) \u00b7 (cid:107) and (cid:107) \u00b7 (cid:107)F denote the spectral and the Frobenius norms, respectively. For the convex\nfunction g : Rd \u2192 R, the subdifferential set at x \u2208 Rd is denoted by \u2202g(x) and we will occasionally\nuse the notation \u2202g(x)/\u03b2 = {z/\u03b2 : z \u2208 \u2202g(x)}. When presenting iteration complexity results, we\n\noften use (cid:101)O(\u00b7) which suppresses the logarithmic dependencies.\n\nWe denote \u03b4X : Rd \u2192 R as the indicator function of a set X \u2282 Rd. The distance function from\na point x to X is denoted by dist(x,X ) = minz\u2208X (cid:107)x \u2212 z(cid:107). For integers k0 \u2264 k1, we use the\nnotation [k0 : k1] = {k0, . . . , k1}. For an operator A : Rd \u2192 Rm with components {Ai}m\ni=1,\nDA(x) \u2208 Rm\u00d7d denotes the Jacobian of A, where the ith row of DA(x) is the vector \u2207Ai(x) \u2208 Rd.\nSmoothness. We assume smooth f : Rd \u2192 R and A : Rd \u2192 Rm; i.e., there exist \u03bbf , \u03bbA \u2265 0 s.t.\n(cid:107)\u2207f (x) \u2212 \u2207f (x(cid:48))(cid:107) \u2264 \u03bbf(cid:107)x \u2212 x(cid:48)(cid:107),\n(4)\n\n(cid:107)DA(x) \u2212 DA(x(cid:48))(cid:107) \u2264 \u03bbA(cid:107)x \u2212 x(cid:48)(cid:107),\n\n\u2200x, x(cid:48) \u2208 Rd.\n\nAugmented Lagrangian method (ALM). ALM is a classical algorithm, which \ufb01rst appeared\nin [29, 51] and extensively studied afterwards in [3, 8]. For solving (1), ALM suggests solving the\n\n1BFGS is in fact a quasi-Newton method that emulates second-order information.\n\n2\n\n\fproblem\n\nmin\n\nmax\n\nx\n\ny\n\nL\u03b2(x, y) + g(x),\n\nwhere, for penalty weight \u03b2 > 0, L\u03b2 is the corresponding augmented Lagrangian, de\ufb01ned as\n\nL\u03b2(x, y) := f (x) + (cid:104)A(x), y(cid:105) +\n\n(cid:107)A(x)(cid:107)2.\n\n\u03b2\n2\n\nThe minimax formulation in (5) naturally suggests the following algorithm for solving (1):\n\nxk+1 \u2208 argmin\n\nL\u03b2(x, yk) + g(x),\n\nx\n\n(5)\n\n(6)\n\n(7)\n\nyk+1 = yk + \u03c3kA(xk+1),\n\nwhere the dual step sizes are denoted as {\u03c3k}k. However, computing xk+1 above requires solving\nthe nonconvex problem (7) to optimality, which is typically intractable. Instead, it is often easier to\n\ufb01nd an approximate \ufb01rst- or second-order stationary point of (7).\nHence, we argue that by gradually improving the stationarity precision and increasing the penalty\nweight \u03b2 above, we can reach a stationary point of the main problem in (5), as detailed in Section 3.\n\nOptimality conditions. First-order necessary optimality conditions for (1) are well-studied. Indeed,\nx \u2208 Rd is a \ufb01rst-order stationary point of (1) if there exists y \u2208 Rm such that\n\n(8)\nwhich is in turn the necessary optimality condition for (5). Inspired by this, we say that x is an (\u0001f , \u03b2)\n\ufb01rst-order stationary point of (5) if there exists a y \u2208 Rm such that\n\nA(x) = 0,\n\n\u2212\u2207xL\u03b2(x, y) \u2208 \u2202g(x),\n\ndist(\u2212\u2207xL\u03b2(x, y), \u2202g(x)) \u2264 \u0001f ,\n\n(cid:107)A(x)(cid:107) \u2264 \u0001f ,\n\n(9)\n\nfor \u0001f \u2265 0. In light of (9), a metric for evaluating the stationarity of a pair (x, y) \u2208 Rd \u00d7 Rm is\n\ndist (\u2212\u2207xL\u03b2(x, y), \u2202g(x)) + (cid:107)A(x)(cid:107),\n\n(10)\nwhich we use as the \ufb01rst-order stopping criterion. As an example, for a convex set X \u2282 Rd, suppose\nthat g = \u03b4X is the indicator function on X . Let also TX (x) \u2286 Rd denote the tangent cone to X at x,\nand with PTX (x) : Rd \u2192 Rd we denote the orthogonal projection onto this tangent cone. Then, for\nu \u2208 Rd, it is not dif\ufb01cult to verify that\n\ndist (u, \u2202g(x)) = (cid:107)PTX (x)(u)(cid:107).\n\n(11)\n\nWhen g = 0, a \ufb01rst-order stationary point x \u2208 Rd of (1) is also second-order stationary if\n\n\u03bbmin(\u2207xxL\u03b2(x, y)) \u2265 0,\n\n(12)\nwhere \u2207xxL\u03b2 is the Hessian of L\u03b2 with respect to x, and \u03bbmin(\u00b7) returns the smallest eigenvalue of\nits argument. Analogously, x is an (\u0001f , \u0001s, \u03b2) second-order stationary point if, in addition to (9), it\nholds that\n(13)\nfor \u0001s \u2265 0. Naturally, for second-order stationarity, we use \u03bbmin(\u2207xxL\u03b2(x, y)) as the stopping\ncriterion.\nSmoothness lemma. This next result controls the smoothness of L\u03b2(\u00b7, y) for a \ufb01xed y. The proof\nis standard but nevertheless is included in Appendix C for completeness.\nLemma 2.1 (smoothness). For \ufb01xed y \u2208 Rm and \u03c1, \u03c1(cid:48) \u2265 0, it holds that\n(cid:107)\u2207xL\u03b2(x, y) \u2212 \u2207xL\u03b2(x(cid:48), y)(cid:107) \u2264 \u03bb\u03b2(cid:107)x \u2212 x(cid:48)(cid:107),\n\n\u03bbmin(\u2207xxL\u03b2(x, y)) \u2265 \u2212\u0001s,\n\n(14)\n\nfor every x, x(cid:48) \u2208 {x(cid:48)(cid:48) : (cid:107)x(cid:48)(cid:48)(cid:107) \u2264 \u03c1,(cid:107)A(x(cid:48)(cid:48))(cid:107) \u2264 \u03c1(cid:48)}, where\n\n\u03bb\u03b2 \u2264 \u03bbf +\n\nm\u03bbA(cid:107)y(cid:107) + (\n\nm\u03bbA\u03c1(cid:48) + d\u03bb(cid:48)2\n\nA)\u03b2 =: \u03bbf +\n\n\u221a\n\n\u221a\n\nAbove, \u03bbf , \u03bbA were de\ufb01ned in (4) and\n\u03bb(cid:48)\nA := max\n(cid:107)x(cid:107)\u2264\u03c1\n\n(cid:107)DA(x)(cid:107).\n\n3\n\n\u221a\n\nm\u03bbA(cid:107)y(cid:107) + \u03bb(cid:48)(cid:48)(A, \u03c1, \u03c1(cid:48))\u03b2.\n\n(15)\n\n(16)\n\n\f3 Algorithm\n\nTo solve the equivalent formulation of (1) presented in (5), we propose the inexact ALM (iALM),\ndetailed in Algorithm 1. At the kth iteration, Step 2 of Algorithm 1 calls a solver that \ufb01nds an\napproximate stationary point of the augmented Lagrangian L\u03b2k (\u00b7, yk) with the accuracy of \u0001k+1, and\nthis accuracy gradually increases in a controlled fashion. The increasing sequence of penalty weights\n{\u03b2k}k and the dual update (Steps 4 and 5) are responsible for continuously enforcing the constraints\nin (1). The appropriate choice for {\u03b2k}k will be speci\ufb01ed in Corrollary Sections A.1 and A.2.\nThe particular choice of the dual step sizes {\u03c3k}k in Algorithm 1 ensures that the dual variable yk\nremains bounded.\n\nAlgorithm 1 Inexact ALM\n\nInput: Non-decreasing, positive, unbounded sequence {\u03b2k}k\u22651, stopping thresholds \u03c4f , \u03c4s > 0.\nInitialization: Primal variable x1 \u2208 Rd, dual variable y0 \u2208 Rm, dual step size \u03c31 > 0.\nfor k = 1, 2, . . . do\n\n1.\n2.\n\n(Update tolerance) \u0001k+1 = 1/\u03b2k.\n(Inexact primal solution) Obtain xk+1 \u2208 Rd such that\n\ndist(\u2212\u2207xL\u03b2k (xk+1, yk), \u2202g(xk+1)) \u2264 \u0001k+1\n\nfor \ufb01rst-order stationarity\n\n\u03bbmin(\u2207xxL\u03b2k (xk+1, yk)) \u2265 \u2212\u0001k+1\n\nfor second-order-stationarity, if g = 0 in (1).\n(Update dual step size)\n\n\u03c3k+1 = \u03c31 min\n\n(cid:107)A(x1)(cid:107) log2 2\n\n(cid:107)A(xk+1)(cid:107)(k + 1) log2(k + 2)\n\n(cid:16)\n\n(cid:17)\n\n, 1\n\n.\n\n(Dual ascent) yk+1 = yk + \u03c3k+1A(xk+1).\n(Stopping criterion) If\n\ndist(\u2212\u2207xL\u03b2k (xk+1), \u2202g(xk+1)) + (cid:107)A(xk+1)(cid:107) \u2264 \u03c4f ,\n\nfor \ufb01rst-order stationarity and if also \u03bbmin(\u2207xxL\u03b2k (xk+1, yk)) \u2265 \u2212\u03c4s for second-order\nstationarity, then quit and return xk+1 as an (approximate) stationary point of (5).\n\n3.\n\n4.\n5.\n\nend for\n\n4 Convergence Rate\n\nThis section presents the total iteration complexity of Algorithm 1 for \ufb01nding \ufb01rst and second-order\nstationary points of problem (5). All the proofs are deferred to Appendix B. Theorem 4.1 characterizes\nthe convergence rate of Algorithm 1 for \ufb01nding stationary points in the number of outer iterations.\nTheorem 4.1. (convergence rate) For integers 2 \u2264 k0 \u2264 k1, consider the interval K = [k0 :\nk1], and let {xk}k\u2208K be the output sequence of Algorithm 1 on the interval K.2 Let also \u03c1 :=\nsupk\u2208[K] (cid:107)xk(cid:107).3 Suppose that f and A satisfy (4) and let\n\n\u03bb(cid:48)\nf = max\n(cid:107)x(cid:107)\u2264\u03c1\n\n(cid:107)\u2207f (x)(cid:107),\n\n\u03bb(cid:48)\nA = max\n(cid:107)x(cid:107)\u2264\u03c1\n\n(cid:107)DA(x)(cid:107),\n\n(17)\n\nbe the (restricted) Lipschitz constants of f and A, respectively. With \u03bd > 0, assume that\n\n\u2212DA(xk)(cid:62)A(xk),\n\n\u03bd(cid:107)A(xk)(cid:107) \u2264 dist\nfor every k \u2208 K. We consider two cases:\n2The choice of k1 = \u221e is valid here too.\n3If necessary, to ensure that \u03c1 < \u221e, one can add a small factor of (cid:107)x(cid:107)2 to L\u03b2 in (6). Then it is easy to verify\nthat the iterates of Algorithm 1 remain bounded, provided that the initial penalty weight \u03b20 is large enough,\nsupx (cid:107)\u2207f (x)(cid:107)/(cid:107)x(cid:107) < \u221e, supx (cid:107)A(x)(cid:107) < \u221e, and supx (cid:107)DA(x)(cid:107) < \u221e.\n\n\u2202g(xk)\n\u03b2k\u22121\n\n(18)\n\n(cid:19)\n\n,\n\n(cid:18)\n\n4\n\n\f\u2022 If a \ufb01rst-order solver is used in Step 2, then xk is an (\u0001k,f , \u03b2k) \ufb01rst-order stationary point of (5)\n\n(cid:18) 2(\u03bb(cid:48)\n\n1\n\nwith\n\nf + \u03bb(cid:48)\n\nAymax)(1 + \u03bb(cid:48)\n\nA\u03c3k)\n\n\u0001k,f =\n\n+ 1\nfor every k \u2208 K, where ymax(x1, y0, \u03c31) := (cid:107)y0(cid:107) + c(cid:107)A(x1)(cid:107).\n\n\u03b2k\u22121\n\n\u03bd\n\n(cid:19)\n\n=:\n\nQ(f, g, A, \u03c31)\n\n\u03b2k\u22121\n\n,\n\n(19)\n\n\u2022 If a second-order solver is used in Step 2, then xk is an (\u0001k,f , \u0001k,s, \u03b2k) second-order stationary\n\npoint of (5) with \u0001k,s speci\ufb01ed above and with\n\n\u0001k,s = \u0001k\u22121 + \u03c3k\n\n\u221a\n\nm\u03bbA\n\n2\u03bb(cid:48)\n\nf + 2\u03bb(cid:48)\n\u03bd\u03b2k\u22121\n\nAymax\n\n\u03bd + \u03c3k\n\n=\n\n\u221a\n\nf + 2\u03bb(cid:48)\n\nAymax\n\nm\u03bbA2\u03bb(cid:48)\n\u03bd\u03b2k\u22121\n\n=:\n\nQ(cid:48)(f, g, A, \u03c31)\n\n\u03b2k\u22121\n\n.\n\n(20)\n\nTheorem 4.1 states that Algorithm 1 converges to a (\ufb01rst- or second-) order stationary point of (5)\nat the rate of 1/\u03b2k, further speci\ufb01ed in Corollary 4.2 and Corollary 4.3. A few remarks are in order\nabout Theorem 4.1.\n\nRegularity. The key geometric condition in Theorem 4.1 is (18) which, broadly speaking, ensures\nthat the primal updates of Algorithm 1 reduce the feasibility gap as the penalty weight \u03b2k grows. We\nwill verify this condition for several examples in Appendices D and E.\nThis condition in (18) is closely related to those in the existing literature. In the special case where\ng = 0 in (1), (18) reduces to;\n\n(cid:107)DA(x)(cid:62)A(x)(cid:107) \u2265 \u03bd(cid:107)A(x)(cid:107).\n\n(21)\n\nPolyak-Lojasiewicz (PL) condition [32]. Consider the problem with \u03bb \u02dcf -smooth objective,\n\n\u02dcf (x) satis\ufb01es the PL inequality if the following holds for some \u00b5 > 0,\n\n\u02dcf (x).\n\nmin\nx\u2208Rd\n\n(cid:107)\u2207 \u02dcf (x)(cid:107)2 \u2265 \u00b5( \u02dcf (x) \u2212 \u02dcf\u2217),\n\n\u2200x\n\n1\n2\n\n(PL inequality)\n\n\u221a\n\n2(cid:107)A(x)(cid:107)2 with \u03bd =\n\nThis inequality implies that gradient is growing faster than a quadratic as we move away from the\noptimal. Assuming that the feasible set {x : A(x) = 0} is non-empty, it is easy to verify that 21 is\nequivalent to the PL condition for minimizing \u02dcf (x) = 1\nPL condition itself is a special case of Kurdyka-Lojasiewicz with \u03b8 = 1/2, see [66, De\ufb01nition 1.1].\nWhen g = 0, it is also easy to see that (18) is weaker than the Mangasarian-Fromovitz (MF) condition\nin nonlinear optimization [10, Assumption 1]. Moreover, when g is the indicator on a convex set,\n(18) is a consequence of the basic constraint quali\ufb01cation in [55], which itself generalizes the MF\ncondition to the case when g is an indicator function of a convex set.\nWe may think of (18) as a local condition, which should hold within a neighborhood of the constraint\nset {x : A(x) = 0} rather than everywhere in Rd. Indeed, the iteration count k appears in (18) to\nre\ufb02ect this local nature of the condition. Similar kind of arguments on the regularity condition also\nappear in [10]. There is also a constant complexity algorithm in [10] to reach so-called \u201cinformation\nzone\u201d, which supplements Theorem 4.1.\n\n2\u00b5 [32].\n\nPenalty method. A classical algorithm to solve (1) is the penalty method, which is characterized by\nthe absence of the dual variable (y = 0) in (6). Indeed, ALM can be interpreted as an adaptive penalty\nor smoothing method with a variable center determined by the dual variable. It is worth noting that,\nwith the same proof technique, one can establish the same convergence rate of Theorem 4.1 for the\npenalty method. However, while both methods have the same convergence rate in theory, we ignore\nthe uncompetitive penalty method since it is signi\ufb01cantly outperformed by iALM in practice.\n\nComputational complexity. Theorem 4.1 speci\ufb01es the number of (outer) iterations that Algo-\nrithm 1 requires to reach a near-stationary point of problem (6) with a prescribed precision and, in\nparticular, speci\ufb01es the number of calls made to the solver in Step 2. In this sense, Theorem 4.1 does\n\n5\n\n\fnot fully capture the computational complexity of Algorithm 1, as it does not take into account the\ncomputational cost of the solver in Step 2.\nTo better understand the total iteration complexity of Algorithm 1, we consider two scenarios in the\nfollowing. In the \ufb01rst scenario, we take the solver in Step 2 to be the Accelerated Proximal Gradient\nMethod (APGM), a well-known \ufb01rst-order algorithm [27]. In the second scenario, we will use the\nsecond-order trust region method developed in [17]. We have the following two corollaries showing\nthe total complexity of our algorithm to reach \ufb01rst and second-order stationary points. Appendix A\ncontains the proofs and more detailed discussion for the complexity results.\nCorollary 4.2 (First-order optimality). For b > 1, let \u03b2k = bk for every k. If we use APGM from [27]\nfor Step 2 of Algorithm 1, the algorithm \ufb01nds an (\u0001f , \u03b2k) \ufb01rst-order stationary point of (5), after T\ncalls to the \ufb01rst-order oracle, where\nT = O\n\n(cid:18) Q3\u03c12\n\n(cid:18) Q3\u03c12\n\n= \u02dcO\n\n(cid:18) Q\n\n(cid:19)(cid:19)\n\n(cid:19)\n\n(22)\n\n.\n\nlogb\n\n\u00013\n\n\u0001\n\n\u00013\n\nFor Algorithm 1 to reach a near-stationary point with an accuracy of \u0001f in the sense of (9) and with\nthe lowest computational cost, we therefore need to perform only one iteration of Algorithm 1, with\n\u03b21 speci\ufb01ed as a function of \u0001f by (19) in Theorem 4.1. In general, however, the constants in (19) are\nunknown and this approach is thus not feasible. Instead, the homotopy approach taken by Algorithm 1\nensures achieving the desired accuracy by gradually increasing the penalty weight. This homotopy\napproach increases the computational cost of Algorithm 1 only by a factor logarithmic in the \u0001f , as\ndetailed in the proof of Corollary 4.2.\nCorollary 4.3 (Second-order optimality). For b > 1, let \u03b2k = bk for every k. We assume that\n\nL\u03b2(x1, y) \u2212 min\n\nL\u03b2(x, y) \u2264 Lu,\n\n\u2200\u03b2.\n\nx\n\n(23)\n\nIf we use the trust region method from [17] for Step 2 of Algorithm 1, the algorithm \ufb01nds an\n\u0001-second-order stationary point of (5) in T calls to the second-order oracle where\n\n(cid:18) LuQ(cid:48)5\n\n(cid:18) Q(cid:48)\n\n(cid:19)(cid:19)\n\n= (cid:101)O\n\n(cid:18) LuQ(cid:48)5\n\n(cid:19)\n\n\u00015\n\nT = O\n\nlogb\n\n\u00015\n\n\u0001\n\n.\n\n(24)\n\nRemark. These complexity results for \ufb01rst and second-order are stationarity with respect to (6). We\nnote that these complexities match [18] and [7]. However, the stationarity criteria and the de\ufb01nition\nof dual variable in these papers differ from ours. We include more discussion on this in the Appendix.\n\nEffect of \u03b2k in 18. We consider two cases, when g is the indicator of a convex set (or 0), the\nsubdifferential set will be a cone (or 0), thus \u03b2k will not have an effect. On the other hand, when g is\na convex and Lipschitz contiunous function de\ufb01ned on the whole space, subdifferential set will be\nbounded [54, Theorem 23.4]. This will introduce an error term in 18 that is of the order (1/\u03b2k). One\ncan see that bk choice for \u03b2k causes a linear decrease in this error term. In fact, all the examples in\nthis paper fall into the \ufb01rst case.\n\n5 Related Work\n\nALM has a long history in the optimization literature, dating back to [29, 51]. In the special case\nof (1) with a convex function f and a linear operator A, standard, inexact, and linearized versions of\nALM have been extensively studied [36, 41, 61, 65].\nClassical works on ALM focused on the general template of (1) with nonconvex f and nonlinear A,\nwith arguably stronger assumptions and required exact solutions to the subproblems of the form (7),\nwhich appear in Step 2 of Algorithm 1, see for instance [4].\nA similar analysis was conducted in [22] for the general template of (1). The authors considered\ninexact ALM and proved convergence rates for the outer iterates, under speci\ufb01c assumptions on the\ninitialization of the dual variable. However, in contrast, the authors did not analyze how to solve the\nsubproblems inexactly and did not provide total complexity results with veri\ufb01able conditions.\nProblem (1) with similar assumptions to us is also studied in [7] and [18] for \ufb01rst-order and second-\norder stationarity, respectively, with explicit iteration complexity analysis. As we have mentioned\n\n6\n\n\fin Section 4, our iteration complexity results matches these theoretical algorithms with a simpler\nalgorithm and a simpler analysis. In addition, these algorithms require setting \ufb01nal accuracies since\nthey utilize this information in the algorithm while our Algorithm 1 does not set accuracies a priori.\n[16] also considers the same template (1) for \ufb01rst-order stationarity with a penalty-type method\ninstead of ALM. Even though the authors show O(1/\u00012) complexity, this result is obtained by\nassuming that the penalty parameter remains bounded. We note that such an assumption can also be\nused to improve our complexity results to match theirs.\n[10] studies the general template (1) with speci\ufb01c assumptions involving local error bound conditions\nfor the (1). These conditions are studied in detail in [9], but their validity for general SDPs (2) has\nnever been established. This work also lacks the total iteration complexity analysis presented here.\nAnother work [20] focused on solving (1) by adapting the primal-dual method of Chambolle and\nPock [19]. The authors proved the convergence of the method and provided convergence rate by\nimposing error bound conditions on the objective function that do not hold for standard SDPs.\n[14, 15] is the \ufb01rst work that proposes the splitting X = U U(cid:62) for solving SDPs of the form (2).\nFollowing these works, the literature on Burer-Monteiro (BM) splitting for the large part focused on\nusing ALM for solving the reformulated problem (3).\nHowever, this proposal has a few drawbacks: First, it requires exact solutions in Step 2 of Algorithm 1\nin theory, which in practice is replaced with inexact solutions. Second, their results only establish con-\nvergence without providing the rates. In this sense, our work provides a theoretical understanding of\nthe BM splitting with inexact solutions to Step 2 of Algorithm 1 and complete iteration complexities.\n[6, 48] are among the earliest efforts to show convergence rates for BM splitting, focusing on\nthe special case of SDPs without any linear constraints. For these speci\ufb01c problems, they prove\nthe convergence of gradient descent to global optima with convergence rates, assuming favorable\ninitialization. These results, however, do not apply to general SDPs of the form (2) where the dif\ufb01culty\narises due to the linear constraints.\nAnother popular method for solving SDPs are due to [12, 11, 13], focusing on the case where the\nconstraints in (1) can be written as a Riemannian manifold after BM splitting. In this case, the authors\napply the Riemannian gradient descent and Riemannian trust region methods for obtaining \ufb01rst- and\nsecond-order stationary points, respectively. They obtain O(1/\u00012) complexity for \ufb01nding \ufb01rst-order\nstationary points and O(1/\u00013) complexity for \ufb01nding second-order stationary points.\nWhile these complexities appear better than ours, the smooth manifold requirement in these works\nis indeed restrictive. In particular, this requirement holds for max-cut and generalized eigenvalue\nproblems, but it is not satis\ufb01ed for other important SDPs such as quadratic programming (QAP),\noptimal power \ufb02ow and clustering with general af\ufb01ne constraints. In addition, as noted in [11], per\niteration cost of their method for max-cut problem is an astronomical O(d6).\nLastly, there also exists a line of work for solving SDPs in their original convex formulation, in a\nstorage ef\ufb01cient way [42, 68, 69]. These works have global optimality guarantees by their virtue of\ndirectly solving the convex formulation. On the downside, these works require the use of eigenvalue\nroutines and exhibit signi\ufb01cantly slower convergence as compared to nonconvex approaches [31].\n\n6 Numerical Evidence\n\nWe \ufb01rst begin with a caveat: It is known that quasi-Newton methods, such as BFGS and lBFGS,\nmight not converge for nonconvex problems [21, 38]. For this reason, we have used the trust region\nmethod as the second-order solver in our analysis in Section 4, which is well-studied for nonconvex\nproblems [17]. Empirically, however, BFGS and lBGFS are extremely successful and we have\ntherefore opted for those solvers in this section since the subroutine does not affect Theorem 4.1 as\nlong as the subsolver performs well in practice.\n\n6.1 Clustering\nGiven data points {zi}n\ni=1, the entries of the corresponding Euclidean distance matrix D \u2208 Rn\u00d7n\nare Di,j = (cid:107)zi \u2212 zj(cid:107)2. Clustering is then the problem of \ufb01nding a co-association matrix Y \u2208 Rn\u00d7n\nsuch that Yij = 1 if points zi and zj are within the same cluster and Yij = 0 otherwise. In [50], the\n\n7\n\n\fn(cid:88)\n\nA(x) = [x(cid:62)\n\n1\n\nxj \u2212 1,\u00b7\u00b7\u00b7 , x(cid:62)\n\nn\n\nn(cid:88)\n\nxj \u2212 1](cid:62),\n\u221a\n\nFigure 1: Clustering running time comparison.\n\nauthors provide a SDP relaxation of the clustering problem, speci\ufb01ed as\n\nmin\n\nY \u2208Rnxn\n\ntr(DY )\n\ns.t. Y 1 = 1, tr(Y ) = s, Y (cid:23) 0, Y \u2265 0,\n\n(25)\n\nwhere s is the number of clusters and Y is both positive semide\ufb01nite and has nonnegative entries.\nStandard SDP solvers do not scale well with the number of data points n, since they often require\nprojection onto the semide\ufb01nite cone with the complexity of O(n3). We instead use the BM\nfactorization to solve (25), sacri\ufb01cing convexity to reduce the computational complexity. More\nspeci\ufb01cally, we solve the program\n\ntr(DV V (cid:62))\n\ns.t. V V (cid:62)1 = 1, (cid:107)V (cid:107)2\n\nF \u2264 s, V \u2265 0,\n\nmin\n\nV \u2208Rn\u00d7r\n\n(26)\nwhere 1 \u2208 Rn is the vector of all ones. Note that Y \u2265 0 in (25) is replaced above by the much\nstronger but easier-to-enforce constraint V \u2265 0 in (26), see [35] for the reasoning behind this\nrelaxation. Now, we can cast (26) as an instance of (1). Indeed, for every i \u2264 n, let xi \u2208 Rr denote\nthe ith row of V . We next form x \u2208 Rd with d = nr by expanding the factorized variable V , namely,\nx := [x(cid:62)\n\n1 ,\u00b7\u00b7\u00b7 , x(cid:62)\nn(cid:88)\n\nn ](cid:62) \u2208 Rd, and then set\nDi,j (cid:104)xi, xj(cid:105) ,\n\ng = \u03b4C,\n\nf (x) =\n\ni,j=1\n\nj=1\n\nj=1\n\nwhere C is the intersection of the positive orthant in Rd with the Euclidean ball of radius\nAppendix D, we verify that Theorem 4.1 applies to (1) with f, g, A speci\ufb01ed above.\nIn our simulations, we use two different solvers for Step 2 of Algorithm 1, namely, APGM and\nlBFGS. APGM is a solver for nonconvex problems of the form (7) with convergence guarantees\nto \ufb01rst-order stationarity, as discussed in Section 4. lBFGS is a limited-memory version of BFGS\nalgorithm in [24] that approximately leverages the second-order information of the problem. We\ncompare our approach against the following convex methods:\n\ns. In\n\n\u2022 HCGM: Homotopy-based Conditional Gradient Method in [69] which directly solves (25).\n\u2022 SDPNAL+: A second-order augmented Lagrangian method for solving SDP\u2019s with nonneg-\n\nativity constraints [67].\n\nAs for the dataset, our experimental setup is similar to that described by [39]. We use the publicly-\navailable fashion-MNIST data in [64], which is released as a possible replacement for the MNIST\nhandwritten digits. Each data point is a 28 \u00d7 28 gray-scale image, associated with a label from ten\nclasses, labeled from 0 to 9. First, we extract the meaningful features from this dataset using a simple\ntwo-layer neural network with a sigmoid activation function. Then, we apply this neural network to\n1000 test samples from the same dataset, which gives us a vector of length 10 for each data point,\nwhere each entry represents the posterior probability for each class. Then, we form the (cid:96)2 distance\nmatrix D from these probability vectors. The solution rank for the template (25) is known and it is\nequal to number of clusters k [35, Theorem 1]. As discussed in [60], setting rank r > k leads more\naccurate reconstruction in expense of speed. Therefore, we set the rank to 20. For iAL lBFGS, we\nused \u03b21 = 1 and \u03c31 = 10 as the initial penalty weight and dual step size, respectively. For HCGM,\n\n8\n\n10010110210-1010-510010110210310-1010-810-610-410-2100\fwe used \u03b20 = 1 as the initial smoothness parameter. We have run SDPNAL+ solver with 10\u221212\ntolerance. The results are depicted in Figure 1. We implemented 3 algorithms on MATLAB and used\nthe software package for SDPNAL+ which contains mex \ufb01les. It is predictable that the performance\nof our nonconvex approach would even improve by using mex \ufb01les.\n\n6.2 Additional demonstrations\n\nWe provide several additional experiments in Appendix E. Section E.1 discusses a novel nonconvex\nrelaxation of the standard basis pursuit template which performs comparable to the state of the art\nconvex solvers. In Section E.2, we provide fast numerical solutions to the generalized eigenvalue\nproblem. In Section E.3, we give a contemporary application example that our template applies,\nnamely, denoising with generative adversarial networks. Finally, we provide improved bounds for\nsparse quadratic assignment problem instances in Section E.4.\n\n7 Conclusions\n\nIn this work, we have proposed and analyzed an inexact augmented Lagrangian method for solving\nnonconvex optimization problems with nonlinear constraints. We prove convergence to the \ufb01rst\nand second order stationary points of the augmented Lagrangian function, with explicit complexity\nestimates. Even though the relation of stationary points and global optima is not well-understood in\nthe literature, we \ufb01nd out that the algorithm has fast convergence behavior to either global minima or\nlocal minima in a wide variety of numerical experiments.\n\nAcknowledgements\nThe authors would like to thank Nicolas Boumal and Nadav Hallak for the helpful suggestions.\n\nThis project has received funding from the European Research Council (ERC) under the\nEuropean Union\u2019s Horizon 2020 research and innovation programme (grant agreement n\u25e6 725594 -\ntime-data) and was supported by the Swiss National Science Foundation (SNSF) under grant number\n200021_178865/1. This project was also sponsored by the Department of the Navy, Of\ufb01ce of Naval\nResearch (ONR) under a grant number N62909-17-1-2111 and was supported by Hasler Foundation\nProgram: Cyber Human Systems (project number 16066). This research was supported by the PhD\nfellowship program of the Swiss Data Science Center (SDSC) under grant lD number P18-07.\n\nReferences\n[1] A. I. Barvinok. Problems of distance geometry and convex properties of quadratic maps.\n\nDiscrete & Computational Geometry, 13(2):189\u2013202, 1995.\n\n[2] D. P. Bertsekas. On penalty and multiplier methods for constrained minimization. SIAM Journal\n\non Control and Optimization, 14(2):216\u2013235, 1976.\n\n[3] D. P. Bertsekas. Constrained optimization and lagrange multiplier methods. Computer Science\n\nand Applied Mathematics, Boston: Academic Press, 1982, 1982.\n\n[4] D. P. Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press,\n\n2014.\n\n[5] S. Bhojanapalli, N. Boumal, P. Jain, and P. Netrapalli. Smoothed analysis for low-rank solutions\n\nto semide\ufb01nite programs in quadratic penalty form. arXiv preprint arXiv:1803.00186, 2018.\n\n[6] S. Bhojanapalli, A. Kyrillidis, and S. Sanghavi. Dropping convexity for faster semi-de\ufb01nite\n\noptimization. In Conference on Learning Theory, pages 530\u2013582, 2016.\n\n[7] E. G. Birgin, J. Gardenghi, J. M. Martinez, S. Santos, and P. L. Toint. Evaluation complexity\nfor nonlinear constrained optimization using unscaled kkt conditions and high-order models.\nSIAM Journal on Optimization, 26(2):951\u2013967, 2016.\n\n[8] E. G. Birgin and J. M. Mart_nez. Practical augmented Lagrangian methods for constrained\n\noptimization, volume 10. SIAM, 2014.\n\n9\n\n\f[9] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter. From error bounds to the complexity of\n\ufb01rst-order descent methods for convex functions. Mathematical Programming, 165(2):471\u2013507,\n2017.\n\n[10] J. Bolte, S. Sabach, and M. Teboulle. Nonconvex lagrangian-based optimization: monitoring\n\nschemes and global convergence. Mathematics of Operations Research, 2018.\n\n[11] N. Boumal, P.-A. Absil, and C. Cartis. Global rates of convergence for nonconvex optimization\n\non manifolds. arXiv preprint arXiv:1605.08101, 2016.\n\n[12] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a matlab toolbox for optimization\n\non manifolds. The Journal of Machine Learning Research, 15(1):1455\u20131459, 2014.\n\n[13] N. Boumal, V. Voroninski, and A. Bandeira. The non-convex burer-monteiro approach works on\nsmooth semide\ufb01nite programs. In Advances in Neural Information Processing Systems, pages\n2757\u20132765, 2016.\n\n[14] S. Burer and R. D. Monteiro. A nonlinear programming algorithm for solving semide\ufb01nite\n\nprograms via low-rank factorization. Mathematical Programming, 95(2):329\u2013357, 2003.\n\n[15] S. Burer and R. D. Monteiro. Local minima and convergence in low-rank semide\ufb01nite program-\n\nming. Mathematical Programming, 103(3):427\u2013444, 2005.\n\n[16] C. Cartis, N. I. Gould, and P. L. Toint. On the evaluation complexity of composite func-\ntion minimization with applications to nonconvex nonlinear programming. SIAM Journal on\nOptimization, 21(4):1721\u20131739, 2011.\n\n[17] C. Cartis, N. I. Gould, and P. L. Toint. Complexity bounds for second-order optimality in\n\nunconstrained optimization. Journal of Complexity, 28(1):93\u2013108, 2012.\n\n[18] C. Cartis, N. I. Gould, and P. L. Toint. Optimality of orders one to three and beyond: characteri-\nzation and evaluation complexity in constrained nonconvex optimization. Journal of Complexity,\n2018.\n\n[19] A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with\n\napplications to imaging. Journal of mathematical imaging and vision, 40(1):120\u2013145, 2011.\n\n[20] C. Clason, S. Mazurenko, and T. Valkonen. Acceleration and global convergence of a \ufb01rst-order\n\nprimal\u2013dual method for nonconvex problems. arXiv preprint arXiv:1802.03347, 2018.\n\n[21] Y.-H. Dai. Convergence properties of the bfgs algoritm. SIAM Journal on Optimization,\n\n13(3):693\u2013701, 2002.\n\n[22] D. Fernandez and M. V. Solodov. Local convergence of exact and inexact augmented lagrangian\nmethods under the second-order suf\ufb01cient optimality condition. SIAM Journal on Optimization,\n22(2):384\u2013407, 2012.\n\n[23] J. F. B. Ferreira, Y. Khoo, and A. Singer. Semide\ufb01nite programming approach for the quadratic\nassignment problem with a sparse graph. Computational Optimization and Applications,\n69(3):677\u2013712, 2018.\n\n[24] R. Fletcher. Practical methods of optimization. John Wiley & Sons, 2013.\n\n[25] F. Flores-Baz\u00e1n, F. Flores-Baz\u00e1n, and C. Vera. A complete characterization of strong duality in\nnonconvex optimization with a single constraint. Journal of Global Optimization, 53(2):185\u2013\n201, 2012.\n\n[26] R. Ge, C. Jin, P. Netrapalli, A. Sidford, et al. Ef\ufb01cient algorithms for large-scale generalized\neigenvector computation and canonical correlation analysis. In International Conference on\nMachine Learning, pages 2741\u20132750, 2016.\n\n[27] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic\n\nprogramming. Mathematical Programming, 156(1-2):59\u201399, 2016.\n\n10\n\n\f[28] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,\n\nand Y. Bengio. Generative Adversarial Networks. ArXiv e-prints, June 2014.\n\n[29] M. R. Hestenes. Multiplier and gradient methods. Journal of optimization theory and applica-\n\ntions, 4(5):303\u2013320, 1969.\n\n[30] A. Ilyas, A. Jalal, E. Asteri, C. Daskalakis, and A. G. Dimakis. The Robust Manifold Defense:\nAdversarial Training using Generative Models. arXiv e-prints, page arXiv:1712.09196, Dec.\n2017.\n\n[31] M. Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML (1),\n\npages 427\u2013435, 2013.\n\n[32] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient\nmethods under the polyak-\u0142ojasiewicz condition. In Joint European Conference on Machine\nLearning and Knowledge Discovery in Databases, pages 795\u2013811. Springer, 2016.\n\n[33] S. Khot and A. Naor. Grothendieck-type inequalities in combinatorial optimization. arXiv\n\npreprint arXiv:1108.2464, 2011.\n\n[34] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv e-prints, page\n\narXiv:1412.6980, Dec. 2014.\n\n[35] B. Kulis, A. C. Surendran, and J. C. Platt. Fast low-rank semide\ufb01nite programming for\n\nembedding and clustering. In Arti\ufb01cial Intelligence and Statistics, pages 235\u2013242, 2007.\n\n[36] G. Lan and R. D. Monteiro. Iteration-complexity of \ufb01rst-order augmented lagrangian methods\n\nfor convex programming. Mathematical Programming, 155(1-2):511\u2013547, 2016.\n\n[37] L. Lov\u00e1sz. Semide\ufb01nite programs and combinatorial optimization. In Recent advances in\n\nalgorithms and combinatorics, pages 137\u2013194. Springer, 2003.\n\n[38] W. F. Mascarenhas. The bfgs method with exact line searches fails for non-convex objective\n\nfunctions. Mathematical Programming, 99(1):49\u201361, 2004.\n\n[39] D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures by semide\ufb01nite program-\n\nming. arXiv preprint arXiv:1602.06612, 2016.\n\n[40] E. Mossel, J. Neeman, and A. Sly. Consistency thresholds for the planted bisection model. In\nProceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 69\u201375.\nACM, 2015.\n\n[41] V. Nedelcu, I. Necoara, and Q. Tran-Dinh. Computational complexity of inexact gradient\naugmented lagrangian methods: application to constrained mpc. SIAM Journal on Control and\nOptimization, 52(5):3109\u20133134, 2014.\n\n[42] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming,\n\n120(1):221\u2013259, 2009.\n\n[43] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate\n\no (1/k\u02c6 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543\u2013547, 1983.\n\n[44] J. Nocedal and S. Wright. Numerical Optimization. Springer Series in Operations Research and\n\nFinancial Engineering. Springer New York, 2006.\n\n[45] M. Nouiehed, J. D. Lee, and M. Razaviyayn. Convergence to second-order stationarity for\n\nconstrained non-convex optimization. arXiv preprint arXiv:1810.02024, 2018.\n\n[46] G. Obozinski, L. Jacob, and J.-P. Vert. Group lasso with overlaps: the latent group lasso\n\napproach. arXiv preprint arXiv:1110.0413, 2011.\n\n[47] N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and Trends in Optimization,\n\n1(3):127\u2013239, 2014.\n\n11\n\n\f[48] D. Park, A. Kyrillidis, S. Bhojanapalli, C. Caramanis, and S. Sanghavi. Provable burer-monteiro\nfactorization for a class of norm-constrained matrix problems. arXiv preprint arXiv:1606.01316,\n2016.\n\n[49] G. Pataki. On the rank of extreme matrices in semide\ufb01nite programs and the multiplicity of\n\noptimal eigenvalues. Mathematics of operations research, 23(2):339\u2013358, 1998.\n\n[50] J. Peng and Y. Wei. Approximating K\u2013means\u2013type clustering via semide\ufb01nite programming.\n\nSIAM J. Optim., 18(1):186\u2013205, 2007.\n\n[51] M. J. Powell. A method for nonlinear constraints in minimization problems. Optimization,\n\npages 283\u2013298, 1969.\n\n[52] A. Radford, L. Metz, and S. Chintala. Unsupervised Representation Learning with Deep\n\nConvolutional Generative Adversarial Networks. ArXiv e-prints, Nov. 2015.\n\n[53] P. Raghavendra. Optimal algorithms and inapproximability results for every csp? In Proceedings\nof the fortieth annual ACM symposium on Theory of computing, pages 245\u2013254. ACM, 2008.\n\n[54] R. T. Rockafellar. Convex analysis, volume 28. Princeton university press, 1970.\n\n[55] R. T. Rockafellar. Lagrange multipliers and optimality. SIAM review, 35(2):183\u2013238, 1993.\n\n[56] P. Samangouei, M. Kabkab, and R. Chellappa. Defense-GAN: Protecting classi\ufb01ers against\nadversarial attacks using generative models. In International Conference on Learning Represen-\ntations, 2018.\n\n[57] A. Singer. Angular synchronization by eigenvectors and semide\ufb01nite programming. Applied\n\nand computational harmonic analysis, 30(1):20, 2011.\n\n[58] A. Singer and Y. Shkolnisky. Three-dimensional structure determination from common lines in\ncryo-em by eigenvectors and semide\ufb01nite programming. SIAM journal on imaging sciences,\n4(2):543\u2013572, 2011.\n\n[59] L. Song, A. Smola, A. Gretton, and K. M. Borgwardt. A dependence maximization view of\nclustering. In Proceedings of the 24th international conference on Machine learning, pages\n815\u2013822. ACM, 2007.\n\n[60] M. Tepper, A. M. Sengupta, and D. Chklovskii. Clustering is semide\ufb01nitely not that hard:\nNonnegative sdp for manifold disentangling. Journal of Machine Learning Research, 19(82),\n2018.\n\n[61] Q. Tran-Dinh, A. Alacaoglu, O. Fercoq, and V. Cevher. An adaptive primal-dual framework for\n\nnonsmooth convex minimization. arXiv preprint arXiv:1808.04648, 2018.\n\n[62] Q. Tran-Dinh, O. Fercoq, and V. Cevher. A smooth primal-dual optimization framework for\nnonsmooth composite convex minimization. SIAM Journal on Optimization, 28(1):96\u2013134,\n2018.\n\n[63] I. Waldspurger and A. Waters. Rank optimality for the burer-monteiro factorization. arXiv\n\npreprint arXiv:1812.03046, 2018.\n\n[64] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking\n\nmachine learning algorithms, 2017.\n\n[65] Y. Xu. Iteration complexity of inexact augmented lagrangian methods for constrained convex\n\nprogramming. arXiv preprint arXiv:1711.05812v2, 2017.\n\n[66] Y. Xu and W. Yin. A globally convergent algorithm for nonconvex optimization based on block\n\ncoordinate update. Journal of Scienti\ufb01c Computing, 72(2):700\u2013734, 2017.\n\n[67] L. Yang, D. Sun, and K.-C. Toh. Sdpnal+: a majorized semismooth newton-cg augmented\nlagrangian method for semide\ufb01nite programming with nonnegative constraints. Mathematical\nProgramming Computation, 7(3):331\u2013366, 2015.\n\n12\n\n\f[68] A. Yurtsever, Q. T. Dinh, and V. Cevher. A universal primal-dual convex optimization framework.\n\nIn Advances in Neural Information Processing Systems, pages 3150\u20133158, 2015.\n\n[69] A. Yurtsever, O. Fercoq, F. Locatello, and V. Cevher. A conditional gradient framework for\ncomposite convex minimization with applications to semide\ufb01nite programming. arXiv preprint\narXiv:1804.08544, 2018.\n\n[70] Q. Zhao, S. E. Karisch, F. Rendl, and H. Wolkowicz. Semide\ufb01nite programming relaxations for\nthe quadratic assignment problem. Journal of Combinatorial Optimization, 2(1):71\u2013109, 1998.\n\n13\n\n\f", "award": [], "sourceid": 7788, "authors": [{"given_name": "Mehmet Fatih", "family_name": "Sahin", "institution": "\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne"}, {"given_name": "Armin", "family_name": "eftekhari", "institution": "EPFL"}, {"given_name": "Ahmet", "family_name": "Alacaoglu", "institution": "EPFL"}, {"given_name": "Fabian", "family_name": "Latorre", "institution": "EPFL"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}