{"title": "NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3215, "page_last": 3223, "abstract": "We study a stochastic and distributed algorithm for nonconvex problems whose objective consists a sum $N$ nonconvex $L_i/N$-smooth functions, plus a nonsmooth regularizer. The proposed NonconvEx primal-dual SpliTTing (NESTT) algorithm splits the problem into $N$ subproblems, and utilizes an augmented Lagrangian based primal-dual scheme to solve it in a distributed and stochastic manner. With a special non-uniform sampling, a version of NESTT achieves $\\epsilon$-stationary solution using $\\mathcal{O}((\\sum_{i=1}^N\\sqrt{L_i/N})^2/\\epsilon)$ gradient evaluations, which can be up to $\\mathcal{O}(N)$ times better than the (proximal) gradient descent methods. It also achieves Q-linear convergence rate for nonconvex $\\ell_1$ penalized quadratic problems with polyhedral constraints. Further, we reveal a fundamental connection between {\\it primal-dual} based methods and a few {\\it primal only} methods such as IAG/SAG/SAGA.", "full_text": "NESTT: A Nonconvex Primal-Dual Splitting Method\n\nfor Distributed and Stochastic Optimization\n\nDavood Hajinezhad, Mingyi Hong \u2217\n\nTuo Zhao\u2020\n\nZhaoran Wang\u2021\n\nAbstract\n\n\u0001-stationary solution using O(((cid:80)N\n\nWe study a stochastic and distributed algorithm for nonconvex problems whose\nobjective consists of a sum of N nonconvex Li/N-smooth functions, plus a non-\nsmooth regularizer. The proposed NonconvEx primal-dual SpliTTing (NESTT)\nalgorithm splits the problem into N subproblems, and utilizes an augmented\nLagrangian based primal-dual scheme to solve it in a distributed and stochastic\nmanner. With a special non-uniform sampling, a version of NESTT achieves\ncan be up to O(N ) times better than the (proximal) gradient descent methods.\nIt also achieves Q-linear convergence rate for nonconvex (cid:96)1 penalized quadratic\nproblems with polyhedral constraints. Further, we reveal a fundamental connec-\ntion between primal-dual based methods and a few primal only methods such as\nIAG/SAG/SAGA.\n\n(cid:112)Li/N )2/\u0001) gradient evaluations, which\n\ni=1\n\nIntroduction\n\n1\nConsider the following nonconvex and nonsmooth constrained optimization problem\n\nmin\nz\u2208Z\n\nf (z) :=\n\n1\nN\n\nN(cid:88)\n\ni=1\n\ngi(z) + g0(z) + p(z),\n\n(1.1)\n\n(cid:80)N\n\ni=1 gi(z) for notational simplicity.\n\nwhere Z \u2286 Rd; for each i \u2208 {0,\u00b7\u00b7\u00b7 , N}, gi : Rd \u2192 R is a smooth possibly nonconvex function\nwhich has Li-Lipschitz continuous gradient; p(z) : Rd \u2192 R is a lower semi-continuous convex but\npossibly nonsmooth function. De\ufb01ne g(z) := 1\nN\nProblem (1.1) is quite general. It arises frequently in applications such as machine learning and sig-\nnal processing; see a recent survey [7]. In particular, each smooth functions {gi}N\ni=1 can represent:\n1) a mini-batch of loss functions modeling data \ufb01delity, such as the (cid:96)2 loss, the logistic loss, etc;\n2) nonconvex activation functions for neural networks, such as the logit or the tanh functions; 3)\nnonconvex utility functions used in signal processing and resource allocation, see [4]. The smooth\nfunction g0 can represent smooth nonconvex regularizers such as the non-quadratic penalties [2], or\nthe smooth part of the SCAD or MCP regularizers (which is a concave function) [26]. The convex\nfunction p can take the following form: 1) nonsmooth convex regularizers such as (cid:96)1 and (cid:96)2 func-\ntions; 2) an indicator function for convex and closed feasible set Z, denoted as \u03b9Z(\u00b7); 3) convex\nfunctions without global Lipschitz continuous gradient, such as p(z) = z4 or p(z) = 1/z + \u03b9z\u22650(z).\nIn this work we solve (1.1) in a stochastic and distributed manner. We consider the setting in which\nN distributed agents each having the knowledge of one smooth function {gi}N\ni=1, and they are\nconnected to a cluster center which handles g0 and p. At any given time, a randomly selected agent\nis activated and performs computation to optimize its local objective. Such distributed computation\nmodel has been popular in large-scale machine learning and signal processing [6]. Such model\nis also closely related to the (centralized) stochastic \ufb01nite-sum optimization problem [1, 9, 14, 15,\n\u2217Department of Industrial & Manufacturing Systems Engineering and Department of Electrical & Computer\n\u2020School\nof\n\u2021Department of Operations Research, Princeton University,zhaoran@princeton.edu\n\nEngineering, Iowa State University, Ames, IA, {dhaji,mingyi}@iastate.edu\n\ntourzhao@gatech.edu\n\nGeorgia\n\nInstitute\n\nIndustrial\n\nand\n\nSystems\n\nEngineering,\n\nTechnology\n\nof\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fN(cid:88)\n\ni=1\n\n21, 22], in which each time the iterate is updated based on the gradient information of a random\ncomponent function. One of the key differences between these two problem types is that in the\ndistributed setting there can be disagreement between local copies of the optimization variable z,\nwhile in the centralized setting only one copy of z is maintained.\nOur Contributions. We propose a class of NonconvEx primal-dual SpliTTing (NESTT) algorithms\nfor problem (1.1). We split z \u2208 Rd into local copies of xi \u2208 Rd, while enforcing the equality\nconstraints xi = z for all i. That is, we consider the following reformulation of (1.1)\n\nmin\nx,z\u2208Rd\n\n(cid:96)(x, z) :=\n\n1\nN\n\ngi(xi) + g0(z) + h(z),\n\ns.t. xi = z, i = 1,\u00b7\u00b7\u00b7 , N,\n\n(1.2)\n\ni=1\n\ni=1 Li/\u0001) gradient evaluation to achieve \u0001-stationarity,\n\nwhere h(z) := \u03b9Z(z) + p(z), x := [x1;\u00b7\u00b7\u00b7 ; xN ]. Our algorithm uses the Lagrangian relaxation of\nthe equality constraints, and at each iteration a (possibly non-uniformly) randomly selected primal\nvariable is optimized, followed by an approximate dual ascent step. Note that such splitting scheme\nhas been popular in the convex setting [6], but not so when the problem becomes nonconvex.\nThe NESTT is one of the \ufb01rst stochastic algorithms for distributed nonconvex nonsmooth optimiza-\ntion, with provable and nontrivial convergence rates. Our main contribution is given below. First,\nin terms of some primal and dual optimality gaps, NESTT converges sublinearly to a point belongs\nto stationary solution set of (1.2). Second, NESTT converges Q-linearly for certain nonconvex (cid:96)1\npenalized quadratic problems. To the best of our knowledge, this is the \ufb01rst time that linear conver-\ngence is established for stochastic and distributed optimization of such type of problems. Third, we\nshow that a gradient-based NESTT with non-uniform sampling achieves an \u0001-stationary solution of\n\n(cid:112)Li/N )2/\u0001) gradient evaluations. Compared with the classical gradient de-\n\n(1.1) using O(((cid:80)N\nscent, which in the worst case requires O((cid:80)N\n\nour obtained rate can be up to O(N ) times better in the case where the Li\u2019s are not equal.\nOur work also reveals a fundamental connection between primal-dual based algorithms and the\nprimal only average-gradient based algorithm such as SAGA/SAG/IAG [5, 9, 22]. With the key\nobservation that the dual variables in NESTT serve as the \u201cmemory\u201d of the past gradients, one can\nspecialize NESTT to SAGA/SAG/IAG. Therefore, NESTT naturally generalizes these algorithms to\nthe nonconvex nonsmooth setting. It is our hope that by bridging the primal-dual splitting algorithms\nand primal-only algorithms (in both the convex and nonconvex setting), there can be signi\ufb01cant\nfurther research developments bene\ufb01ting both algorithm classes.\nRelated Work. Many stochastic algorithms have been designed for (1.2) when it is convex. In these\nalgorithms the component functions gi\u2019s are randomly sampled and optimized. Popular algorithms\ninclude the SAG/SAGA [9, 22], the SDCA [23], the SVRG [14], the RPDG [15] and so on. When the\nproblem becomes nonconvex, the well-known incremental based algorithm can be used [3, 24], but\nthese methods generally lack convergence rate guarantees. The SGD based method has been studied\nin [10], with O(1/\u00012) convergence rate. Recent works [1] and [21] develop algorithms based on\nSVRG and SAGA for a special case of (1.1) where the entire problem is smooth and unconstrained.\nTo the best of our knowledge there has been no stochastic algorithms with provable, and non-trivial,\nconvergence rate guarantees for solving problem (1.1). On the other hand, distributed stochastic\nalgorithms for solving problem (1.1) in the nonconvex setting has been proposed in [13], in which\neach time a randomly picked subset of agents update their local variables. However there has been\nno convergence rate analysis for such distributed stochastic scheme. There has been some recent\ndistributed algorithms designed for (1.1) [17], but again without global convergence rate guarantee.\nPreliminaries. The augmented Lagrangian function for problem (1.1) is given by:\n\n(cid:18) 1\n\nN(cid:88)\n\n(cid:19)\n\nL (x, z; \u03bb) =\n\ngi(xi) + (cid:104)\u03bbi, xi \u2212 z(cid:105) +\n\n(cid:107)xi \u2212 z(cid:107)2\ni=1 is the set of dual variables, and \u03b7 := {\u03b7i > 0}N\n\nwhere \u03bb := {\u03bbi}N\nWe make the following assumptions about problem (1.1) and the function (1.3).\n\n\u03b7i\n2\n\ni=1\n\nN\n\ni=1 are penalty parameters.\n\n+ g0(z) + h(z),\n\n(1.3)\n\nA-(a) The function f (z) is bounded from below over Z \u2229 int(dom f ): f := minz\u2208Z f (z) > \u2212\u221e.\n\np(z) is a convex lower semi-continuous function; Z is a closed convex set.\n\nA-(b) The gi\u2019s and g have Lipschitz continuous gradients, i.e.,\n\n(cid:107)\u2207g(y) \u2212 \u2207g(z)(cid:107) \u2264 L(cid:107)y \u2212 z(cid:107), and (cid:107)\u2207gi(y) \u2212 \u2207gi(z)(cid:107) \u2264 Li(cid:107)y \u2212 z(cid:107), \u2200 y, z\n\n2\n\n\fAlgorithm 1 NESTT-G Algorithm\n1: for r = 1 to R do\n2:\n\nPick ir \u2208 {1, 2,\u00b7\u00b7\u00b7 , N} with probability pir and update (x, \u03bb)\n\nxr+1\nir = arg min\nxir\n\nVir (xir , zr, \u03bbr\n\nir ) ;\n\nir \u2212 zr(cid:1) ;\n(cid:0)xr+1\n\nir = \u03bbr\n\u03bbr+1\nir + \u03b1ir \u03b7ir\nj = \u03bbr\n\u03bbr+1\nj ,\nzr+1 = arg min\nz\u2208Z\n\nj = zr,\nxr+1\nL({xr+1\n\ni\n\n\u2200 j (cid:54)= ir;\n\n}, z; \u03bbr).\n\nUpdate z:\n\n(2.4)\n\n(2.5)\n(2.6)\n(2.7)\n\n3: end for\n4: Output: (zm, xm, \u03bbm) where m randomly picked from {1, 2,\u00b7\u00b7\u00b7 , R}.\n\nClearly L \u2264 1/N(cid:80)N\nA-(c) Each \u03b7i in (1.3) satis\ufb01es \u03b7i > Li/N; if g0 is nonconvex, then(cid:80)N\n\u03b3i := \u03b7i \u2212 Li/N and \u03b3z =(cid:80)N\n\nplicity of analysis we will further assume that L0 \u2264 1\n\ni=1 \u03b7i \u2212 L0, respectively [27, Theorem 2.1].\n\n(cid:80)N\n\ni=1 Li.\n\nN\n\nAssumption A-(c) implies that L (x, z; \u03bb) is strongly convex w.r.t. each xi and z, with modulus\n\ni=1 \u03b7i > 3L0.\n\ni=1 Li, and the equality can be achieved in the worst case. For sim-\n\n(cid:16)\n\nz \u2212 prox\u03b3\n\nWe then de\ufb01ne the prox-gradient (pGRAD) for (1.1), which will serve as a measure of stationarity.\nIt can be checked that the pGRAD vanishes at the set of stationary solutions of (1.1) [20].\nDe\ufb01nition 1.1. The proximal gradient of problem (1.1) is given by (for any \u03b3 > 0)\n\u02dc\u2207f\u03b3(z) := \u03b3\n2 The NESTT-G Algorithm\nAlgorithm Description. We present a primal-dual splitting scheme for the reformulated problem\n(1.2). The algorithm is referred to as the NESTT with Gradient step (NESTT-G) since each agent\nonly requires to know the gradient of each component function. To proceed, let us de\ufb01ne the fol-\nlowing function (for some constants {\u03b1i > 0}N\n\n[z \u2212 1/\u03b3\u2207(g(z) + g0(z))]\n\n, with prox\u03b3\n\n[u] := argmin\n\np(u)+\n\n(cid:17)\n\nu\u2208Z\n\np+\u03b9Z\n\np+\u03b9Z\n\n\u03b3\n2\n\ni=1):\n\n(cid:107)z\u2212u(cid:107)2.\n\nVi(xi, z; \u03bbi) =\n\ngi(z) +\n\n1\nN\n\n1\nN\n\n(cid:104)\u2207gi(z), xi \u2212 z(cid:105) + (cid:104)\u03bbi, xi \u2212 z(cid:105) +\n\n\u03b1i\u03b7i\n\n(cid:107)xi \u2212 z(cid:107)2.\n\n2\n\nNote that Vi(\u00b7) is related to L(\u00b7) in the following way: it is a quadratic approximation (approximated\nat the point z) of L(x, y; \u03bb) w.r.t. xi. The parameters \u03b1 := {\u03b1i}N\ni=1 give some freedom to the\nalgorithm design, and they are critical in improving convergence rates as well as in establishing\nconnection between NESTT-G with a few primal only stochastic optimization schemes.\nThe algorithm proceeds as follows. Before each iteration begins the cluster center broadcasts z to\neveryone. At iteration r + 1 a randomly selected agent ir \u2208 {1, 2,\u00b7\u00b7\u00b7 N} is picked, who minimizes\nVir (\u00b7) w.r.t.\nits local variable xir, followed by a dual ascent step for \u03bbir. The rest of the agents\nupdate their local variables by simply setting them to z. The cluster center then minimizes L(x, z; \u03bb)\nwith respect to z. See Algorithm 1 for details. We remark that NESTT-G is related to the popular\nADMM method for convex optimization [6]. However our particular update schedule (randomly\npicking (xi, \u03bbi) plus deterministic updating z), combined with the special x-step (minimizing an\napproximation of L(\u00b7) evaluated at a different block variable z) is not known before. These features\nare critical in our following rate analysis.\nConvergence Analysis. To proceed, let us de\ufb01ne r(j) as the last iteration in which the jth block\nis picked before iteration r + 1. i.e. r(j) := max{t | t < r + 1, j = i(t)}. De\ufb01ne yr\nj := zr(j) if\nj (cid:54)= ir, and yr\nA few important observations are in order. Combining the (x, z) updates (2.4) \u2013 (2.7), we have\n\n= zr. De\ufb01ne the \ufb01ltration F r as the \u03c3-\ufb01eld generated by {i(t)}r\u22121\nt=1 .\n\nir\n\n\u2207gq(zr)),\n\nq +\n\n(\u03bbr\n\nq = zr \u2212 1\n1\nxr+1\nN\n\u03b1q\u03b7q\nj = \u2212 1\n\u2207gir (zr), \u03bbr+1\nir = \u2212 1\n\u03bbr+1\nN\nN\n= zr \u2212 1\n1\nxr+1\nj\n\u03b1j\u03b7j\nN\n\n(2.6)\n= zr (2.8b)\n\n(\u03bbr\n\nj +\n\n\u2207gq(zr) + \u03bbr\n\nq + \u03b1q\u03b7q(xr+1\n\nq \u2212 zr) = 0, with q = ir\n\n1\nN\n\n\u2207gj(zr(j)), \u2200 j (cid:54)= ir, \u21d2 \u03bbr+1\n\u2207gj(zr(j))), \u2200 j (cid:54)= ir.\n\ni = \u2212 1\nN\n\n\u2207gi(yr\n\ni ), \u2200 i\n\n3\n\n(2.8a)\n\n(2.8b)\n\n(2.8c)\n\n\fThe key here is that the dual variables serve as the \u201cmemory\u201d for the past gradients of gi\u2019s. To\nproceed, we \ufb01rst construct a potential function using an upper bound of L(x, y; \u03bb). Note that\n\ngj(zr), \u2200 j (cid:54)= ir\n\n1\nN\n\n(2.9)\n\n) + (cid:104)\u03bbr\n) + (cid:104)\u03bbr\n\nj , xr+1\n\nir , xr+1\n\nj \u2212 zr(cid:105) +\nj \u2212 zr(cid:107)2 =\n(cid:107)xr+1\n\u03b7j\n2\nir \u2212 zr(cid:105) +\nir \u2212 zr(cid:107)2\n(cid:107)xr+1\n\u03b7i\n2\nir \u2212 zr(cid:107)2\n(cid:107)xr+1\n\n\u03b7ir + Lir /N\n\n2\n\ngir (zr) +\n\ngj(xr+1\n\nj\n\ngir (xr+1\n\nir\n\n1\nN\n1\nN\n(i)\u2264 1\nN\n1\nN\n\n(ii)\n=\n\n(2.10)\nwhere (i) uses (2.8b) and applies the descent lemma on the function 1/N gi(\u00b7); in (ii) we have used\n(2.5) and (2.8b). Since each i is picked with probability pi, we have\n\n2(\u03b1ir \u03b7ir )2 (cid:107)1/N (\u2207gir (yr\u22121\n\n) \u2212 \u2207gir (zr))(cid:107)2\n\n\u03b7ir + Lir /N\n\ngir (zr) +\n\nir\n\nEir [L(xr+1, zr; \u03bbr) | F r]\nN(cid:88)\nN(cid:88)\n\n\u2264 N(cid:88)\n\u2264 N(cid:88)\n\ngi(zr) +\n\ngi(zr) +\n\n1\nN\n\ni=1\n\ni=1\n\n1\nN\n\ni=1\n\ni=1\n\npi(\u03b7i + Li/N )\n\n2(\u03b1i\u03b7i)2\n\n(cid:107)1/N (\u2207gi(yr\u22121\n\ni\n\n) \u2212 \u2207gi(zr))(cid:107)2 + g0(zr) + h(zr)\n\n3pi\u03b7i\n\n(\u03b1i\u03b7i)2(cid:107)1/N (\u2207gi(yr\u22121\n\ni\n\n) \u2212 \u2207gi(zr))(cid:107)2 + g0(zr) + h(zr) := Qr,\n\nwhere in the last inequality we have used Assumption [A-(c)]. In the following, we will use EF r [Qr]\nas the potential function, and show that it decreases at each iteration.\nLemma 2.1. Suppose Assumption A holds, and pick\n\n\u03b1i = pi = \u03b2\u03b7i, where \u03b2 :=\n\n,\n\nand\n\n\u03b7i \u2265 9Li\nN pi\n\n,\n\ni = 1,\u00b7\u00b7\u00b7 N.\n\n(2.11)\n\nThen the following descent estimate holds true for NESTT-G\n\nSublinear Convergence. De\ufb01ne the optimality gap as the following:\n\n1(cid:80)N\n\ni=1 \u03b7i\n\n=\n\ni=1\n\n1\n\u03b22\n\nEzr(cid:107)zr \u2212 zr\u22121(cid:107)2 \u2212 N(cid:88)\nE(cid:104)(cid:107)zr \u2212 prox1/\u03b2\n(cid:32) N(cid:88)\n(cid:112)Li/N\n(cid:17)2 E[Q1 \u2212 QR+1]\n(cid:112)Li/N\n(cid:13)(cid:13)xm\ni \u2212 zm\u22121(cid:13)(cid:13)2\n\n(cid:35)\n\ni=1\n\nR\n\n\u2264 80\n3\n\nE[Qr \u2212 Qr\u22121|F r\u22121] \u2264 \u2212\n\n(cid:80)N\nE[Gr] := E(cid:104)(cid:107) \u02dc\u22071/\u03b2f (zr)(cid:107)2(cid:105)\n\ni=1 \u03b7i\n8\n\n\u03b1i = pi =\n\n, \u03b7i = 3\n\ni=1\n\n(cid:112)Li/N\n(cid:112)Li/N\n(cid:80)N\n(cid:16) N(cid:88)\n(cid:34) N(cid:88)\n\ni=1\n\n1) E[Gm] \u2264 80\n3\n\n2) E[Gm] + E\n\n3\u03b72\ni\n\ni=1\n\n1\n2\u03b7i\n\n(cid:107) 1\nN\n\n(\u2207gi(zr\u22121) \u2212 \u2207gi(yr\u22122\n\ni\n\n))(cid:107)2. (2.12)\n\nh [zr \u2212 \u03b2\u2207(g(zr) + g0(zr))](cid:107)2(cid:105)\n(cid:33)(cid:112)Li/N , \u03b2 =\n\n1\n\n3((cid:80)N\n\n(cid:112)Li/N )2\n\ni=1\n\n.\n\n(2.13)\n\n.\n\n(2.14)\n\n;\n\n(cid:32) N(cid:88)\n\ni=1\n\n(cid:112)Li/N\n\n(cid:33)2 E[Q1 \u2212 QR+1]\n\nR\n\n.\n\nNote that when h, g0 \u2261 0, E[Gr] reduces to E[(cid:107)\u2207g(zr)(cid:107)2]. We have the following result.\nTheorem 2.1. Suppose Assumption A holds, and pick (for i = 1,\u00b7\u00b7\u00b7 , N)\n\nThen every limit point generated by NESTT-G is a stationary solution of problem (1.2). Further,\n\nNote that Part (1) is useful in the centralized \ufb01nite-sum minimization setting, as it shows the sublin-\near convergence of NESTT-G, measured only by the primal optimality gap evaluated at zr. Mean-\nwhile, part (2) is useful in the distributed setting, as it also shows that the expected constraint vio-\nlation, which measures the consensus among agents, shrinks in the same order. We also comment\nthat the above result suggests that to achieve an \u0001-stationary solution, the NESTT-G requires about\nO\nditive N factor for evaluating the gradient of the entire function at the initial step of the algorithm).\n\nnumber of gradient evaluations (for simplicity we have ignored an ad-\n\n(cid:32)(cid:18)(cid:80)N\n\n(cid:112)Li/N\n\n(cid:19)2\n\n(cid:33)\n\ni=1\n\n/\u0001\n\n4\n\n\fAlgorithm 2 NESTT-E Algorithm\n1: for r = 1 to R do\n2:\n\nUpdate z by minimizing the augmented Lagrangian:\n\n3:\n\nz\n\nL(xr, z; \u03bbr).\n\nzr+1 = arg min\nRandomly pick ir \u2208 {1, 2,\u00b7\u00b7\u00b7 N} with probability pir :\nxr+1\nir = argmin\nxir\n\nUir (xir , zr+1; \u03bbr\n\nir );\n\n(cid:0)xr+1\nir \u2212 zr+1(cid:1) ;\n\nir + \u03b1ir \u03b7ir\nj , \u03bbr+1\n\nir = \u03bbr\n\u03bbr+1\nj = xr\nxr+1\n4: end for\n5: Output: (zm, xm, \u03bbm) where m randomly picked from {1, 2,\u00b7\u00b7\u00b7 , R}.\n\nj \u2200 j (cid:54)= ir.\n\nj = \u03bbr\n\n(3.15)\n\n(3.16)\n\n(3.17)\n(3.18)\n\nIt is interesting to observe that our choice of pi is proportional to the square root of the Lipschitz\nconstant of each component function, rather than to Li. Because of such choice of the sampling\nprobability, the derived convergence rate has a mild dependency on N and Li\u2019s. Compared with the\nconventional gradient-based methods, our scaling can be up to N times better. Detailed discussion\nand comparison will be given in Section 4.\nNote that similar sublinear convergence rates can be obtained for the case \u03b1i = 1 for all i (with\ndifferent scaling constants). However due to space limitation, we will not present those results here.\nLinear Convergence. In this section we show that the NESTT-G is capable of linear convergence\nfor a family of nonconvex quadratic problems, which has important applications, for example in\nhigh-dimensional statistical learning [16]. To proceed, we will assume the following.\nB-(a) Each function gi(z) is a quadratic function of the form gi(z) = 1/2zT Aiz + (cid:104)b, z(cid:105), where\n\nAi is a symmetric matrix but not necessarily positive semide\ufb01nite;\n\nB-(b) The feasible set Z is a closed compact polyhedral set;\nB-(c) The nonsmooth function p(z) = \u00b5(cid:107)z(cid:107)1, for some \u00b5 \u2265 0.\n\nOur linear convergence result is based upon certain error bound condition around the stationary\nsolutions set, which has been shown in [18] for smooth quadratic problems and has been extended\nto including (cid:96)1 penalty in [25, Theorem 4]. Due to space limitation the statement of the condition\nwill be given in the supplemental material, along with the proof of the following result.\nTheorem 2.2. Suppose that Assumptions A, B are satis\ufb01ed. Then the sequence {E[Qr+1]}\u221e\nconverges Q-linearly 4 to some Q\u2217 = f (z\u2217), where z\u2217 is a stationary solution for problem (1.1).\nThat is, there exists a \ufb01nite \u00afr > 0, \u03c1 \u2208 (0, 1) such that for all r \u2265 \u00afr, E[Qr+1 \u2212 Q\u2217]\u2264 \u03c1E[Qr \u2212 Q\u2217].\nLinear convergence of this type for problems satisfying Assumption B has been shown for (deter-\nministic) proximal gradient based methods [25, Theorem 2, 3]. To the best of our knowledge, this is\nthe \ufb01rst result that shows the same linear convergence for a stochastic and distributed algorithm.\n\nr=1\n\n3 The NESTT-E Algorithm\n\nAlgorithm Description. In this section, we present a variant of NESTT-G, which is named NESTT\nwith Exact minimization (NESTT-E). Our motivation is the following. First, in NESTT-G every\nagent should update its local variable at every iteration [cf. (2.4) or (2.6)]. In practice this may not\nbe possible, for example at any given time a few agents can be in the sleeping mode so they cannot\nperform (2.6). Second, in the distributed setting it has been generally observed (e.g., see [8, Section\nV]) that performing exact minimization (whenever possible) instead of taking the gradient steps for\nlocal problems can signi\ufb01cantly speed up the algorithm. The NESTT-E algorithm to be presented in\nthis section is designed to address these issues. To proceed, let us de\ufb01ne a new function as follows:\n\nU (x, z; \u03bb) :=\n\nUi(xi, z; \u03bbi) :=\n\ngi(xi) + (cid:104)\u03bbi, xi \u2212 z(cid:105) +\n\n(cid:107)xi \u2212 z(cid:107)2\n\n\u03b1i\u03b7i\n\n2\n\n(cid:19)\n\n.\n\nN(cid:88)\n\ni=1\n\n(cid:18) 1\n\nN(cid:88)\n\nN\n\ni=1\n\n4A sequence {xr} is said to converge Q-linearly to some \u00afx if lim supr (cid:107)xr+1 \u2212 \u00afx(cid:107)/(cid:107)xr \u2212 \u00afx(cid:107) \u2264 \u03c1, where\n\n\u03c1 \u2208 (0, 1) is some constant; cf [25] and references therein.\n\n5\n\n\fNote that if \u03b1i = 1 for all i, then the L(x, z; \u03bb) = U (x, z; \u03bb) + p(z) + h(z). The algorithm details\nare presented in Algorithm 2.\nConvergence Analysis. We begin analyzing NESTT-E. The proof technique is quite different from\nthat for NESTT-G, and it is based upon using the expected value of the Augmented Lagrangian func-\ntion as the potential function; see [11, 12, 13]. For the ease of description we de\ufb01ne the following\nquantities:\n\nL2\ni\n\n\u03b1i\u03b7iN 2 \u2212 \u03b3i\n\n2\n\n+\n\n1 \u2212 \u03b1i\n\u03b1i\n\nLi\nN\n\n, \u03b1 := {\u03b1i}N\n\ni=1.\n\nw := (x, z, \u03bb),\n\n\u03b2 :=\n\n,\n\nci :=\n\ni=1 \u03b7i\n\n(cid:20)\n(cid:21)\n(z \u2212 proxh[z \u2212 \u2207z(L(w) \u2212 h(z))]);\u2207x1L(w);\u00b7\u00b7\u00b7 ;\u2207xN L(w)\n\nTo measure the optimality of NESTT-E, de\ufb01ne the prox-gradient of L(x, z; \u03bb) as:\n\u02dc\u2207L(w) =\nWe de\ufb01ne the optimality gap by adding to (cid:107) \u02dc\u2207L(w)(cid:107)2 the size of the constraint violation [13]:\n\n\u2208 R(N +1)d.\n\n(3.19)\n\n1(cid:80)N\n\nH(wr) := (cid:107) \u02dc\u2207L(wr)(cid:107)2 +\n\nL2\ni\n\nN 2 (cid:107)xr\n\ni \u2212 zr(cid:107)2.\n\nN(cid:88)\n\ni=1\n\nIt can be veri\ufb01ed that H(wr) \u2192 0 implies that wr reaches a stationary solution for problem (1.2).\nWe have the following theorem regarding the convergence properties of NESTT-E.\nTheorem 3.1. Suppose Assumption A holds, and that (\u03b7i, \u03b1i) are chosen such that ci < 0 . Then\nfor some constant f, we have\n\nE[L(wr)] \u2265 E[L(wr+1)] \u2265 f > \u2212\u221e,\n\n\u2200 r \u2265 0.\n\nFurther, almost surely every limit point of {wr} is a stationary solution of problem (1.2). Finally,\nfor some function of \u03b1 denoted as C(\u03b1) = \u03c31(\u03b1)/\u03c32(\u03b1), we have the following:\n\nwhere \u03c31 := max(\u02c6\u03c31(\u03b1), \u02dc\u03c31) and \u03c32 := max(\u02c6\u03c32(\u03b1), \u02dc\u03c32), and these constants are given by\n\n\u02c6\u03c31(\u03b1) = max\n\n4\n\n(cid:40)\n\n(cid:26)\n\ni\n\nN(cid:88)\n\ni=1\n\nE[H(wm)] \u2264 C(\u03b1)E[L(w1) \u2212 L(wR+1)]\n(cid:32)\n(cid:18) L4\n\n(cid:33)\n\nR\n\n,\n\n\u03b1i\n\ni\nN 2\n\n\u2212 1\n\n(cid:18) 1\n\n(cid:19)2 L2\nN(cid:88)\n\u03b7i + L0)2 + 3\n\u2212 1 \u2212 \u03b1i\n\ni=1\n\nLi\nN\n\n\u03b1i\n\ni +\n\nL2\nN 2 + \u03b72\ni\nN(cid:88)\n\n(cid:18) \u03b3i\n\n2\n\ni=1\n\n\u2212 L2\n\ni\n\nN 2\u03b1i\u03b7i\n\n+ 3\n\ni\n\u03b1i\u03b72\n\nL2\ni\nN 2 ,\n\n(cid:19)(cid:27)\n\n,\n\n\u02dc\u03c32 =\n\n(cid:19)(cid:41)\n\nL2\ni\nN 2\n\ni N 4 +\n(cid:80)N\ni=1 \u03b7i \u2212 L0\n\n,\n\n.\n\n2\n\n\u02dc\u03c31 =\n\n4\u03b72\n\ni + (2 +\n\n\u02c6\u03c32(\u03b1) = max\n\ni\n\npi\n\n(3.20)\n\nWe remark that the above result shows the sublinear convergence of NESTT-E to the set of stationary\nsolutions. Note that \u03b3i = \u03b7i \u2212 Li/N, to satisfy ci < 0, a simple derivation yields\n\n(cid:16)\n(2 \u2212 \u03b1i) +(cid:112)(\u03b1i \u2212 2)2 + 8\u03b1i\n\n(cid:17)\n\n2N \u03b1i\n\n.\n\nLi\n\n\u03b7i >\n\nFurther, the above result characterizes the dependency of the rates on various parameters of the\nalgorithm. For example, to see the effect of \u03b1 on the convergence rate, let us set pi =\n,\n\nLi(cid:80)N\nand \u03b7i = 3Li/N, and assume L0 = 0, then consider two different choices of \u03b1: (cid:98)\u03b1i = 1, \u2200 i and\n(cid:101)\u03b1i = 4, \u2200 i. One can easily check that applying these different choices leads to following results:\n\ni=1 Li\n\nN(cid:88)\n\nC((cid:98)\u03b1) = 49\n\nN(cid:88)\n\nC((cid:101)\u03b1) = 28\n\nLi/N,\n\nLi/N.\n\ni=1\n\ni=1\n\nThe key observation is that increasing \u03b1i\u2019s reduces the constant in front of the rate. Hence, we\nexpect that in practice larger \u03b1i\u2019s will yield faster convergence.\n4 Connections and Comparisons with Existing Works\nIn this section we compare NESTT-G/E with a few existing algorithms in the literature. First, we\npresent a somewhat surprising observation, that NESTT-G takes the same form as some well-known\nalgorithms for convex \ufb01nite-sum problems. To formally state such relation, we show in the following\nresult that NESTT-G in fact admits a compact primal-only characterization.\n\n6\n\n\fTable 1: Comparison of # of gradient evaluations for NESTT-G and GD in the worst case\n\n# of Gradient Evaluations\nCase I: Li = 1, \u2200i\n\u221a\nCase II : O(\nN ) terms with Li = N\nthe rest with Li = 1\nCase III : O(1) terms with Li = N 2\nthe rest with Li = 1\n\nO(cid:16)\n\n((cid:80)N\n\nNESTT-G\n\n(cid:112)Li/N )2/\u0001\n\n(cid:17) O(cid:16)(cid:80)N\n\nGD\ni=1 Li/\u0001\n\n(cid:17)\n\ni=1\n\nO(N/\u0001)\nO(N/\u0001)\nO(N/\u0001)\n\nO(N/\u0001)\nO(N 3/2/\u0001)\nO(N 2/\u0001)\n\nProposition 4.1. The NESTT-G can be written into the following compact form:\n\nzr+1 = arg min\n\nh(z) + g0(z) +\n\n(cid:107)z \u2212 ur+1(cid:107)2\n\n1\n2\u03b2\n\nz\n\n(cid:16) 1\n\nwith ur+1 := zr \u2212 \u03b2\n\n(\u2207gir (zr) \u2212 \u2207gir (yr\u22121\n\nir\n\n)) +\n\n1\nN\n\nN \u03b1ir\n\ni=1\n\nN(cid:88)\n\n(cid:17)\n\n)\n\n.\n\n\u2207gi(yr\u22121\n\ni\n\n(4.21a)\n\n(4.21b)\n\nBased on this observation, the following comments are in order.\n(1) Suppose h \u2261 0, g0 \u2261 0 and \u03b1i = 1, pi = 1/N for all i. Then (4.21) takes the same form as the\nSAG presented in [22]. Further, when the component functions gi\u2019s are picked cyclically in a\nGauss-Seidel manner, the iteration (4.21) takes the same form as the IAG algorithm [5].\n(2) Suppose h (cid:54)= 0 and g0 (cid:54)= 0, and \u03b1i = pi = 1/N for all i. Then (4.21) is the same as the SAGA\n\nalgorithm [9], which is design for optimizing convex nonsmooth \ufb01nite sum problems.\n\nthors show that SAGA achieves \u0001-stationarity using O(N 2/3((cid:80)N\nCompared with GD, which achieves \u0001-stationarity using O((cid:80)N\nworse case (in the sense that(cid:80)N\n\nNote that SAG/SAGA/IAG are all designed for convex problems. Through the lens of primal-dual\nsplitting, our work shows that they can be generalized to nonconvex nonsmooth problems as well.\nSecondly, NESTT-E is related to the proximal version of the nonconvex ADMM [13, Algorithm 2].\nHowever, the introduction of \u03b1i\u2019s is new, which can signi\ufb01cantly improve the practical performance\nbut complicates the analysis. Further, there has been no counterpart of the sublinear and linear\nconvergence rate analysis for the stochastic version of [13, Algorithm 2].\nThirdly, we note that a recent paper [21] has shown that SAGA works for smooth and unconstrained\nnonconvex problem. Suppose that h \u2261 0, g0 (cid:54)= 0, Li = Lj, \u2200 i, j and \u03b1i = pi = 1/N, the au-\ni=1 Li/N )/\u0001) gradient evaluations.\ni=1 Li/\u0001) gradient evaluations in the\ni=1 Li/N = L), the rate in [21] is O(N 1/3) times better. How-\never, the algorithm in [21] is different from NESTT-G in two aspects: 1) it does not generalize to\nthe nonsmooth constrained problem (1.1); 2) it samples two component functions at each iteration,\nwhile NESTT-G only samples once. Further, the analysis and the scaling are derived for the case of\nuniform Li\u2019s, therefore it is not clear how the algorithm and the rates can be adapted for the non-\nuniform case. On the other hand, our NESTT works for the general nonsmooth constrained setting.\nThe non-uniform sampling used in NESTT-G is well-suited for problems with non-uniform Li\u2019s,\nand our scaling can be up to N times better than GD (or its proximal version) in the worst case.\nNote that problems with non-uniform Li\u2019s for the component functions are common in applications\nsuch as sparse optimization and signal processing. For example in LASSO problem the data matrix\nis often normalized by feature (or \u201ccolumn-normalized\u201d [19]), therefore the (cid:96)2 norm of each row of\nthe data matrix (which corresponds to the Lipschitz constant for each component function) can be\ndramatically different.\nIn Table 1 we list the comparison of the number of gradient evaluations for NESTT-G and GD, in\ni=1 Li/N = L). For simplicity, we omitted an additive constant\nof O(N ) for computing the initial gradients.\n5 Numerical Results\nIn this section we evaluate the performance of NESTT. Consider the high dimensional regression\nproblem with noisy observation [16], where M observations are generated by y = X\u03bd + \u0001. Here\ny \u2208 RM is the observed data sample; X \u2208 RM\u00d7P is the covariate matrix; \u03bd \u2208 RP is the ground\ntruth, and \u0001 \u2208 RM is the noise. Suppose that the covariate matrix is not perfectly known, i.e., we\nobserve A = X + W where W \u2208 RM\u00d7P is the noise matrix with known covariance matrix \u03a3W .\nLet us de\ufb01ne \u02c6\u0393 := 1/M (A(cid:62)A) \u2212 \u03a3W , and \u02c6\u03b3 := 1/M (A(cid:62)y). To estimate the ground truth \u03bd, let\n\nthe worst case (in the sense that(cid:80)N\n\n7\n\n\fFigure 1: Comparison of NESTT-G/E, SAGA, SGD on problem (5.22). The x-axis denotes the number of\npasses of the dataset. Left: Uniform Sampling pi = 1/N; Right: Non-uniform Sampling (pi =\n).\nTable 2: Optimality gap (cid:107) \u02dc\u22071/\u03b2f (zr)(cid:107)2 for different algorithms, with 100 passes of the datasets.\n\n\u221a\n(cid:80)N\n\nLi/N\n\nLi/N\n\n\u221a\n\ni=1\n\nSAGA\n\nSGD\n\nNESTT-E (\u03b1 = 10)\n\nNESTT-G\n\nN\n10\n20\n30\n40\n50\n\nUniform Non-Uni Uniform Non-Uni Uniform Non-Uni Uniform Non-Uni\n3.4054\n0.6370\n0.2260\n0.0574\n0.0154\n\n6.16E-19\n5.9E-9\n2.7E-6\n8.1E-5\n7.1E-4\n\n2.8022\n11.3435\n0.1253\n0.7385\n3.3187\n\n2.3E-21\n1.2E-10\n4.5E-7\n1.8E-5\n1.2E-4\n\n2.6E-16\n2.4E-9\n3.2E-6\n5.8E-4\n8.3E.-4\n\n6.1E-24\n2.9E-11\n1.4E-7\n3.1E-5\n2.7E-4\n\n2.7E-17\n7.7E-7\n2.5E-5\n4.1E-5\n2.5E-4\n\n0.2265\n6.9087\n0.1639\n0.3193\n0.0409\n\nus consider the following (nonconvex) optimization problem posed in [16, problem (2.4)] (where\nR > 0 controls sparsity):\n\nz(cid:62) \u02c6\u0393z \u2212 \u02c6\u03b3z\n\ns.t. (cid:107)z(cid:107)1 \u2264 R.\n\nmin\n\nz\n\n(5.22)\n\nN\n\nN\n\ni=1\n\ni Wi\n\n(cid:1) z.\n\n(cid:80)N\n\nas [16]. Let X = (X1;\u00b7\u00b7\u00b7 , XN ) \u2208 RM\u00d7P with(cid:80)\n\nDue to the existence of noise, \u02c6\u0393 is not positive semide\ufb01nite hence the problem is not convex. Note\nthat this problem satis\ufb01es Assumption A\u2013 B, then by Theorem 2.2 NESTT-G converges Q-linearly.\nTo test the performance of the proposed algorithm, we generate the problem following similar setups\ni Ni = M and each Xi \u2208 RNi\u00d7P corresponds\nto Ni data points, and it is generated from i.i.d Gaussian. Here Ni represents the size of each mini-\nbatch of samples. Generate the observations yi = Xi\u00d7\u03bd\u2217 +\u0001i \u2208 RNi, where \u03bd\u2217 is a K-sparse vector\nto be estimated, and \u0001i \u2208 RNi is the random noise. Let W = [W1;\u00b7\u00b7\u00b7 ; WN ], with Wi \u2208 RNi\u00d7P\ngenerated with i.i.d Gaussian. Therefore we have z(cid:62) \u02c6\u0393z = 1\nWe set M = 100, 000, P = 5000, N = 50, K = 22 \u2248 \u221a\nP ,and R = (cid:107)\u03bd\u2217(cid:107)1. We implement\n\nM z(cid:62)(cid:0)X(cid:62)\n\ni Xi \u2212 W (cid:62)\n\nNESTT-G/E, the SGD, and the nonconvex SAGA proposed in [21] with stepsize \u03b2 =\n3LmaxN 2/3\n(with Lmax := maxi Li). Note that the SAGA proposed in [21] only works for the unconstrained\nproblems with uniform Li, therefore when applied to (5.22) it is not guaranteed to converge. Here\nwe only include it for comparison purposes.\nIn Fig. 1 we compare different algorithms in terms of the gap (cid:107) \u02dc\u22071/\u03b2f (zr)(cid:107)2.\nIn the left \ufb01gure\nwe consider the problem with Ni = Nj for all i, j, and we show performance of the proposed\nalgorithms with uniform sampling (i.e., the probability of picking ith block is pi = 1/N). On the\nright one we consider problems in which approximately half of the component functions have twice\n\nthe size of Li\u2019s as the rest, and consider the non-uniform sampling (pi =(cid:112)Li/N /(cid:80)N\n\n(cid:112)Li/N).\n\nClearly in both cases the proposed algorithms perform quite well. Furthermore, it is clear that the\nNESTT-E performs well with large \u03b1 := {\u03b1i}N\ni=1, which con\ufb01rms our theoretical rate analysis. Also\nit is worth mentioning that when the Ni\u2019s are non-uniform, the proposed algorithms [NESTT-G and\nNESTT-E (with \u03b1 = 10)] signi\ufb01cantly outperform SAGA and SGD. In Table 2 we further compare\ndifferent algorithms when changing the number of component functions (i.e., the number of mini-\nbatches N) while the rest of the setup is as above. We run each algorithm with 100 passes over\nthe dataset. Similarly as before, our algorithms perform well, while SAGA seems to be sensitive to\nthe uniformity of the size of the mini-batch [note that there is no convergence guarantee for SAGA\napplied to the nonconvex constrained problem (5.22)].\n\ni=1\n\n1\n\n8\n\n0100200300400500# Grad/N10-1510-1010-5100105Optimality gapUniform SamplingSGDNESTT-E( = 10)NESTT-E( = 1)NESTT-GSAGA0100200300400500# Grad/N10-2010-1510-1010-5100105Optimality gapNon-Uniform SamplingSGDNESTT-E( = 10)NESTT-E( = 1)NESTT-GSAGA\fReferences\n[1] Z. A.-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. 2016.\n\nPreprint, available on arXiv, arXiv:1603.05643.\n\n[2] A. Antoniadis, I. Gijbels, and M. Nikolova. Penalized likelihood regression for generalized\nlinear models with non-quadratic penalties. Annals of the Institute of Statistical Mathematics,\n63(3):585\u2013615, 2009.\n\n[3] D. Bertsekas. Incremental gradient, subgradient, and proximal methods f or convex optimiza-\n\ntion: A survey. 2000. LIDS Report 2848.\n\n[4] E. Bjornson and E. Jorswieck. Optimal resource allocation in coordinated multi-cell systems.\n\nFoundations and Trends in Communications and Information Theory, 9, 2013.\n\n[5] D. Blatt, A. O. Hero, and H. Gauchman. A convergent incremental gradient method with a\n\nconstant step size. SIAM Journal on Optimization, 18(1):29\u201351, 2007.\n\n[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statis-\ntical learning via the alternating direction method of multipliers. Foundations and Trends in\nMachine Learning, 3(1):1\u2013122, 2011.\n\n[7] V. Cevher, S. Becker, and M. Schmidt. Convex optimization for big data: Scalable, ran-\ndomized, and parallel algorithms for big data analytics. IEEE Signal Processing Magazine,\n31(5):32\u201343, Sept 2014.\n\n[8] T.-H. Chang, M. Hong, and X. Wang. Multi-agent distributed optimization via inexact consen-\n\nsus admm. IEEE Transactions on Signal Processing, 63(2):482\u2013497, Jan 2015.\n\n[9] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with\n\nsupport for non-strongly convex composite objectives. In The Proceeding of NIPS, 2014.\n\n[10] S. Ghadimi and G. Lan. Stochastic \ufb01rst- and zeroth-order methods for nonconvx stochastic\n\nprogramming. SIAM Journal on Optimizatnoi, 23(4):2341\u20132368, 2013.\n\n[11] D. Hajinezhad, T. H. Chang, X. Wang, Q. Shi, and M. Hong. Nonnegative matrix factorization\nusing admm: Algorithm and convergence analysis. In 2016 IEEE International Conference on\nAcoustics, Speech and Signal Processing (ICASSP), pages 4742\u20134746, March 2016.\n\n[12] D. Hajinezhad and M. Hong. Nonconvex alternating direction method of multipliers for dis-\n\ntributed sparse principal component analysis. In the Proceedings of GlobalSIPT, 2015.\n\n[13] M. Hong, Z.-Q. Luo, and M. Razaviyayn. Convergence analysis of alternating direction\nmethod of multipliers for a family of nonconvex problems. SIAM Journal On Optimization,\n26(1):337\u2013364, 2016.\n\n[14] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In the Proceedings of the Neural Information Processing (NIPS). 2013.\n\n[15] G. Lan. An optimal randomized incremental gradient method. 2015. Preprint.\n[16] P.-L. Loh and M. Wainwright. High-dimensional regression with noisy and missing data:\n\nProvable guarantees with nonconvexity. The Annals of Statistics, 40(3):1637\u20131664, 2012.\n[17] P. D. Lorenzo and G. Scutari. Next: In-network nonconvex optimization. 2016. Preprint.\n[18] Z.-Q. Luo and P. Tseng. On the linear convergence of descent methods for convex essentially\n\nsmooth minimization. SIAM Journal on Control and Optimization, 30(2):408\u2013425, 1992.\n\n[19] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-\ndimensional analysis of m-estimators with decomposable regularizers. Statist. Sci., 27(4):538\u2013\n557, 11 2012.\n\n[20] M. Razaviyayn, M. Hong, Z.-Q. Luo, and J. S. Pang. Parallel successive convex approximation\n\nfor nonsmooth nonconvex optimization. In the Proceedings of NIPS, 2014.\n\n[21] S. J. Reddi, S. Sra, B. Poczos, and A. Smola. Fast incremental method for nonconvex opti-\n\nmization. 2016. Preprint, available on arXiv: arXiv:1603.06159.\n\n[22] M. Schmidt, N. L. Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. 2013. Technical report, INRIA.\n\n[23] S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent methods for\n\nregularzied loss minimization. Journal of Machine Learning Rsearch, 14:567\u2013599, 2013.\n\n[24] S. Sra. Scalable nonconvex inexact proximal splitting.\n\nIn Advances in Neural Information\n\nProcessing Systems (NIPS), 2012.\n\n[25] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable mini-\n\nmization. Mathematical Programming, 117:387\u2013423, 2009.\n\n[26] Z. Wang, H. Liu, and T. Zhang. Optimal computational and statistical rates of convergence for\n\nsparse nonconvex learning problems. Annals of Statistics, 42(6):2164\u20132201, 2014.\n\n[27] S. Zlobec. On the Liu - Floudas convexi\ufb01cation of smooth programs. Journal of Global\n\nOptimization, 32:401 \u2013 407, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1606, "authors": [{"given_name": "Davood", "family_name": "Hajinezhad", "institution": "Iowa State University"}, {"given_name": "Mingyi", "family_name": "Hong", "institution": "iowa state university"}, {"given_name": "Tuo", "family_name": "Zhao", "institution": "Johns Hopkins University"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton University"}]}