{"title": "Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1145, "page_last": 1153, "abstract": "We analyze stochastic algorithms for optimizing nonconvex, nonsmooth finite-sum problems, where the nonsmooth part is convex. Surprisingly, unlike the smooth case, our knowledge of this fundamental problem is very limited. For example, it is not known whether the proximal stochastic gradient method with constant minibatch converges to a stationary point. To tackle this issue, we develop fast stochastic algorithms that provably converge to a stationary point for constant minibatches. Furthermore, using a variant of these algorithms, we obtain provably faster convergence than batch proximal gradient descent. Our results are based on the recent variance reduction techniques for convex optimization but with a novel analysis for handling nonconvex and nonsmooth functions. We also prove global linear convergence rate for an interesting subclass of nonsmooth nonconvex functions, which subsumes several recent works.", "full_text": "Proximal Stochastic Methods for Nonsmooth\n\nNonconvex Finite-Sum Optimization\n\nSashank J. Reddi\n\nCarnegie Mellon University\nsjakkamr@cs.cmu.edu\n\nBarnab\u00e1s P\u00f3czos\n\nCarnegie Mellon University\nbapoczos@cs.cmu.edu\n\nSuvrit Sra\n\nMassachusetts Institute of Technology\n\nsuvrit@mit.edu\n\nAlexander J. Smola\n\nCarnegie Mellon University\n\nalex@smola.org\n\nAbstract\n\nWe analyze stochastic algorithms for optimizing nonconvex, nonsmooth \ufb01nite-sum\nproblems, where the nonsmooth part is convex. Surprisingly, unlike the smooth\ncase, our knowledge of this fundamental problem is very limited. For example,\nit is not known whether the proximal stochastic gradient method with constant\nminibatch converges to a stationary point. To tackle this issue, we develop fast\nstochastic algorithms that provably converge to a stationary point for constant\nminibatches. Furthermore, using a variant of these algorithms, we obtain provably\nfaster convergence than batch proximal gradient descent. Our results are based\non the recent variance reduction techniques for convex optimization but with a\nnovel analysis for handling nonconvex and nonsmooth functions. We also prove\nglobal linear convergence rate for an interesting subclass of nonsmooth nonconvex\nfunctions, which subsumes several recent works.\n\nIntroduction\n\n1\nWe study nonconvex, nonsmooth, \ufb01nite-sum optimization problems of the form\n\nF (x) := f (x) + h(x), where f (x) :=\n\n1\nn\n\nmin\nx2Rd\n\nnXi=1\n\nfi(x),\n\n(1)\n\n: Rd ! R is smooth (possibly nonconvex) for all i 2{ 1, . . . , n} , [n], while\nand each fi\nh : Rd ! R is nonsmooth but convex and relatively simple.\nSuch \ufb01nite-sum optimization problems are fundamental to machine learning when performing regu-\nlarized empirical risk minimization. While there has been extensive research in solving nonsmooth\nconvex \ufb01nite-sum problems (i.e., each fi is convex for i 2 [n]) [4, 16, 31], our understanding of their\nnonsmooth nonconvex counterparts is surprisingly limited. We hope to amend this situation (at least\npartially), given the widespread importance of nonconvexity throughout machine learning.\nA popular approach to handle nonsmoothness in convex problems is via proximal operators [14, 25],\nbut as we will soon see, this approach does not work so easily for the nonconvex problem (1).\nNevertheless, recall that proper closed convex function h, the proximal operator is de\ufb01ned as\n\nprox\u2318h(x) := argmin\n\ny2Rd \u21e3h(y) + 1\n\n2\u2318ky xk2\u2318 ,\n\nfor \u2318> 0.\n\n(2)\n\nThe power of proximal operators lies in how they generalize projections: e.g., if h is the indicator\nfunction IC(x) of a closed convex set C, then proxIC (x) \u2318 projC(x) \u2318 argminy2C ky xk.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fxt+1 = prox\u2318th\u2713xt \n\n\u2318t\n\n|It|Xi2It rfi(xt)\u25c6 ,\n\nt = 0, 1, . . . ,\n\n(4)\n\nThroughout this paper, we assume that the proximal operator of h is easy to compute. This is true\nfor many applications in machine learning and statistics including `1 regularization, box-constraints,\nsimplex constraints, among others [2, 18].\nSimilar to other algorithms, we also assume access to a proximal oracle (PO) that takes a point\nx 2 Rd and returns the output of (2). In addition to the number of PO calls, to describe our complexity\nnPi fi, an IFO\nresults we use the incremental \ufb01rst-order oracle (IFO) model.1 For a function f = 1\ntakes an index i 2 [n] and a point x 2 Rd, and returns the pair (fi(x),rfi(x)).\nA standard (batch) method for solving (1) is the proximal-gradient method (PROXGD) [13], \ufb01rst\nstudied for (batch) nonconvex problems in [5]. This method performs the following iteration:\n\nxt+1 = prox\u2318h(xt \u2318rf (xt)),\n\nt = 0, 1, . . . ,\n\n(3)\n\nwhere \u2318> 0 is a step size. The following convergence rate for PROXGD was proved recently.\nTheorem (Informal). [7]: The number of IFO and PO calls made by the proximal gradient method (3)\nto reach \u270f close to a stationary point is O(n/\u270f) and O(1/\u270f), respectively.\n\nWe refer the reader to [7] for details. The key point to note here is that the IFO complexity of (3) is\nO(n/\u270f). This is due to the fact that a full gradient rf needs to be computed at each iteration (3),\nwhich requires n IFO calls. When n is large, this high cost per iteration is prohibitive. A more\npractical approach is offered by proximal stochastic gradient (PROXSGD), which performs the\niteration\n\nwhere It (referred to as minibatch) is a randomly chosen set (with replacement) from [n] and \u2318t is a\nstep size. Non-asymptotic convergence of PROXSGD was also shown recently, as noted below.\nTheorem (Informal). [7]: The number of IFO and PO calls made by PROXSGD, i.e., iteration (4), to\nreach \u270f close to a stationary point is O(1/\u270f2) and O(1/\u270f) respectively. For achieving this convergence,\nwe impose batch sizes |It| that increase and step sizes \u2318t that decrease with 1/\u270f.\nNotice that the PO complexity of PROXSGD is similar to PROXGD, but its IFO complexity is\nindependent of n; though, this bene\ufb01t comes at the cost of an extra 1/\u270f factor. Furthermore, the step\nsize must decrease with 1/\u270f (or alternatively decay with the number of iterations of the algorithm).\nThe same two aspects are also seen for convex stochastic gradient, in both the smooth and proximal\nversions. However, in the nonconvex setting there is a key third and more important aspect: the\nminibatch size |It| increases with 1/\u270f.\nTo understand this aspect, consider the case where |It| is a constant (independent of both n and \u270f),\ntypically the choice used in practice. In this case, the above convergence result no longer holds and\nit is not clear if PROXSGD even converges to a stationary point at all! To clarify, a decreasing step\nsize \u2318t trivially ensures convergence as t ! 1, but the limiting point is not necessarily stationary.\nOn the other hand, increasing |It| with 1/\u270f can easily lead to |It| n for reasonably small \u270f, which\neffectively reduces the algorithm to (batch) PROXGD.\nThis dismal news does not apply to the convex setting, where PROXSGD is known to converge (in\nexpectation) to an optimal solution using constant minibatch sizes |It|. Furthermore, this problem\ndoes not af\ufb02ict smooth nonconvex problems (h \u2318 0), where convergence with constant minibatches\nis known [6, 21, 22]. Thus, there is a fundamental gap in our understanding of stochastic methods\nfor nonsmooth nonconvex problems. Given the ubiquity of nonconvex models in machine learning,\nbridging this gap is important. We do so by analyzing stochastic proximal methods with guaranteed\nconvergence for constant minibatches, and faster convergence with minibatches independent of 1/\u270f.\nMain Contributions\nWe state our main contributions below and list the key complexity results in Table 1.\n\u2022 We analyze nonconvex proximal versions of the recently proposed stochastic algorithms SVRG\nand SAGA [4, 8, 31], hereafter referred to as PROXSVRG and PROXSAGA, respectively. We show\nconvergence of these algorithms with constant minibatches. To the best of our knowledge, this is\nthe \ufb01rst work to present non-asymptotic convergence rates for stochastic methods that apply to\nnonsmooth nonconvex problems with constant (hence more realistic) minibatches.\n1Introduced in [1] to study lower bounds of deterministic algorithms for convex \ufb01nite-sum problems.\n\n2\n\n\f\u2022 We show that by carefully choosing the minibatch size (to be sublinearly dependent on n but still\nindependent of 1/\u270f), we can achieve provably faster convergence than both proximal gradient and\nproximal stochastic gradient. We are not aware of any earlier results on stochastic methods for the\ngeneral nonsmooth nonconvex problem that have faster convergence than proximal gradient.\n\n\u2022 We study a nonconvex subclass of (1) based on the proximal extension of Polyak-\u0141ojasiewicz\ninequality [9]. We show linear convergence of PROXSVRG and PROXSAGA to the optimal solution\nfor this subclass. This includes the recent results proved in [27, 32] as special cases. Ours is the\n\ufb01rst stochastic method with provable global linear convergence for this subclass of problems.\n\n1.1 Related Work\n\nThe literature on \ufb01nite-sum problems is vast; so we summarize only a few closely related works.\nConvex instances of (1) have been long studied [3, 15] and are fairly well-understood. Remarkable\nrecent progress for smooth convex instances of (1) is the creation of variance reduced (VR) stochastic\nmethods [4, 8, 26, 28]. Nonsmooth proximal VR stochastic algorithms are studied in [4, 31]\nwhere faster convergence rates for both strongly convex and non-strongly convex cases are proved.\nAsynchronous VR frameworks are developed in [20]; lower-bounds are studied in [1, 10].\nIn contrast, nonconvex instances of (1) are much less understood. Stochastic gradient for smooth\nnonconvex problems is analyzed in [6], and only very recently, convergence results for VR stochastic\nmethods for smooth nonconvex problems were obtained in [21, 22]. In [11], the authors consider a VR\nnonconvex setting different from ours, namely, where the loss is (essentially strongly) convex but hard\nthresholding is used. We build upon [21, 22], and focus on handling nonsmooth convex regularizers\n(h 6\u2318 0 in (1)).2 Incremental proximal gradient methods for this class were also considered in [30]\nbut only asymptotic convergence was shown. The \ufb01rst analysis of a projection version of nonconvex\nSVRG is due to [29], who considers the special problem of PCA. Perhaps, the closest to our work\nis [7], where convergence of minibatch nonconvex PROXSGD method is studied. However, typical\nto the stochastic gradient method, the convergence is slow; moreover, no convergence for constant\nminibatches is provided.\n\n2 Preliminaries\nWe assume that the function h(x) in (1) is lower semi-continuous (lsc) and convex. Furthermore, we\nalso assume that its domain dom(h) = {x 2 Rd|h(x) < +1} is closed. We say f is L-smooth if\nthere is a constant L such that\n\nkrf (x) rf (y)k \uf8ff Lkx yk,\n\n8 x, y 2 Rd.\n\nThroughout, we assume that the functions fi in (1) are L-smooth, so that krfi(x) rfi(y)k \uf8ff\nLkx yk for all i 2 [n]. Such an assumption is typical in the analysis of \ufb01rst-order methods.\nOne crucial aspect of the analysis for nonsmooth nonconvex problems is the convergence criterion.\nFor convex problems, typically the optimality gap F (x) F (x\u21e4) is used as a criterion.\nIt is\nunreasonable to use such a criterion for general nonconvex problems due to their intractability. For\nsmooth nonconvex problems (i.e., h \u2318 0), it is typical to measure stationarity, e.g., using krFk. This\ncannot be used for nonsmooth problems, but a \ufb01tting alternative is the gradient mapping3 [17]:\n\nG\u2318(x) := 1\n\n\u2318 [x prox\u2318h(x \u2318rf (x))].\n\n(5)\nWhen h \u2318 0 this mapping reduces to G\u2318(x) = rf (x) = rF (x), the gradient of function F at x.\nWe analyze our algorithms using the gradient mapping (5) as described more precisely below.\nDe\ufb01nition 1. A point x output by stochastic iterative algorithm for solving (1) is called an \u270f-accurate\nsolution, if E[kG\u2318(x)k2] \uf8ff \u270f for some \u2318> 0.\nOur goal is to obtain ef\ufb01cient algorithms for achieving an \u270f-accurate solution, where ef\ufb01ciency is\nmeasured using IFO and PO complexity as functions of 1/\u270f and n.\n\n2More recently, the authors have also developed VR Frank-Wolfe methods for handling constrained problems\n\nthat do not admit easy projection operators [24].\n\n3This mapping has also been used in the analysis of nonconvex proximal methods in [6, 7, 30].\n\n3\n\n\fAlgorithm\n\nIFO\n\nPO\n\nIFO (PL)\n\nPROXSGD\nPROXGD\n\nO1/\u270f2\n\nO (n/\u270f)\n\nO (1/\u270f)\nO (1/\u270f)\n\nO1/\u270f2\n\nO (n\uf8ff log(1/\u270f))\n\nPO (PL)\n\nO (1/\u270f)\n\nO (\uf8ff log(1/\u270f))\n\nPROXSVRG O(n + (n2/3/\u270f)) O(1/\u270f) O((n + \uf8ffn2/3) log(1/\u270f)) O(\uf8ff log(1/\u270f))\n\nPROXSAGA O(n + (n2/3/\u270f)) O(1/\u270f) O((n + \uf8ffn2/3) log(1/\u270f)) O(\uf8ff log(1/\u270f))\n\nConstant\nminibatch?\n\n?\n\np\np\n\nTable 1: Table comparing the best IFO and PO complexity of different algorithms discussed in the paper.\nThe complexity is measured in terms of the number of oracle calls required to achieve an \u270f-accurate solution.\nThe IFO (PL) and PO (PL) represents the IFO and PO complexity of PL functions (see Section 4 for a formal\nde\ufb01nition). The results marked in red are the contributions of this paper. In the table, \u201cconstant minibatch\u201d\nindicates whether stochastic algorithm converges using a constant minibatch size. To the best of our knowledge,\nit is not known if PROXSGD converges on using constant minibatches for nonconvex nonsmooth optimization.\nAlso, we are not aware of any speci\ufb01c convergence results for PROXSGD in the context of PL functions.\n\n3 Algorithms\n\nWe focus on two algorithms: (a) proximal SVRG (PROXSVRG) and (b) proximal SAGA (PROXSAGA).\n\n3.1 Nonconvex Proximal SVRG\nWe \ufb01rst consider a variant of PROXSVRG [31]; pseudocode of this variant is stated in Algorithm 1.\nWhen F is strongly convex, SVRG attains linear convergence rate as opposed to sublinear convergence\nof SGD [8]. Note that, while SVRG is typically stated with b = 1, we use its minibatch variant with\nbatch size b. The speci\ufb01c reasons for using such a variant will become clear during the analysis.\nWhile some other algorithms have been proposed for reducing the variance in the stochastic gradients,\nSVRG is particularly attractive because of its low memory requirement; it requires just O(d) extra\nmemory in comparison to SGD for storing the average gradient (gs in Algorithm 1), while algorithms\nlike SAG and SAGA incur O(nd) storage cost. In addition to its strong theoretical results, SVRG is\nknown to outperform SGD empirically while being more robust to selection of step size. For convex\nproblems, PROXSVRG is known to inherit these advantages of SVRG [31].\nWe now present our analysis of nonconvex PROXSVRG, starting with a result for batch size b = 1.\nTheorem 1. Let b = 1 in Algorithm 1. Let \u2318 = 1/(3Ln), m = n and T be a multiple of m. Then the\noutput xa of Algorithm 1 satis\ufb01es the following bound:\n\nE[kG\u2318(xa)k2] \uf8ff\n\nwhere x\u21e4 is an optimal solution of (1).\n\n18Ln2\n\n3n 2\u2713 F (x0) F (x\u21e4)\n\n\u25c6 ,\n\nT\n\nTheorem 1 shows that PROXSVRG converges for constant minibatches of size b = 1. This result\nis in strong contrast to PROXSGD whose convergence with constant minibatches is still unknown.\nHowever, the result delivered by Theorem 1 is not stronger than that of PROXGD. The following\ncorollary to Theorem 1 highlights this point.\nCorollary 1. To obtain an \u270f-accurate solution, with b = 1 and parameters from Theorem 1, the IFO\nand PO complexities of Algorithm 1 are O(n/\u270f) and O(n/\u270f), respectively.\n\nCorollary 1 follows upon noting that each inner iteration (Step 7) of Algorithm 1 has an effective\nIFO complexity of O(1) since m = n. This IFO complexity includes the IFO calls for calculating\nthe average gradient at the end of each epoch. Furthermore, each inner iteration also invokes the\nproximal oracle, whereby the PO complexity is also O(n/\u270f). While the IFO complexity of constant\nminibatch PROXSVRG is same as PROXGD, we see that its PO complexity is much worse. This\nis due to the fact that n IFO calls correspond to one PO call in PROXGD, while one IFO call in\nPROXSVRG corresponds to one PO call. Consequently, we do not gain any theoretical advantage by\nusing constant minibatch PROXSVRG over PROXGD.\n\n4\n\n\fm = x0 2 Rd, epoch length m, step sizes \u2318> 0, S = dT /me, minibatch size b\nxs+1\n0 = xs\nnPn\nm\ngs+1 = 1\ni=1 rfi(\u02dcxs)\nfor t = 0 to m 1 do\nbPit2It\n\nAlgorithm 1: Nonconvex PROXSVRGx0, T, m, b,\u2318\n1: Input: \u02dcx0 = x0\n2: for s = 0 to S 1 do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n12: Output: Iterate xa chosen uniformly at random from {{xs+1\n\n(rfit (xs+1\nt \u2318v s+1\n\nt=0 }S1\n}m1\ns=0 .\n\nm\n\nUniformly randomly pick It \u21e2{ 1, . . . , n} (with replacement) such that |It| = b\nvs+1\nt = 1\nxs+1\nt+1 = prox\u2318h(xs+1\nend for\n\u02dcxs+1 = xs+1\n\n) rfit (\u02dcxs)) + gs+1\n)\n\nt\n\nt\n\nt\n\nThe key question is therefore: can we modify the algorithm to obtain better theoretical guarantees?\nTo answer this question, we prove the following main convergence result. For ease of theoretical\nexposition, we assume n2/3 to be an integer. This is only for convenience in stating our theoretical\nresults and all the results in the paper hold for the general case.\nTheorem 2. Suppose b = n2/3 in Algorithm 1. Let \u2318 = 1/(3L), m = bn1/3c and T be a multiple of\nm. Then for the output xa of Algorithm 1, we have:\n\nE[kG\u2318(xa)k2] \uf8ff\n\n18L(F (x0) F (x\u21e4))\n\nT\n\n,\n\nwhere x\u21e4 is an optimal solution to (1).\n\nRewriting Theorem 2 in terms of the IFO and PO complexity, we obtain the following corollary.\nCorollary 2. Let b = n2/3 and set parameters as in Theorem 2. Then, to obtain an \u270f-accurate\nsolution the IFO and PO complexities of Algorithm 1 are O(n + n2/3/\u270f) and O(1/\u270f), respectively.\n\nThe above corollary is due to the following observations. From Theorem 2, it can be seen that the\ntotal number of inner iterations (across all epochs) of Algorithm 1 to obtain an \u270f-accurate solution\nis O(1/\u270f). Since each inner iteration of Algorithm 2 involves a call to the PO, we obtain a PO\ncomplexity of O(1/\u270f). Further, since b = n2/3 IFO calls are made at each inner iteration, we obtain\na net IFO complexity of O(n2/3/\u270f). Adding the IFO calls for the calculation of the average gradient\n(and noting that T is a multiple of m), as well as noting that S 1, we obtain a total cost of\nO(n + n2/3/\u270f). A noteworthy aspect of Corollary 2 is that its PO complexity matches PROXGD, but\nits IFO complexity is signi\ufb01cantly decreased to O(n + n2/3/\u270f) as opposed to O(n/\u270f) in PROXGD.\n\n3.2 Nonconvex Proximal SAGA\n\nIn the previous section, we investigated PROXSVRG for solving (1). Note that PROXSVRG is not a\nfully \u201cincremental\" algorithm since it requires calculation of the full gradient once per epoch. An\nalternative to PROXSVRG is the algorithm proposed in [4] (popularly referred to as SAGA). We build\nupon the work of [4] to develop PROXSAGA, a nonconvex proximal variant of SAGA.\nThe pseudocode for PROXSAGA is presented in Algorithm 2. The key difference between Algorithm 1\nand 2 is that PROXSAGA, unlike PROXSVRG, avoids computation of the full gradient. Instead, it\nmaintains an average gradient vector gt, which changes at each iteration (refer to [20]). However,\nsuch a strategy entails additional storage costs. In particular, for implementing Algorithm 2, we must\ni=1, which in general can cost O(nd) in storage. Nevertheless, in some\nstore the gradients {rfi(\u21b5t\nscenarios common to machine learning (see [4]), one can reduce the storage requirements to O(n).\nWhenever such an implementation of PROXSAGA is possible, it can perform similar to or even better\nthan PROXSVRG [4]; hence, in addition to theoretical interest, it is of signi\ufb01cant practical value.\nWe remark that PROXSAGA in Algorithm 2 differs slightly from [4]. In particular, it uses minibatches\nwhere two sets It, Jt are sampled at each iteration as opposed to one in [4]. This is mainly for the\nease of theoretical analysis.\n\ni)}n\n\n5\n\n\fi = x0 for i 2 [n], step size \u2318> 0, minibatch size b\n\nAlgorithm 2: Nonconvex PROXSAGAx0, T, b,\u2318\n1: Input: x0 2 Rd, \u21b50\nnPn\n2: g0 = 1\ni=1 rfi(\u21b50\ni )\n3: for t = 0 to T 1 do\nUniformly randomly pick sets It, Jt from [n] (with replacement) such that |It| = |Jt| = b\n4:\nbPit2It\n5:\n(rfit (xt) rfit (\u21b5t\nvt = 1\n6:\nxt+1 = prox\u2318h(xt \u2318v t)\nj = xt for j 2 Jt and \u21b5t+1\n7:\n\u21b5t+1\nj = \u21b5t\nnPjt2Jt\n8:\n(rfjt (\u21b5t\ngt+1 = gt 1\n9: end for\n10: Output: Iterate xa chosen uniformly random from {xt}T1\nt=0 .\n\nj for j /2 Jt\njt ) rfjt (\u21b5t+1\njt ))\n\nit )) + gt\n\nWe prove that as in the convex case, nonconvex PROXSVRG and PROXSAGA share similar theoretical\nguarantees. In particular, our \ufb01rst result for PROXSAGA is a counterpart to Theorem 1 for PROXSVRG.\nTheorem 3. Suppose b = 1 in Algorithm 2. Let \u2318 = 1/(5Ln). Then for the output xa of Algorithm 2\nafter T iterations, the following stationarity bound holds:\n\nE[kG\u2318(xa)k2] \uf8ff\n\nwhere x\u21e4 is an optimal solution of (1).\n\n50Ln2\n5n 2\n\nF (x0) F (x\u21e4)\n\nT\n\n,\n\nTheorem 3 immediately leads to the following corollary.\nCorollary 3. The IFO and PO complexity of Algorithm 3 for b = 1 and parameters speci\ufb01ed in\nTheorem 3 to obtain an \u270f-accurate solution are O(n/\u270f) and O(n/\u270f) respectively.\n\nSimilar to Theorem 2 for PROXSVRG, we obtain the following main result for PROXSAGA.\nTheorem 4. Suppose b = n2/3 in Algorithm 2. Let \u2318 = 1/(5L). Then for the output xa of\nAlgorithm 2 after T iterations, the following holds:\n\nE[kG\u2318(xa)k2] \uf8ff\nwhere x\u21e4 is an optimal solution of Problem (1).\n\n50L(F (x0) F (x\u21e4))\n\n,\n\n3T\n\nRewriting this result in terms of IFO and PO access, we obtain the following important corollary.\nCorollary 4. Let b = n2/3 and set parameters as in Theorem 4. Then, to obtain an \u270f-accurate\nsolution the IFO and PO complexities of Algorithm 2 are O(n + n2/3/\u270f) and O(1/\u270f), respectively.\n\nThe above result is due to Theorem 4 and because each iteration of PROXSAGA requires O(n2/3)\nIFO calls. The number of PO calls is only O(1/\u270f), since make one PO call for every n2/3 IFO calls.\nDiscussion: It is important to note the role of minibatches in Corollaries 2 and 4. Minibatches are\ntypically used for reducing variance and promoting parallelism in stochastic methods. But unlike\nprevious works, we use minibatches as a theoretical tool to improve convergence rates of both\nnonconvex PROXSVRG and PROXSAGA. In particular, by carefully selecting the minibatch size, we\ncan improve the IFO complexity of the algorithms described in the paper from O(n/\u270f) (similar to\nPROXGD) to O(n2/3/\u270f) (matching the smooth nonconvex case). Furthermore, the PO complexity is\nalso improved in a similar manner by using the minibatch size mentioned in Theorems 2 and 4. 4\n\n4 Extensions\n\nWe discuss some extensions of our approach in this section. Our \ufb01rst extension is to provide\nconvergence analysis for a subclass of nonconvex functions that satisfy a speci\ufb01c growth condition\npopularly known as the Polyak-\u0141ojasiewicz (PL) inequality. In the context of gradient descent,\n\n4We refer the readers to the full version [23] for a more general convergence analysis of the algorithms.\n\n6\n\n\fPL-SVRG:(x0, K, T, m, \u2318)\nfor k = 1 to K do\n\nPL-SAGA:(x0, K, T, m, \u2318)\nfor k = 1 to K do\n\nxk = ProxSVRG(xk1, T, m, b, \u2318) ;\n\nxk = ProxSAGA(xk1, T, b,\u2318 ) ;\n\nend\nOutput: xK\n\nend\nOutput: xK\n\nFigure 1: PROXSVRG and PROXSAGA variants for PL functions.\n\nthis inequality was proposed by Polyak in 1963 [19], who showed global linear convergence of\ngradient descent for functions that satisfy the PL inequality. Recently, in [9] the PL inequality was\ngeneralized to nonsmooth functions and used for proving linear convergence of proximal gradient.\nThe generalization presented in [9] considers functions F (x) = f (x)+h(x) that satisfy the following:\n\n\u00b5(F (x) F (x\u21e4)) \uf8ff\n\n1\n2\n\nDh(x, \u00b5), where \u00b5 > 0\n\n(6)\n\nand Dh(x, \u00b5) := 2\u00b5 miny\u21e5hrf (x), y xi +\n\n\u00b5\n\n2ky xk2 + h(y) h(x)\u21e4.\n\nAn F that satis\ufb01es (6) is called a \u00b5-PL function.\nWhen h \u2318 0, condition (6) reduces to the usual PL inequality. The class of \u00b5-PL functions includes\nseveral other classes as special cases. It subsumes strongly convex functions, covers fi(x) = g(a>i x)\nwith only g being strongly convex, and includes functions that satisfy a optimal strong convexity\nproperty [12]. Note that the \u00b5-PL functions also subsume the recently studied special case where fi\u2019s\nare nonconvex but their sum f is strongly convex. Hence, it encapsulates the problems of [27, 32].\nThe algorithms in Figure 1 provide variants of PROXSVRG and PROXSAGA adapted to optimize\n\u00b5-PL functions. We show the following global linear convergence result of PL-SVRG and PL-SAGA\nin Figure 1 for PL functions. For simplicity, we assume \uf8ff = (L/\u00b5) > n1/3. When f is strongly\nconvex, \uf8ff is referred to as the condition number, in which case \uf8ff> n 1/3 corresponds to the high\ncondition number regime.\nTheorem 5. Suppose F is a \u00b5-PL function. Let b = n2/3, \u2318 = 1/5L, m = bn1/3c and T = d30\uf8ffe.\nThen for the output xK of PL-SVRG and PL-SAGA (in Figure 1), the following holds:\n\nE[F (xK) F (x\u21e4)] \uf8ff\n\n[F (x0) F (x\u21e4)]\n\n,\n\n2K\n\nwhere x\u21e4 is an optimal solution of (1).\nThe following corollary on IFO and PO complexity of PL-SVRG and PL-SAGA is immediate.\nCorollary 5. When F is a \u00b5-PL function, then the IFO and PO complexities of PL-SVRG and\nPL-SAGA with the parameters speci\ufb01ed in Theorem 5 to obtain an \u270f-accurate solution are O((n +\n\uf8ffn2/3) log(1/\u270f)) and O(\uf8ff log(1/\u270f)), respectively.\nNote that proximal gradient also has global linear convergence for PL functions, as recently shown\nin [9]. However, its IFO complexity is O(\uf8ffn log(1/\u270f)), which is much worse than that of PL-SVRG\nand PL-SAGA (Corollary 5).\nOther extensions: While we state our results for speci\ufb01c minibatch sizes, a more general convergence\nanalysis is provided for any minibatch size b \uf8ff n2/3 (Theorems 6 and 7 in the Appendix). Moreover,\nour results can be easily generalized to the case where non-uniform sampling is used in Algorithm 1\nand Algorithm 2. This is useful when the functions fi have different Lipschitz constants.\n\n5 Experiments\nWe present our empirical results in this section. For our experiments, we study the problem of\nnon-negative principal component analysis (NN-PCA). More speci\ufb01cally, for a given set of samples\n{zi}n\n\ni=1, we solve the following optimization problem:\n\n(7)\n\nmin\n\nkxk\uf8ff1, x0\n\n1\n2\n\nx> nXi=1\n\nziz>i ! x.\n\n7\n\n\f10-5\n\n10-10\n\n)\n^x\n(\nf\n!\n\n)\nx\n(\nf\n\n10-15\n\n0\n\nSGD\nSAGA\nSVRG\n\n10-5\n\n10-10\n\n)\n^x\n(\nf\n!\n\n)\nx\n(\nf\n\nSGD\nSAGA\nSVRG\n\n10-5\n\n10-10\n\n)\n^x\n(\nf\n!\n\n)\nx\n(\nf\n\nSGD\nSAGA\nSVRG\n\n10-5\n\n10-10\n\n)\n^x\n(\nf\n!\n\n)\nx\n(\nf\n\nSGD\nSAGA\nSVRG\n\n5\n\n10\n\n# grad/n\n\n15\n\n10-15\n\n0\n\n5\n\n10\n\n# grad/n\n\n15\n\n10-15\n\n0\n\n5\n\n10\n\n# grad/n\n\n15\n\n10-15\n\n0\n\n5\n\n10\n\n# grad/n\n\n15\n\nFigure 2: Non-negative principal component analysis. Performance of PROXSGD, PROXSVRG and\nPROXSAGA on \u2019rcv1\u2019 (left), \u2019a9a\u2019(left-center), \u2019mnist\u2019 (right-center) and \u2019aloi\u2019 (right) datasets. Here,\nthe y-axis is the function suboptimality i.e., f (x) f (\u02c6x) where \u02c6x represents the best solution obtained\nby running gradient descent for long time and with multiple restarts.\n\nt\n\nThe problem of NN-PCA is, in general, NP-hard. This variant of the standard PCA problem can\nbe written in the form (1) with fi(x) = (x>zi)2 for all i 2 [n] and h(x) = IC(x) where C is the\nconvex set {x 2 Rd|kxk \uf8ff 1, x 0}. In our experiments, we compare PROXSGD with nonconvex\nPROXSVRG and PROXSAGA. The choice of step size is important to PROXSGD. The step size of\nPROXSGD is set using the popular t-inverse step size choice of \u2318t = \u23180(1 + \u23180bt/nc)1 where\n\u23180,\u2318 0 > 0. For PROXSVRG and PROXSAGA, motivated by the theoretical analysis, we use a \ufb01xed\nstep size. The parameters of the step size in each of these methods are chosen so that the method gives\nthe best performance on the objective value. In our experiments, we include the value \u23180 = 0, which\ncorresponds to PROXSGD with \ufb01xed step size. For PROXSVRG, we use the epoch length m = n.\nWe use standard machine learning datasets in LIBSVM for all our experiments 5. The samples from\neach of these datasets are normalized i.e. kzik = 1 for all i 2 [n]. Each of these methods is initialized\nby running PROXSGD for n iterations. Such an initialization serves two purposes: (a) it provides a\nreasonably good initial point, typically bene\ufb01cial for variance reduction techniques [4, 26]. (b) it\nprovides a heuristic for calculating the initial average gradient g0 [26]. In our experiments, we use\nb = 1 in order to demonstrate the performance of the algorithms with constant minibatches.\nWe report the objective function value for the datasets. In particular, we report the suboptimality in\nobjective function i.e., f (xs+1\n) f (\u02c6x) (for PROXSVRG) and f (xt) f (\u02c6x) (for PROXSAGA). Here\n\u02c6x refers to the solution obtained by running proximal gradient descent for a large number of iterations\nand multiple random initializations. For all the algorithms, we compare the aforementioned criteria\nagainst for the number of effective passes through the dataset i.e., IFO complexity divided by n. For\nPROXSVRG, this includes the cost of calculating the full gradient at the end of each epoch.\nFigure 2 shows the performance of the algorithms on NN-PCA problem (see Section D of the Ap-\npendix for more experiments). It can be seen that the objective value for PROXSVRG and PROXSAGA\nis much lower compared to PROXSGD, suggesting faster convergence for these algorithms. We\nobserved a signi\ufb01cant gain consistently across all the datasets. Moreover, the selection of step size\nwas much simpler for PROXSVRG and PROXSAGA than that for PROXSGD. We did not observe any\nsigni\ufb01cant difference in the performance of PROXSVRG and PROXSAGA for this particular task.\n6 Final Discussion\nIn this paper, we presented fast stochastic methods for nonsmooth nonconvex optimization. In\nparticular, by employing variance reduction techniques, we show that one can design methods that\ncan provably perform better than PROXSGD and proximal gradient descent. Furthermore, in contrast\nto PROXSGD, the resulting approaches have provable convergence to a stationary point with constant\nminibatches; thus, bridging a fundamental gap in our knowledge of nonsmooth nonconvex problems.\nWe proved that with a careful selection of minibatch size, it is possible to theoretically show superior\nperformance to proximal gradient descent. Our empirical results provide evidence for a similar\nconclusion even with constant minibatches. Thus, we conclude with an important open problem of\ndeveloping stochastic methods with provably better performance than proximal gradient descent with\nconstant minibatch size.\nAcknowledgment: SS acknowledges support of NSF grant: IIS-1409802.\n\n5The\n\ncan\n\nbe\nlibsvmtools/datasets.\n\ndatasets\n\ndownloaded\n\nfrom https://www.csie.ntu.edu.tw/~cjlin/\n\n8\n\n\fReferences\n[1] A. Agarwal and L. Bottou. A lower bound for the optimization of \ufb01nite sums. arXiv:1410.0723, 2014.\n[2] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing norms. In\n\nS. Sra, S. Nowozin, and S. J. Wright, editors, Optimization for Machine Learning. MIT Press, 2011.\n\n[3] L\u00e9on Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-N\u0131mes, 91(8), 1991.\n[4] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with\n\nsupport for non-strongly convex composite objectives. In NIPS 27, pages 1646\u20131654. 2014.\n\n[5] Masao Fukushima and Hisashi Mine. A generalized proximal point algorithm for certain non-convex\n\nminimization problems. International Journal of Systems Science, 12(8):989\u20131000, 1981.\n\n[6] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst- and zeroth-order methods for nonconvex stochastic\n\nprogramming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[7] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for\n\nnonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267\u2013305, 2014.\n\n[8] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\nIn NIPS 26, pages 315\u2013323. 2013.\n\n[9] Hamed Karimi, Julie Nutini, and Mark W. Schmidt. Linear convergence of gradient and proximal-gradient\nmethods under the polyak-\u0142ojasiewicz condition. In Machine Learning and Knowledge Discovery in\nDatabases - European Conference, ECML PKDD 2016, pages 795\u2013811, 2016.\n\n[10] G. Lan and Y. Zhou. An optimal randomized incremental gradient method. arXiv:1507.02000, 2015.\n[11] Xingguo Li, Tuo Zhao, Raman Arora, Han Liu, and Jarvis Haupt. Stochastic variance reduced optimization\n\nfor nonconvex sparse learning. In ICML, 2016. arXiv:1605.02711.\n\n[12] Ji Liu and Stephen J. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence\n\nproperties. SIAM Journal on Optimization, 25(1):351\u2013376, January 2015.\n\n[13] Hisashi Mine and Masao Fukushima. A minimization method for the sum of a convex function and a\ncontinuously differentiable function. Journal of Optimization Theory and Applications, 33(1):9\u201323, 1981.\n[14] J. J. Moreau. Fonctions convexes duales et points proximaux dans un espace hilbertien. C. R. Acad. Sci.\n\nParis S\u00e9r. A Math., 255:2897\u20132899, 1962.\n\n[15] Arkadi Nemirovski and D Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. John Wiley\n\n[16] Yu Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM\n\nJournal on Optimization, 22(2):341\u2013362, 2012.\n\n[17] Yurii Nesterov. Introductory Lectures On Convex Optimization: A Basic Course. Springer, 2003.\n[18] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127\u2013239,\n\nand Sons, 1983.\n\n2014.\n\n[19] B.T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics\n\nand Mathematical Physics, 3(4):864\u2013878, January 1963.\n\n[20] Sashank Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex J Smola. On variance reduction in\n\nstochastic gradient descent and its asynchronous variants. In NIPS 28, pages 2629\u20132637, 2015.\n\n[21] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alexander J. Smola. Stochastic variance\nreduction for nonconvex optimization. In Proceedings of the 33nd International Conference on Machine\nLearning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 314\u2013323, 2016.\n\n[22] Sashank J. Reddi, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alexander J. Smola. Fast incremental method for\n\nnonconvex optimization. CoRR, abs/1603.06159, 2016.\n\n[23] Sashank J. Reddi, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alexander J. Smola. Fast stochastic methods for\n\nnonsmooth nonconvex optimization. CoRR, abs/1605.06900, 2016.\n\n[24] Sashank J. Reddi, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alexander J. Smola. Stochastic frank-wolfe methods for\nnonconvex optimization. In 54th Annual Allerton Conference on Communication, Control, and Computing,\nAllerton 2016, 2016.\n\n[25] R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and\n\n[26] Mark W. Schmidt, Nicolas Le Roux, and Francis R. Bach. Minimizing Finite Sums with the Stochastic\n\noptimization, 14(5):877\u2013898, 1976.\n\nAverage Gradient. arXiv:1309.2388, 2013.\n\n[27] Shai Shalev-Shwartz. SDCA without duality. CoRR, abs/1502.06177, 2015.\n[28] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. The\n\nJournal of Machine Learning Research, 14(1):567\u2013599, 2013.\n\n[29] Ohad Shamir.\n\nA stochastic PCA and SVD algorithm with an exponential convergence rate.\n\narXiv:1409.2848, 2014.\n\n[30] Suvrit Sra. Scalable nonconvex inexact proximal splitting. In NIPS, pages 530\u2013538, 2012.\n[31] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction.\n\nSIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n[32] Zeyuan Allen Zhu and Yang Yuan. Improved svrg for non-strongly-convex or sum-of-non-convex objectives.\n\nCoRR, abs/1506.01972, 2015.\n\n9\n\n\f", "award": [], "sourceid": 642, "authors": [{"given_name": "Sashank", "family_name": "J. Reddi", "institution": "Carnegie Mellon University"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MIT"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}, {"given_name": "Alexander", "family_name": "Smola", "institution": "Amazon - We are hiring!"}]}