{"title": "How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD", "book": "Advances in Neural Information Processing Systems", "page_first": 1157, "page_last": 1167, "abstract": "Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$. However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when $f(x)$ is convex.\n\nIf $f(x)$ is convex, to find a point with gradient norm $\\varepsilon$, we design an algorithm SGD3 with a near-optimal rate $\\tilde{O}(\\varepsilon^{-2})$, improving the best known rate $O(\\varepsilon^{-8/3})$. If $f(x)$ is nonconvex, to find its $\\varepsilon$-approximate local minimum, we design an algorithm SGD5 with rate $\\tilde{O}(\\varepsilon^{-3.5})$, where previously SGD variants only achieve $\\tilde{O}(\\varepsilon^{-4})$. This is no slower than the best known stochastic version of Newton's method in all parameter regimes.", "full_text": "How To Make the Gradients Small Stochastically:\n\nEven Faster Convex and Nonconvex SGD\u2217\n\nZeyuan Allen-Zhu\nMicrosoft Research AI\nRedmond, WA 98052\n\nzeyuan@csail.mit.edu\n\nAbstract\n\nStochastic gradient descent (SGD) gives an optimal convergence rate when mini-\nmizing convex stochastic objectives f (x). However, in terms of making the gra-\ndients small, the original SGD does not give an optimal rate, even when f (x) is\nconvex.\nIf f (x) is convex, to \ufb01nd a point with gradient norm \u03b5, we design an algorithm\n\nSGD3 with a near-optimal rate (cid:101)O(\u03b5\u22122), improving the best known rate O(\u03b5\u22128/3)\nsign an algorithm SGD5 with rate (cid:101)O(\u03b5\u22123.5), where previously SGD variants only\nachieve (cid:101)O(\u03b5\u22124) [6, 14, 30]. This is no slower than the best known stochastic\n\nof [17]. If f (x) is nonconvex, to \ufb01nd its \u03b5-approximate local minimum, we de-\n\nversion of Newton\u2019s method in all parameter regimes [27].\n\n1\n\nIntroduction\n\nIn convex optimization and machine learning, the classical goal is to design algorithms to decrease\nobjective values, that is, to \ufb01nd points x with f (x)\u2212 f (x\u2217) \u2264 \u03b5. In contrast, the rate of convergence\nfor the gradients, that is,\n\nthe number of iterations T needed to \ufb01nd a point x with (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5,\n\nis a harder problem and sometimes needs new algorithmic ideas [25]. For instance, in the full-\ngradient setting, accelerated gradient descent alone is suboptimal for this new goal, and one needs\nadditional tricks to get the fastest rate [25]. We review these tricks in Section 1.1.\nIn the convex (online) stochastic optimization, to the best of our knowledge, tight bounds are not\nyet known for \ufb01nding points with small gradients. The best recorded rate was T \u221d \u03b5\u22128/3 [17], and\nit was raised as an open question [1] regarding how to improve it.\nIn this paper, we design two new algorithms, SGD2 which gives rate T \u221d \u03b5\u22125/2 using Nesterov\u2019s\ntricks, and SGD3 which gives an even better rate T \u221d \u03b5\u22122 log3 1\n\u03b5 which is optimal up to log factors.\nMotivation. Studying the rate of convergence for the minimizing gradients can be important at\nleast for the following two reasons.\n\u2022 In many situations, points with small gradients \ufb01t better our \ufb01nal goals.\n\n\u2217The full version of this paper can be found on https://arxiv.org/abs/1801.02982. When this\npaper was submitted to NeurIPS 2018, the \u201cnon-convex SGD\u201d results were not included. We encourage the\nreaders to go to our full version to \ufb01nd out these \u201cnon-convex SGD\u201d results.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fNesterov [25] considers the dual approach for solving constrained minimization problems. He\nargued that \u201cthe gradient value (cid:107)\u2207f (x)(cid:107) serves as the measure of feasibility and optimality of\nthe primal solution,\u201d and thus is the better goal for minimization purpose.2\nIn matrix scaling [7, 10], given a non-negative matrix, one wants to re-scale its rows and columns\nto make it doubly stochastic. This problem has been applied in image reconstruction, operations\nresearch, decision and control, and other scienti\ufb01c disciplines (see survey [20]). The goal for\nmatrix scaling is to \ufb01nd points with small gradients, but not small objectives.\n\u2022 Designing algorithms to \ufb01nd points with small gradients can help us understand non-convex\n\noptimization better and design faster non-convex machine learning algorithms.\nWithout strong assumptions, non-convex optimization theory is always in terms of \ufb01nding points\nwith small gradients (i.e., approximate stationary points or local minima). Therefore, to under-\nstand non-convex stochastic optimization better, perhaps we should \ufb01rst \ufb01gure out the best rate\nfor convex stochastic optimization. In addition, if new algorithmic ideas are needed, can we also\napply them to the non-convex world? We \ufb01nd positive answers to this question, and also obtain\nbetter rates for standard non-convex optimization tasks.\n\n1.1 Review: Prior Work on Deterministic Convex Optimization\n\nSuppose f (x) is a Lipschitz smooth convex function with smoothness parameter L. Then, it is well-\nknown that accelerated gradient descent (AGD) [23, 24] \ufb01nds a point x satisfying f (x)\u2212 f (x\u2217) \u2264 \u03b4\n) gradient computations of \u2207f (x). To turn this into a gradient guarantee, we can\nusing T = O(\napply the smoothness property of f (x) which gives (cid:107)\u2207f (x)(cid:107)2 \u2264 L(f (x) \u2212 f (x\u2217)). This means\n\n\u221a\nL\u221a\n\u03b4\n\nAGD converges in rate T \u221d L\n\u03b5 .\n\nNesterov [25] proposed two different tricks to improve upon such rate.\n\nNesterov\u2019s First Trick: GD After AGD. Recall that starting from a point x0, if we perform T steps\nof gradient descent (GD) xt+1 = xt \u2212 1\nt=0 (cid:107)\u2207f (xt)(cid:107)2 \u2264 L(f (x0) \u2212\nf (x\u2217)). In addition, if this x0 is already the output of AGD for another T iterations, then it satis\ufb01es\n\n(cid:1). Putting the two inequalities together, we have minT\u22121\n\n(cid:8)(cid:107)\u2207f (xt)(cid:107)2(cid:9) \u2264\n\nf (x0) \u2212 f (x\u2217) \u2264 O(cid:0) L\nO(cid:0) L2\n\n(cid:1). We call this method \u201cGD after AGD,\u201d and\n\nt=0\n\nT 2\n\nL\u2207f (xt), then it satis\ufb01es(cid:80)T\u22121\n\nT 3\n\n\u201cGD after AGD\u201d converges in rate T \u221d L2/3\n\u03b52/3 .\n\nNesterov\u2019s Second Trick: AGD After Regularization. Alternatively, we can also regularize\n2(cid:107)x \u2212 x0(cid:107)2. This new function g(x) is \u03c3-strongly convex,\nf (x) by de\ufb01ning g(x) = f (x) + \u03c3\nso AGD converges linearly, meaning that using T \u221d \u221a\nL\u221a\n\u03b5 gradients we can \ufb01nd a point x\n\u03c3 log L\nsatisfying (cid:107)\u2207g(x)(cid:107)2 \u2264 L(g(x) \u2212 g(x\u2217)) \u2264 \u03b52. If we choose \u03c3 \u221d \u03b5, then this implies (cid:107)\u2207f (x)(cid:107) \u2264\n(cid:107)\u2207g(x)(cid:107) + \u03b5 \u2264 2\u03b5. We call this method \u201cAGD after regularization,\u201d and\n\u201cAGD after regularization\u201d converges in rate T \u221d L1/2\n\nThis is optimal up to a log factor, because \ufb01rst-order methods need T = \u2126((cid:112)L/\u03b4) gradient com-\n\nputations to \ufb01nd f (x)\u2212 f (x\u2217) \u2264 \u03b4 [23], but f (x)\u2212 f (x\u2217) \u2264 (cid:107)\u2207f (x)(cid:107)\u00b7(cid:107)x\u2212 x\u2217(cid:107) \u2264 O((cid:107)\u2207f (x)(cid:107)).\n\n\u03b5 .\n\u03b51/2 log L\n\n1.2 Our Results: Stochastic Convex Optimization\nConsider the stochastic setting where the convex objective f (x) := Ei[fi(x)] and the algorithm can\nonly compute stochastic gradients \u2207fi(x) at any point x for a random i. Let T be the number of\nstochastic gradient computations. It is well-known that stochastic gradient descent (SGD) \ufb01nds a\npoint x with f (x) \u2212 f (x\u2217) \u2264 \u03b4 in (see for instance textbooks [8, 18, 26])\n\nT = O(cid:0) V\n\n\u03c3\u03b4\n\n(cid:1) if f (x) is \u03c3-strongly convex.\n\nT = O(cid:0) V\n\n\u03b42\n\n(cid:1) iterations\n\nor\n\n2Nesterov [25] studied miny\u2208Q{g(y) : Ay = b} with convex Q and strongly convex g(y). The dual\nproblem is minx{f (x)} where f (x) := miny\u2208Q{g(y) + (cid:104)x, b \u2212 Ay(cid:105)}. Let y\u2217(x) \u2208 Q be the (unique)\nminimizer of the internal problem, then g(y\u2217(x)) \u2212 f (x) = (cid:104)x,\u2207f (x)(cid:105) \u2264 (cid:107)x(cid:107) \u00b7 (cid:107)\u2207f (x)(cid:107).\n\n2\n\n\fSGD\n\nSGD1\n\nSGD2\n\nSGD3\n\nSGDsc\nSGD1sc\nSGD3sc\n\nSGD\n\nSCSG\n\nonline convex\n\nonline strongly\n\nconvex\n\nonline nonconvex\n(\u03c3-nonconvex)\n\ngradient complexity T\n\n(folklore, see Theorem 4.2)\n\n(see [17] or Theorem 1)\n\n\u03b5\n\nalgorithm\n\n(naive) O(cid:0)\u03b5\u22124(cid:1)\n(SGD after SGD) O(cid:0)\u03b5\u22128/3(cid:1)\n(SGD after regularization) O(cid:0)\u03b5\u22125/2(cid:1)\n(SGD + recursive regularization) O(cid:0)\u03b5\u22122 \u00b7 log3 1\n(cid:1)\n(naive) O(cid:0)\u03b5\u22122 \u00b7 \u03ba(cid:1)\n(SGD after SGD) O(cid:0)\u03b5\u22122 \u00b7 \u03ba1/2(cid:1)\n(SGD + recursive regularization) O(cid:0)\u03b5\u22122 \u00b7 log3 \u03ba(cid:1)\n(naive) O(cid:0)\u03b5\u22124(cid:1)\nO(cid:0)\u03b5\u221210/3(cid:1)\nO(cid:0)\u03b5\u22122 + \u03c3\u03b5\u22124(cid:1)\nO(cid:0)\u03b5\u22123 + \u03c31/3\u03b5\u221210/3(cid:1)\n(cid:101)O(cid:0)\u03b5\u22124(cid:1)\n(cid:101)O(cid:0)\u03b5\u22123.5(cid:1)\n(cid:101)O(cid:0)\u03b5\u22123.5(cid:1)\nO(cid:0)\u03b5\u22123.25(cid:1)\n\n(see Theorem 2)\n\n(see Theorem 3)\n\n(see Theorem 4.2)\n\n(see Theorem 1)\n\n(see Theorem 3)\n\n(see [16])\n\n(see [21])\n\n(see Theorem 4)\n\n(see [3])\n\n(see [6, 14, 30])\n\n(see Theorem 5)\n\n(see [27]\n\n(see [3])\n\n2nd-order\nsmooth\n\nno\n\nneeded\n\nSGD4\nNatasha1.5\nSGD variants\nSGD5\ncubic Newton\n\nNatasha2\n\nTable 1: Comparison of \ufb01rst-order online stochastic methods for \ufb01nding (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5. Following tradition,\nin these bounds, we hide variance and smoothness parameters in big-O and only show the dependency\n\u03c3 \u2265 1 (if the objective is \u03c3-strongly convex), or the nonconvexity\non \u03b5, the condition number \u03ba = L\nparameter \u03c3.\n\nBoth rates are asymptotically optimal in terms of decreasing objective, and V is an absolute bound\non the variance of the stochastic gradients. Using the same argument (cid:107)\u2207f (x)(cid:107)2 \u2264 L(f (x)\u2212 f (x\u2217))\nas before, SGD \ufb01nds a point x with (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5 in\n\nT = O(cid:0) LV\n\n\u03c3\u03b52\n\n(cid:1) if f (x) is \u03c3-strongly convex.\n\nT = O(cid:0) L2V\n\n(cid:1) iterations\n\n\u03b54\n\n(SGD)\n\nor\n\nThese rates are not optimal. We investigate three approaches to improve such rates.\nNew Approach 1: SGD after SGD. Recall in Nesterov\u2019s \ufb01rst trick, he replaced the use of the\ninequality (cid:107)\u2207f (x)(cid:107)2 \u2264 L(f (x) \u2212 f (x\u2217)) by T steps of gradient descent. In the stochastic setting,\ncan we replace this inequality with T steps of SGD? We call this algorithm SGD1 and prove that\nTheorem 1 (informal). For convex stochastic optimization, SGD1 \ufb01nds x with (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5 in\n\n(cid:16) L2/3V\n\n(cid:17)\n\n(cid:16) L1/2V\n\n(cid:17)\n\nor\n\n\u03b58/3\n\nT = O\n\nT = O\n\n\u03c31/2\u03b52\n\niterations\n\nif f (x) is \u03c3-strongly convex.\n\n(SGD1)\nThe rate T \u221d \u03b5\u22128/3, in the special case of unconstrained minimization, was \ufb01rst discovered by\nGhadimi and Lan [17] using a more complicated algorithm. The rate T \u221d 1\n\u03c31/2\u03b52 does not seem to\nbe known before.\nNew Approach 2: SGD after regularization. Recall that in Nesterov\u2019s second trick, he de\ufb01ned\n2(cid:107)x\u2212x0(cid:107)2 as a regularized version of f (x), and applied the strongly-convex version\ng(x) = f (x)+ \u03c3\nof AGD to minimize g(x). Can we apply this trick to the stochastic setting?\nNote the parameter \u03c3 has to be on the magnitude of \u03b5 because \u2207g(x) = \u2207f (x) + \u03c3(x\u2212 x0) and we\nwish to make sure (cid:107)\u2207f (x)(cid:107) = (cid:107)\u2207g(x)(cid:107)\u00b1 \u03b5. Therefore, if we apply SGD1 to minimize g(x) to \ufb01nd\na point (cid:107)\u2207g(x)(cid:107) \u2264 \u03b5, the convergence rate is T \u221d 1\n\u03b52.5 . We call this algorithm SGD2.\nTheorem 2 (informal). For convex stochastic optimization, SGD2 \ufb01nds x with (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5 in\n\n\u03c31/2\u03b52 = 1\n\nT = O(cid:0) L1/2V\n\n(cid:1) iterations.\n\n\u03b55/2\n\n(SGD2)\n\n\u03b55/2 rate does not seem to be known before.\n\nAgain, this T \u221d 1\nNew Approach 3: SGD and recursive regularization.\nsub-optimality gap is due to the choice of \u03c3 \u221d \u03b5 which ensures (cid:107)\u03c3(x \u2212 x0)(cid:107) \u2264 \u03b5.\n\nIn the second approach above, the \u03b50.5\n\n3\n\n\fIntuitively, if x0 were suf\ufb01ciently close to x\u2217 (and thus were also close to the approximate minimizer\nx), then we could choose \u03c3 (cid:29) \u03b5 so that (cid:107)\u03c3(x \u2212 x0)(cid:107) \u2264 \u03b5 still holds. In other words, an appropriate\nwarm start x0 could help us break the \u03b5\u22122.5 barrier and get a better convergence rate. However, how\nto \ufb01nd such x0? We \ufb01nd it by constructing a \u201cless warm\u201d starting point and so on. This process is\nsummarized by the following algorithm which recursively \ufb01nds the warm starts.\nStarting from f (0)(x) := f (x), we de\ufb01ne f (s)(x) := f (s\u22121)(x) + \u03c3s\n\n2 (cid:107)x \u2212(cid:98)xs(cid:107)2 where \u03c3s = 2\u03c3s\u22121\nand(cid:98)xs is an approximate minimizer of f (s\u22121)(x) that is simply calculated from the naive SGD. We\n(cid:1) if f (x) is \u03c3-strongly convex. (SGD3)\nT = O(cid:0) log3(L/\u03b5)\u00b7V\n\ncall this method SGD3, and prove that\nTheorem 3 (informal). For convex stochastic optimization, SGD3 \ufb01nds x with (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5 in\n\nT = O(cid:0) log3(L/\u03c3)\u00b7V\n\n(cid:1) iterations\n\nor\n\n\u03b52\n\n\u03b52\n\nOur new rates in Theorem 3 not only improve the best known result of T \u221d \u03b5\u22128/3, but also are\nnear optimal because \u2126(V/\u03b52) is clearly a lower bound: even to decide whether a point x has\n(cid:107)\u2207f (x)(cid:107) \u2264 \u03b5 or (cid:107)\u2207f (x)(cid:107) > 2\u03b5 requires \u2126(V/\u03b52) samples of the stochastic gradient.Perhaps\ninterestingly, our dependence on the smoothness parameter L (or the condition number \u03ba := L/\u03c3 if\nstrongly convex) is only polylogarithmic, as opposed to polynomial in all previous results.\n\n1.3 Roadmap\n\nWe introduce notions in Section 2 and formalize the convex problem in Section 3. We review clas-\nsical (convex) SGD theorems with objective decrease in Section 4. We give an auxiliary lemma in\nSection 5 show our SGD3 results in Section 6. We apply our techniques to non-convex optimization\nand give algorithms SGD4 and SGD5 in Section 7. We discuss more related work in Appendix A,\nand show our results on SGD1 and SGD2 respectively in Appendix B and Appendix C.\n\n2 Preliminaries\nThroughout this paper, we denote by (cid:107) \u00b7 (cid:107) the Euclidean norm. We use i \u2208R [n] to denote that i\nis generated from [n] = {1, 2, . . . , n} uniformly at random. We denote by \u2207f (x) the gradient of\nfunction f if it is differentiable, and \u2202f (x) any subgradient if f is only Lipschitz continuous. We\ndenote by I[event] the indicator function of probabilistic events.\nWe denote by (cid:107)A(cid:107)2 the spectral norm of matrix A. For symmetric matrices A and B, we write\nA (cid:23) B to indicate that A \u2212 B is positive semide\ufb01nite (PSD). Therefore, A (cid:23) \u2212\u03c3I if and only if\nall eigenvalues of A are no less than \u2212\u03c3. We denote by \u03bbmin(A) and \u03bbmax(A) the minimum and\nmaximum eigenvalue of a symmetric matrix A.\nRecall some de\ufb01nitions on strong convexity and smoothness (and they have other equivalent de\ufb01ni-\ntions, see textbook [23]).\nDe\ufb01nition 2.1. For a function f : Rd \u2192 R,\n\u2022 f is \u03c3-strongly convex if \u2200x, y \u2208 Rd, it satis\ufb01es f (y) \u2265 f (x) + (cid:104)\u2202f (x), y \u2212 x(cid:105) + \u03c3\n2(cid:107)x \u2212 y(cid:107)2.\n\u2022 f is of \u03c3-bounded nonconvexity (or \u03c3-nonconvex for short) if \u2200x, y \u2208 Rd, it satis\ufb01es f (y) \u2265\n\u2022 f is L-Lipschitz smooth (or L-smooth for short) if \u2200x, y \u2208 Rd, (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107).\n\u2022 f is L2-second-order smooth if \u2200x, y \u2208 Rd, it satis\ufb01es (cid:107)\u22072f (x) \u2212 \u22072f (y)(cid:107)2 \u2264 L2(cid:107)x \u2212 y(cid:107).\nDe\ufb01nition 2.2. For composite function F (x) = \u03c8(x) + f (x) where \u03c8(x) is proper convex, given a\nparameter \u03b7 > 0, the gradient mapping of F (\u00b7) at point x is\nx+ = arg min\n\n(cid:8)\u03c8(y) + (cid:104)\u2207f (x), y(cid:105) +\n\nf (x) + (cid:104)\u2202f (x), y \u2212 x(cid:105) \u2212 \u03c3\n\n(cid:107)y \u2212 x(cid:107)2(cid:9)\n\n(cid:0)x \u2212 x+(cid:1)\n\n2(cid:107)x \u2212 y(cid:107)2. 3\n\nwhere\n\nGF,\u03b7(x) :=\n\n1\n\u03b7\n\n1\n2\u03b7\n\nIn particular, if \u03c8(\u00b7) \u2261 0, then GF,\u03b7(x) \u2261 \u2207f (x).\nRecall the following property about gradient mapping \u2014see for instance [29, Lemma 3.7])\n\ny\n\n3Previous authors also refer to this notion as \u201capproximate convex\u201d, \u201calmost convex\u201d, \u201chypo-convex\u201d,\n\u201csemi-convex\u201d, or \u201cweakly-convex.\u201d We call it \u03c3-nonconvex to stress the point that \u03c3 can be as large as L\n(any L-smooth function is automatically L-nonconvex).\n\n4\n\n\fLemma 2.3. Let F (x) = \u03c8(x) + f (x) where \u03c8(x) is proper convex and f (x) is \u03c3-strongly convex\nand L-smooth. For every x, y \u2208 {x \u2208 Rd : \u03c8(x) < +\u221e}, letting x+ = x \u2212 \u03b7 \u00b7 GF,\u03b7(x), we have\n\nF (y) \u2265 F (x+) + (cid:104)GF,\u03b7(x), y \u2212 x(cid:105) +\n\n(cid:107)GF,\u03b7(x)(cid:107)2 +\n\n\u03b7\n2\n\n(cid:107)y \u2212 x(cid:107)2 .\n\n\u03c3\n2\n\n\u2200\u03b7 \u2208(cid:0)0,\n\n(cid:3) :\n\n1\nL\n\nThe following de\ufb01nition and properties of Fenchel dual for convex functions is classical, and can be\nfound for instance in the textbook [26].\nDe\ufb01nition 2.4. Given proper convex function h(y), its Fenchel dual h\u2217(\u03b2) := maxy{y(cid:62)\u03b2 \u2212 h(y)}.\nProposition 2.5. \u2207h\u2217(\u03b2) = arg maxy{y(cid:62)\u03b2 \u2212 h(y)}.\nProposition 2.6. If h(\u00b7) is \u03c3-strongly convex, then h\u2217(\u00b7) is 1\n\n\u03c3 -smooth.\n\n3 Problem Formalization\n\nThroughout this paper (except our nonconvex application Section 7), we minimize convex stochastic\ncomposite objective:\n\n(cid:8)F (x) = \u03c8(x) + f (x) := \u03c8(x) + 1\n\ni\u2208[n] fi(x)(cid:9) ,\n(cid:80)\n\nn\n\nminx\u2208Rd\n\n(3.1)\n\nwhere\n1. \u03c8(x) is proper convex (a.k.a. the proximal term),\n2. fi(x) is differentiable for every i \u2208 [n],\n3. f (x) is L-smooth and \u03c3-strongly convex for some \u03c3 \u2208 [0, L] that could be zero,\n4. n can be very large of even in\ufb01nite (so f (x) = Ei[fi(x)]),4 and\n5. the stochastic gradients \u2207fi(x) have a bounded variance (over the domain of \u03c8(\u00b7)), that is\n\n\u2200x \u2208 {y \u2208 Rd | \u03c8(y) < +\u221e} : Ei\u2208R[n](cid:107)\u2207f (x) \u2212 \u2207fi(x)(cid:107)2 \u2264 V .\n\nWe emphasize that the above assumptions are all classical.\nIn the rest of the paper, we de\ufb01ne T , the gradient complexity, as the number of computations of\n\u2207fi(x). We search for points x so that the gradient mapping (cid:107)GF,\u03b7(x)(cid:107) \u2264 \u03b5 for any \u03b7 \u2248 1\nL. Recall\nfrom De\ufb01nition 2.2 that if there is no proximal term (i.e., \u03c8(x) \u2261 0), then GF,\u03b7(x) = \u2207f (x) for\nany \u03b7 > 0. We want to study the best tradeoff between the gradient complexity T and the error \u03b5.\nWe say an algorithm is online if its gradient complexity T is independent of n. This tackles the big-\ndata scenario when n is extremely large or even in\ufb01nite (i.e., f (x) = Ei[fi(x)] for some random\nvariable i). The stochastic gradient descent (SGD) method and all of its variants studied in this\npaper are online. In contrast, GD, AGD [23, 24], and Katyusha [2] are of\ufb02ine methods because their\ngradient complexity depends on n (see Table 2 in appendix).\n\n4 Review: SGD with Objective Value Convergence\n\nRecall that stochastic gradient descent (SGD) repeatedly performs proximal updates of the form\n\nxt+1 = arg miny\u2208Rd{\u03c8(y) + 1\n\n2\u03b1(cid:107)y \u2212 xt(cid:107)2 + (cid:104)\u2207fi(xt), y(cid:105)} ,\n\nwhere \u03b1 > 0 is some learning rate, and i is chosen in 1, 2, . . . , n uniformly at random per iteration.\nNote that if \u03c8(y) \u2261 0 then xt+1 = xt \u2212 \u03b1\u2207fi(xt). For completeness\u2019 sake, we summarize it in\nAlgorithm 1. If f (x) is also known to be strongly convex, to get the tightest convergence rate, one\ncan repeatedly apply SGD with decreasing learning rate \u03b1 [19]. We summarize this algorithm as\nSGDsc in Algorithm 2.\nThe following theorem describes the rates of convergence in objective values for SGD and SGDsc\nrespectively. Their proofs are classical (and included in Appendix D); however, for our exact state-\nments, we cannot \ufb01nd them recorded anywhere.5\n\nHowever, we still introduce n to simplify notations.\n\n4All of the results in this paper apply to the case when n is in\ufb01nite, because we focus on online methods.\n5In the special case \u03c8(x) \u2261 0, Theorem 4.1(a) and 4.1(b) are folklore (see for instance [26]). If \u03c8(x) (cid:54)\u2261 0,\nTheorem 4.1(a) is recorded when \u03c8(x) is Lipschitz or smooth [13], but we would not like to impose such\n\n5\n\n\fAlgorithm 1 SGD(F, x0, \u03b1, T )\nInput: function F (x) = \u03c8(x) + 1\nn\n(cid:5) if f (x) = 1\n1: for t = 0 to T \u2212 1 do\ni \u2190 a random index in [n];\n2:\nxt+1 \u2190 arg miny\u2208Rd{\u03c8(y) + 1\n3:\n4: return x = x1+\u00b7\u00b7\u00b7+xT\n\nn\n\n.\n\nT\n\nAlgorithm 2 SGDsc(F, x0, \u03c3, L, T )\nInput: function F (x) = \u03c8(x) + 1\nn\n1: for t = 1 to N = (cid:98) T\n8L/\u03c3(cid:99) do\n2: for k = 1 to K = (cid:98)log2(\u03c3T /16L)(cid:99) do\n3: return x = xN +K.\n\n(cid:9)(cid:1)\n\n(cid:80)n\ni=1 fi(x) is L-smooth, optimal choice \u03b1 = \u0398(cid:0) min(cid:8) (cid:107)x0\u2212x\u2217(cid:107)\n(cid:80)n\ni=1 fi(x); initial vector x0; learning rate \u03b1 > 0; T \u2265 1.\n\u221aVT\n, 1\nL\n2\u03b1(cid:107)y \u2212 xt(cid:107)2 + (cid:104)\u2207fi(xt), y(cid:105)};\n(cid:80)n\ni=1 fi(x); initial vector x0; parameters 0 < \u03c3 \u2264 L; T \u2265 L\nxt \u2190 SGD(cid:0)F, xt\u22121, 1\n\u03c3 .\n(cid:5) f (x) is \u03c3-strongly convex and f (x) = 1\ni=1 fi(x) is L-smooth\n\nxN +k \u2190 SGD(cid:0)F, xN +k\u22121, 1\n\n(cid:80)n\n\n2L , 4L\n\n(cid:1);\n\n(cid:1);\n\n2kL , 2k+2L\n\n\u03c3\n\n\u03c3\n\nn\n\nTheorem 4.1. Let x\u2217 \u2208 arg minx{F (x)}. To solve Problem (3.1) given a starting vector x0 \u2208 Rd,\n(a) SGD(F, x0, \u03b1, T ) outputs x satisfying E[F (x)] \u2212 F (x\u2217) \u2264 \u03b1V\nas long as \u03b1 <\n\n(cid:107)x0\u2212x\u2217(cid:107)2\n\n1/L. In particular, if \u03b1 is tuned optimally, it satis\ufb01es\n\nE[F (x)] \u2212 F (x\u2217) \u2264 O(cid:0) L(cid:107)x0\u2212x\u2217(cid:107)2\nE[F (x)] \u2212 F (x\u2217) \u2264 O(cid:0) V\n(cid:1) +(cid:0)1 \u2212 \u03c3\n\nT\n\n\u03c3T\n\nL\n\n+\n\n2(1\u2212\u03b1L) +\n\u221aV(cid:107)x0\u2212x\u2217(cid:107)\n(cid:1)\u2126(T )\n\n\u221a\n\nT\n\n2\u03b1T\n\n(cid:1) .\n\n\u03c3(cid:107)x0 \u2212 x\u2217(cid:107)2 .\n\n\u03c3 , then SGDsc(F, x0, \u03c3, L, T ) outputs x satisfying\n\n(b) If f (x) is \u03c3-strongly convex and T \u2265 L\n\n\u2200\u03b7 \u2208(cid:0)0,\n\n(cid:3) :\n\nAs a sanity check, if V = 0, the convergence rate of SGD matches that of GD. (However, if V = 0,\none can apply accelerated gradient descent of Nesterov [22, 23] instead for a faster rate.)\nTo turn Theorem 4.1 into a rate of convergence for the gradients, we can simply apply Lemma 2.3\nwhich implies\n\n\u03b7\n2\n\n1\nL\n\n(4.1)\nTheorem 4.2. Let x\u2217 \u2208 arg minx{F (x)}. To solve Problem (3.1) given a starting vector x0 \u2208 Rd\nand any \u03b7 = C\n\n(cid:107)GF,\u03b7(x)(cid:107)2 \u2264 F (x) \u2212 F (x+) \u2264 F (x) \u2212 F (x\u2217) .\n\n(a) SGD outputs x satisfying E[(cid:107)GF,\u03b7(x)(cid:107)2] \u2264 O(cid:0) L2(cid:107)x0\u2212x\u2217(cid:107)2\n\nL where C \u2208 (0, 1] is some absolute constant,\n\n\u03c3 , then SGDsc outputs x satisfying E[(cid:107)GF,\u03b7(x)(cid:107)2] \u2264 O(cid:0) LV\n\n(b) if T \u2265 L\nCorollary 4.3. Hiding V, L, (cid:107)x0 \u2212 x\u2217(cid:107) in the big-O notation, classical SGD \ufb01nds x with\nF (x) \u2212 F (x\u2217) \u2264 O(T \u22121/2)\nF (x) \u2212 F (x\u2217) \u2264 O((\u03c3T )\u22121) (cid:107)GF,\u03b7(x)(cid:107) \u2264 O((\u03c3T )\u22121/2)\n\n(cid:107)GF,\u03b7(x)(cid:107) \u2264 O(T \u22121/4)\n\n(cid:1).\n(cid:1)\u2126(T )\n(cid:1)+(cid:0)1\u2212 \u03c3\n\nfor Problem (3.1), or\nif f (\u00b7) is \u03c3-strongly convex for \u03c3 > 0.\n\n\u03c3L(cid:107)x0\u2212x\u2217(cid:107)2.\n\n\u221aV(cid:107)x0\u2212x\u2217(cid:107)\n\n+ L\n\n\u221a\n\n\u03c3T\n\nL\n\nT\n\nT\n\n5 An Auxiliary Lemma on Regularization\n\nConsider a regularized objective\n\nG(x) := \u03c8(x) + g(x) := \u03c8(x) +(cid:0)f (x) +\n\nS(cid:88)\n\n(cid:107)x \u2212(cid:98)xs(cid:107)2(cid:1) ,\n\n\u03c3s\n2\n\n(5.1)\n\nassumptions. A variant of Theorem 4.1(b) is recorded for the accelerated version of SGD [15], but with a\n\nslightly worse rate T = O(cid:0) V\n\n\u03c3T + L(cid:107)x0\u2212x\u2217(cid:107)2\n\nT 2\n\ns=1\n\n(cid:1). If the readers \ufb01nd either statement explicitly stated somewhere,\n\nplease let us know and we would love to include appropriate citations.\n\n6\n\n\fwhere(cid:98)x1, . . . ,(cid:98)xS are \ufb01xed vectors in Rd. The following lemma says that, if we \ufb01nd an approximate\nis(cid:101)\u03c3-strongly convex with(cid:101)\u03c3 :=(cid:80)S\n(cid:3), we\nan arbitrary vector in the domain of {x \u2208 Rd : \u03c8(x) < +\u221e}. Then, for every \u03b7 \u2208(cid:0)0,\n\nstationary point x of G(x), then it is also an approximate stationary point of F (x) up to some\nadditive error.\nLemma 5.1. Suppose \u03c8(x) is proper convex and f (x) is convex and L-smooth. By de\ufb01nition, g(x)\ns=1 \u03c3s. Let x\u2217 be the unique minimizer of G(y) in (5.1), and x be\n\n1\n\nL+(cid:101)\u03c3\n\nhave\n\ns=1\n\n(cid:107)GF,\u03b7(x)(cid:107) \u2264 S(cid:88)\n\u03c3s(cid:107)x\u2217 \u2212(cid:98)xs(cid:107) + 3(cid:107)GG,\u03b7(x)(cid:107) .\n(cid:88)\n\u03c3s(x \u2212(cid:98)xs)(cid:107) x\u2264 (cid:107)\u2207g(x)(cid:107) +\n\u03c3s(cid:107)x \u2212(cid:98)xs(cid:107)\n(cid:88)\n\u03c3s(cid:107)x\u2217 \u2212(cid:98)xs(cid:107) +(cid:101)\u03c3(cid:107)x\u2217 \u2212 x(cid:107) z\u2264 2(cid:107)\u2207g(x)(cid:107) +\n\n(cid:88)\n\ns\n\n(cid:88)\n\n(cid:107)\u2207f (x)(cid:107) = (cid:107)\u2207g(x) +\n\ns\n\ny\u2264 (cid:107)\u2207g(x)(cid:107) +\n\n\u03c3s(cid:107)x\u2217 \u2212(cid:98)xs(cid:107) .\nAbove, inequalities x and y both use the triangle inequality; and inequality z is due to the(cid:101)\u03c3-strong\n(cid:3)\n\nconvexity of g(x) (see for instance [23, Sec. 2.1.3]).\n\nProof of Lemma 5.1. See full version.\n\ns\n\ns\n\nRemark 5.2. Lemma 5.1 should be easy to prove in the special case of \u03c8(x) \u2261 0. Indeed,\n\n6 Approach 3: SGD and Recursive Regularization\n\n\u03c3s\n2\n\nF (0)(x) := F (x)\n\nfor s = 1, 2, . . . , S\n\nIn this section, add a logarithmic number of regularizers to the objective, each centered at a different\nbut carefully chosen point. Speci\ufb01cally, given parameters \u03c31, . . . , \u03c3S > 0, we de\ufb01ne functions\n\nand F (s)(x) := F (s\u22121)(x) +\n\n(cid:107)x \u2212(cid:98)xs(cid:107)2\nwhere each(cid:98)xs (for s \u2265 1) is an approximate minimizer of F (s\u22121)(x).\ncalculate each(cid:98)xs, we apply SGDsc for T\nIf f (x) is \u03c3-strongly convex, then we choose S \u2248 log2\n\n\u03c3 and let \u03c30 = \u03c3 and \u03c3s = 2\u03c3s\u22121. To\nS iterations. This totals to a gradient complexity of T . We\nsummarize this method as SGD3sc in Algorithm 3.\n2(cid:107)x \u2212 x0(cid:107)2 for some small\nIf f (x) is not strongly convex, then we regularize it by G(x) = F (x) + \u03c3\nparameter \u03c3 > 0, and then apply SGD3sc. We summarize this \ufb01nal method as SGD3 in Algorithm 4.\nWe prove the following main theorem:\nTheorem 3 (SGD3). Let x\u2217 \u2208 arg minx{F (x)}. To solve Problem (3.1) given a starting vector\nx0 \u2208 Rd and any \u03b7 = C\n(a) If f (x) is \u03c3-strongly convex for \u03c3 \u2208 (0, L] and T \u2265 L\n\nL for some absolute constant C \u2208 (0, 1].\n\u03c3 log L\n\n\u03c3 , then SGD3sc(F, x0, \u03c3, L, T ) outputs\n\nL\n\nx satisfying\n\nE[(cid:107)GF,\u03b7(x)(cid:107)] \u2264 O\n\n(cid:16)\u221aV \u00b7 log3/2 L\n\n(cid:17)\n\n\u03c3\n\n\u221a\n\nT\n\n(b) If \u03c3 \u2208 (0, L] and T \u2265 L\nE[(cid:107)GF,\u03b7(x)(cid:107)] \u2264 O\n\n\u03c3 , then SGD3(F, x0, \u03c3, L, T ) outputs x satisfying\n\n\u221aV\u00b7log3/2 L\n\n\u03c3(cid:107)x0 \u2212 x\u2217(cid:107) .\nIf \u03c3 is appropriately chosen, then we \ufb01nd x with E[(cid:107)GF,\u03b7(x)(cid:107)] \u2264 \u03b5 in gradient complexity\n\n\u221a\n\nL\n\nT\n\n\u03c3 log L\n\u03c3(cid:107)x0 \u2212 x\u2217(cid:107) +\n\n(cid:16)\n(cid:16)V \u00b7 log3 L(cid:107)x0\u2212x\u2217(cid:107)\n\n\u03b5\n\n\u03b52\n\nT \u2264 O\n\n\u03c3(cid:107)x0 \u2212 x\u2217(cid:107) .\n\n+(cid:0)1 \u2212 \u03c3\n(cid:17)\n\n(cid:1)\u2126(T / log(L/\u03c3))\n+(cid:0)1 \u2212 \u03c3\n\nL\n\n\u03c3\n\n(cid:1)\u2126(T / log(L/\u03c3))\n(cid:17)\n\nL(cid:107)x0 \u2212 x\u2217(cid:107)\n\n.\n\n\u03b5\n\nL(cid:107)x0 \u2212 x\u2217(cid:107)\n\n+\n\nlog\n\n\u03b5\n\nRemark 6.1. All expected guarantees of the form E[(cid:107)GF,\u03b7(x)(cid:107)2] \u2264 \u03b52 or E[(cid:107)GF,\u03b7(x)(cid:107)] \u2264 \u03b5 through-\nout this paper can be made into high-con\ufb01dence bound by repeating the algorithm multiple times,\neach time estimating the value of (cid:107)GF,\u03b7(x)(cid:107) using roughly O( V\n\u03b52 ) stochastic gradient computations,\nand \ufb01nally outputting the point x that leads to the smallest value (cid:107)GF,\u03b7(x)(cid:107).\n\n7\n\n\fn\n\n\u03c3\n\nL\n\n(cid:1).\n\n(cid:80)n\n\n(cid:5) f (x) = 1\n\n\u03c3 log L\n\u03c3 (cid:99) do\n\nAlgorithm 3 SGD3sc(F, x0, \u03c3, L, T )\nInput: function F (x) = \u03c8(x) + 1\nn\n\n(cid:80)n\nof iterations T \u2265 \u2126(cid:0) L\ni=1 fi(x); initial vector x0; parameters 0 < \u03c3 \u2264 L; number\n1: F (0)(x) := F (x);(cid:98)x0 \u2190 x0; \u03c30 \u2190 \u03c3;\ni=1 fi(x) is \u03c3-strongly convex and L-smooth\n(cid:1);\n(cid:98)xs \u2190 SGDsc(cid:0)F (s\u22121),(cid:98)xs\u22121, \u03c3s\u22121, 3L, T\n2: for s = 1 to S = (cid:98)log2\n3:\n2 (cid:107)x \u2212(cid:98)xs(cid:107)2;\n4:\n6: return x =(cid:98)xS.\n5:\n(cid:80)n\ni=1 fi(x); initial vector x0; parameters L \u2265 \u03c3 > 0; T \u2265 1.\ni=1 fi(x) is convex and L-smooth\n\n\u03c3s \u2190 2\u03c3s\u22121;\nF (s)(x) := F (s\u22121)(x) + \u03c3s\n\nAlgorithm 4 SGD3(F, x0, \u03c3, L, T )\nInput: function F (x) = \u03c8(x) + 1\nn\n2(cid:107)x \u2212 x0(cid:107)2;\n\n(cid:5) f (x) = 1\n\n1: G(x) := F (x) + \u03c3\n2: return x \u2190 SGD3sc(G, x0, \u03c3, L + \u03c3, T ).\n\n(cid:80)n\n\nS\n\nn\n\n6.1 Proof of Theorem 3\n\nBefore proving Theorem 3, we state a few properties regarding the relationships between the\n\nobjective-optimality of(cid:98)xs and point distances.\nClaim 6.2. Suppose for every s = 1, . . . , S the vector(cid:98)xs satis\ufb01es\ns\u22121)(cid:3) \u2264 \u03b4s\n(a) for every s \u2265 1, E[(cid:107)(cid:98)xs \u2212 x\u2217\n(b) for every s \u2265 1, E[(cid:107)(cid:98)xs \u2212 x\u2217\n(c) if \u03c3s = 2\u03c3s\u22121 for all s \u2265 1, then E(cid:2)(cid:80)S\n\nE(cid:2)F (s\u22121)((cid:98)xs) \u2212 F (s\u22121)(x\u2217\ns\u22121(cid:107)]2 \u2264 E[(cid:107)(cid:98)xs \u2212 x\u2217\ns \u2212(cid:98)xs(cid:107)2] \u2264 \u03b4s\ns(cid:107)]2 \u2264 E[(cid:107)x\u2217\ns=1 \u03c3s(cid:107)x\u2217\n\ns\u22121 \u2208 arg minx{F (s\u22121)(x)}, then,\n\ns\u22121(cid:107)2] \u2264 2\u03b4s\n; and\n\nwhere x\u2217\n\n\u03c3s\u22121\n\n\u03c3s\n\n,\n\n(6.1)\n\n\u03b4s\u03c3s .\n\ns=1\n\n\u221a\n\nS \u2212(cid:98)xs(cid:107)(cid:3) \u2264 4(cid:80)S\nE(cid:2)F (s\u22121)((cid:98)xs) \u2212 F (s\u22121)(x\u2217\n\nProof of Claim 6.2.\ns\u22121(cid:107)]2\n\n(a) E[(cid:107)(cid:98)xs \u2212 x\u2217\n\nx\u2264 E[(cid:107)(cid:98)xs \u2212 x\u2217\n\n. Here,\ninequality x is because E[X]2 \u2264 E[X 2], and inequality y is due to the strong convexity of\nF (s\u22121)(x).\n\n\u03c3s\u22121\n\ns\u22121(cid:107)2]\n\ny\u2264 2\n\u03c3s\u22121\n\ns\u22121)] \u2264 2\u03b4s\n\n(b) We derive that\n\ns \u2212(cid:98)xs(cid:107)2\n\n\u03c3s(cid:107)x\u2217\n\ns \u2212(cid:98)xs(cid:107)2 + F (s)((cid:98)xs) \u2212 F (s)(x\u2217\n\nx\u2264 \u03c3s\n2\n\n(cid:107)x\u2217\n\ny\u2264 F (s\u22121)((cid:98)xs) \u2212 F (s\u22121)(x\u2217\ns\u22121. Taking expectation we have E[(cid:107)x\u2217\ns=1 \u03c3s(cid:107)x\u2217\n\ns\u22121) .\n\ns)\n\ns) = F (s\u22121)((cid:98)xs) \u2212 F (s\u22121)(x\u2217\ns \u2212(cid:98)xs(cid:107)2] \u2264 \u03b4s\ns \u2212(cid:98)xs(cid:107)]2 \u2264 E[(cid:107)x\u2217\ns\u22121(cid:107)\ns \u2212 x\u2217\n\n\u03c3s\n\n(c) De\ufb01ne Pt :=(cid:80)t\n\nHere, inequality x is due to the strong convexity of F (s)(x), and inequality y is because of the\nminimality of x\u2217\n\nt \u2212(cid:98)xs(cid:107) for each t \u2265 0, 1, . . . , S. Then by triangle inequality we have\n\u03b4s\u03c3s + \u03c3s \u00b7 E(cid:2)(cid:107)x\u2217\n\ns \u2212(cid:98)xs(cid:107) +(cid:0)(cid:80)s\u22121\n(cid:1) \u00b7 (cid:107)x\u2217\n(cid:112)\ns\u22121 \u2212(cid:98)xs(cid:107)(cid:3) \u2264 4\ns \u2212(cid:98)xs(cid:107) + (cid:107)x\u2217\n\nUsing the parameter choice of \u03c3s = 2\u03c3s\u22121, and plugging in Claim 6.2(a) and Claim 6.2(b), we\nhave\n(cid:3)\n\nE[Ps \u2212 Ps\u22121] \u2264(cid:112)\n\nPs \u2212 Ps\u22121 \u2264 \u03c3s(cid:107)x\u2217\n\nt=1 \u03c3t\n\n\u03b4s\u03c3s .\n\n.\n\n8\n\n\fs\u22121(cid:107)2] .\n\n3L\n\n(cid:1)\u2126(T /S)E[\u03c3s\u22121(cid:107)(cid:98)xs\u22121 \u2212 x\u2217\n(cid:1)\u2126(T /S)\n(cid:1)\u2126(T /S)E[F (s\u22122)((cid:98)xs\u22121)\u2212F (s\u22122)(x\u2217\n\n\u03c30(cid:107)x0 \u2212 x\u2217(cid:107)2 .\n\nL\n\ns\u22122)] .\n\n\u03c30(cid:107)x0 \u2212 x\u2217(cid:107)2 .\n\n2(cid:107)GF (S\u22121),\u03b7((cid:98)xS)(cid:107)2 \u2264 F (S\u22121)((cid:98)xS) \u2212\n\nProof of Theorem 3(a). We \ufb01rst note that, when writing f (s\u22121)(x) = F (s\u22121)(x) \u2212 \u03c8(x), each\nt=1 \u03c3t \u2264 3L Lipschitz smooth. Therefore,\n\n\u03c30T\n\n\u03c3s\u22121T\n\n0 = x\u2217)\n\nIf s > 1, this means\n\napplying Theorem 4.1(b), we have\n\nE[F (s\u22121)((cid:98)xs)\u2212F (s\u22121)(x\u2217\n\nf (s\u22121) is at least \u03c3s\u22121-strongly convex and L +(cid:80)s\u22121\n(cid:1) +(cid:0)1 \u2212 \u03c3s\u22121\ns\u22121)] \u2264 O(cid:0) SV\nE[F (s\u22121)((cid:98)xs) \u2212 F (s\u22121)(x\u2217\nIf s = 1, this means (recalling(cid:98)x0 = x0 and x\u2217\nE[F (0)((cid:98)xs) \u2212 F (0)(x\u2217)] \u2264 O(cid:0) SV\n(cid:1) +(cid:0)1 \u2212 \u03c30\n(cid:1)+(cid:0)1\u2212 \u03c3s\u22121\ns\u22121)] \u2264 O(cid:0) SV\n\u03b4s = O(cid:0) SV\n(cid:1)\u2126(sT /S)\nUsing Lemma 2.3 with F (S\u22121) and y = x =(cid:98)xS, we have \u03b7\nF (S\u22121)((cid:98)x+\nS ) \u2264 F (S\u22121)((cid:98)xS) \u2212 F (S\u22121)(x\u2217\nE(cid:2)(cid:107)GF,\u03b7((cid:98)xS)(cid:107)(cid:3) \u2264 E(cid:104) S\u22121(cid:88)\n(cid:16) S(cid:88)\n\n(cid:1) +(cid:0)1 \u2212 \u03c30\n\n\u03c3s\u22121T\n\n(cid:17)\n\n\u03c3sT\n\ns=1\n\nL\n\nL\n\nS\u22121) and therefore\n\n= O(L\u03b4S) .\nPlugging this into Lemma 5.1 (with G(x) = F (S\u22121)(x)) and Claim 6.2(c), we have\n\nE(cid:2)(cid:107)GF (S\u22121),\u03b7((cid:98)xS)(cid:107)(cid:3)2 \u2264 E(cid:2)(cid:107)GF (S\u22121),\u03b7((cid:98)xS)(cid:107)2(cid:3) \u2264 2\u03b4S\nS\u22121 \u2212(cid:98)xs(cid:107) + 3(cid:107)GF (S\u22121),\u03b7((cid:98)xS)(cid:107)(cid:105) \u2264 O\n(cid:16) S\u22121(cid:88)\n(cid:112)\n(cid:17) \u2264 O\n(cid:1)\u2126(T /S)\n+(cid:0)1 \u2212 \u03c30\n\u03c30(cid:107)x0 \u2212 x\u2217(cid:107) . (cid:3)\n\n(cid:16) S3/2V 1/2\n\n\u03c3s(cid:107)x\u2217\n\n(cid:112)\n\n(cid:112)\n\n\u03b4s\u03c3s +\n\n(cid:17)\n\n\u03b4s\u03c3s\n\n= O\n\nL\u03b4S\n\ns=1\n\n\u03b7\n\nTogether, this means to satisfy (6.1), it suf\ufb01ces to choose \u03b4s so that\n\ns=1\n\nT 1/2\n\nL\n\nProof of Theorem 3(b). De\ufb01ne G(x) := F (x)+ \u03c3\nG(\u00b7). Note that x\u2217\n\non G(x) and Lemma 5.1 with S = 1 and(cid:98)x1 = x0, we have\n(cid:17)\n\nG be the (unique) minimizer of\nG may be different from x\u2217 which is a minimizer of F (\u00b7). Applying Theorem 3(a)\n\n(cid:16)\n\n2(cid:107)x\u2212x0(cid:107)2 and let x\u2217\n\n\u221aV\u00b7log3/2 L\n\n+(cid:0)1 \u2212 \u03c3\n\n(cid:1)\u2126(T / log(L/\u03c3))\nG \u2212 x0(cid:107)2 = (G(x\u2217) \u2212 F (x\u2217)) + (F (x\u2217\n\n\u221a\n\nL\n\nT\n\n\u03c3\n\nE[(cid:107)GF,\u03b7(x)(cid:107)] \u2264 O\n\n\u03c3(cid:107)x0 \u2212 x\u2217\n2(cid:107)x\u2217 \u2212 x0(cid:107)2 \u2212 \u03c3\n\nG(cid:107) +\n2(cid:107)x\u2217\n\nNow, by de\ufb01nition \u03c3\nwe have (cid:107)x\u2217\n\nG \u2212 x0(cid:107) \u2264 (cid:107)x\u2217 \u2212 x0(cid:107). This completes the proof.\n\n\u03c3(cid:107)x0 \u2212 x\u2217\nG(cid:107)\nG)) \u2265 0 so\nG) \u2212 G(x\u2217\n(cid:3)\n\nAcknowledgements\n\nWe would like to thank Lin Xiao for suggesting reference [29, Lemma 3.7], an anonymous re-\nsearcher from the Simons Institute for suggesting reference [25], Yurii Nesterov for helpful discus-\nsions, Xinyu Weng for discussing the motivations, S\u00b4ebastien Bubeck, Yuval Peres, and Lin Xiao for\ndiscussing notations, Chi Jin for discussing reference [27], and Dmitriy Drusvyatskiy for discussing\nthe notion of Moreau envelope.\n\nReferences\n[1] Open problem session of \u201cfast iterative methods in optimization\u201d workshop. Simons Institute\n\nfor the Theory of Computing, UC Berkeley, October 2017.\n\n[2] Zeyuan Allen-Zhu. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods.\n\nIn STOC, 2017.\n\n[3] Zeyuan Allen-Zhu. Natasha 2: Faster Non-Convex Optimization Than SGD.\n\n2018.\n\nIn NeurIPS,\n\n9\n\n\f[4] Zeyuan Allen-Zhu and Elad Hazan. Optimal Black-Box Reductions Between Optimization\n\nObjectives. In NeurIPS, 2016.\n\n[5] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the Compressed Leader: Faster Online Learning of\n\nEigenvectors and Faster MMWU. In ICML, 2017.\n\n[6] Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding Local Minima via First-Order Oracles. In\n\nNeurIPS, 2018.\n\n[7] Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much Faster Algorithms\n\nfor Matrix Scaling. In FOCS, 2017.\n\n[8] S\u00b4ebastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends\n\nin Machine Learning, 8(3-4):231\u2013357, 2015.\n\n[9] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Accelerated Methods for\n\nNon-Convex Optimization. ArXiv e-prints, abs/1611.00756, November 2016.\n\n[10] M. B. Cohen, A. Madry, D. Tsipras, and A. Vladu. Matrix Scaling and Balancing via Box\nIn FOCS, pages 902\u2013913, Oct\n\nConstrained Newton\u2019s Method and Interior Point Methods.\n2017.\n\n[11] Damek Davis and Dmitriy Drusvyatskiy. Complexity of \ufb01nding near-stationary points of con-\n\nvex functions stochastically. ArXiv e-prints, abs/1802.08556, 2018.\n\n[12] Damek Davis and Dmitriy Drusvyatskiy. Stochastic subgradient method converges at the rate\n\no(k\u22121/4) on weakly convex functions. ArXiv e-prints, abs/1802.02988, 2018.\n\n[13] John Duchi and Yoram Singer. Ef\ufb01cient Online and Batch Learning Using Forward Backward\nSplitting. Journal of Machine Learning Research, 10:2899\u20132934, 2009. ISSN 15324435. doi:\n10.1561/2400000003.\n\n[14] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online\nstochastic gradient for tensor decomposition. In Proceedings of the 28th Annual Conference\non Learning Theory, COLT 2015, 2015.\n\n[15] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly\nconvex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal\non Optimization, 22(4):1469\u20131492, 2012.\n\n[16] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex\n\nstochastic programming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[17] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear\nand stochastic programming. Mathematical Programming, pages 1\u201326, feb 2015. ISSN 0025-\n5610.\n\n[18] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimiza-\n\ntion, 2(3-4):157\u2013325, 2016. ISSN 2167-3888. doi: 10.1561/2400000013.\n\n[19] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: Optimal algorithms for\nstochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):\n2489\u20132512, 2014.\n\n[20] Martin Idel. A review of matrix scaling and sinkhorn\u2019s normal form for matrices and positive\n\nmaps. ArXiv e-prints, abs/1609.06349, 2016.\n\n[21] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Nonconvex Finite-Sum Optimization\n\nVia SCSG Methods. In NeurIPS, 2017.\n\n[22] Yurii Nesterov. A method of solving a convex programming problem with convergence rate\nIn Doklady AN SSSR (translated as Soviet Mathematics Doklady), volume 269,\n\nO(1/k2).\npages 543\u2013547, 1983.\n\n10\n\n\f[23] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, vol-\n\nume I. Kluwer Academic Publishers, 2004. ISBN 1402075537.\n\n[24] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,\n\n103(1):127\u2013152, December 2005. ISSN 0025-5610. doi: 10.1007/s10107-004-0552-5.\n\n[25] Yurii Nesterov. How to make the gradients small. Optima, 88:10\u201311, 2012.\n\n[26] Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and\n\nTrends in Machine Learning, 4(2):107\u2013194, 2012. ISSN 1935-8237.\n\n[27] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochas-\ntic Cubic Regularization for Fast Nonconvex Optimization. ArXiv e-prints, abs/1711.02838,\nNovember 2017.\n\n[28] Blake Woodworth and Nati Srebro. Tight Complexity Bounds for Optimizing Composite Ob-\n\njectives. In NeurIPS, 2016.\n\n[29] Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance\n\nReduction. SIAM Journal on Optimization, 24(4):2057\u2014-2075, 2014.\n\n[30] Yi Xu and Tianbao Yang. First-order Stochastic Algorithms for Escaping From Saddle Points\n\nin Almost Linear Time. ArXiv e-prints, abs/1711.01944, November 2017.\n\n11\n\n\f", "award": [], "sourceid": 608, "authors": [{"given_name": "Zeyuan", "family_name": "Allen-Zhu", "institution": "Microsoft Research"}]}