{"title": "A Universally Optimal Multistage Accelerated Stochastic Gradient Method", "book": "Advances in Neural Information Processing Systems", "page_first": 8525, "page_last": 8536, "abstract": "We study the problem of minimizing a strongly convex, smooth function when we have noisy estimates of its gradient. We propose a novel multistage accelerated algorithm that is universally optimal in the sense that it achieves the optimal rate both in the deterministic and stochastic case and operates without knowledge of noise characteristics. The algorithm consists of stages that use a stochastic version of Nesterov's method with a specific restart and parameters selected to achieve the fastest reduction in the bias-variance terms in the convergence rate bounds.", "full_text": "A Universally Optimal Multistage Accelerated\n\nStochastic Gradient Method\n\nNecdet Serhat Aybat\u2217\n\nPennsylvania State University\n\nUniversity Park, PA, USA\n\nnsa10@psu.edu\n\nMert G\u00fcrb\u00fczbalaban\u2217\n\nRutgers University\nPiscataway, NJ, USA\nmg1366@rutgers.edu\n\nAlireza Fallah\u2217\n\nMassachusetts Institute of Technology\n\nCambridge, MA, USA\nafallah@mit.edu\n\nAsuman Ozdaglar\u2217\n\nMassachusetts Institute of Technology\n\nCambridge, MA, USA\n\nasuman@mit.edu\n\nAbstract\n\nWe study the problem of minimizing a strongly convex, smooth function when we\nhave noisy estimates of its gradient. We propose a novel multistage accelerated\nalgorithm that is universally optimal in the sense that it achieves the optimal rate\nboth in the deterministic and stochastic case and operates without knowledge of\nnoise characteristics. The algorithm consists of stages that use a stochastic version\nof Nesterov\u2019s method with a speci\ufb01c restart and parameters selected to achieve the\nfastest reduction in the bias-variance terms in the convergence rate bounds.\n\n1\n\nIntroduction\n\nFirst order optimization methods play a key role in solving large scale machine learning problems due\nto their low iteration complexity and scalability with large data sets. In several cases, these methods\noperate with noisy \ufb01rst order information either because the gradient is estimated from draws or\nsubset of components of the underlying objective function [3, 8, 13, 16, 17, 21, 36, 9, 11] or noise is\ninjected intentionally due to privacy or algorithmic considerations [4, 25, 30, 14, 15]. A fundamental\nquestion in this setting is to design fast algorithms with optimal convergence rate, matching the lower\nbounds on the oracle complexity in terms of target accuracy and other important parameters both for\nthe deterministic and stochastic case (i.e., with or without gradient errors).\nIn this paper, we design an optimal \ufb01rst order method to solve the problem\n\nf\u2217 (cid:44) min\nx\u2208Rd\n\n(1)\nwhere, for scalars 0 < \u00b5 \u2264 L, S\u00b5,L(Rd) is the set of continuously differentiable functions f : Rd \u2192\nR that are strongly convex with modulus \u00b5 and have Lipschitz-continuous gradients with constant L,\nwhich imply that for every x, y \u2208 Rd, f satis\ufb01es (see e.g. [27])\n\nf (x)\n\nsuch that f \u2208 S\u00b5,L(Rd),\n\n\u00b5\n2\n\n(cid:107)x \u2212 y(cid:107)2 \u2264 f (x) \u2212 f (y) \u2212 \u2207f (y)(cid:62)(x \u2212 y) \u2264 L\n2\n\n(cid:107)x \u2212 y(cid:107)2.\n\n(2)\n\nFor f \u2208 S\u00b5,L(Rd), the ratio \u03ba (cid:44) L\ndenote the solution of problem (1) by f\u2217 which is achieved at the unique optimal point x\u2217.\n\n\u00b5 is called the condition number of f. Throughout the paper, we\n\n\u2217The authors are in alphabetical order.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe assume that the gradient information is available through a stochastic oracle, which at each\niteration n, given the current iterate xn \u2208 Rd, provides the noisy gradient \u02dc\u2207f (xn, wn) where {wn}n\nis a sequence of independent random variables such that for all n \u2265 0,\n\nE[ \u02dc\u2207f (xn, wn)|xn] = \u2207f (xn), E(cid:104)(cid:107) \u02dc\u2207f (xn, wn) \u2212 \u2207f (xn)(cid:107)2(cid:12)(cid:12)(cid:12)xn\n\n(cid:105) \u2264 \u03c32.\n\n(3)\n\nThis oracle model is commonly considered in the literature (see e.g. [16, 17, 6]). In Appendix K, we\nshow how our analysis can be extended to the following more general noise setting, same as the one\nstudied in [3], where the variance of the noise is allowed to grow linearly with the squared distance to\nthe optimal solution:\n\nE[ \u02dc\u2207f (xn, wn)|xn] = \u2207f (xn), E(cid:104)(cid:107) \u02dc\u2207f (xn, wn) \u2212 \u2207f (xn)(cid:107)2(cid:12)(cid:12)(cid:12)xn\n\n(cid:105) \u2264 \u03c32 + \u03b72(cid:107)xn \u2212 x\u2217(cid:107)2. (4)\n\nfor some constant \u03b7 \u2265 0.\nUnder noise setting in (3), the performance of many algorithms is characterized by the expected error\nof the iterates (in terms of the suboptimality in function values) which admits a bound as a sum of two\nterms: a bias term that shows the decay of initialization error f (x0) \u2212 f\u2217 and is independent of the\nnoise parameter \u03c32, and a variance term that depends on \u03c32 and is independent of the initial point x0.\nA lower bound on the bias term follows from the seminal work of Nemirovsky and Yudin [26], which\nshowed that without noise (\u03c3 = 0) and after n iterations, E [f (xn)] \u2212 f\u2217 cannot be smaller than2\n\nL(cid:107)x0 \u2212 x\u2217(cid:107)2\n\n2 exp(\u2212O(1)\n\nn\u221a\n\u03ba\n\n).\n\n(5)\n\nWith noise, Raginsky and Rakhlin [31] provided the following (much larger) lower bound3 on\nfunction suboptimality which also provides a lower bound on the variance term:\n\nfor n suf\ufb01ciently large.\n\n(6)\n\n(cid:18) \u03c32\n\n(cid:19)\n\n\u00b5n\n\n\u2126\n\n(cid:107)x0\u2212x\u2217(cid:107)2\n\nn2\n\nn +\n\nSeveral algorithms have been proposed in the recent literature attempting to achieve these lower\nbounds.4 Xiao [38] obtains O(log(n)/n) performance guarantees in expected suboptimality for an\naccelerated version of the dual averaging method. Dieuleveut et al. [12] consider quadratic objective\nfunction and develop an algorithm with averaging to achieve the error bound O( \u03c32\n). Hu\net al. [20] consider general strongly convex and smooth functions and achieve an error bound with\nsimilar dependence under the assumption of bounded noise. Ghadimi and Lan [16] and Chen et al.\n[7] extend this result to the noise model in (3) by introducing the accelerated stochastic approximation\nalgorithm (AC-SA) and optimal regularized dual averaging algorithm (ORDA), respectively. Both\nAC-SA and ORDA have multistage versions presented in [17] and [7] where authors improve the\nbias term of their single stage methods to the optimal exp(\u2212O(1)n/\n\u03ba) by exploiting knowledge of\n\u03c3 and the optimality gap \u2206, i.e., an upper bound for f (x0) \u2212 f\u2217, in the operation of the algorithm.\nAnother closely related paper is [8] which proposed \u00b5AGD+ and showed under additive noise model\n(cid:107)x0\u2212x\u2217(cid:107)2\nthat it admits the error bound O( \u03c32\n) for any p \u2265 1 where the constants grow with p, and\nin particular, they achieve the bound O( \u03c32 log n\nIn this paper, we introduce the class of Multistage Accelerated Stochastic Gradient (M-ASG) methods\nthat are universally optimal, achieving the lower bound both in the noiseless deterministic case\nand the noisy stochastic case up to some constants independent of \u00b5 and L. M-ASG proceeds in\nstages that use a stochastic version of Nesterov\u2019s accelerated method [27] with a speci\ufb01c restart and\nparameterization. Given an arbitrary length and constant stepsize for the \ufb01rst stage together with\ngeometrically growing lengths and shrinking stepsizes for the following stages, we \ufb01rst provide a\ngeneral convergence rate result for M-ASG (see Theorem 3.4). Given the computational budget\nn, a speci\ufb01c choice for the length of the \ufb01rst stage is shown to achieve the optimal error bound\nwithout requiring knowledge of the noise bound \u03c32 and the initial optimality gap (See Corollary 3.8).\n\nnp\nn +\n\n) for p = log n.\n\n(cid:107)x0\u2212x\u2217(cid:107)2 log n\n\nn +\n\n\u221a\n\nnlog n\n\n2This lower bound is shown with the additional assumption n \u2264 d\n3The authors show this result for \u00b5 = 1. Nonetheless, it can be generalized to any \u00b5 > 0 by scaling the\n\n4Here we review their error bounds after n iterations highlighting dependence on \u03c32, n, and initial point x0,\n\nproblem parameters properly.\n\nsuppressing \u00b5 and L dependence.\n\n2\n\n\fAlgorithm\n\nAC-SA\n\nMulti. AC-SA\n\nORDA\n\nMulti. ORDA\nCohen et al.\n\nM-ASG (With parameters in Corollary 3.7)\nM-ASG (With parameters in Corollary 3.8)\nM-ASG (With parameters in Corollary 3.9)\n\nTable 1: Comparison of algorithms\nRequires\n\n\u03c3 \u2206 n or \u0001\n\u0017\n\u0017\n\u0013 \u0013\n\u0017\n\u0017\n\u0013 \u0013\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017 \u0013\n\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\n\u0013[n]\n\u0013[\u0001]\n\nOpt. Opt.\nBias\nVar.\n\u0013\n\u0017\n\u0013\n\u0013\n\u0013\n\u0017\n\u0013\n\u0013\n\u0013\n\u0017\n\u0013\n\u0017\n\u0013\n\u0013\n\u0013\n\u0013\n\nTo the best of our knowledge, this is the \ufb01rst algorithm that achieves such a lower bound under\nsuch informational assumptions. In Table 1, we provide a comparison of our algorithm with other\nalgorithms in terms of required assumptions and optimality of their results in both bias and variance\nterms. In particular, we consider ACSA [16], Multistage AC-SA [17], ORDA and Multistage ORDA\n[7], and the algorithm proposed in [8].\nOur paper builds on an analysis of Nesterov\u2019s accelerated stochastic method with a speci\ufb01c momentum\nparameter presented in Section 2 which may be of independent interest. This analysis follows from\na dynamical system representation and study of \ufb01rst order methods which has gained attention in\nthe literature recently [24, 19, 2]. In Section 3, we present the M-ASG algorithm, and characterize\nits behavior under different assumptions as summarized in Table 1. In particular, we show that it\nachieves the optimal convergence rate with the given budget of iterations n. In Section 4, we show\nhow additional information such as \u03c3 and \u2206 can be leveraged in our framework to improve practical\nperformance. Finally, in Section 5, we provide numerical results on the comparison of our algorithm\nwith some of the other most recent methods in the literature.\nPreliminaries and notation: Let Id and 0d represent the d \u00d7 d identity and zero matrices. For\nmatrix A \u2208 Rd\u00d7d, Tr(A) and det(A) denote the trace and determinant of A, respectively. Also, for\nscalars 1 \u2264 i \u2264 j \u2264 d and 1 \u2264 k \u2264 l \u2264 d, we use A[i:j],[k:l] to show the submatrix formed by rows\ni to j and columns k to l. We use the superscript (cid:62) to denote the transpose of a vector or a matrix\ndepending on the context. Throughout this paper, all vectors are represented as column vectors. Let\n+ denote the set of all symmetric and positive semi-de\ufb01nite m \u00d7 m matrices. For two matrices\nSm\nA \u2208 Rm\u00d7n and B \u2208 Rp\u00d7q, their Kronecker product is denoted by A \u2297 B. For scalars 0 < \u00b5 \u2264 L,\nS\u00b5,L(Rd) is the set of continuously differentiable functions f : Rd \u2192 R that are strongly convex\nwith modulus \u00b5 and have Lipschitz-continuous gradients with constant L. All logarithms throughout\nthe paper are in natural basis.\n\n2 Modeling Accelerated Gradient method as a dynamical system\n\n\u03b1\u00b5\n\n\u221a\n\nIn this section we study Nesterov\u2019s Accelerated Stochastic Gradient method (ASG) [27] with the\nstochastic \ufb01rst-order oracle in (3):\n\nyk = (1 + \u03b2)xk \u2212 \u03b2xk\u22121,\nL ] is the stepsize and \u03b2 = 1\u2212\u221a\n\nxk+1 = yk \u2212 \u03b1 \u02dc\u2207f (yk, wk)\n(7)\nwhere \u03b1 \u2208 (0, 1\n\u03b1\u00b5 is the momentum parameter. This choice of\nmomentum parameter has already been studied in the literature, e.g., [28, 37, 33]. In the next lemma,\nwe provide a new motivation for this choice by showing that for quadratic functions and in the\nnoiseless setting, this momentum parameter achieves the fastest asymptotic convergence rate for a\ngiven \ufb01xed stepsize \u03b1 \u2208 (0, 1\nLemma 2.1. Let f \u2208 S\u00b5,L(Rd) be a strongly convex quadratic function such that f (x) = 1\n2 x(cid:62)Qx\u2212\np(cid:62)x + r where Q is a d by d symmetric positive de\ufb01nite matrix with all its eigenvalues in the interval\n[\u00b5, L]. Consider the deterministic ASG iterations, i.e., \u03c3 = 0, as shown in (7), with constant stepsize\n\u03b1 \u2208 (0, 1/L]. Then, the fastest asymptotic convergence rate, i.e. the smallest \u03c1 \u2208 (0, 1) that satis\ufb01es\nthe inequality\n\nL ]. The proof of this lemma is provided in Appendix A.\n\n1+\n\n(cid:107)xk \u2212 x\u2217(cid:107)2 \u2264 (\u03c1 + \u0001k)2k(cid:107)x0 \u2212 x\u2217(cid:107)2, \u2200x0 \u2208 Rd,\n\n3\n\n\ffor some non-negative sequence {\u0001k}k that goes to zero is \u03c1 = 1 \u2212 \u221a\n\u03b2 = 1\u2212\u221a\n\u221a\n\n\u03b1\u00b55 and it is achieved by\n\u03b1\u00b5 . As a consequence, for this choice of \u03b2, there exists {\u0001k} such that limk\u2192\u221e \u0001k = 0 and\n\n1+\n\n\u03b1\u00b5\n\nf (xk) \u2212 f\u2217 \u2264 L(1 \u2212 \u221a\n\n\u03b1\u00b5 + \u0001k)2k(cid:107)x0 \u2212 x\u2217(cid:107)2.\n\nwhere \u03bek :=(cid:2)x(cid:62)\n\nOur analysis builds on the reformulation of a \ufb01rst-order optimization algorithm as a linear dynamical\nsystem. Following [24, 19], we write ASG iterations as\n\u03bek+1 = A\u03bek + B \u02dc\u2207f (yk, wk),\n\n(cid:3)(cid:62) \u2208 R2d is the state vector and A, B and C are matrices with appropriate\n(cid:20)1 + \u03b2 \u2212\u03b2\n\ndimensions de\ufb01ned as the Kronecker products A = \u02dcA \u2297 Id, B = \u02dcB \u2297 Id and C = \u02dcC \u2297 Id with\n\n(cid:20)\u2212\u03b1\n\nyk = C\u03bek,\n\nx(cid:62)\nk\u22121\n\n(cid:21)\n\n(cid:21)\n\n(8)\n\nk\n\n\u02dcC = [1 + \u03b2 \u2212\u03b2] .\n\n\u02dcA =\n\n1\n\n0\n\n,\n\n\u02dcB =\n\n,\n\n0\n\n(9)\n\nWe can also relate the state \u03bek to the iterate xk in a linear fashion through the identity xk =\nT \u03bek, T (cid:44) [Id\n0d]. We study the evolution of the ASG method through the following Lyapunov\nfunction which also arises in the study of deterministic accelerated gradient methods:\n\nVP (\u03be) = (\u03be \u2212 \u03be\u2217)(cid:62)P (\u03be \u2212 \u03be\u2217) + f (T \u03be) \u2212 f\u2217\n\n(10)\nwhere P is a symmetric positive semi-de\ufb01nite matrix. We \ufb01rst state the following lemma which can\nbe derived by adapting the proof of Proposition 4.6 in [2] to our setting with less restrictive noise\nassumption compared to the additive noise model of [2]. Its proof can be found in Appendix B.\nLemma 2.2. Let f \u2208 S\u00b5,L(Rd) . Consider the ASG iterations given by (7). Assume there exist\n\u03c1 \u2208 (0, 1) and \u02dcP \u2208 S2\n\n+, possibly depending on \u03c1, such that\n\n\u03c12 \u02dcX1 + (1 \u2212 \u03c12) \u02dcX2 (cid:23)\n\n(cid:21)\n\n(cid:20) \u02dcA(cid:62) \u02dcP \u02dcA \u2212 \u03c12 \u02dcP\n\uf8ee\uf8f0 (1 + \u03b2)2\u00b5 \u2212\u03b2(1 + \u03b2)\u00b5 \u2212(1 + \u03b2)\n\n\u02dcA(cid:62) \u02dcP \u02dcB\n\u02dcB(cid:62) \u02dcP \u02dcB\n\n\u02dcB(cid:62) \u02dcP \u02dcA\n\n\u03b2\n\n1\n2\n\n\u2212\u03b2(1 + \u03b2)\u00b5\n\u2212(1 + \u03b2)\n\n\u03b1(2 \u2212 L\u03b1)\n\n\u03b22\u00b5\n\u03b2\n\n\u02dcX2 =\n\n(11)\n\n\uf8f9\uf8fb .\n\n(12)\n\nwhere\n\n\uf8ee\uf8f0 \u03b22\u00b5 \u2212\u03b22\u00b5\n\n\u2212\u03b22\u00b5\n\u2212\u03b2\n\n\u03b22\u00b5\n\u03b2\n\n\uf8f9\uf8fb ,\n\n\u2212\u03b2\n\u03b2\n\n\u02dcX1 =\n\n1\n2\n\n\u03b1(2 \u2212 L\u03b1)\nLet P = \u02dcP \u2297 Id. Then, for every k \u2265 0,\n\nE [VP (\u03bek+1)] \u2264 \u03c12E [VP (\u03bek)] + \u03c32\u03b12( \u02dcP1,1 +\n\nL\n2\n\n).\n\nWe use this lemma and derive the following theorem which characterize the behavior of ASG method\nfor when \u03b1 \u2208 (0, 1/L] and \u03b2 = 1\u2212\u221a\n\u221a\n\u03b1\u00b5 (see the proof in Appendix C).\nTheorem 2.3. Let f \u2208 S\u00b5,L(Rd) . Consider the ASG iterations given in (7) with \u03b1 \u2208 (0, 1\n\u03b2 = 1\u2212\u221a\n\u221a\n\nL ] and\n\n1+\n\n\u03b1\u00b5\n\n\u03b1\u00b5\n\n\u03b1\u00b5 . Then,\n\n1+\n\nE [VP\u03b1(\u03bek+1)] \u2264 (1 \u2212 \u221a\n\nfor every k \u2265 0, where P\u03b1 = \u02dcP\u03b1 \u2297 Id with \u02dcP\u03b1 =\n\n\u03b1\u00b5)E [VP\u03b1 (\u03bek)] +\n\n(cid:113) 1\n2 \u2212(cid:113) 1\n2\u03b1(cid:112) \u00b5\n\n\uf8ee\uf8ef\uf8ef\uf8f0\n\n2\u03b1\n\n\u03c32\u03b1\n\n2\n\n\uf8f9\uf8fa\uf8fa\uf8fb(cid:20)(cid:113) 1\n\n2\u03b1\n\n(1 + \u03b1L)\n\n2 \u2212(cid:113) 1\n(cid:112) \u00b5\n\n2\u03b1\n\n(13)\n\n(cid:21)\n\n.\n\nThis result relies on the special structure of P\u03b1 which will also be key for our analysis in Section 3.\n\n3 A class of multistage ASG algorithms\n\nIn this section, we introduce a class of multistage ASG algorithms, represented in Algorithm 1 which\nwe denote by M-ASG. The main idea is to run ASG with properly chosen parameters (\u03b1k, \u03b2k) at\n\n4\n\n\fAlgorithm 1: Multistage Accelerated Stochastic Gradient Algorithm (M-ASG)\n1 Set n0 = \u22121;\n2 for k = 1; k \u2264 K; k = k + 1 do\n3\n4\n\nSet xk\nfor m = 1; m \u2264 nk; m = m + 1 do\n\nnk\u22121+1;\n\n1 = xk\u22121\n0 = xk\nSet \u03b2k = 1\u2212\u221a\n\u221a\nSet yk\nSet xk\n\n1+\n\n\u00b5\u03b1k\n\u00b5\u03b1k\n\n;\n\nm = (1 + \u03b2k)xk\nm+1 = yk\n\nm \u2212 \u03b2kxk\nm \u2212 \u03b1k \u02dc\u2207f (yk\n\nm\u22121;\nm, wk\nm)\n\n5\n\n6\n7\n8\n9 end\n\nend\n\neach stage k \u2208 1, . . . , K for K \u2265 2 stages. In addition, each new stage is dependent on the previous\nstage as the \ufb01rst two initial iterates of the new stage are set to the last iterate of the previous stage.\nTo analyze Algorithm 1, we \ufb01rst characterize the evolution of iterates in one speci\ufb01c stage through\nthe Lyapunov function in (10). The details of the proof is provided in Appendix D.\nTheorem 3.1. Let f \u2208 S\u00b5,L(Rd) . Consider running the ASG method given in (7) for n iterations\nwith \u03b1 = c2\n\n\u03b1\u00b5 for some 0 < c \u2264 1. Then, for P\u03b1 given in Theorem 2.3,\n\nL and \u03b2 = 1\u2212\u221a\n\n\u221a\n\n1+\n\n\u03b1\u00b5\n\nE [VP\u03b1 (\u03ben+1)] \u2264 exp(\u2212n\n\nc\u221a\n\u03ba\n\n)E [VP\u03b1 (\u03be1)] +\n\n\u03bac\n\n.\n\n(14)\n\n\u03c32\u221a\n\nL\n\nGiven a computational budget of n iterations, we use this result to choose a stepsize that help us\nachieve an approximately optimal decay in the variance term which yields the following corollary for\nM-ASG algorithm with K = 1 stage, and its proof can be found in Appendix E.\nCorollary 3.2. Let f \u2208 S\u00b5,L(Rd) . Consider running M-ASG, i.e., Algorithm 1, for only one stage\nwith n1 = n iterations and stepsize \u03b11 =\n\n(cid:17)2\n\n\u03ba log n\n\n1\n\nL for some scalar p \u2265 1. Then,\n0) \u2212 f\u2217) +\n\np\u03c32 log n\n\nn\u00b5\n\nnp (f (x0\n\n(15)\n\n\u221a\n\n(cid:16) p\nn+1)(cid:3) \u2212 f\u2217 \u2264 2\n\nn\n\nE(cid:2)f (x1\n\n\u221a\n\n\u221a\n\u03ba max{2 log(p\n\n(cid:62)\n\ni\u22121\n\n, xk\n\n(cid:104)\n\n\u03ba), e}.\n\n(cid:62)(cid:105)(cid:62)\n\nfor 1 \u2264 i \u2264 nk + 1 \u2013recall that xk\n\nprovided that n \u2265 p\nFor subsequent analysis, given K \u2265 1, for all 1 \u2264 k \u2264 K, we de\ufb01ne the state vector \u03bek\ni =\nnk\u22121+1, where K is the number\nxk\ni\nof stages. We analyze the performance of each stage with respect to a stage-dependent Lyapunov\nfunction VP\u03b1k\n. The following lemma relates the performance bounds with respect to consecutive\nchoice of Lyapunov functions, building on our speci\ufb01c restarting mechanism (The proof can be found\nin Appendix F).\nLemma 3.3. Let f \u2208 S\u00b5,L(Rd) . Consider M-ASG, i.e., Algorithm 1. Then, for every 1 \u2264 k \u2264 K\u22121,\n\n1 = xk\u22121\n\n0 = xk\n\nE(cid:104)\n\n(cid:105) \u2264 2E(cid:104)\n\n(cid:105)\n\nVP\u03b1k+1\n\n(\u03bek+1\n\n1\n\n)\n\nVP\u03b1k\n\n(\u03bek\n\nnk+1)\n\n.\n\n(16)\n\nNow, we are ready to state and prove the main result of the paper (see proof in Appendix G):\nTheorem 3.4. Let f \u2208 S\u00b5,L(Rd) . Consider running M-ASG , i.e., Algorithm 1, with some n1 \u2265 1\n22kL for any k \u2265 2 and p \u2265 1. The last\nand \u03b11 = 1\niterate of each stage, i.e., xk\n\u221a\n\nnk+1, satis\ufb01es the following bound for all k \u2265 1:\n\n\u03ba log(2p+2)(cid:101) and \u03b1k = 1\n\nL and \ufb01xing nk = 2k(cid:100)\u221a\nnk+1)(cid:3) \u2212 f\u2217 \u2264\nE(cid:2)f (xk\n\n0) \u2212 f\u2217)(cid:1) +\n\n(cid:0)exp(\u2212n1/\n\n\u03ba)(f (x0\n\n(17)\n\n2\n\n\u03c32\u221a\n\u03ba\nL2k\u22121 .\n\ngeneral strongly convex functions in Theorem 2.3, as there \u03c1 =(cid:112)1 \u2212 \u221a\n\n2(p+1)(k\u22121)\n\n\u03b1\u00b5.\n\n5Note that although this rate is asymptotic, its smaller than the non-asymptotic rate that we provide for\n\n5\n\n\fNK(p, n1) (cid:44)(cid:80)K\n\n\u03ba log(2p+2)(cid:101).\n\nm where k = (cid:100)log2\n\nWe next de\ufb01ne NK(p, n1) as the number of iterations needed to run M-ASG for K \u2265 1 stages, i.e.,\n\nk=1 nk. Note for K \u2265 2 and with parameters given in Theorem 3.4,\n\nNK(p, n1) = n1 + (2K+1 \u2212 4)(cid:100)\u221a\n(cid:16)\n\n(cid:17) \u2212 1(cid:101) and m = n \u2212 Nk\u22121(p, n1).\n\n(18)\nWe de\ufb01ne {xn}n\u2208Z+ sequence such that xn is the iterate generated by M-ASG algorithm at the end\nof n gradient steps for n \u2265 0, i.e., x0 = x0\nn+1 for 1 \u2264 n \u2264 n1, and for n > n1 we set\nxn = xk\nRemark 3.5. In the absence of noise, i.e., \u03c3 = 0, the result of Theorem 3.4 recovers the linear\nconvergence rate of deterministic gradient methods as its special case. Indeed, running M-ASG\nonly for one stage with n iterations, i.e., K = 1 and n1 = n guarantees that E [f (xn)] \u2212 f\u2217 \u2264\n2 exp(\u2212n/\nThe next theorem remarks the behavior of M-ASG after running it for n iterations with the parameters\nin the preceding theorem, and its proof is provided in Appendix H.\nTheorem 3.6. Let f \u2208 S\u00b5,L(Rd) . Consider running Algorithm 1 for n iterations and with parame-\nters given in Theorem 3.4 and n1 < n. Then the error is bounded by\n\n0) \u2212 f\u2217) for all n \u2265 1.\n\nn\u2212n1\n\u03ba log(2p+2)(cid:101) + 4\n\n0, xn = x1\n\n\u03ba)(f (x0\n\n(cid:100)\u221a\n\n\u221a\n\nE [f (xn)] \u2212 f\u2217\n\n(cid:32)\n\n\u221a\n\n\u2264 O(1)\n\n(8(p + 1)\n\n\u03ba log(2))p+1\n\n(n \u2212 n1)p+1\n\n0) \u2212 f\u2217)(cid:1) +\n\n\u221a\n\n\u03ba)(f (x0\n\n(cid:33)\n\n.\n\n(19)\n\n(p + 1)\u03c32\n(n \u2212 n1)\u00b5\n\u221a\n\u03ba log (12(p + 1)\u03ba)(cid:101),\n\n(cid:0)exp(\u2212n1/\n(cid:18) 1\n\nCorollary 3.7. Under the premise of Theorem 3.6, choosing n1 = (cid:100)(p + 1)\nthe suboptimality error of M-ASG after n \u2265 2n1 admits\nnp+1 (f (x0\n\nE [f (xn)] \u2212 f\u2217 \u2264 O(1)\n\n0) \u2212 f\u2217) +\n\n(p + 1)\u03c32\n\nn\u00b5\n\n(cid:19)\n\n.\n\nTheorem 3.6 immediately yields the result in Corollary 3.7, (suboptimal with respect to dependence\non initial optimality gap); see Appendix I for the proof. Similar rate results have also been obtained\nby AC-SA [16] and ORDA [7] algorithms.\nWe continue this section by pointing out some important special cases of our result. We \ufb01rst show\nin the next corollary how our algorithm is universally optimal and capable of achieving the lower\nbounds (5) and (6) simultaneously. The proof follows from (19) and n \u2212 n1 \u2265 n\n\u221a\nCorollary 3.8. Under the premise of Theorem 3.6, consider a computational budget of n \u2265 2\nC for some positive constant C \u2265 2, we obtain a bound matching the\niterations. By setting n1 = n\nlower bounds in (5) and (6), i.e.,\n\n\u03ba.\n\n\u03ba\n\n2 \u2265 \u221a\n(cid:19)\n\n(cid:18)\n\nE [f (xn)] \u2212 f\u2217 \u2264 O(1)\n\nexp(\u2212 n\n\u221a\n\nC\n\n\u03ba )(f (x0\n\n0) \u2212 f\u2217) +\n\n\u03c32\nn\u00b5\n\n.\n\nWe note that achieving the lower bound through the M-ASG algorithm requires the knowledge or\nestimation of the strong convexity constant \u00b5. In some applications, \u00b5 may not be known a priori.\nHowever, for regularized risk minimization problems, the regularization parameter is known and it\ndetermines the strong convexity constant. It is also worth noting that, even for the deterministic case,\n[1] has shown that for a wide class of algorithms including ASG, it is not possible to obtain the lower\nbound (5) without knowing the strong convexity parameter. In addition, in Appendix L, we show\nhow our framework can be extended to obtain nearly optimal results in the merely convex setting; i.e.\nwhen \u00b5 = 0. Finally, note that the Lipschitz constant L can be estimated from data using standard\nline search techniques in practice, see [5] and [32, Alg. 2].\nThe lower bound can also be stated as the minimum number of iterations needed to \ufb01nd an \u0001\u2212solution,\ni.e, to \ufb01nd x\u0001 such that E[f (x\u0001)] \u2212 f\u2217 \u2264 \u0001, for any given \u0001 > 0. In the following corollary, and with\n0) \u2212 f\u2217, we state\nthe additional assumption of knowing the bound \u2206 on the initial optimality gap f (x0\nthis version of lower bound. The proof is provided in Appendix J.\n0) \u2212 f\u2217, for any \u0001 \u2208 (0, \u2206), running M-ASG,\nCorollary 3.9. Let f \u2208 S\u00b5,L(Rd) . Given \u2206 \u2265 f (x0\nAlgorithm 1, with parameters given in Theorem 3.4, p = 1, and n1 = (cid:100)\u221a\ncompute \u0001\u2212solution within n\u0001 iterations in total, where\n\n(cid:1)(cid:101), one can\n\n\u03ba log(cid:0) 4\u2206\n\n\u0001\n\nn\u0001 = (cid:100)\u221a\n\n\u03ba log\n\n(cid:18) 4\u2206\n\n(cid:19)\n\n\u0001\n\n(cid:101) + (cid:100)16(1 + log(8))\n\n(cid:101).\n\n\u03c32\n\u00b5\u0001\n\n(20)\n\n6\n\n\fRecall that we presented a comparison with other state-of-the-art algorithms in Table 1. In particular,\nthis table shows that Multistage AC-SA [17] and Multistage ORDA [7] also achieve the lower bounds\nprovided that noise parameters are known \u2013 note we do not make this extra assumption for M-ASG. It\nis also worth noting that the idea of restart, which plays a key role in achieving the lower bounds,\nhas been studied before in the context of deterministic accelerated methods [29, 39]. However, a\nnaive extension of these restart methods to the stochastic setting leads to a two-stage algorithm which\nswitches from constant step-size to diminishing step-size when the variance term dominates the\nbias term. Nevertheless, implementing this technique requires the knowledge of \u03c32 and optimality\ngap to tune algorithms for achieving optimal rates in both bias and variance terms. M-ASG, on the\nother hand, achieves the optimal rates using a speci\ufb01c multistage scheme that does not require the\nknowledge of the parameter \u03c32. In the supplementary material, we also discuss how M-ASG is\nrelated to AC-SA and Multistage AC-SA algorithms proposed in [16, 17].\n4 M-ASG\u2217: An improved bias-variance trade-off\n\nIn section 3, we described a universal algorithm that do not require the knowledge of neither initial\nsuboptimality gap \u2206 nor the noise magnitude \u03c32 to operate. However, as we will argue in this section,\nour framework is \ufb02exible in the sense that additional information about the magnitude of \u2206 or \u03c32 can\nbe leveraged to improve practical performance. We \ufb01rst note that several algorithms in the literature\nassume that an upper bound on \u2206 is known or can be estimated, as summarized in Table 1. This\nassumption is reasonable in a variety of applications when there is a natural lower bound on f. For\nexample, in supervised learning scenarios such as support vector machines, regression or logistic\nregression problems, the loss function f has non-negative values [35]. Similarly, the noise level \u03c32\nmay be known or estimated, e.g., in private risk minimization [4], the noise is added by the user to\nensure privacy; therefore, it is a known quantity.\nThere is a natural well-known trade-off between constant and decaying stepsizes (decaying with\nthe number of iterations n) in stochastic gradient algorithms. Since the noise is multiplied with the\nstepsize, a stepsize that is decaying with the number of iterations n leads to a decay in the variance\nterm; however, this will slow down the decay of the bias term, which is controlled essentially by\nthe behavior of the underlying deterministic accelerated gradient algorithm (AG) that will give the\nbest performance with the constant stepsize (note that when \u03c3 = 0, the bias term gives the known\nperformance bounds for the AG algorithm). The main idea behind the M-ASG algorithm (which\nallows it to achieve the lower bounds) is to exploit this trade-off to decide on the right time, n1, to\nswitch to decaying stepsizes, i.e., when the bias term is suf\ufb01ciently small so that the variance term\ndominates and should be handled with the decaying stepsize. This insight is visible from the results\nof Theorem 3.4 which gives further insights on the choice of the stepsize at every stage to achieve\nL in the\nthe lower bounds. Theorem 3.4 shows that if M-ASG is run with a constant stepsize \u03b11 = 1\n\ufb01rst stage, then the variance term admits the bound \u03c32\u221a\nL which does not decay with the number of\niterations n1 in the \ufb01rst stage. However, in later stages, when n > n1, the stepsize \u03b1k is decreased as\nthe number of iterations grows and this results in a decay of the variance term. Overall, the choice\nof the length of the \ufb01rst stage n1, has a major impact in practice which we will highlight in our\nnumerical experiments.\nIf an estimate of \u2206 or \u03c32 is known, it is desirable to choose n1 as small as possible such that it\n\u03ba )E(cid:2)VP\u03b11\nensures the bias term becomes smaller than the variance term at the end of the \ufb01rst stage. More\nspeci\ufb01cally, applying Theorem 3.1 for c = 1, one can choose n1 to balance the variance \u03c32\u221a\n3.3, can be bounded by E(cid:2)VP\u03b11\nand\nthe bias exp(\u2212n1\n\u03ba ) \u2264 \u03c32\u221a\n\n0) \u2212 f\u2217). Therefore,\nby having an estimate of an upper bound for \u2206, n1 can be set to be the smallest number such that\n2\u2206 exp(\u2212n1\n\n1)(cid:3), as shown in the proof of Lemma\n\n1)(cid:3) terms. The term E(cid:2)VP\u03b11\n1)(cid:3) = \u00b5(cid:107)x0\n\n(\u03be1\n\n(\u03be1\n0) \u2212 f\u2217 \u2264 2(f (x0\n2 + f (x0\n\n\u03ba\n\nL , i.e.,\n\n\u03ba\n\nL\n\n(\u03be1\n\n1\u221a\n\n1\u221a\n\n\u03ba\n\n0 \u2212 x\u2217(cid:107)2\n(cid:16) 2L\u2206\n\n\u03ba log\n\n\u221a\n\n\u03c32\n\n\u03ba\n\n(cid:17)(cid:101).\n\nn1 = (cid:100)\u221a\n\nThis result allows one to \ufb01ne-tune the switching point to start using the decaying stepsizes within\nour framework as a function of \u03c32 and \u2206. In scenarios, when the noise level \u03c3 is small or the initial\ngap \u2206 is large, n1 is chosen large enough to guarantee a fast decay in the bias term. We would like to\nemphasize that this modi\ufb01ed M-ASG algorithm only requires the knowledge of \u03c3 and \u2206 for selecting\n\n7\n\n(21)\n\n\f\ufb01g1: \u03c32\n\nn = 10\u22122\n\n\ufb01g2: \u03c32\n\nn = 10\u22124\n\n\ufb01g3: \u03c32\n\nn = 10\u22126\n\nFigure 1: Comparison on a quadratic function for n = 1000 iterations with different level of noise.\n\n\ufb01g1: \u03c32\n\nn = 10\u22122\n\n\ufb01g2: \u03c32\n\nn = 10\u22124\n\n\ufb01g3: \u03c32\n\nn = 10\u22126\n\nFigure 2: Comparison on a quadratic function for n = 10000 iterations with different level of noise.\n\nn1 and the rest of the parameters can be chosen as in Theorem 3.4 which are independent of both\n\u03c3 and \u2206. Finally, the following theorem provides theoretical guarantees of our framework for this\nchoice of n1. The proof is omitted as it is similar to the proofs of Theorems 3.4 and 3.6.\nTheorem 4.1. Let f \u2208 S\u00b5,L(Rd) . Consider running Algorithm 1 for n iterations and with parame-\nters given in Theorem 3.4, p = 1, and n1 set as (21). Then, the expected suboptimality in function\nvalues admits the bound E [f (xn)] \u2212 f\u2217 \u2264 36(1 + log(8))\n\n(n\u2212n1)\u00b5 for all n \u2265 n1.\n\n\u03c32\n\n5 Numerical experiments\n\nn) where \u03c32\n\nIn this section, we demonstrate the numerical performance of Algorithm 1 with parameters speci\ufb01ed\nby Corollary 3.7 (M-ASG) and Theorem 4.1 (M-ASG\u2217) and compare with other methods from\nthe literature. In our \ufb01rst experiment, we consider the strongly convex quadratic objective f (x) =\n2 x(cid:62)Qx\u2212 bx + \u03bb(cid:107)x(cid:107)2 where Q is the Laplacian of a cycle graph6, b is a random vector and \u03bb = 0.01\n1\nis a regularization parameter. We assume the gradients \u2207f (x) are corrupted by additive noise with a\nGaussian distribution N (0, \u03c32\nn \u2208 {10\u22126, 10\u22124, 10\u22122}. We note that this example has been\npreviously considered in the literature as a problem instance where Standard ASG (ASG iterations\n\u221a\n\u03ba\u22121\u221a\n\u03ba+1 ) perform badly compared to Standard\nwith standard choice of parameters \u03b1 = 1\nGD (Gradient Descent with standard choice of the stepsize \u03b1 = 1/L) [18]. In Figures 1 and 2,\nwe compare M-ASG and M-ASG\u2217 with Standard GD, Standard AG, \u00b5AGD+ [8], and Multistage\nAC-SA [17]. We consider dimension d = 100 and initialize all the methods from x0\n0 = 0. We run the\nalgorithms Multistage AC-SA, and M-ASG\u2217, having access to the same estimate of \u2206. Figures 1-\n2 show the average performance of all the algorithms along with the 95% con\ufb01dence interval over\n50 sample runs while the total number of iterations n = 1000 and n = 10000 respectively as the\nnoise level \u03c32 is varied. The simulation results reveal that both M-ASG and M-ASG\u2217 have typically\na faster decay of the error in the beginning and outperforms the other algorithms in general when\n\nL and \u03b2 =\n\n6All diagonal entries of Q are 2, Qi,j = \u22121 if |i \u2212 j| \u2261 1 (mod d), and the remaining entries are zero.\n\n8\n\n100101102103Iteration count10-2100102f-f*Standard GDStandard AGMultistage AC-SA AGD+M-ASGM-ASG*100101102103Iteration count10-410-2100102f-f*Standard GDStandard AGMultistage AC-SA AGD+M-ASGM-ASG*100101102103Iteration count10-610-410-2100102104f-f*Standard GDStandard AGMultistage AC-SA AGD+M-ASGM-ASG*\f\ufb01g1: b = 50\n\n\ufb01g2: b = 100\n\n\ufb01g3: b = 500\n\nFigure 3: Comparison on logistic regression with n = 10000 iterations and with different batch sizes.\n\nthe number of iterations is small to moderate. In this case, the speed-up obtained by M-ASG and\nM-ASG\u2217 is more prominent if the noise level \u03c32 is smaller. However, as the number of iterations\ngrows, the performance of the algorithms become similar as the variance term dominates. In addition,\nwe would like to highlight that when the noise is small, using n1 as suggested in (21), M-ASG\u2217 runs\nstage one longer than M-ASG; hence, enjoys the linear rate of decay for more iterations before the\nvariance term becomes the dominant term.\nFor the second set of experiments, we consider a regularized logistic regression problem for binary\nclassi\ufb01cation. In particular, we read 10000 images from the M-NIST [23] data-set, and our goal is to\ndistinguish the image of digit zero from that of digit eight.7 The number of samples is N = 1945, and\nthe size of each image is 20 by 20 after removing the margins (hence d = 400 after vectorizing the\nimages). At each iteration, we randomly choose a batch size b of images to compute an estimate of the\ngradient.8 We choose the regularization parameter equal to 1\u221a\nfollowing the standard practice (see\ne.g. [34]). In Figure 3,we compare M-ASG with Standard GD, Standard AG, \u00b5AGD+ [8], and AC-SA\n[17] for b \u2208 {50, 100, 500}. The batch size controls the noise level, with larger batches leading to\nsmaller \u03c3. We run each of these algorithms for 50 times, and plot their average performance and 95%\ncon\ufb01dence intervals. It can be seen that M-ASG usually start faster, and achieves the asymptotic rate\nof other algorithms for all different batch sizes.\n\nN\n\n6 Conclusion\n\nIn this work, we consider strongly convex smooth optimization problems where we have access to\nnoisy estimates of the gradients. We proposed a multistage method that adapts the choice of the\nparameters of the Nesterov\u2019s accelerated gradient at each stage to achieve the optimal rate. Our\nmethod is universal in the sense that it does not require the knowledge of the noise characteristics\nto operate and can achieve the optimal rate both in the deterministic and stochastic settings. We\nprovided numerical experiments that compare our method with existing approaches in the literature,\nillustrating that our method performs well in practice.\n\nAcknowledgements\n\nThe work of Necdet Serhat Aybat is partially supported by NSF Grant CMMI-1635106. Alireza\nFallah is partially supported by Siebel Scholarship. Mert G\u00fcrb\u00fczbalaban acknowledges support from\nthe grants NSF DMS-1723085 and NSF CCF-1814888.\n\n7We provide an experiment with synthetic data for logistic loss in Appendix N.\n8This is an unbiased estimate of the gradient with \ufb01nite but unknown variance, and therefore we do not use\n\nM-ASG\u2217 or other algorithms that need the knowledge of variance.\n\n9\n\n\fReferences\n[1] Yossi Arjevani and Ohad Shamir. On the iteration complexity of oblivious \ufb01rst-order optimiza-\ntion algorithms. In Proceedings of The 33rd International Conference on Machine Learning,\nvolume 48 of Proceedings of Machine Learning Research, pages 908\u2013916, New York, New\nYork, USA, 20\u201322 Jun 2016. PMLR.\n\n[2] Necdet Serhat Aybat, Alireza Fallah, Mert Gurbuzbalaban, and Asuman Ozdaglar. Robust accel-\nerated gradient methods for smooth strongly convex functions. arXiv preprint arXiv:1805.10579,\n2018.\n\n[3] Francis Bach and Eric Moulines. Non-Asymptotic Analysis of Stochastic Approximation\nAlgorithms for Machine Learning. In Neural Information Processing Systems (NIPS), Spain,\n2011.\n\n[4] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Ef\ufb01cient algorithms\nand tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual\nSymposium on, pages 464\u2013473. IEEE, 2014.\n\n[5] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear\n\ninverse problems. SIAM journal on imaging sciences, 2(1):183\u2013202, 2009.\n\n[6] S\u00e9bastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and\n\nTrends R(cid:13) in Machine Learning, 8(3-4):231\u2013357, 2015.\n\n[7] Xi Chen, Qihang Lin, and Javier Pena. Optimal regularized dual averaging methods for\nstochastic optimization. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 25, pages 395\u2013403. Curran Associates,\nInc., 2012.\n\n[8] Michael Cohen, Jelena Diakonikolas, and Lorenzo Orecchia. On acceleration with noise-\ncorrupted gradients. In Proceedings of the 35th International Conference on Machine Learning,\nvolume 80 of Proceedings of Machine Learning Research, pages 1019\u20131028, Stockholmsm\u00e4s-\nsan, Stockholm Sweden, 2018. PMLR.\n\n[9] A. d\u2019Aspremont. Smooth optimization with approximate gradient. SIAM Journal on Optimiza-\n\ntion, 19(3):1171\u20131183, 2008.\n\n[10] E de Klerk. Aspects of Semide\ufb01nite Programming: Interior Point Algorithms and Selected\n\nApplications, volume 65. Springer Science & Business Media, 2002.\n\n[11] O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimization\n\nwith inexact oracle. Mathematical Programming, 146(1-2):37\u201375, 2014.\n\n[12] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger\nconvergence rates for least-squares regression. The Journal of Machine Learning Research,\n18(1):3520\u20133570, 2017.\n\n[13] N. Flammarion and F. Bach. From averaging to acceleration, there is only a step-size. In\n\nConference on Learning Theory, pages 658\u2013695, 2015.\n\n[14] X. Gao, M. G\u00fcrb\u00fczbalaban, and L. Zhu. Global Convergence of Stochastic Gradient Hamiltonian\nMonte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Performance Bounds\nand Momentum-Based Acceleration. ArXiv e-prints, September 2018.\n\n[15] Xuefeng Gao, Mert Gurbuzbalaban, and Lingjiong Zhu. Breaking Reversibility Accel-\narXiv e-prints, page\n\nerates Langevin Dynamics for Global Non-Convex Optimization.\narXiv:1812.07725, December 2018.\n\n[16] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly\nconvex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal\non Optimization, 22(4):1469\u20131492, 2012.\n\n10\n\n\f[17] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly\nconvex stochastic composite optimization, ii: shrinking procedures and optimal algorithms.\nSIAM Journal on Optimization, 23(4):2061\u20132089, 2013.\n\n[18] M. Hardt. Robustness versus acceleration. August 18th, 2014. http://blog.mrtz.org/\n\n2014/08/18/robustness-versus-acceleration.html, August 2014.\n\n[19] Bin Hu and Laurent Lessard. Dissipativity theory for Nesterov\u2019s accelerated method. In Pro-\nceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings\nof Machine Learning Research, pages 1549\u20131557, International Convention Centre, Sydney,\nAustralia, 2017. PMLR.\n\n[20] Chonghai Hu, Weike Pan, and James T. Kwok. Accelerated gradient methods for stochastic\noptimization and online learning. In Advances in Neural Information Processing Systems 22,\npages 781\u2013789. Curran Associates, Inc., 2009.\n\n[21] Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford.\nAccelerating stochastic gradient descent for least squares regression. In Proceedings of the 31st\nConference On Learning Theory, volume 75 of Proceedings of Machine Learning Research,\npages 545\u2013604. PMLR, 2018.\n\n[22] Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical\n\nProgramming, 133(1):365\u2013397, Jun 2012.\n\n[23] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/,\n\n1998.\n\n[24] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization\nalgorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57\u201395,\n2016.\n\n[25] Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach,\nand James Martens. Adding gradient noise improves learning for very deep networks. arXiv\npreprint arXiv:1511.06807, 2015.\n\n[26] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method\n\nef\ufb01ciency in optimization. Wiley, 1983.\n\n[27] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87.\n\nSpringer, 2004.\n\n[28] Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances\n\nin Neural Information Processing Systems, pages 1574\u20131582, 2014.\n\n[29] B. O\u2019Donoghue and E. Cand\u00e8s. Adaptive restart for accelerated gradient schemes. Foundations\n\nof Computational Mathematics, 15(3):715\u2013732, Jun 2015.\n\n[30] M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient\n\nlangevin dynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849, 2017.\n\n[31] Maxim Raginsky and Alexander Rakhlin. Information-based complexity, feedback and dynam-\nics in convex programming. IEEE Transactions on Information Theory, 57(10):7036\u20137056,\n2011.\n\n[32] Mark Schmidt, Reza Babanezhad, Mohamed Ahmed, Aaron Defazio, Ann Clifton, and Anoop\nSarkar. Non-uniform stochastic average gradient method for training conditional random \ufb01elds.\nIn arti\ufb01cial intelligence and statistics, pages 819\u2013828, 2015.\n\n[33] Bin Shi, Simon S Du, Michael I Jordan, and Weijie J Su. Understanding the acceleration\nphenomenon via high-resolution differential equations. arXiv preprint arXiv:1810.08907, 2018.\n\n[34] Karthik Sridharan, Shai Shalev-Shwartz, and Nathan Srebro. Fast rates for regularized objectives.\n\nIn Advances in Neural Information Processing Systems, pages 1545\u20131552, 2009.\n\n11\n\n\f[35] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media,\n\n2013.\n\n[36] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for\nover-parameterized models and an accelerated perceptron. arXiv preprint arXiv:1810.07288,\n2018.\n\n[37] Hoi-To Wai, Wei Shi, Cesar A Uribe, Angelia Nedich, and Anna Scaglione. On curvature-aided\n\nincremental aggregated gradient methods. arXiv preprint arXiv:1806.00125, 2018.\n\n[38] Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization.\n\nJournal of Machine Learning Research, 11(Oct):2543\u20132596, 2010.\n\n[39] Zeyuan Allen Zhu and Lorenzo Orecchia. Linear coupling: An ultimate uni\ufb01cation of gradient\n\nand mirror descent. In ITCS, 2017.\n\n12\n\n\f", "award": [], "sourceid": 4604, "authors": [{"given_name": "Necdet Serhat", "family_name": "Aybat", "institution": "Penn State University"}, {"given_name": "Alireza", "family_name": "Fallah", "institution": "MIT"}, {"given_name": "Mert", "family_name": "Gurbuzbalaban", "institution": "Rutgers"}, {"given_name": "Asuman", "family_name": "Ozdaglar", "institution": "Massachusetts Institute of Technology"}]}