{"title": "Tight Dimension Independent Lower Bound on the Expected Convergence Rate for Diminishing Step Sizes in SGD", "book": "Advances in Neural Information Processing Systems", "page_first": 3665, "page_last": 3674, "abstract": "We study the convergence of Stochastic Gradient Descent (SGD) for strongly convex objective functions. We prove for all $t$ a lower bound on the expected convergence rate after the $t$-th SGD iteration; the lower bound is over all possible sequences of diminishing step sizes. It implies that recently proposed sequences of step sizes at ICML 2018 and ICML 2019 are {\\em universally} close to optimal in that the expected convergence rate after {\\em each} iteration is within a factor $32$ of our lower bound. This factor is independent of dimension $d$. We offer a framework for comparing with lower bounds in state-of-the-art literature and when applied to SGD for strongly convex objective functions our lower bound is a significant factor $775\\cdot d$ larger compared to existing work.", "full_text": "Tight Dimension Independent Lower Bound on the\nExpected Convergence Rate for Diminishing Step\n\nSizes in SGD\n\nPhuong Ha Nguyen\n\nLam M. Nguyen\n\nElectrical and Computer Engineering\n\nIBM Research, Thomas J. Watson Research Center\n\nUniversity of Connecticut, USA\n\nphuongha.ntu@gmail.com\n\nYorktown Heights, USA\n\nLamNguyen.MLTD@ibm.com\n\nMarten van Dijk\n\nElectrical and Computer Engineering\n\nUniversity of Connecticut, USA\nmarten.van_dijk@uconn.edu\n\nAbstract\n\nWe study the convergence of Stochastic Gradient Descent (SGD) for strongly\nconvex objective functions. We prove for all t a lower bound on the expected\nconvergence rate after the t-th SGD iteration; the lower bound is over all possible\nsequences of diminishing step sizes. It implies that recently proposed sequences\nof step sizes at ICML 2018 and ICML 2019 are universally close to optimal in\nthat the expected convergence rate after each iteration is within a factor 32 of our\nlower bound. This factor is independent of dimension d. We offer a framework\nfor comparing with lower bounds in state-of-the-art literature and when applied to\nSGD for strongly convex objective functions our lower bound is a signi\ufb01cant factor\n775 \u00b7 d larger compared to existing work.\n\n1\n\nIntroduction\n\nWe are interested in solving the following stochastic optimization problem\n\n{F (w) = E[f (w; \u03be)]} ,\n\nmin\nw\u2208Rd\nwhere \u03be is a random variable obeying some distribution g(\u03be). In the case of empirical risk mini-\nmization with a training set {(xi, yi)}n\ni=1, \u03bei is a random variable that is de\ufb01ned by a single random\nsample (x, y) pulled uniformly from the training set. Then, by de\ufb01ning fi(w) := f (w; \u03bei), empirical\nrisk minimization reduces to\n\n(1)\n\n(cid:40)\n\nmin\nw\u2208Rd\n\nF (w) =\n\nfi(w)\n\n.\n\n(2)\n\n(cid:41)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nProblems of this type arise frequently in supervised learning applications [8]. The classic \ufb01rst-order\nmethods to solve problem (2) are gradient descent (GD) [19] and stochastic gradient descent (SGD)1\n[21] algorithms. GD is a standard deterministic gradient method, which updates iterates along the\nnegative full gradient with learning rate \u03b7t as follows\n\nwt+1 = wt \u2212 \u03b7t\u2207F (wt) = wt \u2212 \u03b7t\nn\n\n\u2207fi(wt) , t \u2265 0.\n\n1We notice that even though stochastic gradient is referred to as SG in literature, the term stochastic gradient\n\ndescent (SGD) has been widely used in many important works of large-scale learning.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe can choose \u03b7t = \u03b7 = O(1/L) and achieve a linear convergence rate for the strongly convex case\n[15]. The upper bound of the convergence rate of GD and SGD has been studied in [2, 4, 15, 22, 17,\n16, 7].\nThe disadvantage of GD is that it requires evaluation of n derivatives at each step, which is very\nexpensive and therefore avoided in large-scale optimization. To reduce the computational cost for\nsolving (2), a class of variance reduction methods [11, 5, 9, 18] has been proposed. The difference\nbetween GD and variance reduction methods is that GD needs to compute the full gradient at each\nstep, while the variance reduction methods will compute the full gradient after a certain number of\nsteps. In this way, variance reduction methods have less computational cost compared to GD. To\navoid evaluating the full gradient at all, SGD generates an unbiased random variable \u03bet satisfying\n\nE\u03bet [\u2207f (wt; \u03bet)] = \u2207F (wt),\n\nand then evaluates gradient \u2207f (wt; \u03bet) for \u03bet drawn from a distribution g(\u03be). After this, wt is updated\nas follows\n\nwt+1 = wt \u2212 \u03b7t\u2207f (wt; \u03bet).\n\n(3)\n\nWe focus on the general problem (1) where F is strongly convex. Since F is strongly convex, a\nunique optimal solution of (1) exists and throughout the paper we denote this optimal solution by w\u2217\nand are interested in studying the expected convergence rate\nYt = E[(cid:107)wt \u2212 w\u2217(cid:107)2].\n\nAlgorithm 1 provides a detailed description of SGD. Obviously, the computational cost of a single\niteration in SGD is n times cheaper than that of a single iteration in GD. However, as has been shown\nin literature we need to choose \u03b7t = O(1/t) and the expected convergence rate of SGD is slowed\ndown to O(1/t) [3], which is a sublinear convergence rate.\n\nAlgorithm 1 Stochastic Gradient Descent (SGD) Method\n\nInitialize: w0\nIterate:\nfor t = 0, 1, . . . do\n\nChoose a step size (i.e., learning rate) \u03b7t > 0.\nGenerate a random variable \u03bet with probability density g(\u03bet).\nCompute a stochastic gradient \u2207f (wt; \u03bet).\nUpdate the new iterate wt+1 = wt \u2212 \u03b7t\u2207f (wt; \u03bet).\n\nend for\n\nProblem Statement and Contributions: We seek to \ufb01nd a tight lower bound on the expected\nconvergence rate Yt with the purpose of showing that the stepsize sequences of [17] and [7] for\nclassical SGD is optimal for \u00b5-strongly convex and L-smooth respectively expected L-smooth\nobjective functions within a small dimension independent constant factor. This is important because\nof the following reasons:\n\n1. The lower bound tells us that a sequence of stepsizes as a function of only \u00b5 and L cannot\nbeat an expected convergence rate of O(1/t) \u2013 this is known general knowledge and was\nalready proven in [1], where a dimension dependent lower bound for a larger class of\nalgorithms that includes SGD was proven. For the class of SGD with diminishing stepsizes\nas a function of only global parameters \u00b5 and L we show a dimension independent lower\nbound which is a factor 775 \u00b7 d larger.\n\n2. We now understand into what extent the sequence of stepsizes of [17] and [7] are optimal\nin that it leads to minimal expected convergence rates Yt for all t: For each t we will show\na dimension independent lower bound on Yt over all possible stepsize sequences. This\nincludes the best possible stepsize sequence which minimizes Yt for a given t. Our lower\nbound achieves the upper bound on Yt for the stepsize sequences of [17] and [7] within a\nfactor 32 for all t. This implies that these stepsize sequences universally minimizes each Yt\nwithin factor 32.\n\n2\n\n\f3. As a consequence, in order to attain a better expected convergence rate, we need to either\nassume more speci\ufb01c knowledge about the objective function F so that we can construct a\nbetter stepsize sequence for SGD based on this additional knowledge or we need to step\naway from SGD and use a different kind of algorithm. For example, the larger class of\nalgorithms in [1] may contain a non-SGD algorithm which may get close to the lower bound\nproved in [1] which is a factor 775 \u00b7 d smaller. Since the larger class of algorithms in [1]\ncontains algorithms such as Adam [10], AdaGrad [6], SGD-Momentum [23], RMSProp\n[24] we now know that these practical algorithms will at most improve a factor 32 \u00b7 775 \u00b7 d\nover SGD for strongly convex optimization \u2013 this can be signi\ufb01cant as this can lead to orders\nof magnitude less gradient computations. We are the \ufb01rst to make such quanti\ufb01cation.\n\nOutline: Section 2 discusses background: First, we discuss the recurrence on Yt used in [17] for\nproving their upper bound on Yt \u2013 this recurrence plays a central role in proving our lower bound.\nWe discuss the upper bounds of both [17] and [7] \u2013 the latter holding for a larger class of algorithms.\nSecond, we explain the lower bound of [1] in detail in order to be able to properly compare with our\nlower bound. Section 3 introduces a framework for comparing bounds and explains the consequences\nof our lower bound in detail. Section 4 describes a class of strongly convex and smooth objective\nfunctions which is used to derive our lower bound. We also verify our theory by experiments in the\nsupplementary material. Section 5 concludes the paper.\n\n2 Background\n\nWe explain the upper bound of [17, 7], and lower bound of [1] respectively.\n\n2.1 Upper Bound for Strongly Convex and Smooth Objective Functions\n\nThe starting point for analysis is the recurrence \ufb01rst introduced in [17, 12]\nE[(cid:107)wt+1 \u2212 w\u2217(cid:107)2] \u2264 (1 \u2212 \u00b5\u03b7t)E[(cid:107)wt \u2212 w\u2217(cid:107)2] + \u03b72\n\nt N,\n\nwhere\n\nN = 2E[(cid:107)\u2207f (w\u2217; \u03be)(cid:107)2]\n\n(4)\n\nand \u03b7t is upper bounded by 1\n\n2L; the recurrence has been shown to hold, see [17, 12], if we assume\n\n1. F (.) is \u00b5-strongly convex,\n2. f (w; \u03be) is L-smooth,\n3. f (w; \u03be) is convex, and\n4. N is \ufb01nite;\n\nwe detail these assumptions below:\nAssumption 1 (\u00b5-strongly convex). The objective function F : Rd \u2192 R is \u00b5-strongly convex, i.e.,\nthere exists a constant \u00b5 > 0 such that \u2200w, w(cid:48) \u2208 Rd,\n\nF (w) \u2212 F (w(cid:48)) \u2265 (cid:104)\u2207F (w(cid:48)), (w \u2212 w(cid:48))(cid:105) +\n\n(cid:107)w \u2212 w(cid:48)(cid:107)2.\n\n\u00b5\n2\n\n(5)\n\nAssumption 2 (L-smooth). f (w; \u03be) is L-smooth for every realization of \u03be, i.e., there exists a constant\nL > 0 such that, \u2200w, w(cid:48) \u2208 Rd,\n\n(cid:107)\u2207f (w; \u03be) \u2212 \u2207f (w(cid:48); \u03be)(cid:107) \u2264 L(cid:107)w \u2212 w(cid:48)(cid:107).\n\n(6)\n\nAssumption 2 implies that F is also L-smooth.\nAssumption 3. f (w; \u03be) is convex for every realization of \u03be, i.e., \u2200w, w(cid:48) \u2208 Rd,\n\nf (w; \u03be) \u2212 f (w(cid:48); \u03be) \u2265 (cid:104)\u2207f (w(cid:48); \u03be), (w \u2212 w(cid:48))(cid:105).\n\nAssumption 4. N = 2E[(cid:107)\u2207f (w\u2217; \u03be)(cid:107)2] is \ufb01nite.\n\n3\n\n\fWe denote the set of strongly convex objective functions by Fstr and denote the subset of Fstr\nsatisfying Assumptions 1, 2, 3, and 4 by Fsm.\nWe notice that the earlier established recurrence in [13] under the same set of assumptions\n\nE[(cid:107)wt+1 \u2212 w\u2217(cid:107)2] \u2264 (1 \u2212 2\u00b5\u03b7t + 2L2\u03b72\n\nt )E[(cid:107)wt \u2212 w\u2217(cid:107)2] + \u03b72\nt N\nL2 where (4) holds for \u03b7t \u2264 1\n\n2L2 the above recurrence provides a better bound than (4), i.e., 1\u22122\u00b5\u03b7t+2L2\u03b72\n\nis similar, but worse than (4) as it only holds for \u03b7t < \u00b5\n2L. Only for step\nt \u2264 1\u2212\u00b5\u03b7t.\nsizes \u03b7t < \u00b5\nIn practical settings such as logistic regression \u00b5 = O(1/n), L = O(1), and t = O(n) (i.e. t is\nat most a relatively small constant number of epochs, where a single epoch represents n iterations\nresembling the complexity of a single GD computation). See (8) below, for this parameter setting\nthe optimally chosen step sizes are (cid:29) \u00b5\nL2 . This is the reason we focus in this paper on analyzing\nrecurrence (4) in order to prove our lower bound: For \u03b7t \u2264 1\n2L,\nYt+1 \u2264 (1 \u2212 \u00b5\u03b7t)Yt + \u03b72\nt N,\n\nwhere Yt = E[(cid:107)wt \u2212 w\u2217(cid:107)2].\nBased on the above assumptions (without the so-called bounded gradient assumption) and knowledge\nof only \u00b5 and L a sequence of step sizes \u03b7t can be constructed such that Yt is smaller than O(1/t) [17];\nmore explicitly, for the sequence of step sizes\n\n(7)\n\n\u03b7t =\n\n2\n\n\u00b5t + 4L\n\nwe have for all objective functions in Fsm the upper bound\n16N\n\u00b52t\n\n\u00b5(t \u2212 T (cid:48)) + 4L\n\nYt \u2264 16N\n\u00b5\n\n=\n\n1\n\n(1 + O(1/t)),\n\n(8)\n\n(9)\n\nwhere\n\nt \u2265 T (cid:48) =\n\n4L\n\u00b5\n\nmax{ L\u00b5Y0\nN\n\n, 1} \u2212 4L\n\u00b5\n\n.\n\nWe notice that [7] studies the larger class, which we denote Fesm, which is de\ufb01ned as Fsm where\nexpected smoothness is assumed in stead of smoothness and convexity of component functions. We\nrephrase their assumption for classical SGD as studied in this paper.2\nAssumption 5. (L-smooth in expectation) The objective function F : Rd \u2192 R is L-smooth in\nexpectation if there exists a constant L > 0 such that, \u2200w \u2208 Rd,\n\nE[(cid:107)\u2207f (w; \u03be) \u2212 \u2207f (w\u2217; \u03be)(cid:107)2] \u2264 2L(cid:107)F (w) \u2212 F (w\u2217)(cid:107).\n\n(10)\n\nThe results in [7] assume the above assumption for empirical risk minimization (2). L-smoothness,\nsee [15], implies Lipschitz continuity (i.e., \u2200w, w(cid:48) \u2208 Rd,\n\n(cid:107)w \u2212 w(cid:48)(cid:107)2\n\nf (w, \u03be) \u2264 f (w(cid:48), \u03be) + (cid:104)\u2207f (w(cid:48), \u03be), (w \u2212 w(cid:48))(cid:105) +\n\nL\n2\n) and together with Proposition A.1 in [7] this implies L-smooth in expectation. This shows that\nFesm de\ufb01ned by Assumptions 1, 4, and 5 is indeed a superset of Fsm.\nThe step sizes (8) from [17] for Fsm \u2286 Fesm and\n4L\n\u00b5\n\nfor t \u2264 4L\n\u00b5\ndeveloped for Fesm in [7] and [17] are equivalent in that they are both \u2248 2\n\u00b5t for t large enough. Both\nstep size sequences give exactly the same asymptotic upper bound (9) on Yt (in our notation).\nIn [21], the authors proved the convergence of SGD for the step size sequence {\u03b7t} satisfying\nt < \u221e. In [13], the authors studied the expected convergence\nrates for another class of step sizes of O(1/tp) where 0 < p \u2264 1. However, the authors of both [21]\nand [13] do not discuss about the optimal step sizes among all proposed step sizes which is what is\ndone in this paper.\n\nt=0 \u03b7t = \u221e and (cid:80)\u221e\n\nconditions(cid:80)\u221e\n\nand \u03b7t =\n\n(t + 1)2\u00b5\n\nfor t >\n\nt=0 \u03b72\n\n1\n2L\n\n2t + 1\n\n\u03b7t =\n\n(11)\n\n2This means that distribution D in [7] must be over unit vectors v \u2208 [0,\u221e)n, where n is the number\nof component functions, i.e., n possible values for \u03be. Arbitrary distributions D correspond to SGD with\nmini-batches where each component function indexed by \u03be is weighted with v\u03be.\n\n4\n\n\f2.2 Lower Bound for First Order Stochastic Oracles\n\nThe authors of [14] proposed the \ufb01rst formal study on lower bounding the expected convergence\nrate for a large class of algorithms which includes SGD. The authors of [1] and [20] independently\nstudied this lower bound using information theory and were able to improve it.\nThe derivation in [1] is for algorithms including SGD where the sequence of stepsizes is a-priori\n\ufb01xed based on global information regarding assumed stochastic parameters concerning the objective\nfunction F . Their proof uses the following set of assumptions: First, The assumption of a strongly\nconvex objective function, i.e., Assumption 1 (see De\ufb01nition 3 in [1]). Second, the objective function\nis convex Lipschitz:\nAssumption 6. (convex Lipschitz) The objective function F is a convex Lipschitz function, i.e., there\nexists a bounded convex set S \u2282 Rd and a positive number K such that \u2200w, w(cid:48) \u2208 S \u2282 Rd\n\n(cid:107)F (w) \u2212 F (w(cid:48))(cid:107) \u2264 K(cid:107)w \u2212 w(cid:48)(cid:107).\n\nE[(cid:107)\u2207f (w; \u03be)(cid:107)2] \u2264 \u03c32\n\nWe notice that this assumption implies the assumption on bounded gradients as stated here (and\nexplicitly mentioned in De\ufb01nition 1 in [1]): There exists a bounded convex set S \u2282 Rd and a positive\nnumber \u03c3 such that\n\n(12)\nfor all w \u2208 S \u2282 Rd. This is not the same as the bounded gradient assumption where S = Rd is\nunbounded.3 Clearly, for w\u2217, (12) implies a \ufb01nite N \u2264 2\u03c32.\nWe de\ufb01ne Flip as the set of strongly convex objective functions that satisfy Assumption 6. Classes\nFesm and Flip are both subsets of Fstr and differ (are not subclasses of each other) in that they\nassume expected smoothness and convex Lipschitz respectively.\nTo prove a lower bound of Yt for Flip, the authors constructed a class of objective functions \u2286 Flip\nand showed a lower bound of Yt for this class; in terms of the notation used in this paper,\n\n\u221a\nlog(2/\ne)\n432 \u00b7 d\n\nN\n\u00b52t\n\n.\n\n(13)\n\nThe authors of [1] prove lower bound (13) for the class Astoch of stochastic \ufb01rst order algorithms\nthat can be understood as operating based on information provided by a stochastic \ufb01rst-order oracle,\ni.e., any algorithm which bases its computation in the t-th iteration on \u00b5, K or L, d, and access to\nan oracle that provides f (wt; \u03bet) and \u2207f (wt; \u03bet). This class includes ASGD de\ufb01ned as SGD with\nsome sequence of diminishing step sizes as a function of global parameters such as \u00b5 and L or \u00b5 and\nK, see Algorithm 1. We notice that Astoch also includes practical algorithms such as Adam [10],\netc. We revisit their derivation in the supplementary material where we show4 how their lower bound\ntransforms into (13). Notice that their lower bound depends on dimension d.\n\n3 Framework for Upper and Lower Bounds\n\nLet par(F ) denote the concrete values of the global parameters of an objective function F such\nas the values for \u00b5 and L corresponding to objective functions F in Fsm and Fesm or \u00b5 and K\ncorresponding to objective functions F in Flip. When de\ufb01ning a class F of objective functions,\nwe also need to explain how F de\ufb01nes a corresponding par(.) function. We will use the notation\nF[p] to stand for the subclass {F \u2208 F : p = par(F )} \u2286 F, i.e., the subclass of objective\nfunctions of F with the same parameters p. We assume that parameters of a class are included in\nthe parameters of a smaller subclass: For example, Fsm is a subset of the class of strongly convex\nobjective functions Fstr with only global parameter \u00b5. This means that for concrete values \u00b5 and L\nwe have Fsm[\u00b5, L] \u2286 Fstr[\u00b5].\nFor a given objective function F , we are interested in the best possible expected convergence rate\nafter the t-th iteration among all possible algorithms A in a larger class of algorithms A. Here, we\n3The bounded gradient assumption, where S is unbounded, is in con\ufb02ict with assuming strong convexity as\n\nexplained in [17].\n\n4We also discuss the underlying assumption of convex Lipschitz and show that in order for the analysis in [1]\n\nto follow through one \u2013 likely tedious but believable \u2013 statement still needs a formal proof.\n\n5\n\n\fassume that A is a subclass of the larger class Astoch,U of stochastic \ufb01rst order algorithms where\nthe computation in the t-th iteration not only has access to par(F ) and access to an oracle that\nprovides f (wt; \u03bet) and \u2207f (wt; \u03bet) but also access to possibly another oracle U providing even more\ninformation. Notice that A \u2286 Astoch \u2286 Astoch,U for any oracle U. With respect to the expected\nconvergence rate, we want to know which algorithm A in A minimizes Yt the most. Notice that for\ndifferent t this may be a different algorithm A. We de\ufb01ne for F \u2208 F (with associated par(.))\n\nt (A) = inf\n\u03b3F\n\nA\u2208A Yt(F, A),\n\nwhere Yt is explicitly shown as a function of the objective function F and choice of algorithm A.\nAmong the objective functions F \u2208 F with same global parameters p = par(F ) (i.e., F \u2208 F[p]), we\nconsider the objective function F which has the worst expected convergence rate at the t-th iteration.\nThis is of interest to us because algorithms A only have access to p = par(F ) as the sole information\nabout objective function F , hence, if we prove an upper bound on the expected convergence rate for\nalgorithm A, then this upper bound must hold for all F \u2208 F with the same parameters p = par(F ).\nIn other words such an upper bound must be at least\n\n\u03b3t(F[p],A) = sup\nF\u2208F [p]\n\nt (A) = sup\n\u03b3F\nF\u2208F [p]\n\ninf\nA\u2208A Yt(F, A).\n\nSo, any lower bound on \u03b3t(F[p],A) gives us a lower bound on the best possible upper bound on Yt\nthat can be achieved. Such a lower bound tells us into what extent the expected convergence rate Yt\ncannot be improved.\nThe lower bound (13) and upper bound (9) are not only a function of \u00b5 in p = par(F ) but also a\nfunction of N which is outside p = par(F ) for F \u2208 Flip or F \u2208 Fesm. We are really interested\nin such more \ufb01ne-grained bounds that are a function of N. For this reason we need to consider the\nsubclass of objective functions F in F[p] that all have the same N. We implicitly understand that N\nis an auxiliary parameter of an objective function F and we denote this as a function of F as N (F ).\nWe de\ufb01ne F a[p] = {F \u2208 F[p]\n: a = aux(F )} where aux(.) represents for example N (.). This\nleads to notation like F N\nlip[\u00b5, K, d]. Notice that p = par(F ) can be used by an algorithm A \u2208 A\nwhile a = aux(F ) is not available to A through p = par(F ) (but may be available through access to\nan oracle).\nIf we \ufb01nd a tight lower bound with upper bound up to a constant factor, as in this paper, then we know\nthat the algorithm that achieves the upper bound is close to optimal in that the expected convergence\nrate cannot be further minimized/improved in a signi\ufb01cant way. In practice we are only interested\nin upper bounds on Yt that can be realized by the same algorithm A (if not, then we need to know\na-priori the exact number of iterations t we want to run an algorithm and then choose the best one\nfor that t). In this paper we consider the algorithm A for F in Fsm resp. Fesm de\ufb01ned as SGD with\ndiminishing step sizes (8) resp. (11) as a function of par(F ) = (\u00b5, L) giving upper bound (9) on\nexpected convergence rate Yt(F, A). We show that A is close to optimal.\nGiven the above de\ufb01nitions we have\n\n\u03b3t(F[p],A) \u2264 \u03b3t(F(cid:48)[p(cid:48)],A(cid:48))\n\n(14)\nfor F[p] \u2286 F(cid:48)[p(cid:48)] and A(cid:48) \u2286 A, i.e., the worst objective function in a larger class of objective\nfunctions is worse than the worst objective function in a smaller class of objective functions (see\nthe supremum used in de\ufb01ning \u03b3t) and the best algorithm from a larger class of algorithms is better\nthan the best algorithm from a smaller class of algorithms (see the in\ufb01num used in de\ufb01ning \u03b3t). This\nimplies\n\n\u03b3t(F N\n\u03b3t(F N\n\n(15)\n(16)\n\nlip[\u00b5, K, d],Astoch) \u2264 \u03b3t(F N\nsm[\u00b5, L],AExtSGD) \u2264 \u03b3t(F N\nwhere ASGD \u2286 AExtSGD is de\ufb01ned as follows:\nIn our framework we introduce extended SGD as the class AExtSGD of SGD algorithms where the\nstepsize in the t-th iteration can be computed based on global parameters \u00b5, L, and access to an\noracle U that provides additional information N, \u2207F (wt), and Yt. This class also includes SGD with\n\nstr[\u00b5],ASGD),\nesm[\u00b5, L],ASGD) \u2264 \u03b3t(F N\n\nstr[\u00b5],ASGD),\n\n6\n\n\f1\n2\n\nN\n\u00b52t\n\ndiminishing stepsizes as de\ufb01ned in Algorithm 1, i.e., ASGD \u2286 AExtSGD. The reason for introducing\nthe larger class AExtSGD is not because it contains practical algorithms different than SGD, on the\ncontrary. The only reason is that it allows us to de\ufb01ne one single algorithm A \u2208 AExtSGD which\nt (AExtSGD) for all t for all F in a to be constructed subclass F \u2286 Fsm \u2013 the topic of the\nrealizes \u03b3F\nnext section. This property allows a rather straightforward calculus based proof without needing to\nuse more advanced concepts from information and probability theory as required in the proof of [1].\nLooking ahead, we will prove in Theorem 1\n\n(1 \u2212 O((ln t)/t)) \u2264 \u03b3t(F N\n\nsm[\u00b5, L],AExtSGD).\n\n(17)\nNotice that the construction of \u03b7t for algorithms in AExtSGD does not depend on knowledge of the\nstochastic gradient \u2207f (wt; \u03bet). So, we do not consider step sizes that are adaptively computed based\non \u2207f (wt; \u03bet).\nAs a disclaimer we notice that for some objective functions F \u2208 F N\nsm[\u00b5, L] the expected convergence\nrate can be much better than what is stated in (17); this is because \u03b3t({F},AExtSGD) can be much\nsm[\u00b5, L],AExtSGD), see (14). This is due to the speci\ufb01c nature of the objective\nsmaller than \u03b3t(F N\nfunction F itself. However, without knowledge about this nature, one can only prove a general upper\nbound on the expected convergence rate Yt and any such upper bound must be at least the lower\nbound (17).\nResults (13) and (9) of the previous section combined with (15), (16), and (17) yield\n\n\u221a\ne)\nlog(2/\n432 \u00b7 d\n\nN\n\u00b52t\n\n\u2264 \u03b3t(F N\n(1 \u2212 O((ln t)/t)) \u2264 \u03b3t(F N\n(1 \u2212 O((ln t)/t)) \u2264 \u03b3t(F N\n\nN\n1\n\u00b52t\n2\nN\n1\n\u00b52t\n2\n\nlip[\u00b5, K, d],Astoch) \u2264 \u03b3t(F N\nesm[\u00b5, L],AExtSGD) \u2264 \u03b3t(F N\nsm[\u00b5, L],AExtSGD) \u2264 \u03b3t(F N\n\u2264 16N\n\u00b52t\n\nstr[\u00b5],ASGD),\nstr[\u00b5],ASGD),\nesm[\u00b5, L],ASGD)\n(1 + O(1/t)).\n\n(18)\n\n(19)\n\n(20)\n\nWe conclude the following observations (our contributions):\n\n1. The \ufb01rst inequality (18) is from [1]. Comparing (19) to (18) shows that as a lower bound\nstr[\u00b5],ASGD) (SGD for the class of strongly convex objective functions) our lower\nfor \u03b3t(F N\nbound (17) is dimension independent and improves the lower bound (13) of [1] by a factor\n775 \u00b7 d. This is a signi\ufb01cant improvement.\n2. However, our lower bound does not hold for the larger class Astoch. This teaches us that if\nwe wish to reach smaller (better) expected convergence rates, then one approach is to step\nbeyond SGD where our lower bound does not hold implying that within Astoch there may be\nan opportunity to \ufb01nd an algorithm leading to at most a factor 32 \u00b7 775 \u00b7 d smaller expected\nconvergence rate compared to upper bound (20). This is the \ufb01rst exact quanti\ufb01cation into\nwhat extent a better (practical) algorithm when compared to classical SGD can be found.\nE.g., Adam [10], AdaGrad [6], SGD-Momentum [23], RMSProp [24] are all in Astoch and\ncan beat classical SGD by at most a factor 32 \u00b7 775 \u00b7 d.\n3. When searching for a better algorithm in Astoch which signi\ufb01cantly improves over SGD,\nit does not help to take an SGD-like algorithm which uses step sizes that are a function of\niteratively computed estimates of \u2207F (wt) and Yt as this would keep such an algorithm in\nAExtSGD for which our lower bound is tight.\n\n4. Another approach to reach smaller expected convergence rates is to stick with SGD but\nconsider a smaller restricted class of objective functions for which more/other information\nin the form of extra global parameters is available for adaptively computing \u03b7t.\n5. For strongly convex and smooth, respectively expected smooth, objective functions the\nalgorithm A \u2208 ASGD with stepsizes \u03b7t = 2\n(t+1)2\u00b5 for t > 4L\nand \u03b7t = 1\n\u00b5 , realizes the upper bound in (20) for all t. Inequalities (20) show\nthat this algorithm is close to optimal: For each t, the best sequence of diminishing step\nsizes which minimizes Yt can at most achieve a constant (dimension independent) factor 32\nsmaller expected convergence rate.\n\n\u00b5t+4L, respectively \u03b7t = 2t+1\n\n2L for t \u2264 4L\n\n\u00b5\n\n7\n\n\f4 Lower Bound for Extended SGD\n\nIn order to prove a lower bound we propose a speci\ufb01c subclass of strongly convex and smooth\nobjective functions F and we show in the extended SGD setting how, based on recurrence (7), to\ncompute the optimal step size \u03b7t as a function of \u00b5 and L and an oracle U with access to N, \u2207F (wt),\nand Yt, i.e., this step size achieves the smallest Yt+1 at the t-th iteration.\nWe consider the following class of objective functions F : We consider a multivariate normal\ndistribution of a d-dimensional random vector \u03be, i.e., \u03be \u223c N (m, \u03a3), where m = E[\u03be] and \u03a3 =\nE[(\u03be\u2212m)(\u03be\u2212m)T] is the (symmetric positive semi-de\ufb01nite) covariance matrix. The density function\nof \u03be is chosen as\n\nexp(\n\ng(\u03be) =\n\n\u2212(\u03be\u2212m)T\u03a3\u22121(\u03be\u2212m)\n\n)\n\n.\n\n(cid:112)(2\u03c0)d|\u03a3|\n\n2\n\nWe select component functions f (w; \u03be) = s(\u03be)\naccording to the following random process:\n\n(cid:107)w\u2212\u03be(cid:107)2\n\n2\n\n, where function s(\u03be) is constructed a-priori\n\n\u2022 With probability 1\u2212\u00b5/L, we draw s(\u03be) from the uniform distribution over interval [0, \u00b5/(1\u2212\n\u2022 With probability \u00b5/L, we draw s(\u03be) from the uniform distribution over interval [0, L].\n\n\u00b5/L)].\n\nThe following theorem analyses the sequence of optimal step sizes for our class of objective functions\nand gives a lower bound on the corresponding expected convergence rates. The theorem states that we\ncannot \ufb01nd a better sequence of step sizes. In other words without any more additional information\nabout the objective function (beyond \u00b5, L, N, Y0, . . . , Yt for computing \u03b7t), we can at best prove a\ngeneral upper bound which is at least the lower bound as stated in the theorem. The proof of the\nlower bound is presented in the supplementary material:\nTheorem 1. We assume that component functions f (w; \u03be) are constructed according to the recipe\ndescribed above with \u00b5 < L/18. Then, the corresponding objective function is \u00b5-strongly convex\nand the component functions are L-smooth and convex.\nIf we run Algorithm 1 and assume that access to an oracle U with access to N, \u2207F (wt), and Yt\nis given at the t-th iteration (our extended SGD problem setting), then an exact expression for the\noptimal sequence of stepsizes \u03b7t based on \u00b5, L, N, Y0, . . . , Yt can be given, i.e., this sequence of\nstepsizes achieves the smallest possible Yt+1 at the t-th iteration for all t. For this sequence of\nstepsizes,\n\nYt \u2265 N\n2\u00b5\n\n1\n\n\u00b5t + 2\u00b5 ln(t + 1) + W\n\n,\n\n(21)\n\nwhere\n\nW =\n\nL2\n\n12(L \u2212 \u00b5)\n\n.\n\nIn the supplementary material we show numerical experiments in agreement with the presented\ntheorem.\n\n5 Conclusion\n\nWe have studied the convergence of SGD by introducing a framework for comparing upper bounds\nand lower bounds and by proving a new lower bound based on straightforward calculus. The new\nlower bound is dimension independent and improves a factor 775\u00b7 d over previous work [1] applied to\nSGD, shows the optimality of step sizes in [17, 7], and shows that practical algorithms like Adam [10],\nAdaGrad [6], SGD-Momentum [23], RMSProp [24] for strongly convex objective functions can at\nmost achieve a factor 32 \u00b7 775 \u00b7 d smaller expected convergence rate compared to classical SGD.\n\nAcknowledgement\n\nWe thank the reviewers for useful suggestions to improve the paper. Phuong Ha Nguyen and Marten\nvan Dijk were supported in part by AFOSR MURI under award number FA9550-14-1-0351.\n\n8\n\n\fReferences\n[1] Alekh Agarwal, Peter L Bartlett, Pradeep Ravikumar, and Martin J Wainwright. Information-\n\ntheoretic lower bounds on the oracle complexity of stochastic convex optimization. 2010.\n\n[2] D.P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, 1999.\n\n[3] L\u00e9on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. arXiv:1606.04838, 2016.\n\n[4] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,\n\n2004.\n\n[5] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In NIPS, pages 1646\u20131654,\n2014.\n\n[6] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[7] Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter\nRichtarik. Sgd: General analysis and improved rates. arXiv preprint arXiv:1901.09401, 2019.\n\n[8] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning:\n\nData Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edition, 2009.\n\n[9] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In NIPS, pages 315\u2013323, 2013.\n\n[10] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[11] Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an\n\nexponential convergence rate for \ufb01nite training sets. In NIPS, pages 2663\u20132671, 2012.\n\n[12] R\u00e9mi Leblond, Fabian Pederegosa, and Simon Lacoste-Julien. Improved asynchronous parallel\noptimization analysis for stochastic incremental methods. arXiv preprint arXiv:1801.03749,\n2018.\n\n[13] Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation\nalgorithms for machine learning. In Advances in Neural Information Processing Systems, pages\n451\u2013459, 2011.\n\n[14] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method\n\nef\ufb01ciency in optimization. 1983.\n\n[15] Yurii Nesterov.\n\nIntroductory lectures on convex optimization : a basic course. Applied\n\noptimization. Kluwer Academic Publ., Boston, Dordrecht, London, 2004.\n\n[16] Lam Nguyen, Phuong Ha Nguyen, Peter Richtarik, Katya Scheinberg, Martin Takac, and\nMarten van Dijk. New convergence aspects of stochastic gradient algorithms. arXiv preprint\narXiv:1811.12403, 2018.\n\n[17] Lam Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtarik, Katya Scheinberg, and\nMartin Takac. SGD and hogwild! Convergence without the bounded gradients assumption. In\nICML, 2018.\n\n[18] Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak\u00e1\u02c7c. SARAH: A novel method for\n\nmachine learning problems using stochastic recursive gradient. In ICML, 2017.\n\n[19] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2nd\n\nedition, 2006.\n\n[20] Maxim Raginsky and Alexander Rakhlin.\n\nInformation-Based Complexity, Feedback and\nDynamics in Convex Programming. IEEE Trans. Information Theory, 57(10):7036\u20137056, 2011.\n\n9\n\n\f[21] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of\n\nMathematical Statistics, 22(3):400\u2013407, 1951.\n\n[22] Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a\n\nstrong growth condition. arXiv preprint arXiv:1308.6370, 2013.\n\n[23] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In International conference on machine learning,\npages 1139\u20131147, 2013.\n\n[24] Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A Suf\ufb01cient Condition for\n\nConvergences of Adam and RMSProp. arXiv preprint arXiv:1811.09358, 2018.\n\n10\n\n\f", "award": [], "sourceid": 1973, "authors": [{"given_name": "PHUONG_HA", "family_name": "NGUYEN", "institution": "University of Connecticut (UCONN)"}, {"given_name": "Lam", "family_name": "Nguyen", "institution": "IBM Research, Thomas J. Watson Research Center"}, {"given_name": "Marten", "family_name": "van Dijk", "institution": "University of Connecticut"}]}