{"title": "Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates", "book": "Advances in Neural Information Processing Systems", "page_first": 3732, "page_last": 3745, "abstract": "Recent works have shown that stochastic gradient descent (SGD) achieves the fast convergence rates of full-batch gradient descent for over-parameterized models satisfying certain interpolation conditions. However, the step-size used in these works depends on unknown quantities and SGD's practical performance heavily relies on the choice of this step-size. We propose to use line-search techniques to automatically set the step-size when training models that can interpolate the data. In the interpolation setting, we prove that SGD with a stochastic variant of the classic Armijo line-search attains the deterministic convergence rates for both convex and strongly-convex functions. Under additional assumptions, SGD with Armijo line-search is shown to achieve fast convergence for non-convex functions. Furthermore, we show that stochastic extra-gradient with a Lipschitz line-search attains linear convergence for an important class of non-convex functions and saddle-point problems satisfying interpolation. To improve the proposed methods' practical performance, we give heuristics to use larger step-sizes and acceleration. We compare the proposed algorithms against numerous optimization methods on standard classification tasks using both kernel methods and deep networks. The proposed methods result in competitive performance across all models and datasets, while being robust to the precise choices of hyper-parameters. For multi-class classification using deep networks, SGD with Armijo line-search results in both faster convergence and better generalization.", "full_text": "Painless Stochastic Gradient:\n\nInterpolation, Line-Search, and Convergence Rates\n\nSharan Vaswani\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nAaron Mishkin\n\nUniversity of British Columbia\n\nIssam Laradji\n\nUniversity of British Columbia\n\nElement AI\n\nGauthier Gidel\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nElement AI\n\nMark Schmidt\n\nUniversity of British Columbia, 1QBit\n\nCCAI Af\ufb01liate Chair (Amii)\n\nSimon Lacoste-Julien\u2020\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nAbstract\n\nRecent works have shown that stochastic gradient descent (SGD) achieves the fast\nconvergence rates of full-batch gradient descent for over-parameterized models\nsatisfying certain interpolation conditions. However, the step-size used in these\nworks depends on unknown quantities and SGD\u2019s practical performance heavily\nrelies on the choice of this step-size. We propose to use line-search techniques\nto automatically set the step-size when training models that can interpolate the\ndata. In the interpolation setting, we prove that SGD with a stochastic variant of\nthe classic Armijo line-search attains the deterministic convergence rates for both\nconvex and strongly-convex functions. Under additional assumptions, SGD with\nArmijo line-search is shown to achieve fast convergence for non-convex functions.\nFurthermore, we show that stochastic extra-gradient with a Lipschitz line-search\nattains linear convergence for an important class of non-convex functions and\nsaddle-point problems satisfying interpolation. To improve the proposed methods\u2019\npractical performance, we give heuristics to use larger step-sizes and acceleration.\nWe compare the proposed algorithms against numerous optimization methods on\nstandard classi\ufb01cation tasks using both kernel methods and deep networks. The\nproposed methods result in competitive performance across all models and datasets,\nwhile being robust to the precise choices of hyper-parameters. For multi-class\nclassi\ufb01cation using deep networks, SGD with Armijo line-search results in both\nfaster convergence and better generalization.\n\nIntroduction\n\n1\nStochastic gradient descent (SGD) and its variants [18, 21, 35, 39, 72, 82, 87] are the preferred\noptimization methods in modern machine learning. They only require the gradient for one training\nexample (or a small \u201cmini-batch\u201d of examples) in each iteration and thus can be used with large\ndatasets. These \ufb01rst-order methods have been particularly successful for training highly-expressive,\nover-parameterized models such as non-parametric regression [7, 45] and deep neural networks [9,\n88]. However, the practical ef\ufb01ciency of stochastic gradient methods is adversely affected by two\nchallenges: (i) their performance heavily relies on the choice of the step-size (\u201clearning rate\u201d) [9, 70]\nand (ii) their slow convergence compared to methods that compute the full gradient (over all training\nexamples) in each iteration [58].\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada\n\u2020 Canada CIFAR AI Chair..\n\n\fVariance-reduction (VR) methods [18, 35, 72] are relatively new variants of SGD that improve its\nslow convergence rate. These methods exploit the \ufb01nite-sum structure of typical loss functions arising\nin machine learning, achieving both the low iteration cost of SGD and the fast convergence rate of\ndeterministic methods that compute the full-gradient in each iteration. Moreover, VR makes setting\nthe learning rate easier and there has been work exploring the use of line-search techniques for\nautomatically setting the step-size for these methods [71, 72, 76, 81]. These methods have resulted\nin impressive performance on a variety of problems. However, the improved performance comes\nat the cost of additional memory [72] or computational [18, 35] overheads, making these methods\nless appealing when training high-dimensional models on large datasets. Moreover, in practice VR\nmethods do not tend to converge faster than SGD on over-parameterized models [19].\nIndeed, recent works [5, 13, 33, 47, 52, 73, 83] have shown that when training over-parameterized\nmodels, classic SGD with a constant step-size and without VR can achieve the convergence rates of\nfull-batch gradient descent. These works assume that the model is expressive enough to interpolate\nthe data. The interpolation condition is satis\ufb01ed for models such as non-parametric regression [7, 45],\nover-parametrized deep neural networks [88], boosting [69], and for linear classi\ufb01ers on separable\ndata. However, the good performance of SGD in this setting relies on using the proposed constant\nstep-size, which depends on problem-speci\ufb01c quantities not known in practice. On the other hand,\nthere has been a long line of research on techniques to automatically set the step-size for classic\nSGD. These techniques include using meta-learning procedures to modify the main stochastic\nalgorithm [2, 6, 63, 75, 77, 86, 86], heuristics to adjust the learning rate on the \ufb02y [20, 43, 70, 74],\nand recent adaptive methods inspired by online learning [21, 39, 51, 60, 67, 68, 87]. However, none\nof these techniques have been proved to achieve the fast convergence rates that we now know are\npossible in the over-parametrized setting.\nIn this work, we use classical line-search methods [59] to automatically set the step-size for SGD\nwhen training over-parametrized models. Line-search is a standard technique to adaptively set the\nstep-size for deterministic methods that evaluate the full gradient in each iteration. These methods\nmake use of additional function/gradient evaluations to characterize the function around the current\niterate and adjust the magnitude of the descent step. The additional noise in SGD complicates the use\nof line-searches in the general stochastic setting and there have only been a few attempts to address\nthis. Mahsereci et al. [53] de\ufb01ne a Gaussian process model over probabilistic Wolfe conditions and\nuse it to derive a termination criterion for the line-search. The convergence rate of this procedure\nis not known, and experimentally we found that our proposed line-search technique is simpler to\nimplement and more robust. Other authors [12, 17, 22, 42, 62] use a line-search termination criteria\nthat requires function/gradient evaluations averaged over multiple samples. However, in order to\nachieve convergence, the number of samples required per iteration (the \u201cbatch-size\u201d) increases\nprogressively, losing the low per iteration cost of SGD. Other work [11, 26] exploring trust-region\nmethods assume that the model is suf\ufb01ciently accurate, which is not guaranteed in the general\nstochastic setting. In contrast to these works, our line-search procedure does not consider the general\nstochastic setting and is designed for models that satisfy interpolation; it achieves fast rates in the\nover-parameterized regime without the need to manually choose a step-size or increase the batch size.\nWe make the following contributions: in Section 3 we prove that, under interpolation, SGD with a\nstochastic variant of the Armijo line-search attains the convergence rates of full-batch gradient descent\nin both the convex and strongly-convex settings. We achieve these rates under weaker assumptions\nthan the prior work [83] and without the explicit knowledge of problem speci\ufb01c constants. We\nthen consider minimizing non-convex functions satisfying interpolation [5, 83]. Previous work [5]\nproves that constant step-size SGD achieves a linear rate for non-convex functions satisfying the\nPL inequality [37, 65]. SGD is further known to achieve deterministic rates for general non-convex\nfunctions under a stronger assumption on the growth of the stochastic gradients [73, 83]. Under\nthis assumption and an upper bound (that requires knowledge of the \u201cLipschitz\u201d constant) on the\nmaximum step size, we prove that SGD with Armijo line-search can achieve the deterministic rate\nfor general non-convex functions (Section 4). Note that these are the \ufb01rst convergence rates for SGD\nwith line-search in the interpolation setting for both convex and non-convex functions.\nMoving beyond SGD, in Section 5 we consider the stochastic extra-gradient (SEG) method [24,\n31, 36, 41, 55] used to solve general variational inequalities [27]. These problems encompass both\nconvex minimization and saddle point problems arising in robust supervised learning [8, 84] and\nlearning with non-separable losses or regularizers [4, 34]. In the interpolation setting, we show that\na variant of SEG [24] with a \u201cLipschitz\u201d line-search convergences linearly when minimizing an\n\n2\n\n\fimportant class of non-convex functions [16, 40, 44, 78, 79] satisfying the restricted secant inequality\n(RSI). Moreover, in Appendix E, we prove that the same algorithm results in linear convergence for\nboth strongly convex-concave and bilinear saddle point problems satisfying interpolation.\nIn Section 6, we give heuristics to use large step-sizes and integrate acceleration with our line-\nsearch techniques, which improves practical performance of the proposed methods. We compare our\nalgorithms against numerous optimizers [21, 39, 51, 53, 60, 68] on a synthetic matrix factorization\nproblem (Section 7.2), convex binary-classi\ufb01cation problems using radial basis function (RBF)\nkernels (Section 7.3), and non-convex multi-class classi\ufb01cation problems with deep neural networks\n(Section 7.4). We observe that when interpolation is (approximately) satis\ufb01ed, the proposed methods\nare robust and have competitive performance across models and datasets. Moreover, SGD with\nArmijo line-search results in both faster convergence and better generalization performance for\nclassi\ufb01cation using deep networks. Finally, in Appendix G.2, we evaluate SEG with line-search\nfor synthetic bilinear saddle point problems. The code to reproduce our results can be found at\nhttps://github.com/IssamLaradji/sls.\nWe note that in concurrent work to ours, Berrada et al. [10] propose adaptive step-sizes for SGD\non convex, \ufb01nite-sum loss functions under an \u270f-interpolation condition. Unlike our approach, \u270f-\ninterpolation requires knowledge of a lower bound on the global minimum and only guarantees\napproximate convergence to a stationary point. Moreover, in order to obtain linear convergence rates,\nthey assume \u00b5-strong-convexity of each individual function. This assumption with \u270f-interpolation\nreduces the \ufb01nite-sum optimization to minimization of any single function in the \ufb01nite sum.\n2 Assumptions\nWe aim to minimize a differentiable function f assuming access to noisy stochastic gradients of the\nfunction. We focus on the common machine learning setting where the function f has a \ufb01nite-sum\nstructure meaning that f (w) = 1\ni=1 fi(w). Here n is equal to the number of points in the training\nset and the function fi is the loss function for the training point i. Depending on the model, f can\neither be strongly-convex, convex, or non-convex. We assume that f is lower-bounded by some value\nf\u21e4 and that f is L-smooth [56] implying that the gradient rf is L-Lipschitz continuous.\nWe assume that the model is able to interpolate the data and use this property to derive convergence\nrates. Formally, interpolation requires that the gradient with respect to each point converges to zero at\nthe optimum, implying that if the function f is minimized at w\u21e4 and thus rf (w\u21e4) = 0, then for all\nfunctions fi we have that rfi(w\u21e4) = 0. For example, interpolation is exactly satis\ufb01ed when using a\nlinear model with the squared hinge loss for binary classi\ufb01cation on linearly separable data.\n3 Stochastic Gradient Descent for Convex Functions\nStochastic gradient descent (SGD) computes the gradient of the loss function corresponding to one or\na mini-batch of randomly (typically uniformly) chosen training examples ik in iteration k. It then\nperforms a descent step as wk+1 = wk \u2318krfik(wk), where wk+1 and wk are the SGD iterates,\n\u2318k is the step-size and rfik(\u00b7) is the (average) gradient of the loss function(s) chosen at iteration k.\nEach stochastic gradient rfik(w) is assumed to be unbiased, implying that Ei [rfi(w)] = rf (w)\nfor all w. We now describe the Armijo line-search method to set the step-size in each iteration.\n3.1 Armijo line-search\nArmijo line-search [3] is a standard method for setting the step-size for gradient descent in the\ndeterministic setting [59]. We adapt it to the stochastic case as follows: at iteration k, the Armijo\nline-search selects a step-size satisfying the following condition:\n\nnPn\n\nfik (wk \u2318krfik(wk)) \uf8ff fik(wk) c \u00b7 \u2318k krfik(wk)k2 .\n\n(1)\nHere, c > 0 is a hyper-parameter. Note that the above line-search condition uses the function and\ngradient values of the mini-batch at the current iterate wk. Thus, compared to SGD, checking this\ncondition only makes use of additional mini-batch function (and not gradient) evaluations. In the\ncontext of deep neural networks, this corresponds to extra forward passes on the mini-batch.\nIn our theoretical results, we assume that there is a maximum step-size \u2318max from which the line-\nsearch starts in each iteration k and that we choose the largest step-size \u2318k (less than or equal to \u2318max)\nsatisfying (1). In practice, backtracking line-search is a common way to ensure that Equation 1 is\nsatis\ufb01ed. Starting from \u2318max, backtracking iteratively decreases the step-size by a constant factor\n\n3\n\n\f until the line-search succeeds (see Algorithm 1). Suitable strategies for resetting the step-size\ncan avoid backtracking in the majority of iterations and make the step-size selection procedure\nef\ufb01cient. We describe such strategies in Section 6. With resetting, we required (on average) only one\nadditional forward pass on the mini-batch per iteration when training a standard deep network model\n(Section 7.4). Empirically, we observe that the algorithm is robust to the choice of both c and \u2318max;\nsetting c to a small constant and \u2318max to a large value consistently results in good performance.\nWe bound the chosen step-size in terms of the properties of the function(s) selected in iteration k.\nLemma 1. The step-size \u2318k returned by the Armijo line-search and constrained to lie in the (0,\u2318 max]\nrange satis\ufb01es the following inequality,\n\n\u2318k min\u21e2 2 (1 c)\n\nLik\n\n,\u2318 max ,\n\n(2)\n\nwhere Lik is the Lipschitz constant of rfik.\nThe proof is in Appendix A and follows the deterministic case [59]. Note that Equation (1) holds for\nall smooth functions (for small-enough \u2318k), does not require convexity, and guarantees backtracking\nline-search will terminate at a non-zero step-size. The parameter c controls the \u201caggressiveness\u201d of\nthe algorithm; small c values encourage a larger step-size. For a suf\ufb01ciently large \u2318max and c \uf8ff 1/2,\nthe step-size is at least as large as 1/Lik, which is the constant step-size used in the interpolation\nsetting [73, 83]. In practice, we expect these larger step-sizes to result in improved performance. In\nAppendix A, we also give upper bounds on \u2318k if the function fik satis\ufb01es the Polyak-Lojasiewicz\n(PL) inequality [37, 65] with constant \u00b5ik. PL is a weaker condition than strong-convexity and does\nnot require convexity. In this case, \u2318k is upper-bounded by the minimum of \u2318max and 1/(2c \u00b7 \u00b5ik). If\nwe use a backtracking line-search that multiplies the step-size by until (1) holds, the step-size will\nbe smaller by at most a factor of (we do not include this dependence in our results).\n3.2 Convergence rates\nIn this section, we characterize the convergence rate of SGD with Armijo line-search in the strongly-\nconvex and convex cases. The theorems below are proved in Appendix B and Appendix C respectively.\nTheorem 1 (Strongly-Convex). Assuming (a) interpolation, (b) Li-smoothness, (c) convexity of fi\u2019s,\nand (d) \u00b5 strong-convexity of f, SGD with Armijo line-search with c = 1/2 in Eq. 1 achieves the rate:\n\nEhkwT w\u21e4k2i \uf8ff max\u21e2\u27131 \n\n\u00af\u00b5\n\nLmax\u25c6 , (1 \u00af\u00b5\u2318 max)T\n\nkw0 w\u21e4k2 .\n\ni=1 \u00b5i/n is the average strong-convexity of the \ufb01nite sum and Lmax = maxi Li is the\n\nHere \u00af\u00b5 =Pn\nmaximum smoothness constant in the fi\u2019s.\nIn contrast to the previous results [52, 73, 83] that depend on \u00b5, the above linear rate depends on\n\u00af\u00b5 \uf8ff \u00b5. Note that unlike Berrada et al. [10], we do not require that each fi is strongly convex, but for\n\u00af\u00b5 to be non-zero we still require that at least one of the fi\u2019s is strongly-convex.\nTheorem 2 (Convex). Assuming (a) interpolation, (b) Li-smoothness and (c) convexity of fi\u2019s, SGD\nwith Armijo line-search for all c > 1/2 in Equation 1 and iterate averaging achieves the rate:\n\nE [f ( \u00afwT ) f (w\u21e4)] \uf8ff\n\nc \u00b7 maxn Lmax\n\n\u2318maxo\n2 (1c) , 1\n\n(2c 1) T\n\nkw0 w\u21e4k2 .\n\n[PT\n\ni=1 wi]\nT\n\nis the averaged iterate after T iterations and Lmax = maxi Li.\n\u2318max}\nmax{3 Lmax, 2\n\nHere, \u00afwT =\nkw0 w\u21e4k2.\nIn particular, setting c = 2/3 implies that E [f ( \u00afwT ) f (w\u21e4)] \uf8ff\nThese are the \ufb01rst rates for SGD with line-search in the interpolation setting and match the corre-\nsponding rates for full-batch gradient descent on strongly-convex and convex functions. This shows\nSGD attains fast convergence under interpolation without explicit knowledge of the Lipschitz constant.\nNext, we use the above line-search to derive convergence rates of SGD for non-convex functions.\n4 Stochastic Gradient Descent for Non-convex Functions\nTo prove convergence results in the non-convex case, we additionally require the strong growth con-\ndition (SGC) [73, 83] to hold. The function f satis\ufb01es the SGC with constant \u21e2, if Ei krfi(w)k2 \uf8ff\n\nT\n\n4\n\n\f\u21e2krf (w)k2 holds for any point w. This implies that if rf (w) = 0, then rfi(w) = 0 for all i.\nThus, functions satisfying the SGC necessarily satisfy the interpolation property. The SGC holds for\nall smooth functions satisfying a PL condition [83]. Under the SGC, we show that by upper-bounding\nthe maximum step-size \u2318max, SGD with Armijo line-search achieves an O(1/T ) convergence rate.\nTheorem 3 (Non-convex). Assuming (a) the SGC with constant \u21e2 and (b) Li-smoothness of fi\u2019s,\nSGD with Armijo line-search in Equation 1 with c = 1/2 and setting \u2318max = 3/2\u21e2L achieves the rate:\n\nmin\n\nk=0,...,T1\n\nEkrf (wk)k2 \uf8ff\n\n4 Lmax\n\nT\n\n\u2713 2\u21e2\n\n3\n\n+ 1\u25c6 (f (w0) f (w\u21e4)) .\n\n5\n\nWe prove Theorem 3 in Appendix D. The result requires knowledge of \u21e2L max to bound the maximum\nstep-size, which is less practically appealing. It is not immediately clear how to relax this condition\nand we leave it for future work. However, in the next section, we show that if the non-convex function\nsatis\ufb01es a speci\ufb01c curvature condition, a modi\ufb01ed stochastic extra-gradient algorithm can achieve a\nlinear rate under interpolation without additional assumptions or knowledge of the Lipschitz constant.\n5 Stochastic Extra-Gradient Method\nIn this section, we use a modi\ufb01ed stochastic extra-gradient (SEG) method for convex and non-convex\nminimization. For \ufb01nite-sum minimization, stochastic extra-gradient (SEG) has the following update:\n(3)\nIt computes the gradient at an extrapolated point w0k and uses it in the update from the current iterate\nwk. Note that using the same sample ik and step-size \u2318k for both steps [24] is important for the\nsubsequent theoretical results. We now describe a \u201cLipschitz\u201d line-search strategy [31, 32, 38] in\norder to automatically set the step-size for SEG.\n5.1 Lipschitz line-search\nThe \u201cLipschitz\u201d line-search has been used by previous work in the deterministic [32, 38] and the\nvariance reduced settings [30]. It selects a step-size \u2318k that satis\ufb01es the following condition:\n\nw0k = wk \u2318krfik(wk) , wk+1 = wk \u2318krfik(w0k).\n\nkrfik(wk \u2318krfik(wk)) rfik(wk)k \uf8ff c krfik(wk)k .\n\n(4)\nAs before, we use backtracking line-search starting from the maximum value of \u2318max to ensure that\nthe chosen step-size satis\ufb01es the above condition. If the function fik is Lik-smooth, the step-size\nreturned by the Lipschitz line-search satis\ufb01es \u2318k min{c/Lik,\u2318 max}. Like the Armijo line-search in\nSection 3, the Lipschitz line-search does not require knowledge of the Lipschitz constant. Unlike\nthe line-search strategy in the previous sections, checking condition (4) requires computing the\ngradient at a prospective extrapolation point. We now prove convergence rates for SEG with Lipschitz\nline-search for both convex and a special class of non-convex problems.\n5.2 Convergence rates for minimization\nFor the next result, we assume that each function fi(\u00b7) satis\ufb01es the restricted secant inequality (RSI)\nwith constant \u00b5i, implying that for all w, hrfi(w), w w\u21e4i \u00b5i kw w\u21e4k2. RSI is a weaker\ncondition than strong-convexity. With additional assumptions, RSI is satis\ufb01ed by important non-\nconvex models such as single hidden-layer neural networks [40, 44, 78], matrix completion [79] and\nphase retrieval [16]. Under interpolation, we show SEG results in linear convergence for functions\nsatisfying RSI. In particular, we obtain the following guarantee:\nTheorem 4 (Non-convex + RSI). Assuming (a) interpolation, (b) Li-smoothness, and (c) \u00b5i-RSI of\nfi\u2019s, SEG with Lipschitz line-search in Eq. 4 with c = 1/4 and \u2318max \uf8ff mini 1/4\u00b5i achieves the rate:\n\nEhkwT P X \u21e4[wT ]k2i \uf8ff max\u21e2\u27131 \n\n\u00af\u00b5\n\n4 Lmax\u25c6 , (1 \u2318max \u00af\u00b5)T\n\nkw0 P X \u21e4[w0]k2 ,\n\ni=1 \u00b5i\nn\n\nwhere \u00af\u00b5 = Pn\nis the average RSI constant of the \ufb01nite sum and X \u21e4 is the non-empty set of\noptimal solutions. The operation PX \u21e4[w] denotes the projection of w onto X \u21e4.\nSee Appendix E.2 for proof. Similar to the result of Theorem 1, the rate depends on the average\nRSI constant. Note that we do not require explicit knowledge of the Lipschitz constant to achieve\nthe above rate. The constraint on the maximum step-size is mild since the minimum \u00b5i is typically\n\nsmall, thus allowing for large step-sizes. Moreover, Theorem 4 improves upon the1 \u00b52/L2 rate\n\n\fik sample mini-batch of size b\n\u2318 reset(\u2318, \u2318max,, b, k, opt)/\nrepeat\n\nAlgorithm 1 SGD+Armijo(f, w0, \u2318max, b, c, , , opt)\n1: for k = 0, . . . , T do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n10: return wk+1\n\nuntil fik( \u02dcwk) \uf8ff fik(wk) c \u00b7 \u2318 krfik(wk)k2\nwk+1 \u02dcwk\n\n\u2318 \u00b7 \u2318\n\u02dcwk wk \u2318rfik(wk)\n\nreturn \u2318max\n\nAlgorithm 2 reset(\u2318, \u2318max, , b, k, opt)\n1: if k = 1 then\n2:\n3: else if opt = 0 then\n4:\n5: else if opt = 1 then\n6:\n7: else if opt = 2 then\n8:\n9: end if\n10: return \u2318\n\n\u2318 \u2318\n\u2318 \u2318max\n\u2318 \u2318 \u00b7 b/n\n\nFigure 1: Algorithm 1 gives pseudo-code for SGD with Armijo line-search. Algorithm 2 implements\nseveral heuristics (by setting opt) for resetting the step-size at each iteration.\nobtained using constant step-size SGD [5, 83]. In Appendix E.2, we show that the same rate can be\nattained by SEG with a constant step-size. In Appendix E.3, we show that under interpolation, SEG\nwith Lipschitz line-search also achieves the desired O(1/T ) rate for convex functions.\n5.3 Convergence rates for saddle point problems\nIn Appendix E.4, we use SEG with Lipschitz line-search for a class of saddle point problems\nof the form minu2U maxv2V (u, v). Here U and V are the constraint sets for the variables u\nand v respectively. In Theorem 6 in Appendix E.4, we show that under interpolation, SEG with\nLipschitz line-search results in linear convergence for functions (u, v) that are strongly-convex in u\nand strongly-concave in v. The required conditions are satis\ufb01ed for robust optimization [84] with\nexpressive models capable of interpolating the data. Furthermore, the interpolation property can be\nused to improve the convergence for a bilinear saddle-point problem [24, 25, 54, 85]. In Theorem 7\nin Appendix E.5, we show that SEG with Lipschitz line-search results in linear convergence under\ninterpolation. We empirically validate this claim with simple synthetic experiments in Appendix G.2.\n6 Practical Considerations\nIn this section, we give heuristics to use larger step-sizes across iterations and discuss ways to use\ncommon acceleration schemes with our line-search techniques.\n6.1 Using larger step-sizes\nRecall that our theoretical analysis assumes that the line-search in each iteration starts from a global\nmaximum step-size \u2318max. However, in practice, this strategy increases the amount of backtracking\nand consequently the algorithm\u2019s runtime. A simple alternative is to initialize the line-search in each\niteration to the step-size selected in the previous iteration \u2318k1. With this strategy, the step-size can\nnot increase and convergence is slowed in practice (it takes smaller steps than necessary). To alleviate\nthese problems, we consider increasing the step-size across iterations by initializing the backtracking\nat iteration k with \u2318k1 \u00b7 b/n [71, 72], where b is the size of the mini-batch and > 1 is a tunable\nparameter. These heuristics correspond to the options used in Algorithm 2.\nWe also consider the Goldstein line-search that uses additional function evaluations to check the\ncurvature condition fik (wk \u2318krfik(wk)) fik(wk) (1 c) \u00b7 \u2318k krfik(wk)k2 and increases\nthe step-size if it is not satis\ufb01ed. Here, c is the constant in Equation 1. The resulting method decreases\nthe step-size if the Armijo condition is not satis\ufb01ed and increases it if the curvature condition does\nnot hold. Algorithm 3 in Appendix H gives pseudo-code for SGD with the Goldstein line-search.\n6.2 Acceleration\nIn practice, augmenting stochastic methods with some form of momentum or acceleration [57, 64]\noften results in faster convergence [80]. Related work in this context includes algorithms speci\ufb01cally\ndesigned to achieve an accelerated rate of convergence in the stochastic setting [1, 23, 46]. Unlike\nthese works, we propose simple ways of using either Polyak [64] or Nesterov [57] acceleration with\nthe proposed line-search techniques. In both cases, similar to adaptive methods using momentum [80],\nwe use SGD with Armijo line-search to determine \u2318k and then use it directly within the acceleration\nscheme. When using Polyak momentum, the effective update can be given as: wk+1 = wk \n\n6\n\n\f\u2318krfik(wk) + \u21b5(wk wk1), where \u21b5 is the momentum factor. This update rule has been used\nwith a constant step-size and proven to obtain linear convergence rates on the generalization error for\nquadratic functions under an interpolation condition [48, 49]. For Nesterov acceleration, we use the\nvariant for the convex case [57] (which has no additional hyper-parameters) with our line-search. The\npseudo-code for using these methods with the Armijo line-search is given in Appendix H.\n7 Experiments\n\nFigure 2: Matrix factorization using the true model and rank 1, 4, 10 factorizations. Rank 1\nfactorization is under-parametrized, while ranks 4 and 10 are over-parametrized. Rank 10 and the\ntrue model satisfy interpolation.\n\nFigure 3: Binary classi\ufb01cation using a softmax loss and RBF kernels for the mushrooms and ijcnn\ndatasets. Mushrooms is linear separable in kernel-space with the selected kernel bandwidths while\nijcnn is not. Overall, we observe fast convergence of SGD + Armijo, Nesterov + Armijo, and SEG +\nLipschitz for both datasets.\n\nWe describe the experimental setup in Section 7.1. In Section 7.2, we present synthetic experiments to\nshow the bene\ufb01ts of over-parametrization. In Sections 7.3 and 7.4, we showcase the convergence and\ngeneralization performance of our methods for kernel experiments and deep networks, respectively.\n7.1 Experimental setup\nWe benchmark \ufb01ve con\ufb01gurations of the proposed line-search methods: SGD with (1) Armijo\nline-search with resetting the initial step-size (Algorithm 1 using option 2 in Algorithm 2), (2)\nGoldstein line-search (Algorithm 3), (3) Polyak momentum (Algorithm 5), (4) Nesterov acceleration\n(Algorithm 6), and (5) SEG with Lipschitz line-search (Algorithm 4) with option 2 to reset the step-\nsize. Appendix F gives additional details on our experimental setup and the default hyper-parameters\nused for the proposed line-search methods. We compare our methods against Adam [39], which is\nthe most common adaptive method, and other methods that report better performance than Adam:\ncoin-betting [60], L41 [68], and Adabound [51]. We use the default learning rates for the competing\nmethods. Unless stated otherwise, our results are averaged across 5 independent runs.\n\n1L4 applied to momentum SGD (L4 Mom) in https://github.com/iovdin/l4-pytorch was unstable\n\nin our experiments and we omit it from the main paper.\n\n7\n\n\fFigure 4: Multi-class classi\ufb01cation using softmax loss and (top) an MLP model for MNIST; ResNet\nmodel for CIFAR-10 and CIFAR-100 (bottom) DenseNet model for CIFAR-10 and CIFAR-100.\n\nFigure 5: (Left) Variation in step-sizes for SGD+Armijo for the matrix factorization problem and\nclassi\ufb01cation with deep neural networks. (Right) Average time per iteration.\n7.2 Synthetic experiment\nWe examine the effect of over-parametrization on convergence rates for the non-convex regres-\nsion problem: minW1,W2 Ex\u21e0N (0,I) kW2W1x Axk2. This is equivalent to a matrix factorization\nproblem satisfying RSI [79] and has been proposed as a challenging benchmark for gradient de-\nscent methods [66]. Following Rol\u00ednek et al. [68], we choose A 2 R10\u21e56 with condition number\n\uf8ff(A) = 1010 and generate a \ufb01xed dataset of 1000 samples. Unlike the previous work, we consider\nstochastic optimization and control the model\u2019s expressivity via the rank k of the matrix factors\nW1 2 Rk\u21e56 and W2 2 R10\u21e5k. Figure 2 shows plots of training loss (averaged across 20 runs) for\nthe true data-generating model, and using factors with rank k 2{ 1, 4, 10}.\nWe make the following observations: (i) for k = 4 (where interpolation does not hold) the proposed\nmethods converge quicker than other optimizers but all methods reach an arti\ufb01cial optimization \ufb02oor,\n(ii) using k = 10 yields an over-parametrized model where SGD with both Armijo and Goldstein\n\n8\n\n\fline-search converge linearly to machine precision, (iii) SEG with Lipschitz line-search obtains fast\nconvergence according to Theorem 4, and (iv) adaptive-gradient methods stagnate in all cases. These\nobservations validate our theoretical results and show that over-parameterization and line-search can\nallow for fast, \u201cpainless\u201d optimization using SGD and SEG.\n7.3 Binary classi\ufb01cation with kernels\nWe consider convex binary classi\ufb01cation using RBF kernels without regularization. We experiment\nwith four standard datasets: mushrooms, rcv1, ijcnn, and w8a from LIBSVM [14]. The mushrooms\ndataset satis\ufb01es the interpolation condition with the selected kernel bandwidths, while ijcnn, rcv1,\nand w8a do not. For these experiments we also compare against a standard VR method (SVRG) [35]\nand probabilistic line-search (PLS) [53].2 Figure 3 shows the training loss and test accuracy on\nmushrooms and ijcnn for the different optimizers with softmax loss. Results for rcv1 and w8a are\ngiven in Appendix G.3. We make the following observations: (i) SGD + Armijo, Nesterov + Armijo,\nand SEG + Lipschitz perform the best and are comparable to hand-tuned SVRG. (ii) The proposed\nline-search methods perform well on ijcnn even though it is not separable in kernel space. This\ndemonstrates some robustness to violations of the interpolation condition.\n7.4 Multi-class classi\ufb01cation using deep networks\nWe benchmark the convergence rate and generalization performance of our line-search methods\non standard deep learning experiments. We consider non-convex minimization for multi-class\nclassi\ufb01cation using deep network models on the MNIST, CIFAR10, and CIFAR100 datasets. Our\nexperimental choices follow the setup in Luo et al. [51]. For MNIST, we use a 1 hidden-layer\nmulti-layer perceptron (MLP) of width 1000. For CIFAR10 and CIFAR100, we experiment with\nthe standard image-classi\ufb01cation architectures: ResNet-34 [28] and DenseNet-121 [29]. We also\ncompare to the best performing constant step-size SGD with the step-size selected by grid search.\nFrom Figure 4, we observe that: (i) SGD with Armijo line-search consistently leads to the best\nperformance in terms of both the training loss and test accuracy. It also converges to a good solution\nmuch faster when compared to the other methods. (ii) The performance of SGD with line-search and\nPolyak momentum is always better than \u201ctuned\u201d constant step-size SGD and Adam, whereas that\nof SGD with Goldstein line-search is competitive across datasets. We omit Nesterov + Armijo as it\nunstable and diverges and omit SEG since it resulted in slower convergence and worse performance.\nWe also verify that our line-search methods do not lead to excessive backtracking and function\nevaluations. Figure 5 (right) shows the cost per iteration for the above experiments. Our line-searches\nmethods are only marginally slower than Adam and converge much faster. In practice, we observed\nSGD+Armijo uses only one additional function evaluation on average. Figure 5 (left) shows the\nevolution of step-sizes for SGD+Armijo in our experiments. For deep neural networks, SGD+Armijo\nautomatically \ufb01nds a step-size schedule resembling cosine-annealing [50]. In Appendix G.1, we\nevaluate and compare the hyper-parameter sensitivity of Adam, constant step-size SGD, and SGD\nwith Armijo line-search on CIFAR10 with ResNet-34. While SGD is sensitive to the choice of the\nstep-size, the performance of SGD with Armijo line-search is robust to the value of c in the [0.1, 0.5]\nrange. There is virtually no effect of \u2318max, since the correct range of step-sizes is found in early\niterations.\n8 Conclusion\nWe showed that under the interpolation condition satis\ufb01ed by modern over-parametrized models,\nsimple line-search techniques for classic SGD and SEG lead to fast convergence in both theory and\npractice. For future work, we hope to strengthen our results for non-convex minimization using SGD\nwith line-search and study stochastic momentum techniques under interpolation. More generally,\nwe hope to utilize the rich literature on line-search and trust-region methods to improve stochastic\noptimization for machine learning.\n\n2PLS is impractical for deep networks since it requires the second moment of the mini-batch gradients and\n\nneeds GP model inference for every line-search evaluation.\n\n9\n\n\fAcknowledgments\n\nWe would like to thank Yifan Sun and Nicolas Le Roux for insightful discussions. AM is supported\nby the NSERC CGS M award. IL is funded by the UBC Four-Year Doctoral Fellowships (4YF),\nThis research was also partially supported by the Canada CIFAR AI Chair Program, the CIFAR\nLMB Program, by a Google Focused Research award, by an IVADO postdoctoral scholarship (for\nSV), by a Borealis AI fellowship (for GG), by the Canada Excellence Research Chair in \"Data\nScience for Realtime Decision-making\" and by the NSERC Discovery Grants RGPIN-2017-06936\nand 2015-06068.\nReferences\n[1] Zeyuan Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. In\n\nACM SIGACT Symposium on Theory of Computing, 2017.\n\n[2] Lu\u00eds Almeida. Parameter adaptation in stochastic optimization. On-line learning in neural\n\nnetworks, 1998.\n\n[3] Larry Armijo. Minimization of functions having Lipschitz continuous \ufb01rst partial derivatives.\n\nPaci\ufb01c Journal of mathematics, 1966.\n\n[4] Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, et al. Optimization with\n\nsparsity-inducing penalties. Foundations and Trends in Machine Learning, 2012.\n\n[5] Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in non-\n\nconvex over-parametrized learning. arXiv preprint arXiv:1811.02564, 2018.\n\n[6] Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank Wood.\n\nOnline learning rate adaptation with hypergradient descent. In ICLR, 2017.\n\n[7] Mikhail Belkin, Alexander Rakhlin, and Alexandre B. Tsybakov. Does data interpolation\n\ncontradict statistical optimality? In AISTATS, 2019.\n\n[8] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimization. Princeton\n\nUniversity Press, 2009.\n\n[9] Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures.\n\nIn Neural networks: Tricks of the trade. Springer, 2012.\n\n[10] Leonard Berrada, Andrew Zisserman, and M Pawan Kumar. Training neural networks for and\n\nby interpolation. arXiv preprint arXiv:1906.05661, 2019.\n\n[11] Jose Blanchet, Coralia Cartis, Matt Menickelly, and Katya Scheinberg. Convergence rate analy-\nsis of a stochastic trust region method via supermartingales. Informs Journal on Optimization,\n2019.\n\n[12] Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in\n\noptimization methods for machine learning. Mathematical programming, 2012.\n\n[13] Volkan Cevher and Bang C\u00f4ng V\u02dcu. On the linear convergence of the stochastic gradient method\n\nwith constant step-size. Optimization Letters, 2018.\n\n[14] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.\nACM Transactions on Intelligent Systems and Technology, 2011. Software available at http:\n//www.csie.ntu.edu.tw/~cjlin/libsvm.\n\n[15] Tatjana Chavdarova, Gauthier Gidel, Fran\u00e7ois Fleuret, and Simon Lacoste-Julien. Reducing\nnoise in GAN training with variance reduced extragradient. In NeurIPS, pages 391\u2013401, 2019.\n\n[16] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly\n\nas easy as solving linear systems. In NeurIPS, 2015.\n\n[17] Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Big batch SGD: Automated\n\ninference using adaptive batch sizes. arXiv preprint arXiv:1610.05792, 2016.\n\n10\n\n\f[18] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient\n\nmethod with support for non-strongly convex composite objectives. In NeurIPS, 2014.\n\n[19] Aaron Defazio and L\u00e9on Bottou. On the ineffectiveness of variance reduced optimization for\n\ndeep learning. In NeurIPS, pages 1753\u20131763, 2019.\n\n[20] Bernard Delyon and Anatoli Juditsky. Accelerated stochastic approximation. SIAM Journal on\n\nOptimization, 1993.\n\n[21] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\n\nand stochastic optimization. JMLR, 2011.\n\n[22] Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data\n\n\ufb01tting. SIAM Journal on Scienti\ufb01c Computing, 2012.\n\n[23] Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate\nproximal point and faster stochastic algorithms for empirical risk minimization. In ICML, 2015.\n\n[24] Gauthier Gidel, Hugo Berard, Ga\u00ebtan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. A\n\nvariational inequality perspective on generative adversarial networks. In ICLR, 2019.\n\n[25] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks.\n\narXiv:1701.00160, 2016.\n\narXiv preprint\n\n[26] Serge Gratton, Cl\u00e9ment W Royer, Lu\u00eds N Vicente, and Zaikun Zhang. Complexity and global\nrates of trust-region methods based on probabilistic models. IMA Journal of Numerical Analysis,\n38(3):1579\u20131597, 2017.\n\n[27] Patrick T Harker and Jong-Shi Pang. Finite-dimensional variational inequality and nonlinear\ncomplementarity problems: a survey of theory, algorithms and applications. Mathematical\nprogramming, 1990.\n\n[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, 2016.\n\n[29] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected\n\nconvolutional networks. In CVPR, 2017.\n\n[30] Alfredo N Iusem, Alejandro Jofr\u00e9, Roberto I Oliveira, and Philip Thompson. Variance-based\nextragradient methods with line search for stochastic variational inequalities. SIAM Journal on\nOptimization, 2019.\n\n[31] AN Iusem, Alejandro Jofr\u00e9, Roberto I Oliveira, and Philip Thompson. Extragradient method\nwith variance reduction for stochastic variational inequalities. SIAM Journal on Optimization,\n2017.\n\n[32] AN Iusem and BF Svaiter. A variant of korpelevich\u2019s method for variational inequalities with a\n\nnew search strategy. Optimization, 1997.\n\n[33] Prateek Jain, Sham Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerat-\n\ning stochastic gradient descent for least squares regression. In COLT, 2018.\n\n[34] Thorsten Joachims. A support vector method for multivariate performance measures. In ICML,\n\n2005.\n\n[35] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In NeurIPS, 2013.\n\n[36] Anatoli Juditsky, Arkadi Nemirovski, and Claire Tauvel. Solving variational inequalities with\n\nstochastic mirror-prox algorithm. Stochastic Systems, 2011.\n\n[37] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the Polyak-\u0141ojasiewicz condition. In Joint European Conference on\nMachine Learning and Knowledge Discovery in Databases, 2016.\n\n11\n\n\f[38] Evgenii Nikolaevich Khobotov. Modi\ufb01cation of the extra-gradient method for solving varia-\ntional inequalities and certain optimization problems. USSR Computational Mathematics and\nMathematical Physics, 1987.\n\n[39] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[40] Robert Kleinberg, Yuanzhi Li, and Yang Yuan. An alternative view: When does SGD escape\n\nlocal minima? In ICML, 2018.\n\n[41] GM Korpelevich. The extragradient method for \ufb01nding saddle points and other problems.\n\nMatecon, 1976.\n\n[42] Nata\u0161a Kreji\u00b4c and Nata\u0161a Krklec. Line search methods with variable sample size for uncon-\n\nstrained optimization. Journal of Computational and Applied Mathematics, 2013.\n\n[43] Harold J Kushner and Jichuan Yang. Stochastic approximation with averaging and feedback:\n\nRapidly convergent\" on-line\" algorithms. IEEE Transactions on Automatic Control, 1995.\n\n[44] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with ReLU\n\nactivation. In NeurIPS, 2017.\n\n[45] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel\" ridgeless\" regression can\n\ngeneralize. arXiv preprint arXiv:1808.00387, 2018.\n\n[46] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order optimiza-\n\ntion. In NeurIPS, 2015.\n\n[47] Chaoyue Liu and Mikhail Belkin. Accelerating stochastic training for over-parametrized\n\nlearning. arXiv preprint arXiv:1810.13395, 2019.\n\n[48] Nicolas Loizou and Peter Richt\u00e1rik. Linearly convergent stochastic heavy ball method for\n\nminimizing generalization error. arXiv preprint arXiv:1710.10737, 2017.\n\n[49] Nicolas Loizou and Peter Richt\u00e1rik. Momentum and stochastic momentum for stochastic gradi-\nent, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677,\n2017.\n\n[50] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. arXiv\n\npreprint arXiv:1608.03983, 2016.\n\n[51] Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic\n\nbound of learning rate. In ICLR, 2019.\n\n[52] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the\n\neffectiveness of SGD in modern over-parametrized learning. In ICML, 2018.\n\n[53] Maren Mahsereci and Philipp Hennig. Probabilistic line searches for stochastic optimization.\n\nJMLR, 2017.\n\n[54] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of GANs. In NeurIPS,\n\n2017.\n\n[55] Arkadi Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequali-\nties with Lipschitz continuous monotone operators and smooth convex-concave saddle point\nproblems. SIAM Journal on Optimization, 2004.\n\n[56] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic\n\napproximation approach to stochastic programming. SIAM Journal on optimization, 2009.\n\n[57] Yu Nesterov. Gradient methods for minimizing composite functions. Mathematical Program-\n\nming, 2013.\n\n[58] Yurii Nesterov. Introductory lectures on convex optimization: A basic course. Springer Science\n\n& Business Media, 2004.\n\n12\n\n\f[59] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business\n\nMedia, 2006.\n\n[60] Francesco Orabona and Tatiana Tommasi. Training deep networks without learning rates\n\nthrough coin betting. In NeurIPS, 2017.\n\n[61] Balamurugan Palaniappan and Francis Bach. Stochastic variance reduction methods for saddle-\n\npoint problems. In NeurIPS, 2016.\n\n[62] Courtney Paquette and Katya Scheinberg. A stochastic line search method with convergence\n\nrate analysis. arXiv preprint arXiv:1807.07994, 2018.\n\n[63] VP Plagianakos, GD Magoulas, and MN Vrahatis. Learning rate adaptation in stochastic\n\ngradient descent. In Advances in convex analysis and global optimization. Springer, 2001.\n\n[64] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 1964.\n\n[65] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychisli-\n\ntel\u2019noi Matematiki i Matematicheskoi Fiziki, 1963.\n\n[66] Ali Rahimi and Ben Recht. Re\ufb02ections on random kitchen sinks, 2017.\n\n[67] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In\n\nICLR, 2019.\n\n[68] Michal Rolinek and Georg Martius. L4: practical loss-based stepsize adaptation for deep\n\nlearning. In NeurIPS, 2018.\n\n[69] Robert E Schapire, Yoav Freund, Peter Bartlett, Wee Sun Lee, et al. Boosting the margin: A\n\nnew explanation for the effectiveness of voting methods. The annals of statistics, 1998.\n\n[70] Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. In ICML, 2013.\n\n[71] Mark Schmidt, Reza Babanezhad, Mohamed Ahmed, Aaron Defazio, Ann Clifton, and Anoop\nSarkar. Non-uniform stochastic average gradient method for training conditional random \ufb01elds.\nIn AISTATS, 2015.\n\n[72] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. Mathematical Programming, 2017.\n\n[73] Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a\n\nstrong growth condition. arXiv preprint arXiv:1308.6370, 2013.\n\n[74] Alice Schoenauer-Sebag, Marc Schoenauer, and Mich\u00e8le Sebag. Stochastic gradient descent:\n\nGoing as fast as possible but not faster. arXiv preprint arXiv:1709.01427, 2017.\n\n[75] Nicol Schraudolph. Local gain adaptation in stochastic gradient descent. 1999.\n\n[76] Fanhua Shang, Yuanyuan Liu, Kaiwen Zhou, James Cheng, Kelvin Ng, and Yuichi Yoshida.\nGuaranteed suf\ufb01cient decrease for stochastic variance reduced gradient optimization. In AIS-\nTATS, 2018.\n\n[77] S Shao and Percy Yip. Rates of convergence of adaptive step-size of stochastic approximation\n\nalgorithms. Journal of mathematical analysis and applications, 2000.\n\n[78] Mahdi Soltanolkotabi, Adel Javanmard, and Jason Lee. Theoretical insights into the optimization\nlandscape of over-parameterized shallow neural networks. IEEE Transactions on Information\nTheory, 2018.\n\n[79] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization.\n\nIEEE Transactions on Information Theory, 2016.\n\n[80] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\n\ninitialization and momentum in deep learning. In ICML, 2013.\n\n13\n\n\f[81] Conghui Tan, Shiqian Ma, Yu-Hong Dai, and Yuqiu Qian. Barzilai-Borwein step size for\n\nstochastic gradient descent. In NeurIPS, 2016.\n\n[82] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running\n\naverage of its recent magnitude. Coursera: Neural networks for machine learning, 2012.\n\n[83] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of SGD for\n\nover-parameterized models and an accelerated perceptron. In AISTATS, 2019.\n\n[84] Junfeng Wen, Chun-Nam Yu, and Russell Greiner. Robust learning under uncertain test\n\ndistributions: Relating covariate shift to model misspeci\ufb01cation. In ICML, 2014.\n\n[85] Abhay Yadav, Sohil Shah, Zheng Xu, David Jacobs, and Tom Goldstein. Stabilizing adversarial\n\nnets with prediction methods. arXiv preprint arXiv:1705.07364, 2017.\n\n[86] Jin Yu, Douglas Aberdeen, and Nicol Schraudolph. Fast online policy gradient learning with\n\nsmd gain vector adaptation. In NeurIPS, 2006.\n\n[87] Matthew Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701,\n\n2012.\n\n[88] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. In ICLR, 2017.\n\n[89] Jian Zhang and Ioannis Mitliagkas. Yellow\ufb01n and the art of momentum tuning. arXiv preprint\n\narXiv:1706.03471, 2017.\n\n14\n\n\f", "award": [], "sourceid": 2038, "authors": [{"given_name": "Sharan", "family_name": "Vaswani", "institution": "Mila, Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Aaron", "family_name": "Mishkin", "institution": "University of British Columbia"}, {"given_name": "Issam", "family_name": "Laradji", "institution": "University of British Columbia"}, {"given_name": "Mark", "family_name": "Schmidt", "institution": "University of British Columbia"}, {"given_name": "Gauthier", "family_name": "Gidel", "institution": "Mila"}, {"given_name": "Simon", "family_name": "Lacoste-Julien", "institution": "Mila, Universit\u00e9 de Montr\u00e9al"}]}