{"title": "A Stochastic Gradient Method with an Exponential Convergence _Rate for Finite Training Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 2663, "page_last": 2671, "abstract": "We propose a new stochastic gradient method for optimizing the sum of\u2029 a finite set of smooth functions, where the sum is strongly convex.\u2029 While standard stochastic gradient methods\u2029 converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence \u2029rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard\u2029 algorithms, both in terms of optimizing the training error and reducing the test error quickly.", "full_text": "A Stochastic Gradient Method with an Exponential\n\nConvergence Rate for Finite Training Sets\n\nNicolas Le Roux\n\nSIERRA Project-Team\n\nINRIA - ENS\nParis, France\n\nMark Schmidt\n\nSIERRA Project-Team\n\nINRIA - ENS\nParis, France\n\nFrancis Bach\n\nSIERRA Project-Team\n\nINRIA - ENS\nParis, France\n\nnicolas@le-roux.name\n\nmark.schmidt@inria.fr\n\nfrancis.bach@ens.fr\n\nAbstract\n\nWe propose a new stochastic gradient method for optimizing the sum of a \ufb01nite set\nof smooth functions, where the sum is strongly convex. While standard stochas-\ntic gradient methods converge at sublinear rates for this problem, the proposed\nmethod incorporates a memory of previous gradient values in order to achieve a\nlinear convergence rate. In a machine learning context, numerical experiments\nindicate that the new algorithm can dramatically outperform standard algorithms,\nboth in terms of optimizing the training error and reducing the test error quickly.\n\n1\n\nIntroduction\n\nA plethora of the problems arising in machine learning involve computing an approximate minimizer\nof the sum of a loss function over a large number of training examples, where there is a large\namount of redundancy between examples. The most wildly successful class of algorithms for taking\nadvantage of this type of problem structure are stochastic gradient (SG) methods [1, 2]. Although\nthe theory behind SG methods allows them to be applied more generally, in the context of machine\nlearning SG methods are typically used to solve the problem of optimizing a sample average over a\n\ufb01nite training set, i.e.,\n\nminimize\n\nx\u2208Rp\n\ng(x) :=\n\nfi(x).\n\n(1)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nIn this work, we focus on such \ufb01nite training data problems where each fi is smooth and the average\nfunction g is strongly-convex.\n2(cid:107)x(cid:107)2 + log(1 +\nAs an example, in the case of (cid:96)2-regularized logistic regression we have fi(x) := \u03bb\nexp(\u2212biaT\ni x)), where ai \u2208 Rp and bi \u2208 {\u22121, 1} are the training examples associated with a\nbinary classi\ufb01cation problem and \u03bb is a regularization parameter. More generally, any (cid:96)2-regularized\nempirical risk minimization problem of the form\n\nminimize\n\nx\u2208Rp\n\n(cid:107)x(cid:107)2 +\n\n\u03bb\n2\n\n1\nn\n\nli(x),\n\n(2)\n\nfalls in the framework of (1) provided that the loss functions li are convex and smooth. An extensive\nlist of convex loss functions used in machine learning is given by [3], and we can even include\nnon-smooth loss functions (or regularizers) by using smooth approximations.\nThe standard full gradient (FG) method, which dates back to [4], uses iterations of the form\n\n(3)\nUsing x\u2217 to denote the unique minimizer of g, the FG method with a constant step size achieves a\nlinear convergence rate:\n\nxk+1 = xk \u2212 \u03b1kg(cid:48)(xk) = xk \u2212 \u03b1k\nn\n\nf(cid:48)\ni (xk).\n\ni=1\n\nn(cid:88)\n\ng(xk) \u2212 g(x\u2217) = O(\u03c1k),\n\n1\n\n\ffor some \u03c1 < 1 which depends on the condition number of g [5, Theorem 2.1.15]. Linear con-\nvergence is also known as geometric or exponential convergence, because the cost is cut by a \ufb01xed\nfraction on each iteration. Despite the fast convergence rate of the FG method, it can be unappealing\nwhen n is large because its iteration cost scales linearly in n. SG methods, on the other hand, have an\niteration cost which is independent of n, making them suited for that setting. The basic SG method\nfor optimizing (1) uses iterations of the form\n(4)\nwhere \u03b1k is a step-size and a training example ik is selected uniformly among the set {1, . . . , n}.\nThe randomly chosen gradient f(cid:48)\n(xk) yields an unbiased estimate of the true gradient g(cid:48)(xk), and\none can show under standard assumptions that, for a suitably chosen decreasing step-size sequence\n{\u03b1k}, the SG iterations achieve the sublinear convergence rate\nE[g(xk)] \u2212 g(x\u2217) = O(1/k),\n\nxk+1 = xk \u2212 \u03b1kf(cid:48)\n\nik\n\n(xk),\n\nik\n\nwhere the expectation is taken with respect to the selection of the ik variables. Under certain assump-\ntions this convergence rate is optimal for strongly-convex optimization in a model of computation\nwhere the algorithm only accesses the function through unbiased measurements of its objective and\ngradient (see [6, 7, 8]). Thus, we cannot hope to obtain a better convergence rate if the algorithm\nonly relies on unbiased gradient measurements. Nevertheless, by using the stronger assumption\nthat the functions are sampled from a \ufb01nite dataset, in this paper we show that we can achieve an\nexponential converengence rate while preserving the iteration cost of SG methods.\nThe primay contribution of this work is the analysis of a new algorithm that we call the stochastic\naverage gradient (SAG) method, a randomized variant of the incremental aggregated gradient (IAG)\nmethod of [9], which combines the low iteration cost of SG methods with a linear convergence rate\nas in FG methods. The SAG method uses iterations of the form\n\nyk\ni ,\n\n(5)\n\nn(cid:88)\n\ni=1\n\nwhere at each iteration a random training example ik is selected and we set\n\nxk+1 = xk \u2212 \u03b1k\nn\n\n(cid:26)f(cid:48)\n\nyk\ni =\n\ni (xk)\nyk\u22121\ni\n\nif i = ik,\notherwise.\n\nThat is, like the FG method, the step incorporates a gradient with respect to each training example.\nBut, like the SG method, each iteration only computes the gradient with respect to a single training\nexample and the cost of the iterations is independent of n. Despite the low cost of the SAG iterations,\nin this paper we show that the SAG iterations have a linear convergence rate, like the FG method.\nThat is, by having access to ik and by keeping a memory of the most recent gradient value computed\nfor each training example i, this iteration achieves a faster convergence rate than is possible for\nstandard SG methods. Further, in terms of effective passes through the data, we also show that for\ncertain problems the convergence rate of SAG is faster than is possible for standard FG method.\nIn a machine learning context where g(x) is a training cost associated with a predictor parameterized\nby x, we are often ultimately interested in the testing cost, the expected loss on unseen data points.\nNote that a linear convergence rate for the training cost does not translate into a similar rate for the\ntesting cost, and an appealing propertly of SG methods is that they achieve the optimal O(1/k) rate\nfor the testing cost as long as every datapoint is seen only once. However, as is common in machine\nlearning, we assume that we are only given a \ufb01nite training data set and thus that datapoints are\nrevisited multiple times. In this context, the analysis of SG methods only applies to the training cost\nand, although our analysis also focuses on the training cost, in our experiments the SAG method\ntypically reached the optimal testing cost faster than both FG and SG methods.\nThe next section reviews closely-related algorithms from the literature, including previous attempts\nto combine the appealing aspects of FG and SG methods. However, despite 60 years of extensive\nresearch on SG methods, most of the applications focusing on \ufb01nite datasets, we are not aware of\nany other SG method that achieves a linear convergence rate while preserving the iteration cost of\nstandard SG methods. Section 3 states the (standard) assumptions underlying our analysis and gives\nthe main technical results; we \ufb01rst give a slow linear convergence rate that applies for any problem,\nand then give a very fast linear convergence rate that applies when n is suf\ufb01ciently large. Section 4\ndiscusses practical implementation issues, including how to reduce the storage cost from O(np) to\nO(n) when each fi only depends on a linear combination of x. Section 5 presents a numerical\ncomparison of an implementation based on SAG to SG and FG methods, indicating that the method\nmay be very useful for problems where we can only afford to do a few passes through a data set.\n\n2\n\n\fWe can re-write the SAG updates (5) in a similar form as\n\nxk+1 = xk \u2212(cid:80)k\nxk+1 = xk \u2212(cid:80)k\n\nj=1 \u03b1j\u03b2k\u2212jf(cid:48)\n\nij\n\n(xj).\n\n2 Related Work\nThere is a large variety of approaches available to accelerate the convergence of SG methods, and a\nfull review of this immense literature would be outside the scope of this work. Below, we comment\non the relationships between the new method and several of the most closely-related ideas.\nMomentum: SG methods that incorporate a momentum term use iterations of the form\n\nxk+1 = xk \u2212 \u03b1kf(cid:48)\n\nik\n\n(xk) + \u03b2k(xk \u2212 xk\u22121),\n\nsee [10]. It is common to set all \u03b2k = \u03b2 for some constant \u03b2, and in this case we can rewrite the SG\nwith momentum method as\n\nj=1 \u03b1kS(j, i1:k)f(cid:48)\n\nij\n\n(xj),\n\n(6)\n\nwhere the selection function S(j, i1:k) is equal to 1/n if j corresponds to the last iteration where\nj = ik and is set to 0 otherwise. Thus, momentum uses a geometric weighting of previous gradients\nwhile the SAG iterations select and average the most recent evaluation of each previous gradient.\nWhile momentum can lead to improved practical performance, it still requires the use of a decreasing\nsequence of step sizes and is not known to lead to a faster convergence rate.\nGradient Averaging: Closely related to momentum is using the sample average of all previous\ngradients,\n\n(cid:80)k\nj=1 f(cid:48)\n\nij\n\n(xj),\n\nxk+1 = xk \u2212 \u03b1k\n\nk\n\nwhich is similar to the SAG iteration in the form (5) but where all previous gradients are used. This\napproach is used in the dual averaging method [11], and while this averaging procedure leads to\nconvergence for a constant step size and can improve the constants in the convergence rate [12], it\ndoes not improve on the O(1/k) rate.\nIterate Averaging: Rather than averaging the gradients, some authors use the basic SG iteration but\ntake an average over xk values. With a suitable choice of step-sizes, this gives the same asymptotic\nef\ufb01ciency as Newton-like second-order SG methods and also leads to increased robustness of the\nconvergence rate to the exact sequence of step sizes [13]. Baher\u2019s method [14, \u00a71.3.4] combines\ngradient averaging with online iterate averaging, and also displays appealing asymptotic properties.\nThe epoch SG method uses averaging to obtain the O(1/k) rate even for non-smooth objectives [15].\nHowever, the convergence rates of these averaging methods remain sublinear.\nStochastic versions of FG methods: Various options are available to accelerate the convergence of\nthe FG method for smooth functions, such as the accelerated full gradient (AFG) method of Nes-\nterov [16], as well as classical techniques based on quadratic approximations such as non-linear\nconjugate gradient, quasi-Newton, and Hessian-free Newton methods. Several authors have ana-\nlyzed stochastic variants of these algorithms [17, 18, 19, 20, 12]. Under certain conditions these\nvariants are convergent with an O(1/k) rate [18]. Alternately, if we split the convergence rate into\na deterministic and stochastic part, these methods can improve the dependency on the deterministic\npart [19, 12]. However, as with all other methods we have discussed thus far in this section, we are\nnot aware of any existing method of this \ufb02avor that improves on the O(1/k) rate.\nConstant step size: If the SG iterations are used with a constant step size (rather than a decreasing\nsequence), then the convergence rate of the method can be split into two parts [21, Proposition 2.4],\nwhere the \ufb01rst part depends on k and converges linearly to 0 and the second part is independent\nof k but does not converge to 0. Thus, with a constant step size the SG iterations have a linear\nconvergence rate up to some tolerance, and in general after this point the iterations do not make\nfurther progress. Indeed, convergence of the basic SG method with a constant step size has only been\nshown under extremely strong assumptions about the relationship between the functions fi [22].\nThis contrasts with the method we present in this work which converges to the optimal solution\nusing a constant step size and does so with a linear rate (without additional assumptions).\nAccelerated methods: Accelerated SG methods, which despite their name are not related to the\naforementioned AFG method, take advantage of the fast convergence rate of SG methods with a\nconstant step size. In particular, accelerated SG methods use a constant step size by default, and only\ndecrease the step size on iterations where the inner-product between successive gradient estimates\n\n3\n\n\fis negative [23, 24]. This leads to convergence of the method and allows it to potentially achieve\nperiods of linear convergence where the step size stays constant. However, the overall convergence\nrate of the method remains sublinear.\nHybrid Methods: Some authors have proposed variants of the SG method for problems of the\nform (1) that seek to gradually transform the iterates into the FG method in order to achieve a\nlinear convergence rate. Bertsekas proposes to go through the data cyclically with a specialized\nweighting that allows the method to achieve a linear convergence rate for strongly-convex quadratic\nfunctions [25]. However, the weighting is numerically unstable and the linear convergence rate treats\nfull passes through the data as iterations. A related strategy is to group the fi functions into \u2018batches\u2019\nof increasing size and perform SG iterations on the batches [26]. In both cases, the iterations that\nachieve the linear rate have a cost that is not independent of n, as opposed to SAG.\nIncremental Aggregated Gradient: Finally, Blatt et al. presents the most closely-related algorithm,\nthe IAG method [9]. This method is identical to the SAG iteration (5), but uses a cyclic choice of\nik rather than sampling the ik values. This distinction has several important consequences.\nIn\nparticular, Blatt et al. are only able to show that the convergence rate is linear for strongly-convex\nquadratic functions (without deriving an explicit rate), and their analysis treats full passes through\nthe data as iterations. Using a non-trivial extension of their analysis and a proof technique involving\nbounding the gradients and iterates simultaneously by a Lyapunov potential function, in this work\nwe give an explicit linear convergence rate for general strongly-convex functions using the SAG\niterations that only examine a single training example. Further, as our analysis and experiments\nshow, when the number of training examples is suf\ufb01ciently large, the SAG iterations achieve a linear\nconvergence rate under a much larger set of step sizes than the IAG method. This leads to more\nrobustness to the selection of the step size and also, if suitably chosen, leads to a faster convergence\nrate and improved practical performance. We also emphasize that in our experiments IAG and\nthe basic FG method perform similarly, while SAG performs much better, showing that the simple\nchange (random selection vs. cycling) can dramatically improve optimization performance.\n3 Convergence Analysis\nIn our analysis we assume that each function fi in (1) is differentiable and that each gradient f(cid:48)\nLipschitz-continuous with constant L, meaning that for all x and y in Rp we have\n\ni is\n\n(cid:107)f(cid:48)\n\ni (x) \u2212 f(cid:48)\n\ni (y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107).\n\n(cid:80)n\n\nThis is a fairly weak assumption on the fi functions, and in cases where the fi are twice-\ndifferentiable it is equivalent to saying that the eigenvalues of the Hessians of each fi are bounded\ni=1 fi is strongly-convex\nabove by L. In addition, we also assume that the average function g = 1\nwith constant \u00b5 > 0, meaning that the function x (cid:55)\u2192 g(x) \u2212 \u00b5\n2(cid:107)x(cid:107)2 is convex. This is a stronger\nn\nassumption and is not satis\ufb01ed by all machine learning models. However, note that in machine learn-\ning we are typically free to choose the regularizer, and we can always add an (cid:96)2-regularization term\nas in Eq. (2) to transform any convex problem into a strongly-convex problem (in this case we have\n\u00b5 \u2265 \u03bb). Note that strong-convexity implies that the problem is solvable, meaning that there exists\nsome unique x\u2217 that achieves the optimal function value. Our convergence results assume that we\ni to a zero vector for all i, and our results depend on the variance of the gradient norms\ninitialize y0\ni (x\u2217)(cid:107)2. Finally, all our convergence results consider\nat the optimum x\u2217, denoted by \u03c32 = 1\nexpectations with respect to the internal randomization of the algorithm, and not with respect to the\ndata (which are assumed to be deterministic and \ufb01xed).\nWe \ufb01rst consider the convergence rate of the method when using a constant step size of \u03b1k = 1\nwhich is similar to the step size needed for convergence of the IAG method in practice.\nProposition 1 With a constant step size of \u03b1k = 1\n\n(cid:80)\ni (cid:107)f(cid:48)\n\n2nL,\n\nn\n\nThe proof is given in the supplementary material. Note that the SAG iterations also trivially obtain\nthe O(1/k) rate achieved by SG methods, since\n\nE(cid:2)(cid:107)xk \u2212 x\u2217(cid:107)2(cid:3) (cid:54)(cid:16)\n(cid:17)k (cid:54) exp\n(cid:16)\n\n1 \u2212 \u00b5\n8Ln\n\n1 \u2212 \u00b5\n8Ln\n\n(cid:105)\n\n3(cid:107)x0 \u2212 x\u2217(cid:107)2 +\n\n(cid:17)k(cid:104)\n2nL , the SAG iterations satisfy for k \u2265 1:\n(cid:17) (cid:54) 8Ln\n(cid:16) \u2212 k\u00b5\n\n= O(n/k),\n\n9\u03c32\n4L2\n\n.\n\n8Ln\n\nk\u00b5\n\nalbeit with a constant which is proportional to n. Despite this constant, they are advantageous\nover SG methods in later iterations because they obtain an exponential convergence rate as in FG\n\n4\n\n\f2nL.\n\nmethods. We also note that an exponential convergence rate is obtained for any constant step size\nsmaller than 1\nIn terms of passes through the data, the rate in Proposition 1 is similar to that achieved by the basic\nFG method. However, our next result shows that, if the number of training examples is slightly\nlarger than L/\u00b5 (which will often be the case, as discussed in Section 6), then the SAG iterations\ncan use a larger step size and obtain a better convergence rate that is independent of \u00b5 and L (see\nproof in the supplementary material).\nProposition 2 If n (cid:62) 8L\n\n\u00b5 , with a step size of \u03b1k = 1\n\n(cid:17)k\n\n2n\u00b5 the SAG iterations satisfy for k (cid:62) n:\n(cid:16)\n1 \u2212 1\n8n\n4\u03c32\n3n\u00b5\n\n(cid:17)(cid:21)\n\n\u00b5n\n4L\n\n(cid:16)\n\n(cid:17)\n\n8 log\n\n1 +\n\n+ 1\n\n,\n\n.\n\n(cid:16)\n\nE(cid:2)g(xk) \u2212 g(x\u2217)(cid:3) (cid:54) C\n(cid:20) 16L\n\n(cid:107)x0 \u2212 x\u2217(cid:107)2 +\n\nwith C =\n\n3n\n\n\u221a\n\nL +\n\nL \u2212 \u221a\n\n\u221a\n\u00b5)/(\n\nrate of (1 \u2212(cid:112)\u00b5/L) = 0.9900. In contrast, running n iterations of SAG has a much faster rate of\n\nWe state this result for k (cid:62) n because we assume that the \ufb01rst n iterations of the algorithm use an SG\nmethod and that we initialize the subsequent SAG iterations with the average of the iterates, which\nleads to an O((log n)/k) rate. In contrast, using the SAG iterations from the beginning gives the\nsame rate but with a constant proportional to n. Note that this bound is obtained when initializing\nall yi to zero after the SG phase.1 However, in our experiments we do not use the SG initialization\nbut rather use a minor variant of SAG (discussed in the next section), which appears more dif\ufb01cult\nto analyze but which gives better performance.\nIt is interesting to compare this convergence rate with the known convergence rates of \ufb01rst-order\nmethods [5, see \u00a72]. For example, if we take n = 100000, L = 100, and \u00b5 = 0.01 then the basic\nFG method has a rate of ((L \u2212 \u00b5)/(L + \u00b5))2 = 0.9996 and the \u2018optimal\u2019 AFG method has a faster\n(1 \u2212 1/8n)n = 0.8825 using the same number of evaluations of f(cid:48)\n\u221a\ni. Further, the lower-bound for\na black-box \ufb01rst-order method is ((\n\u00b5))2 = 0.9608, indicating that SAG can\nbe substantially faster than any FG method that does not use the structure of the problem.2 In the\nsupplementary material, we compare Propositions 1 and 2 to the rates of primal and dual FG and\ncoordinate-wise methods for the special case of (cid:96)2-regularized leasts squares.\nEven though n appears in the convergence rate, if we perform n iterations of SAG (i.e., one effective\npass through the data), the error is multiplied by (1 \u2212 1/8n)n \u2264 exp(\u22121/8), which is independent\nof n. Thus, each pass through the data reduces the excess cost by a constant multiplicative factor\nthat is independent of the problem, as long as n (cid:62) 8L/\u00b5. Further, while the step size in Proposition\n2 depends on \u00b5 and n, we can obtain the same convergence rate by using a step size as large as\nn, so we can\n\u03b1k = 1\nchoose the smallest possible value of \u00b5 = 8L\nn . We have observed in practice that the IAG method\n2n\u00b5 may diverge, even under these assumptions. Thus, for certain problems\nwith a step size of \u03b1k = 1\nthe SAG iterations can tolerate a much larger step size, which leads to increased robustness to the\nselection of the step size. Further, as our analysis and experiments indicate, the ability to use a large\nstep size leads to improved performance of the SAG iterations.\nWhile we have stated Proposition 1 in terms of the iterates and Proposition 2 in terms of the function\nvalues, the rates obtained on iterates and function values are equivalent because, by the Lipschitz\nand strong-convexity assumptions, we have \u00b5\n4\nIn this section we describe modi\ufb01cations that substantially reduce the SAG iteration\u2019s memory re-\nquirements, as well as modi\ufb01cations that lead to better practical performance.\nStructured gradients: For many problems the storage cost of O(np) for the yk\nhibitive, but we can often use structure in the f(cid:48)\ntions fi take the form fi(aT\n\n16L. This is because the proposition is true for all values of \u00b5 satisfying \u00b5\n\ni vectors is pro-\ni to reduce this cost. For example, many loss func-\ni x) for a vector ai. Since ai is constant, for these problems we only\n1While it may appear suboptimal to not use the gradients computed during the n iterations of stochastic\n2Note that L in the SAG rates is based on the f(cid:48)\ni functions, while in the FG methods it is based on g(cid:48) which\n\ngradient descent, using them only improves the bound by a constant.\n\n2(cid:107)xk \u2212 x\u2217(cid:107)2 (cid:54) g(xk) \u2212 g(x\u2217) (cid:54) L\n\nImplementation Details\n\n2 (cid:107)xk \u2212 x\u2217(cid:107)2.\n\n(cid:62) 8\n\nL\n\ncan be much smaller.\n\n5\n\n\fik\n\ni f(cid:48)\n\ni (uk\n\n(uk\n\ni ) for uk\n\ni = aT\nik\n\nxk rather than the full gradient aT\n\nneed to store the scalar f(cid:48)\ni ), reducing the\nstorage cost to O(n). Further, because of the simple form of the SAG updates, if ai is sparse we can\nuse \u2018lazy updates\u2019 in order to reduce the iteration cost from O(p) down to the sparsity level of ai.\nMini-batches: To employ vectorization and parallelism, practical SG implementations often group\ntraining examples into \u2018mini-batches\u2019 and perform SG iterations on the mini-batches. We can also\nuse mini-batches within the SAG iterations, and for problems with dense gradients this decreases\ni for each mini-batch. Thus, for\nthe storage requirements of the algorithm since we only need a yk\nexample, using mini-batches of size 100 leads to a 100-fold reduction in the storage cost.\nStep-size re-weighting: On early iterations of the SAG algorithm, when most yk\ni are set to the\nuninformative zero vector, rather than dividing \u03b1k in (5) by n we found it was more effective to\ndivide by m, the number of unique ik values that we have sampled so far (which converges to n).\nThis modi\ufb01cation appears more dif\ufb01cult to analyze, but with this modi\ufb01cation we found that the\nSAG algorithm outperformed the SG/SAG hybrid algorithm analyzed in Proposition 2.\nExact regularization: For regularized objectives like (2) we can use the exact gradient of the reg-\nularizer, rather than approximating it. For example, our experiments on (cid:96)2-regularized optimization\nproblems used the recursion\n\nx \u2190(cid:0)1 \u2212 \u03b1\u03bb(cid:1)x \u2212 \u03b1\n\nd .\n\n(7)\n\nm\n\nd \u2190 d \u2212 yi,\n\nyi \u2190 l(cid:48)\n\ni(xk),\n\nd \u2190 d + yi,\n\nThis can be implemented ef\ufb01ciently for sparse data sets by using the representation x = \u03baz, where\n\u03ba is a scalar and z is a vector, since the update based on the regularizer simply updates \u03ba.\nLarge step sizes: Proposition 1 requires \u03b1k (cid:54) 1/2Ln while under an additional assumption Propo-\nsition 2 allows \u03b1k (cid:54) 1/16L.\nIn practice we observed better performance using step sizes of\n\u03b1k = 1/L and \u03b1k = 2/(L + n\u00b5). These step sizes seem to work even when the additional as-\nsumption of Proposition 2 is not satis\ufb01ed, and we conjecture that the convergence rates under these\nstep sizes are much faster than the rate obtained in Proposition 1 for the general case.\nLine search: Since L is generally not known, we experimented with a basic line-search, where\nwe start with an initial estimate L0, and we double this estimate whenever we do not satisfy the\ninstantiated Lipschitz inequality\n\nfik (xk \u2212 (1/Lk)f(cid:48)\n\nik\n\n(xk)) (cid:54) fik (xk) \u2212 1\n2Lk\n\n(cid:107)f(cid:48)\n\nik\n\n(xk)(cid:107)2.\n\nik\n\nTo avoid instability caused by comparing very small numbers, we only do this test when\n(cid:107)f(cid:48)\n(xk)(cid:107)2 > 10\u22128. To allow the algorithm to potentially achieve a faster rate due to a higher\ndegree of local smoothness, we multiply Lk by 2(\u22121/n) after each iteration.\n5 Experimental Results\nOur experiments compared an extensive variety of competitive FG and SG methods. In the sup-\nplementary material we compare to the IAG method and an extensive variety of SG methods, and\nwe allow these competing methods to choose the best step-size in hindsight. However, our exper-\niments in the main paper focus on the following methods, which we chose because they have no\ndataset-dependent tuning parameters:\n\n\u2013 Steepest: The full gradient method described by iteration (3), with a line-search that uses cubic\nHermite polynomial interpolation to \ufb01nd a step size satisfying the strong Wolfe conditions, and\nwhere the parameters of the line-search were tuned for the problems at hand.\n\n\u2013 AFG: Nesterov\u2019s accelerated full gradient method [16], where iterations of (3) with a \ufb01xed step\nsize are interleaved with an extrapolation step, and we use an adaptive line-search based on [27].\n\u2013 L-BFGS: A publicly-available limited-memory quasi-Newton method that has been tuned for\n\nlog-linear models.3 This method is by far the most complicated method we considered.\n\n\u2013 Pegasos: The state-of-the-art SG method described by iteration (4) with a step size of \u03b1k =\n\n1/\u00b5k and a projection step onto a norm-ball known to contain the optimal solution [28].\n\n\u2013 RDA: The regularized dual averaging method of [12], another recent state-of-the-art SG method.\n\u2013 ESG: The epoch SG method of [15], which runs SG with a constant step size and averaging in\n\na series of epochs, and is optimal for non-smooth stochastic strongly-convex optimization.\n\n3http://www.di.ens.fr/\u02dcmschmidt/Software/minFunc.html\n\n6\n\n\fFigure 1: Comparison of optimization strategies for (cid:96)2-regularized logistic regression. Top: training\nexcess cost. Bottom: testing cost. From left to right are the results on the protein, rcv1 and covertype\ndata sets. This \ufb01gure is best viewed in colour.\n\n\u2013 NOSG: The nearly-optimal SG method of [19], which combines ideas from SG and AFG meth-\n\nods to obtain a nearly-optimal dependency on a variety of problem-dependent constants.\n\n\u2013 SAG: The proposed stochastic average gradient method described by iteration (5) using the\nmodi\ufb01cations discussed in the previous section. We used a step-size of \u03b1k = 2/(Lk + n\u03bb)\nwhere Lk is either set constant to the global Lipschitz constant (SAG-C) or set by adaptively\nestimating the constant with respect to the logistic loss function using the line-search described\nin the previous section (SAG-LS). The SAG-LS method was initialized with L0 = 1 .\n\nIn all the experiments, we measure the training and testing costs as a function of the number of\neffective passes through the data, measured as the number of f(cid:48)\ni evaluations divided by n. These\nresults are thus independent of the practical implementation of the algorithms.\nThe theoretical convergence rates suggest the following strategies for deciding on whether to use an\nFG or an SG method:\n\n1. If we can only afford one pass through the data, then an SG method should be used.\n2. If we can afford to do many passes through the data (say, several hundred), then an FG\n\nmethod should be used.\n\nWe expect that the SAG iterations will be most useful between these two extremes, where we can\nafford to do more than one pass through the data but cannot afford to do enough passes to warrant\nusing FG algorithms like L-BFGS. To test whether this is indeed the case on real data sets, we\nperformed experiments on a set of freely available benchmark binary classi\ufb01cation data sets. The\nprotein (n = 145751, p = 74) data set was obtained from the KDD Cup 2004 website,4 while the\nrcv1 (n = 20242, p = 47236) and covertype (n = 581012, p = 54) data sets were obtained from\nthe LIBSVM data website.5 Although our method can be applied to any differentiable function, on\nthese data sets we focus on an (cid:96)2-regularized logistic regression problem, with \u03bb = 1/n. We split\neach dataset in two, training on one half and testing on the other half. We added a (regularized)\nbias term to all data sets, and for dense features we standardized so that they would have a mean of\nzero and a variance of one. We plot the training and testing costs of the different methods for 30\neffective passes through the data in Figure 1. In the supplementary material, we present additional\nexperimental results including the test classi\ufb01cation accuracy and results on different data sets.\nWe can observe several trends across the experiments from both the main paper and the supplemen-\ntary material.\n\n4http://osmot.cs.cornell.edu/kddcup\n5http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets\n\n7\n\n051015202510\u2212410\u2212310\u2212210\u22121100Effective PassesObjective minus Optimum SteepestAFGL\u2212BFGSpegasosRDAESGNOSGSAG\u2212CSAG\u2212LS051015202510\u22121010\u2212810\u2212610\u2212410\u22122100Effective PassesObjective minus Optimum SteepestAFGL\u2212BFGSpegasosRDAESGNOSGSAG\u2212CSAG\u2212LS051015202510\u2212510\u2212410\u2212310\u2212210\u22121100Effective PassesObjective minus Optimum SteepestAFGL\u2212BFGSpegasosRDAESGNOSGSAG\u2212CSAG\u2212LS05101520250.511.522.533.544.55x 104Effective PassesTest Logistic Loss SteepestAFGL\u2212BFGSpegasosRDAESGNOSGSAG\u2212CSAG\u2212LS051015202520002500300035004000450050005500600065007000Effective PassesTest Logistic Loss SteepestAFGL\u2212BFGSpegasosRDAESGNOSGSAG\u2212CSAG\u2212LS05101520251.51.551.61.651.71.751.81.851.91.952x 105Effective PassesTest Logistic Loss SteepestAFGL\u2212BFGSpegasosRDAESGNOSGSAG\u2212CSAG\u2212LS\f\u2013 FG vs. SG: Although the performance of SG methods can be catastrophic if the step size is not\nchosen carefully (e.g., the covertype data), with a carefully-chosen step-size the SG methods\ndo substantially better than FG methods on the \ufb01rst few passes through the data (e.g., the rcv1\ndata). In contrast, FG methods are not sensitive to the step size and because of their steady\nprogress we also see that FG methods slowly catch up to the SG methods and eventually (or\nwill eventually) pass them (e.g., the protein data).\n\n\u2013 (FG and SG) vs. SAG: The SAG iterations seem to achieve the best of both worlds. They\nstart out substantially better than FG methods, but continue to make steady (linear) progress\nwhich leads to better performance than SG methods. In some cases (protein and covertype), the\nsigni\ufb01cant speed-up observed for SAG in reaching low training costs also translates to reaching\nthe optimal testing cost more quickly than the other methods.\n\n(cid:62) 8\n\nn is satis\ufb01ed when \u03bb (cid:62) 8\u03beR2\n\nn\u03b2 with \u03b2 < 1 in a non-parametric setting.\n\n\u2013 IAG vs. SAG: Our experiments (in the supplementary material) show that the IAG method\nperforms similar to the regular FG method, and they also show the surprising result that the\nrandomized SAG method outperforms the closely-related deterministic IAG method by a very\nlarge margin. This is due to the larger step sizes used by the SAG iterations, which would cause\nIAG to diverge.\n6 Discussion\nOptimal regularization strength: One might wonder if the additional hypothesis in Proposition 2\nis satis\ufb01ed in practice.\nIn a learning context, where each function fi is the loss associated to a\nsingle data point, L is equal to the largest value of the loss second derivative \u03be (1 for the square\nloss, 1/4 for the logistic loss) times R2, where R is a the uniform bound on the norm of each data\npoint. Thus, the constraint \u00b5\nn . In low-dimensional settings, the\nL\noptimal regularization parameter is of the form C/n [29] where C is a scalar constant, and may thus\n\u221a\nviolate the constraint. However, the improvement with respect to regularization parameters of the\nn is known to be asymptotically negligible, and in any case in such low-dimensional\nform \u03bb = C/\nsettings, regular stochastic or batch gradient descent may be ef\ufb01cient enough in practice. In the\nmore interesting high-dimensional settings where the dimension p of our covariates is not small\ncompared to the sample size n, then all theoretical analyses we are aware of advocate settings of \u03bb\nwhich satisfy this constraint. For example, [30] considers parameters of the form \u03bb = C\u221a\nn in the\nparametric setting, while [31] considers \u03bb = C\nTraining cost vs. testing cost: The theoretical contribution of this work is limited to the convergence\nrate of the training cost. Though there are several settings where this is the metric of interest (e.g.,\nvariational inference in graphical models), in many cases one will be interested in the convergence\nspeed of the testing cost. Since the O(1/k) convergence rate of the testing cost, achieved by SG\nmethods with decreasing step sizes (and a single pass through the data), is provably optimal when\nthe algorithm only accesses the function through unbiased measurements of the objective and its\ngradient, it is unlikely that one can obtain a linear convergence rate for the testing cost with the SAG\niterations. However, as shown in our experiments, the testing cost of the SAG iterates often reaches\nits minimum quicker than existing SG methods, and we could expect to improve the constant in the\nO(1/k) convergence rate, as is the case with online second-order methods [32].\nStep-size selection and termination criteria: The three major disadvantages of SG methods are: (i)\nthe slow convergence rate, (ii) deciding when to terminate the algorithm, and (iii) choosing the step\nsize while running the algorithm. This paper showed that the SAG iterations achieve a much faster\nconvergence rate, but the SAG iterations may also be advantageous in terms of tuning step sizes\nand designing termination criteria. In particular, the SAG iterations suggest a natural termination\ni variables converges to g(cid:48)(xk) as (cid:107)xk\u2212xk\u22121(cid:107) converges to zero,\ncriterion; since the average of the yk\ni (cid:107) as an approximation of the optimality of xk. Further, while SG methods\nrequire specifying a sequence of step sizes and mispecifying this sequence can have a disastrous\neffect on the convergence rate [7, \u00a72.1], our theory shows that the SAG iterations iterations achieve\na linear convergence rate for any suf\ufb01ciently small constant step size and our experiments indicate\nthat a simple line-search gives strong performance.\nAcknowledgements\nNicolas Le Roux, Mark Schmidt, and Francis Bach are supported by the European Research Council\n(SIERRA-ERC-239993). Mark Schmidt is also supported by a postdoctoral fellowship from the\nNatural Sciences and Engineering Research Council of Canada (NSERC).\n\nwe can use (1/n)(cid:107)(cid:80)\n\ni yk\n\n8\n\n\fReferences\n[1] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics,\n\n22(3):400\u2013407, 1951.\n\n[2] L. Bottou and Y. LeCun. Large scale online learning. NIPS, 2003.\n[3] C. H. Teo, Q. Le, A. J. Smola, and S. V. N. Vishwanathan. A scalable modular convex solver for regular-\n\nized risk minimization. KDD, 2007.\n\n[4] M. A. Cauchy. M\u00b4ethode g\u00b4en\u00b4erale pour la r\u00b4esolution des syst`emes d\u2019\u00b4equations simultan\u00b4ees. Comptes\n\nrendus des s\u00b4eances de l\u2019Acad\u00b4emie des sciences de Paris, 25:536\u2013538, 1847.\n\n[5] Y. Nesterov. Introductory lectures on convex optimization: A basic course. Springer, 2004.\n[6] A. Nemirovski and D. B. Yudin. Problem complexity and method ef\ufb01ciency in optimization. Wiley, 1983.\n[7] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[8] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright.\n\nInformation-theoretic lower bounds\non the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory,\n58(5), 2012.\n\n[9] D. Blatt, A. O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant step\n\nsize. SIAM Journal on Optimization, 18(1):29\u201351, 2007.\n\n[10] P. Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsize rule.\n\nSIAM Journal on Optimization, 8(2):506\u2013531, 1998.\n\n[11] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming,\n\n120(1):221\u2013259, 2009.\n\n[12] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of\n\nMachine Learning Research, 11:2543\u20132596, 2010.\n\n[13] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal\n\non Control and Optimization, 30(4):838\u2013855, 1992.\n\n[14] H. J. Kushner and G. Yin. Stochastic approximation and recursive algorithms and applications. Springer-\n\nVerlag, Second edition, 2003.\n\n[15] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic\n\nstrongly-convex optimization. COLT, 2011.\n\n[16] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence\n\nO(1/k2). Doklady AN SSSR, 269(3):543\u2013547, 1983.\n\n[17] N.N. Schraudolph. Local gain adaptation in stochastic gradient descent. ICANN, 1999.\n[18] P. Sunehag, J. Trumpf, SVN Vishwanathan, and N. Schraudolph. Variable metric stochastic approximation\n\ntheory. International Conference on Arti\ufb01cial Intelligence and Statistics, 2009.\n\n[19] S. Ghadimi and G. Lan. Optimal stochastic\u2018 approximation algorithms for strongly convex stochastic\n\ncomposite optimization. Optimization Online, July, 2010.\n\n[20] J. Martens. Deep learning via Hessian-free optimization. ICML, 2010.\n[21] A. Nedic and D. Bertsekas. Convergence rate of incremental subgradient algorithms.\n\nOptimization: Algorithms and Applications, pages 263\u2013304. Kluwer Academic, 2000.\n\nIn Stochastic\n\n[22] M.V. Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational\n\nOptimization and Applications, 11(1):23\u201335, 1998.\n\n[23] H. Kesten. Accelerated stochastic approximation. Annals of Mathematical Statistics, 29(1):41\u201359, 1958.\n[24] B. Delyon and A. Juditsky. Accelerated stochastic approximation. SIAM Journal on Optimization,\n\n3(4):868\u2013881, 1993.\n\n[25] D. P. Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM Journal\n\non Optimization, 7(4):913\u2013926, 1997.\n\n[26] M. P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data \ufb01tting. SIAM Journal\n\nof Scienti\ufb01c Computing, 34(3):A1351\u2013A1379, 2012.\n\n[27] J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. KDD, 2009.\n[28] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm.\n\nICML, 2007.\n\n[29] P. Liang, F. Bach, and M. I. Jordan. Asymptotically optimal regularization in smooth parametric models.\n\nNIPS, 2009.\n\n[30] K. Sridharan, S. Shalev-Shwartz, and N. Srebro. Fast rates for regularized objectives. NIPS, 2008.\n[31] M. Eberts and I. Steinwart. Optimal learning rates for least squares SVMs using Gaussian kernels. NIPS,\n\n2011.\n\n[32] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. NIPS, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1246, "authors": [{"given_name": "Nicolas", "family_name": "Roux", "institution": null}, {"given_name": "Mark", "family_name": "Schmidt", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}]}