{"title": "StopWasting My Gradients: Practical SVRG", "book": "Advances in Neural Information Processing Systems", "page_first": 2251, "page_last": 2259, "abstract": "We present and analyze several strategies for improving the performance ofstochastic variance-reduced gradient (SVRG) methods. We first show that theconvergence rate of these methods can be preserved under a decreasing sequenceof errors in the control variate, and use this to derive variants of SVRG that usegrowing-batch strategies to reduce the number of gradient calculations requiredin the early iterations. We further (i) show how to exploit support vectors to reducethe number of gradient computations in the later iterations, (ii) prove that thecommonly\u2013used regularized SVRG iteration is justified and improves the convergencerate, (iii) consider alternate mini-batch selection strategies, and (iv) considerthe generalization error of the method.", "full_text": "Stop Wasting My Gradients: Practical SVRG\n\nReza Babanezhad1, Mohamed Osama Ahmed1, Alim Virani2, Mark Schmidt1\n\nDepartment of Computer Science\nUniversity of British Columbia\n\n1{rezababa, moahmed, schmidtm}@cs.ubc.ca,2alim.virani@gmail.com\n\nJakub Kone\u02c7cn\u00b4y\n\nSchool of Mathematics\nUniversity of Edinburgh\nkubo.konecny@gmail.com\n\nDepartment of Electrical and Computer Engineering\n\nScott Sallinen\n\nUniversity of British Columbia\n\nscotts@ece.ubc.ca\n\nAbstract\n\nWe present and analyze several strategies for improving the performance of\nstochastic variance-reduced gradient (SVRG) methods. We \ufb01rst show that the\nconvergence rate of these methods can be preserved under a decreasing sequence\nof errors in the control variate, and use this to derive variants of SVRG that use\ngrowing-batch strategies to reduce the number of gradient calculations required\nin the early iterations. We further (i) show how to exploit support vectors to re-\nduce the number of gradient computations in the later iterations, (ii) prove that the\ncommonly\u2013used regularized SVRG iteration is justi\ufb01ed and improves the conver-\ngence rate, (iii) consider alternate mini-batch selection strategies, and (iv) consider\nthe generalization error of the method.\n\n1\n\nIntroduction\n\nWe consider the problem of optimizing the average of a \ufb01nite but large sum of smooth functions,\n\nn(cid:88)\n\ni=1\n\nmin\nx\u2208Rd\n\nf (x) =\n\n1\nn\n\nfi(x).\n\n(1)\n\nA huge proportion of the model-\ufb01tting procedures in machine learning can be mapped to this prob-\nlem. This includes classic models like least squares and logistic regression but also includes more\nadvanced methods like conditional random \ufb01elds and deep neural network models. In the high-\ndimensional setting (large d), the traditional approaches for solving (1) are: full gradient (FG) meth-\nods which have linear convergence rates but need to evaluate the gradient fi for all n examples on\nevery iteration, and stochastic gradient (SG) methods which make rapid initial progress as they only\nuse a single gradient on each iteration but ultimately have slower sublinear convergence rates.\nLe Roux et al. [1] proposed the \ufb01rst general method, stochastic average gradient (SAG), that only\nconsiders one training example on each iteration but still achieves a linear convergence rate. Other\nmethods have subsequently been shown to have this property [2, 3, 4], but these all require storing a\nprevious evaluation of the gradient f(cid:48)\ni or the dual variables for each i. For many objectives this only\nrequires O(n) space, but for general problems this requires O(np) space making them impractical.\nRecently, several methods have been proposed with similar convergence rates to SAG but without the\nmemory requirements [5, 6, 7, 8]. They are known as mixed gradient, stochastic variance-reduced\ngradient (SVRG), and semi-stochastic gradient methods (we will use SVRG). We give a canonical\nSVRG algorithm in the next section, but the salient features of these methods are that they evaluate\ntwo gradients on each iteration and occasionally must compute the gradient on all examples. SVRG\n\n1\n\n\fmethods often dramatically outperform classic FG and SG methods, but these extra evaluations\nmean that SVRG is slower than SG methods in the important early iterations. They also mean that\nSVRG methods are typically slower than memory-based methods like SAG.\nIn this work we \ufb01rst show that SVRG is robust to inexact calculation of the full gradients it requires\n(\u00a73), provided the accuracy increases over time. We use this to explore growing-batch strategies that\nrequire fewer gradient evaluations when far from the solution, and we propose a mixed SG/SVRG\nmethod that may improve performance in the early iterations (\u00a74). We next explore using support\nvectors to reduce the number of gradients required when close to the solution (\u00a75), give a justi\ufb01cation\nfor the regularized SVRG update that is commonly used in practice (\u00a76), consider alternative mini-\nbatch strategies (\u00a77), and \ufb01nally consider the generalization error of the method (\u00a78).\n\n2 Notation and SVRG Algorithm\nSVRG assumes f is \u00b5-strongly convex, each fi is convex, and each gradient f(cid:48)\ni is Lipschitz-\ncontinuous with constant L. The method begins with an initial estimate x0, sets x0 = x0 and\nthen generates a sequence of iterates xt using\nxt = xt\u22121 \u2212 \u03b7(f(cid:48)\n\n(2)\nwhere \u03b7 is the positive step size, we set \u00b5s = f(cid:48)(xs), and it is chosen uniformly from {1, 2, . . . , n}.\nAfter every m steps, we set xs+1 = xt for a random t \u2208 {1, . . . , m}, and we reset t = 0 with\nx0 = xs+1.\nTo analyze the convergence rate of SVRG, we will \ufb01nd it convenient to de\ufb01ne the function\n\n(xt\u22121) \u2212 f(cid:48)\n\n(xs) + \u00b5s),\n\nit\n\nit\n\n(cid:18) 1\n\n(cid:19)\n\n\u03c1(a, b) =\n\n1\n\n1 \u2212 2\u03b7a\n\n+ 2b\u03b7\n\n.\n\nm\u00b5\u03b7\n\nas it appears repeatedly in our results. We will use \u03c1(a) to indicate the value of \u03c1(a, b) when a = b,\nand we will simply use \u03c1 for the special case when a = b = L. Johnson & Zhang [6] show that if \u03b7\nand m are chosen such that 0 < \u03c1 < 1, the algorithm achieves a linear convergence rate of the form\n\nE[f (xs+1) \u2212 f (x\u2217)] \u2264 \u03c1E[f (xs) \u2212 f (x\u2217)],\n\nwhere x\u2217 is the optimal solution. This convergence rate is very fast for appropriate \u03b7 and m. While\nthis result relies on constants we may not know in general, practical choices with good empirical\nperformance include setting m = n, \u03b7 = 1/L, and using xs+1 = xm rather than a random iterate.\nUnfortunately, the SVRG algorithm requires 2m + n gradient evaluations for every m iterations\nof (2), since updating xt requires two gradient evaluations and computing \u00b5s require n gradient\nevaluations. We can reduce this to m + n if we store the gradients f(cid:48)\ni (xs), but this is not practical in\nmost applications. Thus, SVRG requires many more gradient evaluations than classic SG iterations\nof memory-based methods like SAG.\n\n3 SVRG with Error\nWe \ufb01rst give a result for the SVRG method where we assume that \u00b5s is equal to f(cid:48)(xs) up to\nsome error es. This is in the spirit of the analysis of [9], who analyze FG methods under similar\nassumptions. We assume that (cid:107)xt \u2212 x\u2217(cid:107) \u2264 Z for all t, which has been used in related work [10] and\nis reasonable because of the coercity implied by strong-convexity.\nProposition 1. If \u00b5s = f(cid:48)(xs) + es and we set \u03b7 and m so that \u03c1 < 1, then the SVRG algorithm (2)\nwith xs+1 chosen randomly from {x1, x2, . . . , xm} satis\ufb01es\nE[f (xs+1) \u2212 f (x\u2217)] \u2264 \u03c1E[f (xs) \u2212 f (x\u2217)] +\n\nZE(cid:107)es(cid:107) + \u03b7E(cid:107)es(cid:107)2\n\n.\n\n1 \u2212 2\u03b7L\n\nWe give the proof in Appendix A. This result implies that SVRG does not need a very accurate\napproximation of f(cid:48)(xs) in the crucial early iterations since the \ufb01rst term in the bound will dominate.\nFurther, this result implies that we can maintain the exact convergence rate of SVRG as long as the\nerrors es decrease at an appropriate rate. For example, we obtain the same convergence rate provided\nthat max{E(cid:107)es(cid:107), E(cid:107)es(cid:107)2} \u2264 \u03b3 \u02dc\u03c1s for any \u03b3 \u2265 0 and some \u02dc\u03c1 < \u03c1. Further, we still obtain a linear\nconvergence rate as long as (cid:107)es(cid:107) converges to zero with a linear convergence rate.\n\n2\n\n\fAlgorithm 1 Batching SVRG\n\nInput: initial vector x0, update frequency m, learning rate \u03b7.\nfor s = 0, 1, 2, . . . do\n\n(cid:80)\n\nChoose batch size |Bs|\nBs = |Bs| elements sampled without replacement from {1, 2, . . . , n}.\n\u00b5s = 1|Bs|\nx0=xs\nfor t = 1, 2, . . . , m do\n\n(cid:48)\ni (xs)\n\ni\u2208Bs f\n\nRandomly pick it \u2208 1, . . . , n\n(xt\u22121) \u2212 f\nxt = xt\u22121 \u2212 \u03b7(f\n\n(cid:48)\nit\n\n(cid:48)\nit\n\n(xs) + \u00b5s)\n\nend for\noption I: set xs+1 = xm\noption II: set xs+1 = xt for random t \u2208 {1, . . . , m}\n\nend for\n\n3.1 Non-Uniform Sampling\n\n(\u2217)\n\n(cid:80)n\n\nXiao & Zhang [11] show that non-uniform sampling (NUS) improves the performance of SVRG.\nThey assume each f(cid:48)\ni is Li-Lipschitz continuous, and sample it = i with probability Li/n \u00afL where\n\u00afL = 1\nn\n\ni=1 Li. The iteration is then changed to\n(cid:48)\nit\n\nxt = xt\u22121 \u2212 \u03b7\n\n[f\n\n(cid:18) \u00afL\n\nLit\n\n(cid:19)\n\n(xt\u22121) \u2212 f\n\n(cid:48)\nit\n\n(\u02dcx)] + \u00b5s\n\n,\n\nwhich maintains that the search direction is unbiased. In Appendix A, we show that if \u00b5s is computed\nwith error for this algorithm and if we set \u03b7 and m so that 0 < \u03c1( \u00afL) < 1, then we have a convergence\nrate of\n\nE[f (xs+1) \u2212 f (x\u2217)] \u2264 \u03c1( \u00afL)E[f (xs) \u2212 f (x\u2217)] +\n\nZE(cid:107)es(cid:107) + \u03b7E(cid:107)es(cid:107)2\n\n1 \u2212 2\u03b7 \u00afL\n\n,\n\nwhich can be faster since the average \u00afL may be much smaller than the maximum value L.\n\n3.2 SVRG with Batching\n\nThere are many ways we could allow an error in the calculation of \u00b5s to speed up the algorithm. For\nexample, if evaluating each f(cid:48)\ni involves solving an optimization problem, then we could solve this\noptimization problem inexactly. For example, if we are \ufb01tting a graphical model with an iterative\napproximate inference method, we can terminate the iterations early to save time.\nWhen the fi are simple but n is large, a natural way to approximate \u00b5s is with a subset (or \u2018batch\u2019)\nof training examples Bs (chosen without replacement),\nf(cid:48)\ni (xs).\n\n(cid:88)\n\n\u00b5s =\n\nThe batch size |Bs| controls the error in the approximation, and we can drive the error to zero by\nincreasing it to n. Existing SVRG methods correspond to the special case where |Bs| = n for all s.\nAlgorithm 1 gives pseudo-code for an SVRG implementation that uses this sub-sampling strategy.\nIf we assume that the sample variance of the norms of the gradients is bounded by S2 for all xs,\n\n1\n|Bs|\n\ni\u2208Bs\n\nn \u2212 1\nthen we have that [12, Chapter 2]\n\n1\n\nn(cid:88)\n\ni=1\n\n(cid:2)(cid:107)f(cid:48)\ni (xs)(cid:107)2 \u2212 (cid:107)f(cid:48)(xs)(cid:107)2(cid:3) \u2264 S2,\nE(cid:107)es(cid:107)2 \u2264 n \u2212 |Bs|\n\nn|Bs| S2.\n\nSo if we want E(cid:107)es(cid:107)2 \u2264 \u03b3 \u02dc\u03c12s, where \u03b3 \u2265 0 is a constant for some \u02dc\u03c1 < 1, we need\n\n|Bs| \u2265\n\nnS2\n\nS2 + n\u03b3 \u02dc\u03c12s .\n\n3\n\n(3)\n\n\fAlgorithm 2 Mixed SVRG and SG Method\n\nReplace (*) in Algorithm 1 with the following lines:\nif fit \u2208 Bs then\n\n(cid:48)\nit\n\n(xt\u22121) \u2212 f\n\n(cid:48)\nit\n\n(xs) + \u00b5s)\n\nelse\n\nxt = xt\u22121 \u2212 \u03b7(f\nxt = xt\u22121 \u2212 \u03b7f\n\n(cid:48)\nit\n\nend if\n\n(xt\u22121)\n\nIf the batch size satis\ufb01es the above condition then\n\nZE(cid:107)es\u22121(cid:107) + \u03b7E(cid:107)es\u22121(cid:107)2 \u2264 Z\n\n\u221a\n\n\u03b3 \u02dc\u03c1s + \u03b7\u03b3 \u02dc\u03c12s\n\n\u221a\n\n\u2264 2 max{Z\n\n\u03b3, \u03b7\u03b3 \u02dc\u03c1}\u02dc\u03c1s,\n\nand the convergence rate of SVRG is unchanged compared to using the full batch on all iterations.\nThe condition (3) guarantees a linear convergence rate under any exponentially-increasing sequence\nof batch sizes, the strategy suggested by [13] for classic SG methods. However, a tedious calculation\nshows that (3) has an in\ufb02ection point at s = log(S2/\u03b3n)/2 log(1/\u02dc\u03c1), corresponding to |Bs| =\n2 . This was previously observed empirically [14, Figure 3], and occurs because we are sampling\nn\nwithout replacement. This transition means we don\u2019t need to increase the batch size exponentially.\n\n4 Mixed SG and SVRG Method\n\nAn approximate \u00b5s can drastically reduce the computational cost of the SVRG algorithm, but does\nnot affect the 2 in the 2m+n gradients required for m SVRG iterations. This factor of 2 is signi\ufb01cant\nin the early iterations, since this is when stochastic methods make the most progress and when we\ntypically see the largest reduction in the test error.\nTo reduce this factor, we can consider a mixed strategy: if it is in the batch Bs then perform an\nSVRG iteration, but if it is not in the current batch then use a classic SG iteration. We illustrate this\nmodi\ufb01cation in Algorithm 2. This modi\ufb01cation allows the algorithm to take advantage of the rapid\ninitial progress of SG, since it predominantly uses SG iterations when far from the solution. Below\nwe give a convergence rate for this mixed strategy.\nProposition 2. Let \u00b5s = f(cid:48)(xs)+es and we set \u03b7 and m so that 0 < \u03c1(L, \u03b1L) < 1 with \u03b1 = |Bs|/n.\nIf we assume E(cid:107)f(cid:48)\ni (x)(cid:107)2 \u2264 \u03c32 then Algorithm 2 has\n\nE[f (xs+1) \u2212 f (x\u2217)] \u2264 \u03c1(L, \u03b1L)E[f (xs) \u2212 f (x\u2217)] +\n\nZE(cid:107)es(cid:107) + \u03b7E(cid:107)es(cid:107)2 + \u03b7\u03c32\n\n2 (1 \u2212 \u03b1)\n\n1 \u2212 2\u03b7L\n\nWe give the proof in Appendix B. The extra term depending on the variance \u03c32 is typically the\nbottleneck for SG methods. Classic SG methods require the step-size \u03b7 to converge to zero because\nof this term. However, the mixed SG/SVRG method can keep the fast progress from using a constant\n\u03b7 since the term depending on \u03c32 converges to zero as \u03b1 converges to one. Since \u03b1 < 1 implies that\n\u03c1(L, \u03b1L) < \u03c1, this result implies that when [f (xs) \u2212 f (x\u2217)] is large compared to es and \u03c32 that the\nmixed SG/SVRG method actually converges faster.\nSharing a single step size \u03b7 between the SG and SVRG iterations in Proposition 2 is sub-optimal.\nFor example, if x is close to x\u2217 and |Bs| \u2248 n, then the SG iteration might actually take us far\naway from the minimizer. Thus, we may want to use a decreasing sequence of step sizes for the SG\n\niterations. In Appendix B, we show that using \u03b7 = O\u2217((cid:112)(n \u2212 |B|)/n|B|) for the SG iterations can\n\nimprove the dependence on the error es and variance \u03c32.\n\n5 Using Support Vectors\nUsing a batch Bs decreases the number of gradient evaluations required when SVRG is far from\nthe solution, but its bene\ufb01t diminishes over time. However, for certain objectives we can further\n\n4\n\n\fAlgorithm 3 Heuristic for skipping evaluations of fi at x\n\nif ski = 0 then\n\ncompute f(cid:48)\nif f(cid:48)\n\ni (x).\n\ni (x) = 0 then\npsi = psi + 1.\nski = 2max{0,psi\u22122}.\n\nelse\n\npsi = 0.\n\nend if\nreturn f(cid:48)\ni (x).\nski = ski \u2212 1.\nreturn 0.\n\nelse\n\nend if\n\n{Update the number of consecutive times f(cid:48)\n\ni (x) was zero.}\n{Skip exponential number of future evaluations if it remains zero.}\n{This could be a support vector, do not skip it next time.}\n\n{In this case, we skip the evaluation.}\n\nreduce the number of gradient evaluations by identifying support vectors. For example, consider\nminimizing the Huberized hinge loss (HSVM) with threshold \u0001 [15],\n\nn(cid:88)\n\ni=1\n\nmin\nx\u2208Rd\n\n1\nn\n\nf (biaT\n\ni x),\n\nf (\u03c4 ) =\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f30\n\n1 \u2212 \u03c4\n(1+\u0001\u2212\u03c4 )2\n\n4\u0001\n\nif \u03c4 > 1 + \u0001,\nif \u03c4 < 1 \u2212 \u0001,\nif |1 \u2212 \u03c4| \u2264 \u0001,\n\nIn terms of (1), we have fi(x) = f (biaT\ni x). The performance of this loss function is similar to\nlogistic regression and the hinge loss, but it has the appealing properties of both: it is differentiable\nlike logistic regression meaning we can apply methods like SVRG, but it has support vectors like the\nhinge loss meaning that many examples will have fi(x\u2217) = 0 and f(cid:48)\ni (x\u2217) = 0. We can also construct\nHuberized variants of many non-smooth losses for regression and multi-class classi\ufb01cation.\nIf we knew the support vectors where fi(x\u2217) > 0, we could solve the problem faster by ignoring\nthe non-support vectors. For example, if there are 100000 training examples but only 100 support\nvectors in the optimal solution, we could solve the problem 1000 times faster. While we typically\ndon\u2019t know the support vectors, in this section we outline a heuristic that gives large practical im-\nprovements by trying to identify them as the algorithm runs.\nOur heuristic has two components. The \ufb01rst component is maintaining the list of non-support vectors\nat xs. Speci\ufb01cally, we maintain a list of examples i where f(cid:48)\ni (xs) = 0. When SVRG picks an\nexample it that is part of this list, we know that f(cid:48)\n(xs) = 0 and thus the iteration only needs\none gradient evaluation. This modi\ufb01cation is not a heuristic, in that it still applies the exact SVRG\nalgorithm. However, at best it can only cut the runtime in half.\nThe heuristic part of our strategy is to skip f(cid:48)\ni has been zero\nmore than two consecutive times (and skipping it an exponentially larger number of times each time\nit remains zero). Speci\ufb01cally, for each example i we maintain two variables, ski (for \u2018skip\u2019) and psi\n(for \u2018pass\u2019). Whenever we need to evaluate f(cid:48)\ni for some xs or xt, we run Algorithm 3 which may\nskip the evaluation. This strategy can lead to huge computational savings in later iterations if there\nare few support vectors, since many iterations will require no gradient evaluations.\nIdentifying support vectors to speed up computation has long been an important part of SVM solvers,\nand is related to the classic shrinking heuristic [16]. While it has previously been explored in the con-\ntext of dual coordinate ascent methods [17], this is the \ufb01rst work exploring it for linearly-convergent\nstochastic gradient methods.\n\ni (xt) if our evaluation of f(cid:48)\n\ni (xs) or f(cid:48)\n\nit\n\n6 Regularized SVRG\n\nWe are often interested in the special case where problem (1) has the decomposition\n\ngi(x).\n\n(4)\n\nn(cid:88)\n\ni=1\n\nmin\nx\u2208Rd\n\nf (x) \u2261 h(x) +\n\n1\nn\n\n5\n\n\fxt+1 = xt \u2212 \u03b7(cid:0)h(cid:48)(xt) + g(cid:48)\n\n(xt) \u2212 g(cid:48)\n\nA common choice of h is a scaled 1-norm of the parameter vector, h(x) = \u03bb(cid:107)x(cid:107)1. This non-smooth\nregularizer encourages sparsity in the parameter vector, and can be addressed with the proximal-\nSVRG method of Xiao & Zhang [11]. Alternately, if we want an explicit Z we could set h to the\nindicator function for a 2-norm ball containing x\u2217. In Appendix C, we give a variant of Proposition 1\nthat allows errors in the proximal-SVRG method for non-smooth/constrained settings like this.\nAnother common choice is the (cid:96)2-regularizer, h(x) = \u03bb\nupdates can be equivalently written in the form\n\n2(cid:107)x(cid:107)2. With this regularizer, the SVRG\n\nit\n\nit\n\n(5)\nwhere \u00b5s = 1\ni=1 gi(xs). That is, they take an exact gradient step with respect to the regularizer\nand an SVRG step with respect to the gi functions. When the g(cid:48)\nn\ni are sparse, this form of the update\nallows us to implement the iteration without needing full-vector operations. A related update is used\nby Le Roux et al. to avoid full-vector operations in the SAG algorithm [1, \u00a74]. In Appendix C, we\nprove the below convergence rate for this update.\nProposition 3. Consider instances of problem (1) that can be written in the form (4) where h(cid:48) is\nLh-Lipschitz continuous and each g(cid:48)\ni is Lg-Lipschitz continuous, and assume that we set \u03b7 and m\nso that 0 < \u03c1(Lm) < 1 with Lm = max{Lg, Lh}. Then the regularized SVRG iteration (5) has\n\n(xs) + \u00b5s(cid:1) ,\n\n(cid:80)n\n\nE[f (xs+1) \u2212 f (x\u2217)] \u2264 \u03c1(Lm)E[f (xs) \u2212 f (x\u2217)],\n\nSince Lm \u2264 L, and strictly so in the case of (cid:96)2-regularization, this result shows that for (cid:96)2-\nregularized problems SVRG actually converges faster than the standard analysis would indicate (a\nsimilar result appears in Kone\u02c7cn\u00b4y et al. [18]). Further, this result gives a theoretical justi\ufb01cation for\nusing the update (5) for other h functions where it is not equivalent to the original SVRG method.\n\n7 Mini-Batching Strategies\n\nKone\u02c7cn\u00b4y et al. [18] have also recently considered using batches of data within SVRG. They consider\nusing \u2018mini-batches\u2019 in the inner iteration (the update of xt) to decrease the variance of the method,\nbut still use full passes through the data to compute \u00b5s. This prior work is thus complimentary to\nthe current work (in practice, both strategies can be used to improve performance). In Appendix D\nwe show that sampling the inner mini-batch proportional to Li achieves a convergence rate of\n\nE(cid:2)f (xs+1) \u2212 f (x\u2217)(cid:3) \u2264 \u03c1ME [f (xs) \u2212 f (x\u2217)] ,\n\nwhere M is the size of the mini-batch while\n1\n\n\u03c1M =\n\nM \u2212 2\u03b7 \u00afL\n\n(cid:18) M\n\nm\u00b5\u03b7\n\n(cid:19)\n\n+ 2 \u00afL\u03b7\n\n,\n\nand we assume 0 < \u03c1M < 1. This generalizes the standard rate of SVRG and improves on the result\nof Kone\u02c7cn\u00b4y et al. [18] in the smooth case. This rate can be faster than the rate of the standard SVRG\nmethod at the cost of a more expensive iteration, and may be clearly advantageous in settings where\nparallel computation allows us to compute several gradients simultaneously.\nThe regularized SVRG form (5) suggests an alternate mini-batch strategy for problem (1): consider\na mini-batch that contains a \u2018\ufb01xed\u2019 set Bf and a \u2018random\u2019 set Bt. Without loss of generality, assume\nthat we sort the fi based on their Li values so that L1 \u2265 L2 \u2265 \u00b7\u00b7\u00b7 \u2265 Ln. For the \ufb01xed Bf we will\nalways choose the Mf values with the largest Li, Bf = {f1, f2, . . . , fMf}. In contrast, we choose\nthe members of the random set Bt by sampling from Br = {fMf +1, . . . , fn} proportional to their\ni=Mf +1 Li. In Appendix D, we show the\nLipschitz constants, pi = Li\nfollowing convergence rate for this strategy:\n\nwith \u00afLr = (1/Mr)(cid:80)n\ni /\u2208[Bf ] fi(x) and h(x) = (1/n)(cid:80)\n\nProposition 4. Let g(x) = (1/n)(cid:80)\n\ni\u2208[Bf ] fi(x). If we replace\n\n(Mr) \u00afLr\n\n(cid:33)\ni (xs)) + g(cid:48)(xs)\n\n,\n\nh(cid:48)(xt) + (1/Mr)\n\n\u00afLr\nLi\n\ni (xt) \u2212 f(cid:48)\n(f(cid:48)\n\nE[f (xs+1) \u2212 f (x\u2217)] \u2264 \u03c1(\u03ba, \u03b6)E[F (xs) \u2212 f (x\u2217)].\n\n(cid:32)\n\nthe SVRG update with\n\nxt+1 = xt \u2212 \u03b7\n\nthen the convergence rate is\n\nwhere \u03b6 = (n\u2212Mf ) \u00afLr\n\n(M\u2212Mf )n and \u03ba = max{ L1\n\nn , \u03b6}.\n\n(cid:88)\n\ni\u2208Bt\n\n6\n\n\f\u03b1n\u2212M with \u03b1 = \u00afL\n\u00afLr\n\nIf L1 \u2264 n \u00afL/M and Mf < (\u03b1\u22121)nM\n, then we get a faster convergence rate than\nSVRG with a mini-batch of size M. The scenario where this rate is slower than the existing mini-\nbatch SVRG strategy is when L1 \u2264 n \u00afL/M. But we could relax this assumption by dividing each\nelement of the \ufb01xed set Bf into two functions: \u03b2fi and (1 \u2212 \u03b2)fi, where \u03b2 = 1/M, then replacing\neach function fi in Bf with \u03b2fi and adding (1 \u2212 \u03b2)fi to the random set Br. This result may be\nrelevant if we have access to a \ufb01eld-programmable gate array (FPGA) or graphical processing unit\n(GPU) that can compute the gradient for a \ufb01xed subset of the examples very ef\ufb01ciently. However,\nour experiments (Appendix F) indicate this strategy only gives marginal gains.\nIn Appendix F, we also consider constructing mini-batches by sampling proportional to fi(xs) or\ni (xs)(cid:107). These seemed to work as well as Lipschitz sampling on all but one of the datasets in our\n(cid:107)f(cid:48)\nexperiments, and this strategy is appealing because we have access to these values while we may\nnot know the Li values. However, these strategies diverged on one of the datasets.\n\n8 Learning ef\ufb01ciency\n\nIn this section we compare the performance of SVRG as a large-scale learning algorithm compared\nto FG and SG methods. Following Bottou & Bousquet [19], we can formulate the generalization\nerror E of a learning algorithm as the sum of three terms\n\nE = Eapp + Eest + Eopt\n\nwhere the approximation error Eapp measures the effect of using a limited class of models, the es-\ntimation error Eest measures the effect of using a \ufb01nite training set, and the optimization error Eopt\nmeasures the effect of inexactly solving problem (1). Bottou & Bousquet [19] study asymptotic\nperformance of various algorithms for a \ufb01xed approximation error and under certain conditions on\nthe distribution of the data depending on parameters \u03b1 or \u03bd. In Appendix E, we discuss how SVRG\ncan be analyzed in their framework. The table below includes SVRG among their results.\nAlgorithm Time to reach Eopt \u2264 \u0001\n\nTime to reach E = O(Eapp + \u0001)\n\n(cid:1)(cid:1)\nO(cid:0)n\u03bad log(cid:0) 1\nO(cid:16) d\u03bd\u03ba2\n(cid:17)\nO(cid:0)(n + \u03ba)d log(cid:0) 1\n\n\u0001\n\n\u0001\n\n\u0001\n\nO(cid:16) d2\u03ba\n(cid:1)(cid:17)\n\u00011/\u03b1 log2(cid:0) 1\nO(cid:16) d\u03bd\u03ba2\n(cid:17)\n(cid:1) + \u03bad log(cid:0) 1\n\u00011/\u03b1 log2(cid:0) 1\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\nPrevious with \u03ba \u223c n\n\n(cid:1)(cid:17)\nO(cid:16) d3\n\u00012/\u03b1 log3(cid:0) 1\nO(cid:16) d3\u03bd\n(cid:1)(cid:17)\nlog2(cid:0) 1\n(cid:1)(cid:17)\n(cid:1)(cid:17) O(cid:16) d2\n\u00011/\u03b1 log2(cid:0) 1\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n(cid:1)(cid:1) O(cid:16) d2\n\nFG\nSG\n\nSVRG\n\nIn this table, the condition number is \u03ba = L/\u00b5.\nIn this setting, linearly-convergent stochastic\ngradient methods can obtain better bounds for ill-conditioned problems, with a better dependence\non the dimension and without depending on the noise variance \u03bd.\n\n9 Experimental Results\n\nIn this section, we present experimental results that evaluate our proposed variations on the\nSVRG method. We focus on logistic regression classi\ufb01cation: given a set of training data\n(a1, b1) . . . (an, bn) where ai \u2208 Rd and bi \u2208 {\u22121, +1}, the goal is to \ufb01nd the x \u2208 Rd solving\n\nargmin\nx\u2208Rd\n\n(cid:107)x(cid:107)2 +\n\n\u03bb\n2\n\n1\nn\n\nlog(1 + exp(\u2212biaT\n\ni x)),\n\nn(cid:88)\n\ni=1\n\nWe consider the datasets used by [1], whose properties are listed in the supplementary material. As\nin their work we add a bias variable, normalize dense features, and set the regularization parameter \u03bb\nto 1/n. We used a step-size of \u03b1 = 1/L and we used m = |Bs| which gave good performance across\nmethods and datasets. In our \ufb01rst experiment, we compared three variants of SVRG: the original\nstrategy that uses all n examples to form \u00b5s (Full), a growing batch strategy that sets |Bs| = 2s\n(Grow), and the mixed SG/SVRG described by Algorithm 2 under this same choice (Mixed). While\na variety of practical batching methods have been proposed in the literature [13, 20, 21], we did not\n\ufb01nd that any of these strategies consistently outperformed the doubling used by the simple Grow\n\n7\n\n\fFigure 1: Comparison of training objective (left) and test error (right) on the spam dataset for the\nlogistic regression (top) and the HSVM (bottom) losses under different batch strategies for choosing\n\u00b5s (Full, Grow, and Mixed) and whether we attempt to identify support vectors (SV).\n\nstrategy. Our second experiment focused on the (cid:96)2-regularized HSVM on the same datasets, and we\ncompared the original SVRG algorithm with variants that try to identify the support vectors (SV).\nWe plot the experimental results for one run of the algorithms on one dataset in Figure 1, while\nAppendix F reports results on the other 8 datasets over 10 different runs. In our results, the growing\nbatch strategy (Grow) always had better test error performance than using the full batch, while for\nlarge datasets it also performed substantially better in terms of the training objective. In contrast,\nthe Mixed strategy sometimes helped performance and sometimes hurt performance. Utilizing sup-\nport vectors often improved the training objective, often by large margins, but its effect on the test\nobjective was smaller.\n\n10 Discussion\n\nAs SVRG is the only memory-free method among the new stochastic linearly-convergent methods,\nit represents the natural method to use for a huge variety of machine learning problems. In this\nwork we show that the convergence rate of the SVRG algorithm can be preserved even under an\ninexact approximation to the full gradient. We also showed that using mini-batches to approximate\n\u00b5s gives a natural way to do this, explored the use of support vectors to further reduce the number of\ngradient evaluations, gave an analysis of the regularized SVRG update, and considered several new\nmini-batch strategies. Our theoretical and experimental results indicate that many of these simple\nmodi\ufb01cations should be considered in any practical implementation of SVRG.\n\nAcknowledgements\n\nWe would like to thank the reviewers for their helpful comments. This research was supported by\nthe Natural Sciences and Engineering Research Council of Canada (RGPIN 312176-2010, RGPIN\n311661-08, RGPIN-06068-2015). Jakub Kone\u02c7cn\u00b4y is supported by a Google European Doctoral\nFellowship.\n\n8\n\nEffective Passes051015Objective minus Optimum10-810-610-410-2100FullGrowMixedEffective Passes051015Test Error00.010.020.030.040.05FullGrowMixedEffective Passes051015Objective minus Optimum10-1010-5100FullGrowSV(Full)SV(Grow)Effective Passes051015Test Error00.010.020.030.040.05FullGrowSV(Full)SV(Grow)\fReferences\n[1] N. Le Roux, M. Schmidt, and F. Bach, \u201cA stochastic gradient method with an exponential\nconvergence rate for strongly-convex optimization with \ufb01nite training sets,\u201d Advances in neural\ninformation processing systems (NIPS), 2012.\n\n[2] S. Shalev-Schwartz and T. Zhang, \u201cStochastic dual coordinate ascent methods for regularized\n\nloss minimization,\u201d Journal of Machine Learning Research, vol. 14, pp. 567\u2013599, 2013.\n\n[3] J. Mairal, \u201cOptimization with \ufb01rst-order surrogate functions,\u201d International Conference on\n\nMachine Learning (ICML), 2013.\n\n[4] A. Defazio, F. Bach, and S. Lacoste-Julien, \u201cSaga: A fast incremental gradient method with\nsupport for non-strongly convex composite objectives,\u201d Advances in neural information pro-\ncessing systems (NIPS), 2014.\n\n[5] M. Mahdavi, L. Zhang, and R. Jin, \u201cMixed optimization for smooth functions,\u201d Advances in\n\nneural information processing systems (NIPS), 2013.\n\n[6] R. Johnson and T. Zhang, \u201cAccelerating stochastic gradient descent using predictive variance\n\nreduction,\u201d Advances in neural information processing systems (NIPS), 2013.\n\n[7] L. Zhang, M. Mahdavi, and R. Jin, \u201cLinear convergence with condition number independent\n\naccess of full gradients,\u201d Advances in neural information processing systems (NIPS), 2013.\n\n[8] J. Kone\u02c7cn\u00b4y and P. Richt\u00b4arik, \u201cSemi-stochastic gradient descent methods,\u201d arXiv preprint, 2013.\n[9] M. Schmidt, N. Le Roux, and F. Bach, \u201cConvergence rates of inexact proximal-gradient meth-\nods for convex optimization,\u201d Advances in neural information processing systems (NIPS),\n2011.\n\n[10] C. Hu, J. Kwok, and W. Pan, \u201cAccelerated gradient methods for stochastic optimization and\n\nonline learning,\u201d Advances in neural information processing systems (NIPS), 2009.\n\n[11] L. Xiao and T. Zhang, \u201cA proximal stochastic gradient method with progressive variance re-\n\nduction,\u201d SIAM Journal on Optimization, vol. 24, no. 2, pp. 2057\u20132075, 2014.\n\n[12] S. Lohr, Sampling: design and analysis. Cengage Learning, 2009.\n[13] M. P. Friedlander and M. Schmidt, \u201cHybrid deterministic-stochastic methods for data \ufb01tting,\u201d\n\nSIAM Journal of Scienti\ufb01c Computing, vol. 34, no. 3, pp. A1351\u2013A1379, 2012.\n\n[14] A. Aravkin, M. P. Friedlander, F. J. Herrmann, and T. Van Leeuwen, \u201cRobust inversion, dimen-\nsionality reduction, and randomized sampling,\u201d Mathematical Programming, vol. 134, no. 1,\npp. 101\u2013125, 2012.\n\n[15] S. Rosset and J. Zhu, \u201cPiecewise linear regularized solution paths,\u201d The Annals of Statistics,\n\nvol. 35, no. 3, pp. 1012\u20131030, 2007.\n\n[16] T. Joachims, \u201cMaking large-scale SVM learning practical,\u201d in Advances in Kernel Methods -\nSupport Vector Learning (B. Sch\u00a8olkopf, C. Burges, and A. Smola, eds.), ch. 11, pp. 169\u2013184,\nCambridge, MA: MIT Press, 1999.\n\n[17] N. Usunier, A. Bordes, and L. Bottou, \u201cGuarantees for approximate incremental svms,\u201d Inter-\n\nnational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2010.\n\n[18] J. Kone\u02c7cn\u00b4y, J. Liu, P. Richt\u00b4arik, and M. Tak\u00b4a\u02c7c, \u201cms2gd: Mini-batch semi-stochastic gradient\n\ndescent in the proximal setting,\u201d arXiv preprint, 2014.\n\n[19] L. Bottou and O. Bousquet, \u201cThe tradeoffs of large scale learning,\u201d Advances in neural infor-\n\nmation processing systems (NIPS), 2007.\n\n[20] R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu, \u201cSample size selection in optimization methods\n\nfor machine learning,\u201d Mathematical programming, vol. 134, no. 1, pp. 127\u2013155, 2012.\n\n[21] K. van den Doel and U. Ascher, \u201cAdaptive and stochastic algorithms for EIT and DC resistivity\nproblems with piecewise constant solutions and many measurements,\u201d SIAM J. Scient. Comput,\nvol. 34, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1344, "authors": [{"given_name": "Reza", "family_name": "Babanezhad Harikandeh", "institution": "UBC"}, {"given_name": "Mohamed Osama", "family_name": "Ahmed", "institution": null}, {"given_name": "Alim", "family_name": "Virani", "institution": null}, {"given_name": "Mark", "family_name": "Schmidt", "institution": "University of British Columbia"}, {"given_name": "Jakub", "family_name": "Kone\u010dn\u00fd", "institution": null}, {"given_name": "Scott", "family_name": "Sallinen", "institution": "UBC"}]}