{"title": "Variance Reduced Stochastic Gradient Descent with Neighbors", "book": "Advances in Neural Information Processing Systems", "page_first": 2305, "page_last": 2313, "abstract": "Stochastic Gradient Descent (SGD) is a workhorse in machine learning, yet it is also known to be slow relative to steepest descent. Recently, variance reduction techniques such as SVRG and SAGA have been proposed to overcome this weakness. With asymptotically vanishing variance, a constant step size can be maintained, resulting in geometric convergence rates. However, these methods are either based on occasional computations of full gradients at pivot points (SVRG), or on keeping per data point corrections in memory (SAGA). This has the disadvantage that one cannot employ these methods in a streaming setting and that speed-ups relative to SGD may need a certain number of epochs in order to materialize. This paper investigates a new class of algorithms that can exploit neighborhood structure in the training data to share and re-use information about past stochastic gradients across data points. While not meant to be offering advantages in an asymptotic setting, there are significant benefits in the transient optimization phase, in particular in a streaming or single-epoch setting. We investigate this family of algorithms in a thorough analysis and show supporting experimental results. As a side-product we provide a simple and unified proof technique for a broad class of variance reduction algorithms.", "full_text": "Variance Reduced Stochastic Gradient Descent\n\nwith Neighbors\n\nThomas Hofmann\n\nDepartment of Computer Science\n\nETH Zurich, Switzerland\n\nAurelien Lucchi\n\nDepartment of Computer Science\n\nETH Zurich, Switzerland\n\nSimon Lacoste-Julien\n\nINRIA - Sierra Project-Team\n\n\u00b4Ecole Normale Sup\u00b4erieure, Paris, France\n\nBrian McWilliams\n\nDepartment of Computer Science\n\nETH Zurich, Switzerland\n\nAbstract\n\nStochastic Gradient Descent (SGD) is a workhorse in machine learning, yet its\nslow convergence can be a computational bottleneck. Variance reduction tech-\nniques such as SAG, SVRG and SAGA have been proposed to overcome this\nweakness, achieving linear convergence. However, these methods are either based\non computations of full gradients at pivot points, or on keeping per data point cor-\nrections in memory. Therefore speed-ups relative to SGD may need a minimal\nnumber of epochs in order to materialize. This paper investigates algorithms that\ncan exploit neighborhood structure in the training data to share and re-use infor-\nmation about past stochastic gradients across data points, which offers advantages\nin the transient optimization phase. As a side-product we provide a uni\ufb01ed con-\nvergence analysis for a family of variance reduction algorithms, which we call\nmemorization algorithms. We provide experimental results supporting our theory.\n\n1\n\nIntroduction\n\nWe consider a general problem that is pervasive in machine learning, namely optimization of an em-\npirical or regularized convex risk function. Given a convex loss l and a \u00b5-strongly convex regularizer\n\u2126, one aims at \ufb01nding a parameter vector w which minimizes the (empirical) expectation:\n\nn(cid:88)\n\ni=1\n\nw\u2217 = argmin\n\nw\n\nf (w),\n\nf (w) =\n\n1\nn\n\nfi(w),\n\nfi(w) := l(w, (xi, yi)) + \u2126(w) .\n\n(1)\n\nWe assume throughout that each fi has L-Lipschitz-continuous gradients. Steepest descent can\n\ufb01nd the minimizer w\u2217, but requires repeated computations of full gradients f(cid:48)(w), which becomes\nprohibitive for massive data sets. Stochastic gradient descent (SGD) is a popular alternative, in\nparticular in the context of large-scale learning [2, 10]. SGD updates only involve f(cid:48)\ni (w) for an index\ni chosen uniformly at random, providing an unbiased gradient estimate, since Ef(cid:48)\ni (w) = f(cid:48)(w).\nIt is a surprising recent \ufb01nding [11, 5, 9, 6] that the \ufb01nite sum structure of f allows for signi\ufb01cantly\nfaster convergence in expectation. Instead of the standard O(1/t) rate of SGD for strongly-convex\nfunctions, it is possible to obtain linear convergence with geometric rates. While SGD requires\nasymptotically vanishing learning rates, often chosen to be O(1/t) [7], these more recent methods\nintroduce corrections that ensure convergence for constant learning rates.\nBased on the work mentioned above, the contributions of our paper are as follows: First, we de-\n\ufb01ne a family of variance reducing SGD algorithms, called memorization algorithms, which includes\nSAGA and SVRG as special cases, and develop a unifying analysis technique for it. Second, we\n\n1\n\n\fshow geometric rates for all step sizes \u03b3 < 1\n4L, including a universal (\u00b5-independent) step size\nchoice, providing the \ufb01rst \u00b5-adaptive convergence proof for SVRG. Third, based on the above anal-\nysis, we present new insights into the trade-offs between freshness and biasedness of the corrections\ncomputed from previous stochastic gradients. Fourth, we propose a new class of algorithms that\nresolves this trade-off by computing corrections based on stochastic gradients at neighboring points.\nWe experimentally show its bene\ufb01ts in the regime of learning with a small number of epochs.\n\n2 Memorization Algorithms\n\n2.1 Algorithms\n\n(cid:80)n\n\nVariance Reduced SGD Given an optimization problem as in (1), we investigate a class of\nstochastic gradient descent algorithms that generates an iterate sequence wt (t \u2265 0) with updates\ntaking the form:\n\ngi(w) = f(cid:48)\n\ni (w) \u2212 \u00af\u03b1i with\n\nw+ = w \u2212 \u03b3gi(w),\nwhere \u00af\u03b1 := 1\nj=1 \u03b1j. Here w is the current and w+ the new parameter vector, \u03b3 is the step size,\nn\nand i is an index selected uniformly at random. \u00af\u03b1i are variance correction terms such that E\u00af\u03b1i = 0,\nwhich guarantees unbiasedness Egi(w) = f(cid:48)(w). The aim is to de\ufb01ne updates of asymptotically\nvanishing variance, i.e. gi(w) \u2192 0 as w \u2192 w\u2217, which requires \u00af\u03b1i \u2192 f(cid:48)\ni (w\u2217). This implies that\ncorrections need to be designed in a way to exactly cancel out the stochasticity of f(cid:48)\ni (w\u2217) at the\noptimum. How the memory \u03b1j is updated distinguishes the different algorithms that we consider.\n\n\u00af\u03b1i := \u03b1i \u2212 \u00af\u03b1,\n\n(2)\n\ni = f(cid:48)\n\ni (w) for the selected i, and \u03b1+\n\nSAGA The SAGA algorithm [4] maintains variance corrections \u03b1i by memorizing stochastic gra-\nj = \u03b1j, for j (cid:54)= i. Note that\ndients. The update rule is \u03b1+\nthese corrections will be used the next time the same index i gets sampled. Setting \u00af\u03b1i := \u03b1i \u2212 \u00af\u03b1\nguarantees unbiasedness. Obviously, \u00af\u03b1 can be updated incrementally. SAGA reuses the stochastic\ngradient f(cid:48)\nq-SAGA We also consider q-SAGA, a method that updates q \u2265 1 randomly chosen \u03b1j variables\nat each iteration. This is a convenient reference point to investigate the advantages of \u201cfresher\u201d\ncorrections. Note that in SAGA the corrections will be on average n iterations \u201cold\u201d. In q-SAGA\nthis can be controlled to be n/q at the expense of additional gradient computations.\n\ni (w) computed at step t to update w as well as \u00af\u03b1i.\n\nSVRG We reformulate a variant of SVRG [5] in our framework using a randomization argument\nsimilar to (but simpler than) the one suggested in [6]. Fix q > 0 and draw in each iteration r \u223c\nj(w) (\u2200j) is performed, otherwise they are\nUniform[0; 1). If r < q/n, a complete update, \u03b1+\nleft unchanged. While q-SAGA updates exactly q variables in each iteration, SVRG occasionally\nupdates all \u03b1 variables by triggering an additional sweep through the data. There is an option to not\nmaintain \u03b1 variables explicitly and to save on space by storing only \u00af\u03b1 = f(cid:48)(w) and w.\n\nj = f(cid:48)\n\n(cid:26)f(cid:48)\n\nj(w)\n\u03b1j\n\nif j \u2208 J\notherwise,\n\nUniform Memorization Algorithms Motivated by SAGA and SVRG, we de\ufb01ne a class of algo-\nrithms, which we call uniform memorization algorithms.\nDe\ufb01nition 1. A uniform q-memorization algorithm evolves iterates w according to Eq. (2) and\nselects in each iteration a random index set J of memory locations to update according to\n\nsuch that any j has the same probability of q/n of being updated, i.e. \u2200j,(cid:80)\nNote that q-SAGA and the above SVRG are special cases. For q-SAGA: P{J} = 1/(cid:0)n\n\nj :=\n\n\u03b1+\n\nJ(cid:51)j P{J} = q\nn .\n\n(cid:1) if |J| = q\n\n(3)\n\nP{J} = 0 otherwise. For SVRG: P{\u2205} = 1 \u2212 q/n, P{[1 : n]} = q/n, P{J} = 0, otherwise.\nN -SAGA Because we need it in Section 3, we will also de\ufb01ne an algorithm, which we call N -\nSAGA, which makes use of a neighborhood system Ni \u2286 {1, . . . , n} and which selects neighbor-\nhoods uniformly, i.e. P{Ni} = 1\n\nn. Note that De\ufb01nition 1 requires |{i : j \u2208 Ni}| = q (\u2200j).\n\nq\n\n2\n\n\fi (w) = \u03be(cid:48)\n\nFinally, note that for generalized linear models where fi depends on xi only through (cid:104)w, xi(cid:105), we\nget f(cid:48)\ni(w)xi, i.e. the update direction is determined by xi, whereas the effective step length\ndepends on the derivative of a scalar function \u03bei(w). As used in [9], this leads to signi\ufb01cant memory\nsavings as one only needs to store the scalars \u03be(cid:48)\ni(w) as xi is always given when performing an update.\n\n2.2 Analysis\n\n2(cid:107)w \u2212 w\u2217(cid:107)2 ,\n\nRecurrence of Iterates The evolution equation (2) in expectation implies the recurrence (by cru-\ncially using the unbiasedness condition Egi(w) = f(cid:48)(w)):\n\nE(cid:107)w+\u2212w\u2217(cid:107)2 = (cid:107)w \u2212 w\u2217(cid:107)2 \u2212 2\u03b3(cid:104)f(cid:48)(w), w \u2212 w\u2217(cid:105) + \u03b32E(cid:107)gi(w)(cid:107)2 .\n\n(4)\nHere and in the rest of this paper, expectations are always taken only with respect to i (conditioned\non the past). We utilize a number of bounds (see [4]), which exploit strong convexity of f (wherever\n\u00b5 appears) as well as Lipschitz continuity of the fi-gradients (wherever L appears):\n\n(cid:104)f(cid:48)(w), w \u2212 w\u2217(cid:105) \u2265 f (w) \u2212 f (w\u2217) + \u00b5\n\ni (w\u2217)(cid:105) ,\n\ni (w\u2217)(cid:107)2.\n\n(cid:107)f(cid:48)\nE(cid:107)f(cid:48)\n\ni (w\u2217)(cid:107)2 ,\n\ni (w) \u2212 f(cid:48)\n\nE(cid:107)gi(w)(cid:107)2 \u2264 2E(cid:107)f(cid:48)\n\ni (w) \u2212 f(cid:48)\ni (w)\u2212f(cid:48)\nE(cid:107)\u00af\u03b1i \u2212 f(cid:48)\n\ni (w\u2217)(cid:107)2 \u2264 2Lhi(w),\ni (w\u2217)(cid:107)2 \u2264 2Lf \u03b4(w),\ni (w\u2217)(cid:107)2 = E(cid:107)\u03b1i \u2212 f(cid:48)\n\ni (w\u2217)(cid:107)2 + 2E(cid:107)\u00af\u03b1i \u2212 f(cid:48)\nhi(w) := fi(w) \u2212 fi(w\u2217) \u2212 (cid:104)w \u2212 w\u2217, f(cid:48)\nf \u03b4(w) := f (w) \u2212 f (w\u2217) ,\ni (w\u2217)(cid:107)2 \u2212 (cid:107)\u00af\u03b1(cid:107)2 \u2264 E(cid:107)\u03b1i \u2212 f(cid:48)\n\n(5)\n(6)\n(7)\n(8)\n(9)\nEq. (6) can be generalized [4] using (cid:107)x\u00b1y(cid:107)2 \u2264 (1+\u03b2)(cid:107)x(cid:107)2 +(1+\u03b2\u22121)(cid:107)y(cid:107)2 with \u03b2 > 0. However\nfor the sake of simplicity, we sacri\ufb01ce tightness and choose \u03b2 = 1. Applying all of the above yields:\nLemma 1. For the iterate sequence of any algorithm that evolves solutions according to Eq. (2), the\nfollowing holds for a single update step, in expectation over the choice of i:\n(cid:107)w \u2212 w\u2217(cid:107)2 \u2212 E(cid:107)w+ \u2212 w\u2217(cid:107)2 \u2265 \u03b3\u00b5(cid:107)w \u2212 w\u2217(cid:107)2 \u2212 2\u03b32E(cid:107)\u03b1i \u2212 f(cid:48)\nAll proofs are deferred to the Appendix.\nIdeal and Approximate Variance Correction Note that in the ideal case of \u03b1i = f(cid:48)\nwould immediately get a condition for a contraction by choosing \u03b3 = 1\nwith \u03c1 = \u03b3\u00b5 = \u00b5\nHow can we further bound E(cid:107)\u03b1i \u2212 f(cid:48)\nkey insight is that for memorization algorithms, we can apply the smoothness bound in Eq. (7)\n\ni (w\u2217), we\n2L, yielding a rate of 1 \u2212 \u03c1\ni (w\u2217)(cid:107)2 in the case of \u201cnon-ideal\u201d variance-reducing SGD? A\n\ni (w\u2217)(cid:107)2 +(cid:0)2\u03b3 \u2212 4\u03b32L(cid:1) f \u03b4(w) .\n\n2L, which is half the inverse of the condition number \u03ba := L/\u00b5.\n\ni (w\u2217)(cid:107)2 = (cid:107)f(cid:48)\n\ni (w\u03c4i) \u2212 f(cid:48)\n\ni (w\u2217)(cid:107)2 \u2264 2Lhi(w\u03c4i),\n\n(10)\nNote that if we only had approximations \u03b2i in the sense that (cid:107)\u03b2i \u2212 \u03b1i(cid:107)2 \u2264 \u0001i (see Section 3), then\nwe can use (cid:107)x \u2212 y(cid:107) \u2264 2(cid:107)x(cid:107) + 2(cid:107)y(cid:107) to get the somewhat worse bound:\n\n(where w\u03c4i is old w) .\n\n(cid:107)\u03b1i \u2212 f(cid:48)\n\n(cid:107)\u03b2i \u2212 f(cid:48)\n\ni (w\u2217)(cid:107)2 \u2264 2(cid:107)\u03b1i \u2212 f(cid:48)\n\ni (w\u2217)(cid:107)2 + 2(cid:107)\u03b2i \u2212 \u03b1i(cid:107)2 \u2264 4Lhi(w\u03c4i) + 2\u0001i.\n\n(11)\n\nLyapunov Function Ideally, we would like to show that for a suitable choice of \u03b3, each iteration\nresults in a contraction E(cid:107)w+ \u2212 w\u2217(cid:107)2 \u2264 (1 \u2212 \u03c1)(cid:107)w \u2212 w\u2217(cid:107)2, where 0 < \u03c1 \u2264 1. However, the main\nchallenge arises from the fact that the quantities \u03b1i represent stochastic gradients from previous iter-\nations. This requires a somewhat more complex proof technique. Adapting the Lyapunov function\ni (w\u2217)(cid:107)2 such that Hi \u2192 0 as w \u2192 w\u2217. We\nmethod from [4], we de\ufb01ne upper bounds Hi \u2265 (cid:107)\u03b1i \u2212 f(cid:48)\ni = 0 and (conceptually) initialize Hi = (cid:107)f(cid:48)\ni (w\u2217)(cid:107)2, and then update Hi in sync with \u03b1i,\nstart with \u03b10\nif \u03b1i is updated\n(12)\notherwise\n(cid:80)n\ni (w\u2217)(cid:107)2 \u2264 \u00afH with\ni=1 Hi. The Hi are quantities showing up in the analysis, but need not be computed. We\n\nso that we always maintain valid bounds (cid:107)\u03b1i \u2212 f(cid:48)\n\u00afH := 1\nn\nnow de\ufb01ne a \u03c3-parameterized family of Lyapunov functions1\nL\u03c3(w, H) := (cid:107)w \u2212 w\u2217(cid:107)2 + S\u03c3 \u00afH, with S :=\n\ni (w\u2217)(cid:107)2 \u2264 Hi and E(cid:107)\u03b1i \u2212 f(cid:48)\n\nand 0 \u2264 \u03c3 \u2264 1 .\n\n(13)\n\n(cid:26)2L hi(w)\n\n(cid:18) \u03b3n\n\n(cid:19)\n\nH +\ni\n\nHi\n\n:=\n\nLq\n\n1This is a simpli\ufb01ed version of the one appearing in [4], as we assume f(cid:48)(w\u2217) = 0 (unconstrained regime).\n\n3\n\n\fIn expectation under a random update, the Lyapunov function L\u03c3 changes as EL\u03c3(w+, H +) =\nE(cid:107)w+ \u2212 w\u2217(cid:107)2 + S\u03c3 E \u00afH +. We can readily apply Lemma 1 to bound the \ufb01rst part. The second part\nis due to (12), which mirrors the update of the \u03b1 variables. By crucially using the property that any\n\u03b1j has the same probability of being updated in (3), we get the following result:\nLemma 2. For a uniform q-memorization algorithm, it holds that\n\nE \u00afH + =\n\n\u00afH +\n\n2Lq\nn\n\nf \u03b4(w).\n\n(14)\n\n(cid:18) n \u2212 q\n\n(cid:19)\n\nn\n\nNote that in expectation the shrinkage does not depend on the location of previous iterates w\u03c4 and\nthe new increment is proportional to the sub-optimality of the current iterate w. Technically, this is\nhow the possibly complicated dependency on previous iterates is dealt with in an effective manner.\n\nConvergence Analysis We \ufb01rst state our main Lemma about Lyapunov function contractions:\nLemma 3. Fix c \u2208 (0; 1] and \u03c3 \u2208 [0; 1] arbitrarily. For any uniform q-memorization algorithm with\nsuf\ufb01ciently small step size \u03b3 such that\n\n(cid:26) K\u03c3\n\n(cid:27)\n\nwe have that\n\nmin\n\n\u03b3 \u2264 1\n2L\n\n4qL\nn\u00b5\nEL\u03c3(w+, H +) \u2264 (1 \u2212 \u03c1)L\u03c3(w, H), with \u03c1 := c\u00b5\u03b3.\n\nand K :=\n\n, 1 \u2212 \u03c3\n\nK + 2c\u03c3\n\n,\n\n,\n\nNote that \u03b3 < 1\n\n2L max\u03c3\u2208[0,1] min{\u03c3, 1 \u2212 \u03c3} = 1\n\n4L (in the c \u2192 0 limit).\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\n\u03c1(\u03b3) =\n\nq\nn\n\nwhere\n\n\u00b7 1 \u2212 a\n1 \u2212 a/2\na\u2217(K)\n4L\n\nBy maximizing the bounds in Lemma 3 over the choices of c and \u03c3, we obtain our main result that\nprovides guaranteed geometric rates for all step sizes up to 1\n4L.\nTheorem 1. Consider a uniform q-memorization algorithm. For any step size \u03b3 = a\nthe algorithm converges at a geometric rate of at least (1 \u2212 \u03c1(\u03b3)) with\n\n4L with a < 1,\n\n=\n\n\u00b5\n4L\n\n\u00b7 K(1 \u2212 a)\n1 \u2212 a/2\n\nif \u03b3 \u2265 \u03b3\u2217(K), otherwise \u03c1(\u03b3) = \u00b5\u03b3\n\n,\n\n(cid:21)\n\n\u03b3\u2217(K) :=\n\na\u2217(K) :=\n\n,\n\n\u221a\n2K\n\n, K :=\n\n4qL\nn\u00b5\n\n=\n\n\u03ba .\n\n4q\nn\n\n(cid:20)\n\na\u2217(K)\n\n1 + K +\nWe would like to provide more insights into this result.\nCorollary 1. In Theorem 1, \u03c1 is maximized for \u03b3 = \u03b3\u2217(K). We can write \u03c1\u2217(K) = \u03c1(\u03b3\u2217) as\n\n1 + K 2\n\n\u221a\n2\n\n\u00b5\n\n=\n\nK\n\nq\nn\n\nq\nn\n\n\u221a\n\n1 + K 2\n\n1 + K +\n\n\u03c1\u2217(K) =\n\n2 K\u22121 + O(K\u22123)).\n\na\u2217(K) =\nn (1 \u2212 1\n\n2); 1] \u2248 [0.585; 1]. So for q \u2264 n \u00b5\n\n(19)\n2 K + O(K 3)), whereas in the ill-conditioned case \u03c1\u2217 =\n\n4L in the regime where the condition number dominates n (large\nn in the opposite regime of large data (small K). Note that if K \u2264 1, we have \u03c1\u2217 = \u03b6 q\n4L, it pays off to increase freshness as it affects\n\n\u00b5\n4L\nIn the big data regime \u03c1\u2217 = q\n4L (1 \u2212 1\nThe guaranteed rate is bounded by \u00b5\nK) and by q\nwith \u03b6 \u2208 [2/(2 +\nthe rate proportionally. In the ill-conditioned regime (\u03ba > n), the in\ufb02uence of q vanishes.\nNote that for \u03b3 \u2265 \u03b3\u2217(K), \u03b3 \u2192 1\n4L the rate decreases monotonically, yet the decrease is only minor.\nWith the exception of a small neighborhood around 1\n4L ) results in\nvery similar rates. Underestimating \u03b3\u2217 however leads to a (signi\ufb01cant) slow-down by a factor \u03b3/\u03b3\u2217.\nAs the optimal choice of \u03b3 depends on K, i.e. \u00b5, we would prefer step sizes that are \u00b5-independent,\nthus giving rates that adapt to the local curvature (see [9]). It turns out that by choosing a step size\nthat maximizes minK \u03c1(\u03b3)/\u03c1\u2217(K), we obtain a K-agnostic step size with rate off by at most 1/2:\nCorollary 2. Choosing \u03b3 = 2\u2212\u221a\nTo gain more insights into the trade-offs for these \ufb01xed large universal step sizes, the following\ncorollary details the range of rates obtained:\nL}. In particular, we have\nCorollary 3. Choosing \u03b3 = a\nL} (roughly matching the rate given in [4] for q = 1).\nfor the choice \u03b3 = 1\n\n4L with a < 1 yields \u03c1 = min{ 1\u2212a\n1\u2212 1\n2 a\n\n4L , leads to \u03c1(\u03b3) \u2265 (2 \u2212 \u221a\n\n4L, the entire range of \u03b3 \u2208 [\u03b3\u2217; 1\n\n5L that \u03c1 = min{ 1\n\n2 \u03c1\u2217(K) for all K.\n\n2)\u03c1\u2217(K) > 1\n\nq\n\nn , 1\n\n5\n\n\u00b5\n\nq\n\nn , a\n\n4\n\nn\n\n\u00b5\n\n3\n\n2\n\n4\n\n\f3 Sharing Gradient Memory\n\n3.1\n\n\u0001-Approximation Analysis\n\nstochastic gradients in the current update and by assuring(cid:80)\n\nAs we have seen, fresher gradient memory, i.e. a larger choice for q, affects the guaranteed conver-\ngence rate as \u03c1 \u223c q/n. However, as long as one step of a q-memorization algorithm is as expensive\nas q steps of a 1-memorization algorithm, this insight does not lead to practical improvements per\nse. Yet, it raises the question, whether we can accelerate these methods, in particular N -SAGA,\nby approximating gradients stored in the \u03b1i variables. Note that we are always using the correct\ni \u00af\u03b1i = 0, we will not introduce any bias\nin the update direction. Rather, we lose the guarantee of asymptotically vanishing variance at w\u2217.\nHowever, as we will show, it is possible to retain geometric rates up to a \u03b4-ball around w\u2217.\nWe will focus on SAGA-style updates for concreteness and investigate an algorithm that mirrors N -\nSAGA with the only difference that it maintains approximations \u03b2i to the true \u03b1i variables. We aim\nto guarantee E(cid:107)\u03b1i \u2212 \u03b2i(cid:107)2 \u2264 \u0001 and will use Eq. (11) to modify the right-hand-side of Lemma 1. We\nsee that approximation errors \u0001i are multiplied with \u03b32, which implies that we should aim for small\nlearning rates, ideally without compromising the N -SAGA rate. From Theorem 1 and Corollary 1\nwe can see that we can choose \u03b3 (cid:46) q/\u00b5n for n suf\ufb01ciently large, which indicates that there is hope\nto dampen the effects of the approximations. We now make this argument more precise.\nTheorem 2. Consider a uniform q-memorization algorithm with \u03b1-updates that are on average \u0001-\naccurate (i.e. E(cid:107)\u03b1i \u2212 \u03b2i(cid:107)2 \u2264 \u0001). For any step size \u03b3 \u2264 \u02dc\u03b3(K), where \u02dc\u03b3 is given by Corollary 5 in\nthe appendix (note that \u02dc\u03b3(K) \u2265 2\nEL(wt, H t) \u2264 (1 \u2212 \u00b5\u03b3)tL0 +\n\n(20)\nwhere E denote the (unconditional) expectation over histories (in contrast to E which is conditional),\nand s(\u03b3) := 4\u03b3\nCorollary 4. With \u03b3 = min{\u00b5, \u02dc\u03b3(K)} we have\n\n3 \u03b3\u2217(K) and \u02dc\u03b3(K) \u2192 \u03b3\u2217(K) as K \u2192 0), we get\n\nwith L0 := (cid:107)w0 \u2212 w\u2217(cid:107)2 + s(\u03b3)E(cid:107)fi(w\u2217)(cid:107)2,\n\nK\u00b5 (1 \u2212 2L\u03b3).\n\n4\u03b3\u0001\n\u00b5\n\n,\n\n4\u03b3\u0001\n\u00b5\n\nn, we thus converge towards some\n\nwith a rate \u03c1 = min{\u00b52, \u00b5\u02dc\u03b3} .\n\n\u2264 4\u0001,\n(21)\n\u221a\nIn the relevant case of \u00b5 \u223c 1/\n\u0001-ball around w\u2217 at a similar\nrate as for the exact method. For \u00b5 \u223c n\u22121, we have to reduce the step size signi\ufb01cantly to com-\n\u0001-ball, resulting in the slower rate \u03c1 \u223c n\u22122,\npensate the extra variance and to still converge to an\ninstead of \u03c1 \u223c n\u22121.\nWe also note that the geometric convergence of SGD with a constant step size to a neighborhood\nof the solution (also proven in [8]) can arise as a special case in our analysis. By setting \u03b1i = 0 in\nLemma 1, we can take \u0001 = E(cid:107)f(cid:48)\ni (w\u2217)(cid:107)2 for SGD. An approximate q-memorization algorithm can\nthus be interpreted as making \u0001 an algorithmic parameter, rather than a \ufb01xed value as in SGD.\n\n\u221a\n\n\u221a\n\n3.2 Algorithms\n\nSharing Gradient Memory We now discuss our proposal of using neighborhoods for sharing\ngradient information between close-by data points. Thereby we avoid an increase in gradient com-\nputations relative to q- or N -SAGA at the expense of suffering an approximation bias. This leads\nto a new tradeoff between freshness and approximation quality, which can be resolved in non-trivial\nways, depending on the desired \ufb01nal optimization accuracy.\nWe distinguish two types of quantities. First, the gradient memory \u03b1i as de\ufb01ned by the reference\nalgorithm N -SAGA. Second, the shared gradient memory state \u03b2i, which is used in a modi\ufb01ed\nupdate rule in Eq. (2), i.e. w+ = w \u2212 \u03b3(f(cid:48)\ni (w) \u2212 \u03b2i + \u00af\u03b2). Assume that we select an index i for the\nweight update, then we generalize Eq. (3) as follows\nif j \u2208 Ni\notherwise ,\n\n\u00af\u03b2i := \u03b2i \u2212 \u00af\u03b2 .\n\n(cid:26)f(cid:48)\n\nn(cid:88)\n\ni (w)\n\u03b2j\n\n\u00af\u03b2 :=\n\n(22)\n\n\u03b2+\nj\n\n:=\n\n1\nn\n\n\u03b2i,\n\ni=1\n\nIn the important case of generalized linear models, where one has f(cid:48)\nthe relevant case in Eq. (22) by \u03b2+\nj\ndirection, while reducing storage requirements.\n\ni(w)xi, we can modify\ni(w)xj. This has the advantages of using the correct\n\ni (w) = \u03be(cid:48)\n\n:= \u03be(cid:48)\n\n5\n\n\fi(w)xi + \u03bbw with \u03be(cid:48)\nj (cid:107) = |\u03be(cid:48)\n(cid:107)\u03b1+\n\nApproximation Bounds For our analysis, we need to control the error (cid:107)\u03b1i \u2212 \u03b2i(cid:107)2 \u2264 \u0001i. This\nobviously requires problem-speci\ufb01c investigations.\n2(cid:107)w(cid:107)2 and thus\nLet us \ufb01rst look at the case of ridge regression. fi(w) := 1\ni(w) := (cid:104)xi, w(cid:105) \u2212 yi. Considering j \u2208 Ni being updated, we have\nf(cid:48)\ni (w) = \u03be(cid:48)\nj(w) \u2212 \u03be(cid:48)\n\n(23)\nwhere \u03b4ij := (cid:107)xi \u2212 xj(cid:107). Note that this can be pre-computed with the exception of the norm (cid:107)w(cid:107)\nthat we only know at the time of an update.\nSimilarly, for regularized logistic regression with y \u2208 {\u22121, 1}, we have \u03be(cid:48)\nWith the requirement on neighbors that yi = yj we get\n\ni(w)|(cid:107)xj(cid:107) \u2264 (\u03b4ij(cid:107)w(cid:107) + |yj \u2212 yi|)(cid:107)xj(cid:107) =: \u0001ij(w)\n\ni(w) = yi/(1 + eyi(cid:104)xi,w(cid:105)).\n\n2 ((cid:104)xi, w(cid:105) \u2212 yi)2 + \u03bb\n\nj \u2212 \u03b2+\n\n(cid:107)\u03b1+\n\nj \u2212 \u03b2+\n\nj (cid:107) \u2264 e\u03b4ij(cid:107)w(cid:107) \u2212 1\nAgain, we can pre-compute \u03b4ij and (cid:107)xj(cid:107). In addition to \u03be(cid:48)\n\u0001N -SAGA We can use these bounds in two ways. First, assuming that the iterates stay within a\nnorm-ball (e.g. L2-ball), we can derive upper bounds\n\n1 + e\u2212(cid:104)xi,w(cid:105)(cid:107)xj(cid:107) =: \u0001ij(w)\n\ni(w) we can also store (cid:104)xi, w(cid:105).\n\n(24)\n\n\u0001j(r) \u2265 max{\u0001ij(w) : j \u2208 Ni, (cid:107)w(cid:107) \u2264 r},\n\n\u0001(r) =\n\n1\nn\n\n\u0001j(r) .\n\n(25)\n\n(cid:88)\n\nj\n\nObviously, the more compact the neighborhoods are, the smaller \u0001(r). This is most useful for the\nanalysis. Second, we can specify a target accuracy \u0001 and then prune neighborhoods dynamically.\nThis approach is more practically relevant as it allows us to directly control \u0001. However, a dynam-\nically varying neighborhood violates De\ufb01nition 1. We \ufb01x this in a sound manner by modifying the\nmemory updates as follows:\n\n\uf8f1\uf8f2\uf8f3f(cid:48)\n\ni (w)\nf(cid:48)\nj(w)\n\u03b2j\n\n\u03b2+\nj\n\n:=\n\nif j \u2208 Ni and \u0001ij(w) \u2264 \u0001\nif j \u2208 Ni and \u0001ij(w) > \u0001\notherwise\n\n(26)\n\nThis allows us to interpolate between sharing more aggressively (saving computation) and perform-\ning more computations in an exact manner. In the limit of \u0001 \u2192 0, we recover N -SAGA, as \u0001 \u2192 \u0001max\nwe recover the \ufb01rst variant mentioned.\n\nComputing Neighborhoods Note that the pairwise Euclidean distances show up in the bounds in\nEq. (23) and (24). In the classi\ufb01cation case we also require yi = yj, whereas in the ridge regression\ncase, we also want |yi \u2212 yj| to be small. Thus modulo \ufb01ltering, this suggests the use of Euclidean\ndistances as the metric for de\ufb01ning neighborhoods. Standard approximation techniques for \ufb01nding\nnear(est) neighbors can be used. This comes with a computational overhead, yet the additional costs\nwill amortize over multiple runs or multiple data analysis tasks.\n\n4 Experimental Results\n\nAlgorithms We present experimental results on the performance of the different variants of mem-\norization algorithms for variance reduced SGD as discussed in this paper. SAGA has been uniformly\nsuperior to SVRG in our experiments, so we compare SAGA and \u0001N -SAGA (from Eq. (26)), along-\nside with SGD as a straw man and q-SAGA as a point of reference for speed-ups. We have chosen\nq = 20 for q-SAGA and \u0001N -SAGA. The same setting was used across all data sets and experiments.\n\nData Sets As special cases for the choice of the loss function and regularizer in Eq. (1), we con-\nsider two commonly occurring problems in machine learning, namely least-square regression and\n(cid:96)2-regularized logistic regression. We apply least-square regression on the million song year regres-\nsion from the UCI repository. This dataset contains n = 515, 345 data points, each described by\nd = 90 input features. We apply logistic regression on the cov and ijcnn1 datasets obtained from\nthe libsvm website 2. The cov dataset contains n = 581, 012 data points, each described by d = 54\ninput features. The ijcnn1 dataset contains n = 49, 990 data points, each described by d = 22 input\nfeatures. We added an (cid:96)2-regularizer \u2126(w) = \u00b5(cid:107)w(cid:107)2\n\n2 to ensure the objective is strongly convex.\n\n2http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets\n\n6\n\n\f(a) Cov\n\n(b) Ijcnn1\n\n(c) Year\n\n\u00b5 = 10\u22121, gradient evaluation\n\n\u00b5 = 10\u22123, gradient evaluation\n\n\u00b5 = 10\u22121, datapoint evaluation\n\n\u00b5 = 10\u22123, datapoint evaluation\n\nFigure 1: Comparison of \u0001N -SAGA, q-SAGA, SAGA and SGD (with decreasing and constant step\nsize) on three datasets. The top two rows show the suboptimality as a function of the number\nof gradient evaluations for two different values of \u00b5 = 10\u22121, 10\u22123. The bottom two rows show\nthe suboptimality as a function of the number of datapoint evaluations (i.e. number of stochastic\nupdates) for two different values of \u00b5 = 10\u22121, 10\u22123.\n\n7\n\nepochs24681012141618Suboptimality10-810-610-410-2100SGDcstSGDSAGAq-SAGA0N-SAGA0=10N-SAGA0=0.10N-SAGA0=0.01epochs246810Suboptimality10-1010-810-610-410-2100SGDcstSGDSAGAq-SAGA0N-SAGA0=0.10N-SAGA0=0.050N-SAGA0=0.01epochs24681012141618Suboptimality10-810-610-410-2100SGDcstSGDSAGAq-SAGA0N-SAGA0=20N-SAGA0=10N-SAGA0=0.5epochs24681012141618Suboptimality10-810-610-410-2100epochs246810Suboptimality10-810-610-410-2100epochs24681012141618Suboptimality10-810-610-410-2100epochs24681012141618Suboptimality10-1010-810-610-410-2100epochs246810Suboptimality10-1010-5100epochs24681012141618Suboptimality10-1010-5100epochs24681012141618Suboptimality10-1010-5100epochs246810Suboptimality10-1010-5100epochs24681012141618Suboptimality10-1010-810-610-410-2100\fExperimental Protocol We have run the algorithms in question in an i.i.d. sampling setting and\naveraged the results over 5 runs. Figure 1 shows the evolution of the suboptimality f \u03b4 of the ob-\njective as a function of two different metrics: (1) in terms of the number of update steps performed\n(\u201cdatapoint evaluation\u201d), and (2) in terms of the number of gradient computations (\u201cgradient evalua-\ntion\u201d). Note that SGD and SAGA compute one stochastic gradient per update step unlike q-SAGA,\nwhich is included here not as a practically relevant algorithm, but as an indication of potential im-\nprovements that could be achieved by fresher corrections. A step size \u03b3 = q\n\u00b5n was used everywhere,\nexcept for \u201cplain SGD\u201d. Note that as K (cid:28) 1 in all cases, this is close to the optimal value suggested\nby our analysis; moreover, using a step size of \u223c 1\nL for SAGA as suggested in previous work [9]\ndid not appear to give better results. For plain SGD, we used a schedule of the form \u03b3t = \u03b30/t with\nconstants optimized coarsely via cross-validation. The x-axis is expressed in units of n (suggestively\ncalled \u201depochs\u201d).\nSAGA vs. SGD cst As we can see, if we run SGD with the same constant step size as SAGA,\nit takes several epochs until SAGA really shows a signi\ufb01cant gain. The constant step-size variant\nof SGD is faster in the early stages until it converges to a neighborhood of the optimum, where\nindividual runs start showing a very noisy behavior.\nSAGA vs. q-SAGA q-SAGA outperforms plain SAGA quite consistently when counting stochas-\ntic update steps. This establishes optimistic reference curves of what we can expect to achieve with\n\u0001N -SAGA. The actual speed-up is somewhat data set dependent.\n\u0001N -SAGA vs. SAGA and q-SAGA \u0001N -SAGA with suf\ufb01ciently small \u0001 can realize much of the\npossible freshness gains of q-SAGA and performs very similar for a few (2-10) epochs, where it\ntraces nicely between the SAGA and q-SAGA curves. We see solid speed-ups on all three datasets\nfor both \u00b5 = 0.1 and \u00b5 = 0.001.\nIt should be clearly stated that running \u0001N -SAGA at a \ufb01xed \u0001 for longer will not\nAsymptotics\nresult in good asymptotics on the empirical risk. This is because, as theory predicts, \u0001N -SAGA\ncan not drive the suboptimality to zero, but rather levels-off at a point determined by \u0001.\nIn our\nexperiments, the cross-over point with SAGA was typically after 5 \u2212 15 epochs. Note that the gains\nin the \ufb01rst epochs can be signi\ufb01cant, though. In practice, one will either de\ufb01ne a desired accuracy\nlevel and choose \u0001 accordingly or one will switch to SAGA for accurate convergence.\n\n5 Conclusion\n\nn) and large (\u223c 1\n\nWe have generalized variance reduced SGD methods under the name of memorization algorithms\nand presented a corresponding analysis, which commonly applies to all such methods. We have\ninvestigated in detail the range of safe step sizes with their corresponding geometric rates as guar-\nanteed by our theory. This has delivered a number of new insights, for instance about the trade-offs\nbetween small (\u223c 1\n4L ) step sizes in different regimes as well as about the role of\nthe freshness of stochastic gradients evaluated at past iterates.\nWe have also investigated and quanti\ufb01ed the effect of additional errors in the variance correction\nterms on the convergence behavior. Dependent on how \u00b5 scales with n, we have shown that such\nerrors can be tolerated, yet, for small \u00b5, may have a negative effect on the convergence rate as much\nsmaller step sizes are needed to still guarantee convergence to a small region. We believe this result\nto be relevant for a number of approximation techniques in the context of variance reduced SGD.\nMotivated by these insights and results of our analysis, we have proposed \u0001N -SAGA, a modi\ufb01cation\nof SAGA that exploits similarities between training data points by de\ufb01ning a neighborhood system.\nApproximate versions of per-data point gradients are then computed by sharing information among\nneighbors. This opens-up the possibility of variance-reduction in a streaming data setting, where\neach data point is only seen once. We believe this to be a promising direction for future work.\nEmpirically, we have been able to achieve consistent speed-ups for the initial phase of regularized\nrisk minimization. This shows that approximate computations of variance correction terms consti-\ntutes a promising approach of trading-off computation with solution accuracy.\n\nAcknowledgments We would like to thank Yannic Kilcher, Martin Jaggi, R\u00b4emi Leblond and the\nanonymous reviewers for helpful suggestions and corrections.\n\n8\n\n\fReferences\n[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in\n\nhigh dimensions. Commun. ACM, 51(1):117\u2013122, 2008.\n\n[2] L. Bottou. Large-scale machine learning with stochastic gradient descent.\n\npages 177\u2013186. Springer, 2010.\n\nIn COMPSTAT,\n\n[3] S. Dasgupta and K. Sinha. Randomized partition trees for nearest neighbor search. Algorith-\n\nmica, 72(1):237\u2013263, 2015.\n\n[4] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with\nIn Advances in Neural Information\n\nsupport for non-strongly convex composite objectives.\nProcessing Systems, pages 1646\u20131654, 2014.\n\n[5] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[6] J. Kone\u02c7cn`y and P. Richt\u00b4arik. Semi-stochastic gradient descent methods. arXiv preprint\n\narXiv:1312.1666, 2013.\n\n[7] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical\n\nstatistics, pages 400\u2013407, 1951.\n\n[8] M. Schmidt. Convergence rate of stochastic gradient with constant step size. UBC Technical\n\nReport, 2014.\n\n[9] M. Schmidt, N. L. Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. arXiv preprint arXiv:1309.2388, 2013.\n\n[10] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient\n\nsolver for SVM. Mathematical programming, 127(1):3\u201330, 2011.\n\n[11] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss. The Journal of Machine Learning Research, 14(1):567\u2013599, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1364, "authors": [{"given_name": "Thomas", "family_name": "Hofmann", "institution": "ETH Zurich"}, {"given_name": "Aurelien", "family_name": "Lucchi", "institution": "ETH Zurich"}, {"given_name": "Simon", "family_name": "Lacoste-Julien", "institution": "INRIA"}, {"given_name": "Brian", "family_name": "McWilliams", "institution": "ETH Zurich"}]}