{"title": "Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 1017, "page_last": 1025, "abstract": "We improve a recent gurantee of Bach and Moulines on the linear convergence of SGD for smooth and strongly convex objectives, reducing a quadratic dependence on the strong convexity to a linear dependence. Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence on average smoothness, dominating previous results, and more broadly discus how importance sampling for SGD can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods.", "full_text": "Stochastic Gradient Descent, Weighted Sampling, and\n\nthe Randomized Kaczmarz algorithm\n\nDepartment of Mathematical Sciences\n\nClaremont McKenna College\n\nDeanna Needell\n\nClaremont CA 91711\ndneedell@cmc.edu\n\nNathan Srebro\n\nToyota Technological Institute at Chicago\nand Dept. of Computer Science, Technion\n\nnati@ttic.edu\n\nRachel Ward\n\nDepartment of Mathematics\n\nUniv. of Texas, Austin\n\nrward@math.utexas.edu\n\nAbstract\n\nWe improve a recent guarantee of Bach and Moulines on the linear convergence\nof SGD for smooth and strongly convex objectives, reducing a quadratic depen-\ndence on the strong convexity to a linear dependence. Furthermore, we show how\nreweighting the sampling distribution (i.e. importance sampling) is necessary in\norder to further improve convergence, and obtain a linear dependence on average\nsmoothness, dominating previous results, and more broadly discus how impor-\ntance sampling for SGD can improve convergence also in other scenarios. Our\nresults are based on a connection between SGD and the randomized Kaczmarz al-\ngorithm, which allows us to transfer ideas between the separate bodies of literature\nstudying each of the two methods.\n\n1\n\nIntroduction\n\nThis paper concerns two algorithms which until now have remained somewhat disjoint in the liter-\nature: the randomized Kaczmarz algorithm for solving linear systems and the stochastic gradient\ndescent (SGD) method for optimizing a convex objective using unbiased gradient estimates. The\nconnection enables us to make contributions by borrowing from each body of literature to the other.\nIn particular, it helps us highlight the role of weighted sampling for SGD and obtain a tighter guar-\nantee on the linear convergence regime of SGD.\nOur starting point is a recent analysis on convergence of the SGD iterates. Considering a stochastic\nobjective F (x) = Ei[fi(x)], classical analyses of SGD show a polynomial rate on the suboptimality\nof the objective value F (xk) \u2212 F (x(cid:63)). Bach and Moulines [1] showed that if F (x) is \u00b5-strongly\nconvex, fi(x) are Li-smooth (i.e. their gradients are Li-Lipschitz), and x(cid:63) is a minimizer of (almost)\nall fi(x) (i.e. Pi(\u2207fi(x(cid:63)) = 0) = 1), then E(cid:107)xk \u2212 x(cid:63)(cid:107) goes to zero exponentially, rather then\npolynomially, in k. That is, reaching a desired accuracy of E(cid:107)xk \u2212 x(cid:63)(cid:107)2 \u2264 \u03b5 requires a number of\nsteps that scales only logarithmically in 1/\u03b5. Bach and Moulines\u2019s bound on the required number of\niterations further depends on the average squared conditioning number E[(Li/\u00b5)2].\nIn a seemingly independent line of research, the Kaczmarz method was proposed as an iterative\nmethod for solving overdetermined systems of linear equations [7]. The simplicity of the method\nmakes it popular in applications ranging from computer tomography to digital signal processing [5,\n\n1\n\n\f9, 6]. Recently, Strohmer and Vershynin [19] proposed a variant of the Kaczmarz method which\nselects rows with probability proportional to their squared norm, and showed that using this selection\nstrategy, a desired accuracy of \u03b5 can be reached in the noiseless setting in a number of steps that\nscales with log(1/\u03b5) and only linearly in the condition number. As we discuss in Section 5, the\nrandomized Kaczmarz algorithm is in fact a special case of stochastic gradient descent.\nInspired by the above analysis, we prove improved convergence results for generic SGD, as well as\nfor SGD with gradient estimates chosen based on a weighted sampling distribution, highlighting the\nrole of importance sampling in SGD:\nWe \ufb01rst show that without perturbing the sampling distribution, we can obtain a linear dependence\non the uniform conditioning (sup Li/\u00b5), but it is not possible to obtain a linear dependence on\nthe average conditioning E[Li]/\u00b5. This is a quadratic improvement over [1] in regimes where the\ncomponents have similar Lipschitz constants (Theorem 2.1 in Section 2).\nWe then show that with weighted sampling we can obtain a linear dependence on the average con-\nditioning E[Li]/\u00b5, dominating the quadratic dependence of [1] (Corollary 3.1 in Section 3).\nIn Section 4, we show how also for smooth but not-strongly-convex objectives, importance sampling\ncan improve a dependence on a uniform bound over smoothness, (sup Li), to a dependence on the\naverage smoothness E[Li]\u2014such an improvement is not possible without importance sampling.\nFor non-smooth objectives, we show that importance sampling can eliminate a dependence on the\nvariance in the Lipschitz constants of the components.\nFinally, in Section 5, we turn to the Kaczmarz algorithm, and show we can improve known guaran-\ntees in this context as well.\n\n2 SGD for Strongly Convex Smooth Optimization\nWe consider the problem of minimizing a strongly convex function of the form F (x) = Ei\u223cDfi(x)\nwhere fi : H \u2192 R are smooth functionals over H = Rd endowed with the standard Euclidean\nnorm (cid:107)\u00b7(cid:107)2, or over a Hilbert space H with the norm (cid:107)\u00b7(cid:107)2. Here i is drawn from some source\ndistribution D over an arbitrary probability space. Throughout this manuscript, unless explicitly\nspeci\ufb01ed otherwise, expectations will be with respect to indices drawn from the source distribution\nD. We denote the unique minimum x(cid:63) = arg min F (x) and denote by \u03c32 the \u201cresidual\u201d quantity at\nthe minimum, \u03c32 = E(cid:107)\u2207fi(x(cid:63))(cid:107)2\n2.\nAssumptions Our bounds will be based on the following assumptions and quantities: First, F has\nstrong convexity parameter \u00b5; that is, (cid:104)x \u2212 y,\u2207F (x) \u2212 \u2207F (y)(cid:105) \u2265 \u00b5(cid:107)x \u2212 y(cid:107)2\n2 for all vectors x\nand y. Second, each fi is continuously differentiable and the gradient function \u2207fi has Lipschitz\nconstant Li; that is, (cid:107)\u2207fi(x) \u2212 \u2207fi(y)(cid:107)2 \u2264 Li(cid:107)x \u2212 y(cid:107)2 for all vectors x and y. We denote sup L\nthe supremum of the support of Li, i.e. the smallest L such that Li \u2264 L a.s., and similarly denote\ninf L the in\ufb01mum. We denote the average Lipschitz constant as L = ELi.\nAn unbiased gradient estimate for F (x) can be obtained by drawing i \u223c D and using \u2207fi(x) as the\nestimate. The SGD updates with (\ufb01xed) step size \u03b3 based on these gradient estimates are given by:\n(2.1)\n2 of the iterates\n\nwhere {ik} are drawn i.i.d. from D. We are interested in the distance (cid:107)xk \u2212 x(cid:63)(cid:107)2\nfrom the unique minimum, and denote the initial distance by \u03b50 = (cid:107)x0 \u2212 x(cid:63)(cid:107)2\n2.\nBach and Moulines [1, Theorem 1] considered this setting1 and established that\n\nxk+1 \u2190 xk \u2212 \u03b3\u2207fik (xk)\n\n(cid:16)EL2\n\n(cid:17)\n\nk = 2 log(\u03b50/\u03b5)\n\ni\n\n\u00b52 +\n\n\u03c32\n\u00b52\u03b5\n\n(2.2)\n\nSGD iterations of the form (2.1), with an appropriate step-size, are suf\ufb01cient\nE(cid:107)xk \u2212 x(cid:63)(cid:107)2\n\nto ensure\n2 \u2264 \u03b5, where the expectation is over the random sampling. As long as \u03c32 = 0, i.e. the\n1Bach and Moulines\u2019s results are somewhat more general. Their Lipschitz requirement is a bit weaker and\nmore complicated, but in terms of Li yields (2.2). They also study the use of polynomial decaying step-sizes,\nbut these do not lead to improved runtime if the target accuracy is known ahead of time.\n\n2\n\n\fE(cid:107)xk \u2212 x(cid:63)(cid:107)2\n\n1 \u2212 2\u03b3\u00b5(1 \u2212 \u03b3 sup L)\n\nThat is, for any desired \u03b5, using a step-size of\n\n2 \u2264(cid:104)\n\n\u00b5\u03b5\n\n(cid:17)(cid:105)k (cid:107)x0 \u2212 x(cid:63)(cid:107)2\n\n2 +\n\n\u03b3\u03c32\n\n\u00b5(cid:0)1 \u2212 \u03b3 sup L(cid:1) .\n(cid:17)\n(cid:16) sup L\n\n+\n\n\u03c32\n\u00b52\u03b5\n\n\u00b5\n\n(2.3)\n\n(2.4)\n\nsame minimizer x(cid:63) minimizes all components fi(x) (though of course it need not be a unique min-\nimizer of any of them); this yields linear convergence to x(cid:63), with a graceful degradation as \u03c32 > 0.\nHowever, in the linear convergence regime, the number of required iterations scales with the ex-\npected squared conditioning EL2\ni /\u00b52. In this paper, we reduce this quadratic dependence to a linear\ndependence. We begin with a guarantee ensuring linear dependence on sup L/\u00b5:\nTheorem 2.1 Let each fi be convex where \u2207fi has Lipschitz constant Li, with Li \u2264 sup L a.s.,\nand let F (x) = Efi(x) be \u00b5-strongly convex. Set \u03c32 = E(cid:107)\u2207fi(x(cid:63))(cid:107)2\n2, where x(cid:63) = argminx F (x).\nSuppose that \u03b3 \u2264 1/\u00b5. Then the SGD iterates given by (2.1) satisfy:\n\n\u03b3 =\n\n2\u03b5\u00b5 sup L + 2\u03c32\n\nensures that after k = 2 log(\u03b50/\u03b5)\n\nSGD iterations, E(cid:107)xk \u2212 x(cid:63)(cid:107)2\nrespect to the sampling of {ik}.\nProof sketch: The crux of the improvement over [1] is a tighter recursive equation. Instead of:\n\n2 \u2264 \u03b5, where \u03b50 = (cid:107)x0 \u2212 x(cid:63)(cid:107)2\n\n2 and where both expectations are with\n\n2 \u2264(cid:0)1 \u2212 2\u03b3\u00b5 + 2\u03b32L2\n2 \u2264(cid:0)1 \u2212 2\u03b3\u00b5 + 2\u03b32\u00b5Lik\n\n(cid:1)(cid:107)xk \u2212 x(cid:63)(cid:107)2\n(cid:1)(cid:107)xk \u2212 x(cid:63)(cid:107)2\n\nik\n\n2 + 2\u03b32\u03c32,\n\n2 + 2\u03b32\u03c32.\n\n(cid:107)xk+1 \u2212 x(cid:63)(cid:107)2\n\n(cid:107)xk+1 \u2212 x(cid:63)(cid:107)2\n\nwe use the co-coercivity Lemma (Lemma A.1 in the supplemental material) to obtain:\n\nThe signi\ufb01cant difference is that one of the factors of Lik, an upper bound on the second derivative\n(where ik is the random index selected in the kth iteration) in the third term inside the parenthesis,\nis replaced by \u00b5, a lower bound on the second derivative of F . A complete proof can be found in the\nsupplemental material.\n\nComparison to [1] Our bound (2.4) improves a quadratic dependence on \u00b52 to a linear depen-\ndence and replaces the dependence on the average squared smoothness EL2\ni with a linear dependence\non the smoothness bound sup L. When all Lipschitz constants Li are of similar magnitude, this is a\nquadratic improvement in the number of required iterations. However, when different components\nfi have widely different scaling, i.e. Li are highly variable, the supremum might be signi\ufb01cantly\nlarger then the average square conditioning.\n\nTightness Considering the above, one might hope to obtain a linear dependence on the average\nsmoothness L. However, as the following example shows, this is not possible. Consider a uniform\nsource distribution over N + 1 quadratics, with the \ufb01rst quadratic f1 being N (x[1] \u2212 b)2 and all\nothers being x[2]2, and b = \u00b11. Any method must examine f1 in order to recover x to within\nerror less then one, but by uniformly sampling indices i, this takes N iterations in expectation.\nWe can calculate sup L = L1 = 2N, L = 2(2N\u22121)\n, and \u00b5 = 1. Both\nsup L/\u00b5 = EL2\ni /\u00b52 = O(N ) scale correctly with the expected number of iterations, while error\nreduction in O(L/\u00b5) = O(1) iterations is not possible for this example.\nWe therefore see that the choice between EL2\ni and sup L is unavoidable. In the next Section, we\nwill show how we can obtain a linear dependence on the average smoothness L, using importance\nsampling, i.e. by sampling from a modi\ufb01ed distribution.\n\ni = 4(N 2+N\u22121)\n\n, EL2\n\nN\n\nN\n\nImportance Sampling\n\n3\nFor a weight function w(i) which assigns a non-negative weight w(i) \u2265 0 to each index i, the\nweighted distribution D(w) is de\ufb01ned as the distribution such that\nPD(w) (I) \u221d Ei \u223cD [1I (i)w(i)] ,\n\n3\n\n\fwhere I is an event (subset of indices) and 1I (\u00b7) its indicator function. For a discrete distribution\nD with probability mass function p(i) this corresponds to weighting the probabilities to obtain a\nnew probability mass function, which we write as p(w)(i) \u221d w(i)p(i). Similarly, for a continuous\ndistribution, this corresponds to multiplying the density by w(i) and renormalizing. Importance\nsampling has appeared in both the Kaczmarz method [19] and in coordinate-descent methods [14,\n15], where the weights are proportional to some power of the Lipschitz constants (of the gradient\ncoordinates). Here we analyze this type of sampling in the context of SGD.\nOne way to construct D(w) is through rejection sampling: sample i \u223c D, and accept with probability\nw(i)/W , for some W \u2265 supi w(i). Otherwise, reject and continue to re-sample until a suggestion\ni is accepted. The accepted samples are then distributed according to D(w).\nWe use E(w)[\u00b7] = Ei\u223cD(w) [\u00b7] to denote expectation where indices are sampled from the weighted\ndistribution D(w). An important property of such an expectation is that for any quantity X(i):\n\nw(i) X(i)\n\n= E [w(i)] \u00b7 E [X(i)] ,\n\n(3.1)\n\nIn particular, when\nIn fact, we will consider only weights\n\nE(w)(cid:104) 1\nE[w(i)] = 1, we have that E(w)(cid:104) 1\n\n(cid:105)\n(cid:105)\n\nwhere recall that the expectations on the r.h.s. are with respect to i \u223c D.\n\n= EX(i).\ns.t. E[w(i)] = 1, and refer to such weights as normalized.\n\nw(i) X(i)\n\nReweighted SGD For any normalized weight function w(i), we can write:\n\nf (w)\ni\n\n(x) =\n\n1\n\nw(i)\n\nfi(x)\n\nand F (x) = E(w)[f (w)\n\ni\n\n(x)].\n\n(3.2)\n\nThis is an equivalent, and equally valid, stochastic representation of the objective F (x), and we can\njust as well base SGD on this representation. In this case, at each iteration we sample i \u223c D(w)\nand then use \u2207f (w)\nw(i)\u2207fi(x) as an unbiased gradient estimate. SGD iterates based on the\nrepresentation (3.2), which we will refer to as w-weighted SGD, are then given by\n\n(x) = 1\n\ni\n\nxk+1 \u2190 xk \u2212 \u03b3\n\nw(ik)\n\n\u2207fik (xk)\n\n(3.3)\n\nwhere {ik} are drawn i.i.d. from D(w).\nThe important observation here is that all SGD guarantees are equally valid for the w-weighted\nupdates (3.3)\u2013the objective is the same objective F (x), the sub-optimality is the same, and the\nminimizer x(cid:63) is the same. We do need, however, to calculate the relevant quantities controlling SGD\nconvergence with respect to the modi\ufb01ed components f (w)\n\nand the weighted distribution D(w).\n\ni\n\nStrongly Convex Smooth Optimization using Weighted SGD We now return to the analysis of\nstrongly convex smooth optimization and investigate how re-weighting can yield a better guarantee.\nThe Lipschitz constant L(w)\nw(i) Li.\nThe supremum is then given by:\n\nis now scaled, and we have L(w)\n\nof each component f (w)\n\ni = 1\n\ni\n\ni\n\nsup L(w) = sup\n\ni\n\nL(w)\n\ni = sup\n\ni\n\nLi\nw(i)\n\n.\n\nIt is easy to verify that (3.4) is minimized by the weights\n\nw(i) =\n\nLi\nL\n\n,\n\nso that\n\nsup L(w) = sup\n\ni\n\nLi\n\nLi/L\n\n= L.\n\nBefore applying Theorem 2.1, we must also calculate:\n\n(3.4)\n\n(3.5)\n\n(w) = E(w)[(cid:107)\u2207f (w)\n\u03c32\n\ni\n\n(x(cid:63))(cid:107)2\n\n2] = E[\n\n1\n\nw(i)\n\n(cid:107)\u2207fi(x(cid:63))(cid:107)2\n\n2] = E[\n\nL\nLi\n\n(cid:107)\u2207fi(x(cid:63))(cid:107)2\n\n2] \u2264 L\ninf L\n\n\u03c32.\n\n(3.6)\n\n4\n\n\f(cid:16) L\n\n(cid:16) sup L(w)\n\n(cid:17)\n\nNow, applying Theorem 2.1 to the w-weighted SGD iterates (3.3) with weights (3.5), we have that,\nwith an appropriate stepsize,\n\n\u00b5\n\nL\n\n+\n\n= 2 log(\u03b50/\u03b5)\n\nk = 2 log(\u03b50/\u03b5)\n\n\u03c32\n(w)\n(3.7)\n+\n\u00b52\u03b5\niterations are suf\ufb01cient for E(w)(cid:107)xk \u2212 x(cid:63)(cid:107)2\n2 \u2264 \u03b5, where x(cid:63), \u00b5 and \u03b50 are exactly as in Theorem 2.1.\nIf \u03c32 = 0, i.e. we are in the \u201crealizable\u201d situation, with true linear convergence, then we also have\n(w) = 0. In this case, we already obtain the desired guarantee: linear convergence with a linear\n\u03c32\ndependence on the average conditioning L/\u00b5, strictly improving over the best known results [1].\nHowever, when \u03c32 > 0 we get a dissatisfying scaling of the second term, by a factor of L/inf L.\nFortunately, we can easily overcome this factor. To do so, consider sampling from a distribution\nwhich is a mixture of the original source distribution and its re-weighting:\n\n\u00b7 \u03c32\n\u00b52\u03b5\n\ninf L\n\n\u00b5\n\n(cid:17)\n\nw(i) =\n\n1\n2\n\n+\n\n1\n2\n\n\u00b7 Li\nL\n\n.\n\n(3.8)\n\nWe refer to this as partially biased sampling. Instead of an even mixture as in (3.9), we could also\nuse a mixture with any other constant proportion, i.e. w(i) = \u03bb + (1 \u2212 \u03bb)Li/L for 0 < \u03bb < 1.\nUsing these weights, we have\n2 \u00b7 Li\n\nLi \u2264 2L and \u03c32\n\n(cid:107)\u2207fi(x(cid:63))(cid:107)2\n\n2] \u2264 2\u03c32.\n\nsup L(w) = sup\n\n1\n2 + 1\n\n1\n2 + 1\n\n(w) = E[\n\n2 \u00b7 Li\n\n(3.9)\n\n1\n\n1\n\ni\n\nL\n\nL\n\nCorollary 3.1 Let each fi be convex where \u2207fi has Lipschitz constant Li and let F (x) =\nEi\u223cD[fi(x)], where F (x) is \u00b5-strongly convex.\n2, where x(cid:63) =\nargminx F (x). For any desired \u03b5, using a stepsize of\n\nSet \u03c32 = E(cid:107)\u2207fi(x(cid:63))(cid:107)2\n\n(cid:16) L\n\n\u00b5\n\n(cid:17)\n\n+\n\n\u03c32\n\u00b52\u03b5\n\n(3.10)\n2 \u2264 \u03b5, where\n\n\u00b5\u03b5\n\n\u03b3 =\n\n4(\u03b5\u00b5L + \u03c32)\n\nensures that after k = 4 log(\u03b50/\u03b5)\n\n2 and L = ELi.\n\niterations of w-weighted SGD (3.3) with weights speci\ufb01ed by (3.8), E(w)(cid:107)xk \u2212 x(cid:63)(cid:107)2\n\u03b50 = (cid:107)x0 \u2212 x(cid:63)(cid:107)2\nThis result follows by substituting (3.9) into Theorem 2.1. We now obtain the desired linear scaling\non L/\u00b5, without introducing any additional factor to the residual term, except for a constant factor.\nWe thus obtain a result which dominates Bach and Moulines (up to a factor of 2) and substantially\nimproves upon it (with a linear rather than quadratic dependence on the conditioning). Such \u201cpar-\ntially biased weights\u201d are not only an analysis trick, but might indeed improve actual performance\nover either no weighting or the \u201cfully biased\u201d weights (3.5), as demonstrated in Figure 1.\n\nImplementing Importance Sampling In settings where linear systems need to be solved repeat-\nedly, or when the Lipschitz constants are easily computed from the data, it is straightforward to\nsample by the weighted distribution. However, when we only have sampling access to the source\ndistribution D (or the implied distribution over gradient estimates), importance sampling might be\ndif\ufb01cult. In light of the above results, one could use rejection sampling to simulate sampling from\nD(w). For the weights (3.5), this can be done by accepting samples with probability proportional\nto Li/ sup L. The overall probability of accepting a sample is then L/ sup L, introducing an addi-\ntional factor of sup L/L. This yields a sample complexity with a linear dependence on sup L, as\nin Theorem 2.1, but a reduction in the number of actual gradient calculations and updates. In even\nless favorable situations, if Lipschitz constants cannot be bounded for individual components, even\nimportance sampling might not be possible.\n\n4\n\nImportance Sampling for SGD in Other Scenarios\n\nIn the previous Section, we considered SGD for smooth and strongly convex objectives, and were\nparticularly interested in the regime where the residual \u03c32 is low, and the linear convergence term is\ndominant. Weighted SGD is useful also in other scenarios, and we now brie\ufb02y survey them, as well\nas relate them to our main scenario of interest.\n\n5\n\n\fFigure 1: Performance of SGD with weights w(i) = \u03bb + (1 \u2212 \u03bb) Li\non synthetic overdetermined least\nsquares problems of the form (5.1) (\u03bb = 1 is unweighted, \u03bb = 0 is fully weighted). Left: ai are standard\nspherical Gaussian, bi = (cid:104)ai, x0(cid:105) + N (0, 0.12). Center: ai is spherical Gaussian with variance i, bi =\n(cid:104)ai, x0(cid:105) + N (0, 202). Right: ai is spherical Gaussian with variance i, bi = (cid:104)ai, x0(cid:105) + N (0, 0.12). In all\ncases, matrix A with rows ai is 1000 \u00d7 100 and the corresponding least squares problem is strongly convex;\nthe stepsize was chosen as in (3.10).\n\nL\n\nFigure 2: Performance of SGD with weights w(i) = \u03bb + (1 \u2212 \u03bb) Li\non synthetic underdetermined least\nsquares problems of the form (5.1) (\u03bb = 1 is unweighted, \u03bb = 0 is fully weighted). We consider 3 cases. Left:\nai are standard spherical Gaussian, bi = (cid:104)ai, x0(cid:105)+N (0, 0.12). Center: ai is spherical Gaussian with variance\ni, bi = (cid:104)ai, x0(cid:105) + N (0, 202). Right: ai is spherical Gaussian with variance i, bi = (cid:104)ai, x0(cid:105) + N (0, 0.12).\nIn all cases, matrix A with rows ai is 50 \u00d7 100 and so the corresponding least squares problem is not strongly\nconvex; the step-size was chosen as in (3.10).\n\nL\n\nSmooth, Not Strongly Convex When each component fi is convex, non-negative, and has an\nLi-Lipschitz gradient, but the objective F (x) is not necessarily strongly convex, then after\n\n(cid:18) (sup L)(cid:107)x(cid:63)(cid:107)2\n\n2\n\n(cid:19)\n\n\u00b7 F (x(cid:63)) + \u03b5\n\n\u03b5\n\n\u03b5\n\nk = O\n\n(4.1)\niterations of SGD with an appropriately chosen step-size we will have F (xk) \u2264 F (x(cid:63)) + \u03b5, where\nxk is an appropriate averaging of the k iterates [18]. The relevant quantity here determining the iter-\nation complexity is again sup L. Furthermore, the dependence on the supremum is unavoidable and\ncannot be replaced with the average Lipschitz constant L [3, 18]: if we sample gradients according\nto the source distribution D, we must have a linear dependence on sup L.\nThe only quantity in the bound (4.1) that changes with a re-weighting is sup L\u2014all other quantities\n((cid:107)x(cid:63)(cid:107)2\n2, F (x(cid:63)), and the sub-optimality \u03b5) are invariant to re-weightings. We can therefore replace\nthe dependence on sup L with a dependence on sup L(w) by using a weighted SGD as in (3.3). As we\nalready calculated, the optimal weights are given by (3.5), and using them we have sup L(w) = L.\nIn this case, there is no need for partially biased sampling, and we obtain that\n\n(cid:18) L(cid:107)x(cid:63)(cid:107)2\n\n2\n\n\u03b5\n\nk = O\n\n(cid:19)\n\n\u00b7 F (x(cid:63)) + \u03b5\n\n\u03b5\n\n(4.2)\n\niterations of weighed SGD updates (3.3) using the weights (3.5) suf\ufb01ce. Empirical evidence suggests\nthat this is not a theoretical artifact; full weighted sampling indeed exhibits better convergence rates\ncompared to partially biased sampling in the non-strongly convex setting (see Figure 2), in contrast\n\n6\n\n010002000300010\u22121100101Iteration kError || xk \u2212 x* ||2 \u03bb = 0\u03bb = 0.2\u03bb = 101000200030004000500010\u22121100101Iteration kError || xk \u2212 x* ||2 \u03bb = 0\u03bb = 0.2\u03bb = 101000200030004000500010\u22121100101Iteration kError || xk \u2212 x* ||2 \u03bb = 0\u03bb = 0.2\u03bb = 10200040006000800010\u22122100102Iteration kError F(xk) \u2212 F(x*) \u03bb=0\u03bb=.4\u03bb=1050001000015000100102Iteration kError F(xk) \u2212 F(x*) \u03bb=0\u03bb=.4\u03bb=1050001000015000100101Iteration kError F(xk) \u2212 F(x*) \u03bb=0\u03bb = .4\u03bb = 1\fto the strongly convex regime (see Figure 1). We again see that using importance sampling allows us\nto reduce the dependence on sup L, which is unavoidable without biased sampling, to a dependence\non L. An interesting question for further consideration is to what extent importance sampling can\nalso help with stochastic optimization procedures such as SAG [8] and SDCA [17] which achieve\nfaster convergence on \ufb01nite data sets. Indeed, weighted sampling was shown empirically to achieve\nfaster convergence rates for SAG [16], but theoretical guarantees remain open.\n\nNon-Smooth Objectives We now turn to non-smooth objectives, where the components fi might\nnot be smooth, but each component is Gi-Lipschitz. Roughly speaking, Gi is a bound on the \ufb01rst\nderivative (the subgradients) of fi, while Li is a bound on the second derivatives of fi. Here,\nthe performance of SGD (actually stochastic subgradient decent) depends on the second moment\nG2 = E[G2\ni ] [12]. The precise iteration complexity depends on whether the objective is strongly\nconvex or whether x(cid:63) is bounded, but in either case depends linearly on G2.\nUsing weighted SGD, we get linear dependence on\n\n(w) = E(w)(cid:104)\n\nG2\n\n)2(cid:105)\n\n(cid:20) G2\n\ni\n\n(cid:21)\n\nw(i)2\n\n(cid:21)\n\n(cid:20) G2\n\ni\nw(i)\n\n(G(w)\n\ni\n\n= E(w)\n\n= E\n\n(4.3)\n\ni = Gi/w(i) is the Lipschitz constant of the scaled f (w)\n\nwhere G(w)\nweights w(i) = Gi/G, where G = EGi, yielding G2\n(w) = G\ntherefore reduce the dependence on G2 to a dependence on G\n2\nG\nParallel work we recently became aware of [22] shows a similar improvement for a non-smooth\ncomposite objective. Rather than relying on a specialized analysis as in [22], here we show this\nfollows from SGD analysis applied to different gradient estimates.\n\n. This is minimized by the\n2. Using importance sampling, we\n2. Its helpful to recall that G2 =\n\n+ Var[Gi]. What we save is thus exactly the variance of the Lipschitz constants Gi.\n\ni\n\nNon-Realizable Regime Returning to the smooth and strongly convex setting of Sections 2 and 3,\nlet us consider more carefully the residual term \u03c32 = E(cid:107)\u2207fi(x(cid:63))(cid:107)2\n2. This quantity depends on the\nweighting, and in Section 3, we avoided increasing it, introducing partial biasing for this purpose.\nHowever, if this is the dominant term, we might want to choose weights to minimize this term. The\noptimal weights here would be proportional to (cid:107)\u2207fi(x(cid:63))(cid:107)2, which is not known in general.\nAn alternative approach is to bound (cid:107)\u2207fi(x(cid:63))(cid:107)2 \u2264 Gi and so \u03c32 \u2264 G2. Taking this bound, we are\nback to the same quantity as in the non-smooth case, and the optimal weights are proportional to Gi.\nNote that this differs from using weights proportional to Li, which optimize the linear-convergence\nterm as studied in Section 3.\nTo understand how weighting according to Gi and Li are different, consider a generalized linear\nobjective fi(x) = \u03c6i((cid:104)zi, x(cid:105)), where \u03c6i is a scalar function with bounded |\u03c6(cid:48)\ni |. We have\nthat Gi \u221d (cid:107)zi(cid:107)2 while Li \u221d (cid:107)zi(cid:107)2\n2. Weighting according to (3.5), versus weighting with w(i) =\n2 versus (cid:107)zi(cid:107)2, and are rather different. E.g.,\nGi/G, thus corresponds to weighting according to (cid:107)zi(cid:107)2\nweighting by Li \u221d (cid:107)zi(cid:107)2\n(w) = G2: the same sub-optimal dependence as if no weighting\nat all were used. A good solution could be to weight by a mixture of Gi and Li, as in the partial\nweighting scheme of Section 3.\n\n2 yields G2\n\ni| ,|\u03c6(cid:48)(cid:48)\n\n5 The least squares case and the Randomized Kaczmarz Method\n\nA special case of interest is the least squares problem, where\n((cid:104)ai, x(cid:105) \u2212 bi)2 =\n\nF (x) =\n\nn(cid:88)\n\ni=1\n\n1\n2\n\n(cid:107)Ax \u2212 b(cid:107)2\n\n2\n\n(5.1)\n\n1\n2\n\nwith b \u2208 Cn, A an n \u00d7 d matrix with rows ai, and x(cid:63) = argminx\n2 is the least-squares\nsolution. We can also write (5.1) as a stochastic objective, where the source distribution D is uniform\nover {1, 2, . . . , n} and fi = n\n2 is the residual error\n\n2(cid:107)Ax \u2212 b(cid:107)2\n2 ((cid:104)ai, x(cid:105)\u2212 bi)2. In this setting, \u03c32 = (cid:107)Ax(cid:63) \u2212 b(cid:107)2\n\n1\n\n7\n\n\fat the least squares solution x(cid:63), which can also be interpreted as noise variance in a linear regression\nmodel.\nThe randomized Kaczmarz method introduced for solving the least squares problem (5.1) in the\ncase where A is an overdetermined full-rank matrix, begins with an arbitrary estimate x0, and in the\nkth iteration selects a row i at random from the matrix A and iterates by:\n\nxk+1 = xk + c \u00b7 bi \u2212 (cid:104)ai, xk(cid:105)\n\n(cid:107)ai(cid:107)2\n\n2\n\nai,\n\n(5.2)\n\nwhere c = 1 in the standard method. This is almost an SGD update with step-size \u03b3 = c/n, except\nfor the scaling by (cid:107)ai(cid:107)2\n2.\nStrohmer and Vershynin [19] provided the \ufb01rst non-asymptotic convergence rates, showing that\ndrawing rows proportionally to (cid:107)ai(cid:107)2\n2 leads to provable exponential convergence in expectation [19].\nWith such a weighting, (5.2) is exactly weighted SGD, as in (3.3), with the fully biased weights\n(3.5).\nThe reduction of the quadratic dependence on the conditioning to a linear dependence in Theorem\n2.1, and the use of biased sampling, was inspired by the analysis of [19]. Indeed, applying Theorem\n2.1 to the weighted SGD iterates with weights as in (3.5) and a stepsize of \u03b3 = 1 yields precisely the\nguarantee of [19]. Furthermore, understanding the randomized Kaczmarz method as SGD, allows\nus to obtain the following improvements:\nPartially Biased Sampling. Using partially biased sampling weights (3.8) yields a better depen-\ndence on the residual over the fully biased sampling weights (3.5) considered by [19].\nUsing Step-sizes. The randomized Kaczmarz method with weighted sampling exhibits exponential\nconvergence, but only to within a radius, or convergence horizon, of the least-squares solution [19,\n10]. This is because a step-size of \u03b3 = 1 is used, and so the second term in (2.3) does not vanish.\nIt has been shown [21, 2, 20, 4, 11] that changing the step size can allow for convergence inside of\nthis convergence horizon, but only asymptotically. Our results allow for \ufb01nite-iteration guarantees\nwith arbitrary step-sizes and can be immediately applied to this setting.\nUniform Row Selection. Strohmer and Vershynin\u2019s variant of the randomized Kaczmarz method\ncalls for weighted row sampling, and thus requires pre-computing all the row norms. Although\ncertainly possible in some applications, in other cases this might be better avoided. Understanding\nthe randomized Kaczmarz as SGD allows us to apply Theorem 2.1 also with uniform weights (i.e. to\nthe unbiased SGD), and obtain a randomized Kaczmarz using uniform sampling, which converges\nto the least-squares solution and enjoys \ufb01nite-iteration guarantees.\n\n6 Conclusion\n\nWe consider this paper as making three main contributions. First, we improve the dependence on\nthe conditioning for smooth and strongly convex SGD from quadratic to linear. Second, we investi-\ngate SGD and importance sampling and show how it can yield improvements not possible without\nreweighting. Lastly, we make connections between SGD and the randomized Kaczmarz method.\nThis connection along with our new results show that the choice in step-size of the Kaczmarz method\noffers a tradeoff between convergence rate and horizon and also allows for a convergence bound\nwhen the rows are sampled uniformly.\nFor simplicity, we only considered SGD with \ufb01xed step-size \u03b3, which is appropriate when the target\naccuracy in known in advance. Our analysis can be adapted also to decaying step-sizes.\nOur discussion of importance sampling is limited to a static reweighting of the sampling distribution.\nA more sophisticated approach would be to update the sampling distribution dynamically as the\nmethod progresses, and as we gain more information about the relative importance of components\n(e.g. about (cid:107)\u2207fi(x(cid:63))(cid:107)). Such dynamic sampling is sometimes attempted heuristically, and obtaining\na rigorous framework for this would be desirable.\n\n8\n\n\fReferences\n[1] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine\n\nlearning. Advances in Neural Information Processing Systems (NIPS), 2011.\n\n[2] Y. Censor, P. P. B. Eggermont, and D. Gordon. Strong underrelaxation in Kaczmarz\u2019s method for incon-\n\nsistent systems. Numerische Mathematik, 41(1):83\u201392, 1983.\n\n[3] R. Foygel and N. Srebro. Concentration-based guarantees for low-rank matrix reconstruction. 24th Ann.\n\nConf. Learning Theory (COLT), 2011.\n\n[4] M. Hanke and W. Niethammer. On the acceleration of Kaczmarz\u2019s method for inconsistent linear systems.\n\nLinear Algebra and its Applications, 130:83\u201398, 1990.\n\n[5] G. T. Herman. Fundamentals of computerized tomography: image reconstruction from projections.\n\nSpringer, 2009.\n\n[6] G. N Houns\ufb01eld. Computerized transverse axial scanning (tomography): Part 1. description of system.\n\nBritish Journal of Radiology, 46(552):1016\u20131022, 1973.\n\n[7] S. Kaczmarz. Angen\u00a8aherte au\ufb02\u00a8osung von systemen linearer gleichungen. Bull. Int. Acad. Polon. Sci. Lett.\n\nSer. A, pages 335\u2013357, 1937.\n\n[8] N. Le Roux, M. W. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence\nrate for \ufb01nite training sets. Advances in Neural Information Processing Systems (NIPS), pages 2672\u20132680,\n2012.\n\n[9] F. Natterer. The mathematics of computerized tomography, volume 32 of Classics in Applied Mathemat-\nics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2001. ISBN 0-89871-493-\n1. doi: 10.1137/1.9780898719284. URL http://dx.doi.org/10.1137/1.9780898719284.\nReprint of the 1986 original.\n\n[10] D. Needell.\n(2):395\u2013403,\nhttp://dx.doi.org/10.1007/s10543-010-0265-5.\n\nRandomized Kaczmarz\n\nISSN 0006-3835.\n\nfor\ndoi:\n\nsolver\n\n2010.\n\nlinear\n\nnoisy\n10.1007/s10543-010-0265-5.\n\nsystems.\n\nBIT,\n\n50\nURL\n\n[11] D. Needell and R. Ward. Two-subspace projection method for coherent overdetermined linear systems.\n\nJournal of Fourier Analysis and Applications, 19(2):256\u2013269, 2013.\n\n[12] Arkadi Nemirovski. Ef\ufb01cient methods in convex programming. 2005.\n[13] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer, 2004.\n[14] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM J.\n\nOptimiz., 22(2):341\u2013362, 2012.\n\n[15] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c.\n\nIteration complexity of randomized block-coordinate descent methods for\n\nminimizing a composite function. Math. Program., pages 1\u201338, 2012.\n\n[16] M. Schmidt, N. Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average gradient. arXiv\n\npreprint arXiv:1309.2388, 2013.\n\n[17] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. J. Mach.\n\nLearn. Res., 14(1):567\u2013599, 2013.\n\n[18] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In Advances in Neural\n\nInformation Processing Systems, 2010.\n\n[19] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence. J.\nFourier Anal. Appl., 15(2):262\u2013278, 2009. ISSN 1069-5869. doi: 10.1007/s00041-008-9030-4. URL\nhttp://dx.doi.org/10.1007/s00041-008-9030-4.\n\n[20] K. Tanabe. Projection method for solving a singular system of linear equations and its applications.\n\nNumerische Mathematik, 17(3):203\u2013214, 1971.\n\n[21] T. M. Whitney and R. K. Meany. Two algorithms related to the method of steepest descent. SIAM Journal\n\non Numerical Analysis, 4(1):109\u2013118, 1967.\n\n[22] P. Zhao and T. Zhang. Stochastic optimization with importance sampling. Submitted, 2014.\n\n9\n\n\f", "award": [], "sourceid": 613, "authors": [{"given_name": "Deanna", "family_name": "Needell", "institution": "Claremont McKenna College"}, {"given_name": "Rachel", "family_name": "Ward", "institution": "Univ. of Texas Austin"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI Chicago"}]}