{"title": "Communication trade-offs for Local-SGD with large step size", "book": "Advances in Neural Information Processing Systems", "page_first": 13601, "page_last": 13612, "abstract": "Synchronous mini-batch SGD is state-of-the-art for large-scale distributed machine learning. However, in practice, its convergence is bottlenecked by slow communication rounds between worker nodes. A natural solution to reduce communication is to use the \\emph{``local-SGD''} model in which the workers train their model independently and synchronize every once in a while. This algorithm improves the computation-communication trade-off but its convergence is not understood very well. We propose a non-asymptotic error analysis, which enables comparison to \\emph{one-shot averaging} i.e., a single communication round among independent workers, and \\emph{mini-batch averaging} i.e., communicating at every step. We also provide adaptive lower bounds on the communication frequency for large step-sizes ($ t^{-\\alpha} $, $ \\alpha\\in (1/2 , 1 ) $) and show that \\emph{Local-SGD} reduces communication by a factor of $O\\Big(\\frac{\\sqrt{T}}{P^{3/2}}\\Big)$, with $T$ the total number of gradients and $P$ machines.", "full_text": "Communication trade-offs for Local-SGD with large\n\nstep size\n\nKumar Kshitij PATEL\n\nMLO, EPFL, Lausanne, Switzerland\n\nTTIC-Toyota Technological Institute Chicago\n\nkkpatel@ttic.edu\n\nAymeric DIEULEVEUT\n\nMLO, EPFL, Lausanne, Switzerland\n\nCMAP, Ecole Polytechnique, Palaiseau, France\naymeric.dieuleveut@polytechnique.edu\n\nAbstract\n\nSynchronous mini-batch SGD is state-of-the-art for large-scale distributed machine\nlearning. However, in practice, its convergence is bottlenecked by slow communi-\ncation rounds between worker nodes. A natural solution to reduce communication\nis to use the \u201clocal-SGD\u201d model in which the workers train their model inde-\npendently and synchronize every once in a while. This algorithm improves the\ncomputation-communication trade-off but its convergence is not understood very\nwell. We propose a non-asymptotic error analysis, which enables comparison to\none-shot averaging i.e., a single communication round among independent work-\ners, and mini-batch averaging i.e., communicating at every step. We also provide\nadaptive lower bounds on the communication frequency for large step-sizes (t\u2212\u03b1,\n\u03b1 \u2208 (1/2, 1)) and show that local-SGD reduces communication by a factor of\nO\n\n, with T the total number of gradients and P machines.\n\n(cid:16) \u221a\n\n(cid:17)\n\nT\nP 3/2\n\nvt = vt\u22121 \u2212 \u03b7tgt(vt\u22121),\n\nIntroduction\n\n1\nWe consider the minimization of an objective function which is accessible through unbiased inde-\npendent and identically distributed estimates of its gradients. This problem has received attention\nfrom various communities over the last \ufb01fty years in optimization, stochastic approximation, and\nmachine learning [1\u20137]. The most widely used algorithms are stochastic gradient descent (SGD),\na.k.a. Robbins-Monro algorithm [8], and some of its modi\ufb01cations based on averaging of the iter-\nates [1, 2, 9]. For a convex differentiable function F : Rd \u2192 R, SGD iteratively updates an estimator\n(vt)t\u22650 for any t \u2265 1\n\n(1)\nwhere (\u03b7t)t\u22650 is a deterministic sequence of positive scalars, referred to as the learning rate and\ngt(vt\u22121) is an oracle on the gradient of the function F at vt\u22121. We focus on objective functions that\nare both smooth and strongly convex [10]. While these assumptions might be restrictive in practice,\nthey enable to provide a tight analysis of the error of SGD. In such a setting, two types of proofs\nhave been used traditionally. On one hand, Lyapunov-type proofs rely on controlling the expected\nsquared distance to the optimal point [11]. Such analysis suggests using small decaying steps,\ninversely proportional to the number of iterations (t\u22121). On the other hand, studying the recursion\nas a stochastic process [1] enables to better capture the reduction of the noise through averaging. It\nresults in optimal convergence rates for larger steps, typically scaling as t\u2212\u03b1, \u03b1 \u2208 (1/2, 1) [10].\nOver the past decade, the amount of available data has steadily increased: to adapt SGD to such\nsituations, it has become necessary to distribute the workload between several machines, also referred\nto as workers [12\u201314]. For SGD, two extreme approaches have received attention: 1) workers run\nSGD independently and at the end aggregate their results, called one-shot averaging (OSA) [13, 15]\nor parameter mixing, and 2) mini-batch averaging (MBA) [16\u201320], where workers communicate\nafter every iteration: all gradients are thus computed at the same support point (iterate) and the\nalgorithm is equivalent to using mini-batches of size P , with P the number of workers. While OSA\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Schematic representation of one-shot averaging (left), mini-batch averaging (middle) and\nlocal-SGD (right). Vertical threads correspond to machines and orange boxes to communication\nrounds.\nrequires only a single communication step, it typically does not perform very well in practice [21].\nAt the other extreme, MBA performs better in practice, but the number of communications equals\nthe number of steps, which is a major burden, as communication is highly time consuming [22].\nTo optimize this computation-communication-convergence trade-off, we consider the local-SGD\nframework: P workers run SGD iterations in parallel and communicate periodically. This framework\nencompasses one-shot averaging and mini-batch averaging as special cases (see Figure 1). We make\nthe following contributions:\n1) We provide the \ufb01rst non-asymptotic analysis for local-SGD with large step sizes (typically scaling\nas t\u2212\u03b1, for \u03b1 \u2208 (1/2; 1)), in both on-line and \ufb01nite horizon settings. Our assumptions encompass the\nubiquitous least-squares regression and logistic regression.\n2) Our comparison of the two extreme cases, OSA and MBA, underlines the communication trade-\noffs. While both of these algorithms are asymptotically equivalent for a \ufb01xed number of machines,\nmini-batch theoretically outperforms one-shot averaging when we consider the precise bias-variance\nsplit. In the regime where both the number of machines and gradients grow simultaneously we show\nthat mini-batch SGD outperforms one-shot averaging.\n3) Under three different sets of assumptions, we quantify the frequency of communication necessary\nfor local-SGD to be optimal (i.e., as good as mini-batch). Precisely, we show that the communication\nfrequency can be reduced by as much as O\n, with T gradients and P workers. Moreover, our\nbounds suggest an adaptive communication frequency for logistic regression, which depending on\nthe expected distance to the optimal point (a phenomenon observed by Zhang et al. [21]).\n4) We support our analysis by experiments illustrating the behavior of the algorithms.\nThe paper is organized as follows: in Section 2.1, we introduce the general setting, notations and\nalgorithms, then in Section 2.2, we describe the related literature. Next, in Section 2.3, we describe\nassumptions made on the objective function.\nIn Section 3, we provide our main results, their\ninterpretation, consequence and comparison with other results. Results in the on-line setting and\nexperiments are presented in the Appendix A.2 and Appendix B.\n\n(cid:16) \u221a\n\n(cid:17)\n\nT\nP 3/2\n\n2 Algorithms and setting\nWe \ufb01rst introduce a couple of notations. We consider the \ufb01nite dimensional Euclidean space Rd\nembedded with its canonical inner product (cid:104)\u00b7,\u00b7(cid:105). For any integer (cid:96) \u2208 N\u2217, we denote by [(cid:96)] the\nset {1, . . . , (cid:96)}. We consider a strongly-convex differentiable function F : Rd \u2192 R. We denote\nw(cid:63) := argminw F (w). With only one machine, Serial-SGD performs a sequence of updates\naccording to Equation (1). In the next section, we describe local-SGD, the object of this study.\n\n2.1 Local-SGD algorithm\nWe consider P machines, each of them running SGD. Periodically, workers aggregate (i.e., average)\ntheir models and restart from the resulting model. We denote by C the number of communication\nsteps. We de\ufb01ne a phase as the time between two communication rounds. At phase t \u2208 [C], for\nany worker p \u2208 [P ], we perform N t local steps of SGD. Iterations are thus naturally indexed by\n(t, k) \u2208 [C] \u00d7 [N t]. We consider the lexicographic order (cid:52) on such pairs, which matches the order in\nwhich iterations are processed. Note that we assume the number of local steps to be the same over\n\n2\n\n\fall machines p. While this assumption can be relaxed in practice, is facilitates our proof technique\nand notation. At any k \u2208 [N t], we denote by wt\np,k the model proposed by worker p, at phase t,\nafter k local iterations. All machines initially start from the same point w0, that is for any p \u2208 [P ],\np,0 = w0. The update rule is thus the following, for any p \u2208 [P ], t \u2208 [C], k \u2208 [N t]:\nw1\n(cid:80)P\n\n(2)\nAggregation steps consist in averaging the \ufb01nal local iterates of a phase: for any t \u2208 [C], \u02c6wt =\np,0 := \u02c6wt.\n\np,N t. At phase t+1, every worker p \u2208 [P ] restarts from the averaged model: wt+1\n1\nP\nEventually, we want to control the excess risk of the Polyak-Ruppert (PR) averaged iterate:\n\np,k\u22121 \u2212 \u03b7t\n\np,k = wt\n\np=1 wt\n\nkgt\n\np,k(wt\n\np,k\u22121).\n\nwt\n\nC(cid:88)\n\nN twt =\n\nt=1 N t\n\nt=1\n\nP(cid:80)C\n\n1\nt=1 N t\n\nC(cid:88)\n\nP(cid:88)\n\nN t(cid:88)\n\nt=1\n\np=1\n\nk=1\n\nwt\n\np,k,\n\n1(cid:80)C\n(cid:80)P\n\nw C =\n\n(cid:80)N t\n\nk=1\n\np=1 wt\n\nwith wt = 1\np,k. We use the notation w to underline the fact that iterates are\nP N t\naveraged over one phase and w when averaging is made over all iterations. All averaged iterates can\nbe computed on-line.\nThe algorithm, called local-SGD, is thus parameterized by the number of machines P , communication\nsteps C, local iterations (N t)t\u2208[C], the starting point w0, the learning rate (\u03b7t\nk)(t,k)\u2208[C]\u00d7[N t], and the\n\ufb01rst order oracle on the gradient. Pseudo-code of the algorithm is given in the Appendix, in Fig. S5.\nLink with classical algorithms. Special cases of local-SGD correspond to one-shot averaging\nor mini-batch averaging. More precisely, for a total number of gradients T , with P workers,\nC = T /P communication rounds, and (N t)t\u2208[C] = (1, . . . . , 1), we realize an instance of P-\nmini-batch averaging (P-MBA). On the other hand, with P workers, C = 1 communication, and\n(N 1) = T /P , we realize an instance of one shot-averaging. Our goal is to get general convergence\nbounds for local-SGD that recover classical bounds for both these settings when we choose the\ncorrect parameters. While comparing to Serial-SGD (which is also a particular case of the algorithm),\nwould also be interesting, we focus here on the comparison between local-SGD, one-shot averaging\nand mini-batch averaging. Indeed, the step size is generally increased for mini-batch with respect to\nSerial SGD, and the running ef\ufb01ciency of algorithms is harder to compare: we only focus on different\nalgorithms that use the same number of machines.\n\n2.2 Related Work\nOn Stochastic Gradient Descent. Bounds on the excess risk of SGD for convex functions have\nbeen widely studied: most proofs rely on controlling the decay of the mean squared distance\nE[(cid:107)vt \u2212 w(cid:63)(cid:107)2], which results in an upper bound on the mean excess of risk E[F (\u00afvt) \u2212 F (w(cid:63))] [23,\n24]. This upper bound is composed of a \u201cbias\u201d term that depends on the initial condition, and a\n\u201cvariance\u201d term that involves either an upper bound on the norm of the noisy gradient (in the non-\nsmooth case), or an upper bound on the variance of the noisy gradient in the smooth case [5, 11]. In\nthe strongly convex case such an approach advocates for the use of small step sizes, scaling as (\u00b5t)\u22121.\nHowever, in practice, this is not a very satisfying result, as the constant \u00b5 is typically unknown, and\nconvergence is very sensitive to ill-conditioning. On the other hand, in the smooth and strongly-\nconvex case, the classical analysis by Polyak and Juditsky [1], relies on an explicit decomposition\nof the stochastic process (\u00afvt \u2212 w(cid:63))t\u22651: the effect of averaging on the noise term is better taken\ninto account, and this analysis thus suggests to use larger steps, and results in the optimal rate for\n\u03b7t \u221d t\u2212\u03b1, with \u03b1 \u2208 (0; 1). This type of analysis has been successfully used recently [10, 15, 25, 26].\nFor quadratic functions, larger steps can be used, as pointed by Bach and Moulines [27]. Indeed,\neven with non-decaying step size, the averaged process converges to the optimal point. Several\nstudies focus on understanding properties of SGD for quadratic functions: a detailed non-asymptotic\nanalysis is provided by D\u00e9fossez and Bach [28], acceleration under the additive noise oracle (see\nAssumption A4 below) is studied by Dieuleveut et al. [29] (without this assumption by Jain et al.\n[30]), and Jain et al. [20] analyze the effects of mini-batch and tail averaging.\nOne shot averaging. In this approach, the P -independent workers compute several steps of stochastic\ngradient descent, and a unique communication step is used to average the different models [13, 31, 32].\nZinkevich et al. [13] show a reduction of the variance when multiple workers are used, but neither\nconsider the Polyak-Ruppert averaged iterate as the \ufb01nal output, nor provide non-asymptotic rates.\n\n3\n\n\fZhang et al. [33] provide the \ufb01rst non-asymptotic results for OSA but their dependence on constants\n(like strong convexity constant \u00b5, moment bounds, etc.) is worse; as well as their single machine\nconvergence bound [34] is not truly non-asymptotic (like for e.g., Bach and Moulines [10]). More\nimportantly, their results hold only for small learning rates like c\n\u00b5t. Rosenblatt and Nadler [35]\nhave also discussed the asymptotic equivalence of OSA with vanilla-SGD by providing an analysis\nup to the second order terms. Further, Jain et al. [20] have provided non-asymptotic results for\nleast-square regression using similar Polyak-Juditsky analysis of the stochastic process, while our\nresults apply to more general problems. Their approach encompasses one shot averaging and the\neffect of tail averaging, that we do not consider here. Recently, Godichon and Saadane [15] proposed\nan approach similar to ours (but only for one shot averaging). However, their result relies on an\nasymptotic bound, namely E[(cid:107)wt \u2212 w(cid:63)(cid:107)2] \u2264 C1\u03b7t (as in Rakhlin et al. [34]), while our analysis is\npurely non-asymptotic and we also improve the upper bound on the noise term which results from\nthe analysis.\nMini-batch averaging. Mini-batch averaging has been studied by Dekel et al. [16], Tak\u00e1\u02c7c et al. [17].\nThese papers show an improvement in the variance of the process, and make comparisons to SGD.\nIt has been found that increasing the mini-batch size often leads to increasing generalization errors,\nwhich limits their distributivity [36]. Jain et al. [20] have provided upper bounds on learning-rate\nand mini-batch size for optimal performance. Recently, large mini-batches have been leveraged\nsuccessfully in deep learning as in [37\u201339] by properly tuning learning rates, etc.\nLocal-SGD. Zhang et al. [21] empirically show that local-SGD performs well. They also provide\na theoretical guarantee on the variance of the process, however, they assume the variance of the\nestimated gradients to be uniformly upper bounded (Assumption A4 below). Such an assumption is\nrestrictive in practice, for example it is not satis\ufb01ed for least squares regression. In a simultaneous\nwork, Stich [40] has provided an analysis for local-SGD. The limitation with their analysis is that\nthey also assume bounded gradients and use a small step size scaling as c\n\u00b5t. More importantly, their\nanalysis doesn\u2019t extend to the extreme case of one-shot averaging like ours. Lin et al. [41] have\nexperimentally shown that local-SGD is better than the synchronous mini-batch techniques, in terms\nof overcoming the large communication bottleneck. Recently, Yu et al. [42] have given convergence\nrates for the non-convex synchronous and a stale synchronous settings.\nWe have summarized the major limitations of some of these analyses in Table S3, given in Appendix I.\nOur motivation is to get away with some of these restrictive assumptions, and provide tight upper\nbounds for the above three averaging schemes. In the following section, we present the set of\nassumptions under which our analysis is conducted.\n2.3 Assumptions\nWe \ufb01rst make the following classical assumptions on the objective function F : Rd \u2192 R. In the\nfollowing, we use different subsets of these assumptions:\nA1 (Strong convexity) The function F is strongly-convex with convexity constant \u00b5 > 0.\nA2 (Smoothness and regularity) The function F is three times continuously differentiable\nwith second and third uniformly bounded derivatives:\nsupw\u2208Rd\nQ1 (Quadratic function) There exists a positive de\ufb01nite matrix \u03a3 \u2208 Rd\u00d7d, such that the function\nF is the quadratic function w (cid:55)\u2192 (cid:107)\u03a31/2(w \u2212 w(cid:63))(cid:107)2/2.\nIf Q1 is satis\ufb01ed, then Assumptions A1, A2 are satis\ufb01ed, and L and \u00b5 are respectively the largest and\nsmallest eigenvalues of \u03a3. At any iteration (t, k) \u2208 [C] \u00d7 [N t], any machine can query an unbiased\nestimator of the gradient gt\nA3 (Oracle on the gradient) We observe unbiased estimators of the gradient gt\n(t, k) \u2208 [C] \u00d7 [N t] and w \u2208 Rd, E[gt\nfunctions (gt\nIn Proposition 3, we make the additional, stronger assumption that the variance of gradient estimates\nis uniformly upper bounded, a standard assumption in the SGD literature, see e.g. Zhang et al. [21]:\nA4 (Uniformly bounded variance) The variance of the error, E[(cid:107)gt\np,k)(cid:107)2] is\nuniformly upper bounded by \u03c32\u221e, a constant which does not depend on the iteration.\np,k) \u2212\nAssumption A4 is for example true if the sequence of random vectors (gt\nF (cid:48)(wt\np,k))t\u2208[C],k\u2208[N t],p\u2208[P ] is i.i.d.. This setting is referred to as the semi-stochastic setting [29].\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F (3)(w)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < M. Especially F is L-smooth.\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)F (2)(w)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < L, and\n\np,k)(t,k)(w) are i.i.d. . (See Appendix A.1 for a more formal statement.)\n\np,k+1(w): for any\np,k). Moreover, for any \ufb01xed w the\n\np,k(w) at a point w. Formally, we make the following assumption:\n\np,k+1(wt\n\np,k)|wt\n\np,k] = F (cid:48)(wt\n\nsupw\u2208Rd\n\np,k(wt\n\np,k) \u2212 F (cid:48)(wt\n\np,k+1(wt\n\n4\n\n\fthat is,\n\np,k(w1) \u2212 gt\n\np,k(w1) \u2212 gt\n\np,k(w2)w1 \u2212 w2(cid:105) \u2265 (cid:107)gt\n\np,k is almost\nfor any w1, w2 \u2208 Rd,\n\np,k which is a.s. convex and L-smooth and such that gt\n\nWe also consider the following conditions on the regularity of the gradients, for p \u2265 2:\nA5 (Cocoercivity of the random gradients) For any p \u2208 [P ], t \u2208 [C], k \u2208 [N t], gt\nsurely L-co-coercive (with the same constant as in A2):\nL(cid:104)gt\np,k(w2)(cid:107)2.\nAlmost sure L-co-coercivity [43] is for example satis\ufb01ed if for any (p, k, t), there exist a random\np,k)(cid:48). Finally, we assume\nfunction f t\np,k = (f t\nthe fourth order moment of the random gradients at w(cid:63) to be well de\ufb01ned:\np,k(w(cid:63))(cid:107)4] \u2264 \u03c34.\nA6 (Finite variance at w(cid:63)) \u2203\u03c3 \u2265 0, s.t. for any t, k, p \u2208 [C]\u00d7[N t]\u00d7[P ], E[(cid:107)gt\nIt must be noted that A6 is a much weaker assumption than A4, for e.g., least-square regression\nsatis\ufb01es former but not latter. Most of these assumptions are classical in machine learning. SGD\nfor least squares regression satis\ufb01es Q1, A3, A5 and A6. On the other hand, SGD for logistic\nregression satis\ufb01es A1, A2, A3 and A4. Our main result Theorem 6 (lower bounding the frequency\nof communications) applies to both these sets of assumptions. In Appendix C.3 we further detail how\nthese assumptions apply in machine learning.\nLearning rate. We always assume that for any t \u2208 [C], k \u2208 [N t], the learning rate satis\ufb01es 2\u03b7t\nkL \u2264 1.\nWe consider two different types of learning rates:\n1) in the \ufb01nite horizon (FH) case, the step size (\u03b7t\nk)(t,k)\u2208[C]\u00d7[N t] is a constant \u03b7, that can depend on\nthe number of iterations eventually performed by the algorithm; 2) in the on-line case, the sequence\nof step size is a subsequence of a universal sequence (\u02dc\u03b7(cid:96))(cid:96)\u22650. Moreover, in our analysis, when using\ndecaying learning rate, the step size only depends on the number of iterations processed in the past:\n\u03b7t\n\n+k}. Especially, the step size at iteration (t, k) does not depend on the machine.\n\nk = \u02dc\u03b7{(cid:80)t\u22121\n\nN t(cid:48)\n\nt(cid:48)=1\n\nThough both of these approaches are often considered to be nearly equivalent [44, 45], fundamental\ndifferences exist in their convergence properties. The on-line case is harder to analyze, but ultimately\nprovides a better convergence rate. However as the behavior is easier to interpret in the \ufb01nite horizon\ncase, we postpone results for on-line setting to Appendix A.2. In the following section, we present\nour main results.\n3 Main Results\nSketch of the proof. We follow the approach by Polyak and Juditsky, which relies on the follow-\ning decomposition: for any p \u2208 [P ], t \u2208 [C], k \u2208 [N t], Equation (2) is trivially equivalent to:\np,k\u22121)\u2212\nkF (cid:48)(cid:48)(w(cid:63))(wt\n\u03b7t\np,k\u22121 \u2212 w(cid:63))]. We have added and subtracted a \ufb01rst order Taylor expansion around the\nF (cid:48)(cid:48)(w(cid:63))(wt\noptimal value w(cid:63) of the gradient. Thus, using the de\ufb01nition of w C:\np,k\u22121 \u2212 wt\n\np,k \u2212 \u03b7t\np,k\u22121\u2212 wt\nN t(cid:88)\nP(cid:88)\nC(cid:88)\n\nF (cid:48)(cid:48)(w(cid:63))(cid:0)w C \u2212 w(cid:63)(cid:1) =\n\np,k\u22121\u2212 w(cid:63)) = wt\n\np,k\u22121)\u2212 F (cid:48)(wt\n\n\u2212(cid:2)gt\n\np,k\u22121)]\u2212 \u03b7t\n\nk[F (cid:48)(wt\n\np,k\u22121) \u2212 F (cid:48)(wt\n\np,k(wt\n\np,k(wt\n\nk[gt\n\np,k\n\np,k\u22121)(cid:3)\n\nP(cid:80)C\n\u2212(cid:2)F (cid:48)(wt\n\n1\nt=1 N t\np,k\u22121) \u2212 F (cid:48)(cid:48)(w(cid:63))(wt\n\nk=1\n\np=1\n\nt=1\n\n(cid:18) wt\np,k\u22121 \u2212 w(cid:63))(cid:3)(cid:19)\n\n\u03b7t\nk\n\n.\n\n(3)\n\np,k(wt\n\np,k\u22121) \u2212 F (cid:48)(wt\n\nIn other words, the error can be decomposed into three terms: the \ufb01rst one mainly depends on the\ninitial condition, the second one is a noise term: it is the mean of centered random variables (as\nE[gt\np,k\u22121)] = 0), and the third is a residual term that accounts for the fact that\np,k\u22121 \u2212 w(cid:63)) = 0).\nthe function is not quadratic (if F is quadratic, then F (cid:48)(wt\np,k\u22121)\u2212F (cid:48)(wt\nControlling different terms in Equation (3). The variance of the noise gt\np,k\u22121)\nand the residual term both directly depend on the distance (cid:107)wt\np,k\u22121 \u2212 w(cid:63)(cid:107)2. The proof is thus\ncomposed of two aspects: (1) we \ufb01rst provide a tight control for this quantity, with or without\ncommunication: in the following propositions, this corresponds to an upper bound on E[(cid:107)wt\np,k \u2212\nw(cid:63)(cid:107)2] 1, (2) we provide the subsequent upper bound on E[(cid:107)F (cid:48)(cid:48)(w(cid:63))(w C \u2212 w(cid:63))(cid:107)2].\nWe \ufb01rst compare the convergence in the two extreme situations, i.e., for Mini-batch averaging (MBA)\nand One-shot averaging (OSA) for \ufb01nite horizon setting, and then provide these results for local-SGD.\n\np,k\u22121) \u2212 F (cid:48)(cid:48)(w(cid:63))(wt\np,k(wt\n\n1more precisely, on E[(cid:107) \u02c6wt \u2212 w(cid:63)(cid:107)2] and E[(cid:107)w1\n\np,k \u2212 w(cid:63)(cid:107)2] for MBA and OSA respectively.\n\n5\n\n\f3.1 Results for MBA and OSA, Finite Horizon setting\nk to be a constant \u03b7 at every iteration for any t \u2208 [C], k \u2208 [N t].\nFirst we assume the step size \u03b7t\nOur \ufb01rst contribution is to provide non-asymptotic convergence rates for MBA and OSA, that allow\na simple comparison. For the bene\ufb01t of presentation, we de\ufb01ne following quantities: Qbias =\n1 + M 2\u03b7\nIn the following, we use the (cid:45) notation to denote inequality up to an absolute constant. Recall that\nfor MBA, the total number of gradients processed is T = P C, while it is T = P N for OSA. We\nhave the following results respectively for MBA and OSA:\n\nX\u03b7\u00b5 , Q2,var(X) = M 2XP \u03b72\u03c32\n\n\u00b5 (cid:107)w0 \u2212 w(cid:63)(cid:107)2 + L2\u03b7\n\n\u00b5P , Q1,var(X) = L2\u03b7\n\n\u00b5 + P\n\n\u00b52\n\n.\n\nProposition 1 (Mini-batch Averaging) Under Assumptions A1, A2, A3, A5, A6, we have the fol-\nlowing bound for mini-batch SGD: for any t \u2208 [C],\n\n1 \u2212 (1 \u2212 \u03b7\u00b5)t\n\n,\n\n\u00b5\n\n(cid:16)\n\n\u03c32\nT\n\n1 +\n\nQ1,var(C)\n\nP\n\n+\n\nQ2,var(C)\n\nP 2\n\n(cid:17)\n\n(4)\n\n.\n\n(5)\n\n2\u03c32\u03b7\n\nP\n\nE(cid:104)(cid:13)(cid:13) \u02c6wt \u2212 w(cid:63)(cid:13)(cid:13)2(cid:105) \u2264 (1 \u2212 \u03b7\u00b5)t (cid:107)w0 \u2212 w(cid:63)(cid:107)2 +\n(cid:13)(cid:13)w0 \u2212 w(cid:63)(cid:13)(cid:13)2\nE(cid:104)(cid:13)(cid:13)F (cid:48)(cid:48)(w(cid:63))(w C \u2212 w(cid:63))(cid:13)(cid:13)2(cid:105) (cid:45)\nE(cid:104)(cid:13)(cid:13)w1\nE(cid:104)(cid:13)(cid:13)F (cid:48)(cid:48)(w(cid:63))(w C \u2212 w(cid:63))(cid:13)(cid:13)2(cid:105) (cid:45)\n\np,k \u2212 w(cid:63)(cid:13)(cid:13)2(cid:105) \u2264 (1 \u2212 \u03b7\u00b5)k (cid:107)w0 \u2212 w(cid:63)(cid:107)2 + 2\u03c32\u03b7\n\n(cid:13)(cid:13)w0 \u2212 w(cid:63)(cid:13)(cid:13)2\n\nQbias +\n\n\u03b72C 2\n\nQbias +\n\n\u03c32\nT\n\n\u03b72N 2\n\nProposition 2 (One-shot Averaging) Under Assumptions A1, A2, A3, A5, A6, we have the follow-\ning bound for one shot averaging: p \u2208 [P ], t = 1, k \u2208 [N ],\n\n1 \u2212 (1 \u2212 \u03b7\u00b5)k\n\n,\n\n\u00b5\n\n(cid:0)1 + Q1,var(N ) + Q2,var(N )(cid:1).\n\n(6)\n\n(7)\n\nInterpretation, \ufb01xed P . Using mini-batch naturally reduces the variance of the process\np,k)p\u2208[P ],t\u2208[C],k\u2208[N t]. Equations (4) and (6) show that the speed at which the initial condition is\n(wt\nforgotten remains the same, but that the variance of the local process is reduced by a factor P .\nEquations (5) and (7) show that the convergence depends on an initial condition term and a variance\nterm. For a \ufb01xed number of machines P , and a step size scaling as \u03b7 = X\u2212\u03b1, 0.5 < \u03b1 < 1,\nX \u2208 {N, C}, the speed at which the initial condition is forgotten is asymptotically dictated by\nQbias/(\u03b7X)2 where X \u2208 {N, C}, for both algorithms (if we use the same number of gradients for\nboth algorithms, naturally, N = C.) As for the variance term, it scales as \u03c32T \u22121 as T \u2192 \u221e , as\nthe remaining terms Qvar(X) asymptotically vanish for \u03b7 = X\u2212\u03b1. It reduces with the total number\nT of gradients used in the process. Interestingly, this term is the same for the two extreme cases\n(MBA and OSA): it does not depend on the number of communication rounds. This phenomenon\nis often described as \u201cthe noise is the noise and SGD doesn\u2019t care\u201d (for asynchronous SGD, [46]).\nThough we recover this asymptotic equivalence here, our belief is that this asymptotic point of view\nis typically misleading as the asymptotic regime is not always reached, and the residual terms do then\nmatter.\nIndeed, the lower order terms do have a dependence on the number of communication rounds:\nwhen the number of communications increases, the overall effect of the noise is reduced. More\nprecisely, since Qvar(N ) = Qvar(C) the remaining terms are respectively P or P 2 times smaller\nfor mini-batch. This provides a theoretical explanation of why mini-batch SGD outperforms one\nshot averaging in practice. It also highlights the weakness of an asymptotic analysis: the dominant\nterm might be equivalent, without re\ufb02ecting the actual behavior of the algorithm. Disregarding\ncommunication aspects, mini-batch SGD is in that sense optimal.\nNote that for quadratic functions, Q2,var = 0 as M = 0. The conditions on the step size can thus be\nrelaxed, and the asymptotic rates described above would be valid for any step size satisfying \u03b7 \u2264 \u00b5\n[20]. Extension to the on-line setting, eventually leading to a better convergence rate, is given in\nProposition S7 in AppendixA.2.\nInterpretation, P, T \u2192 \u221e. When both the total number of gradients used T and the number\nof machines P are allowed to grow simultaneously, the asymptotic regime is not necessarily the\nsame for MBA and OSA, as remaining terms are not always negligible. For example, if \ufb01xing\n\u03b7 = X\u22122/3, X \u2208 {N, C} (we chose \u03b1 = 2/3 to balance Q1,var and Q2,var), the variance term\n\n6\n\n\f\u00b5C1/3 ). Thus, unless P \u2264 \u00b5C 1/3, MBA could outperform OSA\n\nwould be controlled by \u03c32T \u22121(1 + P\nby a factor as large as P .\nNovelty and proofs. Both Propositions 1 and 2 are proved in the Appendix G. Importantly, Equa-\ntions (4) and (6) respectively imply Equations (5) and (7) under the stated conditions: this is the\nreason why we only focus on proving equations similar to Equations (4) and (6) for local-SGD.\nProposition 1 is similar to the analysis of Serial-SGD for large step size, but with a reduction in\nthe variance proportional to the number of machines. Such a result is derived from the analysis\nby Dieuleveut et al. [25], combining the approach of Bach and Moulines [27] with the correct upper\nbound for smooth strongly convex SGD [47], and controlling similarly higher order moments. While\nthis result is expected, we have not found it under such a simple form in the literature. Proposition 2\nfollows a similar approach, we combine the proof for mini-batch with a control of the iterates of\neach of the machines. This is closely related to Godichon and Saadane [15], but we preserve a\nnon-asymptotic approach.\nRemark: link with convergence in function values. As we use Equation (3) as a starting point,\nwe provide convergence results on the Mahalanobis distance (cid:107)F (cid:48)(cid:48)(w(cid:63))(w C \u2212 w(cid:63))(cid:107)2: it is the\nnatural quantity in such a setting [10, 15, 27]. These results could be translated into function value\nconvergence F (w C) \u2212 F (w(cid:63)), using the inequality F (w C) \u2212 F (w(cid:63)) \u2264 L\u00b5\u22122(cid:107)F (cid:48)(cid:48)(w(cid:63))(w C \u2212\nw(cid:63))(cid:107)2 but the dependence on \u00b5 would be pessimistic and sub-optimal. However, a similar approach\nhas been used by Bach [44], under a slightly different set of assumptions (including self-concordance,\ne.g., for logistic regression), recovering optimal rates. Extension to such a set of assumptions, which\nrelies on tracking other quantities, is an important direction.\nWhile the \u201cclassical proof\u201d, which provides rates for function values directly (with smoothness, or\nwith uniformly bounded gradients) has a better dependence on \u00b5, one cannot easily obtain a noise\nreduction when averaging between machines. Similarly, there is no proof showing that one-shot\naveraging is asymptotically optimal that relies only on function values. In other words, these proofs\ndo not adequately capture the noise reduction due to averaging. Moreover, such proof techniques\nrelying on function values typically involve a small step size 1/(\u00b5t) (because the noise reduction\nis captured inef\ufb01ciently). Such step size performs poorly in practice (initial condition is forgotten\nslowly), and \u00b5 is unknown.\nIn conclusion, though they do not directly result in optimal dependence on \u00b5 for function values,\nwe believe our approach allows to correctly capture the effect of the noise, and is thus suitable for\ncapturing the effect of local-SGD.\nComparing upper bounds: Our analysis relies on upper bounds: one should handle comparison\nwith cautions. Nevertheless, we think our analysis is tight enough to provide good insights, especially\nbecause the bound for OSA averaging nearly matches the bound for MBA (contrary to Stich [40]).\nMoreover, the bounds given above are tight in the following senses, see Appendix A.3 for details:\n(i) the bias term in equations (5) and (7) is clearly exact in the simple case of a quadratic one\ndimensional function, in the absence of noise: it is normal that in such a situation, MBA and OSA\nconverge similarly: each of the P independent machines computes the same recursion!\n(ii) the bound for the variance, scaling as (P N )\u22121 for any \u03b7 \u221d N\u2212\u03b1, 0.5 < \u03b1 < 1, matches the\nstatistical minimax rate [48] for least squares regression: from the statistical point of view, if we are\nonly given N P independent observations, then no estimator can have an error uniformly lower than\n\u03c32(P N )\u22121.\nOptimizing over the step size in Eqs (5) and (7) results in a somehow disappointing observation: the\nrate for \u03b7 \u221d N\u2212\u03b1, 0.5 < \u03b1 < 12 is dictated by the bias and scales as O((\u03b7N )\u22122), which is slow (but\ntight, see point (i) above). This is unfortunately unavoidable with constant step sizes: the convergence\nrate with decaying steps is much faster in the on-line setting3, but bounds are much harder to read see\nSec. A.2. In other words, bounds in Propositions 1 and 2 are tight, but slower than in on-line setting.\nAs all the trade-offs regarding communications are preserved (our main focus), we chose to highlight\nthe results in \ufb01nite horizon in the main text.\n\n2A good step size is unlikely to be larger than 1/\n\n\u221a\nN: such \u201cvery large\u201d LR (which is rarely used in practice)\ndoes not perform well for non-quadratic functions (note that for quadratic, the N P \u03b72 vanishes, and a constant \u03b7\nwould get a rate 1/N 2 + 1/P N).\n\n3the bias decreases as 1/N 2 instead of 1/(\u03b7N )2 (see Prop.S7).\n\n7\n\n\fConclusion: for a \ufb01xed or limited number of machines, asymptotically, the convergence rate is similar\nfor OSA and MBA. However, non-asymptotically, or when the number of machines also increases,\nthe dominant terms can be as much as P 2 times smaller for MBA. In the following we provide\nconditions for local-SGD to perform as well as MBA (while requiring much fewer communication\nrounds).\n3.2 Convergence of Local-SGD, Finite Horizon setting\nFor local-SGD we \ufb01rst consider the case of a quadratic function, under the assumption that the noise\nhas a uniformly upper bounded variance. While this set of assumptions is not realistic, it allows an\nintuitive presentation of the results. Similar results for settings encompassing LSR and LR follow.\nWe provide a bound on the moment of an iterate after the communication step \u02c6wt (i.e., the restart\npoint of the next phase), and on the second order moment of any iterate. For t \u2208 [C], we denote\nN t\n\nProposition 3 (Local-SGD: Quadratic Functions with Bounded Noise) Under Assumptions Q\n1, A3, A4, we have the following bound for local-SGD: for any p \u2208 [P ], t \u2208 [C], k \u2208 [N t],\n\n.\n\nt(cid:48)=1 N t(cid:48)\n\n1 :=(cid:80)t\nE(cid:104)(cid:13)(cid:13) \u02c6wt\u22121 \u2212 w(cid:63)(cid:13)(cid:13)2(cid:105) \u2264 (1 \u2212 \u03b7\u00b5)N t\u22121\np,k \u2212 w(cid:63)(cid:13)(cid:13)2(cid:105) \u2264 (1 \u2212 \u03b7\u00b5)N t\u22121\nE(cid:104)(cid:13)(cid:13)wt\np,k, and recursively control(cid:13)(cid:13) \u02d8wt\n(cid:80)P\n\n1\n\n(cid:107)w0 \u2212 w(cid:63)(cid:107)2 +\n\n\u03c32\u221e\u03b7\nP\n\n1 \u2212 (1 \u2212 \u03b7\u00b5)N t\u22121\n\n1\n\n\u00b5\n\n(cid:32)\n\n1 +k (cid:107)w0 \u2212 w(cid:63)(cid:107)2 + \u03c32\u221e\u03b7\n\n1\n\n(cid:124)\n\n1 \u2212 (1 \u2212 \u03b7\u00b5)N t\u22121\n\n(cid:124)\nk \u2212 w(cid:63)(cid:13)(cid:13)2. We conclude by remarking that \u02d8wt\n\nlong term reduced variance\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nP \u00b5\n\n+\n\n1 \u2212 (1 \u2212 \u03b7\u00b5)k\n\n(cid:123)(cid:122)\n\n\u00b5\n\nlocal iteration variance\n\n(cid:33)\n\n.\n\n(cid:125)\n\np=1 wt\n\nk :=\nN t = \u02c6wt.\n\nTo prove such a result, we use the classical technique, and introduce a ghost sequence \u02d8wt\n1\nP\nThis proof is given in Appendix D.2.\nInterpretation. The variance bound for the iterates \u201cjust after\u201d communication, \u02c6wt exactly behaves\nas in mini-batch case: the initialization term decays linearly with the number of local steps, and the\nvariance is reduced proportionally to the number of workers P . On the other hand, the bound on the\np,k shows that the variance of this process is composed of a \u201clong term\u201d reduced variance,\niterates wt\nthat accumulates through phases, and is increasingly converging to \u03c32\u221e\u03b7\nP \u00b5 and of an extra variance\n\u03b7\u03c32\u221e 1\u2212(1\u2212\u03b7\u00b5)k\nIn the case of constant step size, the iterates of serial SGD converge to a limit distribution \u03c0\u03b7 that\ndepends on the step size [25]. Here, the iterates after communication (or the mini-batch iterates)\nconverge to a distribution with reduced variance \u03c0\u03b7/P , thus local iterates periodically restart from a\ndistribution with reduced variance, then slowly \u201cdiverge\u201d to the distribution with large variance. If\nthe number of local iterations is small enough, the iterates keep a reduced variance. More precisely,\nwe have the following result.\nCorollary 4 If for all t \u2208 [C], N t \u2264 (\u00b5\u03b7P )\u22121, then the second order moment of wt\nsame upper bound as the mini-batch iterate \u02c6wN t\u22121\na consequence, Equation (5) is still valid, and local-SGD performs \u201coptimally\u201d.\n\np,k admits the\n(Equation (4)) up to a constant factor of 2. As\n\n, that increases within the phase, and is upper bounded by \u03c32\u221e\u03b72k.\n\n1 +k\n\nM B\n\n\u00b5\n\nInterpretation. This result shows that if the algorithm communicates often enough, the convergence\nof the Polyak Ruppert iterate w C is as good as in the mini-batch case, thus it is \u201coptimal\u201d. Moreover,\nthe minimal number of communication rounds is easy to de\ufb01ne: the maximal number of local steps\nN t decays as the number of workers and the step size increases. This bound implies that more\ncommunication steps are necessary when more machines are used. Note that (\u03b7P )\u22121 is a large\nnumber, as a typical value for \u03b7 is inversely proportional to (a power of) the number of local steps for\n\ne.g., ((cid:80)t\n\nt(cid:48)=1 N t(cid:48)\n\n)\u2212\u03b1, \u03b1 \u2208 (1/2; 1).\n\nExample 5 With constant number of local steps N t = N, and learning rate \u03b7 = c(N C)\u22121/2\nin order to obtain an optimal O(\u03c32T \u22121) parallel variance4 rate, local-SGD communicates\n\u221a\nO(\n\nN C/(P \u00b5)) times less as compared to mini-batch averaging.\n\n4in online setting, the same example would hold, resulting in a O( \u03c32\n\nT ) convergence rate (not only variance).\n\n8\n\n\fWe believe that this is the \ufb01rst result (with Stich [40]) that shows a communication reduction\nproportional to a power of the number of local steps of a local solver (i.e., O(\nN C)), compared\nto mini-batch averaging. In the following, we alternatively relax the bounded variance assumption\nA4 and the quadratic assumption Q1, and show similar results for local-SGD. This allows us to\nsuccessively cover the cases of least squares regression (LSR) and logistic regression (LR).\n\n\u221a\n\nTheorem 6 Under either of the following sets of assumptions, the convergence of the Polyak Ruppert\niterate w C is as good as in the mini-batch case, up to a constant:\n(i) Assume Q1, A3, A5, A6, and for any t \u2208 [C], N t \u2264 (\u00b5\u03b7P )\u22121 and \u00b5\u03b72N t\n\n(ii) Assume A1, A2, A3, A4, and for any t \u2208 [C], N t \u2264 inf(cid:0)(\u03b7P ME[(cid:13)(cid:13) \u02c6wt \u2212 w(cid:63)(cid:13)(cid:13)])\u22121, (\u00b5\u03b7P )\u22121(cid:1).\n\n1 = O(1).\n\nThese results are derived from Proposition S16 and Proposition S20 which generalize Proposition 3.\nThose results are proved in Appendix D and E and constitute the main technical challenge of the\npaper.\nInterpretation. We note that in both of these situations, the optimal rates can be achieved if the\ncommunications happen often enough, and beyond such a number of communication rounds, there\nis no substantial improvement in the convergence. This result corresponds to the effect observed\nin practice [21]. The \ufb01rst set of assumption is valid for LSR, the second for LR. In the \ufb01rst case,\nthe maximal number of local steps before communication is upper bounded by the same ratio as in\nCorollary 4, but the \u201cconstant\u201d that appears is exp(\u00b5\u03b72N t\n1), so we need this quantity to be small\n(which is typically always satis\ufb01ed in practice) in order to be optimal w.r.t. mini-batch averaging. A\n\u221a\nsimilar result as Theorem 5 can be provided reducing the communication by a factor of O(\nP \u00b5 ).\nIn the second case, the maximal number of local steps is smaller than before, by a factor \u00b5\u22121, but\n\nthe allowed maximal number of local steps can increase along with the epochs, as E[(cid:13)(cid:13) \u02c6wt \u2212 w(cid:63)(cid:13)(cid:13)]\n\nis typically decaying. This adaptive communication frequency has been observed to work well in\npractice [21] and also explored in [49], in a setting without PR averaging. Assuming optimization\n\u221a\nP 2 ) times improvement in\non a compact space with radius R for instance, one can obtain a O(\ncommunication, similar to Theorem 5.\nThough they may re\ufb02ect the actual behavior of the algorithm, such results might be dif\ufb01cult to use\ndirectly in practice, as \u00b5 is unknown. However, as it is not the limiting factor in Theorem 6.2, an\n\nestimation of E[(cid:13)(cid:13) \u02c6wt \u2212 w(cid:63)(cid:13)(cid:13)] could allow us to use adaptive phases lengths to minimize communica-\n\ntions.\n\nN C\n\nN C\n\n4 Conclusion\n\nStochastic approximation and distributed optimization are both very densely studied research areas.\nHowever, in practice most distributed applications stick to bulk synchronous mini-batch SGD. While\nthe algorithm has desirable convergence properties, it suffers from a huge communication bottleneck.\nIn this paper we have analyzed a natural generalization of mini-batch averaging, local-SGD. Our\nanalysis is non-asymptotic, which helps us to better understand the exact communication trade-offs.\nWe give feasible lower bounds on communication frequency which signi\ufb01cantly reduce the need\nfor communication, while providing similar non-asymptotic convergence as mini-batch averaging.\nOur results apply to common loss functions, and use large step sizes. Further, our analysis uni\ufb01es\nand extends all the scattered results for one-shot averaging, mini-batch averaging and local-SGD,\nproviding an intuitive understanding of their behavior.\nWhile they provide some intuition and are believed to be tight, our comparisons are based on upper\nbounds. Proving corresponding lower bounds is an interesting and important open direction. Also,\nit would also be interesting to study observable quantities to predict an adaptive communication\nfrequency and to relax some of the technical assumptions required by the analysis. The on-line\ncase, experiments, proofs, additional materials and a review of distributed optimization follow in the\nappendix.\n\nAcknowledgements\n\nWe would like to acknowledge Sai Praneeth Reddy, Sebastian Stich, Martin Jaggi and Nathan Srebro\nfor helpful comments and discussions at various stages of this project.\n\n9\n\n\fReferences\n[1] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM\n\nJ. Control Optim., 30(4):838\u2013855, 1992.\n\n[2] D. Ruppert. Ef\ufb01cient estimations from a slowly convergent Robbins-Monro process. Technical\n\nreport, Cornell University Operations Research and Industrial Engineering, 1988.\n\n[3] V. Fabian. On asymptotic normality in stochastic approximation. The Annals of Mathematical\n\nStatistics, pages 1327\u20131332, 1968.\n\n[4] Y. Nesterov and J. P. Vial. Con\ufb01dence Level Solutions for Stochastic Programming. Automatica,\nISSN 0005-1098. doi: 10.1016/j.automatica.2008.01.017. URL\n\n44(6):1559\u20131568, 2008.\nhttp://dx.doi.org/10.1016/j.automatica.2008.01.017.\n\n[5] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust Stochastic Approximation Approach\nto Stochastic Programming. SIAM J. on Optimization, 19(4):1574\u20131609, 2009. ISSN 1052-6234.\ndoi: 10.1137/070704277. URL http://dx.doi.org/10.1137/070704277.\n\n[6] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In\n\nProceedings of the International Conference on Learning Theory (COLT), 2009.\n\n[7] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent\n\nalgorithms. Proceedings of the conference on machine learning (ICML), 2004.\n\n[8] H. Robbins and S. Monro. A stochastic approxiation method. The Annals of mathematical\n\nStatistics, 22(3):400\u2013407, 1951.\n\n[9] O. Shamir and T. Zhang. Stochastic Gradient Descent for Non-smooth Optimization: Con-\nvergence Results and Optimal Averaging Schemes. Proceedings of the 30th International\nConference on Machine Learning, 2013.\n\n[10] F. Bach and E. Moulines. Non-asymptotic Analysis of Stochastic Approximation Algorithms for\nMachine Learning. In Proceedings of the 24th International Conference on Neural Information\nProcessing Systems, NIPS\u201911, pages 451\u2013459, USA, 2011. Curran Associates Inc. ISBN 978-1-\n61839-599-3. URL http://dl.acm.org/citation.cfm?id=2986459.2986510.\n[11] P. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss\n\nminimization. In International Conference on Machine Learning (ICML), pages 1\u20139, 2015.\n\n[12] O. Delalleau and Y. Bengio. Parallel stochastic gradient descent. 2007.\n[13] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In\n\nAdvances in neural information processing systems, pages 2595\u20132603, 2010.\n\n[14] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic\ngradient descent. In Advances in neural information processing systems, pages 693\u2013701, 2011.\n[15] A. B. Godichon and S. Saadane. On the rates of convergence of Parallelized Averaged Stochastic\n\nGradient Algorithms. ArXiv e-prints, 2017.\n\n[16] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction\n\nusing mini-batches. Journal of Machine Learning Research, 13(Jan):165\u2013202, 2012.\n\n[17] M. Tak\u00e1\u02c7c, A. Bijral, P. Richt\u00e1rik, and N. Srebro. Mini-batch primal and dual methods for svms.\nIn Proceedings of the 30th International Conference on International Conference on Machine\nLearning-Volume 28, pages III\u20131022. JMLR. org, 2013.\n\n[18] M. Li, T. Zhang, Y. Chen, and A. J. Smola. Ef\ufb01cient mini-batch training for stochastic\noptimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 661\u2013670. ACM, 2014.\n\n[19] P. Goyal, P. Doll\u00e1r, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia,\ntraining imagenet in 1 hour. arXiv preprint\n\nand K. He. Accurate, large minibatch sgd:\narXiv:1706.02677, 2017.\n\n[20] P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Parallelizing Stochastic\n\nApproximation Through Mini-Batching and Tail-Averaging. ArXiv e-prints, 2016.\n\n[21] J. Zhang, C. De Sa, I. Mitliagkas, and C. R\u00e9. Parallel SGD: When does averaging help? ArXiv\n\ne-prints, 2016.\n\n[22] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang. The zipml framework for training\nmodels with end-to-end low precision: The cans, the cannots, and a little bit of deep learning.\narXiv preprint arXiv:1611.05402, 2016.\n\n10\n\n\f[23] S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an O(1/t) rate for\n\nthe stochastic projected subgradient method. ArXiv e-prints 1212.2002, 2012.\n\n[24] A. Rakhlin, O. Shamir, and K. Sridharan. Making Gradient Descent Optimal for Strongly\n\nConvex Stochastic Optimization. ArXiv e-prints, 2011.\n\n[25] A. Dieuleveut, A. Durmus, and F. Bach. Bridging the gap between constant step size stochastic\n\ngradient descent and markov chains. Annals of Statistics, 2018.\n\n[26] S. Gadat and F. Panloup. Optimal non-asymptotic bound of the Ruppert-Polyak averaging\n\nwithout strong convexity. ArXiv e-prints, 2017.\n\n[27] F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with conver-\n\ngence rate O(1/n). Advances in Neural Information Processing Systems (NIPS), 2013.\n\n[28] A. D\u00e9fossez and F. Bach. Averaged least-mean-squares: bias-variance trade-offs and optimal\nsampling distributions. In Proceedings of the International Conference on Arti\ufb01cial Intelligence\nand Statistics, (AISTATS), 2015.\n\n[29] A. Dieuleveut, N. Flammarion, and F. Bach. Harder, Better, Faster, Stronger Convergence Rates\n\nfor Least-Squares Regression. Journal of Machine Learning research, 2016.\n\n[30] P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Accelerating Stochastic\n\nGradient Descent. arXiv preprint arXiv:1704.08227, 2017.\n\n[31] R. Mcdonald, M. Mohri, N. Silberman, D. Walker, and G. S. Mann. Ef\ufb01cient large-scale\ndistributed training of conditional maximum entropy models. In Advances in Neural Information\nProcessing Systems, pages 1231\u20131239, 2009.\n\n[32] R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron.\nIn Human Language Technologies: The 2010 Annual Conference of the North American Chapter\nof the Association for Computational Linguistics, pages 456\u2013464. Association for Computational\nLinguistics, 2010.\n\n[33] Y. Zhang, M. J. Wainwright, and J. C. Duchi. Communication-ef\ufb01cient algorithms for statistical\noptimization. In Advances in Neural Information Processing Systems, pages 1502\u20131510, 2012.\n[34] A. Rakhlin, O. Shamir, K. Sridharan, et al. Making gradient descent optimal for strongly convex\n\nstochastic optimization. In ICML. Citeseer, 2012.\n\n[35] J. D. Rosenblatt and B. Nadler. On the optimality of averaging in distributed statistical learning.\nInformation and Inference: A Journal of the IMA, 5(4):379\u2013404, 2016. doi: 10.1093/imaiai/\niaw013. URL http://dx.doi.org/10.1093/imaiai/iaw013.\n\n[36] M. Li, T. Zhang, Y. Chen, and A. J. Smola. Ef\ufb01cient mini-batch training for stochastic\noptimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 661\u2013670. ACM, 2014.\n\n[37] N. Shirish Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On Large-Batch\n\nTraining for Deep Learning: Generalization Gap and Sharp Minima. ArXiv e-prints, 2016.\n\n[38] Y. You, I. Gitman, and B. Ginsburg. Large Batch Training of Convolutional Networks. ArXiv\n\ne-prints, 2017.\n\n[39] P. Goyal, P. Doll\u00e1r, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and\nK. He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. ArXiv e-prints, 2017.\n\n[40] S. U. Stich. Local SGD Converges Fast and Communicates Little. ICLR 2019, 2019.\n[41] T. Lin, S. U. Stich, and M. Jaggi. Don\u2019t Use Large Mini-Batches, Use Local SGD. ArXiv\n\ne-prints, 2018.\n\n[42] H. Yu, S. Yang, and S. Zhu. Parallel Restarted SGD for Non-Convex Optimization with Faster\n\nConvergence and Less Communication. ArXiv e-prints, 2018.\n\n[43] D. L. Zhu and P. Marcotte. Co-coercivity and its role in the convergence of iterative schemes\n\nfor solving variational inequalities. SIAM Journal on Optimization, 6(3):714\u2013726, 1996.\n\n[44] F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic\n\nregression. J. Mach. Learn. Res., 15(1):595\u2013627, 2014.\n\n[45] A. Dieuleveut and F. Bach. Nonparametric stochastic approximation with large step-sizes. Ann.\nStatist., 44(4):1363\u20131399, 2016. doi: 10.1214/15-AOS1391. URL http://dx.doi.org/\n10.1214/15-AOS1391.\n\n[46] J. C. Duchi, S. Chaturapruek, and C. R\u00e9. Asynchronous stochastic convex optimization. ArXiv\n\ne-prints, 2015.\n\n11\n\n\f[47] D. Needell, R. Ward, and N. Srebro. Stochastic Gradient Descent, Weighted Sampling, and the\nRandomized Kaczmarz algorithm. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,\nand K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages\n1017\u20131025. Curran Associates, Inc., 2014.\n\n[48] A. B. Tsybakov. Optimal rates of aggregation. In Proceedings of the Annual Conference on\n\nComputational Learning Theory, 2003.\n\n[49] M. Kamp, M. Boley, D. Keren, A. Schuster, and I. Sharfman. Communication-ef\ufb01cient\ndistributed online prediction by dynamic model synchronization. In Joint European Conference\non Machine Learning and Knowledge Discovery in Databases, pages 623\u2013639. Springer, 2014.\n\n12\n\n\f", "award": [], "sourceid": 7539, "authors": [{"given_name": "Aymeric", "family_name": "Dieuleveut", "institution": "Ecole Polytechnique, IPParis"}, {"given_name": "Kumar Kshitij", "family_name": "Patel", "institution": "Toyota Technological Institute at Chicago"}]}