{"title": "The Convergence of Sparsified Gradient Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 5973, "page_last": 5983, "abstract": "Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. Several families of communication-reduction methods, such as quantization, large-batch methods, and gradient sparsification, have been proposed. To date, gradient sparsification methods--where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally--are known to yield some of the largest practical gains. Such methods can reduce the amount of communication per step by up to \\emph{three orders of magnitude}, while preserving model accuracy. Yet, this family of methods currently has no theoretical justification. \n\nThis is the question we address in this paper. We prove that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. The main insight is that sparsification methods implicitly maintain bounds on the maximum impact of stale updates, thanks to selection by magnitude. Our analysis and empirical validation also reveal that these methods do require analytical conditions to converge well, justifying existing heuristics.", "full_text": "The Convergence of Sparsi\ufb01ed Gradient Methods\n\nDan Alistarh\u21e4\nIST Austria\n\ndan.alistarh@ist.ac.at\n\nTorsten Hoe\ufb02er\n\nETH Zurich\n\nhtor@inf.ethz.ch\n\nMikael Johansson\n\nSarit Khirirat\n\nmikaelj@kth.se\n\nsarit@kth.se\n\nKTH\n\nKTH\n\nNikola Konstantinov\n\nIST Austria\n\nnikola.konstantinov@ist.ac.at\n\nC\u00e9dric Renggli\n\nETH Zurich\n\ncedric.renggli@inf.ethz.ch\n\nAbstract\n\nStochastic Gradient Descent (SGD) has become the standard tool for distributed\ntraining of massive machine learning models, in particular deep neural networks.\nSeveral families of communication-reduction methods, such as quantization, large-\nbatch methods, and gradient sparsi\ufb01cation, have been proposed to reduce the\noverheads of distribution. To date, gradient sparsi\ufb01cation methods\u2013where each\nnode sorts gradients by magnitude, and only communicates a subset of the compo-\nnents, accumulating the rest locally\u2013are known to yield some of the largest practical\ngains. Such methods can reduce the amount of communication per step by up to\nthree orders of magnitude, while preserving model accuracy. Yet, this family of\nmethods currently has no theoretical justi\ufb01cation.\nThis is the question we address in this paper. We prove that, under analytic\nassumptions, sparsifying gradients by magnitude with local error correction pro-\nvides convergence guarantees, for both convex and non-convex smooth objectives,\nfor data-parallel SGD. The main insight is that sparsi\ufb01cation methods implicitly\nmaintain bounds on the maximum impact of stale updates, thanks to selection\nby magnitude. Our analysis also reveals that these methods do require analytical\nconditions to converge well, justifying and complementing existing heuristics.\n\n1\n\nIntroduction\n\nThe proliferation of massive datasets has led to renewed focus on distributed machine learning\ncomputation. In this context, tremendous effort has been dedicated to scaling the classic stochastic\ngradient descent (SGD) algorithm, the tool of choice for training a wide variety of machine learning\nmodels. In a nutshell, SGD works as follows. Given a function f : Rn ! R to minimize and given\naccess to stochastic gradients \u02dcG of this function, we apply the iteration\n\nxt+1 = xt \u21b5 \u02dcG(xt),\n\n(1)\n\nwhere xt is our current set of parameters, and \u21b5 is the step size.\nThe standard way to scale SGD to multiple nodes is via data-parallelism: given a set of P nodes,\nwe split the dataset into P partitions. Nodes process samples in parallel, but each node maintains\na globally consistent copy of the parameter vector xt. In each iteration, each node computes a\nnew stochastic gradient with respect to this parameter vector, based on its local data. Nodes then\naggregate all of these gradients locally, and update their iterate to xt+1. Ideally, this procedure\nwould enable us to process P times more samples per unit of time, equating to linear scalability.\n\n\u21e4Authors ordered alphabetically. The full version can be found at https://arxiv.org/abs/1809.10505.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fHowever, in practice scaling is limited by the fact that nodes have to exchange full gradients upon\nevery iteration. To illustrate, when training a deep neural network such as AlexNet, each iteration\ntakes a few milliseconds, upon which nodes need to communicate gradients in the order of 200 MB\neach, in an all-to-all fashion. This communication step can easily become the system bottleneck [4].\nA tremendous amount of work has been dedicated to addressing this scalability problem, largely\nfocusing on the data-parallel training of neural networks. One can classify proposed solutions into\na) lossless, either based on factorization [31, 7] or on executing SGD with extremely large batches,\ne.g., [11], b) quantization-based, which reduce the precision of the gradients before communication,\ne.g., [22, 8, 4, 29], and c) sparsi\ufb01cation-based, which reduce communication by only selecting an\n\u201cimportant\u201d sparse subset of the gradient components to broadcast at each step, and accumulating the\nrest locally, e.g., [24, 9, 2, 26, 17, 25].\nWhile methods from the \ufb01rst two categories are ef\ufb01cient and provide theoretical guarantees, e.g., [31,\n4], some of the largest bene\ufb01ts in practical settings are provided by sparsi\ufb01cation methods. Recent\nwork [2, 17] shows empirically that the amount of communication per node can be reduced by up to\n600\u21e5 through sparsi\ufb01cation without loss of accuracy in the context of large-scale neural networks.\n(We note however that these methods do require signi\ufb01cant additional hyperparameter optimization.)\nContribution. We prove that, under analytic assumptions, gradient sparsi\ufb01cation methods in fact\nprovide convergence guarantees for SGD. We formally show this claim for both convex and non-\nconvex smooth objectives, and derive non-trivial upper bounds on the convergence rate of these\ntechniques in both settings. From the technical perspective, our analysis highlights connections\nbetween gradient sparsi\ufb01cation methods and asynchronous gradient descent, and suggests that some\nof the heuristics developed to ensure good practical performance for these methods, such as learning\nrate tuning and gradient clipping, might in fact be necessary for convergence.\nSparsi\ufb01cation methods generally work as follows. Given standard data-parallel SGD, in each iteration\nt, each node computes a local gradient \u02dcG, based on its current view of the model. The node then\ntruncates this gradient to its top K components, sorted in decreasing order of magnitude, and\naccumulates the error resulting from this truncation locally in a vector \u270f. This error is added to the\ncurrent gradient before truncation. The top K components selected by each node in this iteration are\nthen exchanged among all nodes, and applied to generate the next version of the model.\nSparsi\ufb01cation methods are reminiscent of asynchronous SGD algorithms, e.g., [20, 10, 8], as updates\nare not discarded, but delayed. A critical difference is that sparsi\ufb01cation does not ensure that every\nupdate is eventually applied: a \u201csmall\u201d update may in theory be delayed forever, since it is never\nselected due to its magnitude. Critically, this precludes the direct application of existing techniques\nfor the analysis of asynchronous SGD, as they require bounds on the maximum delay, which may\nnow be in\ufb01nite. At the same time, sparsi\ufb01cation could intuitively make better progress than an\narbitrarily-delayed asynchronous method, since it applies K \u201clarge\u201d updates in every iteration, as\nopposed to an arbitrary subset in the case of asynchronous methods.\nWe resolve these con\ufb02icting intuitions, and show that in fact sparsi\ufb01cation methods converge relatively\nfast. Our analysis yields new insight into this popular communication-reduction method, giving it a\nsolid theoretical foundation, and suggests that prioritizing updates by magnitude might be a useful\ntactic in other forms of delayed SGD as well.\nOur key \ufb01nding is that this algorithm, which we call TopK SGD, behaves similarly to a variant of\nasynchronous SGD with \u201cimplicit\u201d bounds on staleness, maintained seamlessly by the magnitude\nselection process: a gradient update is either salient, in which case it will be applied quickly, or\nis eventually rendered insigni\ufb01cant by the error accumulation process, in which case it need not\nhave been applied in the \ufb01rst place. This intuition holds for both convex and non-convex objectives,\nalthough the technical details are different.\nRelated Work. There has been a recent surge of interest in distributed machine learning, e.g., [1, 33,\n6]; due to space limits, we focus on communication-reduction techniques that are closely related.\nLossless Methods. One way of doing lossless communication-reduction is through factorization [7,\n31], which is effective in deep neural networks with large fully-connected layers, whose gradients\ncan be decomposed as outer vector products. This method is not generally applicable, and in\nparticular may not be ef\ufb01cient in networks with large convolutional layers, e.g., [13, 27]. A second\nlossless method is executing extremely large batches, hiding communication cost behind increased\ncomputation [11, 32]. Although promising, these methods currently require careful per-instance\n\n2\n\n\fparameter tuning, and do not eliminate communication costs. Asynchronous methods, e.g., [20] can\nalso be seen as a way of performing communication-reduction, by overlapping communication and\ncomputation, but are also known to require careful parameter tuning [34].\nQuantization. Seide et al. [23] and Strom [25] were among the \ufb01rst to propose quantization to\nreduce the bandwidth costs of training deep networks. Their techniques employ a variant of error-\naccumulation. Alistarh et al. [4, 12] introduced a theoretically-justi\ufb01ed stochastic quantization\ntechnique called Quantized SGD (QSGD), which trades off compression and convergence rate. This\ntechnique was signi\ufb01cantly re\ufb01ned for the case of two-bit precision by [30]. Recent work [28] studies\nthe problem of selecting a sparse, low-variance unbiased gradient estimator as a linear planning\nproblem. This approach differs from the algorithms we analyze, as it ensures unbiasedness of the\nestimators in every iteration. By contrast, error accumulation inherently biases the applied updates.\nSparsi\ufb01cation. Strom [25], Dryden et al. [9] and Aji and Hea\ufb01eld [2] considered sparsifying the\ngradient updates by only applying the top K components, taken at at every node, in every iteration,\nfor K corresponding to < 1% of the dimension, and accumulating the error. Shokri [24] and Sun\net al. [26] independently considered similar algorithms, but for privacy and regularization purposes,\nrespectively. Lin et al. [17] performed an in-depth empirical exploration of this space in the context\nof training neural networks, showing that extremely high gradient sparsity can be supported by\nconvolutional and recurrent networks, without loss of accuracy, under careful hyperparameter tuning.\nAnalytic Techniques. The \ufb01rst reference to approach the analysis of quantization techniques is\nBuckwild! [8], in the context of asynchronous training of generalized linear models. Our analysis\nin the case of convex SGD uses similar notions of convergence, and a similar general approach.\nThe distinctions are: 1) the algorithm we analyze is different; 2) we do not assume the existence\nof a bound \u2327 on the delay with which a component may be applied; 3) we do not make sparsity\nassumptions on the original stochastic gradients. In the non-convex case, we use a different approach.\n\n2 Preliminaries\nBackground and Assumptions. Please recall our modeling of the basic SGD process in Equation (1).\nFix n to be the dimension of the problems we consider; unless otherwise stated k\u00b7k will denote\nthe 2-norm. We begin by considering a general setting where SGD is used to minimize a function\nf : Rn ! R, which can be either convex or non-convex, using unbiased stochastic gradient samples\n\u02dcG(\u00b7), i.e., E[ \u02dcG(xt)] = rf (xt).\nWe assume throughout the paper that the second moment of the average of P stochastic gradients\nwith respect to any choice of parameter values is bounded, i.e.:\n\n1\nP\n\nE[k\n\nPXp=1\n\n\u02dcGp(x)k2] \uf8ff M 2,8x 2 Rn\n\n(2)\n\nwhere \u02dcG1(x), . . . , \u02dcGP (x) are P independent stochastic gradients (at each node). We also give the\nfollowing de\ufb01nitions:\nDe\ufb01nition 1. For any differentiable function f : Rd ! R,\n\u2022 f is c-strongly convex if 8x, y 2 Rd, it satis\ufb01es f (y) f (x) + hrf (x), y xi + c\n2kx yk2.\n\u2022 f is L-Lipschitz smooth (or L-smooth for short) if 8x, y 2 Rd, krf (x) rf (y)k\uf8ff Lkx yk.\nWe consider both c-strongly convex and L-Lipschitz smooth (non-convex) objectives. Let x\u21e4 be the\noptimum parameter set minimizing Equation (1). For \u270f> 0, the \u201csuccess region\u201d to which we want\nto converge is the set of parameters S = {x|k x x\u21e4k2\uf8ff \u270f}.\nRate Supermartingales.\nIn the convex case, we phrase convergence of SGD in terms of rate\nsupermartingales; we will follow the presentation of De et al. [8] for background. A supermartingale\nis a stochastic process Wt with the property that that E[Wt+1|Wt] \uf8ff Wt. A martingale-based proof\nof convergence will construct a supermartingale Wt(xt, xt1, . . . , x0) that is a function of time and\nthe current and previous iterates; it intuitively represents how far the algorithm is from convergence.\nDe\ufb01nition 2. Given a stochastic algorithm such as the iteration in Equation (1), a non-negative\nprocess Wt : Rn\u21e5t ! R is a rate supermartingale with horizon B if the following conditions are\ntrue. First, it must be a supermartingale: for any sequence xt, . . . , x0 and any t \uf8ff B,\n(3)\n\nE[Wt+1(xt \u21b5 \u02dcGt(xt), xt, . . . , x0)] \uf8ff Wt(xt, xt1, . . . , x0).\n\n3\n\n\fAlgorithm 1 Parallel TopK SGD at a node p.\n\nInput: Stochastic Gradient Oracle \u02dcGp(\u00b7) at node p\nInput: value K, learning rate \u21b5\nInitialize v0 = \u270fp\n0 = ~0\nfor each step t 1 do\nt1 + \u21b5 \u02dcGp\nt \u270fp\naccp\nt TopK(accp\nt accp\n\u270fp\nBroadcast(TopK(accp\nP PP\nq=1 TopK(accq\ngt 1\nvt vt1 gt { apply the update }\nend for\n\nt (vt1) {accumulate error into a locally generated gradient}\nt ), SUM) { broadcast to all nodes and receive from all nodes }\n\nt ) {update the error}\n\nt ) { average the received (sparse) gradients }\n\nSecond, for all times T \uf8ff B and for any sequence xT , . . . , x0, if the algorithm has not succeeded in\nentering the success region S by time T , it must hold that\n(4)\nConvergence. Assuming the existence of a rate supermartingale, one can bound the convergence\nrate of the corresponding stochastic process.\nStatement 1. Assume that we run a stochastic algorithm, for which W is a rate supermartingale. For\nT \uf8ff B, the probability that the algorithm does not complete by time T is\n\nWT (xT , xT1, . . . , x0) T.\n\nPr(FT ) \uf8ff\n\nT\n\nE[W0(x0)]\n\n.\n\nThe proof of this general fact is given by De Sa et al. [8], among others. A rate supermartingale for\nsequential SGD is:\nStatement 2 ([8]). There exists a Wt where, if the algorithm has not succeeded by timestep t,\n\nWt(xt, . . . , x0) =\n\n\u270f\n\n2\u21b5c\u270f \u21b52 \u02dcM 2\n\nlog\u21e3ekxt x\u21e4k2 \u270f1\u2318 + t,\n\nwhere \u02dcM is a bound on the second moment of the stochastic gradients for the sequential SGD process.\nFurther, Wt is a rate submartingale for sequential SGD with horizon B = 1. It is also H-Lipschitz\nin the \ufb01rst coordinate, with H = 2p\u270f2\u21b5c\u270f \u21b52M 21, that is for any t, u, v and any sequence\nxt1, . . . , x0 : kWt (u, xt1, . . . , x0) Wt (v, xt1, . . . , x0)k\uf8ff Hku vk.\n3 The TopK SGD Algorithm\nAlgorithm Description. In the following, we will consider a variant of distributed SGD where, in\neach iteration t, each node computes a local gradient based on its current view of the model, which\nwe denote by vt, which is consistent across nodes (see Algorithm 1 for pseudocode). The node adds\nits local error vector from the previous iteration (de\ufb01ned below) into the gradient, and then truncates\nthis sum to its top K components, sorted in decreasing order of (absolute) magnitude. Each node\naccumulates the components which were not selected locally into the error vector \u270ft, which is added\nto the current gradient before the truncation procedure. The selected top K components are then\nbroadcast to all other nodes. (We assume that broadcast happens point-to-point, but in practice it\ncould be intermediated by a parameter server, or via a more complex reduction procedure.) Each\nnode collects all messages from its peers, and applies their average to the local model. This update is\nthe same across all nodes, and therefore vt is consistent across nodes at every iteration.\nVariants of this pattern are implemented in [2, 9, 17, 25, 26]. When training networks, this pattern is\nused in conjunction with heuristics such as momentum tuning and gradient clipping [17].\nAnalysis Preliminaries. De\ufb01ne \u02dcGt(vt) = 1\ntrack the following auxiliary random variable at each global step t:\n\nt (vt). In the following, it will be useful to\n\n\u02dcGp\n\np=1\n\nxt+1 = xt \n\n1\nP\n\nt (vt) = xt \u21b5 \u02dcGt(vt),\n\n(5)\n\nP PP\nPXp=1\n\n\u21b5 \u02dcGp\n\n4\n\n\fwhere x0 = 0n. Intuitively, xt tracks all the gradients generated so far, without truncation. One of\nour \ufb01rst objectives will be to bound the difference between xt and vt at each time step t. De\ufb01ne:\n\n\u270ft =\n\n1\nP\n\nPXp=1\n\n\u270fp\nt .\n\n(6)\n\nThe variable xt is set up such that, by induction on t, one can prove that, for any time t 0,\n\n(7)\nConvergence. A reasonable question is whether we wish to show convergence with respect to the\nauxiliary variable xt, which aggregates gradients, or with respect to the variable vt, which measures\nconvergence in the view which only accumulates truncated gradients. Our analysis will in fact show\nthat the TopK algorithm converges in both these measures, albeit at slightly different rates. So, in\nparticular, nodes will be able to observe convergence by directly observing the \u201cshared\u201d parameter vt.\n\nvt xt = \u270ft.\n\n3.1 An Analytic Assumption\nThe update to the parameter vt+1 at each step is\n\nThe intention is to apply the top K components of the sum of updates across all nodes, that is,\n\n1\nP\n\nTopK\u21e3\u21b5 \u02dcGp\nPXp=1\nTopK PXp=1\u21e3\u21b5 \u02dcGp\n\n1\nP\n\nt\u2318 .\nt\u2318! .\n\nt (vt) + \u270fp\n\nt (vt) + \u270fp\n\nt + \u270fp\n\nHowever, it may well happen that these two terms are different: one could have a \ufb01xed component j\nof \u21b5 \u02dcGp\nt with the large absolute values, but opposite signs, at two distinct nodes, and value 0 at all\nother nodes. This component would be selected at these two nodes (since it has high absolute value\nlocally), whereas it would not be part of the top K taken over the total sum, since its contribution\nto the sum would be close to 0. Obviously, if this were to happen on all components, the algorithm\nwould make very little progress in such a step.\nIn the following, we will assume that such overlaps can only cause the algorithm to lose a small\namount of information at each step, with respect to the norm of \u201ctrue\u201d gradient \u02dcGt. Speci\ufb01cally:\nAssumption 1. There exists a (small) constant \u21e0 such that, for every iteration t 0, we have:\n\n\n\nTopK 1\n\nP\n\nPXp=1\u21e3\u21b5 \u02dcGp\n\nt (vt) + \u270fp\n\nt\u2318! \n\nPXp=1\n\n1\nP\n\nTopK\u21e3\u21b5 \u02dcGp\n\nt (vt) + \u270fp\n\n(8)\n\nt\u2318 \uf8ff \u21e0k\u21b5 \u02dcGt(vt)k.\n\nDiscussion. We validate Assumption 1 experimentally on a number of different learning tasks in\nSection 6 (see also Figure 1). In addition, we emphasize the following points:\n\u2022 As per our later analysis, in both the convex and non-convex cases, the in\ufb02uence of \u21e0 on\nconvergence is dampened linearly by the number of nodes P . Unless \u21e0 grows linearly with P ,\nwhich appears unlikely, its value will become irrelevant as parallelism is increased.\n\u2022 Assumption 1 is necessary for a general, worst-case analysis. Its role is to bound the gap between\nthe top-K of the gradient sum (which would be applied at each step in a \u201csequential\u201d version of\nthe process), and the sum of top-Ks (which is applied in the distributed version). If the number of\nnodes P is 1, the assumption trivially holds.\nTo illustrate necessity, consider a dummy instance with two nodes, dimension 2, and K = 1.\nAssume that at a step node 1 has gradient vector (1001, 500), and node 2 has gradient vector\n(1001, 500). Selecting the top-1 (max abs) of the sum of the two gradients would result in the\ngradient (0, 1000). Applying the sum of top-1\u2019s taken locally results in the gradient (0, 0), since\nwe select (1001, 0) and (1001, 0), respectively. This is clearly not desirable, but in theory\npossible. The assumption states that this worst-case scenario is unlikely, by bounding the norm\ndifference between the two terms.\n\n5\n\n\f\u2022 The intuitive cause for the example above is the high variability of the local gradients at the nodes.\nOne can therefore view Assumption 1 as a bound on the variance of the local gradients (at the\nnodes) with respect to the global variance (aggregated over all nodes). We further expand on this\nobservation in Section 6.\n\n4 Analysis in the Convex Case\n\nWe now focus on the convergence of Algorithm 1 with respect to the parameter vt. We assume\nthat the function f is c-strongly convex and that the bound (2) holds. Due to space constraints, the\ncomplete proofs are deferred to the full version of our paper [3].\nTechnical Preliminaries. We begin by noting that for any vector x 2 Rn, it holds that\nn kxk2.\n\nn kxk1, and kx TopK (x)k2\uf8ff\n\nkx TopK (x)k1\uf8ff\n\nn K\n\nn K\n\nn , we have that kx TopK (x)k\uf8ff kxk. In practice, the last inequality may be\n\nsatis\ufb01ed by a much smaller value of , since the gradient values are very unlikely to be uniform.\nWe now bound the difference between vt and xt using Assumption 1. We have the following:\nLemma 1. With the processes xt and vt de\ufb01ned as above:\n\nPXp=1\u21e3\u21b5 \u02dcGp\nP\u25c6 tXk=1\n\n\u21e0\n\n1\nP\n\nt1(vt1) + \u270fp\n\nt1\u2318 \nPXp=1\nk1kxtk+1 xtkk.\n\nTopK\u21e3\u21b5 \u02dcGp\n\nt1(vt1) + \u270fp\n\nt1\u2318\n\n(9)\n\nThus, if =q nK\nkvt xtk =\n\uf8ff\u2713 +\n\n1\nP\n\nWe now use the previous result to bound a quantity that represents the difference between the updates\nbased on the TopK procedure and those based on full gradients.\nLemma 2. Under the assumptions above, taking expectation with respect to gradients at time t:\n\nThe Convergence Bound. Our main result in this section is the following:\nTheorem 1. Assume that W is a rate supermartingale with horizon B for the sequential SGD\nalgorithm and that W is H-Lipschitz in the \ufb01rst coordinate. Assume further that \u21b5HM C0 < 1. Then\nfor any T \uf8ff B, the probability that vs 62 S for all s \uf8ff T is:\nE [W0 (v0)]\n\n(11)\n\nPr [FT ] \uf8ff\n\n(1 \u21b5HM C0) T\n\n.\n\nThe proof proceeds by de\ufb01ning a carefully-designed random process with respect to the iterate vt,\nand proving that it is a rate supermartingale assuming the existence of W . We now apply this result\nwith the martingale Wt for the sequential SGD process that uses the average of P stochastic gradients\nas an update (so that \u02dcM = M in Statement 2). We obtain:\n\n6\n\nBefore we move on, we must introduce some notation. Set constants\n\n\u21e0\n\n1\nP\n\n1\nP\n\nt (vt) + \u270fp\n\nE\"\nPXp=1\u21e3\u21b5 \u02dcGp\n\uf8ff ( + 1)\u2713 +\n\nTopK\u21e3\u21b5 \u02dcGp\n\nt\u2318\n#\nt (vt)\u2318 \nPXp=1\nP\u25c6 \u21b5M.\nk1kxtk+1 xtkk+\u2713 +\nP\u25c6 tXk=1\n1 \u2713 +\nP\u25c6 1Xk=1\nC = ( + 1)\u2713 +\nP\u25c6 ,\nC0 = C +\u2713 +\nP\u25c6 =\u2713 +\nP\u25c6 2\n\nk1 =\n\n1 + \n\n.\n\n1 \n\n\u21e0\n\n\u21e0\n\n\u21e0\n\n\u21e0\n\n\u21e0\n\nand\n\n(10)\n\n\fCorollary 1. Assume that we run Algorithm 1 for minimizing a convex function f satisfying the listed\nassumptions. Suppose that the learning rate is set to \u21b5, with:\n\n\u21b5< min\u21e2 2c\u270f\n2 (c\u270f p\u270fM C0)\nThen for any T > 0 the probability that vi 62 S for all i \uf8ff T is:\n(2\u21b5c\u270f \u21b52M 2 \u21b52p\u270fM C0) T\n\n .\nlog\u2713 ekv0 x\u21e4k2\n\nPr (FT ) \uf8ff\n\nM 2 ,\n\nM 2\n\n\u270f\n\n\u270f\n\n\u25c6 .\n\nNote that the learning rate is chosen so that the denominator on the right-hand side is positive.\nThis is discussed in further detail in Section 6. Compared to the sequential case (Statement 2), the\nconvergence rate for the TopK algorithm features a slowdown of \u21b52p\u270fM C0. Assuming that P is\nconstant with respect to n/K,\n\nC0 = r n K\n\nn\n\n\u21e0\n\nP!\n\n+\n\n2\n\n1 q nK\n\nn\n\n= 2\n\nn\n\nK r n K\n\nn\n\n+\n\n\u21e0\n\nP! 1 +r n K\n\nn ! = O\u21e3 n\nK\u2318 .\n\nHence, the slowdown is linear in n/K and \u21e0/P . In particular, the effect of \u21e0 is dampened by the\nnumber of nodes.\n\n(12)\n\n5 Analysis for the Non-Convex Case\n\nWe now consider the more general case when SGD is minimizing a (not necessarily convex) function\nf, using SGD with (decreasing) step sizes \u21b5t. Again, we assume that the bound (2) holds. We also\nassume that f is L-Lipschitz smooth.\nAs is standard in non-convex settings [18], we settle for a weaker notion of convergence, namely:\n\nmin\n\nt2{1,...,T}\n\nE\u21e5krf (vt)k2\u21e4 T!1! 0,\n\nthat is, the algorithm converges ergodically to a point where gradients are 0. Our strategy will be\nto leverage the bound on the difference between the \u201creal\u201d model xt and the view vt observed at\niteration t to bound the expected value of f (vt), which in turn will allow us to bound\n\nTXt=1\n\n1\nt=1 \u21b5t\n\nPT\n\n\u21b5tE\u21e5krf (vt)k2\u21e4 ,\nP\u23182Pt\n\nwhere the parameters \u21b5t are appropriately chosen decreasing learning rate parameters. We start from:\n\nLemma 3. For any time t 1: kvt xtk2\uf8ff\u21e31 + \u21e0\n\nk=122k kxtk+1 xtkk2.\n\nWe will leverage this bound on the gap to prove the following general bound:\nTheorem 2. Consider the TopK algorithm for minimising a function f that satis\ufb01es the assumptions\nin this section. Suppose that the learning rate sequence and K are chosen so that for any time t > 0:\n\nfor some constant D > 0. Then, after running Algorithm 1 for T steps:\n\nTXt=1\n\n1\nt=1 \u21b5t\n\nPT\n\n\u21b5tE\u21e5krf (vt)k2\u21e4 \uf8ff\n\n4 (f (x0) f (x\u21e4))\n\nt=1 \u21b5t\n\n+\u27132LM 2 + 4L2M 2\u21e31 + \u21e0\nP\u23182\n\ntk\n\u21b5t \uf8ff D\n\ntXk=122k \u21b52\nPT\n\n(13)\n\n(14)\n\nD\u25c6PT\n\nt=1 \u21b52\nt\n\n.\n\nNotice again that the effect of \u21e0 in the bound is dampened by P . One can show that inequality (13)\nholds whenever K = cn for some constant c > 1\n2 and the step sizes are chosen so that \u21b5t = t\u2713 for\na constant \u2713> 0. When K = cn with c > 1\n2, a constant learning rate depending on the number of\niterations T can also be used to ensure ergodic convergence. We refer the reader to the full version of\nour paper for a complete derivation [3].\n\nt=1 \u21b5t\n\nPT\n\n7\n\n\f(a) Empirical \u21e0 logistic/RCV1.\n\n(b) Empirical \u21e0 synthetic.\n\n(c) Empirical \u21e0 ResNet110.\n\nFigure 1: Validating Assumption 1 on various models and datasets.\n\n6 Discussion and Experimental Validation\nThe Analytic Assumption. We start by empirically validating Assumption 1 in Figure 1 on two\nregression tasks (a synthetic linear regression task of dimension 1,024, and logistic regression for text\ncategorization on RCV1 [15]), as well as ResNet110 [13] on CIFAR-10 [14]. Exact descriptions of\nthe experimental setup are given in the full version of the paper [5]. Speci\ufb01cally, we sample gradients\nat different epochs during the training process, and bound the constant \u21e0 by comparing the left and\nright-hand sides of Equation (8). The assumption appears to hold with relatively low, stable values of\nthe constant \u21e0. We note that RCV1 is relatively sparse (average density ' 10%), while gradients in\nthe other two settings are fully dense.\nAdditionally, we present an intuitive justi\ufb01cation why Assumption 1 can be seen as a bound on the\nvariance of the local gradients with respect to the global variance. Through a series of elementary\noperations, one can obtain:\n\nk\u270ftk\uf8ff\n\nk\u270fp\ntk\uf8ff\n\nk\u21b5 \u02dcGp\n\ntk+1k,\n\n1\nP\n\nPXp=1\nt\u2318! \n\nk\n\n1\nP\n\ntXk=1\nPXp=1\nTopK\u21e3\u21b5 \u02dcGp\n\n1\nP\n\nPXp=1\n\uf8ff \u21b5k \u02dcGtk+\n\n(15)\n\n(16)\n\nk\u21b5 \u02dcGp\n\nt (vt)k\n\nt (vt) + \u270fp\n\nt\u2318k\nk\u270fp\ntk+\n\n\nP k\u270ftk+\n\n\nP\n\nPXp=1\n\n\u21b5\nP\n\nPXp=1\n\nwhich in turn implies that:\n\nkTopK 1\n\nP\n\nPXp=1\u21e3\u21b5 \u02dcGp\n\nt (vt) + \u270fp\n\nThe left-hand side of (16) is the quantity we wanted to control via Assumption 1. The \ufb01rst term\non the right-hand side is the global (averaged) gradient at time t, while the remaining terms are all\nbounded by a dampened sum of local gradients, as per equation (15). Therefore, assuming a bound\non the variance of the local gradients with respect to the global variance is equivalent to saying that\nthe left-hand side of (16) is bounded by the the norm of the global gradient, at least in expectation.\nThis is exactly the intention behind Assumption 1.\nNote also that equation (15) provides a bound on the norm of the error term at time t, which is similar\nto the one in Lemma 1, but expressed in terms of the norms of the local gradients. One can build on\nthis argument and our techniques in Section 4 to show convergence of the TopK algorithm directly.\nHowever, such analysis will rely on a bound on the variance of the local gradients (as apposed to the\nbound in equation (2)), which is a strong assumption that ignores the effect of averaging over the P\nnodes. In contrast, Assumption 1 allows for a more elegant analysis that provides better convergence\nrates, which are due to the averaging of the local gradients at every step of the TopK algorithm. We\nrefer to the full version of our paper for further details.\nLearning Rate and Variance. In the convex case, the choice of learning rate must ensure both\n\n2\u21b5c\u270f \u21b52M 2 > 0 and \u21b5HM C0 < 1, implying \u21b5< min\u21e2 2c\u270f\n2 (c\u270f p\u270fM C0)\nNote that this requires the second term to be positive, that is \u270f> \u21e3 M C0\nc \u23182\n\n. Hence, if we aim for\nconvergence within a small region around the optimum, we may need to ensure that gradient variance\nis bounded, either by minibatching or, empirically, by gradient clipping [17].\n\n .\n\nM 2 ,\n\n(17)\n\nM 2\n\n8\n\n010203040Epoch0.20.40.60.81.01.2Emprirical\u00a0\u03beTopK\u00a0[K=1.0%]TopK\u00a0[K=10.0%]010203040Epoch24681012Emprirical\u00a0\u03beTopK\u00a0[K=1.0%]TopK\u00a0[K=10.0%]020406080100Epoch1.21.41.61.82.02.22.4Emprirical\u00a0\u03beTopK\u00a0[K=1.0%]TopK\u00a0[K=10.0%]\fThe Impact of the Parameter K and Gradient \u201cShape.\u201d In the convex case, the dependence on\nthe convergence with respect to K and n is encapsulated by the parameter C0 = O(n/K) assuming\nP is constant. Throughout the analysis, we only used worst-case bounds on the norm gap between the\ngradient and its top K components. These bounds are tight in the (unlikely) case where the gradient\nvalues are uniformly distributed; however, there is empirical evidence showing that this is not the\ncase in practice [19], suggesting that this gap should be smaller. The algorithm may implicitly exploit\nthis narrower gap for improved convergence. Please see Figure 2 for empirical validation of this\nclaim, con\ufb01rming that the gradient norm is concentrated towards the top elements.\n\n(a) TopK norm RCV1.\n\n(b) TopK norm synthetic.\n\n(c) TopK norm ResNet110.\n\nFigure 2: Examining the value of k \u02dcG TopK( \u02dcG)k/k \u02dcGk versus K on various datasets/tasks. Every\nline represents a randomly chosen gradient per epoch during training with standard hyper parameters.\n\nIn the non-convex case, the condition K = cn with c > 1/2 is quite restrictive. Again, the condition\nis required since we are assuming the worst-case con\ufb01guration (uniform values) for the gradients,\nin which case the bound in Lemma 4 is tight. However, we argue that in practice gradients are\nunlikely to be uniformly distributed; in fact, empirical studies [19] have noticed that usually gradient\ncomponents are normally distributed, which should enable us to improve this lower bound on c.\nComparison with SGD Variants. In the convex case, we note that, when K is a constant fraction\nof n, the convergence of the TopK algorithm is essentially dictated by the Lipschitz constant of the\nsupermartingale W , and by the second-moment bound M, and will be similar to sequential SGD.\nPlease see Figure 3 for an empirical validation of this fact.\n\n(a) RCV1 convergence.\n\n(b) Linear regression.\n\n(c) ResNet110 on CIFAR10.\n\nFigure 3: Examining convergence versus value of K on various datasets and tasks.\n\nCompared to asynchronous SGD, the convergence rate of the TopK algorithm is basically that of an\nasynchronous algorithm with maximum delay \u2327 = O(pn/K). That is because an asynchronous\nalgorithm with dense updates and max delay \u2327 has a convergence slowdown of \u21e5(\u2327pn) [8, 16, 3].\nWe note that, for large sparsity (0.1%\u20141%), there is a noticeable convergence slowdown, as predicted.\nThe worst-case convergence of TopK is similar to SGD with stochastic quantization, e.g., [4, 28]:\nfor instance, for K = pn, the worst-case convergence slowdown is O(pn), the same as QSGD [4].\nThe TopK procedure is arguably simpler to implement than the parametrized quantization and\nencoding techniques required to make stochastic quantization behave well [4]. Here, TopK had\nsuperior convergence rate compared to stochastic quantization/sparsi\ufb01cation [4, 28] given the same\ncommunication budget per node.\n7 Conclusions\nWe provided the \ufb01rst theoretical analysis of the \u201cTopK\u201d sparsi\ufb01cation communication-reduction\ntechnique. Our approach should extend to methods combining sparsi\ufb01cation with quantization by\nreduced precision [2, 25] and methods using approximate quantiles [2, 17]. We provide a theoretical\nfoundation for empirical results shown with large-scale experiments on recurrent neural networks on\nproduction-scale speech, neural machine translation, as well as image classi\ufb01cation tasks [9, 17, 25, 2].\n\n9\n\n020406080100K in percentage0.00.20.40.60.81.0Norm Di\ufb00erence Full - TopK020406080100K in percentage0.00.20.40.60.81.0Norm Di\ufb00erence Full - TopK020406080100K in percentage0.00.20.40.60.81.0Norm Di\ufb00erence Full - TopK01020304050Epoch0.5500.5750.6000.6250.6500.6750.700ErrorTopK [K=0.1%]TopK [K=1.0%]TopK [K=10.0%]Baseline01020304050Epoch50100150200ErrorTopK [K=0.1%]TopK [K=1.0%]TopK [K=10.0%]Baseline0255075100125150Epoch0.00.51.01.52.0ErrorTopK [K=0.025%]TopK [K=0.1%]TopK [K=0.2%]Baseline\fAcknowledgement\n\nThis project has received funding from the European Union\u2019s Horizon 2020 research and innovation\nprogramme under the Marie Sk\u0142odowska-Curie Grant Agreement No. 665385.\n\nReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for\nlarge-scale machine learning. In OSDI, volume 16, pages 265\u2013283, 2016.\n\n[2] Alham Fikri Aji and Kenneth Hea\ufb01eld. Sparse communication for distributed gradient descent.\n\narXiv preprint arXiv:1704.05021, 2017.\n\n[3] Dan Alistarh, Christopher De Sa, and Nikola Konstantinov. The convergence of stochastic\n\ngradient descent in asynchronous shared memory. arXiv preprint arXiv:1803.08841, 2018.\n\n[4] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Random-\nized quantization for communication-ef\ufb01cient stochastic gradient descent. In Proceedings of\nNIPS 2017, 2017.\n\n[5] Dan Alistarh, Torsten Hoe\ufb02er, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and C\u00e9-\ndric Renggli. The convergence of sparsi\ufb01ed gradient methods. arXiv preprint arXiv:1809.10505,\n2018.\n\n[6] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu,\nChiyuan Zhang, and Zheng Zhang. Mxnet: A \ufb02exible and ef\ufb01cient machine learning library for\nheterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.\n\n[7] Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project\nadam: Building an ef\ufb01cient and scalable deep learning training system. In OSDI, volume 14,\npages 571\u2013582, 2014.\n\n[8] Christopher De Sa, Ce Zhang, Kunle Olukotun, and Christopher R\u00e9. Taming the wild: A uni\ufb01ed\n\nanalysis of Hogwild. Style Algorithms. In NIPS, 2015.\n\n[9] Nikoli Dryden, Sam Ade Jacobs, Tim Moon, and Brian Van Essen. Communication quantization\nfor data-parallel training of deep neural networks. In Proceedings of the Workshop on Machine\nLearning in High Performance Computing Environments, pages 1\u20138. IEEE Press, 2016.\n\n[10] John C Duchi, Sorathan Chaturapruek, and Christopher R\u00e9. Asynchronous stochastic convex\n\noptimization. arXiv preprint arXiv:1508.00882, 2015.\n\n[11] Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,\nAndrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training\nimagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.\n\n[12] Demjan Grubic, Leo Tam, Dan Alistarh, and Ce Zhang. Synchronous multi-gpu training\nfor deep learning with low-precision communications: An empirical study. In EDBT, pages\n145\u2013156, 2018.\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[14] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[15] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv1: A new benchmark collection\nfor text categorization research. Journal of machine learning research, 5(Apr):361\u2013397, 2004.\n[16] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient\nfor nonconvex optimization. In Advances in Neural Information Processing Systems, pages\n2737\u20132745, 2015.\n\n[17] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient com-\npression: Reducing the communication bandwidth for distributed training. arXiv preprint\narXiv:1712.01887, 2017.\n\n10\n\n\f[18] Ji Liu and Stephen J Wright. Asynchronous stochastic coordinate descent: Parallelism and\n\nconvergence properties. SIAM Journal on Optimization, 25(1):351\u2013376, 2015.\n\n[19] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classi\ufb01cation using\n\nbinary convolutional neural networks. In European Conference on Computer Vision, 2016.\n\n[20] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free\nIn Advances in neural information\n\napproach to parallelizing stochastic gradient descent.\nprocessing systems, pages 693\u2013701, 2011.\n\n[21] C\u00e8dric Renggli, Dan Alistarh, and Torsten Hoe\ufb02er. Sparcml: High-performance sparse commu-\n\nnication for machine learning. arXiv preprint arXiv:1802.08021, 2018.\n\n[22] F. Seide, H. Fu, L. G. Jasha, and D. Yu. 1-bit stochastic gradient descent and application to\n\ndata-parallel distributed training of speech dnns. Interspeech, 2014.\n\n[23] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit Stochastic Gradient Descent\nand its Application to Data-parallel Distributed Training of Speech DNNs. In Fifteenth Annual\nConference of the International Speech Communication Association, 2014.\n\n[24] Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the\n22nd ACM SIGSAC conference on computer and communications security, pages 1310\u20131321.\nACM, 2015.\n\n[25] Nikko Strom. Scalable distributed dnn training using commodity gpu cloud computing. In\nSixteenth Annual Conference of the International Speech Communication Association, 2015.\n\n[26] Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. meprop: Sparsi\ufb01ed back propaga-\ntion for accelerated deep learning with reduced over\ufb01tting. arXiv preprint arXiv:1706.06197,\n2017.\n\n[27] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4,\ninception-resnet and the impact of residual connections on learning. In AAAI, pages 4278\u20134284,\n2017.\n\n[28] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsi\ufb01cation for\n\ncommunication-ef\ufb01cient distributed optimization. arXiv preprint arXiv:1710.09854, 2017.\n\n[29] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad:\nTernary gradients to reduce communication in distributed deep learning. In Advances in Neural\nInformation Processing Systems, pages 1508\u20131518, 2017.\n\n[30] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad:\nTernary gradients to reduce communication in distributed deep learning. In Advances in Neural\nInformation Processing Systems, pages 1508\u20131518, 2017.\n\n[31] Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng,\nPengtao Xie, Abhimanu Kumar, and Yaoliang Yu. Petuum: A new platform for distributed\nmachine learning on big data. IEEE Transactions on Big Data, 1(2):49\u201367, 2015.\n\n[32] Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch size to 32k for imagenet training.\n\narXiv preprint arXiv:1708.03888, 2017.\n\n[33] Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng Yao, Zhiheng Huang, Brian Guenter, Oleksii\nKuchaiev, Yu Zhang, Frank Seide, Huaming Wang, et al. An introduction to computational\nnetworks and the computational network toolkit. Microsoft Technical Report MSR-TR-2014\u2013112,\n2014.\n\n[34] Jian Zhang, Ioannis Mitliagkas, and Christopher R\u00e9. Yellow\ufb01n and the art of momentum tuning.\n\narXiv preprint arXiv:1706.03471, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2905, "authors": [{"given_name": "Dan", "family_name": "Alistarh", "institution": "IST Austria"}, {"given_name": "Torsten", "family_name": "Hoefler", "institution": "ETH Z\u00fcrich"}, {"given_name": "Mikael", "family_name": "Johansson", "institution": "KTH - Royal Institute of Technology"}, {"given_name": "Nikola", "family_name": "Konstantinov", "institution": "IST Austria"}, {"given_name": "Sarit", "family_name": "Khirirat", "institution": "KTH Royal Institute of Technology"}, {"given_name": "Cedric", "family_name": "Renggli", "institution": "ETH Zurich"}]}