{"title": "Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification and Local Computations", "book": "Advances in Neural Information Processing Systems", "page_first": 14695, "page_last": 14706, "abstract": "Communication bottleneck has been identified as a significant issue in distributed optimization of large-scale learning models. Recently, several approaches to mitigate this problem have been proposed, including different forms of gradient compression or computing local models and mixing them iteratively. In this paper we propose Qsparse-local-SGD algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of Qsparse-local-SGD. We analyze convergence for Qsparse-local-SGD in the distributed case, for smooth non-convex and convex objective functions. We demonstrate that Qsparse-local-SGD converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We use Qsparse-local-SGD to train ResNet-50 on ImageNet, and show that it results in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy.", "full_text": "Qsparse-local-SGD: Distributed SGD with\n\nQuantization, Sparsi\ufb01cation, and Local Computations\n\nDebraj Basu \u21e4\nAdobe Inc.\n\ndbasu@adobe.com\n\nCan Karakus \u21e4\nAmazon Inc.\n\ncakarak@amazon.com\n\nDeepesh Data\n\nUCLA\n\ndeepeshdata@ucla.edu\n\nSuhas Diggavi\n\nUCLA\n\nsuhasdiggavi@ucla.edu\n\nAbstract\n\nCommunication bottleneck has been identi\ufb01ed as a signi\ufb01cant issue in distributed\noptimization of large-scale learning models. Recently, several approaches to\nmitigate this problem have been proposed, including different forms of gradient\ncompression or computing local models and mixing them iteratively. In this paper\nwe propose Qsparse-local-SGD algorithm, which combines aggressive sparsi\ufb01-\ncation with quantization and local computation along with error compensation,\nby keeping track of the difference between the true and compressed gradients.\nWe propose both synchronous and asynchronous implementations of Qsparse-\nlocal-SGD. We analyze convergence for Qsparse-local-SGD in the distributed\ncase, for smooth non-convex and convex objective functions. We demonstrate that\nQsparse-local-SGD converges at the same rate as vanilla distributed SGD for many\nimportant classes of sparsi\ufb01ers and quantizers. We use Qsparse-local-SGD to train\nResNet-50 on ImageNet, and show that it results in signi\ufb01cant savings over the\nstate-of-the-art, in the number of bits transmitted to reach target accuracy.\n\n1\n\nIntroduction\n\nt , where {gr\n\nt}R\n\nr=1 gr\n\nRPR\n\nStochastic Gradient Descent (SGD) [14] and its many variants have become the workhorse for modern\nlarge-scale optimization as applied to machine learning [5, 8]. We consider the setup where SGD is\napplied to the distributed setting, where R different nodes compute local SGD on their own datasets\nDr. Co-ordination between them is done by aggregating these local computations to update the\noverall parameter xt as, xt+1 = xt \u2318t\nr=1 are the local stochastic gradients\nat the R machines for a local loss function f (r)(x) of the parameters, where f (r) : Rd ! R.\nIt is well understood by now that sending full-precision gradients, causes communication to be\nthe bottleneck for many large scale models [4, 7, 33, 39]. The communication bottleneck could be\nsigni\ufb01cant in emerging edge computation architectures suggested by federated learning [1, 17, 22].\nTo address this, many methods have been proposed recently, and these methods are broadly based\non three major approaches: (i) Quantization of gradients, where nodes locally quantize the gradient\n(perhaps with randomization) to a small number of bits [3,7,33,39,40]. (ii) Sparsi\ufb01cation of gradients,\ne.g., where nodes locally select Topk values of the gradient in absolute value and transmit these at\nfull precision [2, 4, 20, 30, 32, 40], while maintaining errors in local nodes for later compensation.\n(iii) Skipping communication rounds whereby nodes average their models after locally updating their\nmodels for several steps [9, 10, 31, 34, 37, 43, 45].\n\n\u21e4Work done while Debraj Basu and Can Karakus were at UCLA.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this paper we propose Qsparse-local-SGD algorithm, which combines aggressive sparsi\ufb01cation\nwith quantization and local computation along with error compensation, by keeping track of the\ndifference between the true and compressed gradients. We propose both synchronous and asyn-\nchronous2 implementations of Qsparse-local-SGD. We analyze convergence for Qsparse-local-SGD\nin the distributed case, for smooth non-convex and convex objective functions. We demonstrate\nthat, Qsparse-local-SGD converges at the same rate as vanilla distributed SGD for many important\nclasses of sparsi\ufb01ers and quantizers. We implement Qsparse-local-SGD for ResNet-50 using the\nImageNet dataset, and show that we achieve target accuracies with a small penalty in \ufb01nal accuracy\n(approximately 1 %), with about a factor of 15-20 savings over the state-of-the-art [4, 30, 31], in the\ntotal number of bits transmitted. While the downlink communication is not our focus in this paper\n(also in [4, 20, 39], for example), it can be inexpensive when the broadcast routine is implemented in\na tree-structured manner as in many MPI implementations, or if the parameter server aggregates the\nsparse quantized updates and broadcasts it.\nRelated work. The use of quantization for communication ef\ufb01cient gradient methods has decades\nrich history [11] and its recent use in training deep neural networks [27, 32] has re-ignited interest.\nTheoretically justi\ufb01ed gradient compression using unbiased stochastic quantizers has been proposed\nand analyzed in [3, 33, 39]. Though methods in [36, 38] use induced sparsity in the quantized\ngradients, explicitly sparsifying the gradients more aggressively by retaining Topk components, e.g.,\nk < 1%, has been proposed [2, 4, 20, 30, 32], combined with error compensation to ensure that all\nco-ordinates do get eventually updated as needed. [40] analyzed error compensation for QSGD,\nwithout Topk sparsi\ufb01cation and a focus on quadratic functions. Another approach for mitigating\nthe communication bottlenecks is by having infrequent communication, which has been popularly\nreferred to in the literature as iterative parameter mixing and model averaging, see [31, 43] and\nreferences therein. Our work is most closely related to and builds on the recent theoretical results\nin [4, 30, 31, 43]. [30] considered the analysis for the centralized Topk (among other sparsi\ufb01ers),\nand [4] analyzed a distributed version with the assumption of closeness of the aggregated Topk\ngradients to the centralized Topk case, see Assumption 1 in [4]. [31, 43] studied local-SGD, where\nseveral local iterations are done before sending the full gradients, and did not do any gradient\ncompression beyond local iterations. Our work generalizes these works in several ways. We\nprove convergence for the distributed sparsi\ufb01cation and error compensation algorithm, without the\nassumption of [4], by using the perturbed iterate methods [21, 30]. We analyze non-convex (smooth)\nobjectives as well as strongly convex objectives for the distributed case with local computations. [30]\ngave a proof only for convex objective functions and for centralized case and therefore without local\ncomputations3. Our techniques compose a (stochastic or deterministic 1-bit sign) quantizer with\nsparsi\ufb01cation and local computations using error compensation; in fact this technique works for any\ncompression operator satisfying a regularity condition (see De\ufb01nition 3).\nContributions. We study a distributed set of R worker nodes each of which perform computa-\ntions on locally stored data denoted by Dr. Consider the empirical-risk minimization of the loss\nfunction f (x) = 1\n[\u00b7] denotes expec-\ntation4 over a random sample chosen from the local data set Dr. For f :\nd ! , we denote\nx\u21e4 := arg minx2Rd f (x) and f\u21e4 := f (x\u21e4). The distributed nodes perform computations and pro-\nvide updates to the master node that is responsible for aggregation and model update. We develop\nQsparse-local-SGD, a distributed SGD composing gradient quantization and explicit sparsi\ufb01cation\n(e.g., Topk components), along with local iterations. We develop the algorithms and analysis for both\nsynchronous as well as asynchronous operations, in which workers can communicate with the master\nat arbitrary time intervals. To the best of our knowledge, these are the \ufb01rst algorithms which combine\nquantization, aggressive sparsi\ufb01cation, and local computations for distributed optimization.\n\nr=1 f (r)(x), where f (r)(x) = E\ni\u21e0Dr\n\n[fi(x)], where E\ni\u21e0Dr\n\nRPR\n\n2In our asynchronous model, the distributed nodes\u2019 iterates evolve at the same rate, but update the gradients\n\nat arbitrary times; see Section 4 for more details.\n\n3At the completion of our work, we recently found that in parallel to our work [15] examined use of sign-\nSGD quantization, without sparsi\ufb01cation for the centralized model. Another recent work in [16] studies the\ndecentralized case with sparsi\ufb01cation for strongly convex function. Our work, developed independent of these\nworks, uses quantization, sparsi\ufb01cation and local computations for the distributed case with local computations\nfor both non-convex and strongly convex objectives.\n4Our setup can also handle different local functional forms, beyond dependence on the local data set Dr,\n\nwhich is not explicitly written for notational simplicity.\n\n2\n\n\fOur main theoretical results are the convergence analysis of Qsparse-local-SGD for both (smooth)\nnon-convex objectives as well as for the strongly convex case. See Theorem 1, 2 for the synchronous\ncase, as well as Theorem 3, 4, for the asynchronous operation. Our analysis also demonstrates natural\ngains in convergence that distributed, mini-batch operation affords, and has convergence similar to\nvanilla SGD with local iterations (see Corollary 1, 2), for both the non-convex case (with convergence\nrate \u21e0 1/pT for \ufb01xed learning rate) as well as the strongly convex case (with convergence rate\n\u21e0 1/T, for diminishing learning rate), demonstrating that quantizing and sparsifying the gradient,\neven after local iterations asymptotically yields an almost \u201cfree\u201d communication ef\ufb01ciency gain (also\nobserved numerically in Section 5 non-asymptotically). The numerical results on ImageNet dataset\nimplemented for a ResNet-50 architecture demonstrates that one can get signi\ufb01cant communication\nsavings, while retaining equivalent state-of-the art performance with a small penalty in \ufb01nal accuracy.\nUnlike previous works, Qsparse-local-SGD stores the compression error of the net local update,\nwhich is a sum of at most H gradient steps and the historical error, in the local memory. From\nliterature [4, 30], we know that methods with error compensation work only when the evolution of\nthe error is controlled. The combination of quantization, sparsi\ufb01cation, and local computations poses\nseveral challenges for theoretical analysis, including (i) the analysis of impact of local iterations on\nthe evolution of the error due to quantization and sparsi\ufb01cation, as well as the deviation of local\niterates (see Lemma 3, 4, 8, 9) (ii) asynchronous updates together with distribution compression using\noperators which satisfy De\ufb01nition 3, including our composed (Qsparse) operators. (see Lemma 11-14\nin appendix). Another useful technical observation is that the composition of a quantizer and a\nsparsi\ufb01er results in a compression operator (Lemma 1, 2); see Appendix A for proofs on the same.\nWe provide additional results in the appendices as part of the supplementary material. These include\nresults on the asymptotic analysis for non-convex objectives in Theorem 5, 8 along with precise\nstatements of the convergence guarantees for the asynchronous operation Theorem 6, 7 and numerics\nfor the convex case for multi-class logistic classi\ufb01cation on MNIST [19] dataset in Appendix D, for\nboth synchronous and asynchronous operations.\nWe believe that our approach for combining different forms of compression and local computations\ncan be extended to the decentralized case, where nodes are connected over an arbitrary graph, building\non the ideas from [15, 35]. Our numerics also incorporate momentum acceleration, whose analysis is\na topic for future research, for example incorporating ideas from [42].\nOrganization. In Section 2, we demonstrate that composing certain classes of quantization with\nsparsi\ufb01cation satis\ufb01es a certain regularity condition that is needed for several convergence proofs for\nour algorithms. We describe the synchronous implementation of Qsparse-local-SGD in Section 3,\nand outline the main convergence results for it in Section 3.1, brie\ufb02y giving the proof ideas in Section\n3.2. We describe our asynchronous implementation of Qsparse-local-SGD and provide the theoretical\nconvergence results in Section 4. The experimental results are given in Section 5. Many of the proof\ndetails and additional results are given in the appendices provided with the supplementary material.\n\n2 Composition of Quantization and Sparsi\ufb01cation\n\nIn this section, we consider composition of two different techniques used in the literature for mitigating\nthe communication bottleneck in distributed optimization, namely, quantization and sparsi\ufb01cation.\nIn quantization, we reduce precision of the gradient vector by mapping each of its components by\na deterministic [7, 15] or randomized [3, 33, 39, 44] map to a \ufb01nite number of quantization levels.\nIn sparsi\ufb01cation, we sparsify the gradients vector before using it to update the parameter vector, by\ntaking its Topk components or choosing k components uniformly at random, denoted by Randk, [30].\nDe\ufb01nition 1 (Randomized Quantizer [3, 33, 39, 44]). We say that Qs : Rd ! Rd is a randomized\nquantizer with s quantization levels, if the following holds for every x 2 Rd: (i) EQ[Qs(x)] = x; (ii)\nEQ[kQs(x)k2] \uf8ff (1 + d,s)kxk2, where d,s > 0 could be a function of d and s. Here expectation\nis taken over the randomness of Qs.\nExamples of randomized quantizers include (i) QSGD [3, 39], which independently quantizes compo-\npd\ns )); (ii) Stochastic s-level Quantization [33,44],\nnents of x 2 Rd into s levels, with d,s = min( d\ns2 ,\nwhich independently quantizes every component of x 2 Rd into s levels between argmaxixi and\n2s2 ; and (iii) Stochastic Rotated Quantization [33], which is a stochastic\nargminixi, with d,s = d\nquantization, preprocessed by a random rotation, with d,s = 2 log2(2d)\n\n.\n\ns2\n\n3\n\n\f2] \uf8ff (1 )kxk2\n\nInstead of quantizing randomly into s levels, we can take a deterministic approach and round off to\nthe nearest level. In particular, we can just take the sign, which has shown promise in [7, 27, 32].\nDe\ufb01nition 2 (Deterministic Sign Quantizer [7, 15]). A deterministic quantizer Sign : Rd !\n{+1,1}d is de\ufb01ned as follows: for every vector x 2 Rd, i 2 [d], the i\u2019th component of Sign(x) is\nde\ufb01ned as {xi 0} {xi < 0}.\nAs mentioned above, we consider two important examples of sparsi\ufb01cation operators: Topk and\nRandk, For any x 2 Rd, Topk(x) is equal to a d-length vector, which has at most k non-zero\ncomponents whose indices correspond to the indices of the largest k components (in absolute value)\nof x. Similarly, Randk(x) is a d-length (random) vector, which is obtained by selecting k components\nof x uniformly at random. Both of these satisfy a so-called \u201ccompression\u201d property as de\ufb01ned below,\nwith = k/d [30]. Few other examples of such operators can be found in [30].\nDe\ufb01nition 3 (Sparsi\ufb01cation [30]). A (randomized) function Compk : Rd ! Rd is called a compres-\nsion operator, if there exists a constant 2 (0, 1] (that may depend on k and d), such that for every\n2, where expectation is taken over Compk.\nx 2 Rd, we have EC[kx Compk(x)k2\nWe can apply different compression operators to different coordinates of a vector, and the resulting\noperator is also a compression operator; see Corollary 3 in Appendix A. As an application, in the\ncase of training neural networks, we can apply different compression operators to different layers.\nComposition of Quantization and Sparsi\ufb01cation. Now we show that we can compose determin-\nistic/randomized quantizers with sparsi\ufb01ers and the resulting operator is a compression operator.\nProofs are given in Appendix A.\nLemma 1 (Composing sparsi\ufb01cation with stochastic quantization). Let Compk 2{ Topk, Randk}.\nLet Qs : Rd ! Rd be a stochastic quantizer with parameter s that satis\ufb01es De\ufb01nition 1. Let\nQsCompk : Rd ! Rd be de\ufb01ned as QsCompk(x) := Qs(Compk(x)) for every x 2 Rd. Then\nd(1+k,s).\nLemma 2 (Composing sparsi\ufb01cation with deterministic quantization). Let Compk 2\n{Topk, Randk}. Let SignCompk : Rd ! Rd be de\ufb01ned as follows: for every x 2 Rd, the\ni\u2019th component of SignCompk(x) is equal to {xi 0} {xi < 0}, if the i\u2019th component is cho-\nsen in de\ufb01ning Compk, otherwise, it is equal to 0. Then kCompk(x)k1 SignCompk(x)\nis a compression\noperator5 with the compression coef\ufb01cient being equal to = max\u21e2 1\n\nis a compression operator with the compression coef\ufb01cient being equal to =\n\npdkCompk(x)k2\u23182.\nd\u21e3 kCompk(x)k1\n\nQsCompk(x)\n\nd , k\n\n1+k,s\n\nk\n\nk\n\n3 Qsparse-local-SGD\n\nLet I(r)\nT \u2713 [T ] := {1, . . . , T} with T 2I (r)\nT denote a set of indices for which worker r 2 [R]\nsynchronizes with the master. In a synchronous setting, I(r)\nis same for all the workers. Let\nIT := I(r)\nT for any r 2 [R]. Every worker r 2 [R] maintains a local parameterbx(r)\nt which is updated\nis a mini-batch of size b\nin each iteration t, using the stochastic gradient rf\nsampled uniformly in Dr. If t 2I T , the sparsi\ufb01ed error-compensated update g(r)\ncomputed on the\nnet progress made since the last synchronization is sent to the master node, and updates its local\nmemory m(r)\n\u2019s from every worker, master aggregates them, updates the global\nparameter vector, and sends the new model xt+1 to all the workers; upon receiving which, they set\nt+1 to be equal to the global parameter vector xt+1. Our algorithm is\n\nt \u2318, where i(r)\n\nt \u21e3bx(r)\n\n. Upon receiving g(r)\n\ni(r)\n\nT\n\nt\n\nt\n\nt\n\nt\n\nsummarized in Algorithm 1.\n\ntheir local parameter vectorbx(r)\n\n3.1 Main Results for Synchronous Operation\nAll results in this paper use the following two standard assumptions. (i) Smoothness: The local\nfunction f (r) : Rd ! R at each worker r 2 [R] is L-smooth, i.e., for every x, y 2 Rd, we have\n2 ky xk2. (ii) Bounded second moment: For every\nf (r)(y) \uf8ff f (r)(x) + hrf (r)(x), y xi + L\n, for any p 2 Z+ is provided in Appendix A.\n\n5The analysis for general p-norm, i.e. kCompk(x)kp SignCompk(x)\n\nk\n\n4\n\n\f0 , 8r 2 [R]. Suppose \u2318t follows a certain learning rate schedule.\n\nis a mini-batch of size b sampled uniformly in Dr\n\nt+ 1\n2\n\nt+1 bx(r)\n2\u2318, send g(r)\n\nt\n\nt+1 xt+1\n\nto the master.\n\nAlgorithm 1 Qsparse-local-SGD\n\nt\n\nt\n\ni(r)\n\nt+ 1\n\n0 = m(r)\n\nt \u2318trf\n\n1: Initialize x0 =bx(r)\nt \u2318; i(r)\nt \u21e3bx(r)\n2 bx(r)\nbx(r)\nif t + 1 /2I T then\nandbx(r)\nxt+1 xt, m(r)\nt+1 m(r)\nelse\nt Q Compk\u21e3m(r)\ng(r)\nt + xt bx(r)\n2 g(r)\nm(r)\nt+1 m(r)\nt + xt bx(r)\nReceive xt+1 from the master and setbx(r)\n\n2: for t = 0 to T 1 do\n3: On Workers:\nfor r = 1 to R do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\nend if\n20:\n21: end for\n\nxt+1 xt\nReceive g(r)\nBroadcast xt+1 to all workers.\n\nend if\nend for\nAt Master:\nif t + 1 /2I T then\nelse\n\nt+ 1\n\nt+ 1\n\nt\n\nt\n\nfrom R workers and compute xt+1 = xt 1\n\nr=1 g(r)\n\nt\n\nRPR\n\nt+ 1\n2\n\nis used to denote an intermediate variable between iterations t and t + 1.\n\n22: Comment: Note thatbx(r)\nbx(r)\nt 2 Rd, r 2 [R], t 2 [T ], we have E\nt )k2] \uf8ff G2, for some constant G < 1. This is\ni\u21e0Dr\na standard assumption in [4, 12, 16, 23, 25, 26, 29\u201331, 43]. Relaxation of the uniform boundedness\nof the gradient allowing arbitrarily different gradients of local functions in heterogenous settings\nas done for SGD in [24, 37] is left as future work. This also imposes a bound on the variance:\nr \uf8ff G2 for every r 2 [R]. To state our results,\nE\nt )k2] \uf8ff 2\ni\u21e0Dr\nwe need the following de\ufb01nition from [31].\nDe\ufb01nition 4 (Gap [31]). Let IT = {t0, t1, . . . , tk}, where ti < ti+1 for i = 0, 1, . . . , k 1. The gap\nof IT is de\ufb01ned as gap(IT ) := maxi2[k]{(ti ti1)}, which is equal to the maximum difference\nbetween any two consecutive synchronization indices.\n\nt ) rf (r)(bx(r)\n\n[krfi(bx(r)\n\n[krfi(bx(r)\n\nr, where 2\n\nThen we have\n\nWe leverage the perturbed iterate analysis as in [21, 30] to provide convergence guarantees for\nQsparse-local-SGD. Under assumptions (i) and (ii), the following theorems hold when Algorithm 1 is\nrun with any compression operator (including our composed operators).\nTheorem 1 (Convergence in the smooth (non-convex) case with \ufb01xed learning rate). Let f (r)(x)\nbe L-smooth for every i 2 [R]. Let QCompk : Rd ! Rd be a compression operator whose\ncompression coef\ufb01cient is equal to 2 (0, 1]. Let {bx(r)\nt }T1\nt=0 be generated according to Algorithm 1\nwith QCompk, for step sizes \u2318 = bCpT\n2L) and gap(IT ) \uf8ff H.\nbR2 \u2318\u2318 4pT\nEkrf (zT )k2 \uf8ff\u21e3 E[f (x0)]f\u21e4\n\n+ 8\u21e34 (12)\nHere zT is a random variable which samples a previous parameterbx(r)\nCorollary 1. Let E[f (x0)] f\u21e4 \uf8ff J 2, where J < 1 is a constant,6 max = maxr2[R] r, and\nbC2 = bR(E[f (x0)]f\u21e4)\nEkrf (zT )k2 \uf8ffO \u21e3 JmaxpbRT \u2318 + O\u21e3 J 2bRG2H2\nmax2T \u2318 .\n\n(where bC is a constant such that bCpT \uf8ff 1\n+ bCL\u21e3PR\n\n2 + 1\u2318 bC2L2G2H2\n\nt with probability 1/RT .\n\n, we have\n\nr=1 2\nr\n\nmaxL\n\nbC\n\n(1)\n\n(2)\n\n2\n\n2\n\nT\n\n.\n\n6Even classical SGD requires knowing an upper bound on kx0 x\u21e4k in order to choose the learning rate.\n\nSmoothness of f translates this to the difference of the function values.\n\n5\n\n\fIn order to ensure that the compression does not affect the dominating terms while converging at a\n\nrate of O\u21e31/pbRT\u2318, we would require7 H = OT 1/4/(bR)3/4.\n\n1\n\nlog T ), is provided in Theorem 5 in Appendix B.\n\nTheorem 1 is proved in Appendix B and provides non-asymptotic guarantees, where we observe that\ncompression does not affect the \ufb01rst order term. The corresponding asymptotic result (with decaying\nlearning rate), with a convergence rate of O(\nTheorem 2 (Convergence in the smooth and strongly convex case with a decaying learning rate). Let\nf (r) (x) be L-smooth and \u00b5-strongly convex. Let QCompk : Rd ! Rd be a compression operator\nt=0 be generated according to\nAlgorithm 1 with QCompk, for step sizes \u2318t = 8/\u00b5(a+t) with gap(IT ) \uf8ff H, where a > 1 is such\nthat we have a max{4H/, 32\uf8ff, H}, \uf8ff = L/\u00b5. Then the following holds\nA + 128LT\n\u00b53ST\n\nwhose compression coef\ufb01cient is equal to 2 (0, 1]. Let {bx(r)\nHere (i) A = PR\nST PT1\n\n, B = 4\u21e3 3\u00b5\n2 + 3L2G2H 2\u2318, where C 4a(12)\nt=0 hwt\u21e3 1\nRPR\nt=o wt T 3\nr=1bx(r)\n3 .\nxT := 1\n , 32\uf8ff, H}, max = maxr2[R] r, and using Ekx0 x\u21e4k2 \uf8ff 4G2\nCorollary 2. For a > max{ 4H\nfrom Lemma 2 in [25], we have\nE[f (xT )] f\u21e4 \uf8ffO \u21e3 G2H3\n\n2 + 3L CG2H 2\nt \u2318i, where wt = (a + t)2; and (iii) ST =PT1\n\u00b52bRT 2\u2318 + O\u21e3 G2H2\n\u00b523T 3\u2318 + O\u21e3 2\n\u00b532T 2\u2318 .\n\nE[f (xT )] f\u21e4 \uf8ff La3\nr=1 2\nr\nbR2\n\nIn order to ensure that the compression does not affect the dominating terms while converging at a\n\n4ST kx0 x\u21e4k2 + 8LT (T +2a)\n\na4H ; (ii)\n\n\u00b52bRT + H 2\n\nt }T1\n\n\u00b52ST\n\n(4)\n\n(3)\n\nB.\n\nmax\n\nmax\n\n\u00b52\n\nrate of O (1/(bRT )), we would require H = O\u21e3pT /(bR)\u2318.\n\nTheorem 2 has been proved in Appendix B. For no compression and only local computations, i.e., for\n = 1, and under the same assumptions, we recover/generalize a few recent results from literature\nwith similar convergence rates: (i) We recover [43, Theorem 1], which is for non-convex case; (ii) We\ngeneralize [31, Theorem 2.2], which is for a strongly convex case and requires that each worker has\nidentical datasets, to the distributed case. We emphasize that unlike [31, 43], which only consider\nlocal computation, we combine quantization and sparsi\ufb01cation with local computation, which poses\nseveral technical challenges (e.g., see proofs of Lemma 3, 4,7 in Appendix B).\n\n3.2 Proof Outlines\nMaintain virtual sequences for every worker\n\n0\n\n0\n\ni(r)\n\n.\n\nt\n\ni(r)\n\n(5)\n\n\u2318t\n4R\n\n\u23182\nt L\n\nand\n\nand\n\nr=1 rf\n\nt \u2318trf\n\nDe\ufb01ne (i) pt := 1\n\n2 kptk2. With some algebraic manipulations provided in Appendix B, for \u2318t \uf8ff 1/2L, we arrive at\n\nProof outline of Theorem 1. Since f is L-smooth, we have f (ext+1) f (ext) \uf8ff \u2318thrf (ext), pti +\n\nt \u21e3bx(r)\nt \u2318\nex(r)\n:=bx(r)\nex(r)\nt+1 :=ex(r)\nt \u2318, pt := Eit [pt] = 1\nt \u21e3bx(r)\nr=1 rf (r)\u21e3bx(r)\nt \u2318;\nRPR\nRPR\nRPR\nRPR\nr=1ex(r)\nr=1bx(r)\n(ii)ext+1 := 1\nt+1 =ext \u2318tpt,\nbxt := 1\nRXr=1\nt LEkpt ptk2 + 2\u2318tL2Ekext bxtk2\nt )k2 \uf8ff E[f (ext)] E[f (ext+1)] + \u23182\nRXr=1\nEkbxt bx(r)\nUnder Assumptions 1 and 2, we have Ekpt ptk2 \uf8ff PR\nRPR\n\ufb01rst show (in Lemma 7 in Appendix B) thatbxt ext = 1\n, i.e., the difference of the\ntrue and the virtual parameter vectors is equal to the average memory, and then we bound the local\nmemory at each worker r 2 [R] below.\nthe same rate of convergence after T =\u2326 (bR)3/4. Analogous statements hold for Theorem 2-4.\n\n. To bound Ekext bxtk2 in (6), we\n\n7Here we characterize the reduction in communication that can be afforded, however for a constant H we get\n\nEkrf (bx(r)\n\nr=1 m(r)\n\n+2\u2318tL2 1\nR\n\nr=1 2\nr\nbR2\n\nt k2.\n\n(6)\n\nt\n\n6\n\n\f1\nRT\n\n\u2318T\n\n+ 4\u2318L\nbR2\n\n2\n\n(7)\n\nLemma 3 (Bounded Memory). For \u2318t = \u2318, gap(IT ) \uf8ff H, we have for every t 2 Z+ that\n\nRXr=1\n\nRXr=1\n\nT1Xt=0\n\nEkm(r)\n\n2 H 2G2.\n\n2 H 2G2. We can bound the\nlast term of (6) as 1\nt k2 \uf8ff \u23182G2H 2 in Lemma 9 in Appendix B. Putting them back\nin (6), performing a telescopic sum from t = 0 to T 1, and then taking an average over time, we get\n\nt k2 \uf8ff 4 \u23182(12)\n\nt k2 \uf8ff 4 \u23182(12)\nRPR\nr=1 Ekm(r)\nUsing Lemma 3, we get Ekext bxtk2 \uf8ff 1\nRPR\nr=1 Ekbxtbx(r)\nt )k2 \uf8ff 4(E[f (ex0)]f\u21e4)\nEkrf (bx(r)\n\nL2G2H 2 + 8\u23182L2G2H 2.\n\nr + 32 \u23182(12)\n2\n\n2L, we arrive at Theorem 1.\n\nBy letting \u2318 = bC/pT , where bC is a constant such that bCpT \uf8ff 1\nProof outline of Theorem 2. Using the de\ufb01nition of virtual sequences (5), we have kext+1 x\u21e4k2 =\nt kpt ptk2 2\u2318t hext x\u21e4 \u2318tpt, pt pti. With some algebraic manipu-\nkext x\u21e4 \u2318tptk2 + \u23182\nlations provided in Appendix B, for \u2318t \uf8ff 1/4L and letting et = E[f (bxt)] f\u21e4, we get\n2 + 3L Ekbxt extk2\nRXr=1\nt PR\nRPR\nTo bound the 3rd term on the RHS of (63), \ufb01rst we note thatbxt ext = 1\nbound the local memory at each worker r 2 [R] below.\nLemma 4 (Memory Contraction). For a > 4H/, \u2318t = \u21e0/a+t, gap(IT ) \uf8ff H, there exists a\nC 4a(12)\n\n2L et + \u2318t 3\u00b5\nEkbxt bx(r)\nt k2 + \u23182\n\nEkext+1 x\u21e4k2 \uf8ff1 \u00b5\u2318t\n\n2 Ekext x\u21e4k2 \u2318t\u00b5\n\na4H such that the following holds for every t 2 Z+\n2 CH 2G2.\n\n(9)\nA proof of Lemma 4 is provided in Appendix B and is technically more involved than the proof\nof Lemma 3. This complication arises because of the decaying learning rate, combined with\ncompression and local computation. We can bound the penultimate term on the RHS of (63) as\nt G2H 2. This can be shown along the lines of the proof of [31, Lemma\n1\n3.3] and we show it in Lemma 8 in Appendix B. Substituting all these in (63) gives\n\nt k2 \uf8ff 4\u23182\n\nt k2 \uf8ff 4 \u23182\n\n, and then we\n\nr=1 m(r)\n\nEkm(r)\n\nr=1 2\nr\nbR2\n\n.\n\n+ 3\u2318tL\nR\n\n(8)\n\nt\n\nt\n\nRPR\nr=1 Ekbxt bx(r)\n\nEkext+1 x\u21e4k2 \uf8ff1 \u00b5\u2318t\n\n2 Ekext x\u21e4k2 \u00b5\u2318t\nt PR\n\nt LG2H 2 + \u23182\n\nr=1 2\nr\nbR2\n\n2L et + \u2318t 3\u00b5\n\n2 + 3L C 4\u23182\n\nt\n\n(10)\nSince (10) is a contracting recurrence relation, with some calculation done in Appendix B, we\ncomplete the proof of Theorem 2.\n\n+ (3\u2318tL)4\u23182\n\n.\n\n2 G2H 2\n\n4 Asynchronous Qsparse-local-SGD\n\nWe propose and analyze a particular form of asynchronous operation where the workers synchronize\nwith the master at arbitrary times decided locally or by master picking a subset of nodes as in federated\nlearning [17, 22]. However, the local iterates evolve at the same rate, i.e. each worker takes the same\nnumber of steps per unit time according to a global clock. The asynchrony is therefore that updates\noccur after different number of local iterations but the local iterations are synchronous with respect to\nthe global clock.8\nIn this asynchronous setting, I(r)\nT \u2019s may be different for different workers. However, we assume that\ngap(I (r)\nT ) \uf8ff H holds for every r 2 [R], which means that there is a uniform bound on the maximum\ndelay in each worker\u2019s update times. The algorithmic difference from Algorithm 1 is that, in this\ncase, a subset of workers (including a single worker) can send their updates to the master at their\nsynchronization time steps; master aggregates them, updates the global parameter vector, and sends\nthat only to those workers. Our algorithm is summarized in Algorithm 2 in Appendix C. We give the\nsimpli\ufb01ed expressions of our main results below; more precise results are in Appendix C.\n\n8This is different from asynchronous algorithms studied for stragglers [26, 41], where only one gradient step\n\nis taken but occurs at different times due to delays.\n\n7\n\n\ft }T1\n\n(11)\n\n(12)\n\nt\n\n\u00b52bRT + H 2\n\nmax\n\nmax\n\n2\n\n1\n\nmax.\n\nTheorem 3 (Convergence in the smooth non-convex case with \ufb01xed learning rate). Under the\nT ) \uf8ff H, if {bx(r)\nsame conditions as in Theorem 1 with gap(I (r)\nt=0 is generated according to\nAlgorithm 2, the following holds, where E[f (x0)] f\u21e4 \uf8ff J 2, max = maxr2[R] r, and bC2 =\nbR(E[f (x0)]f\u21e4)/2\nEkrf (zT )k2 \uf8ffO \u21e3 JmaxpbRT \u2318 + O\u21e3 J 2bRG2\nmax2T (H 2 + H 4)\u2318 .\nwhere zT is a random variable which samples a previous parameterbx(r)\nof O\u21e31/pbRT\u2318, we would require H = OpT 1/8/(bR)3/8.\n\nt with probability 1/RT . In\norder to ensure that the compression does not affect the dominating terms while converging at a rate\n\nt }T1\n\nt=0 is generated according to Algorithm 2, the following holds:\n\nE[f (xT )] f\u21e4 \uf8ffO \u21e3 G2H3\n\n\u00b523T 3\u2318 + O\u21e3 2\n\nwhere xT , ST are as de\ufb01ned in Theorem 2. To ensure that the compression does not affect the dominat-\n\nWe give a precise result in Theorem 6 in Appendix C. Note that Theorem 3 provides non-asymptotic\nguarantees, where compression is almost for \u201cfree\u201d. The corresponding asymptotic result with\nlog T ), is provided in Theorem 8 in Appendix C.\ndecaying learning rate, with a convergence rate of O(\nTheorem 4 (Convergence in the smooth and strongly convex case with decaying learning rate).\nUnder the same conditions as in Theorem 2 with gap(I(r)\nT ) \uf8ff H, a > max{4H/, 32\uf8ff, H}, max =\nmaxr2[R] r, if {bx(r)\n\u00b52bRT 2\u2318 + O\u21e3 G2\ning terms while converging at a rate of O (1/(bRT )), we would require H = Op(T /(bR))1/4.\nWe give a more precise result in Theorem 7 in Appendix C. If I(r)\nT \u2019s are the same for all the workers,\nthen one would ideally require that the bounds on H in the asynchronous setting reduce to the bounds\non H in the synchronous setting. This is not happening, as our bounds in the asynchronous setting\nare for the worst case scenario \u2013 they hold as long as gap(I (r)\n4.1 Proof Outlines\nOur proofs of these results follow the same outlines of the corresponding proofs in the synchronous\nsetting, but some technical details change signi\ufb01cantly. This is because, in our asynchronous setting,\nworkers are allowed to update the global parameter vector in between two consecutive synchronization\nr=1 m(r)\nand an\n\ntime steps of other workers. For example, unlike the synchronous setting,bxt ext = 1\nRPR\ndoes not hold here; however, we can show thatbxt ext is equal to the sum of 1\nadditional term, which leads to potentially a weaker bound Ekbxt extk2 \uf8ffO \u23182\nt/2G2(H 2 + H 4)\nt/2G2H 2 for the synchronous setting), proved in Lemma 13-14 in Appendix C. Similarly,\n(vs. O\u23182\nthe proof of the average true sequence being close to the virtual sequence requires carefully chosen\nreference points on the global parameter sequence lying within bounded steps of the local parameters.\nRPR\nr=1 Ekbxt bx(r)\nWe show a bound on 1\nt G2(H 2 + H 4/2), which is weaker than the\ncorresponding bound O(\u23182\n5 Experiments\n\nt G2H 2) for the synchronous setting, in Lemma 11-12 in Appendix C.\n\n\u00b532T 2 (H 2 + H 4)\u2318 .\n\nT ) \uf8ff H, for every r 2 [R].\n\nRPR\n\nr=1 m(r)\n\nt\n\nt k2 \uf8ffO (\u23182\n\nExperiment setup: We train ResNet-50 [13] (which has d = 25, 610, 216 parameters) on ImageNet\ndataset, using 8 NVIDIA Tesla V100 GPUs. We use a learning rate schedule consisting of 5 epochs of\nlinear warmup, followed by a piecewise decay of 0.1 at epochs 30, 60 and 80, with a batch size of 256\nper GPU. For experiments, we focus on SGD with momentum of 0.9, applied on the local iterations\nof the workers. We build our compression scheme into the Horovod framework [28].9 We use\nSignT opk (as in Lemma 2) as our composed operator. In T opk, we only update kt = min(dt, 1000)\nelements per step for each tensor t, where dt is the number of elements in the tensor. For ResNet-50\narchitecture, this amounts to updating a total of k = 99, 400 elements per step. We also perform\nanalogous experiments on the MNIST [19] handwritten digits dataset for softmax regression with a\nstandard `2 regularizer, using the synchronous operation of Qsparse-local-SGD with 15 workers, and\n\n9Our implementation is available at https://github.com/karakusc/horovod/tree/qsparselocal.\n\n8\n\n\fa decaying learning rate as proposed in Theorem 2, the details of which are provided in Appendix D.10\nResults: Figure 1 compares the performance of SignT opk-SGD (which employs the 1 bit sign quan-\ntizer and the T opk sparsi\ufb01er) with error compensation (SignTopK) against (i) T opk SGD with error\ncompensation (TopK-SGD), (ii) SignSGD with error compensation (EF-SIGNSGD), and (iii) vanilla\nSGD (SGD). All of these are specializations of Qsparse-local-SGD. Furthermore, SignTopK_hL\nuses a synchronization period of h; same applies for other schemes. From Figure 1a, we observe that\nquantization and sparsi\ufb01cation, both individually and combined, with error compensation, has almost\nno penalty in terms of convergence rate, with respect to vanilla SGD. We observe that SignTopK\ndemonstrates superior performance over EF-SIGNSGD, TopK-SGD, as well as vanilla SGD, both\nin terms of the required number of communicated bits for achieving a certain target loss as well as\ntest accuracy. This is because in SignTopK, we send only 1 bit for the sign of each T opk coordinate,\nalong with its location. Observe that the incorporation of local iterations in Figure 1a has very little\nimpact on the convergence rates, as compared to vanilla SGD with the same number of local iterations.\nFurthermore, this provides an added advantage over SignTopK, in terms of savings (by a factor of 6\nto 8 times on average) in communication bits for achieving a certain target loss; see Figure 1b.\n\n(a) Training loss vs epochs\n\n(b) Training loss vs log2 of\n\ncommunication budget\n\n(c) top-1 accuracy [18] for\n\nschemes in Figure 1a\n\n(d) top-5 accuracy [18] for\n\nschemes in Figure 1a\n\nFigure 1 Figure 1a-1d demonstrate performance gains of our of our scheme in comparison with local SGD [31],\nEF-SIGNSGD [15] and TopK-SGD [4, 30] in a non-convex setting for synchronous updates.\nFigure 1c and Figure 1d show the top-1, and top-5 convergence rates,11respectively, with respect\nto the total number of bits of communication used. We observe that Qsparse-local-SGD combines\nthe bit savings of the deterministic sign based operator and aggressive sparsi\ufb01er, with infrequent\ncommunication; thereby, outperforming the cases where these techniques are individually used. In\nparticular, the required number of bits to achieve the same loss or accuracy in the case of Qsparse-\nlocal-SGD is around 1/16 in comparison with TopK-SGD and over 1000\u21e5 less than vanilla SGD.\n\n(a) Training loss vs epochs\n\n(b) Training loss vs log2 of communication\n\nbudget\n\n(c) top-1 accuracy [18] for schemes in\n\nFigure 2a\n\nFigure 2 Figure 2a-2c demonstrate the performance gains of our scheme in a convex setting.\n\nFigure 2b and 2c makes similar comparisons in the convex setting, and shows that for a test error\napproximately 0.1, Qsparse-local-SGD combines the bene\ufb01ts of the composed operator SignT opk,\nwith local computations, and needs 10-15 times less bits than TopK-SGD and 1000\u21e5 less bits than\nvanilla SGD. Also in Figure 2a, we observe that both TopK-SGD and SignTopK_8L (SignTopK with\n8 local iterations) converge at rates which are almost similar to that of their corresponding local SGD\ncounterpart. Our experiments in both non-convex and convex settings verify that error compensation\nthrough memory can be used to mitigate not only the missing components from updates in previous\nsynchronization rounds, but also explicit quantization error.\n\n10Further numerics demonstrating the performance of Qsparse-local-SGD for the composition of a stochastic\n\nquantizer with a sparsi\ufb01er, as compared to SignT opk and other standard baselines can be found in [6].\n\n11top-i refers to the accuracy of the top i predictions by the model from the list of possible classes; see [18].\n\n9\n\n\fAcknowledgments\nThe authors gratefully thank Navjot Singh for his help with experiments in the early stages of this\nwork. This work was partially supported by NSF grant #1514531, by UC-NL grant LFR-18-548554\nand by Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196. The views\nand conclusions contained in this document are those of the authors and should not be interpreted as\nrepresenting the of\ufb01cial policies, either expressed or implied, of the Army Research Laboratory or\nthe U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for\nGovernment purposes notwithstanding any copyright notation here on.\n\nReferences\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker,\nV. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensor\ufb02ow: A system for large-scale\nmachine learning. In OSDI, pages 265\u2013283, 2016.\n\n[2] Alham Fikri Aji and Kenneth Hea\ufb01eld. Sparse communication for distributed gradient descent.\n\nIn EMNLP, pages 440\u2013445, 2017.\n\n[3] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: communication-ef\ufb01cient\n\nSGD via gradient quantization and encoding. In NIPS, pages 1707\u20131718, 2017.\n\n[4] D. Alistarh, T. Hoe\ufb02er, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The\n\nconvergence of sparsi\ufb01ed gradient methods. In NeurIPS, pages 5977\u20135987, 2018.\n\n[5] Francis R. Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation\n\nalgorithms for machine learning. In NIPS, pages 451\u2013459, 2011.\n\n[6] Debraj Basu, Deepesh Data, Can Karakus, and Suhas N. Diggavi. Qsparse-local-sgd: Distributed\nSGD with quantization, sparsi\ufb01cation, and local computations. CoRR, abs/1906.02367, 2019.\n[7] J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar. SignSGD: compressed\n\noptimisation for non-convex problems. In ICML, pages 559\u2013568, 2018.\n\n[8] L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT,\n\npages 177\u2013186, 2010.\n\n[9] Kai Chen and Qiang Huo. Scalable training of deep learning machines by incremental block\ntraining with intra-block parallel optimization and blockwise model-update \ufb01ltering. In ICASSP,\npages 5880\u20135884, 2016.\n\n[10] Gregory F. Coppola.\n\nIterative parameter mixing for distributed large-margin training of\nstructured predictors for natural language processing. PhD thesis, University of Edinburgh,\nUK, 2015.\n\n[11] R. Gitlin, J. Mazo, and M. Taylor. On the design of gradient algorithms for digitally implemented\n\nadaptive \ufb01lters. IEEE Transactions on Circuit Theory, 20(2):125\u2013136, March 1973.\n\n[12] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for\nstochastic strongly-convex optimization. Journal of Machine Learning Research, 15(1):2489\u2013\n2512, 2014.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\npages 770\u2013778, 2016.\n\n[14] Robbins Herbert and Sutton Monro. A stochastic approximation method. The Annals of\n\nMathematical Statistics. JSTOR, 22, no. 3:400\u2013407, 1951.\n\n[15] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, and Martin Jaggi. Error\nfeedback \ufb01xes signsgd and other gradient compression schemes. In ICML, pages 3252\u20133261,\n2019.\n\n[16] Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic opti-\nmization and gossip algorithms with compressed communication. In ICML, pages 3478\u20133487,\n2019.\n\n[17] Jakub Konecn\u00fd. Stochastic, distributed and federated optimization for machine learning. CoRR,\n\nabs/1707.01155, 2017.\n\n10\n\n\f[18] Maksim Lapin, Matthias Hein, and Bernt Schiele. Top-k multiclass SVM. In NIPS, pages\n\n325\u2013333, 2015.\n\n[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. In Proceedings of the IEEE, 86(11):2278-2324, 1998.\n\n[20] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing the\n\ncommunication bandwidth for distributed training. In ICLR, 2018.\n\n[21] H. Mania, X. Pan, D. S. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed\niterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization,\n27(4):2202\u20132229, 2017.\n\n[22] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-ef\ufb01cient\n\nlearning of deep networks from decentralized data. In AISTATS, pages 1273\u20131282, 2017.\n\n[23] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic\napproximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574\u2013\n1609, 2009.\n\n[24] Lam M. Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richt\u00e1rik, Katya Scheinberg, and\nMartin Tak\u00e1c. SGD and hogwild! convergence without the bounded gradients assumption. In\nICML, pages 3747\u20133755, 2018.\n\n[25] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex\n\nstochastic optimization. In ICML, 2012.\n\n[26] Benjamin Recht, Christopher R\u00e9, Stephen J. Wright, and Feng Niu. Hogwild: A lock-free\n\napproach to parallelizing stochastic gradient descent. In NIPS, pages 693\u2013701, 2011.\n\n[27] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application\nto data-parallel distributed training of speech dnns. In INTERSPEECH, pages 1058\u20131062, 2014.\n[28] A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in tensor\ufb02ow.\n\nCoRR, abs/1802.05799, 2018.\n\n[29] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient\n\nsolver for SVM. In ICML, pages 807\u2013814, 2007.\n\n[30] S. U. Stich, J. B. Cordonnier, and M. Jaggi. Sparsi\ufb01ed SGD with memory. In NeurIPS, pages\n\n4452\u20134463, 2018.\n\n[31] Sebastian U. Stich. Local SGD converges fast and communicates little. In ICLR, 2019.\n[32] Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In\n\nINTERSPEECH, pages 1488\u20131492, 2015.\n\n[33] A. Theertha Suresh, F. X. Yu, S. Kumar, and H. B. McMahan. Distributed mean estimation with\n\nlimited communication. In ICML, pages 3329\u20133337, 2017.\n\n[34] H. Tang, S. Gan, C. Zhang, T. Zhang, and Ji Liu. Communication compression for decentralized\n\ntraining. In NeurIPS, pages 7663\u20137673, 2018.\n\n[35] Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression\n\nfor decentralized training. In NeurIPS, pages 7663\u20137673, 2018.\n\n[36] H. Wang, S. Sievert, S. Liu, Z. B. Charles, D. S. Papailiopoulos, and S. Wright. ATOMO:\ncommunication-ef\ufb01cient learning via atomic sparsi\ufb01cation. In NeurIPS, pages 9872\u20139883,\n2018.\n\n[37] Jianyu Wang and Gauri Joshi. Cooperative SGD: A uni\ufb01ed framework for the design and\n\nanalysis of communication-ef\ufb01cient SGD algorithms. CoRR, abs/1808.07576, 2018.\n\n[38] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsi\ufb01cation for communication-ef\ufb01cient\n\ndistributed optimization. In NeurIPS, pages 1306\u20131316, 2018.\n\n[39] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to\n\nreduce communication in distributed deep learning. In NIPS, pages 1508\u20131518, 2017.\n\n[40] J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized SGD and its applications\n\nto large-scale distributed optimization. In ICML, pages 5321\u20135329, 2018.\n\n[41] Tianyu Wu, Kun Yuan, Qing Ling, Wotao Yin, and Ali H. Sayed. Decentralized consensus\noptimization with asynchrony and delays. IEEE Trans. Signal and Information Processing over\nNetworks, 4(2):293\u2013307, 2018.\n\n11\n\n\f[42] Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication ef\ufb01cient\n\nmomentum sgd for distributed non-convex optimization. In ICML, pages 7184\u20137193, 2019.\n\n[43] Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD with faster convergence and less\ncommunication: Demystifying why model averaging works for deep learning. In AAAI, pages\n5693\u20135700, 2019.\n\n[44] Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Information-theoretic lower bounds\nfor distributed statistical estimation with communication constraints. In NIPS, pages 2328\u20132336,\n2013.\n\n[45] Y. Zhang, J. C. Duchi, and M. J. Wainwright. Communication-ef\ufb01cient algorithms for statistical\n\noptimization. Journal of Machine Learning Research, 14(1):3321\u20133363, 2013.\n\n12\n\n\f", "award": [], "sourceid": 8295, "authors": [{"given_name": "Debraj", "family_name": "Basu", "institution": "Adobe Inc."}, {"given_name": "Deepesh", "family_name": "Data", "institution": "UCLA"}, {"given_name": "Can", "family_name": "Karakus", "institution": "Amazon Web Services"}, {"given_name": "Suhas", "family_name": "Diggavi", "institution": "UCLA"}]}