{"title": "ATOMO: Communication-efficient Learning via Atomic Sparsification", "book": "Advances in Neural Information Processing Systems", "page_first": 9850, "page_last": 9861, "abstract": "Distributed model training suffers from communication overheads due to frequent gradient updates transmitted between compute nodes. To mitigate these overheads, several studies propose the use of sparsified stochastic gradients. We argue that these are facets of a general sparsification method that can operate on any possible atomic decomposition. Notable examples include element-wise, singular value, and Fourier decompositions. We present ATOMO, a general framework for atomic sparsification of stochastic gradients. Given a gradient, an atomic decomposition, and a sparsity budget, ATOMO gives a random unbiased sparsification of the atoms minimizing variance. We show that recent methods such as QSGD and TernGrad are special cases of ATOMO, and that sparsifiying the singular value decomposition of neural networks gradients, rather than their coordinates, can lead to significantly faster distributed training.", "full_text": "ATOMO: Communication-ef\ufb01cient Learning via\n\nAtomic Sparsi\ufb01cation\n\nHongyi Wang1\u21e4, Scott Sievert2\u21e4, Zachary Charles2, Shengchao Liu1,\n\nStephen Wright1, Dimitris Papailiopoulos2\n\n1Department of Computer Sciences,\n\n2Department of Electrical and Computer Engineering\n\nUniversity of Wisconsin-Madison\n\nAbstract\n\nDistributed model training suffers from communication overheads due to frequent\ngradient updates transmitted between compute nodes. To mitigate these overheads,\nseveral studies propose the use of sparsi\ufb01ed stochastic gradients. We argue that\nthese are facets of a general sparsi\ufb01cation method that can operate on any possible\natomic decomposition. Notable examples include element-wise, singular value,\nand Fourier decompositions. We present ATOMO, a general framework for atomic\nsparsi\ufb01cation of stochastic gradients. Given a gradient, an atomic decomposition,\nand a sparsity budget, ATOMO gives a random unbiased sparsi\ufb01cation of the atoms\nminimizing variance. We show that recent methods such as QSGD and TernGrad\nare special cases of ATOMO and that sparsi\ufb01ying the singular value decomposition\nof neural networks gradients, rather than their coordinates, can lead to signi\ufb01cantly\nfaster distributed training.\n\n1\n\nIntroduction\n\nSeveral machine learning frameworks such as TensorFlow [1], MXNet [2], and Caffe2[3], come with\ndistributed implementations of popular training algorithms, such as mini-batch SGD. However, the\nempirical speed-up gains offered by distributed training, often fall short of the optimal linear scaling\none would hope for. It is now widely acknowledged that communication overheads are the main\nsource of this speedup saturation phenomenon [4, 5, 6, 7, 8].\nCommunication bottlenecks are largely attributed to frequent gradient updates transmitted between\ncompute nodes. As the number of parameters in state-of-the-art models scales to hundreds of millions\n[9, 10], the size of gradients scales proportionally. These bottlenecks become even more pronounced\nin the context of federated learning [11, 12], where edge devices (e.g., mobile phones, sensors, etc)\nperform decentralized training, but suffer from low-bandwidth during up-link.\nTo reduce the cost of of communication during distributed model training, a series of recent studies\npropose communicating low-precision or sparsi\ufb01ed versions of the computed gradients during model\nupdates. Partially initiated by a 1-bit implementation of SGD by Microsoft in [5], a large number of\nrecent studies revisited the idea of low-precision training as a means to reduce communication [13,\n14, 15, 16, 17, 18, 19, 17, 20, 21]. Other approaches for low-communication training focus on\nsparsi\ufb01cation of gradients, either by thresholding small entries or by random sampling [6, 22, 23, 24,\n25, 26, 27, 28]. Several approaches, including QSGD and TernGrad, implicitly combine quantization\nand sparsi\ufb01cation to maximize performance gains [14, 16, 12, 29, 30], while providing provable\nguarantees for convergence and performance. We note that quantization methods in the context of\ngradient based updates have a rich history, dating back to at least as early as the 1970s [31, 32, 33].\n\n\u21e4These authors contributed equally\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fOur Contributions An atomic decomposition represents a vector as a linear combination of simple\nbuilding blocks in an inner product space. In this work, we show that stochastic gradient sparsi\ufb01cation\nand quantization are facets of a general approach that sparsi\ufb01es a gradient in any possible atomic\ndecomposition, including its entry-wise or singular value decomposition, its Fourier decomposition,\nand more. With this in mind, we develop ATOMO, a general framework for atomic sparsi\ufb01cation of\nstochastic gradients. ATOMO sets up and optimally solves a meta-optimization that minimizes the\nvariance of the sparsi\ufb01ed gradient, subject to the constraints that it is sparse on the atomic basis, and\nalso is an unbiased estimator of the input.\nWe show that 1-bit QSGD and TernGrad are in fact special cases of\nATOMO, and each is optimal (in terms of variance and sparsity), in\ndifferent parameter regimes. Then, we argue that for some neural\nnetwork applications, viewing the gradient as a concatenation of\nmatrices (each corresponding to a layer), and applying atomic spar-\nsi\ufb01cation to their SVD is meaningful and well-motivated by the fact\nthat these matrices are approximately low-rank (see Fig. 1). We show\nthat ATOMO on the SVD of each layer\u2019s gradient, can lead to less\nvariance, and faster training, for the same communication budget\nas that of QSGD or TernGrad. We present extensive experiments\nshowing that using ATOMO with SVD sparsi\ufb01cation can lead to up\nto 2\u21e5/3\u21e5 faster training time (including the time to compute the\nSVD) compared to QSGD/TernGrad. This holds using VGG and\nResNet-18 on SVHN and CIFAR-10.\nRelation to Prior Work ATOMO is closely related to work on\ncommunication-ef\ufb01cient distributed mean estimation in [29] and\n[30]. These works both note, as we do, that variance (or equivalently\nthe mean squared error) controls important quantities such as convergence, and they seek to \ufb01nd a\nlow-communication vector averaging scheme that minimizes it. Our work differs in two key aspects.\nFirst, we derive a closed-form solution to the variance minimization problem for all input gradients.\nSecond, ATOMO applies to any atomic decomposition, which allows us to compare entry-wise against\nsingular value sparsi\ufb01cation for matrices. Using this, we derive explicit conditions for which SVD\nsparsi\ufb01cation leads to lower variance for the same sparsity budget.\nThe idea of viewing gradient sparsi\ufb01cation through a meta-optimization lens was also used in [34].\nOur work differs in two key ways. First, [34] consider the problem of minimizing the sparsity of a\ngradient for a \ufb01xed variance, while we consider the reverse problem, that is, minimizing the variance\nsubject to a sparsity budget. The second more important difference is that while [34] focuses on\nentry-wise sparsi\ufb01cation, we consider a general problem where we sparsify according to any atomic\ndecomposition. For instance, our approach directly applies to sparsifying the singular values of a\nmatrix, which gives rise to faster training algorithms.\nFinally, low-rank factorizations and sketches of the gradients when viewed as matrices were proposed\nin [35, 36, 37, 38, 12]; arguably most of these methods (with the exception of [12]) aimed to address\nthe high \ufb02ops required when training low-rank models. Though they did not directly aim to reduce\ncommunication, this arises as a useful side effect.\n\nFigure 1: The singular values\nof a convolutional layer\u2019s gradi-\nent, for ResNet-18 while training\non CIFAR-10. The gradient of\na layer can be seen as a matrix,\nonce we vectorize and appropri-\nately stack the conv-\ufb01lters. For\nall presented data passes, there is\na sharp decay in singular values,\nwith the top 3 standing out.\n\n2 Problem Setup\nIn machine learning, we often wish to \ufb01nd a model w minimizing the empirical risk\n\nf (w) =\n\n1\nn\n\nnXi=1\n\n`(w; xi)\n\n(1)\n\nwhere xi 2 Rd is the i-th data point. One way to approximately minimize f (w) is by using stochastic\ngradient methods that operate as follows:\n\nan unbiased estimate of the true gradient g(w) = rf (w). Mini-batch SGD, one of the most common\nrandomly sampled data from the training set. Mini-batch SGD is easily parallelized in the parameter\n\nwhere w0 is some initial model, is the stepsize, andbg(w) is a stochastic gradient of f (w), i.e.it is\nalgorithms for distributed training, computesbg as an average of B gradients, each evaluated on\n\nwk+1 = wk bg(wk)\n\n2\n\n510155DnkV0.20.40.60.86igulDr VDlueVDDtD PDVV: 0DDtD PDVV: 5DDtD PDVV: 10\fserver (PS) setup, where a PS stores the global model, and P compute nodes split the effort of\ncomputing the B gradients. Once the PS receives these gradients, it applies them to the model, and\nsends it back to the compute nodes.\n\nTo prove convergence bounds for stochastic-gradient based methods, we usually requirebg(w) to\nbe an unbiased estimator of the full-batch gradient, and to have small variance Ekbg(w)k2, as this\n\ncontrols the speed of convergence. To see this, suppose w\u21e4 is a critical point of f, then we have\n\nE[kwk+1 w\u21e4k2\n\n2] = E[kwk w\u21e4k2\n\n.\n\n2] 2hrf (wk), wk w\u21e4i 2E[kbg(wk)k2\n2]\n}\n|\n\nprogress at step t\n\n{z\n\nterm E[kbg(wk)k]2\nupper bounds on E[kbg(wk)k]2\n\nIn particular, the progress made by the algorithm at a single step is, in expectation, controlled by the\n2; the smaller it is, the bigger the progress. This is a well-known fact in optimization,\nand most convergence bounds for stochastic-gradient based methods, including minibatch, involve\n2, in a multiplicative form, for both convex and nonconvex setups\n[39, 40, 41, 42, 42, 43, 44, 45, 46, 47]. Hence, recent results on low-communication variants of SGD\ndesign unbiased quantized or sparse gradients, and try to minimize their variance [14, 29, 34].\nSince variance is a proxy for speed of convergence, in the context of communication-ef\ufb01cient\nstochastic gradient methods, one can ask: What is the smallest possible variance of a stochastic\ngradient that is represented with k bits? This can be cast as the following meta-optimization:\n\nmin\n\ng Ekbg(w)k2\ns.t. E[bg(w)] = g(w)\nbg(w) can be expressed with k bits\n\ntractable version of the last constraint. In the next section, we replace this with a simpler constraint\n\nHere, the expectation is taken over the randomness ofbg. We are interested in designing a stochastic\napproximationbg that \u201csolves\u201d this optimization. However, it seems dif\ufb01cult to design a formal,\nthat instead requires thatbg(w) is sparse with respect to a given atomic decomposition.\n\n3 ATOMO: Atomic Decomposition and Sparsi\ufb01cation\nLet (V,h\u00b7,\u00b7i) be an inner product space over R and let k\u00b7k denote the induced norm on V . In what\nfollows, you may think of g as a stochastic gradient of the function we wish to optimize. An atomic\ndecomposition of g is any decomposition of the form g =Pa2A aa for some set of atoms A\u2713 V .\nIntuitively, A consists of simple building blocks. We will assume that for all a 2A , kak = 1, as this\ncan be achieved by a positive rescaling of the a.\nAn example of an atomic decomposition is the entry-wise decomposition g =Pi giei where {ei}n\n\ni=1\nis the standard basis. More generally, any orthonormal basis of V gives rise to a unique atomic\ndecomposition of g. While we focus on \ufb01nite dimensional vectors, one could use Fourier and wavelet\ndecompositions in this framework. When considering matrices, the singular value decomposition\ngives an atomic decomposition in the set of rank-1 matrices. More general atomic decompositions\nhave found uses in a variety of situations, including solving linear inverse problems [48].\nWe are interested in \ufb01nding an approximation to g with fewer atoms. Our primary motivation is that\nthis reduces communication costs, as we only need to send atoms with non-zero weights. We can use\nwhichever decomposition is most amenable for sparsi\ufb01cation. For instance, if X is a low rank matrix,\nthen its singular value decomposition is naturally sparse, so we can save communication costs by\nsparsifying its singular value decomposition instead of its entries.\nSuppose A = {ai}n\n\ni=1 iai. We wish to \ufb01nd an\n\ni=1 and we have an atomic decomposition g =Pn\n\nunbiased estimatorbg of g that is sparse in these atoms, and with small variance. Sincebg is unbiased,\nminimizing its variance is equivalent to minimizing E[kbgk2]. We use the following estimator:\n\n(2)\n\nai\n\nwhere ti \u21e0 Bernoulli(pi), for 0 < pi \uf8ff 1. We refer to this sparsi\ufb01cation scheme as atomic\nsparsi\ufb01cation. Note that the ti\u2019s are independent. Recall that we assumed above that kaik2 = 1 for\n\niti\npi\n\nnXi=1\n\nbg =\nall ai. We have the following lemma aboutbg.\nLemma 1. E[bg] = g and E[kbgk2] =Pn\n\ni=1 2\n\ni p1\n\ni +Pi6=j ijhai, aji.\n\n3\n\n\fLet = [1, . . . , n]T , p = [p1, . . . , pn]T . In order to ensure that this estimator is sparse, we \ufb01x\n\nsome sparsity budget s. That is, we requirePi pi = s. This is a sparsity on average constraint. We\nwish to minimize E[kbgk2] subject to this constraint. By Lemma 1, this is equivalent to\n\ns.t. 8i, 0 < pi \uf8ff 1,\n\npi = s.\n\nmin\n\n(3)\n\n2\ni\npi\n\np\n\nAn equivalent form of this problem was presented in [29] (Section 6.1). The authors considered this\nproblem for entry-wise sparsi\ufb01cation and found a closed-form solution for s \uf8ff kk1/kk1. We\ngive a version of their result but extend it to all s. A similar optimization problem was given in [34],\nwhich instead minimizes sparsity subject to a variance constraint.\n\nnXi=1\n\nnXi=1\n\n: 2 Rn with |1| . . .|n|; sparsity\nbudget s such that 0 < s \uf8ff n.\n\nAlgorithm 1: ATOMO probabilities\nInput\nOutput :p 2 Rn solving (3).\ni = 1;\nwhile i \uf8ff n do\nif |i|s \uf8ffPn\n\nfor k = i, . . . , n do\n\nj=i |i| then\n\npk = |k|s\u21e3Pn\n\nj=i |i|\u23181\n\n;\n\nend\ni = n + 1;\n\npi = 1, s = s 1;\ni = i + 1;\n\nelse\n\nend\n\nend\n\nWe will show that Algorithm 1 produces p 2 Rn\nsolving (3). While we show in Appendix B that this\ncan be derived via the KKT conditions, we focus on\nan alternative method relaxes (3) to better understand\nits structure. This approach also analyzes the variance\nachieved by solving (3) more directly.\nNote that (3) has non-empty feasible set only for\ni /pi. To under-\nstand how to solve (3), we \ufb01rst consider the following\nrelaxation:\n\n0 < s \uf8ff n. De\ufb01ne f (p) :=Pn\ns.t. 8i, 0 < pi,\nfollowing\nthe\n\npi = s.\n\n(4)\n\nnXi=1\n\nnXi=1\n\nlemma\n\ni=1 2\n\n2\ni\npi\n\nmin\n\np\n\nto\n\n(4),\n\n\ufb01rst\n\nshown\n\nabout\nin\n\nhave\nsolutions\n\nWe\nthe\n[29].\n\n1\nskk2\n\n1. This is achieved iff pi = |i|s\nLemma 2 ([29]). Any feasible vector p to (4) satis\ufb01es f (p) \n.\nkk1\nLemma 2 implies that if we ignore the constraint that pi \uf8ff 1, then the optimal p is achieved by setting\npi = |i|s/kk1. If the quantity in the right-hand side is greater than 1, this does not give us an\nactual probability. This leads to the following de\ufb01nition.\nDe\ufb01nition 1. An atomic decomposition g =Pn\ni=1 iai is s-unbalanced at entry i if |i|s > kk1.\nWe say that g is s-balanced otherwise. Clearly, an atomic decomposition is s-balanced iff s \uf8ff\nkk1/kk1. Lemma 2 gives us the optimal way to sparsify s-balanced vectors, since the optimal p\nfor (4) is feasible for (3). If g is s-unbalanced at entry j, we cannot assign this pj as it is larger than 1.\nIn the following lemma, we show that in pj = 1 is optimal in this setting.\nLemma 3. Suppose that g is s-unbalanced at entry j and that q is feasible in (3). Then 9p that is\nfeasible such that f (p) \uf8ff f (q) and pj = 1.\nLet (g) =Pi6=j ijhai, aji. Lemmas 2 and 3 imply the following theorem about solutions to (3).\nTheorem 4. If g is s-balanced, then E[kbgk2] s1kk2\npi = |i|s/kk1. If g is s-unbalanced, then E[kbgk2] > s1kk2\npj = 1 where j = argmaxi=1,...,n |i|.\nDue to the sorting requirement in the input, Algorithm 1 requires O(n log n) operations.\nIn\nAppendix B we describe a variant that uses only O(sn) operations. Thus, we can solve (3) in\nO(min{n, s} log(n)) operations.\n4 Relation to QSGD and TernGrad\n\n1 + (g) with equality if and only if\n1 + (g) and is minimized by p with\n\nIn this section, we will discuss how ATOMO is related to two recent quantization schemes, 1-bit\nQSGD [14] and TernGrad [16]. We will show that in certain cases, these schemes are versions of the\nATOMO for a speci\ufb01c sparsity budget s. Both schemes use the entry-wise atomic decomposition.\n\n4\n\n\fQSGD takes as input g 2 Rn and b 1. This b governs the number of quantization buckets. When\nb = 1, QSGD produces a random vector Q(g) de\ufb01ned by\n\nQ(g)i = kgk2 sign(gi)\u21e3i.\n\nT (g)i = kgk1 sign(gi)\u21e3i\n\nHere, the \u21e3i \u21e0 Bernoulli(|gi|/kgk2) are independent random variables. One can show this is\nequivalent to (2) with pi = |gi|/kgk2 and sparsity budget s = kgk1/kgk2. Note that by de\ufb01nition,\nany g is s-balanced for this s. Therefore, Theorem 4 implies that the optimal way to assign pi with\nthis given s is pi = |gi|/kgk2, which agrees with 1-bit QSGD.\nTernGrad takes g 2 Rn and produces a sparsi\ufb01ed version T (g) given by\nwhere \u21e3i \u21e0 Bernoulli(|gi|/kgk1). This is equivalent to (2) with pi = |gi|/kgk1 and sparsity budget\ns = kgk1/kgk1. Once again, any g is s-balanced for this s by de\ufb01nition. Therefore, Theorem\n4 implies that the optimal assignment of the pi for this s is pi = |gi|/kgk1, which agrees with\nTernGrad.\nWe can generalize both of these with the following quantization method. Fix q 2 (0,1]. Given\ng 2 Rn, we de\ufb01ne the `q-quantization of g, denoted Lq(g), by\nLq(v)i = kgkq sign(gi)\u21e3i\nwhere \u21e3i \u21e0 Bernoulli(|gi|/kgkq). By the reasoning above, we derive the following theorem.\nTheorem 5. `q-quantization performs atomic sparsi\ufb01cation in the standard basis with pi = |gi|/kgkq.\nThis solves (3) for s = kgk1/kgkq and satis\ufb01es E[kLq(g)k2\nIn particular, for q = 2 we get 1-bit QSGD while for q = 1, we get TernGrad.\n\n2] = kgk1kgkq.\n\n5 Spectral-ATOMO: Sparsifying the Singular Value Decomposition\n\nVar.\n\ns\n\n1,1\n\ni=1 iuivT\n\nDecomposition Comm.\n\nTable 1: Communication cost and variance\nof ATOMO for matrices.\n\nFor a rank r matrix X, denote its singular value de-\ncomposition (SVD) by X = Pr\ni . Let =\n[1, . . . , r]T . We de\ufb01ne the `p,q norm of a matrix\nj=1(Pn\nX by kXkp,q = (Pm\ni=1 |Xi,j|p)q/p)1/q. When\np = q = 1, we de\ufb01ne this to be kXkmax where\nkXkmax := maxi,j |Xi,j|.\nLet V be the space of real n \u21e5 m matrices. Given X 2 V ,\nthere are two standard atomic decompositions of X. The\n\ufb01rst is the entry-wise decomposition X =Pi,j Xi,jeieT\nj .\nThe second is its SVD X =Pr\ni . If r is small,\nof the matrix. Let bX and bX denote the random variables in (2) corresponding to the entry-wise\n\ndecomposition and singular value decomposition of X, respectively. We wish to compare these two\nsparsi\ufb01cations.\nIn Table 1, we compare the communication cost and variance of these two methods. The communica-\ntion cost is the expected number of non-zero elements (real numbers) that need to be communicated.\n\nit may be more ef\ufb01cient to communicate the r(n + m) entries of the SVD, rather than the nm entries\n\n1\nskXk2\n1\nskXk2\n\u21e4\n\ni=1 iuivT\n\nEntry-wise\n\nSVD\n\ns(n + m)\n\nFor bX, a sparsity budget of s corresponds to s non-zero entries we need to communicate. For bX, a\nsparsity budget of s gives a communication cost of s(n + m) due to the singular vectors. We compare\nthe optimal variance from Theorem 4.\nTo compare the variance of these two methods under the same communication cost, we want X to be\ns-balanced in its entry-wise decomposition. This holds iff s \uf8ff kXk1,1/kXkmax. By Theorem 4, this\ngives E[kbXk2\n1,1. To achieve the same communication cost with bX, we take a sparsity\nbudget of s0 = s/(n + m). The SVD of X is s0-balanced iff s0 \uf8ff kXk\u21e4/kXk2. By Theorem 4,\nE[kbXk2\nTheorem 6. Suppose X 2 Rn\u21e5m and\ns \uf8ff min\u21e2 kXk1,1\n\n. This leads to the following theorem.\n\nF ] = (n + m)s1kXk2\n\u21e4\n\nF ] = s1kXk2\n\n, (n + m)kXk\u21e4\n\nkXk2 .\n\nkXkmax\n5\n\n\fThen bX with sparsity s0 = s/(n + m) incurs the same communication cost as bX with sparsity s,\nand E[kbXk2] \uf8ff E[kbXk2] if and only if (n + m)kXk2\nTo better understand this condition, we will make use of the following well-known fact.\nLemma 7. For any n \u21e5 m matrix X over R,\nFor expository purposes, we give a proof of this Appendix C and show that these bounds are the best\npossible. As a result, if the \ufb01rst inequality is tight, then E[kbXk2] \uf8ff E[kbXk2], while if the second is\ntight then E[kbXk2] E[kbXk2]. As we show in the next section, using singular value sparsi\ufb01cation\n\ncan translate in to signi\ufb01cantly reduced distributed training time.\n6 Experiments\n\n1,1.\n\u21e4 \uf8ff kXk2\n\n1pnmkXk1,1 \uf8ff kXk\u21e4 \uf8ff kXk1,1.\n\nWe present an empirical study of Spectral-ATOMO and compare it to the recently proposed QSGD\n[14], and TernGrad [16], on a different neural network models and data sets, under real distributed\nenvironments. Our main \ufb01ndings are as follows:\n\u2022 We observe that spectral-ATOMO provides a useful alternative to entry-wise sparsi\ufb01cation methods,\nit reduces communication compared to vanilla mini-batch SGD, and can reduce training time\ncompared to QSGD and TernGrad by up to a factor of 2\u21e5 and 3\u21e5 respectively. For instance,\non VGG11-BN trained on CIFAR-10, spectral-ATOMO with sparsity budget 3 achieves 3.96\u21e5\nspeedup over vanilla SGD, while 4-bit QSGD achieves 1.68\u21e5 on a cluster of 16, g2.2xlarge\ninstances. Both ATOMO and QSGD greatly outperform TernGrad as well.\n\u2022 We observe that spectral-ATOMO in distributed settings leads to models with negligible accuracy\n\nloss when combined with parameter tuning.\n\nImplementation and setup We compare spectral-ATOMO2 with different sparsity budgets to b-\nbit QSGD across a distributed cluster with a parameter server (PS), implemented in mpi4py [49]\nand PyTorch [50] and deployed on multiple types of instances in Amazon EC2 (e.g.m5.4xlarge,\nm5.2xlarge, and g2.2xlarge), both PS and compute nodes are of the same type of instance. The PS\nimplementation is standard, with a few important modi\ufb01cations. At the most basic level, it receives\ngradients from the compute nodes and broadcasts the updated model once a batch has been received.\nIn our experiments, we use data augmentation (random crops, and \ufb02ips), and tuned the step-size for\nevery different setup as shown in Table 5 in Appendix D. Momentum and regularization terms are\nswitched off to make the hyperparamter search tractable and the results more legible. Tuning the\nstep sizes for this distributed network for three different datasets and eight different coding schemes\ncan be computationally intensive. As such, we only used small networks so that multiple networks\ncould \ufb01t into GPU memory. To emulate the effect of larger networks, we use synchronous message\ncommunication, instead of asynchronous.\nEach compute node evaluates gradients sampled from its partition of data. Gradients are then\nsparsi\ufb01ed through QSGD or spectral-ATOMO, and then are sent back to the PS. Note that spectral-\nATOMO transmits the weighted singular vectors sampled from the true gradient of a layer. The PS then\ncombines these, and updates the model with the average gradient. Our entire experimental pipeline\nis implemented in PyTorch [50] with mpi4py [49], and deployed on either g2.2xlarge, m5.2xlarge\nand m5.4xlarge instances in Amazon AWS EC2. We conducted our experiments on various models,\ndatasets, learning tasks, and neural network models as detailed in Table 2.\n\nDataset\n\n# Data points\n\nModel\n# Classes\n\nCIFAR-10\n\n60,000\n\nCIFAR-100\n\n60,000\n\nResNet-18 / VGG-11-BN\n\nResNet-18\n\n10\n\n100\n\nSVHN\n600,000\nResNet-18\n\n10\n\n# Parameters\n\n11,173k / 9,756k\n\n11,173k\n\n11,173k\n\nTable 2: The datasets used and their associated learning models and hyper-parameters.\n\n2code available at: https://github.com/hwang595/ATOMO\n\n6\n\n\fFigure 2: The timing of the gradient coding methods (QSGD and spectral-ATOMO) for different quantization\nlevels, b bits and s sparsity budget respectively for each worker when using a ResNet-34 model on CIFAR10.\nFor brevity, we use SVD to denote spectral-ATOMO. The bars represent the total iteration time and are divided\ninto computation time (bottom, solid), encoding time (middle, dotted) and communication time (top, faded).\n\nScalability We study the scalability of these sparsi\ufb01cation methods on clusters of different sizes.\nWe used clusters with one PS and n = 2, 4, 8, 16 compute nodes. We ran ResNet-34 on CIFAR-10\nusing mini-batch SGD with batch size 512 split among compute nodes. The experiment was run on\nm5.4xlarge instances of AWS EC2 and the results are shown in Figure 2.\nWhile increasing the size of the cluster, decreases the computational cost per worker, it causes the\ncommunication overhead to grow. We denote as computational cost, the time cost required by each\nworker for gradient computations, while the communication overhead is represented by the amount\ntime the PS waits to receive the gradients by the slowest worker. This increase in communication\ncost is non-negligible, even for moderately-sized networks with sparsi\ufb01ed gradients. We observed a\ntrade-off in both sparsi\ufb01cation approaches between the information retained in the messages after\nsparsi\ufb01cation and the communication overhead.\nEnd-to-end convergence performance We evaluate the end-to-end convergence performance on\ndifferent datasets and neural networks, training with spectral-ATOMO(with sparsity budget s =\n1, 2, 3, 4), QSGD (with n = 1, 2, 4, 8 bits), and ordinary mini-batch SGD. The datasets and models\nare summarized in Table 2. We use ResNet-18 [9] and VGG11-BN [51] for CIFAR-10 [52] and\nSVHN [53]. Again, for each of these methods we tune the step size. The experiments were run on a\ncluster of 16 compute nodes instantiated on g2.2xlarge instances.\nThe gradients of convolutional layers are 4 dimensional tensors with shape of [x, y, k, k] where x, y\nare two spatial dimensions and k is the size of the convolutional kernel. However, matrices are\nrequired to compute the SVD for spectral-ATOMO, and we choose to reshape each layer into a matrix\nof size [xy/2, 2k2]. This provides more \ufb02exibility on the sparsity budget for the SVD sparsi\ufb01cation.\nFor QSGD, we use the bucketing and Elias recursive coding methods proposed in [14], with bucket\nsize equal to the number of parameters in each layer of the neural network.\n\n(a) CIFAR-10, ResNet-18,\nBest of QSGD and SVD\n\n(b) SVHN,\nBest of QSGD and SVD\n\nResNet-18,\n\n(c) CIFAR-10, VGG11,\nBest of QSGD and SVD\n\nFigure 3: Convergence rates for the best performance of QSGD and spectral-ATOMO, alongside TernGrad and\nvanilla SGD. (a) uses ResNet-18 on CIFAR-10, (b) uses ResNet-18 on SVHN, and (c) uses VGG-11-BN on\nCIFAR-10. For brevity, we use SVD to denote spectral-ATOMO.\n\n7\n\n248161umber of WorNerV05101520Time Ser iWerDWioQ (Vec)6VD V 16VD V 26VD V 36VD V 446GD b 146GD b 246GD b 446GD b 801020WDOOcORck Time (hrV)406080TeVW VeW AccTeVW AccurDcy vV RuQWimeBeVW Rf AT202BeVW Rf 46GDTerQGrDGVDQiOOD 6GD05101520WDOOcORck Time (hrV)20406080TeVW VeW AccTeVW AccurDcy vV 5uQWimeBeVW Rf AT202BeVW Rf 46GDTerQGrDGVDQiOOD 6GD01020WDOOcORck Time (hrV)20406080TeVW VeW AccTeVW AccurDcy vV RuQWimeBeVW Rf AT202BeVW Rf 46GDTerQGrDGVDQiOOD 6GD\fFigure 3 shows how the testing accuracy varies with wall clock time. Tables 3 and 4 give a detailed\naccount of speedups of singular value sparsi\ufb01cation compared to QSGD. In these tables, each method\nis run until a speci\ufb01ed accuracy.\n\nTable 3: Speedups of spectral-ATOMO with sparsity budget s, b-bit QSGD, and TernGrad using ResNet-18 on\nCIFAR10 over vanilla SGD. N/A stands for the method fails to reach a certain Test accuracy in \ufb01xed iterations.\n\nTable 4: Speedups of spectral-ATOMO with sparsity budget s and b-bit QSGD, and TernGrad using ResNet-18\non SVNH over vanilla SGD. N/A stands for the method fails to reach a certain Test accuracy in \ufb01xed iterations.\n\nWe observe that QSGD and ATOMO speed up model training signi\ufb01cantly and achieve similar\naccuracy to vanilla mini-batch SGD. We also observe that the best performance is not achieve with\nthe most sparsi\ufb01ed, or quantized method, but the optimal method lies somewhere in the middle\nwhere enough information is preserved during the sparsi\ufb01cation. For instance, 8-bit QSGD converges\nfaster than 4-bit QSGD, and spectral-ATOMO with sparsity budget 3, or 4 seems to be the fastest.\nHigher sparsity can lead to a faster running time, but extreme sparsi\ufb01cation can adversely affect\nconvergence. For example, for a \ufb01xed number of iterations, 1-bit QSGD has the smallest time cost,\nbut may converge much more slowly to an accurate model.\n\n7 Conclusion\n\nIn this paper, we present and analyze ATOMO, a general sparsi\ufb01cation method for distributed stochastic\ngradient based methods. ATOMO applies to any atomic decomposition, including the entry-wise and\nthe SVD of a matrix. ATOMO generalizes 1-bit QSGD and TernGrad, and provably minimizes the\nvariance of the sparsi\ufb01ed gradient subject to a sparsity constraint on the atomic decomposition. We\nfocus on the use ATOMO for sparsifying matrices, especially the gradients in neural network training.\nWe show that applying ATOMO to the singular values of these matrices can lead to faster training than\nboth vanilla SGD or QSGD, for the same communication budget. We present extensive experiments\nshowing that ATOMO can lead to up to a 2\u21e5 speed-up in training time over QSGD and up to 3\u21e5\nspeed-up in training time over TernGrad.\nIn the future, we plan to explore the use of ATOMO with Fourier decompositions, due to its utility\nand prevalence in signal processing. More generally, we wish to investigate which atomic sets lead to\nreduced communication costs. We also plan to examine how we can sparsify and compress gradients\nin a joint fashion to further reduce communication costs. Finally, when sparsifying the SVD of a\nmatrix, we only sparsify the singular values. We also note that it would be interesting to explore\njointly sparsi\ufb01cation of the SVD and and its singular vectors, which we leave for future work.\n\n8\n\n69DV 169DV 246GDb 146GDb 27erQGrDG0ethoG 60% 63% 65% 68%7eVt DccurDcy3.06x3.51x2.19x2.31x1.45x3.67x3.6x1.88x2.22x1.65x3.01x3.6x1.46x2.21x2.19x2.36x2.78x1.15x2.01x1.77x69DV 369DV 446GDb 446GDb 87erQGrDG0ethoG 65% 71% 75% 78%7eVt DccurDcy2.63x1.84x2.62x1.79x2.19x2.81x2.04x1.81x2.62x1.22x2.01x1.79x1.41x1.78x1.18x1.81x1.8x1.67x1.73x1/A69DV 369DV 446GDb 446GDb 87erQGrDGMethoG 75% 78% 82% 84%7eVt DccurDcy3.55x2.75x3.22x2.36x1.33x2.84x2.75x2.68x1.89x1.23x2.95x2.28x2.23x2.35x1.18x3.11x2.39x2.34x2.35x1.34x69DV 369DV 446GDb 446GDb 87erQGrDG0ethoG 85% 86% 88% 89%7eVt DccurDcy3.15x2.43x2.67x2.35x1.21x2.58x2.19x2.29x2.1x1/A2.58x2.19x1.69x2.09x1/A2.72x2.27x2.11x2.14x1/A\fReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for\nlarge-scale machine learning. In OSDI, volume 16, pages 265\u2013283, 2016.\n\n[2] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu,\nChiyuan Zhang, and Zheng Zhang. Mxnet: A \ufb02exible and ef\ufb01cient machine learning library for\nheterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.\n\n[3] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick,\nSergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature\nembedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[4] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew\nSenior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In\nAdvances in neural information processing systems, pages 1223\u20131231, 2012.\n\n[5] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent\nand its application to data-parallel distributed training of speech dnns. In Fifteenth Annual\nConference of the International Speech Communication Association, 2014.\n\n[6] Nikko Strom. Scalable distributed DNN training using commodity gpu cloud computing. In\nSixteenth Annual Conference of the International Speech Communication Association, 2015.\n\n[7] Hang Qi, Evan R. Sparks, and Ameet Talwalkar. Paleo: A performance model for deep neural\nnetworks. In Proceedings of the International Conference on Learning Representations, 2017.\n\n[8] Demjan Grubic, Leo Tam, Dan Alistarh, and Ce Zhang. Synchronous multi-GPU deep learning\n\nwith low-precision communication: An experimental study. 2018.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[10] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected\nconvolutional networks. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, volume 1, page 3, 2017.\n\n[11] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-\nef\ufb01cient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629,\n2016.\n\n[12] Jakub Kone\u02c7cn`y, H Brendan McMahan, Felix X Yu, Peter Richt\u00e1rik, Ananda Theertha Suresh,\nand Dave Bacon. Federated learning: Strategies for improving communication ef\ufb01ciency. arXiv\npreprint arXiv:1610.05492, 2016.\n\n[13] Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher R\u00e9. Taming the wild: A\nuni\ufb01ed analysis of hogwild-style algorithms. In Advances in neural information processing\nsystems, pages 2674\u20132682, 2015.\n\n[14] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd:\nCommunication-ef\ufb01cient SGD via gradient quantization and encoding. In Advances in Neural\nInformation Processing Systems, pages 1707\u20131718, 2017.\n\n[15] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. DoReFa-Net:\ntraining low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint\narXiv:1606.06160, 2016.\n\n[16] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad:\nTernary gradients to reduce communication in distributed deep learning. In Advances in Neural\nInformation Processing Systems, pages 1508\u20131518, 2017.\n\n9\n\n\f[17] Christopher De Sa, Matthew Feldman, Christopher R\u00e9, and Kunle Olukotun. Understanding\nand optimizing asynchronous low-precision stochastic gradient descent. In Proceedings of the\n44th Annual International Symposium on Computer Architecture, pages 561\u2013574. ACM, 2017.\n\n[18] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training\nlinear models with end-to-end low precision, and a little bit of deep learning. In International\nConference on Machine Learning, pages 4035\u20134043, 2017.\n\n[19] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet\nclassi\ufb01cation using binary convolutional neural networks. In European Conference on Computer\nVision, pages 525\u2013542. Springer, 2016.\n\n[20] Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R Aberger,\nKunle Olukotun, and Christopher R\u00e9. High-accuracy low-precision training. arXiv preprint\narXiv:1803.03383, 2018.\n\n[21] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar.\nsignSGD: compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434,\n2018.\n\n[22] Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran,\nand Michael I Jordan. Perturbed iterate analysis for asynchronous stochastic optimization.\narXiv preprint arXiv:1507.06970, 2015.\n\n[23] R\u00e9mi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. ASAGA: asynchronous parallel\n\nSAGA. arXiv preprint arXiv:1606.04809, 2016.\n\n[24] Alham Fikri Aji and Kenneth Hea\ufb01eld. Sparse communication for distributed gradient descent.\n\narXiv preprint arXiv:1704.05021, 2017.\n\n[25] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient com-\npression: Reducing the communication bandwidth for distributed training. arXiv preprint\narXiv:1712.01887, 2017.\n\n[26] Chia-Yu Chen, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, and Kailash\nGopalakrishnan. Adacomp: Adaptive residual gradient compression for data-parallel distributed\ntraining. arXiv preprint arXiv:1712.02679, 2017.\n\n[27] C\u00e8dric Renggli, Dan Alistarh, and Torsten Hoe\ufb02er. SparCML: high-performance sparse\n\ncommunication for machine learning. arXiv preprint arXiv:1802.08021, 2018.\n\n[28] Yusuke Tsuzuku, Hiroto Imachi, and Takuya Akiba. Variance-based gradient compression for\n\nef\ufb01cient distributed deep learning. arXiv preprint arXiv:1802.06058, 2018.\n\n[29] Jakub Kone\u02c7cn`y and Peter Richt\u00e1rik. Randomized distributed mean estimation: Accuracy vs\n\ncommunication. arXiv preprint arXiv:1611.07555, 2016.\n\n[30] Ananda Theertha Suresh, Felix X Yu, Sanjiv Kumar, and H Brendan McMahan. Distributed\n\nmean estimation with limited communication. arXiv preprint arXiv:1611.00429, 2016.\n\n[31] R Gitlin, J Mazo, and M Taylor. On the design of gradient algorithms for digitally implemented\n\nadaptive \ufb01lters. IEEE Transactions on Circuit Theory, 20(2):125\u2013136, 1973.\n\n[32] S Alexander. Transient weight misadjustment properties for the \ufb01nite precision LMS algorithm.\n\nIEEE Transactions on Acoustics, Speech, and Signal Processing, 35(9):1250\u20131258, 1987.\n\n[33] Jos\u00e9 Carlos M Bermudez and Neil J Bershad. A nonlinear analytical model for the quan-\ntized LMS algorithm-the arbitrary step size case. IEEE Transactions on Signal Processing,\n44(5):1175\u20131183, 1996.\n\n10\n\n\f[34] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsi\ufb01cation for\n\ncommunication-ef\ufb01cient distributed optimization. arXiv preprint arXiv:1710.09854, 2017.\n\n[35] Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of deep neural network acoustic models\n\nwith singular value decomposition. In Interspeech, pages 2365\u20132369, 2013.\n\n[36] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran.\nLow-rank matrix factorization for deep neural network training with high-dimensional output\ntargets.\nIn Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International\nConference on, pages 6655\u20136659. IEEE, 2013.\n\n[37] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural\n\nnetworks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.\n\n[38] Simon Wiesler, Alexander Richard, Ralf Schluter, and Hermann Ney. Mean-normalized\nstochastic gradient for large-scale deep learning. In Acoustics, Speech and Signal Processing\n(ICASSP), 2014 IEEE International Conference on, pages 180\u2013184. IEEE, 2014.\n\n[39] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms\nvia accelerated gradient methods. In Advances in neural information processing systems, pages\n1647\u20131655, 2011.\n\n[40] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex\n\nstochastic programming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[41] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free\nIn Advances in neural information\n\napproach to parallelizing stochastic gradient descent.\nprocessing systems, pages 693\u2013701, 2011.\n\n[42] S\u00e9bastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and\n\nTrends R in Machine Learning, 8(3-4):231\u2013357, 2015.\n\n[43] Christopher De Sa, Christopher Re, and Kunle Olukotun. Global convergence of stochastic\ngradient descent for some non-convex matrix problems. In International Conference on Machine\nLearning, pages 2332\u20132341, 2015.\n\n[44] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic\nvariance reduction for nonconvex optimization. In International conference on machine learning,\npages 314\u2013323, 2016.\n\n[45] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the Polyak-\u0142ojasiewicz condition. In Joint European Conference on\nMachine Learning and Knowledge Discovery in Databases, pages 795\u2013811. Springer, 2016.\n\n[46] Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Big Batch SGD: Automated\n\ninference using adaptive batch sizes. arXiv preprint arXiv:1610.05792, 2016.\n\n[47] Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran,\nand Peter Bartlett. Gradient diversity: a key ingredient for scalable distributed learning. In\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 1998\u20132007, 2018.\n\n[48] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex\ngeometry of linear inverse problems. Foundations of Computational mathematics, 12(6):805\u2013\n849, 2012.\n\n[49] Lisandro D Dalcin, Rodrigo R Paz, Pablo A Kler, and Alejandro Cosimo. Parallel distributed\n\ncomputing using python. Advances in Water Resources, 34(9):1124\u20131139, 2011.\n\n[50] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. 2017.\n\n11\n\n\f[51] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[52] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[53] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS workshop on deep\nlearning and unsupervised feature learning, volume 2011, page 5, 2011.\n\n12\n\n\f", "award": [], "sourceid": 6429, "authors": [{"given_name": "Hongyi", "family_name": "Wang", "institution": "University of Wisconsin-Madison"}, {"given_name": "Scott", "family_name": "Sievert", "institution": "University of Wisconsin-Madison"}, {"given_name": "Shengchao", "family_name": "Liu", "institution": "UW-Madison"}, {"given_name": "Zachary", "family_name": "Charles", "institution": "University of Wisconsin-Madison"}, {"given_name": "Dimitris", "family_name": "Papailiopoulos", "institution": "UW-Madison"}, {"given_name": "Stephen", "family_name": "Wright", "institution": "UW-Madison"}]}