{"title": "PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 14259, "page_last": 14268, "abstract": "We study gradient compression methods to alleviate the communication bottleneck in data-parallel distributed optimization. Despite the significant attention received, current compression schemes either do not scale well, or fail to achieve the target test accuracy. We propose a low-rank gradient compressor that can i) compress gradients rapidly, ii) efficiently aggregate the compressed gradients using all-reduce, and iii) achieve test performance on par with SGD. The proposed algorithm is the only method evaluated that achieves consistent wall-clock speedups when benchmarked against regular SGD with an optimized communication backend. We demonstrate reduced training times for convolutional networks as well as LSTMs on common datasets.", "full_text": "PowerSGD: Practical Low-Rank\n\nGradient Compression for Distributed Optimization\n\nThijs Vogels\n\nEPFL\n\nLausanne, Switzerland\n\nthijs.vogels@epfl.ch\n\nSai Praneeth Karimireddy\n\nEPFL\n\nLausanne, Switzerland\n\nsai.karimrieddy@epfl.ch\n\nMartin Jaggi\n\nEPFL\n\nLausanne, Switzerland\n\nmartin.jaggi@epfl.ch\n\nAbstract\n\nWe study lossy gradient compression methods to alleviate the communication bot-\ntleneck in data-parallel distributed optimization. Despite the signi\ufb01cant attention\nreceived, current compression schemes either do not scale well, or fail to achieve\nthe target test accuracy. We propose a new low-rank gradient compressor based\non power iteration that can i) compress gradients rapidly, ii) ef\ufb01ciently aggregate\nthe compressed gradients using all-reduce, and iii) achieve test performance on par\nwith SGD. The proposed algorithm is the only method evaluated that achieves con-\nsistent wall-clock speedups when benchmarked against regular SGD using highly\noptimized off-the-shelf tools for distributed communication. We demonstrate re-\nduced training times for convolutional networks as well as LSTMs on common\ndatasets. Our code is available at https://github.com/epfml/powersgd.\n\n1\n\nIntroduction\n\nSynchronous data-parallel SGD is the most common method for accelerating training of deep learning\nmodels (Dean et al., 2012; Iandola et al., 2015; Goyal et al., 2017). Because the gradient vectors\nof such models can be large, the time required to share those gradients across workers limits the\nscalability of deep learning training (Seide et al., 2014; Iandola et al., 2015; Lin et al., 2018).\nPrevious work proposes lossy gradient compression as a solution to this issue. Notable examples\ninclude replacing the coordinates of the gradient with only their sign (Seide et al., 2014; Carlson et al.,\n2015; Bernstein et al., 2018, 2019; Karimireddy et al., 2019), quantizing the individual coordinates\n(Alistarh et al., 2017; Wen et al., 2017), and low-rank approximation of the gradient (Wang et al.,\n2018). While these works demonstrate speedups over full-precision SGD in some settings, we \ufb01nd\nthat their speedups vanish with a fast network and highly optimized communication backend, even on\ncommodity hardware. Some prior work also suffers from degraded test accuracy compared to SGD.\nWe combine three observations to \ufb01x these issues: i) Linear compressor operators achieve scalability\nby enabling aggregation using all-reduce. ii) Error feedback ensures convergence with general biased\ncompressors. iii) Low-rank updates enable aggressive compression without sacri\ufb01cing quality.\nFirst, we explore the properties of various gradient compression schemes for SGD and identify\nwhich ones are crucial for high scalability. In particular, we note that currently proposed gradient\ncompressors are not linear. Their compressed messages cannot be added up hierarchically, unlike raw\ngradients. This prevents current compressed SGD algorithms from aggregating gradients using an\nef\ufb01cient reduce operation and instead require a gather operation. Current deep learning frameworks\nrely either solely or predominantly on all-reduce, which is key to why regular SGD scales well with\nfast communication hardware (cf. Awan et al., 2018; Panda et al., 2019).\nSecondly, it was recently shown that using error feedback (i.e. storing the difference between the\ncomputed and compressed gradient, and reinserting it at the next iteration) improves both convergence\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOutput neurons\n\nCompressed gradients\n\ns\nn\no\nr\nu\ne\nn\nt\nu\np\nn\nI\n\nLayer gradient\n\nRandom Block Low-rank (ours)\nFigure 1: Compression schemes compared in this paper. Left: Interpretation of a layer\u2019s gradient as a\nmatrix. Right: The output of various compression schemes. Implementation details in Appendix G.\n\nRandom K\n\nSign\n\nSign + Norm\n\nTop K\n\nand generalization for compression schemes (Karimireddy et al., 2019). This can enable general\nbiased gradient compression schemes to reach the target test accuracy.\nThirdly, there is growing evidence that the generalization ability of modern over-parameterized\ndeep learning models is related to low-rankedness (Arora et al., 2018; Martin & Mahoney, 2018;\nCollins et al., 2018). Using a low-rank update (as we do) can be viewed as implicitly performing\nspectral regularization (Gunasekar et al., 2018) and hence can be expected to have good generalization\nproperties (Yoshida & Miyato, 2017). Further, Wang et al. (2018) show that the eigenspectrum of the\nstochastic gradients for deep learning models decays, suggesting that a rank-based schemes can get\naway with aggressive compression without sacri\ufb01cing convergence.\nIn this work, we design POWERSGD with the above observations in mind. POWERSGD computes\na low-rank approximation of the gradient using a generalized power iteration (known as subspace\niteration (Stewart & Miller, 1975)). The approximation is computationally light-weight, avoiding\nany prohibitively expensive Singular Value Decomposition. To improve the quality of the ef\ufb01cient\napproximation, we warm-start the power iteration by reusing the approximation from the previous op-\ntimization step. Using all-reduce gradient aggregation, we empirically demonstrate that POWERSGD\nachieves wall-clock speedups over regular SGD in a 16-GPU setting, even with the optimized NCCL\ncommunication backend on a fast network (and is the only algorithm to do so.) By compressing\ngradients more than 120\u00d7, we reduce communication time (including coding and decoding) by 54%\nfor RESNET18 on CIFAR10 and by 90% for an LSTM on WIKITEXT-2. End-to-end wall-clock\ntraining time to full test quality is reduced by 24% for RESNET18 and by 55% for the LSTM.\n\n2 Related work\n\nGradient compression A variety of compression schemes (Figure 1) have been proposed: Alistarh\net al. (2017) and Wen et al. (2017) quantize each gradient coordinate; Seide et al. (2014); Carlson\net al. (2015); Bernstein et al. (2018, 2019) and Karimireddy et al. (2019) replace each coordinate of\nthe gradient with its sign; Lin et al. (2018); Stich et al. (2018) and Wangni et al. (2018) use the largest\nfew coordinates; and Kone\u02c7cn`y et al. (2016) and Wang et al. (2018) use a low-rank approximation.\nSpectral Atomo by Wang et al. (2018) is perhaps the closest to our work. It performs importance sam-\npling of the gradient\u2019s singular vectors and is an unbiased compression scheme. It requires, however,\na full Singular Value Decomposition every iteration and is hence computationally impractical.\n\nCommutative compression and addition Yu et al. (2018) stress that commutability of compres-\nsion with gradient addition enables ef\ufb01cient aggregation with ring all-reduce. Most compressors,\nhowever, lack this property. Yu et al. utilize temporally-consistent correlations between gradients\ncoordinates to compress them linearly. POWERSGD has a similar property that we call \u2018linearity\u2019.\n\nError feedback First introduced in (Seide et al., 2014) and analyzed in (Stich et al., 2018) for the\nconvex case, error feedback involves computing the difference between a worker\u2019s gradient and the\ncompressed gradient (i.e. error) and adding it back to the next gradient (feedback). Karimireddy\net al. (2019) and Stich & Karimireddy (2019) further develop and generalize the framework of error\nfeedback with improved rates. In the non-convex setting, Karimireddy et al. (2019) show that error\nfeedback is crucial both for convergence and generalization when using biased compressors (e.g. sign\nor top-K). In general, biased compression schemes equipped with error feedback tend to out-perform\ntheir unbiased counterparts. The practical algorithm by Lin et al. (2018) is also as an approximate\ntop-K compressor with error feedback.\n\n2\n\n\fLow-rank methods Recent works argue that in modern over-parameterized deep networks, the \ufb01nal\nmodel learnt has a \u2018low stable rank\u2019 (Martin & Mahoney, 2018; Li et al., 2018). This can partially\nexplain their impressive generalization properties despite being substantially overparameterized\n(Arora et al., 2018). Adding explicit spectral regularization has shown to further improve the\nperformance of such models (Mazumder et al., 2010; Yoshida & Miyato, 2017). Using a low-rank\nupdate (as we do) can be viewed as implicitly performing a similar regularization (Gunasekar et al.,\n2018). If the target matrices are known to be exactly low-ranked (instead of just low stable rank),\nYurtsever et al. (2017) show that it is sometimes possible to converge to the optima using low rank\napproximations of the gradients without the need for error feedback.\n\n3 Method\n\n(cid:46) Now, P = 1\n\n(cid:46) Now, Q = 1\n\nIn data-parallel optimization of machine learning models, a number of W workers share the same\nmodel parameters x \u2208 Rd. They iteratively update x by computing independent stochastic gradients,\naggregating these gradients by averaging1, and updating the model parameters based on this aggregate.\nAlgorithm 1 Rank-r POWERSGD compression\n1: The update vector \u2206w is treated as a list of tensors corresponding to individual model parameters.\nVector-shaped parameters (biases) are aggregated uncompressed. Other parameters are reshaped\ninto matrices. The functions below operate on such matrices independently. For each matrix\nM \u2208 Rn\u00d7m, a corresponding Q \u2208 Rm\u00d7r is initialized from an i.i.d. standard normal distribution.\n\n2: function COMPRESS+AGGREGATE(update matrix M \u2208 Rn\u00d7m, previous Q \u2208 Rm\u00d7r)\n\nreturn \u02c6P Q(cid:62)\n\nW (M1 + . . . + MW )Q\n(cid:46) Orthonormal columns\nW (M1 + . . . + MW )(cid:62) \u02c6P\n\n3:\nP \u2190 M Q\n4:\nP \u2190 ALL REDUCE MEAN(P )\n\u02c6P \u2190 ORTHOGONALIZE(P )\n5:\nQ \u2190 M(cid:62) \u02c6P\n6:\n7:\nQ \u2190 ALL REDUCE MEAN(Q)\nreturn the compressed representation ( \u02c6P , Q).\n8:\n9: end function\n10: function DECOMPRESS( \u02c6P \u2208 Rn\u00d7r, Q \u2208 Rm\u00d7r)\n11:\n12: end function\nPOWERSGD compression We approximate each layer in the model independently. The parame-\nters of fully-connected layers (dense matrix multiplication) and their gradients have an inherent matrix\nstructure. The parameters of convolutional layers can be naturally interpreted as fully-connected\nlayers applied repeatedly over a 2D grid of inputs. Practically, this amounts to \ufb02attening input and\nkernel dimensions in the 4D gradient tensors. Neural networks also contain bias vectors, but these\ntypically constitute a tiny fraction of the parameter space and can be aggregated uncompressed.\nFor each parameter\u2019s gradient M \u2208 Rn\u00d7m, the aim of rank-r matrix approximation is to \ufb01nd matrices\nP \u2208 Rn\u00d7r and Q \u2208 Rm\u00d7r such that P Q(cid:62) approximates M well. POWERSGD uses a single step of\nsubspace iteration\u2014power iteration generalized to r > 1\u2014to compute such an approximation. This\ninvolves performing one right multiplication, one left multiplication, and an orthogonalization. We\nuse the Gram-Schmidt procedure to orthogonalize our matrices since they have very few columns\n(1\u20134), and this is the most expensive part of the compression procedure. Further, we \u2018warm-start\u2019 the\nsubspace iteration by reusing the approximation computed at the previous step. With the inclusion\nof warm-start, a single step of subspace iteration yields a factorization M \u223c P Q(cid:62) with the same\nperformance as the best rank-r approximation from an expensive Singular Value Decomposition.\n\nEf\ufb01cient aggregation between workers\nIn data-parallel optimization, we want to approximate\nthe average of the worker\u2019s gradients. Suppose POWERSGD operates on a list of corresponding\ngradients [M1 . . . MW ] from W workers. Both occurrences of M in the algorithm are a (linear)\nmatrix multiplication followed by a (linear) mean reduction over workers. This introduces a practical\ninvariance: execution on 1 worker with batch size B \u00d7 W is equivalent to execution on W workers\nwith batch size B each. We call this property \u2018linearity\u2019. Refer to Appendix A.3 for more details.\n1Bernstein et al. (2019) propose Signum which aggregates 1-bit gradients by majority voting instead of\n\naveraging.\n\n3\n\n\fcompute the sum of W matrices(cid:80)W\n\nAn important bene\ufb01t of the POWERSGD\u2019s linearity\nis that it can be implemented using the all-reduce\nprotocol as opposed to needing a gather operation.\nTo illustrate the difference, suppose that we want to\ni=1 Mi for W = 4.\nThe all-reduce method can use associativity of addition\nto rewrite the computation as (M1+M2)+(M3+M4).\nThis enables a divide-and-conquer approach and allows the summation task to be split over multiple\nworkers, as illustrated on the right. With W workers, both the computation and the communication\ntime scale as O(log W ) for all-reduce, compared to O(W ) for all-gather.\nIn addition to improved scaling, all-reduce communication is preferred over a parameter-server setting\nbecause it avoids double compression. With a parameter server, both the \u2018clients \u2192 server\u2019 and\n\u2018server \u2192 clients\u2019 communication have to be compressed (Caldas et al., 2018; Bernstein et al., 2019;\nSeide et al., 2014). We avoid this by merging compression and aggregation into one step.\n\n(b) Reduce\n\n(a) Gather\n\nError-feedback SGD Since the POWERSGD scheme is biased (i.e. compressing and decompress-\ning a random gradient does not yield the original in expectation), we use error feedback (Seide et al.,\n2014; Karimireddy et al., 2019). Our version of error feedback (Algorithm 2) extends the original by\nintroducing post-compression momentum. This simple extension allows us to reuse the same learning\nrate and hyper-parameters as those tuned for SGD with momentum.\n\ninitialize memory ew \u2190 0 \u2208 Rd\nfor each iterate t = 0, . . . do\n\nAlgorithm 2 Distributed Error-feedback SGD with Momentum\n1: hyperparameters: learning rate \u03b3, momentum parameter \u03bb\n2: initialize model parameters x \u2208 Rd, momentum m \u2190 0 \u2208 Rd, replicated across workers\n3: at each worker w = 1, . . . , W do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: end at\n\nCompute a stochastic gradient gw \u2208 Rd.\n\u2206w \u2190 gw + ew\nC(\u2206w) \u2190 COMPRESS(\u2206w)\new \u2190 \u2206w \u2212 DECOMPRESS(C(\u2206w))\nC(\u2206) \u2190 AGGREGATE(C(\u22061), . . . ,C(\u2206W ))\n\u2206(cid:48)\nm\nx\n\n\u2190 DECOMPRESS(C(\u2206))\n\u2190 \u03bbm + \u2206(cid:48)\n\u2190 x \u2212 \u03b3 (\u2206(cid:48) + m)\n\n(cid:46) Memorize local errors\n(cid:46) Exchange gradients\n(cid:46) Reconstruct an update \u2208 Rd\n\n(cid:46) Incorporate error-feedback into update\n\nend for\n\n4 Analysis of POWERSGD\n\nIn this section, we consider different aspects of POWERSGD in isolation and hope to empirically\nunderstand: i) the effect of using error feedback, ii) the effect of \u2018warm-start\u2019, and iii) the trade-off\nbetween test accuracy and compression rate with varying approximation rank.\n\n4.1 Effect of error feedback\n\nUsing error-feedback SGD as a base algorithm for POWERSGD has two advantages. First, it enables\nour use of a biased compressor. Secondly, EF-SGD improves convergence and obtains better test\naccuracy (Karimireddy et al., 2019).\nTo illustrate the improved test accuracy, we compare POWERSGD\u2014a biased compressor with error\nfeedback\u2014against an unbiased low-rank approximation. To approximate a matrix M \u2208 Rn\u00d7m, the\nunbiased rank-r approximator samples a random matrix U \u2208 Rm\u00d7r such that E[U U(cid:62)] = Im and\noutputs (M U, U ) as the low-rank approximation. This scheme is unbiased since\n\nE[(M U )U\n\n(cid:62)\n\n] = M E[U U\n\n(cid:62)\n\n] = M I = M .\n\nPOWERSGD is the natural biased counterpart of this unbiased scheme. Table 1 demonstrates that our\nbiased approximator with error feedback outperforms the unbiased operator on image classi\ufb01cation.\n\n4\n\n\fTable 1: Rank-based compression with and without\nerror feedback. The biased POWERSGD outperforms\nan unbiased linear rank-r compressor on test accuracy.\nData/epoch\nAlgorithm\nSGD\n1023 MB\nRank-1 POWERSGD\n4 MB\n8 MB\nRank-2 POWERSGD\n3 MB\nUnbiased Rank 1\nUnbiased Rank 2\n4 MB\n\n94.3%\n93.6%\n94.4%\n71.2%\n75.9%\n\nTest accuracy\n\nTable 2: Best rank-2 approximation\nvs. POWERSGD. Warm-start improves\ntest accuracy, even matching the perfor-\nmance of the best rank-2 approximation.\n\nAlgorithm\nBest approximation\nWarm start (default)\nWithout warm start\n\nTest accuracy\n\n94.4%\n94.4%\n94.0%\n\nImage classi\ufb01cation \u2014 RESNET18 on CIFAR10\n\nAlgorithm Test accuracy\nSGD\nRank 1\nRank 2\nRank 4\n\n94.3%\n93.6%\n94.4%\n94.5%\n\nData sent per epoch\n1023 MB (1\u00d7)\n\n4 MB (243\u00d7)\n8 MB (136\u00d7)\n14 MB (72\u00d7)\n\nTime per batch\n312 ms\n+0%\n229 ms \u221226%\n239 ms \u221223%\n260 ms \u221216%\n\nLanguage modeling \u2014 LSTM on WIKITEXT-2\n\nAlgorithm\nSGD\nRank 1\nRank 2\nRank 4\n\n91\n102\n93\n91\n\nTest perplexity Data sent per epoch\n7730 MB (1\u00d7)\n\n25 MB (310\u00d7)\n38 MB (203\u00d7)\n64 MB (120\u00d7)\n\nTime per batch\n300 ms\n+0%\n131 ms \u221256%\n141 ms \u221253%\n134 ms \u221255%\n\nTable 3: POWERSGD with\nvarying rank. With suf-\n\ufb01cient rank, POWERSGD\naccelerates training of a\nRESNET18 and an LSTM\nby reducing communication,\nachieving test quality on par\nwith regular SGD in the\nsame number of iterations.\nThe time per batch includes\nthe forward/backward pass\n(constant). See Section 5 for\nthe experimental setup.\n\n4.2 Effect of warm-start\n\nPOWERSGD does not compute the best rank-r approximation of a gradient matrix, but uses a cheaper,\nlow-\ufb01delity approximation based on power iteration. Comparing the time per batch of POWERSGD\nand Spectral Atomo in Table 6, we see the importance of avoiding a Singular Value Decomposition.\nWith gradients shaped as in POWERSGD, computing the SVD of a stochastic gradient takes 673ms,\nthe equivalent of computing 6 mini-batch gradients. In contrast, one full step of rank-2 POWERSGD,\nincluding communication between 16 workers, takes only 105ms.\nGiven that we only use a single step of power iteration, the quality of the approximation suffers\u2014\ncompare the test accuracy of \u2018without warm start\u2019 and \u2018best approximation\u2019 in Table 2. A key feature\nof POWERSGD is the warm start strategy which reuses previously computed matrix approximations\nto initialize the power iteration algorithm. If the matrix on which we perform power iteration remains\nconstant, then this recovers the best rank-r approximation (see Theorem I in the Appendix). We\nargue that this strategy sometimes makes sense even if the underlying matrices are varying.\nSuppose we approximate the sequence of gradient matrices {Mt} at timesteps t. At timestep t,\nwe leverage the previous factorization Mt\u22121 \u2248 Pt\u22121Q(cid:62)\nt\u22121. If Mt \u2248 Mt\u22121 then we would bene\ufb01t\nfrom reusing Pt\u22121 and Qt\u22121 as our starting point. While this is unlikely to be true, if Mt and Mt\u22121\nare stochastic approximations of the full gradient, we can expect that E[Mt] \u2248 E[Mt\u22121] since the\nfunction is smooth and we only take small update steps. The result is akin to Oja\u2019s algorithm for\nstochastic power iteration (Oja, 1982), and hence could result in an improved approximation quality.\nAs we show empirically in Table 2, this \u2018warm starting\u2019 strategy is suf\ufb01cient to close the gap in test\naccuracy between POWERSGD and the much more expensive best rank-r approximation.\n\n4.3 Effect of varying the rank\n\nPOWERSGD allows users to choose the rank of its gradient approximations. The trade-off between\napproximation quality and compression, decompression and transfer cost is explored in Table 3. In\nboth the image classi\ufb01cation and language modeling tasks we explore, the test quality achieved by\nPOWERSGD grows with increasing rank. In both cases, it reaches a quality that is as good, or even\nslightly better than regular SGD.\n\n5\n\n\fTable 4: Comparing different compression operators for Error-feedback SGD in a uni\ufb01ed setting;\nrunning 300 epochs of Error-feedback SGD with Momentum (Algorithm 2) with a learning rate\ntuned for full-precision SGD on 16 GPUs for CIFAR10. Note that the variations of POWERSGD with\nranks 2 and 7 strikes the best balance between the achieved test accuracy and time per batch (total\ntime for forward, backward, compression, decompression, and gradient aggregation).\n\nTest accuracy\n\nSent/epoch All-reduce\n\nTime/batch\n\nNo compression\nMedium Rank 7\n\nRandom Block\nRandom K\nSign+Norm\nTop K\nRank 2\nRandom Block\nRandom K\nTop K\n\nHigh\n\n5 Results\n\n94.3%\n\n94.6%\n93.3%\n94.0%\n93.9%\n94.4%\n\n94.4%\n87.8%\n92.6%\n93.6%\n\n1023 MB\n24 MB\n24 MB\n24 MB\n32 MB\n32 MB\n8 MB\n8 MB\n8 MB\n8 MB\n\n\u0013\n\n\u0013\n\u0013\n\u0013\n\u0017\n\u0017\n\n\u0013\n\u0013\n\u0013\n\u0017\n\n312 ms\n285 ms\n243 ms\n540 ms\n429 ms\n444 ms\n239 ms\n240 ms\n534 ms\n411 ms\n\nDefault experimental setting\n\nMomentum\nLearning rate\n\nDataset\nArchitecture\n\nCIFAR10\nRESNET18\n\nNumber of workers\nBackend\nBatch size\n\nThis section demonstrates the practicality of POW-\nERSGD for distributed optimization of deep neural\nnetworks. We show that the compression scheme of\nPOWERSGD i) is fast and matches test performance\nof SGD, ii) scales well with increasing workers even\nwith a sub-optimal communication backend, and iii)\nsigni\ufb01cantly reduces training time for larger models.\nMost of the analysis is performed on CIFAR10, in\nthe setting described in the table on the right. We\nverify the generality of POWERSGD by an additional\nevaluation of an LSTM for language modeling on\nWIKITEXT-2. We use 16 GPUs on 8 machines, con-\nnected through a fast (10Gbit/s) network. To obtain\nmeaningful timings, we have aimed to optimize all\ncompared optimizers to a similar level. We provide a\nlist of our performance optimizations in Appendix H. Throughout these results, we tune the learning\nrate for full-precision SGD, and use the same parameters for POWERSGD and other compression\nalgorithms that use error feedback with momentum. Learning rates for the compared-to Spectral\nAtomo (Wang et al., 2018) and Signum (Bernstein et al., 2019) were separately tuned cf. Appendix I.\n\n16\nNCCL (fastest in PYTORCH)\n128 \u00d7 number of workers\n0.9\nTuned for 16 workers \u2014 0.1 \u00d7\n16 for SGD. Scaled linearly by the\nnumber of workers\n/10 at epoch 150 and 250\nLinearly within 5 epochs, starting\nfrom the single-worker LR\n300\n10\u22124,\n0 for BatchNorm parameters\n\nLR decay\nLR warmup\n\n# Epochs\nWeight decay\n\nRepetitions\nError bars\n\n3, with varying seeds\nmin \u2014 max\n\n5.1 Comparison with other compressors\n\nError feedback in compressed optimization enables the use of a multitude of compression schemes,\nincluding biased ones. The potential compression operators illustrated in Figure 1 are compared in\nTable 4. We evaluate compressors based on the test accuracy achieved and the total time taken to\nprocess one mini-batch. The former is a holistic measure of the accuracy of the compression operator,\nand the latter is the net time required for a forward pass, backward pass, gradient compression and\ndecompression and gradient communication. We study two compression regimes\u2014medium and high.\nAt around 32\u00d7 compression, achieved by sign-based methods, all compression schemes (other than\nRandom Block) achieve test accuracy close to full-precision SGD. This implies that all schemes in this\nregime (other than Random Block) obtain a good-enough compression quality. At high compression\n(128\u00d7), POWERSGD particularly stands out as the only method to achieve the target test accuracy.\nIn both the medium and high compression settings, the only schemes to be faster than full-precision\nSGD are POWERSGD and Random Block. Note that both are simple linear schemes and hence\nsupport all-reduce. While Random K also supports all-reduce, the overhead for random memory\naccess during both the compression and decompression stages is substantial, making it slower overall\n\n6\n\n\fTable 5: Breakdown of time spent (in seconds) in one iteration of RESNET18 training. Because\nPOWERSGD (Rank 2) uses all-reduce, time spent encoding/decoding gradients is constant.\n\nForward pass, Backward pass, Gradient exchange, Encoding and decoding.\n\n2 workers\n\n4 workers\n\n8 workers\n\n16 workers\n\nRank 2\nSGD\nSignum\n\n0.1\n\n0.2\n\n0.3\n\n0.1\n\n0.2\n\n0.3\n\n0.1\n\n0.2\n\n0.3\n\n0.1\n\n0.2\n\n0.3\n\nFigure 3: Scaling of POWERSGD on CIFAR10 compared to full-precision SGD and Signum (Bern-\nstein et al., 2019) on two communication backends. The batch size increases linearly with the number\nof workers. We compare training time for one epoch to 1-worker SGD. Note that the faster NCCL\nbackend used throughout bene\ufb01ts the baselines more than our method.\n\nthan SGD. Thus, on modern GPU-enabled infrastructure, POWERSGD, which relies on matrix\nmultiplication, is faster and much more accurate than the other compression schemes.\n\n5.2 Scalability of POWERSGD\n\nHere we investigate how POWERSGD scales with an increasing number of workers, shedding light on\nwhat we can expect if we use a signi\ufb01cantly larger number of workers. Additionally, we investigate\nhow these results depend on the choice of communication backend. We benchmark POWERSGD\nagainst SGD and Signum (signSGD with majority vote) from Bernstein et al. (2019) which we believe\nis the current state-of-the-art for distributed algorithms.\nTable 5 provides a detailed breakdown of the time spent for each mini-batch (i.e. one step) into the\nforward pass, backward pass, gradient exchange (communication), and compression/decompression.\nThe time spent in the forward and backward pass is constant across all algorithms and numbers\nof workers. Since both SGD and POWERSGD use all-reduce, the gradient communication time\n(solid green in Table 5) scales gracefully with increasing number of workers. Signum\u2014which uses\nall-gather instead of all-reduce\u2014has a steeper increase. It has comparable time to POWERSGD for 4\nworkers but becomes more expensive for 16 workers.\nThere is another, more subtle, consequence of all-reduce vs. all-gather on the decoding times. In\nall-reduce, the aggregation step and the communication step happen simultaneously. Each worker\nreceives a pre-aggregated gradient, making the cost of decompression independent of the number of\nworkers. On the other hand, in all-gather, a worker receives W compressed gradients that need to be\nindividually decompressed and aggregated (either using majority vote or averaging). The time for\ndecompression with all-gather therefore scales linearly with number of workers. This shows when\ncomparing the hatcheted regions in Table 5. This observation speaks to the importance of the reduce\noperation for scalability.\nWe next study two different backends\u2014the more optimized NCCL and the slower GLOO. All three\nmethods scale reasonably well with the optimized NCCL backend, although Signum has a slope less\nthan 1 in the log-log plot, indicating sub-linear scaling. On the slower GLOO backend, POWERSGD\nis notably the only method that retains excellent scaling due to its high compression rate.\n\n7\n\n124816Numberofworkers1\u00d72\u00d74\u00d78\u00d7SpeedupoverSGDGLOObackendSGDRank2Signum124816NumberofworkersNCCLbackendSGDRank2Signum\fTable 6: Results on CIFAR10.\nContrary to rank-2 Spectral\nAtomo (Wang et al., 2018) and\nSignum (Bernstein et al., 2019),\nPOWERSGD achieves the same\ntest accuracy as full-precision\nSGD within the default epoch\nbudget.\n\nAlgorithm\nSGD\nAtomo\nSignum\nRank 2\n\nTest accuracy\n\nData/epoch\n\nTime per batch\n\n94.3%\n92.6%\n93.6%\n94.4%\n\n1023 MB 312 ms\n+0%\n113 MB 948 ms +204%\n\u22123%\n32 MB 301 ms\n\u221223%\n8 MB 239 ms\n\nTable 7: In language modeling,\nrank-4 POWERSGD achieves the\ntarget test accuracy and provides\na signi\ufb01cant speedup over SGD.\n\nAlgorithm\nSGD\nSignum\nRank 4\n\nTest perplexity Data/epoch\n\nTime per batch\n\n91\n142\n91\n\n7730 MB 300 ms\n+0%\n242 MB 424 ms +41%\n64 MB 134 ms \u221255%\n\n5.3 Other tasks and methods\n\nIn Table 6, we compare POWERSGD against the state-of-the-art compressed optimization algorithms\nSignum and Spectral Atomo. The cost of performing a full SVD at each step renders Spectral\nAtomo impractical in a high-performance setting, especially considering that it fails to match the test\naccuracies of the other methods. Signum performs much better, proving a minor speedup over SGD.\nPOWERSGD is the fastest and most accurate of the compared methods.\nThe advantage of POWERSGD truly shows when using really large models, i.e. where the commu-\nnication actually becomes a bottleneck. To verify this, we run Signum, full-precision SGD, and\nPOWERSGD to train an LSTM on a language modeling task which has a substantially larger model\nsize than RESNET18 (see Appendix F). To match the test score of full-precision SGD, we needed\nto use a rank-4 approximation (see Section 4.3). POWERSGD reduces communication by 90% and\nthe overall running time by 55%, while Signum becomes slower than full-precision SGD and also\nobtains a worse test score.\nConvergence curves on test accuracy corresponding to Tables 3, 6 and 7 are provided in Appendix C.\nIn those \ufb01gures, you can read our improvements in time-to-accuracy for any target accuracy. We also\nprovide a case study on using PowerSGD for a novel task (language modeling with transformers on\nWIKITEXT-2) and more workers (32) on the public cloud in Appendix D.\n\n6 Conclusion\n\nGradient compression is a promising approach to tackling the communication bottleneck in syn-\nchronous distributed optimization. Thus far, however, it has not found widespread adoption because\nexisting compression schemes either run slower than SGD with optimized all-reduce gradient aggre-\ngation, or more importantly do not reach the same test performance. We see POWERSGD as the \ufb01rst\npractical gradient compression method, and believe it is ready for adaptation in practice.\nThe key to the practicality of POWERSGD is its linear compression scheme that is cheap to compute\nand allows for all-reduce gradient aggregation, while simultaneously matching the test performance of\nfull-precision SGD. This speedup gained over SGD actually increases for larger models such as those\ncommonly found in NLP. Further, as a result of our modi\ufb01cations to the error feedback algorithm,\nPOWERSGD is a plug-in replacement for SGD with momentum, avoiding the need for additional\nhyper-parameter tuning. We expect that these properties of POWERSGD will enable training of even\nlarger models with even more workers than what is possible with full-precision SGD.\nWhile POWERSGD enables faster training with larger batch sizes, increasing batch sizes are known\nto eventually suffer from a \u2018generalization gap\u2019 (Shallue et al., 2018). This is an orthogonal issue that\nwe see as the next step towards solving large-scale training. In our experiments, we have observed\nthat POWERSGD can achieve higher test accuracy than SGD. Combined with the intriguing links\nbetween low-rankedness and generalization, this indicates that POWERSGD may also be helpful for\nclosing the generalization gap in large batch training.\n\n8\n\n\fAcknowledgements\n\nWe thank Alp Yurtsever and Tao Lin for valuable discussions and the reviewers for their feedback.\nThis project was supported by SNSF grant 200021_175796, as well as a Google Focused Research\nAward.\n\nReferences\nAlistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD: Communication-ef\ufb01cient sgd\nvia gradient quantization and encoding. In Advances in Neural Information Processing Systems\n(NIPS), 2017.\n\nArbenz, P. Lecture notes on solving large scale eigenvalue problems. D-MATH, ETH Z\u00fcrich, 2, 2016.\n\nArora, S., Ge, R., Neyshabur, B., and Zhang, Y. Stronger generalization bounds for deep nets via a\n\ncompression approach. In International Conference on Machine Learning (ICML), 2018.\n\nAwan, A. A., Chu, C.-H., Subramoni, H., and Panda, D. K. Optimized broadcast for deep learning\nworkloads on dense-GPU in\ufb01niband clusters: MPI or NCCL? In European MPI Users\u2019 Group\nMeeting (EuroMPI), 2018.\n\nBaevski, A. and Auli, M. Adaptive input representations for neural language modeling. In Interna-\n\ntional Conference on Learning Representations (ICLR), 2019.\n\nBernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A.\n\nsignSGD: compressed\noptimisation for non-convex problems. In International Conference on Machine Learning (ICML),\n2018.\n\nBernstein, J., Zhao, J., Azizzadenesheli, K., and Anandkumar, A. signSGD with majority vote is com-\nmunication ef\ufb01cient and fault tolerant. In International Conference on Learning Representations\n(ICLR), 2019.\n\nCaldas, S., Konecn\u00fd, J., McMahan, H. B., and Talwalkar, A. Expanding the reach of federated\n\nlearning by reducing client resource requirements. arXiv, abs/1812.07210, 2018.\n\nCarlson, D., Cevher, V., and Carin, L. Stochastic Spectral Descent for Restricted Boltzmann Machines.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2015.\n\nCollins, E., Bigdeli, S. A., and S\u00fcsstrunk, S. Detecting memorization in ReLU networks. arXiv,\n\nabs/1810.03372, 2018.\n\nDean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le,\nQ. V., et al. Large scale distributed deep networks. In Advances in Neural Information Processing\nSystems (NIPS), 2012.\n\nGhadimi, S. and Lan, G. Accelerated gradient methods for nonconvex nonlinear and stochastic\n\nprogramming. Mathematical Programming, 156(1-2):59\u201399, 2016.\n\nGoyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and\nHe, K. Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv, abs/1706.02677, 2017.\n\nGunasekar, S., Lee, J., Soudry, D., and Srebro, N. Characterizing implicit bias in terms of optimization\n\ngeometry. In International Conference on Machine Learning (ICML), 2018.\n\nIandola, F. N., Ashraf, K., Moskewicz, M. W., and Keutzer, K. FireCaffe: near-linear acceleration of\n\ndeep neural network training on compute clusters. corr abs/1511.00175 (2015), 2015.\n\nKarimireddy, S. P., Rebjock, Q., Stich, S. U., and Jaggi, M. Error feedback \ufb01xes SignSGD and other\ngradient compression schemes. In International Conference on Machine Learning (ICML), 2019.\n\nKone\u02c7cn`y, J., McMahan, H. B., Yu, F. X., Richt\u00e1rik, P., Suresh, A. T., and Bacon, D. Federated\n\nlearning: Strategies for improving communication ef\ufb01ciency. arXiv, abs/1610.05492, 2016.\n\nLi, Y., Ma, T., and Zhang, H. Algorithmic regularization in over-parameterized matrix sensing and\n\nneural networks with quadratic activations. In Conference on Learning Theory (COLT), 2018.\n\n9\n\n\fLin, Y., Han, S., Mao, H., Wang, Y., and Dally, W. J. Deep gradient compression: Reducing the\ncommunication bandwidth for distributed training. In International Conference on Learning\nRepresentations (ICLR), 2018.\n\nMartin, C. H. and Mahoney, M. W. Implicit self-regularization in deep neural networks: Evidence\n\nfrom random matrix theory and implications for learning. arXiv, abs/1810.01075, 2018.\n\nMazumder, R., Hastie, T., and Tibshirani, R. Spectral regularization algorithms for learning large\n\nincomplete matrices. Journal of Machine Learning Research, 11(Aug):2287\u20132322, 2010.\n\nOja, E. Simpli\ufb01ed neuron model as a principal component analyzer. Journal of Mathematical Biology,\n\n15(3):267\u2013273, 1982.\n\nOtt, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. fairseq: A fast,\nextensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations,\n2019.\n\nPanda, D. K. D., Subramoni, H., and Awan, A. A. High performance distributed deep learning: A\nbeginner\u2019s guide. In Symposium on Principles and Practice of Parallel Programming (PPoPP),\n2019.\n\nRobbins, H. and Monro, S. A Stochastic Approximation Method. The Annals of Mathematical\n\nStatistics, 22(3):400\u2013407, September 1951.\n\nSeide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochastic gradient descent and its application to\ndata-parallel distributed training of speech dnns. In Annual Conference of the International Speech\nCommunication Association (INTERSPEECH), 2014.\n\nShallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., and Dahl, G. E. Measuring the\n\neffects of data parallelism on neural network training. arXiv, abs/1811.03600, 2018.\n\nStewart, G. Simultaneous iteration for computing invariant subspaces of non-Hermitian matrices.\n\nNumerische Mathematik, 25(2):123\u2013136, 1976.\n\nStewart, G. and Miller, J. Methods of simultaneous iteration for calculating eigenvectors of matrices.\n\nTopics in Numerical Analysis II, pp. 169\u2013185, 1975.\n\nStich, S. U. and Karimireddy, S. P. The error-feedback framework: Better rates for sgd with delayed\n\ngradients and compressed communication. arXiv, abs/1909.05350, 2019.\n\nStich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsi\ufb01ed SGD with memory. In Advances in Neural\n\nInformation Processing Systems (NeurIPS), 2018.\n\nWang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., and Wright, S. ATOMO:\nCommunication-ef\ufb01cient learning via atomic sparsi\ufb01cation. In Advances in Neural Information\nProcessing Systems (NeurIPS), 2018.\n\nWangni, J., Wang, J., Liu, J., and Zhang, T. Gradient sparsi\ufb01cation for communication-ef\ufb01cient\ndistributed optimization. In Advances in Neural Information Processing Systems (NeurIPS), 2018.\nWen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. Terngrad: Ternary gradients to\nreduce communication in distributed deep learning. In Advances in Neural Information Processing\nSystems (NIPS), pp. 1509\u20131519, 2017.\n\nYoshida, Y. and Miyato, T. Spectral norm regularization for improving the generalizability of deep\n\nlearning. arXiv, abs/1705.10941, 2017.\n\nYu, M., Lin, Z., Narra, K., Li, S., Li, Y., Kim, N. S., Schwing, A. G., Annavaram, M., and Avestimehr,\nS. Gradiveq: Vector quantization for bandwidth-ef\ufb01cient gradient aggregation in distributed CNN\ntraining. In Advances in Neural Information Processing Systems (NeurIPS), 2018.\n\nYurtsever, A., Udell, M., Tropp, J. A., and Cevher, V. Sketchy decisions: Convex low-rank matrix\noptimization with optimal storage. In International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS), 2017.\n\nZhao, J. signSGD with majority vote. github.com/PermiJW/signSGD-with-Majority-Vote,\n\n2019. [Online; accessed 12-May-2019].\n\n10\n\n\f", "award": [], "sourceid": 8002, "authors": [{"given_name": "Thijs", "family_name": "Vogels", "institution": "EPFL"}, {"given_name": "Sai Praneeth", "family_name": "Karimireddy", "institution": "EPFL"}, {"given_name": "Martin", "family_name": "Jaggi", "institution": "EPFL"}]}