{"title": "Communication Compression for Decentralized Training", "book": "Advances in Neural Information Processing Systems", "page_first": 7652, "page_last": 7662, "abstract": "Optimizing distributed learning systems is an art\nof balancing between computation and communication.\nThere have been two lines of research that try to\ndeal with slower networks: {\\em communication \ncompression} for\nlow bandwidth networks, and {\\em decentralization} for\nhigh latency networks. In this paper, We explore\na natural question: {\\em can the combination\nof both techniques lead to\na system that is robust to both bandwidth\nand latency?}\n\nAlthough the system implication of such combination\nis trivial, the underlying theoretical principle and\nalgorithm design is challenging: unlike centralized algorithms, simply compressing\n{\\rc exchanged information,\neven in an unbiased stochastic way, \nwithin the decentralized network would accumulate the error and cause divergence.} \nIn this paper, we develop\na framework of quantized, decentralized training and\npropose two different strategies, which we call\n{\\em extrapolation compression} and {\\em difference compression}.\nWe analyze both algorithms and prove \nboth converge at the rate of $O(1/\\sqrt{nT})$ \nwhere $n$ is the number of workers and $T$ is the\nnumber of iterations, matching the convergence rate for\nfull precision, centralized training. We validate \nour algorithms and find that our proposed algorithm outperforms\nthe best of merely decentralized and merely quantized\nalgorithm significantly for networks with {\\em both} \nhigh latency and low bandwidth.", "full_text": "Communication Compression for Decentralized\n\nTraining\n\nHanlin Tang1, Shaoduo Gan2, Ce Zhang2, Tong Zhang3, and Ji Liu3,1\n\n1Department of Computer Science, University of Rochester\n\n2Department of Computer Science, ETH Zurich\n\n3Tencent AI Lab\n\nhtang14@ur.rochester.edu, sgan@inf.ethz.ch, ce.zhang@inf.ethz.ch,\n\ntongzhang@tongzhang-ml.org, ji.liu.uwisc@gmail.com\n\nAbstract\n\nOptimizing distributed learning systems is an art of balancing between computation and\ncommunication. There have been two lines of research that try to deal with slower networks:\ncommunication compression for low bandwidth networks, and decentralization for high\nlatency networks. In this paper, We explore a natural question: can the combination of both\ntechniques lead to a system that is robust to both bandwidth and latency?\nAlthough the system implication of such combination is trivial, the underlying theoreti-\ncal principle and algorithm design is challenging: unlike centralized algorithms, simply\ncompressing exchanged information, even in an unbiased stochastic way, within the decen-\ntralized network would accumulate the error and fail to converge. In this paper, we develop a\nframework of compressed, decentralized training and propose two different strategies, which\n\u221a\nwe call extrapolation compression and difference compression. We analyze both algorithms\nand prove both converge at the rate of O(1/\nnT ) where n is the number of workers and\nT is the number of iterations, matching the convergence rate for full precision, centralized\ntraining. We validate our algorithms and \ufb01nd that our proposed algorithm outperforms the\nbest of merely decentralized and merely quantized algorithm signi\ufb01cantly for networks with\nboth high latency and low bandwidth.\n\nIntroduction\n\n1\nWhen training machine learning models in a distributed fashion, the underlying constraints of how\nworkers (or nodes) communication have a signi\ufb01cant impact on the training algorithm. When workers\ncannot form a fully connected communication topology or the communication latency is high (e.g., in\nsensor networks or mobile networks), decentralizing the communication comes to the rescue. On the\nother hand, when the amount of data sent through the network is an optimization objective (maybe\nto lower the cost or energy consumption), or the network bandwidth is low, compressing the traf\ufb01c,\neither via sparsi\ufb01cation [Wangni et al., 2017, Kone\u02c7cn`y and Richt\u00e1rik, 2016] or quantization [Zhang\net al., 2017a, Suresh et al., 2017] is a popular strategy. In this paper, our goal is to develop a novel\nframework that works robustly in an environment that both decentralization and communication\ncompression could be bene\ufb01cial. In this paper, we focus on quantization, the process of lowering\nthe precision of data representation, often in a stochastically unbiased way. But the same techniques\nwould apply to other unbiased compression schemes such as sparsi\ufb01cation.\nBoth decentralized training and quantized (or compressed more generally) training have attracted\nintensive interests recently [Yuan et al., 2016, Zhao and Song, 2016, Lian et al., 2017a, Kone\u02c7cn`y and\nRicht\u00e1rik, 2016, Alistarh et al., 2017]. Decentralized algorithms usually exchange local models among\nnodes, which consumes the main communication budget; on the other hand, quantized algorithms\nusually exchange quantized gradient, and update an un-quantized model. A straightforward idea to\ncombine these two is to directly quantize the models sent through the network during decentralized\ntraining. However, this simple strategy does not converge to the right solution as the quantization\nerror would accumulate during training. The technical contribution of this paper is to develop novel\nalgorithms that combine both decentralized training and quantized training together.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fProblem Formulation. We consider the following decentralized optimization:\n\n(1)\n\nmin\nx\u2208RN\n\nf (x) =\n\n1\nn\n\nn(cid:88)\n\ni=1\n\n(cid:124)\n\nE\u03be\u223cDi Fi(x; \u03be)\n,\n\n(cid:123)(cid:122)\n\n=:fi(x)\n\n(cid:125)\n\nwhere n is the number of node and Di is the local data distribution for node i. n nodes form a\nconnected graph and each node can only communicate with its neighbors. Here we only assume\nfi(x)\u2019s are with L-Lipschitzian gradients.\nSummary of Technical Contributions. In this paper, we propose two decentralized parallel stochas-\ntic gradient descent algorithms (D-PSGD): extrapolation compression D-PSGD (ECD-PSGD) and\n\u221a\ndifference compression D-PSGD (DCD-PSGD). Both algorithms can be proven to converge in the\nrate roughly O(1/\nnT ) where T is the number of iterations. The convergence rates are consistent\nwith two special cases: centralized parallel stochastic gradient descent (C-PSGD) and D-PSGD. To\nthe best of our knowledge, this is the \ufb01rst work to combine quantization algorithms and decentralized\nalgorithms for generic optimization.\nThe key difference between ECD-PSGD and DCD-PSGD is that DCD-PSGD quantizes the difference\nbetween the last two local models, and ECD-PSGD quantizes the extrapolation between the last\ntwo local models. DCD-PSGD admits a slightly better convergence rate than ECD-PSGD when the\ndata variation among nodes is very large. On the other hand, ECD-PSGD is more robust to more\naggressive quantization, as extremely low precision quantization can cause DCD-PSGD to diverge,\nsince DCD-PSGD has strict constraint on quantization. In this paper, we analyze both algorithms,\nand empirically validate our theory. We also show that when the underlying network has both high\nlatency and low bandwidth, both algorithms outperform state-of-the-arts signi\ufb01cantly. We present\nboth algorithm because we believe both of them are theoretically interesting. In practice, ECD-PSGD\ncould potentially be a more robust choice.\n\nDe\ufb01nitions and notations Throughout this paper, we\nuse following notations and de\ufb01nitions:\n\u2022 \u2207f (\u00b7) denotes the gradient of a function f.\n\u2022 f\u2217 denotes the optimal solution of (1).\n\u2022 \u03bbi(\u00b7) denotes the i-th largest eigenvalue of a matrix.\n\u2022 1 = [1, 1,\u00b7\u00b7\u00b7 , 1](cid:62) \u2208 Rn denotes the full-one vector.\n\u2022 (cid:107) \u00b7 (cid:107) denotes the l2 norm for vector.\n\u2022 (cid:107) \u00b7 (cid:107)F denotes the vector Frobenius norm of matrices.\n\u2022 C(\u00b7) denotes the compressing operator.\n\u2022 fi(x) := E\u03be\u223cDiFi(x; \u03be).\n\n2 Related work\nStochastic gradient descent The Stoc-\nahstic Gradient Descent (SGD) [Ghadimi\nand Lan, 2013, Moulines and Bach, 2011,\nNemirovski et al., 2009] - a stochastic\nvariant of the gradient descent method -\nhas been widely used for solving large\nscale machine learning problems [Bottou,\n\u221a\n2010]. It admits the optimal convergence\nrate O(1/\nT ) for non-convex functions.\nCentralized algorithms The central-\nized algorithms is a widely used scheme\nfor parallel computation, such as Tensor-\n\ufb02ow [Abadi et al., 2016], MXNet [Chen et al., 2015], and CNTK [Seide and Agarwal, 2016]. It\nuses a central node to control all leaf nodes. For Centralized Parallel Stochastic Gradient Descent\n(C-PSGD), the central node performs parameter updates and leaf nodes compute stochastic gradients\nbased on local information in parallel. In Agarwal and Duchi [2011], Zinkevich et al. [2010], the ef-\nfectiveness of C-PSGD is studied with latency taken into consideration. The distributed mini-batches\nSGD, which requires each leaf node to compute the stochastic gradient more than once before the\nparameter update, is studied in Dekel et al. [2012]. Recht et al. [2011] proposed a variant of C-PSGD,\nHOGWILD, and proved that it would still work even if we allow the memory to be shared and let the\nprivate mode to be overwriten by others. The asynchronous non-convex C-PSGD optimization is\nstudied in Lian et al. [2015]. Zheng et al. [2016] proposed an algorithm to improve the performance\nof the asynchronous C-PSGD. In Alistarh et al. [2017], De Sa et al. [2017], a quantized SGD is\n\u221a\nproposed to save the communication cost for both convex and non-convex object functions. The\nconvergence rate for C-PSGD is O(1/\nT n). The tradeoff between the mini-batch number and the\nlocal SGD step is studied in Lin et al. [2018], Stich [2018].\nDecentralized algorithms Recently, decentralized training algorithms have attracted signi\ufb01cantly\namount of attentions. Decentralized algorithms are mostly applied to solve the consensus problem\n[Zhang et al., 2017b, Lian et al., 2017a, Sirb and Ye, 2016], where the network topology is decentral-\nized. A recent work shows that decentralized algorithms could outperform the centralized counterpart\nfor distributed training [Lian et al., 2017a]. The main advantage of decentralized algorithms over\n\n2\n\n\fcentralized algorithms lies on avoiding the communication traf\ufb01c in the central node. In particular,\ndecentralized algorithms could be much more ef\ufb01cient than centralized algorithms when the network\nbandwidth is small and the latency is large. The decentralized algorithm (also named gossip algorithm\nin some literature under certain scenarios [Colin et al., 2016]) only assume a connect computational\nnetwork, without using the central node to collect information from all nodes. Each node owns\nits local data and can only exchange information with its neighbors. The goal is still to learn a\nmodel over all distributed data. The decentralized structure can applied in solving of multi-task\nmulti-agent reinforcement learning [Omidsha\ufb01ei et al., 2017, Mhamdi et al., 2017]. Boyd et al.\n[2006] uses a randomized weighted matrix and studied the effectiveness of the weighted matrix in\ndifferent situations. Two methods [Li et al., 2017, Shi et al., 2015] were proposed to reduce the steady\npoint error in decentralized gradient descent convex optimization. Dobbe et al. [2017] applied an\ninformation theoretic framework for decentralize analysis. The performance of the decentralized\nalgorithm is dependent on the second largest eigenvalue of the weighted matrix.\nDecentralized parallel stochastic gradient descent The Decentralized Parallel Stochastic Gradi-\nent Descent (D-PSGD) [Nedic and Ozdaglar, 2009, Yuan et al., 2016] requires each node to exchange\nits own stochastic gradient and update the parameter using the information it receives. In Nedic and\nOzdaglar [2009], the convergence rate for a time-varying topology was proved when the maximum\nof the subgradient is assumed to be bounded. In Lan et al. [2017], a decentralized primal-dual type\n\nmethod is proposed with complexity of O((cid:112)n/T ) for general convex objectives. The linear speedup\n\n\u221a\nof D-PSGD is proved in Lian et al. [2017a], where the computation complexity is O(1/\nnT ). The\nasynchronous variant of D-PSGD is studied in Lian et al. [2017b]. In He et al. [2018], they proposed\nthe gradient descent based algorithm (CoLA) for decentralized learning of linear classi\ufb01cation and\nregression models, and proved the convergence rate for strongly convex and general convex cases.\nCompression To guarantee the convergence and correctness, this paper only considers using the\nunbiased stochastic compression techniques. Existing methods include randomized quantization\n[Zhang et al., 2017a, Suresh et al., 2017] and randomized sparsi\ufb01cation [Wangni et al., 2017, Kone\u02c7cn`y\nand Richt\u00e1rik, 2016]. Other compression methods can be found in Kashyap et al. [2007], Lavaei and\nMurray [2012], Nedic et al. [2009]. In Drumond et al. [2018], a compressed DNN training algorithm\nis proposed. In Stich et al. [2018], a centralized biased sparsi\ufb01ed parallel SGD with memory is\nstudied and proved to admits an factor of acceleration.\n3 Preliminary: decentralized parallel stochastic gradient descent (D-PSGD)\nUnlike the traditional (centralized) parallel stochastic gradient\ndescent (C-PSGD), which requires a central node to compute the\naverage value of all leaf nodes, the decentralized parallel stochastic\ngradient descent (D-PSGD) algorithm does not need such a central\nnode. Each node (say node i) only exchanges its local model\nx(i) with its neighbors to take weighted average, speci\ufb01cally,\nj=1 Wijx(j) where Wij \u2265 0 in general and Wij = 0\nmeans that node i and node j is not connected. At tth iteration,\nD-PSGD consists of three steps (i is the node index):\n1. Each node computes the stochastic gradient \u2207Fi(x(i)\nlocal data set and x(i)\n(cid:16)\nt\n\n2. Each node queries its neighbors\u2019 variables and updates its local model using x(i) =(cid:80)n\n\nFigure 1: D-PSGD vs. D-PSGD\nwith naive compression\n\nx(i) =(cid:80)n\n\nis the local model on node i.\n\nis the samples from its\n\nj=1 Wijx(j).\nusing stochastic gradient, where\n\nt ), where \u03be(i)\n\nt \u2212 \u03b3t\u2207Fi\n\nt \u2190 x(i)\n\n; \u03be(i)\n\n(cid:17)\n\nt\n\nt\n\nt\n\n3. Each node updates its local model x(i)\n\u03b3t is the learning rate.\nTo look at the D-PSGD algorithm from a global view, by de\ufb01ning\n\nx(i)\n\n; \u03be(i)\n\nt\n\n(cid:32)\n\n(cid:33)\n\nn(cid:88)\n\nn(cid:88)\n\nX := [x(1), x(2),\u00b7\u00b7\u00b7 , x(n)] \u2208 RN\u00d7n, G(X; \u03be) := [\u2207F1(x(1); \u03be(1)),\u00b7\u00b7\u00b7 ,\u2207Fn(x(n); \u03be(n))]\n\u2207f (X) :=\n\nn(cid:88)\n(cid:19)\nthe D-PSGD can be summarized into the form Xt+1 = XtW \u2212 \u03b3tG(Xt; \u03bet).\n\n, \u2207f (X) := E\u03beG(X; \u03bet)\n\n\u2207fi(x(i)),\n\n\u2207fi\n\n(cid:18)\n\nx(i)\n\n1\nn\n\n1\nn\n\n1\nn\n\n1\nn\n\ni=1\n\ni=1\n\ni=1\n\n=\n\nThe convergence rate of D-PSGD can be shown to be O\n(without assuming con-\nvexity) where both \u03c3 and \u03b6 are the stochastic variance (please refer to Assumption 1 for detailed\nde\ufb01nitions), if the learning rate is chosen appropriately.\n\n+ n\n\n\u03c3\u221a\nnT\n\nT\n\n2\n3\n\n1\n3 \u03b6\n2\n3\n\n3\n\nEpoch50100150200250300Training Loss102103D-PSGD with naive compressionD-PSGD\f4 Quantized, Decentralized Algorithms\nWe introduce two quantized decentralized algorithms that compress information exchanged between\nnodes. All communications for decentralized algorithms are exchanging local models x(i).\nTo reduce the communication cost, a straightforward idea is to compress the information exchanged\nwithin the decentralized network just like centralized algorithms sending compressed stochastic\ngradient [Alistarh et al., 2017]. Unfortunately, such naive combination does not work even using the\nunbiased stochastic compression and diminishing learning rate as shown in Figure 1. The reason can\nbe seen from the detailed derivation (please \ufb01nd it in Supplement).\nBefore propose our solutions to this issue, let us \ufb01rst make some common optimization assumptions\nfor analyzing decentralized stochastic algorithms [Lian et al., 2017b].\nAssumption 1. Throughout this paper, we make the following commonly used assumptions:\n\n1. Lipschitzian gradient: All function fi(\u00b7)\u2019s are with L-Lipschitzian gradients.\n2. Symmetric double stochastic matrix: The weighted matrix W is a real double stochastic\n\nmatrix that satis\ufb01es W = W (cid:62) and W 1 = W .\nmax{|\u03bb2(W )|,|\u03bbn(W )|} and assume \u03c1 < 1.\n\n3. Spectral gap: Given the symmetric doubly stochastic matrix W , we de\ufb01ne \u03c1 :=\n\n4. Bounded variance: Assume the variance of stochastic gradient to be bounded\n\nE\u03be\u223cDi (cid:107)\u2207Fi(x; \u03be) \u2212 \u2207fi(x)(cid:107)2 (cid:54) \u03c32,\n\n(cid:107)\u2207fi(x) \u2212 \u2207f (x)(cid:107)2 (cid:54) \u03b6 2,\n\n\u2200i,\u2200x,\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n5. Start from 0: We assume X1 = 0. This assumption simpli\ufb01es the proof w.l.o.g.\n6. Independent and unbiased stochastic compression: The stochastic compression operation\nC(\u00b7) is unbiased, that is, E(C(Z)) = Z for any Z and the stochastic compressions are\nindependent on different workers or at different time point.\n\nThe last assumption essentially restricts the compression to be lossy but unbiased. Biased stochastic\ncompression is generally hard to ensure the convergence and lossless compression can combine with\nany algorithms. Both of them are beyond of the scope of this paper. The commonly used stochastic\nunbiased compression include random quantization1 [Zhang et al., 2017a] and sparsi\ufb01cation2 [Wangni\net al., 2017, Kone\u02c7cn`y and Richt\u00e1rik, 2016].\n4.1 Difference compression approach\nIn this section, we introduces a difference based approach, namely, difference compression D-PSGD\n(DCD-PSGD), to ensure ef\ufb01cient convergence.\nThe DCD-PSGD basically follows the framework of D-PSGD, except that nodes exchange the\ncompressed difference of local models between two successive iterations, instead of exchanging\nlocal models. More speci\ufb01cally, each node needs to store its neighbors\u2019 models in last iteration\n{ \u02c6x(j)\n\n: j is node i\u2019s neighbor} and follow the following steps:\n1. take the weighted average and apply stochastic gradient descent step: x(i)\nt+ 1\n2\n\n=\n\nt\n\nj=1 Wij \u02c6x(j)\n\nt \u2212 \u03b3\u2207Fi(x(i)\n\nt\n\n; \u03be(i)\n\nt ), where \u02c6x(j)\n\nt\n\nis just the replica of x(j)\n\nbut is stored on\n\nt\n\n(cid:80)n\n\nnode i3;\n\nand x(i)\nt+ 1\n2\n\nand update the local model:z(i)\n\nt = x(i)\nt+ 1\n2\n\n\u2212\n\n2. compress the difference between x(i)\nt\n\nx(i)\nt\n\n, x(i)\n\n3. send C(z(i)\nt )\n\nt+1 = x(i)\nt + C(z(i)\nt );\nquery\nand\n\u2200j is node i\u2019s neighbor, \u02c6x(j)\nt+1 = \u02c6x(j)\n\nneighbors\u2019 C(zt)\nt + C(z(j)\n\n).\n\nt\n\nto\n\nupdate\n\nthe\n\nlocal\n\nreplica:\n\nThe full DCD-PSGD algorithm is described in Algorithm 1.\nTo ensure convergence, we need to make some restriction on the compression operator C(\u00b7).\nAgain this compression operator could be random quantization or random sparsi\ufb01cation or any\n1A real number is randomly quantized into one of closest thresholds, for example, givens the thresholds\n{0, 0.3, 0.8, 1}, the number \u201c0.5\u201d will be quantized to 0.3 with probability 40% and to 0.8 with probability\n60%. Here, we assume that all numbers have been normalized into the range [0, 1].\n2A real number z is set to 0 with probability 1 \u2212 p and to z/p with probability p.\n3Actually each neighbor of node j maintains a replica of x(j)\n\n.\n\nt\n\n4\n\n\fAlgorithm 1 DCD-PSGD\n1: Input: Initial point x(i)\n\n1 = x1, it-\neration step length \u03b3, weighted matrix W , and number of total\niterations T\n\n1 = x1, initial replica \u02c6x(i)\n\n2: for t = 1,2,...,T do\n3:\n4:\n\nt\n\nand current optimization variable x(i)\n\nfrom local data of the ith node\n; \u03be(i)\n\nRandomly sample \u03be(i)\nCompute local stochastic gradient \u2207Fi(x(i)\n\u03be(i)\nt\nUpdate the local model using local stochastic gradient and the\nweighted average of its connected neighbors\u2019 replica (denote as\n\u02c6x(j)\n\nt ) using\n\nt\n\nt\n\n=\n\nWij x(j)\n\nt \u2212 \u03b3\u2207Fi(x(i)\n\u2212 x(i)\n\nt\n\nt\n\nt = x(i)\nt+ 1\n2\n\n; \u03be(i)\n\nt ),\n\nEach node computes z(i)\n\n, and compress\n\nt\n\n):\nx(i)\nt+ 1\n2\n\nn(cid:88)\n\nj=1\n\nthis z(i)\nUpdate the local optimization variables\n\ninto C(z(i)\n\nt\n\nt ).\nt+1 \u2190 x(i)\nx(i)\n\nt + C(z(i)\n\nt ).\n\nSend C(z(i)\ncas of its connected neighbors\u2019 values:\n\nt ) to its connected neighbors, and update the repli-\n\n5:\n\n6:\n\n7:\n\n8:\n\nAlgorithm 2 ECD-PSGD\n1: Input: Initial point x(i)\n\n1 = x1, initial estimate \u02dcx(i)\n\n1 =\nx1, iteration step length \u03b3, weighted matrix W , and num-\nber of total iterations T.\n2: for t = 1, 2, \u00b7 \u00b7 \u00b7 , T do\n3:\nRandomly sample \u03be(i)\nCompute local stochastic gradient \u2207Fi(x(i)\n4:\nand current optimization variable x(i)\ning \u03be(i)\nCompute the neighborhood weighted average by using\nthe estimate value of the connected neighbors :\n\nfrom local data of the ith node\nt ) us-\n\n; \u03be(i)\n\n5:\n\nt\n\nt\n\nt\n\nt\n\nn(cid:88)\n\nj=1\n\nx(i)\nt+ 1\n2\n\n=\n\nWij \u02dcx(j)\n\nt\n\n6:\n\n7:\n\n8:\n\nUpdate the local model\nt+1 \u2190 x(i)\nx(i)\n\n\u2212 \u03b3\u2207Fi(x(i)\nEach node computes the z-value of itself:\n\nt+ 1\n2\n\nt\n\n, \u03be(i)\nt )\n\nt+1 = (1 \u2212 0.5t) x(i)\nz(i)\n(cid:16)\n\nand compress this z(i)\nEach node updates the estimate for its connected neigh-\nbors:\n\nt + 0.5tx(i)\n\ninto C(z(j)\n\n\u22121(cid:17)\n\nt+1\n\n).\n\nt\n\nt\n\n\u02dcx(j)\n\nt + 2t\n\n\u22121C(z(j)\n\n)\n\nt\n\n1 \u2212 2t\n\n\u02dcx(j)\n\nt+1 =\n\n9: end for\n10: Output: 1\n\nn\n\n\u02c6x(j)\nt+1 = \u02c6x(j)\n\nt + C(z(i)\n\nt ).\n\n(cid:80)n\n\ni=1 x(i)\n\nT\n\n9: end for\n10: Output: 1\n\nn\n\n(cid:80)n\n\ni=1 x(i)\n\nT\n\n(cid:113)\nsupZ(cid:54)=0 (cid:107)Q(cid:107)2\n\nF /(cid:107)Z(cid:107)2\n\nother operators. We introduce the de\ufb01nition of the signal-to-noise related parameter \u03b1. Let\n\u03b1 :=\nTheorem 1. Let \u00b5 := maxi\u2208{2,\u00b7\u00b7\u00b7 ,n} |\u03bbi \u2212 1|. If \u03b1 satis\ufb01es (1 \u2212 \u03c1)2 \u2212 4\u00b52\u03b12 > 0 and \u03b3 satis\ufb01es\n1 \u2212 3D1L2\u03b32 > 0, then under the Assumption 1, , we have the following convergence rate for\nAlgorithm 1:\n\nF , where Q = Z \u2212 C(Z). We have the following theorem.\n\nT(cid:88)\n(cid:0)(1 \u2212 D3) E(cid:107)\u2207f (X t)(cid:107)2 + D4E(cid:107)\u2207f (Xt)(cid:107)2(cid:1) \u2264 2(f (0) \u2212 f\u2217)\n(cid:32)\n\n(cid:0)4L2 + 3L3D2\u03b32(cid:1) D1T \u03b32\n\n(cid:33)\n\nT \u03b32LD2\n\nt=1\n\n\u03b3\n\n\u03c32 +\n\nL\u03b3T \u03c32\n\nn\n\n+\n\n(cid:0)4L2 + 3L3D2\u03b32(cid:1) 3D1T \u03b32\n(cid:18) 2\u00b52(1 + 2\u03b12)\n\n1 \u2212 3D1L2\u03b32\n\n\u03b6 2,\n\n(2)\n\n(cid:19)\n\n+\n\nwhere\n\nD1 :=\n\nD3 :=\n\n+\n\n1 \u2212 3D1L2\u03b32\n\n2\n\n(cid:18) 2\u00b52(1 + 2\u03b12)\n(cid:0)4L2 + 3L3D2\u03b32(cid:1) 3D1\u03b32\n\n2\u03b12\n1 \u2212 \u03c12\n\n(1 \u2212 \u03c1)2 \u2212 4\u00b52\u03b12 + 1\n\n(cid:19)\n\n1 \u2212 3D1L2\u03b32\n\n1\n\n+\n\n(1 \u2212 \u03c1)2 , D2 := 2\u03b12\n\n(1 \u2212 \u03c1)2 \u2212 4\u00b52\u03b12 + 1\n\n+\n\n3LD2\u03b32\n\n2\n\n,\n\nD4 := (1 \u2212 L\u03b3) .\n\n(cid:16)\nTo make the result more clear, we appropriately choose the steplength in the following:\nCorollary 2. Let D1, D2, \u00b5 follow to same de\ufb01nition in Theorem 1, and choose \u03b3 =\n\u221a\n2\nIf \u03b1 is small enough that satis\ufb01es\n6\n3 T\nT(cid:88)\n(1 \u2212 \u03c1)2 \u2212 4\u00b52\u03b12 > 0, then we have\n\n(cid:17)\u22121 in Algorithm 1.\n\n\u221a\nD2L + \u03c3\u221a\n\nD1L + 6\n\n1\n2 + \u03b6\n\nn T\n\n\u03b6\n\n2\n3\n\n1\n3\n\n1\nT\n\nE(cid:107)\u2207f (X t)(cid:107)2 (cid:46) \u03c3\u221a\nnT\n\n+\n\n2\n3\n\nT\n\n+\n\n1\nT\n\n,\n\nif we treat f (0) \u2212 f\u2217, L, and \u03c1 constants.\n\nt=1\n\nThe leading term of the convergence rate is O\nE\n\n(cid:13)(cid:13)(cid:13)X t \u2212 x(i)\n\nt\n\n(cid:13)(cid:13)(cid:13)2(cid:21)\n\n(cid:20)(cid:80)n\n\ni=1\n\n(cid:16)\n\n(cid:17)\n\n\u221a\nT n\n\n1/\n\n(see (27) in Supplementary). We shall see the tightness of our result in the\n\nfollowing discussion.\nLinear speedup Since the leading term of the convergence rate is O\nwhen T is large,\nwhich is consistent with the convergence rate of C-PSGD, this indicates that we would achieve a\nlinear speed up with respect to the number of nodes.\n\n\u221a\nT n\n\n1/\n\n(cid:16)\n\n(cid:17)\n\n, and we also proved the convergence rate for\n\n5\n\n\f(cid:18)\n\n(cid:19)\n\nT\n\n2\n3\n\n2\n3\n2\n3\n\n+ n\n\n2\n3 \u03b6\n2\n3\n\n(cid:19)\n\n(cid:18)\n\n\u03c3\u221a\nnT\n\n\u03c3\u221a\nnT\n\nConsistence with D-PSGD Setting \u03b1 = 0 to match the scenario of D-PSGD, ECD-PSGD admits\n, that is slightly better the rate of D-PSGD proved in Lian et al. [2017b]\nthe rate O\n+ \u03b6\nT\n. The non-leading terms\u2019 dependence of the spectral gap (1 \u2212 \u03c1) is also consistent\n\nO\nwith the result in D-PSDG.\n4.2 Extrapolation compression approach\nFrom Theorem 1, we can see that there is an upper bound for the compressing level \u03b1 in DCD-PSGD.\nMoreover, since the spectral gap (1\u2212 \u03c1) would decrease with the growth of the amount of the workers,\nso DCD-PSGD will fail to work under a very aggressive compression. So in this section, we propose\nanother approach, namely ECD-PSGD, to remove the restriction of the compressing degree, with a\nlittle sacri\ufb01ce on the computation ef\ufb01ciency.\nFor ECD-PSGD, we make the following assumption that the noise brought by compression is\nbounded.\nAssumption 2. (Bounded compression noise) We assume the noise due to compression is unbiased\nand its variance is bounded, that is, \u2200z \u2208 Rn\n\nE(cid:107)C(z) \u2212 z(cid:107)2 \u2264 \u02dc\u03c32/2,\n\n\u2200z\n\nInstead of sending the local model x(i)\nt directly to neighbors, we send a z-value that is extrapolated\nt\u22121 at each iteration. Each node (say, node i) estimates its neighbor\u2019s values x(j)\nfrom x(i)\nand x(i)\nt\nfrom compressed z-value at t-th iteration. This procedure could ensure diminishing estimate error, in\nt \u2212 x(j)\nparticular, E(cid:107) \u02dcx(j)\nstrategy sensitive to the magnitude of z. But our experiments show that using \ufb01x precision randomized\nquantization strategy - a magnitude independent method - still works very well.\nAt tth iteration, node i performs the following steps to estimate x(j)\n\nt (cid:107)2 \u2264 O(cid:0)t\u22121(cid:1). To satisfy Assumption 2, one may need to use the quantization\n\n:\n\nt\n\nt by \u02dcx(j)\n\nt\n\n\u2022 The node j, computes the z-value that is obtained through extrapolation\n\nt = (1 \u2212 0.5t) x(j)\nz(j)\n\nt\u22121 + 0.5tx(j)\n\nt\n\n,\n\n(3)\n\n\u2022 Compress z(j)\n\nt\n\nand send it to its neighbors, say node i. Node i computes \u02dcx(j)\n\nt using\n\nt =(cid:0)1 \u2212 2t\n\n\u22121(cid:1) \u02dcx(j)\n\n\u02dcx(j)\n\nt\u22121 + 2t\n\n\u22121C(z(j)\n\nt\n\n).\n\n:= z(i)\n\nt \u2212 C(z(i)\n\n(4)\nt ) is\n\nUsing Lemma 12 (see Supplemental Materials), if the compression noise q(j)\nglobally bounded variance by \u02dc\u03c32/2, we will have\nt \u2212 x(j)\n\nE((cid:107) \u02dcx(j)\n\nt\n\nUsing this way to estimate the neighbors\u2019 local models leads to the following equivalent updating\nform\n\nXt+1 = \u02dcXtW \u2212 \u03b3tG(Xt; \u03bet) = XtW +\n\n\u2212\u03b3tG(Xt; \u03bet).\n\nt (cid:107)2) \u2264 \u02dc\u03c32/t.\nQtW(cid:124)(cid:123)(cid:122)(cid:125)\n\ndiminished estimate error\n\nT(cid:88)\n\u2264 2(f (0) \u2212 f\u2217)\n\nt=1\n\nThe full extrapolation compression D-PSGD (ECD-PSGD) algorithm is summarized in Algorithm 2.\nBelow we will show that EDC-PSGD algorithm would admit the same convergence rate and the same\ncomputation complexity as D-PSGD.\nTheorem 3 (Convergence of Algorithm 2). Under Assumptions 1 and 2, choosing \u03b3t in Algorithm 2\nto be constant \u03b3 satisfying 1 \u2212 6C1L2\u03b32 > 0, we have the following convergence rate for Algorithm 2\n\n(cid:0)(1 \u2212 C3) E(cid:107)\u2207f (X t)(cid:107)2 + C4E(cid:107)\u2207f (Xt)(cid:107)2(cid:1)\n\nL log T\n\n+\n\n\u02dc\u03c32 +\n\nLT \u03b3\n\n\u03c32 +\n\n4C2 \u02dc\u03c32L2\n1 \u2212 \u03c12\n\nlog T + 4L2C2\n\nn\n\nn\u03b3\n\n\u03b3\nwhere C1 := 1\n(1\u2212\u03c1)2 , C2 :=\nTo make the result more clear, we choose the steplength in the following:\nCorollary 4. If choosing the steplength \u03b3 =\nadmits the following convergence rate (with f (0) \u2212 f\u2217, L, and \u03c1 treated as constants):\n\n1\u22126\u03c1\u22122C1L2\u03b32 , C3 := 12L2C2C1\u03b32, and C4 := 1 \u2212 L\u03b3.\n\n\u221a\nC1L + \u03c3\u221a\n\n1\n2 + \u03b6\n\n(cid:16)\n\nn T\n\n2\n3 T\n\n12\n\n1\n\n(cid:0)\u03c32 + 3\u03b6 2(cid:1) C1T \u03b32,\n(cid:17)\u22121, then Algorithm 2\n\n(5)\n\n1\n3\n\nE(cid:107)\u2207f (X t)(cid:107)2 (cid:46) \u03c3(1 + \u02dc\u03c32 log T\n\n\u221a\n\nn\nnT\n\n\u03b6\n\n)\n\n+\n\n6\n\n2\n\n3 (1 + \u02dc\u03c32 log T\n\nn\n\n2\n3\n\nT\n\n)\n\n+\n\n1\nT\n\n+\n\n\u02dc\u03c32 log T\n\nT\n\n.\n\n(6)\n\nT(cid:88)\n\nt=1\n\n1\nT\n\n\fFigure 2: Performance Comparison between Decentralized and AllReduce implementations.\n\nFigure 3: Performance Comparison in Diverse Network Conditions.\n\n\u221a\n\n(cid:20)(cid:80)n\n\n(cid:13)(cid:13)(cid:13)X t \u2212 x(i)\n\nt\n\n(cid:13)(cid:13)(cid:13)2(cid:21)\n\ni=1\n\nnT ), and we also proved the\n(see (36) in Supplementary). The followed analysis will\n\nThis result suggests the algorithm converges roughly in the rate O(1/\nconvergence rate for E\nbring more detailed interpretation to show the tightness of our result.\n\u221a\nLinear speedup Since the leading term of the convergence rate is O(1/\nnT ) when T is large, which\nis consistent with the convergence rate of C-PSGD, this indicates that we would achieve a linear\nspeed up with respect to the number of nodes.\nConsistence with D-PSGD Setting \u02dc\u03c3 = 0 to match the scenario of D-PSGD, ECD-PSGD admits\n, that is slightly better the rate of D-PSGD proved in Lian et al. [2017b]\nthe rate O\n+ \u03b6\nT\n. The non-leading terms\u2019 dependence of the spectral gap (1 \u2212 \u03c1) is also consistent\n\n\u03c3\u221a\nnT\n\n(cid:18)\n\n(cid:18)\n\n(cid:19)\n\n(cid:19)\n\n+ n\n\n2\n3\n2\n3\n\n2\n3\n\n(cid:18)\n\nO\nwith the result in D-PSDG.\nComparison between DCD-PSGD and ECD-PSGD On one side,\nvergence rate, ECD-PSGD is slightly worse than DCD-PSGD due to additional\n\nin term of the con-\nterms\nthat suggests that if \u02dc\u03c3 is relatively large than \u03c3, the additional\nterms dominate the convergence rate. On the other side, DCD-PSGD does not allow too aggressive\ncompression or quantization and may lead to diverge due to \u03b1 \u2264 1\u2212\u03c1\n\u221a\n2\u00b5, while ECD-PSGD is quite\nrobust to aggressive compression or quantization.\n\n+ \u02dc\u03c32 log T\n\n2\n3 \u02dc\u03c32 log T\n\n(cid:19)\n\n\u03c3 \u02dc\u03c32 log T\n\n+ \u03b6\n\n\u221a\n\nnT\n\nnT\n\n2\n3\n\n\u03c3\u221a\nnT\n\n1\n3 \u03b6\n2\n3\n\nT\n\nn\n\nT\n\n2\n\n5 Experiments\nIn this section we evaluate two decentralized algorithms by comparing with an allreduce implemen-\ntation of centralized SGD. We run experiments under diverse network conditions and show that,\ndecentralized algorithms with low precision can speed up training without hurting convergence.\n5.1 Experimental Setup\nWe choose the image classi\ufb01cation task as a benchmark to evaluate our theory. We train ResNet-\n20 [He et al., 2016] on CIFAR-10 dataset which has 50,000 images for training and 10,000 images\nfor testing. Two proposed algorithms are implemented in Microsoft CNTK and compablack with\nCNTK\u2019s original implementation of distributed SGD:\n\u2022 Centralized: This implementation is based on MPI allreduce primitive with full precision (32\nbits). It is the standard training method for multiple nodes in CNTT.\n\u2022 Decentralized_32bits/8bits: The implementation of the proposed decentralized approach with\nOpenMPI. The full precision is 32 bits, and the compressed precision is 8 bits.\n\u2022 In this paper, we omit the comparison with quantized centralized training because the difference\nbetween Decentralized 8bits and Centralized 8bits would be similar to the original decentralized\ntraining paper Lian et al. [2017a] \u2013 when the network latency is high, decentralized algorithm\noutperforms centralized algorithm in terms of the time for each epoch.\n\n7\n\nTraining Loss00.511.52# Epochs04080120160Training Loss00.511.52Time (s)0200400600800(a) # Epochs vs Training Loss(b) Time vs Training Loss Bandwidth = 1.4Gbps, Latency = 0.13msDecentralized (8bits)CentralizedDecentralized (32bits)CentralizedDecentralized (32bits)Decentralized (8bits)Training Loss00.511.52Time (s)02505007501000Training Loss00.511.52Time (s)040080012001600(c) Time vs Training Loss Bandwidth = 1.4Gbps, Latency = 20ms(d) Time vs Training Loss Bandwidth = 5Mbps, Latency = 20msCentralizedDecentralized (8bits)Decentralized (32bits)CentralizedDecentralized (8bits)Decentralized (32bits)EpochTime(s)040801201605/Bandwidth (5 / 5Mbps)00.250.50.751EpochTime(s)045901351805/Bandwidth (5 / 5Mbps)00.250.50.751(a) Impact of Network Bandwidth Latency = 0.13ms(b) Impact of Network Bandwidth Latency = 20msEpochTime(s)0255075100Latency (ms)05101520EpochTime(s)04590135180Latency (ms)05101520(c) Impact of Network Latency Bandwidth = 1.4Gbps(d) Impact of Network Latency Bandwidth = 5MbpsDecentralized (8bits)Decentralized (32bits)Decentralized (8bits)CentralizedDecentralized (32bits)CentralizedDecentralized (32bits)Decentralized (8bits)CentralizedDecentralized (32bits)Decentralized (8bits)Centralized\fWe run all experiments on 8 Amazon p2.xlarge EC2 instances, each of which has one Nvidia K80\nGPU. We use each GPU as a node. In decentralized cases, 8 nodes are connected in a ring topology,\nwhich means each node just communicates with its two neighbors. The batch size for each node is\nsame as the default con\ufb01guration in CNTK. We also tune learning rate for each variant.\n\n5.2 Convergence and Run Time Performance\nWe \ufb01rst study the convergence of our algorithms. Figure 2(a) shows the convergence w.r.t # epochs of\ncentralized and decentralized cases. We only show ECD-PSGD in the \ufb01gure (and call it Decentralized)\nbecause DCD-PSGD has almost identical convergence behavior in this experiment. We can see that\nwith our algorithms, decentralization and compression would not hurt the convergence rate.\nWe then compare the runtime performance. Figure 2(b, c, d) demonstrates how training loss decreases\nwith the run time under different network conditions. We use tc command to change bandwidth\nand latency of the underlying network. By default, 1.4 Gbps bandwidth and 0.13 ms latency is the\nbest network condition we can get in this cluster. On this occasion, all implementations have a very\nsimilar runtime performance because communication is not the bottleneck for system. When the\nlatency is high, however, decentralized algorithms outperform the allreduce because of fewer number\nof communications. In comparison with decentralized full precision cases, low precision methods\nexchange much less amount of data and thus can outperform full precision cases in low bandwidth\nsituation, as is shown in Figure 2(d).\n\n5.3 Speedup in Diverse Network Conditions\nTo better understand the in\ufb02uence of bandwidth and latency on speedup, we compare the time of\none epoch under various of network conditions. Figure 3(a, b) shows the trend of epoch time with\nbandwidth decreasing from 1.4 Gbps to 5 Mbps. When the latency is low (Figure 3(a)), low precision\nalgorithm is faster than its full precision counterpart because it only needs to exchange around\none fourth of full precision method\u2019s data amount. Note that although in a decentralized way, full\nprecision case has no advantage over allreduce in this situation, because they exchange exactly the\nsame amount of data. When it comes to high latency shown in Figure 3(b), both full and low precision\ncases are much better than allreduce in the beginning. But also, full precision method gets worse\ndramatically with the decline of bandwidth.\nFigure 3(c, d) shows how latency in\ufb02uences the epoch time under good and bad bandwidth conditions.\nWhen bandwidth is not the bottleneck (Figure 3(c)), decentralized approaches with both full and\nlow precision have similar epoch time because they have same number of communications. As\nis expected, allreduce is slower in this case. When bandwidth is very low (Figure 3(d)), only\ndecentralized algorithm with low precision can achieve best performance among all implementations.\n\n5.4 Discussion\nOur previous experiments validate the ef\ufb01ciency\nof the decentralized algorithms on 8 nodes with 8\nbits. However, we wonder if we can scale it to more\nnodes or compress the exchanged data even more\naggressively. We \ufb01rstly conducted experiments on\n16 nodes with 8 bits as before. According to Figure\n4(a), Alg. 1 and Alg. 2 on 16 nodes can still achieve\nbasically same convergence rate as allreduce, which\nshows the scalability of our algorithms. However,\nthey can not be comparable to allreduce with 4 bits,\nas is shown in 4(b). What is noteworthy is that these two compression approaches have quite different\nbehaviors in 4 bits. For Alg. 1, although it converges much slower than allreduce, its training loss\nkeeps reducing. However, Alg. 2 just diverges in the beginning of training. This observation is\nconsistent with our theoretical analysis.\n6 Conclusion\nIn this paper, we studied the problem of combining two tricks of training distributed stochastic gradient\ndescent under imperfect network conditions: quantization and decentralization. We developed two\nnovel algorithms or quantized, decentralized training, analyze the theoretical property of both\nalgorithms, and empirically study their performance in a various settings of network conditions. We\nfound that when the underlying communication networks has both high latency and low bandwidth,\nquantized, decentralized algorithm outperforms other strategies signi\ufb01cantly.\n\nFigure 4: Comparison of Alg. 1 and Alg. 2\n\n8\n\nTraining Loss00.511.52# Epochs050100150200Algorithm1 (8bits)CentralizedAlgorithm2 (8bits)Training Loss00.511.52# Epochs050100150200Algorithm1 (4bits)Centralized(a) # Epochs vs Training Loss in 16 nodes(b) # Epochs vs Training Loss (Algorithm 2 in 4 bits diverges)\fAcknowledgments\n\nHanlin Tang and Ji Liu are in part supported by NSF CCF1718513, IBM faculty award, and NEC\nfellowship. CZ and the DS3Lab gratefully acknowledge the support from Mercedes-Benz Research\n& Development North America, Oracle Labs, Swisscom, Zurich Insurance, Chinese Scholarship\nCouncil, and the Department of Computer Science at ETH Zurich.\n\nReferences\nM. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, et al. Tensor\ufb02ow: A system for large-scale machine learning. In OSDI, volume 16, pages\n265\u2013283, 2016.\n\nA. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural\n\nInformation Processing Systems, pages 873\u2013881, 2011.\n\nD. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: Communication-Ef\ufb01cient SGD\n\nvia Gradient Quantization and Encoding. NIPS, 2017.\n\nD. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-ef\ufb01cient sgd via\ngradient quantization and encoding. In Advances in Neural Information Processing Systems, pages\n1707\u20131718, 2017.\n\nL. Bottou. Large-scale machine learning with stochastic gradient descent. Proc. of the International\n\nConference on Computational Statistics (COMPSTAT), 2010.\n\nS. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE/ACM Trans.\n\nNetw., 14(SI):2508\u20132530, June 2006. ISSN 1063-6692. doi: 10.1109/TIT.2006.874516.\n\nT. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet:\nA \ufb02exible and ef\ufb01cient machine learning library for heterogeneous distributed systems. arXiv\npreprint arXiv:1512.01274, 2015.\n\nI. Colin, A. Bellet, J. Salmon, and S. Cl\u00e9men\u00e7on. Gossip dual averaging for decentralized optimization\nof pairwise functions. In International Conference on Machine Learning, pages 1388\u20131396, 2016.\n\nC. De Sa, M. Feldman, C. R\u00e9, and K. Olukotun. Understanding and optimizing asynchronous low-\nprecision stochastic gradient descent. In Proceedings of the 44th Annual International Symposium\non Computer Architecture, pages 561\u2013574. ACM, 2017.\n\nO. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using\n\nmini-batches. Journal of Machine Learning Research, 13(Jan):165\u2013202, 2012.\n\nR. Dobbe, D. Fridovich-Keil, and C. Tomlin. Fully decentralized policies for multi-agent systems:\nAn information theoretic approach. In Advances in Neural Information Processing Systems, pages\n2945\u20132954, 2017.\n\nM. Drumond, T. Lin, M. Jaggi, and B. Falsa\ufb01. Training dnns with hybrid block \ufb02oating\npoint.\nIn S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Con-\nference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December\n2018, Montr\u00e9al, Canada., pages 451\u2013461, 2018. URL http://papers.nips.cc/paper/\n7327-training-dnns-with-hybrid-block-floating-point.\n\nS. Ghadimi and G. Lan. Stochastic \ufb01rst- and zeroth-order methods for nonconvex stochastic program-\n\nming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013. doi: 10.1137/120880811.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings\n\nof the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\nL. He, A. Bian, and M. Jaggi. Cola: Decentralized linear learning. In S. Bengio, H. Wallach,\nH. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 31, pages 4541\u20134551. Curran Associates, Inc., 2018. URL\nhttp://papers.nips.cc/paper/7705-cola-decentralized-linear-learning.pdf.\n\n9\n\n\fA. Kashyap, T. Ba\u00b8sar, and R. Srikant. Quantized consensus. Automatica, 43(7):1192\u20131203, 2007.\n\nJ. Kone\u02c7cn`y and P. Richt\u00e1rik. Randomized distributed mean estimation: Accuracy vs communication.\n\narXiv preprint arXiv:1611.07555, 2016.\n\nG. Lan, S. Lee, and Y. Zhou. Communication-ef\ufb01cient algorithms for decentralized and stochastic\n\noptimization. 01 2017.\n\nJ. Lavaei and R. M. Murray. Quantized consensus by means of gossip algorithm. IEEE Transactions\n\non Automatic Control, 57(1):19\u201332, 2012.\n\nZ. Li, W. Shi, and M. Yan. A decentralized proximal-gradient method with network independent\n\nstep-sizes and separated convergence rates. arXiv preprint arXiv:1704.07807, 2017.\n\nX. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex\n\noptimization. In Advances in Neural Information Processing Systems, pages 2737\u20132745, 2015.\n\nX. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms\noutperform centralized algorithms? a case study for decentralized parallel stochastic gradient\ndescent. 05 2017a.\n\nX. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient\n\ndescent. 10 2017b.\n\nT. Lin, S. U. Stich, and M. Jaggi. Don\u2019t use large mini-batches, use local SGD. CoRR, abs/1808.07217,\n\n2018. URL http://arxiv.org/abs/1808.07217.\n\nE. Mhamdi, E. Mahdi, H. Hendrikx, R. Guerraoui, and A. D. O. Maurer. Dynamic safe interruptibility\n\nfor decentralized multi-agent reinforcement learning. Technical report, EPFL, 2017.\n\nE. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for\nmachine learning. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 24, pages 451\u2013459. Curran Associates,\nInc., 2011.\n\nA. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE\n\nTransactions on Automatic Control, 54(1):48\u201361, 2009.\n\nA. Nedic, A. Olshevsky, A. Ozdaglar, and J. N. Tsitsiklis. On distributed averaging algorithms and\n\nquantization effects. IEEE Transactions on Automatic Control, 54(11):2506\u20132517, 2009.\n\nA. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009. doi: 10.1137/\n070704277.\n\nS. Omidsha\ufb01ei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multi-task multi-agent\n\nrl under partial observability. arXiv preprint arXiv:1703.06182, 2017.\n\nB. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic\n\ngradient descent. In Advances in neural information processing systems, pages 693\u2013701, 2011.\n\nF. Seide and A. Agarwal. Cntk: Microsoft\u2019s open-source deep-learning toolkit. In Proceedings of\nthe 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\nKDD \u201916, pages 2135\u20132135, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi:\n10.1145/2939672.2945397.\n\nW. Shi, Q. Ling, G. Wu, and W. Yin. A proximal gradient algorithm for decentralized composite\n\noptimization. IEEE Transactions on Signal Processing, 63(22):6013\u20136023, 2015.\n\nB. Sirb and X. Ye. Consensus optimization with delayed and stochastic gradients on decentralized\n\nnetworks. In 2016 IEEE International Conference on Big Data (Big Data), pages 76\u201385, 2016.\n\nS. U. Stich. Local SGD converges fast and communicates little.\n\narXiv:1805.00982, 2018.\n\nTechnical Report, page\n\n10\n\n\fS. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsi\ufb01ed sgd with memory. In S. Bengio, H. Wal-\nlach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 31, pages 4452\u20134463. Curran Associates, Inc., 2018. URL\nhttp://papers.nips.cc/paper/7697-sparsified-sgd-with-memory.pdf.\n\nA. T. Suresh, F. X. Yu, S. Kumar, and H. B. McMahan. Distributed mean estimation with limited\ncommunication. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International\nConference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages\n3329\u20133337, International Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR. URL\nhttp://proceedings.mlr.press/v70/suresh17a.html.\n\nJ. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsi\ufb01cation for communication-ef\ufb01cient\n\ndistributed optimization. arXiv preprint arXiv:1710.09854, 2017.\n\nK. Yuan, Q. Ling, and W. Yin. On the convergence of decentralized gradient descent. SIAM Journal\n\non Optimization, 26(3):1835\u20131854, 2016. doi: 10.1137/130943170.\n\nH. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang. Zipml: Training linear models with\nend-to-end low precision, and a little bit of deep learning. In International Conference on Machine\nLearning, pages 4035\u20134043, 2017a.\n\nW. Zhang, P. Zhao, W. Zhu, S. C. Hoi, and T. Zhang. Projection-free distributed online learning in\n\nnetworks. In International Conference on Machine Learning, pages 4054\u20134062, 2017b.\n\nL. Zhao and W. Song. Decentralized consensus in distributed networks. International Journal of\n\nParallel, Emergent and Distributed Systems, pages 1\u201320, 2016.\n\nS. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.-M. Ma, and T.-Y. Liu. Asynchronous stochas-\ntic gradient descent with delay compensation for distributed deep learning. arXiv preprint\narXiv:1609.08326, 2016.\n\nM. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In\n\nAdvances in neural information processing systems, pages 2595\u20132603, 2010.\n\n11\n\n\f", "award": [], "sourceid": 3785, "authors": [{"given_name": "Hanlin", "family_name": "Tang", "institution": "University of Rochester"}, {"given_name": "Shaoduo", "family_name": "Gan", "institution": "ETH Zurich"}, {"given_name": "Ce", "family_name": "Zhang", "institution": "ETH Zurich"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "Tencent AI Lab"}, {"given_name": "Ji", "family_name": "Liu", "institution": "University of Rochester, Tencent AI lab"}]}