{"title": "Communication-Efficient Distributed Learning via Lazily Aggregated Quantized Gradients", "book": "Advances in Neural Information Processing Systems", "page_first": 3370, "page_last": 3380, "abstract": "The present paper develops a novel aggregated gradient approach for distributed machine learning that adaptively compresses the gradient communication. The key idea is to first quantize the computed gradients, and then skip less informative quantized gradient communications by reusing outdated gradients. Quantizing and skipping result in 'lazy' worker-server communications, which justifies the term Lazily Aggregated Quantized gradient that is henceforth abbreviated as LAQ. Our LAQ can provably attain the same linear convergence rate as the gradient descent in the strongly convex case, while effecting major savings in the communication overhead both in transmitted bits as well as in communication rounds. Empirically, experiments with real data corroborate a significant communication reduction compared to existing gradient- and stochastic gradient-based algorithms.", "full_text": "Communication-Ef\ufb01cient Distributed Learning via\n\nLazily Aggregated Quantized Gradients\n\nJun Sun\u2020\n\nZhejiang University\n\nHangzhou, China 310027\nsunjun16sj@gmail.com\n\nGeorgios B. Giannakis\n\nUniversity of Minnesota, Twin Cities\n\nMinneapolis, MN 55455\n\ngeorgios@umn.edu\n\nTianyi Chen\u2020\n\nRensselaer Polytechnic Institute\n\nTroy, New York 12180\n\nchent18@rpi.edu\n\nZaiyue Yang\n\nSouthern U. of Science and Technology\n\nShenzhen, China 518055\nyangzy3@sustc.edu.cn\n\nAbstract\n\nThe present paper develops a novel aggregated gradient approach for distributed\nmachine learning that adaptively compresses the gradient communication. The\nkey idea is to \ufb01rst quantize the computed gradients, and then skip less informative\nquantized gradient communications by reusing outdated gradients. Quantizing and\nskipping result in \u2018lazy\u2019 worker-server communications, which justi\ufb01es the term\nLazily Aggregated Quantized gradient that is henceforth abbreviated as LAQ. Our\nLAQ can provably attain the same linear convergence rate as the gradient descent\nin the strongly convex case, while effecting major savings in the communication\noverhead both in transmitted bits as well as in communication rounds. Empirically,\nexperiments with real data corroborate a signi\ufb01cant communication reduction\ncompared to existing gradient- and stochastic gradient-based algorithms.\n\n1\n\nIntroduction\n\nConsidering the massive amount of mobile devices, centralized machine learning via cloud computing\nincurs considerable communication overhead, and raises serious privacy concerns. Today, the\nwidespread consensus is that besides in the cloud centers, future machine learning tasks have to be\nperformed starting from the network edge, namely devices [17, 19]. Typically, distributed learning\ntasks can be formulated as an optimization problem of the form\n\nmin\n\n\u2713 Xm2M\n\nNmXn=1\n\nfm(\u2713) with fm(\u2713) :=\n\n`(xm,n; \u2713)\n\n(1)\n\nde\ufb01ne f (\u2713) =Pm2M\n\nwhere \u2713 2 Rp denotes the parameter to be learned, M with |M| = M denotes the set of servers,\nxm,n represents the n-th data vector at worker m (e.g., feature and label), and Nm is the number of\ndata samples at worker m. In (1), `(x; \u2713) denotes the loss associated with \u2713 and x, and fm(\u2713) denotes\nthe aggregated loss corresponding to \u2713 and all data at worker m. For the ease in exposition, we also\n\nfm(\u2713) as the overall loss function.\n\nIn the commonly employed worker-server setup, the server collects local gradients from the workers\nand updates the parameter using a gradient descent (GD) iteration given by\nGD iteration\n\n(2)\n\n\u2713k+1 = \u2713k \u21b5 Xm2M\n\nrfm\u2713k\n\n\u2020 Jun Sun and Tianyi Chen contributed equally to this work.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhere \u2713k denotes the parameter value at\n\niteration k, \u21b5 is the stepsize, and rf (\u2713k) =\nPm2M rfm(\u2713k) is the aggregated gradient. When the data samples are distributed across workers,\neach worker computes the corresponding local gradient rfm(\u2713k), and uploads it to the server. Only\nwhen all the local gradients are collected, the server can obtain the full gradient and update the\nparameter. To implement (2) however, the server has to communicate with all workers to obtain\nfresh gradients {rfm\u2713k}M\nm=1. In several settings though, communication is much slower than\ncomputation [16]. Thus, as the number of workers grows, worker-server communications become the\nbottleneck [10]. This becomes more challenging when incorporating popular deep learning-based\nlearning models with high-dimensional parameters, and correspondingly large-scale gradients.\n\n1.1 Prior art\nCommunication-ef\ufb01cient distributed learning methods have gained popularity recently [10, 22]. Most\npopular methods build on simple gradient updates, and are centered around the key idea of gradient\ncompression to save communication, including gradient quantization and sparsi\ufb01cation.\nQuantization. Quantization aims to compress gradients by limiting the number of bits that repre-\nsent \ufb02oating point numbers during communication, and has been successfully applied to several\nengineering tasks employing wireless sensor networks [21]. In the context of distributed machine\nlearning, a 1-bit binary quantization method has been developed in [5, 24]. Multi-bit quantization\nschemes have been studied in [2, 18], where an adjustable quantization level can endow additional\n\ufb02exibility to control the tradeoff between the per-iteration communication cost and the convergence\nrate. Other variants of quantized gradient schemes include error compensation [32], variance-reduced\nquantization [34], quantization to a ternary vector [31], and quantization of gradient difference [20].\nSparsi\ufb01cation. Sparsi\ufb01cation amounts to transmitting only gradient coordinates with large enough\nmagnitudes exceeding a certain threshold [27]. Empirically, the desired accuracy can be attained even\nafter dropping 99% of the gradients [1]. To avoid losing information, small gradient components\nare accumulated and then applied when they are large enough. The accumulated gradient offers\nvariance reduction of the sparsi\ufb01ed stochastic (S)GD iterates [12, 26]. With its impressive empirical\nperformance granted, except recent efforts [3], deterministic sparsi\ufb01cation schemes lack performance\nanalysis guarantees. However, randomized counterparts that come with the so-termed unbiased\nsparsi\ufb01cation have been developed to offer convergence guarantees [28, 30].\nQuantization and sparsi\ufb01cation have been also employed simultaneously [9, 13, 14]. Nevertheless,\nthey both introduce noise to (S)GD updates, and thus deteriorate convergence in general. For problems\nwith strongly convex losses, gradient compression algorithms either converge to the neighborhood of\nthe optimal solution, or, they converge at sublinear rate. The exception is [18], where the \ufb01rst linear\nconvergence rate has been established for the quantized gradient-based approaches. However, [18]\nonly focuses on reducing the required bits per communication, but not the total number of rounds.\nNevertheless, for exchanging messages, e.g., the p-dimensional \u2713 or its gradient, other latencies\n(initiating communication links, queueing, and propagating the message) are at least comparable\nto the message size-dependent transmission latency [23]. This motivates reducing the number of\ncommunication rounds, sometimes even more so than the bits per round.\nDistinct from the aforementioned gradient compression schemes, communication-ef\ufb01cient schemes\nthat aim to reduce the number of communication rounds have been developed by leveraging higher-\norder information [25, 36], periodic aggregation [19, 33, 35], and recently by adaptive aggregation\n[6, 7, 11, 29]; see also [4] for a lower bound on communication rounds. However, whether we can\nsave communication bits and rounds simultaneously without sacri\ufb01cing the desired convergence\nproperties remains unresolved. This paper aims to address this issue.\n\n1.2 Our contributions\nBefore introducing our approach, we revisit the canonical form of popular quantized (Q) GD methods\n[24]-[20] in the simple setup of (1) with one server and M workers:\nQGD iteration\n\n(3)\n\nwhere Qm\u2713k is the quantized gradient that coarsely approximates the local gradient rfm(\u2713k). While\nthe exact quantization scheme is different across algorithms, transmitting Qm\u2713k generally requires\n\n\u2713k+1 = \u2713k \u21b5 Xm2M\n\nQm\u2713k\n\n2\n\n\f(4)\n\nQk\nm\n\nfewer number of bits than transmitting rfm(\u2713k). Similar to GD however, only when all the local\nquantized gradients {Qm\u2713k} are collected, the server can update the parameter \u2713.\nIn this context, the present paper puts forth a quantized gradient innovation method (as simple as\nQGD) that can skip communication in certain rounds. Speci\ufb01cally, in contrast to the server-to-worker\ndownlink communication that can be performed simultaneously (e.g., by broadcasting \u2713k), the server\nhas to receive the workers\u2019 gradients sequentially to avoid interference from other workers, which\nleads to extra latency. For this reason, our focus here is on reducing the number of worker-to-server\nuplink communications, which we will also refer to as uploads. Our algorithm Lazily Aggregated\nQuantized gradient descent (LAQ) resembles (3), and it is given by\nLAQ iteration\n\n\u2713k+1 = \u2713k \u21b5rk with rk =rk1+Xm2Mk\n\nm := Qm(\u2713k) Qm(\u02c6\u2713\n\nm := \u2713k, 8m 2M k, and \u02c6\u2713\n\nwhere rk is an approximate aggregated gradient that summarizes the parameter change at iteration\nk1\nm ) is the difference between two quantized gradients of fm at\nk, and Qk\nk1\nm . With a judicious selection criterion that will be\nthe current iterate \u2713k and the old copy \u02c6\u2713\nm is uploaded in iteration k,\nintroduced later, Mk denotes the subset of workers whose local Qk\nk1\nwhile parameter iterates are given by \u02c6\u2713\nm , 8m /2M k. Instead of\nm := \u02c6\u2713\nrequesting fresh quantized gradient from every worker in (3), the trick is to obtain rk by re\ufb01ning the\nprevious aggregated gradient rk1; that is, using only the new gradients from the selected workers in\nMk, while reusing the outdated gradients from the rest of workers. If rk1 is stored in the server, this\nsimple modi\ufb01cation scales down the per-iteration communication rounds from QGD\u2019s M to LAQ\u2019s\n|Mk|. Throughout the paper, one round of communication means one worker\u2019s upload.\nCompared to the existing quantization schemes, LAQ \ufb01rst quantizes the gradient innovation \u2014\nthe difference of current gradient and previous quantized gradient, and then skips the gradient\ncommunication \u2014 if the gradient innovation of a worker is not large enough, the communication of\nthis worker is skipped. We will rigorously establish that LAQ achieves the same linear convergence\nas GD under the strongly convex assumption of the loss function. Numerical tests will demonstrate\nthat our approach outperforms existing methods in terms of both communication bits and rounds.\nNotation. Bold lowercase letters denote column vectors; kxk2 and kxk1 denote the `2-norm and\n`1-norm of x, respectively; and [x]i represents i-th entry of x; while bac denotes downward rounding\nof a; and | \u00b7 | denotes the cardinality of the set or vector.\n2 LAQ: Lazily aggregated quantized gradient\n\nk\n\nk\n\nTo reduce the communication overhead, two complementary stages are integrated in our algorithm\ndesign: 1) gradient innovation-based quantization; and 2) gradient innovation-based uploading or\naggregation \u2014 giving the name Lazily Aggregated Quantized gradient (LAQ). The former reduces\nthe number of bits per upload, while the latter cuts down the number of uploads, which together\nguarantee parsimonious communication. This section explains the principles of our two-stage design.\n\n[\n\n[Qm(k)]i\n\n[fm(k)]i\n\n[Qm( k1m )]i\n)\n\n2.1 Gradient innovation-based quantization\nQuantization limits the number of bits to represent a\ngradient vector during communication. Suppose we\nuse b bits to quantize each coordinate of the gradient\nvector in contrast to 32 bits as in most computers.\nWith Q denoting the quantization operator, the quan-\nk1\nm )), which depends on\ntized gradient for worker m at iteration k is Qm(\u2713k) = Q(rfm(\u2713k), Qm(\u02c6\u2713\nk1\nthe gradient rfm(\u2713k) and the previous quantization Qm(\u02c6\u2713\nm ). The gradient is element-wise quan-\ntized by projecting to the closest point in a uniformly discretized grid. The grid is a p-dimensional\nk1\nm )k1. With\nhypercube which is centered at Qm(\u02c6\u2713\nk1\nm ) can\n\u2327 := 1/(2b 1) de\ufb01ning the quantization granularity, the gradient innovation fm(\u2713k) Qm(\u02c6\u2713\nbe quantized by b bits per coordinate at worker m as:\nk1\nm )]i + Rk\nm\n\n2Rk\nFigure 1: Quantization example (b = 3)\n\nm = krfm(\u2713k) Qm(\u02c6\u2713\n\nk1\nm ) with the radius Rk\n\nRk\nm\n\nm\n\n[qm(\u2713k)]i =$ [rfm(\u2713k)]i [Qm(\u02c6\u2713\n\n2\u2327R k\nm\n\n1\n\n2% ,\n\n+\n\ni = 1,\u00b7\u00b7\u00b7 , p\n\n(5)\n\n3\n\n\fQk\n\nm1 :\n\ntransmit Rk\n\nk1\nm ) = 2\u2327R k\n\nmqm(\u2713k) Rk\n\nm = Qm(\u2713k) Qm(\u02c6\u2713\n\nm in the\nwhich is an integer within [0, 2b 1], and thus can be encoded by b bits. Note that adding Rk\nnumerator ensures the non-negativity of [qm(\u2713k)]i, and adding 1/2 in (5) guarantees rounding to the\nclosest point. Hence, the quantized gradient innovation at worker m is (with 1 := [1,\u00b7\u00b7\u00b7 , 1]>)\nm and qm(\u2713k)\n\n(6)\nm and bp bits for qm(\u2713k)) instead of the original\nk1\nm ) stored in the memory and \u2327 known a priori, after\n\nm the server can recover the quantized gradient as Qm(\u2713k) = Qm(\u02c6\u2713\n\nwhich can be transmitted by 32 + bp bits (32 bits for Rk\n32p bits. With the outdated gradients Qm(\u02c6\u2713\nreceiving Qk\nFigure 1 gives an example for quantizing one coordinate of the gradient with b = 3 bits. The\noriginal value is quantized with 3 bits and 23 = 8 values, each of which covers a range of length\nm := rfm(\u2713k) Qm(\u2713k) denoting the local quantization error, it is\n2\u2327R k\nclear that the quantization error is less than half of the length of the range that each value covers,\nnamely, k\"k\nQm(\u2713k), and the\naggregated quantization error is \"k := rf (\u2713k) Q(\u2713k) =PM\nm; that is, Q(\u2713k) = rf (\u2713k) \"k.\n\nm. The aggregated quantized gradient is Q(\u2713k) =Pm2M\n\nm centered at itself. With \"k\n\nmk1 \uf8ff \u2327R k\n\nk1\nm.\nm ) + Qk\n\nm=1 \"k\n\n2.2 Gradient innovation-based aggregation\n\nk\n\nServer\n\n(cid:76)(cid:12)(cid:18)(cid:1)=(cid:1)k(cid:1)(cid:1)k\n\nThe idea of lazy gradient aggregation is that if the difference of two consecutive locally quantized\ngradients is small, it is safe to skip the redundant gradient upload, and reuse the previous one\nat the server.\nIn addition, we also ensure the server has a relatively \u201cfresh\" gradient for each\nworker by enforcing communication if any worker\nhas not uploaded during the last \u00aft rounds. We set a\nclock tm, m 2M for worker m counting the number\nof iterations since last time it uploaded information.\nEquipped with the quantization and selection, our\nLAQ update takes the form as (4).\nNow it only remains to design the selection criterion\nto decide which worker to upload the quantized gra-\ndient or its innovation. We propose the following\ncommunication criterion: worker m 2M skips the upload at iteration k, if it satis\ufb01es\n2\u2318 ;\n2 + k\u02c6\"k1\nm k2\n\nWorkers\nQuantization\nQuantization\nSelection\nSelection\nFigure 2: Distributed learning via LAQ\n\n2 + 3\u21e3k\"k\nmk2\n\nk\nQuantization\nSelection\n\n\u21e0dk\u2713k+1d \u2713kdk2\n\nk1\nm ) Qm(\u2713k)k2\n\nwhere D \uf8ff \u00aft and {\u21e0d}D\nm = rfm(\u02c6\u2713\n\u02c6\"k1\nwe will prove the convergence and communication properties of LAQ under criterion (7).\n\n(7b)\nm is the current quantization error, and\nk1\nm ) is the error of the last uploaded quantized gradient. In next section\n\nd=1 are predetermined constants, \"k\n\nk1\nm ) Qm(\u02c6\u2713\n\n2 \uf8ff\ntm \uf8ff \u00aft\n\nkQm(\u02c6\u2713\n\nDXd=1\n\n\u21b52M 2\n\nQk\nM\n\nQk1\n\n(7a)\n\nk\n\n1\n\n\n\n\n\n2.3 LAQ algorithm development\n\nIn summary, as illustrated in Figure 2, LAQ can be implemented as follows. At iteration k, the server\nbroadcasts the learning parameter to all workers. Each worker calculates the gradient, and then\nquantizes it to judge if it needs to upload the quantized gradient innovation Qk\nm. Then the server\nupdates the learning parameter after it receives the gradient innovation from the selected workers.\nThe algorithm is summarized in Algorithm 2.\nTo make the difference between LAQ and GD clear, we re-write (4) as:\n\n\u2713k+1 =\u2713k \u21b5[rQ(\u2713k) + Xm2Mk\n\nc\n\n=\u2713k \u21b5[rf (\u2713k) \"k + Xm2Mk\n\nc\n\n(Qm(\u02c6\u2713\n\nk1\nm ) Qm(\u2713k))]\n\n(Qm(\u02c6\u2713\n\nk1\nm ) Qm(\u2713k))]\n\n(8a)\n\n(8b)\n\nc := M\\Mk, is the subset of workers which skip communication with server at iteration k.\nwhere Mk\nCompared with the GD iteration in (2), the gradient employed here degrades due to the quantization\nerror, \"k and the missed gradient innovation,Pm2Mk\n) Qm(\u2713k))]. It is clear that if large\n\n(Qm(\u02c6\u2713\n\nk1\n\nc\n\n4\n\n\fAlgorithm 1 QGD\n1: Input: stepsize \u21b5> 0, quantization bit b.\n2: Initialize: \u2713k.\n3: for k = 1, 2,\u00b7\u00b7\u00b7 , K do\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\n\nServer broadcasts \u2713k to all workers.\nfor m = 1, 2,\u00b7\u00b7\u00b7 , M do\nWorker m computes rfm(\u2713k) and Qm(\u2713k).\nWorker m uploads Qk\n\nend for\nServer updates \u2713 following (4) with Mk = M.\n\nm via (6).\n\n0\n\nd=1 and \u00aft.\n\nm), tm}m2M.\n\nServer broadcasts \u2713k to all workers.\nfor m = 1, 2,\u00b7\u00b7\u00b7 , M do\n\nAlgorithm 2 LAQ\n1: Input: stepsize \u21b5> 0, b, D, {\u21e0d}D\n2: Initialize: \u2713k, and {Qm(\u02c6\u2713\n3: for k = 1, 2,\u00b7\u00b7\u00b7 , K do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16: end for\n\nend if\nend for\nServer updates \u2713 according to (4).\n\nWorker m uploads Qk\nSet \u02c6\u2713\n\nk\nm = \u2713k, and tm = 0.\n\nm = \u02c6\u2713\n\nWorker m computes rfm(\u2713k) and Qm(\u2713k).\nif (7) holds for worker m then\nWorker m uploads nothing.\nSet \u02c6\u2713\n\nk1\nm and tm tm + 1.\n\nm via (6).\n\nelse\n\nk\n\nTable 1: A comparison of QGD and LAQ.\n\nenough number of bits are used to quantize the gradient, and all {\u21e0d}D\nd=1 are set 0 thus Mk := M,\nthen LAQ reduces to GD. Thus, adjusting b and {\u21e0d}D\nd=1 directly in\ufb02uences the performance of LAQ.\nThe rationale behind selection criterion (7) lies in the judicious comparison between the descent\namount of GD and that of LAQ. To compare the descent amount, we \ufb01rst establish the one step\ndescent amount of both algorithms. For all the results in this paper, the following assumption holds.\nAssumption 1. The local gradient rfm(\u00b7) is Lm-Lipschitz continuous and the global gradient\nrf (\u00b7) is L-Lipschitz continuous, i.e., there exist constants Lm and L such that\n(9a)\n(9b)\n\nkrfm(\u27131) rfm(\u27132)k2 \uf8ffLmk\u27131 \u27132k2, 8\u27131, \u27132;\nkrf (\u27131) rf (\u27132)k2 \uf8ffLk\u27131 \u27132k2, 8\u27131, \u27132.\n\nBuilding upon Assumption 1, the next lemma describes the descent in objective by GD.\nLemma 1. The gradient descent update yields following descent:\n\nwhere k\n\nGD := (1 \u21b5L\n\nf (\u2713k+1) f (\u2713k) \uf8ff k\n2.\n2 )\u21b5krf (\u2713k)k2\n\nGD\n\n(10)\n\nwhere k\n\nLAQ := \u21b5\n\nThe descent of LAQ distinguishes from that of GD due to the quantization and selection, which is\nspeci\ufb01ed in the following lemma.\nLemma 2. The LAQ update yields following descent:\nLAQ + \u21b5k\"kk2\nf (\u2713k+1) f (\u2713k) \uf8ff k\n2 + \u21b5kPm2Mk\nk1\n(Qm(\u02c6\u2713\nm ) Qm(\u2713k))k2\n\nIn lazy aggregation, we consider only k\nLAQ with the quantization error in (11) ignored. Rigorous\ntheorem showing the property of LAQ taking into account the quantization error will be established\nin next section.\nThe following part shows the intuition for criterion (7a), which is not mathematically strict but\nprovides the intuition. The lazy aggregation mechanism selects the quantized gradient innovation by\njudging its contribution to decreasing the loss function. LAQ is expected to be more communication-\nef\ufb01cient than GD, that is, each upload results in more descent, which translates to:\n\n2.\n2\u21b5 )k\u2713k+1 \u2713kk2\n\n2 krf (\u2713k)k2\n\n2 1\n\n2 + ( L\n\n(11)\n\n2\n\nc\n\nwhich is tantamount to (see the derivations in the supplementary materials)\n\nk\n\nLAQ\n\n|Mk| \uf8ff\n\nk\nGD\nM\n\n.\n\nk(Qm(\u02c6\u2713\n\nk1\nm ) Qm(\u2713k)k2\n\n2 \uf8ff krf (\u2713k)k2\n\n2/(2M 2), 8m 2M k\nc .\n\n5\n\n(12)\n\n(13)\n\n\fHowever, for each worker to check (73) locally is impossible because the fully aggregated gradient\nrf (\u2713k) is required, which is exactly what we want to avoid. Moreover, it does not make sense to\nreduce uploads if the fully aggregated gradient has been obtained. Therefore, we bypass directly\ncalculating krf (\u2713k)k2\n\n2 using its approximation below.\n\nkrf (\u2713k)k2\n\n2 \u21e1\n\n2\n\u21b52\n\nDXk=1\n\n\u21e0dk\u2713k+1d \u2713kdk2\n\n2\n\n(14)\n\nwhere {\u21e0d}D\nd=1 are constants. The fundamental reason why (74) holds is that rf (\u2713k) can be approxi-\nmated by weighted previous gradients or parameter differences since f (\u00b7) is L-smooth. Combining\n(73) and (74) leads to our communication criterion (7a) with quantization error ignored.\nWe conclude this section by a comparison between LAQ and error-feedback (quantized) schemes.\nComparison with error-feedback schemes. Our LAQ approach is related to the error-feedback\nschemes, e.g., [3, 12, 24, 26, 27, 32]. Both lines of approaches accumulate either errors or delayed\ninnovation incurred by communication reduction (e.g., quantization, sparsi\ufb01cation, or skipping),\nand upload them in the next communication round. However, the error-feedback schemes skip\ncommunicating certain entries of the gradient, yet communicate with all workers. LAQ skips\ncommunicating with certain workers, but communicates all (quantized) entries. The two methods are\nnot mutually exclusive, and can be used jointly.\n\n3 Convergence and communication analysis\n\nOur subsequent convergence analysis of LAQ relies on the following assumption on f (\u2713):\nAssumption 2. The function f (\u00b7) is \u00b5-strongly convex, e.g., there exists a constant \u00b5 > 0 such that\n(15)\n\nf (\u27131) f (\u27132) hrf (\u27132), \u27131 \u27132i +\n\n\u00b5\n2 k\u27131 \u27132k2\n\n2, 8\u27131, \u27132.\n\nWith \u2713\u21e4 denoting the optimal solution of (1), we de\ufb01ne Lyapunov function of LAQ as:\n\nV(\u2713k) = f (\u2713k) f (\u2713\u21e4) +\n\n\u21e0j\n\u21b5 k\u2713k+1d \u2713kdk2\n\n2\n\n(16)\n\nDXd=1\n\nDXj=d\n\nThe design of Lyapunov function V(\u2713) is coupled with the communication rule (7a) that contains\nparameter difference term. Intuitively, if no communication is being skipped at current iteration, LAQ\nbehaves like GD that decreases the objective residual in V(\u2713); if certain uploads are skipped, LAQ\u2019s\nrule (7a) guarantees the error of using stale gradients comparable to the parameter difference in V(\u2713)\nto ensure its descending. The following lemma captures the progress of the Lyapunov function.\nLemma 3. Under Assumptions 1 and 2, if the stepsize \u21b5 and the parameters {\u21e0d}D\n(with any 0 <\u21e2 1 < 1 and \u21e22 > 0)\n\nd=1 are selected as\n\nDXd=1\n\n,\n\n1\n\n4(1 + \u21e22)\n\n2(1 + \u21e21\n\nL 1 \u21e21\n\n2 )\n\u21e0d \uf8ff min\u21e2 1 \u21e21\n\u21e0d! ,\n\u21b5 \uf8ff min( 2\nDXd=1\nV(\u2713k+1) \uf8ff 1V(\u2713k) + Bhk\"kk2\n2 + Xm2Mk\n\n4(1 + \u21e22) \n\n2\n\n1\n\n2(1 + \u21e21\n\nL \nc\u21e3k\"k\nmk2\n\n2 ) \n\n\u21e0d!)\nDXd=1\n2\u2318i\n2 + k\u02c6\"k1\nm k2\n\nthen the Lyapunov function follows\n\n(17a)\n\n(17b)\n\n(18)\n\nwhere constants 0 < 1 < 1 and B > 0 depend on \u21b5 and {\u21e0d}; see details in supplementary materials.\nFor the tight analysis, (17) appear to be involved, but it admits simple choices. For example, when\nwe choose \u21e21 = 1/2 and \u21e22 = 1, respectively, then \u21e01 = \u21e02 = \u00b7\u00b7\u00b7 \u21e0D = 1\n8L satisfy (17).\nIf the quantization error in (18) is null, Lemma 3 readily implies that the Lyapunov function enjoys a\nlinear convergence rate. In the following, we will demonstrate that under certain conditions, the LAQ\nalgorithm can still guarantee linear convergence even if we consider the the quantization error.\n\n16D and \u21b5 = 1\n\n6\n\n\f(a) Loss v.s. iteration\n\n(b) Loss v.s. communication\n\n(c) Loss v.s. bit\n\nFigure 4: Convergence of the loss function (logistic regression)\n\n(a) Gradient v.s. iteration\n\n(b) Gradient v.s. communication\n\n(c) Gradient v.s. bit\n\nFigure 5: Convergence of gradient norm (neural network)\n\n2 P ;\n1 \uf8ff \u2327 2k\n\nmk2\n\n2 P, 8m 2M .\n\nV(\u2713k) \uf8ff k\nk\"k\n\nTheorem 1. Under the same assumptions and the parameters in Lemma 3, Lyapunov function and\nthe quantization error converge at a linear rate; that is, there exists a constant 2 2 (0, 1) such that\n(19a)\n(19b)\nwhere P is a constant depending on the parameters in (17); see details in supplementary materials.\nFrom the de\ufb01nition of Lyapunov function, it is clear that f (\u2713k) f (\u2713\u21e4) \uf8ff V(\u2713k) \uf8ff k\nerror f (\u2713k)f (\u2713\u21e4) converges linearly. The L-smoothness results in krf (\u2713k)k2\n2Lk\n2 also converges linearly.\nk\u2713k \u2713\u21e4k2\nCompared to the previous analysis for LAG [6], the analysis for LAQ is more involved, since\nit needs to deal with not only outdated but also quantized (inexact) gradients. This modi-\n\ufb01cation deteriorates the monotonic property of the Lyapunov function in (18), which is the\nbuilding block of analysis in [6]. We tackle this issue by i) considering the outdated gradi-\nent in the quantization (6); and, ii) incorporating quantization error in the new selection cri-\nterion (7). As a result, Theorem 1 demonstrates that LAQ is able to keep the linear con-\nvergence rate even with the presence of the quantization error. This is because the properly\ncontrolled quantization error also converges at a linear rate; see the illustration in Figure 3.\n\n2V0 \u2014 the risk\n2 \uf8ff 2L[f (\u2713kf (\u2713\u21e4)] \uf8ff\n2 converges linearly. Similarly, the \u00b5-strong convexity implies\n\n2V0 \u2014 the gradient norm krf (\u2713k)k2\n\u00b5 k\n\n2V0 \u2014 k\u2713k \u2713\u21e4k2\n\n2 \uf8ff 2\n\n\u00b5 [f (\u2713k f (\u2713\u21e4)] \uf8ff 2\n\nProposition 1. Under Assumption 1, if we choose the constants\nd=1 satisfying \u21e01 \u21e02 \u00b7\u00b7\u00b7 \u21e0D and de\ufb01ne dm, m 2M as:\n{\u21e0d}D\n(20)\ndm := max\n\nm \uf8ff \u21e0d/(3\u21b52M 2D), d 2{ 1, 2,\u00b7\u00b7\u00b7 , D} \n\nthen, worker m has at most k/(dm + 1) communications with the\nserver until the k-th iteration.\n\nd d|L2\n\nf(k)\n\nBound of \n\nQ(k)\n\nk\n\nThis proposition implies that the smoothness of the local loss func-\ntion determines the communication intensity of the local worker.\n\nIterations\n\nFigure 3: Gradient norm decay\n\n7\n\n0100020003000Number(cid:1)of(cid:1)iterations106105104103102101100101Loss(cid:1)residual(cid:1)(f(cid:1)f)(cid:47)(cid:36)(cid:52)(cid:42)(cid:39)(cid:52)(cid:42)(cid:39)(cid:47)(cid:36)(cid:42)101102103104Number(cid:1)of(cid:1)communications106105104103102101100101Loss(cid:1)residual(cid:1)(f(cid:1)f)(cid:47)(cid:36)(cid:52)(cid:42)(cid:39)(cid:52)(cid:42)(cid:39)(cid:47)(cid:36)(cid:42)1061071081091010Number(cid:1)of(cid:1)bits106105104103102101100101Loss(cid:1)residual(cid:1)(f(cid:1)f)(cid:47)(cid:36)(cid:52)(cid:42)(cid:39)(cid:52)(cid:42)(cid:39)(cid:47)(cid:36)(cid:42)02000400060008000Number(cid:1)of(cid:1)iterations104103102101100101102Gradient(cid:1)norm(cid:1)||rf||(cid:47)(cid:36)(cid:52)(cid:42)(cid:39)(cid:52)(cid:42)(cid:39)(cid:47)(cid:36)(cid:42)020000400006000080000Number(cid:1)of(cid:1)communications104103102101100101102Gradient(cid:1)norm(cid:1)||rf||(cid:47)(cid:36)(cid:52)(cid:42)(cid:39)(cid:52)(cid:42)(cid:39)(cid:47)(cid:36)(cid:42)01342Number(cid:1)of(cid:1)bits1011104103102101100101102Gradient(cid:1)norm(cid:1)||rf||(cid:47)(cid:36)(cid:52)(cid:42)(cid:39)(cid:52)(cid:42)(cid:39)(cid:47)(cid:36)(cid:42)\f(a) MNIST\n\n(b) ijcnn1\n\n(c) covtype\n\nFigure 6: Test accuracies on three different datasets\n\n4 Numerical tests and conclusions\n\nTo validate our performance analysis and verify its communication savings in practical machine\nlearning problems, we evaluate the performance of the algorithm for the regularized logistic regression\nwhich is strongly convex, and the neural network which is nonconvex. The dataset we use is\nMNIST [15], which are uniformly distributed across M = 10 workers. In the experiments, we set\nD = 10,\u21e0 1 = \u21e02 = \u00b7\u00b7\u00b7 ,\u21e0 D = 0.8/D, \u00aft = 100; see the detailed setup in the supplementary materials.\nTo benchmark LAQ, we compare it with two classes of algorithms, gradient-based algorithms and\nminibatch stochastic gradient-based algorithms \u2014 corresponding to the following two tests.\n\n(a) Loss v.s. iteration\n\n(b) Loss v.s. communication\n\n(c) Loss v.s. bit\n\nFigure 7: Convergence of loss function (logistic regression)\n\nGradient-based tests. We consider GD, QGD [18] and lazily aggregated gradient (LAG) [6].\nThe number of bits per coordinate is set as b = 3 for logistic regression and 8 for neural network,\nrespectively. Stepsize is set as \u21b5 = 0.02 for both algorithms. Figure 4 shows the objective convergence\nfor the logistic regression task. Clearly, Figure 4(a) veri\ufb01es Theorem 1, e.g., the linear convergence\nrate under strongly convex loss function. As shown in Figure 4(b), LAQ requires fewer number\nof communication rounds than GD and QGD thanks to our selection rule, but more rounds than\nLAG due to the gradient quantization. Nevertheless, the total number of transmitted bits of LAQ is\nsigni\ufb01cantly smaller than that of LAG, as demonstrated in Figure 4(c). For neural network model,\nFigure 5 reports the convergence of gradient norm, where LAQ also shows competitive performance\n\nAlgorithm\n\nIteration # Communication #\n\nBit #\n\nLAQ\n\nGD\n\nQGD\n\nLAG\n\nneural network\n\nlogistic\n\nlogistic\n\nneural network\n\nneural network\n\nlogistic\n\nlogistic\n\nneural network\n\n2673\n8000\n2820\n8000\n2805\n8000\n2659\n8000\n\n620\n\n31845\n28200\n80000\n28050\n80000\n2382\n29916\n\n1.95 \u21e5 107\n4.05 \u21e5 1010\n7.08 \u21e5 109\n4.07 \u21e5 1011\n8.81 \u21e5 108\n1.02 \u21e5 1011\n5.98 \u21e5 108\n1.52 \u21e5 1011\n\nAccuracy\n0.9082\n0.9433\n0.9082\n0.9433\n0.9082\n0.9433\n0.9082\n0.9433\n\nTable 2: Comparison of gradient-based algorithms. For logistic regression, all algorithms terminate\nwhen loss residual reaches 106; for neural network, all algorithms run a \ufb01xed number of iterations.\n\n8\n\n1061071081091010Number(cid:1)of(cid:1)bits0.40.60.8Test(cid:1)accuracy(cid:45)(cid:34)QGDQGDLAG104105106107Number(cid:1)of(cid:1)bits0.7500.7750.8000.8250.8500.8750.900Test(cid:1)accuracy(cid:47)(cid:36)(cid:52)GDQGDLAG104105106107108Number(cid:1)of(cid:1)bits0.760.780.800.820.84Test(cid:1)accuracy(cid:47)(cid:36)QGDQGDLAG02008001000400600Number(cid:1)of(cid:1)iterations0.51.01.52.02.5Loss(cid:1)(f)(cid:54)(cid:47)(cid:36)(cid:52)(cid:54)(cid:42)(cid:39)(cid:52)(cid:54)(cid:42)(cid:39)(cid:54)(cid:54)(cid:42)(cid:39)0200040006000800010000Number(cid:1)of(cid:1)communications0.51.01.52.02.5Loss(cid:1)(f)(cid:54)(cid:47)(cid:36)(cid:52)(cid:54)(cid:42)(cid:39)(cid:52)(cid:54)(cid:42)(cid:39)(cid:54)(cid:54)(cid:42)(cid:39)0.00.52.02.51.0(cid:1)1.5(cid:1)Number(cid:1)of(cid:1)bits1090.51.01.52.02.5Loss(cid:1)(f)(cid:54)(cid:47)(cid:36)(cid:52)(cid:54)(cid:42)(cid:39)(cid:52)(cid:54)(cid:42)(cid:39)(cid:54)(cid:54)(cid:42)(cid:39)\f(a) Loss v.s. iteration\n\n(b) Loss v.s. communication\n\n(c) Loss v.s. bit\n\nFigure 8: Convergence of loss function (neural network)\n\nfor nonconvex problem. Similar to the results for logistic model, LAQ requires the fewest number of\nbits. Table 2 summarizes the number of iterations, uploads and bits needed to reach a given accuracy.\nFigure 6 exhibits the test accuracy of above compared algorithms on three commonly used datasets,\nMNIST, ijcnn1 and covtype. Applied to all these datasets, LAQ saves transmitted bits and meanwhile\nmaintains the same accuracy.\nStochastic gradient-based tests. We test stochastic gradient descent (SGD), quantized stochastic\ngradient descent (QSGD) [2], sparsi\ufb01ed stochastic gradient descent (SSGD) [30], and the stochastic\nversion of LAQ abbreviated as SLAQ. The mini-batch size is 500 , \u21b5 = 0.008, and the number of bits\nper coordinate is set as b = 3 for logistic regression and 8 for neural network. As shown in Figures 7\nand 8, SLAQ incurs the lowest number of communication rounds and bits. In this stochastic gradient\ntest, although the communication reduction of SLAQ is not as signi\ufb01cant as LAQ compared with\ngradient based algorithms, SLAQ still outperforms the state-of-the-art algorithms, e.g., QSGD and\nSSGD. The results are summarized in Table 3. More results under different number of bits and the\nlevel of heterogeneity are reported in the supplementary materials.\n\nAlgorithm\n\nIteration # Communication #\n\nBit #\n\nSLAQ\n\nSGD\n\nQSGD\n\nSSGD\n\nneural network\n\nlogistic\n\nlogistic\n\nneural network\n\nneural network\n\nlogistic\n\nlogistic\n\nneural network\n\n1000\n1500\n1000\n1500\n1000\n1500\n1000\n1500\n\n8255\n11192\n10000\n15000\n10000\n15000\n10000\n15000\n\n1.94 \u21e5 108\n1.42 \u21e5 1010\n2.51 \u21e5 109\n7.63 \u21e5 1010\n7.51 \u21e5 108\n2.03 \u21e5 1010\n1.26 \u21e5 109\n3.82 \u21e5 1010\n\nAccuracy\n0.9018\n0.9107\n0.9021\n0.9100\n0.9021\n0.9100\n0.9013\n0.9104\n\nTable 3: Performance comparison of mini-batch stochastic gradient-based algorithms.\n\nThis paper studied the communication-ef\ufb01cient distributed learning problem, and proposed LAQ that\nsimultaneously quantizes and skips the communication based on gradient innovation. Compared to\nthe original GD method, linear convergence rate is still maintained for strongly convex loss function.\nThis is remarkable since LAQ saves both communication bits and rounds signi\ufb01cantly. Numerical\ntests using (strongly convex) regularized logistic regression and (nonconvex) neural network models\ndemonstrate the advantages of LAQ over existing popular approaches.\nAcknowledgments\nThis work by J. Sun and Z. Yang is supported in part by the Shenzhen Committee on Science and\nInnovations under Grant GJHZ20180411143603361, in part by the Department of Science and\nTechnology of Guangdong Province under Grant 2018A050506003, and in part by the Natural\nScience Foundation of China under Grant 61873118. The work by J. Sun is also supported by China\nScholarship Council. The work by G. Giannakis is supported in part by NSF 1500713, and 1711471.\n\nReferences\n[1] Alham Fikri Aji and Kenneth Hea\ufb01eld. Sparse communication for distributed gradient descent.\n\nIn Proc. Conf. Empi. Meth. Natural Language Process., Copenhagen, Denmark, Sep 2017.\n\n9\n\n01500500(cid:1)1000Number(cid:1)of(cid:1)iterations(cid:1)(cid:1)1234Loss(cid:1)(f)(cid:54)(cid:47)(cid:36)(cid:52)(cid:54)(cid:42)(cid:39)(cid:52)(cid:54)(cid:42)(cid:39)(cid:54)(cid:54)(cid:42)(cid:39)0150005000(cid:1)10000(cid:1)Number(cid:1)of(cid:1)communications1234Loss(cid:1)(f)(cid:54)(cid:47)(cid:36)(cid:52)(cid:54)(cid:42)(cid:39)(cid:52)(cid:54)(cid:42)(cid:39)(cid:54)(cid:54)(cid:42)(cid:39)0.00.20.60.80.4Number(cid:1)of(cid:1)bits10111234Loss(cid:1)(f)(cid:54)(cid:47)(cid:36)(cid:52)(cid:54)(cid:42)(cid:39)(cid:52)(cid:54)(cid:42)(cid:39)(cid:54)(cid:54)(cid:42)(cid:39)\f[2] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD:\nCommunication-ef\ufb01cient SGD via gradient quantization and encoding. In Proc. Advances\nin Neural Info. Process. Syst., pages 1709\u20131720, Long Beach, CA, Dec 2017.\n\n[3] Dan Alistarh, Torsten Hoe\ufb02er, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and\nC\u00e9dric Renggli. The convergence of sparsi\ufb01ed gradient methods. In Proc. Advances in Neural\nInfo. Process. Syst., pages 5973\u20135983, Montreal, Canada, Dec 2018.\n\n[4] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning\nand optimization. In Proc. Advances in Neural Info. Process. Syst., pages 1756\u20131764, Montreal,\nCanada, Dec 2015.\n\n[5] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar.\nSignSGD: Compressed optimisation for non-convex problems. In Proc. Intl. Conf. Machine\nLearn., pages 559\u2013568, Stockholm, Sweden, Jul 2018.\n\n[6] Tianyi Chen, Georgios Giannakis, Tao Sun, and Wotao Yin. LAG: Lazily aggregated gradient\nfor communication-ef\ufb01cient distributed learning. In Proc. Advances in Neural Info. Process.\nSyst., pages 5050\u20135060, Montreal, Canada, Dec 2018.\n\n[7] Tianyi Chen, Kaiqing Zhang, Georgios Giannakis, and Tamer Ba\u00b8sar. Communication-Ef\ufb01cient\nDistributed Reinforcement Learning. IEEE Trans. on Automatic Control, submitted April 2019.\narXiv preprint:1812.03239\n\n[8] Mert Gurbuzbalaban, Asuman Ozdaglar, and Pablo A Parrilo. On the convergence rate of\nincremental aggregated gradient algorithms. SIAM Journal on Optimization, 27(2):1035\u20131048,\n2017.\n\n[9] Peng Jiang and Gagan Agrawal. A linear speedup analysis of distributed deep learning with\nsparse and quantized communication. In Proc. Advances in Neural Info. Process. Syst., pages\n2525\u20132536, Montreal, Canada, Dec 2018.\n\n[10] Michael I Jordan, Jason D Lee, and Yun Yang. Communication-ef\ufb01cient distributed statistical\n\ninference. J. American Statistical Association, to appear, 2018.\n\n[11] Michael Kamp, Linara Adilova, Joachim Sicking, Fabian H\u00fcger, Peter Schlicht, Tim Wirtz, and\nStefan Wrobel. Ef\ufb01cient decentralized deep learning by dynamic model averaging. In Euro.\nConf. Machine Learn. Knowledge Disc. Data., pages 393\u2013409, Dublin, Ireland, 2018.\n\n[12] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback\n\ufb01xes signsgd and other gradient compression schemes. In Proc. Intl. Conf. Machine Learn.,\npages 3252\u20133261, Long Beach, CA, Jun 2019.\n\n[13] Jakub Kone\u02c7cn`y, H Brendan McMahan, Felix X Yu, Peter Richt\u00e1rik, Ananda Theertha Suresh,\nand Dave Bacon. Federated learning: Strategies for improving communication ef\ufb01ciency. arXiv\npreprint:1610.05492, Oct 2016.\n\n[14] Jakub Kone\u02c7cn`y and Peter Richt\u00e1rik. Randomized distributed mean estimation: Accuracy vs\n\ncommunication. Frontiers in Applied Mathematics and Statistics, 4:62, Dec 2018.\n\n[15] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs\n\n[Online]. Available: http://yann. lecun. com/exdb/mnist, 2:18, 2010.\n\n[16] Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication ef\ufb01cient distributed\nmachine learning with the parameter server. In Proc. Advances in Neural Info. Process. Syst.,\npages 19\u201327, Montreal, Canada, Dec 2014.\n\n[17] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized\nalgorithms outperform centralized algorithms? A case study for decentralized parallel stochastic\ngradient descent. In Proc. Advances in Neural Info. Process. Syst., pages 5330\u20135340, Long\nBeach, CA, Dec 2017.\n\n[18] Sindri Magn\u00fasson, Hossein Shokri-Ghadikolaei, and Na Li. On maintaining linear conver-\ngence of distributed learning and optimization under limited communication. arXiv preprint\narXiv:1902.11163, 2019.\n\n10\n\n\f[19] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.\nCommunication-ef\ufb01cient learning of deep networks from decentralized data. In Proc. Intl. Conf.\nArti\ufb01cial Intell. and Stat., pages 1273\u20131282, Fort Lauderdale, FL, April 2017.\n\n[20] Konstantin Mishchenko, Eduard Gorbunov, Martin Tak\u00e1\u02c7c, and Peter Richt\u00e1rik. Distributed\n\nlearning with compressed gradient differences. arXiv preprint:1901.09269, Jan 2019.\n\n[21] Eric J Msechu and Georgios B Giannakis. Sensor-centric data reduction for estimation with\n\nWSNs via censoring and quantization. IEEE Trans. Sig. Proc., 60(1):400\u2013414, Jan 2011.\n\n[22] Angelia Nedi\u00b4c, Alex Olshevsky, and Michael Rabbat. Network topology and communication-\ncomputation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953\u2013976,\nMay 2018.\n\n[23] Larry L Peterson and Bruce S Davie. Computer Networks: A Systems Approach. Morgan\n\nKaufman, Burlington, MA, 2007.\n\n[24] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent\nand its application to data-parallel distributed training of speech dnns. In Proc. Conf. Intl.\nSpeech Comm. Assoc., Singapore, Sept 2014.\n\n[25] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-ef\ufb01cient distributed optimization\nusing an approximate newton-type method. In Proc. Intl. Conf. Machine Learn., pages 1000\u2013\n1008, Beijing, China, Jun 2014.\n\n[26] Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsi\ufb01ed SGD with memory.\nIn Proc. Advances in Neural Info. Process. Syst., pages 4447\u20134458, Montreal, Canada, Dec\n2018.\n\n[27] Nikko Strom. Scalable distributed DNN training using commodity gpu cloud computing. In\n\nProc. Conf. Intl. Speech Comm. Assoc., Dresden, Germany, Sept 2015.\n\n[28] Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and\nStephen Wright. Atomo: Communication-ef\ufb01cient learning via atomic sparsi\ufb01cation. In Proc.\nAdvances in Neural Info. Process. Syst., pages 9850\u20139861, Montreal, Canada, Dec 2018.\n\n[29] Jianyu Wang and Gauri Joshi. Cooperative SGD: A uni\ufb01ed framework for the design and\nanalysis of communication-ef\ufb01cient SGD algorithms. arXiv preprint:1808.07576, August 2018.\n[30] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsi\ufb01cation for\ncommunication-ef\ufb01cient distributed optimization. In Proc. Advances in Neural Info. Process.\nSyst., pages 1299\u20131309, Montreal, Canada, Dec 2018.\n\n[31] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad:\nTernary gradients to reduce communication in distributed deep learning. In Proc. Advances in\nNeural Info. Process. Syst., pages 1509\u20131519, Long Beach, CA, Dec 2017.\n\n[32] Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated\nquantized SGD and its applications to large-scale distributed optimization. arXiv preprint\narXiv:1806.08054, 2018.\n\n[33] Hao Yu and Rong Jin. On the computation and communication complexity of parallel SGD\nwith dynamic batch sizes for stochastic non-convex optimization. In Proc. Intl. Conf. Machine\nLearn., Long Beach, CA, Jun 2019.\n\n[34] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training\nlinear models with end-to-end low precision, and a little bit of deep learning. In Proc. Intl. Conf.\nMachine Learn., pages 4035\u20134043, Sydney, Australia, Aug 2017.\n\n[35] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging\nSGD. In Proc. Advances in Neural Info. Process. Syst., pages 685\u2013693, Montreal, Canada, Dec\n2015.\n\n[36] Yuchen Zhang and Xiao Lin. DiSCO: Distributed optimization for self-concordant empirical\n\nloss. In Proc. Intl. Conf. Machine Learn., pages 362\u2013370, Lille, France, June 2015.\n\n11\n\n\f", "award": [], "sourceid": 1862, "authors": [{"given_name": "Jun", "family_name": "Sun", "institution": "Zhejiang University"}, {"given_name": "Tianyi", "family_name": "Chen", "institution": "Rensselaer Polytechnic Institute"}, {"given_name": "Georgios", "family_name": "Giannakis", "institution": "University of Minnesota"}, {"given_name": "Zaiyue", "family_name": "Yang", "institution": "Southern University of Science and Technology"}]}