{"title": "Robust and Communication-Efficient Collaborative Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 8388, "page_last": 8399, "abstract": "We consider a decentralized learning problem, where a set of computing nodes aim at solving a non-convex optimization problem collaboratively. It is well-known that decentralized optimization schemes face two major system bottlenecks: stragglers' delay and communication overhead. In this paper, we tackle these bottlenecks by proposing a novel decentralized and gradient-based optimization algorithm named as QuanTimed-DSGD. Our algorithm stands on two main ideas: (i) we impose a deadline on the local gradient computations of each node at each iteration of the algorithm, and (ii) the nodes exchange quantized versions of their local models. The first idea robustifies to straggling nodes and the second alleviates communication efficiency. The key technical contribution of our work is to prove that with non-vanishing noises for quantization and stochastic gradients, the proposed method exactly converges to the global optimal for convex loss functions, and finds a first-order stationary point in non-convex scenarios. Our numerical evaluations of the QuanTimed-DSGD on training benchmark datasets, MNIST and CIFAR-10, demonstrate speedups of up to 3x in run-time, compared to state-of-the-art decentralized optimization methods.", "full_text": "Robust and Communication-Ef\ufb01cient\n\nCollaborative Learning\n\nAmirhossein Reisizadeh\n\nECE Department\n\nreisizadeh@ucsb.edu\n\nUniversity of California, Santa Barbara\n\nUniversity of California, Santa Barbara\n\nHossein Taheri\nECE Department\n\nhossein@ucsb.edu\n\nAryan Mokhtari\nECE Department\n\nThe University of Texas at Austin\nmokhtari@austin.utexas.edu\n\nHamed Hassani\nESE Department\n\nUniversity of Pennsylvania\nhassani@seas.upenn.edu\n\nRamtin Pedarsani\nECE Department\n\nUniversity of California, Santa Barbara\n\nramtin@ece.ucsb.edu\n\nAbstract\n\nWe consider a decentralized learning problem, where a set of computing nodes aim\nat solving a non-convex optimization problem collaboratively. It is well-known that\ndecentralized optimization schemes face two major system bottlenecks: stragglers\u2019\ndelay and communication overhead. In this paper, we tackle these bottlenecks by\nproposing a novel decentralized and gradient-based optimization algorithm named\nas QuanTimed-DSGD. Our algorithm stands on two main ideas: (i) we impose a\ndeadline on the local gradient computations of each node at each iteration of the\nalgorithm, and (ii) the nodes exchange quantized versions of their local models. The\n\ufb01rst idea robusti\ufb01es to straggling nodes and the second alleviates communication\nef\ufb01ciency. The key technical contribution of our work is to prove that with non-\nvanishing noises for quantization and stochastic gradients, the proposed method\nexactly converges to the global optimal for convex loss functions, and \ufb01nds a\n\ufb01rst-order stationary point in non-convex scenarios. Our numerical evaluations\nof the QuanTimed-DSGD on training benchmark datasets, MNIST and CIFAR-\n10, demonstrate speedups of up to 3\u21e5 in run-time, compared to state-of-the-art\ndecentralized optimization methods.\n\n1\n\nIntroduction\n\nCollaborative learning refers to the task of learning a common objective among multiple computing\nagents without any central node and by using on-device computation and local communication\namong neighboring agents. Such tasks have recently gained considerable attention in the context of\nmachine learning and optimization as they are foundational to several computing paradigms such as\nscalability to larger datasets and systems, data locality, ownership and privacy. As such, collaborative\nlearning naturally arises in various applications such as distributed deep learning (LeCun et al., 2015;\nDean et al., 2012), multi-agent robotics and path planning (Choi and How, 2010; Jha et al., 2016),\ndistributed resource allocation in wireless networks (Ribeiro, 2010), to name a few.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWhile collaborative learning has recently drawn signi\ufb01cant attention due its decentralized imple-\nmentation, it faces major challenges at the system level as well as algorithm design. The decentral-\nized implementation of collaborative learning faces two major systems challenges: (i) signi\ufb01cant\nslow-down due to straggling nodes, where a subset of nodes can be largely delayed in their local\ncomputation which slows down the wall-clock time convergence of the decentralized algorithm;\n(ii) large communication overhead due to the message passing algorithm as the dimension of the\nparameter vector increases, which can further slow down the algorithm\u2019s convergence time. Moreover,\nin the presence of these system bottlenecks, the ef\ufb01cacy of classical consensus optimization methods\nis not clear and needs to be revisited.\nIn this work we consider the general data-parallel setting where the data is distributed across different\ncomputing nodes, and develop decentralized optimization methods that do not rely on a central\ncoordinator but instead only require local computation and communication among neighboring nodes.\nAs the main contribution of this paper, we propose a straggler-robust and communication-ef\ufb01cient\nalgorithm for collaborative learning called QuanTimed-DSGD, which is a quantized and deadline-\nbased decentralized stochastic gradient descent method. We show that the proposed scheme provably\nimproves upon on the convergence time of vanilla synchronous decentralized optimization methods.\nThe key theoretical contribution of the paper is to develop the \ufb01rst quantized decentralized non-convex\noptimization algorithm with provable and exact convergence to a \ufb01rst-order optimal solution.\nThere are two key ideas in our proposed algorithm. To provide robustness against stragglers, we\nimpose a deadline time Td for the computation of each node. In a synchronous implementation of\nthe proposed algorithm, at every iteration all the nodes simultaneously start computing stochastic\ngradients by randomly picking data points from their local batches and evaluating the gradient\nfunction on the picked data point. By Td, each node has computed a random number of stochastic\ngradients from which it aggregates and generates a stochastic gradient for its local objective. By\ndoing so, each iteration takes a constant computation time as opposed to deadline-free methods in\nwhich each node has to wait for all their neighbours to complete their gradient computation tasks. To\ntackle the communication bottleneck in collaborative learning, we only allow the decentralized nodes\nto share with neighbours a quantized version of their local models. Quantizing the exchanged models\nreduces the communication load which is critical for large and dense networks.\nWe analyze the convergence of the proposed QuanTimed-DSGD for strongly convex and non-convex\nloss functions and under standard assumptions for the network, quantizer and stochastic gradients. In\nthe strongly convex case, we show that QuanTimed-DSGD exactly \ufb01nds the global optimal for every\nnode with a rate arbitrarily close to O(1/pT ). In the non-convex setting, QuanTimed-DSGD provably\n\ufb01nds \ufb01rst-order optimal solutions as fast as O(T 1/3). Moreover, the consensus error decays with\nthe same rate which guarantees an exact convergence by choosing large enough T . Furthermore,\nwe numerically evaluate QuanTimed-DSGD on benchmark datasets CIFAR-10 and MNIST, where it\ndemonstrates speedups of up to 3\u21e5 in the run-time compared to state-of-the-art baselines.\nRelated Work. Decentralized consensus optimization has been studied extensively. The most popular\n\ufb01rst-order choices for the convex setting are distributed gradient descent-type methods (Nedic and\nOzdaglar, 2009; Jakovetic et al., 2014; Yuan et al., 2016; Qu and Li, 2017), augmented Lagrangian\nalgorithms (Shi et al., 2015a,b; Mokhtari and Ribeiro, 2016), distributed variants of the alternating\ndirection method of multipliers (ADMM) (Schizas et al., 2008; Boyd et al., 2011; Shi et al., 2014;\nChang et al., 2015; Mokhtari et al., 2016), dual averaging (Duchi et al., 2012; Tsianos et al., 2012),\nand several dual based strategies (Seaman et al., 2017; Scaman et al., 2018; Uribe et al., 2018).\nRecently, there have been some works which study non-convex decentralized consensus optimization\nand establish convergence to a stationary point (Zeng and Yin, 2018; Hong et al., 2017, 2018; Sun\nand Hong, 2018; Scutari et al., 2017; Scutari and Sun, 2018; Jiang et al., 2017; Lian et al., 2017a).\nThe idea of improving communication-ef\ufb01ciency of distributed optimization procedures via message-\ncompression schemes goes a few decades back (Tsitsiklis and Luo, 1987), however, it has recently\ngained considerable attention due to the growing importance of distributed applications. In particular,\nef\ufb01cient gradient-compression methods are provided in (Alistarh et al., 2017; Seide et al., 2014;\nBernstein et al., 2018) and deployed in the distributed master-worker setting. In the decentralized\nsetting, quantization methods were proposed in different convex optimization contexts with non-\nvanishing errors (Yuksel and Basar, 2003; Rabbat and Nowak, 2005; Kashyap et al., 2006; El Chamie\net al., 2016; Aysal et al., 2007; Nedic et al., 2008). The \ufb01rst exact decentralized optimization\nmethod with quantized messages was given in (Reisizadeh et al., 2018; Zhang et al., 2018), and more\n\n2\n\n\frecently, new techniques have been developed in this context for convex problems (Doan et al., 2018;\nKoloskova et al., 2019; Berahas et al., 2019; Lee et al., 2018a,b).\nThe straggler problem has been widely observed in distributed computing clusters (Dean and Barroso,\n2013; Ananthanarayanan et al., 2010). A common approach to mitigate stragglers is to replicate the\ncomputing task of the slow nodes to other computing nodes (Ananthanarayanan et al., 2013; Wang\net al., 2014), but this is clearly not feasible in collaborative learning. Another line of work proposed\nusing coding theoretic ideas for speeding up distributed machine learning (Lee et al., 2018c; Tandon\net al., 2016; Yu et al., 2017; Reisizadeh et al., 2019b,c), but they work mostly for master-worker\nsetup and particular computation types such as linear computations or full gradient aggregation. The\nclosest work to ours is (Ferdinand et al., 2019) that considers decentralized optimization for convex\nfunctions with deadline for local computations without considering communication bottlenecks\nand quantization as well as non-convex functions. Another line of work proposes asynchronous\ndecentralized SGD, where the workers update their models based on the last iterates received by their\nneighbors (Recht et al., 2011; Lian et al., 2017b; Lan and Zhou, 2018; Peng et al., 2016; Wu et al.,\n2017; Dutta et al., 2018). While asynchronous methods are inherently robust to stragglers, they can\nsuffer from slow convergence due to using stale models.\n\n2 Problem Setup\n\nIn this paper, we focus on a stochastic learning model in which we aim to solve the problem\n\nmin\n\nL(x) := min\n\nx\n\nx E\u2713\u21e0P [`(x,\u2713 )],\n\n(1)\nwhere ` : Rp\u21e5Rq ! R is a stochastic loss function, x 2 Rp is our optimization variable, and \u2713 2 Rq\nis a random variable with probability distribution P and L : Rp ! R is the expected loss function\nalso called population risk. We assume that the underlying distribution P of the random variable\n\u2713 is unknown and we have access only to N = mn realizations of it. Our goal is to solve the loss\nassociated with N = mn realizations of the random variable \u2713, which is also known as empirical risk\nminimization. To be more precise, we aim to solve the empirical risk minimization (ERM) problem\n\nmin\n\nx\n\nLN (x) := min\n\nx\n\n1\nN\n\nNXk=1\n\n`(x,\u2713 k),\n\n(2)\n\nwhere LN is the empirical loss associated with the sample of random variables D = {\u27131, . . . ,\u2713 N}.\nCollaborative Learning Perspective. Our goal is to solve the ERM problem in (2) in a decentralized\nmanner over n nodes. This setting arises in a plethora of applications where either the total number\nof samples N is massive and data cannot be stored or processed over a single node or the samples are\navailable in parts at different nodes and, due to privacy or communication constraints, exchanging\nraw data points is not possible among the nodes. Hence, we assume that each node i has access to m\nsamples and its local objective is\n\nfi(x) =\n\n1\nm\n\nmXj=1\n\n`(x,\u2713 j\n\ni ),\n\n(3)\n\nwhere Di = {\u27131\nminimize the average of all local objective functions, denoted by f, which is given by\n\ni } is the set of samples available at node i. Nodes aim to collaboratively\n\ni ,\u00b7\u00b7\u00b7 ,\u2713 m\n\nmin\n\nx\n\nf (x) = min\n\nx\n\n1\nn\n\nfi(x) = min\n\nx\n\n1\nmn\n\nnXi=1\n\nnXi=1\n\nmXj=1\n\n`(x,\u2713 j\n\ni ).\n\n(4)\n\nIndeed, the objective functions f and LN are equivalent if D := D1 [\u00b7\u00b7\u00b7[D n. Therefore, by\nminimizing the global objective function f we also obtain the solution of the ERM problem in (2).\nWe can rewrite the optimization problem in (4) as a classical decentralized optimization problem as\nfollows. Let xi be the decision variable of node i. Then, (4) is equivalent to\n\nmin\n\nx1,...,xn\n\n1\nn\n\nnXi=1\n\nfi(xi),\n\nsubject to x1 = \u00b7\u00b7\u00b7 = xn,\n\n(5)\n\n3\n\n\fas the objective function value of (4) and (5) are the same when the iterates of all nodes are the same\nand we have consensus. The challenge in distributed learning is to solve the global loss only by\nexchanging information with neighboring nodes and ensuring that nodes\u2019 variables stay close to each\nother. We consider a network of computing nodes characterized by an undirected connected graph\n, and each node i is allowed to\nG = (V,E) with nodes V = [n] = {1,\u00b7\u00b7\u00b7 , n} and edges E\u2713V\u21e5V\nexchange information only with its neighboring nodes in the graph G, which we denote by Ni.\nIn a stochastic optimization setting, where the true objective is de\ufb01ned as an expectation, there is\na limit to the accuracy with which we can minimize L(x) given only N = nm samples, even if\nwe have access to the optimal solution of the empirical risk LN. In particular, it has been shown\nthat when the loss function ` is convex, the difference between the population risk L and the\nempirical risk LN corresponding to N = mn samples with high probability is uniformly bounded\n\nby supx |L(x) LN (x)|\uf8ffO (1/pN ) = O(1/pnm); see (Bottou and Bousquet, 2008). Thus,\nwithout collaboration, each node can minimize its local cost fi to reach an estimate for the optimal\nsolution with an error of O(1/pm). By minimizing the aggregate loss collaboratively, nodes\nreach an approximate solution of the expected risk problem with a smaller error of O(1/pnm).\nBased on this formulation, our goal in the convex setting is to \ufb01nd a point xi for each node i that\nattains the statistical accuracy, i.e., E\u21e5LN (xi) LN (\u02c6x\u21e4)\u21e4 \uf8ffO (1/pmn), which further implies\nE\u21e5L(xi) L(x\u21e4)\u21e4 \uf8ffO (1/pmn).\nFor a non-convex loss function `, however, LN is also non-convex and solving the problem in (4) is\nhard, in general. Therefore, we only focus on \ufb01nding a point that satis\ufb01es the \ufb01rst-order optimality\ncondition for (4) up to some accuracy \u21e2, i.e., \ufb01nding a point \u02dcx such that krLN (\u02dcx)k = krf (\u02dcx)k \uf8ff\n\u21e2. Under the assumption that the gradient of loss is sub-Gaussian, it has been shown that with\nhigh probability the gap between the gradients of expected risk and empirical risk is bounded by\nsupx krL(x) rLN (x)k2 \uf8ffO (1/pnm); see (Mei et al., 2018). As in the convex setting, by\nsolving the aggregate loss instead of local loss, each node \ufb01nds a better approximate for a \ufb01rst-order\nstationary point of the expected risk L. Therefore, our goal in the non-convex setting is to \ufb01nd a point\nthat satis\ufb01es krLN (x)k \uf8ffO (1/pmn) which also implies krL(x)k \uf8ffO (1/pmn).\n\n3 Proposed QuanTimed-DSGD Method\n\nIn this section, we present our proposed QuanTimed-DSGD algorithm that takes into account robust-\nness to stragglers and communication ef\ufb01ciency in decentralized optimization. To ensure robustness\nto stragglers\u2019 delay, we introduce a deadline-based protocol for updating the iterates in which nodes\ncompute their local gradients estimation only for a speci\ufb01c amount time and then use their gradient\nestimates to update their iterates. This is in contrast to the mini-batch setting, in which nodes have to\nwait for the slowest machine to \ufb01nish its local gradient computation. To reduce the communication\nload, we assume that nodes only exchange a quantized version of their local iterates. However, using\nquantized messages induces extra noise in the decision making process which makes the analysis of\nour algorithm more challenging. A detailed description of the proposed algorithm is as follows.\nDeadline-Based Gradient Computation. Consider the current model xi,t available at node i at\niteration t. Recall the de\ufb01nition of the local objective function fi at node i de\ufb01ned in (3). The cost of\ncomputing the local gradient rfi scales linearly by the number of samples m assigned to the i-th\nnode. A common solution to reduce the computation cost at each node for the case that m is large is\nusing a mini-batch approximate of the gradient, i.e., each node i picks a subset of its local samples\n|Bi,t|P\u27132Bi,tr`(xi,t,\u2713 ). A major challenge for this\nBi,t \u2713D i to compute the stochastic gradient\nprocedure is the presence of stragglers in the network: given mini-batch size b, all nodes have to\ncompute the average of exactly b stochastic gradients. Thus, all the nodes have to wait for the slowest\nmachine to \ufb01nish its computation and exchange its new model with the neighbors.\nTo resolve this issue, we propose a deadline-based approach in which we set a \ufb01xed deadline Td for\nthe time that each node can spend computing its local stochastic gradient estimate. Once the deadline\nis reached, nodes \ufb01nd their gradient estimate using whatever computation (mini-batch size) they\ncould perform. Thus, with this deadline-based procedure, nodes do not need to wait for the slowest\nmachine to update their iterates. However, their mini-batch size and consequently the noise of their\ngradient approximation will be different. To be more speci\ufb01c, let Si,t \u2713D i denote the set of random\n\n1\n\n4\n\n\fAlgorithm 1 QuanTimed-DSGD at node i\nRequire: Weights {wij}n\n1: Set xi,0 = 0 and compute zi,0 = Q(xi,0)\n2: for t = 0,\u00b7\u00b7\u00b7 , T 1 do\n3:\n4:\n\nj=1, total iterations T , deadline Td\n\n(6)\n\nr`(xi,t; \u2713),\n\n5:\n6: end for\n\n1\n\n|Si,t| X\u27132Si,t\n\nerfi(xi,t) =\n\nSend zi,t = Q(xi,t) to j 2N i and receive zj,t\nPick and evaluate stochastic gradients {r`(xi,t; \u2713) : \u2713 2S i,t} till reaching the deadline Td\nand generate erfi(xi,t) according to (6)\nUpdate xi,t+1 as follows: xi,t+1 = (1 \" + \"wii)xi,t + \"Pj2Ni\nwijzj,t \u21b5\"erfi(xi,t)\nsamples chosen at time t by node i. De\ufb01ne erfi(xi,t) as the stochastic gradient of node i at time t as\nfor 1\uf8ff|S i,t|\uf8ff m. If there are not any gradients computed by Td, i.e., |Si,t| = 0, we seterfi(xi,t) = 0.\n\nComputation Model. To illustrate the advantage of our deadline-based scheme over the \ufb01xed\nmini-batch scheme, we formally state the model that we use for the processing time of nodes in\nthe network. We remark that our algorithms are oblivious to the choice of the computation model\nwhich is merely used for analysis. We de\ufb01ne the processing speed of each machine as the number\nof stochastic gradients r`(x,\u2713 ) that it computes per second. We assume that the processing speed\nof each machine i and iteration t is a random variable Vi,t, and Vi,t\u2019s are i.i.d. with probability\ndistribution FV (v). We further assume that the domain of the random variable V is bounded and\nits realizations are in [v, \u00afv]. If Vi,t is the number of stochastic gradient which can be computed per\nsecond, the size of mini-batch Si,t is a random variable given by |Si,t| = Vi,tTd.\nIn the \ufb01xed mini-batch scheme and for any iteration t, all the nodes have to wait for the machine\nwith the slowest processing time before updating their iterates, and thus the overall computation\ntime will be b/Vmin where Vmin is de\ufb01ned as Vmin = min{V1,t, . . . , Vn,t}. In our deadline-based\nscheme there is a \ufb01xed deadline Td which limits the computation time of the nodes, and is chosen\nsuch that Td = E\u21e5b/V\u21e4 = bE\u21e51/V\u21e4, while the mini-batch scheme requires an expected time of\nE\u21e5b/Vmin\u21e4 = bE\u21e51/Vmin\u21e4. The gap between E\u21e51/V\u21e4 and E\u21e51/Vmin\u21e4 depends on the distribution\n\nof V , and can be unbounded in general growing with n.\nQuantized Message-Passing. To reduce the communication overhead of exchanging variables\nbetween nodes, we use quantization schemes that signi\ufb01cantly reduces the required number of bits.\nMore precisely, instead of sending xi,t, the i-th node sends zi,t = Q(xi,t) which is a quantized version\nof its local variable xi,t to its neighbors j 2N i. As an example, consider the low precision quantizer\nspeci\ufb01ed by scale factor \u2318 and s bits with the representable range {\u2318 \u00b7 2s1,\u00b7\u00b7\u00b7 ,\u2318, 0,\u2318, \u00b7\u00b7\u00b7 ,\u2318 \u00b7\n(2s 1)}. For any k\u2318 \uf8ff x < (k + 1)\u2318 , the quantizer outputs\n\nQ(\u2318,b)(x) =\u21e2 k\u2318\n\nw.p. 1 (x k\u2318)/\u2318,\n\n(k + 1)\u2318 w.p. (x k\u2318)/\u2318.\n\nAlgorithm Update. Once the local variables are exchanged between neighboring nodes, each\n\nnode i uses its local stochastic gradienterfi(xi,t), its local decision variable xi,t, and the information\nreceived from its neighbors {zj,t = Q(xj,t); j 2N i} to update its local decision variable. Before\nformally stating the update of QuanTimed-DSGD, let us de\ufb01ne wij as the weight that node i assigns\nto the information that it receive from node j. If i and j are not neighbors wij = 0. These weights\nare considered for averaging over the local decision variable xi,t and the quantized variables zj,t\nreceived from neighbors to enforce consensus among neighboring nodes. Speci\ufb01cally, at time t,\nnode i updates its decision variable according to the update\n\nxi,t+1 = (1 \" + \"wii)xi,t + \" Xj2Ni\n\nwijzj,t \u21b5\"erfi(xi,t),\n\nwhere \u21b5 and \" are positive scalars that behave as stepsize. Note that the update in (8) shows that\nthe updated iterate is a linear combination of the weighted average of node i\u2019s neighbors\u2019 decision\n\n(7)\n\n(8)\n\n5\n\n\fvariable, i.e., \"Pj2Ni\nupdate \"(wiixi,t +Pj2Ni\n\nparameter \u21b5 behaves as the stepsize of the gradient descent step with respect to local objective function\nand the parameter \" behaves as an averaging parameter between performing the distributed gradient\n\nwijzj,t, and its local variable xi,t and stochastic gradient erfi(xi,t). The\nwijzj,t \u21b5erfi(xi,t)) and using the previous decision variable (1 \")xi,t.\n\nBy choosing a diminishing stepsize \u21b5 we control the noise of stochastic gradient evaluation, and by\naveraging using the parameter \" we control randomness induced by exchanging quantized variables.\nThe description of QuanTimed-DSGD is summarized in Algorithm 1.\n\n4 Convergence Analysis\n\nIn this section, we provide the main theoretical results for the proposed QuanTimed-DSGD algo-\nrithm. We \ufb01rst consider strongly convex loss functions and characterize the convergence rate of\nQuanTimed-DSGD for achieving the global optimal solution to the problem (4). Then, we focus on\nthe non-convex setting and show that the iterates generated by QuanTimed-DSGD \ufb01nd a stationary\npoint of the cost in (4) while the local models are close to each other and the consensus constraint is\nasymptotically satis\ufb01ed. All the proofs are provided in the supplementary material (Section 6). We\nmake the following assumptions on the weight matrix, the quantizer, and local objective functions.\nAssumption 1. The weight matrix W 2 Rn\u21e5n with entries wij 0 satis\ufb01es the following conditions:\nW = W >, W 1 = 1 and null(I W ) = span(1).\nAssumption 2. The random quantizer Q(\u00b7) is unbiased and variance-bounded, i.e., E[Q(x)|x] = x\nand E[kQ(x) xk2|x] \uf8ff 2, for any x 2 Rp; and quantizations are carried out independently.\nAssumption 1 implies that W is symmetric and doubly stochastic. Moreover, all the eigenvalues\nof W are in (1, 1], i.e., 1 = 1(W ) 2(W ) \u00b7\u00b7\u00b7\n(Yuan et al.,\n2016)). We also denote by 1 the spectral gap associated with the stochastic matrix W , where\n = max|2(W )|,|n(W )| .\nAssumption 3. The function ` is K-smooth with respect to x, i.e., for any x, \u02c6x 2 Rp and any \u2713 2D ,\nr`(x,\u2713 ) r`(\u02c6x,\u2713 ) \uf8ff Kkx \u02c6xk.\nAssumption 4. Stochastic gradients r`(x,\u2713 ) are unbiased and variance bounded,\ni.e.,\nE\u2713\u21e5r`(x,\u2713 )\u21e4 = rL(x) and E\u2713hr`(x,\u2713 ) rL(x)2i \uf8ff 2.\n\nNote the condition in Assumption 4 implies that the local gradients of each node rfi(x) are also\nunbiased estimators of the expected risk gradient rL(x) and their variance is bounded above by\n2/m as it is de\ufb01ned as an average over m realizations.\n\nn(W ) > 1 (e.g.\n\n4.1 Strongly Convex Setting\n\nThis section presents the convergence guarantees of the proposed QuanTimed-DSGD method for\nsmooth and strongly convex functions. The following assumption formally de\ufb01nes strong convexity.\nAssumption 5. The function ` is \u00b5-strongly convex, i.e., for any x, \u02c6x 2 Rp and \u2713 2D we have that\nhr`(x,\u2713 ) r`(\u02c6x,\u2713 ), x \u02c6xi \u00b5kx \u02c6xk2 .\nNext, we characterize the convergence rate of QuanTimed-DSGD for strongly convex objectives.\nTheorem 1 (Strongly Convex Losses). If the conditions in Assumptions 1\u20135 are satis\ufb01ed and step-\nsizes are picked as \u21b5 = T /2 and \" = T 3/2 for arbitrary 2 (0, 1/2), then for large enough\nnumber of iterations T T c\nnXi=1\n1\nn\nwhere D2 = 2KPn\nMoreover, such convergence rate is as close as desired to O(1/pT ) by picking the tuning parameter\n\n\u00b5! 1\nT + O 2\ni=1(fi(0) f\u21e4i ), and f\u21e4i = minx2Rp fi(x).\n\nTheorem 1 guarantees the exact convergence of each local model to the global optimal even though\nthe noises induced by random quantizations and stochastic gradients are non-vanishing with iterations.\n\nEhxi,T x\u21e42i\uf8ffO D2(K/\u00b5)2\n\nmin the iterates generated by the QuanTimed-DSGD algorithm satisfy\n\n arbitrarily close to 1/2. We would like to highlight that by choosing a parameter closer to 1/2,\n\nmax\u21e2E[1/V ]\n\nTd\n\n(1 )2 +\n\n(9)\n\n1\n\nm! 1\n\nT 2 ,\n\n,\n\n\u00b5\n\n2\n\n6\n\n\fmin becomes larger. More details are available\n\nthe lower bound on the number of required iterations T c\nin the proof of Theorem 1 provided in the supplementary material.\nNote that the coef\ufb01cient of 1/T in (9) characterizes the dependency of our upper bound on the\nobjective function condition number K/\u00b5, graph connectivity parameter 1/(1 ), and variance 2\nof error induced by quantizing our signals. Moreover, the coef\ufb01cient of 1/T 2 shows the effect of\nstochastic gradients variance 2 as well as our deadline-based scheme parameters Td/(E[1/V ]).\nRemark 1. The expression 1/beff = max{E[1/V ]/Td, 1/m} represents the inverse of the effective\nbatch size beff used in our QuanTimed-DSGD method. To be more speci\ufb01c, If the deadline Td is large\nenough that in expectation all local gradients are computed before the deadline, i.e., Td/E[1/V ] > m,\nthen our effective batch size is beff = m and the term 1/m is the dominant term in the maximization.\nConversely, if Td is small and the number of computed gradients Td/E[1/V ] is smaller than the total\nnumber of local samples m, the effective batch size is beff = Td/E[1/V ]. In this case, E[1/V ]/Td\nis dominant term in the maximization. This observation shows that 2 max{E[1/V ]/Td, 1/m} =\n2/beff in (9) is the variance of mini-batch gradient in QuanTimed-DSGD.\nRemark 2. Using strong convexity of the objective function, one can easily verify that the last iterates\nxi,T of QuanTimed-DSGD satisfy the sub-optimality f (xi,T ) f (\u02c6x\u21e4) = LN (xi,T ) LN (\u02c6x\u21e4) \uf8ff\nO(1/pT ) with respect to the empirical risk, where \u02c6x\u21e4 is the minimizer of the empirical risk LN. As\nthe gap between the expected risk L and the empirical risk LN is of O(1/pmn), the overall error of\nQuanTimed-DSGD with respect to the expected risk L is O(1/pT + 1/pmn).\n4.2 Non-convex Setting\nIn this section, we characterize the convergence rate of QuanTimed-DSGD for non-convex and smooth\nobjectives. As discussed in Section 2, we are interested in \ufb01nding a set of local models which satisfy\n\ufb01rst-order optimality condition approximately, while the models are close to each other and satisfy\nthe consensus condition up to a small error. To be more precise, we are interested in \ufb01nding a set\nnPn\nof local models {x\u21e41, . . . , x\u21e4n} where their average x\u21e4 := 1\ni=1 x\u21e4i (approximately) satisfy \ufb01rst-\nEkx\u21e4 x\u21e4ik2 \uf8ff \u21e2. If a set of local iterates satis\ufb01es these conditions we call them (\u232b, \u21e2)-approximate\nsolutions. Next theorem characterizes both \ufb01rst-order optimality and consensus convergence rates\nand the overall complexity for achieving an (\u232b, \u21e2)-approximate solutions.\n\norder optimality condition, i.e., Erf (x\u21e4)2 \uf8ff \u232b, while the iterates are close to their average, i.e.,\n\nTheorem 2 (Non-convex Losses). Under Assumptions 1\u20134, and for step-sizes \u21b5 = T 1/6 and\n\" = T 1/2, QuanTimed-DSGD guarantees the following convergence and consensus rates:\n\nT 1/3 +O K 2\n\nn\n\nmax\u21e2E[1/V ]\n\nTd\n\n,\n\n1\n\nm! 1\n\nT 2/3 ,\n(10)\n\n(11)\n\n1\nT\n\nT1Xt=0\n\nand\n\n(1 )2\n\nErf (xt)2 \uf8ffO K2\nnXi=1\n\nT1Xt=0\n\n1\nT\n\n1\nn\n\n+\n\nK 2\n\n2\nm\n\nn ! 1\nExt xi,t2 \uf8ffO \n\nT 1/3 ,\n\n2\n\nm(1 )2! 1\nnPn\n\nmin. Here xt = 1\n\ni=1 xi,t denotes the average models\n\nfor large enough number of iterations T T nc\nat iteration t.\nThe convergence rate in (10) indicates the proposed QuanTimed-DSGD method \ufb01nds \ufb01rst-order\nstationary points with vanishing approximation error, even though the quantization and stochastic\ngradient noises are non-vanishing. Also, the approximation error decays as fast as O(T 1/3) with\niterations. Theorem 2 also implies from (11) that the local models reach consensus with a rate of\nO(T 1/3). Moreover, it shows that to \ufb01nd an (\u232b, \u21e2)-approximate solution QuanTimed-DSGD requires\nat most O(max{\u232b3,\u21e2 3}) iterations.\n5 Experimental Results\n\nIn this section, we numerically evaluate the performance of the proposed QuanTimed-DSGD method\ndescribed in Algorithm 1 for solving a class of non-convex decentralized optimization problems.\n\n7\n\n\fIn particular, we compare the total run-time of QuanTimed-DSGD scheme with the ones for three\nbenchmarks which are brie\ufb02y described below.\n\u2022 Decentralized SGD (DSGD) (Yuan et al., 2016): Each worker updates its decision variable as\nxi,t+1 =Pj2Ni\nwijxj,t \u21b5erfi(xi,t). We note that the exchanged messages are not quantized\n\u2022 Quantized Decentralized SGD (Q-DSGD) (Reisizadeh et al., 2019a): Iterates are updated accord-\ning to (8). Similar to QuanTimed-DSGD scheme, Q-DSGD employs quantized message-passing,\nhowever the gradients are computed for a \ufb01xed batch size in each iteration.\n\nand the local gradients are computed for a \ufb01xed batch size.\n\n\u2022 Asynchronous DSGD: Each worker updates its model without waiting to receive the updates of its\nneighbors, i.e. xi,t+1 =Pj2Ni\nwijxj,\u2327j \u21b5erfi(xi,t) where xj,\u2327j denotes the most recent model\n\nfor node j. In our implementation of this scheme, models are exchanged without quantization.\n\nNote that the \ufb01rst two methods mentioned above, i.e., DSGD and Q-DSGD, operate synchronously\nacross the workers, as is our proposed QuanTimed-DSGD method. To be more speci\ufb01c, worker nodes\nwait to receive the decision variables from all of the neighbor nodes and then synchronously update\naccording to an update rule. In QuanTimed-DSGD (Figure 1, right), this waiting time consists of a\n\ufb01xed gradient computation time denoted by the deadline Td and communication time of the message\nexchanges. Due to the random computation times, different workers end up computing gradients of\ndifferent and random batch-sizes Bi,t across workers i and iterations t. In DSGD (and Q-DSGD)\nhowever (Figure 1, Left), the gradient computation time varies across the workers since computing\na \ufb01xed-batch gradient of size B takes a random time whose expected value is proportional to the\nbatch-size B and hence the slowest nodes (stragglers) determine the overall synchronization time\nTmax. Asynchronous-DSGD mitigates stragglers since each worker iteratively computes a gradient\nof batch-size B and updates the local model using the most recent models of its neighboring nodes\navailable in its memory (Figure 1, middle).\n\nMini-batch-DSGD\nBAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI8ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9YvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia89adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5L1VoWRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJWvjMo=\nBAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI8ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9YvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia89adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5L1VoWRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJWvjMo=\n\nBAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI8ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9YvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia89adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5L1VoWRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJWvjMo=\n\nAsynchronous-DSGD QuanTimed-DSGD\nB1,3\n\nB1,2\n\nBAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI8ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9YvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia89adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5L1VoWRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJWvjMo=\n\nB1,1\n\nAAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4kJJUQY+lXjxWsB/QhrLZTtqlm03Y3Qgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSK4Nq777aytb2xubRd2irt7+weHpaPjlo5TxbDJYhGrTkA1Ci6xabgR2EkU0igQ2A7GdzO//YRK81g+mkmCfkSHkoecUWOldr2feZfetF8quxV3DrJKvJyUIUejX/rqDWKWRigNE1Trrucmxs+oMpwJnBZ7qcaEsjEdYtdSSSPUfjY/d0rOrTIgYaxsSUPm6u+JjEZaT6LAdkbUjPSyNxP/87qpCW/9jMskNSjZYlGYCmJiMvudDLhCZsTEEsoUt7cSNqKKMmMTKtoQvOWXV0mrWvGuKtWH63KtnsdRgFM4gwvw4AZqcA8NaAKDMTzDK7w5ifPivDsfi9Y1J585gT9wPn8AWOSO6w==\n\nAAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4kJJUQY+lXjxWsB/QhrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvpv57SeujYjVI04S7kd0qEQoGEUrtev9zLusTvulsltx5yCrxMtJGXI0+qWv3iBmacQVMkmN6Xpugn5GNQom+bTYSw1PKBvTIe9aqmjEjZ/Nz52Sc6sMSBhrWwrJXP09kdHImEkU2M6I4sgsezPxP6+bYnjrZ0IlKXLFFovCVBKMyex3MhCaM5QTSyjTwt5K2IhqytAmVLQheMsvr5JWteJdVaoP1+VaPY+jAKdwBhfgwQ3U4B4a0AQGY3iGV3hzEufFeXc+Fq1rTj5zAn/gfP4AWmmO7A==\n\nAAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgQcJuIugxxIvHCOYByRJmJ5NkyOzsMtMrhCUf4cWDIl79Hm/+jZNkD5pY0FBUddPdFcRSGHTdbye3sbm1vZPfLeztHxweFY9PWiZKNONNFslIdwJquBSKN1Gg5J1YcxoGkreDyd3cbz9xbUSkHnEacz+kIyWGglG0UrveT72r6qxfLLlldwGyTryMlCBDo1/86g0iloRcIZPUmK7nxuinVKNgks8KvcTwmLIJHfGupYqG3Pjp4twZubDKgAwjbUshWai/J1IaGjMNA9sZUhybVW8u/ud1Exze+qlQcYJcseWiYSIJRmT+OxkIzRnKqSWUaWFvJWxMNWVoEyrYELzVl9dJq1L2quXKw3WpVs/iyMMZnMMleHADNbiHBjSBwQSe4RXenNh5cd6dj2VrzslmTuEPnM8fW+6O7Q==\n\nBAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI8ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9YvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia89adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5L1VoWRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJWvjMo=\n\nBAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI8ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9YvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia89adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5L1VoWRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJWvjMo=\n\ns BAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI8ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9YvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia89adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5L1VoWRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJWvjMo=\n\ne\nd\no\nn\n\nBAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI8ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9YvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia89adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5L1VoWRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJWvjMo=\n\nBAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI8ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9YvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia89adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5L1VoWRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJWvjMo=\n\nBAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI8ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9YvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia89adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5L1VoWRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJWvjMo=\n\nB2,1\n\nAAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4kJJUQY+lXjxWsB/QhrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvpv57SeujYjVI04S7kd0qEQoGEUrtev9rHrpTfulsltx5yCrxMtJGXI0+qWv3iBmacQVMkmN6Xpugn5GNQom+bTYSw1PKBvTIe9aqmjEjZ/Nz52Sc6sMSBhrWwrJXP09kdHImEkU2M6I4sgsezPxP6+bYnjrZ0IlKXLFFovCVBKMyex3MhCaM5QTSyjTwt5K2IhqytAmVLQheMsvr5JWteJdVaoP1+VaPY+jAKdwBhfgwQ3U4B4a0AQGY3iGV3hzEufFeXc+Fq1rTj5zAn/gfP4AWmuO7A==\n\nB3,1\n\nAAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgQcJuIugxxIvHCOYByRJmJ5NkyOzsMtMrhCUf4cWDIl79Hm/+jZNkD5pY0FBUddPdFcRSGHTdbye3sbm1vZPfLeztHxweFY9PWiZKNONNFslIdwJquBSKN1Gg5J1YcxoGkreDyd3cbz9xbUSkHnEacz+kIyWGglG0UrveT6tX3qxfLLlldwGyTryMlCBDo1/86g0iloRcIZPUmK7nxuinVKNgks8KvcTwmLIJHfGupYqG3Pjp4twZubDKgAwjbUshWai/J1IaGjMNA9sZUhybVW8u/ud1Exze+qlQcYJcseWiYSIJRmT+OxkIzRnKqSWUaWFvJWxMNWVoEyrYELzVl9dJq1L2quXKw3WpVs/iyMMZnMMleHADNbiHBjSBwQSe4RXenNh5cd6dj2VrzslmTuEPnM8fW/KO7Q==\n\nTmax\n\nAAAB9XicbVBNSwMxEM3Wr1q/qh69BIvgqexWQY9FLx4r9AvatWTTtA1Nsksyqy3L/g8vHhTx6n/x5r8xbfegrQ8GHu/NMDMviAQ34LrfTm5tfWNzK79d2Nnd2z8oHh41TRhryho0FKFuB8QwwRVrAAfB2pFmRAaCtYLx7cxvPTJteKjqMI2YL8lQ8QGnBKz0UO8lXWATSCSZpGmvWHLL7hx4lXgZKaEMtV7xq9sPaSyZAiqIMR3PjcBPiAZOBUsL3diwiNAxGbKOpYpIZvxkfnWKz6zSx4NQ21KA5+rviYRIY6YysJ2SwMgsezPxP68Tw+DaT7iKYmCKLhYNYoEhxLMIcJ9rRkFMLSFUc3srpiOiCQUbVMGG4C2/vEqalbJ3Ua7cX5aqN1kceXSCTtE58tAVqqI7VEMNRJFGz+gVvTlPzovz7nwsWnNONnOM/sD5/AFfJJMU\n\nTmax\n\nAAAB9XicbVBNSwMxEM3Wr1q/qh69BIvgqexWQY9FLx4r9AvatWTTtA1Nsksyqy3L/g8vHhTx6n/x5r8xbfegrQ8GHu/NMDMviAQ34LrfTm5tfWNzK79d2Nnd2z8oHh41TRhryho0FKFuB8QwwRVrAAfB2pFmRAaCtYLx7cxvPTJteKjqMI2YL8lQ8QGnBKz0UO8lXWATSCSZpGmvWHLL7hx4lXgZKaEMtV7xq9sPaSyZAiqIMR3PjcBPiAZOBUsL3diwiNAxGbKOpYpIZvxkfnWKz6zSx4NQ21KA5+rviYRIY6YysJ2SwMgsezPxP68Tw+DaT7iKYmCKLhYNYoEhxLMIcJ9rRkFMLSFUc3srpiOiCQUbVMGG4C2/vEqalbJ3Ua7cX5aqN1kceXSCTtE58tAVqqI7VEMNRJFGz+gVvTlPzovz7nwsWnNONnOM/sD5/AFfJJMU\n\nTmax\n\nAAAB9XicbVBNSwMxEM3Wr1q/qh69BIvgqexWQY9FLx4r9AvatWTTtA1Nsksyqy3L/g8vHhTx6n/x5r8xbfegrQ8GHu/NMDMviAQ34LrfTm5tfWNzK79d2Nnd2z8oHh41TRhryho0FKFuB8QwwRVrAAfB2pFmRAaCtYLx7cxvPTJteKjqMI2YL8lQ8QGnBKz0UO8lXWATSCSZpGmvWHLL7hx4lXgZKaEMtV7xq9sPaSyZAiqIMR3PjcBPiAZOBUsL3diwiNAxGbKOpYpIZvxkfnWKz6zSx4NQ21KA5+rviYRIY6YysJ2SwMgsezPxP68Tw+DaT7iKYmCKLhYNYoEhxLMIcJ9rRkFMLSFUc3srpiOiCQUbVMGG4C2/vEqalbJ3Ua7cX5aqN1kceXSCTtE58tAVqqI7VEMNRJFGz+gVvTlPzovz7nwsWnNONnOM/sD5/AFfJJMU\n\nTdAAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rFiv6ANZbOZtEs3m7C7EUrpT/DiQRGv/iJv/hu3bQ7a+mDg8d4MM/OCVHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctnWSKYZMlIlGdgGoUXGLTcCOwkyqkcSCwHYzuZn77CZXmiWyYcYp+TAeSR5xRY6XHRj/sl8puxZ2DrBIvJ2XIUe+XvnphwrIYpWGCat313NT4E6oMZwKnxV6mMaVsRAfYtVTSGLU/mZ86JedWCUmUKFvSkLn6e2JCY63HcWA7Y2qGetmbif953cxEN/6EyzQzKNliUZQJYhIy+5uEXCEzYmwJZYrbWwkbUkWZsekUbQje8surpFWteJeV6sNVuXabx1GAUziDC/DgGmpwD3VoAoMBPMMrvDnCeXHenY9F65qTz5zAHzifPyNyjbM=\n\nTdAAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rFiv6ANZbOZtEs3m7C7EUrpT/DiQRGv/iJv/hu3bQ7a+mDg8d4MM/OCVHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctnWSKYZMlIlGdgGoUXGLTcCOwkyqkcSCwHYzuZn77CZXmiWyYcYp+TAeSR5xRY6XHRj/sl8puxZ2DrBIvJ2XIUe+XvnphwrIYpWGCat313NT4E6oMZwKnxV6mMaVsRAfYtVTSGLU/mZ86JedWCUmUKFvSkLn6e2JCY63HcWA7Y2qGetmbif953cxEN/6EyzQzKNliUZQJYhIy+5uEXCEzYmwJZYrbWwkbUkWZsekUbQje8surpFWteJeV6sNVuXabx1GAUziDC/DgGmpwD3VoAoMBPMMrvDnCeXHenY9F65qTz5zAHzifPyNyjbM=\n\nTdAAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rFiv6ANZbOZtEs3m7C7EUrpT/DiQRGv/iJv/hu3bQ7a+mDg8d4MM/OCVHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctnWSKYZMlIlGdgGoUXGLTcCOwkyqkcSCwHYzuZn77CZXmiWyYcYp+TAeSR5xRY6XHRj/sl8puxZ2DrBIvJ2XIUe+XvnphwrIYpWGCat313NT4E6oMZwKnxV6mMaVsRAfYtVTSGLU/mZ86JedWCUmUKFvSkLn6e2JCY63HcWA7Y2qGetmbif953cxEN/6EyzQzKNliUZQJYhIy+5uEXCEzYmwJZYrbWwkbUkWZsekUbQje8surpFWteJeV6sNVuXabx1GAUziDC/DgGmpwD3VoAoMBPMMrvDnCeXHenY9F65qTz5zAHzifPyNyjbM=\n\nFigure 1: Gradient computation timeline for three methods: DSGD, Asynchronous-DSGD, QuanTimed-DSGD.\n\nData and Experimental Setup. We carry out two sets of experiments over CIFAR-10 and\nMNIST datasets, where each worker is assigned with a sample set of size m = 200 for both\ndatasets. For CIFAR-10, we implement a binary classi\ufb01cation using a fully connected neural\nnetwork with one hidden layer with 30 neurons. Each image is converted to a vector of length\n1024. For MNIST, we use a fully connected neural network with one hidden layer of size\n50 to classify the input image into 10 classes.\nIn experiments over CIFAR-10, step-sizes are\n\ufb01ne-tuned as follows: (\u21b5, \") = (0.08/T 1/6, 14/T 1/2) for QuanTimed-DSGD and Q-DSGD, and\n\u21b5 = 0.015 for DSGD and Asynchronous DSGD. In MNIST experiments, step-sizes are \ufb01ne-tuned to\n(\u21b5, \") = (0.3/T 1/6, 15/T 1/2) for QuanTimed-DSGD and Q-DSGD, and \u21b5 = 0.2 for DSGD.\nWe implement the unbiased low precision quantizer in (7) with various quantization levels s, and\nwe let Tc denote the communication time of a p-vector without quantization (16-bit precision). The\ncommunication time for a quantized vector is then proportioned according the quantization level. In\norder to ensure that the expected batch size used in each node is a target positive number b, we choose\nthe deadline Td = b/E[V ], where V \u21e0 Uniform(10, 90) is the random computation speed. The\ncommunication graph is a random Erd\u00f6s-R\u00e8nyi graph with edge connectivity pc = 0.4 and n = 50\nnodes. The weight matrix is designed as W = I L/\uf8ff where L is the Laplacian matrix of the graph\nand \uf8ff> max(L)/2.\nResults. Figure 2 compares the total training run-time for the QuanTimed-DSGD and DSGD schemes.\nOn CIFAR-10 for instance (left), the same (effective) batch-sizes, the proposed QuanTimed-DSGD\nachieves speedups of up to 3\u21e5 compared to DSGD.\n\n8\n\n\fFigure 2: Comparison of QuanTimed-DSGD and vanilla DSGD methods for training a neural network on\nCIFAR-10 (left) and MNIST (right) datasets (Tc = 3).\n\nFigure 3: Comparison of QuanTimed-DSGD, QDSGD, and vanilla DSGD methods for training a neural network\non CIFAR-10 (left) and MNIST (right) datasets (Tc = 3).\n\nFigure 4: Left: Comparison of QuanTimed-DSGD with Asynchronous DSGD and DSGD for training a neural\nnetwork on CIFAR-10 (Tc = 3). Right: Effect of Td on the loss for CIFAR-10 (Tc = 1).\nIn Figure 3, we further compare these two schemes to Q-DSGD benchmark. Although Q-SGD im-\nproves upon the vanilla DSGD by employing quantization, however, the proposed QuanTimed-DSGD\nillustrates 2\u21e5 speedup in training time over Q-DSGD (left).\nTo evaluate the straggler mitigation in the QuanTimed-DSGD, we compare its run-time with Asyn-\nchronous DSGD benchmark in Figure 4 (left). While Asynchronous DSGD outperforms DSGD in\ntraining run-time by avoiding slow nodes, the proposed QuanTimed-DSGD scheme improves upon\nAsynchronous DSGD by up to 30%. These plots further illustrate that QuanTimed-DSGD signi\ufb01cantly\nreduces the training time by simultaneously handling the communication load by quantization and\nmitigating stragglers through a deadline-based computation. The deadline time Td indeed can be\noptimized for the minimum training run-time, as illustrated in Figure 4 (right). Additional numer-\nical results on neural networks with four hidden layers and ImageNet dataset are provided in the\nsupplementary materials.\n\n6 Acknowledgments\n\nThe authors acknowledge supports from National Science Foundation (NSF) under grant CCF-\n1909320 and UC Of\ufb01ce of President under Grant LFR-18-548175. The research of H. Hassani is\nsupported by NSF grants 1755707 and 1837253.\n\n9\n\n\fReferences\nAlistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. (2017). QSGD: Communication-ef\ufb01cient SGD\nvia gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages\n1707\u20131718.\n\nAnanthanarayanan, G., Ghodsi, A., Shenker, S., and Stoica, I. (2013). Effective straggler mitigation: Attack\nof the clones. In Presented as part of the 10th {USENIX} Symposium on Networked Systems Design and\nImplementation ({NSDI} 13), pages 185\u2013198.\n\nAnanthanarayanan, G., Kandula, S., Greenberg, A. G., Stoica, I., Lu, Y., Saha, B., and Harris, E. (2010). Reining\n\nin the outliers in map-reduce clusters using mantri. In Osdi, volume 10, page 24.\n\nAysal, T. C., Coates, M., and Rabbat, M. (2007). Distributed average consensus using probabilistic quantization.\n\nIn Statistical Signal Processing, 2007. SSP\u201907. IEEE/SP 14th Workshop on, pages 640\u2013644. IEEE.\n\nBerahas, A. S., Iakovidou, C., and Wei, E. (2019). Nested distributed gradient methods with adaptive quantized\n\ncommunication. arXiv preprint arXiv:1903.08149.\n\nBernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. (2018). signsgd: Compressed optimisation\n\nfor non-convex problems. arXiv preprint arXiv:1802.04434.\n\nBottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. In Advances in neural information\n\nprocessing systems, pages 161\u2013168.\n\nBoyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning,\n3(1):1\u2013122.\n\nChang, T.-H., Hong, M., and Wang, X. (2015). Multi-agent distributed optimization via inexact consensus admm.\n\nSignal Processing, IEEE Transactions on, 63(2):482\u2013497.\n\nChoi, H.-L. and How, J. P. (2010). Continuous trajectory planning of mobile sensors for informative forecasting.\n\nAutomatica, 46(8):1266\u20131275.\n\nDean, J. and Barroso, L. A. (2013). The tail at scale. Communications of the ACM, 56(2):74\u201380.\n\nDean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V.,\net al. (2012). Large scale distributed deep networks. In Advances in neural information processing systems,\npages 1223\u20131231.\n\nDoan, T. T., Maguluri, S. T., and Romberg, J. (2018). Accelerating the convergence rates of distributed\n\nsubgradient methods with adaptive quantization. arXiv preprint arXiv:1810.13245.\n\nDuchi, J. C., Agarwal, A., and Wainwright, M. J. (2012). Dual averaging for distributed optimization: conver-\n\ngence analysis and network scaling. Automatic Control, IEEE Trans. on, 57(3):592\u2013606.\n\nDutta, S., Joshi, G., Ghosh, S., Dube, P., and Nagpurkar, P. (2018). Slow and stale gradients can win the race:\n\nError-runtime trade-offs in distributed sgd. arXiv preprint arXiv:1803.01113.\n\nEl Chamie, M., Liu, J., and Ba\u00b8sar, T. (2016). Design and analysis of distributed averaging with quantized\n\ncommunication. IEEE Transactions on Automatic Control, 61(12):3870\u20133884.\n\nFerdinand, N., Al-Lawati, H., Draper, S., and Nokleby, M. (2019). Anytime minibatch: Exploiting stragglers in\n\nonline distributed optimization. International Conference on Learning Representations.\n\nHong, M., Hajinezhad, D., and Zhao, M.-M. (2017). Prox-pda: The proximal primal-dual algorithm for fast\ndistributed nonconvex optimization and learning over networks. In Proceedings of the 34th International\nConference on Machine Learning - Volume 70, ICML\u201917, pages 1529\u20131538. JMLR.org.\n\nHong, M., Lee, J. D., and Razaviyayn, M. (2018). Gradient primal-dual algorithm converges to second-order\n\nstationary solutions for nonconvex distributed optimization. arXiv preprint arXiv:1802.08941.\n\nJakovetic, D., Xavier, J., and Moura, J. M. (2014). Fast distributed gradient methods. Automatic Control, IEEE\n\nTransactions on, 59(5):1131\u20131146.\n\nJha, D. K., Chattopadhyay, P., Sarkar, S., and Ray, A. (2016). Path planning in gps-denied environments via\n\ncollective intelligence of distributed sensor networks. International Journal of Control, 89(5):984\u2013999.\n\n10\n\n\fJiang, Z., Balu, A., Hegde, C., and Sarkar, S. (2017). Collaborative deep learning in \ufb01xed topology networks. In\n\nAdvances in Neural Information Processing Systems, pages 5904\u20135914.\n\nKashyap, A., Basar, T., and Srikant, R. (2006). Quantized consensus. 2006 IEEE International Symposium on\n\nInformation Theory, pages 635\u2013639.\n\nKoloskova, A., Stich, S. U., and Jaggi, M. (2019). Decentralized stochastic optimization and gossip algorithms\n\nwith compressed communication. arXiv preprint arXiv:1902.00340.\n\nLan, G. and Zhou, Y. (2018). Asynchronous decentralized accelerated stochastic gradient descent. arXiv preprint\n\narXiv:1809.09258.\n\nLeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436.\n\nLee, C.-S., Michelusi, N., and Scutari, G. (2018a). Distributed quantized weight-balancing and average consensus\n\nover digraphs. In 2018 IEEE Conference on Decision and Control (CDC), pages 5857\u20135862. IEEE.\n\nLee, C.-S., Michelusi, N., and Scutari, G. (2018b). Finite rate quantized distributed optimization with geometric\nconvergence. In 2018 52nd Asilomar Conference on Signals, Systems, and Computers, pages 1876\u20131880.\nIEEE.\n\nLee, K., Lam, M., Pedarsani, R., Papailiopoulos, D., and Ramchandran, K. (2018c). Speeding up distributed\n\nmachine learning using codes. IEEE Transactions on Information Theory, 64(3):1514\u20131529.\n\nLian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. (2017a). Can decentralized algorithms\noutperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In\nAdvances in Neural Information Processing Systems, pages 5330\u20135340.\n\nLian, X., Zhang, W., Zhang, C., and Liu, J. (2017b). Asynchronous decentralized parallel stochastic gradient\n\ndescent. arXiv preprint arXiv:1710.06952.\n\nMei, S., Bai, Y., Montanari, A., et al. (2018). The landscape of empirical risk for nonconvex losses. The Annals\n\nof Statistics, 46(6A):2747\u20132774.\n\nMokhtari, A. and Ribeiro, A. (2016). Dsa: Decentralized double stochastic averaging gradient algorithm. The\n\nJournal of Machine Learning Research, 17(1):2165\u20132199.\n\nMokhtari, A., Shi, W., Ling, Q., and Ribeiro, A. (2016). Dqm: Decentralized quadratically approximated\n\nalternating direction method of multipliers. IEEE Transactions on Signal Processing, 64(19):5158\u20135173.\n\nNedic, A., Olshevsky, A., Ozdaglar, A., and Tsitsiklis, J. N. (2008). Distributed subgradient methods and\nquantization effects. In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pages 4177\u20134184.\nIEEE.\n\nNedic, A. and Ozdaglar, A. (2009). Distributed subgradient methods for multi-agent optimization. Automatic\n\nControl, IEEE Transactions on, 54(1):48\u201361.\n\nPeng, Z., Xu, Y., Yan, M., and Yin, W. (2016). On the convergence of asynchronous parallel iteration with\n\nunbounded delays. Journal of the Operations Research Society of China, pages 1\u201338.\n\nQu, G. and Li, N. (2017). Accelerated distributed nesterov gradient descent. arXiv preprint arXiv:1705.07176.\n\nRabbat, M. G. and Nowak, R. D. (2005). Quantized incremental algorithms for distributed optimization. IEEE\n\nJournal on Selected Areas in Communications, 23(4):798\u2013808.\n\nRecht, B., Re, C., Wright, S., and Niu, F. (2011). Hogwild: A lock-free approach to parallelizing stochastic\ngradient descent. In Proc. of the 25th Annual Conference on Neural Information Processing (NIPS), pages\n693\u2013701.\n\nReisizadeh, A., Mokhtari, A., Hassani, H., and Pedarsani, R. (2018). Quantized decentralized consensus\n\noptimization. In 2018 IEEE Conference on Decision and Control (CDC), pages 5838\u20135843. IEEE.\n\nReisizadeh, A., Mokhtari, A., Hassani, H., and Pedarsani, R. (2019a). An exact quantized decentralized gradient\n\ndescent algorithm. IEEE Transactions on Signal Processing, 67(19):4934\u20134947.\n\nReisizadeh, A., Prakash, S., Pedarsani, R., and Avestimehr, A. S. (2019b). Coded computation over heterogeneous\n\nclusters. IEEE Transactions on Information Theory.\n\nReisizadeh, A., Prakash, S., Pedarsani, R., and Avestimehr, A. S. (2019c). Codedreduce: A fast and robust\n\nframework for gradient aggregation in distributed learning. arXiv preprint arXiv:1902.01981.\n\n11\n\n\fRibeiro, A. (2010). Ergodic stochastic optimization algorithms for wireless communication and networking.\n\nIEEE Transactions on Signal Processing, 58(12):6369\u20136386.\n\nScaman, K., Bach, F., Bubeck, S., Massouli\u00e9, L., and Lee, Y. T. (2018). Optimal algorithms for non-smooth\ndistributed optimization in networks. In Advances in Neural Information Processing Systems, pages 2740\u2013\n2749.\n\nSchizas, I. D., Ribeiro, A., and Giannakis, G. B. (2008). Consensus in ad hoc wsns with noisy links\u2013part i:\n\nDistributed estimation of deterministic signals. Signal Processing, IEEE Transactions on, 56(1):350\u2013364.\n\nScutari, G., Facchinei, F., and Lampariello, L. (2017). Parallel and distributed methods for constrained nonconvex\n\noptimization?part i: Theory. IEEE Transactions on Signal Processing, 65(8):1929\u20131944.\n\nScutari, G. and Sun, Y. (2018). Distributed nonconvex constrained optimization over time-varying digraphs.\n\narXiv preprint arXiv:1809.01106.\n\nSeaman, K., Bach, F., Bubeck, S., Lee, Y. T., and Massouli\u00e9, L. (2017). Optimal algorithms for smooth and\nstrongly convex distributed optimization in networks. In Proceedings of the 34th International Conference on\nMachine Learning-Volume 70, pages 3027\u20133036. JMLR. org.\n\nSeide, F., Fu, H., Droppo, J., Li, G., and Yu, D. (2014). 1-bit stochastic gradient descent and its application to\ndata-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech\nCommunication Association.\n\nShi, W., Ling, Q., Wu, G., and Yin, W. (2015a). Extra: An exact \ufb01rst-order algorithm for decentralized consensus\n\noptimization. SIAM Journal on Optimization, 25(2):944\u2013966.\n\nShi, W., Ling, Q., Wu, G., and Yin, W. (2015b). A proximal gradient algorithm for decentralized composite\n\noptimization. IEEE Transactions on Signal Processing, 63(22):6013\u20136023.\n\nShi, W., Ling, Q., Yuan, K., Wu, G., and Yin, W. (2014). On the linear convergence of the admm in decentralized\n\nconsensus optimization. IEEE Trans. on Signal Processing, 62(7):1750\u20131761.\n\nSun, H. and Hong, M. (2018). Distributed non-convex \ufb01rst-order optimization and information processing:\n\nLower complexity bounds and rate optimal algorithms. arXiv preprint arXiv:1804.02729.\n\nTandon, R., Lei, Q., Dimakis, A. G., and Karampatziakis, N. (2016). Gradient coding. arXiv preprint\n\narXiv:1612.03301.\n\nTsianos, K. I., Lawlor, S., and Rabbat, M. G. (2012). Push-sum distributed dual averaging for convex optimization.\n\nCDC, pages 5453\u20135458.\n\nTsitsiklis, J. N. and Luo, Z.-Q. (1987). Communication complexity of convex optimization. Journal of\n\nComplexity, 3(3):231\u2013243.\n\nUribe, C. A., Lee, S., Gasnikov, A., and Nedi\u00b4c, A. (2018). A dual approach for optimal algorithms in distributed\n\noptimization over networks. arXiv preprint arXiv:1809.00710.\n\nWang, D., Joshi, G., and Wornell, G. (2014). Ef\ufb01cient task replication for fast response times in parallel\n\ncomputation. In ACM SIGMETRICS Performance Evaluation Review, volume 42, pages 599\u2013600. ACM.\n\nWu, T., Yuan, K., Ling, Q., Yin, W., and Sayed, A. H. (2017). Decentralized consensus optimization with\nasynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks, 4(2):293\u2013\n307.\n\nYu, Q., Maddah-Ali, M. A., and Avestimehr, A. S. (2017). Polynomial codes: an optimal design for high-\n\ndimensional coded matrix multiplication. arXiv preprint arXiv:1705.10464.\n\nYuan, K., Ling, Q., and Yin, W. (2016). On the convergence of decentralized gradient descent. SIAM Journal on\n\nOptimization, 26(3):1835\u20131854.\n\nYuksel, S. and Basar, T. (2003). Quantization and coding for decentralized lti systems. In 42nd IEEE International\n\nConference on Decision and Control (IEEE Cat. No. 03CH37475), volume 3, pages 2847\u20132852. IEEE.\n\nZeng, J. and Yin, W. (2018). On nonconvex decentralized gradient descent. IEEE Transactions on signal\n\nprocessing, 66(11):2834\u20132848.\n\nZhang, X., Liu, J., Zhu, Z., and Bentley, E. S. (2018). Compressed distributed gradient descent: Communication-\n\nef\ufb01cient consensus over networks. arXiv preprint arXiv:1812.04048.\n\n12\n\n\f", "award": [], "sourceid": 4549, "authors": [{"given_name": "Amirhossein", "family_name": "Reisizadeh", "institution": "UC Santa Barbara"}, {"given_name": "Hossein", "family_name": "Taheri", "institution": "UCSB"}, {"given_name": "Aryan", "family_name": "Mokhtari", "institution": "UT Austin"}, {"given_name": "Hamed", "family_name": "Hassani", "institution": "UPenn"}, {"given_name": "Ramtin", "family_name": "Pedarsani", "institution": "UC Santa Barbara"}]}