{"title": "Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization", "book": "Advances in Neural Information Processing Systems", "page_first": 11082, "page_last": 11094, "abstract": "Communication overhead is one of the key challenges that hinders the scalability of distributed optimization algorithms. In this paper, we study local distributed SGD, where data is partitioned among computation nodes, and the computation nodes perform local updates with periodically exchanging the model among the workers to perform averaging. While local SGD is empirically shown to provide promising results, a theoretical understanding of its performance remains open. In this paper, we strengthen convergence analysis for local SGD, and show that local SGD can be far less expensive and applied far more generally than current theory suggests. Specifically, we show that for loss functions that satisfy the Polyak-Kojasiewicz condition, $O((pT)^{1/3})$ rounds of communication suffice to achieve a linear speed up, that is, an error of $O(1/pT)$, where $T$ is the total number of model updates at each worker. This is in contrast with previous work which required higher number of communication rounds, as well as was limited to strongly convex loss functions, for a similar asymptotic performance. We also develop an adaptive synchronization scheme that provides a general condition for linear speed up. Finally, we validate the theory with experimental results, running over AWS EC2 clouds and an internal GPUs cluster.", "full_text": "Local SGD with Periodic Averaging:\n\nTighter Analysis and Adaptive Synchronization\n\nFarzin Haddadpour\n\nPenn State\n\nfxh18@psu.edu\n\nMohammad Mahdi Kamani\n\nPenn State\n\nmqk5591@psu.edu\n\nMehrdad Mahdavi\n\nPenn State\n\nmzm616@psu.edu\n\nViveck R. Cadambe\n\nPenn State\n\nvxc12@psu.edu\n\nAbstract\n\nCommunication overhead is one of the key challenges that hinders the scalability\nof distributed optimization algorithms. In this paper, we study local distributed\nSGD, where data is partitioned among computation nodes, and the computation\nnodes perform local updates with periodically exchanging the model among the\nworkers to perform averaging. While local SGD is empirically shown to pro-\nvide promising results, a theoretical understanding of its performance remains\nopen. We strengthen convergence analysis for local SGD, and show that local\nSGD can be far less expensive and applied far more generally than current theory\nsuggests. Speci\ufb01cally, we show that for loss functions that satisfy the Polyak-\n\u0141ojasiewicz condition, O((pT )1/3) rounds of communication suf\ufb01ce to achieve\na linear speed up, that is, an error of O(1/pT ), where T is the total number of\nmodel updates at each worker. This is in contrast with previous work which re-\nquired higher number of communication rounds, as well as was limited to strongly\nconvex loss functions, for a similar asymptotic performance. We also develop an\nadaptive synchronization scheme that provides a general condition for linear speed\nup. Finally, we validate the theory with experimental results, running over AWS\nEC2 clouds and an internal GPU cluster.\n\n1\n\nIntroduction\n\nWe consider the problem of distributed empirical risk minimization, where a set of p machines, each\nwith access to a different local shard of training examples Di, i = 1, 2, , . . . , p, attempt to jointly\nsolve the following optimization problem over entire data set D = D1 \u222a . . . \u222aD p in parallel:\n\nf (x;Di),\n\n(1)\n\nF (x) ! 1\np\n\nmin\nx\u2208Rd\n\np\n\n!i=1\n\nwhere f (\u00b7;Di) is the training loss over the data shard Di. The predominant optimization methodol-\nogy to solve the above optimization problem is stochastic gradient descent (SGD), where the model\nparameters are iteratively updated by\n\n(2)\nwhere x(t) and x(t+1) are solutions at the tth and (t + 1)th iterations, respectively, and \u02dcg(t) is a\nstochastic gradient of the cost function evaluated on a small mini-batch of all data.\n\nx(t+1) = x(t) \u2212 \u03b7\u02dcg(t),\n\nIn this paper, we are particularly interested in synchronous distributed stochastic gradient descent\nalgorithms for non-convex optimization problems mainly due to their recent successes and popu-\nlarity in deep learning models [26, 29, 46, 47]. Parallelizing the updating rule in Eq. (2) can be\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: Comparison of different local-SGD with periodic averaging based algorithms.\n\nStrategy\n\nConvergence Rate Communication Rounds (T /\u03c4 )\n\nExtra Assumption\n\nBounded Gradients\n\nNo\n\nSetting\n\nNon-convex\n\nNon-convex\n\n[43]\n\n[38]\n\n[33]\n\nThis Paper\n\nO\" G2\n\u221apT#\nO\" 1\u221apT#\npT#\nO\" G2\npT#\nO\" 1\n\n4 T 3\n2 T 1\n2 T 1\n\nO\"p 3\nO\"p 3\nO\"p 1\nO\"p\n\n4#\n2#\n2#\n3#\n\n1\n\n1\n\n3 T\n\nBounded Gradients\n\nStrongly Convex\n\nNo\n\nNon-convex under PL Condition\n\ndone simply by replacing \u02dcg(t) with the average of partial gradients of each worker over a random\nmini-batch sample of its own data shard. In fully synchronous SGD, the computation nodes, after\nevaluation of gradients over the sampled mini-batch, exchange their updates in every iteration to\nensure that all nodes have the same updated model. Despite its ease of implementation, updating the\nmodel in fully synchronous SGD incurs signi\ufb01cant amount of communication in terms of number\nof rounds and amount of data exchanged per communication round. The communication cost is,\nin fact, among the primary obstacles towards scaling distributed SGD to large scale deep learning\napplications [30, 4, 44, 21]. A central idea that has emerged recently to reduce the communication\noverhead of vanilla distributed SGD, while preserving the linear speedup, is local SGD, which is\nthe focus of our work. In local SGD, the idea is to perform local updates with periodic averaging,\nwherein machines update their own local models, and the models of the different nodes are averaged\nperiodically [43, 38, 51, 23, 45, 49, 33]. Because of local updates, the model averaging approach\nreduces the number of communication rounds in training and can, therefore, be much faster in prac-\ntice. However, as the model for every iteration is not updated based on the entire data, it suffers\nfrom a residual error with respect to fully synchronous SGD; but it can be shown that if the averag-\ning period is chosen properly the residual error can be compensated. For instance, in [33] it has been\nshown that for strongly convex loss functions, with a \ufb01xed mini-batch size with local updates and\nperiodic averaging, when T model update iterations are performed at each node, the linear speedup\n\nof the parallel SGD is attainable only with O$\u221apT% rounds of communication, with each node per-\nforming \u03c4 = O(&T /p) local updates for every round. If p < T , this is a signi\ufb01cant improvement\n\nthan the naive parallel SGD which requires T rounds of communication. This motivates us to study\nthe following key question: Can we reduce the number of communication rounds even more, and yet\nachieve linear speedup?\n\nIn this paper, we give an af\ufb01rmative answer to this question by providing a tighter analysis of local\nSGD via model averaging [43, 38, 51, 23, 45, 49]. By focusing on possibly non-convex loss func-\ntions that satisfy smoothness and the Polyak-\u0141ojasiewicz condition [18], and performing a careful\nconvergence analysis, we demonstrate that O((pT )1/3) rounds of communication suf\ufb01ce to achieve\nlinear speed up for local SGD. To the best of our knowledge, this is the \ufb01rst work that presents\n\nbounds better than O$\u221apT% on the communication complexity of local SGD with \ufb01xed minibatch\n\nsizes - our results are summarized in Table 1.\n\nThe convergence analysis of periodic averaging, where the models are averaged across nodes after\nevery \u03c4 local updates was shown at [49], but it did not prove a linear speed up. For non-convex\n\naveraging achieves linear speedup. As a further improvement, [38] shows that even by removing\n\noptimization [43] shows that by choosing the number of local updates \u03c4 = O\"T 1\n4#, model\nbounded gradient assumption and the choice of \u03c4 = O\"T 1\n2#, linear speedup can be achieved\n2# linear speedup can\nbe achieved by O$\u221apT% rounds of communication. The present work can be considered as a\n\ntightening of the aforementioned known results. In summary, the main contributions of this paper\nare highlighted as follows:\n\n[33] shows that by setting \u03c4 = O\"T 1\n\nfor non-convex optimization.\n\n2 /p 3\n\n4 /p 3\n\n2 /p 1\n\n\u2022 We improve the upper bound over the number of local updates in [33] by establishing a lin-\near speedup O (1/pT ) for non-convex optimization problems under Polyak-\u0141ojasiewicz\n3# communication\ncondition with \u03c4 = O\"T 2\nrounds are suf\ufb01cient, in contrast to previous work that showed a suf\ufb01ciency of O$\u221apT%.\n\nImportantly, our analysis does not require boundedness assumption for stochastic gradients\nunlike [33].\n\n3#. Therefore, we show that O\"p 1\n\n3 /p 1\n\n3 T 1\n\n2\n\n\f\u2022 We introduce an adaptive scheme for choosing the communication frequency and elaborate\non conditions that linear speedup can be achieved. We also empirically verify that the\nadaptive scheme outperforms \ufb01x periodic averaging scheme.\n\n\u2022 Finally, we complement our theoretical results with experimental results on Amazon EC2\n\ncluster and an internal GPU cluster.\n\n2 Other Related Work\n\nAsynchronous parallel SGD. For large scale machine learning optimization problems parallel mini-\nbatch SGD suffers from synchronization delay due to a few slow machines, slowing down entire\ncomputation. To mitigate synchronization delay, asynchronous SGD method are studied in [28, 8,\n19]. These methods, though faster than synchronized methods, lead to convergence error issues due\nto stale gradients. [2] shows that limited amount of delay can be tolerated while preserving linear\nspeedup for convex optimization problems. Furthermore, [50] indicates that even polynomially\ngrowing delays can be tolerated by utilizing a quasilinear step-size sequence, but without achieving\nlinear speedup.\n\nGradient compression based schemes. A popular approach to reduce the communication cost is\nto decrease the number of transmitted bits at each iteration via gradient compression. Limiting the\nnumber of bits in the \ufb02oating point representation is studied at [8, 13, 25]. In [4, 40, 44], random\nquantization schemes are studied. Gradient vector sparsi\ufb01cation is another approach analyzed in\n[4, 40, 39, 5, 30, 34, 9, 3, 36, 21, 32].\n\nPeriodic model averaging. The one shot averaging, which can be seen as an extreme case of\nmodel averaging, was introduced in [51, 23]. In these works, it is shown empirically that one-shot\naveraging works well in a number of optimization problems. However, it is still an open problem\nwhether the one-shot averaging can achieve the linear speed-up with respect to the number of work-\ners. In fact, [45] shows that one-shot averaging can yield inaccurate solutions for certain non-convex\noptimization problems. As a potential solution, [45] suggests that more frequent averaging in the\nbeginning can improve the performance. [48, 31, 11, 16] represent statistical convergence analy-\nsis with only one-pass over the training data which usually is not enough for the training error to\nconverge. Advantages of model averaging have been studied from an empirical point of view in\n[27, 7, 24, 35, 17, 20]. Speci\ufb01cally, they show that model averaging performs well empirically in\nterms of reducing communication cost for a given accuracy. Furthermore, for the case of T = \u03c4 the\nwork [16] provides speedup with respect to bias and variance for the quadratic square optimization\nproblems. There is another line of research which aims to reduce communication cost by adding data\nredundancy. For instance, reference [15] shows that by adding a controlled amount of redundancy\nthrough coding theoretic means, linear regression can be solved through one round of communica-\ntion. Additionally, [14] shows an interesting trade-off between the amount of data redundancy and\nthe accuracy of local SGD for general non-convex optimization.\n\nParallel SGD with varying minbatch sizes.\nReferences [10, 6] show, for strongly convex\nstochastic minimization, that SGD with expo-\nnentially increasing batch sizes can achieve lin-\near convergence rate on a single machine. Re-\ncently, [42] has shown that remarkably, with ex-\nponentially growing mini-batch size it is pos-\nsible to achieve linear speed up (i.e., error of\nO(1/pT )) with only log T iterations of the al-\ngorithm, and thereby, when implemented in a\ndistributed setting, this corresponds to log T\nrounds of communication. The result of [42]\nimplies that SGD with exponentially increasing\nbatch sizes has a similar convergence behavior\nas the full-\ufb02edged (non-stochastic) gradient de-\nscent. While the algorithm of [42] provides\na different way of reducing communication in\ndistributed setting, for a large number of iter-\nations, their algorithm will require large mini-\n\nFigure 1: Running SyncSGD for different number\nof mini-batches on Epsilon dataset with logistic\nregression. Increasing mini-batches can result in\ndivergence as it is the case here for mini-batch size\nof 1024 comparing to mini-batch size of 512. For\nexperiment setup please refer to Section 6. A sim-\nilar observation can be found in [20].\n\n3\n\n\fAlgorithm 1 LUPA-SGD(\u03c4 ): Local updates with periodic averaging.\n\nInputs: x(0) as an initial global model and \u03c4 as averaging period.\nfor t = 1, 2, . . . , T do\n\nparallel for j = 1, 2, . . . , p do\nj-th machine uniformly and independently samples a mini-batch \u03be(t)\nEvaluates stochastic gradient over a mini-batch, \u02dcg(t)\nj\nif t divides \u03c4 do\n\nas in (3)\n\nj \u2282D at iteration t.\n\n1:\n2:\n3:\n\n4:\n\n5:\n6:\n7:\n\nx(t+1)\nj\n\n= 1\n\nelse do\n\n= x(t)\n\nx(t+1)\nj\n\n8:\n9:\n10:\n11:\n12: end\n13: Output: \u00afx(T ) = 1\n\nend if\nend parallel for\n\nj )\nj \u2212 \u03b7t \u02dcg(t)\n\nj=1(x(t)\np\u2019p\nj \u2212 \u03b7t \u02dcg(t)\n\nj\n\nj=1 x(T )\n\nj\n\np\u2019p\n\nbatches, and washes away the computational bene\ufb01ts of the stochastic gradient descent algorithm\nover its deterministic counter part. Furthermore certain real-world data sets, it is well known that\nlarger minibatches also lead to poor generalization and gradient saturation that lead to signi\ufb01cant\nperformance gaps between the ideal and practical speed up [12, 22, 41, 20]. Our own experiments\nalso reveal this (See Fig. 1 that illustrates this for a logistic regression and a \ufb01xed learning rate). Our\nwork is complementary to the approach of [42], as we focus on approaches that use local updates\nwith a \ufb01xed minibatch size, which in our experiments, is a hyperparameter that is tuned to the data\nset.\n\n3 Local SGD with Periodic Averaging\n\nIn this section, we introduce the local SGD with model averaging algorithm and state the main\nassumptions we make to derive the convergence rates.\nSGD with Local Updates and Periodic Averaging. Consider a setting with training data as D, loss\nfunctions fi : Rd \u2192 R for each data point indexed as i \u2208 1, 2, . . . ,|D|, and p distributed machines\n. Without loss of generality, it will be notationally convenient to assume D = {1, 2, . . . ,|D|} in the\nsequel. For any subset S\u2286D , we denote f (x,S) = \u2019i\u2208S\np f (x,D). Let \u03be\ndenote a 2|D| \u00d7 1 random vector that encodes a subset of D of cardinality B, or equivalently, \u03be is a\nrandom vector of Hamming weight B. In our local updates with periodic averaging SGD algorithm,\ndenoted by LUPA-SGD(\u03c4 ) where \u03c4 represents the number of local updates, at iteration t the jth\nmachine samples mini-batches \u03be(t)\n, j = 1, 2, . . . , p, t = 1, 2, . . . ,\u03c4 are independent\nj\nrealizations of \u03be. The samples are then used to calculate stochastic gradient as follows:\n\nfi(x) and F (x) = 1\n\n, where \u03be(t)\nj\n\n\u02dcg(t)\nj\n\n! 1\n\nB\u2207f (x(t)\n\nj ,\u03be (t)\nj )\n\nNext, each machine, updates its own local version of the model x(t)\n\nj using:\n\nx(t+1)\nj\n\n= x(t)\n\nj \u2212 \u03b7t \u02dcg(t)\n\nj\n\n(3)\n\n(4)\n\nAfter every \u03c4 iterations, we do the model averaging, where we average local versions of the model in\nall p machines. The pseudocode of the algorithm is shown in Algorithm 1. The algorithm proceeds\nfor T iterations alternating between \u03c4 local updates followed by a communication round where the\nlocal solutions of all p machines are aggregated to update the global parameters. We note that unlike\nparallel SGD that the machines are always in sync through frequent communication, in local SGD\nthe local solutions are aggregated every \u03c4 iterations.\n\nAssumptions. Our convergence analysis is based on the following standard assumptions. We use\nthe notations g(x) ! \u2207F (x,D) and \u02dcg(x) ! 1\nB\u2207f (x,\u03be ) below. We drop the dependence of these\nfunctions on x when it is clear from context.\n\n4\n\n\fAssumption 1 (Unbiased estimation). The stochastic gradient evaluated on a mini-batch \u03be \u2282D\nand at any point x is an unbiased estimator of the partial full gradient, i.e. E [\u02dcg(x)] = g(x) for all\nx.\nAssumption 2 (Bounded variance [6]). The variance of stochastic gradients evaluated on a mini-\nbatch of size B from D is bounded as\n\nwhere C1 and \u03c3 are non-negative constants.\n\nE*\u2225\u02dcg \u2212 g\u22252+ \u2264 C1\u2225g\u22252 +\n\n\u03c32\nB\n\n(5)\n\nNote that the bounded variance assumption (see [6]) is a stronger form of the above with C1 = 0.\nAssumption 3 (L-smoothness, \u00b5-Polyak-\u0141ojasiewicz (PL)). The objective function F (x) is differ-\nentiable and L-smooth: \u2225\u2207F (x) \u2212 \u2207F (y)\u2225 \u2264 L\u2225x \u2212 y\u2225, \u2200x, y \u2208 Rd, and it satis\ufb01es the Polyak-\n\u0141ojasiewicz condition with constant \u00b5: 1\n2 \u2265 \u00b5$F (x) \u2212 F (x\u2217)%, \u2200x \u2208 Rd with x\u2217 is an\noptimal solution, that is, F (x) \u2265 F (x\u2217),\u2200x.\nRemark 1. Note that the PL condition does not require convexity. For instance, simple functions\n4 x2 + sin2(2x) are not convex, but are \u00b5-PL. The PL condition is a generalization\nsuch as f (x) = 1\nof strong convexity, and the property of \u00b5-strong convexity implies \u00b5-Polyak-\u0141ojasiewicz (PL), e.g.,\nsee [18] for more details. Therefore, any result based on \u00b5-PL assumption also applies assuming\n\u00b5-strong convexity. It is noteworthy that while many popular convex optimization problems such as\nlogistic regression and least-squares are often not strongly convex, but satisfy \u00b5-PL condition [18].\n\n2\u2225\u2207F (x)\u22252\n\n4 Convergence Analysis\n\nIn this section, we present the convergence analysis of the LUPA-SGD(\u03c4 ) algorithm. All the proofs\nare deferred to the appendix. We de\ufb01ne an auxiliary variable \u00afx(t) = 1\nj , which is the\naverage model across p different machines at iteration t. Using the de\ufb01nition of \u00afx(t), the update rule\nin Algorithm 1, can be written as:\n\np\u2019p\n\nj=1 x(t)\n\n\u00afx(t+1) = \u00afx(t) \u2212 \u03b7( 1\n\np\n\np\n\n!j=1\n\n\u02dcg(t)\n\nj ),\n\n(6)\n\nwhich is equivalent to\n\n\u00afx(t+1) = \u00afx(t) \u2212 \u03b7\u2207F (\u00afx(t)) + \u03b7(\u2207F (\u00afx(t)) \u2212\nj=1 \u02dcg(t)\n\n1\np\n\np\n\n!j=1\n\n\u02dcg(t)\n\nj ),\n\nthus establishing a connection between our algorithm and the perturbed SGD with deviation\n\np\u2019p\n\n$\u2207F (\u00afx(t)) \u2212 1\n\nj %. We show that by i.i.d. assumption and averaging with properly chosen\n\nnumber of local updates, we can reduce the variance of unbiased gradients to obtain the desired\nconvergence rates with linear speed up. The convergence rate of LUPA-SGD(\u03c4 ) algorithm as stated\nbelow:\nTheorem 1. For LUPA-SGD(\u03c4 ) with \u03c4 local updates, under Assumptions 1 - 3, if we choose the\nlearning rate as \u03b7t = 4\n\u03b1 ) <\np ), and initialize all local model parameters at the same point \u00afx(0), for \u03c4 suf\ufb01ciently\n\u00b5 C1(\u03c4 \u2212\n\n\u00b5(t+a) where a = \u03b1\u03c4 + 4 with \u03b1 being a constant satisfying \u03b1 exp (\u2212 2\n\n(\u03c4 \u2212 1)\u03c4 (a + 1)\u03c4\u22122, 32L2\n\n\u03ba,192( p+1\nlarge to ensure that1 that 4(a \u2212 3)\u03c4\u22121L(C1 + p) \u2264 64L2(p+1)\n1)(a + 1)\u03c4\u22122 \u2264 64L2\n\u03c4 \u2265 \"( p+1\n\n\u00b5 (\u03c4 \u2212 1)\u03c4 (a + 1)\u03c4\u22122 and\n\u03b1 + 6\u03b1# +-\"( p+1\n2\"( p+1\n\n\u03b1 + 6\u03b1#2\n\u03b1 \u2212 \u03b12#\n\np )192\u03ba2e 4\np )192\u03ba2e 4\n\np )192\u03ba2e 4\n\n\u00b5p\n\n1Note that this is a mild condition: if we choose \u03c4 as an increasing function of T , e.g., Corollary 1, this\n\ncondition holds. Note also that C1 can be tuned to be small enough if required via appropriate sampling.\n\n5\n\n+ 20\"( p+1\n\np )192\u03ba2e 4\n\n\u03b1 \u2212 \u03b12#\n\n(7)\n\n\fafter T iterations we have:\n\nE!F (\u00afx\n\n(T )) \u2212 F \u2217\" \u2264\n\n\u00b5Bp(T + a)3\nwhere F \u2217 is the global minimum and \u03ba = L/\u00b5 is the condition number.\n\n\u00b5Bpa3E!F (\u00afx\n\n(0)) \u2212 F \u2217\" + 4\u03ba\u03c32T (T + 2a) + 256\u03ba2\u03c32T (\u03c4 \u2212 1)\n\n,\n\n(8)\n\nAn immediate result of above theorem is the following:\n\nCorollary 1. In Theorem 1 choosing \u03c4 = O. T\n\np\n\n1\n3 B\n\n2\n3\n\n1\n\n3/ leads to the following error bound:\n\nE(F (\u00afx(T )) \u2212 F \u2217) \u2264 O0 Bp(\u03b1\u03c4 + 4)3 + T 2\n\n1 = O. 1\npBT/ ,\nTherefore, for large number of iterations T the convergence rate becomes O\" 1\nimplication of Theorem 1 is that by proper choice of \u03c4 , i.e., O\"T 2\n\nBp(T + a)3\n\n3 /p 1\n\npBT#, thus achieving\n3#, and periodically averaging\n\nthe local models it is possible to reduce the variance of stochastic gradients as discussed before.\nFurthermore, as \u00b5-strong convexity implies \u00b5-PL condition [18], Theorem 1 holds for \u00b5-strongly\nconvex cost functions as well.\n\na linear speed up with respect to the mini-batch size B and the number of machines p. A direct\n\n4.1 Comparison with existing algorithms\n\nNoting that the number of communication rounds is T /\u03c4 , for general non-convex optimization, [38]\nimproves the number of communication rounds in [43] from O(p 3\n2 ). In [33], by\nexploiting bounded variance and bounded gradient assumptions, it has been shown that for strongly\n\nlinear speed up can be achieved. In comparison to [33], we show that using the weaker Assump-\n\nconvex functions with \u03c4 = O(&T /p), or equivalently T /\u03c4 = O(\u221apT ) communication rounds,\n3# or equivalently\ntion 3, for non-convex cost functions under PL condition with \u03c4 = O\"T 2\n3# communication rounds, linear speed up can be achieved. All these results are\nT /\u03c4 = O\"(pT )\n\nsummarized in Table 1.\n\n4 ) to O(p 3\n\n3 /p 1\n\n4 T 3\n\n2 T 1\n\n1\n\nThe detailed proof of Theorem 1 will be provided in appendix, but here we discuss how a tighter\nconvergence rate compared to [33] is obtainable. In particular, the main reason behind improve-\nment of the LUPA-SGD over [33] is due to the difference in Assumption 3 and a novel technique\nintroduced to prove the convergence rate. The convergence rate analysis of [33] is based on the uni-\nB ,\nwhich leads to the following bound on the difference between local solutions and their average at\ntth iteration:\n\nformly bounded gradient assumption, E*\u2225\u02dcgj\u22252\n\n2+ \u2264 G2, and bounded variance, E*\u2225\u02dcgj \u2212 gj\u22252\n\n2+ \u2264 \u03c32\n\n(9)\n\np\n\n1\np\n\n!j=1\n\nE*\u2225\u00afx(t) \u2212 x(t)\nj \u22252\n2+ \u2264 4\u03b72\n\nt G2\u03c4 2.\n\np\n\nconvergence bound which determines the maximum allowable size of the local updates without hurt-\n\n\u00b5T 2# in their\nIn [33] it is shown that weighted averaging over the term (9) results in the term O\" \u03ba\u03c4 2\ning optimal convergence rate. However, our analysis based on the assumption E\u03bej*\u2225\u02dcgj \u2212 gj\u22252+ \u2264\nC1\u2225gj\u22252 + \u03c32\n\u03c4 \u230b\u03c4 ):\np / \u03c4 \u03b72\n!j=1\n\nB , implies the following bound (see Lemma 3 in appendix with tc ! \u230a t\n)\u22252 + 2. p + 1\n\nj \u22252\u22642. p + 1\n\n(10)\nNote that we obtain (10) using the non-increasing property of \u03b7t from Lemma 3 by careful analysis\nof the effect of \ufb01rst term in (10) and the weighted averaging. In particular, in our analysis we show\nthat the second term in (10) can be reduced to 256\u03ba2\u03c32T (\u03c4\u22121)\nin Theorem 1; hence resulting in\nimproved upper bound over the number of local updates.\n\np / [C1 + \u03c4 ]\n\nE\u2225\u00afx(t) \u2212 x(t)\n\n\u2225 \u2207F (x(k)\n\n!k=tc\n\n!j=1\n\n\u00b5pB(T +a)3\n\n\u03c32\nB\n\nt\u22121\n\n\u03b72\nk\n\ntc\n\np\n\n.\n\nj\n\n6\n\n\f5 Adaptive LUPA-SGD\n\nThe convergence results discussed so far are indicated based on a \ufb01xed number of local updates, \u03c4 .\nRecently, [45] and [20] have shown empirically that more frequent communication in the beginning\nleads to improved performance over \ufb01xed communication period.\n\nThe main idea behind adaptive variant of LUPA-SGD stems from the following observation. Let us\nconsider the convergence error of LUPA-SGD algorithm as stated in (8). A careful investigation of\n\nthe obtained rate O\" 1\nwhere \u03b1 being a constant, or equivalently \u03c4 = O.T 2\n\npT# reveals that we need to have a3E(F (\u00afx(0))\u2212F \u2217) = O$T 2% for a = \u03b1\u03c4 +4\n3/. Therefore, the\n\nnumber of local updates \u03c4 can be chosen proportional to the distance of objective at initial model,\n\u00afx(0), to the objective at optimal solution, x\u2217. Inspired by this observation, we can think of the ith\ncommunication period as if machines restarting training at a new initial point \u00afx(i\u03c40), where \u03c40 is the\nnumber of initial local updates, and propose the following strategy to adaptively decide the number\nof local updates before averaging the models:\n\n3(F (\u00afx(0)) \u2212 F \u2217)\n\n3 /p 1\n\n1\n\n\u03c4i = \u2308\"\n\nF (\u00afx(0))\n\nF (\u00afx(i\u03c40)) \u2212 F \u2217#\n\n1\n3\n\n\u2309\u03c40\n\n\u2780\n\n\u2192 \u03c4i = \u2308\" F (\u00afx(0))\nF (\u00afx(i\u03c40))#\n\n1\n3\n\n\u2309\u03c40,\n\n(11)\n\nwhere\u2019E\ni=1 \u03c4i = T and E is the total number of synchronizations, and \u2780 comes from the fact that\nF (x(t)) \u2265 F \u2217 and as a result we can simply drop the unknown global minimum value F \u2217 from the\ndenominator of (11). Note that (11) generates increasing sequence of number of local updates. A\nvariation of this choice to decide on \u03c4i is discussed in Section 6. We denote the adaptive algorithm\nby ADA-LUPA-SGD(\u03c41, . . . ,\u03c4 E) for an arbitrary (not necessarily increasing) sequence of positive\nintegers. Following theorem analyzes the convergence rate of adaptive algorithm, ADA-LUPA-SGD\n(\u03c41, . . . ,\u03c4 E)).\n\nTheorem 2. For ADA-LUPA-SGD (\u03c41, . . . ,\u03c4 E) with local updates, under Assumptions 1 to 3, if we\nchoose the learning rate as \u03b7t = 4\n\u00b5(t+c) and all local model parameters are initialized at the same\npoint, for \u03c4i, 1 \u2264 i \u2264 E suf\ufb01ciently large to ensure that 4(c \u2212 3)\u03c4i\u22121L(C1 + p) \u2264 64L2(p+1)\n(\u03c4i \u2212\n1)\u03c4i(c + 1)\u03c4i\u22122, 32L2\n\u00b5 (\u03c4i \u2212 1)\u03c4i(c + 1)\u03c4i\u22122, and \u03c4i, i = 1, 2, . . . , E\nsatis\ufb01es the condition in (7), then after T =\u2019E\n(0)) \u2212 F \u2217\" +\nE!F (\u00afx\n\n\u00b5 C1(\u03c4i \u2212 1)(c + 1)\u03c4i\u22122 \u2264 64L2\n\ni=1 \u03c4i iterations we have:\n\n256\u03ba2\u03c32 #E\n\n(T )) \u2212 F \u2217\" \u2264\n\n4\u03ba\u03c32T (T + 2c)\n\u00b5Bp(T + c)3 +\n\nc3\n\n(T + c)3 E!F (\u00afx\n\n\u00b5Bp(T + c)3\n\n\u00b5p\n\ni=1(\u03c4i \u2212 1)\u03c4i\n\n. (12)\n\nwhere c = \u03b1 max1\u2264i\u2264E \u03c4i + 4, \u03b1 is a constant satisfying \u03b1 exp (\u2212 2\nminimum and \u03ba = L/\u00b5 is the condition number.\n\n\u03b1 ) <\u03ba $192( p+1\n\np ), F \u2217 is the global\n\nWe emphasize that Algorithm 1 with sequence of local updates \u03c41, . . . ,\u03c4 E, preserves linear speed up\ni=1 \u03c4i(\u03c4i\u22121) = O(T 2),\npB#. Note that exponentially increasing \u03c4i that results in a total of\n\nas long as the following three conditions are satis\ufb01ed: i)\u2019E\niii) (max1\u2264i\u2264E \u03c4i)3 = O\" T 2\n\nO(log T ) communication rounds, does not satisfy these three conditions. Thus our result sheds\nsome theoretical insight of ADA-LUPA algorithm on how big we can choose \u03c4i- under our setup\nand convergence techniques while preserving linear speed up - although, we note that impossibility\nresults need to be derived in future work to cement this insight.\n\ni=1 \u03c4i = T , ii)\u2019E\n\nAdditionally, the result of [37] is based on minimizing convergence error with respect to the wall-\nclock time using an adaptive synchronization scheme, while our focus is on reducing the number of\ncommunication rounds for a \ufb01xed number of model updates. Given a model for wall clock time, our\nanalysis can be readily extended to further \ufb01ne-tune the communication-computation complexity of\n[37].\n\n7\n\n\fFigure 2: Comparison of the convergence rate of SyncSGD with LUPA-SGD with \u03c4 = 5 [33],\n\u03c4 = 91 (ours) and one-shot (with only one communication round).\n\n6 Experiments\n\nTo validate the proposed algorithm compared to existing work and algorithms, we conduct experi-\nments on Epsilon dataset2, using logistic regression model, which satis\ufb01es PL condition. Epsilon\ndataset, a popular large scale binary dataset, consists of 400, 000 training samples and 100, 000 test\nsamples with feature dimension of 2000.\n\nExperiment setting. We run our experiments on two different settings implemented with different\nlibraries to show its ef\ufb01cacy on different platforms. Most of the experiments will be run on Ama-\nzon EC2 cluster with 5 p2.xlarge instances. In this environment we use PyTorch [26] to implement\nLUPA-SGD as well as the baseline SyncSGD. We also use an internal high performance computing\n(HPC) cluster equipped with NVIDIA Tesla V100 GPUs. In this environment we use Tensor\ufb02ow [1]\nto implement both SyncSGD and LUPA-SGD. The performance on both settings shows the superi-\nority of the algorithm in both time and convergence3.\n\nImplementations and setups. To run our algorithm, as we stated, we will use logistic regression.\nThe learning rate and regularization parameter are 0.01 and 1 \u00d7 10\u22124, respectively, and the mini-\nbatch size is 128 unless otherwise stated. We use mpi4py library from OpenMPI4 as the MPI library\nfor distributed training.\n\nNormal training. The \ufb01rst experiment is to do a normal training on epsilon dataset. As it\nwas stated, epsilon dataset has 400, 000 training samples, and if we want to run the experi-\nment for 7 epochs on 5 machines with mini-batch size of 128 (T = 21875), based on Ta-\nble 1, we can calculate the given value for \u03c4 which for our LUPA-SGD is T 2\n3 \u2248 91.\nIf we follow the \u03c4 in [33] we would have to set \u03c4 as &T /pb \u2248 5 for this experiment.\n\n3 /(pb) 1\n\nWe also include the results for one-shot learn-\ning, which is local SGD with only having one\nround of communication at the end. The re-\nsults are depicted in Figure 2, shows that LUPA-\nSGD with higher \u03c4 , can indeed, converges to\nthe same level as SyncSGD with faster rate in\nterms of wall clock time.\n\nSpeedup. To show that LUPA-SGD with\ngreater number of local updates can still ben-\ne\ufb01ts from linear speedup with increasing the\nnumber of machines, we run our experiment on\ndifferent number of machines. Then, we report\nthe time that each of them reaches to a certain\nerror level, say \u03f5 = 0.35. The results are the\naverage of 5 repeats.\n\nAdaptive LUPA SGD. To show how ADA\nLUPA-SGD works, we run two experiments,\n\nFigure 3: Changing the number of machines and\ncalculate time to reach certain level of error rate\n(\u03f5 = 0.35).\nIt indicates that LUPA-SGD with\n\u03c4 = 91 can bene\ufb01t from linear speedup by in-\ncreasing the number of machines. The experiment\nis repeated 5 times and the average is reported.\n\n2https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html\n3The implementation code is available at https://github.com/mmkamani7/LUPA-SGD.\n4https://www.open-mpi.org/\n\n8\n\n\fFigure 4: Comparison of the convergence rate of LUPA-SGD with ADA-LUPA-SGD with \u03c4 = 91\nfor LUPA-SGD, and \u03c40 = 91 and \u03c4i = (1 + i\u03b1)\u03c40, with \u03b1 = 1.09 for ADA-LUPA-SGD to have\n10 rounds of communication. The results show that ADA-LUPA-SGD can reach the same level of\nerror rate as LUPA-SGD, with less number of communication.\n\n\ufb01rst with constant \u03c4 = 91 and the other with increasing number of local updates starting with\n\u03c40 = 91 and \u03c4i = (1 + i\u03b1)\u03c40, with \u03b1 \u2265 0. We set \u03b1 in a way to have certain number of communica-\ntions. This experiment has been run on Tensor\ufb02ow setting described before.\n\nWe note that having access to the function F (x(t)) is only for theoretical analysis purposes and is\nnot necessary in practice as long as the choice of \u03c4i satis\ufb01es the conditions in the statement of the\ntheorem. In fact as explained in our experiments, we do NOT use the function value oracle and\nincrease \u03c4i within each communication period linearly (please see Figure 4) which demonstrates\nimprovement over keeping \u03c4i constant.\n\n7 Conclusion and Future Work\n\nIn this paper, we strengthen the theory of local updates with periodic averaging for distributed non-\nconvex optimization. We improve the previously known bound on the number of local updates, while\npreserving the linear speed up, and validate our results through experiments. We also presented an\nadaptive algorithm to decide the number of local updates as algorithm proceeds.\n\nOur work opens few interesting directions as future work. First, it is still unclear if we can preserve\nlinear speed up with larger local updates (e.g., \u03c4 = O (T /log T ) to require O (log T ) communi-\ncations). Recent studies have demonstrated remarkable observations about using large mini-bath\nsizes from practical standpoint: [41] demonstrated that the maximum allowable mini-batch size is\nbounded by gradient diversity quantity, and [42] showed that using larger mini-batch sizes can lead\nto superior training error convergence. These observations raise an interesting question that is wor-\nthy of investigation. In particular, an interesting direction motivated by our work and the contrasting\nviews of these works would be exploring the maximum allowable \u03c4 for which performance does not\ndecay with \ufb01xed bound on the mini-batch size. Finally, obtaining lower bounds on the number of\nlocal updates for a \ufb01xed mini-bath size to achieve linear speedup is an interesting research question.\n\nAcknowledgement\n\nThis work was partially supported by the NSF CCF 1553248 and NSF CCF 1763657 grants.\n\n9\n\n\fReferences\n\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for\nlarge-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design\nand Implementation ({OSDI} 16), pages 265\u2013283, 2016.\n\n[2] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In Advances\n\nin Neural Information Processing Systems, pages 873\u2013881, 2011.\n\n[3] Alham Fikri Aji and Kenneth Hea\ufb01eld. Sparse communication for distributed gradient descent.\nIn Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,\npages 440\u2013445, 2017.\n\n[4] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd:\nCommunication-ef\ufb01cient sgd via gradient quantization and encoding. In Advances in Neural\nInformation Processing Systems, pages 1709\u20131720, 2017.\n\n[5] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandku-\narXiv preprint\n\nsignsgd: Compressed optimisation for non-convex problems.\n\nmar.\narXiv:1802.04434, 2018.\n\n[6] L\u00e9on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale ma-\n\nchine learning. Siam Review, 60(2):223\u2013311, 2018.\n\n[7] Kai Chen and Qiang Huo. Scalable training of deep learning machines by incremental block\ntraining with intra-block parallel optimization and blockwise model-update \ufb01ltering. In 2016\nieee international conference on acoustics, speech and signal processing (icassp), pages 5880\u2013\n5884. IEEE, 2016.\n\n[8] Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher R\u00e9. Taming the wild: A\nuni\ufb01ed analysis of hogwild-style algorithms. In Advances in neural information processing\nsystems, pages 2674\u20132682, 2015.\n\n[9] Nikoli Dryden, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. Communication quanti-\nzation for data-parallel training of deep neural networks. In 2016 2nd Workshop on Machine\nLearning in HPC Environments (MLHPC), pages 1\u20138. IEEE, 2016.\n\n[10] Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data\n\n\ufb01tting. Technical report, 2011.\n\n[11] Antoine Godichon-Baggioni and So\ufb01ane Saadane. On the rates of convergence of parallelized\n\naveraged stochastic gradient algorithms. arXiv preprint arXiv:1710.07926, 2017.\n\n[12] Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge,\nMichael W Mahoney, and Joseph Gonzalez. On the computational inef\ufb01ciency of large batch\nsizes for stochastic gradient descent. arXiv preprint arXiv:1811.12941, 2018.\n\n[13] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning\nwith limited numerical precision. In International Conference on Machine Learning, pages\n1737\u20131746, 2015.\n\n[14] Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe.\nTrading redundancy for communication: Speeding up distributed sgd for non-convex optimiza-\ntion. In International Conference on Machine Learning, pages 2545\u20132554, 2019.\n\n[15] Farzin Haddadpour, Yaoqing Yang, Viveck Cadambe, and Pulkit Grover. Cross-iteration coded\ncomputing. In 2018 56th Annual Allerton Conference on Communication, Control, and Com-\nputing (Allerton), pages 196\u2013203. IEEE, 2018.\n\n[16] Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Paral-\nlelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and\nmodel misspeci\ufb01cation. Journal of Machine Learning Research, 18(223):1\u201342, 2018.\n\n10\n\n\f[17] Michael Kamp, Linara Adilova, Joachim Sicking, Fabian H\u00fcger, Peter Schlicht, Tim Wirtz,\nand Stefan Wrobel. Ef\ufb01cient decentralized deep learning by dynamic model averaging.\nIn\nJoint European Conference on Machine Learning and Knowledge Discovery in Databases,\npages 393\u2013409. Springer, 2018.\n\n[18] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the polyak-\u0142ojasiewicz condition. In Joint European Conference on\nMachine Learning and Knowledge Discovery in Databases, pages 795\u2013811. Springer, 2016.\n\n[19] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradi-\nent for nonconvex optimization. In Advances in Neural Information Processing Systems, pages\n2737\u20132745, 2015.\n\n[20] Tao Lin, Sebastian U Stich, and Martin Jaggi. Don\u2019t use large mini-batches, use local sgd.\n\narXiv preprint arXiv:1808.07217, 2018.\n\n[21] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient com-\npression: Reducing the communication bandwidth for distributed training. arXiv preprint\narXiv:1712.01887, 2017.\n\n[22] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the\neffectiveness of sgd in modern over-parametrized learning. arXiv preprint arXiv:1712.06559,\n2017.\n\n[23] Ryan McDonald, Keith Hall, and Gideon Mann. Distributed training strategies for the struc-\ntured perceptron.\nIn Human Language Technologies: The 2010 Annual Conference of the\nNorth American Chapter of the Association for Computational Linguistics, pages 456\u2013464.\nAssociation for Computational Linguistics, 2010.\n\n[24] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Ar-\ncas. Communication-ef\ufb01cient learning of deep networks from decentralized data. In Arti\ufb01cial\nIntelligence and Statistics, pages 1273\u20131282, 2017.\n\n[25] Taesik Na, Jong Hwan Ko, Jaeha Kung, and Saibal Mukhopadhyay. On-chip training of recur-\nrent neural networks with limited numerical precision. In 2017 International Joint Conference\non Neural Networks (IJCNN), pages 3716\u20133723. IEEE, 2017.\n\n[26] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[27] Daniel Povey, Xiaohui Zhang, and Sanjeev Khudanpur. Parallel training of dnns with natural\n\ngradient and parameter averaging. arXiv preprint arXiv:1410.7455, 2014.\n\n[28] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free ap-\nproach to parallelizing stochastic gradient descent. In Advances in neural information process-\ning systems, pages 693\u2013701, 2011.\n\n[29] Frank Seide and Amit Agarwal. Cntk: Microsoft\u2019s open-source deep-learning toolkit. In KDD,\n\n2016.\n\n[30] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent\nIn Fifteenth Annual\n\nand its application to data-parallel distributed training of speech dnns.\nConference of the International Speech Communication Association, 2014.\n\n[31] Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In 2014\n52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton),\npages 850\u2013857. IEEE, 2014.\n\n[32] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsi\ufb01ed sgd with memory.\n\nIn Advances in Neural Information Processing Systems, pages 4447\u20134458, 2018.\n\n[33] Sebastian Urban Stich. Local sgd converges fast and communicates little. In ICLR 2019 ICLR\n\n2019 International Conference on Learning Representations, number CONF, 2019.\n\n11\n\n\f[34] Nikko Strom. Scalable distributed dnn training using commodity gpu cloud computing. In\n\nSixteenth Annual Conference of the International Speech Communication Association, 2015.\n\n[35] Hang Su and Haoyu Chen. Experiments on parallel training of deep neural network using\n\nmodel averaging. arXiv preprint arXiv:1507.01239, 2015.\n\n[36] Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. meprop: Sparsi\ufb01ed back propa-\ngation for accelerated deep learning with reduced over\ufb01tting. In Proceedings of the 34th Inter-\nnational Conference on Machine Learning-Volume 70, pages 3299\u20133308. JMLR. org, 2017.\n\n[37] Jianyu Wang and Gauri Joshi. Adaptive communication strategies to achieve the best error-\n\nruntime trade-off in local-update sgd. arXiv preprint arXiv:1810.08313, 2018.\n\n[38] Jianyu Wang and Gauri Joshi. Cooperative sgd: A uni\ufb01ed framework for the design and\nanalysis of communication-ef\ufb01cient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018.\n\n[39] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang.\n\nGradient sparsi\ufb01cation for\ncommunication-ef\ufb01cient distributed optimization. In Advances in Neural Information Process-\ning Systems, pages 1299\u20131309, 2018.\n\n[40] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Tern-\ngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in\nneural information processing systems, pages 1509\u20131519, 2017.\n\n[41] Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and\nPeter Bartlett. Gradient diversity: a key ingredient for scalable distributed learning. arXiv\npreprint arXiv:1706.05699, 2017.\n\n[42] Hao Yu and Rong Jin. On the computation and communication complexity of parallel sgd with\ndynamic batch sizes for stochastic non-convex optimization. In International Conference on\nMachine Learning, pages 7174\u20137183, 2019.\n\n[43] Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less\ncommunication: Demystifying why model averaging works for deep learning. In Proceedings\nof the AAAI Conference on Arti\ufb01cial Intelligence, volume 33, pages 5693\u20135700, 2019.\n\n[44] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training\nlinear models with end-to-end low precision, and a little bit of deep learning. In Proceedings of\nthe 34th International Conference on Machine Learning-Volume 70, pages 4035\u20134043. JMLR.\norg, 2017.\n\n[45] Jian Zhang, Christopher De Sa, Ioannis Mitliagkas, and Christopher R\u00e9. Parallel sgd: When\n\ndoes averaging help? arXiv preprint arXiv:1606.07365, 2016.\n\n[46] X. Zhang, M. M. Khalili, and M. Liu. Recycled admm: Improving the privacy and accuracy of\ndistributed algorithms. IEEE Transactions on Information Forensics and Security, pages 1\u20131,\n2019.\n\n[47] Xueru Zhang, Mohammad Mahdi Khalili, and Mingyan Liu. Improving the privacy and ac-\ncuracy of ADMM-based distributed algorithms. In Jennifer Dy and Andreas Krause, editors,\nProceedings of the 35th International Conference on Machine Learning, volume 80 of Pro-\nceedings of Machine Learning Research, pages 5796\u20135805. PMLR, 10\u201315 Jul 2018.\n\n[48] Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-ef\ufb01cient algorithms\nIn Advances in Neural Information Processing Systems, pages\n\nfor statistical optimization.\n1502\u20131510, 2012.\n\n[49] Fan Zhou and Guojing Cong. On the convergence properties of a k-step averaging stochastic\ngradient descent algorithm for nonconvex optimization. In Proceedings of the 27th Interna-\ntional Joint Conference on Arti\ufb01cial Intelligence, pages 3219\u20133227. AAAI Press, 2018.\n\n[50] Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Peter W Glynn, Yinyu Ye, Li-\nJia Li, and Fei-Fei Li. Distributed asynchronous optimization with unbounded delays: How\nslow can you go? In ICML 2018-35th International Conference on Machine Learning, pages\n1\u201310, 2018.\n\n12\n\n\f[51] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic\nIn Advances in neural information processing systems, pages 2595\u20132603,\n\ngradient descent.\n2010.\n\n13\n\n\f", "award": [], "sourceid": 5940, "authors": [{"given_name": "Farzin", "family_name": "Haddadpour", "institution": "Pennsylvania State university"}, {"given_name": "Mohammad Mahdi", "family_name": "Kamani", "institution": "Pennsylvania State University"}, {"given_name": "Mehrdad", "family_name": "Mahdavi", "institution": "Pennsylvania State University"}, {"given_name": "Viveck", "family_name": "Cadambe", "institution": "Penn State"}]}