{"title": "An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums", "book": "Advances in Neural Information Processing Systems", "page_first": 954, "page_last": 964, "abstract": "Modern large-scale finite-sum optimization relies on two key aspects: distribution and stochastic updates. For smooth and strongly convex problems, existing decentralized algorithms are slower than modern accelerated variance-reduced stochastic algorithms when run on a single machine, and are therefore not efficient. Centralized algorithms are fast, but their scaling is limited by global aggregation steps that result in communication bottlenecks. In this work, we propose an efficient \\textbf{A}ccelerated \\textbf{D}ecentralized stochastic algorithm for \\textbf{F}inite \\textbf{S}ums named ADFS, which uses local stochastic proximal updates and randomized pairwise communications between nodes. On $n$ machines, ADFS learns from $nm$ samples in the same time it takes optimal algorithms to learn from $m$ samples on one machine. This scaling holds until a critical network size is reached, which depends on communication delays, on the number of samples $m$, and on the network topology. We provide a theoretical analysis based on a novel augmented graph approach combined with a precise evaluation of synchronization times and an extension of the accelerated proximal coordinate gradient algorithm to arbitrary sampling. We illustrate the improvement of ADFS over state-of-the-art decentralized approaches with experiments.", "full_text": "An Accelerated Decentralized Stochastic Proximal\n\nAlgorithm for Finite Sums\n\nHadrien Hendrikx\n\nINRIA - DIENS - PSL Research University\n\nFrancis Bach\n\nINRIA - DIENS - PSL Research University\n\nhadrien.hendrikx@inria.fr\n\nfrancis.bach@inria.fr\n\nLaurent Massouli\u00b4e\n\nINRIA - DIENS - PSL Research University\n\nlaurent.massoulie@inria.fr\n\nAbstract\n\nModern large-scale \ufb01nite-sum optimization relies on two key aspects: distribu-\ntion and stochastic updates. For smooth and strongly convex problems, existing\ndecentralized algorithms are slower than modern accelerated variance-reduced\nstochastic algorithms when run on a single machine, and are therefore not ef\ufb01cient.\nCentralized algorithms are fast, but their scaling is limited by global aggrega-\ntion steps that result in communication bottlenecks. In this work, we propose an\nef\ufb01cient Accelerated Decentralized stochastic algorithm for Finite Sums named\nADFS, which uses local stochastic proximal updates and randomized pairwise\ncommunications between nodes. On n machines, ADFS learns from nm samples\nin the same time it takes optimal algorithms to learn from m samples on one ma-\nchine. This scaling holds until a critical network size is reached, which depends on\ncommunication delays, on the number of samples m, and on the network topology.\nWe provide a theoretical analysis based on a novel augmented graph approach\ncombined with a precise evaluation of synchronization times and an extension of\nthe accelerated proximal coordinate gradient algorithm to arbitrary sampling. We\nillustrate the improvement of ADFS over state-of-the-art decentralized approaches\nwith experiments.\n\nIntroduction\n\n1\nThe success of machine learning models is mainly due to their capacity to train on huge amounts of\ndata. Distributed systems can be used to process more data than one computer can store or to increase\nthe pace at which models are trained by splitting the work among many computing nodes. In this\nwork, we focus on problems of the form:\n\nmin\n\u03b8\u2208Rd\n\nn\ufffdi=1\n\nfi(\u03b8), where\n\nfi(\u03b8) =\n\nfi,j(\u03b8) +\n\n\u03c3i\n2 \ufffd\u03b8\ufffd2.\n\nm\ufffdj=1\n\n(1)\n\nThis is the typical \ufffd2-regularized empirical risk minimization problem with n computing nodes that\nhave m local training examples each. The function fi,j represents the loss function for the j-th\ntraining example of node i and is assumed to be convex and Li,j-smooth [Nesterov, 2013, Bubeck,\n2015]. These problems are usually solved by \ufb01rst-order methods, and the basic distributed algorithms\ncompute gradients in parallel over several machines [Nedic and Ozdaglar, 2009]. Another way\nto speed up training is to use stochastic algorithms [Bottou, 2010, Defazio et al., 2014, Johnson\nand Zhang, 2013], that take advantage of the \ufb01nite sum structure of the problem to use cheaper\niterations while preserving fast convergence. This paper aims at bridging the gap between stochastic\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSYNCHRONY\n\nSTOCHASTIC\n\nTIME\n\nnm + \u221anm\u03bas\n\n\ufffd\n\u00d7\n\u00d7\n\ufffd\n\ufffd\n\n\u03b3\n\n\u03b3\n\nN/A\n\nGLOBAL\nLOCAL\nGLOBAL\nLOCAL\n\nALGORITHM\nPOINT-SAGA [DEFAZIO, 2016]\nMSDA [SCAMAN ET AL., 2017]\nESDACD [HENDRIKX ET AL., 2019]\nDSBA [SHEN ET AL., 2018]\nADFS (THIS PAPER)\n\nTable 1: Comparison of various state-of-the-art decentralized algorithms to reach accuracy \u03b5 in\n\n\u221a\u03bab\ufffdm + \u03c4\u221a\u03b3\ufffd\n(m + \u03c4 )\ufffd \u03bab\n\ufffdm + \u03bas + \u03b3\u22121\ufffd (1 + \u03c4 )\nm + \u221am\u03bas + (1 + \u03c4 )\ufffd \u03bas\nregular graphs. Constant factors are omitted, as well as the log\ufffd\u03b5\u22121\ufffd factor in the TIME column.\n\nReported runtime for Point-SAGA corresponds to running it on a single machine with nm samples.\nTo allow for direct comparison, we assume that computing a dual gradient of a function fi as required\nby MSDA and ESDACD takes time m, although it is generally more expensive than to compute m\nseparate proximal operators of single fi,j functions.\nand decentralized algorithms when local functions are smooth and strongly convex. In the rest of\nthis paper, following Scaman et al. [2017], we assume that nodes are linked by a communication\nnetwork and can only exchange messages with their neighbours. We further assume that each\ncommunication takes time \u03c4 and that processing one sample, i.e., computing the proximal operator\nfor a single function fi,j, takes time 1. The proximal operator of a function fi,j is de\ufb01ned by\n2\u03b7\ufffdv \u2212 x\ufffd2 + fi,j(v). The condition number of the Laplacian matrix of\nprox\u03b7fi,j (x) = arg minv\nthe graph representing the communication network is denoted \u03b3. This natural constant appears in the\nrunning time of many decentralized algorithms and is for instance of order O(1) for the complete\ngraph and O(n\u22121) for the 2D grid. More generally, \u03b3\u22121/2 is typically of the same order as the\ndiameter of the graph. Following notations from Xiao et al. [2019], we de\ufb01ne the batch and stochastic\ncondition numbers \u03bab and \u03bas (which are classical quantities in the analysis of \ufb01nite sum optimization)\nsuch that for all i, \u03bab \u2265 Mi/\u03c3i where Mi is the smoothness constant of the function fi and \u03bas \u2265 \u03bai,\nwith \u03bai = 1 +\ufffdm\nj=1 Li,j/\u03c3i the stochastic condition number of node i. Although \u03bas is always\nbigger than \u03bab, it is generally of the same order of magnitude, leading to the practical superiority of\nstochastic algorithms. The next paragraphs discuss the relevant state of the art for both distributed and\nstochastic methods, and Table 1 sums up the speeds of the main decentralized algorithms available\nto solve Problem (1). Although it is not a distributed algorithm, Point-SAGA [Defazio, 2016], an\noptimal single-machine algorithm, is also presented for comparison.\n\n1\n\nCentralized gradient methods. A simple way to split work between nodes is to distribute gradient\ncomputations and to aggregate them on a parameter server. Provided the network is fast enough, this\nallows the system to learn from the datasets of n workers in the same time one worker would need to\nlearn from its own dataset. Yet, these approaches are very sensitive to stochastic delays, slow nodes,\nand communication bottlenecks. Asynchronous methods may be used [Recht et al., 2011, Leblond\net al., 2017, Xiao et al., 2019] to address the \ufb01rst two issues, but computing gradients on older (or\neven inconsistent) versions of the parameter harms convergence [Chen et al., 2016]. Therefore, this\npaper focuses on decentralized algorithms, which are generally less sensitive to communication\nbottlenecks [Lian et al., 2017].\n\nDecentralized gradient methods.\nIn their synchronous versions, decentralized algorithms alternate\nrounds of computations (in which all nodes compute gradients with respect to their local data) and\ncommunications, in which nodes exchange information with their direct neighbors [Duchi et al.,\n2012, Shi et al., 2015, Nedic et al., 2017, Tang et al., 2018, He et al., 2018]. Communication steps\noften consist in averaging gradients or parameters with neighbours, and can thus be abstracted as\nmultiplication by a so-called gossip matrix. MSDA [Scaman et al., 2017] is a batch decentralized\nsynchronous algorithm, and it is optimal with respect to the constants \u03b3 and \u03bab, among batch\nalgorithms that can only perform these two operations. Instead of performing global synchronous\nupdates, some approaches inspired from gossip algorithms [Boyd et al., 2006] use randomized\npairwise communications [Nedic and Ozdaglar, 2009, Johansson et al., 2009, Colin et al., 2016].\nThis for example allows fast nodes to perform more updates in order to bene\ufb01t from their increased\ncomputing power. These randomized algorithms do not suffer from the usual worst-case analyses of\nbounded-delay asynchronous algorithms, and can thus have fast rates because the step-size does not\nneed to be reduced in the presence of delays. For example, ESDACD [Hendrikx et al., 2019] achieves\nthe same optimal speed as MSDA when batch computations are faster than communications (\u03c4 > m).\n\n2\n\n\fHowever, both use gradients of the Fenchel conjugates of the full local functions, which are generally\nmuch harder to get than regular gradients.\n\nStochastic algorithms for \ufb01nite sums. All distributed methods presented earlier are batch methods\nthat rely on computing full gradient steps of each function fi. Stochastic methods perform updates\nbased on randomly chosen functions fi,j. In the smooth and strongly convex setting, they can be\ncoupled with variance reduction [Schmidt et al., 2017, Shalev-Shwartz and Zhang, 2013, Johnson and\nZhang, 2013, Defazio et al., 2014] and acceleration, to achieve the m + \u221am\u03bas optimal \ufb01nite-sum\nrate, which greatly improves over the m\u221a\u03bab batch optimum when the dataset is large. Examples\nof such methods include Accelerated-SDCA [Shalev-Shwartz and Zhang, 2014], APCG [Lin et al.,\n2015], Point-SAGA [Defazio, 2016] or Katyusha [Allen-Zhu, 2017]\n\nDecentralized stochastic methods.\nIn the smooth and strongly convex setting, DSA [Mokhtari and\nRibeiro, 2016] and later DSBA [Shen et al., 2018] are two linearly converging stochastic decentralized\nalgorithms. DSBA uses the proximal operator of individual functions fi,j to signi\ufb01cantly improve\nover DSA in terms of rates. Yet, DSBA does not enjoy the \u221am\u03bas accelerated rates and needs an\nexcellent network with very fast communications. Indeed, nodes need to communicate each time they\nprocess a single sample, resulting in many communication steps. CHOCO-SGD [Koloskova et al.,\n2019] is a simple decentralized stochastic algorithm with support for compressed communications.\nYet, it is not linearly convergent and it requires to communicate between each gradient step as\nwell. Therefore, to the best of our knowledge, there is no decentralized stochastic algorithm with\naccelerated linear convergence rate or low communication complexity without sparsity assumptions\n(i.e., sparse features in linear supervised learning).\n\nADFS. The main contribution of this paper is a locally synchronous Accelerated Decentralized\nstochastic algorithm for Finite Sums, named ADFS. It is very similar to APCG for empirical risk\nminimization in the limit case n = 1 (single machine), for which it gets the same m + \u221am\u03bas rate.\nBesides, this rate stays unchanged when the number of machines grows, meaning that ADFS can\nprocess n times more data in the same amount of time on a network of size n. This scaling lasts as\nlong as (1+\u03c4 )\u221a\u03bas\u03b3\u2212 1\n2 < m+\u221am\u03bas. This means that ADFS is at least as fast as MSDA unless both\nthe network is extremely fast (communications are faster than evaluating a single proximal operator)\nand the diameter of the graph is very large compared to the size of the local \ufb01nite sums. Therefore,\nADFS outperforms MSDA and DSBA in most standard machine learning settings, combining optimal\nnetwork scaling with the ef\ufb01cient distribution of optimal sequential \ufb01nite-sum algorithms. Note\nhowever that, similarly to DSBA and Point-SAGA, ADFS requires evaluating proxfi,j , which requires\nsolving a local optimization problem. Yet, in the case of linear models such as logistic regression, it is\nonly a constant factor slower than computing \u2207fi,j, and it is especially much faster than computing\nthe gradient of the conjugate of the full dual functions \u2207f\u2217i required by ESDACD and MSDA, which\nwere not designed for \ufb01nite sums on each node in the \ufb01rst place.\nADFS is based on three novel technical contributions: (i) a novel augmented graph approach which\nyields the dual formulation of Section 2, (ii) an extension of the APCG algorithm to arbitrary sampling\nthat is applied to the dual problem in order to get the generic algorithm of Section 3, and (iii) the\nanalysis of local synchrony, which is performed in Section 4. Finally, Section 5 presents a relevant\nchoice of parameters leading to the rates shown in Table 1, and an experimental comparison is done\nin Section 6. A Python implementation of ADFS is also provided in supplementary material.\n\n2 Model and Derivations\nWe now specify our approach to solve the problem in Equation (1). The \ufb01rst (classical) step consists\nin considering that all nodes have a local parameter, but that all local parameters should be equal\nbecause the goal is to have the global minimizer of the sum. Therefore, the problem writes:\n\nmin\n\u03b8\u2208Rn\u00d7d\n\nn\ufffdi=1\n\nfi(\u03b8(i)) such that \u03b8(i) = \u03b8(j) if j \u2208 N(i),\n\n(2)\n\nwhere N(i) represents the neighbors of node i in the communication graph. Then, ESDACD and\nMSDA are obtained by applying accelerated (coordinate) gradient descent to an appropriate dual\nformulation of Problem (2). In the dual formulation, constraints become variables and so updating\na dual coordinate consists in performing an update along an edge of the network. In this work, we\nconsider a new virtual graph in order to get a stochastic algorithm for \ufb01nite sums. The transformation\n\n3\n\n\fFigure 1: Illustration of the augmented graph for n = 3 and m = 3.\n\nis sketched in Figure 1, and consists in replacing each node of the initial network by a star network.\nThe centers of the stars are connected by the actual communication network, and the center of the\nstar network replacing node i has the local function f comm\n2 \ufffdx\ufffd2. The center of node i is\nthen connected with m nodes whose local functions are the functions fi,j for j \u2208 {1, ..., m}. If we\ndenote E the number of edges of the initial graph, then the augmented graph has n(1 + m) nodes\nand E + nm edges.\nThen, we consider one parameter vector \u03b8(i,j) for each function fi,j and one vector \u03b8(i) for each\nfunction f comm\n. Therefore, there is one parameter vector for each node in the augmented graph.\nWe impose the standard constraint that the parameter of each node must be equal to the parameters\nof its neighbors, but neighbors are now taken in the augmented graph. This yields the following\nminimization problem:\n\n: x \ufffd\u2192 \u03c3i\n\ni\n\ni\n\nmin\n\n\u03b8\u2208Rn(1+m)\u00d7d\n\nn\ufffdi=1\ufffd m\ufffdj=1\n\nfi,j(\u03b8(i,j)) +\n\n\u03c3i\n\n2 \ufffd\u03b8(i)\ufffd2\ufffd\n\n(3)\n\nsuch that \u03b8(i) = \u03b8(j) if j \u2208 N(i), and \u03b8(i,j) = \u03b8(i) \u2200j \u2208 {1, .., m}.\n\nIn the rest of the paper, we use letters k, \ufffd to refer to any nodes in the augmented graph, and letters\ni, j to speci\ufb01cally refer to a communication node and one of its virtual nodes. More precisely, we\ndenote (k, \ufffd) the edge between the nodes k and \ufffd in the augmented graph. Note that k and \ufffd can be\nvirtual or communication nodes. We denote e(k) the unit vector of Rn(1+m) corresponding to node\nk, and ek\ufffd the unit vector of RE+nm corresponding to edge (k, \ufffd). To clearly make the distinction\nbetween node variables and edge variables, for any matrix on the set of nodes of the augmented graph\nx \u2208 Rn(1+m)\u00d7d we write that x(k) = xT e(k) for k \u2208 {1, ..., n(1 + m)} (superscript notation) and for\nany matrix on the set of edges of the augmented graph \u03bb \u2208 R(E+nm)\u00d7d we write that \u03bbk\ufffd = \u03bbT ek\ufffd\n(subscript notation) for any edge (k, \ufffd). For node variables, we use the subscript notation with a\nt to denote time, for instance in Algorithm 1. By a slight abuse of notations, we use indices (i, j)\ninstead of (k, \ufffd) when speci\ufb01cally refering to virtual edges (or virtual nodes) and denote \u03bbij instead\nof \u03bbi,(i,j) the virtual edge between node i and node (i, j) in the augmented graph. The constraints\nof Problem (3) can be rewritten AT \u03b8 = 0 in matrix form, where A \u2208 Rn(1+m)\u00d7(nm+E) is such that\nAek\ufffd = \u00b5k\ufffd(e(k) \u2212 e(\ufffd)) for some \u00b5k\ufffd > 0. Then, the dual formulation of this problem writes:\n\nmax\n\n\u03bb\u2208R(nm+E)\u00d7d \u2212\n\nn\ufffdi=1\ufffd m\ufffdj=1\n\nf\u2217i,j\ufffd(A\u03bb)(i,j)\ufffd +\n\n1\n\n2\u03c3i\ufffd(A\u03bb)(i)\ufffd2\ufffd,\n\n(4)\n\nwhere the parameter \u03bb is the Lagrange multiplier associated with the constraints of Problem (3)\u2014\nmore precisely, for an edge (k, \ufffd), \u03bbk\ufffd \u2208 Rd is the Lagrange multiplier associated with the constraint\n\u00b5k\ufffd(e(k) \u2212 e(\ufffd))T \u03b8 = 0. At this point, the functions fi,j are only assumed to be convex (and not\nnecessarily strongly convex) meaning that the functions f\u2217i,j are potentially non-smooth. This problem\ncould be bypassed by transferring some of the quadratic penalty from the communication nodes to the\nvirtual nodes before going to the dual formulation. Yet, this approach fails when m is large because\nthe smoothness parameter of f\u2217i,j would scale as m/\u03c3i at best, whereas a smoothness of order 1/\u03c3i\n\n4\n\n\fis required to match optimal \ufb01nite-sum methods. A better option is to consider the f\u2217i,j terms as\nnon-smooth and perform proximal updates on them. The rate of proximal gradient methods such\nas APCG [Lin et al., 2015] does not depend on the strong convexity parameter of the non-smooth\nfunctions f\u2217i,j. Each f\u2217i,j is (1/Li,j)-strongly convex (because fi,j was (Li,j)-smooth), so we can\nrewrite the previous equation in order to transfer all the strong convexity to the communication node.\nNoting that (A\u03bb)(i,j) = \u2212\u00b5ij\u03bbij when node (i, j) is a virtual node associated with node i, we rewrite\nthe dual problem as:\n\nmin\n\n\u03bb\u2208R(E+nm)\u00d7d\n\nqA(\u03bb) +\n\nn\ufffdi=1\n\nm\ufffdj=1\n\n\u02dcf\u2217i,j(\u03bbij),\n\n(5)\n\n\u00b52\nij\n\n\u03a3e(i) = \u03c3i if i is a center node and e(i,j)T\n\n2Li,j \ufffdx\ufffd2 and qA : x \ufffd\u2192 Trace\ufffd 1\n\nwith \u02dcf\u2217i,j : x \ufffd\u2192 f\u2217i,j(\u2212\u00b5ijx) \u2212\ndiagonal matrix such that e(i)T\n\u03a3e(i,j) = Li,j if it is\nthe virtual node (i, j). Since dual variables are associated with edges, using coordinate descent\nalgorithms on dual formulations from a well-chosen augmented graph of constraints allows us to\nhandle both computations and communications in the same framework. Indeed, choosing a variable\ncorresponding to an actual edge of the network results in a communication along this edge, whereas\nchoosing a virtual edge results in a local computation step. Then, we balance the ratio between\ncommunications and computations by adjusting the probability of picking a given kind of edges.\n\n2 xT AT \u03a3\u22121Ax\ufffd, where \u03a3 is the\n\n3 The Algorithm: ADFS Iterations and Expected Error\n\nIn this section, we detail our new ADFS algorithm. In order to obtain it, we introduce a generalized\nversion of the APCG algorithm [Lin et al., 2015], which we detail in Appendix A. More speci\ufb01cally,\nthis generalized version allows for arbitrary sampling of coordinates, which is required to use different\nprobabilities for communications and computations. It also includes corrections for functions that\nare strongly convex on a subspace only, which is the case of our augmented dual problem since\nthe Laplacian of a graph is not full rank. Then we apply it to Problem (5) to get Algorithm 1. Due\nto lack of space, we only present the smooth version of ADFS here, but a non-smooth version is\npresented in Appendix B, along with the derivations required to obtain Algorithm 1 and Theorem 1.\nWe denote A\u2020 the pseudo inverse of A and Wk\ufffd \u2208 Rn(1+m)\u00d7n(1+m) the matrix such that Wk\ufffd =\n(e(k) \u2212 e(\ufffd))(e(k) \u2212 e(\ufffd))T for any edge (k, \ufffd). Note that variables xt, yt and vt from Algorithm 1 are\nvariables associated with the nodes of the augmented graph and are therefore matrices in Rn(1+m)\u00d7d\n(one row for each node). They are obtained by multiplying the dual variables of the proximal\ncoordinate gradient algorithm applied to the dual problem of Equation (5) by A on the left. We denote\n\u03c3A = \u03bb+\n\nmin(AT \u03a3\u22121A) the smallest non-zero eigenvalue of the matrix AT \u03a3\u22121A.\n\nAlgorithm 1 ADFS(A, (\u03c3i), (Li,j), (\u00b5k\ufffd), (pk\ufffd), \u03c1)\n\n, Rk\ufffd = eT\n\nk\ufffdA\u2020Aek\ufffd\n\n// Initialization\n\n// Return primal parameter\n\n5\n\nk\ufffd\n\u03c3Apk\ufffd\n\nmin(AT \u03a3\u22121A), \u02dc\u03b7k\ufffd = \u03c1\u00b52\n1: \u03c3A = \u03bb+\n2: x0 = y0 = v0 = z0 = 0(n+nm)\u00d7d\n3: for t = 0 to K \u2212 1 do\n4:\n1+\u03c1 (xt + \u03c1vt)\n5:\n6:\n7:\n8:\n\nyt = 1\nSample edge (k, \ufffd) with probability pk\ufffd\nzt+1 = vt+1 = (1 \u2212 \u03c1)vt + \u03c1yt \u2212 \u02dc\u03b7k\ufffdWk\ufffd\u03a3\u22121yt\nif (k, \ufffd) is the virtual edge between node i and virtual node (i, j) then\nt+1 = prox\u02dc\u03b7ij \u02dcf\u2217i,j\ufffdz(i,j)\nt+1\ufffd\nt+1 \u2212 v(i,j)\n(vt+1 \u2212 (1 \u2212 \u03c1)vt \u2212 \u03c1yt)\n\nv(i,j)\nv(i)\nt+1 = z(i)\nend if\nxt+1 = yt + \u03c1Rk\ufffd\npk\ufffd\n\nt+1 + z(i,j)\n\n9:\n10:\n11:\n12: end for\n13: return \u03b8K = \u03a3\u22121vK\n\nt+1\n\n// Run for K iterations\n\n// Edge sampled from the augmented graph\n// Nodes k and \ufffd communicate yt\n\n// Virtual node update using fi,j\n// Center node update\n\n\fkk + \u03a3\u22121\n\ncan happen:\n\n= (\u03a3\u22121\n\nA a\n\n,\n\n(6)\n\nk\ufffd\n\n\ufffd\ufffd\n\np2\nk\ufffd\n\u00b52\nk\ufffdRk\ufffd\n\ni=1 fi(x) and \u03b8\ufffd\n\nA\ufffd2 + 2\u03c3\u22121\n\nmin(AT \u03a3\u22121A)/\u00b52\n\nminimizer of the dual function F \u2217A = qA + \u03c8. Then \u03b8t as output by Algorithm 1 veri\ufb01es:\n\nTheorem 1. We denote \u03b8\ufffd the minimizer of the primal function F : x \ufffd\u2192 \ufffdn\nwith C0 = \u03bbmax(AT \u03a3\u22122A)\ufffd\ufffdA\u2020A\u03b8\ufffd\n\n\u03bb+\nmin(AT \u03a3\u22121A)\nif \u03c12 \u2264 min\n\u03a3\u22121\nkk + \u03a3\u22121\nA))\ufffd.\nA (F \u2217A(0) \u2212 F \u2217A(\u03b8\ufffd\n\nE\ufffd\ufffd\u03b8t \u2212 \u03b8\ufffd\ufffd2\ufffd \u2264 C0(1 \u2212 \u03c1)t,\n\n\ufffd\ufffd )), to the graph topology (Rk\ufffd = eT\nk\ufffd) and to the sampling probabilities of the edges (p2\n\ncommunication network writes L =\ufffdcommunication (k,\ufffd) \u00b52\n\nWe discuss several aspects related to the implementation of Algorithm 1 below, and provide its Python\nimplementation in supplementary material.\nConvergence rate. The parameter \u03c1 controls the convergence rate of ADFS. It is de\ufb01ned by the\nminimum of the individual rates for each edge, which explicitly depend on parameters related to the\nfunctions themselves (1/(\u03a3\u22121\nk\ufffdA\u2020Aek\ufffd), to a mix of both\n(\u03bb+\nk\ufffd). Note that these quantities\nare very different depending on whether edges are virtual or not. For example, the parameters \u00b5k\ufffd for\ncommunication edges are related to the communication matrix by the fact that the Laplacian of the\nk\ufffdWk\ufffd. In Section 5, we carefully choose\nthe parameters \u00b5k\ufffd and pk\ufffd based on the graph and the local functions to get the best convergence\nspeed. Note that once \u00b5k\ufffd and pk\ufffd are \ufb01xed, the choice of the other parameters (such as Rk\ufffd, \u03c1, \u03b7 and\n\u03c3A) is \ufb01xed as well (no extra tuning is required).\nObtaining Line 6. The form of the communication update (virtual or not) comes from the fact that\nthe update in direction (k, \ufffd) writes A\u2207k\ufffdqA(yt) = Aek\ufffdeT\nSparse updates. Although the updates of Algorithm 1 involve all nodes of the network, it is actually\npossible to implement them ef\ufb01ciently so that only two nodes are actually involved in each update, as\ndescribed below. Indeed, Wk\ufffd is a very sparse matrix so\ufffdWk\ufffd\u03a3\u22121yt\ufffd(k)\nt ) =\n\u2212\ufffdWk\ufffd\u03a3\u22121yt\ufffd(\ufffd) and\ufffdWk\ufffd\u03a3\u22121yt\ufffd(h)\n= 0 for h \ufffd= k, \ufffd. Therefore, only the following situations\n1. Communication updates: If (k, \ufffd) is a communication edge, the update only requires\nnodes k and \ufffd to exchange parameters and perform a weighted difference between them.\n\nk\ufffdA\u03a3\u22121yt = \u00b52\n\nk y(k)\n\nt \u2212 \u03a3\u22121\n\n\ufffd y(\ufffd)\n\nk\ufffdWk\ufffd\u03a3\u22121yt.\n\n2. Local updates: If (k, \ufffd) is the virtual edge between node i and its j-th virtual node,\n\nNote that the Laplacian of the communication graph is\ufffdk\ufffd\n3. Convex combinations: If we choose h \ufffd= k, \ufffd then v(h)\n\nparameters exchange of line 4 is local, and the proximal term involves function fi,j only.\nt+1 are obtained by convex\ncombinations of y(h)\nso the update is cheap and local. Besides, nodes actually need\nthe value of their parameters only when they perform updates of type 1 or 2. Therefore, they\ncan simply store how many updates of this type they should have done and perform them all\nat once before each communication or local update.\n\nt+1 and y(h)\n\nand v(h)\n\nt\n\nt\n\n\u00b52\nij\n\nPrimal proximal step. Algorithm 1 uses proximal steps performed on \u02dcf\u2217i,j : x \u2192 f\u2217i,j(\u2212\u00b5i,jx) \u2212\nusing only\n2Li,j \ufffdx\ufffd2 instead of fi,j. Yet, it is possible to use Moreau identity to express prox\u03b7 \u02dcf\u2217i,j\nthe proximal operator of fi,j, which can easily be evaluated for many objective functions. The exact\nderivations are presented in Appendix B.3.\nLinear case. For many standard machine learning problems, fi,j(\u03b8) = \ufffd(X T\ni,j\u03b8) with Xi,j \u2208 Rd.\nThis implies that f\u2217i,j(\u03b8) = +\u221e whenever \u03b8 /\u2208 Vec (Xi,j). Therefore, the proximal steps on the\nFenchel conjugate only have support on Xi,j, meaning that they are one-dimensional problems that\ncan be solved in constant time using for example the Newton method when no analytical solution is\navailable. Warm starts (initializing on the previous solution) can also be used for solving the local\nproblems even faster so that in the end, a one-dimensional proximal update is only a constant time\nslower than a gradient update. Note that this also allows to store parameters vt and yt as scalar\ncoef\ufb01cients for virtual nodes, thus greatly reducing the memory footprint of ADFS.\nUnbalanced local datasets. We assume that all local datasets are of \ufb01xed size m in order to ease\nreading. Yet, the impact of the value of m on Algorithm 1 is indirect, and unbalanced datasets can be\nhandled without any change. Yet, this may affect waiting time since nodes with large local datasets\nwill generally be more busy than nodes with smaller ones.\n\n6\n\n\f4 Distributed Execution and Synchronization Time\nTheorem 1 gives bounds on the expected error after a given number of iterations. To assess the\nactual speed of the algorithm, it is still required to know how long executing a given number of\niterations takes. This is easy with synchronous algorithms such as MSDA or DSBA, in which all\nnodes iteratively perform local updates or communication rounds. In this case, executing ncomp\ncomputing rounds and ncomm communication rounds simply takes time ncomp + \u03c4 ncomm. ADFS\nrelies on randomized pairwise communications, so it is necessary to sample a schedule, i.e., a random\nsequence of edges from the augmented graph, and evaluate how fast this schedule can be executed.\nNote that the execution time crucially depends on how many edges can be updated in parallel, which\nitself depends on the graph and on the random schedule sampled.\n\nFigure 2: Illustration of parallel execution and local synchrony. Nodes from a toy graph execute\nthe schedule [(A, C), (B, D), (A, B), (D), (C, D)], where (D) means that node D performs a local\nupdate. Each node needs to execute its updates in the partial order de\ufb01ned by the schedule. In\nparticular, node C has to perform update (A, C) and then update (C, D), so it is idle between times\n\u03c4 and \u03c4 + 1 because it needs to wait for node D to \ufb01nish its local update before the communication\nupdate (C, D) can start. We assume \u03c4 > 1 since the local update terminates before the communication\nupdate (A, B). Contrary to synchronous algorithms, no global notion of rounds exist and some nodes\n(such as node D) perform more updates than others.\n\nShared schedule. Even though they only actively take part in a small fraction of the updates, all\nnodes need to execute the same schedule to correctly implement Algorithm 1. To generate this shared\nschedule, all nodes are given a seed and the sampling probabilities of all edges. This allows them to\navoid deadlocks and to precisely know how many convex combinations to perform between vt and yt.\nExecution time. The problem of bounding the probability that a random schedule of \ufb01xed length\nexceeds a given execution time can be cast in the framework of fork-join queuing networks with\nblocking [Zeng et al., 2018]. In particular, queuing theory [Baccelli et al., 1992] tells us that the\naverage time per iteration exists for any \ufb01xed probability distribution over a given augmented graph.\nUnfortunately, existing quantitative results are not precise enough for our purpose so we generalize\nthe method introduced by Hendrikx et al. [2019] to get a \ufb01ner bound. While their result is valid when\nthe only possible operation is communicating with a neighbor, we extend it to the case in which nodes\ncan also perform local computations. For the rest of this paper, we denote pcomm the probability of\nperforming a communication update and pcomp the probability of performing a local update. They are\nsuch that pcomp + pcomm = 1. We also de\ufb01ne pmax\nare in the communication network only. When all nodes have the same probability to participate in\nan update, pmax\nTheorem 2. Let T (t) be the time needed for the system to execute a schedule of size t, i.e., t\niterations of Algorithm 1. If all nodes perform local computations with probability pcomp/n with\npcomp > pmax\n\ncomm = n maxk\ufffd\ufffd\u2208N(k) pk\ufffd/2, where neighbors\n\ncomm = pcomm. Then, the following theorem holds (see proof in Appendix C):\n\ncomm or if \u03c4 > 1 then there exists C < 24 such that:\n\n(7)\n\nP\ufffd 1\n\nt\n\nT (t) \u2264\n\nC\n\nn\ufffdpcomp + 2\u03c4 pmax\n\ncomm\ufffd\ufffd \u2192 1 as t \u2192 \u221e\n\nNote that the constant C is a worst-case estimate and that it is much smaller for homogeneous\ncommunication probabilities. This novel result states that the number of iterations that Algorithm 1\ncan perform per unit of time increases linearly with the size of the network. This is possible because\neach iteration only involves two nodes so many iterations can be done in parallel. The assumption\npcomp > pcomm is responsible for the 1 + \u03c4 factor instead of \u03c4 in Table 1, which prevents ADFS from\nbene\ufb01ting from network acceleration when communications are cheap (\u03c4 < 1). Note that this is\nan actual restriction of following a schedule, as detailed in Appendix C. Yet, network operations\n\n7\n\n\fgenerally suffer from communication protocols overhead whereas computing a single proximal\nupdate often either has a closed-form solution or is a simple one-dimensional problem in the linear\ncase. Therefore, assuming \u03c4 > 1 is not very restrictive in the \ufb01nite-sum setting.\n\n5 Performances and Parameters Choice in the Homogeneous Setting\nWe now prove the time to convergence of ADFS presented in Table 1, and detail the conditions under\nwhich it holds. Indeed, Section 3 presents ADFS in full generality but the different parameters have\nto be chosen carefully to reach optimal speed. In particular, we have to choose the coef\ufb01cients \u00b5\nto make sure that the graph augmentation trick does not cause the smallest positive eigenvalue of\nAT \u03a3\u22121A to shrink too much. Similarly, \u03c1 is de\ufb01ned in Equation (6) by a minimum over all edges of\na given quantity. This quantity heavily depends on whether the edge is an actual communication edge\nor a virtual edge. One can trade pcomp for pcomm so that the minimum is the same for both kind of\nedges, but Theorem 2 tells us that this is only possible as long as pcomp > pcomm.\n\n1\n\n1\n\n)\n\ni\n\nij = \u03bb+\n\nj=1(1 + Li,j\u03c3\u22121\n\ni\n\n)\n\nmin(L)n2/(\u00b52\n\nj=1 Li,j = \u03bas for all nodes. For virtual edges, we choose \u00b52\n\nParameters choice. We de\ufb01ne L = AcommAT\ncomm \u2208 Rn\u00d7n the Laplacian of the communication\ngraph, with Acomm \u2208 Rn\u00d7E such that Acommek\ufffd = \u00b5k\ufffd(e(k) \u2212 e(\ufffd)) for all edge (k, \ufffd) \u2208 Ecomm,\nthe set of communication edges. Then, we de\ufb01ne \u02dc\u03b3 = min(k,\ufffd)\u2208Ecomm \u03bb+\nk\ufffdRk\ufffdE2).\nAs shown in Appendix D.2, \u02dc\u03b3 \u2248 \u03b3 for regular graphs such as the complete graph or the grid,\njustifying the use of \u03b3 instead of \u02dc\u03b3 in Table 1. We assume for simplicity that \u03c3i = \u03c3 and that \u03bai = 1 +\ni \ufffdm\n\u03c3\u22121\nmin(L)Li,j/(\u03c3\u03bai) and pij =\n2 /(nScomp) with Scomp = n\u22121\ufffdn\ni=1\ufffdm\npcomp(1 + Li,j\u03c3\u22121\n2 . This corresponds\nto using a standard importance sampling scheme for selecting samples. For communications edges\n(k, \ufffd) \u2208 Ecomm, we choose uniform pk\ufffd = pcomm/E and \u00b52\nParameters tuning. The previous paragraph speci\ufb01es relevant choices of parameters \u00b5k\ufffd and\npk\ufffd. Therefore, ADFS can be run without manual tuning. Extra tuning (such as communication\nprobabilities) could be performed to adapt to speci\ufb01c heterogeneous situations. Yet, this should\nbe considered as an extra degree of freedom that other algorithms may not have access to rather\nthan an extra parameter to tune. For example, the choice of uniform communication probabilities\nis automatically enforced by synchronous gossip-based algorithms such as MSDA or DSBA (all\nedges are activated at each step). Note that choosing different values of \u00b5k\ufffd for communication\nedges amounts to tuning the gossip matrix, which is generally considered as an input of the problem.\nOur speci\ufb01c choice of \u00b5ij for virtual edges allows to precisely bound the strong convexity of the\naugmented problem \u03c3A, as shown in Appendix D.1.\nIn\ufb02uence of the network topology. The topology of the network only impacts the convergence\nrate through the constant \u02dc\u03b3, which is almost equal to the eigengap of the Laplacian of the graph for\nregular networks. This dependence is standard, as it can be seen in Table 1. The topology can also\nin\ufb02uence the synchronization time since the presence of hubs generally increases waiting time.\n\nk\ufffd = 1/2.\n\nTheorem 3. If we choose pcomm = min\ufffd1/2,\ufffd1 + Scomp\ufffd\u02dc\u03b3/\u03bas\ufffd\u22121\ufffd. Then, running Algorithm 1\nfor K\u03b5 = \u03c1\u22121 log(\u03b5\u22121) iterations guarantees E\ufffd\ufffd\u03b8K\u03b5 \u2212 \u03b8\ufffd\ufffd2\ufffd \u2264 C0\u03b5, and takes time T (K\u03b5), with:\n\nT (K\u03b5) \u2264\n\n\u221a2C\ufffdm + \u221am\u03bas + \u221a2\ufffd1 + 4\u03c4\ufffd\ufffd \u03bas\n\n\u02dc\u03b3 \ufffd log\ufffd1/\u03b5\ufffd\n\nwith probability tending to 1 as \u03c1\u22121 log(\u03b5\u22121) \u2192 \u221e, with C0 and C as in Theorems 1 and 2.\nTheorem 3 assumes that all communication probabilities and condition numbers are exactly equal\nin order to ease reading. A more detailed version with rates for more heterogeneous settings can be\nfound in Appendix D. Note that while algorithms such as MSDA required to use polynomials of the\ninitial gossip matrix to model several consecutive communication steps, we can more directly tune\nthe amount of communication and computation steps simply by adjusting pcomp and pcomm.\n\n6 Experiments\nIn this section, we illustrate the theoretical results by showing how ADFS compares with MSDA [Sca-\nman et al., 2017], ESDACD [Hendrikx et al., 2019], Point-SAGA [Defazio, 2016], and DSBA [Shen\net al., 2018]. All algorithms (except for DSBA, for which we \ufb01ne-tuned the step-size) were run\n\n8\n\n\f(a) Higgs, n = 4,\nm = 104, \u03c3 = 1\n\n(b) Higgs, n = 100,\n\nm = 104, \u03c3 = 1\n\n(c) Covtype, n = 100,\n\nm = 104, \u03c3 = 1\n\n(d) RCV1, n = 100,\nm = 103, \u03c3 = 10\u22124\n\nFigure 3: Performances of various decentralized algorithms on the logistic regression task with\nm = 104 points per node, regularization parameter \u03c3 = 1 and communication delays \u03c4 = 5 on 2D\ngrid networks of different sizes.\n\nwith out-of-the-box hyperparameters given by theory on data extracted from the standard Higgs,\nCovtype and RCV1 datasets from LibSVM. The underlying graph is assumed to be a 2D grid network.\nExperiments were run in a distributed manner on an actual computing cluster. Yet, plots are shown for\nidealized times in order to abstract implementation details as well as ensure that reported timings were\nnot impacted by the cluster status or implementation details. All the details of the experimental setup\nas well as a comparison with centralized algorithms can be found in Appendix E. An implementation\nof ADFS is also available in supplementary material.\nFigure 3a shows that, as predicted by theory, ADFS and Point-SAGA have similar rates on small\nnetworks whereas all other algorithms are signi\ufb01cantly slower. Figures 3b, 3c and 3d use a much\nlarger grid to evaluate how these algorithms scale. In this setting, Point-SAGA is the slowest algorithm\nsince it has 100 times less computing power available. MSDA performs quite well on the Covtype\ndataset thanks to its very good network scaling (dependent on \u03bab rather than \u03bas). Yet, the m\u221a\u03bab\nfactor dominates on the Higgs dataset, making it signi\ufb01cantly slower. DSBA has to communicate after\neach proximal step, thus having to wait for a time \u03c4 = 5 at each step. ESDACD does not perform\nwell either because m > \u03c4 and it has to perform as many batch computing steps as communication\nsteps. ADFS does not suffer from any of these drawbacks and therefore outperforms other approaches\nby a large margin on these experiments. This illustrates the fact that ADFS combines the strengths\nof accelerated stochastic algorithms, such as Point-SAGA, and fast decentralized algorithms, such\nas MSDA. We see that DSBA initially outperforms ADFS on the RCV1 dataset. This may be due\nto statistical reasons, since there is more overlap of the local datasets of the different nodes in this\nexperiment than in the others. Yet, we see that ADFS has a better rate in the steady state and quickly\ncatches up. Besides, we still used a value \u03c4 = 5 but a much higher value of \u03c4 would be more realistic\nin this high dimensional setting since local computations are sparse whereas communications are\nfully dimensional. We only compare DSBA and ADFS in this setting since the high-dimensionality\nof the dataset made the computation of dual gradients expensive, and Point-SAGA is much slower\nwhen using 100 nodes since it is a single-machine algorithm, as shown on the Higgs and Covtype\ndatasets.\n\n7 Conclusion\n\nIn this paper, we provided a novel accelerated stochastic algorithm for decentralized optimization. To\nthe best of our knowledge, it is the \ufb01rst decentralized algorithm that successfully leverages the \ufb01nite-\nsum structure of the objective functions to match the rates of the best known sequential algorithms\nwhile having the network scaling of optimal batch algorithms. The analysis in this paper could be\nextended to better handle heterogeneous settings, both in terms of hardware (computing times, delays)\nand local functions (different regularities). Finally, \ufb01nding a locally synchronous algorithm that can\ntake advantage of arbitrarily low communication delays (beyond the \u03c4 > 1 limit) to scale to large\ngraphs is still an open problem.\n\nAcknowledgement\n\nWe acknowledge support from the European Research Council (grant SEQUOIA 724063).\n\n9\n\n\fReferences\nZeyuan Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods.\n\nProceedings of Symposium on Theory of Computing, pages 1200\u20131205, 2017.\n\nIn\n\nFranc\u00b8ois Baccelli, Guy Cohen, Geert Jan Olsder, and Jean-Pierre Quadrat. Synchronization and\n\nLinearity: an Algebra for Discrete Event Systems. John Wiley & Sons Ltd, 1992.\n\nL\u00b4eon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of\n\nCOMPSTAT, pages 177\u2013186. Springer, 2010.\n\nStephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms.\n\nIEEE Transactions on Information Theory, 52(6):2508\u20132530, 2006.\n\nS\u00b4ebastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends R\ufffd in\n\nMachine Learning, 8(3-4):231\u2013357, 2015.\n\nJianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed\n\nsynchronous SGD. arXiv preprint arXiv:1604.00981, 2016.\n\nIgor Colin, Aur\u00b4elien Bellet, Joseph Salmon, and St\u00b4ephan Cl\u00b4emenc\u00b8on. Gossip dual averaging for\ndecentralized optimization of pairwise functions. In Proceedings of the International Conference\non International Conference on Machine Learning-Volume 48, pages 1388\u20131396, 2016.\n\nAaron Defazio. A simple practical accelerated method for \ufb01nite sums. In Advances in Neural\n\nInformation Processing Systems, pages 676\u2013684, 2016.\n\nAaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method\nwith support for non-strongly convex composite objectives. In Advances in Neural Information\nProcessing Systems, pages 1646\u20131654, 2014.\n\nJohn C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization:\nConvergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):\n592\u2013606, 2012.\n\nOlivier Fercoq and Peter Richt\u00b4arik. Accelerated, parallel, and proximal coordinate descent. SIAM\n\nJournal on Optimization, 25(4):1997\u20132023, 2015.\n\nLie He, An Bian, and Martin Jaggi. Cola: Decentralized linear learning. In Advances in Neural\n\nInformation Processing Systems, pages 4536\u20134546, 2018.\n\nHadrien Hendrikx, Francis Bach, and Laurent Massouli\u00b4e. Accelerated decentralized optimization with\nlocal updates for smooth and strongly convex objectives. In Arti\ufb01cial Intelligence and Statistics,\n2019.\n\nBj\u00a8orn Johansson, Maben Rabi, and Mikael Johansson. A randomized incremental subgradient\nmethod for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):\n1157\u20131170, 2009.\n\nRie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\nAnastasia Koloskova, Sebastian U Stich, and Martin Jaggi. Decentralized stochastic optimization\nand gossip algorithms with compressed communication. International Conference on Machine\nLearning, 2019.\n\nR\u00b4emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. ASAGA: Asynchronous parallel SAGA.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 46\u201354, 2017.\n\nYin Tat Lee and Aaron Sidford. Ef\ufb01cient accelerated coordinate descent methods and faster algorithms\nfor solving linear systems. In Annual Symposium on Foundations of Computer Science (FOCS),\npages 147\u2013156, 2013.\n\nXiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized\nalgorithms outperform centralized algorithms? a case study for decentralized parallel stochastic\ngradient descent. In Advances in Neural Information Processing Systems, pages 5330\u20135340, 2017.\n\n10\n\n\fQihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In\n\nAdvances in Neural Information Processing Systems, pages 3059\u20133067, 2014.\n\nQihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated randomized proximal coordinate gra-\ndient method and its application to regularized empirical risk minimization. SIAM Journal on\nOptimization, 25(4):2244\u20132273, 2015.\n\nTao Lin, Sebastian U. Stich, and Martin Jaggi. Don\u2019t use large mini-batches, use local SGD. arXiv\n\npreprint arXiv:1808.07217, 2018.\n\nAryan Mokhtari and Alejandro Ribeiro. DSA: Decentralized double stochastic averaging gradient\n\nalgorithm. Journal of Machine Learning Research, 17(1):2165\u20132199, 2016.\n\nAngelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization.\n\nIEEE Transactions on Automatic Control, 54(1):48\u201361, 2009.\n\nAngelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed\noptimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597\u20132633, 2017.\n\nYurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Bourse, volume 87. Springer\n\nScience & Business Media, 2013.\n\nYurii Nesterov and Sebastian U. Stich. Ef\ufb01ciency of the accelerated coordinate descent method on\n\nstructured optimization problems. SIAM Journal on Optimization, 27(1):110\u2013123, 2017.\n\nNeal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends R\ufffd in Optimization, 1\n\n(3):127\u2013239, 2014.\n\nKumar Kshitij Patel and Aymeric Dieuleveut. Communication trade-offs for synchronized distributed\n\nSGD with large step size. arXiv preprint arXiv:1904.11325, 2019.\n\nBenjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to\nparallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems,\npages 693\u2013701, 2011.\n\nKevin Scaman, Francis Bach, S\u00b4ebastien Bubeck, Yin Tat Lee, and Laurent Massouli\u00b4e. Optimal\nalgorithms for smooth and strongly convex distributed optimization in networks. In International\nConference on Machine Learning, pages 3027\u20133036, 2017.\n\nMark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. Mathematical Programming, 162(1-2):83\u2013112, 2017.\n\nShai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss\n\nminimization. Journal of Machine Learning Research, 14(Feb):567\u2013599, 2013.\n\nShai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for\nregularized loss minimization. In International Conference on Machine Learning, pages 64\u201372,\n2014.\n\nZebang Shen, Aryan Mokhtari, Tengfei Zhou, Peilin Zhao, and Hui Qian. Towards more ef\ufb01cient\nstochastic decentralized learning: Faster convergence and sparse communication. In International\nConference on Machine Learning, pages 4631\u20134640, 2018.\n\nWei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact \ufb01rst-order algorithm for decentralized\n\nconsensus optimization. SIAM Journal on Optimization, 25(2):944\u2013966, 2015.\n\nHanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. D2: Decentralized training over\ndecentralized data. In International Conference on Machine Learning, pages 4855\u20134863, 2018.\n\nLin Xiao, Adams Wei Yu, Qihang Lin, and Weizhu Chen. DSCOVR: Randomized primal-dual block\ncoordinate algorithms for asynchronous distributed optimization. Journal of Machine Learning\nResearch, 20(43):1\u201358, 2019.\n\nYun Zeng, Augustin Chaintreau, Don Towsley, and Cathy H Xia. Throughput scalability analysis of\n\nfork-join queueing networks. Operations Research, 66(6):1728\u20131743, 2018.\n\n11\n\n\f", "award": [], "sourceid": 537, "authors": [{"given_name": "Hadrien", "family_name": "Hendrikx", "institution": "INRIA - PSL"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}, {"given_name": "Laurent", "family_name": "Massouli\u00e9", "institution": "Inria"}]}