{"title": "COLA: Decentralized Linear Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4536, "page_last": 4546, "abstract": "Decentralized machine learning is a promising emerging paradigm in view of global challenges of data ownership and privacy. We consider learning of linear classification and regression models, in the setting where the training data is decentralized over many user devices, and the learning algorithm must run on-device, on an arbitrary communication network, without a central coordinator.\nWe propose COLA, a new decentralized training algorithm with strong theoretical guarantees and superior practical performance. Our framework overcomes many limitations of existing methods, and achieves communication efficiency, scalability, elasticity as well as resilience to changes in data and allows for unreliable and heterogeneous participating devices.", "full_text": "COLA: Decentralized Linear Learning\n\nLie He\u21e4\nEPFL\n\nlie.he@epfl.ch\n\nAn Bian\u21e4\nETH Zurich\n\nybian@inf.ethz.ch\n\nAbstract\n\nMartin Jaggi\n\nEPFL\n\nmartin.jaggi@epfl.ch\n\nDecentralized machine learning is a promising emerging paradigm in view of\nglobal challenges of data ownership and privacy. We consider learning of linear\nclassi\ufb01cation and regression models, in the setting where the training data is\ndecentralized over many user devices, and the learning algorithm must run on-\ndevice, on an arbitrary communication network, without a central coordinator. We\npropose COLA, a new decentralized training algorithm with strong theoretical\nguarantees and superior practical performance. Our framework overcomes many\nlimitations of existing methods, and achieves communication ef\ufb01ciency, scalability,\nelasticity as well as resilience to changes in data and allows for unreliable and\nheterogeneous participating devices.\n\n1\n\nIntroduction\n\nWith the immense growth of data, decentralized machine learning has become not only attractive but\na necessity. Personal data from, for example, smart phones, wearables and many other mobile devices\nis sensitive and exposed to a great risk of data breaches and abuse when collected by a centralized\nauthority or enterprise. Nevertheless, many users have gotten accustomed to giving up control over\ntheir data in return for useful machine learning predictions (e.g. recommendations), which bene\ufb01ts\nfrom joint training on the data of all users combined in a centralized fashion.\nIn contrast, decentralized learning aims at learning this same global machine learning model, without\nany central server. Instead, we only rely on distributed computations of the devices themselves, with\neach user\u2019s data never leaving its device of origin. While increasing research progress has been made\ntowards this goal, major challenges in terms of the privacy aspects as well as algorithmic ef\ufb01ciency,\nrobustness and scalability remain to be addressed. Motivated by aforementioned challenges, we make\nprogress in this work addressing the important problem of training generalized linear models in a\nfully decentralized environment.\nExisting research on decentralized optimization, minx2Rn F (x), can be categorized into two main\ndirections. The seminal line of work started by Bertsekas and Tsitsiklis in the 1980s, cf. [Tsitsiklis\net al., 1986], tackles this problem by splitting the parameter vector x by coordinates/components\namong the devices. A second more recent line of work including e.g. [Nedic and Ozdaglar, 2009,\nDuchi et al., 2012, Shi et al., 2015, Mokhtari and Ribeiro, 2016, Nedic et al., 2017] addresses\n\nsum-structured F (x) =Pk Fk(x) where Fk is the local cost function of node k. This structure is\nclosely related to empirical risk minimization in a learning setting. See e.g. [Cevher et al., 2014]\nfor an overview of both directions. While the \ufb01rst line of work typically only provides convergence\nguarantees for smooth objectives F , the second approach often suffers from a \u201clack of consensus\u201d,\nthat is, the minimizers of {Fk}k are typically different since the data is not distributed i.i.d. between\ndevices in general.\n\n\u21e4These two authors contributed equally\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fContributions.\nIn this paper, our main contribution is to propose COLA, a new decentralized\nframework for training generalized linear models with convergence guarantees. Our scheme resolves\nboth described issues in existing approaches, using techniques from primal-dual optimization, and\ncan be seen as a generalization of COCOA [Smith et al., 2018] to the decentralized setting. More\nspeci\ufb01cally, the proposed algorithm offers\n\n- Convergence Guarantees: Linear and sublinear convergence rates are guaranteed for strongly\nconvex and general convex objectives respectively. Our results are free of the restrictive\nassumptions made by stochastic methods [Zhang et al., 2015, Wang et al., 2017], which\nrequires i.i.d. data distribution over all devices.\n\n- Communication Ef\ufb01ciency and Usability: Employing a data-local subproblem between each\ncommunication round, COLA not only achieves communication ef\ufb01ciency but also allows\nthe re-use of existing ef\ufb01cient single-machine solvers for on-device learning. We provide\npractical decentralized primal-dual certi\ufb01cates to diagnose the learning progress.\n\n- Elasticity and Fault Tolerance: Unlike sum-structured approaches such as SGD, COLA is\nprovably resilient to changes in the data, in the network topology, and participating devices\ndisappearing, straggling or re-appearing in the network.\n\nOur implementation is publicly available under github.com/epfml/cola .\n\n1.1 Problem statement\nSetup. Many machine learning and signal processing models are formulated as a composite convex\noptimization problem of the form\n\nmin\n\nu\n\nl(u) + r(u),\n\nwhere l is a convex loss function of a linear predictor over data and r is a convex regularizer. Some\ncornerstone applications include e.g. logistic regression, SVMs, Lasso, generalized linear models,\neach combined with or without L1, L2 or elastic-net regularization. Following the setup of [D\u00fcnner\net al., 2016, Smith et al., 2018], these training problems can be mapped to either of the two following\nformulations, which are dual to each other\n\nmin\n\nx2Rn\u21e5FA(x) := f (Ax) +Pi gi(xi)\u21e4\nw2Rd\u21e5FB(w) := f\u21e4(w) +Pi g\u21e4i (A>i w)\u21e4,\n\nmin\n\n(A)\n\n(B)\n\ni=1 gi(xi) is separable.\n\nwhere f\u21e4, g\u21e4i are the convex conjugates of f and gi, respectively. Here x 2 Rn is a parameter vector\nand A := [A1; . . . ; An] 2 Rd\u21e5n is a data matrix with column vectors Ai 2 Rd, i 2 [n]. We assume\nthat f is smooth (Lipschitz gradient) and g(x) :=Pn\nData partitioning. As in [Jaggi et al., 2014, D\u00fcnner et al., 2016, Smith et al., 2018], we assume\nthe dataset A is distributed over K machines according to a partition {Pk}K\nk=1 of the columns of\nA. Note that this convention maintains the \ufb02exibility of partitioning the training dataset either by\nsamples (through mapping applications to (B), e.g. for SVMs) or by features (through mapping\napplications to (A), e.g. for Lasso or L1-regularized logistic regression). For x 2 Rn, we write\nx[k] 2 Rn for the n-vector with elements (x[k])i := xi if i 2P k and (x[k])i := 0 otherwise, and\nanalogously A[k] 2 Rd\u21e5nk for the corresponding set of local data columns on node k, which is of\nsize nk = |Pk|.\nNetwork topology. We consider the task of joint training of a global machine learning model in a\ndecentralized network of K nodes. Its connectivity is modelled by a mixing matrix W2 RK\u21e5K\n.\nMore precisely, Wij 2 [0, 1] denotes the connection strength between nodes i and j, with a non-zero\nweight indicating the existence of a pairwise communication link. We assume W to be symmetric\nand doubly stochastic, which means each row and column of W sums to one.\nThe spectral properties of W used in this paper are that the eigenvalues of W are real, and 1 =\n1(W) \u00b7\u00b7\u00b7 n(W) 1. Let the second largest magnitude of the eigenvalues of W be\n := max{|2(W)|,|n(W)|}. 1 is called the spectral gap, a quantity well-studied in graph\ntheory and network analysis. The spectral gap measures the level of connectivity among nodes. In\nthe extreme case when W is diagonal, and thus an identity matrix, the spectral gap is 0 and there is\nno communication among nodes. To ensure convergence of decentralized algorithms, we impose\n\n+\n\n2\n\n\fthe standard assumption of positive spectral gap of the network which includes all connected graphs,\nsuch as e.g. a ring or 2-D grid topology, see also Appendix B for details.\n\n1.2 Related work\n\nResearch in decentralized optimization dates back to the 1980s with the seminal work of Bertsekas\nand Tsitsiklis, cf. [Tsitsiklis et al., 1986]. Their framework focuses on the minimization of a (smooth)\nfunction by distributing the components of the parameter vector x among agents. In contrast, a\nsecond more recent line of work [Nedic and Ozdaglar, 2009, Duchi et al., 2012, Shi et al., 2015,\nMokhtari and Ribeiro, 2016, Nedic et al., 2017, Scaman et al., 2017, 2018] considers minimization of\n\na sum of individual local cost-functions F (x) =Pi Fi(x), which are potentially non-smooth. Our\n\nwork here can be seen as bridging the two scenarios to the primal-dual setting (A) and (B).\nWhile decentralized optimization is a relatively mature area in the operations research and automatic\ncontrol communities, it has recently received a surge of attention for machine learning applications,\nsee e.g. [Cevher et al., 2014]. Decentralized gradient descent (DGD) with diminishing stepsizes\nwas proposed by [Nedic and Ozdaglar, 2009, Jakovetic et al., 2012], showing convergence to the\noptimal solution at a sublinear rate. [Yuan et al., 2016] further prove that DGD will converge to the\nneighborhood of a global optimum at a linear rate when used with a constant stepsize for strongly\nconvex objectives. [Shi et al., 2015] present EXTRA, which offers a signi\ufb01cant performance boost\ncompared to DGD by using a gradient tracking technique. [Nedic et al., 2017] propose the DIGing\nalgorithm to handle a time-varying network topology. For a static and symmetric W, DIGing recovers\nEXTRA by rede\ufb01ning the two mixing matrices in EXTRA. The dual averaging method [Duchi\net al., 2012] converges at a sublinear rate with a dynamic stepsize. Under a strong convexity\nassumption, decomposition techniques such as decentralized ADMM (DADMM, also known as\nconsensus ADMM) have linear convergence for time-invariant undirected graphs, if subproblems\nare solved exactly [Shi et al., 2014, Wei and Ozdaglar, 2013]. DADMM+ [Bianchi et al., 2016] is a\ndifferent primal-dual approach with more ef\ufb01cient closed-form updates in each step (as compared to\nADMM), and is proven to converge but without a rate. Compared to COLA, neither of DADMM\nand DADMM+ can be \ufb02exibly adapted to the communication-computation tradeoff due to their \ufb01xed\nupdate de\ufb01nition, and both require additional hyperparameters to tune in each use-case (including the\n\u21e2 from ADMM). Notably COLA shows superior performance compared to DIGing and decentralized\nADMM in our experiments. [Scaman et al., 2017, 2018] present lower complexity bounds and optimal\n\nalgorithms for objectives in the form F (x) =Pi Fi(x). Speci\ufb01cally, [Scaman et al., 2017] assumes\n\neach Fi(x) is smooth and strongly convex, and [Scaman et al., 2018] assumes each Fi(x) is Lipschitz\ncontinuous and convex. Additionally [Scaman et al., 2018] needs a boundedness constraint for the\ninput problem. In contrast, COLA can handle non-smooth and non-strongly convex objectives (A)\nand (B), suited to the mentioned applications in machine learning and signal processing. For smooth\nnonconvex models, [Lian et al., 2017] demonstrate that a variant of decentralized parallel SGD can\noutperform the centralized variant when the network latency is high. They further extend it to the\nasynchronous setting [Lian et al., 2018] and to deal with large data variance among nodes [Tang et al.,\n2018a] or with unreliable network links [Tang et al., 2018b]. For the decentralized, asynchronous\nconsensus optimization, [Wu et al., 2018] extends the existing PG-EXTRA and proves convergence\nof the algorithm. [Sirb and Ye, 2018] proves a O(K/\u270f2) rate for stale and stochastic gradients. [Lian\net al., 2018] achieves O(1/\u270f) rate and has linear speedup with respect to number of workers.\nIn the distributed setting with a central server, algorithms of the COCOA family [Yang, 2013, Jaggi\net al., 2014, Ma et al., 2015, D\u00fcnner et al., 2018]\u2014see [Smith et al., 2018] for a recent overview\u2014\nare targeted for problems of the forms (A) and (B). For convex models, COCOA has shown to\nsigni\ufb01cantly outperform competing methods including e.g., ADMM, distributed SGD etc. Other\ncentralized algorithm representatives are parallel SGD variants such as [Agarwal and Duchi, 2011,\nZinkevich et al., 2010] and more recent distributed second-order methods [Zhang and Lin, 2015,\nReddi et al., 2016, Gargiani, 2017, Lee and Chang, 2017, D\u00fcnner et al., 2018, Lee et al., 2018].\nIn this paper we extend COCOA to the challenging decentralized environment\u2014with no central\ncoordinator\u2014while maintaining all of its nice properties. We are not aware of any existing primal-\ndual methods in the decentralized setting, except the recent work of [Smith et al., 2017] on federated\nlearning for the special case of multi-task learning problems. Federated learning was \ufb01rst described\nby [Konecn`y et al., 2015, 2016, McMahan et al., 2017] as decentralized learning for on-device\nlearning applications, combining a global shared model with local personalized models. Current\n\n3\n\n\fk=1. Mixing matrix W.\n\nk\n\nAlgorithm 1: COLA: Communication-Ef\ufb01cient Decentralized Linear Learning\n1 Input: Data matrix A distributed column-wise according to partition {Pk}K\nAggregation parameter 2 [0, 1], and local subproblem parameter 0 as in (1). Starting point\nx(0) := 0 2 Rn, v(0) := 0 2 Rd, v(0)\n2 for t = 0, 1, 2, . . . , T do\n3\n\n:= 0 2 Rd 8 k = 1, . . . K;\n:=PK\n\nfor k 2{ 1, 2, . . . , K} in parallel over all nodes do\ncompute locally averaged shared vector v(t+ 1\n2 )\nx[k] \u21e5-approximate solution to subproblem (1) at v(t+ 1\nupdate local variable x(t+1)\ncompute update of local estimate vk := A[k]x[k]\nv(t+1)\nk\n\nl=1 Wklv(t)\n\n[k] + x[k]\n\n:= v(t+ 1\n2 )\n\n+ K vk\n\n:= x(t)\n\nl\n\n2 )\n\n4\n\n5\n6\n7\n\n[k]\n\nk\n\nk\n\n8\n9\n10 end\n\nend\n\nk\n\nfederated optimization algorithms (like FedAvg in [McMahan et al., 2017]) are still close to the\ncentralized setting. In contrast, our work provides a fully decentralized alternative algorithm for\nfederated learning with generalized linear models.\n\n2 The decentralized algorithm: COLA\n\nThe COLA framework is summarized in Algorithm 1. For a given input problem we map it to either\nof the (A) or (B) formulation, and de\ufb01ne the locally stored dataset A[k] and local part of the weight\nvector x[k] in node k accordingly. While v = Ax is the shared state being communicated in COCOA,\nthis is generally unknown to a node in the fully decentralized setting. Instead, we maintain vk, a local\nestimate of v in node k, and use it as a surrogate in the algorithm.\n\nNew data-local quadratic subproblems. During a computation step, node k locally solves the\nfollowing minimization problem\n\ngi(xi + ( x[k])i).\n\nCrucially, this subproblem only depends on the local data A[k], and local vectors vl from the\nneighborhood of the current node k. In contrast, in COCOA [Smith et al., 2018] the subproblem is\nde\ufb01ned in terms of a global aggregated shared vector vc := Ax 2 Rd, which is not available in the\ndecentralized setting.2 The aggregation parameter 2 [0, 1] does not need to be tuned; in fact, we\nuse the default := 1 throughout the paper, see [Ma et al., 2015] for a discussion. Once is settled,\na safe choice of the subproblem relaxation parameter 0 is given as 0 := K . 0 can be additionally\ntightened using an improved Hessian subproblem (Appendix E.3).\nAlgorithm description. At time t on node k, v(t+ 1\n2 )\nis a local estimate of the shared variable after a\ncommunication step (i.e. gossip mixing). The local subproblem (1) based on this estimate is solved\n2Subproblem interpretation: Note that for the special case of := 1, 0 := K, by smoothness of f, our\n\nk\n\nsubproblem in (2) is an upper bound on\n\n1\n\nminx[k]2Rn\n\nK f (A(x + Kx[k])) +Pi2Pk\n(3)\nwhich is a scaled block-coordinate update of block k of the original objective (A). This assumes that we have\nconsensus vk \u2318 Ax 8 k. For quadratic objectives (i.e. when f \u2318 k.k2\n2 and A describes the quadratic), the\nequality of the formulations (2) and (3) holds. Furthermore, by convexity of f, the sum of (3) is an upper\nbound on the centralized updates f (x + x) + g(x + x). Both inequalities quantify the overhead of the\ndistributed algorithm over the centralized version, see also [Yang, 2013, Ma et al., 2015, Smith et al., 2018] for\nthe non-decentralized case.\n\ngi(xi + ( x[k])i),\n\n4\n\nwhere\n\nG 0\nk (x[k]; vk, x[k]) := 1\n\nmin\n\nx[k]2Rn\n\nG 0\nk (x[k]; vk, x[k]),\n\nK f (vk) + rf (vk)>A[k]x[k]\n+ 0\n\n2\u2327A[k]x[k]2 +Pi2Pk\n\n(1)\n\n(2)\n\n\fand yields x[k]. Then we calculate vk := A[k]x[k], and update the local shared vector v(t+1)\nWe allow the local subproblem to be solved approximately:\nAssumption 1 (\u21e5-approximation solution). Let \u21e5 2 [0, 1] be the relative accuracy of the local solver\n(potentially randomized), in the sense of returning an approximate solution x[k] at each step t, s.t.\n\n.\n\nk\n\nk (x[k]; vk, x[k]) G 0\nE[G 0\nG 0\nk ( 0 ; vk, x[k]) G 0\n\nk (x?\nk (x?\n\n[k]; vk, x[k])]\n[k]; vk, x[k]) \uf8ff \u21e5,\n\nwhere x?\n\n[k] 2 arg minx2RnG 0\n\nk (x[k]; vk, x[k]), for each k 2 [K].\n\nElasticity to network size, compute resources and changing data\u2014and fault tolerance. Real-\nworld communication networks are not homogeneous and static, but greatly vary in availability,\ncomputation, communication and storage capacity. Also, the training data is subject to changes.\nWhile these issues impose signi\ufb01cant challenges for most existing distributed training algorithms, we\nhereby show that COLA offers adaptivity to such dynamic and heterogenous scenarios.\nScalability and elasticity in terms of availability and computational capacity can be modelled by a\nnode-speci\ufb01c local accuracy parameter \u21e5k in Assumption 1, as proposed by [Smith et al., 2017]. The\nmore resources node k has, the more accurate (smaller) \u21e5k we can use. The same mechanism also\nallows dealing with fault tolerance and stragglers, which is crucial e.g. on a network of personal\ndevices. More speci\ufb01cally, when a new node k joins the network, its x[k] variables are initialized\nto 0; when node k leaves, its x[k] is frozen, and its subproblem is not touched anymore (i.e. \u21e5k = 1).\nUsing the same approach, we can adapt to dynamic changes in the dataset\u2014such as additions and\nremoval of local data columns\u2014by adjusting the size of the local weight vector accordingly. Unlike\ngradient-based methods and ADMM, COLA does not require parameter tuning to converge, increasing\nresilience to drastic changes.\nExtension to improved second-order subproblems. In the centralized setting, it has recently been\nshown that the Hessian information of f can be properly utilized to de\ufb01ne improved local subproblems\n[Lee and Chang, 2017, D\u00fcnner et al., 2018]. Similar techniques can be applied to COLA as well,\ndetails on which are left in Appendix E.\nExtension to time-varying graphs. Similar to scalability and elasticity, it is also straightforward to\nextend COLA to a time varying graph under proper assumptions. If we use the time-varying model\nin [Nedic et al., 2017, Assumption 1], where an undirected graph is connected with B gossip steps,\nthen changing COLA to perform B communication steps and one computation step per round still\nguarantees convergence. Details of this setup are provided in Appendix E.\n\n3 On the convergence of COLA\n\nIn this section we present a convergence analysis of the proposed decentralized algorithm COLA for\nboth general convex and strongly convex objectives. In order to capture the evolution of COLA, we\nreformulate the original problem (A) by incorporating both x and local estimates {vk}K\n\nk=1\n\nk=1 f (vk) + g(x),\n\n(DA)\n\nminx,{vk}K\nsuch that\n\nk=1 HA(x,{vk}K\n\nk=1) := 1\n\nvk = Ax, k = 1, ..., K.\n\nKPK\n\nWhile the consensus is not always satis\ufb01ed during Algorithm 1, the following relations between the\ndecentralized objective and the original one (A) always hold. All proofs are deferred to Appendix C.\nLemma 1. Let {vk} and x be the iterates generated during the execution of Algorithm 1. At any\ntimestep, it holds that\n(4)\n(5)\n\nk=1 vk = Ax,\n\nKPK\n\nFA(x) \uf8ffH A(x,{vk}K\n\nk=1) \uf8ff FA(x) + 1\n\nk=1 kvk Axk2 .\n\n1\n\nThe dual problem and duality gap of the decentralized objective (DA) are given in Lemma 2.\nLemma 2 (Decentralized Dual Function and Duality Gap). The Lagrangian dual of the decentralized\nformation (DA) is\n\n2\u2327K PK\n\nmin{wk}K\n\nk=1 HB({wk}K\n\nk=1) := 1\n\nKPK\n\nk=1 f\u21e4(wk) +Pn\n\ni=1 g\u21e4i\u21e3A>i ( 1\n\nk=1 wk)\u2318 .\n\nKPK\n\n(DB)\n\n5\n\n\fk=1,{wk}K\n\nk=1, the duality gap is:\n\nk=1} and dual variables {wk}K\nGiven primal variables {x,{vk}K\nKPk A>i wk . (6)\nKPk(f (vk)+f\u21e4(wk)) + g(x)+Pn\nk=1) := 1\nGH(x,{vk}K\nIf the dual variables are \ufb01xed to the optimality condition wk = rf (vk), then the dual variables can\nbe omitted in the argument list of duality gap, namely GH(x,{vk}K\nk=1). Note that the decentralized\nduality gap generalizes the duality gap of COCOA: when consensus is ensured, i.e., vk \u2318 Ax and\nwk \u2318 rf (Ax), the decentralized duality gap recovers that of COCOA.\n3.1 Linear rate for strongly convex objectives\nWe use the following data-dependent quantities in our main theorems\n\ni=1 g\u21e4i 1\n\nk := maxx[k]2RnA[k]x[k]2 /kx[k]k2, max = maxk=1,...,K k, :=PK\n\nIf {gi} are strongly convex, COLA achieves the following linear rate of convergence.\nTheorem 1 (Strongly Convex gi). Consider Algorithm 1 with := 1 and let \u21e5 be the quality of the\nlocal solver in Assumption 1. Let gi be \u00b5g-strongly convex for all i 2 [n] and let f be 1/\u2327-smooth.\nLet \u00af0 := (1 + )0, \u21b5 := (1 + (1)2\n\n36(1+\u21e5) )1 and \u2318 := (1 \u21e5)(1 \u21b5)\n\nk=1 knk.\n\n(7)\n\nThen after T iterations of Algorithm 1 with3\n\ns0 =\n\n\u2327\u00b5 g\n\n\u2327\u00b5 g+max \u00af0 2 [0, 1].\n\n(8)\n\nlog \"(0)\nH\"H\n\n,\n\n\u2318s0\n\nwith\n\nk }K\n\nT 1+\u2318s0\nit holds that E\u21e5HA(x(T ),{v(T )\nk=1) H A(x?,{v?\n,\u25c6\nlog\u2713 1\nwe have the expected duality gap E[GH(x(T ),{PK\nk=1 Wklv(T )\n\nT 1+\u2318s0\n\n\"(0)\nH\"GH\n\nk}K\n\n\u2318s0\n\n\u2318s0\n\nl\n\n}K\nk=1)] \uf8ff \"GH\n\n.\n\nk=1)\u21e4 \uf8ff \"H. Furthermore, after T iterations\n\n3.2 Sublinear rate for general convex objectives\nModels such as sparse logistic regression, Lasso, group Lasso are non-strongly convex. For such\nmodels, we show that COLA enjoys a O(1/T ) sublinear rate of convergence for all network topologies\nwith a positive spectral gap.\nTheorem 2 (Non-strongly Convex Case). Consider Algorithm 1, using a local solver of quality \u21e5.\nLet gi(\u00b7) have L-bounded support, and let f be (1/\u2327 )-smooth. Let \"GH\n> 0 be the desired duality\ngap. Then after T iterations where\n\u2327 \"GH 1\u2318+\n\u2318\u21e3 8L2\u00af0\n\u21e1\n\nT T0 + max\u21e2l 1\nt0 max\u21e20,\u21e0 1+\u2318\n\n\u2318,\n\u2318m , 4L2\u00af0\n\nT0 t0 +\uf8ff 2\n\n\u2318 log 2\u2327 (HA(x(0),{v(0)\n\nl })HA(x?,{v?}))\n4L2\u00af0\n\n\u2327 \"GH\n\nand \u00af0 := (1 + )0, \u21b5 := (1 + (1)2\nduality gap satis\ufb01es\n\nat the averaged iterate \u00afx :=\n\nt=T0+1(v0k)(t) and \u00afwk := 1\n\n1\n\nTT0PT1\n\n36(1+\u21e5) )1 and \u2318 := (1 \u21e5)(1 \u21b5). We have that the expected\nE\u21e5GH(\u00afx,{\u00afvk}K\nTT0PT1\nTT0PT1\n\nk=1)\u21e4 \uf8ff \"GH\nk=1,{ \u00afwk}K\nt=T0+1 x(t), and v0k\nt=T0+1 rf ((v0k)(t)).\n\nl=1 Wklvl and \u00afvk\n\n:= PK\n\n:=\n\n1\n\nNote that the assumption of bounded support for the gi functions is not restrictive in the general\nconvex case, as discussed e.g. in [D\u00fcnner et al., 2016].\n\n3\"(0)\nH := HA(x(0),{v(0)\n\nk }K\n\nk=1) H A(x?,{v?\n\nk}K\n\nk=1) is the initial suboptimality.\n\n6\n\n\f3.3 Local certi\ufb01cates for global accuracy\n\nAccuracy certi\ufb01cates for the training error are very useful for practitioners to diagnose the learning\nprogress. In the centralized setting, the duality gap serves as such a certi\ufb01cate, and is available as\na stopping criterion on the master node. In the decentralized setting of our interest, this is more\nchallenging as consensus is not guaranteed. Nevertheless, we show in the following Proposition 1\nthat certi\ufb01cates for the decentralized objective (DA) can be computed from local quantities:\nProposition 1 (Local Certi\ufb01cates). Assume gi has L-bounded support, and let Nk := {j : Wjk > 0}\nbe the set of nodes accessible to node k. Then for any given \"> 0, we have\n\nif for all k = 1, . . . , K the following two local conditions are satis\ufb01ed:\n\nGH(x;{vk}K\n\nk=1) \uf8ff \",\n\nv>k rf (vk) + Xi2Pkgi(xi) + g\u21e4i (A>i rf (vk)) \uf8ff\n\n\"\n2K\n\nrf (vk) 1\n\n|Nk|Pj2Nk rf (vj)2 \uf8ff \u21e3PK\n\n(9)\n\n1\n2LpK\n\n\",\n\n(10)\n\nk=1 n2\n\nkk\u23181/2\n\nThe local conditions (9) and (10) have a clear interpretation. The \ufb01rst one ensures the duality gap of\nthe local subproblem given by vk as on the left hand side of (9) is small. The second condition (10)\nguarantees that consensus violation is bounded, by ensuring that the gradient of each node is similar\nto its neighborhood nodes.\nRemark 1. The resulting certi\ufb01cate from Proposition 1 is local, in the sense that no global vector\naggregations are needed to compute it. For a certi\ufb01cate on the global objective, the boolean \ufb02ag\nof each local condition (9) and (10) being satis\ufb01ed or not needs to be shared with all nodes, but\nkk are\nnot required to be known, and any valid upper bound can be used instead. We can use the local\ncerti\ufb01cates to avoid unnecessary work on local problems which are already optimized, as well as\nto continuously quantify how newly arriving local data has to be re-optimized in the case of online\ntraining. The local certi\ufb01cates can also be used to quantify the contribution of newly joining or\ndeparting nodes, which is particularly useful in the elastic scenario described above.\n\nthis requires extremely little communication. Exact values of the parameters andPK\n\nk=1 n2\n\n4 Experimental results\n\nHere we illustrate the advantages of COLA in three respects: \ufb01rstly we investigate the application in\ndifferent network topologies and with varying subproblem quality \u21e5; secondly, we compare COLA\nwith state-of-the-art decentralized baselines: 1, DIGing [Nedic et al., 2017], which generalizes the\ngradient-tracking technique of the EXTRA algorithm [Shi et al., 2015], and 2, Decentralized ADMM\n(aka. consensus ADMM), which extends the classical ADMM (Alternating Direction Method of\n\nFigure 1: Suboptimality for solving Lasso (=106) for the RCV1 dataset on a ring of 16 nodes. We\nillustrate the performance of COLA: a) number of iterations; b) time. \uf8ff here denotes the number of\nlocal data passes per communication round.\n\n7\n\n\fFigure 2: Convergence of COLA for solving problems on a ring of K=16 nodes. Left) Ridge\nregression on URL reputation dataset (=104); Right) Lasso on webspam dataset (=105).\n\nTable 1: Datasets Used for Empirical Study\n\n2M\n350K\n400K\n677K\n\n3M\n16M\n2K\n47K\n\nDataset\nURL\nWebspam\nEpsilon\nRCV1 Binary\n\n#Training #Features Sparsity\n3.5e-5\n2.0e-4\n1.0\n1.6e-3\n\nMultipliers) method [Boyd et al., 2011] to the decentralized setting [Shi et al., 2014, Wei and Ozdaglar,\n2013]; Finally, we show that COLA works in the challenging unreliable network environment where\neach node has a certain chance to drop out of the network.\nWe implement all algorithms in PyTorch with MPI backend. The decentralized network topology is\nsimulated by running one thread per graph node, on a 2\u21e512 core Intel Xeon CPU E5-2680 v3 server\nwith 256 GB RAM. Table 1 describes the datasets4 used in the experiments. For Lasso, the columns\nof A are features. For ridge regression, the columns are features and samples for COLA primal and\nCOLA dual, respectively. The order of columns is shuf\ufb02ed once before being distributed across the\nnodes. Due to space limit, details on the experimental con\ufb01gurations are included in Appendix D.\nEffect of approximation quality \u21e5. We\nstudy the convergence behavior in terms of\nthe approximation quality \u21e5. Here, \u21e5 is\ncontrolled by the number of data passes \uf8ff\non subproblem (1) per node. Figure 1\nshows that increasing \uf8ff always results in\nless number of iterations (less communica-\ntion rounds) for COLA. However, given a\n\ufb01xed network bandwidth, it leads to a clear\ntrade-off for the overall wall-clock time, showing the cost of both communication and computation.\nLarger \uf8ff leads to less communication rounds, however, it also takes more time to solve subproblems.\nThe observations suggest that one can adjust \u21e5 for each node to handle system heterogeneity, as what\nwe have discussed at the end of Section 2.\nEffect of graph topology. Fixing K=16, we test the performance of COLA on 5 different topologies:\nring, 2-connected cycle, 3-connected cycle, 2D grid and complete graph. The mixing matrix W\nis given by Metropolis weights for all test cases (details in Appendix B). Convergence curves are\nplotted in Figure 3. One can observe that for all topologies, COLA converges monotonically and\nespecailly when all nodes in the network are equal, smaller leads to a faster convergence rate. This\nis consistent with the intuition that 1 measures the connectivity level of the topology.\nSuperior performance compared to baselines. We compare COLA with DIGing and D-ADMM\nfor strongly and general convex problems. For general convex objectives, we use Lasso regression\nwith = 104 on the webspam dataset; for the strongly convex objective, we use Ridge regression\nwith = 105 on the URL reputation dataset. For Ridge regression, we can map COLA to both\nprimal and dual problems. Figure 2 traces the results on log-suboptimality. One can observe that\nfor both generally and strongly convex objectives, COLA signi\ufb01cantly outperforms DIGing and\ndecentralized ADMM in terms of number of communication rounds and computation time. While\nDIGing and D-ADMM need parameter tuning to ensure convergence and ef\ufb01ciency, COLA is much\neasier to deploy as it is parameter free. Additionally, convergence guarantees of ADMM relies on\nexact subproblem solvers, whereas inexact solver is allowed for COLA.\n\n4https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/\n\n8\n\n\fFigure 3: Performance comparison of COLA on\ndifferent topologies. Solving Lasso regression\n(=106) for RCV1 dataset with 16 nodes.\n\nFigure 4: Performance of COLA when nodes have\np chance of staying in the network on the URL\ndataset (=104). Freezing x[k] when node k\nleaves the network.\n\nFault tolerance to unreliable nodes. Assume each node of a network only has a chance of p to\nparticipate in each round. If a new node k joins the network, then local variables are initialized as\nx[k] = 0; if node k leaves the network, then x[k] will be frozen with \u21e5k = 1. All remaining nodes\ndynamically adjust their weights to maintain the doubly stochastic property of W. We run COLA on\nsuch unreliable networks of different ps and show the results in Figure 4. First, one can observe that\nfor all p > 0 the suboptimality decreases monotonically as COLA progresses. It is also clear from the\nresult that a smaller dropout rate (a larger p) leads to a faster convergence of COLA.\n\n5 Discussion and conclusions\n\nIn this work we have studied training generalized linear models in the fully decentralized setting.\nWe proposed a communication-ef\ufb01cient decentralized framework, termed COLA, which is free of\nparameter tuning. We proved that it has a sublinear rate of convergence for general convex problems,\nallowing e.g. L1 regularizers, and has a linear rate of convergence for strongly convex objectives. Our\nscheme offers primal-dual certi\ufb01cates which are useful in the decentralized setting. We demonstrated\nthat COLA offers full adaptivity to heterogenous distributed systems on arbitrary network topologies,\nand is adaptive to changes in network size and data, and offers fault tolerance and elasticity. Future\nresearch directions include improving subproblems, as well as extension to the network topology\nwith directed graphs, as well as recent communication compression schemes [Stich et al., 2018].\n\nAcknowledgments. We thank Prof. Bharat K. Bhargava for fruitful discussions. We acknowledge\nfunding from SNSF grant 200021_175796, Microsoft Research JRC project \u2018Coltrain\u2019, as well as a\nGoogle Focused Research Award.\n\nReferences\nJohn N Tsitsiklis, Dimitri P Bertsekas, and Michael Athans. Distributed asynchronous deterministic and\nstochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803\u2013812, 1986.\n\nAngelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE\n\nTransactions on Automatic Control, 54(1):48\u201361, 2009.\n\nJ C Duchi, A Agarwal, and M J Wainwright. Dual Averaging for Distributed Optimization: Convergence\n\nAnalysis and Network Scaling. IEEE Transactions on Automatic Control, 57(3):592\u2013606, March 2012.\n\nWei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact \ufb01rst-order algorithm for decentralized consensus\n\noptimization. SIAM Journal on Optimization, 25(2):944\u2013966, 2015.\n\nAryan Mokhtari and Alejandro Ribeiro. DSA: Decentralized double stochastic averaging gradient algorithm.\n\nJournal of Machine Learning Research, 17(61):1\u201335, 2016.\n\n9\n\n\fAngelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization\n\nover time-varying graphs. SIAM Journal on Optimization, 27(4):2597\u20132633, 2017.\n\nVolkan Cevher, Stephen Becker, and Mark Schmidt. Convex Optimization for Big Data: Scalable, randomized,\n\nand parallel algorithms for big data analytics. IEEE Signal Processing Magazine, 31(5):32\u201343, 2014.\n\nVirginia Smith, Simone Forte, Chenxin Ma, Martin Tak\u00e1c, Michael I Jordan, and Martin Jaggi. CoCoA: A\nGeneral Framework for Communication-Ef\ufb01cient Distributed Optimization. Journal of Machine Learning\nResearch, 18(230):1\u201349, 2018.\n\nSixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with Elastic Averaging SGD. In NIPS\n\n2015 - Advances in Neural Information Processing Systems 28, pages 685\u2013693, 2015.\n\nJialei Wang, Weiran Wang, and Nathan Srebro. Memory and Communication Ef\ufb01cient Distributed Stochastic\nOptimization with Minibatch Prox. In ICML 2017 - Proceedings of the 34th International Conference on\nMachine Learning, pages 1882\u20131919, June 2017.\n\nCelestine D\u00fcnner, Simone Forte, Martin Tak\u00e1c, and Martin Jaggi. Primal-Dual Rates and Certi\ufb01cates. In ICML\n\n2016 - Proceedings of the 33th International Conference on Machine Learning, pages 783\u2013792, 2016.\n\nMartin Jaggi, Virginia Smith, Martin Tak\u00e1c, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and\nIn Advances in Neural\n\nMichael I Jordan. Communication-ef\ufb01cient distributed dual coordinate ascent.\nInformation Processing Systems, pages 3068\u20133076, 2014.\n\nKevin Scaman, Francis R. Bach, S\u00e9bastien Bubeck, Yin Tat Lee, and Laurent Massouli\u00e9. Optimal algorithms for\nsmooth and strongly convex distributed optimization in networks. In Proceedings of the 34th International\nConference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3027\u20133036,\n2017.\n\nKevin Scaman, Francis Bach, S\u00e9bastien Bubeck, Yin Tat Lee, and Laurent Massouli\u00e9. Optimal algorithms for\n\nnon-smooth distributed optimization in networks. arXiv preprint arXiv:1806.00291, 2018.\n\nDusan Jakovetic, Joao Xavier, and Jose MF Moura. Convergence rate analysis of distributed gradient methods\nfor smooth optimization. In Telecommunications Forum (TELFOR), 2012 20th, pages 867\u2013870. IEEE, 2012.\n\nKun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent. SIAM Journal on\n\nOptimization, 26(3):1835\u20131854, 2016.\n\nWei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the Linear Convergence of the ADMM in\nDecentralized Consensus Optimization. IEEE Transactions on Signal Processing, 62(7):1750\u20131761, 2014.\n\nErmin Wei and Asuman Ozdaglar. On the O(1/k) Convergence of Asynchronous Distributed Alternating\n\nDirection Method of Multipliers. arXiv, July 2013.\n\nPascal Bianchi, Walid Hachem, and Franck Iutzeler. A coordinate descent primal-dual algorithm and application\nto distributed asynchronous optimization. IEEE Transactions on Automatic Control, 61(10):2947\u20132957, 2016.\n\nXiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms\noutperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In\nAdvances in Neural Information Processing Systems, pages 5336\u20135346, 2017.\n\nXiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic gradient descent.\n\nIn ICML 2018 - Proceedings of the 35th International Conference on Machine Learning, 2018.\n\nHanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. D2: Decentralized training over decentralized data.\n\narXiv preprint arXiv:1803.07068, 2018a.\n\nHanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized\n\ntraining. In NIPS 2018 - Advances in Neural Information Processing Systems, 2018b.\n\nTianyu Wu, Kun Yuan, Qing Ling, Wotao Yin, and Ali H Sayed. Decentralized consensus optimization with\nasynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks, 4(2):\n293\u2013307, 2018.\n\nBenjamin Sirb and Xiaojing Ye. Decentralized consensus algorithm with delayed and stochastic gradients. SIAM\n\nJournal on Optimization, 28(2):1232\u20131254, 2018.\n\nTianbao Yang. Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent. In\n\nNIPS 2014 - Advances in Neural Information Processing Systems 27, 2013.\n\n10\n\n\fChenxin Ma, Virginia Smith, Martin Jaggi, Michael I Jordan, Peter Richt\u00e1rik, and Martin Tak\u00e1c. Adding vs.\nAveraging in Distributed Primal-Dual Optimization. In ICML 2015 - Proceedings of the 32th International\nConference on Machine Learning, pages 1973\u20131982, 2015.\n\nCelestine D\u00fcnner, Aurelien Lucchi, Matilde Gargiani, An Bian, Thomas Hofmann, and Martin Jaggi. A\nDistributed Second-Order Algorithm You Can Trust. In ICML 2018 - Proceedings of the 35th International\nConference on Machine Learning, pages 1357\u20131365, July 2018.\n\nAlekh Agarwal and John C Duchi. Distributed delayed stochastic optimization.\n\nInformation Processing Systems, pages 873\u2013881, 2011.\n\nIn Advances in Neural\n\nMartin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In\n\nAdvances in Neural Information Processing Systems, pages 2595\u20132603, 2010.\n\nYuchen Zhang and Xiao Lin. Disco: Distributed optimization for self-concordant empirical loss. In International\n\nconference on machine learning, pages 362\u2013370, 2015.\n\nSashank J Reddi, Jakub Konecn`y, Peter Richt\u00e1rik, Barnab\u00e1s P\u00f3cz\u00f3s, and Alex Smola. Aide: Fast and communi-\n\ncation ef\ufb01cient distributed optimization. arXiv preprint arXiv:1608.06879, 2016.\n\nMatilde Gargiani. Hessian-CoCoA: a general parallel and distributed framework for non-strongly convex\n\nregularizers. Master\u2019s thesis, ETH Zurich, June 2017.\n\nChing-pei Lee and Kai-Wei Chang. Distributed block-diagonal approximation methods for regularized empirical\n\nrisk minimization. arXiv preprint arXiv:1709.03043, 2017.\n\nChing-pei Lee, Cong Han Lim, and Stephen J Wright. A distributed quasi-newton algorithm for empirical risk\nminimization with nonsmooth regularization. In ACM International Conference on Knowledge Discovery\nand Data Mining, 2018.\n\nVirginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar. Federated Multi-Task Learning. In\n\nNIPS 2017 - Advances in Neural Information Processing Systems 30, 2017.\n\nJakub Konecn`y, Brendan McMahan, and Daniel Ramage. Federated optimization: Distributed optimization\n\nbeyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.\n\nJakub Konecn`y, H Brendan McMahan, Felix X Yu, Peter Richtarik, Ananda Theertha Suresh, and Dave Bacon.\nFederated learning: Strategies for improving communication ef\ufb01ciency. arXiv preprint arXiv:1610.05492,\n2016.\n\nBrendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-\nef\ufb01cient learning of deep networks from decentralized data. In Arti\ufb01cial Intelligence and Statistics, pages\n1273\u20131282, 2017.\n\nStephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and\nstatistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine\nlearning, 3(1):1\u2013122, 2011.\n\nSebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsi\ufb01ed sgd with memory. In NIPS 2018 -\n\nAdvances in Neural Information Processing Systems, 2018.\n\nRalph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.\n\nW Keith Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57\n\n(1):97\u2013109, 1970.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin,\nAlban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS Workshop on\nAutodiff, 2017.\n\nF. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,\nV. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:\nMachine learning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\n11\n\n\f", "award": [], "sourceid": 2212, "authors": [{"given_name": "Lie", "family_name": "He", "institution": "EPFL"}, {"given_name": "An", "family_name": "Bian", "institution": "ETH Z\u00fcrich"}, {"given_name": "Martin", "family_name": "Jaggi", "institution": "EPFL"}]}