{"title": "Communication-Efficient Distributed Dual Coordinate Ascent", "book": "Advances in Neural Information Processing Systems", "page_first": 3068, "page_last": 3076, "abstract": "Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In this paper, we propose a communication-efficient framework, COCOA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algorithms, as well as experiments on real-world distributed datasets with implementations in Spark. In our experiments, we find that as compared to state-of-the-art mini-batch versions of SGD and SDCA algorithms, COCOA converges to the same .001-accurate solution quality on average 25\u00d7 as quickly.", "full_text": "Communication-Ef\ufb01cient\n\nDistributed Dual Coordinate Ascent\n\nMartin Jaggi \u2217\nETH Zurich\n\nVirginia Smith \u2217\nUC Berkeley\n\nMartin Tak\u00b4a\u02c7c\n\nLehigh University\n\nJonathan Terhorst\n\nUC Berkeley\n\nSanjay Krishnan\n\nUC Berkeley\n\nThomas Hofmann\n\nETH Zurich\n\nMichael I. Jordan\n\nUC Berkeley\n\nAbstract\n\nCommunication remains the most signi\ufb01cant bottleneck in the performance of\ndistributed optimization algorithms for large-scale machine learning. In this pa-\nper, we propose a communication-ef\ufb01cient framework, COCOA, that uses local\ncomputation in a primal-dual setting to dramatically reduce the amount of nec-\nessary communication. We provide a strong convergence rate analysis for this\nclass of algorithms, as well as experiments on real-world distributed datasets with\nimplementations in Spark. In our experiments, we \ufb01nd that as compared to state-\nof-the-art mini-batch versions of SGD and SDCA algorithms, COCOA converges\nto the same .001-accurate solution quality on average 25\u00d7 as quickly.\n\n1\n\nIntroduction\n\nWith the immense growth of available data, developing distributed algorithms for machine learning\nis increasingly important, and yet remains a challenging topic both theoretically and in practice. On\ntypical real-world systems, communicating data between machines is vastly more expensive than\nreading data from main memory, e.g. by a factor of several orders of magnitude when leveraging\ncommodity hardware.1 Yet, despite this reality, most existing distributed optimization methods for\nmachine learning require signi\ufb01cant communication between workers, often equalling the amount of\nlocal computation (or reading of local data). This includes for example popular mini-batch versions\nof online methods, such as stochastic subgradient (SGD) and coordinate descent (SDCA).\nIn this work, we target this bottleneck. We propose a distributed optimization framework that allows\none to freely steer the trade-off between communication and local computation. In doing so, the\nframework can be easily adapted to the diverse spectrum of available large-scale computing systems,\nfrom high-latency commodity clusters to low-latency supercomputers or the multi-core setting.\nOur new framework, COCOA (Communication-ef\ufb01cient distributed dual Coordinate Ascent), sup-\nports objectives for linear reguarlized loss minimization, encompassing a broad class of machine\nlearning models. By leveraging the primal-dual structure of these optimization problems, COCOA\neffectively combines partial results from local computation while avoiding con\ufb02ict with updates si-\nmultaneously computed on other machines. In each round, COCOA employs steps of an arbitrary\ndual optimization method on the local data on each machine, in parallel. A single update vector is\nthen communicated to the master node. For example, when choosing to perform H iterations (usu-\nally order of the data size n) of an online optimization method locally per round, our scheme saves\na factor of H in terms of communication compared to the corresponding naive distributed update\n\n\u2217Both authors contributed equally.\n1On typical computers, the latency for accessing data in main memory is in the order of 100 nanoseconds.\n\nIn contrast, the latency for sending data over a standard network connection is around 250,000 nanoseconds.\n\n1\n\n\fscheme (i.e., updating a single point before communication). When processing the same number of\ndatapoints, this is clearly a dramatic savings.\nOur theoretical analysis (Section 4) shows that this signi\ufb01cant reduction in communication cost\ncomes with only a very moderate increase in the amount of total computation, in order to reach\nthe same optimization accuracy. We show that, in general, the distributed COCOA framework will\ninherit the convergence rate of the internally-used local optimization method. When using SDCA\n(randomized dual coordinate ascent) as the local optimizer and assuming smooth losses, this con-\nvergence rate is geometric.\nIn practice, our experiments with the method implemented on the fault-tolerant Spark platform [1]\ncon\ufb01rm both the clock time performance and huge communication savings of the proposed method\non a variety distributed datasets. Our experiments consistently show order of magnitude gains over\ntraditional mini-batch methods of both SGD and SDCA, and signi\ufb01cant gains over the faster but\ntheoretically less justi\ufb01ed local SGD methods.\n\nRelated Work. As we discuss below (Section 5), our approach is distinguished from recent work\non parallel and distributed optimization [2, 3, 4, 5, 6, 7, 8, 9] in that we provide a general framework\nfor improving the communication ef\ufb01ciency of any dual optimization method. To the best of our\nknowledge, our work is the \ufb01rst to analyze the convergence rate for an algorithm with this level\nof communication ef\ufb01ciency, without making data-dependent assumptions. The presented analysis\ncovers the case of smooth losses, but should also be extendable to the non-smooth case. Existing\nmethods using mini-batches [4, 2, 10] are closely related, though our algorithm makes signi\ufb01cant\nimprovements by immediately applying all updates locally while they are processed, a scheme that\nis not considered in the classic mini-batch setting. This intuitive modi\ufb01cation results in dramatically\nimproved empirical results and also strengthens our theoretical convergence rate. More precisely,\nthe convergence rate shown here only degrades with the number of workers K, instead of with the\nsigni\ufb01cantly larger mini-batch-size (typically order n) in the case of mini-batch methods.\nOur method builds on a closely related recent line of work of [2, 3, 11, 12]. We generalize the algo-\nrithm of [2, 3] by allowing the use of arbitrary (dual) optimization methods as the local subroutine\nwithin our framework. In the special case of using coordinate ascent as the local optimizer, the\nresulting algorithm is very similar, though with a different computation of the coordinate updates.\nMoreover, we provide the \ufb01rst theoretical convergence rate analysis for such methods, without mak-\ning strong assumptions on the data.\nThe proposed COCOA framework in its basic variant is entirely free of tuning parameters or learning\nrates, in contrast to SGD-based methods. The only choice to make is the selection of the internal lo-\ncal optimization procedure, steering the desired trade-off between communication and computation.\nWhen choosing a primal-dual optimizer as the internal procedure, the duality gap readily provides a\nfair stopping criterion and ef\ufb01cient accuracy certi\ufb01cates during optimization.\n\nPaper Outline. The rest of the paper is organized as follows. In Section 2 we describe the prob-\nlem setting of interest. Section 3 outlines the proposed framework, COCOA, and the convergence\nanalysis of this method is presented in Section 4. We discuss related work in Section 5, and compare\nagainst several other state-of-the-art methods empirically in Section 6.\n\n2 Setup\n\nA large class of methods in machine learning and signal processing can be posed as the minimization\nof a convex loss function of linear predictors with a convex regularization term:\n\nmin\nw\u2208Rd\n\nP (w) :=\n\n(cid:107)w(cid:107)2 +\n\n\u03bb\n2\n\n1\nn\n\n(cid:96)i(wT xi)\n\n,\n\n(1)\n\n(cid:104)\n\n(cid:105)\n\nn(cid:88)\n\ni=1\n\nHere the data training examples are real-valued vectors xi \u2208 Rd; the loss functions (cid:96)i, i = 1, . . . , n\nare convex and depend possibly on labels yi \u2208 R; and \u03bb > 0 is the regularization parameter. Using\nthe setup of [13], we assume the regularizer is the (cid:96)2-norm for convenience. Examples of this class\nof problems include support vector machines, as well as regularized linear and logistic regression,\nordinal regression, and others.\n\n2\n\n\f(cid:104)\n\n(cid:105)\n\nn(cid:88)\n\ni=1\n\nThe most popular method to solve problems of the form (1) is the stochastic subgradient method\n(SGD) [14, 15, 16]. In this setting, SGD becomes an online method where every iteration only\nrequires access to a single data example (xi, yi), and the convergence rate is well-understood.\nThe associated conjugate dual problem of (1) takes the following form, and is de\ufb01ned over one dual\nvariable per each example in the training set.\n\nmax\n\u03b1\u2208Rn\n\nD(\u03b1) := \u2212 \u03bb\n2\n\n(cid:107)A\u03b1(cid:107)2 \u2212 1\nn\n\ni (\u2212\u03b1i)\n(cid:96)\u2217\n\n,\n\n(2)\n\ni is the conjugate (Fenchel dual) of the loss function (cid:96)i, and the data matrix A \u2208 Rd\u00d7n\nwhere (cid:96)\u2217\ncollects the (normalized) data examples Ai := 1\n\u03bbn xi in its columns. The duality comes with the\nconvenient mapping from dual to primal variables w(\u03b1) := A\u03b1 as given by the optimality con-\nditions [13]. For any con\ufb01guration of the dual variables \u03b1, we have the duality gap de\ufb01ned as\nP (w(\u03b1))\u2212D(\u03b1). This gap is a computable certi\ufb01cate of the approximation quality to the unknown\ntrue optimum P (w\u2217) = D(\u03b1\u2217), and therefore serves as a useful stopping criteria for algorithms.\nFor problems of the form (2), coordinate descent methods have proven to be very ef\ufb01cient, and come\nwith several bene\ufb01ts over primal methods. In randomized dual coordinate ascent (SDCA), updates\nare made to the dual objective (2) by solving for one coordinate completely while keeping all others\n\ufb01xed. This algorithm has been implemented in a number of software packages (e.g. LibLinear [17]),\nand has proven very suitable for use in large-scale problems, while giving stronger convergence\nresults than the primal-only methods (such as SGD), at the same iteration cost [13]. In addition\nto superior performance, this method also bene\ufb01ts from requiring no stepsize, and having a well-\nde\ufb01ned stopping criterion given by the duality gap.\n\n3 Method Description\nThe COCOA framework, as presented in Algorithm 1, assumes that the data {(xi, yi)}n\ni=1 for a\nregularized loss minimization problem of the form (1) is distributed over K worker machines. We\nassociate with the datapoints their corresponding dual variables {\u03b1i}n\ni=1, being partitioned between\nthe workers in the same way. The core idea is to use the dual variables to ef\ufb01ciently merge the\nparallel updates from the different workers without much con\ufb02ict, by exploiting the fact that they all\nwork on disjoint sets of dual variables.\n\nAlgorithm 1: COCOA: Communication-Ef\ufb01cient Distributed Dual Coordinate Ascent\nInput: T \u2265 1, scaling parameter 1 \u2264 \u03b2K \u2264 K (default: \u03b2K := 1).\nData: {(xi, yi)}n\nInitialize: \u03b1(0)\nfor t = 1, 2, . . . , T\n\n[k] \u2190 0 for all machines k, and w(0) \u2190 0\n\ni=1 distributed over K machines\n\nfor all machines k = 1, 2, . . . , K in parallel\n\n(\u2206\u03b1[k], \u2206wk) \u2190 LOCALDUALMETHOD(\u03b1(t\u22121)\n[k] \u2190 \u03b1(t\u22121)\n\u03b1(t)\n\n[k] + \u03b2K\n\nK \u2206\u03b1[k]\n\n[k]\n\nend\nreduce w(t) \u2190 w(t\u22121) + \u03b2K\n\nK\n\nk=1 \u2206wk\n\n(cid:80)K\n\n, w(t\u22121))\n\nend\n\nIn each round, the K workers in parallel perform some steps of an arbitrary optimization method,\napplied to their local data. This internal procedure tries to maximize the dual formulation (2), only\nwith respect to their own local dual variables. We call this local procedure LOCALDUALMETHOD,\nas speci\ufb01ed in the template Procedure A. Our core observation is that the necessary information\neach worker requires about the state of the other dual variables can be very compactly represented\nby a single primal vector w \u2208 Rd, without ever sending around data or dual variables between the\nmachines.\nAllowing the subroutine to process more than one local data example per round dramatically reduces\nthe amount of communication between the workers. By de\ufb01nition, COCOA in each outer iteration\n\n3\n\n\fProcedure A: LOCALDUALMETHOD: Dual algorithm for prob. (2) on a single coordinate block k\nInput: Local \u03b1[k] \u2208 Rnk, and w \u2208 Rd consistent with other coordinate blocks of \u03b1 s.t. w = A\u03b1\nData: Local {(xi, yi)}nk\nOutput: \u2206\u03b1[k] and \u2206w := A[k]\u2206\u03b1[k]\n\ni=1\n\nProcedure B: LOCALSDCA: SDCA iterations for problem (2) on a single coordinate block k\nInput: H \u2265 1, \u03b1[k] \u2208 Rnk, and w \u2208 Rd consistent with other coordinate blocks of \u03b1 s.t. w = A\u03b1\nData: Local {(xi, yi)}nk\nInitialize: w(0) \u2190 w, \u2206\u03b1[k] \u2190 0 \u2208 Rnk\nfor h = 1, 2, . . . , H\n\ni=1\n\nchoose i \u2208 {1, 2, . . . , nk} uniformly at random\n\ufb01nd \u2206\u03b1 maximizing \u2212 \u03bbn\ni \u2190 \u03b1(h\u22121)\n\u03b1(h)\n+ \u2206\u03b1\n(\u2206\u03b1[k])i \u2190 (\u2206\u03b1[k])i + \u2206\u03b1\nw(h) \u2190 w(h\u22121) + 1\n\u03bbn \u2206\u03b1 xi\n\n2 (cid:107)w(h\u22121) + 1\n\ni\n\n\u03bbn \u2206\u03b1 xi(cid:107)2 \u2212 (cid:96)\u2217\n\ni\n\nend\nOutput: \u2206\u03b1[k] and \u2206w := A[k]\u2206\u03b1[k]\n\n(cid:0) \u2212 (\u03b1(h\u22121)\n\ni\n\n+ \u2206\u03b1)(cid:1)\n\nonly requires communication of a single vector for each worker, that is \u2206wk \u2208 Rd. Further, as we\nwill show in Section 4, COCOA inherits the convergence guarantee of any algorithm run locally on\neach node in the inner loop of Algorithm 1. We suggest to use randomized dual coordinate ascent\n(SDCA) [13] as the internal optimizer in practice, as implemented in Procedure B, and also used in\nour experiments.\n\nsuch that(cid:80)\n\nNotation.\nIn the same way the data is partitioned across the K worker machines, we write the dual\nvariable vector as \u03b1 = (\u03b1[1], . . . , \u03b1[K]) \u2208 Rn with the corresponding coordinate blocks \u03b1[k] \u2208 Rnk\nk nk = n. The submatrix A[k] collects the columns of A (i.e. rescaled data examples)\nwhich are available locally on the k-th worker. The parameter T determines the number of outer\niterations of the algorithm, while when using an online internal method such as LOCALSDCA, then\nthe number of inner iterations H determines the computation-communication trade-off factor.\n\n4 Convergence Analysis\n\nConsidering the dual problem (2), we de\ufb01ne the local suboptimality on each coordinate block as:\n\nD((\u03b1[1], . . . , \u02c6\u03b1[k], . . . , \u03b1[K])) \u2212 D((\u03b1[1], . . . , \u03b1[k], . . . , \u03b1[K])),\n\n(3)\n\n\u03b5D,k(\u03b1) := max\n\n\u02c6\u03b1[k]\u2208Rnk\n\nthat is how far we are from the optimum on block k with all other blocks \ufb01xed. Note that this differs\nfrom the global suboptimality max \u02c6\u03b1 D( \u02c6\u03b1) \u2212 D((\u03b1[1], . . . , \u03b1[K])).\nAssumption 1 (Local Geometric Improvement of LOCALDUALMETHOD). We assume that there\nexists \u0398 \u2208 [0, 1) such that for any given \u03b1, LOCALDUALMETHOD when run on block k alone\nreturns a (possibly random) update \u2206\u03b1[k] such that\n\nE[\u0001D,k((\u03b1[1], . . . , \u03b1[k\u22121], \u03b1[k] + \u2206\u03b1[k], \u03b1[k+1], . . . , \u03b1[K]))] \u2264 \u0398 \u00b7 \u0001D,k(\u03b1).\n\n(4)\n\nNote that this assumption is satis\ufb01ed for several available implementations of the inner procedure\nLOCALDUALMETHOD, in particular for LOCALSDCA, as shown in the following Proposition.\nFrom here on, we assume that the input data is scaled such that (cid:107)xi(cid:107) \u2264 1 for all datapoints. Proofs\nof all statements are provided in the supplementary material.\nProposition 1. Assume the loss functions (cid:96)i are (1/\u03b3)-smooth. Then for using LOCALSDCA,\nAssumption 1 holds with\n\n(cid:19)H\n\n(cid:18)\n\n\u0398 =\n\n1 \u2212 \u03bbn\u03b3\n1 + \u03bbn\u03b3\n\n1\n\u02dcn\n\n.\n\n(5)\n\nwhere \u02dcn := maxk nk is the size of the largest block of coordinates.\n\n4\n\n\fTheorem 2. Assume that Algorithm 1 is run for T outer iterations on K worker machines, with\nthe procedure LOCALDUALMETHOD having local geometric improvement \u0398, and let \u03b2K := 1.\nFurther, assume the loss functions (cid:96)i are (1/\u03b3)-smooth. Then the following geometric convergence\nrate holds for the global (dual) objective:\n\nE[D(\u03b1\u2217) \u2212 D(\u03b1(T ))] \u2264\n\n1 \u2212 (1 \u2212 \u0398)\n\n\u03bbn\u03b3\n\nD(\u03b1\u2217) \u2212 D(\u03b1(0))\n\n.\n\n(6)\n\n(cid:17)\n\n(cid:18)\n\n(cid:19)T(cid:16)\n\n1\nK\n\n\u03c3 + \u03bbn\u03b3\n\n(cid:80)K\nk=1(cid:107)A[k]\u03b1[k](cid:107)2 \u2212 (cid:107)A\u03b1(cid:107)2\n\n(cid:107)\u03b1(cid:107)2\n\n0 \u2264 \u03c3min \u2264 \u02dcn.\n\nHere \u03c3 is any real number satisfying\n\u03c3 \u2265 \u03c3min := max\n\u03b1\u2208Rn\n\n\u03bb2n2\n\n\u2265 0.\n\n(7)\n\nLemma 3. If K = 1 then \u03c3min = 0. For any K \u2265 1, when assuming (cid:107)xi(cid:107) \u2264 1 \u2200i, we have\n\nMoreover, if datapoints between different workers are orthogonal, i.e. (AT A)i,j = 0 \u2200i, j such that\ni and j do not belong to the same part, then \u03c3min = 0.\n\nIf we choose K = 1 then, Theorem 2 together with Lemma 3 implies that\nD(\u03b1\u2217) \u2212 D(\u03b1(0))\n\nE[D(\u03b1\u2217) \u2212 D(\u03b1(T ))] \u2264 \u0398T(cid:16)\n\n(cid:17)\n\n,\n\nas expected, showing that the analysis is tight in the special case K = 1. More interestingly, we\nobserve that for any K, in the extreme case when the subproblems are solved to optimality (i.e.\nletting H \u2192 \u221e in LOCALSDCA), then the algorithm as well as the convergence rate match that of\nserial/parallel block-coordinate descent [18, 19].\n\nNote:\nD(\u03b1\u2217) \u2212 D(\u03b1(0)) \u2264 1 (see e.g. Lemma 20 in [13]).\n\nIf choosing the starting point as \u03b1(0) := 0 as in the main algorithm, then it is known that\n\n5 Related Work\n\nDistributed Primal-Dual Methods. Our approach is most closely related to recent work by [2, 3],\nwhich generalizes the distributed optimization method for linear SVMs as in [11] to the primal-dual\nsetting considered here (which was introduced by [13]). The difference between our approach and\nthe \u2018practical\u2019 method of [2] is that our internal steps directly correspond to coordinate descent iter-\nations on the global dual objective (2), for coordinates in the current block, while in [3, Equation 8]\nand [2], the inner iterations apply to a slightly different notion of the sub-dual problem de\ufb01ned on\nthe local data. In terms of convergence results, the analysis of [2] only addresses the mini-batch\ncase without local updates, while the more recent paper [3] shows a convergence rate for a variant of\nCOCOA with inner coordinate steps, but under the unrealistic assumption that the data is orthogonal\nbetween the different workers. In this case, the optimization problems become independent, so that\nan even simpler single-round communication scheme summing the individual resulting models w\nwould give an exact solution. Instead, we show a linear convergence rate for the full problem class\nof smooth losses, without any assumptions on the data, in the same generality as the non-distributed\nsetting of [13].\nWhile the experimental results in all papers [11, 2, 3] are encouraging for this type of method, they\ndo not yet provide a quantitative comparison of the gains in communication ef\ufb01ciency, or compare\nto the analogous SGD schemes that use the same distribution and communication patterns, which is\nthe main goal or our experiments in Section 6. For the special case of linear SVMs, the \ufb01rst paper\nto propose the same algorithmic idea was [11], which used LibLinear in the inner iterations. How-\never, the proposed algorithm [11] processes the blocks sequentially (not in the parallel or distributed\nsetting). Also, it is assumed that the subproblems are solved to near optimality on each block be-\nfore selecting the next, making the method essentially standard block-coordinate descent. While\nno convergence rate was given, the empirical results in the journal paper [12] suggest that running\nLibLinear for just one pass through the local data performs well in practice. Here, we prove this,\nquantify the communication ef\ufb01ciency, and show that fewer local steps can improve the overall per-\nformance. For the LASSO case, [7] has proposed a parallel coordinate descent method converging\nto the true optimum, which could potentially also be interpreted in our framework here.\n\n5\n\n\fMini-Batches. Another closely related avenue of research includes methods that use mini-batches\nto distribute updates. In these methods, a mini-batch, or sample, of the data examples is selected\nfor processing at each iteration. All updates within the mini-batch are computed based on the same\n\ufb01xed parameter vector w, and then these updates are either added or averaged in a reduce step\nand communicated back to the worker machines. This concept has been studied for both SGD and\nSDCA, see e.g. [4, 10] for the SVM case. The so-called naive variant of [2] is essentially identical\nto mini-batch dual coordinate descent, with a slight difference in de\ufb01ning the sub-problems.\nAs is shown in [2] and below in Section 6, the performance of these algorithms suffers when pro-\ncessing large batch sizes, as they do not take local updates immediately into account. Furthermore,\nthey are very sensitive to the choice of the parameter \u03b2b, which controls the magnitude of combining\nall updates between \u03b2b := 1 for (conservatively) averaging, and \u03b2b := b for (aggressively) adding\nthe updates (here we denote b as the size of the selected mini-batch, which can be of size up to n).\nThis instability is illustrated by the fact that even the change of \u03b2b := 2 instead of \u03b2b := 1 can\nlead to divergence of coordinate descent (SDCA) in the simple case of just two coordinates [4] .\nIn practice it can be very dif\ufb01cult to choose the correct data-dependent parameter \u03b2b especially for\nlarge mini-batch sizes b \u2248 n, as the parameter range spans many orders of magnitude, and directly\ncontrols the step size of the resulting algorithm, and therefore the convergence rate [20, 21]. For\nsparse data, the work of [20, 21] gives some data dependent choices of \u03b2b which are safe.\nKnown convergence rates for the mini-batch methods degrade linearly with the growing batch size\nb \u2248 \u0398(n). More precisely, the improvement in objective function per example processed degrades\nwith a factor of \u03b2b in [4, 20, 21]. In contrast, our convergence rate as shown in Theorem 2 only\ndegrades with the much smaller number of worker machines K, which in practical applications is\noften several orders of magnitudes smaller than the mini-batch size b.\n\nSingle Round of Communication. One extreme is to consider methods with only a single round\nof communication (e.g. one map-reduce operation), as in [22, 6, 23]. The output of these methods is\nthe average of K individual models, trained only on the local data on each machine. In [22], the au-\nthors give conditions on the data and computing environment under which these one-communication\nalgorithms may be suf\ufb01cient. In general, however, the true optimum of the original problem (1) is\nnot the average of these K models, no matter how accurately the subproblems are solved [24].\n\nNaive Distributed Online Methods, Delayed Gradients, and Multi-Core. On the other extreme,\na natural way to distribute updates is to let every machine send updates to the master node (some-\ntimes called the \u201cparameter server\u201d) as soon as they are performed. This is what we call the naive\ndistributed SGD / CD in our experiments. The amount of communication for such naive distributed\nonline methods is the same as the number of data examples processed. In contrast to this, the num-\nber of communicated vectors in our method is divided by H, that is the number of inner local steps\nperformed per outer iteration, which can be \u0398(n).\nThe early work of [25] introduced the nice framework of gradient updates where the gradients come\nwith some delays, i.e. are based on outdated iterates, and shows some robust convergence rates.\nIn the machine learning setting, [26] and the later work of [27] have provided additional insights\ninto these types of methods. However, these papers study the case of smooth objective functions\nof a sum structure, and so do not directly apply to general case we consider here.\nIn the same\nspirit, [5] implements SGD with communication-intense updates after each example processed, al-\nlowing asynchronous updates again with some delay. For coordinate descent, the analogous ap-\nproach was studied in [28]. Both methods [5, 28] are H times less ef\ufb01cient in terms of communica-\ntion when compared to COCOA, and are designed for multi-core shared memory machines (where\ncommunication is as fast as memory access). They require the same amount of communication as\nnaive distributed SGD / CD, which we include in our experiments in Section 6, and a slightly larger\nnumber of iterations due to the asynchronicity. The 1/t convergence rate shown in [5] only holds\nunder strong sparsity assumptions on the data. A more recent paper [29] deepens the understand-\ning of such methods, but still only applies to very sparse data. For general data, [30] theoretically\nshows that 1/\u03b52 communications rounds of single vectors are enough to obtain \u03b5-quality for linear\nclassi\ufb01ers, with the rate growing with K 2 in the number of workers. Our new analysis here makes\nthe dependence on 1/\u03b5 logarithmic.\n\n6\n\n\f6 Experiments\n\nIn this section, we compare COCOA to traditional mini-batch versions of stochastic dual coordinate\nascent and stochastic gradient descent, as well as the locally-updating version of stochastic gradient\ndescent. We implement mini-batch SDCA (denoted mini-batch-CD) as described in [4, 2]. The\nSGD-based methods are mini-batch and locally-updating versions of Pegasos [16], differing only in\nwhether the primal vector is updated locally on each inner iteration or not, and whether the resulting\ncombination/communication of the updates is by an average over the total size KH of the mini-\nbatch (mini-batch-SGD) or just over the number of machines K (local-SGD). For each algorithm,\nwe additionally study the effect of scaling the average by a parameter \u03b2K, as \ufb01rst described in [4],\nwhile noting that it is a bene\ufb01t to avoid having to tune this data-dependent parameter.\nWe apply these algorithms to standard hinge loss (cid:96)2-regularized support vector machines, using\nimplementations written in Spark on m1.large Amazon EC2 instances [1]. Though this non-smooth\ncase is not yet covered in our theoretical analysis, we still see remarkable empirical performance.\nOur results indicate that COCOA is able to converge to .001-accurate solutions nearly 25\u00d7 as fast\ncompared the other algorithms, when all use \u03b2K = 1. The datasets used in these analyses are\nsummarized in Table 1, and were distributed among K = 4, 8, and 32 nodes, respectively. We use\nthe same regularization parameters as speci\ufb01ed in [16, 17].\n\nTable 1: Datasets for Empirical Study\n\nDataset\ncov\nrcv1\nimagenet\n\nTraining (n)\n522,911\n677,399\n32,751\n\nFeatures (d)\n54\n47,236\n160,000\n\nSparsity\n22.22% 1e-6\n0.16% 1e-6\n100% 1e-5\n\n\u03bb Workers (K)\n4\n8\n32\n\nIn comparing each algorithm and dataset, we analyze progress in primal objective value as a function\nof both time (Figure 1) and communication (Figure 2). For all competing methods, we present the\nresult for the batch size (H) that yields the best performance in terms of reduction in objective\nvalue over time. For the locally-updating methods (COCOA and local-SGD), these tend to be larger\nbatch sizes corresponding to processing almost all of the local data at each outer step. For the\nnon-locally updating mini-batch methods, (mini-batch SDCA [4] and mini-batch SGD [16]), these\ntypically correspond to smaller values of H, as averaging the solutions to guarantee safe convergence\nbecomes less of an impediment for smaller batch sizes.\n\nFigure 1: Primal Suboptimality vs. Time for Best Mini-Batch Sizes (H): For \u03b2K = 1, COCOA converges\nmore quickly than all other algorithms, even when accounting for different batch sizes.\n\nFigure 2: Primal Suboptimality vs. # of Communicated Vectors for Best Mini-Batch Sizes (H): A clear\ncorrelation is evident between the number of communicated vectors and wall-time to convergence (Figure 1).\n\n7\n\n02040608010010\u2212610\u2212410\u22122100102CovTime (s)Log Primal Suboptimality 02040608010010\u2212610\u2212410\u22122100102CovCOCOA (H=1e5)minibatch\u2212CD (H=100)local\u2212SGD (H=1e5)batch\u2212SGD (H=1)010020030040010\u2212610\u2212410\u22122100102RCV1Time (s)Log Primal Suboptimality 010020030040010\u2212610\u2212410\u22122100102COCOA (H=1e5)minibatch\u2212CD (H=100)local\u2212SGD (H=1e4)batch\u2212SGD (H=100)020040060080010\u2212610\u2212410\u22122100102ImagenetTime (s)Log Primal Suboptimality 020040060080010\u2212610\u2212410\u22122100102ImagenetCOCOA (H=1e3)mini\u2212batch\u2212CD (H=1)local\u2212SGD (H=1e3)mini\u2212batch\u2212SGD (H=10)05010015020025030010\u2212610\u2212410\u22122100102Cov# of Communicated VectorsLog Primal Suboptimality 05010015020025030010\u2212610\u2212410\u22122100102CovCOCOA (H=1e5)minibatch\u2212CD (H=100)local\u2212SGD (H=1e5)batch\u2212SGD (H=1)010020030040050060070010\u2212610\u2212410\u22122100102RCV1# of Communicated VectorsLog Primal Suboptimality 010020030040050060070010\u2212610\u2212410\u22122100102RCV1COCOA (H=1e5)minibatch\u2212CD (H=100)local\u2212SGD (H=1e4)batch\u2212SGD (H=100)05001000150020002500300010\u2212610\u2212410\u22122100102Imagenet# of Communicated VectorsLog Primal Suboptimality 05001000150020002500300010\u2212610\u2212410\u22122100102ImagenetCOCOA (H=1e3)mini\u2212batch\u2212CD (H=1)local\u2212SGD (H=1e3)mini\u2212batch\u2212SGD (H=10)\fFirst, we note that there is a clear correlation between the wall-time spent processing each dataset\nand the number of vectors communicated, indicating that communication has a signi\ufb01cant effect on\nconvergence speed. We see clearly that COCOA is able to converge to a more accurate solution in all\ndatasets much faster than the other methods. On average, COCOA reaches a .001-accurate solution\nfor these datasets 25x faster than the best competitor. This is a testament to the algorithm\u2019s ability\nto avoid communication while still making signi\ufb01cant global progress by ef\ufb01ciently combining the\nlocal updates of each iteration. The improvements are robust for both regimes n (cid:29) d and n (cid:28) d.\n\nFigure 3: Effect of H on COCOA.\n\nFigure 4: Best \u03b2K Scaling Values for H = 1e5 and H = 100.\n\nIn Figure 3 we explore the effect of H, the computation-communication trade-off factor, on the con-\nvergence of COCOA for the Cov dataset on a cluster of 4 nodes. As described above, increasing H\ndecreases communication but also affects the convergence properties of the algorithm. In Figure 4,\nwe attempt to scale the averaging step of each algorithm by using various \u03b2K values, for two differ-\nent batch sizes on the Cov dataset (H = 1e5 and H = 100). We see that though \u03b2K has a larger\nimpact on the smaller batch size, it is still not enough to improve the mini-batch algorithms beyond\nwhat is achieved by COCOA and local-SGD.\n\n7 Conclusion\n\nWe have presented a communication-ef\ufb01cient framework for distributed dual coordinate ascent algo-\nrithms that can be used to solve large-scale regularized loss minimization problems. This is crucial\nin settings where datasets must be distributed across multiple machines, and where communication\namongst nodes is costly. We have shown that the proposed algorithm performs competitively on\nreal-world, large-scale distributed datasets, and have presented the \ufb01rst theoretical analysis of this\nalgorithm that achieves competitive convergence rates without making additional assumptions on\nthe data itself.\nIt remains open to obtain improved convergence rates for more aggressive updates corresponding\nto \u03b2K > 1, which might be suitable for using the \u2018safe\u2019 updates techniques of [4] and the related\nexpected separable over-approximations of [18, 19], here applied to K instead of n blocks. Further-\nmore, it remains open to show convergence rates for local SGD in the same communication ef\ufb01cient\nsetting as described here.\n\nAcknowledgments. We thank Shivaram Venkataraman, Ameet Talwalkar, and Peter Richt\u00b4arik for\nfruitful discussions. MJ acknowledges support by the Simons Institute for the Theory of Computing.\n\nReferences\n[1] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Murphy McCauley, Michael J\nFranklin, Scott Shenker, and Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction\nfor In-Memory Cluster Computing. NSDI, 2012.\n\n[2] Tianbao Yang. Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent.\n\nNIPS, 2013.\n\n[3] Tianbao Yang, Shenghuo Zhu, Rong Jin, and Yuanqing Lin. On Theoretical Analysis of Distributed\n\nStochastic Dual Coordinate Ascent. arXiv:1312.1031, December 2013.\n\n[4] Martin Tak\u00b4a\u02c7c, Avleen Bijral, Peter Richt\u00b4arik, and Nathan Srebro. Mini-Batch Primal and Dual Methods\n\nfor SVMs. ICML, 2013.\n\n8\n\n02040608010010\u2212610\u2212410\u22122100102Time (s)Log Primal Suboptimality 02040608010010\u2212610\u2212410\u221221001021e51e41e3100102040608010010\u2212610\u2212410\u22122100102Time (s)Log Primal Suboptimality 02040608010010\u2212610\u2212410\u22122100102COCOA (\u03b2k=1)mini\u2212batch\u2212CD (\u03b2k=10)local\u2212SGD (\u03b2k=1)mini\u2212batch\u2212sgd (\u03b2k=10)02040608010010\u2212610\u2212410\u22122100102Time (s)Log Primal Suboptimality 02040608010010\u2212610\u2212410\u22122100102COCOA (\u03b2k=1)mini\u2212batch\u2212CD (\u03b2k=100)local\u2212SGD (\u03b2k=1)mini\u2212batch\u2212sgd (\u03b2k=1)\f[5] Feng Niu, Benjamin Recht, Christopher R\u00b4e, and Stephen J Wright. Hogwild!: A Lock-Free Approach to\n\nParallelizing Stochastic Gradient Descent. NIPS, 2011.\n\n[6] Martin A Zinkevich, Markus Weimer, Alex J Smola, and Lihong Li. Parallelized Stochastic Gradient\n\nDescent. NIPS 23, 2010.\n\n[7] Joseph K Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel Coordinate Descent for\n\nL1-Regularized Loss Minimization. ICML, 2011.\n\n[8] Jakub Mare\u02c7cek, Peter Richt\u00b4arik, and Martin Tak\u00b4a\u02c7c. Distributed Block Coordinate Descent for Minimizing\n\nPartially Separable Functions. arXiv:1408.2467, June 2014.\n\n[9] Ion Necoara and Dragos Clipici. Ef\ufb01cient parallel coordinate descent algorithm for convex optimiza-\ntion problems with separable constraints: Application to distributed MPC. Journal of Process Control,\n23(3):243\u2013253, 2013.\n\n[10] Martin Tak\u00b4a\u02c7c, Peter Richt\u00b4arik, and Nathan Srebro. Primal-Dual Parallel Coordinate Descent for Machine\n\nLearning Optimization. Manuscript, 2014.\n\n[11] Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin. Large linear classi\ufb01cation when data\n\ncannot \ufb01t in memory. the 16th ACM SIGKDD international conference, page 833, 2010.\n\n[12] Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin. Large Linear Classi\ufb01cation When Data\n\nCannot Fit in Memory. ACM Transactions on Knowledge Discovery from Data, 5(4):1\u201323, 2012.\n\n[13] Shai Shalev-Shwartz and Tong Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss\n\nMinimization. JMLR, 14:567\u2013599, 2013.\n\n[14] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical\n\nStatistics, 22(3):400\u2013407, 1951.\n\n[15] L\u00b4eon Bottou. Large-Scale Machine Learning with Stochastic Gradient Descent. COMPSTAT\u20192010 -\n\nProceedings of the 19th International Conference on Computational Statistics, pages 177\u2013187, 2010.\n\n[16] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal Estimated\n\nSub-Gradient Solver for SVM. Mathematical Programming, 127(1):3\u201330, 2010.\n\n[17] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and S Sundararajan. A Dual Coordinate\n\nDescent Method for Large-scale Linear SVM. ICML, 2008.\n\n[18] Peter Richt\u00b4arik and Martin Tak\u00b4a\u02c7c. Iteration complexity of randomized block-coordinate descent methods\n\nfor minimizing a composite function. Mathematical Programming, 144(1-2):1\u201338, April 2014.\n\n[19] Peter Richt\u00b4arik and Martin Tak\u00b4a\u02c7c. Parallel Coordinate Descent Methods for Big Data Optimization.\n\narXiv:1212.0873, 2012.\n\n[20] Peter Richt\u00b4arik and Martin Tak\u00b4a\u02c7c. Distributed Coordinate Descent Method for Learning with Big Data.\n\narXiv:1310.2059, 2013.\n\n[21] Olivier Fercoq, Zheng Qu, Peter Richt\u00b4arik, and Martin Tak\u00b4a\u02c7c. Fast Distributed Coordinate Descent for\nNon-Strongly Convex Losses. IEEE Workshop on Machine Learning for Signal Processing, May 2014.\n[22] Yuchen Zhang, John C Duchi, and Martin J Wainwright. Communication-Ef\ufb01cient Algorithms for Statis-\n\ntical Optimization. JMLR, 14:3321\u20133363, November 2013.\n\n[23] Gideon Mann, Ryan McDonald, Mehryar Mohri, Nathan Silberman, and Daniel D Walker. Ef\ufb01cient\n\nLarge-Scale Distributed Training of Conditional Maximum Entropy Models. NIPS, 1231\u20131239, 2009.\n\n[24] Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication-Ef\ufb01cient Distributed Optimization using\n\nan Approximate Newton-type Method. ICML, 32(1):1000\u20131008, 2014.\n\n[25] John N Tsitsiklis, Dimitri P Bertsekas, and Michael Athans. Distributed asynchronous deterministic and\n\nstochastic gradient optimization algorithms. IEEE Trans. on Automatic Control, 31(9):803\u2013812, 1986.\n\n[26] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal Distributed Online Prediction\n\nUsing Mini-Batches. JMLR, 13:165\u2013202, 2012.\n\n[27] Alekh Agarwal and John C Duchi. Distributed Delayed Stochastic Optimization. NIPS, 873\u2013881, 2011.\n[28] Ji Liu, Stephen J Wright, Christopher R\u00b4e, Victor Bittorf, and Srikrishna Sridhar. An Asynchronous\n\nParallel Stochastic Coordinate Descent Algorithm. ICML, 2014.\n\n[29] John C Duchi, Michael I Jordan, and H Brendan McMahan. Estimation, Optimization, and Parallelism\n\nwhen Data is Sparse. NIPS, 2013.\n\n[30] Maria-Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. Distributed Learning, Communica-\n\ntion Complexity and Privacy. COLT, 23:26.1\u201326.22, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1588, "authors": [{"given_name": "Martin", "family_name": "Jaggi", "institution": "ETH Zurich"}, {"given_name": "Virginia", "family_name": "Smith", "institution": "UC Berkeley"}, {"given_name": "Martin", "family_name": "Takac", "institution": "Lehigh University"}, {"given_name": "Jonathan", "family_name": "Terhorst", "institution": "UC Berkeley"}, {"given_name": "Sanjay", "family_name": "Krishnan", "institution": "University of California Berkeley"}, {"given_name": "Thomas", "family_name": "Hofmann", "institution": "ETH Zurich"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}