{"title": "Communication/Computation Tradeoffs in Consensus-Based Distributed Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1943, "page_last": 1951, "abstract": "We study the scalability of consensus-based distributed optimization algorithms by considering two questions: How many processors should we use for a given problem, and how often should they communicate when communication is not free? Central to our analysis is a problem-specific value $r$ which quantifies the communication/computation tradeoff. We show that organizing the communication among nodes as a $k$-regular expander graph~\\cite{kRegExpanders} yields speedups, while when all pairs of nodes communicate (as in a complete graph), there is an optimal number of processors that depends on $r$. Surprisingly, a speedup can be obtained, in terms of the time to reach a fixed level of accuracy, by communicating less and less frequently as the computation progresses. Experiments on a real cluster solving metric learning and non-smooth convex minimization tasks demonstrate strong agreement between theory and practice.", "full_text": "Communication/Computation Tradeoffs in\nConsensus-Based Distributed Optimization\n\nKonstantinos I. Tsianos, Sean Lawlor, and Michael G. Rabbat\n\nDepartment of Electrical and Computer Engineering\n\n{konstantinos.tsianos, sean.lawlor}@mail.mcgill.ca\n\nMcGill University, Montr\u00b4eal, Canada\n\nmichael.rabbat@mcgill.ca\n\nAbstract\n\nWe study the scalability of consensus-based distributed optimization algorithms\nby considering two questions: How many processors should we use for a given\nproblem, and how often should they communicate when communication is not\nfree? Central to our analysis is a problem-speci\ufb01c value r which quanti\ufb01es the\ncommunication/computation tradeoff. We show that organizing the communica-\ntion among nodes as a k-regular expander graph [1] yields speedups, while when\nall pairs of nodes communicate (as in a complete graph), there is an optimal num-\nber of processors that depends on r. Surprisingly, a speedup can be obtained,\nin terms of the time to reach a \ufb01xed level of accuracy, by communicating less\nand less frequently as the computation progresses. Experiments on a real cluster\nsolving metric learning and non-smooth convex minimization tasks demonstrate\nstrong agreement between theory and practice.\n\n1\n\nIntroduction\n\nHow many processors should we use and how often should they communicate for large-scale dis-\ntributed optimization? We address these questions by studying the performance and limitations of a\nclass of distributed algorithms that solve the general optimization problem\n\nm(cid:88)\n\nj=1\n\nminimize\n\nx\u2208X\n\nF (x) =\n\n1\nm\n\nlj(x)\n\n(1)\n\nwhere each function lj(x) is convex over a convex set X \u2286 Rd. This formulation applies widely in\nmachine learning scenarios, where lj(x) measures the loss of model x with respect to data point j,\nand F (x) is the cumulative loss over all m data points.\nAlthough ef\ufb01cient serial algorithms exist [2], the increasing size of available data and problem di-\nmensionality are pushing computers to their limits and the need for parallelization arises [3]. Among\nmany proposed distributed approaches for solving (1), we focus on consensus-based distributed op-\ntimization [4, 5, 6, 7] where each component function in (1) is assigned to a different node in a\nnetwork (i.e., the data is partitioned among the nodes), and the nodes interleave local gradient-based\noptimization updates with communication using a consensus protocol to collectively converge to a\nminimizer of F (x).\nConsensus-based algorithms are attractive because they make distributed optimization possible with-\nout requiring centralized coordination or signi\ufb01cant network infrastructure (as opposed to, e.g., hi-\nerarchical schemes [8]). In addition, they combine simplicity of implementation with robustness to\nnode failures and are resilient to communication delays [9]. These qualities are important in clusters,\nwhich are typically shared among many users, and algorithms need to be immune to slow nodes that\n\n1\n\n\fuse part of their computation and communication resources for unrelated tasks. The main drawback\nof consensus-based optimization algorithms comes from the potentially high communication cost\nassociated with distributed consensus. At the same time, existing convergence bounds in terms of it-\nerations (e.g., (7) below) suggest that increasing the number of processors slows down convergence,\nwhich contradicts the intuition that more computing resources are better.\nThis paper focuses on understanding the limitations and potential for scalability of consensus-based\noptimization. We build on the distributed dual averaging framework [4]. The key to our analysis is\nto attach to each iteration a cost that involves two competing terms: a computation cost per itera-\ntion which decreases as we add more processors, and a communication cost which depends on the\nnetwork. Our cost expression quanti\ufb01es the communication/computation tradeoff by a parameter r\nthat is easy to estimate for a given problem and platform. The role of r is essential; for example,\nwhen nodes communicate at every iteration, we show that in complete graph topologies, there exists\nan optimal number of processors nopt = 1\u221a\nr , while for k-regular expander graphs [1], increasing\nthe network size yields a diminishing speedup. Similar results are obtained when nodes commu-\nnicate every h > 1 iterations and even when h increases with time. We validate our analysis with\nexperiments on a cluster. Our results show a remarkable agreement between theory and practice.\nIn Section 2 we formalize the distributed optimization problem and summarize the distributed dual\naveraging algorithm. Section 3 introduces the communication/computation tradeoff and contains the\nbasic analysis where nodes communicate at every iteration. The general case of sparsifying commu-\nnication is treated in Section 4. Section 5 tests our theorical results on a real cluster implementation\nand Section 6 discusses some future extensions.\n\n2 Distributed Convex Optimization\n\nAssume we have at our disposal a cluster with n processors to solve (1), and suppose without loss\nof generality that m is divisible by n. In the absence of any other information, we partition the data\nevenly among the processors and our objective becomes to solve the optimization problem,\n\n\uf8eb\uf8ed n\n\nm\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\nm\n\nj=1\n\n\uf8f6\uf8f8 =\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nlj|i(x)\n\nfi(x)\n\n(2)\n\nminimize\n\nx\u2208X\n\nF (x) =\n\n1\nm\n\nlj(x) =\n\n1\nn\n\nm(cid:88)\n\nj=1\n\nwhere we use the notation lj|i to denote loss associated with the jth local data point at processor\ni (i.e., j|i = (i \u2212 1) m\nn + j). The local objective functions fi(x) at each node are assumed to be\nL-Lipschitz and convex. The recent distributed optimization literature contains multiple consensus-\nbased algorithms with similar rates of convergence for solving this type of problem. We adopt\nthe distributed dual averaging (DDA) framework [4] because its analysis admits a clear separation\nbetween the standard (centralized) optimization error and the error due to distributing computation\nover a network, facilitating our investigation of the communication/computation tradeoff.\n\n2.1 Distributed Dual Averaging (DDA)\n\nIn DDA, nodes iteratively communicate and update optimization variables to solve (2). Nodes only\ncommunicate if they are neighbors in a communication graph G = (V, E), with the |V | = n vertices\nbeing the processors. The communication graph is user-de\ufb01ned (application layer) and does not\nnecessarily correspond to the physical interconnections between processors. DDA requires three\nadditional quantities: a 1-strongly convex proximal function \u03c8 : Rd \u2192 R satisfying \u03c8(x) \u2265 0 and\n); and a n \u00d7 n doubly\n\u03c8(0) = 0 (e.g., \u03c8(x) = 1\nstochastic consensus matrix P with entries pij > 0 only if either i = j or (j, i) \u2208 E and pij = 0\notherwise. The algorithm repeats for each node i in discrete steps t, the following updates:\n\n2 xT x); a positive step size sequence a(t) = O( 1\u221a\n\nt\n\nn(cid:88)\n\nj=1\n\n(cid:26)\n\nzi(t) =\n\npijzj(t \u2212 1) + gi(t \u2212 1)\n\nxi(t) =argmin\n\n1\n\nx\u2208X\n\n(cid:104)zi(t), x(cid:105) +\n\n(cid:0)(t \u2212 1) \u00b7 \u02c6xi(t \u2212 1) + xi(t)(cid:1)\n\na(t)\n\n\u03c8(x)\n\n\u02c6xi(t) =\n\n1\nt\n\n(cid:27)\n\n(3)\n\n(4)\n\n(5)\n\n2\n\n\fwhere gi(t\u2212 1) \u2208 \u2202fi(xi(t\u2212 1)) is a subgradient of fi(x) evaluated at xi(t\u2212 1). In (3), the variable\nzi(t) \u2208 Rd maintains an accumulated subgradient up to time t and represents node i\u2019s belief of\nthe direction of the optimum. To update zi(t) in (3), each node must communicate to exchange the\nvariables zj(t) with its neighbors in G. If \u03c8(x\u2217) \u2264 R2, for the local running averages \u02c6xi(t) de\ufb01ned\nin (5), the error from a minimizer x\u2217 of F (x) after T iterations is bounded by (Theorem 1, [4])\n\nErri(T ) = F (\u02c6xi(T )) \u2212 F (x\u2217) \u2264 R2\nT a(T )\n\nL2\n2T\n\na(t \u2212 1)\n\nT(cid:88)\n\uf8eb\uf8ed 2\n\nt=1\n\nn\n\nn(cid:88)\n\nj=1\n\n+\n\nT(cid:88)\n\nt=1\n\n+\n\nL\nT\n\na(t)\n\n\uf8f6\uf8f8 (6)\n\n(cid:107)\u00afz(t) \u2212 zj(t)(cid:107)\u2217 + (cid:107)\u00afz(t) \u2212 zi(t)(cid:107)\u2217\n\n(cid:80)n\n\nwhere L is the Lipschitz constant, (cid:107)\u00b7(cid:107)\u2217 indicates the dual norm, \u00afz(t) = 1\ni=1 zi(t), and\n(cid:107)\u00afz(t) \u2212 zi(t)(cid:107)\u2217 quanti\ufb01es the network error as a disagreement between the direction to the opti-\nmum at node i and the consensus direction \u00afz(t) at time t. Furthermore, from Theorem 2 in [4], with\na(t) = A\u221a\nt\n\n, after optimizing for A we have a bound on the error,\n\nn\n\n\u221a\n\nErri(T ) \u2264 C1\n\n\u221a\nlog (T\nT\n\nn)\n\n,\n\nC1 = 2LR\n\n19 +\n\n12\n\n1 \u2212 \u221a\n\n,\n\n\u03bb2\n\n(7)\n\n(cid:115)\n\nwhere \u03bb2 is the second largest eigenvalue of P . The dependence on the communication topology\nis re\ufb02ected through \u03bb2, since the sparsity structure of P is determined by G. According to (7),\nincreasing n slows down the rate of convergence even if \u03bb2 does not depend on n.\n\n3 Communication/Computation Tradeoff\n\nIn consensus-based distributed optimization algorithms such as DDA, the communication graph G\nand the cost of transmitting a message have an important in\ufb02uence on convergence speed, especially\nwhen communicating one message requires a non-trivial amount of time (e.g., if the dimension d of\nthe problem is very high).\nWe are interested in the shortest time to obtain an \u0001-accurate solution (i.e., Erri(T ) \u2264 \u0001). From (7),\n1 \u2212 \u221a\nconvergence is faster for topologies with good expansion properties; i.e., when the spectral gap\n\u03bb2 does not shrink too quickly as n grows. In addition, it is preferable to have a balanced\nnetwork, where each node has the same number of neighbors so that all nodes spend roughly the\nsame amount of time communicating per iteration. Below we focus on two particular cases and take\nG to be either a complete graph (i.e., all pairs of nodes communicate) or a k-regular expander [1].\nBy using more processors, the total amount of communication inevitably increases. At the same\ntime, more data can be processed in parallel in the same amount of time. We focus on the scenario\nwhere the size m of the dataset is \ufb01xed but possibly very large. To understand whether there is room\nfor speedup, we move away from measuring iterations and employ a time model that explicitly ac-\ncounts for communication cost. This will allow us to study the communication/computation tradeoff\nand draw conclusions based on the total amount of time to reach an \u0001 accuracy solution.\n\n3.1 Time model\n\nAt each iteration, in step (3), processor i computes a local subgradient on its subset of the data:\n\ngi(x) =\n\n\u2202fi(x)\n\n\u2202x\n\n=\n\nn\nm\n\n\u2202lj|i(x)\n\n\u2202x\n\n.\n\n(8)\n\nn(cid:88)\n\nm\n\nj=1\n\nThe cost of this computation increases linearly with the subset size. Let us normalize time so that\none processor compute a subgradient on the full dataset of size m in 1 time unit. Then, using n cpus,\neach local gradient will take 1\nn time units to compute. We ignore the time required to compute the\nprojection in step (4); often this can be done very ef\ufb01ciently and requires negligible time when m is\nlarge compared to n and d.\n\n3\n\n\fWe account for the cost of communication as follows. In the consensus update (3), each pair of\nneighbors in G transmits and receives one variable zj(t \u2212 1). Since the message size depends only\non the problem dimension d and does not change with m or n, we denote by r the time required to\ntransmit and receive one message, relative to the 1 time unit required to compute the full gradient\non all the data. If every node has k neighbors, the cost of one iteration in a network of n nodes is\n\n(9)\nUsing this time model, we study the convergence rate bound (7) after attaching an appropriate time\nunit cost per iteration. To obtain a speedup by increasing the number of processors n for a given\nproblem, we must ensure that \u0001-accuracy is achieved in fewer time units.\n\n+ kr time units / iteration.\n\n1\nn\n\n3.2 Simple Case: Communicate at every Iteration\n\nIn the original DDA description (3)-(5), nodes communicate at every iteration. According to our\ntime model, T iterations will cost \u03c4 = T ( 1\nn + kr) time units. From (7), the time \u03c4 (\u0001) to reach error\n\u0001 is found by substituting for T and solving for \u03c4 (\u0001). Ignoring the log factor in (7), we get\n\n= \u0001 =\u21d2 \u03c4 (\u0001) =\n\nC 2\n1\n\u00012\n\n+ kr\n\ntime units.\n\n(10)\n\n(cid:17)\n\n(cid:16) 1\n\nn\n\n1(cid:113) \u03c4 (\u0001)\n\n1\nn +kr\n\nC1\n\nThis simple manipulation reveals some important facts. If communication is free, then r = 0. If in\naddition the network G is a k-regular expander, then \u03bb2 is \ufb01xed [10], C1 is independent of n and\n1 /(\u00012n). Thus, in the ideal situation, we obtain a linear speedup by increasing the number\n\u03c4 (\u0001) = C 2\nof processors, as one would expect. In reality, of course, communication is not free.\nComplete graph. Suppose that G is the complete graph, where k = n \u2212 1 and \u03bb2 = 0. In this\nscenario we cannot keep increasing the network size without eventually harming performance due\nto the excessive communication cost. For a problem with a communication/computation tradeoff r,\nthe optimal number of processors is calculated by minimizing \u03c4 (\u0001) for n:\n\n1\u221a\nr\n\n\u2202\u03c4 (\u0001)\n\n= 0 =\u21d2 nopt =\n\n.\n\n\u2202n\n\n(11)\nAgain, in accordance with intuition, if the communication cost is too high (i.e., r \u2265 1) and it takes\nmore time to transmit and receive a gradient than it takes to compute it, using a complete graph\ncannot speedup the optimization. We reiterate that r is a quantity that can be easily measured for\na given hardware and a given optimization problem. As we report in Section 5, the optimal value\npredicted by our theory agrees very well with experimental performance on a real cluster.\nExpander. For the case where G is a k-regular expander, the communication cost per node remains\nconstant as n increases. From (10) and the expression for C1 in (7), we see that n can be increased\nwithout losing performance, although the bene\ufb01t diminishes (relative to kr) as n grows.\n\n4 General Case: Sparse Communication\n\nThe previous section analyzes the case where processors communicate at every iteration. Next we\ninvestigate the more general situation where we adjust the frequency of communication.\n\n4.1 Bounded Intercommunication Intervals\n\nSuppose that a consensus step takes place once every h + 1 iterations. That is, the algorithm repeats\nh \u2265 1 cheap iterations (no communication) of cost 1\nn time units followed by an expensive iteration\n(with communication) with cost 1\nn + kr. This strategy clearly reduces the overall average cost per\niteration. The caveat is that the network error (cid:107)\u00afz(t) \u2212 zi(t)(cid:107)\u2217 is higher because of having executed\nfewer consensus steps.\nIn a cheap iteration we replace the update (3) by zi(t) = zi(t \u2212 1) + gi(t \u2212 1). After some straight-\nforward algebra we can show that [for (12), (16) please consult the supplementary material]:\n\nHt\u22121(cid:88)\n\nh\u22121(cid:88)\n\nn(cid:88)\n\n(cid:2)P Ht\u2212w(cid:3)\n\nzi(t) =\n\nQt\u22121(cid:88)\n\nij gj(wh + k) +\n\ngi(t \u2212 Qt + k).\n\n(12)\n\nw=0\n\nk=0\n\nj=1\n\nk=0\n\n4\n\n\fwhere Ht = (cid:98) t\u22121\nif mod(t, h) > 0 and Qt = h otherwise. Using the fact that P 1 = 1, we obtain\n\nh (cid:99) counts the number of communication steps in t iterations, and Qt = mod(t, h)\n\n\u00afz(t) \u2212 zi(t) =\n\n1\nn\n\nzs(t) \u2212 zi(t) =\n\nn(cid:88)\n\ns=1\n\nHt\u22121(cid:88)\n\nw=0\n\n+\n\n1\nn\n\nij\n\nn\n\nj=1\n\nk=0\n\ngj(wh + k)\n\n(cid:17) h\u22121(cid:88)\n\n(cid:16) 1\nn(cid:88)\n\u2212(cid:2)P Ht\u2212w(cid:3)\nn(cid:88)\nQt\u22121(cid:88)\n(cid:0)gs(t \u2212 Qt + k) \u2212 gi(t \u2212 Qt + k)(cid:1).\n(cid:13)(cid:13)(cid:13)(cid:13) 1\n1T \u2212(cid:2)P Ht\u2212w(cid:3)\n\nhL + 2hL\n\n(cid:13)(cid:13)(cid:13)(cid:13)1\n\nk=0\n\ns=1\n\nn\n\ni,:\n\n(13)\n\n(14)\n\n(15)\n\nTaking norms, recalling that the fi are convex and Lipschitz, and since Qt \u2264 h, we arrive at\n\n(cid:107)\u00afz(t) \u2212 zi(t)(cid:107)\u2217 \u2264 Ht\u22121(cid:88)\n\nw=0\n\n\u03bb2\n\n12h\n\n1 \u2212 \u221a\n(cid:33)\u22121\n\n(cid:18) 1\n(cid:23)\n\nkr\n\n(cid:19)\n\n(cid:115)\n\nUsing a technique similar to that in [4] to bound the (cid:96)1 distance of row i of P Ht\u2212w to its stationary\ndistribution as t grows, we can show that\n\n(cid:107)\u00afz(t) \u2212 zi(t)(cid:107)\u2217 \u2264 2hL\n\n(16)\nfor all t \u2264 T . Comparing (16) to equation (29) in [4], the network error within t iterations is no more\nthan h times larger when a consensus step is only performed once every h + 1 iterations. Finally,\nwe substitute the network error in (6). For a(t) = A\u221a\n\n+ 3hL\n\n\u221a\n\nt=1 a(t) \u2264 2A\n\u221a\n\u221a\n\n= Ch\n\nn)\n\nT , and\n\u221a\n\nn)\n\n\u221a\nlog (T\nT\n\n.\n\n(17)\n\nErri(T ) \u2264\n\n+ AL2\n\n1 +\n\n+ 18h\n\nWe minimize the leading term Ch over A to obtain\n\n(cid:18)\n\n(cid:18) R2\n(cid:32)(cid:115)\n\nA\n\nlog(T\n\nn)\n\u03bb2\n\n\u221a\n1 \u2212 \u221a\n, we have(cid:80)T\n(cid:19)(cid:19) log (T\n(cid:115)\n\nT\n\nt\n\n12h\n\nR\nL\n\nA =\n\n1 + 18h +\n\n1 \u2212 \u221a\n1 \u2212 \u221a\n\u03bb2\nOf the T iterations, only HT = (cid:98) T\u22121\nh (cid:99) involve communication. So, T iterations will take\n1\nn\n\n\u03c4 = (T \u2212 HT )\n\n+ HT kr time units.\n\nand Ch = 2RL\n\n1 + 18h +\n\n(cid:19)\n\n+ HT\n\n+ kr\n\n12h\n\nT\nn\n\n\u03bb2\n\n=\n\nn\n\n.\n\nTo achieve \u0001-accuracy, ignoring again the logarithmic factor, we need T = C2\n\nh\n\n\u00012 iterations, or\n\n(cid:18) T\n\n(cid:22) T \u2212 1\n\n\u03c4 (\u0001) =\n\n+\n\nn\n\nh\n\n(cid:19)\n\n(cid:18) 1\n\nn\n\n\u2264 C 2\nh\n\u00012\n\n+\n\nkr\nh\n\ntime units.\n\n(18)\n\n(19)\n\n(20)\n\nFrom the last expression, for a \ufb01xed number of processors n, there exists an optimal value for h that\ndepends on the network size and communication graph G:\n\nhopt =\n\nnkr\n1\u2212\u221a\n18 + 12\n\n\u03bb2\n\n.\n\n(21)\n\nIf the network is a complete graph, using hopt yields \u03c4 (\u0001) = O(n); i.e., using more processors\nhurts performance when not communicating every iteration. On the other hand, if the network is a\nk-regular expander then \u03c4 (\u0001) = c1\u221a\nn + c2 for constants c1, c2, and we obtain a diminishing speedup.\n\n4.2\n\nIncreasingly Sparse Communication\n\nNext, we consider progressively increasing the intercommunication intervals. This captures the\nintuition that as the optimization moves closer to the solution, progress slows down and a processor\nshould have \u201csomething signi\ufb01cantly new to say\u201d before it communicates. Let hj \u2212 1 denote the\nnumber of cheap iterations performed between the (j \u2212 1)st and jth expensive iteration; i.e., the \ufb01rst\ncommunication is at iteration h1, the second at iteration h1 + h2, and so on. We consider schemes\n\n5\n\n\fwhere hj = jp for p \u2265 0. The number of iterations that nodes communicate out of the \ufb01rst T total\n\niterations is given by HT = max{H : (cid:80)H\n(cid:90) HT\nj=1 hj \u2264 T}. We have\nT \u2212 1\nypdy =\u21d2 H p+1\np + 1\n\n(cid:90) HT\n\njp \u2264 1 +\n\ny=1\n\n,\n\n(22)\n\n1\n\n1T \u2212(cid:2)P Ht\u2212w(cid:3)\n\np+1 ) as T \u2192 \u221e. Similar to (15), the network error is bounded as\n(cid:107)\u00b7(cid:107)1 hw + 2htL.\n\nL + 2htL = L\n\nhw\u22121(cid:88)\n\ni,:\n\n(23)\n\n(cid:13)(cid:13)(cid:13)(cid:13)1\n\n\u2264 T \u2264 H p+1\nT + p\np + 1\nHt\u22121(cid:88)\n\nypdy \u2264 HT(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13) 1\n(cid:107)\u00afz(t) \u2212 zi(t)(cid:107)\u2217 \u2264 Ht\u22121(cid:88)\n\nwhich means that HT = \u0398(T\n\ny=1\n\nj=1\n\nn\n\nw=0\nWe split the sum into two terms based on whether or not the powers of P have converged. Using the\n\u221a\n1\u2212\u221a\nsplit point \u02c6t = log(T\n(cid:107)\u00afz(t) \u2212 zi(t)(cid:107)\u2217 \u2264L\n\n, the (cid:96)1 term is bounded by 2 when w is large and by 1\n\n(cid:107)\u00b7(cid:107)1 hw + 2htL\n\nT when w is small:\n\n(24)\n\nw=0\n\nk=0\n\n\u03bb2\n\nn)\n\nHt\u22121\u2212\u02c6t(cid:88)\nHt\u22121\u2212\u02c6t(cid:88)\n\nHt\u22121(cid:88)\n(cid:107)\u00b7(cid:107)1 hw + L\nHt\u22121(cid:88)\n\nw=0\n\nw=Ht\u2212\u02c6t\n\nwp + 2L\n\n\u2264 L\nT\n\n\u2264 L\nT\n\n\u2264 L\np + 1\n\nw=0\n\n(Ht \u2212 \u02c6t \u2212 1)\np + 1\nLp\n\nwp + 2tpL\n\nw=Ht\u2212\u02c6t\n\n1\np+1 + p\n\n+ 2L\u02c6t(Ht \u2212 1)p + 2tpL\n\n(25)\n\n(26)\n\nAT 1\u2212q +\n6L2\u02c6tA\n\n+\n\nT\n\nT(cid:88)\nT(cid:88)\n\nt=1\n\n(27)\nsince T > Ht \u2212 \u02c6t \u2212 1. Substituting this bound into (6) and taking the step size sequence to be\na(t) = A\n\nt + 2tpL\n\nT (p + 1)\n\n+ 2L\u02c6tH p\n\n+\n\ntq with A and q to be determined, we get\nErri(T ) \u2264 R2\n\nL2A\n\n2(1 \u2212 q)T q +\n\n3L2A\n\n(p + 1)(1 \u2212 q)T q +\n\n3L2pA\n\n(p + 1)(1 \u2212 q)T 1+q\n\n(28)\n\nH p\nt\ntq +\n\n6L2A\n\nT\n\ntp\u2212q.\n\n(cid:33)\n\n1\n\n1\nT\n\nO(t\n\nT(cid:88)\n\nH p\ntq \u2264 1\nt\n\nThe \ufb01rst four summands converge to zero when 0 < q < 1. Since Ht = \u0398(t\n\np+1\u2212q(cid:17)\n(cid:80)T\n(29)\nt=1 tp\u2212q \u2264 T p\u2212q\nwhich converges to zero if\np\u2212q+1,\nso the term goes to zero as T \u2192 \u221e if p < q. In conclusion, Erri(T ) converges no slower than\nO( log (T\n2 to balance the \ufb01rst three summands, for\n), while nodes communicate\n\np+1 < q. To bound the last term, note that 1\n\nT q\u2212p . If we choose q = 1\n\np+1 )p\ntq\n\n\u221a\nT q\u2212p\n\np+1\u2212q+1\n\n) since\n\n\u2264 O\n\n< 1\n\nq\u2212 p\np+1\n\np+1 ),\n\n(cid:16)\n\n= O\n\nt=1\n\nt=1\n\n\u221a\n\nT\n\nT\n\nT\n\nT\n\nn)\n\nn)\n\n1\n\np\n\nT\n\nT\n\np\n\n1\n\np\n\n\u221a\nsmall p > 0, the rate of convergence is arbitrarily close to O( log (T\nincreasingly infrequently as T \u2192 \u221e.\nT\nOut of T total iterations, DDA executes HT = \u0398(T\ncation and T \u2212 HT cheap iterations without communication, so\n\np\n\n(cid:19)\n\n(cid:18)\n\n(cid:18) 1\n\n\u03c4 (\u0001) = O\n\n+ T\n\np\np+1 kr\n\n= O\n\nT\n\n(cid:18) T\n\nn\n\n+\n\nn\n\nkr\n1\n\np+1\n\nT\n\n(cid:19)(cid:19)\n\np+1 ) expensive iterations involving communi-\n\n.\n\n(30)\n\nIn this case, the communication cost kr becomes a less and less signi\ufb01cant proportion of \u03c4 (\u0001) as T\nn ). To\nincreases. So for any 0 < p < 1\nget Erri(T ) \u2264 \u0001, ignoring the logarithmic factor, we need\n\n2, if k is \ufb01xed, we approach a linear speedup behaviour \u0398( T\n\n(cid:18) Cp\n\n(cid:19) 2\n\n1\u22122p\n\nT =\n\n\u0001\n\niterations, with Cp = 2LR\n\n7 +\n\n12p + 12\n\n(3p + 1)(1 \u2212 \u221a\n\n+\n\n\u03bb2)\n\n12\n\n2p + 1\n\n.\n\n(31)\n\nFrom this last equation we see that for 0 < p < 1\ncommunication should, in fact, be faster than communicating at every iteration.\n\n2 we have Cp < C1, so using increasingly sparse\n\n6\n\nT(cid:88)\n(cid:32)\n\nt=1\n\n(cid:115)\n\n\f5 Experimental Evaluation\n\nTo verify our theoretical \ufb01ndings, we implement DDA on a cluster of 14 nodes with 3.2 GHz Pen-\ntium 4HT processors and 1 GB of memory each, connected via ethernet that allows for roughly\n11 MB/sec throughput per node. Our implementation is in C++ using the send and receive functions\nof OpenMPI v1.4.4 for communication. The Armadillo v2.3.91 library, linked to LAPACK and\nBLAS, is used for ef\ufb01cient numerical computations.\n\n5.1 Application to Metric Learning\n\nMetric learning [11, 12, 13] is a computationally intensive problem where the goal is to \ufb01nd a\ndistance metric D(u, v) such that points that are related have a very small distance under D while\nfor unrelated points D is large. Following the formulation in [14], we have a data set {uj, vj, sj}m\nj=1\nwith uj, vj \u2208 Rd and sj = {\u22121, 1} signifying whether or not uj is similar to vj (e.g., similar if\nthey are from the same class). Our goal is to \ufb01nd a symmetric positive semi-de\ufb01nite matrix A (cid:23) 0\nhinge-type loss function lj(A, b) = max{0, sj\nthat determines whether two points are dissimilar according to DA(\u00b7,\u00b7). In the batch setting, we\nformulate the convex optimization problem\n\nto de\ufb01ne a pseudo-metric of the form DA(u, v) = (cid:112)(u \u2212 v)T A(u \u2212 v). To that end, we use a\n(cid:0)DA(uj, vj)2 \u2212 b(cid:1) + 1} where b \u2265 1 is a threshold\n\nm(cid:88)\n\nj=1\n\nminimize\n\nA,b\n\nF (A, b) =\n\nlj(A, b)\n\nsubject to A (cid:23) 0, b \u2265 1.\n\nThe subgradient of lj at (A, b) is zero if sj(DA(uj, vj)2 \u2212 b) \u2264 \u22121. Otherwise\n\n\u2202lj(A, b)\n\n\u2202A\n\n= sj(uj \u2212 vj)T (uj \u2212 vj),\n\nand\n\n\u2202lj(A, b)\n\n\u2202b\n\n= \u2212sj.\n\n(32)\n\n(33)\n\nSince DDA uses vectors xi(t) and zi(t), we represent each pair (Ai(t), bi(t)) as a d2+1 dimensional\nvector. The communication cost is thus quadratic in the dimension. In step (3) of DDA, we use the\n2 xT x, in which case (4) simpli\ufb01es to taking xi(t) = \u2212a(t \u2212 1)zi(t),\nproximal function \u03c8(x) = 1\nfollowed by projecting xi(t) to the constraint set by setting bi(t) \u2190 max{1, bi(t)} and projecting\nAi(t) to the set of positive semi-de\ufb01nite matrices by \ufb01rst taking its eigenvalue decomposition and\nreconstructing Ai(t) after forcing any negative eigenvalues to zero.\nWe use the MNIST digits dataset which consists of 28 \u00d7 28 pixel images of handwritten digits 0\nthrough 9. Representing images as vectors, we have d = 282 = 784 and a problem with d2 + 1 =\n614657 dimensions trying to learn a 784 \u00d7 784 matrix A. With double precision arithmetic, each\nDDA message has a size approximately 4.7 MB. We construct a dataset by randomly selecting 5000\npairs from the full MNIST data. One node needs 29 seconds to compute a gradient on this dataset,\nand sending and receiving 4.7 MB takes 0.85 seconds. The communication/computation tradeoff\n29 \u2248 0.0293. According to (11), when G is a complete graph, we\nvalue is estimated as r = 0.85\nexpect to have optimal performance when using nopt = 1\u221a\nr = 5.8 nodes. Figure 1(left) shows the\nevolution of the average function value \u00afF (t) = 1\ni F (\u02c6xi(t)) for 1 to 14 processors connected as\nn\na complete graph, where \u02c6xi(t) is as de\ufb01ned in (5). There is a very good match between theory and\npractice since the fastest convergence is achieved with n = 6 nodes.\nIn the second experiment, to make r closer to 0, we apply PCA to the original data and keep the top\n87 principal components, containing 90% of the energy. The dimension of the problem is reduced\ndramatically to 87 \u00b7 87 + 1 = 7570 and the message size to 59 KB. Using 60000 random pairs of\nMNIST data, the time to compute one gradient on the entire dataset with one node is 2.1 seconds,\nwhile the time to transmit and receive 59 KB is only 0.0104 seconds. Again, for a complete graph,\nFigure 1(right) illustrates the evolution of \u00afF (t) for 1 to 14 nodes. As we see, increasing n speeds up\nthe computation. The speedup we get is close to linear at \ufb01rst, but diminishes since communication\nis not entirely free. In this case r = 0.0104\n\n2.1 = 0.005 and nopt = 14.15.\n\n(cid:80)\n\n5.2 Nonsmooth Convex Minimization\n\nNext we create an arti\ufb01cial problem where the minima of the components fi(x) at each node are\nvery different, so that communication is essential in order to obtain an accurate optimizer of F (x).\n\n7\n\n\fFigure 1: (Left) In a subset of the Full MNIST data for our speci\ufb01c hardware, nopt = 1\u221a\nr = 5.8. The\nfastest convergence is achieved on a complete graph of 6 nodes. (Right) In the reduced MNIST data\nusing PCA, the communication cost drops and a speedup is achieved by scaling up to 14 processors.\n\n(cid:16)\n\nM(cid:88)\n\nj=1\n\n(cid:17)\n\nWe de\ufb01ne fi(x) as a sum of high dimensional quadratics,\n\nj|i(x) = (x \u2212 c\u03be\nl\u03be\n\nj|i)T (x \u2212 c\u03be\n\nj|i),\n\n\u03be \u2208 {1, 2},\n\n(34)\n\nl1\nj|i(x), l2\n\n,\n\nmax\n\nj|i, c2\n\nfi(x) =\n\nj|i(x)\nwhere x \u2208 R10,000, M = 15, 000 and c1\nj|i are the centers of the quadratics. Figure 2 illustrates\nagain the average function value \u00afF (t) for 10 nodes in a complete graph topology. The baseline per-\nformance is when nodes communicate at every iteration (h = 1). For this problem r = 0.00089 and,\nfrom (21), hopt = 1. Naturally communicating every 2 iterations (h = 2) slows down convergence.\nOver the duration of the experiment, with h = 2, each node communicates with its peers 55 times.\nWe selected p = 0.3 for increasingly sparse communication, and got HT = 53 communications\nper node. As we see, even though nodes communicate as much as the h = 2 case, convergence is\neven faster than communicating at every iteration. This veri\ufb01es our intuition that communication is\nmore important in the beginning. Finally, the case where p = 1 is shown. This value is out of the\npermissible range, and as expected DDA does not converge to the right solution.\n\nFigure 2: Sparsifying communication to minimize (34) with 10 nodes in a complete graph topology.\nWhen waiting t0.3 iterations between consensus steps, convergence is faster than communicating\nat every iteration (h = 1), even though the total number of consensus steps performed over the\nduration of the experiment is equal to communicating every 2 iterations (h = 2). When waiting a\nlinear number of iterations between consensus steps (h = t) DDA does not converge to the right\nsolution. Note: all methods are initialized from the same value; the x-axis starts at 5 sec.\n\n6 Conclusions and Future Work\n\nThe analysis and experimental evaluation in this paper focus on distributed dual averaging and re-\nveal the capability of distributed dual averaging to scale with the network size. We expect that\nsimilar results hold for other consensus-based algorithms such as [5] as well as various distributed\naveraging-type algorithms (e.g., [15, 16, 17]). In the future we will extend the analysis to the case of\nstochastic optimization, where ht = tp could correspond to using increasingly larger mini-batches.\n\n8\n\n5010015020025030035040045000.511.522.533.54Time (sec)\u00afF(t) n = 1n = 2n = 4n = 6n = 8n = 10n = 12n = 1410203040506000.20.40.60.811.21.41.6Time (sec)\u00afF(t) n = 1n = 2n = 4n = 6n = 8n = 10n = 12n = 14204060801001201401601.21.41.61.822.22.4x 105Time (sec)\u00afF(t) h = 1h = 2h = t0.3h = t\fReferences\n[1] O. Reingold, S. Vadhan, and A. Wigderson, \u201cEntropy waves, the zig-zag graph product, and new constant-\n\ndegree expanders,\u201d Annals of Mathematics, vol. 155, no. 2, pp. 157\u2013187, 2002.\n\n[2] Y. Nesterov, \u201cPrimal-dual subgradient methods for convex problems,\u201d Mathematical Programming Series\n\nB, vol. 120, pp. 221\u2013259, 2009.\n\n[3] R. Bekkerman, M. Bilenko, and J. Langford, Scaling up Machine Learning, Parallel and Distributed\n\nApproaches. Cambridge University Press, 2011.\n\n[4] J. Duchi, A. Agarwal, and M. Wainwright, \u201cDual averaging for distributed optimization: Convergence\nanalysis and network scaling,\u201d IEEE Transactions on Automatic Control, vol. 57, no. 3, pp. 592\u2013606,\n2011.\n\n[5] A. Nedic and A. Ozdaglar, \u201cDistributed subgradient methods for multi-agent optimization,\u201d IEEE Trans-\n\nactions on Automatic Control, vol. 54, no. 1, January 2009.\n\n[6] B. Johansson, M. Rabi, and M. Johansson, \u201cA randomized incremental subgradient method for distributed\n\noptimization in networked systems,\u201d SIAM Journal on Control and Optimization, vol. 20, no. 3, 2009.\n\n[7] S. S. Ram, A. Nedic, and V. V. Veeravalli, \u201cDistributed stochastic subgradient projection algorithms for\nconvex optimization,\u201d Journal of Optimization Theory and Applications, vol. 147, no. 3, pp. 516\u2013545,\n2011.\n\n[8] A. Agarwal and J. C. Duchi, \u201cDistributed delayed stochastic optimization,\u201d in Neural Information Pro-\n\ncessing Systems, 2011.\n\n[9] K. I. Tsianos and M. G. Rabbat, \u201cDistributed dual averaging for convex optimization under communica-\n\ntion delays,\u201d in American Control Conference (ACC), 2012.\n\n[10] F. Chung, Spectral Graph Theory. AMS, 1998.\n[11] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, \u201cDistance metric learning, with application to clustering\n\nwith side-information,\u201d in Neural Information Processing Systems, 2003.\n\n[12] K. Q. Weinberger and L. K. Saul, \u201cDistance metric learning for large margin nearest neighbor classi\ufb01ca-\n\ntion,\u201d Journal of Optimization Theory and Applications, vol. 10, pp. 207\u2013244, 2009.\n\n[13] K. Q. Weinberger, F. Sha, and L. K. Saul, \u201cConvex optimizations for distance metric learning and pattern\n\nclassi\ufb01cation,\u201d IEEE Signal Processing Magazine, 2010.\n\n[14] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng, \u201cOnline and batch learning of pseudo-metrics,\u201d in ICML,\n\n2004, pp. 743\u2013750.\n\n[15] M. A. Zinkevich, M. Weimer, A. Smola, and L. Li, \u201cParallelized stochastic gradient descent,\u201d in Neural\n\nInformation Processing Systems, 2010.\n\n[16] R. McDonald, K. Hall, and G. Mann, \u201cDistributed training strategies for the structured perceptron,\u201d in\nAnnual Conference of the North American Chapter of the Association for Computational Linguistics,\n2012, pp. 456\u2013464.\n\n[17] G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. D. Walker, \u201cEf\ufb01cient large-scale distributed\ntraining of conditional maximum entropy models,\u201d in Neural Information Processing Systems, 2009, pp.\n1231\u20131239.\n\n9\n\n\f", "award": [], "sourceid": 958, "authors": [{"given_name": "Konstantinos", "family_name": "Tsianos", "institution": null}, {"given_name": "Sean", "family_name": "Lawlor", "institution": null}, {"given_name": "Michael", "family_name": "Rabbat", "institution": null}]}