{"title": "Distributed Dual Averaging In Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 550, "page_last": 558, "abstract": "The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and we provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks.", "full_text": "Distributed Dual Averaging in Networks\n\nJohn C. Duchi1\n\nAlekh Agarwal1\n\nMartin J. Wainwright1,2\n\nDepartment of Electrical Engineering and Computer Science1 and Department of Statistics2\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720-1776\n\n{jduchi,alekh,wainwrig}@eecs.berkeley.edu\n\nAbstract\n\nThe goal of decentralized optimization over a network is to optimize a global ob-\njective formed by a sum of local (possibly nonsmooth) convex functions using\nonly local computation and communication. We develop and analyze distributed\nalgorithms based on dual averaging of subgradients, and provide sharp bounds on\ntheir convergence rates as a function of the network size and topology. Our anal-\nysis clearly separates the convergence of the optimization algorithm itself from\nthe effects of communication constraints arising from the network structure. We\nshow that the number of iterations required by our algorithm scales inversely in\nthe spectral gap of the network. The sharpness of this prediction is con\ufb01rmed both\nby theoretical lower bounds and simulations for various networks.\n\nIntroduction\n\n1\nNetwork-structured optimization problems arise in a variety of application domains within the in-\nformation sciences and engineering. A canonical example that arises in machine learning is the\nproblem of minimizing a loss function averaged over a large dataset (e.g. [16, 17]). With terabytes\nof data, it is desirable (even necessary) to assign smaller subsets of the data to different proces-\nsors, and the processors must communicate to \ufb01nd parameters that minimize the loss over the entire\ndataset. Problems such as multi-agent coordination, estimation problems in sensor networks, and\npacket routing also are all naturally cast as distributed convex minimization [1, 13, 24]. The seminal\nwork of Tsitsiklis and colleagues [22, 1] analyzed algorithms for minimization of a smooth func-\ntion f known to several agents while distributing processing of components of the parameter vector\nx \u2208 Rn. More recently, a few researchers have shifted focus to problems in which each processor\nlocally has its own convex (potentially non-differentiable) objective function [18, 15, 21, 11].\nIn this paper, we provide a simple new subgradient algorithm for distributed constrained optimiza-\ntion of a convex function. We refer to it as a dual averaging subgradient method, since it is based on\nmaintaining and forming weighted averages of subgradients throughout the network. This approach\nis essentially different from previously developed distributed subgradient methods [18, 15, 21, 11],\nand these differences facilitate our analysis of network scaling issues\u2014how convergence rates de-\npend on network size and topology. Indeed, the second main contribution of this paper is a careful\nanalysis that demonstrates a close link between convergence of the algorithm and the underlying\nspectral properties of the network. The convergence rates for a different algorithm given by the\npapers [18, 15] grow exponentially in the number of nodes n in the network. Ram et al. [21] pro-\nvide tighter analysis that yields convergence rates that scale cubically in the network size, but are\nindependent of the network topology. Consequently, their analysis does not capture the intuition\nthat distributed algorithms should converge faster on \u201cwell-connected\u201d networks\u2014expander graphs\nbeing a prime example\u2014than on poorly connected networks (e.g., chains or cycles). Johansson et\nal. [11] analyze a low communication peer-to-peer protocol that attains rates dependent on network\nstructure. However, in their algorithm only one node has a current parameter value, while all nodes\nin our algorithm maintain good estimates of the optimum at all times. This is important in online\n\n1\n\n\f1\n\nor streaming problems where nodes are expected to act or answer queries in real-time. In additional\ncomparison to previous work, our analysis yields network scaling terms that are often substantially\nsharper. Our development yields an algorithm with convergence rate that scales inversely in the\nspectral gap of the network. By exploiting known results on spectral gaps for graphs with n nodes,\nwe show that our algorithm obtains an \u0001-optimal solution in O(n2/\u00012) iterations for a single cycle\nor path, O(n/\u00012) iterations for a two-dimensional grid, and O(1/\u00012) iterations for a bounded degree\nexpander graph. Simulation results show excellent agreement with these theoretical predictions.\n2 Problem set-up and algorithm\nIn this section, we provide a formal statement of the distributed minimization problem and a de-\nscription of the distributed dual averaging algorithm.\nDistributed minimization: We consider an optimization problem based on functions that are dis-\ntributed over a network. More speci\ufb01cally, let G = (V, E) be an undirected graph over the vertex\nset V = {1, 2, . . . , n} with edge set E \u2282 V \u00d7 V . Associated with each i \u2208 V is convex func-\ntion fi\n: Rd \u2192 R, and our overarching goal is to solve the constrained optimization problem\nn!n\ni=1 fi(x), where X is a closed convex set. Each function fi is convex and hence sub-\nminx\u2208X\ndifferentiable, but need not be smooth. We assume without loss of generality that 0 \u2208X , since we\ncan simply translate X. Each node i \u2208 V is associated with a separate agent, and each agent i main-\ntains its own parameter vector xi \u2208 Rd. The graph G imposes communication constraints on the\nagents: in particular, agent i has local access to only the objective function fi and can communicate\ndirectly only with its immediate neighbors j \u2208 N (i) := {j \u2208 V | (i, j) \u2208 E}.\nA concrete motivating example for these types of problems is the machine learning scenario de-\nscribed in Section 1. In this case, the set X is the parameter space of the learner. Each function fi is\nthe empirical loss over the subset of data assigned to processor i, and the average f is the empirical\nloss over the entire dataset. We use cluster computing as our model, so each processor is a node in\nthe cluster and the graph G contains edges between processors connected with small latencies; this\nsetup avoids communication bottlenecks of architectures with a centralized master node.\nDual averaging: Our algorithm is based on a dual averaging algorithm [20] for minimization of\na (potentially nonsmooth) convex function f subject to the constraint that x \u2208X . We begin by\ndescribing the standard version of the algorithm. The dual averaging scheme is based on a proximal\nfunction \u03c8 : Rd \u2192 R assumed to be strongly convex with respect to a norm %\u00b7%, more precisely,\n2 %x \u2212 y%2 for all x, y \u2208X . We assume w.l.o.g. that \u03c8 \u2265 0 on X\n\u03c8(y) \u2265 \u03c8(x) +\u2019\u2207\u03c8(x), y \u2212 x* + 1\nand that \u03c8(0) = 0. Such proximal functions include the canonical quadratic \u03c8(x) = 1\n2, which\nis strongly convex with respect to the #2-norm, and the negative entropy \u03c8(x) =!d\nj=1 xi log xi\u2212xi,\nwhich is strongly convex with respect to the #1-norm for x in the probability simplex.\nWe assume that each function fi is L-Lipschitz with respect to the same norm %\u00b7%\u2014that is,\n(1)\nMany cost functions fi satisfy this type of Lipschitz condition, for instance, convex functions on\na compact domain X or any polyhedral function on an arbitrary domain [8]. The Lipschitz condi-\ntion (1) implies that for any x \u2208X and any subgradient gi \u2208 \u2202fi(x), we have %gi%\u2217 \u2264 L, where\n%\u00b7%\u2217\nThe dual averaging algorithm generates a sequence of iterates {x(t), z(t)}\u221et=0 contained within X\u00d7\nRd. At time step t, the algorithm receives a subgradient g(t) \u2208 \u2202f (x(t)), and updates\n(2)\n(\u2212z(t + 1),\u03b1 (t)).\nHere {\u03b1(t)}\u221et=0 is a non-increasing sequence of positive stepsizes and\n\u03c8(x)#\n\n(3)\nis a type of projection. Intuitively, given the current iterate (x(t), z(t)), the next iterate x(t + 1)\nto chosen to minimize an averaged \ufb01rst-order approximation to the function f, while the proximal\n\nx\u2208X \"\u2019z, x* +\n\ndenotes the dual norm to %\u00b7%, de\ufb01ned by %v%\u2217\n\n:= sup#u#=1 \u2019v, u*.\n\n|fi(x) \u2212 fi(y)|\u2264 L%x \u2212 y%\n\n\u03a0\u03c8\nX\n\n(z, \u03b1) := argmin\n\nz(t + 1) = z(t) \u2212 g(t)\n\nand\n\nx(t + 1) =\u03a0 \u03c8\nX\n\n2 %x%2\n\nfor x, y \u2208X .\n\n1\n\u03b1\n\n2\n\n\fT !T\n\n(\u2212zi(t + 1),\u03b1 (t)),\n\nand xi(t + 1) =\u03a0 \u03c8\nX\n\nt=1 xi(t), which can evidently be computed in a decentralized manner.\n\nfunction \u03c8 and stepsize \u03b1(t) > 0 enforce that the iterates {x(t)}\u221et=0 do not oscillate wildly. The al-\ngorithm is similar to the follow the perturbed/regularized leader algorithms developed in the context\nof online learning [12], though in this form the algorithm seems to be originally due to Nesterov [20].\nIn Section 4, we relate the above procedure to the distributed algorithm we now describe.\nDistributed dual averaging: Here we consider a novel extension of dual averaging to the dis-\ntributed setting. For all times t, each node i \u2208 V maintains a pair of vectors (xi(t), zi(t)) \u2208X\u00d7 Rd.\nAt iteration t, node i computes a subgradient gi(t) \u2208 \u2202fi(xi(t)) of the local function fi and receives\n{zj(t), j \u2208 N (i)} from its neighbors. Its update of the current estimate xi(t) is based on a weighted\naverage of these parameters. To model the process, let P \u2208 Rn\u00d7n be a doubly stochastic symmetric\nmatrix with Pij > 0 only if (i, j) \u2208 E when i ,= j. Thus!n\nj=1 Pij = !j\u2208N (i) Pij = 1 for all\ni=1 Pij =!i\u2208N (j) Pij = 1 for all j \u2208 V . Given a non-increasing sequence {\u03b1(t)}\u221et=0\ni \u2208 V and!n\nof positive stepsizes, each node i \u2208 V updates\nzi(t + 1) = $j\u2208N (i)\n(4)\nPjizj(t) \u2212 gi(t),\nwas de\ufb01ned in (3). In words, node i computes the new dual parameter\nwhere the projection \u03a0\u03c8\nzi(t + 1) from a weighted average of its own subgradient gi(t) and the parameters {zj(t), j \u2208 N (i)}\nX\nin its neighborhood; it then computes the local iterate xi(t + 1) by a proximal projection. We show\nconvergence of the local sequence {xi(t)}\u221et=1 to an optimum of the global objective via the local\naverage%xi(T ) = 1\n3 Main results and consequences\nWe will now state the main results of this paper and illustrate some of their consequences. We give\nthe proofs and a deeper investigation of related corollaries at length in the sections that follow.\nConvergence of distributed dual averaging: We start with a result on the convergence of the\ndistributed dual averaging algorithm that provides a decomposition of the error into an optimization\nterm and the cost associated with network communication. In order to state this theorem, we de\ufb01ne\nn!n\ni=1 zi(t), and we recall the local time-average%xi(T ).\nthe averaged dual variable \u00afz(t) := 1\nTheorem 1 (Basic convergence result). Given a sequence {xi(t)}\u221et=0 and {zi(t)}\u221et=0 generated by\nthe updates (4) with step size sequence {\u03b1(t)}\u221et=0, for each node i \u2208 V and any x\u2217 \u2208X , we have\nT$t=1\nf (%xi(T )) \u2212 f (x\u2217) \u2264\n\u03b1(t)%\u00afz(t) \u2212 zj(t)%\u2217\nTheorem 1 guarantees that after T steps of the algorithm, every node i \u2208 V has access to a locally\nde\ufb01ned quantity %xi(T ) such that the difference f (%xi(T )) \u2212 f (x\u2217) is upper bounded by a sum of\nthree terms. The \ufb01rst two terms in the upper bound in the theorem are optimization error terms that\nare common to subgradient algorithms. The third term is the penalty incurred due to having different\nestimates at different nodes in the network, and it measures the deviation of each node\u2019s estimate\nof the average gradient from the true average gradient. Thus, roughly, Theorem 1 ensures that as\nlong the bound on the deviation %\u00afz(t) \u2212 zi(t)%\u2217\nis tight enough, for appropriately chosen \u03b1(t) (say\n\u03b1(t) \u2248 1/\u221at), the error of%xi(T ) is small uniformly across all nodes i \u2208 V .\nConvergence rates and network topology: We now turn to investigation of the effects of network\ntopology on convergence rates. In this section,1 we assume that the network topology is static and\nthat communication occurs via a \ufb01xed doubly stochastic weight matrix P at every round. Since P\nis symmetric and stochastic, it has largest singular value \u03c31(P ) = 1. As the following result shows,\nthe convergence of our algorithm is controlled by the spectral gap \u03b3(P ) := 1 \u2212 \u03c32(P ) of P.\nTheorem 2 (Rates based on spectral gap). Under the conditions and notation of Theorem 1, suppose\nmoreover that \u03c8(x\u2217) \u2264 R2. With step size choice \u03b1(t) = R\u221a1\u2212\u03c32(P )\n\n3L\nT\n\nmax\n\nj=1,...,n\n\n\u03b1(t \u2212 1) +\n\n1\n\nT\u03b1 (T )\n\n\u03c8(x\u2217) +\n\nL2\n2T\n\nT$t=1\n\n.\n\nf (%xi(T )) \u2212 f (x\u2217) \u2264 8\n\nRL\n\u221aT \u00b7\n\nlog(T\u221an)\n&1 \u2212 \u03c32(P )\n\n4L\u221at\n\n, we have\nfor all i \u2208 V .\n\n1We can weaken these conditions; see the long version of this paper for extensions to random P [4].\n\n3\n\n\f(a)\n\n(b)\n\nFigure 1. (a) A 3-connected cycle. (b) 1-connected two-dimensional grid with non-toroidal boundary\nconditions. (c) A random geometric graph. (d) A random 3-regular expander graph.\n\n(c)\n\n(d)\n\nThis theorem establishes a tight connection between the convergence rate of distributed subgradient\nmethods and the spectral properties of the underlying network. The inverse dependence on the\nspectral gap 1 \u2212 \u03c32(P ) is quite natural, since it is well-known to determine the rates of mixing in\nrandom walks on graphs [14], and the propagation of information in our algorithm is integrally tied\nto the random walk on the underlying graph with transition probabilities speci\ufb01ed by P. Johansson\net al. [11] establish rates for their Markov incremental gradient method (MIGD) of&n\u0393ii/T, where\n\u0393= ( I \u2212 P + 1111(/n)\u22121; performing an eigen-decomposition of the \u0393 matrix shows that \u221an\u0393ii is\nalways lower bounded by 1/&1 \u2212 \u03c32(P ), our bound in Theorem 2.\nUsing Theorem 2, one can derive explicit convergence rates for several classes of interesting net-\nworks, and Figure 1 illustrates four graph topologies of interest. As a \ufb01rst example, the k-connected\ncycle in panel (a) is formed by placing n nodes on a circle and connecting each node to its k neigh-\nbors on the right and left. The grid (panel (b)) is obtained by connecting nodes to their k nearest\nneighbors in axis-aligned directions. In panel (c), we show a random geometric graph, constructed\nby placing nodes uniformly at random in [0, 1]2 and connecting any two nodes separated by a dis-\ntance less than some radius r > 0. These graphs are often used to model the connectivity patterns\nof distibruted devices such as wireless sensor motes [7]. Finally, panel (d) shows an instance of a\nbounded degree expander, which belongs to a special class of sparse graphs that have very good\nmixing properties [3]. For many random graph models, a typical sample is an expander with high\nprobability (e.g. random degree regular graphs [5]). In addition, there are several deterministic con-\nstructions of expanders that are degree regular (see Section 6.3 of Chung [3] for further details).\nIn order to state explicit convergence rates, we need to specify a particular choice of the matrix\nP that respects the graph structure. Let A \u2208 Rn\u00d7n be the symmetric adjacency matrix of the\nundirected graph G, satisfying Aij = 1 when (i, j) \u2208 E and Aij = 0 otherwise. For each node\ni \u2208 V , let \u03b4i = |N (i)| = !n\nj=1 Aij denote the degree of node i and de\ufb01ne the diagonal matrix\nD = diag{\u03b41, . . . ,\u03b4 n}. Letting \u03b4max = maxi\u2208V \u03b4i denote the maximum degree, we de\ufb01ne\n\n(5)\nwhich is symmetric and doubly stochastic by construction. The following result summarizes our\nconclusions for the choice (5) of stochastic matrix for different network topologies. We state the re-\nsults in terms of optimization error achieved after T iterations and the number of iterations TG(\u0001; n)\nrequired to achieve error \u0001 for network type G with n nodes. (These are equivalent statements.)\nCorollary 1. Under the conditions of Theorem 2, using P = Pn(G) gives the following rates.\n(a) k-connected paths and cycles: f (%xi(T )) \u2212 f (x\u2217) = O\u2019 RL\u221aT\n(, T (\u0001; n) = \u02dcO(n2/\u00012).\n(, T (\u0001; n) = \u02dcO(n/\u00012).\n(b) k-connected \u221an \u00d7 \u221an grids: f (%xi(T )) \u2212 f (x\u2217) = O\u2019 RL\u221aT\n(c) Random geometric graphs with connectivity radius r =\u2126( )log1+\u0001 n/n) for any \u0001> 0:\nf (%xi(T )) \u2212 f (x\u2217) = O\u2019 RL\u221aT ) n\nlog n log(T n)( with high-probability, T (\u0001; n) = \u02dcO(n/\u00012).\ndegree:\nratio\n(d) Expanders\nlog(T n)(, T (\u0001; n) = \u02dcO(1/\u00012).\nf (%xi(T )) \u2212 f (x\u2217) = O\u2019 RL\u221aT\n\nof minimum to maximum node\n\n1\n\n\u03b4max + 1\u2019D \u2212 A(,\n\nPn(G) := I \u2212\n\nwith\n\nbounded\n\nn log(T n)\n\n\u221an log(T n)\n\nk\n\nk\n\n4\n\n\f1\n\n2 %x%2\n\nBy comparison, the results in the paper [11] give similar bounds for grids and cycles, but for\nd-dimensional grids we have T (\u0001; n) = O(n2/d/\u00012) while MIGD achieves T (\u0001; n = O(n/\u00012);\nfor expanders and the complete graph MIGD achieves T (\u0001; n) = O(n/\u00012). We provide the proof of\nCorollary 1 in Appendix A. Up to logarithmic factors, the optimization term in the convergence rate\nis always of the order RL/\u221aT, while the remaining terms vary depending on the network topology.\n1\u2212\u03c32(Pn(G))( iterations are required\nIn general, Theorem 2 implies that at most TG(\u0001; n) = O\u2019 1\n\u00012 \u00b7\nto achieve an \u0001-accurate solution when using the matrix Pn(G) de\ufb01ned in (5). It is interesting to\nask whether this upper bound is actually tight. On one hand, it is known that even for central-\n\u00012( iterations to achieve\nized optimization algorithms, any subgradient method requires at least \u2126\u2019 1\n\u0001-accuracy [19], so that the 1/\u00012 term is unavoidable. The next proposition addresses the comple-\nmentary issue, namely whether the inverse spectral gap term is unavoidable for the dual averaging\nalgorithm. For the quadratic proximal function \u03c8(x) = 1\n2, the following result establishes a\nlower bound on the number of iterations in terms of graph topology and network structure:\nProposition 1. Consider the dual averaging algorithm (4) with quadratic proximal function and\ncommunication matrix Pn(G). For any graph G with n nodes, the number of iterations TG(c; n)\nrequired to achieve a \ufb01xed accuracy c > 0 is lower bounded as TG(c; n) =\u2126\u2019\nThe proof of this result, given in Appendix B, involves constructing a \u201chard\u201d optimization problem\nand lower bounding the number of iterations required for our algorithm to solve it. In conjunction\nwith Corollary 1, Proposition 1 implies that our predicted network scaling is sharp.\nIndeed, in\nSection 5, we show that the theoretical scalings from Corollary 1\u2014namely, quadratic, linear, and\nconstant in network size n\u2014are well-matched in simulations of our algorithm.\n4 Proof sketches\nSetting up the analysis: Using techniques similar to some past work [18], we establish conver-\ni=1 zi(t) and y(t) :=\u03a0 \u03c8\n(\u2212\u00afz(t),\u03b1 ). The average sum of\ngence via the two sequences \u00afz(t) := 1\ngradients \u00afz(t) evolves in a very simple way: in particular, we have\nX\nn$j=1\n(6)\nwhere the second equality follows from the double-stochasticity of P. The simple evolution (6) of\nthe averaged dual sequence allows us to avoid dif\ufb01culties with the non-linearity of projection that\nhave been challenging in earlier work. Before proceeding with the proof of Theorem 1, we state a\nfew useful results regarding the convergence of the standard dual averaging algorithm [20].\nLemma 2 (Nesterov). Let {g(t)}\u221et=1 \u2282 Rd be an arbitrary sequence and {x(t)}\u221et=1 de\ufb01ned by the\nupdates (2). For a non-increasing sequence {\u03b1(t)}\u221et=0 of positive stepsizes and any x\u2217 \u2208X ,\n\nn$j=1\u2019Pji(zj(t) \u2212 \u00afz(t))( + \u00afz(t) \u2212\n\n1\u2212\u03c32(Pn(G))(.\n\ngj(t) = \u00afz(t) \u2212\n\nn!n\n\n\u00afz(t + 1) =\n\nn$j=1\n\nn$i=1\n\ngj(t),\n\n1\nn\n\n1\nn\n\n1\nn\n\n1\n\nT$t=1\n\n\u2019g(t), x(t) \u2212 x\u2217* \u2264\n\n\u03b1(t \u2212 1)%g(t)%2\n\u2217\n\n+\n\n1\n\n\u03b1(T )\n\n\u03c8(x\u2217).\n\n1\n2\n\nT$t=1\n\nOur second lemma allows us to restrict our analysis to the sequence {y(t)}\u221et=0 de\ufb01ned previously.\nLemma 3. Consider sequences {xi(t)}\u221et=1, {zi(t)}\u221et=0, and {y(t)}\u221et=0 that evolve according to (4).\nThen for each i \u2208 V and any x\u2217 \u2208X , we have\n\nf (xi(t)) \u2212 f (x\u2217) \u2264\n\nT$t=1\nNow we give the proof of the \ufb01rst theorem.\nProof of Theorem 1: Our proof is based on analyzing the sequence {y(t)}\u221et=0. For any x\u2217 \u2208X ,\n\nf (y(t)) \u2212 f (x\u2217) + L\n\n\u03b1(t)%\u00afz(t) \u2212 zi(t)%\u2217\n\nT$t=1\n\nT$t=1\n\n.\n\nT$t=1\n\nf (y(t)) \u2212 f (x\u2217) =\n\n\u2264\n\n1\nn\n\nT$t=1\nT$t=1\n\n1\nn\n\nn$i=1\nn$i=1\n\nfi(xi(t)) \u2212 f (x\u2217) +\n\nfi(xi(t)) \u2212 f (x\u2217) +\n\n5\n\nT$t=1\nT$t=1\n\n[fi(y(t)) \u2212 fi(xi(t))]\n\n1\nn\n\nn$i=1\nn$i=1\nL\nn %y(t) \u2212 xi(t)% ,\n\n(7)\n\n\fby the L-Lipschitz continuity of the fi. Letting gi(t) \u2208 \u2202fi(xi(t)) be a subgradient of fi at xi(t),\n(8)\n\n1\nn\n\nT$t=1\n\nn$i=1\n\nfi(xi(t)) \u2212 fi(x\u2217) \u2264\n\n\u2019gi(t), xi(t) \u2212 y(t)* .\n\nn$i=1\n\nn$i=1\n\u2019gi(t), y(t) \u2212 x\u2217* +\ns=1!n\nn!t\u22121\n\n1\nn\n\nX\n\nL2\n2\n\nn$i=1\n\n1\n\n\u03b1(T )\n\n\u03c8(x\u2217).\n\n1\nn\n\nT$t=1\n\n\u03b1(t \u2212 1) +\n\nT$t=1\n\n* n$i=1\n\ni=1 \u2019gi(s), x* + 1\n\ngi(t), y(t) \u2212 x\u2217+ \u2264\n\n(\u00b7,\u03b1 ) [9, Theorem X.4.2.1] to see\n\n\u03b1(t) \u03c8(x)}.\nBy de\ufb01nition of \u00afz(t) and y(t), we have y(t) = argminx\u2208X{ 1\nThus, we see that the \ufb01rst term in the decomposition (8) can be written in the same way as the bound\nin Lemma 2, and as a consequence, we have the bound\nT$t=1\nIt remains to control the \ufb01nal two terms in the bounds (7) and (8). Since %gi(t)%\u2217 \u2264 L by assump-\ntion, we use the \u03b1-Lipschitz continuity of the projection \u03a0\u03c8\nX\nT$t=1\n\nT$t=1\nn$i=1\nn$i=1,,,\u03a0\u03c8\nn$i=1\nT$t=1\n\u03b1(t)%\u00afz(t) \u2212 zi(t)%\u2217\nCombining this bound with (7) and (9) yields the running sum bound\nT$t=1-f (y(t))\u2212f (x\u2217). \u2264\nn$j=1\nT$t=1\nApplying Lemma 3 to (10) gives that!T\nt=1[f (xi(t)) \u2212 f (x\u2217)] is upper bounded by\nn$j=1\nT$t=1\n\n(\u2212zi(t),\u03b1 (t)),,, \u2264\nT$t=1\n\nL\nn %y(t) \u2212 xi(t)% +\nT$t=1\n\nn$i=1\n(\u2212\u00afz(t),\u03b1 (t)) \u2212 \u03a0\u03c8\nX\n\n\u03b1(T )\nDividing both sides by T and using convexity of f yields the bound in Theorem 1.\n\n\u2019gi(t), xi(t) \u2212 y(t)* \u2264\n\n\u03b1(t)%\u00afz(t) \u2212 zj(t)%\u2217\n\n\u03b1(t)%\u00afz(t) \u2212 zj(t)%\u2217\n\n\u03b1(t)%\u00afz(t) \u2212 zi(t)%\u2217\n\n.\n\n%y(t) \u2212 xi(t)%\n\n\u03b1(t \u2212 1) +\n\nL2\n2\n\nT$t=1\n\n\u03b1(t\u22121)+\n\n=\n\n2L\nn\n\n1\n\n\u03b1(T )\n\n2L\nn\n\n2L\nn\n\nT$t=1\n\n\u03c8(x\u2217) +\n\n. (10)\n\n\u03c8(x\u2217)+\n\nL2\n2\n\n2L\nn\n\n2L\nn\n\n+ L\n\n(9)\n\n.\n\n1\n\nProof of Theorem 2: For this proof sketch, we adopt the following notational conventions. For\nan n \u00d7 n matrix B, we call its singular values \u03c31(B) \u2265 \u03c32(B) \u2265\u00b7\u00b7\u00b7\u2265 \u03c3n(B) \u2265 0. For a real\nsymmetric B, we use \u03bb1(B) \u2265 \u03bb2(B) \u2265 . . . \u2265 \u03bbn(B) to denote the n real eigenvalues of B. We let\n\u2206n = {x \u2208 Rn | x / 0,!n\ni=1 xi = 1} denote the n-dimensional probability simplex. We make\nfrequent use of the following inequality [10]: for any positive integer t = 1, 2, . . . and any x \u2208 \u2206n,\n2,,P tx \u2212 11/n,,1 \u2264\n,,P tx \u2212 11/n,,TV =\n(11)\n\n\u03c32(P )t\u221an.\n\n1\n2\n\n1\n2\n\n1\n\nWe focus on controlling the network error term in Theorem 1, L\nDe\ufb01ne the matrix \u03a6(t, s) = P t\u2212s+1. Let [\u03a6(t, s)]ji be entry j of column i of \u03a6(t, s). Then\n[\u03a6(t, r)]jigj(r \u2212 1)0 \u2212 gi(t).\n\n[\u03a6(t, s)]jizj(s) \u2212\n\nzi(t + 1) =\n\nt$r=s+1\n\n/ n$j=1\n\nn$j=1\n\ni=1 \u03b1(t)%\u00afz(t) \u2212 zi(t)%\u2217\n\n\u221an,,P tx \u2212 11/n,,2 \u2264\nt=1!n\n\nn!T\n\n(12)\n\n.\n\nClearly the above reduces to the standard update (4) when s = t. Since \u00afz(t) evolves simply as in\n(6), we assume w.l.o.g. that zi(0) = 0 and use (12) to see\n\nzi(t) \u2212 \u00afz(t) =\n\nt\u22121$s=1\n\nn$j=1\n\n(1/n \u2212 [\u03a6(t \u2212 1, s)]ji)gj(s \u2212 1) +/ 1\n\nn\n\n(gj(t \u2212 1) \u2212 gi(t \u2212 1))0. (13)\n\nn$j=1\n\n6\n\n\fWe use the fact that %gi(t)%\u2217 \u2264 L for all i and t and (13) to see that\n(1/n \u2212 [\u03a6(t \u2212 1, s)]ji)gj(s \u2212 1) +/ 1\n%\u00afz(t) \u2212 zi(t)%\u2217\n\n=,,,,\nt\u22121$s=1\nn$j=1\nn$j=1\nt\u22121$s=1\n%gj(s \u2212 1)%\u2217 |(1/n) \u2212 [\u03a6(t \u2212 1, s)]ji| +\nt\u22121$s=1\nL%[\u03a6(t \u2212 1, s)]i \u2212 11/n%1 + 2L.\n\n\u2264\n\n\u2264\n\nn\n\ngj(t \u2212 1) \u2212 gi(t \u2212 1)0,,,,\u2217\nn$j=1\nn$i=1\n\n%gj(t \u2212 1) \u2212 gi(t \u2212 1)%\u2217\n\n1\nn\n\n(14)\n\nNow we break the sum in (14) into two terms separated by a cutoff point%t. The \ufb01rst term consists\nof \u201cthrowaway\u201d terms, that is, timesteps s for which the Markov chain with transition matrix P\nhas not mixed, while the second consists of steps s for which %[\u03a6(t \u2212 1, s)]i \u2212 11/n%1 is small.\nNote that the indexing on \u03a6(t \u2212 1, s) = P t\u2212s+1 implies that for small s, \u03a6(t \u2212 1, s) is close to\nuniform. From the inequality (11), we have %[\u03a6(t, s)]j \u2212 11/n%1 \u2264 \u221an\u03c32(P )t\u2212s+1. Hence, if\nlog \u03c32(P )\u22121 \u2212 1, then we are guaranteed %[\u03a6(t, s)]j \u2212 11/n%1 \u2264 \u221an\u0001. Thus, by setting\nt \u2212 s \u2265 log \u0001\u22121\n\u0001\u22121 = T\u221an, for t \u2212 s + 1 \u2265 log(T\u221an)\nT . For larger s, we\nsimply have %[\u03a6(t, s)]j \u2212 11/n%1 \u2264 2. The above suggests that we split the sum at%t = log T\u221an\nlog \u03c32(P )\u22121.\nSince t \u2212 1 \u2212 (t \u2212%t) =%t and there are at most T steps in the summation,\n\nlog \u03c32(P )\u22121 , we have %[\u03a6(t, s)]j \u2212 11/n%1 \u2264 1\n\n%\u00afz(t) \u2212 zi(t)%\u2217 \u2264 L\n\n%\u03a6(t \u2212 1, s)ei \u2212 11/n%1 + L\n\n%\u03a6(t \u2212 1, s)ei \u2212 11/n%1 + 2L\n\nt\u22121$s=t\u2212!t\n\nlog(T\u221an)\nlog \u03c32(P )\u22121 + 3L \u2264 2L\n\n\u2264 2L\n\nt\u22121\u2212!t$s=1\nlog(T\u221an)\n1 \u2212 \u03c32(P )\n\n+ 3L.\n\n(15)\n\n\u03b1(t).\n\n\u03c8(x\u2217) +\n\n1\n\n\u03b1(T )\n\nL2\n2\n\nT$t=1\n\nT$t=1\n\nT$t=1\n\n\u03b1(t \u2212 1) + 6L2\n\n\u03b1(t) + 4L2 log(T\u221an)\n1 \u2212 \u03c32(P )\n\nThe last inequality follows from the concavity of log(\u00b7), since log \u03c32(P )\u22121 \u2265 1 \u2212 \u03c32(P ).\nCombining (15) with the running sum bound in (10) of the proof of the basic theorem, Theorem 1,\nwe \ufb01nd that for x\u2217 \u2208X ,\nT$t=1\nf (y(t)) \u2212 f (x\u2217) \u2264\nAppealing to Lemma 3 allows us to obtain the same result on the sequence xi(t) with slightly\nt=1 t\u22121/2 \u2264 2\u221aT \u2212 1, using the assumption that \u03c8(x\u2217) \u2264 R2, bounding\nworse constants. Since!T\nT !T\nt=1 f (xi(t)), and setting \u03b1(t) as in the theorem statement completes the proof.\nf (%xi(T )) \u2264 1\n5 Simulations\nIn this section, we report experimental results on the network scaling behavior of the distributed\ndual averaging algorithm as a function of the graph structure and number of processors n. These\nresults illustrate the excellent agreement of the empirical behavior with our theoretical predictions.\nFor all experiments reported here, we consider distributed minimization of a sum of hinge losses.\nWe solve a synthetic classi\ufb01cation problem, in which we are given n pairs of the form (ai, yi) \u2208\nRd \u00d7{\u2212 1, +1}, where ai \u2208 Rd corresponds to a feature vector and yi \u2208{\u2212 1, +1} is the associated\nlabel. Given the shorthand notation [c]+ := max{0, c}, the hinge loss associated with a linear\nclassi\ufb01er based on x is given by fi(x) = [1 \u2212 yi \u2019ai, x*]+. The global objective is given by the sum\ni=1 [1 \u2212 yi \u2019ai, x*]+. Setting L = maxi %ai%2, we note that f is L-Lipschitz and\nf (x) := 1\nnon-smooth at any point with \u2019ai, x* = yi. As is common, we impose a quadratic regularization,\nchoosing X = {x \u2208 Rd |% x%2 \u2264 5}. Then for a given graph size n, we form a random instance\nof this SVM classi\ufb01cation problem. Although this is a speci\ufb01c ensemble of problems, we have\nobserved qualitatively similar behavior for other problem classes. In all cases, we use the optimal\nsetting of the step size \u03b1 speci\ufb01ed in Theorem 2 and Corollary 1.\n\nn!n\n\n7\n\n\f100\n\n10\u22121\n\n)\n\u2217\nx\n(\nf\n\n-\n\n)\n)\nt\n(\nx\n(\nf\n\n10\u22122\n\n \n\nn = 225\nn = 400\nn = 625\n\nT (\u0001; 400)\n\nT (\u0001; 625)\n\nFigure 2. Plot of the function error ver-\nsus the number of iterations for a grid\ngraph. Each curve corresponds to a grid\nwith a different number of nodes (n \u2208\n{225, 400, 600}). As expected, larger\ngraphs require more iterations to reach\na pre-speci\ufb01ed tolerance \u0001> 0, as de-\n\ufb01ned by the iteration number T (\u0001; n).\nThe network scaling problem is to de-\ntermine how T (\u0001; n) scales as a func-\ntion of n.\n\nT (\u0001; 225)\n\n \n0\n\n200\n\n400\n\n600\n\nIterations\n\n800\n\n1000\n\n1200\n\n\u03b5\n\no\nt\n\ns\np\ne\nt\nS\n\n1400\n\n1200\n\n1000\n\n800\n\n600\n\n400\n\n200\n0\n\n200\n\n500\n\n450\n\n400\n\n350\n\n300\n\n250\n\n200\n\n\u03b5\n\no\nt\n\ns\np\ne\nt\nS\n\n800\n\n1000\n\n150\n0\n\n200\n\n120\n\n110\n\n100\n\n90\n\n80\n\n70\n0\n\n\u03b5\n\no\nt\n\ns\np\ne\nt\nS\n\n800\n\n1000\n\n200\n\n600\n400\nNodes n\n\n(a)\n\n400\n\n600\nNodes n\n\n(b)\n\n800\n\n1000\n\n400\n\n600\nNodes n\n\n(c)\n\nFigure 3. Each plot shows the number of iterations required to reach a \ufb01xed accuracy \u0001 (vertical axis)\nversus the network size n (horizontal axis). Panels show the same plot for different graph topologies:\n(a) single cycle; (b) two-dimensional grid; and (c) bounded degree expander.\n\nFigure 2 provides plots of the function error maxi[f (%xi(T )\u2212 f (x\u2217)] versus the number of iterations\nfor grid graphs with a varying number of nodes n \u2208{ 225, 400, 625}. In addition to demonstrating\nconvergence, these plots also show how the convergence time scales as a function of the graph size.\nWe also experimented with the algorithm and stepsize suggested by previous analyses [21]; the\nresulting stepsize is so small that the method effectively jams and makes no progress.\nIn Figure 3, we compare the theoretical predictions of Corollary 1 with the actual behavior of dual\nsubgradient averaging. Each panel shows the function TG(\u0001; n) versus the graph size n for the \ufb01xed\nvalue \u0001 = 0.1; the three different panels correspond to different graph types: cycles (a), grids (b) and\nexpanders (c). In the panels, each point on the solid blue curve is the average of 20 trials, and the\nbars show standard errors. For comparison, the dotted black line shows the theoretical prediction.\nNote that the agreement between the empirical behavior and theoretical predictions is excellent in\nall cases. In particular, panel (a) exhibits the quadratic scaling predicted for the cycle, panel (b)\nexhibits the the linear scaling expected for the grid, and panel (c) shows that expander graphs have\nthe desirable property of having constant network scaling.\n6 Conclusions\nIn this paper, we have developed and analyzed an ef\ufb01cient algorithm for distributed optimization\nbased on dual averaging of subgradients.\nIn addition to establishing convergence, we provided\na careful analysis of the algorithm\u2019s network scaling. Our results show an inverse scaling in the\nspectral gap of the graph, and we showed that this prediction is tight in general via a matching\nlower bound. We have implemented our method, and our simulations show that these theoretical\npredictions provide a very accurate characterization of its behavior. In the extended version of this\npaper [4], we also show that it is possible to extend our algorithm and analysis to the cases in which\ncommunication is random and not \ufb01xed, the algorithm receives stochastic subgradient information,\nand for minimization of composite regularized objectives of the form f (x) + \u03d5(x).\nAcknowledgements:\nJCD was supported by an NDSEG fellowship and Google. AA was sup-\nported by a Microsoft Research Fellowship. In addition, AA was partially supported by NSF grants\nDMS-0707060 and DMS-0830410. MJW and AA were partially supported by AFOSR-09NL184.\n\n8\n\n\fPrentice-Hall, Inc., Upper Saddle River, NJ, USA, 1989.\nactions on Information Theory, 52(6):2508\u20132530, 2006.\n\nReferences\n[1] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods.\n[2] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE Trans-\n[3] F.R.K. Chung. Spectral Graph Theory. AMS, 1998.\n[4] J. Duchi, A. Agarwal, and M. Wainwright. Dual averaging for distributed optimization: con-\nvergence analysis and network scaling. URL http://arxiv.org/abs/1005.2012, 2010.\n[5] J. Friedman, J. Kahn, and E. Szemer\u00b4edi. On the second eigenvalue of random regular graphs.\nIn Proceedings of the Twenty First Annual ACM Symposium on Theory of Computing, pages\n587\u2013598, New York, NY, USA, 1989. ACM.\n[6] R. Gray. Toeplitz and circulant matrices: A review. Foundations and Trends in Communica-\ntions and Information Theory, 2(3):155\u2013239, 2006.\n[7] P. Gupta and P. R. Kumar. The capacity of wireless networks. IEEE Transactions on Informa-\ntion Theory, 46(2):388\u2013404, 2000.\n[8] J. Hiriart-Urruty and C. Lemar\u00b4echal. Convex Analysis and Minimization Algorithms I.\nSpringer, 1996.\n[9] J. Hiriart-Urruty and C. Lemar\u00b4echal. Convex Analysis and Minimization Algorithms II.\nSpringer, 1996.\n[10] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985.\n[11] B. Johansson, M. Rabi, and M. Johansson. A randomized incremental subgradient method for\ndistributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157\u2013\n1170, 2009.\n[12] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of Com-\nputer and System Sciences, 71(3):291\u2013307, 2005.\n[13] V. Lesser, C. Ortiz, and M. Tambe, editors. Distributed Sensor Networks: A Multiagent Per-\nspective, volume 9. Kluwer Academic Publishers, May 2003.\n[14] D. Levin, Y. Peres, and E. Wilmer. Markov Chains and Mixing Times. American Mathematical\nSociety, 2008.\n[15] I. Lobel and A. Ozdaglar. Distributed subgradient methods over random networks. Technical\nReport 2800, MIT LIDS, 2008.\n[16] R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured percep-\ntron. In North American Chapter of the Association for Computational Linguistics (NAACL),\n2010.\nIncremental subgradient methods for nondifferentiable opti-\n[17] A. Nedic and D. P. Bertsekas.\nmization. SIAM Journal on Optimization, 12(1):109\u2013138, 2001.\n[18] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization.\nIEEE Transactions on Automatic Control, 54:48\u201361, 2009.\n[19] A. Nemirovski and D. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization.\nWiley, New York, 1983.\n[20] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Program-\nming A, 120(1):261\u2013283, 2009.\n[21] S. Sundhar Ram, A. Nedic, and V. V. Veeravalli. Distributed subgradient projection algorithm\nfor convex optimization. In IEEE International Conference on Acoustics, Speech, and Signal\nProcessing, pages 3653\u20133656, 2009.\n[22] J. Tsitsiklis. Problems in decentralized decision making and computation. PhD thesis, Mas-\nsachusetts Institute of Technology, 1984.\n[23] U. von Luxburg, A. Radl, and M. Hein. Hitting times, commute distances, and the spectral gap\nfor large random geometric graphs. URL http://arxiv.org/abs/1003.1266, 2010.\n[24] L. Xiao, S. Boyd, and S. J. Kim. Distributed average consensus with least-mean-square devia-\ntion. Journal of Parallel and Distributed Computing, 67(1):33\u201346, 2007.\n\n9\n\n\f", "award": [], "sourceid": 423, "authors": [{"given_name": "Alekh", "family_name": "Agarwal", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}, {"given_name": "John", "family_name": "Duchi", "institution": null}]}