{"title": "Distributed Delayed Stochastic Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 873, "page_last": 881, "abstract": "We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show $n$-node architectures whose optimization error in stochastic problems---in spite of asynchronous delays---scales asymptotically as $\\order(1 / \\sqrt{nT})$, which is known to be optimal even in the absence of delays.", "full_text": "Distributed Delayed Stochastic Optimization\n\nAlekh Agarwal\n\nJohn C. Duchi\n\nDepartment of Electrical Engineering and Computer Sciences\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\n{alekh,jduchi}@eecs.berkeley.edu\n\nAbstract\n\nWe analyze the convergence of gradient-based optimization algorithms\nwhose updates depend on delayed stochastic gradient information. The\nmain application of our results is to the development of distributed mini-\nmization algorithms where a master node performs parameter updates while\nworker nodes compute stochastic gradients based on local information in\nparallel, which may give rise to delays due to asynchrony. Our main contri-\nbution is to show that for smooth stochastic problems, the delays are asymp-\ntotically negligible.\nIn application to distributed optimization, we show\nn-node architectures whose optimization error in stochastic problems\u2014in\n\nspite of asynchronous delays\u2014scales asymptotically as O(1/\u221anT ), which\n\nis known to be optimal even in the absence of delays.\n\n1\n\nIntroduction\n\nWe focus on stochastic convex optimization problems of the form\n\nminimize\n\nf (x)\n\nfor\n\nx\u2208X\n\nf (x) := EP [F (x; \u03be)] =Z\u039e\n\nF (x; \u03be)dP (\u03be),\n\n(1)\n\nwhere X \u2286 Rd is a closed convex set, P is a probability distribution over \u039e, and F (\u00b7 ; \u03be) is\nconvex for all \u03be \u2208 \u039e, so that f is convex. Classical stochastic gradient algorithms [18, 16]\niteratively update a parameter x(t) \u2208 X by sampling \u03be \u223c P , computing g(t) = \u2207F (x(t); \u03be),\nand performing the update x(t + 1) = \u03a0X (x(t) \u2212 \u03b1(t)g(t)), where \u03a0X denotes projection\nonto the set X and \u03b1(t) \u2208 R is a stepsize. In this paper, we analyze asynchronous gradient\nmethods, where instead of receiving current information g(t), the procedure receives out\nof date gradients g(t \u2212 \u03c4 (t)) = \u2207F (x(t \u2212 \u03c4 (t)), \u03be), where \u03c4 (t) is the (potentially random)\ndelay at time t. The central contribution of this paper is to develop algorithms that\u2014under\nnatural assumptions about the functions F in the objective (1)\u2014achieve asymptotically\noptimal convergence rates for stochastic convex optimization in spite of delays.\nOur model of delayed gradient information is particularly relevant in distributed optimiza-\ntion scenarios, where a master maintains the parameters x while workers compute stochastic\ngradients of the objective (1) using a local subset of the data. Master-worker architectures\nare natural for distributed computation, and other researchers have considered models sim-\nilar to those in this paper [12, 10]. By allowing delayed and asynchronous updates, we can\navoid synchronization issues that commonly handicap distributed systems.\nDistributed optimization has been studied for several decades, tracing back at least to\nseminal work of Bertsekas and Tsitsiklis ([3, 19, 4]) on asynchronous computation and\nminimization of smooth functions where the parameter vector is distributed. More recent\nwork has studied problems in which each processor or node i in a network has a local\nfunction fi, and the goal is to minimize the sum f (x) = 1\ni=1 fi(x) [12, 13, 17, 7]. Our\nwork is closest to Nedi\u00b4c et al.\u2019s asynchronous incremental subgradient method [12], who\n\nnPn\n\n1\n\n\fMaster\n\ng1(t \u2212 n)\nx(t)\n\nx(t + 1)\n\ng2(t \u2212 n + 1)\n\ngn(t \u2212 1)\n\nx(t + n \u2212 1)\n\n1\n\n2\n\n3\n\nn\n\nFigure 1: Cyclic delayed update architecture. Workers compute gradients in parallel, passing out-\nof-date (stochastic) gradients gi(t \u2212 \u03c4 ) = \u2207fi(x(t \u2212 \u03c4 )) to master. Master responds with current\nparameters. Diagram shows parameters and gradients communicated between rounds t and t+n\u22121.\n\nanalyze gradient projection steps taken using out-of-date gradients. See Figure 1 for an\nillustration. Nedi\u00b4c et al. prove that the procedure converges, and a slight extension of their\nresults shows that the optimization error of the procedure after T iterations is at most\n\nO(p\u03c4 /T ), \u03c4 being the delay in gradients. Without delay, a centralized stochastic gradient\nalgorithm attains convergence rate O(1/\u221aT ). All the approaches mentioned above give\n\nslower convergence than this centralized rate in distributed settings, paying a penalty for\ndata being split across a network; as Dekel et al. [5] note, one would expect that parallel\ncomputation actually speeds convergence. Langford et al. [10] also study asynchronous\nmethods in the setup of stochastic optimization and attempt to remove the penalty for the\ndelayed procedure under an additional smoothness assumption; however, their paper has a\ntechnical error (see the long version [2] for details). The main contributions of our paper\nare (1) to remove the delay penalty for smooth functions and (2) to demonstrate bene\ufb01ts\nin convergence rate by leveraging parallel computation even in spite of delays.\nWe build on results of Dekel et al. [5], who give reductions of stochastic optimization al-\ngorithms (e.g. [8, 9]) to show that for smooth objectives f , when n processors compute\nstochastic gradients in parallel using a common parameter x it is possible to achieve conver-\n\ngence rate O(1/\u221aT n). The rate holds so long as most processors remain synchronized for\n\nmost of the time [6]. We show similar results, but we analyze the e\ufb00ects of asynchronous gra-\ndient updates where all the nodes in the network can su\ufb00er delays, quantifying the impact of\nthe delays. In application to distributed optimization, we show that under di\ufb00erent network\n\nassumptions, we achieve convergence rates ranging from O(min{n3/T, (n/T )2/3} + 1/\u221aT n)\nto O(min{n/T, 1/T 2/3} + 1/\u221aT n), which is O(1/\u221anT ) asymptotically in T . The time nec-\nessary to achieve \u01eb-optimal solution to the problem (1) is asymptotically O(1/n\u01eb2), a factor\nof n\u2013the size of the network \u2013better than a centralized procedure in spite of delay. Proofs of\nour results can be found in the long version of this paper [2].\n\nis de-\n:= supx:kxk\u22641 hz, xi. The subdi\ufb00erential set of a function f at a point x is\n\nNotation We denote a general norm by k\u00b7k, and its associated dual norm k\u00b7k\u2217\n\ufb01ned as kzk\u2217\n\u2202f (x) :=(cid:8)g \u2208 Rd | f (y) \u2265 f (x) + hg, y \u2212 xi for all y \u2208 dom f(cid:9) . A function f is G-Lipschitz\nw.r.t. the norm k\u00b7k on X if \u2200x, y \u2208 X , |f (x)\u2212 f (y)| \u2264 Gkx \u2212 yk, and f is L-smooth on X if\nL\n2 kx \u2212 yk2 .\nk\u2207f (x) \u2212 \u2207f (y)k\u2217 \u2264 Lkx \u2212 yk , equivalently, f (y) \u2264 f (x)+h\u2207f (x), y \u2212 xi+\nA convex function h is c-strongly convex with respect to a norm k\u00b7k over X if\n\nh(y) \u2265 h(x) + hg, y \u2212 xi +\n\n2 Setup and Algorithms\n\nc\n2 kx \u2212 yk2\n\nfor all x, y \u2208 X and g \u2208 \u2202h(x).\n\n(2)\n\nTo build intuition for the algorithms we analyze, we \ufb01rst describe the delay-free algorithm\nunderlying our approach: the dual averaging algorithm of Nesterov [15].1 The dual averaging\nalgorithm is based on a strongly convex proximal function \u03c8(x); we assume without loss\nthat \u03c8(x) \u2265 0 for all x \u2208 X and (by scaling) that \u03c8 is 1-strongly convex.\n\n1Essentially identical results to those we present here also hold for extensions of mirror de-\n\nscent [14], but we omit these for lack of space.\n\n2\n\n\fAt time t, the algorithm updates a dual vector z(t) and primal vector x(t) \u2208 X using a\nsubgradient g(t) \u2208 \u2202F (x(t); \u03be(t)), where \u03be(t) is drawn i.i.d. according to P :\n1\n\nz(t + 1) = z(t) + g(t)\n\nand x(t + 1) = argmin\n\n(3)\n\nx\u2208X nhz(t + 1), xi +\n\n\u03b1(t + 1)\n\n\u03c8(x)o.\n\nFor the remainder of the paper, we will use the following three essentially standard assump-\ntions [8, 9, 20] about the stochastic optimization problem (1).\nAssumption I (Lipschitz Functions). For P -a.e. \u03be, the function F (\u00b7 ; \u03be) is convex. More-\nover, for any x \u2208 X , and v \u2208 \u2202F (x; \u03be), E[kvk2\n\u2217\nAssumption II (Smooth Functions). The expected function f has L-Lipschitz continuous\ngradient, and for all x \u2208 X the variance bound E[k\u2207f (x) \u2212 \u2207F (x; \u03be)k2\n\u2217\nAssumption III (Compactness). For all x \u2208 X , \u03c8(x) \u2264 R2/2.\nSeveral commonly used functions satisfy the above assumptions, for example:\n\n] \u2264 \u03c32 holds.\n\n] \u2264 G2.\n\n(i) The logistic loss: F (x; \u03be) = log[1+exp(hx, \u03bei)]. The objective F satis\ufb01es Assumptions I\n(ii) Least squares: F (x; \u03be) = (a \u2212 hx, bi)2 where \u03be = (a, b) for a \u2208 Rd and b \u2208 R, satis\ufb01es\n\nand II so long as k\u03bek\u2217\n\nhas \ufb01nite second moment.\n\nhas \ufb01nite fourth moment.\n\nAssumptions I and II if X is compact and k\u03bek\u2217\n\n+\n\nT\n\n\u03c3R\n\n\u221aT(cid:19) .\n\nUnder Assumption III, assumptions I and II imply \ufb01nite-sample convergence rates for the\nt=1 x(t + 1). Under Assumption I,\nR/(G\u221at) [15, 20]. The result is sharp to constant factors [14, 1], but can be further improved\nusing Assumption II. Building on work of Juditsky et al. [8] and Lan [9], Dekel et al. [5,\nAppendix A] show that the stepsize choice \u03b1(t)\u22121 = L + \u03c3R\u221at, yields the convergence rate\n\nT PT\nupdate (3). De\ufb01ne the time averaged vectorbx(T ) := 1\ndual averaging satis\ufb01es E[f (bx(T ))] \u2212 f (x\u2217) = O(RG/\u221aT ) for the stepsize choice \u03b1(t) =\nE[f (bx(T ))] \u2212 f (x\u2217) = O(cid:18) LR2\n\nDelayed Optimization Algorithms We now turn to extending the dual averaging (3)\nupdate to the setting in which instead of receiving a current gradient g(t) at time t, the\nprocedure receives a gradient g(t \u2212 \u03c4 (t)), that is, a stochastic gradient of the objective (1)\ncomputed at the point x(t \u2212 \u03c4 (t)). Our analysis admits any sequence \u03c4 (t) of delays as long\nas the mapping t 7\u2192 t \u2212 \u03c4 (t) is one-to-one, and satis\ufb01es E[\u03c4 (t)2] \u2264 B2 < \u221e.\nWe consider the dual averaging algorithm with g(t) replaced by g(t \u2212 \u03c4 (t)):\n1\nz(t + 1) = z(t) + g(t \u2212 \u03c4 (t)) and x(t + 1) = argmin\nBy combining the techniques Nedi\u00b4c et al. [12] developed with the convergence proofs of dual\naveraging [15], it is possible to show that so long as E[\u03c4 (t)] \u2264 B < \u221e for all t, Assumptions I\nand III and the stepsize choice \u03b1(t) = R\nthe next section we show how to avoid the \u221aB penalty.\n\n\u03c8(x)o. (5)\nx\u2208X nhz(t + 1), xi +\ngive E[f (bx(T ))] \u2212 f (x\u2217) = O(RG\u221aB/\u221aT ). In\n\n\u03b1(t + 1)\n\nG\u221aBt\n\n(4)\n\n3 Convergence rates for delayed optimization of smooth functions\n\nWe now state and discuss the implications of two results for asynchronous stochastic gradient\nmethods. Our \ufb01rst convergence result is for the update rule (5), while the second averages\nseveral stochastic subgradients for every update, each with a potentially di\ufb00erent delay.\n\n3.1 Simple delayed optimization\n\nOur focus in this section is to remove the \u221aB penalty for the delayed update rule (5) using\nAssumption II, which arises for non-smooth optimization because subgradients can vary\ndrastically even when measured at near points. We show that under the smoothness condi-\ntion, the errors from delay become second order: the penalty is asymptotically negligible.\n\n3\n\n\fEf (bx(T )) \u2212 T f (x\u2217) \u2264\n\nTheorem 1. Let x(t) be de\ufb01ned by the update (5). De\ufb01ne \u03b1(t)\u22121 = L + \u03b7(t), where\n\n\u03b7(t) = \u03b7\u221at or \u03b7(t) \u2261 \u03b7\u221aT for all t. The average bx(T ) =PT\n\nLR2 + 6\u03c4 GR\n\n+\n\n+\n\n2\u03b7R2\n\u221aT\n\n\u03c32\n\u03b7\u221aT\n\n+ 4\n\nT\n\nt=1 x(t + 1)/T satis\ufb01es\nLG2(\u03c4 + 1)2 log T\n\n\u03b72T\n\n.\n\nWe make a few remarks about the theorem. The log T factor on the last term is not present\nwhen using the \ufb01xed stepsize of \u03b7\u221aT . Furthermore, though we omit it here for lack of space,\nthe analysis also extends to random delays as long as E[\u03c4 (t)2] \u2264 B2; see the long version [2]\nfor details. Finally, based on Assumption II, we can set \u03b7 = \u03c3/R, which makes the rate\nasymptotically O(\u03c3R/\u221aT ), which is the same as the delay-free case so long as \u03c4 = o(T 1/4).\n\nThe take-home message from Theorem 1 is thus that the penalty in convergence rate due\nto the delay \u03c4 (t) is asymptotically negligible. In the next section, we show the implications\nof this result for robust distributed stochastic optimization algorithms.\n\n3.2 Combinations of delays\n\nIn some scenarios\u2014including distributed settings similar to those we discuss in the next\nsection\u2014the procedure has access not to only a single delayed gradient but to several stochas-\ntic gradients with di\ufb00erent delays. To abstract away the essential parts of this situation, we\nassume that the procedure receives n stochastic gradients g1, . . . , gn \u2208 Rd, where each has\na potentially di\ufb00erent delay \u03c4 (i). Let \u03bb = (\u03bbi)n\ni=1 be (an unspeci\ufb01ed) vector in probability\nsimplex. Then the procedure performs the following updates at time t:\n\nnXi=1\n\nz(t + 1) = z(t) +\n\nThe next theorem builds on the proof of Theorem 1.\n\n\u03bbigi(t\u2212 \u03c4 (i)), x(t + 1) = argmin\n\n\u03c8(x)(cid:9). (6)\nTheorem 2. Under Assumptions I\u2013III, let \u03b1(t) = (L + \u03b7(t))\u22121 and \u03b7(t) = \u03b7\u221at or \u03b7(t) \u2261\n\u03b7\u221aT for all t. The average bx(T ) =PT\n2LR2 + 4Pn\n\nx\u2208X (cid:8)hz(t + 1), xi +\n\nt=1 x(t + 1)/T for the update sequence (6) satis\ufb01es\n\ni=1 \u03bbiLG2(\u03c4 (i) + 1)2 log T\n\ni=1 \u03bbi\u03c4 (i)GR\nT\n\n+ 6Pn\n\n\u03b1(t + 1)\n\n\u03b72T\n\n1\n\nf (bx(T )) \u2212 f (x\u2217) \u2264\n\n+\n\n4\u03b7R2\n\u221aT\n\n+\n\n1\n\u03b7\u221aT\n\nnXi=1\n\nE(cid:13)(cid:13)\n\n\u03bbi[\u2207f (x(t \u2212 \u03c4 (i))) \u2212 gi(t \u2212 \u03c4 (i))](cid:13)(cid:13)2\n\n\u2217\n\n.\n\nWe illustrate the consequences of Theorem 2 for distributed optimization in the next section.\n\n4 Distributed Optimization\n\nWe now turn to what we see as the main purpose and application of the above results:\ndeveloping robust and e\ufb03cient algorithms for distributed stochastic optimization. Our main\nmotivations here are machine learning applications where the data is so large that it cannot\n\ufb01t on a single computer. Examples of the form (1) include logistic or linear regression, as\ndescribed respectively in Sec. 2(i) and (ii). We consider both stochastic and online/streaming\nscenarios for such problems. In the simplest setting, the distribution P in the objective (1)\nis the empirical distribution over an observed dataset, that is, f (x) = 1\ni=1 F (x; \u03bei). We\ndivide the N samples among n workers so that each worker has an N/n-sized subset of data.\nIn online learning applications, the distribution P is the unknown distribution generating\nthe data, and each worker receives a stream of independent data points \u03be \u223c P . Worker i\nuses its subset of the data, or its stream, to compute gi \u2208 Rd, an estimate of the gradient\n\u2207f of the global f . We assume that gi is an unbiased estimate of \u2207f (x), which is satis\ufb01ed,\nfor example, in the online setting or when each worker computes the gradient gi based on\nsamples picked at random without replacement from its subset of the data.\nThe architectural assumptions we make are based o\ufb00 of master/worker topologies, but the\nconvergence results in Section 3 allow us to give procedures robust to delay and asynchrony.\nThe architectures build on the na\u00a8\u0131ve scheme of having each worker simultaneously com-\npute a stochastic gradient and send it to the master, which takes a gradient step on the\n\nNPN\n\n4\n\n\fx(t)\n\nx(t \u2212 1)\n\nx(t \u2212 2)\n\nM\n\nM\n\ng(t \u2212 1)\n\ng(t \u2212 2)\n\nx(t \u2212 3)\n\nx(t \u2212 4)\n\ng(t \u2212 3)\n\ng(t \u2212 4)\n\n(a)\n\n(b)\n\nFigure 2: Master-worker averaging network. (a): parameters stored at di\ufb00erent nodes at time t.\nA node at distance d from master has the parameter x(t \u2212 d). (b): gradients computed at di\ufb00erent\nnodes. A node at distance d from master computes gradient g(t \u2212 d).\n\n1\n\n3 g1(t \u2212 d)\n\n+ 1\n\n3 g2(t \u2212 d \u2212 2) + 1\n\n3 g3(t \u2212 d \u2212 2)\n\nDepth d\n\n1\n\n{x(t \u2212 d), g2(t \u2212 d \u2212 2), g3(t \u2212 d \u2212 2)}\n\ng2(t \u2212 d \u2212 1)\n\ng3(t \u2212 d \u2212 1)\n\nDepth d + 1\n\n2\n\n{x(t \u2212 d \u2212 1)}\n\n3\n\n{x(t \u2212 d \u2212 1)}\n\nFigure 3: Communication of gra-\ndient information toward master\nnode at time t from node 1 at dis-\ntance d from master. Information\nstored at time t by node i in brack-\nets to right of node.\n\naveraged gradient. While the n gradients are computed in parallel in the na\u00a8\u0131ve scheme,\naccumulating and averaging n gradients at the master takes \u2126(n) time, o\ufb00setting the gains\nof parallelization, and the procedure is non-robust to laggard workers.\n\nCyclic Delayed Architecture This protocol is the delayed update algorithm mentioned\nin the introduction, and it computes n stochastic gradients of f (x) in parallel. Formally,\nworker i has parameter x(t\u2212\u03c4 ) and computes gi(t\u2212\u03c4 ) = \u2207F (x(t\u2212\u03c4 ); \u03bei(t)) \u2208 Rd, where \u03bei(t)\nis a random variable sampled at worker i from the distribution P . The master maintains\na parameter vector x \u2208 X . At time t, the master receives gi(t \u2212 \u03c4 ) from some worker i,\ncomputes x(t + 1) and passes it back to worker i only. Other workers do not see x(t + 1)\nand continue their gradient computations on stale parameter vectors. In the simplest case,\neach node su\ufb00ers a delay of \u03c4 = n, though our analysis applies to random delays as well.\nRecall Fig. 1 for a description of the process.\n\nLocally Averaged Delayed Architecture At a high level, the protocol we now describe\ncombines the delayed updates of the cyclic delayed architecture with averaging techniques\nof previous work [13, 7]. We assume a network G = (V,E), where V is a set of n nodes\n(workers) and E are the edges between the nodes. We select one of the nodes as the master,\nwhich maintains the parameter vector x(t) \u2208 X over time.\nThe algorithm works via a series of multicasting and aggregation steps on a spanning tree\nrooted at the master node. In the broadcast phase, the master sends its current parameter\nvector x(t) to its immediate neighbors. Simultaneously, every other node broadcasts its\ncurrent parameter vector (which, for a depth d node, is x(t \u2212 d)) to its children in the\nspanning tree. See Fig. 2(a). Every worker computes its local gradient at its new parameter\n(see Fig. 2(b)). The communication then proceeds from leaves toward the root. The leaf\nnodes communicate their gradients to their parents, and the parent takes the gradients of\nthe leaf nodes from the previous round (received at iteration t \u2212 1) and averages them with\nits own gradient, passing this averaged gradient back up the tree. Again simultaneously,\neach node takes the averaged gradient vectors of its children from the previous rounds,\naverages them with its current gradient vector, and passes the result up the spanning tree.\nSee Fig. 3. The master node receives an average of delayed gradients from the entire tree,\ngiving rise to updates of the form (6). We note that this is similar to the MPI all-reduce\noperation, except our implementation is non-blocking since we average delayed gradients\nwith di\ufb00erent delays at di\ufb00erent nodes.\n\n5\n\n\f4.1 Convergence rates for delayed distributed minimization\n\n1\n\n\u03c4 2m\n\nT 2/3 ,\n\n2 kxk2\n\nmPm\n\nT (cid:27) +\n\nWe turn now to corollaries of the results from the previous sections that show even asyn-\nchronous distributed procedures achieve asymptotically faster rates (over centralized proce-\ndures). The key is that workers can pipeline updates by computing asynchronously and in\nparallel, so each worker can compute a low variance estimate of the gradient \u2207f (x). We\nignore the constants L, G, R, and \u03c3, which are not dependent on the characteristics of\nthe network. We also assume that each worker i uses m independent samples \u03bei(j) \u223c P ,\nj = 1, . . . , m, to compute the stochastic gradient as gi(t) = 1\nj=1 \u2207F (x(t); \u03bei(j)). Using\nthe cyclic protocol as in Fig. 1, Theorem 1 gives the following result.\nCorollary 1. Let \u03c8(x) = 1\n2, assume the conditions in Theorem 1, and assume that\neach worker uses m samples \u03be \u223c P to compute the gradient it communicates to the master.\n\nE[f (bx(T ))] \u2212 f (x\u2217) = O(cid:18) min(cid:26) \u03c4 2/3\n\nThen with the choice \u03b7(t) = max{\u03c4 2/3T \u22121/3,pT /m} the update (5) satis\ufb01es\n\u221aT m(cid:19).\n2] = E[k\u2207f (x) \u2212 \u2207F (x; \u03be)k2\nProof Noting that \u03c32 = E[k\u2207f (x) \u2212 gi(t)k2\nwhen workers use m independent stochastic gradient samples, the corollary is immediate.\nAs in Theorem 1, the corollary generalizes to random delays as long as E\u03c4 2(t) \u2264 B2 < \u221e,\nwith \u03c4 replaced by B in the result. So long as B = o(T 1/4), the \ufb01rst term in the bound is\nasymptotically negligible, and we achieve a convergence rate of O(1/\u221aT n) when m = O(n).\nThe cyclic delayed architecture has the drawback that information from a worker can take\n\u03c4 = O(n) time to reach the master. While the algorithm is robust to delay, the downside\nof the architecture is that the essentially \u03c4 2m or \u03c4 2/3 term in the bounds above can be\nquite large. To address the large n drawback, we turn our attention to the locally averaged\narchitecture described by Figs. 2 and 3, where delays can be smaller since they depend\nonly on the height of a spanning tree in the network. As a result of the communication\nprocedure, the master receives a convex combination of the stochastic gradients evaluated at\ni=1 \u03bbigi(t \u2212\n\u03c4 (i)) for some \u03bb in the simplex, where \u03c4 (i) is the delay of worker i, which puts us in\nthe setting of Theorem 2. We now make the reasonable assumption that the gradient\nerrors \u2207f (x(t)) \u2212 gi(t) are uncorrelated across the nodes in the network.2\nIn statistical\napplications, for example, each worker may own independent data or receive streaming data\nfrom independent sources. We also set \u03c8(x) = 1\n\neach worker i. Speci\ufb01cally, the master receives gradients of the form g\u03bb(t) =Pn\n\n2]/m = O(1/m)\n\n2, and observe\n\nThis gives the following corollary to Theorem 2.\nCorollary 2. Set \u03bbi = 1\ndenote the average of the delays \u03c4 (i) and \u03c4 (i)2. Under the conditions of Theorem 2,\n\nn for all i, \u03c8(x) = 1\n\nEk\u2207f (x(t \u2212 \u03c4 (i))) \u2212 gi(t \u2212 \u03c4 (i))k2\n2 .\n2, and \u03b7(t) = \u03c3\u221aT /R\u221an. Let \u00af\u03c4 and \u03c4 2\n\n=\n\n2 kxk2\nnXi=1\n\u03bb2\ni\n2 kxk2\n\nnXi=1\n\nE(cid:13)(cid:13)(cid:13)(cid:13)\n\n2\n\n2\n\n\u03bbi\u2207f (x(t \u2212 \u03c4 (i))) \u2212 gi(t \u2212 \u03c4 (i))(cid:13)(cid:13)(cid:13)(cid:13)\nE [f (bx(T )) \u2212 f (x\u2217)] = O LR2\n\nT\n\n+\n\n\u00af\u03c4 GR\n\nT\n\n+\n\nLG2R2n\u03c4 2\n\n\u03c32T\n\n+\n\nR\u03c3\n\n\u221aT n! .\n\nIn this architecture, the delay \u03c4 is\nbounded by the graph diameter D. Furthermore, we can use a slightly di\ufb00erent stepsize set-\n\nAsymptotically, E[f (bx(T ))] \u2212 f (x\u2217) = O(1/\u221aT n).\nting as in Corollary 1 to get an improved rate of O(min{(D/T )2/3, nD2/T} + 1/\u221aT n). It is\n\nalso possible\u2014but outside the scope of this extended abstract\u2014to give fast(er) convergence\nrates dependent on communication costs (details can be found in the long version [2]).\n\n4.2 Running-time comparisons\n\nWe now explicitly study the running times of the centralized stochastic gradient algo-\nrithm (3), the cyclic delayed protocol with the update (5), and the locally averaged ar-\nchitecture with the update (6). To make comparisons more cleanly, we avoid constants,\n\n2Similar results continue to hold under weak correlation.\n\n6\n\n\fCentralized (3)\n\nCyclic (5)\n\nLocal (6)\n\nEf (bx) \u2212 f (x\u2217) = O(cid:18)r 1\nT(cid:19)\nEf (bx) \u2212 f (x\u2217) = O(cid:18)min(cid:18) n2/3\nT (cid:19) +\nEf (bx) \u2212 f (x\u2217) = O(cid:18)min(cid:18) D2/3\nT (cid:19) +\n\nT 2/3 ,\nT 2/3 ,\n\nnD2\n\nn3\n\n1\n\n\u221aT n(cid:19)\n\u221anT(cid:19)\n\n1\n\nTable 1: Upper bounds on optimization error after T units of time. See text for details.\n\n2 \u2264 1\nm .\n\nmPm\n\nj=1 \u2207F (x; \u03bej)k2\n\nassuming that the bound \u03c32 on Ek\u2207f (x) \u2212 \u2207F (x; \u03be)k2 is 1, and that sampling \u03be \u223c P and\nevaluating \u2207F (x; \u03be) requires unit time. It is also clear that if we receive m uncorrelated\nsamples of \u03be, the variance Ek\u2207f (x) \u2212 1\nNow we state our assumptions on the relative times used by each algorithm. Let T be\nthe number of units of time allocated to each algorithm, and let the centralized, cyclic\ndelayed and locally averaged delayed algorithms complete Tcent, Tcycle and Tdist iterations,\nrespectively, in time T . It is clear that Tcent = T . We assume that the distributed methods\nuse mcycle and mdist samples of \u03be \u223c P to compute stochastic gradients. For concreteness, we\nassume that communication is of the same order as computing the gradient of one sample\n\u2207F (x; \u03be).\nIn the cyclic setup of Sec. 3.1, it is reasonable to assume that mcycle = \u2126(n)\nto avoid idling of workers. For mcycle = \u2126(n), the master requires mcycle\nunits of time to\nreceive one gradient update, so mcycle\nn Tcycle = T . In the local communication framework, if\neach node uses mdist samples to compute a gradient, the master receives a gradient every\nmdist units of time, and hence mdistTdist = T . We summarize our assumptions by saying\nthat in T units of time, each algorithm performs the following number of iterations:\n\nn\n\nTcent = T,\n\nTcycle =\n\nT n\n\nmcycle\n\n,\n\nand\n\nTdist =\n\nT\n\nmdist\n\n.\n\n(7)\n\nCombining with the bound (4) and Corollaries 1 and 2, we get the results in Table 1.\nAsymptotically in the number of units of time T , both the cyclic and locally communicating\nstochastic optimization schemes have the same convergence rate. Comparing the lower\norder terms, since D \u2264 n for any network, the locally averaged algorithm always guarantees\nbetter performance than the cyclic algorithm. For speci\ufb01c graph topologies, however, we\ncan quantify the time improvements (assuming we are in the n2/3/T 2/3 regime):\n\n\u2022 n-node cycle or path: D = n so that both methods have the same convergence rate.\n\u2022 \u221an-by-\u221an grid: D = \u221an, so the distributed method has a factor of n2/3/n1/3 =\n\u2022 Balanced trees and expander graphs: D = O(log n), so the distributed method has\n\nn1/3 improvement over the cyclic architecture.\n\na factor\u2014ignoring logarithmic terms\u2014of n2/3 improvement over cyclic.\n\n5 Numerical Results\n\nThough this paper focuses mostly on the theoretical analysis of delayed stochastic methods,\nit is important to understand their practical aspects. To that end, we use the cyclic delayed\nmethod (6) to solve a somewhat large logistic regression problem:\n\nminimize\n\nx\n\nf (x) =\n\n1\nN\n\nNXi=1\n\nlog(1 + exp(\u2212bi hai, xi))\n\nsubject to kxk2 \u2264 R.\n\n(8)\n\nWe use the Reuters RCV1 dataset [11], which consists of N \u2248 800000 news articles, each la-\nbeled with a combination of the four labels economics, government, commerce, and medicine.\nIn the above example, the vectors ai \u2208 {0, 1}d, d \u2248 105, are feature vectors representing the\nwords in each article, and the labels bi are 1 if the article is about government, \u22121 otherwise.\n\nWe simulate the cyclic delayed optimization algorithm (5) for the problem (8) for several\nchoices of the number of workers n and the number of samples m computed at each worker.\nWe summarize the results in Figure 4. We \ufb01x an \u01eb (in this case, \u01eb = .05), then measure the\n\n7\n\n\f800\n\n600\n\n400\n\n200\n\ny\nc\na\nr\nu\nc\nc\na\n\n\u01eb\n\no\nt\n\ne\nm\nT\n\ni\n\ny\nc\na\nr\nu\nc\nc\na\n\n\u01eb\n\no\nt\n\ne\nm\nT\n\ni\n\n1000\n\n800\n\n600\n\n400\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n8\n\n10 12\n\n15 18\n\n22 26\n\n1\n\nNumber of workers\n\n(a)\n\n2\nNumber of workers\n\n6\n\n3\n\n4\n\n5\n\n8\n\n10\n\n12\n\n(b)\n\nFigure 4: Estimated time to compute \u01eb-accurate solution to the objective (8) as a function of\nthe number of workers n. See text for details. Plot (a): convergence time assuming the cost of\ncommunication to the master and gradient computation are same. Plot (b): convergence time\nassuming the cost of communication to the master is 16 times that of gradient computation.\n\ntime it takes the stochastic algorithm (5) to output anbx such that f (bx) \u2264 inf x\u2208X f (x) + \u01eb.\n\nWe perform each experiment ten times. The two plots di\ufb00er in the amount of time C\nrequired to communicate the parameters x between the master and the workers (relative to\nthe amount of time to compute the gradient on one sample in the objective (8)). For the\nleft plot in Fig. 4(a), we assume that C = 1, while in Fig. 4(b), we assume that C = 16.\nFor Fig. 4(a), each worker uses m = n samples to compute a stochastic gradient for the\nobjective (8). The plotted results show the delayed update (5) enjoys speedup (the ratio of\ntime to \u01eb-accuracy for an n-node system versus the centralized procedure) nearly linear in\nthe number n of worker machines until n \u2265 15 or so. Since we use the stepsize choice \u03b7(t) \u221d\nterm in the convergence rate presumably becomes non-negligible for larger n. This expands\non earlier experimental work with a similar method [10], which experimentally demonstrated\nlinear speedup for small values of n, but did not investigate larger network sizes.\nIn Fig. 4(b), we study the e\ufb00ects of more costly communication by assuming that com-\nmunication is C = 16 times more expensive than gradient computation. As argued in the\nlong version [2], we set the number of samples each worker computes to m = Cn = 16n\n\npt/n, which yields the predicted convergence rate given by Corollary 1, the n2m/T \u2248 n3/T\n\nand correspondingly reduce the damping stepsize \u03b7(t) \u221dpt/(Cn). In the regime of more\n\nexpensive communication\u2014as our theoretical results predict\u2014small numbers of workers still\nenjoy signi\ufb01cant speedups over a centralized method, but eventually the cost of communi-\ncation and delays mitigate some of the bene\ufb01ts of parallelization. The alternate choice of\nstepsize \u03b7(t) = n2/3T \u22121/3 gives qualitatively similar performance.\n\n6 Conclusion and Discussion\n\nIn this paper, we have studied delayed dual averaging algorithms for stochastic optimiza-\ntion, showing applications of our results to distributed optimization. We showed that for\nsmooth problems, we can preserve the performance bene\ufb01ts of parallelization over central-\nized stochastic optimization even when we relax synchronization requirements. Speci\ufb01cally,\nwe presented methods that take advantage of distributed computational resources and are\nrobust to node failures, communication latency, and node slowdowns. In addition, though\nwe omit these results for brevity, it is possible to extend all of our expected convergence\nresults to guarantees with high-probability.\n\nAcknowledgments\n\nAA was supported by a Microsoft Research Fellowship and NSF grant CCF-1115788, and\nJCD was supported by the NDSEG Program and Google. We are very grateful to Ofer\nDekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao for communicating of their proof\nof the bound (4). We would also like to thank Yoram Singer and Dimitri Bertsekas for\nreading a draft of this manuscript and giving useful feedback and references.\n\n8\n\n\fReferences\n\n[1] A. Agarwal, P. Bartlett, P. Ravikumar, and M. Wainwright.\n\nInformation-theoretic\nlower bounds on the oracle complexity of convex optimization. In Advances in Neural\nInformation Processing Systems 23, 2009.\n\n[2] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. URL\n\nhttp://arxiv.org/abs/1104.5525, 2011.\n\n[3] D. P. Bertsekas. Distributed asynchronous computation of \ufb01xed points. Mathematical\n\nProgramming, 27:107\u2013120, 1983.\n\n[4] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical\n\nMethods. Prentice-Hall, Inc., 1989.\n\n[5] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online\n\nprediction using mini-batches. URL http://arxiv.org/abs/1012.1367, 2010.\n\n[6] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Robust distributed online\n\nprediction. URL http://arxiv.org/abs/1012.1370, 2010.\n\n[7] J. Duchi, A. Agarwal, and M. Wainwright. Dual averaging for distributed optimization:\nconvergence analysis and network scaling. IEEE Transactions on Automatic Control,\nto appear, 2011.\n\n[8] A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with the\n\nstochastic mirror-prox algorithm. URL http://arxiv.org/abs/0809.0815, 2008.\n\n[9] G. Lan.\n\nematical Programming Series A,\nhttp://www.ise.u\ufb02.edu/glan/papers/OPT SA4.pdf.\n\n2010.\n\nAn optimal method for stochastic composite optimization. Math-\nappear. URL\n\nOnline \ufb01rst,\n\nto\n\n[10] J. Langford, A. Smola, and M. Zinkevich. Slow learners are fast. In Advances in Neural\n\nInformation Processing Systems 22, pages 2331\u20132339, 2009.\n\n[11] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text\n\ncategorization research. Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[12] A. Nedi\u00b4c, D.P. Bertsekas, and V.S. Borkar. Distributed asynchronous incremental\nsubgradient methods.\nIn D. Butnariu, Y. Censor, and S. Reich, editors, Inherently\nParallel Algorithms in Feasibility and Optimization and their Applications, volume 8 of\nStudies in Computational Mathematics, pages 381\u2013407. Elsevier, 2001.\n\n[13] A. Nedi\u00b4c and A. Ozdaglar. Distributed subgradient methods for multi-agent optimiza-\n\ntion. IEEE Transactions on Automatic Control, 54:48\u201361, 2009.\n\n[14] A. Nemirovski and D. Yudin. Problem Complexity and Method E\ufb03ciency in Optimiza-\n\ntion. Wiley, New York, 1983.\n\n[15] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical\n\nProgramming A, 120(1):261\u2013283, 2009.\n\n[16] B. T. Polyak. Introduction to optimization. Optimization Software, Inc., 1987.\n\n[17] S. S. Ram, A. Nedi\u00b4c, and V. V. Veeravalli. Distributed stochastic subgradient projection\nalgorithms for convex optimization. Journal of Optimization Theory and Applications,\n147(3):516\u2013545, 2010.\n\n[18] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical\n\nStatistics, 22:400\u2013407, 1951.\n\n[19] J. Tsitsiklis. Problems in decentralized decision making and computation. PhD thesis,\n\nMassachusetts Institute of Technology, 1984.\n\n[20] L. Xiao. Dual averaging methods for regularized stochastic learning and online opti-\n\nmization. Journal of Machine Learning Research, 11:2543\u20132596, 2010.\n\n9\n\n\f", "award": [], "sourceid": 574, "authors": [{"given_name": "Alekh", "family_name": "Agarwal", "institution": null}, {"given_name": "John", "family_name": "Duchi", "institution": null}]}