{"title": "Asynchronous stochastic convex optimization: the noise is in the noise and SGD don't care", "book": "Advances in Neural Information Processing Systems", "page_first": 1531, "page_last": 1539, "abstract": "We show that asymptotically, completely asynchronous stochastic gradient procedures achieve optimal (even to constant factors) convergence rates for the solution of convex optimization problems under nearly the same conditions required for asymptotic optimality of standard stochastic gradient procedures. Roughly, the noise inherent to the stochastic approximation scheme dominates any noise from asynchrony. We also give empirical evidence demonstrating the strong performance of asynchronous, parallel stochastic optimization schemes, demonstrating that the robustness inherent to stochastic approximation problems allows substantially faster parallel and asynchronous solution methods. In short, we show that for many stochastic approximation problems, as Freddie Mercury sings in Queen's \\emph{Bohemian Rhapsody}, ``Nothing really matters.''", "full_text": "Asynchronous stochastic convex optimization:\n\nthe noise is in the noise and SGD don\u2019t care\n\nSorathan Chaturapruek1\n\nJohn C. Duchi2\n\nChristopher R\u00b4e1\n\nDepartments of 1Computer Science, 2Electrical Engineering, and 2Statistics\n\nStanford University\nStanford, CA 94305\n\n{sorathan,jduchi,chrismre}@stanford.edu\n\nAbstract\n\nWe show that asymptotically, completely asynchronous stochastic gradient proce-\ndures achieve optimal (even to constant factors) convergence rates for the solution\nof convex optimization problems under nearly the same conditions required for\nasymptotic optimality of standard stochastic gradient procedures. Roughly, the\nnoise inherent to the stochastic approximation scheme dominates any noise from\nasynchrony. We also give empirical evidence demonstrating the strong perfor-\nmance of asynchronous, parallel stochastic optimization schemes, demonstrating\nthat the robustness inherent to stochastic approximation problems allows substan-\ntially faster parallel and asynchronous solution methods. In short, we show that\nfor many stochastic approximation problems, as Freddie Mercury sings in Queen\u2019s\nBohemian Rhapsody, \u201cNothing really matters.\u201d\n\n1\n\nIntroduction\n\nWe study a natural asynchronous stochastic gradient method for the solution of minimization prob-\nlems of the form\n\nminimize f (x) := EP [F (x; W )] =Z\u2126\n\nF (x; \u03c9)dP (\u03c9),\n\n(1)\n\nwhere x 7\u2192 F (x; \u03c9) is convex for each \u03c9 \u2208 \u2126, P is a probability distribution on \u2126, and the vector\nx \u2208 Rd. Stochastic gradient techniques for the solution of problem (1) have a long history in\noptimization, starting from the early work of Robbins and Monro [19] and continuing on through\nErmoliev [7], Polyak and Juditsky [16], and Nemirovski et al. [14]. The latter two show how certain\nlong stepsizes and averaging techniques yield more robust and asymptotically optimal optimization\nschemes, and we show how their results extend to practical parallel and asynchronous settings.\n\nWe consider an extension of previous stochastic gradient methods to a natural family of asyn-\nchronous gradient methods [3], where multiple processors can draw samples from the distribution\nP and asynchronously perform updates to a centralized (shared) decision vector x. Our iterative\nscheme is based on the HOGWILD! algorithm of Niu et al. [15], which is designed to asynchronously\nsolve certain stochastic optimization problems in multi-core environments, though our analysis and\niterations are different. In particular, we study the following procedure, where each processor runs\nasynchronously and independently of the others, though they maintain a shared integer iteration\ncounter k; each processor P asynchronously performs the following:\n\ncentralized counter k\n\n(i) Processor P reads current problem data x\n(ii) Processor P draws a random sample W \u223c P , computes g = \u2207F (x; W ), and increments the\n(iii) Processor P updates x \u2190 x \u2212 \u03b1kg sequentially for each coordinate j = 1, 2, . . . , d by incre-\nmenting [x]j \u2190 [x]j \u2212 \u03b1k[g]j , where the scalars \u03b1k are a non-increasing stepsize sequence.\n\n1\n\n\fWe assume that scalar addition is atomic: the addition of \u2212\u03b1k[g]j propagates eventually, but maybe\nout of order. Our main results show that because of the noise inherent to the sampling process for\nW , the errors introduced by asynchrony in iterations (i)\u2013(iii) are asymptotically negligible: they do\nnot matter. Even more, we can ef\ufb01ciently construct an x from the asynchronous process possessing\noptimal convergence rate and asymptotic variance. This has consequences for solving stochastic\noptimization problems on multi-core and multi-processor systems; we can leverage parallel com-\nputing without performing any synchronization, so that given a machine with m processors, we can\nread data and perform updates m times as quickly as with a single processor, and the error from\nreading stale information on x becomes asymptotically negligible. In Section 2, we state our main\nconvergence theorems about the asynchronous iteration (i)\u2013(iii) for solving problem (1). Our main\nresult, Theorem 1, gives explicit conditions under which our results hold, and we give applications\nto speci\ufb01c stochastic optimization problems as well as a general result for asynchronous solution of\noperator equations. Roughly, all we require for our (optimal) convergence results is that the Hessian\n\nof f be positive de\ufb01nite near x\u22c6 = argminx f (x) and that the gradients \u2207f (x) be smooth.\n\nSeveral researchers have provided and analyzed asynchronous algorithms for optimization. Bert-\nsekas and Tsitsiklis [3] provide a comprehensive study both of models of asynchronous computation\nand analyses of asynchronous numerical algorithms. More recent work has studied asynchronous\ngradient procedures, though it often imposes strong conditions on gradient sparsity, conditioning of\nthe Hessian of f , or allowable types of asynchrony; as we show, none are essential. Niu et al. [15]\npropose HOGWILD! and show that under sparsity and smoothness assumptions (essentially, that\nthe gradients \u2207F (x; W ) have a vanishing fraction of non-zero entries, that f is strongly convex,\nand \u2207F (x; \u03c9) is Lipschitz for all \u03c9), convergence guarantees similar to the synchronous case are\npossible; Agarwal and Duchi [1] showed under restrictive ordering assumptions that some delayed\ngradient calculations have negligible asymptotic effect; and Duchi et al. [4] extended Niu et al.\u2019s\nresults to a dual averaging algorithm that works for non-smooth, non strongly-convex problems,\nso long as certain gradient sparsity assumptions hold. Researchers have also investigated parallel\ncoordinate descent solvers; Richt\u00b4arik and Tak\u00b4a\u02c7c [18] and Liu et al. [13] show how certain \u201cnear-\nseparability\u201d properties of an objective function f govern convergence rate of parallel coordinate\ndescent methods, the latter focusing on asynchronous schemes. As we show, large-scale stochastic\noptimization renders many of these problem assumptions unnecessary.\n\nIn addition to theoretical results, in Section 3 we give empirical results on the power of parallelism\nand asynchrony in the implementation of stochastic approximation procedures. Our experiments\ndemonstrate two results: \ufb01rst, even in non-asymptotic \ufb01nite-sample settings, asynchrony introduces\nlittle degradation in solution quality, regardless of data sparsity (a common assumption in previous\nanalyses); that is, asynchronously-constructed estimates are statistically ef\ufb01cient. Second, we show\nthat there is some subtlety in implementation of these procedures in real hardware; while increases in\nparallelism lead to concomitant linear improvements in the speed with which we compute solutions\nto problem (1), in some cases we require strategies to reduce hardware resource competition between\nprocessors to achieve the full bene\ufb01ts of asynchrony.\n\nNotation A sequence of random variables or vectors Xn converges in distribution to Z, denoted\n\u2192 Z denote\nXn\nconvergence in probability, meaning that limn P(kXn \u2212 Zk > \u01eb) = 0 for any \u01eb > 0. The notation\nN(\u00b5, \u03a3) denotes the multivariate Gaussian with mean \u00b5 and covariance \u03a3.\n\nd\u2192 Z, if E[f (Xn)] \u2192 E[f (Z)] for all bounded continuous functions f . We let Xn\n\np\n\n2 Main results\n\nOur main results repose on a few standard assumptions often used for the analysis of stochastic op-\ntimization procedures, which we now detail, along with a few necessary de\ufb01nitions. We let k denote\nthe iteration counter used throughout the asynchronous gradient procedure. Given that we compute\ng = \u2207F (x; W ) with counter value k in the iterations (i)\u2013(iii), we let xk denote the (possibly in-\nconsistent) particular x used to compute g, and likewise say that g = gk, noting that the update to\nx is then performed using \u03b1k. In addition, throughout paper, we assume there is some \ufb01nite bound\nM < \u221e such that no processor reads information more than M steps out of date.\n2.1 Asynchronous convex optimization\n\nWe now present our main theoretical results for solving the stochastic convex problem (1), giving\nthe necessary assumptions on f and F (\u00b7; W ) for our results. Our \ufb01rst assumption roughly states that\nf has quadratic expansion near the (unique) optimal point x\u22c6 and is smooth.\n\n2\n\n\fAssumption A. The function f has unique minimizer x\u22c6 and is twice continuously differentiable in\nthe neighborhood of x\u22c6 with positive de\ufb01nite Hessian H = \u22072f (x\u22c6) \u227b 0 and there is a covariance\nmatrix \u03a3 \u227b 0 such that\nAdditionally, there exists a constant C < \u221e such that the gradients \u2207F (x; W ) satisfy\n\nE[\u2207F (x\u22c6; W )\u2207F (x\u22c6; W )\u22a4] = \u03a3.\nE[k\u2207F (x; W ) \u2212 \u2207F (x\u22c6; W )k2] \u2264 C kx \u2212 x\u22c6k2\n\nLastly, f has L-Lipschitz continuous gradient: k\u2207f (x) \u2212 \u2207f (y)k \u2264 Lkx \u2212 yk for all x, y \u2208 Rd.\nAssumption A guarantees the uniqueness of the vector x\u22c6 minimizing f (x) over Rd and ensures that\nf is well-behaved enough for our asynchronous iteration procedure to introduce negligible noise\nover a non-asynchronous procedure. In addition to Assumption A, we make one of two additional\nassumptions. In the \ufb01rst case, we assume that f is strongly convex:\nAssumption B. The function f is \u03bb-strongly convex over all of Rd for some \u03bb > 0, that is,\n\nfor all x \u2208 Rd.\n\n(2)\n\nf (y) \u2265 f (x) + h\u2207f (x), y \u2212 xi +\n\n\u03bb\n2 kx \u2212 yk2\n\nfor x, y \u2208 Rd.\n\n(3)\n\nOur alternate assumption is a Lipschitz assumption on f itself, made by virtue of a second moment\nbound on \u2207F (x; W ).\nAssumption B\u2019. There exists a constant G < \u221e such that for all x \u2208 Rd,\n\nE[k\u2207F (x; W )k2] \u2264 G2.\n\n(4)\n\nWith our assumptions in place, we state our main theorem.\nTheorem 1. Let the iterates xk be generated by the asynchronous process (i), (ii), (iii) with stepsize\nchoice \u03b1k = \u03b1k\u2212\u03b2, where \u03b2 \u2208 ( 1\n2 , 1) and \u03b1 > 0. Let Assumption A and either of Assumptions B\n\nor B\u2019 hold. Then\n\n1\n\u221an\n\nnXk=1\n(xk \u2212 x\u22c6) d\u2192 N(cid:0)0, H \u22121\u03a3H \u22121(cid:1) = N(cid:0)0, (\u22072f (x\u22c6))\u22121\u03a3(\u22072f (x\u22c6))\u22121(cid:1) .\n\nBefore moving to example applications of Theorem 1, we note that its convergence guarantee is\ngenerally unimprovable even by numerical constants. Indeed, for classical statistical problems, the\ncovariance H \u22121\u03a3H \u22121 is the inverse Fisher information, and by the Le Cam-H\u00b4ajek local minimax\ntheorems [9] and results on Bahadur ef\ufb01ciency [21, Chapter 8], this is the optimal covariance matrix,\nand the best possible rate is n\u2212 1\n2 . As for function values, using the delta method [e.g. 10, Theorem\n1.8.12], we can show the optimal convergence rate of 1/n on function values.\n\nCorollary 1. Let the conditions of Theorem 1 hold. Then n(cid:0)f(cid:0) 1\n2 tr(cid:2)H \u22121\u03a3(cid:3) \u00b7 \u03c72\nH = \u22072f (x\u22c6) and \u03a3 = E[\u2207F (x\u22c6; W )\u2207F (x\u22c6; W )\u22a4].\n\nd\u2192\n1 denotes a chi-squared random variable with 1 degree of freedom, and\n\nk=1 xk(cid:1) \u2212 f (x\u22c6)(cid:1)\n\nnPn\n\n1\n\n1, where \u03c72\n\n2.2 Examples\n\nWe now give two classical statistical optimization problems to illustrate Theorem 1. We verify that\nthe conditions of Assumptions A and B or B\u2019 are not overly restrictive.\n\nthe assumptions of Theorem 1 are certainly satis\ufb01ed. Standard modeling assumptions yield more\n\nLinear regression Standard linear regression problems satis\ufb01es the conditions of Assumption B.\nIn this case, the data \u03c9 = (a, b) \u2208 Rd \u00d7 R and the objective F (x; \u03c9) = 1\n2 (ha, xi \u2212 b)2. If we have\nmoment bounds E[kak4\n2] < \u221e, E[b2] < \u221e and H = E[aa\u22a4] \u227b 0, we have \u22072f (x\u22c6) = H, and\nconcrete guarantees. For example, if b = ha, x\u22c6i + \u03b5 where \u03b5 is independent mean-zero noise with\nE[\u03b52] = \u03c32, the minimizer of f (x) = E[F (x; W )] is x\u22c6, we have ha, x\u22c6i \u2212 b = \u2212\u03b5, and\nE[\u2207F (x\u22c6; W )\u2207F (x\u22c6; W )\u22a4] = E[(ha, x\u22c6i\u2212b)aa\u22a4(ha, x\u22c6i\u2212b)] = E[aa\u22a4\u03b52] = \u03c32E[aa\u22a4] = \u03c32H.\n\nIn particular, the asynchronous iterates satisfy\n\n1\n\u221an\n\nnXk=1\n(xk \u2212 x\u22c6) d\u2192 N(0, \u03c32H \u22121) = N(cid:0)0, \u03c32E[aa\u22a4]\u22121(cid:1) ,\n\nwhich is the (minimax optimal) asymptotic covariance of the ordinary least squares estimate of x\u22c6.\n\n3\n\n\fLogistic regression As long as the data has \ufb01nite second moment, logistic regression problems\nsatisfy all the conditions of Assumption B\u2019 in Theorem 1. We have \u03c9 = (a, b) \u2208 Rd \u00d7 {\u22121, 1} and\ninstantaneous objective F (x; \u03c9) = log(1 + exp(\u2212bha, xi)). For \ufb01xed \u03c9, this function is Lipschitz\ncontinuous and has gradient and Hessian\n\n\u2207F (x; \u03c9) = \u2212\n\n1\n\nba and \u22072F (x; \u03c9) =\nwhere \u2207F (x; \u03c9) is Lipschitz continuous as k\u22072F (x; (a, b))k \u2264 1\n2] < \u221e\nand E[\u22072F (x\u22c6; W )] \u227b 0 (i.e. E[aa\u22a4] is positive de\ufb01nite), Theorem 1 applies to logistic regression.\n\n(1 + ebha,xi)2 aa\u22a4,\n2. So long as E[kak2\n\n1 + exp(bha, xi)\n\n4 kak2\n\nebha,xi\n\n2.3 Extension to nonlinear problems\n\nWe prove Theorem 1 by way of a more general result on \ufb01nding the zeros of a residual operator\n\nR : Rd \u2192 Rd, where we only observe noisy views of R(x), and there is unique x\u22c6 such that\nR(x\u22c6) = 0. Such situations arise, for example, in the solution of stochastic monotone operator\nproblems (cf. Juditsky, Nemirovski, and Tauvel [8]). In this more general setting, we consider the\nfollowing asynchronous iterative process, which extends that for the convex case outlined previously.\nEach processor P performs the following asynchronously and independently:\n\n(i) Processor P reads current problem data x\n\n(ii) Processor P receives vector g = R(x) + \u03be, where \u03be is a random (conditionally) mean-zero\n\nnoise vector, and increments a centralized counter k\n\n(iii) Processor P updates x \u2190 x \u2212 \u03b1kg sequentially for each coordinate j = 1, 2, . . . , d by incre-\n\nmenting [x]j = [x]j \u2212 \u03b1k[g]j .\n\nAs in the convex case, we associate vectors xk and gk with the update performed using \u03b1k, and we\nlet \u03bek denote the noise vector used to construct gk. These iterates and assignment of indices imply\nthat xk has the form\n\nxk = \u2212\n\nk\u22121Xi=1\n\n\u03b1iEkigi,\n\n(5)\n\nwhere Eki \u2208 {0, 1}d\u00d7d is a diagonal matrix whose jth diagonal entry captures that coordinate j of\nthe ith gradient has been incorporated into iterate xk.\nWe de\ufb01ne the an increasing sequence of \u03c3-\ufb01elds Fk by\n\nFk = \u03c3(cid:0)\u03be1, . . . , \u03bek,(cid:8)Eij : i \u2264 k + 1, j \u2264 i(cid:9)(cid:1) ,\n\n(6)\n\nthat is, the noise variables \u03bek are adapted to the \ufb01ltration Fk, and these \u03c3-\ufb01elds are the smallest\ncontaining both the noise and all index updates that have occurred and that will occur to compute\nxk+1. Thus we have xk+1 \u2208 Fk, and our mean-zero assumption on the noise \u03be is\n\nE[\u03bek | Fk\u22121] = 0.\n\nWe base our analysis on Polyak and Juditsky\u2019s study [16] of stochastic approximation procedures,\nso we enumerate a few more requirements\u2014modeled on theirs\u2014for our results on convergence of\nthe asynchronous iterations for solving the nonlinear equality R(x\u22c6) = 0. We assume there is a\nLyapunov function V satisfying V (x) \u2265 \u03bbkxk2\nfor all x \u2208 Rd, k\u2207V (x) \u2212 \u2207V (y)k \u2264 Lkx \u2212 yk\nfor all x, y, that \u2207V (0) = 0, and V (0) = 0. This implies\n\n\u03bbkxk2 \u2264 V (x) \u2264 V (0) + h\u2207V (0), x \u2212 0i +\n\nL\n2 kxk2 =\n\nL\n2 kxk2\n\n(7)\n\nand k\u2207V (x)k2 \u2264 L2 kxk2 \u2264 (L2/\u03bb)V (x). We make the following assumptions on the residual R.\nAssumption C. There exists a matrix H \u2208 Rd\u00d7d with H \u227b 0, a parameter 0 < \u03b3 \u2264 1, constant\nC < \u221e, and \u01eb > 0 such that if x satis\ufb01es kx \u2212 x\u22c6k \u2264 \u01eb,\n\nkR(x) \u2212 H(x \u2212 x\u22c6)k \u2264 C kx \u2212 x\u22c6k1+\u03b3 .\n\n4\n\n\fAssumption C essentially requires that R is differentiable at x\u22c6 with derivative matrix H \u227b 0. We\nalso make a few assumptions on the noise process \u03be; speci\ufb01cally, we assume \u03be implicitly depends\non x \u2208 Rd (so that we may write \u03bek = \u03be(xk)), and that the following assumption holds.\nAssumption D. The noise vector \u03be(x) decomposes as \u03be(x) = \u03be(0) + \u03b6(x), where \u03be(0) is a process\nE[k\u03bek(0)k2 | Fk\u22121] < \u221e\nsatisfying E[\u03bek(0)\u03bek(0)\u22a4 | Fk\u22121]\nwith probability 1, and E[k\u03b6k(x)k2 | Fk\u22121] \u2264 C kx \u2212 x\u22c6k2\nfor a constant C < \u221e and all x \u2208 Rd.\n\n\u2192 \u03a3 \u227b 0 for a matrix \u03a3 \u2208 Rd\u00d7d, supk\n\np\n\nAs in the convex case, we make one of two additional assumptions, which should be compared with\nAssumptions B and B\u2019. The \ufb01rst is that R gives globally strong information about x\u22c6.\nAssumption E (Strongly convex residuals). There exists a constant \u03bb0 > 0 such that for all x \u2208 Rd,\nh\u2207V (x \u2212 x\u22c6), R(x)i \u2265 \u03bb0V (x \u2212 x\u22c6).\nAlternatively, we may make an assumption on the boundedness of R, which we shall see suf\ufb01ces\nfor proving our main results.\nAssumption E\u2019 (Bounded residuals). There exist \u03bb0 > 0 and \u01eb > 0 such that\n\ninf\n\n0 0.\n\nIn addition there exists C < \u221e such that, kR(x)k \u2264 C and E[k\u03bekk2 | Fk\u22121] \u2264 C 2 for all k and x.\n\nWith these assumptions in place, we obtain the following more general version of Theorem 1; in-\ndeed, we show that Theorem 1 is a consequence of this result.\nTheorem 2. Let V be a function satisfying inequality (7), and let Assumptions C and D hold. Let\nthe stepsizes \u03b1k = \u03b1k\u2212\u03b2, where\n\n1+\u03b3 < \u03b2 < 1. Let one of Assumptions E or E\u2019 hold. Then\n\n1\n\n1\n\u221an\n\nnXk=1\n\n(xk \u2212 x\u22c6) d\u2192 N(cid:0)0, H \u22121\u03a3H \u22121(cid:1) .\n\nWe may compare this result to Polyak and Juditsky\u2019s Theorem 2 [16], which gives identical asymp-\ntotic convergence guarantees but with somewhat weaker conditions on the function V and stepsize\nsequence \u03b1k. Our stronger assumptions, however, allow our result to apply even in fully asyn-\nchronous settings.\n\n2.4 Proof sketch\n\n1\n\nWe provide rigorous proofs in the long version of this paper [5], providing an amputated sketch\nhere. First, to show that Theorem 1 follows from Theorem 2, we set R(x) = \u2207f (x) and V (x) =\n2 kxk2\n. We can then show that Assumption A, which guarantees a second-order Taylor expansion,\nimplies Assumption C with \u03b3 = 1 and H = \u22072f (x\u22c6). Moreover, Assumption B (or B\u2019) implies\nAssumption E (respectively, E\u2019), while to see that Assumption D holds, we set \u03be(0) = \u2207F (x\u22c6; W ),\ntaking \u03a3 = E[\u2207F (x\u22c6; W )\u2207F (x\u22c6; W )\u22a4] and \u03b6(x) = \u2207F (x; W ) \u2212 \u2207F (x\u22c6; W ), and applying\ninequality (2) of Assumption A to satisfy Assumption D with the vector \u03b6.\nof the sequence xk from expression (5) to the easier to analyze sequence exk = \u2212Pk\u22121\nAsymptotically, we obtain E[kxk \u2212exkk2] = O(\u03b12\n\nThe proof of Theorem 2 is somewhat more involved. Roughly, we show the asymptotic equivalence\ni=1 \u03b1igi.\n\ngradient calculations\u2014are close enough to a correct stochastic gradient iterate that they possess\noptimal asymptotic normality properties. This \u201cclose enough\u201d follows by virtue of the squared error\nbounds for \u03b6 in Assumption D, which guarantee that \u03bek essentially behaves like an i.i.d. sequence\nasymptotically (after application of the Robbins-Siegmund martingale convergence theorem [20]),\nwhich we then average to obtain a central-limit-theorem.\n\nk), while the iteratesexk\u2014in spite of their incorrect\n\n3 Experimental results\n\nWe provide empirical results studying the performance of asynchronous stochastic approximation\nschemes on several simulated and real-world datasets. Our theoretical results suggest that asyn-\nchrony should introduce little degradation in solution quality, which we would like to verify; we\n\n5\n\n\falso investigate the engineering techniques necessary to truly leverage the power of asynchronous\nstochastic procedures.\nIn our experiments, we focus on linear and logistic regression, the ex-\n\namples given in Section 2.2; that is, we have data (ai, bi) \u2208 Rd \u00d7 R (for linear regression) or\n(ai, bi) \u2208 Rd \u00d7 {\u22121, 1} (for logistic regression), for i = 1, . . . , N , and objectives\n\nf (x) =\n\n1\n2N\n\nNXi=1\n\n(hai, xi \u2212 bi)2 and f (x) =\n\n1\nN\n\nNXi=1\n\nlog(cid:0)1 + exp(\u2212bi hai, xi)(cid:1).\n\n(8)\n\nWe perform each of our experiments using a 48-core Intel Xeon machine with 1 terabyte of RAM,\nand have put code and binaries to replicate our experiments on CodaLab [6]. The Xeon architecture\nputs each core onto one of four sockets, where each socket has its own memory. To limit the impact\nof communication overhead in our experiments, we limit all experiments to at most 12 cores, all\non the same socket. Within an experiment\u2014based on the empirical expectations (8)\u2014we iterate in\nepochs, meaning that our stochastic gradient procedure repeatedly loops through all examples, each\nexactly once.1 Within an epoch, we use a \ufb01xed stepsize \u03b1, decreasing the stepsize by a factor of .9\nbetween each epoch (this matches the experimental protocol of Niu et al. [15]). Within each epoch,\nwe choose examples in a randomly permuted order, where the order changes from epoch to epoch\n(cf. [17]). To address issues of hardware resource contention (see Section 3.2 for more on this), in\nsome cases we use a mini-batching strategy. Abstractly, in the formulation of the basic problem (1),\nthis means that in each calculation of a stochastic gradient g we draw B \u2265 1 samples W1, . . . , WB\ni.i.d. according to P , then set\n\ng(x) =\n\n1\nB\n\nBXb=1\n\n\u2207F (x; Wb).\n\n(9)\n\nThe mini-batching strategy (9) does not change the (asymptotic) convergence guarantees of asyn-\nchronous stochastic gradient descent, as the covariance matrix \u03a3 = E[g(x\u22c6)g(x\u22c6)\u22a4] satis\ufb01es\nE[\u2207F (x\u22c6; W )\u2207F (x\u22c6; W )\u22a4], while the total iteration count is reduced by the a factor B.\n\u03a3 = 1\nLastly, we measure the performance of optimization schemes via speedup, de\ufb01ned as\n\nB\n\nspeedup =\n\naverage epoch runtime on a single core using HOGWILD!\n.\n\naverage epoch runtime on m cores\n\n(10)\n\nIn our experiments, as increasing the number m of cores does not change the gap in optimality\nf (xk) \u2212 f (x\u22c6) after each epoch, speedup is equivalent to the ratio of the time required to obtain an\n\u01eb-accurate solution using a single processor/core to that required to obtain \u01eb-accurate solution using\nm processors/cores.\n\n3.1 Ef\ufb01ciency and sparsity\n\nFor our \ufb01rst set of experiments, we study the effect that data sparsity has on the convergence behavior\nof asynchronous methods\u2014sparsity has been an essential part of the analysis of many asynchronous\nand parallel optimization schemes [15, 4, 18], while our theoretical results suggest it should be\nunimportant\u2014using the linear regression objective (8). We generate synthetic linear regression prob-\n\nlems with N = 106 examples in d = 103 dimensions via the following procedure. Let \u03c1 \u2208 (0, 1]\nbe the desired fraction of non-zero gradient entries, and let \u03a0\u03c1 be a random projection operator\nthat zeros out all but a fraction \u03c1 of the elements of its argument, meaning that for a \u2208 Rd, \u03a0\u03c1(a)\nuniformly at random chooses \u03c1d elements of a, leaves them identical, and zeroes the remaining\nelements. We generate data for our linear regression drawing a random vector u\u22c6 \u223c N(0, I), then\ni.i.d.\u223c N(0, I),\nconstructing bi = hai, u\u22c6i + \u03b5i, i = 1, . . . , N , where \u03b5i\nand \u03a0\u03c1(eai) denotes an independent random sparse projection ofeai. To measure optimality gap, we\ndirectly compute x\u22c6 = (AT A)\u22121AT b, where A = [a1 a2 \u00b7\u00b7\u00b7 aN ]\u22a4 \u2208 RN \u00d7d.\nIn Figure 1, we plot the results of simulations using densities \u03c1 \u2208 {.005, .01, .2, 1} and mini-batch\nsize B = 10, showing the gap f (xk) \u2212 f (x\u22c6) as a function of the number of epochs for each of\nthe given sparsity levels. We give results using 1, 2, 4, and 10 processor cores (increasing degrees\nof asynchrony), and from the plots, we see that regardless of the number of cores, the convergence\n\ni.i.d.\u223c N(0, 1), ai = \u03a0\u03c1(eai),eai\n\n1Strictly speaking, this violates the stochastic gradient assumption, but it allows direct comparison with the\n\noriginal HOGWILD! code and implementation [15].\n\n6\n\n\fbehavior is nearly identical, with very minor degradations in performance for the sparsest data. (We\n\nplot the gaps f (xk) \u2212 f (x\u22c6) on a logarithmic axis.) Moreover, as the data becomes denser, the\n\nmore asynchronous methods\u2014larger number of cores\u2014achieve performance essentially identical\nto the fully synchronous method in terms of convergence versus number of epochs. In Figure 2,\nwe plot the speedup achieved using different numbers of cores. We also include speedup achieved\nusing multiple cores with explicit synchronization (locking) of the updates, meaning that instead\nof allowing asynchronous updates, each of the cores globally locks the decision vector when it\nreads, unlocks and performs mini-batched gradient computations, and locks the vector again when\nit updates the vector. We can see that the performance curve is much worse than than the without-\nlocking performance curve across all densities. That the locking strategy also gains some speedup\nwhen the density is higher is likely due to longer computation of the gradients. However, the locking-\nstrategy performance is still not competitive with that of the without-locking strategy.\n\n100\n\n10-1\n\n)\n\u22c6\nx\n(\nf\n\u2212\n\n10-2\n\n)\nk\nx\n(\nf\n\n10-3\n\n10-4\n\n10 cores\n8 cores\n4 cores\n1 core\n\n100\n\n10-2\n\n10-4\n\n10 cores\n8 cores\n4 cores\n1 core\n\n101\n\n100\n\n10-1\n\n10-2\n\n10 cores\n8 cores\n4 cores\n1 core\n\n102\n\n100\n\n10-2\n\n10-4\n\n10 cores\n8 cores\n4 cores\n1 core\n\n0\n\n5\n\n10\n\nEpochs\n\n15\n\n20\n\n0\n\n5\n\n10\n\nEpochs\n\n15\n\n20\n\n0\n\n5\n\n10\n\nEpochs\n\n15\n\n20\n\n0\n\n5\n\n10\n\nEpochs\n\n15\n\n20\n\n(a) \u03c1 = .005\n\n(b) \u03c1 = .01\n\n(c) \u03c1 = .2\n\n(d) \u03c1 = 1\n\nFigure 1. (Exponential backoff stepsizes) Optimality gaps for synthetic linear regression experiments\nshowing effects of data sparsity and asynchrony on f (xk)\u2212f (x\u22c6). A fraction \u03c1 of each vector ai \u2208 Rd\nis non-zero.\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\nlinear speedup\nwithout locking\nwith locking\n\n2\n\n4\n\n6\nCores\n\n8\n\n10\n\n(a) \u03c1 = .005\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\nlinear speedup\nwithout locking\nwith locking\n\n2\n\n4\n\n6\nCores\n\n8\n\n10\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\nlinear speedup\nwithout locking\nwith locking\n\n2\n\n4\n\n6\nCores\n\n8\n\n10\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\nlinear speedup\nwithout locking\nwith locking\n\n2\n\n4\n\n6\nCores\n\n8\n\n10\n\n(b) \u03c1 = .01\n\n(c) \u03c1 = .2\n\n(d) \u03c1 = 1\n\nFigure 2. (Exponential backoff stepsizes) Speedups for synthetic linear regression experiments show-\ning effects of data sparsity on speedup (10). A fraction \u03c1 of each vector ai \u2208 Rd is non-zero.\n\n3.2 Hardware issues and cache locality\n\nWe detail a small set of experiments investigating hardware issues that arise even in implementation\nof asynchronous gradient methods. The Intel x86 architecture (as with essentially every processor\narchitecture) organizes memory in a hierarchy, going from L1 to L3 (level 1 to level 3) caches of\nincreasing sizes. An important aspect of the speed of different optimization schemes is the relative\nfraction of memory hits, meaning accesses to memory that is cached locally (in order of decreasing\nspeed, L1, L2, or L3 cache). In Table 1, we show the proportion of cache misses at each level of the\nmemory hierarchy for our synthetic regression experiment with fully dense data (\u03c1 = 1) over the\nexecution of 20 epochs, averaged over 10 different experiments. We compare memory contention\nwhen the batch size B used to compute the local asynchronous gradients (9) is 1 and 10. We see\nthat the proportion of misses for the fastest two levels\u20141 and 2\u2014of the cache for B = 1 increase\nsigni\ufb01cantly with the number of cores, while increasing the batch size to B = 10 substantially\nmitigates cache incoherency. In particular, we maintain (near) linear increases in iteration speed\n\nwith little degradation in solution quality (the gap f (bx) \u2212 f (x\u22c6) output by each of the procedures\n\nwith and without batching is identical to within 10\u22123; cf. Figure 1(d)).\n\n7\n\n\fNo batching (B = 1)\n\nBatch size B = 10\n\nNumber of cores\nfraction of L1 misses\nfraction of L2 misses\nfraction of L3 misses\nepoch average time (s)\nspeedup\nNumber of cores\nfraction of L1 misses\nfraction of L2 misses\nfraction of L3 misses\nepoch average time (s)\nspeedup\n\n1\n\n0.0009\n0.5638\n0.6152\n4.2101\n\n1.00\n\n1\n\n0.0012\n0.5420\n0.5677\n4.4286\n\n1.00\n\n4\n\n0.0017\n0.6594\n0.4528\n1.6577\n\n2.54\n\n4\n\n0.0011\n0.5467\n0.5895\n1.1868\n\n3.73\n\n8\n\n0.0025\n0.7551\n0.3068\n1.4052\n\n3.00\n\n8\n\n0.0011\n0.5537\n0.5714\n0.6971\n\n6.35\n\n10\n\n0.0026\n0.7762\n0.2841\n1.3183\n\n3.19\n10\n\n0.0011\n0.5621\n0.5578\n0.6220\n\n7.12\n\nTable 1. Memory traf\ufb01c for batched updates (9) versus non-batched updates (B = 1) for a dense\nlinear regression problem in d = 103 dimensions with a sample of size N = 106. Cache misses are\nsubstantially higher with B = 1.\n\n3.3 Real datasets\n\nWe perform experiments using three different real-world datasets: the Reuters RCV1 corpus [11],\nthe Higgs detection dataset [2], and the Forest Cover dataset [12]. Each represents a binary classi\ufb01-\ncation problem which we formulate using logistic regression. We brie\ufb02y detail statistics for each:\n\nclassify each document as being about corporate industrial topics (CCAT) or not.\n\n(1) Reuters RCV1 dataset consists of N \u2248 7.81 \u00b7 105 data vectors (documents) ai \u2208 {0, 1}d with\nd \u2248 5 \u00b7 104 dimensions; each vector has sparsity approximately \u03c1 = 3 \u00b7 10\u22123. Our task is to\n(2) The Higgs detection dataset consists of N = 106 data vectors eai \u2208 Rd0 , with d0 = 28. We\nencode each vectoreai as a vector ai \u2208 {0, 1}5d0 whose non-zero entries correspond to quantiles\n(3) The Forest Cover dataset consists of N \u2248 5.7 \u00b7 105 data vectors ai \u2208 {\u22121, 1}d with d = 54,\n\nquantize each coordinate into 5 bins containing equal fraction of the coordinate values and\n\ninto which coordinates fall. The task is to detect (simulated) emissions from a linear accelerator.\n\nand the task is to predict forest growth types.\n\n10-1\n\n)\n\u22c6\nx\n(\nf\n\u2212\n\n10-2\n\n)\nk\nx\n(\nf\n\n10-3\n\n10 cores\n8 cores\n4 cores\n1 core\n\n10-1\n\n10-2\n\n10-3\n\n10 cores\n8 cores\n4 cores\n1 core\n\n0\n\n5\n\n10\n\nEpochs\n\n15\n\n20\n\n0\n\n5\n\n10\n\nEpochs\n\n15\n\n20\n\n10-1\n\n10-2\n\n10-3\n\n0\n\n10 cores\n8 cores\n4 cores\n1 core\n\nFigure 3.\n(Exponential\nbackoff stepsizes) Op-\ntimality gaps f (xk) \u2212\nf (x\u22c6) on the (a) RCV1,\n(b) Higgs, and (c) Forest\nCover datasets.\n\n5\n\n10\n\nEpochs\n\n15\n\n20\n\n(a) RCV1 (\u03c1 = .003)\n\n(b) Higgs (\u03c1 = 1)\n\n(c) Forest (\u03c1 = 1)\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\nlinear speedup\nwithout locking\n\n2\n\n4\n\n6\nCores\n\n8\n\n10\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\nlinear speedup\nwithout locking\n\n2\n\n4\n\n6\nCores\n\n8\n\n10\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\nlinear speedup\nwithout locking\n\nFigure 4.\n(Exponen-\nstepsizes)\ntial backoff\nregression\nLogistic\nshowing\nexperiments\nspeedup (10) on the\n(a) RCV1,\n(b) Higgs,\nand (c) Forest Cover\ndatasets.\n\n2\n\n4\n\n6\nCores\n\n8\n\n10\n\n(a) RCV1 (\u03c1 = .003)\n\n(b) Higgs (\u03c1 = 1)\n\n(c) Forest (\u03c1 = 1)\n\nIn Figure 3, we plot the gap f (xk) \u2212 f (x\u22c6) as a function of epochs, giving standard error intervals\n\nover 10 runs for each experiment. There is essentially no degradation in objective value for the\ndifferent numbers of processors, and in Figure 4, we plot speedup achieved using 1, 4, 8, and 10\ncores with batch sizes B = 10. Asynchronous gradient methods achieve speedup of between 6\u00d7\nand 8\u00d7 on each of the datasets using 10 cores.\n\n8\n\n\fReferences\n\n[1] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization.\n\nIn Advances in\n\nNeural Information Processing Systems 24, 2011.\n\n[2] P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics\n\nwith deep learning. Nature Communications, 5, July 2014.\n\n[3] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Meth-\n\nods. Prentice-Hall, Inc., 1989.\n\n[4] J. C. Duchi, M. I. Jordan, and H. B. McMahan. Estimation, optimization, and parallelism when\n\ndata is sparse. In Advances in Neural Information Processing Systems 26, 2013.\n\n[5] J. C. Duchi, S. Chaturapruek, and C. R\u00b4e. Asynchronous stochastic convex optimization.\n\narXiv:1508.00882 [math.OC], 2015.\n\n[6] J. C. Duchi, S. Chaturapruek, and C. R\u00b4e. Asynchronous stochastic convex optimization, 2015.\nURL https://www.codalab.org/worksheets/. Code for reproducing experiments.\n[7] Y. M. Ermoliev. On the stochastic quasi-gradient method and stochastic quasi-Feyer sequences.\n\nKibernetika, 2:72\u201383, 1969.\n\n[8] A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with the stochastic\n\nmirror-prox algorithm. Stochastic Systems, 1(1):17\u201358, 2011.\n\n[9] L. Le Cam and G. L. Yang. Asymptotics in Statistics: Some Basic Concepts. Springer, 2000.\n[10] E. L. Lehmann and G. Casella. Theory of Point Estimation, Second Edition. Springer, 1998.\n[11] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text catego-\n\nrization research. Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[12] M.\n\nLichman.\n\nUCI machine\n\nlearning\n\nrepository,\n\n2013.\n\nURL\n\nhttp://archive.ics.uci.edu/ml.\n\n[13] J. Liu, S. J. Wright, C. R\u00b4e, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic\ncoordinate descent algorithm. In Proceedings of the 31st International Conference on Machine\nLearning, 2014.\n\n[14] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[15] F. Niu, B. Recht, C. Re, and S. Wright. Hogwild: a lock-free approach to parallelizing stochas-\n\ntic gradient descent. In Advances in Neural Information Processing Systems 24, 2011.\n\n[16] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM\n\nJournal on Control and Optimization, 30(4):838\u2013855, 1992.\n\n[17] B. Recht and C. R\u00b4e. Beneath the valley of the noncommutative arithmetic-geometric mean\ninequality: conjectures, case-studies, and consequences. In Proceedings of the Twenty Fifth\nAnnual Conference on Computational Learning Theory, 2012.\n\n[18] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c.\n\nParallel coordinate descent methods\n\nfor big data\nURL\n\n2015.\n\noptimization.\nhttp://link.springer.com/article/10.1007/s10107-015-0901-6.\n\nMathematical Programming,\n\npage Online \ufb01rst,\n\n[19] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical\n\nStatistics, 22:400\u2013407, 1951.\n\n[20] H. Robbins and D. Siegmund. A convergence theorem for non-negative almost supermartin-\ngales and some applications. In Optimizing Methods in Statistics, pages 233\u2013257. Academic\nPress, New York, 1971.\n\n[21] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic\n\nMathematics. Cambridge University Press, 1998. ISBN 0-521-49603-9.\n\n9\n\n\f", "award": [], "sourceid": 946, "authors": [{"given_name": "Sorathan", "family_name": "Chaturapruek", "institution": "Stanford University"}, {"given_name": "John", "family_name": "Duchi", "institution": "Stanford"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": null}]}