{"title": "Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 693, "page_last": 701, "abstract": "Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented *without any locking*. We present an update scheme called Hogwild which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then Hogwild achieves a nearly optimal rate of convergence. We demonstrate experimentally that Hogwild outperforms alternative schemes that use locking by an order of magnitude.", "full_text": "HOGWILD!: A Lock-Free Approach to Parallelizing\n\nStochastic Gradient Descent\n\nFeng Niu\n\nleonn@cs.wisc.edu\n\nBenjamin Recht\n\nbrecht@cs.wisc.edu\n\nChristopher R\u00b4e\n\nchrisre@cs.wisc.edu\n\nswright@cs.wisc.edu\n\nComputer Sciences Department\nUniversity of Wisconsin-Madison\n\nStephen J. Wright\n\nMadison, WI 53706\n\nAbstract\n\nStochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-\nof-the-art performance on a variety of machine learning tasks. Several researchers\nhave recently proposed schemes to parallelize SGD, but all require performance-\ndestroying memory locking and synchronization. This work aims to show using\nnovel theoretical analysis, algorithms, and implementation that SGD can be im-\nplemented without any locking. We present an update scheme called HOGWILD!\nwhich allows processors access to shared memory with the possibility of overwrit-\ning each other\u2019s work. We show that when the associated optimization problem\nis sparse, meaning most gradient updates only modify small parts of the deci-\nsion variable, then HOGWILD! achieves a nearly optimal rate of convergence. We\ndemonstrate experimentally that HOGWILD! outperforms alternative schemes that\nuse locking by an order of magnitude.\n\n1\n\nIntroduction\n\nWith its small memory footprint, robustness against noise, and rapid learning rates, Stochastic Gra-\ndient Descent (SGD) has proved to be well suited to data-intensive machine learning tasks [3,5,24].\nHowever, SGD\u2019s scalability is limited by its inherently sequential nature; it is dif\ufb01cult to paral-\nlelize. Nevertheless, the recent emergence of inexpensive multicore processors and mammoth,\nweb-scale data sets has motivated researchers to develop several clever parallelization schemes for\nSGD [4, 10, 12, 16, 27]. As many large data sets are currently pre-processed in a MapReduce-like\nparallel-processing framework, much of the recent work on parallel SGD has focused naturally on\nMapReduce implementations. MapReduce is a powerful tool developed at Google for extracting\ninformation from huge logs (e.g., \u201c\ufb01nd all the urls from a 100TB of Web data\u201d) that was designed\nto ensure fault tolerance and to simplify the maintenance and programming of large clusters of ma-\nchines [9]. But MapReduce is not ideally suited for online, numerically intensive data analysis.\nIterative computation is dif\ufb01cult to express in MapReduce, and the overhead to ensure fault toler-\nance can result in dismal throughput. Indeed, even Google researchers themselves suggest that other\nsystems, for example Dremel, are more appropriate than MapReduce for data analysis tasks [20].\nFor some data sets, the sheer size of the data dictates that one use a cluster of machines. However,\nthere are a host of problems in which, after appropriate preprocessing, the data necessary for statisti-\ncal analysis may consist of a few terabytes or less. For such problems, one can use a single inexpen-\nsive work station as opposed to a hundred thousand dollar cluster. Multicore systems have signi\ufb01cant\nperformance advantages, including (1) low latency and high throughput shared main memory (a pro-\ncessor in such a system can write and read the shared physical memory at over 12GB/s with latency\nin the tens of nanoseconds); and (2) high bandwidth off multiple disks (a thousand-dollar RAID\n\n1\n\n\fcan pump data into main memory at over 1GB/s). In contrast, a typical MapReduce setup will read\nincoming data at rates less than tens of MB/s due to frequent checkpointing for fault tolerance. The\nhigh rates achievable by multicore systems move the bottlenecks in parallel computation to syn-\nchronization (or locking) amongst the processors [2,13]. Thus, to enable scalable data analysis on a\nmulticore machine, any performant solution must minimize the overhead of locking.\nIn this work, we propose a simple strategy for eliminating the overhead associated with locking: run\nSGD in parallel without locks, a strategy that we call HOGWILD!. In HOGWILD!, processors are\nallowed equal access to shared memory and are able to update individual components of memory\nat will. Such a lock-free scheme might appear doomed to fail as processors could overwrite each\nother\u2019s progress. However, when the data access is sparse, meaning that individual SGD steps only\nmodify a small part of the decision variable, we show that memory overwrites are rare and that\nthey introduce barely any error into the computation when they do occur. We demonstrate both\ntheoretically and experimentally a near linear speedup with the number of processors on commonly\noccurring sparse learning problems.\nIn Section 2, we formalize a notion of sparsity that is suf\ufb01cient to guarantee such a speedup and\nprovide canonical examples of sparse machine learning problems in classi\ufb01cation, collaborative\n\ufb01ltering, and graph cuts. Our notion of sparsity allows us to provide theoretical guarantees of linear\nspeedups in Section 4. As a by-product of our analysis, we also derive rates of convergence for\nalgorithms with constant stepsizes. We demonstrate that robust 1/k convergence rates are possible\nwith constant stepsize schemes that implement an exponential back-off in the constant over time.\nThis result is interesting in of itself and shows that one need not settle for 1/pk rates to ensure\nrobustness in SGD algorithms.\nIn practice, we \ufb01nd that computational performance of a lock-free procedure exceeds even our theo-\nretical guarantees. We experimentally compare lock-free SGD to several recently proposed methods.\nWe show that all methods that propose memory locking are signi\ufb01cantly slower than their respective\nlock-free counterparts on a variety of machine learning applications.\n\n2 Sparse Separable Cost Functions\nOur goal throughout is to minimize a function f : X \u2713 Rn ! R of the form\n\nf(x) =Xe2E\n\nfe(xe) .\n\nHere e denotes a small subset of {1, . . . , n} and xe denotes the values of the vector x on the coordi-\nnates indexed by e. The key observation that underlies our lock-free approach is that the natural cost\nfunctions associated with many machine learning problems of interest are sparse in the sense that\n|E| and n are both very large but each individual fe acts only on a very small number of components\nof x. That is, each subvector xe contains just a few components of x.\nThe cost function (1) induces a hypergraph G = (V, E) whose nodes are the individual components\nof x. Each subvector xe induces an edge in the graph e 2 E consisting of some subset of nodes. A\nfew examples illustrate this concept.\nSparse SVM. Suppose our goal is to \ufb01t a support vector machine to some data pairs E =\n{(z1, y1), . . . , (z|E|, y|E|)} where z 2 Rn and y is a label for each (z, y) 2 E.\n\nmax(1 y\u21b5xT z\u21b5, 0) + kxk2\n2 ,\n\n(2)\n\nminimizexX\u21b52E\n\nand we know a priori that the examples z\u21b5 are very sparse (see for example [14]). To write this cost\nfunction in the form of (1), let e\u21b5 denote the components which are non-zero in z\u21b5 and let du denote\nthe number of training examples which are non-zero in component u (u = 1, 2, . . . , n). Then we\ncan rewrite (2) as\n\nminimizexX\u21b52E max(1 y\u21b5xT z\u21b5, 0) + Xu2e\u21b5\n\nx2\nu\n\ndu! .\n\nEach term in the sum (3) depends only on the components of x indexed by the set e\u21b5.\n\n(1)\n\n(3)\n\n2\n\n\fMatrix Completion. In the matrix completion problem, we are provided entries of a low-rank,\nnr \u21e5 nc matrix Z from the index set E. Such problems arise in collaborative \ufb01ltering, Euclidean\ndistance estimation, and clustering [8,17,23]. Our goal is to reconstruct Z from this sparse sampling\nof data. A popular heuristic recovers the estimate of Z as a product LR\u21e4 of factors obtained from\nthe following minimization:\n\nminimize(L,R) X(u,v)2E\n\n(LuR\u21e4v Zuv)2 + \u00b5\n\n2 kLk2\n\nF + \u00b5\n\n2 kRk2\nF ,\n\n(4)\n\nwhere L is nr \u21e5 r, R is nc \u21e5 r and Lu (resp. Rv) denotes the uth (resp. vth) row of L (resp. R)\n[17, 23, 25]. To put this problem in sparse form, i.e., as (1), we write (4) as\n\nF + \u00b5\n\n2|Eu|kLuk2\n\nFo\n2|Ev|kRvk2\n\nminimize(L,R) X(u,v)2En(LuR\u21e4v Zuv)2 + \u00b5\nwhere Eu = {v : (u, v) 2 E} and Ev = {u : (u, v) 2 E}.\nGraph Cuts. Problems involving minimum cuts in graphs frequently arise in machine learning\n(see [6] for a comprehensive survey). In such problems, we are given a sparse, nonnegative matrix W\nwhich indexes similarity between entities. Our goal is to \ufb01nd a partition of the index set {1, . . . , n}\nthat best conforms to this similarity matrix. Here the graph structure is explicitly determined by\nthe similarity matrix W ; arcs correspond to nonzero entries in W . We want to match each string\nto some list of D entities. Each node is associated with a vector xi in the D-dimensional simplex\nv=1 \u21e3v = 1}. Here, two-way cuts use D = 2, but multiway-cuts\nwith tens of thousands of classes also arise in entity resolution problems [18]. For example, we may\nhave a list of n strings, and Wuv might index the similarity of each string. Several authors (e.g., [7])\npropose to minimize the cost function\n\nSD = {\u21e3 2 RD : \u21e3v 0 PD\n\nminimizex X(u,v)2E\n\nwuvkxu xvk1\n\nsubject to xv 2 SD for v = 1, . . . , n .\n\n(5)\n\nIn all three of the preceding examples, the number of components involved in a particular term fe\nis a small fraction of the total number of entries. We formalize this notion by de\ufb01ning the following\nstatistics of the hypergraph G:\n\nmax1\uf8ffv\uf8ffn |{e 2 E : v 2 e}|\n\n, \u21e2 :=\n\n|E|\n\nmaxe2E |{\u02c6e 2 E : \u02c6e \\ e 6= ;}|\n\n. (6)\n\n|E|\n\n\u2326 := max\n\ne2E |e|, :=\n\nThe quantity \u2326 simply quanti\ufb01es the size of the hyper edges. \u21e2 determines the maximum fraction of\nedges that intersect any given edge. determines the maximum fraction of edges that intersect any\nvariable. \u21e2 is a measure of the sparsity of the hypergraph, while measures the node-regularity.\nFor our examples, we can make the following observations about \u21e2 and .\n\n1. Sparse SVM. is simply the maximum frequency that any feature appears in an example,\nIf some features are very common\n\nwhile \u21e2 measures how clustered the hypergraph is.\nacross the data set, then \u21e2 will be close to one.\n\n2. Matrix Completion. If we assume that the provided examples are sampled uniformly at\nand \u21e2 \u21e1 2 log(nr)\n.\n\nrandom and we see more than nc log(nc) of them, then \u21e1 log(nr)\nThis follows from a coupon collector argument [8].\n\nnr\n\nnr\n\n3. Graph Cuts. is the maximum degree divided by |E|, and \u21e2 is at most 2.\n\nWe now describe a simple protocol that achieves a linear speedup in the number of processors when\n\u2326, , and \u21e2 are relatively small.\n\n3 The HOGWILD! Algorithm\n\nHere we discuss the parallel processing setup. We assume a shared memory model with p proces-\nsors. The decision variable x is accessible to all processors. Each processor can read x, and can\n\n3\n\n\fAlgorithm 1 HOGWILD! update for individual processors\n1: loop\n2:\n3:\n4:\n5: end loop\n\nSample e uniformly at random from E\nRead current state xe and evaluate Ge(xe)\nfor v 2 e do xv xv Gev(xe)\n\ncontribute an update vector to x. The vector x is stored in shared memory, and we assume that the\ncomponentwise addition operation is atomic, that is\n\nxv xv + a\n\ncan be performed atomically by any processor for a scalar a and v 2 {1, . . . , n}. This operation\ndoes not require a separate locking structure on most modern hardware: such an operation is a single\natomic instruction on GPUs and DSPs, and it can be implemented via a compare-and-exchange\noperation on a general purpose multicore processor like the Intel Nehalem. In contrast, the operation\nof updating many components at once requires an auxiliary locking structure.\nEach processor then follows the procedure in Algorithm 1. Let Ge(xe) denote a gradient or subgra-\ndient of the function fe multiplied by |E|. That is,\n\nSince it is clear by notation, we often write Ge(x), dropping the notation that identi\ufb01es the affected\nindices of x. Note that as a consequence of the uniform random sampling of e from E, we have\n\n|E|1Ge(xe) 2 @fe(xe).\n\nE[Ge(xe)] 2 @f(x) .\n\nfor each v 2 e.\n\nxv xv Gev(xe),\n\nIn Algorithm 1, each processor samples an term e 2 E uniformly at random, computes the gradient\nof fe at xe, and then writes\n(7)\nWe assume that the stepsize is a \ufb01xed constant. Note that the processor modi\ufb01es only the variables\nindexed by e, leaving all of the components in \u00ace (i.e., not in e) alone. Even though the processors\nhave no knowledge as to whether any of the other processors have modi\ufb01ed x, we de\ufb01ne xj to be\nthe state of the decision variable x after j updates have been performed1. Since two processors can\nwrite to x at the same time, we need to be a bit careful with this de\ufb01nition, but we simply break ties\nat random. Note that xj is generally updated with a stale gradient, which is based on a value of x\nread many clock cycles earlier. We use xk(j) to denote the value of the decision variable used to\ncompute the gradient or subgradient that yields the state xj.\nIn what follows, we provide conditions under which this asynchronous, incremental gradient algo-\nrithm converges. Moreover, we show that if the hypergraph induced by f is isotropic and sparse,\nthen this algorithm converges in nearly the same number of gradient steps as its serial counterpart.\nSince we are running in parallel and without locks, this means that we get a nearly linear speedup in\nterms of the number of processors.\n\n4 Fast Rates for Lock-Free Parallelism\n\nTo state our theoretical results, we must describe several quantities that important in the analysis\nof our parallel stochastic gradient descent scheme. We follow the notation and assumptions of\nNemirovski et al [21]. To simplify the analysis, we will assume that each fe in (1) is a convex\nfunction. We assume Lipschitz continuous differentiability of f with Lipschitz constant L:\n\nkrf(x0) rf(x)k \uf8ff Lkx0 xk, 8 x0, x 2 X.\nWe also assume f is strongly convex with modulus c. By this we mean that\n\nf(x0) f(x) + (x0 x)Trf(x) + c\n\n2kx0 xk2,\n\nfor all x0, x 2 X.\n\n(8)\n\n(9)\n\n1Our notation overloads subscripts of x. For clarity throughout, subscripts i, j, and k refer to iteration\n\ncounts, and v and e refer to components or subsets of components.\n\n4\n\n\fWhen f is strongly convex, there exists a unique minimizer x? and we denote f? = f(x?). We\nadditionally assume that there exists a constant M such that\n\nkGe(xe)k2 \uf8ff M almost surely for all x 2 X .\n(10)\nWe assume throughout that c < 1. (Indeed, when c > 1, even the ordinary gradient descent\nalgorithms will diverge.) Our main results are summarized by the following\n\nProposition 4.1 Suppose in Algorithm 1 that the lag between when a gradient is computed and\nwhen it is used in step j \u2014 namely, j k(j) \u2014 is always less than or equal to \u2327, and is de\ufb01ned to\nbe\n(11)\n\n =\n\n#\u270fc\n\n2LM 21 + 6\u21e2\u2327 + 4\u2327 2\u23261/2 .\n\n2LM 21 + 6\u2327 \u21e2 + 6\u2327 2\u23261/2 log(LD0/\u270f)\n\n.\n\nk \n\nfor some \u270f > 0 and # 2 (0, 1). De\ufb01ne D0 := kx0 x?k2 and let k be an integer satisfying\n\n(12)\n\nc2#\u270f\nThen after k updates of x, we have E[f(xk) f?] \uf8ff \u270f.\nA proof of Proposition 4.1 is provided in the full version of this paper [22]. In the case that \u2327 = 0,\nthis reduces to precisely the rate achieved by the serial SGD protocol. A similar rate is achieved if\n\u2327 = o(n1/4) as \u21e2 and are typically both o(1/n). In our setting, \u2327 is proportional to the number\nof processors, and hence as long as the number of processors is less n1/4, we get nearly the same\nrecursion as in the linear rate.\nNote that up to the log(1/\u270f) term in (12), our analysis nearly provides a 1/k rate of convergence\nfor a constant stepsize SGD scheme, both in the serial and parallel cases. Moreover, note that our\nrate of convergence is fairly robust to error in the value of c; we pay linearly for our underestimate\nof the curvature of f. In contrast, Nemirovski et al demonstrate that when the stepsize is inversely\nproportional to the iteration counter, an overestimate of c can result in exponential slow-down [21]!\nRobust 1/k rates. We note that a 1/k can be achieved by a slightly more complicated protocol\nwhere the stepsize is slowly decreased after a large number of iterations. Suppose we run Algo-\nrithm 1 for a \ufb01xed number of gradient updates K with stepsize < 1/c. Then, we wait for the\nthreads to coalesce, reduce by a constant factor 2 (0, 1), and run for 1K iterations. This\nscheme results in a 1/k rate of convergence with the only synchronization overhead occurring at\nthe end of each \u201cround\u201d or \u201cepoch\u201d of iteration. In some sense, this piecewise constant stepsize\nprotocol approximates a 1/k diminishing stepsize. The main difference with our approach from\nprevious analysis is that our stepsizes are always less than 1/c in contrast to beginning with very\nlarge stepsizes. Always working with small stepsizes allows us to avoid the possible exponential\nslow-downs that occur with standard diminishing stepsize schemes.\n\n5 Related Work\n\nMost schemes for parallelizing stochastic gradient descent are variants of ideas presented in the\nseminal text by Bertsekas and Tsitsiklis [4]. For instance, in this text, they describe using stale\ngradient updates computed across many computers in a master-worker setting and describe settings\nwhere different processors control access to particular components of the decision variable. They\nprove global convergence of these approaches, but do not provide rates of convergence (This is one\nway in which our work extends this prior research). These authors also show that SGD convergence\nis robust to a variety of models of delay in computation and communication in [26].\nRecently, a variety of parallel schemes have been proposed in a variety of contexts. In MapReduce\nsettings, Zinkevich et al proposed running many instances of stochastic gradient descent on different\nmachines and averaging their output [27]. Though the authors claim this method can reduce both\nthe variance of their estimate and the overall bias, we show in our experiments that for the sorts of\nproblems we are concerned with, this method does not outperform a serial scheme.\nSchemes involving the averaging of gradients via a distributed protocol have also been proposed\nby several authors [10, 12]. While these methods do achieve linear speedups, they are dif\ufb01cult\n\n5\n\n\fHOGWILD!\n\nROUND ROBIN\n\ntype\nSVM\n\nMC\n\nCuts\n\ndata\nset\n\nRCV1\nNet\ufb02ix\nKDD\nJumbo\nDBLife\nAbdomen\n\nsize\n(GB)\n0.9\n1.5\n3.9\n30\n3e-3\n18\n\n\u21e2\n\n\n\n0.44\n2.5e-3\n3.0e-3\n2.6e-7\n8.6e-3\n9.2e-4\n\n1.0\n2.3e-3\n1.8e-3\n1.4e-7\n4.3e-3\n9.2e-4\n\ntime\n(s)\n\n9.5\n301.0\n877.5\n9453.5\n230.0\n1181.4\n\ntrain\nerror\n0.297\n0.754\n19.5\n0.031\n10.6\n3.99\n\ntime\ntest\n(s)\nerror\n61.8\n0.339\n2569.1\n0.928\n7139.0\n22.6\nN/A\n0.013\nN/A\n413.5\nN/A 7467.25\n\ntrain\nerror\n0.297\n0.754\n19.5\nN/A\n10.5\n3.99\n\ntest\nerror\n0.339\n0.927\n22.6\nN/A\nN/A\nN/A\n\nFigure 1: Comparison of wall clock time across of HOGWILD! and RR. Each algorithm is run for\n20 epochs and parallelized over 10 cores.\n\nto implement ef\ufb01ciently on multicore machines as they require massive communication overhead.\nDistributed averaging of gradients requires message passing between the cores, and the cores need\nto synchronize frequently in order to compute reasonable gradient averages.\nThe work most closely related to our own is a round-robin scheme proposed by Langford et al [16].\nIn this scheme, the processors are ordered and each update the decision variable in order. When the\ntime required to lock memory for writing is dwarfed by the gradient computation time, this method\nresults in a linear speedup, as the errors induced by the lag in the gradients are not too severe. How-\never, we note that in many applications of interest in machine learning, gradient computation time is\nincredibly fast, and we now demonstrate that in a variety of applications, HOGWILD! outperforms\nsuch a round-robin approach by an order of magnitude.\n\n6 Experiments\n\nWe ran numerical experiments on a variety of machine learning tasks, and compared against a round-\nrobin approach proposed in [16] and implemented in Vowpal Wabbit [15]. We refer to this approach\nas RR. To be as fair as possible to prior art, we hand coded RR to be nearly identical to the HOG-\nWILD! approach, with the only difference being the schedule for how the gradients are updated. One\nnotable change in RR from the Vowpal Wabbit software release is that we optimized RR\u2019s locking\nand signaling mechanisms to use spinlocks and busy waits (there is no need for generic signaling to\nimplement round robin). We veri\ufb01ed that this optimization results in nearly an order of magnitude\nincrease in wall clock time for all problems that we discuss.\nWe also compare against a model which we call AIG which can be seen as a middle ground between\nRR and HOGWILD!. AIG runs a protocol identical to HOGWILD! except that it locks all of the\nvariables in e in before and after the for loop on line 4 of Algorithm 1. Our experiments demonstrate\nthat even this \ufb01ne-grained locking induces undesirable slow-downs.\nAll of the experiments were coded in C++ are run on an identical con\ufb01guration: a dual Xeon X650\nCPUs (6 cores each x 2 hyperthreading) machine with 24GB of RAM and a software RAID-0 over\n7 2TB Seagate Constellation 7200RPM disks. The kernel is Linux 2.6.18-128. We never use more\nthan 2GB of memory. All training data is stored on a seven-disk raid 0. We implemented a custom\n\ufb01le scanner to demonstrate the speed of reading data sets of disk into small shared memory. This\nallows us to read data from the raid at a rate of nearly 1GB/s.\nAll of the experiments use a constant stepsize which is diminished by a factor at the end of\neach pass over the training set. We run all experiments for 20 such passes, even though less epochs\nare often suf\ufb01cient for convergence. We show results for the largest value of the learning rate \nwhich converges and we use = 0.9 throughout. We note that the results look the same across\na large range of (, ) pairs and that all three parallelization schemes achieve train and test errors\nwithin a few percent of one another. We present experiments on the classes of problems described\nin Section 2.\nSparse SVM. We tested our sparse SVM implementation on the Reuters RCV1 data set on the\nbinary text classi\ufb01cation task CCAT [19]. There are 804,414 examples split into 23,149 training and\n781,265 test examples, and there are 47,236 features. We swapped the training set and the test set for\nour experiments to demonstrate the scalability of the parallel multicore algorithms. In this example,\n\n6\n\n\fp\nu\nd\ne\ne\np\nS\n\n5\n\n4\n\n3\n\n2\n\n1\n\n \n\n0\n0\n\nHogwild\nAIG\nRR\n\n4\n\n2\n8\nNumber of Splits\n\n6\n\n \n\n(a)\n10\n\np\nu\nd\ne\ne\np\nS\n\n5\n\n4\n\n3\n\n2\n\n1\n\n \n\n0\n0\n\nHogwild\nAIG\nRR\n\n4\n\n2\n8\nNumber of Splits\n\n6\n\n \n\n10\n\np\nu\nd\ne\ne\np\nS\n\n8\n\n6\n\n4\n\n2\n\n \n\n0\n0\n\n(b)\n10\n\nHogwild\nAIG\nRR\n\n4\n\n2\n8\nNumber of Splits\n\n6\n\n \n\n(c)\n10\n\nFigure 2: Total CPU time versus number of threads for (a) RCV1, (b) Abdomen, and (c) DBLife.\n\n\u21e2 = 0.44 and = 1.0\u2014large values that suggest a bad case for HOGWILD!. Nevertheless, in\nFigure 2(a), we see that HOGWILD! is able to achieve a factor of 3 speedup with while RR gets worse\nas more threads are added. Indeed, for fast gradients, RR is worse than a serial implementation.\nFor this data set, we also implemented the approach in [27] which runs multiple SGD runs in parallel\nand averages their output. In Figure 3(b), we display at the train error of the ensemble average across\nparallel threads at the end of each pass over the data. We note that the threads only communicate\nat the very end of the computation, but we want to demonstrate the effect of parallelization on train\nerror. Each of the parallel threads touches every data example in each pass. Thus, the 10 thread run\ndoes 10x more gradient computations than the serial version. Here, the error is the same whether\nwe run in serial or with ten instances. We conclude that on this problem, there is no advantage to\nrunning in parallel with this averaging scheme.\nMatrix Completion. We ran HOGWILD! on three very large matrix completion problems. The\nNet\ufb02ix Prize data set has 17,770 rows, 480,189 columns, and 100,198,805 revealed entries. The\nKDD Cup 2011 (task 2) data set has 624,961 rows, 1,000,990, columns and 252,800,275 revealed\nentries. We also synthesized a low-rank matrix with rank 10, 1e7 rows and columns, and 2e9 re-\nvealed entries. We refer to this instance as \u201cJumbo.\u201d In this synthetic example, \u21e2 and are both\naround 1e-7. These values contrast sharply with the real data sets where \u21e2 and are both on the\norder of 1e-3.\nFigure 3(a) shows the speedups for these three data sets using HOGWILD!. Note that the Jumbo and\nKDD examples do not \ufb01t in our allotted memory, but even when reading data off disk, HOGWILD!\nattains a near linear speedup. The Jumbo problem takes just over two and a half hours to complete.\nSpeedup graphs like in Figure 2 comparing HOGWILD! to AIG and RR on the three matrix comple-\ntion experiments are provided in the full version of this paper. Similar to the other experiments with\nquickly computable gradients, RR does not show any improvement over a serial approach. In fact,\nwith 10 threads, RR is 12% slower than serial on KDD Cup and 62% slower on Net\ufb02ix. We did not\nallow RR to run to completion on Jumbo because it several hours.\nGraph Cuts. Our \ufb01rst cut problem was a standard image segmentation by graph cuts problem\npopular in computer vision. We computed a two-way cut of the abdomen data set [1]. This data set\nconsists of a volumetric scan of a human abdomen, and the goal is to segment the image into organs.\nThe image has 512 \u21e5 512 \u21e5 551 voxels, and the associated graph is 6-connected with maximum\ncapacity 10. Both \u21e2 and are equal to 9.2e-4 We see that HOGWILD! speeds up the cut problem\nby more than a factor of 4 with 10 threads, while RR is twice as slow as the serial version.\nOur second graph cut problem sought a mulit-way cut to determine entity recognition in a large\ndatabase of web data. We created a data set of clean entity lists from the DBLife website and\nof entity mentions from the DBLife Web Crawl [11]. The data set consists of 18,167 entities and\n180,110 mentions and similarities given by string similarity. In this problem each stochastic gradient\nstep must compute a Euclidean projection onto a simplex of dimension 18,167. As a result, the\nindividual stochastic gradient steps are quite slow. Nonetheless, the problem is still very sparse with\n\u21e2=8.6e-3 and =4.2e-3. Consequently, in Figure 2, we see the that HOGWILD! achieves a ninefold\nspeedup with 10 cores. Since the gradients are slow, RR is able to achieve a parallel speedup for this\nproblem, however the speedup with ten processors is only by a factor of 5. That is, even in this case\nwhere the gradient computations are very slow, HOGWILD! outperforms a round-robin scheme.\n\n7\n\n\fp\nu\nd\ne\ne\np\nS\n\n8\n\n6\n\n4\n\n2\n\n \n\n0\n0\n\nJumbo\nNetflix\nKDD\n\n4\n\n2\n8\nNumber of Splits\n\n6\n\n \n\n(a)\n10\n\nr\no\nr\nr\nE\nn\na\nr\nT\n\ni\n\n \n\n0.34\n0.335\n0.33\n0.325\n0.32\n0.315\n0.31\n0\n\n \n\n \n\n1 Thread\n3 Threads\n10 Threads\n\n5\n\n10\n\nEpoch\n\n15\n\n(b)\n20\n\np\nu\nd\ne\ne\np\nS\n\n10\n\n8\n\n6\n\n4\n\n2\n\n \n\n0\n100\n\nHogwild\nAIG\nRR\n\n102\n\nGradient Delay (ns)\n\n104\n\n \n\n(c)\n106\n\nFigure 3: (a) Speedup for the three matrix completion problems with HOGWILD!. In all three cases,\nmassive speedup is achieved via parallelism. (b) The training error at the end of each epoch of SVM\ntraining on RCV1 for the averaging algorithm [27]. (c) Speedup achieved over serial method for\nvarious levels of delays (measured in nanoseconds).\n\nWhat if the gradients are slow? As we saw with the DBLIFE data set, the RR method does get a\nnearly linear speedup when the gradient computation is slow. This raises the question whether RR\never outperforms HOGWILD! for slow gradients. To answer this question, we ran the RCV1 exper-\niment again and introduced an arti\ufb01cial delay at the end of each gradient computation to simulate a\nslow gradient. In Figure 3(c), we plot the wall clock time required to solve the SVM problem as we\nvary the delay for both the RR and HOGWILD! approaches.\nNotice that HOGWILD! achieves a greater decrease in computation time across the board. The\nspeedups for both methods are the same when the delay is few milliseconds. That is, if a gradient\ntakes longer than one millisecond to compute, RR is on par with HOGWILD! (but not better). At\nthis rate, one is only able to compute about a million stochastic gradients per hour, so the gradient\ncomputations must be very labor intensive in order for the RR method to be competitive.\n\n7 Conclusions\n\nOur proposed HOGWILD! algorithm takes advantage of sparsity in machine learning problems to\nenable near linear speedups on a variety of applications. Empirically, our implementations outper-\nform our theoretical analysis. For instance, \u21e2 is quite large in the RCV1 SVM problem, yet we\nstill obtain signi\ufb01cant speedups. Moreover, our algorithms allow parallel speedup even when the\ngradients are computationally intensive.\nOur HOGWILD! schemes can be generalized to problems where some of the variables occur quite\nfrequently as well. We could choose to not update certain variables that would be in particularly\nhigh contention. For instance, we might want to add a bias term to our Support Vector Machine, and\nwe could still run a HOGWILD! scheme, updating the bias only every thousand iterations or so.\nFor future work, it would be of interest to enumerate structures that allow for parallel gradient com-\nputations with no collisions at all. That is, It may be possible to bias the SGD iterations to completely\navoid memory contention between processors. An investigation into such biased orderings would\nenable even faster computation of machine learning problems.\n\nAcknowledgements\n\nBR is generously supported by ONR award N00014-11-1-0723 and NSF award CCF-1139953. CR\nis generously supported by the Air Force Research Laboratory (AFRL) under prime contract no.\nFA8750-09-C-0181, the NSF CAREER award under IIS-1054009, ONR award N000141210041,\nand gifts or research awards from Google, LogicBlox, and Johnson Controls, Inc. SJW is generously\nsupported by NSF awards DMS-0914524 and DMS-0906818 and DOE award DE-SC0002283. Any\nopinions, \ufb01ndings, and conclusion or recommendations expressed in this work are those of the\nauthors and do not necessarily re\ufb02ect the views of any of the above sponsors including DARPA,\nAFRL, or the US government.\n\n8\n\n\fReferences\n[1] Max-\ufb02ow problem instances in vision. From http://vision.csd.uwo.ca/data/maxflow/.\n[2] K. Asanovic and et al. The landscape of parallel computing research: A view from berkeley. Technical\nReport UCB/EECS-2006-183, Electrical Engineering and Computer Sciences, University of California at\nBerkeley, 2006.\n\n[3] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA, 2nd edition, 1999.\n[4] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena\n\nScienti\ufb01c, Belmont, MA, 1997.\n\n[5] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information\n\nProcessing Systems, 2008.\n\n[6] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-\ufb02ow algorithms for energy\nminimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):1124\u2013\n1137, 2004.\n\n[7] G. C\u02d8alinescu, H. Karloff, and Y. Rabani. An improved approximation algorithm for multiway cut. In\n\nProceedings of the thirtieth annual ACM Symposium on Theory of Computing, pages 48\u201352, 1998.\n\n[8] E. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational\n\nMathematics, 9(6):717\u2013772, 2009.\n\n[9] J. Dean and S. Ghemawat. MapReduce: simpli\ufb01ed data processing on large clusters. Communications of\n\nthe ACM, 51(1):107\u2013113, 2008.\n\n[10] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-\n\nbatches. Technical report, Microsoft Research, 2011.\n\n[11] A. Doan. http://dblife.cs.wisc.edu.\n[12] J. Duchi, A. Agarwal, and M. J. Wainwright. Distributed dual averaging in networks. In Advances in\n\nNeural Information Processing Systems, 2010.\n\n[13] S. H. Fuller and L. I. Millett, editors. The Future of Computing Performance: Game Over or Next\nLevel. Committee on Sustaining Growth in Computing Performance. The National Academies Press,\nWashington, D.C., 2011.\n\n[14] T. Joachims. Training linear svms in linear time. In Proceedings of the ACM Conference on Knowledge\n\nDiscovery and Data Mining (KDD), 2006.\n\n[15] J. Langford. https://github.com/JohnLangford/vowpal_wabbit/wiki.\n[16] J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. In Advances in Neural Information\n\nProcessing Systems, 2009.\n\n[17] J. Lee, , B. Recht, N. Srebro, R. R. Salakhutdinov, and J. A. Tropp. Practical large-scale optimization for\n\nmax-norm regularization. In Advances in Neural Information Processing Systems, 2010.\n\n[18] T. Lee, Z. Wang, H. Wang, and S. Hwang. Web scale entity resolution using relational evidence. Tech-\nnical report, Microsoft Research, 2011. Available at http://research.microsoft.com/apps/\npubs/default.aspx?id=145839.\n\n[19] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization\n\nresearch. Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[20] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel:\n\nInteractive analysis of web-scale datasets. In Proceedings of VLDB, 2010.\n\n[21] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[22] F. Niu, B. Recht, C. R\u00b4e, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic\n\ngradient descent. Technical report, 2011. arxiv.org/abs/1106.5730.\n\n[23] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum rank solutions of matrix equations via nuclear\n\nnorm minimization. SIAM Review, 52(3):471\u2013501, 2010.\n\n[24] S. Shalev-Shwartz and N. Srebro. SVM Optimization: Inverse dependence on training set size. In Pro-\n\nceedings of the 25th Internation Conference on Machine Learning, 2008.\n\n[25] N. Srebro, J. Rennie, and T. Jaakkola. Maximum margin matrix factorization. In Advances in Neural\n\nInformation Processing Systems, 2004.\n\n[26] J. Tsitsiklis, D. P. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic gra-\n\ndient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803\u2013812, 1986.\n\n[27] M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. Advances in\n\nNeural Information Processing Systems, 2010.\n\n9\n\n\f", "award": [], "sourceid": 485, "authors": [{"given_name": "Benjamin", "family_name": "Recht", "institution": null}, {"given_name": "Christopher", "family_name": "Re", "institution": null}, {"given_name": "Stephen", "family_name": "Wright", "institution": null}, {"given_name": "Feng", "family_name": "Niu", "institution": null}]}