{"title": "Large Scale Distributed Deep Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1223, "page_last": 1231, "abstract": "Recent work in unsupervised feature learning and deep learning has shown that  being able to train large models can dramatically improve performance.  In this  paper, we consider the problem of training a deep network with billions of  parameters using tens of thousands of CPU cores.  We have developed a  software framework called DistBelief that can utilize computing clusters  with thousands of machines to train large models.  Within this framework, we  have developed two algorithms for large-scale distributed training: (i) Downpour  SGD, an asynchronous stochastic gradient descent procedure supporting a  large number of model replicas, and (ii) Sandblaster, a framework that supports  for a variety of distributed batch optimization procedures, including a distributed  implementation of L-BFGS.  Downpour SGD and Sandblaster L-BFGS both  increase the scale and speed of deep network training.  We have successfully  used our system to train a deep network 100x larger than previously reported in  the literature, and achieves state-of-the-art performance on ImageNet, a visual  object recognition task with 16 million images and 21k categories.  We show that  these same techniques dramatically accelerate the training of a more modestly  sized deep network for a commercial speech recognition service. Although we  focus on and report performance of these methods as applied to training large  neural networks, the underlying algorithms are applicable to any gradient-based  machine learning algorithm.", "full_text": "Large Scale Distributed Deep Networks\n\nJeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen,\n\nMatthieu Devin, Quoc V. Le, Mark Z. Mao, Marc\u2019Aurelio Ranzato,\n\nAndrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng\n\n{jeff, gcorrado}@google.com\n\nGoogle Inc., Mountain View, CA\n\nAbstract\n\nRecent work in unsupervised feature learning and deep learning has shown that be-\ning able to train large models can dramatically improve performance. In this paper,\nwe consider the problem of training a deep network with billions of parameters\nusing tens of thousands of CPU cores. We have developed a software framework\ncalled DistBelief that can utilize computing clusters with thousands of machines to\ntrain large models. Within this framework, we have developed two algorithms for\nlarge-scale distributed training: (i) Downpour SGD, an asynchronous stochastic\ngradient descent procedure supporting a large number of model replicas, and (ii)\nSandblaster, a framework that supports a variety of distributed batch optimization\nprocedures, including a distributed implementation of L-BFGS. Downpour SGD\nand Sandblaster L-BFGS both increase the scale and speed of deep network train-\ning. We have successfully used our system to train a deep network 30x larger than\npreviously reported in the literature, and achieves state-of-the-art performance on\nImageNet, a visual object recognition task with 16 million images and 21k cate-\ngories. We show that these same techniques dramatically accelerate the training\nof a more modestly- sized deep network for a commercial speech recognition ser-\nvice. Although we focus on and report performance of these methods as applied\nto training large neural networks, the underlying algorithms are applicable to any\ngradient-based machine learning algorithm.\n\n1\n\nIntroduction\n\nDeep learning and unsupervised feature learning have shown great promise in many practical ap-\nplications. State-of-the-art performance has been reported in several domains, ranging from speech\nrecognition [1, 2], visual object recognition [3, 4], to text processing [5, 6].\nIt has also been observed that increasing the scale of deep learning, with respect to the number\nof training examples, the number of model parameters, or both, can drastically improve ultimate\nclassi\ufb01cation accuracy [3, 4, 7]. These results have led to a surge of interest in scaling up the\ntraining and inference algorithms used for these models [8] and in improving applicable optimization\nprocedures [7, 9]. The use of GPUs [1, 2, 3, 8] is a signi\ufb01cant advance in recent years that makes\nthe training of modestly sized deep networks practical. A known limitation of the GPU approach is\nthat the training speed-up is small when the model does not \ufb01t in GPU memory (typically less than\n6 gigabytes). To use a GPU effectively, researchers often reduce the size of the data or parameters\nso that CPU-to-GPU transfers are not a signi\ufb01cant bottleneck. While data and parameter reduction\nwork well for small problems (e.g. acoustic modeling for speech recognition), they are less attractive\nfor problems with a large number of examples and dimensions (e.g., high-resolution images).\nIn this paper, we describe an alternative approach: using large-scale clusters of machines to distribute\ntraining and inference in deep networks. We have developed a software framework called DistBe-\nlief that enables model parallelism within a machine (via multithreading) and across machines (via\n\n1\n\n\fmessage passing), with the details of parallelism, synchronization and communication managed by\nthe framework. In addition to supporting model parallelism, the DistBelief framework also supports\ndata parallelism, where multiple replicas of a model are used to optimize a single objective. Within\nthis framework, we have designed and implemented two novel methods for large-scale distributed\ntraining: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure which lever-\nages adaptive learning rates and supports a large number of model replicas, and (ii) Sandblaster\nL-BFGS, a distributed implementation of L-BFGS that uses both data and model parallelism.1 Both\nDownpour SGD and Sandblaster L-BFGS enjoy signi\ufb01cant speed gains compared to more conven-\ntional implementations of SGD and L-BFGS.\nOur experiments reveal several surprising results about large-scale nonconvex optimization. Firstly,\nasynchronous SGD, rarely applied to nonconvex problems, works very well for training deep net-\nworks, particularly when combined with Adagrad [10] adaptive learning rates. Secondly, we show\nthat given suf\ufb01cient resources, L-BFGS is competitive with or faster than many variants of SGD.\nWith regard to speci\ufb01c applications in deep learning, we report two main \ufb01ndings: that our dis-\ntributed optimization approach can both greatly accelerate the training of modestly sized models,\nand that it can also train models that are larger than could be contemplated otherwise. To illustrate\nthe \ufb01rst point, we show that we can use a cluster of machines to train a modestly sized speech model\nto the same classi\ufb01cation accuracy in less than 1/10th the time required on a GPU. To illustrate the\nsecond point, we trained a large neural network of more than 1 billion parameters and used this\nnetwork to drastically improve on state-of-the-art performance on the ImageNet dataset, one of the\nlargest datasets in computer vision.\n\n2 Previous work\n\nIn recent years commercial and academic machine learning data sets have grown at an unprece-\ndented pace. In response, a great many authors have explored scaling up machine learning algo-\nrithms through parallelization and distribution [11, 12, 13, 14, 15, 16, 17]. Much of this research has\nfocused on linear, convex models, where distributed gradient computation is the natural \ufb01rst step.\nWithin this area, some groups have relaxed synchronization requirements, exploring delayed gradi-\nent updates for convex problems [12, 17]. In parallel, other groups working on problems with sparse\ngradients (problems where only a tiny fraction of the coordinates of the gradient vector are non-zero\nfor any given training example) have explored lock-less asynchronous stochastic gradient descent\non shared-memory architectures (i.e. single machines) [5, 18]. We are interested in an approach\nthat captures the best of both worlds, allowing the use of a cluster of machines asynchronously\ncomputing gradients, but without requiring that the problem be either convex or sparse.\nIn the context of deep learning, most work has focused on training relatively small models on a single\nmachine (e.g., Theano [19]). Suggestions for scaling up deep learning include the use of a farm of\nGPUs to train a collection of many small models and subsequently averaging their predictions [20],\nor modifying standard deep networks to make them inherently more parallelizable [21]. Our focus\nis scaling deep learning techniques in the direction of training very large models, those with a few\nbillion parameters, but without introducing restrictions on the form of the model. In special cases\nwhere one layer dominates computation, some authors have considered distributing computation in\nthat one layer and replicating computation in the remaining layers [5]. But in the general case where\nmany layers of the model are computationally intensive, full model parallelism in a spirit similar\nto [22] is required. To be successful, however, we believe that model parallelism must be combined\nwith clever distributed optimization techniques that leverage data parallelism.\nWe considered a number of existing large-scale computational tools for application to our prob-\nlem, MapReduce [23] and GraphLab [24] being notable examples. We concluded that MapRe-\nduce, designed for parallel data processing, was ill-suited for the iterative computations inherent in\ndeep network training; whereas GraphLab, designed for general (unstructured) graph computations,\nwould not exploit computing ef\ufb01ciencies available in the structured graphs typically found in deep\nnetworks.\n\n1We implemented L-BFGS within the Sandblaster framework, but the general approach is also suitable for\n\na variety of other batch optimization methods.\n\n2\n\n\fFigure 1: An example of model parallelism in DistBelief. A \ufb01ve layer deep neural network with\nlocal connectivity is shown here, partitioned across four machines (blue rectangles). Only those\nnodes with edges that cross partition boundaries (thick lines) will need to have their state transmitted\nbetween machines. Even in cases where a node has multiple edges crossing a partition boundary,\nits state is only sent to the machine on the other side of that boundary once. Within each partition,\ncomputation for individual nodes will the parallelized across all available CPU cores.\n\n3 Model parallelism\n\nTo facilitate the training of very large deep networks, we have developed a software framework,\nDistBelief, that supports distributed computation in neural networks and layered graphical models.\nThe user de\ufb01nes the computation that takes place at each node in each layer of the model, and the\nmessages that should be passed during the upward and downward phases of computation.2 For\nlarge models, the user may partition the model across several machines (Figure 1), so that respon-\nsibility for the computation for different nodes is assigned to different machines. The framework\nautomatically parallelizes computation in each machine using all available cores, and manages com-\nmunication, synchronization and data transfer between machines during both training and inference.\nThe performance bene\ufb01ts of distributing a deep network across multiple machines depends on the\nconnectivity structure and computational needs of the model. Models with a large number of param-\neters or high computational demands typically bene\ufb01t from access to more CPUs and memory, up\nto the point where communication costs dominate. We have successfully run large models with up\nto 144 partitions in the DistBelief framework with signi\ufb01cant speedups, while more modestly sized\nmodels show decent speedups for up to 8 or 16 partitions. (See Section 5, under the heading Model\nParallelism Benchmarks, for experimental results.) Obviously, models with local connectivity struc-\ntures tend to be more amenable to extensive distribution than fully-connected structures, given their\nlower communication requirements. The typical cause of less-than-ideal speedups is variance in\nprocessing times across the different machines, leading to many machines waiting for the single\nslowest machine to \ufb01nish a given phase of computation. Nonetheless, for our largest models, we can\nef\ufb01ciently use 32 machines where each machine achieves an average CPU utilization of 16 cores, for\na total of 512 CPU cores training a single large neural network. When combined with the distributed\noptimization algorithms described in the next section, which utilize multiple replicas of the entire\nneural network, it is possible to use tens of thousands of CPU cores for training a single model,\nleading to signi\ufb01cant reductions in overall training times.\n\n4 Distributed optimization algorithms\n\nParallelizing computation within the DistBelief framework allows us to instantiate and run neural\nnetworks considerably larger than have been previously reported. But in order to train such large\nmodels in a reasonable amount of time, we need to parallelize computation not only within a single\n\n2In the case of a neural network \u2018upward\u2019 and \u2018downward\u2019 might equally well be called \u2018feedforward\u2019 and\n\n\u2018backprop\u2019, while for a Hidden Markov Model, they might be more familiar as \u2018forward\u2019 and \u2018backward\u2019.\n\n3\n\nMachine 1Machine 2Machine 3Machine 4\fFigure 2: Left: Downpour SGD. Model replicas asynchronously fetch parameters w and push gra-\ndients \u2206w to the parameter server. Right: Sandblaster L-BFGS. A single \u2018coordinator\u2019 sends small\nmessages to replicas and the parameter server to orchestrate batch optimization.\n\ninstance of the model, but to distribute training across multiple model instances. In this section we\ndescribe this second level of parallelism, where we employ a set of DistBelief model instances, or\nreplicas, to simultaneously solve a single optimization problem.\nWe present a comparison of two large-scale distributed optimization procedures: Downpour SGD,\nan online method, and Sandblaster L-BFGS, a batch method. Both methods leverage the concept\nof a centralized sharded parameter server, which model replicas use to share their parameters. Both\nmethods take advantage of the distributed computation DistBelief allows within each individual\nreplica. But most importantly, both methods are designed to tolerate variance in the processing\nspeed of different model replicas, and even the wholesale failure of model replicas which may be\ntaken of\ufb02ine or restarted at random.\nIn a sense, these two optimization algorithms implement an intelligent version of data parallelism.\nBoth approaches allow us to simultaneously process distinct training examples in each of the many\nmodel replicas, and periodically combine their results to optimize our objective function.\n\n4.1 Downpour SGD\n\nStochastic gradient descent (SGD) is perhaps the most commonly used optimization procedure for\ntraining deep neural networks [25, 26, 3]. Unfortunately, the traditional formulation of SGD is\ninherently sequential, making it impractical to apply to very large data sets where the time required\nto move through the data in an entirely serial fashion is prohibitive.\nTo apply SGD to large data sets, we introduce Downpour SGD, a variant of asynchronous stochas-\ntic gradient descent that uses multiple replicas of a single DistBelief model. The basic approach is\nas follows: We divide the training data into a number of subsets and run a copy of the model on\neach of these subsets. The models communicate updates through a centralized parameter server,\nwhich keeps the current state of all parameters for the model, sharded across many machines (e.g.,\nif we have 10 parameter server shards, each shard is responsible for storing and applying updates\nto 1/10th of the model parameters) (Figure 2). This approach is asynchronous in two distinct as-\npects: the model replicas run independently of each other, and the parameter server shards also run\nindependently of one another.\nIn the simplest implementation, before processing each mini-batch, a model replica asks the pa-\nrameter server service for an updated copy of its model parameters. Because DistBelief models\nare themselves partitioned across multiple machines, each machine needs to communicate with just\nthe subset of parameter server shards that hold the model parameters relevant to its partition. After\nreceiving an updated copy of its parameters, the DistBelief model replica processes a mini-batch of\ndata to compute a parameter gradient, and sends the gradient to the parameter server, which then\napplies the gradient to the current value of the model parameters.\nIt is possible to reduce the communication overhead of Downpour SGD by limiting each model\nreplica to request updated parameters only every nf etch steps and send updated gradient values only\nevery npush steps (where nf etch might not be equal to npush).\nIn fact, the process of fetching\n\n4\n\nParameter ServerModelReplicasDataShardsw\u2019 = w - \u03b7\u0394ww\u0394wParameter ServerModelReplicasDataCoordinator(small messages)\fparameters, pushing gradients, and processing training data can be carried out in three only weakly\nsynchronized threads (see the Appendix for pseudocode). In the experiments reported below we\n\ufb01xed nf etch = npush = 1 for simplicity and ease of comparison to traditional SGD.\nDownpour SGD is more robust to machines failures than standard (synchronous) SGD. For syn-\nchronous SGD, if one machine fails, the entire training process is delayed; whereas for asynchronous\nSGD, if one machine in a model replica fails, the other model replicas continue processing their\ntraining data and updating the model parameters via the parameter servers. On the other hand, the\nmultiple forms of asynchronous processing in Downpour SGD introduce a great deal of additional\nstochasticity in the optimization procedure. Most obviously, a model replica is almost certainly\ncomputing its gradients based on a set of parameters that are slightly out of date, in that some other\nmodel replica will likely have updated the parameters on the parameter server in the meantime. But\nthere are several other sources of stochasticity beyond this: Because the parameter server shards act\nindependently, there is no guarantee that at any given moment the parameters on each shard of the\nparameter server have undergone the same number of updates, or that the updates were applied in\nthe same order. Moreover, because the model replicas are permitted to fetch parameters and push\ngradients in separate threads, there may be additional subtle inconsistencies in the timestamps of\nparameters. There is little theoretical grounding for the safety of these operations for nonconvex\nproblems, but in practice we found relaxing consistency requirements to be remarkably effective.\nOne technique that we have found to greatly increase the robustness of Downpour SGD is the use\nof the Adagrad [10] adaptive learning rate procedure. Rather than using a single \ufb01xed learning\nrate on the parameter sever (\u03b7 in Figure 2), Adagrad uses a separate adaptive learning rate for each\nparameter. Let \u03b7i,K be the learning rate of the i-th parameter at iteration K and \u2206wi,K its gradient,\nthen we set: \u03b7i,K = \u03b3/\n2. Because these learning rates are computed only from the\nsummed squared gradients of each parameter, Adagrad is easily implemented locally within each\nparameter server shard. The value of \u03b3, the constant scaling factor for all learning rates, is generally\nlarger (perhaps by an order of magnitude) than the best \ufb01xed learning rate used without Adagrad.\nThe use of Adagrad extends the maximum number of model replicas that can productively work\nsimultaneously, and combined with a practice of \u201cwarmstarting\u201d model training with only a single\nmodel replica before unleashing the other replicas, it has virtually eliminated stability concerns in\ntraining deep networks using Downpour SGD (see results in Section 5).\n\n(cid:113)(cid:80)K\n\nj=1 \u2206wi,j\n\n4.2 Sandblaster L-BFGS\n\nBatch methods have been shown to work well in training small deep networks [7]. To apply these\nmethods to large models and large datasets, we introduce the Sandblaster batch optimization frame-\nwork and discuss an implementation of L-BFGS using this framework.\nA key idea in Sandblaster is distributed parameter storage and manipulation. The core of the opti-\nmization algorithm (e.g L-BFGS) resides in a coordinator process (Figure 2), which does not have\ndirect access to the model parameters.\nInstead, the coordinator issues commands drawn from a\nsmall set of operations (e.g., dot product, scaling, coef\ufb01cient-wise addition, multiplication) that can\nbe performed by each parameter server shard independently, with the results being stored locally\non the same shard. Additional information, e.g the history cache for L-BFGS, is also stored on the\nparameter server shard on which it was computed. This allows running large models (billions of\nparameters) without incurring the overhead of sending all the parameters and gradients to a single\ncentral server. (See the Appendix for pseudocode.)\nIn typical parallelized implementations of L-BFGS, data is distributed to many machines and each\nmachine is responsible for computing the gradient on a speci\ufb01c subset of data examples. The gra-\ndients are sent back to a central server (or aggregated via a tree [16]). Many such methods wait for\nthe slowest machine, and therefore do not scale well to large shared clusters. To account for this\nproblem, we employ the following load balancing scheme: The coordinator assigns each of the N\nmodel replicas a small portion of work, much smaller than 1/Nth of the total size of a batch, and\nassigns replicas new portions whenever they are free. With this approach, faster model replicas do\nmore work than slower replicas. To further manage slow model replicas at the end of a batch, the\ncoordinator schedules multiple copies of the outstanding portions and uses the result from whichever\nmodel replica \ufb01nishes \ufb01rst. This scheme is similar to the use of \u201cbackup tasks\u201d in the MapReduce\nframework [23]. Prefetching of data, along with supporting data af\ufb01nity by assigning sequential\n\n5\n\n\fportions of data to the same worker makes data access a non-issue. In contrast with Downpour\nSGD, which requires relatively high frequency, high bandwidth parameter synchronization with the\nparameter server, Sandblaster workers only fetch parameters at the beginning of each batch (when\nthey have been updated by the coordinator), and only send the gradients every few completed por-\ntions (to protect against replica failures and restarts).\n\n5 Experiments\n\nWe evaluated our optimization algorithms by applying them to training models for two different deep\nlearning problems: object recognition in still images and acoustic processing for speech recognition.\nThe speech recognition task was to classify the central region (or frame) in a short snippet of audio as\none of several thousand acoustic states. We used a deep network with \ufb01ve layers: four hidden layer\nwith sigmoidal activations and 2560 nodes each, and a softmax output layer with 8192 nodes. The\ninput representation was 11 consecutive overlapping 25 ms frames of speech, each represented by\n40 log-energy values. The network was fully-connected layer-to-layer, for a total of approximately\n42 million model parameters. We trained on a data set of 1.1 billion weakly labeled examples,\nand evaluated on a hold out test set. See [27] for similar deep network con\ufb01gurations and training\nprocedures.\nFor visual object recognition we trained a larger neural network with locally-connected receptive\n\ufb01elds on the ImageNet data set of 16 million images, each of which we scaled to 100x100 pixels [28].\nThe network had three stages, each composed of \ufb01ltering, pooling and local contrast normalization,\nwhere each node in the \ufb01ltering layer was connected to a 10x10 patch in the layer below. Our\ninfrastructure allows many nodes to connect to the same input patch, and we ran experiments varying\nthe number of identically connected nodes from 8 to 36. The output layer consisted of 21 thousand\none-vs-all logistic classi\ufb01er nodes, one for each of the ImageNet object categories. See [29] for\nsimilar deep network con\ufb01gurations and training procedures.\n\nModel parallelism benchmarks: To explore the scaling behavior of DistBelief model parallelism\n(Section 3), we measured the mean time to process a single mini-batch for simple SGD training as\na function of the number of partitions (machines) used in a single model instance. In Figure 3 we\nquantify the impact of parallelizing across N machines by reporting the average training speed-up:\nthe ratio of the time taken using only a single machine to the time taken using N. Speedups for\ninference steps in these models are similar and are not shown here.\nThe moderately sized speech model runs fastest on 8 machines, computing 2.2\u00d7 faster than using a\nsingle machine. (Models were con\ufb01gured to use no more than 20 cores per machine.) Partitioning\n\nFigure 3: Training speed-up for four different deep networks as a function of machines allocated\nto a single DistBelief model instance. Models with more parameters bene\ufb01t more from the use of\nadditional machines than do models with fewer parameters.\n\n6\n\n1163264128051015Machines per model instanceTraining speed(cid:239)up  Speech: 42M parametersImages: 80M parametersImages: 330M parametersImages: 1.7B parameters\fFigure 4: Left: Training accuracy (on a portion of the training set) for different optimization meth-\nods. Right: Classi\ufb01cation accuracy on the hold out test set as a function of training time. Downpour\nand Sandblaster experiments initialized using the same \u223c10 hour warmstart of simple SGD.\n\nthe model on more than 8 machines actually slows training, as network overhead starts to dominate\nin the fully-connected network structure and there is less work for each machine to perform with\nmore partitions.\nIn contrast, the much larger, locally-connected image models can bene\ufb01t from using many more\nmachines per model replica. The largest model, with 1.7 billion parameters bene\ufb01ts the most, giving\na speedup of more than 12\u00d7 using 81 machines. For these large models using more machines\ncontinues to increase speed, but with diminishing returns.\n\nOptimization method comparisons: To evaluate the proposed distributed optimization proce-\ndures, we ran the speech model described above in a variety of con\ufb01gurations. We consider two\nbaseline optimization procedures: training a DistBelief model (on 8 partitions) using conventional\n(single replica) SGD, and training the identical model on a GPU using CUDA [27]. The three dis-\ntributed optimization methods we compare to these baseline methods are: Downpour SGD with a\n\ufb01xed learning rate, Downpour SGD with Adagrad learning rates, and Sandblaster L-BFGS.\nFigure 4 shows classi\ufb01cation performance as a function of training time for each of these methods\non both the training and test sets. Our goal is to obtain the maximum test set accuracy in the\nminimum amount of training time, regardless of resource requirements. Conventional single replica\nSGD (black curves) is the slowest to train. Downpour SGD with 20 model replicas (blue curves)\nshows a signi\ufb01cant improvement. Downpour SGD with 20 replicas plus Adagrad (orange curve)\nis modestly faster. Sandblaster L-BFGS using 2000 model replicas (green curves) is considerably\nfaster yet again. The fastest, however, is Downpour SGD plus Adagrad with 200 model replicas (red\ncurves). Given access to suf\ufb01cient CPU resourses, both Sandblaster L-BFGS and Downpour SGD\nwith Adagrad can train models substantially faster than a high performance GPU.\nThough we did not con\ufb01ne the above experiments to a \ufb01xed resource budget, it is interesting to\nconsider how the various methods trade off resource consumption for performance. We analyze\nthis by arbitrarily choosing a \ufb01xed test set accuracy (16%), and measuring the time each method\ntook to reach that accuracy as a function of machines and utilized CPU cores, Figure 5. One of the\nfour points on each traces corresponds to a training con\ufb01guration shown in Figure 4, the other three\npoints are alternative con\ufb01gurations.\nIn this plot, points closer to the origin are preferable in that they take less time while using fewer re-\nsources. In this regard Downpour SGD using Adagrad appears to be the best trade-off: For any \ufb01xed\nbudget of machines or cores, Downpour SGD with Adagrad takes less time to reach the accuracy\ntarget than either Downpour SGD with a \ufb01xed learning rate or Sandblaster L-BFGS. For any allotted\ntraining time to reach the accuracy target, Downpour SGD with Adagrad used few resources than\nSandblaster L-BFGS, and in many cases Downpour SGD with a \ufb01xed learning rate could not even\nreach the target within the deadline. The Sandblaster L-BFGS system does show promise in terms\n\n7\n\n0204060801001200510152025Time (hours)Average Frame Accuracy (%)Accuracy on Training Set  SGD [1]DownpourSGD [20]DownpourSGD [200] w/AdagradSandblaster L\u2212BFGS [2000]0204060801001200510152025Time (hours)Average Frame Accuracy (%)Accuracy on Test Set  SGD [1]GPU [1]DownpourSGD [20]DownpourSGD [20] w/AdagradDownpourSGD [200] w/AdagradSandblaster L\u2212BFGS [2000]\fFigure 5: Time to reach a \ufb01xed accuracy (16%) for different optimization strategies as a function of\nnumber of the machines (left) and cores (right).\n\nof its scaling with additional cores, suggesting that it may ultimately produce the fastest training\ntimes if used with an extremely large resource budget (e.g., 30k cores).\n\nApplication to ImageNet: The previous experiments demonstrate that our techniques can accel-\nerate the training of neural networks with tens of millions of parameters. However, the more sig-\nni\ufb01cant advantage of our cluster-based approach to distributed optimization is its ability to scale to\nmodels that are much larger than can be comfortably \ufb01t on single machine, let alone a single GPU.\nAs a \ufb01rst step toward exploring the capabilities of very large neural networks, we used Downpour\nSGD to train the 1.7 billion parameter image model described above on the ImageNet object classi-\n\ufb01cation task. As detailed in [29], this network achieved a cross-validated classi\ufb01cation accuracy of\nover 15%, a relative improvement over 60% from the best performance we are aware of on the 21k\ncategory ImageNet classi\ufb01cation task.\n\n6 Conclusions\n\nIn this paper we introduced DistBelief, a framework for parallel distributed training of deep net-\nworks. Within this framework, we discovered several effective distributed optimization strategies.\nWe found that Downpour SGD, a highly asynchronous variant of SGD works surprisingly well for\ntraining nonconvex deep learning models. Sandblaster L-BFGS, a distributed implementation of\nL-BFGS, can be competitive with SGD, and its more ef\ufb01cient use of network bandwidth enables it\nto scale to a larger number of concurrent cores for training a single model. That said, the combi-\nnation of Downpour SGD with the Adagrad adaptive learning rate procedure emerges as the clearly\ndominant method when working with a computational budget of 2000 CPU cores or less.\nAdagrad was not originally designed to be used with asynchronous SGD, and neither method is\ntypically applied to nonconvex problems. It is surprising, therefore, that they work so well together,\nand on highly nonlinear deep networks. We conjecture that Adagrad automatically stabilizes volatile\nparameters in the face of the \ufb02urry of asynchronous updates, and naturally adjusts learning rates to\nthe demands of different layers in the deep network.\nOur experiments show that our new large-scale training methods can use a cluster of machines to\ntrain even modestly sized deep networks signi\ufb01cantly faster than a GPU, and without the GPU\u2019s\nlimitation on the maximum size of the model. To demonstrate the value of being able to train larger\nmodels, we have trained a model with over 1 billion parameters to achieve better than state-of-the-art\nperformance on the ImageNet object recognition challenge.\n\nAcknowledgments\n\nThe authors would like to thank Samy Bengio, Tom Dean, John Duchi, Yuval Netzer, Patrick Nguyen, Yoram\nSinger, Sebastian Thrun, and Vincent Vanhoucke for their indispensable advice, support, and comments.\n\n8\n\n11000200030004000500060001020304050607080MachinesTime (hours)Time to 16% accuracy  Downpour SGDDownpour SGD w/AdagradSandblaster L\u2212BFGSGPU1200040006000800010000120001020304050607080CoresTime (hours)Time to 16% accuracy  Downpour SGDDownpour SGD w/AdagradSandblaster L\u2212BFGSGPU (CUDA cores)\fReferences\n[1] G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large\n\nvocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2012.\n\n[2] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,\nT. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE\nSignal Processing Magazine, 2012.\n\n[3] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets excel on\n\nhandwritten digit recognition. CoRR, 2010.\n\n[4] A. Coates, H. Lee, and A. Y. Ng. An analysis of single-layer networks in unsupervised feature learning.\n\nIn AISTATS 14, 2011.\n\n[5] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of\n\nMachine Learning Research, 3:1137\u20131155, 2003.\n\n[6] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: Deep neural networks\n\nwith multitask learning. In ICML, 2008.\n\n[7] Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A.Y. Ng. On optimization methods for deep\n\nlearning. In ICML, 2011.\n\n[8] R. Raina, A. Madhavan, and A. Y. Ng. Large-scale deep unsupervised learning using graphics processors.\n\nIn ICML, 2009.\n\n[9] J. Martens. Deep learning via hessian-free optimization. In ICML, 2010.\n[10] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[11] Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, A. Strehl, and V. Vishwanathan. Hash kernels. In\n\nAISTATS, 2009.\n\n[12] J. Langford, A. Smola, and M. Zinkevich. Slow learners are fast. In NIPS, 2009.\n[13] G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Ef\ufb01cient large-scale distributed training\n\nof conditional maximum entropy models. In NIPS, 2009.\n\n[14] R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. In\n\nNAACL, 2010.\n\n[15] M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. In NIPS, 2010.\n[16] A. Agarwal, O. Chapelle, M. Dudik, and J. Langford. A reliable effective terascale linear learning system.\n\nIn AISTATS, 2011.\n\n[17] A. Agarwal and J. Duchi. Distributed delayed stochastic optimization. In NIPS, 2011.\n[18] F. Niu, B. Retcht, C. Re, and S. J. Wright. Hogwild! A lock-free approach to parallelizing stochastic\n\ngradient descent. In NIPS, 2011.\n\n[19] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley,\n\nand Y. Bengio. Theano: a CPU and GPU math expression compiler. In SciPy, 2010.\n\n[20] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classi\ufb01cation.\n\nTechnical report, IDSIA, 2012.\n\n[21] L. Deng, D. Yu, and J. Platt. Scalable stacking and learning for building deep architectures. In ICASSP,\n\n2012.\n\n[22] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, U. Toronto, 2009.\n[23] J. Dean and S. Ghemawat. Map-Reduce: simpli\ufb01ed data processing on large clusters. CACM, 2008.\n[24] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. Hellerstein. Distributed GraphLab: A\n\nframework for machine learning in the cloud. In VLDB, 2012.\n\n[25] L. Bottou. Stochastic gradient learning in neural networks. In Proceedings of Neuro-N\u02c6\u0131mes 91, 1991.\n[26] Y. LeCun, L. Bottou, G. Orr, and K. Muller. Ef\ufb01cient backprop. In Neural Networks: Tricks of the trade.\n\nSpringer, 1998.\n\n[27] V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on cpus. In Deep\n\nLearning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.\n\n[28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImage Database. In CVPR, 2009.\n\nImageNet: A Large-Scale Hierarchical\n\n[29] Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, and A.Y. Ng. Building\n\nhigh-level features using large scale unsupervised learning. In ICML, 2012.\n\n9\n\n\f", "award": [], "sourceid": 598, "authors": [{"given_name": "Jeffrey", "family_name": "Dean", "institution": null}, {"given_name": "Greg", "family_name": "Corrado", "institution": null}, {"given_name": "Rajat", "family_name": "Monga", "institution": null}, {"given_name": "Kai", "family_name": "Chen", "institution": null}, {"given_name": "Matthieu", "family_name": "Devin", "institution": null}, {"given_name": "Mark", "family_name": "Mao", "institution": null}, {"given_name": "Marc'aurelio", "family_name": "Ranzato", "institution": null}, {"given_name": "Andrew", "family_name": "Senior", "institution": null}, {"given_name": "Paul", "family_name": "Tucker", "institution": null}, {"given_name": "Ke", "family_name": "Yang", "institution": null}, {"given_name": "Quoc", "family_name": "Le", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}