{"title": "Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training", "book": "Advances in Neural Information Processing Systems", "page_first": 8045, "page_last": 8056, "abstract": "Distributed training of deep nets is an important technique to address some of the present day computing challenges like memory consumption and computational demands. Classical distributed approaches, synchronous or asynchronous, are based on the parameter server architecture, i.e., worker nodes compute gradients which are communicated to the parameter server while updated parameters are returned. Recently, distributed training with AllReduce operations gained popularity as well. While many of those operations seem appealing, little is reported about wall-clock training time improvements. In this paper, we carefully analyze the AllReduce based setup, propose timing models which include network latency, bandwidth, cluster size and compute time, and demonstrate that a pipelined training with a width of two combines the best of both synchronous and asynchronous training. Specifically, for a setup consisting of a four-node GPU cluster we show wall-clock time training improvements of up to 5.4x compared to conventional approaches.", "full_text": "Pipe-SGD: A Decentralized Pipelined SGD\n\nFramework for Distributed Deep Net Training\n\nYoujie Li\u2020, Mingchao Yu*, Songze Li*, Salman Avestimehr*,\n\nNam Sung Kim\u2020, and Alexander Schwing\u2020\n\n\u2020University of Illinois at Urbana-Champaign\n\n*University of Southern California\n\nAbstract\n\nDistributed training of deep nets is an important technique to address some of the\npresent day computing challenges like memory consumption and computational de-\nmands. Classical distributed approaches, synchronous or asynchronous, are based\non the parameter server architecture, i.e., worker nodes compute gradients which\nare communicated to the parameter server while updated parameters are returned.\nRecently, distributed training with AllReduce operations gained popularity as well.\nWhile many of those operations seem appealing, little is reported about wall-clock\ntraining time improvements. In this paper, we carefully analyze the AllReduce\nbased setup, propose timing models which include network latency, bandwidth,\ncluster size and compute time, and demonstrate that a pipelined training with a\nwidth of two combines the best of both synchronous and asynchronous training.\nSpeci\ufb01cally, for a setup consisting of a four-node GPU cluster we show wall-clock\ntime training improvements of up to 5.4\u00d7 compared to conventional approaches.\n\n1\n\nIntroduction\n\nDeep nets [25, 3] are omnipresent across \ufb01elds from computer vision and natural language processing\nto computational biology and robotics. Across domains and tasks they have demonstrated impressive\nresults by automatically extracting hierarchical abstractions of representations from many different\ndatasets. The surge in popularity pivoted in the 2010s, with impressive results being demonstrated\non the ImageNet dataset [22, 42]. Since then, deep nets have been applied to many more tasks.\nProminent examples include recognition of places [53], playing of Atari games [34, 35], and the\ngame of Go [45]. Common to all those methods is the use of large datasets to fuel the many layers of\ndeep nets.\nImportantly, in the last few years, the number of layers, or more generally the depth of the computation\ntree has increased signi\ufb01cantly from a few layers for LeNet [26] to several 100s or 1000s [14, 24].\nInherent to the increasing complexity of the computation graph is an increase in training time and often\nalso an increase in the amount of data that is processed. Traditionally, computational performance\nincreases do not keep up with the desired processing needs despite the use of accelerators like GPUs.\nBeyond accelerators, parallelization of computation on multiple computers is therefore popular.\nHowever, it requires frequent communication to exchange a large amount of data among compute\nnodes while the bandwidth of network interfaces is limited. This in turn signi\ufb01cantly diminishes the\nbene\ufb01t of parallelization, as a substantial fraction of training time is spent to communicate data. The\nfraction of time spent on communication is further increased when applying accelerators [16, 7, 38,\n52, 48, 49, 44], as they decrease computation time while leaving communication time untouched.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTo take advantage of parallelization across machines, a variety of approaches have been developed\nstarting from the popular MapReduce paradigm [9, 51, 19, 37]. Despite their bene\ufb01ts, communication\nheavy training of deep nets is often based on custom implementations [8, 6, 36, 20] relying on the\nparameter server architecture [28, 27, 15], where the centralized server aggregates the gradients\nfrom workers and distributes the updated weights, either in a synchronous or asynchronous manner.\nRecent research proposed to use a decentralized architecture with global synchronization among\nnodes [12, 33]. However, in common to all the aforementioned techniques, little is reported regarding\nthe timing analysis of distributed deep net training.\nIn this paper, we analyze the wall-clock time trade-offs between communication and computation.\nTo this end we develop a model to assess the training time based on a set of parameters such as\nlatency, cluster size, network bandwith, model size, etc. Based on the results of our model we\ndevelop Pipe-SGD, a framework with pipelined training and balanced communication, and show its\nconvergence properties by adjusting proofs of [23, 15]. We also show what types of compression\ncan be ef\ufb01ciently included in an AllReduce based framework. Finally, we assess the speedups of\nour proposed approach on a GPU cluster of four nodes with 10GbE network, showing wall-clock\ntime training improvements by a factor of 3.2 \u223c 5.4\u00d7 compared to conventional centralized and\ndecentralized approaches without degradation in accuracy.\n\n2 Background\n\nGeneral Training of Deep Nets: Training of deep nets involves \ufb01nding the parameters w of a\npredictor F(x,w) given input data x. To this end we minimize a loss function (cid:96)(F(x,w),y) which\ncompares the predictor output F(x,w) for given data x and the current w to the ground-truth annotation\ny. Given a dataset D = {(x,y)}, \ufb01nding w is formally summarized via:\n\nmin\nw\n\nfD (w) :=\n\n1\n|D| \u2211\n(x,y)\u2208D\n\n(cid:96)(F(x,w),y).\n\n(1)\n\n\u2202 w \u2248 \u2202 fB\n\nOptimization of the objective given in Eq. (1) w.r.t. the parameters w, e.g., via gradient descent using\n\u2202 fD\n\u2202 w , can be challenging due to not only the complexity of evaluating the predictor F(x,w) and its\nderivative, but also the size of the dataset |D|. Consequently, stochastic gradient descent (SGD)\nemerged as a popular technique. We randomly sample a subset B of the dataset, often also referred\nto as a minibatch. Instead of computing the gradient on the entire dataset D, we approximate it using\nthe samples in the minibatch, i.e., we assume \u2202 fD\n\u2202 w . However, for present day datasets and\npredictors, computation of the gradient \u2202 fB\n\u2202 w on a single machine is still challenging. Minibatch sizes\n|B| of less than 20 samples are common, e.g., when training for semantic image segmentation [5].\nDistributed Training of Deep Nets: To train larger models or to increase the minibatch size,\ndistributed training on multiple compute nodes is used [8, 15, 6, 27, 28, 36, 16]. A popular architecture\nto facilitate distributed training is the parameter server framework [15, 27, 28]. The parameter server\nmaintains a copy of the current parameters, and communicates with a group of worker nodes, each of\nwhich operates on a small minibatch to compute local gradients based on the retrieved parameters\nw. Upon having completed its task, the worker shares the gradients with the parameter server. Once\nthe parameter server has obtained all or some of the gradients it updates the parameters using the\nnegative gradient direction and afterwards shares the latest values with the workers.\nAsynchronous updates where each worker independently pulls w from the server, computes its own\nlocal gradient, and pushes results back are available and illustrated in Fig. 1 (a). Due to the asynchrony,\nminimal synchronization overhead is traded with staleness of gradients. Methods for staleness control\nexist, which bound the number of delay steps [15]. However, note that stale gradients may slow down\ntraining signi\ufb01cantly.\nImportantly, all those frameworks are based on a centralized compute topology which forms a\ncommunication bottleneck, increasing the training time as the cluster size scales. The time taken by\npushing gradient, update, and pulling w can be linear in the cluster size due to network congestion.\nTherefore, most recently, decentralized training frameworks gained popularity in both the synchronous\nand asynchronous setting [30, 31]. However, those approaches assume decentralized workers are\neither completely synchronous (as in Fig. 1 (b)) or completely asynchronous, which requires to either\ndeal with long execution time every iteration or pay for uncontrolled gradient staleness.\n\n2\n\n\fFigure 1: Comparison between different distributed learning frameworks: (a) parameter server with\nasynchronous training, (b) decentralized synchronous training, and (c) decentralized pipeline training.\n\nCompression in Distributed Training: As the model size increases and cluster size scales, commu-\nnication overhead in distributed learning system dominates the training time, e.g., up to 80 \u223c 90%\neven in a high-speed network environment [29, 10]. To reduce the communication time, various com-\npression algorithms have been proposed recently [43, 46, 11, 4, 50, 33, 2], some of which focus on\nreducing the precision of communicated gradients through scalar quantization into 1 bit, while others\nfocus on reducing the quantity of gradients to be transferred. Most compression works, however,\nonly emphasize on achieving high compression ratio or low loss in accuracy without reporting the\nwall-clock training time.\nIn practice, compression without knowledge of the communication process is usually counter-\nproductive [29], i.e., the total training time often increases. This is due to the fact that AllReduce\nis a multi-step algorithm which requires transferred gradients to be compressed and decompressed\nrepeatedly with a worst-case complexity linear in the cluster size, as we discuss below in Sec. 3.2.\n\n3 Decentralized Pipelined Stochastic Gradient Descent\n\nOverview: To address the aforementioned issues (network congestion for a central server, long\nexecution time for synchronous training, and stale gradients in asynchronous training) we propose a\nnew decentralized learning framework, Pipe-SGD, shown in Fig. 1 (c). It balances communication\namong nodes via AllReduce and pipelines the local training iterations to hide communication time.\nWe developed Pipe-SGD by analyzing a timing model for wall-clock train time under different\nresource conditions using various communication approaches. We \ufb01nd that the proposed Pipe-SGD is\noptimal when gradient updates are delayed by only one iteration and the time taken by each iteration\nis dominated by local computation on workers. Moreover, we found lossy compression to further\nreduce communication time without impacting accuracy.\nDue to local pipelined training, balanced communication, and compression, the communication time\nis no longer part of the critical path, i.e., it is completely masked by computation, leading to linear\nspeedup of end-to-end training time as the cluster size scales. Finally, we prove the convergence of\nPipe-SGD for convex and strongly convex objectives by adjusting the proof of [23, 15].\n\n3.1 Timing Models and Decentralized Pipe-SGD\n\nTiming Model: We propose timing models based on decentralized synchronous SGD to analyze the\nwall-clock runtime of training. Each training iteration consists of three major stages: model update,\ngradient computation, and gradient communication. Classical synchronous SGD (Fig. 1 (b)) runs\nlocal iterations on workers sequentially, i.e., each update depends on the gradient from the previous\niteration, i.e., the iteration dependency is 1. Therefore the total runtime of synchronous SGD can be\nformulated easily as:\n\n(2)\nwhere T denotes the total number of training iterations and lup,lcomp,lcomm refer to the time taken by\nupdate, compute, and communication, respectively. It is apparent that synchronous SGD depends on\nthe sum of execution time taken by all stages, which leads to long end-to-end training time.\n\nltotal_sync = T \u00b7 (lup + lcomp + lcomm),\n\n3\n\n(a)Worker 1Iteration Dependence: K=2UpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateWorker 1Iteration Dependence: K=2UpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateTimeWorker 2UpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateWorker 2UpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateWorker 1Iteration Dependence: K=2UpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateTimeWorker 2UpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateWorker 3UpdateComputeCommunicateUpdateComputeCommunicateWorker 3TimeUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateWorker 2UpdateComputeCommunicateUpdateComputeCommunicateWorker 2UpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateWorker 1UpdateComputeCommunicateUpdateComputeCommunicateWorker 1UpdateComputeCommunicateUpdateComputeCommunicateWorker 3TimeUpdateComputeCommunicateUpdateComputeCommunicateWorker 2UpdateComputeCommunicateUpdateComputeCommunicateWorker 1push gradComputeCommunicateCommunicateComputeCommunicateCommunicateComputeCommunicateCommunicateComputeCommunicateCommunicateW2W3Worker 1Worker 2Worker 3Worker 4W4W5TimeParameterServerpull wComputeCommunicateCommunicateComputeCommunicateCommunicateCommunicateComputeCommunicateCommunicateComputeCommunicateW1push gradComputeCommunicateCommunicateComputeCommunicateCommunicateW2W3Worker 1Worker 2Worker 3Worker 4W4W5TimeParameterServerpull wComputeCommunicateCommunicateCommunicateComputeCommunicateW1Worker 1Iteration Dependence: K=2UpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateTimeWorker 2UpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateUpdateComputeCommunicateWorker 3TimeUpdateComputeCommunicateUpdateComputeCommunicateWorker 2UpdateComputeCommunicateUpdateComputeCommunicateWorker 1push gradComputeCommunicateCommunicateComputeCommunicateCommunicateW2W3Worker 1Worker 2Worker 3Worker 4W4W5TimeParameterServerpull wComputeCommunicateCommunicateCommunicateComputeCommunicateW1(b)(c)\fFigure 2: Timing model of Pipe-SGD: (a) each worker with limited resources, (b) sequential v.s.\npipelined gradient communication, and (c) an example of gradient communication: Ring-AllReduce.\n\nOn the contrary, Pipe-SGD relaxes the iteration dependency to K, i.e., each update depends only\non the gradients of the K-th last iteration. This enables interleaving between neighboring iterations\nwhile maintaining globally synchronized communication, as shown in Fig. 1 (c). If we assume ideal\nconditions where both computation resources (CPU, GPU, other accelerators) and communication\nresources (communication links) are unlimited or abundant in counts/bandwidth, then the total\nruntime of Pipe-SGD is:\n\nltotal_pipe = T /K \u00b7 (lup + lcomp + lcomm),\n\n(3)\nwhere K denotes the iteration dependency or the gradient staleness. We observe that the end-to-end\ntraining time in Pipe-SGD can be shortened by a factor of K. However, the ideal resource assumption\ndoesn\u2019t hold in practice, because both computation and communication resources are strictly limited\non each worker node in today\u2019s distributed systems. As a result, the timing model for distributed\nlearning is resource bound, either communication or computation bound, as shown in Fig. 2 (a), i.e.,\nthe total runtime is:\n\nltotal_pipe = T \u00b7 max(lup + lcomp,lcomm),\n\n(4)\nwhere the total runtime is solely determined by either computation or communication resources,\nregardless of K (when K \u2265 2). Also, since gradient updates are always delayed by (K \u2212 1) iterations,\nincreasing K > 2 only harms, i.e., the optimal value of K = 2 for Pipe-SGD with limited resources.\nHence, the staleness of gradients is limited to 1 iteration, i.e., the minimal staleness achievable in\nasynchronous updates. Besides, we generally prefer a computation-bound setting for distributed\ntraining system, i.e., lup + lcomp > lcomm. To achieve this we discuss compression techniques in\nSec. 3.2.\nIn addition to pipelined execution of iterations, we also analyze pipelined gradient communication\nwithin each iteration to reduce train time. Computation of gradients, i.e., the backward-pass, and\ncommunication of gradients are often executed in a strictly sequential manner (see Fig. 2 (b)).\nHowever, pipelined gradient communication, i.e., communicating gradients immediately after they\nare computed, is feasible. Again, we assume limited resources and compare the sequential and\npipelined gradient communication in Fig. 2 (b).\nTo analyze the detailed timing of the those two approaches, we use the timing models for com-\nmunication [47]. Communication of gradients is an AllReduce operation which aggregates the\ngradient vector from all workers, performs the sum reduction element-wise, and then sends the\nresult back to all. In practice, the underlying algorithms are much more involved [47]. For example,\nRing-AllReduce, one of the fastest AllReduce algorithms, performs gradient aggregation collectively\namong workers through balanced communication. As shown in Fig. 2 (c), each worker transmits only\na block of the entire gradient vector to its neighbor and performs the sum reduction on the received\nblock. This \u201ctransmit-and-reduce\u201d runs in parallel on all workers, until the gradient blocks are fully\nreduced on a worker (different for each block). Afterwards those fully reduced blocks are sent back to\nthe remaining workers along the virtual ring. This approach optimally utilizes the network bandwidth\nof all nodes.\nAdopting the Ring-AllReduce model of [47], we obtain the total runtime of Pipe-SGD with sequential\ngradient communication under the limited resource assumption via:\n\n(cid:18)\n\nltotal_pipe_s = T \u00b7 max\n\nlup + lfor + lback, 2(p\u2212 1)\u00b7 \u03b1 + 2(\n\np\u2212 1\np )\u00b7 n\u00b7 \u03b2 + (\n\np\u2212 1\np )\u00b7 n\u00b7 \u03b3 + S\n\n,\n\n(5)\n\n(cid:19)\n\n4\n\nEach WorkerIteration Dependence: K=2UpdateComputeCommunicateIdleUpdateComputeCommunicateIdleUpdateComputeCommunicateIdleUpdateComputeCommunicateIdle(With limited resources)Each WorkerIteration Dependence: K=2UpdateComputeCommunicateIdleUpdateComputeCommunicateIdleUpdateComputeCommunicateIdleUpdateComputeCommunicateIdle(With limited resources)UpdateForwardB1B2...BLAllred1Allred2AllredL...UpdateForwardB1B2...BLAllred1Allred2AllredL...UpdateForwardAllreduceBackwardUpdateForwardAllreduceBackwardSequentialPipelineUpdateForwardB1B2...BLAllred1Allred2AllredL...UpdateForwardAllreduceBackwardSequentialPipelineW[0]W[1]W[2]Step 1 and 2: Transmit and ReduceStep3 and 4: Send back Reduced Resultsblk[0]blk[1]blk[2]blk[0]blk[1]blk[2]blk[0]blk[1]blk[2]blk[0]blk[1]blk[2]blk[0]blk[1]blk[2]blk[0]blk[1]blk[2]Step 1 and 2: Transmit and ReduceStep3 and 4: Send back Reduced Resultsblk[0]blk[1]blk[2]blk[0]blk[1]blk[2]blk[0]blk[1]blk[2]blk[0]blk[1]blk[2]W[0]W[1]W[2]Step 1 and 2: Transmit and ReduceStep3 and 4: Send back Reduced Resultsblk[0]blk[1]blk[2]blk[0]blk[1]blk[2]blk[0]blk[1]blk[2]blk[0]blk[1]blk[2]Each WorkerIteration Dependence: K=2UpdateComputeCommunicateIdleUpdateComputeCommunicateIdleUpdateComputeCommunicateIdleUpdateComputeCommunicateIdle(With limited resources)UpdateForwardB1B2...BLAllred1Allred2AllredL...UpdateForwardAllreduceBackwardSequentialPipelineW[0]W[1]W[2]Step 1 and 2: Transmit and ReduceStep3 and 4: Send back Reduced Resultsblk[0]blk[1]blk[2]blk[0]blk[1]blk[2]blk[0]blk[1]blk[2]blk[0]blk[1]blk[2](a)(b)(c)\fAlgorithm 1: Decentralized Pipe-SGD training algorithm for each worker.\n\nsum[t \u2212 K])\n\nsum in compressed format at iteration [t \u2212 K] is ready\n\nDecompress gradient gsum[t \u2212 K] \u2190 Decompress(gc\nUpdate w[t] \u2190 w[t \u2212 1]\u2212 \u03b3 \u00b7 gsum[t \u2212 K]\nLoad a batch B of training data\nForward pass to compute current loss fB\nBackward pass to compute gradient glocal[t] \u2190 \u2202 fB\n\u2202 w[t]\nlocal[t] \u2190 Compress(glocal[t])\nCompress gradient gc\nDenote local gradient gc\n\nOn the computation thread of each worker:\n1: Initialize by the same model w[0], learning rate \u03b3, iteration dependency K, and number of iterations T .\n2: for t = 1, . . . ,T do\n3: Wait until aggregated gradient gc\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\nOn the communication thread of each worker:\n1: Initialize aggregated gradients gc\n2: for t = 1, . . . ,T do\n3: Wait until local gradient gc\nsum[t] \u2190 \u2211gc\n4:\n5:\n6: end for\n\nlocal[t] is ready\nAllReduce gc\nlocal[t]\nDenote aggregated gradient gc\n\nsum of iteration [1\u2212 K,1\u2212 K + 1, ...,0] as zero and mark them as ready\n\nlocal[t] as ready\n\nsum[t] as ready\n\n,\n\n(6)\n\n(cid:19)\n\n(cid:18)\n\np\u2212 1\np )\u00b7 n\u00b7 \u03b2 + (\n\np\u2212 1\np )\u00b7 n\u00b7 \u03b3 + L\u00b7 S\n\nlup + lfor + lb, 2(p\u2212 1)L\u00b7 \u03b1 + 2(\n\nwhere lfor and lback denote forward-pass and backward-pass time, p denotes the number of workers,\n\u03b1 the network latency, n the model size in bytes, \u03b2 the byte transfer time, \u03b3 the byte sum reduction\ntime, and S the global synchronization time.\nSimilarly, we obtain the total runtime of Pipe-SGD with pipelined gradient communication via:\nltotal_pipe_p = T \u00b7 max\nwhere L denotes the number of gradient segments, and lb denotes the backward-pass time taken by\nthe \ufb01rst segment.\nBased on Eq. (5) and Eq. (6) we note: if a pipelined system remains communication bound, then\nsequential gradient communication is preferred over the pipelined gradient communication (Eq. (5)\nis smaller than Eq. (6) due to positive L). In practice, distributed training of large models is often\ncommunication bound, making sequential exchange the best option.\nTo sum up, based on our timing models, we \ufb01nd: Pipe-SGD is optimal for K = 2, system is compute\nbound (after compression), and sequential gradient communication is used. Note that although\nour model is derived based on the Ring-AllReduce, this conclusion also applies to other AllReduce\nalgorithms, such as recursive doubling, recursive halving and doubling, pairwise exchange, etc. [47].\nDecentralized Pipeline SGD: Guided by the timing models, we develop the decentralized Pipe-SGD\nframework illustrated in Fig. 1 (c) where neighboring training iterations on workers are interleaved\nwith a width of K = 2 while the execution within each iteration remains strictly sequential. Decentral-\nized workers perform pipelined training in parallel with synchronization on gradient communication\nafter every iteration. Due to the synchronous nature of our framework, the gradient update is always\ndelayed by K \u2212 1 iterations, which enforces a deterministic rather than an uncontrolled staleness. In\nour optimal setting, the number of iterations for a delayed update is 1, as compared to O(p) where p\nis the cluster size in the conventional asynchronous parameter server training [15, 31, 1]. Importantly,\nour framework still enjoys the advantage of an asynchronous approach \u2013 interleaving of training\niterations to reduce end-to-end runtime. Also, different from the parameter server architecture, we\ndon\u2019t congest the head node. Instead, in our case, every worker is only responsible for aggregating\npart of the gradients in a balanced manner such that communication and aggregate operation time are\nmuch more scalable.\nMore formally, we outline the algorithmic structure of our implementation for each worker in Alg. 1.\nTo be speci\ufb01c, each worker has two threads: one for computation and one for communication, where\nthe former thread consumes the aggregated gradient of the K-th last iteration and generates the local\ngradient to be communicated, and the latter thread exchanges the local gradient and buffers the\naggregated results to be consumed by the former thread.\n\n5\n\n\fFigure 3: Pipelining within AllReduce: (a) block transfer in native Ring-AllReduce and pipelined\nRing-AllReduce, and (b) block transfer with light-weight compression.\n\n3.2 Compression in Pipe-SGD\n\nTo further reduce the communication time we integrate lossy compression into our decentralized\nPipe-SGD framework. Unlike the conventional parameter server or recent decentralized framework\ntransferring parameters over the network [8, 6, 27, 28, 36, 16, 15, 31, 30], our approach communicates\nonly gradients and we justi\ufb01ed empirically that gradients are much more tolerant to lossy compression\nthan the model parameters. This seems intuitive since reducing the precision of parameters in every\niteration harms the \ufb01nal precision of the trained model directly.\nImportantly, as mentioned in Sec. 3.1, compressing the communication overhead contributes to the\noptimal setting of Pipe-SGD. Once Pipe-SGD is completely computation bound, linear speedups\nof end-to-end training time can be realized as the cluster size increases. Analytically, we show this\nobservation by deriving the scaling ef\ufb01ciency using the timing model given in Eq. (4). Assume that:\n1) the singe-node training takes Tsingle iterations to complete with an execution time of lsingle taken\nby each iteration; 2) given a Pipe-SGD cluster with p workers we use the same batch size on each\nworker as the single-node [12]; 3) the single node and Pipe-SGD train the same epochs on the dataset.\nFrom 2) and 3), we \ufb01nd that the total number of iterations required for Pipe-SGD is Tsingle/p, because\nPipe-SGD has a p times larger batch size while still training the same number of samples. From this\nwe obtain the scaling ef\ufb01ciency SE of Pipe-SGD via\n\nSE =\n\nActual Speedup\nIdeal Speedup\n\n=\n\nlsingle\u00b7Tsingle\nltotal_pipe\n\np\n\n=\n\nlsingle\u00b7Tsingle\n\nmax(lup+lcomp, lcomm)\u00b7 Tsingle\np\n\np\n\n=\n\nlup + lcomp\n\nmax(lup + lcomp, lcomm)\n\n.\n\n(7)\n\nThus, we showed that once our system becomes compute bound with compressed communication,\nPipe-SGD can achieve linear speedup as the cluster scales, i.e., SE = 1.\nTo maintain applicability of Ring-AllReduce, we choose two simple compression approaches: trun-\ncation and scalar quantization. Truncation drops the less signi\ufb01cant mantissa bits of \ufb02oating-point\nvalues for each gradient. The scalar quantization discretizes each gradient value into an integer of\nlimited bits, with a quantization range determined by the maximal element of a gradient vector. Due\nto their simplicity, we easily parallelize those compression approaches to minimize overhead.\nNote that compression itself can be compute-heavy and the introduced computation overhead can\noutweigh the bene\ufb01t of compressed communication. Particularly when considering that AllReduce\nbased communication performs multiple steps to transfer and reduce the data (see Fig. 2 (c)), requiring\nrepeated invocation of compression and decompression, i.e., for each \u201ctransmit-and-reduce\u201d step,\nwith an invocation complexity linear in cluster size. Therefore, many proposed complex compression\ntechniques [43, 46, 11, 4, 50, 33] often fail in the communication-optimal AllReduce setting, resulting\nin longer wallclock time. For these reasons, compression embedded inside AllReduce must be light,\nfast and easy to parallelize, such as a \ufb02oating-point truncation or our element-wise quantization.\nIndeed, pipelining within AllReduce can help alleviate the heavy overhead of complex compression.\nHowever, its bene\ufb01t might still be limited. Instead of pipelining of training iterations as in Pipe-\nSGD, pipelining within AllReduce interleaves the gradient communication and reduction within\neach AllReduce process, as illustrated in Fig. 3 (a). Since the communication time is often larger\nthan the reduction time, the latter can be hidden by the former. Once compression is used (as in\nFig. 3 (b)), the two stage pipeline becomes (decompression, sum, compression) and (compressed\ncommunication) such that light compression overhead can be masked completely. Although complex\ncompression may also bene\ufb01t from the pipelined AllReduce, the improvement is limited because\nthe time spent by complex compression often outweighs the communication time. For example,\n\n6\n\nThe Block Transfer in Ring-AllReduceTime LineThe Block Transfer in Pipelined Ring-AllReduce1. COMM2. REDUCE1. COMM2. REDUCE1. COMM2. REDUCE1. COMM2. REDUCEThe Block Transfer in Ring-AllReduceTime LineThe Block Transfer in Pipelined Ring-AllReduce1. COMM2. REDUCE1. COMM2. REDUCERing-AllReduce + Light Compression1. COMM2. REDUCE1. COMM2. REDUCE1. COMM2. REDUCE1. COMM2. REDUCEDecompPipelined Ring-AllReduce + Light CompressionSumCompTime LineRing-AllReduce + Light Compression1. COMM2. REDUCE1. COMM2. REDUCEDecompPipelined Ring-AllReduce + Light CompressionSumCompTime Line(a) (b)\fwe implemented [50] within the pipelined AllReduce and found that the compression overhead is\n1.6 \u223c 2.3\u00d7 the uncompressed communication time and 25.6 \u223c 36.8\u00d7 the compressed communication\ntime for the benchmarks in Sec. 4, in which case the heavy overhead cannot be masked. Complete\nmasking requires the compression overhead to be smaller than the compressed communication. In\nthe remainder, we only consider light compressions (truncation/quantization) with native AllReduce.\n\n3.3 Convergence\n\n(cid:113) K\n\nTo prove the convergence of Pipe-SGD we adapt the derivation from parameter-server based asyn-\nchronous training [15, 23]. We can show that the convergence rate of Pipe-SGD for convex objectives\nvia SGD is 8FL\nT , where K = 2, F and L are constants for gradient distance and Lipschitz continu-\nity, respectively. We can also show the convergence of Pipe-SGD for strongly convex functions, and\n\ufb01nd a rate of O( logT\nT ) for gradient descent. These rates are consistent with [15, 23]. Due to the page\nlimit we defer details to the supplementary material.\n\n4 Experimental Evaluation\n\nIn this section, we demonstrate the ef\ufb01cacy of our approach on four benchmarks using three datasets:\nMNIST [26], CIFAR100 [21] and ImageNet [42]. We brie\ufb02y review characteristics of those datasets\nbefore discussing metrics and setup, and \ufb01nally presenting experimental results and analysis.\nDatasets and Deep Net Architecture\n\u2022 MNIST: The MNIST dataset consists of 60,000 training and 10,000 test images, each showing one\nof ten possible digits. The images are of size 28\u00d7 28 pixels with digits located at the center of\nthe images. We use a classical 3-layer perceptron, MNIST-MLP, with both hidden layers being\n500-dimensional and with a global batch size of 100.\n\n\u2022 ImageNet: For our experiments we use 1,281,167 training and 50,000 validation examples from\nthe ImageNet challenge. Each example comprises a color image of 256\u00d7256 pixels and belongs to\none of 1000 classes. We use the classical AlexNet [22] and ResNet [14], both with a global batch\nsize of 256.\n\n\u2022 CIFAR100: The CIFAR100 dataset is composed of 50,000 training and 10,000 test examples with\n100 classes. The simple AlexNet-style CIFAR100 architecture in [32] is used for benchmarking\nthis datasets. It consists of 3 convolutional layers and 2 fully connected layers followed by a\nsoftmax layer. The detailed parameters are available in [32]. Importantly, we adapt this 5 layer\nCIFAR100-CNN into a convex optimization benchmark, CIFAR100-Convex, to match our proof\nof convergence. The convexity is achieved by training only the last fully connected layer while\n\ufb01xing the parameters of all previous layers.\n\nMetrics and Setup\nWe measure the wall-clock time of end-to-end training, i.e., the same number of iterations for different\nsettings. For each benchmark, we evaluate the timing model we proposed using end-to-end train time\nand detailed timing breakdowns. We plot the test/validation accuracy over training time to evaluate\nthe actual convergence. Also, \ufb01nal top-1 accuracies on the test/validation set are reported. For the\nsetup, we use a cluster of four nodes, each of which consists of a Titan XP GPU [40] and a Xeon CPU\nE5-2640 [17]. We employ an additional node as the parameter server to support the conventional\ncentralized design. All nodes are connected by 10Gb Ethernet. We implement a distributed training\nframework in C++ using CUDA 8.0 [39], MKL 2018 [18], and OpenMPI 2.0 [41], which supports\nthe parameter-server and Pipe-SGD approach.\nResults and Analysis\nWe evaluate the performance of three different frameworks: parameter server with synchronous SGD\n(PS-Sync), decentralized synchronous SGD (D-Sync), and Pipe-SGD. Our compression schemes,\ni.e., 16-bit truncation (T) and 8-bit quantization (Q), are also applied to AllReduce communication in\nD-Sync and Pipe-SGD. Evaluation results are summarized in Fig. 4 where the \ufb01rst two columns show\nthe convergence performances and the third column shows detailed timing breakdowns with \ufb01nal\naccuracies labeled.\n\n7\n\n\fFigure 4: Experimental results: Each row shows different benchmarks. The left two columns show\nconvergence via test/validation accuracy vs. wallclock training time, where the \ufb01rst column is an\ninset of the second one. The right most column shows the detailed timing breakdown of end-to-end\ntraining. Note that the \ufb01nal top-1 accuracies on test/validation set are labeled on top of the bars.\n\nConvergence: From Fig. 4, we observe: decentralized approaches, i.e., D-Sync and Pipe-SGD,\nconverge much faster than the parameter server even without compression, and Pipe-SGD shows\nthe fastest convergence among these frameworks, especially when compression is applied. For\nexample, the convergence curve of the CIFAR100-Convex shows that D-Sync is around 40% faster\nthan PS-Sync and Pipe-SGD is another 37% faster than D-Sync. The advantage of Pipe-SGD is\nfurther boosted by compression, i.e., truncation in this case, and demonstrates an additional 46%\nfaster convergence than the D-Sync with the same compression scheme. Therefore Pipe-SGD prevails\nwith a great margin.\nTiming Breakdown: From Fig. 4, the comparison between centralized and decentralized designs\nshows 50% reduction in uncompressed communication time, thus justifying the ef\ufb01cacy of balanced\ncommunication. Once compression is applied, further reduction is observed. However, the actual\nimprovement in D-Sync is not ideal considering compression factors of 2\u00d7 for truncation and 4\u00d7 for\nquantization, because the compression overhead is paid at the critical path of D-Sync. In contrast, our\nPipe-SGD can hide this overhead together with computation due to the pipelined nature, as shown in\n\u201cD-Sync+T\u201d vs. \u201cPipeSGD +T\u201d in the MNIST benchmark. As communication is further reduced by\nquantization, the system becomes compute bound and Pipe-SGD switches to hide the communication\ninstead, thus reaching the optimal setting of Pipe-SGD. This optimum can also be achieved via\nthe simplest truncation for models with less dominant communication time, e.g., ResNet18 and\nCIFAR100-Convex. As a result, our approach achieves a speedup of 2.0 \u223c 3.2\u00d7 compared to D-Sync\nand 4.0 \u223c 5.4\u00d7 compared to PS-Sync for these benchmarks. Note that these speedups are based on\nthe comparison between different approaches in the same cluster without scaling the cluster size.\n\n8\n\nMNIST-MLP102030405060Train Time (sec)0.930.940.950.960.970.98Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctDecentralized Sync + 8-bit QuantPipe-SGDPipe-SGD + 16-bit TrunctPipe-SGD + 8-bit Quant102030405060Train Time (sec)0.930.940.950.960.970.98Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctDecentralized Sync + 8-bit QuantPipe-SGDPipe-SGD + 16-bit TrunctPipe-SGD + 8-bit Quant102030405060Train Time (sec)0.930.940.950.960.970.98Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctDecentralized Sync + 8-bit QuantPipe-SGDPipe-SGD + 16-bit TrunctPipe-SGD + 8-bit Quant02004006008001000Train Time (sec)00.20.40.60.81Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctDecentralized Sync + 8-bit QuantPipe-SGDPipe-SGD + 16-bit TrunctPipe-SGD + 8-bit Quant02004006008001000Train Time (sec)00.20.40.60.81Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctDecentralized Sync + 8-bit QuantPipe-SGDPipe-SGD + 16-bit TrunctPipe-SGD + 8-bit Quant02004006008001000Train Time (sec)00.20.40.60.81Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctDecentralized Sync + 8-bit QuantPipe-SGDPipe-SGD + 16-bit TrunctPipe-SGD + 8-bit Quant00.511.522.533.5Train Time (sec)x1050.30.350.40.450.50.550.6Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctDecentralized Sync + 8-bit QuantPipe-SGDPipe-SGD + 16-bit TrunctPipe-SGD + 8-bit Quant00.511.522.533.5Train Time (sec)x1050.30.350.40.450.50.550.6Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctDecentralized Sync + 8-bit QuantPipe-SGDPipe-SGD + 16-bit TrunctPipe-SGD + 8-bit Quant00.511.522.533.5Train Time (sec)x1050.30.350.40.450.50.550.6Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctDecentralized Sync + 8-bit QuantPipe-SGDPipe-SGD + 16-bit TrunctPipe-SGD + 8-bit QuantAlexNet01234567Train Time (sec)x10500.10.20.30.40.50.6Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctDecentralized Sync + 8-bit QuantPipe-SGDPipe-SGD + 16-bit TrunctPipe-SGD + 8-bit Quant01234567Train Time (sec)x10500.10.20.30.40.50.6Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctDecentralized Sync + 8-bit QuantPipe-SGDPipe-SGD + 16-bit TrunctPipe-SGD + 8-bit Quant01234567Train Time (sec)x10500.10.20.30.40.50.6Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctDecentralized Sync + 8-bit QuantPipe-SGDPipe-SGD + 16-bit TrunctPipe-SGD + 8-bit Quant051015Train Time (sec)x1040.350.40.450.50.550.60.65Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctPipe-SGDPipe-SGD + 16-bit Trunct051015Train Time (sec)x1040.350.40.450.50.550.60.65Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctPipe-SGDPipe-SGD + 16-bit Trunct051015Train Time (sec)x1040.350.40.450.50.550.60.65Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctPipe-SGDPipe-SGD + 16-bit Trunct00.511.522.53Train Time (sec)x10500.10.20.30.40.50.6Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctPipe-SGDPipe-SGD + 16-bit Trunct00.511.522.53Train Time (sec)x10500.10.20.30.40.50.6Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctPipe-SGDPipe-SGD + 16-bit Trunct00.511.522.53Train Time (sec)x10500.10.20.30.40.50.6Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctPipe-SGDPipe-SGD + 16-bit TrunctResNet18246810121416Train Time (sec)0.360.380.40.420.440.460.48Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctPipe-SGDPipe-SGD + 16-bit Trunct246810121416Train Time (sec)0.360.380.40.420.440.460.48Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctPipe-SGDPipe-SGD + 16-bit Trunct246810121416Train Time (sec)0.360.380.40.420.440.460.48Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctPipe-SGDPipe-SGD + 16-bit TrunctCIFAR100-Convex0102030405060Train Time (sec)00.10.20.30.40.5Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctPipe-SGDPipe-SGD + 16-bit Trunct0102030405060Train Time (sec)00.10.20.30.40.5Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctPipe-SGDPipe-SGD + 16-bit Trunct0102030405060Train Time (sec)00.10.20.30.40.5Top-1 AccuracyParameter Server SyncDecentralized SyncDecentralized Sync + 16-bit TrunctPipe-SGDPipe-SGD + 16-bit TrunctTrain Time (sec)020040060080010000.98330.98330.98340.98300.98340.98360.9833ComputationTruncation/QuantizationAllreduce/CommunicationTrain Time (sec)020040060080010000.98330.98330.98340.98300.98340.98360.9833ComputationTruncation/QuantizationAllreduce/CommunicationTrain Time (sec)02468x1050.56970.56970.56550.57100.56830.56980.5646ComputationTruncation/QuantizationAllreduce/CommunicationTrain Time (sec)02468x1050.56970.56970.56550.57100.56830.56980.5646ComputationTruncation/QuantizationAllreduce/CommunicationTrain Time (sec)0123x1050.66100.66110.66020.66350.6630ComputationTruncationAllreduce/CommunicationTrain Time (sec)0123x1050.66100.66110.66020.66350.6630ComputationTruncationAllreduce/CommunicationTrain Time (sec)02040600.47930.47930.47930.47930.4797ComputationTruncationAllreduce/CommunicationTrain Time (sec)02040600.47930.47930.47930.47930.4797ComputationTruncationAllreduce/Communication\fAccuracy: Considering the potential drawback of the 1-iteration staled update and lossy compression\nin Pipe-SGD, we also evaluate the \ufb01nal test/validation accuracies after end-to-end training, as shown\nin Fig. 4. Interestingly, in our optimal settings \u201cPipeSGD +T/Q,\u201d we \ufb01nd that only AlexNet drops\ntop-1 accuracy by 0.005 compared to baseline D-Sync while all other benchmarks show slightly\nimproved accuracies. To obtain the best accuracies for the two large non-convex models such as\nAlexNet and ResNet, we employ a similiar warm-up scheme as in [33], i.e., we don\u2019t turn on the\npipelined training until the 5-th epoch, before which we still stick to D-Sync training to avoid the\nundesirable gradient change in the initial stage. Since the warm-up period is marginal compared to\ntotal number of epochs, the system performance bene\ufb01ts from Pipe-SGD most of the time. Note that\nfor smaller models, especially convex ones (e.g., CIFAR100-Convex), no warm-up is required.\n\n5 Related Work\n\nLi et al. [27, 28] proposed a parameter server framework for distributed learning and a few approaches\nto reduce the cost of communication among compute nodes, such as exchanging only nonzero\nparameter values, local caching of index list, and random skip of messages to be transmitted.\nAbadi et al. [1] also proposed a centralized framework, TensorFlow, which incorporates model and\ndata parallelism for training deep nets. Both works support the asynchronous setting to improve\ncommunication ef\ufb01ciency but without controlling the staleness of the gradient update. Ho et al. [15]\nproposed SSP, another centralized asynchronous framework but with bounded staleness for gradients.\nThe key idea of SSP: 1) each worker has its own iteration index, 2) the slowest and fastest worker\nmust be within S iterations, otherwise, the fastest worker is forced to wait until the slowest worker\ncatches up. However, this bound S applies to the iteration drift among workers instead of directly on\nthe stale updates of the parameter server. As a result, each worker within the bound can still commit\ntheir updates to the server asynchronously, making the last gradient update staled heavily. In the\nworst case, the staleness is linear in the cluster size.\nLin et al. [33] employed AllReduce as the gradient aggregation method in their synchronous frame-\nwork, but little is reported regarding wallclock time bene\ufb01ts, especially considering that the full\nsynchronous design suffers from the longest execution time among all workers. Besides, Lian et\nal. proposed AD-PSGD [31] which parallelizes the SGD process over decentralized workers in a\ncompletely asynchronous fashion. Workers run completely independently, and only communicate\nwith a set of neighboring nodes to exchange trained weights, i.e., neighboring models are averaged to\nreplace each worker\u2019s local model in each iteration. However, this approach suffers from uncontrolled\nstaleness, which in practice increases with cluster size and the time taken by each iteration. In\naddition, such a communication method requires each worker to act as the center node of a local\ngraph, which results in a local communication bottleneck. As a result, each worker suffers from\nlong iteration time which further increases the staleness of weight updates. Although Lian et al. [31]\ncompared their framework with the full synchronous design in wall-clock time, the performance turns\nout to be similar when network speeds are roughly equal.\nRecently, independent work [13] also proposed a distributed pipelined system for DNN training.\nDifferent from Pipe-SGD, [13] focuses on pipelining with model parallelism, partitioning the DNN\nlayers onto different machines and pipelining the execution of the machines by injecting consecutive\nmini-batches into the \ufb01rst one. This approach reduces communication load since only activations and\ngradients of a subset of layers are communicated between machines. However, complex mechanisms\n(such as pro\ufb01ling, partitioning algorithm, and replicated stages) are necessary to balance the workload\namong different machines, otherwise compute resources turn idle. Furthermore, [13] may suffer from\nstaleness of the weight update, which is linear in the number of stages. This limits the effectiveness\nof model pipelining and throttles speedups.\n6 Conclusion\nWe developed a rigorous timing model for distributed deep net training which takes into account\nnetwork latency, model size, byte transfer time, etc. Based on our timing model and realistic resource\nassumptions, e.g., limited network bandwidth, we assessed scalability and developed Pipe-SGD, a\npipelined training framework which is able to mask the faster of computation or communication time.\nWe showed ef\ufb01cacy of the proposed method on a four-node GPU cluster connected with 10Gb links.\nRigorously assessing wall-clock time for Pipe-SGD, we are able to achieve improvements of up to\n5.4\u00d7 compared to conventional approaches.\n\n9\n\n\fAcknowledgement\n\nThis work is supported in part by grants from NSF (IIS 17-18221, CNS 17-05047, CNS 15-57244,\nCCF-1763673 and CCF-1703575). This work is also supported by 3M and the IBM-ILLINOIS\nCenter for Cognitive Computing Systems Research (C3SR). Besides, this material is based in part\nupon work supported by Defense Advanced Research Projects Agency (DARPA) under Contract\nNo. HR001117C0053. The views, opinions, and/or \ufb01ndings expressed are those of the author(s) and\nshould not be interpreted as representing the of\ufb01cial views or policies of the Department of Defense\nor the U.S. Government.\n\nReferences\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard,\nM. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan,\nP. Warden, M. Wicke, Y. Yu, and X. Zhang. TensorFlow: A System for Large-Scale Machine Learning. In\nOSDI, 2016.\n\n[2] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: Communication-Ef\ufb01cient SGD via\n\nGradient Quantization and Encoding. In NIPS, 2017.\n\n[3] Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives. PAMI,\n\n2013.\n\n[4] C.-Y. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan. AdaComp : Adaptive\n\nResidual Gradient Compression for Data-Parallel Distributed Training. In AAAI, 2018.\n\n[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic Image Segmentation with\n\nDeep Convolutional Nets and Fully Connected CRFs. In ICLR, 2015.\n\n[6] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project Adam: Building an ef\ufb01cient and scalable\n\ndeep learning training system. In OSDI, 2014.\n\n[7] H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. GeePS: Scalable Deep Learning on\n\nDistributed GPUs with a GPU-Specialized Parameter Server. In EuroSys, 2016.\n\n[8] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang,\n\nQ. V. Le, and A. Y. Ng. Large Scale Distributed Deep Networks. In NIPS, 2012.\n\n[9] J. Dean and S. Ghemawat. MapReduce: Simpli\ufb01ed Data Processing on Large Clusters. Communications\n\nof the ACM, 2008.\n\n[10] N. Dryden, N. Maruyama, T. Moon, T. Benson, A. Yoo, M. Snir, and B. V. Essen. Aluminum: An\nAsynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural\nNetworks on HPC Systems. In MLHPC, 2018.\n\n[11] N. Dryden, T. Moon, S. A. Jacobs, and B. V. Essen. Communication Quantization for Data-Parallel\n\nTraining of Deep Neural Networks. In MLHPC, 2016.\n\n[12] P. Goyal, P. Doll\u00e1r, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He.\n\nAccurate, Large Minibatch SGD: Training ImageNet in 1 Hour. In CVPR, 2017.\n\n[13] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, and P. B. Gibbons.\n\nPipeDream: Fast and Ef\ufb01cient Pipeline Parallel DNN Training. In arXiv:1806.03377v1, 2018.\n\n[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.\n[15] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing.\n\nMore Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In NIPS, 2013.\n\n[16] F. N. Iandola, K. Ashraf, M. W. Moskewicz, and K. Keutzer. FireCaffe: Near-linear Acceleration of Deep\n\nNeural Network Training on Compute Clusters. In CVPR, 2016.\n\n[17] Intel Corporation. Xeon CPU E5, https://www.intel.com/content/www/us/en/products/processors/xeon/e5-\n\nprocessors.html, 2017.\n\n[18] Intel Corporation. Intel Math Kernel Library, https://software.intel.com/en-us/mkl, 2018.\n[19] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from\n\nsequential building blocks. In ACM SIGOP, 2007.\n\n[20] H. Kim, J. Park, J. Jang, and S. Yoon. Deepspark: A spark-based distributed deep learning framework for\n\ncommodity clusters. arXiv:1602.08191 [cs], 2016.\n\n[21] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images, 2009.\n[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet Classi\ufb01cation with Deep Convolutional Neural\n\nNetworks. In NIPS, 2012.\n\n10\n\n\f[23] J. Langford, A. J. Smola, and M. Zinkevich. Slow Learners are Fast. In NIPS, 2009.\n[24] G. Larsson, M. Maire, and G. Shakhnarovich. FractalNet: Ultra-Deep Neural Networks without Residuals.\n\nIn https://arxiv.org/abs/1605.07648, 2016.\n\n[25] Y. LeCun, Y. Bengio, and G. E. Hinton. Deep learning. Nature, 2015.\n[26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nIEEE, 1998.\n\n[27] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y.\n\nSu. Scaling Distributed Machine Learning with the Parameter Server. In OSDI, 2014.\n\n[28] M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication Ef\ufb01cient Distributed Machine Learning\n\nwith the Parameter Server. In NIPS, 2014.\n\n[29] Y. Li, J. Park, M. Alian, Y. Yuan, Z. Qu, P. Pan, R. Wang, A.G. Schwing, H. Esmaeilzadeh, and N.S. Kim.\nA Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural\nNetworks. In MICRO, 2018.\n\n[30] X. Lian, C. Zhang, H. Zhang, C. Hsieh, W. Zhang, and J. Liu. Can Decentralized Algorithms Outperform\nCentralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. In NIPS,\n2017.\n\n[31] X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous Decentralized Parallel Stochastic Gradient\n\nDescent. In arXiv:1710.06952v3, 2018.\n\n[32] R. Liao, A. Schwing, R. Zemel, and R. Urtasun. Learning Deep Parsimonious Representations. In NIPS,\n\n2016.\n\n[33] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep Gradient Compression: Reducing the Communi-\n\ncation Bandwidth for Distributed Training. In ICLR, 2018.\n\n[34] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing\n\nAtari with Deep Reinforcement Learning. In NIPS Deep Learning Workshop, 2013.\n\n[35] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,\nD. Wierstra, S. Legg, and D. Hassabis. Human-level Control through Deep Reinforcement Learning.\nNature, 2015.\n\n[36] P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan. SparkNet: Training Deep Networks in Spark. In ICLR,\n\n2016.\n\n[37] D. G. Murray, R. Isaacs F. McSherry, M. Isard, P. Barham, and M. Abadi. Naiad: A Timely Data\ufb02ow\n\nSystem. In SOSP, 2013.\n\n[38] Nvidia. GPU-Based Deep Learning Inference: A Performance and Power Analysis. In Whitepaper, 2015.\n[39] NVIDIA Corporation. NVIDIA CUDA C programming guide, 2010.\n[40] NVIDIA Corporation. TITAN Xp, https://www.nvidia.com/en-us/design-visualization/products/titan-xp/,\n\n2017.\n\n[41] OpenMPI Community. OpenMPI: A High Performance Message Passing Library, https://www.open-\n\nmpi.org/, 2017.\n\n[42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV,\n2015.\n\n[43] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-Bit Stochastic Gradient Descent and Its Application to\n\nData-Parallel Distributed Training of Speech DNNs. In INTERSPEECH, 2014.\n\n[44] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Misra, and H. Esmaeilzadeh. From\n\nHigh-Level Deep Neural Models to FPGAs. In MICRO, 2016.\n\n[45] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,\nI. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the Game of\nGo with Deep Neural Networks and Tree Search. Nature, 2016.\n\n[46] N. Strom. Scalable Distributed DNN Training using Commodity GPU Cloud Computing. In INTER-\n\nSPEECH, 2015.\n\n[47] R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of Collective Communication Operations in\n\nMPICH. IJHPCA, 2005.\n\n[48] Q. Wang, Y. Li, and P. Li. Liquid State Machine based Pattern Recognition on FPGA with Firing-Activity\n\nDependent Power Gating and Approximate Computing. In ISCAS, 2016.\n\n11\n\n\f[49] Q. Wang, Y. Li, B. Shao, S. Dey, and Peng Li. Energy Ef\ufb01cient Parallel Neuromorphic Architectures with\n\nApproximate Arithmetic on FPGA. Neurocomputing, 2017.\n\n[50] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. TernGrad: Ternary Gradients to Reduce\n\nCommunication in Distributed Deep Learning. In NIPS, 2017.\n\n[51] M. Zaharia, M. Chowdhury, Michael J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with\n\nWorking Sets. In HotCloud, 2010.\n\n[52] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-based Accelerator Design for\n\nDeep Convolutional Neural Networks. In FPGA, 2015.\n\n[53] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning Deep Features for Scene Recognition\n\nusing Places Database. In NIPS, 2014.\n\n12\n\n\f", "award": [], "sourceid": 4965, "authors": [{"given_name": "Youjie", "family_name": "Li", "institution": "UIUC"}, {"given_name": "Mingchao", "family_name": "Yu", "institution": "University of Southern California"}, {"given_name": "Songze", "family_name": "Li", "institution": "University of Southern California"}, {"given_name": "Salman", "family_name": "Avestimehr", "institution": "University of Southern California"}, {"given_name": "Nam Sung", "family_name": "Kim", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Alexander", "family_name": "Schwing", "institution": "University of Illinois at Urbana-Champaign"}]}