{"title": "Bayesian Distributed Stochastic Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 6378, "page_last": 6388, "abstract": "We introduce Bayesian distributed stochastic gradient descent (BDSGD), a high-throughput algorithm for training deep neural networks on parallel clusters. This algorithm uses amortized inference in a deep generative model to perform joint posterior predictive inference of mini-batch gradient computation times in a compute cluster specific manner. Specifically, our algorithm mitigates the straggler effect in synchronous, gradient-based optimization by choosing an optimal cutoff beyond which mini-batch gradient messages from slow workers are ignored. In our experiments, we show that eagerly discarding the mini-batch gradient computations of stragglers not only increases throughput but actually increases the overall rate of convergence as a function of wall-clock time by virtue of eliminating idleness. The principal novel contribution and finding of this work goes beyond this by demonstrating that using the predicted run-times from a generative model of cluster worker performance improves substantially over the static-cutoff prior art, leading to reduced deep neural net training times on large computer clusters.", "full_text": "Bayesian Distributed Stochastic Gradient Descent\n\nMichael Teng\n\nDepartment of Engineering Sciences\n\nUniversity of Oxford\n\nmteng@robots.ox.ac.uk\n\nFrank Wood\n\nDepartment of Computer Science\nUniversity of British Columbia\n\nfwood@cs.ubc.ca\n\nAbstract\n\nWe introduce Bayesian distributed stochastic gradient descent (BDSGD), a high-\nthroughput algorithm for training deep neural networks on parallel computing\nclusters. This algorithm uses amortized inference in a deep generative model to\nperform joint posterior predictive inference of mini-batch gradient computation\ntimes in a compute cluster speci\ufb01c manner. Speci\ufb01cally, our algorithm mitigates\nthe straggler effect in synchronous, gradient-based optimization by choosing an\noptimal cutoff beyond which mini-batch gradient messages from slow workers are\nignored. The principle novel contribution and \ufb01nding of this work goes beyond\nthis by demonstrating that using the predicted run-times from a generative model\nof cluster worker performance improves over the static-cutoff prior art, leading\nto higher gradient computation throughput on large compute clusters. In our ex-\nperiments we show that eagerly discarding the mini-batch gradient computations\nof stragglers not only increases throughput but sometimes also increases the over-\nall rate of convergence as a function of wall-clock time by virtue of eliminating\nidleness.\n\n1\n\nIntroduction\n\nDeep learning success stories are predicated on large neural network models being trained using\never larger amounts of data. While the computational speed and memory available on individual\ncomputers and GPUs continually grows, there will always be some problems and settings in which\nthe amount of training data available will not \ufb01t entirely into the memory of one computer. What is\nmore, and even for a \ufb01xed amount of data, as the number of parameters in a neural network or the\ncomplexity of the computation it performs increases, so too do the incurred economic and time costs\nto train. Both large training datasets and complex networks inspire parallel training algorithms.\nIn this work we focus on parallel stochastic gradient descent (SGD). Like the substantial and growing\nbody of work on this topic (Recht et al. (2011); Dean et al. (2012); McMahan and Streeter (2014);\nZhang et al. (2015)) we too focus on gradient computations computed in parallel on \u201cmini-batches\u201d\ndrawn from the training data. However, unlike most of these methods which are asynchronous in\nnature, we focus instead on improving the performance of synchronous distributed SGD, very much\nlike Chen et al. (2016), upon whose work we directly build.\nA problem in fully synchronous distributed SGD is the straggler effect. This real-world effect is\ncaused by the small and constantly varying subset of worker nodes that, for factors outside our con-\ntrol, perform their mini-batch gradient computation slower than the rest of the concurrent workers,\ncausing long idle times in workers which already have \ufb01nished. Chen et al. (2016) introduce a\nmethod of mitigating the straggler effect on wall-clock convergence rate by picking a \ufb01xed cut-off\nfor the number of workers on which to wait before synchronously updating the parameter on a cen-\ntralized parameter server. They found, as we demonstrate in this work as well, that the increased\ngradient computation throughput that comes from reducing idle time more than offsets the loss of a\nsmall fraction of mini-batch gradient contributions per gradient descent step.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fOur work exploits this same key idea but improves the way the likely number of stragglers is identi-\n\ufb01ed. In particular we instrument and generate training data once for a particular compute cluster and\nneural network architecture, and then use this data to train a lagged generative latent-variable time-\nseries model that is used to predict the joint run-time behavior of all the workers in the cluster. For\nhighly contentious clusters with poor job schedulers, such a model might reasonably be expected to\nlearn to model latent states that produce correlated, grouped increases in observed run-times due to\nresource contention. For well-engineered clusters, such a model might learn that worker run-times\nare nearly perfectly independently and identically distributed.\nSpecifying such a \ufb02exible model by hand would be dif\ufb01cult. Also, we will need to perform real-\ntime posterior predictive inference in said model at runtime to dynamically predict straggler cut-\noff. For both these reasons we use the variational autoencoder loss (Kingma and Welling, 2013)\nto simultaneously learn not only the model parameters but also the parameters of an amortized\ninference neural network (Ritchie et al., 2016; Le et al., 2017) that allows for real-time approximate\npredictive inference of worker run-times.\nThe main contributions of this paper are:\n\n\u2022 The idea of using amortized inference in a deep state space model to predict compute\ncluster worker run-times, in particular for use in a distributed synchronous gradient descent\nalgorithm.\n\n\u2022 The BDSGD algorithm itself, including the approximations made to enable real-time pos-\n\nterior predictive inference.\n\n\u2022 The empirical veri\ufb01cation at scale of the increased gradient computation throughput that\n\nour algorithm yields when training deep neural networks in parallel on large clusters.\n\nThe rest of the paper is organized as follows. In section 2, we give necessary background on why\nand how synchronous distributed SGD can be improved. In section 3, we explain our choice of\ngenerative model for cutoff determination. In section 4, we present our experimental results.\n\n2 Background and Motivation\n\nIn stochastic gradient descent, we use unbiased estimates of gradients to update parameter settings.\nSynchronous distributed SGD differs from single-threaded mini-batch SGD in that the mini-batch\nof size m is distributed to N total workers that locally compute sub-mini-batch gradients before\ncommunicating the result back to a centralized parameter server that updates the parameter vector\nusing an update rule:\n\n)\n\n(1)\n\nwith\n\n(cid:18)(t+1) = (cid:18)(t) (cid:0) (cid:11)\n\n1\nN\n\nf ((cid:18); a; b) =\n\n1\nb (cid:0) a\n\nf\n\n(cid:18)(t); (i (cid:0) 1)\n\nm\nN\n\n; i\n\nm\nN\n\n\u2207\n\n(cid:18)(t) F ((cid:18); z(k); y(k))\n\n(\n\nN\u2211\nb(cid:0)a\u2211\n\ni=1\n\nk=0\n\nwhere (cid:18) are the network parameters, F is the loss function, and (cid:11) is the learning rate. Although\nnot shown, asynchronous SGD is lock-free, and parameter updates are made whenever any worker\nreports a sub-mini-batch gradient to the parameter server which results in out-of-date or stale gradi-\nent information (Recht et al., 2011). Unlike asynchronous distributed SGD, synchronous distributed\nSGD is equivalent to single-threaded SGD with batchsize, m. This allows hyperparameters, (cid:11) and\nm, to be tuned in the distributed setting without having to consider the possibility of stale gradients\n(Hoffer et al., 2017).\n\n2.1 Effect of Stragglers\n\nIn synchronous SGD, we can attribute low throughput, in the sense of central parameter updates per\nunit time, to the straggler effect that arises in real-world cluster computing scenarios with multiple\nworkers computing in parallel. Consider Equation 1, in which f ((cid:18);(cid:1);(cid:1)) is computed independently\non an memory-isolated logical processor. Let xj be the time it takes for f to be computed on the\n\n2\n\n\fFigure 1: Oracle throughput curves (best achievable in hindsight) for synchronous SGD runs for\nthree different neural networks on the same 2175-worker cluster. From left: low variance (1-2%\nstragglers), medium variance (2-4% stragglers), and high variance (8-12% stragglers) throughput\ncurves with mean(cid:6)std worker gradient computation times being 5:34 (cid:6) 0:13 seconds, 2:83 (cid:6) 0:077\nseconds, and 0:24 (cid:6) 0:018 seconds, respectively. The x-axes are the number of workers. The y-axes\nshow throughput achieved if all workers beyond the x-axis value are ignored. When runtimes for\ngradient computations have low variance relative to total runtime, Chen et al. underestimates the\noptimal cutoff point whereas when runtimes have proportionally higher variance, Chen et al. over-\nestimates the optimal cutoff point. Our approach achieves more accurate estimates of the optimal\ncutoff in both scenarios.\nworker indexed by j for j 2 1:::N. Distributed compute clusters are not perfect, otherwise xj\nwould be a constant, independent of j, and all workers would \ufb01nish at the same time and only\nbe idle while the parameter server aggregates the gradients and sends back the new parameters.\nInstead xj is actually random. Moreover, the joint distribution of all the xj\u2019s is likely, again in real-\nworld settings, to be non-trivially correlated owing to cluster architecture and resource contention.\nFor instance, most modern clusters consist of computers or graphics processing units each in turn\nhaving a small number of independent processors, so slow-downs in one logical processing unit are\nlikely to be exhibited by others sharing the same bus or network address. What is more, in modern\noperating systems, time-correlated contention is quite common, particularly in clusters under queue\nmanagement systems, when, for instance, other processes are concurrently executed. All this yields\nworker compute times that may be non-trivially correlated in both time and in \u201cspace\u201d.\nOur aim is to signi\ufb01cantly reduce the effect of stragglers on throughput and to do so by modeling\ncluster worker compute times in a way that intelligently and adaptively responds to the kinds of\ncorrelated run-time variations actually observed in the real world. What we \ufb01nd is that doing so\nimproves overall performance of distributed mini-batch SGD.\n\n3 Methodology\n\nOur approach works by maximising the total throughput of parameter updates during a distributed\nmini-batch SGD run. The basic idea, shared with Chen et al. (2016), is to predict a cutoff, ct < N,\nfor each iteration of SGD which dictates the total number of workers on which to wait before taking a\ngradient step in the parameter space. While Chen et al. (2016) use a \ufb01xed cutoff, ct = 0:943(cid:1) N ; 8t,\nwe would like for ct to be evolving dynamically with each iteration and in a manner speci\ufb01c to\neach compute cluster, neural network architecture pair. We note that for overall rate of convergence,\nthroughput is not the exact quantity we wish to maximize; that being some quantity related to the\nrate of expected gain in objective function value instead, but it is the proxy we use in this work.\nAlso, paradoxically, lower throughput, by virtue of smaller mini-batch sizes, may in some instances\nincrease the rate of convergence, an effect previously documented in the literature (Masters and\nLuschi, 2018).\nThe central considerations are: what is the notion of throughput we should optimize? And how\ndo we predict the cutoff that achieves it? Simply optimizing overall run-time admits a trivial and\nunhelpful solution of setting ct = 0. Each iteration and the overall algorithm would then take no\ntime but achieve nothing. Instead we seek to maximize the number of workers to \ufb01nish in a given\namount of time, i.e. throughput \u2126(c), which we de\ufb01ne to be:\n\n\u2126(c) =\n\nc\n~x(c)\n\nwhere c indexes the ordered worker run-times ~x(c). Note that, for now, we avoid indexing run-times\nby SGD loop iteration. Soon, we will address temporal correlation between worker runtimes.\n\n3\n\n05001000150020000100200300400gradients / secondWideResNet-16:10Oracle: 2172Chen et. al: 205105001000150020000100200300400500600700ResNet-64Oracle: 2095Chen et. al: 2051050010001500200002000400060008000ResNet-16Oracle: 2004Chen et. al: 2051\fWith this de\ufb01nition, we can plot throughput curves that show how throughput drops off as the strag-\ngler effect increases (Figure 1). On the well-con\ufb01gured cluster used to produce Figure 1, a high\npercentage of workers (between 80-95%) \ufb01nish at roughly the same time, so in this regime, through-\nput of the system increases linearly for each additional worker. However, continuing to wait for\nmore workers includes some stragglers which eventually decreases the overall throughput.\nWe de\ufb01ne our objective to be to maximize the throughput of the system as de\ufb01ned above,\ni.e. arg maxc \u2126(c) at all times. This also implicitly handles the tradeoff between iteration speedup\nand the learning signal reduction that comes from using a higher variance gradient estimate given\nby discarding gradient information.\nSetting the cutoff optimally and dynamically requires a model which is able to learn and predict\nthe joint ordered run-times of all cluster workers. With such a model, we can make informed and\naccurate predictions about the next set of run-times per worker and consequently make a real-time,\nnear-optimal choice of c for the subsequent loop of mini-batch gradient calculations. How we model\ncompute cluster worker performance follows.\n\n3.1 Modeling Compute Cluster Worker Performance\nAs before, let xj 2 R+ be the time it takes for f to be computed on the worker indexed by j.\nAssume that these are distributed according to some distribution p. Given a set of n, p((cid:1))-distributed\nrandom variables x1; x2; : : : ; xn we wish to know the joint distribution of the n sorted random\nvariables ~x(1); ~x(2); : : : ; ~x(n). Such quantities are known as \u201corder statistics.\u201d Each p(~x(j)) describes\nthe distribution of the jth largest sorted run-time under independent draws from this underlying\ndistribution. Taking the mean of each order statistic allows us to derive a cutoff using our notion of\nthroughput, given as:\n\narg max\n\n\u2126(c) = arg max\n\nc\n\nc\n\nE[\n\nc\n~x(c)\n\n]\n\n(2)\n\n3.1.1 Elfving Cutoff\n\nThe \ufb01rst model of runtimes we consider assumes that they are are independent and identically dis-\nx) the distribution of each order\ntributed (iid) Gaussian. Under the assumption that xj = N((cid:22)x; (cid:27)2\nstatistic p(~x(1)); p(~x(2)); :::; p(~x(n)) is independent and E[~x(1)] (cid:20) E[~x(2)]; :::;(cid:20) E[~x(n)].\nUnder the given iid normality assumption the distribution of the each order statistic has closed form:\n\n\u222b 1\n\np(~x(j)) = Z(n; j)\n\n(cid:0)1\n\nx[(cid:8)(x)]j(cid:0)1[1 (cid:0) (cid:8)(x)]n(cid:0)jp(x)dx\n\nwhere (cid:8)(x) is the cumulative distribution function (CDF) of N((cid:22)t; (cid:27)2\n(j(cid:0)1)!(n(cid:0)j)!\nNote that each order statistic\u2019s distribution, including the maximum, increases as the variance of the\nrun-time distribution increases, while the average run-time does not.\nAs a baseline in subsequent sections we will use an approximation of the expected order statistics\nunder this iid normality assumption. This is known as the Elfving (1947) formula (Royston, 1982):\n\nt ) and Z(n; j) =\n\nn!\n\nE[~x(j)] (cid:25) (cid:22)t + (cid:8)\n\n(cid:0)1\n\n; 0; 1\n\n(cid:27)t\n\n(3)\n\n)\n\n(\n\nn (cid:0) (cid:25)\nj (cid:0) (cid:25)\n4 + 1\n\n8\n\nHere, we note that the Elfving model requires full observability of runtimes to predict subsequent\nruntimes in a production setting. In practice, the parameters (cid:22)t; (cid:27)t in Eqn. 3 are \ufb01t using maximum\nlikelihood on the \ufb01rst \ufb01xed lagged window and remain static during the remainder of the run.\nWhile some clusters may approximate the strong assumptions required to use the Elfving formula for\ncutoff prediction, most compute clusters will emit joint order statistics of non-Gaussian distributed\ncorrelated random variables, for which no analytic expression exists. However, if we have a predic-\ntive model of the joint distribution of the xj\u2019s (or ~x(j)\u2019s), we can use sorted samples from such a joint\ndistribution to obtain a Monte Carlo approximation of the order-statistics. In the next section, we\nwill detail how to construct the predictive model in order to learn correlations of worker runtimes.\n\n4\n\n\fFigure 2: Predicted throughputs. Each runtime plot (5 surrounding the top \ufb01gure) shows the in-\ndividual runtimes of all workers (x-axis index) during an iteration of SGD on a 158 node cluster.\nWe highlight SGD iterations 1, 50, 100, 150, and 200 which highlight two signi\ufb01cantly different\nregimes of persistent time-and-machine-identity correlated worker runtimes. The bottom-right \ufb01g-\nure displays a comparison of throughputs achieved at each of the 200 SGD iterations by waiting\nfor all workers to \ufb01nish (green) and using our approach, BDSGD (red), relative to the ground truth\nmaximum achievable (blue). BDSGD predicts cutoffs that achieve near optimal throughput, in a\nsetting where \ufb01xed-cutoffs are insuf\ufb01cient and Elfving assumptions do not hold.\n\n3.1.2 Bayesian Cutoff\n\nIn this section, we formally introduce our proposed training method, which we call Bayesian dis-\ntributed SGD (BDSGD). Before introducing the design of the generative model we use to predict\nworker run-times, \ufb01rst consider the practical implications of using a generative model instead of\na purely autoregressive model. In short we can only consider worker run-time prediction models\nthat are extremely sample ef\ufb01cient to train. We also can only consider a kind of model that allows\nreal-time prediction because it will be in the inner loop of the parameter server and used to predict at\nrun-time how many straggling workers to ignore. Deep neural net auto-regressors satisfy the latter\nbut not the former. Generative models satisfy the former but historically not the latter; except now\ndeep neural net guided amortized inference in generative models does. This forms the core of our\ntechnical approach.\nWe will model the time sequence of observed joint worker run-times xT(cid:0)\u2113; : : : ; xT using a deep\nstate space model where zT(cid:0)\u2113; : : : ; zT is the time evolving unobserved latent state of the cluster.\nIn this framework, xT(cid:0)\u2113:T may be replaced with directly modeling ~xT(cid:0)\u2113:T , and we continue with\nxT(cid:0)\u2113:T for clarity. The dependency structure of our model factorizes as\n\np(cid:18)(xT(cid:0)\u2113:T ; zT(cid:0)\u2113:T ) =\n\np(cid:18)(zijzi(cid:0)1)\n\np(cid:18)(xijzi)\n\nT\u220f\n\ni=T(cid:0)\u2113\n\nT\u220f\n\ni=T(cid:0)\u2113\n\nwhere, for reasons speci\ufb01c to amortizing inference, we will restrict our model to a \ufb01xed-lag \u2113 window.\nThe principle model use is the accurate prediction of the next set of worker run-times from those\nthat have come before:\n\np(xT +1jxT(cid:0)\u2113:T ) =\n\np(cid:18)(xT +1jzT +1)p(cid:18)(zT +1jzT )p(zT(cid:0)\u2113:TjxT(cid:0)\u2113:T )dzT(cid:0)\u2113:T +1\n\n(4)\n\n\u222b\n\n3.1.3 Model Learning and Amortized Inference\n\nWith the course-grained model dependency de\ufb01ned, it remains to specify the \ufb01ne-grained parame-\nterization of the generative model, to explain how to train the model, and to show how to perform\nreal-time approximate inference in the model.\nWe use the deep linear dynamical model introduced by Krishnan et al. (2017), that constructs the\nLDS with MLP link functions between Gaussian distributed latent state and observation vectors.\nInference in the model is done with a non-Markovian proposal network. Namely, the transition and\n\n5\n\n0501001500.250.500501001500.51.0SGD runtime050100150worker0.51.00501001500.250.500501001500.250.500255075100125150175200100150200250300maximum possible throughputnaive throughputamortized inference throughput0255075100125150175200SGD iteration012Time saved per run over naive (unnormalized)[naive time] - [cutoff time]\f(a) SGD runtime pro\ufb01les for two iterations\n\n(b) MNIST full training run\n\nFigure 3: Cumulative results of 158-worker cluster on 3-layer MLP training. The left two \ufb01gures\nare plots of observed runtimes vs predicted runtime order statistics of two iterations of SGD of\nthe validation set in the BDSGD training step. The maximum throughput cutoff under the model\npredictions is shown in red, indicating a large chunk of idle time is reduced as a result of stopping\nearly. Notably, when there are exceptionally slow workers present, the cutoff is set to proceed\nwithout any of them as seen in left \ufb01gure of subplot (a). Subplot (b) shows MNIST validation loss\nfor model-based methods, Elfving and BDSGD, compared to naive synchronous (waiting for all\nworkers) and asynchronous (Hogwild) approaches, where dashed vertical lines indicate the time at\nwhich the \ufb01nal iteration completed (all training methods perform the same number of mini-batch\ngradient updates). In the lower right corner of subplot (b), we observe that BDSGD achieves the\nfastest time to complete the \ufb01xed number of gradient updates of the synchronous methods, while\nalso achieving the lowest validation loss.\n\nemission functions in our model are parameterized by neural networks, the inference over which is\nguided by an RNN. For a detailed exposition, see the supplementary materials.\nThe \ufb02exibility of such a model allows us to avoid making restrictive or inappropriate assumptions\nthat might be quite far from the true generative model while imposing rough structural assumptions\nthat seem appropriate like correlation over time and correlation between workers at a given time.\nThe remaining tasks are to, given a set of training data, i.e. fully observed SGD runtimes speci\ufb01c\nto a cluster, learn (cid:18) and train an amortized inference network to perform realtime inference in said\nmodel. For this we utilize the variational autoencoder-style loss used for amortized inference in deep\nprobabilistic programming with guide programs (Ritchie et al., 2016).\nWe use stochastic gradient descent to simultaneously optimize the variational evidence lower bound\n(ELBO) with respect to both \u03d5 and (cid:18):\n\n(\n\n)\n\nELBO = Eq\u03d5(zT (cid:0)\u2113:tjxT (cid:0)\u2113:T ) log\n\np(cid:18)(xT(cid:0)\u2113:t; zT(cid:0)\u2113:t)\nq\u03d5(zT(cid:0)\u2113:tjxT(cid:0)\u2113:T )\n\nwhere\n\nq\u03d5(zT(cid:0)\u2113:tjxT(cid:0)\u2113:T ) =\n\nq\u03d5(ztjzT(cid:0)\u2113:t(cid:0)1; xT(cid:0)\u2113:T ):\n\nT\u220f\n\nt=T(cid:0)\u2113\n\nDoing this yields a useful by-product. Maximizing the ELBO also drives the KL divergence between\nq\u03d5(zT(cid:0)\u2113:tjxT(cid:0)\u2113:T ) and p(cid:18)(zT(cid:0)\u2113:tjxT(cid:0)\u2113:t) to be small. We will exploit this fact in our experiments\nto enable cutoff prediction. In particular we will directly approximate Equation 4 by:\n\n\u222b\n\np(xT +1jxT(cid:0)\u2113:T ) (cid:25)\n\np(cid:18)(xT +1jzT +1)p(cid:18)(zT +1jzT )q\u03d5(zT(cid:0)\u2113:TjxT(cid:0)\u2113:T )dzT(cid:0)\u2113:T +1\nK\u2211\n\np(cid:18)(xT +1jzT +1)p(cid:18)(zT +1jz(k)\nT )\n\n(5)\n\nk=1\n\n(cid:25) 1\nK\n\nbeing the last-time-step marginal of the kth of K samples from q\u03d5(zT(cid:0)\u2113:TjxT(cid:0)\u2113:T ).\nwith z(k)\nT\nThe predictive runtimes given by this technique can now be used to determine the throughput-\nmaximizing cutoff in the objective given by Equation 2.\n\n6\n\n0501001500.250.500.751.001.25predictedtrue0501001500.20.40.605001000150020000.40.60.805001000150020000.40.60.8worker index (A) (B) (C) (D)minibatch runtime0100200300400500600time (seconds)2.02.53.03.54.0lossHogwild!Bayesian SGDSynchronousElfving SGD\f3.1.4 Handling Censored Run-times\n\nAs described, we use the learned inference network to predict future cutoffs rather than the gen-\nerative model. Because variational inference jointly learns the model parameters along with the\ninference network, we could theoretically use an inference algorithm such as SMC (Doucet et al.,\n2001) for more accurate estimates of the true posterior predictive distribution. However, our cutoff\nprediction must be done in an amortized setting because we rely on it to be set for a gradient run\nprior to the updates returning from the workers. In a setting requiring fast, repeated inference, using\nan amortized method is often the only feasible approach, especially in large complex models.\nHowever, when using amortized inference, there is a practical issue of dealing with partially ob-\nserved, censored data. Since at run-time we are only waiting for c gradients up to the cutoff, and\nare in fact actually killing the straggling workers, we do not have the run-time information from\nthe straggling workers that would have \ufb01nished past the cutoff. Inference in the generative model\ncould directly be made able to deal with censored data, however our inference network runs an RNN\nwhich was trained on fully observed run-time vectors and therefore requires fully observed input\nto function correctly. Because of this, we deploy an effective approximate technique for imputing\nthe missing worker runtime values, which samples a new uncensored data point for every worker\nwhose gradients are dropped. Because we push estimates of the approximate posterior through\nthe generative model, we have a predictive run-time distribution for the current iteration of SGD\nbefore receiving actual updates from any worker. When eventually the cutoff is reached, and the\ncorresponding rate censor is observed, we are left with run-time distributions truncated at ~x(c):\n\n(6)\n\np(~x; ~x > ~xc) =\n\np(~x)\np(~x)d~x\n\n\u222b 1\n\n~x(c)\n\nwhere we have left off the time index for clarity and ~x is any one of the censored worker runtime\nobservations. When a censored value is required, we take its corresponding predicted run-time\ndistribution and sample from its right tailed truncation to get an approximate value for that missing\nrun-time. We \ufb01nd that this method works well to propagate the model forward, leading to still\naccurate predictions.\n\n4 Experiments\n\nTo test our model\u2019s ability to accurately predict joint worker runtimes, we perform experiments by\ntraining 4 different neural network architectures on one of two clusters of different architectures and\nsizes. To train the BDSGD generative models used, we \ufb01rst train the neural network architecture\nof interest using fully synchronous SGD and use the recorded worker runtimes during each SGD\niteration to learn a corresponding generative model of that particular neural network\u2019s gradient com-\nputation times on the cluster. As we will highlight, BDSGD model-based estimates of expected\nruntimes are suf\ufb01cient to derive a straggler cutoff from their order statistics that leads to increased\nthroughput and/or faster training times in real world situations.\n\n4.1 Small Compute Cluster\n\nOn one cluster comprised of four nodes of 40 logical Intel Xeon processors, we benchmark Elfving\nand BDSGD cutoff predictors against the fully synchronous and fully asynchronous SGD with a\n158-worker model by using each method to train a 2-layer CNN on MNIST classi\ufb01cation. At this\nscale, and on a small neural network model, we are still able to deploy a Hogwild training scheme\nthat does not diverge.\nThis cluster uses a job scheduler that allows jobs to be run concurrently on each node. From one\nof the fully synchronous SGD runs used to gather runtime data for BDSGD model learning, we\n\ufb01nd that 40 of the workers localized to one machine node experience a temporary 2(cid:2) slowdown\nin gradient computation times, which we believe to have been caused by another job batched onto\nthe node. Figure 2 shows the transitional window of SGD worker runtimes, where the \ufb01rst 75 SGD\niterations experience this slowdown before returning to normal for the remaining 125 iterations. In\nthe case of 25% slow workers running 2(cid:2) slower, naive synchronous SGD decreases throughput\nby 50%. BDSGD, however, is able to correctly ignore all 40 slow workers (Fig. 3a), leading to\nnear-optimal throughput despite the 2(cid:2) slowdown in SGD iteration time.\n\n7\n\n\fFigure 4: Comparison of BDSGD against Chen et al. on synchronous test data for three neural\nnetwork training runs with no early worker termination. From left: histogram of cutoff set by\neach method (Chen et al. always uses 2051 workers and naive always uses all 2175 workers), his-\ntogram of the absolute difference between chosen cutoff and the oracle cutoff which would achieve\nhighest throughput, quartile-boxplot of gradient computation throughput per iteration, and percent\nwall clock time saved using each method over fully synchronous (naive) method. We observe that\nBDSGD best predicts the oracle cutoff, which leads to highest throughput on all three cases, and\nhighest expected wall-clock savings when BDSGD\u2019s average cutoff is lower than that \ufb01xed by Chen\net al. (ResNet-16 model).\n\nFigure 3b shows that our method achieves the fastest convergence to the lowest loss among compar-\nison methods performing synchronous SGD. Hogwild outperforms our approach in wall-clock time,\nbut its convergence is to a higher validation loss, as seen in the tails of the loss curves. Although not\nshown in subsequent experiments, Hogwild training diverges on larger clusters.\n\n4.2 Large Scale Computing\n\nOn a large compute cluster, we use 32 68-core CPU nodes of a Cray XC40 supercomputer to com-\npare 2175-worker BDSGD runs against the Chen et al. cutoff and naive methods on training three\nneural network architectures for CIFAR10 classi\ufb01cation: a WideResNet model (Zagoruyko and Ko-\nmodakis, 2016) and 16 and 64 layer ResNets. Using increasingly larger networks and batchsizes\nallows us to benchmark our speedup in situations called for by the large amount of recent work on\ntraining with 10K+ mini-batch sizes and high learning rates, (e.g. : Codreanu et al. (2017); You\net al. (2017a,b); Smith et al. (2017)). For generative models of these neural network models on this\ncluster, we empirically \ufb01nd that training the latent variable model to directly emit sorted runtime\norder statistics is both faster to train and more accurate. Sampled draws from these distributions are\nreordered as before to calculate the predicted maximum throughput.\nUnlike the 158-worker cluster, jobs on the Cray XC40 are sequestered to dedicated nodes by the\nscheduler. In Figure 4, we compare BDSGD and Chen et al.\u2019s \ufb01xed cutoff method on validation\nsets in order to isolate the effect of accurate cutoff prediction on expected iteration throughput and\nspeedup. BDSGD provides the best model of these runtimes, subsequently leading to near optimal\nthroughput.\nFigure 5 shows the wall-clock training times and throughputs achieved under real workloads for\neach training method on the same three neural networks. Comparing the production throughput in\nrows 1 and 2 of Figure 5 to the expected throughput in rows 1 and 2 of Figure 4, all training methods\nexperience a small drop-off in throughput when run in production due to communication costs and\nother additional overhead. For these two neural networks, BDSGD is still shown to produce the\nhighest throughput when used during training. For training the WideResNet, Chen et al.\u2019s method\nachieves a 1.2% speedup in wall-clock training time whereas BDSGD is able to calculate 5% more\n\n8\n\n19002000210022000.000.050.100.150.20WideResNet-16:10Exact CutoffNaiveChenOracleBDSGD01002003004000.000.020.040.060.08Distance to Oracle CutoffNaiveChen et. alBDSGDNaiveChenBDSGDOracle360370380Throughput per IterationChen et. alBDSGDOracle0.02.55.07.510.0% Time Saved over Naive19002000210022000.000.050.100.150.20ResNet-64NaiveChenOracleBDSGD01002003004000.000.020.04NaiveChen et. alBDSGDNaiveChenBDSGDOracle500550600650700Chen et. alBDSGDOracle10203019002000210022000.000.050.100.150.20ResNet-16NaiveChenOracleBDSGD01002003004000.0000.0050.0100.0150.020NaiveChen et. alBDSGDNaiveChenBDSGDOracle50006000700080009000Chen et. alBDSGDOracle1020304050\fFigure 5: Production training runs comparing BDSGD, Chen et al., and fully synchronous (naive)\nmethods. The neural networks on each row are trained to convergence using the three training meth-\nods, \ufb01ve times each. Each row in this \ufb01gure corresponds to the same row in Figure 4. The columns\nshow quartile-boxplots and mean(cid:6)std (purple error bar). From left to right: iteration throughput\nachieved during training, wall-clock time to \ufb01xed iteration (400, 600, and 1500 for WideResNet,\nResNet-64, and ResNet-16, respectively), and validation loss at a \ufb01xed wall-clock time (set to the\nwall-clock time at 50% of the total training time taken by Chen et al.\u2019s method). The two big neu-\nral network models, ResNet-64 and WideResNet-16:10, achieve the highest throughput when using\nBDSGD, training to roughly equal or better validation loss at a \ufb01xed wall-clock time, while improv-\ning total training time by roughly the same as Chen et al. despite setting much higher cutoffs. The\nResNet-16 model demonstrates the ability to run a modi\ufb01ed BDSGD on a small network, where the\namortized inference exceeds gradient computation time on average.\n\ngradients to achieve a 1% speedup in wall-clock training time. Similarly, BDSGD achieves a 7.8%\nspeedup in wall-clock training time while using 3.2% more gradients than Chen et al.\u2019s method,\nwhich achieves a 9.3% speedup.\nIn the \ufb01nal row of Figure 5, we demonstrate the ability for BDSGD prediction to be robust to the\nscenario in which performing amortized inference in the generative model exceeds the time it takes\nfor workers to \ufb01nish their gradient computations. Here, we use a modi\ufb01ed variant of DBSGD that\n\ufb01xes the predicted cutoff for ten iterations at a time in order to avoid being bottlenecked at every\niteration by the parameter server cutoff prediction. In doing so, we show in the \ufb01nal row of Figure 5\nthat one may still achieve a 7.6% speedup with sparse predictions from the generative model.\nAll training methods for the three neural network models in Figure 5 train to a similar \ufb01nal held-out\nvalidation accuracy.\n\n5 Discussion\n\nWe have presented a principled approach to achieving higher throughput in synchronous distributed\ngradient descent. Our primary contributions include describing how a model of worker runtimes\ncan be used to predict order statistics that allow for a near optimal choice of straggler cutoff that\nmaximizes gradient computation throughput.\nWhile the focus throughout has been on on vanilla SGD, it should be clear that our method and\nalgorithm can be nearly trivially extended to most optimizers of choice so long as they are stochastic\nin their operation on the training set. Most methods for learning deep neural network models today\n\ufb01t this description, including for instance the Adam optimizer (Kingma and Ba, 2014).\nWe conclude with a note that our method implicitly assumes that every mini-batch is of the same\ncomputational cost in expectation, which may not always be the case. Future work could be to extend\nthe inference network further (Rezende and Mohamed, 2015) or to investigate variable length input\nin distributed training as in Ergen and Kozat (2017).\n\n9\n\n320340360380gradients / secondWideResNet-16:10Throughput per Iteration32003300time (seconds)Wall Clock Time To Fixed Iteration657075top@1 accuracyLoss at Fixed Wall Clock Time200400600gradients / secondResNet-642100220023002400time (seconds)767778top@1 accuracyNaiveChen et. alBDSGD020004000gradients / secondResNet-16NaiveChen et. alBDSGD100010501100time (seconds)NaiveChen et. alBDSGD67.570.072.575.0top@1 accuracy\fAcknowledgments\n\nMichael Teng is supported under DARPA D3M, under Cooperative Agreement FA8750-17-2-0093\nand partially supported by the NERSC Big Data Center under Intel BDC. This research used re-\nsources of the National Energy Research Scienti\ufb01c Computing Center (NERSC), a DOE Of\ufb01ce of\nScience User Facility supported by the Of\ufb01ce of Science of the U.S. Department of Energy under\nContract No. DE-AC02-05CH11231. We acknowledge Intel for their funding support and we thank\nmembers of the UBC PLAI group for helpful discussions.\n\nReferences\nChen, J., X. Pan, R. Monga, S. Bengio, and R. Jozefowicz\n\n2016. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981.\n\nCodreanu, V., D. Podareanu, and V. Saletore\n\n2017. Scale out for large minibatch SGD: Residual network training on ImageNet-1k with im-\nproved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291.\n\nDean, J., G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V.\n\nLe, et al.\n2012. Large scale distributed deep networks.\nsystems, Pp. 1223\u20131231.\n\nIn Advances in neural information processing\n\nDoucet, A., N. De Freitas, and N. Gordon\n\n2001. An introduction to sequential Monte Carlo methods. In Sequential Monte Carlo methods\nin practice, Pp. 3\u201314. Springer.\n\nErgen, T. and S. S. Kozat\n\n2017. Online training of LSTM networks in distributed systems for variable length data sequences.\nIEEE Transactions on Neural Networks and Learning Systems.\n\nHoffer, E., I. Hubara, and D. Soudry\n\n2017. Train longer, generalize better: closing the generalization gap in large batch training of\nneural networks. In Advances in Neural Information Processing Systems, Pp. 1729\u20131739.\n\nKingma, D. P. and J. Ba\n\n2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.\n\nKingma, D. P. and M. Welling\n\n2013. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114.\n\nKrishnan, R. G., U. Shalit, and D. Sontag\n\n2017. Structured Inference Networks for Nonlinear State Space Models. In AAAI, Pp. 2101\u2013\n2109.\n\nLe, T., A. Baydin, and F. Wood\n\n2017. Inference Compilation and Universal Probabilistic Programming. In 20th International\nConference on Arti\ufb01cial Intelligence and Statistics.\n\nMasters, D. and C. Luschi\n\n2018. Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612.\n\nMcMahan, B. and M. Streeter\n\n2014. Delay-tolerant algorithms for asynchronous distributed online learning. In Advances in\nNeural Information Processing Systems, Pp. 2915\u20132923.\n\nRecht, B., C. Re, S. Wright, and F. Niu\n\n2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in\nneural information processing systems, Pp. 693\u2013701.\n\nRezende, D. J. and S. Mohamed\n\n2015. Variational inference with normalizing \ufb02ows. arXiv preprint arXiv:1505.05770.\n\n10\n\n\fRitchie, D., P. Horsfall, and N. D. Goodman\n\n2016. Deep amortized inference for probabilistic programs. arXiv preprint arXiv:1610.05735.\n\nRoyston, J.\n\n1982. Algorithm AS 177: Expected normal order statistics (exact and approximate). Journal of\nthe royal statistical society. Series C (Applied statistics), 31(2):161\u2013165.\n\nSmith, S. L., P.-J. Kindermans, and Q. V. Le\n\n2017. Don\u2019t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489.\n\nYou, Y., I. Gitman, and B. Ginsburg\n\n2017a. Scaling SGD batch size to 32k for ImageNet training. arXiv preprint arXiv:1708.03888.\n\nYou, Y., Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer\n\n2017b. ImageNet training in minutes. CoRR, abs/1709.05011.\n\nZagoruyko, S. and N. Komodakis\n\n2016. Wide residual networks. arXiv preprint arXiv:1605.07146.\n\nZhang, S., A. E. Choromanska, and Y. LeCun\n\n2015. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing\nSystems, Pp. 685\u2013693.\n\n11\n\n\f", "award": [], "sourceid": 3138, "authors": [{"given_name": "Michael", "family_name": "Teng", "institution": "University of Oxford (visiting at University of British Columbia)"}, {"given_name": "Frank", "family_name": "Wood", "institution": "University of British Columbia"}]}