{"title": "Parallelized Stochastic Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 2595, "page_last": 2603, "abstract": "With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique --- contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic gradient descent algorithms reach the asymptotically normal regime.", "full_text": "Parallelized Stochastic Gradient Descent\n\nMartin A. Zinkevich\n\nYahoo! Labs\n\nSunnyvale, CA 94089\n\nmaz@yahoo-inc.com\n\nAlex Smola\nYahoo! Labs\n\nSunnyvale, CA 94089\n\nsmola@yahoo-inc.com\n\nMarkus Weimer\n\nYahoo! Labs\n\nSunnyvale, CA 94089\n\nweimer@yahoo-inc.com\n\nLihong Li\nYahoo! Labs\n\nSunnyvale, CA 94089\n\nlihong@yahoo-inc.com\n\nAbstract\n\nWith the increase in available data parallel machine learning has become an in-\ncreasingly pressing problem. In this paper we present the \ufb01rst parallel stochastic\ngradient descent algorithm including a detailed analysis and experimental evi-\ndence. Unlike prior work on parallel optimization algorithms [5, 7] our variant\ncomes with parallel acceleration guarantees and it poses no overly tight latency\nconstraints, which might only be available in the multicore setting. Our analy-\nsis introduces a novel proof technique \u2014 contractive mappings to quantify the\nspeed of convergence of parameter distributions to their asymptotic limits. As a\nside effect this answers the question of how quickly stochastic gradient descent\nalgorithms reach the asymptotically normal regime [1, 8].\n\n1\n\nIntroduction\n\nOver the past decade the amount of available data has increased steadily. By now some industrial\nscale datasets are approaching Petabytes. Given that the bandwidth of storage and network per\ncomputer has not been able to keep up with the increase in data, the need to design data analysis\nalgorithms which are able to perform most steps in a distributed fashion without tight constraints\non communication has become ever more pressing. A simple example illustrates the dilemma. At\ncurrent disk bandwidth and capacity (2TB at 100MB/s throughput) it takes at least 6 hours to read\nthe content of a single harddisk. For a decade, the move from batch to online learning algorithms\nwas able to deal with increasing data set sizes, since it reduced the runtime behavior of inference\nalgorithms from cubic or quadratic to linear in the sample size. However, whenever we have more\nthan a single disk of data, it becomes computationally infeasible to process all data by stochastic\ngradient descent which is an inherently sequential algorithm, at least if we want the result within a\nmatter of hours rather than days.\nThree recent papers attempted to break this parallelization barrier, each of them with mixed suc-\ncess. [5] show that parallelization is easily possible for the multicore setting where we have a tight\ncoupling of the processing units, thus ensuring extremely low latency between the processors. In\nparticular, for non-adversarial settings it is possible to obtain algorithms which scale perfectly in\nthe number of processors, both in the case of bounded gradients and in the strongly convex case.\nUnfortunately, these algorithms are not applicable to a MapReduce setting since the latter is fraught\nwith considerable latency and bandwidth constraints between the computers.\nA more MapReduce friendly set of algorithms was proposed by [3, 9]. In a nutshell, they rely on\ndistributed computation of gradients locally on each computer which holds parts of the data and\nsubsequent aggregation of gradients to perform a global update step. This algorithm scales linearly\n\n1\n\n\fin the amount of data and log-linearly in the number of computers. That said, the overall cost in\nterms of computation and network is very high: it requires many passes through the dataset for\nconvergence. Moreover, it requires many synchronization sweeps (i.e. MapReduce iterations). In\nother words, this algorithm is computationally very wasteful when compared to online algorithms.\n[7] attempted to deal with this issue by a rather ingenious strategy: solve the sub-problems exactly on\neach processor and in the end average these solutions to obtain a joint solution. The key advantage\nof this strategy is that only a single MapReduce pass is required, thus dramatically reducing the\namount of communication. Unfortunately their proposed algorithm has a number of drawbacks:\nthe theoretical guarantees they are able to obtain imply a signi\ufb01cant variance reduction relative\nto the single processor solution [7, Theorem 3, equation 13] but no bias reduction whatsoever [7,\nTheorem 2, equation 9] relative to a single processor approach. Furthermore, their approach requires\na relatively expensive algorithm (a full batch solver) to run on each processor. A further drawback\nof the analysis in [7] is that the convergence guarantees are very much dependent on the degree of\nstrong convexity as endowed by regularization. However, since regularization tends to decrease with\nincreasing sample size the guarantees become increasingly loose in practice as we see more data.\nWe attempt to combine the bene\ufb01ts of a single-average strategy as proposed by [7] with asymptotic\nanalysis [8] of online learning. Our proposed algorithm is strikingly simple: denote by ci(w) a loss\nfunction indexed by i and with parameter w. Then each processor carries out stochastic gradient\ndescent on the set of ci(w) with a \ufb01xed learning rate \u03b7 for T steps as described in Algorithm 1.\n\nAlgorithm 1 SGD({c1, . . . , cm}, T, \u03b7, w0)\n\nfor t = 1 to T do\n\nDraw j \u2208{ 1 . . . m} uniformly at random.\nwt \u2190 wt\u22121 \u2212 \u03b7\u2202wcj(wt\u22121).\n\nend for\nreturn wT .\n\nOn top of the SGD routine which is carried out on each computer we have a master-routine which\naggregates the solution in the same fashion as [7].\n\nAlgorithm 2 ParallelSGD({c1, . . . cm}, T, \u03b7, w0, k)\n\nfor all i \u2208{ 1, . . . k} parallel do\nend for\nAggregate from all computers v = 1\n\nvi = SGD({c1, . . . cm}, T, \u03b7, w0) on client\nk!k\n\ni=1 vi and return v\n\nm of the data which is likely to exceed 1\nk .\n\nThe key algorithmic difference to [7] is that the batch solver of the inner loop is replaced by a\nstochastic gradient descent algorithm which digests not a \ufb01xed fraction of data but rather a random\n\ufb01xed subset of data. This means that if we process T instances per machine, each processor ends up\nseeing T\nAlgorithm\nDistributed subgradient [3, 9]\nDistributed convex solver [7]\nMulticore stochastic gradient [5]\nThis paper\nA direct implementation of the algorithms above would place every example on every machine:\nhowever, if T is much less than m, then it is only necessary for a machine to have access to the\ndata it actually touches. Large scale learning, as de\ufb01ned in [2], is when an algorithm is bounded\nby the time available instead of by the amount of data available. Practically speaking, that means\nthat one can consider the actual data in the real dataset to be a subset of a virtually in\ufb01nite set,\nand drawing with replacement (as the theory here implies) and drawing without replacement on the\n\nLatency tolerance MapReduce Network IO Scalability\nmoderate\nhigh\nlow\nhigh\n\nlinear\nunclear\nlinear\nlinear\n\nhigh\nlow\nn.a.\nlow\n\nyes\nyes\nno\nyes\n\n2\n\n\fAlgorithm 3 SimuParallelSGD(Examples {c1, . . . cm}, Learning Rate \u03b7, Machines k)\n\nDe\ufb01ne T = $m/k%\nRandomly partition the examples, giving T examples to each machine.\nfor all i \u2208{ 1, . . . k} parallel do\n\nRandomly shuf\ufb02e the data on machine i.\nInitialize wi,0 = 0.\nfor all t \u2208{ 1, . . . T}: do\n\nGet the tth example on the ith machine (this machine), ci,t\nwi,t \u2190 wi,t\u22121 \u2212 \u03b7\u2202wci(wi,t\u22121)\n\nend for\n\nend for\nAggregate from all computers v = 1\n\ni=1 wi,t and return v.\n\nk!k\n\nin\ufb01nite data set can both be simulated by shuf\ufb02ing the real data and accessing it sequentially. The\ninitial distribution and shuf\ufb02ing can be a part of how the data is saved. SimuParallelSGD \ufb01ts very\nwell with the large scale learning paradigm as well as the MapReduce framework. Our paper applies\nan anytime algorithm via stochastic gradient descent. The algorithm requires no communication\nbetween machines until the end. This is perfectly suited to MapReduce settings. Asymptotically,\nthe error approaches zero. The amount of time required is independent of the number of examples,\nonly depending upon the regularization parameter and the desired error at the end.\n\n2 Formalism\n\nIn stark contrast to the simplicity of Algorithm 2, its convergence analysis is highly technical. Hence\nwe limit ourselves to presenting the main results in this extended abstract. Detailed proofs are given\nin the appendix. Before delving into details we brie\ufb02y outline the proof strategy:\n\n\u2022 When performing stochastic gradient descent with \ufb01xed (and suf\ufb01ciently small) learning\nrate \u03b7 the distribution of the parameter vector is asymptotically normal [1, 8]. Since all\ncomputers are drawing from the same data distribution they all converge to the same limit.\n\u2022 Averaging between the parameter vectors of k computers reduces variance by O(k\u2212 1\n2 )\nsimilar to the result of [7]. However, it does not reduce bias (this is where [7] falls short).\n\u2022 To show that the bias due to joint initialization decreases we need to show that the distri-\nbution of parameters per machine converges suf\ufb01ciently quickly to the limit distribution.\n\u2022 Finally, we also need to show that the mean of the limit distribution for \ufb01xed learning rate\nis suf\ufb01ciently close to the risk minimizer. That is, we need to take \ufb01nite-size learning rate\neffects into account relative to the asymptotically normal regime.\n\n2.1 Loss and Contractions\n\nIn this paper we consider estimation with convex loss functions ci\n: #2 \u2192 [0,\u221e). While our\nanalysis extends to other Hilbert Spaces such as RKHSs we limit ourselves to this class of functions\nfor convenience. For instance, in the case of regularized risk minimization we have\n\nci(w) =\n\n\u03bb\n2(w(2 + L(xi, yi, w \u00b7 xi)\n\n(1)\n\nwhere L is a convex function in w\u00b7xi, such as 1\nfor binary classi\ufb01cation. The goal is to \ufb01nd an approximate minimizer of the overall risk\n\n2 (yi\u2212w\u00b7xi)2 for regression or log[1+exp(\u2212yiw\u00b7xi)]\nm\"i=1\n\nci(w).\n\n1\nm\n\n(2)\n\nc(w) =\n\nTo deal with stochastic gradient descent we need tools for quantifying distributions over w.\nLipschitz continuity: A function f : X \u2192 R is Lipschitz continuous with constant L with respect\n\nto a distance d if |f (x) \u2212 f (y)|\u2264 Ld(x, y) for all x, y \u2208 X.\n\n3\n\n\fH\u00a8older continuity: A function f is H\u00a8older continous with constant L and exponent \u03b1 if |f (x) \u2212\nLipschitz seminorm: [10] introduce a seminorm. With minor modi\ufb01cation we use\n\nf (y)|\u2264 Ld\u03b1(x, y) for all x, y \u2208 X.\n\n(f(Lip := inf {l where |f (x) \u2212 f (y)|\u2264 ld(x, y) for all x, y \u2208 X} .\nThat is, (f(Lip is the smallest constant for which Lipschitz continuity holds.\n\nH\u00a8older seminorm: Extending the Lipschitz norm for \u03b1 \u2265 1:\n\n(f(Lip\u03b1\n\n:= inf {l where |f (x) \u2212 f (y)|\u2264 ld\u03b1(x, y) for all x, y \u2208 X} .\n\n(3)\n\n(4)\n\nContraction: For a metric space (M, d), f : M \u2192 M is a contraction mapping if (f(Lip < 1.\nIn the following we assume that (L(x, y, y\")(Lip \u2264 G as a function of y\" for all occurring data\n(x, y) \u2208 X \u00d7 Y and for all values of w within a suitably chosen (often compact) domain.\nTheorem 1 (Banach\u2019s Fixed Point Theorem) If (M, d) is a non-empty complete metric space,\nthen any contraction mapping f on (M, d) has a unique \ufb01xed point x\u2217 = f (x\u2217).\n\nCorollary 2 The sequence xt = f (xt\u22121) converges linearly with d(x\u2217, xt) \u2264 (f(t\nOur strategy is to show that the stochastic gradient descent mapping\n\nLip d(x0, x\u2217).\n\n(5)\nis a contraction, where i is selected uniformly at random from {1, . . . m}. This would allow us\nto demonstrate exponentially fast convergence. Note that since the algorithm selects i at random,\ndifferent runs with the same initial settings can produce different results. A key tool is the following:\n\nw \u2190 \u03c6i(w) := w \u2212 \u03b7\u2207ci(w)\n\nLemma 3 Let c\u2217 \u2265 ##\u2202\u02c6yL(xi, yi, \u02c6y)##Lip be a Lipschitz bound on the loss gradient. Then if \u03b7 \u2264\n(##xi##2 c\u2217 + \u03bb)\u22121 the update rule (5) is a contraction mapping in #2 with Lipschitz constant 1\u2212 \u03b7\u03bb.\n\nWe prove this in Appendix B. If we choose \u03b7 \u201clow enough\u201d, gradient descent uniformly becomes a\ncontraction. We de\ufb01ne\n\n\u03b7\u2217 := min\n\ni $##xi##2\n\nc\u2217 + \u03bb%\u22121\n\n.\n\n(6)\n\n2.2 Contraction for Distributions\n\nFor \ufb01xed learning rate \u03b7 stochastic gradient descent is a Markov process with state vector w. While\nthere is considerable research regarding the asymptotic properties of this process [1, 8], not much is\nknown regarding the number of iterations required until the asymptotic regime is assumed. We now\naddress the latter by extending the notion of contractions from mappings of points to mappings of\ndistributions. For this we introduce the Monge-Kantorovich-Wasserstein earth mover\u2019s distance.\n\nDe\ufb01nition 4 (Wasserstein metric) For a Radon space (M, d) let P (M, d) be the set of all distri-\nbutions over the space. The Wasserstein distance between two distributions X, Y \u2208 P (M, d) is\n\nWz(X, Y ) =&\n\n\u03b3\u2208\u0393(X,Y )\u2019x,y\n\ninf\n\ndz(x, y)d\u03b3(x, y)( 1\n\nz\n\n(7)\n\nwhere \u0393(X, Y ) is the set of probability distributions on (M, d) \u00d7 (M, d) with marginals X and Y .\nThis metric has two very important properties: it is complete and a contraction in (M, d) induces a\ncontraction in (P (M, d), Wz). Given a mapping \u03c6 : M \u2192 M, we can construct p : P (M, d) \u2192\nP (M, d) by applying \u03c6 pointwise to M. Let X \u2208 P (M, d) and let X\" := p(X). Denote for any\nmeasurable event E its pre-image by \u03c6\u22121(E). Then we have that X\"(E) = X(\u03c6\u22121(E)).\n\n4\n\n\fLemma 5 Given a metric space (M, d) and a contraction mapping \u03c6 on (M, d) with constant c, p\nis a contraction mapping on (P (M, d), Wz) with constant c.\n\nThis is proven in Appendix C. This shows that any single mapping is a contraction. However, since\nwe draw ci at random we need to show that a mixture of such mappings is a contraction, too. Here\nthe fact that we operate on distributions comes handy since the mixture of mappings on distribution\nis a mapping on distributions.\n\nLemma 6 Given a Radon space (M, d), if p1 . . . pk are contraction mappings with constants\ni=1 aipi is a contrac-\n\nc1 . . . ck with respect to Wz, and!i ai = 1 where ai \u2265 0, then p = !k\ntion mapping with a constant of no more than [!i ai(ci)z]\nCorollary 7 If for all i, ci \u2264 c, then p is a contraction mapping with a constant of no more than c.\ni=1 pi to be the\nThis is proven in Appendix C. We apply this to SGD as follows: De\ufb01ne p\u2217 = 1\n\u03b7 the initial parameter distribution from which w0 is\nstochastic operation in one step. Denote by D0\ndrawn and by Dt\n).\n\u03b7 = p\u2217(Dt\u22121\nThen the following holds:\n\n\u03b7 the parameter distribution after t steps, which is obtained via Dt\n\nm!m\n\nz .\n\n\u03b7\n\n1\n\nTheorem 8 For any z \u2208 N, if \u03b7 \u2264 \u03b7\u2217, then p\u2217 is a contraction mapping on (M, Wz) with contrac-\ntion rate (1 \u2212 \u03b7\u03bb). Moreover, there exists a unique \ufb01xed point D\u2217\u03b7 such that p\u2217(D\u2217\u03b7) = D\u2217\u03b7. Finally,\nif w0 = 0 with probability 1, then Wz(D0\n\n\u03b7, D\u2217\u03b7) = G\n\n\u03bb , and Wz(DT\n\n\u03b7 , D\u2217\u03b7) \u2264 G\n\n\u03bb (1 \u2212 \u03b7\u03bb)T .\n\n\u03b7 , D\u2217\u03b7) \u2264 G\n\nThis is proven in Appendix F. The contraction rate (1 \u2212 \u03b7\u03bb) can be proven by applying Lemma 3,\nLemma 5, and Corollary 6. As we show later, wt \u2264 G/\u03bb with probability 1, so Prw\u2208D\u2217\u03b7 [d(0, w) \u2264\nG/\u03bb] = 1, and since w0 = 0, this implies Wz(D0\n\u03b7, D\u2217\u03b7) = G/\u03bb. From this, Corollary 2 establishes\nWz(DT\nThis means that for a suitable choice of \u03b7 we achieve exponentially fast convergence in T to some\nstationary distribution D\u2217\u03b7. Note that this distribution need not be centered at the risk minimizer\nof c(w). What the result does, though, is establish a guarantee that each computer carrying out\nAlgorithm 1 will converge rapidly to the same distribution over w, which will allow us to obtain\ngood bounds if we can bound the \u2019bias\u2019 and \u2019variance\u2019 of D\u2217\u03b7.\n\n\u03bb (1 \u2212 \u03b7\u03bb)T .\n\n2.3 Guarantees for the Stationary Distribution\n\nAt this point, we know there exists a stationary distribution, and our algorithms are converging to\nthat distribution exponentially fast. However, unlike in traditional gradient descent, the stationary\ndistribution is not necessarily just the optimal point. In particular, the harder parts of understanding\nthis algorithm involve understanding the properties of the stationary distribution. First, we show that\nthe mean of the stationary distribution has low error. Therefore, if we ran for a really long time and\naveraged over many samples, the error would be low.\n\nTheorem 9 c(Ew\u2208D\u2217\u03b7 [w]) \u2212 minw\u2208Rn c(w) \u2264 2\u03b7G2.\nProven in Appendix G using techniques from regret minimization. Secondly, we show that the\nsquared distance from the optimal point, and therefore the variance, is low.\n\nTheorem 10 The average squared distance of D\u2217\u03b7 from the optimal point is bounded by:\n\nEw\u2208D\u2217\u03b7 [(w \u2212 w\u2217)2] \u2264\n\n4\u03b7G2\n\n(2 \u2212 \u03b7\u03bb)\u03bb\n\n.\n\nIn other words, the squared distance is bounded by O(\u03b7G2/\u03bb).\n\n5\n\n\f\u03b7\n\nProven in Appendix I using techniques from reinforcement learning. In what follows, if x \u2208 M,\nY \u2208 P (M, d), we de\ufb01ne Wz(x, Y ) to be the Wz distance between Y and a distribution with a\nprobability of 1 at x. Throughout the appendix, we develop tools to show that the distribution\nover the output vector of the algorithm is \u201cnear\u201d \u00b5D\u2217\u03b7, the mean of the stationary distribution. In\nis the distribution over the \ufb01nal vector of ParallelSGD after T iterations on each\nparticular, if DT,k\nof k machines with a learning rate \u03b7, then W2(\u00b5D\u2217\u03b7 , DT,k\n[(x \u2212 \u00b5D\u2217\u03b7 )2] becomes\nsmall. Then, we need to connect the error of the mean of the stationary distribution to a distribution\nthat is near to this mean.\nTheorem 11 Given a cost function c such that (c(L and (\u2207c(L are bounded, a distribution D such\nthat \u03c3D and is bounded, then, for any v:\n\n) = )Ex\u2208DT,k\n\n\u03b7\n\n\u03b7\n\nEw\u2208D[c(w)] \u2212 min\n\nw\n\nc(w)\n\nw\n\nc(w)) + (\u2207c(L\n\n(W2(v, D))2 + (c(v) \u2212 min\n\n\u2264 (W2(v, D)))2(\u2207c(L (c(v) \u2212 min\nThis is proven in Appendix K. The proof is related to the Kantorovich-Rubinstein theorem, and\nbounds on the Lipschitz of c near v based on c(v) \u2212 minw c(w). At this point, we are ready to get\nthe main theorem:\nTheorem 12 If \u03b7 \u2264 \u03b7\u2217 and T = ln k\u2212(ln \u03b7+ln \u03bb)\n8\u03b7G2\n\nc(w)).\n\n(8)\n\n2\u03b7\u03bb\n\n2\n\nw\n\n:\n\n+ (2\u03b7G2).\n\n(9)\n\nEw\u2208DT,k\n\n\u03b7\n\n[c(w)] \u2212 min\n\nw\n\nc(w) \u2264\n\n\u221ak\u03bb)(\u2207c(L +\n\n8\u03b7G2 (\u2207c(L\n\nk\u03bb\n\nThis is proven in Appendix K.\n\n2.4 Discussion of the Bound\n\nThe guarantee obtained in (9) appears rather unusual insofar as it does not have an explicit depen-\ndency on the sample size. This is to be expected since we obtained a bound in terms of risk min-\nimization of the given corpus rather than a learning bound. Instead the runtime required depends\nonly on the accuracy of the solution itself.\nIn comparison to [2], we look at the number of iterations to reach \u03c1 for SGD in Table 2. Ignoring\nthe effect of the dimensions (such as \u03bd and d), setting these parameters to 1, and assuming that the\n\u03bb, and \u03c1 = \u03b7. In terms of our bound, we assume G = 1 and (\u2207c(L = 1.\nconditioning number \u03ba = 1\n\u03bb. So, the Bottou paper claims a bound of \u03bd\u03ba2\nIn order to make our error order \u03b7, we must set k = 1\n\u03bb machines to run 1\niterations, which we interpret as\ntime, which is the same order of computation, but a dramatic speedup of a factor of 1\n\u03bb in wall clock\ntime.\nAnother important aspect of the algorithm is that it can be arbitrarily precise. By halving \u03b7 and\nroughly doubling T , you can halve the error. Also, the bound captures how much paralllelization\ncan help. If k > %\u2207c%L\n\n\u03b7\u03bb2 . Modulo logarithmic factors, we require 1\n\n, then the last term \u03b7G2 will start to dominate.\n\n\u03b7\u03bb\n\n\u03c1\n\n1\n\n\u03bb\n\n3 Experiments\n\nData: We performed experiments on a proprietary dataset drawn from a major email system with\nlabels y \u2208\u00b1 1 and binary, sparse features. The dataset contains 3, 189, 235 time-stamped instances\nout of which the last 68, 1015 instances are used to form the test set, leaving 2, 508, 220 training\npoints. We used hashing to compress the features into a 218 dimensional space. In total, the dataset\ncontained 785, 751, 531 features after hashing, which means that each instance has about 313 fea-\ntures on average. Thus, the average sparsity of each data point is 0.0012. All instance have been\nnormalized to unit length for the experiments.\n\n6\n\n\fFigure 1: Relative training error with \u03bb = 1e\u22123: Huber loss (left) and squared error (right)\n\nApproach: In order to evaluate the parallelization ability of the proposed algorithm, we followed\nthe following procedure: For each con\ufb01guration (see below), we trained up to 100 models, each on\nan independent, random permutation of the full training data. During training, the model is stored on\ndisk after k = 10, 000 \u2217 2i updates. We then averaged the models obtained for each i and evaluated\nthe resulting model. That way, we obtained the performance for the algorithm after each machine\nhas seen k samples. This approach is geared towards the estimation of the parallelization ability of\nour optimization algorithm and its application to machine learning equally. This is in contrast to\nthe evaluation approach taken in [7] which focussed solely on the machine learning aspect without\nstudying the performance of the optimization approach.\nEvaluation measures: We report both the normalized root mean squared error (RMSE) on the test\nset and the normalized value of the objective function during training. We normalize the RMSE\nsuch that 1.0 is the RMSE obtained by training a model in one single, sequential pass over the data.\nThe objective function values are normalized in much the same way such that the objective function\nvalue of a single, full sequential pass over the data reaches the value 1.0.\nCon\ufb01gurations: We studied both the Huber and the squared error loss. While the latter does not\nsatisfy all the assumptions of our proofs (its gradient is unbounded), it is included due to its popu-\nlarity. We choose to evaluate using two different regularization constants, \u03bb = 1e\u22123 and \u03bb = 1e\u22126\nin order to estimate the performance characteristics both on smooth, \u201ceasy\u201d problems (1e\u22123) and on\nhigh-variance, \u201chard\u201d problems (1e\u22126). In all experiments, we \ufb01xed the learning rate to \u03b7 = 1e\u22123.\n\n3.1 Results and Discussion\n\nOptimization: Figure 1 shows the relative objective function values for training using 1, 10 and\n100 machines with \u03bb = 1e\u22123. In terms of wall clock time, the models obtained on 100 machines\nclearly outperform the ones obtained on 10 machines, which in turn outperform the model trained\non a single machine. There is no signi\ufb01cant difference in behavior between the squared error and\nthe Huber loss in these experiments, despite the fact that the squared error is effectively unbounded.\nThus, the parallelization works in the sense that many machines obtain a better objective function\nvalue after each machine has seen k instances. Additionally, the results also show that data-local\nparallelized training is feasible and bene\ufb01cial with the proposed algorithm in practice. Note that\nthe parallel training needs slightly more machine time to obtain the same objective function value,\nwhich is to be expected. Also unsurprising, yet noteworthy, is the trade-off between the number of\nmachines and the quality of the solution: The solution obtained by 10 machines is much more of an\nimprovement over using one machine than using 100 machines is over 10.\nPredictive Performance: Figure 2 shows the relative test RMSE for 1, 10 and 100 machines with\n\u03bb = 1e\u22123. As expected, the results are very similar to the objective function comparison: The\nparallel training decreases wall clock time at the price of slightly higher machine time. Again, the\ngain in performance between 1 and 10 machines is much higher than the one between 10 and 100.\n\n7\n\n\fFigure 2: Relative Test-RMSE with \u03bb = 1e\u22123: Huber loss (left) and squared error (right)\n\nFigure 3: Relative train-error using Huber loss: \u03bb = 1e\u22123 (left), \u03bb = 1e\u22126 (right)\n\nPerformance using different \u03bb: The last experiment is conducted to study the effect of the regu-\nlarization constant \u03bb on the parallelization ability: Figure 3 shows the objective function plot using\nthe Huber loss and \u03bb = 1e\u22123 and \u03bb = 1e\u22126. The lower regularization constant leads to more\nvariance in the problem which in turn should increase the bene\ufb01t of the averaging algorithm. The\nplots exhibit exactly this characteristic: For \u03bb = 1e\u22126, the loss for 10 and 100 machines not only\ndrops faster, but the \ufb01nal solution for both beats the solution found by a single pass, adding further\nempirical evidence for the behaviour predicted by our theory.\n\n4 Conclusion\n\nIn this paper, we propose a novel data-parallel stochastic gradient descent algorithm that enjoys a\nnumber of key properties that make it highly suitable for parallel, large-scale machine learning: It\nimposes very little I/O overhead: Training data is accessed locally and only the model is communi-\ncated at the very end. This also means that the algorithm is indifferent to I/O latency. These aspects\nmake the algorithm an ideal candidate for a MapReduce implementation. Thereby, it inherits the lat-\nter\u2019s superb data locality and fault tolerance properties. Our analysis of the algorithm\u2019s performance\nis based on a novel technique that uses contraction theory to quantify \ufb01nite-sample convergence\nrate of stochastic gradient descent. We show worst-case bounds that are comparable to stochastic\ngradient descent in terms of wall clock time, and vastly faster in terms of overall time. Lastly, our\nexperiments on a large-scale real world dataset show that the parallelization reduces the wall-clock\ntime needed to obtain a set solution quality. Unsurprisingly, we also see diminishing marginal util-\nity of adding more machines. Finally, solving problems with more variance (smaller regularization\nconstant) bene\ufb01ts more from the parallelization.\n\n8\n\n\fReferences\n[1] Shun-ichi Amari. A theory of adaptive pattern classi\ufb01ers. IEEE Transactions on Electronic\n\nComputers, 16:299\u2013307, 1967.\n\n[2] L. Bottou and O. Bosquet. The tradeoffs of large scale learning.\n\nInformation Processing Systems, 2008.\n\nIn Advances in Neural\n\n[3] C.T. Chu, S.K. Kim, Y. A. Lin, Y. Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-reduce for\nmachine learning on multicore. In B. Sch\u00a8olkopf, J. Platt, and T. Hofmann, editors, Advances\nin Neural Information Processing Systems 19, 2007.\n\n[4] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\n\nand stochastic optimization. In Conference on Computational Learning Theory, 2010.\n\n[5] J. Langford, A.J. Smola, and M. Zinkevich. Slow learners are fast.\n\nProcessing Systems, 2009.\n\nIn Neural Information\n\n[6] J. Langford, A.J. Smola, and M. Zinkevich. Slow learners are fast. arXiv:0911.0491, 2009.\n[7] G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Ef\ufb01cient large-scale dis-\ntributed training of conditional maximum entropy models.\nIn Y. Bengio, D. Schuurmans,\nJ. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Pro-\ncessing Systems 22, pages 1231\u20131239. 2009.\n\n[8] N. Murata, S. Yoshizawa, and S. Amari. Network information criterion \u2014 determining the\nIEEE Transactions on Neural\n\nnumber of hidden units for arti\ufb01cial neural network models.\nNetworks, 5:865\u2013872, 1994.\n\n[9] Choon Hui Teo, S. V. N. Vishwanthan, Alex J. Smola, and Quoc V. Le. Bundle methods for\n\nregularized risk minimization. J. Mach. Learn. Res., 11:311\u2013365, January 2010.\n\n[10] U. von Luxburg and O. Bousquet. Distance-based classi\ufb01cation with lipschitz functions. Jour-\n\nnal of Machine Learning Research, 5:669\u2013695, 2004.\n\n[11] M. Zinkevich. Online convex programming and generalised in\ufb01nitesimal gradient ascent. In\n\nProc. Intl. Conf. Machine Learning, pages 928\u2013936, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1162, "authors": [{"given_name": "Martin", "family_name": "Zinkevich", "institution": null}, {"given_name": "Markus", "family_name": "Weimer", "institution": null}, {"given_name": "Lihong", "family_name": "Li", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}