{"title": "Slow Learners are Fast", "book": "Advances in Neural Information Processing Systems", "page_first": 2331, "page_last": 2339, "abstract": "Online learning algorithms have impressive convergence properties when it comes to risk minimization and convex games on very large problems. However, they are inherently sequential in their design which prevents them from taking advantage of modern multi-core architectures. In this paper we prove that online learning with delayed updates converges well, thereby facilitating parallel online learning.", "full_text": "Slow Learners are Fast\n\nJohn Langford, Alexander J. Smola, Martin Zinkevich\n\nMachine Learning, Yahoo! Labs and Australian National University\n\n4401 Great America Pky, Santa Clara, 95051 CA\n\n{jl, maz, smola}@yahoo-inc.com\n\nAbstract\n\nOnline learning algorithms have impressive convergence properties when it comes\nto risk minimization and convex games on very large problems. However, they are\ninherently sequential in their design which prevents them from taking advantage\nof modern multi-core architectures. In this paper we prove that online learning\nwith delayed updates converges well, thereby facilitating parallel online learning.\n\n1\n\nIntroduction\n\nOnline learning has become the paradigm of choice for tackling very large scale estimation prob-\nlems. The convergence properties are well understood and have been analyzed in a number of differ-\nent frameworks such as by means of asymptotics [12], game theory [8], or stochastic programming\n[13]. Moreover, learning-theory guarantees show that O(1) passes over a dataset suf\ufb01ce to obtain\noptimal estimates [3, 2]. This suggests that online algorithms are an excellent tool for learning.\nThis view, however, is slightly deceptive for several reasons: current online algorithms process\none instance at a time. That is, they receive the instance, make some prediction, incur a loss, and\nupdate an associated parameter. In other words, the algorithms are entirely sequential in their nature.\nWhile this is acceptable in single-core processors, it is highly undesirable given that the number\nof processing elements available to an algorithm is growing exponentially (e.g. modern desktop\nmachines have up to 8 cores, graphics cards up to 1024 cores). It is therefore very wasteful if only\none of these cores is actually used for estimation.\nA second problem arises from the fact that network and disk I/O have not been able to keep up with\nthe increase in processor speed. A typical network interface has a throughput of 100MB/s and disk\narrays have comparable parameters. This means that current algorithms reach their limit at problems\nof size 1TB whenever the algorithm is I/O bound (this amounts to a training time of 3 hours), or even\nsmaller problems whenever the model parametrization makes the algorithm CPU bound.\nFinally, distributed and cloud computing are unsuitable for today\u2019s online learning algorithms. This\ncreates a pressing need to design algorithms which break the sequential bottleneck. We propose two\nvariants. To our knowledge, this is the \ufb01rst paper which provides theoretical guarantees combined\nwith empirical evidence for such an algorithm. Previous work, e.g. by [6] proved rather inconclusive\nin terms of theoretical and empirical guarantees.\nIn a nutshell, we propose the following two variants: several processing cores perform stochastic\ngradient descent independently of each other while sharing a common parameter vector which is\nupdated asynchronously. This allows us to accelerate computationally intensive problems when-\never gradient computations are relatively expensive. A second variant assumes that we have linear\nfunction classes where parts of the function can be computed independently on several cores. Sub-\nsequently the results are combined and the combination is then used for a descent step.\nA common feature of both algorithms is that the update occurs with some delay: in the \ufb01rst case\nother cores may have updated the parameter vector in the meantime, in the second case, other cores\nmay have already computed parts of the function for the subsequent examples before an update.\n\n1\n\n\f2 Algorithm\n\n2.1 Platforms\n\nWe begin with an overview of three platforms which are available for parallelization of algorithms.\nThey differ in their structural parameters, such as synchronization ability, latency, and bandwidth\nand consequently they are better suited to different styles of algorithms. The description is not com-\nprehensive by any means. For instance, there exist numerous variants of communication paradigms\nfor distributed and cloud computing.\nShared Memory Architectures: The commercially available 4-16 core CPUs on servers and desk-\ntop computers fall into this category. They are general purpose processors which operate on a joint\nmemory space where each of the processors can execute arbitrary pieces of code independently of\nother processors. Synchronization is easy via shared memory/interrupts/locks. A second example\nare graphics cards. There the number of processing elements is vastly higher (1024 on high-end\nconsumer graphics cards), although they tend to be bundled into groups of 8 cores (also referred\nto as multiprocessing elements), each of which can execute a given piece of code in a data-parallel\nfashion. An issue is that explicit synchronization between multiprocessing elements is dif\ufb01cult \u2014 it\nrequires computing kernels on the processing elements to complete.\nClusters: To increase I/O bandwidth one can combine several computers in a cluster using MPI or\nPVM as the underlying communications mechanism. A clear limit here is bandwidth constraints\nand latency for inter-computer communication. On Gigabit Ethernet the TCP/IP latency can be in\nthe order of 100\u00b5s, the equivalent of 105 clock cycles on a processor and network bandwidth tends\nto be a factor 100 slower than memory bandwdith.\nGrid Computing: Computational paradigms such as MapReduce [4] and Hadoop are well suited\nfor the parallelization of batch-style algorithms [17]. In comparison to cluster con\ufb01gurations com-\nmunication and latency are further constrained. For instance, often individual processing elements\nare unable to communicate directly with other elements with disk / network storage being the only\nmechanism of inter-process data transfer. Moreover, the latency is signi\ufb01cantly increased.\nWe consider only the \ufb01rst two platforms since latency plays a critical role in the analysis of the\nclass of algorithms we propose. While we do not exclude the possibility of devising parallel online\nalgorithms suited to grid computing, we believe that the family of algorithm proposed in this paper\nis unsuitable and a signi\ufb01cantly different synchronization paradigm would be needed.\n\n2.2 Delayed Stochastic Gradient Descent\n\nMany learning problems can be written as convex minimization problems. It is our goal to \ufb01nd some\nparameter vector x (which is drawn from some Banach space X with associated norm (cid:107)\u00b7(cid:107)) such that\nthe sum over convex functions fi : X \u2192 R takes on the smallest value possible. For instance,\n(penalized) maximum likelihood estimation in exponential families with fully observed data falls\ninto this category, so do Support Vector Machines and their structured variants. This also applies to\ndistributed games with a communications constraint within a team.\nAt the outset we make no special assumptions on the order or form of the functions fi. In particular,\nan adversary may choose to order or generate them in response to our previous choices of x. In other\ncases, the functions fi may be drawn from some distribution (e.g. whenever we deal with induced\ni fi(xi) is minimized.\nWith some abuse of notation we identify the average empirical and expected loss both by f\u2217. This\nis possible, simply by rede\ufb01ning p(f ) to be the uniform distribution over F . Denote by\n\nlosses). It is our goal to \ufb01nd a sequence of xi such that the cumulative loss(cid:80)\n\n(cid:88)\n\ni\n\n2\n\nf\u2217(x) :=\n\n1\n|F|\nand correspondingly x\u2217 := argmin\nx\u2208X\n\nfi(x) or f\u2217(x) := Ef\u223cp(f )[f (x)]\nf\u2217(x)\n\n(2)\nthe average risk. We assume that x\u2217 exists (convexity does not guarantee a bounded minimizer) and\nthat it satis\ufb01es (cid:107)x\u2217(cid:107) \u2264 R (this is always achievable, simply by intersecting X with the unit-ball of\nradius R). We propose the following algorithm:\n\n(1)\n\n\fAlgorithm 1 Delayed Stochastic Gradient Descent\n\nInput: Feasible space X \u2286 Rn, annealing schedule \u03b7t and delay \u03c4 \u2208 N\nInitialization: set x1 . . . , x\u03c4 = 0 and compute corresponding gt = \u2207ft(xt).\nfor t = \u03c4 + 1 to T + \u03c4 do\nObtain ft and incur loss ft(xt)\nCompute gt := \u2207ft(xt)\nUpdate xt+1 = argminx\u2208X (cid:107)x \u2212 (xt \u2212 \u03b7tgt\u2212\u03c4 )(cid:107) (Gradient Step and Projection)\n\nend for\n\nFigure 1: Data parallel stochastic gradient de-\nscent with shared parameter vector. Observations\nare partitioned on a per-instance basis among n\nprocessing units. Each of them computes its own\ngradient gt = \u2202xft(xt). Since each computer is\nupdating x in a round-robin fashion, it takes a de-\nlay of \u03c4 = n \u2212 1 between gradient computation\nand when the gradients are applied to x.\n\nt\u2212\u03c4 . Often, X = Rn.\nIn this paper the annealing schedule will be either \u03b7t = \u03c3\nIf we set \u03c4 = 0, algorithm 1 becomes an entirely standard stochastic gradient descent algorithm.\nThe only difference with delayed stochastic gradient descent is that we do not update the parameter\nvector xt with the current gradient gt but rather with a delayed gradient gt\u2212\u03c4 that we computed \u03c4\nsteps previously. We extend this to bounds which are dependent on strong convexity [1, 7] to obtain\nadaptive algorithms which can take advantage of well-behaved optimization problems in practice.\nAn extension to Bregman divergences is possible. See [11] for details.\n\n(t\u2212\u03c4 ) or \u03b7t = \u03c3\u221a\n\n2.3 Templates\n\nAsynchronous Optimization: Assume that we have n processors which can process data inde-\npendently of each other, e.g. in a multicore platform, a graphics card, or a cluster of workstations.\nMoreover, assume that computing the gradient of ft(x) is at least n times as expensive as it is to up-\ndate x (read, add, write). This occurs, for instance, in the case of conditional random \ufb01elds [15, 18],\nin planning [14], and in ranking [19].\nThe rationale for delayed updates can be seen in the following setting: assume that we have n\ncores performing stochastic gradient descent on different instances ft while sharing one common\nparameter vector x. If we allow each core in a round-robin fashion to update x one at a time then\nthere will be a delay of \u03c4 = n \u2212 1 between when we see ft and when we get to update xt+\u03c4 . The\ndelay arises since updates by different cores cannot happen simultaneously. This setting is preferable\nwhenever computation of ft itself is time consuming.\nNote that there is no need for explicit thread-level synchronization between individual cores. All\nwe need is a read / write-locking mechanism for x or alternatively, atomic updates on the parameter\nvector. On a multi-computer cluster we can use a similar mechanism simply by having one server act\nas a state-keeper which retains an up-to-date copy of x while the loss-gradient computation clients\ncan retrieve at any time a copy of x and send gradient update messages to the state keeper.\nPipelined Optimization: The key impediment in the previous template is that it required signi\ufb01cant\namounts of bandwidth solely for the purpose of synchronizing the state vector. This can be addressed\nby parallelizing computing the function value fi(x) explicitly rather than attempting to compute\nseveral instances of fi(x) simultaneously. Such situations occur, e.g. when fi(x) = g((cid:104)\u03c6(zi), x(cid:105)) for\nhigh-dimensional \u03c6(zi). If we decompose the data zi (or its features) over n nodes we can compute\npartial function values and also all partial updates locally. The only communication required is to\ncombine partial values and to compute gradients with respect to (cid:104)\u03c6(zi), x(cid:105).\nThis causes delay since the second stage is processing results of the \ufb01rst stage while the latter has\nalready moved on to processing ft+1 or further. While the architecture is quite different, the effects\nare identical: the parameter vector x is updated with some delay \u03c4. Note that here \u03c4 can be much\nsmaller than the number of processors and mainly depends on the latency of the communication\nchannel. Also note that in this con\ufb01guration the memory access for x is entirely local.\n\n3\n\nlossgradientdatasourcex\fRandomization: Order of observations matters for delayed updates: imagine that an adversary,\naware of the delay \u03c4 bundles each of the \u03c4 most similar instances ft together. In this case we will\nincur a loss that can be \u03c4 times as large as in the non-delayed case and require a learning rate which is\n\u03c4 times smaller. The reason being that only after seeing \u03c4 instances of ft will we be able to respond\nto the data. Such highly correlated settings do occur in practice: for instance, e-mails or search\nkeywords have signi\ufb01cant temporal correlation (holidays, political events, time of day) and cannot\nbe treated as iid data. Randomization of the order of data can be used to alleviate the problem.\n\n3 Lipschitz Continuous Losses\n\nDue to space constraints we only state the results and omit the proofs. A more detailed analysis\ncan be found in [11]. We begin with a simple game theoretic analysis that only requires ft to\nbe convex and where the subdifferentials are bounded (cid:107)\u2207ft(x)(cid:107) \u2264 L by some L > 0. Denote\nby x\u2217 the minimizer of f\u2217(x).\nIt is our goal to bound the regret R associated with a sequence\nX = {x1, . . . , xT} of parameters. If all terms are convex we obtain\n\nT(cid:88)\n\nft(xt) \u2212 ft(x\u2217) \u2264 T(cid:88)\n\nR[X] :=\n\nT(cid:88)\n\n(cid:104)\u2207ft(xt), xt \u2212 x\u2217(cid:105) =\n\n(cid:104)gt, xt \u2212 x\u2217(cid:105) .\n\n(3)\n\nt=1\n\nt=1\n\nt=1\n\nNext de\ufb01ne a potential function measuring the distance between xt and x\u2217. In the more general\nanalysis this will become a Bregman divergence. We de\ufb01ne D(x(cid:107)x(cid:48)) := 1\n2 (cid:107)x \u2212 x(cid:48)(cid:107)2. At the heart\nof our regret bounds is the following which bounds the instantaneous risk at a given time [16]:\nLemma 1 For all x\u2217 and for all t > \u03c4, if X = Rn, the following expansion holds:\n\n(cid:104)xt\u2212\u03c4 \u2212 x\u2217, gt\u2212\u03c4(cid:105) =\n\n\u03b7t (cid:107)gt\u2212\u03c4(cid:107)2 +\n\n1\n2\n\nD(x\u2217(cid:107)xt) \u2212 D(x\u2217(cid:107)xt+1)\n\n\u03b7t\n\n+\n\nmin(\u03c4,t\u2212(\u03c4 +1))(cid:88)\n\nj=1\n\n\u03b7t\u2212j (cid:104)gt\u2212\u03c4\u2212j, gt\u2212\u03c4(cid:105)\n\nNote that the decomposition of Lemma 1 is very similar to standard regret decomposition bounds,\nsuch as [21]. The key difference is that we now have an additional term characterizing the correlation\nbetween successive gradients which needs to be bounded. In the worst case all we can do is bound\n(cid:104)gt\u2212\u03c4\u2212j, gt\u2212\u03c4(cid:105) \u2264 L2, whenever the gradients are highly correlated, which yields the following:\n\nTheorem 2 Suppose all the cost functions are Lipschitz continuous with a constant L and\nmaxx,x(cid:48)\u2208X D(x(cid:107)x(cid:48)) \u2264 F 2. Given \u03b7t = \u03c3\u221a\nfor some constant \u03c3 > 0, the regret of the delayed\nt\u2212\u03c4\nupdate algorithm is bounded by\n\u221a\n\n\u221a\n\nR[X] \u2264 \u03c3L2\n\nT + F 2\n\nT\n\u03c3\n\n+ L2 \u03c3\u03c4 2\n2\n\n+ 2L2\u03c3\u03c4\n\n\u221a\n\nand consequently for \u03c32 = F 2\n\n2\u03c4 L2 and T \u2265 \u03c4 2 we obtain the bound\n\nT\n\n(4)\n\n(5)\n\n\u221a\nR[X] \u2264 4F L\n\u221a\n\n\u03c4 T\n\nIn other words the algorithm converges at rate O(\n\u03c4 T ). This is similar to what we would expect\nin the worst case: an adversary may reorder instances such as to maximally slow down progress.\nIn this case a parallel algorithm is no faster than a sequential code. This result may appear overly\npessimistic but the following example shows that such worst-case scaling behavior is to be expected:\n\nLemma 3 Assume that an optimal online algorithm with regard to a convex game achieves regret\nR[m] after seeing m instances. Then any algorithm which may only use information that is at least\n\u03c4 instances old has a worst case regret bound of \u03c4 R[m/\u03c4 ].\nOur construction works by designing a sequence of functions fi where for a \ufb01xed n \u2208 N all fn\u03c4 +j\nare identical (for j \u2208 {1, . . . , n}). That is, we send identical functions to the algorithm while it has\nno chance of responding to them. Hence, even an algorithm knowing that we will see \u03c4 identical\ninstances in a row but being disallowed to respond to them for \u03c4 instances will do no better than one\nwhich sees every instance once but is allowed to respond instantly.\n\n4\n\n\fThe useful consequence of Theorem 2 is that we are guaranteed to converge at all even if we en-\ncounter delay (the latter is not trivial \u2014 after all, we could end up with an oscillating parameter\nvector for overly aggressive learning rates). While such extreme cases hardly occur in practice, we\nneed to make stronger assumptions in terms of correlation of ft and the degree of smoothness in ft\nto obtain tighter bounds. We conclude this section by studying a particularly convenient case: the\nsetting when the functions fi are strongly convex satisfying\nf (x\u2217) \u2265 fi(x) + (cid:104)x\u2217 \u2212 x, \u2202xf (x)(cid:105) +\n\n(cid:107)x \u2212 x\u2217(cid:107)2\n\n(6)\n\nHere we can get rid of the D(x\u2217(cid:107)x1) dependency in the loss bound.\nTheorem 4 Suppose that the functions fi are strongly convex with parameter \u03bb > 0. Moreover,\n\u03bb(t\u2212\u03c4 ) for t > \u03c4 and \u03b7t = 0 for t \u2264 \u03c4. Then under the assumptions\nchoose the learning rate \u03b7t = 1\nof Theorem 2 we have the following bound:\n\nhi\n2\n\nR[X] \u2264 \u03bb\u03c4 F 2 +(cid:2) 1\n\n2 + \u03c4(cid:3) L2\n\n(7)\nThe key difference is that now we need to take the additional contribution of the gradient correlations\ninto account. As before, we pay a linear price in the delay \u03c4.\n\n(1 + \u03c4 + log T )\n\n\u03bb\n\n4 Decorrelating Gradients\n\nTo improve our bounds beyond the most pessimistic case we need to assume that the adversary is not\nacting in the most hostile fashion possible. In the following we study the opposite case \u2014 namely\nthat the adversary is drawing the functions fi iid from an arbitrary (but \ufb01xed) distribution. The key\nreason for this requirement is that we need to control the value of (cid:104)gt, gt(cid:48)(cid:105) for adjacent gradients.\nThe \ufb02avor of the bounds we use will be in terms of the expected regret rather than an actual regret.\nConversions from expected to realized regret are standard. See e.g. [13, Lemma 2] for an example\nof this technique. For this purpose we need to take expectations of sums of copies of the bound of\nLemma 1. Note that this is feasible since expectations are linear and whenever products between\nmore than one term occur, they can be seen as products which are conditionally independent given\npast parameters, such as (cid:104)gt, gt(cid:48)(cid:105) for |t \u2212 t(cid:48)| \u2264 \u03c4 (in this case no information about gt can be used\nto infer gt(cid:48) or vice versa, given that we already know all the history up to time min(t, t(cid:48)) \u2212 1.\nA key quantity in our analysis are bounds on the correlation between subsequent instances. In some\ncases we will only be able to obtain bounds on the expected regret rather than the actual regret. For\nthe reasons pointed out in Lemma 3 this is an in-principle limitation of the setting.\nOur \ufb01rst strategy is to assume that ft arises from a scalar function of a linear function class. This\nleads to bounds which, while still bearing a linear penalty in \u03c4, make do with considerably improved\nconstants. The second strategy makes stringent smoothness assumptions on ft, namely it assumes\nthat the gradients themselves are Lipschitz continuous. This will lead to guarantees for which the\ndelay becomes increasingly irrelevant as the algorithm progresses.\n\n4.1 Covariance bounds for linear function classes\n\nMany functions ft(x) depend on x only via an inner product. They can be expressed as\nft(x) = l(yt,(cid:104)zt, x(cid:105)) and hence gt(x) = \u2207ft(x) = zt\u2202(cid:104)zt,x(cid:105)l(yt,(cid:104)zt, x(cid:105))\n\nNow assume that (cid:12)(cid:12)\u2202(cid:104)zt,x(cid:105)l(yt,(cid:104)zt, x(cid:105))(cid:12)(cid:12) \u2264 \u039b for all x and all t. This holds, e.g. in the case of\n\nlogistic regression, the soft-margin hinge loss, novelty detection. In all three cases we have \u039b = 1.\nRobust loss functions such as Huber\u2019s regression score [9] also satisfy (8), although with a different\nconstant (the latter depends on the level of robustness). For such problems it is possible to bound\nthe correlation between subsequent gradients via the following lemma:\nLemma 5 Denote by (y, z), (y(cid:48), z(cid:48)) \u223c Pr(y, z) random variables which are drawn independently\nof x, x(cid:48) \u2208 X. In this case\n\nEy,z,y(cid:48),z(cid:48) [(cid:104)\u2202xl(y,(cid:104)z, x(cid:105)), \u2202xl(y(cid:48),(cid:104)z(cid:48), x(cid:48)(cid:105))(cid:105)] \u2264 \u039b2(cid:13)(cid:13)(cid:13)Ez,z(cid:48)(cid:2)z(cid:48)z(cid:62)(cid:3)(cid:13)(cid:13)(cid:13)Frob\n\n=: L2\u03b1\n\n(9)\n\n(8)\n\n5\n\n\fHere we de\ufb01ned \u03b1 to be the scaling factor which quanti\ufb01es by how much gradients are correlated.\nThis yields a tighter version of Theorem 2.\n\nCorollary 6 Given \u03b7t = \u03c3\u221a\nt\u2212\u03c4\nalgorithm is bounded by\n\nand the conditions of Lemma 5 the regret of the delayed update\n\nR[X] \u2264 \u03c3L2\n\n\u221a\n\nT + F 2\n\n\u221a\n\nT\n\u03c3\n\n+ L2\u03b1\n\n\u03c3\u03c4 2\n2\n\n+ 2L2\u03b1\u03c3\u03c4\n\n\u221a\n\nT\n\nHence for \u03c32 = F 2\n\n\u221a\n2\u03c4 \u03b1L2 (assuming that \u03c4 \u03b1 \u2265 1) and T \u2265 \u03c4 2 we obtain R[X] \u2264 4F L\n\n(10)\n\n\u03b1\u03c4 T .\n\n4.2 Bounds for smooth gradients\n\nThe key to improving the rate rather than the constant with regard to which the bounds depend on \u03c4\nis to impose further smoothness constraints on ft. The rationale is quite simple: we want to ensure\nthat small changes in x do not lead to large changes in the gradient. This is precisely what we need\nin order to show that a small delay (which amounts to small changes in x) will not impact the update\nthat is carried out to a signi\ufb01cant amount. More speci\ufb01cally we assume that the gradient of f is a\nLipschitz-continuous function. That is,\n\n(cid:107)\u2207ft(x) \u2212 \u2207ft(x(cid:48))(cid:107) \u2264 H (cid:107)x \u2212 x(cid:48)(cid:107) .\n\n(11)\n\nSuch a constraint effectively rules out piecewise linear loss functions, such as the hinge loss, struc-\ntured estimation, or the novelty detection loss. Nonetheless, since this discontinuity only occurs on\na set of measure 0 delayed stochastic gradient descent still works very well on them in practice.\n\n\u221a\nL\n\n4F\n\nTheorem 7 In addition to the conditions of Theorem 2 assume that the functions fi are i.i.d., H \u2265\n\u03c4 and that H also upper-bounds the change in the gradients as in Equation 11. Moreover,\nL . In this case the risk is bounded by\n\nassume that we choose a learning rate \u03b7t = \u03c3\u221a\nt\u2212\u03c4\n\nwith \u03c3 = F\n\nE[R[X]] \u2264\n\n28.3F 2H +\n\n2\n3\n\nF L +\n\n4\n3\n\nF 2H log T\n\n\u03c4 2 +\n\n\u221a\nF L\n\nT .\n\n8\n3\n\n(12)\n\n(cid:21)\n\n(cid:20)\n\n\u221a\n\nNote that the convergence bound which is O(\u03c4 2 log T +\nT ) is governed by two different regimes.\nInitially, a delay of \u03c4 can be quite harmful since subsequent gradients are highly correlated. At a\nlater stage when optimization becomes increasingly an averaging process a delay of \u03c4 in the updates\nproves to be essentially harmless. The key difference to bounds of Theorem 2 is that now the rate of\nconvergence has improved dramatically and is essentially as good as in sequential online learning.\nNote that H does not in\ufb02uence the asymptotic convergence properties but it signi\ufb01cantly affects the\ninitial convergence properties.\nThis is exactly what one would expect: initially while we are far away from the solution x\u2217 par-\nallelism does not help much in providing us with guidance to move towards x\u2217. However, after\na number of steps online learning effectively becomes an averaging process for variance reduction\naround x\u2217 since the stepsize is suf\ufb01ciently small. In this case averaging becomes the dominant force,\nhence parallelization does not degrade convergence further. Such a setting is desirable \u2014 after all,\nwe want to have good convergence for extremely large amounts of data.\n\n4.3 Bounds for smooth gradients with strong convexity\n\nWe conclude this section with the tightest of all bounds \u2014 the setting where the losses are all\nstrongly convex and smooth. This occurs, for instance, for logistic regression with (cid:96)2 regularization.\nSuch a requirement implies that the objective function f\u2217(x) is sandwiched between two quadratic\nfunctions, hence it is not too surprising that we should be able to obtain rates comparable with what\nis possible in the minimization of quadratic functions. Also note that the ratio between upper and\nlower quadratic bound loosely corresponds to the condition number of a quadratic function \u2014 the\nratio between the largest and smallest eigenvalue of the matrix involved in the optimization problem.\n\n6\n\n\fFigure 2: Experiments with simulated delay on the TREC dataset (left) and on a propietary dataset\n(right). In both cases a delay of 10 has no effect on the convergence whatsoever and even a delay of\n100 is still quite acceptable.\n\nFigure 3: Time performance on a subset of\nthe TREC dataset which \ufb01ts into memory, us-\ning the quadratic representation. There was\neither one thread (a serial implementation)\nor 3 or more threads (master and 2 or more\nslaves).\n\nTheorem 8 Under the assumptions of Theorem 4, in particular, assuming that all functions fi are\ni.i.d and strongly convex with constant \u03bb and corresponding learning rate \u03b7t = 1\n\u03bb(t\u2212\u03c4 ) and provided\nthat Equation 11 holds we have the following bound on the expected regret:\nL2\nE [R[X]] \u2264 10\n9\n2\u03bb\n\n[1 + \u03c4 + log(3\u03c4 + (H\u03c4 /\u03bb))] +\n\n(cid:21) L2\n\n[1 + log T ] +\n\n\u03c02\u03c4 2HL2\n\n(cid:20) 1\n\n2\n\n\u03bb\u03c4 F 2 +\n\n+ \u03c4\n\n(cid:20)\n\n\u03bb\n\n(cid:21)\n\n.\n\n6\u03bb2\n\n(13)\n\nAs before, this improves the rate of the bound. Instead of a dependency of the form O(\u03c4 log T ) we\nnow have the dependency O(\u03c4 log \u03c4 + log T ). This is particularly desirable for large T . We are now\nwithin a small factor of what a fully sequential algorithm can achieve. In fact, we could make the\nconstant arbitrary small for large enough T .\n\n5 Experiments\n\nIn our experiments we focused on pipelined optimization. In particular, we used two different train-\ning sets that were based on e-mails: the TREC dataset [5], consisting of 75,419 e-mail messages, and\na proprietary (signi\ufb01cantly harder) dataset of which we took 100,000 e-mails. These e-mails were\ntokenized by whitespace. The problem there is one of binary classi\ufb01cation where we minimized a\n\u2019Huberized\u2019 soft-margin loss function\n\n\uf8f1\uf8f2\uf8f3 1\n\n1\n\n0\n\n2 \u2212 \u03c7\n2 (\u03c7 \u2212 1)2\n\nif \u03c7 \u2264 0\nif \u03c7 \u2208 [0, 1]\notherwise\n\nft(x) = l(yt (cid:104)zt, x(cid:105)) where l(\u03c7) =\n\n(14)\n\nHere yt \u2208 {\u00b11} denote the labels of the binary classi\ufb01cation problem, and l is the smoothed\nquadratic soft-margin loss of [10]. We used two feature representations: a linear one which\namounted to a simple bag of words representation, and a quadratic one which amounted to gen-\nerating a bag of word pairs (consecutive or not).\n\n7\n\n-12-10-8-6-4-2 0 2 0 10 20 30 40 50 60 70 80 90 100Log_2 ErrorThousands of IterationsPerformance on TREC Datano delaydelay of 10delay of 100delay of 1000-6-5-4-3-2-1 0 1 2 0 10 20 30 40 50 60 70 80 90 100Log_2 ErrorThousands of IterationsPerformance on Real Datano delaydelay of 10delay of 100delay of 1000 0 50 100 150 200 250 300 350 400 450 1 2 3 4 5 6 7Percent SpeedupThreadsPerformance on TREC Data\fTo deal with high-dimensional feature spaces we used hashing [20]. In particular, for the TREC\ndataset we used 218 feature bins and for the proprietary dataset we used 224 bins. Note that hash-\ning comes with performance guarantees which state that the canonical distortion due to hashing is\nsuf\ufb01ciently small for the dimensionality we picked. We tried to address the following issues:\n\n1. The obvious question is a systematic one: how much of a convergence penalty do we incur\nin practice due to delay. This experiment checks the goodness of our bounds. We checked\nconvergence for a system where the delay is given by \u03c4 \u2208 {0, 10, 100, 1000}.\n\n2. Secondly, we checked on an actual parallel implementation whether the algorithm scales\nwell. Unlike the previous check includes issues such as memory contention, thread syn-\nchronization, and general feasibility of a delayed updating architecture.\n\nt\n\n.\n\nImplementation The code was written in Java, although several of the fundamentals were based\nupon VW [10], that is, hashing and the choice of loss function. We added regularization using lazy\nupdates of the parameter vector (i.e. we rescale the updates and occasionally rescale the parameter).\nThis is akin to Leon Bottou\u2019s SGD code. For robustness, we used \u03b7t = 1\u221a\nAll timed experiments were run on a single, 8 core machine with 32 GB of memory. In general, at\nleast 6 of the cores were free at any given time. In order to achieve advantages of parallelization,\nwe divide the feature space {1 . . . n} into roughly equal pieces, and assign a slave thread to each\npiece. Each slave is given both the weights for its pieces, as well as the corresponding pieces\nof the examples. The master is given the label of each example. We compute the dot product\nseparately on each piece, and then send these results to a master. The master adds the pieces together,\ncalculates the update, and then sends that back to the slaves. Then, the slaves update their weight\nvectors in proportion to the magnitude of the central classi\ufb01er. What makes this work quickly is that\nthere are multiple examples in \ufb02ight through this data\ufb02ow simultaneously. Note that between the\ntime when a dot product is calculated for an example and when the results have been transcribed,\nthe weight vector has been updated with several other earlier examples and the dot products have\nbeen calculated from several later examples. As a safeguard we limited the maximum delay to 100\nexamples. In this case the compute slave would simply wait for the pipeline to clear.\nThe \ufb01rst experiment that we ran was a simulation where we arti\ufb01cially added a delay between the\nupdate and the product (Figure 2a). We ran this experiment using linear features, and observed\nthat the performance did not noticeably degrade with a delay of 10 examples, did not signi\ufb01cantly\ndegrade with a delay of 100, but with a delay of 1000, the performance became much worse.\nThe second experiment that we ran was with a proprietary dataset (Figure 2b). In this case, the\ndelays hurt less; we conjecture that this was because the information gained from each example was\nsmaller. In fact, even a delay of 1000 does not result in particularly bad performance.\nSince even the sequential version already handled 150,000 examples per second we tested paral-\nlelization only for quadratic features where throughput would be in the order of 1000 examples per\nsecond. Here parallelization dramatically improved performance \u2014 see Figure 3. To control for\ndisk access we loaded a subset of the data into memory and carried out the algorithm on it.\n\nSummary and Discussion\n\nThe type of updates we presented is a rather natural one. However, intuitively, having a delay of \u03c4\nis like having a learning rate that is \u03c4 times larger. In this paper, we have shown theoretically how\nindependence between examples can make the actual effect much smaller.\nThe experimental results showed three important aspects: \ufb01rst of all, small simulated delayed up-\ndates do not hurt much, and in harder problems they hurt less; secondly, in practice it is hard to\nspeed up \u201ceasy\u201d problems with a small amount of computation, such as e-mails with linear features;\n\ufb01nally, when examples are larger or harder, the speedups can be quite dramatic.\n\n8\n\n\fReferences\n[1] Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient descent. In\nJ. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Pro-\ncessing Systems 20, Cambridge, MA, 2008. MIT Press.\n\n[2] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In J. C. Platt, D. Koller,\n\nY. Singer, and S.T. Roweis, editors, NIPS. MIT Press, 2007.\n\n[3] L\u00b4eon Bottou and Yann LeCun. Large scale online learning.\n\nIn S. Thrun, L. Saul, and\nB. Sch\u00a8olkopf, editors, Advances in Neural Information Processing Systems 16, pages 217\u2013\n224, Cambridge, MA, 2004. MIT Press.\n\n[4] C.T. Chu, S.K. Kim, Y. A. Lin, Y. Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-reduce for\nmachine learning on multicore. In B. Sch\u00a8olkopf, J. Platt, and T. Hofmann, editors, Advances\nin Neural Information Processing Systems 19, 2007.\n\n[5] G. Cormack. TREC 2007 spam track overview. In The Sixteenth Text REtrieval Conference\n\n(TREC 2007) Proceedings, 2007.\n\n[6] O. Delalleau and Y. Bengio. Parallel stochastic gradient descent, 2007. CIAR Summer School,\n\nToronto.\n\n[7] C.B. Do, Q.V. Le, and C.-S. Foo. Proximal regularization for online and batch learning. In A.P.\nDanyluk, L. Bottou, and M. L. Littman, editors, Proceedings of the 26th Annual International\nConference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009,\nvolume 382, page 33. ACM, 2009.\n\n[8] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex\n\noptimization. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[9] P. J. Huber. Robust Statistics. John Wiley and Sons, New York, 1981.\n[10] J. Langford, L. Li, and A. Strehl.\n\nVowpal wabbit online learning project, 2007.\n\nhttp://hunch.net/?p=309.\n\n[11] J. Langford, A.J. Smola, and M. Zinkevich. Slow learners are fast. arXiv:0911.0491.\n[12] N. Murata, S. Yoshizawa, and S. Amari. Network information criterion \u2014 determining the\nIEEE Transactions on Neural\n\nnumber of hidden units for arti\ufb01cial neural network models.\nNetworks, 5:865\u2013872, 1994.\n\n[13] Y. Nesterov and J.-P. Vial. Con\ufb01dence level solutions for stochastic programming. Techni-\ncal Report 2000/13, Universit\u00b4e Catholique de Louvain - Center for Operations Research and\nEconomics, 2000.\n\n[14] N. Ratliff, J. Bagnell, and M. Zinkevich. Maximum margin planning. In International Confer-\n\nence on Machine Learning, July 2006.\n\n[15] N. Ratliff, J. Bagnell, and M. Zinkevich. (online) subgradient methods for structured predic-\ntion. In Eleventh International Conference on Arti\ufb01cial Intelligence and Statistics (AIStats),\nMarch 2007.\n\n[16] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-\n\ngradient solver for SVM. In Proc. Intl. Conf. Machine Learning, 2007.\n\n[17] Choon Hui Teo, S. V. N. Vishwanthan, Alex J. Smola, and Quoc V. Le. Bundle methods for\n\nregularized risk minimization. J. Mach. Learn. Res., 2009. Submitted in February 2009.\n\n[18] S. V. N. Vishwanathan, Nicol N. Schraudolph, Mark Schmidt, and Kevin Murphy. Acceler-\nated training conditional random \ufb01elds with stochastic gradient methods. In Proc. Intl. Conf.\nMachine Learning, pages 969\u2013976, New York, NY, USA, 2006. ACM Press.\n\n[19] M. Weimer, A. Karatzoglou, Q. Le, and A. Smola. Co\ufb01 rank - maximum margin matrix\nfactorization for collaborative ranking.\nIn J.C. Platt, D. Koller, Y. Singer, and S. Roweis,\neditors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA,\n2008.\n\n[20] K. Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A.J. Smola. Feature hashing for\nlarge scale multitask learning. In L. Bottou and M. Littman, editors, International Conference\non Machine Learning, 2009.\n\n[21] M. Zinkevich. Online convex programming and generalised in\ufb01nitesimal gradient ascent. In\n\nProc. Intl. Conf. Machine Learning, pages 928\u2013936, 2003.\n\n9\n\n\f", "award": [], "sourceid": 942, "authors": [{"given_name": "Martin", "family_name": "Zinkevich", "institution": null}, {"given_name": "John", "family_name": "Langford", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}