{"title": "More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server", "book": "Advances in Neural Information Processing Systems", "page_first": 1223, "page_last": 1231, "abstract": "We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy-to-use shared interface for read/write access to an ML model's values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fully-synchronous and asynchronous schemes.", "full_text": "More Effective Distributed ML via a Stale\nSynchronous Parallel Parameter Server\n\n\u2020Qirong Ho, \u2020James Cipar, \u00a7Henggang Cui, \u2020Jin Kyu Kim, \u2020Seunghak Lee,\n\u2021Phillip B. Gibbons, \u2020Garth A. Gibson, \u00a7Gregory R. Ganger, \u2020Eric P. Xing\n\u2021Intel Labs\n\n\u00a7Electrical and Computer Engineering\n\nPittsburgh, PA 15213\n\nphillip.b.gibbons@intel.com\n\n\u2020School of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nqho@, jcipar@, jinkyuk@,\n\nhengganc@, ganger@ece.cmu.edu\n\nseunghak@, garth@,\nepxing@cs.cmu.edu\n\nAbstract\n\nWe propose a parameter server system for distributed ML, which follows a Stale\nSynchronous Parallel (SSP) model of computation that maximizes the time com-\nputational workers spend doing useful work on ML algorithms, while still provid-\ning correctness guarantees. The parameter server provides an easy-to-use shared\ninterface for read/write access to an ML model\u2019s values (parameters and vari-\nables), and the SSP model allows distributed workers to read older, stale versions\nof these values from a local cache, instead of waiting to get them from a central\nstorage. This signi\ufb01cantly increases the proportion of time workers spend com-\nputing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm\ncorrectness by limiting the maximum age of the stale values. We provide a proof\nof correctness under SSP, as well as empirical results demonstrating that the SSP\nmodel achieves faster algorithm convergence on several different ML problems,\ncompared to fully-synchronous and asynchronous schemes.\n\nIntroduction\n\n1\nModern applications awaiting next generation machine intelligence systems have posed unprece-\ndented scalability challenges. These scalability needs arise from at least two aspects: 1) massive\ndata volume, such as societal-scale social graphs [10, 25] with up to hundreds of millions of nodes;\nand 2) massive model size, such as the Google Brain deep neural network [9] containing billions of\nparameters. Although there exist means and theories to support reductionist approaches like subsam-\npling data or using small models, there is an imperative need for sound and effective distributed ML\nmethodologies for users who cannot be well-served by such shortcuts. Recent efforts towards dis-\ntributed ML have made signi\ufb01cant advancements in two directions: (1) Leveraging existing common\nbut simple distributed systems to implement parallel versions of a limited selection of ML models,\nthat can be shown to have strong theoretical guarantees under parallelization schemes such as cyclic\ndelay [17, 1], model pre-partitioning [12], lock-free updates [21], bulk synchronous parallel [5], or\neven no synchronization [28] \u2014 these schemes are simple to implement but may under-exploit the\nfull computing power of a distributed cluster. (2) Building high-throughput distributed ML architec-\ntures or algorithm implementations that feature signi\ufb01cant systems contributions but relatively less\ntheoretical analysis, such as GraphLab [18], Spark [27], Pregel [19], and YahooLDA [2].\nWhile the aforementioned works are signi\ufb01cant contributions in their own right, a naturally desirable\ngoal for distributed ML is to pursue a system that (1) can maximally unleash the combined compu-\ntational power in a cluster of any given size (by spending more time doing useful computation and\nless time waiting for communication), (2) supports inference for a broad collection of ML methods,\nand (3) enjoys correctness guarantees. In this paper, we explore a path to such a system using the\n\n1\n\n\fidea of a parameter server [22, 2], which we de\ufb01ne as the combination of a shared key-value store\nthat provides a centralized storage model (which may be implemented in a distributed fashion) with\na synchronization model for reading/updating model values. The key-value store provides easy-\nto-program read/write access to shared parameters needed by all workers, and the synchronization\nmodel maximizes the time each worker spends on useful computation (versus communication with\nthe server) while still providing algorithm correctness guarantees.\nTowards this end, we propose a parameter server using a Stale Synchronous Parallel (SSP) model of\ncomputation, for distributed ML algorithms that are parallelized into many computational workers\n(technically, threads) spread over many machines. In SSP, workers can make updates to a param-\neter1 \u2713, where the updates follow an associative, commutative form \u2713 \u2713 + . Hence, the current\ntrue value of \u2713 is just the sum over updates from all workers. When a worker asks for \u2713, the SSP\nmodel will give it a stale (i.e. delayed) version of \u2713 that excludes recent updates . More formally,\na worker reading \u2713 at iteration c will see the effects of all from iteration 0 to c s 1, where\ns 0 is a user-controlled staleness threshold. In addition, the worker may get to see some recent\nupdates beyond iteration c s 1. The idea is that SSP systems should deliver as many updates\nas possible, without missing any updates older than a given age \u2014 a concept referred to as bounded\nstaleness [24]. The practical effect of this is twofold: (1) workers can perform more computation\ninstead of waiting for other workers to \ufb01nish, and (2) workers spend less time communicating with\nthe parameter server, and more time doing useful computation. Bounded staleness distinguishes\nSSP from cyclic-delay systems [17, 1] (where \u2713 is read with in\ufb02exible staleness), Bulk Synchronous\nParallel (BSP) systems like Hadoop (workers must wait for each other at the end of every iteration),\nor completely asynchronous systems [2] (workers never wait, but \u2713 has no staleness guarantees).\nWe implement an SSP parameter server with a table-based interface, called SSPtable, that supports\na wide range of distributed ML algorithms for many models and applications. SSPtable itself can\nalso be run in a distributed fashion, in order to (a) increase performance, or (b) support applications\nwhere the parameters \u2713 are too large to \ufb01t on one machine. Moreover, SSPtable takes advantage of\nbounded staleness to maximize ML algorithm performance, by reading the parameters \u2713 from caches\non the worker machines whenever possible, and only reading \u2713 from the parameter server when the\nSSP model requires it. Thus, workers (1) spend less time waiting for each other, and (2) spend less\ntime communicating with the parameter server. Furthermore, we show that SSPtable (3) helps slow,\nstraggling workers to catch up, providing a systems-based solution to the \u201clast reducer\u201d problem on\nsystems like Hadoop (while we note that theory-based solutions are also possible). SSPtable can\nbe run on multiple server machines (called \u201cshards\u201d), thus dividing its workload over the cluster;\nin this manner, SSPtable can (4) service more workers simultaneously, and (5) support very large\nmodels that cannot \ufb01t on a single machine. Finally, the SSPtable server program can also be run on\nworker machines, which (6) provides a simple but effective strategy for allocating machines between\nworkers and the parameter server.\nOur theoretical analysis shows that (1) SSP generalizes the bulk synchronous parallel (BSP) model,\nand that (2) stochastic gradient algorithms (e.g. for matrix factorization or topic models) under SSP\nnot only converge, but do so at least as fast as cyclic-delay systems [17, 1] (and potentially even\nfaster depending on implementation). Furthermore, our implementation of SSP, SSPtable, supports\na wide variety of algortihms and models, and we demonstrate it on several popular ones: (a) Ma-\ntrix Factorization with stochastic gradient descent [12], (b) Topic Modeling with collapsed Gibbs\nsampling [2], and (c) Lasso regression with parallelized coordinate descent [5]. Our experimental\nresults show that, for these 3 models and algorithms, (i) SSP yields faster convergence than BSP (up\nto several times faster), and (ii) SSP yields faster convergence than a fully asynchronous (i.e. no stal-\neness guarantee) system. We explain SSPtable\u2019s better performance in terms of algorithm progress\nper iteration (quality) and iterations executed per unit time (quantity), and show that SSPtable hits a\n\u201csweet spot\u201d between quality and quantity that is missed by BSP and fully asynchronous systems.\n2 Stale Synchronous Parallel Model of Computation\nWe begin with an informal explanation of SSP: assume a collection of P workers, each of which\nmakes additive updates to a shared parameter x x + u at regular intervals called clocks. Clocks\nare similar to iterations, and represent some unit of progress by an ML algorithm. Every worker\n\n1 For example, the parameter \u2713 might be the topic-word distributions in LDA, or the factor matrices in a\nmatrix decomposition, while the updates could be adding or removing counts to topic-word or document-word\ntables in LDA, or stochastic gradient steps in a matrix decomposition.\n\n2\n\n\fforced to wait for the slowest worker to catch up.\n\nhas its own integer-valued clock c, and workers only commit their updates at the end of each clock.\nUpdates may not be immediately visible to other workers trying to read x \u2014 in other words, workers\nonly see effects from a \u201cstale\u201d subset of updates. The idea is that, with staleness, workers can retrieve\nupdates from caches on the same machine (fast) instead of querying the parameter server over the\nnetwork (slow). Given a user-chosen staleness threshold s 0, SSP enforces the following bounded\nstaleness conditions (see Figure 1 for a graphical illustration):\n\u2022 The slowest and fastest workers must be \uf8ff s clocks apart \u2014 otherwise, the fastest worker is\n\u2022 When a worker with clock c commits an update u, that u is timestamped with time c.\n\u2022 When a worker with clock c reads x, it will always see effects from all u with timestamp \uf8ff\nc s 1. It may also see some u with timestamp > c s 1 from other workers.\n\u2022 Read-my-writes: A worker p will always see the effects of its own updates up.\nSince the fastest and slowest workers are\n\uf8ff s clocks apart, a worker reading x at\nclock c will see all updates with times-\ntamps in [0, c s 1], plus a (possi-\nbly empty) \u201cadaptive\u201d subset of updates in\nthe range [c s, c + s 1]. Note that\nwhen s = 0, the \u201cguaranteed\u201d range be-\ncomes [0, c 1] while the adaptive range\nbecomes empty, which is exactly the Bulk\nSynchronous Parallel model of computa-\ntion. Let us look at how SSP applies to an\nexample ML algorithm.\n2.1 An example: Stochastic Gradient Descent for Matrix Problems\nThe Stochastic Gradient Descent (SGD) [17, 12] algorithm optimizes an objective function by ap-\nplying gradient descent to random subsets of the data. Consider a matrix completion task, which\ninvolves decomposing an N \u21e5 M matrix D into two low-rank matrices LR \u21e1 D, where L, R have\nsizes N \u21e5 K and K \u21e5 M (for a user-speci\ufb01ed K). The data matrix D may have missing entries,\ncorresponding to missing data. Concretely, D could be a matrix of users against products, with Dij\nrepresenting user i\u2019s rating of product j. Because users do not rate all possible products, the goal is\nto predict ratings for missing entries Dab given known entries Dij. If we found low-rank matrices\nL, R such that Li\u00b7 \u00b7 R\u00b7j \u21e1 Dij for all known entries Dij, we could then predict Dab = La\u00b7 \u00b7 R\u00b7b for\nunknown entries Dab.\nTo perform the decomposition, let us minimize the squared difference between each known entry\nDij and its prediction Li\u00b7 \u00b7 R\u00b7j (note that other loss functions and regularizers are also possible):\n\nFigure 1: Bounded Staleness under the SSP Model\n\n.\n\n(1)\n\nmin\n\nL,R X(i,j)2Data\n\nDij \n\nKXk=1\n\n2\n\nLikRkj\n\n= X(a,b)2Data\n\nAs a \ufb01rst step towards SGD, consider solving Eq (1) using coordinate gradient descent on L, R:\n@OMF\n@Lik\n\n(a = i) [2DabRkb + 2La\u00b7R\u00b7bRkb] ,\n\n(b = j) [2DabLak + 2La\u00b7R\u00b7bLak]\n\n@OMF\n@Rkj\n\n= X(a,b)2Data\n\nwhere OMF is the objective in Eq(1), and (a = i) equals 1 if a = i, and 0 otherwise. This can be\ntransformed into an SGD algorithm by replacing the full sum over entries (a, b) with a subsample\n(with appropriate reweighting). The entries Dab can then be distributed over multiple workers, and\ntheir gradients computed in parallel [12].\nWe assume that D is \u201ctall\u201d, i.e. N > M (or transpose D so this is true), and partition the rows of\nD and L over the processors. Only R needs to be shared among all processors, so we let it be the\nSSP shared parameter x := R. SSP allows many workers to read/write to R with minimal waiting,\nthough the workers will only see stale values of R. This tradeoff is bene\ufb01cial because without\nstaleness, the workers must wait for a long time when reading R from the server (as our experiments\nwill show). While having stale values of R decreases convergence progress per iteration, SSP more\nthan makes up by enabling signi\ufb01cantly more iterations per minute, compared to fully synchronous\nsystems. Thus, SSP yields more convergence progress per minute, i.e. faster convergence.\n\n3\n\nClock 0 1 2 3 4 5 6 7 8 9 SSP: Bounded Staleness and Clocks Updates visible to all workers Worker 1 Worker 2 Worker 3 Worker 4 Staleness Threshold 3 Updates visible to Worker 1, due to read-my-writes Updates not necessarily visible to Worker 1 Here, Worker 1 must wait on further reads, until Worker 2 has reached clock 4 Worker progress \fClient process\n\nTable server\nTable server\nTable server\nTable server\nTable data\n\nPending \nrequests\n\nApplication\nApplication\nthread\nApplication\nthread\nApplication\nthread\nthread\n\nThread\nThread\ncache\nThread\ncache\nThread\ncache\ncache\n\nProcess\ncache\n\nFigure 2: Cache structure of SSPtable, with\nmultiple server shards\n\nNote that SSP is not limited to stochastic gradient matrix algorithms: it can also be applied to parallel\ncollapsed sampling on topic models [2] (by storing the word-topic and document-topic tables in x),\nparallel coordinate descent on Lasso regression [5] (by storing the regression coef\ufb01cients in x), as\nwell as any other parallel algorithm or model with shared parameters that all workers need read/write\naccess to. Our experiments will show that SSP performs better than bulk synchronous parallel and\nasynchronous systems for matrix completion, topic modeling and Lasso regression.\n3 SSPtable: an Ef\ufb01cient SSP System\nAn ideal SSP implementation would fully exploit the lee-\nway granted by the SSP\u2019s bounded staleness property,\nin order to balance the time workers spend waiting on\nreads with the need for freshness in the shared data. This\nsection describes our initial implementation of SSPtable,\nwhich is a parameter server conforming to the SSP model,\nand that can be run on many server machines at once (dis-\ntributed). Our experiments with this SSPtable implemen-\ntation shows that SSP can indeed improve convergence\nrates for several ML models and algorithms, while fur-\nther tuning of cache management policies could further\nimprove the performance of SSPtable.\nSSPtable follows a distributed client-server architecture. Clients access shared parameters using a\nclient library, which maintains a machine-wide process cache and optional per-thread2 thread caches\n(Figure 2); the latter are useful for improving performance, by reducing inter-thread synchronization\n(which forces workers to wait) when a client ML program executes multiple worker threads on each\nof multiple cores of a client machine. The server parameter state is divided (sharded) over multiple\nserver machines, and a normal con\ufb01guration would include a server process on each of the client\nmachines. Programming with SSPtable follows a simple table-based API for reading/writing to\nshared parameters x (for example, the matrix R in the SGD example of Section 2.1):\n\u2022 Table Organization: SSPtable supports an unlimited number of tables, which are divided into\n\u2022 read row(table,row,s): Retrieve a table-row with staleness threshold s. The user can\n\u2022 inc(table,row,el,val): Increase a table-row-element by val, which can be negative.\n\u2022 clock(): Inform all servers that the current thread/processor has completed one clock, and\nAny number of read row() and inc() calls can be made in-between calls to clock(). Differ-\nent thread workers are permitted to be at different clocks, however, bounded staleness requires that\nthe fastest and slowest threads be no more than s clocks apart. In this situation, SSPtable forces the\nfastest thread to block (i.e. wait) on calls to read row(), until the slowest thread has caught up.\nTo maintain the \u201cread-my-writes\u201d property, we use a write-back policy: all writes are immediately\ncommitted to the thread caches, and are \ufb02ushed to the process cache and servers upon clock().\nTo maintain bounded staleness while minimizing wait times on read row() operations, SSPtable\nuses the following cache protocol: Let every table-row in a thread or process cache be endowed\nwith a clock rthread or rproc respectively. Let every thread worker be endowed with a clock c, equal\nto the number of times it has called clock(). Finally, de\ufb01ne the server clock cserver to be the\nminimum over all thread clocks c. When a thread with clock c requests a table-row, it \ufb01rst checks\nits thread cache. If the row is cached with clock rthread c s, then it reads the row. Otherwise,\nit checks the process cache next \u2014 if the row is cached with clock rproc c s, then it reads the\nrow. At this point, no network traf\ufb01c has been incurred yet. However, if both caches miss, then a\nnetwork request is sent to the server (which forces the thread to wait for a reply). The server returns\nits view of the table-row as well as the clock cserver. Because the fastest and slowest threads can\nbe no more than s clocks apart, and because a thread\u2019s updates are sent to the server whenever it\ncalls clock(), the returned server view always satis\ufb01es the bounded staleness requirements for the\n\nrows, which are further subdivided into elements. These tables are used to store x.\n\nThese changes are not propagated to the servers until the next call to clock().\n\ncommit all outstanding inc()s to the servers.\n\nthen query individual row elements.\n\n2 We assume that every computation thread corresponds to one ML algorithm worker.\n\n4\n\n\fasking thread. After fetching a row from the server, the corresponding entry in the thread/process\ncaches and the clocks rthread, rproc are then overwritten with the server view and clock cserver.\nA bene\ufb01cial consequence of this cache protocol is that the slowest thread only performs costly server\nreads every s clocks. Faster threads may perform server reads more frequently, and as frequently as\nevery clock if they are consistently waiting for the slowest thread\u2019s updates. This distinction in work\nper thread does not occur in BSP, wherein every thread must read from the server on every clock.\nThus, SSP not only reduces overall network traf\ufb01c (thus reducing wait times for all server reads), but\nalso allows slow, straggler threads to avoid server reads in some iterations. Hence, the slow threads\nnaturally catch up \u2014 in turn allowing fast threads to proceed instead of waiting for them. In this\nmanner, SSP maximizes the time each machine spends on useful computation, rather than waiting.\n4 Theoretical Analysis of SSP\nFormally, the SSP model supports operations x x (z \u00b7 y), where x, y are members of a ring\nwith an abelian operator (such as addition), and a multiplication operator \u00b7 such that z \u00b7 y = y0\nwhere y0 is also in the ring. In the context of ML, we shall focus on addition and multiplication\nover real vectors x, y and scalar coef\ufb01cients z, i.e. x x + (zy); such operations can be found\nin the update equations of many ML inference algorithms, such as gradient descent [12], coordinate\ndescent [5] and collapsed Gibbs sampling [2]. In what follows, we shall informally refer to x as the\n\u201csystem state\u201d, u = zy as an \u201cupdate\u201d, and to the operation x x + u as \u201cwriting an update\u201d.\nWe assume that P workers write updates at regular time intervals (referred to as \u201cclocks\u201d). Let up,c\nbe the update written by worker p at clock c through the write operation x x + up,c. The updates\nup,c are a function of the system state x, and under the SSP model, different workers will \u201csee\u201d\ndifferent, noisy versions of the true state x. Let \u02dcxp,c be the noisy state read by worker p at clock c,\nimplying that up,c = G(\u02dcxp,c) for some function G. We now formally re-state bounded staleness,\nwhich is the key SSP condition that bounds the possible values \u02dcxp,c can take:\nSSP Condition (Bounded Staleness): Fix a staleness s. Then, the noisy state \u02dcxp,c is equal to\n\n\u02dcxp,c = x0 +24\n|\n\ncs1Xc0=1\n\n+\n\nup0,c035\n}\n\nPXp0=1\n{z\n\n24\n|\n\nup,c035\nc1Xc0=cs\n}\n{z\n\n+24 X(p0,c0)2Sp,c\n{z\n|\n\nup0,c035\n}\n\n,\n\n(2)\n\nguaranteed pre-window updates\n\nguaranteed read-my-writes updates\n\nbest-effort in-window updates\n\nwhere Sp,c \u2713W p,c = ([1, P ] \\ {p}) \u21e5 [c s, c + s 1] is some subset of the updates u written\nin the width-2s \u201cwindow\u201d Wp,c, which ranges from clock c s to c + s 1 and does not include\nupdates from worker p. In other words, the noisy state \u02dcxp,c consists of three parts:\n\nupdates made by the querying worker3 p.\n\n1. Guaranteed \u201cpre-window\u201d updates from clock 0 to c s 1, over all workers.\n2. Guaranteed \u201cread-my-writes\u201d set {(p, c s), . . . , (p, c 1)} that covers all \u201cin-window\u201d\n3. Best-effort \u201cin-window\u201d updates Sp,c from the width-2s window4 [c s, c + s 1] (not\ncounting updates from worker p). An SSP implementation should try to deliver as many\nupdates from Sp,c as possible, but may choose not to depending on conditions.\n\nNotice that Sp,c is speci\ufb01c to worker p at clock c; other workers at different clocks will observe\ndifferent S. Also, observe that SSP generalizes the Bulk Synchronous Parallel (BSP) model:\nBSP Corollary: Under zero staleness s = 0, SSP reduces to BSP. Proof: s = 0 implies [c, c +\ns 1] = ;, and therefore \u02dcxp,c exactly consists of all updates until clock c 1. \u21e4\nOur key tool for convergence analysis is to de\ufb01ne a reference sequence of states xt, informally\nreferred to as the \u201ctrue\u201d sequence (this is different and unrelated to the SSPtable server\u2019s view):\n\nxt = x0 +\n\nut0,\n\nwhere ut := ut mod P ,bt/Pc.\n\ntXt0=0\n\nIn other words, we sum updates by \ufb01rst looping over workers (t mod P ), then over clocks bt/Pc.\nWe can now bound the difference between the \u201ctrue\u201d sequence xt and the noisy views \u02dcxp,c:\n3 This is a \u201cread-my-writes\u201d or self-synchronization property, i.e. workers will always see any updates they\n\nmake. Having such a property makes sense because self-synchronization does not incur a network cost.\n\n4 The width 2s is only an upper bound for the slowest worker. The fastest worker with clock cmax has a\nwidth-s window [cmax s, cmax 1], simply because no updates for clocks cmax have been written yet.\n\n5\n\n\f,\n\n(3)\n\nLemma 1: Assume s 1, and let \u02dcxt := \u02dcxt mod P ,bt/Pc, so that\n+\"Xi2Bt\nui#\n}\n{z\n|\n\n\u02dcxt = xt \"Xi2At\n{z\n|\n\nui#\n}\n\nmissing updates\n\nextra updates\n\nwhere we have decomposed the difference between \u02dcxt and xt into At, the index set of updates ui\nthat are missing from \u02dcxt (w.r.t. xt), and Bt, the index set of \u201cextra\u201d updates in \u02dcxt but not in xt. We\nthen claim that |At| + |Bt|\uf8ff 2s(P 1), and furthermore, min(At [B t) max(1, t (s + 1)P ),\nand max(At [B t) \uf8ff t + sP .\nProof: Comparing Eq.\n(3) with (2), we see that the extra updates obey Bt \u2713S t mod P ,bt/Pc,\nwhile the missing updates obey At \u2713 (Wt mod P ,bt/Pc \\St mod P ,bt/Pc). Because |Wt mod P ,bt/Pc| =\n2s(P 1), the \ufb01rst claim immediately follows. The second and third claims follow from looking at\nthe left- and right-most boundaries of Wt mod P ,bt/Pc. \u21e4\nLemma 1 basically says that the \u201ctrue\u201d state xt and the noisy state \u02dcxt only differ by at most 2s(P1)\nupdates ut, and that these updates cannot be more than (s+1)P steps away from t. These properties\ncan be used to prove convergence bounds for various algorithms; in this paper, we shall focus on\nstochastic gradient descent SGD [17]:\nTheorem 1 (SGD under SSP): Suppose we want to \ufb01nd the minimizer x\u21e4 of a convex function\nt=1 ft(x), via gradient descent on one component rft at a time. We assume the\nf (x) = 1\ncomponents ft are also convex. Let ut := \u2318trft(\u02dcxt), where \u2318t = pt with =\nfor\ncertain constants F, L. Then, under suitable conditions (ft are L-Lipschitz and the distance between\ntwo points D(xkx0) \uf8ff F 2),\n\nT PT\n\nLp2(s+1)P\n\nF\n\nR[X] :=\" 1\n\nT\n\nTXt=1\n\nft(\u02dcxt)# f (x\u21e4) \uf8ff 4F Lr 2(s + 1)P\n\nT\n\nThis means that the noisy worker views \u02dcxt converge in expectation to the true view x\u21e4 (as measured\nby the function f (), and at rate O(T 1/2)). We defer the proof to the appendix, noting that it\ngenerally follows the analysis in Langford et al. [17], except in places where Lemma 1 is involved.\nOur bound is also similar to [17], except that (1) their \ufb01xed delay \u2327 has been replaced by our\nstaleness upper bound 2(s + 1)P , and (2) we have shown convergence of the noisy worker views\n\u02dcxt rather than a true sequence xt. Furthermore, because the constant factor 2(s + 1)P is only an\nupper bound to the number of erroneous updates, SSP\u2019s rate of convergence has a potentially tighter\nconstant factor than Langford et al.\u2019s \ufb01xed staleness system (details are in the appendix).\n5 Experiments\nWe show that the SSP model outperforms fully-synchronous models such as Bulk Synchronous\nParallel (BSP) that require workers to wait for each other on every iteration, as well as asynchronous\nmodels with no model staleness guarantees. The general experimental details are:\n\u2022 Computational models and implementation: SSP, BSP and Asynchronous5. We used SSPtable for the\n\ufb01rst two (BSP is just staleness 0 under SSP), and implemented the Asynchronous model using many of the\ncaching features of SSPtable (to keep the implementations comparable).\n\n\u2022 ML models (and parallel algorithms): LDA Topic Modeling (collapsed Gibbs sampling), Matrix Fac-\ntorization (stochastic gradient descent) and Lasso regression (coordinate gradient descent). All algorithms\nwere implemented using SSPtable\u2019s parameter server interface. For TM and MF, we ran the algorithms in a\n\u201cfull batch\u201d mode (where the algorithm\u2019s workers collectively touch every data point once per clock()),\nas well as a \u201c10% minibatch\u201d model (workers touch 10% of the data per clock()). Due to implementa-\ntion limitations, we did not run Lasso under the Async model.\n\n\u2022 Datasets: Topic Modeling: New York Times (N = 100m tokens, V = 100k terms, K = 100 topics),\nMatrix Factorization: NetFlix (480k-by-18k matrix with 100m nonzeros, rank K = 100 decomposition),\nLasso regression: Synthetic dataset (N = 500 samples with P = 400k features6). We use a static data\npartitioning strategy explained in the Appendix.\n\n\u2022 Compute cluster: Multi-core blade servers connected by 10 Gbps Ethernet, running VMware ESX. We\nuse one virtual machine (VM) per physical machine. Each VM is con\ufb01gured with 8 cores (either 2.3GHz\nor 2.5GHz each) and 23GB of RAM, running on top of Debian Linux 7.0.\n5 The Asynchronous model is used in many ML frameworks, such as YahooLDA [2] and HogWild! [21].\n6This is the largest data size we could get the Lasso algorithm to converge on, under ideal BSP conditions.\n\n6\n\n\fConvergence Speed. Figure 3 shows objective vs. time plots for the three ML algorithms, over\nseveral machine con\ufb01gurations. We are interested in how long each algorithm takes to reach a given\nobjective value, which corresponds to drawing horizontal lines on the plots. On each plot, we show\ncurves for BSP (zero staleness), Async, and SSP for the best staleness value 1 (we generally\nomit the other SSP curves to reduce clutter). In all cases except Topic Modeling with 8 VMs, SSP\nconverges to a given objective value faster than BSP or Async. The gap between SSP and the other\nsystems increases with more VMs and smaller data batches, because both of these factors lead to\nincreased network communication \u2014 which SSP is able to reduce via staleness. We also provide a\nscalability-with-N-machines plot in the Appendix.\nComputation Time vs Network Waiting Time. To understand why SSP performs better, we look\nat how the Topic Modeling (TM) algorithm spends its time during a \ufb01xed number of clock()s. In\nthe 2nd row of Figure 3, we see that for any machine con\ufb01guration, the TM algorithm spends roughly\nthe same amount of time on useful computation, regardless of the staleness value. However, the time\nspent waiting for network communication drops rapidly with even a small increase in staleness,\nallowing SSP to execute clock()s more quickly than BSP (staleness 0). Furthermore, the ratio of\nnetwork-to-compute time increases as we add more VMs, or use smaller data batches. At 32 VMs\nand 10% data minibatches, the TM algorithm under BSP spends six times more time on network\ncommunications than computation. In contrast, the optimal value of staleness, 32, exhibits a 1:1\nratio of communication to computation. Hence, the value of SSP lies in allowing ML algorithms\nto perform far more useful computations per second, compared to the BSP model (e.g. Hadoop).\nSimilar observations hold for the MF and Lasso applications (graphs not shown for space reasons).\nIteration Quantity and Quality. The network-compute ratio only partially explains SSP\u2019s behav-\nior; we need to examine each clock()\u2019s behavior to get a full picture. In the 3rd row of Figure 3,\nwe plot the number of clocks executed per worker per unit time for the TM algorithm, as well as\nthe objective value at each clock. Higher staleness values increase the number of clocks executed\nper unit time, but decrease each clock\u2019s progress towards convergence (as suggested by our theory);\nMF and Lasso also exhibit similar behavior (graphs not shown). Thus, staleness is a tradeoff be-\ntween iteration quantity and quality \u2014 and because the iteration rate exhibits diminishing returns\nwith higher staleness values, there comes a point where additional staleness starts to hurt the rate of\nconvergence per time. This explains why the best staleness value in a given setting is some constant\n0 < s < 1 \u2014 hence, SSP can hit a \u201csweet spot\u201d between quality/quantity that BSP and Async do\nnot achieve. Automatically \ufb01nding this sweet spot for a given problem is a subject for future work.\n6 Related Work and Discussion\nThe idea of staleness has been explored before: in ML academia, it has been analyzed in the con-\ntext of cyclic-delay architectures [17, 1], in which machines communicate with a central server (or\neach other) under a \ufb01xed schedule (and hence \ufb01xed staleness). Even the bulk synchronous paral-\nlel (BSP) model inherently produces stale communications, the effects of which have been studied\nfor algorithms such as Lasso regression [5] and topic modeling [2]. Our work differs in that SSP\nadvocates bounded (rather than \ufb01xed) staleness to allow higher computational throughput via local\nmachine caches. Furthermore, SSP\u2019s performance does not degrade when parameter updates fre-\nquently collide on the same vector elements, unlike asynchronous lock-free systems [21]. We note\nthat staleness has been informally explored in the industrial setting at large scales; our work provides\na \ufb01rst attempt at rigorously justifying staleness as a sound ML technique.\nDistributed platforms such as Hadoop and GraphLab [18] are popular for large-scale ML. The\nbiggest difference between them and SSPtable is the programming model \u2014 Hadoop uses a stateless\nmap-reduce model, while GraphLab uses stateful vertex programs organized into a graph. In con-\ntrast, SSPtable provides a convenient shared-memory programming model based on a table/matrix\nAPI, making it easy to convert single-machine parallel ML algorithms into distributed versions. In\nparticular, the algorithms used in our experiments \u2014 LDA, MF, Lasso \u2014 are all straightforward\nconversions of single-machine algorithms. Hadoop\u2019s BSP execution model is a special case of SSP,\nmaking SSPtable more general in that regard; however, Hadoop also provides fault-tolerance and\ndistributed \ufb01lesystem features that SSPtable does not cover. Finally, there exist special-purpose\ntools such as Vowpal Wabbit [16] and YahooLDA [2]. Whereas these systems have been targeted at\na subset of ML algorithms, SSPtable can be used by any ML algorithm that tolerates stale updates.\nThe distributed systems community has typically examined staleness in the context of consistency\nmodels. The TACT model [26] describes consistency along three dimensions: numerical error, order\nerror, and staleness. Other work [24] attempts to classify existing systems according to a number\n\n7\n\n\f8 VMs\n\nTopic Modeling: Convergence\n\n32 VMs\n\n32 VMs, 10% minibatches\n\nTopic Modeling: Computation Time vs Network Waiting Time\n8 VMs\n\n32 VMs\n\n32 VMs, 10% minibatches\n\nTopic Modeling: Iteration Quantity and Quality\n\n32 VMs, 10% minibatches\n\n32 VMs, 10% minibatches\n\nLasso: Convergence\n\n16 VMs\n\n8 VMs\n\nMatrix Factorization: Convergence\n\n32 VMs\n\n32 VMs, 10% minibatches\n\nFigure 3: Experimental results: SSP, BSP and Asynchronous parameter servers running Topic Modeling,\nMatrix Factorization and Lasso regression. The Convergence graphs plot objective function (i.e. solution\nquality) against time. For Topic Modeling, we also plot computation time vs network waiting time, as well as\nhow staleness affects iteration (clock) frequency (Quantity) and objective improvement per iteration (Quality).\nof consistency properties, speci\ufb01cally naming the concept of bounded staleness. The vector clocks\nused in SSPtable are similar to those in Fidge [11] and Mattern [20], which were in turn inspired\nby Lamport clocks [15]. However, SSPtable uses vector clocks to track the freshness of the data,\nrather than causal relationships between updates. [8] gives an informal de\ufb01nition of the SSP model,\nmotivated by the need to reduce straggler effects in large compute clusters.\nIn databases, bounded staleness has been applied to improve update and query performance. Lazy-\nBase [7] allows staleness bounds to be con\ufb01gured on a per-query basis, and uses this relaxed stale-\nness to improve both query and update performance. FAS [23] keeps data replicated in a number of\ndatabases, each providing a different freshness/performance tradeoff. Data stream warehouses [13]\ncollect data about timestamped events, and provide different consistency depending on the freshness\nof the data. Staleness (or freshness/timeliness) has also been applied in other \ufb01elds such as sensor\nnetworks [14], dynamic web content generation [3], web caching [6], and information systems [4].\nAcknowledgments\nQirong Ho is supported by an NSS-PhD Fellowship from A-STAR, Singapore. This work is supported in\npart by NIH 1R01GM087694 and 1R01GM093156, DARPA FA87501220324, and NSF IIS1111142 to Eric\nP. Xing. We thank the member companies of the PDL Consortium (Acti\ufb01o, APC, EMC, Emulex, Facebook,\nFusion-IO, Google, HP, Hitachi, Huawei, Intel, Microsoft, NEC, NetApp, Oracle, Panasas, Samsung, Seagate,\nSymantec, VMware, Western Digital) for their interest, insights, feedback, and support. This work is supported\nin part by Intel via the Intel Science and Technology Center for Cloud Computing (ISTC-CC) and hardware\ndonations from Intel and NetApp.\n\n8\n\n-1.30E+09 -1.25E+09 -1.20E+09 -1.15E+09 -1.10E+09 -1.05E+09 -1.00E+09 -9.50E+08 -9.00E+08 0 500 1000 1500 2000 Log-Likelihood Seconds Objective function versus time LDA 8 machines (64 threads), Full data per iter BSP (stale 0) stale 2 async -1.30E+09 -1.25E+09 -1.20E+09 -1.15E+09 -1.10E+09 -1.05E+09 -1.00E+09 -9.50E+08 -9.00E+08 0 500 1000 1500 2000 Log-Likelihood Seconds Objective function versus time LDA 32 machines (256 threads), Full data per iter BSP (stale 0) stale 4 async -1.30E+09 -1.25E+09 -1.20E+09 -1.15E+09 -1.10E+09 -1.05E+09 -1.00E+09 -9.50E+08 -9.00E+08 0 500 1000 1500 2000 Log-Likelihood Seconds Objective function versus time LDA 32 machines (256 threads), 10% data per iter BSP (stale 0) stale 32 async 0 200 400 600 800 1000 1200 1400 1600 1800 0 2 4 16 32 48 Seconds Staleness Time Breakdown: Compute vs Network LDA 8 machines, Full data Network waiting time Compute time 0 500 1000 1500 2000 2500 3000 3500 4000 0 2 4 6 8 Seconds Staleness Time Breakdown: Compute vs Network LDA 32 machines, Full data Network waiting time Compute time 0 1000 2000 3000 4000 5000 6000 7000 8000 0 8 16 24 32 40 48 Seconds Staleness Time Breakdown: Compute vs Network LDA 32 machines, 10% data Network waiting time Compute time 0 100 200 300 400 500 600 700 800 900 1000 0 2000 4000 6000 8000 Iterations (clocks) Seconds Quantity: iterations versus time LDA 32 machines, 10% data BSP (stale 0) stale 8 stale 16 stale 24 stale 32 stale 40 stale 48 -1.30E+09 -1.25E+09 -1.20E+09 -1.15E+09 -1.10E+09 -1.05E+09 -1.00E+09 -9.50E+08 -9.00E+08 0 200 400 600 800 1000 Log-Likelihood Iterations (clocks) Quality: objective versus iterations LDA 32 machines, 10% data BSP (stale 0) stale 8 stale 16 stale 24 stale 32 stale 40 stale 48 4.20E-01 4.30E-01 4.40E-01 4.50E-01 4.60E-01 4.70E-01 4.80E-01 0 500 1000 1500 2000 2500 3000 3500 4000 Objective Seconds Objective function versus time Lasso 16 machines (128 threads) BSP (stale 0) stale 10 stale 20 stale 40 stale 80 1.40E+09 1.60E+09 1.80E+09 2.00E+09 2.20E+09 2.40E+09 2.60E+09 0 1000 2000 3000 4000 5000 6000 7000 8000 Objective Seconds Objective function versus time MF 8 machines (64 threads), Full data per iter BSP (stale 0) stale 4 async 1.60E+09 1.70E+09 1.80E+09 1.90E+09 2.00E+09 2.10E+09 2.20E+09 0 1000 2000 3000 4000 5000 6000 7000 8000 Objective Seconds Objective function versus time MF 32 machines (256 threads), Full data per iter BSP (stale 0) stale 15 async 1.60E+09 1.70E+09 1.80E+09 1.90E+09 2.00E+09 2.10E+09 2.20E+09 0 1000 2000 3000 4000 5000 6000 7000 8000 Objective Seconds Objective function versus time MF 32 machines (256 threads), 10% data per iter BSP (stale 0) stale 32 async \fReferences\n[1] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Decision and Control (CDC),\n\n2012 IEEE 51st Annual Conference on, pages 5451\u20135452. IEEE, 2012.\n\n[2] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable\n\nmodels. In WSDM, pages 123\u2013132, 2012.\n\n[3] N. R. Alexandros Labrinidis. Balancing performance and data freshness in web database servers. pages\n\npp. 393 \u2013 404, September 2003.\n\n[4] M. Bouzeghoub. A framework for analysis of data freshness. In Proceedings of the 2004 international\n\nworkshop on Information quality in information systems, IQIS \u201904, pages 59\u201367, 2004.\n\n[5] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss\n\nminimization. In International Conference on Machine Learning (ICML 2011), June 2011.\n\n[6] L. Bright and L. Raschid. Using latency-recency pro\ufb01les for data delivery on the web. In Proceedings of\n\nthe 28th international conference on Very Large Data Bases, VLDB \u201902, pages 550\u2013561, 2002.\n[7] J. Cipar, G. Ganger, K. Keeton, C. B. Morrey, III, C. A. Soules, and A. Veitch. LazyBase:\n\ntrading\nfreshness for performance in a scalable database. In Proceedings of the 7th ACM european conference on\nComputer Systems, pages 169\u2013182, 2012.\n\n[8] J. Cipar, Q. Ho, J. K. Kim, S. Lee, G. R. Ganger, G. Gibson, K. Keeton, and E. Xing. Solving the straggler\n\nproblem with bounded staleness. In HotOS \u201913. Usenix, 2013.\n\n[9] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker,\n\nK. Yang, and A. Ng. Large scale distributed deep networks. In NIPS 2012, 2012.\n\n[10] Facebook. www.facebook.com/note.php?note_id=10150388519243859, January 2013.\n[11] C. J. Fidge. Timestamps in Message-Passing Systems that Preserve the Partial Ordering. In 11th Aus-\n\ntralian Computer Science Conference, pages 55\u201366, University of Queensland, Australia, 1988.\n\n[12] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed\n\nstochastic gradient descent. In KDD, pages 69\u201377. ACM, 2011.\n\n[13] L. Golab and T. Johnson. Consistency in a stream warehouse. In CIDR 2011, pages 114\u2013122.\n[14] C.-T. Huang. Loft: Low-overhead freshness transmission in sensor networks.\n\nIn SUTC 2008, pages\n\n241\u2013248, Washington, DC, USA, 2008. IEEE Computer Society.\n\n[15] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558\u2013\n\n565, July 1978.\n\n[16] J. Langford, L. Li, and A. Strehl. Vowpal wabbit online learning project, 2007.\n[17] J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. In Advances in Neural Information\n\nProcessing Systems, pages 2331\u20132339, 2009.\n\n[18] Y. Low, G. Joseph, K. Aapo, D. Bickson, C. Guestrin, and M. Hellerstein, Joseph. Distributed GraphLab:\n\nA framework for machine learning and data mining in the cloud. PVLDB, 2012.\n\n[19] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a\nsystem for large-scale graph processing. In Proceedings of the 2010 International Conference on Man-\nagement of Data, pages 135\u2013146. ACM, 2010.\n\n[20] F. Mattern. Virtual time and global states of distributed systems. In C. M. et al., editor, Proc. Workshop\n\non Parallel and Distributed Algorithms, pages 215\u2013226, North-Holland / Elsevier, 1989.\n\n[21] F. Niu, B. Recht, C. R\u00b4e, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic\n\ngradient descent. In NIPS, 2011.\n\n[22] R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In Proceedings\nof the USENIX conference on Operating systems design and implementation (OSDI), pages 1\u201314, 2010.\n[23] U. R\u00a8ohm, K. B\u00a8ohm, H.-J. Schek, and H. Schuldt. Fas: a freshness-sensitive coordination middleware for\n\na cluster of olap components. In VLDB 2002, pages 754\u2013765. VLDB Endowment, 2002.\n\n[24] D. Terry. Replicated data consistency explained through baseball. Technical Report MSR-TR-2011-137,\n\nMicrosoft Research, October 2011.\n\n[25] Yahoo! http://webscope.sandbox.yahoo.com/catalog.php?datatype=g, 2013.\n[26] H. Yu and A. Vahdat. Design and evaluation of a conit-based continuous consistency model for replicated\n\nservices. ACM Transactions on Computer Systems, 20(3):239\u2013282, Aug. 2002.\n\n[27] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with\n\nworking sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 2010.\n\n[28] M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. Advances in\n\nNeural Information Processing Systems, 23(23):1\u20139, 2010.\n\n9\n\n\f", "award": [], "sourceid": 631, "authors": [{"given_name": "Qirong", "family_name": "Ho", "institution": "CMU"}, {"given_name": "James", "family_name": "Cipar", "institution": "CMU"}, {"given_name": "Henggang", "family_name": "Cui", "institution": "CMU"}, {"given_name": "Seunghak", "family_name": "Lee", "institution": "CMU"}, {"given_name": "Jin Kyu", "family_name": "Kim", "institution": "CMU"}, {"given_name": "Phillip B.", "family_name": "Gibbons", "institution": "Intel Labs"}, {"given_name": "Garth", "family_name": "Gibson", "institution": "CMU"}, {"given_name": "Greg", "family_name": "Ganger", "institution": "CMU"}, {"given_name": "Eric", "family_name": "Xing", "institution": "CMU"}]}