{"title": "Communication Efficient Distributed Machine Learning with the Parameter Server", "book": "Advances in Neural Information Processing Systems", "page_first": 19, "page_last": 27, "abstract": "This paper describes a third-generation parameter server framework for distributed machine learning. This framework offers two relaxations to balance system performance and algorithm efficiency. We propose a new algorithm that takes advantage of this framework to solve non-convex non-smooth problems with convergence guarantees. We present an in-depth analysis of two large scale machine learning problems ranging from $\\ell_1$-regularized logistic regression on CPUs to reconstruction ICA on GPUs, using 636TB of real data with hundreds of billions of samples and dimensions. We demonstrate using these examples that the parameter server framework is an effective and straightforward way to scale machine learning to larger problems and systems than have been previously achieved.", "full_text": "Communication Ef\ufb01cient Distributed Machine\n\nLearning with the Parameter Server\n\nMu Li\u2217\u2020, David G. Andersen\u2217, Alexander Smola\u2217\u2021, and Kai Yu\u2020\n\n\u2217Carnegie Mellon University \u2020Baidu \u2021Google\n\n{muli, dga}@cs.cmu.edu, alex@smola.org, yukai@baidu.com\n\nAbstract\n\nThis paper describes a third-generation parameter server framework for distributed\nmachine learning. This framework offers two relaxations to balance system per-\nformance and algorithm ef\ufb01ciency. We propose a new algorithm that takes ad-\nvantage of this framework to solve non-convex non-smooth problems with con-\nvergence guarantees. We present an in-depth analysis of two large scale machine\nlearning problems ranging from (cid:96)1-regularized logistic regression on CPUs to re-\nconstruction ICA on GPUs, using 636TB of real data with hundreds of billions of\nsamples and dimensions. We demonstrate using these examples that the param-\neter server framework is an effective and straightforward way to scale machine\nlearning to larger problems and systems than have been previously achieved.\n\n1\n\nIntroduction\n\nIn realistic industrial machine learning applications the datasets range from 1TB to 1PB. For ex-\nample, a social network with 100 million users and 1KB data per user has 100TB. Problems in\nonline advertising and user generated content analysis have complexities of similar order of magni-\ntudes [12]. Such huge quantities of data allow learning powerful and complex models with 109 to\n1012 parameters [9], at which scale a single machine is often not powerful enough to complete these\ntasks in time.\nDistributed optimization is becoming a key tool for solving large scale machine learning problems\n[1, 3, 10, 21, 19]. The workloads are partitioned into worker machines, which access the globally\nshared model as they simultaneously perform local computations to re\ufb01ne the model. However, ef-\n\ufb01cient implementations of the distributed optimization algorithms for machine learning applications\nare not easy. A major challenge is the inter-machine data communication:\n\u2022 Worker machines must frequently read and write the global shared parameters. This massive\ndata access requires an enormous amount of network bandwidth. However, bandwidth is one\nof the scarcest resources in datacenters [6], often 10-100 times smaller than memory bandwidth\nand shared among all running applications and machines. This leads to a huge communication\noverhead and becomes a bottleneck for distributed optimization algorithms.\n\u2022 Many optimization algorithms are sequential, requiring frequent synchronization among worker\nmachines. In each synchronization, all machines need to wait the slowest machine. However,\ndue to imperfect workload partition, network congestion, or interference by other running jobs,\nslow machines are inevitable, which then becomes another bottleneck.\n\nIn this work, we build upon our prior work designing an open-source third generation parameter\nserver framework [4] to understand the scope of machine learning algorithms to which it can be\napplied, and to what bene\ufb01t. Figure 1 gives an overview of the scale of the largest machine learning\nexperiments performed on a number of state-of-the-art systems. We con\ufb01rmed with the authors of\nthese systems whenever possible.\n\n1\n\n\fCompared to these systems, our parame-\nter server is several orders of magnitude\nmore scalable in terms of both parameters\nand nodes. The parameter server commu-\nnicates data asynchronously to reduce the\ncommunication cost. The resulting data in-\nconsistency is a trade-off between the sys-\ntem performance and the algorithm conver-\ngence rate. The system offers two relax-\nations to address data (in)consistency: First,\nrather than arguing for a speci\ufb01c consistency\nmodel [29, 7, 15], we support \ufb02exible con-\nsistency models. Second, the system allows\nuser-speci\ufb01c \ufb01lters for \ufb01ne-grained consis-\ntency management. Besides, the system pro-\nvides other features such as data replication,\ninstantaneous failover, and elastic scalability.\n\nFigure 1: Comparison of the public largest machine\nlearning experiments each system performed. The\nresults are current as of April 2014.\n\nMotivating Application. Consider the following general regularized optimization problem:\n\nminimize\n\nw\n\nF (w) where F (w) := f (w) + h(w) and w \u2208 Rp,\n\n(1)\nWe assume that the loss function f : Rp \u2192 R is continuously differentiable but not necessarily\nconvex, and the regularizer h : Rp \u2192 R is convex, left side continuous, block separable, but\npossibly non-smooth.\nThe proposed algorithm solves this problem based on the proximal gradient method [23]. However,\nit differs with the later in four aspects to ef\ufb01ciently tackle very high dimensional and sparse data:\n\u2022 Only a subset (block) of coordinates is updated in each time: (block) Gauss-Seidel updates are\n\u2022 The model a worker maintains is only partially consistent with other machines, due to asyn-\n\u2022 The proximal operator uses coordinate-speci\ufb01c learning rates to adapt progress to sparsity pat-\n\u2022 Only coordinates that would change the associated model weights are communicated to reduce\n\nshown to be ef\ufb01cient on sparse data [36, 27].\n\nchronous data communication.\n\ntern inherent in the data.\n\nnetwork traf\ufb01c.\n\nWe demonstrate the ef\ufb01ciency of the proposed algorithm by applying it to two challenging prob-\nlems: (1) non-smooth (cid:96)1-regularized logistic regression on sparse text datasets with over 100 billion\nexamples and features; (2) a non-convex and non-smooth ICA reconstruction problem [18], extract-\ning billions of sparse features from dense image data. We show that the combination of the proposed\nalgorithm and system effectively reduces both the communication cost and programming effort. In\nparticular, 300 lines of codes suf\ufb01ce to implement (cid:96)1-regularized logistic regression with nearly no\ncommunication overhead for industrial-scale problems.\nOutline: We \ufb01rst provide background in Section 2. Next, we address the two relaxations in Section 3\nand the proposed algorithm in Section 4. In Section 5 (and also Appendix B and C), we present the\napplications with the experimental results. We conclude with a discussion in Section 6.\n\n2 Background\n\nRelated Work. The parameter server framework [29] has proliferated both in academia and in\nindustry. Related systems have been implemented at Amazon, Baidu, Facebook, Google [10], Mi-\ncrosoft, and Yahoo [2]. There are also open source codes, such as YahooLDA [2] and Petuum [15].\nAs introduced in [29, 2], the \ufb01rst generation of the parameter servers lacked \ufb02exibility and perfor-\nmance. The second generation parameter servers were application speci\ufb01c, exempli\ufb01ed by Dist-\nbelief [10] and the synchronization mechanism in [20]. Petuum modi\ufb01ed YahooLDA by imposing\nbounded delay instead of eventual consistency and aimed for a general platform [15], but it placed\n\n2\n\n10110210310410410510610710810910101011# of cores# of shared parametersDistbelief (DNN)VW (LR)Yahoo!LDA (LDA)Graphlab (LDA)Naiad (LR)REEF (LR)Petuum (Lasso)MLbase (LR)Parameter server (Sparse LR)\fmore constraints on the threading model of worker machines. Compared to previous work, our\nthird generation system greatly improves system performance, and also provides \ufb02exibility and fault\ntolerance.\nBeyond the parameter server, there exist many general-purpose distributed systems for machine\nlearning applications. Many mandate synchronous and iterative communication. For example, Ma-\nhout [5], based on Hadoop [13] and MLI [30], based on Spark [37], both adopt the iterative MapRe-\nduce framework [11]. On the other hand, Graphlab [21] supports global parameter synchronization\non a best effort basis. These systems scale well to few hundreds of nodes, primarily on dedicated\nresearch clusters. However, at a larger scale the synchronization requirement creates performance\nbottlenecks. The primary advantage over these systems is the \ufb02exibility of consistency models of-\nfered by the parameter server.\nThere is also a growing interest in asynchronous algorithms. Shotgun [7], as a part of Graphlab,\nperforms parallel coordinate descent for solving (cid:96)1 optimization problems. Other methods partition\nobservations over several machines and update the model in a data parallel fashion [34, 17, 38, 3,\n1, 19]. Lock-free variants were proposed in Hogwild [26]. Mixed variants which partition data and\nparameters into non-overlapping components were introduced in [33], albeit at the price of having\nto move or replicate data on several machines. Lastly, the NIPS framework [31] discusses general\nnon-convex approximate proximal methods.\nThe proposed algorithm differs from existing approaches mainly in two aspects. First, we focus on\nsolving large scale problems. Given the size of data and the limited network bandwidth, neither\nthe shared memory approach of Shotgun and Hogwild nor moving the entire data during training is\ndesirable. Second, we aim at solving general non-convex and non-smooth composite objective func-\ntions. Different to [31], we derive a convergence theorem with weaker assumptions, and furthermore\nwe carry out experiments that are of many orders of magnitude larger scale.\n\nThe Parameter Server Architecture. An instance of the parameter server [4] contains a server\ngroup and several worker groups, in which a group has several machines. Each machine in the server\ngroup maintains a portion of the global parameters, and all servers communicate with each other to\nreplicate and/or migrate parameters for reliability and scaling.\nA worker stores only a portion of the training data and it computes the local gradients or other\nstatistics. Workers communicate only with the servers to retrieve and update the shared parameters.\nIn each worker group, there might be a scheduler machine, which assigns workloads to workers as\nwell as monitors their progress. When workers are added or removed from the group, the scheduler\ncan reschedule the un\ufb01nished workloads. Each worker group runs an application, thus allowing for\nmulti-tenancy. For example, an ad-serving system and an inference algorithm can run concurrently\nin different worker groups.\nThe shared model parameters are represented as sorted (key,value) pairs. Alternatively we can view\nthis as a sparse vector or matrix that interacts with the training data through the built-in multi-\nthreaded linear algebra functions. Data exchange can be achieved via two operations: push and\npull. A worker can push all (key, value) pairs within a range to servers, or pull the corresponding\nvalues from the servers.\n\nDistributed Subgradient Descent. For the motivating example introduced in (1), we can im-\nplement a standard distributed subgradient descent algorithm [34] using the parameter server. As\nillustrated in Figure 2 and Algorithm 1, training data is partitioned and distributed among all the\nworkers. The model w is learned iteratively. In each iteration, each worker computes the local gra-\ndients using its own training data, and the servers aggregate these gradients to update the globally\nshared parameter w. Then the workers retrieve the updated weights from the servers.\nA worker needs the model w to compute the gradients. However, for very high-dimensional training\ndata, the model may not \ufb01t in a worker. Fortunately, such data are often sparse, and a worker\ntypically only requires a subset of the model. To illustrate this point, we randomly assigned samples\nin the dataset used in Section 5 to workers, and then counted the model parameters a worker needed\nfor computing gradients. We found that when using 100 workers, the average worker only needs\n7.8% of the model. With 10,000 workers this reduces to 0.15%. Therefore, despite the large total\nsize of w, the working set of w needed by a particular worker can be cached trivially.\n\n3\n\n\fk=1\n\nr \u2190(cid:80)nr\n\nAlgorithm 1 Distributed Subgradient Descent\nSolving (1) in the Parameter Server\nWorker r = 1, . . . , m:\n1: Load a part of training data {yik , xik}nr\n2: Pull the working set w(0)\nfrom servers\n3: for t = 1 to T do\nGradient g(t)\n4:\nPush g(t)\n5:\nr\nPull w(t+1)\n6:\n7: end for\nServers:\n1: for t = 1 to T do\n2:\n3:\n4: end for\n\nAggregate g(t) \u2190(cid:80)m\nw(t+1) \u2190 w(t) \u2212 \u03b7(cid:0)g(t) + \u2202h(w(t)(cid:1)\n\nk=1 \u2202(cid:96)(xik , yik , w(t)\nr )\n\nfrom servers\n\nr=1 g(t)\n\nr\n\nr\n\nto servers\n\nr\n\nFigure 2: One iteration of Algorithm 1. Each\nworker only caches the working set of w.\n\n3 Two Relaxations of Data Consistency\n\nWe now introduce the two relaxations that are key to the proposed system. We encourage the reader\ninterested in systems details such as server key layout, elastic scalability, and continuous fault toler-\nance, to see our prior work [4].\n\n3.1 Asynchronous Task Dependency\n\nWe decompose the workloads in the parameter server into tasks that are issued by a caller to a remote\ncallee. There is considerable \ufb02exibility in terms of what constitutes a task: for instance, a task can be\na push or a pull that a worker issues to servers, or a user-de\ufb01ned function that the scheduler issues\nto any node, such as an iteration in the distributed subgradient algorithm. Tasks can also contains\nsubtasks. For example, a worker performs one push and one pull per iteration in Algorithm 1.\nTasks are executed asynchronously: the caller can perform further computation immediately after\nissuing a task. The caller marks a task as \ufb01nished only once it receives the callee\u2019s reply. A reply\ncould be the function return of a user-de\ufb01ned function, the (key,value) pairs requested by the pull,\nor an empty acknowledgement. The callee marks a task as \ufb01nished only if the call of the task is\nreturned and all subtasks issued by this call are \ufb01nished.\nBy default callees execute tasks in parallel for best\nperformance. A caller wishing to render task execu-\ntion sequential can insert an execute-after-\ufb01nished\ndependency between tasks. The diagram on the\nright illustrates the execution of three tasks. Tasks\n10 and 11 are independent, but 12 depends on 11. The callee therefore begins task 11 immediately\nafter the gradients are computed in task 10. Task 12, however, is postponed to after pull of 11.\nTask dependencies aid implementing algorithm logic. For example, the aggregation logic at servers\nin Algorithm 1 can be implemented by having the updating task depend on the push tasks of all\nworkers. In this way, the weight w is updated only after all worker gradients have been aggregated.\n\n3.2 Flexible Consistency Models via Task Dependency Graphs\n\nThe dependency graph introduced above can be used to relax consistency requirements. Independent\ntasks improve the system ef\ufb01ciency by parallelizing the usage of CPU, disk and network bandwidth.\nHowever, this may lead to data inconsistency between nodes. In the diagram above, the worker r\nstarts iteration 11 before the updated model w(11)\nis pulled back, thus it uses the outdated model\nw(10)\n. This inconsis-\n\nand compute the same gradient as it did in iteration 10, namely g(11)\n\nr = g(10)\n\nr\n\nr\n\nr\n\n4\n\nworker 1 (cid:1)(cid:1)(cid:2)(cid:2)(cid:1)(cid:2)(cid:2)(cid:2)(cid:1)(cid:2)(cid:2)(cid:2)(cid:1)(cid:1)(cid:1)(cid:2)(cid:2)(cid:1)(cid:2)(cid:2)(cid:2)(cid:1)(cid:2)(cid:2)(cid:2)(cid:1)(cid:1)(cid:1)(cid:2)(cid:2)(cid:2)(cid:1)(cid:2)(cid:1)(cid:2)(cid:1)(cid:2)(cid:2)(cid:2)(cid:1)(cid:1)(cid:1)(cid:2)(cid:2)(cid:2)(cid:1)(cid:2)(cid:1)(cid:2)(cid:1)(cid:2)(cid:2)(cid:2)(cid:1)g1 +... +gm w(cid:2)(cid:1)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:1)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:1)(cid:1)(cid:2)(cid:2)(cid:2)(cid:1)(cid:2)(cid:2)(cid:2)(cid:1)(cid:1)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:1)serversg1w1gmwm(cid:1)(cid:2)(cid:2)(cid:2)(cid:1)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:1)(cid:2)(cid:2)(cid:2)(cid:1)(cid:1)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:1)(cid:1)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:1)worker m ...2. pushtraining data4. pull 4. pull 2. push3. update 1. compute 1. compute iter 10:iter 11:iter 12:gradientgradientgradientpush & pullpush & pullpu\ftency can potentially slows down the convergence speed of Algorithm 1. However, some algorithms\nmay be less sensitive to this inconsistency. For example, if only a block of w is updated in each\niteration of Algorithm 2, starting iteration 11 without waiting for 10 causes only a portion of w to\nbe inconsistent.\nThe trade-off between algorithm ef\ufb01ciency and system performance depends on various factors in\npractice, such as feature correlation, hardware capacity, datacenter load, etc. Unlike other systems\nthat force the algorithm designer to adopt a speci\ufb01c consistency model that may be ill-suited to\nthe real situations, the parameter server can provide full \ufb02exibility for different consistency models\nby creating task dependency graphs, which are directed acyclic graphs de\ufb01ned by tasks with their\ndependencies. Consider the following three examples:\n\nSequential Consistency requires all tasks to be executed one by one. The next task can be started\nonly if the previous one has \ufb01nished. It produces results identical to the single-thread imple-\nmentation. Bulk Synchronous Processing uses this approach.\n\nEventual Consistency to the contrary allows all tasks to be started simultaneously. [29] describe\nsuch a system for LDA. This approach is only recommendable whenever the underlying algo-\nrithms are very robust with regard to delays.\n\nBounded Delay limits the staleness of parameters. When a maximal delay time \u03c4 is set, a new task\nwill be blocked until all previous tasks \u03c4 times ago have been \ufb01nished (\u03c4 = 0 yields sequential\nconsistency and for \u03c4 = \u221e we recover eventual consistency). Algorithm 2 uses such a model.\nNote that dependency graphs allow for more advanced consistency models. For example, the sched-\nuler may increase or decrease the maximal delay according to the runtime progress to dynamically\nbalance the ef\ufb01ciency-convergence trade-off.\n\n3.3 Flexible Consistency Models via User-de\ufb01ned Filters\n\nTask dependency graphs manage data consistency between tasks. User-de\ufb01ned \ufb01lters allow for a\nmore \ufb01ne-grained control of consistency (e.g. within a task). A \ufb01lter can transform and selectively\nsynchronize the the (key,value) pairs communicated in a task. Several \ufb01lters can be applied together\nfor better data compression. Some example \ufb01lters are:\n\nSigni\ufb01cantly modi\ufb01ed \ufb01lter: it only pushes entries that have changed by more than a threshold\n\nsince synchronized last time.\n\nRandom skip \ufb01lter: it subsamples entries before sending. They are skipped in calculations.\nKKT \ufb01lter: it takes advantage of the optimality condition when solving the proximal operator: a\nworker only pushes gradients that are likely to affect the weights on the servers. We will discuss\nit in more detail in section 5.\n\nKey caching \ufb01lter: Each time a range of (key,value) pairs is communicated because of the range-\nbased push and pull. When the same range is chosen again, it is likely that only values\nare modi\ufb01ed while the keys are unchanged. If both the sender and receiver have cached these\nkeys, the sender then only needs to send the values with a signature of the keys. Therefore, we\neffectively double the network bandwidth.\n\nCompressing \ufb01lter: The values communicated are often compressible numbers, such as zeros,\nsmall integers, and \ufb02oating point numbers with more than enough precision. This \ufb01lter reduces\nthe data size by using lossless or lossy data compression algorithms1.\n\n4 Delayed Block Proximal Gradient Method\n\nIn this section, we propose an ef\ufb01cient algorithm taking advantage of the parameter server to solve\nthe previously de\ufb01ned nonconvex and nonsmooth optimization problem (1).\n\n1Both key caching and data compressing are presented as system-level optimization in the prior work [4],\n\nhere we generalize them into user-de\ufb01ned \ufb01lters.\n\n5\n\n0120120123(a) Sequential(b) Eventual(c) 1 Bounded delay 4\fAlgorithm 2 Delayed Block Proximal Gradient Method Solving (1)\nScheduler:\n1: Partition parameters into k blocks b1, . . . , bk\n2: for t = 1 to T : Pick a block bit and issue the task to workers\nWorker r at iteration t\n1: Wait until all iterations before t \u2212 \u03c4 are \ufb01nished\n2: Compute \ufb01rst-order gradient g(t)\nr\n3: Push g(t)\nr\n4: Pull w(t+1)\nServers at iteration t\n1: Aggregate g(t) and u(t)\n2: Solve the generalized proximal operator (2) w(t+1) \u2190 ProxU\n\nand u(t)\nr\n\nr\n\nand coordinate-speci\ufb01c learning rates u(t)\n\nr on block bit\n\nto servers with user-de\ufb01ned \ufb01lters, e.g., the random skip or the KKT \ufb01lter\n\nfrom servers with user-de\ufb01ned \ufb01lters, e.g., the signi\ufb01cantly modi\ufb01ed \ufb01lter\n\n(w(t)) with U = diag(u(t)).\n\n\u03b3t\n\nProximal Gradient Methods. For a closed proper convex function h(x) : Rp \u2192 R \u222a {\u221e} de\ufb01ne\nthe generalized proximal operator\n\n(cid:107)x \u2212 y(cid:107)2\n\nU\n\n1\n2\u03b3\n\ny\u2208Rp\n\nh(y) +\n\nProxU\n\n\u03b3 (x) := argmin\n\n(2)\nThe Mahalanobis norm (cid:107)x(cid:107)U is taken with respect to a positive semide\ufb01nite matrix U (cid:23) 0. Many\nproximal algorithms choose U = 1. To minimize the composite objective function f (w) + h(w),\nproximal gradient algorithms update w in two steps: a forward step performing steepest gradient\ndescent on f and a backward step carrying out projection using h. Given learning rate \u03b3t > 0 at\niteration t these two steps can be written as\n\nwhere (cid:107)x(cid:107)2\n\nU := x(cid:62)U x.\n\nw(t+1) = ProxU\n\u03b3t\n\nw(t) \u2212 \u03b3t\u2207f (w(t))\n\nfor t = 1, 2, . . .\n\n(3)\n\n(cid:104)\n\n(cid:105)\n\nAlgorithm. We relax the consistency model of the proximal gradient methods with a block scheme\nto reduce the sensitivity to data inconsistency. The proposed algorithm is shown in Algorithm 2. It\ndiffers from the standard method as well as Algorithm 1 in four substantial ways to take advantage\nof the opportunities offered by the parameter server and to handle high-dimensional sparse data.\n1. Only a block of parameters is updated per iteration.\n2. The workers compute both gradients and coordinate-speci\ufb01c learning rates, e.g., the diagonal\n\npart of the second derivative, on this block.\n\n3. Iterations are asynchronous. We use a bounded-delay model over iterations.\n4. We employ user-de\ufb01ned \ufb01lters to suppress transmission of parts of data whose effect on the\n\nmodel is likely to be negligible.\n\nthat is f =(cid:80)\n\nConvergence Analysis. To prove convergence we need to make a number of assumptions. As\nbefore, we decompose the loss f into blocks fi associated with the training data stored by worker i,\ni fi. Next we assume that block bt is chosen at iteration t. A key assumption is that\nfor given parameter changes the rate of change in the gradients of f is bounded. More speci\ufb01cally,\nwe need to bound the change affecting the very block and the amount of \u201ccrosstalk\u201d to other blocks.\n\nAssumption 1 (Block Lipschitz Continuity) There exists positive constants Lvar,i and Lcov,i such\nthat for any iteration t and all x, y \u2208 Rp with xi = yi for any i /\u2208 bt we have\n\n(cid:107)\u2207btfi(x) \u2212 \u2207btfi(y)(cid:107) \u2264 Lvar,i (cid:107)x \u2212 y(cid:107)\n(cid:107)\u2207bsfi(x) \u2212 \u2207bsfi(y)(cid:107) \u2264 Lcov,i (cid:107)x \u2212 y(cid:107)\n\nwhere \u2207bf (x) is the block b of \u2207f (x). Further de\ufb01ne Lvar :=(cid:80)m\n\nfor 1 \u2264 i \u2264 m\nfor 1 \u2264 i \u2264 m, t < s \u2264 t + \u03c4\n\ni=1 Lvar,i and Lcov :=(cid:80)m\n\n(4a)\n(4b)\n\ni=1 Lcov,i.\n\nThe following Theorem 2 indicates that this algorithm converges to a stationary point under the\nrelaxed consistency model, provided that a suitable learning rate is chosen. Note that since the\noverall objective is nonconvex, no guarantees of optimality are possible in general.\n\n6\n\n\fTheorem 2 Assume that updates are performed with a delay bounded by \u03c4, also assume that we\napply a random skip \ufb01lter on pushing gradients and a signi\ufb01cantly-modi\ufb01ed \ufb01lter on pulling weights\nwith threshold O(t\u22121). Moreover assume that gradients of the loss are Lipschitz continuous as per\nAssumption 1. Denote by Mt the minimal coordinate-speci\ufb01c learning rate at time t. For any \u0001 > 0,\nAlgorithm 2 converges to a stationary point in expectation if the learning rate \u03b3t satis\ufb01es\n\n\u03b3t \u2264\n\nMt\n\nLvar + \u03c4 Lcov + \u0001\n\nfor all t > 0.\n\n(5)\n\nThe proof is shown in Appendix A. Intuitively, the difference between w(t\u2212\u03c4 ) and w(t) will be small\nwhen reaching a stationary point. As a consequence, also the change in gradients will vanish. The\ninexact gradient obtained by delayed and inexact model, therefore, is likely a good approximation\nof the true gradient, so the convergence results of proximal gradient methods can be applied.\nNote that, when the delay increase, we should decrease the learning rate to guarantee convergence.\nHowever, a larger value is possible when careful block partition and order are chosen. For example,\nif features in a block are less correlated then Lvar decreases. If the block is less related to the previous\nblocks, then Lcov decreases, as also exploited in [26, 7].\n\n5 Experiments\n\nWe now show how the general framework discussed above can be used to solve challenging machine\nlearning problems. Due to space constraints we only present experimental results for a 0.6PB dataset\nbelow. Details on smaller datasets are relegated to Appendix B. Moreover, we discuss non-smooth\nReconstruction ICA in Appendix C.\n\nSetup. We chose (cid:96)1-regularized logistic regression for evaluation because that it is one of the\nmost popular algorithms used in industry for large scale risk minimization [9]. We collected an ad\nclick prediction dataset with 170 billion samples and 65 billion unique features. The uncompressed\ndataset size is 636TB. We ran the parameter server on 1000 machines, each with 16 CPU cores,\n192GB DRAM, and connected by 10 Gb Ethernet. 800 machines acted as workers, and 200 were\nservers. The cluster was in concurrent use by other jobs during operation.\n\nAlgorithm. We adopted Algorithm 2 with upper bounds of the diagonal entries of the Hessian as\nthe coordinate-speci\ufb01c learning rates. Features were randomly split into 580 blocks according the\nfeature group information. We chose a \ufb01xed learning rate by observing the convergence speed.\nWe designed a Karush-Kuhn-Tucker (KKT) \ufb01lter to skip inactive coordinates. It is analogous to\nthe active-set selection strategies of SVM optimization [16] and active set selectors [22]. Assume\nwk = 0 for coordinate k and gk the current gradient. According to the optimality condition of the\nproximal operator, also known as soft-shrinkage operator, wk will remain 0 if |gk| \u2264 \u03bb. Therefore,\nit is not necessary for a worker to send gk (as well as uk). We use an old value \u02c6gk to approximate gk\nto further avoid computing gk. Thus, coordinate k will be skipped in the KKT \ufb01lter if |\u02c6gk| \u2264 \u03bb \u2212 \u03b4,\nwhere \u03b4 \u2208 [0, \u03bb] controls how aggressive the \ufb01ltering is.\n\nImplementation. To the best of our knowledge, no open source system can scale sparse logistic\nregression to the scale described in this paper. Graphlab provides only a multi-threaded, single\nmachine implementation. We compared it with ours in Appendix B. Mlbase, Petuum and REEF do\nnot support sparse logistic regression (as con\ufb01rmed with the authors in 4/2014). We compare the\nparameter server with two special-purpose second general parameter servers, named System A and\nB, developed by a large Internet company.\nBoth system A and B adopt the sequential consistency model, but the former uses a variant of L-\nBFGS while the latter runs a similar algorithm as ours. Notably, both systems consist of more than\n10K lines of code. The parameter server only requires 300 lines of code for the same functionality\nas System B (the latter was developed by an author of this paper). The parameter server successfully\nmoves most of the system complexity from the algorithmic implementation into reusable compo-\nnents.\n\n7\n\n\fFigure 3: Convergence of sparse logistic regres-\nsion on a 636TB dataset.\n\nFigure 4: Average time per worker spent on\ncomputation and waiting during optimization.\n\nFigure 6: The reduction of sent data size when\nstacking various \ufb01lters together.\n\nFigure 5: Time to reach the same convergence\ncriteria under various allowed delays.\nExperimental Results. We compare these systems by running them to reach the same conver-\ngence criteria. Figure 3 shows that System B outperforms system A due to its better algorithm. The\nparameter server, in turn, speeds up System B in 2 times while using essentially the same algorithm.\nIt achieves this because the consistency relaxations signi\ufb01cantly reduce the waiting time (Figure 4).\nFigure 5 shows that increasing the allowed delays signi\ufb01cantly decreases the waiting time though\nslightly slows the convergence. The best trade-off is 8-delay, which results in a 1.6x speedup com-\nparing the sequential consistency model. As can be seen in Figure 6, key caching saves 50% network\ntraf\ufb01c. Compressing reduce servers\u2019 traf\ufb01c signi\ufb01cantly due to the model sparsity, while it is less\neffective for workers because the gradients are often non-zeros. But these gradients can be \ufb01ltered\nef\ufb01ciently by the KKT \ufb01lter. In total, these \ufb01lters give 40x and 12x compression rates for servers\nand workers, respectively.\n\n6 Conclusion\n\nThis paper examined the application of a third-generation parameter server framework to modern\ndistributed machine learning algorithms. We show that it is possible to design algorithms well\nsuited to this framework; in this case, an asynchronous block proximal gradient method to solve\ngeneral non-convex and non-smooth problems, with provable convergence. This algorithm is a\ngood match to the relaxations available in the parameter server framework: controllable asynchrony\nvia task dependencies and user-de\ufb01nable \ufb01lters to reduce data communication volumes. We showed\nexperiments for several challenging tasks on real datasets up to 0.6PB size with hundreds billions\nsamples and features to demonstrate its ef\ufb01ciency. We believe that this third-generation parameter\nserver is an important and useful building block for scalable machine learning. Finally, the source\ncodes are available at http://parameterserver.org.\n\n8\n\n10\u221211001011010.61010.7time (hours)objective value System\u2212ASystem\u2212BParameter ServerSystem\u2212ASystem\u2212BParameter Server012345time (hours) computingwaiting012481600.511.52time (hours)maximal delays computingwaitingkey cachingcompressingKKT Filter020406080100relative network traf\ufb01c (%) serverworker\fReferences\n\n[1] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In IEEE CDC, 2012.\n[2] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable\n\n[3] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A. J. Smola. Distributed large-scale\n\nmodels. In WSDM, 2012.\n\nnatural graph factorization. In WWW, 2013.\n\n[4] M. Li, D. G. Andersen, J. Park h, A. J. Smola, A. Amhed, V. Josifovski, J. Long, E. Shekita, and B. Y. Su.\n\nScaling Distributed Machine Learning with the Parameter Server. In OSDI, 2014\n[5] Apache Foundation. Mahout project, 2012. http://mahout.apache.org.\n[6] L. A. Barroso and H. H\u00a8olzle. The datacenter as a computer: An introduction to the design of warehouse-\n\nscale machines. Synthesis lectures on computer architecture, 4(1):1\u2013108, 2009.\n\n[7] J.K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for L1-regularized loss\n\nminimization. In ICML, 2011.\n\n[8] J. Byers, J. Considine, and M. Mitzenmacher. Simple load balancing for distributed hash tables.\n\nIn\n\nPeer-to-peer systems II, pages 80\u201387. Springer, 2003.\n\n[9] K. Canini. Sibyl: A system for large scale supervised machine learning. Technical Talk, 2012.\n[10] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker,\n\nK. Yang, and A. Ng. Large scale distributed deep networks. In NIPS, 2012.\n\n[11] J. Dean and S. Ghemawat. MapReduce: simpli\ufb01ed data processing on large clusters. CACM, 2008.\n[12] Domo. Data Never Sleeps 2.0, 2014. http://www.domo.com/learn.\n[13] The Apache Software Foundation. Apache hadoop, 2009. http://hadoop.apache.org/core/.\n[14] S. H. Gunderson. Snappy https://code.google.com/p/snappy/.\n[15] Q. Ho, J. Cipar, H. Cui, S. Lee, J. Kim, P. Gibbons, G. Gibson, G. Ganger, and E. Xing. More effective\n\ndistributed ml via a stale synchronous parallel parameter server. In NIPS, 2013.\n\n[16] T. Joachims. Making large-scale SVM learning practical. Advances in Kernel Methods, 1999\n[17] J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. In NIPS, 2009.\n[18] Q.V. Le, A. Karpenko, J. Ngiam, and A.Y. Ng. ICA with reconstruction cost for ef\ufb01cient overcomplete\n\nfeature learning. NIPS, 2011.\n\n[19] M. Li, D. G. Andersen, and A. J. Smola. Distributed delayed proximal gradient methods.\n\nIn NIPS\n\nWorkshop on Optimization for Machine Learning, 2013.\n\n[20] M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D.G. Andersen, and A. J. Smola. Parameter server for distributed\n\nmachine learning. In Big Learning NIPS Workshop, 2013.\n\n[21] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed graphlab: A\n\nframework for machine learning and data mining in the cloud. In PVLDB, 2012.\n\n[22] S. Matsushima, S.V.N. Vishwanathan, and A.J. Smola. Linear support vector machines via dual cached\n\nloops. In KDD, 2012.\n\n[23] N. Parikh and S. Boyd. Proximal algorithms. In Foundations and Trends in Optimization, 2013.\n[24] K. B. Petersen and M. S. Pedersen. The matrix cookbook, 2008. Version 20081110.\n[25] A. Phanishayee, D. G. Andersen, H. Pucha, A. Povzner, and W. Belluomini. Flex-kv: Enabling high-\n\nperformance and \ufb02exible KV systems. In Management of big data systems, 2012.\n\n[26] B. Recht, C. Re, S.J. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic\n\ngradient descent. NIPS, 2011.\n\n[27] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c.\n\nIteration complexity of randomized block-coordinate descent methods for\n\nminimizing a composite function. Mathematical Programming, 2012.\n\n[28] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location and routing for large-scale\n\npeer-to-peer systems. In Distributed Systems Platforms, 2001.\n\n[29] A. J. Smola and S. Narayanamurthy. An architecture for parallel topic models. In VLDB, 2010.\n[30] E. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, M. J. Franklin, M. I. Jordan, and\n\nT. Kraska. MLI: An API for distributed machine learning. 2013.\n\n[31] S. Sra. Scalable nonconvex inexact proximal splitting. In NIPS, 2012.\n[32] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer\n\nlookup service for internet applications. SIGCOMM Computer Communication Review, 2001.\n\n[33] C. Te\ufb02ioudi, F. Makari, and R. Gemulla. Distributed matrix completion. In ICDM, 2012.\n[34] C. H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods for regularized risk mini-\n\nmization. JMLR, January 2010.\n\nOSDI, 2004.\n\n[35] R. van Renesse and F. B. Schneider. Chain replication for supporting high throughput and availability. In\n\n[36] G. X. Yuan, K. W. Chang, C. J. Hsieh, and C. J. Lin. A comparison of optimization methods and software\n\nfor large-scale l1-regularized linear classi\ufb01cation. JMLR, 2010.\n\n[37] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. Mccauley, M. J. Franklin, S. Shenker, and I.\n\nStoica. Fast and interactive analytics over hadoop data with spark. USENIX ;login:, August 2012.\n\n[38] M. Zinkevich, A. J. Smola, M. Weimer, and L. Li. Parallelized stochastic gradient descent. In NIPS, 2010.\n\n9\n\n\f", "award": [], "sourceid": 24, "authors": [{"given_name": "Mu", "family_name": "Li", "institution": "CMU"}, {"given_name": "David", "family_name": "Andersen", "institution": "Carnegie Mellon University"}, {"given_name": "Alexander", "family_name": "Smola", "institution": "Carnegie Mellon"}, {"given_name": "Kai", "family_name": "Yu", "institution": "Baidu"}]}