{"title": "On Model Parallelization and Scheduling Strategies for Distributed Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2834, "page_last": 2842, "abstract": "Distributed machine learning has typically been approached from a data parallel perspective, where big data are partitioned to multiple workers and an algorithm is executed concurrently over different data subsets under various synchronization schemes to ensure speed-up and/or correctness. A sibling problem that has received relatively less attention is how to ensure efficient and correct model parallel execution of ML algorithms, where parameters of an ML program are partitioned to different workers and undergone concurrent iterative updates. We argue that model and data parallelisms impose rather different challenges for system design, algorithmic adjustment, and theoretical analysis. In this paper, we develop a system for model-parallelism, STRADS, that provides a programming abstraction for scheduling parameter updates by discovering and leveraging changing structural properties of ML programs. STRADS enables a flexible tradeoff between scheduling efficiency and fidelity to intrinsic dependencies within the models, and improves memory efficiency of distributed ML. We demonstrate the efficacy of model-parallel algorithms implemented on STRADS versus popular implementations for topic modeling, matrix factorization, and Lasso.", "full_text": "On Model Parallelization and Scheduling Strategies\n\nfor Distributed Machine Learning\n\n\u2020Seunghak Lee, \u2020Jin Kyu Kim, \u2020Xun Zheng, \u00a7Qirong Ho, \u2020Garth A. Gibson, \u2020Eric P. Xing\n\n\u2020School of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nseunghak@, jinkyuk@, xunzheng@,\n\ngarth@, epxing@cs.cmu.edu\n\n\u00a7Institute for Infocomm Research\n\nA*STAR\n\nSingapore 138632\nhoqirong@gmail.com\n\nAbstract\n\nDistributed machine learning has typically been approached from a data parallel\nperspective, where big data are partitioned to multiple workers and an algorithm\nis executed concurrently over different data subsets under various synchroniza-\ntion schemes to ensure speed-up and/or correctness. A sibling problem that has\nreceived relatively less attention is how to ensure ef\ufb01cient and correct model par-\nallel execution of ML algorithms, where parameters of an ML program are parti-\ntioned to different workers and undergone concurrent iterative updates. We argue\nthat model and data parallelisms impose rather different challenges for system de-\nsign, algorithmic adjustment, and theoretical analysis. In this paper, we develop a\nsystem for model-parallelism, STRADS, that provides a programming abstraction\nfor scheduling parameter updates by discovering and leveraging changing struc-\ntural properties of ML programs. STRADS enables a \ufb02exible tradeoff between\nscheduling ef\ufb01ciency and \ufb01delity to intrinsic dependencies within the models, and\nimproves memory ef\ufb01ciency of distributed ML. We demonstrate the ef\ufb01cacy of\nmodel-parallel algorithms implemented on STRADS versus popular implementa-\ntions for topic modeling, matrix factorization, and Lasso.\n\n1\n\nIntroduction\n\nAdvancements in sensory technologies and digital storage media have led to a prevalence of \u201cBig\nData\u201d collections that have inspired an avalanche of recent efforts on \u201cscalable\u201d machine learning\n(ML). In particular, numerous data-parallel solutions from both algorithmic [28, 10] and system\n[7, 25] angles have been proposed to speed up inference and learning on Big Data. The recently\nemerged parameter server architecture [15, 18] has started to pave ways for a uni\ufb01ed programming\ninterface for data parallel algorithms, based on various parallellization models such as stale syn-\nchroneous parallelism (SSP) [15], eager SSP [5], and value-bound asynchronous parallelism [23],\netc. However, in addition to Big Data, modern large-scale ML problems have started to encounter\nthe so-called Big Model challenge [8, 1, 17], in which models with millions if not billions of pa-\nrameters and/or variables (such as in deep networks [6] or large-scale topic models [20]) must be\nestimated from big (or even modestly-sized) datasets. Such Big Model problems seem to have re-\nceived less systematic investigation. In this paper, we propose a model-parallel framework for such\nan investigation.\nAs is well known, a data-parallel algorithm parallelly computes a partial update of all model pa-\nrameters (or latent model states in some cases) in each worker, based on only the subset of data\non that worker and a local copy of the model parameters stored on that worker, and then aggregates\nthese partial updates to obtain a global estimate of the model parameters [15]. In contrast, a model\n\n1\n\n\fparallel algorithm aims to parallelly update a subset of parameters on each worker \u2014 using either\nall data, or different subsets of the data [4] \u2014 in a way that preserves as much correctness as possi-\nble, by ensuring that the updates from each subset are highly compatible. Obviously, such a scheme\ndirectly alleviates memory bottlenecks caused by massive parameter sizes in big models; but even\nfor small or mid-sized models, an effective model parallel scheme is still highly valuable because it\ncan speed up an algorithm by updating multiple parameters concurrently, using multiple machines.\nWhile data-parallel algorithms such as stochastic gradient descent [27] can be advantageous over\ntheir sequential counterparts \u2014 thanks to concurrent processing over data using various bounded-\nasynchronous schemes \u2014 they require every worker to have full access to all global parameters; fur-\nthermore they leverage on an assumption that different data subsets are i.i.d. given the shared global\nparameters. For a model-parallel program however, in which model parameters are distributed to\ndifferent workers, one cannot blindly leverage such an i.i.d. assumption over arbitrary parameter\nsubsets, because doing so will cause incorrect estimates due to incompatibility of sub-results from\ndifferent workers (e.g., imagine trivially parallelizing a long, simplex-constrained vector across mul-\ntiple workers \u2014 independent updates will break the simplex constraint). Therefore, existing data-\nparallel schemes and frameworks, that cannot support sophisticated constraint and/or consistency\nsatis\ufb01ability mechanisms across workers, are not easily adapted to model-parallel programs. On the\nother hand, as explored in a number of recent works, explicit analysis of dependencies across model\nparameters, coupled with the design of suitable parallel schemes accordingly, opens up new oppor-\ntunities for big models. For example, as shown in [4], model-parallel coordinate descent allows us\nto update multiple parameters in parallel, and our work in this paper further this approach by allow-\ning some parameters to be prioritized over others. Furthermore, one can take advantage of model\nstructures to avoid interference and loss of correctness during concurrent parameter updates (e.g.,\nnearly independent parameters can be grouped to be updated in parallel [21]), and in this paper,\nwe explore how to discover such structures in an ef\ufb01cient and scalable manner. To date, model-\nparallel algorithms are usually developed for a speci\ufb01c application such as matrix factorization [10]\nor Lasso [4] \u2014 thus, there is a need for developing programming abstractions and interfaces that can\ntackle the common challenges of Big Model problems, while also exposing new opportunities such\nas parameter prioritization to speed up convergence without compromising inference correctness.\nEffectively and conveniently programming a model-parallel algorithm stands as another challenge,\nas it requires mastery of detailed communication management in a cluster. Existing distributed\nframeworks such as MapReduce [7], Spark [25], and GraphLab [19] have shown that a variety of\nML applications can be supported by a single, common programming interface (e.g. Map/Reduce\nor Gather/Apply/Scatter). Crucially, these frameworks allow the user to specify a coarse order to\nparameter updates, but automatically decide on the precise execution order \u2014 for example, MapRe-\nduce and Spark allow users to specify that parallel jobs should be executed in some topological order;\ne.g. mappers are guaranteed to be followed by reducers, but the system will execute the mappers\nin an arbitrary parallel or sequential order that it deems suitable. Similarly, GraphLab chooses the\nnext node to be updated based on its \u201cchromatic engine\u201d and the user\u2019s choice of graph consistency\nmodel, but the user only has loose control over the update order (through the input graph structure).\nWhile this coarse-grained, fully-automatic scheduling is certainly convenient, it does not offer the\n\ufb01ne-grained control needed to avoid parallelization of parameters with subtle interdependencies that\nmight not be present in the super\ufb01cial problem or graph structure (which can then lead to algorithm\ndivergence, as in Lasso [4]). Moreover, most of these frameworks do not allow users to easily prior-\nitize parameters based on new criteria, for more rapid convergence (though we note that GraphLab\nallows node prioritization through a priority queue). It is true that data-parallel algorithms can be im-\nplemented ef\ufb01ciently on these frameworks, and in principle, one can also implement model-parallel\nalgorithms on top of them. Nevertheless, we argue that without \ufb01ne-grained control over parameter\nupdates, we would miss many new opportunities for accelerating ML algorithm convergence.\nTo address these challenges, we develop STRADS (STRucture-Aware Dynamic Scheduler), a sys-\ntem that performs automatic scheduling and parameter prioritization for dynamic Big Model paral-\nlelism, and is designed to enable investigation of new ML-system opportunities for ef\ufb01cient man-\nagement of memory and accelerated convergence of ML algorithms, while making a best-effort to\npreserve existing convergence guarantees for model-parallel algorithms (e.g. convergence of Lasso\nunder parallel coordinate descent). STRADS provides a simple abstraction for users to program ML\nalgorithms, consisting of three \u201cconceptual\u201d actions: schedule, push and pull. Schedule speci\ufb01es\nthe next subset of model parameters to be updated in parallel, push speci\ufb01es how individual workers\n\n2\n\n\f// Generic STRADS application\n\nschedule() {\n\nFigure 1: High-level architecture of our STRADS\nsystem interface for dynamic model parallelism.\n\n// Select U params x[j] to be sent\n// to the workers for updating\n...\nreturn (x[j_1], ..., x[j_U])\n\ncompute partial results on those parameters, and pull speci\ufb01es how those partial results are aggre-\ngated to perform the full parameter update. A high-level view of STRADS is illustrated in Figure 1.\nWe stress that these actions only specify the abstraction for managed model-parallel ML programs;\nthey do not dictate the underlying implementation. A key-value store allows STRADS to handle a\nlarge number of parameters in distributed fashion, accessible from all master and worker machines.\nAs a showcase for STRADS, we implement and\nprovide schedule/push/pull pseudocode for three\npopular ML applications: topic modeling (LDA),\nmatrix factorization (MF), and Lasso.\nIt is our\nhope that:\n(1) the STRADS interface enables\nBig Model problems to be solved in distributed\nfashion with modest programming effort, and (2)\nthe STRADS mechanism accelerates the conver-\ngence Big ML algorithms through good schedul-\ning (particularly through used-de\ufb01ned scheduling\ncriteria). In our experiments, we present some evidence of STRADS\u2019s success: topic modeling with\n3.9M docs, 10K topics, and 21.8M vocabulary (200B parameters), MF with rank-2K on a 480K-by-\n10K matrix (1B parameters), and Lasso with 100M features (100M parameters).\n2 Scheduling for Big Model Parallelism with STRADS\n\u201cModel parallelism\u201d refers to parallelization\nof an ML algorithm over the space of shared\nmodel parameters, rather than the space of\n(usually i.i.d.) data samples. At a high level,\nmodel parameters are the changing intermedi-\nate quantities that an ML algorithm iteratively\nupdates, until convergence is reached. A key\nadvantage of the model-parallel approach is\nthat it explicitly partitions the model param-\neters into subsets, allowing ML problems with\nmassive model spaces to be tackled on ma-\nchines with limited memory (see supplement\nfor details of STRADS memory usage).\nTo enable users to systematically and pro-\ngrammatically exploit model parallelism,\nSTRADS de\ufb01nes a programming interface,\nwhere the user writes three functions for\nschedule, push and pull\na ML problem:\n(Figures 1, 2). STRADS repeatedly schedules\nand executes these functions in that order, thus creating an iterative model-parallel algorithm.\nBelow, we describe the three functions.\nSchedule: This function selects U model parameters to be dispatched for updates (Figure 1).\nWithin the schedule function, the programmer may access all data D and all model parameters\nx, in order to decide which U parameters to dispatch. A simple schedule is to select model param-\neters according to a \ufb01xed sequence, or drawn uniformly at random. As we shall later see, schedule\nalso allows model parameters to be selected in a way that: (1) focuses on the fastest-converging pa-\nrameters, while avoiding already-converged parameters; (2) avoids parallel dispatch of parameters\nwith inter-dependencies, which can lead to divergence or parallelization errors.\nPush & Pull: These functions describe the \ufb02ow of model parameters x from the scheduler to\nthe workers performing the update equations, as in Fig 1. Push dispatches a set of parameters\n{xj1, . . . , xjU} to each worker p, which then computes a partial update z for {xj1, . . . , xjU} (or a\nsubset of it). When writing push, the user can take advantage of data partitioning: e.g., when only\nP of the data samples are stored at each worker, the p-th worker should compute partial\na fraction 1\nP data points Di. Pull is used to collect the partial\nresults zp\nresults {zp\nj } from all workers, and commit them to the parameters {xj1, . . . , xjU}. Our STRADS\nLDA, MF, and Lasso applications partition the data samples uniformly over machines.\n\n// Compute partial update z for U params x[j]\n// at worker p\n...\nreturn z\n\nFigure 2: STRADS interface: Basic functional signa-\ntures of schedule, push, pull, using pseudocode.\n\n// Use partial updates z from workers p to\n// update U params x[j]. sync() is automatic.\n...\n\npull(workers = [p], pars = (x[j_1],...,x[j_U]),\n\nupdates = [z]) {\n\npush(worker = p, pars = (x[j_1],...,x[j_U])) {\n\n}\n\n}\n\n}\n\n3\n\nj =(cid:80)\n\nfxj (Di) by iterating over its 1\n\nDi\n\nPushPullScheduleMasterMasterWorkerKey-valuestoreKey-valuestoreKey-valuestoreMasterWorkerWorkerWorkerWorkerWorkerWorkerVariable/Param R/WVariable/Param R/W\f// STRADS LDA\n\nreturn dispatch\n\n}\n\n}\n\nIn\n\norder\n\nto\n\n// Update sufficient stats\n\nif w[i,j] in V_p\n\nt.append( (i,j,f_1(i,j,D,B)) )\n\nreturn t\n\npush(worker = p, pars = [V_a, ..., V_U]) {\n\nupdates = [t]) {\n\nfor all (i,j)\n(D,B) = f_2([t])\n\nschedule() {\n\ndispatch = []\nfor a=1..U\n\npull(workers = [p], pars = [V_a, ..., V_U],\n\nt = []\nfor (i,j) in W[q_p]\n\n// Empty list\n// Fast Gibbs sampling\n\n// Empty list\n// Rotation scheduling\n\nidx = ((a+C-1) mod U) + 1\ndispatch.append( V[q_idx] )\n\n}\nFigure 3: STRADS LDA pseudocode. De\ufb01nitions for\nf1, f2, qp are in the text. C is a global model parameter.\n\n3 Leveraging Model-Parallelism in ML Applications through STRADS\nIn this section, we explore how users can apply model-parallelism to their ML applications, using\nSTRADS. As case studies, we design and experiment on 3 ML applications \u2014 LDA, MF, and\nLasso \u2014 in order to show that model-parallelism in STRADS can be simple to implement, yet also\npowerful enough to expose new and interesting opportunities for speeding up distributed ML.\n3.1 Latent Dirichlet Allocation (LDA)\nWe introduce STRADS programming through\ntopic modeling via LDA [3]. Big LDA mod-\nels provide a strong use case for model-\nparallelism: when thousands of topics and mil-\nlions of words are used, the LDA model con-\ntains billions of global parameters, and data-\nparallel implementations face the challenge of\nproviding access to all these parameters; in con-\ntrast, model-parallellism explicitly divides up\nthe parameters, so that workers only need to ac-\ncess a fraction of parameters at a given time.\nFormally, LDA takes a corpus of N docu-\nments as input \u2014 represented as word \u201ctokens\u201d\nwij \u2208 W , where i is the document index and\nj is the word position index \u2014 and outputs K\ntopics as well as N K-dimensional topic vec-\ntors (soft assignments of topics to each docu-\nment). LDA is commonly reformulated as a\n\u201ccollapsed\u201d model [14], in which some of the\nlatent variables are integrated out for faster inference. Inference is performed using Gibbs sampling,\nwhere each word-topic indicator (denoted zij \u2208 Z) is sampled in turn according to its distribution\nconditioned on all other parameters. To perform this computation without having to iterate over all\nW , Z, suf\ufb01cient statistics are kept in the form of a \u201cdoc-topic\u201d table D, and a \u201cword-topic\u201d table\nB. A full description of the LDA model is in the supplement.\nSTRADS implementation:\nperform model-\nparallelism, we \ufb01rst identify the model parameters, and create a\nschedule strategy over them.\nIn LDA, the assignments zij are\nthe model parameters, while D, B are summary statistics over\nzij that are used to speed up the sampler. Our schedule strategy\nequally divides the V words into U subsets V1, . . . , VU (where U\nis the number of workers). Each worker will only sample words\nfrom one subset Va at a time (via push), and update the suf\ufb01cient\nstatistics D, W via pull. Subsequent invocations of schedule will\n\u201crotate\u201d subsets amongst workers, so that every worker touches all\nU subsets every U invocations. For data partitioning, we divide\nthe document tokens wij \u2208 W evenly across workers, and denote\nworker p\u2019s set of tokens by Wqp, where qp is the index set for the\np-th worker. Further details and analysis of the pseudocode, particularly how push-pull constitutes\na model-parallel execution of LDA, are in the supplement.\nModel parallelism results in low error: Parallel Gibbs sampling is not generally guaranteed\nto converge [12], unless the parameters being sampled for concurrent updates are conditionally\nindependent of each other. STRADS model-parallel LDA assigns workers to disjoint words V\nand documents wij; thus, each worker\u2019s parameters zij are almost conditionally independent of\nother workers, resulting in very low sampling error 1. As evidence, we de\ufb01ne an error score \u2206t\nthat measures the divergence between the true word-topic distribution/table B, versus the local\ncopy seen at each worker (a full mathematical explanation is in the supplement). \u2206t ranges from\n[0, 2] (where 0 means no error). Figure 4 plots \u2206t for the \u201cWikipedia unigram\u201d dataset (see \u00a75 for\n\nFigure 4: STRADS LDA: Par-\nallelization error \u2206t at each iter-\nation, on the Wikipedia unigram\ndataset with K = 5000 and 64\nmachines.\n\n1This sampling error arises because workers see different versions B \u2014 which is an unavoidable when\n\nparallelizing LDA inference, because the Gibbs sampler is inherently sequential.\n\n4\n\n0100200300\u22121\u22120.500.511.522.5M vocab, 5K topics, 64 machinesIteration s\u2212error STRADS\f}\n\n}\n\nreturn z\n\nelse\n\n(cid:80)\n\nschedule() {\n\nelse\n\nelse\n\n// X is from H\n\n// Do W\n\n// Do H\n\nreturn W[q_counter]\n\nfor col in s, k=1..K\n\nfor row in s, k=1..K\n\nreturn H[r_(counter-U)]\n\n// Empty list\n// X is from W\n\n// STRADS Matrix Factorization\n\npush(worker = p, pars = X[s]) {\n\nz = []\nif counter <= U\n\nz.append( (g_1(k,col,p),g_2(k,col,p)) )\n\nz.append( (f_1(row,k,p),f_2(row,k,p)) )\n\n// Round-robin scheduling\nif counter <= U\n\nexperimental details) with K = 5000 topics and 64 machines (128 processor cores total). \u2206t is\n\u2264 0.002 throughout, con\ufb01rming that STRADS LDA exhibits very small parallelization error.\n3.2 Matrix Factorization (MF)\nWe now consider matrix factorization (collab-\norative \ufb01ltering), which can be used to pre-\ndict users\u2019 unknown preferences, given their\nknown preferences and the preferences of oth-\ners. Formally, MF takes an incomplete matrix\nA \u2208 RN\u00d7M as input, where N is the number of\nusers, and M is the number of items. The idea\nis to discover rank-K matrices W \u2208 RN\u00d7K\nand H \u2208 RK\u00d7M such that WH \u2248 A. Thus,\nthe product WH can be used to predict the\nmissing entries (user preferences). Let \u2126 be the\nset of indices of observed entries in A, let \u2126i\nbe the set of observed column indices in the i-\nth row of A, and let \u2126j be the set of observed\nrow indices in the j-th column of A. Then, the\nMF task is de\ufb01ned by an optimization problem:\nminW,H\nF +\n(cid:107)H(cid:107)2\nF ). We solve this objective using a parallel\ncoordinate descent algorithm [24].\nSTRADS implementation: Our MF sched-\nule strategy is to partition the rows of A into\nU disjoint index sets qp, and the columns of A\ninto U disjoint index sets rp. We then dispatch\nthe model parameters W, H in a round-robin\nfashion. To update the rows of W, each worker p uses push to compute partial summations on its\nassigned columns rp of A and H; the columns of H are updated similarly with rows qp of A and\nW. Finally, pull aggregates the partial summations, and then update the entries in W and H. In\nFigure 5, we show the STRADS MF pseudocode, and further details are in the supplement.\n3.3 Lasso\nSTRADS not only supports simple static schedules, but also dynamic, adaptive strategies that take\nthe model state into consideration. Speci\ufb01cally, STRADS Lasso implementation schedules param-\neter updates by (1) prioritizing coef\ufb01cients that contribute the most to algorithm convergence, and\n(2) avoiding the simultaneous update of coef\ufb01cients whose dimensions are highly inter-dependent.\nThese properties complement each other in an algorithmically ef\ufb01cient way, as we shall see.\n\n}\nFigure 5: STRADS MF pseudocode. De\ufb01nitions for\nf1, g1, . . . and qp, rp are in the text. counter is a\nglobal model variable.\n\nj \u2212 wihj)2 + \u03bb((cid:107)W(cid:107)2\n\npull(workers=[p], pars=X[s], updates=[z]) {\n\ncounter = (counter mod 2*U) + 1\n\nW[row,k] = f_3(row,k,[z])\n\nH[k,col] = g_3(k,col,[z])\n\nfor row in s, k=1..K\n\nfor col in s, k=1..K\n\n(i,j)\u2208\u2126(ai\n\nif counter <= U\n\n// X is from W\n\n// X is from H\n\n2 + \u03bb(cid:80)\n\n1\n\nk\n\n2 (cid:107)y \u2212 X\u03b2(cid:107)2\n\nj |\u03b2j|,\nFormally, Lasso can be de\ufb01ned by an optimization problem: min\u03b2\nwhere \u03bb is a regularization parameter that determines the sparsity of \u03b2. We solve Lasso us-\ning coordinate descent (CD) update rule [9]: \u03b2(t)\n, \u03bb), where\nS(g, \u03bb) := sign(\u03b2) (|g| \u2212 \u03bb)+.\nSTRADS implementation: Lasso schedule dynamically selects parameters to be updated with\nthe following prioritization scheme: rapidly changing parameters are more frequently updated than\nothers. First, we de\ufb01ne a probability distribution c = [c1, . . . , cJ ] over \u03b2; the purpose of c is\nto prioritize \u03b2j\u2019s during schedule, and thus speed up convergence. In particular, we observe that\n+ \u03b7 substantially speeds up the Lasso\n\nchoosing \u03b2j with probability cj = f1(j) :\u221d (cid:16)\n\nj y \u2212(cid:80)\n\nj \u2190 S(xT\n\nj xk\u03b2(t\u22121)\n\n\u03b4\u03b2(t\u22121)\n\n(cid:17)2\n\nj(cid:54)=k xT\n\nconvergence rate, where \u03b7 is a small positive constant, and \u03b4\u03b2(t\u22121)\nTo prevent non-convergence due to dimension inter-dependencies [4], we only schedule \u03b2j and \u03b2k\nj xk \u2248 0. This is performed as follows: \ufb01rst, select L(cid:48)(> L) indices of\nfor concurrent updates if xT\ncoef\ufb01cients from the probability distribution c to form a set C (|C| = L(cid:48)). Next, choose a subset\nB \u2282 C of size L such that xT\nj xk < \u03c1 for all j, k \u2208 B, where \u03c1 \u2208 (0, 1]; we represent this selection\nprocedure by the function f2(C). Note that this procedure is inexpensive: by selecting L(cid:48) candidate\n\n= \u03b2(t\u22122)\n\n\u2212 \u03b2(t\u22121)\n\n.\n\nj\n\nj\n\nj\n\nj\n\n5\n\n\f\u03b2j\u2019s \ufb01rst, only L(cid:48)2 dependencies need to be checked, as opposed to J 2, where J is the total number\nof features. Here L(cid:48) and \u03c1 are user-de\ufb01ned parameters.\nWe execute push and pull to update the coef\ufb01cients indexed by B using U workers in parallel.\nThe rows of the data matrix X are partitioned into U submatrices, and the p-th worker stores the\nsubmatrix Xqp \u2208 R|qp|\u00d7J; with X partitioned in this manner, we need to modify the CD update rule\naccordingly. Using U workers, push computes U partial summations for each selected \u03b2j, j \u2208 B,\ndenoted by {z(t)\nj,U}, where zj,p represents the partial summation for \u03b2j in the p-th worker at\n. After all pushes\nthe t-th iteration: z(t)\nhave been completed, pull updates \u03b2j via \u03b2(t)\n\nj,p \u2190 f3(p, j) :=(cid:80)\n\nj,p]) := S((cid:80)U\n\nj)T y \u2212(cid:80)\n\nk)\u03b2(t\u22121)\np=1 z(t)\nj,p, \u03bb).\n\nj = f4(j, [z(t)\n\nj,1, . . . , z(t)\n\nj(cid:54)=k(xi\n\nj)T (xi\n\n(cid:110)\n\n(cid:111)\n\ni\u2208qp\n\n(xi\n\nk\n\n// STRADS Lasso\n\nschedule() {\n\n// Priority-based scheduling\nfor all j\n\n// Get new priorities\n\nc_j = f_1(j)\n\nfor a=1..L\u2019\n\n// Prioritize betas\n\nrandom draw s_a using [c_1, ..., c_J]\n\n// Get \u2019safe\u2019 betas\n(j_1, ..., j_L) = f_2(s_1, ..., s_L\u2019)\nreturn (b[j_1], ..., b[j_L])\n\n}\n\npush(worker = p, pars = (b[j_1],...,b[j_L])) {\n\nz = []\nfor a=1..L\n\n// Empty list\n// Compute partial sums\n\nz.append( f_3(p,j_a) )\n\nreturn z\n\n}\n\npull(workers = [p], pars = (b[j_1],...,b[j_L]),\n\nupdates = [z]) {\n\nfor a=1..L\n\n// Aggregate partial sums\n\nb[j_a] = f_4(j_a,[z])\n\n1\n\n2 (cid:107)y \u2212 X\u03b2(cid:107)2\n\n2 + \u03bb(cid:80)2J\n\nAnalysis of STRADS Lasso scheduling We\nwish to highlight several notable aspects of the\nSTRADS Lasso schedule mentioned above.\nIn brief, the sampling distribution f1(j) and\nthe model dependency control scheme with\nthreshold \u03c1 allow STRADS to speed up the\nconvergence rate of Lasso. To analyze this\nclaim, let us rewrite the Lasso problem by du-\nplicating original features with opposite sign:\nF (\u03b2) := min\u03b2\nj=1 \u03b2j.\nHere, with an abuse of notation, X contains\n2J features and \u03b2j \u2265 0, for all j = 1, . . . , 2J.\nThen, we have the following analysis of our\nscheduling scheme.\nProposition 1 Suppose B is the set of indices\nof coef\ufb01cients updated in parallel at the t-th\niteration, and \u03c1 is suf\ufb01ciently small constant\nk \u2248 0, for all j (cid:54)= k \u2208\nsuch that \u03c1\u03b4\u03b2(t)\nB. Then, the sampling distribution p(j) \u221d\n\u03b4\u03b2(t)\n\n(cid:16)\npractice, we approximate p(j) \u221d(cid:16)\n\nj \u03b4\u03b2(t)\n\n(cid:17)2\n\n(cid:17)2\n\nj\n\napproximately maximizes a lower bound on EB(cid:2)F (\u03b2(t)) \u2212 F (\u03b2(t) + \u2206\u03b2(t))(cid:3).\n\n}\nFigure 6: STRADS Lasso pseudocode. De\ufb01nitions for\nf1, f2, . . . are given in the text.\n\n(cid:16)\n\n(cid:17)2\n\nj\n\nj\n\n\u03b4\u03b2(t)\n\n\u03b2(t\u22121)\n\n+ \u03b7 because \u03b4\u03b2(t)\nj\n\nwith f1(j) \u221d \u03b4\n\nProposition 1 (see supplement for proof) shows that our scheduling attempts to speed up the con-\nvergence of Lasso by decreasing the objective as much as possible at every iteration. However, in\nis unavail-\n; we add \u03b7 to give all \u03b2j\u2019s non-zero probability of\n\nable at the t-th iteration before computing \u03b2(t)\nj\nbeing updated to account for the approximation.\n4 STRADS System Architecture and Implementation\nOur STRADS system implementation uses multiple master/scheduler machines, multiple worker\nmachines, and a single \u201cmaster\u201d coordinator2 machine that directs the activities of the schedulers\nand workers The basic unit of STRADS execution is a \u201cround\u201d, which consists of schedule-push-\npull in that order. In more detail (Figure 1), (1) the masters execute schedule to pick U sets of\nmodel parameters x that can be safely updated in parallel (if the masters need to read parameters,\nthey get them from the key-value stores); (2) jobs for push, which update the U sets of parameters,\nare dispatched via the coordinator to the workers (again, workers read parameters from the key-value\nstores), which then execute push to compute partial updates z for each parameter; (3) the key-value\nstores execute pull to aggregate the partial updates z, and keep newly updated parameters.\nTo ef\ufb01ciently use multiple cores/machines in the scheduler pool, STRADS uses pipelined schedule\ncomputations, i.e., masters compute schedule and queue jobs in advance for future rounds. In other\n\n2 The coordinator sends jobs from the masters and the workers, which does not bottleneck at the 10- to\n\n100-machine scale explored in this paper. Distributing the coordinator is left for future work.\n\n6\n\n\fwords, parameters to be updated are determined by the masters without waiting for workers\u2019 param-\neter updates; the jobs for parameter updates are dispatched to workers in turn by the coordinator. By\npipelining schedule, the master machines do not become a bottleneck even with a large number of\nworkers. Speci\ufb01cally, the pipelined strategy does not occur any parallelization errors if parameters\nx for push can be ordered in a manner that does not depend on their actual values (e.g. MF and\nLDA applications). For programs whose schedule outcome depends on the current values of x (e.g.\nLasso), the strategy is equivalent to executing schedule based on stale values of x, similar to how\nparameter servers allow computations to be executed on stale model parameters [15, 1]. In Lasso\nexperiments in \u00a75, such schedule strategy with stale values greatly improved its convergence rate.\nSTRADS does not have to perform push-pull communication between the masters and the workers\n(which would bottleneck the masters). Instead, the model parameters x can be globally accessible\nthrough a distributed, partitioned key-value store (represented by standard arrays in our pseudocode).\nA variety of key-value store synchronization schemes exist, such as Bulk Synchronous Parallel\n(BSP), Stale Synchronous Parallel (SSP) [15], and Asynchronous Parallel (AP). In this paper, we\nuse BSP synchronization; we leave the use of alternative schemes like SSP or AP as future work.\nWe implemented STRADS using C++ and the Boost libraries, and OpenMPI 1.4.5 was used for\nasynchronous communication between the master schedulers, workers, and key-value stores.\n5 Experiments\nWe now demonstrate that our STRADS implementations of LDA, MF and Lasso can (1) reach larger\nmodel sizes than other baselines; (2) converge at least as fast, if not faster, than other baselines; (3)\nwith additional machines, STRADS uses less memory per machine (ef\ufb01cient partitioning). For\nbaselines, we used (a) a STRADS implementation of distributed Lasso with only a naive round-\nrobin scheduler (Lasso-RR), (b) GraphLab\u2019s Alternating Least Squares (ALS) implementation of\nMF [19], (c) YahooLDA for topic modeling [1]. Note that Lasso-RR imitates the random scheduling\nscheme proposed by Shotgun algorithm on STRADS. We chose GraphLab and YahooLDA, as they\nare popular choices for distributed MF and LDA.\nWe conducted experiments on two clusters [11] (with 2-core and 16-core machines respectively),\nto show the effectiveness of STRADS model-parallelism across different hardware. We used the\n2-core cluster for LDA, and the 16-core cluster for Lasso and MF. The 2-core cluster contains 128\nmachines, each with two 2.6GHz AMD cores and 8GB RAM, and connected via a 1Gbps network\ninterface. The 16-core cluster contains 9 machines, each with 16 2.1GHz AMD cores and 64GB\nRAM, and connected via a 40Gbps network interface. Both clusters exhibit a 4GB memory-to-CPU\nratio, a setting commonly observed in the machine learning literature [22, 13], which closely matches\nthe more cost-effective instances on Amazon EC2. All our experiments use a \ufb01xed data size, and\nwe vary the number of machines and/or the model size (unless otherwise stated); furthermore, for\nLasso, we set \u03bb = 0.001, and for MF, we set \u03bb = 0.05.\n5.1 Datasets\nLatent Dirichlet Allocation We used 3.9M English Wikipedia abstracts, and conducted experi-\nments using both unigram (1-word) tokens (V = 2.5M unique unigrams, 179M tokens) and bigram\n(2-word) tokens [16] (V = 21.8M unique bigrams, 79M tokens). We note that our bigram vocab-\nulary (21.8M) is an order of magnitude larger than recently published results [1], demonstrating\nthat STRADS scales to very large models. We set the number of topics to K = 5000 and 10000\n(also larger than recent literature [1]), which yields extremely large word-topic tables: 25B elements\n(unigram) and 218B elements (bigram).\nMatrix Factorization We used the Nex\ufb02ix dataset [2] for our MF experiments: 100M anonymized\nratings from 480,189 users on 17,770 movies. We varied the rank of W, H from K = 20 to 2000,\nwhich exceeds the upper limit of previous MF papers [26, 10, 24].\nLasso We used synthetic data with 50K samples and J = 10M to 100M features, where every\nfeature xj has only 25 non-zero samples. To simulate correlations between adjacent features (which\nexist in real-world data sets), we \ufb01rst generate x1 \u223c U nif (0, 1). Then, with 0.9 probability, we\nmake xj \u223c U nif (0, 1), and with 0.1 probability, xj \u223c 0.9xj\u22121 + 0.1U nif (0, 1) for j = 2, . . . , J.\n5.2 Speed and Model Sizes\nFigure 7 shows the time taken by each algorithm to reach a \ufb01xed objective value (over a range of\nmodel sizes), as well as the largest model size that each baseline was capable of running. For LDA\nand MF, STRADS handles much larger model sizes than either YahooLDA (could handle 5K topics\n\n7\n\n\fFigure 7: Convergence time versus model size for STRADS and baselines for (left) LDA, (center) MF, and\n(right) Lasso. We omit the bars if a method did not reach 98% of STRADS\u2019s convergence point (YahooLDA and\nGraphLab-MF failed at 2.5M-Vocab/10K-topics and rank K \u2265 80, respectively). STRADS not only reaches\nlarger model sizes than YahooLDA, GraphLab, and Lasso-RR, but also converges signi\ufb01cantly faster.\n\nFigure 8: Convergence trajectories of different methods for (left) LDA, (center) MF, and (right) Lasso.\n\non the unigram dataset) or GraphLab (could handle rank < 80), while converging more quickly;\nwe attribute STRADS\u2019s faster convergence to lower parallelization error (LDA only) and reduced\nsynchronization requirements through careful model partitioning (LDA, MF). We observed that each\nYahooLDA worker stores a portion of the word-topic table \u2014 speci\ufb01cally, those elements referenced\nby the words in the worker\u2019s data partition. Because our experiments feature very large vocabulary\nsizes, even a small fraction of the word-topic table can still be too large for a single machine\u2019s mem-\nory, which caused YahooLDA to fail on the larger experiments. For Lasso, STRADS converges\nmore quickly than Lasso-RR because of our dynamic schedule strategy, which is graphically cap-\ntured in the convergence trajectory seen in Figure 8 \u2014 observe that STRADS\u2019s dynamic schedule\ncauses the Lasso objective to plunge quickly to the optimum at around 250 seconds. We also see\nthat STRADS LDA and MF achieved better objective values than the other baselines, con\ufb01rming\nthat STRADS model-parallelism is fast without compromising convergence quality.\n5.3 Scalability\nIn Figure 9, we show the convergence trajecto-\nries and time-to-convergence for STRADS LDA\nusing different numbers of machines at a \ufb01xed\nmodel size (unigram with 2.5M vocab and 5K top-\nics). The plots con\ufb01rm that STRADS LDA ex-\nhibits faster convergence with more machines, and\nthat the time to convergence almost halves with ev-\nery doubling of machines (near-linear scaling).\n6 Conclusions\nIn this paper, we presented a programmable framework for dynamic Big Model-parallelism that\nprovides the following bene\ufb01ts: (1) scalability and ef\ufb01cient memory utilization, allowing larger\nmodels to be run with additional machines; (2) the ability to invoke dynamic schedules that reduce\nmodel parameter dependencies across workers, leading to lower parallelization error and thus faster,\ncorrect convergence. An important direction for future research would be to reduce the communi-\ncation costs of using STRADS. We also want to explore the use of STRADS for other popular ML\napplications, such as support vector machines and logistic regression.\nAcknowledgments\nThis work was done under support from NSF IIS1447676, CNS-1042543 (PRObE [11]), DARPA\nFA87501220324, and support from Intel via the Intel Science and Technology Center for Cloud\nComputing (ISTC-CC).\nReferences\n[1] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable\n\nFigure 9: STRADS LDA scalablity with increas-\ning machines using a \ufb01xed model size. (Left) Con-\nvergence trajectories; (Right) Time taken to reach a\nlog-likelihood of \u22122.6 \u00d7 109.\n\nmodels. In WSDM, 2012.\n\n8\n\n2.5M/5k2.5M/10k21.8M/5k21.8M/10k0100020003000400050001914464 machinesVocab/TopicsSeconds STRADSYahooLDA2040801603201000200002004006008001000120014006620341949 machinesRanksSeconds STRADSGraphLab10M50M100M0500100015002000250030009 machinesFeaturesSeconds STRADSLasso\u2212RR012345x 104\u22123.5\u22123\u22122.5x 1092.5M vocab, 5K topics32 machinesSecondsLog\u2212Likelihood STRADSYahooLDA0501001500.511.522.580 ranks9 machinesSecondsRMSE STRADSGraphLab050010000.050.10.150.20.25100M features9 machinesSecondsObjective STRADSLasso\u2212RR0123x 104\u22123.4\u22123.2\u22123\u22122.8\u22122.6\u22122.4x 1092.5M vocab, 5K topicsSecondsLog\u2212Likelihood STRADS (16 machines)STRADS (32 machines)STRADS (64 machines)STRADS (128 machines)163264128012345678x 1042.5M vocab, 5K topicsNumber of machinesSeconds STRADS (16 machines)STRADS (32 machines)STRADS (64 machines)STRADS (128 machines)\f[2] J. Bennett and S. Lanning. The Net\ufb02ix prize. In Proceedings of KDD cup and workshop, 2007.\n[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[4] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss\n\nminimization. In ICML, 2011.\n\n[5] W. Dai, A. Kumar, J. Wei, Q. Ho, G. Gibson, and E. P. Xing. High-performance distributed ML at scale\n\nthrough parameter server consistency models. In AAAI, 2014.\n\n[6] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior,\n\nP. A. Tucker, et al. Large scale distributed deep networks. In NIPS, 2012.\n\n[7] J. Dean and S. Ghemawat. MapReduce: simpli\ufb01ed data processing on large clusters. Communications of\n\nthe ACM, 51(1):107\u2013113, 2008.\n\n[8] J. Fan, R. Samworth, and Y. Wu. Ultrahigh dimensional feature selection: beyond the linear model. The\n\nJournal of Machine Learning Research, 10:2013\u20132038, 2009.\n\n[9] J. Friedman, T. Hastie, H. Ho\ufb02ing, and R. Tibshirani. Pathwise coordinate optimization. Annals of Applied\n\nStatistics, 1(2):302\u2013332, 2007.\n\n[10] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed\n\nstochastic gradient descent. In SIGKDD, 2011.\n\n[11] G. Gibson, G. Grider, A. Jacobson, and W. Lloyd. PRObE: A thousand-node experimental cluster for\n\ncomputer systems research. USENIX; login, 38, 2013.\n\n[12] J. Gonzalez, Y. Low, A. Gretton, and C. Guestrin. Parallel gibbs sampling: From colored \ufb01elds to thin\n\njunction trees. In AISTATS, 2011.\n\n[13] J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel com-\n\nputation on natural graphs. In OSDI, 2012.\n\n[14] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy of Sciences\n\nof the United States of America, 101(Suppl 1):5228\u20135235, 2004.\n\n[15] Q. Ho, J. Cipar, H. Cui, J. Kim, S. Lee, P. B. Gibbons, G. Gibson, G. R. Ganger, and E. P. Xing. More\n\neffective distributed ML via a stale synchronous parallel parameter server. In NIPS, 2013.\n\n[16] Jey Han Lau, Timothy Baldwin, and David Newman. On collocations and topic models. ACM Transac-\n\ntions on Speech and Language Processing (TSLP), 10(3):10, 2013.\n\n[17] Q. V. Le, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng. Building\n\nhigh-level features using large scale unsupervised learning. In ICML, 2012.\n\n[18] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and\n\nB. Su. Scaling distributed machine learning with the parameter server. In OSDI, 2014.\n\n[19] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed GraphLab: A\n\nframework for machine learning and data mining in the cloud. In VLDB, 2012.\n\n[20] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models. The\n\nJournal of Machine Learning Research, 10:1801\u20131828, 2009.\n\n[21] C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin. Feature clustering for accelerating parallel\n\ncoordinate descent. In NIPS, 2012.\n\n[22] Y. Wang, X. Zhao, Z. Sun, H. Yan, L. Wang, Z. Jin, L. Wang, Y. Gao, J. Zeng, Q. Yang, et al. Towards\n\ntopic modeling for big data. arXiv:1405.4402 [cs.IR], 2014.\n\n[23] J. Wei, W. Dai, A. Kumar, X. Zheng, Q. Ho, and E. P. Xing. Consistent bounded-asynchronous parameter\n\nservers for distributed ML. arXiv:1312.7869 [stat.ML], 2013.\n\n[24] H. Yu, C. Hsieh, S. Si, and I. Dhillon. Scalable coordinate descent approaches to parallel matrix factor-\n\nization for recommender systems. In ICDM, 2012.\n\n[25] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with\n\nworking sets. In HotCloud, 2010.\n\n[26] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative \ufb01ltering for the net\ufb02ix\n\nprize. In AAIM, 2008.\n\n[27] M. Zinkevich, J. Langford, and A. J. Smola. Slow learners are fast. In NIPS, 2009.\n[28] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In NIPS, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1474, "authors": [{"given_name": "Seunghak", "family_name": "Lee", "institution": "Carnegie Mellon University"}, {"given_name": "Jin Kyu", "family_name": "Kim", "institution": "CMU"}, {"given_name": "Xun", "family_name": "Zheng", "institution": "Carnegie Mellon University"}, {"given_name": "Qirong", "family_name": "Ho", "institution": "Institute for Infocomm Research, A*STAR"}, {"given_name": "Garth", "family_name": "Gibson", "institution": "CMU"}, {"given_name": "Eric", "family_name": "Xing", "institution": "CMU"}]}