{"title": "Asynchronous Distributed Learning of Topic Models", "book": "Advances in Neural Information Processing Systems", "page_first": 81, "page_last": 88, "abstract": "Distributed learning is a problem of fundamental interest in machine learning and cognitive science. In this paper, we present asynchronous distributed learning algorithms for two well-known unsupervised learning frameworks: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Processes (HDP). In the proposed approach, the data are distributed across P processors, and processors independently perform Gibbs sampling on their local data and communicate their information in a local asynchronous manner with other processors. We demonstrate that our asynchronous algorithms are able to learn global topic models that are statistically as accurate as those learned by the standard LDA and HDP samplers, but with significant improvements in computation time and memory. We show speedup results on a 730-million-word text corpus using 32 processors, and we provide perplexity results for up to 1500 virtual processors. As a stepping stone in the development of asynchronous HDP, a parallel HDP sampler is also introduced.", "full_text": "Asynchronous Distributed Learning of Topic Models\n\nArthur Asuncion, Padhraic Smyth, Max Welling\n\nDepartment of Computer Science\nUniversity of California, Irvine\n\n{asuncion,smyth,welling}@ics.uci.edu\n\nAbstract\n\nDistributed learning is a problem of fundamental interest in machine learning and\ncognitive science. In this paper, we present asynchronous distributed learning al-\ngorithms for two well-known unsupervised learning frameworks: Latent Dirichlet\nAllocation (LDA) and Hierarchical Dirichlet Processes (HDP). In the proposed\napproach, the data are distributed across P processors, and processors indepen-\ndently perform Gibbs sampling on their local data and communicate their infor-\nmation in a local asynchronous manner with other processors. We demonstrate\nthat our asynchronous algorithms are able to learn global topic models that are\nstatistically as accurate as those learned by the standard LDA and HDP samplers,\nbut with signi\ufb01cant improvements in computation time and memory. We show\nspeedup results on a 730-million-word text corpus using 32 processors, and we\nprovide perplexity results for up to 1500 virtual processors. As a stepping stone in\nthe development of asynchronous HDP, a parallel HDP sampler is also introduced.\n\n1 Introduction\nLearning algorithms that can perform in a distributed asynchronous manner are of interest for several\ndifferent reasons. The increasing availability of multi-processor and grid computing technology\nprovides an immediate and practical motivation to develop learning algorithms that are able take\nadvantage of such computational resources. Similarly, the increasing proliferation of networks of\nlow-cost devices motivates the investigation of distributed learning in the context of sensor networks.\nOn a deeper level, there are fundamental questions about distributed learning from the viewpoints\nof arti\ufb01cial intelligence and cognitive science.\n\nIn this paper, we focus on the speci\ufb01c problem of developing asynchronous distributed learning algo-\nrithms for a class of unsupervised learning techniques, speci\ufb01cally LDA [1] and HDP [2] with learn-\ning via Gibbs sampling. The frameworks of LDA and HDP have recently become popular due to\ntheir effectiveness at extracting low-dimensional representations from sparse high-dimensional data,\nwith multiple applications in areas such as text analysis and computer vision. A promising approach\nto scaling these algorithms to large data sets is to distribute the data across multiple processors and\ndevelop appropriate distributed topic-modeling algorithms [3, 4, 5]. There are two somewhat distinct\nmotivations for distributed computation in this context: (1) to address the memory issue when the\noriginal data and count matrices used by the algorithm exceed the main memory capacity of a single\nmachine; and (2) using multiple processors to signi\ufb01cantly speed up topic-learning, e.g., learning a\ntopic model in near real-time for tens of thousands of documents returned by a search-engine.\n\nWhile synchronous distributed algorithms for topic models have been proposed in earlier work, here\nwe investigate asynchronous distributed learning of topic models. Asynchronous algorithms provide\nseveral computational advantages over their synchronous counterparts: (1) no global synchroniza-\ntion step is required; (2) the system is extremely fault-tolerant due to its decentralized nature; (3)\nheterogeneous machines with different processor speeds and memory capacities can be used; (4)\nnew processors and new data can be incorporated into the system at any time.\n\nOur primary novel contribution is the introduction of new asynchronous distributed algorithms for\nLDA and HDP, based on local collapsed Gibbs sampling on each processor. We assume an asyn-\n\n1\n\n\fchronous \u201cgossip-based\u201d framework [6] which only allows pairwise interactions between random\nprocessors. Our distributed framework can provide substantial memory and time savings over single-\nprocessor computation, since each processor only needs to store and perform Gibbs sweeps over 1\nP th\nof the data, where P is the number of processors. Furthermore, the asynchronous approach can scale\nto large corpora and large numbers of processors, since no global synchronization steps are required.\nWhile building towards an asynchronous algorithm for HDP, we also introduce a novel synchronous\ndistributed inference algorithm for HDP, again based on collapsed Gibbs sampling.\n\nIn the proposed framework, individual processors perform Gibbs sampling locally on each processor\nbased on a noisy inexact view of the global topics. As a result, our algorithms are not necessarily\nsampling from the proper global posterior distribution. Nonetheless, as we will show in our experi-\nments, these algorithms are empirically very robust and converge rapidly to high-quality solutions.\n\nWe \ufb01rst review collapsed Gibbs sampling for LDA and HDP. Then we describe the details of our\ndistributed algorithms. We present perplexity and speedup results for our algorithms when applied\nto text data sets. We conclude with a discussion of related work and future extensions of our work.\n\n2 A brief review of topic models\nBefore delving into the details of our distributed algorithms, we \ufb01rst\ndescribe the LDA and HDP topic models. In LDA, each document j is\nmodeled as a mixture over K topics, and each topic k is a multinomial\ndistribution, \u03c6wk, over a vocabulary of W words1. Each document\u2019s\nmixture over topics, \u03b8kj, is drawn from a Dirichlet distribution with\nparameter \u03b7. In order to generate a new document, \u03b8kj is \ufb01rst sampled\nfrom a Dirichlet distribution with parameter \u03b1. For each token i in\nthat document, a topic assignment zij is sampled from \u03b8kj, and the\nspeci\ufb01c word xij is drawn from \u03c6wzij . The graphical model for LDA\nis shown in Figure 1, and the generative process is below:\n\na\nq\n\nkj\n\nijZ\n\nijX\n\nNj\n\nJ\n\nh\n\nwk\n\nK\n\ng\n\na\n\nb\nq\n\nk\n\nkj\n\nijZ\n\nijX\n\nNj\n\nJ\n\nh\n\nwk\n\u221e\n\nFigure 1: Graphical models\nfor LDA (left) and HDP (right).\n\n\u03b8k,j \u223c D[\u03b1]\n\n\u03c6w,k \u223c D[\u03b7]\n\nzij \u223c \u03b8k,j\n\nxij \u223c \u03c6w,zij .\n\nGiven observed data, it is possible to infer the posterior distribution of the latent variables. One\ncan perform collapsed Gibbs sampling [7] by integrating out \u03b8kj and \u03c6wk and sampling the topic\nassignments in the following manner:\n\nP (zij = k|z\u00acij, w) \u221d\n\n(1)\n\nwk + \u03b7\n\nN \u00acij\nwk + W \u03b7 (cid:16)N \u00acij\nPw N \u00acij\n\njk + \u03b1(cid:17) .\n\nNwk denotes the number of word tokens of type w assigned to topic k, while Njk denotes the\nnumber of tokens in document j assigned to topic k. N \u00acij denotes the count with token ij removed.\nThe HDP mixture model is composed of a hierarchy of Dirichlet processes. HDP is similar to LDA\nand can be viewed as the model that results from taking the in\ufb01nite limit of the following \ufb01nite\nmixture model. Let L be the number of mixture components, and \u03b2k be top level Dirichlet variables\ndrawn from a Dirichlet distribution with parameter \u03b3/L. The mixture for each document, \u03b8kj, is\ngenerated from a Dirichlet with parameter \u03b1\u03b2k. The multinomial topic distributions, \u03c6wk are drawn\nfrom a base Dirichlet distribution with parameter \u03b7. As in LDA, zij is sampled from \u03b8kj, and word\nxij is sampled from \u03c6wzij . If we take the limit of this model as L goes to in\ufb01nity, we obtain HDP:\n\n\u03b2k \u223c D[\u03b3/L]\n\n\u03b8k,j \u223c D[\u03b1\u03b2k]\n\n\u03c6w,k \u223c D[\u03b7]\n\nzij \u223c \u03b8k,j\n\nxij \u223c \u03c6w,zij .\n\nTo sample from the posterior, we follow the details of the direct assignment sampler for HDP [2].\nBoth \u03b8kj and \u03c6wk are integrated out, and zij is sampled from a conditional distribution that is\nalmost identical to that of LDA, except that a small amount of probability mass is reserved for the\ninstantiation of a new topic. Note that although HDP is de\ufb01ned to have an in\ufb01nite number of topics,\nthe only topics that are instantiated are those that are actually used.\n\n3 Asynchronous distributed learning for the LDA model\nWe consider the problem of learning an LDA model with K topics in a distributed fashion where\ndocuments are distributed across P processors. Each processor p stores the following local variables:\n\n1To avoid clutter, we write \u03c6wk or \u03b8kj to denote the set of all components, i.e. {\u03c6wk} or {\u03b8kj}. Similarly,\n\nwhen sampling from a Dirichlet, we write \u03b8kj \u223c D[\u03b1\u03b2k] instead of [\u03b81,j , ..\u03b8K,j] \u223c D[\u03b1\u03b21, .., \u03b1\u03b2K ].\n\n2\n\nF\nF\n\fw is the simple word count on a processor (derived from wp), and N p\n\nij contains the word type for each token i in document j in the processor, and zp\n\nwp\nij contains the\nassigned topic for each token. N \u00acp\nwk is the global word-topic count matrix stored at the processor\u2014\nthis matrix stores counts of other processors gathered during the communication step and does not\ninclude the processor\u2019s local counts. N p\nkj is the local document-topic count matrix (derived from\nzp), N p\nwk is the local word-topic\ncount matrix (derived from zp and wp) which only contains the counts of data on the processor.\nNewman et al. [5] introduced a parallel version of LDA based on collapsed Gibbs sampling (which\nwe will call Parallel-LDA). In Parallel-LDA, each processor receives 1\nP of the documents in the\ncorpus and the z\u2019s are globally initialized. Each iteration of the algorithm is composed of two steps:\na Gibbs sampling step and a synchronization step. In the sampling step, each processor samples its\nlocal zp by using the global topics of the previous iteration. In the synchronization step, the local\ncounts N p\nwk on each processor are aggregated to produce a global set of word-topic counts Nwk.\nThis process is repeated for either a \ufb01xed number of iterations or until the algorithm has converged.\n\nParallel-LDA can provide substantial memory and time savings. However, it is a fully synchronous\nalgorithm since it requires global synchronization at each iteration. In some applications, a global\nsynchronization step may not be feasible, e.g. some processors may be unavailable, while other\nprocessors may be in the middle of a long Gibbs sweep, due to differences in processor speeds. To\ngain the bene\ufb01ts of asynchronous computing, we introduce an asynchronous distributed version of\nLDA (Async-LDA) that follows a similar two-step process to that above. Each processor performs\na local Gibbs sampling step followed by a step of communicating with another random processor.\n\nFor Async-LDA, during each iteration, the processors perform a full sweep of collapsed Gibbs\nsampling over their local topic assignment variables zp according to the following conditional dis-\ntribution, in a manner directly analogous to Equation 1,\n\nP (zpij = k|z\u00acij\n\np\n\n, wp) \u221d\n\n(N \u00acp + N p)\u00acij\nPw(N \u00acp + N p)\u00acij\n\nwk + \u03b7\n\nwk + W \u03b7 (cid:16)N \u00acij\n\npjk + \u03b1(cid:17) .\n\n(2)\n\nwk and N p\n\nwk is used in the sampling equation. Recall that N \u00acp\n\nThe combination of N \u00acp\nwk represents\nprocessor p\u2019s belief of the counts of all the other processors with which it has already communicated\n(not including processor p\u2019s local counts), while N p\nwk is the processor\u2019s local word-topic counts.\nThus, the sampling of the zp\u2019s is based on the processor\u2019s \u201cnoisy view\u201d of the global set of topics.\nOnce the inference of zp is complete (and N p\nwk is up-\ndated), the processor \ufb01nds another \ufb01nished processor and\ninitiates communication2. We are generally interested in\nthe case where memory and communication bandwidth\nare both limited. We also assume in the simpli\ufb01ed gos-\nsip scheme that a processor can establish communication\nwith every other processor \u2013 later in the paper we also\ndiscuss scenarios that relax these assumptions.\n\nSample zp locally (Equation 2)\nReceive N g\nSend N p\nif p has met g before then\n\nfor each processor p in parallel do\n\nAlgorithm 1 Async-LDA\n\nwk from random proc g\n\nwk to proc g\n\nrepeat\n\nN \u00acp\n\nwk \u2190 N \u00acp\n\nwk \u2212 \u02dcN g\n\nwk + N g\n\nwk\n\nwk\n\nend for\n\nwk + N g\n\nwk \u2190 N \u00acp\n\nN \u00acp\nend if\n\nuntil convergence\n\nwk to its N \u00acp\n\nwk , and vice versa.\n\nIn the communication step, let us consider the case where\ntwo processors, p and g have never met before. In this\ncase, processors simply exchange their local N p\nwk\u2019s (their\nlocal contribution to the global topic set), and processor p\nsimply adds N g\nConsider the case where two processors meet again. The processors should not simply swap and add\ntheir local counts again; rather, each processor should \ufb01rst remove from N \u00acp\nwk the previous in\ufb02uence\nof the other processor during their previous encounter, in order to prevent processors that frequently\nmeet from over-in\ufb02uencing each other. We assume in the general case that a processor does not\nstore in memory the previous counts of all the other processors that processor p has already met.\nSince the previous local counts of the other processor were already absorbed into N \u00acp\nwk and are thus\nnot retrievable, we must take a different approach. In Async-LDA, the processors exchange their\nN p\nwk\u2019s, from which the count of words on each processor, N p\nw can be derived. Using processor g\u2019s\nw, processor p creates \u02dcN g\nw topic values randomly without replacement from\nN g\n2We don\u2019t discuss in general the details of how processors might identify other processors that have \ufb01nished\n\nwk by sampling N g\n\ntheir iteration, but we imagine that a standard protocol could be used, like P2P.\n\nelse\n\n3\n\n\fwk}. We can imagine that there are Pk N \u00acp\n\ncollection {N \u00acp\nwk balls of color k,\nw balls uniformly at random without replacement. This process is equivalent\nfrom which we pick N g\nto sampling from a multivariate hypergeometric distribution. \u02dcN g\nwk acts as a substitute for the N g\nwk\nthat processor p received during their previous encounter. Since all knowledge of the previous\nN g\nwk is lost, this method can be justi\ufb01ed by Laplace\u2019s principle of indifference (or the principle of\nmaximum entropy). Finally, we update N \u00acp\n\nwk colored balls, with N \u00acp\n\nwk and adding the current N g\n\nwk by subtracting \u02dcN g\n\nwk:\n\nN \u00acp\n\nwk \u2190 N \u00acp\n\nwk \u2212 \u02dcN g\n\nwk + N g\n\nwk where\n\n\u02dcN g\n\nw,k \u223c MH [N g\n\nw; N \u00acp\n\nw,1, .., N \u00acp\n\nw,K ] .\n\n(3)\n\nPseudocode for Async-LDA is provided in the display box for Algorithm 1. The assumption of\nlimited memory can be relaxed by allowing processors to cache previous counts of other processors\n\u2013 the cached N g\nwk. We can also relax the assumption of limited bandwidth.\nProcessor p could forward its individual cached counts (from other processors) to g, and vice versa,\nto quicken the dissemination of information.\nIn \ufb01xed topologies where the network is not fully\nconnected, forwarding is necessary to propagate the counts across the network. Our approach can\nbe applied to a wide variety of scenarios with varying memory, bandwidth, and topology constraints.\n\nwk would replace \u02dcN g\n\n4 Synchronous and asynchronous distributed learning for the HDP model\nInference for HDP can be performed in a distributed manner as well. Before discussing our asyn-\nchronous HDP algorithm, we \ufb01rst describe a synchronous parallel inference algorithm for HDP.\n\nWe begin with necessary notation for HDPs: \u03b3 is the concentration parameter for the top level\nDirichlet Process (DP), \u03b1 is the concentration parameter for the document level DP, \u03b2k\u2019s are top-\nlevel topic probabilities, and \u03b7 is the Dirichlet parameter for the base distribution. The graphical\nmodel for HDP is shown in Figure 1.\n\nWe introduce Parallel-HDP, which is analogous to Parallel-LDA except that new topics may be\nadded during the Gibbs sweep. Documents are again distributed across the processors. Each proces-\nsor maintains local \u03b2p\nk parameters which are augmented when a new topic is locally created. During\nthe Gibbs sampling step, each processor locally samples the zp topic assignments. In the synchro-\nnization step, the local word-topic counts N p\nwk are aggregated into a single matrix of global counts\nNwk, and the local \u03b2p\nk\u2019s are averaged to form a global \u03b2k. The \u03b1, \u03b2k and \u03b3 hyperparameters are\nalso globally resampled during the synchronization step \u2013 see Teh et al. [2] for details. We \ufb01x \u03b7 to\nbe a small constant. While \u03b1 and \u03b3 can also be \ufb01xed, sampling these parameters improves the rate\nof convergence. To facilitate sampling, relatively \ufb02at gamma priors are placed on \u03b1 and \u03b3. Finally,\nthese parameters and the global count matrix are distributed back to the processors.\n\nAlgorithm 2 Parallel-HDP\n\nAlgorithm 3 Async-HDP\n\nfor each processor p in parallel do\n\nrepeat\n\nrepeat\n\nfor each processor p in parallel do\n\nSample zp locally\nSend N p\n\nwk, \u03b2p\n\nk to master node\n\nend for\nNwk \u2190 Pp N p\n\u03b2k \u2190 (Pp \u03b2p\nResample \u03b1, \u03b2k, \u03b3 globally\nDistribute Nwk, \u03b1, \u03b2k, \u03b3 to all processors\n\nk) / P\n\nwk\n\nuntil convergence\n\nSample zp and then \u03b1p, \u03b2p\nReceive N g\nSend N p\nk to proc g\nif p has met g before then\n\nwk, \u03b1p, \u03b2p\n\nwk, \u03b1g, \u03b2g\n\nk, \u03b3p locally\n\nk from random proc g\n\nN \u00acp\n\nwk \u2190 N \u00acp\n\nwk \u2212 \u02dcN g\n\nwk + N g\n\nwk\n\nelse\n\nwk + N g\n\nwk \u2190 N \u00acp\n\nN \u00acp\nend if\n\u03b1p \u2190 (\u03b1p + \u03b1g) / 2 and \u03b2p\n\nwk\n\nk \u2190 (\u03b2p\n\nk + \u03b2g\n\nk) / 2\n\nuntil convergence\n\nend for\n\nMotivated again by the advantages of local asynchronous communication between processors, we\npropose an Async-HDP algorithm. It is very similar in spirit to Async-LDA, and so we focus on the\ndifferences in our description. First, the sampling equation for zp is different to that of Async-LDA,\nsince some probability mass is reserved for new topics:\n\nP (zpij = k|z\u00acij\n\np\n\n, wp) \u221d\n\n\uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\n(N \u00acp+N p)\u00acij\nPw(N \u00acp+N p)\u00acij\n\nwk +W \u03b7 (cid:16)N \u00acij\n\nwk +\u03b7\n\npjk + \u03b1p\u03b2p\n\nk(cid:17) ,\n\n\u03b1p\u03b2p\n\nnew\n\nW ,\n\n4\n\nif k \u2264 Kp\n\nif k is new.\n\n\fTotal number of documents in training set\nSize of vocabulary\nTotal number of words\nTotal number of documents in test set\n\nNYT\n300,000\n102,660\n99,542,125\n\u2013\nTable 1: Data sets used for perplexity and speedup experiments\n\nNIPS\n1,500\n12,419\n1,932,365\n184\n\nKOS\n3,000\n6,906\n410,595\n430\n\nPUBMED\n8,200,000\n141,043\n737,869,083\n\u2013\n\nk, \u03b3p locally3 during the inference step, and keep \u03b7 \ufb01xed.\n\nWe resample the hyperparameters \u03b1p, \u03b2p\nIn Async-HDP, a processor can add new topics to its collection during the inference step. Thus,\nwhen two processors communicate, the number of topics on each processor might be different. One\nway to merge topics is to perform bipartite matching across the two topic sets, using the Hungarian\nalgorithm. However, performing this topic matching step imposes a computational penalty as the\nnumber of topics increases. In our experiments for Async-LDA, Parallel-HDP, and Async-HDP, we\ndo not perform topic matching, but we simply combine the topics on different processors based their\ntopic IDs and (somewhat surprisingly) the topics gradually self-organize and align. Newman et al.\n[5] also observed this same behavior occurring in Parallel-LDA.\nDuring the communication step, the counts N p\nk values are exchanged\nand merged. Async-HDP removes a processor\u2019s previous in\ufb02uence through the same MH technique\nused in Async-LDA. Pseudocode for Async-HDP is provided in the display box for Algorithm 3.\n\nwk and the parameters \u03b1p and \u03b2p\n\n5 Experiments\nWe use four text data sets for evaluation: KOS, a data set derived from blog entries (dailykos.com);\nNIPS, a data set derived from NIPS papers (books.nips.cc); NYT, a collection of news articles\nfrom the New York Times (nytimes.com); and PUBMED, a large collection of PubMed abstracts\n(ncbi.nlm.nih.gov/pubmed/). The characteristics of these four data sets are summarized in Table 1.\n\nFor our perplexity experiments, parallel processors were simulated in software and run on smaller\ndata sets (KOS, NIPS), to enable us to test the statistical limits of our algorithms. Actual parallel\nhardware is used to measure speedup on larger data sets (NYT, PUBMED). Our simulation features\na gossip scheme over a fully connected network that lets each processor communicate with one other\nrandom processor at the end of every iteration, e.g., with P =100, there are 50 pairs at each iteration.\nIn our perplexity experiments, the data set is separated into a training set and a test set. We learn our\nmodels on the training set, and then we measure the performance of our algorithms on the test set\nusing perplexity, a widely-used metric in the topic modeling community.\n\nWe brie\ufb02y describe how perplexity is computed for our models. Perplexity is simply the exponen-\ntiated average per-word log-likelihood. For each of our experiments, we perform S = 5 different\nGibbs runs, with each run lasting 1500 iterations (unless otherwise noted), and we obtain a sample\nat the end of each of those runs. The 5 samples are then averaged when computing perplexity. For\nParallel-HDP, perplexity is calculated in the same way as in standard HDP:\n\nlog p(x\n\ntest) = X\n\njw\n\nlog\n\n1\nS X\n\ns\n\nX\n\nk\n\n\u02c6\u03b8s\njk\n\n\u02c6\u03c6s\nwk where \u02c6\u03b8s\n\njk =\n\n\u03b1\u03b2k + N s\njk\n\nPk (\u03b1\u03b2k) + N s\n\nj\n\n, \u02c6\u03c6s\n\nwk =\n\n\u03b7 + N s\nwk\nW \u03b7 + N s\nk\n\n. (4)\n\nwk is available in sample s. To obtain \u02c6\u03b8s\n\nAfter the model is run on the training data, \u02c6\u03c6s\nresample the topic assignments on the \ufb01rst half of each document in the test set while holding \u02c6\u03c6s\nwk and \u02c6\u03b8s\n\ufb01xed. Perplexity is evaluated on the second half of each document in the test set, given \u02c6\u03c6s\njk.\nThe perplexity calculation for Async-LDA and Async-HDP uses the same formula. Since each pro-\ncessor effectively learns a separate local topic model, we can directly compute the perplexity for\neach processor\u2019s local model. In our experiments, we report the average perplexity among proces-\nsors, and we show error bars denoting the minimum and maximum perplexity among all processors.\nThe variance of perplexities between processors is usually quite small, which suggests that the local\ntopic models learned on each processor are equally accurate.\n\njk, one must\n\nwk\n\nFor KOS and NIPS, we used the same settings for priors and hyperpriors: \u03b1 = 0.1, \u03b7 = 0.01 for\nLDA and Async-LDA, and \u03b7 = 0.01, \u03b3 \u223c Gam(10, 1), and \u03b1 \u223c Gam(2, 1) for the HDP algorithms.\n\n3Sampling \u03b1p, \u03b2p\n\nk, \u03b3p requires a global view of variables like m\u00b7k, the total number of \u201ctables\u201d serving\n\u201cdish\u201d k [2]. These values can be asynchronously propagated in the same way that the counts are propagated.\n\n5\n\n\fl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n1800\n\n1700\n\nKOS\nK=8\n\n1600\n\nK=16\n\n1500\n\nK=32\n\nK=64\n\n1400\n\n1\n\n10\n\nProcessors\n\n100\n\n2000\n\n1800\n\nNIPS\nK=10\n\nK=20\n\n1600\n\nK=40\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\nK=80\n\n1400\n\n1\n\n10\n\nProcessors\n\n100\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n2000\n\n1800\n\n1600\n\n1400\n\nLDA\nAsync\u2212LDA\n\nKOS\nK=16\n\n1\n\n10 100 500 10001500\n\nProcessors\n\nFigure 2: (a) Left: Async-LDA perplexities on KOS. (b) Middle: Async-LDA perplexities on NIPS. (c) Right:\nAsync-LDA perplexities on KOS with many procs. Cache=5 when P\u2265100. 3000 iterations run when P\u2265500.\n\n5.1 Async-LDA perplexity and speedup results\nFigures 2(a,b) show the perplexities for Async-LDA on KOS and NIPS data sets for varying numbers\nof topics. The variation in perplexities between LDA and Async-LDA is slight and is signi\ufb01cantly\nless than the variation in perplexities as the number of topics K is changed. These numbers suggest\nthat Async-LDA converges to solutions of the same quality as standard LDA. While these results are\nbased on a single test/train split of the corpus, we have also performed cross-validation experiments\n(results not shown) which give essentially the same results across different test/train splits.\n\nWe also stretched the limits of our algorithm by increasing P (e.g.\nfor P =1500, there are only\ntwo documents on each processor), and we found that performance was virtually unchanged (\ufb01gure\n2(c)). As a baseline we ran an experiment where processors never communicate. As the number of\nprocessors P was increased from 10 to 1500 the corresponding perplexities increased from 2600 to\n5700, dramatically higher than our Async-LDA algorithm, indicating (unsurprisingly) that processor\ncommunication is essential to obtain good quality models. Figure 3(a) shows the rate of convergence\nof Async-LDA. As the number of processors increases, the rate of convergence slows, since it takes\nmore iterations for information to propagate to all the processors. However, it is important to note\nthat one iteration in real time of Async-LDA is up to P times faster than one iteration of LDA. We\nshow the same curve in terms of estimated real time in \ufb01gure 3(b) , assuming a parallel ef\ufb01ciency of\n0.5, and one can see that Async-LDA converges much more quickly than LDA. Figure 3(c) shows\nactual speedup results for Async-LDA on NYT and PUBMED, and the speedups are competitive\nto those reported for Parallel-LDA [5]. As the data set size grows, the parallel ef\ufb01ciency increases,\nsince communication overhead is dwarfed by the sampling time.\n\nwk + N \u00acg\n\nwk)/d + N g\n\nwk \u2190 (N \u00acp\n\nIn Figure 3(a), we also show the performance of a baseline asynchronous averaging scheme, where\nglobal counts are averaged together: N \u00acp\nwk. To prevent unbounded count\ngrowth, d must be greater than 2, and so we arbitrarily set d to 2.5. While this averaging scheme\ninitially converges quickly, it converges to a \ufb01nal solution that is worse than Async-LDA, regardless\nof the setting for d.\nThe rate of convergence for Async-LDA P =100 can be dramatically improved by letting each pro-\ncessor maintain a cache of previous N g\nwk counts of other processors. Figures 3(a,b), C=5, show the\nimprovement made by letting each processor cache the \ufb01ve most recently seen N g\nwk\u2019s. Note that we\nstill assume a limited bandwidth \u2013 processors do not forward individual cached counts, but instead\nshare a single matrix of combined cache counts that helps the processors to achieve faster burn-in\ntime. In this manner, one can elegantly make a tradeoff between time and memory.\n\n2500\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n2000\n\n1500\n0\n\nLDA\nAsync\u2212LDA P=10\nAsync\u2212LDA P=100\nAsync\u2212LDA P=100 C=5\nAveraging P=100\n\n2500\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n2000\n\nLDA\nAsync\u2212LDA P=10\nAsync\u2212LDA P=100\nAsync\u2212LDA P=100 C=5\n\nPerfect\nAsync\u2212LDA (PUBMED)\nAsync\u2212LDA (NYT)\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\np\nu\nd\ne\ne\np\nS\n\n500\n\nIteration\n\n1000\n\n1500\n0\n\n50\n\nRelative Time\n\n100\n\n1\n\n8\n\n16\n\nProcessors (MPI)\n\n24\n\n32\n\nFigure 3: (a) Left: Convergence plot for Async-LDA on KOS, K=16. (b) Middle: Same plot with x-axis as\nrelative time. (c) Right: Speedup results for NYT and PUBMED on a cluster, using Message Passing Interface.\n\n6\n\n\fHDP\nParallel\u2212HDP\nAsync\u2212HDP\n\nKOS\n\nNIPS\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n1500\n1400\n1300\n\n1400\n1300\n1200\n\n1\n\n10\n\nProcessors\n\n100\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n \n \n|\n|\n \n \ns\nc\np\no\nT\n\ni\n\n \nf\n\no\n\n \n.\n\no\nN\n\n3000\n\n2000\n\n1000\n\n0\n0\n\nHDP\nParallel\u2212HDP P=10\nParallel\u2212HDP P=100\n\nPerplexity\n\nNo. of Topics\n\n500\n\nIteration\n\n1000\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n \n \n|\n|\n \n \ns\nc\np\no\nT\n\ni\n\n \nf\n\no\n\n \n.\n\no\nN\n\n3000\n\n2000\n\n1000\n\n0\n0\n\nHDP\nAsync\u2212HDP P=10\nAsync\u2212HDP P=100 C=5\n\nPerplexity\n\nNo. of Topics\n\n500\n\nIteration\n\n1000\n\nFigure 4: (a) Left: Perplexities for Parallel-HDP and Async-HDP. Cache=5 used for Async-HDP P=100. (b)\nMiddle: Convergence plot for Parallel-HDP on KOS. (c) Right: Convergence plot for Async-HDP on KOS.\n5.2 Parallel-HDP and Async-HDP results\nPerplexities for Parallel-HDP after 1500 iterations are shown in \ufb01gure 4(a), and they suggest that\nthe model generated by Parallel-HDP has nearly the same predictive power as standard HDP. Figure\n4(b) shows that Parallel-HDP converges at essentially the same rate as standard HDP on the KOS\ndata set, even though topics are generated at a slower rate. Topics grow at a slower rate in Parallel-\nHDP since new topics that are generated locally on each processor are merged together during each\nsynchronization step. In this experiment, while the number of topics is still growing, the perplexity\nhas converged, because the newest topics are smaller and do not signi\ufb01cantly affect the predictive\npower of the model. The number of topics does stabilize after thousands of iterations.\n\nPerplexities for Async-HDP are shown in \ufb01gures 4(a,c) as well. On the NIPS data set, there is a\nslight perplexity degradation, which is partially due to non-optimal parameter settings for \u03b1 and \u03b3.\nTopics are generated at a slightly faster rate for Async-HDP than for Parallel-HDP because Async-\nHDP take a less aggressive approach on pruning small topics, since processors need to be careful\nwhen pruning topics locally. Like Parallel-HDP, Async-HDP converges rapidly to a good solution.\n5.3 Extended experiments for realistic scenarios\nIn certain applications, it is desirable to learn a topic model incrementally as new data arrives. In\nour framework, if new data arrives, we simply assign the new data to a new processor, and then\nlet that new processor enter the \u201cworld\u201d of processors with which it can begin to communicate.\nOur asynchronous approach requires no global initialization or global synchronization step. We\ndo assume a \ufb01xed global vocabulary, but one can imagine schemes which allow the vocabulary to\ngrow as well. We performed an experiment for Async-LDA where we introduced 10 new processors\n(each carrying new data) every 100 iterations. In the \ufb01rst 100 iterations, only 10% of the KOS data\nis known, and every 100 iterations, an additional 10% of the data is added to the system through\nnew processors. Figure 5(a) shows that perplexity decreases as more processors and data are added.\nAfter 1000 iterations, the perplexity of Async-LDA has converged to the standard LDA perplexity.\nThus, in this experiment, learning in an online fashion does not adversely affect the \ufb01nal model.\n\nIn the experiments previously described, documents were randomly distributed across processors.\nIn reality, a processor may have a document set specialized to only a few topics. We investigated\nAsync-LDA\u2019s behavior on a non-random distribution of documents over processors. After running\nLDA (K=20) on NIPS, we used the inferred mixtures \u03b8jk to separate the corpus into 20 different\nsets of documents corresponding to the 20 topics. We assigned 2 sets of documents to each of 10\nprocessors, so that each processor had a document set that was specialized to 2 topics. Figure 5(b)\nshows that Async-LDA performs just as well on this non-random distribution of documents.\n\n3000\n\n10%\n\nLDA\nAsync\u2212LDA P=100\nAsync\u2212LDA P=100 (Online)\n\n2500\n\n20%\n\n30%\n\n40% of data seen, etc.\n\n3000\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n2500\n\n2000\n\nLDA\nAsync\u2212LDA P=10 Random\nAsync\u2212LDA P=10 Non\u2212Random\n\n3000\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n2500\n\n2000\n\nLDA\nAsync\u2212LDA P=10 (Balanced)\nAsync\u2212LDA P=10 (Imbalanced)\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n2000\n\n1500\n0\n\n200\n\n400\n\n600\n\nIteration\n\n800\n\n1000\n\n1500\n0\n\n100\n\n200\n\n300\n\nIteration\n\n400\n\n500\n\n1500\n0\n\n200\n\n400\n600\nRelative Time\n\n800\n\n1000\n\nFigure 5: (a) Left: Online learning for Async-LDA on KOS, K=16. (b) Middle: Comparing random vs. non-\nrandom distribution of documents for Async-LDA on NIPS, K=20. (c) Right: Async-LDA on KOS, K=16,\nwhere processors have varying amounts of data. In all 3 cases, Async-LDA converges to a good solution.\n\n7\n\n\fAnother situation of interest is the case where the amount of data on each processor varies. KOS was\ndivided into 30 blocks of 100 documents and these blocks were assigned to 10 processors according\nto a distribution: {7, 6, 4, 3, 3, 2, 2, 1, 1, 1}. We assume that if a processor has k blocks, then it will\ntake k units of time to complete one sampling sweep. Figure 5(c) shows that this load imbalance\ndoes not signi\ufb01cantly affect the \ufb01nal perplexity achieved. More generally, the time T p that each\nprocessor p takes to perform Gibbs sampling dictates the communication graph that will ensue.\nThere exist pathological cases where the graph may be disconnected due to phase-locking (e.g. 5\nprocessors with times T = {10, 12, 14, 19, 20} where P1, P2, P3 enter the network at time 0 and P4,\nP5 enter the network at time 34). However, the graph is guaranteed to be connected over time if Tp\nhas a stochastic component (e.g. due to network delays), a reasonable assumption in practice.\n\nIn our experiments, we assumed a fully connected network of processors and did not focus on other\nnetwork topologies. After running Async-LDA on both a 10x10 \ufb01xed grid network and a 100 node\nchain network on KOS K=16, we have veri\ufb01ed that Async-LDA achieves the same perplexity as\nLDA as long as caching and forwarding of cached counts occurs between processors.\n\n6 Discussion and conclusions\nThe work that is most closely related to that in this paper is that of Mimno and McCallum [3]\nand Newman et al. [5], who each propose parallel algorithms for the collapsed sampler for LDA.\nIn other work, Nallapati et al. [4] parallelize the variational EM algorithm for LDA, and Wolfe\net al. [8] examine asynchronous EM algorithms for LDA. The primary distinctions between our\nwork and other work on distributed LDA based on Gibbs sampling are that (a) our algorithms use\npurely asynchronous communication rather than a global synchronous scheme, and (b) we have also\nextended these ideas (synchronous and asynchronous) to HDP. More generally, exact parallel Gibbs\nsampling is dif\ufb01cult to perform due to the sequential nature of MCMC. Brockwell [9] presents a pre-\nfetching parallel algorithm for MCMC, but this technique is not applicable to the collapsed sampler\nfor LDA. There is also a large body of prior work on gossip algorithms (e.g., [6]), such as Newscast\nEM, a gossip algorithm for performing EM on Gaussian mixture learning [10].\n\nAlthough processors perform local Gibbs sampling based on inexact global counts, our algorithms\nnonetheless produce solutions that are nearly the same as that of standard single-processor samplers.\nProviding a theoretical justi\ufb01cation for these distributed algorithms is still an open area of research.\n\nWe have proposed a new set of algorithms for distributed learning of LDA and HDP models. Our\nperplexity and speedup results suggest that topic models can be learned in a scalable asynchronous\nfashion for a wide variety of situations. One can imagine our algorithms being performed by a large\nnetwork of idle processors, in an effort to mine the terabytes of information available on the Internet.\n\nAcknowledgments\nThis material is based upon work supported in part by NSF under Award IIS-0083489 (PS, AA), IIS-\n0447903 and IIS-0535278 (MW), and an NSF graduate fellowship (AA). MW was also supported\nby ONR under Grant 00014-06-1-073, and PS was also supported by a Google Research Award.\n\nReferences\n[1] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n[2] Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. JASA, 101(476), 2006.\n[3] D. Mimno and A. McCallum. Organizing the OCA: learning faceted subjects from a library of digital\n\nbooks. In JCDL \u201907, pages 376\u2013385, New York, NY, USA, 2007. ACM.\n\n[4] R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational EM for latent Dirichlet allocation: An\nexperimental evaluation of speed and scalability. In ICDM Workshop On High Perf. Data Mining, 2007.\n[5] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent Dirichlet allocation.\n\nIn NIPS 20. MIT Press, Cambridge, MA, 2008.\n\n[6] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Gossip algorithms: design, analysis and applications. In\n\nINFOCOM, pages 1653\u20131664, 2005.\n\n[7] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. PNAS, 101 Suppl 1:5228\u20135235, April 2004.\n[8] J. Wolfe, A. Haghighi, and D. Klein. Fully distributed EM for very large datasets. In ICML \u201908, pages\n\n1184\u20131191, New York, NY, USA, 2008. ACM.\n\n[9] A. Brockwell. Parallel Markov chain Monte Carlo simulation by pre-fetching. JCGS, 15, No. 1, 2006.\n[10] W. Kowalczyk and N. Vlassis. Newscast EM. In NIPS 17. MIT Press, Cambridge, MA, 2005.\n\n8\n\n\f", "award": [], "sourceid": 762, "authors": [{"given_name": "Padhraic", "family_name": "Smyth", "institution": null}, {"given_name": "Max", "family_name": "Welling", "institution": null}, {"given_name": "Arthur", "family_name": "Asuncion", "institution": null}]}