{"title": "Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units", "book": "Advances in Neural Information Processing Systems", "page_first": 2134, "page_last": 2142, "abstract": "The recent emergence of Graphics Processing Units (GPUs) as general-purpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. Furthermore, the partitioning scheme balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. We also use data streaming to handle extremely large datasets. Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors; actually the speedup is almost linearly scalable with the number of multiprocessors available. The proposed partitioning scheme and data streaming can be easily ported to many other models in machine learning.", "full_text": "Parallel Inference for Latent Dirichlet Allocation on\n\nGraphics Processing Units\n\nFeng Yan\n\nDepartment of CS\nPurdue University\n\nWest Lafayette, IN 47907\n\nNingyi Xu\n\nMicrosoft Research Asia\n\nNo. 49 Zhichun Road\nBeijing, P.R. China\n\nYuan (Alan) Qi\n\nDepartments of CS and Statistics\n\nPurdue University\n\nWest Lafayette, IN 47907\n\nAbstract\n\nThe recent emergence of Graphics Processing Units (GPUs) as general-purpose\nparallel computing devices provides us with new opportunities to develop scal-\nable learning methods for massive data.\nIn this work, we consider the prob-\nlem of parallelizing two inference methods on GPUs for latent Dirichlet Allo-\ncation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational\nBayesian (CVB). To address limited memory constraints on GPUs, we propose\na novel data partitioning scheme that effectively reduces the memory cost. This\npartitioning scheme also balances the computational cost on each multiprocessor\nand enables us to easily avoid memory access con\ufb02icts. We use data streaming to\nhandle extremely large datasets. Extensive experiments showed that our parallel\ninference methods consistently produced LDA models with the same predictive\npower as sequential training methods did but with 26x speedup for CGS and 196x\nspeedup for CVB on a GPU with 30 multiprocessors. The proposed partitioning\nscheme and data streaming make our approach scalable with more multiproces-\nsors. Furthermore, they can be used as general techniques to parallelize other\nmachine learning models.\n\n1 Introduction\n\nLearning from massive datasets, such as text, images, and high throughput biological data, has\napplications in various scienti\ufb01c and engineering disciplines. The scale of these datasets, however,\noften demands high, sometimes prohibitive, computational cost. To address this issue, an obvious\napproach is to parallelize learning methods with multiple processors. While large CPU clusters\nare commonly used for parallel computing, Graphics Processing Units (GPUs) provide us with a\npowerful alternative platform for developing parallel machine learning methods.\nA GPU has massively built-in parallel thread processors and high-speed memory, therefore provid-\ning potentially one or two magnitudes of peak \ufb02ops and memory throughput greater than its CPU\ncounterpart. Although GPU is not good at complex logical computation, it can signi\ufb01cantly reduce\nrunning time of numerical computation-centric applications. Also, GPUs are more cost effective\nand energy ef\ufb01cient. The current high-end GPU has over 50x more peak \ufb02ops than CPUs at the\nsame price. Given a similar power consumption, GPUs perform more \ufb02ops per watt than CPUs. For\nlarge-scale industrial applications, such as web search engines, ef\ufb01cient learning methods on GPUs\ncan make a big difference in energy consumption and equipment cost. However, parallel computing\n\n1\n\n\fon GPUs can be a challenging task because of several limitations, such as relatively small memory\nsize.\nIn this paper, we demonstrate how to overcome these limitations to parallel computing on GPUs with\nan exemplary data-intensive application, training Latent Dirichlet Allocation (LDA) models. LDA\nmodels have been successfully applied to text analysis. For large corpora, however, it takes days,\neven months, to train them. Our parallel approaches take the advantage of parallel computing power\nof GPUs and explore the algorithmic structures of LDA learning methods, therefore signi\ufb01cantly\nreducing the computational cost. Furthermore, our parallel inference approaches, based on a new\ndata partition scheme and data streaming, can be applied to not only GPUs but also any shared\nmemory machine. Speci\ufb01cally, the main contributions of this paper include:\n\ncessors, fully utilizing the massive parallel mechanisms of GPUs.\n\n\u2022 We introduce parallel collapsed Gibbs sampling (CGS) and parallel collapsed variational\nBayesian (CVB) for LDA models on GPUs. We also analyze the convergence property\nof the parallel variational inference and show that, with mild convexity assumptions, the\nparallel inference monotonically increases the variational lower bound until convergence.\n\u2022 We propose a fast data partition scheme that ef\ufb01ciently balances the workloads across pro-\n\u2022 Based on this partitioning scheme, our method is also independent of speci\ufb01c memory\nconsistency models: with partitioned data and parameters in exclusive memory sections,\nwe avoid access con\ufb02ict and do not sacri\ufb01ce speedup caused by extra cost from a memory\nconsistency mechanism\n\u2022 We propose a data streaming scheme, which allows our methods to handle very large cor-\n\u2022 Extensive experiments show both parallel inference algorithms on GPUs achieve the same\npredictive power as their sequential inference counterparts on CPUs, but signi\ufb01cantly faster.\nThe speedup is near linear in terms of the number of multiprocessors in the GPU card.\n\npora that cannot be stored in a single GPU.\n\n2 Latent Dirichlet Allocation\n\nWe brie\ufb02y review the LDA model and two inference algorithms for LDA. 1LDA models each of D\ndocuments as a mixture over K latent topics, and each topic k is a multinomial distribution over\na word vocabulary having W distinct words denoted by \u03c6k = {\u03c6kw}, where \u03c6k is drawn from a\nsymmetric Dirichlet prior with parameter \u03b2. In order to generate a document j, the document\u2019s\nmixture over topics, \u03b8j = {\u03b8jk}, is drawn from a symmetric Dirichlet prior with parameter \u03b1\n\ufb01rst. For the ith token in the document, a topic assignment zij is drawn with topic k chosen with\nprobability \u03b8jk. Then word xij is drawn from the zijth topic, with xij taking on value w with\nprobability \u03c6zij w. Given the training data with N words x = {xij}, we need to compute the\nposterior distribution over the latent variables.\nCollapsed Gibbs sampling [4] is an ef\ufb01cient procedure to sample the posterior distribution of topic\nassignment z = {zij} by integrating out all \u03b8jk and \u03c6kw. Given the current state of all but one\nvariable zij, the conditional distribution of zij is\n\n\u00acij\nxij k + \u03b2\nP (zij = k|z\u00acij, x, \u03b1, \u03b2) \u221d n\n\u00acij\nk + W \u03b2\nn\n\n(n\n\n\u00acij\njk + \u03b1)\n\n(1)\n\nk =X\n\n\u00acij\n\nwhere nwk denotes the number of tokens with word w assigned to topic k, njk denotes the number\n\u00acij\nwk . Superscript \u00acij denotes that the\nof tokens in document j assigned to topic k and n\n\nn\n\nw\n\nvariable is calculated as if token xij is removed from the training data.\nCGS is very ef\ufb01cient because the variance is greatly reduced by sampling in a collapsed state\nspace. Teh et al. [9] applied the same state space to variational Bayesian and proposed the col-\nlapsed variational Bayesian inference algorithm. It has been shown that CVB has a theoretically\ntighter variational bound than standard VB. In CVB, the posterior of z is approximated by a fac-\nij q(zij|\u03b3ij) where q(zij|\u03b3ij) is multinomial with variational parameter\n\ntorized posterior q(z) =Q\n\n1We use indices to represent topics, documents and vocabulary words.\n\n2\n\n\fbound L(q) = X\n\n\u03b3ij = {\u03b3ijk}. The inference task is to \ufb01nd variational parameters maximizing the variational lower\n. The authors used a computationally ef\ufb01cient Gaussian\n\nq(z) log p(z, x|\u03b1, \u03b2)\n\napproximation. The updating formula for \u03b3ij is similar to the CGS updates\n\nq(z)\n\nz\n\n\u03b3ijk \u221d (Eq[n\n\n\u00acij\nxij k] + \u03b2)(Eq[n\nxij k]+\u03b2)2 \u2212\n\n\u00acij\nxij k]\n\n2(Eq[n\n\n\u00acij\n\n\u00acij\njk ] + \u03b1)(Eq[n\n\u00acij\njk ]\n\nVarq[n\n\u00acij\njk ]+\u03b1)2)\n\n2(Eq[n\n\nexp(\u2212 Varq[n\n\n\u00acij\nk ] + W \u03b2)\u22121\n\n+\n\n\u00acij\nk ]\n\nVarq[n\n\u00acij\n\nk ]+W \u03b2)2 )\n\n2(Eq[n\n\n(2)\n\n3 Parallel Algorithms for LDA Training\n\n3.1 Parallel Collapsed Gibbs Sampling\n\nw np\n\nk =P\n\nP documents on each processor.\n\nkw in parallel. Then a global synchronization aggregates local counts np\n\nA natural way to parallelize LDA training is to distribute documents across P processors. Based on\nthis idea, Newman et al. [8] introduced a parallel implementation of CGS on distributed machines,\ncalled AD-LDA. In AD-LDA, D documents and document-speci\ufb01c counts njk are distributed over\nIn each iteration, every processor p inde-\nP processors, with D\npendently runs local Gibbs sampling with its own copy of topic-word count np\nkw and topic counts\nnp\nkw to produce\nglobal counts nkw and nk. AD-LDA achieved substantial speedup compared with single-processor\nCGS training without sacri\ufb01cing prediction accuracy. However, it needs to store P copies of topic-\nword counts nkw for all processors, which is unrealistic for GPUs with large P and large datasets\ndue to device memory space limitation. For example, a dataset having 100, 000 vocabulary words\nneeds at least 1.4 GBytes to store 256-topic nwk for 60 processors, exceeding the device memory ca-\npacity of current high-end GPUs. In order to address this issue, we develop parallel CGS algorithm\nthat only requires one copy of nkw.\n\n1\n\n2\n3\n\nAlgorithm 1: Parallel Collapsed Gibbs Sampling\n\nInput: Word tokens x, document partition\nJ1, . . . , JP and vocabulary partition\nV1, . . . , VP\n\nOutput: njk, nwk, zij\nInitialize topic assignment to each word token, set\nk \u2190 nk\nnp\nrepeat\nfor l = 0 to P \u2212 1 do\n\nOur parallel CGS algorithm is motivated\nby the following observation: for word\ntoken w1 in document j1 and word to-\nken w2 in document j2, if w1 6= w2 and\nj1 6= j2, simultaneous updates of topic\nassignment by (1) have no memory\nread/write con\ufb02icts on document-topic\ncounts njk and topic-word counts nwk.\nThe algorithmic \ufb02ow is summarized in\nAlgorithm 1.\nIn addition to dividing\nall documents J = {1, . . . , D} to P\n(disjoint) sets of documents J1, . . . , JP\nand distribute them to P processors,\nwe further divide the vocabulary words\nV = {1, . . . , W} into P disjoint subsets\nV1, . . . , VP , and each processor p (p =\n0, . . . , P \u22121) stores a local copy of topic\ncounts np\nk. Every parallel CGS training\niteration consists of P epochs, and each\nepoch consists of a sampling step and\na synchronization step. In the sampling\nstep of the lth epoch (l = 0, . . . , P \u2212 1),\nprocessor p samples topic assignments\nzij whose document index is j \u2208 Jp and word index is xij \u2208 Vp\u2295l. The \u2295 is the modulus P\naddition operation de\ufb01ned by\n\n*/\nSample zij for j \u2208 Jp and xij \u2208 Vp\u2295l\n(Equation (1)) with global counts nwk,\nglobal counts njk and local counts np\nk\n\nend\n/* Synchronization step\nUpdate np\n\n/* Sampling step\nfor each processor p in parallel do\n\nk according to Equation (3)\n\nuntil convergence\n\nend\n\n*/\n\n7\n8\n9\n\n4\n5\n\n6\n\na \u2295 b = (a + b) mod P,\n\nand all processors run the sampling simultaneously without memory read/write con\ufb02icts on the\nglobal counts njk and nwk. Then the synchronization step uses (3) to aggregate np\nk to global counts\nnk, which are used as local counts in the next epoch.\nk \u2212 nk),\n(np\n\nnk \u2190 nk +X\n\nk \u2190 nk\nnp\n\n(3)\n\np\n\n3\n\n\fOur parallel CGS can be regarded as an extension to AD-LDA by using the data partition in local\nsampling and inserting P \u22121 more synchronization steps within an iteration. Since our data partition\nguarantees that any two processors access neither the same document nor the same word in an epoch,\nthe synchronization of nwk in AD-LDA is equivalent to keeping nwk unchanged after the sampling\nstep of the epoch. Becasue P processors concurrently sample new topic assignments in parallel\nCGS, we don\u2019t necessarily sample from the correct posterior distribution. However, we can view it\nas a stochastic optimization method that maximizes p(z|x, \u03b1, \u03b2). A justi\ufb01cation of this viewpoint\ncan be found in [8].\n\n3.2 Parallel Collapsed Variational Bayesian\n\nThe collapsed Gibbs sampling and the collapsed variational Bayesian inference [9] are similar in\ntheir algorithmic structures. As pointed out by Asuncion et al. [2], there are striking similarities\nbetween CGS and CVB. A single iteration of our parallel CVB also consists of P epochs, and each\nepoch has an updating step and a synchronization step. The updating step updates variational pa-\nrameters in a similar manner as the sampling step of parallel CGS. Counts in CGS are replaced\nby expectations and variances, and new variational parameters are computed by (2). The synchro-\nnization step involves an af\ufb01ne combination of the variational parameters in the natural parameter\nspace.\nSince multinomial distribution belongs to the exponential family, we can represent the multinomial\ndistribution over K topics de\ufb01ned by mean parameter \u03b3ij in natural parameter \u03bbij = (\u03bbijk) by\nk06=K \u03b3ijk0 ) for k = 1, 2, . . . , K \u2212 1, and the domain of \u03bbij is unconstrained. Thus\n\u03bbijk = log(\nmaximizing L(q(\u03bb)) becomes an unconstrained optimization problem. Denote \u03bbm = (\u03bbij)j\u2208Jm,\n\u03bb = (\u03bb0, . . . , \u03bbP\u22121), \u03bbnew and \u03bbold to be the variational parameters immediately after and before\nP\u22121). We pick a \u03bbsync as the\nthe updating step respectively. Let \u03bb(p) = (\u03bbold\nupdated \u03bb from a one-parameter class of variational parameters \u03bb(\u00b5) that combines the contribution\nfrom all processors\n\n0 , . . . , \u03bbnew\n\n1\u2212P\n\n, . . . , \u03bbold\n\n\u03b3ijk\n\np\n\n\u03bb(\u00b5) = \u03bbold + \u00b5\n\n(\u03bb(i) \u2212 \u03bbold), \u00b5 \u2265 0.\nP ) is a convex combination of {\u03bb(p)}; and 2)\nTwo special cases are of interest: 1) \u03bbsync = \u03bb( 1\n\u03bbsync = \u03bb(1) = \u03bbnew.\nIf (quasi)concavity [3] holds in suf\ufb01cient large neighborhoods of the\nsequence of \u03bb(\u00b5), say near a local maximum having negatively de\ufb01ned Hessian, then L(q(\u03bb(\u00b5))) \u2265\nminp L(q(\u03bb(p))) \u2265 L(q(\u03bbold)) and L(q) converge locally. For the second case, we keep \u03b3new and\nonly update Eq[nk] and Varq[nk] similarly as (3) in the synchronization step. The formulas are\n\ni=0\n\nP\u22121X\n\np(E[np\n\nk] \u2212 E[nk]),\n\nE[np\n\nk] \u2190 E[nk]\nVar[np\n\nAlso, \u03bb(1) assigns a larger step size to the directionPP\u22121\n\n(4)\ni=0 (\u03bb(i) \u2212 \u03bbold). Thus we can achieve a\nfaster convergence rate if it is an ascending direction. It should be noted that our choice of \u03bbsync\ndoesn\u2019t guarantee global convergence, but we shall see that \u03bb(1) can produce models that have\nalmost the same predictive power and variational lower bounds as the single-processor CVB.\n\nk] \u2212 Var[nk]),\n\nk] \u2190 Var[nk]\n\np(Var[np\n\nE[nk] \u2190 E[nk] +P\nVar[nk] \u2190 Var[nk] +P\n\n3.3 Data Partition\n\nIn order to achieve maximal speedup, we need the partitions producing balanced workloads across\nprocessors, and we also hope that generating the data partition consumes a small fraction of time in\nthe whole training process.\nIn order to present in a uni\ufb01ed way, we de\ufb01ne the co-occurrence matrix R = (rjw) as: For parallel\nCGS, rjw is the number of occurrences of word w in document j; for parallel CVB, rjw = 1 if w\noccurs at least once in j, otherwise rjw = 0. We de\ufb01ne the submatrix Rmn = (rjw) \u2200j \u2208 Jm, w \u2208\nVn. The optimal data partition is equivalent to minimizing the following cost function\n\nP\u22121X\n\nl=0\n\nC =\n\n{Cmn},\n\nmax\n(m,n):\nm\u2295l=n\n\n4\n\nCmn = X\n\nrjw\u2208Rmn\n\nrjw\n\n(5)\n\n\fThe basic operation in the proposed algorithms is either sampling topic assignments (in CGS) or\nupdating variational parameters (in CVB). Each value of l in the \ufb01rst summation term in (5) is\nassociated with one epoch. All Rmn satisfying m\u2295 l = n are the P submatrices of R whose entries\nare used to perform basic operations in epoch l. The number of these two types of basic operations\non each unique document/word pair (j, w) are all rjw. So the total number of basic operations in\nRm,n is Cmn for a single processor. Since all processors have to wait for the slowest processor to\ncomplete its job before a synchronization step, the maximal Cmn is the number of basic operations\nfor the slowest processor. Thus the total number of basic operations is C. We de\ufb01ne data partition\nef\ufb01ciency, \u03b7, for a given row and column partitions by\n\n, Copt = X\n\nj\u2208J,w\u2208V\n\n\u03b7 = Copt\nC\n\nrjw/P\n\n(6)\n\nwhere Copt is the theoretically minimal number of basic operations. \u03b7 is de\ufb01ned to be less than or\nequal to 1. The higher the \u03b7, the better the partitions. Exact optimization of (5) can be achieved\nthrough solving an equivalent integer programming problem. Since integer programming is NP-\nhard in general, and the large number of free variables for real-world datasets makes it intractable\nto solve, we use a simple approximate algorithm to perform data partitioning. In our observation, it\nworks well empirically.\nHere we use the convention of initial value j0 = w0 = 0. Our data partition algorithm divides row\nrjw|.\nindex J into disjoint subsets Jm = {j(m\u22121), . . . , jm}, where jm = arg min\nj0\nSimilarly, we divide column index V into disjoint subsets Vn = {w(n\u22121) + 1, . . . , wn} by wn =\nrjw|. This algorithm is fast, since it needs only one full sweep over all word\narg min\nw0\n\n|mCopt \u2212X\n\n|mCopt \u2212 X\n\nj\u2264j0\n\ntokens or unique document/word pairs to calculate jm and wn. In practice, we can run this algorithm\nfor several random permutations of J or V , and take the partitions with the highest \u03b7.\nWe empirically obtained high \u03b7 on large datasets with the approximate algorithm. For a word token x\nin the corpus, the probability that x is the word w is P (x = w), the probability that x is in document\nj is P (x in j). If we assume these two distributions are independent and x is i.i.d., then for a \ufb01xed P ,\nthe law of large numbers asserts P (x in Jm) \u2248 jm\u2212j(m\u22121)\n\u2248 1\n\u2248 1\nP .\nIndependence gives E[Cmn] \u2248 Copt\nx 1{x inJm,x\u2208Vn}. Furthermore, the law of\nlarge numbers and the central limit theorem also give Cmn \u2248 Copt\nand the distribution of Cmn is\napproximately a normal distribution. Although independence and i.i.d. assumptions are not true for\nreal data, the above analysis holds in an approximate way. Actually, when P = 10, the Cmn of\nNIPS and NY Times datasets (see Section 4) accepted the null hypothesis of Lilliefors\u2019 normality\ntest with a 0.05 signi\ufb01cance level.\n\nP where Cmn = P\n\nP and P (x \u2208 Vn) \u2248 wn\u2212w(n\u22121)\n\nW\n\nD\n\nP\n\nw\u2264w0\n\n3.4 GPU Implementation and Data Streaming\n\nWe used a Leatek Geforce 280 GTX GPU (G280) in this experiment. The G280 has 30 on-chip\nmultiprocessors running at 1296 MHz, and each multiprocessor has 8 thread processors that are\nresponsible for executing all threads deployed on the multiprocessor in parallel. The G280 has\n1 GBytes on-board device memory, the memory bandwidth is 141.7 GB/s. We adopted NVidia\u2019s\nCompute Uni\ufb01ed Device Architecture (CUDA) as our GPU programming environment. CUDA\nprograms run in a Single Program Multiple Threads (SPMT) fashion. All threads are divided into\nequal-sized thread blocks. Threads in the same thread block are executed on a multiprocessor, and\na multiprocessor can execute a number of thread blocks. We map a \u201cprocessor\u201d in the previous\nalgorithmic description to a thread block. For a word token, \ufb01ne parallel calculations, such as (1)\nand (2), are realized by parallel threads inside a thread block.\nGiven the limited amount of device memory on GPUs, we cannot load all training data and model\nparameters into the device memory for large-scale datasets. However, the sequential nature of Gibbs\nsampling and variational Bayesian inferences allow us to implement a data streaming [5] scheme\nwhich effectively reduces GPU device memory space requirements. Temporal data and variables,\nxij, zij and \u03b3ij, are sent to a working space on GPU device memory on-the-\ufb02y. Computation and\ndata transfer are carried out simultaneously, i.e. data transfer latency is hidden by computation.\n\n5\n\n\fKOS\n3, 430\n6, 906\nNumber of word tokens, N 467, 714\nNumber of unique document/word pairs, M 353, 160\n\ndataset\nNumber of documents, D\nNumber of words, W\n\nNIPS\n1, 500\n12, 419\n1, 932, 365\n746, 316\n\nNYT\n300, 000\n102, 660\n99, 542, 125\n69, 679, 427\n\nTable 1: datasets used in the experiments.\n\nFigure 1: Test set perplexity versus number of processors P for KOS (left) and NIPS (right).\n\n4 Experiments\n\nWe used three text datasets retrieved from the UCI Machine Learning Repository2 for evaluation.\nStatistical information about these datasets is shown in Table 4. For each dataset, we randomly\nextracted 90% of all word tokens as the training set, and the remaining 10% of word tokens are the\ntest set. We set \u03b1 = 50/K and \u03b2 = 0.1 in all experiments [4]. We use \u03bbsync = \u03bb(1) in the parallel\nCVB, and this setting works well in all of our experiments.\n\n4.1 Perplexity\n\nWe measure the performance of the parallel algorithms using test set perplexity. Test set perplexity\nis de\ufb01ned as exp(\u2212 1\n\nN test log p(xtest)). For CVB, test set likelihood p(xtest) is computed as\n\u00af\u03c6wk = \u03b2 + E[nwk]\nW \u03b2 + E[nk]\n\nK\u03b1 +P\n\n\u03b1 + E[njk]\n\n\u00af\u03b8jk =\n\n\u00af\u03b8jk\n\n\u00af\u03c6xij k\n\nk E[njk]\n\n(7)\n\np(xtest) =Y\n\nlogX\n\nij\n\nk\n\nWe report the average perplexity and the standard deviation of 10 randomly initialized runs for the\nparallel CVB. The typical burn-in period of CGS is about 200 iterations. We compute the likelihood\np(xtest) for CGS by averaging S = 10 samples at the end of 1000 iterations from different chains.\n\np(xtest) =Y\n\nX\n\nX\n\nlog\n\n1\nS\n\nij\n\ns\n\nk\n\n\u02c6\u03b8s\n\njk\n\n\u02c6\u03c6s\nxij k\n\n\u02c6\u03b8s\njk =\n\nK\u03b1 +P\n\n\u03b1 + ns\njk\nk ns\njk\n\nwk = \u03b2 + ns\n\u02c6\u03c6s\nwk\nW \u03b2 + ns\nk\n\n(8)\n\nTwo small datasets, KOS and NIPS, are used in the perplexity experiment. We computed test per-\nplexity for different values of K and P . Figure 1 shows the test set perplexity on KOS (left) and NIPS\n(right). We used the CPU to compute perplexity for P = 1 and the GPU for P = 10, 30, 60. For a\n\ufb01xed number of K, there is no signi\ufb01cant difference between the parallel and the single-processor\nalgorithms. It suggests our parallel algorithms converge to models having the same predictive power\nin terms of perplexity as single-processor LDA algorithms.\nPerplexity as a function of iteration number for parallel CGS and parallel CVB on NIPS are shown\nin Figure 2 (a) and (b) respectively. Since CVB actually maxmizes the variational lower bound L(q)\non the training set, so we also investigated the convergence rate of the variational lower bound. The\nvariational lower bound is computed using an exact method suggested in [9]. Figure 2 (c) shows the\nper word token variational lower bound as a function of iteration for P = 1, 10, 30 on a sampled\n\n2http://archive.ics.uci.edu/ml/datasets/Bag+of+Words\n\n6\n\nP=1P=10P=30P=6013501400145015001550CVB, K=64CVB, K=128CGS, K=64CGS, K=128PerplexityP=1P=10P=30P=60100010501100115012001250130013501400CVB, K=64CVB, K=128CVB, K=256CGS, K=64CGS, K=128CGS, K=256Perplexity\fsubset of KOS (K = 64). Both parallel algorithms converge as rapidly as the single-processor\nLDA algorithms. Therefore, when P gets larger, convergence rate does not curtail the speedup. We\nsurmise that these results in Figure 2 may be due to frequent synchronization and relative big step\nsizes in our algorithms. In fact, as we decreased the number of synchronizations in the parallel CVB,\nthe result became signi\ufb01cantly worse. The curve \u201c\u00b5=1/P, P=10\u201d in Figure 2 (right) was obtained by\nP ). It converged considerably slower than the other curves because of its small\nsetting \u03bbsync = \u03bb( 1\nstep size.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) Test set perplexity as a function of iteration number for the parallel CGS on NIPS,\nK = 256. (b) Test set perplexity as a function of iteration number for the parallel CVB on NIPS,\nK = 128. (c) Variational lower bound on a dataset sampled from KOS, K = 64.\n\n4.2 Speedup\n\nThe speedup is compared with a PC equipped with an Intel quad-core 2.4GHz CPU and 4 GBytes\nmemory. Only one core of the CPU is used. All CPU implementations are compiled by Microsoft\nC++ compiler 8.0 with -O2 optimization. We did our best to optimize the code through experiments,\nsuch as using better data layout and reducing redundant computation. The \ufb01nal CPU code is almost\ntwice as fast as the initial code.\nOur speedup experiments are conducted on the NIPS dataset for both parallel algorithms and the\nlarge NYT dataset for only the parallel CGS, because \u03b3ij of the NYT dataset requires too much\nmemory space to \ufb01t into our PC\u2019s host memory. We measure the speedup on a range of P with or\nwithout data streaming. As the baseline, average running times on the CPU are: 4.24 seconds on\nNIPS (K = 256) and 22.1 seconds on NYT (K = 128) for the parallel CGS, and 11.1 seconds\n(K = 128) on NIPS for the parallel CVB. Figure 3 shows the speedup of the parallel CGS (left)\nand the speedup of the parallel CVB (right) with the data partition ef\ufb01ciency \u03b7 under the speedup.\nWe note that when P > 30, more threads are deployed on a multiprocessor. Therefore data transfer\nbetween the device memory and the multiprocessor is better hidden by computation on the threads.\nAs a result, we have extra speedup when the number of \u201cprocessors\u201d (thread blocks) is larger than\nthe number of multiprocessors on the GPU.\n\nFigure 3: Speedup of parallel CGS (left) on NIPS and NYT, and speedup of parallel CVB (right)\non NIPS. Average running times on the CPU are 4.24 seconds on NIPS and 22.1 seconds on NYT\nfor the parallel CGS, and 11.1 seconds on NIPS for the parallel CVB, respectively. Although using\ndata streaming reduces the speedup of parallel CVB due to the low bandwidth between the PC host\nmemory and the GPU device memory, it enables us to use a GPU card to process large-volume data.\n\n7\n\n050100150200250300100012001400160018002000220024002600PerplexityIteration CGS Parallel CGS, P=10 Parallel CGS, P=3005010015020025030012001400160018002000220024002600PerplexityIteration m=1/P, P=10 CVB Parallel CVB, P=10 Parallel CVB, P=30050100150-6.82-6.80-6.78-6.76-6.74-6.72-6.70-6.68-6.66Variational lower boundIteration CVB, P=1 Parallel CVB, P=10 Parallel CVB, P=30P=10P=30P=6081216202428P=10P=30P=600.50.60.70.80.91.0 Speedup NYT, Streaming NIPS, Streaming NIPS, No Streaming NYT NIPShP=10P=30P=6020406080100120140160180200220240P=10P=30P=600.50.60.70.80.91.0 Speedup Streaming No Streamingh \fThe synchronization overhead is very small since P (cid:28)\nN and the speedup is largely determined by the maxi-\nmal number of nonzero elements in all partitioned sub-\nmatrices. As a result, the speedup (when not using data\nstreaming) is proportional to \u03b7P . The bandwidth be-\ntween the PC host memory and the GPU device mem-\nory is \u223c 1.0 GB/s, which is higher than the computa-\ntion bandwidth (size of data processed by the GPU per\nsecond) of the parallel CGS. Therefore, the speedup\nwith or without data streaming is almost the same for\nthe parallel CGS. But the speedup with or without data\nstreaming differs dramatically for the parallel CVB,\nbecause its computation bandwidth is roughly \u223c 7.2\nGB/s for K = 128 due to large memory usage of \u03b3ij,\nhigher than the maximal bandwidth that data stream-\ning can provide. The high speedup of the parallel CVB\nwithout data streaming is due to a hardware supported\nexponential function and a high performance implementation of parallel reduction that is used to\nnormalize \u03b3ij calculated from (2). Figure 3 (right) shows that the larger the P , the smaller the\nspeedup for the parallel CVB with data streaming. The reason is when P becomes large, the data\nstreaming management becomes more complicated and introduces more latencies on data transfer.\nFigure 4 shows data partition ef\ufb01ciency \u03b7 of various data partition algorithms for P = 10, 30, 60 on\nNIPS. \u201ccurrent\u201d is the data partition algorithm proposed in section 3.3, \u201ceven\u201d partitions documents\nP c and wn = b nW\nP c.\nand word vocabulary into roughly equal-sized subsets by setting jm = b mD\n\u201crandom\u201d is a data partition obtained by randomly partitioning documents and words. We see that\nthe proposed data partition algorithm outperforms the other algorithms.\nMore than 20x speedup is achieved for both parallel algorithms with data streaming. The speedup\nof the parallel CGS enables us to run 1000 iterations (K=128) Gibbs sampling on the large NYT\ndataset within 1.5 hours, and it yields the same perplexity 3639 (S = 5) as the result obtained from\n30-hour training on a CPU.\n5 Related Works and Discussion\n\nFigure 4: data partition ef\ufb01ciency \u03b7 of\nvarious data partition algorithms for P =\n10, 30, 60. Due to the negligible overheads\nfor the synchronization steps, the speedup\nis proportional to \u03b7 in practice.\n\nOur work is closely related to several previous works, including the distributed LDA by Newman\net al. [8], asynchronous distributed LDA by Asuncion et al. [1] and the parallelized variational\nEM algorithm for LDA by Nallapati et al. [7]. For these works LDA training was parallelized on\ndistributed CPU clusters and achieved impressive speedup. Unlike their works, ours shows how\nto use GPUs to achieve signi\ufb01cant, scalable speedup for LDA training while maintaining correct,\naccurate predictions.\nMasada et al. recently proposed a GPU implementation of CVB [6]. Masada et al. keep one copy\nof nwk while simply maintaining the same algorithmic structure for their GPU implementation as\nNewman et al. did on a CPU cluster. However, with the limited memory size of a GPU, compared\nto that of a CPU cluster, this can lead to memory access con\ufb02icts. This issue becomes severe when\none performs many parallel jobs (threadblocks) and leads to wrong inference results and operation\nfailure, as reported by Masada et al. Therefore, their method is not easily scalable due to memory\naccess con\ufb02icts. Different from their approach, ours are scalable with more multiprocessors with the\nthe proposed partitioning scheme and data streaming. They can also be used as general techniques\nto parallelize other machine learning models that involve sequential operations on matrix, such as\nonline training of matrix factorization.\n\nAcknowledgements\n\nWe thank Max Welling and David Newman for providing us with the link to the experimental data.\nWe also thank the anonymous reviewers, Dong Zhang and Xianxing Zhang for their invaluable\ninputs. F. Yan conducted this research at Microsoft Research Asia. F. Yan and Y. Qi were supported\nby NSF IIS-0916443 and Microsoft Research.\n\n8\n\nP=10 P=30P=600.00.10.20.30.40.50.60.70.80.91.0h current even random\fReferences\n[1] A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In\nD. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, NIPS, pages 81\u201388. MIT Press,\n2008.\n\n[2] A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic\nmodels. In Proceedings of the International Conference on Uncertainty in Arti\ufb01cial Intelligence,\n2009.\n\n[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, March 2004.\n[4] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy\n\nScience, 101 (suppl. 1):5228\u20135235, April 2004.\n\n[5] F. Labonte, P. Mattson, W. Thies, I. Buck, C. Kozyrakis, and M. Horowitz. The stream vir-\nIn PACT \u201904: Proceedings of the 13th International Conference on Parallel\ntual machine.\nArchitectures and Compilation Techniques, pages 267\u2013277, Washington, DC, USA, 2004. IEEE\nComputer Society.\n\n[6] T. Masada, T. Hamada, Y. Shibata, and K. Oguri. Accelerating collapsed variational bayesian\ninference for latent Dirichlet allocation with Nvidia CUDA compatible devices. In IEA-AIE,\n2009.\n\n[7] R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational EM for latent Dirichlet alloca-\n\ntion: An experimental evaluation of speed and scalability. 2007.\n\n[8] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent Dirichlet\n\nallocation. In NIPS, 2007.\n\n[9] Y. W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm\nfor Latent Dirichlet allocation. In B. Sch\u00a8olkopf, J. C. Platt, and T. Hoffman, editors, NIPS, pages\n1353\u20131360. MIT Press, 2006.\n\n9\n\n\f", "award": [], "sourceid": 546, "authors": [{"given_name": "Feng", "family_name": "Yan", "institution": null}, {"given_name": "Ningyi", "family_name": "Xu", "institution": null}, {"given_name": "Yuan", "family_name": "Qi", "institution": null}]}