{"title": "Distributed Inference for Latent Dirichlet Allocation", "book": "Advances in Neural Information Processing Systems", "page_first": 1081, "page_last": 1088, "abstract": null, "full_text": "Distributed Inference for Latent Dirichlet Allocation\n\nDavid Newman, Arthur Asuncion, Padhraic Smyth, Max Welling\n\nDepartment of Computer Science\nUniversity of California, Irvine\n\n newman,asuncion,smyth,welling\n\n@ics.uci.edu\n\nAbstract\n\n\u0003\u0005\u0004\u0006\u0002\n\nprocessors only sees\n\nWe investigate the problem of learning a widely-used latent-variable model \u2013 the\nLatent Dirichlet Allocation (LDA) or \u201ctopic\u201d model \u2013 using distributed compu-\nof the total data set. We pro-\ntation, where each of\npose two distributed inference schemes that are motivated from different perspec-\ntives. The \ufb01rst scheme uses local Gibbs sampling on each processor with periodic\nupdates\u2014it is simple to implement and can be viewed as an approximation to\na single processor implementation of Gibbs sampling. The second scheme re-\nlies on a hierarchical Bayesian extension of the standard LDA model to directly\nprocessors\u2014it has a theo-\naccount for the fact that data are distributed across\nretical guarantee of convergence but is more complex to implement than the ap-\nproximate method. Using \ufb01ve real-world text corpora we show that distributed\nlearning works very well for LDA models, i.e., perplexity and precision-recall\nscores for distributed learning are indistinguishable from those obtained with\nsingle-processor learning. Our extensive experimental results include large-scale\ndistributed computation on 1000 virtual processors; and speedup experiments of\nlearning topics in a 100-million word corpus using 16 processors.\n\n1 Introduction\n\nVery large data sets, such as collections of images, text, and related data, are becoming increasingly\ncommon, with examples ranging from digitized collections of books by companies such as Google\nand Amazon, to large collections of images at Web sites such as Flickr, to the recent Net\ufb02ix customer\nrecommendation data set. These data sets present major opportunities for machine learning, such\nas the ability to explore much richer and more expressive models, as well as providing new and\ninteresting domains for the application of learning algorithms.\n\nHowever, the scale of these data sets also brings signi\ufb01cant challenges for machine learning, partic-\nularly in terms of computation time and memory requirements. For example, a text corpus with 1\nmillion documents, each containing 1000 words on average, will require approximately 12 Gbytes\nwords, which is beyond the main memory capacity for most single pro-\nof memory to store the\n\u0003\b\u0007\n\t\ncessor machines. Similarly, if one were to assume that a simple operation (such as computing a\nprobability vector over categories using Bayes rule) would take on the order of\nsec per word,\nthen a full pass through\nwords will take 1000 seconds. Thus, algorithms that make multiple\npasses over this sized corpus (such as occurs in many clustering and classi\ufb01cation algorithms) will\nhave run times in days.\n\n\u0003\u000b\u0007\r\f\u000f\u000e\n\n\u0003\b\u0007\n\t\n\nAn obvious approach for addressing these time and memory issues is to distribute the learning\nalgorithm over multiple processors [1, 2, 3]. In particular, with\nprocessors, it is somewhat trivial\nto get around the memory problem by distributing\nof the total data to each processor. However,\nthe computation problem remains non-trivial for a fairly large class of learning algorithms, namely\nhow to combine local processing on each of the\n\nprocessors to arrive at a useful global solution.\n\n1\n\n\u0001\n\u0002\n\u0002\n\u0002\n\u0010\n\u0011\n\u0002\n\fIn this general context we investigate distributed learning algorithms for the LDA model [4]. LDA\nmodels are arguably among the most successful recent learning algorithms for analyzing count data\nsuch as text. However, they can take days to learn for large corpora, and thus, distributed learning\nwould be particularly useful for this type of model.\n\nThe novel contributions of this paper are as follows:\n\n We introduce two algorithms that perform distributed inference for LDA models, one of\n\nwhich is simple to implement but does not necessarily sample from the correct posterior\ndistribution, and the other which optimizes the correct posterior quantity but is more com-\nplex to implement and slower to run.\n\n We demonstrate that both distributed algorithms produce models that are statistically indis-\n\ntinguishable (in terms of predictive power) from models obtained on a single-processor, and\nthey can learn these models much faster than using a single processor and only requiring\nstorage of\n\nth of the data on each processor.\n\n2 Latent Dirichlet Allocation\n\n , then word \u0016\u0017\u0012\n\nThus, the generative process is given by\n\ndocuments as a mixture over \u0002\n\n. For the \f\u000e\r\u0010\u000f word in the document, a topic \u0011\u0013\u0012\n\nBefore introducing our distributed algorithms for LDA, we brie\ufb02y review the standard LDA model.\nlatent topics, each being a multinomial\n\nLDA models each of \u0001\ndistribution over a \u0003 word vocabulary. For document \u0004 , we \ufb01rst draw a mixing proportion \u0005\u0007\u0006\t\b\nfrom a Dirichlet with parameter \u000b\n\u0014 chosen with probability \u0005\u0015\u0006\t\b\ntopic, with \u0016\u0017\u0012\nis drawn from the \u0011\u0018\r\u0010\u000f\n\u0019 with probability \u001a\u0017\u001b\u001c\b\n\u0006 .\n\u0006 . Finally, a Dirichlet prior with parameter \u001d\nis placed on the topics \u001a\u001e\u001b\u001c\b\n\u001f+\u001a\u0017\u001b\u001c\b\n\u0016&\u0012\n\n \u001f\"!$#\n\u0005\t\u0006\u0013\b\n\u0006'\u001f\"!$#\nGiven the observed words 1+2\ndistribution over the latent topic indices 4*2\n\u001a\u0017\u001b\u001c\b\nout, and the latent variables 4 are sampled. Given the current state of all but one variable \u0011\nconditional probability of \u0011\t\u0012\n\u0014\u001e8\n4\u00189\n\n , and the topics\n\u0006 . An ef\ufb01cient procedure is to use collapsed Gibbs sampling [5], where \u0005 and \u001a are marginalized\n\n , the\n\n\u0011)\u0012\n\u001f*\u0005\u0013\u0006\t\b\n, the mixing proportions \u0005\u0015\u0006\t\b\n\n, the task of Bayesian inference is to compute the posterior\n\nis drawn with topic\ntaking on value\n\nwhere the superscript E7\fF\u0004 means the corresponding data-item is excluded in the count values, and\nwhere @\nout: @N\u0006\t\b\n3 Distributed Inference Algorithms for LDA\n\n. We use the convention that missing indices are summed\n\n\f KL\u0016&\u0012\n\nG\u0006Q\u001b and @\u001e\u001b\u001c\b\n\n\u0003D\u001d$>A@\n\nG\u0006H\u001b\n\nO2+P\n\n\u0011M\u0012\n\u0006R2SP\n\n\u000b?>A@\n\n\u001d$>A@\n\n\u001d7;=<\n\n\u001a&\u001b\u001c\b\n\n2JI\n\nG\u0006Q\u001b\n\nis\n\n\u0015:\n\n,.-0/\n\n\u001d(%\n\n\u0011\u0013\u0012\n\n576\n\n\u000b\u001e%\n\n\u00163\u0012\n\n(2)\n\n(1)\n\n.\n\n\u0006\t\b\n\n-C/\n\ndocuments over\n\nWe now present two versions of LDA where the data and the parameters are distributed over distinct\ndocuments on each\n\nprocessors, with \u0001\nprocessors. We distribute the \u0001\n2UT\nprocessor. We partition the data 1\ndocuments) into 1A2\n:WVWVXVW:\nand the corresponding topic assignments into 4Z2\n, where 1\non processor 5\n. Document-speci\ufb01c counts @[\u0006\t\b\nmaintains its own copy of word-topic and topic counts, @7\u001b\u001c\b\ncounts as @N\u0006\u0013\b\n@\u001e\u001b\u001c\b\n\n:WVWVXVW:\n1\u0017Y\nY only exist\n\n are likewise distributed, however every processor\n\u0006 . We denote processor-speci\ufb01c\n\n(words from the \u0001\n\n:WVWVXVW:\n\u0006 and @\n\n:WVXVWVX:\nY and 4\n\nY and @\n\nY .\n\n3.1 Approximate Distributed Inference\n\nIn our Approximate Distributed LDA model (AD-LDA), we simply implement LDA on each pro-\ncessor, and simultaneous Gibbs sampling is performed independently on each of the\nprocessors,\n, given the current state of all but\n\nas if each processor thinks it is the only processor. On processor 5\none variable \u0011)\u0012\n576\n\nY , the topic assignment to the \f\u000e\r\u0010\u000f word in document \u0004 , \u0011\t\u0012\n\u0011)\u0012\n\n\u000b`>a@\n\n\u001d$>a@\n\n\u001d^;_<\n\n\u0014\u001e8\n\nY]\\\n\u0003D\u001d?>A@\n\n-C/\n\nis sampled from:\n\n(3)\n\n\u0006\u0013\b\n\n2\n\n\u0010\n\u0011\n\n\u0012\n\n\n\n\u0001\n\n\n\u0001\n\u0012\n\n\u0011\n\u0012\n\n2\n\u0012\n1\n:\n\u000b\n:\n6\n9\n\u0012\n\n;\n6\n9\n\u0012\n\nB\n\b\n\u0006\n;\n6\n9\n\u0012\n\n\u0006\n;\n\f\n\u0010\n\n\n2\n\u0019\n:\n\n2\n\u0014\n\u0001\n\u001b\n@\n\n@\n\u0002\n\u0011\n\u0011\n\n1\n\u0010\n1\n\u0011\n\u0001\n\n4\n\u0010\n4\nY\n4\n\u0011\n\u0001\n\nY\n:\n\u0006\n\u0006\n\u0002\n\n4\nY\n\nY\n2\n4\n9\n\u0012\n\nY\nY\n:\n1\n:\n\u000b\n:\n6\n9\n\u0012\n\nY\n\nY\n;\n6\n9\n\u0012\n\nY\nB\n\b\n\u0006\nY\n;\n6\n9\n\u0012\n\nY\n\u0006\nY\n;\n\f\n\u0010\n\fa\n\nq\n\njk|\n\nijZ\n\nijX\n\nN j\n\nD\n\nb\n\nf\n\nkw|\n\nK\n\n\u0004\u0006\u0005\n\na\n\np\n\nq\n\njpk|\n\nijpZ\n\nijpX\n\nN jp\n\nDp\n\nP\n\ng\n\nkw|\n\n\u0001\u0003\u0002\nb\n\nk\n\nj\n\nkpw|\n\nP\n\nK\n\nFigure 1: (Left) Graphical model for LDA. (Right) Graphical model for HD-LDA. Variables are repeated over\nthe indices of the random variables. Square boxes indicate parameters.\n\nNote that @\u001e\u001b\u001c\b\n@\u001e\u001b\u001c\b\n\u001b\b\u0007\nnumber of words on processor 5\nY , @\u001e\u001b\u001c\b\n@N\u0006\t\b\n\nseparate LDA models running on separate data. In particular\nis the total number of words across all processors, as opposed to the\n\nis not the result of\n, where \t\nY , we have modi\ufb01ed counts\nY . To merge back to a single set of counts, after a number of Gibbs sampling\n\n. After processor 5 has reassigned 4\n\nsteps (e.g., after a single pass through the data on each processor) we perform the global update,\nusing a reduce-scatter operation,\n\nY , and @\n\n@\u001e\u001b\u001c\b\n\n\u0006\u000b\n\n@\u001e\u001b\u001c\b\n\n@\u001e\u001b\u001c\b\n\nY\u000e\n\n@\u001e\u001b\u001c\b\n\n\u0006\t;\n\n@\u001e\u001b\u001c\b\n\n@\u001e\u001b\u001c\b\n\n(4)\n\nwhere @N\u001b\u001c\b\nThe counts @\ntopic assignments 4\n\n\u0006 are the counts that all processors started with before the sweep of the Gibbs sampler.\n\u0006 are computed by @\n\u0006 . Note that this global update correctly re\ufb02ects the\n\n@\u001e\u001b\u001c\b\n\n(i.e., @N\u001b\u001c\b\n\n\u0006 can also be regenerated using 4 ).\n\nWe can consider this algorithm to be an approximation to the single-processor Gibbs sampler in\nthe following sense: at the start of each iteration, all of the processors have the same set of counts.\nHowever, as each processor starts sampling, the global count matrix is changing in a way that is\nunknown to each processor. Thus, in Equation 3, the sampling is not being done according to the true\ncurrent global count (or true posterior distribution), but to an approximation. We have experimented\nwith \u201crepairing\u201d reversibility of the sampler by adding a phase which re-traces the Gibbs moves\nstarting at the (global) end-state, but we found that, due to the curse-of-dimensionality, virtually all\nsteps ended up being rejected.\n\n3.2 Hierarchical Distributed Inference\n\nA more principled way to model parallel processes is to build them directly into the probabilistic\nmodel. Imagine a parent collection of topics\nthe topic distributions on the various processors. We assume\n\n\u0006 . This parent has\nDirichlet distribution with topic-dependent strength parameter \u001d\n\nY which represent\n\u0006 . The model that lives on each\n\nprocessor is simply an LDA model. Hence, the generative process is given by,\n\nis sampled from\n\naccording to a\n\nchildren\n\n\u0010\u001c\u001b\u001c\b\n\n\u001b\u001c\b\n\n\u0011&%\n\u001f\"!$#\n\u000f_\u001b\u001c\b\n\u0010[\u001b\u001c\b\n!$#\nYZ\u001f\"\u0005\u0013\u0006\t\b\n\n\u000f_\u001b\u001c\b\n\n\u0006M%\n\n\u0005\u0013\u0006\t\b\n\n:\u0016\u0015\n\n:\u0019\u0018\n\n\u001f\u0013\u0012_#\n\n\u001f\u0013\u0012_#\n\u001f\"!$#\n\u001f\u001a\u00107\u001b\u001c\b\n\n-C/\u001c\u001b\n\nThe graphical model corresponding to this Hierarchical Distributed LDA (HD-LDA) is shown on\nthe right of Figure 1, with standard LDA shown on the left for comparison. This model is different\nthan the two other topic hierarchies we found in the literature, namely 1) the deeper version of the\nhierarchical Dirichlet process mentioned in [6] and 2) Pachinko allocation [7]. The \ufb01rst places a\n) while the second deals with a document-speci\ufb01c\nhierarchy of topic-assignments. These types of hierarchies do not suit our need to facilitate parallel\ncomputation.\n\ndeeper hierarchical prior on \u0005\n\n(instead of on\n\n(5)\n\n3\n\n\n\nF\n\u0006\nY\n\u0002\nP\n\u0006\n\u0006\nY\n2\n\t\n\n\u0006\n\u0006\n\u0006\n>\n\f\nY\n6\n\u0006\n:\n\u0006\nY\n\n\u0006\n\u0006\n2\nP\n\u001b\n\u000f\n\u0002\n\u0006\n\u0010\n\u000f\n\u0006\n\u001d\n\u0006\n\u0014\n%\n\u000b\nY\n\u0017\n%\n\u0006\nY\n\u001f\n\u001d\n\u0006\n\nY\n\u000b\nY\n%\n\u0011\n\u0012\n\nY\n\u0016\n\u0012\n\nY\n,\n\u0010\n\fAs is the case for LDA, inference for HD-LDA is most ef\ufb01cient if we marginalize out\nderive the following conditional probabilities necessary for the Gibbs sampler,\n\n576\n\n\u0011)\u0012\n\n\u0014\u001e8\n\n>A@\n\n\u0006\t\b\n\n-C/\n\n>A@\n\n-C/W\b\n\n>A@\nand \u000b\n\nIn our experiments we learn MAP estimates for the global variables\n. Alternatively,\none can derive Gibbs sampling equations using the auxiliary variable method explained in [6], but\nwe leave exploration of this inference technique for future research. Inference is thus based on\n. The entire algorithm\ncan be understood as expectation maximization on a collapsed space where the M-step corresponds\nto MAP-updates and the E-step corresponds to sampling. As such, the proposed Monte Carlo EM\n(MCEM) algorithm is guaranteed to converge in expectation (e.g., [8]). The MAP learning rules are\nderived by using the bounds derived in [9]. They are given by\n\n, sampling 4 and learning the MAP value of\n\nintegrating out \u0005 and\n\n, \u001d\n, \u001d and \u000b\n\nand \u0005 . We\n\n(6)\n\n\u0003=>A\u000b\u001eY\n>A\u0002JP\n\u0003=>A\u001d\n\n\u0003=>\n\n\u000b\u001eY\n>A@N\u0006\t\b\n\nG\u0006\u0001\u0003\u0002\n\u000b\u001eY\n>A@\nY\u0015;\n\u000f_\u001b\u001c\b\n\u0006\u0002\n\u000f_\u001b\u001c\b\n\n\u000f_\u001b\u001c\b\n>A@\n\n\u0006\u0002\n\u000f_\u001b\u001c\b\n\n\u0006\u0002\n\n\u000b\u001eY\u0015;\u0005\u0004\n\u000b\u001eY\u0013;\u000e%\n>a@\u001e\u001b\u001c\b\n\nY\u0018;\n\u000f_\u001b\u001c\b\n\n>a@\u001e\u001b\u001c\b\n\n;\u000e%\n>a@\u001e\u001b\u001c\b\n\n\u000f_\u001b\u001c\b\n\n\u000f_\u001b\u001c\b\n\n\u0011(\u0003\n\n\u000f_\u001b\u001c\b\n\n\u0006\t;\u0007\u0004\n\u000f_\u001b\u001c\b\n\n\u0006\t;\u0007\u0004\n\u000f_\u001b\u001c\b\n\n\u0006\t;\u0007\u0004\n\n(7)\n\n\u001b\b\u0007\n\n\u0003 >\n\n@\u001e\u001b\u001c\b\n\n@\u001e\u001b\u001c\b\n\n. We set\n\nY\t\b\u000b\n\n, so we choose\n\n. Finally we choose\n\n, but for HD-LDA P\n\nwhere\nis the digamma function. Careful selection of hyper-parameters is critical to making HD-\nLDA work well, and we used our experience with AD-LDA to guide these choices. For AD-LDA\nto make the mode\n,\n\n\u001b\b\u0007\nof \u001d\nmatching the value of \u000b used in our LDA and AD-LDA experiments.\n\nWe can view HD-LDA as a mixture model with\nLDA mixture components, where the data have\nbeen hard-assigned to their respective clusters (processors). The parameters of the clusters are gen-\nerated from a shared prior distribution. This view clari\ufb01es the procedure we have adopted for test-\nfor the \ufb01rst half of the test document (analogous to\nfolding-in). Given these samples we compute the likelihood of the test document under the model for\neach processor. Assuming equal prior weights for each processor we then compute responsibilities,\nwhich are given by the likelihoods, normalized over processors. The probability of the remainder of\nthe test document is then given by the responsibility-weighted average over the processors.\n\ning: First we sample assignment variables \u0011\u0013\u0012\n\nto make the mode of \u000b^Y\n\nand \u0015\n\nand \u0018\n\n4 Experiments\nThe two distributed algorithms are initialized by \ufb01rst randomly assigning topics to 4 , then from this\ncounting topics in documents, @^\u0006\u0013\b\nY , for each processor. Recall for\nAD-LDA that the count arrays @N\u001b\u001c\b\n\u0006 are the same on every processor (initially, and after\n\nevery global update). For each run of each algorithm, a sample was taken after 500 iterations of\nthe Gibbs sampler, well after the typical burn-in period of 200-300 iterations. Multiple processors\nwere simulated in software (by separating data, running sequentially through each processor, and\nsimulating the global update step), except for the speedup experiments which were run on a 16-\nprocessor computer.\n\nY , and words in topics, @N\u001b\u001c\b\n2J@\u001e\u001b\u001c\b\n\nIt is not obvious a priori that the AD-LDA algorithm will in general converge to a useful result.\nLater in this section we describe a set of systematic empirical results with AD-LDA, but we \ufb01rst\nuse an illustrative toy example to provide some insight as to how AD-LDA learns a model. The toy\ndistance between\nthe model\u2019s estimate of a particular topic-word distribution and the true distribution, as a function\nof Gibbs iterations, for both single-processor LDA and AD-LDA with\n. LDA and AD-LDA\nhave qualitatively the same 3-phase learning dynamics 1. The \ufb01rst 4 or so iterations (\u201cearly burn-\nin\u201d) correspond to somewhat random movement close to the randomly initialized starting point. In\n\ntopics. The left panel of Figure 2 shows the\n\nexample has \u0003\n\nwords, \u0002\n\n2\u0011\u000f\n\n2\u000e\n\n2\u000e\u000f\n\n1For clarity, the results in this \ufb01gure are plotted for a single run, single data set, etc.\u2014we observed qualita-\n\ntively similar results over a large variety of such simulations\n\n4\n\n\u0010\n\nY\n2\n4\n9\n\u0012\n\nY\nY\n:\n1\n:\n\u000b\n:\n\u001d\n:\n\u000f\n;\n<\n6\n\u000b\nY\n9\n\u0012\n\nY\n\nY\n;\n6\n\u001d\n\u0006\n\u000f\nB\n\b\n\u0006\n9\n\u0012\n\nY\nB\n\u0006\nY\n;\n6\n\u001d\n\u0006\n9\n\u0012\n\nY\n\u0006\nY\n;\n\f\n\u0010\n\u000f\n\u0010\n\u000f\n\u000b\nY\n\n\u0017\n\nP\n6\n\nY\n;\n\n\u0002\n6\n\u0018\n\n#\n\u0002\n6\n\u0002\n\n\u0002\n6\n\u0002\n\u001d\n\u0006\n\n\u0014\n\n\u0006\nP\n\u001b\nY\n\u0006\n6\n\u001d\n\u0006\n\u0006\n\u0006\nY\n;\n\n\u0002\n6\n\u001d\n\u0006\n\u0015\n>\nP\nY\n#\n\u0002\n6\n\u001d\n\u0006\n\u0006\n\n\u0002\n6\n\u001d\n\u0006\n\u0006\n\n\u0011\n\nP\nY\n\u001d\n\u0006\n\u0006\n6\n\u001d\n\u0006\n\u0006\n\u0006\nY\n;\n\n\u0002\n6\n\u001d\n\u0006\n\n\u0003\n>\nP\n\u001b\nY\n\u001d\n\u0006\n\u0006\n6\n\u001d\n\u0006\n\u0006\n\u0006\nY\n;\n\n\u0002\n6\n\u001d\n\u0006\n\u0002\nP\n\u0006\n\u0006\nY\n2\n\t\n\u0006\n\u0006\n\u0011\n\u0014\n\u0006\n2\n\u0011\n\f\n\u0010\n\u0011\n\t\n\u0011\n2\n\u0010\n\f\n\u0017\n2\n\u0007\nV\n\u0003\n\u0002\n\nY\n\n\u0006\n\u0006\nY\n\u0010\n\u0010\n\u0002\n\fm\nr\no\nn\n\n \n\n1\nL\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n0\n\nearly burn\u2212in\n\nLDA\nAD\u2212LDA proc1\nAD\u2212LDA proc2\n\nburn\u2212in\n\nequilibrium\n\n5\n\n10\n\n15\n\nIteration\n\n20\n\n25\n\n30\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\nLDA\nAD\u2212LDA proc1\nAD\u2212LDA proc2\n\nstart\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\nstart\n\nLDA\nAD\u2212LDA proc1\nAD\u2212LDA proc2\nAD\u2212LDA proc3\n...etc...\nAD\u2212LDA proc10\n\ntopic mode\n\ntopic mode\n\n0.55\n\n0.6\n\n0.65\n\n0.7\n\n0.55\n\n0.6\n\n0.65\n\n0.7\n\nFigure 2: (Left)\ndistance to the mode for LDA and for P=2 AD-LDA. (Center) Projection of\ntopics onto simplex, showing convergence to mode. (Right) Same setup as center panel, but with\n\nprocessors.\n\n\u0003\b\u0007\n\nthe next phase (\u201cburn-in\u201d) both algorithms rapidly move in parameter space towards the posterior\nmode. And \ufb01nally at equilibrium, both are sampling around the mode. The center panel of Figure 2\nplots the same run, in the 2-d planar simplex corresponding to the 3-word topic distribution. This\npanel shows the paths in parameter space of each model, taking a few small steps near the starting\npoint (top right corner), moving down to the true solution (bottom left), and then sampling near the\nposterior mode for the rest of the iterations. For each Gibbs iteration, the parameters corresponding\nto each of the two individual processors, and those parameters after merging, are shown (for AD-\nLDA). We observed that after the initial few iterations, the individual processor steps and the merge\nstep each resulted in a move closer to the mode. The right panel in Figure 2 illustrates the same\nqualitative behavior as in the center panel, but now for 10 processors. One might worry that the\nAD-LDA algorithm would get \u201ctrapped\u201d close to the initial starting point, e.g., due to repeated\nlabel mismatching of the topics across processors. In practice we have consistently observed that\nthe algorithm quickly discards such con\ufb01gurations (due to the stochastic nature of the moves) and\n\u201clatches\u201d onto a consistent labeling that then rapidly moves it towards the posterior mode.\n\nIt is useful to think of LDA as an approximation to stochastic descent in the space of assignment\n\nvariables 4 . On a single processor, one can view Gibbs sampling during burn-in as a stochastic\n\nalgorithm to move up the likelihood surface. With multiple processors, each processor computes an\nupward direction in its own subspace, keeping all other directions \ufb01xed. The global update step then\nrecombines these directions by vector-addition, in the same way as one would compute a gradient\nusing \ufb01nite differences. This is expected to be accurate as long as the surface is locally convex\nor concave, but will break down at saddle-points. We conjecture AD-LDA works reliably because\nsaddle points are 1) unstable and 2) rare due to the fact that the posterior appears often to be highly\npeaked for LDA models and high-dimensional count data sets.\n\nTo evaluate AD-LDA and HD-LDA systematically, we measured performance using test set per-\n\ntest\n\n2\u0001\u0003\u0002\u0005\u0004\n\ntest\n\ntest\n\n\u0006\b\u0007\n\t\n\n5^6\n\n(at random) are put in a fold-in part, and the remaining words are put in a test part. The document\nis learned using the fold-in part, and log probability is computed using this mix and words\nfrom the test part, ensuring that the test words are never seen before being used. For AD-LDA, the\n\nplexity, computed as Perp6\nmix \u0005\u0013\u0006\t\b\nperplexity computation exactly follows that of LDA, since a single set of topic counts @[\u001b\u001c\b\n\nwhen a sample is taken. In contrast, all\nHD-LDA, as described in the previous section. Except where stated, perplexities are computed for\nall algorithms using\n\n\u0006 are saved\nY are required to compute perplexity for\n\n;.; . For every test document, half the words\n\nsamples from the posterior (from 10 different chains) using\n\ncopies of @^\u001b\u001c\b\n\nwith the analogous expression being used for HD-LDA.\n\nWe compared LDA (Gibbs sampling on a single processor) and our two distributed algorithms, AD-\nLDA and HD-LDA, using three data sets: KOS (from dailykos.com), NIPS (from books.nips.cc)\nand NYTIMES (from ldc.upenn.edu). Each data set was split into a training set and a test set. Size\nis the vocabulary size and\nis the total number of words. Using the three data sets and the three models we computed test set\n\nparameters for these data sets are shown in Table 1. For each corpus \u0003\n\n5\n\n\u000ba2\n\n\u0003\u000b\u0007\n\u0006\b\u0007\n\t\n\ntest\n\n\u0006\f\u0007\r\t\n\n5^6\n\n\u0006\t\b\n\n\u001b\u001c\b\n\n\u0006\t\b\n\n\u000b?>A@\n\u0006\t\b\n\u000b$>A@\n\n\u001b\u001c\b\n\n\u001b\u001c\b\n\u0003D\u001d$>a@\n\n(8)\n\n\u0010\n\u0010\n\u0002\n2\n1\n;\n6\n\n\u0010\n\n1\n\n\u0002\n\u0006\n1\n;\n2\n\f\n\n\u0007\n\u001b\n\u0003\n\u000b\n\f\n\u000e\n\f\n\u0006\n\u0005\n\u000e\n\n\u001a\n\u000e\n\u0006\n\u0005\n\u000e\n\n2\n\u000e\n\n\u0002\n\u000e\n\n\u001a\n\u000e\n\u0006\n2\n\u001d\n>\n@\n\u000e\n\u0006\n\u000e\n\u0006\n\t\n\fKOS\n3000\n6906\n410,000\n430\n\nNIPS\n1500\n12,419\n1,900,000\n184\n\nNYTIMES\n300,000\n102,660\n100,000,000\n34,658\n\ntrain\n\ntest\n\nTable 1: Size parameters for the three data sets used in perplexity and speedup experiments.\n\nperplexities for a range of topics \u0002\n\nour distributed models.\n\n, and for number of processors,\n\n, ranging from 10 to 1000 for\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\nT=8\n\n1750\n\n1700\n\n1650\n\n1600\n\nT=16\n\n1550\n\n1500\n\n1450\n\nT=32\n\n1400\n\nT=64\n\n1350\n\nP=1\n\nP=10\n\nP=100\n\nLDA\nAD\u2212LDA\nHD\u2212LDA\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n2000\n\n1900\n\n1800\n\n1700\n\n1600\n\n1500\n\n1400\n\nT=10\n\nT=20\n\nT=40\n\nT=80\n\nLDA\nAD\u2212LDA\nHD\u2212LDA\n\nP=1\n\nP=10\n\nP=100\n\nFigure 3: Test perplexity of models versus number of processors P for KOS (left) and NIPS (right).\nP=1 corresponds to LDA (circles), and AD-LDA (crosses), and HD-LDA (squares) are shown at\nP=10 and 100 .\n\nFigure 3 clearly shows that, for a \ufb01xed number of topics, the perplexity results are essentially the\nsame whether we use single-processor LDA or either of the two algorithms with data distributed\nacross multiple processors (either 10 or 100). The \ufb01gure shows the test set perplexity for KOS\nperplexity is computed by\n(left) and NIPS (right), versus number of processors,\nLDA (circles), and we use our distributed models \u2013 AD-LDA (crosses), and HD-LDA (squares) \u2013\nperplexities. Though not shown, perplexities for AD-LDA\nto compute the\n\u0003\u000b\u0007\nremained approximately constant as the number of processors was further increased to\n\u0007\n\u0007\nfor NIPS, demonstrating effective distributed learning with only 3 documents\nfor KOS and\non each processor.\nIt is worth emphasizing that, despite no formal convergence guarantees, the\napproximate distributed algorithm converged to good solutions in every single one of the more than\none thousand experiments we did using \ufb01ve real-world data sets, plus synthesized data sets designed\nto be \u201chard\u201d to learn (i.e., topics mutually exclusively distributed over processors)\u2014page limitations\npreclude a full description of all these results in this paper.\n\n2\u0001\u0006\u0007\n\n. The\n\n\u0003\u000b\u0007\n\nand\n\n\u0003\u000b\u0007\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n2000\n\n1950\n\n1900\n\n1850\n\n1800\n\n1750\n\n1700\n\nLDA\nAD\u2212LDA P=10\nAD\u2212LDA P=100\nHD\u2212LDA P=10\nHD\u2212LDA P=100\n\n2000\n\n1900\n\n1800\n\n1700\n\n1600\n\n1500\n\n1400\n\n1300\n\n1200\n\n1100\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n50\n\n100\n\n150\n\n200\n\nIteration\n\n250\n\n300\n\n350\n\n400\n\n1000\n0\n\n100\n\n200\n\nLDA\nAD\u2212LDA P=10\nHD\u2212LDA P=10\n\n500\n\n600\n\n700\n\n300\n\n400\n\nNumber of Topics\n\nFigure 4: (Left) Test perplexity versus iteration.\n\n(Right) Test perplexity versus number of topics.\n\nTo properly determine the utility of the distributed algorithms, it is necessary to check whether the\nparallelized samplers are systematically converging more slowly than single processor sampling. If\n\n6\n\n\u0001\n\u0003\n\t\n\u0001\n\u0002\n\u0002\n\u0002\n2\n\u0003\n\u0002\n2\n\u0002\n2\n\u0007\n\u0002\n2\n\u0002\n\u0007\n\fi\n\ni\n\nn\no\ns\nc\ne\nr\nP\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\nPerfect\nAD\u2212LDA\n\nTF\u2212IDF\nLDA\nAD\u2212LDA\nHD\u2212LDA\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\np\nu\nd\ne\ne\np\nS\n\nAP\n\nFR\n\n2\n\n4\n\n6\nNumber of Processors, P\n\n10\n\n8\n\n12\n\n14\n\n16\n\nFigure 5: (Left) Precision/recall results.\n\n(Right). Parallel speedup results.\n\nthis were the case, it would mitigate the computational gains of parallelization. In fact our exper-\niments consistently showed (somewhat surprisingly) that the convergence rate for the distributed\nalgorithms is just as rapid as for the single processor case. As an example, Figure 4 (left) shows\n). During burn-in, up\nto iteration 200, the distributed models are actually converging slightly faster than single processor\nLDA. Also note that 1 iteration of AD-LDA (or HD-LDA) on a parallel computer takes a fraction of\nthe wall-clock time of 1 iteration of LDA.\n\ntest perplexity versus iteration number of the Gibbs sampler (NIPS, \u0002\n\n\u000f\u0006\u0007\n\nnumber of processors\n\nWe also investigated whether the results were sensitive to the number of topics used in the models,\ne.g., perhaps the distributed algorithms\u2019 performance diverges when the number of topics becomes\nvery large. Figure 4 (right) shows the test set perplexity computed on the NIPS data set using\nsamples, as a function of the number of topics, for the different algorithms and a \ufb01xed\n(not shown here are the results for the KOS data set which were quite\nvaries. Sometimes\nthe distributed algorithms produce slightly lower perplexities than those of single processor LDA.\nThis lower perplexity may be due to: for AD-LDA, parameters constantly splitting and merging\nproducing an internal averaging effect; and for HD-LDA, test perplexity being computed using\ncopies of saved parameters.\n\nsimilar). The perplexities of the different algorithms closely track each other as \u0002\n\n\u0003\b\u0007\n\nFinally, to demonstrate that the low perplexities obtained from the distributed algorithms with\n\nprocessors are not just due to averaging effects, we split the NIPS corpus into one hundred 15-\n\u0003\u000b\u0007\n\u0007\ndocument collections, and ran LDA separately on each of these hundred collections. Test perplexity\n) computed by averaging 100-separate LDA models was 2117, versus the P=100 test\n(\u0002\nperplexity of 1575 for AD-LDA and HD-LDA. This shows that simple averaging of results from\nseparate processors does not perform nearly as well as the distributed coordinated learning.\n\nOur distributed algorithms also perform well under other performance metrics. We performed pre-\ncision/recall calculations using TREC\u2019s AP and FR collections and measured performance using the\nwell-known mean average precision (MAP) metric used in IR research. Figure 5 (left) again shows\nthat AD-LDA and HD-LDA (both using P=10) perform similarly to LDA. All three LDA models\nhave signi\ufb01cantly higher precision than TF-IDF on the AP and FR collections (signi\ufb01cance was\n\ncomputed using a t-test at the 0.05 level). These calculations were run with \u0002\nThe per-processor per-iteration time and space complexity of LDA and AD-LDA are shown in Ta-\nble 2. AD-LDA\u2019s memory requirement scales well as collections grow, because while \t\ncan get arbitrarily large (which can be offset by increasing\nSimilarly the time complexity scales well since the leading order term \t\nis divided by\n@\u001e\u001b\u001c\b\n\nasymptotes.\n. The\nterm accounts for the communication cost of the reduce-scatter operation on the count difference\nterm, parallel\n, with increasing ef\ufb01ciency as this ratio increases. Space and time\n\n2\u000e\u000f\n), the vocabulary size \u0003\n\nef\ufb01ciency will depend on\ncomplexity of HD-LDA are similar to that of AD-LDA, but HD-LDA has bigger constants.\n\nstages. Because of the additional \u0002\n\n\u0006\t; , which is executed in \u0006\f\u0007\r\t\n\nand \u0001\n\n@\u001e\u001b\u001c\b\n\n.\n\n\u0011\u0004\u0003\n\nUsing our large NYTIMES data set, we performed speedup experiments on a 16-processor SMP\nshared memory computer using\n1, 2, 4, 8 and 16 processors (since we did not have access\nto a distributed memory computer). The single processor LDA run with 1000 iterations for this\n\ufb02ops, and takes more than 10 days on a 3GHz workstation, so it is an ideal\ndata set involves\n\n\u0003\b\u0007\n\n\u0010\u0006\u0005\n\n7\n\n2\n\u000b\n2\n\n\u0002\n2\n\u0002\n\u0002\n2\n2\n\u0001\n\u0007\n\u0007\n\u0007\n\u0002\n\u0002\n\u0002\n\u0002\n6\n\u0006\nY\n\n\u0002\n\u0003\n\n\u0002\n2\n\fLDA\n\nSpace\nTime\n\n>A\u0002\n\nAD-LDA\n\n>A\u0002\n\n;\u001e>A\u0002\n\nTable 2: Space and time complexity of LDA and AD-LDA.\n\nspeedup using\n\ncomputation to speed up. The speedup results, shown in Figure 5 (right), show reasonable parallel\nef\ufb01ciency, with a\nprocessors. This speedup reduces our NYTIMES 10-\nday run (880 sec/iteration on 1 processor) to the order of 1 day (105 sec/iteration on 16 processors).\nNote, however, that while the implementation on an SMP machine captures some distributed effects\n(e.g. time to synchronize), it does not accurately re\ufb02ect the extra time for communication. However,\nwe do expect that for problems with large\n\n, parallel ef\ufb01ciency will be high.\n\n\u0002\u0001\n\n\u0003\u0004\u0003\n\n\u0011\u0004\u0003\n\n5 Discussion and Conclusions\n\nPrior work on parallelizing probabilistic learning algorithms has focused largely on EM-\noptimization algorithms, e.g., parallel updates of expected suf\ufb01cient statistics for mixture models\n[2, 1]. In the statistical literature, the idea of running multiple MCMC chains in parallel is one ap-\nproach to parallelization (e.g., the method of parallel tempering), but requires that each processor\nstore a copy of the full data set. Since MCMC is inherently sequential, parallel sampling using\ndistributed subsets of the data will not in general yield a proper MCMC sampler except in special\ncases [10]. Mimno and McCallum [11] recently proposed the DCM-LDA model, where processor-\nspeci\ufb01c sets of topics are learned independently on each processor for local subsets of data, without\nany communication between processors, followed by a global clustering of the topics from the dif-\nferent processors. While this method is highly scalable, it does not lead to single global set of topics\nthat represent individual documents, nor is it de\ufb01ned by a generative process.\n\nWe proposed two different approaches to distributing MCMC sampling across different processors\nfor an LDA model. With AD-LDA we sample from an approximation to the posterior density by\nallowing different processors to concurrently sample latent topic assignments on their local subsets\nof the data. Despite having no formal convergence guarantees, AD-LDA works very well empir-\nically and is easy to implement. With HD-LDA we adapt the underlying LDA model to map to\nthe distributed computational infrastructure. While this model is more complicated than AD-LDA,\nand slower to run (because of digamma evaluations), it inherits the usual convergence properties of\nMCEM. Careful selection of hyper-parameters was critical to making HD-LDA work well.\n\nIn conclusion, both of our proposed algorithms learn models with predictive performance that is no\ndifferent than single-processor LDA. On each processor they burn-in and converge at the same rate\nas LDA, yielding signi\ufb01cant speedups in practice. The space and time complexity of both models\nmake them scalable to run on enormous problems, for example, collections with billions to trillions\nof words. There are several potentially interesting research directions that can be pursued using\nthe algorithms proposed here as a starting point, e.g., using asynchronous local communication (as\nopposed to the environment of synchronous global communications covered in this paper) and more\ncomplex schemes that allow data to adaptively move from one processor to another. The distributed\nscheme of AD-LDA can also be used to parallelize other machine learning algorithms. Using the\nsame principles, we have implemented distributed versions of NMF and PLSA, and initial results\nsuggest that these distributed algorithms also work well in practice.\n\n6 Acknowledgements\n\nThis material is based upon work supported by the National Science Foundation: DN and PS were\nsupported by NSF grants SCI-0225642, CNS-0551510, and IIS-0083489, AA was supported by an\nNSF graduate fellowship, and MW was supported by grants IIS-0535278 and IIS-0447903.\n\n8\n\n\t\n6\n\u0001\n>\n\u0003\n;\n\u0010\n\u0011\n6\n\t\n\u0001\n\u0003\n\t\n\u0002\n\u0010\n\u0011\n\t\n\u0002\n>\n\u0002\n\u0003\n>\n\u0002\n\nV\n\u0002\n2\n\n\fReferences\n\n[1] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-Reduce for machine\n\nlearning on multicore. In NIPS 19, pages 281\u2013288. MIT Press, Cambridge, MA, 2007.\n\n[2] W. Kowalczyk and N. Vlassis. Newscast EM. In NIPS 17, pages 713\u2013720. MIT Press, Cam-\n\nbridge, MA, 2005.\n\n[3] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online\n\ncollaborative \ufb01ltering. In 16th International World Wide Web Conference, 2007.\n\n[4] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n[5] T. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. In Proceedings of the National Academy\n\nof Sciences, volume 101, pages 5228\u20135235, 2004.\n\n[6] Y.W. Teh, M. Jordan, M. Beal, and A. Blei. Sharing clusters among related groups: Hierarchi-\n\ncal Dirichlet processes. In NIPS 17, pages 1385\u20131392. MIT Press, Cambridge, MA, 2005.\n\n[7] W. Li and A. McCallum. Pachinko allocation: DAG-structured mixture models of topic corre-\n\nlations. In ICML, pages 577\u2013584, 2006.\n\n[8] G. Wei and M. Tanner. A Monte Carlo implementation of the EM algorithm and the poor man\u2019s\ndata augmentation algorithms. Journal of the American Statistical Association, 85(411):699\u2013\n704, 1990.\n\n[9] T. Minka. Estimating a Dirichlet distribution. http://research.microsoft.com/ minka/papers/dirichlet/,\n\n2003.\n\n[10] A. Brockwell.\n\nJ.Comp.Graph.Stats, volume 15, pages 246\u2013261, 2006.\n\nParallel markov chain monte carlo simulation by pre-fetching.\n\nIn\n\n[11] A. McCallum D. Mimno. Organizing the oca: Learning faceted subjects from a library of\n\ndigital books. In Joint Conference in Digital Libraries, pages 376\u2013385, 2007.\n\n9\n\n\f", "award": [], "sourceid": 672, "authors": [{"given_name": "David", "family_name": "Newman", "institution": null}, {"given_name": "Padhraic", "family_name": "Smyth", "institution": null}, {"given_name": "Max", "family_name": "Welling", "institution": null}, {"given_name": "Arthur", "family_name": "Asuncion", "institution": null}]}