{"title": "Augur: Data-Parallel Probabilistic Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 2600, "page_last": 2608, "abstract": "Implementing inference procedures for each new probabilistic model is time-consuming and error-prone. Probabilistic programming addresses this problem by allowing a user to specify the model and then automatically generating the inference procedure. To make this practical it is important to generate high performance inference code. In turn, on modern architectures, high performance requires parallel execution. In this paper we present Augur, a probabilistic modeling language and compiler for Bayesian networks designed to make effective use of data-parallel architectures such as GPUs. We show that the compiler can generate data-parallel inference code scalable to thousands of GPU cores by making use of the conditional independence relationships in the Bayesian network.", "full_text": "Augur: Data-Parallel Probabilistic Modeling\n\nJean-Baptiste Tristan1, Daniel Huang2, Joseph Tassarotti3,\n1Oracle Labs {jean.baptiste.tristan, adam.pocock,\n\nAdam Pocock1, Stephen J. Green1, Guy L. Steele, Jr1\nstephen.x.green, guy.steele}@oracle.com\n2Harvard University dehuang@fas.harvard.edu\n3Carnegie Mellon University jtassaro@cs.cmu.edu\n\nAbstract\n\nImplementing inference procedures for each new probabilistic model is time-\nconsuming and error-prone. Probabilistic programming addresses this problem\nby allowing a user to specify the model and then automatically generating the\ninference procedure. To make this practical it is important to generate high per-\nformance inference code. In turn, on modern architectures, high performance re-\nquires parallel execution. In this paper we present Augur, a probabilistic modeling\nlanguage and compiler for Bayesian networks designed to make effective use of\ndata-parallel architectures such as GPUs. We show that the compiler can generate\ndata-parallel inference code scalable to thousands of GPU cores by making use of\nthe conditional independence relationships in the Bayesian network.\n\n1\n\nIntroduction\n\nMachine learning, and especially probabilistic modeling, can be dif\ufb01cult to apply. A user needs to\nnot only design the model, but also implement an ef\ufb01cient inference procedure. There are many dif-\nferent inference algorithms, many of which are conceptually complicated and dif\ufb01cult to implement\nat scale. This complexity makes it dif\ufb01cult to design and test new models, or to compare inference\nalgorithms. Therefore any effort to simplify the use of probabilistic models is useful.\nProbabilistic programming [1], as introduced by BUGS [2], is a way to simplify the application of\nmachine learning based on Bayesian inference. It allows a separation of concerns: the user speci\ufb01es\nwhat needs to be learned by describing a probabilistic model, while the runtime automatically gen-\nerates the how, i.e., the inference procedure. Speci\ufb01cally the programmer writes code describing a\nprobability distribution, and the runtime automatically generates an inference algorithm which sam-\nples from the distribution. Inference itself is a computationally intensive and challenging problem.\nAs a result, developing inference algorithms is an active area of research. These include determinis-\ntic approximations (such as variational methods) and Monte Carlo approximations (such as MCMC\nalgorithms). The problem is that most of these algorithms are conceptually complicated, and it is\nnot clear, especially to non-experts, which one would work best for a given model.\nIn this paper we present Augur, a probabilistic modeling system, embedded in Scala, whose design\nis guided by two observations. The \ufb01rst is that if we wish to bene\ufb01t from advances in hardware we\nmust focus on producing highly parallel inference algorithms. We show that many MCMC inference\nalgorithms are highly data-parallel [3, 4] within a single Markov Chain, if we take advantage of\nthe conditional independence relationships of the input model (e.g., the assumption of i.i.d. data\nmakes the likelihood independent across data points). Moreover, we can automatically generate\ngood data-parallel inference with a compiler. This inference runs ef\ufb01ciently on common highly\nparallel architectures such as Graphics Processing Units (GPUs). We note that parallelism brings\ninteresting trade-offs to MCMC performance as some inference techniques generate less parallelism\nand thus scale poorly.\n\n1\n\n\fThe second observation is that a high performance system begins by selecting an appropriate in-\nference algorithm, and this choice is often the hardest problem. For example, if our system only\nimplements Metropolis-Hastings inference, there are models for which our system will be of no\nuse, even given large amounts of computational power. We must design the system so that we can\ninclude the latest research on inference while reusing pre-existing analyses and optimizations. Con-\nsequently, we use an intermediate representation (IR) for probability distributions that serves as a\ntarget for modeling languages and as a basis for inference algorithms, allowing us to easily extend\nthe system. We will show this IR is key to scaling the system to very large networks.\nWe present two main results: \ufb01rst, some inference algorithms are highly data-parallel and a compiler\ncan automatically generate effective GPU implementations; second, it is important to use a symbolic\nrepresentation of a distribution rather than explicitly constructing a graphical model in memory,\nallowing the system to scale to much larger models (such as LDA).\n\n2 The Augur Language\n\nWe present two example model speci\ufb01cations in Augur, latent Dirichlet allocation (LDA) [5], and\na multivariate linear regression model. The supplementary material shows how to generate samples\nfrom the models, and how to use them for prediction. It also contains six more example probabilistic\nmodels in Augur: polynomial regression, logistic regression, a categorical mixture model, a Gaus-\nsian Mixture Model (GMM), a Naive Bayes Classi\ufb01er, and a Hidden Markov Model (HMM). Our\nlanguage is similar in form to BUGS [2] and Stan [6], except our statements are implicitly parallel.\n\n2.1 Specifying a Model\n\nThe LDA model speci\ufb01cation is shown in Figure 1a. The probability distribution is a Scala object\n(object LDA) composed of two declarations. First, we declare the support of the probability\ndistribution as a class named sig. The support of the LDA model is composed of four arrays, one\neach for the distribution of topics per document (theta), the distribution of words per topic (phi),\nthe topics assignments (z), and the words in the corpus (w). The support is used to store the inferred\nmodel parameters. These last two arrays are \ufb02at representations of ragged arrays, and thus we do\nnot require the documents to be of equal length. The second declaration speci\ufb01es the probabilistic\nmodel for LDA in our embedded domain speci\ufb01c language (DSL) for Bayesian networks. The\nDSL is marked by the bayes keyword and delimited by the enclosing brackets. The model \ufb01rst\ndeclares the parameters of the model: K for the number of topics, V for the vocabulary size, M for\nthe number of documents, and N for the array of document sizes. In the model itself, we de\ufb01ne the\nhyperparameters (values alpha and beta) for the Dirichlet distributions and sample K Dirichlets\nof dimension V for the distribution of words per topic (phi) and M Dirichlets of dimension K for\nthe distribution of topics per document (theta). Then, for each word in each document, we draw\na topic z from theta, and \ufb01nally a word from phi conditioned on the topic we drew for z.\nThe regression model in Figure 1b is de\ufb01ned in the same way using similar language features. In\nthis example the support comprises the (x, y) data points, the weights w, the bias b, and the noise\ntau. The model uses an additional sum function to sum across the feature vector.\n\n2.2 Using a Model\n\nOnce a model is speci\ufb01ed, it can be used as any other Scala object by writing standard Scala code.\nFor instance, one may want to use the LDA model with a training corpus to learn a distribution\nof words per topic and then use it to learn the per-document topic distribution of a test corpus. In\nthe supplementary material we provide a code sample which shows how to use an Augur model\nfor such a task. Each Augur model forms a distribution, and the runtime system generates a Dist\ninterface which provides two methods: map, which implements maximum a posteriori estimation,\nand sample, which returns a sequence of samples. Both of these calls require a similar set of\narguments: a list of additional variables to be observed (e.g., to \ufb01x the phi values at test time in\nLDA), the model hyperparameters, the initial state of the model support, the model support that\nstores the inferred parameters, the number of MCMC samples and the chosen inference method.\n\n2\n\n\fvar theta: Array[Double],\nvar z: Array[Int],\nvar w: Array[Int])\n\nval alpha = vector(K,0.1)\nval beta = vector(V,0.1)\nval phi = Dirichlet(V,beta).sample(K)\nval theta = Dirichlet(K,alpha).sample(M)\nval w =\n\n1 object LDA {\n2 class sig(var phi: Array[Double],\n3\n4\n5\n6 val model = bayes {\n7 (K:Int,V:Int,M:Int,N:Array[Int]) => {\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20 }}}\n\nCategorical(V,phi(z)).sample()\n}}\n\nCategorical(K,theta(i)).sample()\n\nfor(j <- 1 to N(i)) yield {\n\nfor(i <- 1 to M) yield {\n\nval z: Int =\n\nobserve(w)\n\nvar b: Double,\nvar tau: Double,\nvar x: Array[Double],\nvar y: Array[Double])\n\n1 object LinearRegression {\n2 class sig(var w: Array[Double],\n3\n4\n5\n6\n7 val model = bayes {\n8 (K:Int,N:Int,l:Double,u:Double) => {\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20 }}}\n\nval w = Gaussian(0,10).sample(K)\nval b = Gaussian(0,10).sample()\nval tau = InverseGamma(3.0,1.0).sample()\nval x = for(i <- 1 to N)\n\nval y = for (i <- 1 to N) yield {\n\nval phi = for(j <- 1 to K) yield\n\nGaussian(phi.sum + b,tau).sample()\n\nyield Uniform(l,u).sample(K)\n\nw(j) * x(i)(j)\n\n}\n\nobserve(x, y)\n\n(a) A LDA model in Augur. The model speci\ufb01es the\ndistribution p(\u03c6, \u03b8, z | w).\n\n(b) A multivariate regression in Augur. The model\nspeci\ufb01es the distribution p(w, b, \u03c4 | x, y).\n\nFigure 1: Example Augur programs.\n\n3 System Architecture\n\nWe now describe how a model speci\ufb01cation is transformed into CUDA code running on a GPU.\nAugur has two distinct compilation phases. The \ufb01rst phase transforms the block of code following\nthe bayes keyword into our IR for probability distributions, and occurs when scalac is invoked.\nThe second phase happens at runtime, when a method is invoked on the model. At that point, the IR\nis transformed, analyzed, and optimized, and then CUDA code is emitted, compiled, and executed.\nDue to these two phases, our system is composed of two distinct components that communicate\nthrough the IR: the front end, where the DSL is converted into the IR, and the back end, where\nthe IR is compiled down to the chosen inference algorithm (currently Metropolis-Hastings, Gibbs\nsampling, or Metropolis-Within-Gibbs). We use the Scala macro system to de\ufb01ne the modeling\nlanguage in the front end. The macro system allows us to de\ufb01ne a set of functions (called \u201cmacros\u201d)\nthat are executed by the Scala compiler on the code enclosed by the macro invocation. We currently\nfocus on Bayesian networks, but other DSLs (e.g., Markov random \ufb01elds) could be added without\nmodi\ufb01cations to the back end. The implementation of the macros to de\ufb01ne the Bayesian network\nlanguage is conceptually uninteresting so we omit further details.\nSeparating the compilation into two distinct phases provides many advantages. As our language is\nimplemented using Scala\u2019s macro system, it provides automatic syntax highlighting, method name\ncompletion, and code refactoring in any IDE which supports Scala. This improves the usability of the\nDSL as we require no special tools support. We also use Scala\u2019s parser, semantic analyzer (e.g., to\ncheck that variables have been de\ufb01ned), and type checker. Additionally we bene\ufb01t from scalac\u2019s\noptimizations such as constant folding and dead code elimination. Then, because we compile the IR\nto CUDA code at run time, we know the values of all the hyperparameters and the size of the dataset.\nThis enables better optimization strategies, and also gives us important insights into how to extract\nparallelism (Section 4.2). For example, when compiling LDA, we know that the number of topics is\nmuch smaller than the number of documents and thus parallelizing over documents produces more\nparallelism than parallelizing over topics. This is analogous to JIT compilation in modern runtime\nsystems where the compiler can make different decisions at runtime based upon the program state.\n\n4 Generation of Data-Parallel Inference\n\nWe now explain how Augur generates data-parallel samplers by exploiting the conditional indepen-\ndence structure of the model. We will use the two examples from Section 2 to explain how the\ncompiler analyzes the model and generates the inference code.\n\n3\n\n\f(cid:32) K(cid:89)\n\n(cid:33)(cid:32) N(cid:89)\n\nWhen we invoke an inference procedure on a model (e.g., by calling model.map), Augur compiles\nthe IR into CUDA inference code for that model. Our aim with the IR is to make the parallelism\nexplicit in the model and to support further analysis of the probability distributions contained within.\n\nFor example, a(cid:81) indicates that each sub-term in the expression can be evaluated in parallel. Infor-\n\nmally, our IR expressions are generated from this Backus-Naur Form (BNF) grammar:\n\n(cid:12)(cid:12)(cid:12) p(\n\nP ::= p(\n\n\u2192\nX)\n\n\u2192\nX | \u2192\nX)\n\n(cid:12)(cid:12)(cid:12) P P\n\n(cid:12)(cid:12)(cid:12) 1\n\nP\n\n(cid:12)(cid:12)(cid:12) N(cid:89)\n\ni\n\n(cid:12)(cid:12)(cid:12) (cid:90)\n\nX\n\nP\n\nP d x\n\n(cid:12)(cid:12)(cid:12) {P}c\n\nThe use of a symbolic representation for the model is key to Augur\u2019s ability to scale to large net-\nworks. Indeed, as we show in the experimental study (Section 5), popular probabilistic modeling\nsystems such as JAGS [7] or Stan [8] reify the graphical model, resulting in unreasonable memory\nconsumption for models such as LDA. However, a consequence of our symbolic representation is\nthat it is more dif\ufb01cult to discover conjugacy relationships, a point we return to later.\n\n4.1 Generating data-parallel MH samplers\n\nTo use Metropolis-Hastings (MH) inference, the compiler emits code for a function f that is pro-\nportional to the distribution to be sampled. This code is then linked with our library implemen-\ntation of MH. The function f is the product of the prior and the model likelihood and is ex-\ntracted automatically from the model speci\ufb01cation.\nIn our regression example this function is:\nf (x, y, \u03c4, b, w) = p(b)p(\u03c4 )p(w)p(x)p(y | x, b, \u03c4, w) which we rewrite to\n\nf (x, y, \u03c4, b, w) = p(b)p(\u03c4 )\n\np(wk)\n\np(xn)p(yn | xn \u00b7 w + b, \u03c4 )\n\nk\n\nn\n\nIn this form, the compiler knows that the distribution factorizes into a large number of terms that\ncan be evaluated in parallel and then ef\ufb01ciently multiplied together. Each (x, y) contributes to the\nlikelihood independently (i.e., the data is i.i.d.), and each pair can be evaluated in parallel and the\ncompiler can optimize accordingly. In practice, we work in log-space, so we perform summations.\nThe compiler then generates the CUDA code to evaluate f from the IR. This code generation step is\nconceptually simple and we will not explain it further.\nIt is interesting to note that the code scales well despite the simplicity of this parallelization: there\nis a large amount of parallelism because it is roughly proportional to the number of data points;\nuncovering the parallelism in the code does not increase the amount of computation performed; and\nthe ratio of computation to global memory accesses is high enough to hide the memory latency.\n\n4.2 Generating data-parallel Gibbs samplers\n\nAlternatively we can generate a Gibbs sampler for conjugate models. We would prefer to generate\na Gibbs sampler for LDA, as an MH sampler will have a very low acceptance ratio. To generate\na Gibbs sampler, the compiler needs to \ufb01gure out how to sample from each univariate conditional\ndistribution. As an example, to draw \u03b8m as part of the (\u03c4 + 1)th sample, the compiler needs to\ngenerate code that samples from the following distribution\n, ..., \u03b8\u03c4 +1\n\nm | w\u03c4 +1, z\u03c4 +1, \u03b8\u03c4 +1\n\nm+1, ..., \u03b8\u03c4\n\nm\u22121, \u03b8\u03c4\n\np(\u03b8\u03c4 +1\n\nM ).\n\n1\n\nAs we previously explained, our compiler uses a symbolic representation of the model:\nthe ad-\nvantage is that we can scale to large networks, but the disadvantage is that it is more challenging\nto uncover conjugacy and independence relationships between variables. To accomplish this, the\ncompiler uses an algebraic rewrite system that aims to rewrite the above expression in terms of ex-\npressions it knows (i.e., the joint distribution of the model). We show a few selected rules below to\ngive a \ufb02avor of the rewrite system. The full set of 14 rewrite rules are in the supplementary material.\n\n(cid:33)\n\n(a) P\n\nP \u21d2 1\n\n(b)(cid:82) P (x) Q dx \u21d2 Q(cid:82) P (x)dx\n\nN(cid:81)\n\nP (xi) \u21d2 N(cid:81)\n\n{P (xi)}q(i)=T\n(c)\n(cid:82) P (x,y) dx\n(d) P (x | y) \u21d2 P (x,y)\n\ni\n\ni\n\nN(cid:81)\n\ni\n\n{P (xi)}q(i)=F\n\nRule (a) states that like terms can be canceled. Rule (b) says that terms that do not depend on the\nvariable of integration can be pulled out of the integral. Rule (c) says that we can partition a product\n\n4\n\n\fover N terms into two products, one where a predicate q is true on the indexing variable and one\nwhere it is false. Rule (d) is a combination of the product and sum rule. Currently, the rewrite system\nis comprised of rules we found useful in practice, and it is easy to extend the system with more rules.\nGoing back to our example, the compiler rewrites the desired expression into the one below:\n\nN (m)(cid:89)\n\nj\n\n1\nZ p(\u03b8\u03c4 +1\nm )\n\np(zmj|\u03b8\u03c4 +1\nm )\n\nIn this form, it is clear that each \u03b81, . . . , \u03b8m is independent of the others after conditioning on the\nother random variables. As a result, they may all be sampled in parallel.\nAt each step, the compiler can test for a conjugacy relation. In the above form, the compiler rec-\nognizes that the zmj are drawn from a categorical distribution and \u03b8m is drawn from a Dirichlet,\nand can exploit the fact that these are conjugate distributions. The posterior distribution for \u03b8m is\nDirichlet(\u03b1 + cm) where cm is a vector whose kth entry is the number of z of topic k from\ndocument m. Importantly, the compiler now knows that sampling each z requires a counting phase.\nThe case of the \u03c6 variables is more interesting. In this case, we want to sample from\n\np(\u03c6\u03c4 +1\n\nk\n\n|w\u03c4 +1, z\u03c4 +1, \u03b8\u03c4 +1, \u03c6\u03c4 +1\n\n1\n\n, ..., \u03c6\u03c4 +1\n\nk\u22121, \u03c6\u03c4\n\nk+1, ..., \u03c6\u03c4\n\nK)\n\nAfter the applying the rewrite system to this expression, the compiler discovers that it is equal to\n\nM(cid:89)\n\nN (i)(cid:89)\n\ni\n\nj\n\n1\nZ p(\u03c6k)\n\n{p(wi|\u03c6zij )}k=zij\n\ni\n\nj\n\n(cid:81)N (i)\n\nrule (c) as(cid:81)M\n\nThe key observation that the compiler uses to reach this conclusion is the fact that the z are dis-\ntributed according to a categorical distribution and are used to index into the \u03c6 array. Therefore,\nthey partition the set of words w into K disjoint sets w1 (cid:93) ... (cid:93) wk, one for each topic. More\nconcretely, the probability of words drawn from topic k can be rewritten in partitioned form using\n{p(wij|\u03c6zij )}k=zij as once a word\u2019s topic is \ufb01xed, the word depends upon\nonly one of the \u03c6k distributions. In this form, the compiler recognizes that it should draw from\nDirichlet(\u03b2 + ck) where ck is the count of words assigned to topic k. In general, the compiler\ndetects this pattern when it discovers that samples drawn from categorical distributions are being\nused to index into arrays.\nFinally, the compiler turns to analyzing the zij.\nIt detects that they can be sampled in parallel\nbut it does not \ufb01nd a conjugacy relationship. However, it discovers that the zij are drawn from\ndiscrete distributions, so the univariate distribution can be calculated exactly and sampled from. In\ncases where the distributions are continuous, it tries to use another approximate sampling method to\nsample from that variable.\nOne concern with such a rewrite system is that it may fail to \ufb01nd a conjugacy relation if the model has\na complicated structure. So far we have found our rewrite system to be robust and it can \ufb01nd all the\nusual conjugacy relations for models such as LDA, GMMs or HMMs, but it suffers from the same\nshortcomings as implementations of BUGS when deeper mathematics are required to discover a\nconjugacy relation (as would be the case for instance for a non-linear regression). In the cases where\na conjugacy relation cannot be found, the compiler will (like BUGS) resort to using Metropolis-\nHastings and therefore exploit the inherent parallelism of the model likelihood.\nFinally, note that the rewrite rules are applied deterministically and the process will always terminate\nwith the same result. Overall, the cost of analysis is negligible compared to the sampling time for\nlarge data sets. Although the rewrite system is simple, it enables us to use a concise symbolic\nrepresentation for the model and thereby scale to large networks.\n\n4.3 Data-parallel Operations on Distributions\n\nTo produce ef\ufb01cient code, the compiler needs to uncover parallelism, but we also need a library of\ndata-parallel operations for distributions. For instance, in LDA, there are two steps where we sample\nfrom many Dirichlet distributions in parallel. When drawing the per document topic distributions,\neach thread can draw a \u03b8i by generating K Gamma variates and normalizing them [9]. Since the\n\n5\n\n\fnumber of documents is usually very large, this produces enough parallelism to make full use of\nthe GPU\u2019s cores. However, this may not produce suf\ufb01cient parallelism when drawing the \u03c6k, be-\ncause the number of topics is usually small compared to the number of cores. Consequently, we\nuse a different procedure which exposes more parallelism (the algorithm is given in the supplemen-\ntary material). To generate K Dirichlet variates over V categories with concentration parameters\n\u03b111, . . . , \u03b1KV , we \ufb01rst generate a matrix A where Aij \u223c Gamma(\u03b1ij) and then normalize each row\nof this matrix. To sample the \u03b8i, we could launch a thread per row. However, as the number of\ncolumns is much larger than the number of rows, we launch a thread to generate the gamma variates\nfor each column, and then separately compute a normalizing constant for each row by multiplying\nthe matrix with a vector of ones using CUBLAS. This is an instance where the two-stage compilation\nprocedure (Section 3) is useful, because the compiler is able to use information about the relative\nsizes of K and V to decide that the complex scheme will be more ef\ufb01cient than the simple scheme.\nThis sort of optimization is not unique to the Dirichlet distribution. For example, when sampling\na large number of multivariate normals by applying a linear transformation to a vector of normal\nsamples, the strategy for extracting parallelism may change based on the number of samples to\ngenerate, the dimension of the multinormal, and the number of GPU cores. We found that issues\nlike these were crucial to generating high-performance data-parallel samplers.\n\n4.4 Parallelism & Inference Tradeoffs\n\nIt is dif\ufb01cult to give a cost model for Augur programs. Traditional approaches are not necessarily\nappropriate for probabilistic inference because there are tradeoffs between faster sampling times and\nconvergence which are not easy to characterize. In particular, different inference methods may affect\nthe amount of parallelism that can be exploited in a model. For example, in the case of multivariate\nregression, we can use the Metropolis-Hastings sampler presented above, which lets us sample from\nall the weights in parallel. However, we may be better off generating a Metropolis-Within-Gibbs\nsampler where the weights are sampled one at a time. This reduces the amount of exploitable\nparallelism, but it may converge faster, and there may still be enough parallelism in each calculation\nof the Hastings ratio by evaluating the likelihood in parallel.\nMany of the optimizations in the literature that improve the mixing time of a Gibbs sampler, such as\nblocking or collapsing, reduce the available parallelism by introducing dependencies between previ-\nously independent variables. In a system like Augur it is not always bene\ufb01cial to eliminate variables\n(e.g., by collapsing) if it introduces more dependencies for the remaining variables. Currently Au-\ngur cannot generate a blocked or collapsed sampler, but there is interesting work on automatically\nblocking or collapsing variables [10] that we wish to investigate in the future. Our experimental\nresults on LDA demonstrate this tradeoff between mixing and runtime. There we show that while\na collapsed Gibbs sampler converges more quickly in terms of the number of samples compared to\nan uncollapsed sampler, the uncollapsed sampler converges more quickly in terms of runtime. This\nis due to the uncollapsed sampler having much more available parallelism. We hope that as more\noptions and inference strategies are added to Augur, users will be able to experiment further with the\ntradeoffs of different inference methods in a way that would be too time-consuming to do manually.\n\n5 Experimental Study\n\nWe provide experimental results for the two examples presented throughout the paper and in the\nsupplementary material for a Gaussian Mixture Model (GMM). More detailed information on the\nexperiments can be found in the supplementary material.\nTo test multivariate regression and the GMM, we compare Augur\u2019s performance to those of two\npopular languages for statistical modeling, JAGS [7] and Stan [8]. JAGS is an implementation of\nBUGS, and performs inference using Gibbs sampling, adaptive MH, and slice sampling. Stan uses\na No-U-Turn sampler, a variant of Hamiltonian Monte Carlo. For the regression, we con\ufb01gured\nAugur to use MH1, while for the GMM Augur generated a Gibbs sampler. In our LDA experiments\nwe also compare Augur to a handwritten CUDA implementation of a Gibbs sampler, and also to\n\n1Augur could not generate a Gibbs sampler for regression, as the conjugacy relation for the weights is not a\nsimple application of conjugacy rules[11]. JAGS avoids this issue by adding speci\ufb01c rules for linear regression.\n\n6\n\n\fRMSE v. Training Time (winequality-red)\n\n100\n\n200\n\nAugur\nJags\nStan\n\n\u00b7105\n\n\u22121.25\n\n\u22121.3\n\nPredictive Probability v. Training Time\n\n28 29 210 211\n28 29 210 211\n\n27\n\n26\n\n27\n\n26\n\n25\n\n1222 23\n\n24\n\n25\n12222324\n\n26 27 28 29 210 211\n\n25\n\n24\n\n\u22121.35\n\n\u22121.4\n\n\u22121.45\n\n\u22121.5\n\ny\nt\ni\nl\ni\nb\na\nb\no\nr\nP\ne\nv\ni\nt\nc\ni\nd\ne\nr\nP\n0\n1\ng\no\nL\n\n\u22121.55\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nE\nS\nM\nR\n\n150\n\n3000\n\n5\n\n10\n\n0\n\n0\n\n500\n1000\n2000\n7500\n5000\n15\n\n100\n\n500\n\n2000\n\n200\n\n1000\n\n5000\n\n25\n\n20\n30\nRuntime (seconds)\n\n35\n\n40\n\n45\n\n50\n\n\u22121.6\n\n\u22121.65\n1\n\n10\n\n1\n\n23\n\n2 22\n\n102\n\n103\n\nRuntime (seconds)\n\nAugur\nCuda\nFactorie(Collapsed)\n\n104\n\n105\n\n(a) Multivariate linear regression results on the UCI\nWineQuality-red dataset.\n\n(b) Predictive probability vs time for up to 2048\nsamples with three LDA implementations: Augur,\nhand-written CUDA, Factorie\u2019s Collapsed Gibbs.\n\nFigure 2: Experimental results on multivariate linear regression and LDA.\n\nthe collapsed Gibbs sampler [12] from the Factorie library [13]. The former is a comparison for an\noptimised GPU implementation, while the latter is a baseline for a CPU Scala implementation.\n\n5.1 Experimental Setup\n\nFor the linear regression experiment, we used data sets from the UCI regression repository [14]. The\nGaussian Mixture Model experiments used two synthetic data sets, one generated from 3 clusters,\nthe other from 4 clusters. For the LDA benchmark, we used a corpus extracted from the simple\nEnglish variant of Wikipedia, with standard stopwords removed. This corpus has 48556 documents,\na vocabulary size of 37276 words, and approximately 3.3 million tokens. From that we sampled\n1000 documents to use as a test set, removing words which appear only in the test set. To evaluate\nthe model we measure the log predictive probability [15] on the test set.\nAll experiments ran on a single workstation with an Intel Core i7 4820k CPU, 32 GB RAM, and an\nNVIDIA GeForce Titan Black. The Titan Black uses the Kepler architecture. All probability values\nare calculated in double precision. The CPU performance results using Factorie are calculated using\na single thread, as the multi-threaded samplers are neither stable nor performant in the tested release.\nThe GPU results use all 960 double-precision ALU cores available in the Titan Black. The Titan\nBlack has 2880 single-precision ALU cores, but single precision resulted in poor quality inference\nresults, though the speed was greatly improved.\n\n5.2 Results\n\nIn general, our results show that once the problem is large enough we can amortize Augur\u2019s startup\ncost of model compilation to CUDA, nvcc compilation to a GPU binary, and copying the data to\nand from the GPU. This cost is approximately 9 seconds averaged across all our experiments. After\nthis point Augur scales to larger numbers of samples in shorter runtimes than comparable systems.\nIn this mode we are using Augur to \ufb01nd a likely set of parameters rather than generating a set of\nsamples with a large effective sample size for posterior estimation. We have not investigated the\neffective sample size vs runtime tradeoff, though the MH approach we use for regression is likely to\nhave a lower effective sample size than the HMC used in Stan.\nOur linear regression experiments show that Augur\u2019s inference is similar to JAGS in runtime and\nperformance, and better than Stan. Augur takes longer to converge as it uses MH, though once we\nhave amortized the compilation time it draws samples very quickly. The regression datasets tend to\nbe quite small in terms of both number of random variables and number of datapoints, so it is harder\nto amortize the costs of GPU execution. However, the results are very different for models where the\nnumber of inferred parameters grows with the data set. In the GMM example in the supplementary,\n\n7\n\n\fwe show that Augur scales to larger problems than JAGS or Stan. For 100, 000 data points, Augur\ndraws a thousand samples in 3 minutes while JAGS takes more than 21 minutes and Stan requires\nmore than 6 hours. Each system found the correct means and variances for the clusters; our aim was\nto measure the scaling of runtime with problem size.\nResults from the LDA experiment are presented in Figure 2b and use predictive probability to mon-\nitor convergence over time. We compute the predictive probability and record the time (in seconds)\nafter drawing 2i samples, for i ranging from 0 to 11 inclusive. It takes Augur 8.1 seconds to draw its\n\ufb01rst sample for LDA. Augur\u2019s performance is very close to that of the hand-written CUDA imple-\nmentation, and much faster than the Factorie collapsed Gibbs sampler. Indeed, it takes the collapsed\nLDA implementation 6.7 hours longer than Augur to draw 2048 samples. We note that the col-\nlapsed Gibbs sampler appears to have converged after 27 samples, in approximately 27 minutes.\nThe uncollapsed implementations converge after 29 samples, in approximately 4 minutes. We also\nimplemented LDA in JAGS and Stan but they ran into scalability issues. The Stan version of LDA\n(taken from the Stan user\u2019s manual[6]) uses 55 GB of RAM but failed to draw a sample in a week of\ncomputation time. We could not test JAGS as it required more than 128 GB of RAM. In comparison,\nAugur uses less than 1 GB of RAM for this experiment.\n\n6 Related Work\n\nAugur is similar to probabilistic modeling languages such as BUGS [16], Factorie [13], Dimple [17],\nInfer.net [18], and Stan [8]. This family of languages explicitly represents a probability distribution,\nrestricting the expressiveness of the modeling language to improve performance. For example,\nFactorie, Dimple, and Infer.net provide languages for factor graphs enabling these systems to take\nadvantage of speci\ufb01c ef\ufb01cient inference algorithms (e.g., Belief Propagation). Stan, while Turing\ncomplete, focuses on probabilistic models with continuous variables using a No-U-Turn sampler\n(recent versions also support discrete variables). In contrast, Augur focuses on Bayesian Networks,\nallowing a compact symbolic representation, and enabling the generation of data-parallel code.\nAnother family of probabilistic programming languages is characterized by their ability to express\nall computable generative models by reasoning over execution traces which implicitly represent\nprobability distributions. These are typically a Turing complete language with probabilistic prim-\nitives and include Venture [19], Church [20], and Figaro [21]. Augur and the modeling languages\ndescribed above are less expressive than these languages, and so describe a restricted set of proba-\nbilistic programs. However performing inference over program traces generated by a model, instead\nof the model support itself, makes it more dif\ufb01cult to generate an ef\ufb01cient inference algorithm.\n\n7 Discussion\n\nWe show that it is possible to automatically generate parallel MCMC inference algorithms, and it\nis also possible to extract suf\ufb01cient parallelism to saturate a modern GPU with thousands of cores.\nThe choice of a Single-Instruction Multiple-Data (SIMD) architecture such as a GPU is central to the\nsuccess of Augur, as it allows many parallel threads with low overhead. Creating thousands of CPU\nthreads is less effective as each thread has too little work to amortize the overhead. GPU threads\nare comparatively cheap, and this allows for many small parallel tasks (like likelihood calculations\nfor a single datapoint). Our compiler achieves this parallelization with no extra information beyond\nthat which is normally encoded in a graphical model description and uses a symbolic representation\nthat allows scaling to large models (particularly for latent variable models like LDA). It also makes\nit easy to run different inference algorithms and evaluate the tradeoffs between convergence and\nsampling time. The generated inference code is competitive in terms of model performance with\nother probabilistic modeling systems, and can sample from large problems much more quickly.\nThe current version of Augur runs on a single GPU, which introduces another tier into the memory\nhierarchy as data and samples need to be streamed to and from the GPU\u2019s memory and main memory.\nWe do not currently support this in Augur for problems larger than GPU memory, though it is\npossible to analyse the generated inference code and automatically generate the data movement\ncode [22]. This movement code can execute concurrently with the sampling process. One area\nwe have not investigated is expanding Augur to clusters of GPUs, though this will introduce the\nsynchronization problems others have encountered when scaling up MCMC [23].\n\n8\n\n\fReferences\n[1] N. D. Goodman. The principles and practice of probabilistic programming. In Proc. of the\n40th ACM Symp. on Principles of Programming Languages, POPL \u201913, pages 399\u2013402, 2013.\n[2] A. Thomas, D. J. Spiegelhalter, and W. R. Gilks. BUGS: A program to perform Bayesian\n\ninference using Gibbs sampling. Bayesian Statistics, 4:837 \u2013 842, 1992.\n\n[3] W. D. Hillis and G. L. Steele, Jr. Data parallel algorithms. Comm. of the ACM, 29(12):1170\u2013\n\n1183, 1986.\n\n[4] G. E. Blelloch. Programming parallel algorithms. Comm. of the ACM, 39:85\u201397, 1996.\n[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine\n\nLearning Research, 3:993\u20131022, 2003.\n\n[6] Stan Dev. Team. Stan Modeling Language Users Guide and Ref. Manual, Version 2.2, 2014.\n[7] M. Plummer. JAGS: A program for analysis of Bayesian graphical models using Gibbs sam-\npling. In 3rd International Workshop on Distributed Statistical Computing (DSC 2003), pages\n20\u201322, 2003.\n\n[8] M.D. Hoffman and A. Gelman. The No-U-Turn Sampler: Adaptively setting path lengths in\n\nHamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1593\u20131623, 2014.\n\n[9] G. Marsaglia and W. W. Tsang. A simple method for generating gamma variables. ACM Trans.\n\nMath. Softw., 26(3):363\u2013372, 2000.\n\n[10] D. Venugopal and V. Gogate. Dynamic blocking and collapsing for Gibbs sampling. In 29th\n\nConf. on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201913, 2013.\n\n[11] R. Neal. CSC 2541: Bayesian methods for machine learning, 2013. Lecture 3.\n[12] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. In Proc. of the National Academy of\n\nSciences, volume 101, 2004.\n\n[13] A. McCallum, K. Schultz, and S. Singh. Factorie: Probabilistic programming via imperatively\nde\ufb01ned factor graphs. In Neural Information Processing Systems 22, pages 1249\u20131257, 2009.\n\n[14] K. Bache and M. Lichman. UCI machine learning repository, 2013.\n[15] M. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of\n\nMachine Learning Research, 14:1303\u20131347, 2013.\n\n[16] D. Lunn, D. Spiegelhalter, A. Thomas, and N. Best. The BUGS project: Evolution, critique\n\nand future directions. Statistics in Medicine, 2009.\n\n[17] S. Hershey, J. Bernstein, B. Bradley, A. Schweitzer, N. Stein, T. Weber, and B. Vigoda.\nAccelerating inference: Towards a full language, compiler and hardware stack. CoRR,\nabs/1212.2991, 2012.\n\n[18] T. Minka, J.M. Winn, J.P. Guiver, and D.A. Knowles. Infer.NET 2.5, 2012. Microsoft Research\n\nCambridge.\n\n[19] V. K. Mansinghka, D. Selsam, and Y. N. Perov. Venture: a higher-order probabilistic program-\n\nming platform with programmable inference. CoRR, abs/1404.0099, 2014.\n\n[20] N. D. Goodman, V. K. Mansinghka, D. Roy, K. Bonawitz, and J. B. Tenenbaum. Church: A\nlanguage for generative models. In 24th Conf. on Uncertainty in Arti\ufb01cial Intelligence, UAI\n2008, pages 220\u2013229, 2008.\n\n[21] A. Pfeffer. Figaro: An object-oriented probabilistic programming language. Technical report,\n\nCharles River Analytics, 2009.\n\n[22] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: a\nlanguage and compiler for optimizing parallelism, locality, and recomputation in image pro-\ncessing pipelines. ACM SIGPLAN Notices, 48(6):519\u2013530, 2013.\n\n[23] A. Smola and S. Narayanamurthy. An architecture for parallel topic models. Proceedings of\n\nthe VLDB Endowment, 3(1-2):703\u2013710, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1361, "authors": [{"given_name": "Jean-Baptiste", "family_name": "Tristan", "institution": "Oracle Labs"}, {"given_name": "Daniel", "family_name": "Huang", "institution": "Harvard University"}, {"given_name": "Joseph", "family_name": "Tassarotti", "institution": "Carnegie Mellon University"}, {"given_name": "Adam", "family_name": "Pocock", "institution": "Oracle Labs"}, {"given_name": "Stephen", "family_name": "Green", "institution": "Oracle Labs"}, {"given_name": "Guy", "family_name": "Steele", "institution": "Oracle Labs"}]}