{"title": "Nonstandard Interpretations of Probabilistic Programs for Efficient Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1152, "page_last": 1160, "abstract": "Probabilistic programming languages allow modelers to specify a stochastic process using syntax that resembles modern programming languages. Because the program is in machine-readable format, a variety of techniques from compiler design and program analysis can be used to examine the structure of the distribution represented by the probabilistic program. We show how nonstandard interpretations of probabilistic programs can be used to craft efficient inference algorithms: information about the structure of a distribution (such as gradients or dependencies) is generated as a monad-like side computation while executing the program. These interpretations can be easily coded using special-purpose objects and operator overloading. We implement two examples of nonstandard interpretations in two different languages, and use them as building blocks to construct inference algorithms: automatic differentiation, which enables gradient based methods, and provenance tracking, which enables efficient construction of global proposals.", "full_text": "Nonstandard Interpretations of Probabilistic\n\nPrograms for Ef\ufb01cient Inference\n\nDavid Wingate\nBCS / LIDS, MIT\nwingated@mit.edu\n\nNoah D. Goodman\nPsychology, Stanford\n\nngoodman@stanford.edu\n\nAndreas Stuhlm\u00a8uller\n\nBCS, MIT\nast@mit.edu\n\nJeffrey M. Siskind\n\nECE, Purdue\n\nqobi@purdue.edu\n\nAbstract\n\nProbabilistic programming languages allow modelers to specify a stochastic pro-\ncess using syntax that resembles modern programming languages. Because the\nprogram is in machine-readable format, a variety of techniques from compiler de-\nsign and program analysis can be used to examine the structure of the distribution\nrepresented by the probabilistic program. We show how nonstandard interpreta-\ntions of probabilistic programs can be used to craft ef\ufb01cient inference algorithms:\ninformation about the structure of a distribution (such as gradients or dependen-\ncies) is generated as a monad-like side computation while executing the program.\nThese interpretations can be easily coded using special-purpose objects and oper-\nator overloading. We implement two examples of nonstandard interpretations in\ntwo different languages, and use them as building blocks to construct inference\nalgorithms: automatic differentiation, which enables gradient based methods, and\nprovenance tracking, which enables ef\ufb01cient construction of global proposals.\n\nIntroduction\n\n1\nProbabilistic programming simpli\ufb01es the development of probabilistic models by allowing modelers\nto specify a stochastic process using syntax that resembles modern programming languages. These\nlanguages permit arbitrary mixing of deterministic and stochastic elements, resulting in tremendous\nmodeling \ufb02exibility. The resulting programs de\ufb01ne probabilistic models that serve as prior distribu-\ntions: running the (unconditional) program forward many times results in a distribution over execu-\ntion traces, with each trace being a sample from the prior. Examples include BLOG [13], Bayesian\nLogic Programs [10] IBAL[18], CHURCH [6], Stochastic MATLAB [28], and HANSEI [11].\n\nThe primary challenge in developing such languages is scalable inference. Inference can be viewed\nas reasoning about the posterior distribution over execution traces conditioned on a particular pro-\ngram output, and is dif\ufb01cult because of the \ufb02exibility these languages present:\nin principle, an\ninference algorithm must behave reasonably for any program a user wishes to write. Sample-based\nMCMC algorithms are the state-of-the-art method, due to their simplicity, universality, and compo-\nsitionality. But in probabilistic modeling more generally, ef\ufb01cient inference algorithms are designed\nby taking advantage of structure in distributions. How can we \ufb01nd structure in a distribution de-\n\ufb01ned by a probabilistic program? A key observation is that some languages, such as CHURCH and\nStochastic MATLAB, are de\ufb01ned in terms of an existing (non-probabilistic) language. Programs in\nthese languages may literally be executed in their native environments\u2014suggesting that tools from\nprogram analysis and programming language theory can be leveraged to \ufb01nd and exploit structure\nin the program for inference, much as a compiler might \ufb01nd and exploit structure for performance.\n\nHere, we show how nonstandard interpretations of probabilistic programs can help craft ef\ufb01cient\ninference algorithms. Information about the structure of a distribution (such as gradients, dependen-\ncies or bounds) is generated as a monad-like side computation while executing the program. This\nextra information can be used to, for example, construct good MH proposals, or search ef\ufb01ciently\nfor a local maximum. We focus on two such interpretations: automatic differentiation and prove-\nnance tracking, and show how they can be used as building blocks to construct ef\ufb01cient inference\n\n1\n\n\falgorithms. We implement nonstandard interpretations in two different languages (CHURCH and\nStochastic MATLAB), and experimentally demonstrate that while they typically incur some addi-\ntional execution overhead, they dramatically improve inference performance.\n2 Background and Related Work\nWe begin by outlining our setup, following [28]. We de-\n\ufb01ne an unconditioned probabilistic program to be a pa-\nrameterless function f with an arbitrary mix of stochas-\ntic and deterministic elements (hereafter, we will use the\nterm function and program interchangeably). The func-\ntion f may be written in any language, but our running\nexample will be MATLAB. We allow the function to be\narbitrarily complex inside, using any additional functions,\nrecursion, language constructs or external libraries it wishes. The only constraint is that the func-\ntion must be self-contained, with no external side-effects which would impact the execution of the\nfunction from one run to another.\n\nAlg. 1: A Gaussian-Gamma mixture\n1: for i=1:1000\n2:\n3:\n4:\n5:\n6:\n7: end;\n\nif ( rand > 0.5 )\nX(i) = randn;\n\nX(i) = gammarnd;\n\nelse\n\nend;\n\nThe stochastic elements of f must come from a set of known, \ufb01xed elementary random primitives,\nor ERPs. Complex distributions are constructed compositionally, using ERPs as building blocks. In\nMATLAB, ERPs may be functions such as rand (sample uniformly from [0,1]) or randn (sample\nfrom a standard normal). Higher-order random primitives, such as nonparametric distributions, may\nalso be de\ufb01ned, but must be \ufb01xed ahead of time. Formally, let T be the set of ERP types. We assume\nthat each type t \u2208 T is a parametric family of distributions pt(x|\u03b8t), with parameters \u03b8t.\nNow, consider what happens while executing f. As f is executed, it encounters a series of ERPs.\nAlg. 1 shows an example of a simple f written in MATLAB with three syntactic ERPs: rand,\nrandn, and gammarnd. During execution, depending on the return value of each call to rand,\ndifferent paths will be taken through the program, and different ERPs will be encountered. We call\nthis path an execution trace. A total of 2000 random choices will be made when executing this f.\nLet fk|x1,\u00b7\u00b7\u00b7 ,xk\u22121 be the k\u2019th ERP encountered while executing f, and let xk be the value it returns.\nNote that the parameters passed to the k\u2019th ERP may change depending on previous xk\u2019s (indeed,\nits type may also change, as well as the total number of ERPs). We denote by x all of the random\nchoices which are made by f, so f de\ufb01nes the probability distribution p(x). In our example, x \u2208\nR2000. The probability p(x) is the product of the probability of each individual ERP choice:\n\nK\n\np(x) =\n\nptk (xk|\u03b8tk , x1, \u00b7 \u00b7 \u00b7 , xk\u22121)\n\n(1)\n\nYk=1\n\nagain noting explicitly that types and parameters may depend arbitrarily on previous random choices.\nTo simplify notation, we will omit the conditioning on the values of previous ERPs, but again wish\nto emphasize that these dependencies are critical and cannot be ignored. By fk, it should therefore\nbe understood that we mean fk|x1,\u00b7\u00b7\u00b7 ,xk\u22121, and by ptk (xk|\u03b8tk ) we mean ptk (xk|\u03b8tk , x1, \u00b7 \u00b7 \u00b7 , xk\u22121).\nGenerative functions as described above are, of course, easy to write. A much harder problem, and\nour goal in this paper, is to reason about the posterior conditional distribution p(x|y), where we\nde\ufb01ne y to be a subset of random choices which we condition on and (in an abuse of notation) x\nto be the remaining random choices. For example, we may condition f on the X(i)\u2019s, and reason\nabout the sequence of rand\u2019s most likely to generate the X(i)\u2019s. For the rest of this paper, we\nwill drop y and simply refer to p(x), but it should be understood that the goal is always to perform\ninference in conditional distributions.\n2.1 Nonstandard Interpretations of Probabilistic Programs\nWith an outline of probabilistic programming in hand, we now turn to nonstandard interpretations.\nThe idea of nonstandard interpretations originated in model theory and mathematical logic, where it\nwas proposed that a set of axioms could be interpreted by different models. For example, differential\ngeometry can be considered a nonstandard interpretation of classical arithmetic.\n\nIn programming, a nonstandard interpretation replaces the domain of the variables in the program\nwith a new domain, and rede\ufb01nes the semantics of the operators in the program to be consistent\nwith the new domain. This allows reuse of program syntax while implementing new functionality.\nFor example, the expression \u201ca \u2217 b\u201d can be interpreted equally well if a and b are either scalars or\n\n2\n\n\fmatrices, but the \u201c\u2217\u201d operator takes on different meanings. Practically, many useful nonstandard\ninterpretations can be implemented with operator overloading: variables are rede\ufb01ned to be objects\nwith operators that implement special functionality, such as tracing, reference counting, or pro\ufb01ling.\nFor the purposes of inference in probabilistic programs, we will augment each random choice xk\nwith additional side information sk, and replace each xk with the tuple hxk, ski. The native inter-\npreter for the probabilistic program can then interpret the source code as a sequence of operations\non these augmented data types. For a recent example of this, we refer the reader to [24].\n3 Automatic Differentiation\nFor probabilistic models with many continuous-valued random variables, the gradient of the like-\nlihood \u2207xp(x) provides local information that can signi\ufb01cantly improve the properties of Monte-\nCarlo inference algorithms. For instance, Langevin Monte-Carlo [20] and Hamiltonian MCMC [15]\nuse this gradient as part of a variable-augmentation technique (described below). We would like\nto be able to use gradients in the probabilistic-program setting, but p(x) is represented implicitly\nby the program. How can we compute its gradient? We use automaticdifferentiation (AD) [3, 7],\na nonstandard interpretation that automatically constructs \u2207xp(x). The automatic nature of AD is\ncritical because it relieves the programmer from hand-computing derivatives for each model; more-\nover, some probabilistic programs dynamically create or delete random variables making simple\nclosed-form expressions for the gradient very dif\ufb01cult to \ufb01nd.\n\nUnlike \ufb01nite differencing, AD computes an exact derivative of a function f at a point (up to machine\nprecision). To do this, AD relies on the chain rule to decompose the derivative of f into derivatives\nof its sub-functions: ultimately, known derivatives of elementary functions are composed together to\nyield the derivative of the compound function. This composition can be computed as a nonstandard\ninterpretation of the underlying elementary functions.\n\nThe derivative computation as a composition of the derivatives of the elementary functions can be\nperformed in different orders. In forwardmode AD [27], computation of the derivative proceeds by\npropagating perturbations of the input toward the output. This can be done by a nonstandard inter-\npretation that extends each real value to the \ufb01rst two terms of its Taylor expansion [26], overloading\neach elementary function to operate on these real \u201cpolynomials\u201d. Because the derivatives of f at c\ncan be extracted from the coef\ufb01cients of \u01eb in f (c + \u01eb) , this allows computation of the gradient. In\nreverse mode AD [25], computation of the derivative proceeds by propagating sensitivities of the\noutput toward the input. One way this can be done is by a nonstandard interpretation that extends\neach real value into a \u201ctape\u201d that captures the trace of the real computation which led to that value\nfrom the inputs, overloading each elementary function to incrementally construct these tapes. Such\na tape can be postprocessed, in a fashion analogous to backpropagation [21], to yield the gradient.\nThese two approaches have complementary computational tradeoffs: reverse mode (which we use in\nour implementation) can compute the gradient of a function f : Rn \u2192 R with the same asymptotic\ntime complexity as computing f, but not the same asymptotic space complexity (due to its need\nfor saving the computation trace), while forward mode can compute the gradient with these same\nasymptotic space complexity, but with a factor of O(n) slowdown (due to its need for constructing\nthe gradient out of partial derivatives along each independent variable).\n\nThere are implementations of AD for many languages, including SCHEME(e.g., [17]), FORTRAN\n(e.g., ADIFOR[2]), C (e.g., ADOL\u2013C [8]), C++ (e.g., FADBAD++[1]), MATLAB (e.g., INTLAB [22]),\nand MAPLE (e.g., GRADIENT [14]). See www.autodiff.org. Additionally, overloading and AD\nare well established techniques that have been applied to machine learning, and even to application-\nspeci\ufb01c programming languages for machine learning, e.g., LUSH[12] and DYNA[4]. In particular,\nDYNA applies a nonstandard interpretation for \u2227 and \u2228 as a semiring (\u00d7 and +, + and max, . . .) in\na memoizing PROLOG to generalize Viterbi, forward/backward, inside/outside, etc. and uses AD to\nderive the outside algorithm from the inside algorithm and support parameter estimation, but unlike\nprobabilistic programming, it does not model general stochastic processes and does not do general\ninference over such. Our use of overloading and AD differs in that it facilitates inference in com-\nplicated models of general stochastic processes formulated as probabilistic programs. Probabilistic\nprogramming provides a powerful and convenient framework for formulating complicated models\nand, more importantly, separating such models from orthogonal inference mechanisms. Moreover,\noverloading provides a convenient mechanism for implementing many such inference mechanisms\n(e.g., Langevin MC, Hamiltonian MCMC, Provenance Tracking, as demonstrated below) in a prob-\nabilistic programming language.\n\n3\n\n\f(define (perlin-pt x y keypt power)\n(* 255 (sum (map (lambda (p2 pow)\n\n(let ((x0 (floor (* p2 x))) (y0 (floor (* p2 y))))\n\n(* pow (2d-interp (keypt x0 y0) (keypt (+ 1 x0) y0) (keypt x0 (+ 1 y0)) (keypt (+ 1 x0) (+ 1 y0))))))\n\npowers-of-2 power))))\n\n(define (perlin xs ys power)\n\n(let ([keypt (mem (lambda (x y) (/ 1 (+ 1 (exp (- (gaussian 0.0 2.0)))))))])\n\n(map (lambda (x) (map (lambda (y) (perlin-pt x y keypt power)) xs)) ys)))\n\nFigure 1: Code for the structured Perlin noise generator. 2d-interp is B-spline interpolation.\n\nDraw momentum m \u223c N (0, \u03c32)\n\nStart with current state (x, m)\nSimulate Hamiltonian dynamics to give (x\u2032, m\u2032)\nAccept w/ p = min[1, e(\u2212H(x\u2032,m\u2032)+H(x,m))]\n\nAlg. 2: Hamiltonian MCMC\n1: repeat forever\n2: Gibbs step:\n3:\n4: Metropolis step:\n5:\n6:\n7:\n8: end;\n\n3.1 Hamiltonian MCMC\nTo illustrate the power of AD in proba-\nbilistic programming, we build on Hamil-\ntonian MCMC (HMC), an ef\ufb01cient algo-\nrithm whose popularity has been some-\nwhat limited by the necessity of comput-\ning gradients\u2014a dif\ufb01cult task for com-\nplex models. Neal [15] introduces HMC\nas an inference method which \u201cproduces\ndistant proposals for the Metropolis algo-\nrithm, thereby avoiding the slow explo-\nration of the state space that results from the diffusive behavior of simple random-walk proposals.\u201d\nHMC begins by augmenting the states space with \u201cmomentum variables\u201d m. The distribution over\nthis augmented space is eH(x,m), where the Hamiltonian function H decomposed into the sum of\na potential energy term U (x) = \u2212 ln p(x) and a kinetic energy K(m) which is usually taken to\nbe Gaussian. Inference proceeds by alternating between a Gibbs step and Metropolis step: \ufb01xing\nthe current state x, a new momentum m is sampled from the prior over m; then x and m are up-\ndated together by following a trajectory according to Hamiltonian dynamics. Discrete integration\nof Hamiltonian dynamics requires the gradient of H, and must be done with a symplectic (i.e. vol-\nume preserving) integrator (following [15] we use the Leapfrog method). While this is a complex\ncomputation, incorporating gradient information dramatically improves performance over vanilla\nrandom-walk style MH moves (such as Gaussian drift kernels), and its statistical ef\ufb01ciency also\nscales much better with dimensionality than simpler methods [15]. AD can also compute higher-\norder derivatives. For example, Hessian matrices can be used to construct blocked Metropolis moves\n[9] or proposals based on Newton\u2019s method [19], or as part of Riemannian manifold methods [5].\n\n3.2 Experiments and Results\nWe implemented HMC by extending BHER [28], a lightweight implementation of the CHURCH\nlanguage which provides simple, but universal, MH inference. We used used an implementation of\nAD based on [17] that uses hygienic operator overloading to do both forward and reverse mode AD\nfor Scheme (the target language of the BHER compiler).\nThe goal is to compute \u2207xp(x). By Eq. 1, p(x) is the product of the individual choices made by\neach xi (though each probability can depend on previous choices, through the program evaluation).\nTo compute p(x), BHER executes the corresponding program, accumulating likelihoods. Each time\na continuous ERP is created or retrieved, we wrap it in a \u201ctape\u201d object which is used to track gradient\ninformation; as the likelihood p(x) is computed, these tapes \ufb02ow through the program and through\nappropriately overloaded operators, resulting in a dependency graph for the real portion of the com-\nputation. The gradient is then computed in reverse mode, by \u201cback-propagating\u201d along this graph.\nWe implement an HMC kernel by using this gradient in the leapfrog integrator. Since program states\nmay contain a combination of discrete and continuous ERPs, we use an overall cycle kernel which\nalternates between standard MH kernel for individual discrete random variables and the HMC ker-\nnel for all continuous random choices. To decrease burn-in time, we initialize the sampler by using\nannealed gradient ascent (again implemented using AD).\n\nWe ran two sets of experiments that illustrate two different bene\ufb01ts of HMC with AD: automated\ngradients of complex code, and good statistical ef\ufb01ciency.\nStructured Perlin noise generation. Our \ufb01rst experiment uses HMC to generate modi\ufb01ed Perlin\nnoise with soft symmetry structure. Perlin noise is a procedural texture used by computer graphics\nartists to add realism to natural textures such as clouds, grass or tree bark. We generate Perlin-\nlike noise by layering octaves of random but smoothly varying functions. We condition the result\n\n4\n\n\fd\n\ni\na\n\ns\n\ny\n\nm\n\ng\n\no\n\nn\n\nm\n\na\nl\n\ne\n\nt\n\nr\n\ny\n\ne\nu\nr\nt\n \n\no\n\nt\n \n\ne\nc\nn\na\n\nt\ns\nD\n\ni\n\nn\no\n\ni\nt\n\na\n\nt\nc\ne\np\nx\ne\n\n6\n\n4\n\n2\n\n0\n\n0\n\nMCMC\nHMC\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nSamples\n\nFigure 2: On the left: samples from the structured Perlin noise generator. On the right: convergence\nof expected mean for a draw from a 3D spherical Gaussian conditioned on lying on a line.\n\non approximate diagonal symmetry, forcing the resulting image to incorporate additional structure\nwithout otherwise skewing the statistics of the image. Note that the MAP solution for this problem is\nuninteresting, as it is a uniform image; it is the variations around the MAP that provide rich texture.\nWe generated 48x48 images; the model had roughly 1000 variables.\n\nFig. 2 shows the result via typical samples generated by HMC, where the approximate symmetry is\nclearly visible. A code snippet demonstrating the complexity of the calculations is shown in Fig. 1;\nthis experiment illustrates how the automatic nature of the gradients is most helpful, as it would be\ntime consuming to compute these gradients by hand\u2014particularly since we are free to condition\nusing any function of the image.\nComplex conditioning. For our second example, we\ndemonstrate the improved statistical ef\ufb01ciency of the\nsamples generated by HMC versus BHER\u2019s standard\nMCMC algorithm. The goal is to sample points from a\ncomplex 3D distribution, de\ufb01ned by starting with a Gaus-\nsian prior, and sampling points that are noisily condi-\ntioned to be on a line running through R3. This creates\ncomplex interactions with the prior to yield a smooth, but\nstrongly coupled, energy landscape.\n\n1: x \u223c N (\u00b5, \u03c3)\n2: k \u223c Bernoulli(e\u2212 dist(line,x)\n3: Condition on k = 1\n\nNormal distribution noisily condi-\ntioned on line (2D projection)\n\nnoise\n\n)\n\n2\n\n1\n\nFig. 2 compares our HMC implementation with BHER\u2019s\nstandard MCMC engine. The x-axis denotes samples,\nwhile the y-axis denotes the convergence of an estimator\nof certain marginal statistics of the samples. We see that\nthis estimator converges much faster for HMC, implying\nthat the samples which are generated are less autocorrelated \u2013 af\ufb01rming that HMC is indeed making\nbetter distal moves. HMC is about 5x slower than MCMC for this experiment, but the overhead is\njusti\ufb01ed by the signi\ufb01cant improvement in the statistical quality of the samples.\n\n-1.5 -1.0 -0.5\n\n1.5\n\n1.0\n\n0.5\n\n-2\n\n-1\n\n4 Provenance Tracking for Fine-Grained Dynamic Dependency Analysis\n\nOne reason gradient based inference algorithms are effective is that the chain rule of derivatives\npropagates information backwards from the data up to the proposal variables. But gradients, and the\nchain rule, are only de\ufb01ned for continuous variables. Is there a corresponding structure for discrete\nchoices? We now introduce a new nonstandard interpretation based on provenance tracking (PT). In\nprogramming language theory, the provenance of a variable is the history of variables and computa-\ntions that combined to form its value. We use this idea to track \ufb01ne-grained dependency information\nbetween random values and intermediate computations as they combine to form a likelihood. We\nthen use this provenance information to construct good global proposals for discrete variables as\npart of a novel factored multiple-try MCMC algorithm.\n\n4.1 De\ufb01ning and Implementing Provenance Tracking\nLike AD, PT can be implemented with operator overloading. Because provenance information is\nmuch coarser than gradient information, the operators in PT objects have a particularly simple form;\nmost program expressions can be covered by considering a few cases. Let X denote the set {xi}\nof all (not necessarily random) variables in a program. Let R(x) \u2282 X de\ufb01ne the provenance of a\nvariable x. Given R(x), the provenance of expressions involving x can be computed by breaking\n\n5\n\n\fdown expressions into a sequence of unary operations, binary operations, and function applications.\nConstants have empty provenances.\n\nLet x and y be expressions in the program (consisting of an arbitrary mix of variables, constants,\nfunctions and operators). For a binary operation x \u2299 y, the provenance R(x \u2299 y) of the result is\nde\ufb01ned to be R(x \u2299 y) = R(x) \u222a R(y). Similarly, for a unary operation, the provenance R(\u2299x) =\nR(x). For assignments, x = y \u21d2 R(x) = R(y). For a function, R(f (x, y, ...)) may be computed\nby examining the expressions within f; a worst-case approximation is R(f (x, y, ...)) = R(x) \u222a\nR(y) \u00b7 \u00b7 \u00b7 . A few special cases are also worth noting. Strictly speaking, the previous rules track a\nsuperset of provenance information because some functions and operations are constant for certain\ninputs. In the case of multiplication, x \u2217 0 = 0, so R(x \u2217 0) = {}. Accounting for this gives tighter\nprovenances, implying, for example, that special considerations apply to sparse linear algebra.\n\nIn the case of probabilistic programming, recall that random variables (or ERPs) are represented as\nstochastic functions fi that accept parameters \u03b8i. Whenever a random variable is conditioned, the\noutput of the corresponding fi is \ufb01xed; thus, while the likelihood of a particular output of fi depends\non \u03b8i, the speci\ufb01c output of fi does not. For the purposes inference, therefore, R(fi(\u03b8i)) = {}.\n\n4.2 Using Provenance Tracking as Part of Inference\nProvenance information could be used in many ways. Here, we illustrate one use: to help construct\ngood block proposals for MH inference. Our basic idea is to construct a good global proposal by\nstarting with a random global proposal (which is unlikely to be good) and then inhibiting the bad\nparts. We do this by allowing each element of the likelihood to \u201cvote\u201d on which proposals seemed\ngood. This can be considered a factored version of a multiple-try MCMC algorithm [16].\n\n6= xO\ni\n\nThe algorithm is shown in Fig. 3. Let xO be the starting state. In step (2), we propose a new state\nxO\u2032. This new state changes many ERPs at once, and is unlikely to be good (for the proof, we require\nthat xO\u2032\nfor all i). In step (3), we accept or reject each element of the proposal based on a\ni\nfunction \u03b1. Our choice of \u03b1 (Fig. 3, left) uses PT, as we explain below. In step (4) we construct a\nnew proposal xM by \u201cmixing\u201d two states: we set the variables in the accepted set A to the values of\nxO\u2032\n, and we leave the variables in the rejected set R at their original values in xO. In steps (5-6) we\ni\ncompute the forward probabilities. In steps (7-8) we sample one possible path backwards from xM\nto xO, with the relevant probabilities. Finally, in step (9) we accept or reject the overall proposal.\n\nimpacted. We also use PT to compute p(xO\u2032\n\nWe use \u03b1(xO, xO\u2032\n) to allow the likelihood to \u201cvote\u201d in a \ufb01ne-grained way for which proposals\nseemed good and which seemed bad. To do this, we compute p(xO) using PT to track how each\ni ,\nin\ufb02uences the overall likelihood p(xO). Let D(i; xO) denote the \u201cdescendants\u201d of variable xO\nxO\ni\nde\ufb01ned as all ERPs whose likelihood xO\n), again\ni\ntracking dependents D(i; xO\u2032\n), and let D(i) be the joint set of ERPs that xi in\ufb02uences in either state\nxO or xO\u2032. We then use D(i), p(xO) and p(xO\u2032\n) to estimate the amount by which each constituent\nelement xO\u2032\nin the proposal changed the likelihood. We assign \u201ccredit\u201d to each i as if it were the\ni\nonly proposal \u2013 that is, we assume that if, for example, the likelihood went up, it was entirely due to\nthe change in xO\ni . Of course, the variables\u2019 effects are not truly independent; this is a fully-factored\napproximation to those effects. The \ufb01nal \u03b1 is shown in Fig. 3 (left), where we de\ufb01ne p(xD(i)) to be\nthe likelihood of only the subset of variables that xi impacted.\nHere, we prove that our algorithm is valid MCMC by following [16] and showing detailed balance.\nTo do this, we must integrate over all possible rejected paths of the negative bits xO\u2032\n\nR and xM I\nR :\n\np(xO)P (xM |xO) = p(xO)ZxO\u2032\nR ZxM I\n= ZxO\u2032\n= p(xM )ZxO\u2032\n\nR\n\nR ZxM I\n\nR\n\nR ZxM I\n\nR\n\nQO\u2032\n\nA QO\u2032\n\nR P M\n\nA P M\n\nR QM I\n\nR min(cid:26)1,\n\np(xM )\np(xO)\n\nA P M I\nQM I\nQO\u2032\nA P M\n\nA P M I\nR\nA P M\n\nR (cid:27)\nR o\n\nQO\u2032\n\nR QM I\n\nR minnp(xO)QO\u2032\n\nA P M\n\nA P M\n\nR , p(xM )QM I\n\nA P M I\n\nA P M I\n\nQM I\n\nA QM I\n\nR P M I\n\nA P M I\n\nR QO\u2032\n\nR min(1,\n\np(xO)QO\u2032\np(xM )QM I\n\nA P M\nA P M I\n\nA P M\nR\nA P M I\n\nR )\n\n= p(xM )P (xO|xM )\n\nwhere the subtlety to the equivalence is that the rejected bits xO\u2032\n\nR and xM I\n\nR have switched roles. (cid:3)\n\n6\n\n\fAlg. 3: Factored Multiple-Try MH\n1: Begin in state xO. Assume it is composed of individual ERPs xO = (cid:8)xO\n2: Propose a new state for many ERPs. For i = 1, \u00b7 \u00b7 \u00b7 , k, propose xO\u2032\n3: Decide to accept or reject each element of xO\u2032 . This test can depend arbitrarily on xO and xO\u2032 , but must\n. Let A be the set\n\n) be the probability of accepting xO\u2032\n\ndecide for each ERP independently; let \u03b1i(xO, xO\u2032\nof indices of accepted proposals, and R the set of rejected ones.\n\ni \u223c Q(xO\u2032\n\n|xO) s.t. xO\u2032\n\n1 , \u00b7 \u00b7 \u00b7 , xO\n\n6= xO\ni .\n\nk (cid:9).\n\ni\n\ni\n\n4: Construct a new state, xM = nxO\u2032\n\ni\n\n: i \u2208 AoS(cid:8)xO\n\nj : j \u2208 R(cid:9). This new state mixes new values for the\n\nERPs from the accepted set A and old values for the ERPs in the rejected set R.\n\n5: Let P M\n\n) be the probability of accepting the ERPs in A, and let P M\n\nR = Qj\u2208R(1 \u2212\n\nA = Qi\u2208A \u03b1i(xO, xO\u2032\nA = Qi\u2208A Q(xO\u2032\n\ni\n\n\u03b1j(xO, xO\u2032\n\n)) be the probability of rejecting the ERPs in R.\n\n|xO) and QO\u2032\n\n6: Let QO\u2032\n7: Construct a new state xM I. Propose new values for all of the rejected ERPs using xM as the start state,\nj \u223c Q(\u00b7|xM ). Then,\n\nbut leave ERPs in the accepted set at their original value. For j \u2208 R let xM I\nxM I = (cid:8)xO\n\nR = Qj\u2208R Q(xO\u2032\n\n: i \u2208 A(cid:9)S(cid:8)xM I\n\n: j \u2208 R(cid:9).\n\nj |xO).\n\nj\n\ni\n\nA = Qi\u2208A \u03b1i(xM , xM I ), and let P M I\n\n8: Let P M I\n9: Accept xM with probability minn1, (p(xM )QM I\n\nR = Qj\u2208R(1 \u2212 \u03b1j(xM , xM I )).\nA P M\n\nR )/(p(xO)QO\u2032\n\nA P M I\n\nA P M I\n\nA P M\n\nR )o.\n\nAlg. 4: A PT-based Acceptance Test\n1: The PT algorithm implements \u03b1i(x, x\u2032).\n2: Compute p(x), tracking D(xi; x)\n3: Compute p(x\u2032), tracking D(xi; x\u2032)\n4: Let D(i) = D(xi; x) \u222a D(xi; x\u2032)\ni)p(x\u2032\n\np(x\u2032\n\n5: Let \u03b1i(x, x\u2032) = minn1,\n\nD(i))\n\np(xi)p(xD(i))o\n\nIndividual ERPs\n\nAccepted set\n\nRejected set\n\nStart\nstate\n\nProposed\n\nstate\n\nMixed\nstate\n\nReverse path\n\nFigure 3: The factored multiple-try MH algorithm (top), the PT-based acceptance test (left) and an\nillustration of the process of mixing together different elementary proposals (right).\n\n4.3 Experiments and Results\nWe implemented provenance tracking and in Stochastic MATLAB [28] by leveraging MATLAB\u2019s\nobject oriented capabilities, which provides full operator overloading. We tested on four tasks: a\nBayesian \u201cmesh induction\u201d task, a small QMR problem, probabilistic matrix factorization [23] and\nan integer-valued variant of PMF. We measured performance by examining likelihood as a function\nof wallclock time; an important property of the provenance tracking algorithm is that it can help\nmitigate constant factors affecting inference performance.\nBayesian mesh induction.\nThe BMI task is simple:\ngiven a prior distribution over meshes and a target im-\nage, sample a mesh which, when rendered, looks like the\ntarget image. The prior is a Gaussian centered around a\n\u201cmean mesh,\u201d which is a perfect sphere; Gaussian noise\nis added to each vertex to deform the mesh. The model\nis shown in Alg. 5. The rendering function is a custom\nOPENGL renderer implemented as a MEX function. No\ngradients are available for this renderer, but it is reasonably easy to augment it with provenance\ninformation recording vertices of the triangle that were responsible for each pixel. This allows us to\nmake proposals to mesh vertices, while assigning credit based on pixel likelihoods.\n\nAlg. 5: Bayesian Mesh Induction\n1: function X = bmi( base mesh )\n2: mesh = base mesh + randn;\n3:\n4:\n5: end;\n\nimg = render( mesh );\nX = img + randn;\n\nResults for this task are shown in Fig. 4 (\u201cFace\u201d). Even though the renderer is quite fast, MCMC\nwith simple proposals fails: after proposing a change to a single variable, it must re-render the image\nin order to compute the likelihood. In contrast, making large, global proposals is very effective.\nFig. 4 (top) shows a sequence of images representing burn-in of the model as it starts from the initial\ncondition and samples its way towards regions of high likelihood. A video demonstrating the results\nis available at http://www.mit.edu/\u02dcwingated/papers/index.html.\n\n7\n\n\fInput\n\nTime\n\nTarget\n\nFace\n\n9\nx 10\n\nx 1e9\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n \ng\no\nL\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n\u221240\n\n\u221260\n\n\u221280\n\n\u22123\n0\n\n15\n\n30\n\n45\n\n\u2212100\n\n0\n\nQMR\n\nPMF\n\n4\nx 10\nx 1e4\n\n0\n\n\u22125\n\n\u221210\n\nInteger PMF\n\n7\nx 10\nx 1e7\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n1\n\n2\n\n5 10 15 20 25\n\n5\n\n10 15 20 25\n\nTime (seconds)\n\nFigure 4: Top: Frames from the face task. Bottom: results on Face, QMR, PMF and Integer PMF.\n\nQMR. The QMR model is a bipartite, binary model relating diseases (hidden) to symptoms (ob-\nserved) using a log-linear noisy-or model. Base rates on diseases can be quite low, so \u201cexplaining\naway\u201d can cause poor mixing. Here, MCMC with provenance tracking is effective: it \ufb01nds high-\nlikelihood solutions quickly, again outperforming naive MCMC.\nProbabilistic Matrix Factorization. For the PMF task, we factored a matrix A \u2208 R1000x1000 with\n99% sparsity. PMF places a Gaussian prior over two matrices, U \u2208 R1000x10 and V \u2208 R1000x10,\nj , 1). In Fig. 4, we see\nfor a total of 20,000 parameters. The model assumes that Aij \u223c N (UiV T\nthat MCMC with provenance tracking is able to \ufb01nd regions of much higher likelihood much more\nquickly than naive MCMC. We also compared to an ef\ufb01cient hand-coded MCMC sampler which\nis capable of making, scoring and accepting/rejecting about 20,000 proposals per second. Interest-\ningly, MCMC with provenance tracking is more ef\ufb01cient than the hand-coded sampler, presumably\nbecause of the economies of scale that come with making global proposals.\nInteger Probabilistic Matrix Factorization. The Integer PMF task is like ordinary PMF, except\nthat every entry in U and V is constrained to be an integer between 1 and 10. These constraints\nimply that no gradients exist. Empirically, this does not seem to matter for the ef\ufb01ciency of the\nalgorithm relative to standard MCMC: in Fig. 4 we again see dramatic performance improvements\nover the baseline Stochastic MATLAB sampler and the hand-coded sampler.\n5 Conclusions\nWe have shown how nonstandard interpretations of probabilistic programs can be used to extract\nstructural information about a distribution, and how this information can be used as part of a vari-\nety of inference algorithms. The information can take the form of gradients, Hessians, \ufb01ne-grained\ndependencies, or bounds. Empirically, we have implemented two such interpretations and demon-\nstrated how this information can be used to \ufb01nd regions of high likelihood quickly, and how it can\nbe used to generate samples with improved statistical properties versus random-walk style MCMC.\nThere are other types of interpretations which could provide additional information. For example,\ninterval arithmetic [22] could be used to provide bounds or as part of adaptive importance sampling.\n\nEach of these interpretations can be used alone or in concert with each other; one of the advantages\nof the probabilistic programming framework is the clean separation of models and inference algo-\nrithms, making it easy to explore combinations of inference algorithms for complex models. More\ngenerally, this work begins to illuminate the close connections between probabilistic inference and\nprogramming language theory. It is likely that other techniques from compiler design and program\nanalysis could be fruitfully applied to inference problems in probabilistic programs.\nAcknowledgments\nDW was supported in part by AFOSR (FA9550-07-1-0075) and by Shell Oil, Inc. NDG was sup-\nported in part by ONR (N00014-09-1-0124) and a J. S. McDonnell Foundation Scholar Award.\nJMS was supported in part by NSF (CCF-0438806), by NRL (N00173-10-1-G023), and by ARL\n(W911NF-10-2-0060). All views expressed in this paper are the sole responsibility of the authors.\n\n8\n\n\fReferences\n[1] C. Bendtsen and O. Stauning. FADBAD, a \ufb02exible C++ package for automatic differentiation. Technical\nReport IMM\u2013REP\u20131996\u201317, Department of Mathematical Modelling, Technical University of Denmark,\nLyngby, Denmark, Aug. 1996.\n\n[2] C. H. Bischof, A. Carle, G. F. Corliss, A. Griewank, and P. D. Hovland. ADIFOR: Generating derivative\n\ncodes from Fortran programs. Scienti\ufb01c Programming, 1(1):11\u201329, 1992.\n\n[3] G. Corliss, C. Faure, A. Griewank, L. Hasco\u00a8et, and U. Naumann. Automatic Differentiation: From\n\nSimulation to Optimization. Springer-Verlag, New York, NY, 2001.\n\n[4] J. Eisner, E. Goldlust, and N. A. Smith. Compiling comp ling: Weighted dynamic programming and\nIn Proceedings of Human Language Technology Conference and Conference on\nthe Dyna language.\nEmpirical Methods in Natural Language Processing (HLT-EMNLP), pages 281\u2013290, Vancouver, October\n2005.\n\n[5] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J.\n\nR. Statist. Soc. B, 73(2):123\u2013214, 2011.\n\n[6] N. Goodman, V. Mansinghka, D. Roy, K. Bonawitz, and J. Tenenbaum. Church: a language for generative\n\nmodels. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2008.\n\n[7] A. Griewank. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Num-\n\nber 19 in Frontiers in Applied Mathematics. SIAM, 2000.\n\n[8] A. Griewank, D. Juedes, and J. Utke. ADOL-C, a package for the automatic differentiation of algorithms\n\nwritten in C/C++. ACM Trans. Math. Software, 22(2):131\u2013167, 1996.\n\n[9] E. Herbst. Gradient and Hessian-based MCMC for DSGE models (job market paper), 2010.\n[10] K. Kersting and L. D. Raedt. Bayesian logic programming: Theory and tool. In L. Getoor and B. Taskar,\n\neditors, An Introduction to Statistical Relational Learning. MIT Press, 2007.\n\n[11] O. Kiselyov and C. Shan. Embedded probabilistic programming. In Domain-Speci\ufb01c Languages, pages\n\n360\u2013384, 2009.\n\n[12] Y. LeCun and L. Bottou. Lush reference manual. Technical report, 2002. URL http://lush.\n\nsourceforge.net.\n\n[13] B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. BLOG: Probabilistic models with\nunknown objects. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 1352\u20131359,\n2005.\n\n[14] M. B. Monagan and W. M. Neuenschwander. GRADIENT: Algorithmic differentiation in Maple.\nInternational Symposium on Symbolic and Algebraic Computation (ISSAC), pages 68\u201376, July 1993.\n\nIn\n\n[15] R. M. Neal. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte-Carlo (Steve\n\nBrooks, Andrew Gelman, Galin Jones and Xiao-Li Meng, Eds.), 2010.\n\n[16] S. Pandol\ufb01, F. Bartolucci, and N. Friel. A generalization of the multiple-try metropolis algorithm for\nbayesian estimation and model selection. In International Conference on Arti\ufb01cial Intelligence and Statis-\ntics (AISTATS), 2010.\n\n[17] B. A. Pearlmutter and J. M. Siskind. Lazy multivariate higher-order forward-mode AD. In Symposium on\nPrinciples of Programming Languages (POPL), pages 155\u2013160, 2007. doi: 10.1145/1190215.1190242.\n[18] A. Pfeffer. IBAL: A probabilistic rational programming language. In International Joint Conference on\n\nArti\ufb01cial Intelligence (IJCAI), pages 733\u2013740. Morgan Kaufmann Publ., 2001.\n\n[19] Y. Qi and T. P. Minka. Hessian-based Markov chain Monte-Carlo algorithms (unpublished manuscript),\n\n2002.\n\n[20] P. J. Rossky, J. D. Doll, and H. L. Friedman. Brownian dynamics as smart monte carlo simulation. Journal\n\nof Chemical Physics, 69:4628\u20134633, 1978.\n\n[21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors.\n\n323:533\u2013536, 1986.\n\n[22] S. Rump.\n\nKluwer Academic Publishers, Dordrecht, 1999.\n\nINTLAB - INTerval LABoratory.\n\nIn Developments in Reliable Computing, pages 77\u2013104.\n\n[23] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization.\n\nSystems (NIPS), 2008.\n\nIn Neural Information Processing\n\n[24] J. M. Siskind and B. A. Pearlmutter. First-class nonstandard interpretations by opening closures. In Sym-\nposium on Principles of Programming Languages (POPL), pages 71\u201376, 2007. doi: 10.1145/1190216.\n1190230.\n\n[25] B. Speelpenning. Compiling Fast Partial Derivatives of Functions Given by Algorithms. PhD thesis,\n\nDepartment of Computer Science, University of Illinois at Urbana-Champaign, Jan. 1980.\n\n[26] B. Taylor. Methodus Incrementorum Directa et Inversa. London, 1715.\n[27] R. E. Wengert. A simple automatic derivative evaluation program. Commun. ACM, 7(8):463\u2013464, 1964.\n[28] D. Wingate, A. Stuhlmueller, and N. D. Goodman. Lightweight implementations of probabilistic pro-\ngramming languages via transformational compilation. In International Conference on Arti\ufb01cial Intelli-\ngence and Statistics (AISTATS), 2011.\n\n9\n\n\f", "award": [], "sourceid": 674, "authors": [{"given_name": "David", "family_name": "Wingate", "institution": null}, {"given_name": "Noah", "family_name": "Goodman", "institution": null}, {"given_name": "Andreas", "family_name": "Stuhlmueller", "institution": null}, {"given_name": "Jeffrey", "family_name": "Siskind", "institution": null}]}