{"title": "A Probabilistic Programming Approach To Probabilistic Data Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 2011, "page_last": 2019, "abstract": "Probabilistic techniques are central to data analysis, but different approaches can be challenging to apply, combine, and compare. This paper introduces composable generative population models (CGPMs), a computational abstraction that extends directed graphical models and can be used to describe and compose a broad class of probabilistic data analysis techniques. Examples include discriminative machine learning, hierarchical Bayesian models, multivariate kernel methods, clustering algorithms, and arbitrary probabilistic programs. We demonstrate the integration of CGPMs into BayesDB, a probabilistic programming platform that can express data analysis tasks using a modeling definition language and structured query language. The practical value is illustrated in two ways. First, the paper describes an analysis on a database of Earth satellites, which identifies records that probably violate Kepler\u2019s Third Law by composing causal probabilistic programs with non-parametric Bayes in 50 lines of probabilistic code. Second, it reports the lines of code and accuracy of CGPMs compared with baseline solutions from standard machine learning libraries.", "full_text": "A Probabilistic Programming Approach To\n\nProbabilistic Data Analysis\n\nMIT Probabilistic Computing Project\n\nMIT Probabilistic Computing Project\n\nFeras Saad\n\nfsaad@mit.edu\n\nVikash Mansinghka\n\nvkm@mit.edu\n\nAbstract\n\nProbabilistic techniques are central to data analysis, but different approaches can\nbe challenging to apply, combine, and compare. This paper introduces composable\ngenerative population models (CGPMs), a computational abstraction that extends\ndirected graphical models and can be used to describe and compose a broad class\nof probabilistic data analysis techniques. Examples include discriminative machine\nlearning, hierarchical Bayesian models, multivariate kernel methods, clustering\nalgorithms, and arbitrary probabilistic programs. We demonstrate the integration\nof CGPMs into BayesDB, a probabilistic programming platform that can express\ndata analysis tasks using a modeling de\ufb01nition language and structured query\nlanguage. The practical value is illustrated in two ways. First, the paper describes\nan analysis on a database of Earth satellites, which identi\ufb01es records that probably\nviolate Kepler\u2019s Third Law by composing causal probabilistic programs with non-\nparametric Bayes in 50 lines of probabilistic code. Second, it reports the lines of\ncode and accuracy of CGPMs compared with baseline solutions from standard\nmachine learning libraries.\n\n1\n\nIntroduction\n\nProbabilistic techniques are central to data analysis, but can be dif\ufb01cult to apply, combine, and\ncompare. Such dif\ufb01culties arise because families of approaches such as parametric statistical modeling,\nmachine learning and probabilistic programming are each associated with different formalisms and\nassumptions. The contributions of this paper are (i) a way to address these challenges by de\ufb01ning\nCGPMs, a new family of composable probabilistic models; (ii) an integration of this family into\nBayesDB [10], a probabilistic programming platform for data analysis; and (iii) empirical illustrations\nof the ef\ufb01cacy of the framework for analyzing a real-world database of Earth satellites.\nWe introduce composable generative population models (CGPMs), a computational formalism that\ngeneralizes directed graphical models. CGPMs specify a table of observable random variables with\na \ufb01nite number of columns and countably in\ufb01nitely many rows. They support complex intra-row\ndependencies among the observables, as well as inter-row dependencies among a \ufb01eld of latent random\nvariables. CGPMs are described by a computational interface for generating samples and evaluating\ndensities for random variables derived from the base table by conditioning and marginalization. This\npaper shows how to package discriminative statistical learning techniques, dimensionality reduction\nmethods, arbitrary probabilistic programs, and their combinations, as CGPMs. We also describe\nalgorithms and illustrate new syntaxes in the probabilistic Metamodeling Language for building\ncomposite CGPMs that can interoperate with BayesDB.\nThe practical value is illustrated in two ways. First, we describe a 50-line analysis that identi\ufb01es\nsatellite data records that probably violate their theoretical orbital characteristics. The BayesDB script\nbuilds models that combine non-parametric Bayesian structure learning with a causal probabilistic\nprogram that implements a stochastic variant of Kepler\u2019s Third Law. Second, we illustrate coverage\nand conciseness of the CGPM abstraction by quantifying the improvement in accuracy and reduction\nin lines of code achieved on a representative data analysis task.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f2 Composable Generative Population Models\n\nA composable generative population model represents a data generating process for an exchangeable\nsequence of random vectors (x1, x2, . . . ), called a population. Each member xr is T -dimensional,\nand element x[r,t] takes values in an observation space Xt, for t \u2208 [T ] and r \u2208 N. A CGPM G is\nformally represented by a collection of variables that characterize the data generating process:\n\nG = (\u03b1, \u03b8, Z = {zr : r \u2208 N}, X = {xr : r \u2208 N}, Y = {yr : r \u2208 N}).\n\n\u2022 \u03b1: Known, \ufb01xed quantities about the population, such as metadata and hyperparameters.\n\u2022 \u03b8: Population-level latent variables relevant to all members of the population.\n\u2022 zr = (z[r,1], . . . z[r,L]): Member-speci\ufb01c latent variables that govern only member r directly.\n\u2022 xr = (x[r,1], . . . x[r,T ]): Observable output variables for member r. A subset of these variables\n\nmay be observed and recorded in a dataset D.\n\n\u2022 yr = (y[r,1], . . . y[r,I]): Input variables, such as \u201cfeature vectors\u201d in a purely discriminative model.\nA CGPM is required to satisfy the following conditional independence constraint:\n\u2200r (cid:54)= r(cid:48) \u2208 N,\u2200t, t(cid:48) \u2208 [T ] : x[r,t] \u22a5\u22a5 x[r(cid:48),t(cid:48)] | {\u03b1, \u03b8, zr, zr(cid:48)}.\n\n(1)\nEq (1) formalizes the notion that all dependencies across members r \u2208 N are completely mediated\nby the population parameters \u03b8 and member-speci\ufb01c variables zr. However, elements x[r,i] and x[r,j]\nwithin a member are generally free to assume any dependence structure. Similarly, the member-\nspeci\ufb01c latents in Z may be either uncoupled or highly-coupled given population parameters \u03b8.\nCGPMs differ from the standard mathematical de\ufb01nition of a joint density in that they are de\ufb01ned in\nterms of a computational interface (Listing 1). As computational objects, they explicitly distinguish\nbetween the sampler for the random variables from their joint distribution, and the assessor of their\njoint density. In particular, a CGPM is required to sample/assess the joint distribution of a subset of\noutput variables x[r,Q] conditioned on another subset x[r,E], and marginalizing over x[r,[T ]\\(Q\u222aE)].\n\nListing 1 Computational interface for composable generative population models.\n\u2022 s \u2190 simulate (G, member: r, query: Q = {qk}, evidence : x[r,E], input : yr)\nGenerate a sample from the distribution\n\u2022 c \u2190 logpdf (G, member: r, query : x[r,Q], evidence : x[r,E], input : yr)\nEvaluate the log density\n\u2022 G(cid:48) \u2190 incorporate (G, measurement : x[r,t] or yr)\nRecord a measurement x[r,t] \u2208 Xt (or yr) into the dataset D.\n\u2022 G(cid:48) \u2190 unincorporate (G, member : r)\nEliminate all measurements of input and output variables for member r.\n\u2022 G(cid:48) \u2190 infer (G, program : T )\nAdjust internal latent state in accordance with the learning procedure speci\ufb01ed by program T .\n\ns \u223cG x[r,Q]|{x[r,E], yr, D}.\nlog pG(x[r,Q]|{x[r,E], yr, D}).\n\n2.1 Primitive univariate CGPMs and their statistical data types\n\nThe statistical data type (Figure 1) of a population variable xt generated by a CGPM provides a\nmore re\ufb01ned taxonomy than its \u201cobservation space\u201d Xt. The (parameterized) support of a statistical\ntype is the set in which samples from simulate take values. Each statistical type is also associated\nwith a base measure which ensures logpdf is well-de\ufb01ned. In high-dimensional populations with\nheterogeneous types, logpdf is taken against the product measure of these base measures. The\nstatistical type also identi\ufb01es invariants that the variable maintains. For instance, the values of a\nNOMINAL variable are permutation-invariant. Figure 1 shows statistical data types provided by the\nMetamodeling Language from BayesDB. The \ufb01nal column shows some examples of primitive CGPMs\nthat are compatible with each statistical type; they implement logpdf directly using univariate\nprobability density functions, and algorithms for simulate are well known [4]. For infer their\nparameters may be \ufb01xed, or learned from data using, e.g., maximum likelihood [2, Chapter 7] or\nBayesian priors [5]. We refer to an extended version of this paper [14, Section 3] for using these\nprimitives to implement CGPMs for a broad collection of model classes, including non-parametric\nBayes, nearest neighbors, PCA, discriminative machine learning, and multivariate kernel methods.\n\n2\n\n\fStatistical Data Type\nBINARY\nNOMINAL\nCOUNT/RATE\nCYCLIC\nMAGNITUDE\nNUMERICAL\nNUMERICAL-RANGED\n\nParameters\n-\nsymbols: S\nbase: b\nperiod: p\n\u2013\n\u2013\nlow: l, high:h\n\nSupport\n{0, 1}\n{0 . . . S\u22121}\nb , . . .}\n{0, 1\nb , 2\n(0, p)\n(0,\u221e)\n(\u2212\u221e,\u221e)\n(l, h) \u2282 R\n\nMeasure/\u03c3-Algebra\n(#, 2{0,1})\n(#, 2[S])\nN\n)\n(#, 2\n(\u03bb,B(R))\n(\u03bb,B(R))\n(\u03bb,B(R))\n(\u03bb,B(R))\n\nPrimitive CGPM\nBERNOULLI\nCATEGORICAL\nPOISSON, GEOMETRIC\nVON-MISES\nLOGNORMAL, EXPON\nNORMAL\nBETA, NORMAL-TRUNC\n\nFigure 1: Statistical data types for population variables generated by CGPMs available in the\nBayesDB Metamodeling Language, and samples from their marginal distributions.\n\n2.2\n\nImplementing general CGPMs as probabilistic programs in VentureScript\n\nIn this section, we show how to implement simulate and logpdf (Listing 1) for composable gener-\native models written in VentureScript [8], a probabilistic programming language with programmable\ninference. For simplicity, this section assumes a stronger conditional independence constraint,\n\u2203l, l(cid:48) \u2208 [L] such that (r, t) (cid:54)= (r(cid:48), t(cid:48)) =\u21d2 x[r,t] \u22a5\u22a5 x[r(cid:48),t(cid:48)] | {\u03b1, \u03b8, z[r,l], z[r(cid:48),l(cid:48)], yr, y(cid:48)\nr}.\n\n(2)\nIn words, for every observable element x[r,t], there exists a latent variable z[r,l] which (in addition\nto \u03b8) mediates all coupling with other variables in the population. The member latents Z may still\nexhibit arbitrary dependencies. The approach for simulate and logpdf described below is based\non approximate inference in tagged subparts of the Venture trace, which carries a full realization\nof all random choices (population and member-speci\ufb01c latent variables) made by the program. The\nruntime system carries a set of K traces {(\u03b8k, Zk)}K\nk=1 sampled from an approximate posterior\npG(\u03b8, Z|D). These traces are assigned weights depending on the user-speci\ufb01ed evidence x[r,E] in\nthe simulate/logpdf function call. G represents the CGPM as a probabilistic program, and the\ninput yr and latent variables Zk are treated as ambient quantities in \u03b8k. The distribution of interest is\npG(x[r,Q]|x[r,E], D) =\n\npG(x[r,Q]|x[r,E], \u03b8, D)pG(\u03b8|x[r,E], D)d\u03b8\n\n(cid:90)\n\n\u03b8\n\npG(x[r,Q]|x[r,E], \u03b8, D)\n\n(cid:90)\n1(cid:80)K\n\n\u03b8\n\n=\n\n\u2248\n\nK(cid:88)\n\nk=1 wk\n\nk=1\n\n(cid:18) pG(x[r,E]|\u03b8, D)pG(\u03b8|D)\n\n(cid:19)\n\npG(x[r,E]|D)\n\nd\u03b8\n\n(3)\n\npG(x[r,Q]|x[r,E], \u03b8k, D)wk\n\nwhere \u03b8k \u223cG |D. (4)\n\nzk\nr\n\n(cid:90)\n\npG(x[r,Q], zk\n\n(cid:0)pG(x[r,q]|zk\n\n(cid:90)\nr , \u03b8k)(cid:1) pG(zk\n\nThe weight wk = pG(x[r,E]|\u03b8k, D) of trace \u03b8k is the likelihood of the evidence. The weighting\nscheme (4) is a computational trade-off avoiding the requirement to run posterior inference on\npopulation parameters \u03b8 for a query about member r. It suf\ufb01ces to derive the distribution for only \u03b8k,\npG(x[r,Q]|x[r,E], \u03b8k, D) =\n\n(cid:89)\nq\u2208Q\nr \u223cG |{x[r,E], \u03b8k, D}. Eq (5) suggests that simulate can be implemented by sampling\nwhere zk,j\nr ) \u223cG |{x[r,E], \u03b8k, D} from the joint local posterior, then returning elements x[r,Q]. Eq (6)\n(x[r,Q], zk\nr \u223cG |{x[r,E], \u03b8k, D}\nshows that logpdf can be implemented by \ufb01rst sampling the member latents zk\nfrom the local posterior; using the conditional independence constraint (2), the query x[r,Q] then\nfactors into a product of density terms for each element x[r,q].\n\nr|x[r,E], \u03b8k, D)dzk\nJ(cid:88)\n\nr|x[r,E], \u03b8k, D)dzk\n\npG(x[r,q]|zk,j\n\nr\n\nr \u2248 1\nJ\n\n(cid:89)\n\n(5)\n\n(6)\n\n=\n\nzk\nr\n\nj=1\n\nq\u2208Q\n\n, \u03b8k),\n\nr\n\n3\n\nFrequencyNominalCategoricalCountPoissonGeometricMagnitudeLognormalExponentialCyclicVon-MisesNumericalNormalNumerical-RangedNormalTruncBeta\fwk \u2190(cid:81)\n\nfor k = 1, . . . , K do\nr (cid:54)\u2208 Z k then\nif zk\nr \u223cG |{\u03b8k, Z k, D}\nzk\n\ne\u2208E pG(x[r,e]|\u03b8k, zk\nr )\nk \u223c CATEGORICAL ({w1, . . . , wk})\n{x[r,Q], zk\nreturn x[r,Q]\n\nTo aggregate over {\u03b8k}K\nk=1, for simulate the runtime obtains the queried sample by \ufb01rst drawing\nk \u223c CATEGORICAL({w1, . . . , wK}), then returns the sample x[r,Q] drawn from trace \u03b8k. Similarly,\nlogpdf is computed using the weighted Monte Carlo estimator (6). Algorithms 2a and 2b summarize\nimplementations of simulate and logpdf in a general probabilistic programming environment.\nAlgorithm 2a simulate for CGPMs in a probabilistic programming environment.\n1: function SIMULATE(G, r, Q, x[r,E], yr)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\nAlgorithm 2b logpdf for CGPMs in a probabilistic programming environment.\n1: function LOGPDF(G, r, x[r,Q], x[r,E], yr)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\n(cid:46) for each trace k\n(cid:46) retrieve the trace weight\n(cid:46) obtain J samples of latents in scope of member r\n(cid:46) run a transition operator leaving target invariant\n(cid:46) compute the density estimate\n(cid:46) aggregate density estimates by simple Monte Carlo\n(cid:46) importance weight the estimate\n(cid:46) weighted importance sampling over all traces\n\n(cid:46) for each trace k\n(cid:46) if member r has unknown local latents\n(cid:46) sample them from the prior\n(cid:46) weight the trace by likelihood of evidence\n(cid:46) importance resample the traces\n(cid:46) run a transition operator leaving target invariant\n(cid:46) select query variables from the resampled trace\n\nhk,j \u2190(cid:81)\n(cid:80)J\n(cid:16)(cid:80)K\nk=1 qk(cid:17) \u2212 log\n\nRun steps 2 through 5 from Algorithm 2a\nfor j = 1, . . . , J do\nr \u223cG |{\u03b8k, Z k, D \u222a {yr, x[r,E]}}\nzk,j\n\nr} \u223cG |{\u03b8k, Z k, D \u222a {yr, x[r,E]}}\n\n(cid:16)(cid:80)K\nk=1 wk(cid:17)\n\nq\u2208Q pG(x[r,q]|\u03b8k, zk,j\nr )\n\nfor k = 1, . . . , K do\n\nrk \u2190 1\nqk \u2190 rkwk\n\nJ\n\nj=1 hk,j\n\nreturn log\n\n2.3\n\nInference in a composite network of CGPMs\n\nThis section shows how CGPMs are composed by applying the output of one to the input of another.\nThis allows us to build complex probabilistic models out of simpler primitives directly as software.\nSection 3 demonstrates surface-level syntaxes in the Metamodeling Language for constructing these\ncomposite structures. We report experiments including up to three layers of composed CGPMs.\nLet Ga be a CGPM with output xa\u2217 and input ya\u2217, and Gb have output xb\u2217 and input yb\u2217 (the symbol \u2217\nindexes all members r \u2208 N). The composition GbB \u25e6 GaA applies the subset of outputs xa\n[\u2217,A] of Ga to\n[\u2217,B] of Gb, where |A| = |B| and the variables are type-matched (Figure 1). This operation\nthe inputs yb\nresults in a new CGPM Gc with output xa\u2217 \u222a xb\u2217 and input ya\u2217 \u222a yb\n[\u2217,\\B]. In general, a collection\n{Gk : k \u2208 [K]} of CGPMs can be organized into a generalized directed graph G[K], which itself is a\nCGPM. Node k is an \u201cinternal\u201d CGPM Gk, and the labeled edge aA \u2192 bB denotes the composition\nGaA \u25e6 GbB. The directed acyclic edge structure applies only to edges between elements of different\n[\u2217,j] within Gk may satisfy the more general constraint (1).\nCGPMs in the network; elements xk\nAlgorithms 3a and 3b show sampling-importance-resampling and ratio-likelihood weighting algo-\nrithms that combine simulate and logpdf from each individual Gk to compute queries against\nnetwork G[K]. The symbol \u03c0k = {(p, t) : xp\n[\u2217,t] \u2208 yk\u2217} refers to the set of all output elements from\nupstream CGPMs connected to the inputs of Gk, so that {\u03c0k : k \u2208 [K]} encodes the graph adjacency\nmatrix. Subroutine 3c generates a full realization of all unconstrained variables, and weights forward\nsamples from the network by the likelihood of constraints. Algorithm 3b is based on ratio-likelihood\nweighting (both terms in line 6 are computed by unnormalized importance sampling) and admits an\nanalysis with known error bounds when logpdf and simulate of each Gk are exact [7].\nAlgorithm 3a simulate in a directed acyclic network of CGPMs.\n1: function SIMULATE(Gk, r, Qk, xk\n2:\n3:\n4:\n5:\n\n(cid:46) generate J importance samples\n(cid:46) retrieve jth weighted sample\n(cid:46) resample by importance weights\n(cid:46) return query variables from the selected sample\n\nm \u2190 CATEGORICAL ({w1, . . . , wJ})\nreturn {xk\n\n(sj, wj) \u2190 WEIGHTED-SAMPLE ({xk\n\n[r,Ek] : k \u2208 [K]})\n\nr , for k \u2208 [K])\n\nfor j = 1, . . . , J do\n\n[r,Qk] \u2208 sm : k \u2208 [K]}\n\n[r,Ek], yk\n\n[\u2217,i], xk\n\n4\n\n\ffor j = 1, . . . , J do\n\n(sj, wj) \u2190 WEIGHTED-SAMPLE ({xk\n\nfor j = 1, . . . , J(cid:48) do\n\nreturn log\n\n[r,Ck], for k \u2208 [K])\n\nQ, xk\n\n[r,Ek], yk\n\nr , for k \u2208 [K])\n\n(cid:16)(cid:80)\n\n[J] wj/(cid:80)\n\n(s(cid:48)j, w(cid:48)j) \u2190 WEIGHTED-SAMPLE ({xk\n\n[r,Qk\u222aEk] : k \u2208 [K]})\n[r,Ek] : k \u2208 [K]})\n\n(cid:46) generate J importance samples\n(cid:46) joint density of query/evidence\n(cid:46) generate J(cid:48) importance samples\n(cid:46) marginal density of evidence\n(cid:46) return likelihood ratio importance estimate\n\n[J(cid:48)] w(cid:48)j(cid:17) \u2212 log(J/J(cid:48))\n\nAlgorithm 3b logpdf in a directed acyclic network of CGPMs.\n1: function SIMULATE(Gk, r, xk\n2:\n3:\n4:\n5:\n6:\nAlgorithm 3c Weighted forward sampling in a directed acyclic network of CGPMs.\n1: function WEIGHTED-SAMPLE (constraints: xk\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n\n(s, log w) \u2190 (\u2205, 0)\nfor k \u2208 TOPOSORT ({\u03c01, . . . , \u03c0K}) do\n[r,t] \u2208 s : (p, t) \u2208 \u03c0k}\nr \u222a {xp\nr \u2190 yk\n\u02dcyk\nlog w \u2190 log w + logpdf (Gk, r, xk\n[r,\\Ck] \u2190 simulate (Gk, r,\\C k, xk\nxk\ns \u2190 s \u222a xk\nreturn (s, w)\n\n(cid:46) initialize empty sample with zero weight\n(cid:46) topologically sort CGPMs using adjacency matrix\n(cid:46) retrieve required inputs at node k\n(cid:46) update weight by likelihood of constraint\n(cid:46) simulate unconstrained nodes\n(cid:46) append all node values to sample\n(cid:46) return the overall sample and its weight\n\n[r,Ck], \u2205, \u02dcyk\nr )\nr )\n[r,Ck], \u02dcyk\n\n[r,Ck\u222a\\Ck]\n\n3 Analyzing satellites using CGPMs built from causal probabilistic\n\nprograms, discriminative machine learning, and Bayesian\nnon-parametrics\n\nThis section outlines a case study applying CGPMs to a database of 1163 satellites maintained by\nthe Union of Concerned Scientists [12]. The dataset contains 23 numerical and categorical features\nof each satellite such as its material, functional, physical, orbital and economic characteristics. The\nlist of variables and examples of three representative satellites are shown in Table 1. A detailed\nstudy of this database using BayesDB provided in [10]. Here, we compose the baseline CGPM\nin BayesDB, CrossCat [9], a non-parametric Bayesian structure learner for high dimensional data\ntables, with several CGPMs: a classical physics model written in VentureScript, a random forest\nclassi\ufb01er, factor analysis, and an ordinary least squares regressor. These composite models allow us\nto identify satellites that probably violate their orbital mechanics (Figure 2), as well as accurately\ninfer the anticipated lifetimes of new satellites (Figure 3). We refer to [14, Section 6] for several\nmore experiments on a broader set of data analysis tasks, as well as comparisons to baseline machine\nlearning solutions.\n\nName\nCountry of Operator\nOperator Owner\nUsers\nPurpose\nClass of Orbit\nType of Orbit\nPerigee km\nApogee km\nEccentricity\nPeriod minutes\nLaunch Mass kg\nDry Mass kg\nPower watts\nDate of Launch\nAnticipated Lifetime\nContractor\nCountry of Contractor\nLaunch Site\nLaunch Vehicle\nSource Used for Orbital Data\nlongitude radians of geo\nInclination radians\n\nInternational Space Station\nMultinational\nNASA/Multinational\nGovernment\nScienti\ufb01c Research\nLEO\nIntermediate\n401\n422\n0.00155\n92.8\nNaN\nNaN\nNaN\n36119\n30\nBoeing Satellite Systems/Multinational\nMultinational\nBaikonur Cosmodrome\nProton\nwww.satellitedebris.net 12/12\nNaN\n0.9005899\n\nAAUSat-3\nDenmark\nAalborg University\nCivil\nTechnology Development\nLEO\nNaN\n770\n787\n0.00119\n100.42\n0.8\nNaN\nNaN\n41330\n1\nAalborg University\nDenmark\nSatish Dhawan Space Center\nPSLV\nSC - ASCR\nNaN\n1.721418241\n\nAdvanced Orion 5 (NRO L-32, USA 223)\nUSA\nNational Reconnaissance Of\ufb01ce (NRO)\nMilitary\nElectronic Surveillance\nGEO\nNaN\n35500\n35500\n0\nNaN\n5000\nNaN\nNaN\n40503\nNaN\nNational Reconnaissance Laboratory\nUSA\nCape Canaveral\nDelta 4 Heavy\nSC - ASCR\n1.761037215\n0\n\nTable 1: Variables in the satellite population, and three representative satellites. The records are\nmultivariate, heterogeneously typed, and contain arbitrary patterns of missing data.\n\n5\n\n\f1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n32\n33\n34\n35\n36\n37\n38\n39\n40\n41\n42\n43\n44\n45\n46\n47\n48\n49\n50\n\nCREATE TABLE satellites_ucs FROM 'satellites.csv';\nCREATE POPULATION satellites FOR satellites_ucs WITH SCHEMA ( GUESS STATTYPES FOR (*) );\n\nCREATE METAMODEL satellites_hybrid FOR satellites WITH BASELINE CROSSCAT (\n\nOVERRIDE GENERATIVE MODEL FOR type_of_orbit\nGIVEN apogee_km, perigee_km, period_minutes, users, class_of_orbit\nUSING RANDOM_FOREST (num_categories = 7);\n\nOVERRIDE GENERATIVE MODEL FOR launch_mass_kg, dry_mass_kg, power_watts, perigee_km, apogee_km\nUSING FACTOR_ANALYSIS (dimensionality = 2);\n\nOVERRIDE GENERATIVE MODEL FOR period_minutes\nAND EXPOSE kepler_cluster_id CATEGORICAL, kepler_noise NUMERICAL\nGIVEN apogee_km, perigee_km USING VENTURESCRIPT (program = '\n\ndefine dpmm_kepler = () -> {\n\n// Definition of DPMM Kepler model program.\n\nassume keplers_law = (apogee, perigee) -> {\n\n(GM, earth_radius) = (398600, 6378);\na = .5*(abs(apogee) + abs(perigee)) + earth_radius;\n2 * pi * sqrt(a**3 / GM) / 60 };\n\n// Latent variable priors.\nassume crp_alpha = gamma(1,1);\nassume cluster_id_sampler = make_crp(crp_alpha);\nassume noise_sampler = mem((cluster) -> make_nig_normal(1, 1, 1, 1));\n// Simulator for latent variables (kepler_cluster_id and kepler_noise).\nassume sim_cluster_id = mem((rowid, apogee, perigee) -> {\n\ncluster_id_sampler() #rowid:1 });\n\nassume sim_noise = mem((rowid, apogee, perigee) -> {\n\ncluster_id = sim_cluster_id(rowid, apogee, perigee);\nnoise_sampler(cluster_id)() #rowid:2 });\n\n// Simulator for observable variable (period_minutes).\nassume sim_period = mem((rowid, apogee, perigee) -> {\n\nkeplers_law(apogee, perigee) + sim_noise(rowid, apogee, perigee) });\n\nassume outputs = [sim_period, sim_cluster_id, sim_noise];\n\n// List of output variables.\n\n};\n// Procedures for observing the output variables.\ndefine obs_cluster_id = (rowid, apogee, perigee, value, label) -> {\n\n$label: observe sim_cluster_id( $rowid, $apogee, $perigee) = atom(value); };\n\ndefine obs_noise = (rowid, apogee, perigee, value, label) -> {\n\n$label: observe sim_noise( $rowid, $apogee, $perigee) = value; };\n\ndefine obs_period = (rowid, apogee, perigee, value, label) -> {\n\ntheoretical_period = run(sample keplers_law($apogee, $perigee));\nobs_noise( rowid, apogee, perigee, value - theoretical_period, label); };\n\ndefine observers = [obs_period, obs_cluster_id, obs_noise];\ndefine inputs = [\"apogee\", \"perigee\"];\ndefine transition = (N) -> { default_markov_chain(N) };\n\n// List of observer procedures.\n// List of input variables.\n// Transition operator.\n\n'));\nINITIALIZE 10 MODELS FOR satellites_hybrid;\nANALYZE satellites_hybrid FOR 100 ITERATIONS;\nINFER name, apogee_km, perigee_km, period_minutes, kepler_cluster_id, kepler_noise FROM satellites;\n\nFigure 2: A session in BayesDB to detect satellites whose orbits are likely violations of\nKepler\u2019s Third Law using a causal composable generative population model written in\nVentureScript. The dpmm_kepler CGPM (line 17) learns a DPMM on the residuals of each\nsatellite\u2019s deviation from its theoretical orbit. Both the cluster identity and inferred noise are\nexposed latent variables (line 14). Each dot in the scatter plot (left) is a satellite in the population,\nand its color represents the latent cluster assignment learned by dpmm_kepler. The histogram\n(right) shows that each of the four detected clusters roughly translates to a qualitative description\nof the deviation: yellow (negligible), magenta (noticeable), green (large), and blue (extreme).\n\n6\n\n010000200003000040000Perigee[km]010002000300040005000Period[mins]Orion6GeotailMeridian4Amos5NavStarClustersIdenti\ufb01edbyKeplerCGPMCluster1Cluster2Cluster3Cluster4TheoreticallyFeasibleOrbits1e-101e-51e01e51e10MagntiudeofDeviationfromKepler\u00b4sLaw[mins2]202122232425262728NumberofSatellitesOrion6GeotailMeridian4Amos5EmpiricalDistributionofOrbitalDeviationsNegligibleNoticeableLargeExtreme\fCREATE TABLE data_train FROM 'sat_train.csv';\n.nullify data_train 'NaN';\n\nCREATE POPULATION satellites FOR data_train\n\nWITH SCHEMA(\n\nGUESS STATTYPES FOR (*)\n\nCREATE METAMODEL crosscat_ols FOR satellites\n\nWITH BASELINE CROSSCAT(\n\nOVERRIDE GENERATIVE MODEL FOR\n\nanticipated_lifetime\n\nGIVEN\n\ntype_of_orbit, perigee_km, apogee_km,\nperiod_minutes, date_of_launch,\nlaunch_mass_kg\n\nUSING LINEAR_REGRESSION\n\n);\n\n);\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n32\n\nINITIALIZE 4 MODELS FOR crosscat_ols;\nANALYZE crosscat_ols FOR 100 ITERATION WAIT;\n\nCREATE TABLE data_test FROM 'sat_test.csv';\n.nullify data_test 'NaN';\n.sql INSERT INTO data_train\nSELECT * FROM data_test;\n\nCREATE TABLE predicted_lifetime AS\n\nINFER EXPLICIT\n\nPREDICT anticipated_lifetime\nCONFIDENCE prediction_confidence\nFROM satellites WHERE _rowid_ > 1000;\n\n(a) Full session in BayesDB which loads the\ntraining and test sets, creates a hybrid CGPM,\nand runs the regression using CrossCat+OLS.\n\ndef dummy_code_categoricals(frame, maximum=10):\n\ndef dummy_code_categoricals(series):\n\ncategories = pd.get_dummies(\n\nseries, dummy_na=1)\n\nif len(categories.columns) > maximum-1:\n\nreturn None\n\nif sum(categories[np.nan]) == 0:\n\ndel categories[np.nan]\n\ncategories.drop(\n\ncategories.columns[-1], axis=1,\ninplace=1)\n\nreturn categories\n\ndef append_frames(base, right):\n\nfor col in right.columns:\n\nbase[col] = pd.DataFrame(right[col])\n\nnumerical = frame.select_dtypes([float])\ncategorical = frame.select_dtypes([object])\n\ncategorical_coded = filter(\nlambda s: s is not None,\n[dummy_code_categoricals(categorical[c])\n\nfor c in categorical.columns])\n\njoined = numerical\n\nfor sub_frame in categorical_coded:\nappend_frames(joined, sub_frame)\n\nreturn joined\n(b) Ad-hoc Python routine (used by baselines)\nfor coding nominal predictors in a dataframe\nwith missing values and mixed data types.\n\nFigure 3: In a high-dimensional regression problem with mixed data types and missing data,\nthe composite CGPM improves prediction accuracy over purely generative and purely discrim-\ninative baselines. The task is to infer the anticipated lifetime of a held-out satellite given categorical\nand numerical features such as type of orbit, launch mass, and orbital period. As feature vectors in\nthe test set have missing entries, purely discriminative models (ridge, lasso, OLS) either heuristically\nimpute missing features, or ignore the features and predict the anticipated lifetime using the mean\nin the training set. The purely generative model (CrossCat) can impute missing features from their\njoint distribution, but only indirectly mediates dependencies between the predictors and response\nthrough latent variables. The composite CGPM (CrossCat+OLS) in panel (a) combines advantages\nof both approaches; statistical imputation followed by regression on the features leads to improved\npredictive accuracy. The reduced code size is a result of using SQL, BQL, & MML, for preprocessing,\nmodel-building and predictive querying, as opposed to collections of ad-hoc scripts such as panel (b).\n\nFigure 2 shows the MML program for constructing the hybrid CGPM on the satellites population. In\nterms of the compositional formalism from Section 2.3, the CrossCat CGPM (speci\ufb01ed by the MML\nBASELINE keyword) learns the joint distribution of variables at the \u201croot\u201d of the network (i.e., all\nvariables from Table 1 which do not appear as arguments to an MML OVERRIDE command). The\ndpmm_kepler CGPM in line 16 of the top panel in Figure 2 accepts apogee_km and perigee_km\nas input variables y = (A, P ), and produces as output the period_minutes x = (T ). These\nvariables characterize the elliptical orbit of a satellite and are constrained by the relationships\n\ne = (A \u2212 P )/(A + P ) and T = 2\u03c0(cid:112)((A + P )/2))3/GM where e is the eccentricity and GM\n\n7\n\n101102LinesofCode100101102MeanSquaredErrorridgeolslassokernelforestbayesdb(crosscat+ols)bayesdb(crosscat)\fis a physical constant. The program speci\ufb01es a stochastic version of Kepler\u2019s Law using a Dirichlet\nprocess mixture model for the distribution over errors (between the theoretical and observed period),\n\nP \u223c DP(\u03b1, NORMAL-INVERSE-GAMMA(m, V, a, b)),\n\u0001r|{\u00b5r, \u03c32\n\nr )|P \u223c P\nr ), where \u0001r := Tr \u2212 KEPLER(Ar, Pr).\n\nr , yr} \u223c NORMAL(\u00b7|\u00b5r, \u03c32\n\n(\u00b5r, \u03c32\n\nThe lower panels of Figure 2 illustrate how the dpmm_kepler CGPM clusters satellites based on the\nmagnitude of the deviation from their theoretical orbits; the variables (deviation, cluster identity, etc)\nin these \ufb01gures are obtained from the BQL query on line 50. For instance, the satellite Orion6 shown\nin the right panel of Figure 2, belongs to a component with \u201cextreme\u201d deviation. Further investigation\nreveals that Orion6 has a recorded period 23.94 minutes, most likely a data entry error for the true\nperiod of 24 hours (1440 minutes); we have reported such errors to the maintainers of the database.\n\nThe data analysis task in Figure 3 is to infer the anticipated_lifetime xr of a new satellite, given\na set of features yr such as its type_of_orbit and perigee_km. A simple OLS regressor with\nnormal errors is used for the response pGols(xr|yr). The CrossCat baseline learns a joint generative\nmodel for the covariates pGcrosscat(yr). The composite CGPM crosscat_ols built Figure 3 (left\npanel) thus carries the full joint distribution over the predictors and response pG(xr, yr), leading to\nmore accurate predictions. Advantages of this hybrid approach are further discussed in the \ufb01gure.\n\n4 Related Work and Discussion\n\nThis paper has shown that it is possible to use a computational formalism in probabilistic programming\nto uniformly apply, combine, and compare a broad class of probabilistic data analysis techniques.\nBy integrating CGPMs into BayesDB [10] and expressing their compositions in the Metamodeling\nLanguage, we have shown it is possible to combine CGPMs synthesized by automatic model discovery\n[9] with custom probabilistic programs, which accept and produce multivariate inputs and outputs,\ninto coherent joint probabilistic models. Advantages of this hybrid approach to modeling and inference\ninclude combining the strengths of both generative and discriminative techniques, as well as savings\nin code complexity from the uniformity of the CGPM interface.\n\nWhile our experiments have constructed CGPMs using VentureScript and Python implementations,\nthe general probabilistic programming interface of CGPMs makes it possible for BayesDB to interact\nwith a variety systems such as BUGS [15], Stan [1], BLOG [11], Figaro [13], and others. Each of\nthese systems provides varying levels of model expressiveness and inference capabilities, and can\nbe used to be construct domain-speci\ufb01c CGPMs with different performance properties based on\nthe data analysis task on hand. Moreover, by expressing the data analysis tasks in BayesDB using\nthe model-independent Bayesian Query Language [10, Section 3], CGPMs can be queried without\nnecessarily exposing their internal structures to end users. Taken together, these characteristics help\nillustrate the broad utility of the BayesDB probabilistic programming platform and architecture [14,\nSection 5], which in principle can be used to create and query novel combinations of black-box\nmachine learning, statistical modeling, computer simulation, and probabilistic generative models.\n\nOur applications have so far focused on CGPMs for analyzing populations from standard multivariate\nstatistics. A promising area for future work is extending the computational abstraction of CGPMs,\nas well as the Metamodeling and Bayesian Query Languages, to cover analysis tasks in other\ndomains such longitudinal populations [3], statistical relational settings [6], or natural language\nprocessing and computer vision. Another extension, important in practice, is developing alternative\ncompositional algorithms for querying CGPMs (Section 2.3). The importance sampling strategy used\nfor compositional simulate and logpdf may only be feasible when the networks are shallow and\nthe constituent CGPMs are fairly noisy; better Monte Carlo strategies or perhaps even variational\nstrategies may be needed for deeper networks. Additional future work for composite CGPMs include\n(i) algorithms for jointly learning the internal parameters of each individual CGPM, using, e.g.,\nimputations from its parents, and (ii) new meta-algorithms for structure learning among a collection\nof compatible CGPMs, in a similar spirit to the non-parametric divide-and-conquer method from [9].\n\nWe hope the formalisms in this paper lead to practical, unifying tools for data analysis that integrate\nthese ideas, and provide abstractions that enable the probabilistic programming community to\ncollaboratively explore these research directions.\n\n8\n\n\fReferences\n[1] B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A. Brubaker,\nJ. Guo, P. Li, and A. Riddell. Stan: A probabilistic programming language. J Stat Softw, 2016.\n\n[2] G. Casella and R. Berger. Statistical Inference. Duxbury advanced series in statistics and\n\ndecision sciences. Thomson Learning, 2002.\n\n[3] M. Davidian and D. M. Giltinan. Nonlinear models for repeated measurement data, volume 62.\n\nCRC press, 1995.\n\n[4] L. Devroye. Sample-based non-uniform random variate generation. In Proceedings of the 18th\n\nconference on Winter simulation, pages 260\u2013265. ACM, 1986.\n\n[5] D. Fink. A compendium of conjugate priors. 1997.\n\n[6] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In\nProceedings of the Sixteenth International Joint Conference on Arti\ufb01cial Intelligence, IJCAI 99,\nStockholm, Sweden, July 31 - August 6, 1999. 2 Volumes, 1450 pages, pages 1300\u20131309, 1999.\n\n[7] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT\n\npress, 2009.\n\n[8] V. Mansinghka, D. Selsam, and Y. Perov. Venture: a higher-order probabilistic programming\n\nplatform with programmable inference. CoRR, abs/1404.0099, 2014.\n\n[9] V. Mansinghka, P. Shafto, E. Jonas, C. Petschulat, M. Gasner, and J. B. Tenenbaum. Crosscat:\nA fully bayesian nonparametric method for analyzing heterogeneous, high dimensional data.\narXiv preprint arXiv:1512.01272, 2015.\n\n[10] V. Mansinghka, R. Tibbetts, J. Baxter, P. Shafto, and B. Eaves. Bayesdb: A probabilistic program-\nming system for querying the probable implications of data. arXiv preprint arXiv:1512.05006,\n2015.\n\n[11] B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. 1 blog: Probabilistic\n\nmodels with unknown objects. Statistical relational learning, page 373, 2007.\n\n[12] U. of Concerned Scientists. UCS Satellite Database, 2015.\n\n[13] A. Pfeffer. Figaro: An object-oriented probabilistic programming language. Charles River\n\nAnalytics Technical Report, 137, 2009.\n\n[14] F. Saad and V. Mansinghka. Probabilistic data analysis with probabilistic programming. arXiv\n\npreprint arXiv:1608.05347, 2016.\n\n[15] D. J. Spiegelhalter, A. Thomas, N. G. Best, W. Gilks, and D. Lunn. Bugs: Bayesian inference\nusing gibbs sampling. Version 0.5,(version ii) http://www. mrc-bsu. cam. ac. uk/bugs, 19, 1996.\n\n9\n\n\f", "award": [], "sourceid": 1075, "authors": [{"given_name": "Feras", "family_name": "Saad", "institution": "MIT"}, {"given_name": "Vikash", "family_name": "Mansinghka", "institution": "MIT"}]}