{"title": "On Tractable Computation of Expected Predictions", "book": "Advances in Neural Information Processing Systems", "page_first": 11169, "page_last": 11180, "abstract": "Computing expected predictions of discriminative models is a fundamental task in machine learning that appears in many interesting applications such as fairness, handling missing values, and data analysis. Unfortunately, computing expectations of a discriminative model with respect to a probability distribution defined by an arbitrary generative model has been proven to be hard in general. In fact, the task is intractable even for simple models such as logistic regression and a naive Bayes distribution. In this paper, we identify a pair of generative and discriminative models that enables tractable computation of expectations, as well as moments of any order, of the latter with respect to the former in case of regression. Specifically, we consider expressive probabilistic circuits with certain structural constraints that support tractable probabilistic inference. Moreover, we exploit the tractable computation of high-order moments to derive an algorithm to approximate the expectations for classification scenarios in which exact computations are intractable. Our framework to compute expected predictions allows for handling of missing data during prediction time in a principled and accurate way and enables reasoning about the behavior of discriminative models. We empirically show our algorithm to consistently outperform standard imputation techniques on a variety of datasets. Finally, we illustrate how our framework can be used for exploratory data analysis.", "full_text": "On Tractable Computation of Expected Predictions\n\nPasha Khosravi, YooJung Choi, Yitao Liang, Antonio Vergari, and Guy Van den Broeck\n\n{pashak,yjchoi,yliang,aver,guyvdb}@cs.ucla.edu\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nAbstract\n\nComputing expected predictions of discriminative models is a fundamental task\nin machine learning that appears in many interesting applications such as fairness,\nhandling missing values, and data analysis. Unfortunately, computing expectations\nof a discriminative model with respect to a probability distribution de\ufb01ned by an\narbitrary generative model has been proven to be hard in general. In fact, the\ntask is intractable even for simple models such as logistic regression and a naive\nBayes distribution. In this paper, we identify a pair of generative and discriminative\nmodels that enables tractable computation of expectations, as well as moments of\nany order, of the latter with respect to the former in case of regression. Speci\ufb01cally,\nwe consider expressive probabilistic circuits with certain structural constraints\nthat support tractable probabilistic inference. Moreover, we exploit the tractable\ncomputation of high-order moments to derive an algorithm to approximate the\nexpectations for classi\ufb01cation scenarios in which exact computations are intractable.\nOur framework to compute expected predictions allows for handling of missing\ndata during prediction time in a principled and accurate way and enables reasoning\nabout the behavior of discriminative models. We empirically show our algorithm\nto consistently outperform standard imputation techniques on a variety of datasets.\nFinally, we illustrate how our framework can be used for exploratory data analysis.\n\n1\n\nIntroduction\n\nLearning predictive models like regressors or classi\ufb01ers from data has become a routine exercise\nin machine learning nowadays. Nevertheless, making predictions and reasoning about classi\ufb01er\nbehavior on unseen data is still a highly challenging task for many real-world applications. This is\neven more true when data is affected by uncertainty, e.g., in the case of noisy or missing observations.\nA principled way to deal with this kind of uncertainty would be to probabilistically reason about the\nexpected outcomes of a predictive model on a particular feature distribution. That is, to compute\nmathematical expectations of the predictive model w.r.t. a generative model representing the feature\ndistribution. This is a common need that arises in many scenarios including dealing with missing\ndata [20, 14], performing feature selection [37, 4, 7], handling sensor failure and resource scaling [12],\nseeking explanations [25, 21, 3] or determining how \u201cfair\u201d the learned predictor is [38, 39, 8].\nWhile dealing with the above expectations is ubiquitous in machine learning, computing the expected\npredictions of an arbitrary discriminative models w.r.t. an arbitrary generative model is in general\ncomputationally intractable [14, 26]. As one would expect, the more expressive these models\nbecome, the harder it is to compute the expectations. More interestingly, even resorting to simpler\ndiscriminative models like logistic regression does not help reducing the complexity of such a task:\ncomputing the \ufb01rst moment of its predictions w.r.t. a naive Bayes model is known to be NP-hard [14].\nIn this work, we introduce a pair of expressive generative and discriminative models for regression,\nfor which it is possible to compute not only expectations, but any moment ef\ufb01ciently. We leverage\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\frecent advancements in probabilistic circuit representations. Speci\ufb01cally, we prove that generative\nand discriminative circuits enable computing the moments in time polynomial in the size of the\ncircuits, when they are subject to some structural constraints which do not hinder their expressiveness.\nMoreover, we demonstrate that for classi\ufb01cation even the aforementioned structural constraints cannot\nguarantee computations in tractable time. However, ef\ufb01ciently approximating them becomes doable\nin polynomial time by leveraging our algorithm for the computations of arbitrary moments.\nLastly, we investigate applications of computing expectations. We \ufb01rst consider the challenging\nscenario of missing values at test time. There, we empirically demonstrate that computing expectations\nof a discriminative circuit w.r.t. a generative one is not only a more robust and accurate option than\nmany imputation baselines for regression, but also for classi\ufb01cation. In addition, we show how we\ncan leverage this framework for exploratory data analysis to understand behavior of predictive models\nwithin different sub-populations.\n\n2 Expectations and higher order moments of discriminative models\n\nWe use uppercase letters for random variables, e.g., X, and lowercase letters for their assignments\ne.g., x. Analogously, we denote sets of variables in bold uppercase, e.g., X and their assignments in\nbold lowercase, e.g., x. The set of all possible values that X can take is denoted as X .\nLet p be a probability distribution over X and f : X \u2192 R be a discriminative model, e.g., a regressor,\nthat assigns a real value (outcome) to each complete input con\ufb01guration x \u2208 X (features). The task\nof computing the k-th moment of f with respect to the distribution p is de\ufb01ned as:\n\nMk(f, p) (cid:44) Ex\u223cp(x)\n\n(cid:2)(f (x))k(cid:3) .\n\n(1)\n\nComputing moments of arbitrary degree k allows one to probabilistically reason about the outcomes\nof f. That is, it provides a description of the distribution of its predictions assuming p as the data-\ngenerating distribution. For instance, we can compute the mean of f w.r.t. p: Ep[f ] = M1(f, p) or\nreason about the dispersion (variance) of its outcomes: VARp(f ) = M2(f, p) \u2212 (M1(f, p))2.\nThese computations can be a very useful tool to reason in a principled way about the behavior of f in\nthe presence of uncertainty, such as making predictions with missing feature values [14] or deciding\na subset of X to observe [16, 37]. For example, given a partial assignment xo to a subset Xo \u2286 X,\nthe expected prediction of f over the unobserved variables can be computed as Ex\u223cp(x|xo) [f (x)],\nwhich is equivalent to M1(f, p(.|xo)).\nUnfortunately, computing arbitrary moments, and even just the expectation, of a discriminative model\nw.r.t. an arbitrary distribution is, in general, computationally hard. Under the restrictive assumptions\ni p(Xi), and that f is a simple linear model of the form\ni \u03c6ixi, computing expectations can be done in linear time. However, the task suddenly\nbecomes NP-hard even for slightly more expressive models, for instance when p is a naive Bayes\ndistribution and f is a logistic regression (a generalized linear model with a sigmoid activation\nfunction). See [14] for a detailed discussion.\nIn Section 4, we propose a pair of a generative and discriminative models that are highly expressive\nand yet still allow for polytime computation of exact moments and expectations of the latter w.r.t. the\nformer. We \ufb01rst review the necessary background material in Section 3.\n\nthat p fully factorizes, i.e., p(X) = (cid:81)\nf (x) =(cid:80)\n\n3 Generative and discriminative circuits\n\nThis section introduces the pair of circuit representations we choose as expressive generative and\ndiscriminative models. In both cases, we assume the input is discrete. We later establish under which\nconditions computing expected predictions becomes tractable.\n\nLogical circuits A logical circuit [11, 9] is a directed acyclic graph representing a logical formula\nwhere each node n encodes a logical sub-formula, denoted as [n]. Each inner node in the graph\nis either an AND or an OR gate, and each leaf (input) node encodes a Boolean literal (e.g., X or\n\u00acX). We denote the set of child nodes of a gate n as ch(n). An assignment x satis\ufb01es node n if\nit satis\ufb01es the logical formula encoded by n, written x |= [n]. Fig. 1 depicts some examples of\n\n2\n\n\f.2\n\n.8\n\nX1\n\n\u00acX1\n\n1.0\n\n.6\n\n.4\n\nX1\n\n\u22121.6\n\n2.1\n\nX1\n\n\u22125.3\n\n6.1\n\n2\n\n\u00acX1\n\n\u22124.3\n\n3\n\u22121.1\n\nX2\n\nX3\n\n(a) A vtree\n\n\u00acX2 \u00acX3 X2 X3 \u00acX3\nX2 X3\n(b) A Probabilistic Circuit\n\n.5\n\n.5\n\n\u2212.3\n\n.5\n\n1.7\n\n2.8\n\n\u00acX2 \u00acX3 X2 \u00acX2 X3\n\n\u00acX3\nX2 X3\n(c) A Logistic/Regression Circuit\n\nFigure 1: A vtree (a) over X = {X1, X2, X3} and a generative and discriminative circuit pair (b, c)\nthat conform to it. AND gates are colored as the vtree nodes they correspond to (blue and orange).\nFor the discriminative circuit on the right, \u201chot wires\u201d that form a path from input to output are\ncolored red, for the given input con\ufb01guration x = (X1 = 1, X2 = 0, X3 = 0).\n\nlogical circuits. Several syntactic properties of circuits enable ef\ufb01cient logical and probabilistic\nreasoning over them [11, 29]. We now review those properties as they will be pivotal for our ef\ufb01cient\ncomputations of expectations and high-order moments in Section 4.\n\nSyntactic Properties A circuit is said to be decomposable if for every AND gate its inputs depend\non disjoint sets of variables. For notational simplicity, we will assume decomposable AND gates to\nhave two inputs, denoted L(eft) and R(ight) children, depending on variables XL and XR respectively.\nIn addition, a circuit satis\ufb01es structured decomposability if each of its AND gates decomposes\naccording to a vtree, a binary tree structure whose leaves are the circuit variables. That is, the L\n(resp. R) child of an AND gate depends on variables that appear on the left (resp. right) branch of\nits corresponding vtree node. Fig. 1 shows a vtree and visually maps its nodes to the AND gates of\ntwo example circuits. A circuit is smooth if for an OR gate all its children depend on the same set of\nvariables [32]. Lastly, a circuit is deterministic if, for any input, at most one child of every OR node\nhas a non-zero output. For example, Fig. 1c highlights in red the wires that are true, and that form a\npath from the root to the leaves, given input x = (X1 = 1, X2 = 0, X3 = 0). Note that every OR gate\nin Fig. 1c has at most one hot input wire, because of the determinism property.\n\nGenerative probabilistic circuits A probabilistic circuit (PC) is characterized by its logical circuit\nstructure and parameters \u03b8 that are assigned to the inputs of each OR gate.\nIntuitively, each PC node n recursively de\ufb01nes a distribution pn over a subset of the variables X\nappearing in the sub-circuit rooted at it. More precisely:\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f31n(x)\n(cid:80)\n\npL(xL) \u00b7 pR(xR)\ni\u2208ch(n) \u03b8ipi(x)\n\npn(x) =\n\nif n is a leaf,\nif n is an AND gate,\nif n is an OR gate.\n\n(2)\n\nHere, 1n(x) (cid:44) 1{x |= [n]} indicates whether the leaf n is satis\ufb01ed by input x. Moreover, xL and\nxR indicate the subsets of con\ufb01guration x restricted to the decomposition de\ufb01ned by an AND gate\nover its L (resp. R) child. As such, an AND gate of a PC represents a factorization over independent\nsets of variables, whereas an OR gate de\ufb01nes a mixture model. Unless otherwise noted, in this paper\nwe adopt PCs that satisfy structured decomposability and smoothness as our generative circuit.\nPCs allow for the exact computation of the probability of complete and partial con\ufb01gurations (that\nis, marginalization) in time linear in the size of the circuit. A well-known example of PCs is the\nprobabilistic sentential decision diagram (PSDD) [15].1 They have been successfully employed\nas state-of-the-art density estimators not only for unstructured [19] but also for structured feature\nspaces [5, 30, 31]. Other types of PCs include sum-product networks (SPNs) and cutset networks,\nyet those representations are typically decomposable but not structured decomposable [23, 24].\n\n1PSDDs by de\ufb01nition also satisfy determinism, but we do not require this property for computing moments.\n\n3\n\n\f\uf8f1\uf8f4\uf8f2\uf8f4\uf8f30\n(cid:80)\n\ngm(x) =\n\ngL(xL) + gR(xR)\n\nj\u2208ch(m)\n\n1j(x)(\u03c6j + gj(x))\n\nDiscriminative circuits For the discriminative model f, we adopt and extend the semantics of\nlogistic circuits (LCs): discriminative circuits recently introduced for classi\ufb01cation [18]. An LC is\nde\ufb01ned by a decomposable, smooth and deterministic logical circuit with parameters \u03c6 on inputs to\nOR gates. Moreover, we will work with LCs that are structured decomposable, which is a restriction\nalready supported by their learning algorithms [18]. An LC acts as a classi\ufb01er on top of a rich set of\nnon-linear features, extracted by its logical circuit structure. Speci\ufb01cally, an LC assigns an embedding\nrepresentation h(x) to each input example x. Each feature h(x)k in the embedding is associated with\none input k of one of the OR gates in the circuit (and thus also with one parameter \u03c6k). It corresponds\nto a logical formula that can be readily extracted from the logical circuit structure.\nClassi\ufb01cation is performed on this new feature representation by applying a sigmoid non-linearity:\nk \u03c6kh(x)k ), and similar to logistic regression it is amenable to convex parame-\nter optimization. Alternatively, one can fully characterize an LC by recursively de\ufb01ning the output of\neach node m. We use gm(x) to de\ufb01ne output of node m given x. It can be computed as:\n\nf LC(x) (cid:44) 1/(1 + e\u2212(cid:80)\n\nif m is a leaf,\nif m is an AND gate,\nif m is an OR gate.\n\n(3)\n\nAgain, 1j(x) is an indicator for x |= [j], effectively using the determinism property of LCs to select\nwhich input to pass through. Then classi\ufb01cation is done by applying a sigmoid function to the output\nof the circuit root r: f LC(x) = 1/(1 + e\u2212gr(x)). The increased expressive power of LCs w.r.t. simple\nlinear regressors lies in the rich representations h(x) they learn, which in turn rely on the underlying\ncircuit structure as a powerful feature extractor [34, 33].\nLCs have been introduced for classi\ufb01cation and were shown to outperform larger neural networks [18].\nWe also leverage them for regression, that is, we are interested in computing the expectations of the\noutput of the root node gr(x) w.r.t. a generative model p. We call an LC when no sigmoid function is\napplied to gr(x) a regression circuit (RC). As we will show in the next section, we are able to exactly\ncompute any moment of an RC g w.r.t. an LC p, that is, Mk(g, p), in time polynomial in the size of\nthe circuits, if p and g share the same vtree.\n\n4 Computing expectations and moments for circuit pairs\n\nWe now introduce our main result, which leads to ef\ufb01cient algorithms for tractable Expectation and\nMoment Computation of Circuit pairs (EC2 and MC2) in which the discriminative model is an RC\nand the generative model is a PC, and where both circuits are structured decomposable sharing the\nsame vtree. Recall that we also assumed all circuits to be smooth, and the RC to be deterministic.\nTheorem 1. Let n and m be root nodes of a PC and an RC with the same vtree over X. Let sn and\nsm be their respective number of edges. Then, the kth moment of gm w.r.t. the distribution encoded\nby pn, that is, Mk(gm, pn), can be computed exactly in time O(k2snsm).2\n\nMoreover, this complexity is attained by the MC2 algorithm, which we describe in the next section.\nWe then investigate how this result can be generalized to arbitrary circuit pairs and how restrictive\nthe structural requirements are. In fact, we demonstrate how computing expectations and moments\nfor circuit pairs not sharing a vtree is #P-hard. Furthermore, we address the hardness of computing\nexpectations for an LC w.r.t. a PC\u2013due to the introduction of the sigmoid function over g\u2013by\napproximating it through the tractable computation of moments with the MC2 algorithm.\n\n4.1 EC2: Expectations of regression circuits\n\nIntuitively, the computation of expectations becomes tractable because we can \u201cbreak it down\u201d to the\nleaves of the PC and RC, where it reduces to trivial computations. Indeed, the two circuits sharing\nthe same vtree is the property that enables a polynomial time recursive decomposition, because it\nensures that pairs of nodes considered by the algorithm depend on exactly the same set of variables.\n\nA tighter bound would be O(k2(cid:80)\n\n2 This is a loose upper bound since the algorithm only looks at a small subset of pairs of edges in the circuits.\nv sv tv) where v ranges over vtree nodes and sv (resp. tv) counts the number\n\nof edges going into the nodes of the PC (resp. RC) that can be attributed to the vtree node v.\n\n4\n\n\f(cid:46) Cache recursive calls to achieve polynomial complexity\n\nAlgorithm 1 EC2(n, m)\nRequire: A PC node n and an RC node m\n\nif m is Leaf then return 0\nelse if n is Leaf then\n\nif [n] |= [mL] then return \u03c6mL\nif [n] |= [mR] then return \u03c6mR\n\nelse if n,m are OR then return(cid:80)\n\n(cid:80)\n\nelse if n,m are AND then return PR(nL, mL) EC2(nR, mR) + PR(nR, mR) EC2(nL, mL)\n\ni\u2208ch(n) \u03b8i\n\nj\u2208ch(m) (EC2(i, j) + \u03c6jPR(i, j))\n\nWe will now show how this computation recursively decomposes over pairs of OR and AND gates,\nstarting from the roots of the PC p and RC g. We refer the reader to the Appendix for detailed proofs\nof all Propositions and Theorems in this section. Without loss of generality, we assume that the roots\nof both p and g are OR gates, and that circuit nodes alternate between AND and OR gates layerwise.\nProposition 1. Let n and m be OR gates of a PC and an RC, respectively. Then the expectation of\nthe regressor gm w.r.t. distribution pn is:\n\n(cid:88)\n\n(cid:88)\n\nM1(gm, pn) =\n\ni\u2208ch(n)\n\n\u03b8i\n\nj\u2208ch(m)\n\n(M1(1j \u00b7 gj, pi) + \u03c6jM1(1j, pi)) .\n\nThe above proposition illustrates how the expectation of an OR gate of an RC w.r.t. an OR gate in the\nPC is a weighted sum of the expectations of the child nodes. The number of smaller expectations\nto be computed is quadratic in the number of children. More speci\ufb01cally, one now has to compute\nexpectations of two different functions w.r.t. the children of PC n.\nFirst, M1(1j, pi) is the expectation of the indicator function associated to the j-th child of m (see\nEq. 3) w.r.t. the i-th child node of n. Intuitively, this translates to the probability of the logical formula\n[j] being satis\ufb01ed according to the distribution encoded by pi. Fortunately, this can be computed\nef\ufb01ciently, in quadratic time, linear in the size of both circuits as already demonstrated in [5].\nOn the other hand, computing the other expectation term M1(1jgj, pi) requires a novel algorithm\ntailored to RCs and PCs. We next show how to further decompose this expectation from AND gates\nto their OR children.\nProposition 2. Let n and m be AND gates of a PC and an RC, respectively. Let nL and nR (resp.\nmL and mR) be the left and right children of n (resp. m). Then the expectation of function (1m \u00b7 gm)\nw.r.t. distribution pn is:\n\nM1(1m \u00b7 gm, pn) = M1(1mL , pnL )M1(gmR , pnR ) + M1(1mR , pnR )M1(gmL , pnL ).\n\nWe are again left with the task of computing expectations of the RC node indicator functions,\ni.e., M1(1mL , pnL ) and M1(1mR , pnR ), which can also be done by exploiting the algorithm in [5].\nFurthermore, note that the other expectation terms (M1(gmL , pnL ) and M1(gmR , pnR )) can readily be\ncomputed using Proposition 1, since they concern pairs of OR nodes.\nWe brie\ufb02y highlight how determinism in the regression circuit plays a crucial role in enabling this\ncomputation. In fact, OR gates being deterministic ensures that the otherwise non-decomposable\nproduct of indicator functions 1m \u00b7 1k, where m is a parent OR gate of an AND gate k, results to be\nequal to 1k. We refer the readers to Appendix A.3 for a detailed discussion.\nRecursively, one is guaranteed to reach pairs of leaf nodes in the RC and PC, for which the respective\nexpectations can be computed in O(1) by checking if their associated Boolean indicators agree, and\nby noting that gm(x) = 0 if m is a leaf (see Eq. 3). Putting it all together, we obtain the recursive\nprocedure shown in Algorithm 1. Here, PR(n, m) refer to the algorithm to compute M1(1m, pn)\nin [5]. As the algorithm computes expectations in a bottom-up fashion, the intermediate computations\ncan be cached to avoid evaluating the same pair of nodes more than once, and therefore keeping the\ncomplexity as stated by our Theorem 1.\n\n5\n\n\f4.2 MC2: Moments of regression circuits\n\nOur algorithmic solution goes beyond the tractable computation of the sole expectation of an RC.\nIndeed, any arbitrary order moment of gm can be computed w.r.t. pn, still in polynomial time. We\ncall this algorithm MC2 and we delineate its main routines with the following Propositions:3\nProposition 3. Let n and m be OR gates of a PC and an RC, respectively. Then the k-th moment of\nthe regressor gm w.r.t. distribution pn is:\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)k\n\n(cid:18)k\n\n(cid:19)\n\nl=0\n\nl\n\nMk(gm, pn) =\n\ni\u2208ch(n)\n\n\u03b8i\n\nj\u2208ch(m)\n\nj Ml(1j \u00b7 gj, pi).\n\u03c6k\u2212l\n\nProposition 4. Let n and m be AND gates of a PC and an RC, respectively. Let nL and nR (resp.\nmL and mR) be the left and right children of n (resp. m). Then the k-th moment of function (1mgm)\nw.r.t. distribution pn is:\n\nMk(1m \u00b7 gm, pn) =\n\nMl(1mL \u00b7 gmL , pnL )Mk\u2212l(1mR \u00b7 gmR , pnR )\n\n(cid:88)k\n\n(cid:18)k\n\n(cid:19)\n\nl=0\n\nl\n\nAnalogous to computing simple expectations, by recursively and alternatively applying Propositions 3\nand 4, we arrive at the moments of the leaves at both circuits, while gradually reducing the order k of\nthe involved moments.\nFurthermore, the lower-order moments in Proposition 4 that decompose to L and R children, e.g.,\nMl(1mL \u00b7 gmL , pnL ), can be computed by noting that they reduce to:\n\nMk(1m \u00b7 gm, pn) =\n\nif k = 0,\notherwise.\n\n(4)\n\n(cid:26)M1(1m, pn)\n\nMk(gm, pn)\n\nNote again that these computations are made possible by the interplay of determinism of g and shared\nvtrees between p and g. From the former it follows that a sum over OR gate children reduces to a\nsingle child value. The latter ensures that the AND gates in p and g decompose in the same way,\nthereby enabling ef\ufb01cient computations.\nGiven this, a natural question arises: \u201cIf we do not require a PC p and a RC g to have the same\nvtree structure, is computing Mk(g, p) still tractable?\u201d. Unfortunately, this is not the case, as we\ndemonstrate in the following theorem.\nTheorem 2. Computing any moment of an RC gm w.r.t. a PC distribution pn where both have\narbitrary vtrees is #P-hard.\n\nAt a high level, we can reduce #SAT, a well known #P-complete problem on CNF sentences, to the\nmoment computation problem. Given a choice of different vtrees, we can construct an RC and a PC\nin time polynomial in the size of the CNF formula such that its #SAT value can be computed using\nthe expectation of the RC w.r.t. the PC. We refer to Appendix A.3 for more details.\nSo far, we have focused our analysis to RCs, the analogous of LCs for regression. One would hope\nthat the ef\ufb01cient computations of EC2 could be carried on to LCs to compute the expected predictions\nof classi\ufb01ers. However, the application of the sigmoid function \u03c3 on the regressor g, even when g\nshares the same vtree as p, makes the problem intractable, as our next Theorem shows.\nTheorem 3. Taking the expectation of an LC (\u03c3 \u25e6 gm) w.r.t. a PC distribution pn is NP-hard even if\nn and m share the same vtree.\n\nThis follows from a recent result that taking the expectation of a logistic regression w.r.t. a naive\nBayes distribution is NP-hard [14]; see Appendix A.4 for a detailed proof.\n\n4.3 Approximating expectations of classi\ufb01ers\n\nTheorem 3 leaves us with no hope of computing exact expected predictions in a tractable way even\nfor pairs of generative PCs and discriminative LCs conforming to the same vtree. Nevertheless, we\ncan leverage the ability to ef\ufb01ciently compute the moments of the RC gm to ef\ufb01ciently approximate\n3 The algorithm MC2 can easily be derived from EC2 in Algorithm 1, using the equations in this section.\n\n6\n\n\fFigure 2: Evaluating EC2 for predictions under different percentages of missing features (x-axis) over\nfour real-world regression datasets in terms of the RMSE (y-axis) of the predictions of g((xm, xo)).\nOverall, exactly computing the expected predictions via EC2 outperforms simple imputation schemes\nlike median and mean as well as more sophisticated ones like MICE [1] or computing the MPE\ncon\ufb01guration with the PC p. Detailed dataset statistics can be found in Appendix B.\n\nthe expectation of \u03b3 \u25e6 gm, with \u03b3 being any differentiable non-linear function, including sigmoid \u03c3.\nUsing a Taylor series approximation around point \u03b1 we de\ufb01ne the following d-order approximation:\n\nTd(\u03b3 \u25e6 gm, pn) (cid:44)(cid:88)d\n\n\u03b3(k)(\u03b1)\n\nk!\n\nMk(gm \u2212 \u03b1, pn)\n\nk=0\n\nSee Appendix A.5, for a detailed derivation and more intuition behind this approximation.\n\n5 Expected prediction in action\n\nIn this section, we empirically evaluate the usefulness and effectiveness of computing the expected\npredictions of our discriminative circuits with respect to generative ones.4 First, we tackle the\nchallenging task of making predictions in the presence of missing values at test time, for both\nregression and classi\ufb01cation.5 Second, we show how our framework can be used to reasoning about\nthe behavior of predictive models. We employ it in the context of exploratory data analysis, to check\nfor biases in the predictive models, or to search for interesting patterns in the predictions associated\nwith sub-populations in the data distribution.\n\n5.1 Reasoning with missing values: an application\n\nTraditionally, prediction with missing values has been addressed by imputation, which substitutes\nmissing values with presumably reasonable alternatives such as the mean or median, estimated from\ntraining data [28]. These imputation methods are typically heuristic and model-agnostic [20]. To\novercome this, the notion of expected predictions has recently been proposed in [14] as a probabilisti-\ncally principled and model-aware way to deal with missing values. Formally, we want to compute\n\n(5)\nwhere xm (resp. xo) denotes the con\ufb01guration of a sample x that is missing (resp. observed) at\ntest time. In the case of regression, we can exactly compute Eq. 5 for a pair of generative and\ndiscriminative circuits sharing the same vtree by our proposed algorithm, after observing that\n\nExm\u223cp(xm|xo) [f (xmxo)]\n\nExm\u223cp(xm|xo) [f (xmxo)] =\n\n1\n\np(xo)\n\nExm\u223cp(xm,xo) [f (xmxo)]\n\n(6)\n\nwhere p(xm, xo) is the unnormalized distribution encoded by the generative circuit con\ufb01gured for\nevidence xo. That is, the sub-circuits depending on the variables in Xo have been \ufb01xed according to\nthe input xo. This transformation, and computing any marginal p(xo), can be done ef\ufb01ciently in time\nlinear in the size of the PC [10].\nTo demonstrate the generality of our method, we construct a 6-dataset testing suite, four of which are\ncommon regression benchmarks from several domains [13], and the rest are classi\ufb01cation on MNIST\n\n4Our implementation of the algorithm and experiments are available at https://github.com/UCLA-StarAI/mc2.\n5In case of classi\ufb01cation, we use the Taylor expansion approximation we discussed in Section 4.3.\n\n7\n\n050100% Missing345RMSEAbaloneMedianSampleM1 (ours)MeanMiceMPE050100% Missing2.02.53.03.51e4Delta050100% Missing0.51.01.51e4Insurance050100% Missing0.500.751.001.251e2Elevators\fFigure 3: Evaluating the \ufb01rst-order Tay-\nlor approximation T1(\u03c3 \u25e6 gm, pn) of the\nexpected predictions of a classi\ufb01er for\nmissing value imputation for different\npercentages of missing features (x-axis)\nin terms of the accuracy (y-axis).\n\nand FASHION datasets [36, 35]. We compare our method with classical imputation techniques such\nas standard mean and median imputation, and more sophisticated (and computationally intensive)\nimputation techniques such as multiple imputations by chained equations (MICE) [1]. Moreover, we\nadopt a natural and strong baseline: imputing the missing values by the most probable explanation\n(MPE) [10], computed by probabilistic reasoning on the generative circuit p. Note that the MPE\ninference acts as an imputation: it returns the mode of the input feature distribution, while EC2 would\nconvey a more global statistic of the distribution of the outputs of such a predictive model.\nTo enforce that the discriminative-generative pair of circuits share the same vtree, we \ufb01rst generate a\n\ufb01xed random and balanced vtree and use it to guide the respective parameter and structure learning\nalgorithms of our circuits. In our experiments we adopt PSDDs [15] for the generative circuits.\nPSDDs are a subset of PCs, since they also satisfy determinism. Although we do not require\ndeterminism of generative circuits for moment computation, we use PSDDs due to the availability of\ntheir learning algorithms.\nOn image data, however, we exploit the already learned and publicly available LC structure in [18],\nwhich scores 99.4% accuracy on MNIST, being competitive to much larger deep models. We learn\na PSDD with the same vtree. For RCs, we adapt the parameter and structure learning of LCs [18],\nsubstituting the logistic regression objective with a ridge regression during optimization. For structure\nlearning of both LCs and RCs, we considered up to 100 iterates while monitoring the loss on a held\nout set. For PSDDs we employ the parameter and structure learning of [19] with default parameters\nand run it up to 1000 iterates until no signi\ufb01cant improvement is seen on a held out set.\nFigure 2 shows our method outperforming other regression baselines. This can be explained by the\nfact that it computes the exact expectation while other techniques make restrictive assumptions to\napproximate the expectation. Mean and median imputations effectively assume that the features\nare independent; MICE6 assumes a \ufb01xed dependence formula between the features; and, as already\nstated, MPE only considers the highest probability term in the expansion of the expectation.\nAdditionally, as we see in Figure 3, our approximation method for predicted classi\ufb01cation, using just\nthe \ufb01rst-order expansion T1(\u03b3\u25e6 gm, pn), is able to outperform the predictions of the other competitors.\nThis suggests that our method is effective in approximating the true expected values.\nThese experiments agree with the observations from [14] that, given missing data, probabilistically\nreasoning about the outcome of a classi\ufb01er by taking expectations can generally outperform imputation\ntechniques. Our advantage clearly comes from the PSDD learning a better density estimation of\nthe data distribution, instead of having \ufb01xed prior assumptions about the features. An additional\ndemonstration of this fact comes from the excellent performance of MPE on both datasets. Again,\nthis can be credited to the PSDD learning a good distribution on the features.\n\n5.2 Reasoning about predictive models for exploratory data analysis\n\nWe now showcase an example of how our framework can be utilized for exploratory data analysis\nwhile reasoning about the behavior of a given predictive model. Suppose an insurance company has\nhired us to analyze both their data and the predictions of their regression model. To simulate this\nscenario, we use the RC and PC circuits that were learned on the real-world Insurance dataset in\nthe previous section (see Figure 2). This dataset lists the yearly health insurance cost of individuals\nliving in the US with features such as age, smoking habits, and location. Our task is to examine the\nbehavior of the predictions, such as whether they are biased by some sensitive attributes or whether\nthere exist interesting patterns across sub-populations of the data.\n\n6On the elevator dataset, we reported MICE result only until 30% missing as the imputation method is\n\ncomputationally heavy and required more than 10hr to complete.\n\n8\n\n255075% Missing50100AccuracyMNISTMeanMedianT1 (ours)MPE255075% Missing255075FMNIST\fWe might start by asking: \u201chow different are the insurance costs between smokers and non smokers?\u201d\nwhich can be easily computed as\n\nM1(f, p(. | Smoker)) \u2212 M1(f, p(. | Non Smoker)) = 31, 355 \u2212 8, 741 = 22, 614\n\n(7)\nby applying the same conditioning as in Equations 5 and 6. We can also ask: \u201cis the predictive model\nbiased by gender?\u201d To answer this question, it would be interesting to compute:\n\nM1(f, p(. | Female)) \u2212 M1(f, p(. | Male)) = 14, 170 \u2212 13, 196 = 974\n\n(8)\nAs expected, being a smoker affects the health insurance costs much more than being male or female.\nIf it were the opposite, we would conclude that the model may be unfair or misbehaving.\nIn addition to examining the effect of a single feature, we may study the model in a smaller sub-\npopulation, by conditioning the distribution on multiple features. For instance, suppose the insurance\ncompany is interested in expanding and as part of their marketing plan wants to know the effect of\nan individual\u2019s region, e.g., southeast (SE) and southwest (SW), for the sub-population of female\n(F) smokers (S) with one child (C). By computing the following quantities, we can discover that the\ndifference in their average insurance cost is relevant, but much more relevant is the difference in their\nstandard deviations, indicating a signi\ufb01cantly different treatment of this population between regions:\n(9)\n\n[f ] = M1(f, p(. | F, S, C, SE)) = 30, 974, STDpSE[f ] =(cid:112)M2(.) \u2212 (M1(.))2 = 11, 229\n[f ] = M1(f, p(. | F, S, C, SW)) = 27, 250, STDpSW [f ] =(cid:112)M2(.) \u2212 (M1(.))2 = 7, 717 (10)\n\nE\npSE\nE\npSW\n\nHowever, one may ask why we do not estimate these values directly from the dataset. The main issue\nin doing so is that as we condition on more features, fewer if not zero matching samples are present\nin the data. For example, only 4 and 3 samples match the criterion asked by the last two queries.\nFurthermore, it is not uncommon for the data to be unavailable due to sensitivity or privacy concerns,\nand only the models are available. For instance, two insurance agencies in different regions might\nwant to partner without sharing their data yet.\nThe expected prediction framework with probabilistic circuits allows us to ef\ufb01ciently compute these\nqueries with interesting applications in explainability and fairness. We leave the more rigorous\nexploration of their applications for future work.\n\n6 Related Work\n\nUsing expected prediction to handle missing values was introduced in Khosravi et al. [14]; given\na logistic regression model, they learned a conforming Naive bayes model and then computed\nexpected prediction only using the learned naive bayes model. In contrast, we are taking the expected\nprediction using two distinct models. Moreover, probabilistic circuits are much more expressive\nmodels. Imputations are a common way to handle missing features and are a well-studied topic. For\nmore detail and a history of the techniques we refer the reader to Buuren [2], Little and Rubin [20].\nProbabilistic circuits enable a wide range of tractable operations. Given the two circuits, our expected\nprediction algorithm operated on the pairs of children of the nodes in the two circuits corresponding\nto the same vtree node and hence had a quadratic run-time. There are other applications that operate\non similar pairs of nodes such as: multiplying the distribution of two PSDDs [29], computing the\nprobability of a logical formula [6], and computing KL divergence [17].\n\n7 Conclusion\n\nIn this paper we investigated under which model assumptions it is tractable to compute expectations\nof certain discriminative models. We proved how, for regression, pairing a discriminative circuit\nwith a generative one sharing the same vtree structure allows to compute not only expectations but\nalso arbitrary high-order moments in poly-time. Furthermore, we characterized when the task is\notherwise hard, e.g., for classi\ufb01cation, when a non-decomposable, non-linear function is introduced.\nAt the same time, we devised for this scenario an approximate computation that leverages the\naforementioned ef\ufb01cient computation of the moments of regressors. Finally, we showcased how\nthe expected prediction framework can help a data analyst to reason about the predictive model\u2019s\nbehavior under different sub-populations. This opens up several interesting research venues, from\napplications like reasoning about missing values, to perform feature selection, to scenarios where\nexact and approximate computations of expected predictions can be combined.\n\n9\n\n\fAcknowledgements\n\nThis work is partially supported by NSF grants #IIS-1633857, #CCF-1837129, DARPA XAI grant\n#N66001-17-2-4032, NEC Research, and gifts from Intel and Facebook Research. expecta bos olim herba.\n\nReferences\n[1] M. J. Azur, E. A. Stuart, C. Frangakis, and P. J. Leaf. Multiple imputation by chained equations:\nwhat is it and how does it work? International journal of methods in psychiatric research, 20\n(1):40\u201349, 2011.\n\n[2] S. v. Buuren. Flexible imputation of missing data. CRC Press, 2018.\n\n[3] C.-H. Chang, E. Creager, A. Goldenberg, and D. Duvenaud. Explaining image classi\ufb01ers by\n\ncounterfactual generation. In International Conference on Learning Representations, 2019.\n\n[4] A. Choi, Y. Xue, and A. Darwiche. Same-decision probability: A con\ufb01dence measure for\nthreshold-based decisions. International Journal of Approximate Reasoning, 53(9):1415\u20131428,\n2012.\n\n[5] A. Choi, G. Van Den Broeck, and A. Darwiche. Tractable learning for structured probabil-\nIn Proceedings of the 24th\n\nity spaces: A case study in learning preference distributions.\nInternational Conference on Arti\ufb01cial Intelligence, IJCAI, 2015.\n\n[6] A. Choi, G. Van den Broeck, and A. Darwiche. Tractable learning for structured probability\nspaces: A case study in learning preference distributions. In Twenty-Fourth International Joint\nConference on Arti\ufb01cial Intelligence (IJCAI), 2015.\n\n[7] Y. Choi, A. Darwiche, and G. Van den Broeck. Optimal feature selection for decision robustness\nin bayesian networks. Proceedings of the 26th International Joint Conference on Arti\ufb01cial\nIntelligence (IJCAI), Aug 2017.\n\n[8] Y. Choi, G. Farnadi, B. Babaki, and G. Van den Broeck. Learning fair naive bayes classi\ufb01ers by\n\ndiscovering and eliminating discrimination patterns. CoRR, abs/1906.03843, 2019.\n\n[9] A. Darwiche. A differential approach to inference in bayesian networks. J.ACM, 2003.\n\n[10] A. Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge, 2009.\n\n[11] A. Darwiche and P. Marquis. A knowledge compilation map. Journal of Arti\ufb01cial Intelligence\n\nResearch, 17:229\u2013264, 2002.\n\n[12] L. I. Galindez Olascoaga, W. Meert, M. Verhelst, and G. Van den Broeck. Towards hardware-\naware tractable learning of probabilistic models. In Advances in Neural Information Processing\nSystems 32 (NeurIPS), 2019.\n\n[13] J. Khiari, L. Moreira-Matias, A. Shaker, B. \u017denko, and S. D\u017eeroski. Metabags: Bagged\nmeta-decision trees for regression. In Joint European Conference on Machine Learning and\nKnowledge Discovery in Databases, pages 637\u2013652. Springer, 2018.\n\n[14] P. Khosravi, Y. Liang, Y. Choi, and G. Van den Broeck. What to expect of classi\ufb01ers? Reasoning\nabout logistic regression with missing features. In Proceedings of the 28th International Joint\nConference on Arti\ufb01cial Intelligence (IJCAI), 2019.\n\n[15] D. Kisa, G. Van den Broeck, A. Choi, and A. Darwiche. Probabilistic sentential decision\ndiagrams. In Proceedings of the 14th International Conference on Principles of Knowledge\nRepresentation and Reasoning (KR), July 2014.\n\n[16] A. Krause and C. Guestrin. Optimal value of information in graphical models. Journal of\n\nArti\ufb01cial Intelligence Research, 35:557\u2013591, 2009.\n\n[17] Y. Liang and G. Van den Broeck. Towards compact interpretable models: Shrinking of learned\nprobabilistic sentential decision diagrams. In IJCAI 2017 Workshop on Explainable Arti\ufb01cial\nIntelligence (XAI), 2017.\n\n10\n\n\f[18] Y. Liang and G. Van den Broeck. Learning logistic circuits.\n\nConference on Arti\ufb01cial Intelligence (AAAI), 2019.\n\nIn Proceedings of the 33rd\n\n[19] Y. Liang, J. Bekker, and G. Van den Broeck. Learning the structure of probabilistic sentential de-\ncision diagrams. In Proceedings of the 33rd Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI), 2017.\n\n[20] R. J. Little and D. B. Rubin. Statistical analysis with missing data, volume 793. Wiley, 2019.\n\n[21] S. M. Lundberg and S.-I. Lee. A uni\ufb01ed approach to interpreting model predictions. In I. Guyon,\nU. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 30, pages 4765\u20134774. Curran Associates,\nInc., 2017.\n\n[22] W. J. Nash, T. L. Sellers, S. R. Talbot, A. J. Cawthorn, and W. B. Ford. The population biology\nof abalone (haliotis species) in tasmania. i. blacklip abalone (h. rubra) from the north coast and\nislands of bass strait. Sea Fisheries Division, Technical Report, 48, 1994.\n\n[23] H. Poon and P. Domingos. Sum-Product Networks: a New Deep Architecture. UAI 2011, 2011.\n\n[24] T. Rahman, P. Kothalkar, and V. Gogate. Cutset networks: A simple, tractable, and scalable\napproach for improving the accuracy of Chow-Liu trees. In Machine Learning and Knowledge\nDiscovery in Databases, volume 8725 of LNCS, pages 630\u2013645. Springer, 2014.\n\n[25] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of\nany classi\ufb01er. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge\ndiscovery and data mining, pages 1135\u20131144. ACM, 2016.\n\n[26] D. Roth. On the hardness of approximate reasoning. Arti\ufb01cial Intelligence, 82(1\u20132):273\u2013302,\n\n1996.\n\n[27] Y. Rozenholc, T. Mildenberger, and U. Gather. Combining regular and irregular histograms by\n\npenalized likelihood. Computational Statistics & Data Analysis, 54(12):3313\u20133323, 2010.\n\n[28] J. L. Schafer. Multiple imputation: a primer. Statistical methods in medical research, 8(1):3\u201315,\n\n1999.\n\n[29] Y. Shen, A. Choi, and A. Darwiche. Tractable operations for arithmetic circuits of probabilistic\nmodels. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances\nin Neural Information Processing Systems 29, pages 3936\u20133944. Curran Associates, Inc., 2016.\n\n[30] Y. Shen, A. Choi, and A. Darwiche. A tractable probabilistic model for subset selection. In\n\nUAI, 2017.\n\n[31] Y. Shen, A. Choi, and A. Darwiche. Conditional psdds: Modeling and learning with modular\n\nknowledge. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[32] A. Shih, G. Van den Broeck, P. Beame, and A. Amarilli. Smoothing structured decomposable\n\ncircuits. In Advances in Neural Information Processing Systems 32 (NeurIPS), 2019.\n\n[33] A. Vergari, N. Di Mauro, and F. Esposito. Visualizing and understanding sum-product networks.\n\npreprint arXiv, 2016. URL https://arxiv.org/abs/1608.08266.\n\n[34] A. Vergari, R. Peharz, N. Di Mauro, A. Molina, K. Kersting, and F. Esposito. Sum-product\nautoencoding: Encoding and decoding representations using sum-product networks. In AAAI,\n2018.\n\n[35] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking\n\nmachine learning algorithms. CoRR, abs/1708.07747, 2017.\n\n[36] L. Yann, C. Corinna, and C. J. Burges. The mnist database of handwritten digits, 2009.\n\n[37] S. Yu, B. Krishnapuram, R. Rosales, and R. B. Rao. Active sensing. In Arti\ufb01cial Intelligence\n\nand Statistics, pages 639\u2013646, 2009.\n\n11\n\n\f[38] M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi. Fairness constraints: Mechanisms\n\nfor fair classi\ufb01cation. arXiv preprint arXiv:1507.05259, 2015.\n\n[39] M. B. Zafar, I. Valera, M. Rodriguez, K. Gummadi, and A. Weller. From parity to preference-\nbased notions of fairness in classi\ufb01cation. In Advances in Neural Information Processing\nSystems, pages 229\u2013239, 2017.\n\n12\n\n\f", "award": [], "sourceid": 5985, "authors": [{"given_name": "Pasha", "family_name": "Khosravi", "institution": "UCLA"}, {"given_name": "YooJung", "family_name": "Choi", "institution": "UCLA"}, {"given_name": "Yitao", "family_name": "Liang", "institution": "UCLA"}, {"given_name": "Antonio", "family_name": "Vergari", "institution": "University of California, Los Angeles"}, {"given_name": "Guy", "family_name": "Van den Broeck", "institution": "UCLA"}]}