{"title": "Bayesian Learning of Sum-Product Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6347, "page_last": 6358, "abstract": "Sum-product networks (SPNs) are flexible density estimators and have received significant attention due to their attractive inference properties. While parameter learning in SPNs is well developed, structure learning leaves something to be desired: Even though there is a plethora of SPN structure learners, most of them are somewhat ad-hoc and based on intuition rather than a clear learning principle. In this paper, we introduce a well-principled Bayesian framework for SPN structure learning. First, we decompose the problem into i) laying out a computational graph, and ii) learning the so-called scope function over the graph. The first is rather unproblematic and akin to neural network architecture validation. The second represents the effective structure of the SPN and needs to respect the usual structural constraints in SPN, i.e. completeness and decomposability. While representing and learning the scope function is somewhat involved in general, in this paper, we propose a natural parametrisation for an important and widely used special case of SPNs. These structural parameters are incorporated into a Bayesian model, such that simultaneous structure and parameter learning is cast into monolithic Bayesian posterior inference. In various experiments, our Bayesian SPNs often improve test likelihoods over greedy SPN learners. Further, since the Bayesian framework protects against overfitting, we can evaluate hyper-parameters directly on the Bayesian model score, waiving the need for a separate validation set, which is especially beneficial in low data regimes. Bayesian SPNs can be applied to heterogeneous domains and can easily be extended to nonparametric formulations. Moreover, our Bayesian approach is the first, which consistently and robustly learns SPN structures under missing data.", "full_text": "Bayesian Learning of Sum-Product Networks\n\nMartin Trapp1,2, Robert Peharz3, Hong Ge3,\n\nFranz Pernkopf1, Zoubin Ghahramani4,3\n1Graz University of Technology, 2OFAI,\n\n3University of Cambridge, 4Uber AI\n\nmartin.trapp@tugraz.at, rp587@cam.ac.uk, hg344@cam.ac.uk\n\npernkopf@tugraz.at, zoubin@eng.cam.ac.uk\n\nAbstract\n\nSum-product networks (SPNs) are \ufb02exible density estimators and have received\nsigni\ufb01cant attention due to their attractive inference properties. While parameter\nlearning in SPNs is well developed, structure learning leaves something to be\ndesired: Even though there is a plethora of SPN structure learners, most of them are\nsomewhat ad-hoc and based on intuition rather than a clear learning principle. In\nthis paper, we introduce a well-principled Bayesian framework for SPN structure\nlearning. First, we decompose the problem into i) laying out a computational\ngraph, and ii) learning the so-called scope function over the graph. The \ufb01rst is\nrather unproblematic and akin to neural network architecture validation. The second\nrepresents the effective structure of the SPN and needs to respect the usual structural\nconstraints in SPN, i.e. completeness and decomposability. While representing\nand learning the scope function is somewhat involved in general, in this paper,\nwe propose a natural parametrisation for an important and widely used special\ncase of SPNs. These structural parameters are incorporated into a Bayesian model,\nsuch that simultaneous structure and parameter learning is cast into monolithic\nBayesian posterior inference. In various experiments, our Bayesian SPNs often\nimprove test likelihoods over greedy SPN learners. Further, since the Bayesian\nframework protects against over\ufb01tting, we can evaluate hyper-parameters directly\non the Bayesian model score, waiving the need for a separate validation set, which\nis especially bene\ufb01cial in low data regimes. Bayesian SPNs can be applied to\nheterogeneous domains and can easily be extended to nonparametric formulations.\nMoreover, our Bayesian approach is the \ufb01rst, which consistently and robustly learns\nSPN structures under missing data.\n\n1\n\nIntroduction\n\nSum-product networks (SPNs) [29] are a prominent type of deep probabilistic model, as they are\na \ufb02exible representation for high-dimensional distributions, yet allowing fast and exact inference.\nLearning SPNs can be naturally organised into structure learning and parameter learning, following\nthe same dichotomy as in probabilistic graphical models (PGMs) [16]. Like in PGMs, state-of-\nthe-art SPN parameter learning covers a wide range of well-developed techniques. In particular,\nvarious maximum likelihood approaches have been proposed using either gradient-based optimisation\n[37, 27, 3, 40] or expectation-maximisation (and related schemes) [25, 29, 46]. In addition, several\ndiscriminative criteria, e.g. [9, 14, 39, 32], as well as Bayesian approaches to parameter learning,\ne.g. [44, 33, 43], have been developed.\nConcerning structure learning, however, the situation is remarkably different. Although there is a\nplethora of structure learning approaches for SPNs, most of them can be described as a heuristic. For\nexample, the most prominent structure learning scheme, LearnSPN [10], derives an SPN structure\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f+ S1\n\n\u00d7 P1\n\n\u00d7 P2\n\n+ S2\n\n+ S3\n\n+ S4\n\n+ S5\n\n\u00d7 P3 \u00d7 P4 \u00d7 P5 \u00d7 P6 \u00d7 P7 \u00d7 P8 \u00d7 P9 \u00d7 P10\n\nL1\n\nL2\n\nL3\n\nL4\n\nL5\n\nL6\n\nL7\n\nL8\n\nL9\n\nL10 L11 L12 L13 L14 L15 L16\n\np(X1, X2, X3, X4) =\n\n+ S1\n\n=====\u21d2\n\n\u03c8\n\np(X1, . . . , X4) =\n\n\u00d7 P1\n\np(X1, X4)p(X2, X3) =\n\n\u00d7 P2\n\n+ S2\n\n+\n\np(X1, X4) =\n\n+ S4\n\np(X2, X3) =\n\n+ S5\n\n\u00d7 P3 \u00d7 P4 \u00d7\n\n\u00d7\n\n\u00d7 P7 \u00d7 P8 \u00d7 P9 \u00d7 P10\n\nL1\n{1, 2}\n\nL2\n{3, 4}\n\nL3\n{1, 4}\n\nL4\n{2, 3}\n\nL10\n{1, 4}\n\nL11\n{4}\n\nL12\n{1}\n\nL13\n{2}\n\nL14\n{3}\n\nL15\n{2, 3}\n\nFigure 1: A computational graph G (left) and an SPN structure (right), de\ufb01ned by the scope function \u03c8,\ndiscovered using posterior inference on an encoding of \u03c8. The SPN contains only a subset of the nodes in G as\nsome sub-trees are allocated with an empty scope (dotted) \u2013 evaluating to constant 1. Note that the graph G only\nencodes the topological layout of nodes, while the \u201ceffective\u201d SPN structure is encoded via \u03c8.\n\nby recursively clustering the data instances (yielding sum nodes) and partitioning data dimensions\n(yielding product nodes). Each of these steps can be understood as some local structure improvement,\nand as an attempt to optimise a local criterion. While LearnSPN is an intuitive scheme and elegantly\nmaps the structural SPN semantics onto an algorithmic procedure, the fact that the global goal of\nstructure learning is not declared is unsatisfying. This principal shortcoming of LearnSPN is shared\nby its many variants such as online LearnSPN [18], ID-SPN [35], LearnSPN-b [42], mixed SPNs\n[22], and automatic Bayesian density analysis (ABDA) [43]. Also other approaches lack a sound\nlearning principle, such as [5, 1] which derive SPN structures from k-means and SVD clustering,\nrespectively, [23] which grows SPNs bottom up using a heuristic based on the information bottleneck,\n[6] which uses a heuristic structure exploration, or [13] which use a variant of hard EM to decide\nwhen to enlarge or shrink an SPN structure.\nAll of the approaches mentioned above fall short of posing some fundamental questions: What is\na good SPN structure? or What is a good principle to derive an SPN structure? This situation is\nsomewhat surprising since the literature on PGMs offers a rich set of learning principles: In PGMs,\nthe main strategy is to optimise a structure score such as minimum-description-length (MDL) [38],\nBayesian information criterion (BIC) [16] or the Bayes-Dirichlet (BD) score [4, 11]. Moreover, in\n[7] an approximate MCMC sampler was proposed for full Bayesian structure learning.\nIn this paper, we propose a well-principled Bayesian approach to SPN learning, by simultaneously\nperforming inference over both structure and parameters. We \ufb01rst decompose the structure learning\nproblem into two steps, namely i) proposing a computational graph, laying out the arrangement of\nsums, products and leaf distributions, and ii) learning the so-called scope-function, which assigns\nto each node its scope.1 The \ufb01rst step is straightforward, computational graphs have only very few\nrequirements, while the second step, learning the scope function, is more involved in full generality.\nTherefore, we propose a parametrisation of the scope function for a widely used special case of SPNs,\nnamely so-called tree-shaped region graphs [5, 27]. This restriction allows us to encode the scope\nfunction via categorical variables elegantly. Now Bayesian learning becomes conceptually simple:\nWe equip all latent variables and the leaves with appropriate priors and perform monolithic Bayesian\ninference, implemented via Gibbs-updates. Figure 1 illustrates our approach of disentangling structure\nlearning and performing Bayesian inference on an encoding of the scope function.\nIn summary, our main contributions in this paper are:\n\n\u2022 We propose a novel and well-principled approach to SPN structure learning, by decomposing\n\nthe problem into \ufb01nding a computational graph and learning a scope-function.\n\n\u2022 To learn the scope function, we propose a natural parametrisation for an important sub-type\nof SPNs, which allows us to formulate a joint Bayesian framework simultaneously over\nstructure and parameters.\n\n\u2022 Bayesian SPNs are protected against over\ufb01tting, waiving the necessity of a separate vali-\ndation set, which is bene\ufb01cial for low data regimes. Furthermore, they naturally deal with\nmissing data and are the \ufb01rst \u2013 to the best of our knowledge \u2013 which consistently and\nrobustly learn SPN structures under missing data. Bayesian SPNs can easily be extended to\nnonparametric formulations, supporting growing data domains.\n\n1The scope of a node is a subset of random variables, the node is responsible for and needs to ful\ufb01l the\n\nso-called completeness and decomposability conditions, see Section 3.\n\n2\n\n\f2 Related Work\n\nThe majority of structure learning approaches for SPNs, such as LearnSPN [10] and its variants,\ne.g. [18, 35, 42, 22] or other approaches, such as [5, 23, 1, 6, 13], heuristically generate a structure\nby optimising some local criterion. However, none of these approaches de\ufb01nes an overall goal of\nstructure learning, and all of them lack a sound objective to derive SPN structures. Bayesian SPNs,\non the other hand, follow a well-principled approach using posterior inference over structure and\nparameters.\nThe most notable attempts for principled structure learning of SPNs include ABDA [43] and the\nexisting nonparametric variants of SPNs [17, 41]. Even though the Bayesian treatment in ABDA,\ni.e. posterior inference over the parameters of the latent variable models located at the leaf nodes, can\nbe understood as some kind of local Bayesian structure learning, the approach heavily relies on a\nheuristically prede\ufb01ned SPN structure. In fact, Bayesian inference in ABDA only allows adaptations\nof the parameters and the latent variables at the leaves and does not infer the general structure of the\nSPN. Therefore, ABDA can be understood as a particular case of Bayesian SPNs in which the overall\nstructure is kept \ufb01xed, and inference is only performed over the latent variables at the leaves and the\nparameters of the SPN.\nOn the other hand, nonparametric formulations for SPNs, i.e. [17, 41], use Bayesian nonparametric\npriors for both structure and parameters. However, the existing approaches do not use an ef\ufb01cient\nrepresentation, e.g. [41] uses uniform distributions over all possible partitions of the scope at each\nproduct node, making posterior inference infeasible for real-world applications. Bayesian SPNs\nand nonparametric extension of Bayesian SPNs, on the other hand, can be applied to real-world\napplications, as shown in Section 6.\nBesides structure learning in SPNs, there are various approaches for other tractable probabilistic\nmodels (which also allow exact and ef\ufb01cient inference), such as probabilistic sentential decision\ndiagrams (PSDDs) [15] and Cutset networks (CNets) [31]. Most notably, the work by [19] introduces\na greedy approach to optimises a heuristically de\ufb01ned global objective for structure learning in\nPSDDs. However, similar to structure learning of selective SPNs [24] (which are a restricted sub-\ntype of SPNs), the global objective of the optimisation is not well-principled. Existing approaches\nfor CNets, on the other hand, mainly use heuristics to de\ufb01ne the structure. In [21] structures are\nconstructed randomly, while [30] compiles a learned latent variable model, such as an SPNs, into a\nCNet. However, all these approaches lack a sound objective which de\ufb01nes a good structure.\n\n3 Background\nLet X = {X1, . . . , XD} be a set of D random variables (RVs), for which N i.i.d. samples are\navailable. Let xn,d be the nth observation for the dth dimension and xn := (xn,1, . . . , xn,D). Our\ngoal is to estimate the distribution of X using a sum-product network (SPN). In the following we\nreview SPNs, but use a more general de\ufb01nition than usual, in order to facilitate our discussion below.\nIn this paper, we de\ufb01ne an SPN S as a 4-tuple S = (G, \u03c8, w, \u03b8), where G is a computational graph,\n\u03c8 is a scope-function, w is a set of sum-weights, and \u03b8 is a set of leaf parameters. In the following,\nwe explain these terms in more detail.\nDe\ufb01nition 1 (Computational graph). The computational graph G is a connected directed acyclic\ngraph, containing three types of nodes: sums (S), products (P) and leaves (L). A node in G has no\nchildren if and only if it is of type L. When we do not discriminate between node types, we use N for a\ngeneric node. S, P, L, and N denote the collections of all S, all P, all L, and all N in G, respectively.\nThe set of children of node N is denoted as ch(N). In this paper, we require that G has only a single\nroot (node without parent).\nDe\ufb01nition 2 (Scope function). The scope function is a function \u03c8 : N (cid:55)\u2192 2X, assigning each node in\nG a sub-set of X (2X denotes the power set of X). It has the following properties:\n\n1. If N is the root node, then \u03c8(N) = X.\n\n2. If N is a sum or product, then \u03c8(N) =(cid:83)\n\nN(cid:48)\u2208ch(N) \u03c8(N(cid:48)).\n\n3. For each S \u2208 S we have \u2200N, N(cid:48) \u2208 ch(S) : \u03c8(N) = \u03c8(N(cid:48)) (completeness).\n4. For each P \u2208 P we have \u2200N, N(cid:48) \u2208 ch(P) : \u03c8(N) \u2229 \u03c8(N(cid:48)) = \u2205 (decomposability).\n\n3\n\n\fEach node N in G represents a distribution over the random variables \u03c8(N), as described in the\nfollowing. Each leaf L computes a pre-speci\ufb01ed distribution over its scope \u03c8(L) (for \u03c8(L) = \u2205, we\nset L \u2261 1). We assume that L is parametrised by \u03b8L, and that \u03b8L represents a distribution for any\npossible choice of \u03c8(L). In the most naive setting, we would maintain a separate parameter set for\neach of the 2D possible choices for \u03c8(L), but this would quickly become intractable. In this paper,\nwe simply assume that \u03b8L contains D parameters \u03b8L,1, . . . , \u03b8L,D over single-dimensional distributions\n(e.g. Gaussian, Bernoulli, etc), and that for a given \u03c8(L), the represented distribution factorises:\nXi\u2208\u03c8(L) p(Xi | \u03b8L,i). However, more elaborate schemes are possible. Note that our de\ufb01nition\nof leaves is quite distinct from prior art: previously, leaves were de\ufb01ned to be distributions over a\n\ufb01xed scope; our leaves de\ufb01ne at all times distributions over all 2D possible scopes. The set \u03b8 = {\u03b8L}\ndenotes the collection of parameters for all leaf nodes. A sum node S computes a weighted sum\nN\u2208ch(S) wS,N N. Each weight wS,N is non-negative, and can w.l.o.g. [26, 45] be assumed to\nN\u2208ch(S) wS,N = 1. We denote the set of all sum-weights for S as wS\nand use w to denote the set of all sum-weights in the SPN. A product node P simply computes\n\nL =(cid:81)\nS =(cid:80)\nbe normalised: wS,N \u2265 0,(cid:80)\nP =(cid:81)\n\nN\u2208ch(P) N.\n\nThe two conditions we require for \u03c8 \u2013 completeness and decomposability \u2013 ensure that each node N\nis a probability distribution over \u03c8(N). The distribution represented by S is de\ufb01ned as the distribution\nof the root node in G and denoted as S(x). Furthermore, completeness and decomposability are\nessential to render many inference scenarios tractable in SPNs. In particular, arbitrary marginalisation\ntasks reduce to marginalisation at the leaves, i.e. simplify to several marginalisation tasks over\n(small) subsets of X, followed by an evaluation of the internal part (sum and products) in a simple\nfeed-forward pass [26]. Thus, exact marginalisation can be computed in linear time in size of the\nSPN (assuming constant time marginalisation at the leaves). Conditioning can be tackled similarly.\nNote that marginalisation and conditioning are key inference routines in probabilistic reasoning so\nthat SPNs are generally referred to as tractable probabilistic models.\n\n4 Bayesian Sum-Product Networks\n\nAll previous work de\ufb01ne the structure of an SPN in an entangled way, i.e. the scope is seen as an\ninherent property of the nodes in the graph. In this paper, however, we propose to decouple the\naspects of an SPN structure by searching over G and nested learning of \u03c8. Note that G has few\nstructural requirements, and can be validated like a neural network structure. Consequently, we \ufb01x G\nin the following discussion and cross-validate it in our experiments. Learning \u03c8 is challenging, as \u03c8\nhas non-trivial structure due to the completeness and decomposability conditions. In the following,\nwe develop a parametrisation of \u03c8 and incorporate it into a Bayesian framework. We \ufb01rst revisit\nBayesian parameter learning in SPNs using a \ufb01xed \u03c8.\n\n4.1 Learning Parameters w, \u03b8 \u2013 Fixing Scope Function \u03c8\n\nThe key insight for Bayesian parameter learning [44, 33, 43] is that sum nodes can be interpreted\nas latent variables, clustering data instances [29, 45, 25]. Formally, consider any sum node S and\nassume that it has KS children. For each data instance xn and each S, we introduce a latent variable\nZS,n with KS states and categorical distribution given by the weights wS of S. Intuitively, the sum\nnode S represents a latent clustering of data instances over its children. Let Zn = {ZS,n}S\u2208S be the\ncollection of all ZS,n. To establish the interpretation of sum nodes as latent variables, we introduce\nthe notion of induced tree [44]. We omit sub-script n when a distinction between data instances is\nnot necessary.\nDe\ufb01nition 3 (Induced tree [44]). Let an S = (G, \u03c8, w, \u03b8) be given. Consider a sub-graph T of G\nobtained as follows: i) for each sum S \u2208 G, delete all but one outgoing edge and ii) delete all nodes\nand edges which are now unreachable from the root. Any such T is called an induced tree of G\n(sometimes also denoted as induced tree of S). The SPN distribution can always be written as the\nmixture\n\n(cid:88)\n\n(cid:89)\n\nS(x) =\n\nT \u223cS\n\n(S,N)\u2208T\n\nwS,N\n\nL(xL),\n\n(1)\n\nwhere the sum runs over all possible induced trees in S, and L(xL) denotes the evaluation of L on the\nrestriction of x to \u03c8(L).\n\n(cid:89)\n\nL\u2208T\n\n4\n\n\fzS,n\n\n\u03b1\n\nwS\n\u2200S \u2208 S\n\nxn,d\nn\u2208 1 : N\n\n\u03b3d\n\nyP,d\n\nvP\n\u2200P \u2208 P\n\n\u03b2\n\n\u03b8L,d\n\u2200L \u2208 L\nd\u2208 1 : D\n\nFigure 2: Plate notation of our generative model for Bayesian structure and parameter learning.\n\nWe de\ufb01ne a function T (z) which assigns to each value z of Z the induced tree determined by z,\ni.e. where z indicates the kept sum edges in De\ufb01nition 3. Note that the function T (z) is surjective, but\nnot injective, and thus, T (z) is not invertible. However, it is \u201cpartially\u201d invertible, in the following\nsense: Note that any T splits the set of all sum nodes S into two sets, namely the set of sum nodes\nST which are contained in T , and the set of sum nodes \u00afST which are not. For any T , we can identify\n(invert) the state zS for any S \u2208 ST , as it corresponds to the unique child of S in T . On the other\nhand, the state of any S \u2208 \u00afST is arbitrary. In short, given an induced tree T , we can perfectly retrieve\nthe states of the (latent variables of) sum nodes in T , while the states of the other latent variables are\narbitrary.\n\nNow, de\ufb01ne the conditional distribution p(x| z) =(cid:81)\n(cid:89)\n\n(cid:88)\n\n(cid:89)\n\n(cid:89)\n\nS\u2208G wS,zS,\nwhere wS,zS is the sum-weight indicated by zS. When marginalising Z from the joint p(x, z) =\np(x| z) p(z), we yield\n\nL\u2208T (z) L(xL) and prior p(z) =(cid:81)\n\n(cid:89)\n\nwS,zS\n\nL(xL) =\n\nS\u2208S\n\nz\n\nL\u2208T (z)\n\nz\u2208T \u22121(T )\n\nS\u2208S\n\nL\u2208T (z)\n\n(cid:88)\n(cid:88)\n\nT\n\n(cid:88)\n(cid:89)\n\nwS,zS\n\n(cid:89)\n\nL\u2208T\n\nL(xL)\n\n\uf8eb\uf8ed(cid:88)\n(cid:124)\n\n\u00afz\n\n(cid:89)\n(cid:123)(cid:122)\n\nS\u2208\u00afST\n\n=1\n\n(2)\n\n= S(x),\n\n(3)\n\n\uf8f6\uf8f8\n(cid:125)\n\nwS,\u00afzS\n\n=\n\nwS,N\n\nL(xL)\n\nT\n\n(S,N)\u2208T\n\nestablishing the SPN distribution (1) as latent variable model, with Z marginalised out. In (2), we\nsplit the sum over all z into the double sum over all induced trees T , and all z \u2208 T \u22121(T ), where\nT \u22121(T ) is the pre-image of T under T , i.e. the set of all z for which T (z) = T . As discussed above,\nthe set T \u22121(T ) is made up by a unique z-assignment for each S \u2208 ST , corresponding to the unique\nsum-edge (S, N) \u2208 T , and all possible assignments for S \u2208 \u00afST , leading to (3).\nIt is now conceptually straightforward to extend the model to a Bayesian setting, by equipping the\nsum-weights w and leaf-parameters \u03b8 with suitable priors. In this paper, we assume Dirichlet priors\nfor sum-weights and some parametric form L(\u00b7| \u03b8L) for each leaf, with conjugate prior over \u03b8L,\nleading to the following generative model:\nwS | \u03b1 \u223c Dir(wS | \u03b1) \u2200S ,\n\u03b8L | \u03b3 \u223c p(\u03b8L | \u03b3) \u2200L ,\n\nzS,n | wS \u223c Cat(zS,n | wS) \u2200S\u2200n,\nL(xL,n | \u03b8L) \u2200n.\n\nxn | zn, \u03b8 \u223c (cid:89)\n\n(4)\n\nL\u2208T (zn)\n\nWe now extend the model to also comprise the SPN\u2019s \u201ceffective\u201d structure, the scope function \u03c8.\n\nJointly Learning w, \u03b8 and \u03c8\n\n4.2\nGiven a computational graph G, we wish to learn \u03c8, additionally to the SPN\u2019s parameters w and \u03b8,\nand adopt it in our generative model (4). In general graphs G, representing \u03c8 in an amenable form\nis rather involved. Therefore, in this paper, we restrict to the class of SPNs whose computational G\nfollows a tree-shaped region graph, which leads to a natural encoding of \u03c8. Region graphs can be\nunderstood as a \u201cvectorised\u201d representation of SPNs, and have been used in several SPN learners\ne.g. [5, 23, 27].\n\n5\n\n\feither a region with children or a partition, then \u03c8(Q) =(cid:83)\n\nDe\ufb01nition 4 (Region graph). Given a set of random variables X, a region graph is a tuple (R, \u03c8)\nwhere R is a connected directed acyclic graph containing two types of nodes: regions (R) and\npartitions (P ). R is bipartite w.r.t. to these two types of nodes, i.e. children of R are only of type\nP and vice versa. R has a single root (node with no parents) of type R, and all leaves are also\nof type R. Let R be the set of all R and P be the set of all P . The scope function is a function\n\u03c8 : R \u222a P (cid:55)\u2192 2X, with the following properties: 1) If R \u2208 R is the root, then \u03c8(R) = X. 2) If Q is\nQ(cid:48)\u2208ch(Q) \u03c8(Q(cid:48)). 3) For all P \u2208 P we\nhave \u2200R, R(cid:48) \u2208 ch(P ) : \u03c8(R)\u2229 \u03c8(R(cid:48)) = \u2205. 4) For all R \u2208 R we have \u2200P \u2208 ch(R) : \u03c8(R) = \u03c8(P ).\nNote that, we generalised previous notions of a region graph [5, 23, 27], also decoupling its graphical\nstructure R and the scope function \u03c8 (we are deliberately overloading symbol \u03c8 here). Given a\nregion graph (R, \u03c8), we can easily construct an SPN structure (G, \u03c8) as follows. To construct the\nSPN graph G, we introduce a single sum node for the root region in R; this sum node will be the\noutput of the SPN. For each leaf region R, we introduce I SPN leaves. For each other region R,\nwhich is neither root nor leaf, we introduce J sum nodes. Both I and J are hyper-parameters of\nthe model. For each partition P we introduce all possible cross-products of nodes from P \u2019s child\nregions. More precisely, let ch(P ) = {R1, . . . RK}. Let Nk be the assigned sets of nodes in each\nchild region Rk. Now, we construct all possible cross-products P = N1 \u00d7\u00b7\u00b7\u00b7\u00d7 NK, where Nk \u2208 Nk,\nfor 1 \u2264 k \u2264 K. Each of these cross-products is connected as children of each sum node in each\nparent region of P . We refer to the supplement for a detailed description, including the algorithm to\nconstruct region-graphs used in this paper.\nThe scope function \u03c8 of the SPN is inherited from the \u03c8 of the region graph: any SPN node introduced\nfor a region (partition) gets the same scope as the region (partition) itself. It is easy to check that, if\nthe SPN\u2019s G follows R using above construction, any proper scope function according to De\ufb01nition 4\ncorresponds to a proper scope function according to De\ufb01nition 2.\nIn this paper, we consider SPN structures (G, \u03c8) following a tree-shaped region graph (R, \u03c8), i.e. each\nnode in R has at most one parent. Note that G is in general not tree-shaped in this case, unless\nI = J = 1. Further note, that this sub-class of SPNs is still very expressive, and that many SPN\nlearners, e.g. [10, 27], also restrict to it.\nWhen the SPN follows a tree-shaped region graph, the scope function can be encoded as follows. Let\nP be any partition and R1, . . . , R|ch(P )| be its children. For each data dimension d, we introduce a\ndiscrete latent variable YP,d with 1, . . . ,|ch(P )| states. Intuitively, the latent variable YP,d represents\na decision to assign dimension d to a particular child, given that all partitions \u201cabove\u201d have decided to\nassign d onto the path leading to P (this path is unique since R is a tree). More formally, we de\ufb01ne:\nDe\ufb01nition 5 (Induced scope function). Let R be a tree-shaped region graph structure, let YP,d be\nde\ufb01ned as above, let Y = {YP,d}P\u2208R,d\u2208{1...D}, and let y be any assignment for Y. Let Q denote\nany node in R, let \u03a0 be the unique path from the root to Q (exclusive Q). The scope function induced\nby y is de\ufb01ned as:\n\n(cid:40)\n\n(cid:41)\n1[RyP,d \u2208 \u03a0] = 1\n\n(cid:12)(cid:12)(cid:12)(cid:12) (cid:89)\n\nP\u2208\u03a0\n\n\u03c8y(Q) :=\n\nXd\n\n,\n\n(5)\n\ni.e. \u03c8y(Q) contains Xd if for each partition in \u03a0 also the child indicated by yP,d is in \u03a0.\nIt is easy to check that for any tree-shaped R and any y, the induced scope function \u03c8y is a proper\nscope function according to De\ufb01nition 4. Conversely, for any proper scope function according to\nDe\ufb01nition 4, there exists a y such that \u03c8y \u2261 \u03c8.2\nWe can now incorporate Y in our model. Therefore, we assume Dirichlet priors for each YP,d and\nextend the generative model (4) as follows:\nwS | \u03b1 \u223c Dir(wS | \u03b1) \u2200S ,\nvP | \u03b2 \u223c Dir(vP | \u03b2) \u2200P ,\n\u03b8L | \u03b3 \u223c p(\u03b8L | \u03b3) \u2200L,\n\nzS,n | wS \u223c Cat(zS,n | wS) \u2200S\u2200n,\nyP,d | vP \u223c Cat(vP,d | vP ) \u2200P \u2200d,\nL(xy,n | \u03b8L) \u2200n.\n\nxn | zn, y, \u03b8 \u223c (cid:89)\n\n(6)\n\nL\u2208T (zn)\n\n2Note that the relation between y and \u03c8y is similar to the relation between z and T , i.e. each \u03c8y corresponds\nin general to many y\u2019s. Also note, the encoding of \u03c8 for general region graphs is more envolved, since the path\nto each Q is not unique anymore, requiring consistency among y.\n\n6\n\n\fHere, the notation xy,n denotes the evaluation of L on the scope induced by y. Figure 2 illustrates our\ngenerative model in plate notation, in which directed edges indicate dependencies between variables.\nFurthermore, our Bayesian formulation naturally allows for various nonparametric formulations of\nSPNs. In particular, one can use the stick-breaking construction [36] of a Dirichlet process mixture\nmodel with SPNs as mixture components. We illustrate this approach in the experiments.3\n\n5 Sampling-based Inference\nLet X = {xn}N\nn=1 be a training set of N observations xn, we aim to draw posterior samples from\nour generative model given X . For this purpose, we perform Gibbs sampling alternating between i)\nupdating parameters w, \u03b8 (\ufb01xed y), and ii) updating y (\ufb01xed w, \u03b8).\n\nUpdating Parameters w, \u03b8 (\ufb01xed y) We follow the same procedure as in [43], i.e. in order to\nsample w and \u03b8, we \ufb01rst sample assignments zn for all the sum latent variables Zn in the SPN,\nand subsequently sample new w and \u03b8. For a given set of parameters (w and \u03b8), each zn can be\ndrawn independently and follows standard SPN ancestral sampling. The latent variables not visited\nduring ancestral sampling, are drawn from the prior. After sampling all zn, the sum-weights are\nsampled from the posterior distributions of a Dirichlet, i.e. Dir(\u03b1 + cS,1, . . . , \u03b1 + cS,KS ) where\n1[zS,n = k] denotes the number of observations assigned to child k. The parameters at\n\ncS,k =(cid:80)N\n\nn=1\n\nleaf nodes can be updated similarly; see [43] for further details.\n\nUpdating the Structure y, (\ufb01xed w, \u03b8) We use a collapsed Gibbs sampler to sample all yP,d\nassignments. For this purpose, we marginalise out v (c.f. the dependency structure in Figure 2)\nand sample yP,d from the respective conditional. Therefore, let yP denote the set of all dimension\nassignments at partition P and let yP,(cid:54)d denote the exclusion of d from yP . Further, let yP\\P,d denote\nthe assignments of dimension d on all partitions except partition P , then the conditional probability\nof assigning dimension d to child k under P is:\n\nj=1\n\n\u03b2+mP,k\n\n\u03b2+mP,k\n\n, where mP,k =(cid:80)\n\np(yP,d = k | yP,(cid:54)d, yP\\P,d,X , z, \u03b8, \u03b2) = p(yP,d = k | yP,(cid:54)d, \u03b2)p(X | yP,d = k, yP\\P,d, z, \u03b8) .\n(7)\nNote that the conditional prior in Equation 7 follows standard derivations, i.e. p(yP,d = k | yP,(cid:54)d, \u03b2) =\n(cid:80)|ch(P )|\n1[yP,d = k] are component counts. The second term\nin Equation 7 is the product over marginal likelihood terms of each product node in P . Intuitively,\nvalues for yP,d are more likely if other dimensions are assigned to the same child (rich-get-richer)\nand if the product of marginal likelihoods of child k has low variance in d.\nGiven a set of T posterior samples, we can compute predictions for an unseen datum x\u2217 using an\napproximation of the posterior predictive distribution, i.e.\n\nd\u2208\u03c8(P )\\d\n\nT(cid:88)\n\nt=1\n\np(x\u2217 |X ) \u2248 1\nT\n\nS(x\u2217 |G, \u03c8y(t), w(t), \u03b8(t)) ,\n\n(8)\n\nwhere S(x\u2217 |G, \u03c8y(t), w(t), \u03b8(t)) denotes the SPN of the tth posterior sample with t = 1, . . . , T .\nNote that we can represent the resulting distribution as a single SPN with T children (sub-SPNs).\n\n6 Experiments\n\nWe assessed the performance of our approach on discrete [10] and heterogeneous data [43] as well\nas on three datasets with missing values. We constructed G using the algorithm described in the\nsupplement and used a grid search over the parameters of the graph. Further, we used 5 \u00b7 103 burn-in\nsteps and estimated the predictive distribution using 104 samples from the posterior. Since the\nBayesian framework is protected against over\ufb01tting, we combined training and validation sets and\nfollowed classical Bayesian model selection [34], i.e. using the Bayesian model evidence. Note that\nwithin the validation loop, the computational graph remains \ufb01xed. We list details on the selected\n\n3See https://github.com/trappmartin/BayesianSumProductNetworks for and implementation of\n\nBayesian SPNs in form of a Julia package accompanied by codes and datasets used for the experiments.\n\n7\n\n\fTable 1: Average test log-likelihoods on discrete datasets using SOTA, Bayesian SPNs (ours) and\nin\ufb01nite mixtures of SPNs (ours\u221e). Signi\ufb01cant differences are underlined. Overall best result is in\nbold. In addition we list the best-to-date (BTD) results obtained using SPNs, PSDDs or CNets.\nDataset\nNLTCS\nMSNBC\nKDD\nPlants\nAudio\nJester\nNet\ufb02ix\nAccidents\nRetail\nPumsb-star\nDNA\nKosarak\nMSWeb\nBook\nEachMovie\nWebKB\nReuters-52\n20 Newsgrp\nBBC\nAD\n\nours\u221e\nBTD\nours\nID-SPN\nCCCP\n\u22126.00\n\u22125.97\n\u22126.02\n\u22126.03\n\u22126.02\n\u22126.03\n\u22126.03\n\u22126.06\n\u22126.05\n\u22126.04\n\u22122.12\n\u22122.11\n\u22122.13\n\u22122.13\n\u22122.13\n\u221212.87 \u221212.54\n\u221211.84\n\u221212.68\n\u221212.94\n\u221239.79 \u221239.77\n\u221239.39\n\u221240.02\n\u221239.79\n\u221252.86 \u221252.42\n\u221251.29\n\u221252.86\n\u221252.88\n\u221256.36 \u221256.31\n\u221255.71\n\u221256.78\n\u221256.80\n\u221227.70 \u221226.98\n\u221234.10\n\u221226.98\n\u221233.89\n\u221210.83 \u221210.83\n\u221210.72\n\u221210.92\n\u221210.85\n\u221224.23 \u221222.41\n\u221222.41\n\u221231.34\n\u221231.96\n\u221284.92 \u221281.21\n\u221281.07\n\u221292.84\n\u221292.95\n\u221210.88 \u221210.60\n\u221210.52\n\u221210.77\n\u221210.74\n\u22129.73\n\u22129.62\n\u22129.97\n\u22129.88\n\u22129.89\n\u221234.14 \u221234.13\n\u221234.14\n\u221235.01\n\u221234.34\n\u221251.66 \u221250.94\n\u221250.34\n\u221252.56\n\u221251.51\n\u2212157.49 \u2212151.84 \u2212156.02 \u2212157.33 \u2212149.20\n\u221284.63 \u221283.35\n\u221284.31\n\u221284.44\n\u221281.87\n\u2212153.21 \u2212151.47 \u2212151.99 \u2212151.95 \u2212151.02\n\u2212248.93 \u2212249.70 \u2212254.69 \u2212229.21\n\u221227.20 \u221219.05\n\u221214.00\n\n\u221263.80\n\nLearnSPN RAT-SPN\n\u22126.01\n\u22126.04\n\u22122.13\n\u221213.44\n\u221239.96\n\u221252.97\n\u221256.85\n\u221235.49\n\u221210.91\n\u221232.53\n\u221297.23\n\u221210.89\n\u221210.12\n\u221234.68\n\u221253.63\n\u2212157.53\n\u221287.37\n\u2212152.06\n\u2212252.14 \u2212248.60\n\u221248.47\n\n\u22126.11\n\u22126.11\n\u22122.18\n\u221212.98\n\u221240.50\n\u221253.48\n\u221257.33\n\u221230.04\n\u221211.04\n\u221224.78\n\u221282.52\n\u221210.99\n\u221210.25\n\u221235.89\n\u221252.49\n\u2212158.20\n\u221285.07\n\u2212155.93\n\u2212250.69\n\u221219.73\n\n\u221263.80\n\nparameters and the runtime for each dataset in the supplement, c.f. Table 3. For posterior inference in\nin\ufb01nite mixtures of SPNs, we used the distributed slice sampler [8].\nTable 1 lists the test log-likelihood scores of state-of-the-art (SOTA) structure learners, i.e. LearnSPN\n[10], LearnSPN with parameter optimisation (CCCP) [46] and ID-SPN [35], random region-graphs\n(RAT-SPN) [27] and the results obtained using Bayesian SPNs (ours) and in\ufb01nite mixtures of Bayesian\nSPN (ours\u221e) on discrete datasets. In addition we list the best-to-date (BTD) results, collected based\non the most recent works on structure learning for SPNs [12], PSDDs [19] and CNets[21, 30]. Note\nthat the BTD results are often by large ensembles over structures. Signi\ufb01cant differences to the best\nSOTA approach under the Mann-Whitney-U-Test [20] with p < 0.01 are underlined. We refer to the\nsupplement for an extended results table and further details on the signi\ufb01cance tests. We see that\nBayesian SPNs and in\ufb01nite mixtures generally improve over LearnSPN and RAT-SPN. Further, in\nmany cases, we observe an improvement over LearnSPN with additional parameter learning and often\nobtain results comparable to ID-SPN or sometimes outperforms BTD results. Note that ID-SPN uses\na more expressive SPN formulation with Markov networks as leaves and also uses a sophisticated\nlearning algorithm.\nAdditionally, we conducted experiments on heterogeneous data, see: [22, 43], and compared against\nmixed SPNs (MSPN) [22] and ABDA [43]. We used mixtures over likelihood models as leaves,\nsimilar to Vergari et al. [43], and performed inference over the structure, parameters and likelihood\nmodels. Further details can be found in the supplement. Table 2 lists the test log-likelihood scores\nof all approaches, indicating that our approaches perform comparably to structure learners tailored\nto heterogeneous datasets and sometimes outperform SOTA. 4 Interestingly, we obtain, with a large\nmargin, better test scores for Autism which might indicate that existing approaches over\ufb01t in this\ncase while our formulation naturally penalises complex models.\nWe compared the test log-likelihood of LearnSPN, ID-SPN and Bayesian SPNs against an increasing\nnumber of observations having 50% dimensions missing completely at random [28]. We evaluated\nLearnSPN and ID-SPN by i) removing all observations with missing values and ii) using K-nearest\nneighbour imputation [2] (denoted with an asterisk). Note that we selected k-NN imputation because\nit arguably provides a stronger baseline than simple mean imputation (while being computationally\n\n4Note that we did not apply a signi\ufb01cance test as implementations of existing approaches are not available.\n\n8\n\n\fTable 2: Average test log-likelihoods on heterogeneous datasets using SOTA, Bayesian SPN (ours)\nand in\ufb01nite mixtures of SPNs (ours\u221e). Overall best result is indicated in bold.\nours\u221e\nMSPN\nABDA\nours\nDataset\nAbalone\n9.73\n2.22\n3.92\n3.99\n\u22125.91 \u22124.62\n\u221244.07\n\u22124.68\nAdult\n\u221236.14 \u221216.44\n\u221221.51 \u221221.99\nAustralian\n\u221227.93 \u22120.47\n\u221239.20\n\u22121.16\nAutism\n\u221225.48 \u221225.02 \u221225.76\n\u221228.01\nBreast\n\u221212.30 \u221211.54 \u221211.76\n\u221213.01\nChess\n\u221236.26 \u221212.82\n\u221219.38 \u221219.62\nCrx\n\u221224.98 \u221223.95 \u221224.33\nDermatology \u221227.71\n\u221231.22 \u221217.48\n\u221221.21 \u221221.06\nDiabetes\n\u221226.05 \u221225.83\n\u221226.76 \u221226.63\nGerman\n\u221230.18 \u221228.73\n\u221229.51\n\u221229.9\nStudent\n\u22120.13\n\u22128.65\n\u22128.62\n\u221210.12\nWine\n\n\u221252\n\n\u221254\n\n\u221256\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n-\ng\no\nl\n\nt\ns\ne\nt\n\n\u2212155\n\n\u2212160\n\n\u2212165\n\nLearnSPN\nID-SPN\nBSPN\n\nLearnSPN(cid:63)\nID-SPN(cid:63)\n\n\u2212260\n\n\u2212280\n\n\u2212300\n\n20\n\n40\n\n60\n\n80\n\n20\n\n40\n\n60\n\n80\n\n20\n\n40\n\n60\n\n80\n\n% obs. with missing values\n\n% obs. with missing values\n\n% obs. with missing values\n\n(a) EachMovie (D: 500, N: 5526)\n\n(b) WebKB (D: 839, N: 3361)\n\n(c) BBC (D: 1058, N: 1895)\n\nFigure 3: Performance under missing values for discrete datasets with increasing dimensionality (D).\nResults for LearnSPN are shown in dashed lines, results for ID-SPN in dotted lines and our approach\nis indicated using solid lines. Star ((cid:63)) indicates k-NN imputation while (\u25e6) means no imputation.\n\nmore demanding). All methods have been trained using the full training set, i.e. training and validation\nset combined, and were evaluated using default parameters to ensure a fair comparison across methods\nand levels of missing values. See supplement Section A.3 for further details. Figure 3 shows that our\nformulation is consistently robust against missing values while SOTA approaches often suffer from\nmissing values, sometimes even if additional imputation is used.\n\n7 Conclusion\n\nStructure learning is an important topic in SPNs, and many promising directions have been proposed\nin recent years. However, most of these approaches are based on intuition and refrain from declaring\nan explicit and global principle to structure learning. In this paper, our primary motivation is to\nchange this practice. To this end, we phrase structure (and joint parameter) learning as Bayesian\ninference in a latent variable model. Our experiments show that this principled approach competes\nwell with prior art and that we gain several bene\ufb01ts, such as automatic protection against over\ufb01tting,\nrobustness under missing data and a natural extension to nonparametric formulations.\nA critical insight for our approach is to decompose structure learning into two steps, namely con-\nstructing a computational graph and separately learning the SPN\u2019s scope function \u2013 determining the\n\u201ceffective\u201d structure of the SPN. We believe that this novel approach will be stimulating for future\nwork. For example, while we used Bayesian inference over the scope function, it could also be\noptimised, e.g. with gradient-based techniques. Further, more sophisticated approaches to identify the\ncomputational graph, e.g. using AutoML techniques or neural structural search (NAS) [47], could be\nfruitful directions. The Bayesian framework presented in this paper allows several natural extensions,\nsuch as parameterisations of the scope-function using hierarchical priors or using variational inference\nfor large-scale approximate Bayesian inference, and relaxing the necessity of a given computational\ngraph, by incorporating nonparametric priors in all stages of the model formalism.\n\n9\n\n\fAcknowledgements\n\nThis work was partially funded by the Austrian Science Fund (FWF): I2706-N31 and has received\nfunding from the European Union\u2019s Horizon 2020 research and innovation programme under the\nMarie Sk\u0142odowska-Curie Grant Agreement No. 797223 \u2014 HYBSPN.\n\nReferences\n[1] T. Adel, D. Balduzzi, and A. Ghodsi. Learning the structure of sum-product networks via an SVD-based\n\nalgorithm. In Proceedings of UAI, 2015.\n\n[2] L. Beretta and A. Santaniello. Nearest neighbor imputation algorithms: a critical evaluation. BMC medical\n\ninformatics and decision making, 16(3):74, 2016.\n\n[3] C. J. Butz, J. S. Oliveira, A. E. dos Santos, and A. L. Teixeira. Deep convolutional sum-product networks.\n\nIn Proceedings of AAAI, 2019.\n\n[4] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data.\n\nMachine learning, 9(4):309\u2013347, 1992.\n\n[5] A. W. Dennis and D. Ventura. Learning the architecture of sum-product networks using clustering on\n\nvariables. In Proceedings of NeurIPS, pages 2042\u20132050, 2012.\n\n[6] A. W. Dennis and D. Ventura. Greedy structure search for sum-product networks. In Proceedings of IJCAI,\n\npages 932\u2013938, 2015.\n\n[7] N. Friedman and D. Koller. Being Bayesian about network structure. a Bayesian approach to structure\n\ndiscovery in Bayesian networks. Machine learning, 50(1-2):95\u2013125, 2003.\n\n[8] H. Ge, Y. Chen, M. Wan, and Z. Ghahramani. Distributed inference for Dirichlet process mixture models.\n\nIn Proceedings of ICML, pages 2276\u20132284, 2015.\n\n[9] R. Gens and P. Domingos. Discriminative learning of sum-product networks. In Proceedings of NeurIPS,\n\npages 3248\u20133256, 2012.\n\n[10] R. Gens and P. Domingos. Learning the structure of sum-product networks. Proceedings of ICML, pages\n\n873\u2013880, 2013.\n\n[11] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of\n\nknowledge and statistical data. Machine learning, 20(3):197\u2013243, 1995.\n\n[12] P. Jaini, A. Ghose, and P. Poupart. Prometheus: Directly learning acyclic directed graph structures for\n\nsum-product networks. In Proceedings of PGM, pages 181\u2013192, 2018.\n\n[13] A. Kalra, A. Rashwan, W.-S. Hsu, P. Poupart, P. Doshi, and G. Trimponias. Online structure learning for\nfeed-forward and recurrent sum-product networks. In Proceedings of NeurIPS, pages 6944\u20136954, 2018.\n\n[14] H. Kang, C. D. Yoo, and Y. Na. Maximum margin learning of t-SPNs for cell classi\ufb01cation with \ufb01ltered\n\ninput. IEEE Journal of Selected Topics in Signal Processing, 10(1):130\u2013139, 2016.\n\n[15] D. Kisa, G. Van den Broeck, A. Choi, and A. Darwiche. Probabilistic sentential decision diagrams. In\n\nProceedings of KR, 2014.\n\n[16] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.\n\n[17] S. Lee, C. Watkins, and B. Zhang. Non-parametric bayesian sum-product networks. In Proceedings of\n\nLTPM workshop at ICML, 2014.\n\n[18] S.-W. Lee, M.-O. Heo, and B.-T. Zhang. Online incremental structure learning of sum-product networks.\n\nIn Proceedings of NeurIPS, pages 220\u2013227, 2013.\n\n[19] Y. Liang, J. Bekker, and G. Van den Broeck. Learning the structure of probabilistic sentential decision\n\ndiagrams. In Proceedings of UAI, 2017.\n\n[20] H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger\n\nthan the other. The annals of mathematical statistics, pages 50\u201360, 1947.\n\n[21] N. Di Mauro, A. Vergari, T. M. Altomare Basile, and F. Esposito. Fast and accurate density estimation\n\nwith extremely randomized cutset networks. In Proceedings of ECML/PKDD, pages 203\u2013219, 2017.\n\n10\n\n\f[22] A. Molina, A. Vergari, N. Di Mauro, S. Natarajan, F. Esposito, and K. Kersting. Mixed sum-product\n\nnetworks: A deep architecture for hybrid domains. In Proceedings of AAAI, 2018.\n\n[23] R. Peharz, B. C. Geiger, and F. Pernkopf. Greedy part-wise learning of sum-product networks.\n\nProceedings of ECML/PKDD, pages 612\u2013627, 2013.\n\nIn\n\n[24] R. Peharz, R. Gens, and P. Domingos. Learning selective sum-product networks. In Proceedings of LTPM\n\nworkshop at ICML, 2014.\n\n[25] R. Peharz, R. Gens, F. Pernkopf, and P. Domingos. On the latent variable interpretation in sum-product\n\nnetworks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(10):2030\u20132044, 2017.\n\n[26] R. Peharz, S. Tschiatschek, F. Pernkopf, and P. Domingos. On theoretical properties of sum-product\n\nnetworks. In Proceedings of AISTATS, 2015.\n\n[27] R. Peharz, A. Vergari, K. Stelzner, A. Molina, X. Shao, M. Trapp, K. Kersting, and Z. Ghahramani. Random\nsum-product networks: A simple but effective approach to probabilistic deep learning. In Proceedings of\nUAI, 2019.\n\n[28] D. F. Polit and C. T. Beck. Nursing research: Generating and assessing evidence for nursing practice.\n\nLippincott Williams & Wilkins, 2008.\n\n[29] H. Poon and P. Domingos. Sum-product networks: A new deep architecture. In Proceedings of UAI, pages\n\n337\u2013346, 2011.\n\n[30] T. Rahman, S. Jin, and V. Gogate. Look ma, no latent variables: Accurate cutset networks via compilation.\n\nIn Proceedings of ICML, pages 5311\u20135320, 2019.\n\n[31] T. Rahman, P. Kothalkar, and V. Gogate. Cutset networks: A simple, tractable, and scalable approach for\n\nimproving the accuracy of Chow-Liu trees. In Proceedings of ECML/PKDD, pages 630\u2013645, 2014.\n\n[32] A. Rashwan, P. Poupart, and C. Zhitang. Discriminative training of sum-product networks by extended\n\nBaum-Welch. In Proceedings of PGM, pages 356\u2013367, 2018.\n\n[33] A. Rashwan, H. Zhao, and P. Poupart. Online and distributed Bayesian moment matching for parameter\n\nlearning in sum-product networks. In Proceedings of AISTATS, pages 1469\u20131477, 2016.\n\n[34] C. E. Rasmussen and Z. Ghahramani. Occam\u2019s Razor. In Proceedings of NeurIPS, pages 294\u2013300, 2001.\n\n[35] A. Rooshenas and D. Lowd. Learning sum-product networks with direct and indirect variable interactions.\n\nIn Proceedings of ICML, pages 710\u2013718, 2014.\n\n[36] J. Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Statistica sinica, pages 639\u2013650, 1994.\n\n[37] O. Sharir, R. Tamari, N. Cohen, and A. Shashua. Tractable generative convolutional arithmetic circuits.\n\narXiv preprint arXiv:1610.04167, 2016.\n\n[38] J. Suzuki. A construction of Bayesian networks from databases based on an MDL principle. In Proceedings\n\nof UAI, pages 266\u2013273, 1993.\n\n[39] M. Trapp, T. Madl, R. Peharz, F. Pernkopf, and R. Trappl. Safe semi-supervised learning of sum-product\n\nnetworks. In Proceedings of UAI, 2017.\n\n[40] M. Trapp, R. Peharz, and F. Pernkopf. Optimisation of overparametrized sum-product networks. CoRR,\n\nabs/1905.08196, 2019.\n\n[41] M. Trapp, R. Peharz, M. Skowron, T. Madl, F. Pernkopf, and R. Trappl. Structure inference in sum-product\n\nnetworks using in\ufb01nite sum-product trees. In Proceedings of BNP workshop at NeurIPS, 2016.\n\n[42] A. Vergari, N. Di Mauro, and F. Esposito. Simplifying, regularizing and strengthening sum-product\n\nnetwork structure learning. In Proceedings of ECML/PKDD, pages 343\u2013358, 2015.\n\n[43] A. Vergari, A. Molina, R. Peharz, Z. Ghahramani, K. Kersting, and I. Valera. Automatic Bayesian density\n\nanalysis. In Proceedings of AAAI, 2019.\n\n[44] H. Zhao, T. Adel, G. J. Gordon, and B. Amos. Collapsed variational inference for sum-product networks.\n\nIn Proceedings of ICML, pages 1310\u20131318, 2016.\n\n[45] H. Zhao, M. Melibari, and P. Poupart. On the relationship between sum-product networks and Bayesian\n\nnetworks. In Proceedings of ICML, pages 116\u2013124, 2015.\n\n11\n\n\f[46] H. Zhao, P. Poupart, and G. J. Gordon. A uni\ufb01ed approach for learning the parameters of sum-product\n\nnetworks. In Proceedings of NeurIPS, pages 433\u2013441, 2016.\n\n[47] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In Proceedings of ICLR,\n\n2017.\n\n12\n\n\f", "award": [], "sourceid": 3427, "authors": [{"given_name": "Martin", "family_name": "Trapp", "institution": "Graz University of Technology"}, {"given_name": "Robert", "family_name": "Peharz", "institution": "University of Cambridge"}, {"given_name": "Hong", "family_name": "Ge", "institution": "University of Cambridge"}, {"given_name": "Franz", "family_name": "Pernkopf", "institution": "Signal Processing and Speech Communication Laboratory, Graz, Austria"}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": "Uber and University of Cambridge"}]}