{"title": "Linear Time Computation of Moments in Sum-Product Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6894, "page_last": 6903, "abstract": "Bayesian online algorithms for Sum-Product Networks (SPNs) need to update their posterior distribution after seeing one single additional instance. To do so, they must compute moments of the model parameters under this distribution. The best existing method for computing such moments scales quadratically in the size of the SPN, although it scales linearly for trees. This unfortunate scaling makes Bayesian online algorithms prohibitively expensive, except for small or tree-structured SPNs. We propose an optimal linear-time algorithm that works even when the SPN is a general directed acyclic graph (DAG), which significantly broadens the applicability of Bayesian online algorithms for SPNs. There are three key ingredients in the design and analysis of our algorithm: 1). For each edge in the graph, we construct a linear time reduction from the moment computation problem to a joint inference problem in SPNs. 2). Using the property that each SPN computes a multilinear polynomial, we give an efficient procedure for polynomial evaluation by differentiation without expanding the network that may contain exponentially many monomials. 3). We propose a dynamic programming method to further reduce the computation of the moments of all the edges in the graph from quadratic to linear. We demonstrate the usefulness of our linear time algorithm by applying it to develop a linear time assume density filter (ADF) for SPNs.", "full_text": "Linear Time Computation of Moments in\n\nSum-Product Networks\n\nHan Zhao\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nhan.zhao@cs.cmu.edu\n\nGeoff Gordon\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nggordon@cs.cmu.edu\n\nAbstract\n\nBayesian online algorithms for Sum-Product Networks (SPNs) need to update\ntheir posterior distribution after seeing one single additional instance. To do so,\nthey must compute moments of the model parameters under this distribution. The\nbest existing method for computing such moments scales quadratically in the\nsize of the SPN, although it scales linearly for trees. This unfortunate scaling\nmakes Bayesian online algorithms prohibitively expensive, except for small or\ntree-structured SPNs. We propose an optimal linear-time algorithm that works\neven when the SPN is a general directed acyclic graph (DAG), which signi\ufb01cantly\nbroadens the applicability of Bayesian online algorithms for SPNs. There are three\nkey ingredients in the design and analysis of our algorithm: 1). For each edge\nin the graph, we construct a linear time reduction from the moment computation\nproblem to a joint inference problem in SPNs. 2). Using the property that each SPN\ncomputes a multilinear polynomial, we give an ef\ufb01cient procedure for polynomial\nevaluation by differentiation without expanding the network that may contain\nexponentially many monomials. 3). We propose a dynamic programming method\nto further reduce the computation of the moments of all the edges in the graph from\nquadratic to linear. We demonstrate the usefulness of our linear time algorithm by\napplying it to develop a linear time assume density \ufb01lter (ADF) for SPNs.\n\n1\n\nIntroduction\n\nSum-Product Networks (SPNs) have recently attracted some interest because of their \ufb02exibility in\nmodeling complex distributions as well as the tractability of performing exact marginal inference [11,\n5, 6, 9, 16\u201318, 10]. They are general-purpose inference machines over which one can perform exact\njoint, marginal and conditional queries in linear time in the size of the network. It has been shown that\ndiscrete SPNs are equivalent to arithmetic circuits (ACs) [3, 8] in the sense that one can transform\neach SPN into an equivalent AC and vice versa in linear time and space with respect to the network\nsize [13]. SPNs are also closely connected to probabilistic graphical models: by interpreting each sum\nnode in the network as a hidden variable and each product node as a rule encoding context-speci\ufb01c\nconditional independence [1], every SPN can be equivalently converted into a Bayesian network\nwhere compact data structures are used to represent the local probability distributions [16]. This\nrelationship characterizes the probabilistic semantics encoded by the network structure and allows\npractitioners to design principled and ef\ufb01cient parameter learning algorithms for SPNs [17, 18].\nMost existing batch learning algorithms for SPNs can be straightforwardly adapted to the online\nsetting, where the network updates its parameters after it receives one instance at each time step.\nThis online learning setting makes SPNs more widely applicable in various real-world scenarios.\nThis includes the case where either the data set is too large to store at once, or the network needs\nto adapt to the change of external data distributions. Recently Rashwan et al. [12] proposed an\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fonline Bayesian Moment Matching (BMM) algorithm to learn the probability distribution of the\nmodel parameters of SPNs based on the method of moments. Later Jaini et al. [7] extended this\nalgorithm to the continuous case where the leaf nodes in the network are assumed to be Gaussian\ndistributions. At a high level BMM can be understood as an instance of the general assumed density\n\ufb01ltering framework [14] where the algorithm \ufb01nds an approximate posterior distribution within a\ntractable family of distributions by the method of moments. Speci\ufb01cally, BMM for SPNs works\nby matching the \ufb01rst and second order moments of the approximate tractable posterior distribution\nto the exact but intractable posterior. An essential sub-routine of the above two algorithms [12, 7]\nis to ef\ufb01ciently compute the exact \ufb01rst and second order moments of the one-step update posterior\ndistribution (cf. 3.2). Rashwan et al. [12] designed a recursive algorithm to achieve this goal in linear\ntime when the underlying network structure is a tree, and this algorithm is also used by Jaini et al. [7]\nin the continuous case. However, the algorithm only works when the underlying network structure is\na tree, and a naive computation of such moments in a DAG will scale quadratically w.r.t. the network\nsize. Often this quadratic computation is prohibitively expensive even for SPNs with moderate sizes.\nIn this paper we propose a linear time (and space) algorithm that is able to compute any moments of all\nthe network parameters simultaneously even when the underlying network structure is a DAG. There\nare three key ingredients in the design and analysis of our algorithm: 1). A linear time reduction from\nthe moment computation problem to the joint inference problem in SPNs, 2). A succinct evaluation\nprocedure of polynomial by differentiation without expanding it, and 3). A dynamic programming\nmethod to further reduce the quadratic computation to linear. The differential approach [3] used for\npolynomial evaluation can also be applied for exact inference in Bayesian networks. This technique\nhas also been implicitly used in the recent development of a concave-convex procedure (CCCP) for\noptimizing the weights of SPNs [18]. Essentially, by reducing the moment computation problem to a\njoint inference problem in SPNs, we are able to exploit the fact that the network polynomial of an\nSPN computes a multilinear function in the model parameters, so we can ef\ufb01ciently evaluate this\npolynomial by differentiation even if the polynomial may contain exponentially many monomials,\nprovided that the polynomial admits a tractable circuit complexity. Dynamic programming can be\nfurther used to trade off a constant factor in space complexity (using two additional copies of the\nnetwork) to reduce the quadratic time complexity to linear so that all the edge moments can be\ncomputed simultaneously in two passes of the network. To demonstrate the usefulness of our linear\ntime sub-routine for computing moments, we apply it to design an ef\ufb01cient assumed density \ufb01lter [14]\nto learn the parameters of SPNs in an online fashion. ADF runs in linear time and space due to our\nef\ufb01cient sub-routine. As an additional contribution, we also show that ADF and BMM can both be\nunderstood under a general framework of moment matching, where the only difference lies in the\nmoments chosen to be matched and how to match the chosen moments.\n\n2 Preliminaries\n\nWe use [n] to abbreviate {1, 2, . . . , n}, and we reserve S to represent an SPN, and use |S| to mean\nthe size of an SPN, i.e., the number of edges plus the number of nodes in the graph.\n\n2.1 Sum-Product Networks\nA sum-product network S is a computational circuit over a set of random variables X =\n{X1, . . . , Xn}. It is a rooted directed acyclic graph. The internal nodes of S are sums or prod-\nucts and the leaves are univariate distributions over Xi. In its simplest form, the leaves of S are\nindicator variables IX=x, which can also be understood as categorical distributions whose entire\nprobability mass is on a single value. Edges from sum nodes are parameterized with positive weights.\nSum node computes a weighted sum of its children and product node computes the product of its\nchildren. If we interpret each node in an SPN as a function of leaf nodes, then the scope of a node\nin SPN is de\ufb01ned as the set of variables that appear in this function. More formally, for any node\nv in an SPN, if v is a terminal node, say, an indicator variable over X, then scope(v) = {X}, else\nscope(v) = \u222a\u02dcv\u2208Ch(v)scope(\u02dcv). An SPN is complete iff each sum node has children with the same\nscope, and is decomposable iff for every product node v, scope(vi) \u2229 scope(vj) = \u2205 for every pair\n(vi, vj) of children of v. It has been shown that every valid SPN can be converted into a complete and\ndecomposable SPN with at most a quadratic increase in size [16] without changing the underlying\ndistribution. As a result, in this work we assume that all the SPNs we discuss are complete and\ndecomposable.\n\n2\n\n\fLet x be an instantiation of the random vector X. We associate an unnormalized probability Vk(x; w)\nwith each node k when the input to the network is x with network weights set to be w:\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3p(Xi = xi)\n(cid:81)\n(cid:80)\n\nif k is a leaf node over Xi\nif k is a product node\nif k is a sum node\n\n(1)\n\nVk(x; w) =\n\nj\u2208Ch(k) Vj(x; w)\nj\u2208Ch(k) wk,jVj(x; w)\n\nwhere Ch(k) is the child list of node k in the graph and wk,j is the edge weight associated with\nsum node k and its child node j. The probability of a joint assignment X = x is computed by\nthe value at the root of S with input x divided by a normalization constant Vroot(1; w): p(x) =\nVroot(x; w)/Vroot(1; w), where Vroot(1; w) is the value of the root node when all the values of leaf\nnodes are set to be 1. This essentially corresponds to marginalizing out the random vector X, which\nwill ensure p(x) de\ufb01nes a proper probability distribution. Remarkably, all queries w.r.t. x, including\njoint, marginal, and conditional, can be answered in linear time in the size of the network.\n\n2.2 Bayesian Networks and Mixture Models\n\nWe provide two alternative interpretations of SPNs that will be useful later to design our linear\ntime moment computation algorithm. The \ufb01rst one relates SPNs with Bayesian networks (BNs).\nInformally, any complete and decomposable SPN S over X = {X1, . . . , Xn} can be converted into\na bipartite BN with O(n|S|) size [16]. In this construction, each internal sum node in S corresponds\nto one latent variable in the constructed BN, and each leaf distribution node corresponds to one\nobservable variable in the BN. Furthermore, the constructed BN will be a simple bipartite graph with\none layer of local latent variables pointing to one layer of observable variables X. An observable\nvariable is a child of a local latent variable if and only if the observable variable appears as a\ndescendant of the latent variable (sum node) in the original SPN. This means that the SPN S can be\nunderstood as a BN where the number of latent variables per instance is O(|S|).\nThe second perspective is to view an SPN S as a mixture model with exponentially many mixture\ncomponents [4, 18]. More speci\ufb01cally, we can decompose each complete and decomposable SPN S\ninto a sum of induced trees, where each tree corresponds to a product of univariate distributions. To\nproceed, we \ufb01rst formally de\ufb01ne what we called induced trees:\nDe\ufb01nition 1 (Induced tree SPN). Given a complete and decomposable SPN S over X =\n{X1, . . . , Xn}, T = (TV ,TE) is called an induced tree SPN from S if 1). Root(S) \u2208 TV ; 2).\nIf v \u2208 TV is a sum node, then exactly one child of v in S is in TV , and the corresponding edge is in\nTE; 3). If v \u2208 TV is a product node, then all the children of v in S are in TV , and the corresponding\nedges are in TE.\nIt has been shown that Def. 1 produces subgraphs of S that are trees as long as the original SPN S is\n(cid:81)n\ncomplete and decomposable [4, 18]. One useful result based on the concept of induced trees is:\nTheorem 1 ([18]). Let \u03c4S = Vroot(1; 1). \u03c4S counts the number of unique induced trees in S, and\ni=1 pt(Xi = xi), where Tt is the tth unique\ninduced tree of S and pt(Xi) is a univariate distribution over Xi in Tt as a leaf node.\nThm. 1 shows that \u03c4S = Vroot(1; 1) can also be computed ef\ufb01ciently by setting all the edge weights\nto be 1. In general counting problems are in the #P complexity class [15], and the fact that both\nprobabilistic inference and counting problem are tractable in SPNs also implies that SPNs work on\nsubsets of distributions that have succinct/ef\ufb01cient circuit representation. Without loss of generality\nassuming that sum layers alternate with product layers in S, we have \u03c4S = \u2126(2H(S)), where H(S)\nis the height of S. Hence the mixture model represented by S has number of mixture components\nthat is exponential in the height of S. Thm. 1 characterizes both the number of components and the\nform of each component in the mixture model, as well as their mixture weights. For the convenience\nof later discussion, we call Vroot(x; w) the network polynomial of S.\nCorollary 1. The network polynomial Vroot(x; w) is a multilinear function of w with positive\ncoef\ufb01cients on each monomial.\n\nVroot(x; w) can be written as(cid:80)\u03c4S\n\n(cid:81)\n\nt=1\n\n(k,j)\u2208TtE wk,j\n\nCorollary 1 holds since each monomial corresponds to an induced tree and each edge appears at most\nonce in the tree. This property will be crucial and useful in our derivation of a linear time algorithm\nfor moment computation in SPNs.\n\n3\n\n\f3 Linear Time Exact Moment Computation\n\np0(w; \u03b1\u03b1\u03b1) =(cid:81)m\n\n3.1 Exact Posterior Has Exponentially Many Modes\nLet m be the number of sum nodes in S. Suppose we are given a fully factorized prior distribution\nk=1 p0(wk; \u03b1k) over w. It is worth pointing out the fully factorized prior distribution\nis well justi\ufb01ed by the bipartite graph structure of the equivalent BN we introduced in section 2.2. We\nare interested in computing the moments of the posterior distribution after we receive one observation\nfrom the world. Essentially, this is the Bayesian online learning setting where we update the belief\nabout the distribution of model parameters as we observe data from the world sequentially. Note\nthat wk corresponds to the weight vector associated with sum node k, so wk is a vector that satis\ufb01es\nwk > 0 and 1T wk = 1. Let us assume that the prior distribution for each wk is Dirichlet, i.e.,\n\np0(w; \u03b1\u03b1\u03b1) =\n\nDir(wk; \u03b1k) =\n\nk=1\n\nk=1\n\nj \u03b1k,j)\nj \u0393(\u03b1k,j)\n\n\u03b1k,j\u22121\nk,j\n\nw\n\nm(cid:89)\n\n\u0393((cid:80)\n(cid:81)\n\n(cid:89)\n\nj\n\nm(cid:89)\n\nAfter observing one instance x, we have the exact posterior distribution to be: p(w | x) =\np0(w; \u03b1\u03b1\u03b1)p(x | w)/p(x). Let Zx (cid:44) p(x) and realize that the network polynomial also computes\nthe likelihood p(x | w). Plugging the expression for the prior distribution as well as the network\npolynomial into the above Bayes formula, we have\n\n\u03c4S(cid:88)\n\nm(cid:89)\n\nt=1\n\nk=1\n\np(w | x) =\n\n1\nZx\n\n(cid:89)\n\n(k,j)\u2208TtE\n\nn(cid:89)\n\ni=1\n\nDir(wk; \u03b1k)\n\nwk,j\n\npt(xi)\n\nSince Dirichlet is a conjugate distribution to the multinomial, each term in the summation is an\nupdated Dirichlet with a multiplicative constant. So, the above equation suggests that the exact\nposterior distribution becomes a mixture of \u03c4S Dirichlets after one observation. In a data set of\nD instances, the exact posterior will become a mixture of \u03c4 DS components, which is intractable to\nmaintain since \u03c4S = \u2126(2H(S)).\nThe hardness of maintaining the exact posterior distribution appeals for an approximate scheme\nwhere we can sequentially update our belief about the distribution while at the same time ef\ufb01ciently\nmaintain the approximation. Assumed density \ufb01ltering [14] is such a framework: the algorithm\nchooses an approximate distribution from a tractable family of distributions after observing each\ninstance. A typical choice is to match the moments of an approximation to the exact posterior.\n\n3.2 The Hardness of Computing Moments\n\nIn order to \ufb01nd an approximate distribution to match the moments of the exact posterior, we need to\nbe able to compute those moments under the exact posterior. This is not a problem for traditional\nmixture models including mixture of Gaussians, latent Dirichlet allocation, etc., since the number of\nmixture components in those models are assumed to be small constants. However, this is not the case\nfor SPNs, where the effective number of mixture components is \u03c4S = \u2126(2H(S)), which also depends\nTo simplify the notation, for each t \u2208 [\u03c4S ], we de\ufb01ne ct (cid:44) (cid:81)n\non the input network S.\n(cid:82)\nw p0(w)(cid:81)\ninduced tree Tt, and ut is the moment of(cid:81)\n(cid:89)\n\ni=1 pt(xi)1 and ut (cid:44)\n(k,j)\u2208TtE wk,j dw. That is, ct corresponds to the product of leaf distributions in the tth\n(k,j)\u2208TtE wk,j, i.e., the product of tree edges, under the\nprior distribution p0(w). Realizing that the posterior distribution needs to satisfy the normalization\nconstraint, we have:\n\n\u03c4S(cid:88)\n\n\u03c4S(cid:88)\n\n(cid:90)\n\nct\n\np0(w)\n\nwk,j dw =\n\nctut = Zx\n\n(2)\n\nt=1\n\nw\n\n(k,j)\u2208TtE\n\nt=1\n\nNote that the prior distribution for a sum node is a Dirichlet distribution. In this case we can compute\na closed form expression for ut as:\n\n(cid:89)\n\n\u03b1k,j(cid:80)\n\nj(cid:48) \u03b1k,j(cid:48)\n\n(3)\n\n(cid:89)\n\n(cid:90)\n\n(cid:89)\n\nut =\n\np0(wk)wk,j dwk =\n\n(k,j)\u2208TtE\n\nwk\n\nEp0(wk)[wk,j] =\n\n(k,j)\u2208TtE\n\n(k,j)\u2208TtE\n\n1For ease of notation, we omit the explicit dependency of ct on the instance x .\n\n4\n\n\f(cid:90)\n\nMore generally, let f (\u00b7) be a function applied to each edge weight in an SPN. We use the notation\nMp(f ) to mean the moment of function f evaluated under distribution p. We are interested in\ncomputing Mp(f ) where p = p(w | x), which we call the one-step update posterior distribution.\nMore speci\ufb01cally, for each edge weight wk,j, we would like to compute the following quantity:\n\n\u03c4S(cid:88)\n\n(cid:90)\n\nt=1\n\nw\n\n(cid:89)\n\n(k(cid:48),j(cid:48))\u2208TtE\n\nMp(f (wk,j)) =\n\nw\n\nf (wk,j)p(w | x) dw =\n\n1\nZx\n\nct\n\np0(w)f (wk,j)\n\nwk(cid:48),j(cid:48) dw (4)\n\nWe note that (4) is not trivial to compute as it involves \u03c4S = \u2126(2H(S)) terms. Furthermore, in order\nto conduct moment matching, we need to compute the above moment for each edge (k, j) from a\nsum node. A naive computation will lead to a total time complexity \u2126(|S| \u00b7 2H(S)). A linear time\nalgorithm to compute these moments has been designed by Rashwan et al. [12] when the underlying\nstructure of S is a tree. This algorithm recursively computes the moments in a top-down fashion\nalong the tree. However, this algorithm breaks down when the graph is a DAG.\nIn what follows we will present a O(|S|) time and space algorithm that is able to compute all the\nmoments simultaneously for general SPNs with DAG structures. We will \ufb01rst show a linear time\nreduction from the moment computation in (4) to a joint inference problem in S, and then proceed to\nuse the differential trick to ef\ufb01ciently compute (4) for each edge in the graph. The \ufb01nal component\nwill be a dynamic program to simultaneously compute (4) for all edges wk,j in the graph by trading\nconstant factors of space complexity to reduce time complexity.\n\n3.3 Linear Time Reduction from Moment Computation to Joint Inference\n\nLet us \ufb01rst compute (4) for a \ufb01xed edge (k, j). Our strategy is to partition all the induced trees based\non whether they contain the tree edge (k, j) or not. De\ufb01ne TF = {Tt | (k, j) (cid:54)\u2208 Tt, t \u2208 [\u03c4S ]} and\nTT = {Tt | (k, j) \u2208 Tt, t \u2208 [\u03c4S ]}. In other words, TF corresponds to the set of trees that do not\ncontain edge (k, j) and TT corresponds to the set of trees that contain edge (k, j). Then,\n\nMp(f (wk,j)) =\n\n+\n\n1\nZx\n\n1\nZx\n\n(cid:88)\n(cid:88)\n\nTt\u2208TT\n\nTt\u2208TF\n\n(cid:90)\n(cid:90)\n\nw\n\nw\n\np0(w)f (wk,j)\n\np0(w)f (wk,j)\n\n(k(cid:48),j(cid:48))\u2208TtE\n\n(k(cid:48),j(cid:48))\u2208TtE\n\nwk(cid:48),j(cid:48) dw\n\nwk(cid:48),j(cid:48) dw\n\n(5)\n\n(cid:89)\n(cid:89)\n(cid:88)\n\nct\n\nct\n\n(cid:89)\n\n(cid:89)\n\n(cid:88)\n\n(cid:90)\n\n(cid:90)\n\nFor the induced trees that contain edge (k, j), we have\n\n1\nZx\n\n1\nZx\n\nTt\u2208TT\n\n(k(cid:48),j(cid:48))\u2208TtE\n\nct\n\nw\n\nTt\u2208TT\n\n(cid:88)\n\nctutMp(cid:48)\n\n0,k\n\nwk(cid:48),j(cid:48) dw =\n\n(f (wk,j))\n\n(6)\n\n0,k is the one-step update posterior Dirichlet distribution for sum node k after absorbing the\n\np0(w)f (wk,j)\n\n1\nZx\nwhere p(cid:48)\nterm wk,j. Similarly, for the induced trees that do not contain the edge (k, j):\n\n(cid:88)\nthe order of integration and realize that since (k, j) is not in tree Tt,(cid:81)\n(cid:88)\n\nwhere p0,k is the prior Dirichlet distribution for sum node k. The above equation holds by changing\n(k(cid:48),j(cid:48))\u2208TtE wk(cid:48),j(cid:48) does not\n(f (wk,j)) are independent of\n\ncontain the term wk,j. Note that both Mp0,k (f (wk,j)) and Mp(cid:48)\nspeci\ufb01c induced trees, so we can combine the above two parts to express Mp(f (wk,j)) as:\n\n0,k\n\nctutMp0,k (f (wk,j))\n\nct\n\np0(w)f (wk,j)\n\nwk(cid:48),j(cid:48) dw =\n\n(cid:88)\n\n(k(cid:48),j(cid:48))\u2208TtE\n\nTt\u2208TF\n\nw\n\n(cid:32)\n\n(cid:33)\n\n(cid:32)\n\n1\nZx\n\nTt\u2208TF\n\n(cid:33)\n\n(7)\n\nMp(f (wk,j)) =\n\nctut\n\nMp0,k (f (wk,j)) +\n\nctut\n\nMp(cid:48)\n\n0,k\n\n(f (wk,j))\n\n(8)\n\n1\nZx\n\nTt\u2208TF\n\n1\nZx\n\nTt\u2208TT\n\nFrom (2) we have\n\n\u03c4S(cid:88)\n\nt=1\n\n1\nZx\n\nctut = 1 and\n\n(cid:88)\n\nTt\u2208TT\n\n(cid:88)\n\nTt\u2208TF\n\nctut\n\nctut +\n\n\u03c4S(cid:88)\n\nt=1\n\nctut =\n\n5\n\n\f0,k\n\n(f ). In other words,\nThis implies that Mp(f ) is in fact a convex combination of Mp0,k (f ) and Mp(cid:48)\nsince both Mp0,k (f ) and Mp(cid:48)\n(f ) can be computed in closed form for each edge (k, j), so in order\nto compute (4), we only need to be able to compute the two coef\ufb01cients ef\ufb01ciently. Recall that for\nj(cid:48) \u03b1k,j(cid:48). So the term\n\neach induced tree Tt, we have the expression of ut as ut =(cid:81)\n(cid:80)\u03c4S\n\u03b1k,j(cid:80)\n\n(k,j)\u2208TtE \u03b1k,j/(cid:80)\nn(cid:89)\n\nt=1 ctut can thus be expressed as:\n\n\u03c4S(cid:88)\n\n\u03c4S(cid:88)\n\n(cid:89)\n\nctut =\n\npt(xi)\n\n(9)\n\n0,k\n\nt=1\n\nt=1\n\n(k,j)\u2208TtE\n\nj(cid:48) \u03b1k,j(cid:48)\n\ni=1\n\nThe key observation that allows us to \ufb01nd the linear time reduction lies in the fact that (9) shares\nexactly the same functional form as the network polynomial, with the only difference being the\nspeci\ufb01cation of edge weights in the network. The following lemma formalizes our argument.\n\nt=1 ctut can be computed in O(|S|) time and space in a bottom-up evaluation of S.\n\nLemma 1. (cid:80)\u03c4S\n\nProof. Compare the form of (9) to the network polynomial:\n\n\u03c4S(cid:88)\n\n(cid:89)\n\nn(cid:89)\n\nwk,j\n\npt(xi)\n\n(10)\n\np(x | w) = Vroot(x; w) =\n\nt=1\n\n(k,j)\u2208TtE\n\nused in (9) is given by \u03b1k,j/(cid:80)\nvalue of (9), we can replace all the edge weights wk,j with \u03b1k,j/(cid:80)\n\nClearly (9) and (10) share the same functional form and the only difference lies in that the edge weight\nj(cid:48) \u03b1k,j(cid:48) while the edge weight used in (10) is given by wk,j, both of\nwhich are constrained to be positive and locally normalized. This means that in order to compute the\nj(cid:48) \u03b1k,j(cid:48), and a bottom-up pass\nevaluation of S will give us the desired result at the root of the network. The linear time and space\n(cid:4)\ncomplexity then follows from the linear time and space inference complexity of SPNs.\n\ni=1\n\nshares the same network structure with S.\nspace in a top-down differentiation of S.\n\nIn other words, we reduce the original moment computation problem for edge (k, j) to a joint\ninference problem in S with a set of weights determined by \u03b1\u03b1\u03b1.\n3.4 Ef\ufb01cient Polynomial Evaluation by Differentiation\n\ninduced trees that contain edge (k, j). Again, due to the exponential lower bound on the number of\nunique induced trees, a brute force computation is infeasible in the worst case. The key observation is\nt=1 ctut\nj(cid:48) \u03b1k,j(cid:48), \u2200k, j and it has a tractable circuit representation since it\nt=1 ctut/\u2202wk,j), and it can be computed in O(|S|) time and\nn(cid:89)\n\nTo evaluate (8), we also need to compute(cid:80)Tt\u2208TT ctut ef\ufb01ciently, where the sum is over a subset of\nthat we can use the differential trick to solve this problem by realizing the fact that Zx =(cid:80)\u03c4S\nis a multilinear function in \u03b1k,j/(cid:80)\nLemma 2. (cid:80)Tt\u2208TT ctut = wk,j (\u2202(cid:80)\u03c4S\nProof. De\ufb01ne wk,j (cid:44) \u03b1k,j/(cid:80)\n(cid:89)\n(cid:88)\n(cid:32)\n\n(cid:88)\n(cid:33)\n\n(k(cid:48),j(cid:48))\u2208TtE\n(k(cid:48),j(cid:48))(cid:54)=(k,j)\n\npt(xi) + 0 \u00b7\n\nj(cid:48) \u03b1k,j(cid:48), then\n\n(cid:88)\n\n(cid:88)\n\n(k(cid:48),j(cid:48))\u2208TtE\n\nn(cid:89)\n\n(cid:32)\n\n(cid:33)\n\n= wk,j\n\nctut =\n\nTt\u2208TF\n\nTt\u2208TT\n\nTt\u2208TT\n\nwk(cid:48),j(cid:48)\n\npt(xi)\n\nTt\u2208TT\n\nwk(cid:48),j(cid:48)\n\nctut\n\ni=1\n\ni=1\n\n\u2202\n\n(cid:88)\n\n(cid:89)\n(cid:88)\n\n= wk,j\n\n\u2202\n\nctut +\n\nctut\n\n= wk,j\n\n\u2202wk,j\n\nTt\u2208TT\n\n\u2202wk,j\n\nTt\u2208TF\n\n\u03c4S(cid:88)\n\n\u2202\n\nctut\n\n\u2202wk,j\n\nt=1\n\nwhere the second equality is by Corollary 1 that the network polynomial is a multilinear function\nof wk,j and the third equality holds because TF is the set of trees that do not contain wk,j. The last\nequality follows by simple algebraic transformations. In summary, the above lemma holds because\nof the fact that differential operator applied to a multilinear function acts as a selector for all the\n\n6\n\n\fmonomials containing a speci\ufb01c variable. Hence,(cid:80)Tt\u2208TF ctut =(cid:80)\u03c4S\ncomputed(cid:80)\u03c4S\n\nalso be computed. To show the linear time and space complexity, recall that the differentiation\nw.r.t.wk,j can be ef\ufb01ciently computed by back-propagation in a top-down pass of S once we have\n(cid:4)\n\n(cid:80)Tt\u2208TT ctut can\n\nt=1 ctut \u2212\n\nt=1 ctut in a bottom-up pass of S.\n\nRemark. The fact that we can compute the differentiation w.r.t. wk,j using the original circuit\nwithout expanding it underlies many recent advances in the algorithmic design of SPNs. Zhao et al.\n[18, 17] used the above differential trick to design linear time collapsed variational algorithm and the\nconcave-convex produce for parameter estimation in SPNs. A different but related approach, where\nthe differential operator is taken w.r.t. input indicators, not model parameters, is applied in computing\nthe marginal probability in Bayesian networks and junction trees [3, 8]. We \ufb01nish this discussion\nby concluding that when the polynomial computed by the network is a multilinear function in terms\nof model parameters or input indicators (such as in SPNs), then the differential operator w.r.t. a\nvariable can be used as an ef\ufb01cient way to compute the sum of the subset of monomials that contain\nthe speci\ufb01c variable.\n\n3.5 Dynamic Programming: from Quadratic to Linear\n\nDe\ufb01ne Dk(x; w) = \u2202Vroot(x; w)/\u2202Vk(x; w). Then the differentiation term \u2202(cid:80)\u03c4S\n\nLemma 2 can be computed via back-propagation in a top-down pass of the network as follows:\n\nt=1 ctut/\u2202wk,j in\n\n\u2202(cid:80)\u03c4S\n\nt=1 ctut\n\u2202wk,j\n\n=\n\n\u2202Vroot(x; w)\n\u2202Vk(x; w)\n\n\u2202Vk(x; w)\n\n\u2202wk,j\n\n= Dk(x; w)Vj(x; w)\n\n(11)\n\nLet \u03bbk,j = (wk,jVj(x; w)Dk(x; w)) /Vroot(x; w) and fk,j = f (wk,j), then the \ufb01nal formula for\ncomputing the moment of edge weight wk,j under the one-step update posterior p is given by\n\nMp(fk,j) = (1 \u2212 \u03bbk,j) Mp0(fk,j) + \u03bbk,jMp(cid:48)\n\n0(fk,j)\n\n(12)\n\nCorollary 2. For each edges (k, j), (8) can be computed in O(|S|) time and space.\nThe corollary simply follows from Lemma 1 and\nLemma 2 with the assumption that the moments\nunder the prior has closed form solution. By def-\n\ninition, we also have \u03bbk,j =(cid:80)Tt\u2208TT ctut/Zx,\n\nhence 0 \u2264 \u03bbk,j \u2264 1, \u2200(k, j). This formula\nshows that \u03bbk,j computes the ratio of all the\ninduced trees that contain edge (k, j) to the net-\nwork. Roughly speaking, this measures how\nimportant the contribution of a speci\ufb01c edge is\nto the whole network polynomial. As a result,\nwe can interpret (12) as follows: the more impor-\ntant the edge is, the more portion of the moment\ncomes from the new observation. We visualize\nour moment computation method for a single\nedge (k, j) in Fig. 1.\nRemark. CCCP for SPNs was originally de-\nrived using a sequential convex relaxation tech-\nnique, where in each iteration a concave surro-\ngate function is constructed and optimized. The key update in each iteration of CCCP ([18], (7)) is\ngiven as follows: w(cid:48)\nk,j \u221d wk,jVj(x; w)Dk(x; w)/Vroot(x; w), where the R.H.S. is exactly the same\nas \u03bbk,j de\ufb01ned above. From this perspective, CCCP can also be understood as implicitly applying the\ndifferential trick to compute \u03bbk,j, i.e., the relative importance of edge (k, j), and then take updates\naccording to this importance measure.\nIn order to compute the moments of all the edge weights wk,j, a naive computation would scale\nO(|S|2) because there are O(|S|) edges in the graph and from Cor. 2 each such computation takes\nO(|S|) time. The key observation that allows us to further reduce the complexity to linear comes\nfrom the structure of \u03bbk,j: \u03bbk,j only depends on three terms, i.e., the forward evaluation value\n\nFigure 1: The moment computation only needs\nthree quantities: the forward evaluation value at\nnode j, the backward differentiation value node k,\nand the weight of edge (k, j).\n\n7\n\n+Dk(x;w)\u00d7Vj(x;w)\u00d7\u00d7wk,j\fVj(x; w), the backward differentiation value Dk(x; w) and the original weight of the edge wk,j.\nThis implies that we can use dynamic programming to cache both Vj(x; w) and Dk(x; w) in a\nbottom-up evaluation pass and a top-down differentiation pass, respectively. At a high level, we trade\noff a constant factor in space complexity (using two additional copies of the network) to reduce the\nquadratic time complexity to linear.\nTheorem 2. For all edges (k, j), (8) can be computed in O(|S|) time and space.\nProof. During the bottom-up evaluation pass, in order to compute the value Vroot(x; w) at the root of\nS, we will also obtain all the values Vj(x; w) at each node j in the graph. So instead of discarding\nthese intermediate Vj(x; w), we cache them by allocating additional space at each node j. So after\none bottom-up evaluation pass of the network, we will also have all the Vj(x; w) for each node j, at\nthe cost of one additional copy of the network. Similarly, during the top-down differentiation pass of\nthe network, because of the chain rule, we will also obtain all the intermediate Dk(x; w) at each node\nk. Again, we cache them. Once we have both Vj(x; w) and Dk(x; w) for each edge (k, j), from\n(12), we can get all the moments for all the weighted edges in S simultaneously. Because the whole\nprocess only requires one bottom-up evaluation pass and one top-down differentiation pass of S, the\ntime complexity is 2|S|. Since we use two additional copies of S, the space complexity is 3|S|. (cid:4)\n\nWe summarize the linear time algorithm for moment computation in Alg. 1.\n\n1: wk,j \u2190 \u03b1k,j/(cid:80)\n\nAlgorithm 1 Linear Time Exact Moment Computation\nInput: Prior p0(w | \u03b1\u03b1\u03b1), moment f, SPN S and input x.\nOutput: Mp(f (wk,j)),\u2200(k, j).\nj(cid:48) \u03b1k,j(cid:48),\u2200(k, j).\n2: Compute Mp0(f (wk,j)) and Mp(cid:48)\n3: Bottom-up evaluation pass of S with input x. Record Vk(x; w) at each node k.\n4: Top-down differentiation pass of S with input x. Record Dk(x; w) at each node k.\n5: Compute the exact moment for each (k, j): Mp(fk,j) = (1 \u2212 \u03bbk,j) Mp0(fk,j) + \u03bbk,jMp(cid:48)\n\n0(f (wk,j)),\u2200(k, j).\n\n0 (fk,j).\n\n4 Applications in Online Moment Matching\n\nLet P = {q | q =(cid:81)m\n\nIn this section we use Alg. 1 as a sub-routine to develop a new Bayesian online learning algorithm\nfor SPNs based on assumed density \ufb01ltering [14]. To do so, we \ufb01nd an approximate distribution by\nminimizing the KL divergence between the one-step update posterior and the approximate distribution.\nk=1 Dir(wk; \u03b2k)}, i.e., P is the space of product of Dirichlet densities that are\ndecomposable over all the sum nodes in S. Note that since p0(w; \u03b1\u03b1\u03b1) is fully decomposable, we have\np0 \u2208 P. One natural choice is to try to \ufb01nd an approximate distribution q \u2208 P such that q minimizes\nthe KL-divergence between p(w|x) and q, i.e.,\n\u02c6p = arg min\nq\u2208P\n\nKL(p(w | x) || q)\n\nEp(w|x)[T (wk)] = Eq(w)[T (wk)]\n\nIt is not hard to show that when q is an exponential family distribution, which is the case in our\nsetting, the minimization problem corresponds to solving the following moment matching equation:\n(13)\nwhere T (wk) is the vector of suf\ufb01cient statistics of q(wk). When q(\u00b7) is a Dirichlet, we have T (wk) =\nlog wk, where the log is understood to be taken elementwise. This principle of \ufb01nding an approximate\ndistribution is also known as reverse information projection in the literature of information theory [2].\nAs a comparison, information projection corresponds to minimizing KL(q || p(w | x)) within the\nsame family of distributions q \u2208 P. By utilizing our ef\ufb01cient linear time algorithm for exact moment\ncomputation, we propose a Bayesian online learning algorithm for SPNs based on the above moment\nmatching principle, called assumed density \ufb01ltering (ADF). The pseudocode is shown in Alg. 2.\nIn the ADF algorithm, for each edge wk,j the above moment matching equation amounts to solving\nthe following equation:\n\n\u03b2k,j(cid:48)) = Ep(w|x)[log wk,j]\n\n8\n\n(cid:88)\n\nj(cid:48)\n\n\u03c8(\u03b2k,j) \u2212 \u03c8(\n\n\fwhere \u03c8(\u00b7) is the digamma function. This is a system of nonlinear equations about \u03b2 where the\nR.H.S. of the above equation can be computed using Alg. 1 in O(|S|) time for all the edges (k, j). To\nef\ufb01ciently solve it, we take exp(\u00b7) at both sides of the equation and approximate the L.H.S. using the\nfact that exp(\u03c8(\u03b2k,j)) \u2248 \u03b2k,j \u2212 1\n2 for \u03b2k,j > 1. Expanding the R.H.S. of the above equation using\nthe identity from (12), we have:\n\n\uf8eb\uf8ed\u03c8(\u03b2k,j) \u2212 \u03c8(\n(cid:32)\n\nexp\n\n\uf8f6\uf8f8 = exp(cid:0)Ep(w|x)[log wk,j](cid:1)\n(cid:33)(1\u2212\u03bbk,j )\n\n(cid:32)\n\n\u03b2w,j(cid:48))\n\n(cid:88)\n(cid:80)\n\u03b1k,j \u2212 1\nj(cid:48) \u03b1k,j(cid:48) \u2212 1\n\nj(cid:48)\n\n2\n\n2\n\n\u00d7\n\n\u21d4\n\n(cid:33)\u03bbk,j\n\n(14)\n\n2\n\n=\n\n2\n\n0, where p(cid:48)\n\n0, weighted by \u03bbk,j.\n\n\u03b1k,j + 1\n2\nj(cid:48) \u03b1k,j(cid:48) + 1\n2\n\n(cid:80)\n\u03b2k,j \u2212 1\nNote that (\u03b1k,j \u2212 0.5)/((cid:80)\nj(cid:48) \u03b2k,j(cid:48) \u2212 1\nand (\u03b1k,j + 0.5)/((cid:80)\n\n(cid:80)\nj(cid:48) \u03b1k,j(cid:48) + 0.5) is approximately the mean of p(cid:48)\n\nj(cid:48) \u03b1k,j(cid:48) \u2212 0.5) is approximately the mean of the prior Dirichlet under p0\n0 is the posterior by\nadding one pseudo-count to wk,j. So (14) is essentially \ufb01nding a posterior with hyperparameter \u03b2\nsuch that the posterior mean is approximately the weighted geometric mean of the means given by p0\nand p(cid:48)\nInstead of matching the moments given by the suf\ufb01cient statistics, also known as the natural moments,\nBMM tries to \ufb01nd an approximate distribution q by matching the \ufb01rst order moments, i.e., the mean\nof the prior and the one-step update posterior. Using the same notation, we want q to match the\nfollowing equation:\nEq(w)[wk] = Ep(w|x)[wk] \u21d4\nAgain, we can interpret the above equation as to \ufb01nd the posterior hyperparameter \u03b2 such that the\nposterior mean is given by the weighted arithmetic mean of the means given by p0 and p(cid:48)\n0, weighted\nby \u03bbk,j. Notice that due to the normalization constraint, we cannot solve for \u03b2 directly from the\nabove equations, and in order to solve for \u03b2 we will need one more equation to be added into the\nsystem. However, from line 1 of Alg. 1, what we need in the next iteration of the algorithm is not\n\u03b2, but only its normalized version. So we can get rid of the additional equation and use (15) as the\nupdate formula directly in our algorithm.\nUsing Alg. 1 as a sub-routine, both ADF and BMM enjoy linear running time, sharing the same order\nof time complexity as CCCP. However, since CCCP directly optimizes over the data log-likelihood,\nin practice we observe that CCCP often outperforms ADF and BMM in log-likelihood scores.\n\n\u03b1k,j + 1\nj(cid:48) \u03b1k,j(cid:48) + 1\n\n\u03b1k,j(cid:80)\n\n\u03b2k,j(cid:80)\n\n= (1 \u2212 \u03bbk,j)\n\n+ \u03bbk,j\n\nj(cid:48) \u03b1k,j(cid:48)\n\n(cid:80)\n\nj(cid:48) \u03b2k,j(cid:48)\n\n(15)\n\nAlgorithm 2 Assumed Density Filtering for SPN\n\u221e\nInput: Prior p0(w | \u03b1\u03b1\u03b1), SPN S and input {xi}\ni=1.\n1: p(w) \u2190 p0(w | \u03b1\u03b1\u03b1)\n2: for i = 1, . . . ,\u221e do\n3:\n4:\n5:\n6: end for\n\nApply Alg. 1 to compute Ep(w|xi)[log wk,j] for all edges (k, j).\nFind \u02c6p = arg minq\u2208P KL(p(w | xi) || q) by solving the moment matching equation (13).\np(w) \u2190 \u02c6p(w).\n\n5 Conclusion\n\nWe propose an optimal linear time algorithm to ef\ufb01ciently compute the moments of model parameters\nin SPNs under online settings. The key techniques used in the design of our algorithm include\nthe liner time reduction from moment computation to joint inference, the differential trick that is\nable to ef\ufb01ciently evaluate a multilinear function, and the dynamic programming to further reduce\nredundant computations. Using the proposed algorithm as a sub-routine, we are able to improve the\ntime complexity of BMM from quadratic to linear on general SPNs with DAG structures. We also use\nthe proposed algorithm as a sub-routine to design a new online algorithm, ADF. As a future direction,\nwe hope to apply the proposed moment computation algorithm in the design of ef\ufb01cient structure\nlearning algorithms for SPNs. We also expect that the analysis techniques we develop might \ufb01nd\nother uses for learning SPNs.\n\n9\n\n\fAcknowledgements\n\nHZ thanks Pascal Poupart for providing insightful comments. HZ and GG are supported in part by\nONR award N000141512365.\n\nReferences\n[1] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-speci\ufb01c independence in\nBayesian networks. In Proceedings of the Twelfth international conference on Uncertainty in\narti\ufb01cial intelligence, pages 115\u2013123. Morgan Kaufmann Publishers Inc., 1996.\n\n[2] I. Csisz\u00e1r and F. Matus. Information projections revisited. IEEE Transactions on Information\n\nTheory, 49(6):1474\u20131490, 2003.\n\n[3] A. Darwiche. A differential approach to inference in Bayesian networks. Journal of the ACM\n\n(JACM), 50(3):280\u2013305, 2003.\n\n[4] A. Dennis and D. Ventura. Greedy structure search for sum-product networks. In International\n\nJoint Conference on Arti\ufb01cial Intelligence, volume 24, 2015.\n\n[5] R. Gens and P. Domingos. Discriminative learning of sum-product networks. In Advances in\n\nNeural Information Processing Systems, pages 3248\u20133256, 2012.\n\n[6] R. Gens and P. Domingos. Learning the structure of sum-product networks. In Proceedings of\n\nThe 30th International Conference on Machine Learning, pages 873\u2013880, 2013.\n\n[7] P. Jaini, A. Rashwan, H. Zhao, Y. Liu, E. Banijamali, Z. Chen, and P. Poupart. Online algorithms\nfor sum-product networks with continuous variables. In Proceedings of the Eighth International\nConference on Probabilistic Graphical Models, pages 228\u2013239, 2016.\n\n[8] J. D. Park and A. Darwiche. A differential semantics for jointree algorithms. Arti\ufb01cial\n\nIntelligence, 156(2):197\u2013216, 2004.\n\n[9] R. Peharz, S. Tschiatschek, F. Pernkopf, and P. Domingos. On theoretical properties of sum-\n\nproduct networks. In AISTATS, 2015.\n\n[10] R. Peharz, R. Gens, F. Pernkopf, and P. Domingos. On the latent variable interpretation in\nsum-product networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39\n(10):2030\u20132044, 2017.\n\n[11] H. Poon and P. Domingos. Sum-product networks: A new deep architecture. In Proc. 12th Conf.\n\non Uncertainty in Arti\ufb01cial Intelligence, pages 2551\u20132558, 2011.\n\n[12] A. Rashwan, H. Zhao, and P. Poupart. Online and distributed bayesian moment matching\nfor parameter learning in sum-product networks. In Proceedings of the 19th International\nConference on Arti\ufb01cial Intelligence and Statistics, pages 1469\u20131477, 2016.\n\n[13] A. Rooshenas and D. Lowd. Learning sum-product networks with direct and indirect variable\n\ninteractions. In ICML, 2014.\n\n[14] H. W. Sorenson and A. R. Stubberud. Non-linear \ufb01ltering by approximation of the a posteriori\n\ndensity. International Journal of Control, 8(1):33\u201351, 1968.\n\n[15] L. G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8\n\n(2):189\u2013201, 1979.\n\n[16] H. Zhao, M. Melibari, and P. Poupart. On the relationship between sum-product networks and\n\nbayesian networks. In ICML, 2015.\n\n[17] H. Zhao, T. Adel, G. Gordon, and B. Amos. Collapsed variational inference for sum-product\n\nnetworks. In ICML, 2016.\n\n[18] H. Zhao, P. Poupart, and G. Gordon. A uni\ufb01ed approach for learning the parameters of\n\nsum-product networks. NIPS, 2016.\n\n10\n\n\f", "award": [], "sourceid": 3455, "authors": [{"given_name": "Han", "family_name": "Zhao", "institution": "Carnegie Mellon University"}, {"given_name": "Geoffrey", "family_name": "Gordon", "institution": "CMU"}]}