{"title": "Online Structure Learning for Feed-Forward and Recurrent Sum-Product Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6944, "page_last": 6954, "abstract": "Sum-product networks have recently emerged as an attractive representation due to their dual view as a special type of deep neural network with clear semantics and a special type of probabilistic graphical model for which inference is always tractable. Those properties follow from some conditions (i.e., completeness and decomposability) that must be respected by the structure of the network. As a result, it is not easy to specify a valid sum-product network by hand and therefore structure learning techniques are typically used in practice. This paper describes a new online structure learning technique for feed-forward and recurrent SPNs. The algorithm is demonstrated on real-world datasets with continuous features for which it is not clear what network architecture might be best, including sequence datasets of varying length.", "full_text": "Online Structure Learning for Feed-Forward and\n\nRecurrent Sum-Product Networks\n\nAgastya Kalra\u2217, Abdullah Rashwan\u2217, Wilson Hsu, Pascal Poupart\n\nCheriton School of Computer Science, Waterloo AI Institute, University of Waterloo, Canada\n\nagastya.kalra@gmail.com,{arashwan,wwhsu,ppoupart}@uwaterloo.ca\n\nVector Institute, Toronto, Canada\n\nPrashant Doshi\n\nDepartment of Computer Science\n\nUniversity of Georgia, USA\n\nGeorge Trimponias\n\nHuawei Noah\u2019s Ark Lab, Hong Kong\n\ng.trimponias@huawei.com\n\npdoshi@cs.uga.edu\n\nAbstract\n\nSum-product networks have recently emerged as an attractive representation due\nto their dual view as a special type of deep neural network with clear semantics\nand a special type of probabilistic graphical model for which marginal inference is\nalways tractable. These properties follow from the conditions of completeness and\ndecomposability, which must be respected by the structure of the network. As a\nresult, it is not easy to specify a valid sum-product network by hand and therefore\nstructure learning techniques are typically used in practice. This paper describes\na new online structure learning technique for feed-forward and recurrent SPNs.\nThe algorithm is demonstrated on real-world datasets with continuous features and\nsequence datasets of varying length for which the best network architecture is not\nobvious.\n\n1\n\nIntroduction\n\nSum-product networks (SPNs) were introduced as a new type of deep representation [13] equivalent\nto arithmetic circuits [3]. They distinguish themselves from other types of neural networks by\nseveral desirable properties: 1) The quantities computed by each node can be clearly interpreted as\n(un-normalized) probabilities. 2) SPNs can represent the same discrete distributions as Bayesian and\nMarkov networks [19] while ensuring that exact inference2 has linear complexity with respect to\nthe size of the network. 3) SPNs are generative models that naturally handle arbitrary queries with\nmissing data while allowing the inputs and outputs to vary.\nThere is a catch: these nice properties arise only when the structure of the network satis\ufb01es the\nconditions of decomposability and completeness [13]. Hence, it is not easy to specify sum-product\nnetworks by hand. In particular, fully connected networks typically violate those conditions. While\nthis may seem like a major drawback, the bene\ufb01t is that researchers have been forced to develop\nstructure learning techniques to obtain valid SPNs that satisfy those conditions [4, 7, 12, 9, 16, 1, 18,\n14, 10]. In deep learning, feature engineering has been replaced by architecture engineering, however\nthis is a tedious process that many practitioners would like to automate. Hence, there is a need for\nscalable structure learning techniques.\n\n\u2217Equal contribution, \ufb01rst author was selected based on a coin \ufb02ip\n2Most types of inference are tractable except for marginal MAP inference, which is still intractable for SPNs.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fTo that effect, we propose a new online structure learning technique for feed-forward and recurrent\nSPNs [10]. The approach starts with a network structure that assumes that all features are independent.\nThis network structure is then updated as a stream of data points is processed. Whenever a non-\nnegligible correlation is detected between some features, the network structure is updated to capture\nthis correlation. The approach is evaluated on several large benchmark datasets, including sequence\ndata of varying length.\n\n2 Background\n\nPoon and Domingos presented SPNs as a new type of deep architecture consisting of a rooted\nacyclic directed graph with interior nodes that are sums and products while the leaves are tractable\ndistributions, including Bernoulli distributions for discrete SPNs and Gaussian distributions for\ncontinuous SPNs. Each edge emanating from sum nodes is labeled with a non-negative weight w. An\nSPN encodes a function f (X = x) that takes as input a variable assignment X = x and produces an\noutput at its root. This function is de\ufb01ned recursively at each node i as follows:\n\n\uf8f1\uf8f2\uf8f3 Pr(Xi = xi)\n(cid:80)\n(cid:81)\n\nj wjfchildj (i)(x)\nj fchildj (i)(x)\n\nif isLeaf (i)\nif isSum(i)\nif isP roduct(i)\n\nfi(X = x) =\n\nHere, Xi = xi denotes the variable assignment restricted to the variables contained in the leaf i. If\nnone of the variables in leaf i are instantiated by X = x then Pr(Xi = xi) = Pr(\u2205) = 1. If leaf i\ncontains continuous variables, then Pr(Xi = xi) should be interpreted as pdf (Xi=xi).\nAn SPN is a neural network in the sense that each interior node can be interpreted as computing a\nlinear combination (for sum node) or non-linear combination (for product node) of its children. An\nSPN can also be viewed as encoding a joint distribution over the random variables in its leaves when\nthe network structure satis\ufb01es certain conditions. These conditions are often de\ufb01ned in terms of the\nnotion of scope.\nDe\ufb01nition 1 (Scope). The scope(i) of a node i is the set of variables that are descendants of i.\n\nA suf\ufb01cient set of conditions to ensure that the SPN encodes a valid joint distribution includes:\nDe\ufb01nition 2 (Completeness [2, 13]). An SPN is complete if all children of the same sum node have\nthe same scope.\nDe\ufb01nition 3 (Decomposability [2, 13]). An SPN is decomposable if all children of the same product\nnode have disjoint scopes.\n\nDecomposability allows us to interpret product nodes as computing factored distributions with respect\nto disjoint sets of variables, which ensures that the product is a valid distribution over the union of\nthe scopes of the children. Similarly, completeness allows us to interpret sum nodes as computing a\nmixture of the distributions encoded by the children since they all have the same scope. Each child is\na mixture component with mixture probability proportional to its weight. Hence, in complete and\ndecomposable SPNs, the sub-SPN rooted at each node can be interpreted as encoding an unnormalized\njoint distribution over its scope. We can use the function f to answer inference queries with respect\nto the joint distribution encoded by the entire SPN: Marginal queries: Pr(X = x) = froot(X=x)\n;\nconditional queries: Pr(X=x|Y=y) = froot(X=x,Y=y)\nUnlike most neural networks that can answer queries with \ufb01xed inputs and outputs only, SPNs can\nanswer conditional inference queries with varying inputs and outputs simply by changing the set of\nvariables that are queried (outputs) and conditioned on (inputs). Furthermore, SPNs can be used to\ngenerate data by sampling from the joint distributions they encode. This is achieved by a top-down\npass through the network. Starting at the root, each child of a product node is followed, a single child\nof a sum node is sampled according to the unnormalized distribution encoded by the weights of the\nsum node and a variable assignment is sampled in each leaf that is reached. This is particularly useful\nin natural language generation tasks and image completion tasks [13].\nNote also that inference queries other than marginal MAP can be answered exactly in linear time with\nrespect to the size of the network since each query requires two evaluations of the network function f\nand each evaluation is performed in a bottom-up pass through the network. This means that SPNs can\n\nfroot(\u2205)\n\nfroot(Y=y)\n\n.\n\n2\n\n\falso be viewed as a special type of tractable probabilistic graphical model, in contrast to Bayesian and\nMarkov networks for which inference is #P-hard [17]. Any SPN can be converted into an equivalent\nbipartite Bayesian network without any exponential blow up, while Bayesian and Markov networks\ncan be converted into equivalent SPNs at the risk of an exponential blow up [19]. In practice, we do\nnot convert probabilistic graphical models (PGMs) into SPNs since we typically learn SPNs directly\nfrom data. The tractable nature of SPNs ensures that the resulting distribution permits exact tractable\ninference.\nMelibari et al. [10] proposed dynamic SPNs (a.k.a. recurrent SPNs) to model sequence data of\nvariable length. A recurrent SPN consists of a bottom network that feeds into a template network\n(repeated as many times as needed) that feeds into a top network. The template network describes\nthe recurrent part of the network. Inputs to the template network include data features and interface\nnodes with the earlier part of the network while the output consists of nodes that interface with the\nprevious and subsequent part of the network. Melibari et al. [10] describe an invariance property for\ntemplate networks that ensures that the resulting recurrent SPN encodes a valid distribution.\n\n2.1 Parameter Learning\n\nThe weights of an SPN are its parameters. They can be estimated by maximizing the likelihood\nof a dataset (generative training) [13] or the conditional likelihood of some output features given\nsome input features (discriminative training) by stochastic gradient descent (SGD) [6]. Since SPNs\nare generative probabilistic models where the sum nodes can be interpreted as hidden variables\nthat induce a mixture, the parameters can also be estimated by the Expectation Maximization\nschema (EM) [13, 11]. Zhao et al. [21] provides a unifying framework that explains how likelihood\nmaximization in SPNs corresponds to a signomial optimization problem where SGD is a \ufb01rst order\nprocedure, sequential monomial approximations are also possible and EM corresponds to a concave-\nconvex procedure that converges faster than other techniques. Since SPNs are deep architectures,\nSGD and EM suffer from vanishing updates and therefore \u201dhard\u201d variants have been proposed to\nremedy this problem [13, 6]. By replacing all sum nodes by max nodes in an SPN, we obtain\na max-product network where the gradient is constant (hard SGD) and latent variables become\ndeterministic (hard EM). It is also possible to train SPNs in an online fashion based on streaming\ndata [9, 15, 20, 8]. In particular, it was shown that online Bayesian moment matching [15, 8] and\nonline collapsed variational Bayes [20] perform much better than SGD and online EM.\n\n2.2 Structure Learning\n\nSince it is dif\ufb01cult to specify network structures for SPNs that satisfy the decomposability and\ncompleteness properties, several automated structure learning techniques have been proposed [4, 7,\n12, 9, 16, 1, 18, 14, 10]. The \ufb01rst two structure learning techniques [4, 7] are top down approaches\nthat alternate between instance clustering to construct sum nodes and variable partitioning to construct\nproduct nodes. We can also combine instance clustering and variable partitioning in one step with\na rank-one submatrix extraction by performing a singular value decomposition [1]. Alternatively,\nwe can learn the structure of SPNs in a bottom-up fashion by incrementally clustering correlated\nvariables [12]. These algorithms all learn SPNs with a tree structure and univariate leaves. It is\npossible to learn SPNs with multivariate leaves by using a hybrid technique that learns an SPN in a\ntop down fashion, but stops early and constructs multivariate leaves by \ufb01tting a tractable probabilistic\ngraphical model over the variables in each leaf [16, 18]. It is also possible to merge similar subtrees\ninto directed acyclic graphs in a post-processing step to reduce the size of the resulting SPN [14].\nIn the context of recurrent SPNs, Melibari et al. [10] describe a search-and-score structure learning\ntechnique that does a local search over the space of template network structures while using scoring\nbased on log likelihood computations.\nSo far, all these structure learning algorithms are batch techniques that assume that the full dataset\nis available and can be scanned multiple times. Lee et al. [9] describe an online structure learning\ntechnique that gradually grows a network structure based on mini-batches. The algorithm is a variant\nof LearnSPN [7] where the clustering step is modi\ufb01ed to use online clustering. As a result, sum\nnodes can be extended with more children when the algorithm encounters a mini-batch that exhibits\nadditional clusters. Product nodes are never modi\ufb01ed after their creation. This technique requires\n\n3\n\n\flarge mini-batches to detect the emergence of new clusters and it assumes \ufb01xed length data so it is\nunable to generate structures for recurrent SPNs.\nIn this paper, we describe the \ufb01rst online structure learning technique for feed-forward and recurrent\nSPNs. It is more accurate and it scales better than the of\ufb02ine search-and-score technique introduced\npreviously [10]. It also scales better than the technique that uses online clustering [9] while working\nwith small mini-batches and recurrent SPNs.\n\n3 Online Learning\n\nTo simplify the exposition, we assume that the leaf nodes have Gaussian distributions (though we\nshow results in the experiments with Bernoulli distributions and it is straightforward to generalize to\nother distributions). A leaf node may have more than one variable in its scope, in which case it follows\na multivariate Gaussian distribution. Suppose we want to model a probability distribution over a\nd-dimensional space. The algorithm starts with a fully factorized joint probability distribution over\nall variables, p(x) = p(x1, x2, . . . , xd) = p1(x1)p2(x2)\u00b7\u00b7\u00b7 pd(xd). This distribution is represented\nby a product node with d children, the ith of which is a univariate distribution over xi. Initially, we\nassume that the variables are independent, and the algorithm will update this probability distribution\nas new data points are processed.\nGiven a mini-batch of data points, the algorithm passes the points through the network from the root\nto the leaf nodes and updates each node along the way. This update includes two parts: i) updating\nthe parameters of the SPN, and ii) updating the structure of the network.\n\n3.1 Parameter update\n\nThere are two types of parameters in the model: weights on the branches under a sum node, and\nparameters for the Gaussian distribution in a leaf node. We use an online version of the hard\nEM algorithm to update the network parameters [13]. We prove that the algorithm monotonically\nimproves the likelihood of the last data point. We also extend it to work for Gaussian leaf nodes. The\npseudocode of this procedure (Alg. 1) is provided in the supplementary material.\nEvery node in the network has a count, nc, initialized to 1. When a data point is received, the\nlikelihood of this data point is computed at each node. Then the parameters of the network are\nupdated in a recursive top-down fashion by starting at the root node. When a sum node is traversed,\nits count is increased by 1 and the count of the child with the highest likelihood is increased by 1. In\na feed-forward network, the weight ws,c of a branch between a sum node s and one of its children\nc is estimated as ws,c = nc\nwhere ns is the count of the sum node and nc is the count of the child\nns\nnode. We recursively update the subtree of the child with the highest likelihood.\nWe recursively update the subtrees rooted at each child of a product node. For Gaussian leaf nodes,\nwe keep track of the empirical mean vector \u00b5 and covariance matrix \u03a3 for the variables in their scope.\nWhen a leaf node with a current count of n receives a batch of m data points x(1), x(2), . . . , x(m),\nthe empirical mean \u00b5 and covariance \u03a3 are updated according to the following equations:\n\n(cid:32)\n(cid:34)\n\nn\u00b5i +\n\n\u00b5(cid:48)\ni =\n\n1\n\nn + m\n\n\u03a3(cid:48)\ni,j =\n\n1\n\nn + m\n\n(cid:33)\n\nm(cid:88)\nm(cid:88)\n\nk=1\n\nk=1\n\nx(k)\ni\n\n(cid:16)\n\n(cid:17)(cid:16)\n\nj \u2212 \u00b5j\nx(k)\n\n(cid:17)(cid:35)\n\n\u2212 (\u00b5(cid:48)\n\ni \u2212 \u00b5i)(\u00b5(cid:48)\n\nj \u2212 \u00b5j)\n\n(1)\n\nn\u03a3i,j +\n\ni \u2212 \u00b5i\nx(k)\n\nwhere i and j index the variables in the leaf node\u2019s scope.\nThe update of these suf\ufb01cient statistics can be seen as locally maximizing the likelihood of the data.\nThe empirical mean and covariance of the Gaussian leaves locally increase the likelihood of the\ndata that reach that leaf. Similarly, the count ratios used to set the weights under a sum node locally\nincrease the likelihood of the data that reach each child. We prove this result below.\nTheorem 1. Let \u03b8s be the set of parameters of an SPN s, and let fs(\u00b7|\u03b8s) be the probability density\nfunction of the SPN. Given an observation x, suppose the parameters are updated to \u03b8(cid:48)\ns based on the\nrunning average update procedure, then fs(x|\u03b8(cid:48)\n\ns) \u2265 fs(x|\u03b8s).\n\n4\n\n\fn(cid:89)\n\ni=1\n\nfs(x(i)|\u03b8(cid:48)\n\ns) \u2265 fs(x|\u03b8s)\n\nProof. We will prove the theorem by induction. First suppose the SPN is just one leaf node. In\nthis case, the parameters are the empirical mean and covariance, which is the maximum likelihood\nestimator for a Gaussian distribution. Suppose \u03b8 consists of the parameters learned using n data\npoints x(1), . . . , x(n), and \u03b8(cid:48) consists of the parameters learned using the same n data points and an\nadditional observation x. Then we have\n\nn(cid:89)\nfs(x|\u03b8(cid:48)\ns)\ns) = (cid:81)\n(cid:81)\ns) \u2265 fs(x|\u03b8s). Suppose we have an SPN s where each child SPN t satis\ufb01es the\nThus we get fs(x|\u03b8(cid:48)\nt) \u2265 ft(x|\u03b8t). If the root of s is a product node, then fs(x|\u03b8(cid:48)\nproperty ft(x|\u03b8(cid:48)\nt) \u2265\nt ft(x|\u03b8t) = fs(x|\u03b8s). Now suppose the root of s is a sum node. Let nt be the count of child t,\n(cid:32)\nand let u = arg maxt ft(x|\u03b8t) be the child with the highest count. Then we have\n(cid:88)\n(cid:32)(cid:88)\n\nt ft(x|\u03b8(cid:48)\n(cid:33)\n\n(cid:32)\n(cid:88)\n\nfs(x(i)|\u03b8s) \u2265 fs(x|\u03b8s)\n\nfu(x|\u03b8(cid:48)\n\nu) +\n\nfs(x|\u03b8(cid:48)\n\ns) =\n\n\u2265 1\n\nn + 1\n\nfs(x(i)|\u03b8(cid:48)\ns)\n\nfu(x|\u03b8u) +\n\nntft(x|\u03b8(cid:48)\nt)\n\n(cid:88)\n\nt\n\nntft(x|\u03b8t)\n\nn(cid:89)\n\ni=1\n\n1\n\nn + 1\n\n(cid:33)\n\n(cid:33)\n\n(2)\n\ni=1\n\n\u2265 1\n\nn + 1\n\nft(x|\u03b8t) +\n\nnt\nn\n\nt\n\nntft(x|\u03b8t)\n\n=\n\n1\nn\n\nt\n\nntft(x|\u03b8t) = fs(x|\u03b8s)\n\nt\n\n(cid:88)\n\nt\n\n3.2 Structure update\n\nThe simple online parameter learning technique described above can be easily extended to enable\nonline structure learning. In the supplementary material, Alg. 2 describes the pseudocode of the\nresulting procedure called oSLRAU (online Structure Learning with Running Average Update).\nSimilar to leaf nodes, each product node also keeps track of the empirical mean vector and empirical\ncovariance matrix of the variables in its scope. These are updated in the same way as the leaf nodes.\nInitially, when a product node is created using traditional structure learning, all variables in the scope\nare assumed independent (see Alg. 3 in the supplementary material). As new data points arrive at a\nproduct node, the covariance matrix is updated, and if the absolute value of the Pearson correlation\ncoef\ufb01cient between two variables are above a certain threshold, the algorithm updates the structure so\nthat the two variables become correlated in the model.\n\nFigure 1: Depiction of how correlations between variables are introduced. Left: original product node with\nthree children. Middle: combine Child1 and Child2 into a multivariate leaf node (Alg. 4). Right: create a mixture\nto model the correlation (Alg. 5).\nWe correlate two variables in the model by combining the child nodes whose scopes contain the\ntwo variables. The algorithm employs two approaches to combine the two child nodes: a) create\na multivariate leaf node (Alg. 4 in the supplementary material), or b) create a mixture of two\ncomponents over the variables (Alg. 5 in the supplementary material). These two processes are\ndepicted in Figure 1. On the left, a product node with scope x1, . . . , x5 originally has three children.\nThe product node keeps track of the empirical mean and covariance for these \ufb01ve variables. Suppose\nit receives a mini-batch of data and updates the statistics. As a result of this update, x1 and x3 now\nhave a correlation above the threshold. In the middle of Figure 1, the algorithm combines the two\nchild nodes that have x1 and x3 in their scope, and turns them into a multivariate leaf node. Since the\nproduct node already keeps track of the mean and covariance of these variables, we can simply use\nthose statistics as the parameters for the new leaf node.\nAnother way to correlate x1 and x3 is to create a mixture, as shown in Figure 1(right). The mixture\nhas two components. The \ufb01rst contains the original children of the product node that contain x1 and\nx3. The second component is a new product node, which is again initialized to have a fully factorized\n\n5\n\n\fdistribution over its scope (see Alg. 3 in the supplementary material). The mini-batch of data points\nare then passed down the new mixture to update its parameters. Although the children are drawn like\nleaf nodes in the diagrams, they can in fact be entire subtrees. Since the process does not involve the\nparameters of a child, it works the same way if some of the children are trees instead of single nodes.\nThe technique chosen to induce a correlation depends on the number of variables in the scope. The\nalgorithm creates a multivariate leaf node when the combined scope of the two children has a number\nof variables that does not exceed some threshold and if the total number of variables in the problem\nis greater than this threshold, otherwise it creates a mixture. Since the number of parameters in\nmultivariate Gaussian leaves grows at a quadratic rate with respect to the number of variables, it is not\nadvised to consider multivariate leaves with too many variables. In contrast, the mixture construction\nincreases the number of parameters at a linear rate.\nTo simplify the structure, if a product node ends up with only one child, it is removed from the\nnetwork, and its only child is joined with its parent. Similarly, if a sum node ends up being a child of\nanother sum node, then the child sum node can be removed, and all its children are promoted one\nlayer up. We also prune subtrees periodically when the count at a node does not increase for several\nmini-batches. This helps to prevent over\ufb01tting and to adapt to changes in non-stationary settings.\nNote that this structure learning technique does a single pass through the data and therefore is entirely\nonline. The time and space complexity of updating the structure after each data point is linear in the\nsize of the network (i.e., # of edges) and quadratic in the number of features (since product nodes\nstore a covariance matrix that is quadratic in the size of their scope). The algorithm also ensures that\nthe decomposability and completeness properties are preserved after each update.\n\n3.3 Updates in Recurrent Networks\n\nWe can generalize the parameter and structure updates described in the previous sections to handle\nrecurrent SPNs as follows. We start with a bottom network that has k fully factored distributions.\nThe template network initially has k interface input product nodes, an intermediate layer of k sum\nnodes and an output interface layer of k product nodes. Fig. 2(top) shows an initial template network\nwhen k = 2. The top network consists of a single sum node linked to the output interface layer of the\ntemplate network. For the parameter updates, we unroll the recurrent SPN by creating as many copies\nof the template network as needed to match the length of a data sequence. Fig. 2(bottom) shows an\nunrolled recurrent SPN over 3 time steps. We use a single shared count for each node of the template\nnetwork even though template nodes are replicated multiple times. A shared count is incremented\neach time a data sequence goes through its associated node in any copy of the template network.\nSimilarly, the empirical mean and covariance of each leaf in the template network are shared across\nall copies of the template network.\n\nFigure 2: Top: A generic template network with interface nodes drawn in red and leaf distributions drawn in\nblue. Bottom: A recurrent SPN unrolled over 3 time steps.\n\n6\n\n\fStructure updates in recurrent networks can also be done by detecting correlations between pairs of\nvariables that are not already captured by the network. A single shared covariance matrix is estimated\nat each product node of the template network. To circumvent the fact that the scope of a product node\nwill differ in each copy of the template network, we relabel the scope of each input interface node to\na single unique binary latent variable that takes value 1 when a data sequence traverses this node and\n0 otherwise. These latent binary variables can be thought of as summarizing the information below\nthe input interface nodes of each copy of the template network. This ensures that the variables in each\ncopy of the template network are equivalent and therefore we can maintain a shared covariance matrix\nat each product node of the template network. When a signi\ufb01cant correlation is detected between the\nvariables in the scope of two different children of a product node, a mixture is introduced as depicted\nin the right part of Fig. 1.\n\n4 Experiments\n\nWe compare the performance of oSLRAU with other methods on both simple and larger data sets\nwith continuous variables. We begin this section by describing the data sets.\n\n4.1 Synthetic Data\n\nAs a proof of concept, we test the algorithm on a synthetic dataset. We generate data from a\n3-dimensional distribution\n\np(x1, x2, x3) = [0.25N (x1|1, 1)N (x2|2, 2) + 0.25N (x1|11, 1)N (x2|12, 2)\n\n+ 0.25N (x1|21, 1)N (x2|22, 2) + 0.25N (x1|31, 1)N (x2|32, 2)]N (x3|3, 3)\n\nwhere N (\u00b7|\u00b5, \u03c32) is the normal distribution with mean \u00b5 and variance \u03c32. Therefore, the \ufb01rst two\ndimensions x1 and x2 are generated from a Gaussian mixture with four components, and x3 is\nindependent of the other two variables.\n\n(a)\n\n(b)\n\nFigure 3: Learning the structure from the toy dataset using univariate leaf nodes after (a) 200 data points and (b)\n500 data points. Blue dots are the data points from the toy dataset, and the red ellipses show diagonal Gaussian\ncomponents learned.\n\nStarting from a fully factorized distribution, we would expect x3 to remain factorized after learning\nfrom data. Furthermore, the algorithm should generate new components along the \ufb01rst two dimensions\nas more data points are received since x1 and x2 are correlated. This is indeed observed in Figures 3a\nand 3b, which show the structure learned after 200 and 500 data points. The variable x3 remains\nfactorized regardless of the number of data points seen, whereas more components are created for x1\nand x2 as more data points are processed. Bottom charts in Figures 3a and 3b show the data points\nalong the \ufb01rst two dimensions and the Gaussian components learned. We observe that the algorithm\ngenerates new components to model the correlation between x1 and x2 as it processes more data.\n\n7\n\n\f4.2 Large Continuous Datasets\n\nWe also tested oSLRAU\u2019s combined parameter and structure updates on large real-world datasets with\ncontinuous features (see supplementary material for details about each dataset). Table 1 compares the\naverage log-likelihood of oSLRAU to that of randomly generated networks and a modi\ufb01ed version of\nILSPN [9] that we adapted to Gaussian SPNs. For a fair comparison we generated random networks\nthat are at least as large as the networks obtained by oSLRAU. Observe that oSLRAU achieves\nhigher log-likelihood than random networks since it effectively discovers empirical correlations and\ngenerates a structure that captures those correlations. ILSPN ran out of memory for 3 problems where\nit generated networks of more than 7.5 Gb. It underperformed oSLRAU on two other problems since\nit never modi\ufb01es its product nodes after creation and its online clustering technique is not suitable for\nstreaming data as it requires fairly large batch sizes to create new clusters.\n\nTable 1: Average log-likelihood scores with standard error on large real-world data sets. The best results among\nthe online techniques (random, ILSPN, oSLRAU and RealNVP online) are highlighted in bold. Results for\nRealNVP of\ufb02ine are also included for comparison purposes. \u201d\u2212\u201d indicates that ILSPN exceeded the memory\nlimit of 7.5 Gb.\n\nDatasets\nVoxforge\nPower\nNetwork\nGasSen\nMSD\nGasSenH\n\nRandom\n-33.9 \u00b1 0.3\n-2.83 \u00b1 0.13\n-5.34 \u00b1 0.03\n-114 \u00b1 2\n-538.8 \u00b1 0.7\n-21.5 \u00b1 1.3\n\nILSPN\n\n\u2014-\n\n-1.85 \u00b1 0.02\n-4.71 \u00b1 0.16\n\n\u2014-\n\u2014-\n\n-182.3 \u00b1 4.5\n\noSLRAU\n-29.6 \u00b1 0.0\n-2.46 \u00b1 0.11\n-4.27 \u00b1 0.04\n-102 \u00b1 4\n-531.4 \u00b1 0.3\n-15.6 \u00b1 1.2\n\nRealNVP Online RealNVP Of\ufb02ine\n-168.2 \u00b1 0.8\n-169.0 \u00b1 0.6\n-17.85 \u00b1 0.22\n-18.70 \u00b1 0.19\n-7.89 \u00b1 0.05\n-10.80 \u00b1 0.02\n-443 \u00b1 64\n-748 \u00b1 99\n-362.4 \u00b1 0.4\n-257.1 \u00b1 2.03\n-44.5 \u00b1 0.1\n44.2 \u00b1 0.1\n\nTable 2: Large datasets: comparison of oSLRAU with and without periodic pruning.\n\nDataset\nPower\nNetwork\nGasSen\nMSD\nGasSenH\n\nlog-likelihood\n\nno pruning\n-2.46 \u00b1 0.11\n-4.27 \u00b1 0.02\n-102 \u00b1 4\n-527.7 \u00b1 0.28\n-15.6 \u00b1 1.2\n\npruning\n-2.40 \u00b1 0.18\n-4.20 \u00b1 0.09\n-130 \u00b1 3\n-526.8 \u00b1 0.27\n-17.7 \u00b1 1.58\n\ntime (sec)\n\nSPN size (# nodes)\n\nno pruning\n183\n14\n351\n74\n12\n\npruning\n39\n12\n276\n72\n10\n\nno pruning\n23360\n7214\n5057\n1442\n920\n\npruning\n5330\n5739\n1749\n1395\n467\n\nWe also compare oSLRAU to a publicly available implementation of RealNVPwhich is a different\ntype of generative neural network used for density estimation [5]. Since the benchmarks include\na variety of problems from different domains and it is not clear which network architecture would\nwork best, we used a default 2-hidden-layer fully connected network. The two layers have the same\nsize. For a fair comparison, we used a number of nodes per layer that yields approximately the same\nnumber of parameters as the SPNs. Training was done by stochastic gradient descent in TensorFlow\nwith a step size of 0.01 and mini-batch sizes that vary from 100 to 1500 depending on the size of\nthe dataset. We report the results for online learning (single iteration) and of\ufb02ine learning (when\nvalidation loss stops decreasing). In this experiment, the correlation threshold was kept constant at\n0.1. To determine the maximum number of variables in multivariate leaves, we utilized the following\nrule: at most one variable per leaf if the problem has 3 features or less and then increase the maximum\nnumber of variables per leaf up to 4 depending on the number of features. Further analysis on\nthe effects of varying the maximum number of variables per leaf is included in the supplementary\nmaterial. oSLRAU outperformed RealNVP on 5 of the 6 datasets. This can be explained by the\nfact that oSLRAU learns a structure that is suited for each problem while RealNVP does not learn\nany structure. Note that RealNVP may yield better results by using a different architecture than the\ndefault of 2-hidden layers, however in the absence of domain knowledge this is dif\ufb01cult. Furthermore,\nonline learning with streaming data precludes an of\ufb02ine search over some hyperparameters such as\nthe number of layers and nodes in order to re\ufb01ne the architecture. Hence, the results presented in\nTable 1 highlight the importance of an online learning technique such as oSLRAU to obtain a suitable\nnetwork structure with streaming data in the absence of domain knowledge.\nTable 2 reports the training time (seconds) and the size (# of nodes) of the SPNs constructed for each\ndataset by oSLRAU with and without periodic pruning. After every 1% of a dataset is processed,\nsubtrees that have not been updated in the last percent of the dataset are pruned. This helps to mitigate\nover\ufb01tting while decreasing the size of the SPNs. The experiments were carried out on an Amazon\n\n8\n\n\fc4.xlarge machine with 4 vCPUs (high frequency Intel Xeon E5-2666 v3 Haswell processors) and\n7.5 Gb of RAM. The times are short since oSLRAU does a single pass through the data.\nAdditional experiments are included in the supplementary material to evaluate the effect of the\nhyperparameters. Additional empirical comparisons between oSLRAU and other techniques are also\npresented in the supplementary material.\n\n4.3 Nonstationary Generative Learning\nWe evaluate the effectiveness of the periodic pruning tech-\nnique to adapt to changes in a nonstationary environment\nby feeding oSLRAU with a stream of 50,000 images from\nthe MNIST dataset ordered by their label from 0 to 9. The\nbottom row of Fig. 4 shows a sample of images generated\nby the SPN (14,000 nodes) constructed by oSLRAU with\npruning after every 6000 images. As the last images in the\nstream are 8 and 9, oSLRAU pruned parts of its network\nrelated to other digits and it generated mostly 9\u2019s. When\npruning is disabled, the top row of Fig. 4 shows that the\nSPN (17,000 nodes) constructed by oSLRAU can generate\na mixture of digits as it learned to generate all digits.\n\n4.4 Sequence Data\n\nFigure 4: Top row: sample images gener-\nated by SPN learned by oSLRAU without\npruning. Bottom: sample images generated\nby SPN learned by oSLRAU with pruning\nevery 6000 images.\n\nWe also tested oSLRAU\u2019s ability to learn the structure of\nrecurrent SPNs. Table 3 reports the average log likelihood\nbased on 10-fold cross validation with 5 sequence datasets.\nThe number of sequences, the average length of the sequences and the number of observed variables\nis reported under the name of each dataset. We compare oSLRAU to the previous search-and-score\n(S&S) technique [10] for recurrent SPNs (RSPNs) with Gaussian leaves as well as HMMs with\nmixture of Gaussians emission distributions and recurrent neural networks (RNNs) with LSTM units\nand output units that compute the mean of Gaussians. The number of interface nodes in RSPNs,\nhidden states in HMMs and LSTM units in RNNs was bounded to 15. Parameter and structure\nlearning was performed for the RSPNs while only parameter learning was performed for the HMMs\nand RNNs. The RNNs were trained by minimizing squared loss, which is mathematically equivalent\nto maximizing the data likelihood when we interpret each output as the mean of a univariate Gaussian.\nThe variance of each Gaussian was optimized by a grid search in [0.01,0.1] in increments of 0.01 and\nin [0.1,2] in increments of 0.1. We did this solely for the purpose of reporting the log likelihood of\nthe test data with RNNs, which would not be possible otherwise. oSLRAU outperformed the other\ntechniques on 4 of the 5 datasets. It learned better structures than S&S in less time. oSLRAU took\nless than 11 minutes per dataset while S&S took 1 day per dataset.\n\nTable 3: Average log-likelihood and standard error based on 10-fold cross validation. (#i,length,#oVars)\nindicates the number of data instances, average length of the sequences and number of observed variables per\ntime step.\n\nDataset\n(#i,length,#oVars)\nHMM\nRNN\nRSPN+S&S\nRSPN+oSLRAU\n\nhillValley\n(600,100,1)\n286 \u00b1 6.9\n205 \u00b1 23\n296 \u00b1 16.1\n299.5 \u00b1 18\n\neegEye\n(14970,14,1)\n22.9 \u00b1 1.8\n15.2 \u00b1 3.9\n25.9 \u00b1 2.1\n36.9 \u00b1 1.4\n\nlibras\n\n(350,90,1)\n-116.5 \u00b1 2.2\n-92.9 \u00b1 12.9\n-93.5 \u00b1 7.2\n-83.5 \u00b1 5.4\n\nJapanVowels\n(270,16,12)\n-275 \u00b1 13\n-257 \u00b1 35\n-241 \u00b1 12\n-231 \u00b1 12\n\nozLevel\n(2170,24,2)\n-34.6 \u00b1 0.3\n-15.3 \u00b1 0.8\n-34.4 \u00b1 0.4\n-30.1 \u00b1 0.4\n\n5 Conclusion and Future work\n\nThis paper describes a new online structure learning technique for feed-forward and recurrent SPNs.\noSLRAU can learn the structure of SPNs in domains for which it is unclear what might be a good\nstructure, including sequence datasets of varying length. This algorithm can also scale to large\ndatasets ef\ufb01ciently. We plan to extend this work by learning the structure of SPNs in an online and\ndiscriminative fashion. Discriminative learning is essential to attain good accuracy in classi\ufb01cation.\n\n9\n\n\fAcknowledgments\n\nThis research was funded by Huawei Technologies and NSERC. Prashant Doshi acknowledges\nsupport from NSF grant #IIS-1815598.\n\nReferences\n[1] Adel, Tameem, Balduzzi, David, and Ghodsi, Ali. Learning the structure of sum-product\n\nnetworks via an svd-based algorithm. In UAI, pp. 32\u201341, 2015.\n\n[2] Darwiche, Adnan. A logical approach to factoring belief networks. KR, 2:409\u2013420, 2002.\n[3] Darwiche, Adnan. A differential approach to inference in Bayesian networks. JACM, 50(3):\n\n280\u2013305, 2003.\n\n[4] Dennis, Aaron and Ventura, Dan. Learning the architecture of sum-product networks using\nclustering on variables. In Advances in Neural Information Processing Systems, pp. 2033\u20132041,\n2012.\n\n[5] Dinh, Laurent, Sohl-Dickstein, Jascha, and Bengio, Samy. Density estimation using real nvp.\n\nIn International Conference on Learning Representations, 2017.\n\n[6] Gens, Robert and Domingos, Pedro. Discriminative learning of sum-product networks. In NIPS,\n\npp. 3248\u20133256, 2012.\n\n[7] Gens, Robert and Domingos, Pedro. Learning the structure of sum-product networks. In ICML,\n\npp. 873\u2013880, 2013.\n\n[8] Jaini, Priyank, Rashwan, Abdullah, Zhao, Han, Liu, Yue, Banijamali, Ershad, Chen, Zhitang,\nand Poupart, Pascal. Online algorithms for sum-product networks with continuous variables. In\nConference on Probabilistic Graphical Models, pp. 228\u2013239, 2016.\n\n[9] Lee, Sang-Woo, Heo, Min-Oh, and Zhang, Byoung-Tak. Online incremental structure learning\nof sum\u2013product networks. In International Conference on Neural Information Processing\n(ICONIP), pp. 220\u2013227. Springer, 2013.\n\n[10] Melibari, Mazen, Poupart, Pascal, Doshi, Prashant, and Trimponias, George. Dynamic sum\nproduct networks for tractable inference on sequence data. In Conference on Probabilistic\nGraphical Models, pp. 345\u2013355, 2016.\n\n[11] Peharz, Robert. Foundations of Sum-Product Networks for Probabilistic Modeling. PhD thesis,\n\nMedical University of Graz, 2015.\n\n[12] Peharz, Robert, Geiger, Bernhard C, and Pernkopf, Franz. Greedy part-wise learning of sum-\nproduct networks. In Machine Learning and Knowledge Discovery in Databases, pp. 612\u2013627.\nSpringer, 2013.\n\n[13] Poon, Hoifung and Domingos, Pedro. Sum-product networks: A new deep architecture. In UAI,\n\npp. 2551\u20132558, 2011.\n\n[14] Rahman, Tahrima and Gogate, Vibhav. Merging strategies for sum-product networks: From\ntrees to graphs. In Proceedings of the Thirty-Second Conference on Uncertainty in Arti\ufb01cial\nIntelligence, UAI, 2016.\n\n[15] Rashwan, Abdullah, Zhao, Han, and Poupart, Pascal. Online and distributed bayesian moment\nIn Arti\ufb01cial Intelligence and\n\nmatching for parameter learning in sum-product networks.\nStatistics, pp. 1469\u20131477, 2016.\n\n[16] Rooshenas, Amirmohammad and Lowd, Daniel. Learning sum-product networks with direct\n\nand indirect variable interactions. In ICML, pp. 710\u2013718, 2014.\n\n[17] Roth, Dan. On the hardness of approximate reasoning. Arti\ufb01cial Intelligence, 82(1):273\u2013302,\n\n1996.\n\n[18] Vergari, Antonio, Di Mauro, Nicola, and Esposito, Floriana. Simplifying, regularizing and\nstrengthening sum-product network structure learning. In ECML-PKDD, pp. 343\u2013358. 2015.\n[19] Zhao, Han, Melibari, Mazen, and Poupart, Pascal. On the relationship between sum-product\nIn International Conference on Machine Learning, pp.\n\nnetworks and bayesian networks.\n116\u2013124, 2015.\n\n10\n\n\f[20] Zhao, Han, Adel, Tameem, Gordon, Geoff, and Amos, Brandon. Collapsed variational inference\n\nfor sum-product networks. In ICML, 2016.\n\n[21] Zhao, Han, Poupart, Pascal, and Gordon, Geoffrey J. A uni\ufb01ed approach for learning the\nparameters of sum-product networks. In Advances in Neural Information Processing Systems,\npp. 433\u2013441, 2016.\n\n11\n\n\f", "award": [], "sourceid": 3452, "authors": [{"given_name": "Agastya", "family_name": "Kalra", "institution": "University of Waterloo"}, {"given_name": "Abdullah", "family_name": "Rashwan", "institution": "University of Waterloo"}, {"given_name": "Wei-Shou", "family_name": "Hsu", "institution": "University of Waterloo"}, {"given_name": "Pascal", "family_name": "Poupart", "institution": "University of Waterloo & RBC Borealis AI"}, {"given_name": "Prashant", "family_name": "Doshi", "institution": "University of Georgia"}, {"given_name": "Georgios", "family_name": "Trimponias", "institution": "Huawei Technologies Co., Ltd."}]}