{"title": "Deep Homogeneous Mixture Models: Representation, Separation, and Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 7136, "page_last": 7145, "abstract": "At their core, many unsupervised learning models provide a compact representation of homogeneous density mixtures, but their similarities and differences are not always clearly understood. In this work, we formally establish the relationships among latent tree graphical models (including special cases such as hidden Markov models and tensorial mixture models), hierarchical tensor formats and sum-product networks. Based on this connection, we then give a unified treatment of exponential separation in \\emph{exact} representation size between deep mixture architectures and shallow ones. In contrast, for \\emph{approximate} representation, we show that the conditional gradient algorithm can approximate any homogeneous mixture within $\\epsilon$ accuracy by combining $O(1/\\epsilon^2)$ ``shallow'' architectures, where the hidden constant may decrease (exponentially) with respect to the depth. Our experiments on both synthetic and real datasets confirm the benefits of depth in density estimation.", "full_text": "Deep Homogeneous Mixture Models:\n\nRepresentation, Separation, and Approximation\n\nDepartment of Computer Science & Waterloo AI Institute\n\nPriyank Jaini\n\nUniversity of Waterloo\npjaini@uwaterloo.ca\n\nPascal Poupart\n\nUniversity of Waterloo, Vector Institute & Waterloo AI Institute\n\nppoupart@uwaterloo.ca\n\nDepartment of Computer Science & Waterloo AI Institute\n\nYaoliang Yu\n\nUniversity of Waterloo\n\nyaoliang.yu@uwaterloo.ca\n\nAbstract\n\nAt their core, many unsupervised learning models provide a compact representation\nof homogeneous density mixtures, but their similarities and differences are not\nalways clearly understood. In this work, we formally establish the relationships\namong latent tree graphical models (including special cases such as hidden Markov\nmodels and tensorial mixture models), hierarchical tensor formats and sum-product\nnetworks. Based on this connection, we then give a uni\ufb01ed treatment of expo-\nnential separation in exact representation size between deep mixture architectures\nand shallow ones. In contrast, for approximate representation, we show that the\nconditional gradient algorithm can approximate any homogeneous mixture within\n\u0001 accuracy by combining O(1/\u00012) \u201cshallow\u201d architectures, where the hidden con-\nstant may decrease (exponentially) with respect to the depth. Our experiments on\nboth synthetic and real datasets con\ufb01rm the bene\ufb01ts of depth in density estimation.\n\nIntroduction\n\n1\nMultivariate density estimation, a widely studied problem in statistics and machine learning [28],\nis becoming even more relevant nowadays due to the availability of huge amounts of unlabeled\ndata in various applications. Many unsupervised and semi-supervised learning algorithms either\nimplicitly (e.g. generative adversarial networks) or explicitly estimate (some functional of) the\nunderlying density function. In this work, we study the problem of density estimation with an explicit\nrepresentation through \ufb01nite mixture models (FMMs) [19], which have endured thorough scienti\ufb01c\nscrutiny over decades. The popularity of FMMs is largely due to their simplicity, interpretability,\nand universality, in the sense that, given suf\ufb01ciently many components (satisfying mild conditions),\nFMMs can approximate any distribution to an arbitrary level of accuracy [22].\nMany familiar unsupervised models in machine learning, at their core, provide a compact represen-\ntation of homogeneous density mixtures. This list includes (but is not limited to) hidden Markov\nmodels (HMM), the recently proposed tensorial mixture models (TMM) [26], latent tree graphi-\ncal models (LTM)[21], hierarchical tensor formats (HTF) [13], and sum-product networks (SPN)\n[9; 24]. However, despite all being a certain form of FMM, the precise relationships among these\nmodels are not always well-understood. Our \ufb01rst contribution \ufb01lls this gap: we prove (roughly) that\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f{HM M, T M M} \u2286 LT M \u2286 HT F \u2286 SP N. Moreover, converting from a lower to an upper class\ncan be achieved in linear time and without any increase in size. Our results not only clarify the\nsimilarities and subtle differences between these widely-used models, but also pave the way for a\nuni\ufb01ed treatment of many properties of such models, using tools from linear algebra.\nWe next investigate the consequence of converting a deep mixture model into a shallow one. We \ufb01rst\nprove that the (nonnegative) tensor rank exactly characterizes the minimum size of a shallow SPN\n(or LTM or HTF due to equivalence) that represents a given homogeneous mixture. Then, we show\nthat a generic \u201cdeep\u201d SPN (with depth at least 2) can be exactly represented by a shallow SPN only\nwhen the latter contains exponentially many product nodes. Our result extends signi\ufb01cantly those in\n[7; 26; 10; 18; 8] in various aspects, but most saliently from the restrictive full binary tree [7; 26] to\nany rooted tree. As a consequence, our results imply that a generic HMM (whose underlying tree\nis \u201ccompletely\u201d unbalanced) cannot be exactly represented by any polynomially-sized shallow SPN,\nwhich, to our best knowledge, has not been shown before.\nFrom a practical point of view, exact representations are an overkill: it suf\ufb01ces to approximate a\ngiven density mixture with reasonable accuracy. Our third contribution demonstrates that under the\n(cid:96)\u221e metric, we can approximate any homogeneous density mixture within \u0001 accuracy by combining\nO(1/\u00012) shallow SPNs. However, our proof requires the knowledge of the target density hence is not\npractical. Instead, borrowing a classic idea from [17] we show that minimizing the KL divergence\nusing the conditional gradient algorithm can also approximate any homogeneous mixture within \u0001\naccuracy by combining O(1/\u00012) base SPNs, where the hidden constant decreases exponentially wrt\nthe depth of the base SPNs. Each iteration of the conditional gradient algorithm amounts to learning a\nbase SPN hence can be ef\ufb01ciently implemented. We conduct thorough experiments on both synthetic\nand real datasets and con\ufb01rm the bene\ufb01ts of depth in density estimation.\nWe proceed as follows: In \u00a72 we introduce homogeneous density mixtures. In \u00a73 we articulate the\nrelationships among various popular mixture models. \u00a74 examines the exponential separation in exact\nrepresentation size between deep and shallow models while \u00a75 turns into approximate representations.\nWe report our experiments in \u00a76 and \ufb01nally we conclude in \u00a77. All proofs are deferred to Appendix C.\n\n2 Density Estimation using Mixture Models\nIn this section, we introduce our main problem: how to estimate a multivariate density through an\nexplicit, \ufb01nite homogeneous mixture. To set up the stage, let x = (x1, . . . , xd), with xi \u2208 Xi where\neach Xi is a Borel (measurable) subset of the Euclidean space Ei. We equip a Borel measure \u00b5i\non Xi. All our subsequent measure-theoretic de\ufb01nitions are w.r.t. the Borel \u03c3-\ufb01eld of Xi and the\nmeasure \u00b5i. Let X = X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Xd and \u00b5 = \u00b51 \u00d7 \u00b7\u00b7\u00b7 \u00d7 \u00b5d be the product space and product\nmeasure, respectively. For each i \u2208 [d] := {1, . . . , d}, let Fi be a class of density functions (w.r.t.\n\u00b5i) of the variable xi, and let Gi = conv(Fi) be its convex hull. The function class Fi is essentially\nour basis of densities for the variable xi. Our setting here follows that in [18] and includes both\ncontinuous and discrete distributions.\nWe are interested in constructing a \ufb01nite density mixture [19], using component densities from the\n\nbasis class F =(cid:83)d\ni=1 Fi. We assume that our \ufb01nite mixture f is \u201chomogeneous,\u201d i.e.\nk1(cid:88)\nk2(cid:88)\n\n(xi) = (cid:104)W, (cid:126)f 1(x1) \u2297 \u00b7\u00b7\u00b7 \u2297 (cid:126)f d(xd)(cid:105),\n\n\u00b7\u00b7\u00b7 kd(cid:88)\n\nf (x) =\n\n(1)\n\nd(cid:89)\n+ (cid:39) Rk1\u00d7\u00b7\u00b7\u00b7\u00d7kd\nRki\n\nf i\nji\n\ni=1\n\nWj1,j2,...,jd\n\n, W \u2208(cid:78)\n\nj1=1\n\nj2=1\n\njd=1\n\n) \u2208 F ki\n\ni\n\ni\n\n+\n\n1, . . . , f i\nki\n\nwhere (cid:126)f i := (f i\nis a d-order density tensor (nonnega-\ntive and sum to 1), and (cid:104)\u00b7,\u00b7(cid:105) is the standard inner product on the tensor product space. We refer to the\nexcellent book [13] and Appendix A for some basic de\ufb01nitions about tensors. By dropping linearly\ndependent densities in each Fi we can assume w.l.o.g. the tensor representation W is unique.\nThere are a number of reasons for restricting to homogeneous mixtures: Firstly, this is the most\ncommon choice for estimating a multivariate density function [28]. Secondly, we can always apply the\nusual \u201chomogenization\u201d trick, i.e., by enlarging the function class Fi and appending the (improper)\ndensity 1 to each Fi. Thirdly, homogeneous densities are \u201cuniversal\u201d if each class Fi is, c.f. Appendix\nA of [26]. In other words, any joint density can be approximated arbitrarily well by a homogeneous\ndensity, provided that each marginal class Fi can approximate any marginal density arbitrarily well\nand the size (i.e. ki) tends to \u221e. See Appendix F.1 for some empirical veri\ufb01cations, where we\nshow that convex combinations of relatively few isotropic Gaussians can approximate mixtures of\n\n2\n\n\fGaussians of full covariance matrices surprisingly well. Lastly, as we argue below, many known\nmodels in machine learning are simply compact representations of homogeneous mixtures.\n\nweight wuv. The value Tv at a product node v is the product of the values of its children,(cid:81)\nThe value Tu at a sum node u is the weighted sum of the values of its children,(cid:80)\n\n3 Compact Representation of Homogeneous Mixtures\nWe now recall a few unsupervised learning models in machine learning and show that they have a\ncompact representation of homogeneous mixtures at their core. We prove the precise relationship\namongst them. Our results clarify the similarity and difference of these recent developments, and pave\nthe way for a uni\ufb01ed treatment of depth separation (Section 4) and model approximation (Section 5).\nSum-Product Networks (SPN) [9; 24; 18] An SPN T is a rooted tree whose leaves are density\nfunctions f i\nj (xi) over each of the variables x1, . . . , xd and whose internal nodes are either a sum node\nor a product node. Each edge (u, v) emanating from a sum node u has an associated nonnegative\nu\u2208ch(v) Tu.\nv\u2208ch(u) wuvTv.\nThe value of an SPN T is the expression evaluated at the root node, which we denote as T(x). The\nscope of a node v in an SPN is the set of all variables that appear in the leaves of the sub-SPN rooted\nat v. We only consider decomposable and complete SPNs, i.e., the children of each sum node must\nhave the same scope and the children of each product node must have disjoint scopes. The main\nadvantage of a decomposable and complete SPN over a generic graphical model is that joint, marginal\nand conditional queries can be answered by two network evaluations and hence, exact inference takes\nlinear time with respect to the size of the network [9; 24; 18]. In comparison, inference in Bayesian\nNetworks and Markov Networks may take exponential time in terms of the size of the network.\nW.l.o.g. we can rearrange an SPN to have alternating sum and product layers (see Theorem C.1).\nThe latent variable semantics [23] as well as SPNs representing a mixture model over its leaf densities\n[24] is well-known. It is also informally known that many tractable graphical models can be treated as\nSPNs, but precise characterizations are scarce (see [29] which relates SPNs with Bayesian Networks).\n\nSelf-similar SPNs (S3PN) We call an SPN self-similar, if at every sum node, the sub-tree rooted at\neach of its (product node) children is the same, except the weights at corresponding sum nodes and\nthe densities (but not the variables) at corresponding leaf nodes may differ. This special class of SPNs\nis exactly equivalent to some recently proposed unsupervised learning models, as we show below.\n\n(cid:81)d\n\nj1,...,jk\n\nj1,...,jk\n\n(cid:80)\n\nj1\n\n\u00b7\u00b7\u00b7(cid:80)\n\nwv,\u03b3\n\nj1,...,jk\n\njk\n\nHierarchical Tensor Format (HTF) [13] We showed in (1) that a homogeneous mixture can be\nidenti\ufb01ed with a tensor W, whose explicit storage can, however, be quite challenging since its size is\ni=1 ki. HTF [13] aims at representing tensors compactly, hence can also be used for representing\nhomogeneous mixtures. An HTF consists of a dimension-partition rooted tree (DPT) T, d vector\nspaces Vi with bases1 Fi at the d leaf nodes, and at most d \u2212 1 internal nodes which are certain\nsubspaces of the tensor product of vector spaces at disjoint children nodes. Note that the dimension of\nthe tensor product U \u2297 V is the product of the dimensions of U and V. The key in HTF is to truncate\neach tensor product with a (much smaller) subspace, hence keeping the total storage manageable.\nMoreover, at each internal node v with k children nodes {vi}, instead of storing its r bases directly,\n: \u03b3 \u2208 [r]} such that, recursively, the \u03b3-th basis at node v is\nwe store r coef\ufb01cient tensors {wv,\u03b3\nvj1 \u2297 \u00b7\u00b7\u00b7 \u2297 vjk, where {vji} consists of the bases at the i-th child node vi. To\nour best knowledge, HTFs have not been recognized as SPNs previously, although they have been\nused in a spectral method for latent variable models [27].\nTo turn an HTF into an SPN, more precisely an S3PN, we start from the root of the dimension-partition\ntree T. For each internal node v with say r bases and say k children nodes {vi}, each of which has\nri bases themselves, we create three layers in the corresponding S3PN: in the \ufb01rst layer we have r\nsum nodes {Sv\n, to the second\ni=1 ri sum nodes\n{Svi\n}. Note that the weights\nji\nwv,\u03b3\nneed not be positive or sum to 1 in HTF, although for representing a homogeneous mixture\nwe can make this choice and we call this subclass HTF+. Clearly, our construction is reversible hence\nwe can turn an S3PN into an equivalent HTF+ as well. The construction takes linear time and there is\n\n}, and \ufb01nally the third layer consists of(cid:80)k\n\n\u03b3}, each of which is (fully) connected, with respective weights wv,\u03b3\n\ni=1 ri product nodes {Pv\n\nj1,...,jk\n\nj1,...,jk is connected to k sum nodes {Sv1\n\nj1\n\nlayer of(cid:81)k\n\n}. The product node Pv\n\nj1,...,jk\n\n, . . . , Svk\njk\n\n1More generally frames, in particular, the elements need not be linearly independent.\n\n3\n\n\fH 2\n\n+\n\nPr(H = 1)\n\nPr(H = 2)\n\n\u00d7\n\n\u00d7\n\n{1, 2, 3, 4}1\n\nX1 X2 X3 X4\n\nf 1\n1\n\nf 2\n1\n\nf 3\n1\n\nf 4\n1\n\nf 1\n2\n\nf 2\n2\n\nf 3\n2\n\nf 4\n2\n\n{1}2 {2}2 {3}2 {4}2\n\nFigure 1: Left: A simple latent class model (special case of LTM). The superscript 2 indicates the\nnumber of values the hidden variable H can take. Middle: The equivalent S3PN, where f i\nj (xi) =\np(Xi = xi|H = j) is from the density class Fi. Right: The dimension-partition tree in an equivalent\nHTF+. The superscript indicates the number of bases, which should be the same for sibling nodes.\n\nf (x1, . . . , xd) =(cid:80)\n\n\u00b7\u00b7\u00b7(cid:80)\n\nW(h1, . . . , ht)(cid:81)d\n\nj1,...,jk\n\nj1,...,jk\n\nno increase of representation size. See Figs.1,5 for simple illustrations2. In summary, HTF is exactly\nS3PN with arbitrary weights.\nDiagonal HTF (dHTF) [13] For later reference, let us call the subclass of HTFs whose coef\ufb01cient\ntensors wv,\u03b3\n(that de\ufb01ne bases recursively at internal nodes of the DPT, see above) are diagonal\nfor all v and \u03b3 as dHTF, i.e., siblings in the DPT must have the same number of bases (ri \u2261 r)\n(cid:54)= 0 only when j1 = . . . = jk. In neural network terminology, dHTFs are \u201clocally\nand wv,\u03b3\nconnected.\u201d Compared to the fully connected HTF, dHTFs signi\ufb01cantly reduce the representation\ni=1 ri = rk product nodes\n\nsize (at the expense of expressiveness, see Figure 7). For instance, the(cid:81)k\n\nin the above conversion from HTF to S3PN are reduced to merely r product nodes.\nLatent Tree Models (LTM) [21; 27; 5] An LTM is a rooted tree graphical model with observed\nvariables Xi on the leaves and hidden variables Hj on the internal nodes. Note that we allow observed\nvariables Xi to be either continuous or discrete but the hidden variables Hj can take only \ufb01nitely\nmany values. Using conditional independence, the joint density of observed variables is given as\n\nht\n\nh1\n\nh\u03c0i\n\n(xi),\n\ni=1 f i\n\n(2)\nwhere H\u03c0i is the parent of Xi. From (2) it is clear that an LTM is a homogeneous density mixture,\nwhose tensor representation is given by the joint density W of the hidden variables. What is less\nknown3 is that LTMs are a special subclass of self-similar SPNs. It may appear that the size of\nS3PN is larger than that of an equivalent LTM, but this is because S3PN also encodes the conditional\nprobability tables (CPT) into its structure whereas LTMs require other means to store CPTs. Note\nalso that to evaluate an LTM, one usually needs to run a separately designed algorithm (such as\nmessage passing), while in S3PN we evaluate the leaf densities and propagate in linear time to the\nroot. In summary, LTM is a subclass of S3PN with CPTs encoded as edge weights and with inference\nsimpli\ufb01ed as network propagation. More precisely, LTM is exactly dHTF+, since conditioned on the\nparent, all children nodes must depend on the same realization. An algorithm for converting LTMs\ninto equivalent S3PNs, along with more examples (Figs. 1-6), can be found in Appendix B.1.\nTensorial Mixture Models (TMM) [26; 7; 6] TMM [26] is a recently proposed subclass of dHTF+\nwhere nodes on the same level of the dimension-partition tree must have the same number of bases.\nClearly, TMM is a strict subclass of LTM since the latter only requires sibling nodes in the DPT to\nhave the same number of bases. We note that TMM, as de\ufb01ned in [26], also assumes the DPT to be\nbinary and balanced, i.e. each internal node has exactly two children, although this condition can be\neasily relaxed. See Figure 2 and its reduced form in Appendix B.3 for a simple example. Further, in\nAppendix B.4, we give an example of an LTM that is not a TMM.\nHidden Markov Models (HMM) [3; 25] HMM is a strict subclass of LTM. [14] recently observed\nthat HMM is equivalent to the tensor-train format, a special subclass of dHTF+ where the DPT is\nbinary and completely \u201cimbalanced.\u201d See Appendix B.5 for a simple example. In some sense, TMM\nand HMM are the two opposite extremes within dHTF+ (or equivalently LTM).\nFurther, in Appendix B.6 we give an example of an S3PN that is not an LTM, and in Appendix B.7,\nwe give an example of an SPN that is not an S3PN, leading to the following summary:\nTheorem 3.1. {TMM, HMM} \u2286 LTM = dHTF+ \u2286 HTF+ = S3PN \u2286 SPN, in the sense that we can\nconvert in linear time from a lower representation class to an upper one, without any increase in size.\n\n2All of our illustrations of S3PN in the main text are drawn with some redundant leaves, for the sake of\n\nmaking the self-similar property apparent. See Appendix B for the reduced (but equivalent) counterparts.\n\n3As an evidence, we note that the recent survey [21] on LTMs did not mention SPNs at all.\n\n4\n\n\f{1, 2, 3, 4}1\n\n{1, 2}3\n\n{3, 4}3\n\n\u00d7\n\n+\n\n\u00d7\n\nH 3\n1\n\n\u00d7\n\n+\n\n+\n\n+\n\n+\n\n+\n\n+\n\nH 2\n2\n\nH 2\n3\n\n\u00d7\n\n\u00d7\n\n\u00d7\n\n\u00d7\n\n\u00d7\n\n\u00d7\n\n\u00d7\n\n\u00d7\n\n\u00d7\n\n\u00d7\n\n\u00d7\n\n\u00d7\n\n{1}2 {2}2\n\n{3}2 {4}2\n\nf 1\n1 f 2\n1\n\nf 1\n2 f 2\n2\n\nf 3\n1 f 4\n1\n\nf 3\n2 f 4\n2\n\nf 1\n1 f 2\n1\n\nf 1\n2 f 2\n2\n\nf 3\n1 f 4\n1\n\nf 3\n2 f 4\n2\n\nf 1\n1 f 2\n1\n\nf 1\n2 f 2\n2\n\nf 3\n1 f 4\n1\n\nf 3\n2 f 4\n2\n\nX1 X2\n\nX3 X4\n\nFigure 2: Left: A dimension-partition tree in HTF. The superscripts indicate the number of bases,\nwhich should remain constant on each level. Middle: The equivalent S3PN. The leaf f i\nj is the j-th\nbasis of vector space Vi. Right: An equivalent TMM. The superscripts indicate the number of values\neach hidden variable can take (again, remaining constant on each level).\n\nIt is important to point out one subtlety here: any (complete and decomposable) SPN, if expanded at\nthe root, is a homogeneous mixture (c.f. (1)). Hence, any SPN is even equivalent to an LCM (i.e. an\nLTM with one hidden variable taking many values, like in Figure 1), at the expense of potentially\nincreasing the size (signi\ufb01cantly). Thus, the containment in Theorem 3.1 should be understood under\nthe premise of not increasing the representation size. It would be interesting to understand if the\ncontainment is strict if only polynomial increase in size is allowed. We provide more comparing\nexamples in Appendix B for different models, and in the next section we discuss the (huge) size\nconsequence from converting a certain upper representation class to some lower one.\n\n4 Depth Separation\nIn the previous section, we established relationships among different representation schemes for\nhomogeneous density mixtures. In this section, we prove an exponential separation in size when\nconverting one representation to another and extend the results in [10; 18; 7; 26]. The key is to exploit\nthe equivalence to HTF, which allows us to bound the model size using linear algebra.\nWe call a (complete and decomposable) SPN shallow if it has only one sum node, followed by\na layer of product nodes. Using the equivalence in Section 3, we know a shallow SPN (trivially\nself-similar) is equivalent to an LCM (a latent tree model with one hidden node taking as many values\nas the number of product nodes), or an HTF+ whose DPT has depth 1 (c.f. Figure 1). Recall that\nrank+(W) denotes the nonnegative rank of a tensor and nnz(W) is the number of nonzeros (c.f.\nAppendix A). The leaf nodes in SPN (LTM) or the leaf bases in HTF are either from F (union of\nlinearly independent component densities) or G (the convex hull), see the de\ufb01nitions in Section 2.\nOur \ufb01rst result characterizes the model capacity of shallow SPNs (LCMs):\nTheorem 4.1. If a shallow SPN T, with leaf (input) nodes from G, represents the density mixture W,\nthen T has at least rank+(W) many product nodes. Conversely, there always exists a shallow SPN\nthat represents W using rank+(W) product nodes and 1 sum node.\nIn other words, the nonnegative rank characterizes the smallest size of shallow SPNs (LCMs) that\nrepresent the density mixture W. Similarly, we can prove the following result when the leaf nodes\nare from F instead of the convex hull G.\nTheorem 4.2. If a shallow SPN T, with leaf nodes from F, represents the density mixture W, then\neither T has at least nnz(W) product nodes or rank+(W) = 1. Conversely, there always exists a\nshallow SPN that represents W using nnz(W) product nodes and 1 sum node.\nNote that we always have rank(W) \u2264 rank+(W) \u2264 nnz(W), thus the lower bound in Theorem 4.2\nis stronger than that in Theorem 4.1. This is not surprising, because an SPN with leaf nodes from G\nis the same as an SPN with leaf nodes from F and with an additional layer of sum nodes appended at\nthe bottom (to perform the convex hull operation). This difference already indicates that an additional\nlayer of sum nodes at the bottom can strictly increase the expressive power of SPNs. This distinction\nbetween leaf nodes from F or from G, to our best knowledge, has not been noted before.\nThe signi\ufb01cance of Theorem 4.1 and Theorem 4.2 is that they give exact characterizations of the\nmodel size of shallow SPNs, and they pave the way for comparing more interesting models. For\nconvenience, we state our next result in terms of LTMs, but the consequence for dHTFs or SPNs\nshould be clear, thanks to the equivalence in Theorem 3.1.\n\n5\n\n\fm(cid:89)\n\ni=1\n\nTheorem 4.3. Let an LTM T have d observed variables X = {X1, . . . , Xd} with parents Hi taking\nri values respectively. Assuming the CPTs of T are sampled from a continuous distribution, then\nalmost surely, the tensor representation W for T has rank at least\n\nmax\n\n1\u2264m\u2264d/2\n\nmax\n\n{S1,...,Sm, \u00afS1,..., \u00afSm}\u2286X\n\nmin{ri, \u00afri, ki, \u00afki},\n\n(3)\n\nwhere ki (\u00afki) is the number of (linearly independent) component densities that Si ( \u00afSi) has, and Si\n( \u00afSi) are non-siblings.\nCorollary 4.4. In addition to the setting in Theorem 4.3, if each observed variable Xi has b sibling\nobserved variables and ri \u2261 r \u2264 k \u2261 ki, then the tensor representation W has rank at least r(cid:98)d/b(cid:99).\nCorollary 4.5. In addition to the setting in Theorem 4.3, if each observed variable Xi has no sibling\nobserved variables and ri \u2261 r \u2264 k \u2261 ki, then the tensor representation W has rank at least r(cid:98)d/2(cid:99).\nCombining Corollary 4.4 with Theorem 4.2 we conclude that an LTM T with d observed variables\nXi where every b of them share the same hidden parent node is equivalent to an LCM T(cid:48) where the\nhidden node must take at least r(cid:98)d/b(cid:99) many values. Note that T has \u0398(d/b) hidden variables, each\nof them taking r values, thus the total size of the CPTs of T is \u0398(rd/b) while the total size of that\nof T(cid:48) is r(cid:98)d/b(cid:99), an exponential blow-up. By combining Corollary 4.5 with Theorem 4.2 a similar\nconclusion can be made for converting an HMM into a LCM. Of course, interpretation using SPNs is\nalso readily available: Almost all depth-L S3PNs (L \u2265 2) with weights sampled from a continuous\ndistribution can be written as a shallow SPN with necessarily exponentially many product nodes.\nTo our best knowledge, [10] was the \ufb01rst to construct a polynomial that, while representable by a\npolynomially-sized depth-log d SPN, would require exponentially many product nodes if represented\nby a shallow SPN. However, the deep SPN given in [10, Figure 1] is not complete. Recently, [7]\nproved that the existence result of [10] is in fact generic. However, the results of [7] and subsequent\nwork [26] are limited to full binary trees. In contrast, our general Theorem 4.3 holds for any tree, and\nwe allow non-sibling nodes to take different number of values. As a result, we are able to handle\nHMMs, the opposite extreme of TMM. Another important point we want to emphasize is that the\nexponential separation from a shallow (i.e. depth-1) tree can be achieved by increasing the depth by\nmerely 1, as opposed to the depth-log d constructions in [10; 26].\nWe end this section by making another observation about Theorem 4.3: It also allows us to compare\nthe model size of LTMs T1 and T2 where say T1, after removing its root R, is a subtree of T2. Indeed,\nin this case we need only de\ufb01ne the children nodes of R as \u201cobserved\u201d variables. Then, T1 becomes\nan LCM and T2 serves as T in Theorem 4.3, with observed variables as the children nodes of R. This\nessentially extends [7, Theorem 3] from a full binary tree to any tree and allowing non-sibling nodes\nto take different number of values.\n\n5 Approximate Representation\nIn the previous section, we proved that homogeneous mixtures representable by \u201cdeep\u201d architectures\n(such as SPN or LTM) of polynomial size cannot be exactly represented by a shallow one with\nsub-exponential size. In this section, we address a more intricate and relevant question: What if we\nare only interested in an approximate representation?\nTo formulate the problem, let g and h be two homogeneous mixtures with tensor representation W\nand Z, respectively. We consider the distance dist(g, h) := (cid:107)W \u2212 Z(cid:107) for some norm (cid:107) \u00b7 (cid:107) speci\ufb01ed\nlater. Using the characterization in Theorem 4.1 we formulate our approximation problem as follows.\nLet \u2206 be a perturbation tensor with (cid:107)\u2206(cid:107) \u2264 \u0001. What is the minimum value for rank+(W + \u2206), i.e.\nthe size of a shallow SPN? This motivates the following de\ufb01nition adapted from [1]:\n\n\u0001-rank+(W) = min\n\nrank+(W + \u2206) : (cid:107)\u2206(cid:107) \u2264 \u0001\n\n= min\n\nrank+(Z) : (cid:107)Z \u2212 W(cid:107) \u2264 \u0001\n\n.\n\n(4)\n\n(cid:111)\n\n(cid:110)\n\n(cid:110)\n\n(cid:111)\n\nIn other words, \u0001-rank+ is precisely the minimum size of a shallow SPN (LCM) that approximates\na speci\ufb01ed mixture W with accuracy \u0001. We can similarly de\ufb01ne \u0001-rank, where we replace the\nnonnegative rank with the usual rank in (4). Note that the notion of \u0001-rank depends on the norm (cid:107) \u00b7 (cid:107).\n(cid:96)\u221e-norm Let the norm in the de\ufb01nition (4) be the usual (cid:96)\u221e norm, and we signify this choice with\n\u221e. In this setting, we can prove the following nearly-tight bound on the \u0001-rank.\nthe notation \u0001-rank\n\n6\n\n\fTheorem 5.1. Fix \u0001 > 0 and tensor W \u2208 Rk1\u00d7\u00b7\u00b7\u00b7\u00d7kd. Then, for some (small) constant c > 0,\n\n\u221e\n\n(W) \u2264 c(cid:107)W(cid:107)tr\n\n,\n\n\u00012\n\n\u0001-rank\n\n(5)\nwhere (cid:107)W(cid:107)tr is the tensor trace norm. A similar result holds for \u0001-rank\n+ (W). The dependence on \u0001\n\u221e\nis tight up to a log factor.\nNote that the representative tensor W for a homogeneous density mixture f is nonnegative and sums\nto 1, in which case (cid:107)W(cid:107)tr \u2264 (cid:107)W(cid:107)1 = 1. Thus, very surprisingly, Theorem 5.1 con\ufb01rms that any\ndeep SPN (or any LTM or HTF+) can be approximated by some shallow SPN with accuracy \u0001 under\nthe (cid:96)\u221e metric and with at most c/\u00012 many product nodes. Of course, this does not contradict with\nthe impossibility results in [7] and [18], because the accuracy \u0001 there is exponentially small.\nTheorem 5.1 remains mostly of theoretical interest, though, because (i) a straightforward application\nof Theorem 5.1 leads to a disappointing bound on the total variational distance between the two\ni ki; (ii) in practical applications\nwe do not have access to W so the constructive algorithm in our proof does not apply.\nKL divergence\nIn contrast to the above (cid:96)\u221e approximation, we now give an ef\ufb01cient algorithm to\napproximate a homogeneous density mixture h, using a classic idea of [17]. We propose to estimate\nh by minimizing the KL divergence over the convex hull4 of a hypothesis class H:\n\nhomogeneous mixtures f and g, due to scaling by the big constant(cid:81)\n\nwhere KL(h(cid:107)g) := (cid:82) h(x) log h(x)\n\n(6)\ng(x) d\u00b5(x), and Wg is the representative tensor for the mixture g.\nFollowing [17], we apply the conditional gradient algorithm [12] to solve (6): Given gt\u22121, we \ufb01nd\n\nWg\u2208conv(H)\n\nmin\n\nKL(h(cid:107)g),\n\n(\u03b7t, ft) \u2190 arg\n\nmin\n\n\u03b7\u2208[0,1],Wf\u2208H\n\nKL(h(cid:107)(1 \u2212 \u03b7)gt\u22121 + \u03b7f ),\n\ngt \u2190 (1 \u2212 \u03b7t)gt\u22121 + \u03b7tft.\n\n(7)\n\nOne can also simply set \u03b7t = 2\nsolved based on an iid sample x1, . . . , xn hence is practical:\n\n2+t, as is common in practice. Note that (7) can be approximately\n\n(cid:88)n\n\nmax\n\n\u03b7\u2208[0,1],Wf\u2208H\n\ni=1\n\nlog[(1 \u2212 \u03b7)gt\u22121(xi) + \u03b7f (xi)].\n\n(8)\n\nUsing basically the same argument as in [17], the above algorithm enjoys the following guarantee:\n\nwhere \u03b4 = sup{log\n\nKL(h(cid:107)gt) \u2264 ch\u03b4/t,\nch = min{p \u2265 0 : Wh =(cid:80)p\n(cid:104)W, (cid:126)f1\u2297\u00b7\u00b7\u00b7\u2297 (cid:126)fd(cid:105)\n(cid:104)Z, (cid:126)f1\u2297\u00b7\u00b7\u00b7\u2297 (cid:126)fd(cid:105) : W,Z \u2208 H, x \u2208 X}, and\n\ni=1 \u03bbiWi,Wi \u2208 H, \u03bb \u2265 0, 1(cid:62)\u03bb = 1}\n\n(9)\n\n(10)\n\nlarger than(cid:81)\n\nis essentially the rank of the mixture h (with tensor representation Wh) w.r.t. the class H.\nThe important conclusion we draw from the above bound (9) is as follows: First, the constant ch is no\ni ki if H is any of the classes in Theorem 3.1 (since we only consider \ufb01nite homogeneous\nmixtures h). Second, if the target density h is a small number of combinations of densities in H, then\nch is small and we can approximate h using the algorithm (7) ef\ufb01ciently. Third, ch can be vastly\ndifferent for different hypothesis classes H, as shown in Section 4. For instance, if h is a generic\nTMM and H is the shallow class LCM, then ch is exponential in d, whereas if H is the class TMM,\nthen ch can be as small as 1. There is a trade-off though, since solving (8) for a simpler class (such as\nLCM) is easier than a deeper one (such as TMM). We will verify this trade-off in our experiments.\n\n6 Experiments\nWe perform experiments on both synthetic and real world data to reinforce our theoretical \ufb01ndings.\nFirstly, we present experiments on synthetic data to demonstrate the expressive power of an SPN and\nthe algorithm proposed in (7)-(8) which we call SPN-CG. Next, we present two sets of experiments\non real world datasets and present results for image classi\ufb01cation under missing data.\n\n7\n\n\fFigure 3: Depth ef\ufb01ciency and performance of SPN-CG\n\nSynthetic data Firstly, in appendix F.1 we con\ufb01rm that a Gaussian mixture model (GMM) with\nfull covariance matrices can be well approximated by a homogeneous mixture model represented by\nan SPN learned using SPN-CG. Secondly, we generate 20,000 samples from a 16 dimensional GMM\nunder three different settings - (i) 8 component GMM with full covariance matrices, (ii) 8 component\nGMM with diagonal covariance matrices and, (iii) GMMs represented by a deep SPN with 4 layers\n- and estimate each using SPN-CG. We consider layers, L \u2208 {1, 2, 3, 4} where L = 1 corresponds\nto a shallow network and L = 4 corresponds to a network in TMM (a full binary tree). For each\nL, at every iteration of SPN-CG we add a network with L layers. In Figure 3, we plot the number\nof iterations and the total running time until convergence w.r.t. the depth for each setting described\nabove. We make the following observations: As the depth (layer) increases, the number of iterations\ndecreases sharply, since adding a deeper network effectively is the same as adding exponentially\nmany shallower networks (con\ufb01rming Section 4). Moreover, although learning a deeper network in\neach iteration is more expensive than learning a shallower network, the sharp decrease in iterations\nfull compensates this overhead and leads to a much reduced total running time. The advantage in\nusing deeper networks is more pronounced when the data is indeed generated from a deep model.\n\nImage Classi\ufb01cation under Missing Data by Marginalization A natural setting to test the ef-\nfectiveness of generative models like deep SPNs is for classi\ufb01cation in the regime of missing data.\nGenerative models can cope with missing data naturally through marginalizing the missing values,\neffectively learning all possible completions for classi\ufb01cation. As stated earlier, SPNs are attractive\nbecause inference, marginalization and evaluating conditionals is tractable and amounts to one pass\nthrough the network. This is in stark contrast with discriminative models that often rely on either data\nimputation techniques (which result in sub-optimal classi\ufb01cation) or by assuming the distribution of\nmissing values is same during train and test time; an assumption that is often not valid in practice.\nWe perform experiments on MNIST [15] for digit classi\ufb01cation and small NORB [16] for 3D object\nrecognition under the MAR (missing at random) regime as described in [26] (Section 3). We\nexperiment with two missing distributions- (i) an i.i.d. mask with a \ufb01xed probability of missing each\npixel, and (ii) a mask obtained by the union of rectangles of a certain size, each positioned uniformly\nat random in the image. Concretely, let P (X, Y) be the joint distribution over the images (X \u2208 Rd)\nand labels Y \u2208 [M ]. Further, let Z be a random binary vector conditioned on X = x with distribution\nQ(Z|X = x). To generate images with missing pixels, we sample z \u2208 {0, 1}d and consider the\nvector x (cid:12) z. A pixel xi, i \u2208 [d] is considered missing if zi = 0 in which case the corresponding\ncoordinate in x (cid:12) z holds \u2217 and it holds xi if zi = 1. In the MAR setting that we consider for our\nexperiments, Q(Z = z|X = x) is a function of both z and x but is independent of changes to xi\nif zi = 0 i.e. Z is independent of missing pixels. As described in [26], the optimal classi\ufb01cation\nrule in the MAR regime is h\u2217(x (cid:12) z) = P (Y = y|w(x, z)) where w(x, z) is the realization when X\ncoincides with x on coordinates i for which zi = 1.\nOur major goal with these experiments is to test our algorithm SPN-CG for high-dimensional real\nworld settings and show the ef\ufb01cacy of learning SPNs by increasing their expressiveness iteratively.\nTherefore, we directly adapt the experiments as presented in [26]. Speci\ufb01cally, we adapt the code\nof HT-TMM for our SPN-CG by following the details in [26]. In each iteration of our algorithm,\nwe add an SPN structure exactly similar to HT-TMM. Therefore, the \ufb01rst iteration of our algorithm\n(i.e. SPN-CG1) amounts to a structure similar to HT-TMM while additional iterations increase the\nnetwork capacity. For each iteration, we train the network using an AdamSGD variant with a base\nlearning rate of 0.03 and momentum parameters \u03b21 = \u03b22 = 0.9. For each added network structure,\nwe train the model for 22,000 iterations for MNIST and 40,000 for NORB.\n\n4This is similar in spirit to [20; 2] which learn mixture of trees, but the algorithms are quite different.\n\n8\n\n12341020304050IterationsGMM : 16 dims, 8 components, full covariance1234Layers600700800900Running time (s)123402468IterationsGMM : 16 dims, 8 components, diagonal covariance1234Layers50607080Running time (s)123010203040Iterations8 dim GMM from a 4 layered deep SPN123Layers200300400Running time (s)\fFigure 4: Performance of SPN-CG for missing data on MNIST and NORB\n\nDue to space limit Figure 4 only presents results comparing our model with (i) data imputation\ntechniques that complete missing pixels with zeros or NICE [11], a generative model suited for\ninpainting, and \ufb01nally using a ConvNet for prediction, (ii) an SPN with structure learned using data\nas proposed in [24] augmented with a class variable to maximize joint probability, and (iii) shallow\nnetworks to demonstrate the bene\ufb01ts of depth. A more comprehensive \ufb01gure showing comparisons\nwith several other algorithms is given in appendix F.2, along with details.\nSPN-CG1 and SPN-CG3 in Figure 4 stand for one and three iterations of our algorithm respectively.\nThe results show that SPN-CG performs well in all regimes of missing data for both MNIST\nand NORB. Furthermore, other generative models including SPN with structure learning perform\ncomparably only when a few pixels are missing but perform very poorly as compared to SPN-CG\nwhen larger amounts of data is missing. Our results here complement those in [26] where these\nexperiments were \ufb01rst reported with state of the art results.\n\n7 Conclusion\nWe have formally established the relationships among some popular unsupervised learning models,\nsuch as latent tree graphical models, hierarchical tensor formats and sum-product networks, based\non which we further provided a uni\ufb01ed treatment of exponential separation in exact representation\nsize between deep architectures and shallow ones. Surprisingly, for approximate representation, the\nconditional gradient algorithm can approximate any homogeneous mixture within accuracy \u0001 by\ncombining O(1/\u00012) shallow models, where the hidden constant may decrease exponentially wrt the\ndepth. Experiments on both synthetic and real datasets con\ufb01rmed our theoretical \ufb01ndings.\n\nAcknowledgement\n\nThe authors gratefully acknowledge support from the NSERC discovery program.\n\nReferences\n[1] Noga Alon. Perturbed identity matrices have high rank: Proof and applications. Combinatorics,\n\nProbability and Computing, 18(1-2):3\u201315, 2009.\n\n[2] Animashree Anandkumar, Daniel Hsu, Furong Huang, and Sham M. Kakade. Learning mixtures\n\nof tree graphical models. In Advances in Neural Information Processing Systems, 2012.\n\n[3] Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of \ufb01nite state\n\nmarkov chains. The annals of mathematical statistics, 37(6):1554\u20131563, 1966.\n\n[4] Richard Caron and Tim Traynor. The zero set of a polynomial. Technical report, 2005.\n\n[5] Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, and Alan S.Willsky. Learning\n\nlatent tree graphical models. Journal of Machine Learning Research, 12:1771\u20131812, 2011.\n\n[6] Nadav Cohen, Or Sharir, Yoav Levine, Ronen Tamari, David Yakira, and Amnon Shashua.\nAnalysis and design of convolutional networks via hierarchical tensor decompositions, 2017.\narXiv:1705.02302v4.\n\n[7] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A\n\ntensor analysis. In Conference on Learning Theory, pages 698\u2013728, 2016.\n\n9\n\n0.00.250.50.750.90.950.99Probability of missing pixels10%20%30%40%50%60%70%80%90%100%AccuracyMNIST (i.i.d. corruption)ZeroNICESPNShallow_NetSPN-CG1SPN-CG3(1,7)(2,7)(3,7)(1,11)(2,11)(3,11)(1,15)(2,15)(3,15)size of missing rectangles20%30%40%50%60%70%80%90%100%accuracyMNIST (missing rectangles)ZeroNICESPNShallow_NetSPN-CG1SPN-CG30.00.250.50.750.90.950.99Probability of missing pixels15%25%35%45%55%65%75%85%95%AccuracyNORB (i.i.d. corruption)ZeroNICESPNShallow_netSPN-CG1SPN-CG3(1,7)(2,7)(3,7)(1,11)(2,11)(3,11)(1,15)(2,15)(3,15)size of missing rectangles15%25%35%45%55%65%75%85%95%accuracyNORB (missing rectangles)ZeroNICEShallow_NetSPN-CG1SPN-CG3\f[8] Nadav Cohen and Amnon Shashua. Convolutional recti\ufb01er networks as generalized tensor\n\ndecompositions. In ICML, 2016.\n\n[9] Adnan Darwiche. A differential approach to inference in bayesian networks. Journal of the\n\nACM (JACM), 50(3):280\u2013305, 2003.\n\n[10] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances in\n\nNeural Information Processing Systems, pages 666\u2013674, 2011.\n\n[11] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components\n\nestimation. arXiv preprint arXiv:1410.8516, 2014.\n\n[12] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research\n\nLogistics Quarterly, 3(1-2):95\u2013110, 1956.\n\n[13] Wolfgang Hackbusch. Tensor Spaces and Numerical Tensor Calculus. Springer, 2012.\n[14] M. Ishteva. Tensors and latent variable models. In The 12th International Conference on Latent\n\nVariable Analysis and Signal Separation (LVA/ICA), pages 49\u2013\u00e2 \u02d8A\u00b8S55, 2015.\n\n[15] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[16] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition\nwith invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR\n2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II\u2013104.\nIEEE, 2004.\n\n[17] Jonathan Q Li and Andrew R Barron. Mixture density estimation. In Advances in neural\n\ninformation processing systems, pages 279\u2013285, 2000.\n\n[18] James Martens and Venkatesh Medabalimi. On the expressive ef\ufb01ciency of sum product\n\nnetworks. arXiv preprint arXiv:1411.7717, 2014.\n\n[19] Geoffrey McLachlan and David Peel. Finite mixture models. John Wiley & Sons, 2004.\n[20] Marina Meila and Michael I. Jordan. Learning with mixtures of trees. Journal of Machine\n\nLearning Research, 1:1\u201348, 2000.\n\n[21] Rapha\u00ebl Mourad, Christine Sinoquet, Nevin L. Zhang, Tengfei Liu, and Philippe Leray. A\nsurvey on latent tree models and applications. Journal of Arti\ufb01cial Intelligence Research,\n47:157\u2013203, 2013.\n\n[22] Hien D Nguyen and Geoffrey J McLachlan. On approximations via convolution-de\ufb01ned mixture\n\nmodels. arXiv preprint arXiv:1611.03974, 2016.\n\n[23] Robert Peharz, Robert Gens, Franz Pernkopf, and Pedro Domingos. On the latent variable\ninterpretation in sum-product networks. IEEE transactions on pattern analysis and machine\nintelligence, 39(10):2030\u20132044, 2017.\n\n[24] Hoifung Poon and Pedro Domingos. Sum-product networks: A new deep architecture. In\n\nUncertainty in Arti\ufb01cial Intelligence. UAI, 2011.\n\n[25] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech\n\nrecognition. Proceedings of the IEEE, 77(2):257\u2013286, 1989.\n\n[26] Or Sharir, Ronen Tamari, Nadav Cohen, and Amnon Shashua. Tensorial mixture models, 2018.\n\narXiv:1610.04167v5.\n\n[27] Le Song, Haesun Park, Mariya Ishteva, Ankur Parikh, and Eric Xing. Hierarchical tensor\n\ndecomposition of latent tree graphical models. In ICML, 2013.\n\n[28] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.\n[29] Han Zhao, Mazen Melibari, and Pascal Poupart. On the relationship between sum-product\nnetworks and bayesian networks. In International Conference on Machine Learning, pages\n116\u2013124, 2015.\n\n10\n\n\f", "award": [], "sourceid": 3541, "authors": [{"given_name": "Priyank", "family_name": "Jaini", "institution": "University of Waterloo"}, {"given_name": "Pascal", "family_name": "Poupart", "institution": "University of Waterloo & RBC Borealis AI"}, {"given_name": "Yaoliang", "family_name": "Yu", "institution": "University of Waterloo"}]}