{"title": "Kernel Embeddings of Latent Tree Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2708, "page_last": 2716, "abstract": "Latent tree graphical models are natural tools for expressing long range and hierarchical dependencies among many variables which are common in computer vision, bioinformatics and natural language processing problems. However, existing models are largely restricted to discrete and Gaussian variables due to computational constraints; furthermore, algorithms for estimating the latent tree structure and learning the model parameters are largely restricted to heuristic local search. We present a method based on kernel embeddings of distributions for latent tree graphical models with continuous and non-Gaussian variables. Our method can recover the latent tree structures with provable guarantees and perform local-minimum free parameter learning and efficient inference. Experiments on simulated and real data show the advantage of our proposed approach.", "full_text": "Kernel Embeddings of Latent Tree Graphical Models\n\nLe Song\n\nCollege of Computing\n\nGeorgia Institute of Technology\n\nlsong@cc.gatech.edu\n\nAnkur P. Parikh\n\nSchool of Computer Science\nCarnegie Mellon University\napparikh@cs.cmu.edu\n\nEric P. Xing\n\nSchool of Computer Science\nCarnegie Mellon University\nepxing@cs.cmu.edu\n\nAbstract\n\nLatent tree graphical models are natural tools for expressing long range and hi-\nerarchical dependencies among many variables which are common in computer\nvision, bioinformatics and natural language processing problems. However, exist-\ning models are largely restricted to discrete and Gaussian variables due to com-\nputational constraints; furthermore, algorithms for estimating the latent tree struc-\nture and learning the model parameters are largely restricted to heuristic local\nsearch. We present a method based on kernel embeddings of distributions for\nlatent tree graphical models with continuous and non-Gaussian variables. Our\nmethod can recover the latent tree structures with provable guarantees and per-\nform local-minimum free parameter learning and ef\ufb01cient inference. Experiments\non simulated and real data show the advantage of our proposed approach.\n\nIntroduction\n\n1\nReal world problems often produce high dimensional features with sophisticated statistical depen-\ndency structures. One way to compactly model these statistical structures is to use probabilistic\ngraphical models that relate the observed features to a set of latent or hidden variables. By de\ufb01n-\ning a joint probabilistic model over observed and latent variables, the marginal distribution of the\nobserved variables is obtained by integrating out the latent ones. This allows complex distributions\nover observed variables (e.g., clique models) to be expressed in terms of more tractable joint models\n(e.g., tree models) over the augmented variable space. Probabilistic models with latent variables\nhave been deployed successfully to a diverse range of problems such as in document analysis [3],\nsocial network modeling [10], speech recognition [18] and bioinformatics [5].\nIn this paper, we will focus on latent variable models where the latent structures are trees (we call it\na \u201clatent tree\u201d for short). In these tree-shaped graphical models, the leaves are the set of observed\nvariables (e.g., taxa, pixels, words) while the internal nodes are hidden and intuitively \u201crepresent\u201d\nthe common properties of their descendants (e.g., distinct ancestral species, objects in an image,\nlatent semantics). This class of models strike a nice balance between their representation power\n(e.g., ability to model cliques) and the complexity of learning and inference processes on these\nstructures (e.g., message passing is exact on trees). In particular, we will study the problems of\nestimating the latent tree structures, learning the model parameters and performing inference on\nthese models for continuous and non-Gaussian variables where it is not easy to specify a parametric\nfamily.\nIn previous works, the challenging problem of estimating the structure of latent trees has largely\nbeen tackled by heuristics since the search space of structures is intractable. For instance, Zhang et\nal. [28] proposed a search heuristic for hierarchical latent class models by de\ufb01ning a series of local\nsearch operations and using EM to compute the likelihood of candidate structures. Harmeling and\nWilliams [8] proposed a greedy algorithm to learn binary trees by joining two nodes with a high\nmutual information and iteratively performing EM to compute the mutual information among newly\nadded hidden nodes. Alternatively, Bayesian hierarchical clustering [9] is an agglomerative cluster-\ning technique that merges clusters based on a statistical hypothesis test. Many other local search\nheuristics based on maximum parsimony and maximum likelihood methods can also be found from\n\n1\n\n\fthe phylogenetic community [21]. However, none of these methods extend easily to the nonpara-\nmetric case since they require the data to be discrete or to have a parametric form such that statistical\ntests or likelihoods/EM can be easily computed.\nGiven the structures of the latent trees, learning the model parameters has predominantly relied on\nlikelihood maximization and local search heuristics such as expectation maximization (EM) [6].\nBesides the problem of local minima, non-Gaussian statistical features such as multimodality and\nskewness may pose additional challenges for EM. For instance, parametric models such as mixture\nof Gaussians may lead to an exponential blowup in terms of representation during the inference\nstage of EM, so further approximations may be needed to make these cases tractable. Furthermore,\nEM can require many iterations to reach a prescribed training precision.\nIn this paper, we propose a method for latent tree models with continuous and non-Gaussian ob-\nservation based on the concept of kernel embedding of distributions [23]. The problems we try to\naddress are: how to estimate the structures of latent trees with provable guarantees, and how to\nperform local-minimum-free parameter learning and ef\ufb01cient inference given the tree structures, all\nin nonparametric fashion. The main \ufb02avor of our method is to exploit the spectral properties of\nthe joint embedding (or covariance operators) in both the structure recovery and learning/inference\nstage. For the former, we de\ufb01ne a distance measure between variables based on the singular value\ndecomposition of covariance operators. This allows us to generalize some of the distance based\nlatent tree learning procedures such as neighbor joining [20] and the recursive grouping methods [4]\nto the nonparametric setting. These distance based methods come with strong statistical guarantees\nwhich carry over to our nonparametric generalization. After the structure is recovered, we further\nuse the covariance operator and its principal singular vectors to design surrogates for parameters\nof the latent variables (called a \u201cspectral algorithm\u201d). One advantage of our spectral algorithm is\nthat it is local-minimum-free and hence amenable for further statistical analysis (see [11, 25, 16] for\nprevious work on spectral algorithms). Last, we will demonstrate the advantage of our method over\nexisting approaches in both simulation and real data experiments.\n2 Latent Tree Graphical Models\nWe will focus on latent variable models where the observed variables are continuous and non-\nGaussian and the conditional independence structures are speci\ufb01ed by trees. We will use uppercase\nletters to denote random variables (e.g., Xi) and lowercase letters their instantiations (e.g., xi). A\nlatent tree model de\ufb01nes a joint distribution over a set, O = {X1, . . . , XO}, of O observed vari-\nables and a set, H = {XO+1, . . . , XO+H}, of H hidden variables. The complete set of variables is\ndenoted by X = O \u222a H . For simplicity, we will assume that all observed variables have the same\ndomain XO, and all hidden variables take values from XH and have \ufb01nite dimension d.\nThe joint distribution of X in a latent tree model is fully characterized by a set of conditional distri-\nbutions (CD). More speci\ufb01cally, we can select an arbitrary latent node in the tree as the root, and re-\norient all edges away from the root. Then the set of CDs between nodes and their parents P(Xi|X\u03c0i)\nare suf\ufb01cient to characterize the joint distribution (for the root node Xr, we set P(Xr|X\u03c0r ) =\nP(Xi|X\u03c0i ). Com-\npared to tree models which are de\ufb01ned solely on observed variables, latent tree models encompass\na much larger classes of models, allowing more \ufb02exibility in modeling observed variables. This is\nevident if we sum out the latent variables in the joint distribution,\n\nP(Xr); and we use P to refer to density in continuous case), P(X ) = (cid:81)O+H\n\ni=1\n\nH\n\ni=1\n\n(1)\nThis expression leads to complicated conditional independence structures between observed vari-\nables depending on the tree topology. In other words, latent tree models allow complex distributions\nover observed variables (e.g., clique models) to be expressed in terms of more tractable joint models\nover the augmented variable space. This can lead to a signi\ufb01cant saving in model parametrization.\nFor simplicity of explanation, we will focus on latent tree structures where each internal node has\nexactly 3 neighbors. We can reroot the tree and redirect all the edges away from the root. For a\nvariable Xs, we use \u03b1s to denote its sibling, \u03c0s to denote its parent, \u03b9s to denote its left child and \u03c1s\nto denote its right child; the root node will have 3 children, and we use \u03c9s to denote the extra child.\nAll the observed variables are leaves in the tree, and we will use \u03b9\u2217\ns to denote an observed\nvariable which is found by tracing in the direction from node s to its left child \u03b9s, right child \u03c1s, and\nits parent \u03c0s respectively. s\u2217 denotes any observed variable in the subtree rooted at node s.\n\ns, \u03c0\u2217\n\ns, \u03c1\u2217\n\n(cid:88)\n\n(cid:89)O+H\n\nP(O) =\n\nP(Xi|X\u03c0i ).\n\n2\n\n\f2\u03c0\u03c3\n\nj=1\n\n3 Kernel Density Estimator and Hilbert Space Embedding\nKernel density estimation (KDE) is a nonparametric way of \ufb01tting the density of continuous random\nvariables with non-Gaussian statistical features such as multi-modality and skewness [22]. However,\ntraditional KDE cannot model the latent tree structure. In this paper, we will show that the kernel\ndensity estimator can be augmented to deal with latent tree structures using a recent concept called\nHilbert space embedding of distributions [23]. Next, we will \ufb01rst explain the basic idea of KDE and\ndistribution embeddings, and show how they are related.\n\nKernel density estimator. Given a set of i.i.d. samples S = (cid:8)(xi\n\nO)(cid:9)n\n\n1, . . . , xi\n\ni=1 from\n\nP(X1, . . . , XO), KDE estimates the density via\n1\nn\n\n(cid:98)P(x1, . . . , xO) =\n\n(cid:88)n\n\n(cid:89)O\n\ni=1\n\nj=1\n\nj),\n\n(cid:21)\n\nES\n\n(cid:80)n\n\ni=1\n\n= EO\n\nk(xj, Xj)\n\n(cid:20)(cid:89)O\n\nk(xj, xi\n\n(2)\nwhere k(x, x(cid:48)) is a kernel function. A commonly used kernel function, which we will focus on, is\nexp(\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2/2\u03c32). For Gaussian RBF kernel, there\nthe Gaussian RBF kernel k(x, x(cid:48)) = 1\u221a\nexists a feature map \u03c6 : R (cid:55)\u2192 F such that k(x, x(cid:48)) = (cid:104)\u03c6(x), \u03c6(x(cid:48))(cid:105)F , and the feature space has the\nreproducing property, i.e. for all f \u2208 F, f (x) = (cid:104)f, \u03c6(x)(cid:105)F . Products of kernels are also kernels,\nF O.\nHere \u2297O\nj=1(cid:63) denotes the tensor product of O feature vectors which results in a rank-1 tensor of\norder O. This inner product can be understood by analogy to the \ufb01nite dimensional case: given\nx, y, z, x(cid:48), y(cid:48), z(cid:48) \u2208 Rd, (x(cid:62)x(cid:48))(y(cid:62)y(cid:48))(z(cid:62)z(cid:48)) = (cid:104)x \u2297 y \u2297 z, x(cid:48) \u2297 y(cid:48) \u2297 z(cid:48)(cid:105)Rd3 .\n\nj) as a single inner product (cid:10)\u2297O\nj)(cid:11)\nj=1\u03c6(Xj)(cid:3) is called the Hilbert space embedding of dis-\n\nwhich allow us to write (cid:81)O\nHilbert space embedding. CO := EO(cid:2)\u2297O\n\nj=1\u03c6(xj),\u2297O\n\nj=1 k(xj, x(cid:48)\n\nj=1\u03c6(x(cid:48)\n\nj=1\u03c6(xj)(cid:11)\n\n=(cid:10)EO(cid:2)\u2297O\n\nj=1\u03c6(Xj)(cid:3) ,\u2297O\n\nF O = CO \u00af\u00d7O \u03c6(xO) . . . \u00af\u00d72 \u03c6(x2) \u00af\u00d71 \u03c6(x1)\n\ntribution P(O) with tensor features \u2297O\nj=1\u03c6(Xj). In other words, the embedding of a distribution\nis simply the expected feature of that distribution. The essence of Hilbert space embedding is to\nrepresent distributions as elements in Hilbert spaces, and then subsequent manipulation of the distri-\nbutions can be carried out via Hilbert space operations such as inner product and distance. We next\nshow how to represent a KDE using distribution embeddings.\nTaking the expected value of a KDE with respect to the random sample S ,\n\n(3)\nwe see that this expected value is the inner product between the embedding CO and tensor\nfeatures \u2297O\n\n(cid:104)(cid:98)P(x1, . . . , xO)\n(cid:105)\nIf we replace the embedding CO by its \ufb01nite sample estimate (cid:98)CO :=\nj)(cid:1), we recover the density estimator in (2). Alternatively, using tensor nota-\n(cid:0)\u2297O\nj=1\u03c6(Xj)(cid:3) ,\u2297O\n(cid:10)EO(cid:2)\u2297O\n\n1\nn\ntion (described in supplemental), we can rewrite equation (3) as\n\n(4)\nwhere CO is a big tensor of order O which can be dif\ufb01cult to store and maintain. While traditional\nKDE can not make use of the fact that the embedding CO originates from a distribution with latent\ntree structure, the embedding view actually allows us to exploit this special structure and further\ndecompose CO to simpler tensors of much lower orders.\n4 Kernel Embedding of Latent Tree Graphical Models\nIn this section, we assume that the structures of the latent tree graphical models are given, and we will\ndeal with structure learning in the next section. We will show that the tensor expression of KDE in (4)\ncan be computed recursively using a collection of lower order tensors. Essentially, these lower order\ntensors correspond to the conditional densities in the latent tree graphical models; and the recursive\ncomputations try to integrate out the latent variables in the model, and they correspond to the steps\nin the message passing algorithm for graphical model inference. The challenge is that message\npassing algorithm becomes nontrivial to represent and implement in continuous and nonparametric\nsettings. Previous methods may lead to exponential blowup in their message representation and\nhence various approximations are needed, such as expectation propagation [15], mixture of Gaussian\nsimpli\ufb01cation [27], and sampling [12]. In contrast, the distribution embedding view allows us to\nrepresent and implement message passing algorithm ef\ufb01ciently without resorting to approximations.\nFurthermore, it also allows us to develop a local-minimum-free algorithm for learning the parameters\nof latent tree graphical models.\n\nj=1\u03c6(xj)(cid:11)\n\nj=1\u03c6(xj).\nj=1\u03c6(xi\n\nF O ,\n\n3\n\n\f4.1 Covariance Operator and Conditional Embedding Operator\nWe will \ufb01rst explain the concept of conditional embedding operators which are the nonparametric\ncounterparts for conditional probability tables in the discrete case. Conditional embedding operators\nwill be the key building blocks to a nonparametric message passing algorithm as much as conditional\nprobability tables are to the ordinary message passing algorithm.\nFollowing [7], we \ufb01rst de\ufb01ne the covariance operator CXsXt which allows us to compute the expec-\ntation of the product of function f (Xs) and g(Xt), i.e., EXsXt[f (Xs)g(Xt)], using linear operations\nin the RKHS. More formally, let CXsXt : F (cid:55)\u2192 F such that for all f, g \u2208 F,\nEXsXt[f (Xs)g(Xt)] = (cid:104)f, EXsXt [\u03c6(Xs) \u2297 \u03c6(Xt)] g(cid:105)F = (cid:104)f, CXsXtg(cid:105)F = Cst \u00af\u00d72 g \u00af\u00d71 f (5)\nwhere we abbreviate the notation CXsXt as Cst, and will follow such abbreviation in the rest of\nthe paper (e.g. Cs2 is an abbreviation for CXsXs) . This can be understood by analogy with the\n\ufb01nite dimensional case: if x, y, z, v \u2208 Rd, then x(cid:62)(yz(cid:62))v = (yz(cid:62)) \u00af\u00d72 v \u00af\u00d71 x where we use the\ntensor-vector multiplication notation from [13] (see supplemental for details). In other words, the\ncovariance operator is also the embedding of the joint distribution P(Xs, Xt).\nThen the conditional embedding operator can be de\ufb01ned via covariance operators according to\nSong et al. [26]. A conditional embedding operator allows us to compute conditional expectations\nEXt|xs [f (Xt)] as linear operations in the RKHS. Let Ct|s := CtsC\u22121\n(6)\nIn other words, the operator Ct|s takes the feature map \u03c6(xs) of the point on which we condition,\nand outputs the conditional expectation of the feature \u03c6(Xt) with respect to P(Xt|xs). Although the\nformula looks similar to the Gaussian case, it is important to note that the conditional embedding\noperator allows us to compute the conditional expectation of any f \u2208 F, regardless of the distribu-\ntion of the random variable in feature space (aside from the condition that h(\u00b7) := EXt|Xs=\u00b7[f (Xt)]\nis in the RKHS on Xs, as noted by Song et al.). In particular, we do not need to assume the random\nvariables have a Gaussian distribution in feature space.\n4.2 Representation for Message Passing Algorithm\nFor simplicity, we will focus on latent trees where all latent variables have degree 3 (but our\nmethod can be generalized to higher degrees). We \ufb01rst introduce latent variables into equation (3),\nEO\u222aH\n; Then we integrate out the latent variables according to the latent tree\nstructure using a message passing algorithm [17],\n\nEXt|xs[f (Xt)] = (cid:10)f, EXt|xs [\u03c6(Xt)(cid:11)\n\nss such that for all f \u2208 F,\nF = Ct|s \u00af\u00d72 \u03c6(xs) \u00af\u00d71 f.\n\nF = (cid:10)f, Ct|s\u03c6(xs)(cid:11)\n\n(cid:104)(cid:81)O\n\nj=1 k(xj, Xj)\n\n(cid:105)\n\n* At a leaf node (always observed variable) we pass the following message to its parent\n\n** An internal latent variable aggregates incoming messages from its two children and then\n\nms(X\u03c0s ) = EXs|X\u03c0s\nsends an outgoing message to its own parent ms(X\u03c0s) = EXs|X\u03c0s\n\n[k(xs, Xs)].\n\n[m\u03b9s (Xs)m\u03c1s(Xs)].\n\nis integrated out br := EO[(cid:81)O\n\n*** Finally, at the root node, all incoming messages are multiplied together and the root variable\n\nj=1 k(xj, Xj)] = EXr [m\u03b9s (Xr)m\u03c1s(Xr)m\u03c9r (Xr)].\n\nThe challenge is that message passing becomes nontrivial to represent and implement in continuous\nand nonparametric settings. Previous methods may lead to exponential blowup in their message\nrepresentation and hence various approximations are needed, such as expectation propagation [15],\nmixture of Gaussian simpli\ufb01cation [27], and sampling [12].\nSong et al. [24] show that the above 3 message update operations can be expressed using Hilbert\nspace embeddings [26], and no further approximation is needed in the message computation. Basi-\ncally, the embedding approach assume that messages are functions in the reproducing kernel Hilbert\nspace, and message update is an operator that takes several functions as inputs and output another\nfunction in the reproducing kernel Hilbert space. More speci\ufb01cally, message updates are linear (or\nmulti-linear) operations in feature space,\n\n\u00af\u00d71 \u03c6(xs)\n* At leaf nodes, we have mts(\u00b7) = EXs|X\u03c0s =\u00b7[k(xs, Xs)] = C(cid:62)\ns|\u03c0s\n** At internal nodes, we de\ufb01ne a tensor product reproducing kernel Hilbert space H := F\u2297F,\nunder which the product of incoming messages can be written as a single inner product,\nm\u03b9s(Xs) m\u03c1s(Xs) = (cid:104)m\u03b9s , \u03c6(Xs)(cid:105)(cid:104)m\u03c1s, \u03c6(Xs)(cid:105) = (cid:104)m\u03b9s \u2297 m\u03c1s, \u03c6(Xs) \u2297 \u03c6(Xs)(cid:105)H\nThen the message update becomes\n\nms(\u00b7) =(cid:10)m\u03b9s \u2297 m\u03c1s , EXs|X\u03c0s =\u00b7 [\u03c6(Xs) \u2297 \u03c6(Xs)](cid:11)\n\n\u03c6(xs) = Cs|\u03c0s\n\nH = Cs2|\u03c0s\n\n\u00af\u00d72 m\u03c1s\n\n\u00af\u00d71 m\u03b9s\n\n(7)\n\n4\n\n\f\u03c0s\u03c0s.\n\n\u00af\u00d71 m\u03b9r\n\n\u00af\u00d72 m\u03c1r\n\n= Cr3 \u00af\u00d73 m\u03c9r\n\n[\u03c6(Xs) \u2297 \u03c6(Xs) \u2297 \u03c6(Xs)], and the operator C\u22121\n\n*** Finally, at the root nodes, we use the property of tensor product features and arrives at:\n\nwhere we de\ufb01ne the conditional embedding operator for the tensor features \u03c6(Xs)\u2297\u03c6(Xs).\nBy analogy with (6)), Cs2|\u03c0s is de\ufb01ned in terms of a covariance operator Cs2\u03c0s\n:=\nEXsX\u03c0s\nEr[m\u03b9r (Xr) m\u03c1r (Xr) m\u03c9r (Xr)] = (cid:104)m\u03b9r \u2297 m\u03c1r \u2297 m\u03c9r , EXr [\u03c6(Xr) \u2297 \u03c6(Xr) \u2297 \u03c6(Xr)](cid:105)\n(8)\nWe note that the traditional kernel density estimator needs to estimate a tensor of order O involving\nall observed variables (equation (4)). By making use of the conditional independence structure of\nlatent tree models, we only need to estimate tensors of much smaller orders. Particularly, we only\nneed to estimate tensors involving two variables (for each parent-child pair), and then the density\ncan be estimated via message passing algorithms using these tensors of much smaller order.\nThe drawback of the representations in (7) and (8) is that they require exact knowledge of conditional\nembedding operators associated with latent variables, but none of these are available in training.\nNext we will show that we can still make use of the tensor decomposition representation without the\nneed for recovering the latent variables explicitly.\n4.3 Spectral Algorithm for Learning Latent Tree Parameters\nOur observation from (7) and (8) is that if we can recover the conditional embedding operators\nassociated with latent variables up to some invertible transformations, we will still be able to com-\n\npute latent tree density correctly. For example, we can transform the messages: (cid:101)m\u03b9s = T\u03b9s m\u03b9s,\n(cid:101)m\u03c1s = T\u03c1sm\u03c1s, and (cid:101)m\u03c9s = T\u03c9sm\u03c9s, and we can update these transformed messages:\n\u00af\u00d71 (cid:101)m\u03b9s\n\u00af\u00d71 (cid:101)m\u03b9r\n\n* At leaf nodes, (cid:101)ms = T (cid:62)\n** At internal nodes, (cid:101)ms = (Cs2|\u03c0s \u00d71 T \u22121\n\ns C(cid:62)\ns|\u03c0s\n*** At the root, br = (Cr3 \u00d71 T \u22121\n\ns ) \u00af\u00d72 (cid:101)m\u03c1s\n\u00af\u00d72 (cid:101)m\u03c1r\n\n) \u00af\u00d73 (cid:101)m\u03c9r\n\n\u00d72 T \u22121\n\u00d73 T \u22121\n\n\u00d72 T \u22121\n\n\u00d73 T (cid:62)\n\n\u03c6(xs)\n\n\u03c9r\n\n\u03c1r\n\n\u03c1s\n\n\u03b9r\n\n\u03b9s\n\nwithout changing the \ufb01nal br. Basically, all the invertible transformations T cancel out with each\nother. These transformations provide us an additional degree of freedom for algorithm design: we\ncan choose the invertible transforms cleverly, such that the transformed representation can be recov-\nered from observed quantities without the need for accessing the latent variables. This representation\nis related to but different from that of [16] for discrete variables which uses only 3rd order tensors.\nThe kernel case is more challenging and requires qth order tensors (where q is the degree of a node).\nMore speci\ufb01cally, these transformations T can be constructed from cross covariance operators of\n\u22121 and let\ncertain pairs of observed variables and their singular vectors U. We set Ts = (U(cid:62)\nUs be the top d right eigenvectors of C\u03c0\u2217\ns s\u2217. Consider the simple case for the leaf node (\u2217). In this\ns = U(cid:62)\ncase, we can set s\u2217 = s and get that T \u22121\ns Cs|\u03c0s )\nC\u03c0\u2217\n\ns Cs\u2217|\u03c0s)\ns Cs|\u03c0s. Consider the following expansion:\n) = \u03c6(xs)(cid:62)Cs\u03c0\u2217\n\n(9)\n(10)\nHere \u2020 denotes pseudo-inverse. The general pattern is that we can relate the transformed latent\nquantity to observed quantities in two different ways such that we can solve for the transformed\ns in the\n\u03c9r at the root. We summarize the\n\nlatent quantity. A similar strategy can be applied to (cid:101)Cs2|\u03c0s := Cs2|\u03c0s \u00d71 T \u22121\ninternal message update, and the (cid:101)Cr3 := Cr3 \u00d71 T \u22121\n\n\u21d2 (cid:101)ms = (C\u03c0\u2217\n) = \u03c6T (xs)Cs|\u03c0s (U(cid:62)\n\u2020\ns sUs)\n\n(U(cid:62)\ns s\u03c6(xs)\n\n(cid:101)m(cid:62)\ns (U(cid:62)\n\ns Cs|\u03c0s)(C\u03c02\n\nresults on how to compute the transformed quantities below (see supplemental for details).\n\n\u00d72 T \u22121\n\n\u00d73 T \u22121\n\n\u00d72 T \u22121\n\nC(cid:62)\ns|\u03c0s\n\u03c0\u2217\n\n\u00d73 T (cid:62)\n\ns Cs\u03c0\u2217\n\n\u22121\n\n\u03c1s\n\n\u03c1s\n\n\u03b9s\n\n\u03b9s\n\ns\n\ns\n\ns\n\n* At leaf nodes, (cid:101)ms = (C\u03c0\u2217\n** At internal nodes, (cid:101)Cs2|\u03c0s = C\u03b9\u2217\n*** At the root, (cid:101)Cr3 = C\u03b9\u2217\n\n\u2020\ns sUs)\n\nC\u03c0\u2217\ns \u03c1\u2217\ns \u03c0\u2217\n\u00d71 U(cid:62)\n\nr \u03c1\u2217\n\nr \u03c9\u2217\n\n\u03b9r\n\nr\n\ns\n\ns s\u03c6(xs).\n\u00d71 U(cid:62)\n\u00d72 U(cid:62)\n\n\u03b9s\n\n\u03c1r\n\n\u00d72 U(cid:62)\n\u03c1s\n\u00d73 U(cid:62)\n.\n\n\u03c9r\n\n\u00d73 (C\u03c0\u2217\ns \u03b9\u2217\n\ns\n\nUs)\u2020.\n\nThe above results give us an ef\ufb01cient algorithm for computing the expected kernel density br which\ncan take into account the latent tree structures while at the same time avoiding the local minimum\nproblems associated with explicitly recovering latent parameters. The main computation only in-\nvolves tensor-matrix and tensor-vector multiplications, and a sequence of singular value decompo-\nsitions of pairwise cross covariance operators. After we obtain the transformed quantities, we can\nthen use them in the message passing algorithm to obtain the \ufb01nal belief br.\n\ni=1 drawn i.i.d. from a P(O), the spectral algorithm for latent\ntrees proceeds by replacing all population quantities by their empirical counterpart. For instance,\n\n1, . . . , xi\n\nGiven a sample S =(cid:8)(xi\n\nO)(cid:9)n\n\n5\n\n\fthe SVD of covariance operators between Xs and Xt can be estimated by \ufb01rst forming matrices\nvalue decomposition of (cid:98)C can be carried out to obtain an estimate for (cid:98)U (See [25] for more details).\nn \u03a6\u03a5(cid:62). Then a singular\n\u03a5 = (\u03c6(x1\n\ns )) and \u03a6 = (\u03c6(x1\n\nt ), . . . , \u03c6(xn\n\ns), . . . , \u03c6(xn\n\nt )), and estimate (cid:98)Cts = 1\n\n5 Structure Learning of Latent Tree Graphical Models\nThe last section focused on density estimation where the structure of the latent tree is known. In this\nsection, we focus on learning the structure of the latent tree. Structure learning of latent trees is a\nchallenging problem that has largely been tackled by heuristics since the search space of structures\nis intractable. The additional challenge in our case is that the observed variables are continuous and\nnon-Gaussian, which we are not aware of any existing methods for this problem.\nStructure learning algorithm We develop a distance based method for constructing latent trees of\ncontinuous, non-Gaussian variables. The idea is that if we have a tree metric (distance) between dis-\ntributions on observed nodes, we can use the property of the tree metric to reconstruct the latent tree\nstructure using algorithms such as neighbor joining [20] and the recursive grouping algorithm [4].\nThese methods take a distance matrix among all pairs of observed variables as input and output a\ntree by iteratively adding hidden nodes. While these methods are iterative, they have strong theo-\nretical guarantees on structure recovery when the true distance matrix forms an additive tree metric.\nHowever, most previously known tree metrics are de\ufb01ned for discrete and Gaussian variables. The\nadditional challenge in our case is that the observed variables are continuous and non-Gaussian. We\npropose a tree metric below which works for continuous non-Gaussian cases.\nTree metric and pseudo-determinant We will \ufb01rst explain some basic concepts of a tree metric.\nIf the joint probability distribution P(X ) has a latent tree structure, then a distance measure dst\nbetween an arbitrary variables pairs Xs and Xt are called tree metric if it satis\ufb01es the following path\n(u,v)\u2208P ath(s,t) duv. For discrete and Gaussian variables, tree metric can\nbe de\ufb01ned via the determinant | \u00b7 | [4]\n2 log |CstC(cid:62)\n\n(11)\nwhere Cst denotes joint probability matrix in the discrete case and the covariance in the Gaussian\ncase; Css is the diagonalized marginal probability vector in the discrete case and variance in the\nGaussian case. However, this de\ufb01nition of tree metric is restricted in the sense that it requires\nall discrete variables to have the same number of states and all Gaussian variables have the same\ndimension. This is because determinant is only de\ufb01ned (and non-zero) for square and non-singular\nmatrices. For our more general scenario, where the observed variables are continuous non-Gaussian\nbut the hidden variables have dimension d, we will de\ufb01ne a tree metric based on pseudo-determinant\nwhich works for our operators.\nNonparametric tree metric The pseudo-determinant is de\ufb01ned as the product of non-zero singular\ni=1 \u03c3i(C). In our case, since we assume that the dimension of the\nhidden variables is d, the pseudo-determinant is simply the product of top d singular values. Then\nwe de\ufb01ne the distance metric between two continuous non-Gaussian variables Xs and Xt as\n\nvalues of an operator |C|(cid:63) = (cid:81)d\n\nadditive condition: dst =(cid:80)\n\n4 log |CttC(cid:62)\ntt|,\n\n4 log |CssC(cid:62)\n\ndst = \u2212 1\n\nss| + 1\n\nst| + 1\n\n4 log |CssC(cid:62)\n\nss|(cid:63) + 1\n\n4 log |CttC(cid:62)\n\ntt|(cid:63).\n\n(12)\nOne can prove that (12) de\ufb01nes a tree metric by inducting on the path length. Here we only\nshow the additive property for the simplest path Xs \u2212 Xu \u2212 Xt involving only a single hidden\ns|u|(cid:63) according\nvariable Xu.\nto the Markov property. Then using Sylvester\u2019s determinant theorem, the latter is also equal to\ns|u to the front. Next, introducing two copies of |Cuu| and\n|C(cid:62)\ns|uCs|uCuuC(cid:62)\nrearranging terms, we have\ns|u|(cid:63)|Ct|uCuuCuuC(cid:62)\n|Cuu|(cid:63)|Cuu|(cid:63)\ntu|(cid:63) + 1\n\nIn this case, we \ufb01rst factorize |CstC(cid:62)\nt|uCt|uCuu|(cid:63) by \ufb02ipping C(cid:62)\n|Cs|uCuuCuuC(cid:62)\n\nsu|(cid:63)|CtuC(cid:62)\ntu|(cid:63)\n|CuuCuu|(cid:63)\n4 log |CssC(cid:62)\n\nLast, we plug this into (12) and we have the desired path additive property\n\nst|(cid:63) into |Cs|uCuuC(cid:62)\n\nt|uCt|uCuuC(cid:62)\n\n2 log |CuuC(cid:62)\n\ntt|\n4 log |CttC(cid:62)\n\n2 log |CsuC(cid:62)\n\n2 log |CtuC(cid:62)\n\ndst = \u2212 1\n\nuu|(cid:63) + 1\n\nsu|(cid:63) \u2212 1\n\nss|(cid:63) + 1\n\n|CsuC(cid:62)\n\n|CstC(cid:62)\n\nst|(cid:63) =\n\n2 log(cid:12)(cid:12)CstC(cid:62)\n\nst\n\n(cid:12)(cid:12)(cid:63) + 1\n\nt|u|(cid:63)\n\n=\n\ndst = \u2212 1\n\n.\n\n(13)\n\n= dsu + dut\n\n6 Experiments\nWe evaluate our method on synthetic data as well as a real-world crime/communities dataset [1, 19].\nFor all experiments we compare to 2 existing approaches. The \ufb01rst is to assume the data are\n\n6\n\n\f(a) balanced binary tree\n\n(b) skewed HMM-like tree\n\n(c) random trees\n\nFigure 1: Comparison of our kernel structure learning method to the Gaussian and Nonparanormal\nmethods on different tree structures.\n\nFigure 2: Histogram of the differences between the estimated number of hidden states and the true\nnumber of states.\nmultivariate Gaussians and use the tree metric de\ufb01ned in [4] (which is essentially a function of\nthe correlation coef\ufb01cient). The second existing approach we compare to is the Nonparanor-\nmal (NPN) [14] which assumes that there exist marginal transformations f1, . . . , fp such that\nf (X1), . . . , f (Xp) \u223c N (\u00b5, \u03a3).\nIf the data comes from a Nonparanormal distribution, then the\ntransformed data are assumed to be multivariate Gaussians and the same tree metric as the Gaussian\ncase can be used on the transformed data. Our approach makes much fewer assumptions about the\ndata than either of these two methods which can be more favorably in practice.\nTo perform learning and inference in our approach, we use the spectral algorithm and message\npassing algorithm described earlier in the paper. For inference in the Gaussian (and nonparanormal)\ncases, we use the technique in [4] to learn the model parameters (covariance matrix). Once the\ncovariance matrix has been estimated, computing the marginal of one variable given a set of evidence\nreduces to solving a linear equation of one variable [2].\nSynthetic data: structure recovery. The \ufb01rst experiment is to demonstrate how our method com-\npares to the Gaussian and Nonparanormal methods in terms of structure recovery. We experiment\nwith 3 different tree types (each with 64 leaves or observed variables): a balanced binary tree, a\ncompletely binary skewed tree (like an HMM), and randomly generated binary trees. For all trees,\nwe use the following generative process to generate the n-th sample from a node s (denoted x(n)\n): If\n2 sample from a Gaussian\ns is the root, sample from a mixture of 2 Gaussians. Else, with probability 1\nwith mean \u2212x(n)\nWe vary the training sample size from 200 to 100,000. Once we have computed the empirical tree\ndistance matrix for each algorithm, we use the neighbor joining algorithm [20] to learn the trees.\nFor evaluation we compare the number of hops between each pair of leaves in the true tree to the\nestimated tree. For a pair of leaves i, j the error is de\ufb01ned as: error(i, j) =\n+\n|hops\u2217(i,j)\u2212(cid:91)hops(i,j)|\n, where hops\u2217 is the true number of hops and (cid:91)hops is the estimated number of\n\n2 sample from a Gaussian with mean x(n)\n\u03c0s .\n\n\u03c0s and with probability 1\n\n|hops\u2217(i,j)\u2212(cid:91)hops(i,j)|\n\nhops\u2217(i,j)\n\ns\n\n(cid:91)hops(i,j)\n\nhops. The total error is then computed by adding the error for each pair of leaves.\nThe performance of our method depends on the number of singular values chosen and we experi-\nmented with 2, 5 and 8 singular values. Furthermore, we choose the bandwidth \u03c3 for the Gaussian\nRBF kernel needed for the covariance operators using median distance between pairs of training\npoints. For all these choices our method performs better than the Gaussian and Nonparanormal\nmethods. This is to be expected, since the data we generated is neither Gaussian or Nonparamnor-\nmal, yet our method is able to learn the structure correctly. We also note that balanced binary trees\nare the easiest to learn while the skewed trees are the hardest (Figure 1).\nSynthetic data: model selection. Next we evaluate the ability of our model to select the correct\nnumber of singular values via held-out likelihood. For this experiment we use a balanced binary tree\nwith 16 leaves (total of 31 nodes) and 100000 samples. A different generative process is used so\n\n7\n\n0.20.51251020501001050Training Sample Size (x103)ErrorKernel\u22122Kernel\u22128Kernel\u22125GaussianNPN0.20.51251020501001050100200Training Sample Size (x103)ErrorKernel\u22122Kernel\u22125Kernel\u22128GaussianNPN0.20.51251020501001050Training Sample Size (x103)ErrorKernel\u22122Kernel\u22128Kernel\u22125NPNGaussian\u22123\u22122\u22121012305101520Diff from true # of hidden statesFrequencyTrue # of hidden states = 2\u22123\u22122\u22121012305101520Diff from true # of hidden statesFrequencyTrue # of hidden states = 3\u22123\u22122\u221210123051015Diff from true # of hidden statesFrequencyTrue # of hidden states = 4\u22123\u22122\u221210123051015Diff from true # of hidden statesFrequencyTrue # of hidden states = 5\f(a)\n\n(b)\n\nFigure 3: (a) visualization of kernel latent tree learned from crime data (b) Comparison of our\nmethod to Gaussian and NPN in predictive task.\nthat it is clear what the correct number of singular values should be (When the hidden state space is\ncontinuous like in our \ufb01rst synthetic experiment this is unclear). Each internal node is discrete and\ntakes on d values. The leaf is a mixture of d Gaussians where which Gaussian to sample from is\ndictated by the discrete value of the parent.\nWe vary d from 2 through 5 and then run our method for a range of 2 through 8 singular values.\nWe select the model that has the highest likelihood computed using our spectral algorithm on a\nhold-out set of 500 examples. We then take the difference between the number of singular values\nchosen and the true singular values, and plot histograms of this difference (Ideally all the trials\nshould be in the zero bin). The experiment is run for 20 trials. As we can see in Figure 2, when\nd is low, the held-out likelihood computed by our method does a fairly good job in recovering the\ncorrect number. However, as the true number of eigenvalues rises our method underestimates the\ntrue number (although it is still fairly close).\nCrime Data. Finally, we explore the performance of our method on a communities and crime\ndataset from the UCI repository [1, 19]. In this dataset several real valued attributes are collected\nfor several communities, such as ethnicity proportions, income, poverty rate, divorce rate etc., and\nthe goal is to predict the number of violent crimes (proportional to the size of the community) that\noccur based on these attributes. In general these attributes are highly skewed and therefore not well\ncharacterized by a Gaussian model.\nWe divide the data into 1400 samples for training, 300 samples for model selection (held-out like-\nlihood), and 300 samples for testing. We pick the \ufb01rst 50 of these attributes, plus the violent crime\nvariable and construct a latent tree using our tree metric and the neighbor joining algorithm [20].\nWe depict the tree in Figure 3 and highlight a few coherent groupings. For example, the \u201celderly\u201d\ngroup attributes are those related to retirement and social security (and thus correlated). The large\nclustering in the center is where the class variable (violent crimes) is located next to the poverty\nrate and the divorce rate among other relevant variables. Other groupings include type of occupa-\ntion and education level as well as ethnic proportions. Thus, overall our method captures sensible\nrelationships.\nFor a more quantitative evaluation, we condition on a set of E of evidence variables and predict the\nviolent crimes class label. We experiment with a varying number of sizes of evidence sets from 5\nto 40 and repeat for 40 randomly chosen evidence sets of a \ufb01xed size. Since the crime variable is a\nnumber between 0 and 1, our error measure is simply err(\u02c6c) = |\u02c6c \u2212 c\u2217| (where \u02c6c is the predicted\nvalue and c\u2217 is the true value. As one can see in Figure 3 our method outperforms both the Gaussian\nand the nonparanormal for the range of query sizes. Thus, in this case our method is better able to\ncapture the skewed distributions of the variables than the other methods.\nAcknowledgments\nThis work was partially done when LS was at Carnegie Mellon University and Google Research. This\nwork is also supported by an NSF Graduate Research Fellowship (under Grant No. 0750271) to APP, NIH\n1R01GM093156, NIH 1RC2HL101487, NSF DBI-0546594, and an Alfred P. Sloan Fellowship to EPX.\n\n8\n\nelderly Urban/rural Education/job Divorce/crime/poverty race 5102030400.150.20.25query sizeErrorKernelNPNGaussian\fReferences\n\n[1] A. Asuncion and D.J. Newman. Uci machine learning repository, 2007.\n[2] D. Bickson. Gaussian belief propagation: Theory and application. Arxiv preprint arXiv:0811.2518, 2008.\n[3] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. In NIPS, 2002.\n[4] M.J. Choi, V.Y.F. Tan, A. Anandkumar, and A.S. Willsky. Learning latent tree graphical models. Arxiv\n\npreprint arXiv:1009.2722, 2010.\n\n[5] A. Clark. Inference of haplotypes from pcr-ampli\ufb01ed samples of diploid populations. Molecular Biology\n\nand Evolution, 7(2):111\u2013122, 1990.\n\n[6] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm.\n\nJournal of the Royal Statistical Society B, 39(1):1\u201322, 1977.\n\n[7] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with repro-\n\nducing kernel Hilbert spaces. J. Mach. Learn. Res., 5:73\u201399, 2004.\n\n[8] S. Harmeling and C.K.I. Williams. Greedy learning of binary latent trees. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 2010.\n\n[9] K.A. Heller and Z. Ghahramani. Bayesian hierarchical clustering. In Proceedings of the 22nd interna-\n\ntional conference on Machine learning, pages 297\u2013304. ACM, 2005.\n\n[10] Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network\n\nanalysis. JASA, 97(460):1090\u20131098, 2002.\n\n[11] D. Hsu, S. Kakade, and T. Zhang. A spectral algorithm for learning hidden markov models. In COLT,\n\n2009.\n\n[12] A. Ihler and D. McAllester. Particle belief propagation. In AISTATS, pages 256\u2013263, 2009.\n[13] Tamara. Kolda and Brett Bader. Tensor decompositions and applications. SIAM Review, 51(3):455\u2013500,\n\n2009.\n\n[14] H. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: Semiparametric estimation of high dimen-\n\nsional undirected graphs. The Journal of Machine Learning Research, 10:2295\u20132328, 2009.\n\n[15] T. Minka. Expectation Propagation for approximative Bayesian inference. PhD thesis, MIT Media Labs,\n\nCambridge, USA, 2001.\n\n[16] A. Parikh, L. Song, and E. Xing. A spectral algorithm for latent tree graphical models. In ICML, 2011.\n[17] J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2001.\n[18] L. R. Rabiner and B. H. Juang. An introduction to hidden Markov models.\n\nIEEE ASSP Magazine,\n\n3(1):4\u201316, January 1986.\n\n[19] M. Redmond and A. Baveja. A data-driven software tool for enabling cooperative information sharing\n\namong police departments. European Journal of Operational Research, 141(3):660\u2013678, 2002.\n\n[20] N. Saitou, M. Nei, et al. The neighbor-joining method: a new method for reconstructing phylogenetic\n\ntrees. Mol Biol Evol, 4(4):406\u2013425, 1987.\n\n[21] C. Semple and M.A. Steel. Phylogenetics, volume 24. Oxford University Press, USA, 2003.\n[22] B. W. Silverman. Density Estimation for Statistical and Data Analysis. Monographs on statistics and\n\napplied probability. Chapman and Hall, London, 1986.\n\n[23] A.J. Smola, A. Gretton, L. Song, and B. Sch\u00a8olkopf. A hilbert space embedding for distributions. In E.\n\nTakimoto, editor, Algorithmic Learning Theory, Lecture Notes on Computer Science. Springer, 2007.\n\n[24] L. Song, A. Gretton, and C. Guestrin. Nonparametric tree graphical models.\n\nIn 13th Workshop on\nArti\ufb01cial Intelligence and Statistics, volume 9 of JMLR workshop and conference proceedings, pages\n765\u2013772, 2010.\n\n[25] Le Song, Byron Boots, Sajid Siddiqi, Geoffrey Gordon, and Alex Smola. Hilbert space embeddings of\n\nhidden markov models. In International Conference on Machine Learning, 2010.\n\n[26] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional\n\ndistributions. In ICML, 2009.\n\n[27] E. Sudderth, A. Ihler, W. Freeman, and A. Willsky. Nonparametric belief propagation. In CVPR, 2003.\n[28] N.L. Zhang. Hierarchical latent class models for cluster analysis. The Journal of Machine Learning\n\nResearch, 5:697\u2013723, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1479, "authors": [{"given_name": "Le", "family_name": "Song", "institution": null}, {"given_name": "Eric", "family_name": "Xing", "institution": null}, {"given_name": "Ankur", "family_name": "Parikh", "institution": null}]}