{"title": "Learning Mixtures of Tree Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1052, "page_last": 1060, "abstract": "We consider  unsupervised estimation of mixtures of discrete graphical models, where the class variable   is hidden and each mixture component  can have a potentially different Markov graph structure and parameters over the observed variables. We propose a novel method for estimating the mixture components with  provable guarantees.   Our output is   a tree-mixture model which serves as a good approximation to the underlying graphical model mixture. The   sample and computational requirements for our method scale as $\\poly(p,  r)$,   for an $r$-component mixture of $p$-variate graphical models, for a wide class of models which includes tree mixtures and mixtures over bounded degree graphs.", "full_text": "Learning Mixtures of Tree Graphical Models\n\nAnimashree Anandkumar\n\nUC Irvine\n\nDaniel Hsu\n\nMicrosoft Research New England\n\na.anandkumar@uci.edu\n\ndahsu@microsoft.com\n\nFurong Huang\n\nUC Irvine\n\nfurongh@uci.edu\n\nSham M. Kakade\n\nMicrosoft Research New England\nskakade@microsoft.com\n\nAbstract\n\nWe consider unsupervised estimation of mixtures of discrete graphical models,\nwhere the class variable is hidden and each mixture component can have a poten-\ntially different Markov graph structure and parameters over the observed variables.\nWe propose a novel method for estimating the mixture components with provable\nguarantees. Our output is a tree-mixture model which serves as a good approxi-\nmation to the underlying graphical model mixture. The sample and computational\nrequirements for our method scale as poly(p, r), for an r-component mixture of p-\nvariate graphical models, for a wide class of models which includes tree mixtures\nand mixtures over bounded degree graphs.\n\nKeywords: Graphical models, mixture models, spectral methods, tree approximation.\n\n1 Introduction\n\nThe framework of graphical models allows for parsimonious representation of high-dimensional\ndata by encoding statistical relationships among the given set of variables through a graph, known\nas the Markov graph. Recent works have shown that a wide class of graphical models can be\nestimated ef\ufb01ciently in high dimensions [1\u20133]. However, frequently, graphical models may not\nsuf\ufb01ce to explain all the characteristics of the observed data. For instance, there may be latent or\nhidden variables, which can in\ufb02uence the observed data in myriad ways.\n\nIn this paper, we consider latent variable models, where a latent variable can alter the relationships\n(both structural and parametric) among the observed variables. In other words, we posit the observed\ndata as being generated from a mixture of graphical models, where each mixture component has a\npotentially different Markov graph structure and parameters. The choice variable corresponding\nto the selection of the mixture component is hidden. Such a class of graphical model mixtures\ncan incorporate context-speci\ufb01c dependencies, and employs multiple graph structures to model the\nobserved data. This leads to a signi\ufb01cantly richer class of models, compared to graphical models.\n\nLearning graphical model mixtures is however far more challenging than learning graphical mod-\nels. State-of-art theoretical guarantees are mostly limited to mixtures of product distributions, also\nknown as latent class models or na\u00a8\u0131ve Bayes models. These models are restrictive since they do not\nallow for dependencies to exist among the observed variables in each mixture component. Our work\nsigni\ufb01cantly generalizes this class and allows for general Markov dependencies among the observed\nvariables in each mixture component.\n\nThe output of our method is a tree mixture model, which is a good approximation for the underlying\ngraphical model mixture. The motivation behind \ufb01tting the observed data to a tree mixture is clear:\ninference can be performed ef\ufb01ciently via belief propagation in each of the mixture components.\n\n1\n\n\fSee [4] for a detailed discussion. Thus, a tree mixture model offers a good tradeoff between using\nsingle-tree models, which are too simplistic, and general graphical model mixtures, where inference\nis not tractable.\n\n1.1 Summary of Results\n\nWe propose a novel method with provable guarantees for unsupervised estimation of discrete graph-\nical model mixtures. Our method has mainly three stages: graph structure estimation, parameter\nestimation, and tree approximation. The \ufb01rst stage involves estimation of the union graph structure\nG\u222a := \u222ahGh, which is the union of the Markov graphs {Gh} of the respective mixture components.\nOur method is based on a series of rank tests, and can be viewed as a generalization of conditional-\nindependence tests for graphical model selection (e.g. [1, 5, 6]). We establish that our method is\nef\ufb01cient (in terms of computational and sample complexities), when the underlying union graph has\nsparse vertex separators. This includes tree mixtures and mixtures with bounded degree graphs. The\nsecond stage of our algorithm involves parameter estimation of the mixture components. In general,\nthis problem is NP-hard. We provide conditions for tractable estimation of pairwise marginals of the\nmixture components. Roughly, we exploit the conditional-independence relationships to convert the\ngiven model to a series of mixtures of product distributions. Parameter estimation for product dis-\ntribution mixture has been well studied (e.g. [7\u20139]), and is based on spectral decompositions of the\nobserved moments. We leverage on these techniques to obtain estimates of the pairwise marginals\nfor each mixture component. The \ufb01nal stage for obtaining tree approximations involves running the\nstandard Chow-Liu algorithm [10] on each component using the estimated pairwise marginals of the\nmixture components.\n\nWe prove that our method correctly recovers the union graph structure and the tree structures cor-\nresponding to maximum-likelihood tree approximations of the mixture components. Note that if\nthe underlying model is a tree mixture, we correctly recover the tree structures of the mixture com-\nponents. The sample and computational complexities of our method scale as poly(p, r), for an\nr-component mixture of p-variate graphical models, when the union graph has sparse vertex separa-\ntors between any node pair. This includes tree mixtures and mixtures with bounded degree graphs.\nTo the best of our knowledge, this is the \ufb01rst work to provide provable learning guarantees for\ngraphical model mixtures. Our algorithm is also ef\ufb01cient for practical implementation and some\npreliminary experiments suggest an advantage over EM with respect to running times and accuracy\nof structure estimation of the mixture components. Thus, our approach for learning graphical model\nmixtures has both theoretical and practical implications.\n\n1.2 Related Work\n\nGraphical Model Selection: Graphical model selection is a well studied problem starting from\nthe seminal work of Chow and Liu [10] for \ufb01nding the maximum-likelihood tree approximation of a\ngraphical model. Works on high-dimensional loopy graphical model selection are more recent. They\ncan be classi\ufb01ed into mainly two groups: non-convex local approaches [1, 2, 6] and those based on\nconvex optimization [3, 11]. However, these works are not directly applicable for learning mixtures\nof graphical models. Moreover, our proposed method also provides a new approach for graphical\nmodel selection, in the special case when there is only one mixture component.\n\nLearning Mixture Models: Mixture models have been extensively studied, and there are a num-\nber of recent works on learning high-dimensional mixtures, e.g. [12,13]. These works provide guar-\nantees on recovery under various separation constraints between the mixture components and/or\nhave computational and sample complexities growing exponentially in the number of mixture com-\nponents r. In contrast, the so-called spectral methods have both computational and sample complex-\nities scaling only polynomially in the number of components, and do not impose stringent separation\nconstraints. Spectral methods are applicable for parameter estimation in mixtures of discrete product\ndistributions [7] and more generally for latent trees [8] and general linear multiview mixtures [9].\nWe leverage on these techniques for parameter estimation in models beyond product distribution\nmixtures.\n\n2\n\n\f2 Graphical Models and their Mixtures\n\nA graphical model is a family of multivariate distributions Markov on a given undirected graph [14].\nIn a discrete graphical model, each node in the graph v \u2208 V is associated with a random variable Yv\ntaking value in a \ufb01nite set Y. Let d := |Y| denote the cardinality of the set and p := |V | denote the\nnumber of variables. A vector of random variables Y := (Y1, . . . , Yp) with a joint probability mass\nfunction (pmf) P is Markov on the graph G if P satis\ufb01es the global Markov property for all disjoint\nsets A, B \u2282 V\n\u2200A, B \u2282 V : N [A] \u2229 N [B] = \u2205.\nP (yA, yB|yS(A,B;G)) = P (yA|yS(A,B;G))P (yB|yS(A,B;G)),\nwhere the set S(A, B; G) is a node separator1between A and B, and N [A] denotes the closed\nneighborhood of A (i.e., including A).\nMixtures of discrete graphical models is considered. Let H denote the discrete hidden choice vari-\nable corresponding to selection of a different mixture components, taking values in [r] := {1, . . . , r}\nh as the probability vec-\nand let Y denote the observed random vector. Denote \u03c0H := [P (H = h)]>\ntor of the mixing weights and Gh as the Markov graph of the distribution P (y|H = h) of each\nmixture component. Given n i.i.d. samples yn = [y1, . . . , yn]> from P (y), our goal is to \ufb01nd a\ntree approximation for each mixture component {P (y|H = h)}h. We do not assume any knowl-\nedge of the mixing weights \u03c0H or Markov graphs {Gh}h or parameters of the mixture components\n{P (y|H = h)}h. Moreover, since the variable H is latent, we do not a priori know the mixture com-\nponent from which a sample is drawn. Thus, a major challenge is in decomposition of the observed\nstatistics into the component models, and we tackle this in three main stages. First, we estimate the\nunion graph G\u222a := \u222ar\nh=1Gh, which is the union of the Markov graphs of the components. We then\n\nuse this graph estimate bG\u222a to obtain the pairwise marginals of the respective mixture components\n\n{P (y|H = h)}h. Finally, Chow-Liu algorithm provides tree approximations {Th}h of each mixture\ncomponents.\n\n3 Estimation of the Union of Component Graphs\n\nWe propose a novel method for learning graphical model mixtures by \ufb01rst estimating the union\ngraph G\u222a = \u222ar\nh=1Gh, which is the union of the graphs of the components. In the special case when\nGh \u2261 G\u222a, this gives the graph estimate of the components. However, the union graph G\u222a appears\nto have no direct relationship with the marginalized model P (y). We \ufb01rst provide intuitions on how\nG\u222a relates to the observed statistics.\n\nIntuitions: We \ufb01rst establish the simple result that the union graph G\u222a satis\ufb01es Markov property\nin each mixture component. Recall that S(u, v; G\u222a) denotes a vertex separator between nodes u and\nv in G\u222a.\n\n2\n\n(1)\n\nFact 1 (Markov Property of G\u222a) For any two nodes u, v \u2208 V such that (u, v) /\u2208 G\u222a,\n\nYu \u22a5\u22a5 Yv|YS, H, S := S(u, v; G\u222a).\n\nThe separator set in G\u222a, denoted by S := S(u, v; G\u222a), is also a vertex separator for u and\nProof:\nv in each of the component graphs Gh. This is because removal of S disconnects u and v in each\nGh. Thus, we have Markov property in each component: Yu \u22a5\u22a5 Yv|YS, {H = h}, for each h \u2208 [r],\nand the above result follows.\n\nThe above result can be exploited to obtain union graph estimate as follows: two nodes u, v are\nnot neighbors in G\u222a if a separator set S can be found which results in conditional independence,\nas in (1). The main challenge is indeed that the variable H is not observed and thus, conditional\nindependence cannot be directly inferred via observed statistics. However, the effect of H on the\nobserved statistics can be quanti\ufb01ed as follows:\n\nLemma 1 (Rank Property) Given an r-component mixture of graphical models with G\u222a =\n\u222ar\nh=1Gh, for any u, v \u2208 V such that (u, v) /\u2208 G\u222a and S := S(u, v; G\u222a), the probability matrix\nMu,v,{S;k} := [P [Yu = i, Yv = j, YS = k]]i,j has rank at most r for any k \u2208 Y |S|.\n\n1A set S(A, B; G) \u2282 V is a separator of sets A and B if the removal of nodes in S(A, B; G) separates A\n\nand B into distinct components.\n\n3\n\n\fThe proof is given in [15]. Thus, the effect of marginalizing the choice variable H is seen in the\nrank of the observed probability matrices Mu,v,{S;k}. When u and v are non-neighbors in G\u222a, a\nseparator set S can be found such that the rank of Mu,v,{S;k} is at most r. In order to use this result\nas a criterion for inferring neighbors in G\u222a, we require that the rank of Mu,v,{S;k} for any neighbors\n(u, v) \u2208 G\u222a be strictly larger than r. This requires the dimension of each node variable d > r. We\ndiscuss in detail the set of suf\ufb01cient conditions for correctly recovering G\u222a in Section 3.1.\n\nTractable Graph Families: Another obstacle in using Lemma 1 to estimate graph G\u222a is computa-\ntional: the search for separators S for any node pair u, v \u2208 V is exponential in |V | := p if no further\nconstraints are imposed. We consider graph families where a vertex separator can be found for any\n(u, v) /\u2208 G\u222a with size at most \u03b7. Under our framework, the hardness of learning a union graph is\nparameterized by \u03b7. Similar observations have been made before for graphical model selection [1].\nThere are many natural families where \u03b7 is small:\n\n1. If G\u222a is trivial (i.e., no edges) then \u03b7 = 0, we have a mixture of product distributions.\n\n2. When G\u222a is a tree, i.e., we have a mixture model Markov on the same tree, then \u03b7 = 1,\n\nsince there is a unique path between any two nodes on a tree.\n\n3. For an arbitrary r-component tree mixture, G\u222a = \u222ahTh where each component is a tree\ndistribution, we have that \u03b7 \u2264 r (since for any node pair, there is a unique path in each of\nthe r trees {Th}, and separating the node pair in each Th also separates them on G\u222a).\n\nthe maximum degree in Gh, i.e., the Markov graph corresponding to component {H = h}.\n\n4. For an arbitrary mixture of bounded degree graphs, we have \u03b7 \u2264Ph\u2208[r] \u2206h, where \u2206h is\nof their overlap. In the worst case, \u03b7 can be as high asPh\u2208[r] \u03b7h, while in the special case when\n\nGh \u2261 G\u222a, the bound remains the same \u03b7h \u2261 \u03b7. Note that for a general graph G\u222a with treewidth\ntw(G\u222a) and maximum degree \u2206(G\u222a), we have that \u03b7 \u2264 min(\u2206(G\u222a), tw(G\u222a)).\n\nIn general, \u03b7 depends on the respective bounds \u03b7h for the component graphs Gh, as well as the extent\n\n\u222a = RankTest(yn; \u03ben,p, \u03b7, r) for estimating G\u222a := \u222ar\n\nRank(A; \u03be) denotes the effective rank of matrix A, i.e., number of singular values more than \u03be.\n\nh=1Gh of an r-component\nmixture using yn samples, where \u03b7 is the bound on size of vertex separators between any node pair\nin G\u222a and \u03ben,p is a threshold on the singular values.\n\nAlgorithm 1 bGn\ncM n\nu,v,{S;k} := [bP n(Yu = i, Yv = j, YS = k)]i,j is the empirical estimate computed using n i.i.d.\nsamples yn. Initialize bGn\n\n\u222a = (V, \u2205). For each u, v \u2208 V , estimate cM n\n\nu,v,{S;k} from yn for some\n\nu,v,{S;k}; \u03ben,p) > r,\n\n(2)\n\ncon\ufb01guration k \u2208 Y |S|, if\n\nmin\n\nS\u2282V \\{u,v}\n\n|S|\u2264\u03b7\n\nthen add (u, v) to bGn\n\n\u222a.\n\nRank(cM n\n\nany two given nodes u, v \u2208 V , based on the effective rank of cM n\n\nRank Test: Based on the above observations, we propose a rank test to estimate G\u222a := \u222ah\u2208[r]Gh,\nthe union graph in Algorithm 1. The method is based on a search for potential separators S between\nu,v,{S;k}: if the effective rank is r\nor less, then u and v are declared as non-neighbors (and set S as their separator). If no such sets are\nfound, they are declared as neighbors. Thus, the method involves searching for separators for each\nnode pair u, v \u2208 V , by considering all sets S \u2282 V \\ {u, v} satisfying |S| \u2264 \u03b7. The computational\ncomplexity of this procedure is O(p\u03b7+2d3), where d is the dimension of each node variable Yi, for\ni \u2208 V and p is the number of nodes. This is because the number of rank tests performed is O(p\u03b7+2)\nover all node pairs and conditioning sets; each rank tests has O(d3) complexity since it involves\nperforming singular value decomposition (SVD) of a d \u00d7 d matrix.\n\n4\n\n\f3.1 Analysis of the Rank Test\n\nWe now provide guarantees for the success of rank tests in estimating G\u222a. As noted before, we\nrequire that the number of components r and the dimension d of each node variable satisfy d > r.\nMoreover, we assume bounds on the size of separator sets, \u03b7 = O(1). This includes tree mixtures\nand mixtures over bounded degree graphs.\nIn addition, the following parameters determine the\nsuccess of the rank tests.\n\n(A1) Rank condition for neighbors: Let Mu,v,{S;k} := [P (Yu = i, Yv = j, YS = k)]i,j and\n\n\u03c1min :=\n\nmin\n\n(u,v)\u2208G\u222a,|S|\u2264\u03b7\n\nmax\nk\u2208Y |S|\n\nS\u2282V \\{u,v}\n\n\u03c3r+1(cid:0)Mu,v,{S;k}(cid:1) > 0,\n\n(3)\n\nwhere \u03c3r+1(\u00b7) denotes the (r + 1)th singular value, when the singular values are arranged in\nthe descending order \u03c31(\u00b7) \u2265 \u03c32(\u00b7) \u2265 . . . \u03c3d(\u00b7). This ensures that the probability matrices\nfor neighbors (u, v) \u2208 G\u222a have (effective) rank of at least r + 1, and thus, the rank test can\ncorrectly distinguish neighbors from non-neighbors. It rules out the presence of spurious\nlow rank matrices between neighboring nodes in G\u222a (for instance, when the nodes are\nmarginally independent or when the distribution is degenerate).\n\n(A2) Choice of threshold \u03be: The threshold \u03be on singular values is chosen as \u03be := \u03c1min\n2 .\n(A3) Number of Samples: Given \u03b4 \u2208 (0, 1), the number of samples n satis\ufb01es\n\nn > nRank(\u03b4; p) := max 1\n\nt2(cid:0)2 log p + log \u03b4\u22121 + log 2(cid:1) ,(cid:18)\n\n2\n\n\u03c1min \u2212 t(cid:19)2! ,\n\n(4)\n\nfor some t \u2208 (0, \u03c1min) (e.g. t = \u03c1min/2,) where p is the number of nodes.\n\nWe now provide the result on the success of recovering the union graph G\u222a := \u222ar\n\nh=1Gh.\n\nTheorem 1 (Success of Rank Tests) The RankTest(yn; \u03be, \u03b7, r) recovers the correct graph G\u222a,\nwhich is the union of the component Markov graphs, under (A1)\u2013(A3) with probability at least\n1 \u2212 \u03b4.\n\nA special case of the above result is graphical model selection, where there is a single graphical\nmodel (r = 1) and we are interested in estimating its graph structure.\n\nCorollary 1 (Application to Graphical Model Selection) Given n i.i.d.\nsamples yn,\nRankTest(yn; \u03be, \u03b7, 1) is structurally consistent under (A1)\u2013(A3) with probability at least 1 \u2212 \u03b4.\n\nthe\n\nRemarks: Thus, the rank test is also applicable for graphical model selection. Previous works (see\nSection 1.2) have proposed tests based on conditional independence, using either conditional mutual\ninformation or conditional variation distances, see [1, 6]. The rank test above is thus an alternative\ntest for conditional independence in graphical models, resulting in graph structure estimation. In\naddition, it extends naturally to estimation of union graph structure of mixture components. Our\nabove result establishes that our method is also ef\ufb01cient in high dimensions, since it only requires\nlogarithmic samples for structural consistency (n = \u2126(log p)).\n\n4 Parameter Estimation of Mixture Components\n\nHaving obtained an estimate of the union graph G\u222a, we now describe a procedure for estimating\nparameters of the mixture components {P (y|H = h)}. Our method is based on spectral decom-\nposition, proposed previously for mixtures of product distributions [7\u20139]. We recap it brie\ufb02y below\nand then describe how it can be adapted to the more general setting of graphical model mixtures.\n\nRecap of Spectral Decomposition in Mixtures of Product Distributions: Consider the case\nwhere V = {u, v, w}, and Yu \u22a5\u22a5 Yv \u22a5\u22a5 Yw|H. For simplicity assume that d = r, i.e., the hidden\nand observed variables have the same dimension. This assumption will be removed subsequently.\nDenote Mu|H := [P (Yu = i|H = j)]i,j, and similarly for Mv|H, Mw|H and assume that they are\n\n5\n\n\ffull rank. Denote the probability matrices Mu,v := [P (Yu = i, Yv = j)]i,j and Mu,v,{w;k} :=\n[P (Yu = i, Yv = j, Yw = k)]i,j. The parameters (i.e., matrices Mu|H , Mv|H, Mw|H) can be\nestimated as:\n\nLemma 2 (Mixture of Product Distributions) Given\n[\u03bb(k)\n\nd ]> be the column vector with the d eigenvalues given by\n\n1 , . . . , \u03bb(k)\n\nthe\n\nabove model,\n\nlet \u03bb(k)\n\nLet \u039b := [\u03bb(1)|\u03bb(2)| . . . |\u03bb(d)] be a matrix where the kth column corresponds to \u03bb(k). We have\n\n\u03bb(k) := Eigenvalues(cid:0)Mu,v,{w;k}M \u22121\nu,v(cid:1) ,\n\nk \u2208 Y.\n\nMw|H := [P (Yw = i|H = j)]i,j = \u039b>.\n\n=\n\n(5)\n\n(6)\n\nFor the proof of the above result and for the general algorithm (when d \u2265 r), see [9]. Thus, if\nwe have a general product distribution mixture over nodes in V , we can learn the parameters by\nperforming the above spectral decomposition over different triplets {u, v, w}. However, an obstacle\nremains: spectral decomposition over different triplets {u, v, w} results in different permutations\nof the labels of the hidden variable H. To overcome this, note that any two triplets (u, v, w) and\n(u, v0, w0) share the same set of eigenvectors in (5) when the \u201cleft\u201d node u is the same. Thus, if we\nconsider a \ufb01xed node u\u2217 \u2208 V as the \u201cleft\u201d node and use a \ufb01xed matrix to diagonalize (5) for all\ntriplets, we obtain a consistent ordering of the hidden labels over all triplet decompositions.\n\nParameter Estimation in Graphical Model Mixtures: We now adapt the above procedure for\nestimating components of a general graphical model mixture. We \ufb01rst make a simple observation on\nhow to obtain mixtures of product distributions by considering separators on the union graph G\u222a.\nFor any three nodes u, v, w \u2208 V , which are not neighbors on G\u222a, let Suvw denote a multiway vertex\nseparator, i.e., the removal of nodes in Suvw disconnects u, v and w in G\u222a. On lines of Fact 1,\n\nYu \u22a5\u22a5 Yv \u22a5\u22a5 Yw|YSuvw , H,\n\n\u2200u, v, w : (u, v), (v, w), (w, u) /\u2208 G\u222a.\n\n(7)\nThus, by \ufb01xing the con\ufb01guration of nodes in Suvw, we obtain a product distribution mixture over\n{u, v, w}. If the previously proposed rank test is successful in estimating G\u222a, then we possess cor-\nrect knowledge of the separators Suvw. In this case, we can obtain estimates {P (Yw|YSuvw =\nk, H = h)}h by \ufb01xing the nodes in Suvw and using the spectral decomposition described in\nLemma 2, and the procedure can be repeated over different triplets {u, v, w}.\nAn obstacle remains, viz., the permutation of hidden labels over different triplet decompositions\n{u, v, w}. In case of product distribution mixture, as discussed previously, this is resolved by \ufb01xing\nthe \u201cleft\u201d node in the triplet to some u\u2217 \u2208 V and using the same matrix for diagonalization over\ndifferent triplets. However, an additional complication arises when we consider graphical model\nmixtures, where conditioning over separators is required. We require that the permutation of the\nhidden labels be unchanged upon conditioning over different values of variables in the separator set\nSu\u2217vw. This holds when the separator set Su\u2217vw has no effect on node u\u2217, i.e., we require that\n\n\u2203u\u2217 \u2208 V, s.t. Yu\u2217 \u22a5\u22a5 YV \\u\u2217|H,\nwhich implies that u\u2217 is isolated from all other nodes in graph G\u222a.\nCondition (8) is required for identi\ufb01ability if we only operate on statistics over different triplets\n(along with their separator sets). In other words, if we resort to operations over only low order\nstatistics, we require additional conditions such as (8) for identi\ufb01ability. However, our setting is a\nsigni\ufb01cant generalization over the mixtures of product distributions, where (8) is required to hold\nfor all nodes.\n\n(8)\n\nFinally, since our goal is to estimate pairwise marginals of the mixture components, in place of node\nw in the triplet {u, v, w} in Lemma 2, we need to consider a node pair a, b \u2208 V . The general algo-\nrithm allows the variables in the triplet to have different dimensions, see [9] for details. Thus, we\nobtain estimates of the pairwise marginals of the mixture components. For details on implementa-\ntion, refer to [15].\n\n4.1 Analysis and Guarantees\n\nIn addition to (A1)\u2013(A3) in Section 3.1 to guarantee correct recovery of G\u222a and the conditions\ndiscussed above, the success of parameter estimation depends on the following quantities:\n\n6\n\n\f(A4) Non-degeneracy: For each node pair a, b \u2208 V , and any subset S \u2282 V \\ {a, b} with\n|S| \u2264 2\u03b7 and k \u2208 Y |S|, the probability matrix M(a,b)|H,{S;k} := [P (Ya,b = i|H =\nj, YS = k)]i,j \u2208 Rd2\u00d7r has rank r.\n\n(A5) Spectral Bounds and Number of Samples: Refer to various spectral bounds used to\nobtain K(\u03b4; p, d, r) in (??) in [15], where \u03b4 \u2208 (0, 1) is \ufb01xed. Given any \ufb01xed \u0001 \u2208 (0, 1),\nassume that the number of samples satis\ufb01es\n\nn > nspect(\u03b4, \u0001; p, d, r) :=\n\n4K 2(\u03b4; p, d, r)\n\n\u00012\n\n.\n\n(9)\n\nNote that (A4) is a natural condition required for success of spectral decomposition and has been\npreviously imposed for learning product distribution mixtures [7\u20139]. Moreover, when (A4) does not\nhold, i.e., when the matrices are not full rank, parameter estimation is computationally at least as\nhard as learning parity with noise, which is conjectured to be computationally hard [8]. Condition\n(A5) is required for learning product distribution mixtures [9], and we inherit it here.\n\nWe now provide guarantees for estimation of pairwise marginals of the mixture components. Let\nk \u00b7 k2 on a vector denote the `2 norm.\n\nTheorem 2 (Parameter Estimation of Mixture Components) Under the assumptions (A1)\u2013(A5),\n\nall h \u2208 [r], there exists a permutation \u03c4 (h) \u2208 [r] with\n\nthe spectral decomposition method outputs bP spect(Ya, Yb|H = h), for each a, b \u2208 V , such that for\n\n(10)\n\nwith probability at least 1 \u2212 4\u03b4.\n\nkbP spect(Ya, Yb|H = h) \u2212 P (Ya, Yb|H = \u03c4 (h))k2 \u2264 \u0001,\n\nRemark: Recall\nthat p denotes the number of variables, r is the number of mixture\ncomponents, d is the dimension of each node variable and \u03b7 is the bound on separa-\ntor sets between any node pair in the union graph. We establish that K(\u03b4; p, d, r) is\n\n\u03b7 = O(1) is a small constant, this implies that we have a polynomial sample complexity in p, d, r.\n\nO(cid:0)p2\u03b7+2d2\u03b7r5\u03b4\u22121 poly log(p, d, r, \u03b4\u22121)(cid:1) in [15]. Thus, we require the number of samples in (9)\nscaling as n = \u2126(cid:0)p4\u03b7+4d4\u03b7r10\u03b4\u22122\u0001\u22122 poly log(p, d, r, \u03b4\u22121)(cid:1). Since we consider models where\nwise marginals of each component {bP spect(Ya, Yb|H = h)} to obtain tree approximation of the\n\ncomponent via Chow-Liu algorithm [10]. We now impose a standard condition of non-degeneracy\non each mixture component to guarantee the existence of a unique tree structure corresponding to\nthe maximum-likelihood tree approximation to the mixture component.\n\nTree Approximation of Mixture Components: The \ufb01nal step involves using the estimated pair-\n\n(A6) Separation of Mutual Information: Let Th denote the maximum-likelihood tree approx-\n\nimation corresponding to the model P (y|H = h) when exact statistics are input and let\n\n\u03d1 := min\nh\u2208[r]\n\nmin\n\n(a,b) /\u2208Th\n\nmin\n\n(u,v)\u2208Path(a,b;Th)\n\n(I(Yu, Yv|H = h) \u2212 I(Ya, Yb|H = h)) ,\n\n(11)\n\nwhere Path(a, b; Th) denotes the edges along the path connecting a and b in Th. Intuitively\n\u03d1 denotes the \u201cbottleneck\u201d where errors are most likely to occur in tree structure estimation.\nSee [16] for a detailed discussion.\n\n(A7) Number of Samples: Given \u0001tree de\ufb01ned in [15], we require\n\nn > nspect(\u03b4, \u0001tree; p, d, r),\nwhere nspect is given by (9).\nIntuitively, \u0001tree provides the bound on distortion of the\nestimated pairwise marginals of the mixture components, required for correct estimation of\ntree approximations, and depends on \u03d1 in (11).\n\n(12)\n\nTheorem 3 (Tree Approximations of Mixture Components) Under (A1)\u2013(A7), the Chow-Liu al-\ngorithm outputs the correct tree structures corresponding to maximum-likelihood tree approxima-\ntions of the mixture components {P (y|H = h)} with probability at least 1 \u2212 4\u03b4, when the estimates\n\nof pairwise marginals {bP spect(Ya, Yb|H = h)} from spectral decomposition method are input.\n\n7\n\n\f \n\n\u22125.5\n\n\u22126\n\n\u22126.5\n\n\u22127\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n-\ng\no\nL\n\n\u22127.5\n\n \n\n3000\n\n \n\n\u221218\n\n\u221220\n\n \n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n-\ng\no\nL\n\n\u221222\n\n\u221224\n\n\u221226\n\n\u221228\n\n\u221230\n\n \n\n3000\n\n7000\n\n8000\n\nEM\nProposed\nProposed+EM\n5000\n\n6000\n\n4000\n\nSample Size n\n\n7000\n\n8000\n\nEM\nProposed\nProposed+EM\n5000\n\n6000\n\n4000\n\nSample Size n\n\n\u221230\n\n \n\n3000\n\n4000\n\n5000\n\n6000\n\n7000\n\n8000\n\nSample Size n\n\n(a) Overall likelihood of the mixture\n\n(b) Conditional likelihood of strong\ncomponent\n\n(c) Conditional likelihood of weak\ncomponent\n\nFigure 1: Performance of the proposed method, EM and EM initialized with the proposed method\noutput on a tree mixture with two components.\n\n0.2\n\n \n\n1.5\n\nEM\nProposed\nProposed+EM\n\nEM\nProposed\nProposed+EM\n\n \n\n2\n\n1.5\n\n \n\nEM\nProposed\nProposed+EM\n\n1\n\n0.5\n\ns\ne\nc\nn\na\nt\ns\ni\nd\nt\ni\nd\nE\n\n1\n\n0.5\n\ns\ne\nc\nn\na\nt\ns\ni\nd\nt\ni\nd\nE\n\nEM\nProposed\nProposed+EM\n\n\u221225\n\n\u221226\n\n\u221227\n\n\u221228\n\n\u221229\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n-\ng\no\nL\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\ne\nn\no\ni\nt\na\nc\n\ufb01\ni\ns\ns\na\nl\nC\n\n0\n\n \n\n3000\n\n4000\n\n5000\nSample Size n\n\n6000\n\n7000\n\n8000\n\n0\n\n \n\n3000\n\n4000\n\n5000\n\nSample Size n\n\n6000\n\n7000\n\n8000\n\n0\n\n \n\n3000\n\n4000\n\n5000\n\n6000\n\n7000\n\n8000\n\nSample Size n\n\n(a) Classi\ufb01cation error\n\n(b) Strong component edit distance\n\n(c) Weak component edit distance\n\nFigure 2: Classi\ufb01cation error and normalized edit distances of the proposed method, EM and EM\ninitialized with the proposed method output on the tree mixture.\n5 Experiments\nExperimental results are presented on synthetic data. We estimate the graph using proposed algo-\nrithm and compare the performance of our method with EM [4]. Comprehensive results based on\nthe normalized edit distances and log-likelihood scores between the estimated and the true graphs\nare presented. We generate samples from a mixture over two different trees (r = 2) with mixing\nweights \u03c0 = [0.7, 0.3] using Gibbs sampling. Each mixture component is generated from the stan-\ndard Potts model on p = 60 nodes, where the node variables are ternary (d = 3), and the number of\nsamples n \u2208 [2.5 \u00d7 103, 104]. The joint distribution of nodes in each mixture component is given by\n\nP (X|H = h) \u221d exp\uf8ee\uf8f0 X(i,j)\u2208G\n\nJi,j;h(I(Yi = Yj) \u2212 1) +Xi\u2208V\n\nKi;hYi,\uf8f9\uf8fb\n\nwhere I is the indicator function and Jh := {Ji,j;h} are the edge potentials in the model. For the\n\ufb01rst component (H = 1), the edge potentials J1 are chosen uniformly from [5, 5.05], while for the\nsecond component (H = 2), J2 are chosen from [0.5, 0.55]. We refer to the \ufb01rst component as\nstrong and the second as weak since the correlations vary widely between the two models due to the\nchoice of parameters. The node potentials are all set to zero (Ki;h = 0) except at the isolated node\nu\u2217 in the union graph. The performance of the proposed method is compared with EM. We consider\n10 random initializations of EM and run it to convergence. We also evaluated EM by utilizing\nproposed result as the initial point (referred to as Proposed+EM in the \ufb01gures). We observe in Fig 1a\nthat the overall likelihood under our method is comparable with EM. Intuitively this is because EM\nattempts to maximize the overall likelihood. However, our algorithm has signi\ufb01cantly superior\nperformance with respect to the edit distance which is the error in estimating the tree structure in\nthe two components, as seen in Fig 2. In fact, EM never manages to recover the structure of the\nweak components(i.e., the component with weak correlations). Intuitively, this is because EM uses\nthe overall likelihood as criterion for tree selection. Under the above choice of parameters, the weak\ncomponent has a much lower contribution to the overall likelihood, and thus, EM is unable to recover\nit. We also observe in Fig 1b and Fig 1c, that our proposed method has superior performance in terms\nof conditional likelihood for both the components. Classi\ufb01cation error is evaluated in Fig 2a. We\ncould get smaller classi\ufb01cation errors than EM method.\n\nThe above experimental results con\ufb01rm our theoretical analysis and suggest the advantages of our\nbasic technique over more common approaches. Our method provides a point of tractability in the\nspectrum of probabilistic models, and extending beyond the class we consider here is a promising\ndirection of future research.\nAcknowledgements: The \ufb01rst author is supported in part by the NSF Award CCF-1219234, AFOSR Award\nFA9550-10-1-0310, ARO Award W911NF-12-1-0404, and setup funds at UCI. The third author is supported\nby the NSF Award 1028394 and AFOSR Award FA9550-10-1-0310.\n\n8\n\n\fReferences\n\n[1] A. Anandkumar, V. Y. F. Tan, F. Huang, and A. S. Willsky. High-Dimensional Structure Learning of Ising\n\nModels: Local Separation Criterion. Accepted to Annals of Statistics, Jan. 2012.\n\n[2] A. Jalali, C. Johnson, and P. Ravikumar. On learning discrete graphical models using greedy methods. In\n\nProc. of NIPS, 2011.\n\n[3] P. Ravikumar, M.J. Wainwright, and J. Lafferty. High-dimensional Ising Model Selection Using l1-\n\nRegularized Logistic Regression. Annals of Statistics, 2008.\n\n[4] M. Meila and M.I. Jordan. Learning with mixtures of trees. J. of Machine Learning Research, 1:1\u201348,\n\n2001.\n\n[5] P. Spirtes and C. Meek. Learning bayesian networks with discrete variables from data. In Proc. of Intl.\n\nConf. on Knowledge Discovery and Data Mining, pages 294\u2013299, 1995.\n\n[6] G. Bresler, E. Mossel, and A. Sly. Reconstruction of Markov Random Fields from Samples: Some Obser-\nvations and Algorithms. In Intl. workshop APPROX Approximation, Randomization and Combinatorial\nOptimization, pages 343\u2013356. Springer, 2008.\n\n[7] J.T. Chang. Full reconstruction of markov models on evolutionary trees: identi\ufb01ability and consistency.\n\nMathematical Biosciences, 137(1):51\u201373, 1996.\n\n[8] E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden Markov models. The Annals of\n\nApplied Probability, 16(2):583\u2013614, 2006.\n\n[9] A. Anandkumar, D. Hsu, and S.M. Kakade. A Method of Moments for Mixture Models and Hidden\n\nMarkov Models. In Proc. of Conf. on Learning Theory, June 2012.\n\n[10] C. Chow and C. Liu. Approximating Discrete Probability Distributions with Dependence Trees. IEEE\n\nTran. on Information Theory, 14(3):462\u2013467, 1968.\n\n[11] N. Meinshausen and P. B\u00a8uhlmann. High Dimensional Graphs and Variable Selection With the Lasso.\n\nAnnals of Statistics, 34(3):1436\u20131462, 2006.\n\n[12] M. Belkin and K. Sinha. Polynomial learning of distribution families. In IEEE Annual Symposium on\n\nFoundations of Computer Science, pages 103\u2013112, 2010.\n\n[13] A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of gaussians. In IEEE Annual\n\nSymposium on Foundations of Computer Science, 2010.\n\n[14] S.L. Lauritzen. Graphical models: Clarendon Press. Clarendon Press, 1996.\n[15] A. Anandkumar, D. Hsu, and S.M. Kakade. Learning High-Dimensional Mixtures of Graphical Models.\n\nPreprint. Available on ArXiv:1203.0697, Feb. 2012.\n\n[16] V.Y.F. Tan, A. Anandkumar, and A. Willsky. A Large-Deviation Analysis for the Maximum Likelihood\n\nLearning of Tree Structures. IEEE Tran. on Information Theory, 57(3):1714\u20131735, March 2011.\n\n9\n\n\f", "award": [], "sourceid": 512, "authors": [{"given_name": "Anima", "family_name": "Anandkumar", "institution": null}, {"given_name": "Daniel", "family_name": "Hsu", "institution": ""}, {"given_name": "Furong", "family_name": "Huang", "institution": ""}, {"given_name": "Sham", "family_name": "Kakade", "institution": null}]}