{"title": "Estimating Dependency Structure as a Hidden Variable", "book": "Advances in Neural Information Processing Systems", "page_first": 584, "page_last": 590, "abstract": "", "full_text": "Estimating Dependency Structure as a Hidden \n\nVariable \n\nMarina Meill and Michael I. Jordan \n\n{mmp, jordan}@ai.mit.edu \n\nCenter for Biological & Computational Learning \n\nMassachusetts Institute of Technology \n\n45 Carleton St. E25-201 \nCambridge, MA 02142 \n\nAbstract \n\nThis paper introduces a probability model, the mixture of trees that can \naccount for sparse, dynamically changing dependence relationships. We \npresent a family of efficient algorithms that use EM and the Minimum \nSpanning Tree algorithm to find the ML and MAP mixture of trees for a \nvariety of priors, including the Dirichlet and the MDL priors. \n\n1 INTRODUCTION \nA fundamental feature of a good model is the ability to uncover and exploit independencies \nin the data it is presented with. For many commonly used models, such as neural nets and \nbelief networks, the dependency structure encoded in the model is fixed, in the sense that it \nis not allowed to vary depending on actual values of the variables or with the current case. \nHowever, dependency structures that are conditional on values of variables abound in the \nworld around us. Consider for example bitmaps of handwritten digits. They obviously \ncontain many dependencies between pixels; however, the pattern of these dependencies \nwill vary across digits. Imagine a medical database recording the body weight and other \ndata for each patient. The body weight could be a function of age and height for a healthy \nperson, but it would depend on other conditions if the patient suffered from a disease or \nwas an athlete. \n\nModels that are able to represent data conditioned dependencies are decision trees and \nmixture models, including the soft counterpart of the decision tree, the mixture of experts. \nDecision trees however can only represent certain patterns of dependecy, and in particular \nare designed to represent a set of conditional probability tables and not a joint probability \ndistribution. Mixtures are more flexible and the rest of this paper will be focusing on one \nspecial case called the mixtures of trees. \n\nWe will consider domains where the observed variables are related by pairwise dependencies \nonly and these dependencies are sparse enough to contain no cycles. Therefore they can \n\n\fEstimating Dependency Structure as a Hidden Variable \n\n585 \n\nbe represented graphically as a tree. The structure of the dependencies may vary from \none instance to the next. We index the set of possible dependecy structures by a discrete \nstructure variable z (that can be observed or hidden) thereby obtaining a mixture. \n\nIn the framework of graphical probability models, tree distributions enjoy many properties \nthat make them attractive as modelling tools: they have a flexible topology, are intuitively \nappealing, sampling and computing likelihoods are linear time, simple efficient algorithms \nfor marginalizing and conditioning (O(1V1 2) or less) exist. Fitting the best tree to a given \ndistribution can be done exactly and efficiently (Chow and Liu, 1968). Trees can capture \nsimple pairwise interactions between variables but they can prove insufficient for more \ncomplex distributions. Mixtures of trees enjoy most of the computational advantages of \ntrees and, in addition, they are universal approximators over the space of all distributions. \nTherefore, they are fit for domains where the dependency patterns become tree like when a \npossibly hidden variable is instantiated. \n\nMixture models have been extensively used in the statistics and neural network literature. \nOf relevance to the present work are the mixtures of Gaussians, whose distribution space, in \nthe case of continuous variables overlaps with the space of mixtures of trees. Work on fitting \na tree to a distribution in a Maximum-Likelihood (ML) framework has been pioneered by \n(Chow and Liu, 1968) and was extended to poly trees by (Pearl, 1988) and to mixtures \nof trees with observed structure variable by (Geiger, 1992; Friedman and Goldszmidt, \n1996). Mixtures of factorial distributions were studied by (Kontkanen et al., 1996) whereas \n(Thiesson et aI., 1997) discusses mixtures of general belief nets. Multinets (Geiger, 1996) \nwhich are essentially mixtures of Bayes nets include mixtures of trees as a special case. \nIt is however worth studying mixtures of trees separately for their special computational \nadvantages. \n\nThis work presents efficient algorithms for learning mixture of trees models with unknown \nor hidden structure variable. The following section introduces the model; section 3 develops \nthe basic algorithm for its estimation from data in the ML framework. Section 4 discusses \nthe introduction of priors over mixtures of trees models and presents several realistic \nfactorized priors for which the MAP estimate can be computed by a modified versions of \nthe basic algorithm. The properties of the model are verified by simulation in section 5 and \nsection 6 concludes the paper. \n2 THE MIXTURE OF TREES MODEL \nIn this section we will introduce the mixture of trees model and the notation that will be \nused throughout the paper. Let V denote the set of variables of interest. According to the \ngraphical model paradigm, each variable is viewed as a vertex of a graph. Let Tv denote \nthe number of values of variable v E V, XV a particular value of V, XA an assignment to the \nvariables in the subset A of V. To simplify notation Xv will be denoted by x. \nWe use trees as graphical representations for families of probability distributions over V \nthat satisfy a common set of independence relationships encoded in the tree topology. In \nthis representation, an edge of the tree shows a direct dependence, or, more precisely, the \nabsence of an edge between two variables signifies that they are independent, conditioned \non all the other variables in V. We shall call a graph that has no cycles a tree I and shall \ndenote by E the set of its (undirected) edges. A probability distribution T that is conformal \nwith the tree (V, E) is a distribution that can be factorized as: \n\n( ) \n\nT X = \n\nIT(u,v)EE Tuv (xu, xv) \nIT \nT, (x )degv-l \nv \n\nvEV \n\nv \n\n(1) \n\nHere deg v denotes the degree of v, e.g. the number of edges incident to node v E V. The \n\nl In the graph theory literature, our definition corresponds to a forest. The connected components \n\nof a forest are called trees. \n\n\f586 \n\nM. MeillJ and M. I. Jordan \n\nfactors Tuv and Tv are the marginal distributions under T: \n\nTuv(xu,xv) = 2: T(xu , xv,XV-{u ,v}), Tv(xv) = 2: T(xv,xv-{ v})' \n\n(2) \n\nXV-{u.v} \n\nXv-tv} \n\nThe distribution itself will be called a tree when no confusion is possible. Note that a tree \ndistribution has for each edge (u, v) E E a factor depending on xu, Xv onlyl If the tree is \nconnected, e.g. it spans all the nodes in V , it is often called a spanning tree. \n\nAn equivalent representation for T in terms of conditional probabilities is \n\nT(x) = II Tvlpa(v)(xvlxpa(v\u00bb) \n\nvEV \n\n(3) \n\nThe form (3) can be obtained from (1) by choosing an arbitrary root in each connected \ncomponent and recursively substituting Tvt';V) by Tvlpa(v) starting from the root. pa(v) \nrepresents the parent of v in the thus directed tree or the empty set if v is the root of \na connected component. The directed tree representation has the advantage of having \nindependent parameters. The total number of free parameters in either representation is \nE(u,v)EET rurv - EVEv(degv -\n\nl)rv . \n\nNow we define a mixture of trees to be a distribution of the form \n\nm \n\nQ(X) = 2: AkTk(x); \n\nk=1 \n\nAk 2: 0, k = 1, . .. , m; \n\n(4) \n\nFrom the graphical models perspecti ve, a mixture of trees can be viewed as a containing an \nunobserved choice variable z, taking value k E {I, ... m} with probability Ak. Conditioned \non the value of z the distribution of the visible variables X is a tree. The m trees may have \ndifferent structures and different parameters. Note that because of the structure variable, \na mixture of trees is not properly a belief network, but most of the results here owe to the \nbelief network perspective. \n3 THE BASIC ALGORITHM: ML FITIING OF MIXTURES OF \nTREES \nThis section will show how a mixture of trees can be fit to an observed dataset in the Maxi(cid:173)\nmum Likelihood paradigm via the EM algorithm (Dempster et al., 1977). The observations \nare denoted by {xl , x2 , ... , xN}; the corresponding values of the structure variable are \n{zi,i=I, ... N}. \n\nFollowing a usual EM procedure for mixtures, the Expectation (E) step consists in estimating \nthe posterior probability of each tree to generate datapoint xi \n\nPr[zi = klxl , .. . ,N, model] = 'Yk(i) = \n\nAkTk(x:). \nLkl AklTk (x') \n\n(5) \n\nThen the expected complete log-likelihood to be maximized by the M step of the algorithm \nis \n\nE[Ic Ixl , ... N , model] \n\nN \n\nrk = 2: 'Yk(X i ), \n\ni=1 \n\nm \n\nL rk[log Ak + L pk(xi) 10gTk(xi )] \n\nN \n\nk=1 \n\ni=1 \n\n(6) \n\n(7) \n\nThe maximizing values for the parameters A are Akew = \nrk/ N. To obtain the new \ndistributions Tk, we have to maximize for each k the expression that is the negative of the \n\n\fEstimating Dependency Structure as a Hidden Variable \n\n587 \n\nFigure 1: The Basic Algorithm: ML Fitting of a Mixture of Trees \n\nInput:Dataset {xl, ... xN} \n\nInitial model m, Tk, ,\\k, k = I, . .. m \nProcedure MST( weights) that fits a maximum weight spanning tree over V \n\nIterate until convergence \n\nEstep: \nMstep: \nMl. \nM2. \nMJ. \nM4. \nM5. \n\ncompute'Y~, pk(X') fork = I, . .. m, i= 1, . . . Nby(5),(7) \n'\\k +- rk/N, k = I, ... m \ncompute marginals P:, p!v, U, v E V, k = I , ... m \ncompute mutual information I!v u, v E V, k = I , ... m \ncall MST( { I!v }) to generate ETk for k = I, ... m \nT!v +- p!v, ; T: +- P: for (u, v) E ETk, k = I, ... m \n\ncrossentropy between pk and Tk. \n\nN L pk(xi ) 10gTk(xi) \n\ni=l \n\n(8) \n\nThis problem can be solved exactly as shown in (Chow and Liu, 1968). Here we will \ngive a brief description of the procedure. First, one has to compute the mutual information \nbetween each pair of variables in V under the target distribution P \n\nJUti = Jvu = L.J Puv Xu, Xv \n\n'\"\" \n\n( \n\nX\",X v \n\nPUtl(xu , Xtl) \n\n) \nlog Pu(xu)PtI(xv)' \n\nu, v E V, u f=v . \n\n(9) \n\nSecond, the optimal tree structure is found by a Maximum Spanning Tree (MST) algorithm \nusing JUti as the weight for edge (u, v), \\lu, v E V.Once the tree is found, its marginals Tutl \n(or Tul v), (u, v) E ET are exactly equal to the corresponding marginals PUti of the target \ndistribution P. They are already computed as an intermediate step in the computation of \nthe mutual informations JUti (9). \nIn our case, the target distribution for Tk is represented by the posterior sample distribution \npk. Note that although each tree fit to pk is optimal, for the encompassing problem of \nfitting a mixture of trees to a sample distribution only a local optimum is guaranteed to be \nreached. The algorithm is summarized in figure 1. \n\nThis procedure is based on one important assumption that should be made explicit now. It \nis the Parameter independence assumption: The distribution T:1pa( tI) for any k, v and \nvalue of pa( v) is a multinomial with rv - 1 free parameters that are independent of any \nother parameters of the mixture. \n\nIt is possible to constrain the m trees to share the same structure, thus constructing a truly \nBayesian network. To achieve this, it is sufficient to replace the weights in step M4 by \nLk J~tI and run the MST algorithm only once to obtain the common structure ET. The \ntree stuctures obtained by the basic algorithm are connected. The following section will \ngive reasons and ways to obtain disconnected tree structures. \n4 MAP MIXTURES OF TREES \nIn this section we extend the basic algorithm to the problem of finding the Maximum a \nPosteriori (MAP) probability mixture of trees for a given dataset. In other words, we will \nconsider a nonuniform prior P[mode/] and will be searching for the mixture of trees that \nmaximizes \n\nlog P[model\\x1 , .. . N] = 10gP[xl, ... N\\model] + log P[model] + constant. \n\n(10) \nFactorized priors The present maximization problem differs from the ML problem solved \nin the previous section only by the addition of the term log P[model]. We can as well \n\n\f588 \n\nM. Meilii and M. l. Jordan \n\napproach it from the EM point of view, by iteratively maximizing \nE [logP[modelJxl , ... N, ZI, ... NJ] = E[lc{xl , ... N, zl , ... NJmodel)] + 10gP[model] \nIt is easy to see that the added term does not have any influence on the E step,which \nwill proceed exactly as before. However, in the M step, we must be able to successfully \nmaximize the r.h.s. of (11). Therefore, we look for priors of the form \n\n(11) \n\nP[model] = P[AI, .. . m] II P[Tkl \n\nm \n\nk=1 \n\n(12) \n\nThis class of priors is in agreement with the parameter independence assumption and \nincludes the conjugate prior for the multinomial distribution which is the Dirichlet prior. A \nDirichlet prior over a tree can be represented as a table of fictitious marginal probabilities \nP~~ for each pair u , v of variables plus an equivalent sample size Nt that gives the strength \nof the prior (Heckerman et al., 1995). However, for Dirichlet priors, the maximization over \ntree structures (corresponding to step M4) can only be performed iteratively (Meilli et al., \n1997). \nMDL (Minimum Description Length) priors are less informative priors. They attempt \nto balance the number of parameters that are estimated with the amount of data available, \nusually by introducing a penalty on model complexity. For the experiments in section 5 \nwe used edge pruning. More smoothing methods are presented in (Meilli et al., 1997). To \npenalize the number of parameters in each component we introduce a prior that penalizes \neach edge that is added to a tree, thus encouraging the algorithm to produce disconnected \ntrees. The edge pruning prior is P [T]