{"title": "A MCMC Approach to Hierarchical Mixture Modelling", "book": "Advances in Neural Information Processing Systems", "page_first": 680, "page_last": 686, "abstract": null, "full_text": "A MCMC approach to Hierarchical Mixture \n\nModelling \n\nChristopher K. I. Williams \n\nInstitute for Adaptive and Neural Computation \nDivision of Informatics, University of Edinburgh \n5 Forrest Hill, Edinburgh EHI 2QL, Scotland, UK \n\nckiw@dai.ed.ac.uk \n\nhttp://anc.ed.ac.uk \n\nAbstract \n\nThere are many hierarchical clustering algorithms available, but these \nlack a firm statistical basis. Here we set up a hierarchical probabilistic \nmixture model, where data is generated in a hierarchical tree-structured \nmanner. Markov chain Monte Carlo (MCMC) methods are demonstrated \nwhich can be used to sample from the posterior distribution over trees \ncontaining variable numbers of hidden units. \n\n1 Introduction \n\nOver the past decade or two mixture models have become a popular approach to clustering \nor competitive learning problems. They have the advantage of having a well-defined ob(cid:173)\njective function and fit in with the general trend of viewing neural network problems in a \nstatistical framework. However, one disadvantage is that they produce a \"flat\" cluster struc(cid:173)\nture rather than the hierarchical tree structure that is returned by some clustering algorithms \nsuch as the agglomerative single-link method (see e.g. [12]). In this paper I formulate a \nhierarchical mixture model, which retains the advantages of the statistical framework, but \nalso features a tree-structured hierarchy. \n\nThe basic idea is illustrated in Figure 1 (a). At the root of the tree (level l) we have a single \ncentre (marked with a x). This is the mean of a Gaussian with large variance (represented \nby the large circle). A random number of centres (in this case 3) are sampled from the level \n1 Gaussian, to produce 3 new centres (marked with o's). The variance associated with the \nlevel 2 Gaussians is smaller. A number of level 3 units are produced and associated with \nthe level 2 Gaussians. The centre of each level 3 unit (marked with a +) is sampled from \nits parent Gaussian. This hierarchical process could be continued indefinitely, but in this \nexample we generate data from the level 3 Gaussians, as shown by the dots in Figure lea). \n\nA three-level version of this model would be a standard mixture model with a Gaussian \nprior on where the centres are located. In the four-level model the third level centres are \nclumped together around the second level means, and it is this that distinguishes the model \nfrom a flat mixture model. Another view of the generative process is given in Figure l(b), \nwhere the tree structure denotes which nodes are children of particular parents. Note also \nthat this is a directed acyclic graph, with the arrows denoting dependence of the position of \nthe child on that of the parent. \n\n\fA MCMC Approach to Hierarchical Mixture Modelling \n\n681 \n\nIn section 2 we describe the theory of probabilistic hierarchical clustering and give a dis(cid:173)\ncussion of related work. Experimental results are described in section 3. \n\n(a) \n\n(b) \n\nFigure 1: The basic idea of the hierarchical mixture model. (a) x denotes the root of the \ntree, the second level centres are denoted by o's and the third level centres by +'s. Data is \ngenerated from the third level centres by sampling random points from Gaussians whose \nmeans are the third level centres. (b) The corresponding tree structure. \n\n2 Theory \n\nWe describe in turn (i) the prior over trees, (ii) the calculation of the likelihood given a \ndata vector, (iii) Markov chain Monte Carlo (MCMC) methods for the inference of the tree \nstructure given data and (iv) related work. \n\n2.1 Prior over trees \n\nWe describe first the prior over the number of units in each layer, and then the prior on \nconnections between layers. Consider a L layer hierarchical model. The root node is in \nlevell, there are n2 nodes in level 2, and so on down to nL nodes on level L. These \nn's are collected together in the vector n. We use a Markovian model for P(n), so that \nP(n) = P(ndP(n2In1) ... P(nLlnL-1) with P(n1) = 8(n1,I). Currently these are \ntaken to be Poisson distributions offset by 1, so that P(ni+llni) rv PO(Aini) + 1, where \nAi is a parameter associated with level i. The offset is used so that there must always be at \nleast one unit in any layer. \n\nGiven n, we next consider how the tree is formed. The tree structure describes which node \nin the ith layer is the parent of each node in the (i + 1 )th layer, for i = 1, ... , L - 1. Each \nunit has an indicator vector which stores the index of the parent to which it is attached. We \ncollect all these indicator vectors together into a matrix, denoted Z(n). The probability of \na node in layer (i + 1) connecting to any node in layer i is taken to be I/ni. Thus \n\nP(n, Z(n)) = P(n)P(Z(n)ln) = P(n) IT (I/ni)n i +1 \u2022 \n\nL-1 \n\ni=l \n\nWe now describe the generation of a random tree given nand Z(n). For simplicity we \ndescribe the generation of points in I-d below, although everything can be extended to \narbitrary dimension very easily. The mean f-Ll of the level 1 Gaussian is at the origin 1. The \n\nI It is easy to relax this assumption so that /-L 1 has a prior Gaussian distribution, or is located at \n\nsome point other than the origin. \n\n\f682 \n\nC. K. l. Williams \n\nlevel 2 means It], j = 1, ... ,n2 are generated from N (It 1 , af), where ar is the variance \nassociated with the level ] node. Similarly, the position of each level 3 node is generated \nfrom its level 2 parent as a displacement from the position of the level 2 parent. This \ndisplacement is a Gaussian RV with zero mean and variance ai. This process continues \non down to the visible variables. In order for this model to be useful, we require that \nar > ai > ... > aI-I' i.e. that the variability introduced at successive levels declines \nmonotonically (cj scaling of wavelet coefficients). \n\n2.2 Calculation of the likelihood \n\nThe data that we observe are the positions of the points in the final layer; this is denoted \nx. To calculate the likelihood of x under this model, we need to integrate out the locations \nof the means of the hidden variables in levels 2 through to L - 1. This can be done expli(cid:173)\ncitly, however, we can shorten this calculation by realizing that given Z(n), the generative \ndistribution for the observables x is Gaussian N(O, C). The covariance matrix C can be \ncalculated as follows. Consider two leaf nodes indexed by k and i. The Gaussian RVs that \ngenerated the position of these two leaves can be denoted \n\nl \n\n_ \n\nXk - Wk + Wk + ... + W k \n\n2 \n\n1 \n\n(L-l) \n, \n\nXI = WI + WI + ... + WI \n\n1 \n\n2 \n\n(L-l) \n. \n\nTo calculate the covariance between Xk and Xl, we simply calculate (XkXI). This depends \ncrucially on how many of the w's are shared between nodes k and l (cj path analysis). For \nexample, if Wk i- wl, i.e. the nodes lie in different branches of the tree at levell, their \ncovariance is zero. If k = l, the variance is just the sum of the variances of each RV in the \ntree. In between, the covariance of Xk and XI can be determined by finding at what level in \nthe tree their common parent occurs. \n\nUnder these assumptions, the log likelihood L of x given Z (n) is \n\n1 T \n\n1 \n\nL=-\"2 x C x-\"2logICI-Tlog21r. \n\n-1 \n\nnL \n\n(1) \n\nIn fact this calculation can be speeded up by taking account of the tree structure (see e.g. \n[8]). Note also that the posterior means (and variances) of the hidden variables can be \ncalculated based on the covariances between the hidden and visible nodes. Again, this \ncalculation can be carried out more efficiently; see Pearl [11] (section 7.2) for details. \n\n2.3 \n\nInference for nand Z (n) \n\nGiven n we have the problem of trying to infer the connectivity structure Z given the \nobservations x. Of course what we are interested in is the posterior distribution over Z, \ni.e. P(Zlx, n). One approach is to use a Markov chain Monte Carlo (MCMC) method \nto sample from this posterior distribution. A straightforward way to do this is to use the \nMetropolis algorithm, where we propose changes in the structure by changing the parent of \na single node at a time. Note the similarities of this algorithm to the work of Williams and \nAdams [14] on Dynamic Trees COTs); the main differences are Ci) that disconnections are \nnot allowed, i.e. we maintain a single tree (rather than a forest), and (ii) that the variables \nin the DT image models are discrete rather than Gaussian. \n\nWe also need to consider moves that change n. This can be effected with a split/merge \nmove. In the split direction, consider a node with a parent and several children. Split this \nnode and randomly assign the children to the two split nodes. Each of the split nodes keeps \nthe same parent. The probability of accepting this move under the Metropolis-Hastings \nscheme is \n\n. (1 P(n', Z(n')lx)Q(n', Z(n'); n, Z(n))) \n\n'P(n, Z(n)lx)Q(n, Z(n); n', Z(n')) \n\n, \n\na = mm \n\n\fA MCMC Approach to Hierarchical Mixture Modelling \n\n683 \n\nwhere Q(n', Z(n'); n, Z(n)) is the proposal probability of configuration (n', Z(n')) given \nconfiguration (n, Z (n)). This scheme is based on the work on MCMC model composition \n(MC 3 ) by Madigan and York [9], and on Green 's work on reversible jump MCMC [5]. \n\nAnother move that changes n is to remove \"dangling\" nodes, i.e. nodes which have no \nchildren. This occurs when all the nodes in a given layer \"decide\" not to use one or more \nnodes in the layer above. \n\nAn alternative to sampling from the posterior is to use approximate inference, such as \nmean-field methods. These are currently being investigated for DT models [1]. \n\n2.4 Related work \n\nThere are a very large number of papers on hierarchical clustering; in this work we have fo(cid:173)\ncussed on expressing hierarchical clustering in terms of probabilistic models. For example \nAmbros-Ingerson et at [2] and Mozer [10] developed models where the idea is to cluster \ndata at a coarse level, subtract out mean and cluster the residuals (recursively). This paper \ncan be seen as a probabilistic interpretation of this idea. \n\nThe reconstruction of phylogenetic trees from biological sequence (DNA or protein) in(cid:173)\nformation gives rise to the problem of inferring a binary tree from the data. Durbin et al \n[3] (chapter 8) show how a probabilistic formulation of the problem can be developed, and \nthe link to agglomerative hierarchical clustering algorithms as approximations to the full \nprobabilistic method (see \u00a78.6 in [3]). Much of the biological sequence work uses discrete \nvariables, which diverges somewhat from the focus of the current work. However work \nby Edwards (1970) [4] concerns a branching Brownian-motion process, which has some \nsimilarities to the model described above. Important differences are that Edwards' model is \nin continuous time, and the the variances of the particles are derived from a Wiener process \n(and so have variance proportional to the lifetime of the particle). This is in contrast to the \ndecreasing sequence of variances at a given number of levels assumed in the above model. \nOne important difference between the model discussed in this paper and the phylogenetic \ntree model is that points in higher levels of the phylogenetic tree are taken to be individuals \nat an earlier time in evolutionary history, which is not the interpretation we require here. \n\nAn very different notion of hierarchy in mixture models can be found in the work on the \nAutoClass system [6]. They describe a model involving class hierarchy and inheritance, \nbut their trees specify over which dimensions sharing of parameters occurs (e.g. means and \ncovariance matrices for Gaussians). In contrast, the model in this paper creates a hierarchy \nover examples labelled 1, ... ,n rather than dimensions. \n\nXu and Pearl [15] discuss the inference of a tree-structured belief network based on know(cid:173)\nledge of the covariances of the leaf nodes. This algorithm cannot be applied directly in \nour case as the covariances are not known, although we note that if multiple runs from a \ngiven tree structure were available the covariances might be approximated using sample \nestimates. \n\nOther ideas concerning hierarchical clustering are discussed in [13] and [7]. \n\n3 Experiments \n\nWe describe two sets of experiments to explore these ideas. \n\n3.1 Searching over Z with n fixed \n\n100 4-level random trees were generated from the prior, using values of >'1 = 1.5, >'2 = 2, \n>'3 = 3, and (JI = 10, (Ji = 1, (J~ = 0.01. These trees had between 4 and 79 leaf \n\n\f684 \n\nC. K I. Williams \n\nnodes, with an average of 30. For each tree n was kept the same as in the generative \ntree, and sampling was carried out over Z starting from a random initial configuration. \nA given node proposes changing its parent, and this proposal is accepted or rejected with \nthe usual Metropolis probability. In one sweep, each node in levels 3 and 4 makes such \na move. (Level 2 nodes only have one parent so there is no point in such a move there.) \nTo obtain a representative sample of P(Z(n)ln, x), we should run the chain for as long \nas possible. However, we can also use the chain to find configurations with high posterior \nprobability, and in this case running for longer only increases the chances of finding a better \nconfiguration. In our experiments the sampler was run for 100 sweeps. As P(Z(n)ln) is \nuniform for fixed n, the posterior is simply proportional to the likelihood term. It would \nalso be possible to run simulated annealing with the same move set to search explicitly for \nthe maximum a posteriori (MAP) configuration. \n\nThe results are that for 76 of the 100 cases the tree with the highest posterior probability \n(HPP) configuration had higher posterior probability than the generative tree, for 20 cases \nthe same tree was found and in 4 cases the HPP solution was inferior to the generative tree. \nThe fact that in almost all cases the sampler found a configuration as good or better than \nthe generative one in a relatively small number of sweeps is very encouraging. \n\nIn Figure 2 the generative (left column) and HPP trees for fixed n (middle column) are \nplotted for two examples. In panel (b) note the \"dangling\" node in level 2, which means \nthat the level 3 nodes to the left end up in a inferior configuration to (a). By contrast, in \npanel (e) the sampler has found a better (less tangled) configuration than the generative \nmodel (d). \n\n_'1150841 \n\n(b) \n\n(e) \n\n(c) \n\n(f) \n\n(a) \n\n(d) \n\nFigure 2: (a) and (d) show the generative trees for two examples. The corresponding HPP \ntrees for fixed n are plotted in (b) and (e) and those for variable n in (c) and (f). The number \nin each panel is the log posterior probability of the configuration. The nodes in levels 2 and \n3 are shown located at their posterior means. Apparent non-tree structures are caused by \ntwo nodes being plotted almost on top of each other. \n\n3.2 Searching over both nand Z \n\nGiven some data x we will not usually know appropriate numbers of hidden units. This \nmotivates searching over both Z and n, which can be achieved using the split/merge moves \ndiscussed in section 2.3. \nIn the experiments below the initial numbers of units in levels 2 and 3 (denoted n2 and \n\n\fA MCMC Approach to Hierarchical Mixture Modelling \n\n685 \n\n113) were set using the simple-minded formulae 113 = rdim(x)/A31112 = r113/A21. A \nproper inferential calculation for 71,2 and 71,3 can be carried out, but it requires the solution \nof a non-linear optimization problem. Given 112 and 113, the initial connection configuration \nwas chosen randomly. \n\nThe search method used was to propose a split/merge move (with probability 0.5:0.5) in \nlevel 2, then to sample the level 2 to level 3 connections, and then to propose a split-merge \nmove in level 3, and then update the level 3 to level 4 connections. This comprised a single \nsweep, and as above 100 sweeps were used. \n\nExperiments were conducted on the same trees used in section 3.1. In this case the results \nwere that for 50 out of the 100 cases, the HPP configuration had higher posterior probability \nthan the generative tree, for 11 cases the same tree was found and in 39 cases the HPP \nsolution was inferior to the generative tree. Overall these results are less good than the \nones in section 3.1, but it should be remembered that the search space is now much larger, \nand so it would be expected that one would need to search longer. Comparing the results \nfrom fixed n against those with variable n shows that in 42 out of 100 cases the variable \nn method gave a higher posterior probability. in 45 cases it was lower and in 13 cases the \nsame trees were found. \n\nThe rightmost column of Figure 2 shows the HPP configurations when sampling with vari(cid:173)\nable n on the two examples discussed above. In panel (c) the solution found is not very \ndissimilar to that in panel (b), although the overall probability is lower. In Cf), the solution \nfound uses just one level 2 centre rather than two, and obtains a higher posterior probability \nthan the configurations in (e) and Cd). \n\n4 Discussion \n\nThe results above indicate that the proposed model behaves sensibly, and that reasonable \nsolutions can be found with relatively short amounts of search. The method has been \ndemonstrated on univariate data, but extending it to multivariate Gaussian data for which \neach dimension is independent given the tree structure is very easy as the likelihood calcu(cid:173)\nlation is independent on each dimension. \n\nThere are many other directions is which the model can be developed. Firstly, the model as \npresented has uniform mixing proportions, so that children are equally likely to connect to \neach potential parent. This can be generalized so that there is a non-uniform vector of con(cid:173)\nnection probabilities in each layer. Also, given a tree structure and independent Dirichlet \npriors over these probability vectors, these parameters can be integrated out analytically. \nSecondly, the model can be made to generate iid data by regarding the penultimate layer \nas mixture centres; in this case the term P(nLI71,L-l) would be ignored when computing \nthe probability of the tree. Thirdly, it would be possible to add the variance variables to the \nMCMC scheme, e.g. using the Metropolis algorithm, after defining a suitable prior on the \nsequence of variances ai, ... ,ai-I' The constraint that all variances in the same level are \nequal could also be relaxed by allowing them to depend on hyperparameters set at every \nlevel. Fourthly, there may be improved MCMC schemes that can be devised. For example, \nin the current implementation the posterior means of the candidate units are not taken into \naccount when proposing merge moves Ccj [5]). Fifthly, for the multivariate Gaussian ver(cid:173)\nsion we can consider a tree-structured factor analysis model, so that higher levels in the \ntree need not have the same dimensionality as the data vectors. \n\nOne can also consider a version where each dimension is a multinomial rather than a con(cid:173)\ntinuous variable. In this case one might consider a model where a multinomial parameter \nvector (}l in the tree is generated from its parent by (}l = 'Y(}l-l + (1- 'Y)r where 'Y E [0,1] \nand r is a random vector of probabilities. An alternative model could be to build a tree \n\n\f686 \n\nC. K. 1. Williams \n\nstructured prior on the a parameters of the Dirichlet prior for the multinomial distribution. \n\nAcknowledgments \n\nThis work is partially supported through EPSRC grant GRIL 78161 Probabilistic Models \nfor Sequences. I thank the Gatsby Computational Neuroscience Unit (UCL) for organizing \nthe \"Mixtures Day\" in March 1999 and supporting my attendance, and Peter Green, Phil \nDawid and Peter Dayan for helpful discussions at the meeting. I also thank Amos Storkey \nfor helpful discussions and Magnus Rattray for (accidentally!) pointing me towards the \nchapters on phylogenetic trees in [3]. \n\nReferences \n[1] N. J. Adams, A. Storkey, Z. Ghahramani, and C. K. 1. Williams. MFDTs: Mean Field \n\nDynamic Trees. Submitted to ICPR 2000,1999. \n\n[2] J. Ambros-Ingerson, R. Granger, and G. Lynch. Simulation of Paleocortex Performs \n\nHierarchical Clustering. Science, 247:1344-1348,1990. \n\n[3] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis. \n\nCambridge University Press, Cambridge, UK, 1998. \n\n[4] A. w. F. Edwards. Estimation of the Branch Points of a Branching Diffusion Process. \n\nJournal of the Royal Statistical Society B, 32(2): 155-174, 1970. \n\n[5] P. J. Green. Reversible Jump Markov chain Monte Carlo computation and Bayesian \n\nmodel determination. Biometrika, 82(4):711-732, 1995. \n\n[6] R. Hanson, J. Stutz, and P. Cheeseman. Bayesian Classification with Correlation and \nInheritance. In IlCAI-91: Proceedings of the Twelfth International Joint Conference \non Artificial Intelligence, 1991 . Sydney, Australia. \n\n[7] T. Hofmann and J. M. Buhmann. Hierarchical Pairwise Data Clustering by Mean(cid:173)\nField Annealing. In F. Fogelman-Soulie and P. Gallinari, editors, Proc. ICANN 95. \nEC2 et Cie, 1995. \n\n[8] M. R. Luettgen and A. S. Wi11sky. Likelihood Calculation for a Class of Multiscale \nStochastic Models, with Application to Texture Discrimination. IEEE Trans. Image \nProcessing, 4(2): 194-207, 1995. \n\n[9] D. Madigan and J. York. Bayesian Graphical Models for Discrete Data. International \n\nStatistical Review, 63:215-232,1995. \n\n[10] M. C. Mozer. Discovering Discrete Distributed Representations with Iterated Com(cid:173)\n\npetitive Learning. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, \nAdvances in Neural Information Processing Systems 3. Morgan Kaufmann, 1991. \n\n[11] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Infer(cid:173)\n\nence. Morgan Kaufmann, San Mateo, CA, 1988. \n\n[12] B. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, \n\nCambridge, UK, 1996. \n\n[13] N. Vasconcelos and A. Lippmann. Learning Mixture Hierarchies. In M. S. Kearns, \nS. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing \nSystems 11, pages 606-612. MIT Press, 1999. \n\n[14] C. K. 1. Williams and N. J. Adams. DTs: Dynamic Trees. In M. J. Kearns, S. A. \nSolla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems \n11 . MIT Press, 1999. \n\n[15] L. Xu and J. Pearl. Structuring Causal Tree Models with Continuous Variables. In \n\nL. N. Kanal, T. S. Levitt, and J. F. Lemmer, editors, Uncertainty in Artificial Intelli(cid:173)\ngence 3. Elsevier, 1989. \n\n\f", "award": [], "sourceid": 1650, "authors": [{"given_name": "Christopher", "family_name": "Williams", "institution": null}]}