{"title": "Probabilistic Abstraction Hierarchies", "book": "Advances in Neural Information Processing Systems", "page_first": 913, "page_last": 920, "abstract": null, "full_text": "Probabilistic Abstraction Hierarchies\n\nEran Segal\n\nComputer Science Dept.\n\nStanford University\neran@cs.stanford.edu\n\nDaphne Koller\n\nComputer Science Dept.\n\nStanford University\n\nkoller@cs.stanford.edu\n\nDirk Ormoneit\n\normoneit@cs.stanford.edu\n\nComputer Science Dept.\n\nStanford University\n\nAbstract\n\nMany domains are naturally organized in an abstraction hierarchy or taxonomy,\nwhere the instances in \u201cnearby\u201d classes in the taxonomy are similar. In this pa-\nper, we provide a general probabilistic framework for clustering data into a set\nof classes organized as a taxonomy, where each class is associated with a prob-\nabilistic model from which the data was generated. The clustering algorithm\nsimultaneously optimizes three things: the assignment of data instances to clus-\nters, the models associated with the clusters, and the structure of the abstraction\nhierarchy. A unique feature of our approach is that it utilizes global optimization\nalgorithms for both of the last two steps, reducing the sensitivity to noise and the\npropensity to local maxima that are characteristic of algorithms such as hierarchi-\ncal agglomerative clustering that only take local steps. We provide a theoretical\nanalysis for our algorithm, showing that it converges to a local maximum of the\njoint likelihood of model and data. We present experimental results on synthetic\ndata, and on real data in the domains of gene expression and text.\n\n1 Introduction\nMany domains are naturally associated with a hierarchical taxonomy, in the form of a tree,\nwhere instances that are close to each other in the tree are assumed to be more \u201csimilar\u201d than\ninstances that are further away. In biological systems, for example, creating a taxonomy of\nthe instances is often one of the \ufb01rst steps in understanding the system. In particular, much\nof the work on analyzing gene expression data [3] has focused on creating gene hierarchies.\nSimilarly, in text domains, creating a hierarchy of documents is a common task [12, 7].\n\nIn many of these applications, the hierarchy is unknown; indeed, discovering the hier-\narchy is often a key part of the analysis. The standard algorithms applied to the problem\ntypically use an agglomerative bottom-up approach [3] or a divide-and-conquer top-down\napproach [8]. Although these methods have been shown to be useful in practice, they suf-\nfer from several limitations: First, they proceed via a series of local improvements, making\nthem particularly prone to local maxima. Second, at least the bottom-up approaches are\ntypically applied to the raw data; models (if any), are constructed as a post-processing\nstep. Thus, domain knowledge about the type of distribution from which data instances are\nsampled is rarely used in the formation of the hierarchy.\n\nIn this paper, we present probabilistic abstraction hierarchies (PAH), a probabilisti-\ncally principled general framework for learning abstraction hierarchies from data which\novercomes these dif\ufb01culties. We use a Bayesian approach, where the different models\ncorrespond to different abstraction hierarchies. The prior is designed to enforce our intu-\nitions about taxonomies: nearby classes have similar data distributions. More speci\ufb01cally,\na model in a PAH is a tree, where each node in the tree is associated with a class-speci\ufb01c\nprobabilistic model (CPM). Data is generated only at the leaves of the tree, so that a model\nbasically de\ufb01nes a mixture distribution whose components are the CPMs at the leaves of\n\n\fthe tree. The CPMs at the internal nodes are used to de\ufb01ne the prior over models: We\nprefer models where the CPM at a child node is close to the CPM at its parent, relative to\nsome distance function between CPMs. Our framework allows a wide range of notions of\ndistance between models; we essentially require only that the distance function be convex\nin the parameters of the two CPMs. For example, if a CPM is a Gaussian distribution, we\nmight use a simple squared Euclidean distance between the parameters of the two CPMs.\nWe present a novel algorithm for learning the model parameters and the tree structure\nin this framework. Our algorithm is based on the structural EM (SEM) approach of [4],\nbut utilizes \u201cglobal\u201d optimization steps for learning the best possible hierarchy and CPM\nparameters (see also [5, 13] for similar global optimization steps within SEM). Each step in\nour procedure is guaranteed to increase the joint probability of model and data, and hence\n(like SEM) our procedure is guaranteed to converge to a local optimum.\n\nOur approach has several advantages. (1) It provides principled probabilistic semantics\nfor hierarchical models. (2) It is model based, which allows us to exploit domain structural\nknowledge more easily. (3) It utilizes global optimization steps, which tend to avoid local\nmaxima and help make the model less sensitive to noise. (4) The abstraction hierarchy\ntends to pull the parameters of one model closer to those of nearby ones, which naturally\nleads to a form of parameter smoothing or shrinkage [12].\n\nWe present experiments for PAH on synthetic data and on two real data sets: gene\nexpression and text. Our results show that the PAH approach produces hierarchies that are\nmore robust to noise in the data, and that the learned hierarchies generalize better to test\ndata than those produced by hierarchical agglomerative clustering.\n\n2 Probabilistic Abstraction Hierarchy\nLet be the domain of some random observation, e.g., the set of possible assignments\nto a set of features. Our goal is to take a set of instances in \nsome set of \u0001 classes. Standard \u201c\ufb02at\u201d clustering approaches \u2014 for example, Autoclass [1]\nor the \u0001 -means algorithm \u2014 are special cases of a generative mixture model.\nmodels, each data instance belongs to one of the \u0001 classes, each of which is associated\nindependently by \ufb01rst selecting one of the \u0001 classes according to a multinomial distribution,\n\nwith a different class-speci\ufb01c probabilistic model (CPM). Each data instance is sampled\n\nand then randomly selecting the data instance itself from the CPM of the chosen class.\n\nIn standard clustering models, there is no relation between the individual CPMs, which\ncan be arbitrarily different. In this paper, we propose a model where the different classes\nare related to each other via an abstraction hierarchy, such that classes that are \u201cnearby\u201d in\nthe hierarchy have similar probabilistic models. More precisely, we de\ufb01ne:\n\n, and to cluster them into\n\nIn such\n\n\f\t\u000e\u000f\u000e\u000f\u000e\u0010\f\u001a\"\nto denote #\n\u001b ; we use %\n\n. We also have a multinomial distribution over the\n\nto denote the parameters of this distribution.\n\n, such that \u0003 has exactly \u0001\n\n, is associated with a CPM #\n\nis a tree \u0003 with nodes \u0004\u0006\u0005\nleaves \b\u0017\n\u0018\f\u000f\u000e\t\u000e\u000f\u000e\u0019\f\u001a\b\u001c\u001b . Each\n\u001d , which de\ufb01nes a distribution over\n\nOur framework does not, in principle, place restrictions on the form of the CPMs; we\n. For example,\n\nDe\ufb01nition 2.1 A probabilistic abstraction hierarchy (PAH) \u0002\n\u0007\t\b\u000b\n\r\f\u000f\u000e\t\u000e\u000f\u000e\u0010\f\u0011\b\u0013\u0012\u0015\u0014 and undirected edges \u0016\nnode \b\u001c\u001d , \u001e\u001f\u0005! \n\n\f\u000f\u000e\t\u000e\u000f\u000e\u0019\f\n ; we use $\nleaves \b\n\f\t\u000e\u000f\u000e\u000f\u000e\u0010\f\u001a\b\ncan use any probabilistic model that de\ufb01nes a probability distribution over \n\u001d may be a Bayesian network, in which case its speci\ufb01cation would include the param-\n\u001d may be a hidden\neters, and perhaps also the network structure; in a different setting, #\nThus, we augment with an additional hidden class variable &\ntakes the values \n, we de\ufb01ne +-,.'\na PAH \u0002\n%718+-,9':/;#=<>1 , where +-,9&?\u0005@*A/;%\u00101\n+-,.'\u0015/\u0013#\n\n\f\t\u000e\u000f\u000e\u000f\u000e\u0010\f\n\u0001 denoting the leaf that was chosen to generate this item. Given\n, an element ')(\n*0/\u0019\u0002212\u00053+-,.&4\u00055*6/\nis the conditional density of the data item given the CPM at leaf * . The induced\n\nMarkov model. In practice, however, the choice of CPMs has rami\ufb01cations both for the\noverall hierarchical model and the algorithm.\n\nAs discussed above, we assume that data is generated only from the leaves of the tree.\nfor each data item, which\n\n, and a value *\n\nfor &\n\nis the multinomial distribution over the leaves and\n\n#\n\u0012\n\n#\n\n\f\n<\n1\n\ff(g2)\n\nM5\n\ng1\n\ng2\n\ng3\n\ng4\n\nM4\n\nM3\n\ng1\n\ng2\n\ng3\n\ng4\n\ng1\n\ng2\n\ng3\n\ng4\n\nf(g2)\n\nf(g2)\n\nf(g2)\n\nf(g2)\n\nM2\n\nM1\n\ng1\n\ng2\n\ng3\n\ng4\n\ng1\n\ng2\n\ng3\n\ng4\n\n(a)\n\nM4\n\nM2 M3\n\nM1\n\nM5 (=M4)\n\nM2\n\nM4 M6 (=M3)\n\nM1\n\nM3\n\nM5 (=M4)\n\nM2\n\nM6 (=M4)\n\nM4 M6 (=M3)\n\nM1\n\nM3\n\n(b)\n\nFigure 1: (a) A PAH with 3 leaves over a 4-dimensional continuous state space, along\nwith a visualization of the Gaussian distribution for the 3rd dimension. (b) Two different\n\nweight-preserving transformations for a tree with 4 leaves #\ndistribution of ' given \u0002\nis summed out from +-,9'\n\n, from which the data are generated, is simply +-,.'-/\n/\r\u000221 .\n\nAs we mentioned, the role of the internal nodes in the tree is to enforce an intuitive\ninterpretation of the model as an abstraction hierarchy, by enforcing similarity between\nCPMs at nearby leaves. We achieve this goal by de\ufb01ning a prior distribution over ab-\n\n\u000221 , where *\n\n#\u0001 .\n\n\u0018\f\t\u000e\u000f\u000e\u000f\u000e\u0010\f\n\nIDKL,.#\u0007\u0006\n\n1-\u0005\n\u0011\u0013\u0012\u0015\u0014\u0017\u0016\u0019\u0018\u001b\u001a\u001d\u001c\n\n\u0005\u0005\u0004\n\u0006>#\n\u0002 de\ufb01ne over \n\niff #\n1 , where IDKL,9#\u0007\u0006>#\n\nthe mathematical sense; instead, we only require that it be symmetric (as we chose to use\n\nthat penalizes the distance between neighboring CPMs #\n\nand #\n1 . Note that we do not require that \u0003 be a distance in\n\u0002 .1 One obvious choice\n\nis the KL-\n. This distance measure\nhas the advantage of being applicable to any pair of CPMs over the same space, even if\n\nstraction hierarchies \u0002\nusing a distance function \u00037,.#\nundirected trees), non-negative, and that \u0003\n,9#\n1\t\b IDKL,.#\nis to de\ufb01ne \u00037,.#\nand #\ndistance between the distributions that #\ntheir parameterization is different. Given a de\ufb01nition of \u0003 , we de\ufb01ne the prior over PAHs\n1\u00111 , where \u001f\nas +-,.\u000221\u000b\n\r\f\u0001\u000e\nences in distances are penalized (larger \u001f\nGiven a set of data instances \" with domain \nimizes +-,.\u0002\n/#\"\nto the data \"\n\nmaximizing this expression, we are trading off the \ufb01t of the mixture model over the leaves\n, and the desire to generate a hierarchy in which nearby models are similar.\nFig. 1(a) illustrates a typical PAH with Gaussian CPM distributions, where a CPM close\nto the leaves of the tree is more specialized and thus has fairly peaked distributions. Con-\nversely, CPMs closer to the root of the tree, acting to bridge between their neighbors, are\nexpected to have less peaked distributions and peak only around parts of the distribution\nwhich are common to an entire subtree.\n\n, our goal is to \ufb01nd a PAH \u0002\n+-,\u0010\"\n\n\u000221 or equivalently, %'&\u0017(\n\nrepresents the extent to which differ-\n\n,\u0013\u001e \u001f!\u0003\n\u000221\u0011+-,\u0010\"\n\nrepresents a larger penalty). 2\n\n\u000221\t\b)%*&+(\n\n1$\n!+-,\n\nthat max-\n\n\u000221 . By\n\n\u001d\u0010\u000f\n\n,9#\n\n+-,\n\n\u0007-,/.\n\n\f\t\u000e\u000f\u000e\u000f\u000e\u0010\f1,2.\n\n\f\u000f\u000e\u000f\u000e\t\u000e\u0010\f\nto leaves of \u0003\n\nfrom a data set \"\n, the parameters %\n\n,\n, and the assignment of the\n. Hence, the likelihood function has multiple local maxima,\nand no general method exists for \ufb01nding the global maximum. In this section, we provide\n\n3 Learning the Models\n\u0014 . This\nOur goal in this section is to learn a PAH \u0002\nlearning task is fairly complex, as many aspects are unknown: the structure of the tree \u0003\nthe CPMs #\nat the nodes of \u0003\ninstances in \"\nan ef\ufb01cient algorithm for \ufb01nding a locally optimal \u0002\n1Two models are considered identical if4!5\t687:9<;\u000b=\u00105?>A@CBEDF;\u000b=\u00105 >A@HGIB .\n2Care must be taken to ensure that;\u000b=KJLB\nbe the case for the choice ofM we use in this paper. We also note that, if desired, we can modify this\nprior to incorporate a prior over the parameters of the@:N \u2019s.\n\nis a proper probability distribution, but this will always\n\n \u001b0\n\n.\n\n\f\n*\n\u0002\n\f\n#\n\u0002\n\f\n#\n\u0002\n1\n\u0005\n#\n\f\n#\n\u0002\n#\n\u0002\n\u0002\n\u0002\n1\n\u001d\n\f\n#\n\u0011\n/\n/\n\u0005\n3\n0\n#\n\u0012\n\f\u001d\u0002\u0001\n\n, we get that\n\n;\u000b=&%'\u0017\n\n\u000b \u001f\n\n!\u000e\"\n\n\u000b$#\n\n%'&\u0017(\n\n+-,\n\n(1)\n\nN43\n\nB/5\n\nholds for the models used in our experiments. The convexity property allows us to \ufb01nd the\n\nstructure of the tree \u0003\nleaves, denoted &\n\u000221\n\nthat minimize \u0003:\u0005\n=\u0010@\nB)(\n\nN,+\n\n-/.\u00160\u00101\n\n\f\t\u000e\u000f\u000e\t\u000e\u0019\f\n\u001d , but the computational issues are somewhat different.\n\nis\nto a pure numerical optimization problem. The\ngeneral framework of our algorithm extends to cases where we also have to solve the model\n\nTo simplify the algorithm, we assume that the structure of the CPMs #\n\ufb01xed. This reduces the choice of each #\nselection problem for each #\nWe \ufb01rst discuss the case of complete data, where for each data instance ,/.\nstructure of the tree \u0003\n\n, we\nare given the leaf from which it was generated. For this case, we show how to learn the\n. This problem, of con-\nstructing a tree over a set of points that is not \ufb01xed, is very closely related to the Steiner\ntree problem [10], virtually all of whose variants are NP-hard. We propose a heuristic ap-\nproach that decouples the joint optimization problem into two subproblems: optimizing the\nCPM parameters given the tree structure, and learning a tree structure given a set of CPMs.\nSomewhat surprisingly, we show that our careful choice of additive prior allows each of\nthese subproblems to be tackled very effectively using global optimization techniques.\n\nand the setting of the parameters % and $\n\nWe begin with the task of learning the CPMs. Thus, assume that we are given both the\nto one of the\n\n\u001d \u2019s, \ufb01xing the parameters of the remaining CPMs #\n\n, this iterative procedure is guaranteed to converge to the global\nin-\nis a leaf) and the parameters of the CPMs\n\n, separates from the rest, so that\nreduces to straightforward maximum likelihood estima-\ntion. To optimize the CPM parameters, the key property turns out to be the convexity of\n\n(76\n. An examination of (1) shows that the optimization of each CPM #\n\nand the assignment of each data instance ,/.\n\u001d\u0002\u0001\n0 . It remains to \ufb01nd %\n+-,.\u000221 . Substituting the de\ufb01nitions into \u0003\n%'&\u0017(\n\b\t\u0007\n\u0019\u001b\u001a/>\u001d\u001c!B\u001e\u0005\n\u0019\u001b\u001a/>-@\nD\u0006\u0005\n;\u000b=\u0016\u0015\u0018\u0017\n\u000f\u0012\u0011\u0014\u0013\n\u000b\t\f\u000e\r\u0010\u000f\u0012\u0011\u0014\u0013\nThe \ufb01rst term, involving the multinomial parameters %\nrelative to %\nthe optimization of \u0003\nfunction, which holds in a wide variety of choices of CPMs and \u0003 ; in particular, it\nthe \u0003\nglobal minimum of \u0003 using a simple iterative procedure. In each iteration, we optimize the\nparameters of one of the #\n\u001e ).\n\u001d \u2019s in a round robin fashion, until convergence.\nThis procedure is repeated for each of the #\nBy the joint convexity of \u0003\nminimum of \u0003\n\u001d (if \u001e\nvolves only the data cases assigned to #\nthat are neighbors of #\nnected) leaf nodes \b\n\f\t\u000e\u000f\u000e\t\u000e\u0019\f\u0011\b\nmust keep the same set of leaves \b\npenalty of the tree can be measured via the sum of the edge weights \u0003\nAt each iteration, the algorithm starts out with a tree over some set of nodes \b\nIt takes the leaves \b\nin the resulting tree, some of the #\npreserves the weight (score) of the tree. By using only \b\nnew MSTs built from all nodes \b\n\u001e:8\n\n\u001b of this tree, and constructs an MST over them. Of course,\n\u001d are no longer leaves. This problem is corrected by a\n\u001b , the algorithm simply\nof the previous tree. For all nodes \b\u001c\u001d for 98\n\u0001 which end up as internal nodes, we perform the same transformation described\n\n\u001d in the tree, thereby simplifying the computation substantially.\n\n\u0013\f\u000f\u000e\t\u000e\u000f\u000e\u0010\f\u0011\b\u001c\u001b , and \ufb01nd the optimal parameter settings for each leaf CPM as\n\ndescribed above. Note that these CPMs are unrelated, and the parameters of each one are\ncomputed independently of other CPMs. Given this initial set of CPMs for the leaf nodes\nrelative to these CPMs.\nThe goal is to \ufb01nd the lowest weight tree, subject to the restriction that the tree structure\n\nlem is also a variant of the Steiner tree problem. As a heuristic substitute, we follow the\nlines of [5] and use a minimum spanning tree (MST) algorithm for constructing low-weight\ntrees.\n\nWe now turn our attention to the second subproblem, of learning the structure of the\ntree given the learned CPMs. We \ufb01rst consider an empty tree containing only the (uncon-\n\n\u001b , the algorithm tries to learn a good tree structure \u0003\n\nabove. In both cases, this transformation is not unique, as it depends on the order in which\nthe steps are executed; see Fig. 1(b). The algorithm therefore generates an entire pool of\n\n\u001b . Due to the decomposability of %'&\u0017(\n\ntransformation that \u201cpushes\u201d a leaf down the tree, duplicating its model; this transformation\n\n\u201cthrows away\u201d the entire structure of the previous tree. However, we can also construct\n\n,9#\n\n\u000221 , the\n+-,\n1 . This prob-\n\f\u000f\u000e\u000f\u000e\t\u000e\u0010\f\u0011\b\n\n.\n\n\f\u000f\u000e\u000f\u000e\t\u000e\u0010\f\u0011\b\n\n\f\t\u000e\u000f\u000e\t\u000e\u0010\f\u0011\b\u0013\u0012\n\n\f\t\u000e\u000f\u000e\u000f\u000e\u0010\f\u001a\b\n\n\f\t\u000e\u000f\u000e\u000f\u000e\n\n\f\u0011\b\n\n#\n\u0012\n\u001d\n\n0\n(\n\"\n\"\n0\n(\n\"\n\u0001\n.\n\"\n\u0012\n\f\n$\n\u0012\n\u001e\n\"\n/\n\u001e\n\u0004\n\u0007\n\nN\n\n\f\nN\nN\n\n*\n2\nM\n@\n-\n\u0011\n\u0005\n\u001d\n#\n\u0011\n\b\n\n\u001d\n\f\n#\n\u0011\n\n\u0012\n\n\f), generated using different random\nresolutions of ambiguities in the weight-preserving transformation. For each such tree, the\nCPM learning algorithm is used to \ufb01nd an optimal setting of the parameters. The trees are\n\ncandidate trees (from both \b\n\f\t\u000e\u000f\u000e\u000f\u000e\nevaluated relative to our score (%'&\u0017(\n\n+-,.\u0002\n\n\f\u0011\b\n\n\u001b and \b\n\n\f\t\u000e\u000f\u000e\t\u000e\u0010\f\u0011\b\n1 ), and the highest scoring tree is kept.\n\nThe tree just constructed has a new set of CPMs, so we can repeat this process. To\ndetect termination, the algorithm also keeps the tree from the previous iteration, and termi-\nnates when the score of all trees in the newly constructed pool is lower than the score of\nthe best tree from previous iteration.\n\nFinally, we address the fact that the data we have is incomplete, in that the assign-\n\nments &\n\n0 of data instances to classes is not determined. We address the problem of\n\nincomplete data using the standard Expectation Maximization (EM) algorithm [2] and the\nstructural EM algorithm [4] which extends EM to the problem of model selection. Starting\nfrom an initial model, the algorithm iterates the following two steps: The E-step com-\nputes the distribution over the unobserved variables given the observed data and the current\nmodel. In our case, the distribution over the unobserved variables is computed by evalu-\n\n\u001e2/\n\nating +-,.&\n\nincrease the expected log likelihood of the data, relative to the distribution computed in\nthe E-step. In our case, the M-step is precisely the algorithm for complete data described\nabove, but using a soft assignment of data instances to nodes in the tree. The full algorithm\nis shown in Fig. 2.\n\n\")/ . The M-step learns new models that\n\n\u000221 for all \n\n83/\n\n,/.\n\nA simple analysis along the lines of [4] can be used to show that the log-probability\n\n+-,.\u00025/<\"\n\nincreases at every M-step. We therefore obtain the following theorem:\n\n%'&\u0017(\nTheorem 3.1 The algorithm in Fig. 2 converges to a local maximum of %*&+(\n1. InitializeJ\n\n\u0002\u0006/<\"\n+-,\nB and the models at the leaves. Randomly initialize\n\n2. Repeat until convergence:\n\n@\u0003\u0002\u0005\u0004\n\n.\n\n1 .\n\n3\u0007\u0006\n\nbecome leaves.\n\n, usingM\n\n=\u0010@\n\n--B as edge weights.\n\n, compute the posterior probabilities for the indicator variable\n\n(a)\n\n(b)\n\n(c) @\n\n-step:\ni. Choose an MST over some subset of\u0001\u000b\n\nii. Transform the MST so that \n\n-step: For\n\r\u0014\u001a . For\n\u0015\u0018\u0017\nDF;\u000b=\u0016\u0015\u0018\u0017\n-step: Update the CPMs and\u001c\n\nD\u000f\u000e\n\u000e\u0013\u0012\u0015\u0014\u0016\u0012\u0018\u0017\n=\u0016\u0015\u0018\u0017\n\nD\u001a\u0014\n\r\u0014\u001a\n. Let = \u001b\n\n3\u0011\u0010\n\r\u0014\u001a!D\u001a\u0014\n\n:\n\n>\u001c\u001b\n\nJLB\u001e\u001dF;\u000b=&%'\u0017\n%'\u0017\n\u0015\u0018\u0017\n\n\u0014\u001a\n\n>-@\nN\u0010B\u001f\b\u001bN45\n\r\u0014\u001a\n\f\u000e\r . Then:\n\r\u0014\u001a!\u0004\u001c\"\n\nD\u001a\u0014\n;\u000b=\n\n=\u0016\u0015\u0018\u0017\n(\u0010\u0017\n\nIE\n\n\u0014\u001a\n\u0011\u0014\u0013\n\n= \u001b\n\n$\u000b%\n\n$\u000b'\n\n>)\t\n\n\u001a,5\n\nFigure 2: Abstraction Hierarchy Learning Algorithm\n\n4 Experimental Results\n\nWe focus our experimental results on genomic expression data, although we also provide\nsome results on a text dataset. In gene expression data, the level of mRNA transcript of\nevery gene in the cell is measured simultaneously, using DNA microarray technology. This\ngenomic expression data provides researchers with much insight towards understanding\nthe overall cellular behavior. The most commonly used method for analyzing this data is\nclustering, a process which identi\ufb01es clusters of genes that share similar expression pat-\nterns (e.g., [3]), and which are therefore also often involved in similar cellular processes.\n\nWe apply PAH to this data, using CPMs of the form #\nKL-distance is simply: IDKL,.#\n\n\u0006>#\n\n8\u00119\n\n35476\n\n, which is simply the sum of\n\n1 , in which case\n\n\u0006/.10\u00052\n\n,\u0005,\n\n\u0005+*\n\n\u0012\n/\n\"\n.\n\"\n.\n\"\n0\n\u0005\n\"\n0\n\f\n8\n\"\n1\nD\n=\n\u0001\n@\n\n3\n5\n5\n5\n3\n\b\n\t\n\n3\n5\n5\n5\n3\n\n\u000b\n\u0004\nN\n3\n@\n\n3\n5\n5\n5\n3\n\n\u0002\n\f\n\n3\n5\n5\n5\n\u0019\nB\n9\n3\n3\n\u0015\nB\nD\n\u0001\n3\n-\n\b\nN\n9\nD\n\u000e\n\u0010\n\"\n\n-\n\f\n\n\u0019\nB\n#\n9\nD\n\u0013\n&\n\u000f\n3\n\u0015\nB\n3\n#\nB\n\u001d\n-\n\u001d\n\u001d\n\u0011\n1\n\u0005\n\n\u0001\n\n,\n-\n\u001d\n8\n\u001e\n-\n\u0011\n8\n1\n0\n\fsquared distances between the means of the corresponding Gaussian components, normal-\n\n1 .\n\n\u0005 IDKL,9#\n\nized by their variance. We therefore de\ufb01ne \u00037,.#\n\nThe most popular clustering method for genomic expression data to date is hierarchical\nagglomerative clustering (HAC) [3], which builds a hierarchy among the genes by itera-\ntively merging the closest genes relative to some distance metric. We use the same distance\nmetric for HAC. (Note that in HAC the metric is used as the distance between data cases\nwhereas in our algorithm it is used as the distance between models.) To perform a direct\ncomparison between PAH and HAC, we often need to obtain a probabilistic model from\nHAC. To do so, we create CPMs from the genes that HAC assigned to each internal node.\nIn both PAH and HAC, we then assign each gene (in the training set or the test set) to\nthe hierarchy by choosing the best (highest likelihood) CPM among all the nodes in the\nthat this CPM\n\ntree (including internal nodes) and recording the probability +-,\u0001\n\nassigns to the gene.\nStructure Recovery. A good algorithm for learning abstraction hierarchies should recover\nthe true hierarchy as well as possible. To test this, we generated a synthetic data set, and\nmeasured the ability of each method to recover the distances between pairs of instances\n(genes) in the generating model, where distance here is the length of the path between two\ngenes in the hierarchy.\n\n/\u0010#\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\nWe used the correlation and the\n\nWe generated the data set by sampling from the leaves of a PAH; to make the data re-\nalistic, we sampled from a PAH that we learned from a real gene expression data set. To\nallow a comparison with HAC, we generated one data instance from each leaf. We gener-\nated data for 80 (imaginary) genes and 100 experiments, for a total of 8000 measurements.\nFor robustness, we generated 5 different such data sets and ran PAH and HAC for each data\nset.\n\n error between the pairwise distances in the original\n\u000e\f\r\n\u0004\u000e\u0014\n\n error was\n\nand the learned tree as measures of similiarity. The correlation was \u0004\ncompared to a much worse \u0004\nfor PAH and \n\u0004\u0012\u0014\n\nhierarchy much better than HAC.\nGeneralization. We next tested the ability of the different methods to generalize to un-\nobserved (test) data, measuring the extent to which each method captures the underlying\nstructure in the data. We ran these tests on the yeast data set of [6]. We selected 953 genes\nwith signi\ufb01cant changes in expression, using their full set of 93 experiments.\n\nfor HAC. These results show that PAH recovers an abstraction\n\nfor HAC. The average\n\nfor PAH,\n\n\u000e\f\r\u0012\u000b\u0013\u000f\n\n\u000e\f\u000b\u000e\r\u0010\u000f\n\n\u0017\u000e\n\n\u0015\u0012\u0017\n\n\u0004\u000e\u0011\n\n\u000e\u0016\u000b\n\nAgain, we ran PAH and HAC and evaluated performance using 5 fold cross validation.\n\n(the coef\ufb01cient of the penalty term in +-,\n\n\u000221 ),\nFor PAH we also used different settings for \u001f\n\u0004 ) and greatly\nwhich explores the performance in the range of only \ufb01tting the data (\u001f:\u0005\nfavoring hierarchies in which nearby models are similar (large \u001f ). In both cases, we learned\n\na model using training data, and evaluated the log-likelihood of test instances as described\nabove. The results, summarized in Fig. 3(a), clearly show that PAH generalizes much better\nto previously unobserved data than HAC and that PAH works best at some tradeoff between\n\ufb01tting the data and generating a hierarchy in which nearby models are similar.\nRobustness. Our goal in constructing a hierarchy is to extract meaningful biological con-\nclusions from the data. However, data is invariably partial and noisy. If our analysis pro-\nduces very different results for slightly different training data, the biological conclusions\nare unlikely to be meaningful. Thus, we want genes that are assigned to nearby nodes in\nthe tree, to be close together also in hierarchies learned from perturbed data sets.\n\nWe tested robustness to noise by learning a model from the original data set and from\nperturbed data sets in which we permuted a varying percentage of the expression measu-\nments. We then compared the distances (the path length in the tree) between the nodes\nassigned to every pair of genes in trees learned from the original data and trees learned\nfrom perturbed data sets. The results are shown in Fig. 3(b), demonstrating that PAH pre-\nof the data is perturbed (and\nserves the pairwise distances extremely well even when \r\npermutation), while HAC completely deteriorates\nperforms reasonably well for\nwhen \n\nof the data is permuted.\n\n\u001a+\u0004\u000b\u001e\u001b\u0015\n\n\u0004\u0019\u0018\n\n\u0004\u001c\u0018\n\n\u0004\u0019\u0018\n\n\u001d\n\f\n#\n\u0011\n1\n\u001d\n\u0006\n#\n\u0011\n1\n\n\u0004\n\u000e\n\u0004\n\u000e\n\n\u0015\n\u0011\n\u000f\n \n\u0014\n\u000f\n \n\u000e\n\f-90\n\n-92\n\n-94\n\n-96\n\n-98\n\n-100\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\ng\no\n\n \n\nl\n \ne\ng\na\nr\ne\nv\nA\n\n-102\n\n-104\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\nLambda\n(a)\n\nPAH\nHAC\n\n14\n\n16\n\n18\n\n20\n\nt\nn\ne\ni\nc\ni\nf\nf\ne\no\n(cid:0)c\nn\no\ni\nt\na\nl\ne\nr\nr\no\nC\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n20\n\nPAH(cid:0)\nHAC\n\n40\n\n60\n\n80\n\n100\n\nData(cid:0)permuted(cid:0)(%)\n(b)\n\nFigure 3: (a) Generalization to test data (b) Robustness to noise\n\nsigmoid\n\nfeedforward\n\nvector\ngradient\n\nboltzmann\n\nmachin\n\nModel\nPAH\nHAC\nPAH\nHAC\nPAH\nHAC\n\np\n\n90%\n\n80%\n\n70%\n\nTraining set\n\navg. L1 difference\n\nTest set\n\navg. L1 difference\n\n\t\u000b\n\f\u0004\r\t\n\n\u0001\u0003\u0002\u0005\u0004\u0018\u000e\u00145\n\u0010\u000b\u0001\u0005\u0004\u0018\u000e\u00145\n\u000e\u001c\u000e\u0007\u0004\u0018\u000e\u00145\n\n\t\u0003\t\u0011\u0004\u0012\u0001\n\n\u0002\u0003\u0010\u0005\u0004\u000f\u0001\n\n\u0001\u0003\u0002\n\u0002\u000e\b\n\n\u0006\u0003\b\n\b\u000e\t\n\n\u0006\u0003\b\n\n\u0006\u000b\b\n\b\u000b\u0001\n\n\u000e\u0007\u0004\u0018\u000e\u001d5\n\u000e\u0005\u000e\u00145\n\n\u0003\t\f\u0004\u000f\u0002\n\u0002\u000e\u0006\f\u0004\u0018\u000e\u001d5\n\b\u000b\u0002\u0005\u0004\u0018\u000e\u0014\u0013\n\u000e\u0014\u0002\u0005\u0004\u0018\u000e\u001d5\n\u000e\u00145\n\u0002\u0003\u0005\u0004\u001a\u000e\u0014\n\n\u0003\u0013\n\n(a)\n\nEm\n\nmaxim\n\nmaximum\nmaximum\nlikelihood\nlikelihood\n\nparameter\nparameter\n\nfeatur\n\nalgorithm\n\nbayesian\nnetwork\n\ngraph\ngraph\n\nacyclic\nacyclic\n\nmethodolog\n\nsample\n\nspline\n\nnonLinear\n\ncausal\ncausal\ncausat\ncausat\ninfluenc\ninfluenc\n\nmont\ncarlo\nmcmc\n\nmethod\nmarkov\n\ncounterfactu\n\npearl\n\ntrain\n\nforward\n\nglobal\nvariant\n\nhidden\nlearn\n\nartificial\nintellig\n\nhmm\n\ngraphic\n\nmodel\n\nstatist\n\n(b)\n\nFigure 4: (a) Robustness of PAH and HAC to different subsets of training instances. (b)\nWord hierarchy learned on Cora data.\n\n\u0004\u0019\u0018\n\n\u0014+\u0004\u001c\u0018\n\nA second important test is robustness to our particular choice of training data: a par-\nticular training set re\ufb02ects only a subset of the experiments that we could have performed.\nIn this experiment, we used the Yeast Compendium data of [9], which measures the ex-\npression pro\ufb01les triggered by speci\ufb01c gene mutations. We selected 450 genes and all 298\nranging\narrays, focusing on genes that changed signi\ufb01cantly. For each of three values of\n, we generated ten different training sets by sampling (without replace-\nfrom \u000b\nment)\n\npercent of the 450 genes, the rest of which form a test set.\n\nto\n\nWe then placed both training and test genes within the hierarchy. For each data set, every\npair of genes either appear together in the training set, the test set, or do not appear together\n(i.e., one appears in the training set and the other in the test set). We compared, for each pair\nof genes, their distances in training sets in which they appear together and their distances\nin test sets in which they appear together. The results are summarized in Fig. 4(a). Our\nresults on the training data show that PAH consistently constructs very similar hierarchies,\neven from very different subsets of the data. By contrast, the hierarchies constructed by\nHAC are much less consistent. The results on the test data are even more striking. PAH\nis very consistent about its classi\ufb01cation into the hierachy even of test instances \u2014 ones\nnot used to construct the hierarchy. In fact, there is no signi\ufb01cant difference between its\nperformance on the training data and the test data. By contrast, HAC places test instances\nin very different con\ufb01gurations in different trees, reducing our con\ufb01dence in the biological\nvalidity of the learned structure.\nIntuitiveness.\ninto the hierarchies produced, we ran\nPAH on 350 documents from the Probabilistic Methods category in the Cora dataset\n(cora.whizbang.com) and learned hierarchies among the (stemmed) words. We con-\nstructed a vector for each word with an entry for each document whose value is the TFIDF-\n\nTo get qualitative insight\n\n\n5\n\n5\n\u0006\n\u0002\n5\n5\n5\n\n\n5\n\b\n\u000e\n\n5\n\u0006\n\u000e\n\u0010\n5\n5\n5\n5\n\n5\n\n5\n\u0010\n5\n5\n\t\n\n5\n\b\n\u0015\n\u0015\n\fweighted frequency of the word within the document. Fig. 4(b) shows parts of the learned\nhierarchy, consisting of 441 nodes, where we list high con\ufb01dence words for each node.\nPAH organized related words into the same region of the tree. Within each region, many\nwords were arranged in a way which is consistent with our intuitive notion of abstraction.\n\n5 Discussion\nWe presented probabilistic abstraction hierarchies, a general framework for learning ab-\nstraction hierarchies from data, which relates different classes in the hierarchy by a\ntree whose nodes correspond to class-speci\ufb01c probability models (CPMs). We utilize a\nBayesian approach, where the prior favors hierarchies in which nearby classes have similar\ndata distributions, by penalizing the distance between neighboring CPMs.\n\nA unique feature of PAH is the use of global optimization steps for constructing the\nhierarchy and for \ufb01nding the optimal setting of the entire set of parameters. This feature\ndifferentiates us from many other approaches that build hierarchies by local improvements\nof the objective function or approaches that optimize a \ufb01xed hierarchy [7]. The global op-\ntimization steps help in avoiding local maxima and in reducing sensitivity to noise. Our\napproach leads naturally to a form of parameter smoothing, and provides much better gen-\neralization for test data and robustness to noise than other clustering approaches.\n\nIn principle, we can use any probabilistic model for the CPM as long as it de\ufb01nes a\nprobability distribution over the state space. We have recently [14] applied this approach\nto the substantially more complex problem of clustering proteins based on their amino acid\nsequence using pro\ufb01le HMMs [11].\nAcknowledgements. We thank Nir Friedman for useful comments. This work was sup-\nported by NSF Grant ACI-0082554 under the NSF ITR program, and by the Sloan Foun-\ndation. Eran Segal was also supported by a Stanford Graduate Fellowship (SGF).\n\nReferences\n[1] P. Cheeseman and J. Stutz. Bayesian Classi\ufb01cation (AutoClass): Theory and Results. AAAI\n\nPress, 1995.\n\n[2] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. Journal of the Royal Statistical Society, B 39:1\u201339, 1977.\n\n[3] M. Eisen, P. Spellman, P. Brown, and D. Botstein. Cluster analysis and display of genome-wide\n\nexpression patterns. PNAS, 95:14863\u201368, 1998.\n\n[4] N. Friedman. The Bayesian structural EM algorithm. In Proc. UAI, 1998.\n[5] N. Friedman, M. Ninio, I. Pe\u2019er, and T. Pupko. A structural EM algorithm for phylogentic\n\ninference. In Proc. RECOMB, 2001.\n\n[6] A.P. Gasch et al. Genomic expression program in the response of yeast cells to environmental\n\nchanges. Mol. Bio. Cell, 11:4241\u20134257, 2000.\n\n[7] T. Hofmann. The cluster-abstraction model: Unsupervised learning of topic hierarchies from\n\ntext data. In Proc. IJCAI, 1999.\n\n[8] T. Hofmann. The cluster-abstraction model: Unsupervised learning of topic hierarchies from\n\ntext data. In Proc. International Joint Conference on Arti\ufb01cial Intelligence, 1999.\n\n[9] T. R. Hughes et al. Functional discovery via a compendium of expression pro\ufb01les. Cell,\n\n102(1):109\u201326, 2000.\n\n[10] F.K. Hwang, D.S.Richards, and P. Winter. The Steiner Tree Problem. Annals of Discrete\n\nMathematics, Vol. 53, North-Holland, 1992.\n\n[11] A. Krogh, M. Brown, S. Mian, K. Sjolander, and D. Haussler. Hidden markov models in\ncomputational biology: Applications to protein modeling. Mol. Biology, 235:1501\u20131531, 1994.\n[12] A. McCallum, R. Rosenfeld, T. Mitchell, and A. Ng. Improving text classi\ufb01cation by shrinkage\n\nin a hierarchy of classes. In Proc. ICML, 1998.\n\n[13] M. Meila and M.I. Jordan. Learning with mixtures of trees. Machine Learning, 1:1\u201348, 2000.\n[14] E. Segal and D. Koller. Probabilistic hierarchical clustering for biological data. In RECOMB,\n\n2002.\n\n\f", "award": [], "sourceid": 2065, "authors": [{"given_name": "Eran", "family_name": "Segal", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}, {"given_name": "Dirk", "family_name": "Ormoneit", "institution": null}]}