{"title": "Semi-Supervised Learning with Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 257, "page_last": 264, "abstract": "", "full_text": "Semi-Supervised Learning with Trees\n\nCharles Kemp, Thomas L. Grif\ufb01ths, Sean Stromsten & Joshua B. Tenenbaum\n\nDepartment of Brain and Cognitive Sciences, MIT, Cambridge, MA 02139\n\nfckemp,gruffydd,sean s,jbtg@mit.edu\n\nAbstract\n\nWe describe a nonparametric Bayesian approach to generalizing from\nfew labeled examples, guided by a larger set of unlabeled objects and\nthe assumption of a latent tree-structure to the domain. The tree (or a\ndistribution over trees) may be inferred using the unlabeled data. A prior\nover concepts generated by a mutation process on the inferred tree(s)\nallows ef\ufb01cient computation of the optimal Bayesian classi\ufb01cation func-\ntion from the labeled examples. We test our approach on eight real-world\ndatasets.\n\n1\n\nIntroduction\n\nPeople have remarkable abilities to learn concepts from very limited data, often just one\nor a few labeled examples per class. Algorithms for semi-supervised learning try to match\nthis ability by extracting strong inductive biases from a much larger sample of unlabeled\ndata. A general strategy is to assume some latent structure T that underlies both the label\nvector Y to be learned and the observed features X of the full data (unlabeled and labeled;\nsee Figure 1). The unlabeled data can be used to help identify the latent structure T , and an\nassumption that Y is somehow \u201csmooth\u201d with respect to T \u2013 or in Bayesian terms, can be\nassigned a strong prior conditional on T \u2013 provides the inductive bias needed to estimate\nY successfully from very few labeled examples Yobs.\nDifferent existing approaches can be understood within this framework. The closest to\nour current work is [1] and its cousins [2-5]. The structure T is assumed to be a low-\ndimensional manifold, whose topology is approximated by a sparse neighborhood graph\nde\ufb01ned over the data points (based on Euclidean distance between feature vectors in the X\nmatrix). The label vector Y is assumed to be smooth with respect to T ; [1] implements this\nsmoothness assumption by de\ufb01ning a Gaussian \ufb01eld over all complete labelings Y of the\nneighborhood graph that expects neighbors to have the same label. This approach performs\nwell in classifying data with a natural manifold structure, e.g., handwritten digits.\n\nThe graphical model in Figure 1 suggests a more general strategy for exploiting other kinds\nof latent structure T , not just low-dimensional manifolds. In particular, trees arise promi-\nnently in both natural and human-generated domains (e.g., in biology, language and in-\nformation retrieval). Here we describe an approach to semi-supervised learning based on\nmapping the data onto the leaf nodes of a rooted (and typically ultrametric) tree T .\nThe label vector Y is generated from a stochastic mutation process operating over branches\nof T . Tree T can be inferred from unlabeled data using either bottom-up methods (agglom-\nerative clustering) or more complex probabilistic methods. The mutation process de\ufb01nes\n\n\fT\n\nX\n\nY\n\nYobs\n\nFigure 1: A general approach to semi-supervised\nlearning. X is an observed object-feature matrix,\nY the hidden vector of true labels for these ob-\njects and Yobs a sparse vector of observed labels.\nThe unlabeled data in X assist in inferring Y by\nallowing us to infer some latent structure T that\nis assumed to generate both X and Y .\n\na prior over all possible labelings of the unlabeled data, favoring those that maximize a\ntree-speci\ufb01c notion of \u201csmoothness\u201d. Figure 2 illustrates this Tree-Based Bayes (TBB)\napproach. Each of the 32 objects in this dataset has two continuous features (x and y coor-\ndinates); X is a 32-by-2 matrix. Yobs contains four entries, two positive and two negative.\nThe shading in part (b) represents a probabilistic inference about Y : the darker an object\u2019s\nnode in the tree, the more likely that its label is positive.\n\nTBB classi\ufb01es unlabeled data by integrating over all possible labelings of the domain that\nare consistent with the observed labels Yobs, and is thus an instance of optimal Bayesian\nconcept learning [6]. Typically, optimal Bayes is of theoretical interest only [7], because\nthe sum over labelings is in general intractable and it is dif\ufb01cult to specify suf\ufb01ciently\npowerful and noise-resistant priors for real-world domains. Here, a prior de\ufb01ned in terms\nof a tree-based mutation process makes the approach ef\ufb01cient and empirically successful.\n\nThe next section describes TBB, as well as a simple heuristic method, Tree Nearest Neigh-\nbor (TNN), which we show approximates TBB in the limit of high mutation rate. Section 3\npresents experimental comparisons with other approaches on a range of datasets.\n\n(a) \n\n(b) \n\nFigure 2: Illustration of the Tree-Based Bayesian approach to semi-supervised learning. (a)\nWe observe a set of unlabeled objects (small points) with some latent hierarchical structure\n(gray ellipses) along with two positive and two negative examples of a new concept (black\nand white circles). (b) Inferring the latent tree, and treating the concept as generated from\na mutation process on the tree, we can probabilistically classify the unlabeled objects.\n\n2 Tree-Based Bayes (TBB)\n\nWe assume a binary classi\ufb01cation problem with Y 2 f(cid:0)1; 1gn. We choose a label yi for\nunlabeled object xi by computing p(yi = 1jYobs; X) and thresholding at 0.5. Generaliza-\ntion to the multi-class case will be straightforward.\n\nIdeally we would sum over all possible latent trees T :\n\np(yi = 1jYobs; X) =XT\n\np(yi = 1jYobs; T )p(T jYobs; X)\n\n(1)\n\n\fFirst we consider p(yi = 1jYobs; T ) and the classi\ufb01cation of object xi given a particular\ntree T . Section 2.2 discusses p(T jYobs; X), the inference of tree T , and approaches to\napproximating the sum over trees in Equation 1.\nWe predict object xi\u2019s label by summing over all possible complete labelings Y of the data:\n\np(yi = 1jY )p(Y jYobs; T )\n\np(yi = 1jYobs; T ) =XY\n=XY\n= PY p(yi = 1jY )p(YobsjY )p(Y jT )\n\np(yi = 1jY )p(YobsjY; T )p(Y jT )\n\np(YobsjT )\n\nPY p(YobsjY )p(Y jT )\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\nIn general, the likelihood p(YobsjY ) depends on assumptions about sampling and noise.\nTypical simplifying assumptions are that the labeled objects were chosen randomly from\nall objects in the domain, and that all observations are free of noise. Then p(YobsjY ) / 1\nif Yobs is consistent with Y and is zero otherwise.\nUnder these assumptions, Equation 4 becomes:\n\np(yi = 1jYobs; T ) = PY consistent with Yobs:yi=1 p(Y jT )\n\np(Y jT )\n\nPY consistent with Yobs\n\nThe probability that yi = 1 reduces to the weighted fraction of label vectors consistent with\nYobs that set yi = 1, with each label vector weighted by its prior under the tree, p(Y jT ).\nWhen class frequencies are unbalanced, small training sets provide little scope for learning\nif constructed using random sampling. Consider the problem of identifying genetic markers\nfor a disease that af\ufb02icts one person in 10,000. A training set for this problem might be\nconstructed by \u201cretrospective sampling,\u201d e.g. taking data from 20 patients with the disease\nand 20 healthy subjects. Randomly sampling subjects from the entire population would\nmean that even a medium-sized training set would have little chance of including anyone\nwith the disease.\n\nRetrospective sampling can be modeled by specifying a more complex likelihood\np(YobsjY ). The likelihood can also be modi\ufb01ed to handle additional complexities, such\nas learning from labeled examples of just a single class, or learning in the presence of label\nnoise. We consider none of these complexities here. Our experiments explore both ran-\ndom and retrospective sampling, but the algorithm we implement is strictly correct only for\nnoise-free learning under random sampling.\n\n2.1 Bayesian classi\ufb01cation with a mutation model\n\nIn many tree-structured domains it is natural to think of features arising from a history\nof stochastic events or mutations. We develop a mutation model that induces a sensible\n\u201csmoothness\u201d prior p(Y jT ) and enables ef\ufb01cient computation of Equation 5 via belief\npropagation on a Bayes net. The model combines aspects of several previous proposals for\nprobabilistic learning with trees [8, 9, 10].\n\nLet L be a feature corresponding to the class label. Suppose that L is de\ufb01ned at every\npoint along every branch, not just at the leaf nodes where the data points lie. Imagine L\nspreading out over the tree from root to leaves \u2014 it starts out at the root with some value\nand could switch values at any point along any branch. Whenever a branch splits, both\nlower branches inherit the value of L at the point immediately before the split.\n\n\fTransitions between states of L are modeled using a continuous-time Markov chain with\nin\ufb01nitesimal matrix:\n\nQ =(cid:20) (cid:0)(cid:21)\n\n(cid:21) (cid:0)(cid:21) (cid:21)\n\n(cid:21)\n\nThe free parameter, (cid:21), will be called the mutation rate. Note that the mutation process is\nsymmetric: mutations from -1 to 1 are just as likely as mutations in the other direction.\nOther models of mutation could be substituted if desired. Generalization to the k-class\ncase is achieved by specifying a k by k matrix Q, with (cid:0)(cid:21) on the diagonal and (cid:21)\nk(cid:0)1 on the\noff-diagonal.\n\nTransition probabilities along a branch of length t are given by:\n\neQt =\" 1+e(cid:0)2(cid:21)t\n\n1(cid:0)e(cid:0)2(cid:21)t\n\n2\n\n2\n\n1(cid:0)e(cid:0)2(cid:21)t\n\n2\n\n1+e(cid:0)2(cid:21)t\n\n2\n\n#\n\n(6)\n\n2\n\n.\n\nThat is, the probability that a parent and child separated by a branch of length t have\ndifferent values of L is 1(cid:0)e(cid:0)2(cid:21)t\nThis mutation process induces a prior p(Y jT ) equal to the probability of generating the\nlabel vector Y over leaves of T under the mutation process. The resulting distribution\nfavors labelings that are \u201csmooth\u201d with respect to T . Regardless of (cid:21), it is always more\nlikely for L to stay the same than to switch its value along a branch. Thus labelings that\ndo not require very many mutations are preferred, and the two hypotheses that assign the\nsame label to all leaf nodes receive the most weight. Because mutations are more likely to\noccur along longer branches, the prior also favors hypotheses in which label changes occur\nbetween clusters (where branches tend to be longer) rather than within clusters (where\nbranches tend to be shorter).\n\nThe independence assumptions implicit in the mutation model allow the right side of Equa-\ntion 5 to be computed ef\ufb01ciently. Inspired by [9], we set up a Bayes net with the same\ntopology as T that captures the joint probability distribution over all nodes. We associate\nwith each branch a conditional probability table that speci\ufb01es the value of the child condi-\ntioned on the value of the parent (based on Equation 6), and set the prior probabilities at\nthe root node to the uniform distribution (the stationary distribution of the Markov chain\nspeci\ufb01ed by Q). Evaluating Equation 5 now reduces to a standard problem of inference in a\nBayes net \u2013 we clamp the nodes in Yobs to their observed values, and compute the posterior\nmarginal probability at node yi. The tree structure makes this computation ef\ufb01cient and\nallows specially tuned inference algorithms, as in [9].\n\n2.2 A distribution over trees\n\nWe now consider p(T jYobs; X), the second component of Equation 1. Using Bayes\u2019 theo-\nrem:\n\np(T jYobs; X) / p(Yobs; XjT )p(T )\n\n(7)\n\nWe assume that each discrete feature in X is generated independently over T according\nto the mutation model just outlined. Continuous features can be handled by an analogous\nstochastic diffusion process in a continuous space (see for example [11]). Because the fea-\ntures are conditionally independent of each other and of Yobs given the tree, p(Yobs; XjT )\ncan be computed using the methods of the previous section.\n\nTo \ufb01nish the theoretical development of the model it remains only to specify p(T ), a prior\nover tree structures. Section 3.2 uses a uniform prior, but a Dirichlet Diffusion Tree prior\nis another option [11].\n\n\f2.3 Approximating the sum over trees\n\nThe sum over trees in Equation 1 is intractable for datasets of even moderate size. We\ntherefore consider two approximations. Markov Chain Monte Carlo (MCMC) techniques\nhave been used to approximate similar sums over trees in Bayesian phylogenetics [12], and\nSection 3.2 applies these ideas to a small-scale example. Although theoretically attractive,\nMCMC approaches are still expensive to use with large datasets. Section 3.1 follows a\nsimpler approach: we assume that most of the probability p(T jYobs; X) is concentrated\non or near the most probable tree T (cid:3) and approximate Equation 1 as p(yi = 1jYobs; T (cid:3)).\nThe tree T (cid:3) can be estimated using more or less sophisticated means. In Section 3.1 we\nuse a greedy method \u2013 average-link agglomerative clustering on the object-feature matrix\nX, using Hamming or Euclidean distance in discrete or continuous domains, respectively.\nIn Section 3.2 we compare this greedy method to the best tree found in our MCMC runs.\nNote that we ignore Yobs when building T (cid:3), because we run many trials on each dataset\nand do not want to compute a new tree for each value of Yobs. Since our data include many\nfeatures and few labeled objects, the contribution of Yobs is likely to be negligible.\n\n2.4 Tree Nearest Neighbor (TNN)\n\nA Bayesian formulation based on the mutation process provides a principled approach to\nlearning with trees, but there are simpler algorithms that instantiate similar intuitions. For\ninstance, we could build a one-nearest-neighbor classi\ufb01er using the metric of distance in\nthe tree T (with ties resolved randomly). It is clear how this Tree Nearest Neighbor (TNN)\nalgorithm re\ufb02ects the assumption that nearby leaves in T are likely to have the same label,\nbut it is not necessarily clear when and why this simple approach should work well.\n\nAn analysis of Tree-Based Bayes provides some insight here \u2013 TBB and TNN become\nequivalent when the (cid:21) parameter of TBB is set suf\ufb01ciently high.\n\nTheorem 1 For each ultrametric tree T , there is a (cid:21)0 such that TNN and TBB produce\nidentical classi\ufb01cations for all examples with a unique nearest neighbor when (cid:21) > (cid:21)0 .\n\nproof\n\nis\n\nat\n\navailable\n\nA\nhttp://www.mit.edu/\u02dcckemp/papers/\ntreesslproof.pdf, but we give some intuition for the result here. Consider\nthe Bayes net described in Section 2.1 and suppose xi is an unlabeled object. The value\nchosen for yi will depend on all the labels in Yobs, but the in\ufb02uence of any single label\ndecreases with distance in the tree from yi. Once (cid:21) becomes suf\ufb01ciently high it can be\nshown that yi is always determined uniquely by the closest labeled example in the tree.\nGiven this equivalence between the algorithms, TNN is the method of choice when a high\nmutation rate is indicated. It is not only faster, but numerically more stable. For large\nvalues of (cid:21), the probabilities manipulated by TBB become very close to 0.5 and variables\nthat should be different may become indistinguishable within the limits of computational\nprecision. Our implementation of TBB therefore uses TNN when cross-validation indicates\nthat a suf\ufb01ciently high value of (cid:21) is required.\n\n3 Experiments\n\n3.1 Trees versus Manifolds\n\nWe compared TBB and TNN with the Laplacian method of Belkin and Niyogi [4], an\napproach that effectively assumes a latent manifold structure T . We also ran generic one-\nnearest neighbor (NN) as a baseline.\n\nThe best performing method on a given dataset should be the algorithm that assumes the\n\n\fright latent structure for that domain. We therefore tested the algorithms on several different\ntypes of data: four taxonomic datasets (Beetles, Crustaceans, Salamanders and Worms,\nwith 192, 56, 30 and 286 objects respectively), two molecular biology sets (Gene Promoter\nand Gene Splice, with sizes 106 and 3190), and two \u201cmanifold\u201d sets (Digits and Vowels,\nwith sizes 10,000 and 990).\n\nThe taxonomic datasets were expected to have a tree-like structure. Each set de-\nscribes the external anatomy of a group of species, based on data available at http:\n//biodiversity.uno.edu/delta/. One feature in the Beetles set, for example,\nindicates whether a beetle\u2019s body is \u201cstrongly \ufb02attened, slightly \ufb02attened to moderately\nconvex, or strongly convex.\u201d Since these taxonomic sets do not include class labels, we\nchose features at random to stand in for the class label. We averaged across \ufb01ve such\nchoices for each dataset.\n\nThe molecular biology sets were taken from the UCI repository. The objects in both sets\nare strings of DNA, and tree structures might also be appropriate here since these strings\narose through evolution. The manifold sets arose from human motor behaviors, and were\ntherefore expected to have a low-dimensional manifold structure. The Digits data are a\nsubset of the MNIST data, and the Vowels data are taken from the UCI repository.\n\nOur experiments focused on learning from very small labeled sets. The number of labeled\nexamples was always set to a small multiple (m = 1, 2, 3, 5, or 10) of the total number\nof classes. The algorithms were compared under random and retrospective sampling, and\ntraining sets were always sampled with replacement. For each training-set size m, we av-\neraged across 10 values of Yobs obtained by randomly sampling from the vector Y . Free\nparameters for TBB ((cid:21)) and Laplacian (number of nearest neighbors, number of eigenvec-\ntors) were chosen using randomized leave-one-out cross-validation.\n\nFigure 3a shows the performance of the algorithms under random sampling for four rep-\nresentative datasets. TBB outperforms the other algorithms across the four taxonomic\nsets (only Beetles and Crustaceans shown), but the differences between TBB and Near-\nest Neighbor are rather small. These results do suggest a substantial advantage for TBB\nover Laplacian in tree-structured domains. As expected, this pattern is reversed on the\nDigits set, but it is encouraging that the tree-based methods can still improve on Nearest\nNeighbor even for datasets that are not normally associated with trees. Neither method\nbeats the baseline on the Vowels or the Gene Promoter sets, but TBB performs well on the\nGene Splice set, which suggests that it may \ufb01nd further uses in computational biology.\n\nMore dramatic differences between the algorithms appear under retrospective sampling\n(Figure 3b). There is a clear advantage here for TBB on the taxonomic sets. TBB fares\nbetter than the other algorithms when the class proportions in the training set do not match\nthe proportions in the population, and it turns out that many of the features in the taxonomic\ndatasets are unbalanced. Since the other datasets have classes of approximately equal size,\nthe results for retrospective sampling are similar to those for random sampling.\n\nWhile not conclusive, our results suggest that TBB may be the method of choice on tree-\nstructured datasets, and is robust even for datasets (like Digits) that are not clearly tree-\nstructured.\n\n3.2 MCMC over trees\n\nFigure 3 shows that TBB can perform well on real-world datasets using only a single\ntree. Working with a distribution over trees, although costly, could improve performance\nwhen there is not suf\ufb01cient data to strongly constrain the best tree, or when the domain is\nnot strongly tree-structured. Using a small synthetic example, we explored one such case:\nlearning from very sparse and noisy data in a tree-structured domain.\n\n\fBeetles\n\nCrustaceans\n\nGene Splice\n\nDigits\n\n8\n\n7\n\n6\n\n60\n\n50\n\n40\n\n60\n\n40\n\n20\n\nTBB \nNN \nTNN \nLaplacian\n\n1 2 3 5\n\n10\n\n1 2 3 5\n\n10\n\n1 2 3 5\n\n10\n\n1 2 3 5\n\n10\n\n(a)\n\n35\n\n30\n\n25\n\n20\n\n15\n\n(b)\n\ne\nt\na\nr\n \nr\no\nr\nr\n\nE\n\n40\n\n30\n\n20\n\n50\n\n40\n\n30\n\n20\n\n10\n\n60\n\n55\n\n50\n\n45\n\n40\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n1 2 3 5\n\n10\n\n1 2 3 5\n\n10\n\n1 2 3 5\n\n10\n\n10\n\n1 2 3 5\n10\nExamples per class\n\nFigure 3: Error rates for four datasets under (a) random and (b) retrospective sampling, as\na function of the number of labeled examples m per class. Mean standard error bars for\neach dataset are shown in the upper right corner of the plot.\n\n0.4\n\n0.25\n\n0.2\n\n0.35\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\n\nE\n\n0.3\n\nTBB (agglom)\n\nNN\nTBB (modal)\n\nWe generated arti\ufb01cial datasets consist-\ning of 20 objects. Each dataset was based\non a \u201ctrue\u201d tree T0, with objects at the\nleaves of T0. Each object was repre-\nsented by a vector of 20 binary features\ngenerated by a mutation process over T0,\nwith high (cid:21). Most feature values were\nmissing; the algorithms saw only 5 of\nthe 20 features for each object. For each\ndataset, we created 20 test concepts from\nthe same mutation process. The algo-\nrithms saw m labeled examples of each\ntest concept and had to infer the labels of\nthe remaining objects. This experiment\nwas repeated for 10 random trees T0.\nOur MCMC approach was inspired by an\nalgorithm for reconstruction of phyloge-\nnetic trees [12], which uses Metropolis-\nHastings over tree topologies with two\nkinds of proposals: local (nearest neigh-\nbor\n(subtree\npruning and regrafting). Unlike the pre-\nvious section, none of the trees considered (including the true tree T0) was ultrametric.\nInstead, each branch in each tree was assigned a \ufb01xed length. This meant that any two trees\n\nFigure 4: Error rates on sparse arti\ufb01cial data\nas a function of number of labels observed.\n\ninterchange) and global\n\nNumber of labeled examples\n\nTBB (MCMC)\n\n4\n\n8\n\n12\n\nTBB (ideal)\n\n\fwith the same hierarchical structure were identical, and we did not have to store trees with\nthe same topology but different branch lengths.\n\nFigure 4 shows the mean classi\ufb01cation error rate, based on 1600 samples after a burn-in of\n400 iterations. Four versions of TBB are shown: \u201cideal\u201d uses the true tree T0, \u201cMCMC\u201d\nuses model averaging over a distribution of trees, \u201cmodal\u201d uses the single most likely tree in\nthe distribution, and \u201cagglom\u201d uses a tree built by average-link clustering. The ideal learner\nbeats all others because the true tree is impossible to identify with such sparse data. Using\nMCMC over trees brings TBB substantially closer to the ideal than simpler alternatives that\nignore the tree structure (NN) or consider only a single tree (modal, agglom).\n\n4 Conclusion\n\nWe have shown how to make optimal Bayesian concept learning tractable in a semi-\nsupervised setting by assuming a latent tree structure that can be inferred from the unla-\nbeled data and de\ufb01ning a prior for concepts based on a mutation process over the tree. Our\nBayesian framework supports many possible extensions, including active learning, feature\nselection, and model selection. Inferring the nature of the latent structure T \u2013 rather than\nassuming a manifold structure or a tree structure \u2013 is a particularly interesting problem.\nWhen little is known about the form of T , Bayesian methods for model selection could\nbe used to choose among approaches that assume manifolds, trees, \ufb02at clusters, or other\ncanonical representational forms.\n\nAcknowledgments This project was supported by the DARPA CALO program and NTT Commu-\nnication Science Laboratories. Our implementation of the Laplacian method was based on code\nprovided by Mikhail Belkin.\n\nReferences\n[1] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian \ufb01elds and\n\nharmonic functions. In ICML, volume 20, 2003.\n\n[2] M. Szummer and T. Jaakkola. Partially labeled classi\ufb01cation with Markov random walks. In\n\nNIPS, volume 14, 2002.\n\n[3] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In\n\nICML, volume 18, 2001.\n\n[4] M. Belkin and P. Niyogi. Semi-supervised learning on manifolds. 2003. To appear in Machine\n\nLearning, Special Issue on Theoretical Advances in Data Clustering.\n\n[5] O. Chapelle, J. Weston, and B. Sch\u00a8olkopf. Cluster kernels for semi-supervised learning. In\n\nNIPS, volume 15, 2003.\n\n[6] T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.\n[7] D. Haussler, M. Kearns, and R. Schapire. Bounds on the sample complexity of Bayesian learn-\n\ning using information theory and the VC dimension. Machine Learning, 14(1), 1994.\n\n[8] C. Kemp and J. B. Tenenbaum. Theory-based induction. In Proceedings of the 25th Annual\n\nConference of the Cognitive Science Society, 2003.\n\n[9] L. Shih and D. Karger. Learning classes correlated to a hierarchy. 2003. Unpublished\n\nmanuscript.\n\n[10] J.-P. Vert. A tree kernel to analyze phylogenetic pro\ufb01les. Bioinformatics, 1(1):1\u20139, 2002.\n[11] R. Neal. De\ufb01ning priors for distributions using Dirichlet diffusion trees. Technical Report 0108,\n\nUniversity of Toronto, 2001.\n\n[12] H. Jow, C. Hudelot, M. Rattray, and P. Higgs. Bayesian phylogenetics using an RNA sub-\nstitution model applied to early mammalian evolution. Molecular Biology and Evolution,\n19(9):1951\u20131601, 2002.\n\n\f", "award": [], "sourceid": 2464, "authors": [{"given_name": "Charles", "family_name": "Kemp", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}, {"given_name": "Sean", "family_name": "Stromsten", "institution": null}, {"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}]}