{"title": "The Time-Marginalized Coalescent Prior for Hierarchical Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 2969, "page_last": 2977, "abstract": "We introduce a new prior for use in Nonparametric Bayesian Hierarchical Clustering. The prior is constructed by marginalizing out the time information of Kingman\u2019s coalescent, providing a prior over tree structures which we call the Time-Marginalized Coalescent (TMC). This allows for models which factorize the tree structure and times, providing two benefits: more flexible priors may be constructed and more efficient Gibbs type inference can be used. We demonstrate this on an example model for density estimation and show the TMC achieves competitive experimental results.", "full_text": "The Time-Marginalized Coalescent Prior for\n\nHierarchical Clustering\n\nLevi Boyles\n\nMax Welling\n\nDepartment of Computer Science\n\nUniversity of California, Irvine\n\nDepartment of Computer Science\n\nUniversity of California, Irvine\n\nIrvine, CA 92617\n\nlboyles@uci.edu\n\nIrvine, CA 92617\n\nwelling@uci.edu\n\nAbstract\n\nWe introduce a new prior for use in Nonparametric Bayesian Hierarchical Clus-\ntering. The prior is constructed by marginalizing out the time information of\nKingman\u2019s coalescent, providing a prior over tree structures which we call the\nTime-Marginalized Coalescent (TMC). This allows for models which factorize\nthe tree structure and times, providing two bene\ufb01ts: more \ufb02exible priors may be\nconstructed and more ef\ufb01cient Gibbs type inference can be used. We demonstrate\nthis on an example model for density estimation and show the TMC achieves com-\npetitive experimental results.\n\n1\n\nIntroduction\n\nHierarchical clustering models aim to \ufb01t hierarchies to data, and enjoy the property that cluster-\nings of varying size can be obtained by \u201cpruning\u201d the tree at particular levels. In contrast, standard\nclustering models must specify the number of clusters beforehand, while Nonparametric Bayesian\n(NPB) clustering models such as the Dirichlet Process Mixture (DPM) [5, 13] directly infer the (ef-\nfective) number of clusters. Hierarchical clustering is often used in population genetics for inferring\nancestral history and bioinformatics for genetic clustering, and has also seen use in computer vision\n[18, 1] and topic modelling [3, 1].\n\nNPB models are a class of models of growing popularity. Being Bayesian, these models can easily\nquantify the uncertainty the the resulting inferences, and being nonparametric, they can seamlessly\nadapt to increasingly complicated data, avoiding the model selection problem. NPB hierarchical\nclustering models are an important regime of such models, and have been shown to have superior\nperformance to alternative models in many domains [8]. Thus, further advances in the applicability\nof these models is important.\n\nThere has been substantial work on NPB models for hierarchical clustering. Dirichlet Diffusion\nTrees (DDT) [16], Kingman\u2019s Coalescent [9, 10, 4, 20], and Pitman-Yor Diffusion Trees (PYDT)\n[11] all provide models in which data is generated from a Continuous-Time Markov Chain (CTMC)\nthat lives on a tree that splits (or coalesces) according to some continuous-time process. The nested\nCRP and DP [3, 17] and Tree-Structured Stick Breaking (TSSB) [1] de\ufb01ne priors over tree structures\nfrom which data is directly generated.\n\nAlthough there is extensive and impressive literature on the subject demonstrating its useful clus-\ntering properties, NPB hierarchical clustering has yet to see widespred use. The expensive com-\nputational cost typically associated with these models is a likely inhibitor to the adoption of these\nmodels. The CTMC based models are typically more computationally intensive than the direct gen-\neration models, and there has been substantial work in improving the speed of inference in these\nmodels. [12] introduces a variational approximation for the DDT, and [7, 6] provide more ef\ufb01cient\nSMC schemes for the Coalescent. The direct generation models are typically faster, but usually\n\n1\n\n\fFigure 1: Coalescent tree construction (left) A pair is uniformly drawn from N = 5 points to coalesce. (mid-\ndle) The coalescence time t5 is drawn from Exp((cid:0)5\n2(cid:1)), and another pair on the remaining 4 points is drawn\nuniformly. (right) After drawing t4 \u223c Exp((cid:0)4\n2(cid:1)), the coalescence time for the newly coalesced pair is t5 + t4.\n\nFigure 2: Consider the trees one might construct by uniformly picking pairs of points to join, starting with four\nleaves {a, b, c, d}. One can join a and b \ufb01rst, and then c and d (and then the two parents), or c and d and then a\nand b to construct the tree on the left. By de\ufb01ning a uniform prior over \u03c6n, and then marginalizing out the order\nof the internal nodes \u03c1 (equivalently, the order in which pairs are joined), we then have a prior over \u03c8n that\nputs more mass on balanced trees than unbalanced ones. For example the tree on right can only be constructed\nin one way by node joining.\n\ncome at some cost or limitation; for example the TSSB allows (and requires) that data live at some\nof its internal nodes.\n\nOur contribution is a new prior over tree structures that is simpler than many of the priors described\nabove, yet still retains the exchangeability and consistency properties of a NPB model. The prior\nis derived by marginalizing out the times and ordering of internal nodes of the coalescent. The\nremaining distribution is an exchangeable and consistent prior over tree structures. This prior may\nbe used directly with a data generation process, or a notion of time may be reintroduced, providing\na prior with a factorization between the tree structures and times. The simplicity of the prior allows\nfor great \ufb02exibility and the potential for more ef\ufb01cient inference algorithms. For the purposes of\nthis paper, we focus on one such possible model, wherein we generate branch lengths according to\na process similar to stick-breaking.\n\nWe introduce the proposed prior on tree structures in Section 2, the distribution over times condi-\ntioned on tree structure in 3.1, and the data generation process in 3.2. We show experimental results\nin Section 4, and conclude in Section 5.\n\n2 Coalescent Prior on Tree Structures\n\n2.1 Kingman\u2019s Coalescent\n\nKingman\u2019s Coalescent provides a prior over balanced, edge-weighted trees, wherein the weights\nare often interpreted as representing some notion of time. See Figure 1. A coalescent tree can be\nsampled as follows: start with n points and n dangling edges hanging from them, and all weights set\n\nto 0. Sample a time from Exp((cid:0)n\n\n2(cid:1)), and add this value to the weight for each of the dangling edges.\n\nThen pick a pair uniformly at random to coalesce (giving rise to their mutual parent, whose new\ndangling edge has weight 0). Repeat this process on the remaining n \u2212 1 points until a full weighted\ntree is constructed. Note however, that the weights do not in\ufb02uence the choice of tree structures\nsampled, which suggests we can marginalize out the times and still retain an exchangeable and\nconsistent distribution over trees. What remains is simply a process in which a uniformly chosen\npair of points is joined at every iteration.\n\n2.2 Coalescent Distribution over Trees\n\nWe consider two types of tree-like structures, generic (rooted) unweighted tree graphs which we\ndenote \u03c8n living in \u03a8n, and trees of the previous type, but with a speci\ufb01ed ordering \u03c1 on the\ninternal (non-leaf) nodes of the tree, denoted (\u03c8n, \u03c1) = \u03c6n \u2208 \u03a6n. Marginalizing out the times\nof the coalescent gives a uniform prior over ordered tree structures \u03c6n. The order information is\n\n2\n\n\fFigure 3: (left) A sample from the described prior with stick-breaking parameter B(1, 1) (uniform). (middle)\nA sample using B(2, 2). (right) A sample using B(4, 2).\n\nnecessary because for a given \u03c8n there are multiple ways of constructing it by uniformly picking\n\npairs to join, see Figure 2. If there are i remaining nodes to join, there are(cid:0)i\n\nso we have for the probability of a particular \u03c6n:\n\n2(cid:1) ways of joining them,\n\np(\u03c6n) =\n\nn\n\n2(cid:19)\u22121\n\nYi=2(cid:18)i\n\nThis de\ufb01nes an exchangeable and consistent prior over \u03a6n; exchangeable because p(\u03c6n) does not\ndepend on the order in which the data is seen, and consistent because the conditional prior1 is well\nde\ufb01ned \u2013 we can imagine adding a new leaf to an existing \u03c6n, which creates a new internal node.\nLet yi denote the ith internal node2 of \u03c6n, i \u2208 {1...n \u2212 1}, and let y\u2217 denote the new internal node.\nThere are n ways of attaching the new internal node below y1, n\u22121 ways of attaching below y2, and\nso on, giving n(n+1)\n\n2(cid:1) ways of attaching y\u2217 into \u03c6n. Thus if we make this choice uniformly at\n\n2 =(cid:0)n\n\n2 (cid:1)\u22121\nrandom, we get the probability of the new tree is p(\u03c6n+1) = p(\u03c6n)(cid:0)n+1\n\nIt is possible to marginalize out the ordering information in the coalescent tree structures \u03c6n to derive\nexchangeable, consistent priors on \u201cunordered\u201d tree structures \u03c8n. We can perform this marginal-\nization by counting how many ordered tree structures \u03c6n \u2208 \u03a6n are consistent with a particular\nunordered tree structure \u03c8n.\nLemma 1. A tree \u03c8n has T (\u03c8n) = (n\u22121)!\ni=1 mi\nthe number of internal nodes in the subtree rooted at node i.\n\npossible orderings on its internal nodes, where mi is\n\n2(cid:1)\u22121\n=Qn+1\ni=2 (cid:0)i\n\nQn\u22121\n\n.\n\n(For proof see the supplementary material.) This is in agreement with what we would expect: for\nan unbalanced tree, mi = {1, 2, ..., n \u2212 1}, so this gives T = 1. Since an unbalanced tree imposes\na full ordering on the internal nodes, there can only be one unbalanced ordered tree that maps to the\ncorresponding unbalanced unordered tree. As the tree becomes more balanced, the mis decrease,\nincreasing T .\n\nThus the probability of a particular \u03c8n is T (\u03c8n) times the probability of an ordered tree \u03c6n under\nthe coalescent:3\n\np(\u03c8n) = T (\u03c8n)\n\n(1)\n\nn\n\n2(cid:19)\u22121\n\nYi=2(cid:18)i\n\n=\n\n(n \u2212 1)!\ni=1 mi\n\nQn\u22121\n\nn\n\n2(cid:19)\u22121\n\nYi=2(cid:18)i\n\nTheorem 1. p(\u03c8n) de\ufb01nes an exchangeable and consistent prior over \u03a8n\n\np(\u03c8n) is clearly still exchangeable as it does not depend on any order of the data, and was de\ufb01ned by\nmarginalizing out a consistent process, so its conditional priors are still well de\ufb01ned and thus p(\u03c8n)\nis consistent. For a more explicit proof see the supplementary material.\n\n1The sequential sampling scheme often associated with NPB models; for example the conditional prior for\nthe CRP is the prior probability of adding the n + 1st point to one of the existing clusters (or a new cluster)\ngiven the clustering on the \ufb01rst n points.\n\n2When times are given, we index the internal nodes from most recent to root. Otherwise, nodes are ordered\n\nsuch that parents always succeed children.\n\n3It has been brought to our attention that this prior and its connection to the coalescent has been studied\nbefore in [2] as the beta-splitting model with parameter \u03b2 = 0, and later in [14] under the framework of\nGibbs-Fragmentation trees.\n\n3\n\n\fFigure 4: The subtree Sl rooted at the red node l is pruned in preparation for an MCMC move. We perform\nslice sampling to determine where the pruned subtree\u2019s parent should be placed next in the remaining tree \u03c0l.\n\nFigure 5: We compute the posterior pdf for each branch that we might attach to. If \u03b1 or \u03b2 are greater than one,\nthe Beta prior on branch lengths can cause these pdfs to go to zero at the limits of their domain. Thus to enable\nmoves across branches we compute the extrema of each pdf so that all valid intervals are found.\n\n3 Data Generation Model\n\nGiven a prior over tree structures such as (1), we can de\ufb01ne a data generating process in many\nways (indeed, any L1 bounded martingale will do [19]); here we restrict our attention to generative\nmodels in which we \ufb01rst sample times given a tree structure, and then sample the data according to\nsome process described on those times (in our case Brownian Motion). Examples of other potential\ndata generation models include those in [1], such as the \u201cGeneralized Gaussian Diffusions,\u201d and the\nmultinomial likelihood often used with the Coalescent.\n\n3.1 Branch Lengths\n\nGiven a tree structure \u03c8, we can sample branch lengths, si = tpi \u2212 ti, with ti the time of coalescent\nevent i, with t = 0 at the leaves and pi is the parent of i. Consider the following construction similar\nto a stick-breaking process: Start with a stick of unit length. Starting at the root, travel down the\ngiven \u03c8, and at each split point duplicate the current stick into two sticks, assigning one to each\nchild. Then, sample a Beta random variable B for each of the two sticks where the corresponding\nchildren are not leaves. B will be the proportion of the remaining stick attributed to that branch of\nthe tree until the next split point (sticks afterwards will be of length proportional to (1 \u2212 B)). We\nhave, Bi = 1 \u2212 (ti/tpi ) = si/tpi . The total prior over branch lengths can thus be written as:\n\nN \u22122\n\np({Bi}|\u03c8) =\n\nB(Bi|\u03b1, \u03b2)\n\n(2)\n\nYi=1\n\nSee Figure 3 for samples from this prior. Note that any distribution whose support is the unit interval\nmay be used, and in fact more innovative schemes for sampling the times may be used as well; one\nof the major advantages of the TMC over the Coalescent and DDT is that the times may be de\ufb01ned\nin a variety of ways.\n\nThere is a single Beta random variable attributed to each internal node of the tree (except the root,\nwhich has B set to 1). Since the order in which we see the data does not affect the way in which we\nsample these stick proportions, the process remains exchangeable. We denote pairs (\u03c8n, {Bi}) as\n\u03c0, i.e. a tree structure with branch lengths.\n\n3.2 Brownian Motion\n\nGiven a tree \u03c0 we can de\ufb01ne a likelihood for continuous data xi \u2208 Rp using Brownian motion.\nWe denote the length of each branch segment of the tree si. Data is generated as follows: we start\nat some unknown location in Rp at time t = 1 and immediately split into two independent Wiener\nprocesses (with parameter \u039b), each denoted yi, where i is the index of the relevant branch in \u03c0. Each\n\n4\n\n\fFigure 6: (left) Approximated log-density using a DP mixture. (midleft) Log-density using a Dirichlet Diffu-\nsion Tree model. (midright) Log-density using our model directly. (right) Log-density using our model with a\nheavy-tailed noise model at the leaves. Contours are spaced 1 apart, for a total of 15 contours. In the probability\ndomain the various densities look similar.\n\nFigure 7: Posterior sample from our model applied to the leukemia dataset. Best viewed in color. Each pure\nsubtree is painted a color unique to the class associated with it. The OTHERS class is a set of datapoints to\nwhich no diagnostic label was assigned. A larger view of this \ufb01gure can be found in the supplementary material.\n\nof the processes evolves for times si, then, a new independent Wiener process is instantiated at the\ntime of each split, and this continues until all processes reach the leaves of \u03c0 (i.e. t = 0), at which\npoint the yis at the leaves are associated with the data x. This is a similar likelihood to the ones used\nfor Dirichlet Diffusion Trees [16] and the Coalescent [20] for continuous data.\n\n3.2.1 Likelihood Computation\n\nThe likelihood p(x|\u03c0) can be calculated using a single bottom-up sweep of message passing. As in\n[20], by marginalizing out from the leaves towards the root, the message at an internal node i is a\nGaussian with mean \u02c6yi and variance \u039b\u03bdi.\n\nThe \u03bd and \u02c6y messages can be written for any number of incoming nodes in a single form:\n\n\u03bd\u22121\n\ni = Xj\u2208c(i)\n\n(\u03bdj + sj)\u22121;\n\n\u02c6yi = \u03bdi Xj\u2208c(i)\n\n\u02c6yj\n\n\u03bdj + sj\n\nwhere c(i) are the nodes sending incoming messages to i. We can compute the likelihood using any\narbitrary node as the root for message passing. Fixing a particular node as root, we can write the\ntotal likelihood of the tree as:\n\nn\u22121\n\np(x|\u03c0) =\n\nZc(i)(x, \u03c0)\n\n(3)\n\nWhen |c(i)| = 1 (e.g. when passing through the root at t = 1), Zc(i) = 1. When |c(i)| = 2 and\n|c(i)| = 3 (when collecting at an arbitrary node i chosen as the root):\n\nYi=1\n\nZli,ri (x, \u03c0) = |2\u03c0 \u02c6\u039bi|\u2212 1\n\n||\u02c6yri \u2212 \u02c6yli ||2\n\n\u02c6\u039bi = \u039b(\u03bdli + \u03bdri + sli + sri )\n\n(4)\n\n2 exp(cid:18)\u2212\n\n1\n2\n\n\u02c6\u039bi(cid:19) ;\n\nZpi,li,ri (x, \u03c0) = |2\u03c0\u039b|\u22121\u03bd\u2217\u2212 k\n\u03bd\u2217\npi = \u03bdpi + si;\n\n\u03bd\u2217\nli = \u03bdli + sli ;\n\n2 e(\u2212 1\n\n||\u02c6yli \u2212\u02c6yri ||2\n\n2 (\u03bd \u2217\npi\n\u03bd\u2217\nri = \u03bdri + sri ;\n\n\u039b\u03bd\u2217 +\u03bd \u2217\nri\n\u03bd\u2217 = \u03bd\u2217\n\n||\u02c6ypi \u2212\u02c6yli ||2\npi \u03bd\u2217\n\n\u039b\u03bd\u2217 +\u03bd \u2217\nli\nli + \u03bd\u2217\n\n||\u02c6ypi \u2212\u02c6yri ||2\nri + \u03bd\u2217\nri\u03bd\u2217\npi\n\nli \u03bd\u2217\n\n\u039b\u03bd\u2217 ))\n\n(5)\n\nwhere ||.||\u039b corresponds to the Mahalanobis norm with covariance \u039b. These messages are derived\nby using the product of Gaussian pdf identities.\n\n5\n\n\f3.3 MCMC Inference\n\nWe propose an MCMC procedure that samples from the posterior distribution over \u03c0 as follows.\nFirst, a random node l is pruned from the tree (so that its parent pl has no parent and only one child),\ngiving the pruned subtree Sl and remaining tree \u03c0l. See Figure 4. We then consider all possible\nmoves that would place pl into a valid location elsewhere in the tree. For each branch indexed by\nthe node i \u201cbelow\u201d it, we compute the posterior density function of where pl should be placed on\nthat branch. We then slice sample on this collection of density functions. See Figure 5. By cycling\nthrough the nodes to prune and reattach, we achieve a Gibbs sampler over \u03c0.\n\nWe can ef\ufb01ciently compute the relative change in the likelihood p(x|\u03c0) through a combination of\nbelief propagation and local computations. First we perform belief propagation on \u03c0l to give upward\nand downward messages, and on Sl to give only upwards messages. Denote \u03c0(S, i, t) as the tree\nformed by attaching S above node i at time t in \u03c0. For the new tree we imagine collecting messages\nto node pl resulting in a new factor Zi,l,pi (x, \u03c0l(Sl, i, t)). The messages directly downstream of this\n(x, \u03c0l) (if lpi = i, ie i is the \u201cleft\u201d child of its parent). If we\nfactor are Zc(i)(x, \u03c0l) and Zppi ,rpi\nnow imagine that the original likelihood was computed by collecting to node pi, then we see that the\n(x, \u03c0l) at node pi while the latter factor was already\n\ufb01rst factor should replace the factor Zlpi ,rpi ,ppi\n(x, \u03c0l). All other factors do no change. The total (multiplicative) change in\nincluded in Zlpi ,rpi ,ppi\nthe likelihood is thus,\n\n\u2206Z(\u03c0l(Sl, i, t)) = Zi,l,pi (x, \u03c0l(Sl, i, t))\n\nZppi ,rpi\nZlpi ,rpi ,ppi\n\n(x, \u03c0l)\n\n(x, \u03c0l)\n\n(6)\n\nThe update in prior probability for adding the parent of l in the segment (i, pi) (with times ti and\ntpi ) at time t is proportional to the product of the Beta pdfs in (2) that arise when \u03c0l(Sl, i, t) is\nconstructed, and inversely proportional to the Beta pdf that is removed from \u03c0l, as well as being\nproportional to the overall prior probability over \u03c8n:4\n\np(\u03c0l(Sl, i, t)) \u221d\n\n1\nB(1 \u2212 ti\ntpi\n\nB(1 \u2212\n\n; \u03b1, \u03b2)\n\nt\ntpi\n\n; \u03b1, \u03b2)B(1 \u2212\n\nti\nt\n\n; \u03b1, \u03b2)B(1 \u2212\n\ntl\nt\n\n; \u03b1, \u03b2)p(\u03c8(\u03c0l(Sl, i, t))\n\n(7)\nWhere \u03c8(\u03c0) gives the structure part of \u03c0 = (\u03c8, {Bi}). p(\u03c8(\u03c0l(Sl, i, t)) can be computed for all i\nin linear time via dynamic programming (it does not depend on the actual value of t). By taking the\nproduct of (6) and (7) we get the joint posterior of (i, t):\n\np(\u03c0l(Sl, i, t)|X) \u221d \u2206Z(\u03c0l(Sl, i, t)) p(\u03c0l(Sl, i, t))\n\n(8)\n\np(\u03c0l(Sl, i, t)|X) de\ufb01nes the distribution from which we would like to sample. We propose a slice\nsampler that can propose adding Sl to any segment in \u03c0l. For a \ufb01xed i, p(\u03c0l(Sl, i, t)|X) is typically\nunimodal, and typically has a small number of modes at most. If we can \ufb01nd all of the extrema of the\nposterior, we can easily \ufb01nd the intervals that contain positive probability density for slice sampling\nmoves (see Figure 5). Thus this slice sampling procedure will mix as quickly as slice sampling on a\nsingle unimodal distribution. We \ufb01nd the extrema of these functions using Newton methods.\n\nThe overall sampling procedure is then to sample a new location for each node (both leaves and\ninternal nodes) of the tree using the Gibbs sampling scheme explained above.\n\n3.4 Hyperparameter Inference\n\nAs we do not know the structure of the data beforehand, we may not want to predetermine the\nspeci\ufb01c values of \u03b1, \u03b2 and \u039b. Thus we de\ufb01ne hyperpriors on these parameters and infer them as\nwell. For simplicity we assume the form \u039b = kI for the Brownian motion covariance parameter.\nWe use an Inverse-Gamma prior on k, so that k\u22121 \u223c G(\u03ba, \u03b8).\n\nk\u22121|X \u223c G (N \u2212 1)p\n\n2\n\n+ \u03ba,\n\n1\n2\n\nN \u22121\n\nXi=1\n\ndi + \u03b8!\n\n4Note that if either l or i is a leaf, then the prior term will be simpler than the one listed here\n\n6\n\n\fwhere N is the number of datapoints, p is the dimension, and di =\nEuclidean norm.\n\n||\u02c6yli \u2212\u02c6yri ||2\n\n\u03bdli +\u03bdri +sli +sri\n\n, ||.|| is the\n\nBy putting a G(\u03be, \u03bb) prior on \u03b1 \u2212 1 and \u03b2 \u2212 1, we achieve a posterior for these parameters:\n\np(\u03b1, \u03b2|\u03be, \u03bb, X) \u221d (\u03b1 \u2212 1)\u03be(\u03b2 \u2212 1)\u03bee\u2212\u03bb(\u03b1\u22121+\u03b2\u22121)\n\nN \u22121\n\nYi=1\n\n1\n\nB(\u03b1, \u03b2)(cid:18)1 \u2212\n\nti\n\ntpi(cid:19)\u03b1\u22121(cid:18) ti\n\ntpi(cid:19)\u03b2\u22121\n\nThis posterior is log-concave and thus unimodal. We perform slice sampling to update \u03b1 and \u03b2.\n\n3.5 Predictive Density\n\nGiven a set of samples from the posterior, we can approximate the posterior density estimate by sam-\npling a test point located at yt into each of these trees repeatedly (giving new trees \u03c0\u2032 = \u03c0(yt, i, s)\nfor various values of i and s) and approximating p(yt|X) as:\n\np(yt|X) =Z p(\u03c0|X)p(yt|\u03c0, X)d\u03c0 =Z d\u03c0p(\u03c0|X)Z d\u03c0\u2032p(\u03c0\u2032, yt|\u03c0, X)\n\n\u2248 X\u03c0i\u223c\u03c0|X X\u03c0\u2032\n\nj \u223cp(\u03c0\u2032\n\nj |\u03c0i,X)\n\np(yt|\u03c0\u2032\n\nj, X)\n\nwhere p(\u03c0\u2032|\u03c0, X) =R p(\u03c0\u2032, yt|\u03c0, X)dyt. By integrating out yli in (5), we get a modi\ufb01cation of (8)\n\nthat is proportional to p(\u03c0\u2032|\u03c0, X). Slice sampling from this gives us several new trees \u03c0\u2032\nj for each\n\u03c0i, where one of the leaves is not observed. p(yt|\u03c0\u2032\nj, X) is then available by message passing to the\nleaf associated with yt (denoted l), which results in a Gaussian over yt. Thus the \ufb01nal approximated\ndensity is a mixture of Gaussians with a component for each of the \u03c0\u2032\nj .\n\nPerforming the aforementioned integration (after replacing \u02c6yli with yt), we get:\n\nZpred(i, t) \u221dZ dytZi,pi,l(x, \u03c0\u2032(yt))\n\n= |2\u03c0\u039b|\u2212 1\n\n2 (\u03bd\u2217\n\ni + \u03bd\u2217\n\npi )\u2212 k\n\n2 exp(cid:18)\u2212\n\n1\n2\n\n(\u03bd\u2217\n\nl + (\u03bd\u2217\n\ni\n\n\u22121 + \u03bd\u2217\npi\n\n\u22121)\u22121)di,pi(cid:19)\n\nwhere di,pi = ||\u02c6yi \u2212 \u02c6ypi ||2\npoint:\n\n\u039b\u03bd \u2217 . This gives the posterior density for the location of the unobserved\n\np(\u03c0\u2032 = \u03c0({l}, i, t)|\u03c0, X) \u221d Zpred(i, t)\n\nZli,ri(x, \u03c0l)\nZli,ri,pi (x, \u03c0l)\n\np(\u03c0({l}, i, t))\n\nwhere p(\u03c0({l}, i, t)) is as in (7).\n\n4 Experiments\n\nWe compare our model to Dirichlet Diffusion Trees (DDT) [16] and to Dirichlet Process Mixtures\n(DPM) [5, 13]. We used Radford Neal\u2019s Flexible Bayesian Modeling package [15] for both the DDT\nand the DPM experiments. All algorithms were run with vague hyperpriors, except for the DPM\nconcentration parameter which we set to .1 as we did not expect many modes for these experiments.\n\n4.1 Synthetic Data\n\nTo qualitatively compare our method to Dirichlet Process Mixtures and Dirichlet Diffusion Trees,\nwe ran all three methods on a simulated dataset with N = 200, p = 2. The data is generated from a\nmixture of heavy tailed distributions to demonstrate the differences between these algorithms when\npresented with outliers. As can be seen in Figure 6, the DDT \ufb01ts a density with reasonably heavy\ntails, whereas our model \ufb01ts a narrower distribution. This is a result of the fact that the divergence\nfunction of the DDT strongly encourages the branch lengths to be small, and thus a larger variance\nis required to explain the data. Our model can be combined with a heavy-tailed observation model\nto produce densities with heavier tails \u2013 see the rightmost panel of Figure 6.\n\n7\n\n\fFigure 8: A comparison of our method to the DDT and DPM, using predictive log likelihood on test data. Plots\nshow performance over time, except the DPM which shows the result after convergence. (left) the comparison\non a p = 200 version of the St. Jude\u2019s Leukemia dataset. The \u201cTMC - k\u201d runs are with k \ufb01xed throughout\nthe run. (middle) the comparison on the p = 1000 version of the Leukemia dataset (right) comparison on a\nN = 1400, p = 200 bag of visual words dataset.\n\n4.2 Gene Clustering\n\nWe applied our model to the St. Jude\u2019s Leukemia dataset [22], which has Ntrain = 215 datapoints,\nNtest = 112, and preprocessed5 to have p = 1000 dimensions. We preprocessed the data so that\neach dimension had unit variance. Associated with each datapoint is one of 6 classi\ufb01cations of\nleukemia, or a 7th class with which no diagnosis was attributed. We applied our method to the full\ndataset to see if it could recover these classes. Figure 7 shows the posterior tree sampled after about\n28 Gibbs passes (about 10 minutes). We also compared our method against the DDT and DPM on\nthese models\u2019 abilities to predict test data on a p = 200 subset of the p = 1000 dataset, as well as on\nthe p = 1000 datset, see Figure 8. On the p = 200 dataset, both the DDT and the TMC outperform\nthe DPM, with the TMC performing slightly worse. We attribute this difference in performance due\nour model\u2019s weaker prior on the branch lengths, which causes our model to over\ufb01t slightly; if we\npreset the diffusion variance of our model to a value somewhat larger than the data variance, our\nperformance improves. In the p = 1000 dataset, the same phenomenon is observed.\n\n4.3 Computer Vision Features\n\nWe also cluster visual bag of words features collected from birds images from Visipedia [21]. We\nworked on a dataset of size N = 1400, Ntest = 1412, where each observation belongs to one of\n200 classes of birds, see Figure 8. Again our method is better than DPM yet not as well as the DDT.\nFixing the variance does improve the performance of our algorithm but not enough to improve over\nthe DDT.\n\n5 Conclusion\n\nWe introduced a new prior for use in NPB hierarchical clustering, one that can be used in a variety\nof ways to de\ufb01ne a generative model for data. By marginalizing out the time of the coalescent, we\nachieve a prior from which data can be either generated directly via a graphical model living on\ntrees, or by a CTMC lying on a distribution over times for the branch lengths \u2013 in the style of the\ncoalescent and DDT. However, unlike the coalescent and DDT, in our model the times are generated\nconditioned on the tree structure; giving potential for more interesting models or more ef\ufb01cient\ninference. The simplicity of the prior allows for ef\ufb01cient Gibbs style inference, and we provide an\nexample model and demonstrate that it can achieve similar performance to that of the DDT. However,\nto achieve that performance the diffusion variance must be set in advance, suggesting that alternative\ndistributions over the branch lengths may provide better performance than the one explored here.\n\nAcknowledgements\n\nThis material is based upon work supported by the National Science Foundation under Grant No.\n0914783, 0928427, 1018433, 1216045.\n\n5We simply took the 1000 dimensions with the highest variance.\n\n8\n\n\fReferences\n\n[1] R.P. Adams, Z. Ghahramani, and M.I. Jordan. Tree-structured stick breaking for hierarchical\n\ndata. Advances in Neural Information Processing Systems, 23:19\u201327, 2010.\n\n[2] D Aldous. Probability distributions on cladograms. IMA Volumes in Mathematics and its . . . ,\n\n1995.\n\n[3] David Blei, Thomas L. Grif\ufb01ths, Michael I. Jordan, and Joshua B. Tenenbaum. Hierarchi-\ncal topic models and the nested chinese restaurant process.\nIn Sebastian Thrun, Lawrence\nSaul, and Bernhard Sch\u00a8olkopf, editors, Advances in Neural Information Processing Systems,\nvolume 16. MIT Press, Cambridge, MA, 2004.\n\n[4] A. Drummond and A. Rambaut. Beast: Bayesian evolutionary analysis by sampling trees.\n\nBMC evolutionary biology, 7(1):214, 2007.\n\n[5] T.S. Ferguson. Bayesian density estimation by mixtures of normal distributions. Recent ad-\n\nvances in statistics, pages 287\u2013303, 1983.\n\n[6] D. G\u00a8or\u00a8ur, L. Boyles, and M. Welling. Scalable inference on kingman\u02d82019s coalescent using\n\npair similarity. In Proceedings of AISTATS, 2012.\n\n[7] D. G\u00a8or\u00a8ur and Y. W. Teh. An ef\ufb01cient sequential Monte Carlo algorithm for coalescent clus-\ntering. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural\nInformation Processing Systems 21, pages 521\u2013528, 2009.\n\n[8] Katherine Heller and Zoubin Ghahramani. Bayesian hierarchical clustering. In Proceedings of\n\nICML, volume 22, 2005.\n\n[9] J. F. C. Kingman. The coalescent. Stochastic Processes and their Applications, 13:235\u2013248,\n\n1982.\n\n[10] J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Probability,\n\n19:27\u201343, 1982.\n\n[11] D.A. Knowles and Z. Ghahramani.\n\nPitman-yor diffusion trees.\n\nArxiv preprint\n\narXiv:1106.2494, 2011.\n\n[12] D.A. Knowles, J. Van Gael, and Z. Ghahramani. Message passing algorithms for dirichlet dif-\nfusion trees. In Proceedings of the 28th Annual International Conference on Machine Learn-\ning, 2011.\n\n[13] A.Y. Lo. On a class of bayesian nonparametric estimates: I. density estimates. The Annals of\n\nStatistics, 12(1):351\u2013357, 1984.\n\n[14] Peter McCullagh, Jim Pitman, and Matthias Winkel. Gibbs fragmentation trees. Bernoulli,\n\n14(4):988\u20131002, November 2008.\n\n[15] R. Neal. Software for \ufb02exible bayesian modeling and markov chain sampling. see http://www.\n\ncs. toronto. edu/ radford/fbm. software. html, 2003.\n\n[16] R.M. Neal. Density modeling and clustering using dirichlet diffusion trees. Bayesian Statistics,\n\n7:619\u2013629, 2003.\n\n[17] A. Rodriguez, D.B. Dunson, and A.E. Gelfand. The nested dirichlet process. Journal of the\n\nAmerican Statistical Association, 103(483):1131\u20131154, 2008.\n\n[18] R. Salakhutdinov, J. Tenenbaum, and A. Torralba. Learning to learn with compound hd models.\n\nIn Advances in Neural Information Processing Systems 21, 2012.\n\n[19] J. Steinhardt and Z. Ghahramani. Flexible martingale priors for deep hierarchies. In Interna-\ntional Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), volume 43, pages 61\u201362,\n2012.\n\n[20] Y. W. Teh, H. Daum\u00b4e III, and D. M. Roy. Bayesian agglomerative clustering with coalescents.\n\nIn Advances in Neural Information Processing Systems, volume 20, 2008.\n\n[21] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-\nUCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology,\n2010.\n\n[22] E.J. Yeoh, M.E. Ross, S.A. Shurtleff, W.K. Williams, D. Patel, R. Mahfouz, F.G. Behm, S.C.\nRaimondi, M.V. Relling, A. Patel, et al. Classi\ufb01cation, subtype discovery, and prediction of\noutcome in pediatric acute lymphoblastic leukemia by gene expression pro\ufb01ling. Cancer cell,\n1(2):133\u2013143, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1348, "authors": [{"given_name": "Levi", "family_name": "Boyles", "institution": null}, {"given_name": "Max", "family_name": "Welling", "institution": null}]}