{"title": "Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent", "book": "Advances in Neural Information Processing Systems", "page_first": 1079, "page_last": 1087, "abstract": "Discovering hierarchical regularities in data is a key problem in interacting with large datasets, modeling cognition, and encoding knowledge. A previous Bayesian solution---Kingman's coalescent---provides a convenient probabilistic model for data represented as a binary tree. Unfortunately, this is inappropriate for data better described by bushier trees. We generalize an existing belief propagation framework of Kingman's coalescent to the beta coalescent, which models a wider range of tree structures. Because of the complex combinatorial search over possible structures, we develop new sampling schemes using sequential Monte Carlo and Dirichlet process mixture models, which render inference efficient and tractable. We present results on both synthetic and real data that show the beta coalescent outperforms Kingman's coalescent on real datasets and is qualitatively better at capturing data in bushy hierarchies.", "full_text": "Binary to Bushy: Bayesian Hierarchical Clustering\n\nwith the Beta Coalescent\n\nYuening Hu1, Jordan Boyd-Graber2, Hal Daum`e III3, Z. Irene Ying4\n\n1, 3: Computer Science, 2: iSchool and UMIACS, 4: Agricultural Research Service\n\nynhu@cs.umd.edu, {jbg,hal}@umiacs.umd.edu, zhu.ying@ars.usda.gov\n\n1, 2, 3: University of Maryland, 4: Department of Agriculture\n\nAbstract\n\nDiscovering hierarchical regularities in data is a key problem in interacting with\nlarge datasets, modeling cognition, and encoding knowledge. A previous Bayesian\nsolution\u2014Kingman\u2019s coalescent\u2014provides a probabilistic model for data repre-\nsented as a binary tree. Unfortunately, this is inappropriate for data better described\nby bushier trees. We generalize an existing belief propagation framework of\nKingman\u2019s coalescent to the beta coalescent, which models a wider range of tree\nstructures. Because of the complex combinatorial search over possible structures,\nwe develop new sampling schemes using sequential Monte Carlo and Dirichlet\nprocess mixture models, which render inference ef\ufb01cient and tractable. We present\nresults on synthetic and real data that show the beta coalescent outperforms King-\nman\u2019s coalescent and is qualitatively better at capturing data in bushy hierarchies.\n\n1 The Need For Bushy Hierarchical Clustering\n\nHierarchical clustering is a fundamental data analysis problem: given observations, what hierarchical\ngrouping of those observations effectively encodes the similarities between observations? This is a\ncritical task for understanding and describing observations in many domains [1, 2], including natural\nlanguage processing [3], computer vision [4], and network analysis [5]. In all of these cases, natural\nand intuitive hierarchies are not binary but are instead bushy, with more than two children per parent\nnode. Our goal is to provide ef\ufb01cient algorithms to discover bushy hierarchies.\nWe review existing nonparametric probabilistic clustering algorithms in Section 2, with particular\nfocus on Kingman\u2019s coalescent [6] and its generalization, the beta coalescent [7, 8]. While Kingman\u2019s\ncoalescent has attractive properties\u2014it is probabilistic and has edge \u201clengths\u201d that encode how\nsimilar clusters are\u2014it only produces binary trees. The beta coalescent (Section 3) does not have this\nrestriction. However, na\u00a8\u0131ve inference is impractical, because bushy trees are more complex: we need\nto consider all possible subsets of nodes to construct each internal nodes in the hierarchy.\nOur \ufb01rst contribution is a generalization of the belief propagation framework [9] for beta coalescent to\ncompute the joint probability of observations and trees (Section 3). After describing sequential Monte\nCarlo posterior inference for the beta coalescent, we develop ef\ufb01cient inference strategies in Section 4,\nwhere we use proposal distributions that draw on the connection between Dirichlet processes\u2014a\nubiquitous Bayesian nonparametric tool for non-hierarchical clustering\u2014and hierarchical coalescents\nto make inference tractable. We present results on both synthetic and real data that show the beta\ncoalescent captures bushy hierarchies and outperforms Kingman\u2019s coalescent (Section 5).\n\n2 Bayesian Clustering Approaches\n\nRecent hierarchical clustering techniques have been incorporated inside statistical models; this\nrequires formulating clustering as a statistical\u2014often Bayesian\u2014problem. Heller et al. [10] build\n\n1\n\n\fbinary trees based on the marginal likelihoods, extended by Blundell et al. [11] to trees with arbitrary\nbranching structure. Ryan et al. [12] propose a tree-structured stick-breaking process to generate trees\nwith unbounded width and depth, which supports data observations at leaves and internal nodes.1\nHowever, these models do not distinguish edge lengths, an important property in distinguishing how\n\u201ctight\u201d the clustering is at particular nodes.\nHierarchical models can be divided into complementary \u201cfragmentation\u201d and \u201ccoagulation\u201d frame-\nworks [7]. Both produce hierarchical partitions of a dataset. Fragmentation models start with a single\npartition and divide it into ever more speci\ufb01c partitions until only singleton partitions remain. Coagu-\nlation frameworks repeatedly merge singleton partitions until only one partition remains. Pitman-Yor\ndiffusion trees [13], a generalization of Dirichlet diffusion trees [14], are an example of a bushy\nfragmentation model, and they model edge lengths and build non-binary trees.\nInstead, our focus is on bottom-up coalescent models [8], one of the coagulation models and\ncomplementary to diffusion trees, which can also discover hierarchies and edge lengths. In this\nmodel, n nodes are observed (we use both observed to emphasize that nodes are known and leaves to\nemphasize topology). These observed nodes are generated through some unknown tree with latent\nedges and unobserved internal nodes. Each node (both observed and latent) has a single parent. The\nconvention in such models is to assume our observed nodes come at time t = 0, and at time \u2212\u221e all\nnodes share a common ur-parent through some sequence of intermediate parents.\nConsider a set of n individuals observed at the present (time t = 0). All individuals start in one of n\nsingleton sets. After time ti, a set of these nodes coalesce into a new node. Once a set merges, their\nparent replaces the original nodes. This is called a coalescent event. This process repeats until there\nis only one node left, and a complete tree structure \u03c0 (Figure 1) is obtained.\nDifferent coalescents are de\ufb01ned by different probabilities of merging a set of nodes. This is called\nthe coalescent rate, de\ufb01ned by a general family of coalescents: the lambda coalescent [7, 15]. We\nrepresent the rate via the symbol \u03bbk\nn, the rate at which k out of n nodes merge into a parent node.\nFrom a collection of n nodes, k \u2264 n can coalesce at some coalescent event (k can be different for\ndifferent coalescent events). The rate of a fraction \u03b3 of the nodes coalescing is given by \u03b3\u22122\u039b(d\u03b3),\nwhere \u039b(d\u03b3) is a \ufb01nite measure on [0, 1]. So k nodes merge at rate\n\n\u03bbk\nn =\n\n\u03b3k\u22122(1 \u2212 \u03b3)n\u2212k\u039b(d\u03b3)\n\n(2 \u2264 k \u2264 n).\n\n(1)\n\n(cid:90) 1\n\n0\n\nChoosing different measures yields different coalescents. A degenerate Dirac delta measure at 0\nresults in Kingman\u2019s coalescent [6], where \u03bbk\nn is 1 when k = 2 and zero otherwise. Because this\ngives zero probability to non-binary coalescent events, this only creates binary trees.\nAlternatively, using a beta distribution BETA(2 \u2212 \u03b1, \u03b1) as the measure \u039b yields the beta coalescent.\nWhen \u03b1 is closer to 1, the tree is bushier; as \u03b1 approaches 2, it becomes Kingman\u2019s coalescent. If we\nhave ni\u22121 nodes at time ti\u22121 in a beta coalescent, the rate \u03bbki\nni\u22121 for a children set of ki nodes at time\nti and the total rate \u03bbni\u22121 of any children set merging\u2014summing over all possible mergers\u2014is\n\n\u03bbki\nni\u22121\n\n=\n\n\u0393(ki \u2212 \u03b1)\u0393(ni\u22121 \u2212 ki + \u03b1)\n\n\u0393(2 \u2212 \u03b1)\u0393(\u03b1)\u0393(ni\u22121)\n\nand \u03bbni\u22121 =\n\n(cid:0)ni\u22121\n\nki\n\n(cid:1)\u03bbki\n\nni\u22121(cid:88)\n\nki=2\n\n.\n\n(2)\n\nni\u22121\n\nEach coalescent event also has an edge length\u2014duration\u2014\u03b4i. The duration of an event comes from\nan exponential distribution, \u03b4i \u223c exp(\u03bbni\u22121), and the parent node forms at time ti = ti\u22121 \u2212 \u03b4i.\nShorter durations mean that the children more closely resemble their parent (the mathematical basis\nfor similarity is speci\ufb01ed by a transition kernel, Section 3).\nAnalogous to Kingman\u2019s coalescent, the prior probability of a complete tree \u03c0 is the product of all of\nits constituent coalescent events i = 1, . . . m, merging ki children after duration \u03b4i,\n\nm(cid:89)\n\ni=1\n\n(cid:124)\n\np(\u03c0) =\n\n(cid:123)(cid:122)\n\n(cid:125)\n\np(ki|ni\u22121)\nMerge ki nodes\n\n\u00b7 p(\u03b4i|ki, ni\u22121)\nAfter duration \u03b4i\n\n=\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\nm(cid:89)\n\ni=1\n\n\u03bbki\nni\u22121\n\n\u00b7 exp(\u2212\u03bbni\u22121\u03b4i).\n\n(3)\n\n1This is appropriate where the entirety of a population is known\u2014both ancestors and descendants. We focus\n\non the case where only the descendants are known. For a concrete example, see Section 5.2.\n\n2\n\n\fif ns == 1 then\n\nContinue.\n\n0 = 1.\n\nUpdate i = i + 1.\nfor Particle s = 1, 2,\u00b7\u00b7\u00b7 , S do\n\nAlgorithm 1 MCMC inference for generating a tree\n1: for Particle s = 1, 2,\u00b7\u00b7\u00b7 , S do\nInitialize ns = n, i = 0, ts\n0 = 0, ws\n2:\nInitialize the node set V s = {\u03c10, \u03c11,\u00b7\u00b7\u00b7 , \u03c1n}.\n3:\n4: while \u2203s \u2208 {1\u00b7\u00b7\u00b7 S} where ns > 1 do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n\nPropose a duration \u03b4s\nSet coalescent time ts\nSample partitions ps\nPropose a set \u03c1s\nUpdate weight ws\nUpdate ns = ns \u2212 |\u03c1s\nRemove \u03c1s\n\ni to V s.\nCompute effective sample size ESS [16].\nif ESS < S/2 then\n\ni by Equation 10.\ni\u22121 \u2212 \u03b4s\ni .\ni = ts\ni from DPMM.\n\n| + 1.\n(cid:126)ci from V s, add \u03c1s\n\n(cid:126)ci according to Equation 11.\n\nResample particles [17].\n\ni by Equation 13.\n\n(cid:126)ci\n\n3 Beta Coalescent Belief Propagation\n\n(a) Kingman\u2019s coalescent\n\n(b) the beta coalescent\n\nFigure 1: The beta coalescent can merge four simi-\nlar nodes at once, while Kingman\u2019s coalescent only\nmerges two each time.\n\nThe beta coalescent prior only depends on the topology of the tree. In real clustering applications, we\nalso care about a node\u2019s children and features. In this section, we de\ufb01ne the nodes and their features,\nand then review how we use message passing to compute the probabilities of trees.\nAn internal node \u03c1i is de\ufb01ned as the merger of other nodes. The children set of node \u03c1i, \u03c1(cid:126)ci, coalesces\ninto a new node \u03c1i \u2261 \u222ab\u2208(cid:126)ci\u03c1b. This encodes the identity of the nodes that participate in speci\ufb01c\ncoalescent events; Equation 3, in contrast, only considers the number of nodes involved in an event.\nIn addition, each node is associated with a multidimensional feature vector yi.\nTwo terms specify the relationship between nodes\u2019 features: an initial distribution p0(yi) and a\ntransition kernel \u03batitb (yi, yb). The initial distribution can be viewed as a prior or regularizer for\nfeature representations. The transition kernel encourages a child\u2019s feature yb (at time tb) to resemble\nfeature yi (formed at ti); shorter durations tb \u2212 ti increase the resemblance.\nIntuitively, the transition kernel can be thought as a similarity score; the more similar the features\nare, the more likely nodes are. For Brownian diffusion (discussed in Section 4.3), the transition\nkernel follows a Gaussian distribution centered at a feature. The covariance matrix \u03a3 is decided\nby the mutation rate \u00b5 [18, 9], the probability of a mutation in an individual. Different kernels\n(e.g., multinomial, tree kernels) can be applied depending on modeling assumptions of the feature\nrepresentations.\nTo compute the probability of the beta coalescent tree \u03c0 and observed data x, we generalize the\nbelief propagation framework used by Teh et al. [9] for Kingman\u2019s coalescent; this is a more scalable\nalternative to other approaches for computing the probability of a Beta coalescent tree [19]. We\nde\ufb01ne a subtree structure \u03b8i = {\u03b8i\u22121, \u03b4i, \u03c1(cid:126)ci}, thus the tree \u03b8m after the \ufb01nal coalescent event m is a\ncomplete tree \u03c0. The message for node \u03c1i marginalizes over the features of the nodes in its children\nset.2 The total message for a parent node \u03c1i is\n(x|\u03b8i)\n\nM\u03c1i(yi) = Z\u22121\n\n\u03batitb (yi, yb)M\u03c1b (yb)dyb.\n\n(cid:89)\n\n(cid:90)\n\n(4)\n\n\u03c1i\n\nwhere Z\u03c1i(x|\u03b8i) is the local normalizer, which can be computed as the combination of initial\ndistribution and messages from a set of children,\n\n(cid:19)\n\n(cid:90)\n\nZ\u03c1i(x|\u03b8i) =\n\nb\u2208(cid:126)ci\n\n(cid:18)(cid:90)\n\n(cid:89)\n\nb\u2208(cid:126)ci\n\np0(yi)\n\n\u03batitb (yi, yb)M\u03c1b (yb)dyb\n\ndyi.\n\n(5)\n\n2When \u03c1b is a leaf, the message M\u03c1b (yb) is a delta function centered on the observation.\n\n3\n\n\fRecursively performing this marginalization through message passing provides the joint probability\nof a complete tree \u03c0 and the observations x. At the root,\n\nZ\u2212\u221e(x|\u03b8m) =\n\np0(y\u2212\u221e)\u03ba\u2212\u221e,tm(y\u2212\u221e, ym)M\u03c1m (ym)dymdy\u2212\u221e\n\n(6)\n\n(cid:90)\n\nwhere p0(y\u2212\u221e) is the initial feature distribution and m is the number of coalescent events. This gives\nthe marginal probability of the whole tree,\n\np(x|\u03c0) = Z\u2212\u221e(x|\u03b8m)\n\nZ\u03c1i(x|\u03b8i),\n\nm(cid:89)\n\ni=1\n\nm(cid:89)\n\n(7)\n\n(8)\n\nThe joint probability of a tree \u03c0 combines the prior (Equation 3) and likelihood (Equation 7),\n\np(x, \u03c0) = Z\u2212\u221e(x|\u03b8m)\n\n\u03bbki\nni\u22121\n\nexp(\u2212\u03bbni\u22121 \u03b4i) \u00b7 Z\u03c1i (x|\u03b8i).\n\n3.1 Sequential Monte Carlo Inference\n\ni=1\n\nSequential Monte Carlo (SMC)\u2014often called particle \ufb01lters\u2014estimates a structured sequence of\nhidden variables based on observations [20]. For coalescent models, this estimates the posterior\ndistribution over tree structures given observations x. Initially (i = 0) each observation is in a\nsingleton cluster;3 in subsequent particles (i > 0), points coalesce into more complicated tree\nstructures \u03b8s\ni , where s is the particle index and we add superscript s to all the related notations to\ndistinguish between particles. We use sequential importance resampling [21, SIR] to weight each\nparticle s at time ti, denoted as ws\ni .\nThe weights from SIR approximate the posterior. Computing the weights requires a conditional distri-\nbution of data given a latent state p(x|\u03b8s\ni\u22121),\nand a proposal distribution f (\u03b8s\n\ni ), a transition distribution between latent states p(\u03b8s\n\ni\u22121, x). Together, these distributions de\ufb01ne weights\n\ni|\u03b8s\n\ni|\u03b8s\n\nws\n\ni = ws\n\ni\u22121\n\np(x| \u03b8s\nf (\u03b8s\n\ni | \u03b8s\ni\u22121, x)\n\ni )p(\u03b8s\ni | \u03b8s\n\ni\u22121)\n\n.\n\n(9)\n\ni takes a set of nodes \u03c1s\n(cid:126)ci\ni\u22121, \u03b4s\n\nThen we can approximate the posterior distribution of the hidden structure using the normalized\nweights, which become more accurate with more particles.\nTo apply SIR inference to belief propagation with the beta coalescent prior, we \ufb01rst de\ufb01ne the particle\nspace structure. The sth particle represents a subtree \u03b8s\ni\u22121, and a transition to a new\nsubtree \u03b8s\ni , where ts\ni and\ni = {\u03b8s\ni and the children set \u03c1s\n\u03b8s\n(cid:126)ci\nto merge based on the previous subtree \u03b8s\nWe propose the duration \u03b4s\ni from the prior exponential distribution and propose a children set from the\nposterior distribution based on the local normalizers. 4 This is the \u201cpriorpost\u201d method in Teh et al. [9].\nHowever, this approach is intractable. Given ni\u22121 nodes at time ti, we must consider all possible\ni\u22121)\n\n(cid:1). The computational complexity grows from O(n2\n\n}. Our proposal distribution must provide the duration \u03b4s\n\nchildren sets(cid:0)ni\u22121\n\ni\u22121 at time ts\ni\u22121, and merges them at ts\n\n(cid:1) + \u00b7\u00b7\u00b7 +(cid:0)ni\u22121\n\n(cid:1) +(cid:0)ni\u22121\n\ni\u22121 \u2212 \u03b4s\n\nfrom \u03b8s\n\ni = ts\n\ni , \u03c1s\n(cid:126)ci\n\ni\u22121.\n\n2\n\n3\n\nni\u22121\n\n(Kingman\u2019s coalescent) to O(2ni\u22121 ) (beta coalescent).\n\n4 Ef\ufb01ciently Finding Children Sets with DPMM\n\nWe need a more ef\ufb01cient way to consider possible children sets. Even for Kingman\u2019s coalescent, which\nonly considers pairs of nodes, Gorur et al. [22] do not exhaustively consider all pairs. Instead, they\nuse data structures from computational geometry to select the R closest pairs as their restriction set,\nreducing inference to O(n log n). While \ufb01nding closest pairs is a traditional problem in computational\ngeometry, discovering arbitrary-sized sets is less studied.\n\n3The relationship between time and particles is non-intuitive. Time t goes backward with subsequent particles.\n\nWhen we use time-speci\ufb01c adjectives for particles, this is with respect to inference.\n\n4This is a special case of Section 4.2\u2019s algorithm, where the restriction set \u2126i is all possible subsets.\n\n4\n\n\fIn this section, we describe how we use a Dirichlet process mixture model [23, DPMM] to discover a\nrestriction set \u2126, integrating DPMMs into the SMC proposal. We \ufb01rst brie\ufb02y review what DPMMs are,\ndescribe why they are attractive, and then describe how we incorporate DPMMs in SMC inference.\nThe DPMM is de\ufb01ned by a concentration \u03b2 and a base distribution G0. A distribution over mixtures is\ndrawn from a Dirichlet process (DP): G \u223c DP(\u03b2, G0). Each observation xi is assigned to a mixture\ncomponent \u00b5i drawn from G. Because the Dirichlet process is a discrete distribution, observations i\nand j can have the same mixture component (\u00b5i = \u00b5j). When this happens, points are said to be in\nthe same partition. Posterior inference can discover a distribution over partitions. A full derivation of\nthese sampling equations appears in the supplemental material.\n\n4.1 Attractive Properties of DPMMs\n\nDPMMs and Coalescents Berestycki et al. [8] showed that the distribution over partitions in a\nDirichlet process is equivalent to the distribution over coalescents\u2019 allelic partitions\u2014the set of\nmembers that have the same feature representation\u2014when the mutation rate \u00b5 of the associated\nkernel is half of the Dirichlet concentration \u03b2 (Section 3). For Brownian diffusion, we can connect\nDPMM with coalescents by setting the kernel covariance \u03a3 = \u00b5I to \u03a3 = \u03b2/2I.\nThe base distribution G0 is also related with nodes\u2019 feature. The base distribution G0 of a Dirichlet\nprocess generates the probability measure G for each block, which generates the nodes in a block.\nAs a result, we can select a base distribution which \ufb01ts the distribution of the samples in coalescent\nprocess. For example, if we use Gaussian distribution for the transition kernel and prior, a Gaussian\nis also appropriate as the DPMM base distribution.\n\nEffectiveness as a Proposal The necessary condition for a valid proposal [24] is that it should have\nsupport on a superset of the true posterior. In our case, the distribution over partitions provided by\nthe DPMM considers all possible children sets that could be merged in the coalescent. Thus the new\nproposal with DPMM satis\ufb01es this requirement, and it is a valid proposal.\nIn addition, Chen [25] gives a set of desirable criteria for a good proposal distribution: accounts for\noutliers, considers the likelihood, and lies close to the true posterior. The DPMM ful\ufb01lls these criteria.\nFirst, the DPMM provides a distribution over all partitions. Varying the concentration parameter \u03b2 can\ncontrol the length of the tail of the distribution over partitions. Second, choosing the base distribution\nof the DPMM appropriately models the feature likelihood; i.e., ensuring the DPMM places similar\nnodes together in a partition with high probability. Third, the DPMM qualitatively provides reasonable\nchildren sets when compared with exhaustively considering all children sets (Figure 2(c)).\n\n4.2\n\nIncorporating DPMM in SMC Proposals\n\nTo address the inference intractability in Section 3.1, we use the DPMM to obtain a distribution over\npartitions of nodes. Each partition contains clusters of nodes, and we take a union over all partitions\nto create a restriction set \u2126i = {\u03c9i1, \u03c9i2,\u00b7\u00b7\u00b7}, where each \u03c9ij is a subset of the ni\u22121 nodes. A\nstandard Gibbs sampler provides these partitions (see supplemental).\nWith this restriction set \u2126i, we propose the duration time \u03b4s\npropose a children set \u03c1s\n(cid:126)ci\nexp(\u2212\u03bbs\n\ni from the exponential distribution and\nZ\u03c1i (x|\u03b8s\ni\u22121, \u03b4s\nZ0\n\nbased on the local normalizers\n\n(cid:3) , (11)\n\n\u00b7 I(cid:2)\u03c1s\n\n|\u03b4s\ni , \u03b8s\n\ni ) (10)\n\u03b4s\n\ni ) = \u03bbs\n\n\u2208 \u2126s\n\ni\u22121) =\n\nfi(\u03c1s\n(cid:126)ci\n\ni , \u03c1s\n(cid:126)ci\n\nni\u22121\n\nni\u22121\n\nfi(\u03b4s\n\n(cid:126)ci\n\n)\n\ni\n\nwhere \u2126s\nZ\u03c1i (x|\u03b8s\n\ni restricts the candidate children sets, I is the indicator, and we replace Z\u03c1i(x|\u03b8s\ni\u22121, \u03b4s\n\n) since they are equivalent here. The normalizer is\n\ni , \u03c1s\n(cid:126)ci\n\ni ) with\n\nZ0 =\n\nZ\u03c1i (x|\u03b8s\n\ni\u22121, \u03b4s\n\ni , \u03c1(cid:48)\n\n(cid:126)c) \u00b7 I [\u03c1(cid:48)\n\n(cid:126)c \u2208 \u2126s\n\ni ] =\n\nZ\u03c1i(x|\u03b8s\n\ni\u22121, \u03b4s\n\ni , \u03c1(cid:48)\n(cid:126)c).\n\n(12)\n\n(cid:88)\n\n(cid:126)c\u2208\u2126s\n\u03c1(cid:48)\n\ni\n\n(cid:88)\n\n\u03c1(cid:48)\n\n(cid:126)c\n\nApplying the true distribution (the ith multiplicand from Equation 8) and the proposal distribution\n(Equation 10 and Equation 11) to the SIR weight update (Equation 9),\ni , \u03c1(cid:48)\n(cid:126)c)\n\nni\u22121 \u00b7(cid:80)\n\ni\u22121, \u03b4s\n\n|\u03c1s\n(cid:126)ci\n\n\u03bb\n\n|\n\n,\n\n(13)\n\nws\n\ni = ws\n\ni\u22121\n\nZ\u03c1i(x|\u03b8s\n(cid:126)c\u2208\u2126s\n\u03c1(cid:48)\n\u03bbs\n\ni\n\nni\u22121\n\n5\n\n\fwhere |\u03c1s\ntion 2); and \u03bbs\n\n(cid:126)ci\n\n| is the size of children set \u03c1s\n\n(cid:126)ci; parameter \u03bb\n\n|\n\n|\u03c1s\nni\u22121 is the rate of the children set \u03c1s\n(cid:126)ci\n(cid:126)ci\n\n(Equa-\n\nni\u22121 is the rate of all possible sets given a total number of nodes ni\u22121 (Equation 2).\n\nWe can view this new proposal as a coarse-to-\ufb01ne process: DPMM proposes candidate children\nsets; SMC selects a children set from DPMM to coalesce. Since the coarse step is faster and \ufb01lters\n\u201cbad\u201d children sets, the slower \ufb01ner step considers fewer children sets, saving computation time\n(Algorithm 1). If \u2126i has all children sets, it recovers exhaustive SMC. We estimate the effective sample\nsize [16] and resample [17] when needed. For smaller sets, the DPMM is sometimes impractical (and\nonly provides singleton clusters). In such cases it is simpler to enumerate all children sets.\n\n4.3 Example Transition Kernel: Brownian Diffusion\n\nThis section uses Brownian diffusion as an example for message passing framework. The initial\ndistribution p0(y) of each node is N (0,\u221e); the transition kernel \u03batitb (y,\u00b7) is a Gaussian centered\nat y with variance (ti \u2212 tb)\u03a3, where \u03a3 = \u00b5I, \u00b5 = \u03b2/2, \u03b2 is the concentration parameter of DPMM.\nThen the local normalizer Z\u03c1i(x|\u03b8i) is\n\nN (yi; 0,\u221e)\n\n(cid:90)\n(cid:89)\n(v\u03c1b + tb \u2212 ti)\u22121(cid:17)\u22121\n\n,\n\nb\u2208(cid:126)ci\n\nZ\u03c1i(x|\u03b8i) =\n\n(cid:16)(cid:88)\n\nv\u03c1i =\n\nb\u2208(cid:126)ci\n\nand the node message M\u03c1i(yi) is normally distributed M\u03c1i(yi) \u223c N (yi; \u02c6y\u03c1i, \u03a3v\u03c1i ), where\n\nN (yi; \u02c6yb, \u03a3(v\u03c1b + tb \u2212 ti))dyi,\n(cid:19)\n\n(cid:18)(cid:88)\n\n\u02c6y\u03c1b\n\n\u02c6y\u03c1i =\n\nb\u2208(cid:126)ci\n\nv\u03c1b + tb \u2212 ti\n\nv\u03c1i.\n\n(14)\n\n5 Experiments: Finding Bushy Trees\n\nIn this section, we compare trees built by the beta coalescent (beta) against those built by Kingman\u2019s\ncoalescent (kingman) and hierarchical agglomerative clustering [26, hac] on both synthetic and real\ndata. We show beta performs best and can capture data in more interpretable, bushier trees.\n\nSetup The parameter \u03b1 for the beta coalescent is between 1 and 2. The closer \u03b1 is to 1, bushier the\ntree is, and we set \u03b1 = 1.2.5 We set the mutation rate as 1, thus the DPMM parameter is initialized\nas \u03b2 = 2, and updated using slice sampling [27]. All experiments use 100 initial iterations of DPMM\ninference with 30 more iterations after each coalescent event (forming a new particle).\n\nMetrics We use three metrics to evaluate the quality of the trees discovered by our algorithm:\npurity, subtree and path length. The dendrogram purity score [28, 10] measures how well the leaves\nin a subtree belong to the same class. For any two leaf nodes, we \ufb01nd the least common subsumer\nnode s and\u2014for the subtree rooted at s\u2014measure the fraction of leaves with same class labels. The\nsubtree score [9] is the ratio between the number of internal nodes with all children in the same\nclass and the total number of internal nodes. The path length score is the average difference\u2014over\nall pairs\u2014of the lowest common subsumer distance between the true tree and the generated tree,\nwhere the lowest common subsumer distance is the distance between the root and the lowest common\nsubsumer of two nodes. For purity and subtree, higher is better, while for length, lower is better.\nScores are in expectation over particles and averaged across chains.\n\n5.1 Synthetic Hierarchies\n\nTo test our inference method, we generated synthetic data with edge length (full details available\nin the supplemental material); we also assume each child of the root has a unique label and the\ndescendants also have the same label as their parent node (except the root node).\nWe compared beta against kingman and hac by varying the number of observations (Figure 2(a))\nand feature dimensions (Figure 2(b)). In both cases, beta is comparable to kingman and hac (no\nedge length). While increasing the feature dimension improves both scores, more observations do\nnot: for synthetic data, a small number of observations suf\ufb01ce to construct a good tree.\n\n5With DPMM proposals, \u03b1 has a negligible effect, so we elide further analysis for different \u03b1 values.\n\n6\n\n\f(a) Increasing observations\n\n(b) Increasing dimension\n\n(c) beta v.s. enum\n\nFigure 2: Figure 2(a) and 2(b) show the effect of changing the underlying data size or number\nof dimension. Figure 2(c) shows that our DPMM proposal for children sets is comparable to an\nexhaustive enumeration of all possible children sets (enum).\n\nTo evaluate the effectiveness of using our DPMM as a proposal distribution, we compare exhaustively\nenumerating all children set candidates (enum) while keeping the SMC otherwise unchanged; this\nexperiment uses ten data points (enum is completely intractable on larger data). Beta uses the DPMM\nand achieved similar accuracy (Figure 2(c)) while greatly improving ef\ufb01ciency.\n\n5.2 Human Tissue Development\n\nOur \ufb01rst real dataset is based on the developmental biology of human tissues. As a human develops,\ntissues specialize, starting from three embryonic germ layers: the endoderm, ectoderm, and mesoderm.\nThese eventually form all human tissues. For example, one developmental pathway is ectoderm \u2192\nneural crest \u2192 cranial neural crest \u2192 optic vesicle \u2192 cornea. Because each germ layer specializes\ninto many different types of cells at speci\ufb01c times, it is inappropriate to model this development as a\nbinary tree, or with clustering models lacking path lengths.\nHistorically, uncovering these specialization pathways is a painstaking process, requiring inspection\nof embryos at many stages of development; however, massively parallel sequencing data make it\npossible to ef\ufb01ciently form developmental hypotheses based on similar patterns of gene expression.\nTo investigate this question we use the transcriptome of 27 tissues with known, unambiguous,\ntime-speci\ufb01c lineages [29]. We reduce the original 182727 dimensions via principle component\nanalysis [30, PCA]. We use \ufb01ve chains with \ufb01ve particles per chain.\nUsing reference developmental trees, beta performs better on all three scores (Table 1) because beta\nbuilds up a bushy hierarchy more similar to the true tree. The tree recovered by beta (Figure 3)\nre\ufb02ects human development. The \ufb01rst major differentiation is the division of embryonic cells into\nthree layers of tissue: endoderm, mesoderm, and ectoderm. These go on to form almost all adult\norgans and cells. The placenta (magenta), however, forms from a fourth cell type, the trophoblast;\nthis is placed in its own cluster at the root of the tree. It also successfully captures ectodermal tissue\nlineage. However, mesodermic and endodermic tissues, which are highly diverse, do not cluster as\nwell. Tissues known to secrete endocrine hormones (dashed borders) cluster together.\n\n5.3 Clustering 20-newsgroups Data\n\nFollowing Heller et al. [10], we also compare the three models on 20-newsgroups,6 a multilevel\nhierarchy \ufb01rst dividing into general areas (rec, space, and religion) before specializing into areas such\nas baseball or hockey.7 This true hierarchy is inset in the bottom right of Figure 4, and we assume\neach edge has the same length. We apply latent Dirichlet allocation [31] with 50 topics to this corpus,\nand use the topic distribution for each document as the document feature. We use \ufb01ve chains with\neighty particles per chain.\n\n6 http://qwone.com/\u02dcjason/20Newsgroups/\n7We use \u201crec.autos\u201d, \u201crec.sport.baseball\u201d, \u201crec.sport.hockey\u201d, \u201csci.space\u201d newsgroups but also\u2014in contrast\n\nto Heller et al. [10]\u2014added \u201csoc.religion.christian\u201d.\n\n7\n\npurity0.40.60.81.020406080100Number of ObservationsScoresbetahackingmanpurity0.40.60.81.0246810DimensionScoresbetahackingmanpurity0.40.60.81.0246810DimensionScoresbetaenumlength0.00.20.40.620406080100Number of ObservationsScoresbetakingmanlength0.00.20.40.6246810DimensionScoresbetakingmanlength0.00.20.40.6246810DimensionScoresbetaenum\fFigure 3: One sample hierarchy of human tissue\nfrom beta. Color indicates germ layer origin of\ntissue. Dashed border indicates secretory func-\ntion. While neural tissues from the ectoderm\nwere clustered correctly, some mesoderm and\nendoderm tissues were commingled. The cluster\nalso preferred placing secretory tissues together\nand higher in the tree.\n\nFigure 4: One sample hierarchy of the 20news-\ngroups from beta. Each small square is a docu-\nment colored by its class label. Large rectangles\nrepresent a subtree with all the enclosed docu-\nments as leaf nodes. Most of the documents from\nthe same group are clustered together; the three\n\u201crec\u201d groups are merged together \ufb01rst, and then\nmerged with the religion and space groups.\n\nBiological Data\n\npurity \u2191\nsubtree \u2191\nlength \u2193\n\nhac\n0.453\n0.240\n\u2212\n\nkingman\n\n0.474 \u00b1 0.029\n0.302 \u00b1 0.033\n0.654 \u00b1 0.041\n\nbeta\n\n0.492 \u00b1 0.028\n0.331 \u00b1 0.050\n0.586 \u00b1 0.051\n\nhac\n0.465\n0.571\n\u2212\n\n20-newsgroups Data\nkingman\n0.510 \u00b1 0.047\n0.651 \u00b1 0.013\n0.477 \u00b1 0.027\n\n0.565 \u00b1 0.081\n0.720 \u00b1 0.013\n0.333 \u00b1 0.047\n\nbeta\n\nTable 1: Comparing the three models: beta performs best on all three scores.\n\nAs with the biological data, beta performs best on all scores for 20-newsgroups. Figure 4 shows a\nbushy tree built by beta, which mostly recovered the true hierarchy. Documents within a newsgroup\nmerge \ufb01rst, then the three \u201crec\u201d groups, followed by \u201cspace\u201d and \u201creligion\u201d groups. We only use\ntopic distribution as features, so better results could be possible with more comprehensive features.\n\n6 Conclusion\n\nThis paper generalizes Bayesian hierarchical clustering, moving from Kingman\u2019s coalescent to the\nbeta coalescent. Our novel inference scheme based on SMC and DPMM make this generalization\npractical and ef\ufb01cient. This new model provides a bushier tree, often a more realistic view of data.\nWhile we only consider real-valued vectors, which we model through the ubiquitous Gaussian,\nother likelihoods might be better suited to other applications. For example, for discrete data such\nas in natural language processing, a multinomial likelihood may be more appropriate. This is a\nstraightforward extension of our model via other transition kernels and DPMM base distributions.\nRecent work uses the coalescent as a means of producing a clustering in tandem with a downstream\ntask such as classi\ufb01cation [32]. Hierarchies are often taken a priori in natural language processing.\nParticularly for linguistic tasks, a fully statistical model like the beta coalescent that jointly learns the\nhierarchy and a downstream task could improve performance in dependency parsing [33] (clustering\nparts of speech), multilingual sentiment [34] (\ufb01nding sentiment-correlated words across languages),\nor topic modeling [35] (\ufb01nding coherent words that should co-occur in a topic).\n\nAcknowledgments\n\nWe would like to thank the anonymous reviewers for their helpful comments, and thank H\u00b4ector\nCorrada Bravo for pointing us to human tissue data. This research was supported by NSF grant\n#1018625. Any opinions, \ufb01ndings, conclusions, or recommendations expressed here are those of the\nauthors and do not necessarily re\ufb02ect the view of the sponsor.\n\n8\n\nPlacentaStomachPancreasTracheaThyroidBoneMarrowColonHeartKidneyPeripheralBloodLymphocytesBrainCorpusCallosumSpinalCordLungProstateSpleenUterusBrainHypothalamusBrainCaudateNucleusBrainThalamusRetinaThymusMonocytesBladderMammaryGlandSmallIntestineBrainAmygdalaBrainCerebellumectodermendodermmesodermplacentarec.sport.hockyrec.sport.baseballrec.autossci.spacesoc.religion.christian...............True TreeDoc Label\fReferences\n[1] Kaufman, L., P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley,\n\n1990.\n\n[2] Jain, A. K. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651\u2013666, 2010.\n[3] Brown, P. F., V. J. D. Pietra, P. V. deSouza, et al. Class-based n-gram models of natural language.\n\nComputational Linguistics, 18:18\u20134, 1990.\n\n[4] Bergen, J., P. Anandan, K. Hanna, et al. Hierarchical model-based motion estimation. In ECCV. 1992.\n[5] Girvan, M., M. E. J. Newman. Community structure in social and biological networks. PNAS, 99:7821\u2013\n\n7826, 2002.\n\n[6] Kingman, J. F. C. On the genealogy of large populations. Journal of Applied Probability, 19:27\u201343, 1982.\n[7] Pitman, J. Coalescents with multiple collisions. The Annals of Probability, 27:1870\u20131902, 1999.\n[8] Berestycki, N. Recent progress in coalescent theory. In Ensaios Matematicos, vol. 16. 2009.\n[9] Teh, Y. W., H. Daum\u00b4e III, D. M. Roy. Bayesian agglomerative clustering with coalescents. In NIPS. 2008.\n[10] Heller, K. A., Z. Ghahramani. Bayesian hierarchical clustering. In ICML. 2005.\n[11] Blundell, C., Y. W. Teh, K. A. Heller. Bayesian rose trees. In UAI. 2010.\n[12] Adams, R., Z. Ghahramani, M. Jordan. Tree-structured stick breaking for hierarchical data. In NIPS. 2010.\n[13] Knowles, D., Z. Ghahramani. Pitman-Yor diffusion trees. In UAI. 2011.\n[14] Neal, R. M. Density modeling and clustering using Dirichlet diffusion trees. Bayesian Statistics, 7:619\u2013629,\n\n2003.\n\n[15] Sagitov, S. The general coalescent with asynchronous mergers of ancestral lines. Journal of Applied\n\nProbability, 36:1116\u20131125, 1999.\n\n[16] Neal, R. M. Annealed importance sampling. Technical report 9805, University of Toronto, 1998.\n[17] Fearhhead, P. Sequential Monte Carlo method in \ufb01lter theory. PhD thesis, University of Oxford, 1998.\n[18] Felsenstein, J. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am J\n\nHum Genet, 25(5):471\u2013492, 1973.\n\n[19] Birkner, M., J. Blath, M. Steinrucken. Importance sampling for lambda-coalescents in the in\ufb01nitely many\n\nsites model. Theoretical population biology, 79(4):155\u201373, 2011.\n\n[20] Doucet, A., N. De Freitas, N. Gordon, eds. Sequential Monte Carlo methods in practice. 2001.\n[21] Gordon, N., D. Salmond, A. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation.\n\nIEEE Proceedings F, Radar and Signal Processing, 140(2):107\u2013113, 1993.\n\n[22] G\u00a8or\u00a8ur, D., L. Boyles, M. Welling. Scalable inference on Kingman\u2019s coalescent using pair similarity. JMLR,\n\n22:440\u2013448, 2012.\n\n[23] Antoniak, C. E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.\n\nThe Annals of Statistics, 2(6):1152\u20131174, 1974.\n\n[24] Cappe, O., S. Godsill, E. Moulines. An overview of existing methods and recent advances in sequential\n\nMonte Carlo. PROCEEDINGS-IEEE, 95(5):899, 2007.\n\n[25] Chen, Z. Bayesian \ufb01ltering: From kalman \ufb01lters to particle \ufb01lters, and beyond. McMaster, [Online], 2003.\n[26] Eads, D. Hierarchical clustering (scipy.cluster.hierarchy). SciPy, 2007.\n[27] Neal, R. M. Slice sampling. Annals of Statistics, 31:705\u2013767, 2003.\n[28] Powers, D. M. W. Unsupervised learning of linguistic structure an empirical evaluation. International\n\nJournal of Corpus Linguistics, 2:91\u2013131, 1997.\n\n[29] Jongeneel, C., M. Delorenzi, C. Iseli, et al. An atlas of human gene expression from massively parallel\n\nsignature sequencing (mpss). Genome Res, 15:1007\u20131014, 2005.\n\n[30] Shlens, J. A tutorial on principal component analysis. In Systems Neurobiology Laboratory, Salk Institute\n\nfor Biological Studies. 2005.\n\n[31] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 2003.\n[32] Rai, P., H. Daum\u00b4e III. The in\ufb01nite hierarchical factor regression model. In NIPS. 2008.\n[33] Koo, T., X. Carreras, M. Collins. Simple semi-supervised dependency parsing. In ACL. 2008.\n[34] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent\n\nDirichlet allocation. In EMNLP. 2010.\n\n[35] Andrzejewski, D., X. Zhu, M. Craven. Incorporating domain knowledge into topic modeling via Dirichlet\n\nforest priors. In ICML. 2009.\n\n9\n\n\f", "award": [], "sourceid": 573, "authors": [{"given_name": "Yuening", "family_name": "Hu", "institution": "University of Maryland"}, {"given_name": "Jordan", "family_name": "Ying", "institution": "University of Maryland"}, {"given_name": "Hal", "family_name": "Daume III", "institution": "University of Maryland"}, {"given_name": "Z. Irene", "family_name": "Ying", "institution": "US Department of Agriculture"}]}