{"title": "Tree-Structured Stick Breaking for Hierarchical Data", "book": "Advances in Neural Information Processing Systems", "page_first": 19, "page_last": 27, "abstract": "Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data.", "full_text": "Tree-Structured Stick Breaking for Hierarchical Data\n\nRyan Prescott Adams\u2217\nDept. of Computer Science\n\nUniversity of Toronto\n\nZoubin Ghahramani\nDept. of Engineering\n\nUniversity of Cambridge\n\nAbstract\n\nMichael I. Jordan\n\nDepts. of EECS and Statistics\n\nUniversity of California, Berkeley\n\nMany data are naturally modeled by an unobserved hierarchical structure. In this\npaper we propose a \ufb02exible nonparametric prior over unknown data hierarchies.\nThe approach uses nested stick-breaking processes to allow for trees of unbounded\nwidth and depth, where data can live at any node and are in\ufb01nitely exchangeable.\nOne can view our model as providing in\ufb01nite mixtures where the components\nhave a dependency structure corresponding to an evolutionary diffusion down a\ntree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo\nmethods based on slice sampling to perform Bayesian inference and simulate from\nthe posterior distribution on trees. We apply our method to hierarchical clustering\nof images and topic modeling of text data.\n\n1\n\nIntroduction\n\nStructural aspects of models are often critical to obtaining \ufb02exible, expressive model families. In\nmany cases, however, the structure is unobserved and must be inferred, either as an end in itself\nor to assist in other estimation and prediction tasks. This paper addresses an important instance\nof the structure learning problem: the case when the data arise from a latent hierarchy. We take a\ndirect nonparametric Bayesian approach, constructing a prior on tree-structured partitions of data that\nprovides for unbounded width and depth while still allowing tractable posterior inference.\nProbabilistic approaches to latent hierarchies have been explored in a variety of domains. Unsu-\npervised learning of densities and nested mixtures has received particular attention via \ufb01nite-depth\ntrees [1], diffusive branching processes [2] and hierarchical clustering [3, 4]. Bayesian approaches to\nlearning latent hierarchies have also been useful for semi-supervised learning [5], relational learning\n[6] and multi-task learning [7]. In the vision community, distributions over trees have been useful as\npriors for \ufb01gure motion [8] and for discovering visual taxonomies [9].\nIn this paper we develop a distribution over probability measures that imbues them with a natural\nhierarchy. These hierarchies have unbounded width and depth and the data may live at internal nodes\non the tree. As the process is de\ufb01ned in terms of a distribution over probability measures and not as a\ndistribution over data per se, data from this model are in\ufb01nitely exchangeable; the probability of any\nset of data is not dependent on its ordering. Unlike other in\ufb01nitely exchangeable models [2, 4], a\npseudo-time process is not required to describe the distribution on trees and it can be understood in\nterms of other popular Bayesian nonparametric models.\nOur new approach allows the components of an in\ufb01nite mixture model to be interpreted as part of\na diffusive evolutionary process. Such a process captures the natural structure of many data. For\nexample, some scienti\ufb01c papers are considered seminal \u2014 they spawn new areas of research and\ncause new papers to be written. We might expect that within a text corpus of scienti\ufb01c documents,\nsuch papers would be the natural ancestors of more specialized papers that followed on from the new\nideas. This motivates two desirable features of a distribution over hierarchies: 1) ancestor data (the\n\n\u2217http://www.cs.toronto.edu/\u02dcrpa/\n\n1\n\n\f(a) Dirichlet process stick breaking\n\n(b) Tree-structured stick breaking\n\nFigure 1: a) Dirichlet process stick-breaking procedure, with a linear partitioning. b) Interleaving two stick-\nbreaking processes yields a tree-structured partition. Rows 1, 3 and 5 are \u03bd-breaks. Rows 2 and 4 are \u03c8-breaks.\n\u201cprototypes\u201d) should be able to live at internal nodes in the tree, and 2) as the ancestor/descendant\nrelationships are not known a priori, the data should be in\ufb01nitely exchangeable.\n\n2 A Tree-Structured Stick-Breaking Process\n\nStick-breaking processes based on the beta distribution have played a prominent role in the develop-\nment of Bayesian nonparametric methods, most signi\ufb01cantly with the constructive approach to the\nDirichlet process (DP) due to Sethuraman [10]. A random probability measure G can be drawn from\na DP with base measure \u03b1H using a sequence of beta variates via:\n\nG =\n\n\u03c0i \u03b4\u03b8i\n\n\u03c0i = \u03bdi\n\n(1 \u2212 \u03bdi(cid:48))\n\n\u03b8i \u223c H\n\n\u03bdi \u223c Be(1, \u03b1)\n\n\u03c01 = \u03bd1.\n\n(1)\n\ni\u22121(cid:89)\n\ni(cid:48)=1\n\n\u221e(cid:88)\n\ni=1\n\nWe can view this as taking a stick of unit length and breaking it at a random location. We call the\nleft side of the stick \u03c01 and then break the right side at a new place, calling the left side of this new\nbreak \u03c02. If we continue this process of \u201ckeep the left piece and break the right piece again\u201d as in\nFig. 1a, assigning each \u03c0i a random value drawn from H, we can view this is a random probability\nmeasure centered on H. The distribution over the sequence (\u03c01, \u03c02,\u00b7\u00b7\u00b7 ) is a case of the GEM\ndistribution [11], which also includes the Pitman-Yor process [12]. Note that in Eq. (1) the \u03b8i are i.i.d.\nfrom H; in the current paper these parameters will be drawn according to a hierarchical process.\nThe GEM construction provides a distribution over in\ufb01nite partitions of the unit interval, with natural\nnumbers as the index set as in Fig. 1a. In this paper, we extend this idea to create a distribution over\nin\ufb01nite partitions that also possess a hierarchical graph topology. To do this, we will use \ufb01nite-length\nsequences of natural numbers as our index set on the partitions. Borrowing notation from the P\u00b4olya\ntree (PT) construction [13], let \u0001 = (\u00011, \u00012,\u00b7\u00b7\u00b7 , \u0001K), denote a length-K sequence of positive integers,\ni.e., \u0001k \u2208N+. We denote the zero-length string as \u0001 =\u00f8 and use |\u0001| to indicate the length of \u0001\u2019s\nsequence. These strings will index the nodes in the tree and |\u0001| will then be the depth of node \u0001.\nWe interleave two stick-breaking procedures as in Fig. 1b. The \ufb01rst has beta variates \u03bd\u0001\u223cBe(1, \u03b1(|\u0001|))\nwhich determine the size of a given node\u2019s partition as a function of depth. The second has beta\nvariates \u03c8\u0001\u223cBe(1, \u03b3), which determine the branching probabilities. Interleaving these processes\npartitions the unit interval. The size of the partition associated with each \u0001 is given by\n\n\u03d5\u0001(cid:48)(1 \u2212 \u03bd\u0001(cid:48))\n\n\u0001(cid:48)\u227a\u0001\n\n\u03c0\u0001 = \u03bd\u0001\u03d5\u0001\n\n\u03d5\u0001\u0001i = \u03c8\u0001\u0001i\n\n(2)\nwhere \u0001\u0001i denotes the sequence that results from appending \u0001i onto the end of \u0001, and \u0001(cid:48)\u227a \u0001 indicates\nthat \u0001 could be constructed by appending onto \u0001(cid:48). When viewing these strings as identifying nodes on\na tree, {\u0001\u0001i : \u0001i\u2208 1, 2,\u00b7\u00b7\u00b7} are the children of \u0001 and {\u0001(cid:48) : \u0001(cid:48)\u227a \u0001} are the ancestors of \u0001. The {\u03c0\u0001} in\nEq. (2) can be seen as products of several decisions on how to allocate mass to nodes and branches in\nthe tree: the {\u03d5\u0001} determine the probability of a particular sequence of children and the \u03bd\u0001 and (1\u2212\u03bd\u0001)\nterms determine the proportion of mass allotted to \u0001 versus nodes that are descendants of \u0001.\n\n\u03c0\u00f8 = \u03bd\u00f8,\n\n(1 \u2212 \u03c8\u0001j)\n\n\u0001i\u22121(cid:89)\n\nj=1\n\n(cid:89)\n\n2\n\n\f(a) \u03b10 = 1, \u03bb = 1\n\n2 , \u03b3 = 1\n\n5\n\n(b) \u03b10 = 1, \u03bb = 1, \u03b3 = 1\n\n5\n\n(c) \u03b10 = 1, \u03bb = 1, \u03b3 = 1\n\n(d) \u03b10 = 5, \u03bb = 1\n\n2 , \u03b3 = 1\n\n5\n\n(e) \u03b10 = 5, \u03bb = 1, \u03b3 = 1\n\n5\n\n(f) \u03b10 = 5, \u03bb = 1\n\n2 , \u03b3 = 1\n\n(g) \u03b10 = 25, \u03bb = 1\n\n2 , \u03b3 = 1\n\n5\n\n(h) \u03b10 = 25, \u03bb = 1\n\n2 , \u03b3 = 1\n\ndepth-varying parameter for the \u03bd-sticks) must satisfy(cid:80)\u221e\n\nFigure 2: Eight samples of trees over partitions of \ufb01fty data, with different hyperparameter settings. The circles\nare represented nodes, and the squares are the data. Note that some of the sampled trees have represented nodes\nwith no data associated with them and that the branch ordering does not correspond to a size-biased permutation.\nWe require that the {\u03c0\u0001} sum to one. The \u03c8-sticks have no effect upon this, but \u03b1(\u00b7) : N \u2192 R+ (the\nj=1 ln(1+1/\u03b1(j\u22121)) = +\u221e (see [14]). This\nis clearly true for \u03b1(j) = \u03b10 > 0. A useful function that also satis\ufb01es this condition is \u03b1(j) = \u03bbj\u03b10\nwith \u03b10 > 0, \u03bb\u2208 (0, 1]. The decay parameter \u03bb allows a distribution over trees with most of the mass\nat an intermediate depth. This is the \u03b1(\u00b7) we will assume throughout the remainder of the paper.\n\nAn Urn-based View\n\ni.e., a sum over all descendants: N\u0001\u227a\u00b7 =(cid:80)\n\nWhen a Bayesian nonparametric model induces partitions over data, it is sometimes possible to\nconstruct a Blackwell-MacQueen [15] type urn scheme that corresponds to sequentially generating\ndata, while integrating out the underlying random measure. The \u201cChinese restaurant\u201d metaphor for\nthe Dirichlet process is a popular example. In our model, we can use such an urn scheme to construct\na treed partition over a \ufb01nite set of data.\nThe urn process can be seen as a path-reinforcing Bernoulli trip down the tree where each datum starts\nat the root and descends into children until it stops at some node. The \ufb01rst datum lives at the root node\nwith probability 1/(\u03b1(0)+1), otherwise it descends and instantiates a new child. It stays at this new\nchild with probability 1/(\u03b1(1)+1) or descends again and so on. A later datum stays at node \u0001 with\nprobability (N\u0001 +1)/(N\u0001 +N\u0001\u227a\u00b7 +\u03b1(|\u0001|)+1), where N\u0001 is the number of previous data that stopped\nat \u0001, and N\u0001\u227a\u00b7 is the number of previous data that came down this path of the tree but did not stop at \u0001,\n\u0001\u227a\u0001(cid:48) N\u0001(cid:48). If a datum descends to \u0001 but does not stop then\nit chooses which child to descend to according to a Chinese restaurant process where the previous\ncustomers are only those data who have also descended to this point. That is, if it has reached node \u0001\nbut will not stay there, it descends to existing child \u0001\u0001i with probability (N\u0001\u0001i +N\u0001\u0001i\u227a\u00b7)/(N\u0001\u227a\u00b7 +\u03b3)\nand instantiates a new child with probability \u03b3/(N\u0001\u227a\u00b7 +\u03b3). A particular path therefore becomes more\nlikely according to its \u201cpopularity\u201d with previous data. Note that a node can be a part of a popular\npath without having any data of its own. Fig. 2 shows the structures over \ufb01fty data drawn from this\nprocess with different hyperparameter settings. Note that the branch ordering in a realization of the\nurn scheme will not necessarily be the same as that of the size-biased ordering [16] of the partitions\nin Fig. 1b: the former is a tree over a \ufb01nite set of data and the latter is over a random in\ufb01nite partition.\nThe urn view allows us to compare this model to other priors on in\ufb01nite trees. One contribution of this\nmodel is that the data can live at internal nodes in the tree, but are nevertheless in\ufb01nitely exchangeable.\nThis is in contrast to the model in [8], for example, which is not in\ufb01nitely exchangeable. The nested\nChinese restaurant process (nCRP) [17] provides a distribution over trees of unbounded width and\ndepth, but data correspond to paths of in\ufb01nite length, requiring an additional distribution over depths\nthat is not path-dependent. The P\u00b4olya tree [13] uses a recursive stick-breaking process to specify a\ndistribution over nested partitions in a binary tree, however the data live at in\ufb01nitely-deep leaf nodes.\nThe marginal distribution on the topology of a Dirichlet diffusion tree [2] (and the clustering variant\nof Kingman\u2019s coalescent [4]) provides path-reinforcement and in\ufb01nite exchangeability, however it\nrequires a pseudo-time hazard process and data do not live at internal nodes.\n\n3\n\n\f3 Hierarchical Priors for Node Parameters\n\nOne can view the stick-breaking construction of the Dirichlet process as generating an in\ufb01nite\npartition and then labeling each cell i with parameter \u03b8i drawn i.i.d. from H. In a mixture model, data\nfrom the ith component are generated independently according to a distribution f (x| \u03b8i), where x\ntakes values in a sample space X . In our model, we continue to assume that the data are generated\nindependently given the latent labeling, but to take advantage of the tree-structured partitioning of\nSection 2 an i.i.d. assumption on the node parameters is inappropriate. Rather, the distribution over the\nparameters at node \u0001, denoted \u03b8\u0001, should depend in an interesting way on its ancestors {\u03b8\u0001(cid:48) : \u0001(cid:48)\u227a \u0001}.\nA natural way to specify such dependency is via a directed graphical model, with the requirement\nthat edges must always point down the tree. An intuitive subclass of such graphical models are\nthose in which a child is conditionally independent of all ancestors, given its parents and any global\nhyperparameters. This is the case we will focus on here, as it provides a useful view of the parameter-\ngeneration process as a \u201cdiffusion down the tree\u201d via a Markov transition kernel that can be essentially\nany distribution with a location parameter. Coupling such a kernel, which we denote T (\u03b8\u0001\u0001i \u2190 \u03b8\u0001),\nwith a root-level prior p(\u03b8\u00f8) and the node-wise data distribution f (x| \u03b8\u0001), we have a complete model\nfor in\ufb01nitely exchangeable tree-structured data on X . We now examine a few speci\ufb01c examples.\n\nIf our data distribution f (x| \u03b8) is such that the parameters can\nGeneralized Gaussian Diffusions\nbe speci\ufb01ed as a real-valued vector \u03b8\u2208RM , then we can use a Gaussian distribution to describe the\nparent-to-child transition kernel: Tnorm(\u03b8\u0001\u0001i \u2190 \u03b8\u0001) =N (\u03b8\u0001\u0001i | \u03b7 \u03b8\u0001, \u039b), where \u03b7\u2208 [0, 1). Such a kernel\ncaptures the simple idea that the child\u2019s parameters are noisy versions of the parent\u2019s, as speci\ufb01ed\nby the covariance matrix \u039b, while \u03b7 ensures that all parameters in the tree have a \ufb01nite marginal\nvariance. While this will not result in a conjugate model unless the data are themselves Gaussian, it\nhas the simple property that each node\u2019s parameter has a Gaussian prior that is speci\ufb01ed by its parent.\nWe present an application of this model in Section 5, where we model images as a distribution over\nbinary vectors obtained by transforming a real-valued vector to (0, 1) via the logistic function.\n\nChained Dirichlet-Multinomial Distributions\nIf each datum is a set of counts over M discrete\noutcomes, as in many \ufb01nite topic models, a multinomial model for f (x| \u03b8) may be appropriate. In this\ncase, X =NM , and \u03b8\u0001 takes values in the (M\u22121)-simplex. We can construct a parent-to-child transi-\ntion kernel via a Dirichlet distribution with concentration parameter \u03ba: Tdir(\u03b8\u0001\u0001i \u2190 \u03b8\u0001) =Dir(\u03ba\u03b8\u0001),\nusing a symmetric Dirichlet for the root node, i.e., \u03b8\u00f8\u223cDir(\u03ba1).\n\nHierarchical Dirichlet Processes A very general way to specify the distribution over data is to say\nthat it is drawn from a random probability measure with a Dirichlet process prior. In our case, one\n\ufb02exible approach would be to model the data at node \u0001 with a distribution G\u0001 as in Eq. (1). This means\nthat \u03b8\u0001 \u223c G\u0001 where G\u0001 now corresponds to an in\ufb01nite set of parameters. The hierarchical Dirichlet\nprocess (HDP) [18] provides a natural parent-to-child transition kernel for the tree-structured model,\nagain with concentration parameter \u03ba: Thdp(G\u0001\u0001i \u2190 G\u0001) =DP(\u03baG\u0001). At the top level, we specify a\nglobal base measure H for the root node, i.e., G\u00f8\u223c H. One negative aspect of this transition kernel is\nthat the G\u0001 will have a tendency to collapse down onto a single atom. One remedy is to smooth the\nkernel with \u03b7 as in the Gaussian case, i.e., Thdp(G\u0001\u0001i \u2190 G\u0001) =DP(\u03ba (\u03b7 G\u0001 + (1 \u2212 \u03b7) H)).\n\n4\n\nInference via Markov chain Monte Carlo\n\nWe have so far de\ufb01ned a model for data that are generated from the parameters associated with the\nnodes of a random tree. Having seen N data and assuming a model f (x| \u03b8\u0001) as in the previous\nsection, we wish to infer possible trees and model parameters. As in most complex probabilistic\nmodels, closed form inference is impossible and we instead generate posterior samples via Markov\nchain Monte Carlo (MCMC). To operate ef\ufb01ciently over a variety of regimes without tuning, we use\nslice sampling [19] extensively. This allows us to sample from the true posterior distribution over\nthe \ufb01nite quantities of interest despite our model containing an in\ufb01nite number of parameters. The\nprimary data structure in our Markov chain is the set of N strings describing the current assignments\nof data to nodes, which we denote {\u0001n}N\nn=1. We represent the \u03bd-sticks and parameters \u03b8\u0001 for all nodes\nthat are traversed by the data in its current assignments, i.e., {\u03bd\u0001, \u03b8\u0001 : \u2203n, \u0001\u227a \u0001n}. We also represent\nall \u03c8-sticks in the \u201chull\u201d of the tree that contains the data: if at some node \u0001 one of the N data paths\n{\u03c8\u0001\u0001j : \u0001j \u2264 \u0001i}.\n\npasses through child \u0001\u0001i, then we represent all the \u03c8-sticks in the set(cid:83)\n\n(cid:83)\n\n\u0001n\n\n\u0001\u0001i(cid:22)\u0001n\n\n4\n\n\ffunction SAMP-ASSIGNMENT(n)\npslice \u223c Uni(0, f (xn | \u03b8\u0001n ))\numin \u2190 0, umax \u2190 1\nloop\nu \u223c Uni(umin, umax)\n\u0001 \u2190 FIND-NODE(u, \u00f8)\np \u2190 f (xn | \u03b8\u0001)\nif p > pslice then return \u0001\nelse if \u0001 < \u0001n then umin\u2190 u\nelse umax \u2190 u\n\nfunction FIND-NODE(u, \u0001)\nif u < \u03bd\u0001 then return \u0001\nelse\nu \u2190 (u \u2212 \u03bd\u0001)/(1 \u2212 \u03bd\u0001)\n\nwhile u < 1\u2212(cid:81)\n\nj(1\u2212\u03c8\u0001\u0001j ) do\n\nDraw a new \u03c8-stick\ne\u2190edges from \u03c8-sticks\ni\u2190bin index for u from edges\nDraw \u03b8\u0001\u0001i and \u03bd\u0001\u0001i if necessary\nu \u2190 (u \u2212 ei)/(ei+1 \u2212 ei)\nreturn FIND-NODE(u, \u0001\u0001i)\n\nfunction SIZE-BIASED-PERM(\u0001)\n\u03c1 \u2190 \u2205\nwhile represented children do\nw \u2190 weights from {\u03c8\u0001\u0001i}\nw \u2190 w\\\u03c1\nj \u223c w\n\u03c1 \u2190 append j\nreturn \u03c1\n\nSlice Sampling Data Assignments The primary challenge in inference with Bayesian nonpara-\nmetric mixture models is often sampling from the posterior distribution over assignments, as it is\nfrequently dif\ufb01cult to integrate over the in\ufb01nity of unrepresented components. To avoid this dif\ufb01culty,\nwe use a slice sampling approach that can be viewed as a combination of the Dirichlet slice sampler\nof Walker [20] and the retrospective sampler of Papaspiliopolous and Roberts [21].\nSection 2 described a path-reinforcing process for generating data from the model. An alternative\nmethod is to draw a uniform variate u on (0, 1) and break sticks until we know what \u03c0\u0001 the u fell into.\nOne can imagine throwing a dart at the top of Fig. 1b and considering which \u03c0\u0001 it hits. We would\ndraw the sticks and parameters from the prior, as needed, conditioning on the state instantiated from\nany previous draws and with parent-to-child transitions enforcing the prior downwards in the tree.\nThe pseudocode function FIND-NODE(u, \u0001) with u\u223cUni(0, 1) and \u0001 =\u00f8 draws such a sample. This\nrepresentation leads to a slice sampling scheme on u that does not require any tuning parameters.\nTo slice sample the assignment of the nth datum, currently assigned to \u0001n, we initialize our slice\nsampling bounds to (0, 1). We draw a new u from the bounds and use the FIND-NODE function to\ndetermine the associated \u0001 from the currently-represented state, plus any additional state that must\nbe drawn from the prior. We do a lexical comparison (\u201cstring-like\u201d) of the new \u0001 and our current\nstate \u0001n, to determine whether this new path corresponds to a u that is \u201cabove\u201d or \u201cbelow\u201d our current\nstate. This lexical comparison prevents us from having to represent the initial un. We shrink the slice\nsampling bounds appropriately, depending on the comparison, until we \ufb01nd a u that satis\ufb01es the slice.\nThis procedure is given in pseudocode as SAMP-ASSIGNMENT(n). After performing this procedure,\nwe can discard any state that is not in the previously-mentioned hull of representation.\n\nGibbs Sampling Stick Lengths Given the represented sticks and the current assignments of nodes\nto data, it is straightforward to resample the lengths of the sticks from the posterior beta distributions\n\n\u03c8\u0001\u0001i | data \u223c Be(N\u0001\u0001i\u227a\u00b7 +1, \u03b3 +(cid:80)\n\n\u03bd\u0001 | data \u223c Be(N\u0001 +1, N\u0001\u227a\u00b7 +\u03b1(|\u0001|))\n\nj>i N\u0001\u0001j\u227a\u00b7),\n\nwhere N\u0001 and N\u0001\u227a\u00b7 are the path-based counts as described in Section 2.\n\nGibbs Sampling the Ordering of the \u03c8-Sticks When using the stick-breaking representation of\nthe Dirichlet process, it is crucial for mixing to sample over possible orderings of the sticks. In our\nmodel, we include such moves on the \u03c8-sticks. We iterate over each instantiated node \u0001 and perform\na Gibbs update of the ordering of its immediate children using its invariance under size-biased\npermutation (SBP) [16]. For a given node, the \u03c8-sticks provide a \u201clocal\u201d set of weights that sum to\none. We repeatedly draw without replacement from the discrete distribution implied by the weights\nand keep the ordering that results. Pitman [16] showed that distributions over sequences such as\nour \u03c8-sticks are invariant under such permutations and we can view the SIZE-BIASED-PERM(\u0001)\nprocedure as a Metropolis\u2013Hastings proposal with an acceptance ratio that is always one.\n\nSlice Sampling Stick-Breaking Hyperparameters Given all of the instantiated sticks, we slice\nsample from the conditional posterior distribution over the hyperparameters \u03b10, \u03bb and \u03b3:\nBe(\u03bd\u0001 | 1, \u03bb|\u0001|\u03b10)\n\np(\u03b10, \u03bb|{\u03bd\u0001}) \u221d I(\u03b1min\n\n)I(\u03bbmin < \u03bb < \u03bbmax)\n\n(cid:89)\n\n0 < \u03b10 < \u03b1max\np(\u03b3 |{\u03c8\u0001}) \u221d I(\u03b3min < \u03b3 < \u03b3max)\n\n0\n\n(cid:89)\n\nBe(\u03c8\u0001 | 1, \u03b3),\n\n\u0001\n\nwhere the products are over nodes in the aforementioned hull. We initialize the bounds of the slice\nsampler with the bounds of the top-hat prior.\n\n\u0001\n\n5\n\n\fFigure 3: These \ufb01gures show a subset of the tree learned from the 50,000 CIFAR-100 images. The top tree only\nshows nodes for which there were at least 250 images. The ten shown at each node are those with the highest\nprobability under the node\u2019s distribution. The second row shows three expanded views of subtrees, with nodes\nthat have at least 50 images. Detailed views of portions of these subtrees are shown in the third row.\nSelecting a Single Tree We have so far described a procedure for generating posterior samples\nfrom the tree structures and associated stick-breaking processes. If our objective is to \ufb01nd a single\ntree, however, samples from the posterior distribution are unsatisfying. Following [17], we report a\nbest single tree structure over the data by choosing the sample from our Markov chain that has the\nhighest complete-data likelihood p({xn, \u0001n}N\n\nn=1 |{\u03bd\u0001},{\u03c8\u0001}, \u03b10, \u03bb, \u03b3).\n\n5 Hierarchical Clustering of Images\n\nWe applied our model and MCMC inference to the problem of hierarchically clustering the CIFAR-\n100 image data set 1. These data are a labeled subset of the 80 million tiny images data [22]\nwith 50,000 32\u00d732 color images. We did not use the labels in our clustering. We modeled the\nimages via 256-dimensional binary features that had been previously extracted from each image\n(i.e., xn \u2208 {0, 1}256) using a deep neural network that had been trained for an image retrieval task\n[23]. We used a factored Bernoulli likelihood at each node, parameterized by a latent 256-dimensional\nreal vector (i.e., \u03b8\u0001 \u2208 R256) that was transformed component-wise via the logistic function:\n\n256(cid:89)\n\n(cid:16)\n\n\u0001 }(cid:17)\u2212x(d)\nn (cid:16)\n\n\u0001 }(cid:17)1\u2212x(d)\n\nn\n\n.\n\n1 + exp{\u03b8(d)\n\nf (xn | \u03b8\u0001) =\n\n1 + exp{\u2212\u03b8(d)\n\nd=1\n\nThe prior over the parameters of a child node was Gaussian with its parent\u2019s value as the mean.\nThe covariance of the prior (\u039b in Section 3) was diagonal and inferred as part of the Markov chain.\nWe placed independent Uni(0.01, 1) priors on the elements of the diagonal. To ef\ufb01ciently learn the\nnode parameters, we used Hamiltonian (hybrid) Monte Carlo (HMC) [24], taking 25 leapfrog HMC\nsteps, with a randomized step size. We occasionally interleaved a slice sampling move for robustness.\n\n1http://www.cs.utoronto.ca/\u02dckriz/cifar.html\n\n6\n\n\fFigure 4: A subtree of documents from NIPS 1-12, inferred using 20 topics. Only nodes with at least 50\ndocuments are shown. Each node shows three aggregated statistics at that node: the \ufb01ve most common author\nnames, the \ufb01ve most common words and a histogram over the years of proceedings.\nFor the stick-breaking processes, we used \u03b10\u223cUni(10, 50), \u03bb\u223cUni(0.05, 0.8), and \u03b3\u223cUni(1, 10).\nUsing Python on a single core of a modern workstation each MCMC iteration of the entire model\n(including slice sampled reassignment of all 50,000 images) requires approximately three minutes.\nFig. 3 represents a part of the tree with the best complete-data log likelihood after 4000 such iterations.\nThe tree provides a useful visualization of the data set, capturing broad variations in color at the\nhigher levels of the tree, with lower branches varying in texture and shape. A larger version of this\ntree is provided in the supplementary material.\n\n6 Hierarchical Modeling of Document Topics\n\nWe also used our approach in a bag-of-words topic model, applying it to 1740 papers from NIPS\n1\u201312 2. As in latent Dirichlet allocation (LDA) [25], we consider a topic to be a distribution over\nwords and each document to be described by a distribution over topics. In LDA, each document has a\nunique topic distribution. In our model, however, each document lives at a node and that node has a\nunique topic distribution. Thus multiple documents share a distribution over topics if they inhabit the\nsame node. Each node\u2019s topic distribution is from a chained Dirichlet-multinomial as described in\nSection 3. The topics each have symmetric Dirichlet priors over their word distributions. This results\nin a different kind of topic model than that provided by the nested Chinese restaurant process. In the\nnCRP, each node corresponds to a topic and documents are spread across in\ufb01nitely-long paths down\nthe tree. Each word is drawn from a distribution over depths that is given by a GEM distribution. In\nthe nCRP, it is not the documents that have the hierarchy, but the topics.\nWe did two kinds of analyses. The \ufb01rst is a visualization as with the image data of the previous\nsection, using all 1740 documents. The subtree in Fig. 4 shows the nodes that had at least \ufb01fty\ndocuments, along with the most common authors and words at that node. The normalized histogram\nin each box shows which of the twelve years are represented among the documents in that node. An\n\n2http://cs.nyu.edu/\u02dcroweis/data.html\n\n7\n\nneuralchip\ufb01gureinputnetworkKochMurrayLazzaroHarrisCauwenberghsimagenetworkimagesrecognitionobjectSejnowskiBeckerBalujaZemelMozertimesignalnetworkneural\ufb01gureSejnowskiBialekMakeigJungPrincipelearningmodeltimestatecontrolDayanThrunSinghBartoMoorefunctionnetworksneuralfunctionsnetworkKowalczykWarmuthBartlettWilliamsonMeirneuronsmodelneuroninputspikeKochZadorBowerSejnowskiBrownmodelvisualcells\ufb01gureorientationObermayerKochPougetSejnowskiSchultennetworkinputneurallearningnetworksMozerWilesGilesSunPollacksetalgorithmdatatrainingvectorScholkopfSmolaVapnikShawe-TaylorBartlettdatamodelgaussiandistributionalgorithmBishopTrespWilliamsGhahramaniBarbernetworkunitslearninghiddeninputHintonGilesFahlmanKamimuraBaumstatelearningpolicyfunctiontimeSinghSuttonBartoTsitsiklisMoorenetworkneurallearningnetworkstimeGilesToomarianMozerZemelKabashima\f(a) Improvement versus multinomial, by number of topics\n\n(b) Best perplexity per word, by folds\n\nFigure 5: Results of predictive performance comparison between latent Dirichlet allocation (LDA) and\ntree-structured stick breaking (TSSB). a) Mean improvement in perplexity per word over Laplace-smoothed\nmultinomial, as a function of topics (larger is better). The error bars show the standard deviation of the\nimprovement across the ten folds. b) Best predictive perplexity per word for each fold (smaller is better). The\nnumbers above the LDA and TSSB bars show how many topics were used to achieve this.\n\nexpanded version of this tree is provided in the supplementary material. Secondly, we quantitatively\nassessed the predictive performance of the model. We created ten random partitions of the NIPS\ncorpus into 1200 training and 540 test documents. We then performed inference with different\nnumbers of topics (10, 20, . . . , 100) and evaluated the predictive perplexity of the held-out data\nusing an empirical likelihood estimate taken from a mixture of multinomials (pseudo-documents of\nin\ufb01nite length, see, e.g. [26]) with 100,000 components. As Fig. 5a shows, our model improves in\nperformance over standard LDA for smaller numbers of topics. This improvement appears to be\ndue to the constraints on possible topic distributions that are imposed by the diffusion. For larger\nnumbers of topics, however, it may be that these constraints become a hindrance and the model may\nbe allocating predictive mass to regions where it is not warranted. In absolute terms, more topics did\nnot appear to improve predictive performance for LDA or the tree-structured model. Both models\nperformed best with fewer than \ufb01fty topics and the best tree model outperformed the best LDA model\non all folds, as shown in Fig. 5b.\nThe MCMC inference procedure we used to train our model was as follows: \ufb01rst, we ran Gibbs\nsampling of a standard LDA topic model for 1000 iterations. We then burned in the tree inference\nfor 500 iterations with \ufb01xed word-topic associations. We then allowed the word-topic associations\nto vary and burned in for an additional 500 iterations, before drawing 5000 samples from the full\nposterior. For the comparison, we burned in LDA for 1000 iterations and then drew 5000 samples\nfrom the posterior [27]. For both models we thinned the samples by a factor of 50. The mixing of the\ntopic model seems to be somewhat sensitive to the initialization of the \u03ba parameter in the chained\nDirichlet-multinomial and we initialized this parameter to be the same as the number of topics.\n\n7 Discussion\n\nWe have presented a model for a distribution over random measures that also constructs a hierarchy,\nwith the goal of constructing a general-purpose prior on tree-structured data. Our approach is novel in\nthat it combines in\ufb01nite exchangeability with a representation that allows data to live at internal nodes\non the tree, without a hazard rate process. We have developed a practical inference approach based\non Markov chain Monte Carlo and demonstrated it on two real-world data sets in different domains.\nThe imposition of structure on the parameters of an in\ufb01nite mixture model is an increasingly important\ntopic. In this light, our notion of evolutionary diffusion down a tree sits within the larger class of\nmodels that construct dependencies between distributions on random measures [28, 29, 18].\n\nAcknowledgements\n\nThe authors wish to thank Alex Krizhevsky for providing the image feature data. We also thank Kurt\nMiller, Iain Murray, Hanna Wallach, and Sinead Williamson for valuable discussions, and Yee Whye\nTeh for suggesting Gibbs moves based on size-biased permutation. RPA is a Junior Fellow of the\nCanadian Institute for Advanced Reserch.\n\n8\n\n102030405060708090100Number of Topics400450500550600650700Perplexity Improvement Over Multinomial (nats)LDATSSB12345678910Folds2000220024002600280030003200Best Perplexity Per Word (nats)4030404040403040403020201030402020303020MultinomialLDATSSB\fReferences\n[1] Christopher K. I. Williams. A MCMC approach to hierarchical mixture modelling. In Advances in Neural\n\nInformation Processing Systems 12, pages 680\u2013686. 2000.\n\n[2] Radford M. Neal. Density modeling and clustering using Dirichlet diffusion trees. In Bayesian Statistics\n\n7, pages 619\u2013629, 2003.\n\n[3] Katherine A. Heller and Zoubin Ghahramani. Bayesian hierarchical clustering. In Proceedings of the\n\n22nd International Conference on Machine Learning, 2005.\n\n[4] Yee Whye Teh, Hal Daum\u00b4e III, and Daniel Roy. Bayesian agglomerative clustering with coalescents. In\n\nAdvances in Neural Information Processing Systems 20, 2007.\n\n[5] Charles Kemp, Thomas L. Grif\ufb01ths, Sean Stromsten, and Joshua B. Tenenbaum. Semi-supervised learning\n\nwith trees. In Advances in Neural Information Processing Systems 16. 2004.\n\n[6] Daniel M. Roy, Charles Kemp, Vikash K. Mansinghka, and Joshua B. Tenenbaum. Learning annotated\n\nhierarchies from relational data. In Advances in Neural Information Processing Systems 19, 2007.\n\n[7] Hal Daum\u00b4e III. Bayesian multitask learning with latent hierarchies. In Proceedings of the 25th Conference\n\non Uncertainty in Arti\ufb01cial Intelligence, 2009.\n\n[8] Edward Meeds, David A. Ross, Richard S. Zemel, and Sam T. Roweis. Learning stick-\ufb01gure models\nusing nonparametric Bayesian priors over trees. In IEEE Conference on Computer Vision and Pattern\nRecognition, 2008.\n\n[9] Evgeniy Bart, Ian Porteous, Pietro Perona, and Max Welling. Unsupervised learning of visual taxonomies.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition, 2008.\n\n[10] Jayaram Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4:639\u2013650, 1994.\n[11] Jim Pitman. Poisson\u2013Dirichlet and GEM invariant distributions for split-and-merge transformation of an\n\ninterval partition. Combinatorics, Probability and Computing, 11:501\u2013514, 2002.\n\n[12] Jim Pitman and Marc Yor. The two-parameter Poisson\u2013Dirichlet distribution derived from a stable\n\nsubordinator. The Annals of Probability, 25(2):855\u2013900, 1997.\n\n[13] R. Daniel Mauldin, William D. Sudderth, and S. C. Williams. P\u00b4olya trees and random distributions. The\n\nAnnals of Statistics, 20(3):1203\u20131221, September 1992.\n\n[14] Hemant Ishwaran and Lancelot F. James. Gibbs sampling methods for stick-breaking priors. Journal of\n\nthe American Statistical Association, 96(453):161\u2013173, March 2001.\n\n[15] David Blackwell and James B. MacQueen. Ferguson distributions via P\u00b4olya urn schemes. Annals of\n\nStatistics, 1(2):353\u2013355, 1973.\n\n[16] Jim Pitman. Random discrete distributions invariant under size-biased permutation. Advances in Applied\n\nProbability, 28(2):525\u2013539, 1996.\n\n[17] David M. Blei, Thomas L. Grif\ufb01ths, and Michael I. Jordan. The nested Chinese restaurant process and\n\nBayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2):1\u201330, 2010.\n\n[18] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet processes.\n\nJournal of the American Statistical Association, 101(476):1566\u20131581, 2006.\n\n[19] Radford M. Neal. Slice sampling (with discussion). The Annals of Statistics, 31(3):705\u2013767, 2003.\n[20] Stephen G. Walker. Sampling the Dirichlet mixture model with slices. Communications in Statistics,\n\n36:45\u201354, 2007.\n\n[21] Omiros Papaspiliopoulos and Gareth O. Roberts. Retrospective Markov chain Monte Carlo methods for\n\nDirichlet process hierarchical models. Biometrika, 95(1):169\u2013186, 2008.\n\n[22] Antonio Torralba, Rob Fergus, and William T. Freeman. 80 million tiny images: A large data set\nfor nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 30(11):1958\u20131970, 2008.\n\n[23] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Department of\n\nComputer Science, University of Toronto, 2009.\n\n[24] Radford M. Neal. MCMC using Hamiltonian dynamics. In Handbook of Markov chain Monte Carlo.\n\nChapman and Hall / CRC Press.\n\n[25] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine\n\nLearning Research, 3:993\u20131022, 2003.\n\n[26] Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluation methods for topic\n\nmodels. In Proceedings of the 26th International Conference on Machine Learning, 2009.\n\n[27] Tom L. Grif\ufb01ths and Mark Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy of\n\nSciences of the United States of America, 101(Suppl. 1):5228\u20135235, 2004.\n\n[28] Steven N. MacEachern. Dependent nonparametric processes. In Proceedings of the Section on Bayesian\n\nStatistical Science, 1999.\n\n[29] Steven N. MacEachern, Athanasios Kottas, and Alan E. Gelfand. Spatial nonparametric Bayesian models.\n\nTechnical Report 01-10, Institute of Statistics and Decision Sciences, Duke University, 2001.\n\n9\n\n\f", "award": [], "sourceid": 146, "authors": [{"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Ryan", "family_name": "Adams", "institution": null}]}