{"title": "Dependent nonparametric trees for dynamic hierarchical clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1152, "page_last": 1160, "abstract": "Hierarchical clustering methods offer an intuitive and powerful way to model a wide variety of data sets. However, the assumption of a fixed hierarchy is often overly restrictive when working with data generated over a period of time: We expect both the structure of our hierarchy, and the parameters of the clusters, to evolve with time. In this paper, we present a distribution over collections of time-dependent, infinite-dimensional trees that can be used to model evolving hierarchies, and present an efficient and scalable algorithm for performing approximate inference in such a model. We demonstrate the efficacy of our model and inference algorithm on both synthetic data and real-world document corpora.", "full_text": "Dependent nonparametric trees for dynamic\n\nhierarchical clustering\n\nAvinava Dubey\u2217\u2020, Qirong Ho\u2217\u2021,\n\nSinead Williamson\u00a3, Eric P. Xing\u2020\n\n\u2020 Machine Learning Department, Carnegie Mellon University\n\n\u2021 Institute for Infocomm Research, A*STAR\n\n\u00a3 McCombs School of Business, University of Texas at Austin\n\nakdubey@cs.cmu.edu, hoqirong@gmail.com\n\nsinead.williamson@mccombs.utexas.edu, epxing@cs.cmu.edu\n\nAbstract\n\nHierarchical clustering methods offer an intuitive and powerful way to model a\nwide variety of data sets. However, the assumption of a \ufb01xed hierarchy is of-\nten overly restrictive when working with data generated over a period of time:\nWe expect both the structure of our hierarchy, and the parameters of the clus-\nters, to evolve with time. In this paper, we present a distribution over collections\nof time-dependent, in\ufb01nite-dimensional trees that can be used to model evolving\nhierarchies, and present an ef\ufb01cient and scalable algorithm for performing approx-\nimate inference in such a model. We demonstrate the ef\ufb01cacy of our model and\ninference algorithm on both synthetic data and real-world document corpora.\n\nIntroduction\n\n1\nHierarchically structured clustering models offer a natural representation for many forms of data.\nFor example, we may wish to hierarchically cluster animals, where \u201cdog\u201d and \u201ccat\u201d are subcategories\nof \u201cmammal\u201d, and \u201cpoodle\u201d and \u201cdachshund\u201d are subcategories of \u201cdog\u201d. When modeling scienti\ufb01c\narticles, articles about machine learning and programming languages may be subcategories under\ncomputer science. Representing clusters in a tree structure allows us to explicitly capture these\nrelationships, and allow clusters that are closer in tree-distance to have more similar parameters.\nSince hierarchical structures occur commonly, there exists a rich literature on statistical models for\ntrees. We are interested in nonparametric distributions over trees \u2013 that is, distributions over trees\nwith in\ufb01nitely many leaves and in\ufb01nitely many internal nodes. We can model any \ufb01nite data set\nusing a \ufb01nite subset of such a tree, marginalizing over the in\ufb01nitely many unoccupied branches. The\nadvantage of such an approach is that we do not have to specify the tree dimensionality in advance,\nand can grow our representation in a consistent manner if we observe more data.\nIn many settings, our data points are associated with a point in time \u2013 for example the date when\na photograph was taken or an article was written. A stationary clustering model is inappropriate in\nsuch a context: The number of clusters may change over time; the relative popularities of clusters\nmay vary; and the location of each cluster in parameter space may change. As an example, consider\na topic model for scienti\ufb01c articles over the twentieth century. The \ufb01eld of computer science \u2013 and\ntherefore topics related to it \u2013 did not exist in the \ufb01rst half of the century. The proportion of scienti\ufb01c\narticles devoted to genetics has likely increased over the century, and the terminology used in such\narticles has changed with the development of new sequencing technology.\nDespite this, to the best of our knowledge, there are no nonparametric distributions over time-\nevolving trees in the literature. There exist a variety of distributions over stationary trees\n[1, 14, 5, 13, 10], and time-evolving non-hierarchical clustering models [16, 7, 11, 2, 4, 12] \u2013 but\nno models that combine time evolution and hierarchical structure. The reason for this is likely to\nbe practical: Inference in trees is typically very computationally intensive, and adding temporal\nvariation will, in general, increase the computational requirements. Designing such a model must,\ntherefore, proceed hand in hand with developing ef\ufb01cient and scalable inference schemes.\n\n1\n\n\f(a) In\ufb01nite tree\n\n(b) Changing popularity\n\n(c) Cluster/topic drift\n\nFigure 1: Our dependent tree-structured stick breaking process can model trees of arbitrary size and shape,\nand captures popularity and parameter changes through time. a) Model any number of nodes (clusters, topics),\nof any branching factor, and up to any depth b) Nodes can change in probability mass, or new nodes can be\ncreated c) Node parameters can evolve over time.\nIn this paper, we de\ufb01ne a distribution over temporally varying trees with in\ufb01nitely many nodes that\ncaptures this form of variation, and describe how this model can cluster both real-valued observa-\ntions and text data. Further, we propose a scalable approximate inference scheme that can be run in\nparallel, and demonstrate its ef\ufb01cacy on synthetic data where ground-truth clustering is available, as\nwell as demonstrate qualitative and quantitative performance on three text corpora.\n2 Background\nThe model proposed in this paper is a dependent nonparametric process with tree-structured\nmarginals. A dependent nonparametric process [12] is a distribution over collections of random\nmeasures indexed by values in some covariate space, such that at each covariate value, the marginal\ndistribution is given by some known nonparametric distribution. For example, a dependent Dirichlet\nprocess [12, 7, 11] is a distribution over collections of probability measures with Dirichlet process-\ndistributed marginals; a dependent Pitman-Yor process [15] is a distribution over collections of\nprobability measures with Pitman-Yor process-distributed marginals; a dependent Indian buffet\nprocess [17] is a distribution over collections of matrices with Indian buffet process-distributed\nmarginals; etc.\nIf our covariate space is time, such distributions can be used to construct non-\nparametric, time-varying models.\nThere are two main methods of inducing dependency: Allowing the sizes of the atoms composing\nthe measure to vary across covariate space, and allowing the parameter values associated with the\natoms to vary across covariate space. In the context of a time-dependent topic model, these methods\ncorrespond to allowing the popularity of a topic to change over time, and allowing the words used\nto express a topic to change over time (topic drift). Our proposed model incorporates both forms\nof dependency. In the supplement, we discuss some speci\ufb01c dependent nonparametric models that\nshare properties with our model.\nThe key difference between our proposed model and existing dependent nonparametric models is\nthat ours has tree-distributed marginals. There are a number of options for the marginal distribution\nover trees, as we discuss in the supplement. We choose a distribution over in\ufb01nite-dimensional trees\nknown as the tree-structured stick breaking process [TSSBP, 1], described in Section 2.1.\n2.1 The tree-structured stick-breaking process\nThe tree-structured stick-breaking process (TSSBP) is a distribution over trees with in\ufb01nitely many\nleaves and in\ufb01nitely many internal nodes. Each node \u0001 within the tree is associated with a mass \u03c0\u0001\n\u0001 \u03c0\u0001 = 1, and each data point is assigned to a node in the tree according to p(zn = \u0001) =\n\u03c0\u0001, where zn is the node assignment of the nth data point. The TSSBP is unique among the current\ntoolbox of random in\ufb01nite-dimensional trees in that data can be assigned to an internal node, rather\nthan a leaf, of the tree. This property is often desirable; for example in a topic modeling context,\na document could be assigned to a general topic such as \u201cscience\u201d that lives toward the root of the\ntree, or to a more speci\ufb01c topic such as \u201cgenetics\u201d that is a descendant of the science topic.\nThe TSSBP can be represented using two interleaving stick-breaking processes \u2013 one (parametrized\nby \u03b1) that determines the size of a node and another (parametrized by \u03b3) that determines the branch-\ning probabilities. Index the root node as node \u2205 and let \u03c0\u2205 be the mass assigned to it. Index its\n(countably in\ufb01nite) child nodes as node 1, node 2, . . . and let \u03c01, \u03c02, . . . be the masses assigned to\nthem; index the child nodes of node 1 as nodes 1 \u00b7 1, 1 \u00b7 2, . . . and let \u03c01\u00b71, \u03c01\u00b72, . . . be the masses\nassigned to nodes 1 \u00b7 1, 1 \u00b7 2 . . . ; etc. Then we can sample the in\ufb01nite-dimensional tree as:\n\nsuch that(cid:80)\n\n(cid:81)i\u22121\n\u03bd\u0001 \u223c Beta(1, \u03b1(|\u0001|)), \u03c8\u0001 \u223c Beta(1, \u03b3),\nj=1(1 \u2212 \u03c8\u0001\u00b7j) \u03c0\u0001 = \u03bd\u0001\u03c6\u0001\n\u03c6\u0001\u00b7i = \u03c8\u0001\u00b7i\n\n(cid:81)\n\u0001(cid:48)\u227a\u0001(1 \u2212 \u03bd\u0001(cid:48))\u03c6\u0001(cid:48),\n\n\u03c0\u2205 = \u03bd\u2205, \u03c6\u2205 = 1\n\n(1)\n\n2\n\n\fwhere |\u0001| indicates the depth of node \u0001, and \u0001(cid:48) \u227a \u0001 indicates that \u0001(cid:48) is an ancestor node of \u0001. We refer\nto the resulting in\ufb01nite-dimensional weighted tree as \u03a0 = ((\u03c0\u0001), (\u03c6\u0001i)).\n3 Dependent tree-structured stick-breaking processes\nWe now describe a dependent tree-structured stick-breaking process where both atom sizes and their\nlocations vary with time. We \ufb01rst describe a distribution over atom sizes, and then use this distribu-\ntion over collections of trees as the basis for time-varying clustering models and topic models.\n3.1 A distribution over time-varying trees\nWe start with the basic TSSBP model [1] (described in Section 2.1 and the left of Figure 1), and\nmodify it so that the latent variables \u03bd\u0001, \u03c8\u0001 and \u03c0\u0001 are replaced with sequences \u03bd(t)\n, \u03c8(t)\nand \u03c0(t)\n\u0001\n\u0001\n\u0001\nindexed by discrete time t \u2208 T (the middle of Figure 1). The forms of \u03bd(t)\nand \u03c8(t)\nare chosen so\n\u0001\nthat the marginal distribution over the \u03c0(t)\n\u0001\n\nis as described in Equation 1.\n\n\u0001 = (cid:80)Nt\n\u0001 = (cid:80)Nt\nLet N (t) be the number of observations at time t, and let z(t)\nn be the node allocation of the nth\nobservation at time t. For each node \u0001 at time t, let X (t)\nn = \u0001) be the number\nof observations assigned to node \u0001 at time t, and Y (t)\nn ) be the number of\nobservations assigned to descendants of node \u0001. Introduce a \u201cwindow\u201d parameter h \u2208 N. We can\n(cid:1)\nthen de\ufb01ne a prior predictive distribution over the tree at time t, as\n(cid:80)t\nt(cid:48)=t\u2212h Y (t(cid:48))\nt(cid:48)=t\u2212h(X (t(cid:48))\n\n\u0001 \u223c Beta(cid:0)1 +(cid:80)t\u22121\n\u0001\u00b7i \u223c Beta(cid:0)1 +(cid:80)t\u22121\n\n, \u03b1(|\u0001|) +(cid:80)t\u22121\n\u0001\u00b7i ),\u03b3 +(cid:80)\n\nt(cid:48)=t\u2212h X (t(cid:48))\nt(cid:48)=t\u2212h(X (t(cid:48))\n\n\u0001\u00b7j )(cid:1).\n\n\u0001\u00b7j + Y (t(cid:48))\n\nI(z(t)\n\nn=1\n\nI(\u0001 \u227a z(t)\n\n\u0001\n\n\u0001\u00b7i + Y (t(cid:48))\n\n\u03bd(t)\n\u03c8(t)\n\n(2)\n\n\u0001\n\nn=1\n\n\u0001\n\nj>i\n\n\u0001 ), (\u03c6(t)\n\n\u0001i )), t \u2208 T ).\n\nFollowing [1], we let \u03b1(d) = \u03bbd\u03b10, for \u03b10 > 0 and \u03bb \u2208 (0, 1). This de\ufb01nes a sequence of trees\n(\u03a0(t) = ((\u03c0(t)\nIntuitively, the prior distribution over a tree at time t is given by the posterior distribution of the (sta-\ntionary) TSSBP, conditioned on the observations in some window t \u2212 h, . . . , t \u2212 1. The following\ntheorem gives the equivalence of dynamic TSSBP (dTSSBP) and TSSBP\nTheorem 1. The marginal posterior distribution of the dTSSBP, at time t, follows a TSSBP.\n\nThe proof is a straightforward extension of that for the generalized P\u00b4olya urn dependent Dirichlet\nprocess [7] and is given in the supplimentary. The above theorem implies that Equation 2 de\ufb01nes a\ndependent tree-structured stick-breaking process.\nWe note that an alternative choice for inducing dependency would be to down-weight the contri-\nbution of observations for previous time-steps. For example, we could exponentially decay the\ncontributions of observations from previous time-steps, inducing a similar form of dependency as\nthat found in the recurrent Chinese restaurant process [2]. However, unlike the method described in\nEquation 2, such an approach would not yield stationary TSSBP-distributed marginals.\n3.2 Dependent hierarchical clustering\nThe construction above gives a distribution over in\ufb01nite-dimensional trees, which in turn have a\nprobability distribution over their nodes. In order to use this distribution in a hierarchical Bayesian\nmodel for data, we must associate each node with a parameter value \u03b8(t)\n. We let \u0398(t) denote the set\n\u0001\nof all parameters \u03b8(t)\n\u0001 associated with a tree \u03a0(t). We wish to capture two properties: 1) Within a tree\n\u03a0(t), nodes have similar values to their parents; and 2) Between trees \u03a0(t) and \u03a0(t+1), corresponding\nparameters \u03b8(t)\nhave similar values. This form of variation is shown in the right of\n\u0001\nFigure 1. In this subsection, we present two models that exhibit these properties: One appropriate\nfor real-valued data, and one appropriate for multinomial data.\n3.2.1 A time-varying, tree-structured mixture of Gaussians\nAn in\ufb01nite mixture of Gaussians is a \ufb02exible choice for density estimation and clustering real-valued\nobservations. Here, we suggest a time-varying hierarchical clustering model that is similar to the\ngeneralized Gaussian model of [1]. The model assumes Gaussian-distributed data at each node, and\nallows the means of clusters to evolve in an auto-regressive model, as below:\n\nand \u03b8(t+1)\n\n\u0001\n\n\u03b8(t)\u2205 |\u03b8(t\u22121)\n\n\u2205\n\n\u223c N (\u03b8(t\u22121)\n\n\u2205\n\n, \u03c30\u03c3a\n\n1 I),\n\n\u0001\u00b7i|\u03b8(t)\n\u03b8(t)\n\n\u0001\n\n, \u03b8(t\u22121)\n\u0001\u00b7i \u223c N (m, s2I),\n\n(3)\n\n3\n\n\f(cid:18)\n\n(cid:19)\u22121\n\n(cid:18)\n\n(cid:19)\n\n\u0001\n\n\u0001\n\n.\n\n)2\n\n+\n\n\u03c30\u03c3\n\n\u03c30\u03c3\n\n+ \u03b7\u03b8(t\u22121)\n\n1\n|\u0001\u00b7i|+a\n1\n\nand \u03b8(t)\n\nand \u03b8(t)\n\u0001\n\n\u0001\u00b7i\n|\u0001\u00b7i|+a\n1\n\n1\n|\u0001\u00b7i|\n\u03c30\u03c3\n1\n\n|\u0001\u00b7i|+a\n1\n\n\u03b8(t)\n\u0001\n|\u0001\u00b7i|\n(\u03c30\u03c3\n1\n\n, m = s2\u00b7\n\n|\u0001|\n1 , and the factor potential associated with the link between \u03b8(t)\n\n, \u03c30 > 0, \u03c31 \u2208 (0, 1),\nwhere, s2 =\n\u03b7 \u2208 [0, 1), and a \u2265 1. Due to the self-conjugacy of the Gaussian distribution, this corresponds to\na Markov network with factor potentials given by unnormalized Gaussian distributions: Up to a\nnormalizing constant, the factor potential associated with the link between \u03b8(t\u22121)\nis Gaus-\nsian with variance \u03c30\u03c3\n\u0001\u00b7i is\nGaussian with variance \u03c30\u03c3\nFor a single time point, this allows for fractal-like behavior, where the distance between child and\nparent decreases down the tree. This behavior, which is not used in the generalized Gaussian model\nof [1], makes it easier to identify the root node, and guarantees that the marginal distribution over\nthe location of the leaf nodes has \ufb01nite variance. The a parameter enforces the idea that the amount\nof variation between \u03b8(t)\n\u0001\u00b7i , while \u03b7 ensures the\n\u0001\nvariance of node parameters remains \ufb01nite across time. We chose spherical Gaussian distributions\nto ensure that structural variation is captured by the tree rather than by node parameters.\n3.3 A time-varying model for hierarchically clustering documents\nGiven a dictionary of V words, a document can be represented using a V -dimensional term fre-\nquency vector, that corresponds to a location on the surface of the (V \u2212 1)-dimensional unit sphere.\nThe von Mises-Fisher distribution, with mean direction \u00b5 and concentration parameter \u03c4, provides\na distribution on this space. A mixture of von Mises-Fisher distributions can, therefore, be used to\ncluster documents [3, 8]. Following the terminology of topic modeling [6], the mean direction \u00b5k\nassociated with the kth cluster can be interpreted as the topic associated with that cluster.\nWe construct a time-dependent hierarchical clustering model appropriate for documents by associ-\nating nodes of our dependent nonparametric tree with topics. Let x(t)\nn be the vector associated with\nthe nth document at time t. We assign a mean parameter \u03b8(t)\n\u0001\n\nis smaller than that between \u03b8(t)\n\u0001\n\nand \u03b8(t+1)\n\nand \u03b8(t)\n\n\u0001\n\n(cid:113)\n\nto each node \u0001 in each tree \u03a0(t) as\n, \u03b8(t\u22121)\n\u0001\u00b7i \u223c vMF(\u03c4 (t)\n\n(4)\n\n\u0001\u00b7i , \u03c1(t)\n\u0001\u00b7i ),\n1 \u03b8(t\u22121)\n\n\u2205\n\n\u03ba0\u03b8(t)\u22121+\u03ba0\u03baa\n\n(cid:113)\n\u03b8(t)\u2205 |\u03b8(t\u22121)\n\n\u2205\n\n\u223c vMF(\u03c4 (t)\u2205 , \u03c1(t)\u2205 ),\n\n\u0001\n\n\u0001\u00b7i|\u03b8(t)\n\u03b8(t)\n1(\u03b8(t)\u22121 \u00b7 \u03b8(t\u22121)\n\u0001\u00b7i = \u03ba0\u03ba\n\u03c4 (t)\n\n),\n|\u0001\u00b7i|\n1\n\n\u2205\n\n\u0001\n\n\u0001\n\n),\n\n\u0001\u00b7i\n\n\u03b8(t\u22121)\n\u0001\u00b7i\n\n= \u03ba0\n\n1(\u03b8(t)\n\n1 + \u03ba2a\n\n1 + \u03ba2a\n\n1 + 2\u03baa\n\n\u03c1(t)\n\u0001\u00b7i =\n\nwhere, \u03c1(t)\u2205\n\nn according to z(t)\n\n1 + 2\u03baa\n\u00b7 \u03b8(t\u22121)\n\n\u0001 ), we sample each document x(t)\n\n\u03c1(t)\n\u2205\n, \u03ba0 > 0, \u03ba1 > 1, and\n\n\u03c4 (t)\u2205\n=\n|\u0001\u00b7i|+a\n|\u0001\u00b7i|\n\u03b8(t)\n\u0001 +\u03ba0\u03ba\n\u03ba0\u03ba\n1\n1\n\u03c1(t)\n\u0001\u00b7i\n\u03b8(t)\u22121 is a probability vector of the same dimension as the \u03b8(t)\nthat can be interpreted as the parent of\nthe root node at time t.1 This yields similar dependency behavior to that described in Section 3.2.1.\nn \u223c\nConditioned on \u03a0(t) and \u0398(t) = (\u03b8(t)\nDiscrete(\u03a0(t)) and xn \u223c vMF(\u03b8(t), \u03b2). This is a hierarchical extension of the temporal vMF mix-\nture proposed by [8].\n4 Online Learning\nIn many time-evolving applications, we observe data points in an online setting. We are typically\ninterested in obtaining predictions for future data points, or characterizing the clustering structure of\ncurrent data, rather than improving predictive performance on historic data. We therefore propose\na sequential online learning algorithm, where at each time t we infer the parameter settings for the\ntree \u03a0(t) conditioned on the previous trees, which we do not re-learn. This allows us to focus our\ncomputational efforts on the most recent (and likely relevant) data. This has the added advantage of\nreducing the computational demands of the algorithm, as we do not incorporate a backwards pass\nthrough the data, and are only ever considering a fraction of the data at a time.\nIn developing an inference scheme, there is always a trade-off between estimate quality and com-\nputational requirements. MCMC samplers are often the \u201cgold standard\u201d of inference techniques,\nbecause they have the true posterior distribution as the stationary distribution of their Markov Chain.\nHowever, they can be very slow, particularly in complex models. Estimating the parameter setting\nthat maximizes the data likelihood is a much cheaper, but cannot capture the full posterior.\n\n1In our experiments, we set \u03b8(t)\u22121 to be the average over all data points at time t. This ensures that the root\n\nnode is close to the centroid of the data, rather than the periphery.\n\n4\n\n\f\u0001\n\n{X (0)\n\nonly depend on {z(0)\n\n\u0001 ). The resulting algorithm has the following desirable properties:\n\nIn order to develop an inference algorithm that is parallelizable, runs in reasonable time, but still\nobtains good predictive performance, we combine Gibbs sampling steps for learning the tree\nparameters (\u03a0(t)) and the topic indicators (z(t)\nn ) with a MAP method for estimating the location\nparameters (\u03b8(t)\n1. The priors for \u03bd(t)\n\n, \u03c8(t)\n\u0001\n\u0001 } . . .{X (t\u22121)\nn }. Hence we\n2. The posteriors for \u03bd(t)\n\u0001\nn } (or more\ncan Gibbs sample \u03bd(t)\nprecisely, their suf\ufb01cient statistics {X\u0001, Y\u0001}). Similarly, we can Gibbs sample the cluster/topic\nassignments {z(t)\n\u0001 } and the data, as well as infer\nthe MAP estimate of {\u03b8(t)\n\u0001 } in parallel given the data and the cluster/topic assignments. Because\nof the online assumption, we do not consider evidence from times u > t.\n\nare conditionally independent given {z(1)\nin parallel given the cluster assignments {z(1)\n\n} can be updated in amortized constant time.\n\nn } in parallel given the parameters {\u03bd(t)\n\n}, whose suf\ufb01cient statistics\n\n, Y (t\u22121)\n\u0001\n, \u03c8(t)\n\u0001\n, \u03c8(t)\n\u0001\n\nn } . . .{z(t\u22121)\n\nn } . . .{z(t)\n\nn } . . .{z(t)\n\n, Y (0)\n\n, \u03c8(t)\n\u0001\n\n, \u03b8(t)\n\nn\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\nDue to the conjugacy between the beta and binomial distributions, we can\n\nSampling \u03bd(t)\neasily Gibbs sample the stick-breaking parameters\n\n\u0001\n\n, \u03c8(t)\n\u0001\n\n|X\u0001, Y\u0001 \u223c Beta(cid:0)1 +(cid:80)t\n\u0001\u00b7i|X\u0001\u00b7i, Y\u0001\u00b7i \u223c Beta(cid:0)1 +(cid:80)t\n\n\u03bd(t)\n\u0001\n\n\u03c8(t)\n\n,\u03b1(|\u0001|) +(cid:80)t\n\u0001\u00b7i ),\u03b3 +(cid:80)\n\n(cid:1)\n(cid:80)t\nt(cid:48)=t\u2212h Y (t(cid:48))\nt(cid:48)=t\u2212h(X (t(cid:48))\n\n\u0001\u00b7i + Y (t(cid:48))\n\nj>i\n\n\u0001\n\nt(cid:48)=t\u2212h X (t(cid:48))\nt(cid:48)=t\u2212h(X (t(cid:48))\n\n\u0001\n\n\u0001\u00b7j )(cid:1).\n\n\u0001\u00b7j + Y (t(cid:48))\n\nThe \u03bd(t)\nso the sampler can be parallelized. We only explicitly store \u03c0(t)\n\u0001\n\n\u0001 distributions for each node are conditionally independent given the counts X, Y , and\nfor nodes \u0001 with nonzero\n\n\u0001 , \u03b8(t)\n\u0001\n\n, \u03c6(t)\n\n\u0001\n\n, \u03c8(t)\n\ncounts, i.e.(cid:80)t\n\nt(cid:48)=t\u2212h X (t(cid:48))\n\n\u0001 + Y (t(cid:48))\n\n\u0001 > 0.\nConditioned on the \u03bd(t)\n\u0001 },{\u03c8(t)\n\n\u0001 }, x(t)\n\n\u0001\n\nand \u03c8(t)\n\u0001\n\nn | {\u03bd(t)\n\nSampling z(t)\n, the distribution over the cluster assignments z(t)\nn\nn\nis just given by the TSSBP. We therefore use the slice sampling method described in [1] to Gibbs\nsample z(t)\nn , \u03b8. Since the cluster assignments are conditionally independent\ngiven the tree, this step can be performed in parallel.\nLearning \u03b8\nIt is possible to Gibbs sample the cluster parameters \u03b8; however, in the document clus-\ntering case described in Section 3.3, this requires far more time than sampling all other parameters.\nTo improve the speed of our algorithm, we instead use maximum a posteriori (MAP) estimates for\n\u03b8, obtained using a parallel coordinate ascent algorithm. Notably, conditioned on the trees at time\nfor odd-numbered tree depths |\u0001| are conditionally independent given the\nt \u2212 1 and t + 1, the \u03b8(t)\n\u0001(cid:48) s at even-numbered tree depths |\u0001(cid:48)|, and vice versa. Hence, our algorithm alternates between\n\u03b8(t)\nparallel optimization of odd-depth \u03b8(t)\n\u0001\n\n, and parallel optimization of even-depth \u03b8(t)\n\u0001\n\n.\n\n\u0001\n\n\u0001\n\n\u0001\n\n, its postdecessor \u03b8(t+1)\n\nIn general, the conditional distribution of a cluster parameter \u03b8(t)\n\u0001 depends on the values of its prede-\ncessor \u03b8(t\u22121)\n, its parent at time t, and its children at time t. In some cases,\nnot all of these values will be available \u2013 for example if a node was unoccupied at previous time\nsteps. In this case, the distribution now depends on the full history of the parent node. For computa-\ntional reasons, and because we do not wish to store the full history, we approximate the distribution\nas being dependent only on observed members of the node\u2019s Markov blanket.\n5 Experimental evaluation\nWe evaluate the performance of our model on both synthetic and real-world data sets. Evaluation\non synthetic data sets allows us to verify that our inference algorithm allows us to recover the \u201ctrue\u201d\nevolving hierarchical structure underlying our data. Evaluation on real-world data allows us to\nevaluate whether our modeling assumptions are useful in practice.\n5.1 Synthetic data\nWe manually created a time-evolving tree, as shown in Figure 2, with Gaussian-distributed data\nat each node. This synthetic time-evolving tree features temporal variation in node probabilities,\ntemporal variation in node parameters, and addition and deletion of nodes. Using the Gaussian\nmodel described in Equation 3, we inferred the structure of the tree at each time period as described\nin Section 4. Figure 3 shows the recovered tree structure, demonstrating the ability of our inference\nalgorithm to recover the expected evolving hierarchical structure. Note that it accurately captures\nevolution in node probabilities and location, and the addition and deletion of new nodes.\n\n5\n\n\fFigure 2: Ground truth tree, evolving over three time steps\n\nFigure 3: Recovered tree structure, over three consecutive time periods. Each color indicates a node in the\ntree and each arrow indicates a branch connecting parent to child; nodes are consistently colored across time.\n\nDepth limit\nTWITTER\nSOU\nPNAS\n\nTWITTER\nSOU\nPNAS\n\ndTSSBP\n\n4\n\n522 \u00b1 4.35\n2708 \u00b1 32.0\n4562 \u00b1 116\n\n3\n\n249 \u00b1 0.98\n1320 \u00b1 33.6\n3217 \u00b1 195\n\no-TSSBP\n\nT-TSSBP\n\n4\n\n414 \u00b1 3.31\n1455 \u00b1 44.5\n2672 \u00b1 357\n\n3\n\n199 \u00b1 2.19\n583 \u00b1 16.4\n1163 \u00b1 196\n\n4\n\n335 \u00b1 54.8\n1687 \u00b1 329\n4333 \u00b1 647\n\n3\n\n182 \u00b1 24.1\n1089 \u00b1 143\n2962 \u00b1 685\n\ndDP\n\no-DP\n\n204 \u00b1 8.82\n834 \u00b1 51.2\n2374 \u00b1 51.7\nTable 1: Test set average log-likelihood on three datasets.\n\n136 \u00b1 0.42\n633 \u00b1 18.8\n1061 \u00b1 10.5\n\nT-DP\n\n112 \u00b1 10.9\n890 \u00b1 70.5\n2174 \u00b1 134\n\n5.2 Real-world data\nIn Section 3.3, we described how the dependent TSSBP can be combined with a von Mises-Fisher\nlikelihood to cluster documents. To evaluate this model, we looked at three corpora:\n\u2022 TWITTER: 673,102 tweets containing hashtags relevant to the NFL, collected over 18 weeks in 2011 and\n\u2022 PNAS: 79,800 paper titles from the Proceedings of the National Academy of Sciences between 1915 and\n2005, containing 36,901 unique words (after stopwording). We grouped the titles into 10 ten-year epochs.\n\u2022 STATE OF THE UNION (SOU): Presidential SoU addresses from 1790 through 2002, containing 56,352\nsentences and 21,505 unique words (after stopwording). We grouped the sentences into 21 ten-year epochs.\n\ncontaining 2,636 unique words (after stopwording). We grouped the tweets into 9 two-week epochs.\n\nIn each case, documents were represented using their vector of term frequencies.\nOur hypothesis is that the topical structure of language is hierarchically structured and time-\nevolving, and that a model that captures these properties will achieve better performance than models\nthat ignore hierarchical structure and/or temporal evolution. To test these hypotheses, we compare\nour dependent tree-structured stick-breaking process (dTSSBP) against several online nonparamet-\nric models for document clustering:\n\n1. Multiple tree-structured stick-breaking process (T-TSSBP): We modeled the entire corpus using the sta-\ntionary TSSBP model, with each node modeled using an independent von Mises-Fisher distribution. Each\ntime period is modeled with a separate tree, using a similar implementation to our time-dependent TSSBP.\n2. \u201cOnline\u201d tree-structured stick-breaking processes (o-TSSBP): This simulates online learning of a single,\nstationary tree over the entire corpus. We used our dTSSBP implementation with an in\ufb01nite window h =\n\u221e, and once a node is created at time t, we prevent its vMF mean \u03b8(t)\nfrom changing in future time points.\n3. Dependent Dirichlet process (dDP): We modeled the entire corpus using an h-order Markov generalized\nP\u00b4olya urn DDP [7]. This model was implemented by modifying our dTSSBP code to have a single level.\nNode parameters were evolved as \u03b8(t)\n\nk \u223c vMF(\u03b8(t)\n\nk , \u03be).\n\n4. Multiple Dirichlet process (T-DP): We modeled the entire corpus using DP mixtures of von Mises-Fisher\ndistributions, one DP per time period. Each node was modeled using an independent von Mises-Fisher\ndistribution. We used our own implementation.\n\n\u0001\n\n6\n\n\fFigure 4: PNAS dataset: Birth, growth, and death of tree-structured topics in our dTSSBP model. This\nillustration captures some trends in American scienti\ufb01c research throughout the 20th century, by focusing on\nthe evolution of parent and child topics in two major scienti\ufb01c areas: Chemistry and Immunology (the rest of\nthe tree has been omitted for clarity). At each epoch, we show the number of documents assigned to each topic,\nas well as it\u2019s most popular words (according to the vMF mean \u03b8).\n\n5. \u201cOnline\u201d Dirichlet process (o-DP): This simulates online learning of a single DP over the entire corpus.\nWe used our dDP implementation with an in\ufb01nite window h = \u221e, and once a cluster is instantiated at time\nt, we prevent its vMF mean \u03b8(t) from changing in future time points.\n\nEvaluation scheme: We divide each dataset into two parts: the \ufb01rst 50%, and last 50% of time\npoints. We use the \ufb01rst 50% to tune model parameters and select a good random restart (by training\non 90% and testing on 10% of the data at each time point), and then use the last 50% to evaluate\nthe performance of the best parameters/restart (again, by training on 90% and testing on 10% data).\nWhen training the 3 TSSBP-based models, we grid-searched \u03ba0 \u2208 {1, 10, 100, 1000, 10000}, and\n\ufb01xed \u03ba1 = 1, a = 0 for simplicity. Each value of \u03ba0 was run 5 times to get different random\nrestarts, and we took the best \u03ba0-restart pair for evaluation on the last 50% of time points. For the 3\nDP-based models, there is no \u03ba0 parameter, so we simply took 5 random restarts and used the best\none for evaluation. For all TSSBP- and DP-based models, we repeated the evaluation phase 5 times\nto get error bars. Every dTSSBP trial completed in < 20 minutes on a single processor core, while\nwe observed moderate (though not perfectly linear) speedups with 2-4 processors.\nParameter settings: For all models, we estimated each node/cluster\u2019s vMF concentration param-\neter \u03b2 from the data. For the TSSBP-based models, we used stick breaking parameters \u03b3 = 0.5 and\n\u03b1(d) = 0.5d, and set \u03b8(t)\u22121 to the average document term frequency vector at time t. In order to keep\nrunning times reasonable, we limit the TSSBP-based models to a maximum depth of either 3 or 4\n(we report results for both)2. For the DP-based models, we used a Dirichlet process concentration\nparameter of 1. The dDP\u2019s inter-epoch vMF concentration parameter was set to \u03be = 0.001.\nResults: Table 1 shows the average log (unnormalized) likelihoods on the test sets (from the last\n50% of time points). The tree-based models uniformly out-perform the non-hierarchical models,\nwhile the max-depth-4 tree models outperform the max-depth-3 ones. On all 3 datasets, the max-\ndepth-4 dTSSBP uniformly outperforms all models, con\ufb01rming our initial hypothesis.\n5.3 Qualitative results\nIn addition to high-quality quantitative results, we \ufb01nd that the time-dependent tree model gives\ngood qualitative performance. Figure 4 shows two time-evolving sub-trees obtained from the PNAS\ndata set. The top level shows a sub-tree concerned with Chemistry; the bottom level shows a sub-tree\n\n2One justi\ufb01cation is that shallow hierarchies are easier to interpret than deep ones; see [5, 9].\n\n7\n\n9 mobilities, ions, air, electrons, presence, resistance, function, electric, molecules, disease 36 pressure, ions, solutions, salts, osmotic, molecules, mobilities, gas, effect, influence 3 pressure, acoustic, exhibit, excitation, telephonic, variation, heat, specific, liquids, chiefly Chemistry 1915 - 1924 19 electrons, mobilities, ions, air, presence, metals, electric, resistance, function, conductivity 3 pressure, ions, solutions, salts, osmotic, molecules, mobilities, gas, effect, influence 9 pressure, acoustic, liquids, telephonic, exhibit, excitation, variation, heat, specific, reservoirs 3 solutions, liquids, non, salts, fields, electrolytes, dielectric, fused, squares, intensive Chemistry 1925 - 1934 0 3 pressure, acoustic, liquids, telephonic, exhibit, excitation, variation heat, specific, reservoirs 24 solutions, equations, finite, field, liquids, salts, non, electrolytes, conductance, certain Chemistry 1945 - 1954 0 11 pressure, acoustic, liquids, telephonic, exhibit, excitation, variation, heat, specific, reservoirs 11 solutions, equations, finite, field, liquids, non, salts, electrolytes, conductance, certain Chemistry 1965 - 1974 \u2026 \u2026 30 virus, murine, leukemia, cells, sarcoma, antibody, herpes, induced, simian, type Immunology 1965 - 1974 209 virus, simian, rna, cells, vesicular, stomatitis, influenza, sequence, antigen, viral 97 virus, leukemia, murine, sarcoma, cells, induced, mice, herpes, antigens, simplex 93 virus, sarcoma, avian, gene, transforming, genome, protein, sequences, murine, myeloblastosis 63 virus, cells, epstein, barr, murine, antibody, sarcoma, leukemia, vitro, antibodies Immunology 1975 - 1984 133 virus, simian, rna, cells, vesicular, stomatitis, influenza, sequence, antigen, viral 97 virus, leukemia, murine, sarcoma, cells, induced, mice, herpes, antigens, simplex 65 virus, cells, epstein, barr, murine, antibody, sarcoma, leukemia, vitro, antibodies Immunology 1985 - 1994 \fFigure 5: State of the Union dataset: Birth, growth, and death of tree-structured topics in our dTSSBP\nmodel. This illustration captures some key events in American history. At each epoch, we show the number of\ndocuments assigned to each topic, as well as it\u2019s most popular words (according to the vMF mean \u03b8).\n\nconcerned with Immunology. Our dynamic tree model discovers closely-related topics and groups\nthem under a sub-tree, and creates, grows and destroys individual sub-topics as needed to \ufb01t the data.\nFor instance, our model captures the sudden surge in Immunology-related research from 1975-1984,\nwhich happened right after the structure of the antibody molecule was identi\ufb01ed a few years prior.\nIn the Chemistry topic, the study of mechanical properties of materials (pressure, acoustic properties,\nspeci\ufb01c heat, etc) is a constant presence throughout the century. The study of electrical properties\nof materials starts off with a topic (in purple) that seems devoted to Physical Chemistry. However,\nfollowing the development of Quantum Mechanics in the 30s, this line of research became more\nclosely aligned with Physics than Chemistry, and it disappears from the sub-tree. In its wake, we\nsee the growth of a topic more concerned with electrolytes, solutions and salts, which remained the\nwithin the sphere of Chemistry.\nFigure 5 shows time-evolving sub-trees obtained from the State of the Union dataset. We see a\nsub-tree tracking the development of the Cold War. The parent node contains general terms relevant\nto the Cold War; starting from the 1970s, a child node (shown in purple) contains terms relevant\nto nuclear arms control, in light of the Strategic Arms Limitation Talks of that decade. The same\ndecade also sees the birth of a child node focused on Asia (shown in cyan), contemporaneous with\nPresident Richard Nixon\u2019s historic visit to China in 1972. In addition to the Cold War, we also\nsee topics corresponding to events such as the Mexican War, the Civil War and the Indian Wars,\ndemonstrating our model\u2019s ability to detect events in a timeline.\n\n6 Discussion\n\nIn this paper, we have proposed a \ufb02exible nonparametric model for dynamically-evolving, hierar-\nchically structured data. This model can be applied to multiple types of data using appropriate\nchoices of likelihood; we present an application in document clustering that combines high-quality\nquantitative performance with intuitively interpretable results. One of the signi\ufb01cant challenges in\nconstructing nonparametric dependent tree models is the need for ef\ufb01cient inference algorithms. We\nmake judicious use of approximations and combine MCMC and MAP approximation techniques to\ndevelop an inference algorithm that can be applied in an online setting, while being parallelizable.\nAcknowledgements: This research was supported by NSF Big data IIS1447676, DARPA XDATA\nFA87501220324 and NIH GWAS R01GM087694.\n\n8\n\n40 world, peace, free, nation, nations, america, war, dream, american, communist Cold War 1960 - 1970 10 world, security, strength, relations, peace, people, fourth, nations, nuclear, continue 144 world, peace, free, nation, nations, america, war, dream, american, communist 6 world, major, peace, asia, force, exist, security, america, natural, nation Cold War 1970 - 1980 3 world, major, peace, asia, force, exist, security, america, natural, nation 3 world, power, defenses, years, leadership, restore, alliances, trusts, peace, requires Cold War 1980 - 1990 87 world, peace, free, nation, nations, america, war, dream, american, communist 3 world, security, strength, relations, peace, people, fourth, nations, nuclear, continue Cold War 1990 - 2000 10 world, peace, free, nation, nations, america, war, dream, american, communist 5 world, security, strength, relations, peace, people, fourth, nations, nuclear, continue 19 general, army, command, war, proper, summer, secretary, operations, time, mexico Mexican War 1840 - 1850 10 slavery, constitution, senate, van, buren, war, existed, rebellion, time, act Civil War 1860 - 1870 Indian Wars 1790 - 1800 1 indian, tribes, overtures, friendship, spared, source, lands, commissioners, extinguished, title Indian Wars 1800 - 1810 11 indian, tribes, friendship, overtures, spared, lands, source, demarcation, practicable, imposition Indian Wars 1810 - 1820 Indian Wars 1830 - 1840 1 indian, tribes, overtures, friendship, spared, source, lands, commissioners, title, demarcation 2 indian, tribes, overtures, friendship, spared, source, lands, imposition, war, mode \u2026 6 indian, tribes, friendship, overtures, spared, lands, source, demarcation, practicable, imposition 5 indian, tribes, friendship, overtures, spared, lands, source, demarcation, practicable, imposition \fReferences\n[1] R. Adams, Z. Ghahramani, and M. Jordan. Tree-structured stick breaking for hierarchical data.\n\nIn Advances in Neural Information Processing Systems, 2010.\n\n[2] A. Ahmed and E. Xing. Dynamic non-parametric mixture models and the recurrent Chinese\n\nrestaurant process: with applications to evolutionary clustering. In SDM, 2008.\n\n[3] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra. Clustering on the unit hypersphere using von\n\nMises-Fisher distributions. Journal of Machine Learning Research, 6:1345\u20131382, 1995.\n\n[4] D. Blei and P. Frazier. Distance dependent Chinese restaurant processes. Journal of Machine\n\nLearning Research, 12(2461\u20132488), 2011.\n\n[5] D. Blei, T. Grif\ufb01ths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested\n\nChinese restaurant process. In Advances in Neural Information Processing Systems, 2004.\n\n[6] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[7] F. Caron, M. Davy, and A. Doucet. Generalized Polya urn for time-varying Dirichlet processes.\n\nIn uai, 2007.\n\n[8] S. Gopal and Y. Yang. Von Mises-Fisher clustering models. In International Conference on\n\nMachine Learning, 2014.\n\n[9] Q. Ho, J. Eisenstein, and E. Xing. Document hierarchies from text and links. In Proceedings\n\nof the 21st international conference on World Wide Web, pages 739\u2013748. ACM, 2012.\n\n[10] J. Kingman. On the genealogy of large populations. Journal of Applied Probability, 19:27\u201343,\n\n1982.\n\n[11] D. Lin, E. Grimson, and J. Fisher. Construction of dependent Dirichlet processes based on\n\nPoisson processes. In Advances in Neural Information Processing Systems, 2010.\n\n[12] S. N. MacEachern. Dependent nonparametric processes. In Bayesian Statistical Science, 1999.\n[13] R. M. Neal. Density modeling and clustering using Dirichlet diffusion trees. Bayesian Statis-\n\ntics, 7:619\u2013629, 2003.\n\n[14] A. Rodriguez, D. Dunson, and A. Gelfand. The nested Dirichlet process. Journal of the\n\nAmerican Statistical Association, 103(483), 2008.\n\n[15] E. Sudderth and M. Jordan. Shared segmentation of natural scenes using dependent Pitman-\n\nYor processes. In Advances in Neural Information Processing Systems, 2008.\n\n[16] X. Wang and A. McCallum. Topics over time: a non-Markov continuous-time model of topical\n\ntrends. In Knowledge Discovery and Data Mining, 2006.\n\n[17] S. Williamson, P. Orbanz, and Z. Ghahramani. Dependent Indian buffet processes. In Arti\ufb01cial\n\nIntelligence and Statistics, 2010.\n\n9\n\n\f", "award": [], "sourceid": 669, "authors": [{"given_name": "Kumar Avinava", "family_name": "Dubey", "institution": "Carnegie Mellon University"}, {"given_name": "Qirong", "family_name": "Ho", "institution": "Institute for Infocomm Research, A*STAR"}, {"given_name": "Sinead", "family_name": "Williamson", "institution": "University of Texas at Austin"}, {"given_name": "Eric", "family_name": "Xing", "institution": "CMU"}]}