{"title": "Parallel Sampling of HDPs using Sub-Cluster Splits", "book": "Advances in Neural Information Processing Systems", "page_first": 235, "page_last": 243, "abstract": "We develop a sampling technique for Hierarchical Dirichlet process models. The parallel algorithm builds upon [Chang & Fisher 2013] by proposing large split and merge moves based on learned sub-clusters. The additional global split and merge moves drastically improve convergence in the experimental results. Furthermore, we discover that cross-validation techniques do not adequately determine convergence, and that previous sampling methods converge slower than were previously expected.", "full_text": "Parallel Sampling of HDPs using Sub-Cluster Splits\n\nJason Chang\nCSAIL, MIT\n\njchang7@csail.mit.edu\n\nJohn W. Fisher III\n\nCSAIL, MIT\n\nfisher@csail.mit.edu\n\nAbstract\n\nWe develop a sampling technique for Hierarchical Dirichlet process models. The\nparallel algorithm builds upon [1] by proposing large split and merge moves based\non learned sub-clusters. The additional global split and merge moves drastically\nimprove convergence in the experimental results. Furthermore, we discover that\ncross-validation techniques do not adequately determine convergence, and that\nprevious sampling methods converge slower than were previously expected.\n\n1\n\nIntroduction\n\nHierarchical Dirichlet Process (HDP) mixture models were \ufb01rst introduced by Teh et al. [2]. HDPs\nextend the Dirichlet Process (DP) to model groups of data with shared cluster statistics. Since\ntheir inception, HDPs and related models have been used in many statistical problems, including\ndocument analysis [2], object categorization [3], and as a prior for hidden Markov models [4].\nThe success of HDPs has garnered much interest in inference algorithms. Variational techniques\n[5, 6] are often used for their parallelization and speed, but lack the limiting guarantees of Markov\nchain Monte Carlo (MCMC) methods. Unfortunately, MCMC algorithms tend to converge slowly.\nIn this work, we extend the recent DP Sub-Cluster algorithm [1] to HDPs to accelerate convergence\nby inferring \u201csub-clusters\u201d in parallel and using them to propose large split moves.\nExtensions to the HDP are complicated by the additional DP, which violates conjugacy assumptions\nused in [1]. Furthermore, split/merge moves require computing the joint model likelihood, which,\nprior to this work, was unknown in the common Direct Assignment HDP representation [2]. We\ndiscover that signi\ufb01cant overlap in cluster distributions necessitates new global split/merge moves\nthat change all clusters simultaneously. Our experiments on synthetic and real-world data validate\nthe improved convergence of the proposed method. Additionally, our analysis of joint summary\nstatistics suggests that other MCMC methods may converge prematurely in \ufb01nite time.\n\n2 Related Work\n\nThe seminal work of [2] introduced the Chinese Restaurant Franchise (CRF) and the Direct Assign-\nment (DA) sampling algorithms for the HDP. Since then, many alternatives have been developed.\nBecause HDP inference often extends methods from DPs, we brie\ufb02y discuss relevant work on both\nmodels that focus on convergence and scalability. Current methods are summarized in Table 1.\nSimple Gibbs sampling methods, such as CRF or DA, may converge slowly in complex models.\nWorks such as [11, 12, 13, 14] address this issue in DPs with split/merge moves. Wang and Blei [7]\ndeveloped the only split/merge MCMC method for HDPs by extending the Sequentially Allocated\nMerge-Split (SAMS) algorithm of DPs developed in [13]. Unfortunately, reported results in [7]\nonly show a marginal improvement over Gibbs sampling. Our experiments suggest that this is likely\ndue to properties of the speci\ufb01c sampler, and that a different formulation signi\ufb01cantly improves\nconvergence. Additionally, SAMS cannot be parallelized, and is therefore only tested on a corpus\nwith 263K words. By designing a parallel algorithm, we test on a corpus of 100M words.\n\n1\n\n\fTable 1: Capabilities of MCMC Sampling Algorithms for HDPs\n\nCRF [2] DA [2] SAMS [7] FSD [4] Hog-Wild [8] Super-Cluster [9] Proposed\n\n(cid:88)\n\u00b7\n\u00b7\n(cid:88)\n\u00b7\n\u00b7\n\n(cid:88)\n(cid:88)\n\u2217\n(cid:88)\n\u00b7\n\u00b7\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nIn\ufb01nite Model\nMCMC Guarantees\nNon-Conjugate Priors\nParallelizable\nLocal Splits/Merges\nGlobal Splits/Merges\n\n(cid:88)\n(cid:88)\n\u2217\n\u00b7\n\u00b7\n\u00b7\n\n(cid:88)\n(cid:88)\n\u2217\n\u00b7\n\u00b7\n\u00b7\n\n(cid:88)\n(cid:88)\n\u00b7\n\u00b7\n(cid:88)\n\u00b7\n\n\u00b7\n(cid:88)\n(cid:88)\n(cid:88)\n\u00b7\n\u00b7\n\n\u2217 potentially possible with some adapatation of the DP Metropolis-Hastings framework of [10].\n\nThere has also been work on parallel sampling algorithms for HDPs. Fox et al. [4] generalizes the\nwork of Ishwaran and Zarepour [15] by approximating the highest-level DP with a \ufb01nite symmetric\nDirichlet (FSD). Iterations of this approximation can be parallelized, but \ufb01xing the model order is\nundesirable since it no longer grows with the data. Furthermore, our experiments suggest that this al-\ngorithm exhibits poor convergence. Newman et al. [8] present an alternative parallel approximation\nrelated to Hog-Wild Gibbs sampling [16, 17]. Each processor independently runs a Gibbs sampler\non its assigned data followed by a resynchronization step across all processors. This approximation\nhas shown to perform well on cross-validation metrics, but loses the limiting guarantees of MCMC.\nAdditionally, we will show that cross-validation metrics are not suitable to analyze convergence.\nAn exact parallel algorithm for DPs and HDPs was recently developed by Willamson et al. [9]\nby grouping clusters into independent super-clusters. Unfortunately, the parallelization does not\nscale well [18], and convergence is often impeded [1]. Regardless of exactness, all current parallel\nsampling algorithms exhibit poor convergence due to their local nature, while split/merge proposals\nare essentially ineffective and cannot be parallelized.\n\n2.1 DP Sub-Clusters Algorithm\n\nThe recent DP Sub-Cluster algorithm [1] addresses these issues by combining non-ergodic Markov\nchains into an ergodic chain and proposing splits from learned sub-clusters. We brie\ufb02y review\nrelevant aspects of the DP Sub-Cluster algorithm here. MCMC algorithms typically satisfy two\nconditions: detailed balance and ergodicity. Detailed balance ensures that the target distribution\nis a stationary distribution of the chain, while ergodicity guarantees uniqueness of the stationary\ndistribution. The method of [1] combines a Gibbs sampler that is restricted to non-empty clusters\nwith a Metropolis-Hastings (MH) algorithm that proposes splits and merges. Since any Gibbs or\nMH sampler satis\ufb01es detailed balance, the true posterior distribution is guaranteed to be a stationary\ndistribution of the chain. Furthermore, the combination of the two samplers enforces ergodicity and\nguarantees the convergence to the stationary distribution.\nThe DP Sub-Cluster algorithm also augments the model with auxiliary variables that learn a two-\ncomponent mixture model for each cluster. These \u201csub-clusters\u201d are subsequently used to propose\nsplits that are learned over time instead of built in a single iteration like previous methods. In this\npaper, we extend these techniques to HDPs. As we will show, considerable work is needed to address\nthe higher-level DP and the overlapping distributions that exist in topic modeling.\n\n3 Hierarchical Dirichlet Processes\n\nWe begin with a brief review of the equivalent CRF and DA representations of the HDP [2] depicted\nin Figures 1a\u20131b. Due to the proli\ufb01c use of HDPs in topic modeling, we refer to the variables with\ntheir topic modeling names. \u03b2 is the corpus-level, global topic proportions, \u03b8k is the parameter for\ntopic k, and xji is the ith word in document j. Here, the CRF and DA representations depart. In the\nCRF, \u02dc\u03c0j is drawn from a stick-breaking process [19], and each \u201ccustomer\u201d (i.e., word) is assigned\nto a \u201ctable\u201d through tji \u223c Categorical(\u02dc\u03c0j). The higher-level DP then assigns \u201cdishes\u201d (i.e., topics)\nto tables via kjt \u223c Categorical(\u03b2). The association of customers to dishes through the tables is\nequivalent to assigning a word to a topic. In the CRF, multiple tables can be assigned the same dish.\nThe DA formulation combines these multiple instances and directly assigns a word to a topic with\nzji. The resulting document-speci\ufb01c topic proportions, \u03c0j, aggregates multiple \u02dc\u03c0j values. For\n\n2\n\n\f(a) HDP CRF Model\n\n(b) HDP DA Model\n\n(c) HDP Augmented DA Model\n\nFigure 1: Graphical models. (c) Hyper-parameters are omitted and auxiliary variables are dotted.\n\nFigure 2: Visualization of augmented sample space.\n\ncounts are represented with dots, e.g., nj\u00b7\u00b7 (cid:44)(cid:80)\n\nreasons which will be discussed, inference in the DA formulation still relies on some aspects of\nthe CRF. We adopt the notation of [2], where the number of tables in restaurant j serving dish k is\ndenoted mjk, and the number of customers in restaurant j at table t eating dish k is njtk. Marginal\nk mjk represent the number\nof customers and dishes in restaurant j, respectively. We refer the reader to [2] for additional details.\n\nt,k njtk and mj\u00b7 (cid:44)(cid:80)\n\n4 Restricted Parallel Sampling\n\nWe draw on the DP Sub-Cluster algorithm to combine a restricted, parallel Gibbs sampler with\nsplit/merge moves (as described in Section 2.1). The former is detailed here, and the latter is devel-\noped in Section 5. Because the restricted Gibbs sampler cannot create new topics, dimensions of the\nin\ufb01nite vectors \u03b2, \u03c0, and \u03b8 associated with empty clusters need not be instantiated. Extending the\nDA sampling algorithm of [2] results in the following restricted posterior distributions:\n\np(\u03b2|m) = Dir(m\u00b71, . . . , m\u00b7K, \u03b3),\n\nk=1 \u03c0jkfx(xji; \u03b8k)1I[zji = k],\np(mjk|\u03b2, z) = fm(mjk; \u03b1\u03b2k, nj\u00b7k) (cid:44) \u0393(\u03b1\u03b2k)\n\n\u0393(\u03b1\u03b2k+nj\u00b7k) s(nj\u00b7k, mjk)(\u03b1\u03b2k)mjk .\n\np(zji|x, \u03c0j, \u03b8) \u221d(cid:80)K\n\np(\u03c0j|\u03b2, z) = Dir(\u03b1\u03b21 + nj\u00b71, . . . , \u03b1\u03b2K + nj\u00b7K, \u03b1\u03b2K+1),\np(\u03b8k|x, z) \u221d fx(xIk ; \u03b8k)f\u03b8(\u03b8k; \u03bb),\n\n(1)\n(2)\n(3)\n(4)\n(5)\nSince p(\u03b2|\u03c0) is not known analytically, we use the auxiliary variable, mjk, as derived by [2, 20].\nHere, s(n, m) denotes unsigned Stirling numbers of the \ufb01rst kind. We note that \u03b2 and \u03c0 are now\n(K + 1)\u2013length vectors partitioning the space, where the last components, \u03b2K+1 and \u03c0j(K+1),\naggregate the weight of all empty topics. Additionally, Ik (cid:44) {j, i; zji = k} denotes the set of\nindices in topic k, and fx and f\u03b8 denote the observation and prior distributions. We note that if f\u03b8 is\nconjugate to fx, Equation (3) stays in the same family of parametric distributions as f\u03b8(\u03b8; \u03bb).\nEquations (1\u20135), each of which can be sampled in parallel, fully specify the restricted Gibbs sam-\npler. The astute reader may notice similarities with the FSD approximation used in [4]. The main\ndifferences are that the \u03b2 distribution in Equation (1) is exact, and that sampling z in Equation (4)\nis explicitly restricted to non-empty clusters. Unlike [4], however, this sampler is guaranteed to\nconverge to the true HDP model when combined with any split move (cf. Section 2.1).\n\n5 Augmented Sub-Cluster Space for Splits and Merges\n\nIn this section we develop the augmented, sub-cluster model, which is aimed at \ufb01nding a two-\ncomponent mixture model containing a likely split of the data. As demonstrated in [1], these splits\nperform well in DPs because they improve at every iteration of the algorithm. Unfortunately, because\nthese splits perform poorly in HDPs, we modify the formulation to propose more \ufb02exible moves.\nFor each topic, k, we \ufb01t two sub-topics, k(cid:96) and kr, referred to as the \u201cleft\u201d and \u201cright\u201d sub-topics.\nEach topic is augmented with auxiliary global sub-topic proportions, \u03b2k = {\u03b2k(cid:96), \u03b2kr}, document-\n\n3\n\n\flevel sub-topic proportions, \u03c0jk = {\u03c0jk(cid:96), \u03c0jkr}, and sub-topic parameters, \u03b8k = {\u03b8k(cid:96), \u03b8kr}. Fur-\nthermore, a sub-topic assignment, zji \u2208 {(cid:96), r} is associated with each word, xji. The augmented\nspace is summarized in Figure 1c and visualized in Figure 2. These auxiliary variables are denoted\nwith the same symbol as their \u201cregular-topic\u201d counterparts to allude to their similarities. Extending\nthe work of [1], we adopt the following auxiliary generative and marginal posterior distributions:\n\nGenerative Distributions\n\np(\u03b2k) = Dir(\u03b3, \u03b3),\n\n(cid:89)\np(\u03c0jk|\u03b2k) = Dir(\u03b1\u03b2k(cid:96), \u03b1\u03b2kr),\n\nf\u03b8(\u03b8kh; \u03bb)\n\np(\u03b8k|\u03c0, z, x) =\n\n(cid:89)\n(cid:89)K\nZji(\u03c0, \u03b8, z, x) (cid:44)(cid:88)\n\np(z|\u03c0, \u03b8, z, x) =\n\nh\u2208{(cid:96),r}\n\n(cid:89)\n\nZji(\u03c0, \u03b8, z, x),\n\nj,i\u2208Ik\n\u03c0jkzji fx(xji;\u03b8kzji )\n\n,\n\nk=1\n\nj,i\u2208Ik\n\nZji(\u03c0,\u03b8,z,x)\n\nMarginal Posterior Distributions\np(\u03b2k|\u2022) = Dir(\u03b3 + m\u00b7k(cid:96), \u03b3 + m\u00b7kr),\n(6)\np(\u03c0jk|\u2022) = Dir(\u03b1\u03b2k(cid:96)+nj\u00b7k(cid:96),\u03b1\u03b2kr+nj\u00b7kr), (7)\np(\u03b8kh|\u2022) \u221d fx(xIkh ; \u03b8kh)f\u03b8(\u03b8kh; \u03bb),\n(8)\n\np(zji|\u2022) \u221d \u03c0jzjizjifx(xji; \u03b8zjizji)\np(mjkh|\u2022) = fm(mjkh; \u03b1\u03b2kh, nj\u00b7kh),\n\n(9)\n\nh\u2208{(cid:96),r} \u03c0jzjihfx(xji; \u03b8zjih),\n\n(10)\nwhere \u2022 denotes all other variables. Full derivations are given in the supplement. Notice the similar-\nity between these posterior distributions and Equations (1\u20135). Inference is performed by interleaving\nthe sampling of Equations (1\u20135) with Equations (6\u201310). Furthermore, each step can be parallelized.\n\n5.1 Sub-Topic Split/Merge Proposals\n\nWe adopt a Metropolis-Hastings (MH) [21] framework that proposes a split/merge from the sub-\ntopics and either accepts or rejects it. Denoting v (cid:44) {\u03b2, \u03c0, z, \u03b8} and v (cid:44) {\u03b2, \u03c0, z, \u03b8} as the set of\nregular and auxiliary variables, a sampled proposal, {\u02c6v, \u02c6v} \u223c q(\u02c6v, \u02c6v|v) is accepted with probability\n\nPr[{v, v} = {\u02c6v, \u02c6v}] = min\n\n1, p(x,\u02c6v)p(\u02c6v|x,\u02c6v)\n\np(x,v)p(v|x,v) \u00b7 q(v|x,\u02c6v)q(v|x,\u02c6v,v)\nq(\u02c6v|x,v)q(\u02c6v|x,v,\u02c6v)\n\n= min [1, H] .\n\n(11)\n\n(cid:104)\n\n(cid:105)\n\nH, is known as the Hastings ratio. Algorithm 1 outlines a general split/merge MH framework,\nwhere steps 1\u20132 propose a sample from q(\u02c6v|x, v)q(\u02c6v|x, v, v, \u02c6v). Sampling the variables other than \u02c6z\nis detailed here, after which we discuss three versions of Algorithm 1 with variants on sampling \u02c6z.\n\nAlgorithm 1 Split-Merge Framework\n1. Propose assignments, \u02c6z, global proportions, \u02c6\u03b2, document proportions, \u02c6\u03c0, and parameters, \u02c6\u03b8.\n2. Defer the proposal of auxiliary variables to the restricted sampling of Equations (1\u201310).\n3. Accept/reject the proposal with the Hastings ratio.\n\n(Step 1: \u02c6\u03b2): In Metropolis-Hastings, convergence typically improves as the proposal distribution is\ncloser to the target distribution. Thus, it would be ideal to propose \u02c6\u03b2 from p(\u03b2|\u02c6z). Unfortunately,\np(\u03b2|z) cannot be expressed analytically without conditioning on the dish counts, m\u00b7k, as in Equation\n(1). Since the distribution of dish counts depends on \u03b2 itself, we approximate its value with\n\n\u02dcmjk(z) (cid:44) arg maxm p(m|\u03b2 = 1/K, z) = arg maxm\n\n\u0393(1/K)\n\n\u0393(1/K+nj\u00b7k) s(nj\u00b7k, m)( 1\n\nK )m,\n\n(12)\n\nwhere the global topic proportions have essentially been substituted with 1/K. We note that the\ndependence on z is implied through the counts, n. We then propose global topics proportions from\n\n\u02c6\u03b2 \u223c q( \u02c6\u03b2|\u02c6z) = p( \u02c6\u03b2| \u02dcm(\u02c6z)) = Dir ( \u02dcm\u00b71(\u02c6z),\u00b7\u00b7\u00b7 , \u02dcm\u00b7K(\u02c6z), \u03b3) .\n\n(13)\nWe will denote \u02dcmjk (cid:44) \u02dcmjk(z) and \u02c6\u02dcmjk (cid:44) \u02dcmjk(\u02c6z). We emphasize that the approximate \u02c6\u02dcmjk is\nonly used for a proposal distribution, and the resulting chain will still satisfy detailed balance.\n(Step 1: \u02c6\u03c0): Conditioned on \u03b2 and z, the distribution of \u03c0 is known to be Dirichlet. Thus, we\npropose \u02c6\u03c0 \u223c p(\u02c6\u03c0| \u02c6\u03b2, \u02c6z) by sampling directly from the true posterior distribution of Equation (2).\n(Step 1: \u02c6\u03b8): If f\u03b8 is conjugate to fx, we sample \u02c6\u03b8 directly from the posterior of Equation (3). If\nnon-conjugate models, any proposal can be used while adjusting for it in the Hastings ratio.\n\n4\n\n\f(Step 2): We use the Deferred MH sampler developed in [1], which sets q(\u02c6v|x, \u02c6v) = p(\u02c6v|x, \u02c6v) by\ndeferring the sampling of auxiliary variables to the restricted sampler of Section 5. Splits and merges\nare then only proposed for topics where auxiliary variables have already burned-in. In practice burn-\nin is quite fast, and is determined by monitoring the sub-topic data likelihoods.\n(Step 3): Finally, the above proposals results in the following the Hastings ratio:\n\nH = p( \u02c6\u03b2,\u02c6z)p(x|\u02c6z)\n\np(\u03b2,z)p(x|z) \u00b7 q(z|\u02c6v,\u02c6v)q(\u03b2|z)\nq(\u02c6z|v,v)q( \u02c6\u03b2|\u02c6z)\n\n(14)\nThe data likelihood, p(x|z) is known analytically, and q(\u03b2|z) can be calculated according to Equa-\ntion 13. The prior distribution, p(\u03b2, z), is expressed in the following proposition:\nProposition 5.1. Let z be a set of topic assignments with integer values in {1, . . . , K}. Let \u03b2 be a\n(K +1)\u2013length vector representing global topic weights, and \u03b2K+1 be the sum of weights associated\nwith empty topics. The prior distribution, p(\u03b2, z), marginalizing over \u03c0, can be expressed as\n\n.\n\n(cid:104)\n\n(cid:89)K\n\np(\u03b2, z) =\n\n\u03b3\u03b2\u03b3\u22121\n\nK+1\n\n\u03b2\u22121\n\nk\n\nk=1\n\n(cid:105) \u00d7(cid:104)(cid:89)D\n\n(cid:89)K\n\n(cid:105)\n\n\u0393(\u03b1)\n\n\u0393(\u03b1+nj\u00b7\u00b7)\n\nj=1\n\n\u0393(\u03b1\u03b2k+nj\u00b7k)\n\n\u0393(\u03b1\u03b2k)\n\nk=1\n\n.\n\n(15)\n\nProof. See supplemental material.\nThe remaining term in Equation (14), q(\u02c6z|v, v), is the probability of proposing a particular split. In\nthe following sections, we describe three possible split constructions using the sub-clusters. Since\nthe other steps remain the same, we only discuss the proposal distributions for \u02c6z and \u02c6\u03b2.\n\n5.1.1 Deterministic Split/Merge Proposals\n\nThe method of [1] constructs a split deterministically by copying the sub-cluster labels for a sin-\ngle cluster. We refer to this proposal as a local split, which only changes assignments within one\ntopic, as opposed to a global split (discussed shortly), which changes all topic assignments. A local\ndeterministic split will essentially be accepted if the joint likelihood increases. Unfortunately, as\nwe show in the supplement, samples from the typical set of an HDP do not have high likelihood.\nDeterministic split and merge proposals are, consequently, very rarely accepted. We now suggest\ntwo alternative pairs of split and merge proposals, each with their own bene\ufb01ts and drawbacks.\n\n5.1.2 Local Split/Merge Proposals\n\nHere, we depart from the approach of [1] by sampling a local split of topic a into topics b and c.\nTemporary parameters, {\u02dc\u03c0b, \u02dc\u03c0c, \u02dc\u03b8b, \u02dc\u03b8c}, and topic assignments, \u02c6z, are sampled according to\n\n(\u02dc\u03c0b, \u02dc\u03c0c) = \u03c0a \u00b7 (\u03c0a(cid:96), \u03c0ar),\n(\u02dc\u03b8b, \u02dc\u03b8c) = (\u03b8a(cid:96), \u03b8ar),\n\n(cid:41)\n\n=\u21d2 q(\u02c6z|v, v) \u221d(cid:89)\n\n(cid:88)\n\nj,i\u2208Ia\n\nk\u2208{b,c}\n\n\u02dc\u03c0kfx(xji; \u02dc\u03b8k)1I[\u02c6zji = k].\n\n(16)\n\nWe note that a sample from q(\u02c6z|v, v) is already drawn from the restricted Gibbs sampler described\nin Equation (9). Therefore, no additional computation is needed to propose the split. If the split is\nrejected, the \u02c6z is simply used as the next sample of the auxiliary z for cluster a.\nA \u02c6\u03b2 is then drawn by splitting \u03b2a into \u02c6\u03b2b and \u02c6\u03b2c according to a local version of Equation (13):\n\n(17)\nThe corresponding merge move combines topics b and c into topic a by deterministically performing\n\nq( \u02c6\u03b2b, \u02c6\u03b2c|\u02c6z, \u03b2a) = Dir( \u02c6\u03b2b/\u03b2a, \u02c6\u03b2c/\u03b2a; \u02c6\u02dcm\u00b7b, \u02c6\u02dcm\u00b7c).\n\nq(\u02c6zji|v) = 1I[\u02c6zji = a],\n\n\u2200j, i \u2208 Ib \u222a Ic,\n\nq( \u02c6\u03b2a|v) = \u03b4( \u02c6\u03b2a \u2212 (\u03b2b + \u03b2c)).\n\nThis results in the following Hastings ratio for a local split (derivation in supplement):\n\nH = \u03b3\u0393( \u02c6\u02dcm\u00b7b)\u0393( \u02c6\u02dcm\u00b7c)\n\u0393( \u02c6\u02dcm\u00b7b+ \u02c6\u02dcm\u00b7c)\n\n\u02c6\u02dcm\u00b7b+ \u02c6\u02dcm\u00b7c\n\u03b2\na\n\u02c6\u02dcm\u00b7b\n\u02c6\u02dcm\u00b7c\n\u02c6\u03b2\n\u02c6\u03b2\nc\nb\n\np(x|\u02c6z)\np(x|z)\n\n1\n\nq(\u02c6z|v,v)\n\nQM\nK+1\nQS\nK\n\n\u0393(\u03b1\u03b2a)\n\n\u0393(\u03b1\u03b2a+nj\u00b7a)\n\nj\n\n\u0393(\u03b1 \u02c6\u03b2k+\u02c6nj\u00b7k)\n\nk\u2208{b,c}\n\n\u0393(\u03b1 \u02c6\u03b2k)\n\n(cid:89)\n\n(cid:89)\n\n(18)\n\n,\n\n(19)\n\nK and QM\n\nwhere QS\nK are the probabilities of selecting a speci\ufb01c split or merge with K topics. We\nrecord q(\u02c6z|v, v) when sampling from Equation (9), and all other terms are computed via suf\ufb01cient\nstatistics. We set QS\n\nK = 1 by proposing all splits at each iteration. QM\n\nK will be discussed shortly.\n\n5\n\n\fThe Hastings ratio for a merge is essentially the reciprocal of Equation (19). However, the reverse\nsplit move, q(z|\u02c6v, \u02c6v), relies on the inferred sub-topic parameters, \u02c6\u03c0 and \u02c6\u03b8, which are not readily\navailable due to the Deferred MH algorithm. Instead, we approximate the Hastings ratio by substi-\ntuting the two original topic parameters, \u03b8b and \u03b8c, for the proposed sub-topics. The quality of this\napproximation rests on the similarity between the regular-topics and the sub-topics. Generating the\nreverse move that splits topic a into b and c can then be approximated as\n\nq(z|\u02c6v, \u02c6v) \u2248(cid:89)\n\nj,i\u2208Ib\u222aIc\n\nLkl (cid:44)(cid:89)\n\nLkk (cid:44)(cid:89)\n\nj,i\u2208Ik\n\nj,i\u2208Ik\n\n\u03c0kfx(xji; \u03b8k),\n\n[\u03c0kfx(xji; \u03b8k) + \u03c0lfx(xji; \u03b8l)] .\n\n(21)\nAll of the terms in Equation (20) are already calculated in the restricted Gibbs steps. When aggre-\ngated correctly in the K \u00d7 K matrix, L, the Hastings ratio for any proposed merge is evaluated in\nconstant time. However, if topics b and c are merged into a, further merging a with another cluster\ncannot be ef\ufb01ciently computed without looping through the data. We therefore only propose (cid:98)K/2(cid:99)\nmerges by generating a random permutation of the integers [1, K], and proposing to merge disjoint\nneighbors. For example, if the random permutation for K = 7 is { 3 1 7 4 2 6 5}, we propose to\nmerge topics 3 and 1, topics 7 and 4, and topics 2 and 6. This results in QM\n\nK = 2(cid:98)K/2(cid:99)\nK(K\u22121).\n\n\u03c0zji fx(xji;\u03b8zji )\n\n\u03c0bfx(xi;\u03b8b)+\u03c0cfx(xi;\u03b8c) = LbbLcc\nLbcLcb\n\n,\n\n(20)\n\n5.1.3 Global Split/Merge Proposals\n\nIn many applications where clusters have signi\ufb01cant overlap (e.g., topic modeling), local splits may\nbe too constrained since only points within a single topic change. We now develop a global split\nand merge move, which reassign the data in all topics. A global split \ufb01rst constructs temporary topic\nproportions, \u02dc\u03c0, and parameters, \u02dc\u03b8, followed by proposing topic assignments for all words with:\n(\u02dc\u03c0b, \u02dc\u03c0c) = \u03c0a \u00b7 (\u03c0a(cid:96), \u03c0ar), \u02dc\u03c0k = \u03c0k, \u2200k (cid:54)= a,\n\u02dc\u03b8k = \u03b8k, \u2200k (cid:54)= a,\n(\u02dc\u03b8b, \u02dc\u03b8c) = (\u03b8a(cid:96), \u03b8ar),\n\n\u02dc\u03c0\u02c6zjifx(xji; \u02dc\u03b8\u02c6zji)\nk \u02dc\u03c0kfx(xji; \u02dc\u03b8k)\n\n=\u21d2 q(\u02c6z|v, v) =\n\n. (22)\n\nSimilarly, the corresponding merge move is constructed according to\n=\u21d2 q(\u02c6z|v, v) =\n\n\u02dc\u03c0k = \u03c0k, \u2200k (cid:54)= b, c,\n\u02dc\u03b8k = \u03b8k, \u2200k (cid:54)= b, c,\n\n\u02dc\u03c0a = \u03c0b + \u03c0c,\n\u02dc\u03b8a \u223c q(\u02dc\u03b8a|z, x),\n\n\u02dc\u03c0\u02c6zjifx(xji; \u02dc\u03b8\u02c6zji)\nk \u02dc\u03c0kfx(xji; \u02dc\u03b8k)\n\n. (23)\n\n(cid:89)\n(cid:89)\n\nj,i\n\nj,i\n\n(cid:80)\n(cid:80)\n\n(cid:41)\n(cid:41)\n\nThe proposal for \u02dc\u03b8a is written in a general form; if priors are conjugate, one should propose directly\nfrom the posterior. After Equations (22)\u2013(23), \u02c6\u03b2 is sampled via Equation (13). All remaining steps\nfollow Algorithm 1. The resulting Hastings ratio for a global split (see supplement) is expressed as\n\nK(cid:89)\n\nD(cid:89)\n\nK+1(cid:89)\n\nD(cid:89)\n\nH = \u03b3\u0393(\u03b3+ \u02dcm\u00b7\u00b7)\n\u0393(\u03b3+ \u02c6\u02dcm\u00b7\u00b7)\n\np(x|\u02c6z)\np(x|z)\n\nq(z|\u02c6v,\u02c6v)q(\u02dc\u03b8a|z)\n\nq(\u02c6z|v,v)\n\nQM\nK+1\nQS\nK\n\n\u02dcm\u00b7k\nk\n\n\u03b2\n\u0393( \u02dcm\u00b7k)\n\n\u0393(\u03b1\u03b2k)\n\n\u0393(\u03b1\u03b2k+nj\u00b7k)\n\nk=1\n\nj=1\n\nk=1\n\n\u0393( \u02c6\u02dcm\u00b7k)\n\u02c6\u02dcm\u00b7k\n\u02c6\u03b2\nk\n\n\u0393(\u03b1 \u02c6\u03b2k+\u02c6nj\u00b7k)\n\n\u0393(\u03b1 \u02c6\u03b2k)\n\nj=1\n\n. (24)\n\nSimilar to local merges, the Hastings ratio for a global merge depends on the proposed sub-topics\nparameters. We approximate these with the main-topic parameters prior to the merge.\nUnlike the local split/merge proposals, proposing \u02c6z requires signi\ufb01cant computation by looping\nthrough all data points. As such, we only propose a single global split and merge each iteration.\nK = 2/(K(K \u2212 1)). We emphasize that the developed global moves are\nThus, QS\nvery different from previous local split/merge moves in DPs and HDPs (e.g., [1, 7, 11, 13, 14]). We\nconjecture that this is the reason the split/merge moves in [7] only made negligible improvement.\n\nK = 1/K and QM\n\n6 Experiments\n\nWe now test the proposed HDP Sub-Clusters method on topic modeling. The algorithm is sum-\nmarized in the following steps: (1) initialize \u03b2 and z randomly; (2) sample \u03c0, \u03b8, \u03c0, and \u03b8 via\n2 (cid:99) local merges\nEquations (2, 3, 7, 8); (3) sample z and z via Equations (4, 9); (4) propose (cid:98) K\nfollowed by K local splits; (5) propose a global merge followed by a global split; (6) sample m\nand m via Equations (5, 10); (7) sample \u03b2 and \u03b2 via Equations (1, 6); (8) repeat from Step 2\nuntil convergence. We \ufb01x the hyper-parameters, but resampling techniques [2] can easily be in-\ncorporated. All results are averaged over 10 sample paths. Source code can be downloaded from\nhttp://people.csail.mit.edu/jchang7.\n\n6\n\n\f(a) Visualizing Topics\n\n(c) Parallelization\n\n(b) Split/Merge Moves\n\n(d) Algorithm Comparison\nFigure 3: Synthetic \u201cbars\u201d example. (a) Visualizing topic word distributions without splits/merges\nfor K = 5. (b)\u2013(c) Number of inferred topics for different split/merge proposals and parallelizations.\n(d) Comparing sampling algorithms with a single processor and initialized to a single topic.\n\n(a) AP Results with Different Initializations\n\n(b) AP Results with Switching Algorithms\n\nFigure 4: Results on AP. (a) 1, 25, 50, and 75 initial topics. (b) Switching algorithms at 1000 secs.\n\n6.1 Synthetic Bars Dataset\n\nWe synthesized 200 documents from the \u201cbars\u201d example of [22] with a dictionary of 25 words that\ncan be arranged in a 5x5 grid. Each of the 10 true topics forms a horizontal or vertical bar. To\nvisualize the sub-topics, we initialize to 5 topics and do not propose splits or merges. The resulting\nregular- and sub-topics are shown in Figure 3a. Notice how the sub-topics capture likely splits.\nNext, we consider different split/merge proposals in Figure 3b. The \u201cCombined\u201d algorithm uses\nlocal and global moves. The deterministic moves are often rejected resulting in slow convergence.\nWhile global moves are not needed in such a well-separated dataset, we have observed that the make\na signi\ufb01cant impact in real-world datasets. Furthermore, since every step of the sampling algorithm\ncan be parallelized, we achieve a linear speedup in the number of processors, as shown in Figure 3c.\nFigure 3d compares convergence without parallelization to the Direct Assignment (DA) sampler and\nthe Finite Symmetric Dirichlet (FSD) of order 20. Since all algorithms should sample from the same\nmodel, the goal here is to analyze convergence speed. We plot two summary statistics: the likelihood\nof a single held-out word (HOW) from each document, and the number of inferred topics. While\nthe HOW likelihood for FSD converges at 1 second, the number of topics converges at 100 seconds.\nThis suggests that cross-validation techniques, which evaluate model \ufb01t, cannot solely determine\nMCMC convergence. We note that FSD tends to \ufb01rst create all L topics and slowly remove them.\n\n6.2 Real-World Corpora Datasets\n\nNext, we consider the Associated Press (AP) dataset [23] with 436K words in 2K documents. We\nmanually set the FSD order to 100. Results using 16 cores (except DA, which cannot be parallelized)\nwith 1, 25, 50, and 75 initial topics are shown in Figure 4a. All samplers should converge to the\nsame statistics regardless of the initialization. While HOW likelihood converges for 3/4 FSD initial-\nizations, the number of topics indicates that no DA or FSD sample paths have converged. Unlike\nthe well-separated, synthetic dataset, the Sub-Clusters method that only uses local splits and merges\ndoes not converge to a good solution here. In contrast, all initializations of the Sub-Clusters method\nhave converged to a high HOW likelihood with only approximately 20 topics. The path taken by\neach sampler in the joint HOW likelihood / number of topics space is shown in the right panel of\nFigure 4a. This visualization helps to illustrate the different approaches taken by each algorithm.\nFigure 5a shows confusion matrices, C, of the inferred topics. Each element of C is de\ufb01ned as:\nx fx(x; \u03b8r) log fx(x; \u03b8c), and captures the likelihood of a random word from topic r\n\nCr,c = (cid:80)\n\n7\n\n10-410-3102103-3.5-3-2.5-2secs (log scale)HOWLog Like.01020TopicsNum.10-210101020Num. Topicssecs (log scale)Det.LocalGlobalCombined10-210101020Num. TopicsCombinedsecs (log scale)1 Proc.2 Procs.4 Procs.8 Procs.-8.4-8.2-801000050100secsHOW Log Like.Num. Topics-7.8HOW Log LikelihoodNumber of Topics0100-8.4-8.2-8-7.8-8.4-8.2-802000050100secsHOW Log Like.Num. Topics-7.8HOW Log LikelihoodNumber of Topics0100-8.4-8.2-8-7.8\f(a) Confusion Matrices for AP\n\n(b) Four Topics from NYTimes\n\nFigure 5: (a) Confusion matrices on AP for SUB-CLUSTERS, DA, and FSD (left to right). Outlines\nare overlaid to compare size. (b) Four inferred topics from the NYTimes articles.\n\n(a) Enron Results\n\n(b) NYTimes Results\n\nFigure 6: Results on (a) Enron emails and (b) NYTimes articles for 1 and 50 initial topics.\n\nevaluated under topic c. DA and FSD both converge to many topics that are easily confused, whereas\nthe Sub-Clusters method converges to a smaller set of more distinguishable topics.\nRigorous proofs about convergence are quite dif\ufb01cult. Furthermore, even though the approximations\nmade in calculating the Hastings ratios for local and global splits (e.g., Equation (20)) are backed by\nintuition, they complicate the analysis. Instead, we run each sample path for 2,000 seconds. After\n1,000 seconds, we switch the Sub-Clusters sample paths to FSD and all other sample paths to Sub-\nClusters. Markov chains that have converged should not change when switching the sampler. Figure\n4b shows that switching from DA, FSD, or the local version of Sub-Clusters immediately changes\nthe number of topics, but switching Sub-Clusters to FSD has no effect. We believe that the number\nof topics is slightly higher in the former because the Sub-Cluster method struggles to create small\ntopics. By construction, the splits make large moves, in contrast to DA and FSD, which often create\nsingle word topics. This suggests that alternating between FSD and Sub-Clusters may work well.\nFinally, we consider two large datasets from [24]: Enron Emails with 6M words in 40K documents\nand NYTimes Articles with 100M words in 300K documents. We note that the NYTimes dataset is\n3 orders of magnitude larger than those considered in the HDP split/merge work of [7]. Again, we\nmanually set the FSD order to 200. Results are shown in Figure 6 initialized to 1 and 50 topics. In\nsuch large datasets, it is dif\ufb01cult to predict convergence times; after 28 hours, it seems as though no\nalgorithms have converged. However, the Sub-Clusters method seems to be approaching a solution,\nwhereas FSD has yet to prune topics and DA has yet to to achieve a good cross-validation score.\nFour inferred topics using the Sub-Clusters method on the NYTimes dataset are visualized in Figure\n5b. These words seem to describe plausible topics (e.g., music, terrorism, basketball, and wine).\n\n7 Conclusion\n\nWe have developed a new parallel sampling algorithm for the HDP that proposes split and merge\nmoves. Unlike previous attempts, the proposed global splits and merges exhibit signi\ufb01cantly im-\nproved convergence in a variety of datasets. We have also shown that cross-validation metrics in\nisolation can lead to the erroneous conclusion that an MCMC sampling algorithm has converged.\nBy considering the number of topics and held-out likelihood jointly, we show that previous sampling\nalgorithms converge very slowly.\n\nAcknowledgments\n\nThis research was partially supported by the Of\ufb01ce of Naval Research Multidisciplinary Research\nInitiative program, award N000141110688 and by VITALITE, which receives support from Army\nResearch Of\ufb01ce Multidisciplinary Research Initiative program, award W911NF-11-1-0391.\n\n8\n\n10-11001041050100200Num. Topics-8.6-8.2-7.8HOW Log Like.secs (log scale)0200-8.6-8.2-7.8Number of TopicsHOW Log Likelihood-9.3-9-8.710-11001041050100200Num. TopicsHOW Log Like.secs (log scale)-9.3-9-8.70200Number of TopicsHOW Log Likelihood\fReferences\n[1] J. Chang and J. W. Fisher, III. Parallel sampling of DP mixture models using sub-clusters splits.\n\nAdvances in Neural Information and Processing Systems, Dec 2013.\n\nIn\n\n[2] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n[3] E. B. Sudderth. Graphical Models for Visual Object Recognition and Tracking. PhD thesis, Massachusetts\n\nInstitute of Technology, 2006.\n\n[4] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. An HDP-HMM for systems with state\n\npersistence. In International Conference on Machine Learning, July 2008.\n\n[5] Y. W. Teh, K. Kurihara, and M. Welling. Collapsed variational inference for HDP. In Advances in Neural\n\nInformation Processing Systems, volume 20, 2008.\n\n[6] M. Bryant and E. Sudderth. Truly nonparametric online variational inference for Hierarchical Dirichlet\n\nprocesses. In Advances in Neural Information Processing Systems, 2012.\n\n[7] C. Wang and D Blei. A split-merge MCMC algorithm for the Hierarchical Dirichlet process.\n\narXiv:1207.1657 [stat.ML], 2012.\n\n[8] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models. Journal\n\nof Machine Learning Research, 10:1801\u20131828, December 2009.\n\n[9] S. Williamson, A. Dubey, and E. P. Xing. Parallel Markov chain Monte Carlo for nonparametric mixture\n\nmodels. In International Conference on Machine Learning, 2013.\n\n[10] R. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational\n\nand Graphical Statistics, 9(2):249\u2013265, June 2000.\n\n[11] S. Jain and R. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture\n\nmodel. Journal of Computational and Graphical Statistics, 13:158\u2013182, 2000.\n\n[12] P. J. Green and S. Richardson. Modelling heterogeneity with and without the Dirichlet process. Scandi-\n\nnavian Journal of Statistics, pages 355\u2013375, 2001.\n\n[13] D. B. Dahl. An improved merge-split sampler for conjugate Dirichlet process mixture models. Technical\n\nreport, University of Wisconsin - Madison Dept. of Statistics, 2003.\n\n[14] S. Jain and R. Neal. Splitting and merging components of a nonconjugate Dirichlet process mixture\n\nmodel. Bayesian Analysis, 2(3):445\u2013472, 2007.\n\n[15] H. Ishwaran and M. Zarepour. Exact and approximate sum-representations for the Dirichlet process.\n\nCanadian Journal of Statistics, 30:269\u2013283, 2002.\n\n[16] F. Niu, B. Recht, C. R\u00b4e, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic\n\ngradient descent. In Advances in Neural Information Processing Systems, 2011.\n\n[17] M. J. Johnson, J. Saunderson, and A. S. Willsky. Analyzing hogwild parallel gaussian gibbs sampling. In\n\nAdvances in Neural Information Processing Systems, 2013.\n\n[18] Y. Gal and Z. Ghahramani. Pitfalls in the use of parallel inference for the Dirichlet process. In Workshop\n\non Big Learning, NIPS, 2013.\n\n[19] J. Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Statstica Sinica, pages 639\u2013650, 1994.\n[20] C. E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.\n\nAnnals of Statistics, 2(6):1152\u20131174, 1974.\n\n[21] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika,\n\n57(1):97\u2013109, 1970.\n\n[22] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy of\n\nSciences, 101:5228\u20135235, April 2004.\n\n[23] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, March 2003.\n\n[24] K. Bache and M. Lichman. UCI Machine Learning Repository, 2013.\n\n9\n\n\f", "award": [], "sourceid": 170, "authors": [{"given_name": "Jason", "family_name": "Chang", "institution": "CSAIL, MIT"}, {"given_name": "John", "family_name": "Fisher III", "institution": "Massachusetts Institute of Technology"}]}