{"title": "Parallel Sampling of DP Mixture Models using Sub-Cluster Splits", "book": "Advances in Neural Information Processing Systems", "page_first": 620, "page_last": 628, "abstract": "We present a novel MCMC sampler for Dirichlet process mixture models that can be used for conjugate or non-conjugate prior distributions. The proposed sampler can be massively parallelized to achieve significant computational gains. A non-ergodic restricted Gibbs iteration is mixed with split/merge proposals to produce a valid sampler. Each regular cluster is augmented with two sub-clusters to construct likely split moves. Unlike many previous parallel samplers, the proposed sampler accurately enforces the correct stationary distribution of the Markov chain without the need for approximate models. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods.", "full_text": "Parallel Sampling of DP Mixture\nModels using Sub-Clusters Splits\n\nJason Chang\u2217\nCSAIL, MIT\n\njchang7@csail.mit.edu\n\nJohn W. Fisher III\u2217\n\nCSAIL, MIT\n\nfisher@csail.mit.edu\n\nAbstract\n\nWe present an MCMC sampler for Dirichlet process mixture models that can\nbe parallelized to achieve signi\ufb01cant computational gains. We combine a non-\nergodic, restricted Gibbs iteration with split/merge proposals in a manner that\nproduces an ergodic Markov chain. Each cluster is augmented with two sub-\nclusters to construct likely split moves. Unlike some previous parallel samplers,\nthe proposed sampler enforces the correct stationary distribution of the Markov\nchain without the need for \ufb01nite approximations. Empirical results illustrate that\nthe new sampler exhibits better convergence properties than current methods.\n\n1\n\nIntroduction\n\nDirichlet process mixture models (DPMMs) are widely used in the machine learning community\n(e.g. [28, 32]). Among other things, the elegant theory behind DPMMs has extended \ufb01nite mixture\nmodels to include automatic model selection in clustering problems. One popular method for poste-\nrior inference in DPMMs is to draw samples of latent variables using a Markov chain Monte Carlo\n(MCMC) scheme. Extensions to the DPMM such as the Hierarchical Dirichlet processes [29] and\nthe dependent Dirichlet process [18] also typically employ sampling-based inference.\nPosterior sampling in complex models such as DPMMs is often dif\ufb01cult because samplers that\npropose local changes exhibit poor convergence. Split and merge moves, \ufb01rst considered in DPMMs\nby [13], attempt to address these convergence issues. Alternatively, approximate inference methods\nsuch as the variational algorithms of [3] and [15] can be used. While variational algorithms do\nnot have the limiting guarantees of MCMC methods and may also suffer from similar convergence\nissues, they are appealing for use in large datasets as they lend themselves to parallelization. Here,\nwe develop a sampler for DPMMs that: (1) preserves limiting guarantees; (2) proposes splits and\nmerges to improve convergence; (3) can be parallelized to accommodate large datasets; and (4)\nis applicable to a wide variety of DPMMs (conjugate and non-conjugate). To our knowledge, no\ncurrent sampling algorithms satisfy all of these properties simultaneously.\nWhile we focus on DP mixture models here, similar methods can be extended for mixture models\nwith other priors (\ufb01nite Dirichlet distributions, Pitman-Yor Processes, etc.).\n\n2 Related Work\n\nOwing to the wealth of literature on DPMM samplers, we focus on the most relevant work in our\noverview. Other sampling algorithms (e.g. [17]) and inference methods (e.g. [3]) are not discussed.\nThe majority of DPMM samplers \ufb01t into one of two categories: collapsed-weight samplers that\n\u2217Jason Chang was partially supported by the Of\ufb01ce of Naval Research Multidisciplinary Research Initiative\n(MURI) program, award N000141110688. John Fisher was partially supported by the Defense Advanced\nResearch Projects Agency, award FA8650-11-1-7154.\n\n1\n\n\fExact Model\nSplits & Merges\nIntra-cluster Parallelizable\nInter-cluster Parallelizable\nNon-conjugate Priors\n\nCW [11, 12]\n(cid:88)\n\u00b7\n\u00b7\n\u00b7\n(cid:88)\n\n\u00b7\n\u00b7\n\u00b7\n(cid:88)\n(cid:88)\n\n[7, 24]\n\n(cid:88)\n\u00b7\n\u00b7\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n\u00b7\n\u00b7\n\u00b7\n\nTable 1: Capabilities of MCMC Sampling Algorithms\n[19, 31]\n\n[5, 9, 13]\n\n[14]\n(cid:88)\n(cid:88)\n\u00b7\n\u00b7\n(cid:88)\n\n(cid:88)\n\u00b7\n(cid:88)\n\u00b7\n\u00b7\n\nProposed Method\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nmarginalize over the mixture weights or instantiated-weight samplers that explicitly represent them.\nCapabilities of current algorithms, which we now discuss, are summarized in Table 1.\nCollapsed-weight (CW) samplers using both conjugate (e.g. [4, 6, 20, 22, 30]) and non-conjugate\n(e.g. [21, 23]) priors sample the cluster labels iteratively one data point at a time. When a conjugate\nprior is used, one can also marginalize out cluster parameters. However, as noted by multiple authors\n(e.g. [5, 13, 17]), these methods often exhibit slow convergence. Additionally, due to the particular\nmarginalization schemes, these samplers cannot be parallelized.\nInstantiated-weight (IW) samplers explicitly represent cluster weights, typically using a \ufb01nite ap-\nproximation to the DP (e.g. [11, 12]). Recently, [7] and [24] have eliminated the need for this\napproximation; however, IW samplers still suffer from convergence issues. If cluster parameters are\nmarginalized, it can be very unlikely for a single point to start a new cluster. When cluster parame-\nters are instantiated, samples of parameters from the prior are often a poor \ufb01t to the data. However,\nIW samplers are often useful because they can be parallelized across each data point conditioned on\nthe weights and parameters. We refer to this type of algorithm as \u201cinter-cluster parallelizable\u201d, since\nthe cluster label for each point within a cluster can be sampled in parallel.\nThe recent works of [19] and [31] present an alternative parallelization scheme for CW samplers.\nThey observe that multiple clusters can be grouped into \u201csuper-clusters\u201d and that each super-cluster\ncan be sampled independently. We refer to this type of implementation as \u201cintra-cluster paralleliz-\nable\u201d, since points in different super-clusters can be sampled in parallel, but points within a cluster\ncannot. This distinction is important as many problems of interest contain far more data points than\nclusters, and the greatest computational gain may come from inter-cluster parallelizable algorithms.\nDue to their particular construction, current algorithms group super-clusters solely based on the size\nof each super-cluster. In the sequel, we show empirically that this can lead to slow convergence and\ndemonstrate how data-based super-clusters improve upon these methods.\nRecent CW samplers consider larger moves to address convergence issues. Green and Richard-\nson [9] present a reversible jump MCMC sampler that proposes splitting and merging components.\nWhile a general framework is presented, proposals are model-dependent and generic choices are\nnot speci\ufb01ed. Proposed splits are unlikely to \ufb01t the posterior since auxiliary variables governing the\nsplit cluster parameters and weights are proposed independent of the data. Jain and Neal [13, 14]\nconstruct a split by running multiple restricted Gibbs scans for a single cluster in conjugate and non-\nconjugate models. While each restricted scan improves the constructed split, it also increases the\namount of computation needed. As such, it is not easy to determine how many restricted scans are\nneeded. Dahl [5] proposes a split scheme for conjugate models by reassigning labels of a cluster se-\nquentially. All current split samplers construct a proposed move to be used in a Metropolis-Hastings\nframework. If the split is rejected, considerable computation is wasted, and all information contained\nin learning the split is forgotten. In contrast, the proposed method of \ufb01tting sub-clusters iteratively\nlearns likely split proposals with the auxiliary variables. Additionally, we show that split proposals\ncan be computed in parallel, allowing for very ef\ufb01cient implementations.\n\n3 Dirichlet Process Mixture Model Samplers\n\nIn this section we give a brief overview of DPMMs. For a more in-depth understanding, we refer\nthe reader to [27]. A graphical model for the DPMM is shown in Figure 1a, where i indexes a\nparticular data point, x is the vector of observed data, z is the vector of cluster indices, \u03c0 is the\nin\ufb01nite vector of mixture weights, \u03b1 is the concentration parameter for the DP, \u03b8 is the vector of the\ncluster parameters, and \u03bb is the hyperparameter for the corresponding DP base measure.\n\n2\n\n\f3.1\n\nInstantiated-Weight Samplers using Approximations to the Dirichlet Process\n\nThe constructive proof of the Dirichlet processes [26] shows that a DP can be sampled by iteratively\nscaling an in\ufb01nite sequence of Beta random variables. Therefore, posterior MCMC inference in a\nDPMM could, in theory, alternate between the following samplers\n\n(\u03c01, . . . , \u03c0\u221e) \u223c p(\u03c0|z, \u03b1),\n\n\u221d\u223c fx(x{k}; \u03b8k)f\u03b8(\u03b8k; \u03bb),\n\n\u03b8k\n\n\u221d\u223c(cid:88)\u221e\n\n\u2200k \u2208 {1, . . . ,\u221e},\n\u2200i \u2208 {1, . . . , N},\n\n(1)\n(2)\n\nzi\n\n\u03c0kfx(xi; \u03b8k)1I[zi = k],\n\n(3)\n\u221d\u223c samples from a distribution proportional to the right side, x{k} denotes the (possibly empty)\nwhere\nset of data labeled k, and f\u25e6(\u00b7) denotes a particular form of the probability density function of \u25e6. We\nuse fx(x{k}; \u03b8k) to denote the product of likelihoods for all data points in cluster k. When conjugate\npriors are used, the posterior distribution for cluster parameters is in the same family as the prior:\n\nk=1\n\np(\u03b8k|x, z, \u03bb) \u221d f\u03b8(\u03b8k; \u03bb)fx(x{k}; \u03b8k) \u221d f\u03b8(\u03b8k; \u03bb\u2217\nk),\n\n(4)\nk denotes the posterior hyperparameters for cluster k. Unfortunately, the in\ufb01nite length\n\nwhere \u03bb\u2217\nsequences of \u03c0 and \u03b8 clearly make this procedure impossible.\nAs an approximation, authors have considered the truncated stick-breaking representation [11] and\nthe \ufb01nite symmetric Dirichlet distribution [12]. These approximations become more accurate when\nthe truncation is much larger than the true number of components. However, knowledge of the\ntrue number of clusters is often unknown. When cluster parameters are explicitly sampled, these\nalgorithms may additionally suffer from slow convergence issues. In particular, a broad prior will\noften result in a very small probability of creating new clusters since the probability of generating a\nparameter from the prior to \ufb01t a single data point is small.\n\n3.2 Collapsed-Weight Samplers using the Chinese Restaurant Process\n\np(zi|x, z\\i; \u03b1) \u221d(cid:104)(cid:88)\n\nAlternatively, the weights can be marginalized to form a collapsed-weight sampler. By exchange-\nability, a label can be drawn using the Chinese Restaurant Process (CRP) [25], which assigns a new\ncustomer (i.e. data point) to a particular table (i.e. cluster) with the following predictive distribution\n\nNk\\ifx(xi; \u03bb\u2217\n\n(5)\nwhere \\i denotes all indices excluding i, Nk\\i are the number of elements in z\\i with label k, \u02c6k is\na new cluster label, and fx(\u25e6; \u03bb) denotes the distribution of x when marginalizing over parameters.\nWhen a non-conjugate prior is used, a computationally expensive Metropolis-Hastings step (e.g.\n[21, 23]) must be used when sampling the label for each data point.\n\n+ \u03b1fx(xi; \u03bb)1I[z = \u02c6k],\n\nk\\i)1I[z = k]\n\nk\n\n(cid:105)\n\n4 Exact Parallel Instantiated-Weight Samplers\n\nWe now present a novel alternative to the instantiated-weight samplers that does not require any \ufb01nite\nmodel approximations. The detailed balance property underlies most MCMC sampling algorithms.\nIn particular, if one desires to sample from a target distribution, \u03c0(z), satisfying detailed balance\nfor an ergodic Markov chain guarantees that simulations of the chain will uniquely converge to the\ntarget distribution of interest. We now consider the atypical case of simulating from a non-ergodic\nchain with a transition distribution that satis\ufb01es detailed balance.\nDe\ufb01nition 4.1 (Detailed Balance). Let \u03c0(z) denote the target distribution. If a Markov chain is\nconstructed with a transition distribution q(\u02c6z|z) that satis\ufb01es \u03c0(z)q(\u02c6z|z) = \u03c0(\u02c6z)q(z|\u02c6z), then the\nchain is said to satisfy the detailed balance condition and \u03c0(z) is guaranteed to be a stationary\ndistribution of the chain.\n\nWe de\ufb01ne a restricted sampler as one that satis\ufb01es detailed balance (e.g. using the Hastings ratio\n[10]) but does not result in an ergodic chain. We note that without ergodicity, detailed balance does\nnot imply uniqueness in, or convergence to the stationary distribution. One key observation of this\nwork is that multiple restricted samplers can be combined to form an ergodic chain. In particular,\n\n3\n\n\f(a) DPMM Graphical Model\n\n(b) Augmented Super-Cluster\n\n(c) Super-Cluster Example\n\nFigure 1: (a)-(b) Graphical models for the DPMM and augmented super-cluster space. Auxiliary\nvariables are dotted. (c) An illustration of the super-cluster grouping. Nodes represent clusters,\narrows point to neighbors, and colors represent the implied super-clusters.\n\nwe consider a sampler that is restricted to only sample labels belonging to non-empty clusters. Such\na sampler is not ergodic because it cannot create new clusters. However, when mixed with a sampler\nthat proposes splits, the resulting chain is ergodic and yields a valid sampler. We now consider a\nrestricted Gibbs sampler. The coupled split sampler is discussed in Section 5.\n\n4.1 Restricted DPMM Gibbs Sampler with Super-Clusters\n\nA property stemming from the de\ufb01nition of Dirichlet processes is that the measure for every \ufb01nite\npartitioning of the measurable space is distributed according to a Dirichlet distribution [8]. While the\nDP places an in\ufb01nite length prior on the labels, any realization of z will belong to a \ufb01nite number of\nclusters. Supposing zi \u2208 {1,\u00b7\u00b7\u00b7 , K}, \u2200i, we show in the supplement that the posterior distribution\nof mixture weights, \u03c0, conditioned on the cluster labels can be expressed as\n\ni 1I[zi = k] is the number of points in cluster k, and \u02dc\u03c0K+1 =(cid:80)\u221e\n\n(6)\nk=K+1 \u03c0k is the sum\nof all empty mixture weights. This relationship has previously been noted in the literature (c.f. [29]).\nIn conjunction with De\ufb01nition 4.1, this leads to the following iterated restricted Gibbs sampler:\n\n(\u03c01,\u00b7\u00b7\u00b7 , \u03c0K, \u02dc\u03c0K+1) \u223c Dir (N1,\u00b7\u00b7\u00b7 , NK, \u03b1) ,\n\nwhere Nk =(cid:80)\n\n(\u03c01, . . . , \u03c0K, \u02dc\u03c0K+1) \u223c Dir(N1, . . . , NK, \u03b1),\n\u221d\u223c fx(x{k}; \u03b8k)f\u03b8(\u03b8k; \u03bb),\n\n\u03b8k\n\n\u221d\u223c(cid:88)K\n\nzi\n\nk=1\n\n\u2200k \u2208 {1, . . . , K},\n\u2200i \u2208 {1, . . . , N}.\n\n(7)\n(8)\n\n(9)\n\n\u03c0kfx(xi; \u03b8k)1I[zi = k],\n\nWe note that each of these steps can be parallelized and, because the mixture parameters are explic-\nitly represented, this procedure works for conjugate and non-conjugate priors. When non-conjugate\npriors are used, any proposal that leaves the stationary distribution invariant can be used (c.f. [23]).\nSimilar to previous super-cluster methods, we can also restrict each cluster to only consider moving\nto a subset of other clusters. The super-clusters of [19] and [31] are formed using a size-biased\nsampler. This can lead to slower convergence since clusters with similar data may not be in the same\nsuper-cluster. By observing that any similarly restricted Gibbs sampler satis\ufb01es detailed balance,\nany randomized algorithm that assigns \ufb01nite probability to any super-cluster grouping can be used.\nAs shown in Figure 1b, we augment the sample space with super-cluster groups, g, that group similar\nclusters together. Conditioned on g, Equation 9 is altered to only consider labels within the super-\ncluster that the data point currently belongs to. The super-cluster sampling procedure is described\nin Algorithm 1. Here, D denotes an arbitrary distance measure between probability distributions.\nIn our experiments, we use the symmetric version of KL-divergence (J-divergence). When the J-\ndivergence is dif\ufb01cult to calculate, any distance measure can be substituted. For example, in the\ncase of multinomial distributions, we use the J-divergence for the categorical distribution as a proxy.\nAn illustration of the implied super-cluster grouping from the algorithm is shown in Figure 1c and a\nvisualization of an actual super-cluster grouping is shown in Figure 2. Notice that the super-cluster\ngroupings using [19] are essentially random while our super-clusters are grouped by similar data.\n\nAlgorithm 1 Sampling Super-clusters with Similar Cluster\n1. Form the adjacency matrix, A, where Ak,m = exp[\u2212D(fx(\u25e6; \u03b8k), fx(\u25e6; \u03b8m))]\n\n2. For each cluster, k, sample a random neighbor k(cid:48), according to, k(cid:48) \u221d\u223c(cid:80)\n\n3. Form the groups of super-clusters, g, by \ufb01nding the separate connected graphs\n\nm Ak,m1I[k(cid:48) = m]\n\n4\n\n\fFigure 2: (left) A visualization of the algorithm. Each set of uniquely colored ellipses indicate one\ncluster. Solid ellipses indicate regular clusters and dotted ellipses indicate sub-cluster. Color of data\npoints indicate super-cluster membership. (right) Inferred clusters and super-clusters from [19].\n\n5 Parallel Split/Merge Moves via Sub-Clusters\n\nThe preceding section showed that an exact MCMC sampling algorithm can be constructed by alter-\nnating between a restricted Gibbs sampler and split moves. While any split proposal (e.g. [5, 13, 14])\ncan result in an ergodic chain, we now develop ef\ufb01cient split moves that are compatible with conju-\ngate and non-conjugate priors and that can be parallelized. We will augment the space with auxiliary\nvariables, noting that samples of the non-auxiliary variables can be obtained by drawing samples\nfrom the joint space and simply discarding any auxiliary values.\n\n5.1 Augmenting the Space with Auxiliary Variables\n\nSince the goal is to design a model that is tailored toward splitting clusters, we augment each regular\ncluster with two explicit sub-clusters (herein referred to as the \u201cleft\u201d and \u201cright\u201d sub-clusters). Each\ndata point is then attributed with a sub-cluster label, zi \u2208 {(cid:96), r}, indicating whether it comes from\nthe left or right sub-cluster. Additionally, each sub-cluster has an associated pair of weights, \u03c0k =\n{\u03c0k,(cid:96), \u03c0k,r}, and parameters, \u03b8k = {\u03b8k,(cid:96), \u03b8k,r}. These auxiliary variables are named in a similar\nfashion to their regular-cluster counterparts because of the similarities between sub-clusters and\nregular-clusters. One na\u00a8\u0131ve choice for auxiliary parameter distributions is\n\np(\u03c0k) = Dir(\u03c0k,(cid:96), \u03c0k,r; \u03b1/2, \u03b1/2),\n\np(z|\u03c0, \u03b8, x, z) =\n\n(cid:89)\n\n(cid:89)\n\nk\n\n{i;zi=k}\n\np(\u03b8k) = f\u03b8(\u03b8k,(cid:96); \u03bb)f\u03b8(\u03b8k,r; \u03bb),\n\n\u03c0k,zi fx(xi;\u03b8k,zi )\n\n\u03c0k,(cid:96)fx(xi;\u03b8k,(cid:96))+\u03c0k,rfx(xi;\u03b8k,r)\n\n.\n\n(10)\n\n(11)\n\nThe corresponding graphical model is shown in Figure 3a. It would be advantageous if the form\nof the posterior for the auxiliary variables matched those of the regular-clusters in Equation 7-9.\nUnfortunately, because the normalization in Equation 11 depends on \u03c0 and \u03b8, this choice of auxiliary\ndistributions does not result in the posterior distributions for \u03c0 and \u03b8 that one would expect. We note\nthat this problem only arises in the auxiliary space where x generates the auxiliary label z (in contrast\nto the regular space, where z generates x). Additional details are provided in the supplement.\nConsequently, we alter the distribution over sub-cluster parameters to be\n\n(cid:0)\u03c0k,(cid:96)fx(xi; \u03b8k,(cid:96)) + \u03c0k,rfx(xi; \u03b8k,r)(cid:1) .\n\np(\u03b8k|x, z, \u03c0) \u221d f\u03b8(\u03b8k,(cid:96); \u03bb)f\u03b8(\u03b8k,r; \u03bb)\n\n(cid:89)\n\n{i;zi=k}\n\nIt is easily veri\ufb01ed that this choice results in the the following conditional posterior distributions\n\n(\u03c0k,(cid:96), \u03c0k,r) \u223c Dir(Nk,(cid:96) + \u03b1/2, Nk,r + \u03b1/2),\n\n\u221d\u223c fx(x{k,s}; \u03b8k,s)f\u03b8(\u03b8k,s; \u03bb),\n\n\u221d\u223c(cid:88)\n\n\u03b8k,s\n\nzi\n\ns\u2208{(cid:96),r} \u03c0zi,sfx(xi; \u03b8zi,s)1I[zi = s],\n\n\u2200k \u2208 {1, . . . , K},\n\u2200k \u2208 {1, . . . , K},\u2200s \u2208 {(cid:96), r},\n\u2200i \u2208 {1, . . . , N},\n\n(12)\n\n(13)\n(14)\n(15)\n\nwhich essentially match the distributions for regular-cluster parameters in Equation 7-9. We note\nthat the joint distribution over the augmented space cannot be expressed analytically as a result of\nonly specifying Equation 12 up to a proportionality constant that depends on \u03c0, x, and z. The\ncorresponding graphical model is shown in Figure 3b.\n\n5.2 Restricted Gibbs Sampling in Augmented Space\n\nRestricted sampling in the augmented space can be performed in a similar fashion as before. One\ncan draw a sample from the space of K regular clusters by sampling all the regular- and sub-cluster\nparameters conditioned on labels and data from Equations 7, 8, 13, and 14. Conditioned on these\nparameters, one can sample a regular-cluster label followed by a sub-cluster label for each data point\nfrom Equations 9 and 15. All of these steps can be computed in parallel.\n\n5\n\n\f(a) Unmatched Augmented Sub-Cluster Model\n(b) Matched Augmented Sub-Cluster Model\nFigure 3: Graphical models for the augmented DPMMs. Auxiliary variables are dotted.\n\n5.3 Metropolis-Hastings Sub-Cluster Split Moves\n\nA pair of inferred sub-clusters contains a likely split of the corresponding regular-cluster. We\nexploit these auxiliary variables to propose likely splits. Similar to previous methods, we use a\nMetropolis-Hastings (MH) MCMC [10] method for proposed splits. A new set of random variables,\n{\u02c6\u03c0, \u02c6\u03b8, \u02c6z, \u02c6\u03c0, \u02c6\u03b8, \u02c6z} are proposed via some proposal distribution, q, and accepted with probability\n\nmin\n\n1, p(\u02c6\u03c0,\u02c6z,\u02c6\u03b8,x)p(\u02c6\u03c0,\n\n\u03b8,\u02c6z|x,\u02c6z)\n\u02c6\np(\u03c0,z,\u03b8,x)p(\u03c0,\u03b8,z|x,z)\n\n\u00b7 q(\u03c0,z,\u03b8,\u03c0,\u03b8,z|\u02c6\u03c0,\u02c6z,\u02c6\u03b8,\u02c6\u03c0,\nq(\u02c6\u03c0,\u02c6z,\u02c6\u03b8,\u02c6\u03c0,\n\n\u02c6\n\u03b8,\u02c6z)\n\u03b8,\u02c6z|\u03c0,z,\u03b8,\u03c0,\u03b8,z)\n\u02c6\n\n= min[1, H],\n\n(16)\n\n(cid:20)\n\n(cid:21)\n\n\u03b1(cid:81)\n\nwhere H is the \u201cHastings ratio\u201d. Because of the required reverse proposal in the Hastings ratio, we\nmust propose both merges and splits. Unfortunately, because the joint likelihood for the augmented\nspace cannot be analytically expressed, the Hastings ratio for an arbitrary proposal distribution can-\nnot be computed. A very speci\ufb01c proposal distribution, which we now discuss, does result in a\ntractable Hastings ratio. A split or merge move, denoted by Q, is \ufb01rst selected at random. In our\nexamples, all possible splits and merges are considered since the number of clusters is much smaller\nthan the number of data points. When this is not the case, any randomized proposal can be used.\nConditioned on Q = Qsplit-c, which splits cluster c into m and n, or Q = Qmerge-mn, which merges\nclusters m and n into c, a new set of variables are sampled with the following\n\nQ = Qsplit-c\n\nQ = Qmerge-mn\n\n\u02c6z{c} = merge-mn(z)\n\n(\u02c6z{m}, \u02c6z{n}) = split-c(z, z)\n\n(um, un) \u223c Dir( \u02c6Nm, \u02c6Nn)\n\n(\u02c6\u03c0m, \u02c6\u03c0n) = \u03c0c \u00b7 (um, un),\n(\u02c6\u03b8m, \u02c6\u03b8n) \u223c q(\u02c6\u03b8m, \u02c6\u03b8n|x, \u02c6z, \u02c6z)\n\u02c6vm, \u02c6vn \u223c p(\u02c6vm, \u02c6vn|x, \u02c6z)\n\n(17)\n(18)\n(19)\n(20)\nHere, vk = {\u03c0k, \u03b8k, z{k}} denotes the set of auxiliary variables for cluster k, the function split-c(\u25e6)\nsplits the labels of cluster c based on the sub-cluster labels, and merge-mn(\u25e6) merges the labels of\nclusters m and n. The proposal of cluster parameters is written in a general form so that users can\nspecify their own proposal for non-conjugate priors. All other cluster parameters remain the same.\nSampling auxiliary variables from Equation 20 will be discussed shortly. Assuming that this can be\nperformed, we show in the supplement that the resulting Hastings ratio for a split is\n\n\u02c6\u03c0c = \u02c6\u03c0m + \u02c6\u03c0n\n\u02c6\u03b8c \u223c q(\u02c6\u03b8c|x, \u02c6z, \u02c6z)\n\u02c6vc \u223c p(\u02c6vc|x, \u02c6z)\n\nHsplit-c =\n\n\u03b1q(\u03b8c|x,z,\u02c6z)\n\n\u0393(Nk)f\u03b8(\u03b8c;\u03bb)fx(x{c};\u03b8c)\n\n\u0393( \u02c6Nk)f\u03b8(\u02c6\u03b8k;\u03bb)fx(x{k};\u02c6\u03b8k)\n\nq(\u02c6\u03b8k|x,z,\u02c6z)\n\n=\n\nk\u2208{m,n} \u0393( \u02c6Nk)fx(x{k};\u03bb)\n\n\u0393(Nc)fx(x{c};\u03bb)\n\n. (21)\n\n(cid:89)\n\nk\u2208{m,n}\n\nThe \ufb01rst expression can be used for non-conjugate models, and the second expression can be used in\nconjugate models where new cluster parameters are sampled directly from the posterior distribution.\nWe note that these expressions do not have any residual normalization terms and can be computed\nexactly, even though the joint distribution of the augmented space can not be expressed analytically.\nUnfortunately, the Hastings ratio for a merge move is slightly more complicated. We discuss these\ncomplications following the explanation of sampling the auxiliary variables in the next section.\n\n5.4 Deferred Metropolis-Hastings Sampling\n\nThe preceding section showed that sampling a split according to Equations 17-20 results in an ac-\ncurate MH framework. However, sampling the auxiliary variables from Equation 20 is not straight-\nforward. This step is equivalent to sampling cluster parameters and labels for a 2-component\n\n6\n\n\fmixture model, which is known to be dif\ufb01cult. One typically samples from this space using\nan MCMC procedure.\nIn fact, that is precisely what the restricted Gibbs sampler is doing.\nWe therefore sample from Equation 20 by running a restricted Gibbs sampler for each newly\nproposed sub-cluster until they have burned-in. We monitor the data-likelihood for cluster m,\nLm = fx(x{m,(cid:96)}; \u03b8m,(cid:96)) \u00b7 fx(x{m,r}; \u03b8m,r) and declare burn-in once Lm begins to oscillate.\nFurthermore, due to the implicit marginalization of auxiliary variables, the restricted Gibbs sampler\nand split moves that act on clusters that were not recently split do not depend on the proposed\nauxiliary variables. As such, these proposals can be computed before the auxiliary variables are\neven proposed. The sampling of auxiliary variables of a recently split cluster are deferred to the\nrestricted Gibbs sampler while the other sampling steps are run concurrently. Once a set of proposed\nsub-clusters have burned-in, the corresponding clusters can be proposed to split again.\n\n5.5 Merge Moves with Random Splits\n\nThe Hastings ratio for a merge depends on the proposed auxiliary variables for the reverse split.\nSince proposed splits are deterministic conditioned on the sub-cluster labels, the Hastings ratio will\nbe zero if the proposed sub-cluster labels for a merge do not match those of the current clusters. We\nshow in the supplement that as the number of data points grows, the acceptance ratio for a merge\nmove quickly decays. With only 256 data points, the acceptance ratio for a merge proposal for 1000\ntrials in a 1D Gaussian mixture model did not exceed 10\u221216. We therefore approximate all merges\nwith an automatic rejection. Unfortunately, this can lead to slow convergence in certain situations.\nFortunately, there is a very simple sampler that is good at proposing merges: a data-independent,\nrandom split proposal generated from the prior with a corresponding merge move. A split is con-\nstructed by sampling a random cluster, c, followed by a random partitioning of its data points form\na Dirichlet-Multinomial. In general, these data-independent splits will be non-sensical and result in\na rejection. However, merge moves are accepted with much higher probability than the sub-cluster\nmerges. We refer the interested reader to the supplement for additional details.\n\n6 Results\n\nIn this section, we compare the proposed method against other MCMC sampling algorithms. We\nconsider three different versions of the proposed algorithm: using sub-clusters with and without\nsuper-clusters (SUBC and SUBC+SUPC) and an approximate method that does not wait for the con-\nvergence of sub-clusters to split (SUBC+SUPC APPROX). We note that while we do not expect this\nlast version to converge to the correct distribution, empirical results show that it is similar in average\nperformance. We compare the proposed methods against four other methods: the \ufb01nite symmet-\nric Dirichlet approximate model (FSD) with 100 components, a Rao-Blackwellized Gibbs sampler\n(GIBBS), a Rao-Blackwellized version of the original super-cluster work of [19] (GIBBS+SUPC),\nand the current state-of-the-art split/merge sampler [5] (GIBBS+SAMS). In our implementations,\nthe concentration parameter is not resampled, though one could easily use a slice-sampler if desired.\nWe \ufb01rst compare these algorithms on synthetic Gaussian data with a Normal Inverse-Wishart prior.\n100,000 data points are simulated from ten 2D Gaussian clusters. The average log likelihood for\nmultiple sample paths obtained using the algorithms without parallelization for different numbers\nof initial clusters K and concentration parameters \u03b1 are shown in the \ufb01rst two columns of Figure 4.\nIn this high data regime, \u03b1 should have little effect on the resulting clusters. However, we \ufb01nd that\nthe samplers without split/merge proposals (FSD, GIBBS, GIBBS+SC) perform very poorly when\nthe initial number of clusters and the concentration parameter is small. We also \ufb01nd that the super-\ncluster method, GIBBS+SC, performs even worse than regular Gibbs sampling. This is likely due to\nsuper-clusters not being grouped by similar data, since data points not being able to move between\ndifferent super-clusters can hinder convergence. In contrast, the proposed super-cluster method does\nnot suffer from the same convergence problems, but is comparable to SUBC because there are a\nsmall number of clusters. Finally, the approximate sub-cluster method has signi\ufb01cant gains when\nonly one initial cluster is used, but performs approximately the same with more initial clusters.\nNext we consider parallelizing the algorithms using 16 cores in the last column of Figure 4. The\nfour inter-cluster parallelizable algorithms, SUBC, SUBC+SUPC, SUBC+SUPC APPROX, and\nFSD exhibit an order of magnitude speedup, while the the intra-cluster parallelizable algorithm\n\n7\n\n\fFigure 4: Synthetic data results for various initial clusters K, concentration parameters \u03b1, and cores.\n\nFigure 5: Log likelihood vs. computation time for real data. All parallel algorithms use 16 cores.\n\nGIBBS+SUPC only has minor gains. As expected, parallelization does not aid the convergence of\nalgorithms, only the speed at which they converge.\nWe now show results on real data. We test a Gaussian model with a Normal Inverse-Wishart prior\non the MNIST dataset [16] by \ufb01rst running PCA on the 70,000 training and test images to 50 di-\nmensions. Results on the MNIST dataset are shown in Figure 5a. We additionally test the algorithm\non multinomial data with a Dirichlet prior on the following datasets: Associated Press [2] (2,246\ndocuments and 10,473 dimension dictionary), Enron Emails [1] (39,861 documents and 28,102\ndimension dictionary), New York Times articles [1] (300,000 documents and 102,660 dimension\ndictionary), and PubMed abstracts [1] (8,200,000 documents and 141,043 dimension dictionary).\nResults are shown in Figure 5b-e. In contrast to HDP models, each document is treated as a single\ndraw from a multinomial distribution. We note that on the PubMed dataset, we had to increase the\napproximation of FSD to 500 components after observing that SUBC inferred approximately 400\nclusters. On real data, it is clearly evident that the other algorithms have issues with convergence.\nIn fact, in the allotted time, no algorithms besides the proposed methods converge to the same log\nlikelihood with the two different initializations on the larger datasets. The presented sub-cluster\nmethods converge faster to a better sample than other algorithms converge to a worse sample.\nOn the small, Associated Press dataset, the proposed methods actually perform slightly worse than\nthe GIBBS methods. Approximately 20 clusters are inferred for this dataset, resulting in approxi-\nmately 100 observations for each cluster. In these small data regimes, it is important to marginalize\nover as many variables as possible. We believe that because the GIBBS methods marginalize over\nthe cluster parameters and weights, they achieve better performance as compared to the sub-cluster\nmethods and FSD which explicitly instantiate them. This is not an issue with larger datasets.\n\n7 Conclusion\n\nWe have presented a novel sampling algorithm for Dirichlet process mixture models. By al-\nternating between a restricted Gibbs sampler and a split proposal, \ufb01nite approximations to the\nDPMM are not needed and ef\ufb01cient inter-cluster parallelization can be achieved. Addition-\nally, the proposed method for constructing splits based on \ufb01tting sub-clusters is, to our knowl-\nedge, the \ufb01rst parallelizable split algorithm for mixture models. Results on both synthetic and\nreal data demonstrate that the speed of the sampler is orders of magnitude faster than other ex-\nact MCMC methods. Publicly available source code used in this work can be downloaded at\nhttp://people.csail.mit.edu/jchang7/.\n\n8\n\n\fReferences\n[1] K. Bache and M. Lichman. UCI machine learning repository, 2013.\n[2] D. M. Blei, T. L. Grif\ufb01ths, M. I. Jordan, and J. B. Tenenbaum. Hierarchical topic models and the nested\n\nChinese restaurant process. In NIPS, 2003.\n\n[3] D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis,\n\n1:121\u2013144, 2005.\n\nBiometrika, 83:275\u2013285, 1973.\n\n[4] C. A. Bush and S. N. MacEachern. A semiparametric Bayesian model for randomised block designs.\n\n[5] D. B. Dahl. An improved merge-split sampler for conjugate Dirichlet process mixture models. Technical\n\nreport, University of Wisconsin - Madison Dept. of Statistics, 2003.\n\n[6] M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the\n\nAmerican Statistical Association, 90(430):577\u2013588, 1995.\n\n[7] S. Favaro and Y. W. Teh. MCMC for normalized random measure mixture models. Statistical Science,\n\n[8] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209\u2013\n\n2013.\n\n230, 1973.\n\n[9] P. J. Green and S. Richardson. Modelling heterogeneity with and without the Dirichlet process. Scandi-\n\nnavian Journal of Statistics, pages 355\u2013375, 2001.\n\n[10] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika,\n\n57(1):97\u2013109, 1970.\n\n[11] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of the American\n\nStatistical Association, 96:161\u2013173, 2001.\n\n[12] H. Ishwaran and M. Zarepour. Exact and approximate sum-representations for the Dirichlet process.\n\nCanadian Journal of Statistics, 30:269\u2013283, 2002.\n\n[13] S. Jain and R. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture\n\nmodel. Journal of Computational and Graphical Statistics, 13:158\u2013182, 2000.\n\n[14] S. Jain and R. Neal. Splitting and merging components of a nonconjugate Dirichlet process mixture\n\nmodel. Bayesian Analysis, 2(3):445\u2013472, 2007.\n\n[15] K. Kurihara, M. Welling, and Y. W. Teh. Collapsed variational Dirichlet process mixture models.\n\nIn\n\nInternational Joint Conference on Arti\ufb01cial Intelligence, 2007.\n\n[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[17] P. Liang, M. I. Jordan, and B. Taskar. A permutation-augmented sampler for DP mixture models. In\n\nProceedings of the 24th international conference on Machine learning, 2007.\n\n[18] D. Lin, E. Grimson, and J. W. Fisher III. Construction of dependent Dirichlet processes based on Poisson\n\nprocesses. In NIPS, 2010.\n\n[19] D. Lovell, R. P. Adams, and V. K. Mansingka. Parallel Markov chain Monte Carlo for Dirichlet process\n\nmixtures. In Workshop on Big Learning, NIPS, 2012.\n\n[20] S. N. MacEachern. Estimating normal means with a conjugate style Dirichlet process prior. In Commu-\n\nnications in Statistics: Simulation and Computation, 1994.\n\n[21] S. N. MacEachern and P. M\u00a8uller. Estimating mixture of Dirichlet process models. Journal of Computa-\n\ntional and Graphical Statistics, 7(2):223\u2013238, June 1998.\n\n[22] R. Neal. Bayesian mixture modeling. In Proceedings of the 11th International Workshop on Maximum\n\nEntropy and Bayesian Methods of Statistical Analysis, 1992.\n\n[23] R. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational\n\nand Graphical Statistics, 9(2):249\u2013265, June 2000.\n\n[24] O. Papaspiliopoulos and G. O. Roberts. Retrospective Markov chain Monte Carlo methods for Dirichlet\n\nprocess hierarchical models. Biometrika, 95(1):169\u2013186, 2008.\n\n[25] J. Pitman. Combinatorial stochastic processes. Technical report, U.C. Berkeley Dept. of Statistics, 2002.\n[26] J. Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Statstica Sinica, pages 639\u2013650, 1994.\n[27] E. B. Sudderth. Graphical Models for Visual Object Recognition and Tracking. PhD thesis, Massachusetts\n\n[28] E. B. Sudderth, A. B. Torralba, W. T. Freeman, and A. S. Willsky. Describing visual scenes using trans-\n\nInstitute of Technology, 2006.\n\nformed Dirichlet processes. In NIPS, 2006.\n\n[29] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n[30] M. West, P. M\u00a8uller, and S. N. MacEachern. Hierarchical priors and mixture models, with application in\n\nregression and density estimation. Aspects of Uncertainity, pages 363\u2013386, 1994.\n\n[31] S. A. Williamson, A. Dubey, and E. P. Xing. Parallel Markov chain Monte Carlo for nonparametric\n\n[32] E. P. Xing, R. Sharan, and M. I. Jordan. Bayesian haplotype inference via the Dirichlet process. In ICML,\n\nmixture models. In ICML, 2013.\n\n2004.\n\n9\n\n\f", "award": [], "sourceid": 380, "authors": [{"given_name": "Jason", "family_name": "Chang", "institution": "MIT"}, {"given_name": "John", "family_name": "Fisher III", "institution": "MIT"}]}