Paper ID: 380
Title: Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
Reviews

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presets a new MCMC sampling scheme for Dirichlet process mixture models. The sampling scheme can be parallelised, and empirical results are presented that show the method has superior convergence properties to existing methods. These are both important qualities in the context of working with DP mixture models.

The presented method is constructed from 'restricted' samplers, that is samplers that satisfy detailed balance but are not ergodic. The authors show that one can combine multiple such samplers to produce ergodic chains, hence eusuring the uniqueness of the stationary distribution to which the MCMC chain converges. By working with these 'restricted' samplers, they are able to produce a sampling scheme with the above desirable properties. The resulting sampler is a combination of restricted Gibbs steps and 'split' steps, which results in a valid sampler.

Results are presented on both synthetic (2D Gaussian clusters) and real data. Even without parallelisation, there are gains to be had in terms of convergence. When parallelised, the algorithm of course runs much faster (with regards to wallclock time).

This is an interesting and useful piece of research. DPs are widely used and are often computationally intensive to work with. Being able to efficiently parallelise MCMC sampling for a DP is therefore very useful. This is also an interesting and novel way to construct an MCMC sampler, and so may represent a new direction for research in this area.
Q2: Please summarize your review in 1-2 sentences
This paper presents a new MCMC sampling scheme for Dirichlet process mixture models. This scheme is highly parallelisable and has good convergence properties.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a parallel algorithm for the DPMM that parallelizes a RJMCMC sampler that jumps between finite models. While the parallelization and the RJMCMC sampler are proposed together, I will separate them for the purpose of this review, in order to ask questions about each part separately.

First, the RJMCMC algorithm (by which I mean, the algorithm we would have on a single cluster). Here, we use a reversible-jump MCMC algorithm to jump between finite-dimensional Dirichlet distributions. As an aside, since $\bar{\pi}_{K+1}$ is not used in the mixture model (the mixture model is defined on the renormalized occupied K components), it would seem to make more sense to define a K-dimensional, rather than a K-1 - dimensional, Dirichlet distribution; this is valid under marginalization properties of the Dirichlet distribution, since equation 10 samples from a distribution proportional to $\pi_1 ... \pi_K$

To jump between model dimensionalities, the authors propose a split/merge RJMCMC step that is reminiscent of that of Green and Richardson. In fact, the “naive” choice is very similar to the Green and Richardson method, and I am not sure why it is not valid in this case.. I would like to see a more detailed comparison of the Green and Richardson method (and other split/merge methods) and the proposed method, and the reasons for the different design choices.

The parallelization method splits the clusters onto processors with probability related to the similarity between clusters. Each data point can only be assigned to processors within its own cluster. The split/merge method of creating new clusters, and the fact that atoms are explicitly instantiated, ensures that there is no alignment problem of new clusters across processors. Update of cluster parameters can be done in parallel; however contrary to line 194 I do not see how sampling the $\pi_k$ can be done in parallel. Further, sampling the superclusters must be done in parallel.

The experiments suggest improved performance; however I would be interested in seeing further experimental evaluation and information. What is the number of superclusters? How often are $\pi$ and the super-clusters updates? How does performance scale with number of clusters? Also, it would be interesting to discuss why your split/merge sampler performs so much better than the other split/merge sampler.


Additional points:

Line 107: The Dirichlet process can be sampled by normalizing an infinite sequence of gamma random variables, not beta random variables (or more accurately, by normalizing a sample from a gamma process). The Sethuraman paper cited describes an iterative algorithm for sampling size-biased samples from a DP, but it is based on the fact that the DP has Dirichlet marginals rather than normalization of an unnormalized random measure.

Line ~079, line ~131: In addition to the approximate truncated and finite methods mentioned, there exist instantiated-weight samplers that are asymptotically correct -- for example slice sampling and retrospective sampling approaches, or the recent paper by Favaro & Teh

Q2: Please summarize your review in 1-2 sentences
Scalable inference is an important goal in Bayesian nonparametrics, and this shows signs of being a useful approach; however I would like to see further evaluation and exploration.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Massively Parallel Sampling of DP Mixture Models using Sub-Clusters

The paper proposes a new sampling strategy for Dirichlet process mixture models that can be applied both in the conjugate and non-conjugate setting. The method is based on restricted Gibbs sampling, which can be computed in parallel, and split-merge moves. Together, these two move types form an ergodic Markov chain, which importantly can be implemented in a highly parallel fashion. An important idea used in the paper is that the parameter space is augmented with sub- and super-clusters which serve to facilitate both the split moves and the restricted Gibbs moves.

Efficiently sampling from Dirichlet process mixtures is an important problem, and this paper makes a significant contribution towards designing correct paralellizable MCMC methods for this problem.

Regarding the use of super-clusters in the restricted Gibbs move, I am not sure I understand the argumentation. If we think of Algorithm 1 as just some random procedure for choosing a super-cluster, which chooses the particular restricted Gibbs proposal used subsequently, then the random procedure must not depend on the current state of the MCMC chain to be valid, but it does. If we alternatively think of the super-clusters, g, as another parameter in an augmented parameter space then Algorithm 1 simply defines the conditional distribution of g given all other parameters - however, when sampling the parameters conditioned on g the joint distribution of the parameters and g should be used when computing the acceptance probability. Perhaps this is what the authors suggest, but I am not sure, or maybe I misunderstand something?

Another thing that puzzles me is the proposed split procedure. As I understand it, the authors propose a split-merge procedure that in itself is a valid MCMC procedure - thus in theory running this alone would sample correctly from the posterior. However, it is noted that the merge step is (almost) never accepted and can be omitted. I would have liked some more discussion about this, since this (for me at least) goes against intuition.

The ideas in the paper are well presented, although I found some of the argumentation regarding the choices of distributions for the auxiliary variables as well as the above mentioned details difficult to follow. I think the introduction, review, and experimental section are well written and sufficiently detailed.

I look forward to hearing the authors' response to my comments.

UPDATE: I am happy with the authors' comments and confident that this paper is a solid and important contribution.
Q2: Please summarize your review in 1-2 sentences
A well written paper with interesting and useful new ideas, but with some unclear parts that could be improved.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Clarity:
Lots of little issues with the references: "dp", "dirichlet", "Statstica", "markov"
Good writing and grammar.

Quality:
General parallel clustering methods need to be cited as well:
The database community has looked at distributed clustering
for years, starting with BIRCH by Zhang, Ramakrishnan and Livny,
the use of variants of KD-trees, and simple map-reduce style
approaches in an EM framework. Bradley and Fayyad had papers
on other scaling and distributed methods in the late 90s. The
effort here is *not* on the mixture model esoterics,
its on large scale distance/log-prob calculations.

Split and merge clustering algorithms in an EM style, keeping a pair of
sub-clusters, were published in 1970 by Wallace and Bolton with SNOB.
The Jain and Neal split-merge proposals do not appear to work well on
larger problems and so MCMC split-merge approaches are important.

The focus on Dirichlet process mixture models could be relaxed.
One can use Dirichlet mixture models with a large K and small
alpha as well, since these often compete well with DPs.
So the focus should be probabilistic mixture models
generally. Although, the authors do a great job of simplifying
things and basically removing any complexity created by the
DP. Equation (7) is the key.
Using DPs to "pick the right number of clusters" is not always
significant ... one can use Pitman-Yors, and even Dirichlets
for this, and setting the other hyperparameters can greatly affect the
number of clusters chosen.

This paper is a great tour of techniques in sampling. Very good
read. I made lots of notes on technical details to check, but
everything seemed to pan out. I would like to see a better theoretical
statement about how multiple restricted samplers can be stitched
together to create a correct sampler. Automatically rejecting the
merge moves (end of section 5.3) is an interesting result!

The results section is way too short, though I don't know what could
be eliminated to give more room for it. The results seem to
justify the approach, though the sizes are a bit small.

One problem I see: 3000 seconds for full Gibbs cycle, versus 5 seconds for your algorithm. That is a 600 fold speed up. Perhaps your tricks with the g variable have allowed a good speed up, but 600?

Originality:
The combination of the sub-cluster idea with the special samplers
and the restricted samplers is original in my understanding.
Perhaps better split-merge MCMC sampling exists already?

Significance:
This is the first significant attempt I have seen to address split-merge
clustering in MCMC using methods from early EM style algorithms.
Anyway, this is a significant
development that should open the way for many other problems where
split-merge is needed, not just clustering. I have some other
applications in mind, e.g., topic models!!!
Q2: Please summarize your review in 1-2 sentences
Authors: please reread review, some changes.

The related works section is poor.
The contribution however, seems to be a tractible split-merge sampling
procedure for MCMC clustering, and that is very important, for the
Dirichlet distribution too.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their insightful and helpful comments. They will be considered upon any final submission of the work. We would appreciate suggestions from the reviewers for sections that can be shortened so that other sections that need more discussion can be expanded. Individual reviewer comments are now addressed.


==R6==
We believe R6 has a misconception about this work. The majority of computation time in typical methods is spent sampling cluster labels. The presented algorithm exhibits intra-cluster parallelization via super-clusters *and* inter-cluster parallelization since each point’s label assignment can be sampled in parallel via the restricted Gibbs sampler. Inter-cluster parallelization samplers typically requires approximations or result in slower convergence (see below ++). The presented method does not suffers from these.

“make more sense to define a K-dimensional rather than a K-1 - dimensional Dirichlet”
We are unclear as to what R6 means. While the weights in Eqn 8 are a K+1 dimensional Dirichlet, the restricted sampler only considers the non-empty, K dimensional Dirichlet. There is no K-1 dimensional Dirichlet.

“I am not sure why [the naive choice is] not valid”
It is valid, but results in posterior distributions that are difficult to sample and go against intuition [Ln 250-257]. Details were left for the supplement [Suppl. Ln 40-85] to not detract from the paper.

“comparison [to] Green and Richardson [10]”
[10] develops a general framework for split/merge moves, but does not explicitly specify how to choose their “mock weights”, w, or their mapping “vector functions”, m. Proposed splits are unlikely to fit the posterior since auxiliary variables, u and w, governing the split cluster parameters and weights are proposed independent of the data. Furthermore, similar to all other split methods [4,14,15], a split is constructed in linear time and fed into a Hastings acceptance. If the split is rejected, considerable computation is wasted, and all information contained in learning the split is forgotten. In contrast, our splits are learned over many iterations via the parallel restricted Gibbs sampler. Conditioned on the sub-clusters, the proposed splits are evaluated in constant time [Ln 69-78, 280-285, 336-350].

“sampling the pi_k... in parallel”
They are sampled by normalizing K independent draws from a Gamma distribution. The normalization term can be computed with a parallel reduction, and the actual sampling and division are independent and parallelizable.

“number of superclusters?”
This is automatically determined from our algorithm. While the count depends on the data, we typically see 1-5 super-clusters for ~20 clusters.

“How often are pi and the super-clusters updated?”
We alternate the sampling of {labels and sub-cluster labels}, {cluster parameters and weight}, and {super-cluster labels}. These details were mistakenly left out and will be included in any final submission.

“scale with number of clusters?”
It should scale similar to other sampling algorithms since the core of the algorithm is the same.

“why [is performance] better than the other split/merge sampler.”
See [Ln. 69-78] and the comparison to other split/merge methods above.

“The DP can be sampled by normalizing an infinite sequence of gamma”
Agreed, we will correct the error.

Missing references++
The slice/retrospective samplers of [Walker, Papaspiliopoulos & Roberts, and Favaro & Teh] are exact instantiated-weight samplers. In these cases, if the cluster parameters (theta) are not instantiated, the sampling cannot be parallelized. If they are instantiated, empty cluster parameters are sampled from the prior, and it is unlikely to obtain a parameter that fits the data [Ln. 135-137]. The final submission will include the references.


==R7==
“when sampling the parameters conditioned on g the joint distribution of the parameters and g should be used”
We circumvent this problem with the following relationship: p(theta,g|x,z) = p(theta|x,z) p(g|theta,x,z). We can then draw a joint sample of theta and g by theta~p(theta|x,z) [Eqn 9], followed by g~p(g|theta,x,z) [Alg 1]. However, as indicated by the results, the super-clusters do not have much of an effect in the ~20-cluster datasets.

“split-merge procedure that in itself is a valid MCMC procedure”
Actually, without the restricted Gibbs sampler, the proposed deterministic splits would never change, and the resulting chain would not be ergodic [Ln 280-350].

“the merge step is never accepted... more discussion, since this goes against intuition.”
Detailed discussion was left to the supplement [Suppl. Ln 182-256] to not distract from the main points and due to space limitations.


==R8==
“parallel clustering methods need to be cited”
These were omitted due to the focus on DPMMs. We will include them in a final submission space permitting.

“The focus on DPMMs is misdirected”
We suppose this is a matter of preference. While the methodology applies to generic mixture models, we felt it was interesting that combining splits with restricted moves creates an ergodic chain. Restricted moves are unnecessary for finite mixture models as all clusters can be instantiated.

“I would like to see a better theoretical statement about how multiple restricted samplers can be stitched together to create a correct sampler”
It is quite difficult to give general conditions on combining multiple restricted samplers for an arbitrary model. The restricted samplers must combine to make the chain reachable and aperiodic, but these are just the general principles of ergodicity.

“the [dataset] sizes are a bit small”
We test on datasets that are feasible to run with typical algorithms. EDIT: Corrected inefficiency existing in all algorithms to improve performance thanks to R8 for final submission.