Paper ID: 263
Title: Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposed a sequential Bayesian inference of the Dirichlet process mixture models.
The proposed algorithm is simple, but outperformed the existing work such as truncation-free variational inference.

My concern is that the model selection depends on \rho(K+1)<\epsilon.
That is, the performance of this method depends on \epsilon.
The effective value of \epsilon seems to be unstable to the property of data and the type of a mixture component.
How do you decide \epsilon and how robust is it?

The highly related work is
DAVID J. NOTT1, XIAOLE ZHANG2, CHRISTOPHER YAU2 & AJAY JASRA1
A sequential algorithm for fast fi tting of Dirichlet process mixture
models
http://www.stat.nus.edu.sg/~staja/vsugs2.pdf

I'd like you to comment on this paper.
Q2: Please summarize your review in 1-2 sentences
This paper proposed a sequential Bayesian inference of the Dirichlet process mixture models.
The proposed algorithm is simple, but outperformed a more complex method such as truncation-free variational inference.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
In this paper the authors presented a sequential variational inference algorithm for Dirichlet process mixture models. The authors used a posterior characterization of normalized random measures with independent increments as the basis for a variational distribution that was then used on a sequential decomposition of the posterior. The algorithm was demonstrated on a Gaussian mixture model applied to real and synthetic data, and a non-conjugate DP mixture of Dirichlet distributions to cluster text documents.

This is a nice paper that aims to make a particular Bayesian nonparametric model useful to analyze massive data sets. The overall presentation is clear and well-motivated. The experiments are explained clearly and demonstrate advantages of the proposed algorithm.

I think that the impact of the paper in expanding the usefulness of Bayesian nonparametric models could be greatly improved by including a supplementary materials section with some proofs and derivations so that a wider audience can understand how the algorithm works and adapt it for their problem. Specifically, including the following would make the paper more self-contained and much clearer:
1. A proof of Lemma 1. The theorem that the proof is a result from is in a pretty dense paper. Restating the theorem and showing how Lemma 1 follows would be very helpful.
2. Derive Equation 12. The variational distribution is non-standard and it's pretty tedious to derive Equation 12.
3. Derive Equations 15 and 16 from Equation 14. The variational distribution is non-standard, the notation for the variational distribution changed from q(D) to q(z,phi), and the sequential variational framework is different enough from the standard setup that it would save some mathematical headaches for the reader to see how Equations 15 and 16 arise.

A few minor comments on the paper that should be addressed: This work should situate itself with respect to the work on sequential Monte Carlo algorithms for Dirichlet process mixture models which use the same sequential decomposition of the posterior. I think Equation 14 is missing the \phi's in the variational posterior. There are a few parameters of the algorithm, namely \epsilon_d and \epsilon_r, what were these set to for the experiments, or were they found with cross-validation? Lastly, the abbreviation CRF is not standard, why not CRP?
Q2: Please summarize your review in 1-2 sentences
This is an interesting paper that introduces a sequential variational inference algorithm in order to make DP mixture models useful for massive data. The clarity and usefulness of the paper to the research community could be improved with a supplemental materials section with derivations of the results.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors present a fast algorithm for learning DPs using a sequential variational approximation to the predictive distribution. The experiments indicate significant speed-ups compared to state-of-the-art learning algorithms for DPs.

The paper is well-written and fairly easy to follow. The idea of sequential variational approximation to the predictive distribution seems novel (to the best of my knowledge) and the one-pass algorithm is attractive for applying DPs to large datasets. I feel the experimental section could have been better; other than that, I think this is a good paper.

Comments:

I believe the authors incorrectly use CRF (instead of CRP) to denote Chinese Restaurant process in (4) and elsewhere.

Line 358: The authors claim that the poor performance of CGS and TFV is due to random initialization. It might be a good idea to present results for CGS and TFV when they are initialized with the output of SVA to verify if this is indeed the case.

Experimental section:
- how sensitive are the results to the order in which the data is processed? perhaps the authors could present results for multiple shuffles of the dataset?
- What are the hyperparameter values in Section 6.2 and 6.3?
- How do CGS, MC-SM perform for the datasets in section 6.2 and section 6.3?
- comparison of K for different methods in section 6.2 and 6.3 might be helpful
- Section 6.3: The experimental setup is not clear. Is each document represented by a 150-dimensional feature vector? How are these features obtained for test documents?
- Some indication of sensitivity to \epsilon_r, \epsilon_d would be useful.

I encourage the authors to make their code publicly available.

Minor points:
\nu not defined in line 120
typo: "along is line" in line 74
line 150: "given by i.e."
Line 383: Not clear how there are 240M feature vectors. There should be only 2000 x 128 = 256K feature vectors per image right?

UPDATE (after rebuttal): The authors' response addressed my questions satisfactorily.
Q2: Please summarize your review in 1-2 sentences
The idea of sequential variational approximation to the predictive distribution seems novel and appears to be a promising direction for scaling up Bayesian non-parametric models to large datasets. Experimental section could be improved.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all reviewers for the constructive feedback.

R1 asked about the setting of epsilon.

Empirically, we found that epsilon influences how fast new components are introduced. However, if this algorithm is applied to a large dataset, the final results (i.e. clustering of the samples) are not very sensitive to this value. To be more specific, if we set epsilon to a relatively higher value (i.e. being more conservative in adding new components), components are added in a slower pace, but each data cluster would eventually get its own component. On the other hand, setting this threshold too low would incur lots of false components and one has to resort to the prune-and-merge steps to clean up the model. We will provide a brief discussion about the influence of epsilon in the final version.

In our experiments, we conducted cross validation over the synthetic dataset and chose to set \epsilon to 0.001. This setting works pretty well across all the experiments presented in the paper. We plan to further investigate whether it is helpful to adjust its value adaptively on the fly.

R1 asked about comments on Nott et al’s paper.

Similar to our algorithm, the VSUGS method proposed in this paper takes a sequential updating approach, but relies on a different approximation. Particularly, what we approximate is a joint posterior over both data allocation and model parameters, while VSUGS is based on the approximating the posterior of data allocation. Also, VSUGS requires fixing a truncation level T in advance, which may lead to difficulties in practice (especially for large data). Our algorithm provides a way to tackle this, and no longer requires fixed truncation.

R2 suggested including supplementary materials with detailed proofs and derivations.

Whereas we have included a skeleton of the derivation in section 4, we do agree that providing details of the derivation in supplemental materials will definitely make this paper clearer and easier to follow. We appreciate this suggestion and will provide such materials along with our final version.

R3 suggested presenting results for CGS and TFV when they are initialized with the output of SVA (instead of random initialization).

We think this is a very good idea; as such an experiment will give a clearer idea about how different parts of the algorithm influence the final performance. We run a quick experiment to investigate this over the synthetic data, and found that using SVA for initialization did considerably improve the performance of CGS and TFV. However, these two methods still tend to introduce false or unnecessary components occasionally after initialization.

R3 have several questions regarding the experiment section:

(a) Sensitivity of the results to the order in which the data is processed: in our implementation of the experiments, the data set was randomly shuffled, and the algorithm performed similarly. However, there might exist (contrived) adverse order on which the algorithm may fail.
(b) Hyper-parameter values in section 6.2 & 6.3 are found via cross validation. Particularly, for section 6.2, we set alpha to 10.0 and sigma_x to 16.0; while for section 6.3, we set alpha to 10.0 and gamma_x to 2.0.
(c) Setup of section 6.3: it is worth noting that we are not trying to discover topics using a nonparametric model, instead, we simply run LDA to obtain the topics and use the topic proportion vector to represent each document. Then a DPMM model is used to cluster documents into groups.

R3 suggests making the codes publically available. We appreciate this suggestion, and we plan to release the codes together with the experiment scripts when the paper is published.

Finally, we thank both R2 and R3 for catching our mistake of using CRF in the place of CRP. We will correct this.