Paper ID: 600
Title: Memoized Online Variational Inference for Dirichlet Process Mixture Models

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The authors proposed a novel online variational inference, called memoized online variational inference (MOV inference).
The proposed online algorithm is not technically new.
It is not a stochastic gradient (or step-wise) type online algorithm but an incremental algorithm which updates parameters by replacing old statistics with new one.
This framework is proposed by Neal & Hinton.
The difference is trivial because Neal & Hinton incremental algorithm is based on the EM algorithm and updates parameters for each data point, while the proposed inference is based on the variational inference and updates parameters for each batch.

The authors proposed a novel framework for adapting the number of topics, called birth and merge method.
The existing construction of the new components is either heuristic or highly specialized to the particular model.
The proposed birth moves can add many new components at once, and are applicable to any exponential family mixture model.
This framework is interesting and a strong point.
However, even if the existing work is either heuristic or highly specialized to the particular model, you should compare the existing methods with the proposed method such as [4] and [12].
In particular, [4] and [12] can be applicable for the Gaussian mixture model.

Q2: Please summarize your review in 1-2 sentences
The birth and merge algorithm is interesting.
However, the experimental setting is not convincing.
The stochastic variational inference is not state-of-the-art in the truncation-free setting.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The authors propose a novel method for performing memory efficient inference in Dirichlet Process (DP) models. There are two key advances presented in this work: memoized online (MO) learning of the model parameters using small subsets of data; an efficient birth-merge move for dynamically altering the truncation level of the stick breaking approximation.

The first advance is quite similar to approaches for memoized learning with EM algorithms. To my knowledge this approach has not been applied to non-paramteric Bayesian models however, which is the major novelty of this part of the work.

The birth merge move is in my view the more interesting advance in this paper, as it elegantly handles the issue of proposing births while keeping a limited subset of the data in memory. In the results section the authors apply the model to several synthetic and real datasets and show that MO learning is able to escape local optima, particularly when used with the birth merge move.

The paper is generally well written and easy to follow. There are some small points (listed below) which would make the exposition easier to follow and the experiments reproducible. In particular the specific settings for the birth merge parameters would be useful. The work is a nice synthesis of previous work in different settings along with some novel contributions. The problem of scaling DPs to large datasets is generally important since it allows for clustering with automatic inference of model complexity. The authors method is simple to implement, efficient and (except for births or merges) maintains the monotonic increase of the lower bound.

Minor Revisions:

- The exposition of the memoized batch updates is excellent and can be readily implemented from the authors description. One thing which would make it even easier would be if the authors were to provide the complete set of update equations for the Gaussian emission used in the experiments in a supplemental file. This would be particularly helpful to readers not familiar with VB methods, for example the computation of $E_{q}[log w_{k} (v)]$.

- When trying to implement the birth/merge moves I found the following points could be clarified in the text or a supplemental file with pseudo code:

1) How was $N'$ (the size of the birth batches) set in the experiments.

2) How was the DP fit in the creation step of the birth move. That is was Gibbs, batch VB, etc. used. I assume batch VB was used but it not stated explicitly as far a I can see.

3) How exactly are the sufficient statistics inflated to protect newborn clusters in the third phase of the birth proposals.

4) For the merge moves clarify what exactly is cached to compute the entropy term. Is it a single set O(K^2) values or one set of values per batch.

- The authors mention where components are inserted and deleted in the size biased ordering of the stick lengths several times. Could they clarify the significance of these choices.

- I am not sure why the authors did not compare against stochastic online methods using adaptive learning rates, which is in reference [2] from the paper.

- I found the bottom row of figure 2 somewhat confusing. What do the colours used in the covariance matrix indicate?

- It would be great if the code for this method and the other methods used in the comparisons could be made available.
Q2: Please summarize your review in 1-2 sentences
A good paper which is worthy of publication after some small revisions.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The proposed "memoized online variational inference" is a natural technique for exploiting the additivity of sufficient statistics for a certain class of exponential family probabilistic models. At it's core, the method is simply propagating information between the global and local latent random variables more frequently than traditional VB. In some of my previous research I've actually used this technique for a specific model, and assumed it was a standard trick for VB (& not knowing of the authors' research), and found that it greatly improved my results. So I can independently attest that memoized updates should be highly applicable/beneficial to VB practitioners. I'm glad to see a paper explicitly work out the updates.

The birth/merge updates are neat, but they're still heuristics: on line 216 the authors mention that "construction of these new components is either heuristic [4] or highly specialized...In contrast our birth moves..." when referring to previous work (but their method is entirely heuristic as well). Other than that, these heuristics seem well-thought-out, and I plan to give them a shot in some of my own work.

How exactly does the citation on line 202 (to [11]: A stochastic gradient method with an exponential convergence rate for finite training sets) motivate your use of cached gradients?

The experiments seem solid, but I encourage the authors to release their experimental code. Of course, the conclusions drawn in this section are data/model dependent, but the authors make an above-average effort to understand their algorithm and test on a number of different circumstances/data. I appreciate that the authors separately compared the memoized algorithm with and without the birth/merge steps.

I would have liked to see a different model besides the DP-GMM explored: the memoized result applies to general exponential family models, after-all. From Figure 2: why does the MO and MO-BM results not depend on the number of batches? Are the results the same with N batches? I would like to have seen an analysis of batch size and its effect on the inference results.

The "discussion" section is a misnomer, perhaps use "summary."

NB: The experimental line plots are not discernible when printing from a black and white printer.

Review summary:

The quality of this paper is high: the results are technically sound and the claims are well supported by their diverse experimental section.

The paper is clearly written and organized.

I would like to run some of these experiments myself, and while the experimental section is fairly detailed, a link to the actual experimental code would be helpful.

The paper is fairly original. My initial perspective was that their update results were somewhat trivial, but a quick literature search indicates that their update method is not (explicitly) in widespread use. It's certainly beneficial to share these types of helpful tricks to practitioners in our field.
Q2: Please summarize your review in 1-2 sentences
This paper is a solid contribution in VB inference that applies to a fairly broad class of probabilistic models.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their helpful comments. In our paper, we develop memoized online variational bayes (MO-VB), an inference algorithm that does not require any learning rate and is insensitive to batch size. Compared to alternative methods like stochastic online variational (SO-VB), our approach delivers much more consistent performance without the need for extensive parameter tuning. We would like to emphasize that while our approach builds on a rich variational inference literature, it differs in several key respects which are crucial to its scalability, robustness to initialization (via birth-merge moves), and applicability to nonparametric models.

In some sense, all modern work on variational methods follows closely from the ideas in Neal & Hinton (and various tutorials by Jordan and colleagues, and decades-older work in statistical physics). But there are several novel contributions which do not follow easily, including:
* The nested family of truncations which allows these ideas to be correctly applied to Bayesian nonparametric models.
* The practical demonstration that compared to SO-VB, the MO-VB approach is simpler (in implementation and parameter tuning), equally scalable, and often more accurate.
* The development of birth-merge moves which operate on batches of data, but nevertheless avoid local optima and (in the case of merges) are guaranteed to increase the full-data likelihood bound.
* Achieving all of this in a framework that is directly applicable to a broad range of models.

CODE (R5 & R6):
We are pleased that the reviewers show interest in the code for our algorithms. We are in fact working on an open source, Python software package which implements the various inference frameworks compared in this paper for a range of Bayesian nonparametric models (mixture models, topic models, relational models, time series models, etc.). This code will be publicly released.

* Bryant & Sudderth, NIPS 2012 [4]: In line 347, we do compare our merge moves (which have provable guarantees to improve the ELBO) to the merges found in [4], which use only the current minibatch and have no whole-dataset guarantees. We find that [4]'s merges can cause the ELBO to drop significantly, as shown in Fig. 2's "GreedyMerge" traces.
* Wang & Blei, NIPS 2012 [12]: This work relies on a local collapsed Gibbs sampler to create new components. Such samplers are widely known to be slow to add needed components (and become less effective as dataset size grows). In some internal experiments, we replaced the batch VB in the creation step of the birth move with a collapsed gibbs sampler, and found that speed was slower and performance worse (sometimes slightly, sometimes substantially). We excluded these comparisons due to space constraints.
* Ranganath et al., ICML 2013 [2]: Code is not available for the SO-VB adaptive learning rates proposed in this paper, which became available just before the NIPS deadline. We found the details of their approach too complex to reproduce. Our MO-VB lacks learning rates entirely, making it much simpler.

R6: why do MO-VB results not depend on batch size (Fig. 2)?
MO-VB global parameter updates always account for the whole dataset via memoized sufficient statistics. This makes our algorithm very insensitive to batch size, since the influence of each batch is balanced at every update. In contrast, SO-VB is much more sensitive due to its reliance on the current batch alone.

R6: new birth moves are well-thought-out, but still should be called "heuristic".
We'll find a better word to describe the differences. Our method is certainly more general-purpose than prior work: creating new components by running batchVB on a targeted subsampling of data is applicable to most popular models.

R5: some algorithm details missing.
We appreciate the detailed list. We'll improve figure captions and add needed information to a supplement or longer technical report.

R6: would have liked comparison of MO-VB on another model in addition to GMM.
We are in fact working on extensions to other models, but even fully describing our work on the GMM was difficult in the NIPS page constraints. We intend to experiment more broadly in future papers.

R6: connection to ref [11] is unclear.
We'll improve the description. Like our MO-VB approach, [11]'s algorithm exploits additivity to store a memoized version of a "whole-dataset" gradient that can be updated incrementally.