Paper ID: 874
Title: Streaming Variational Bayes
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents a streaming algorithm for inference in topic models. After each mini-batch, the algorithm maintains the posterior over the model parameter given the data seen thus far. This in contrast to variational LDA where the quantity maintained is an approximation to the posterior of the full data set (which require knowing apriori the size of the dataset). The resulting algorithm can be used in truly streaming settings, doesn't require setting learning rates and is less sensitive to mini-batch size.

Overall this is an interesting idea. Few comments:

1) regarding evaluation, I am a bit concerned that you *only* ran with a Dirichlet parameter of 1 (line 312) over \beta (topic word distribution). Usually values of .01, .1 or smaller are used for this hyper-parameter to encourage sparse topics. In the SVI paper, a value of .01 was used. I am really curious to see how things would change as you vary that prior. Wouldn't that initial prior matter more in your case since the batch size is more or less "given" thus the rate at which this prior is overwritten is controlled by that parameter. In SVI the learning rate would help to some degree. Some analysis is required here.

2) I think the main contribution of the paper is the departure from approximating the final posterior to approximating the "running posterior".

3) You can still throw many threads at SVI and scale it (just distribute the VI step over the documents in the mini-batch), why not? Please update table 1 based on that.

4) You can still use Hogwild style update over the natural gradient in SVI as well. Why not?

With that in mind I suggest framing the main contribution around 1 (which is great by itself) unless there is a good argument against 2 and 3 in SVI.

5) nit. You only ran the algorithm in multi-threaded mode. Hogwild itself was proposed as a multi-threaded algorithm and getting hogwild to run in truly distributed settings (across machines) is not trivial with network delay and what not. Please discuss this in the paper.

6) It would help to see how the topics evolve as documents arrive akin to the SVI paper.


Q2: Please summarize your review in 1-2 sentences
A streaming inference algorithm for LDA using variational Bayes recursive updates. Less sensitive to tuning parameters than existing approaches (SVI) and gives comparable results. Few points in the evaluation need to be discussed/considered to make sure that this result carry over to other hyper-parameter settings.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
AC:

This paper presents an interesting idea. It is very well-written and
well-executed. It is a clear accept.

Authors:

This was a fascinating paper. Thank you. I have some comments and
criticisms. I would appreciate a response in the author rebuttal.

I agree with the main idea around your paper: SVI cannot really be
used in a true streaming setting because we have to set D, the data
set size. (Note: In the final version of our long paper about SVI, we
removed references to "streaming data" and cited this as a point for
future work.) Further issues with SVI are its sensitivity to the
learning rate and mini-batch size. Your taking the perspective of
incremental Bayesian updating, as an alternative to SVI, is very
elegant.

- You frame SVI as being "designed" for the single pass setting (e.g.,
in the abstract, p5, and elsewhere). This isn't our perspective about
it. Rather, SVI repeatedly subsamples mini-batches, with replacement,
from the data set. In many applications, a user may want to treat a
single pass, divided into mini-batches, as though this were the case.
But note that this is different from the original conception of the
algorithm. I think you should make this clear.

- If we were to use SVI in a streaming setting and had to artificially
set D, this would essentially interact with setting the learning rate
parameters. I suggest mentioning this in your paper.

- The parallelized asynchronous component is very nice, but the reader
might mistakenly think here that it only applies to SDA-Bayes. You
should note, though it is not the subject of your paper, that this
innovation applies as easily to SVI. As with SDA-Bayes, one can march
forward with asynchronous updates (despite the lack of theory) and
speed up the algorithm.

- My reading of the results is this. SDA-Bayes performs as well as SVI
if we run SVI with one pass through the data. It comes from a
different perspective about approximating the posterior, does not
require setting a learning rate, and handles infinite sets of data
without having to make up a data set size.

- That said, I'm not sure the comparison to the multi-threaded version
is fair. Doesn't that version effectively set the minibatch size to
8,192 (32 * 256)? With that minibatch size, how do the algorithms
compare? (I emphasize that I don't think it matters: getting around
learning rates and D is a significant contribution.)

- In showing how SVI is sensitive to learning rate and D, I suggest
putting SDA-Bayes (at the same mini-batch size) as a horizontal line.
This will emphasize that while users must explore these parameters in
SVI, they don't need to with your algorithm.

Small comments:

- p4: "utilized" -> "used" (optional, of course)

- p6: does [3] use a subset? It does to compare to batch, but I
believe we also analyzed the full Wikipedia corpus.

Citation to mention:

- Ranganath et al. have worked on automatically setting learning rates
in SVI:

R. Ranganath, C. Wang, D. Blei, and E. Xing. An adaptive learning rate
for stochastic variational inference. International Conference on
Machine Learning, 2013
Q2: Please summarize your review in 1-2 sentences
This paper presents an interesting idea. It is very well-written and
well-executed. It is a clear accept.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors present a Streaming Distributed Asynchronous (SDA) framework for variational Bayesian computation aimed at analysis of large datasets. They assume a variational form q(θ) for the posterior on model parameters that is exponential family, which they express as q(θ) ∝ exp[ξ . T(θ)]. Note that this parameter vector ξ = (χ ν) contains both the cumulant of the conjugate sufficient statistics χ and the scale parameter ν. The authors then define a set of parameters ξ_0 for the variational approximation of the prior p(θ), and make the observation that when the data is split up into batches, independent posteriors ξ_b for each batch may be combined to obtain a approximation for the posterior on the aggregate data ξ = ξ_0 + Σ_b (ξ_b - ξ_0). Τhey propose an algorithm where each new batch is used to update ξ_0, i.e. for each new batch prior is taken to be the posterior after observation of the previous batch. Τhis algorithm can be run in a distributed setting where each worker returns Δξ_b = (ξ_b - ξ_0) to the master, which then updates its copy of ξ_0 <- ξ_0 + Δξ_b accordingly.

Τhe proposed methodology is simple but useful, primarily because it presents a trivial parallelization strategy that may be applied to any algorithm that obtains an exponential family estimate of a posterior p(θ | x). The main competitor to this method is stochastic variational inference (SVI), a method that uses a stochastic optimization based on a gradient estimate of the lower bound evidence. A disadvantage of SVI, as the authors note, is that it is sensitive to parameters for the total number of expected documents, batch size, and step size. However I am not entirely convinced this is always a bad thing. While tuning parameters can indeed be costly, having such tunable parameters also makes a method more flexible. The presented results on LDA are promising, but they are not sufficient to demonstrate that SDA can match or exceed SVI results in a variety of domains while eliminating the need for such tunable settings.

I would have also liked to have seen a clearer motivation for the proposed distributed update scheme. When comparing their chosen update scheme with one where ξ_0 is held constant for each of the workers, the authors note that the "key difference between the first and second frameworks proposed above is that, in the second, the latest posterior is used as a prior. This latter framework is more in line with the streaming update of Eq. (2) but introduces a new layer of approximation". Unfortunately the implications of this approximation are left unexplored, save for the otherwise unqualified statement that "we find that the latter framework performs better in practice, so we focus on it exclusively in what follows".

It would seem that the essential difference between SDA and SVI is that in SDA ξ grows with the number of documents, whereas the variational parameters have a fixed strength in SVI. This implies that in SDA (1) the entropy of the posterior q(θ) decreases as more data is seen, and that (2) convergence is ensured by virtue of the fact that in later updates (ξ_b - ξ_0) << ξ_0. In SVI convergence is achieved by annealing the step size to 0 and ξ has a fixed strength determined by the total number of documents D, which in an on-line setting one might interpret as a buffer size. It would be interesting to see a more detailed analysis of which of these two strategies is more effective in terms of ensuring rapid convergence while avoiding local maxima.

In short I believe this may be a promising approach to variational inference in large data sets, chiefly because it is by design suitable to distributed inference in a map-reduce type setting. However I do feel more thorough testing is needed before this method can be considered a drop-in replacement for SVI.

--
Minor points / Questions

• I see no reason why SVI could not be performed with distributed updates precisely in the manner proposed here for SDA. Have the authors considered this scenario?

• Are the authors using their own implementation of SVI in the analysis comparison, Matt Hoffman's code (http://www.cs.princeton.edu/~mdhoffma/code/onlineldavb.tar) or the Vowpal Wabbit implementation (http://hunch.net/~vw)? Given that a run-time comparison is performed, it would be desirable to state this explicitly

• The authors write that in SVI "the stochastic gradient is computed for a single data point (document) at a time". This statement is technically correct but somewhat confusing, since it suggests SVI is performed on single data-points, not batches.
Q2: Please summarize your review in 1-2 sentences
This paper presents a simple and interesting approach to variational inference in large datasets that is by design suitable to a distributed map-reduce type setting, and at the same time eliminates the step size parameters that need to be tuned in SVI. The experimental results however, which only consider LDA, leave some question as to the comparative strengths and weaknesses relative to SVI.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes an inference framework (SDA) for streaming data in the Bayesian setting. This framework exploits the additive relationship between new data points and exponential family natural parameters, yielding a Bayesian streaming/online update equation that (a) does not require prior knowledge of the number of data points (unlike SVI), (b) has fewer tuning parameters than SVI, and (c) in the case of exact update equations, is perfectly suited to asynchronous updates (though in practice, approximate update equations are used, making the effect of asynchrony somewhat unclear). While SDA single-threaded performance is about 4 times slower than SVI, SDA's lack of a need to tune parameters and good asynchronous performance are promising for large-scale distributed inference, particularly at large cluster sizes.

Quality:

The arguments and derivations appear sound, while the experimental evidence is adequate (though a wider variety of datasets would have been welcome, such as the NY Times dataset). However, I found the description of how SDA was applied to LDA using VB to be insufficient. As a matter of completeness, the authors should have provided an explicit, algorithmic description of how SDA uses VB to update the LDA natural parameters, just as they did for EP in the appendix.

Minor comment: In section 2.3, the authors claim that the "second framework works better in practice", despite the fact it introduces additional approximation. I wonder if this is simply a consequence of using VB, which is a kind of local approximation? It seems that giving the workers a recent (albeit inaccurate) version of the posterior parameters should yield a better VB approximation than the prior parameters.

Clarity:

I found the writing to be somewhat lacking in intuition, and a little difficult to follow. I felt as if the key ideas were lost in the math and descriptions, and I had to read multiple passes before I could appreciate the value of SDA.

Originality:

While the idea of updating natural family parameters is not new within statistics, its application to streaming data on complex models that require approximating distributions is new to me. I would say the paper is original in the context of large-scale machine learning on complex models.

Significance:

I believe SDA provides significant utility for large-scale ML, despite being slower than SVI currently. SDA is closer to "working out of the box" than SVI, because SDA has no tuning parameters and does not depend on the number of documents. The fact that SDA can be run asynchronously is a big plus at large scales.
Q2: Please summarize your review in 1-2 sentences
SDA provides significant usability advantages over SVI, though it is currently slower. Some technical and clarity issues detract from the paper's quality, but I nevertheless found it valuable.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We are very grateful to the reviewers for their insightful comments, helpful suggestions, and errors spotted. On a general note, we agree with the reviewers that our main goal was to demonstrate a method for calculating the "running posterior" as opposed to a single posterior. We meant for the distributed, asynchronous part to show that our algorithm can be run quickly and robustly in practice, not to differentiate it from SVI. We will endeavor to make this distinction more clear throughout the text. We agree that SVI's computations may be distributed asynchronously, as for general stochastic gradient descent. We respond to other comments in more detail below.

Reviewer_4

1) I am a bit concerned that you *only* ran with a Dirichlet parameter of 1 (line 312) over \beta (topic word distribution). Usually values of .01, .1 or smaller are used for this hyper-parameter

Thank you for noticing this discrepancy. We will examine more carefully how the value of eta affects both SDA-Bayes and SVI in further experiments.

5) nit. You only ran the algorithm in multi-threaded mode. Hogwild itself was proposed as a multi-threaded algorithm and getting hogwild to run in truly distributed settings (across machines) is not trivial with network delay and what not. Please discuss this in the paper.

It is fair to note we use multiple processors rather than multiple machines. However, we feel that discussing this in more depth would be somewhat far from the main thrust of the paper (the running approximation). We do hope to work in distributed settings across machines in the future, particularly given that our algorithm is naturally suited to a map-reduce implementation.

6) It would help to see how the topics evolve as documents arrive akin to the SVI paper.

This would be interesting for future work (in particular, if the dataset is changing over time), but our focus in this paper was to demonstrate an inference algorithm and not really to delve too far into topic models.

Reviewer_5

- You frame SVI as being "designed" for the single pass setting (e.g., in the abstract, p5, and elsewhere).

Thank you for pointing out this mistake; we agree and will correct our wording.

- That said, I'm not sure the comparison to the multi-threaded version is fair. Doesn't that version effectively set the minibatch size to 8,192 (32 * 256)? With that minibatch size, how do the algorithms compare?

This is true. Our main goal with the multi-threaded version was to show that SDA-Bayes can be run in a reasonable amount of time (which we believe is key to practical streaming). For minibatch size, all of our lines in Figure 3 are for the single-thread case to facilitate direct comparison between SVI and SDA across this dimension.

- p6: does [3] use a subset? It does to compare to batch, but I believe we also analyzed the full Wikipedia corpus.

Thank you for calling this error in our discussion to our attention.

Reviewer_6

• The presented results on LDA are promising, but they are not sufficient to demonstrate that SDA can match or exceed SVI results in a variety of domains while eliminating the need for such tunable settings.

We disagree that eliminating the SVI tuning parameters is the main contribution of this paper. Rather, we see this work as solving fundamentally different problems from SVI: finding a running posterior, working out of the box for (any) model that already has a batch approximation in the same exponential family as the prior, and doing these things in reasonable time.

• I would have also liked to have seen a clearer motivation for the proposed distributed update scheme.

We hope to examine this interesting phenomenon more carefully in future work. We hypothesize that this effect may occur due to the lack of identifiability between different permutations of topics. Using the latest posterior for each minibatch calculation may help focus down a particular permutation rather than averaging over all permutations. But this is merely conjecture.

• It would be interesting to see a more detailed analysis of which of these two strategies is more effective in terms of ensuring rapid convergence while avoiding local maxima.

We believe that convergence in SVI is fundamentally different from inference in SDA-Bayes. In the SVI case, D' data points (with D' << D) might be sufficient to quickly estimate the posterior for D data points. In the SDA-Bayes case, the number of data points processed corresponds to the size of the posterior being estimated.

• Are the authors using their own implementation of SVI in the analysis comparison

On line 310, we note that we use Matt Hoffman's code and give a link in citation [17].

• The authors write that in SVI "the stochastic gradient is computed for a single data point (document) at a time". This statement is technically correct but somewhat confusing, since it suggests SVI is performed on single data-points, not batches.

We will correct this misleading wording.

• The experimental results however, which only consider LDA, leave some question as to the comparative strengths and weaknesses relative to SVI.

We preferred to study a single model in depth rather than say little about many models given the constrained space. We do hope to address other interesting, complex models in future work. We also note that the original SVI paper focused exclusively on LDA, a model of significant current interest.

Reviewer_7

- As a matter of completeness, the authors should have provided an explicit, algorithmic description of how SDA uses VB to update the LDA natural parameters, just as they did for EP in the appendix.

Excellent idea. We are happy to provide more detail in an additional appendix for our derivations.

- Minor comment: In section 2.3, the authors claim that the "second framework works better in practice", despite the fact it introduces additional approximation.

Please see our response to Reviewer_6.