*Submitted by Assigned_Reviewer_1*
**Q1**: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a novel probabilistic deep architecture for modelling count data. In addition the authors propose a modelling extension for multi-class classification problems (i.e., discriminative modelling). The authors derive two inference techniques, one sampling one and a variational inference one, and show empirical gains compared to unsupervised methods using several large-scale datasets.
Questions/Comments: - I found the model to be clearly explained and motivated. I enjoyed
reading Section 2.3. I think adding a figure to illustrate Equations 3
would be useful. - I think it may be slightly misleading to imply that [23] does not scale
wrt to the number of zeros. My understanding is that using a Poisson
likelihood allows all methods to effectively ignore the zeros. Further
it seems like most of the computational gains come from that rather than
from higher-layers (which are typically much smaller). [23] also does allow
non-linear activation functions. - Computation. It appears that your model scales well. It would be
interesting to have an idea of how computation scales and how long it
takes to learn on these larger datasets. Providing a rough comparison to
competing models (docNADE, LDA, and replicated softmax) would also be
useful. - When reporting the results of the classification experiments it seems
like you are only comparing to unsupervised techniques. In that sense
the comparison is not absolutely fair. It would be good to add, at
least, one simple supervised baseline (e.g., a small neural net with a
softmax output and the word frequencies as inputs).
Other comments: - line 165: Sparsity of the Dirichlet relies on your choice of parameters
(eta). I think it would be good to make it clear.
**Q2**: Please summarize your review in 1-2 sentences
- This is a good paper. The model scales, and pushes our
understanding of deep generative models. The discriminative extension is
also worth noting. Empirical results are relatively good as well.
*Submitted by Assigned_Reviewer_2*
**Q1**: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents a multilayer model of count vectors using Poisson Factor Analysis at all layers (providing interpretable topics) and binary units connecting these layers (learning topic correlations). MCMC and SVI inference is straightforward -- all conditional posteriors are in closed form --
and inference scales in the number of non-zero observations and hidden units. The model is an incremental change from Zhou et al. (2012), removing the gobal beta-Bernoulli process and using the Bernoulli-Poisson link to avoid using sigmoid belief networks.
Both the model and inference described in the paper are elegant. The model is only an incremental improvement from prior work (notably Zhou et al. (2012)), but it's likely to be of significant interest to the community.
1. The experimental analysis has been done with an eye towards comparing to other deep and one-layer models. However, there's hardly any effort in exploring the proposed model itself.
a. how did you fix the layer widths? have you studied more than 2 layers?
b. how does the model deal with overdispersion in data?
c. how should the layer widths decay with depth?
2. it's surprising that even your one-layer model does significantly better than ORSM and LDA (which is similar to PFA). is this due to your approach to discriminative topic models? there is no explanation provided.
3. why is ORSM not included in Table 1?
**Q2**: Please summarize your review in 1-2 sentences
The paper presents a tweak to existing deep network topic models (combining ideas from Zhou et al. (2012) and Zhou et al. (2015)) and shows how a hierarchy of Poisson Factor Analysis units can be connected using hidden binary units.
Although an incremental contribution, both MCMC and variational inference are made much simpler due to local conjugacy and experimental results show superior performance. The paper is lacking in a experimental investigation into the network structure -- instead it seems to be arbitrarily fixed. It is still likely to be of much interest to the deep learning and topic modeling community.
*Submitted by Assigned_Reviewer_3*
**Q1**: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper extends previous work on using deep Poisson Factor Analysis (PFA) for topic modeling by using a Bernoulli-Poisson link, instead of logistic functions. The paper also describes a way to jointly model documents and their associated discrete labels. Experiments show the proposed method outperform related baselines in held-out perplexities and classification accuracy.
The paper extends existing work (Gan et al, ICML'15 in particular) to provide a more flexible way to define the prior on documents' proportions over topics. Even though mostly a combination of existing ideas, I think the paper provides some advances in applying deep models for topic modeling. Here are some of my detailed comments:
- It is interesting to see that both MCMC and variational inference techniques are included. One of the arguments for using VI in the paper is its scalability. It would be interesting to see comparison on running time between the two inference techniques.
- Jointly capturing documents with their associated metadata is a well-studied problem. I am wondering why the performance of traditional supervised topic models such as sLDA (for classification) are not included for comparison in Section 5.
- For readers who are not familiar with conventional notations of deep models, I would suggest including some figure to illustrate the different layers and their input/output in Section 2.
**Q2**: Please summarize your review in 1-2 sentences
The paper extends and improves previous work on using deep Poisson Factor Analysis for topic modeling. While mostly a combination of existing ideas, I think the paper provides some advances in applying deep models for topic modeling.
**Q1**:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their encouragement and
constructive criticism, we address their main concerns and comments below.
We use the following abbreviations: R1 refers to reviewer 1 and
R1/3 refers to both reviewer 1 and 3.
R1/3/4/5: We will provide
runtimes as proxies for the computational cost in the
experiments.
R1/3/4: We will add results with supervised baselines:
neural network and sLDA, as suggested by the reviewers.
R1/3: We
did not include a graphical model in Section 2 due to space limitations
but we will be happy to add it to the supplementary material.
R1:
We will make clear that [23], as it's based on a Poisson likelihood allows
to effectively ignore zeros and that it also allows for non-linear
activation functions.
R1: We will make clearer that sparsity of the
Dirichlet prior relies on the choice of \eta.
R2: We fixed layer
widths to be consistent with DPFA's experiments. As stated in the paper,
we also tried models with more hidden units per layer and one additional
layer but the results were not significantly different from those shown in
Table 1. It is worth noting that because our model has sparse loadings at
all its layers, due to the Dirichlet prior used, the model should be
fairly robust to layer width settings, as long as widths are large
enough.
R2: The model accommodates well for overdispersion via r_k
in the prior distribution for \theta, in fact, it can be shown that
marginalizing out \theta in our Poisson-Gamma formulation reduces to a
negative binomial likelihood with parameter r_k [30], meaning that our
model effectively decouples mean and variance of observed data. This
interesting point will be added to the paper.
R2: We did not
investigate how layer widths should decay with depth but we agree with the
reviewer that it would be interesting to see, we leave it as future
work.
R2: We believe that the difference in performance between
LDA, ORSM and our discriminative PFA is due to our joint learning
approach, we will make it clearer.
R2: We will include ORSM results
in Table 1, the difference in perplexity between RSM and ORSM is not that
large, about 20 units in [25].
R4: We did not include DPFA in the
classification results because the model was not originally conceived for
that purpose [6]. We agree that a discriminative version of DPFA is
possible but out of reach for our paper. |