
Submitted by Assigned_Reviewer_5
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a new topic model collaborative topic Poisson factorization (CTPF) which combines ideas from Poisson factorization (PF) and collaborative topic regression (CTR). The idea behind the generative model is intuitive and appealing: learn a latent representation for users and articles that explains the observed ratings as well as the content of the articles. The authors use stochastic variational inference which enables them to apply the model to large datasets. The authors claim that the use of a shared latent representation allows them to attack the coldstart recommendation problem. I like the idea a lot and I think it can potentially be very useful (e.g. as a web service for discovering new scientific articles) assuming the authors' claims are true. However, the presentation needs to be improved before I can wholeheartedly accept the paper.
 I felt that the paper is targeted at someone who knows the PF and CTR; the text in some sections reads like a "diff file" highlighting differences between the CTPF model from the older models :) It'd be better to present the CTPF model first and then perhaps describe how PF and CTR arise as special cases of CTPF.
 Figure 1: It might be better to move it to the discussion section. \theta and \epsilon are not defined in section 1. I was confused by what "topic offsets" meant until I read the latter sections. (Line 72 describes the red bars as "preferences of readers who have the EM paper in their libraries", which is not helpful either.)
 Evaluation: Do you assume that articles not rated are irrelevant? If so, does it not add noise to the computed precision and recall? Are these precision/recall values significant enough that users of such a system would notice an improvement in quality. Can you provide the performance of the "ideal" recommender which sorts by true relevance? Perhaps measures such as Normalized Discounted Cumulative Gain (NDCG) used in information retrieval would be more useful?
 The evaluation compares models just based on precision and recall. I think it would be better to compare other model properties as well (e.g. distribution over number of ratings per article, distribution of number of ratings per article)
 Line 88: Not clear to me why Poisson factorization handles sparse data better than Gaussian factorization. It's very common in recommender systems to let the objective function depend just on the observed ratings which requires us to update just the latent vectors for users/articles with nonzero observations.
 Line 130: PF model seems to use a separate rate parameter (gamma distribution) for each document and user. Is a separate rate not needed here?
Line 140: \epsilon is referred to as "document user weights" here and "topic offsets" elsewhere, which makes it hard to follow. It'd be better to use a consistent reference.
 How was K set in the experiments? How sensitive is the algorithm to Gamma hyperparameters? Some sensitivity analysis on toy data might be helpful.
Line 374: "Content [1]" is misleading. The fLDA paper uses different priors, likelihood and a different inference scheme. I would encourage the authors to compare to fLDA if the source code is available. I guess "topic offset variables" refers to \epsilon?
 Line 414: why are train/test/validation proportions different for the two datasets? 1% of documents as a test set seems too small. Do you average results on multiple train/test splits?
 Figure 5: what do you mean by "readers in the field"? In A), B), C) it would be simpler if you specify what you mean in terms of \theta and \epsilon. There's some discrepancy between the figure and the caption ("ontologies and applications" vs "information retrieval").
 Line 431: The paper seems to end quite abruptly. It'd be good to have a discussion section to summarize the contributions and discuss future directions.
Typos, etc:
 Line 107: use r instead of y  Line 285: \tau_0, \kappa not defined.
Update (after author rebuttal):
The authors have addressed most of my concerns. I have updated my rating, trusting the authors to make the necessary changes.
I think it'd be helpful to add a brief discussion early on about implicit feedback and how it differs from explicit feedback, for example the differences explained in Hu et al. [13]. IMO, the PF model is more appealing in the implicit feedback setting; the authors' response about why PF is better than competitors in the implicit feedback setting might be worth adding to the main text.
The fact that unobserved responses are treated as 0s in the evaluation is a bit unsatisfying and I am slightly worried how this assumption affects the authors' claims about 4x improvement over stateoftheart. That said, I understand that evaluation on implicit feedback datasets is less straightforward. I strongly encourage the authors to make the data as well as the code publicly available so that others can study this problem in more detail. Q2: Please summarize your review in 12 sentences
The authors propose an interesting approach to improve recommender systems by using a topic model to model content. The authors combine tricks from several existing papers (recommendation module from PF, the idea of combining topic model with a matrix factorization model from CTR/fLDA, auxiliary variable trick + SVI for scalability) to develop (what appears to be) a novel approach to address the coldstart recommendation problem. Overall, the approach is promising, but the presentation needs to be improved.
Submitted by Assigned_Reviewer_23
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: The authors present collaborative topic Poisson factorization (CTPF), a generative model of both text and user rating/log data. The running application in this paper is the important problem of recommending scientific articles to people based on previous rating/interaction data. CTPF draws mainly upon two recent models: collaborative topic regression (CTR) of Wang and Blei and Poisson factorization of Gopalan et al. Each document is represented by two latent vectors in Kdimensional topic space: \theta, based on the text of the document, and \epsilon, based on the document's readers. Each user is represented by a latent Kdimensional topic affinity vector, x. Observed word counts for each document are drawn from a Poisson centered on the product of theta and the topicword matrix, while the observed userdocument ratings are drawn from a Poisson centered on x * (\theta + \epsilon), leading to a very elegant combination of text data and readership data. Authors present both batch and stochastic variational inference algorithms for approximating the posterior, and then experimental results showing stateoftheart recall and precision @20 performance on two realworld data sets.

Overall, the model is elegant and intuitive, the motivation is real and important, this work is executed well, the paper is written quite clearly, and the results are meaningful. I think this paper should be accepted. Some details below:
Quality:  The model makes a lot of sense and elegantly addresses the ubiquitous cold start problem by additively combining factors based on the text of articles and the readership of articles in a Poisson factorization setting.  Inference is efficient due to conditional conjugacy of Poisson factorization.  Results show meaningful improvement over a stateoftheart model like CTR (which incidentally won best paper at KDD 2011) and simpler/more factorized variants of the proposed model.  Have you tried anything beyond binary ratings matrices? The Poisson model certainly doesn't assume binary ratings, and so would be interesting to see performance in that setting.
Clarity: I can't overstate the importance of describing a model clearly and completely, and the authors do an excellent job of this here. A few points, though:  line 81: what makes the topics from CTPF more interpretable than latent factors from matrix factorization?  line 106107: do you mean "greater than one"?  line 169170 and line 214: should say y_ud not y_dv. Also, the 12K indexing of y_ud mentioned on line 241 should probably be moved earlier, since the first time you see it is in Figure 2.  would be nice to see a full listing of the variational updates in the appendix, to make it easy on others to implement the procedure
Originality and Significance:  This paper builds upon two recent models, and in that sense is incremental, but it is executed well, and the combination of the two ideas (Poisson factorization and collaborative topic regression) leads to the most elegant joint model of text and user ratings I've seen.
Q2: Please summarize your review in 12 sentences
Overall, the model is elegant and intuitive, the motivation is real and important, this work is executed well, the paper is written quite clearly, and the results are meaningful. I think this paper should be accepted. However, I'm curious how the model performs on nonbinary rating data. Submitted by Assigned_Reviewer_46
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a new graphical model, that allows to uncover latent topics for a text corpus, and latent interest from users. The main novelty is to use jointly Poisson factorization in the document and user spaces, by incorporating user/document offsets, thus allowing to capture users' deviations from the pure topic intensities. The standard variational tools from training this models are used, which is quite easy in this case since the model is by construction conditionally conjuguate.
The paper is very clean and nicely written, and the idea is nice and quite simple. The idea, although seemingly based on several previous tools existing from the huge graphical model literature, seem to be new (although I not an expert of this field, so novelty shall be assessed by another reviewer.)
I only have minor comments on the paper:
 p.~4 l12 : a sum of Poisson is itself Poisson under independence, which is the case since the variational distribution is fully factorized. Please make this argument more clear
 It seems intuitively that initialization of the latent topics using LDA helps a lot for obtaining good performance of the overall algorithm in terms of accurary. A sentence explaining to what extent this choice is important could be helpful, in particular, what happends if no such initialization is done
 p.6 : I am not convinced by the explanation of the choice of hyperparameters a, b, ... How do you choose these numerical values ? Is there any crossvalidation under the carpet for this ?
 p. l.7 l.10 in your text preprocessing, words are stemmed and lemmatized ? Is there any hashing involved to reduce dimension ?
 p. l.7 I don't understand the sentence : "We treat user clicks as..." Please explain more clearly how the user's ratings for documents are constructed from this dataset. It can't be only binary, as ratings are Poisson distributed in your model.
Q2: Please summarize your review in 12 sentences
A good paper about a new graphical model for joint factorization with latent factors of document and user/document relations, with good numerical experiments Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point. We thank the reviewers for their constructive feedback.
> R23. Have you tried anything beyond binary ratings matrices? > R46. It can't be only binary, as ratings are Poisson distributed in > your model.
PF can capture marginal user, item distributions well and optimizing the likelihood requires iteration over only the nonzero ratings (Gopalan et al. (2013)). These advantages hold even in binary settings; therefore, we prefer it over Gaussian/Bernoulli models.
We focussed on binary data sets of implicit feedback. Our extended work will address nonbinary data. Our ongoing work suggests that "binarizing" provides better latent components and predictive accuracy. In a pilot study, we developed a censored Poisson model. It was computationally more expensive and it did not give better empirical performance.
> R23. what makes the topics from CTPF more interpretable than > latent factors from matrix factorization?
By tying latent factors to topics in content, CTPF allows factors to be interpreted as distributions over words, rather than a list of document titles. Further, the topics of new documents enable coldstart recommendations.
> R5. Evaluation: Do you assume that articles not rated are > irrelevant? If so, does it not add noise to the computed precision > and recall?
We assume that the unrated articles are observed zeros. This includes rated articles that are held out as part of the test set. The precision/recall metrics measure the ability of the fitted model to rank the rated test articles above the unrated articles.
> R5. Are these precision/recall values significant enough that users > of such a system would notice an improvement in quality. Can you > provide the performance of the "ideal" recommender which sorts by > true relevance? Perhaps measures such as NDCG would be more useful?
Ideal recommender systems would have perfect (normalized) mean precision (100%). Precision at 100 is a challenging task in domains with a large number of items (260K and 600K in our case). Further we beat the state of the art model with up to ~4x improvement in mean precision.
The NDCG measure may provide additional insight. We will include NDCG scores in our revised paper.
> R5. I think it would be better to compare other model properties as > well (e.g. distribution over number of ratings per article, > distribution of number of ratings per article)
Thank you for this suggestion. We agree that model checking, for example, using posterior predictive checks, is useful. We pursue this approach in our extended work.
> R5. Line 88: Not clear to me why PF handles sparse data better than > Gaussian factorization.
For sparse, implicit feedback data sets, Gaussian factorization depends either on a careful assignment of confidence weights to each observation (Hu et al. (2008)), or on randomly synthesizing negative examples (e.g., Paquet et al.). PF naturally solves this problem by capturing user activity with a Poisson distribution (recall that a sum of Poissons is a Poisson). Further, under PF, updating the latent variables with all matrix entries is almost as easy as updating with only nonzero entries.
> R5. Line 130: Is a separate rate not needed here?
We presented the simplest contentbased PF model. We can extend CTPF with Gamma priors on the rate parameters, but this is not essential to obtain gains from PF.
> R5. why are train/test/validation proportions different > for the two datasets? 1% of documents as a test set seems too > small. Do you average results on multiple train/test splits?
Only our validation sets differed across data sets. This was done to account for the sparsity of the data. For the outmatrix experiments on arXiv, 1% of the documents corresponds to 8600 articles; we found it to be adequate for outmatrix experiments. We held out 20% of the observed ratings for the inmatrix experiments.
Given the massive size of the data sets, we did not average results on multiple train/test splits. Each data point in Figure 4 is an average across 10,000 users.
> R5. Line 374: "Content [1]" is misleading. The fLDA paper uses > different priors, likelihood and a different inference scheme. I > would encourage the authors to compare to fLDA if the source code is > available.
Thank you for this suggestion. We agree that fLDA is roughly similar to a "Contentonly" model, and will make this clear in our revised version. Our understanding is that fLDA requires having user covariates which we do not have. Similar to Wang et al. (2013), we compared to our model with the topic offset variables fixed at zero.
> R46. ...what extent LDA initialization is important could be > helpful, in particular, what happends if no such initialization is > done
We agree; we will discuss the performance from random initialization in the final version.
> R46, R5. How do you choose the Gamma hyperparameters ? How > sensitive is the algorithm to the hyperparameter settings? How was K > set in the experiments?
We fixed the Gamma hyperparameters following Gopalan et al. (2014). The authors chose hyperparameters from a study on Movielens data set varying scale/shape hyperparameters in the set {0.1, 1, 10}. The algorithm is seen to be robust to hyperparameter changes. We set K to 100 in all of our experiments.
On massive data sets, such exploration is infeasible. We plan to develop an empirical Bayes based approach in our future work, but we note the superior performance of PF models with fixed hyperparameter settings.
> R46. p. l.7 l.10 in your text preprocessing, words are stemmed and > lemmatized ? Is there any hashing involved to reduce dimension ?
We don't use stemming; we reduce the vocabulary size with tfidf.
> R5. Figure 5: what do you mean by "readers in the field"?
We mean readers interested in the field, as in, those with a high preference value at that topic.
 