
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
A Bayesian approach depending only on positive labels is presented for multilabel learning problems. I'm not so familiar with the framework but assumptions (6)(9) should be more explained even if such priors are usual priors. Is there any restriction if D
> L, L > D or
L small
?
The authors focus on Gibbs sampling and EM, but other inference algorithms could be discussed.
Please give the timings in Table 1 for each algorithm.
Q2: Please summarize your review in 12 sentences
The Bayesian framework is well comprehensive and the paper is clearly written. The method is satifyingly justified, and the objective is clear.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper addresses the problem of largescale multilabel classification. The key feature of the proposed algorithm is the construction of a new likelihood model for multilabel data, through the introduction of auxiliary latent variables.
Inference in this model has a nice property: it scales linearly with the number of positive examples. Making the algorithm particularly suited for the sparse positive labels case. The generative model is simple enough that it is possible to compute all expectations required for the Estep in closed form, allowing for efficient optimization. The authors then compare their algorithm to several concurrent stateoftheart algorithms in different datasets, obtaining very competitive results.
The paper is very well written, the derivations are clear and the experiments show compelling results.
Some specific comments:
The authors stress in section 2.2 that the resulting link function is asymmetric  contrasting this to the logistic linkfunction, which is symmetric. However, can't we make it asymmetric by merely introducing bias parameters, as commonly used in GLMs/MLPs?
It seems that this is an interesting remark, but it does not deserve a whole section dedicated to it.
The proposed generative model for labels introduces an intermediate nonlinearity between the features and the labels, Eq. (8). This is also pointedout as one of the innovations of their model (Sec. 6). This extra nonlinearity makes the model similar to an MLP with one hidden layer. It would be interesting if the authors had discussed this relationship in more details or even provided an empirical comparison on the same datasets.
Q2: Please summarize your review in 12 sentences
The paper is very well written, the derivations are clear and the experiments show compelling results.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper is about a new Bayesian method for multi label learning. The goal is to classify accurately in settings where there are many potential labels but only a few of them apply to each data point.
The basis of the new results is a new generative model for the label vector of each example. Specifically the label vector y_n of the nth example is generated as
y_n = f(V(\sigma(Wx_n)),
where Wx_n is a lower dimensional projection of the nth instance x_n, followed by an elementwise sigmoid activation \sigma. The final operation f corresponds to drawing Poisson random variables with rates given by V(\sigma(Wx_n)) and thresholding these socalled latent counts by taking the minimum with 1.
The inference problem boils down to (heuristically) estimating various parameters of the model using Gibbs sampling and EM. In doing so, the authors observe that the latent counts are sufficient statistics of the model parameters.
Evaluation: I found the generative model interesting and natural as it combines lowrank approximations with nonlinearities in an interesting way. It can also be interpreted as a 3 layer neural net.
I'm not sure however that inference in this model is really feasible on larger problems. The authors evaluate the approach on four tiny data sets. The largest one has fewer than 20k samples and one has as few as 40 test samples. It's not clear to me that these benchmarks are representative of real problems. The authors have yet to demonstrate that this approach will be competitive with, say, WSABIE on even medium size data sets (doesn't fit in memory but on a single machine). In particular, I don't follow the narrative in this paper which repeatedly emphasizes the "largescale" or "massive data" aspect when there isn't even a medium data benchmark.
At any rate, the multi label problem is also interesting for small data and in this regime the approach appears to be competitive with or better than state of the art.
Complementing stronger experimental results, it would also be interesting to have more theoretical support for the inference method. Currently the approach is heuristic.
Small remark:
 The g() function in (10) appears to be a typo. If it's supposed to be there, please define it.
Q2: Please summarize your review in 12 sentences
A new generative model for multilabel learning via Bayesian inference. Experimental results on small data sets are encouraging.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors present a novel learning framework for multilabel prediction problems.
The framework (BMLPL) is a generative approach, which utilizes a nonlinear feature extraction of observed features to generate label vectors.
The approach depends only on positive labels, which proves useful in contexts where the absence of a label could either mean it is negative or unobserved.
Importantly, the authors show that their method scales only with the number of positive labels  a useful feature for sparse label matrices.
The authors then propose
efficient inference methods based on Gibbs sampling and EM, respectively.
The paper provides a robust explanation of the proposed method, including the presentation of an asymmetric link function that more rapidly approaches 1, making it less reliant on negative labels.
Furthermore, the connection to topic models is an interesting one and provides the added benefit of interpreting the mapping of labels into lower dimensional space.
The author(s) conclude with experimental comparisons to other labelembedding approaches.
Results show superior AUC to prior work, although the comparison to LEML (Yu et al.) does not appear to be significant.
It would be helpful if the authors explore this comparison in more detail.
Furthermore, while the authors claim their method is more robust to label imbalances, there is very little explanation of how this arises from their framework, and no justification for the claim in the experimental results.
Among the 4 datasets utilized, the authors should be able to extract this information.
The experiments further show that BMLPL performs well when labels are removed, showing the benefit of not relying on negative labels.
Lastly, the results show that EM using conjugate gradient approximations efficiently arrives at accurate solutions compared to complete EM and Gibbs sampling.
One criticism is that the computational run time of BMLPL, a major advantage according the authors, is not compared to other methods either analytically or experimentally.
Regardless, the paper provides an intriguing new direction in multilabel prediction.
Q2: Please summarize your review in 12 sentences
The paper is interesting from a multilabel prediction perspective but requires more work in terms of experiments.
Submitted by Assigned_Reviewer_5
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Nice understandable paper that I can imagine enjoying in person. The use of a Bayesian model with a truncated Poisson makes sense to me.
I don't find AUC tables entirely convincing, though. Even more so here because the differences in AUC, especially for LEML and BMLPL, aren't huge. Plus AUC doesn't give any insight into the errors and misclassification rates.
Q2: Please summarize your review in 12 sentences
nice paper
Submitted by Assigned_Reviewer_6
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper describes a generative approach towards multilabel classification that uses only positive data, while making no further assumption about unknown versus negative labels. This leads to fast learning algorithm for sparse positive data and avoids the problem of balancing classes.
The method exploits correlations between classes through a latent embedding using a poissonian factor model. The core part of the contribution hinges on work from Zhou, Hannah, Dunson and Carin (2012) that study the PoissonFactor model and suggest a trick for Multinomial sampling and derive a Gibbs sampler.
The paper under review applies the Poisson Factor model to the task of multilabel classification with the following extendsions:
 extension to a cloglog link function
 restriction of Poissonian to exclude zeros
 adaptation of Zhou et al's Gibbs sampling to match the link function using Polya Gamma sampling
 two alternatives to Gibbs sampling: EM algorithm and EM with Conjugate Gradient method.
The paper evaluates the runtime performance versus prediction performance (AUC) on four data sets with varying amounts of classes, documents, and positive annotations. Additionally, the performance under missing is studied, demonstrating stateoftheart performance
The author compare to recent, stateoftheart methods, such as WSABIE, BCS, and LEML.
Strong Points:
 Sound method, with sound basis from Zhou et al, with significant extensions
 Important machine learning problem, with stateofthe art results
 An appropriate amount of related work is discussed
Q2: Please summarize your review in 12 sentences
Great paper for a relevant problem.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for the careful reading and
useful feedback. We will incorporate the suggestions in the final version.
Our response to the reviewers' specific questions/comments is given
below.
**Assigned_Reviewer_1**
 Our framework doesn't have
any restrictions on the relative values of D and L. In fact, the number of
labels L could potentially be very large (D can be very large as
well).
 Our model enjoys full local conjugacy which makes it
amenable to Gibbs sampling. Furthermore, the analytically available
expectations of the latent variables allow efficient EM inference. As we
discussed in section 8, it is also possible to extend these inference
algorithms to the online setting to scale our framework to even larger
data sets.
 We could not report the run times because the
implementations of our model and the other baselines weren't directly
comparable (MATLAB vs C/C++). However, our models are expected to be
considerably more efficient than the other baselines (especially when the
label matrix is sparse and massive) because the likelihood only needs to
be evaluated on the nonzero labels (please also see the response to
Assigned_Reviewer_5).
**Assigned_Reviewer_3**
 In the final
version, we will include another experiment on a synthetic data to
illustrate the impact of cloglog.
**Assigned_Reviewer_4**

Following the reviewer's suggestion, we will make the discussion of label
imbalance (section 2.2) shorter. Adding a bias parameter to the logistic
can model the asymmetry. Our BernoulliPoisson model however naturally
captures it (and in addition leads to computational advantages).

As suggested, we will discuss the relationship of our nonlinear embedding
based model with MLP with a hidden
layer.
**Assigned_Reviewer_5**
 Quantitative improvements
over LEML: LEML is one of the stateoftheart methods for multilabel
prediction and our proposed model performs comparably/better than LEML in
almost all the cases. Moreover, our model offers several other advantages
over LEML: (1) while the label embeddings learned by LEML are linear, our
model can learn *nonlinear* embeddings; and (2) our model also produces
more interpretable results, e.g., inferring topics over labels, which can
be useful in its own right for many applications (e.g., learning a topic
model conditioned on document features). Another key difference is that
while LEML is an optimization based method, ours is a fully Bayesian
approach, which makes it more suitable for other tasks such as active
learning on the labels.
 More justification of the robustness
against label imbalance: In the final version, we will add a synthetic
data experiment and compare with a logistic model, and include a
discussion.
 Computational running time: LEML has a C/C++
implementation so we could not directly compare its running time with our
model's MATLAB implementation, but we will reimplement LEML in MATLAB and,
in the final version, we will include an experiment reporting the timing
comparison. Note however that, analytically, as we discussed in section4,
the running time of our model depends not on the label matrix size but
only on the number of positive labels (in contrast, a logistic model for
the labels will have a cost that depends on the size of the label matrix),
and the dependence of the runtime on the other quantities (N, D, K,
number of nonzero features in the X matrix, etc.) are similar as
LEML.
**Assigned_Reviewer_6**
 AUC is a useful metric (and
used in several recent papers on multilabel learning) because in many
problems we care more about ranking the positive labels higher than the
negative labels. AUC is also less sensitive to label imbalance. In the
final version, in a separate appendix, we will also add some other metrics
such as precision@k.
**Assigned_Reviewer_7**
 Larger scale
problems: Our experiments were on moderately large data sets, if not
extremely large. We would like to point out that all our experiments were
done using a standard laptop with modest hardware (8GB RAM). Other
existing methods can be considerably slow on such hardware (especially if
the label matrix is massive). On more advanced hardware, our model is
readily scalable, easily parallelizable, and can be applied to much larger
data sets. Finally, the EM algorithm we proposed can be easily extended to
an *online* EM implementation, which can be another way to make our method
scale to much larger problems.
 About the theoretical support for
the inference method: Our model admits closedform Gibbs sampling (which
provably will converge to the true posterior). The EM algorithm is also
guaranteed to converge to a local optima.
 We used 'g' to denote a
nonlinear mapping from the label space to the lowerdimensional space.
Here the function 'g' represents a combination of the conditioning on the
features and the BernoulliPoisson link. We will make this clearer in the
final version. 
