Review for NeurIPS paper: Temporal Positive-unlabeled Learning for Biomedical Hypothesis Generation via Risk Estimation

NeurIPS 2020

Temporal Positive-unlabeled Learning for Biomedical Hypothesis Generation via Risk Estimation

Review 1

Summary and Contributions: This paper proposes to use temporal PU learning to tackle the hypothesis generation (HG) problem. Specifically, the authors formulate the HG problem as future connectivity prediction on a dynamic attributed graph. The authors claim that the experiments on COVID-19 datasets validate the effectiveness of the proposed model.

Strengths: 1. The idea of this paper is very novel, which combines PU learning, GRU, and graphSAGE. 2. The application to COVID-19 is much appreciated.

Weaknesses: 1. The writing and presentation of this paper can be further improved. 2. The proposed method is not compared with the SOTA methods, and some important baselines are missing. 3. The codes and datasets are not provided, which decreases the reproducibility. I understand that the data may be confidential, but the authors should at least provide the codes for their algorithm.

Correctness: Maybe correct.

Clarity: Can be improved.

Relation to Prior Work: Some important prior works are missing.

Reproducibility: No

Additional Feedback: 1. In Introduction, when I read the sentence “When two terms co-occurred at time t in scientific discovery …”, in fact I’m not pretty sure what are the “terms” in this paper. The authors should explain this notion in advance in a clearer way. 2. The authors claim that they proposed a variational inference model to estimate the positive prior. In fact, the main idea is to minimize the difference between two distributions. Such idea has already been presented in “Class-prior Estimation for Learning from Positive andUnlabeled Data” (ACML 15). The main difference is that ACML paper uses L1-distance while this paper uses KL divergence. Therefore, the ACML method should be discussed and compared. 3. In related work of PU learning, there are other important works that considered the unlabeled examples as negatives with label noise, such as “Loss decomposition and centroid estimation for positive and unlabeled learning” (PAMI 19) and “Positive and Unlabeled Learning via Loss Decomposition and Centroid Estimation” (IJCAI 18). These two papers can be cited. 4. The compared PU baselines are a bit weak. The authors claim that “The used SOTA PU learning methods include [14] …” However, [14] is published in 2008 and is not SOTA. SOTA includes nPU, nnPU, LDCE, and some GAN-based PU models. Therefore, the comparisons with PU learning methods should be improved. 5. The texts in figures are too small and unclear, such as Fig. 1 and Fig. 2. 6. The presentation of this paper can be further improved, for example “The idea is to find variational distribution variables theta* that minimize the Kullback-Leibler (KL) divergence” should be “The idea is to find variational distribution variables theta* that minimizes the Kullback-Leibler (KL) divergence”. Generally, I feel that this is a borderline paper. However, considering that this paper contains some publishable results, I currently give a positive score on this paper. -----------------------------update after rebuttal---------------------------- I thank the authors for addressing my concerns. I hope the experimental part of this paper can be enhanced if this paper is finally accepted. Besides, the release of codes and datasets as promised by the authors is also appreciated.

Review 2

Summary and Contributions: The paper designs a novel algorithm named TPR in PU-Learning to predict future connectivity on a dynamic attributed graph (HG problem). The algorithm can be divided into two parts. The first part is to transform the HG problem into a PU learning problem, so that the unbiased risk estimator based PU methods can be applied. The second part is about estimation of the positive prior also by optimizing an objective function related to ELBO.

Strengths: The ideas and process of the algorithm are sound and clear, and the proposed methods have comparably high accuracy on ground-truth datasets. The major contribution of this paper is to transform a practically important problem, mecial HG problem, into a PU learning problem. It shows the potential of the PU learning in real-world applications besides image classification.

Weaknesses: 1) The class prior estimation is an intractable problem for PU learning, and it is hard to identify the quantity from data without the assumption of irreducibility (the negative distribution cannot be a mixture that contains positive distribution). The authors seem to avoid this problem by using the Guassian mixture model with two components, but it leads to another problem: Is the GMM suitable for the data? 2) This is related to the above one. Can authors show (at least by numerical results) that the class prior is well estimated in experiments? I am curious how the ratio of positive dataset to unlabeled dataset influences effectiveness of the model. 3) It is interesting to see that nnPU is worse than uPU in experiments. I think more analysis is required, because the to optimal PU classifier must satisfy the "the non-negative restriction of the risk estimation" according to the theoretical analysis. ==== Update after rebuttal: I thank the authors for addressing the main points. But more details on the estimation of class prior is still required.

Correctness: Almost correct except the part mentioned in "Weaknesses"

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper is aimed at addressing hypothesis generation problem. It considers the HG problem as connectivity prediction by capturing the features on a temporally dynamic graph with positive-unlabeled learning. Also, the author proposes a variational inference model to estimate positive prior to help node embedding learning. I have read the rebuttal.

Strengths: (1) The author proposes the Temporal Relationship Predictor model to calculate the future connection score for term pairs based on PU learning framework. It’s the first application of PU learning on the HG problem and dynamic graph. (2) To acquire a more reliable positive prior, the authors proposes a variational inference method, treating the learned pairs embedding with Gaussian mixture distribution and minimize the KL divergence to estimate the GMM model parameters \beta. (3) Experimental results show the effectiveness of TRP method, achieve the SOTA result in PU learning. The authors also give a detailed analysis and visualization for the result.

Weaknesses: (1) The introduction to L^E in eq(7) can be more clear. (2) It would be better to conduct some ablation studies to show the effectiveness of prior estimate.

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: No

Additional Feedback: Overall, the paper is well written, and model structure and training details are clearly presented. Questions: 1. Would you release the datasets?

Review 4

Summary and Contributions: The paper contributes a method for modeling the evolution of connections in a graph that considers links that are unobserved so far as unlabeled rather than negative. It is applied to a hypothesis generation problem by modeling the cooccurrence of biomedical terms in paper titles and abstracts over the last 75 years.

Strengths: The treatment of links that are unobserved so far as unlabeled rather than assuming they are negative/absent makes a lot of sense. The approach of modeling the temporal evolution of the graph also seems advantageous.

Weaknesses: The exposition of the methodology is dense and hard to follow (Section 3). Is there a way to provide confidence intervals on the values in Table 2?

Correctness: I would like to understand the part about estimating p(y=1) better. I was under the impression that this quantity was unidentifiable from positive-unlabeled data without strong assumptions. Does the SCAR assumption enable estimation? Does it rely on other assumptions as well?

Clarity: There is substantial room for improvement in clarity. - The text in all figures is way too small to be reasonably readable. - The paper would benefit from another editing pass for English grammar. Minor points: - h is undefined at first use. - GRU is not spelled out.

Relation to Prior Work: Seems reasonable -- but see questions about estimating p(y=1).

Reproducibility: Yes

Additional Feedback: I have some skepticism about the utility of this approach for hypothesis generation (HG). I see in the results that some terms were linked at the last time step which truly did get added at the end of the time series. But how do you envision this informing scientific research? That if two terms are predicted to co-occur, it will inspire a new study? Are there examples (very generally) of HG leading to new scientific discoveries? Thanks to the authors for the feedback provided. I think these changes will improve the paper. I am changing my reproducibility score to "Yes" since the authors plan to release the code and datasets. I am changing my overall score to 6.