Review for NeurIPS paper: Partial Optimal Transport with applications on Positive-Unlabeled Learning

NeurIPS 2020

Partial Optimal Transport with applications on Positive-Unlabeled Learning

Review 1

Summary and Contributions: The paper found that the solution of the partial Wasserstein problem's solution could be got by solving the original extended Wasserstein problem first then dropping the extended row and column. Then the paper applies the same trick to the partial Gromov-Wasserstein problem and provides a way to solve this problem.

Strengths: Introducing the outlier bin to the optimal transport problem is not a new idea, for example in paper[1]. In this paper, the authors use this idea to solve the partial assignment problem and extend it to the partial GW problem with theoretical proof. This part is novel and the paper provides detailed mathematical proof. The proposed method has been tested on the dataset and compared with other baseline methods. Sometimes this method shows better performance. [1] SuperGlue: Learning Feature Matching with Graph Neural Networks, CVPR2020

Weaknesses: 1. In this work, although the partial assignment problem is addressed, how to set the mass of the dummy point and the kesi in the cost matrix might be a potential issue. These parameters associated with the dataset which could highly affect the result. It might be difficult to tune. 2. I'm a little bit curious about the running time, which is totally missing in the paper.

Correctness: The propose method seems like mathmatically sound and feasible in practice.

Clarity: The paper is well written.

Relation to Prior Work: The paper clearly explains the contribution.

Reproducibility: Yes

Additional Feedback: I agree the R3's opinion which the paper is difficult to read, with more explanation sentences will make the paper easier to read. About the rebuttal, it addresses all my concerns, as a result I keep my rate about this paper.

Review 2

Summary and Contributions: This paper tackles the problem of computing partial optimal transport metrics. More precisely, the authors consider both the partial-Wassserstein (partial-W) and the partial-Gromov-Wasserstein (partial-GW) distances. Their first contribution is to show that the partial-W can be reduced to a (full) Wasserstein problem by adding dummy points. For the partial-GW approach, a Frank-Wolfe algorithm is proposed, where a partial-W distance is computed at each iteration. The second contribution is to show that the PU-learning problem can be solved by using partial-W and partial-GW distances. Experiments on various datasets illustrate the efficiency of this approach.

Strengths: - This work has pedagogical qualities: the different variants of the Wasserstein distance are recalled very clearly. - The "dummy point" trick to compute the partial-W is very smart and interesting.

Weaknesses: The definition of the Gromov-Wasserstein distance should be discussed a bit more, as it is less usual than the Wasserstein distance. Otherwise it seems a bit arbitrary ... For instance, is there a case where both W and GW coincide?

Correctness: Yes, the experiments illustrate the theory.

Clarity: Yes, the paper is well written.

Relation to Prior Work: Previous related works on W and GW distances are recalled, as well as existing PU-learning methods.

Reproducibility: Yes

Additional Feedback: Typos and comments: - line 97: typo in the expression of PW - line 163: formula of the posterior probability should be p(y=1|x) - I felt a bit lost when reading the section 3: I think the use of partial-W to compute partial-GW should be stated more explicitly in the beginning of section 3 --> the authors will clarify this point - notation in section 4.2: it was a bit disturbing to me that "p" does not refer to the "positive" distribution ... - Using PU-learning as a motivation seems a bit weak considering the (sophisticated) optimal transport methods developed in the article. PU-learning is already fairly well understood by using more simple ideas relying on importance sampling (see e.g. [a] or [b] below). Are there more challenging transfer learning tasks that could be solved using your partial-(G)W approach? --> well answered in the rebuttal: color transfer, point clouds registration, deep learning References: [a] du Plessis et al., "Analysis of learning from positive and unlabeled data" [b] Vogel et al., "Weighted Empirical Risk Minimization: Sample Selection Bias Correction based on Importance Sampling"

Review 3

Summary and Contributions: The paper concerns optimal transport (OT) plans minimizing Wasserstein (W) and Gromov-Wasserstein (GW) distances between the empirical distributions of a pair of samples and their application to Positive-Unlabeled (PU) learning. A transport plan between two samples is represented as a matrix with linear equality constraints. The OT plan is formulated as a solution to an optimization problem over a space of transport matrices. The formulation for the Wasserstein distance is based on a single cost matrix representing the pairwise distance between the points of the two samples, whereas the Gromov-Wasserstein distance is based on two cost matrices representing within sample pairwise distances. The paper’s main contribution concerns partial optimal transport plans (for both W and GW distances) that constraint the mass transported between the samples. The formulation relies on the same optimization criteria as for full optimal transport, although with different constraints on the transport matrix. In the case of Wasserstein distance, the formulation is relaxed back to the original constraints by adding two dummy points in the two samples and modifying the cost matrix. In the case of Gromov-Wasserstein (GW) distance, a Frank-Wolfe based algorithm is derived to solve the optimization problem. PU learning is set as a partial OT problem with an additional regularization term. The approach is tested empirically with comparisons to existing PU learning methods.

Strengths: The paper contains novel algorithms on learning partial optimal transport and technically sound theory to support the algorithms. The application to PU learning is intuitive and it is first approach to addresses domain adaptation with different features in the PU context.

Weaknesses: Overall: When applied to PU learning, the theoretical implications are not well understood and an important state of the art method is not compared with. Assuming the class prior to be known is a significant limitation. Furthermore, the time complexity of the algorithm might make it impractical. The paper is difficult to read and significant effort should be spent to simplify the notation, make the paper self contained and wherever possible state the meaning and implication of formulas and constraints. Details: 1) The partial OT formulation enforces that the mass transported back and forth between the two samples is equal. The way PU learning problem is set up, with p_i = 1/n and q_i = 1/m as Partial OT seems to be suboptimal. The mass \pi (positives) from the unlabeled sample gets transported to only mass \pi of the labeled sample. However, ideally it should get transported to the entirety of the labeled sample. If p_i is defined as 1/(n\pi), then the positives in the unlabeled would correspond to the mass of 1 and it can be transported to the entirety of the labeled sample. 2) An important PU method is not included in the comparisons [3]. 3) The algorithms for estimation of the class prior are only suitable for the unbiased PU setting. Assuming that the class prior can be reliably estimated in the biased case is not realistic. Some experiments on the sensitivity to class prior should be conducted. 4) Solving PU learning with bias is an intractable problem in general. That’s why most algorithms for biased PU learning make an assumption for bias first and then derive a method specific to handle that bias. What kind of bias does the proposed method handle? What prevents the method from learning the biased positives from the unlabeled data? It the colored MNIST experiment was designed with the unlabeled data containing both red and green colored digits and the prior was misspecified or set closer to the proportion of the green positives in the unlabeled data, would the partial-GW method assign a higher score to the red positives compared to negatives? Maybe simulating a one dimensional biased PU dataset can be used to illustrate the debiasing effects of the proposed method. 5) The text following equation 6, including the introduction of \alpha, regularization, is quite challenging to follow. Please attempt to give more intuition and make the paper self contained. g and the notation T(i, I_g) is not defined. 6) Would the partial-GW method still work if the negatives were obtained by adding a constant to the positives (think one dimensional data)? The pairwise distances between the negatives would be similar to the pairwise distances between the positives. Is it conceivable that the labeled positives are equally likely to be transported to the negatives or the positives in the unlabeled data? [3] Kiryo R, Niu G, Du Plessis MC, Sugiyama M. Positive-unlabeled learning with non-negative risk estimator. InAdvances in neural information processing systems 2017 (pp. 1675-1685). ========= After Rebuttal========= I still find the section on the regularization unclear. The clarification provided by the author is related to a typo, but not about regularization. If accepted the authors must spend significant effort in making this clear.

Correctness: Yes

Clarity: 1) The equation after line 76 suggests that the number of points and the number of bins are the same. This is counter intuitive since usually a histogram has significantly fewer bins than the size of the sample. Moreover, Dirac functions are point masses, instead of bins. Furthermore, expressing p and q as a summation is not consistent with them introduced as vectors. 2) The paper will be easier to read if words are used to express the meaning of formulas and the implications of constraints are stated explicitly. For example, in section 2, it would be helpful to say that each row of T sums to p_i and each column sums to q_i. And that the ||p|| = ||q|| is implicitly enforced by the constraints on T, since the sum of p_i’s and sum of q_i’s are both enforced to be equal to the sum of all entries of T. 3) Authors should spend effort in improving notations. For example, p_i and q_i can be defined in terms of n_U and n_P. There is no need for n and m. p is used for the vector of probability as well as exponent of the distance. This can be confusing. 5) After stating their choice for p_i and q_i in PU learning, it would help if they provided the rationale behind their choice. 6) The authors claim that the partial-W is relaxed to the full OT problem which is unconstrained. However, the full OT problem has its own set of linear equality.

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: 1) Important references on class proportion estimation are missing [1,2]. 2) How is the PU classifier obtained from the optimal transport? 3) Please comment on the time complexity of the algorithms. [1] Jain S, White M, Trosset MW, Radivojac P. Nonparametric semi-supervised learning of class proportions. arXiv preprint arXiv:1601.01944. 2016 Jan 8 [2] Zeiberg D, Jain S, Radivojac P. Fast Nonparametric Estimation of Class Proportions in the Positive-Unlabeled Classification Setting. InAAAI 2020 (pp. 6729-6736).

Review 4

Summary and Contributions: 1. The paper proposes an algorithm to solve Wasserstein and Gromov-Wasserstein problems. 2. The proposed method can be applied to solve positive-unlabeled learning problem, and the experimental results demonstrate the effectiveness of the proposed method.

Strengths: 1. The paper is well-written and easy to follow. 2. The proposed method is novel and can be applied to solve positive-unlabeled learning problem, and the results are OK.

Weaknesses: My main concerns are listed as follows: 1. There are some typos in the manuscript, e.g., in Abstract, "betwenn". 2. It is a pity that the authors only perform experiments on positive-unlabeled learning, optimal transport techniques have been used in many applications. More results on other applications such as transfer learning, few-shot learning, or zero-shot learning may be better, with more baseline methods being compared. 3. In recent years, both optimal transport and deep learning are hot research issues. The authors are encourage to explain how to expand the proposed method to integrate with deep learning models. 4. For Figure 1, are the figures generated by real experiments or artificially? If they are artificially generated, can authors conduct some real-world experiments to support the phenomenon occurred in these figures? This would be an important evaluation of the proposed method. 5. When citing literature, the tense of sentences is inconsistent, e.g., "Peyré et al. (2016) proposed" and "Chizat et al. (2018) propose".

Correctness: All the claims, methods and empirical methodology are correct.

Clarity: The paper is well-written.

Relation to Prior Work: Yes. For the computation of partial Wasserstein metrics with exact solutions, previous works have no partial formulation.

Reproducibility: Yes

Additional Feedback: The manuscript addresses the partial Wasserstein and Gromov-Wasserstein problems and proposes exact algorithms to solve them. Also, the authors apply their method to positive-unlabeled learning, the proposed method achieves some experimental results to demonstrate the effectiveness. - pos: 1. The manuscript is well-written and the structure is OK. 2. The proposed method can be applied to positive-unlabeled learning problem, I think may be it is also fit to other applications. - con: 1. There are some typos in the manuscript, e.g., in Abstract, "betwenn". 2. It is a pity that the authors only perform experiments on positive-unlabeled learning, optimal transport techniques have been used in many applications. More results on other applications such as transfer learning, few-shot learning, or zero-shot learning may be better, with more baseline methods being compared. 3. In recent years, both optimal transport and deep learning are hot research issues. The authors are encourage to explain how to expand the proposed method to integrate with deep learning models. 4. For Figure 1, are the figures generated by real experiments or artificially? If they are artificially generated, can authors conduct some real-world experiments to support the phenomenon occurred in these figures? This would be an important evaluation of the proposed method. 5. When citing literature, the tense of sentences is inconsistent, e.g., "Peyré et al. (2016) proposed" and "Chizat et al. (2018) propose".