Review for NeurIPS paper: CO-Optimal Transport

NeurIPS 2020

CO-Optimal Transport

Review 1

Summary and Contributions: This paper introduces a method based on optimal transport, COOT, to compare distributions supported on different vector spaces. It consists in simultaneously learning two transport plans, one for samples and one for features. The authors provide an algorithm to compute COOT which can be adapted to entropic regularisation, and discuss the relation between COOT and Gromov-Wasserstein (GW). Finally, they illustrate COOT on heterogeneous domain adaptation and co-clustering (i.e., clustering of both samples and features) tasks.

Strengths: The problem of simultaneously matching samples and features across different domains is well-motivated (as stated by the authors, it notably has applications to embedding alignment in e.g. NLP, and domain adaptation) and the proposed method is convincing and novel. It adapts OT and GW frameworks in a natural manner. The fact that COOT still defines a distance (in a permutational sense) shows that COOT is theoretically well-grounded. The links established with GW by the authors are informative. The illustrations and experiments showcase COOT convincingly.

Weaknesses: The main weakness of this work lies in the computational cost of COOT, as Algorithm 1 relies on iteratively solving OT or matrix balancing problems (in the case of entropic regularization). However, this is arguably inherent to the problem it tackles, as GW also suffers from high costs, and the authors provide clear information on the complexity of their method. Adding plots (e.g. in the supplemental) showing the empirical behaviour of Algorithm 1 in different settings (values of n, d, and regularization) would help assessing how big an issue this complexity is.

Correctness: I have not checked all proofs in detail but the results and the experimental methodology seem correct.

Clarity: I have found the paper very clearly written (at least for someone having some knowledge on OT and GW). One point that could be clarified however is what loss L is used in the experiments. I assume it is quadratic - this is clearly stated in the co-clustering experiment description but I couldn’t find the information for HDA. Likewise, what regularization is exactly used? If I understand well, entropic regularisation for both features and samples in co-clustering, but only for features in HDA? If so, could the authors discuss these different choices?

Relation to Prior Work: The relation to prior work is clearly discussed.

Reproducibility: Yes

Additional Feedback: (i) In Proposition 1, the authors show that if L is an Lp norm, then COOT is positive, symmetric and satisfies the triangle inequality, and further that if n=n’ and d=d’ COOT(X, X’) is 0 iff X = X’ up to permutation. Then, l. 134-135 they say that when n≠n’ or d≠d’, the identity of indiscernibles cannot be proven, but (l.60 in the supplemental) that COOT is then > 0. But arguably, when n≠n’ or d≠d’, wouldn’t X and X’ be different and hence the identity of indiscernibles be still satisfied, which would show that COOT is still a distance (the point being that “X=X’” in a permutational sense implies that n=n’ and d=d’)? (ii) In Figure 1, the authors illustrate the sample and feature plans between MNIST and USPS datasets. However, the feature mapping is a little hard to read. It would be interesting to illustrate feature mappings between two datasets which have clearly interpretable features (e.g. physical quantities), to see whether features having similar roles are indeed matched to one another. Less importantly, although I am not a grammar expert it seems that in some cases the usage of possessives is a bit unusual (e.g. l.28 “correspondences’ estimation” could be “correspondence estimation”, l.213 “features’ mapping” -> “feature mappings”, etc.) ########### POST-REBUTTAL ############# Thanks to the authors for answering my points. I have read all reviews and responses, and agree with other reviewers that empirical convergence is the main issue. As the authors have provided some content on that subject in their response, I will maintain my score.

Review 2

Summary and Contributions: This work proposed a new formulation of Optimal transport (CO-OT) to compare distributions in different metric spaces with potentially different dimensions. Given two datasets, the main idea is to align both the samples and the features by optimizing over a pair of transportation maps. An alternating optimization algorithm is proposed to solve this problem (performing several classical OT subproblems) and several numerical experiments are shown to illustrate the intuition behind the double-assignment (MNIST-USPS) and the ability of CO-OT to out-perform state of the art methods in domain adaptation and co-clustering applications.

Strengths: - CO-OT has a simple and intuitive formulation and can be solved using previously known algorithms iteratively. - It provides a generalization for Gromov-Wasserstein that leverages the raw data instead of kernel representations of the data. - The experiments show a significant improvement in accuracy compared to other methods

Weaknesses: - Small discussion on computational aspects - Lack of context of the obtained results

Correctness: - Except some obscure settings in the experiments, the main claims of this paper are sound.

Clarity: Well written, well illustrated.

Relation to Prior Work: Several methods are discussed throughout the paper and compared with in the experiments. Authors also made important connections of CO-OT with some relevant proposals in the literature (Gromov-W, InvOT).

Reproducibility: Yes

Additional Feedback: ####### POST REBUTTAL I find the authors' response very solid. My questions were answered convincingly, and I would very much like to see the 3rd paragraph of the rebuttal in the final version of the paper with the additional experiments. ################# Overall this is a good paper; CO-OT is well motivated and illustrated using the MNIST / USPS example. The positioning is clear with respect to Gromov-Wasserstein and the algorithmic part amounts to nested loops of the simplex / Sinkhorn algorithm. 1) In L121: the statement "COOT provides a tight convex relaxation of BAP" is true in terms of the constraint set but still the objective function is not jointly convex in the pair (pi_s, pi_v) ? Did authors try different initializations of Alg 1. to see how sensitive it is, specially for co-Clustering methods when minimizing over X? 2) Given the (at-least) quadratic complexity of the inner OT-problems, computing COOT would be problematic if the alternating opt takes several iterations to converge. How many iterations were necessary to reach convergence ? It could be useful to have a plot showing the decrease of the loss function. A quick look at the code shows that Alg 1 could be accelerated by keeping track of the previous dual variables to warmstart Sinkhorn's algorithm. 3) The accuracy scores obtained by EGW in all experiments tables (even those in the appendix) are no better than chance ! Perhaps the choice of eps was not adequate here ? Several hyperparameters were fixed once and for all L249-252 with [16] as a ref, but in [16] EGW performed significantly better and was even competing with SGW. 4) minor: latex typo in L296

Review 3

Summary and Contributions: The paper proposes a new OT distance called COOT which simultaneously considers differences of samples as well as their features of two interested distributions. The experimental studies on heterogeneous domain adaptation and bi-clustering problems have shown the usefulness of the proposed distance.

Strengths: The proposed OT distance is useful in problems when ambient data spaces of two distributions are different. The authors also provide the optimization algorithm to compute the distance (and transportation plans). Applications of COOT for heterogeneous domain adaptation have shown the performance improvements over the state-of-the-art methods.

Weaknesses: Though the usefulness of COOT has shown via two applications experimentally, theoretical grounding for COOT is missing. In COOT-clustering application, authors do not show some quantitative results on real-world datasets such as Olivetti Face and MovieLens. ==== Update after rebuttal==== I appreciate the feedbacks from authors. However, I am still not satisfied with the quantitative results on real-world datasets as synthetic datasets are supposed for validation (the correctness) not for comparing with the baseline methods.

Correctness: Claims in three propositions look correct without further checking.

Clarity: The paper is easy to follow however, the motivation for the COOT needs be elaborated.

Relation to Prior Work: Authors have discussed connections to other OT distances especially the Gromov-Wasserstein (GW) distance.

Reproducibility: Yes

Additional Feedback: In line 39, authors claim that "our new formulation includes GW as a special case", can authors show at which conditions, COOT becomes GW?

Review 4

Summary and Contributions: The paper addresses the problem of solving optimal transport in a context where the two spaces are of different dimensions and the number of samples may be different. Therefore it presents an alternative to Gromov-Wasserstein. Unlike GW, which relies on the distributions of distances between pairs of points in each space, COOT makes use of the features, by finding two couplings instead of one: one between samples and one between features. Even though the problem turns out to be NP-hard, one can solve it approximately via block coordinate descent, which basically boils down to alternatively solve OT for one coupling, and then for the other until it converges. The main contribution of this paper is to propose this new approach to solve two couplings instead of one and a simple algorithm grounded on OT.

Strengths: To begin with, the paper is well-written and easy to understand despite the theoretical complexity of the problem. Then, I have appreciated the effort put into linking the paper with previous work and more in general the presence of substantial references to the literature to support their various claims. In my opinion, the proposed algorithm shines by its simplicity, which basically alternates between two OT problems. Eventually, the illustration of the method is quite convincing and the experimental validation supports their claims.

Weaknesses: Although the complexity analysis is given to the reader, I think that details about empirical speed of convergence of the algorithm and its memory limitations are lacking. Even though Sinkhorn algorithm can be qualified of "light speed" and even though its complexity is quadratic, alternating many times a sequence of two Sinkhorn algorithms might in the end be costly in time and memory. I would have liked to know how long it takes to run on MNIST / USPS, how many iterations of Sinkhorn for each inner OT and how many outer iterations are needed. I am asking myself if that is the reason why in the HDA experiment, the authors are only taking 20 samples per class, ending up with n=n'=200 instead of using a bigger set. What is the limitation here and can other methods support more samples ? In which case it might also be unfair to limit other methods to 200 samples because COOT cannot handle more than this, if that is the reason for this choice. If that is indeed the case, one possible way to deal with this would be to use a subset of samples to run COOT in a first step that is only meant to find the features transport \pi^v, and then use all the transported samples (by \pi^v) to find the transport \pi^s in the common feature space. About the face dataset, could the same experiment be run on CelebA ? Or are the images too big for this ? In other words, what can a reader expect from COOT in terms of ranges of n, n', d, d' ?

Correctness: I believe so. With this one question about the empirical setup in HDA that restricts the number of samples to 20 per class.

Clarity: The paper is well written and very clear. It is a good read.

Relation to Prior Work: I am not an expert in the field, but I have enjoyed having a rich bibliography and I felt like most of the claims were supported by citations and that the experimental validation compares itself to many previous works. Also I am not aware of any other work trying to work on the samples and the features at the same time.

Reproducibility: Yes

Additional Feedback: ########### POST-REBUTTAL ############# I would like to thank the authors for answering my questions in their rebuttal. I find the paper and the rebuttal convincing and I am maintaining my score. ####################################### A small error on line 85, the authors refers to paragraph 2.2 as explaining the algorithm, however, the algorithm is described in paragraph 2.3. Figure 1 is very interesting but it takes time to parse at first glance. I feel like some examples from figure 2 and 3 in the supplementary materials are easier to understand and would be complementary. Also even if I acknowledge the effort put into solving different tasks to illustrate the method, the last one about movie ratings does not really bring much to the paper in my opinion, and I think this space could be used to discuss practical running time / memory usage instead or could be used to bring some interesting results / proof from the supplementary material, that is very complete. Eventually, I am wondering if it would be possible to add more constraints on \pi^v without breaking the simplicity of the method. For example in the MNIST -> USPS example, we do expect some piece-wise continuity, meaning that the nearby pixel locations of MNIST should be also close in USPS. The features order is not completely random in this specific case and this structure in preserved in both spaces.