NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Review -------- This paper is very well written and easy to follow. The Wasserstein barycenter problem computes the Frechet mean of a set of probability distributions. When the support of the unknown barycenter is fixed, the problem is a LP problem. If not, one can alternatively minimize with respect to the sample points and the probability weights. The authors propose a general accelerated algorithm for both cases with a cubic complexity with respect to the dimension. The main finding of this work is that their method solves the OT problem without strictly convex regularization (such as entropy) to speed up the computation. Thus, they manage to obtain more accurate estimations of the barycenter in a competitive time. Questions --------- 1. In L.79-80, the authors mention "Our algorithms also inherits 80 a natural structure potentially fitting parallel computing scheme well". What does "potentially" exactly mean ? Since Sinkhorn's algorithm (IBP) is parallelizable on the N axis (across number of input distributions), this statement deserves to be discussed. 2. Perhaps I am missing something but if the complexity of the proposed method is cubic w.r.t to the dimension, how come the plots of figure 4 show a linear rate on the second row ? When varying m, does the dimension of ALL distributions not change at the same time ? 3. the second column of Figure 4 is not clear. The computation time is not already displayed on the first column ? And why not all methods are displayed on the last 3 columns ? 4. On regular grids such as images, the iterations of Sinkhorn's algorithm can be written as Kernel convolutions (if the squared l2 distance is used as a ground metric, which is the case here). This idea was introduced in https://people.csail.mit.edu/jsolomon/assets/convolutional_w2.compressed.pdf but is not very clear in that paper. Briefly, each iteration can be seen as a Gaussian smoothing which -- thanks to the separability of the squared l2 -- can be applied on rows and columns independently. This results in a reduced complexity O(Nm^2) => O(Nm^3/2) of each iteration of N 2D distributions in R^m. This trick is crucial and should have been used when comparing with Sinkhorn.
Reviewer 2
================================================== AFTER AUTHOR FEEDBACK. I maintain my score. However, I'm not convinced by the author's response to my question on showing the derivation of the previous best run time. They need to use more current algorithms for LP (for example, the one by Cohen, Lee, and Song) in order to compute how fast this problem can be solved. Without a derivation using these algorithms, it's unclear *how much* of an improvement this submission provides. Without a clear picture of this, it's unclear to determine the full contribution of the paper. ================================================== ORIGINALITY See comment in 1. The paper recognizes and exploits some properties of the objects in the problem in a way that lets the authors improve the algorithm's runtime by orders of magnitude. This is an original observation/contribution. However, I am concerned about the reported run time for the previous best interior point algorithm: in particular, in the supplementary material provided, in line 69 (paragraph titled "Low Theoretical Complexity"), the previous best run time is reported to be O(N^3 m^4), but there is no citation provided or computation shown for this claim. In order for me to be convinced that the authors' work does have a significant improvement upon previous work, I'd like to request the following from them. (1) an explicit computation showing how the previous best run time of O(N^3 m^4) is computed, (2) applying the run times from some classical and contemporary results for LPs: papers by Vaidya https://link.springer.com/article/10.1007/BF01580859, Lee-Sidford https://arxiv.org/abs/1312.6713, and current best LP result by Cohen-Lee-Song https://arxiv.org/abs/1905.04447. QUALITY - Overall, I found the paper to be high quality, with adequate citations for various results. However, I do have the following concerns: I am confused why the NeurIPS paper by Cuturi is cited for Iterative Bregman Projection (line 249 in the supplementary pdf) when that paper doesn't even mention Bregman projections. For Figure 2, I think the evaluation of SLRM and DLRM must be done against more recent solvers for normal equations such as those mentioned in the "Originality" paragraph above and also other interior point solvers which are more specialized (than the MATLAB solver which I'm not sure exploits the structure of normal equations). CLARITY - Overall, the paper reads quite well. However, I think the clarity can be improved in the following places: In line 115, the authors describe an LP and then all the vectors/variables/matrices in it; however, the vector c is not defined. Also, it would help to clearly state, separately, the problem dimension and number of constraints at this point. It would also greatly help if you gave the reader a "sense" of each of the m_i's, so that the reader can get a better sense of m sum_i m_i + m and Nm + sum_i m_i + 1 (just as you have done in the section "Low Theoretical Complexity" in the supplementary material, where you say, let m_i = m). In Algorithm 1, it would help to have caption or text saying that z(i) is the solution to which system; the way it's written now, I found it difficult to understand from the procedure alone that it's actually solving (bar_A D bar_A^t ) z= f. Finally, the authors are using the cost of an n x n matrix inversion to be n^3, when this should be n^\omega. The current value of \omega is roughly 2.372927 (see Table 3 by Le Gall and Urrutia https://arxiv.org/abs/1708.05622), so I think this should be made more clear. SIGNIFICANCE: I think the formulation and use of the block structure is certainly novel for this problem, and anyone who tries to solve this problem can benefit from the authors' observations. However, as someone who has not worked on the problem of Wasserstein Barycenter before, I don't think I can judge the importance of the problem itself. I think the ideas of using structure could be used in other interior point methods for LPs.
Reviewer 3
-------- After Author Feedback --------- I stand by my initial review and score. I believe the paper presents an interesting new approach to computing Wasserstein barycenters that is well motivated, and well tested. My initial misgivings were addressed sufficiently in the rebuttal. ----------------------------------------------- Typo: 93: Wasserstein distance instead of barycenter The paragraphs from 132--148 imply that the algorithm recovers a globally optimal solution to the free support problem. Can you elaborate or clarify this? A globally optimal solution to the Wasserstein barycenter problem implies a globally optimal solution to k-means which is known to be NP-hard. In lines 237--239 it is mentioned that the memory usage for MAAIPM is similar to that of Sinkhorn type algorithms. The memory usage in Sinkhorn algorithms is dominated by the cost of storing the kernel matrix K = exp(-d(x, y) / epsilon). However, there are work-arounds for this issue that only require applying an operator to a vector, and thus reduce memory requirements to O(n + m). Have the authors compared to these solutions, for example: Solomon et al. "Convolutional Wasserstein Distances: Efficient Optimal Transportation on Geometric Domains." In addition to Gurobi, can the authors compare to CPLEX and Mosek. I have frequently found those to be faster than Gurobi for transport problems. Another point of comparison that is also tailor made for transport problems can be found here: https://github.com/nbonneel/network_simplex (it's true that this is for computing distances, but would be a good point of comparison nonetheless). The authors can also compare with their reference [17] for a regularized transport solution with faster convergence than Sinkhorn, with reference [2] for a empirically faster Sinkhorn iteration scheme with better memory requirements, or with - Altschuler et al., "Massively scalable Sinkhorn distances via the Nyström method." for a very fast and memory efficient approximation of Sinkhorn. I hope the authors provide code later on. In brief: Originality: A novel algorithm for computing unregularized Wasserstein barycenters via a different interior point method to solve the associated linear program. Quality: The paper is technically sound, and the claims are well supported by thorough experiments. The algorithms are described in full and could be easily implemented independently. Clarity: The paper is readable, the algorithms explained well, and the experiments documented clearly. Significance: One shortcoming of this paper is lack of comparison with newer, faster versions of regularized optimal transport (mentioned above). Unregularized optimal transport is of independent interest, but this submission would be strengthened if MAAIPM is faster than e.g. Greenkhorn.