Reviews: A Graph Theoretic Additive Approximation of Optimal Transport

Edit: I thank the authors for answering my questions in their response. I have raised my rating to a full accept. This is a nice paper! ========= =1. I checked all proofs, and they seem correct to me. (A few minor corrections below). However, the one thing I miss is that the invariants (I1) and (I2) in L259-262—although critical to the analysis—are never proved. These are not immediately obvious to me, please explain the proof in the rebuttal and paper. (They are also stated in L259-261, but aren’t proved there either.) =2. Parallelization is not discussed. A key feature of the Sink alg that makes it very scalable and thus led to its widespread use, is its parallelizability (it requires only matrix-vector mult). It seems to me that the proposed alg is intrinsically not amenable to parallelization. Indeed, the alg proceeds in phases, and in each phase it does Djikstra’s alg, and then sequential iterations of DFS. Each of these parts is hard to/cannot be parallelized well. Please comment on this in the rebuttal and paper. =3. I have concerns with the empirical comparisons and the strong claim in the abstract that the “empirical results suggest our algorithm significantly outperforms existing approaches in execution time”. This is for several reasons: =3a. Note the reported runtimes are shockingly large for such small problem sizes. Indeed, runtimes are in the 10s of seconds or few minutes, for only n=784 (tiny for Sink). This is undoubtedly because the precision demanded is *extremely* high: delta/C roughly between 0.2/1000=2E-5 and 0.02/1000=2E-6. (See below for why the ratio delta/C is the relevant quantity.) The runtime should be reported for delta/C in a range starting from a much more moderate precision e.g. 1e-2. I wonder if/suspect Sink is faster in this moderate-error regime? This is important since in practice, many ML applications only compute OT to a moderate precision. (Obviously this last sentence is somewhat generalizing since error requirements vary by application, but it is my understanding this is often true in practice.) Moreover, the implementation you are using appears to be extremely numerically unstable for such high precision. Indeed, it requires flops with #s of size exp(-4*log(n)*C/delta). Even for the lowest precision you use, delta/C=1e-5, this means numbers of size roughly exp(-1,000,000) which is absurd numerically. Such numerical errors likely worsen Sink’s error, and moreover may also make Sink run much slower as matrix scaling becomes a much harder problem when the matrix to scale has more zero entries (note that exp(-1,000,000) will just be 0 on a computer), see e.g. [Linail et al. “A deterministic strongly polynomial algorithm…”]. There are various tricks used pervasively in implementations of Sink to deal with such numerical issues, see e.g. [Peyre-Cuturi’s book Ch 4.4.] =3b. Scalability in the problem size n should also be investigated. Please plot runtimes for varied n, when delta/C is fixed. I suspect the Sink alg will be able to scale to larger problems than the proposed alg, in part since Sink is parallelizable while the proposed alg is not (see 2 above). =3c. It should be noted that Sink iterations can be implemented significantly faster, in o(n^2) time, when the cost matrix has certain structures that often occur in practice. This occurs, for instance, even in the real-world dataset used in your experiments—although this is not implemented in your paper. Indeed, there, datapoints lie on a uniform grid (pixels in an image), so Sink iterations can be computed in \tilde{O}(n) time rather than n^2 using FFT (see e.g. Peyre-Cuturi book, Ch 4.3). Moreover, even if the points weren’t on a grid, since you are using squared Euclidean distance, the Sink iterations can still be computed to provably sufficient accuracy in \tilde{O}(n) time (see “Massively scalable Sinkhorn distances via the Nystrom method”). =I also find several plots misleading: == For additive approximations, it is only the (scale-invariant) ratio delta/C that is meaningful for understanding the error, *not* delta by itself. All plots should be rescaled so that C is fixed at 1, and only delta is changing. ==I find Figure 2c misleading since you are potting actual error in terms of the parameter delta, yet are changing the error favorably for your alg (L330-336). I would remove Fig 2c. (If you still want to incorporate this information, perhaps change the x-axes of Fig 2a,b to be “error” rather than “delta”.) EXPOSITION =The primal and dual LP formulations should be written somewhere. (Especially since the paper uses a slightly unconventional unbalanced version of OT.) This would give simple, natural explanations of condition (C) as complementary slackness (L178), would clarify what is meant by “dual feasibility conditions” (L169), etc. Also I think the proof of Lemma 2.1 can be significantly shortened using duality/complementary slackness of the LP. =Some re-organization would help the reader. Consider making S1.1 (cost scaling) its own section. Also consider dividing what is currently S2.1 (the algorithm) at L259, to separate the alg’s description and its analysis. =For NeurIPS, it is highly encouraged to include pseudocode of the full alg somewhere. Perhaps put in the Supplement if lack of space. =The proofs would benefit from a re-organization pass through, as several times things are proved about your alg in weird orders and it takes the reader some time to figure out what part of the proof you are referencing. E.g., condition (C) is proved in L263-273 in paragraph form (as opposed to in a lemma), then later this is used in App L36 without reference--consider putting this in a lemma for easier referencing =L259-262: In the invariants, integrality should be mentioned (for dual variables y, slacks, etc.)-----since although simple, this is used in the following proofs repeatedly, and was the whole point of the cost-scaling in S1.1 MINOR (FIXABLE) ERRATA IN PROOFS =L193: typo: eq (7) missing delta’/2 factor in second summand =L269-270: “Since the claim is true initially, it is sufficient to argue that…” It must also be argued that no non-free demand vertex can later become free. This is not hard, but should be mentioned. =App L9: “exactly one” is not true. It can be >1, but always come in telescoping pairs, so OK. =App L44: sum_P Phi(P) is missing r_P factors =App L46: “there are at most \mathcal{U} augmenting paths” is not the right justification. It is because \sum_{P in \mathcal{P}} r_P — i.e. the total pushed flow, summed over all augmenting paths — is at most \mathcal{U}. Indeed this implies \sum_{P \in mathcal{P}} r_P \Phi(P) <= \mathcal{U} * (2) as desired =App L47: first equality should be inequality, see (1) in main text MINOR COMMENTS =L71-72: are there runtimes independent of the dimension of the points?  =L129 typo: with for =L138 typo: comma should be period =L166 typo: any -> a =L178 typo: y(v) -> y(a) =L181: replace “such that …. is 0” just with “satisfying (C)”. =L190: typo:” missing a =L233: mention briefly that slacks all non-negative (by 1-feasibility eqs (3),(4)), which is Djikstra’s alg can be used =L224:”several” -> at least one =L242-248: mention briefly that this must be implemented with DFS not BFS =Inconsistency in the notation G_sigma. Sometimes with vector over it. =References missing capitalization, e.g. GAN, Wasserstein, Hungarian, Gaussian, Sinkhorn, etc =App L14: typo: comma after because =App L26: consider changing notation “wr” to single letter =App eq (3): align parentheses size

Paper ID:	7732
Title:	A Graph Theoretic Additive Approximation of Optimal Transport

Reviewer 1

Reviewer 2

Reviewer 3