Reviews: Massively scalable Sinkhorn distances via the Nyström method

The problem of estimating Sinkhorn distance has been well studied. Given two probability distributions p and q, both on the same metric space, the Sinkhorn distance is a regularized notion of how measuring how different the two probability distributions are. In general, calculating the Sinkhorn distance needs O(n^2) time, where n = support of p or q. As the authors discuss, there are efficient algorithms that utilize the structure of the metric space (e.g. it being low-dimensional) and try to make this more efficient. In this paper, the novel contribution is the following: the authors utilize the connection between Sinkhorn distance and Sinkhorn scaling to give an efficient algorithm. They then use Nystrom method to do Sinkhorn scaling. Again, while Nystrom methods have been used in this setting before, this presents a neat bound in terms of what they call "effective dimension". They are able to prove that their algorithm leads to a guaranteed bound on the Sinkhorn distance. They also perform experiments to demonstrate the efficiency. The experiments seems quite exhaustive in terms of the baseline algorithms compared against. My overall impression is the following: while a number of blocks that are used in the algorithm are existing in literature, putting them together and proving the theoretical bounds requires substantial insight. The authors also extend their bounds to special cases when the data lie in low-dimensional manifolds. The paper seems well written too. I was not entirely sure why we would expect R to appear in the run-time? Any intuition , e.g. whether it is inevitable or is a effect of this analysis, would be useful.

Reviewer 2

Comments: - interesting paper - please check the paper w.r.t. the usage of similarities, dissimilarities, distances and make it consistent - some assumptions like psd should be clearly mentioned - 'method constructs a low-rank approximation to a Gaussian kernel' - well technically it can be used for any psd kernel - and even more generic (non-psd) as shown in A. Gisbrecht et al. Metric and non-metric proximity transformations at linear costs. Neurocomputing 167: 643-657 (2015) - Adaptive Nystroem is fine but can lead to suboptimal results for very small set of landmarks hence the number of landmarks should be reasonable large -> which at the end has some negative impact on the memory / runtime consumption - Could you please also comment on the out of sample extension - e.g. if I use the calculated/approximated proximities and generate a model how should I effectively and valid apply the model on new data - would it be 'safe' to calculate the proximities exact? - it would be nice to have a larger evaluation on the practical impact (not only runtime) of the approximation on its usage in some algorithm (so far only Fig 2) --> this is somewhat covered in the supplement but its a bit sad not to have a clear summary in the main paper

Reviewer 3

This paper proposes simple method to approximate Sinkhorn distance quickly by combining two previous techniques and presents a new analysis. The experimental results also show that the algorithm can obtain a low excess error efficiently. The results seem convincing. The presentation could be improved. The background is not enough and some notations are used without definition.

Paper ID:	2482
Title:	Massively scalable Sinkhorn distances via the Nyström method

Reviewer 1

Reviewer 2

Reviewer 3