Review for NeurIPS paper: Statistical and Topological Properties of Sliced Probability Divergences

NeurIPS 2020

Statistical and Topological Properties of Sliced Probability Divergences

Review 1

Summary and Contributions: This paper establishes a general theory for sliced probability divergences, which generalizes the popular theory on the Wasserstein distance to a much broader class of divergences including integral probability metrics and Sinkhorn divergences. Topologically, the paper proves that the sliced divergences metrize the weak topology under mild conditions, as well as the topological equivalence between the sliced divergence and the divergence it bases upon when the probability measures have finite supports, Statistically, the paper shows that the convergence rate (aka sample complexity) of the sliced divergences is in the same order as that of the base divergence for one-dimensional measures, thereby avoiding the curse of dimensionality. The appendix provides various extra interesting examples.

Strengths: The paper is well-written and the proofs are clear and well-structured. The unified theory should be of interest in many areas of machine learning and statistics.

Weaknesses: I only have a few minor comments and suggestions. (1) line 168, "We note that the L-Lipschitz assumption is not crucial for this result and can be exchanged with a uniform continuity assumption". Could you elaborate how the Lipschitz assumption can be relaxed? (2) One major advantage of the sliced divergences is the dimension-independent sample complexity. But is it meaningful to compare the sample complexity of a sliced divergence and its based divergence? After all, they are in different scales. (3) line 53, "we prove that the 'sample complexity' of S∆ is proportional to the sample complexity of ∆ and does not depend on the dimension d". This was confusing in my first read. I think it is more precise to say that "the 'sample complexity' of S∆ is proportional to the sample complexity of ∆ for one-dimensional measures...". (4) Proposition 2 can be sharpened using the concentration of measure on the sphere (see e.g. Theorem 1.5 of Vershynin (2011)). In particular, let f_{ij}(\theta) = <x_i - x_j, \theta> / R. Then f_{ij} is 1-Lipschitz with median 0. Since \theta is uniformly distributed on the unite sphere, P(|f_{ij}(\theta)| >= t) <= 2e^{-dt^2 / 2}. Therefore, max_{ij}|f_{ij}(\theta)| <= t^2 with probability 1 - n(n-1)e^{-dt^2 / 2}. This would give a bound O(R^2 / d \log (n / \delta)), instead of the bound O(R^2 / \sqrt{d} \log (n / \delta)). (5) Are there clean examples where S∆ (μ_n ,ν_n ) --> 0, but ∆ (μ_n ,ν_n ) -/-> 0 for TV distance or for other divergences with unbounded domain? Reference: Vershynin, Roman. "Lectures in geometric functional analysis." Unpublished manuscript. Available at http://www-personal. umich. edu/romanv/papers/GFA-book/GFA-book. pdf (2011). ------------------ Post-rebuttal update ------------------- The authors' response clearly addressed all my concerns so I raised my score to 8.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper develop and analysis the statistical and topological property of the sliced probability measure based on integral probability metric. It considers the projection of multi-variate probability measure into a particular direction, named Sliced Probability Divergences (SPDs), which can be used to learn a large class of distributions. Theoretical properties are provided.

Strengths: I think this is an interesting paper, addressing the properties of IPM with 1d projection. the paper progressed clearly and both theoretical and empirical results provided are sound.

Weaknesses: For sampling the random projection, uniform distribution used on choosing directions, as stated in Thm 2. This is analogous to say each direction are equally important. When comparing distributions, projections onto some directions are arguably more "important" in terms of distinguishing distributions. will this improve the results? (say thm 2), as in line 198, the complexity between estimated slicing and the population slicing is stated while not explicitly analyzed.

Correctness: I don't find a major flaw in the theory and the empirical examples are correct.

Clarity: I think the paper is well presented.

Relation to Prior Work: The paper had an adequate survey on previous work. As the paper discussed implicit generative model; maybe https://arxiv.org/pdf/1610.03483.pdf and https://arxiv.org/abs/1611.04488 are useful citations. As the paper discussed MMD Wasserstein and Sinkhorn divergence in applications, it may be nice to draw the connections via: https://arxiv.org/abs/1810.02733

Reproducibility: Yes

Additional Feedback: --- after response --- After author's response, I have gain a better understanding about the current stage of research w.r.t. the choice of projection as well as complexity analysis. Thank you.

Review 3

Summary and Contributions: The paper derives several topological and statistical properties of sliced probability divergences, which have been shown to be useful in practice but not well-studied from a theoretical point of view.

Strengths: The properties seem to be carefully considered and derived, and applicable to any base divergence and its sliced version. The results seem to corroborate and explain observed empirical behavior. Several experiments and examples support the results and a new and effective sliced-Sinkhorn divergence is proposed. I'm convinced by the other reviews and the authors' response that this is a good paper.

Weaknesses: I imagine that the results, while largely theoretical, would be interesting and important for a subcommunity of NeurIPS,

Correctness: This is far outside my area so all I can say is that the paper seems carefully and confidently written, with a thorough understanding of its context.

Clarity: The paper is well-written in the sense of grammar, etc. but I imagine that even for an expert it would be dense and terse. It seems like a huge amount of expertise and deep understanding of prior work would be required to fully understand the derivations.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I apologize that this paper is so far outside my field of expertise that I'm unable to provide a useful review. I hope that other reviewers are more knowledgeable and constructive.

Review 4

Summary and Contributions: The authors investigate the properties of sliced divergences, which have been used in machine learning as a computationally friendly alternative to their unsliced counterparts for high dimensional applications. Technical results are stated including, but not limited to, characterising the slicing operations ability to preserve metric properties and metrize the weak topology, and sample complexity results.

Strengths: The technical results are nontrivial and of significant importance and novel enough on their own. The promising experimental results are a welcome bonus.

Weaknesses: None

Correctness: I am satisfied with the correctness and empirical methodology.

Clarity: The paper is very well written, with clear, concise notation. I think the statement of Theorem 1 would be improved by use of more mathematical notation to break up the text. L#121 Since the push-forward operation is defined right next to the sliced Wasserstein divergence, I think it would make sense to move the definition of the θ* notation here too. It took me a couple minutes to locate the star notation definition in the boilerplate at the beginning of section 2.

Relation to Prior Work: The authors give an extensive and thorough account of previous work and how their results build upon the results in this area.

Reproducibility: Yes

Additional Feedback: