NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3487
Title:Maximum Mean Discrepancy Gradient Flow

Reviewer 1


		
This paper seems to accomplish two feats at once: it provides a rather deep dive into the specific topic of gradient flows w.r.t. MMD, while it also lays out some original propositions and theorems that establish the paper's main contributions. The first two sections of the paper are excellent and provide a solid introduction to the material the subsequent sections. Per C1, it appears this is fully realized in Proposition 7 in Section 3.2. As an outsider to this level of detail in the field, it is unclear how strigent this assumption is to provide convergence to a global optimum. The proofs seem correct and the other assumptions (e.g. A) are quite mild, but its unclear if this condition can be even checked in a simple case. The authors mention this can be checked in the case of a (nonweighted) negative Sobolev distance, but it would be nice to clarify if such was known for their weighted distance. Per C2, the authors propose a method to avoid local minima both theoretically and empirically. The idea of mollifying the particles before pushing them through grad f is their way of making sure the support of the particles doesn't collapse over time. This is demonstrated empirically in Section 4.2. There are a few questions I have for this experiment: * How sensitive are the choices of beta_n? It appears from a support collapse perspective, you want them large, but if they are too large then they bias the diffusion. * Is it true the validation error eventually plateaus? Is this b/c the beta_n > 0? * Are there any other methods that the noisy gradient method can be compared to, e.g., SGVD? Originality: The paper appears to provide to novel results, including many supporting propositions along the way. It is unclear how readily these results extend to GANs or the training of other NNs. The authors provide a theoretical result in the appendix (Theorem 10) but its empirically not clear how useful these updates are in comparison to standard gradient descent. Quality: The paper appears correct, although I have not read through all the proofs in the appendix (its a 33 page paper!). Clarity: The paper is wonderfully written. Despite the technical nature of the content, it is very clear and approachable for those not completely familiar with the subject matter. Significance: Not being an expert in this field, I have uncertainty in my assessment on the overall impact of this paper. However, the paper is well written and provides two novel ideas relating to gradient flows for MMD, which seem like a worthy contribution to the field. APPENDUM: Upon reading the other reviewers critiques and the author feedback, my score here stands. It isn't exactly clear how much weaker the assumptions used in Proposition 7 are than the state-of-the-art in the literature, so I'm unable to make any meaningful changes.

Reviewer 2


		
The problem studied by the authors of defining (and proving convergence) of a gradient flow of the MMD with respect to the Wasserstein distance is an important one. As noted in the paper, understanding the behavior of this gradient flow could lead a better understanding of the training dynamics of GANs. Eventually this could lead to better, theoretically sound procedures to train GANS. The convergence analysis employed at a high level follows the steps in Ambrosio, Gigli and Savaré 2008 who studied the gradient flow of KL divergence with respect to Wasserstein distance (studied by JKO initially). While the main ingredients of their analysis such as Proposition 4 and Proposition 5 (via displacement convexity) have direct analogs in the proofs of Ambrosio et al. the presentation here is done well and seems to be technically correct. The authors in Section 4 also present simple extensions to their algorithm by adding noise to regularize and a sample based approximation to the gradients to make their algorithm practically viable. They also provide simple extensions to their theory to motivate these extensions. Overall the paper is well written, and the problem they solve is interesting and important. Some terms however are used before they are defined. For example in Ln. 31 witness functions are undefined. The assumptions used in paper are deep in the appendix in Appendix C without any mention of this in the paper. It would be useful to state these or link to these in the main paper. Further the authors could perhaps some intuition behind the definition and usefulness displacement convexity (also known as geodesic convexity) and how that plays into their proof. ===== Post Author Feedback ========= Thank you for you polite rebuttal. In the discussion, an important point was raised about the locality condition made in in the proof of Proposition 7 and 8. I would urge the authors to motivate and provide an interpretation for this condition (which seems to be hard to check and crucial) in subsequent drafts.

Reviewer 3


		
For any objective functional of a probability measure, one can write down its Wasserstein gradient flow. When the objective functional is quadratic, the gradient flow is of particular interest and has been studied in the literature: 1. In the interacting particle system in physics, [Carrillo, McCann, Villani, 2006]. 2. In the case of neural networks [Mei, Montanari, and Nguyen, 2018], [Rotskoff and Vandan-Eijnden, 2018]. 3. In the case of KSD flow, [Mroueh, Sercu, and Raj, 2019]. One difficulty of Wasserstein flow of quadratic functional is to establish global convergence result under reasonable assumptions. The new results (which is the main contribution as the authors wrote in the introduction) in this paper are global convergence results in Proposition 7 and Theorem 6. However, these results are quite weak in the sense that it requires unverifiable assumptions on the trajectory. This makes the theory to be like, the gradient flow will converge to global optimal if you start near the global optimal, which is essentially local convergence. Therefore, this “global convergence theory” doesn’t sound satisfactory. In the conclusion section, the authors claimed novelty of introducing the MMD flow. The authors recognized that MMD functional is a quadratic functional of the probability measure of interest, and recognized that existing results (the descending properties, entropy regularization, propagation of Chaos bound of discretizations) on Wasserstein flow of quadratic functional can be applied to MMD functional. Since there are works introducing KSD flow and SVGD relating generative modeling and Wasserstein gradient flow, whether recognizing this relationship is of enough contribution is questionable in reviewer’s point of view. The good side of this paper is, it is well written and properly cited. ---- After reading the response: Thank the authors clarifying the connections and differences of the MMD flow and the results in the literatures. In terms of writings and reference, I think this paper did pretty well. After discussing with the other reviewer, I also believe this paper did a good job on adapting the proof of Wasserstein gradient flow to their settings. This is the reason that I decide to increase my score to 5. My concern is still the trajectory dependent convergence results (Proposition 7, 8 and Theorem 6), which is the new ingredients of this paper. I agree with the author that it is unclear a priori what locality condition is best suited to characterize convergence of the MMD flow. Though, I am not very satisfied with the assumptions in line 218, 219, 268. I am not convinced that, these assumptions and results provide more insights than saying gradient descent will converge to global minimum if it doesn't stuck at bad local min. (Proposition 8 seems providing exponential convergence, but actually the condition \sum \beta_n^2 = \infty with beta_n satisfying Eq. (18) is trajectory dependent and cannot be easily checked. )