{"title": "Maximum Mean Discrepancy Gradient Flow", "book": "Advances in Neural Information Processing Systems", "page_first": 6484, "page_last": 6494, "abstract": "We construct a Wasserstein gradient flow of the maximum mean discrepancy (MMD) and study its convergence properties.\n The MMD is an integral probability metric defined for a reproducing kernel Hilbert space (RKHS), and serves as a metric on probability measures for a sufficiently rich RKHS. We obtain conditions for convergence of the gradient flow towards a global optimum, that can be related to particle transport when optimizing neural networks.\n We also propose a way to regularize this MMD flow, based on an injection of noise in the gradient. This algorithmic fix comes with theoretical and empirical evidence.\nThe practical implementation of the flow is straightforward, since both the MMD and its gradient have simple closed-form expressions, which can be easily estimated with samples.", "full_text": "Maximum Mean Discrepancy Gradient Flow\n\nMichael Arbel\n\nAnna Korba\n\nGatsby Computational Neuroscience Unit\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\n\nmichael.n.arbel@gmail.com\n\nUniversity College London\n\na.korba@ucl.ac.uk\n\nAdil Salim\n\nKAUST\n\nadil.salim@kaust.edu.sa\n\nVisual Computing Center\n\nGatsby Computational Neuroscience Unit\n\nArthur Gretton\n\nUniversity College London\n\narthur.gretton@gmail.com\n\nAbstract\n\nWe construct a Wasserstein gradient \ufb02ow of the maximum mean discrepancy\n(MMD) and study its convergence properties. The MMD is an integral probability\nmetric de\ufb01ned for a reproducing kernel Hilbert space (RKHS), and serves as a\nmetric on probability measures for a suf\ufb01ciently rich RKHS. We obtain conditions\nfor convergence of the gradient \ufb02ow towards a global optimum, that can be related\nto particle transport when optimizing neural networks. We also propose a way to\nregularize this MMD \ufb02ow, based on an injection of noise in the gradient. This\nalgorithmic \ufb01x comes with theoretical and empirical evidence. The practical\nimplementation of the \ufb02ow is straightforward, since both the MMD and its gradient\nhave simple closed-form expressions, which can be easily estimated with samples.\n\n1\n\nIntroduction\n\nWe address the problem of de\ufb01ning a gradient \ufb02ow on the space of probability distributions endowed\nwith the Wasserstein metric, which transports probability mass from a starting distribtion \u03bd to a target\ndistribution \u00b5. Our \ufb02ow is de\ufb01ned on the maximum mean discrepancy (MMD) [19], an integral\nprobability metric [33] which uses the unit ball in a characteristic RKHS [43] as its witness function\nclass. Speci\ufb01cally, we choose the function in the witness class that has the largest difference in\nexpectation under \u03bd and \u00b5: this difference constitutes the MMD. The idea of descending a gradient\n\ufb02ow over the space of distributions can be traced back to the seminal work of [24], who revealed\nthat the Fokker-Planck equation is a gradient \ufb02ow of the Kullback-Leibler divergence. Its time-\ndiscretization leads to the celebrated Langevin Monte Carlo algorithm, which comes with strong\nconvergence guarantees (see [14, 15]), but requires the knowledge of an analytical form of the target\n\u00b5. A more recent gradient \ufb02ow approach, Stein Variational Gradient Descent (SVGD) [29], also\nleverages this analytical \u00b5.\nThe study of particle \ufb02ows de\ufb01ned on the MMD relates to two important topics in modern machine\nlearning. The \ufb01rst is in training Implicit Generative Models, notably generative adversarial networks\n[18]. Integral probability metrics have been used extensively as critic functions in this setting: these\ninclude the Wasserstein distance [3, 17, 22] and maximum mean discrepancy [2, 4, 5, 16, 27, 28]. In\n[32, Section 3.3], a connection between IGMs and particle transport is proposed, where it is shown\nthat gradient \ufb02ow on the witness function of an integral probability metric takes a similar form to the\ngenerator update in a GAN. The critic IPM in this case is the Kernel Sobolev Discrepancy (KSD),\nwhich has an additional gradient norm constraint on the witness function compared with the MMD. It\nis intended as an approximation to the negative Sobolev distance from the optimal transport literature\n[35, 36, 45]. There remain certain differences between gradient \ufb02ow and GAN training, however.\nFirst, and most obviously, gradient \ufb02ow can be approximated by representing \u03bd as a set of particles,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhereas in a GAN \u03bd is the output of a generator network. The requirement that this generator network\nbe a smooth function of its parameters causes a departure from pure particle \ufb02ow. Second, in modern\nimplementations [2, 5, 27], the kernel used in computing the critic witness function for an MMD\nGAN critic is parametrized by a deep network, and an alternating optimization between the critic\nparameters and the generator parameters is performed. Despite these differences, we anticipate that\nthe theoretical study of MMD \ufb02ow convergence will provide helpful insights into conditions for GAN\nconvergence, and ultimately, improvements to GAN training algorithms.\nRegarding the second topic, we note that the properties of gradient descent for large neural networks\nhave been modeled using the convergence towards a global optimum of particle transport in the\npopulation limit, when the number of particles goes to in\ufb01nity [12, 31, 38, 41]. In particular, [37]\nshow that gradient descent on the parameters of a neural network can also be seen as a particle\ntransport problem, which has as its population limit a gradient \ufb02ow of a functional de\ufb01ned for\nprobability distributions over the parameters of the network. This functional is in general non-convex,\nwhich makes the convergence analysis challenging. The particular structure of the MMD allows us to\nrelate its gradient \ufb02ow to neural network optimization in a well-speci\ufb01ed regression setting similar to\n[12, 37] (we make this connection explicit in Appendix F).\nOur main contribution in this work is to establish conditions for convergence of MMD gradient \ufb02ow\nto its global optimum. We give detailed descriptions of MMD \ufb02ow for both its continuous-time and\ndiscrete instantiations in Section 2. In particular, the MMD \ufb02ow may employ a sample approximation\nfor the target \u00b5: unlike e.g. Langevin Monte Carlo or SVGD, it does not require \u00b5 in analytical form.\nGlobal convergence is especially challenging to prove: while for functionals that are displacement\nconvex, the gradient \ufb02ow can be shown to converge towards a global optimum [1], the case of\nnon-convex functionals, like the MMD, requires different tools. A modi\ufb01ed gradient \ufb02ow is proposed\nin [37] that uses particle birth and death to reach global optimality. Global optimality may also be\nachieved simply by teleporting particles from \u03bd to \u00b5, as occurs for the Sobolev Discrepancy \ufb02ow\nabsent a kernel regulariser [32, Theorem 4, Appendix D]. Note, however, that the regularised Kernel\nSobolev Discrepancy \ufb02ow does not rely on teleportation.\nOur approach takes inspiration in particular from [7], where it is shown that although the 1-Wasserstein\ndistance is non-convex, it can be optimized up to some barrier that depends on the diameter of the\ndomain of the target distribution. Similarly to [7], we provide in Section 3 a barrier on the gradient\n\ufb02ow of the MMD, although the tightness of this barrier in terms of the target diameter remains to be\nestablished. We obtain a further condition on the evolution of the \ufb02ow to ensure global optimality,\nand give rates of convergence in that case, however the condition is a strong one: it implies that the\nnegative Sobolev distance between the target and the current particles remains bounded at all times.\nWe thus propose a way to regularize the MMD \ufb02ow, based on a noise injection (Section 4) in\nthe gradient, with more tractable theoretical conditions for convergence. Encouragingly, the noise\ninjection is shown in practice to ensure convergence in a simple illustrative case where the original\nMMD \ufb02ow fails. Finally, while our emphasis has been on establishing conditions for convergence,\nwe note that MMD gradient \ufb02ow has a simple O(M N + N 2) implementation for N \u03bd-samples and\nM \u00b5-samples, and requires only evaluating the gradient of the kernel k on the given samples.\n\n2 Gradient \ufb02ow of the MMD in W2\n2.1 Construction of the gradient \ufb02ow\n\nIn this section we introduce the gradient \ufb02ow of the Maximum Mean Discrepancy (MMD) and\nhighlight some of its properties. We start by brie\ufb02y reviewing the MMD introduced in [19]. We de\ufb01ne\nX \u2282 Rd as the closure of a convex open set, and P2(X ) as the set of probability distributions on X\nwith \ufb01nite second moment, equipped with the 2-Wassertein metric denoted W2. For any \u03bd \u2208 P2(X ),\nL2(\u03bd) is the set of square integrable functions w.r.t. \u03bd. The reader may \ufb01nd a relevant mathematical\nbackground in Appendix A.\nMaximum Mean Discrepancy. Given a characteristic kernel k : X \u00d7 X \u2192 R, we denote by H\nits corresponding RKHS (see [42]). The space H is a Hilbert space with inner product (cid:104)., .(cid:105)H and\nnorm (cid:107).(cid:107)H. We will rely on speci\ufb01c assumptions on the kernel which are given in Appendix B. In\nparticular, Assumption (A) states that the gradient of the kernel, \u2207k, is Lipschitz with constant L.\nFor such kernels, it is possible to de\ufb01ne the Maximum Mean Discrepancy as a distance on P2(X ).\n\n2\n\n\fThe MMD can be written as the RKHS norm of the unnormalised witness function f\u00b5,\u03bd between \u00b5\nand \u03bd, which is the difference between the mean embeddings of \u03bd and \u00b5,\n\nM M D(\u00b5, \u03bd) = (cid:107)f\u00b5,\u03bd(cid:107)H,\n\nf\u03bd,\u00b5(z) =\n\nk(x, z) d\u03bd(x) \u2212\n\nk(x, z) d\u00b5(x) \u2200z \u2208 X\n\n(1)\n\nThroughout the paper, \u00b5 will be \ufb01xed and \u03bd can vary, hence we will only consider the dependence in\n\u03bd and denote by F(\u03bd) = 1\n2 M M D2(\u00b5, \u03bd). A direct computation [32, Appendix B] shows that for\nany \ufb01nite measure \u03c7 such that \u03bd + \u0001\u03c7 \u2208 P2(X ), we have\n\u0001\u22121(F(\u03bd + \u0001\u03c7) \u2212 F(\u03bd)) =\n\n(2)\nThis means that f\u00b5,\u03bd is the differential of F(\u03bd) . Interestingly, F(\u03bd) admits a free-energy expression:\n\nf\u00b5,\u03bd(x)d\u03c7(x).\n\nlim\n\u0001\u21920\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n1\n2\n\nF(\u03bd) =\n\nV (x) d\u03bd(x) +\n\n1\n2\n\nW (x, y) d\u03bd(x) d\u03bd(y) + C.\n\n(cid:90)\n\nwhere V is a con\ufb01nement potential, W an interaction potential and C a constant de\ufb01ned by:\nk(x, x(cid:48)) d\u00b5(x) d\u00b5(x(cid:48))\n\nk(x, x(cid:48)) d\u00b5(x(cid:48)), W (x, x(cid:48)) = k(x, x(cid:48)), C =\n\nV (x) = \u2212\n\n(3)\n\n(4)\n\nFormulation (3) and the simple expression of the differential in (2) will be key to construct a gradient\n\ufb02ow of F(\u03bd), to transport particles. In (4), V re\ufb02ects the potential generated by \u00b5 and acting on each\nparticle, while W re\ufb02ects the potential arising from the interactions between those particles.\n\nGradient \ufb02ow of the MMD. We consider now the problem of transporting mass from an initial\ndistribution \u03bd0 to a target distribution \u00b5, by \ufb01nding a continuous path \u03bdt starting from \u03bd0 that converges\nto \u00b5 while decreasing F(\u03bdt). Such a path should be physically plausible, in that teleportation\nphenomena are not allowed. For instance, the path \u03bdt = (1 \u2212 e\u2212t)\u00b5 + e\u2212t\u03bd0 would constantly\nteleport mass between \u00b5 and \u03bd0 although it decreases F since F(\u03bdt) = e\u22122tF(\u03bd0) [32, Section 3.1,\nCase 1]. The physicality of the path is understood in terms of classical statistical physics: given\nan initial con\ufb01guration \u03bd0 of N particles, these can move towards a new con\ufb01guration \u00b5 through\nsuccessive small transformations, without jumping from one location to another.\nOptimal transport theory provides a way to construct such a continuous path by means of the\ncontinuity equation. Given a vector \ufb01eld Vt on X and an initial condition \u03bd0, the continuity equation\nis a partial differential equation which de\ufb01nes a path \u03bdt evolving under the action of the vector \ufb01eld\nVt, and reads \u2202t\u03bdt = \u2212div(\u03bdtVt) for all t \u2265 0. The reader can \ufb01nd more detailed discussions in\nAppendix A.2 or [39]. Following [1], a natural choice is to choose Vt as the negative gradient of the\ndifferential of F(\u03bdt) at \u03bdt, since it corresponds to a gradient \ufb02ow of F associated with the W2 metric\n(see Appendix A.3). By (2), we know that the differential of F(\u03bdt) at \u03bdt is given by f\u00b5,\u03bdt, hence\nVt(x) = \u2212\u2207f\u00b5,\u03bdt(x).1 The gradient \ufb02ow of F is then de\ufb01ned by the solution (\u03bdt)t\u22650 of\n\n\u2202t\u03bdt = div(\u03bdt\u2207f\u00b5,\u03bdt).\n\n(5)\nEquation (5) is non-linear in that the vector \ufb01eld depends itself on \u03bdt. This type of equation is\nassociated in the probability theory literature to the so-called McKean-Vlasov process [26, 30],\n\ndXt = \u2212\u2207f\u00b5,\u03bdt(Xt)dt\n\nX0 \u223c \u03bd0.\n\n(6)\n\nIn fact, (6) de\ufb01nes a process (Xt)t\u22650 whose distribution (\u03bdt)t\u22650 satis\ufb01es (5), as shown in Proposi-\ntion 1. (Xt)t\u22650 can be interpreted as the trajectory of a single particle, starting from an initial random\nposition X0 drawn from \u03bd0. The trajectory is driven by the velocity \ufb01eld \u2212\u2207f\u00b5,\u03bdt, and is affected\nby other particles. These interactions are captured by the velocity \ufb01eld through the dependence on\nthe current distribution \u03bdt of all particles. Existence and uniqueness of a solution to (5) and (6) are\nguaranteed in the next proposition, whose proof is given Appendix C.1.\nProposition 1. Let \u03bd0 \u2208 P2(X ). Then, under Assumption (A), there exists a unique process (Xt)t\u22650\nsatisfying the McKean-Vlasov equation in (6) such that X0 \u223c \u03bd0. Moreover, the distribution \u03bdt of Xt\nis the unique solution of (5) starting from \u03bd0, and de\ufb01nes a gradient \ufb02ow of F.\n\n1Also, Vt = \u2207V + \u2207W (cid:63) \u03bdt (see Appendix A.3) where (cid:63) denotes the classical convolution.\n\n3\n\n\fBesides existence and uniqueness of the gradient \ufb02ow of F, one expects F to decrease along the path\n\u03bdt and ideally to converge towards 0. The \ufb01rst property, stated in the next proposition, is rather easy\nto get and is the object of Proposition 2, similar to the result for KSD \ufb02ow in [32, Section 3.1].\nProposition 2. Under Assumption (A), F(\u03bdt) is decreasing in time and satis\ufb01es:\n\n= \u2212\n\n(cid:107)\u2207f\u00b5,\u03bdt(x)(cid:107)2 d\u03bdt(x).\n\n(7)\n\n(cid:90)\n\ndF(\u03bdt)\n\ndt\n\nThis property results from (5) and the energy identity in [1, Theorem 11.3.2] and is proved in\nAppendix C.1. From (7), F can be seen as a Lyapunov functional for the dynamics de\ufb01ned by\n(5), since it is decreasing in time. Hence, the continuous-time gradient \ufb02ow introduced in (5)\nallows to formally consider the notion of gradient descent on P2(X ) with F as a cost function. A\ntime-discretized version of the \ufb02ow naturally follows, and is provided in the next section.\n\n2.2 Euler scheme\nWe consider here a forward-Euler scheme of (5). For any T : X \u2192 X a measurable map, and\n\u03bd \u2208 P2(X ), we denote the pushforward measure by T#\u03bd (see Appendix A.2). Starting from\n\u03bd0 \u2208 P2(X ) and using a step-size \u03b3 > 0, a sequence \u03bdn \u2208 P2(X ) is given by iteratively applying\n(8)\n\n\u03bdn+1 = (I \u2212 \u03b3\u2207f\u00b5,\u03bdn )#\u03bdn.\n\nFor all n \u2265 0, equation (8) is the distribution of the process de\ufb01ned by\nX0 \u223c \u03bd0.\n\nXn+1 = Xn \u2212 \u03b3\u2207f\u00b5,\u03bdn(Xn)\n\n(9)\nThe asymptotic behavior of (8) as n \u2192 \u221e will be the object of Section 3. For now, we provide a\nguarantee that the sequence (\u03bdn)n\u2208N approaches (\u03bdt)t\u22650 as the step-size \u03b3 \u2192 0.\nProposition 3. Let n \u2265 0. Consider \u03bdn de\ufb01ned in (8), and the interpolation path \u03c1\u03b3\nt = (I \u2212 (t \u2212 n\u03b3)\u2207f\u00b5,\u03bdn )#\u03bdn, \u2200t \u2208 [n\u03b3, (n + 1)\u03b3). Then, under Assumption (A), \u2200 T > 0,\n\u03c1\u03b3\n\nt de\ufb01ned as:\n\nW2(\u03c1\u03b3\n\nt , \u03bdt) \u2264 \u03b3C(T ) \u2200t \u2208 [0, T ]\n\n(10)\n\nwhere C(T ) is a constant that depends only on T .\n\nA proof of Proposition 3 is provided in Appendix C.2 and relies on standard techniques to control\nthe discretization error of a forward-Euler scheme. Proposition 3 means that \u03bdn can be linearly\ninterpolated giving rise to a path \u03c1\u03b3\nt which gets arbitrarily close to \u03bdt on bounded intervals. Note that\nas T \u2192 \u221e the bound C(T ) it is expected to blow up. However, this result is enough to show that (8)\nis indeed a discrete-time \ufb02ow of F. In fact, provided that \u03b3 is small enough, F(\u03bdn) is a decreasing\nsequence, as shown in Proposition 4.\nProposition 4. Under Assumption (A), and for \u03b3 \u2264 2/3L, the sequence F(\u03bdn) is decreasing, and\n\nF(\u03bdn+1) \u2212 F(\u03bdn) \u2264 \u2212\u03b3(1 \u2212 3\u03b3\n2\n\nL)\n\n(cid:107)\u2207f\u00b5,\u03bdn (x)(cid:107)2 d\u03bdn(x),\n\n\u2200n \u2265 0.\n\n(cid:90)\n\nProposition 4, whose proof is given in Appendix C.2, is a discrete analog of Proposition 2. In fact,\n(8) is intractable in general as it requires the knowledge of \u2207f\u00b5,\u03bdn (and thus of \u03bdn) exactly at each\niteration n. Nevertheless, we present in Section 4.2 a practical algorithm using a \ufb01nite number of\nsamples which is provably convergent towards (8) as the sample-size increases. We thus begin by\nstudying the convergence properties of the time discretized MMD \ufb02ow (8) in the next section.\n\n3 Convergence properties of the MMD \ufb02ow\nWe are interested in analyzing the asymptotic properties of the gradient \ufb02ow of F. Although we know\nfrom Propositions 2 and 4 that F decreases in time, it can very well converge to local minima. One\nway to see this is by looking at the equilibrium condition for (7). As a non-negative and decreasing\nfunction, t (cid:55)\u2192 F(\u03bdt) is guaranteed to converge towards a \ufb01nite limit l \u2265 0, which implies in turn that\nthe r.h.s. of (7) converges to 0. If \u03bdt happens to converge towards some distribution \u03bd\u2217, it is possible\nto show that the equilibrium condition (11) must hold [31, Prop. 2] ,\n\n(cid:90)\n\n(cid:107)\u2207f\u00b5,\u03bd\u2217 (x)(cid:107)2 d\u03bd\u2217(x) = 0.\n\n(11)\n\n4\n\n\fCondition (11) does not necessarily imply that \u03bd\u2217 is a global optimum unless when the loss function\nhas a particular structure [11]. For instance, this would hold if the kernel is linear in at least one of its\ndimensions. However, when a characteristic kernel is required (to ensure the MMD is a distance),\nsuch a structure can\u2019t be exploited. Similarly, the claim that KSD \ufb02ow converges globally, [32, Prop.\n3, Appendix B.1], requires an assumption [32, Assump. A] that excludes local minima which are\nnot global (see Appendix D.1; recall KSD is related to MMD). Global convergence of the \ufb02ow is\nharder to obtain, and will be the topic of this section. The main challenge is the lack of convexity of\nF w.r.t. the Wassertein metric. We show that F is merely \u039b-convex, and that standard optimization\ntechniques only provide a loose bound on its asymptotic value. We next exploit a Lojasiewicz type\ninequality to prove convergence to the global optimum provided that a particular quantity remains\nbounded at all times.\n\n(cid:90) 1\n\n\u221a\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:90)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:90)\n\n3.1 Optimization in a (W2) non-convex setting\nThe displacement convexity of a functional F is an important criterion in characterizing the con-\nvergence of its Wasserstein gradient \ufb02ow. Displacement convexity states that t (cid:55)\u2192 F(\u03c1t) is a\nconvex function whenever (\u03c1t)t\u2208[0,1] is a path of minimal length between two distributions \u00b5 and \u03bd\n(see De\ufb01nition 2). Displacement convexity should not be confused with mixture convexity, which\ncorresponds to the usual notion of convexity. As a matter of fact, F is mixture convex in that it\nsatis\ufb01es: F(t\u03bd + (1 \u2212 t)\u03bd(cid:48)) \u2264 tF(\u03bd) + (1 \u2212 t)F(\u03bd(cid:48)) for all t \u2208 [0, 1] and \u03bd, \u03bd(cid:48) \u2208 P2(X ) (see\nLemma 25). Unfortunately, F is not displacement convex. Instead, F only satis\ufb01es a weaker notion\nof displacement convexity called \u039b-displacement convexity, given in De\ufb01nition 4 (Appendix A.4).\nProposition 5. Under Assumptions (A) to (C), F is \u039b-displacement convex, and satis\ufb01es\n\nF(\u03c1t) \u2264 (1 \u2212 t)F(\u03bd) + tF(\u03bd(cid:48)) \u2212\n\n(12)\nfor all \u03bd, \u03bd(cid:48) \u2208 P2(X ) and any displacement geodesic (\u03c1t)t\u2208[0,1] from \u03bd to \u03bd(cid:48) with velocity vectors\n(vt)t\u2208[0,1]. The functional \u039b is de\ufb01ned for any pair (\u03c1, v) with \u03c1 \u2208 P2(X ) and (cid:107)v(cid:107) \u2208 L2(\u03c1),\n\n\u039b(\u03c1s, vs)G(s, t) ds\n\n0\n\n\u2212\n\n2\u03bbdF(\u03c1)\n\n1\n2\n\n(cid:107)v(x)(cid:107)2 d\u03c1(x),\n\nv(x).\u2207xk(x, .) d\u03c1(x)\n\nH\n\n\u039b(\u03c1, v) =\n\nterms: thus(cid:82) 1\n\n(13)\nwhere (s, t) (cid:55)\u2192 G(s, t) = s(1 \u2212 t)1{s \u2264 t} + t(1 \u2212 s)1{s \u2265 t} and \u03bb is de\ufb01ned in Assumption (C).\nProposition 5 can be obtained by computing the second time derivative of F(\u03c1t), which is then lower-\nbounded by \u039b(\u03c1t, vt) (see Appendix D.2). In (13), the map \u039b is a difference of two non-negative\n0 \u039b(\u03c1s, vs)G(s, t) ds can become negative, and displacement convexity does not hold in\ngeneral. [8, Theorem 6.1] provides a convergence when only \u039b-displacement convexity holds as long\nas either the potential or the interaction term is convex enough. In fact, as mentioned in [8, Remark\n6.4], the convexity of either term could compensate for a lack of convexity of the other. Unfortunately,\nthis cannot be applied for MMD since both terms involve the same kernel but with opposite signs.\nHence, even under convexity of the kernel, a concave term appears and cancels the effect of the\nconvex term. Moreover, the requirement that the kernel be positive semi-de\ufb01nite makes it hard to\nconstruct interesting convex kernels. However, it is still possible to provide an upper bound on the\nasymptotic value of F(\u03bdn) when (\u03bdn)n\u2208N are obtained using (8). This bound is given in Theorem 6,\ns )s\u2208[0,1] is a constant speed\ndisplacement geodesic from \u03bdn to the optimal value \u00b5, with velocity vectors (vn\ns )s\u2208[0,1] of constant\nnorm.\nTheorem 6. Let \u00afK be the average of (K(\u03c1j))0\u2264j\u2264n. Under Assumptions (A) to (C) and if \u03b3 \u2264 1/3L,\n\nand depends on a scalar K(\u03c1n) :=(cid:82) 1\n\ns )(1 \u2212 s) ds, where (\u03c1n\n\n0 \u039b(\u03c1n\n\ns , vn\n\nF(\u03bdn) \u2264 W 2\n\n2 (\u03bd0, \u00b5)\n2\u03b3n\n\n\u2212 \u00afK.\n\n(14)\n\nTheorem 6 is obtained using techniques from optimal transport and optimization.\nIt relies on\nProposition 5 and Proposition 4 to prove an extended variational inequality (see Proposition 16), and\nconcludes using a suitable Lyapunov function. A full proof is given in Appendix D.3. When \u00afK is\nnon-negative, one recovers the usual convergence rate as O( 1\nn ) for the gradient descent algorithm.\nHowever, \u00afK can be negative in general, and would therefore act as a barrier on the optimal value\n\n5\n\n\fthat F(\u03bdn) can achieve when n \u2192 \u221e. In that sense, the above result is similar to [7, Theorem 6.9].\nTheorem 6 only provides a loose bound, however. In Section 3.2 we show global convergence, under\nthe boundedness at all times t of a speci\ufb01c distance between \u03bdt and \u00b5.\n\nby the absolute value of its time derivative(cid:82) (cid:107)\u2207f\u00b5,\u03bdt(x)(cid:107)2 d\u03bdt(x). The latter is the squared weighted\n\n3.2 A condition for global convergence\nThe lack of convexity of F, as shown in Section 3.1, suggests that a \ufb01ner analysis of the convergence\nshould be performed. One strategy is to provide estimates for the dynamics in Proposition 2 using\ndifferential inequalities which can be solved using the Gronwall\u2019s lemma (see [34]). Such inequalities\nare known in the optimization literature as Lojasiewicz inequalities (see [6]), and upper-bound F(\u03bdt)\nSobolev semi-norm of f\u00b5,\u03bdt (see Appendix D.4), also written (cid:107)f\u00b5,\u03bdt(cid:107) \u02d9H(\u03bdt). Thus one needs to \ufb01nd\na relationship between F(\u03bdt) = 1\n2(cid:107)f\u00b5,\u03bdt(cid:107)2H and (cid:107)f\u00b5,\u03bdt(cid:107) \u02d9H(\u03bdt). For this purpose, we consider the\nweighted negative Sobolev distance on P2(X ), de\ufb01ned by duality using (cid:107).(cid:107) \u02d9H(\u03bd) (see also [36]).\nDe\ufb01nition 1. Let \u03bd \u2208 P2(x), with its corresponding weighted Sobolev semi-norm (cid:107).(cid:107) \u02d9H(\u03bd). The\nweighted negative Sobolev distance (cid:107)p \u2212 q(cid:107) \u02d9H\u22121(\u03bd) between any p and q in P2(x) is de\ufb01ned as\n\n(cid:90)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nf (x) dp(x) \u2212\n\nf (x) dq(x)\n\n(15)\n\n(cid:107)p \u2212 q(cid:107) \u02d9H\u22121(\u03bd) =\n\nsup\n\nf\u2208L2(\u03bd),(cid:107)f(cid:107) \u02d9H(\u03bd)\u22641\n\nwith possibly in\ufb01nite values.\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:90)\n\nEquation (59) plays a fundamental role in dynamic optimal transport. It can be seen as the minimum\nkinetic energy needed to advect the mass \u03bd to q (see [32]). It is shown in Appendix D.4 that\n\n(cid:107)f\u00b5,\u03bdt(cid:107)2H \u2264 (cid:107)f\u00b5,\u03bdt(cid:107) \u02d9H(\u03bdt)(cid:107)\u00b5 \u2212 \u03bdt(cid:107) \u02d9H\u22121(\u03bdt).\n\n(16)\nProvided that (cid:107)\u00b5 \u2212 \u03bdt(cid:107) \u02d9H\u22121(\u03bdt) remains bounded by some positive constant C at all times, (16) leads\nto a functional version of Lojasiewicz inequality for F. It is then possible to use the general strategy\nexplained earlier to prove the convergence of the \ufb02ow to a global optimum:\nProposition 7. Under Assumption (A),\n\n(i) If (cid:107)\u00b5 \u2212 \u03bdt(cid:107)2\n(ii) If (cid:107)\u00b5 \u2212 \u03bdn(cid:107)2\n\n\u02d9H\u22121(\u03bdt)\n\n\u02d9H\u22121(\u03bdn)\n\n\u2264 C, for all t \u2265 0, then: F(\u03bdt) \u2264\n\u2264 C for all n \u2265 0, then: F(\u03bdn) \u2264\n\nC\n\nCF (\u03bd0)\u22121+4t ,\n\nCF (\u03bd0)\u22121+4\u03b3(1\u2212 3\n\nC\n\n2 \u03b3L)n .\n\n2 turns out to linearize KL(\u00b5(cid:107)\u03bdt) 1\n\nProofs of Proposition 7 (i) and (ii) are direct consequences of Propositions 2 and 4 and the bounded\nenergy assumption: see Appendix D.4. The fact that (59) appears in the context of Wasserstein \ufb02ows\nof F is not a coincidence. Indeed, (59) is a linearization of the Wasserstein distance (see [35, 36] and\nAppendix D.6). Gradient \ufb02ows of F de\ufb01ned under different metrics would involve other kinds of\ndistances instead of (59). For instance, [37] consider gradient \ufb02ows under a hybrid metric (a mixture\nbetween the Wasserstein distance and KL divergence), where convergence rates can then be obtained\nprovided that the chi-square divergence \u03c72(\u00b5(cid:107)\u03bdt) remains bounded. As shown in Appendix D.6,\n\u03c72(\u00b5(cid:107)\u03bdt) 1\n2 when \u00b5 and \u03bdt are close. Hence, we conjecture that\ngradient \ufb02ows of F under a metric d can be shown to converge when the linearization of the metric\nremains bounded. This can be veri\ufb01ed on simple examples for (cid:107)\u00b5 \u2212 \u03bdt(cid:107) \u02d9H\u22121(\u03bdt) as discussed in\nAppendix D.5. However, it remains hard to guarantee this condition in general. One possible approach\ncould be to regularize F using an estimate of (59). Indeed, [32] considers the gradient \ufb02ow of a\nregularized version of the negative Sobolev distance which can be written in closed form, and shows\nthat this decreases the MMD. Combing both losses could improve the overall convergence properties\nof the MMD, albeit at additional computational cost. In the next section, we propose a different\napproach to improve the convergence, and a particle-based algorithm to approximate the MMD \ufb02ow\nin practice.\n\n4 A practical algorithm to descend the MMD \ufb02ow\n\n4.1 A noisy update as a regularization\nWe showed in Section 3.1 that F is a non-convex functional, and derived a condition in Section 3.2 to\nreach the global optimum. We now address the case where such a condition does not necessarily hold,\n\n6\n\n\fn \u2265 0,\n\nand provide a regularization of the gradient \ufb02ow to help achieve global optimality in this scenario.\nOur starting point will be the equilibrium condition in (11). If an equilibrium \u03bd\u2217 that satis\ufb01es (11)\nhappens to have a positive density, then f\u00b5,\u03bd\u2217 would be constant everywhere. This in turn would\nmean that f\u00b5,\u03bd\u2217 = 0 when the RKHS does not contain constant functions, as for a gaussian kernel\n[44, Corollary 4.44]. Hence, \u03bd\u2217 would be a global optimum since F(\u03bd\u2217) = 0. The limit distribution\n\u03bd\u2217 might be singular, however, and can even be a dirac distribution [31, Theorem 6]. Although the\ngradient \u2207f\u00b5,\u03bd\u2217 is not identically 0 in that case, (11) only evaluates it on the support \u03bd\u2217, on which\n\u2207f\u00b5,\u03bd\u2217 = 0 holds. Hence a possible \ufb01x would be to make sure that the unnormalised witness gradient\nis also evaluated at points outside of the support of \u03bd\u2217. Here, we propose to regularize the \ufb02ow by\ninjecting noise into the gradient during updates of (9),\n\nXn+1 = Xn \u2212 \u03b3\u2207f\u00b5,\u03bdn (Xn + \u03b2nUn),\n\n(17)\nwhere Un is a standard gaussian variable and \u03b2n is the noise level at n. Compared to (8), the sample\nhere is \ufb01rst blurred before evaluating the gradient. Intuitively, if \u03bdn approaches a local optimum\n\u03bd\u2217, \u2207f\u00b5,\u03bdn would be small on the support of \u03bdn but it might be much larger outside of it, hence\nevaluating \u2207f\u00b5,\u03bdn outside the support of \u03bdn can help in escaping the local minimum. The stochastic\nprocess (17) is different from adding a diffusion term to (5). The latter case would correspond\nto regularizing F using an entropic term as in [31, 40] (see also Appendix A.5 on the Langevin\ndiffusion) and was shown to converge to a global optimum that is in general different from the global\nminmum of the un-regularized loss. Eq. (17) is also different from [9, 13], where F (and thus\nits associated velocity \ufb01eld) is regularized by convolving the interaction potential W in (4) with a\nmolli\ufb01er. The optimal solution of a regularized version of the functional F will be generally different\nfrom the non-regularized one, however, which is not desirable in our setting. Eq. (17) is more closely\nrelated to the continuation methods [10, 20, 21] and graduated optimization [23] used for non-convex\noptimization in Euclidian spaces, which inject noise into the gradient of a loss function F at each\niteration. The key difference is the dependence of f\u00b5,\u03bdn of \u03bdn, which is inherently due to functional\noptimization. We show in Proposition 8 that (17) attains the global minimum of F provided that the\nlevel of the noise is well controlled, with the proof given in Appendix E.1.\nProposition 8. Let (\u03bdn)n\u2208N be de\ufb01ned by (17) with an initial \u03bd0. Denote D\u03b2n (\u03bdn) =\nEx\u223c\u03bdn,u\u223cg[(cid:107)\u2207f\u00b5,\u03bdn(x + \u03b2nu)(cid:107)2] with g the density of the standard gaussian distribution. Un-\nder Assumptions (A) and (D), and for a choice of \u03b2n such that\nnF(\u03bdn) \u2264 D\u03b2n (\u03bdn),\n8\u03bb2\u03b22\nF(\u03bdn+1) \u2212 F(\u03bdn) \u2264 \u2212 \u03b3\n2\nF(\u03bdn) \u2264 F(\u03bd0)e\u22124\u03bb2\u03b3(1\u22123\u03b3L)(cid:80)n\ni=0 \u03b22\ni .\n\u221a\ni \u2192 \u221e holds is when \u03b2n decays as 1/\n\nMoreover if(cid:80)n\nA particular case where(cid:80)n\n\nwhere \u03bb and L are de\ufb01ned in Assumptions (A) and (D) and depend only on the choice of the kernel.\n\ni=0 \u03b22\n\nn while still satisfying (18).\nIn this case, convergence occurs in polynomial time. At each iteration, the level of the noise needs to\nbe adjusted such that the gradient is not too blurred. This ensures that each step decreases the loss\nfunctional. However, \u03b2n does not need to decrease at each iteration: it could increase adaptively\nwhenever needed. For instance, when the sequence gets closer to a local optimum, it is helpful to\nincrease the level of the noise to probe the gradient in regions where its value is not \ufb02at. Note that for\n\u03b2n = 0 in (19) , we recover a similar bound to Proposition 4.\n\n(1 \u2212 3\u03b3L)D\u03b2n (\u03bdn),\n\nthe following inequality holds:\n\ni=0 \u03b22\n\ni \u2192 \u221e, then\n\n(18)\n\n(19)\n\n(20)\n\n4.2 The sample-based approximate scheme\n\nWe now provide a practical algorithm to implement the noisy updates in the previous section, which\nemploys a discretization in space. The update (17) involves computing expectations of the gradient\nof the kernel k w.r.t the target distribution \u00b5 and the current distribution \u03bdn at each iteration n. This\nsuggests a simple approximate scheme, based on samples from these two distributions, where at each\niteration n, we model a system of N interacting particles (X i\nn)1\u2264i\u2264N and their empirical distribution\nin order to approximate \u03bdn. More precisely, given i.i.d. samples (X i\n0)1\u2264i\u2264N and (Y m)1\u2264m\u2264M from\n\u03bd0 and \u00b5 and a step-size \u03b3, the approximate scheme iteratively updates the i-th particle as\n\nX i\n\nn+1 = X i\n\nn \u2212 \u03b3\u2207f\u02c6\u00b5,\u02c6\u03bdn (X i\n\nn + \u03b2nU i\n\nn),\n\n(21)\n\n7\n\n\fn are i.i.d standard gaussians and \u02c6\u00b5, \u02c6\u03bdn denote the empirical distributions of (Y m)1\u2264m\u2264M\nwhere U i\nn)1\u2264i\u2264N , respectively. It is worth noting that for \u03b2n = 0, (21) is equivalent to gradient\nand (X i\ndescent over the particles (X i\nn) using a sample based version of the MMD. Implementing (21) is\nstraightforward as it only requires to evaluate the gradient of k on the current particles and target\nsamples. Pseudocode is provided in Algorithm 1. The overall computational cost of the algorithm\nat each iteration is O((M + N )N ) with O(M + N ) memory. The computational cost becomes\nO(M + N ) when the kernel is approximated using random features, as is the case for regression with\nneural networks (Appendix F). This is in contrast to the cubic cost of the \ufb02ow of the KSD [32], which\nrequires solving a linear system at each iteration. The cost can also be compared to the algorithm in\n[40], which involves computing empirical CDF and quantile functions of random projections of the\nparticles.\nThe approximation scheme in (21) is a particle version of (17), so one would expect it to converge\ntowards its population version (17) as M and N goes to in\ufb01nity. This is shown below.\nTheorem 9. Let n \u2265 0 and T > 0. Let \u03bdn and \u02c6\u03bdn de\ufb01ned by (8) and (21) respectively. Suppose\n\u03b3 \u2265 n:\nAssumption (A) holds and that \u03b2n < B for all n, for some B > 0. Then for any T\n(e4LT \u2212 1)\n\n(cid:18) 1\u221a\n\n2 )e2LT +\n\n(cid:19)\n\nvar(\u00b5)\n\n1\n\n2 )\n\nE [W2(\u02c6\u03bdn, \u03bdn)] \u2264 1\n4\n\n(B + var(\u03bd0)\n\n1\n\nN\n\n1\u221a\nM\n\nTheorem 9 controls the propagation of the chaos at each iteration, and uses techniques from [25].\nNotice also that these rates remain true when no noise is added to the updates, i.e. for the original\n\ufb02ow when B = 0. A proof is provided in Appendix E.2. The dependence in\nM underlines the\nfact that our procedure could be interesting as a sampling algorithm when one only has access to M\nsamples of \u00b5 (see Appendix A.5 for a more detailed discussion).\nExperiments\n\n\u221a\n\nFigure 1: Comparison between different training methods for student-teacher ReLU networks with\ngaussian output non-linearity and synthetic data uniform on a hyper-sphere. In blue, (21) is used\nwithout noise \u03b2n = 0 while in red noise is added with the following schedule: \u03b20 > 0 and \u03b2n is\ndecreased by half after every 103 epochs. In green, a diffusion term is added to the particles with\nnoise level kept \ufb01xed during training (\u03b2n = \u03b20). In purple, the KSD is used as a cost function instead\nof the MMD. In all cases, the kernel is estimated using random features (RF) with a batch size of 102.\nBest step-size was selected for each method from {10\u22123, 10\u22122, 10\u22121} and was used for 104 epochs\non a dataset of 103 samples (RF). Initial parameters of the networks are drawn from i.i.d. gaussians:\nN (0, 1) for the teacher and N (10\u22123, 1) for the student. Results are averaged over 10 different runs.\n\nFigure 1 illustrates the behavior of the proposed algorithm (21) in a simple setting and compares it\nwith three other methods: MMD without noise injection (blue traces), MMD with diffusion (green\ntraces) and KSD (purple traces, [32]). Here, a student network is trained to produce the outputs of a\nteacher network using gradient descent. More details on the experiment are provided in Appendix G.1.\nAs discussed in Appendix F, this setting can be seen as a stochastic version of the MMD \ufb02ow since the\nkernel is estimated using random features at each iteration ((91) in Appendix G.1). Here, the MMD\n\ufb02ow fails to converge towards the global optimum. Such behavior is consistent with the observations\nin [11] when the parameters are initialized from a gaussian noise with relatively high variance (which\nis the case here). On the other hand, adding noise to the gradient seems to lead to global convergence.\nIndeed, the training error decreases below 10\u22125 and leads to much better validation error. While\nadding a small diffusion term (green) help convergence, the noise-injection (red) still outperforms\nit. This also holds for KSD (purple) which leads to a good solution (b) although at a much higher\n\n8\n\n1102104Time (s)105104103102101100MMD2(a) Training error02k4k6k8kEpochs102101(b) Test error105103101101noise level 102101(c) Sensitivity to noiseMMDMMD + noise injectionMMD + diffusionKSD\fcomputational cost (a). Our noise injection method (red) is also robust to the amount of noise and\nachieves best performance over a wide region (c). On the other hand, MMD + diffusion (green)\nperforms well only for much smaller values of noise that are located in a narrow region. This is\nexpected since adding a diffusion changes the optimal solution, unlike the injection where the global\noptimum of the MMD remains a \ufb01xed point of the algorithm.\nAnother illustrative experiment on a simple \ufb02ow between Gaussians is given in Appendix G.2.\n\n5 Conclusion\n\nWe have introduced MMD \ufb02ow, a novel \ufb02ow over the space of distributions, with a practical space-\ntime discretized implementation and a regularisation scheme to improve convergence. We provide\ntheoretical results, highlighting intrinsic properties of the regular MMD \ufb02ow, and guarantees on\nconvergence based on recent results in optimal transport, probabilistic interpretations of PDEs, and\nparticle algorithms. Future work will focus on a deeper understanding of regularization for MMD\n\ufb02ow, and its application in sampling and optimization for large neural networks.\n\nReferences\n\n[1] L. Ambrosio, N. Gigli, and G. Savar\u00e9. Gradient \ufb02ows: in metric spaces and in the space of\n\nprobability measures. Springer Science & Business Media, 2008.\n\n[2] M. Arbel, D. J. Sutherland, M. Bi\u00b4nkowski, and A. Gretton. \u201cOn gradient regularizers for MMD\n\nGANs.\u201d In: NIPS (2018).\n\n[3] M. Arjovsky and L. Bottou. \u201cTowards Principled Methods for Training Generative Adversarial\n\nNetworks.\u201d In: ICLR. 2017. arXiv: 1701.04862.\n\n[4] M. G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lakshminarayanan, S. Hoyer,\nand R. Munos. The Cramer Distance as a Solution to Biased Wasserstein Gradients. 2017.\narXiv: 1705.10743.\n\n[5] M. Bi\u00b4nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. \u201cDemystifying MMD GANs.\u201d In:\n\nICLR. 2018.\n\n[6] A. Blanchet and J. Bolte. \u201cA family of functional inequalities: Lojasiewicz inequalities and\ndisplacement convex functions.\u201d In: Journal of Functional Analysis 275.7 (2018), pp. 1650\u2013\n1673.\n\n[8]\n\n[7] L. Bottou, M. Arjovsky, D. Lopez-Paz, and M. Oquab. \u201cGeometrical insights for implicit\ngenerative modeling.\u201d In: Braverman Readings in Machine Learning. Key Ideas from Inception\nto Current State. Springer, 2018, pp. 229\u2013268.\nJ. A. Carrillo, R. J. McCann, and C. Villani. \u201cContractions in the 2-Wasserstein Length Space\nand Thermalization of Granular Media.\u201d en. In: Archive for Rational Mechanics and Analysis\n179.2 (Feb. 2006), pp. 217\u2013263. URL: https://doi.org/10.1007/s00205-005-0386-1\n(visited on 07/27/2019).\nJ. A. Carrillo, K. Craig, and F. S. Patacchini. \u201cA blob method for diffusion.\u201d In: Calculus of\nVariations and Partial Differential Equations 58.2 (2019), p. 53.\n\n[9]\n\n[10] P. Chaudhari, A. Oberman, S. Osher, S. Soatto, and G. Carlier. \u201cDeep Relaxation: partial\ndifferential equations for optimizing deep neural networks.\u201d In: arXiv:1704.04932 [cs, math]\n(2017). URL: http://arxiv.org/abs/1704.04932.\n\n[11] L. Chizat and F. Bach. \u201cA Note on Lazy Training in Supervised Differentiable Programming.\u201d\nIn: arXiv:1812.07956 [cs, math] (Dec. 2018). arXiv: 1812.07956. URL: http://arxiv.org/\nabs/1812.07956 (visited on 05/05/2019).\n\n[12] L. Chizat and F. Bach. \u201cOn the global convergence of gradient descent for over-parameterized\n\nmodels using optimal transport.\u201d In: NIPS, 2018.\n\n[13] K. Craig and A. Bertozzi. \u201cA blob method for the aggregation equation.\u201d In: Mathematics of\n\ncomputation 85.300 (2016), pp. 1681\u20131717.\n\n[14] A. S. Dalalyan and A. Karagulyan. \u201cUser-friendly guarantees for the Langevin Monte Carlo\n\nwith inaccurate gradient.\u201d In: Stochastic Processes and their Applications (2019).\n\n[15] A. Durmus, S. Majewski, and B. Miasojedow. \u201cAnalysis of Langevin Monte Carlo via convex\n\noptimization.\u201d In: arXiv preprint arXiv:1802.09188 (2018).\n\n9\n\n\f[16] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. \u201cTraining generative neural networks via\n\nMaximum Mean Discrepancy optimization.\u201d In: UAI. 2015.\n\n[17] A. Genevay, G. Peyr\u00e9, and M. Cuturi. \u201cLearning Generative Models with Sinkhorn Diver-\n\ngences.\u201d In: AISTATS. 2018. arXiv: 1706.00292.\nI. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,\nand Y. Bengio. \u201cGenerative Adversarial Nets.\u201d In: NIPS. 2014. arXiv: 1406.2661.\n\n[18]\n\n[19] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. Smola. \u201cA kernel two-sample\n\ntest.\u201d In: Journal of Machine Learning Research (2012).\n\n[20] C. Gulcehre, M. Moczulski, F. Visin, and Y. Bengio. \u201cMollifying networks.\u201d In: arXiv preprint\n\narXiv:1608.04980 (2016).\n\n[21] C. Gulcehre, M. Moczulski, M. Denil, and Y. Bengio. \u201cNoisy activation functions.\u201d In: ICML.\n\n2016.\nI. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. \u201cImproved Training of\nWasserstein GANs.\u201d In: NIPS. 2017. arXiv: 1704.00028.\n\n[22]\n\n[23] E. Hazan, K. Y. Levy, and S. Shalev-Shwartz. \u201cOn graduated optimization for stochastic\n\nnon-convex problems.\u201d In: ICML. 2016.\n\n[24] R. Jordan, D. Kinderlehrer, and F. Otto. \u201cThe variational formulation of the Fokker\u2013Planck\n\nequation.\u201d In: SIAM journal on mathematical analysis 29.1 (1998), pp. 1\u201317.\n\n[25] B. Jourdain, S. M\u00e9l\u00e9ard, and W. Woyczynski. \u201cNonlinear SDEs driven by Levy proesses and\n\nrelated PDEs.\u201d In: arXiv preprint arXiv:0707.2723 (2007).\n\n[26] M. Kac. \u201cFoundations of kinetic theory.\u201d In: Proceedings of The third Berkeley symposium on\nmathematical statistics and probability. Vol. 3. University of California Press Berkeley and\nLos Angeles, California. 1956, pp. 171\u2013197.\n\n[27] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. P\u00f3czos. \u201cMMD GAN: Towards Deeper\nUnderstanding of Moment Matching Network.\u201d In: arXiv:1705.08584 [cs, stat] (May 2017).\narXiv: 1705.08584. URL: http://arxiv.org/abs/1705.08584 (visited on 11/13/2018).\n\n[28] Y. Li, K. Swersky, and R. Zemel. \u201cGenerative moment matching networks.\u201d In: arXiv preprint\n\narXiv:1502.02761 (2015).\n\n[29] Q. Liu. \u201cStein variational gradient descent as gradient \ufb02ow.\u201d In: Advances in neural information\n\nprocessing systems. 2017, pp. 3115\u20133123.\n\n[30] H. McKean Jr. \u201cA class of Markov processes associated with nonlinear parabolic equations.\u201d\nIn: Proceedings of the National Academy of Sciences of the United States of America 56.6\n(1966), p. 1907.\n\n[31] S. Mei, A. Montanari, and P.-M. Nguyen. \u201cA mean \ufb01eld view of the landscape of two-layer\nneural networks.\u201d In: Proceedings of the National Academy of Sciences 115.33 (2018), E7665\u2013\nE7671.\n\n[32] Y. Mroueh, T. Sercu, and A. Raj. \u201cSobolev Descent.\u201d In: AISTATS. 2019.\n[33] A. M\u00fcller. \u201cIntegral Probability Metrics and their Generating Classes of Functions.\u201d In:\n\nAdvances in Applied Probability 29.2 (1997), pp. 429\u2013443.\nJ. A. Oguntuase. \u201cOn an inequality of Gronwall.\u201d In: Journal of Inequalities in Pure and\nApplied Mathematics (2001).\n\n[34]\n\n[35] F. Otto and C. Villani. \u201cGeneralization of an inequality by Talagrand and links with the\nlogarithmic Sobolev inequality.\u201d In: Journal of Functional Analysis 173.2 (2000), pp. 361\u2013\n400.\n[36] R. Peyre. \u201cComparison between W2 distance and \u02d9H\u22121 norm, and localisation of Wasserstein\ndistance.\u201d In: ESAIM: Control, Optimisation and Calculus of Variations 24.4 (2018), pp. 1489\u2013\n1501.\n\n[37] G. Rotskoff, S. Jelassi, J. Bruna, and E. Vanden-Eijnden. \u201cGlobal convergence of neuron\n\nbirth-death dynamics.\u201d In: ICML. 2019.\n\n[38] G. M. Rotskoff and E. Vanden-Eijnden. \u201cNeural networks as interacting particle systems:\nAsymptotic convexity of the loss landscape and universal scaling of the approximation error.\u201d\nIn: arXiv preprint arXiv:1805.00915 (2018).\n\n[39] F. Santambrogio. \u201cOptimal transport for applied mathematicians.\u201d In: Birk\u00e4user, NY 55 (2015),\n\npp. 58\u201363.\n\n10\n\n\f[40] U. \u00b8Sim\u00b8sekli, A. Liutkus, S. Majewski, and A. Durmus. \u201cSliced-Wasserstein \ufb02ows: Nonpara-\n\nmetric generative modeling via optimal transport and diffusions.\u201d In: ICML. 2019.\nJ. Sirignano and K. Spiliopoulos. \u201cMean \ufb01eld analysis of neural networks: A central limit\ntheorem.\u201d In: arXiv preprint arXiv:1808.09372 (2018).\n\n[41]\n\n[42] A. J. Smola and B. Scholkopf. Learning with kernels. Vol. 4. Citeseer, 1998.\n[43] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00f6lkopf, and G. R. Lanckriet. \u201cHilbert\nspace embeddings and metrics on probability measures.\u201d In: Journal of Machine Learning\nResearch 11.Apr (2010), pp. 1517\u20131561.\nI. Steinwart and A. Christmann. Support Vector Machines. 1st. Springer Publishing Company,\nIncorporated, 2008.\n\n[44]\n\n[45] C. Villani. Optimal transport: old and new. Vol. 338. Springer Science & Business Media,\n\n2008.\n\n11\n\n\f", "award": [], "sourceid": 3487, "authors": [{"given_name": "Michael", "family_name": "Arbel", "institution": "UCL"}, {"given_name": "Anna", "family_name": "Korba", "institution": "Gatsby Unit - UCL"}, {"given_name": "Adil", "family_name": "SALIM", "institution": "KAUST"}, {"given_name": "Arthur", "family_name": "Gretton", "institution": "Gatsby Unit, UCL"}]}