{"title": "On gradient regularizers for MMD GANs", "book": "Advances in Neural Information Processing Systems", "page_first": 6700, "page_last": 6710, "abstract": "We propose a principled method for gradient-based regularization of the critic of GAN-like models trained by adversarially optimizing the kernel of a Maximum Mean Discrepancy (MMD). We show that controlling the gradient of the critic is vital to having a sensible loss function, and devise a method to enforce exact, analytical gradient constraints at no additional cost compared to existing approximate techniques based on additive regularizers. The new loss function is provably continuous, and experiments show that it stabilizes and accelerates training, giving image generation models that outperform state-of-the art methods on $160 \\times 160$ CelebA and $64 \\times 64$ unconditional ImageNet.", "full_text": "On gradient regularizers for MMD GANs\n\nGatsby Computational Neuroscience Unit\n\nGatsby Computational Neuroscience Unit\n\nDougal J. Sutherland\u21e4\n\nUniversity College London\n\ndougal@gmail.com\n\nMichael Arbel\u21e4\n\nUniversity College London\n\nmichael.n.arbel@gmail.com\n\nMiko\u0142aj Bi\u00b4nkowski\n\nDepartment of Mathematics\nImperial College London\nmikbinkowski@gmail.com\n\nArthur Gretton\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\n\narthur.gretton@gmail.com\n\nAbstract\n\nWe propose a principled method for gradient-based regularization of the critic of\nGAN-like models trained by adversarially optimizing the kernel of a Maximum\nMean Discrepancy (MMD). We show that controlling the gradient of the critic\nis vital to having a sensible loss function, and devise a method to enforce exact,\nanalytical gradient constraints at no additional cost compared to existing approxi-\nmate techniques based on additive regularizers. The new loss function is provably\ncontinuous, and experiments show that it stabilizes and accelerates training, giving\nimage generation models that outperform state-of-the art methods on 160 \u21e5 160\nCelebA and 64 \u21e5 64 unconditional ImageNet.\n\n1\n\nIntroduction\n\nThere has been an explosion of interest in implicit generative models (IGMs) over the last few years,\nespecially after the introduction of generative adversarial networks (GANs) [16]. These models\nallow approximate samples from a complex high-dimensional target distribution P, using a model\ndistribution Q\u2713, where estimation of likelihoods, exact inference, and so on are not tractable. GAN-\ntype IGMs have yielded very impressive empirical results, particularly for image generation, far\nbeyond the quality of samples seen from most earlier generative models [e.g. 18, 22, 23, 24, 38].\nThese excellent results, however, have depended on adding a variety of methods of regularization and\nother tricks to stabilize the notoriously dif\ufb01cult optimization problem of GANs [38, 42]. Some of\nthis dif\ufb01culty is perhaps because when a GAN is viewed as minimizing a discrepancy DGAN(P, Q\u2713),\nits gradient r\u2713 DGAN(P, Q\u2713) does not provide useful signal to the generator if the target and model\ndistributions are not absolutely continuous, as is nearly always the case [2].\nAn alternative set of losses are the integral probability metrics (IPMs) [36], which can give credit to\nmodels Q\u2713 \u201cnear\u201d to the target distribution P [3, 8, Section 4 of 15]. IPMs are de\ufb01ned in terms of a\ncritic function: a \u201cwell behaved\u201d function with large amplitude where P and Q\u2713 differ most. The IPM\nis the difference in the expected critic under P and Q\u2713, and is zero when the distributions agree. The\nWasserstein IPMs, whose critics are made smooth via a Lipschitz constraint, have been particularly\nsuccessful in IGMs [3, 14, 18]. But the Lipschitz constraint must hold uniformly, which can be hard\nto enforce. A popular approximation has been to apply a gradient constraint only in expectation [18]:\nthe critic\u2019s gradient norm is constrained to be small on points chosen uniformly between P and Q.\nAnother class of IPMs used as IGM losses are the Maximum Mean Discrepancies (MMDs) [17],\nas in [13, 28]. Here the critic function is a member of a reproducing kernel Hilbert space (except\nin [50], who learn a deep approximation to an RKHS critic). Better performance can be obtained,\n\n\u21e4These authors contributed equally.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fhowever, when the MMD kernel is not based directly on image pixels, but on learned features of\nimages. Wasserstein-inspired gradient regularization approaches can be used on the MMD critic\nwhen learning these features: [27] uses weight clipping [3], and [5, 7] use a gradient penalty [18].\nThe recent Sobolev GAN [33] uses a similar constraint on the expected gradient norm, but phrases it\nas estimating a Sobolev IPM rather than loosely approximating Wasserstein. This expectation can be\ntaken over the same distribution as [18], but other measures are also proposed, such as (P + Q\u2713) /2.\nA second recent approach, the spectrally normalized GAN [32], controls the Lipschitz constant of\nthe critic by enforcing the spectral norms of the weight matrices to be 1. Gradient penalties also\nbene\ufb01t GANs based on f-divergences [37]: for instance, the spectral normalization technique of [32]\ncan be applied to the critic network of an f-GAN. Alternatively, a gradient penalty can be de\ufb01ned\nto approximate the effect of blurring P and Q\u2713 with noise [40], which addresses the problem of\nnon-overlapping support [2]. This approach has recently been shown to yield locally convergent\noptimization in some cases with non-continuous distributions, where the original GAN does not [30].\nIn this paper, we introduce a novel regularization for the MMD GAN critic of [5, 7, 27], which\ndirectly targets generator performance, rather than adopting regularization methods intended to\napproximate Wasserstein distances [3, 18]. The new MMD regularizer derives from an approach\nwidely used in semi-supervised learning [10, Section 2], where the aim is to de\ufb01ne a classi\ufb01cation\nfunction f which is positive on P (the positive class) and negative on Q\u2713 (negative class), in the\nabsence of labels on many of the samples. The decision boundary between the classes is assumed\nto be in a region of low density for both P and Q\u2713: f should therefore be \ufb02at where P and Q\u2713 have\nsupport (areas with constant label), and have a larger slope in regions of low density. Bousquet et al.\n[10] propose as their regularizer on f a sum of the variance and a density-weighted gradient norm.\nWe adopt a related penalty on the MMD critic, with the difference that we only apply the penalty on P:\nthus, the critic is \ufb02atter where P has high mass, but does not vanish on the generator samples from Q\u2713\n(which we optimize). In excluding Q\u2713 from the critic function constraint, we also avoid the concern\nraised by [32] that a critic depending on Q\u2713 will change with the current minibatch \u2013 potentially\nleading to less stable learning. The resulting discrepancy is no longer an integral probability metric:\nit is asymmetric, and the critic function class depends on the target P being approximated.\nWe \ufb01rst discuss in Section 2 how MMD-based losses can be used to learn implicit generative models,\nand how a naive approach could fail. This motivates our new discrepancies, introduced in Section 3.\nSection 4 demonstrates that these losses outperform state-of-the-art models for image generation.\n\n2 Learning implicit generative models with MMD-based losses\nAn IGM is a model Q\u2713 which aims to approximate a target distribution P over a space X\u2713 Rd.\nWe will de\ufb01ne Q\u2713 by a generator function G\u2713 : Z!X , implemented as a deep network with\nparameters \u2713, where Z is a space of latent codes, say R128. We assume a \ufb01xed distribution on Z,\nsay Z \u21e0 Uniform[1, 1]128, and call Q\u2713 the distribution of G\u2713(Z). We will consider learning by\nminimizing a discrepancy D between distributions, with D(P, Q\u2713) 0 and D(P, P) = 0, which we\ncall our loss. We aim to minimize D(P, Q\u2713) with stochastic gradient descent on an estimator of D.\nIn the present work, we will build losses D based on the Maximum Mean Discrepancy,\n\nMMDk(P, Q) =\n\nsup\n\nEX\u21e0P[f (X)] EY \u21e0Q[f (Y )],\n\nf : kfkHk\uf8ff1\n\n(1)\n\n(2)\n\nan integral probability metric where the critic class is the unit ball within Hk, the reproducing\nkernel Hilbert space with a kernel k. The optimization in (1) admits a simple closed-form optimal\ncritic, f\u21e4(t) / EX\u21e0P[k(X, t)] EY \u21e0Q[k(Y, t)]. There is also an unbiased, closed-form estimator of\nMMD2\nk with appealing statistical properties [17] \u2013 in particular, its sample complexity is independent\nof the dimension of X , compared to the exponential dependence [52] of the Wasserstein distance\n\nW(P, Q) =\n\nsup\n\nf : kfkLip\uf8ff1\n\nEX\u21e0P[f (X)] EY \u21e0Q[f (Y )].\n\nThe MMD is continuous in the weak topology for any bounded kernel with Lipschitz embeddings [46,\nD! P, then MMD(Pn, P) ! 0.\nTheorem 3.2(b)], meaning that if Pn converges in distribution to P, Pn\n(W is continuous in the slightly stronger Wasserstein topology [51, De\ufb01nition 6.9]; Pn W! P implies\n\n2\n\n\fD! P, and the two notions coincide if X is bounded.) Continuity means the loss can provide\nPn\nbetter signal to the generator as Q\u2713 approaches P, as opposed to e.g. Jensen-Shannon where the loss\ncould be constant until suddenly jumping to 0 [e.g. 3, Example 1]. The MMD is also strict, meaning\nit is zero iff P = Q\u2713, for characteristic kernels [45]. The Gaussian kernel yields an MMD both\ncontinuous in the weak topology and strict. Thus in principle, one need not conduct any alternating\noptimization in an IGM at all, but merely choose generator parameters \u2713 to minimize MMDk.\nDespite these appealing properties, using simple pixel-level kernels leads to poor generator samples\n[8, 13, 28, 48]. More recent MMD GANs [5, 7, 27] achieve better results by using a parameterized\nfamily of kernels, {k } 2 , in the Optimized MMD loss previously studied by [44, 46]:\n\n(3)\n\nMMD(P, Q) := sup\nD \n 2 \n\nMMDk (P, Q).\n\nWe primarily consider kernels de\ufb01ned by some \ufb01xed kernel K on top of a learned low-dimensional\nrepresentation : X! Rs, i.e. k (x, y) = K( (x), (y)), denoted k = K . In practice,\nK is a simple characteristic kernel, e.g. Gaussian, and is usually a deep network with output\ndimension say s = 16 [7] or even s = 1 (in our experiments). If is powerful enough, this choice\nis suf\ufb01cient; we need not try to ensure each k is characteristic, as did [27].\nProposition 1. Suppose k = K , with K characteristic and { } rich enough that for any\nP 6= Q, there is a 2 for which #P 6= #Q.2 Then if P 6= Q, D \nProof. Let \u02c6 2 be such that \u02c6 (P) 6= \u02c6 (Q). Then, since K is characteristic,\n\nMMD(P, Q) > 0.\n\nMMD2\n\nMMD is stronger than Wasserstein or MMD.\n\nMMD in an IGM, though, were unsuccessful [48, footnote 7]. This\n\nMMDK( #P, #Q) MMDK( \u02c6 #P, \u02c6 #Q) > 0.\n\nMMD(P, Q) = sup\nD \n 2 \nMMD, one can conduct alternating optimization to estimate a \u02c6 and then update the\nTo estimate D \ngenerator according to MMDk \u02c6 , similar to the scheme used in GANs and WGANs. (This form of\nestimator is justi\ufb01ed by an envelope theorem [31], although it is invariably biased [7].) Unlike DGAN\nor W, \ufb01xing a \u02c6 and optimizing the generator still yields a sensible distance MMDk \u02c6 .\nEarly attempts at minimizing D \ncould be because for some kernel classes, D \nExample 1 (DiracGAN [30]). We wish to model a point mass at the origin of R, P = 0, with any\npossible point mass, Q\u2713 = \u2713 for \u2713 2 R. We use a Gaussian kernel of any bandwidth, which can be\nwritten as k = K with (x) = x for 2 = R and K(a, b) = exp 1\n2 (a b)2. Then\nMMD(0, \u2713) =\u21e2p2\nk (0, \u2713) = 21 exp 1\nD \nMMD(0, 1/n) = p2 6! 0, even though 1/n W! 0, shows that the Optimized MMD\nk (0, \u2713). Some sequences following v (e.g. A)\n\nConsidering D \ndistance is not continuous in the weak or Wasserstein topologies.\nThis also causes optimization issues. Figure 1 (a) shows gradient vector \ufb01elds in parameter space,\nv(\u2713, ) / r\u2713 MMD2\nconverge to an optimal solution (0, ), but some (B) move in the wrong direction, and others (C) are\nstuck because there is essentially no gradient. Figure 1 (c, red) shows that the optimal D \nMMD critic\nis very sharp near P and Q; this is less true for cases where the algorithm converged.\nWe can avoid these issues if we ensure a bounded Lipschitz critic:3\nProposition 2. Assume the critics f (x) = (EX\u21e0P k (X, x) EY \u21e0Q k (Y, x))/ MMDk (P, Q)\nare uniformly bounded and have a common Lipschitz constant: supx2X , 2 |f (x)| < 1 and\nsup 2 kf kLip < 1. In particular, this holds when k = K and\n\nk (0, \u2713),r MMD2\n\n2 2\u27132 ,\n\n\u2713 6= 0\n\u2713 = 0\n\n.\n\n0\n\nsup\na2Rs\nThen D \n\nK(a, a) < 1,\nMMD is continuous in the weak topology: if Pn\n\nkK(a,\u00b7) K(b,\u00b7)kHK \uf8ff LKka bkRs,\n\nD! P, then D \n\nsup\n\n 2 k kLip \uf8ff L < 1.\nMMD(Pn, P) ! 0.\n\n2 f #P denotes the pushforward of a distribution: if X \u21e0 P, then f (X) \u21e0 f #P.\n3[27, Theorem 4] makes a similar claim to Proposition 2, but its proof was incorrect: it tries to uniformly\n\nbound MMDk \uf8ffW 2, but the bound used is for a Wasserstein in terms of kk (x,\u00b7) k (y,\u00b7)kHk \n\n.\n\n3\n\n\fFigure 1: The setting of Example 1. (a, b): parameter-space gradient \ufb01elds for the MMD and the\nSMMD (Section 3.3); the horizontal axis is \u2713, and the vertical 1/ . (c): optimal MMD critics for\n\u2713 = 20 with different kernels. (d): the MMD and the distances of Section 3 optimized over .\n\nProof. The main result is [12, Corollary 11.3.4]. To show the claim for k = K , note that\n|f (x) f (y)|\uf8ff k f kHk kk (x,\u00b7) k (y,\u00b7)kHk \n\n, which since kf kHk \n\n= 1 is\n\nkK( (x),\u00b7) K( (y),\u00b7)kHK \uf8ff LKk (x) (y)kRs \uf8ff LKLkx ykRd.\n\nIndeed, if we put a box constraint on [27] or regularize the gradient of the critic function [7],\nthe resulting MMD GAN generally matches or outperforms WGAN-based models. Unfortunately,\nthough, an additive gradient penalty doesn\u2019t substantially change the vector \ufb01eld of Figure 1 (a), as\nshown in Figure 5 (Appendix B). We will propose distances with much better convergence behavior.\n\n3 New discrepancies for learning implicit generative models\n\nOur aim here is to introduce a discrepancy that can provide useful gradient information when used as\nan IGM loss. Proofs of results in this section are deferred to Appendix A.\n\n3.1 Lipschitz Maximum Mean Discrepancy\nProposition 2 shows that an MMD-like discrepancy can be continuous under the weak topology even\nwhen optimizing over kernels, if we directly restrict the critic functions to be Lipschitz. We can easily\nde\ufb01ne such a distance, which we call the Lipschitz MMD: for some > 0,\n\nLipMMDk,(P, Q) :=\n\nf2Hk : kfk2\n\nsup\nLip+kfk2\n\nHk\uf8ff1\n\nEX\u21e0P [f (X)] EY \u21e0Q [f (Y )] .\n\n(4)\n\nFor a universal kernel k, we conjecture that lim!0 LipMMDk,(P, Q) !W (P, Q). But for any k\nand , LipMMD is upper-bounded by W, as (4) optimizes over a smaller set of functions than (2).\nThus D ,\nLipMMD(P, Q) := sup 2 LipMMDk ,(P, Q) is also upper-bounded by W, and hence is\ncontinuous in the Wasserstein topology. It also shows excellent empirical behavior on Example 1\n(Figure 1 (d), and Figure 5 in Appendix B). But estimating LipMMDk,, let alone D ,\nLipMMD, is in\ngeneral extremely dif\ufb01cult (Appendix D), as \ufb01nding kfkLip requires optimization in the input space.\nConstraining the mean gradient rather than the maximum, as we will do next, is far more tractable.\n\n4\n\n0246810120246ABC(D) : VeFtor )LeOG of 00D-GA10246810120246ABC(b) : VeFtor )LeOG of 600D-GA1\u221250510152025\u22120.50\u22120.250.000.250.50(F) : 2StLmDO CrLtLF\u03c8 10.0\u03c8 1.0\u03c8 0.25\u22124\u22122024\u03b80.00.51.01.5(G) : 2StLmLzeG DLVtDnFeV\ue23000D\ue230600D\ue230GC00D\ue230LLS00D\f3.2 Gradient-Constrained Maximum Mean Discrepancy\nWe de\ufb01ne the Gradient-Constrained MMD for > 0 and using some measure \u00b5 as\n\n(5)\n\n(6)\n\nGCMMD\u00b5,k,(P, Q) :=\n\nsup\n\nwhere kfk2\n\nS(\u00b5),k, := kfk2\n\nf2Hk : kfkS(\u00b5),k,\uf8ff1\nL2(\u00b5) + krfk2\n\nEX\u21e0P [f (X)] EY \u21e0Q [f (Y )] ,\nL2(\u00b5) + kfk2\n\nHk .\n\nL2(\u00b5) =R k\u00b7k2 \u00b5(dx) denotes the squared L2 norm. Rather than directly constraining the Lipschitz\nk\u00b7k2\nconstant, the second term krfk2\nL2(\u00b5) encourages the function f to be \ufb02at where \u00b5 has mass. In\nexperiments we use \u00b5 = P, \ufb02attening the critic near the target sample. We add the \ufb01rst term following\n[10]: in one dimension and with \u00b5 uniform, k\u00b7kS(\u00b5),\u00b7,0 is then an RKHS norm with the kernel\n\uf8ff(x, y) = exp(kx yk), which is also a Sobolev space. The correspondence to a Sobolev norm is\nlost in higher dimensions [53, Ch. 10], but we also found the \ufb01rst term to be bene\ufb01cial in practice.\nWe can exploit some properties of Hk to compute (5) analytically. Call the difference in kernel mean\nembeddings \u2318 := EX\u21e0P[k(X,\u00b7)] EY \u21e0Q[k(Y,\u00b7)] 2H k; recall MMD(P, Q) = k\u2318kHk.\nProposition 3. Let \u02c6\u00b5 =PM\nm=1 Xm. De\ufb01ne \u2318(X) 2 RM with mth entry \u2318(Xm), and r\u2318(X) 2\nRM d with (m, i)th entry4 @i\u2318(Xm). Then under Assumptions (A) to (D) in Appendix A.1,\n\nGCMMD2\n\n\u02c6\u00b5,k,(P, Q) =\n\n1\n\nMMD2(P, Q) \u00afP (\u2318)\nr\u2318(X)T\u2713\uf8ffK GT\n\n\u00afP (\u2318) =\uf8ff \u2318(X)\n\nG H + MI M +M d\u25c61\uf8ff \u2318(X)\nr\u2318(X) ,\n\nwhere K is the kernel matrix Km,m0 = k(Xm, Xm0), G is the matrix of left derivatives 5 G(m,i),m0 =\n@ik(Xm, Xm0), and H that of derivatives of both arguments H(m,i),(m0,j) = @i@j+dk(Xm, Xm0).\nAs long as P and Q have integrable \ufb01rst moments, and \u00b5 has second moments, Assumptions (A)\nto (D) are satis\ufb01ed e.g. by a Gaussian or linear kernel on top of a differentiable . We can thus\nestimate the GCMMD based on samples from P, Q, and \u00b5 by using the empirical mean \u02c6\u2318 for \u2318.\nThis discrepancy indeed works well in practice: Appendix F.2 shows that optimizing our estimate\nof D\u00b5, ,\nGCMMD = sup 2 GCMMD\u00b5,k , yields a good generative model on MNIST. But the linear\nsystem of size M + M d is impractical: even on 28 \u21e5 28 images and using a low-rank approximation,\nthe model took days to converge. We therefore design a less expensive discrepancy in the next section.\nThe GCMMD is related to some discrepancies previously used in IGM training. The Fisher GAN [34]\nuses only the variance constraint kfk2\nL2(\u00b5) \uf8ff 1,\nalong with a vanishing boundary condition on f to ensure a well-de\ufb01ned solution (although this was\nnot used in the implementation, and can cause very unintuitive critic behavior; see Appendix C).\nThe authors considered several choices of \u00b5, including the WGAN-GP measure [18] and mixtures\n(P + Q\u2713) /2. Rather than enforcing the constraints in closed form as we do, though, these models\nused additive regularization. We will compare to the Sobolev GAN in experiments.\n\nL2(\u00b5) \uf8ff 1. The Sobolev GAN [33] constrains krfk2\n\n3.3 Scaled Maximum Mean Discrepancy\nWe will now derive a lower bound on the Gradient-Constrained MMD which retains many of its\nattractive qualities but can be estimated in time linear in the dimension d.\nProposition 4. Make Assumptions (A) to (D). For any f 2H k, kfkS(\u00b5),k, \uf8ff 1\n\n\u00b5,k,kfkHk, where\n\n\u00b5,k, := 1.vuut +Z k(x, x)\u00b5(dx) +\n\ndXi=1Z @2k(y, z)\n\n@yi@zi\n\n(y,z)=(x,x)\n\n\u00b5(dx).\n\nWe then de\ufb01ne the Scaled Maximum Mean Discrepancy based on this bound of Proposition 4:\n\nSMMD\u00b5,k,(P, Q) :=\n\nsup\n\nf : 1\n\n\u00b5,k,kfkH\uf8ff1\n\nEX\u21e0P [f (X)]EY \u21e0Q [f (Y )] = \u00b5,k, MMDk(P, Q). (7)\n\n4We use (m, i) to denote (m 1)d + i; thus r\u2318(X) stacks r\u2318(X1), . . . , r\u2318(XM ) into one vector.\n5We use @ik(x, y) to denote the partial derivative with respect to xi, and @i+dk(x, y) that for yi.\n\n5\n\n\fk,\u00b5, = + E\u00b5\u21e5k (X)k2 + kr (X)k2\n\nBecause the constraint in the optimization of (7) is more restrictive than in that of (5), we have\nthat SMMD\u00b5,k,(P, Q) \uf8ff GCMMD\u00b5,k,(P, Q). The Sobolev norm kfkS(\u00b5),, and a fortiori the\ngradient norm under \u00b5, is thus also controlled for the SMMD critic. We also show in Appendix F.1\nthat SMMD\u00b5,k, behaves similarly to GCMMD\u00b5,k, on Gaussians.\nk,\u00b5, = + g(0) + 2|g0(0)| E\u00b5\u21e5kr (X)k2\nF\u21e4.\nIf k = K and K(a, b) = g(ka bk2), then 2\nF\u21e4. Estimating\nOr if K is linear, K(a, b) = aTb, then 2\nthese terms based on samples from \u00b5 is straightforward, giving a natural estimator for the SMMD.\nOf course, if \u00b5 and k are \ufb01xed, the SMMD is simply a constant times the MMD, and so behaves\nin essentially the same way as the MMD. But optimizing the SMMD over a kernel family ,\nD\u00b5, ,\nSMMD(P, Q) := sup 2 SMMD\u00b5,k ,(P, Q), gives a distance very different from D \nFigure 1 (b) shows the vector \ufb01eld for the Optimized SMMD loss in Example 1, using the WGAN-\nGP measure \u00b5 = Uniform(0,\u2713 ). The optimization surface is far more amenable: in particular\nthe location C, which formerly had an extremely small gradient that made learning effectively\nimpossible, now converges very quickly by \ufb01rst reducing the critic gradient until some signal is\navailable. Figure 1 (d) demonstrates that D\u00b5, ,\nLipMMD but in sharp contrast\nto D \nWe can establish that D\u00b5, ,\nTheorem 1. Let k = K , with : X! Rs a fully-connected L-layer network with\nLeaky-ReLU\u21b5 activations whose layers do not increase in width, and K satisfying mild smoothness\nconditions QK < 1 (Assumptions (II) to (V) in Appendix A.2). Let \uf8ff be the set of parameters where\neach layer\u2019s weight matrices have condition number cond(W l) = kW lk/ min(W l) \uf8ff \uf8ff< 1. If \u00b5\nhas a density (Assumption (I)), then\nD\u00b5, \uf8ff,\nSMMD (P, Q) \uf8ff\n\nMMD, is continuous with respect to the location \u2713 and provides a strong gradient towards 0.\nSMMD is continuous in the Wasserstein topology under some conditions:\n\nSMMD, like D\u00b5, ,\n\nGCMMD and D ,\n\nMMD (3).\n\nQK\uf8ffL/2\npdL\u21b5L/2 W(P, Q).\n\nSMMD (Pn, P) ! 0, even if \u00b5 is chosen to depend on P and Q.\n\nThus if Pn W! P, then D\u00b5, \uf8ff,\nUniform bounds vs bounds in expectation Controlling krf k2\nL2(\u00b5) = E\u00b5krf (X)k2 does\nnot necessarily imply a bound on kfkLip supx2Xkrf (X)k, and so does not in general give\ncontinuity via Proposition 2. Theorem 1 implies that when the network\u2019s weights are well-conditioned,\nit is suf\ufb01cient to only control krf k2\nL2(\u00b5), which is far easier in practice than controlling kfkLip.\nIf we instead tried to directly controlled kfkLip with e.g. spectral normalization (SN) [32], we\ncould signi\ufb01cantly reduce the expressiveness of the parametric family. In Example 1, constraining\nk kLip = 1 limits us to only = {1}. Thus D{1}MMD is simply the MMD with an RBF kernel\nof bandwidth 1, which has poor gradients when \u2713 is far from 0 (Figure 1 (c), blue). The Cauchy-\nSchwartz bound of Proposition 4 allows jointly adjusting the smoothness of k and the critic f, while\nSN must control the two independently. Relatedly, limiting kkLip by limiting the Lipschitz norm of\neach layer could substantially reduce capacity, while krf kL2(\u00b5) need not be decomposed by layer.\nAnother advantage is that \u00b5 provides a data-dependent measure of complexity as in [10]: we do not\nneedlessly prevent ourselves from using critics that behave poorly only far from the data.\n\nSpectral parametrization When the generator is near a local optimum, the critic might identify\nonly one direction on which Q\u2713 and P differ. If the generator parameterization is such that there\nis no local way for the generator to correct it, the critic may begin to single-mindedly focus on\nthis difference, choosing redundant convolutional \ufb01lters and causing the condition number of the\nweights to diverge. If this occurs, the generator will be motivated to \ufb01x this single direction while\nignoring all other aspects of the distributions, after which it may become stuck. We can help avoid\nthis collapse by using a critic parameterization that encourages diverse \ufb01lters with higher-rank weight\nmatrices. Miyato et al. [32] propose to parameterize the weight matrices as W = \u00afW /k \u00afWkop,\nwhere k \u00afWkop is the spectral norm of \u00afW . This parametrization works particularly well with D\u00b5, ,\nSMMD;\nFigure 2 (b) shows the singular values of the second layer of a critic\u2019s network (and Figure 9, in\nAppendix F.3, shows more layers), while Figure 2 (d) shows the evolution of the condition number\nduring training. The conditioning of the weight matrix remains stable throughout training with\nspectral parametrization, while it worsens through training in the default case.\n\n6\n\n\f4 Experiments\n\nWe evaluated unsupervised image generation on three datasets: CIFAR-10 [26] (60 000 images,\n32 \u21e5 32), CelebA [29] (202 599 face images, resized and cropped to 160 \u21e5 160 as in [7]), and the\nmore challenging ILSVRC2012 (ImageNet) dataset [41] (1 281 167 images, resized to 64 \u21e5 64).\nCode for all of these experiments is available at github.com/MichaelArbel/Scaled-MMD-GAN.\nLosses All models are based on a scalar-output critic network : X! R, except MMDGAN-GP\nwhere : X! R16 as in [7]. The WGAN and Sobolev GAN use a critic f = , while the\nGAN uses a discriminator D (x) = 1/(1 + exp( (x))). The MMD-based methods use a kernel\nk (x, y) = exp(( (x) (y))2/2), except for MMDGAN-GP which uses a mixture of RQ\nkernels as in [7]. Increasing the output dimension of the critic or using a different kernel didn\u2019t\nsubstantially change the performance of our proposed method. We also consider SMMD with a linear\ntop-level kernel, k(x, y) = (x) (y); because this becomes essentially identical to a WGAN\n(Appendix E), we refer to this method as SWGAN. SMMD and SWGAN use \u00b5 = P; Sobolev GAN\nuses \u00b5 = (P + Q)/2 as in [33]. We choose and an overall scaling to obtain the losses:\n\nSMMD:\n\n\\MMD\n\n2\nk (P, Q\u2713)\n\n1 + 10 E\u02c6P [kr (X)k2\nF ]\n\n, SWGAN:\n\nE\u02c6P [ (X)] E\u02c6Q\u2713\n\n[ (X)]\n\nq1 + 10E\u02c6P [| (X)|2] + E\u02c6P [kr (X)k2\nF ] .\n\nArchitecture For CIFAR-10, we used the CNN architecture proposed by [32] with a 7-layer critic\nand a 4-layer generator. For CelebA, we used a 5-layer DCGAN discriminator and a 10-layer ResNet\ngenerator as in [7]. For ImageNet, we used a 10-layer ResNet for both the generator and discriminator.\nIn all experiments we used 64 \ufb01lters for the smallest convolutional layer, and double it at each layer\n(CelebA/ImageNet) or every other layer (CIFAR-10). The input codes for the generator are drawn\n\nfrom Uniform[1, 1]128. We consider two parameterizations for each critic: a standard one where\n\nthe parameters can take any real value, and a spectral parametrization (denoted SN-) as above [32].\nModels without explicit gradient control (SN-GAN, SN-MMDGAN, SN-MMGAN-L2, SN-WGAN)\n\ufb01x = 1, for spectral normalization; others learn , using a spectral parameterization.\nTraining All models were trained for 150 000 generator updates on a single GPU, except for ImageNet\nwhere the model was trained on 3 GPUs simultaneously. To limit communication overhead we\naveraged the MMD estimate on each GPU, giving the block MMD estimator [54]. We always used\n64 samples per GPU from each of P and Q, and 5 critic updates per generator step. We used initial\nlearning rates of 0.0001 for CIFAR-10 and CelebA, 0.0002 for ImageNet, and decayed these rates\nusing the KID adaptive scheme of [7]: every 2 000 steps, generator samples are compared to those\nfrom 20 000 steps ago, and if the relative KID test [9] fails to show an improvement three consecutive\ntimes, the learning rate is decayed by 0.8. We used the Adam optimizer [25] with 1 = 0.5, 2 = 0.9.\nEvaluation To compare the sample quality of different models, we considered three different scores\nbased on the Inception network [49] trained for ImageNet classi\ufb01cation, all using default parameters\nin the implementation of [7]. The Inception Score (IS) [42] is based on the entropy of predicted\nlabels; higher values are better. Though standard, this metric has many issues, particularly on datasets\nother than ImageNet [4, 7, 20]. The FID [20] instead measures the similarity of samples from the\ngenerator and the target as the Wasserstein-2 distance between Gaussians \ufb01t to their intermediate\nrepresentations. It is more sensible than the IS and becoming standard, but its estimator is strongly\nbiased [7]. The KID [7] is similar to FID, but by using a polynomial-kernel MMD its estimates enjoy\nbetter statistical properties and are easier to compare. (A similar score was recommended by [21].)\nResults Table 1a presents the scores for models trained on both CIFAR-10 and CelebA datasets. On\nCIFAR-10, SN-SWGAN and SN-SMMDGAN performed comparably to SN-GAN. But on CelebA,\nSN-SWGAN and SN-SMMDGAN dramatically outperformed the other methods with the same\narchitecture in all three metrics. It also trained faster, and consistently outperformed other methods\nover multiple initializations (Figure 2 (a)). It is worth noting that SN-SWGAN far outperformed\nWGAN-GP on both datasets. Table 1b presents the scores for SMMDGAN and SN-SMMDGAN\ntrained on ImageNet, and the scores of pre-trained models using BGAN [6] and SN-GAN [32].6 The\n6These models are courtesy of the respective authors and also trained at 64 \u21e5 64 resolution. SN-GAN used\nthe same architecture as our model, but trained for 250 000 generator iterations; BS-GAN used a similar 5-layer\nResNet architecture and trained for 74 epochs, comparable to SN-GAN.\n\n7\n\n\fFigure 2: The training process on CelebA. (a) KID scores. We report a \ufb01nal score for SN-GAN\nslightly before its sudden failure mode; MMDGAN and SN-MMDGAN were unstable and had scores\naround 100. (b) Singular values of the second layer, both early (dashed) and late (solid) in training.\n(c) 2\n\u00b5,k, for several MMD-based methods. (d) The condition number in the \ufb01rst layer through\ntraining. SN alone does not control \u00b5,k,, and SMMD alone does not control the condition number.\n\n(a) Scaled MMD GAN with SN\n\n(b) SN-GAN\n\n(c) Boundary Seeking GAN\n\n(d) Scaled MMD GAN with SN\n(f) MMD GAN with GP+L2\nFigure 3: Samples from various models. Top: 64 \u21e5 64 ImageNet; bottom: 160 \u21e5 160 CelebA.\n\n(e) Scaled WGAN with SN\n\n8\n\n0246810generDWor LWerDWLonV\u00d7104101520253035(D) : .ID\u00d7103 600DGA161-600DGA100DGA1-G3-L26obolev-GA161-GA1:GA1-G361-6:GA1020406080100120itK VLngulDr vDlue0.20.40.60.81.0\u03c3i(b) : 6LngulDr VDlueV: LDyer 2600DGA1 : 10K LWerDWLonV600DGA1 : 150K LWerDWLonV61-600DGA1 : 10K LWerDWLonV61-600DGA1 : 150K LWerDWLonV0123456generDWor LWerDWLonV\u00d7104101102103104(c) : CrLWLc CoPSlexLWy600DGA161-600DGA100DGA161-00DGA10123456generDWor LWerDWLonV\u00d7104100200300400500600700(G) : ConGLWLon 1uPber: LDyer 1600DGA161-600DGA100DGA161-00DGA1\fTable 1: Mean (standard deviation) of score estimates, based on 50 000 samples from each model.\n\nMethod\nWGAN-GP\nMMDGAN-GP-L2\nSobolev-GAN\nSMMDGAN\nSN-GAN\nSN-SWGAN\nSN-SMMDGAN\n\n(a) CIFAR-10 and CelebA.\n\nCIFAR-10\nIS\n6.9\u00b10.2\n6.9\u00b10.1\n7.0\u00b10.1\n7.0\u00b10.1\n7.2\u00b10.1\n7.2\u00b10.1\n7.3\u00b10.1\n\nFID\n31.1\u00b10.2\n31.4\u00b10.3\n30.3\u00b10.3\n31.5\u00b10.4\n26.7\u00b10.2\n28.5\u00b10.2\n25.0\u00b10.3\n\nKID\u21e5103\n22.2\u00b11.1\n23.3\u00b11.1\n22.3\u00b11.2\n22.2\u00b11.1\n16.1\u00b10.9\n17.6\u00b11.1\n16.6\u00b12.0\n\nCelebA\nIS\n2.7\u00b10.0\n2.6\u00b10.0\n2.9\u00b10.0\n2.7\u00b10.0\n2.7\u00b10.0\n2.8\u00b10.0\n2.8\u00b10.0\n\n(b) ImageNet.\n\nFID\n29.2\u00b10.2\n20.5\u00b10.2\n16.4\u00b10.1\n18.4\u00b10.2\n22.6\u00b10.1\n14.1\u00b10.2\n12.4\u00b10.2\n\nKID\u21e5103\n22.0\u00b11.0\n13.0\u00b11.0\n10.6\u00b10.5\n11.5\u00b10.8\n14.6\u00b11.1\n7.7\u00b10.5\n6.1\u00b10.4\n\nMethod\nIS\n10.7\u00b10.4\nBGAN\n11.2\u00b10.1\nSN-GAN\nSMMDGAN\n10.7\u00b10.2\nSN-SMMDGAN 10.9\u00b10.1\n\nFID\n43.9\u00b10.3\n47.5\u00b10.1\n38.4\u00b10.3\n36.6\u00b10.2\n\nKID\u21e5103\n47.0\u00b11.1\n44.4\u00b12.2\n39.3\u00b12.5\n34.6\u00b11.6\n\nproposed methods substantially outperformed both methods in FID and KID scores. Figure 3 shows\nsamples on ImageNet and CelebA; Appendix F.4 has more.\nSpectrally normalized WGANs / MMDGANs To control for the contribution of the spectral\nparametrization to the performance, we evaluated variants of MMDGANs, WGANs and Sobolev-\nGAN using spectral normalization (in Table 2, Appendix F.3). WGAN and Sobolev-GAN led to\nunstable training and didn\u2019t converge at all (Figure 11) despite many attempts. MMDGAN converged\non CIFAR-10 (Figure 11) but was unstable on CelebA (Figure 10). The gradient control due to SN\nis thus probably too loose for these methods. This is reinforced by Figure 2 (c), which shows that\nthe expected gradient of the critic network is much better-controlled by SMMD, even when SN is\nused. We also considered variants of these models with a learned while also adding a gradient\npenalty and an L2 penalty on critic activations [7, footnote 19]. These generally behaved similarly to\nMMDGAN, and didn\u2019t lead to substantial improvements. We ran the same experiments on CelebA,\nbut aborted the runs early when it became clear that training was not successful.\nRank collapse We occasionally observed the failure mode for SMMD where the critic becomes\nlow-rank, discussed in Section 3.3, especially on CelebA; this failure was obvious even in the training\nobjective. Figure 2 (b) is one of these examples. Spectral parametrization seemed to prevent this\nbehavior. We also found one could avoid collapse by reverting to an earlier checkpoint and increasing\nthe RKHS regularization parameter , but did not do this for any of the experiments here.\n\n5 Conclusion\n\nWe studied gradient regularization for MMD-based critics in implicit generative models, clarifying\nMMD loss. Based on these insights, we proposed the Gradient-\nhow previous techniques relate to the D \nConstrained MMD and its approximation the Scaled MMD, a new loss function for IGMs that\ncontrols gradient behavior in a principled way and obtains excellent performance in practice.\nOne interesting area of future study for these distances is their behavior when used to diffuse particles\ndistributed as Q towards particles distributed as P. Mroueh et al. [33, Appendix A.1] began such a\nstudy for the Sobolev GAN loss; [35] proved convergence and studied discrete-time approximations.\nAnother area to explore is the geometry of these losses, as studied by Bottou et al. [8], who showed\npotential advantages of the Wasserstein geometry over the MMD. Their results, though, do not\naddress any distances based on optimized kernels; the new distances introduced here might have\ninteresting geometry of their own.\n\n9\n\n\fReferences\n\n[1] B. Amos and J. Z. Kolter. \u201cOptNet: Differentiable Optimization as a Layer in Neural Net-\n\n[2] M. Arjovsky and L. Bottou. \u201cTowards Principled Methods for Training Generative Adversarial\n\n[3] M. Arjovsky, S. Chintala, and L. Bottou. \u201cWasserstein Generative Adversarial Networks.\u201d\n\nworks.\u201d ICML. 2017. arXiv: 1703.00443.\n\nNetworks.\u201d ICLR. 2017. arXiv: 1701.04862.\n\nICML. 2017. arXiv: 1701.07875.\n\n[4] S. Barratt and R. Sharma. A Note on the Inception Score. 2018. arXiv: 1801.01973.\n[5] M. G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lakshminarayanan, S. Hoyer,\nand R. Munos. The Cramer Distance as a Solution to Biased Wasserstein Gradients. 2017.\narXiv: 1705.10743.\n\n[6] D. Berthelot, T. Schumm, and L. Metz. BEGAN: Boundary Equilibrium Generative Adversarial\n\n[7] M. Bi\u00b4nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. \u201cDemystifying MMD GANs.\u201d\n\nNetworks. 2017. arXiv: 1703.10717.\n\nICLR. 2018. arXiv: 1801.01401.\n\n[8] L. Bottou, M. Arjovsky, D. Lopez-Paz, and M. Oquab. \u201cGeometrical Insights for Implicit\nGenerative Modeling.\u201d Braverman Readings in Machine Learning: Key Iedas from Inception\nto Current State. Ed. by L. Rozonoer, B. Mirkin, and I. Muchnik. LNAI Vol. 11100. Springer,\n2018, pp. 229\u2013268. arXiv: 1712.07822.\n\n[9] W. Bounliphone, E. Belilovsky, M. B. Blaschko, I. Antonoglou, and A. Gretton. \u201cA Test of\nRelative Similarity For Model Selection in Generative Models.\u201d ICLR. 2016. arXiv: 1511.\n04581.\n\n[10] O. Bousquet, O. Chapelle, and M. Hein. \u201cMeasure Based Regularization.\u201d NIPS. 2004.\n[11] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. \u201cNeural Photo Editing with Introspective\n\nAdversarial Networks.\u201d ICLR. 2017. arXiv: 1609.07093.\n\n[12] R. M. Dudley. Real Analysis and Probability. 2nd ed. Cambridge University Press, 2002.\n[13] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. \u201cTraining generative neural networks via\n\nMaximum Mean Discrepancy optimization.\u201d UAI. 2015. arXiv: 1505.03906.\n\n[14] A. Genevay, G. Peyr\u00e9, and M. Cuturi. \u201cLearning Generative Models with Sinkhorn Diver-\n\ngences.\u201d AISTATS. 2018. arXiv: 1706.00292.\n\n[15] T. Gneiting and A. E. Raftery. \u201cStrictly proper scoring rules, prediction, and estimation.\u201d JASA\n\n102.477 (2007), pp. 359\u2013378.\nI. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,\nand Y. Bengio. \u201cGenerative Adversarial Nets.\u201d NIPS. 2014. arXiv: 1406.2661.\n\n[16]\n\n[17] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. J. Smola. \u201cA Kernel Two-\n\nSample Test.\u201d JMLR 13 (2012).\nI. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. \u201cImproved Training of\nWasserstein GANs.\u201d NIPS. 2017. arXiv: 1704.00028.\n\n[19] A. G\u00fcng\u00f6r. \u201cSome bounds for the product of singular values.\u201d International Journal of\n\n[18]\n\nContemporary Mathematical Sciences (2007).\n\n[20] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter. \u201cGANs\nTrained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium.\u201d NIPS. 2017.\narXiv: 1706.08500.\n\n[21] G. Huang, Y. Yuan, Q. Xu, C. Guo, Y. Sun, F. Wu, and K. Weinberger. An empirical study on\n\nevaluation metrics of generative adversarial networks. 2018. arXiv: 1806.07755.\n\n[22] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. \u201cMultimodal Unsupervised Image-to-Image\n\nTranslation.\u201d ECCV. 2018. arXiv: 1804.04732.\n\n[23] Y. Jin, K. Zhang, M. Li, Y. Tian, H. Zhu, and Z. Fang. Towards the Automatic Anime Characters\n\nCreation with Generative Adversarial Networks. 2017. arXiv: 1708.05509.\n\n[24] T. Karras, T. Aila, S. Laine, and J. Lehtinen. \u201cProgressive Growing of GANs for Improved\n\nQuality, Stability, and Variation.\u201d ICLR. 2018. arXiv: 1710.10196.\n\n[25] D. Kingma and J. Ba. \u201cAdam: A Method for Stochastic Optimization.\u201d ICLR. 2015. arXiv:\n\n1412.6980.\n\n[26] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009.\n\n10\n\n\f[27] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. P\u00f3czos. \u201cMMD GAN: Towards Deeper\n\nUnderstanding of Moment Matching Network.\u201d NIPS. 2017. arXiv: 1705.08584.\n\n[28] Y. Li, K. Swersky, and R. Zemel. \u201cGenerative Moment Matching Networks.\u201d ICML. 2015.\n\n[29] Z. Liu, P. Luo, X. Wang, and X. Tang. \u201cDeep learning face attributes in the wild.\u201d ICCV. 2015.\n\n[30] L. Mescheder, A. Geiger, and S. Nowozin. \u201cWhich Training Methods for GANs do actually\n\nConverge?\u201d ICML. 2018. arXiv: 1801.04406.\n\n[31] P. Milgrom and I. Segal. \u201cEnvelope theorems for arbitrary choice sets.\u201d Econometrica 70.2\n\narXiv: 1502.02761.\n\narXiv: 1411.7766.\n\n(2002), pp. 583\u2013601.\n\n[32] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. \u201cSpectral Normalization for Generative\n\nAdversarial Networks.\u201d ICLR. 2018. arXiv: 1802.05927.\n\n[33] Y. Mroueh, C.-L. Li, T. Sercu, A. Raj, and Y. Cheng. \u201cSobolev GAN.\u201d ICLR. 2018. arXiv:\n\n1711.04894.\n\n[34] Y. Mroueh and T. Sercu. \u201cFisher GAN.\u201d NIPS. 2017. arXiv: 1705.09675.\n[35] Y. Mroueh, T. Sercu, and A. Raj. Regularized Kernel and Neural Sobolev Descent: Dynamic\n\nMMD Transport. 2018. arXiv: 1805.12062.\n\n[36] A. M\u00fcller. \u201cIntegral Probability Metrics and their Generating Classes of Functions.\u201d Advances\n\nin Applied Probability 29.2 (1997), pp. 429\u2013443.\n\n[37] S. Nowozin, B. Cseke, and R. Tomioka. \u201cf-GAN: Training Generative Neural Samplers using\n\nVariational Divergence Minimization.\u201d NIPS. 2016. arXiv: 1606.00709.\n\n[38] A. Radford, L. Metz, and S. Chintala. \u201cUnsupervised Representation Learning with Deep\n\nConvolutional Generative Adversarial Networks.\u201d ICLR. 2016. arXiv: 1511.06434.\nJ. R. Retherford. \u201cReview: J. Diestel and J. J. Uhl, Jr., Vector measures.\u201d Bull. Amer. Math.\nSoc. 84.4 (July 1978), pp. 681\u2013685.\n\n[39]\n\n[40] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. \u201cStabilizing Training of Generative Adver-\n\nsarial Networks through Regularization.\u201d NIPS. 2017. arXiv: 1705.09367.\n\n[41] O. Russakovsky et al. ImageNet Large Scale Visual Recognition Challenge. 2014. arXiv:\n\n1409.0575.\n\n[42] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. \u201cImproved\n\nTechniques for Training GANs.\u201d NIPS. 2016. arXiv: 1606.03498.\nJ. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University\nPress, 2004.\n\n[43]\n\n[44] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, G. R. G. Lanckriet, and B. Sch\u00f6lkopf. \u201cKernel\n\nchoice and classi\ufb01ability for RKHS embeddings of probability distributions.\u201d NIPS. 2009.\n\n[45] B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. \u201cUniversality, Characteristic\nKernels and RKHS Embedding of Measures.\u201d JMLR 12 (2011), pp. 2389\u20132410. arXiv: 1003.\n0887.\n\n[46] B. Sriperumbudur. \u201cOn the optimal estimation of probability mesaures in weak and strong\n\ntopologies.\u201d Bernoulli 22.3 (2016), pp. 1839\u20131893. arXiv: 1310.8240.\nI. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.\n\n[47]\n[48] D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton.\n\u201cGenerative Models and Model Criticism via Optimized Maximum Mean Discrepancy.\u201d ICLR.\n2017. arXiv: 1611.04488.\n\n[49] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. \u201cRethinking the Inception\n\nArchitecture for Computer Vision.\u201d CVPR. 2016. arXiv: 1512.00567.\n\n[50] T. Unterthiner, B. Nessler, C. Seward, G. Klambauer, M. Heusel, H. Ramsauer, and S. Hochre-\niter. \u201cCoulomb GANs: Provably Optimal Nash Equilibria via Potential Fields.\u201d ICLR. 2018.\narXiv: 1708.08819.\n\n[51] C. Villani. Optimal Transport: Old and New. Springer, 2009.\n[52]\n\nJ. Weed and F. Bach. \u201cSharp asymptotic and \ufb01nite-sample rates of convergence of empirical\nmeasures in Wasserstein distance.\u201d Bernoulli (forthcoming). arXiv: 1707.00087.\n[53] H. Wendland. Scattered Data Approximation. Cambridge University Press, 2005.\n[54] W. Zaremba, A. Gretton, and M. B. Blaschko. \u201cB-tests: Low Variance Kernel Two-Sample\n\nTests.\u201d NIPS. 2013. arXiv: 1307.1954.\n\n11\n\n\f", "award": [], "sourceid": 3366, "authors": [{"given_name": "Michael", "family_name": "Arbel", "institution": "UCL"}, {"given_name": "Dougal", "family_name": "Sutherland", "institution": "Gatsby Unit, UCL"}, {"given_name": "Miko\u0142aj", "family_name": "Bi\u0144kowski", "institution": "Imperial College London"}, {"given_name": "Arthur", "family_name": "Gretton", "institution": "Gatsby Unit, UCL"}]}