{"title": "Banach Wasserstein GAN", "book": "Advances in Neural Information Processing Systems", "page_first": 6754, "page_last": 6763, "abstract": "Wasserstein Generative Adversarial Networks (WGANs) can be used to generate realistic samples from complicated image distributions. The Wasserstein metric used in WGANs is based on a notion of distance between individual images, which induces a notion of distance between probability distributions of images. So far the community has considered $\\ell^2$ as the underlying distance. We generalize the theory of WGAN with gradient penalty to Banach spaces, allowing practitioners to select the features to emphasize in the generator. We further discuss the effect of some particular choices of underlying norms, focusing on Sobolev norms. Finally, we demonstrate a boost in performance for an appropriate choice of norm on CIFAR-10 and CelebA.", "full_text": "Banach Wasserstein GAN\n\nJonas Adler\n\nDepartment of Mathematics\n\nKTH - Royal institute of Technology\n\nResearch and Physics\n\nElekta\n\njonasadl@kth.se\n\nSebastian Lunz\n\nDepartment of Applied Mathematics\n\nand Theoretical Physics\nUniversity of Cambridge\nlunz@math.cam.ac.uk\n\nAbstract\n\nWasserstein Generative Adversarial Networks (WGANs) can be used to generate\nrealistic samples from complicated image distributions. The Wasserstein metric\nused in WGANs is based on a notion of distance between individual images, which\ninduces a notion of distance between probability distributions of images. So far\nthe community has considered (cid:96)2 as the underlying distance. We generalize the\ntheory of WGAN with gradient penalty to Banach spaces, allowing practitioners to\nselect the features to emphasize in the generator. We further discuss the effect of\nsome particular choices of underlying norms, focusing on Sobolev norms. Finally,\nwe demonstrate a boost in performance for an appropriate choice of norm on\nCIFAR-10 and CelebA.\n\n1\n\nIntroduction\n\nGenerative Adversarial Networks (GANs) are one of the most popular generative models [6]. A\nneural network, the generator, learns a map that takes random input noise to samples from a given\ndistribution. The training involves using a second neural network, the critic, to discriminate between\nreal samples and the generator output.\nIn particular, [2, 7] introduces a critic built around the Wasserstein distance between the distribution\nof true images and generated images. The Wasserstein distance is inherently based on a notion of\ndistance between images which in all implementations of Wasserstein GANs (WGAN) so far has\nbeen the (cid:96)2 distance. On the other hand, the imaging literature contains a wide range of metrics used\nto compare images [4] that each emphasize different features of interest, such as edges or to more\naccurately approximate human observer perception of the generated image.\nThere is hence an untapped potential in selecting a norm beyond simply the classical (cid:96)2 norm. We\ncould for example select an appropriate Sobolev space to either emphasize edges, or large scale\nbehavior. In this work we extend the classical WGAN theory to work on these and more general\nBanach spaces.\nOur contributions are as follows:\n\ngradient penalty (GP) term to any separable complete normed space.\n\n\u2022 We introduce Banach Wasserstein GAN (BWGAN), extending WGAN implemented via a\n\u2022 We describe how BWGAN can be ef\ufb01ciently implemented. The only practical difference\nfrom classical WGAN with gradient penalty is that the (cid:96)2 norm is replaced with a dual norm.\nWe also give theoretically grounded heuristics for the choice of regularization parameters.\n\u2022 We compare BWGAN with different norms on the CIFAR-10 and CelebA datasets. Using the\nSpace L10, which puts strong emphasize on outliers, we achieve an unsupervised inception\nscore of 8.31 on CIFAR-10, state of the art for non-progressive growing GANs.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(cid:18)\n\n(cid:19)1/p\n\n2 Background\n\n2.1 Generative adversarial networks\n\nGenerative Adversarial Networks (GANs) [6] perform generative modeling by learning a map\nG : Z \u2192 B from a low-dimensional latent space Z to image space B, mapping a \ufb01xed noise\ndistribution PZ to a distribution of generated images PG.\nIn order to train the generative model G, a second network D is used to discriminate between original\nimages drawn from a distribution of real images Pr and images drawn from PG. The generator is\ntrained to output images that are conceived to be realistic by the critic D. The process is iterated,\nleading to the famous minimax game [6] between generator G and critic D\n\nmin\n\nG\n\nmax\n\nD\n\nEX\u223cPr [log(D(X))] + EZ\u223cPZ [log(1 \u2212 D(G\u0398(Z)))] .\n\n(1)\n\nAssuming the discriminator is perfectly trained, this gives rise to the Jensen\u2013Shannon divergence\n(JSD) as distance measure between the distributions PG and Pr [6, Theorem 1].\n\n2.2 Wasserstein metrics\n\nTo overcome undesirable behavior of the JSD in the presence of singular measures [1], in [2] the\nWasserstein metric is introduced to quantify the distance between the distributions PG and Pr. While\nthe JSD is a strong metric, measuring distances point-wise, the Wasserstein distance is a weak metric,\nmeasuring the cost of transporting one probability distribution to another. This allows it to stay \ufb01nite\nand provide meaningful gradients to the generator even when the measures are mutually singular.\nIn a rather general form, the Wasserstein metric takes into account an underlying metric dB :\nB \u00d7 B \u2192 R on a Polish (e.g. separable completely metrizable) space B. In its primal formulation,\nthe Wasserstein-p, p \u2265 1, distance is de\ufb01ned as\n\nWassp(PG, Pr) :=\n\ninf\n\nE(X1,X2)\u223c\u03c0dB(X1, X2)p\n\n\u03c0\u2208\u03a0(PG,Pr)\n\n(2)\nwhere \u03a0(PG, Pr) denotes the set of distributions on B\u00d7B with marginals PG and Pr. The Wasserstein\ndistance is hence highly dependent on the choice of metric dB.\nThe Kantorovich-Rubinstein duality [19, 5.10] provides a way of more ef\ufb01ciently computing the\nWasserstein-1 distance (which we will henceforth simply call the Wasserstein distance, Wass =\nWass1) between measures on high dimensional spaces. The duality holds in the general setting of\nPolish spaces and states that\n\n,\n\nWass(PG, Pr) = sup\nLip(f )\u22641\n\nEX\u223cPGf (X) \u2212 EX\u223cPr f (X).\n\n(3)\n\nThe supremum is taken over all Lipschitz continuous functions f : B \u2192 R with Lipschitz constant\nequal or less than one. We note that in this dual formulation, the dependence of f on the choice of\nmetric is encoded in the condition of f being 1-Lipschitz and recall that a function f : B \u2192 R is\n\u03b3-Lipschitz if\n\n|f (x) \u2212 f (y)| \u2264 \u03b3dB(x, y).\n\nIn an abstract sense, the Wasserstein metric could be used in GAN training by using a critic D to\napproximate the supremum in (3). The generator uses the loss EZ\u223cPZ D(G(Z)). In the case of a\nperfectly trained critic D, this is equivalent to using the Wasserstein loss Wass(PG, Pr) to train G [2,\nTheorem 3].\n\n2.3 Wasserstein GAN\n\nImplementing GANs with the Wasserstein metric requires to approximate the supremum in (3) with a\nneural network. In order to do so, the Lipschitz constraint has to be enforced on the network. In the\npaper Wasserstein GAN [2] this was achieved by restricting all network parameters to lie within a\nprede\ufb01ned interval. This technique typically guarantees that the network is \u03b3 Lipschitz for some \u03b3\nfor any metric space. However, it typically reduces the set of admissible functions to a proper subset\n\n2\n\n\fof all \u03b3 Lipschitz functions, hence introducing an uncontrollable additional constraint on the network.\nThis can lead to training instabilities and artifacts in practice [7].\nIn [7] strong evidence was presented that the condition can better be enforced by working with another\ncharacterization of 1\u2212Lipschitz functions. In particular, they prove that if B = Rn, d(x, y)B =\n(cid:107)x \u2212 y(cid:107)2 we have the gradient characterization\n\nf is 1 \u2212 Lipschitz \u21d0\u21d2 (cid:107)\u2207f (x)(cid:107)2 \u2264 1\n\nfor all x \u2208 Rn.\n\nThey softly enforce this condition by adding a penalty term to the loss function of D that takes the\nform\n\n(cid:16)(cid:107)\u2207D( \u02c6X)(cid:107)2 \u2212 1\n(cid:17)2\n\nE \u02c6X\n\n,\n\n(4)\n\nwhere the distribution of \u02c6X is taken to be the uniform distributions on lines connecting points drawn\nfrom PG and Pr.\nHowever, penalizing the (cid:96)2 norm of the gradient corresponds speci\ufb01cally to choosing the (cid:96)2 norm as\nunderlying distance measure on image space. Some research has been done on generalizing GAN\ntheory to other spaces [18, 11], but in its current form WGAN with gradient penalty does not extend\nto arbitrary choices of underlying spaces B. We shall give a generalization to a large class of spaces,\nthe (separable) Banach spaces, but \ufb01rst we must introduce some notation.\n\n2.4 Banach spaces of images\n\nA vector space is a collection of objects (vectors) that can be added together and scaled, and can be\nseen as a generalization of the Euclidean space Rn. If a vector space B is equipped with a notion of\nlength, a norm (cid:107) \u00b7 (cid:107)B : B \u2192 R, we call it a normed space. The most commonly used norm is the (cid:96)2\nnorm de\ufb01ned on Rn, given by\n\n(cid:32) n(cid:88)\n\n(cid:33)1/2\n\nx2\ni\n\n.\n\n(cid:107)x(cid:107)2 =\n\ni=1\n\nSuch spaces can be used to model images in a very general fashion. In a pixelized model, the image\nspace B is given by the discrete pixel values, B \u223c Rn\u00d7n. Continuous image models that do not rely\non the concept of pixel discretization include the space of square integrable functions over the unit\nsquare. The norm (cid:107)\u00b7(cid:107)B gives room for a choice on how distances between images are measured. The\nEuclidean distance is a common choice, but many other distance notions are possible that account for\nmore speci\ufb01c image features, like the position of edges in Sobolev norms.\nA normed space is called a Banach space if it is complete, that is, Cauchy sequences converge.\nFinally, a space is separable if there exists some countable dense subset. Completeness is required in\norder to ensure that the space is rich enough for us to de\ufb01ne limits whereas separability is necessary\nfor the usual notions of probability to hold. These technical requirements formally hold in \ufb01nite\ndimensions but are needed in the in\ufb01nite dimensional setting. We note that all separable Banach\nspaces are Polish spaces and we can hence de\ufb01ne Wasserstein metrics on them using the induced\nmetric dB(x, y) = (cid:107)x \u2212 y(cid:107)B.\nFor any Banach space B, we can consider the space of all bounded linear functionals B \u2192 R, which\nwe will denote B\u2217 and call the (topological) dual of B. It can be shown [17] that this space is itself a\nBanach space with norm (cid:107) \u00b7 (cid:107)B\u2217 : B\u2217 \u2192 R given by\n(cid:107)x\u2217(cid:107)B\u2217 = sup\nx\u2208B\n\nx\u2217(x)\n(cid:107)x(cid:107)B\n\n(5)\n\n.\n\nIn what follows, we will give some examples of Banach spaces along with explicit characterizations\nof their duals. We will give the characterizations in continuum, but they are also Banach spaces in\ntheir discretized (\ufb01nite dimensional) forms.\n\nLp-spaces. Let \u2126 be some domain, for example \u2126 = [0, 1]2 to model square images. The set of\nfunctions x : \u2126 \u2192 R with norm\n\n(cid:107)x(cid:107)Lp =\n\nx(t)pdt\n\n(6)\n\n(cid:19)1/p\n\n(cid:18)(cid:90)\n\n\u2126\n\n3\n\n\fis a Banach space with dual [Lp]\u2217 = Lq where 1/p + 1/q = 1. In particular, we note that [L2]\u2217 = L2.\nThe parameter p controls the emphasis on outliers, with higher values corresponding to a stronger\nfocus on outliers. In the extreme case p = 1, the norm is known to induce sparsity, ensuring that all\nbut a small amount of pixels are set to the correct values.\nSobolev spaces. Let \u2126 be some domain, then the set of functions x : \u2126 \u2192 R with norm\n\n(cid:18)(cid:90)\n\n\u2126\n\n(cid:107)x(cid:107)W 1,2 =\n\nx(t)2 + |\u2207x(t)|2dt\n\n(cid:19)1/2\n\n(cid:18)(cid:90)\n\n(cid:16)F\u22121(cid:104)\n\n(cid:105)\n\n(cid:17)p\n\n(cid:19)1/p\n\nwhere \u2207x is the spatial gradient, is an example of a Sobolev space. In this space, more emphasis\nis put on the edges than in e.g. Lp spaces, since if (cid:107)x1 \u2212 x2(cid:107)W 1,2 is small then not only are their\nabsolute values close, but so are their edges.\nSince taking the gradient is equivalent to multiplying with \u03be in the Fourier space, the concept of\nSobolev spaces can be generalized to arbitrary (real) derivative orders s if we use the norm\n\n(cid:107)x(cid:107)W s,p =\n\n(1 + |\u03be|2)s/2Fx\n\n,\n\n\u2126\n\ndt\n\n(t)\n\n(7)\nwhere F is the Fourier transform. The tuning parameter s allows to control which frequencies of\nan image are emphasized: A negative value of s corresponds to amplifying low frequencies, hence\nprioritizing the global structure of the image. On the other hand, high values of s amplify high\nfrequencies, thus putting emphasis on sharp local structures, like the edges or ridges of an image.\nThe dual of the Sobolev space, [W s,p]\u2217, is W \u2212s,q where q is as above [3]. Under weak assumptions\non \u2126, all Sobolev spaces with 1 \u2264 p < \u221e are separable. We note that W 0,p = Lp and in particular\nwe recover as an important special case W 0,2 = L2.\nThere is a wide range of other norms that can be de\ufb01ned for images, see appendix A and [5, 3] for a\nfurther overview of norms and their respective duals.\n\n3 Banach Wasserstein GANs\n\nIn this section we generalize the loss (4) to separable Banach spaces, allowing us to effectively train a\nWasserstein GAN using arbitrary norms.\nWe will show that the characterization of \u03b3-Lipschitz functions via the norm of the differential can\nbe extended from the (cid:96)2 setting in (4) to arbitrary Banach spaces by considering the gradient as an\nelement in the dual of B. In particular, for any Banach space B with norm (cid:107) \u00b7 (cid:107)B, we will derive the\nloss function\n\n1\n\u03b3\n\nL =\n\n(EX\u223cP\u0398D(X) \u2212 EX\u223cPr D(X)) + \u03bbE \u02c6X\n\n(8)\nwhere \u03bb, \u03b3 \u2208 R are regularization parameters, and show that a minimizer of this this is an approxima-\ntion to the Wasserstein distance on B.\n\n(cid:107)\u2202D( \u02c6X)(cid:107)B\u2217 \u2212 1\n\n\u03b3\n\n,\n\n(cid:18) 1\n\n(cid:19)2\n\n(cid:12)(cid:12)f (x + h) \u2212 f (x) \u2212(cid:2)\u2202f (x)(cid:3)(h)(cid:12)(cid:12) = 0.\n\n3.1 Enforcing the Lipschitz constraint in Banach spaces\nThroughout this chapter, let B denote a Banach space with norm (cid:107) \u00b7 (cid:107)B and f : B \u2192 R a continuous\nfunction. We require a more general notion of gradient: The function f is called Fr\u00e9chet differentiable\nat x \u2208 B if there is a bounded linear map \u2202f (x) : B \u2192 R such that\n\n1\n(cid:107)h(cid:107)B\n\n(9)\nThe differential \u2202f (x) is hence an element of the dual space B\u2217. We note that the usual notion of\ngradient \u2207f (x) in Rn with the standard inner product is connected to the Fr\u00e9chet derivative via\n\nlim(cid:107)h(cid:107)B\u21920\n\n(cid:2)\u2202f (x)(cid:3)(h) = \u2207f (x) \u00b7 h.\n\nThe following theorem allows us to characterize all Lipschitz continuous functions according to the\ndual norm of the Fr\u00e9chet derivative.\n\n4\n\n\fLemma 1. Assume f : B \u2192 R is Fr\u00e9chet differentiable. Then f is \u03b3-Lipschitz if and only if\n\n(cid:107)\u2202f (x)(cid:107)B\u2217 \u2264 \u03b3 \u2200x \u2208 B.\nProof. Assume f is \u03b3-Lipschitz. Then for all x, h \u2208 B and \u0001 > 0\n(f (x + \u0001h) \u2212 f (x)) \u2264 lim\n\u0001\u21920\n\n(cid:2)\u2202f (x)(cid:3)(h) = lim\n\n\u0001\u21920\n\n1\n\u0001\n\nhence by the de\ufb01nition of the dual norm, eq. (5), we have\n\n\u03b3\u0001(cid:107)h(cid:107)B\n\n\u0001\n\n(cid:2)\u2202f (x)(cid:3)(h)\n\n(cid:107)h(cid:107)B\n\n\u2264 sup\nh\u2208B\n\n\u03b3(cid:107)h(cid:107)B\n(cid:107)h(cid:107)B\n\n\u2264 \u03b3.\n\n(cid:107)\u2202f (x)(cid:107)B\u2217 = sup\nh\u2208B\n\n(10)\n\n= \u03b3(cid:107)h(cid:107)B,\n\nNow let f satisfy (10) and let x, y \u2208 B. De\ufb01ne the function g : R \u2192 R by\nx(t) = tx + (1 \u2212 t)y,\n\ng(t) = f (x(t)),\n\nwhere\n\nAs x(t + \u2206t) \u2212 x(t) = \u2206t(x \u2212 y), we see that g is everywhere differentiable and\n\ng(cid:48)(t) =(cid:2)\u2202f(cid:0)x(t)(cid:1)(cid:3)(x \u2212 y).\n\n|g(cid:48)(t)| =(cid:12)(cid:12)(cid:2)\u2202f(cid:0)x(t)(cid:1)(cid:3)(x \u2212 y)(cid:12)(cid:12) \u2264 (cid:107)\u2202f (x(t))(cid:107)B\u2217(cid:107)x \u2212 y(cid:107)B \u2264 \u03b3(cid:107)x \u2212 y(cid:107)B,\n\nHence\n\nwhich gives\n\n|f (x) \u2212 f (y)| = |g(1) \u2212 g(0)| \u2264\n\n|g(cid:48)(t)| dt \u2264 \u03b3(cid:107)x \u2212 y(cid:107)B,\n\n(cid:90) 1\n\nthus \ufb01nishing the proof.\n\n0\n\nUsing lemma 1 we see that a \u03b3-Lipschitz requirement in Banach spaces is equivalent to the dual\nnorm of the Fr\u00e9chet derivative being less than \u03b3 everywhere. In order to enforce this we need to\ncompute (cid:107)\u2202f (x)(cid:107)B\u2217. As shown in section 2.4, the dual norm can be readily computed for a range of\ninteresting Banach spaces, but we also need to compute \u2202f (x), preferably using readily available\nautomatic differentiation software. However, such software can typically only compute derivatives in\nRn with the standard norm.\nConsider a \ufb01nite dimensional Banach space B equipped by any norm (cid:107) \u00b7 (cid:107)B. By Lemma 1, gradient\nnorm penalization requires characterizing (e.g. giving a basis for) the dual B\u2217 of B. This can be a\ndif\ufb01cult for in\ufb01nite dimensional Banach spaces. In a \ufb01nite dimensional however setting, there is an\nlinear continuous bijection \u03b9 : Rn \u2192 B given by\n\n\u03b9(x)i = xi.\n\n(11)\nThis isomorphism implicitly relies on the fact that a basis of B can be chosen and can be mapped\nto the corresponding dual basis. This does not generalize to the in\ufb01nite dimensional setting, but we\nhope that this is not a very limiting assumption in practice.\nWe note that we can write f = g \u25e6 \u03b9 where g : Rn \u2192 R and automatic differentiation can be used to\ncompute the derivative \u2202g(x) ef\ufb01ciently. Further, note that the chain rule yields\n\n\u2202f (x) = \u03b9\u2217 (\u2202g(\u03b9(x))) ,\n\nwhere \u03b9\u2217 : Rn \u2192 B\u2217 is the adjoint of \u03b9 which is readily shown to be as simple as \u03b9, \u03b9\u2217(x)i = xi.\nThis shows that computing derivatives in \ufb01nite dimensional Banach spaces can be done using\nstandard automatic differentiation libraries with only some formal mathematical corrections. In an\nimplementation, the operators \u03b9, \u03b9\u2217 would be implicit.\nIn terms of computational costs, the difference between general Banach Wasserstein GANs and the\nones based on the (cid:96)2 metric lies in the computation of the gradient of the dual norm. By the chain\nrule, any computational step outside the calculation of this gradient is the same for any choice of\nunderlying notion of distance. This in particular includes any forward pass or backpropagation step\nthrough the layers of the network used as discriminator. If there is an ef\ufb01cient framework available\nto compute the gradient of the dual norm, as in the case of the Fourier transform used for Sobolev\nspaces, the computational expenses hence stay essentially the same independent of the choice of\nnorm.\n\n5\n\n\f3.2 Regularization parameter choices\n\nThe network will be trained by adding the regularization term\n(cid:107)\u2202D( \u02c6X)(cid:107)B\u2217 \u2212 1\n\n\u03bbE \u02c6X\n\n(cid:17)2\n\n.\n\n(cid:16) 1\n\n\u03b3\n\nHere, \u03bb is a regularization constant and \u03b3 is a scaling factor controlling which norm we compute. In\nparticular D will approximate \u03b3 times the Wasserstein distance. In the original WGAN-GP paper\n[7] and most following work \u03bb = 10 and \u03b3 = 1, while \u03b3 = 750 was used in Progressive GAN [9].\nHowever, it is easy to see that these values are speci\ufb01c to the (cid:96)2 norm and that we would need to\nre-tune them if we change the norm. In order to avoid having to hand-tune these for every choice of\nnorm, we will derive some heuristic parameter choice rules that work with any norm.\nFor our heuristic, we will start by assuming that the generator is the zero-generator, always returning\nzero. Assuming symmetry of the distribution of true images Pr, the discriminator will then essentially\nbe decided by a single constant f (x) = c(cid:107)x(cid:107)B, where c solves the optimization problem\n\nmin\nc\u2208R\n\nBy solving this explicitly we \ufb01nd\n\nEX\u223cPr\n\nc = \u03b3\n\n1 +\n\nEX\u223cPr(cid:107)X(cid:107)B\n\n2\u03bb\n\n(cid:20)\n\u2212 c(cid:107)X(cid:107)B\n(cid:18)\n\n\u03b3\n\n(cid:21)\n\n.\n\n\u03bb(c \u2212 \u03b3)2\n\n+\n\n\u03b32\n\n(cid:19)\n\n.\n\nSince we are trying to approximate \u03b3 times the Wasserstein distance, and since the norm has Lipschitz\nconstant 1, we want c \u2248 \u03b3. Hence to get a small relative error we need EX\u223cPr(cid:107)X(cid:107)B (cid:28) 2\u03bb. With\nthis theory to guide us, we can make the heuristic rule\n\n\u03bb \u2248 EX\u223cPr(cid:107)X(cid:107)B.\n\nIn the special case of CIFAR-10 with the (cid:96)2 norm this gives \u03bb \u2248 27, which agrees with earlier\npractice (\u03bb = 10) reasonably well.\nFurther, in order to keep the training stable we assume that the network should be approximately\nscale preserving. Since the operation x \u2192 \u2202D(x) is the deepest part of the network (twice the depth\nas the forward evaluation), we will enforce (cid:107)x(cid:107)B\u2217 \u2248 (cid:107)\u2202D(x)(cid:107)B\u2217. Assuming \u03bb was appropriately\nchosen, we \ufb01nd in general (by lemma 1) (cid:107)\u2202D(x)(cid:107)B\u2217 \u2248 \u03b3. Hence we want \u03b3 \u2248 (cid:107)x(cid:107)B\u2217. We pick the\nexpected value as a representative and hence we obtain the heuristic\n\n\u03b3 \u2248 EX\u223cPr(cid:107)X(cid:107)B\u2217\n\nFor CIFAR-10 with the (cid:96)2 norm this gives \u03b3 = \u03bb \u2248 27 and may explain the improved performance\nobtained in [9].\nA nice property of the above parameter choice rules is that they can be used with any underlying\nnorm. By using these parameter choice rules we avoid the issue of hand tuning further parameters\nwhen training using different norms.\n\n4 Computational results\n\nTo demonstrate computational feasibility and to show how the choice of norm can impact the trained\ngenerator, we implemented Banach Wasserstein GAN with various Sobolev and Lp norms, applied\nto CIFAR-10 and CelebA (64 \u00d7 64 pixels). The implementation was done in TensorFlow and\nthe architecture used was a faithful re-implementation of the residual architecture used in [7], see\nappendix B. For the loss function, we used the loss eq. (8) with parameters according to section 3.2\nand the norm chosen according to either the Sobolev norm eq. (7) or the Lp norm eq. (6). In the\ncase of the Sobolev norm, we selected units such that |\u03be| \u2264 5. Following [9], we add a small\n10\u22125EX\u223cPr D(X)2 term to the discriminator loss to stop it from drifting during the training.\nFor training we used the Adam optimizer [10] with learning rate decaying linearly from 2 \u00b7 10\u22124 to 0\nover 100 000 iterations with \u03b21 = 0, \u03b22 = 0.9. We used 5 discriminator updates per generator update.\nThe batch size used was 64. In order to evaluate the reproducibility of the results on CIFAR-10, we\n\n6\n\n\fFigure 1: Generated CIFAR-10 samples for some Lp spaces.\n\n(a) p = 1.3\n\n(b) p = 2.0\n\n(c) p = 10.0\n\nFigure 2: FID scores for BWGAN on CIFAR-10.\n\n(a) W s,2\n\n(b) Lp\n\nFigure 3: Inception scores for BWGAN on CIFAR-10.\n\nFigure 4: Inception Scores on CIFAR-10.\n\nMethod\nDCGAN [16]\nEBGAN [21]\nWGAN-GP [7]\nCT GAN [20]\nSNGAN [14]\nW \u2212 3\nL10-BWGAN\nProgressive GAN [9]\n\n2 , 2-BWGAN\n\nInception Score\n6.16 \u00b1 .07\n7.07 \u00b1 .10\n7.86 \u00b1 .07\n8.12 \u00b1 .12\n8.22 \u00b1 .05\n8.26 \u00b1 .07\n8.31 \u00b1 .07\n8.80 \u00b1 .05\n\nfollowed this up by training an ensemble of 5 generators using SGD with warm restarts following\n[12]. Each warm restart used 10 000 generator steps. Our implementation is available online1.\nSome representative samples from the generator on both datasets can be seen in \ufb01gs. 1 and 5. See\nappendix C for samples from each of the W s,2 and Lp spaces investigated as well as samples from\nthe corresponding Fr\u00e9chet derivatives.\nFor evaluation, we report Fr\u00e9chet Inception Distance (FID)[8] and Inception scores, both computed\nfrom 50K samples. A high image quality corresponds to high Inception and low FID scores. On\nthe CIFAR-10 dataset, both FID and inception scores indicate that negative s and large values of p\nlead to better image quality. On CelebA, the best FID scores are obtained for values of s between\n\u22121 and 0 and around p = 0, whereas the training become unstable for p = 10. We further compare\nour CIFAR-10 results in terms of Inception scores to existing methods, see table 4. To the best of\n\n1https://github.com/adler-j/bwgan\n\n7\n\n2.01.51.00.50.00.51.01.52.0s16171819201.11.31.52.03.04.05.010.0p16171819202.01.51.00.50.00.51.01.52.0s7.757.958.158.351.11.31.52.03.04.05.010.0p7.757.958.158.35\fFigure 5: Generated CelebA samples for Sobolev spaces W s,2.\n\n(a) s = \u22122\n\n(b) s = 0\n\n(c) s = 2\n\nFigure 6: FID scores for BWGAN on CelebA.\n\n(a) W s,2\n\n(b) Lp\n\nour knowledge, the inception score of 8.31 \u00b1 0.07, achieved using the L10 space, is state of the art\nfor non-progressive growing methods. Our FID scores are also highly competitive, for CIFAR-10\nwe achieve 16.43 using L4. We also note that our result for W 0,2 = (cid:96)2 is slightly better than the\nreference implementation, despite using the same network. We suspect that this is due to our improved\nparameter choices.\n\n5 How about metric spaces?\n\nGradient norm penalization according to lemma 1 is only valid in Banach spaces, but a natural\nalternative to penalizing gradient norms is to enforce the Lipschitz condition directly (see [15]). This\nwould potentially allow training Wasserstein GAN on general metric spaces by adding a penalty term\nof the form\n\n(12)\n\n(cid:34)(cid:18)|f (X) \u2212 f (Y )|\n\ndB(X, Y )\n\nEX,Y\n\n(cid:35)\n\n(cid:19)2\n\n\u2212 1\n\n.\n\n+\n\nWhile theoretically equivalent to gradient norm penalization when the distributions of X and Y are\nchosen appropriately, this term is very likely to have considerably higher variance in practice.\nFor example, if we assume that d is not bounded from below and consider two points x, y \u2208 M that\nare suf\ufb01ciently close then a penalty term of the Lipschitz quotient as in (12) imposes a condition\non the differential around x and y in the direction (x \u2212 y) only, i.e. only |\u2202f (\u02dcx)(x \u2212 y)| \u2264 1 is\nensured. In the case of two distributions that are already close, we will with high probability sample\nthe difference quotient in a spatial direction that is parallel to the data, hence not exhausting the\nLipschitz constraint, i.e. |\u2202f (\u02dcx)(x \u2212 y)| (cid:28) 1 . Difference quotient penalization (12) then does not\neffectively enforce the Lipschitz condition. Gradient norm penalization on the other hand ensures\nthis condition in all spatial directions simultaneously by considering the dual norm of the differential.\n\n8\n\n2.01.51.00.50.00.51.01.52.0s101112131.11.31.52.03.04.05.010p10111213\fOn the other hand, if d is bounded from below the above argument fails. For example, Wasserstein\nGAN over a space equipped with the trivial metric\n\n(cid:26)0\n\n1\n\ndtrivial(x, y) =\n\nif x = y\nelse\n\napproximates the Total Variation distance [19]. Using the regularizer eq. (12) we get a slight variation\nof Least Squares GAN [13]. We do not further investigate this line of reasoning.\n\n6 Conclusion\n\nWe analyzed the dependence of Wasserstein GANs (WGANs) on the notion of distance between\nimages and showed how choosing distances other than the (cid:96)2 metric can be used to make WGANs\nfocus on particular image features of interest. We introduced a generalization of WGANs with\ngradient norm penalization to Banach spaces, allowing to easily implement WGANs for a wide range\nof underlying norms on images. This opens up a new degree of freedom to design the algorithm to\naccount for the image features relevant in a speci\ufb01c application.\nOn the CIFAR-10 and CelebA dataset, we demonstrated the impact a change in norm has on model\nperformance. In particular, we computed FID scores for Banach Wasserstein GANs using different\nSobolev spaces W s,p and found a correlation between the values of both s and p with model\nperformance.\nWhile this work was motivated by images, the theory is general and can be applied to data in any\nnormed space. In the future, we hope that practitioners take a step back and ask themselves if the (cid:96)2\nmetric is really the best measure of \ufb01t, or if some other metric better emphasize what they want to\nachieve with their generative model.\n\nAcknowledgments\n\nThe authors would like to acknowledge Peter Maass for brining us together as well as important\nsupport from Ozan \u00d6ktem, Axel Ringh, Johan Karlsson, Jens Sj\u00f6lund, Sam Power and Carola\nSch\u00f6nlieb.\nThe work by J.A. was supported by the Swedish Foundation of Strategic Research grants AM13-0049,\nID14-0055 and Elekta. The work by S.L. was supported by the EPSRC grant EP/L016516/1 for the\nUniversity of Cambridge Centre for Doctoral Training, and the Cambridge Centre for Analysis. We\nalso acknowledge the support of the Cantab Capital Institute for the Mathematics of Information.\n\nReferences\n\n[1] Mart\u00edn Arjovsky and L\u00e9on Bottou. Towards Principled Methods for Training Generative\nAdversarial Networks. International Conference on Learning Representations (ICLR), 2017.\n[2] Mart\u00edn Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein Generative Adversarial\n\nNetworks. International Conference on Machine Learning, ICML, 2017.\n\n[3] Haim Brezis. Functional analysis, Sobolev spaces and partial differential equations. Springer\n\nScience & Business Media, 2010.\n\n[4] Tony F Chan and Jianhong Jackie Shen. Image processing and analysis: variational, PDE,\n\nwavelet, and stochastic methods, volume 94. Siam, 2005.\n\n[5] Michel Marie Deza and Elena Deza. Encyclopedia of Distances. Springer Berlin Heidelberg,\n\n2009.\n\n[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural\ninformation processing systems (NIPS), 2014.\n\n[7] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. Advances in Neural Information Processing Systems\n(NIPS), 2017.\n\n9\n\n\f[8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon,\nU. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 30, pages 6626\u20136637. Curran Associates,\nInc., 2017.\n\n[9] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive Growing of GANs for\nImproved Quality, Stability, and Variation. International Conference on Learning Representa-\ntions (ICLR), 2018.\n\n[10] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. International\n\nConference on Learning Representations (ICLR), 2015.\n\n[11] Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and Convergence\n\nProperties of Generative Adversarial Learning. arXiv, 2017.\n\n[12] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic Gradient descent with Warm Restarts.\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[13] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley.\nLeast squares generative adversarial networks. IEEE International Conference on Computer\nVision (ICCV), 2017.\n\n[14] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral Normaliza-\ntion for Generative Adversarial Networks. International Conference on Learning Representa-\ntions (ICLR), 2018.\n\n[15] Henning Petzka, Asja Fischer, and Denis Lukovnikov. On the regularization of Wasserstein\n\nGANs. International Conference on Learning Representations (ICLR), 2018.\n\n[16] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning with\nDeep Convolutional Generative Adversarial Networks. International Conference on Learning\nRepresentations (ICLR), 2016.\n\n[17] Walter Rudin. Functional analysis. International series in pure and applied mathematics, 1991.\n[18] Calvin Seward, Thomas Unterthiner, Urs Bergmann, Nikolay Jetchev, and Sepp Hochreiter.\n\nFirst Order Generative Adversarial Networks. arXiv, 2018.\n\n[19] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia, 2008.\n\n[20] Xiang Wei, Zixia Liu, Liqiang Wang, and Boqing Gong. Improving the Improved Training of\n\nWasserstein GANs. International Conference on Learning Representations (ICLR), 2018.\n\n[21] Junbo Jake Zhao, Micha\u00ebl Mathieu, and Yann LeCun. Energy-based Generative Adversarial\n\nNetwork. International Conference on Learning Representations (ICLR), 2017.\n\n10\n\n\f", "award": [], "sourceid": 3402, "authors": [{"given_name": "Jonas", "family_name": "Adler", "institution": "KTH - Royal Institute of Technology"}, {"given_name": "Sebastian", "family_name": "Lunz", "institution": "University of Cambridge"}]}