{"title": "Approximation and Convergence Properties of Generative Adversarial Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5545, "page_last": 5553, "abstract": "Generative adversarial networks (GAN) approximate a target data distribution by jointly optimizing an objective function through a \"two-player game\" between a generator and a discriminator.  Despite their empirical success, however, two very basic questions on how well they can approximate the target distribution remain unanswered. First, it is not known how restricting the discriminator family affects the approximation quality. Second, while a number of different objective functions have been proposed, we do not understand when convergence to the global minima of the objective function leads to convergence to the target distribution under various notions of distributional convergence.   In this paper, we address these questions in a broad and unified setting by defining a notion of adversarial divergences that includes a number of recently proposed objective functions. We show that if the objective function is an adversarial divergence with some additional conditions, then using a restricted discriminator family has a moment-matching effect. Additionally, we show that for objective functions that are strict adversarial divergences, convergence in the objective function implies weak convergence, thus generalizing previous results.", "full_text": "Approximation and Convergence Properties of\n\nGenerative Adversarial Learning\n\nShuang Liu\n\nUniversity of California, San Diego\n\nOlivier Bousquet\n\nGoogle Brain\n\nshuangliu@ucsd.edu\n\nobousquet@google.com\n\nKamalika Chaudhuri\n\nUniversity of California, San Diego\n\nkamalika@cs.ucsd.edu\n\nAbstract\n\nGenerative adversarial networks (GAN) approximate a target data distribution by\njointly optimizing an objective function through a \"two-player game\" between a\ngenerator and a discriminator. Despite their empirical success, however, two very\nbasic questions on how well they can approximate the target distribution remain\nunanswered. First, it is not known how restricting the discriminator family affects\nthe approximation quality. Second, while a number of different objective functions\nhave been proposed, we do not understand when convergence to the global minima\nof the objective function leads to convergence to the target distribution under\nvarious notions of distributional convergence.\nIn this paper, we address these questions in a broad and uni\ufb01ed setting by de\ufb01ning\na notion of adversarial divergences that includes a number of recently proposed\nobjective functions. We show that if the objective function is an adversarial\ndivergence with some additional conditions, then using a restricted discriminator\nfamily has a moment-matching effect. Additionally, we show that for objective\nfunctions that are strict adversarial divergences, convergence in the objective\nfunction implies weak convergence, thus generalizing previous results.\n\n1\n\nIntroduction\n\nGenerative adversarial networks (GANs) have attracted an enormous amount of recent attention in\nmachine learning. In a generative adversarial network, the goal is to produce an approximation to\na target data distribution \u0016 from which only samples are available. This is done iteratively via two\ncomponents \u2013 a generator and a discriminator, which are usually implemented by neural networks.\nThe generator takes in random (usually Gaussian or uniform) noise as input and attempts to transform\nit to match the target distribution \u0016; the discriminator aims to accurately discriminate between\nsamples from the target distribution and those produced by the generator. Estimation proceeds by\niteratively re\ufb01ning the generator and the discriminator to optimize an objective function until the\ntarget distribution is indistinguishable from the distribution induced by the generator. The practical\nsuccess of GANs has led to a large volume of recent literature on variants which have many desirable\nproperties; examples are the f-GAN [10], the MMD-GAN [5, 9], the Wasserstein-GAN [2], among\nmany others.\nIn spite of their enormous practical success, unlike more traditional methods such as maximum\nlikelihood inference, GANs are theoretically rather poorly-understood. In particular, two very basic\nquestions on how well they can approximate the target distribution \u0016, even in the presence of a very\nlarge number of samples and perfect optimization, remain largely unanswered. The \ufb01rst relates to the\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\frole of the discriminator in the quality of the approximation. In practice, the discriminator is usually\nrestricted to belong to some family, and it is not understood in what sense this restriction affects the\ndistribution output by the generator. The second question relates to convergence; different variants of\nGANs have been proposed that involve different objective functions (to be optimized by the generator\nand the discriminator). However, it is not understood under what conditions minimizing the objective\nfunction leads to a good approximation of the target distribution. More precisely, does a sequence\nof distributions output by the generator that converges to the global minimum under the objective\nfunction always converge to the target distribution \u0016 under some standard notion of distributional\nconvergence?\nIn this work, we consider these two questions in a broad setting. We \ufb01rst characterize a very general\nclass of objective functions that we call adversarial divergences, and we show that they capture\nthe objective functions used by a variety of existing procedures that include the original GAN [7],\nf-GAN [10], MMD-GAN [5, 9], WGAN [2], improved WGAN [8], as well as a class of entropic\nregularized optimal transport problems [6]. We then de\ufb01ne the class of strict adversarial divergences\n\u2013 a subclass of adversarial divergences where the minimizer of the objective function is uniquely the\ntarget distribution. This characterization allows us to address the two questions above in a uni\ufb01ed\nsetting, and translate the results to an entire class of GANs with little effort.\nFirst, we address the role of the discriminator in the approximation in Section 4. We show that if the\nobjective function is an adversarial divergence that obeys certain conditions, then using a restricted\nclass of discriminators has the effect of matching generalized moments. A concrete consequence of\nthis result is that in linear f-GANs, where the discriminator family is the set of all af\ufb01ne functions over\na vector of features maps, and the objective function is an f-GAN, the optimal distribution \u0017 output\nby the GAN will satisfy Ex\u0018\u0016[ (x)] = Ex\u0018\u0017[ (x)] regardless of the speci\ufb01c f-divergence chosen\nin the objective function. Furthermore, we show that a neural network GAN is just a supremum of\nlinear GANs, therefore has the same moment-matching effect.\nWe next address convergence in Section 5. We show that convergence in an adversarial divergence\nimplies some standard notion of topological convergence. Particularly, we show that provided an\nobjective function is a strict adversarial divergence, convergence to \u0016 in the objective function implies\nweak convergence of the output distribution to \u0016. While convergence properties of some isolated\nobjective functions were known before [2], this result extends them to a broad class of GANs. An\nadditional consequence of this result is the observation that as the Wasserstein distance metrizes\nweak convergence of probability distributions (see e.g. [14]), Wasserstein-GANs have the weakest1\nobjective functions in the class of strict adversarial divergences.\n\n2 Notations\n\nWe use bold constants (e.g., 0, 1, x0) to denote constant functions. We denote by f \u000e g the function\ncomposition of f and g. We denote by Y X the set of functions maps from the set X to the set Y . We\ndenote by \u0016 \n \u0017 the product measure of \u0016 and \u0017. We denote by int(X) the interior of the set X. We\ndenote by E\u0016[f ] the integral of f with respect to measure \u0016 .\nLet f : R ! R [ f+1g be a convex function, we denote by dom f the effective domain of f,\nthat is, dom f = fx 2 R; f (x) < +1g; and we denote by f\u0003 the convex conjugate of f, that is,\nf\u0003(x\u0003) = supx2R fx\u0003 \u0001 x  f (x)g.\nFor a topological space \n, we denote by C(\n) the set of continuous functions on \n, Cb(\n) the set of\nbounded continuous functions on \n, rca(\n) the set of \ufb01nite signed regular Borel measures on \n, and\nP(\n) the set of probability measures on \n.\nGiven a non-empty subspace Y of a topological space X, denote by X=Y the quotient space equipped\nwith the quotient topology \u0018Y , where for any a; b 2 X, a \u0018Y b if and only if a = b or a; b both\nbelong to Y . The equivalence class of each element a 2 X is denoted as [a] = fb : a \u0018Y bg.\n\n1Weakness is actually a desirable property since it prevents the divergence from being too discriminative\n(saturate), thus providing more information about how to modify the model to approximate the true distribution.\n\n2\n\n\f3 General Framework\n\nLet \u0016 be the target data distribution from which we can draw samples. Our goal is to \ufb01nd a generative\nmodel \u0017 to approximate \u0016. Informally, most GAN-style algorithms model this approximation as\nsolving the following problem\n\ninf\n\u0017\n\nsup\nf2F\n\nEx\u0018\u0016; y\u0018\u0017 [f (x; y)] ;\n\nwhere F is a class of functions. The process is usually considered adversarial in the sense that it\ncan be thought of as a two-player minimax game, where a generator \u0017 is trying to mimick the true\ndistribution \u0016, and a adversary f is trying to distinguish between the true and generated distributions.\nHowever, another way to look at it is as the minimization of the following objective function\n\n\u0017 7! sup\nf2F\n\nEx\u0018\u0016; y\u0018\u0017 [f (x; y)]\n\n(1)\n\nThis objective function measures how far the target distribution \u0016 is from the current estimate \u0017.\nHence, minimizing this function can lead to a good approximation of the target distribution \u0016.\nThis leads us to the concept of adversarial divergence.\nDe\ufb01nition 1 (Adversarial divergence). Let X be a topological space, F \u0012 Cb(X 2), F 6= ;. An\nadversarial divergence \u001c over X is a function\n\nP(X) \u0002 P(X) ! R [ f+1g\n\n(\u0016; \u0017) 7! \u001c (\u0016jj\u0017) = sup\nf2F\n\nE\u0016\n\u0017 [f ] :\n\n(2)\n\nObserve that in De\ufb01nition 1 if we have a \ufb01xed target distribution \u0016, then (2) is reduced to the objective\nfunction (1). Also, notice that because \u001c is the supremum of a family of linear functions (in each of\nthe variables \u0016 and \u0017 separately), it is convex in each of its variables.\nDe\ufb01nition 1 captures the objective functions used by a variety of existing GAN-style procedures. In\npractice, although the function class F can be complicated, it is usually a transformation of a simple\nfunction class V, which is the set of discriminators or critics, as they have been called in the GAN\nliterature. We give some examples by specifying F and V for each objective function.\n\n(a) GAN [7].\n\nF = fx; y 7! log(u(x)) + log(1  u(y)) : u 2 Vg\nV = (0; 1)X \\ Cb(X):\n\n(b) f-GAN [10]. Let f : R ! R [ f1g be a convex lower semi-continuous function. Assume\nf\u0003(x) \u0015 x for any x 2 R, f\u0003 is continuously differentiable on int(dom f\u0003), and there\nexists x0 2 int(dom f\u0003) such that f\u0003(x0) = x0.\n\nF = fx; y 7! v(x)  f\u0003(v(y)) : v 2 Vg ;\nV = (dom f\u0003)X \\ Cb(X):\n\n(c) MMD-GAN [5, 9]. Let k : X 2 ! R be a universal reproducing kernel. Let M be the set of\n\nsigned measures on X.\n\nF = fx; y 7! v(x)  v(y) : v 2 Vg ;\nV =\bx 7! E\u0016 [k(x;\u0001)] : \u0016 2 M; E\u00162 [k] \u0014 1\t :\n\n(d) Wasserstein-GAN (WGAN) [2]. Assume X is a metric space.\nF = fx; y 7! v(x)  v(y) : v 2 Vg ;\nV =nv 2 Cb(X) : kvkLip \u0014 Ko ;\n\nwhere K is a positive constant, k\u0001kLip denotes the Lipschitz constant.\n\n(e) WGAN-GP (Improved WGAN) [8]. Assume X is a convex subset of a Euclidean space.\n\nF = fx; y 7! v(x)  v(y)  \u0011Et\u0018U [(krv(tx + (1  t)y)k2  1)p] : v 2 Vg;\nV = C 1(X);\n\nwhere U is the uniform distribution on [0; 1], \u0011 is a positive constant, p 2 (1;1).\n\n3\n\n\f(f) (Regularized) Optimal Transport [6].\n\n2 Let c : X 2 ! R be some transportation cost\n(3)\n\nfunction, \u000f \u0015 0 be the strength of regularization. If \u000f = 0 (no regularization), then\nF = fx; y 7! u(x) + v(y) : (u; v) 2 Vg ;\nV = f(u; v) 2 Cb(X) \u0002 Cb(X); u(x) + v(y) \u0014 c(x; y) for any x; y 2 Xg ;\n\u0013 : u; v 2 V\u001b ;\nF =\u001ax; y 7! u(x) + v(y)  \u000f exp\u0012 u(x) + v(y)  c(x; y)\nV = Cb(X):\n\nif \u000f > 0, then\n\n\u000f\n\n(4)\n\nIn order to study an adversarial divergence \u001c, it is critical to \ufb01rst understand at which points the diver-\ngence is minimized. More precisely, let \u001c be an adversarial divergence and \u0016\u0003 be the target probability\nmeasure. We are interested in the set of probability measures that minimize the divergence \u001c when\nthe \ufb01rst argument of \u001c is set to \u0016\u0003, i.e., the set arg min \u001c (\u0016\u0003jj\u0001) = f\u0016 : \u001c (\u0016\u0003jj\u0016) = inf \u0017 \u001c (\u0016\u0003jj\u0017)g.\nFormally, we de\ufb01ne the set OPT\u001c;\u0016\u0003 as follows.\nDe\ufb01nition 2 (OPT\u001c;\u0016\u0003). Let \u001c be an adversarial divergence over a topological space X, \u0016\u0003 2 P(X).\nDe\ufb01ne OPT\u001c;\u0016\u0003 to be the set of probability measures that minimize the function \u001c (\u0016\u0003jj\u0001). That is,\n\nOPT\u001c;\u0016\u0003 4=\u001a\u0016 2 P(X) : \u001c (\u0016\u0003jj\u0016) = inf\n\n\u001602P(X)\n\n\u001c (\u0016\u0003jj\u00160)\u001b :\n\nIdeally, the target probability measure \u0016\u0003 should be one and the only one that minimizes the objective\nfunction. The notion of strict adversarial divergence captures this property.\nDe\ufb01nition 3 (Strict adversarial divergence). Let \u001c be an adversarial divergence over a topological\nspace X, \u001c is called a strict adversarial divergence if for any \u0016\u0003 2 P(X), OPT\u001c;\u0016\u0003 = f\u0016\u0003g.\nFor example, if the underlying space X is a compact metric space, then examples (c) and (d) induce\nmetrics on P(X) (see, e.g., [12]), therefore are strict adversarial divergences.\nIn the next two sections, we will answer two questions regarding the set OPT\u001c;\u0016\u0003: how well do the\nelements in OPT\u001c;\u0016\u0003 approximate the target distribution \u0016\u0003 when restricting the class of discrimina-\ntors? (Section 4); and does a sequence of distributions that converges in an adversarial divergence\nalso converges to OPT\u001c;\u0016\u0003 under some standard notion of distributional convergence? (Section 5)\n\n4 Generalized Moment Matching\n\nTo motivate the discussion in this section, recall example (b) in Section 3). It can be shown that\nunder some mild conditions, \u001c, the objective function of f-GAN, is actually the f-divergence, and\nthe minimizer of \u001c (\u0016\u0003jj\u0001) is only \u0016\u0003 [10]. However, in practice, the discriminator class V is usually\nimplemented by a feedforward neural network, and it is known that a \ufb01xed neural network has limited\ncapacity (e.g., it cannot implement the set of all the bounded continuous function). Therefore, one\ncould ask what will happen if we restrict V to a sub-class V0? Obviously one would expect \u0016\u0003 not be\nthe unique minimizer of \u001c (\u0016\u0003jj\u0001) anymore, that is, OPT\u001c;\u0016\u0003 contains elements other than \u0016\u0003. What\ncan we say about the elements in OPT\u001c;\u0016\u0003 now? Are all of them close to \u0016\u0003 in a certain sense? In this\nsection we will answer these questions.\nMore formally, we consider F = fm\u0012  r\u0012 : \u0012 2 \u0002g to be a function class indexed by a set \u0002.\nWe can think of \u0002 as the parameter set of a feedforward neural network. Each m\u0012 is thought to\nbe a matching between two distributions, in the sense that \u0016 and \u0017 are matched under m\u0012 if and\nonly if E\u0016\n\u0017[m\u0012] = 0. In particular, if each m\u0012 is corresponding to some function v\u0012 such that\nm\u0012(x; y) = v\u0012(x)  v\u0012(y), then \u0016 and \u0017 are matched under m\u0012 if and only if some generalized\nmoment of \u0016 and \u0017 are equal: E\u0016[v\u0012] = E\u0017[v\u0012]. Each r\u0012 can be thought as a residual.\nWe will now relate the matching condition to the optimality of the divergence. In particular, de\ufb01ne\n\nM\u0016\u0003 4= f\u0016 : 8\u0012 2 \u0002; E\u0016\u0003 [v\u0012] = E\u0016[v\u0012]g ;\nWe will give suf\ufb01cients conditions for members of M\u0016\u0003 to be in OPT\u001c;\u0016\u0003.\n\n2To the best of our knowledge, neither (3) or (4) was used in any GAN algorithm. However, since our focus\nin this paper is not implementing new algorithms, we leave experiments with this formulation for future work.\n\n4\n\n\fTheorem 4. Let X be a topological space, \u0002 \u0012 Rn, V = fv\u0012 2 Cb(X) : \u0012 2 \u0002g, R =\n\br\u0012 2 Cb(X 2) : \u0012 2 \u0002\t. Let m\u0012(x; y) = v\u0012(x)  v\u0012(y).\nIf there exists c 2 R such that for\nany \u0016; \u0017 2 P(X), inf \u00122\u0002 E\u0016\n\u0017[r\u0012] = c and there exists some \u0012\u0016\n\u0017 ] = c and\n\u0017 2 \u0002 such that E\u0016\n\u0017[r\u0012\u0016\nE\u0016\n\u0017 [m\u0012  r\u0012] is an adversarial divergence over X and\n\u0017 ] \u0015 0, then \u001c (\u0016jj\u0017) = sup\u00122\u0002\nE\u0016\n\u0017[m\u0012\u0016\nfor any \u0016\u0003 2 P(X), OPT\u001c;\u0016\u0003 \u001b M\u0016\u0003 :\nWe now review the examples (a)-(e) in Section 3, show how to write each f 2 F into m\u0012  r\u0012, and\nspecify \u0012\u0016\n\n\u0017 in each case such that the conditions of Theorem 4 can be satis\ufb01ed.\n\n(a) GAN. Note that for any x 2 (0; 1), log (1=(x(1  x))) \u0015 log(4). Let u\u0012\u0016\n\n2,\n\u0017 = 1\n\nf\u0012(x; y) = log(u\u0012(x)) + log(1  u\u0012(y))\n= log(u\u0012(x))  log(u\u0012(y))\n}\n|\n\u0017 i=0\u0011\n\nm\u0012(x;y)\u0010note E\u0016\n\u0017hm\u0012\n\n{z\n\n\u0016\n\n log (1= (u\u0012(y)(1  u\u0012(y))))\n}\n\nr\u0012(x;y)\u0010note r\u0012(x;y)\u0015r\u0012\n\n{z\n\n|\n\n(x;y)=log(4)\u0011\n\n\u0016\n\u0017\n\n:\n\n=\n\n(5)\n\n:\n\n:\n\n\n\n\u0016\n\n{z\n\n0\n\n{z\n\n(b) f-GAN. Recall that f\u0003(x)  x \u0015 0 for any x 2 R and f\u0003(x0) = x0. Let v\u0012\u0016\n\n\u0017 = x0,\n\n{z\n\n\u0017 i=0\u0011\n\n\u0016\n\n(x;y)=0\u0011\n\nf\u0012(x; y) =\n\nr\u0012(x;y)\u0010note r\u0012(x;y)\u0015r\u0012\n\n\u0017 i=0\u0011\n\u0017 = 0,\n\n\n(f\u0003(v\u0012(y))  v\u0012(y))\n}\n|\n\n|{z}\nr\u0012(x;y)\u0010note r\u0012(x;y)=r\u0012\n(e) WGAN-GP. Note that the function x 7! xp is nonnegative on R. Let\ni=1 xi] \u0015 E\u0017[Pn\n\nf\u0012(x; y) = v\u0012(x)  f\u0003(v\u0012(y))\nv\u0012(x)  v\u0012(y)\n}\n|\nm\u0012(x;y)\u0010note E\u0016\n\u0017hm\u0012\n(c, d) MMD-GAN or Wasserstein-GAN. Let v\u0012\u0016\nv\u0012(x)  v\u0012(y)\n|\n}\nm\u0012(x;y)\u0010note E\u0016\n\u0017hm\u0012\n\u0017 =((x1; x2;\u0001\u0001\u0001 ; xn) 7! Pn\ni=1 xipn\n;\n(x1; x2;\u0001\u0001\u0001 ; xn) 7! Pn\ni=1 xipn\n \u0011Et\u0018U [(krv(tx + (1  t)y)k2  1)p]\nv\u0012(x)  v\u0012(y)\n}\n|\n}\n|\nm\u0012(x;y)\u0010note E\u0016\n\u0017hm\u0012\nWe now re\ufb01ne the previous result and show that under some additional conditions on m\u0012 and r\u0012, the\noptimal elements of \u001c are fully characterized by the matching condition, i.e. OPT\u001c;\u0016\u0003 = M\u0016\u0003.\nTheorem 5. Under the assumptions of Theorem 4, if \u0012\u0016\n\u0012 7! E\u0016\n\u0017[r\u0012] have gradients at \u0012\u0016\n\n\u0017 2 int(\u0002) and both \u0012 7! E\u0016\n\u0017[m\u0012] and\n(6)\n\nif E\u0016[Pn\n\notherwise;\n\nr\u0012(x;y)\u0010note r\u0012(x;y)\u0015r\u0012\n\nf\u0012(x; y) =\n\ni=1 xi],\n\n(x;y)=0\u0011\n\n(x;y)=0\u0011\n\n\u0017 , and\n\n\u0016\n\n\u0017 i\u00150\u0011\n\nE\u0016\n\u0017[m\u0012\u0016\n\n\u0017 ] = 0 and 9\u00120; E\u0016\n\u0017 [m\u00120 ] 6= 0\u0001 =) r\u0012\u0016\n\n\u0017\n\nThen for any \u0016\u0003 2 P(X), OPT\u001c;\u0016\u0003 = M\u0016\u0003 :\nWe remark that Theorem 4 is relatively intuitive, while Theorem 5 requires extra conditions, and is\nquite counter-intuitive especially for algorithms like f-GANs.\n\nE\u0016\n\u0017 [m] 6= 0:\n\n{z\n\n{z\n\n\u0016\n\u0017\n\n\u0016\n\u0017\n\n\u0016\n\u0017\n\nv\u0012\u0016\n\n;\n\n:\n\n4.1 Example: Linear f-GAN\n\nWe \ufb01rst consider a simple algorithm called linear f-GAN. Suppose we are provided with a feature\nmap that maps each point x in the sample space X to a feature vector ( 1(x); 2(x);\u0001\u0001\u0001 ; n(x))\nwhere each i 2 Cb(X). We are satis\ufb01ed that any distribution \u0016 is a good approximation of the\ntarget distribution \u0016\u0003 as long as E\u0016\u0003 [ ] = E\u0016[ ]. For example, if X \u0012 R and k(x) = xk, to\nsay E\u0016\u0003 [ ] = E\u0016[ ] is equivalent to say the \ufb01rst n moments of \u0016\u0003 and \u0016 are matched. Recall that\nin the standard f-GAN (example (b) in Section 3), V = (dom f\u0003)X \\ Cb(X). Now instead of\nusing the discriminator class V, we use a restricted discriminator class V0 \u0012 V, containing the linear\n(or more precisely, af\ufb01ne) transformations of \u2013 the set V0 = \b\u0012T( ; 1) : \u0012 2 \u0002\t \u0012 V; where\n\u0002 = \b\u0012 2 Rn+1 : 8x 2 X; \u0012T( (x); 1) 2 dom f\u0003\t. We will show that now OPT\u001c;\u0016\u0003 contains\n\nexactly those \u0016 such that E\u0016\u0003 [ ] = E\u0016[ ], regardless of the speci\ufb01c f chosen. Formally,\n\n5\n\n\flinear f-GAN\n\nCorollary 6 (linear f-GAN). Let X be a compact topological space. Let f be a function as de\ufb01ned\ni=1 be a vector of continuously differentiable functions on\nin example (b) of Section 3. Let = ( i)n\n\nX. Let \u0002 =\b\u0012 2 Rn+1 : 8x 2 X; \u0012T( (x); 1) 2 dom f\u0003\t. Let \u001c be the objective function of the\nThen for any \u0016\u0003 2 P(X), OPT\u001c;\u0016\u0003 = f\u0016 : \u001c (\u0016\u0003jj\u0016) = 0g = f\u0016 : E\u0016\u0003 [ ] = E\u0016[ ]g 3 \u0016\u0003:\nA very concrete example of Corollary 6 could be, for example, the linear KL-GAN, where f (u) =\nu log u, f\u0003(t) = exp(t  1), = ( i)n\n\n\u00122\u0002E\u0016[\u0012T( ; 1)]  E\u0017[f\u0003 \u000e (\u0012T( ; 1))]\u0001 :\n\ni=1, \u0002 = Rn+1. The objective function is\n\n\u001c (\u0016jj\u0017) = sup\n\n\u001c (\u0016jj\u0017) = sup\n\n\u00122Rn+1E\u0016[\u0012T( ; 1)]  E\u0017[exp(\u0012T( ; 1)  1)]\u0001 ;\n\n4.2 Example: Neural Network f-GAN\n\nNext we consider a more general and practical example: an f-GAN where the discriminator class\nV0 = fv\u0012 : \u0012 2 \u0002g is implemented through a feedforward neural network with weight parameter set\n\u0002. We assume that all the activation functions are continuously differentiable (e.g., sigmoid, tanh),\nand the last layer of the network is a linear transformation plus a bias. We also assume dom f\u0003 = R\n(e.g., the KL-GAN where f\u0003(t) = exp(t  1)).\nNow observe that when all the weights before the last layer are \ufb01xed, the last layer acts as a\ndiscriminator in a linear f-GAN. More precisely, let \u0002pre be the index set for the weights before\nthe last layer. Then each \u0012pre 2 \u0002pre corresponds to a feature map \u0012pre. Let the linear f-GAN that\ncorresponds to \u0012pre be \u001c\u0012pre, the adversarial divergence induced by the Neural Network f-GAN is\n\n\u001c (\u0016\u0003jj\u0016) = sup\n\u0012pre2\u0002pre\n\n\u001c\u0012pre (\u0016\u0003jj\u0016)\n\nClearly OPT\u001c;\u0016\u0003 \u0013T\u0012pre2\u0002pre\nOPT\u001c\u0012pre ;\u0016\u0003. For the other direction, note that by Corollary 6, for any\n\u0012pre 2 \u0002pre, \u001c\u0012pre (\u0016\u0003jj\u0016) \u0015 0 and \u001c\u0012pre (\u0016\u0003jj\u0016\u0003) = 0. Therefore \u001c (\u0016\u0003jj\u0016) \u0015 0 and \u001c (\u0016\u0003jj\u0016\u0003) = 0. If\n\u0016 2 OPT\u001c;\u0016\u0003, then \u001c (\u0016\u0003jj\u0016) = 0. As a consequence, \u001c\u0012pre (\u0016\u0003jj\u0016) = 0 for any \u0012pre 2 \u0002pre. Therefore\nOPT\u001c;\u0016\u0003 \u0012T\u0012pre2\u0002pre\n\nOPT\u001c\u0012pre ;\u0016\u0003. Therefore, by Corollary 6,\n\nOPT\u001c\u0012pre ;\u0016\u0003 = f\u0016 : 8\u0012 2 \u0002; E\u0016\u0003 [v\u0012] = E\u0016[v\u0012]g :\n\nOPT\u001c;\u0016\u0003 = \\\u0012pre2\u0002pre\n\nThat is, the minimizer of the Neural Network f-GAN are exactly those distributions that are indistin-\nguishable under the expectation of any discriminator network v\u0012.\n\n5 Convergence\n\nTo motivate the discussion in this section, consider the following question. Let \u000ex0 be the delta\ndistribution at x0 2 R, that is, x = x0 with probability 1. Now, does the sequence of delta\ndistributions \u000e1=n converges to \u000e1? Almost all the people would answer no. However, does the\nsequence of delta distributions \u000e1=n converges to \u000e0? Most people would answer yes based on the\nintuition that 1=n ! 0 and so does the sequence of corresponding delta distributions, even though\nthe support of \u000e1=n never has any intersection with the support of \u000e0. Therefore, convergence can be\nde\ufb01ned for distributions not only in a point-wise way, but in a way that takes consideration of the\nunderlying structure of the sample space.\nNow returning to our adversarial divergence framework. Given an adversarial divergence \u001c, is it\npossible that \u001c (\u000e1jj\u000e1=n) convreges to the global minimum of \u001c (\u000e1jj\u0001)? How to we de\ufb01ne convergence\nto a set of points instead of only one point, in order to explain the convergence behaviour of any\nadversarial divergence? In this section we will answer these questions.\nWe start from two standard notions from functional analysis.\nDe\ufb01nition 7 (Weak-* topology on P(X) (see e.g. [11])). Let X be a compact metric space. By\nassociating with each \u0016 2 rca(X) a linear function f 7! E\u0016[f ] on C(X), we have that rca(X)\n\n6\n\n\fis the continuous dual of C(X) with respect to the uniform norm on C(X) (see e.g. [4]). Therefore\nwe can equip rca(X) (and therefore P(X)) with a weak-* topology, which is the coarsest topology\non rca(X) such that f\u0016 7! E\u0016[f ] : f 2 C(X)g is a set of continuous linear functions on rca(X).\nDe\ufb01nition 8 (Weak convergence of probability measures (see e.g. [11])). Let X be a compact metric\nspace. A sequence of probability measures (\u0016n) in P(X) is said to weakly converge to a measure\n\u0016\u0003 2 P(X), if 8f 2 C(X), E\u0016n [f ] ! E\u0016\u0003 [f ]; or equivalently, if (\u0016n) is weak-* convergent to \u0016\u0003.\nThe de\ufb01nition of weak-* topology and weak convergence respect the topological structure of the\nsample space. For example, it is easy to check that the sequence of delta distributions \u000e1=n weakly\nconverges to \u000e0, but not to \u000e1.\nNow note that De\ufb01nition 8 only de\ufb01nes weak convergence of a sequence of probability measures to a\nsingle target measure. Here we generalize the de\ufb01nition for the single target measure to a set of target\nmeasures through quotient topology as follows.\nDe\ufb01nition 9 (Weak convergence of probability measures to a set). Let X be a compact metric space,\nequip P(X) with the weak-* topology and let A be a non-empty subspace of P(X). A sequence of\nprobability measures (\u0016n) in P(X) is said to weakly converge to the set A if ([\u0016n]) converges to A\nin the quotient space P(X)=A.\nWith everything properly de\ufb01ned, we are now ready to state our convergence result. Note that\nan adversarial divergence is not necessarily a metric, and therefore does not necessarily induce a\ntopology. However, convergence in an adversarial divergence can still imply some type of topological\nconvergence. More precisely, we show a convergence result that holds for any adversarial divergence\n\u001c as long as the sample space is a compact metric space. Informally, we show that for any target\nprobability measure, if \u001c (\u0016\u0003jj\u0016n) converges to the global minimum of \u001c (\u0016\u0003jj\u0001), then \u0016n weakly\nconverges to the set of measures that achieve the global minimum. Formally,\nTheorem 10. Let X be a compact metric space, \u001c be an adversarial divergence over X, \u0016\u0003 2 P(X),\nthen OPT\u001c;\u0016\u0003 6= ;. Let (\u0016n) be a sequence of probability measures in P(X). If \u001c (\u0016\u0003jj\u0016n) !\ninf \u00160 \u001c (\u0016\u0003jj\u00160), then (\u0016n) weakly converges to the set OPT\u001c;\u0016\u0003.\nAs a special case of Theorem 10, if \u001c is a strict adversarial divergence, i.e., OPT\u001c;\u0016\u0003 = f\u0016\u0003g, then\nconverging to the minimizer of the objective function implies the usual weak convergence to the\ntarget probability measure. For example, it can be checked that the objective function of f-GAN is a\nstrict adversarial divergence, therefore converging in the objective function of an f-GAN implies the\nusual weak convergence to the target probability measure.\nTo compare this result with our intuition, we return to the example of a sequence of delta distributions\nand show that as long as \u001c is a strict adversarial divergence, \u001c (\u000e1jj\u000e1=n) does not converge to the\nglobal minimum of \u001c (\u000e1jj\u0001). Observe that if \u001c (\u000e1jj\u000e1=n) converges to the global minimum of \u001c (\u000e1jj\u0001),\nthen according to Theorem 10, \u000e1=n will weakly converge to \u000e1, which leads to a contradiction.\nHowever Theorem 10 does more than excluding undesired possibilities. It also enables us to give\ngeneral statements about the structure of the class of adversarial divergences. The structural result\ncan be easily stated under the notion of relative strength between adversarial divergences, which is\nde\ufb01ned as follows.\nDe\ufb01nition 11 (Relative strength between adversarial divergences). Let \u001c1 and \u001c2 be two adversarial\ndivergences, if for any sequence of probability measures (\u0016n) and any target probability measure\n\u0016\u0003, \u001c1(\u0016\u0003jj\u0016n) ! inf \u0016 \u001c1(\u0016\u0003jj\u0016) implies \u001c2(\u0016\u0003jj\u0016n) ! inf \u0016 \u001c2(\u0016\u0003jj\u0016), then we say \u001c1 is stronger\nthan \u001c2 and \u001c2 is weaker than \u001c1. We say \u001c1 is equivalent to \u001c2 if \u001c1 is both stronger and weaker than\n\u001c2. We say \u001c1 is strictly stronger (strictly weaker) than \u001c2 if \u001c1 is stronger (weaker) than \u001c2 but not\nequivalent. We say \u001c1 and \u001c2 are not comparable if \u001c1 is neither stronger nor weaker than \u001c2.\nNot much is known about the relative strength between different adversarial divergences. If the\nunderlying sample space is nice (e.g., subset of Euclidean space), then the variational (GAN-style)\nformulation of f-divergences using bounded continuous functions coincides with the original de\ufb01ni-\ntion [15], and therefore f-divergences are adversarial divergences. [2] showed that the KL-divergence\nis stronger than the JS-divergence, which is equivalent to the total variation distance, which is strictly\nstronger than the Wasserstein-1 distance.\nHowever, the novel fact is that we can reach the weakest strict adversarial divergence. Indeed,\none implicatoin of Theorem 10 is that if X is a compact metric space and \u001c is a strict adversarial\n\n7\n\n\fFigure 1: Structure of the class of strict adversarial divergences\n\ndivergence over \u001c, then \u001c-convergence implies the usual weak convergence on probability measures.\nIn particular, since the Wasserstein distance metrizes weak convergence of probability distributions\n(see e.g. [14]), as a direct consequence of Theorem 10, the Wasserstein distance is in the equivalence\nclass of the weakest strict adversarial divergences. In the other direction, there exists a trivial strict\nadversarial divergence\n\n\u001cTrivial (\u0016jj\u0017) 4=\u001a0;\n+1;\n\nif \u0016 = \u0017,\notherwise;\n\n(7)\n\nthat is stronger than any other strict adversarial divergence. We now incorporate our convergence\nresults with some previous results and get the following structural result.\nCorollary 12. The class of strict adversarial divergences over a bounded and closed subset of a\nEuclidean space has the structure as shown in Figure 1, where \u001cTrivial is de\ufb01ned as in (7), \u001cMMD is\ncorresponding to example (c) in Section 3, \u001cWasserstein is corresponding to example (d) in Section 3, and\n\u001cKL, \u001cReverse-KL, \u001cTV, \u001cJS, \u001cHellinger are corresponding to example (b) in Section 3 with f (x) being x log(x),\n2 ) + x log(x), (px  1)2, respectively. Each rectangle in\n log(x), 1\nFigure 1 represents an equivalence class, inside of which are some examples. In particular, \u001cTrivial is in\nthe equivalence class of the strongest strict adversarial divergences, while \u001cMMD and \u001cWasserstein are in the\nequivalence class of the weakest strict adversarial divergences.\n\n2jx  1j, (x + 1) log( x+1\n\n6 Related Work\n\nThere has been an explosion of work on GANs over the past couple of years; however, most of the\nwork has been empirical in nature. A body of literature has looked at designing variants of GANs\nwhich use different objective functions. Examples include [10], which propose using the f-divergence\nbetween the target \u0016 and the generated distribution \u0017, and [5, 9], which propose the MMD distance.\nInspired by previous work, we identify a family of GAN-style objective functions in full generality\nand show general properties of the objective functions in this family.\nThere has also been some work on comparing different GAN-style objective functions in terms\nof their convergence properties, either in a GAN-related setting [2], or in a general IPM setting\n[12]. Unlike these results, which look at the relationship between several speci\ufb01c strict adversarial\ndivergences, our results apply to an entire class of GAN-style objective functions and establish their\nconvergence properties. For example, [2] shows that KL-divergnce, JS-divergence, total-variation\ndistance are all stronger than the Wasserstein distance, while our results generalize this part of their\nresult and says that any strict adversarial divergence is stronger than the Wasserstein distance and its\nequivalences. Furthermore, our results also apply to non-strict adversarial divergences.\nThat being said, it does not mean our results are a complete generalization of the previous convergence\nresults such as [2, 12]. Our results do not provide any methods to compare two strict adversarial\ndivergences if none of them is equivalent to the Wasserstein distance or the trivial divergence. In\ncontrast, [2] show that the KL-divergence is stronger than the JS-divergence, which is equivalent to\nthe total variation distance, which is strictly stronger than the Wasserstein-1 distance.\nFinally, there has been some additional theoretical literature on understanding GANs, which consider\northogonal aspects of the problem. [3] address the question of whether we can achieve generalization\nbounds when training GANs. [13] focus on optimizing the estimating power of kernel distances. [5]\nstudy generalization bounds for MMD-GAN in terms of fat-shattering dimension.\n\n8\n\n\f7 Discussion and Conclusions\n\nIn conclusion, our results provide insights on the cost or loss functions that should be used in GANs.\nThe choice of cost function plays a very important role in this case \u2013 more so, for example, than data\ndomains or network architectures. For example, most works still use the DCGAN architecture, while\nchanging the cost functions to achieve different levels of performance, and which cost function is\nbetter is still a matter of debate. In particular we provide a framework for studying many different\nGAN criteria in a way that makes them more directly comparable, and under this framework, we\nstudy both approximation and convergence properties of various loss functions.\n\n8 Acknowledgments\n\nWe thank Iliya Tolstikhin, Sylvain Gelly, and Robert Williamson for helpful discussions. The work\nof KC and SL were partially supported by NSF under IIS 1617157.\n\nReferences\n[1] C. D. Aliprantis and O. Burkinshaw. Principles of real analysis. Academic Press, 1998.\n\n[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. CoRR, abs/1701.07875, 2017.\n\n[3] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative\n\nadversarial nets (gans). CoRR, abs/1703.00573, 2017.\n\n[4] H. G. Dales, J. F.K. Dashiell, A.-M. Lau, and D. Strauss. Banach Spaces of Continuous\nFunctions as Dual Spaces. CMS Books in Mathematics. Springer International Publishing,\n2016.\n\n[5] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via\n\nmaximum mean discrepancy optimization. In UAI 2015.\n\n[6] A. Genevay, M. Cuturi, G. Peyr\u00e9, and F. R. Bach. Stochastic optimization for large-scale\n\noptimal transport. In NIPS 2016.\n\n[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NIPS 2014.\n\n[8] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of\n\nwasserstein gans. CoRR, abs/1704.00028, 2017.\n\n[9] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In ICML 2015.\n\n[10] S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training generative neural samplers using\n\nvariational divergence minimization. In NIPS 2016.\n\n[11] W. Rudin. Functional Analysis. International Series in Pure and Applied Mathematics. McGraw-\n\nHill, Inc, 1991.\n\n[12] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00f6lkopf, and G. R. G. Lanckriet. Hilbert\nspace embeddings and metrics on probability measures. Journal of Machine Learning Research,\n11:1517\u20131561, 2010.\n\n[13] D. J. Sutherland, H. F. Tung, H. Strathmann, S. De, A. Ramdas, A. J. Smola, and A. Gretton.\nGenerative models and model criticism via optimized maximum mean discrepancy. In ICLR\n2017.\n\n[14] C. Villani. Optimal transport, old and new. Grundlehren der mathematischen Wissenschaften.\n\nSpringer-Verlag Berlin Heidelberg, 2009.\n\n[15] Y. Wu. Lecture notes: Information-theoretic methods for high-dimensional statistics. 2017.\n\n9\n\n\f", "award": [], "sourceid": 2857, "authors": [{"given_name": "Shuang", "family_name": "Liu", "institution": "University of California, San Diego"}, {"given_name": "Olivier", "family_name": "Bousquet", "institution": "Google"}, {"given_name": "Kamalika", "family_name": "Chaudhuri", "institution": "UCSD"}]}