{"title": "A Primal-Dual link between GANs and Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 415, "page_last": 424, "abstract": "Since the introduction of Generative Adversarial Networks (GANs) and Variational Autoencoders (VAE), the literature on generative modelling has witnessed an overwhelming resurgence. The impressive, yet elusive empirical performance of GANs has lead to the rise of many GAN-VAE hybrids, with the hopes of GAN level performance and additional benefits of VAE, such as an encoder for feature reduction, which is not offered by GANs. Recently, the Wasserstein Autoencoder (WAE) was proposed, achieving performance similar to that of GANs, yet it is still unclear whether the two are fundamentally different or can be further improved into a unified model. In this work, we study the $f$-GAN and WAE models and make two main discoveries. First, we find that the $f$-GAN and WAE objectives partake in a primal-dual relationship and are equivalent under some assumptions, which then allows us to explicate the success of WAE. Second, the equivalence result allows us to, for the first time, prove generalization bounds for Autoencoder models, which is a pertinent problem when it comes to theoretical analyses of generative models. Furthermore, we show that the WAE objective is related to other statistical quantities such as the $f$-divergence and in particular, upper bounded by the Wasserstein distance, which then allows us to tap into existing efficient (regularized) optimal transport solvers. Our findings thus present the first primal-dual relationship between GANs and Autoencoder models, comment on generalization abilities and make a step towards unifying these models.", "full_text": "A Primal-Dual link between GANs and Autoencoders\n\nHisham Husain\u2021,\u2020 Richard Nock\u2020,\u2021,\u2663 Robert C. Williamson\u2021,\u2020\n\u2021The Australian National University, \u2020Data61, \u2663The University of Sydney\n\nfirstname.lastname@{data61.csiro.au,anu.edu.au}\n\nAbstract\n\nSince the introduction of Generative Adversarial Networks (GANs) and Variational\nAutoencoders (VAE), the literature on generative modelling has witnessed an\noverwhelming resurgence. The impressive, yet elusive empirical performance of\nGANs has lead to the rise of many GAN-VAE hybrids, with the hopes of GAN\nlevel performance and additional bene\ufb01ts of VAE, such as an encoder for feature\nreduction, which is not offered by GANs. Recently, the Wasserstein Autoencoder\n(WAE) was proposed, achieving performance similar to that of GANs, yet it is still\nunclear whether the two are fundamentally different or can be further improved\ninto a uni\ufb01ed model. In this work, we study the f-GAN and WAE models and\nmake two main discoveries. First, we \ufb01nd that the f-GAN and WAE objectives\npartake in a primal-dual relationship and are equivalent under some assumptions,\nwhich then allows us to explicate the success of WAE. Second, the equivalence\nresult allows us to, for the \ufb01rst time, prove generalization bounds for Autoencoder\nmodels, which is a pertinent problem when it comes to theoretical analyses of\ngenerative models. Furthermore, we show that the WAE objective is related\nto other statistical quantities such as the f-divergence and in particular, upper\nbounded by the Wasserstein distance, which then allows us to tap into existing\nef\ufb01cient (regularized) optimal transport solvers. Our \ufb01ndings thus present the \ufb01rst\nprimal-dual relationship between GANs and Autoencoder models, comment on\ngeneralization abilities and make a step towards unifying these models.\n\n1\n\nIntroduction\n\nImplicit probabilistic models [1] are de\ufb01ned to be the pushforward of a simple distribution PZ\nover a latent space Z through a map G : Z \u2192 X, where X is the space of the input data. Such\nmodels allow easy sampling, but the computation of the corresponding probability density function is\nintractable. The goal of these methods is to match G#PZ to a target distribution PX by minimizing\nD(PX , G#PZ), for some discrepancy D(\u00b7,\u00b7) between distributions. An overwhelming number\nof methods have emerged after the introduction of Generative Adversarial Networks [2, 3] and\nVariational Autoencoders [4] (GANs and VAEs), which have established two distinct paradigms:\nAdversarial (networks) training and Autoencoders respectively. Adversarial training involves a set of\nfunctions D, referred to as discriminators, with an objective of the form\n\nD(PX , G#PZ) = max\nd\u2208D\n\n(1)\nfor some functions a : R \u2192 R and b : R \u2192 R. Autoencoder methods are concerned with \ufb01nding a\nfunction E : X \u2192 Z, referred to as an encoder, whose goal is to reverse G, and learn a feature space\nwith the objective\n\n{Ex\u223cPX [a(d(x))] \u2212 Ex\u223cG#PZ [b(d(x))]} ,\n\nD(PX , G#PZ) = min\n\nE\n\n{R(G, E) + \u2126(E)} ,\n\n(2)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhere R(G, E) is the reconstruction loss and acts to ensure G and E reverse each other and \u2126(E) is\na regularization term. Much work on Autoencoder methods has focused upon the choice of \u2126.\nIn practice, the two methods demonstrate contrasting abilities in their strengths and limitations, which\nhave resulted in differing directions of progress. Indeed, there is a lack of theoretical understanding\nof how these frameworks are parametrized and it is not clear whether the methods are fundamentally\ndifferent. For example, Adversarial training based methods have empirically demonstrated high\nperformance when it comes to producing realistic looking samples from PX. However, GANs often\nhave problems in convergence and stability of training [5]. Autoencoders, on the other hand, deal\nwith a more well behaved objective and learn an encoder in the process, making them useful for\nfeature representation. However in practice, Autoencoder based methods have reported shortfalls,\nsuch as producing blurry samples for image based datasets [6]. This has motivated researchers\nto adapt Autoencoder models by borrowing elements from Adversarial networks in the hopes of\nGAN level performance whilst learning an encoder. Examples include replacing \u2126 with Adversarial\nobjectives [7, 8] or replacing the reconstruction loss with an adversarial objective [9, 10]. Recently,\nthe Wasserstein Autoencoder (WAE) [6] has been shown to subsume these two methods with an\nAdversarial based \u2126, and has demonstrated performance similar to that of Adversarial methods.\nUnderstanding the connection between the two paradigms is important for not only the practical\npurposes outlined above but for the inheritance of theoretical analyses from one another. For example,\nwhen it comes to directions of progress, Adversarial training methods now have theoretical guarantees\non generalization performance [11], however no such theoretical results have been obtained to date\nfor autoencoders. Indeed, generalization performance is a pressing concern, since both techniques\nimplicitly assume the samples represent the target distribution [12] and eventually leads to memorizing\ntraining data.\nIn this work, we study the two paradigms and in particular focus on the f-GANs [3] for Adversarial\ntraining and Wasserstein Autoencoders (WAE) for Autoencoders, which generalize the original GAN\nand VAE models respectively. We prove that the f-GAN objective with Lipschitz (with respect to a\nmetric c) discriminators is equivalent to the WAE objective with cost c. In particular, we show that\nthe WAE objective is an upper bound; schematically we get\nf-GAN \u2264 WAE\n\nand discuss the tightness of this bound. Our result is a generalization of the Kantorovich-Rubinstein\nduality and thus suggests a primal-dual relationship between Adversarial and Autoencoder methods.\nConsequently we show, to the best of our knowledge, the \ufb01rst generalization bounds for autoencoders.\nFurthermore, using this equivalence, we show that the WAE objective is related to key statistical\nquantities such as the f-divergence and Wasserstein distance, which allows us to tap into ef\ufb01cient\n(regularized) OT solvers.\nThe main contributions can be summarized as the following:\n(cid:46) (Theorem 8) Establishes an equivalence between Adversarial training and Wasserstein Autoencoders,\nshowing conditions under which the f-GAN and WAE coincide. This further justi\ufb01es the similar\nperformance of WAE to GAN based methods. When the conditions are not met, we have an inequality,\nwhich allows us to comment on the behavior of the methods.\n(cid:46) (Theorem 9, 10 and 14) Show that the WAE objective is related to other statistical quantities such\nas f-divergence and Wasserstein distance.\n(cid:46) (Theorem 13) Provide generalization bounds for WAE. In particular, this focuses on the empirical\nvariant of the WAE objective, which allows the use of Optimal Transport (OT) solvers as they are\nconcerned with discrete distributions. This allows one to employ ef\ufb01cient (regularized) OT solvers\nfor the estimation of WAE, f-GANs and the generalization bounds.\n\n2 Preliminaries\n\n2.1 Notation\n\nWe will use X to denote the input space (a Polish space), typically taken to be a Euclidean space. We\nuse Z to denote the latent space, also taken to be Euclidean. We use N\u2217 to denote the natural numbers\nwithout 0: N \\ {0}. We denote by P the set of probability measures over X, and elements of this set\n\n2\n\n\fwill be referred to as distributions. If P \u2208 P(X) happens to be absolutely continuous with respect\nto the Lebesgue measure then we will use dP/d\u03bb to refer to the density function (Radon-Nikodym\nderivative with respect to the Lebesgue measure). For any T \u2208 F (X, Z), for any measure \u00b5 \u2208 P(X),\nthe pushforward measure of \u00b5 through T denoted T #\u00b5 \u2208 P(Z) is such that T #\u00b5(A) = \u00b5(T \u22121(A))\nfor any measurable set A \u2282 Z. The set F (X, R) refers to all measurable functions from X into the set\nR. We will use functions to represent conditional distributions over a space Z conditioned on elements\nX, for example P \u2208 F (X, P(Z)) so that for any x \u2208 X, P (x) = P (\u00b7|x) \u2208 P(Z). For any P \u2208\nP(X), the support of P is supp(P ) = {x \u2208 X : if x \u2208 Nx open =\u21d2 P (Nx) > 0}. In any metric\nspace (X, c), for any set S \u2286 X, we de\ufb01ne the diameter of S to be diamc(S) = supx,x(cid:48)\u2208S c(x, x(cid:48)).\nGiven a metric c over X, for any f \u2208 F (X, R), Lipc(f ) denotes the Lipschitz constant of f with\nrespect to c and Hc = {g \u2208 F (X, R) : Lipc(g) \u2264 1}. For some set S \u2286 R, 1S corresponds to the\nconvex indicator function, ie. 1S(x) = 0 if x \u2208 S and 1S(x) = \u221e otherwise. For any x \u2208 X,\n\u03b4x : X \u2192 {0, 1} corresponds to the characteristic function, with \u03b4x(0) = 1 if x = 0 and \u03b4x(0) = 0\nif x (cid:54)= 0.\n\n2.2 Background\n\n2.2.1 Probability Discrepancies\n\nProbability discrepancies are central to the objective of \ufb01nding the best \ufb01tting model. We introduce\nsome key discrepancies and their notation, which will appear later.\nDe\ufb01nition 1 (f-Divergence) For a convex function f : R \u2192 (\u2212\u221e,\u221e] with f (1) = 0, for any\nP, Q \u2208 P(X) with P absolutely continuous with respect to Q, the f-Divergence between P and Q is\n\n(cid:90)\n\n(cid:18) dP\n\n(cid:19)\n\ndQ\n\nwith Df (P, Q) = \u221e if P is note absolutely continuous with respect to Q.\n\nDf (P, Q) :=\n\nf\n\nX\n\ndQ,\n\nAn example of a method to compute the f-divergence is to \ufb01rst compute dP/dQ and estimate the\nintegral empirically using samples from Q.\nDe\ufb01nition 2 (Integral Probability Metric) For a \ufb01xed function class F \u2286 F (X, R), the Integral\nProbability Metric (IPM) based on F between P, Q \u2208 P(X) is de\ufb01ned as\n\nIPMF(P, Q) := sup\nf\u2208F\n\nf (x)dP (x) \u2212\n\nX\n\nX\n\nf (x)dQ(x)\n\n.\n\nIf we have that \u2212F = F then IPMF forms a metric over P(X) [13]. A particular IPM we will make\nuse of is Total Variation (TV): TV(P, Q) = IPMV(P, Q) where V = {h \u2208 F (X, R) : |h| \u2264 1}. We\nalso note that when f (x) = |x \u2212 1| then TV = Df and thus TV is both an IPM and an f-divergence.\nDe\ufb01nition 3 For any P, Q \u2208 P(X), de\ufb01ne the set of couplings between P and Q to be\n\n\u03a0(P, Q) =\n\n\u03c0 \u2208 P(X \u00d7 X) :\n\n\u03c0(x, y)dx = P,\n\n\u03c0(x, y)dy = Q\n\n.\n\nFor a cost c : X \u00d7 X \u2192 R+, the Wasserstein distance between P and Q is\n\n(cid:26)\n\n(cid:90)\n\nX\n\n(cid:26)(cid:90)\n\n(cid:27)\n\n(cid:90)\n\nX\n\n(cid:27)\n\nWc(P, Q) := inf\n\n\u03c0\u2208\u03a0(P,Q)\n\nX\u00d7X\n\nc(x, y)d\u03c0(x, y)\n\n.\n\nThe Wasserstein distance can be regarded as an in\ufb01nite linear program and thus admits a dual form,\nand in the case of c being a metric, belongs to the class of IPMs. We summarize this fact the following\nlemma [14].\n\nLemma 4 (Wasserstein Duality) Let (X, c) be a metric space, and suppose Hc is the set of all\n1-Lipschitz functions with respect to c. Then for any P, Q \u2208 P(X), we have\n\nWc(P, Q) = sup\nh\u2208Hc\n\nh(x)dP (x) \u2212\n\nX\n\nX\n\nh(x)dQ(x)\n\n(cid:90)\n\n(cid:90)\n\n(cid:27)\n\n(cid:27)\n\n(cid:26)(cid:90)\n\n(cid:26)(cid:90)\n\n= IPMHc(P, Q).\n\n3\n\n\f2.3 Generative Models\nIn both GAN and VAE models, we have a latent space Z (typically taken to be Rd, with d being\nsmall) and a prior distribution PZ \u2208 P(Z) (e.g. unit variance Gaussian). We have a function referred\nto as the generator G : Z \u2192 X, which induces the generated distribution, denoted by PG \u2208 P(X),\nas the pushforward of PZ through G: PG = G#PZ. The true data distribution will be referred to\nas PX \u2208 P(X). The common goal between the two methods is to \ufb01nd a generator G such that the\nsamples generated by pushing forward PZ through G (G#PZ) are close to the true data distribution\n(PX). More formally, one can cast this as an optimization problem by \ufb01nding the best G such that\nD(PG, PX ) is minimized, where D(\u00b7,\u00b7) is some discrepancy between distributions. Both methods\n(as we outline below) utilize their own discrepancies between PX and PG, which offer their own\nbene\ufb01ts and weaknesses.\n\n2.3.1 Wasserstein Autoencoder\nLet E : X \u2192 P(Z) denote a probabilistic encoder 1, which maps each point x to a conditional\ndistribution E(x) \u2208 P(Z), denoted as the posterior distribution. The pushforward of PX through E:\nE#PX, will be referred to as the aggregated posterior.\nDe\ufb01nition 5 (Wasserstein Autoencoder [6]) Let c : X \u00d7 X \u2192 R\u22650, \u03bb > 0 and \u2126 : P(Z) \u00d7\nP(Z) \u2192 R\u22650 with \u2126(P, P ) = 0 for all P \u2208 P(Z). The Wasserstein Autoencoder objective is\nEz\u223cE(x)[c(x, G(z))]dPX (x) + \u03bb \u00b7 \u2126(E#PX , PZ)\n\nWAEc,\u03bb\u00b7\u2126(PX , G) =\n\n(cid:26)(cid:90)\n\n(cid:27)\n\n.\n\ninf\n\nE\u2208F (X,P(Z))\n\nX\n\nWe remark that there are various choices of c and \u03bb \u00b7 \u2126. [6] select these by tuning \u03bb and selecting\ndifferent measures of discrepancies between probability distortions for \u2126.\n\nf-Generative Adversarial Network\n\n2.3.2\nLet d : X \u2192 R denote a discriminator function.\nDe\ufb01nition 6 (f-GAN [3]) Let f : R \u2192 (\u2212\u221e,\u221e] denote a convex function with property f (1) = 0\nand D \u2282 F (X, R) a set of discriminators. The f-GAN model minimizes the following objective for\na generator G : Z \u2192 X\n\nGANf (PX , G; D) := sup\nd\u2208D\n\n{Ex\u223cPX [d(x)] \u2212 Ez\u223cPZ [f\u2217(d(G(z)))]} ,\n\n(3)\n\nwhere f (cid:63)(x) = supy {x \u00b7 y \u2212 f (y)} is the convex conjugate of f.\n\nThere are two knobs in this method, namely D, the set of discriminators, and the convex func-\ntion f. The objective in (3) is a variational approximation to Df [3]; if D = F (X, R), then\nGANf (PX , G; D) = Df (PX , PG) [15]. In the case of f (x) = x log(x)\u2212(x+1) log(x+1)+2 log 2,\nwe recover the original GAN [2].\n\n3 Related Work\n\nCurrent attempts at building a taxonomy for generative models have largely been within each paradigm\nor the proposal of hybrid methods that borrow elements from the two. We \ufb01rst review major and\nrelevant advances in each paradigm, and then move on to discuss results that are close to the technical\ncontributions of our work.\nThe line of Autoencoders begin with \u2126 = 0, which is the original autoencoder concerned only with\nreconstruction loss. VAE then introduced a non-zero \u2126, along with implementing Gaussian encoders\n[4]. This was then replaced by an adversarial objective [7], which is sample based and consequently\nallows arbitrary encoders. In the spirit of uni\ufb01cation, Adversarial Autoencoders (AAE) [8] proposed\n\u2126 to be a discrepancy between the pushforward of the target distribution through the encoder (E#PX)\n\n1We remark that this is not standard notation in the VAE and Variational Inference literature.\n\n4\n\n\fand the prior distribution (PZ) in the latent space, which was then showed to be equivalent to the VAE\n\u2126 minus a mutual information term [16]. Independently, InfoVAE [17] proposed a similar objective,\nwhich was subsequently shown to be equivalent to adding mutual information. [6] reparametrized the\nWasserstein distance into an Autoencoder objective (WAE) where the \u2126 term generalizes AAE, and\nhas reported performance comparable to that of Adversarial methods. Other attempts also include\nadjusting the reconstruction loss to be adversarial as well [9, 10]. Another work that focuses on WAE\nis the Sinkhorn Autoencoders (SAE) [18], which select \u2126 to be the Wasserstein distance and show\nthat the overall objective is an upper bound to the Wasserstein distance between PX and PG.\n[19] discussed the two paradigms and their uni\ufb01cation by interpretting GANs from the perspective\nof variational inference, which allowed a connection to VAE, resulting in a GAN implemented\nwith importance weighting techniques. While this approach is the closest to our work in forming a\nlink, their results apply to standard VAE (and not other AE methods such as WAE) and cannot be\nextended to all f-GANs. [20] introduced the notion of an Adversarial divergence, which subsumed\nmainstream adversarial based methods. This also led to the formal understanding of how the\nselected discriminator set D affects the \ufb01nal generator G learned. However, this approach is silent\nwith regard to Autoencoder based methods. [11] established the tradeoff between the Rademacher\ncomplexity of the discriminator class D and generalization performance of G, with no results present\nfor Autoencoders. These theoretical advances in Adversarial training methods are inherited by\nAutoencoders as a consequence of the equivalence presented in our work.\nOne key point in the proof of our equivalence is the use of a result that decomposes the GAN\nobjective into an f-divergence and an IPM for a restricted class of discriminators (which we used for\nLipschitz functions). This decomposition is used in [21] and applied to linear f-GANs, showing that\nthe adversarial training objective decomposes into a mixture of maximum likelihood and moment\nmatching. [22] used this decomposition with Lipschitz discriminators like our work, however does\nnot make any extension or further progress to establish the link to WAE. Indeed, GANs with Lipschitz\ndiscriminators have been independently studied in [23], which suggest that one should enforce\nLipschitz constraints to provide useful gradients.\n\n4\n\nf-Wasserstein Autoencoders\n\nWe de\ufb01ne a new objective, that will help us in the proof of the main theorems of this paper.\nDe\ufb01nition 7 (f-Wasserstein Autoencoder) Let c : X \u00d7 X \u2192 R, \u03bb > 0, f : R \u2192 (\u2212\u221e,\u221e] be a\nconvex function (with f (1) = 0), the f-Wasserstein Autoencoder (f-WAE) objective is\n\nW c,\u03bb\u00b7f (PX , G) =\n\ninf\n\nE\u2208F (X,P(Z))\n\n{Wc(PX , (G \u25e6 E)#PX ) + \u03bbDf (E#PX , PZ)}\n\n(4)\n\nIn the proof of the main result, we will show that the f-WAE objective is indeed the same as the\nWAE objective when using the same cost c and selecting the regularizer to be \u03bb \u00b7 \u2126 = D\u03bbf = \u03bbDf .\nThe only difference between this and the standard WAE is the use of Wc(PX , (G \u25e6 E)#PX ) as\nthe reconstruction loss instead of the standard cost which is an upper bound (Lemma 18), and the\nregularizer is chosen to be \u03bb \u00b7 \u2126 = D\u03bbf = \u03bbDf . We now present the main theorem that captures the\nrelationship between f-GAN and WAE.\n\nTheorem 8 (f-GAN and WAE equivalence) Suppose (X, c) is a metric space and let Hc denote\nthe set of all functions from X \u2192 R that are 1-Lipschitz (with respect to c). Let f : R \u2192 (\u2212\u221e,\u221e]\nbe a convex function with f (1) = 0. Then for all \u03bb > 0,\n\nGAN\u03bbf (PX , G; Hc) \u2264 WAEc,\u03bb\u00b7Df (PX , G),\n\nwith equality if G is invertible.\n\n(5)\n\n(6)\n\n(This is a sketch, see Section A.1 for full proof). The proof begins by proving certain\n\nProof\nproperties of Hc (Lemma 16), allowing us to use the dual form of restricted GANs (Theorem 15),\n\n(cid:27)\n{EPX [h] \u2212 EP (cid:48)[h]}\n\nGANf (PX , G; Hc) = inf\n\nP (cid:48)\u2208P(X)\n\n= inf\n\nP (cid:48)\u2208P(X)\n\nDf (P (cid:48), PG) + sup\nh\u2208Hc\n{Df (P (cid:48), PG) + Wc(P (cid:48), PX )} .\n\n(cid:26)\n\n5\n\n\fGAN\u03bbf (PX , G; Hc)\n\ninf\n\nE\u2208F (X,P(Z))\n= W c,\u03bb\u00b7f (PX , G)\n\u2264\n\n(cid:90)\nwhere the \ufb01nal inequality follows from the fact that Wc(P, Q) \u2264(cid:82)\n\n= WAEc,\u03bb\u00b7Df (PX , G),\n\n\u03bbDf (E#PX , PZ) +\n\nE\u2208F (X,P(Z))\n\n(cid:26)\n\ninf\n\nX\n\n(cid:27)\n\nEz\u223cE(x)[c(x, G(z))]dPX (x)\n\nThe key is to reparametrize (6) by optimizing over couplings. By rewriting P (cid:48) = (G \u25e6 E)#PX for\nsome E \u2208 F (X, P(Z)) and rewriting (6) as an optimization over E (Lemma 20), we obtain\n\n{Df (P (cid:48), PG) + Wc(P (cid:48), PX )}\n\ninf\n\nP (cid:48)\u2208P(X)\n=\n\ninf\n\nE\u2208F (X,P(Z))\n\n{Df ((G \u25e6 E)#PX , PG) + Wc((G \u25e6 E)#PX , PX )}\n\n(7)\n\nWe then have\n\nDf ((G \u25e6 E)#PX , PG) = Df (G#(E#PX ), G#PZ)\n\n(\u2217)\u2264 Df (E#PX , PZ),\n\nwith equality in (\u2217) if G is invertible (Lemma 17). A weaker condition is required if f is differentiable,\nnamely if G is invertible with respect to f(cid:48) \u25e6 d(E#PX )/dPZ in the sense that\n\nG(z) = G(z(cid:48)) =\u21d2 f(cid:48) \u25e6 (d(E#PX )/dPZ)(z) = f(cid:48) \u25e6 (d(E#PX )/dPZ)(z(cid:48)),\n\n(8)\nnoting that an invertible G trivially satis\ufb01es this requirement. Letting f \u2190 \u03bbf, we have Df (\u00b7,\u00b7) \u2190\n\u03bbDf (\u00b7,\u00b7), and so from Equation 7, we have\n\n(\u2217)\u2264\n\n{\u03bbDf (E#PX , PZ) + Wc((G \u25e6 E)#PX , PX )}\n\nEz\u223cE(x)[c(x, G(z))]dPX (x)\n\nX\n\n(Lemma 18). Using the fact that W \u2265 WAE (Lemma 19) completes the proof.\nWhen G is invertible, we remark that PG can still be expressive and capable of modelling complex\ndistributions in WAE and GAN models. For example, if G is implemented with feedforward\nneural networks, and G is invertible then PG can model deformed exponential families [24], which\nencompasses a large class appearing in statistical physics and information geometry [25, 26]. There\nexists many invertible activation functions under which G will be invertible. Furthermore, in the proof\nof the Theorem it is clear that W and WAE are the same objective (from Lemma 18 and Lemma 19).\nWhen using f = 1{1} (f (x) = 0 if x = 1 and f (x) = \u221e otherwise), and noting that f (cid:63)(x) = x,\nmeaning that Theorem 8 (with \u03bb = 1) reduces to\n\nsup\nh\u2208Hc\n\n{Ex\u223cPX [h(x)] \u2212 Ex\u223cPG[h(x)]} = GANf (PX , G; Hc)\n\n\u2264 W c,f (PX , PG)\ninf\n=\n\nE\u2208F (X,P(Z)):E#PX =PZ\n\n=\n\nE\u2208F (X,P(Z)):E#PX =PZ\n\ninf\n\n= Wc(PX , PG),\n\n{Wc(PX , (G \u25e6 E)#PX )}\n{Wc(PX , G#PZ}\n\nwhich is the standard primal-dual relation between Wasserstein distances as in Lemma 4. Hence,\nTheorem 8 can be viewed as a generalization of this primal-dual relationship, where Autoencoder\nand Adversarial objectives represent primal and dual forms respectively.\nWe note that the left hand of Equation (5) does not explicitly engage the prior space Z as much as\nthe right hand side in the sense that one can set Z = X, G = Id (which is invertible) and PZ = PG\nand indeed results in the exact same f-GAN objective since G#PZ = Id#PG = PG, yet the\nequivalent f-WAE objective (from Theorem 8) will be different. This makes the Theorem versatile\nin reparametrizations, which we exploit in the proof of Theorem 10. We now consider weighting\nthe reconstruction along with the regularization term in W (which is equivalent to weighting WAE),\nwhich simply amounts to re-weighting the cost since for any \u03b3 > 0,\n\nW \u03b3\u00b7c,\u03bb\u00b7f (PX , G) =\n\ninf\n\nE\u2208F (X,P(Z))\n\n{\u03b3Wc((G \u25e6 E)#PX , PX ) + \u03bbDf (E#PX , PZ)} .\n\n6\n\n\fThe idea of weighting the regularization term by \u03bb was introduced by [27] and furthermore studied\nempirically, showing that the choice of \u03bb in\ufb02uences learning disentanglement in the latent space.\n[28]. We show that if \u03bb = 1 and \u03b3 is larger than some \u03b3\u2217 then W will become an f-divergence\n(Theorem 9). On the other hand if we \ufb01x \u03b3 = 1 and take \u03bb is larger than some \u03bb\u2217, then W becomes\nthe Wasserstein distance and in particular, equality holds in (5) (Theorem 10). We show explicitly\nhow high \u03b3 and \u03bb need to be for such equalities to occur. This is surprising since f-divergence and\nWasserstein distance are quite different distortions.\nWe begin with the f-divergence case. Consider f : R \u2192 (\u2212\u221e,\u221e] convex, differentiable and\nf (1) = 0 and assume that PX is absolutely continuous with respect to PG, so that Df (PX , PG) < \u221e.\nTheorem 9 Set c(x, y) = \u03b4x\u2212y and let f : R \u2192 (\u2212\u221e,\u221e] be a convex function (with f (1) = 0)\nand differentiable. Let \u03b3\u2217 = supx,x(cid:48)\u2208X\ncontinuous with respect to PX and that G is invertible, then we have for all \u03b3 \u2265 \u03b3\u2217\n\n(cid:12)(cid:12)(cid:12) and suppose PG is absolutely\n\n(cid:17) \u2212 f(cid:48)( dPX\n\n(cid:12)(cid:12)(cid:12)f(cid:48)(cid:16) dPX\n\ndPG\n\n)(x(cid:48))\n\ndPG\n\nW \u03b3\u00b7c,f (PX , G) = Df (PX , PG).\n\n(Proof in Appendix, Section A.3). It is important to note that Wc(PX , PG) = TV(PX , PG) when\nc(x, y) = \u03b4x\u2212y and so Theorem 9 tells us that the objective with a weighted total variation reconstruc-\ntion loss with a f-divergence prior regularization amounts to the f-divergence. It was shown that in\n[24] that when G is an invertible feedforward neural network then Df (PX , PG) is a Bregman diver-\ngence (a well regarded quantity in information geometry) between the parametrizations of the network\nfor a \ufb01xed choice of activation function of G, which depends on f. Hence, a practioner should design\nG with such activation function when using f-WAE under the above setting (c(x, y) = \u03b4x\u2212y and\n\u03b3 = \u03b3\u2217) with G being invertible, so that the information theoretic divergence (Df ) between the\ndistributions becomes an information geometric divergence involving the network parameters.\nWe now show that if \u03bb is selected higher than \u03bb\u2217 := supP (cid:48)\u2208P(X) (Wc(P (cid:48), PG)/Df (P (cid:48), PG)), then\nW becomes Wc and furthermore we have equality between f-GAN, f-WAE and WAE.\nTheorem 10 Let c : X \u00d7 X \u2192 R be a metric. For any f : R \u2192 (\u2212\u221e,\u221e] convex function (with\nf (1) = 0), we have for all \u03bb \u2265 \u03bb\u2217\n\nGAN\u03bbf (PX , G; Hc) = W c,\u03bb\u00b7f (PX , G) = WAEc,\u03bb\u00b7Df (PX , G) = Wc(PX , PG).\n\n(Proof in Appendix, Section A.4). Note that Theorem 10 holds for any f (satisfying properties of\nthe Theorem) and so one can estimate the Wasserstein distance using any f as long as \u03bb is scaled\nto \u03bb\u2217. In order to understand how high \u03bb\u2217 can be,there are two extremes in which the supremum\nmay be unbounded. The \ufb01rst case is when P (cid:48) is taken far from PG so that Wc(P (cid:48), PG) increases,\nhowever one should note that in the case when \u2206 = maxx,x(cid:48)\u2208X c(x, x(cid:48)) < \u221e then Wc \u2208 [0, \u2206] and\nso Wc will be \ufb01nite whereas Df (P (cid:48), PG) can possibly diverge to \u221e, making \u03bb\u2217 \u2192 0. The other\nDf (P (cid:48),PG) \u2192 \u221e however Wc(P (cid:48), PG) \u2192 0\ncase is when P (cid:48) is made close to PG, in which case\nso the quantity \u03bb\u2217 can still be small in this case, depending on the rate of decrease between Wc\nand Df . Now suppose that f (x) = |x \u2212 1| and c(x, y) = \u03b4x\u2212y, in which case Df = Wc and thus\n\u03bb\u2217 = 1. In this case, Theorem 10 reduces to the standard result [29] regarding the equivalence\nbetween Wasserstein distance and f-divergence intersecting at the variational divergence under these\nconditions.\n\n1\n\n5 Generalization bounds\n\nWe present generalization bounds using machinery developed in [30] with the following de\ufb01nitions.\nDe\ufb01nition 11 (Covering Numbers) For a set S \u2286 X, we denote N\u03b7(S) to be the \u03b7-covering number\nof S, which is the smallest m \u2208 N\u2217 such that there exists closed balls B1, . . . , Bm of radius \u03b7\n, where\nN\u03b7(P, \u03c4 ) := inf {N\u03b7(S) : P (S) \u2265 1 \u2212 \u03c4} .\n\ni=1 Bi. For any P \u2208 P(X), the (\u03b7, \u03c4 )-dimension is d\u03b7(P, \u03c4 ) := log N\u03b7(P,\u03c4 )\n\u2212 log \u03b7\n\nwith S \u2286 (cid:83)m\n\n7\n\n\fDe\ufb01nition 12 (1-Upper Wasserstein Dimension) The 1-Upper Wasserstein dimension of any P \u2208\n\nP(X) is d\u2217(P ) := inf(cid:8)s \u2208 (2,\u221e) : lim sup\u03b7\u21920 d\u03b7(P, \u03b7\n\ns\u22122 ) \u2264 s(cid:9) .\n\ns\n\nWe make an assumption of PX and PG having bounded support to achieve the following bounds. For\nany P \u2208 P(X) in a metric space (X, c), we use de\ufb01ne \u2206P,c = diamc(supp(P )).\nTheorem 13 Let (X, c) be a metric space and suppose \u2206 := max{\u2206c,PX , \u2206c,PG} < \u221e. For any\nn \u2208 N\u2217, let \u02c6PX and \u02c6PG denote the empirical distribution with n samples drawn i.i.d from PX and\nPG respectively. Let sX > d\u2217(PX ) and sG > d\u2217(PG). For all f : R \u2192 (\u2212\u221e,\u221e] convex functions,\nf (1) = 0, \u03bb > 0 and \u03b4 \u2208 (0, 1), then with probability at least 1 \u2212 \u03b4, we have\n\n(cid:32)\n\n(cid:115)\n\nGAN\u03bbf (PX , G; Hc) \u2264 W c,\u03bb\u00b7f ( \u02c6PX , PG) + O\n\nn\u22121/sX + \u2206\n\nand if f (x) = |x \u2212 1| is chosen then\n\nGAN\u03bbf (PX , G; Hc) \u2264 W c,\u03bb\u00b7f ( \u02c6PX , \u02c6PG) + O\n\n(cid:32)\n\nn\u22121/sX + n\u22121/sG + \u2206\n\n,\n\n(cid:19)(cid:33)\n(cid:19)(cid:33)\n\n(cid:18) 1\n(cid:18) 1\n\n\u03b4\n\n1\nn\n\nln\n\n(cid:115)\n\n1\nn\n\nln\n\n\u03b4\n\n(9)\n\n.\n\n(10)\n\n(cid:111)\n\n(cid:110)\n\n(Proof in Appendix, Section A.2). First note that there is no requirement on G to be invertible and no\nrestriction on \u03bb. Second, there are the quantities sX,sG and \u2206 that are in\ufb02uenced by the distributions\nPX and PG. It is interesting to note that d\u2217 is related to fractal dimensions [31] and thus relates the\nconvergence of GANs to statistical geometry. If G is invertible in the above then the left hand side\nof both bounds becomes W c,\u03bb\u00b7f (PX , G) by Theorem 8. In general, \u02c6PX and \u02c6PG will not share the\nsame support, in which case Df ( \u02c6PX , \u02c6PG) = \u221e \u2013 This would lead one to suspect the same from\nW c,\u03bb\u00b7f ( \u02c6PX , \u02c6PG), however this is not the case since\n\nW c,\u03bb\u00b7f ( \u02c6PX , \u02c6PG) \u2264\n\ninf\n\nE\u2208F (X,P(X))\n\nWc((G \u25e6 E)#PX , PX ) + \u03bbDf (E# \u02c6PX , \u02c6PZ)\n\n,\n\nand so E \u2208 F (X, P(Z)) would be selected such that E# \u02c6PX shares the support of \u02c6PZ, resulting in\na bounded value. We now show the relationship between W and Wc.\nTheorem 14 For any c : X\u00d7X \u2192 R, \u03bb > 0 and f : R \u2192 (\u2212\u221e,\u221e] convex function (with f (1) = 0)\nwe have W c,\u03bb\u00b7f (PX , G) \u2264 Wc(PX , PG)\n\n(Proof in Appendix, Section A.5). This suggests that in order to minimize W , one can minimize Wc.\nIndeed, majority of the solvers are concerned with discrete distributions, which is exactly what is\npresent on the right hand side of the generalization bounds: W c,\u03bb\u00b7f ( \u02c6PX , \u02c6PG)\n\n6 Discussion and Conclusion\n\nThis work is the \ufb01rst to prove a generalized primal-dual betweenship between GANs and Autoen-\ncoders. Our result elucidated the close performance between WAE and f-GANs. Furthermore, we\nexplored the effect of weighting the reconstruction and regularization on the WAE objective, showing\nrelationships to both f-divergences and Wasserstein metrics along with the impact on the duality\nrelationship. This equivalence allowed us to prove generalization results, which to the best of our\nknowledge, are the \ufb01rst bounds given for Autoencoder models. The results imply that we can employ\nef\ufb01cient (regularized) OT solvers to approximate upper bounds on the generalization bounds, which\ninvolve discrete distributions and thus are natural for such solvers.\nThe consequences of unifying two paradigms are plentiful, generalization bounds being an example.\nOne line of extending and continuing the presented work is to explore the use of a general cost\nc (as opposed to a metric), invoking the generalized Wasserstein dual in the goal of forming a\ngeneralized GAN. Our paper provides a basis to unify Adversarial Networks and Autoencoders\nthrough a primal-dual relationship, and opens doors for the further uni\ufb01cation of related models.\n\n8\n\n\fAcknowledgments\n\nWe would like to acknowledge anonymous reviewers and the Australian Research Council of Data61.\n\nReferences\n[1] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv\n\npreprint arXiv:1610.03483, 2016.\n\n[2] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[3] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural sam-\nplers using variational divergence minimization. In Advances in Neural Information Processing\nSystems, pages 271\u2013279, 2016.\n\n[4] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[5] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks.\n\narXiv:1701.00160, 2016.\n\narXiv preprint\n\n[6] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-\n\nencoders. arXiv preprint arXiv:1711.01558, 2017.\n\n[7] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes:\narXiv preprint\n\nUnifying variational autoencoders and generative adversarial networks.\narXiv:1701.04722, 2017.\n\n[8] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adver-\n\nsarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.\n\n[9] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Ar-\njovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704,\n2016.\n\n[10] Aibek Alanov, Max Kochurov, Daniil Yashkov, and Dmitry Vetrov. Pairwise augmented gans\n\nwith adversarial reconstruction loss. arXiv preprint arXiv:1810.04920, 2018.\n\n[11] Pengchuan Zhang, Qiang Liu, Dengyong Zhou, Tao Xu, and Xiaodong He. On the\n\ndiscrimination-generalization tradeoff in gans. arXiv preprint arXiv:1711.02771, 2017.\n\n[12] Ke Li and Jitendra Malik. On the implicit assumptions of gans. arXiv preprint arXiv:1811.12402,\n\n2018.\n\n[13] Alfred M\u00fcller. Integral probability metrics and their generating classes of functions. Advances\n\nin Applied Probability, 29(2):429\u2013443, 1997.\n\n[14] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia, 2008.\n\n[15] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence func-\ntionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information\nTheory, 56(11):5847\u20135861, 2010.\n\n[16] Matthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the\nvariational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference,\nNIPS, 2016.\n\n[17] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational\n\nautoencoders. arXiv preprint arXiv:1706.02262, 2017.\n\n9\n\n\f[18] Giorgio Patrini, Marcello Carioni, Patrick Forre, Samarth Bhargav, Max Welling, Rianne\nvan den Berg, Tim Genewein, and Frank Nielsen. Sinkhorn autoencoders. arXiv preprint\narXiv:1810.01118, 2018.\n\n[19] Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, and Eric P Xing. On unifying deep generative\n\nmodels. arXiv preprint arXiv:1706.00550, 2017.\n\n[20] Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence\nproperties of generative adversarial learning. In Advances in Neural Information Processing\nSystems, pages 5545\u20135553, 2017.\n\n[21] Shuang Liu and Kamalika Chaudhuri. The inductive bias of restricted f-gans. arXiv preprint\n\narXiv:1809.04542, 2018.\n\n[22] Farzan Farnia and David Tse. A convex duality framework for gans. In Advances in Neural\n\nInformation Processing Systems, pages 5254\u20135263, 2018.\n\n[23] Zhiming Zhou, Yuxuan Song, Lantao Yu, Hongwei Wang, Weinan Zhang, Zhihua Zhang, and\nYong Yu. Understanding the effectiveness of lipschitz-continuity in generative adversarial nets.\n2018.\n\n[24] Richard Nock, Zac Cranko, Aditya K Menon, Lizhen Qu, and Robert C Williamson. f-gans\nin an information geometric nutshell. In Advances in Neural Information Processing Systems,\npages 456\u2013464, 2017.\n\n[25] Shun-ichi Amari. Information geometry and its applications. Springer, 2016.\n\n[26] Lisa Borland. Ito-langevin equations within generalized thermostatistics. Physics Letters A,\n\n245(1-2):67\u201372, 1998.\n\n[27] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. 2016.\n\n[28] Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy.\nFixing a broken elbo. In International Conference on Machine Learning, pages 159\u2013168, 2018.\n\n[29] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch\u00f6lkopf, and Gert RG\nLanckriet. On integral probability metrics,\\phi-divergences and binary classi\ufb01cation. arXiv\npreprint arXiv:0901.2698, 2009.\n\n[30] Jonathan Weed and Francis Bach. Sharp asymptotic and \ufb01nite-sample rates of convergence of\n\nempirical measures in wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.\n\n[31] Kenneth Falconer. Fractal geometry: mathematical foundations and applications. John Wiley\n\n& Sons, 2004.\n\n[32] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds\n\nand structural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n10\n\n\f", "award": [], "sourceid": 205, "authors": [{"given_name": "Hisham", "family_name": "Husain", "institution": "The Australian National University"}, {"given_name": "Richard", "family_name": "Nock", "institution": "Data61, the Australian National University and the University of Sydney"}, {"given_name": "Robert", "family_name": "Williamson", "institution": "Australian National University & Data61"}]}