{"title": "Learning GANs and Ensembles Using Discrepancy", "book": "Advances in Neural Information Processing Systems", "page_first": 5796, "page_last": 5807, "abstract": "Generative adversarial networks (GANs) generate data based on minimizing a divergence between two distributions. The choice of that divergence is therefore critical. We argue that the divergence must take into account the hypothesis set and the loss function used in a subsequent learning task, where the data generated by a GAN serves for training. Taking that structural information into account is also important to derive generalization guarantees. Thus, we propose to use the discrepancy measure, which was originally introduced for the closely related problem of domain adaptation and which precisely takes into account the hypothesis set and the loss function. We show that discrepancy admits favorable properties for training GANs and prove explicit generalization guarantees. We present efficient algorithms using discrepancy for two tasks: training a GAN directly, namely DGAN, and mixing previously trained generative models, namely EDGAN. Our experiments on toy examples and several benchmark datasets show that DGAN is competitive with other GANs and that EDGAN outperforms existing GAN ensembles, such as AdaGAN.", "full_text": "Learning GANs and Ensembles Using Discrepancy\n\nBen Adlam\n\nGoogle Research\n\nNew York, NY 10011\nadlam@google.com\n\nMehryar Mohri\n\nGoogle Research & CIMS\n\nNew York, NY 10012\nmohri@google.com\n\nCorinna Cortes\nGoogle Research\n\nNew York, NY 10011\ncorinna@google.com\n\nNingshan Zhang\n\nNew York University\nNew York, NY 10012\nnzhang@stern.nyu.edu\n\nAbstract\n\nGenerative adversarial networks (GANs) generate data based on minimizing a\ndivergence between two distributions. The choice of that divergence is therefore\ncritical. We argue that the divergence must take into account the hypothesis set\nand the loss function used in a subsequent learning task, where the data generated\nby a GAN serves for training. Taking that structural information into account\nis also important to derive generalization guarantees. Thus, we propose to use\nthe discrepancy measure, which was originally introduced for the closely related\nproblem of domain adaptation and which precisely takes into account the hypothesis\nset and the loss function. We show that discrepancy admits favorable properties for\ntraining GANs and prove explicit generalization guarantees. We present ef\ufb01cient\nalgorithms using discrepancy for two tasks: training a GAN directly, namely\nDGAN, and mixing previously trained generative models, namely EDGAN. Our\nexperiments on toy examples and several benchmark datasets show that DGAN\nis competitive with other GANs and that EDGAN outperforms existing GAN\nensembles, such as AdaGAN.\n\n1\n\nIntroduction\n\nGenerative adversarial networks (GANs) consist of a family of methods for unsupervised learning. A\nGAN learns a generative model that can easily output samples following a distribution P\u2713, which\naims to mimic the real data distribution Pr. The parameter \u2713 of the generator is learned by minimizing\na divergence between Pr and P\u2713, and different choices of this divergence lead to different GAN\nalgorithms:\nthe Jensen-Shannon divergence gives the standard GAN [Goodfellow et al., 2014,\nSalimans et al., 2016], the Wasserstein distance gives the WGAN [Arjovsky et al., 2017, Gulrajani\net al., 2017], the squared maximum mean discrepancy gives the MMD GAN [Li et al., 2015, Dziugaite\net al., 2015, Li et al., 2017], and the f-divergence gives the f-GAN [Nowozin et al., 2016], just to\nname a few. There are many other GANs that have been derived using other divergences in the past,\nsee [Goodfellow, 2017] and [Creswell et al., 2018] for more extensive studies.\nThe choice of the divergence seems to be critical in the design of a GAN. But, how should that\ndivergence be selected or de\ufb01ned? We argue that its choice must take into consideration the structure\nof a learning task and include, in particular, the hypothesis set and the loss function considered. In\ncontrast, divergences that ignore the hypothesis set typically cannot bene\ufb01t from any generalization\nguarantee (see for example Arora et al. [2017]). The loss function is also crucial: while many GAN\napplications aim to generate synthetic samples indistinguishable from original ones, for example\nimages [Karras et al., 2018, Brock et al., 2019] or Anime characters [Jin et al., 2017], in many other\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fapplications, the generated samples are used to improve subsequent learning tasks, such as data\naugmentation [Frid-Adar et al., 2018], improved anomaly detection [Zenati et al., 2018], or model\ncompression [Liu et al., 2018b]. Such subsequent learning tasks require optimizing a speci\ufb01c loss\nfunction applied to the data. Thus, it would seem bene\ufb01cial to explicitly incorporate this loss in the\ntraining of a GAN.\nA natural divergence that accounts for both the loss function and the hypothesis set is the discrepancy\nmeasure introduced by Mansour et al. [2009]. Discrepancy plays a key role in the analysis of domain\nadaptation, which is closely related to the GAN problem, and other related problems such as drifting\nand time series prediction [Mohri and Medina, 2012, Kuznetsov and Mohri, 2015]. Several important\ngeneralization bounds for domain adaptation are expressed in terms of discrepancy [Mansour et al.,\n2009, Cortes and Mohri, 2014, Ben-David et al., 2007]. We de\ufb01ne discrepancy in Section 2 and give\nexamples illustrating the bene\ufb01t of using discrepancy to measure the divergence between distributions.\nIn this work, we design a new GAN technique, discrepancy GAN (DGAN), that minimizes the\ndiscrepancy between P\u2713 and Pr. By training GANs with discrepancy, we obtain theoretical guarantees\nfor subsequent learning tasks using the samples it generates. We show that discrepancy is continuous\nwith respect to the generator\u2019s parameter \u2713, under mild conditions, which makes training DGAN easy.\nAnother key property of the discrepancy is that it can be accurately estimated from \ufb01nite samples\nwhen the hypothesis set admits bounded complexity. This property does not hold for popular metrics\nsuch as the Jensen-Shannon divergence and the Wasserstein distance.\nMoreover, we propose to use discrepancy to learn an ensemble of pre-trained GANs, which results\nin our EDGAN algorithm. By considering an ensemble of GANs, one can greatly reduce the\nproblem of missing modes that frequently occurs when training a single GAN. We show that the\ndiscrepancy between the true and the ensemble distribution learned on \ufb01nite samples converges to the\ndiscrepancy between the true and the optimal ensemble distribution, as the sample size increases. We\nalso show that the EDGAN problem can be formulated as a convex optimization problem, thereby\nbene\ufb01ting from strong convergence guarantees. Recent work of Tolstikhin et al. [2017], Arora et al.\n[2017], Ghosh et al. [2018] and Hoang et al. [2018] also considered mixing GANs, either motived by\nboosting algorithms such as AdaBoost, or by the minimax theorem in game theory. These algorithms\ntrain multiple generators and learn the mixture weights simultaneously, yet none of them explicitly\noptimizes for the mixture weights once the multiple GANs are learned, which can provide additional\nimprovement as demonstrated by our experiments with EDGAN.\nThe term \u201cdiscrepancy\u201d has been previously used in the GAN literature under a different de\ufb01nition.\nThe squared maximum mean discrepancy (MMD), which was originally proposed by Gretton et al.\n[2012], is used as the distance metric for training MMD GAN [Li et al., 2015, Dziugaite et al., 2015,\nLi et al., 2017]. MMD between two distributions is de\ufb01ned with respect to a family of functions\nF, which is usually assumed to be a reproducing kernel Hilbert space (RKHS) induced by a kernel\nfunction, but MMD does not take into account the loss function. LSGAN [Mao et al., 2017] also\nadopts the squared loss function for the discriminator, and as we do for DGAN. Feizi et al. [2017],\nDeshpande et al. [2018] consider minimizing the quadratic Wasserstein distance between the true\nand the generated samples, which involves the squared loss function as well. However, their training\nobjectives are vastly different from ours. Finally, when the hypothesis set is the family of linear\nfunctions with bounded norm and the loss function is the squared loss, DGAN coincides with the\nobjective sought by McGAN [Mroueh et al., 2017], that of matching the empirical covariance matrices\nof the true and the generated distribution. However, McGAN uses nuclear norm while DGAN uses\nspectral norm in that case.\nThe rest of this paper is organized as follows. In Section 2, we de\ufb01ne discrepancy and prove that\nit bene\ufb01ts from several favorable properties, including continuity with respect to the generator\u2019s\nparameter and the possibility of accurately estimating it from \ufb01nite samples. In Section 3, we\ndescribe our discrepancy GAN (DGAN) and ensemble discrepancy GAN (EDGAN) algorithms\nwith a discussion of the optimization solution and theoretical guarantees. We report the results of a\nseries of experiments (Section 4), on both toy examples and several benchmark datasets, showing that\nDGAN is competitive with other GANs and that EDGAN outperforms existing GAN ensembles,\nsuch as AdaGAN.\n\n2\n\n\fdiscH,`(P, Q) = sup\n\nh,h02H E\n\nx\u21e0P\u21e5`h(x), h0(x)\u21e4 E\n\nx\u21e0Q\u21e5`h(x), h0(x)\u21e4.\n\n(1)\n\n2 Discrepancy\nLet Pr denote the real data distribution on X , which, without loss of generality, we can assume to be\nX = {x 2 Rd : kxk2 \uf8ff 1}. A GAN generates a sample in X via the following procedure: it \ufb01rst\ndraws a random noise vector z 2Z from a \ufb01xed distribution Pz, typically a multivariate Gaussian,\nand then passes z through the generator g\u2713 : Z!X , typically a neural network parametrized by\n\u2713 2 \u21e5. Let P\u2713 denote the resulting distribution of g\u2713(z). Given a distance metric d(\u00b7,\u00b7) between two\ndistributions, a GAN\u2019s learning objective is to minimize d(Pr, P\u2713) over \u2713 2 \u21e5.\nIn Appendix A, we present and discuss two instances of the distance metric d(\u00b7,\u00b7) and two widely-\nused GANs: the Jensen-Shannon divergence for the standard GAN [Goodfellow et al., 2014], and\nthe Wasserstein distance for WGAN [Arjovsky et al., 2017]. Furthermore, we show that Wasserstein\ndistance can be viewed as discrepancy without considering the hypothesis set and the loss function,\nwhich is one of the reasons why it cannot bene\ufb01t from theoretical guarantees. In this section, we\ndescribe the discrepancy measure and motivate its use by showing that it bene\ufb01ts from several\nimportant favorable properties.\nConsider a hypothesis set H and a symmetric loss function ` : Y\u21e5Y! R, which will be used in\nfuture supervised learning tasks on the true (and probably also the generated) data. Given H and `,\nthe discrepancy between two distributions P and Q is de\ufb01ned by the following:\n\nEquivalently, let `H =`h(x), h0(x) : h, h0 2H be the family of discriminators induced by `\nand H, then, the discrepancy can be written as discH,`(P, Q) = supf2`H EP[f (x)] EQ[f (x)].\n\nHow would subsequent learning tasks bene\ufb01t from samples generated by GANs trained with dis-\ncrepancy? We show that, under mild conditions, any hypothesis performing well on P\u2713 (with loss\nfunction `) is guaranteed to perform well on Pr, as long as the discrepancy discH,`(P\u2713, Pr) is small.\nTheorem 1. Assume the true labeling function f : X!Y\nis contained in the hypothesis set H.\nThen, for any hypothesis h 2H ,\n\nE\nx\u21e0Pr\n\n[`(h, f )] \uf8ff E\nx\u21e0P\u2713\n\n[`(h, f )] + discH,`(P\u2713, Pr).\n\nTheorem 1 suggests that the learner can learn a model using samples drawn from P\u2713, whose generation\nerror on Pr is guaranteed to be no more than its generation error on P\u2713 plus the discrepancy, which is\nminimized by the algorithm. The proof uses the de\ufb01nition of discrepancy. Due to space limitation,\nwe provide all the proofs in Appendix B.\n\n2.1 Hypothesis set and loss function\nWe argue that discrepancy is more favorable than Wasserstein distance measures, since it makes\nexplicit the dependence on loss function and hypothesis set. We consider two widely used learning\nscenarios: 0-1 loss with linear separators, and squared loss with Lipschitz functions.\n\n0-1 Loss, Linear Separators Consider the two distributions on R2 illustrated in Figure 1a: Q\n(\ufb01lled circles \u2022) is equivalent to P (circles ), but with all points shifted to the right by a small amount\n\u270f. Then, by the de\ufb01nition of Wasserstein distance, W (P, Q) = \u270f, since to transport P to Q, one just\nneed to move each point to the right by \u270f. When \u270f is small, WGAN views the two distributions as close\nand thus stops training. On the other hand, when ` is the 0-1 loss and H is the set of linear separators,\ndiscH,`(P, Q) = 1, which is achieved at the h, h0 as shown in Figure 1a, with EP[1h(x)6=h0(x)] = 1\nand EQ[1h(x)6=h0(x)] = 0. Thus, DGAN continues training to push Q towards P.\nThe example above is an extreme case where P and Q are separable. In more practical scenarios, the\ndomain of the two distributions may overlap signi\ufb01cantly, as illustrated in Figure 1b, where P is in\nred and Q is in blue, and the shaded areas contain 95% probably mass. Again, Q equals P shifting to\nthe right by \u270f and thus W (P, Q) = \u270f. Since the non-overlapping area has a sizable probability mass,\nthe discrepancy between P and Q is still large, for the same reason as for Figure 1a.\nThese examples demonstrate the importance of taking hypothesis sets and loss functions into account\nwhen comparing two distributions: even though two distributions appear geometrically \u201cclose\u201d\n\n3\n\n\f++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\n\n\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\n\n++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n(a) Non-overlapping distributions, P: {}, Q: {\u2022}.\n(b) Overlapping distributions, P: {red}, Q: {blue}.\nFigure 1: Distributions P and Q may appear \u201cclose\u201d under Wasserstein distance, but the discrepancy\nbetween the two is still large, where the discrepancy is de\ufb01ned by 0-1 loss and linear separators.\n\nunder Wasserstein distance, a classi\ufb01er trained on one distribution may perform poorly on another\ndistribution. According to Theorem 1, such unfortunate behaviors are less likely to happen with\ndiscH,`.\nSquared Loss, Lipschitz Functions Next, we consider the squared loss and the hypothesis set\nof 1-Lipschitz functions H = {h : |h(x) h(x0)|\uf8ff k x x0k2,8x, x0 2X} , then `H = {[h(x) \nh0(x)]2 : h, h0 2H} . We can show that `H is a subset of 4-Lipschitz functions on X . Then, by the\nde\ufb01nition of discrepancy and Wasserstein distance, discH,`(P, Q) is comparable to W (P, Q):\nQ\u21e5f (x)\u21e4 = 4W (P, Q).\nHowever, the inequality above can be quite loose since, depending on the hypothesis set, `H may be\nonly a small subset of all 4-Lipschitz functions. For instance, when H is the set of linear functions\nwith norm bounded by one, then `H = {(wT x)2 : kwk \uf8ff 2}, which is a signi\ufb01cantly smaller set than\nthe family of all 4-Lipschitz functions. Thus, discH,`(P, Q) could potentially be a tighter measure\nthan W (P, Q), depending on H.\n2.2 Continuity and estimation\n\ndiscH,`(P, Q) = sup\nf2`H\n\nP\u21e5f (x)\u21e4 E\n\nf : 4-Lipschitz\n\nsup\n\nE\n\nP\u21e5f (x)\u21e4 E\n\nQ\u21e5f (x)\u21e4 \uf8ff\n\nE\n\nIn this section, we discuss two favorable properties of discrepancy: its continuity under mild assump-\ntions with respect to the generator\u2019s parameter \u2713, a property shared with the Wasserstein distance,\nand the fact that it can be accurately estimated from \ufb01nite samples, which does not hold for either the\nJensen-Shannon or the Wasserstein distance. The continuity property is summarized in the following\ntheorem.\nTheorem 2. Let H = {h : X!Y}\nbe a family of \u00b5-Lipschitz functions and assume that the loss\nfunction ` is continuous and symmetric in its arguments, and is bounded by M. Assume further that `\nadmits the triangle inequality, or that it can be written as `(y, y0) = f (|y y0|) for some Lipschitz\nfunction f. Assume that g\u2713 : Z!X is continuous in \u2713. Then, discH,`(Pr, P\u2713) is continuous in \u2713.\nThe assumptions of Theorem 2 are easily satis\ufb01ed in practice, where h 2H and g\u2713 are neural\nnetworks whose parameters are limited within a compact set, and where the loss function can be\neither the `1 loss, `(y, y0) = |y y0|, or the squared loss, `(y, y0) = (y y0)2. If the discrepancy\nis continuous in \u2713, then, as the sequence of parameters \u2713t converges to \u2713\u21e4, the discrepancy also\nconverges: |discH,`(Pr, P\u2713t) discH,`(Pr, P\u2713\u21e4)|! 0, which is a desirable property for training\nDGAN. The reader is referred to Arjovsky et al. [2017] for a more extensive discussion of the\ncontinuity properties of various distance metrics and their effects on training GANs.\nNext, we show that discrepancy can be accurately estimated from \ufb01nite samples. Let Sr and S\u2713 be\n\ndistributions induced by Sr and S\u2713, respectively. Recall that the empirical Radmacher complexity of\n\ni.i.d. samples drawn from Pr and P\u2713 with |Sr| = m and |S\u2713| = n, and letbPr andbP\u2713 be the empirical\ni=1 ig(xi)\u21e4,\na hypothesis set G on sample S of size m is de\ufb01ned by: bRS(G) = 2\nwhere 1, 2, . . . , m are i.i.d. random variables with P(i = 1) = P(i = 1) = 1/2. The\nempirical Radmacher complexity measures the complexity of the hypothesis set G. The next theorem\npresents the learning guarantees of discrepancy.\n\nm E\u21e5 supg2GPm\n\n4\n\n\fFurthermore, when the loss function `(h, h0) is a q-Lipschitz function of h h0, we have\n\nTheorem 3. Assume the loss is bounded, ` \uf8ff M. For any > 0, with probability at least 1 over\nthe draw of Sr and S\u2713,\n2m +q log(4/)\ndiscH,`(Pr, P\u2713) discH,`(bPr,bP\u2713) \uf8ff bRSr (`H) +bRS\u2713 (`H) + 3M\u21e3q log(4/)\n\u2318.\n2m +q log(4/)\ndiscH,`(Pr, P\u2713) discH,`(bPr,bP\u2713) \uf8ff 4q\u21e3bRSr (H) +bRS\u2713 (H)\u2318 + 3M\u21e3q log(4/)\n2n \u2318.\nIn the rest of this paper, we will consider the squared loss `(y, y0) = (yy0)2, which is bounded and 2-\nLipschitz when |h(x)|\uf8ff 1 for all h 2H and x 2X . Furthermore, when H is a family of feedforward\nneural networks, Cortes et al. [2017] provided an explicit upper bound of bRS(H) = O(1/pm) for\nits complexity, and thus the right-hand side of the above inequality is in O( 1pm + 1pn ). Then, for m\nand n suf\ufb01ciently large, the empirical discrepancy is close to the true discrepancy. It is important that\nthe discrepancy can be accurately estimated from \ufb01nite samples since, when training DGAN, we\ncan only approximate the true discrepancy with a batch of samples. In contrast, the Jensen-Shannon\ndistance and the Wasserstein distance do not admit this favorable property [Arora et al., 2017].\n\n2n\n\n3 Algorithms\n\nIn this section, we show how to compute the discrepancy and train DGAN for various hypothesis\nsets and the squared loss. We also propose to learn an ensemble of pre-trained GANs via minimizing\ndiscrepancy. We name this method EDGAN, and present its learning guarantees.\n\nmin\n\u27132\u21e5\n\nn X T\n\n\u2713 X\u2713 1\n\nm X T\n\nmax\n\nw,w02W E\n\nx\u21e0Pr\u21e5`hw(x), hw0(x)\u21e4 E\n\nx\u21e0P\u2713\u21e5`hw(x), hw0(x)\u21e4.\n\n3.1 DGAN algorithm\nGiven a parametric family of hypotheses H = {hw : w 2 W}, DGAN is de\ufb01ned as the following\nmin-max optimization problem:\n(2)\nAs with other GANs, DGAN is trained by iteratively solving the min-max problem (2). The\nminimization over the generator\u2019s parameters \u2713 can be tackled by standard stochastic gradient\ndescent (SGD) algorithm with back-propagation. The inner maximization problem that computes the\ndiscrepancy, however, can be ef\ufb01ciently solved when ` is the squared loss function.\nWe \ufb01rst consider H to be the set of linear functions with bounded norm: H = {x ! wT x : kwk2 \uf8ff\n1, w 2 Rd}. Recall the de\ufb01nition of Sr, S\u2713, Pr and P\u2713 from Section 2.2. In addition, let Xr and X\u2713\ndenote the corresponding m \u21e5 d and n \u21e5 d data matrices, where each row represents one input.\nProposition 4. When ` is the squared loss and H the family of linear functions with norm bounded\nby 1, discH,`(bPr,bP\u2713) = 2 1\nThus, the discrepancy discH,`(bPr,bP\u2713) equals twice the largest eigenvalue in absolute value of the\nat the optimal solution, we can then back-propagate the loss discH,`(bPr,bP\u2713) = 2v\u21e4T (\u2713)M (\u2713)v\u21e4(\u2713)\nto optimize for \u2713. The maximum or minimum eigenvalue of M (\u2713) can be computed in O(d2) [Golub\nand van Van Loan, 1996], and the power method can be used to closely approximate it.\nThe closed-form solution in Proposition 4 holds for a family H of linear mappings. To generate\nrealistic outcomes with DGAN, however, we need a more complex hypothesis set H, such as the\nfamily of deep neural networks (DNN). Thus, we consider the following approach: \ufb01rst, we \ufb01x a\npre-trained DNN classi\ufb01er, such as the inception network, and pass the samples through this network\nto obtain the last (or any other) layer of embedding f : X!E\n, where E is the embedding space.\nNext, we compute the discrepancy on the embedded samples with H being the family of linear\nfunctions with bounded norm, which admits a closed-form solution according to Proposition 4. In\npractice, it also makes sense to train the embedding network together with the generator: let f\u21e3 be the\nembedding network parametrized by \u21e3, then DGAN optimizes for both f\u21e3 and g\u2713 . See Algorithm 1\nfor a single step of updating DGAN. In particular, the learner can either compute F (\u21e3t,\u2713 t) exactly,\nor use an approximation based on the power method. Note that when the learner uses a pre-\ufb01xed\nembedding network f, the update step of \u21e3t+1 can be skipped.\n\nr Xr2, where k\u00b7k 2 denotes the spectral norm.\n\nr Xr. Given v\u21e4(\u2713), the corresponding eigenvector\n\ndata-dependent matrix M (\u2713) = 1\n\nn X T\n\n\u2713 X\u2713 1\n\nm X T\n\n5\n\n\fAlgorithm 1 UPDATE DGAN(\u21e3t,\u2713 t,\u2318 )\n\nAlgorithm 2 UPDATE EDGAN(\u21b5t, f,\u2318 )\n\nXr [f\u21e3t (x1),\u00b7\u00b7\u00b7 , f\u21e3t (xm)]T , where xi \u21e0 Pr\nX\u2713 [f\u21e3t (x01),\u00b7\u00b7\u00b7 , f\u21e3t (x0n)]T , where x0i \u21e0 P\u2713t\nUpdate: \u21e3 t+1 \u21e3 t + \u2318r\u21e3F (\u21e3 t,\u2713 t)\nUpdate: \u2713t+1 \u2713t \u2318r\u2713F (\u21e3 t,\u2713 t)\n\nF (\u21e3 t,\u2713 t) 1\n\nr Xr2\n\n\u2713 X\u2713 1\n\nm X T\n\nn X T\n\nXr [f (x1),\u00b7\u00b7\u00b7 , f (xnr )]T , where xi \u21e0 Pr\nnk )]T , where xk\ni \u21e0 P\u2713k\nXk [f (xk\n1),\u00b7\u00b7\u00b7 , f (xk\nk Xk 1\nF (\u21b5t) kPp\nX T\nX T\nr Xrk2\nUpdate: \u21b5t+1 \u21b5t \u2318r\u21b5F (\u21b5t)\n\n\u21b5t\nk\nnk\n\nk=1\n\nnr\n\n\u21b5\u21e4 = argmin\n\ndiscH,`(P\u21b5, Pr),\n\n3.2 EDGAN algorithm\nNext, we show that discrepancy provides a principled way of choosing the ensemble weights to mix\npre-trained GANs, which admits favorable convergence guarantees.\nLet g1, . . . , gp be p pre-trained GANs. For a given mixture weight \u21b5 = (\u21b51, . . . ,\u21b5 p) 2 , where\n= {(\u21b51, . . . ,\u21b5 p) : \u21b5k 0,Pp\nk=1 \u21b5k = 1} is the simplex in Rp, we de\ufb01ne the ensemble of p\nGANs by g\u21b5 =Pp\nk=1 \u21b5kgk. To draw a sample from the ensemble g\u21b5, we \ufb01rst sample an index\nk 2 [p] = {1, 2,\u00b7\u00b7\u00b7 , p} according to the multinomial distribution with parameter \u21b5, and then return a\nrandom sample generated by the chosen GAN gk. We denote by P\u21b5 the distribution of g\u21b5. EDGAN\ndetermines the mixture weight \u21b5 by minimizing the discrepancy between P\u21b5 and the real data Pr:\nmin\u21b52 discH,`(P\u21b5, Pr).\nTo learn the mixture weight \u21b5, we approximate the true distributions by their empirical counterparts:\nfor each k 2 [p], we randomly draw a set of nk samples from gk, and randomly draw nr samples\nfrom the real data distribution Pr. Let Sk and Sr denote the corresponding set of samples, and letbPk\nandbPr denote the induced empirical distributions, respectively. For a given \u21b5, letbP\u21b5 =Pp\nk=1 \u21b5kbPk\nLet \u21b5\u21e4 andb\u21b5 be the discrepancy minimizer under the true and the empirical distributions, respectively:\n\nbe the empirical counterparts of P\u21b5. We \ufb01rst present a convergence result for the EDGAN method,\nand then describe how to train EDGAN.\n\ndiscH,`(bP\u21b5,bPr).\nFor simplicity, we set nk = nr = n for all k 2 [p], but the following result can be easily extended to\narbitrary batch size for each generator.\nTheorem 5. For any > 0, with probability at least 1 over the draw of samples,\n\n`(h, h0) is a q-Lipschitz function of h h0, the following holds with probability 1 :\n\n|discH,`(Pb\u21b5, Pr) discH,`(P\u21b5\u21e4, Pr)|\uf8ff 2\u21e3bRS(`H) + 3Mplog[4(p + 1)/]/2n\u2318,\n|discH,`(Pb\u21b5, Pr) discH,`(P\u21b5\u21e4, Pr)|\uf8ff 2\u21e34qbRS(H) + 3Mplog[4(p + 1)/]/2n\u2318,\n\nwhere bRS(`H) = maxbRS1(`H), . . . ,bRSp(`H),bRSr (`H) . Furthermore, when the loss function\nwhere bRS(H) = maxbRS1(H), . . . ,bRSp(H),bRSr (H) .\nWhen ` is the squared loss and H is the family of feedforward neural networks, the upper bound on\nbRS(`H) is in O(1/pn). Since we can generate unlimited samples from each of the p pre-trained\nGANs, n can be as large as the number of available real samples, and thus the discrepancy between\nthe learned ensemble Pb\u21b5 and the real data Pr can be very close to the discrepancy between the\n\noptimal ensemble P\u21b5\u21e4 and the real data Pr. This is a very favorable generalization guarantee for\nEDGAN, since it suggests that the mixture weight learned on the training data is guaranteed to\ngeneralize and perform well on the test data, a fact also corroborated by our experiments.\nTo compute the discrepancy for EDGAN, we again begin with linear mappings H = {x !\nwT x : kwk2 \uf8ff 1, w 2 Rd}. For each generator k 2 [p], we obtain a nk \u21e5 d data matrix Xk, and\nsimilarly we have the nr \u21e5 d data matrix for the real samples. Then, by the proof of Proposition 4,\ndiscrepancy minimization can be written as\n\nb\u21b5 = argmin\n\n\u21b52\n\n\u21b52\n\n\u21b52kM (\u21b5)k2, with M (\u21b5) =\uf8ff pXk=1\n\n\u21b5k\nnk\n\nX T\n\nk Xk \n\n1\nnr\n\nX T\n\nr Xr.\n\n(3)\n\nmin\n\u21b52\n\ndiscH,`(bP\u21b5,bPr) = 2 min\n\n6\n\n\fFigure 2: Random samples from DGAN trained on MNIST.\n\nFigure 3: Random samples from DGAN trained on CIFAR10.\n\nsupkvk2\uf8ff1vT M (\u21b5)v is also convex in \u21b5, as the supremum of a set of convex functions is convex.\n\nSince M (\u21b5) and M (\u21b5) are af\ufb01ne and thus convex functions of \u21b5, kM (\u21b5)k2 =\nThus, problem (3) is a convex optimization problem, thereby bene\ufb01tting from strong convergence\nguarantees.\nNote that we have kM (\u21b5)k2 = max{max(M (\u21b5)), max(M (\u21b5))}. Thus, one way to solve\nproblem (3) is to cast it as a semi-de\ufb01nite programming (SDP) problem:\n\nmin\n\u21b5,\n\n,\n\ns.t. I M (\u21b5) \u232b 0, I + M (\u21b5) \u232b 0, \u21b5 0, 1T \u21b5 = 1.\n\nAn alternative solution consists of using the power method to approximate the spectral norm, which is\nfaster when the sample dimension d is large. As with DGAN, we can also consider a more complex\nhypothesis set H, by \ufb01rst passing samples through an embedding network f, and then letting H be\nthe set of linear mappings on the embedded samples. Since the generators are already pre-trained\nfor EDGAN, we no longer need to train the embedding network, but instead keep it \ufb01xed. See\nAlgorithm 2 for one training step of EDGAN.\n\n4 Experiments\n\n4.1 DGAN\nIn this section, we show that DGAN obtains competitive results on the benchmark datasets MNIST,\nCIFAR10, CIFAR100, and CelebA (at resolution 128 \u21e5 128). We did unconditional generation\nand did not use the labels in the dataset. We trained both the discriminator\u2019s embedding layer and\nthe generator with discrepancy loss as in Algorithm 1. Note, we did not attempt to optimize the\narchitecture and other hyperparameters to get state-of-the-art results. We used a standard DCGAN\narchitecture. The main architectural modi\ufb01cation for DGAN is that the \ufb01nal dense layer of the\ndiscriminator has output dimension greater than 1 since, in DGAN, the discriminator outputs an\nembedding layer rather than a single score. The size of this embedding layer is a hyperparameter\nthat can be tuned, but we refrained from doing so here. See Table 6 in Appendix C for DGAN\narchitectures. One important observation is that larger embedding layers require more samples to\naccurately estimate the population covariance matrix of the embedding layer under the data and\ngenerated distributions (and hence the spectral norm of the difference).\nTo enforce the Lipschitz assumption of our Theorems, either weight clipping [Arjovsky et al., 2017],\n\n7\n\n\fFigure 4: Random samples from DGAN trained on CIFAR100.\n\nFigure 5: Random samples from DGAN trained on CelebA at resolution 128 \u21e5 128.\n\nTable 1: Inception Score (IS) and Fr\u00e9chet Incep-\ntion Distance (FID) for various datasets.\n\ngradient penalization [Gulrajani et al., 2017], spec-\ntral normalization [Miyato et al., 2018], or some\ncombination can be used. We found gradient pe-\nnalization useful for its stabilizing effect on train-\ning, and obtained the best performance with this\nand weight clipping. Table 1 lists Inception score\n(IS) and Fr\u00e9chet Inception distance (FID) on var-\nious datasets. All results are the best of \ufb01ve trials.\nWhile our scores are not state-of-the-art [Brock\net al., 2019], they are close to those achieved by similar unconditional DCGANs [Miyato et al., 2018,\nLucic et al., 2018]. Figures 2-5 show samples from a trained DGAN that are not cherry-picked.\n\nDataset\nCIFAR10\nCIFAR100\nCelebA\n\nIS\n7.02\n7.31\n2.15\n\nFID (train)\n\nFID (test)\n\n26.7\n28.9\n59.2\n\n30.7\n33.3\n\n-\n\n4.2 EDGAN\nToy example We \ufb01rst considered the toy datasets described in section 4.1 of AdaGAN [Tolstikhin\net al., 2017], where we can explicitly compare various GANs with well-de\ufb01ned, likelihood-based\nperformance metrics. The true data distribution is a mixture of 9 isotropic Gaussian components\non X = R2, with their centers uniformly distributed on a circle. We used the AdaGAN algorithm\nto sequentially generate 10 GANs, and compared various ensembles of these 10 networks: GAN1\ngenerated by the baseline GAN algorithm; Ada5 and Ada10, generated by AdaGAN with the \ufb01rst 5\nor 10 GANs, respectively; EDGAN5 and EDGAN10, the ensembles of the \ufb01rst 5 or 10 GANs by\nEDGAN, respectively.\nThe EDGAN algorithm ran with squared loss and linear mappings. To measure the performance, we\ncomputed the likelihood of the generated data under the true distribution L(S\u2713), and the likelihood\nof the true data under the generated distribution L(Sr). We used kernel density estimation with\ncross-validated bandwidth to approximate the density of both P\u2713 and Pr, as in Tolstikhin et al. [2017].\nWe provide part of the ensembles here and present the full results in Appendix C. Table 2 compares\nthe two likelihood-based metrics averaged over 10 repetitions, with standard deviation in parentheses.\n\n8\n\n\fTable 2: Likelihood-based metrics of var-\nious ensembles of 10 GANs.\n\nL(Sr)\n\nL(S\u2713)\n\n-12.39 (\u00b1 2.12) -796.05 (\u00b1 12.48)\nGAN1\n-4.33 (\u00b1 0.30) -266.60 (\u00b1 24.91)\nAda10\nEDGAN10 -3.99 (\u00b1 0.20) -148.97 (\u00b1 14.13)\n\n(a)\n\n(b)\n\nFigure 6: The true (red) and the generated (blue) distri-\nbutions using (a) GAN1; (b) Ada10; (c) EDGAN10.\n\n(c)\n\nTable 3: Each row uses a different embedding to calculate the discrepancy between the generated\nimages and the CIFAR10 test set.\n\nInceptionLogits\nInceptionPool\nMobileNet\nPNASNet\nNASNet\nAmoebaNet\n\nGAN1\n285.09\n70.52\n109.09\n35.18\n54.61\n97.71\n\nGAN2\n259.61\n64.37\n90.47\n36.42\n52.66\n110.83\n\nGAN3\n259.64\n69.48\n88.01\n34.94\n59.01\n108.61\n\nGAN4\n271.21\n69.69\n90.9\n34.38\n61.79\n105.31\n\nGAN5\n272.23\n68.7\n93.08\n36.52\n64.97\n110.5\n\nBest GAN Average\n259.12\n259.61\n64.37\n66.08\n85.71\n88.01\n34.66\n34.38\n55.66\n52.66\n97.71\n104.91\n\nEDGAN\n255.3\n63.98\n81.83\n33.97\n52.46\n97.71\n\nWe can see that for both metrics, ensembles of networks by EDGAN outperformed AdaGAN using\nthe same number of base networks. Figure 6 shows the true distribution (in red) and the generated\ndistribution (in blue). The single GAN model (Figure 6(a)) does not work well. As AdaGAN\ngradually mixes in more networks, the generated distribution is getting closer to the true distribution\n(Figure 6(b)). By explicitly learning the mixture weights using discrepancy, EDGAN10 (Figure 6(c))\nfurther improves over Ada10, such that the span of the generated distribution is reduced, and the\ngenerated distribution now closely concentrates around the true one.\n\nCIFAR10 We used \ufb01ve pre-trained generators from Lucic et al. [2018] (all are publicly available on\nTF-Hub) as base learners in the ensemble. The models were trained with different hyperparameters\nand had different levels of performance. We then took 50k samples from each generator and the\ntraining split of CIFAR10, and embedded these images using a pre-trained classi\ufb01er. We used several\nembeddings: InceptionV3\u2019s logits layer [Szegedy et al., 2016], InceptionV3\u2019s pooling layer [Szegedy\net al., 2016], MobileNet [Sandler et al., 2018], PNASNet Liu et al. [2018a], NASNet [Zoph and Le,\n2017], and AmoebaNet [Real et al., 2019]. All of these models are also available on TF-Hub. For\neach embedding, we trained an ensemble and evaluated its discrepancy on the test set of CIFAR10\nand 10k independent samples from each generator. We report these results in Table 3. In all cases\nEDGAN performs as well or better than the best individual generator or a uniform average of the\ngenerators. This also shows that discrepancy generalizes well from the training to the testing data.\nInterestingly, depending on which embedding is used for the ensemble, drastically different mixture\nweights are optimal, which demonstrates the importance of the hypothesis class for discrepancy. We\nlist the learned ensemble weights in Table 5 in Appendix C.\n\n5 Conclusion\n\nWe advocated the use of discrepancy for de\ufb01ning GANs and proved a series of favorable properties\nfor it, including continuity, under mild assumptions, the possibility of accurately estimating it from\n\ufb01nite samples, and the generalization guarantees it bene\ufb01ts from. We also showed empirically that\nDGAN is competitive with other GANs, and that EDGAN, which we showed can be formulated\nas a convex optimization problem, outperforms existing GAN ensembles. For future work, one can\nuse generative models with discrepancy in adaptation, as shown in Appendix D, where the goal is to\nlearn a feature embedding for the target domain such that its distribution is close to the distribution of\nthe embedded source domain. DGAN also has connections with standard Maximum Entropy models\n(Maxent) as discussed in Appendix E.\n\nAcknowledgments\nThis work was partly supported by NSF CCF-1535987, NSF IIS-1618662, and a Google Research\nAward. We thank Judy Hoffman for helpful pointers to the literature.\n\n9\n\n\fReferences\nM. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International\n\nConference on Machine Learning (ICML), pages 214\u2013223, 2017.\n\nS. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative\nIn International Conference on Machine Learning (ICML), pages\n\nadversarial nets (GANs).\n224\u2013232, 2017.\n\nS. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain\n\nadaptation. In Advances in Neural Information Processing Systems, pages 137\u2013144, 2007.\n\nA. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high \ufb01delity natural image\n\nsynthesis. In International Conference on Learning Representations (ICLR), 2019.\n\nC. Cortes and M. Mohri. Domain adaptation and sample bias correction theory and algorithm for\n\nregression. Theor. Comput. Sci., 519:103\u2013126, 2014.\n\nC. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang. Adanet: Adaptive structural learning\nof arti\ufb01cial neural networks. In International Conference on Machine Learning (ICML), pages\n874\u2013883, 2017.\n\nA. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath. Generative\n\nadversarial networks: An overview. IEEE Signal Processing Magazine, 35(1):53\u201365, 2018.\n\nI. Deshpande, Z. Zhang, and A. G. Schwing. Generative modeling using the sliced wasserstein\n\ndistance. In Computer Vision and Pattern Recognition (CVPR), pages 3483\u20133491, 2018.\n\nM. D. Donsker and S. S. Varadhan. Asymptotic evaluation of certain markov process expectations\n\nfor large time, i. Communications on Pure and Applied Mathematics, 28(1):1\u201347, 1975.\n\nG. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum\nmean discrepancy optimization. In Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI),\npages 258\u2013267, 2015.\n\nS. Feizi, C. Suh, F. Xia, and D. Tse. Understanding GANs: the LQG setting. arXiv preprint\n\narXiv:1710.10793, 2017.\n\nM. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan. GAN-based\nsynthetic medical image augmentation for increased CNN performance in liver lesion classi\ufb01cation.\nNeurocomputing, 321:321\u2013331, 2018.\n\nA. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. Torr, and P. K. Dokania. Multi-agent diverse generative\nadversarial networks. In Computer Vision and Pattern Recognition (CVPR), pages 8513\u20138521,\n2018.\n\nG. H. Golub and C. F. van Van Loan. Matrix Computations. The Johns Hopkins University Press,\n\n3rd edition, 1996.\n\nI. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems,\npages 2672\u20132680, 2014.\n\nI. J. Goodfellow. Advances in Neural Information Processing Systems 2016 tutorial: Generative\n\nadversarial networks. arXiv preprint arXiv:1701.00160, 2017.\n\nA. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. J. Smola. A kernel two-sample test.\n\nJournal of Machine Learning Research, 13(Mar):723\u2013773, 2012.\n\nI. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of\nwasserstein GANs. In Advances in Neural Information Processing Systems, pages 5769\u20135779,\n2017.\n\nQ. Hoang, T. D. Nguyen, T. Le, and D. Q. Phung. MGAN: training generative adversarial nets with\n\nmultiple generators. In International Conference on Learning Representations (ICLR), 2018.\n\n10\n\n\fY. Jin, J. Zhang, M. Li, Y. Tian, H. Zhu, and Z. Fang. Towards the automatic anime characters\n\ncreation with generative adversarial networks. arXiv preprint arXiv:1708.05509, 2017.\n\nT. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial\n\nnetworks. arXiv preprint arXiv:1812.04948, 2018.\n\nV. Kuznetsov and M. Mohri. Learning theory and algorithms for forecasting non-stationary time\n\nseries. In Advances in Neural Information Processing Systems, pages 541\u2013549, 2015.\n\nC. Li, W. Chang, Y. Cheng, Y. Yang, and B. P\u00f3czos. MMD GAN: towards deeper understanding\nof moment matching network. In Advances in Neural Information Processing Systems, pages\n2200\u20132210, 2017.\n\nY. Li, K. Swersky, and R. S. Zemel. Generative moment matching networks. In International\n\nConference on Machine Learning (ICML), volume 37, pages 1718\u20131727, 2015.\n\nC. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and\nK. Murphy. Progressive neural architecture search. In European Conference on Computer Vision\n(ECCV), pages 19\u201334, 2018a.\n\nR. Liu, N. Fusi, and L. Mackey. Model compression with Generative Adversarial Networks. arXiv\n\npreprint arXiv:1812.02271, 2018b.\n\nM. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are gans created equal? a large-scale\n\nstudy. In Advances in Neural Information Processing Systems, pages 700\u2013709, 2018.\n\nY. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms.\n\nIn Conference on Learning Theory (COLT), 2009.\n\nX. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley. Least squares generative adversarial\n\nnetworks. In International Conference on Computer Vision (ICCV), pages 2794\u20132802, 2017.\n\nT. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial\n\nnetworks. In International Conference on Learning Representations (ICLR), 2018.\n\nM. Mohri and A. M. Medina. New analysis and algorithm for learning with drifting distributions. In\n\nAlgorithmic Learning Theory (ALT), pages 124\u2013138, 2012.\n\nY. Mroueh, T. Sercu, and V. Goel. McGan: Mean and covariance feature matching GAN.\n\nInternational Conference on Machine Learning (ICML), pages 2527\u20132535, 2017.\n\nIn\n\nS. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training generative neural samplers using variational\ndivergence minimization. In Advances in Neural Information Processing Systems, pages 271\u2013279,\n2016.\n\nE. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classi\ufb01er architecture\n\nsearch. In AAAI Conference on Arti\ufb01cial Intelligence, pages 4780\u20134789, 2019.\n\nT. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques\nfor training gans. In Advances in Neural Information Processing Systems, pages 2234\u20132242, 2016.\n\nM. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals\nand linear bottlenecks. In Computer Vision and Pattern Recognition (CVPR), pages 4510\u20134520,\n2018.\n\nC. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for\ncomputer vision. In Computer Vision and Pattern Recognition (CVPR), pages 2818\u20132826, 2016.\n\nI. O. Tolstikhin, S. Gelly, O. Bousquet, C.-J. Simon-Gabriel, and B. Sch\u00f6lkopf. Adagan: Boosting\ngenerative models. In Advances in Neural Information Processing Systems, pages 5424\u20135433,\n2017.\n\nE. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In\n\nComputer Vision and Pattern Recognition (CVPR), pages 2962\u20132971, 2017.\n\n11\n\n\fC. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.\nH. Zenati, C. S. Foo, B. Lecouat, G. Manek, and V. R. Chandrasekhar. Ef\ufb01cient GAN-based anomaly\n\ndetection. arXiv preprint arXiv:1802.06222, 2018.\n\nB. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In International\n\nConference on Learning Representations (ICLR), 2017.\n\n12\n\n\f", "award": [], "sourceid": 3099, "authors": [{"given_name": "Ben", "family_name": "Adlam", "institution": "Google"}, {"given_name": "Corinna", "family_name": "Cortes", "institution": "Google Research"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Inst. of Math. Sciences & Google Research"}, {"given_name": "Ningshan", "family_name": "Zhang", "institution": "New York University"}]}