{"title": "Fisher GAN", "book": "Advances in Neural Information Processing Systems", "page_first": 2513, "page_last": 2523, "abstract": "Generative Adversarial Networks (GANs) are powerful models for learning complex distributions. Stable training of GANs has been addressed in many recent works which explore different metrics between distributions. In this paper we introduce Fisher GAN that fits within the Integral Probability Metrics (IPM) framework for training GANs. Fisher GAN defines a data dependent constraint on the second order moments of the critic. We show in this paper that Fisher GAN allows for stable and time efficient training that does not compromise the capacity of the critic, and does not need data independent constraints such as weight clipping. We analyze our Fisher IPM theoretically and provide an algorithm based on Augmented Lagrangian for Fisher GAN. We validate our claims on both image sample generation and semi-supervised classification using Fisher GAN.", "full_text": "Fisher GAN\n\nYoussef Mroueh\u21e4, Tom Sercu\u21e4\n\nmroueh@us.ibm.com, tom.sercu1@ibm.com\n\n\u21e4 Equal Contribution\n\nAI Foundations, IBM Research AI\nIBM T.J Watson Research Center\n\nAbstract\n\nGenerative Adversarial Networks (GANs) are powerful models for learning com-\nplex distributions. Stable training of GANs has been addressed in many recent\nworks which explore different metrics between distributions. In this paper we\nintroduce Fisher GAN which \ufb01ts within the Integral Probability Metrics (IPM)\nframework for training GANs. Fisher GAN de\ufb01nes a critic with a data dependent\nconstraint on its second order moments. We show in this paper that Fisher GAN\nallows for stable and time ef\ufb01cient training that does not compromise the capacity\nof the critic, and does not need data independent constraints such as weight clip-\nping. We analyze our Fisher IPM theoretically and provide an algorithm based on\nAugmented Lagrangian for Fisher GAN. We validate our claims on both image\nsample generation and semi-supervised classi\ufb01cation using Fisher GAN.\n\n1\n\nIntroduction\n\nGenerative Adversarial Networks (GANs) [1] have recently become a prominent method to learn\nhigh-dimensional probability distributions. The basic framework consists of a generator neural\nnetwork which learns to generate samples which approximate the distribution, while the discriminator\nmeasures the distance between the real data distribution, and this learned distribution that is referred\nto as fake distribution. The generator uses the gradients from the discriminator to minimize the\ndistance with the real data distribution. The distance between these distributions was the object of\nstudy in [2], and highlighted the impact of the distance choice on the stability of the optimization. The\noriginal GAN formulation optimizes the Jensen-Shannon divergence, while later work generalized\nthis to optimize f-divergences [3], KL [4], the Least Squares objective [5]. Closely related to our\nwork, Wasserstein GAN (WGAN) [6] uses the earth mover distance, for which the discriminator\nfunction class needs to be constrained to be Lipschitz. To impose this Lipschitz constraint, WGAN\nproposes to use weight clipping, i.e. a data independent constraint, but this comes at the cost of\nreducing the capacity of the critic and high sensitivity to the choice of the clipping hyper-parameter.\nA recent development Improved Wasserstein GAN (WGAN-GP) [7] introduced a data dependent\nconstraint namely a gradient penalty to enforce the Lipschitz constraint on the critic, which does not\ncompromise the capacity of the critic but comes at a high computational cost.\nWe build in this work on the Integral probability Metrics (IPM) framework for learning GAN of [8].\nIntuitively the IPM de\ufb01nes a critic function f, that maximally discriminates between the real and\nfake distributions. We propose a theoretically sound and time ef\ufb01cient data dependent constraint on\nthe critic of Wasserstein GAN, that allows a stable training of GAN and does not compromise the\ncapacity of the critic. Where WGAN-GP uses a penalty on the gradients of the critic, Fisher GAN\nimposes a constraint on the second order moments of the critic. This extension to the IPM framework\nis inspired by the Fisher Discriminant Analysis method.\nThe main contributions of our paper are:\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1. We introduce in Section 2 the Fisher IPM, a scaling invariant distance between distributions.\nFisher IPM introduces a data dependent constraint on the second order moments of the critic that\ndiscriminates between the two distributions. Such a constraint ensures the boundedness of the metric\nand the critic. We show in Section 2.2 that Fisher IPM when approximated with neural networks,\ncorresponds to a discrepancy between whitened mean feature embeddings of the distributions. In\nother words a mean feature discrepancy that is measured with a Mahalanobis distance in the space\ncomputed by the neural network.\n2. We show in Section 3 that Fisher IPM corresponds to the Chi-squared distance (2) when the\ncritic has unlimited capacity (the critic belongs to a universal hypothesis function class). Moreover\nwe prove in Theorem 2 that even when the critic is parametrized by a neural network, it approximates\nthe 2 distance with a factor which is a inner product between optimal and neural network critic. We\n\ufb01nally derive generalization bounds of the learned critic from samples from the two distributions,\nassessing the statistical error and its convergence to the Chi-squared distance from \ufb01nite sample size.\n3. We use Fisher IPM as a GAN objective 1 and formulate an algorithm that combines desirable\nproperties (Table 1): a stable and meaningful loss between distributions for GAN as in Wasserstein\nGAN [6], at a low computational cost similar to simple weight clipping, while not compromising the\ncapacity of the critic via a data dependent constraint but at a much lower computational cost than [7].\nFisher GAN achieves strong semi-supervised learning results without need of batch normalization in\nthe critic.\n\nTable 1: Comparison between Fisher GAN and recent related approaches.\n\nStability Unconstrained Ef\ufb01cient\n\ncapacity\n\nComputation\n\nRepresentation\npower (SSL)\n\nStandard GAN [1, 9]\nWGAN, McGan [6, 8]\nWGAN-GP [7]\nFisher Gan (Ours)\n\n7\n3\n3\n3\n\n3\n7\n3\n3\n\n3\n3\n7\n3\n\n3\n7\n?\n3\n\ndF (P, Q) = sup\n\nf2Fn E\n\n2 Learning GANs with Fisher IPM\n2.1 Fisher IPM in an arbitrary function space: General framework\nIntegral Probability Metric (IPM). Intuitively an IPM de\ufb01nes a critic function f belonging to a\nfunction class F , that maximally discriminates between two distributions. The function class F\nde\ufb01nes how f is bounded, which is crucial to de\ufb01ne the metric. More formally, consider a compact\nspace X in Rd. Let F be a set of measurable, symmetric and bounded real valued functions on\nX . Let P(X ) be the set of measurable probability distributions on X . Given two probability\ndistributions P, Q 2 P(X ), the IPM indexed by a symmetric function space F is de\ufb01ned as follows\n[10]:\n(1)\nIt is easy to see that dF de\ufb01nes a pseudo-metric over P(X ). Note speci\ufb01cally that if F is not\nbounded, supf will scale f to be arbitrarily large. By choosing F appropriately [11], various\ndistances between probability measures can be de\ufb01ned.\nFirst formulation: Rayleigh Quotient. In order to de\ufb01ne an IPM in the GAN context, [6, 8] impose\nthe boundedness of the function space via a data independent constraint. This was achieved via\nrestricting the norms of the weights parametrizing the function space to a `p ball. Imposing such a\ndata independent constraint makes the training highly dependent on the constraint hyper-parameters\nand restricts the capacity of the learned network, limiting the usability of the learned critic in a semi-\nsupervised learning task. Here we take a different angle and design the IPM to be scaling invariant\nas a Rayleigh quotient. Instead of measuring the discrepancy between means as in Equation (1), we\nmeasure a standardized discrepancy, so that the distance is bounded by construction. Standardizing\nthis discrepancy introduces as we will see a data dependent constraint, that controls the growth of the\nweights of the critic f and ensures the stability of the training while maintaining the capacity of the\ncritic. Given two distributions P, Q 2 P(X ) the Fisher IPM for a function space F is de\ufb01ned as\nfollows:\n\nf (x) E\nx\u21e0Q\n\nf (x)o.\n\nx\u21e0P\n\ndF (P, Q) = sup\nf2F\n\nE\nx\u21e0P\n\n[f (x)] E\nx\u21e0Q\n\n[f (x)]\n\np1/2Ex\u21e0Pf 2(x) + 1/2Ex\u21e0Qf 2(x)\n\n1Code is available at https://github.com/tomsercu/FisherGAN\n\n.\n\n(2)\n\n2\n\n\fP\n\nx\n\nQ\n\nReal\n\nFake\n\n!\n\nv\n\n!(x) 2 Rm\n\nFigure 1: Illustration of Fisher IPM with Neural Networks. ! is a convolutional neural network\nwhich de\ufb01nes the embedding space. v is the direction in this embedding space with maximal mean\nseparation hv, \u00b5!(P) \u00b5!(Q)i, constrained by the hyperellipsoid v> \u2303!(P; Q) v = 1.\n\nWhile a standard IPM (Equation (1)) maximizes the discrepancy between the means of a function\nunder two different distributions, Fisher IPM looks for critic f that achieves a tradeoff between\nmaximizing the discrepancy between the means under the two distributions (between class variance),\nand reducing the pooled second order moment (an upper bound on the intra-class variance).\nStandardized discrepancies have a long history in statistics and the so-called two-samples hypothesis\ntesting. For example the classic two samples Student\u2019s t test de\ufb01nes the student statistics as the\nratio between means discrepancy and the sum of standard deviations. It is now well established that\nlearning generative models has its roots in the two-samples hypothesis testing problem [12]. Non\nparametric two samples testing and model criticism from the kernel literature lead to the so called\nmaximum kernel mean discrepancy (MMD) [13]. The MMD cost function and the mean matching\nIPM for a general function space has been recently used for training GAN [14, 15, 8].\nInterestingly Harchaoui et al [16] proposed Kernel Fisher Discriminant Analysis for the two samples\nhypothesis testing problem, and showed its statistical consistency. The Standard Fisher discrepancy\nused in Linear Discriminant Analysis (LDA) or Kernel Fisher Discriminant Analysis (KFDA) can\n\nbe written: supf2F \u2713 Ex\u21e0P\n\n[f (x)] Ex\u21e0Q\n\n[f (x)]\u25c62\n\nVarx\u21e0P(f(x))+Varx\u21e0Q(f(x)) , where Varx\u21e0P(f(x)) = Ex\u21e0Pf 2(x) (Ex\u21e0P(f(x)))2.\nNote that in LDA F is restricted to linear functions, in KFDA F is restricted to a Reproducing\nKernel Hilbert Space (RKHS). Our Fisher IPM (Eq (2)) deviates from the standard Fisher discrepancy\nsince the numerator is not squared, and we use in the denominator the second order moments instead\nof the variances. Moreover in our de\ufb01nition of Fisher IPM, F can be any symmetric function class.\nSecond formulation: Constrained form. Since the distance is scaling invariant, dF can be written\nequivalently in the following constrained form:\n\ndF (P, Q) =\n\nsup\n2 Ex\u21e0Pf 2(x)+ 1\n\nf2F , 1\n\n2 Ex\u21e0Qf 2(x)=1\n\nE (f ) := E\nx\u21e0P\n\n[f (x)] E\nx\u21e0Q\n\n[f (x)].\n\n(3)\n\nSpecifying P, Q: Learning GAN with Fisher IPM. We turn now to the problem of learning GAN\nwith Fisher IPM. Given a distribution Pr 2 P(X ), we learn a function g\u2713 : Z \u21e2 Rnz ! X , such\nthat for z \u21e0 pz, the distribution of g\u2713(z) is close to the real data distribution Pr, where pz is a \ufb01xed\ndistribution on Z (for instance z \u21e0 N (0, Inz )). Let P\u2713 be the distribution of g\u2713(z), z \u21e0 pz. Using\nFisher IPM (Equation (3)) indexed by a parametric function class Fp, the generator minimizes the\nIPM: ming\u2713 dFp(Pr, P\u2713). Given samples {xi, 1 . . . N} from Pr and samples {zi, 1 . . . M} from pz\nwe shall solve the following empirical problem:\n\nmin\ng\u2713\n\nsup\nfp2Fp\n\n\u02c6E (fp, g\u2713) :=\n\n1\nN\n\nNXi=1\n\n1\nM\n\nMXj=1\n\nfp(g\u2713(zj)) Subject to \u02c6\u2326(fp, g\u2713) = 1,\n\n(4)\n\nwhere \u02c6\u2326(fp, g\u2713) = 1\n\ni=1 f 2\n\np (xi) + 1\n\nj=1 f 2\n\np (g\u2713(zj)). For simplicity we will have M = N.\n\n2NPN\n\nfp(xi) \n\n2MPM\n\n3\n\n\f2.2 Fisher IPM with Neural Networks\nWe will speci\ufb01cally study the case where F is a \ufb01nite dimensional Hilbert space induced by a\nneural network ! (see Figure 1 for an illustration). In this case, an IPM with data-independent\nconstraint will be equivalent to mean matching [8]. We will now show that Fisher IPM will give rise\nto a whitened mean matching interpretation, or equivalently to mean matching with a Mahalanobis\ndistance.\nRayleigh Quotient. Consider the function space Fv,!, de\ufb01ned as follows\n\n! is typically parametrized with a multi-layer neural network. We de\ufb01ne the mean and covariance\n(Gramian) feature embedding of a distribution as in McGan [8]:\nand \u2303!(P) = E\n\nFv,! = {f (x) = hv, !(x)i| v 2 Rm, ! : X ! Rm},\nx\u21e0P!(x)!(x)> ,\n\n\u00b5!(P) = E\nx\u21e0P\n\n(!(x))\n\nFisher IPM as de\ufb01ned in Equation (2) on Fv,! can be written as follows:\nhv, \u00b5!(P) \u00b5!(Q)i\n2 \u2303!(P) + 1\n\ndFv,! (P, Q) = max\n!\n\nmax\n\nv\n\n2 \u2303!(Q) + Im)v\n\n,\n\n(5)\n\nwhere we added a regularization term (> 0) to avoid singularity of the covariances. Note that if !\nwas implemented with homogeneous non linearities such as RELU, if we swap (v, !) with (cv, c0!)\nfor any constants c, c0 > 0, the distance dFv,! remains unchanged, hence the scaling invariance.\nConstrained Form. Since the Rayleigh Quotient is not amenable to optimization, we will consider\nFisher IPM as a constrained optimization problem. By virtue of the scaling invariance and the\nconstrained form of the Fisher IPM given in Equation (3), dFv,! can be written equivalently as:\n\nqv>( 1\n\ndFv,! (P, Q) =\n\n!,v,v>( 1\n\n2 \u2303!(P)+ 1\n\n2 \u2303!(Q)+Im)v=1hv, \u00b5!(P) \u00b5!(Q)i\nmax\n2 \u2303!(P) + 1\n\n2 \u2303!(Q) + Im. Doing a simple change of\n\n(6)\n\nDe\ufb01ne the pooled covariance: \u2303!(P; Q) = 1\nvariable u = (\u2303 !(P; Q)) 1\n\n2 v we see that:\n\ndFu,! (P, Q) = max\n!\n\n2 (\u00b5!(P) \u00b5!(Q))E\n\nmax\n\nu,kuk=1Du, (\u2303!(P; Q)) 1\n! (\u2303!(P; Q)) 1\n! q(\u00b5!(P) \u00b5!(Q))>\u23031\n\n(7)\nhence we see that \ufb01sher IPM corresponds to the worst case distance between whitened means.\nSince the means are white, we don\u2019t need to impose further constraints on ! as in [6, 8]. Another\ninterpretation of the Fisher IPM stems from the fact that:\n\n2 (\u00b5!(P) \u00b5!(Q)) ,\n\n= max\n\ndFv,! (P, Q) = max\n\n! (P; Q)(\u00b5!(P) \u00b5!(Q)),\n\nfrom which we see that Fisher IPM is a Mahalanobis distance between the mean feature embeddings\nof the distributions. The Mahalanobis distance is de\ufb01ned by the positive de\ufb01nite matrix \u2303w(P; Q).\nWe show in Appendix A that the gradient penalty in Improved Wasserstein [7] gives rise to a similar\nMahalanobis mean matching interpretation.\nLearning GAN with Fisher IPM. Hence we see that learning GAN with Fisher IPM:\n\nmin\ng\u2713\n\nmax\n\n!\n\nv,v>( 1\n\n2 \u2303!(Pr)+ 1\n\n2 \u2303!(P\u2713)+Im)v=1hv, \u00b5w(Pr) \u00b5!(P\u2713)i\nmax\n\ncorresponds to a min-max game between a feature space and a generator. The feature space tries\nto maximize the Mahalanobis distance between the feature means embeddings of real and fake\ndistributions. The generator tries to minimize the mean embedding distance.\n\n3 Theory\nWe will start \ufb01rst by studying the Fisher IPM de\ufb01ned in Equation (2) when the function space has full\ndx < 1.\ncapacity i.e when the critic belongs to L2(X , 1\nTheorem 1 shows that under this condition, the Fisher IPM corresponds to the Chi-squared distance\nbetween distributions, and gives a closed form expression of the optimal critic function f (See\nAppendix B for its relation with the Pearson Divergence). Proofs are given in Appendix D.\n\n2 (P + Q)) meaning thatRX\n\nf 2(x) (P(x)+Q(x))\n\n2\n\n4\n\n\fFigure 2: Example on 2D synthetic data, where both P and Q are \ufb01xed normal distributions with the\nsame covariance and shifted means along the x-axis, see (a). Fig (b, c) show the exact 2 distance\nfrom numerically integrating Eq (8), together with the estimate obtained from training a 5-layer MLP\nwith layer size = 16 and LeakyReLU nonlinearity on different training sample sizes. The MLP is\ntrained using Algorithm 1, where sampling from the generator is replaced by sampling from Q, and\nthe 2 MLP estimate is computed with Equation (2) on a large number of samples (i.e. out of sample\nestimate). We see in (b) that for large enough sample size, the MLP estimate is extremely good. In (c)\nwe see that for smaller sample sizes, the MLP approximation bounds the ground truth 2 from below\n(see Theorem 2) and converges to the ground truth roughly as O( 1pN\n) (Theorem 3). We notice that\nwhen the distributions have small 2 distance, a larger training size is needed to get a better estimate -\nagain this is in line with Theorem 3.\n\nTheorem 1 (Chi-squared distance at full capacity). Consider the Fisher IPM for F being the space\nof all measurable functions endowed by 1\n2 ). De\ufb01ne the Chi-squared\ndistance between two distributions:\n\n2 (P + Q), i.e. F := L2(X , P+Q\n\n2(P, Q) =sZX\n\n(P(x) Q(x))2\n\nP(x)+Q(x)\n\n2\n\ndx\n\n(8)\n\nThe following holds true for any P, Q, P 6= Q:\n1) The Fisher IPM for F = L2(X , P+Q\ndF (P, Q) = 2(P, Q).\n2) The optimal critic of the Fisher IPM on L2(X , P+Q\n\n2 ) is equal to the Chi-squared distance de\ufb01ned above:\n\n2 ) is :\nP(x) Q(x)\nP(x)+Q(x)\n\n.\n\nf(x) =\n\n1\n\n2(P, Q)\n\n2\n\nWe note here that LSGAN [5] at full capacity corresponds to a Chi-Squared divergence, with the\nmain difference that LSGAN has different objectives for the generator and the discriminator (bilevel\noptimizaton), and hence does not optimize a single objective that is a distance between distributions.\nThe Chi-squared divergence can also be achieved in the f-gan framework from [3]. We discuss the\nadvantages of the Fisher formulation in Appendix C.\nOptimizing over L2(X , P+Q\n2 ) is not tractable, hence we have to restrict our function class, to a\nhypothesis class H , that enables tractable computations. Here are some typical choices of the space\nH : Linear functions in the input features, RKHS, a non linear multilayer neural network with a\nlinear last layer (Fv,!). In this Section we don\u2019t make any assumptions about the function space and\nshow in Theorem 2 how the Chi-squared distance is approximated in H , and how this depends on\nthe approximation error of the optimal critic f in H .\nTheorem 2 (Approximating Chi-squared distance in an arbitrary function space H ). Let H\nbe an arbitrary symmetric function space. We de\ufb01ne the inner product hf, fiL2(X , P+Q\n2 ) =\nRX\n2 ) be the unit sphere\nin L2(X , P+Q\n2 ) = 1}. The \ufb01sher IPM de\ufb01ned on an\narbitrary function space H dH (P, Q), approximates the Chi-squared distance. The approximation\n\ndx, which induces the Lebesgue norm. Let SL2(X , P+Q\n2 ) = {f : X ! R,kfkL2(X , P+Q\n\nf (x)f(x) P(x)+Q(x)\n\n2 ): SL2(X , P+Q\n\n2\n\n5\n\n202443210123(a)2DGaussians,contourplot01234Meanshift0.00.51.01.52.02distanceandMLPestimate(b)ExactMLP,N=M=10k101102103N=M=numtrainingsamples0.00.51.01.52.02distanceandMLPestimate(c)MLP,shift=3MLP,shift=1MLP,shift=0.5\fquality depends on the cosine of the approximation of the optimal critic f in H . Since H is\nsymmetric this cosine is always positive (otherwise the same equality holds with an absolute value)\n\ndH (P, Q) = 2(P, Q)\n\nsup\n\nf2H \\ SL2(X , P+Q\nEquivalently we have following relative approximation error:\n\n2\n\n)\n\nhf, fiL2(X , P+Q\n2 ) ,\n\n2(P, Q) dH (P, Q)\n\n2(P, Q)\n\n=\n\n1\n2\n\ninf\n\nf2H \\ SL2(X , P+Q\n\n2\n\n) kf fk2\n\nL2(X , P+Q\n\n2 ) .\n\nFrom Theorem 2, we know that we have always dH (P, Q) \uf8ff 2(P, Q). Moreover if the space\nH was rich enough to provide a good approximation of the optimal critic f, then dH is a good\napproximation of the Chi-squared distance 2.\nGeneralization bounds for the sample quality of the estimated Fisher IPM from samples from P and\nQ can be done akin to [11], with the main dif\ufb01culty that for Fisher IPM we have to bound the excess\nrisk of a cost function with data dependent constraints on the function class. We give generalization\nbounds for learning the Fisher IPM in the supplementary material (Theorem 3, Appendix E). In a\nnutshell the generalization error of the critic learned in a hypothesis class H from samples of P and\nQ, decomposes to the approximation error from Theorem 2 and a statistical error that is bounded\n\nusing data dependent local Rademacher complexities [17] and scales like O(p1/n), n = M N/M +N.\n\nWe illustrate in Figure 2 our main theoretical claims on a toy problem.\n\n4 Fisher GAN Algorithm using ALM\n\nj=1 f 2\n\ni=1 f 2\n\np (xi) + 1\n\n2NPN\n\nFor any choice of the parametric function class Fp (for example Fv,!), note the constraint in Equation\n(4) by \u02c6\u2326(fp, g\u2713) = 1\np (g\u2713(zj)). De\ufb01ne the Augmented Lagrangian\n[18] corresponding to Fisher GAN objective and constraint given in Equation (4):\n( \u02c6\u2326(fp, g\u2713) 1)2\n\n2NPN\nLF (p, \u2713, ) = \u02c6E (fp, g\u2713) + (1 \u02c6\u2326(fp, g\u2713)) \n\n(9)\nwhere is the Lagrange multiplier and \u21e2> 0 is the quadratic penalty weight. We alternate between\noptimizing the critic and the generator. Similarly to [7] we impose the constraint when training the\ncritic only. Given \u2713, for training the critic we solve maxp min LF (p, \u2713, ). Then given the critic\nparameters p we optimize the generator weights \u2713 to minimize the objective min\u2713 \u02c6E (fp, g\u2713). We\ngive in Algorithm 1, an algorithm for Fisher GAN, note that we use ADAM [19] for optimizing the\nparameters of the critic and the generator. We use SGD for the Lagrange multiplier with learning rate\n\u21e2 following practices in Augmented Lagrangian [18].\nAlgorithm 1 Fisher GAN\n\n\u21e2\n2\n\nInput: \u21e2 penalty weight, \u2318 Learning rate, nc number of iterations for training the critic, N batch\nsize\nInitialize p, \u2713, = 0\nrepeat\n\nfor j = 1 to nc do\n\nSample a minibatch xi, i = 1 . . . N, xi \u21e0 Pr\nSample a minibatch zi, i = 1 . . . N, zi \u21e0 pz\n(gp, g) (rpLF ,rLF )(p, \u2713, )\np p + \u2318 ADAM (p, gp)\n \u21e2g {SGD rule on with learning rate \u21e2}\n\nend for\nSample zi, i = 1 . . . N, zi \u21e0 pz\nNPN\nd\u2713 r\u2713 \u02c6E (fp, g\u2713) = r\u2713\n\u2713 \u2713 \u2318 ADAM (\u2713, d\u2713)\nuntil \u2713 converges\n\n1\n\ni=1 fp(g\u2713(zi))\n\n6\n\n\fFigure 3: Samples and plots of the loss \u02c6E (.), lagrange multiplier , and constraint \u02c6\u2326(.) on 3\nbenchmark datasets. We see that during training as grows slowly, the constraint becomes tight.\n\nFigure 4: No Batch Norm: Training results from a critic f without batch normalization. Fisher GAN\n(left) produces decent samples, while WGAN with weight clipping (right) does not. We hypothesize\nthat this is due to the implicit whitening that Fisher GAN provides. (Note that WGAN-GP does also\nsuccesfully converge without BN [7]). For both models the learning rate was appropriately reduced.\n5 Experiments\n\nWe experimentally validate the proposed Fisher GAN. We claim three main results: (1) stable training\nwith a meaningful and stable loss going down as training progresses and correlating with sample\nquality, similar to [6, 7]. (2) very fast convergence to good sample quality as measured by inception\nscore. (3) competitive semi-supervised learning performance, on par with literature baselines, without\nrequiring normalization of the critic.\nWe report results on three benchmark datasets: CIFAR-10 [20], LSUN [21] and CelebA [22]. We\nparametrize the generator g\u2713 and critic f with convolutional neural networks following the model\ndesign from DCGAN [23]. For 64 \u21e5 64 images (LSUN, CelebA) we use the model architecture in\nAppendix F.2, for CIFAR-10 we train at a 32\u21e5 32 resolution using architecture in F.3 for experiments\nregarding sample quality (inception score), while for semi-supervised learning we use a better\nregularized discriminator similar to the Openai [9] and ALI [24] architectures, as given in F.4.We\nused Adam [19] as optimizer for all our experiments, hyper-parameters given in Appendix F.\nQualitative: Loss stability and sample quality. Figure 3 shows samples and plots during training.\nFor LSUN we use a higher number of D updates (nc = 5) , since we see similarly to WGAN that\nthe loss shows large \ufb02uctuations with lower nc values. For CIFAR-10 and CelebA we use reduced\nnc = 2 with no negative impact on loss stability. CIFAR-10 here was trained without any label\ninformation. We show both train and validation loss on LSUN and CIFAR-10 showing, as can be\nexpected, no over\ufb01tting on the large LSUN dataset and some over\ufb01tting on the small CIFAR-10\ndataset. To back up our claim that Fisher GAN provides stable training, we trained both a Fisher Gan\nand WGAN where the batch normalization in the critic f was removed (Figure 4).\nQuantitative analysis: Inception Score and Speed. It is agreed upon that evaluating generative\nmodels is hard [25]. We follow the literature in using \u201cinception score\u201d [9] as a metric for the quality\n\n7\n\n024Meandierence\u02c6E(a)LSUN\u02c6Etrain\u02c6Eval0.00.51.01.5g\u2713iterations\u21e510501234\u02c6\u2326024Meandierence\u02c6E(b)CelebA\u02c6Etrain01234g\u2713iterations\u21e510501234\u02c6\u2326024Meandierence\u02c6E(c)CIFAR-10\u02c6Etrain\u02c6Eval0.00.51.01.5g\u2713iterations\u21e510501234\u02c6\u2326\fFigure 5: CIFAR-10 inception scores under 3 training conditions. Corresponding samples are given\nin rows from top to bottom (a,b,c). The inception score plots are mirroring Figure 3 from [7].\nNote All inception scores are computed from the same tensor\ufb02ow codebase, using the architecture\ndescribed in appendix F.3, and with weight initialization from a normal distribution with stdev=0.02.\nIn Appendix F.1 we show that these choices are also bene\ufb01ting our WGAN-GP baseline.\n\nof CIFAR-10 samples. Figure 5 shows the inception score as a function of number of g\u2713 updates\nand wallclock time. All timings are obtained by running on a single K40 GPU on the same cluster.\nWe see from Figure 5, that Fisher GAN both produces better inception scores, and has a clear speed\nadvantage over WGAN-GP.\nQuantitative analysis: SSL. One of the main premises of unsupervised learning, is to learn features\non a large corpus of unlabeled data in an unsupervised fashion, which are then transferable to other\ntasks. This provides a proper framework to measure the performance of our algorithm. This leads\nus to quantify the performance of Fisher GAN by semi-supervised learning (SSL) experiments on\nCIFAR-10. We do joint supervised and unsupervised training on CIFAR-10, by adding a cross-entropy\nterm to the IPM objective, in conditional and unconditional generation.\n\nTable 2: CIFAR-10 inception scores using resnet architecture and codebase from [7]. We used\nLayer Normalization [26] which outperformed unnormalized resnets. Apart from this, no additional\nhyperparameter tuning was done to get stable training of the resnets.\n\nMethod\nALI [24]\nBEGAN [27]\nDCGAN [23] (in [28])\nImproved GAN (-L+HA) [9]\nEGAN-Ent-VI [29]\nDFM [30]\nWGAN-GP ResNet [7]\nFisher GAN ResNet (ours)\n\nUnsupervised\n\nScore\n5.34 \u00b1 .05\n5.62\n6.16 \u00b1 .07\n6.86 \u00b1 .06\n7.07 \u00b1 .10\n7.72 \u00b1 .13\n7.86 \u00b1 .07\n7.90 \u00b1 .05\n\nMethod\nSteinGan [31]\nDCGAN (with labels, in [31])\nImproved GAN [9]\nFisher GAN ResNet (ours)\nAC-GAN [32]\nSGAN-no-joint [28]\nWGAN-GP ResNet [7]\nSGAN [28]\n\nSupervised\n\nScore\n6.35\n6.58\n8.09 \u00b1 .07\n8.16 \u00b1 .12\n8.25 \u00b1 .07\n8.37 \u00b1 .08\n8.42 \u00b1 .10\n8.59 \u00b1 .12\n\nUnconditional Generation with CE Regularization. We parametrize the critic f as in Fv,!.\nWhile training the critic using the Fisher GAN objective LF given in Equation (9), we train a linear\nclassi\ufb01er on the feature space ! of the critic, whenever labels are available (K labels). The linear\nclassi\ufb01er is trained with Cross-Entropy (CE) minimization. Then the critic loss becomes LD =\nLF DP(x,y)2lab CE(x, y; S, !), where CE(x, y; S, !) = log [Softmax(hS, !(x)i)y],\nwhere S 2 RK\u21e5m is the linear classi\ufb01er and hS, !i 2 RK with slight abuse of notation. D is the\nregularization hyper-parameter. We now sample three minibatches for each critic update: one labeled\nbatch from the small labeled dataset for the CE term, and an unlabeled batch + generated batch for\nthe IPM.\nConditional Generation with CE Regularization. We also trained conditional generator models,\nconditioning the generator on y by concatenating the input noise with a 1-of-K embedding of the\nlabel: we now have g\u2713(z, y). We parametrize the critic in Fv,! and modify the critic objective\nas above. We also add a cross-entropy term for the generator to minimize during its training step:\n\nLG = \u02c6E +GPz\u21e0p(z),y\u21e0p(y) CE(g\u2713(z, y), y; S, !). For generator updates we still need to sample\n\nonly a single minibatch since we use the minibatch of samples from g\u2713(z, y) to compute both the\n\n8\n\n0.000.250.500.751.001.251.501.752.00g\u2713iterations\u21e5105012345678Inceptionscore(a)FisherGAN:CE,Conditional(b)FisherGAN:CE,GNotCond.(c)FisherGAN:NoLabWGAN-GPWGANDCGAN0.00.51.01.52.02.53.03.54.0Wallclocktime(seconds)\u21e5104\fIPM loss \u02c6E and CE. The labels are sampled according to the prior y \u21e0 p(y), which defaults to the\ndiscrete uniform prior when there is no class imbalance. We found D = G = 0.1 to be optimal.\nNew Parametrization of the Critic: \u201cK + 1 SSL\u201d. One speci\ufb01c successful formulation of SSL in\nthe standard GAN framework was provided in [9], where the discriminator classi\ufb01es samples into\nK + 1 categories: the K correct clases, and K + 1 for fake samples. Intuitively this puts the real\nclasses in competition with the fake class. In order to implement this idea in the Fisher framework,\nwe de\ufb01ne a new function class of the critic that puts in competition the K class directions of the\nclassi\ufb01er Sy, and another \u201cK+1\u201d direction v that indicates fake samples. Hence we propose the\nfollowing parametrization for the critic: f (x) = PK\ny=1 p(y|x)hSy, !(x)i hv, !(x)i, where\np(y|x) = Softmax(hS, !(x)i)y which is also optimized with Cross-Entropy. Note that this critic\ndoes not fall under the interpretation with whitened means from Section 2.2, but does fall under\nthe general Fisher IPM framework from Section 2.1. We can use this critic with both conditional\nand unconditional generation in the same way as described above. In this setting we found D =\n1.5, G = 0.1 to be optimal.\nLayerwise normalization on the critic. For most GAN formulations following DCGAN design\nprinciples, batch normalization (BN) [33] in the critic is an essential ingredient. From our semi-\nsupervised learning experiments however, it appears that batch normalization gives substantially\nworse performance than layer normalization (LN) [26] or even no layerwise normalization. We\nattribute this to the implicit whitening Fisher GAN provides.\nTable 3 shows the SSL results on CIFAR-10. We show that Fisher GAN has competitive results, on\npar with state of the art literature baselines. When comparing to WGAN with weight clipping, it\nbecomes clear that we recover the lost SSL performance. Results with the K + 1 critic are better\nacross the board, proving consistently the advantage of our proposed K + 1 formulation. Conditional\ngeneration does not provide gains in the setting with layer normalization or without normalization.\n\nTable 3: CIFAR-10 SSL results.\n\nNumber of labeled examples\nModel\nCatGAN [34]\nImproved GAN (FM) [9]\nALI [24]\nWGAN (weight clipping) Uncond\nWGAN (weight clipping) Cond\nFisher GAN BN Cond\nFisher GAN BN Uncond\nFisher GAN BN K+1 Cond\nFisher GAN BN K+1 Uncond\nFisher GAN LN Cond\nFisher GAN LN Uncond\nFisher GAN LN K+1 Cond\nFisher GAN LN K+1, Uncond\nFisher GAN No Norm K+1, Uncond\n\n1000\n\n21.83 \u00b1 2.01\n19.98 \u00b1 0.89\n69.01\n68.11\n\n36.37\n36.42\n34.94\n33.49\n26.78 \u00b1 1.04\n24.39 \u00b1 1.22\n20.99 \u00b1 0.66\n19.74 \u00b1 0.21\n21.15 \u00b1 0.54\n\n6 Conclusion\n\n2000\n\n4000\n\nMisclassi\ufb01cation rate\n\n19.61 \u00b1 2.09\n19.09 \u00b1 0.44\n56.48\n58.59\n\n32.03\n33.49\n28.04\n28.60\n23.30 \u00b1 0.39\n22.69 \u00b1 1.27\n19.01 \u00b1 0.21\n17.87 \u00b1 0.38\n18.21 \u00b1 0.30\n\n19.58\n18.63 \u00b1 2.32\n17.99 \u00b1 1.62\n40.85\n42.00\n\n27.42\n27.36\n23.85\n24.19\n20.56 \u00b1 0.64\n19.53 \u00b1 0.34\n17.41 \u00b1 0.38\n16.13 \u00b1 0.53\n16.74 \u00b1 0.19\n\n8000\n\n17.72 \u00b1 1.82\n17.05 \u00b1 1.49\n30.56\n30.91\n\n22.85\n22.82\n20.75\n21.59\n18.26 \u00b1 0.25\n17.84 \u00b1 0.15\n15.50 \u00b1 0.41\n14.81 \u00b1 0.16\n14.80 \u00b1 0.15\n\nWe have de\ufb01ned Fisher GAN, which provide a stable and fast way of training GANs. The Fisher\nGAN is based on a scale invariant IPM, by constraining the second order moments of the critic. We\nprovide an interpretation as whitened (Mahalanobis) mean feature matching and 2 distance. We\nshow graceful theoretical and empirical advantages of our proposed Fisher GAN.\n\nAcknowledgments. The authors thank Steven J. Rennie for many helpful discussions and Martin\nArjovsky for helpful clari\ufb01cations and pointers.\n\n9\n\n\fReferences\n[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\n\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[2] Martin Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adver-\n\nsarial networks. In ICLR, 2017.\n\n[3] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural\n\nsamplers using variational divergence minimization. In NIPS, 2016.\n\n[4] Casper Kaae S\u00f8nderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz\u00e1r. Amortised\n\nmap inference for image super-resolution. ICLR, 2017.\n\n[5] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, and Zhen Wang. Least squares\n\ngenerative adversarial networks. arXiv:1611.04076, 2016.\n\n[6] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. ICML, 2017.\n[7] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville.\n\nImproved training of wasserstein gans. arXiv:1704.00028, 2017.\n\n[8] Youssef Mroueh, Tom Sercu, and Vaibhava Goel. Mcgan: Mean and covariance feature\n\nmatching gan. arXiv:1702.08398 ICML, 2017.\n\n[9] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\n\nImproved techniques for training gans. NIPS, 2016.\n\n[10] Alfred M\u00fcller. Integral probability metrics and their generating classes of functions. Advances\n\nin Applied Probability, 1997.\n\n[11] Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch\u00f6lkopf, and Gert\nR. G. Lanckriet. On the empirical estimation of integral probability metrics. Electronic Journal\nof Statistics, 2012.\n\n[12] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models.\n\narXiv:1610.03483, 2016.\n\n[13] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander\n\nSmola. A kernel two-sample test. JMLR, 2012.\n\n[14] Yujia Li, Kevin Swersky, and Richard Zemel. Generative moment matching networks. In ICML,\n\n2015.\n\n[15] Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural\n\nnetworks via maximum mean discrepancy optimization. UAI, 2015.\n\n[16] Za\u00efd Harchaoui, Francis R Bach, and Eric Moulines. Testing for homogeneity with kernel \ufb01sher\n\ndiscriminant analysis. In NIPS, 2008.\n\n[17] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities.\n\nAnn. Statist., 2005.\n\n[18] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2nd edition, 2006.\n[19] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[20] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master\u2019s\n\nthesis, 2009.\n\n[21] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao.\nLsun: Construction of a large-scale image dataset using deep learning with humans in the loop.\narXiv:1506.03365, 2015.\n\n[22] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In ICCV, 2015.\n\n10\n\n\f[23] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\n\ndeep convolutional generative adversarial networks. arXiv:1511.06434, 2015.\n\n[24] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mas-\n\ntropietro, and Aaron Courville. Adversarially learned inference. ICLR, 2017.\n\n[25] Lucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative\n\nmodels. ICLR, 2016.\n\n[26] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[27] David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative\n\nadversarial networks. arXiv preprint arXiv:1703.10717, 2017.\n\n[28] Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, and Serge Belongie. Stacked generative\n\nadversarial networks. arXiv preprint arXiv:1612.04357, 2016.\n\n[29] Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, and Aaron Courville. Calibrating\n\nenergy-based generative adversarial networks. arXiv preprint arXiv:1702.01691, 2017.\n\n[30] D Warde-Farley and Y Bengio. Improving generative adversarial networks with denoising\n\nfeature matching. ICLR submissions, 8, 2017.\n\n[31] Dilin Wang and Qiang Liu. Learning to draw samples: With application to amortized mle for\n\ngenerative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.\n\n[32] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with\n\nauxiliary classi\ufb01er gans. arXiv preprint arXiv:1610.09585, 2016.\n\n[33] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. Proc. ICML, 2015.\n\n[34] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical genera-\n\ntive adversarial networks. arXiv:1511.06390, 2015.\n\n[35] Alessandra Tosi, S\u00f8ren Hauberg, Alfredo Vellido, and Neil D. Lawrence. Metrics for proba-\n\nbilistic geometries. 2014.\n\n[36] Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Scholkopf, and Gert\nR. G. Lanckriet. On integral probability metrics, -divergences and binary classi\ufb01cation. 2009.\n[37] I. Ekeland and T. Turnbull. In\ufb01nite-dimensional Optimization and Convexity. The University of\n\nChicago Press, 1983.\n\n[38] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\nneural networks. In International conference on arti\ufb01cial intelligence and statistics, pages\n249\u2013256, 2010.\n\n[39] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Sur-\npassing human-level performance on imagenet classi\ufb01cation. arXiv preprint arXiv:1502.01852,\n2015.\n\n11\n\n\f", "award": [], "sourceid": 1460, "authors": [{"given_name": "Youssef", "family_name": "Mroueh", "institution": "IBM T.J Watson Research Center"}, {"given_name": "Tom", "family_name": "Sercu", "institution": "IBM Research"}]}