{"title": "MMD GAN: Towards Deeper Understanding of Moment Matching Network", "book": "Advances in Neural Information Processing Systems", "page_first": 2203, "page_last": 2213, "abstract": "Generative moment matching network (GMMN) is a deep generative model that differs from Generative Adversarial Network (GAN) by replacing the discriminator in GAN with a two-sample test based on kernel maximum mean discrepancy (MMD). Although some theoretical guarantees of MMD have been studied, the empirical performance of GMMN is still not as competitive as that of GAN on challenging and large benchmark datasets. The computational efficiency of GMMN is also less desirable in comparison with GAN, partially due to its requirement for a rather large batch size during the training. In this paper, we propose to improve both the model expressiveness of GMMN and its computational efficiency by introducing {\\it adversarial kernel learning} techniques, as the replacement of a fixed Gaussian kernel in the original GMMN. The new approach combines the key ideas in both GMMN and GAN, hence we name it MMD-GAN. The new distance measure in MMD-GAN is a meaningful loss that enjoys the advantage of weak$^*$ topology and can be optimized via gradient descent with relatively small batch sizes. In our evaluation on multiple benchmark datasets, including MNIST, CIFAR-10, CelebA and LSUN, the performance of MMD-GAN significantly outperforms GMMN, and is competitive with other representative GAN works.", "full_text": "MMD GAN: Towards Deeper Understanding of\n\nMoment Matching Network\n\nChun-Liang Li1,\u21e4 Wei-Cheng Chang1,\u21e4 Yu Cheng2 Yiming Yang1 Barnab\u00e1s P\u00f3czos1\n\n1 Carnegie Mellon University,\n\n2AI Foundations, IBM Research\n\n{chunlial,wchang2,yiming,bapoczos}@cs.cmu.edu\n(\u21e4 denotes equal contribution)\n\nchengyu@us.ibm.com\n\nAbstract\n\nGenerative moment matching network (GMMN) is a deep generative model that\ndiffers from Generative Adversarial Network (GAN) by replacing the discriminator\nin GAN with a two-sample test based on kernel maximum mean discrepancy\n(MMD). Although some theoretical guarantees of MMD have been studied, the\nempirical performance of GMMN is still not as competitive as that of GAN on\nchallenging and large benchmark datasets. The computational ef\ufb01ciency of GMMN\nis also less desirable in comparison with GAN, partially due to its requirement for\na rather large batch size during the training. In this paper, we propose to improve\nboth the model expressiveness of GMMN and its computational ef\ufb01ciency by\nintroducing adversarial kernel learning techniques, as the replacement of a \ufb01xed\nGaussian kernel in the original GMMN. The new approach combines the key ideas\nin both GMMN and GAN, hence we name it MMD GAN. The new distance measure\nin MMD GAN is a meaningful loss that enjoys the advantage of weak\u21e4 topology\nand can be optimized via gradient descent with relatively small batch sizes. In our\nevaluation on multiple benchmark datasets, including MNIST, CIFAR-10, CelebA\nand LSUN, the performance of MMD GAN signi\ufb01cantly outperforms GMMN, and\nis competitive with other representative GAN works.\n\n1\n\nIntroduction\n\nThe essence of unsupervised learning models the underlying distribution PX of the data X . Deep\ngenerative model [1, 2] uses deep learning to approximate the distribution of complex datasets with\npromising results. However, modeling arbitrary density is a statistically challenging task [3]. In many\napplications, such as caption generation [4], accurate density estimation is not even necessary since\nwe are only interested in sampling from the approximated distribution.\nRather than estimating the density of PX , Generative Adversarial Network (GAN) [5] starts from a\nbase distribution PZ over Z, such as Gaussian distribution, then trains a transformation network g\u2713\nsuch that P\u2713 \u21e1 PX , where P\u2713 is the underlying distribution of g\u2713(z) and z \u21e0 PZ. During the training,\nGAN-based algorithms require an auxiliary network f to estimate the distance between PX and P\u2713.\nDifferent probabilistic (pseudo) metrics have been studied [5\u20138] under GAN framework.\nInstead of training an auxiliary network f for measuring the distance between PX and P\u2713, Generative\nmoment matching network (GMMN) [9, 10] uses kernel maximum mean discrepancy (MMD) [11],\nwhich is the centerpiece of nonparametric two-sample test, to determine the distribution distances.\nDuring the training, g\u2713 is trained to pass the hypothesis test (minimize MMD distance). [11] shows\neven the simple Gaussian kernel enjoys the strong theoretical guarantees (Theorem 1). However, the\nempirical performance of GMMN does not meet its theoretical properties. There is no promising\nempirical results comparable with GAN on challenging benchmarks [12, 13]. Computationally,\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fit also requires larger batch size than GAN needs for training, which is considered to be less\nef\ufb01cient [9, 10, 14, 8]\nIn this work, we try to improve GMMN and consider using MMD with adversarially learned kernels\ninstead of \ufb01xed Gaussian kernels to have better hypothesis testing power. The main contributions of\nthis work are:\n\u2022 In Section 2, we prove that training g\u2713 via MMD with learned kernels is continuous and differen-\ntiable, which guarantees the model can be trained by gradient descent. Second, we prove a new\ndistance measure via kernel learning, which is a sensitive loss function to the distance between\nPX and P\u2713 (weak\u21e4 topology). Empirically, the loss decreases when two distributions get closer.\n\u2022 In Section 3, we propose a practical realization called MMD GAN that learns generator g\u2713 with\nthe adversarially trained kernel. We further propose a feasible set reduction to speed up and\nstabilize the training of MMD GAN.\n\n\u2022 In Section 5, we show that MMD GAN is computationally more ef\ufb01cient than GMMN, which can\nbe trained with much smaller batch size. We also demonstrate that MMD GAN has promising\nresults on challenging datasets, including CIFAR-10, CelebA and LSUN, where GMMN fails. To\nour best knowledge, we are the \ufb01rst MMD based work to achieve comparable results with other\nGAN works on these datasets.\n\nFinally, we also study the connection to existing works in Section 4. Interestingly, we show Wasser-\nstein GAN [8] is the special case of the proposed MMD GAN under certain conditions. The uni\ufb01ed\nview shows more connections between moment matching and GAN, which can potentially inspire\nnew algorithms based on well-developed tools in statistics [15]. Our experiment code is available at\nhttps://github.com/OctoberChang/MMD-GAN.\n\n2 GAN, Two-Sample Test and GMMN\nAssume we are given data {xi}n\ni=1, where xi 2X and xi \u21e0 PX . If we are interested in sampling\nfrom PX , it is not necessary to estimate the density of PX . Instead, Generative Adversarial Network\n(GAN) [5] trains a generator g\u2713 parameterized by \u2713 to transform samples z \u21e0 PZ, where z 2Z ,\ninto g\u2713(z) \u21e0 P\u2713 such that P\u2713 \u21e1 PX . To measure the similarity between PX and P\u2713 via their\nsamples {x}n\nj=1 during the training, [5] trains the discriminator f parameterized\nby  for help. The learning is done by playing a two-player game, where f tries to distinguish xi\nand g\u2713(zj) while g\u2713 aims to confuse f by generating g\u2713(zj) similar to xi.\nOn the other hand, distinguishing two distributions by \ufb01nite samples is known as Two-Sample Test in\nstatistics. One way to conduct two-sample test is via kernel maximum mean discrepancy (MMD) [11].\nGiven two distributions P and Q, and a kernel k, the square of MMD distance is de\ufb01ned as\n\ni=1 and {g\u2713(zj)}n\n\nMk(P, Q) = k\u00b5P  \u00b5Qk2\n\nH = EP[k(x, x0)]  2EP,Q[k(x, y)] + EQ[k(y, y0)].\n\nTheorem 1. [11] Given a kernel k, if k is a characteristic kernel, then Mk(P, Q) = 0 iff P = Q.\nGMMN: One example of characteristic kernel is Gaussian kernel k(x, x0) = exp(kx x0k2). Based\non Theorem 1, [9, 10] propose generative moment-matching network (GMMN), which trains g\u2713 by\n(1)\n\nmin\n\nMk(PX , P\u2713),\n\n\u2713\n\nwith a \ufb01xed Gaussian kernel k rather than training an additional discriminator f as GAN.\n\n2.1 MMD with Kernel Learning\nIn practice we use \ufb01nite samples from distributions to estimate MMD distance. Given X =\n{x1,\u00b7\u00b7\u00b7 , xn}\u21e0 P and Y = {y1,\u00b7\u00b7\u00b7 , yn}\u21e0 Q, one estimator of Mk(P, Q) is\n\n\u02c6Mk(X, Y ) =\n\nk(xi, x0i) \n\nk(xi, yj) +\n\nk(yj, y0j).\n\nBecause of the sampling variance, \u02c6M (X, Y ) may not be zero even when P = Q. We then conduct\nhypothesis test with null hypothesis H0 : P = Q. For a given allowable probability of false rejection \u21b5,\n\n1\n\n2Xi6=i0\nn\n\n2\n\n2Xi6=j\nn\n\n2\n\n1\n\n2Xj6=j0\nn\n\n\fwe can only reject H0, which imply P 6= Q, if \u02c6M (X, Y ) > c\u21b5 for some chose threshold c\u21b5 > 0.\nOtherwise, Q passes the test and Q is indistinguishable from P under this test. Please refer to [11] for\nmore details.\nIntuitively, if kernel k cannot result in high MMD distance Mk(P, Q) when P 6= Q, \u02c6Mk(P, Q) has\nmore chance to be smaller than c\u21b5. Then we are unlikely to reject the null hypothesis H0 with \ufb01nite\nsamples, which implies Q is not distinguishable from P. Therefore, instead of training g\u2713 via (1) with\na pre-speci\ufb01ed kernel k as GMMN, we consider training g\u2713 via\nMk(PX , P\u2713),\n\nmin\n\n(2)\n\n\u2713\n\nmax\nk2K\n\nwhich takes different possible characteristic kernels k 2K into account. On the other hand, we\ncould also view (2) as replacing the \ufb01xed kernel k in (1) with the adversarially learned kernel\narg maxk2K Mk(PX , P\u2713) to have stronger signal where P 6= P\u2713 to train g\u2713. We refer interested\nreaders to [16] for more rigorous discussions about testing power and increasing MMD distances.\nHowever, it is dif\ufb01cult to optimize over all characteristic kernels when we solve (2). By [11, 17] if f\nis a injective function and k is characteristic, then the resulted kernel \u02dck = k  f, where \u02dck(x, x0) =\nk(f (x), f (x0)) is still characteristic. If we have a family of injective functions parameterized by ,\nwhich is denoted as f, we are able to change the objective to be\nMkf(PX , P\u2713),\n\nmax\n\nmin\n\n(3)\n\n\n\n\u2713\n\nIn this paper, we consider the case that combining Gaussian kernels with injective functions f, where\n\u02dck(x, x0) = exp(kf(x) f(x)0k2). One example function class of f is {f|f(x) = x,  > 0},\nwhich is equivalent to the kernel bandwidth tuning. A more complicated realization will be discussed\nin Section 3. Next, we abuse the notation Mf(P, Q) to be MMD distance given the composition\nkernel of Gaussian kernel and f in the following. Note that [18] considers the linear combination\nof characteristic kernels, which can also be incorporated into the discussed composition kernels. A\nmore general kernel is studied in [19].\n\n2.2 Properties of MMD with Kernel Learning\n\n[8] discuss different distances between distributions adopted by existing deep learning algorithms, and\nshow many of them are discontinuous, such as Jensen-Shannon divergence [5] and Total variation [7],\nexcept for Wasserstein distance. The discontinuity makes the gradient descent infeasible for training.\nFrom (3), we train g\u2713 via minimizing max Mf(PX , P\u2713). Next, we show max Mf(PX , P\u2713) also\nenjoys the advantage of being a continuous and differentiable objective in \u2713 under mild assumptions.\nAssumption 2. g : Z\u21e5 Rm !X is locally Lipschitz, where Z\u2713 Rd. We will denote g\u2713(z) the\nevaluation on (z, \u2713) for convenience. Given f and a probability distribution Pz over Z, g satis\ufb01es\nAssumption 2 if there are local Lipschitz constants L(\u2713, z) for f  g, which is independent of , such\nthat Ez\u21e0Pz [L(\u2713, z)] < +1.\nTheorem 3. The generator function g\u2713 parameterized by \u2713 is under Assumption 2. Let PX be a \ufb01xed\ndistribution over X and Z be a random variable over the space Z. We denote P\u2713 the distribution\nof g\u2713(Z), then max Mf(PX , P\u2713) is continuous everywhere and differentiable almost everywhere\nin \u2713.\n\nIf g\u2713 is parameterized by a feed-forward neural network, it satis\ufb01es Assumption 2 and can be trained\nvia gradient descent as well as propagation, since the objective is continuous and differentiable\nfollowed by Theorem 3. More technical discussions are shown in Appendix B.\nTheorem 4. (weak\u21e4 topology) Let {Pn} be a sequence of distributions. Considering n ! 1,\nD! PX , where D! means converging in\nunder mild Assumption, max Mf(PX , Pn) ! 0 () Pn\ndistribution [3].\nTheorem 4 shows that max Mf(PX , Pn) is a sensible cost function to the distance between PX\nand Pn. The distance is decreasing when Pn is getting closer to PX , which bene\ufb01ts the supervision\nof the improvement during the training. All proofs are omitted to Appendix A. In the next section,\nwe introduce a practical realization of training g\u2713 via optimizing min\u2713 max Mf(PX , P\u2713).\n\n3\n\n\f3 MMD GAN\n\nTo approximate (3), we use neural networks to parameterized g\u2713 and f with expressive power.\nFor g\u2713, the assumption is locally Lipschitz, where commonly used feed-forward neural networks\nsatisfy this constraint. Also, the gradient 5\u2713 (max f  g\u2713) has to be bounded, which can be\ndone by clipping  [8] or gradient penalty [20]. The non-trivial part is f has to be injective.\nFor an injective function f, there exists an function f1 such that f1(f (x)) = x,8x 2X and\nf1(f (g(z))) = g(z),8z 2Z 1, which can be approximated by an autoencoder. In the following,\nwe denote  = {e, d} to be the parameter of discriminator networks, which consists of an encoder\nfe, and train the corresponding decoder fd \u21e1 f1 to regularize f. The objective (3) is relaxed to\nbe\n(4)\n\nMfe (P(X ), P(g\u2713(Z)))  Ey2X[g(Z)ky  fd(fe(y))k2.\n\nmax\n\nmin\n\n\n\n\u2713\n\nNote that we ignore the autoencoder objective when we train \u2713, but we use (4) for a concise\npresentation. We note that the empirical study suggests autoencoder objective is not necessary to\nlead the successful GAN training as we will show in Section 5, even though the injective property is\nrequired in Theorem 1.\nThe proposed algorithm is similar to GAN [5], which aims to optimize two neural networks g\u2713 and f\nin a minmax formulation, while the meaning of the objective is different. In [5], fe is a discriminator\n(binary) classi\ufb01er to distinguish two distributions. In the proposed algorithm, distinguishing two\ndistribution is still done by two-sample test via MMD, but with an adversarially learned kernel\nparametrized by fe. g\u2713 is then trained to pass the hypothesis test. More connection and difference\nwith related works is discussed in Section 4. Because of the similarity of GAN, we call the proposed\nalgorithm MMD GAN. We present an implementation with the weight clipping in Algorithm 1, but\none can easily extend to other Lipschitz approximations, such as gradient penalty [20].\n\nAlgorithm 1: MMD GAN, our proposed algorithm.\ninput\n\n:\u21b5 the learning rate, c the clipping parameter, B the batch size, nc the number of iterations of\ndiscriminator per generator update.\n\ninitialize generator parameter \u2713 and discriminator parameter ;\nwhile \u2713 has not converged do\n\nfor t = 1, . . . , nc do\n\nSample a minibatches {xi}B\ng rMfe (P(X ), P(g\u2713(Z)))  Ey2X[g(Z)ky  fd(fe(y))k2\n  + \u21b5 \u00b7 RMSProp(, g)\n clip(,c, c)\n\ni=1 \u21e0 P(X ) and {zj}B\n\nj=1 \u21e0 P(Z)\n\ni=1 \u21e0 P(X ) and {zj}B\n\nj=1 \u21e0 P(Z)\n\nSample a minibatches {xi}B\ng\u2713 r\u2713Mfe (P(X ), P(g\u2713(Z)))\n\u2713 \u2713  \u21b5 \u00b7 RMSProp(\u2713, g\u2713)\n\nEncoding Perspective of MMD GAN: Besides from using kernel selection to explain MMD GAN,\nthe other way to see the proposed MMD GAN is viewing fe as a feature transformation function,\nand the kernel two-sample test is performed on this transformed feature space (i.e., the code space of\nthe autoencoder). The optimization is \ufb01nding a manifold with stronger signals for MMD two-sample\ntest. From this perspective, [9] is the special case of MMD GAN if fe is the identity mapping\nfunction. In such circumstance, the kernel two-sample test is conducted in the original data space.\n\n3.1 Feasible Set Reduction\nTheorem 5. For any f, there exists f0 such that Mf(Pr, P\u2713) = Mf0\nEz[f0(g\u2713(z))].\nWith Theorem 5, we could reduce the feasible set of  during the optimization by solving\n\n(Pr, P\u2713) and Ex[f(x)] \u232b\n\nmin\u2713 max Mf(Pr, P\u2713)\n\ns.t. E[f(x)] \u232b E[f(g\u2713(z))]\n\n1Note that injective is not necessary invertible.\n\n4\n\n\fwhich the optimal solution is still equivalent to solving (2).\nHowever, it is hard to solve the constrained optimization problem with backpropagation. We relax\nthe constraint by ordinal regression [21] to be\n\nmin\n\n\u2713\n\nmax\n\n\n\nMf(Pr, P\u2713) +  minE[f(x)]  E[f(g\u2713(z))], 0,\n\nwhich only penalizes the objective when the constraint is violated. In practice, we observe that\nreducing the feasible set makes the training faster and stabler.\n\n4 Related Works\n\nmin\n\n\u2713\n\nIf we composite f with linear kernel instead of Gaussian kernel, and\n\nThere has been a recent surge on improving GAN [5]. We review some related works here.\nConnection with WGAN:\nrestricting the output dimension h to be 1, we then have the objective\n kE[f(x)]  E[f(g\u2713(z))]k2.\nmax\n\n(5)\nParameterizing f and g\u2713 with neural networks and assuming 90 2  such f0 = f,8, recovers\nWasserstein GAN (WGAN) [8] 2. If we treat f(x) as the data transform function, WGAN can\nbe interpreted as \ufb01rst-order moment matching (linear kernel) while MMD GAN aims to match\nin\ufb01nite order of moments with Gaussian kernel form Taylor expansion [9]. Theoretically, Wasserstein\ndistance has similar theoretically guarantee as Theorem 1, 3 and 4. In practice, [22] show neural\nnetworks does not have enough capacity to approximate Wasserstein distance. In Section 5, we\ndemonstrate matching high-order moments bene\ufb01ts the results. [23] also propose McGAN that\nmatches second order moment from the primal-dual norm perspective. However, the proposed\nalgorithm requires matrix (tensor) decompositions because of exact moment matching [24], which\nis hard to scale to higher order moment matching. On the other hand, by giving up exact moment\nmatching, MMD GAN can match high-order moments with kernel tricks. More detailed discussions\nare in Appendix B.3.\nDifference from Other Works with Autoencoders: Energy-based GANs [7, 25] also utilizes\nthe autoencoder (AE) in its discriminator from the energy model perspective, which minimizes\nthe reconstruction error of real samples x while maximize the reconstruction error of generated\nsamples g\u2713(z). In contrast, MMD GAN uses AE to approximate invertible functions by minimizing\nthe reconstruction errors of both real samples x and generated samples g\u2713(z). Also, [8] show EBGAN\napproximates total variation, with the drawback of discontinuity, while MMD GAN optimizes MMD\ndistance. The other line of works [2, 26, 9] aims to match the AE codespace f (x), and utilize the\ndecoder fdec(\u00b7). [2, 26] match the distribution of f (x) and z via different distribution distances and\ngenerate data (e.g. image) by fdec(z). [9] use MMD to match f (x) and g(z), and generate data via\nfdec(g(z)). The proposed MMD GAN matches the f (x) and f (g(z)), and generates data via g(z)\ndirectly as GAN. [27] is similar to MMD GAN but it considers KL-divergence without showing\ncontinuity and weak\u21e4 topology guarantee as we prove in Section 2.\nOther GAN Works:\nIn addition to the discussed works, there are several extended works of GAN.\n[28] proposes using the linear kernel to match \ufb01rst moment of its discriminator\u2019s latent features. [14]\nconsiders the variance of empirical MMD score during the training. Also, [14] only improves the\nlatent feature matching in [28] by using kernel MMD, instead of proposing an adversarial training\nframework as we studied in Section 2. [29] uses Wasserstein distance to match the distribution of\nautoencoder loss instead of data. One can consider to extend [29] to higher order matching based on\nthe proposed MMD GAN. A parallel work [30] use energy distance, which can be treated as MMD\nGAN with different kernel. However, there are some potential problems of its critic. More discussion\ncan be referred to [31].\n\n5 Experiment\n\nWe train MMD GAN for image generation on the MNIST [32], CIFAR-10 [33], CelebA [13],\nand LSUN bedrooms [12] datasets, where the size of training instances are 50K, 50K, 160K, 3M\n2Theoretically, they are not equivalent but the practical neural network approximation results in the same\n\nalgorithm.\n\n5\n\n\frespectively. All the samples images are generated from a \ufb01xed noise random vectors and are not\ncherry-picked.\nNetwork architecture: In our experiments, we follow the architecture of DCGAN [34] to design g\u2713\nby its generator and f by its discriminator except for expanding the output layer of f to be h\ndimensions.\nKernel designs: The loss function of MMD GAN is implicitly associated with a family of character-\nistic kernels. Similar to the prior MMD seminal papers [10, 9, 14], we consider a mixture of K RBF\nkernels k(x, x0) =PK\nq=1 kq (x, x0) where kq is a Gaussian kernel with bandwidth parameter q.\nTuning kernel bandwidth q optimally still remains an open problem. In this works, we \ufb01xed K = 5\nand q to be {1, 2, 4, 8, 16} and left the f to learn the kernel (feature representation) under these q.\nHyper-parameters: We use RMSProp [35] with learning rate of 0.00005 for a fair comparison with\nWGAN as suggested in its original paper [8]. We ensure the boundedness of model parameters\nof discriminator by clipping the weights point-wisely to the range [0.01, 0.01] as required by\nAssumption 2. The dimensionality h of the latent space is manually set according to the complexity\nof the dataset. We thus use h = 16 for MNIST, h = 64 for CelebA, and h = 128 for CIFAR-10 and\nLSUN bedrooms. The batch size is set to be B = 64 for all datasets.\n\n5.1 Qualitative Analysis\n\n(a) GMMN-D MNIST\n\n(b) GMMN-C MNIST\n\n(c) MMD GAN MNIST\n\n(d) GMMN-D CIFAR-10\n\n(e) GMMN-C CIFAR-10\n\n(f) MMD GAN CIFAR-10\n\nFigure 1: Generated samples from GMMN-D (Dataspace), GMMN-C (Codespace) and our MMD\nGAN with batch size B = 64.\n\nWe start with comparing MMD GAN with GMMN on two standard benchmarks, MNIST and CIFAR-\n10. We consider two variants for GMMN. The \ufb01rst one is original GMMN, which trains the generator\nby minimizing the MMD distance on the original data space. We call it as GMMN-D. To compare\nwith MMD GAN, we also pretrain an autoencoder for projecting data to a manifold, then \ufb01x the\nautoencoder as a feature transformation, and train the generator by minimizing the MMD distance in\nthe code space. We call it as GMMN-C.\nThe results are pictured in Figure 1. Both GMMN-D and GMMN-C are able to generate meaningful\ndigits on MNIST because of the simple data structure. By a closer look, nonetheless, the boundary\nand shape of the digits in Figure 1a and 1b are often irregular and non-smooth. In contrast, the sample\n\n6\n\n\f(a) WGAN MNIST\n\n(b) WGAN CelebA\n\n(c) WGAN LSUN\n\n(d) MMD GAN MNIST\n\n(e) MMD GAN CelebA\n\n(f) MMD GAN LSUN\n\nFigure 2: Generated samples from WGAN and MMD GAN on MNIST, CelebA, and LSUN bedroom\ndatasets.\n\ndigits in Figure 1c are more natural with smooth outline and sharper strike. For CIFAR-10 dataset,\nboth GMMN variants fail to generate meaningful images, but resulting some low level visual features.\nWe observe similar cases in other complex large-scale datasets such as CelebA and LSUN bedrooms,\nthus results are omitted. On the other hand, the proposed MMD GAN successfully outputs natural\nimages with sharp boundary and high diversity. The results in Figure 1 con\ufb01rm the success of the\nproposed adversarial learned kernels to enrich statistical testing power, which is the key difference\nbetween GMMN and MMD GAN.\nIf we increase the batch size of GMMN to 1024, the image quality is improved, however, it is still\nnot competitive to MMD GAN with B = 64. The images are put in Appendix C. This demonstrates\nthat the proposed MMD GAN can be trained more ef\ufb01ciently than GMMN with smaller batch size.\nComparisons with GANs: There are several representative extensions of GANs. We consider\nrecent state-of-art WGAN [8] based on DCGAN structure [34], because of the connection with MMD\nGAN discussed in Section 4. The results are shown in Figure 2. For MNIST, the digits generated\nfrom WGAN in Figure 2a are more unnatural with peculiar strikes. In Contrary, the digits from\nMMD GAN in Figure 2d enjoy smoother contour. Furthermore, both WGAN and MMD GAN\ngenerate diversi\ufb01ed digits, avoiding the mode collapse problems appeared in the literature of training\nGANs. For CelebA, we can see the difference of generated samples from WGAN and MMD GAN.\nSpeci\ufb01cally, we observe varied poses, expressions, genders, skin colors and light exposure in Figure\n2b and 2e. By a closer look (view on-screen with zooming in), we observe that faces from WGAN\nhave higher chances to be blurry and twisted while faces from MMD GAN are more spontaneous with\nsharp and acute outline of faces. As for LSUN dataset, we could not distinguish salient differences\nbetween the samples generated from MMD GAN and WGAN.\n\n5.2 Quantitative Analysis\n\nTo quantitatively measure the quality and diversity of generated samples, we compute the inception\nscore [28] on CIFAR-10 images. The inception score is used for GANs to measure samples quality\nand diversity on the pretrained inception model [28]. Models that generate collapsed samples have\na relatively low score. Table 1 lists the results for 50K samples generated by various unsupervised\n\n7\n\n\fgenerative models trained on CIFAR-10 dataset. The inception scores of [36, 37, 28] are directly\nderived from the corresponding references.\nAlthough both WGAN and MMD GAN can generate sharp images as we show in Section 5.1, our\nscore is better than other GAN techniques except for DFM [36]. This seems to con\ufb01rm empirically\nthat higher order of moment matching between the real data and fake sample distribution bene\ufb01ts\ngenerating more diversi\ufb01ed sample images. Also note DFM appears compatible with our method and\ncombing training techniques in DFM is a possible avenue for future work.\n\nMethod\nReal data\nDFM [36]\nALI [37]\n\nImproved GANs [28]\n\nScores \u00b1 std.\n11.95 \u00b1 .20\n\n7.72\n5.34\n4.36\n\nMMD GAN\n\nWGAN\nGMMN-C\nGMMN-D\nTable 1: Inception scores\n\n6.17 \u00b1 .07\n5.88 \u00b1 .07\n3.94 \u00b1 .04\n3.47 \u00b1 .03\n\nFigure 3: Computation time\n\n5.3 Stability of MMD GAN\n\nWe further illustrate how the MMD distance correlates well with the quality of the generated samples.\nFigure 4 plots the evolution of the MMD GAN estimate the MMD distance during training for\nMNIST, CelebA and LSUN datasets. We report the average of the \u02c6Mf(PX , P\u2713) with moving\naverage to smooth the graph to reduce the variance caused by mini-batch stochastic training. We\nobserve during the whole training process, samples generated from the same noise vector across\niterations, remain similar in nature. (e.g., face identity and bedroom style are alike while details and\nbackgrounds will evolve.) This qualitative observation indicates valuable stability of the training\nprocess. The decreasing curve with the improving quality of images supports the weak\u21e4 topology\nshown in Theorem 4. Also, We can see from the plot that the model converges very quickly. In Figure\n4b, for example, it converges shortly after tens of thousands of generator iterations on CelebA dataset.\n\n(a) MNIST\n\n(b) CelebA\n\n(c) LSUN Bedrooms\n\nFigure 4: Training curves and generative samples at different stages of training. We can see a clear\ncorrelation between lower distance and better sample quality.\n\n5.4 Computation Issue\n\nWe conduct time complexity analysis with respect to the batch size B. The time complexity of each\niteration is O(B) for WGAN and O(KB2) for our proposed MMD GAN with a mixture of K RBF\nkernels. The quadratic complexity O(B2) of MMD GAN is introduced by computing kernel matrix,\nwhich is sometimes criticized for being inapplicable with large batch size in practice. However,\nwe point that there are several recent works, such as EBGAN [7], also matching pairwise relation\nbetween samples of batch size, leading to O(B2) complexity as well.\n\n8\n\n\fEmpirically, we \ufb01nd that under GPU environment, the highly parallelized matrix operation tremen-\ndously alleviated the quadratic time to almost linear time with modest B. Figure 3 compares the\ncomputational time per generator iterations versus different B on Titan X. When B = 64, which\nis adapted for training MMD GAN in our experiments setting, the time per iteration of WGAN\nand MMD GAN is 0.268 and 0.676 seconds, respectively. When B = 1024, which is used for\ntraining GMMN in its references [9], the time per iteration becomes 4.431 and 8.565 seconds, respec-\ntively. This result coheres our argument that the empirical computational time for MMD GAN is not\nquadratically expensive compared to WGAN with powerful GPU parallel computation.\n\n5.5 Better Lipschitz Approximation and Necessity of Auto-Encoder\nAlthough we used weight-clipping for Lipschitz constraint in Assumption 2, one can also use other\napproximations, such as gradient penalty [20]. On the other hand, in Algorithm 1, we present an\nalgorithm with auto-encoder to be consistent with the theory that requires f to be injective. However,\nwe observe that it is not necessary in practice. We show some preliminary results of training MMD\nGAN with gradient penalty and without the auto-encoder in Figure 5. The preliminary study indicates\nthat MMD GAN can generate satisfactory results with other Lipschitz constraint approximation. One\npotential future work is conducting more thorough empirical comparison studies between different\napproximations.\n\n(a) Cifar10, Giter = 300K\n\n(b) CelebA, Giter = 300K\n\nFigure 5: MMD GAN results using gradient penalty [20] and without auto-encoder reconstruction\nloss during training.\n\n6 Discussion\n\nWe introduce a new deep generative model trained via MMD with adversarially learned kernels. We\nfurther study its theoretical properties and propose a practical realization MMD GAN, which can be\ntrained with much smaller batch size than GMMN and has competitive performances with state-of-the-\nart GANs. We can view MMD GAN as the \ufb01rst practical step forward connecting moment matching\nnetwork and GAN. One important direction is applying developed tools in moment matching [15] on\ngeneral GAN works based the connections shown by MMD GAN. Also, in Section 4, we connect\nWGAN and MMD GAN by \ufb01rst-order and in\ufb01nite-order moment matching. [24] shows \ufb01nite-order\nmoment matching (\u21e0 5) achieves the best performance on domain adaption. One could extend MMD\nGAN to this by using polynomial kernels. Last, in theory, an injective mapping f is necessary for\nthe theoretical guarantees. However, we observe that it is not mandatory in practice as we show\nin Section 5.5. One conjecture is it usually learns the injective mapping with high probability by\nparameterizing with neural networks, which worth more study as a future work.\n\nAcknowledgments\n\nWe thank the reviewers for their helpful comments. This work is supported in part by the National\nScience Foundation (NSF) under grants IIS-1546329 and IIS-1563887.\n\n9\n\n\fReferences\n[1] Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In AISTATS, 2009.\n\n[2] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2013.\n\n[3] Larry Wasserman. All of statistics: a concise course in statistical inference. Springer Science &\n\nBusiness Media, 2013.\n\n[4] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov,\nRich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with\nvisual attention. In ICML, 2015.\n\n[5] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\n\nOzair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[6] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural\n\nsamplers using variational divergence minimization. In NIPS, 2016.\n\n[7] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based Generative Adversarial Network. In ICLR,\n\n2017.\n\n[8] Mart\u00edn Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN. In ICML, 2017.\n\n[9] Yujia Li, Kevin Swersky, and Richard Zemel. Generative moment matching networks. In ICML,\n\n2015.\n\n[10] Gintare Karolina Dziugaite, Daniel M. Roy, and Zoubin Ghahramani. Training generative\n\nneural networks via maximum mean discrepancy optimization. In UAI, 2015.\n\n[11] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch\u00f6lkopf, and Alexander\n\nSmola. A kernel two-sample test. JMLR, 2012.\n\n[12] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun:\nConstruction of a large-scale image dataset using deep learning with humans in the loop. arXiv\npreprint arXiv:1506.03365, 2015.\n\n[13] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In CVPR, 2015.\n\n[14] Dougal J. Sutherland, Hsiao-Yu Fish Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas,\nAlexander J. Smola, and Arthur Gretton. Generative models and model criticism via optimized\nmaximum mean discrepancy. In ICLR, 2017.\n\n[15] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch\u00f6lkopf. Kernel\nmean embedding of distributions: A review and beyonds. arXiv preprint arXiv:1605.09522,\n2016.\n\n[16] Kenji Fukumizu, Arthur Gretton, Gert R Lanckriet, Bernhard Sch\u00f6lkopf, and Bharath K Sripe-\nrumbudur. Kernel choice and classi\ufb01ability for rkhs embeddings of probability distributions. In\nNIPS, 2009.\n\n[17] A. Gretton, B. Sriperumbudur, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, and\n\nK. Fukumizu. Optimal kernel choice for large-scale two-sample tests. In NIPS, 2012.\n\n[18] Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano\nPontil, Kenji Fukumizu, and Bharath K Sriperumbudur. Optimal kernel choice for large-scale\ntwo-sample tests. In NIPS, 2012.\n\n[19] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel\n\nlearning. In AISTATS, 2016.\n\n[20] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville.\n\nImproved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.\n\n10\n\n\f[21] Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Support vector learning for ordinal\n\nregression. 1999.\n\n[22] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and\n\nequilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.\n\n[23] Youssef Mroueh, Tom Sercu, and Vaibhava Goel. Mcgan: Mean and covariance feature\n\nmatching gan. arxiv pre-print 1702.08398, 2017.\n\n[24] Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschl\u00e4ger, and Susanne\nSaminger-Platz. Central moment discrepancy (cmd) for domain-invariant representation learn-\ning. arXiv preprint arXiv:1702.08811, 2017.\n\n[25] Shuangfei Zhai, Yu Cheng, Rog\u00e9rio Schmidt Feris, and Zhongfei Zhang. Generative adversarial\n\nnetworks as variational training of energy based models. CoRR, abs/1611.01799, 2016.\n\n[26] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adver-\n\nsarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.\n\n[27] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Adversarial generator-encoder net-\n\nworks. arXiv preprint arXiv:1704.02304, 2017.\n\n[28] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\n\nImproved techniques for training gans. In NIPS, 2016.\n\n[29] David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative\n\nadversarial networks. arXiv preprint arXiv:1703.10717, 2017.\n\n[30] Marc G Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan,\nStephan Hoyer, and R\u00e9mi Munos. The cramer distance as a solution to biased wasserstein\ngradients. arXiv preprint arXiv:1705.10743, 2017.\n\n[31] Arthur Gretton. Notes on the cramer gan. https://medium.com/towards-data-science/\n\nnotes-on-the-cramer-gan-752abd505c00, 2017. Accessed: 2017-11-2.\n\n[32] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 1998.\n\n[33] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[34] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\n\ndeep convolutional generative adversarial networks. In ICLR, 2016.\n\n[35] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running\n\naverage of its recent magnitude. COURSERA: Neural networks for machine learning, 2012.\n\n[36] D Warde-Farley and Y Bengio. Improving generative adversarial networks with denoising\n\nfeature matching. In ICLR, 2017.\n\n[37] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mas-\n\ntropietro, and Aaron Courville. Adversarially learned inference. In ICLR, 2017.\n\n[38] Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch\u00f6lkopf, and Gert R.G.\n\nLanckriet. Hilbert space embeddings and metrics on probability measures. JMLR, 2010.\n\n11\n\n\f", "award": [], "sourceid": 1315, "authors": [{"given_name": "Chun-Liang", "family_name": "Li", "institution": "Carnegie Mellon University"}, {"given_name": "Wei-Cheng", "family_name": "Chang", "institution": "Carnegie Mellon University"}, {"given_name": "Yu", "family_name": "Cheng", "institution": "AI Foundations, IBM Research"}, {"given_name": "Yiming", "family_name": "Yang", "institution": "CMU"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}]}