{"title": "Dual Discriminator Generative Adversarial Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 2670, "page_last": 2680, "abstract": "We propose in this paper a novel approach to tackle the problem of mode collapse encountered in generative adversarial network (GAN). Our idea is intuitive but proven to be very effective, especially in addressing some key limitations of GAN. In essence, it combines the Kullback-Leibler (KL) and reverse KL divergences into a unified objective function, thus it exploits the complementary statistical properties from these divergences to effectively diversify the estimated density in capturing multi-modes. We term our method dual discriminator generative adversarial nets (D2GAN) which, unlike GAN, has two discriminators; and together with a generator, it also has the analogy of a minimax game, wherein a discriminator rewards high scores for samples from data distribution whilst another discriminator, conversely, favoring data from the generator, and the generator produces data to fool both two discriminators. We develop theoretical analysis to show that, given the maximal discriminators, optimizing the generator of D2GAN reduces to minimizing both KL and reverse KL divergences between data distribution and the distribution induced from the data generated by the generator, hence effectively avoiding the mode collapsing problem. We conduct extensive experiments on synthetic and real-world large-scale datasets (MNIST, CIFAR-10, STL-10, ImageNet), where we have made our best effort to compare our D2GAN with the latest state-of-the-art GAN's variants in comprehensive qualitative and quantitative evaluations. The experimental results demonstrate the competitive and superior performance of our approach in generating good quality and diverse samples over baselines, and the capability of our method to scale up to ImageNet database.", "full_text": "Dual Discriminator Generative Adversarial Nets\n\nTu Dinh Nguyen, Trung Le, Hung Vu, Dinh Phung\n\nDeakin University, Geelong, Australia\n\nCentre for Pattern Recognition and Data Analytics\n\n{tu.nguyen, trung.l, hungv, dinh.phung}@deakin.edu.au\n\nAbstract\n\nWe propose in this paper a novel approach to tackle the problem of mode collapse\nencountered in generative adversarial network (GAN). Our idea is intuitive but\nproven to be very effective, especially in addressing some key limitations of GAN.\nIn essence, it combines the Kullback-Leibler (KL) and reverse KL divergences into\na uni\ufb01ed objective function, thus it exploits the complementary statistical properties\nfrom these divergences to effectively diversify the estimated density in capturing\nmulti-modes. We term our method dual discriminator generative adversarial nets\n(D2GAN) which, unlike GAN, has two discriminators; and together with a genera-\ntor, it also has the analogy of a minimax game, wherein a discriminator rewards high\nscores for samples from data distribution whilst another discriminator, conversely,\nfavoring data from the generator, and the generator produces data to fool both two\ndiscriminators. We develop theoretical analysis to show that, given the maximal\ndiscriminators, optimizing the generator of D2GAN reduces to minimizing both\nKL and reverse KL divergences between data distribution and the distribution\ninduced from the data generated by the generator, hence effectively avoiding the\nmode collapsing problem. We conduct extensive experiments on synthetic and\nreal-world large-scale datasets (MNIST, CIFAR-10, STL-10, ImageNet), where we\nhave made our best effort to compare our D2GAN with the latest state-of-the-art\nGAN\u2019s variants in comprehensive qualitative and quantitative evaluations. The\nexperimental results demonstrate the competitive and superior performance of our\napproach in generating good quality and diverse samples over baselines, and the\ncapability of our method to scale up to ImageNet database.\n\n1\n\nIntroduction\n\nGenerative models are a subarea of research that has been rapidly growing in recent years, and\nsuccessfully applied in a wide range of modern real-world applications (e.g., see chapter 20 in [9]).\nTheir common approach is to address the density estimation problem where one aims to learn a\nmodel distribution pmodel that approximates the true, but unknown, data distribution pdata. Methods in\nthis approach deal with two fundamental problems. First, the learning behaviors and performance\nof generative models depend on the choice of objective functions to train them [29, 15]. The most\nwidely-used objective, considered the de-facto standard one, is to follow the principle of maximum\nlikelihood estimate that seeks model parameters to maximize the likelihood of training data. This is\nequivalent to minimizing the Kullback-Leibler (KL) divergence between data and model distributions:\nDKL (pdata(cid:107)pmodel ). It has been observed that this minimization tends to result in pmodel that covers\nmultiple modes of pdata, but may produce completely unseen and potentially undesirable samples [29].\nBy contrast, another approach is to swap the arguments and instead, minimize: DKL (pmodel (cid:107)pdata),\nwhich is usually referred to as the reverse KL divergence [23, 11, 15, 29]. It is observed that\noptimization towards the reverse KL divergence criteria mimics the mode-seeking process where\npmodel concentrates on a single mode of pdata while ignoring other modes, known as the problem of\nmode collapse. These behaviors are well-studied in [29, 15, 11].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe second problem is the choice of formulation for the density function of pmodel [9]. One might\nchoose to de\ufb01ne an explicit density function, and then straightforwardly follow maximum likelihood\nframework to estimate the parameters. Another idea is to estimate the data distribution using an\nimplicit density function, without the need for analytical forms of pmodel (e.g., see [11] for further\ndiscussions). One of the most notably pioneered class of the latter is the generative adversarial\nnetwork (GAN) [10], an expressive generative model that is capable of producing sharp and realistic\nimages for natural scenes. Different from most generative models that maximize data likelihood\nor its lower bound, GAN takes a radical approach that simulates a game between two players: a\ngenerator G that generates data by mapping samples from a noise space to the input space; and a\ndiscriminator D that acts as a classi\ufb01er to distinguish real samples of a dataset from fake samples\nproduced by the generator G. Both G and D are parameterized via neural networks, thus this method\ncan be categorized into the family of deep generative models or generative neural models [9].\nThe optimization of GAN formulates a minimax problem, wherein given an optimal D, the learning\nobjective turns into \ufb01nding G that minimizes the Jensen-Shannon divergence (JSD): DJS (pdata(cid:107)pmodel).\nThe behavior of JSD minimization has been empirically proven to be more similar to reverse KL\nthan to KL divergence [29, 15]. This, however, leads to the aforementioned issue of mode collapse,\nwhich is indeed a notorious failure of GAN [11] where the generator only produces similarly looking\nimages, yielding a low entropy distribution with poor variety of samples.\nRecent attempts have been made to solve the mode collapsing problem by improving the training\nof GAN. One idea is to use the minibatch discrimination trick [27] to allow the discriminator to\ndetect samples that are unusually similar to other generated samples. Although this heuristics helps\nto generate visually appealing samples very quickly, it is computationally expensive, thus normally\nused in the last hidden layer of discriminator. Another approach is to unroll the optimization of\ndiscriminator by several steps to create a surrogate objective for the update of generator during\ntraining [20]. The third approach is to train many generators that discover different modes of the\ndata [14]. Alternatively, around the same time, there are various attempts to employ autoencoders as\nregularizers or auxiliary losses to penalize missing modes [5, 31, 4, 30]. These models can avoid the\nmode collapsing problem to a certain extent, but at the cost of computational complexity with the\nexception of DFM in [31], rendering them unscalable up to ImageNet, a large-scale and challenging\nvisual dataset.\nAddressing these challenges, we propose a novel approach to both effectively avoid mode collapse\nand ef\ufb01ciently scale up to very large datasets (e.g., ImageNet). Our approach combines the KL\nand reverse KL divergences into a uni\ufb01ed objective function, thus it exploits the complementary\nstatistical properties from these divergences to effectively diversify the estimated density in capturing\nmulti-modes. We materialize our idea using GAN\u2019s framework, resulting in a novel generative\nadversarial architecture containing three players: a discriminator D1 that rewards high scores for\ndata sampled from pdata rather than generated from the generator distribution pG whilst another\ndiscriminator D2, conversely, favoring data from pG rather pdata, and a generator G that generates\ndata to fool both two discriminators. We term our proposed model dual discriminator generative\nadversarial network (D2GAN).\nIt turns out that training D2GAN shares the same minimax problem as in GAN, which can be solved\nby alternatively updating the generator and discriminators. We provide theoretical analysis showing\nthat, given G, D1 and D2 with enough capacity, i.e., in the nonparametric limit, at the optimal points,\nthe training criterion indeed results in the minimal distance between data and model distribution with\nrespect to both their KL and reverse KL divergences. This helps the model place fair distribution of\nprobability mass across the modes of the data generating distribution, thus allowing one to recover\nthe data distribution and generate diverse samples using the generator in a single shot. In addition, we\nfurther introduce hyperparameters to stabilize the learning and control the effect of each divergence.\nWe conduct extensive experiments on one synthetic dataset and four real-world large-scale datasets\n(MNIST, CIFAR10, STL-10, ImageNet) of very different nature. Since evaluating generative models\nis notoriously hard [29], we have made our best effort to adopt a number of evaluation metrics from\nliterature to quantitatively compare our proposed model with the latest state-of-the-art baselines\nwhenever possible. The experimental results reveal that our method is capable of improving the\ndiversity while keeping good quality of generated samples. More importantly, our proposed model\ncan be scaled up to train on the large-scale ImageNet database, obtain a competitive variety score and\ngenerate reasonably good quality images.\n\n2\n\n\fIn short, our main contributions are: (i) a novel generative adversarial model that encourages the\ndiversity of samples produced by the generator; (ii) a theoretical analysis to prove that our objective is\noptimized towards minimizing both KL and reverse KL divergence and has a global optimum where\npG = pdata; and (iii) a comprehensive evaluation on the effectiveness of our proposed method using a\nwide range of quantitative criteria on large-scale datasets.\n\n2 Generative Adversarial Nets\nWe \ufb01rst review the generative adversarial network (GAN) that was introduced in [10] to formulate a\ngame of two players: a discriminator D and a generator G. The discriminator, D (x), takes a point x\nin data space and computes the probability that x is sampled from data distribution Pdata, rather than\ngenerated by the generator G. At the same time, the generator \ufb01rst maps a noise vector z drawn from a\nprior P (z) to the data space, obtaining a sample G (z) that resembles the training data, and then uses\nthis sample to challenge the discriminator. The mapping G (z) induces a generator distribution PG in\ndata domain with probability density function pG (x). Both G and D are parameterized by neural\nnetworks (see Fig. 1a for an illustration) and learned by solving the following minimax optimization:\n\nJ (G, D) = Ex\u223cPdata(x) [log (D (x))] + Ez\u223cPz [log (1 \u2212 D (G (z)))]\n\nmin\n\nG\n\nmax\n\nD\n\nThe learning follows an iterative procedure wherein the discriminator and generator are alternatively\nupdated. Given a \ufb01xed G, the maximization subject to D results in the optimal discriminator\npdata(x)+pG(x) , whilst given this optimal D(cid:63), the minimization of G turns into minimizing\nD(cid:63) (x) =\nthe Jensen-Shannon (JS) divergence between the data and model distributions: DJS (Pdata(cid:107)PG) [10].\nAt the Nash equilibrium of a game, the model distribution recovers the data distribution exactly:\nPG = Pdata, thus the discriminator D now fails to differentiate real or fake data as D (x) = 0.5,\u2200x.\n\npdata(x)\n\n(a) GAN.\n\n(b) D2GAN.\n\nFigure 1: An illustration of the standard GAN and our proposed D2GAN.\n\nSince the JS divergence has been empirically proven to have the same nature as that of the reverse\nKL divergence [29, 15, 11], GAN suffers from the model collapsing problem, and thus its generated\ndata samples have low level of diversity [20, 5].\n\n3 Dual Discriminator Generative Adversarial Nets\nTo tackle GAN\u2019s problem of mode collapse, in what follows we present our main contribution of a\nframework that seeks an approximated distribution to effectively cover many modes of the multimodal\ndata. Our intuition is based on GAN, but we formulate a three-player game that consists of two\ndifferent discriminators D1 and D2, and one generator G. Given a sample x in data space, D1 (x)\nrewards a high score if x is drawn from the data distribution Pdata, and gives a low score if generated\nfrom the model distribution PG. In contrast, D2 (x) returns a high score for x generated from PG\nwhilst giving a low score for a sample drawn from Pdata. Unlike GAN, the scores returned by our\ndiscriminators are values in R+ rather than probabilities in [0, 1]. Our generator G performs a similar\nrole to that of GAN, i.e., producing data mapped from a noise space to synthesize the real data and\nthen fool both two discriminators D1 and D2. All three players are parameterized by neural networks\nwherein D1 and D2 do not share their parameters. We term our proposed model dual discriminator\ngenerative adversarial network (D2GAN). Fig. 1b shows an illustration of D2GAN.\n\n3\n\n\fmin\n\nG\n\nmax\nD1,D2\n\nMore formally, D1, D2 and G now play the following three-player minimax optimization game:\n\nJ (G, D1, D2) = \u03b1 \u00d7 Ex\u223cPdata [log D1 (x)] + Ez\u223cPz [\u2212D1 (G (z))]\n+ Ex\u223cPdata [\u2212D2 (x)] + \u03b2 \u00d7 Ez\u223cPz [log D2 (G (z))]\n\n(1)\nwherein we have introduced hyperparameters 0 < \u03b1, \u03b2 \u2264 1 to serve two purposes. The \ufb01rst is\nto stabilize the learning of our model. As the output values of two discriminators are positive\nand unbounded, D1 (G (z)) and D2 (x) in Eq. (1) can become very large and have exponentially\nstronger impact on the optimization than log D1 (x) and log D2 (G (z)) do, rendering the learning\nunstable. To overcome this issue, we can decrease \u03b1 and \u03b2, in effect making the optimization penalize\nD1 (G (z)) and D2 (x), thus helping to stabilize the learning. The second purpose of introducing \u03b1\nand \u03b2 is to control the effect of KL and reverse KL divergences on the optimization problem. This\nwill be discussed in the following part once we have the derivation of our optimal solution.\nSimilar to GAN [10], our proposed network can be trained by alternatively updating D1, D2 and G.\nWe refer to the supplementary material for the pseudo-code of learning parameters for D2GAN.\n\n3.1 Theoretical analysis\nWe now provide formal theoretical analysis of our proposed model, that essentially shows that, given\nG, D1 and D2 are of enough capacity, i.e., in the nonparametric limit, at the optimal points, G can\nrecover the data distributions by minimizing both KL and reverse KL divergences between model and\ndata distributions. We \ufb01rst consider the optimization problem with respect to (w.r.t) discriminators\ngiven a \ufb01xed generator.\nProposition 1. Given a \ufb01xed G, maximizing J (G, D1, D2) yields to the following closed-form\noptimal discriminators D(cid:63)\n\n2:\n1, D(cid:63)\nD(cid:63)\n\n1 (x) =\n\n\u03b1pdata (x)\n\npG (x)\n\nand D(cid:63)\n\n2 (x) =\n\n\u03b2pG (x)\npdata (x)\n\nProof. According to the induced measure theorem [12],\ntwo expectations are equal:\nEz\u223cPz [f (G (z))] = Ex\u223cPG [f (x)] where f (x) = \u2212D1 (x) or f (x) = log D2 (x). The objec-\ntive function can be rewritten as below:\nJ (G, D1, D2) = \u03b1 \u00d7 Ex\u223cPdata [log D1 (x)] + Ex\u223cPG [\u2212D1 (x)]\n+ Ex\u223cPdata [\u2212D2 (x)] + \u03b2 \u00d7 Ex\u223cPG [log D2 (x)]\n\u02c6\n\n[\u03b1pdata (x) log D1 (x) \u2212 pGD1 (x) \u2212 pdata (x) D2 (x) + \u03b2pG log D2 (x)] dx\nConsidering the function inside the integral, given x, we maximize this function w.r.t two variables\nD1, D2 to \ufb01nd D(cid:63)\n\n=\n\nx\n\n1 (x) and D(cid:63)\n\u03b1pdata (x)\n\n2 (x). Setting the derivatives w.r.t D1 and D2 to 0, we gain:\n\u2212 pG (x) = 0\n\n\u2212 pdata (x) = 0\n\nand \u03b2pG (x)\n\n(2)\n\nThe second derivatives: \u2212\u03b1pdata(x)/D2\nobtained the maximum solution and concluding the proof.\n\n1 and \u2212\u03b2pG(x)/D2\n\nD1\n\nD2\n\n2 are non-positive, thus verifying that we have\n\n2 and \ufb01nd the optimal solution G(cid:63) for the generator G.\n\n1, D2 = D(cid:63)\n1, D(cid:63)\nJ (G(cid:63), D(cid:63)\n1, D(cid:63)\n1 (x) = \u03b1 and D(cid:63)\nD(cid:63)\n\nNext, we \ufb01x D1 = D(cid:63)\nTheorem 2. Given D(cid:63)\n2, at the Nash equilibrium point (G(cid:63), D(cid:63)\n1, D(cid:63)\nproblem of D2GAN, we have the following form for each component:\n2) = \u03b1 (log \u03b1 \u2212 1) + \u03b2 (log \u03b2 \u2212 1)\n(cid:20)\n\nProof. Substituting D(cid:63)\nproblem, we gain:\n\n1, D(cid:63)\n\n(cid:21)\n\n\u02c6\n\n2 (x) = \u03b2,\u2200x at pG(cid:63) = pdata\n\nJ (G, D(cid:63)\n\n1, D(cid:63)\n\n2) = \u03b1 \u00d7 Ex\u223cPdata\n\nlog \u03b1 + log\n\n2 from Eq. (2) into the objective function in Eq. (1) of the minimax\n\n2) for minimax optimization\n\npdata (x)\npG (x)\n\n\u2212 \u03b1\n\n(cid:20)\n\npG (x)\n\nx\n\ndx\n\n(cid:21)\n\npdata (x)\npG (x)\npG (x)\npdata (x)\n\n\u2212 \u03b2\n\npdata\n\nx\n\npG (x)\npdata (x)\n\ndx + \u03b2 \u00d7 Ex\u223cPG\n\nlog \u03b2 + log\n\n= \u03b1 (log \u03b1 \u2212 1) + \u03b2 (log \u03b2 \u2212 1) + \u03b1DKL (Pdata(cid:107)PG) + \u03b2DKL (PG(cid:107)Pdata)\n\n(3)\n\n\u02c6\n\n4\n\n\f\u02c6\n\n(cid:18) q (x)\n\n(cid:19)\n\np (x)\n\ndx\n\nDf (P(cid:107)Q) =\n\nq (x) f\n\nX\n\nwhere DKL (Pdata(cid:107)PG) and DKL (PG(cid:107)Pdata) is the KL and reverse KL divergences between data and\nmodel (generator) distributions, respectively. These divergences are always nonnegative and only zero\nwhen two distributions are equal: pG(cid:63) = pdata. In other words, the generator induces a distribution\npG(cid:63) that is identical to the data distribution pdata, and two discriminators now fail to recognize the real\nor fake samples since they return the same score of 1 for both samples. This concludes the proof.\n\nThe loss of generator in Eq. (3) becomes an upper bound when the discriminators are not optimal.\nThis loss shows that increasing \u03b1 promotes the optimization towards minimizing the KL divergence\nDKL (Pdata(cid:107)PG), thus helping the generative distribution cover multiple modes, but may include\npotentially undesirable samples; whereas increasing \u03b2 encourages the minimization of the reverse\nKL divergence DKL (PG(cid:107)Pdata), hence enabling the generator capture a single mode better, but may\nmiss many modes. By empirically adjusting these two hyperparameters, we can balance the effect of\ntwo divergences, and hence effectively avoid the mode collapsing issue.\n\n3.2 Connection to f-GAN\nNext we point out the relations between our proposed D2GAN and f-GAN \u2013 the model extends the\nJensen-Shannon divergence (JSD) of GAN to more general divergences, speci\ufb01cally f-divergences\n[23]. A divergence in the f-divergence family has the following form:\n\n{ut \u2212 f (u)}. The function f\u2217 is again convex and lower-semicontinuous.\n\nwhere f : R+ \u2192 R is a convex, lower-semicontinuous function satisfying f (1) = 0. This\nfunction has a convex conjugate function f\u2217, also known as Fenchel conjugate [13] : f\u2217 (t) =\nsupu\u2208domf\nConsidering P the true distribution and Q the generator distribution, we resemble the learning\nproblem in GAN by minimizing the f-divergence between P and Q. Based on the variational lower\nbound of f-divergence proposed by Nguyen et al. [22], the objective function of f-GAN can be\nderived as follows:\n\nmin\n\n\u03b8\n\nmax\n\n\u03c6\n\nF (\u03b8, \u03c6) = Ex\u223cP [gf (V\u03c6 (x))] + Ex\u223cQ\u03b8 [\u2212f\u2217 (gf (V\u03c6 (x)))]\n\nwhere Q is parameterized by \u03b8 (as the generator in GAN), V\u03c6 : X \u2192 R is a function parameterized\nby \u03c6 (as the discriminator in GAN) and gf : R \u2192 domf\u2217 is an output activation function (i.e., the\ndiscriminator\u2019s decision function) speci\ufb01c to the f-divergence used. Using appropriate functions gf\nand f\u2217 (see Tab. 2 in [23]), we recover the minimization of corresponding divergences such as JSD\nin GAN, KL (associated with discriminator D1) and reverse KL (associated with discriminator D2)\nof our D2GAN.\nThe f-GAN, however, only considers a single divergence. On the other hand, our proposed method\ncombines KL and reserve KL divergences. Our idea is conceived upon pondering the advantages and\ndisadvantages of these two divergences in covering multiple modes of data. Combining them into a\nuni\ufb01ed objective function as in Eq. (3) helps us reversely engineer to \ufb01nally obtain the optimization\ngame in Eq. (1) that can be ef\ufb01ciently formulated and solved using the principle of GAN.\n\n4 Experiments\n\nIn this section, we conduct comprehensive experiments to demonstrate the capability of improving\nmode coverage and the scalability of our proposed model on large-scale datasets. We use a synthetic\n2D dataset for both visual and numerical veri\ufb01cation, and four datasets of increasing diversity and\nsize for numerical veri\ufb01cation. We have made our best effort to compare the results of our method\nwith those of the latest state-of-the-art GAN\u2019s variants by replicating experimental settings in the\noriginal work whenever possible.\nFor each experiment, we refer to the supplementary material for model architectures and additional\nresults. Common points are: i) discriminators\u2019 outputs with softplus activations :f (x) = ln (1 + ex),\ni.e., positive version of ReLU; (ii) Adam optimizer [16] with learning rate 0.0002 and the \ufb01rst-order\nmomentum 0.5; (iii) minibatch size of 64 samples for training both generator and discriminators; (iv)\nLeaky ReLU with the slope of 0.2; and (v) weights initialized from an isotropic Gaussian: N (0, 0.01)\n\n5\n\n\f(a) Symmetric KL divergence.\n\n(b) Wasserstein distance.\n\n(c) Evolution of data (in blue) generated from GAN (top row), Un-\nrolledGAN (middle row) and our D2GAN (bottom row) on 2D data\nof 8 Gaussians. Data sampled from the true mixture are red.\n\nFigure 2: The comparison of standard GAN, UnrolledGAN and our D2GAN on 2D synthetic dataset.\n\nand zero biases. Our implementation is in TensorFlow [1] and we have published a version for\nreference1. We now present our experiments on synthetic data followed by those on large-scale\nreal-world datasets.\n\n4.1 Synthetic data\n\nIn the \ufb01rst experiment, we reuse the experimental design proposed in [20] to investigate how well our\nD2GAN can deal with multiple modes in the data. More speci\ufb01cally, we sample training data from\na 2D mixture of 8 Gaussian distributions with a covariance matrix 0.02I and means arranged in a\ncircle of zero centroid and radius 2.0. Data in these low variance mixture components are separated\nby an area of very low density. The aim is to examine properties such as low probability regions and\nlow separation of modes.\nWe use a simple architecture of a generator with two fully connected hidden layers and discriminators\nwith one hidden layer of ReLU activations. This setting is identical, thus ensures a fair comparison\nwith UnrolledGAN2 [20]. Fig. 2c shows the evolution of 512 samples generated by our models and\nbaselines through time. It can be seen that the regular GAN generates data collapsing into a single\nmode hovering around the valid modes of data distribution, thus re\ufb02ecting the mode collapse in\nGAN. At the same time, UnrolledGAN and D2GAN distribute data around all 8 mixture components,\nand hence demonstrating the abilities to successfully learn multimodal data in this case. At the last\nsteps, our D2GAN captures data modes more precisely than UnrolledGAN as, in each mode, the\nUnrolledGAN generates data that concentrate only on several points around the mode\u2019s centroid, thus\nseems to produce fewer samples than D2GAN whose samples fairly spread out the entire mode.\nNext we further quantitatively compare the quality of generated data. Since we know the true\ndistribution pdata in this case, we employ two measures, namely symmetric KL divergence and\nWasserstein distance. These measures compute the distance between the normalized histograms\nof 10,000 points generated from our D2GAN, UnrolledGAN and GAN to true pdata. Figs. 2a and\n2b again clearly demonstrate the superiority of our approach over GAN and UnrolledGAN w.r.t\nboth distances (lower is better); notably with Wasserstein metric, the distance from ours to the true\ndistribution almost reduces to zero. These \ufb01gures also demonstrate the stability of our D2GAN\n(red curves) during training as it is much less \ufb02uctuating compared with GAN (green curves) and\nUnrolledGAN (blue curves).\n\n4.2 Real-world datasets\n\nWe now examine the performance of our proposed method on real-world datasets with increasing\ndiversities and sizes. For networks containing convolutional layers, we closely follow the DCGAN\u2019s\ndesign [24]. We use strided convolutions for discriminators and fractional-strided convolutions\nfor generator instead of pooling layers. Batch normalization is applied for each layer, except the\n\n1https://github.com/tund/D2GAN\n2We obtain the code of UnrolledGAN for 2D data from the link authors provided in [20].\n\n6\n\nGANUnrolled GAND2GANSymmetric KL-divStep05000100001500020000250000.05.010.015.020.025.030.0GANUnrolled GAND2GANWasserstein estimate0.00.51.01.52.02.50500010000150002000025000Step3.0\fgenerator output layer and the discriminator input layers. We also use Leaky ReLU activations for\ndiscriminators, and use ReLU for generator, except its output is tanh since we rescale the pixel\nintensities into the range of [-1, 1] before feeding images to our model. Only one difference is\nthat, for our model, initializing the weights from N (0, 0.01) yields slightly better results than from\nN (0, 0.02). We again refer to the supplementary material for detailed architectures.\n\n4.2.1 Evaluation protocol\n\nEvaluating the quality of image produced by generative models is a notoriously challenging due to\nthe variety of probability criteria and the lack of a perceptually meaningful image similarity metric\n[29]. Even when a model can generate plausible images, it is not useful if those images are visually\nsimilar. Therefore, in order to quantify the performance of covering data modes as well as producing\nhigh quality samples, we use several different ad-hoc metrics for different experiments to compare\nwith other baselines.\ncomputed by:\nFirst we\nexp (Ex [DKL (p (y | x) (cid:107) p (y))]), where p (y | x) is the conditional\nlabel distribution for\nimage x estimated using a pretrained Inception model [28], and p (y) is the marginal distribution:\nn=1 p (y | xn = G (zn)). This metric rewards good and varied samples, but\nsometimes is easily fooled by a model that collapses and generates to a very low quality image, thus\nfails to measure whether a model has been trapped into one bad mode. To address this problem, for\nlabeled datasets, we further recruit the so-called MODE score introduced in [5]:\nexp (Ex [DKL (p (y | x) (cid:107) \u02dcp (y))] \u2212 DKL (p (y) (cid:107) \u02dcp (y)))\n\np (y) \u2248 1/N(cid:80)N\n\nInception score proposed in [27], which are\n\nadopt\n\nthe\n\nwhere \u02dcp (y) is the empirical distribution of labels estimated from training data. The score can\nadequately re\ufb02ect the variety and visual quality of images, which is discussed in [5].\n\n4.2.2 Handwritten digit images\n\nWe start with the handwritten digit images \u2013 MNIST [19] that consists of 60,000 training and 10,000\ntesting 28\u00d728 grayscale images of digits from 0 to 9. Following the setting in [5], we \ufb01rst assume\nthat the MNIST has 10 modes, representing connected component in the data manifold, associated\nwith 10 digit classes. We then also perform an extensive grid search of different hyperparameter\ncon\ufb01gurations, wherein our two regularized constants \u03b1, \u03b2 in Eq. (1) are varied in {0.01, 0.05, 0.1,\n0.2}. For a fair comparison, we use the same parameter ranges and fully connected layers for our\nnetwork (c.f. the supplementary material for more details), and adopt results of GAN and mode\nregularized GAN (Reg-GAN) from [5].\nFor evaluation, we \ufb01rst train a simple, yet effective 3-layer convolutional nets3 that can obtain 0.65%\nerror on MNIST testing set, and then employ it to predict the label probabilities and compute MODE\nscores for generated samples. Fig. 3 (left) shows the distributions of MODE scores obtained by three\nmodels. Clearly, our proposed D2GAN signi\ufb01cantly outperforms the standard GAN and Reg-GAN\nby achieving scores mostly around the maximum [8.0-9.0]. It is worthy to note that we did not\nobserve substantial differences in the average MODE scores obtained by varying the network size\nthrough the parameter searching. We here report the result of the minimal network with the smallest\nnumber of layers and hidden units.\nTo study the effect of \u03b1 and \u03b2, we inspect the results obtained by this minimal network with varied\n\u03b1, \u03b2 in Fig. 3 (right). There is a pattern that, given a \ufb01xed \u03b1, our D2GAN obtains better MODE score\nwhen increasing \u03b2 to a certain value, after which the score could signi\ufb01cantly decrease.\n\nMNIST-1K. The standard MNIST data with 10-mode assumption seems to be fairly trivial. Hence,\nbased on this data, we test our proposed model on a more challenging one. We continue following\nthe technique used in [5, 20] to construct a new 1000-class MNIST dataset (MNIST-1K) by stacking\nthree randomly selected digits to form an RGB image with a different digit image in each channel.\nThe resulting data can be assumed to contain 1,000 distinct modes, corresponding to the combinations\nof digits in 3 channels from 000 to 999.\nIn this experiment, we use a more powerful model with convolutional layers for discriminators and\ntransposed convolutions for the generator. We measure the performance by the number of modes\n\n3Network architecture is similar to https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py.\n\n7\n\n\fFigure 3: Distributions of MODE scores (left) and average MODE scores (right) with varied \u03b1, \u03b2.\n\nfor which the model generated at least one in total 25,600 samples, and the reverse KL divergence\nbetween the model distribution (i.e., the label distribution predicted by the pretrained MNIST classi\ufb01er\nused in the previous experiment) and the expected data distribution. Tab. 1 reports the results of our\nD2GAN compared with those of GAN, UnrolledGAN taken from [20], DCGAN and Reg-GAN from\n[5]. Our proposed method again clearly demonstrates the superiority over baselines by covering all\nmodes and achieving the best distance that is close to zero.\n\nTable 1: Numbers of modes covered and reverse KL divergence between model and data distributions.\n\nModel\n\n# modes covered\nDKL (model(cid:107) data)\n\nGAN [20] UnrolledGAN [20] DCGAN [5] Reg-GAN [5]\n955.5\u00b118.7\n628.0\u00b1140.9\n2.58\u00b10.75\n0.64\u00b10.05\n\n817.4\u00b137.9\n1.43\u00b10.12\n\n849.6\u00b162.7\n0.73\u00b10.09\n\nD2GAN\n1000.0\u00b10.00\n0.08\u00b10.01\n\n4.2.3 Natural scene images\n\nWe now extend our experiments to investigate the scalability of our proposed method on much more\nchallenging large-scale image databases from natural scenes. We use three widely-adopted datasets:\nCIFAR-10 [17], STL-10 [6] and ImageNet [26]. CIFAR-10 is a well-studied dataset of 50,000 32\u00d732\ntraining images of 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.\nSTL-10, a subset of ImageNet, contains about 100,000 unlabeled 96\u00d796 images, which is more\ndiverse than CIFAR-10, but less so than the full ImageNet. We rescale all images down 3 times\nand train our networks on 32\u00d732 resolution. ImageNet is a very large database of about 1.2 million\nnatural images from 1,000 classes, normally used as the most challenging benchmark to validate\nthe scalability of deep models. We follow the preprocessing in [18], except subsampling to 32\u00d732\nresolution. We use the code provided in [27] to compute the Inception score for 10 independent\npartitions of 50,000 generated samples.\n\nTable 2: Inception scores on CIFAR-10.\n\nModel\nReal data\nWGAN [2]\nMIX+WGAN [3]\nImproved-GAN [27]\nALI [8]\nBEGAN [4]\nMAGAN [30]\nDCGAN [24]\nDFM [31]\nD2GAN\n\nScore\n11.24\u00b10.16\n3.82\u00b10.06\n4.04\u00b10.07\n4.36\u00b10.04\n5.34\u00b10.05\n5.62\n5.67\n6.40\u00b10.05\n7.72\u00b10.13\n7.15\u00b10.07\n\nFigure 4: Inception scores on STL-10 and ImageNet.\n\nTab. 2 and Fig. 4 show the Inception scores on CIFAR-10, STL-10 and ImageNet datasets obtained\nby our model and baselines collected from recent work in literature. It is worthy to note that we only\ncompare with methods trained in a completely unsupervised manner without label information. As the\nresult, there exist 8 baselines on CIFAR-10 whilst only DCGAN [24] and denoising feature matching\n(DFM) [31] are available on STL-10 and ImageNet. We use our own TensorFlow implementation of\nDCGAN with the same network architecture with our model for fair comparisons. In all 3 experiments,\nthe D2GAN fails to beat the DFM, but outperforms other baselines by large margins. The lower\nresults compared with DFM suggest that using autoencoders for matching high-level features appears\n\n8\n\n\u03ed\u03ed\u0358\u03ee\u03f3\u03ec\u0358\u03ec\u03f0\u0358\u03ec\u03ed\u0358\u03f2\u03ee\u0358\u03f1\u03ee\u0358\u03f4\u03f3\u0358\u03f0\u03ec\u0358\u03ef\u03ec\u0358\u03ef\u03ec\u0358\u03ec\u03ec\u0358\u03ef\u03f2\u0358\u03ee\u03f2\u0358\u03ee\u03ee\u0358\u03ee\u03f0\u0358\u03ef\u03f5\u0358\u03f2\u03ed\u03f0\u0358\u03f5\u03ee\u03ee\u0358\u03ef\u03ed\u03f3\u0358\u03ef\u03f0\u0358\u03ef\u03ec\u0358\u03ec\u03ec\u0358\u03ec\u03ec\u0358\u03ee\u03ed\u0358\u03f0\u03f0\u0358\u03f2\u03f4\u0358\u03ec\u03ed\u03f0\u0358\u03f2\u03f5\u0358\u03f5\u03ed\u03ec\u0358\u03f5\u03f1\u03ec\u0358\u03f1\u03ec\u03ed\u03ec\u03ee\u03ec\u03ef\u03ec\u03f0\u03ec\u03f1\u03ec\u03f2\u03ec\u03f3\u03ec\u03f4\u03ec\u03ec\u0372\u03ec\u0358\u03f1\u03ec\u0358\u03f1\u0372\u03ed\u03ed\u0372\u03ee\u03ee\u0372\u03ef\u03ef\u0372\u03f0\u03f0\u0372\u03f1\u03f1\u0372\u03f2\u03f2\u0372\u03f3\u03f3\u0372\u03f4\u03f4\u0372\u03f5'\u0004EZ\u011e\u0150\u0372'\u0004E\u0018\u03ee'\u0004E\u03ec\u0358\u03ec\u03ed\u03ec\u0358\u03ec\u03f1\u03ec\u0358\u03ed\u03ec\u0358\u03ee\u03f3\u03f3\u0358\u03f1\u03f4\u03f4\u0358\u03f1\u03f5\u03f5\u0358\u03f1\u03ec\u0358\u03ec\u03ed\u03ec\u0358\u03ec\u03f1\u03ec\u0358\u03ed\u03ec\u0358\u03eeDK\u0018\u001c\u0003\u0190\u0110\u017d\u018c\u011e\u03ee\u03f2\u0358\u03ec\u03f4\u03ee\u03f1\u0358\u03f3\u03f4\u03f3\u0358\u03f1\u03f0\u03f3\u0358\u03f4\u03f5\u03f4\u0358\u03f1\u03ed\u03f5\u0358\u03ed\u03f4\u03f3\u0358\u03f5\u03f4\u03f4\u0358\u03ee\u03f1\u03ec\u03f1\u03ed\u03ec\u03ed\u03f1\u03ee\u03ec\u03ee\u03f1\u03ef\u03ec^d>\u0372\u03ed\u03ec/\u0175\u0102\u0150\u011eE\u011e\u019aZ\u011e\u0102\u016f\u0003\u011a\u0102\u019a\u0102\u0018\u0012'\u0004E\u0018&D\u0018\u03ee'\u0004E\fto be an effective way to encourage the diversity. This technique is compatible with our method, thus\nintegrating it could be a promising avenue for our future work.\nTwo discriminators D1 and D2 have almost identical architectures, thus they potentially can share\nparameters in many different schemes. We explore this direction by creating two version of our\nD2GAN with the same hyperparameter setting. The \ufb01rst version shares all parameters of D1 and D2\nexcept the last (output) layer. This model has failed because the discriminator now contains much\nfewer parameters, rendering it unable to capture two inverse ratios of two density functions. The\nsecond one shares all parameters of D1 and D2 except the last two layers. This version performed\nbetter than the previous one, and could obtain promising Inception scores (7.01 on CIFAR10, 7.44 on\nSTL10 and 7.81 on ImageNet), but these results are still worse than those of our proposed model\nwithout sharing parameters.\nFinally, we show several samples generated by our proposed model trained on these three datasets in\nFig. 5. Samples are fair random draws, not cherry-picked. It can be seen that our D2GAN is able to\nproduce visually recognizable images of cars, trucks, boats, horses on CIFAR-10. The objects are\ngetting harder to recognize, but the shapes of airplanes, cars, trucks and animals still can be identi\ufb01ed\non STL-10, and images with various backgrounds such as sky, underwater, mountain, forest are\nshown on ImageNet. This con\ufb01rms the diversity of samples generated by our model.\n\n(a) CIFAR-10.\n\n(b) STL-10.\n\n(c) ImageNet.\n\nFigure 5: Samples generated by our proposed D2GAN trained on natural image datasets. Due to the\nspace limit, please refer to the supplementary material for larger plot.\n\n5 Conclusion\n\nTo summarize, we have introduced a novel approach to combine Kullback-Leibler (KL) and reverse\nKL divergences in a uni\ufb01ed objective function of the density estimation problem. Our idea is to\nexploit the complementary statistical properties of two divergences to improve both the quality and\ndiversity of samples generated from the estimator. To that end, we propose a novel framework\nbased on generative adversarial nets (GANs), which formulates a minimax game of three players:\ntwo discriminators and one generator, thus termed dual discriminator GAN (D2GAN). Given two\ndiscriminators \ufb01xed, the learning of generator moves towards optimizing both KL and reverse KL\ndivergences simultaneously, and thus can help avoid mode collapse, a notorious drawback of GANs.\nWe have established extensive experiments to demonstrate the effectiveness and scalability of our\nproposed approach using synthetic and large-scale real-world datasets. Compared with the latest\nstate-of-the-art baselines, our model is more scalable, can be trained on the large-scale ImageNet\ndataset, and obtains Inception scores lower than those of the combination of denoising autoencoder\nand GAN (DFM), but signi\ufb01cantly higher than the others. Finally, we note that our method is\northogonal and could integrate techniques in those baselines such as semi-supervised learning [27],\nconditional architectures [21, 7, 25] and autoencoder [5, 31].\n\nAcknowledgments. This work was partially supported by the Australian Research Council (ARC)\nDiscovery Grant Project DP160109394.\n\n9\n\n\fReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,\nAndrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,\nManjunath Kudlur, Josh Levenberg, Dan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray,\nChris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul\nTucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete Warden,\nMartin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale\nmachine learning on heterogeneous systems, 2015. Software available from tensor\ufb02ow.org. 4\n\n[2] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint\n\narXiv:1701.07875, 2017. 2\n\n[3] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and\n\nequilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017. 2\n\n[4] David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative\n\nadversarial networks. arXiv preprint arXiv:1703.10717, 2017. 1, 2\n\n[5] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized\ngenerative adversarial networks. arXiv preprint arXiv:1612.02136, 2016. 1, 2, 4.2.1, 4.2.2,\n4.2.2, 1, 5\n\n[6] Adam Coates, Andrew Y Ng, and Honglak Lee. An analysis of single-layer networks in\nIn International Conference on Arti\ufb01cial Intelligence and\n\nunsupervised feature learning.\nStatistics (AISTATS), pages 215\u2013223, 2011. 4.2.3\n\n[7] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using\na laplacian pyramid of adversarial networks. In Advances in neural information processing\nsystems (NIPS), pages 1486\u20131494, 2015. 5\n\n[8] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier\narXiv preprint\n\nMastropietro, and Aaron Courville. Adversarially learned inference.\narXiv:1606.00704, 2016. 2\n\n[9] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.\n\nhttp://www.deeplearningbook.org. 1\n\n[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani,\nM. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in 27th Neural\nInformation Processing Systems (NIPS), pages 2672\u20132680. Curran Associates, Inc., 2014. 1, 2,\n3\n\n[11] Ian J. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. CoRR, 2017. 1, 2\n\n[12] Somesh Das Gupta and Jun Shao. Mathematical statistics, 2000. 3.1\n\n[13] Jean-Baptiste Hiriart-Urruty and Claude Lemar\u00e9chal. Fundamentals of convex analysis. Springer\n\nScience & Business Media, 2012. 3.2\n\n[14] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung. Multi-generator gernerative\n\nadversarial nets. arXiv preprint arXiv:1708.02556, 2017. 1\n\n[15] Ferenc Husz\u00e1r. How (not) to train your generative model: Scheduled sampling, likelihood,\n\nadversary? arXiv preprint arXiv:1511.05101, 2015. 1, 2\n\n[16] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014. 4\n\n[17] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nComputer Science Department, University of Toronto, Tech. Rep, 1(4), 2009. 4.2.3\n\n10\n\n\f[18] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. ImageNet classi\ufb01cation with deep convolu-\ntional neural networks. In Proceedings of the 26th Annual Conference on Neural Information\nProcessing Systems (NIPS), volume 2, pages 1097\u20131105, Lake Tahoe, United States, December\n3\u20136 2012. printed;. 4.2.3\n\n[19] Yann Lecun, Corinna Cortes, and Christopher J.C. Burges. The MNIST database of handwritten\n\ndigits. 1998. 4.2.2\n\n[20] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial\n\nnetworks. arXiv preprint arXiv:1611.02163, 2016. 1, 2, 4.1, 2, 4.2.2, 1\n\n[21] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint\n\narXiv:1411.1784, 2014. 5\n\n[22] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence func-\ntionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information\nTheory, 56(11):5847\u20135861, 2010. 3.2\n\n[23] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural\nsamplers using variational divergence minimization. In D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages\n271\u2013279. Curran Associates, Inc., 2016. 1, 3.2\n\n[24] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n4.2, 2, 4.2.3\n\n[25] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak\nLee. Generative adversarial text to image synthesis. In Proceedings of The 33rd International\nConference on Machine Learning (ICML), volume 3, 2016. 5\n\n[26] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision\n(IJCV), 115(3):211\u2013252, 2015. 4.2.3\n\n[27] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems\n(NIPS), pages 2226\u20132234, 2016. 1, 4.2.1, 4.2.3, 2, 5\n\n[28] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-\nthinking the inception architecture for computer vision. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition (CVPR), pages 2818\u20132826, 2016. 4.2.1\n\n[29] Lucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative\n\nmodels. arXiv preprint arXiv:1511.01844, 2015. 1, 2, 4.2.1\n\n[30] Ruohan Wang, Antoine Cully, Hyung Jin Chang, and Yiannis Demiris. Magan: Margin\n\nadaptation for generative adversarial networks. arXiv preprint arXiv:1704.03817, 2017. 1, 2\n\n[31] D Warde-Farley and Y Bengio. Improving generative adversarial networks with denoising\n\nfeature matching. ICLR submissions, 8, 2017. 1, 2, 4.2.3, 5\n\n11\n\n\f", "award": [], "sourceid": 1526, "authors": [{"given_name": "Tu", "family_name": "Nguyen", "institution": "Deakin University"}, {"given_name": "Trung", "family_name": "Le", "institution": "Deakin University"}, {"given_name": "Hung", "family_name": "Vu", "institution": "Deakin University"}, {"given_name": "Dinh", "family_name": "Phung", "institution": "Deakin University"}]}