{"title": "Generative Adversarial Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 2672, "page_last": 2680, "abstract": "We propose a new framework for estimating generative models via adversarial nets, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitatively evaluation of the generated samples.", "full_text": "Generative Adversarial Nets\n\nIan J. Goodfellow\u2217, Jean Pouget-Abadie\u2020, Mehdi Mirza, Bing Xu, David Warde-Farley,\n\nUniversit\u00b4e de Montr\u00b4eal\nMontr\u00b4eal, QC H3C 3J7\n\nSherjil Ozair\u2021, Aaron Courville, Yoshua Bengio\u00a7\n\nD\u00b4epartement d\u2019informatique et de recherche op\u00b4erationnelle\n\nAbstract\n\nWe propose a new framework for estimating generative models via an adversar-\nial process, in which we simultaneously train two models: a generative model G\nthat captures the data distribution, and a discriminative model D that estimates\nthe probability that a sample came from the training data rather than G. The train-\ning procedure for G is to maximize the probability of D making a mistake. This\nframework corresponds to a minimax two-player game. In the space of arbitrary\nfunctions G and D, a unique solution exists, with G recovering the training data\ndistribution and D equal to 1\n2 everywhere. In the case where G and D are de\ufb01ned\nby multilayer perceptrons, the entire system can be trained with backpropagation.\nThere is no need for any Markov chains or unrolled approximate inference net-\nworks during either training or generation of samples. Experiments demonstrate\nthe potential of the framework through qualitative and quantitative evaluation of\nthe generated samples.\n\n1\n\nIntroduction\n\nThe promise of deep learning is to discover rich, hierarchical models [2] that represent probability\ndistributions over the kinds of data encountered in arti\ufb01cial intelligence applications, such as natural\nimages, audio waveforms containing speech, and symbols in natural language corpora. So far, the\nmost striking successes in deep learning have involved discriminative models, usually those that\nmap a high-dimensional, rich sensory input to a class label [14, 20]. These striking successes have\nprimarily been based on the backpropagation and dropout algorithms, using piecewise linear units\n[17, 8, 9] which have a particularly well-behaved gradient . Deep generative models have had less\nof an impact, due to the dif\ufb01culty of approximating many intractable probabilistic computations that\narise in maximum likelihood estimation and related strategies, and due to dif\ufb01culty of leveraging\nthe bene\ufb01ts of piecewise linear units in the generative context. We propose a new generative model\nestimation procedure that sidesteps these dif\ufb01culties. 1\nIn the proposed adversarial nets framework, the generative model is pitted against an adversary: a\ndiscriminative model that learns to determine whether a sample is from the model distribution or the\ndata distribution. The generative model can be thought of as analogous to a team of counterfeiters,\ntrying to produce fake currency and use it without detection, while the discriminative model is\nanalogous to the police, trying to detect the counterfeit currency. Competition in this game drives\nboth teams to improve their methods until the counterfeits are indistiguishable from the genuine\narticles.\n\n\u2217Ian Goodfellow is now a research scientist at Google, but did this work earlier as a UdeM student\n\u2020Jean Pouget-Abadie did this work while visiting Universit\u00b4e de Montr\u00b4eal from Ecole Polytechnique.\n\u2021Sherjil Ozair is visiting Universit\u00b4e de Montr\u00b4eal from Indian Institute of Technology Delhi\n\u00a7Yoshua Bengio is a CIFAR Senior Fellow.\n1All code and hyperparameters available at http://www.github.com/goodfeli/adversarial\n\n1\n\n\fThis framework can yield speci\ufb01c training algorithms for many kinds of model and optimization\nalgorithm. In this article, we explore the special case when the generative model generates samples\nby passing random noise through a multilayer perceptron, and the discriminative model is also a\nmultilayer perceptron. We refer to this special case as adversarial nets. In this case, we can train\nboth models using only the highly successful backpropagation and dropout algorithms [16] and\nsample from the generative model using only forward propagation. No approximate inference or\nMarkov chains are necessary.\n\n2 Related work\n\nUntil recently, most work on deep generative models focused on models that provided a parametric\nspeci\ufb01cation of a probability distribution function. The model can then be trained by maximiz-\ning the log likelihood. In this family of model, perhaps the most succesful is the deep Boltzmann\nmachine [25]. Such models generally have intractable likelihood functions and therefore require\nnumerous approximations to the likelihood gradient. These dif\ufb01culties motivated the development\nof \u201cgenerative machines\u201d\u2013models that do not explicitly represent the likelihood, yet are able to gen-\nerate samples from the desired distribution. Generative stochastic networks [4] are an example of\na generative machine that can be trained with exact backpropagation rather than the numerous ap-\nproximations required for Boltzmann machines. This work extends the idea of a generative machine\nby eliminating the Markov chains used in generative stochastic networks.\nOur work backpropagates derivatives through generative processes by using the observation that\n\n\u2207xE\u0001\u223cN (0,\u03c32I)f (x + \u0001) = \u2207xf (x).\n\nlim\n\u03c3\u21920\n\nWe were unaware at the time we developed this work that Kingma and Welling [18] and Rezende\net al. [23] had developed more general stochastic backpropagation rules, allowing one to backprop-\nagate through Gaussian distributions with \ufb01nite variance, and to backpropagate to the covariance\nparameter as well as the mean. These backpropagation rules could allow one to learn the condi-\ntional variance of the generator, which we treated as a hyperparameter in this work. Kingma and\nWelling [18] and Rezende et al. [23] use stochastic backpropagation to train variational autoen-\ncoders (VAEs). Like generative adversarial networks, variational autoencoders pair a differentiable\ngenerator network with a second neural network. Unlike generative adversarial networks, the sec-\nond network in a VAE is a recognition model that performs approximate inference. GANs require\ndifferentiation through the visible units, and thus cannot model discrete data, while VAEs require\ndifferentiation through the hidden units, and thus cannot have discrete latent variables. Other VAE-\nlike approaches exist [12, 22] but are less closely related to our method.\nPrevious work has also taken the approach of using a discriminative criterion to train a generative\nmodel [29, 13]. These approaches use criteria that are intractable for deep generative models. These\nmethods are dif\ufb01cult even to approximate for deep models because they involve ratios of probabili-\nties which cannot be approximated using variational approximations that lower bound the probabil-\nity. Noise-contrastive estimation (NCE) [13] involves training a generative model by learning the\nweights that make the model useful for discriminating data from a \ufb01xed noise distribution. Using a\npreviously trained model as the noise distribution allows training a sequence of models of increasing\nquality. This can be seen as an informal competition mechanism similar in spirit to the formal com-\npetition used in the adversarial networks game. The key limitation of NCE is that its \u201cdiscriminator\u201d\nis de\ufb01ned by the ratio of the probability densities of the noise distribution and the model distribution,\nand thus requires the ability to evaluate and backpropagate through both densities.\nSome previous work has used the general concept of having two neural networks compete. The most\nrelevant work is predictability minimization [26]. In predictability minimization, each hidden unit\nin a neural network is trained to be different from the output of a second network, which predicts\nthe value of that hidden unit given the value of all of the other hidden units. This work differs from\npredictability minimization in three important ways: 1) in this work, the competition between the\nnetworks is the sole training criterion, and is suf\ufb01cient on its own to train the network. Predictability\nminimization is only a regularizer that encourages the hidden units of a neural network to be sta-\ntistically independent while they accomplish some other task; it is not a primary training criterion.\n2) The nature of the competition is different. In predictability minimization, two networks\u2019 outputs\nare compared, with one network trying to make the outputs similar and the other trying to make the\n\n2\n\n\foutputs different. The output in question is a single scalar. In GANs, one network produces a rich,\nhigh dimensional vector that is used as the input to another network, and attempts to choose an input\nthat the other network does not know how to process. 3) The speci\ufb01cation of the learning process\nis different. Predictability minimization is described as an optimization problem with an objective\nfunction to be minimized, and learning approaches the minimum of the objective function. GANs\nare based on a minimax game rather than an optimization problem, and have a value function that\none agent seeks to maximize and the other seeks to minimize. The game terminates at a saddle point\nthat is a minimum with respect to one player\u2019s strategy and a maximum with respect to the other\nplayer\u2019s strategy.\nGenerative adversarial networks has been sometimes confused with the related concept of \u201cadversar-\nial examples\u201d [28]. Adversarial examples are examples found by using gradient-based optimization\ndirectly on the input to a classi\ufb01cation network, in order to \ufb01nd examples that are similar to the\ndata yet misclassi\ufb01ed. This is different from the present work because adversarial examples are\nnot a mechanism for training a generative model. Instead, adversarial examples are primarily an\nanalysis tool for showing that neural networks behave in intriguing ways, often con\ufb01dently clas-\nsifying two images differently with high con\ufb01dence even though the difference between them is\nimperceptible to a human observer. The existence of such adversarial examples does suggest that\ngenerative adversarial network training could be inef\ufb01cient, because they show that it is possible to\nmake modern discriminative networks con\ufb01dently recognize a class without emulating any of the\nhuman-perceptible attributes of that class.\n\n3 Adversarial nets\n\nThe adversarial modeling framework is most straightforward to apply when the models are both\nmultilayer perceptrons. To learn the generator\u2019s distribution pg over data x, we de\ufb01ne a prior on\ninput noise variables pz(z), then represent a mapping to data space as G(z; \u03b8g), where G is a\ndifferentiable function represented by a multilayer perceptron with parameters \u03b8g. We also de\ufb01ne a\nsecond multilayer perceptron D(x; \u03b8d) that outputs a single scalar. D(x) represents the probability\nthat x came from the data rather than pg. We train D to maximize the probability of assigning the\ncorrect label to both training examples and samples from G. We simultaneously train G to minimize\nlog(1 \u2212 D(G(z))). In other words, D and G play the following two-player minimax game with\nvalue function V (G, D):\n\nmin\n\nG\n\nmax\n\nD\n\nV (D, G) = Ex\u223cpdata(x)[log D(x)] + Ez\u223cpz(z)[log(1 \u2212 D(G(z)))].\n\n(1)\n\nIn the next section, we present a theoretical analysis of adversarial nets, essentially showing that\nthe training criterion allows one to recover the data generating distribution as G and D are given\nenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical\nexplanation of the approach. In practice, we must implement the game using an iterative, numerical\napproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,\nand on \ufb01nite datasets would result in over\ufb01tting. Instead, we alternate between k steps of optimizing\nD and one step of optimizing G. This results in D being maintained near its optimal solution, so\nlong as G changes slowly enough. The procedure is formally presented in Algorithm 1.\nIn practice, equation 1 may not provide suf\ufb01cient gradient for G to learn well. Early in learning,\nwhen G is poor, D can reject samples with high con\ufb01dence because they are clearly different from\nthe training data. In this case, log(1 \u2212 D(G(z))) saturates. Rather than training G to minimize\nlog(1 \u2212 D(G(z))) we can train G to maximize log D(G(z)). This objective function results in the\nsame \ufb01xed point of the dynamics of G and D but provides much stronger gradients early in learning.\n4 Theoretical Results\n\nThe generator G implicitly de\ufb01nes a probability distribution pg as the distribution of the samples\nG(z) obtained when z \u223c pz. Therefore, we would like Algorithm 1 to converge to a good estimator\nof pdata, if given enough capacity and training time. The results of this section are done in a non-\nparametric setting, e.g. we represent a model with in\ufb01nite capacity by studying convergence in the\nspace of probability density functions.\nWe will show in section 4.1 that this minimax game has a global optimum for pg = pdata. We will\nthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.\n\n3\n\n\f. . .\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution\n(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,\ndotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is\nthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain\nof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on\ntransformed samples. G contracts in regions of high density and expands in regions of low density of pg. (a)\nConsider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classi\ufb01er.\n(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D\u2217(x) =\npdata(x)+pg (x) . (c) After an update to G, gradient of D has guided G(z) to \ufb02ow to regions that are more likely\nto be classi\ufb01ed as data. (d) After several steps of training, if G and D have enough capacity, they will reach a\npoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate between\n2 .\nthe two distributions, i.e. D(x) = 1\n\npdata(x)\n\nAlgorithm 1 Minibatch stochastic gradient descent training of generative adversarial nets. The number of\nsteps to apply to the discriminator, k, is a hyperparameter. We used k = 1, the least expensive option, in our\nexperiments.\n\nfor number of training iterations do\n\nfor k steps do\n\n\u2022 Sample minibatch of m noise samples {z(1), . . . , z(m)} from noise prior pg(z).\n\u2022 Sample minibatch of m examples {x(1), . . . , x(m)} from data generating distribution\npdata(x).\n\u2022 Update the discriminator by ascending its stochastic gradient:\n\nm(cid:88)\n\n(cid:104)\n\ni=1\n\n\u2207\u03b8d\n\n1\nm\n\n(cid:16)\n\nx(i)(cid:17)\n\n(cid:16)\n\nlog D\n\n+ log\n\n1 \u2212 D\n\nG\n\n(cid:16)\n\n(cid:16)\n\nz(i)(cid:17)(cid:17)(cid:17)(cid:105)\n\n.\n\nend for\n\u2022 Sample minibatch of m noise samples {z(1), . . . , z(m)} from noise prior pg(z).\n\u2022 Update the generator by descending its stochastic gradient:\n\n(cid:16)\n\nm(cid:88)\n\ni=1\n\n\u2207\u03b8g\n\n1\nm\n\n1 \u2212 D\n\nG\n\nlog\n\n(cid:16)\n\n(cid:16)\n\nz(i)(cid:17)(cid:17)(cid:17)\n\n.\n\nend for\nThe gradient-based updates can use any standard gradient-based learning rule. We used momen-\ntum in our experiments.\n\n4.1 Global Optimality of pg = pdata\n\nWe \ufb01rst consider the optimal discriminator D for any given generator G.\nProposition 1. For G \ufb01xed, the optimal discriminator D is\n\nD\u2217\nG(x) =\n\npdata(x)\n\npdata(x) + pg(x)\n\n(2)\n\n4\n\nxzXZXZXZ\f(cid:90)\n\n(cid:90)\n(cid:90)\n\nProof. The training criterion for the discriminator D, given any generator G, is to maximize the\nquantity V (G, D)\n\nV (G, D) =\n\nx\n\npdata(x) log(D(x))dx +\npdata(x) log(D(x)) + pg(x) log(1 \u2212 D(x))dx\n\nz\n\npz(z) log(1 \u2212 D(g(z)))dz\n\n=\n\n(3)\nFor any (a, b) \u2208 R2 \\ {0, 0}, the function y \u2192 a log(y) + b log(1 \u2212 y) achieves its maximum in\na+b. The discriminator does not need to be de\ufb01ned outside of Supp(pdata) \u222a Supp(pg),\n[0, 1] at\nconcluding the proof.\n\nx\n\na\n\nNote that the training objective for D can be interpreted as maximizing the log-likelihood for es-\ntimating the conditional probability P (Y = y|x), where Y indicates whether x comes from pdata\n(with y = 1) or from pg (with y = 0). The minimax game in Eq. 1 can now be reformulated as:\n\nC(G) = max\n\nD\n\nV (G, D)\n=Ex\u223cpdata[log D\u2217\n=Ex\u223cpdata[log D\u2217\n=Ex\u223cpdata\n\n(cid:20)\n\nlog\n\nG(x)] + Ez\u223cpz [log(1 \u2212 D\u2217\nG(x)] + Ex\u223cpg [log(1 \u2212 D\u2217\n+ Ex\u223cpg\n\npdata(x)\n\n(cid:21)\n\nPdata(x) + pg(x)\n\n(cid:20)\n\nG(G(z)))]\nG(x))]\n\nlog\n\npdata(x) + pg(x)\n\n(4)\n\n(cid:21)\n\npg(x)\n\nTheorem 1. The global minimum of the virtual training criterion C(G) is achieved if and only if\npg = pdata. At that point, C(G) achieves the value \u2212 log 4.\nProof. For pg = pdata, D\u2217\n\ufb01nd C(G) = log 1\n2 + log 1\nonly for pg = pdata, observe that\n\n2, we\nG(x) = 1\n2 = \u2212 log 4. To see that this is the best possible value of C(G), reached\nEx\u223cpdata [\u2212 log 2] + Ex\u223cpg [\u2212 log 2] = \u2212 log 4\n(cid:18)\n\n2, (consider Eq. 2). Hence, by inspecting Eq. 4 at D\u2217\n\nand that by subtracting this expression from C(G) = V (D\u2217\n\nG, G), we obtain:\n\nG(x) = 1\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)\n\nC(G) = \u2212 log(4) + KL\n\npdata\n\n+ KL\n\npg\n\n(5)\n\n(cid:13)(cid:13)(cid:13)(cid:13) pdata + pg\n\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13) pdata + pg\n\n2\n\nwhere KL is the Kullback\u2013Leibler divergence. We recognize in the previous expression the Jensen\u2013\nShannon divergence between the model\u2019s distribution and the data generating process:\n\nC(G) = \u2212 log(4) + 2 \u00b7 JSD (pdata (cid:107)pg )\n\n(6)\nSince the Jensen\u2013Shannon divergence between two distributions is always non-negative, and zero\niff they are equal, we have shown that C\u2217 = \u2212 log(4) is the global minimum of C(G) and that the\nonly solution is pg = pdata, i.e., the generative model perfectly replicating the data distribution.\n\n4.2 Convergence of Algorithm 1\n\nProposition 2. If G and D have enough capacity, and at each step of Algorithm 1, the discriminator\nis allowed to reach its optimum given G, and pg is updated so as to improve the criterion\n\nEx\u223cpdata[log D\u2217\n\nG(x)] + Ex\u223cpg [log(1 \u2212 D\u2217\n\nG(x))]\n\nthen pg converges to pdata\n\nProof. Consider V (G, D) = U (pg, D) as a function of pg as done in the above criterion. Note\nthat U (pg, D) is convex in pg. The subderivatives of a supremum of convex functions include the\nderivative of the function at the point where the maximum is attained. In other words, if f (x) =\nsup\u03b1\u2208A f\u03b1(x) and f\u03b1(x) is convex in x for every \u03b1, then \u2202f\u03b2(x) \u2208 \u2202f if \u03b2 = arg sup\u03b1\u2208A f\u03b1(x).\nThis is equivalent to computing a gradient descent update for pg at the optimal D given the cor-\nresponding G. supD U (pg, D) is convex in pg with a unique global optima as proven in Thm 1,\ntherefore with suf\ufb01ciently small updates of pg, pg converges to px, concluding the proof.\nIn practice, adversarial nets represent a limited family of pg distributions via the function G(z; \u03b8g),\nand we optimize \u03b8g rather than pg itself, so the proofs do not apply. However, the excellent perfor-\nmance of multilayer perceptrons in practice suggests that they are a reasonable model to use despite\ntheir lack of theoretical guarantees.\n\n5\n\n\fModel\nDBN [3]\n\nStacked CAE [3]\nDeep GSN [5]\nAdversarial nets\n\nMNIST\n138 \u00b1 2\n121 \u00b1 1.6\n214 \u00b1 1.1\n225 \u00b1 2\n\nTFD\n\n1909 \u00b1 66\n2110 \u00b1 50\n1890 \u00b1 29\n2057 \u00b1 26\n\nTable 1: Parzen window-based log-likelihood estimates. The reported numbers on MNIST are the mean log-\nlikelihood of samples on test set, with the standard error of the mean computed across examples. On TFD, we\ncomputed the standard error across folds of the dataset, with a different \u03c3 chosen using the validation set of\neach fold. On TFD, \u03c3 was cross validated on each fold and mean log-likelihood on each fold were computed.\nFor MNIST we compare against other models of the real-valued (rather than binary) version of dataset.\n\n5 Experiments\n\nWe trained adversarial nets an a range of datasets including MNIST[21], the Toronto Face Database\n(TFD) [27], and CIFAR-10 [19]. The generator nets used a mixture of recti\ufb01er linear activations [17,\n8] and sigmoid activations, while the discriminator net used maxout [9] activations. Dropout [16]\nwas applied in training the discriminator net. While our theoretical framework permits the use of\ndropout and other noise at intermediate layers of the generator, we used noise as the input to only\nthe bottommost layer of the generator network.\nWe estimate probability of the test set data under pg by \ufb01tting a Gaussian Parzen window to the\nsamples generated with G and reporting the log-likelihood under this distribution. The \u03c3 parameter\nof the Gaussians was obtained by cross validation on the validation set. This procedure was intro-\nduced in Breuleux et al. [7] and used for various generative models for which the exact likelihood\nis not tractable [24, 3, 4]. Results are reported in Table 1. This method of estimating the likelihood\nhas somewhat high variance and does not perform well in high dimensional spaces but it is the best\nmethod available to our knowledge. Advances in generative models that can sample but not estimate\nlikelihood directly motivate further research into how to evaluate such models. In Figures 2 and 3\nwe show samples drawn from the generator net after training. While we make no claim that these\nsamples are better than samples generated by existing methods, we believe that these samples are at\nleast competitive with the better generative models in the literature and highlight the potential of the\nadversarial framework.\n\n6 Advantages and disadvantages\n\nThis new framework comes with advantages and disadvantages relative to previous modeling frame-\nworks. The disadvantages are primarily that there is no explicit representation of pg(x), and that D\nmust be synchronized well with G during training (in particular, G must not be trained too much\nwithout updating D, in order to avoid \u201cthe Helvetica scenario\u201d in which G collapses too many values\nof z to the same value of x to have enough diversity to model pdata), much as the negative chains of a\nBoltzmann machine must be kept up to date between learning steps. The advantages are that Markov\nchains are never needed, only backprop is used to obtain gradients, no inference is needed during\nlearning, and a wide variety of functions can be incorporated into the model. Table 2 summarizes\nthe comparison of generative adversarial nets with other generative modeling approaches.\nThe aforementioned advantages are primarily computational. Adversarial models may also gain\nsome statistical advantage from the generator network not being updated directly with data exam-\nples, but only with gradients \ufb02owing through the discriminator. This means that components of the\ninput are not copied directly into the generator\u2019s parameters. Another advantage of adversarial net-\nworks is that they can represent very sharp, even degenerate distributions, while methods based on\nMarkov chains require that the distribution be somewhat blurry in order for the chains to be able to\nmix between modes.\n\n7 Conclusions and future work\n\nThis framework admits many straightforward extensions:\n\n6\n\n\fa)\n\nc)\n\nb)\n\nd)\n\nFigure 2: Visualization of samples from the model. Rightmost column shows the nearest training example of\nthe neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples\nare fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these\nimages show actual samples from the model distributions, not conditional means given samples of hidden units.\nMoreover, these samples are uncorrelated because the sampling process does not depend on Markov chain\nmixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator\nand \u201cdeconvolutional\u201d generator)\n\nFigure 3: Digits obtained by linearly interpolating between coordinates in z space of the full model.\n\n1. A conditional generative model p(x | c) can be obtained by adding c as input to both G and D.\n2. Learned approximate inference can be performed by training an auxiliary network to predict z\ngiven x. This is similar to the inference net trained by the wake-sleep algorithm [15] but with\nthe advantage that the inference net may be trained for a \ufb01xed generator net after the generator\nnet has \ufb01nished training.\n3. One can approximately model all conditionals p(xS | x(cid:54)S) where S is a subset of the indices\nof x by training a family of conditional models that share parameters. Essentially, one can use\nadversarial nets to implement a stochastic extension of the deterministic MP-DBM [10].\n\n4. Semi-supervised learning: features from the discriminator or inference net could improve perfor-\n\nmance of classi\ufb01ers when limited labeled data is available.\n\n5. Ef\ufb01ciency improvements: training could be accelerated greatly by devising better methods for\n\ncoordinating G and D or determining better distributions to sample z from during training.\n\nThis paper has demonstrated the viability of the adversarial modeling framework, suggesting that\nthese research directions could prove useful.\n\n7\n\n\fDeep directed\ngraphical models\n\nTraining\n\nInference needed\nduring training.\n\nInference\n\nSampling\n\nLearned\napproximate\ninference\nNo dif\ufb01culties\n\nDeep undirected\ngraphical models\nInference needed\nduring training.\nMCMC needed to\napproximate\npartition function\ngradient.\nVariational\ninference\nRequires Markov\nchain\n\nEvaluating p(x)\n\nIntractable, may be\napproximated with\nAIS\n\nIntractable, may be\napproximated with\nAIS\n\nGenerative\nautoencoders\nEnforced tradeoff\nbetween mixing\nand power of\nreconstruction\ngeneration\n\nMCMC-based\ninference\nRequires Markov\nchain\nNot explicitly\nrepresented, may be\napproximated with\nParzen density\nestimation\n\nAdversarial models\n\nSynchronizing the\ndiscriminator with\nthe generator.\nHelvetica.\n\nLearned\napproximate\ninference\nNo dif\ufb01culties\nNot explicitly\nrepresented, may be\napproximated with\nParzen density\nestimation\n\nModel design\n\nModels need to be\ndesigned to work\nwith the desired\ninference scheme\n\u2014 some inference\nschemes support\nsimilar model\nfamilies as GANs\n\nCareful design\nneeded to ensure\nmultiple properties\n\nAny differentiable\nfunction is\ntheoretically\npermitted\n\nAny differentiable\nfunction is\ntheoretically\npermitted\n\nTable 2: Challenges in generative modeling: a summary of the dif\ufb01culties encountered by different approaches\nto deep generative modeling for each of the major operations involving a model.\n\nAcknowledgments\n\nWe would like to acknowledge Patrice Marcotte, Olivier Delalleau, Kyunghyun Cho, Guillaume\nAlain and Jason Yosinski for helpful discussions. Yann Dauphin shared his Parzen window eval-\nuation code with us. We would like to thank the developers of Pylearn2 [11] and Theano [6, 1],\nparticularly Fr\u00b4ed\u00b4eric Bastien who rushed a Theano feature speci\ufb01cally to bene\ufb01t this project. Ar-\nnaud Bergeron provided much-needed support with LATEX typesetting. We would also like to thank\nCIFAR, and Canada Research Chairs for funding, and Compute Canada, and Calcul Qu\u00b4ebec for\nproviding computational resources. Ian Goodfellow is supported by the 2013 Google Fellowship in\nDeep Learning. Finally, we would like to thank Les Trois Brasseurs for stimulating our creativity.\n\nReferences\n[1] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and\nBengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised\nFeature Learning NIPS 2012 Workshop.\n\n[2] Bengio, Y. (2009). Learning deep architectures for AI. Now Publishers.\n[3] Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Better mixing via deep representations.\n\nICML\u201913.\n\nIn\n\n[4] Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2014a). Deep generative stochastic networks trainable\n\nby backprop. In ICML\u201914.\n\n[5] Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative stochastic net-\nworks trainable by backprop. In Proceedings of the 30th International Conference on Machine Learning\n(ICML\u201914).\n\n[6] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley,\nIn Proceedings of the\n\nD., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler.\nPython for Scienti\ufb01c Computing Conference (SciPy). Oral Presentation.\n\n[7] Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating representative samples from an\n\nRBM-derived process. Neural Computation, 23(8), 2053\u20132073.\n\n[8] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse recti\ufb01er neural networks. In AISTATS\u20192011.\n\n8\n\n\f[9] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout networks.\n\nIn ICML\u20192013.\n\n[10] Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep Boltzmann\n\nmachines. In NIPS\u20192013.\n\n[11] Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra,\nJ., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a machine learning research library. arXiv preprint\narXiv:1308.4214.\n\n[12] Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. (2014). Deep autoregressive networks.\n\nIn ICML\u20192014.\n\n[13] Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for\nunnormalized statistical models. In Proceedings of The Thirteenth International Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS\u201910).\n\n[14] Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,\nSainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic modeling in speech recognition.\nIEEE Signal Processing Magazine, 29(6), 82\u201397.\n\n[15] Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The wake-sleep algorithm for unsupervised\n\nneural networks. Science, 268, 1558\u20131161.\n\n[16] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012b). Improving\n\nneural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.\n\n[17] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture\nfor object recognition? In Proc. International Conference on Computer Vision (ICCV\u201909), pages 2146\u20132153.\nIEEE.\n\n[18] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the Interna-\n\ntional Conference on Learning Representations (ICLR).\n\n[19] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical\n\nreport, University of Toronto.\n\n[20] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS\u20192012.\n\n[21] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11), 2278\u20132324.\n\n[22] Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. Technical\n\nreport, arXiv preprint arXiv:1402.0030.\n\n[23] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate\n\ninference in deep generative models. Technical report, arXiv:1401.4082.\n\n[24] Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling contractive\n\nauto-encoders. In ICML\u201912.\n\n[25] Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS\u20192009, pages 448\u2013\n\n455.\n\n[26] Schmidhuber, J. (1992). Learning factorial codes by predictability minimization. Neural Computation,\n\n4(6), 863\u2013879.\n\n[27] Susskind, J., Anderson, A., and Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML\n\nTR 2010-001, U. Toronto.\n\n[28] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. (2014).\n\nIntriguing properties of neural networks. ICLR, abs/1312.6199.\n\n[29] Tu, Z. (2007). Learning generative models via discriminative approaches. In Computer Vision and Pattern\n\nRecognition, 2007. CVPR\u201907. IEEE Conference on, pages 1\u20138. IEEE.\n\n9\n\n\f", "award": [], "sourceid": 1384, "authors": [{"given_name": "Ian", "family_name": "Goodfellow", "institution": "Google"}, {"given_name": "Jean", "family_name": "Pouget-Abadie", "institution": "Harvard University"}, {"given_name": "Mehdi", "family_name": "Mirza", "institution": "University of Montreal"}, {"given_name": "Bing", "family_name": "Xu", "institution": "University of Alberta"}, {"given_name": "David", "family_name": "Warde-Farley", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Sherjil", "family_name": "Ozair", "institution": "Indian Institute of Technology Delhi"}, {"given_name": "Aaron", "family_name": "Courville", "institution": "University of Montreal"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "University of Montreal"}]}