{"title": "Implicit Generation and Modeling with Energy Based Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3608, "page_last": 3618, "abstract": "Energy based models (EBMs) are appealing due to their generality and simplicity in likelihood modeling, but have been traditionally difficult to train. We present techniques to scale MCMC based EBM training on continuous neural networks, and we show its success on the high-dimensional data domains of ImageNet32x32, ImageNet128x128, CIFAR-10, and robotic hand trajectories, achieving better samples than other likelihood models and nearing the performance of contemporary GAN approaches, while covering all modes of the data. We highlight some unique capabilities of implicit generation such as compositionality and corrupt image reconstruction and inpainting. Finally, we show that EBMs are useful models across a wide variety of tasks, achieving state-of-the-art out-of-distribution classification, adversarially robust classification, state-of-the-art continual online class learning, and coherent long term predicted trajectory rollouts.", "full_text": "Implicit Generation and Modeling with Energy-Based\n\nModels\n\nYilun Du \u2217\nMIT CSAIL\n\nIgor Mordatch\nGoogle Brain\n\nAbstract\n\nEnergy based models (EBMs) are appealing due to their generality and simplicity\nin likelihood modeling, but have been traditionally dif\ufb01cult to train. We present\ntechniques to scale MCMC based EBM training on continuous neural networks,\nand we show its success on the high-dimensional data domains of ImageNet32x32,\nImageNet128x128, CIFAR-10, and robotic hand trajectories, achieving better\nsamples than other likelihood models and nearing the performance of contemporary\nGAN approaches, while covering all modes of the data. We highlight some unique\ncapabilities of implicit generation such as compositionality and corrupt image\nreconstruction and inpainting. Finally, we show that EBMs are useful models across\na wide variety of tasks, achieving state-of-the-art out-of-distribution classi\ufb01cation,\nadversarially robust classi\ufb01cation, state-of-the-art continual online class learning,\nand coherent long term predicted trajectory rollouts.\n\nIntroduction\n\n1\nLearning models of the data distribution and generating samples are important problems in machine\nlearning for which a number of methods have been proposed, such as Variational Autoencoders\n(VAEs) [Kingma and Welling, 2014] and Generative Adversarial Networks (GANs) [Goodfellow\net al., 2014].In this work, we advocate for continuous energy-based models (EBMs), represented as\nneural networks, for generative modeling tasks and as a building block for a wide variety of tasks.\nThese models aim to learn an energy function E(x) that assigns low energy values to inputs x in the\ndata distribution and high energy values to other inputs. Importantly, they allow the use of an implicit\nsample generation procedure, where sample x is found from x \u223c e\u2212E(x) through MCMC sampling.\nCombining implicit sampling with energy-based models for generative modeling has a number of\nconceptual advantages compared to methods such as VAEs and GANs which use explicit functions to\ngenerate samples:\nSimplicity and Stability: An EBM is the only object that needs to be trained and designed. Separate\nnetworks are not tuned to ensure balance (for example, [He et al., 2019] point out unbalanced training\ncan result in posterior collapse in VAEs or poor performance in GANs [Kurach et al., 2018]).\nSharing of Statistical Strength: Since the EBM is the only trained object, it requires fewer model\nparameters than approaches that use multiple networks. More importantly, the model being concen-\ntrated in a single network allows the training process to develop a shared set of features as opposed to\ndeveloping them redundantly in separate networks.\nAdaptive Computation Time: Implicit sample generation in our work is an iterative stochastic\noptimization process, which allows for a trade-off between generation quality and computation time.\n\n\u2217Work done at OpenAI\n\u2217Correspondence to: yilundu@mit.edu\n\u2217Additional results, source code, and pre-trained models are available at https://sites.google.com/view/igebm\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThis allows for a system that can make fast coarse guesses or more deliberate inferences by running\nthe optimization process longer. It also allows for re\ufb01nement of external guesses.\nFlexibility Of Generation: The power of an explicit generator network can become a bottleneck on\nthe generation quality. For example, VAEs and \ufb02ow-based models are bound by the manifold structure\nof the prior distribution and consequently have issues modeling discontinuous data manifolds, often\nassigning probability mass to areas unwarranted by the data. EBMs avoid this issue by directly\nmodeling particular regions as high or lower energy.\nCompositionality: If we think of energy functions as costs for a certain goals or constraints, summa-\ntion of two or more energies corresponds to satisfying all their goals or constraints [Mnih and Hinton,\n2004, Haarnoja et al., 2017]. While such composition is simple for energy functions (or product of\nexperts [Hinton, 1999]), it induces complex changes to the generator that may be dif\ufb01cult to capture\nwith explicit generator networks.\nDespite these advantages, energy-based models with implicit generation have been dif\ufb01cult to use on\ncomplex high-dimensional data domains. In this work, we use Langevin dynamics [Welling and Teh,\n2011], which uses gradient information for effective sampling and initializes chains from random\nnoise for more mixing. We further maintain a replay buffer of past samples (similarly to [Tieleman,\n2008] or [Mnih et al., 2013]) and use them to initialize Langevin dynamics to allow mixing between\nchains. An overview of our approach is presented in Figure 1.\n\nEmpirically, we show that energy-based models trained\non CIFAR-10 or ImageNet image datasets generate higher\nquality image samples than likelihood models and near-\ning that of contemporary GANs approaches, while not\nsuffering from mode collapse. The models exhibit prop-\nerties such as correctly assigning lower likelihood to out-\nof-distribution images than other methods (no spurious\nmodes) and generating diverse plausible image comple-\ntions (covering all data modes). Implicit generation allows\nour models to naturally denoise or inpaint corrupted im-\nages, convert general images to an image from a speci\ufb01c\nclass, and generate samples that are compositions of mul-\ntiple independent models.\nOur contributions in this work are threefold. Firstly, we\npresent an algorithm and techniques for training energy-\nbased models that scale to challenging high-dimensional\ndomains. Secondly, we highlight unique properties of energy-based models with implicit generation,\nsuch as compositionality and automatic decorruption and inpainting. Finally, we show that energy-\nbased models are useful across a series of domains, on tasks such as out-of-distribution generalization,\nadversarially robust classi\ufb01cation, multi-step trajectory prediction and online learning.\n2 Related Work\n\nFigure 1: Overview of our method and the\ninterrelationship of the components involved.\n\nEnergy-based models (EBMs) have a long history in machine learning. Ackley et al. [1985], Hinton\n[2006], Salakhutdinov and Hinton [2009] proposed latent based EBMs where energy is represented\nas a composition of latent and observable variables. In contrast Mnih and Hinton [2004], Hinton et al.\n[2006] proposed EBMs where inputs are directly mapped to outputs, a structure we follow. We refer\nreaders to [LeCun et al., 2006] for a comprehensive tutorial on energy models.\nThe primary dif\ufb01culty in training EBMs comes from effectively estimating and sampling the partition\nfunction. One approach to train energy based models is sample the partition function through\namortized generation. Kim and Bengio [2016], Zhao et al. [2016], Haarnoja et al. [2017], Kumar et al.\n[2019] propose learning a separate network to generate samples, which makes these methods closely\nconnected to GANs [Finn et al., 2016], but these methods do not have the advantages of implicit\nsampling noted in the introduction. Furthermore, amortized generation is prone to mode collapse,\nespecially when training the sampling network without an entropy term which is often approximated\nor ignored.\nAn alternative approach is to use MCMC sampling to estimate the partition function. This has an\nadvantage of provable mode exploration and allows the bene\ufb01ts of implicit generation listed in the\n\n2\n\n123KLangevin DynamicsEq. 1X~X-X+ML ObjectiveEq. 2Replay BufferTraining Data\fintroduction. Hinton [2006] proposed Contrastive Divergence, which uses gradient free MCMC\nchains initialized from training data to estimate the partition function. Similarly, Salakhutdinov and\nHinton [2009] apply contrastive divergence, while Tieleman [2008] proposes PCD, which propagates\nMCMC chains throughout training. By contrast, we initialize chains from random noise, allowing\neach mode of the model to be visited with equal probability. But initialization from random noise\ncomes at a cost of longer mixing times. As a result we use Gradient based MCMC (Langevin\nDynamics) for more ef\ufb01cient sampling and to offset the increase of mixing time which was also\nstudied previously in [Teh et al., 2003, Xie et al., 2016]. We note that HMC [Neal, 2011] may be an\neven more ef\ufb01cient gradient algorithm for MCMC sampling, though we found Langevin Dynamics to\nbe more stable. To allow gradient based MCMC, we use continuous inputs, while most approaches\nhave used discrete inputs. We build on idea of PCD and maintain a replay buffer of past samples to\nadditionally reduce mixing times.\n3 Energy-Based Models and Sampling\nGiven a datapoint x, let E\u03b8(x) \u2208 R be the energy function. In our work this function is represented\nby a deep neural network parameterized by weights \u03b8. The energy function de\ufb01nes a probability\ndistribution via the Boltzmann distribution p\u03b8(x) = exp(\u2212E\u03b8(x))\ndenotes the partition function. Generating samples from this distribution is challenging, with previous\nwork relying on MCMC methods such as random walk or Gibbs sampling [Hinton, 2006]. These\nmethods have long mixing times, especially for high-dimensional complex data such as images. To\nimprove the mixing time of the sampling procedure, we use Langevin dynamics which makes use of\nthe gradient of the energy function to undergo sampling\n\n, where Z(\u03b8) =(cid:82) exp(\u2212E\u03b8(x))dx\n\nZ(\u03b8)\n\n\u02dcxk = \u02dcxk\u22121 \u2212 \u03bb\n2\n\n\u2207xE\u03b8(\u02dcxk\u02d81) + \u03c9k, \u03c9k \u223c N (0, \u03bb)\n\n(1)\nwhere we let the above iterative procedure de\ufb01ne a distribution q\u03b8 such that \u02dcxK \u223c q\u03b8. As shown by\nWelling and Teh [2011] as K \u2192 \u221e and \u03bb \u2192 0 then q\u03b8 \u2192 p\u03b8 and this procedure generates samples\nfrom the distribution de\ufb01ned by the energy function. Thus, samples are generated implicitly\u2020 by the\nenergy function E as opposed to being explicitly generated by a feedforward network.\nIn the domain of images, if the energy network has a convolutional architecture, energy gradient\n\u2207xE in (1) conveniently has a deconvolutional architecture. Thus it mirrors a typical image generator\nnetwork architecture, but without it needing to be explicitly designed or balanced. We take two views\nof the energy function E: \ufb01rstly, it is an object that de\ufb01nes a probability distribution over data and\nsecondly it de\ufb01nes an implicit generator via (1).\n3.1 Maximum Likelihood Training\nWe want the distribution de\ufb01ned by E to model the data distribution pD, which we do by minimizing\nthe negative log likelihood of the data LML(\u03b8) = Ex\u223cpD [\u2212 log p\u03b8(x)] where \u2212 log p\u03b8(x) = E\u03b8(x)\u2212\nlog Z(\u03b8). This objective is known to have the gradient (see [Turner, 2005] for derivation) \u2207\u03b8LML =\nEx+\u223cpD [\u2207\u03b8E\u03b8(x+)] \u2212 Ex\u2212\u223cp\u03b8 [\u2207\u03b8E\u03b8(x\u2212)].\nIntuitively, this gradient decreases energy of the\npositive data samples x+, while increasing the energy of the negative samples x\u2212 from the model p\u03b8.\nWe rely on Langevin dynamics in (1) to generate q\u03b8 as an approximation of p\u03b8:\n\n\u2207\u03b8LML \u2248 Ex+\u223cpD\n\n(cid:2)\u2207\u03b8E\u03b8(x+)(cid:3) \u2212 Ex\u2212\u223cq\u03b8\n\n(cid:2)\u2207\u03b8E\u03b8(x\u2212)(cid:3) .\n\n(2)\n\nThis is similar to the gradient of the Wasserstein GAN objective [Arjovsky et al., 2017], but with an\nimplicit MCMC generating procedure and no gradient through sampling. This lack of gradient is\nimportant as it controls between the diversity in likelihood models and the mode collapse in GANs.\nThe approximation in (2) is exact when Langevin dynamics generates samples from p, which happens\nafter a suf\ufb01cient number of steps (mixing time). We show in the supplement that pd and q appear to\nmatch each other in distribution, showing evidence that p matches q. We note that even in cases when\na particular chain does not fully mix, since our initial proposal distribution is a uniform distribution,\nall modes are still equally likely to be explored.\n\n\u2020Deterministic case of procedure in (1) is x = arg min E(x), which makes connection to implicit functions\n\nmore clear.\n\n3\n\n\f3.2 Sample Replay Buffer\nLangevin dynamics does not place restrictions on sample initialization \u02dcx0 given suf\ufb01cient sampling\nsteps. However initialization plays an crucial role in mixing time. Persistent Contrastive Divergence\n(PCD) [Tieleman, 2008] maintains a single persistent chain to improve mixing and sample quality. We\nuse a sample replay buffer B in which we store past generated samples \u02dcx and use either these samples\nor uniform noise to initialize Langevin dynamics procedure. This has the bene\ufb01t of continuing to\nre\ufb01ne past samples, further increasing number of sampling steps K as well as sample diversity. In all\nour experiments, we sample from B 95% of the time and from uniform noise otherwise.\n3.3 Regularization and Algorithm\nArbitrary energy models can have sharp changes in gradients that can make sampling with Langevin\ndynamics unstable. We found that constraining the Lipschitz constant of the energy network can\nameliorate these issues. To constrain the Lipschitz constant, we follow the method of [Miyato et al.,\n2018] and add spectral normalization to all layers of the model. Additionally, we found it useful\nto weakly L2 regularize energy magnitudes for both positive and negative samples during training,\nas otherwise while the difference between positive and negative samples was preserved, the actual\nvalues would \ufb02uctuate to numerically unstable values. Both forms of regularization also serve to\nensure that partition function is integrable over the domain of the input, with spectral normalization\nensuring smoothness and L2 coef\ufb01cient bounding the magnitude of the unnormalized distribution.\nWe present the algorithm below, where \u2126(\u00b7) indicates the stop gradient operator.\n\nAlgorithm 1 Energy training algorithm\n\nInput: data dist. pD(x), step size \u03bb, number of steps\nKB \u2190 \u2205\nwhile not converged do\ni \u223c pD\nx+\ni \u223c B with 95% probability and U otherwise\nx0\n(cid:46) Generate sample from q\u03b8 via Langevin dynamics:\nfor sample step k = 1 to K do\n\u03c9 \u223c\n\n\u02dcxk \u2190 \u02dcxk\u22121 \u2212 \u2207xE\u03b8(\u02dcxk\u22121) + \u03c9,\nN (0, \u03c3)\nend for\n\u2212\ni = \u2126(\u02dcxk\ni )\nx\n(cid:46) Optimize objective \u03b1L2 + LML wrt \u03b8:\n\u2206\u03b8 \u2190 \u2207\u03b8\nE\u03b8(x+\nUpdate \u03b8 based on \u2206\u03b8 using Adam optimizer\nB \u2190 B \u222a \u02dcxi\n\n(cid:80)\ni ) \u2212 E\u03b8(x\n\u2212\ni )\n\ni \u03b1(E\u03b8(x+\n\n1\nN\n\n\u2212\ni )2 + E\u03b8(x\ni )2) +\n\nend while\n\nFigure 2: Conditional ImageNet32x32 EBM samples\n\n4\n\nImage Modeling\n\n(a) GLOW Model\nFigure 3: Comparison of image generation techniques on unconditional CIFAR-10 dataset.\n\n(c) EBM (10 historical)\n\n(d) EBM Sample Buffer\n\n(b) EBM\n\nIn this section, we show that EBMs are effective generative models for images. We show EBMs\nare able to generate high \ufb01delity images and exhibit mode coverage on CIFAR-10 and ImageNet.\n\n4\n\n\fWe further show EBMs exhibit adversarial robustness and better out-of-distribution behavior than\nother likelihood models. Our model is based on the ResNet architecture (using conditional gains and\nbiases per class [Dumoulin et al.] for conditional models) with details in the supplement. We present\nsensitivity analysis, likelihoods, and ablations in the supplement in A.4. We provide a comparison\nbetween EBMs and other likelihood models in A.5. Overall, we \ufb01nd that EBMs are both more\nparameter/computationally ef\ufb01cient than likelihood models, though worse than GANs.\n4.1\nWe show unconditional CIFAR-10 images in Figure 3, with comparisons to GLOW [Kingma and\nDhariwal, 2018], and conditional ImageNet32x32 images in Figure 2. We provide qualitative images\nof ImageNet128x128 and other visualizations in A.1.\n\nImage Generation\n\nModel\nCIFAR-10 Unconditional\nPixelCNN [Van Oord et al., 2016]\nPixelIQN [Ostrovski et al., 2018]\nEBM (single)\nDCGAN [Radford et al., 2016]\nWGAN + GP [Gulrajani et al., 2017]\nEBM (10 historical ensemble)\nSNGAN [Miyato et al., 2018]\nCIFAR-10 Conditional\nImproved GAN\nEBM (single)\nSpectral Normalization GAN\nImageNet 32x32 Conditional\nPixelCNN\nPixelIQN\nEBM (single)\nImageNet 128x128 Conditional\nACGAN [Odena et al., 2017]\nEBM* (single)\nSNGAN\n\nInception*\n\nFID\n\n4.60\n5.29\n6.02\n6.40\n6.50\n6.78\n8.22\n\n8.09\n8.30\n8.59\n\n8.33\n10.18\n18.22\n\n28.5\n28.6\n36.8\n\n65.93\n49.46\n40.58\n37.11\n36.4\n38.2\n21.7\n\n-\n\n37.9\n25.5\n\n33.27\n22.99\n14.31\n\n-\n\n43.7\n27.62\n\nFigure 4: Table of Inception and FID scores for ImageNet32x32\nand CIFAR-10. Quantitative numbers for ImageNet32x32 from\n[Ostrovski et al., 2018]. (*) We use Inception Score (from original\nOpenAI repo) to compare with legacy models, but strongly encour-\nage future work to compare soley with FID score, since Langevin\nDynamics converges to minima that arti\ufb01cially in\ufb02ate Inception\nScore. (**) conditional EBM models for 128x128 are smaller than\nthose in SNGAN.\n\nFigure 5: EBM image restoration\non images in the test set via MCMC.\nThe right column shows failure (ap-\nprox. 10% objects change with ground\ntruth initialization and 30% of objects\nchange in salt/pepper corruption or in-\npainting. Bottom two rows shows worst\ncase of change.)\n\nWe quantitatively evaluate image quality of EBMs with Inception score [Salimans et al., 2016] and\nFID score [Heusel et al., 2017] in Table 4. Overall we obtain signi\ufb01cantly better scores than likelihood\nmodels PixelCNN and PixelIQN, but worse than SNGAN [Miyato et al., 2018]. We found that in the\nunconditional case, mode exploration with Langevin took a very long time, so we also experimented\nin EBM (10 historical ensemble) with sampling joint from the last 10 snapshots of the model. At\ntraining time, extensive exploration is ensured with the replay buffer (Figure 3d). Our models have\nsimilar number of parameters to SNGAN, but we believe that signi\ufb01cantly more parameters may\nbe necessary to generate high \ufb01delity images with mode coverage. On ImageNet128x128, due to\ncomputational constraints, we train a smaller network than SNGAN and do not train to convergence.\n4.2 Mode Evaluation\nWe evaluate over-\ufb01tting and mode coverage in EBMs. To test over-\ufb01tting, we plotted histogram of\nenergies for CIFAR-10 train and test dataset in Figure 11 and note almost identical curves. In the\nsupplement, we show that the nearest neighbor of generated images are not identical to images in\n\n5\n\nSalt and Pepper (0.1)InpaintingGround Truth Initialization\fthe training dataset. To test mode coverage in EBMs, we investigate MCMC sampling on corrupted\nCIFAR-10 test images. Since Langevin dynamics is known to mix slowly [Neal, 2011] and reach\nlocal minima, we believe that good denoising after limited number of steps of sampling indicates\nprobability modes at respective test images. Similarly, lack of movement from a ground truth test\nimage initialization after the same number of steps likely indicates probability mode at the test image.\nIn Figure 5, we \ufb01nd that if we initialize sampling with images from the test set, images do not move\nsigni\ufb01cantly. However, under the same number of steps, Figure 5 shows that we are able to reliably\ndecorrupt masked and salt and pepper corrupted images, indicating good mode coverage. We note\nthat large number of steps of sampling lead to more saturated images, which are due to sampling low\ntemperature modes, which are saturated across likelihood models (see appendix). In comparison,\nGANs have been shown to miss many modes of data and cannot reliably reconstruct many different\ntest images [Yeh et al.]. We note that such decorruption behavior is a nice property of implicit\ngeneration without need of explicit knowledge of corrupted pixels.\n\nFigure 6: Illustration of cross-class implicit sam-\npling on a conditional EBM. The EBM is condi-\ntioned on a particular class but is initialized with\nan image from a separate class.\n\nFigure 7: Illustration of image completions on condi-\ntional ImageNet model. Our models exhibit diversity in\ninpainting.\n\nAnother common test for mode coverage and over\ufb01tting is masked inpainting [Van Oord et al., 2016].\nIn Figure 7, we mask out the bottom half of ImageNet images and test the ability to sample the\nmasked pixels, while \ufb01xing the value of unmasked pixels. Running Langevin dynamics on the images,\nwe \ufb01nd diversity of completions on train/test images, indicating low over\ufb01tting on training set and\ndiversity characterized by likelihood models. Furthermore initializing sampling of a class conditional\nEBM with images from images from another class, we can further test for presence of probability\nmodes at images far away from the those seen in training. We \ufb01nd in Figure 6 that sampling on\nsuch images using an EBM is able to generate images of the target class, indicating semantically\nmeaningful modes of probability even far away from the training distribution.\n4.3 Adversarial Robustness\n\n(a) L\u221e robustness\n\n(b) L2 Robustness\n\nWe show conditional EBMs exhibit adversarial\nrobustness on CIFAR-10 classi\ufb01cation, without\nexplicit adversarial training. To compute logits\nfor classi\ufb01cation, we compute the negative en-\nergy of the image in each class. Our model, with-\nout \ufb01ne-tuning, achieves an accuracy of 49.6%.\nFigure 8 shows adversarial robustness curves.\nWe ran 20 steps of PGD as in [Madry et al.,\n2017], on the above logits. To undergo classi\ufb01-\ncation, we then ran 10 steps sampling initialized\nfrom the starting image (with a bounded devia-\ntion of 0.03) from each conditional model, and\nthen classi\ufb01ed using the lowest energy condi-\ntional class. We found that running PGD in-\ncorporating sampling was less successful than\nwithout. Overall we \ufb01nd in Figure 8 that EBMs are very robust to adversarial perturbations and\noutperforms the SOTA L\u221e model in [Madry et al., 2017] on L\u221e attacks with \u0001 > 13.\n4.4 Out-of-Distribution Generalization\nWe show EBMs exhibit better out-of-distribution (OOD) detection than other likelihood models.\nSuch a task requires models to have high likelihood on the data manifold and low likelihood at all\n\nFigure 8: \u0001 plots under L\u221e and L2 attacks of condi-\ntional EBMs as compared to PGD trained models in\n[Madry et al., 2017] and a baseline Wide ResNet18.\n\n6\n\nCorruptionCompletionsOriginalTestImagesTrainImages0510152025300.00.20.40.60.8PGDEBMBaseline0204060800.00.20.40.60.8PGDEBMBaseline\fFigure 11: Histogram of relative likelihoods for various datasets for Glow, PixelCNN++ and EBM models\n\nother locations and can be viewed as a proxy of log likelihood. Surprisingly, Nalisnick et al. [2019]\nfound likelihood models such as VAE, PixelCNN, and Glow models, are unable to distinguish data\nassign higher likelihood to many OOD images. We constructed our OOD metric following following\n[Hendrycks and Gimpel, 2016] using Area Under the ROC Curve (AUROC) scores computed based\non classifying CIFAR-10 test images from other OOD images using relative log likelihoods. We use\nSVHN, Textures [Cimpoi et al., 2014], monochrome images, uniform noise and interpolations of\nseparate CIFAR-10 images as OOD distributions. We provide examples of OOD images in Figure 9.\nWe found that our proposed OOD metric correlated well with training progress in EBMs.\n\nModel\nSVHN\nTextures\nConstant Uniform\nUniform\nCIFAR10 Interpolation\nAverage\n\nPixelCNN++ Glow EBM (ours)\n\n0.32\n0.33\n0.0\n1.0\n0.71\n0.47\n\n0.24\n0.27\n0.0\n1.0\n0.59\n0.42\n\n0.63\n0.48\n0.30\n1.0\n0.70\n0.62\n\nFigure 10: AUROC scores of out of distribution classi\ufb01cation on differ-\nent datasets. Only our model gets better than chance classi\ufb01cation.\n\nFigure 9: Illustration of im-\nages from each of the out of\ndistribution dataset.\n\nIn Table 10, unconditional EBMs perform signi\ufb01cantly better out-of-distribution than other auto-\nregressive and \ufb02ow generative models and have OOD scores of 0.62 while the closest, PixelCNN++,\nhas a OOD score of 0.47. We provide histograms of relative likelihoods for SVHN in Figure 11\nwhich are also discussed in [Nalisnick et al., 2019, Hendrycks et al., 2018]. We believe that the\nreason for better generalization is two-fold. First, we believe that the negative sampling procedure\nin EBMs helps eliminate spurious minima. Second, we believe EBMs have a \ufb02exible structure that\nallows global context when estimating probability without imposing constraints on latent variable\nstructure. In contrast, auto-regressive models model likelihood sequentially, which makes global\ncoherence dif\ufb01cult. In a different vein, \ufb02ow based models must apply continuous transformations onto\na continuous connected probability distribution which makes it very dif\ufb01cult to model disconnected\nmodes, and thus assign spurious density to connections between modes.\n5 Trajectory Modeling\nWe show that EBMs generate and generalize well in the different domain of trajectory modeling. We\ntrain EBMs to model dynamics of a simulated robot hand manipulating a free cube object [OpenAI,\n2018]. We generated 200,000 different trajectories of length 100, from a trained policy (with every\n4th action set to a random action for diversity), with a 90-10 train-test split. Models are trained to\npredict positions of all joints in the hand and orientation and position of the cube one step in the\nfuture. We test performance by evaluating many step roll-outs of self-predicted trajectories.\n5.1 Training Setup and Metrics\nWe compare EBM models to feedforward models (FC), both of which are composed of 3 layers of\n128 hidden units. We apply spectral normalization to FC to prevent multi-step explosion. We evaluate\nmulti-step trajectories by computing Frechet Distance [Dowson and Landau, 1982] between predicted\nand ground distributions across all states at timestep t. We found this metric was a better metric of\ntrajectories than multi-step MSE due to accumulation of error.\n\n7\n\n12000100008000600040002000Log Prob0.00000.00010.00020.00030.00040.0005SVHN/CIFAR-10 Test on GlowCIFAR-10 TestSVHN80006000400020000Log Prob0.00000.00020.00040.00060.0008SVHN/CIFAR-10 Test on PixelCNN++CIFAR-10 TestSVHN\u22123\u22122\u2212101Log Prob (Unscaled)0.00.51.0SVHN/CIFAR-10 Test on EBMCIFAR-10 TestSVHN\u22123\u22122\u2212101Log Prob (Unscaled)0.00.20.40.60.8CIFAR-10 Train/Test on EBMTrainTestTexturesSVHNConstant UniformUniformCIFAR10MixCIFAR10\fFigure 12: Views of hand manipulation trajec-\ntories generated unconditionally from the same\nstate(1st frame).\n\nFigure 13: Conditional and Unconditional Mod-\neling of Hand Manipulation through Frechet Dis-\ntance\n\n5.2 Multi-Step Trajectory Generation\nWe evaluated EBMs for both action conditional and unconditional prediction of multi-step rollouts.\nQuantitatively, by computing the average Frechet distance across all time-steps, unconditional EBM\nhave value 5.96 while unconditional FC networks have a value of 33.28. Conditional EBM have\nvalue 8.97 while a conditional FC has value 19.75. We provide plots of Frechet distance over time\nin Figure 13. In Figure 13, we observe that for unconditional hand modeling in a FC network, the\nFrechet distance increases dramatically in the \ufb01rst several time steps. Qualitatively, we found that\nthe same FC networks stop predicting hand movement after several several steps as demonstrated in\nFigure 12. In contrast, Frechet distance increases slowly for unconditional EBMs. The unconditional\nmodels are able to represent multimodal transitions such as different types of cube rotation and\nFigure 12 shows that the unconditional EBMs generate diverse realistic trajectories.\n6 Online Learning\n\nAccuracy\n19.80 (0.05)\n19.67 (0.09)\n19.52 (0.29)\n24.17 (0.33)\n40.04 (1.31)\n64.99 (4.27)\n\nTable 1: Comparison of various continual learning\nbenchmarks. Values averaged acrossed 10 seeds reported\nas mean (standard deviation).\n\nMethod\nEWC [Kirkpatrick et al., 2017]\nSI [Zenke et al., 2017]\nNAS [Schwarz et al., 2018]\nLwF [Li and Snavely, 2018]\nVAE\nEBM (ours)\n\nWe \ufb01nd that EBMs also perform well in con-\ntinual learning. We evaluate incremental class\nlearning on the Split MNIST task proposed\nin [Farquhar and Gal, 2018]. The task evalu-\nates overall MNIST digit classi\ufb01cation accuracy\ngiven 5 sequential training tasks of disjoint pairs\nof digits. We train a conditional EBM with 2 lay-\ners of 400 hidden units work and compare with\na generative conditional VAE baseline with both\nencoder/decoder having 2 layers of 400 hidden\nunits. Additional training details are covered in\nthe appendix. We train the generative models\nto represent the joint distribution of images and\nlabels and classify based off the lowest energy\nlabel. Hsu et al. [2018] analyzed common continual learning algorithms such as EWC [Kirkpatrick\net al., 2017], SI [Zenke et al., 2017] and NAS [Schwarz et al., 2018] and \ufb01nd they obtain performance\naround 20%. LwF [Li and Snavely, 2018] performed the best with performance of 24.17 \u00b1 0.33 ,\nwhere all architectures use 2 layers of 400 hidden units. However, since each new task introduces\ntwo new MNIST digits, a test accuracy of around 20% indicates complete forgetting of previous\ntasks. In contrast, we found continual EBM training obtains signi\ufb01cantly higher performance of\n64.99 \u00b1 4.27. All experiments were run with 10 seeds.\nA crucial difference is that negative training in EBMs only locally \"forgets\" information corresponding\nto negative samples. Thus, when new classes are seen, negative samples are conditioned on the new\nclass, and the EBM only forgets unlikely data from the new class. In contrast, the cross entropy\nobjective used to train common continual learning algorithms down-weights the likelihood of all\nclasses not seen. We can apply this insight on other generative models, by maximizing the likelihood\nof a class conditional model at train time and then using the highest likelihood class as classi\ufb01cation\nresults. We ran such a baseline using a VAE and obtained a performance of 40.04 \u00b1 1.31, which is\nhigher than other continual learning algorithms but less than that in a EBM.\n\n8\n\nGround TruthFully ConnectedEBM Sample 1EBM Sample 2T = 0T = 20T = 40T = 80T = 100020406080100Steps01020304050Frechet DistanceUnconditional Frechet DistanceUnconditional EBMUnconditional FC\f7 Compositional Generation\n\nFigure 14: A 2D example of combining EBMs through\nsummation and the resulting sampling trajectories.\n\nFinally, we show compositionality through im-\nplicit generation in EBMs. Consider a set of\nconditional EBMs for separate independent la-\ntents. Sampling through the joint distribution\non all latents is represented by generation on an\nEBM that is the sum of each conditional EBM\n[Hinton, 1999] and corresponds to a product of\nexperts model. As seen in Figure 14, summa-\ntion naturally allows composition of EBMs. We\nsample from joint conditional distribution through Langevin dynamics sequentially from each model.\nWe conduct our experiments on the dSprites dataset [Higgins et al., 2017], which consists of all\npossible images of an object (square, circle or heart) varied by scale, position, rotation with labeled\nlatents. We trained conditional EBMs for each latent and found that scale, position and rotation\nworked well. The latent for shape was learned poorly, and we found that even our unconditional\nmodels were not able to reliably generate different shapes which was also found in [Higgins et al.,\n2017]. We show some results on CelebA in A.6.\n\nFigure 15: Samples from joint distribution of\n4 independent conditional EBMs on scale, posi-\ntion, rotation and shape (left panel) with associated\nground truth rendering (right panel).\n\nFigure 16: GT = Ground Truth. Images of cross\nproduct generalization of size-position (left panel)\nand shape-position (right panel).\n\nJoint Conditioning\nIn Figure 15, we provide generated images from joint conditional sampling.\nUnder such sampling we are able to generate images very close to ground truth for all classes with\nexception of shape. This result also demonstrates mode coverage across all data.\nZero-Shot Cross Product Generalization We evaluate the ability of EBMs to generalize to novel\ncombinations of latents. We generate three datasets, D1: different size squares at a central position,\nD2: smallest size square at each location, D3: different shapes at the center position. We evaluate\nsize-position generalization by training independent energy functions on D1 and D2, and test on\ngenerating different size squares at all positions. We similarly evaluate shape-position generalization\nfor D2 and D3. We generate samples at novel combinations by sampling from the summation of\nenergy functions (we \ufb01rst \ufb01netune the summation energy to generate both training datasets using a\nKL term de\ufb01ned in the appendix). We compare against a joint conditional model baseline.\nWe present results of generalization in Figure 16. In the left panel of Figure 16, we \ufb01nd the EBMs\nare able to generalize to different sizes at different position (albeit with loss in sample quality) while\na conditional model ignores the size latent and generates only images seen in the training. In the right\npanel of Figure 16, we found that EBMs are able to generalize to combinations of shape and position\nby creating a distinctive shape for each conditioned shape latent at different positions (though the\ngenerated shape doesn\u2019t match the shape of the original shape latent), while a baseline is unable to\ngenerate samples. We believe the compositional nature of EBMs is crucial to generalize in this task.\n8 Conclusion\nWe have presented a series of techniques to scale up EBM training. We further show unique\nbene\ufb01ts of implicit generation and EBMs and believe there are many further directions to explore.\nAlgorithmically, we think it would be interesting to explore methods for faster sampling, such as\nadaptive HMC. Empirically, we think it would be interesting to explore, extend, and understand\nresults we\u2019ve found, in directions such as compositionality, out-of-distribution detection, adversarial\nrobustness, and online learning. Furthermore, we think it may be interesting to apply EBMs on other\ndomains, such as text and as a means for latent representation learning.\n9 Acknowledgements\nWe would like to thank Ilya Sutskever, Alec Radford, Prafulla Dhariwal, Dan Hendrycks, Johannes\nOtterbach, Rewon Child and everyone at OpenAI for helpful discussions.\n\n9\n\nenergy Aenergy Benergy A + BBaselineEBMGTBaselineEBMGT\fReferences\nDavid H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines.\n\nCognit. Sci., 9(1):147\u2013169, 1985.\n\nMartin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.\n\nM. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings\n\nof the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.\n\nDC Dowson and BV Landau. The fr\u00e9chet distance between multivariate normal distributions. Journal of\n\nmultivariate analysis, 12(3):450\u2013455, 1982.\n\nVincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style.\n\nSebastian Farquhar and Yarin Gal. Towards robust evaluations of continual learning.\n\narXiv:1805.09733, 2018.\n\narXiv preprint\n\nChelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial\n\nnetworks, inverse reinforcement learning, and energy-based models. In NIPS Workshop, 2016.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\nIshaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training\n\nof wasserstein gans. In NIPS, 2017.\n\nTuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-\n\nbased policies. arXiv preprint arXiv:1702.08165, 2017.\n\nJunxian He, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. Lagging inference networks and\n\nposterior collapse in variational autoencoders. arXiv preprint arXiv:1901.05534, 2019.\n\nDan Hendrycks and Kevin Gimpel. A baseline for detecting misclassi\ufb01ed and out-of-distribution examples in\n\nneural networks. arXiv preprint arXiv:1610.02136, 2016.\n\nDan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier exposure. arXiv\n\npreprint, 2018.\n\nMartin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained\nby a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information\nProcessing Systems, pages 6626\u20136637, 2017.\n\nIrina Higgins, Loic Matthey, Arka Pal, Christopher P Burgess, Xavier Glorot, Matthew Botvinick, Shakir\nMohamed, and Alexander Lerchner. Beta-vae: Learning basic visual concepts with a constrained variational\nframework. In ICLR, 2017.\n\nGeoffrey Hinton, Simon Osindero, Max Welling, and Yee-Whye Teh. Unsupervised discovery of nonlinear\n\nstructure using contrastive backpropagation. Cognitive science, 30(4):725\u2013731, 2006.\n\nGeoffrey E Hinton. Products of experts. International Conference on Arti\ufb01cial Neural Networks, 1999.\n\nGeoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Training, 14(8), 2006.\n\nYen-Chang Hsu, Yen-Cheng Liu, and Zsolt Kira. Re-evaluating continual learning scenarios: A categorization\n\nand case for strong baselines. arXiv preprint arXiv:1810.12488, 2018.\n\nTaesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation.\n\narXiv preprint arXiv:1606.03439, 2016.\n\nDiederik P Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. arXiv\n\npreprint arXiv:1807.03039, 2018.\n\nDiederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\nJames Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu,\nKieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic\nforgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521\u20133526, 2017.\n\nRithesh Kumar, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. Maximum entropy generators for\n\nenergy-based models. arXiv preprint arXiv:1901.08508, 2019.\n\n10\n\n\fKarol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. The gan landscape: Losses,\n\narchitectures, regularization, and normalization. arXiv preprint arXiv:1807.04720, 2018.\nYann LeCun, Sumit Chopra, and Raia Hadsell. A tutorial on energy-based learning. 2006.\nZhengqi Li and Noah Snavely. Learning intrinsic image decomposition from watching the world. In CVPR,\n\n2018.\n\nAleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep\n\nlearning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.\n\nTakeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative\n\nadversarial networks. arXiv preprint arXiv:1802.05957, 2018.\n\nAndriy Mnih and Geoffrey Hinton. Learning nonlinear constraints with contrastive backpropagation. Citeseer,\n\n2004.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and\n\nMartin Riedmiller. Playing atari with deep reinforcement learning. In NIPS Workshop, 2013.\n\nEric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep\ngenerative models know what they don\u2019t know? In International Conference on Learning Representations,\n2019. URL https://openreview.net/forum?id=H1xwNhCcYm.\n\nRadford M Neal. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011.\nAugustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classi\ufb01er\ngans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2642\u20132651.\nJMLR. org, 2017.\n\nOpenAI. Learning dexterous in-hand manipulation. In arXiv preprint arXiv:1808.00177, 2018.\nGeorg Ostrovski, Will Dabney, and R\u00e9mi Munos. Autoregressive quantile networks for generative modeling.\n\narXiv preprint arXiv:1806.05575, 2018.\n\nAlec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. In ICLR, 2016.\n\nRuslan Salakhutdinov and Geoffrey E. Hinton. Deep boltzmann machines. In David A. Van Dyk and Max\nWelling, editors, AISTATS, volume 5 of JMLR Proceedings, pages 448\u2013455. JMLR.org, 2009. URL http:\n//www.jmlr.org/proceedings/papers/v5/salakhutdinov09a.html.\n\nTim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved\n\ntechniques for training gans. In NIPS, 2016.\n\nJonathan Schwarz, Jelena Luketina, Wojciech M Czarnecki, Agnieszka Grabska-Barwinska, Yee Whye Teh,\nRazvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. arXiv\npreprint arXiv:1805.06370, 2018.\n\nYee Whye Teh, Max Welling, Simon Osindero, and Geoffrey E Hinton. Energy-based models for sparse\n\novercomplete representations. Journal of Machine Learning Research, 4(Dec):1235\u20131260, 2003.\n\nTijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In\n\nProceedings of the 25th international conference on Machine learning, pages 1064\u20131071. ACM, 2008.\n\nRichard Turner. Cd notes. 2005.\nAaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016.\nMax Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of\n\nthe 28th International Conference on Machine Learning (ICML-11), pages 681\u2013688, 2011.\n\nJianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. In International\n\nConference on Machine Learning, pages 2635\u20132644, 2016.\n\nRaymond A Yeh, Chen Chen, Teck-Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson, and Minh N Do.\n\nSemantic image inpainting with deep generative models.\n\nFriedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence.\n\nIn\nProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987\u20133995. JMLR.\norg, 2017.\n\nJunbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv preprint\n\narXiv:1609.03126, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1955, "authors": [{"given_name": "Yilun", "family_name": "Du", "institution": "MIT"}, {"given_name": "Igor", "family_name": "Mordatch", "institution": "OpenAI"}]}