{"title": "Generative Modeling by Estimating Gradients of the Data Distribution", "book": "Advances in Neural Information Processing Systems", "page_first": 11918, "page_last": 11930, "abstract": "We introduce a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching. Because gradients can be ill-defined and hard to estimate when the data resides on low-dimensional manifolds, we perturb the data with different levels of Gaussian noise, and jointly estimate the corresponding scores, i.e., the vector fields of gradients of the perturbed data distribution for all noise levels. For sampling, we propose an annealed Langevin dynamics where we use gradients corresponding to gradually decreasing noise levels as the sampling process gets closer to the data manifold. Our framework allows flexible model architectures, requires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons. Our models produce samples \ncomparable to GANs on MNIST, CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score of 8.87 on CIFAR-10. Additionally, we demonstrate that our models learn effective representations via image inpainting experiments.", "full_text": "Generative Modeling by Estimating Gradients of the\n\nData Distribution\n\nYang Song\n\nStanford University\n\nyangsong@cs.stanford.edu\n\nStefano Ermon\n\nStanford University\n\nermon@cs.stanford.edu\n\nAbstract\n\nWe introduce a new generative model where samples are produced via Langevin\ndynamics using gradients of the data distribution estimated with score matching.\nBecause gradients can be ill-de\ufb01ned and hard to estimate when the data resides on\nlow-dimensional manifolds, we perturb the data with different levels of Gaussian\nnoise, and jointly estimate the corresponding scores, i.e., the vector \ufb01elds of\ngradients of the perturbed data distribution for all noise levels. For sampling, we\npropose an annealed Langevin dynamics where we use gradients corresponding to\ngradually decreasing noise levels as the sampling process gets closer to the data\nmanifold. Our framework allows \ufb02exible model architectures, requires no sampling\nduring training or the use of adversarial methods, and provides a learning objective\nthat can be used for principled model comparisons. Our models produce samples\ncomparable to GANs on MNIST, CelebA and CIFAR-10 datasets, achieving a new\nstate-of-the-art inception score of 8.87 on CIFAR-10. Additionally, we demonstrate\nthat our models learn effective representations via image inpainting experiments.\n\n1\n\nIntroduction\n\nGenerative models have many applications in machine learning. To list a few, they have been\nused to generate high-\ufb01delity images [26, 6], synthesize realistic speech and music fragments [58],\nimprove the performance of semi-supervised learning [28, 10], detect adversarial examples and\nother anomalous data [54], imitation learning [22], and explore promising states in reinforcement\nlearning [41]. Recent progress is mainly driven by two approaches: likelihood-based methods [17,\n29, 11, 60] and generative adversarial networks (GAN [15]). The former uses log-likelihood (or a\nsuitable surrogate) as the training objective, while the latter uses adversarial training to minimize\nf-divergences [40] or integral probability metrics [2, 55] between model and data distributions.\nAlthough likelihood-based models and GANs have achieved great success, they have some intrinsic\nlimitations. For example, likelihood-based models either have to use specialized architectures to\nbuild a normalized probability model (e.g., autoregressive models, \ufb02ow models), or use surrogate\nlosses (e.g., the evidence lower bound used in variational auto-encoders [29], contrastive divergence\nin energy-based models [21]) for training. GANs avoid some of the limitations of likelihood-based\nmodels, but their training can be unstable due to the adversarial training procedure. In addition, the\nGAN objective is not suitable for evaluating and comparing different GAN models. While other\nobjectives exist for generative modeling, such as noise contrastive estimation [19] and minimum\nprobability \ufb02ow [50], these methods typically only work well for low-dimensional data.\nIn this paper, we explore a new principle for generative modeling based on estimating and sampling\nfrom the (Stein) score [33] of the logarithmic data density, which is the gradient of the log-density\nfunction at the input data point. This is a vector \ufb01eld pointing in the direction where the log data\ndensity grows the most. We use a neural network trained with score matching [24] to learn this\nvector \ufb01eld from data. We then produce samples using Langevin dynamics, which approximately\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fworks by gradually moving a random initial sample to high density regions along the (estimated)\nvector \ufb01eld of scores. However, there are two main challenges with this approach. First, if the data\ndistribution is supported on a low dimensional manifold\u2014as it is often assumed for many real world\ndatasets\u2014the score will be unde\ufb01ned in the ambient space, and score matching will fail to provide a\nconsistent score estimator. Second, the scarcity of training data in low data density regions, e.g., far\nfrom the manifold, hinders the accuracy of score estimation and slows down the mixing of Langevin\ndynamics sampling. Since Langevin dynamics will often be initialized in low-density regions of the\ndata distribution, inaccurate score estimation in these regions will negatively affect the sampling\nprocess. Moreover, mixing can be dif\ufb01cult because of the need of traversing low density regions to\ntransition between modes of the distribution.\nTo tackle these two challenges, we propose to perturb the data with random Gaussian noise of\nvarious magnitudes. Adding random noise ensures the resulting distribution does not collapse to a\nlow dimensional manifold. Large noise levels will produce samples in low density regions of the\noriginal (unperturbed) data distribution, thus improving score estimation. Crucially, we train a single\nscore network conditioned on the noise level and estimate the scores at all noise magnitudes. We\nthen propose an annealed version of Langevin dynamics, where we initially use scores corresponding\nto the highest noise level, and gradually anneal down the noise level until it is small enough to be\nindistinguishable from the original data distribution. Our sampling strategy is inspired by simulated\nannealing [30, 37] which heuristically improves optimization for multimodal landscapes.\nOur approach has several desirable properties. First, our objective is tractable for almost all pa-\nrameterizations of the score networks without the need of special constraints or architectures, and\ncan be optimized without adversarial training, MCMC sampling, or other approximations during\ntraining. The objective can also be used to quantitatively compare different models on the same\ndataset. Experimentally, we demonstrate the ef\ufb01cacy of our approach on MNIST, CelebA [34],\nand CIFAR-10 [31]. We show that the samples look comparable to those generated from modern\nlikelihood-based models and GANs. On CIFAR-10, our model sets the new state-of-the-art inception\nscore of 8.87 for unconditional generative models, and achieves a competitive FID score of 25.32. We\nshow that the model learns meaningful representations of the data by image inpainting experiments.\n\n2 Score-based generative modeling\nSuppose our dataset consists of i.i.d. samples {xi \u2208 RD}N\ni=1 from an unknown data distribution\npdata(x). We de\ufb01ne the score of a probability density p(x) to be \u2207x log p(x). The score network\ns\u03b8 : RD \u2192 RD is a neural network parameterized by \u03b8, which will be trained to approximate the\nscore of pdata(x). The goal of generative modeling is to use the dataset to learn a model for generating\nnew samples from pdata(x). The framework of score-based generative modeling has two ingredients:\nscore matching and Langevin dynamics.\n\n2.1 Score matching for score estimation\n\nEpdata[(cid:107)s\u03b8(x) \u2212 \u2207x log pdata(x)(cid:107)2\n\nScore matching [24] is originally designed for learning non-normalized statistical models based on\ni.i.d. samples from an unknown data distribution. Following [53], we repurpose it for score estimation.\nUsing score matching, we can directly train a score network s\u03b8(x) to estimate \u2207x log pdata(x) without\ntraining a model to estimate pdata(x) \ufb01rst. Different from the typical usage of score matching, we opt\nnot to use the gradient of an energy-based model as the score network to avoid extra computation due\nto higher-order gradients. The objective minimizes 1\n2], which can\n2\nbe shown equivalent to the following up to a constant\ntr(\u2207xs\u03b8(x)) +\n\n(1)\nwhere \u2207xs\u03b8(x) denotes the Jacobian of s\u03b8(x). As shown in [53], under some regularity conditions\nthe minimizer of Eq. (3) (denoted as s\u03b8\u2217 (x)) satis\ufb01es s\u03b8\u2217 (x) = \u2207x log pdata(x) almost surely.\nIn practice, the expectation over pdata(x) in Eq. (1) can be quickly estimated using data samples.\nHowever, score matching is not scalable to deep networks and high dimensional data [53] due to the\ncomputation of tr(\u2207xs\u03b8(x)). Below we discuss two popular methods for large scale score matching.\nDenoising score matching Denoising score matching [61] is a variant of score matching that\ncompletely circumvents tr(\u2207xs\u03b8(x)). It \ufb01rst perturbs the data point x with a pre-speci\ufb01ed noise\n\n(cid:107)s\u03b8(x)(cid:107)2\n\n2\n\n(cid:21)\n\n,\n\n(cid:20)\n\nEpdata(x)\n\n1\n2\n\n2\n\n\fdistribution q\u03c3(\u02dcx | x) and then employs score matching to estimate the score of the perturbed data\n\ndistribution q\u03c3(\u02dcx) (cid:44)(cid:82) q\u03c3(\u02dcx | x)pdata(x)dx. The objective was proved equivalent to the following:\n\nEq\u03c3(\u02dcx|x)pdata(x)[(cid:107)s\u03b8(\u02dcx) \u2212 \u2207\u02dcx log q\u03c3(\u02dcx | x)(cid:107)2\n2].\n\n1\n2\n\n(2)\n\nAs shown in [61], the optimal score network (denoted as s\u03b8\u2217 (x)) that minimizes Eq. (2) satis\ufb01es\ns\u03b8\u2217 (x) = \u2207x log q\u03c3(x) almost surely. However, s\u03b8\u2217 (x) = \u2207x log q\u03c3(x) \u2248 \u2207x log pdata(x) is true\nonly when the noise is small enough such that q\u03c3(x) \u2248 pdata(x).\nSliced score matching Sliced score matching [53] uses random projections to approximate\ntr(\u2207xs\u03b8(x)) in score matching. The objective is\n\n(cid:20)\n\n(cid:21)\n\n,\n\nEpv\n\nEpdata\n\nv\n\n(cid:124)\u2207xs\u03b8(x)v +\n\n(cid:107)s\u03b8(x)(cid:107)2\n\n2\n\n1\n2\n\n(3)\n\nwhere pv is a simple distribution of random vectors, e.g., the multivariate standard normal. As shown\n(cid:124)\u2207xs\u03b8(x)v can be ef\ufb01ciently computed by forward mode auto-differentiation.\nin [53], the term v\nUnlike denoising score matching which estimates the scores of perturbed data, sliced score matching\nprovides score estimation for the original unperturbed data distribution, but requires around four\ntimes more computations due to the forward mode auto-differentiation.\n\n2.2 Sampling with Langevin dynamics\n\nLangevin dynamics can produce samples from a probability density p(x) using only the score function\n\u2207x log p(x). Given a \ufb01xed step size \u0001 > 0, and an initial value \u02dcx0 \u223c \u03c0(x) with \u03c0 being a prior\ndistribution, the Langevin method recursively computes the following\n\n\u2207x log p(\u02dcxt\u22121) +\n\n\u0001\n2\n\n\u0001 zt,\n\n\u02dcxt = \u02dcxt\u22121 +\n\n(4)\nwhere zt \u223c N (0, I). The distribution of \u02dcxT equals p(x) when \u0001 \u2192 0 and T \u2192 \u221e, in which case \u02dcxT\nbecomes an exact sample from p(x) under some regularity conditions [62]. When \u0001 > 0 and T < \u221e,\na Metropolis-Hastings update is needed to correct the error of Eq. (4), but it can often be ignored in\npractice [9, 12, 39]. In this work, we assume this error is negligible when \u0001 is small and T is large.\nNote that sampling from Eq. (4) only requires the score function \u2207x log p(x). Therefore, in order to\nobtain samples from pdata(x), we can \ufb01rst train our score network such that s\u03b8(x) \u2248 \u2207x log pdata(x)\nand then approximately obtain samples with Langevin dynamics using s\u03b8(x). This is the key idea of\nour framework of score-based generative modeling.\n\n\u221a\n\n3 Challenges of score-based generative modeling\n\nIn this section, we analyze more closely the idea of score-based generative modeling. We argue that\nthere are two major obstacles that prevent a na\u00efve application of this idea.\n\n3.1 The manifold hypothesis\n\nThe manifold hypothesis states that data in the\nreal world tend to concentrate on low dimen-\nsional manifolds embedded in a high dimen-\nsional space (a.k.a., the ambient space). This\nhypothesis empirically holds for many datasets,\nand has become the foundation of manifold\nlearning [3, 47]. Under the manifold hypothesis,\nscore-based generative models will face two key\nFigure 1: Left: Sliced score matching (SSM) loss\ndif\ufb01culties. First, since the score \u2207x log pdata(x)\nw.r.t. iterations. No noise is added to data. Right:\nSame but data are perturbed with N (0, 0.0001).\nis a gradient taken in the ambient space, it is un-\nde\ufb01ned when x is con\ufb01ned to a low dimensional\nmanifold. Second, the score matching objective Eq. (1) provides a consistent score estimator only\nwhen the support of the data distribution is the whole space (cf ., Theorem 2 in [24]), and will be\ninconsistent when the data reside on a low-dimensional manifold.\n\n3\n\n\fThe negative effect of the manifold hypothesis on score estimation can be seen clearly from Fig. 1,\nwhere we train a ResNet (details in Appendix B.1) to estimate the data score on CIFAR-10. For\nfast training and faithful estimation of the data scores, we use the sliced score matching objective\n(Eq. (3)). As Fig. 1 (left) shows, when trained on the original CIFAR-10 images, the sliced score\nmatching loss \ufb01rst decreases and then \ufb02uctuates irregularly. In contrast, if we perturb the data with a\nsmall Gaussian noise (such that the perturbed data distribution has full support over RD), the loss\ncurve will converge (right panel). Note that the Gaussian noise N (0, 0.0001) we impose is very small\nfor images with pixel values in the range [0, 1], and is almost indistinguishable to human eyes.\n\n3.2 Low data density regions\n\nThe scarcity of data in low density regions can cause dif\ufb01culties for both score estimation with score\nmatching and MCMC sampling with Langevin dynamics.\n\n3.2.1 Inaccurate score estimation with score matching\n\nIn regions of low data density, score match-\ning may not have enough evidence to estimate\nscore functions accurately, due to the lack of\ndata samples. To see this, recall from Sec-\ntion 2.1 that score matching minimizes the ex-\npected squared error of the score estimates, i.e.,\nEpdata[(cid:107)s\u03b8(x) \u2212 \u2207x log pdata(x)(cid:107)2\nIn prac-\n2].\n1\n2\ntice, the expectation w.r.t. the data distribu-\ntion is always estimated using i.i.d. samples\n{xi}N\ni.i.d.\u223c pdata(x). Consider any region\nR \u2282 RD such that pdata(R) \u2248 0.\nIn most\ncases {xi}N\ni=1 \u2229 R = \u2205, and score matching\nwill not have suf\ufb01cient data samples to estimate\n\u2207x log pdata(x) accurately for x \u2208 R.\nTo demonstrate the negative effect of this, we\nprovide the result of a toy experiment (details in Appendix B.1) in Fig. 2 where we use sliced score\n5N ((5, 5), I).\nmatching to estimate scores of a mixture of Gaussians pdata = 1\nAs the \ufb01gure demonstrates, score estimation is only reliable in the immediate vicinity of the modes\nof pdata, where the data density is high.\n\nFigure 2: Left: \u2207x log pdata(x); Right: s\u03b8(x).\nThe data density pdata(x) is encoded using an\norange colormap: darker color implies higher\ndensity. Red rectangles highlight regions where\n\u2207x log pdata(x) \u2248 s\u03b8(x).\n\ni=1\n\n5N ((\u22125,\u22125), I) + 4\n\n3.2.2 Slow mixing of Langevin dynamics\n\nWhen two modes of the data distribution are separated by low density regions, Langevin dynamics\nwill not be able to correctly recover the relative weights of these two modes in reasonable time, and\ntherefore might not converge to the true distribution. Our analyses of this are largely inspired by [63],\nwhich analyzed the same phenomenon in the context of density estimation with score matching.\nConsider a mixture distribution pdata(x) = \u03c0p1(x)+(1\u2212\u03c0)p2(x), where p1(x) and p2(x) are normal-\nized distributions with disjoint supports, and \u03c0 \u2208 (0, 1). In the support of p1(x), \u2207x log pdata(x) =\n\u2207x(log \u03c0 + log p1(x)) = \u2207x log p1(x), and in the support of p2(x), \u2207x log pdata(x) = \u2207x(log(1 \u2212\n\u03c0) + log p2(x)) = \u2207x log p2(x). In either case, the score \u2207x log pdata(x) does not depend on \u03c0.\nSince Langevin dynamics use \u2207x log pdata(x) to sample from pdata(x), the samples obtained will not\ndepend on \u03c0. In practice, this analysis also holds when different modes have approximately disjoint\nsupports\u2014they may share the same support but be connected by regions of small data density. In this\ncase, Langevin dynamics can produce correct samples in theory, but may require a very small step\nsize and a very large number of steps to mix.\nTo verify this analysis, we test Langevin dynamics sampling for the same mixture of Gaussian used\nin Section 3.2.1 and provide the results in Fig. 3. We use the ground truth scores when sampling\nwith Langevin dynamics. Comparing Fig. 3(b) with (a), it is obvious that the samples from Langevin\ndynamics have incorrect relative density between the two modes, as predicted by our analysis.\n\n4\n\n\fFigure 3: Samples from a mixture of Gaussian with different methods. (a) Exact sampling. (b)\nSampling using Langevin dynamics with the exact scores. (c) Sampling using annealed Langevin\ndynamics with the exact scores. Clearly Langevin dynamics estimate the relative weights between\nthe two modes incorrectly, while annealed Langevin dynamics recover the relative weights faithfully.\n\n4 Noise Conditional Score Networks: learning and inference\n\nWe observe that perturbing data with random Gaussian noise makes the data distribution more\namenable to score-based generative modeling. First, since the support of our Gaussian noise distri-\nbution is the whole space, the perturbed data will not be con\ufb01ned to a low dimensional manifold,\nwhich obviates dif\ufb01culties from the manifold hypothesis and makes score estimation well-de\ufb01ned.\nSecond, large Gaussian noise has the effect of \ufb01lling low density regions in the original unperturbed\ndata distribution; therefore score matching may get more training signal to improve score estimation.\nFurthermore, by using multiple noise levels we can obtain a sequence of noise-perturbed distributions\nthat converge to the true data distribution. We can improve the mixing rate of Langevin dynamics\non multimodal distributions by leveraging these intermediate distributions in the spirit of simulated\nannealing [30] and annealed importance sampling [37].\nBuilt upon this intuition, we propose to improve score-based generative modeling by 1) perturbing\nthe data using various levels of noise; and 2) simultaneously estimating scores corresponding to all\nnoise levels by training a single conditional score network. After training, when using Langevin\ndynamics to generate samples, we initially use scores corresponding to large noise, and gradually\nanneal down the noise level. This helps smoothly transfer the bene\ufb01ts of large noise levels to low\nnoise levels where the perturbed data are almost indistinguishable from the original ones. In what\nfollows, we will elaborate more on the details of our method, including the architecture of our score\nnetworks, the training objective, and the annealing schedule for Langevin dynamics.\n\n= \u00b7\u00b7\u00b7 = \u03c3L\u22121\n\n\u03c3L\n\n> 1. Let q\u03c3(x) (cid:44)\n\n4.1 Noise Conditional Score Networks\nLet {\u03c3i}L\n\n(cid:82) pdata(t)N (x | t, \u03c32I)dt denote the perturbed data distribution. We choose the noise levels {\u03c3i}L\n\ni=1 be a positive geometric sequence that satis\ufb01es \u03c31\n\u03c32\n\ni=1\nsuch that \u03c31 is large enough to mitigate the dif\ufb01culties discussed in Section 3, and \u03c3L is small enough\nto minimize the effect on data. We aim to train a conditional score network to jointly estimate the\nscores of all perturbed data distributions, i.e., \u2200\u03c3 \u2208 {\u03c3i}L\ni=1 : s\u03b8(x, \u03c3) \u2248 \u2207x log q\u03c3(x). Note that\ns\u03b8(x, \u03c3) \u2208 RD when x \u2208 RD. We call s\u03b8(x, \u03c3) a Noise Conditional Score Network (NCSN).\nSimilar to likelihood-based generative models and GANs, the design of model architectures plays an\nimportant role in generating high quality samples. In this work, we mostly focus on architectures\nuseful for image generation, and leave the architecture design for other domains as future work.\nSince the output of our noise conditional score network s\u03b8(x, \u03c3) has the same shape as the input\nimage x, we draw inspiration from successful model architectures for dense prediction of images\n(e.g., semantic segmentation). In the experiments, our model s\u03b8(x, \u03c3) combines the architecture\ndesign of U-Net [46] with dilated/atrous convolution [64, 65, 8]\u2014both of which have been proved\nvery successful in semantic segmentation. In addition, we adopt instance normalization in our score\nnetwork, inspired by its superior performance in some image generation tasks [57, 13, 23], and we\n\n5\n\n\fuse a modi\ufb01ed version of conditional instance normalization [13] to provide conditioning on \u03c3i.\nMore details on our architecture can be found in Appendix A.\n\n4.2 Learning NCSNs via score matching\n\nBoth sliced and denoising score matching can train NCSNs. We adopt denoising score matching as it\nis slightly faster and naturally \ufb01ts the task of estimating scores of noise-perturbed data distributions.\nHowever, we emphasize that empirically sliced score matching can train NCSNs as well as denoising\nscore matching. We choose the noise distribution to be q\u03c3(\u02dcx | x) = N (\u02dcx | x, \u03c32I); therefore\n\u2207\u02dcx log q\u03c3(\u02dcx | x) = \u2212(\u02dcx\u2212x)/\u03c32. For a given \u03c3, the denoising score matching objective (Eq. (2)) is\n\n(cid:96)(\u03b8; \u03c3) (cid:44) 1\n2\n\nEpdata(x)E\u02dcx\u223cN (x,\u03c32I)\n\nThen, we combine Eq. (5) for all \u03c3 \u2208 {\u03c3i}L\n\n(cid:21)\n\n.\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n(cid:20)(cid:13)(cid:13)(cid:13)(cid:13)s\u03b8(\u02dcx, \u03c3) +\nL(cid:88)\n\n\u02dcx \u2212 x\n\u03c32\ni=1 to get one uni\ufb01ed objective\ni=1) (cid:44) 1\nL\n\n\u03bb(\u03c3i)(cid:96)(\u03b8; \u03c3i),\n\ni=1\n\nL(\u03b8;{\u03c3i}L\n\n(5)\n\n(6)\n\nwhere \u03bb(\u03c3i) > 0 is a coef\ufb01cient function depending on \u03c3i. Assuming s\u03b8(x, \u03c3) has enough capacity,\ns\u03b8\u2217 (x, \u03c3) minimizes Eq. (6) if and only if s\u03b8\u2217 (x, \u03c3i) = \u2207x log q\u03c3i(x) a.s. for all i \u2208 {1, 2,\u00b7\u00b7\u00b7 , L},\nbecause Eq. (6) is a conical combination of L denoising score matching objectives.\nThere can be many possible choices of \u03bb(\u00b7).\nIdeally, we hope that the values of \u03bb(\u03c3i)(cid:96)(\u03b8; \u03c3i)\nfor all {\u03c3i}L\ni=1 are roughly of the same order of magnitude. Empirically, we observe that when\nthe score networks are trained to optimality, we approximately have (cid:107)s\u03b8(x, \u03c3)(cid:107)2 \u221d 1/\u03c3. This\ninspires us to choose \u03bb(\u03c3) = \u03c32. Because under this choice, we have \u03bb(\u03c3)(cid:96)(\u03b8; \u03c3) = \u03c32(cid:96)(\u03b8; \u03c3) =\nE[(cid:107)\u03c3s\u03b8(\u02dcx, \u03c3) + \u02dcx\u2212x\n\u03c3 \u223c N (0, I) and (cid:107)\u03c3s\u03b8(x, \u03c3)(cid:107)2 \u221d 1, we can easily conclude\n1\n2\nthat the order of magnitude of \u03bb(\u03c3)(cid:96)(\u03b8; \u03c3) does not depend on \u03c3.\nWe emphasize that our objective Eq. (6) requires no adversarial training, no surrogate losses, and no\nsampling from the score network during training (e.g., unlike contrastive divergence). Also, it does\nnot require s\u03b8(x, \u03c3) to have special architectures in order to be tractable. In addition, when \u03bb(\u00b7) and\n{\u03c3i}L\n\ni=1 are \ufb01xed, it can be used to quantitatively compare different NCSNs.\n\n2]. Since \u02dcx\u2212x\n\n\u03c3 (cid:107)2\n\n4.3 NCSN inference via annealed Langevin dynamics\n\ni=1, \u0001, T .\n\n(cid:46) \u03b1i is the step size.\n\n\u03b1i \u2190 \u0001 \u00b7 \u03c32\nfor t \u2190 1 to T do\nDraw zt \u223c N (0, I)\n\u02dcxt \u2190 \u02dcxt\u22121 +\n\nAlgorithm 1 Annealed Langevin dynamics.\nRequire: {\u03c3i}L\n1: Initialize \u02dcx0\n2: for i \u2190 1 to L do\ni /\u03c32\n3:\nL\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nAfter the NCSN s\u03b8(x, \u03c3) is trained, we propose\na sampling approach\u2014annealed Langevin dy-\nnamics (Alg. 1)\u2014to produced samples, inspired\nby simulated annealing [30] and annealed im-\nportance sampling [37]. As shown in Alg. 1, we\nstart annealed Langevin dynamics by initializing\nthe samples from some \ufb01xed prior distribution,\ne.g., uniform noise. Then, we run Langevin dy-\nnamics to sample from q\u03c31(x) with step size\n\u03b11. Next, we run Langevin dynamics to sample\nfrom q\u03c32(x), starting from the \ufb01nal samples of\nthe previous simulation and using a reduced step\nsize \u03b12. We continue in this fashion, using the \ufb01-\nnal samples of Langevin dynamics for q\u03c3i\u22121 (x)\nas the initial samples of Langevin dynamic for\nq\u03c3i(x), and tuning down the step size \u03b1i gradually with \u03b1i = \u0001 \u00b7 \u03c32\ni /\u03c32\ndynamics to sample from q\u03c3L(x), which is close to pdata(x) when \u03c3L \u2248 0.\nSince the distributions {q\u03c3i}L\ni=1 are all perturbed by Gaussian noise, their supports span the whole\nspace and their scores are well-de\ufb01ned, avoiding dif\ufb01culties from the manifold hypothesis. When\n\u03c31 is suf\ufb01ciently large, the low density regions of q\u03c31(x) become small and the modes become less\nisolated. As discussed previously, this can make score estimation more accurate, and the mixing of\nLangevin dynamics faster. We can therefore assume that Langevin dynamics produce good samples\nfor q\u03c31(x). These samples are likely to come from high density regions of q\u03c31(x), which means\n\nL. Finally, we run Langevin\n\nend for\n\u02dcx0 \u2190 \u02dcxT\n\ns\u03b8(\u02dcxt\u22121, \u03c3i) +\n\nreturn \u02dcxT\n\n\u03b1i\n2\n\n\u221a\n\n\u03b1i zt\n\n6\n\n\fModel\nCIFAR-10 Unconditional\nPixelCNN [59]\nPixelIQN [42]\nEBM [12]\nWGAN-GP [18]\nMoLM [45]\nSNGAN [36]\nProgressiveGAN [25]\nNCSN (Ours)\nCIFAR-10 Conditional\nEBM [12]\nSNGAN [36]\nBigGAN [6]\n\n4.60\n5.29\n6.02\n\n7.86 \u00b1 .07\n7.90 \u00b1 .10\n8.22 \u00b1 .05\n8.80 \u00b1 .05\n8.87 \u00b1 .12\n\n8.30\n\n8.60 \u00b1 .08\n\n9.22\n\nInception\n\nFID\n\n65.93\n49.46\n40.58\n36.4\n18.9\n21.7\n\n-\n\n25.32\n\n37.9\n25.5\n14.73\n\nTable 1: Inception and FID scores for CIFAR-10\n\nFigure 4: Intermediate samples of annealed\nLangevin dynamics.\n\nthey are also likely to reside in the high density regions of q\u03c32 (x), given that q\u03c31(x) and q\u03c32(x) only\nslightly differ from each other. As score estimation and Langevin dynamics perform better in high\ndensity regions, samples from q\u03c31(x) will serve as good initial samples for Langevin dynamics of\nq\u03c32(x). Similarly, q\u03c3i\u22121 (x) provides good initial samples for q\u03c3i(x), and \ufb01nally we obtain samples\nof good quality from q\u03c3L(x).\nThere could be many possible ways of tuning \u03b1i according to \u03c3i in Alg. 1. Our choice is \u03b1i \u221d \u03c32\ni .\nThe motivation is to \ufb01x the magnitude of the \u201csignal-to-noise\u201d ratio \u03b1is\u03b8 (x,\u03c3i)\nin Langevin dynam-\n\u03b1i z\n\u03b1i z (cid:107)2\nics. Note that E[(cid:107) \u03b1is\u03b8 (x,\u03c3i)\n2]. Recall that empirically\nwe found (cid:107)s\u03b8(x, \u03c3)(cid:107)2 \u221d 1/\u03c3 when the score network is trained close to optimal, in which case\nE[(cid:107)\u03c3is\u03b8(x; \u03c3i)(cid:107)2\n4 does not depend on \u03c3i.\nTo demonstrate the ef\ufb01cacy of our annealed Langevin dynamics, we provide a toy example where the\ngoal is to sample from a mixture of Gaussian with two well-separated modes using only scores. We\napply Alg. 1 to sample from the mixture of Gausssian used in Section 3.2. In the experiment, we\nchoose {\u03c3i}L\ni=1 to be a geometric progression, with L = 10, \u03c31 = 10 and \u03c310 = 0.1. The results are\nprovided in Fig. 3. Comparing Fig. 3 (b) against (c), annealed Langevin dynamics correctly recover\nthe relative weights between the two modes whereas standard Langevin dynamics fail.\n\n2] \u221d 1. Therefore (cid:107) \u03b1is\u03b8 (x,\u03c3i)\n\n\u221a\n2\nE[(cid:107)\u03c3is\u03b8(x, \u03c3i)(cid:107)2\n\n2] \u2248 E[ \u03b1i(cid:107)s\u03b8 (x,\u03c3i)(cid:107)2\n\n\u03b1i z (cid:107)2 \u221d 1\n\n4\n\nE[(cid:107)\u03c3is\u03b8(x, \u03c3i)(cid:107)2\n\n2] \u221d 1\n\n\u221a\n2\n\n2\n\n] \u221d 1\n\n4\n\n\u221a\n2\n\n4\n\n5 Experiments\n\nIn this section, we demonstrate that our NCSNs are able to produce high quality image samples on\nseveral commonly used image datasets. In addition, we show that our models learn reasonable image\nrepresentations by image inpainting experiments.\n\nSetup We use MNIST, CelebA [34], and CIFAR-10 [31] datasets in our experiments. For CelebA,\nthe images are \ufb01rst center-cropped to 140 \u00d7 140 and then resized to 32 \u00d7 32. All images are rescaled\nso that pixel values are in [0, 1]. We choose L = 10 different standard deviations such that {\u03c3i}L\ni=1 is\na geometric sequence with \u03c31 = 1 and \u03c310 = 0.01. Note that Gaussian noise of \u03c3 = 0.01 is almost\nindistinguishable to human eyes for image data. When using annealed Langevin dynamics for image\ngeneration, we choose T = 100 and \u0001 = 2 \u00d7 10\u22125, and use uniform noise as our initial samples. We\nfound the results are robust w.r.t. the choice of T , and \u0001 between 5 \u00d7 10\u22126 and 5 \u00d7 10\u22125 generally\nworks \ufb01ne. We provide additional details on model architecture and settings in Appendix A and B.\n\nImage generation In Fig. 5, we show uncurated samples from annealed Langevin dynamics for\nMNIST, CelebA and CIFAR-10. As shown by the samples, our generated images have higher or\ncomparable quality to those from modern likelihood-based models and GANs. To intuit the procedure\nof annealed Langevin dynamics, we provide intermediate samples in Fig. 4, where each row shows\n\n7\n\n\f(a) MNIST\n\n(c) CIFAR-10\nFigure 5: Uncurated samples on MNIST, CelebA, and CIFAR-10 datasets.\n\n(b) CelebA\n\nFigure 6: Image inpainting on CelebA (left) and CIFAR-10 (right). The leftmost column of each\n\ufb01gure shows the occluded images, while the rightmost column shows the original images.\n\nhow samples evolve from pure random noise to high quality images. More samples from our approach\ncan be found in Appendix C. We also show the nearest neighbors of generated images in the training\ndataset in Appendix C.2, in order to demonstrate that our model is not simply memorizing training\nimages. To show it is important to learn a conditional score network jointly for many noise levels and\nuse annealed Langevin dynamics, we compare against a baseline approach where we only consider\none noise level {\u03c31 = 0.01} and use the vanilla Langevin dynamics sampling method. Although this\nsmall added noise helps circumvent the dif\ufb01culty of the manifold hypothesis (as shown by Fig. 1,\nthings will completely fail if no noise is added), it is not large enough to provide information on\nscores in regions of low data density. As a result, this baseline fails to generate reasonable images, as\nshown by samples in Appendix C.1.\nFor quantitative evaluation, we report inception [48] and FID [20] scores on CIFAR-10 in Tab. 1. As\nan unconditional model, we achieve the state-of-the-art inception score of 8.87, which is even better\nthan most reported values for class-conditional generative models. Our FID score 25.32 on CIFAR-10\nis also comparable to top existing models, such as SNGAN [36]. We omit scores on MNIST and\nCelebA as the scores on these two datasets are not widely reported, and different preprocessing (such\nas the center crop size of CelebA) can lead to numbers not directly comparable.\n\nImage inpainting In Fig. 6, we demonstrate that our score networks learn generalizable and\nsemantically meaningful image representations that allow it to produce diverse image inpaintings.\nNote that some previous models such as PixelCNN can only impute images in the raster scan order.\nIn contrast, our method can naturally handle images with occlusions of arbitrary shapes by a simple\nmodi\ufb01cation of the annealed Langevin dynamics procedure (details in Appendix B.3). We provide\nmore image inpainting results in Appendix C.5.\n\n8\n\n\f6 Related work\n\nOur approach has some similarities with methods that learn the transition operator of a Markov chain\nfor sample generation [4, 51, 5, 16, 52]. For example, generative stochastic networks (GSN [4, 1])\nuse denoising autoencoders to train a Markov chain whose equilibrium distribution matches the\ndata distribution. Similarly, our method trains the score function used in Langevin dynamics to\nsample from the data distribution. However, GSN often starts the chain very close to a training\ndata point, and therefore requires the chain to transition quickly between different modes.\nIn\ncontrast, our annealed Langevin dynamics are initialized from unstructured noise. Nonequilibrium\nThermodynamics (NET [51]) used a prescribed diffusion process to slowly transform data into\nrandom noise, and then learned to reverse this procedure by training an inverse diffusion. However,\nNET is not very scalable because it requires the diffusion process to have very small steps, and needs\nto simulate chains with thousands of steps at training time.\nPrevious approaches such as Infusion Training (IT [5]) and Variational Walkback (VW [16]) also\nemployed different noise levels/temperatures for training transition operators of a Markov chain.\nBoth IT and VW (as well as NET) train their models by maximizing the evidence lower bound of\na suitable marginal likelihood. In practice, they tend to produce blurry image samples, similar to\nvariational autoencoders. In contrast, our objective is based on score matching instead of likelihood,\nand we can produce images comparable to GANs.\nThere are several structural differences that further distinguish our approach from previous methods\ndiscussed above. First, we do not need to sample from a Markov chain during training. In contrast,\nthe walkback procedure of GSNs needs multiple runs of the chain to generate \u201cnegative samples\u201d.\nOther methods including NET, IT, and VW also need to simulate a Markov chain for every input to\ncompute the training loss. This difference makes our approach more ef\ufb01cient and scalable for training\ndeep models. Secondly, our training and sampling methods are decoupled from each other. For\nscore estimation, both sliced and denoising score matching can be used. For sampling, any method\nbased on scores is applicable, including Langevin dynamics and (potentially) Hamiltonian Monte\nCarlo [38]. Our framework allows arbitrary combinations of score estimators and (gradient-based)\nsampling approaches, whereas most previous methods tie the model to a speci\ufb01c Markov chain.\nFinally, our approach can be used to train energy-based models (EBM) by using the gradient of an\nenergy-based model as the score model. In contrast, it is unclear how previous methods that learn\ntransition operators of Markov chains can be directly used for training EBMs.\nScore matching was originally proposed for learning EBMs. However, many existing methods\nbased on score matching are either not scalable [24] or fail to produce samples of comparable\nquality to VAEs or GANs [27, 49]. To obtain better performance on training deep energy-based\nmodels, some recent works have resorted to contrastive divergence [21], and propose to sample with\nLangevin dynamics for both training and testing [12, 39]. However, unlike our approach, contrastive\ndivergence uses the computationally expensive procedure of Langevin dynamics as an inner loop\nduring training. The idea of combining annealing with denoising score matching has also been\ninvestigated in previous work under different contexts. In [14, 7, 66], different annealing schedules\non the noise for training denoising autoencoders are proposed. However, their work is on learning\nrepresentations for improving the performance of classi\ufb01cation, instead of generative modeling.\nThe method of denoising score matching can also be derived from the perspective of Bayes least\nsquares [43, 44], using techniques of Stein\u2019s Unbiased Risk Estimator [35, 56].\n\n7 Conclusion\n\nWe propose the framework of score-based generative modeling where we \ufb01rst estimate gradients of\ndata densities via score matching, and then generate samples via Langevin dynamics. We analyze\nseveral challenges faced by a na\u00efve application of this approach, and propose to tackle them by\ntraining Noise Conditional Score Networks (NCSN) and sampling with annealed Langevin dynamics.\nOur approach requires no adversarial training, no MCMC sampling during training, and no special\nmodel architectures. Experimentally, we show that our approach can generate high quality images\nthat were previously only produced by the best likelihood-based models and GANs. We achieve the\nnew state-of-the-art inception score on CIFAR-10, and an FID score comparable to SNGANs.\n\n9\n\n\fAcknowledgements\n\nToyota Research Institute (\"TRI\") provided funds to assist the authors with their research but this\narticle solely re\ufb02ects the opinions and conclusions of its authors and not TRI or any other Toyota\nentity. This research was also supported by NSF (#1651565, #1522054, #1733686), ONR (N00014-\n19-1-2145), AFOSR (FA9550-19-1-0024).\n\nReferences\n\n[1] G. Alain, Y. Bengio, L. Yao, J. Yosinski, E. Thibodeau-Laufer, S. Zhang, and P. Vincent. GSNs:\n\ngenerative stochastic networks. Information and Inference, 2016.\n\n[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In\nD. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Ma-\nchine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214\u2013223,\nInternational Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[3] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data represen-\n\ntation. Neural computation, 15(6):1373\u20131396, 2003.\n\n[4] Y. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising auto-encoders as generative\n\nmodels. In Advances in neural information processing systems, pages 899\u2013907, 2013.\n\n[5] F. Bordes, S. Honari, and P. Vincent. Learning to generate samples from noise through infusion\n\ntraining. arXiv preprint arXiv:1703.06975, 2017.\n\n[6] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high \ufb01delity natural\n\nimage synthesis. In International Conference on Learning Representations, 2019.\n\n[7] B. Chandra and R. K. Sharma. Adaptive noise schedule for denoising autoencoder. In Interna-\n\ntional conference on neural information processing, pages 535\u2013542. Springer, 2014.\n\n[8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic\nimage segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.\nIEEE transactions on pattern analysis and machine intelligence, 40(4):834\u2013848, 2017.\n\n[9] T. Chen, E. Fox, and C. Guestrin. Stochastic gradient hamiltonian monte carlo. In International\n\nconference on machine learning, pages 1683\u20131691, 2014.\n\n[10] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov. Good semi-supervised\nlearning that requires a bad gan. In Advances in neural information processing systems, pages\n6510\u20136520, 2017.\n\n[11] L. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent components estimation.\n\narXiv preprint arXiv:1410.8516, 2014.\n\n[12] Y. Du and I. Mordatch. Implicit generation and generalization in energy-based models. arXiv\n\npreprint arXiv:1903.08689, 2019.\n\n[13] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In Interna-\n\ntional Conference on Learning Representations 2017, 2017.\n\n[14] K. J. Geras and C. Sutton. Scheduled denoising autoencoders. arXiv preprint arXiv:1406.3269,\n\n2014.\n\n[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\npages 2672\u20132680, 2014.\n\n[16] A. G. A. P. Goyal, N. R. Ke, S. Ganguli, and Y. Bengio. Variational walkback: Learning a\ntransition operator as a stochastic recurrent net. In Advances in Neural Information Processing\nSystems, pages 4392\u20134402, 2017.\n\n[17] A. Graves.\n\nGenerating sequences with recurrent neural networks.\n\narXiv:1308.0850, 2013.\n\narXiv preprint\n\n[18] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of\nwasserstein gans. In Advances in Neural Information Processing Systems, pages 5767\u20135777,\n2017.\n\n10\n\n\f[19] M. Gutmann and A. Hyv\u00e4rinen. Noise-contrastive estimation: A new estimation principle for\nunnormalized statistical models. In Proceedings of the Thirteenth International Conference on\nArti\ufb01cial Intelligence and Statistics, pages 297\u2013304, 2010.\n\n[20] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two\ntime-scale update rule converge to a local nash equilibrium. In Advances in Neural Information\nProcessing Systems, pages 6626\u20136637, 2017.\n\n[21] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\ncomputation, 14(8):1771\u20131800, 2002.\n\n[22] J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Informa-\n\ntion Processing Systems, pages 4565\u20134573, 2016.\n\n[23] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance nor-\nmalization. In Proceedings of the IEEE International Conference on Computer Vision, pages\n1501\u20131510, 2017.\n\n[24] A. Hyv\u00e4rinen. Estimation of non-normalized statistical models by score matching. Journal of\n\nMachine Learning Research, 6(Apr):695\u2013709, 2005.\n\n[25] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality,\n\nstability, and variation. In International Conference on Learning Representations, 2018.\n\n[26] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial\n\nnetworks. arXiv preprint arXiv:1812.04948, 2018.\n\n[27] D. Kingma and Y. LeCun. Regularized estimation of image statistics by score matching. In\nAdvances in Neural Information Processing Systems 23: 24th Annual Conference on Neural\nInformation Processing Systems 2010, NIPS 2010, 2010.\n\n[28] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep\ngenerative models. In Advances in neural information processing systems, pages 3581\u20133589,\n2014.\n\n[29] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.\n[30] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. SCIENCE,\n\n220(4598):671\u2013680, 1983.\n\n[31] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n[32] G. Lin, A. Milan, C. Shen, and I. Reid. Re\ufb01nenet: Multi-path re\ufb01nement networks for high-\nresolution semantic segmentation. In Proceedings of the IEEE conference on computer vision\nand pattern recognition, pages 1925\u20131934, 2017.\n\n[33] Q. Liu, J. Lee, and M. Jordan. A kernelized stein discrepancy for goodness-of-\ufb01t tests. In\n\nInternational Conference on Machine Learning, pages 276\u2013284, 2016.\n\n[34] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings\n\nof International Conference on Computer Vision (ICCV), 2015.\n\n[35] K. Miyasawa. An empirical bayes estimator of the mean of a normal population. Bull. Inst.\n\nInternat. Statist, 38(181-188):1\u20132, 1961.\n\n[36] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative\n\nadversarial networks. In International Conference on Learning Representations, 2018.\n\n[37] R. M. Neal. Annealed importance sampling. Statistics and computing, 11(2):125\u2013139, 2001.\n[38] R. M. Neal. Mcmc using hamiltonian dynamics. arXiv preprint arXiv:1206.1901, 2012.\n[39] E. Nijkamp, M. Hill, T. Han, S.-C. Zhu, and Y. N. Wu. On the anatomy of mcmc-based\nmaximum likelihood learning of energy-based models. arXiv preprint arXiv:1903.12370, 2019.\n[40] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using\nvariational divergence minimization. In Advances in neural information processing systems,\npages 271\u2013279, 2016.\n\n[41] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Count-based exploration\nwith neural density models. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 2721\u20132730. JMLR. org, 2017.\n\n11\n\n\f[42] G. Ostrovski, W. Dabney, and R. Munos. Autoregressive quantile networks for generative\nmodeling. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 3933\u2013\n3942. PMLR, 2018.\n\n[43] M. Raphan and E. P. Simoncelli. Learning to be bayesian without supervision. In Advances in\n\nneural information processing systems, pages 1145\u20131152, 2007.\n\n[44] M. Raphan and E. P. Simoncelli. Least squares estimation without priors or supervision. Neural\n\ncomputation, 23(2):374\u2013420, 2011.\n\n[45] S. Ravuri, S. Mohamed, M. Rosca, and O. Vinyals. Learning implicit generative models with\nthe method of learned moments. In J. Dy and A. Krause, editors, Proceedings of the 35th\nInternational Conference on Machine Learning, volume 80 of Proceedings of Machine Learning\nResearch, pages 4314\u20134323, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n[46] O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolutional networks for biomedical image\nsegmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI),\nvolume 9351 of LNCS, pages 234\u2013241. Springer, 2015.\n(available on arXiv:1505.04597\n[cs.CV]).\n\n[47] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.\n\nscience, 290(5500):2323\u20132326, 2000.\n\n[48] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\ntechniques for training gans. In Advances in neural information processing systems, pages\n2234\u20132242, 2016.\n\n[49] S. Saremi, A. Mehrjou, B. Sch\u00f6lkopf, and A. Hyv\u00e4rinen. Deep energy estimator networks.\n\narXiv preprint arXiv:1805.08306, 2018.\n\n[50] J. Sohl-Dickstein, P. Battaglino, and M. R. DeWeese. Minimum probability \ufb02ow learning. arXiv\n\npreprint arXiv:0906.4779, 2009.\n\n[51] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning\nusing nonequilibrium thermodynamics. In International Conference on Machine Learning,\npages 2256\u20132265, 2015.\n\n[52] J. Song, S. Zhao, and S. Ermon. A-nice-mc: Adversarial training for mcmc. In Advances in\n\nNeural Information Processing Systems, pages 5140\u20135150, 2017.\n\n[53] Y. Song, S. Garg, J. Shi, and S. Ermon. Sliced score matching: A scalable approach to density\nand score estimation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Arti\ufb01cial\nIntelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, page 204, 2019.\n\n[54] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman. Pixeldefend: Leveraging generative\nmodels to understand and defend against adversarial examples. In International Conference on\nLearning Representations, 2018.\n[55] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch\u00f6lkopf, and G. R. Lanckriet. On integral\nprobability metrics,\\phi-divergences and binary classi\ufb01cation. arXiv preprint arXiv:0901.2698,\n2009.\n\n[56] C. M. Stein. Estimation of the mean of a multivariate normal distribution. The annals of\n\nStatistics, pages 1135\u20131151, 1981.\n\n[57] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for\n\nfast stylization. arXiv preprint arXiv:1607.08022, 2016.\n\n[58] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,\nA. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. In Arxiv, 2016.\n[59] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image\ngeneration with pixelcnn decoders. In Advances in neural information processing systems,\npages 4790\u20134798, 2016.\n\n[60] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In\nProceedings of the 33rd International Conference on International Conference on Machine\nLearning - Volume 48, ICML\u201916, pages 1747\u20131756. JMLR.org, 2016.\n\n[61] P. Vincent. A connection between score matching and denoising autoencoders. Neural compu-\n\ntation, 23(7):1661\u20131674, 2011.\n\n12\n\n\f[62] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics.\nIn Proceedings of the 28th international conference on machine learning (ICML-11), pages\n681\u2013688, 2011.\n\n[63] L. Wenliang, D. Sutherland, H. Strathmann, and A. Gretton. Learning deep kernels for expo-\nnential family densities. In International Conference on Machine Learning, pages 6737\u20136746,\n2019.\n\n[64] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In International\n\nConference on Learning Representations (ICLR), 2016.\n\n[65] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In Computer Vision and Pattern\n\nRecognition (CVPR), 2017.\n\n[66] Q. Zhang and L. Zhang. Convolutional adaptive denoising autoencoders for hierarchical feature\n\nextraction. Frontiers of Computer Science, 12(6):1140\u20131148, 2018.\n\n13\n\n\f", "award": [], "sourceid": 6392, "authors": [{"given_name": "Yang", "family_name": "Song", "institution": "Stanford University"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}