{"title": "Generalized Denoising Auto-Encoders as Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 899, "page_last": 907, "abstract": "Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying data generating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justification which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty).", "full_text": "Generalized Denoising Auto-Encoders as Generative\n\nModels\n\nYoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent\n\nD\u00b4epartement d\u2019informatique et recherche op\u00b4erationnelle, Universit\u00b4e de Montr\u00b4eal\n\nAbstract\n\nRecent work has shown how denoising and contractive autoencoders implicitly\ncapture the structure of the data-generating density, in the case where the cor-\nruption noise is Gaussian, the reconstruction error is the squared error, and the\ndata is continuous-valued. This has led to various proposals for sampling from\nthis implicitly learned density function, using Langevin and Metropolis-Hastings\nMCMC. However, it remained unclear how to connect the training procedure\nof regularized auto-encoders to the implicit estimation of the underlying data-\ngenerating distribution when the data are discrete, or using other forms of corrup-\ntion process and reconstruction errors. Another issue is the mathematical justi\ufb01-\ncation which is only valid in the limit of small corruption noise. We propose here\na different attack on the problem, which deals with all these issues: arbitrary (but\nnoisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood),\nhandling both discrete and continuous-valued variables, and removing the bias due\nto non-in\ufb01nitesimal corruption noise (or non-in\ufb01nitesimal contractive penalty).\n\nIntroduction\n\n1\nAuto-encoders learn an encoder function from input to representation and a decoder function back\nfrom representation to input space, such that the reconstruction (composition of encoder and de-\ncoder) is good for training examples. Regularized auto-encoders also involve some form of regu-\nlarization that prevents the auto-encoder from simply learning the identity function, so that recon-\nstruction error will be low at training examples (and hopefully at test examples) but high in general.\nDifferent variants of auto-encoders and sparse coding have been, along with RBMs, among the\nmost successful building blocks in recent research in deep learning (Bengio et al., 2013b). Whereas\nthe usefulness of auto-encoder variants as feature learners for supervised learning can directly be\nassessed by performing supervised learning experiments with unsupervised pre-training, what has\nremained until recently rather unclear is the interpretation of these algorithms in the context of\npure unsupervised learning, as devices to capture the salient structure of the input data distribution.\nWhereas the answer is clear for RBMs, it is less obvious for regularized auto-encoders. Do they\ncompletely characterize the input distribution or only some aspect of it? For example, clustering\nalgorithms such as k-means only capture the modes of the distribution, while manifold learning\nalgorithms characterize the low-dimensional regions where the density concentrates.\nSome of the \ufb01rst ideas about the probabilistic interpretation of auto-encoders were proposed by Ran-\nzato et al. (2008): they were viewed as approximating an energy function through the reconstruction\nerror, i.e., being trained to have low reconstruction error at the training examples and high recon-\nstruction error elsewhere (through the regularizer, e.g., sparsity or otherwise, which prevents the\nauto-encoder from learning the identity function). An important breakthrough then came, yielding\na \ufb01rst formal probabilistic interpretation of regularized auto-encoders as models of the input dis-\ntribution, with the work of Vincent (2011). This work showed that some denoising auto-encoders\n(DAEs) correspond to a Gaussian RBM and that minimizing the denoising reconstruction error (as a\nsquared error) estimates the energy function through a regularized form of score matching, with the\nregularization disappearing as the amount of corruption noise goes to 0, and then converging to the\nsame solution as score matching (Hyv\u00a8arinen, 2005). This connection and its generalization to other\n\n1\n\n\fenergy functions, giving rise to the general denoising score matching training criterion, is discussed\nin several other papers (Kingma and LeCun, 2010; Swersky et al., 2011; Alain and Bengio, 2013).\nAnother breakthrough has been the development of an empirically successful sampling algorithm\nfor contractive auto-encoders (Rifai et al., 2012), which basically involves composing encoding, de-\ncoding, and noise addition steps. This algorithm is motivated by the observation that the Jacobian\nmatrix (of derivatives) of the encoding function provides an estimator of a local Gaussian approxi-\nmation of the density, i.e., the leading singular vectors of that matrix span the tangent plane of the\nmanifold near which the data density concentrates. However, a formal justi\ufb01cation for this algorithm\nremains an open problem.\nThe last step in this development (Alain and Bengio, 2013) generalized the result from Vincent\n(2011) by showing that when a DAE (or a contractive auto-encoder with the contraction on the whole\nencode/decode reconstruction function) is trained with small Gaussian corruption and squared error\nloss, it estimates the score (derivative of the log-density) of the underlying data-generating distri-\nbution, which is proportional to the difference between reconstruction and input. This result does\nnot depend on the parametrization of the auto-encoder, but suffers from the following limitations: it\napplies to one kind of corruption (Gaussian), only to continuous-valued inputs, only for one kind of\nloss (squared error), and it becomes valid only in the limit of small noise (even though in practice,\nbest results are obtained with large noise levels, comparable to the range of the input).\nWhat we propose here is a different probabilistic interpretation of DAEs, which is valid for any data\ntype, any corruption process (so long as it has broad enough support), and any reconstruction loss\n(so long as we can view it as a log-likelihood).\nThe basic idea is that if we corrupt observed random variable X into \u02dcX using conditional distribution\nC( \u02dcX|X), we are really training the DAE to estimate the reverse conditional P (X| \u02dcX). Combining\nthis estimator with the known C( \u02dcX|X), we show that we can recover a consistent estimator of\nP (X) through a Markov chain that alternates between sampling from P (X| \u02dcX) and sampling from\nC( \u02dcX|X), i.e., encode/decode, sample from the reconstruction distribution model P (X| \u02dcX), apply\nthe stochastic corruption procedure C( \u02dcX|X), and iterate.\nThis theoretical result is validated through experiments on arti\ufb01cial data in a non-parametric setting\nand experiments on real data in a parametric setting (with neural net DAEs). We \ufb01nd that we can\nimprove the sampling behavior by using the model itself to de\ufb01ne the corruption process, yielding a\ntraining procedure that has some surface similarity to the contrastive divergence algorithm (Hinton,\n1999; Hinton et al., 2006).\n\nAlgorithm 1 THE GENERALIZED DENOISING AUTO-ENCODER TRAINING ALGORITHM requires\na training set or training distribution D of examples X, a given corruption process C( \u02dcX|X) from\nwhich one can sample, and with which one trains a conditional distribution P\u03b8(X| \u02dcX) from which\none can sample.\n\nrepeat\n\n\u2022 sample training example X \u223c D\n\u2022 sample corrupted input \u02dcX \u223c C( \u02dcX|X)\n\u2022 use (X, \u02dcX) as an additional training example towards minimizing the expected value of\n\u2212 log P\u03b8(X| \u02dcX), e.g., by a gradient step with respect to \u03b8.\n\nuntil convergence of training (e.g., as measured by early stopping on out-of-sample negative log-\nlikelihood)\n\n2 Generalizing Denoising Auto-Encoders\n2.1 De\ufb01nition and Training\nLet P(X) be the data-generating distribution over observed random variable X. Let C be a given\ncorruption process that stochastically maps an X to a \u02dcX through conditional distribution C( \u02dcX|X).\nThe training data for the generalized denoising auto-encoder is a set of pairs (X, \u02dcX) with X \u223c\nP(X) and \u02dcX \u223c C( \u02dcX|X). The DAE is trained to predict X given \u02dcX through a learned conditional\ndistribution P\u03b8(X| \u02dcX), by choosing this conditional distribution within some family of distributions\n\n2\n\n\findexed by \u03b8, not necessarily a neural net. The training procedure for the DAE can generally be\nformulated as learning to predict X given \u02dcX by possibly regularized maximum likelihood, i.e., the\ngeneralization performance that this training criterion attempts to minimize is\n\nL(\u03b8) = \u2212E[log P\u03b8(X| \u02dcX)]\n\nwhere the expectation is taken over the joint data-generating distribution\n\nP(X, \u02dcX) = P(X)C( \u02dcX|X).\n\n2.2 Sampling\nWe de\ufb01ne the following pseudo-Gibbs Markov chain associated with P\u03b8:\n\nXt \u223c P\u03b8(X| \u02dcXt\u22121)\n\u02dcXt \u223c C( \u02dcX|Xt)\n\n(3)\nwhich can be initialized from an arbitrary choice X0. This is the process by which we are go-\ning to generate samples Xt according to the model implicitly learned by choosing \u03b8. We de\ufb01ne\nT (Xt|Xt\u22121) the transition operator that de\ufb01nes a conditional distribution for Xt given Xt\u22121, inde-\npendently of t, so that the sequence of Xt\u2019s forms a homogeneous Markov chain. If the asymptotic\nmarginal distribution of the Xt\u2019s exists, we call this distribution \u03c0(X), and we show below that it\nconsistently estimates P(X).\nNote that the above chain is not a proper Gibbs chain in general because there is no guarantee\nthat P\u03b8(X| \u02dcXt\u22121) and C( \u02dcX|Xt) are consistent with a unique joint distribution. In that respect, the\nsituation is similar to the sampling procedure for dependency networks (Heckerman et al., 2000), in\nthat the pairs (Xt, \u02dcXt\u22121) are not guaranteed to have the same asymptotic distribution as the pairs\n(Xt, \u02dcXt) as t \u2192 \u221e. As a follow-up to the results in the next section, it is shown in Bengio et al.\n(2013a) that dependency networks can be cast into the same framework (which is that of Generative\nStochastic Networks), and that if the Markov chain is ergodic, then its stationary distribution will\nde\ufb01ne a joint distribution between the random variables (here that would be X and \u02dcX), even if the\nconditionals are not consistent with it.\n2.3 Consistency\nNormally we only have access to a \ufb01nite number n of training examples but as n \u2192 \u221e, the empir-\nical training distribution approaches the data-generating distribution. To compensate for the \ufb01nite\ntraining set, we generally introduce a (possibly data-dependent) regularizer \u2126 and the actual training\ncriterion is a sum over n training examples (X, \u02dcX),\n\n(1)\n\n(2)\n\n(cid:88)\n\nLn(\u03b8) =\n\n1\nn\n\nX\u223cP(X), \u02dcX\u223cC( \u02dcX|X)\n\n\u03bbn\u2126(\u03b8, X, \u02dcX) \u2212 log P\u03b8(X| \u02dcX)\n\n(4)\n\nwhere we allow the regularization coef\ufb01cient \u03bbn to be chosen according to the number of training\nexamples n, with \u03bbn \u2192 0 as n \u2192 \u221e. With \u03bbn \u2192 0 we get that Ln \u2192 L (i.e. converges to\ngeneralization error, Eq. 1), so consistent estimators of P(X| \u02dcX) stay consistent. We de\ufb01ne \u03b8n to be\nthe minimizer of Ln(\u03b8) when given n training examples.\n\nWe de\ufb01ne Tn to be the transition operator Tn(Xt|Xt\u22121) =(cid:82) P\u03b8n(Xt| \u02dcX)C( \u02dcX|Xt\u22121)d \u02dcX associated\n\nwith \u03b8n (the parameter obtained by minimizing the training criterion with n examples), and de\ufb01ne\n\u03c0n to be the asymptotic distribution of the Markov chain generated by Tn (if it exists). We also\nde\ufb01ne T be the operator of the Markov chain associated with the learned model as n \u2192 \u221e.\nTheorem 1. If P\u03b8n (X| \u02dcX) is a consistent estimator of the true conditional distribution P(X| \u02dcX)\nand Tn de\ufb01nes an ergodic Markov chain, then as the number of examples n \u2192 \u221e, the asymptotic\ndistribution \u03c0n(X) of the generated samples converges to the data-generating distribution P(X).\nProof. If Tn is ergodic, then the Markov chain converges to a \u03c0n. Based on our de\ufb01nition of\nthe \u201ctrue\u201d joint (Eq. 2), one obtains a conditional P(X| \u02dcX) \u221d P(X)C( \u02dcX|X). This conditional,\nalong with P( \u02dcX|X) = C( \u02dcX|X) can be used to de\ufb01ne a proper Gibbs chain where one alterna-\ntively samples from P( \u02dcX|X) and from P(X| \u02dcX). Let T be the corresponding \u201ctrue\u201d transition\noperator, which maps the t-th sample X to the t + 1-th in that chain. That is, T (Xt|Xt\u22121) =\n\n(cid:82) P(Xt| \u02dcX)C( \u02dcX|Xt\u22121)d \u02dcX. T produces P(X) as asymptotic marginal distribution over X (as we\n\n3\n\n\fthat\n\nfor\n\nthe fact\n\nconsider more samples from the chain) simply because P(X) is the marginal distribution of the joint\nP(X)C( \u02dcX|X) to which the chain converges. By hypothesis we have that P\u03b8n(X| \u02dcX) \u2192 P(X| \u02dcX)\nas n \u2192 \u221e. Note that Tn is de\ufb01ned exactly as T but with P(Xt| \u02dcX) replaced by P\u03b8n (X| \u02dcX). Hence\nTn \u2192 T as n \u2192 \u221e.\nthe convergence of Tn to T into the convergence of \u03c0n(X) to\nNow let us convert\nP(X). We will exploit\nthe 2-norm, matrix M and unit vector v,\n||M v||2 \u2264 sup||x||2=1 ||M x||2 = ||M||2. Consider M = T \u2212 Tn and v the principal eigenvector\nof T , which, by the Perron-Frobenius theorem, corresponds to the asymptotic distribution P(X).\nSince Tn \u2192 T , ||T \u2212 Tn||2 \u2192 0. Hence ||(T \u2212 Tn)v||2 \u2264 ||T \u2212 Tn||2 \u2192 0, which implies\nthat Tnv \u2192 T v = v, where the last equality comes from the Perron-Frobenius theorem (the lead-\ning eigenvalue is 1). Since Tnv \u2192 v, it implies that v becomes the leading eigenvector of Tn,\ni.e., the asymptotic distribution of the Markov chain, \u03c0n(X) converges to the true data-generating\ndistribution, P(X), as n \u2192 \u221e.\nHence the asymptotic sampling distribution associated with the Markov chain de\ufb01ned by Tn (i.e., the\nmodel) implicitly de\ufb01nes the distribution \u03c0n(X) learned by the DAE over the observed variable X.\nFurthermore, that estimator of P(X) is consistent so long as our (regularized) maximum likelihood\nestimator of the conditional P\u03b8(X| \u02dcX) is also consistent. We now provide suf\ufb01cient conditions for\nthe ergodicity of the chain operator (i.e. to apply theorem 1).\nCorollary 1. If P\u03b8(X| \u02dcX) is a consistent estimator of the true conditional distribution P(X| \u02dcX),\nand both the data-generating distribution and denoising model are contained in and non-zero in\na \ufb01nite-volume region V (i.e., \u2200 \u02dcX, \u2200X /\u2208 V, P(X) = 0, P\u03b8(X| \u02dcX) = 0), and \u2200 \u02dcX, \u2200X \u2208 V,\nP(X) > 0, P\u03b8(X| \u02dcX) > 0,C( \u02dcX|X) > 0 and these statements remain true in the limit of n \u2192 \u221e,\nthen the asymptotic distribution \u03c0n(X) of the generated samples converges to the data-generating\ndistribution P(X).\nProof. To obtain the existence of a stationary distribution, it is suf\ufb01cient to have irreducibility (every\nvalue reachable from every other value), aperiodicity (no cycle where only paths through the cycle\nallow to return to some value), and recurrence (probability 1 of returning eventually). These condi-\ntions can be generalized to the continuous case, where we obtain ergodic Harris chains rather than\nergodic Markov chains. If P\u03b8(X| \u02dcX) > 0 and C( \u02dcX|X) > 0 (for X \u2208 V ), then Tn(Xt|Xt\u22121) > 0\nas well, because\n\n(cid:90)\n\nT (Xt|Xt\u22121) =\n\nP\u03b8(Xt| \u02dcX)C( \u02dcX|Xt\u22121)d \u02dcX\n\nThis positivity of the transition operator guarantees that one can jump from any point in V to any\nother point in one step, thus yielding irreducibility and aperiodicity. To obtain recurrence (prevent-\ning the chain from diverging to in\ufb01nity), we rely on the assumption that the domain V is bounded.\nNote that although Tn(Xt|Xt\u22121) > 0 could be true for any \ufb01nite n, we need this condition to hold\nfor n \u2192 \u221e as well, to obtain the consistency result of theorem 1. By assuming this positivity\n(Boltzmann distribution) holds for the data-generating distribution, we make sure that \u03c0n does not\nconverge to a distribution which puts 0\u2019s anywhere in V . Having satis\ufb01ed all the conditions for\nthe existence of a stationary distribution for Tn as n \u2192 \u221e, we can apply theorem 1 and obtain its\nconclusion.\n\nNote how these conditions take care of the various troubling cases one could think of. We avoid\nthe case where there is no corruption (which would yield a wrong estimation, with the DAE simply\nlearning a dirac probability its input). Second, we avoid the case where the chain wanders to in\ufb01nity\nby assuming a \ufb01nite volume where the model and data live, a real concern in the continuous case.\nIf it became a real issue, we could perform rejection sampling to make sure that P (X| \u02dcX) produces\nX \u2208 V .\n2.4 Locality of the Corruption and Energy Function\nIf we believe that P (X| \u02dcX) is well estimated for all (X, \u02dcX) pairs, i.e., that it is approximately\nconsistent with C( \u02dcX|X), then we get as many estimators of the energy function as we want, by\npicking a particular value of \u02dcX.\nLet us de\ufb01ne the notation P (\u00b7) to denote the probability of the joint, marginals or conditionals over\nthe pairs (Xt, \u02dcXt\u22121) that are produced by the model\u2019s Markov chain T as t \u2192 \u221e. So P (X) = \u03c0(X)\n\n4\n\n\fP ( \u02dcX|X)\n\nC( \u02dcX|X)\n\n\u221d P (X| \u02dcX)\nC( \u02dcX|X)\n\n\u2248 P (X| \u02dcX)P ( \u02dcX)\n\nis the asymptotic distribution of the Markov chain T , and P ( \u02dcX) the marginal over the \u02dcX\u2019s in that\nchain. The above assumption means that P ( \u02dcXt\u22121|Xt) \u2248 C( \u02dcXt\u22121|Xt) (which is not guaranteed\nin general, but only asymptotically as P approaches the true P). Then, by Bayes rule, P (X) =\nP (X| \u02dcX)P ( \u02dcX)\nso that we can get an estimated energy function from any\ngiven choice of \u02dcX through energy(X) \u2248 \u2212 log P (X| \u02dcX) + log C( \u02dcX|X). where one should note\nthat the intractable partition function depends on the chosen value of \u02dcX.\nHow much can we trust that estimator and how should \u02dcX be chosen? First note that P (X| \u02dcX) has\nonly been trained for pairs (X, \u02dcX) for which \u02dcX is relatively close to X (assuming that the corruption\nis indeed changing X generally into some neighborhood). Hence, although in theory (with in\ufb01nite\namount of data and capacity) the above estimator should be good, in practice it might be poor when\nX is far from \u02dcX. So if we pick a particular \u02dcX the estimated energy might be good for X in the\nneighborhood of \u02dcX but poor elsewhere. What we could do though, is use a different approximate\nenergy function in different regions of the input space. Hence the above estimator gives us a way to\ncompare the probabilities of nearby points X1 and X2 (through their difference in energy), picking\nfor example a midpoint \u02dcX = 1\n2 (X1 + X2). One could also imagine that if X1 and XN are far apart,\nwe could chart a path between X1 and XN with intermediate points Xk and use an estimator of\nthe relative energies between the neighbors Xk, Xk+1, add them up, and obtain an estimator of the\nrelative energy between X1 and XN .\n\nFigure 1: Although P(X) may be complex and multi-modal, P(X| \u02dcX) is often simple and approx-\nimately unimodal (e.g., multivariate Gaussian, pink oval) for most values of \u02dcX when C( \u02dcX|X) is a\nlocal corruption. P(X) can be seen as an in\ufb01nite mixture of these local distributions (weighted by\nP( \u02dcX)).\nThis brings up an interesting point. If we could always obtain a good estimator P (X| \u02dcX) for any\n\u02dcX, we could just train the model with C( \u02dcX|X) = C( \u02dcX), i.e., with an unconditional noise process\nthat ignores X. In that case, the estimator P (X| \u02dcX) would directly equal P (X) since \u02dcX and X\nare actually sampled independently in its \u201cdenoising\u201d training data. We would have gained nothing\nover just training any probabilistic model just directly modeling the observed X\u2019s. The gain we\nexpect from using the denoising framework is that if \u02dcX is a local perturbation of X, then the true\nP(X| \u02dcX) can be well approximated by a much simpler distribution than P(X). See Figure 1 for a\nvisual explanation: in the limit of very small perturbations, one could even assume that P(X| \u02dcX)\ncan be well approximated by a simple unimodal distribution such as the Gaussian (for continuous\ndata) or factorized binomial (for discrete binary data) commonly used in DAEs as the reconstruction\nprobability function (conditioned on \u02dcX). This idea is already behind the non-local manifold Parzen\nwindows (Bengio et al., 2006a) and non-local manifold tangent learning (Bengio et al., 2006b) al-\ngorithms: the local density around a point \u02dcX can be approximated by a multivariate Gaussian whose\ncovariance matrix has leading eigenvectors that span the local tangent of the manifold near which\nthe data concentrates (if it does). The idea of a locally Gaussian approximation of a density with a\nmanifold structure is also exploited in the more recent work on the contractive auto-encoder (Rifai\net al., 2011) and associated sampling procedures (Rifai et al., 2012). Finally, strong theoretical ev-\nidence in favor of that idea comes from the result from Alain and Bengio (2013): when the amount\nof corruption noise converges to 0 and the input variables have a smooth continuous density, then a\nunimodal Gaussian reconstruction density suf\ufb01ces to fully capture the joint distribution.\nHence, although P (X| \u02dcX) encapsulates all information about P(X) (assuming C given), it will\ngenerally have far fewer non-negligible modes, making easier to approximate it. This can be seen\nanalytically by considering the case where P(X) is a mixture of many Gaussians and the corruption\n\n5\n\n\fFigure 2: Walkback samples get attracted by spu-\nrious modes and contribute to removing them.\nSegment of data manifold in violet and example\nwalkback path in red dotted line, starting on the\nmanifold and going towards a spurious attractor.\nThe vector \ufb01eld represents expected moves of the\nchain, for a unimodal P (X| \u02dcX), with arrows from\n\u02dcX to X.\n\nis a local Gaussian: P (X| \u02dcX) remains a Gaussian mixture, but one for which most of the modes\nhave become negligible (Alain and Bengio, 2013). We return to this in Section 3, suggesting that\nin order to avoid spurious modes, it is better to have non-in\ufb01nitesimal corruption, allowing faster\nmixing and successful burn-in not pulled by spurious modes far from the data.\n\n3 Reducing the Spurious Modes with Walkback Training\nSampling in high-dimensional spaces (like in experiments below) using a simple local corruption\nprocess (such as Gaussian or salt-and-pepper noise) suggests that if the corruption is too local, the\nDAE\u2019s behavior far from the training examples can create spurious modes in the regions insuf\ufb01-\nciently visited during training. More training iterations or increasing the amount of corruption noise\nhelps to substantially alleviate that problem, but we discovered an even bigger boost by training the\nDAE Markov chain to walk back towards the training examples (see Figure 2). We exploit knowl-\nedge of the currently learned model P (X| \u02dcX) to de\ufb01ne the corruption, so as to pick values of \u02dcX that\nwould be obtained by following the generative chain: wherever the model would go if we sampled\nusing the generative Markov chain starting at a training example X, we consider to be a kind of\n\u201cnegative example\u201d \u02dcX from which the auto-encoder should move away (and towards X). The spirit\nof this procedure is thus very similar to the CD-k (Contrastive Divergence with k MCMC steps)\nprocedure proposed to train RBMs (Hinton, 1999; Hinton et al., 2006).\nMore precisely, the modi\ufb01ed corruption process \u02dcC we propose is the following, based on the original\ncorruption process C. We use it in a version of the training algorithm called walkback, where we\nreplace the corruption process C of Algorithm 1 by the walkback process \u02dcC of Algorithm 2. This\nalso provides extra training examples (taking advantage of the \u02dcX samples generated along the walk\naway from X). It is called walkback because it forces the DAE to learn to walk back from the\nrandom walk it generates, towards the X\u2019s in the training set.\nAlgorithm 2: THE WALKBACK ALGORITHM is based on the walkback corruption process \u02dcC( \u02dcX|X),\nde\ufb01ned below in terms of a generic original corruption process C( \u02dcX|X) and the current model\u2019s re-\nconstruction conditional distribution P (X| \u02dcX). For each training example X, it provides a sequence\nof additional training examples (X, \u02dcX\u2217) for the DAE. It has a hyper-parameter that is a geometric\ndistribution parameter 0 < p < 1 controlling the length of these walks away from X, with p = 0.5\nby default. Training by Algorithm 1 is the same, but using all \u02dcX\u2217 in the returned list L to form the\npairs (X, \u02dcX\u2217) as training examples instead of just (X, \u02dcX).\n1: X\u2217 \u2190 X, L \u2190 [ ]\n2: Sample \u02dcX\u2217 \u223c C( \u02dcX|X\u2217)\n3: Sample u \u223c Uniform(0, 1)\n4: if u > p then\n5:\n6: If during training, append \u02dcX\u2217 to L, so (X, \u02dcX\u2217) will be an additional training example.\n7: Sample X\u2217 \u223c P (X| \u02dcX\u2217)\n8: goto 2.\n\nAppend \u02dcX\u2217 to L and return L\n\nProposition 1. Let P (X) be the implicitly de\ufb01ned asymptotic distribution of the Markov chain al-\nternating sampling from P (X| \u02dcX) and C( \u02dcX|X), where C is the original local corruption process.\nUnder the assumptions of corollary 1, minimizing the training criterion in walkback training algo-\n\n6\n\n\frithm for generalized DAEs (combining Algorithms 1 and 2) produces a P (X) that is a consistent\nestimator of the data generating distribution P(X).\nProof. Consider that during training, we produce a sequence of estimators Pk(X| \u02dcX) where Pk\ncorresponds to the k-th training iteration (modifying the parameters after each iteration). With the\nwalkback algorithm, Pk\u22121 is used to obtain the corrupted samples \u02dcX from which the next model\nIf training converges, Pk \u2248 Pk+1 = P and we can then consider the whole\nPk is produced.\ncorruption process \u02dcC \ufb01xed. By corollary 1, the Markov chain obtained by alternating samples from\nP (X| \u02dcX) and samples from \u02dcC( \u02dcX|X) converges to an asymptotic distribution P (X) which estimates\nthe underlying data-generating distribution P(X). The walkback corruption \u02dcC( \u02dcX|X) corresponds\nto a few steps alternating sampling from C( \u02dcX|X) (the \ufb01xed local corruption) and sampling from\nP (X| \u02dcX). Hence the overall sequence when using \u02dcC can be seen as a Markov chain obtained by\nalternatively sampling from C( \u02dcX|X) and from P (X| \u02dcX) just as it was when using merely C. Hence,\nonce the model is trained with walkback, one can sample from it usingc orruption C( \u02dcX|X).\nA consequence is that the walkback training algorithm estimates the same distribution as the orig-\ninal denoising algorithm, but may do it more ef\ufb01ciently (as we observe in the experiments), by\nexploring the space of corruptions in a way that spends more time where it most helps the model.\n4 Experimental Validation\nNon-parametric case. The mathematical results presented here apply to any denoising training\ncriterion where the reconstruction loss can be interpreted as a negative log-likelihood. This re-\nmains true whether or not the denoising machine P (X| \u02dcX) is parametrized as the composition of\nan encoder and decoder. This is also true of the asymptotic estimation results in Alain and Bengio\n(2013). We experimentally validate the above theorems in a case where the asymptotic limit (of\nenough data and enough capacity) can be reached, i.e., in a low-dimensional non-parametric setting.\nFig. 3 shows the distribution recovered by the Markov chain for discrete data with only 10 different\nvalues. The conditional P (X| \u02dcX) was estimated by multinomial models and maximum likelihood\n(counting) from 5000 training examples. 5000 samples were generated from the chain to estimate\nthe asymptotic distribution \u03c0n(X). For continuous data, Figure 3 also shows the result of 5000\ngenerated samples and 500 original training examples with X \u2208 R10, with scatter plots of pairs of\ndimensions. The estimator is also non-parametric (Parzen density estimator of P (X| \u02dcX)).\n\nFigure 3: Top left: histogram of a data-generating distribution (true, blue), the empirical distribution\n(red), and the estimated distribution using a denoising maximum likelihood estimator. Other \ufb01gures:\npairs of variables (out of 10) showing the training samples and the model-generated samples.\n\n7\n\n\fMNIST digits. We trained a DAE on the binarized MNIST data (thresholding at 0.5). A Theano1\n(Bergstra et al., 2010) implementation is available2. The 784-2000-784 auto-encoder is trained for\n200 epochs with the 50000 training examples and salt-and-pepper noise (probability 0.5 of corrupt-\ning each bit, setting it to 1 or 0 with probability 0.5). It has 2000 tanh hidden units and is trained by\nminimizing cross-entropy loss, i.e., maximum likelihood on a factorized Bernoulli reconstruction\ndistribution. With walkback training, a chain of 5 steps was used to generate 5 corrupted examples\nfor each training example. Figure 4 shows samples generated with and without walkback. The\nquality of the samples was also estimated quantitatively by measuring the log-likelihood of the test\nset under a non-parametric density estimator \u02c6P (x) = mean \u02dcX P (x| \u02dcX) constructed from 10000 con-\nsecutively generated samples ( \u02dcX from the Markov chain). The expected value of E[ \u02c6P (x)] over the\nsamples can be shown (Bengio and Yao, 2013) to be a lower bound (i.e. conservative estimate) of\nthe true (implicit) model density P (x). The test set log-likelihood bound was not used to select\namong model architectures, but visual inspection of samples generated did guide the preliminary\nsearch reported here. Optimization hyper-parameters (learning rate, momentum, and learning rate\nreduction schedule) were selected based on the training objective. We compare against a state-of-\nthe-art RBM (Cho et al., 2013) with an AIS log-likelihood estimate of -64.1 (AIS estimates tend\nto be optimistic). We also drew samples from the RBM and applied the same estimator (using the\nmean of the RBM\u2019s P (x|h) with h sampled from the Gibbs chain), and obtained a log-likelihood\nnon-parametric bound of -233, skipping 100 MCMC steps between samples (otherwise numbers are\nvery poor for the RBM, which does not mix at all). The DAE log-likelihood bound with and without\nwalkback is respectively -116 and -142, con\ufb01rming visual inspection suggesting that the walkback\nalgorithm produces less spurious samples. However, the RBM samples can be improved by a spatial\nblur. By tuning the amount of blur (the spread of the Gaussian convolution), we obtained a bound\nof -112 for the RBM. Blurring did not help the auto-encoder.\n\nFigure 4: Successive samples generated by Markov chain associated with the trained DAEs ac-\ncording to the plain sampling scheme (left) and walkback sampling scheme (right). There are less\n\u201cspurious\u201d samples with the walkback algorithm.\n\n5 Conclusion and Future Work\nWe have proven that training a model to denoise is a way to implicitly estimate the underlying data-\ngenerating process, and that a simple Markov chain that alternates sampling from the denoising\nmodel and from the corruption process converges to that estimator. This provides a means for\ngenerating data from any DAE (if the corruption is not degenerate, more precisely, if the above\nchain converges). We have validated those results empirically, both in a non-parametric setting and\nwith real data. This study has also suggested a variant of the training procedure, walkback training,\nwhich seem to converge faster to same the target distribution.\nOne of the insights arising out of the theoretical results presented here is that in order to reach the\nasymptotic limit of fully capturing the data distribution P(X), it may be necessary for the model\u2019s\nP (X| \u02dcX) to have the ability to represent multi-modal distributions over X (given \u02dcX).\n\nAcknowledgments\nThe authors would acknowledge input from A. Courville, I. Goodfellow, R. Memisevic, K. Cho as\nwell as funding from NSERC, CIFAR (YB is a CIFAR Fellow), and Canada Research Chairs.\n\n1http://deeplearning.net/software/theano/\n2git@github.com:yaoli/GSN.git\n\n8\n\n\fReferences\nAlain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data generating\n\ndistribution. In International Conference on Learning Representations (ICLR\u20192013).\n\nBengio, Y. and Yao, L. (2013). Bounding the test log-likelihood of generative models. Technical\n\nreport, U. Montreal, arXiv.\n\nBengio, Y., Larochelle, H., and Vincent, P. (2006a). Non-local manifold Parzen windows.\n\nNIPS\u201905, pages 115\u2013122. MIT Press.\n\nIn\n\nBengio, Y., Monperrus, M., and Larochelle, H. (2006b). Nonlocal estimation of manifold structure.\n\nNeural Computation, 18(10).\n\nBengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2013a). Deep generative stochastic networks\n\ntrainable by backprop. Technical Report arXiv:1306.1091, Universite de Montreal.\n\nBengio, Y., Courville, A., and Vincent, P. (2013b). Unsupervised feature learning and deep learning:\nA review and new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI).\nBergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-\nIn\n\nFarley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler.\nProceedings of the Python for Scienti\ufb01c Computing Conference (SciPy).\n\nCho, K., Raiko, T., and Ilin, A. (2013). Enhanced gradient for training restricted boltzmann ma-\n\nchines. Neural computation, 25(3), 805\u2013831.\n\nHeckerman, D., Chickering, D. M., Meek, C., Rounthwaite, R., and Kadie, C. (2000). Depen-\ndency networks for inference, collaborative \ufb01ltering, and data visualization. Journal of Machine\nLearning Research, 1, 49\u201375.\n\nHinton, G. E. (1999). Products of experts. In ICANN\u20191999.\nHinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets.\n\nNeural Computation, 18, 1527\u20131554.\n\nHyv\u00a8arinen, A. (2005). Estimation of non-normalized statistical models using score matching. Jour-\n\nnal of Machine Learning Research, 6, 695\u2013709.\n\nKingma, D. and LeCun, Y. (2010). Regularized estimation of image statistics by score matching.\nIn J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in\nNeural Information Processing Systems 23, pages 1126\u20131134.\n\nRanzato, M., Boureau, Y.-L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks.\n\nIn NIPS\u201907, pages 1185\u20131192, Cambridge, MA. MIT Press.\n\nRifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011). Contractive auto-encoders:\n\nExplicit invariance during feature extraction. In ICML\u20192011.\n\nRifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling con-\n\ntractive auto-encoders. In ICML\u20192012.\n\nSwersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On autoencoders\n\nand score matching for energy based models. In ICML\u20192011. ACM.\n\nVincent, P. (2011). A connection between score matching and denoising autoencoders. Neural\n\nComputation, 23(7).\n\n9\n\n\f", "award": [], "sourceid": 491, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": "University of Montreal"}, {"given_name": "Li", "family_name": "Yao", "institution": "University of Montreal"}, {"given_name": "Guillaume", "family_name": "Alain", "institution": "University of Montreal"}, {"given_name": "Pascal", "family_name": "Vincent", "institution": "University of Montreal"}]}