{"title": "Variational Walkback: Learning a Transition Operator as a Stochastic Recurrent Net", "book": "Advances in Neural Information Processing Systems", "page_first": 4392, "page_last": 4402, "abstract": "We propose a novel method to {\\it directly} learn a stochastic transition operator whose repeated application provides generated samples. Traditional undirected graphical models approach this problem indirectly by learning a Markov chain model whose stationary distribution obeys detailed balance with respect to a parameterized energy function. The energy function is then modified so the model and data distributions match, with no guarantee on the number of steps required for the Markov chain to converge. Moreover, the detailed balance condition is highly restrictive: energy based models corresponding to neural networks must have symmetric weights, unlike biological neural circuits. In contrast, we develop a method for directly learning arbitrarily parameterized transition operators capable of expressing non-equilibrium stationary distributions that violate detailed balance, thereby enabling us to learn more biologically plausible asymmetric neural networks and more general non-energy based dynamical systems. The proposed training objective, which we derive via principled variational methods, encourages the transition operator to \"walk back\" (prefer to revert its steps) in multi-step trajectories that start at data-points, as quickly as possible back to the original data points. We present a series of experimental results illustrating the soundness of the proposed approach, Variational Walkback (VW), on the MNIST, CIFAR-10, SVHN and CelebA datasets, demonstrating superior samples compared to earlier attempts to learn a transition operator. We also show that although each rapid training trajectory is limited to a finite but variable number of steps, our transition operator continues to generate good samples well past the length of such trajectories, thereby demonstrating the match of its non-equilibrium stationary distribution to the data distribution. Source Code:http://github.com/anirudh9119/walkback_nips17", "full_text": "Variational Walkback: Learning a Transition\n\nOperator as a Stochastic Recurrent Net\n\nAnirudh Goyal\n\nNan Rosemary Ke\n\nMILA, Universit\u00e9 de Montr\u00e9al\n\nMILA, \u00c9cole Polytechnique de Montr\u00e9al\n\nanirudhgoyal9119@gmail.com\n\nrosemary.nan.ke@gmail.com\n\nSurya Ganguli\n\nStanford University\n\nYoshua Bengio\n\nMILA, Universit\u00e9 de Montr\u00e9al\n\nsganguli@stanford.edu\n\nyoshua.umontreal@gmail.com\n\nAbstract\n\nWe propose a novel method to directly learn a stochastic transition operator whose\nrepeated application provides generated samples. Traditional undirected graphical\nmodels approach this problem indirectly by learning a Markov chain model whose\nstationary distribution obeys detailed balance with respect to a parameterized energy\nfunction. The energy function is then modi\ufb01ed so the model and data distributions\nmatch, with no guarantee on the number of steps required for the Markov chain to\nconverge. Moreover, the detailed balance condition is highly restrictive: energy\nbased models corresponding to neural networks must have symmetric weights,\nunlike biological neural circuits. In contrast, we develop a method for directly\nlearning arbitrarily parameterized transition operators capable of expressing non-\nequilibrium stationary distributions that violate detailed balance, thereby enabling\nus to learn more biologically plausible asymmetric neural networks and more gen-\neral non-energy based dynamical systems. The proposed training objective, which\nwe derive via principled variational methods, encourages the transition operator to\n\"walk back\" (prefer to revert its steps) in multi-step trajectories that start at data-\npoints, as quickly as possible back to the original data points. We present a series\nof experimental results illustrating the soundness of the proposed approach, Varia-\ntional Walkback (VW), on the MNIST, CIFAR-10, SVHN and CelebA datasets,\ndemonstrating superior samples compared to earlier attempts to learn a transition\noperator. We also show that although each rapid training trajectory is limited to a\n\ufb01nite but variable number of steps, our transition operator continues to generate\ngood samples well past the length of such trajectories, thereby demonstrating the\nmatch of its non-equilibrium stationary distribution to the data distribution. Source\nCode: http://github.com/anirudh9119/walkback_nips17\n\nIntroduction\n\n1\nA fundamental goal of unsupervised learning involves training generative models that can understand\nsensory data and employ this understanding to generate, or sample new data and make new inferences.\nIn machine learning, the vast majority of probabilistic generative models that can learn complex proba-\nbility distributions over data fall into one of two classes: (1) directed graphical models, corresponding\nto a \ufb01nite time feedforward generative process (e.g. variants of the Helmholtz machine (Dayan\net al., 1995) like the Variational Auto-Encoder (VAE) (Kingma and Welling, 2013; Rezende et al.,\n2014)), or (2) energy function based undirected graphical models, corresponding to sampling from a\nstochastic process whose equilibrium stationary distribution obeys detailed balance with respect to the\nenergy function (e.g. various Boltzmann machines (Salakhutdinov and Hinton, 2009)). This detailed\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fbalance condition is highly restrictive: for example, energy-based undirected models corresponding\nto neural networks require symmetric weight matrices and very speci\ufb01c computations which may not\nmatch well with what biological neurons or analog hardware could compute.\nIn contrast, biological neural circuits are capable of powerful generative dynamics enabling us to\nmodel the world and imagine new futures. Cortical computation is highly recurrent and therefore its\ngenerative dynamics cannot simply map to the purely feed-forward, \ufb01nite time generative process of\na directed model. Moreover, the recurrent connectivity of biological circuits is not symmetric, and so\ntheir generative dynamics cannot correspond to sampling from an energy-based undirected model.\nThus, the asymmetric biological neural circuits of our brain instantiate a type of stochastic dynamics\narising from the repeated application of a transition operator\u2217 whose stationary distribution over\nneural activity patterns is a non-equilibrium distribution that does not obey detailed balance with\nrespect to any energy function. Despite these fundamental properties of brain dynamics, machine\nlearning approaches to training generative models currently lack effective methods to model complex\ndata distributions through the repeated application a transition operator, that is not indirectly speci\ufb01ed\nthrough an energy function, but rather is directly parameterized in ways that are inconsistent with the\nexistence of any energy function. Indeed the lack of such methods constitutes a glaring gap in the\npantheon of machine learning methods for training probabilistic generative models.\nThe fundamental goal of this paper is to provide a step to \ufb01lling such a gap by proposing a novel\nmethod to learn such directly parameterized transition operators, thereby providing an empirical\nmethod to control the stationary distributions of non-equilibrium stochastic processes that do not\nobey detailed balance, and match these distributions to data. The basic idea underlying our training\napproach is to start from a training example, and iteratively apply the transition operator while\ngradually increasing the amount of noise being injected (i.e., temperature). This heating process\nyields a trajectory that starts from the data manifold and walks away from the data due to the heating\nand to the mismatch between the model and the data distribution. Similarly to the update of a\ndenoising autoencoder, we then modify the parameters of the transition operator so as to make the\nreverse of this heated trajectory more likely under a reverse cooling schedule. This encourages the\ntransition operator to generate stochastic trajectories that evolve towards the data distribution, by\nlearning to walk back the heated trajectories starting at data points. This walkback idea had been\nintroduced for generative stochastic networks (GSNs) and denoising autoencoders (Bengio et al.,\n2013b) as a heuristic, and without temperature annealing. Here, we derive the speci\ufb01c objective\nfunction for learning the parameters through a principled variational lower bound, hence we call our\ntraining method variational walkback (VW). Despite the fact that the training procedure involves\nwalking back a set of trajectories that last a \ufb01nite, but variable number of time-steps, we \ufb01nd\nempirically that this yields a transition operator that continues to generate sensible samples for many\nmore time-steps than are used to train, demonstrating that our \ufb01nite time training procedure can sculpt\nthe non-equilibrium stationary distribution of the transition operator to match the data distribution.\nWe show how VW emerges naturally from a variational derivation, with the need for annealing\narising out of the objective of making the variational bound as tight as possible. We then describe\nexperimental results illustrating the soundness of the proposed approach on the MNIST, CIFAR-10,\nSVHN and CelebA datasets. Intriguingly, we \ufb01nd that our \ufb01nite time VW training process involves\nmodi\ufb01cations of variational methods for training directed graphical models, while our potentially\nasymptotically in\ufb01nite generative sampling process corresponds to non-equilibrium generalizations\nof energy based undirected models. Thus VW goes beyond the two disparate model classes of\nundirected and directed graphical models, while simultaneously incorporating good ideas from each.\n2 The Variational Walkback Training Process\nOur goal is to learn a stochastic transition operator pT (s(cid:48)|s) such that its repeated application yields\nsamples from the data manifold. Here T re\ufb02ects an underlying temperature, which we will modify\nduring the training process. The transition operator is further speci\ufb01ed by other parameters which\nmust be learned from data. When K steps are chosen to generate a sample, the generative process\nt=1 pTt(st\u22121|st), where Tt is the temperature at step t. We\nhas joint probability p(sK\n\ufb01rst give an intuitive description of our learning algorithm before deriving it via variational methods\nin the next section. The basic idea, as illustrated in Fig. 1 and Algorithm 1 is to follow a walkback\n\u2217A transition operator maps the previous-state distribution to a next-state distribution, and is implemented by\n\n0 ) = p(sK)(cid:81)K\n\na stochastic transformation which from the previous state of a Markov chain generates the next state\n\n2\n\n\fFigure 1: Variational WalkBack framework. The generative process is represented in the blue arrows\nwith the sequence of pTt(st\u22121|st) transitions. The destructive forward process starts at a datapoint\n(from qT0 (s0)) and gradually heats it through applications of qTt(st|st\u22121). Larger temperatures on\nthe right correspond to a \ufb02atter distribution, so the whole destructive forward process maps the data\ndistribution to a Gaussian and the creation process operates in reverse.\n\n0 , i.e., q(sK\n\n0 ) = q(s0)(cid:81)K\n\nstrategy similar to that introduced in Alain and Bengio (2014). In particular, imagine a destructive\nprocess qTt+1(st+1|st) (red arrows in Fig. 1), which starts from a data point s0 = x, and evolves it\nstochastically to obtain a trajectory s0, . . . , sK \u2261 sK\nt=1 qTt(st|st\u22121), where\nq(s0) is the data distribution. Note that the p and q chains will share the same parameters for the\ntransition operator (one going backwards and one forward) but they start from different priors for\ntheir \ufb01rst step: q(s0) is the data distribution while p(s0) is a \ufb02at factorized prior (e.g. Gaussian).\nThe training procedure trains the transition operator pT to make reverse transitions of the destructive\nprocess more likely. For this reason we index time so the destructive process operates forward in time,\nwhile the reverse generative process operates backwards in time, with the data distribution occurring\nat t = 0. In particular, we need only train the transition operator to reverse time by 1-step at each step,\nmaking it unnecessary to solve a deep credit assignment problem by performing backpropagation\nthrough time across multiple walk-back steps. Overall, the destructive process generates trajectories\nthat walk away from the data manifold, and the transition operator pT learns to walkback these\ntrajectories to sculpt the stationary distribution of pT at T = 1 to match the data distribution.\nBecause we choose qT to have the same parameters as pT , they have the same transition operator but\nnot the same joint over the whole sequence because of differing initial distributions for each trajectory.\nWe also choose to increase temperature with time in the destructive process, following a temperature\nschedule T1 \u2264 \u00b7\u00b7\u00b7 \u2264 TK. Thus the forward destructive (reverse generative) process corresponds to a\nheating (cooling) protocol. This training procedure is similar in spirit to DAE\u2019s (Vincent et al., 2008)\nor NET (Sohl-Dickstein et al., 2015) but with one major difference: the destructive process in these\nworks corresponds to the addition of random noise which knows nothing about the current generative\nprocess during training. To understand why tying together destruction and creation may be a good\nidea, consider the special case in which pT corresponds to a stochastic process whose stationary\ndistribution obeys detailed balance with respect to the energy function of an undirected graphical\nmodel. Learning any such model involves two fundamental goals: the model must place probability\nmass (i.e. lower the energy function) where the data is located, and remove probability mass (i.e.\nraise the energy function) elsewhere. Probability modes where there is no data are known as spurious\nmodes, and a fundamental goal of learning is to hunt down these spurious modes and remove them.\nMaking the destructive process identical to the transition operator to be learned is motivated by the\nnotion that the destructive process should then ef\ufb01ciently explore the spurious modes of the current\ntransition operator. The walkback training will then destroy these modes. In contrast, in DAE\u2019s and\nNET\u2019s, since the destructive process corresponds to the addition of unstructured noise that knows\nnothing about the generative process, it is not clear that such an agnostic destructive process will\nef\ufb01ciently seek out the spurious modes of the reverse, generative process.\nWe chose the annealing schedule empirically to minimize training time. The generative process\nstarts by sampling a state sK from a broad Gaussian p\u2217(sK), whose variance is initially equal to\nthe total data variance \u03c32\nmax (but can be later adapted to match the \ufb01nal samples from the inference\ntrajectories). Then we sample from pTmax (sK\u22121|sK), where Tmax is a high enough temperature\nso that the resultant injected noise can move the state across the whole domain of the data. The\ninjected noise used to simulate the effects of \ufb01nite temperature has variance linearly proportional to\n\n3\n\n\ftemperature. Thus if \u03c32 is the equivalent noise injected by the transition operator pT at T = 1, we\nchoose Tmax = \u03c32\nto achieve the goal of the \ufb01rst sample sK\u22121 being able to move across the entire\nrange of the data distribution. Then we successively cool the temperature as we sample \u201cprevious\u201d\nstates st\u22121 according to pT (st\u22121|st), with T reduced by a factor of 2 at each step, followed by n\nsteps at temperature 1. This cooling protocol requires the number of steps to be\n\nmax\n\u03c32\n\nK = log2 Tmax + n,\n\n(1)\nin order to go from T = Tmax to T = 1 in K steps. We choose K from a random distribution.\nThus the training procedure trains pT to rapidly transition from a simple Gaussian distribution to\nthe data distribution in a \ufb01nite but variable number of steps. Ideally, this training procedure should\nthen indirectly create a transition operator pT at T = 1 whose repeated iteration samples the data\ndistribution with a relatively rapid mixing time. Interestingly, this intuitive learning algorithm for a\nrecurrent dynamical system, formalized in Algorithm 1, can be derived in a principled manner from\nvariational methods that are usually applied to directed graphical models, as we see next.\nAlgorithm 1 VariationalWalkback(\u03b8)\nTrain a generative model associated with a transition operator pT (s|s(cid:48)) at temperature T (temperature\n1 for sampling from the actual model), parameterized by \u03b8. This transition operator injects noise of\nvariance T \u03c32 at each step, where \u03c32 is the noise level at temperature 1.\nRequire: Transition operator pT (s|s(cid:48)) from which one can both sample and compute the gradient\nRequire: Precomputed \u03c32\nRequire: N1 > 1 the number of initial temperature-1 steps of q trajectory (or ending a p trajectory).\n\nof log pT (s|s(cid:48)) with respect to parameters \u03b8, given s and s(cid:48).\n\nmax, initially data variance (or squared diameter).\n\nrepeat\n\nmax\n\u03c32\n\nSet p\u2217 to be a Gaussian with mean and variance of the data.\nTmax \u2190 \u03c32\nSample n as a uniform integer between 0 and N1\nK \u2190 ceil(log2 Tmax) + n\nSample x \u223c data (or equivalently sample a minibatch to parallelize computation and process\neach element of the minibatch independently)\nLet s0 = (x) and initial temperature T = 1, initialize L = 0\nfor t = 1 to K do\n\nSample st \u223c pT (s|st\u22121)\nIncrement L \u2190 L + log pT (st\u22121|st)\nUpdate parameters with log likelihood gradient \u2202 log pT (st\u22121|st)\nIf t > n, increase temperature with T \u2190 2T\n\n\u2202\u03b8\n\nend for\nIncrement L \u2190 L + log p\u2217(sK)\nUpdate mean and variance of p\u2217 to match the accumulated 1st and 2nd moment statistics of the\nsamples of sK\n\nuntil convergence monitoring L on a validation set and doing early stopping =0\n\n3 Variational Derivation of Walkback\nThe marginal probability of a data point s0 at the end of the K-step generative cooling process is\n\n(cid:88)\n\n(cid:32) K(cid:89)\n\n(cid:33)\n\np(s0) =\n\npT0(s0|s1)\n\npTt(st\u22121|st)\n\np\u2217(sK)\n\n(2)\n\nsK\n1\n\nt=2\n\nwhere sK\ncooling trajectory that lead to it can be thought of as a latent, hidden variable h = sK\ndecomposition of the marginal log-likelihood via a variational lower bound,\n\n1 = (s1, s2, . . . , sK) and v = s0 is a visible variable in our generative process, while the\n1 . Recall the\n\nln p(v) \u2261 ln\n\np(v|h)p(h) =\n\n(cid:88)\n\nh\n\n(cid:88)\n(cid:124)\n\nh\n\nq(h|v) ln\n\n(cid:123)(cid:122)\n\nL\n\n+DKL[q(h|v)||p(h|v)].\n\n(3)\n\np(v, h)\nq(h|v)\n\n(cid:125)\n\n4\n\n\fHere L is the variational lower bound which motivates the proposed training procedure, and q(h|v) is\na variational approximation to p(h|v). Applying this decomposition to v = s0 and h = sK\n1 , we \ufb01nd\n\nln p(s0) =\n\nq(sk\n\n1|s0) ln\n\n+ DKL[q(sk\n\n1|s0)|| p(sk\n\n1|s0)].\n\n(4)\n\np(s0|sk\nq(sk\n\n1)p(sk\n1)\n1|s0)\n\n(cid:88)\n\nsk\n1\n\nSimilarly to the EM algorithm, we aim to approximately maximize the log-likelihood with a 2-step\nprocedure. Let \u03b8p be the parameters of the generative model p and \u03b8q be the parameters of the\napproximate inference procedure q. Before seeing the next example we have \u03b8q = \u03b8p. Then in the\n\ufb01rst step we update \u03b8p towards maximizing the variational bound L, for example by a stochastic\ngradient descent step. In the second step, we update \u03b8q by setting \u03b8q \u2190 \u03b8p, with the objective to\nreduce the KL term in the above decomposition. See Sec. 3.1 below regarding conditions for the\ntightness of the bound, which may not be perfect, yielding a possibly biased gradient when we force\nthe constraint \u03b8p = \u03b8q. We continue iterating this procedure, with training examples s0. We can\nobtain an unbiased Monte-Carlo estimator of L as follows from a single trajectory:\n\nL(s0) \u2248 K(cid:88)\n\nt=1\n\npTt(st\u22121|st)\nqTt(st|st\u22121)\n\nln\n\n+ ln p\u2217(sK)\n\n(5)\n\n(cid:88)\n\nsk\n1\n\nK(cid:89)\n\nt=1\n\nqTt(st|st\u22121)\npTt(st\u22121|st)\n\nwith respect to p\u03b8, where s0 is sampled from the data distribution qT0 (s0), and the single sequence sK\n1 |s0). We are making the reverse of heated trajectories more\n1\nis sampled from the heating process q(sK\nlikely under the cooling process, leading to Algorithm 1. Such variational bounds have been used\nsuccessfully in many learning algorithms in the past, such as the VAE (Kingma and Welling, 2013),\nexcept that they use an explicitly different set of parameters for p and q. Some VAE variants (S\u00f8nderby\net al., 2016; Kingma et al., 2016) however mix the p-parameters implicitly in forming q, by using the\nlikelihood gradient to iteratively form the approximate posterior.\n\n3.1 Tightness of the variational lower bound\nAs seen in (4), the gap between L(s0) and ln p(s0) is controlled by DKL[q(sk\n1|s0)], and is\ntherefore tight when the distribution of the heated trajectory, starting from a point s0, matches the\nposterior distribution of the cooled trajectory ending at s0. Explicitly, this KL divergence is given by\n\n1|s0)||p(sk\n\nDKL =\n\nq(sk\n\n1|s0) ln\n\np(s0)\np\u2217(sK)\n\n.\n\n(6)\n\nAs the heating process q unfolds forward in time, while the cooling process p unfolds backwards in\ntime, we introduce the time reversal of the transition operator pT , denoted by pR\nT , as follows. Under\nrepeated application of the transition operator pT , state s settles into a stationary distribution \u03c0T (s)\nat temperature T . The probability of observing a transition st \u2192 st\u22121 under pT in its stationary state\nis then pT (st\u22121|st)\u03c0T (st). The time-reversal pR\nT is the transition operator that makes the reverse\ntransition equally likely for all state pairs, and therefore obeys\n\nPT (st\u22121|st)\u03c0T (st) = P R\n\nT (st|st\u22121)\u03c0T (st\u22121)\n\n(7)\nfor all pairs of states st\u22121 and st. It is well known that pR\nT is a valid stochastic transition operator and\nhas the same stationary distribution \u03c0T (s) as pT . Furthermore, the process pT obeys detailed balance\nif and only if it is invariant under time-reversal, so that pT = pR\nT .\nTo better understand the KL divergence in (6), at each temperature Tt, we use relation (7) to replace\nthe cooling process PTt which occurs backwards in time with its time-reversal, unfolding forward in\ntime, at the expense of introducing ratios of stationary probabilities. We also exploit the fact that q\nand p are the same transition operator. With these substitutions in (6), we \ufb01nd\n\nDKL =\n\nq(sk\n\n1|s0) ln\n\nsk\n1\n\nt=1\n\npTt(st|st\u22121)\n(st|st\u22121)\npR\nTt\n\n+\n\nq(sk\n\n1|s0) ln\n\np(s0)\np\u2217(sK)\n\n\u03c0Tt(st)\n\u03c0Tt(st\u22121)\n\n.\n\n(8)\n\n(cid:88)\n\nK(cid:89)\n\n(cid:88)\n\nsk\n1\n\nK(cid:89)\n\nt=1\n\nThe \ufb01rst term in (8) is simply the KL divergence between the distribution over heated trajectories, and\nthe time reversal of the cooled trajectories. Since the heating (q) and cooling (p) processes are tied,\nthis KL divergence is 0 if and only if pTt = pR\nfor all t. This time-reversal invariance requirement\nTt\nfor vanishing KL divergence is equivalent to the transition operator pT obeying detailed balance at all\ntemperatures.\n\n5\n\n\fNow intuitively, the second term can be made small in the limit where K is large and the temperature\nsequence is annealed slowly. To see why, note we can write the ratio of probabilities in this term as,\n\np(s0)\n\u03c0T1 (s0)\n\n\u03c0T1 (s1)\n\u03c0T2 (s1)\n\n\u00b7\u00b7\u00b7 \u03c0TK\u22121(sK\u22121)\n\u03c0TK\u22121(sK)\n\n\u03c0TK (sK)\np\u2217(sK)\n\n.\n\n(9)\n\nwhich is similar in shape (but arising in a different context) to the product of probability ratios\ncomputed for annealed importance sampling (Neal, 2001) and reverse annealed importance sam-\npling (Burda et al., 2014). Here it is manifest that, under slow incremental annealing schedules, we\nare comparing probabilities of the same state under slightly different distributions, so all ratios are\nclose to 1. For example, under many steps, with slow annealing, the generative process approximately\nreaches its stationary distribution, p(s0) \u2248 \u03c0T1(s0).\nThis slow annealing to go from p\u2217(sK) to p(s0) corresponds to the quasistatic limit in statistical\nphysics, where the work required to perform the transformation is equal to the free energy difference\nbetween states. To go faster, one must perform excess work, above and beyond the free energy differ-\nence, and this excess work is dissipated as heat into the surrounding environment. By writing the dis-\ntributions in terms of energies and free energies: \u03c0Tt(st) \u221d e\u2212E(st)/Tt, p\u2217(sK) = e\u2212[EK (sK )\u2212FK ],\nand p(s0) = e\u2212[E0(s0)\u2212F0], one can see that the second term in the KL divergence is closely related\nto average heat dissipation in a \ufb01nite time heating process (see e.g. (Crooks, 2000)).\nThis intriguing connection between the size of the gap in a variational lower bound, and the excess\nheat dissipation in a \ufb01nite time heating process opens the door to exploiting a wealth of work in\nstatistical physics for \ufb01nding optimal thermodynamic paths that minimize heat dissipation (Schmiedl\nand Seifert, 2007; Sivak and Crooks, 2012; Gingrich et al., 2016), which may provide new ideas\nto improve variational inference. In summary, tightness of the variational bound can be achieved\nif: (1) The transition operator of p approximately obeys detailed balance, and (2) the temperature\nannealing is done slowly over many steps. And intriguingly, the magnitude of the looseness of the\nbound is related to two physical quantities: (1) the degree of irreversiblity of the transition operator p,\nas measured by the KL divergence between p and its time reversal pR, and (2) the excess physical\nwork, or equivalently, excess heat dissipated, in performing the heating trajectory.\nTo check, post-hoc, potential looseness of the variational lower bound, we can measure the degree of\nirreversibility of pT by estimating the KL divergence DKL(pT (s(cid:48)|s)\u03c0T (s)|| pT (s|s(cid:48))\u03c0T (s(cid:48))), which\nis 0 if and only if pT obeys detailed balance and is therefore time-reversal invariant. This quantity\ncan be estimated by 1\n1 is a long sequence sampled by repeatedly\napplying transition operator pT from a draw s1 \u223c \u03c0T . If this quantity is strongly positive (negative)\nK\nthen forward transitions are more (less) likely than reverse transitions, and the process pT is not\ntime-reversal invariant. This estimated KL divergence can be normalized by the corresponding\nentropy to get a relative value (with 3.6% measured on a trained model, as detailed in Appendix).\n3.2 Estimating log likelihood via importance sampling\nWe can derive an importance sampling estimate of the negative log-likelihood by the following\nprocedure. For each training example x, we sample a large number of destructive paths (as in\nAlgorithm 1). We then use the following formulation to estimate the log-likelihood log p(x) via\np\u2217(sK)\nt=2 qTt(st|st\u22121)\n\n\uf8ee\uf8f0 pT0(s0 = x|s1)\n\nx\u223cpD,qT0 (x)qT1 (s1|s0(x,))((cid:81)K\n\n(cid:16)(cid:81)K\n\n(cid:17)\nt=2 pTt(st\u22121|st)\n\nqT0(x)qT1 (s1|s0 = x)\n\n(cid:80)K\nt=1 ln pT (st+1|st)\n\npT (st|st+1), where sK\n\nlog E\n\nt=2 qTt (st|st\u22121))\n\n(cid:16)(cid:81)K\n\n\uf8f9\uf8fb\n\n(cid:17)\n\n(10)\n\n3.3 VW transition operators and their convergence\nThe VW approach allows considerable freedom in choosing transition operators, obviating the need\nfor specifying them indirectly through an energy function. Here we consider Bernoulli and isotropic\nGaussian transition operators for binary and real-valued data respectively. The form of the stochastic\nstate update imitates a discretized version of the Langevin differential equation. The Bernoulli\ntransition operator computes the element-wise probability as \u03c1 = sigmoid( (1\u2212\u03b1)\u2217st\u22121+\u03b1\u2217F\u03c1(st\u22121)\n).\nThe Gaussian operator computes a conditional mean and standard deviation via \u00b5 = (1 \u2212 \u03b1) \u2217 st\u22121 +\n\u03b1 \u2217 F\u00b5(st\u22121) and \u03c3 = Tt log(1 + eF\u03c3(st\u22121)). Here the F functions can be arbitrary parametrized\nfunctions, such as a neural net and Tt is the temperature at time step t.\n\nTt\n\n6\n\n\fA natural question is when will the \ufb01nite time VW training process learn a transition operator whose\nstationary distribution matches the data distribution, so that repeated sampling far beyond the training\ntime continues to yield data samples. To partially address this, we prove the following theorem:\nProposition 1. If p has enough capacity, training data and training time, with slow enough annealing\nand a small departure from reversibility so p can match q, then at convergence of VW training, the\ntransition operator pT at T = 1 has the data generating distribution as its stationary distribution.\n\nA proof can be found in the Appendix, but the essential intuition is that if the \ufb01nite time generative\nprocess converges to the data distribution at multiple different VW walkback time-steps, then it\nremains on the data distribution for all future time at T = 1. We cannot always guarantee the\npreconditions of this theorem but we \ufb01nd experimentally that its essential outcome holds in practice.\n\n4 Related Work\nA variety of learning algorithms can be cast in the framework of Fig. 1. For example, for directed\ngraphical models like VAEs (Kingma and Welling, 2013; Rezende et al., 2014), DBNs (Hinton et al.,\n2006), and Helmholtz machines in general, q corresponds to a recognition model, transforming data\nto a latent space, while p corresponds to a generative model that goes from latent to visible data in\na \ufb01nite number of steps. None of these directed models are designed to learn transition operators\nthat can be iterated ad in\ufb01nitum, as we do. Moreover, learning such models involves a complex,\ndeep credit assignment problem, limiting the number of unobserved latent layers that can be used to\ngenerate data. Similar issues of limited trainable depth in a \ufb01nite time feedforward generative process\napply to Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), which also further\neschew the goal of speci\ufb01cally assigning probabilities to data points. Our method circumvents this\ndeep credit assignment problem by providing training targets at each time-step; in essence each past\ntime-step of the heated trajectory constitutes a training target for the future output of the generative\noperator pT , thereby obviating the need for backpropagation across multiple steps. Similarly, unlike\nVW, Generative Stochastic Networks (GSN) (Bengio et al., 2014) and the DRAW (Gregor et al.,\n2015) also require training iterative operators by backpropagating across multiple computational\nsteps.\nVW is similar in spirit to DAE (Bengio et al., 2013b), and NET approaches (Sohl-Dickstein et al.,\n2015) but it retains two crucial differences. First, in each of these frameworks, q corresponds to\na very simple destruction process in which unstructured Gaussian noise is injected into the data.\nThis agnostic destruction process has no knowledge of underlying generative process p that is to\nbe learned, and therefore cannot be expected to ef\ufb01ciently explore spurious modes, or regions of\nspace, unoccupied by data, to which p assigns high probability. VW has the advantage of using a\nhigh-temperature version of the model p itself as part of the destructive process, and so should be\nbetter than random noise injection at \ufb01nding these spurious modes. A second crucial difference is\nthat VW ties weights of the transition operator across time-steps, thereby enabling us to learn a bona\n\ufb01de transition operator than can be iterated well beyond the training time, unlike DAEs and NET.\nThere\u2019s also another related recent approach to learning a transition operator with a denoising cost,\ndeveloped in parallel, called Infusion training (Bordes et al., 2017), which tries to reconstruct the\ntarget data in the chain, instead of the previous step in the destructive chain.\n\n5 Experiments\nVW is evaluated on four datasets: MNIST, CIFAR10 (Krizhevsky and Hinton, 2009), SVHN (Netzer\net al., 2011) and CelebA (Liu et al., 2015). The MNIST, SVHN and CIFAR10 datasets were used\nas is except for uniform noise added to MNIST and CIFAR10, as per Theis et al. (2016), and the\naligned and cropped version of CelebA was scaled from 218 x 178 pixels to 78 x 64 pixels and\ncenter-cropped at 64 x 64 pixels (Liu et al., 2015). We used the Adam optimizer (Kingma and Ba,\n2014) and the Theano framework (Al-Rfou et al., 2016). More details are in Appendix and code for\ntraining and generation is at http://github.com/anirudh9119/walkback_nips17.\nTable 1 compares with published NET results on CIFAR.\nImage Generation. Figure 3, 5, 6, 7, 8 (see supplementary section) show VW samples on each of\nthe datasets. For MNIST, real-valued views of the data are modeled. Image Inpainting. We clamped\nthe bottom part of CelebA test images (for each step during sampling), and ran it through the model.\nFigure 1 (see Supplementary section) shows the generated conditional samples.\n\n7\n\n\fModel\n\nNET (Sohl-Dickstein et al., 2015)\n\nVW(20 steps)\n\nDeep VAE\n\nVW(30 steps)\n\nDRAW (Gregor et al., 2015)\n\nResNet VAE with IAF (Kingma et al., 2016)\n\nbits/dim \u2264\n\n5.40\n5.20\n< 4.54\n4.40\n< 4.13\n3.11\n\nTable 1: Comparisons on CIFAR10, test set average number of bits/data dimension(lower is better)\n\n6 Discussion\n\n6.1 Summary of results\n\nOur main advance involves using variational inference to learn recurrent transition operators that\ncan rapidly approach the data distribution and then be iterated much longer than the training time\nwhile still remaining on the data manifold. Our innovations enabling us to achieve this involved: (a)\ntying weights across time, (b) tying the destruction and generation process together to ef\ufb01ciently\ndestroy spurious modes, (c) using the past of the destructive process to train the future of the creation\nprocess, thereby circumventing issues with deep credit assignment (like NET), (d) introducing an\naggressive temperature annealing schedule to rapidly approach the data distribution (e.g. NET takes\n1000 steps while VWB only takes 30 steps to do so), and (e) introducing variable trajectory lengths\nduring training to encourage the generator to stay on the data manifold for times longer than the\ntraining sequence length.\nIndeed, it is often dif\ufb01cult to sample from recurrent neural networks for many more time steps than\nthe duration of their training sequences, especially non-symmetric networks that could exhibit chaotic\nactivity. Transition operators learned by VW can be stably sampled for exceedingly long times; for\nexample, in experiments (see supplementary section) we trained our model on CelebA for 30 steps,\nwhile at test time we sampled for 100000 time-steps. Overall, our method of learning a transition\noperator outperforms previous attempts at learning transition operators (i.e. VAE, GSN and NET)\nusing a local learning rule.\nOverall, we introduced a new approach to learning non-energy-based transition operators which\ninherits advantages from several previous generative models, including a training objective that\nrequires rapidly generating the data in a \ufb01nite number of steps (as in directed models), re-using the\nsame parameters for each step (as in undirected models), directly parametrizing the generator (as in\nGANs and DAEs), and using the model itself to quickly \ufb01nd its own spurious modes (the walk-back\nidea). We also anchor the algorithm in a variational bound and show how its analysis suggests to use\nthe same transition operator for the destruction or inference process, and the creation or generation\nprocess, and to use a cooling schedule during generation, and a reverse heating schedule during\ninference.\n\n6.2 New bridges between variational inference and non-equilibrium statistical physics\n\nWe connected the variational gap to physical notions like reversibility and heat dissipation. This novel\nbridge between variational inference and concepts like excess heat dissipation in non-equilbrium\nstatistical physics, could potentially open the door to improving variational inference by exploiting a\nwealth of work in statistical physics. For example, physical methods for \ufb01nding optimal thermody-\nnamic paths that minimize heat dissipation (Schmiedl and Seifert, 2007; Sivak and Crooks, 2012;\nGingrich et al., 2016), could potentially be exploited to tighten lowerbounds in variational inference.\nMoreover, motivated by the relation between the variational gap and reversibility, we veri\ufb01ed empiri-\ncally that the model converges towards an approximately reversible chain (see Appendix) making the\nvariational bound tighter.\n\n6.3 Neural weight asymmetry\n\nA fundamental aspect of our approach is that we can train stochastic processes that need not exactly\n\n8\n\n\fobey detailed balance, yielding access to a larger and potentially more powerful space of models. In\nparticular, this enables us to relax the weight symmetry constraint of undirected graphical models\ncorresponding to neural networks, yielding a more brain like iterative computation characteristic\nof asymmetric biological neural circuits. Our approach thus avoids the biologically implausible\nrequirement of weight transport (Lillicrap et al., 2014) which arises as a consequence of imposing\nweight symmetry as a hard constraint. With VW, this hard constraint is removed, although the\ntraining procedure itself may converge towards more symmetry. Such approach towards symmetry is\nconsistent with both empirical observations (Vincent et al., 2010) and theoretical analysis (Arora et al.,\n2015) of auto-encoders, for which symmetric weights are associated with minimizing reconstruction\nerror.\n6.4 A connection to the neurobiology of dreams\nThe learning rule underlying VW, when applied to an asymmetric stochastic neural network, yields a\nspeculative, but intriguing connection to the neurobiology of dreams. As discussed in Bengio et al.\n(2015), spike-timing dependent plasticity (STDP), a plasticity rule found in the brain (Markram\nand Sakmann, 1995), corresponds to increasing the probability of con\ufb01gurations towards which the\nnetwork intrinsically likes to go (i.e., remembering observed con\ufb01gurations), while reverse-STDP\ncorresponds to forgetting or unlearning the states towards which the network goes (which potentially\nmay occur during sleep).\nIn the VW update applied to a neural network, the resultant learning rule does indeed strengthen\nsynapses for which a presynaptic neuron is active before a postsynaptic neuron in the generative\ncooling process (STDP), and it weakens synapses in which a postsynaptic neuron is active before a\npresynaptic neuron in the heated destructive process (reverse STDP). If, as suggested, the neurobio-\nlogical function of sleep involves re-organizing memories and in particular unlearning spurious modes\nthrough reverse-STDP, then the heating destructive process may map to sleep states, in which the\nbrain is hunting down and destroying spurious modes. In contrast, the cooling generative dynamics\nof VW may map to awake states in which STDP reinforces neural trajectories moving towards\nobserved sensory data. Under this mapping, the relative incoherence of dreams compared to reality\nis qualitatively consistent with the heated destructive dynamics of VW, compared to the cooled\ntransition operator in place during awake states.\n6.5 Future work\nMany questions remain open in terms of analyzing and extending VW. Of particular interest is the\nincorporation of latent layers. The state at each step would now include both visible x and latent\nh components. Essentially the same procedure can be run, except for the chain initialization, with\ns0 = (x, h0) where h0 a sample from the posterior distribution of h given x.\nAnother interesting direction is to replace the log-likelihood objective at each step by a GAN-like\nobjective, thereby avoiding the need to inject noise independently on each of the pixels, during\neach transition step, and allowing latent variable sampling to inject the required high-level decisions\nassociated with the transition. Based on the earlier results from (Bengio et al., 2013a), sampling in\nthe latent space rather than in the pixel space should allow for better generative models and even\nbetter mixing between modes (Bengio et al., 2013a).\nOverall, our work takes a step to \ufb01lling a relatively open niche in the machine learning literature on\ndirectly training non-energy-based iterative stochastic operators, and we hope that the many possible\nextensions of this approach could lead to a rich new class of more powerful brain-like machine\nlearning models.\nAcknowledgments\nThe authors would like to thank Benjamin Scellier, Ben Poole, Tim Cooijmans, Philemon Brakel,\nGa\u00e9tan Marceau Caron, and Alex Lamb for their helpful feedback and discussions, as well as\nNSERC, CIFAR, Google, Samsung, Nuance, IBM and Canada Research Chairs for funding, and\nCompute Canada for computing resources. S.G. would like to thank the Simons, McKnight, James S.\nMcDonnell, and Burroughs Wellcome Foundations and the Of\ufb01ce of Naval Research for support. Y.B\nwould also like to thank Geoff Hinton for an analogy which is used in this work, while discussing\ncontrastive divergence (personnal communication). The authors would also like to express debt of\ngratitude towards those who contributed to theano over the years (as it is no longer maintained),\nmaking it such a great tool.\n\n9\n\n\fReferences\nAl-Rfou, R., Alain, G., Almahairi, A., and et al. (2016). Theano: A python framework for fast\n\ncomputation of mathematical expressions. CoRR, abs/1605.02688.\n\nAlain, G. and Bengio, Y. (2014). What regularized auto-encoders learn from the data-generating\n\ndistribution. J. Mach. Learn. Res., 15(1):3563\u20133593.\n\nArora, S., Liang, Y., and Ma, T. (2015). Why are deep nets reversible: a simple theory, with\n\nimplications for training. Technical report, arXiv:1511.05653.\n\nBengio, Y., Mesnard, T., Fischer, A., Zhang, S., and Wu, Y. (2015). An objective function for STDP.\n\nCoRR, abs/1509.05936.\n\nBengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013a). Better mixing via deep representations.\n\nBengio, Y., Thibodeau-Laufer, E. r., Alain, G., and Yosinski, J. (2014). Deep generative stochastic net-\nworks trainable by backprop. In Proceedings of the 31st International Conference on International\nConference on Machine Learning - Volume 32, ICML\u201914, pages II\u2013226\u2013II\u2013234. JMLR.org.\n\nBengio, Y., Yao, L., Alain, G., and Vincent, P. (2013b). Generalized denoising auto-encoders as\n\ngenerative models. In NIPS\u20192013, arXiv:1305.6663.\n\nBordes, F., Honari, S., and Vincent, P. (2017). Learning to generate samples from noise through\n\ninfusion training. CoRR, abs/1703.06975.\n\nBurda, Y., Grosse, R. B., and Salakhutdinov, R. (2014). Accurate and conservative estimates of MRF\n\nlog-likelihood using reverse annealing. CoRR, abs/1412.8566.\n\nCrooks, G. E. (2000). Path-ensemble averages in systems driven far from equilibrium. Physical\n\nreview E, 61(3):2361.\n\nDayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The helmholtz machine. Neural\n\nComput., 7(5):889\u2013904.\n\nGingrich, T. R., Rotskoff, G. M., Crooks, G. E., and Geissler, P. L. (2016). Near-optimal protocols in\ncomplex nonequilibrium transformations. Proceedings of the National Academy of Sciences, page\n201606273.\n\nGoodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and\nBengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing\nSystems, pages 2672\u20132680.\n\nGregor, K., Danihelka, I., Graves, A., and Wierstra, D. (2015). Draw: A recurrent neural network for\n\nimage generation. arXiv preprint arXiv:1502.04623.\n\nHinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets.\n\nNeural Comput., 18(7):1527\u20131554.\n\nKingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980.\n\nKingma, D. P., Salimans, T., and Welling, M. (2016). Improving variational inference with inverse\n\nautoregressive \ufb02ow. CoRR, abs/1606.04934.\n\nKingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes.\n\narXiv:1312.6114.\n\narXiv preprint\n\nKrizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images.\n\nLillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. (2014). Random feedback weights\n\nsupport learning in deep neural networks. arXiv:1411.0247.\n\nLiu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In\n\nProceedings of the IEEE International Conference on Computer Vision, pages 3730\u20133738.\n\n10\n\n\fMarkram, H. and Sakmann, B. (1995). Action potentials propagating back into dendrites triggers\n\nchanges in ef\ufb01cacy. Soc. Neurosci. Abs, 21.\n\nNeal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2):125\u2013139.\n\nNetzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading digits in natural\nimages with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised\nfeature learning, volume 2011, page 5.\n\nRezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate\n\ninference in deep generative models. arXiv preprint arXiv:1401.4082.\n\nSalakhutdinov, R. and Hinton, G. (2009). Deep boltzmann machines. In Arti\ufb01cial Intelligence and\n\nStatistics.\n\nSchmiedl, T. and Seifert, U. (2007). Optimal \ufb01nite-time processes in stochastic thermodynamics.\n\nPhysical review letters, 98(10):108301.\n\nSivak, D. A. and Crooks, G. E. (2012). Thermodynamic metrics and optimal paths. Physical review\n\nletters, 108(19):190602.\n\nSohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsupervised\n\nlearning using nonequilibrium thermodynamics. CoRR, abs/1503.03585.\n\nS\u00f8nderby, C. K., Raiko, T., Maal\u00f8e, L., S\u00f8nderby, S. K., and Winther, O. (2016). Ladder variational\nautoencoders. In Advances in Neural Information Processing Systems 29: Annual Conference\non Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages\n3738\u20133746.\n\nTheis, L., van den Oord, A., and Bethge, M. (2016). A note on the evaluation of generative models.\n\nIn International Conference on Learning Representations.\n\nVincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust\nfeatures with denoising autoencoders. In Proceedings of the 25th international conference on\nMachine learning, pages 1096\u20131103. ACM.\n\nVincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising\nautoencoders: Learning useful representations in a deep network with a local denoising criterion.\nJ. Machine Learning Res., 11.\n\n11\n\n\f", "award": [], "sourceid": 2300, "authors": [{"given_name": "Anirudh Goyal", "family_name": "ALIAS PARTH GOYAL", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Nan Rosemary", "family_name": "Ke", "institution": "MILA, \u00c9cole Polytechnique de Montr\u00e9al"}, {"given_name": "Surya", "family_name": "Ganguli", "institution": "Stanford"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "U. Montreal"}]}