{"title": "GibbsNet: Iterative Adversarial Inference for Deep Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 5089, "page_last": 5098, "abstract": "Directed latent variable models that formulate the joint distribution as $p(x,z) = p(z) p(x \\mid z)$ have the advantage of fast and exact sampling. However, these models have the weakness of needing to specify $p(z)$, often with a simple fixed prior that limits the expressiveness of the model.  Undirected latent variable models discard the requirement that $p(z)$ be specified with a prior, yet sampling from them generally requires an iterative procedure such as blocked Gibbs-sampling that may require many steps to draw samples from the joint distribution $p(x, z)$.  We propose a novel approach to learning the joint distribution between the data and a latent code which uses an adversarially learned iterative procedure to gradually refine the joint distribution, $p(x, z)$, to better match with the data distribution on each step.  GibbsNet is the best of both worlds both in theory and in practice.  Achieving the speed and simplicity of a directed latent variable model, it is guaranteed (assuming the adversarial game reaches the virtual training criteria global minimum) to produce samples from $p(x, z)$ with only a few sampling iterations.  Achieving the expressiveness and flexibility of an undirected latent variable model, GibbsNet does away with the need for an explicit $p(z)$ and has the ability to do attribute prediction, class-conditional generation, and joint image-attribute modeling in a single model which is not trained for any of these specific tasks.  We show empirically that GibbsNet is able to learn a more complex $p(z)$ and show that this leads to improved inpainting and iterative refinement of $p(x, z)$ for dozens of steps and stable generation without collapse for thousands of steps, despite being trained on only a few steps.", "full_text": "GibbsNet: Iterative Adversarial Inference for Deep\n\nGraphical Models\n\nAlex Lamb\n\nR Devon Hjelm\n\nYaroslav Ganin\n\nJoseph Paul Cohen\n\nAaron Courville\n\nYoshua Bengio\n\nAbstract\n\nDirected latent variable models that formulate the joint distribution as p(x, z) =\np(z)p(x | z) have the advantage of fast and exact sampling. However, these\nmodels have the weakness of needing to specify p(z), often with a simple \ufb01xed\nprior that limits the expressiveness of the model. Undirected latent variable models\ndiscard the requirement that p(z) be speci\ufb01ed with a prior, yet sampling from them\ngenerally requires an iterative procedure such as blocked Gibbs-sampling that may\nrequire many steps to draw samples from the joint distribution p(x, z). We propose\na novel approach to learning the joint distribution between the data and a latent\ncode which uses an adversarially learned iterative procedure to gradually re\ufb01ne the\njoint distribution, p(x, z), to better match with the data distribution on each step.\nGibbsNet is the best of both worlds both in theory and in practice. Achieving the\nspeed and simplicity of a directed latent variable model, it is guaranteed (assuming\nthe adversarial game reaches the virtual training criteria global minimum) to\nproduce samples from p(x, z) with only a few sampling iterations. Achieving the\nexpressiveness and \ufb02exibility of an undirected latent variable model, GibbsNet\ndoes away with the need for an explicit p(z) and has the ability to do attribute\nprediction, class-conditional generation, and joint image-attribute modeling in\na single model which is not trained for any of these speci\ufb01c tasks. We show\nempirically that GibbsNet is able to learn a more complex p(z) and show that this\nleads to improved inpainting and iterative re\ufb01nement of p(x, z) for dozens of steps\nand stable generation without collapse for thousands of steps, despite being trained\non only a few steps.\n\n1\n\nIntroduction\n\nGenerative models are powerful tools for learning an underlying representation of complex data.\nWhile early undirected models, such as Deep Boltzmann Machines or DBMs (Salakhutdinov and Hin-\nton, 2009), showed great promise, practically they did not scale well to complicated high-dimensional\nsettings (beyond MNIST), possibly because of optimization and mixing dif\ufb01culties (Bengio et al.,\n2012). More recent work on Helmholtz machines (Bornschein et al., 2015) and on variational au-\ntoencoders (Kingma and Welling, 2013) borrow from deep learning tools and can achieve impressive\nresults, having now been adopted in a large array of domains (Larsen et al., 2015).\nMany of the important generative models available to us rely on a formulation of some sort of stochas-\ntic latent or hidden variables along with a generative relationship to the observed data. Arguably\nthe simplest is the directed graphical models (such as the VAE) with a factorized decomposition\np(z, x) = p(z)p(x | z). In this, it is typical to assume that p(z) follows some factorized prior with\nsimple statistics (such as Gaussian). While sampling with directed models is simple, inference and\nlearning tends to be dif\ufb01cult and often requires advanced techniques such as approximate inference\nusing a proposal distribution for the true posterior.\n\n\fz0 \u223c N (0, I)\n\nxi \u223c p(x| zi)\n\nzN \u223c q(z| xN\u22121)\n\nxN \u223c p(x| zN )\n\nD(z, x)\n\nzi+1 \u223c q(z| xi)\n\n\u02c6z \u223c q(z| xdata)\n\nxdata \u223c q(x)\n\nFigure 1: Diagram illustrating the training procedure for GibbsNet. The unclamped chain (dashed\nbox) starts with a sample from an isotropic Gaussian distribution N (0, I) and runs for N steps.\nThe last step (iteration N) shown as a solid pink box is then compared with a single step from the\nclamped chain (solid blue box) using joint discriminator D.\n\nThe other dominant family of graphical models are undirected graphical models, such that the\njoint is represented by a product of clique potentials and a normalizing factor. It is common to\nassume that the clique potentials are positive, so that the un-normalized density can be represented\nby an energy function, E and the joint is represented by p(x, z) = e\u2212E(z,x)/Z, where Z is the\nnormalizing constant or partition function. These so-called energy-based models (of which the\nBoltzmann Machine is an example) are potentially very \ufb02exible and powerful, but are dif\ufb01cult to\ntrain in practice and do not seem to scale well. Note also how in such models, the marginal p(z) can\nhave a very rich form (as rich as that of p(x)).\nThe methods above rely on a fully parameterized joint distribution (and approximate posterior in\nthe case of directed models), to train with approximate maximum likelihood estimation (MLE,\nDempster et al., 1977). Recently, generative adversarial networks (GANs, Goodfellow et al., 2014)\nhave provided a likelihood-free solution to generative modeling that provides an implicit distribution\nunconstrained by density assumptions on the data. In comparison to MLE-based latent variable\nmethods, generated samples can be of very high quality (Radford et al., 2015), and do not suffer\nfrom well-known problems associated with parameterizing noise in the observation space (Good-\nfellow, 2016). Recently, there have been advances in incorporating latent variables in generative\nadversarial networks in a way reminiscent of Helmholtz machines (Dayan et al., 1995), such as\nadversarially learned inference (Dumoulin et al., 2017; Donahue et al., 2017) and implicit variational\ninference (Husz\u00e1r, 2017).\nThese models, as being essentially complex directed graphical models, rely on approximate inference\nto train. While potentially powerful, there is good evidence that using an approximate posterior\nnecessarily limits the generator in practice (Hjelm et al., 2016; Rezende and Mohamed, 2015). In\ncontrast, it would perhaps be more appropriate to start with inference (encoder) and generative\n(decoder) processes and derive the prior directly from these processes. This approach, which we call\nGibbsNet, uses these two processes to de\ufb01ne a transition operator of a Markov chain similar to Gibbs\nsampling, alternating between sampling observations and sampling latent variables. This is similar\nto the previously proposed generative stochastic networks (GSNs, Bengio et al., 2013) but with a\nGAN training framework rather than minimizing reconstruction error. By training a discriminator to\nplace a decision boundary between the data-driven distribution (with x clamped) and the free-running\nmodel (which alternates between sampling x and z), we are able to train the model so that the two\njoint distributions (x, z) match. This approach is similar to Gibbs sampling in undirected models,\nyet, like traditional GANs, it lacks the strong parametric constraints, i.e., there is no explicit energy\nfunction. While losing some the theoretical simplicity of undirected models, we gain great \ufb02exibility\nand ease of training. In summary, our method offers the following contributions:\n\n\u2022 We introduce the theoretical foundation for a novel approach to learning and performing\ninference in deep graphical models. The resulting model of our algorithm is similar to\nundirected graphical models, but avoids the need for MLE-based training and also lacks an\nexplicitly de\ufb01ned energy, instead being trained with a GAN-like discriminator.\n\n2\n\n\f\u2022 We present a stable way of performing inference in the adversarial framework, meaning\nthat useful inference is performed under a wide range of architectures for the encoder and\ndecoder networks. This stability comes from the fact that the encoder q(z | x) appears\nin both the clamped and the unclamped chain, so gets its training signal from both the\ndiscriminator in the clamped chain and from the gradient in the unclamped chain.\n\n\u2022 We show improvements in the quality of the latent space over models which use a simple\nprior for p(z). This manifests itself in improved conditional generation. The expressiveness\nof the latent space is also demonstrated in cleaner inpainting, smoother mixing when running\nblocked Gibbs sampling, and better separation between classes in the inferred latent space.\n\u2022 Our model has the \ufb02exibility of undirected graphical models, including the ability to do\nlabel prediction, class-conditional generation, and joint image-label generation in a single\nmodel which is not explicitly trained for any of these speci\ufb01c tasks. To our knowledge our\nmodel is the \ufb01rst model which combines this \ufb02exibility with the ability to produce high\nquality samples on natural images.\n\n2 Proposed Approach: GibbsNet\n\nThe goal of GibbsNet is to train a graphical model with transition operators that are de\ufb01ned and\nlearned directly by matching the joint distributions of the model expectation with that with the\nobservations clamped to data. This is analogous to and inspired by undirected graphical models,\nexcept that the transition operators, which correspond to blocked Gibbs sampling, are de\ufb01ned to\nmove along a de\ufb01ned energy manifold, so we will make this connection throughout our formulation.\nWe \ufb01rst explain GibbsNet in the simplest case where the graphical model consists of a single layer of\nobserved units and a single layer of latent variable with stochastic mappings from one to the other as\nparameterized by arbitrary neural network. Like Professor Forcing (Lamb et al., 2016), GibbsNet\nuses a GAN-like discriminator to make two distributions match, one corresponding to the model\niteratively sampling both observation, x, and latent variables, z (free-running), and one corresponding\nto the same generative model but with the observations, x, clamped. The free-running generator is\nanalogous to Gibbs sampling in Restricted Boltzmann Machines (RBM, Hinton et al., 2006) or Deep\nBoltzmann Machines (DBM, Salakhutdinov and Hinton, 2009). In the simplest case, the free-running\ngenerator is de\ufb01ned by conditional distributions q(z|x) and p(x|z) which stochastically map back\nand forth between data space x and latent space z.\nTo begin our free-running process, we start the chain with a latent variable sampled from a normal\ndistribution: z \u223c N (0, I), and follow this by N steps of alternating between sampling from p(x|z)\nand q(z|x). For the clamped version, we do simple ancestral sampling from q(z|x), given xdata is\ndrawn from the data distribution q(x) (a training example). When the model has more layers (e.g., a\nhierarchy of layers with stochastic latent variables, \u00e0 la DBM), the data-driven model also needs to\niterate to correctly sample from the joint. While this situation highly resembles that of undirected\ngraphical models, GibbsNet is trained adversarially so that its free-running generative states become\nindistinguishable from its data-driven states. In addition, while in principle undirected graphical\nmodels need to either start their chains from data or sample a very large number of steps, we \ufb01nd\nin practice GibbsNet only requires a very small number of steps (on the order of 3 to 5 with very\ncomplex datasets) from noise.\nAn example of the free-running (unclamped) chain can be seen in Figure 2. An interesting aspect\nof GibbsNet is that we found that it was enough and in fact best experimentally to back-propagate\ndiscriminator gradients through a single step of the iterative procedure, yielding more stable training.\nAn intuition for why this helps is that each step of the procedure is supposed to generate increasingly\nrealistic samples. However, if we passed gradients through the iterative procedure, then this gradient\ncould encourage the earlier steps to store features which have downstream value instead of immediate\nrealistic x-values.\n\n2.1 Theoretical Analysis\n\nWe consider a simple case of an undirected graph with single layers of visible and latent units trained\nwith alternating 2-step (p then q) unclamped chains and the asymptotic scenario where the GAN\nobjective is properly optimized. We then ask the following questions: in spite of training for a\n\n3\n\n\fFigure 2: Evolution of samples for 20 iterations from the unclamped chain, trained on the SVHN\ndataset starting on the left and ending on the right.\n\nbounded number of Markov chain steps, are we learning a transition operator? Are the encoder\nand decoder estimating compatible conditionals associated with the stationary distribution of that\ntransition operator? We \ufb01nd positive answers to both questions.\nA high level explanation of our argument is that if the discriminator is fooled, then the consecutive\n(z, x) pairs from the chain match the data-driven (z, x) pair. Because the two marginals on x from\nthese two distributions match, we can show that the next z in the chain will form again the same joint\ndistribution. Similarly, we can show that the next x in the chain also forms the same joint with the\nprevious z. Because the state only depends on the previous value of the chain (as it\u2019s Markov), then\nall following steps of the chain will also match the clamped distribution. This explains the result,\nvalidated experimentally, that even though we train for just a few steps, we can generate high quality\nsamples for thousands or more steps.\nProposition 1. If (a) the stochastic encoder q(z|x) and stochastic decoder p(x|z) inject noise such\nthat the transition operator de\ufb01ned by their composition (p followed by q or vice-versa) allows for all\npossible x-to-x or z-to-z transitions (x \u2192 z \u2192 x or z \u2192 x \u2192 z), and if (b) those GAN objectives\nare properly trained in the sense that the discriminator is fooled in spite of having suf\ufb01cient capacity\nand training time, then (1) the Markov chain which alternates the stochastic encoder followed by the\nstochastic decoder as its transition operator T (or vice-versa) has the data-driven distribution \u03c0D\nas its stationary distribution \u03c0T , (2) the two conditionals q(z|x) and p(x|z) converge to compatible\nconditionals associated with the joint \u03c0D = \u03c0T .\n\nProof. When the stochastic decoder and encoder inject noise so that their composition forms a\ntransition operator T with paths with non-zero probability from any state to any other state, then T is\nergodic. So condition (a) implies that T has a stationary distribution \u03c0T . The properly trained GAN\ndiscriminators for each of these two steps (condition (b)) forces the matching of the distributions of\nthe pairs (zt, xt) (from the generative trajectory) and (x, z) with x \u223c q(x), the data distribution and\nz \u223c q(z | x), both pairs converging to the same data-driven distribution \u03c0D. Because (zt, xt) has the\nsame joint distribution as (z, x), it means that xt has the same distribution as x. Since z \u223c q(z | x),\nwhen we apply q to xt, we get zt+1 which must form a joint (zt+1, xt) which has the same distribution\nas (z, x). Similarly, since we just showed that zt+1 has the same distribution as z and thus the same\nas zt, if we apply p to zt+1, we get xt+1 and the joint (zt+1, xt+1) must have the same distribution\nas (z, x). Because the two pairs (zt, xt) and (zt+1, xt+1) have the same joint distribution \u03c0D, it\nmeans that the transition operator T , that maps samples (zt, xt) to samples (zt+1, xt+1), maps \u03c0D to\nitself, i.e., \u03c0D = \u03c0T is both the data distribution and the stationary distribution of T and result (1)\nis obtained. Now consider the \"odd\" pairs (zt+1, xt) and (zt+2, xt+1) in the generated sequences.\nBecause of (1), xt and xt+1 have the same marginal distribution \u03c0D(x). Thus when we apply the\nsame q(z|x) to these x\u2019s we obtain that (zt+1, xt) and (zt+2, xt+1) also have the same distribution.\nFollowing the same reasoning as for proving (1), we conclude that the associated transition operator\nTodd has also \u03c0D as stationary distribution. So starting from z \u223c \u03c0D(z) and applying p(x | z)\ngives an x so that the pair (z, x) has \u03c0D as joint distribution, i.e., \u03c0D(z, x) = \u03c0D(z)p(x | z). This\nmeans that p(x | z) = \u03c0D(x,z)\nis the x | z conditional of \u03c0D. Since (zt, xt) also converges to joint\ndistribution \u03c0D, we can apply the same argument when starting from an x \u223c \u03c0D(x) followed by q\nis the z | x conditional of \u03c0D.\nand obtain that \u03c0D(z, x) = \u03c0D(x)q(z | x) and so q(z|x) = \u03c0D(z,x)\nThis proves result (2).\n\n\u03c0D(x)\n\n\u03c0D(z)\n\n4\n\n\f2.2 Architecture\nGibbsNet always involves three networks: the inference network q(z|x), the generation network\np(x|z), and the joint discriminator. In general, our architecture for these networks closely follow\nDumoulin et al. (2017), except that we use the boundary-seeking GAN (BGAN, Hjelm et al., 2017)\nas it explicitly optimizes on matching the opposing distributions (in this case, the model expectation\nand the data-driven joint distributions), allows us to use discrete variables where we consider learning\ngraphs with labels or discrete attributes, and worked well across our experiments.\n\n3 Related Work\n\nEnergy Models and Deep Boltzmann Machines The training and sampling procedure for gener-\nating from GibbsNet is very similar to that of a deep Boltzmann machine (DBM, Salakhutdinov and\nHinton, 2009): both involve blocked Gibbs sampling between observation- and latent-variable layers.\nA major difference is that in a deep Boltzmann machine, the \u201cdecoder\" p(x|z) and \u201cencoder\" p(z|x)\nexactly correspond to conditionals of a joint distribution p(x, z), which is parameterized by an energy\nfunction. This, in turn, puts strong constraints on the forms of the encoder and decoder.\nIn a restricted Boltzmann machine (RBM, Hinton, 2010), the visible units are conditionally inde-\npendent given the hidden units on the adjacent layer, and likewise the hidden units are conditionally\nindependent given the visible units. This may force the layers close to the data to need to be nearly\ndeterministic, which could cause poor mixing and thus make learning dif\ufb01cult. These conditional\nindependence assumptions in RBMs and DBMs have been discussed before in the literature as a\npotential weakness in these models (Bengio et al., 2012).\nIn our model, p(x|z) and q(z|x) are modeled by separate deep neural networks with no shared\nparameters. The disadvantage is that the networks are over-parameterized, but this has the added\n\ufb02exibility that these conditionals can be much deeper, can take advantage of all the recent advances\nin deep architectures, and have fewer conditional independence assumptions than DBMs and RBMs.\n\nGenerative Stochastic Networks Like GibbsNet, generative stochastic networks (GSNs, Bengio\net al., 2013) also directly parameterizes a transition operator of a Markov chain using deep neural\nnetworks. However, GSNs and GibbsNet have completely different training procedures. In GSNs,\nthe training procedure is based on an objective that is similar to de-noising autoencoders (Vincent\net al., 2008).\nGSNs begin by drawing a sampling from the data, iteratively corrupting it, then learning a transition\noperator which de-noises it (i.e., reverses that corruption), so that the reconstruction after k steps is\nbrought closer to the original un-corrupted input.\nIn GibbsNet, there is no corruption in the visible space, and the learning procedure never involves\n\u201cwalk-back\" (de-noising) towards a real data-point. Instead, the processes from and to data are\nmodeled by different networks, with the constraint of the marginal, p(x), matches the real distribution\nimposed through the GAN loss on the joint distributions from the clamped and unclamped phases.\n\nNon-Equilibrium Thermodynamics The Non-Equilibrium Thermodynamics method (Sohl-\nDickstein et al., 2015) learns a reverse diffusion process against a forward diffusion process which\nstarts from real data points and gradually injects noise until the data distribution matches a analytically\ntractible / simple distribution. This is similar to GibbsNet in that generation involves a stochastic\nprocess which is initialized from noise, but differs in that Non-Equilibrium Thermodynamics is\ntrained using MLE and relies on noising + reversal for training, similar to GSNs above.\n\nGenerative Adversarial Learning of Markov Chains The Adversarial Markov Chain algorithm\n(AMC, Song et al., 2017) learns a markov chain over the data distribution in the visible space.\nGibbsNet and AMC are related in that they both involve adversarial training and an iterative procedure\nfor generation. However there are major differences. GibbsNet learns deep graphical models with\nlatent variables, whereas the AMC method learns a transition operator directly in the visible space.\nThe AMC approach involves running chains which start from real data points and repeatedly apply\nthe transition operator, which is different from the clamped chain used in GibbsNet. The experiments\n\n5\n\n\fshown in Figure 3 demonstrate that giving the latent variables to the discriminator in our method has\na signi\ufb01cant impact on inference.\n\nAdversarially Learned Inference (ALI) Adversarially learned inference (ALI, Dumoulin et al.,\n2017) learns to match distributions generative and inference distributions, p(x, z) and q(x, z) (can be\nthought of forward and backward models) with a discriminator, so that p(z)p(x | z) = q(x)q(z | x).\nIn the single latent layer case, GibbsNet also has forward and reverse models, p(x | z) and q(z | x).\nThe un-clamped chain is sampled as p(z), p(x | z), q(z | x), p(x | z), . . . and the clamped chain\nis sampled as q(x), q(z | x). We then adversarially encourage the clamped chain to match the\nequilibrium distribution of the unclamped chain. When the number of iterations is set to N = 1,\nGibbsNet reduces to ALI. However, in the general setting of N > 1, Gibbsnet should learn a richer\nrepresentation than ALI, as the prior, p(z), is no longer forced to be the simple one at the beginning\nof the unclamped phase.\n\n4 Experiments and Results\n\nThe goal of our experiments is to explore and give insight into the joint distribution p(x, z) learned\nby GibbsNet and to understand how this joint distribution evolves over the course of the iterative\ninference procedure. Since ALI is identical to GibbsNet when the number of iterative inference steps\nis N = 1, results obtained with ALI serve as an informative baseline.\nFrom our experiments, the clearest result (covered in detail below) is that the p(z) obtained with\nGibbsNet can be more complex than in ALI (or other directed graphical models). This is demonstrated\ndirectly in experiments with 2-D latent spaces and indirectly by improvements in classi\ufb01cation when\ndirectly using the variables q(z | x). We achieve strong improvements over ALI using GibbsNet even\nwhen q(z | x) has exactly the same architecture in both models.\nWe also show that GibbsNet allows for gradual re\ufb01nement of the joint, (x, z), in the sampling chain\nq(z | x), p(x | z). This is a result of the sampling chain making small steps towards the equilibrium\ndistribution. This allows GibbsNet to gradually improve sampling quality when running for many\niterations. Additionally it allows for inpainting and conditional generation where the conditioning\ninformation is not \ufb01xed during training, and indeed where the model is not trained speci\ufb01cally for\nthese tasks.\n\n4.1 Expressiveness of GibbsNet\u2019s Learned Latent Variables\nLatent structure of GibbsNet The latent variables from q(z | x) learned from GibbsNet are more\nexpressive than those learned with ALI. We show this in two ways. First, we train a model on the\nMNIST digits 0, 1, and 9 with a 2-D latent space which allows us to easily visualize inference. As\nseen in Figure 3, we show that GibbsNet is able to learn a latent space which is not Gaussian and has\na structure that makes the different classes well separated.\n\nSemi-supervised learning Following from this, we show that the latent variables learned by\nGibbsNet are better for classi\ufb01cation. The goal here is not to show state of the art results on\nclassi\ufb01cation, but instead to show that the requirement that p(z) be something simple (like a Gaussian,\nas in ALI) is undesirable as it forces the latent space to be \ufb01lled. This means that different classes\nneed to be packed closely together in that latent space, which makes it hard for such a latent space to\nmaintain the class during inference and reconstruction.\nWe evaluate this property on two datasets: Street View House Number (SVHN, Netzer et al., 2011)\nand permutation invariant MNIST. In both cases we use the latent features q(z | x) directly from a\ntrained model, and train a 2-layer MLP on top of the latent variables, without passing gradient from\nthe classi\ufb01er through to q(z | x). ALI and GibbsNet were trained for the same amount of time and\nwith exactly the same architecture for the discriminator, the generative network, p(x | z), and the\ninference network, q(z | x).\nOn permutation invariant MNIST, ALI achieves 91% test accuracy and GibbsNet achieves 97.7% test\naccuracy. On SVHN, ALI achieves 66.7% test accuracy and GibbsNet achieves 79.6% test accuracy.\nThis does not demonstrate a competitive classi\ufb01er in either case, but rather demonstrates that the\nlatent space inferred by GibbsNet keeps more information about its input image than the encoder\n\n6\n\n\flearned by ALI. This is consistent with the reported ALI reconstructions (Dumoulin et al., 2017) on\nSVHN where the reconstructed image and the input image show the same digit roughly half of the\ntime.\nWe found that ALI\u2019s inferred latent variables not being effective for classi\ufb01cation is a fairly robust\nresult that holds across a variety of architectures for the inference network. For example, with 1024\nunits, we varied the number of fully-connected layers in ALI\u2019s inference network between 2 and\n8 and found that the classi\ufb01cation accuracies on the MNIST validation set ranged from 89.4% to\n91.0%. Using 6 layers with 2048 units on each layer and a 256 dimensional latent prior achieved\n91.2% accuracy. This suggests that the weak performance of the latent variables for classi\ufb01cation is\ndue to ALI\u2019s prior, and is probably not due to a lack of capacity in the inference network.\n\nFigure 3: Illustration of the distribution over inferred latent variables for real data points from the\nMNIST digits (0, 1, 9) learned with different models trained for roughly the same amount of time:\nGibbsNet with a determinstic decoder and the latent variables not given to the discriminator (a),\nGibbsNet with a stochastic decoder and the latent variables not given to the discriminator (b), ALI\n(c), GibbsNet with a deterministic decoder (f), GibbsNet with a stochastic decoder with two different\nruns (g and h), GibbsNet with a stochastic decoder\u2019s inferred latent states in an unclamped chain at 1,\n2 , 3, and 15 steps (d, e, i, and j, respectively) into the P-chain (d, e, i, and j, respectively). Note that\nwe continue to see re\ufb01nement in the marginal distribution of z when running for far more steps (15\nsteps) than we used during training (3 steps).\n\n4.2\n\nInception Scores\n\nThe GAN literature is limited in terms of quantitative evaluation, with none of the existing techniques\n(such as inception scores) being satisfactory (Theis et al., 2015). Nonetheless, we computed inception\nscores on CIFAR-10 using the standard method and code released from Salimans et al. (2016). In our\nexperiments, we compared the inception scores from samples from Gibbsnet and ALI on two tasks,\ngeneration and inpainting.\nOur conclusion from the inception scores (Table 1) is that GibbsNet slightly improves sample quality\nbut greatly improves the expressiveness of the latent space z, which leads to more detail being\npreserved in the inpainting chain and a much larger improvement in inception scores in this setting.\nThe supplementary materials includes examples of sampling and inpainting chains for both ALI and\nGibbsNet which shows differences between sampling and inpainting quality that are consistent with\nthe inception scores.\n\nTable 1: Inception Scores from different models. Inpainting results were achieved by \ufb01xing the left\nhalf of the image while running the chain for four steps. Sampling refers to unconditional sampling.\n\nSource\nReal Images\nALI (ours)\nALI (Dumoulin)\nGibbsNet\n\nInpainting\n11.24\n5.59\nN/A\n6.15\n\nSamples\n11.24\n5.41\n5.34\n5.69\n\n7\n\n\fFigure 4: CIFAR samples on methods which learn transition operators. Non-Equilibrium Thermody-\nnamics (Sohl-Dickstein et al., 2015) after 1000 steps (left) and GibbsNet after 20 steps (right).\n\n4.3 Generation, Inpainting, and Learning the Image-Attribute Joint Distribution\n\nGeneration Here, we compare generation on the CIFAR dataset against Non-Equilibrium Thermo-\ndynamics method (Sohl-Dickstein et al., 2015), which also begins its sampling procedure from noise.\nWe show in Figure 4 that, even with a relatively small number of steps (20) in its sampling procedure,\nGibbsNet outperforms the Non-Equilibrium Thermodynamics approach in sample quality, even after\nmany more steps (1000).\n\nInpainting The inpainting that can be done with the transition operator in GibbsNet is stronger\nthan what can be done with an explicit conditional generative model, such as Conditional GANs,\nwhich are only suited to inpainting when the conditioning information is known about during training\nor there is a strong prior over what types of conditioning will be performed at test time. We show here\nthat GibbsNet performs more consistent and higher quality inpainting than ALI, even when the two\nnetworks share exactly the same architecture for p(x | z) and q(z | x) (Figure 5), which is consistent\nwith our results on latent structure above.\n\nJoint generation Finally, we show that GibbsNet is able to learn the joint distribution between\nface images and their attributes (CelebA, Liu et al., 2015) (Figure 6). In this case, q(z | x, y) (y\nis the attribute) is a network that takes both the image and attribute, separately processing the two\nmodalities before joining them into one network. p(x, y | z) is one network that splits into two\nnetworks to predict the modalities separately. Training was done with continuous boundary-seeking\nGAN (BGAN, Hjelm et al., 2017) on the image side (same as our other experiments) and discrete\nBGAN on the attribute side, which is an importance-sampling-based technique for training GANs\nwith discrete data.\n\n5 Conclusion\n\nWe have introduced GibbsNet, a powerful new model for performing iterative inference and generation\nin deep graphical models. Although models like the RBM and the GSN have become less investigated\nin recent years, their theoretical properties worth pursuing, and we follow the theoretical motivations\nhere using a GAN-like objective. With a training and sampling procedure that is closely related to\nundirected graphical models, GibbsNet is able to learn a joint distribution which converges in a very\nsmall number of steps of its Markov chain, and with no requirement that the marginal p(z) match a\nsimple prior. We prove that at convergence of training, in spite of unrolling only a few steps of the\nchain during training, we obtain a transition operator whose stationary distribution also matches the\ndata and makes the conditionals p(x | z) and q(z | x) consistent with that unique joint stationary\ndistribution. We show that this allows the prior, p(z), to be shaped into a complicated distribution\n(not a simple one, e.g., a spherical Gaussian) where different classes have representations that are\neasily separable in the latent space. This leads to improved classi\ufb01cation when the inferred latent\nvariables q(z|x) are used directly. Finally, we show that GibbsNet\u2019s \ufb02exible prior produces a \ufb02exible\nmodel which can simultaneously perform inpainting, conditional image generation, and prediction\nwith a single model not explicitly trained for any of these speci\ufb01c tasks, outperforming a competitive\nALI baseline with the same setup.\n\n8\n\n\f(a) SVHN inpainting after 20 steps (ALI).\n\n(b) SVHN inpainting after 20 steps (GibbsNet).\n\nFigure 5: Inpainting results on SVHN, where the right side is given and the left side is inpainted. In\nboth cases our model\u2019s trained procedure did not consider the inpainting or conditional generation\ntask at all, and inpainting is done by repeatedly applying the transition operators and clamping the\nright side of the image to its observed value. GibbsNet\u2019s richer latent space allows the transition\noperator to keep more of the structure of the input image, allowing for tighter inpainting.\n\nFigure 6: Demonstration of learning the joint distribution between images and a list of 40 binary\nattributes. Attributes (right) are generated from a multinomial distribution as part of the joint with the\nimage (left).\n\nReferences\nBengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2012). Better mixing via deep representations.\n\nCoRR, abs/1207.4404.\n\nBengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2013). Deep generative stochastic networks\n\ntrainable by backprop. CoRR, abs/1306.1091.\n\nBornschein, J., Shabanian, S., Fischer, A., and Bengio, Y. (2015). Training opposing directed models\n\nusing geometric mean matching. CoRR, abs/1506.03877.\n\nDayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The helmholtz machine. Neural\n\ncomputation, 7(5):889\u2013904.\n\nDempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data\nvia the em algorithm. Journal of the royal statistical society. Series B (methodological), pages\n1\u201338.\n\nDonahue, J., Kr\u00e4henb\u00fchl, P., and Darrell, T. (2017). Adversarial feature learning. In Proceedings of\n\nthe International Conference on Learning Representations (ICLR). abs/1605.09782.\n\nDumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., and Courville,\nA. (2017). Adversarially learned inference. In Proceedings of the International Conference on\nLearning Representations (ICLR). arXiv:1606.00704.\n\n9\n\n\fGoodfellow, I. (2016). Nips 2016 tutorial: Generative adversarial networks. arXiv preprint\n\narXiv:1701.00160.\n\nGoodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and\nBengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680.\n\nHinton, G. (2010). A practical guide to training restricted boltzmann machines. Momentum, 9(1):926.\n\nHinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets.\n\nNeural computation, 18(7):1527\u20131554.\n\nHjelm, D., Salakhutdinov, R. R., Cho, K., Jojic, N., Calhoun, V., and Chung, J. (2016). Iterative\nre\ufb01nement of the approximate posterior for directed belief networks. In Advances in Neural\nInformation Processing Systems, pages 4691\u20134699.\n\nHjelm, R. D., Jacob, A. P., Che, T., Cho, K., and Bengio, Y. (2017). Boundary-seeking generative\n\nadversarial networks. arXiv preprint arXiv:1702.08431.\n\nHusz\u00e1r, F. (2017). Variational Inference using Implicit Distributions. ArXiv e-prints.\n\nKingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes.\n\narXiv:1312.6114.\n\narXiv preprint\n\nLamb, A., Goyal, A., Zhang, Y., Zhang, S., Courville, A., and Bengio, Y. (2016). Professor forcing:\nA new algorithm for training recurrent networks. Neural Information Processing Systems (NIPS)\n2016.\n\nLarsen, A. B. L., S\u00f8nderby, S. K., and Winther, O. (2015). Autoencoding beyond pixels using a\n\nlearned similarity metric. CoRR, abs/1512.09300.\n\nLiu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In\n\nProceedings of International Conference on Computer Vision (ICCV).\n\nNetzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading digits in natural\nimages with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised\nfeature learning, volume 2011, page 5.\n\nRadford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. CoRR, abs/1511.06434.\n\nRezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing \ufb02ows. arXiv preprint\n\narXiv:1505.05770.\n\nSalakhutdinov, R. and Hinton, G. (2009). Deep boltzmann machines. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 448\u2013455.\n\nSalimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). Improved\n\ntechniques for training gans. CoRR, abs/1606.03498.\n\nSohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsupervised\n\nlearning using nonequilibrium thermodynamics. CoRR, abs/1503.03585.\n\nSong, J., Zhao, S., and Ermon, S. (2017). Generative adversarial learning of markov chains. ICLR\n\nWorkshop Track.\n\nTheis, L., van den Oord, A., and Bethge, M. (2015). A note on the evaluation of generative models.\n\nArXiv e-prints.\n\nVincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust\nfeatures with denoising autoencoders. In Proceedings of the 25th international conference on\nMachine learning, pages 1096\u20131103. ACM.\n\n10\n\n\f", "award": [], "sourceid": 2649, "authors": [{"given_name": "Alex", "family_name": "Lamb", "institution": "UMontreal (MILA)"}, {"given_name": "Devon", "family_name": "Hjelm", "institution": "MILA"}, {"given_name": "Yaroslav", "family_name": "Ganin", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Joseph Paul", "family_name": "Cohen", "institution": "Montreal Institute for Learning Algorithms"}, {"given_name": "Aaron", "family_name": "Courville", "institution": "U. Montreal"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "U. Montreal"}]}