{"title": "An Architecture for Deep, Hierarchical Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 4826, "page_last": 4834, "abstract": "We present an architecture which lets us train deep, directed generative models with many layers of latent variables. We include deterministic paths between all latent variables and the generated output, and provide a richer set of connections between computations for inference and generation, which enables more effective communication of information throughout the model during training. To improve performance on natural images, we incorporate a lightweight autoregressive model in the reconstruction distribution. These techniques permit end-to-end training of models with 10+ layers of latent variables. Experiments show that our approach achieves state-of-the-art performance on standard image modelling benchmarks, can expose latent class structure in the absence of label information, and can provide convincing imputations of occluded regions in natural images.", "full_text": "An Architecture for Deep, Hierarchical Generative\n\nModels\n\nPhilip Bachman\n\nphil.bachman@maluuba.com\n\nMaluuba Research\n\nAbstract\n\nWe present an architecture which lets us train deep, directed generative models\nwith many layers of latent variables. We include deterministic paths between all\nlatent variables and the generated output, and provide a richer set of connections\nbetween computations for inference and generation, which enables more effective\ncommunication of information throughout the model during training. To improve\nperformance on natural images, we incorporate a lightweight autoregressive model\nin the reconstruction distribution. These techniques permit end-to-end training of\nmodels with 10+ layers of latent variables. Experiments show that our approach\nachieves state-of-the-art performance on standard image modelling benchmarks,\ncan expose latent class structure in the absence of label information, and can\nprovide convincing imputations of occluded regions in natural images.\n\n1\n\nIntroduction\n\nTraining deep, directed generative models with many layers of latent variables poses a challenging\nproblem. Each layer of latent variables introduces variance into gradient estimation which, given\ncurrent training methods, tends to impede the \ufb02ow of subtle information about sophisticated structure\nin the target distribution. Yet, for a generative model to learn effectively, this information needs to\npropagate from the terminal end of a stochastic computation graph, back to latent variables whose\neffect on the generated data may be obscured by many intervening sampling steps.\nOne approach to solving this problem is to use recurrent, sequential stochastic generative processes\nwith strong interactions between their inference and generation mechanisms, as introduced in the\nDRAW model of Gregor et al. [5] and explored further in [1, 19, 22]. Another effective technique is\nto use lateral connections for merging bottom-up and top-down information in encoder/decoder type\nmodels. This approach is exempli\ufb01ed by the Ladder Network of Rasmus et al. [17], and has been\ndeveloped further for, e.g. generative modelling and image processing in [8, 23].\nModels like DRAW owe much of their success to two key properties: they decompose the process\nof generating data into many small steps of iterative re\ufb01nement, and their structure includes direct\ndeterministic paths between all latent variables and the \ufb01nal output. In parallel, models with lateral\nconnections permit different components of a model to operate at well-separated levels of abstraction,\nthus generating a hierarchy of representations. This property is not explicitly shared by DRAW-like\nmodels, which typically reuse the same set of latent variables throughout the generative process. This\nmakes it dif\ufb01cult for any of the latent variables, or steps in the generative process, to individually\ncapture abstract properties of the data. We distinguish between the depth used by DRAW and the\ndepth made possible by lateral connections by describing them respectively as sequential depth and\nhierarchical depth. These two types of depth are complimentary, rather than competing.\nOur contributions focus on increasing hierarchical depth without forfeiting trainability. We combine\nthe bene\ufb01ts of DRAW-like models and Ladder Networks by developing a class of models which we\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fcall Matryoshka Networks (abbr. MatNets), due to their deeply nested structure. In Section 2, we\npresent the general architecture of a MatNet. In the MatNet architecture we:\n\n\u2022 Combine the ability of, e.g. LapGANs [3] and Diffusion Nets [21] to learn hierarchically-\n\ndeep generative models with the power of jointly-trained inference/generation1.\n\n\u2022 Use lateral connections, shortcut connections, and residual connections [7] to provide direct\npaths through the inference network to the latent variables, and from the latent variables to\nthe generated output \u2014 this makes hierarchically-deep models easily trainable in practice.\n\nSection 2 also presents several extensions to the core architecture including: mixture-based prior\ndistributions, a method for regularizing inference to prevent over\ufb01tting in practical settings, and a\nmethod for modelling the reconstruction distribution p(x|z) with a lightweight, local autoregressive\nmodel. In Section 3, we present experiments showing that MatNets offer state-of-the-art performance\non standard benchmarks for modelling simple images and compelling qualitative performance\non challenging imputation problems for natural images. Finally, in Section 4 we provide further\ndiscussion of related work and promising directions for future work.\n\n2 The Matryoshka Network Architecture\n\nMatryoshka Networks combine three components: a top-down network (abbr. TD network), a bottom-\nup network (abbr. BU network), and a set of merge modules which merge information from the\nBU and TD networks. In the context of stochastic variational inference [10], all three components\ncontribute to the approximate posterior distributions used during inference/training, but only the TD\nnetwork participates in generation. We \ufb01rst describe the MatNet model formally, and then provide a\nprocedural description of its three components. The full architecture is summarized in Fig. 1.\n\n(a)\n\n(b)\n\nFigure 1: (a) The overall structure of a Matryoshka Network, and how information \ufb02ows through\nthe network during training. First, we perform a feedforward pass through the bottom-up network to\ngenerate a sequence of BU states. Next, we sample the initial latent variables conditioned on the \ufb01nal\nBU state. We then begin a stochastic feedforward pass through the top-down network. Whenever\nthis feedforward pass requires sampling some latent variables, we get the sampling distribution by\npassing the corresponding TD and BU states through a merge module. This module draws conditional\nsamples of the latent variables via reparametrization [10]. These latent samples are then combined\nwith the current TD state, and the feedforward pass continues. Intuitively, this approach allows the\nTD network to invert the bottom-up network by tracking back along its intermediate states, and\neventually recover its original input. (b) Detailed view of a merge module from the network in (a).\nThis module stacks the relevant BU, TD, and merge states on top of each other, and then passes them\nthrough a convolutional residual module, as described in Eqn. 10. The output has three parts \u2014 the\n\ufb01rst provides means for the latent variables, the second provides their log-variances, and the third\nconveys updated state information to subsequent merge modules.\n\n1A signi\ufb01cant downside of LapGANs and Diffusion Nets is that they de\ufb01ne their inference mechanisms a\n\npriori. This is computationally convenient, but prevents the model from learning abstract representations.\n\n2\n\nLatent VariablesTop-downNetworkBottom-upNetworkMergeModulesmerge moduleTD stateBU statelatent meanlatent logvarmerge statemerge state\f2.1 Formal Description\n\nThe distribution p(x) generated by a MatNet is encoded in its top-down network. To model p(x), the\nTD network decomposes the joint distribution p(x, z) over an observation x and a sequence of latent\nvariables z \u2261 {z0, ..., zd} into a sequence of simpler conditional distributions:\n\np(x|zd, ..., z0)p(zd|zd\u22121, ..., z0)...p(zi|zi\u22121, ..., z0)...p(z0),\n\n(1)\n\n(cid:88)\n\np(x) =\n\n(zd,...,z0)\n\n(cid:88)\n\nwhich we marginalize with respect to the latent variables to get p(x). The TD network is designed so\nthat each conditional p(zi|zi\u22121, ..., z0) can be truncated to p(x|ht\ni. See\nEqns. 7/8 in Sec. 2.2 for procedural details.\nThe distribution q(z|x) used for inference in an unconditional MatNet involves the BU network, TD\nnetwork, and merge modules. This distribution can be written:\n\ni) using an internal TD state ht\n\nq(zd, ..., z0|x) = q(z0|x)q(z1|z0, x)...q(zi|zi\u22121, ..., z0, x)...q(zd|zd\u22121, ..., z0, x),\n\ni+1 produced by the ith merge module. See Eqns. 10/11 in Sec. 2.2 for procedural details.\n\nwhere each conditional q(zi|zi\u22121, ..., z0, x) can be truncated to q(zi|hm\nstate hm\nMatNets can also be applied to conditional generation problems like inpainting or pixel-wise segmen-\ntation. For, e.g. inpainting with known pixels xk and missing pixels xu, the predictive distribution of\na conditional MatNet is given by:\n\n(2)\ni+1) using an internal merge\n\np(xu|xk) =\n\np(xu|zd, ..., z0, xk)p(zd|zd\u22121, ..., z0, xk)...p(z1|z0, xk)p(z0|xk).\n\n(3)\n\n(zd,...,z0)\n\nEach conditional p(zi|zi\u22121, ..., z0, xk) can be truncated to p(zi|hm:g\ni+1 indicates state in\na merge module belonging to the generator network. Crucially, conditional MatNets include BU\nnetworks and merge modules that participate in generation, in addition to the BU networks and merge\nmodules used by both conditional and unconditional MatNets during inference/training.\nThe distribution used for inference in a conditional MatNet is given by:\n\ni+1 ), where hm:g\n\nq(zd, ..., z0|xk, xu) = q(zd|zd\u22121, ..., z0, xk, xu)...q(z1|z0, xk, xu)q(z0|xk, xu),\ni+1), where hm:i\n\n(4)\nwhere each conditional q(zi|zi\u22121, ..., z0, xk, xu) can be truncated to q(zi|hm:i\ni+1 indicates\nstate in a merge module belonging to the inference network. Note that, in a conditional MatNet the\ndistributions p(\u00b7|\u00b7) are not allowed to condition on xu, while the distributions q(\u00b7|\u00b7) can.\nMatNets are well-suited to training with Stochastic Gradient Variational Bayes [10]. In SGVB, one\nmaximizes a lower-bound on the data log-likelihood based on the variational free-energy:\n\nlog p(x) \u2265 E\n\n[log p(x|z)] \u2212 KL(q(z|x)|| p(z)),\n\nz\u223cq(z|x)\n\n(5)\nfor which p and q must satisfy a few simple assumptions and KL(q(z|x)|| p(z)) indicates the KL\ndivergence between the inference distribution q(z|x) and the model prior p(z). This bound is tight\nwhen the inference distribution matches the true posterior p(z|x) in the model joint distribution\np(x, z) = p(x|z)p(z) \u2014 in our case given by Eqns. 1/3.\nFor brevity, we only explicitly write the free-energy bound for a conditional MatNet, which is:\n\nlog p(xu|xk) \u2265\n\nE\n\nq(zd,...,z0|xk,xu)\n\n(cid:2)log p(xu|zd, ..., z0, xk)(cid:3)\u2212\n\nKL(q(zd, ..., z0|xk, xu)||p(zd, ..., z0|xk)).\n\n(6)\n\nWith SGVB we can optimize the bound in Eqn. 6 using the \u201creparametrization trick\u201d to allow easy\nbackpropagation through the expectation over z \u223c q(z|xk, xu). See [10, 18] for more details about\nthis technique. The bound for unconditional MatNets is nearly identical \u2014 it just removes xk.\n\n2.2 Procedural Description\n\nStructurally, top-down networks in MatNets comprise sequences of modules in which each module\ni receives two inputs: a deterministic top-down state ht\ni\u22121, and some\nf t\n\ni from the preceding module f t\n\n3\n\n\fi (ht\n\ni+1 = f t\n\ni produces an updated state ht\n\nlatent variables zi. Module f t\ni, zi; \u03b8t), where \u03b8t indicates\nthe TD network\u2019s parameters. By de\ufb01ning the TD modules appropriately, we can reproduce the\narchitectures for LapGANs, Diffusion Nets, and Probabilistic Ladder Networks [23]. Motivated by\nthe success of LapGANs and ResNets [7], we use TD modules in which the latent variables are\nconcatenated with the top-down state, then transformed, after which the transformed values are added\nback to the top-down state prior to further processing. If the adding occurs immediately before,\ne.g. a ReLU, then the latent variables can effectively gate the top-down state by knocking particular\nelements below zero. This allows each stochastic module in the top-down network to apply small\nre\ufb01nements to the output of preceding modules. MatNets thus perform iterative stochastic re\ufb01nement\nthrough hierarchical depth, rather than through sequential depth as in DRAW2.\nMore precisely, the top-down modules in our convolutional MatNets compute:\n\ni)),\n\ni/vt\n\ni )), wt\n\ni; zi], vt\n\nht\ni+1 = lrelu(ht\n\ni + conv(lrelu(conv([ht\n\n(7)\nwhere [x; x(cid:48)] indicates tensor concatenation along the \u201cfeature\u201d axis, lrelu(\u00b7) indicates the leaky\nReLU function, conv(h, w) indicates shape-preserving convolution of the input h with the kernel w,\nand wt\ni indicate the trainable parameters for module i in the TD network. We elide bias terms\nfor brevity. When working with fully-connected models we use stochastic GRU-style state updates\nrather than the stochastic residual updates in Eq. 7. Exhaustive descriptions of the modules can be\nfound in our code at: https://github.com/Philip-Bachman/MatNets-NIPS.\nThese TD modules represent each conditional p(zi|zi\u22121, ..., z0) in Eq. 1 using p(zi|ht\ni places a distribution over zi using parameters [\u00af\u00b5i; log \u00af\u03c32\nf t\n\ni ] computed as follows:\n\ni). TD module\n\ni),\n\ni, vt\n\ni )), wt\n\n[\u00af\u00b5i; log \u00af\u03c32\n\ni ] = conv(lrelu(conv(ht\n\n(8)\nwhere we use \u00af\u00b7 to distinguish between Gaussian parameters from the generator network and those\nfrom the inference network (see Eqn. 11). The distributions p(\u00b7) all depend on the parameters \u03b8t.\nBottom-up networks in MatNets comprise sequences of modules in which each module receives input\nonly from the preceding BU module. Our BU networks are all deterministic and feedforward, but\nsensibly augmenting them with auxiliary latent variables [16, 15] and/or recurrence is a promising\ntopic for future work. Each non-terminal module f b\ni in the BU network computes an updated state:\n0, provides means and log-variances for sampling z0 via\nhb\ni = f b\nreparametrization [10]. To align BU modules with their counterparts in the TD network, we number\nthem in reverse order of evaluation. We structured the modules in our BU networks to take advantage\nof residual connections. Speci\ufb01cally, each BU module f b\n\ni+1; \u03b8b). The \ufb01nal module, f b\n\ni (hb\n\ni computes:\ni+1, vb\n\nhb\ni = lrelu(hb\n\ni+1 + conv(lrelu(conv(hb\n\ni )), wb\n\ni )),\n\n(9)\n\nwith operations de\ufb01ned as for Eq. 7. These updates can be replaced by GRUs, LSTMs, etc.\nThe updates described in Eqns. 7 and 9 both assume that module inputs and outputs are the same\nshape. We thus construct MatNets using groups of \u201cmeta modules\u201d, within which module input/output\nshapes are constant. To keep our network design (relatively) simple, we use one meta module for\neach spatial scale in our networks (e.g. scales of 14x14, 7x7, and fully-connected for MNIST). We\nconnect meta modules using layers which may upsample, downsample, and change feature dimension\nvia strided convolution. We use standard convolution layers, possibly with up or downsampling, to\nfeed data into and out of the bottom-up and top-down networks.\nDuring inference, merge modules compare the current top-down state with the state of the corre-\nsponding bottom-up module, conditioned on the current merge state, and choose a perturbation of the\ntop-down information to push it towards recovering the bottom-up network\u2019s input (i.e. minimize\nreconstruction error). The ith merge module outputs [\u00b5i; log \u03c32\ni ; \u03b8m), where\n\u00b5i and log \u03c32\ni+1 gives\nthe updated merge state. As in the TD and BU networks, we use a residual update:\n\ni are the mean and log-variance for sampling zi via reparametrization, and hm\n\ni+1] = f m\n\ni ; hm\n\ni (hb\n\ni, hm\n\ni , ht\n\nhm\ni+1 = lrelu(hm\ni ] = conv(hm\n[\u00b5i; log \u03c32\n\ni + conv(lrelu(conv([hm\ni+1, wi\n\ni),\n\ni ; hb\n\ni ; ht\n\ni], ui\n\ni)), vi\n\ni))\n\n(10)\n(11)\n\n2Current DRAW-like models can be extended to incorporate hierarchical depth, and our models can be\n\nextended to incorporate sequential depth.\n\n4\n\n\fi, vi\n\ni, and wi\n\nin which the convolution kernels ui\ni constitute the trainable parameters of this module.\nEach merge module thus computes an updated merge state and then reparametrizes a diagonal\nGaussian using a linear function of the updated merge state.\nIn our experiments all modules in all networks had their own trainable parameters. We experimented\nwith parameter sharing and GRU-style state in our convolutional models. The stochastic convolutional\nGRU is particularly interesting when applied depth-wise (rather than time-wise as in [19]), as it\nimplements a stochastic Neural GPU [9] trainable by variational inference and capable of multi-modal\ndynamics. We saw no performance gains with these changes, but they merit further investigation.\nIn unconditional MatNets, the top-most latent variables z0 follow a zero-mean, unit-variance Gaussian\nprior, except in our experiments with mixture-based priors. In conditional MatNets, z0 follows a\ndistribution conditioned on the known values xk. Conditional MatNets use parallel sets of BU\nand merge modules for the conditional generator and the inference network. BU modules in the\nconditional generator observe a partial input xk, while BU modules in the inference network observe\nboth xk and the unknown values xu (which the model is trained to predict). The generative BU and\nmerge modules in a conditional MatNet interact with the TD modules analogously to the BU and\nmerge modules used for inference. Our models used independent Bernoullis, diagonal Gaussians, or\n\u201cintegrated\u201d Logistics (see [11]) for the \ufb01nal output distribution p(x|zd, ..., z0)/p(xu|zd, ..., z0, xk).\n\n2.3 Model Extensions\n\nWe also develop several extensions for the MatNet architecture. The \ufb01rst is to replace the zero-mean,\nunit-variance Gaussian prior over z0 with a Gaussian Mixture Model, which we train simultaneously\nwith the rest of the model. When using a mixture prior, we use an analytical approximation to the\nrequired KL divergence. For Gaussian distribution q, and Gaussian mixture p with components\n{p1, ..., pk} with uniform mixture weights, we use the KL approximation:\n\nKL(q || p) \u2248 log\n\n(cid:80)k\n\n1\n\ni=1 e\u2212 KL(q || pi)\n\n.\n\n(12)\n\n(cid:20)\n\n(cid:21)\n\nOur tests with mixture-based priors are only concerned with qualitative behaviour, so we do not\nworry about the approximation error in Eqn. 12.\nThe second extension is a technique for regularizing the inference model to prevent over\ufb01tting beyond\nthat which is present in the generator. This regularization is applied by optimizing:\n\nmaximize\n\nq\n\nE\n\nx\u223cp(x)\n\nE\n\nz\u223cq(z|x)\n\n[log p(x|z)] \u2212 KL(q(z|x)|| p(z))\n\n.\n\n(13)\n\nThis maximizes the free-energy bound for samples drawn from our model, but without changing\ntheir true log-likelihood. By maximizing Eqn. 13, we implicitly reduce KL(q(z|x)|| p(z|x)), which\nis the gap between the free-energy bound and the true log-likelihood. A similar regularizer can be\nconstructed for minimizing KL(p(z|x)|| q(z|x)). We use (13) to reduce over\ufb01tting, and slightly\nboost test performance, in our experiments with MNIST and Omniglot.\nThe third extension off-loads responsibility for modelling sharp local dynamics in images, e.g. precise\nedge placements and small variations in textures, from the latent variables onto a local, deterministic\nautoregressive model. We use a simpli\ufb01ed version of the masked convolutions in the PixelCNN of\n[25], modi\ufb01ed to condition on the output of the \ufb01nal TD module in a MatNet. This modi\ufb01cation is\neasy \u2014 we just concatenate the \ufb01nal TD module\u2019s output and the true image, and feed this into a\nPixelCNN with, e.g. \ufb01ve layers. A trick we use to improve gradient \ufb02ow back to the MatNet is to\nfeed the MatNet\u2019s output directly into each internal layer of the PixelCNN. In the masked convolution\nlayers, connections to the MatNet output are unrestricted, since they are already separated from the\nground truth by an appropriately-monitored noisy channel. Larger, more powerful mechanisms for\ncombining local autoregressions and conditioning information are explored in [26].\n\n3 Experiments\n\nWe measured quantitative performance of MatNets on three datasets: MNIST, Omniglot [13], and\nCIFAR 10 [12]. We used the 28x28 version of Omniglot described in [2], which can be found\nat: https://github.com/yburda/iwae. All quantitative experiments measured performance in\n\n5\n\n\fFigure 2: MatNet performance on quantitative benchmarks. All tables except the lower-right table\ndescribe standard unconditional generative NLL results. The lower-right table presents results from\nthe structured prediction task in [22], in which 1-3 quadrants of an MNIST digit are visible, and NLL\nis measured on predictions for the unobserved quadrants.\n\nFigure 3: Class-like structure learned by a MatNet trained on 28x28 Omniglot, without label\ninformation. The model used a GMM prior over z0 with 50 mixture components. Each group of three\ncolumns corresponds to a mixture component. The top row shows validation set examples whose\nposterior over the mixture components placed them into each component. Subsequent rows show\nsamples drawn by freely resampling latent variables from the model prior, conditioned on the top\nk layers of latent variables, i.e. {z0, ..., zk\u22121} being drawn from the approximate posterior for the\nexample at the top of the column. From the second row down, we show k = {1, 2, 4, 6, 8, 10}.\n\nterms of negative log-likelihood, with the CIFAR 10 scores rescaled to bits-per-pixel and corrected for\ndiscrete/continuous observations as described in [24]. We used the IWAE bound from [2] to evaluate\nour models, with 2500 samples in the bound. We performed additional experiments measuring the\nqualitative performance of MatNets using Omniglot, CelebA faces [14], LSUN 2015 towers, and\nLSUN 2015 churches. The latter three datasets are 64x64 color images with signi\ufb01cant detail and\nnon-trivial structure. Complete hyperparameters for model architecture and optimization can be\nfound in the code at https://github.com/Philip-Bachman/MatNets-NIPS.\nWe performed three quantitative tests using MNIST. The \ufb01rst tests measured generative performance\non dynamically-binarized images using a fully-connected model (for comparison with [2, 23]) and on\nthe \ufb01xed binarization from [20] using a convolutional model (for comparison with [25, 19]). MatNets\nimproved on existing results in both settings. See the tables in Fig. 2. Our third tests with MNIST\nmeasured performance of conditional MatNets for structured prediction. For this, we recreated the\ntests described in [22]. MatNet performance on these tests was also strong, though the prior results\nwere from a fully-connected model, which skews the comparison.\nWe also measured quantitative performance using the 32x32 color images of CIFAR 10. We trained\ntwo models on this data \u2014 one with a Gaussian reconstruction distribution and dequantization as\ndescribed in [24], and the other which added a local autoregression and used the \u201cintegrated Logistic\u201d\nlikelihood described in [11]. The Gaussian model fell just short of the best previously reported result\nfor a variational method (from [6]), and well short of the Pixel RNN presented in [25]. Performance\non this task seems very dependent on a model\u2019s ability to predict pixel intensities precisely along\nedges. The ability to ef\ufb01ciently capture global structure has a relatively weak bene\ufb01t. Mistaking a cat\nfor a dog costs little when amortized over thousands of pixels, while misplacing a single edge can\nspike the reconstruction cost dramatically. We demonstrate the strength of this effect in Fig. 4, where\nwe plot how the bits paid to encode observations are distributed among the modules in the network\nover the course of training for MNIST, Omniglot, and CIFAR 10. The plots show a stark difference\nbetween these distributions when modelling simple line drawings vs. when modelling more natural\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 4: This \ufb01gure shows per module divergences KL(q(zi|hm\ni)) over the course of\ntraining for models trained on MNIST, Omniglot, and CIFAR 10. The stacked area plots are grouped\nby \u201cmeta module\u201d in the TD network. The MNIST and Omniglot models both had a single FC\nmodule and meta modules at spatial dimension 7x7 and 14x14. The meta modules at 7x7 and 14x14\nboth comprised 5 TD modules. The CIFAR10 model (without autoregression) had one FC module,\nand meta modules at spatial dimension 8x8, 16x16, and 32x32. These meta modules comprised 2,\n4, and 4 modules respectively. Light lines separate modules, and dark lines separate meta modules.\nThe encoding cost on CIFAR 10 is clearly dominated by the low-level details encoded by the latent\nvariables in the full-resolution TD modules closest to the output.\n\ni+1)|| p(zi|ht\n\nimages. For CIFAR 10, almost all of the encoding cost was spent in the 32x32 layers of the network\nclosest to the generated output. This was our motivation for adding a lightweight autoregression\nto p(x|z), which signi\ufb01cantly reduced the gap between our model and the PixelRNN. Fig. 5 shows\nsome samples from our model, which exhibit occasional glimpses of global and local structure.\nOur \ufb01nal quantitative test used the Omniglot handwritten character dataset, rescaled to 28x28 as in [2].\nThese tests used the same convolutional architecture as on MNIST. Our model outperformed previous\nresults, as shown in Fig. 2. Using Omniglot we also experimented with placing a mixture-based prior\ndistribution over the top-most latent variables z0. The purpose of these tests was to determine whether\nthe model could uncover latent class structure in the data without seeing any label information. We\nvisualize results of these tests in Fig. 3. Additional description is provided in the \ufb01gure caption. We\nplaced a slight penalty on the entropy of the posterior distributions for each input to the model, to\nencourage a stronger separation of the mixture components. The inputs assigned to each mixture\ncomponent (based on their posteriors) exhibit clear stylistic coherence.\nIn addition to qualitative tests exploring our model\u2019s ability to uncover latent factors of variation\nin Omniglot data, we tested the performance of our models at imputing missing regions of higher\nresolution images. These tests used images of celebrity faces, churches, and towers. These images\ninclude far more detail and variation than those in MNIST/Omniglot/CIFAR 10. We used two-stage\nmodels for these tests, in which each stage was a conditional MatNet. The \ufb01rst stage formed an\ninitial guess for the missing image content, and the second stage then re\ufb01ned that guess. Both stages\nused the same architectures for their inference and generator networks. We sampled imputation\nproblems by placing three 20x20 occluders uniformly at random in the image. Each stage had single\nTD modules at scales 32x32, 16x16, 8x8, and fully-connected. We trained models for roughly 200k\nupdates, and show imputation performance on images from a test set that was held out during training.\nResults are shown in Fig. 5.\n\n4 Related Work and Discussion\n\nPrevious successful attempts to train hierarchically-deep models largely fall into a class of methods\nbased on deconstructing, and then reconstructing data. Such approaches are akin to solving mazes\nby starting at the end and working backwards, or to learning how an object works by repeatedly\ndisassembling and reassembling it. Examples include LapGANs [3], which deconstruct an image by\nrepeatedly downsampling it, and Diffusion Nets [21], which deconstruct arbitrary data by subjecting\nit to a long sequence of small random perturbations. The power of these approaches stems from the\nway in which gradually deconstructing the data leaves behind a trail of crumbs which can be followed\nback to a well-formed observation. In the generative models of [3, 21], the deconstruction processes\nwere de\ufb01ned a priori, which avoided the need for trained inference. This makes training signi\ufb01cantly\n\n7\n\n\f(a) CIFAR 10 samples\n\n(b) CelebA Faces\n\n(c) LSUN Churches\n\n(d) LSUN Towers\n\nFigure 5: Imputation results on challenging, real-world images. These images show predictions\nfor missing data generated by a two stage conditional MatNet, trained as described in Section 3.\nEach occluded region was 20x20 pixels. Locations for the occlusions were selected uniformly at\nrandom within the images. One interesting behaviour which emerged in these tests was that our\nmodel successfully learned to properly reconstruct the watermark for \u201cshutterstock\u201d, which was a\nsource of many of the LSUN images \u2013 see the second input/output pair in the third row of (b).\n\neasier, but subverts one of the main motivations for working with latent variables and sample-based\napproximate inference, i.e. the ability to capture salient factors of variation in the inferred relations\nbetween latent variables and observed data. This de\ufb01ciency is beginning to be addressed by, e.g. the\nProbabilistic Ladder Networks of [23], which are a special case of our architecture in which the\ndeterministic paths from latent variables to observations are removed and the conditioning mechanism\nin inference is more restricted.\nReasoning about data through the posteriors induced by an appropriate generative model motivates\nsome intriguing work at the intersection of machine learning and cognitive science. This work shows\nthat, in the context of an appropriate generative model, powerful inference mechanisms are capable\nof exposing the underlying factors of variation in fairly sophisticated data. See, e.g. Lake et al. [13].\nTechniques for training coupled generation and inference have now reached a level that makes it\npossible to investigate these ideas while learning models end-to-end [4].\nIn future work we plan to apply our models to more \u201cinteresting\u201d generative modelling problems,\nincluding more challenging image data and problems in language/sequence modelling. The strong\nperformance of our models on benchmark problems suggests their potential for solving dif\ufb01cult\nstructured prediction problems. Combining the hierarchical depth of MatNets with the sequential\ndepth of DRAW is also worthwhile.\n\n8\n\n\fReferences\n[1] P. Bachman and D. Precup. Data generation as sequential decision making. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2015.\n\n[2] Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted auto-encoders. arXiv:1509.00519v1\n\n[cs.LG], 2015.\n\n[3] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative models using a laplacian pyramid of\n\nadversarial networks. arXiv:1506.05751 [cs.CV], 2015.\n\n[4] S. M. A. Eslami, N. Heess, T. Weber, Y. Tassa, K. Kavucuoglu, and G. E. Hinton. Attend, infer, repeat:\n\nFast scene understanding with generative models. arXiv:1603.08575 [cs.CV], 2016.\n\n[5] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for image\n\ngeneration. In International Conference on Machine Learning (ICML), 2015.\n\n[6] K. Gregor, F. Besse, D. J. Rezende, I. Danihelka, and D. Wierstra. Towards conceptual compression. In\n\narXiv:1604.08772v1 [stat.ML], 2016.\n\n[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv:1512.03385v1\n\n[cs.CV], 2015.\n\n[8] S. Honari, J. Yosinski, P. Vincent, and C. Pal. Recombinator networks: Learning coarse-to-\ufb01ne feature\n\naggregation. In Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[9] L. Kaiser and I. Sutskever. Neural gpus learn algorithms. In International Conference on Learning\n\nRepresentations (ICLR), 2016.\n\n[10] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Learning\n\nRepresentations (ICLR), 2014.\n\n[11] D. P. Kingma, T. Salimans, and M. Welling. Improving variational inference with inverse autoregressive\n\n\ufb02ow. arXiv:1606.04934 [cs.LG], 2016.\n\n[12] A. Krizhevsky and G. E. Hinton. Learning multiple layers of features from tiny images. Master\u2019s thesis,\n\nUniversity of Toronto, 2009.\n\n[13] B. M. Lake, R. Salakhutdinov, and J. B. Tenebaum. Human-level concept learning through probabilistic\n\nprogram induction. Science, 350(6266):1332\u20131338, 2015.\n\n[14] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In International Conference\n\non Computer Vision (ICCV), 2015.\n\n[15] L. Maal\u00f8e, C. K. S\u00f8nderby, S. K. S\u00f8nderby, and O. Winther. Auxiliary deep generative models. In\n\nInternational Conference on Machine Learning (ICML), 2016.\n\n[16] R. Ranganath, D. Tran, and D. M. Blei. Hierarchical variational models. In International Conference on\n\nMachine Learning (ICML), 2016.\n\n[17] A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko. Semi-supervised learning with ladder\n\nnetworks. In Advances in Neural Information Processing Systems (NIPS), 2015.\n\n[18] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. In International Conference on Machine Learning (ICML), 2014.\n\n[19] D. J. Rezende, S. Mohamed, I. Danihelka, K. Gregor, and D. Wierstra. One-shot generalization in deep\n\ngenerative models. In International Conference on Machine Learning (ICML), 2016.\n\n[20] R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In International\n\nConference on Machine Learning (ICML), 2008.\n\n[21] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using\n\nnonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015.\n\n[22] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative\n\nmodels. In Advances in Neural Information Processing Systems (NIPS), 2015.\n\n[23] C. K. S\u00f8nderby, T. Raiko, L. Maal\u00f8e, S. K. S\u00f8nderby, and O. Winther. How to train deep variational\nautoencoders and probabilistic ladder networks. International Conference on Machine Learning (ICML),\n2016.\n\n[24] L. Theis and M. Bethge. Generative image modeling using spatial lstms. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2015.\n\n[25] A. van den Oord, N. Kalchbrenner, and K. Kavucuoglu. Pixel recurrent neural networks. International\n\nConference on Machine Learning (ICML), 2016.\n\n[26] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavucuoglu. Conditional\n\nimage generation with pixelcnn decoders. arXiv:1606.05328 [cs.CV], 2016.\n\n9\n\n\f", "award": [], "sourceid": 2450, "authors": [{"given_name": "Philip", "family_name": "Bachman", "institution": "Maluuba"}]}