{"title": "Flexible and accurate inference and learning for deep generative models", "book": "Advances in Neural Information Processing Systems", "page_first": 4166, "page_last": 4175, "abstract": "We introduce a new approach to learning in hierarchical latent-variable generative\nmodels called the \u201cdistributed distributional code Helmholtz machine\u201d, which\nemphasises flexibility and accuracy in the inferential process. Like the original\nHelmholtz machine and later variational autoencoder algorithms (but unlike adver-\nsarial methods) our approach learns an explicit inference or \u201crecognition\u201d model\nto approximate the posterior distribution over the latent variables. Unlike these\nearlier methods, it employs a posterior representation that is not limited to a narrow\ntractable parametrised form (nor is it represented by samples). To train the genera-\ntive and recognition models we develop an extended wake-sleep algorithm inspired\nby the original Helmholtz machine. This makes it possible to learn hierarchical\nlatent models with both discrete and continuous variables, where an accurate poste-\nrior representation is essential. We demonstrate that the new algorithm outperforms\ncurrent state-of-the-art methods on synthetic, natural image patch and the MNIST\ndata sets.", "full_text": "Flexible and accurate inference and learning\n\nfor deep generative models\n\nEszter V\u00e9rtes Maneesh Sahani\n\nGatsby Computational Neuroscience Unit\n\n{eszter, maneesh}@gatsby.ucl.ac.uk\n\nUniversity College London\n\nLondon, W1T 4JG\n\nAbstract\n\nWe introduce a new approach to learning in hierarchical latent-variable generative\nmodels called the \u201cdistributed distributional code Helmholtz machine\u201d, which\nemphasises \ufb02exibility and accuracy in the inferential process. Like the original\nHelmholtz machine and later variational autoencoder algorithms (but unlike adver-\nsarial methods) our approach learns an explicit inference or \u201crecognition\u201d model\nto approximate the posterior distribution over the latent variables. Unlike these\nearlier methods, it employs a posterior representation that is not limited to a narrow\ntractable parametrised form (nor is it represented by samples). To train the genera-\ntive and recognition models we develop an extended wake-sleep algorithm inspired\nby the original Helmholtz machine. This makes it possible to learn hierarchical\nlatent models with both discrete and continuous variables, where an accurate poste-\nrior representation is essential. We demonstrate that the new algorithm outperforms\ncurrent state-of-the-art methods on synthetic, natural image patch and the MNIST\ndata sets.\n\n1\n\nIntroduction\n\nThere is substantial interest in applying variational methods to learn complex latent-variable generative\nmodels, for which the full likelihood function (after marginalising over the latent variables) and its\ngradients are intractable. Unsupervised learning of such models has two complementary goals: to\nlearn a good approximation to the distribution of the observations; and also to learn the underlying\nstructural dependence so that the values of latent variables may be inferred from new observations.\nVariational methods rely on optimising a lower bound to the log-likelihood (the free energy), which\ndepends on an approximation to the posterior distribution over the latents [1]. The performance of\nvariational algorithms depends critically on the \ufb02exibility of the variational posterior. In cases where\nthe approximating class does not contain the true posterior distribution, variational learning may\nintroduce substantial bias to estimates of model parameters [2].\nVariational autoencoders [3, 4] combine the variational inference framework with the earlier idea of\nthe recognition network. This approach has made variational inference applicable to a large class\nof complex generative models. However, many challenges remain. Most current algorithms have\ndif\ufb01culty learning hierarchical generative models with multiple layers of stochastic latent variables\n[5]. Arguably, this class of models is crucial for modelling data where the underlying physical\nprocess is itself hierarchical in nature. Furthermore, the generative models typically considered in the\nliterature restrict the prior distribution to a simple form, most often a factorised Gaussian distribution,\nwhich makes it dif\ufb01cult to incorporate additional generative structure such as sparsity into the model.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe introduce a new approach to learning hierarchical generative models, the Distributed Distri-\nbutional Code (DDC) Helmholtz Machine, which combines two ideas that originate in theoretical\nneuroscience: the Helmholtz Machine with wake-sleep learning [6]; and distributed or population\ncodes for distributions [7, 8]. A key element of our method is that the approximate posterior distri-\nbution is represented as a set of expected suf\ufb01cient statistics, rather than by directly parametrising\nthe probability density function. This allows an accurate posterior approximation without being\nrestricted to a rigid parametric class. At the same time, the DDC Helmholtz Machine retains some of\nthe simplicity of the original Helmholtz Machine in that it does not require propagating gradients\nacross different layers of latent variables. The result is a robust method able to learn the parameters\nof each layer of a hierarchical generative model with far greater accuracy than achieved by current\nvariational methods.\nWe begin by brie\ufb02y reviewing variational learning (Section 2), deep exponential family models (Sec-\ntion 3), and the original Helmholtz Machine and wake-sleep algorithm (Section 4); before introducing\nthe DDC (Section 5.1), and associated Helmholtz Machine (Section 5.2), whose performance we\nevaluate in Section 7.\n\n2 Variational inference and learning in latent variable models\n\nConsider a generative model for observations x, that depend on latent variables z. Variational methods\nrely on optimising a lower bound on the log-likelihood by introducing an approximate posterior\ndistribution q(z|x) over the latent variables:\n\nlog p\u03b8(x) \u2265 F(q, \u03b8, x) = log p\u03b8(x) \u2212 DKL[q(z|x)||p\u03b8(z|x)]\n\n(1)\nThe cost of computing the posterior approximation for each observation can be ef\ufb01ciently amortised\nby using a recognition model [9], an explicit function (with parameters \u03c6, often a neural network)\nthat for each x returns the parameters of an estimated posterior distribution: x (cid:55)\u2192 q\u03c6(z).\nA major source of bias in variational learning comes from the mismatch between the approximate\nand exact posterior distributions. The variational objective penalises this error using the \u2018exclu-\nsive\u2019 Kullback-Leibler divergence (see Eq. 1), which typically results in an approximation that\nunderestimates the posterior uncertainty [10].\nMulti-sample objectives (IWAE [11]; VIMCO [12]) have been proposed to remedy the disadvantages\nof a restrictive posterior approximation. Nonetheless, bene\ufb01ts of these methods are limited in cases\nwhen the proposal distribution is too far from the true posterior (see Section 7).\n\n3 Deep exponential family models\n\nWe consider hierarchical generative models in which each conditional belongs to an exponential\nfamily, also known as deep exponential family models [13]. Let x \u2208 X denote a single (vector)\nobservation. The distribution of data x is determined by a sequence of L (vector) latent variables\nz1 . . . zL arranged in a conditional hierarchy as follows:\n\nzL\n\n...\nz2\n\nz1\n\nx\n\np(zL) = exp(\u03b8T\n\nL SL(zL) \u2212 \u03a6L(\u03b8L))\n\np(z2|z3) = exp(g2(z3, \u03b82)T S2(z2) \u2212 \u03a62(g2(z3, \u03b82)))\n\np(z1|z2) = exp(cid:0)g1(z2, \u03b81)T S1(z1) \u2212 \u03a61(g1(z2, \u03b81))(cid:1)\np(x|z1) = exp(cid:0)g0(z1, \u03b80)T S0(x) \u2212 \u03a60(g0(z1, \u03b80))(cid:1)\n\nEach conditional distribution is a member of a tractable exponential family, so that conditional\nsampling is possible. Using l \u2208 {0, 1, 2, . . . L} to denote the layer (with l = 0 for the observation),\n\n2\n\n\fthese distributions have suf\ufb01cient statistic function Sl, natural parameter given by a known function\ngl of both the parent variable and a parameter vector \u03b8l, and a log normaliser \u03a6l that depends on this\nnatural parameter. At the top layer, we lose no generality by taking gL(\u03b8L) = \u03b8L.\nWe will maintain the general notation here, as the method we propose is very broadly applicable\n(both to continuous and discrete latent variables), provided that the family remains tractable in the\nsense that we can ef\ufb01ciently sample from the conditional distributions given the natural parameters.\n\n4 The classic Helmholtz Machine and the wake-sleep algorithm\n\nThe Helmholtz Machine (HM) [6] comprises a latent-variable generative model that is to be \ufb01t to\ndata, and a recognition network, trained to perform approximate inference over the latent variables.\nThe latent variables of an HM generative model are arranged hierarchically in a directed acyclic\ngraph, with the variables in a given layer conditionally independent of one another given the variables\nin the layer above. In the original HM, all latent and observed variables were binary and formed a\nsigmoid belief network [14], which is a special case of deep exponential family models introduced\nin the previous section with Sl(zl) = zl and gl(zl+1, \u03b8l) = \u03b8lzl+1. The recognition network is a\nfunctional mapping with an analogous hierarchical architecture that takes each x to an estimate of the\nposterior probability of each zl, using a factorised mean-\ufb01eld representation.\nThe training of both generative model and recognition network follows a two-phase procedure\nknown as wake-sleep [15].\nIn the wake phase, observations x are fed through the recognition\nnetwork to obtain the posterior approximation q\u03c6(zl|x). In each layer the latent variables are sampled\nindependently conditioned on the samples of the layer below according to the probabilities determined\nby the recognition model parameters. These samples are then used to update the generative parameters\nto increase the expected joint likelihood \u2013 equivalent to taking gradient steps to increase the variational\nfree energy. In the sleep phase, the current generative model is used to provide joint samples of the\nlatent variables and \ufb01ctitious (or \u201cdreamt\u201d) observations and these are used as supervised training\ndata to adapt the recognition network. The algorithm allows for straightforward optimisation since\nparameter updates at each layer in both the generative and recognition models are based on locally\ngenerated samples of both the input and output of the layer.\nDespite the resemblance to the two-phase process of expectation-maximisation and approximate\nvariational methods, the sleep phase of wake-sleep does not necessarily increase the free-energy\nbound on the likelihood. Even in the limit of in\ufb01nite samples, the mean \ufb01eld representation q\u03c6(z|x)\nis learnt so that it minimises DKL[p\u03b8(z|x)(cid:107)q\u03c6(z|x)], rather than DKL[q\u03c6(z|x)(cid:107)p\u03b8(z|x)] as required\nby variational learning. For this reason, the mean-\ufb01eld approximation provided by the recognition\nmodel is particularly limiting, since it not only biases the learnt generative model (as in the variational\ncase) but it may also preclude convergence.\n\n5 The DDC Helmholtz Machine (DDC-HM)\n\n5.1 Distributed Distributional Codes\n\nThe key drawback of both the classic HM and most approximate variational methods is the need for a\ntractably parametrised posterior approximation. Our contribution is to instead adopt a \ufb02exible and\npowerful representation of uncertainty in terms of expected values of large set of (possibly random)\narbitrary nonlinear functions. We call this representation a Distributed Distributional Code (DDC) in\nacknowledgement of its history in theoretical neuroscience [7, 8]. In the DDC-HM, each posterior\nis represented by approximate expectations of non-linear encoding functions {T (i)(z)}i=1...K with\nrespect to the true posterior p\u03b8(z|x):\nr(i)\n\nl (x, \u03c6) \u2248(cid:10)T (i)(zl)(cid:11)\n\np\u03b8(zl|x) ,\n\n(2)\n\nwhere r(i)\nl (x, \u03c6), i = 1...Kl is the output of the recognition network (parametrised by \u03c6) at the\nlth latent layer, and the angle brackets denote expectations. A \ufb01nite\u2014albeit large\u2014set of expec-\ntations does not itself fully specify the probability distribution p\u03b8(z|x). Thus, the recognition\noutputs {r(i)\nl }i=1...Kl are interpreted as representing an approximate posterior q(zl|x) de\ufb01ned by the\ndistribution of maximum entropy that agrees with all of the encoded expectations.\n\n3\n\n\fA standard calculation shows that this distribution has a density of the form [1, Ch.3]:\n\n(cid:32) K(cid:88)\n\ni=1\n\n(cid:33)\n\nq(z|x) =\n\n1\n\nZ(\u03b7(x))\n\nexp\n\n\u03b7(i)(x)T (i)(z)\n\n(3)\n\nwhere the \u03b7(i) are natural parameters (derived as Lagrange multipliers enforcing the expectation\nconstraints), and Z(\u03b7) is a normalising constant. Thus, in this view, the encoded distribution q is a\nmember of the exponential family whose suf\ufb01cient statistic functions correspond to the encoding\nfunctions {T (i)(z)}, and the recognition network returns the expected suf\ufb01cient statistics, or mean\nparameters. It follows that given a suf\ufb01ciently large set of encoding functions, we can approximate\nthe true posterior distribution arbitrarily well [16]. Throughout the paper we will use encoding\nfunctions of the following form:\n\nT (i)(z) = \u03c3(w(i)T z + b(i)), i = 1 . . . K ,\n\n(4)\nwhere w(i) is a random linear projection with components sampled from a standard normal dis-\ntribution, b(i) is a similarly distributed bias term, and \u03c3 is a sigmoidal non-linearity. That is, the\nrepresentation is designed to capture information about the posterior distribution along K random\nprojections in z-space. As a special case, we can recover the approximate posterior equivalent to the\noriginal HM if we consider linear encoding functions T (i)(z) = zi, corresponding to a factorised\nmean-\ufb01eld approximation.\nObtaining the posterior natural parameters {\u03b7(i)} (and thus evaluating the density in Eq. 3) from\nthe mean parameters {r(i)} is not straightforward in the general case since Z(\u03b7) is intractable.\nThus, it is not immediately clear how a DDC representation can be used for learning. Our exact\nscheme will be developed below, but in essence it depends on the simple observation that most of the\ncomputations necessary for learning (and indeed most computations involving uncertainty) depend\non the evaluation of appropriate expectations. Given a rich set of encoding functions {T (i)}i=1...K\nsuf\ufb01cient to approximate a desired function f using linear weights {\u03b1i}, such expectations become\neasy to evaluate in the DDC representation:\n\nf (z) \u2248(cid:88)\n\n\u03b1(i)T (i)(z) \u21d2 (cid:104)f (z)(cid:105)q(z) \u2248(cid:88)\n\n\u03b1(i)(cid:68)\n\n\u03b1(i)r(i)\n\n(5)\n\n(cid:69)\n\nT (i)(z)\n\n=\n\nq(z)\n\ni\n\ni\n\n(cid:88)\n\ni\n\nThus, the richer the family of DDC encoding functions, the more accurate are both the approximated\nposterior distribution, and the approximated expectations1. We will make extensive use of this\nproperty in the following section where we discuss how this posterior representation is learnt (sleep\nphase) and how it can be used to update the generative model (wake phase).\n\n5.2 The DDC Helmholtz Machine algorithm\n\nAlgorithm 1 DDC Helmholtz Machine training\n\nInitialise \u03b8\nrepeat\n\n1 , x(s) \u223c p\u03b8(x, z1, ..., zL)\n\nL , ..., z(s)\n\nSleep phase:\nfor s = 1 . . . S, sample: z(s)\nupdate recognition parameters {\u03c6l} [eq. 7]\nupdate function approximators {\u03b1l, \u03b2l} [appendix]\nWake phase:\nx \u2190 {minibatch}\nevaluate rl(x, \u03c6) [eq. 8]\nupdate \u03b8: \u2206\u03b8 \u221d (cid:91)\u2207\u03b8F(x, r(x, \u03c6), \u03b8) [appendix]\n\nuntil |(cid:91)\u2207\u03b8F| < threshold\n\nFollowing [6] the generative and recognition models in the DDC-HM are learnt in two separate\nphases (see Algorithm 1). The sleep phase involves learning a recognition network that takes data\n\n1In a suitable limit, an in\ufb01nite family of encoding functions would correspond to a mean embedding\n\nrepresentation in a reproducing kernel Hilbert space [17]\n\n4\n\n\fpoints x as input and produces expectations of the non-linear encoding functions {T (i)} as given\nby Eq. (2); and learning how to use these expectations to update the generative model parameters\nusing approximations of the form of Eq. (5). The wake phase updates the generative parameters by\ncomputing the approximate gradient of the free energy, using the posterior expectations learned in\nthe sleep phase. Below we describe the two phases of the algorithm in more detail.\n\nSleep phase One aim of the sleep phase, given a current generative model p\u03b8(x, z), is to update the\nrecognition network so that the Kullback-Leibler divergence between the true and the approximate\nposterior is minimised:\n\n\u03c6 = argmin DKL[p\u03b8(z|x)||q\u03c6(z|x)]\n\np\u03b8(z|x) =(cid:10)T (z)(cid:11)\n(cid:10)T (z)(cid:11)\nso that: rl(x, \u03c6l) \u2248(cid:10)T (zl)(cid:11)\n\n(6)\nSince the DDC q(z|x) is in the exponential family, the KL-divergence in Eq. (6) is minimised if the\nexpectations of the suf\ufb01cient statistics vector T = [T (1), . . . , T (K)] under the two distributions agree:\nq\u03c6(z|x). Hence the parameters of the recognition model should be updated\np\u03b8(zl|x) This requirement can be translated into an optimisation problem\nby sampling z(s)\n1 , x(s) from the generative model and minimising the error between the\noutput of the recognition model rl(x(s), \u03c6l) and encoding functions T evaluated at the generated\nsleep samples. For tractability, we substitute the squared loss in place of Eq. (6).\n\nL , . . . , z(s)\n\n(cid:88)\n\n(cid:13)(cid:13)rl(x(s), \u03c6l) \u2212 T (z(s)\n\n)(cid:13)(cid:13)2\n\n\u03c6l = argmin\n\nl\n\n(7)\n\ns\n\nIn principle, one could use any function approximator (such as a neural network) for the recognition\nmodel rl(x(s), \u03c6l), provided that it is suf\ufb01ciently \ufb02exible to capture the mapping from the data to the\nencoded expectations. Here, we parallel the original HM, and use a recognition model that re\ufb02ects\nthe hierarchical structure of the generative model. For a model with 2 layers of latent variables:\n\nh1(x, W ) = [W x]+ ,\n\nr1(x, \u03c61) = \u03c61 \u00b7 h1(W, x) ,\n\nr2(x, \u03c62) = \u03c62 \u00b7 r1(x, \u03c61) ,\n\n(8)\n\nwhere W, \u03c61, \u03c62 are matrices and [.]+ is a rectifying non-linearity. Throughout this paper we use a\n\ufb01xed W \u2208 RM\u00d7Dx sampled from a normal distribution, and learn \u03c61, \u03c62 according to Eq. (7).\nRecognition model learning in the DDC-HM thus parallels that of the original HM, albeit with a\nmuch richer posterior representation. The second aim of the DDC-HM sleep phase is quite different:\na further set of weights must be learnt to approximate the gradients of the generative model joint\nlikelihood. This step is derived in the appendix, but summarised in the following section.\n\nWake phase The aim in the wake phase is to update the generative parameters to increase the\nvariational free energy F(q, \u03b8), evaluated on data x, using a gradient step:\n\n\u2206\u03b8 \u221d \u2207\u03b8F(q, \u03b8) = (cid:104)\u2207\u03b8 log p\u03b8(x, z)(cid:105)q(z|x)\n\n(9)\nThe update depends on the evaluation of an expectation over q(z|x). As discussed in Section 5.1, the\nDDC approximate posterior representation allows us to evaluate such expectations by approximating\nthe relevant functions using the non-linear encoding functions T .\nFor deep exponential family generative models, the gradients of the free energy take the following\nform (see appendix):\n\u2207\u03b80F = \u2207\u03b80(cid:104)log p(x|z1)(cid:105)q = S0(x)T(cid:104)\u2207g(z1, \u03b80)(cid:105)q(z1) \u2212 (cid:104)\u00b5T\nx|z1\n\u2207\u03b8lF = \u2207\u03b8l(cid:104)log p(zl|zl+1)(cid:105)q = (cid:104)Sl(zl)T\u2207g(zl+1, \u03b8l)(cid:105)q(zl,zl+1) \u2212 (cid:104)\u00b5T\n\u2207\u03b8LF = \u2207\u03b8L(cid:104)log p(zL)(cid:105)q = (cid:104)SL(zL)(cid:105) \u2212 \u2207\u03a6(\u03b8L) ,\n\n\u2207g(zl+1, \u03b8l)(cid:105)q(zl+1)\n(10)\n\n\u2207g(z1, \u03b80)(cid:105)q(z1)\n\nzl|zl+1\n\nwhere \u00b5x|z1 , \u00b5zl|zl+1 are expected suf\ufb01cient statistic vectors of the conditional distributions from the\ngenerative model: \u00b5x|z1 = (cid:104)S0(x)(cid:105)p(x|z1), \u00b5zl|zl+1 = (cid:104)Sl(zl)(cid:105)p(zl|zl+1). Now the functions that must\nbe approximated are the functions of {zl} that appear within the expectations in Eqs. 10. As shown\nin the appendix, the coef\ufb01cients of these combinations can be learnt by minimising a squared error\non the sleep-phase samples, in parallel with the learning of the recognition model.\n\n5\n\n\f(cid:80)\n\nThus, taking the gradient in the \ufb01rst line of Eq. (10) as an example, we write \u2207\u03b8g(z1, \u03b80) \u2248\n\n0 T (i)(z1) = \u03b10 \u00b7 T (zl) and evaluate the gradients as follows:\n\ni \u03b1(i)\n\n(cid:88)\n\n(cid:0)\u2207\u03b8g(z(s)\n\n1 )(cid:1)2\n\nsleep:\n\n\u03b10 \u2190 argmin\n(cid:104)\u2207\u03b8g(z1, \u03b80)(cid:105)q(z1) \u2248 \u03b10 \u00b7 (cid:104)T (z1)(cid:105)q(z1) = \u03b10 \u00b7 r1(x, \u03c61)\n\n1 , \u03b80) \u2212 \u03b10 \u00b7 T (z(s)\n\ns\n\n(11)\n\nwake:\n\n(12)\nwith similar expressions providing all the gradients necessary for learning derived in the appendix.\nIn summary, in the DDC-HM computing the wake-phase gradients of the free energy becomes\nstraightforward, since the necessary expectations are computed using approximations learnt in the\nsleep phase, rather than by an explicit construction of the intractable posterior. Furthermore, as shown\nin the appendix, using the function approximations trained using the sleep samples and the posterior\nrepresentation produced by the recognition network, we can learn the generative model parameters\nwithout needing any explicit independence assumptions (within or across layers) about the posterior\ndistribution.\n\n6 Related work\n\nFollowing the Variational Autoencoder (VAE; [3, 4]), there has been a renewed interest in using\nrecognition models \u2013 originally introduced in the HM \u2013 in the context of learning complex generative\nmodels. The Importance Weighted Autoencoder (IWAE; [11]) optimises a tighter lower bound\nconstructed by an importance sampled estimator of the log-likelihood using the recognition model\nas a proposal distribution. This approach decreases the variational bias introduced by the factorised\nposterior approximation of the standard VAE. VIMCO [12] extends this approach to discrete latent\nvariables and yields state-of-the-art generative performance on learning sigmoid belief networks.\nWe compare our method to the IWAE and VIMCO in section 7. Sonderby et al. [5] demonstrate\nthat the standard VAE has dif\ufb01culty making use of multiple stochastic layers. To overcome this,\nthey propose the Ladder Variational Autoencoder with a modi\ufb01ed parametrisation of the recognition\nmodel that includes stochastic top-down pass through the generative model. The resulting posterior\napproximation is a factorised Gaussian as for the VAE. Normalising Flows [18] relax the factorised\nGaussian assumption on the variational posterior. Through a series of invertible transformations, an\narbitrarily complex posterior can be constructed. However, to our knowledge, they have not yet been\nsuccessfully applied to deep hierarchical generative models.\n\n7 Experiments\n\nWe have evaluated the performance of the DDC-HM on a directed graphical model comprising two\nstochastic latent layers and an observation layer. The prior on the top layer is a mixture of Gaussians,\nwhile the conditional distributions linking the layers below are Laplace and Gaussian, respectively:\n\np(z2) = 1/2(cid:0)N (z2|m, \u03c32) + N (z2|\u2013m, \u03c32)(cid:1)\n\np(z1|z2) = Laplace(z1|\u00b5 = 0, \u03bb = softplus(Bz2))\np(x|z1) = N (x|\u00b5 = \u039bz1, \u03a3x = \u03a8diag)\n\n(13)\n\nWe chose a generative model with a non-Gaussian prior distribution and sparse latent variables,\nmodels typically not considered in the VAE literature. Due to the sparsity and non-Gaussianity,\nlearning in these models is challenging, and the use of a \ufb02exible posterior approximation is crucial.\nWe show that the DDC-HM can provide a suf\ufb01ciently rich posterior representation to learn accurately\nin such a model. We begin with low dimensional synthetic data to evaluate the performance of the\napproach, before evaluating performance on a data set of natural image patches.\n\nSynthetic examples To illustrate that the recognition network of the DDC-HM is powerful enough\nto capture dependencies implied by the generative model, we trained it on a data set generated from\nthe model (N=10000). The dimensionality of the observation layer, the \ufb01rst and second latent layers\nwas set to Dx = 2, D1 = 2, D2 = 1, respectively, for both the true generative model and the \ufb01tted\nmodels. We used a recognition model with a hidden layer of size 100, and K1 = K2 = 100 encoding\nfunctions for each latent layer, with 200 sleep samples, and learned the parameters of the conditional\ndistributions p(x|z1) and p(z1|z2) while keeping the prior on z2 \ufb01xed (m=3, \u03c3=0.1).\n\n6\n\n\fFigure 1: Left: Examples of the distributions learned by the Variational Autoencoder (VAE), the\nImportance Weighted Variational Autoencoder (IWAE) with k=50 importance samples and the DDC\nHelmholtz Machine. Right: histogram of log MMD values for different algorithms trained on\nsynthetic datasets.\n\nAs a comparison, we have also \ufb01tted both a Variational Autoencoder (VAE) and an Importance\nWeighted Autoencoder (IWAE), using 2-layer recognition networks with 100 hidden units each,\nproducing a factorised Gaussian posterior approximation (or proposal distribution for the IWAE).\nTo make the comparison between the algorithms clear (i.e. independent of initial conditions, local\noptima of the objective functions) we initialised each model to the true generative parameters and ran\nthe algorithms until convergence (1000 epochs, learning rate: 10\u22124, using the Adam optimiser; [19]).\nFigure 1 shows examples of the training data and data generated by the VAE, IWAE and DDC-HM\nmodels after learning. The solution found by the DDC-HM matches the training data, suggesting\nthat the posterior approximation was suf\ufb01ciently accurate to avoid bias during learning. The VAE,\nas expected from its more restrictive posterior approximation, could capture neither the strong anti-\ncorrelation between latent variables nor the heavy tails of the distribution. Similar qualitative features\nare seen in the IWAE samples, suggesting that the importance weighting was unable to recover from\nthe strongly biased posterior proposal.\nWe quanti\ufb01ed the quality of the \ufb01ts by computing the maximum mean discrepancy (MMD) [17]\nbetween the training data and the samples generated by each model [20] 2. We used an exponentiated\nquadratic kernel with kernel width optimised for maximum test power [21]. We computed the MMD\nfor 25 data sets drawn using different generative parameters, and found that the MMD estimates were\nsigni\ufb01cantly lower for the DDC-HM than for the VAE or the IWAE (k=5, 50) (Figure 1).\nBeyond capturing the density of the data, correctly identifying the underlying latent structure is also\nan important criterion when evaluating algorithms for learning generative models. Figure 2 shows an\nexample where we have used Hamiltonian Monte Carlo to generate samples from the true posterior\ndistribution for one data point under the generative models learnt by each approach. We found that\nthere was close agreement between the posterior distributions of the true generative model and the\none learned by the DDC-HM. However, the biased recognition of the VAE and IWAE in turn biases\nthe learnt generative parameters so that the resulting posteriors (even when computed without the\nrecognition networks) appear closer to Gaussian.\n\nNatural image patches We tested the scalability of the DDC-HM by applying it to a natural image\ndata set [22]. We trained the same generative model as before on image patches with dimensionality\n\n2Estimating the log-likelihood by importance sampling in this model has proven to be unreliable due the lack\n\nof a good proposal distribution\n\n7\n\n54321Log MMD02468 DDC HMVAEIWAE, k=5IWAE, k=50\fFigure 2: Example posteriors corresponding to the learned generative models. The corner plots show\nthe pairwise and marginal densities of the three latent variables, for the true model (top left), the\nmodel learned by the VAE, IWAE (k=50) and DDC-HM.\nDx = 16 \u00d7 16 and varying sizes of latent layers. The recognition model had a hidden layer of size\n500, K1 = 500, K2 = 100 encoding functions for z1 and z2, respectively, and used 1000 samples\nduring the sleep phase. We compared the performance of our model with the IWAE (k=50) using\nthe relative (three sample) MMD test [20] with a exponentiated quadratic kernel (width chosen by\nthe median heuristic). The test establishes whether the MMD distance between distributions Px and\nPy is signi\ufb01cantly smaller than the distance between Px and Pz. We used the image data set as our\nreference distribution and the IWAE being closer to the data as null hypothesis. Table 1 summarises\nthe results obtained on models with different latent dimensionality, all of them strongly preferring the\nDDC-HM.\n\nTable 1: 3-sample MMD results. The table shows the results of the \u2018relative\u2019 MMD test between\nthe DDC-HM and the IWAE (k = 50) on the image patch data set for different generative model\narchitectures. The null hypothesis tested: MMDIWAE < MMDDDC\u2212HM. Small p values indicate\nthat the model learned by the DDC HM matches the data signi\ufb01cantly better than the one learned by\nthe IWAE (k = 50). We obtained similar results when comparing to IWAE k = 1, 5 (not shown).\n\nLATENT DIMENSIONS\n\nD2 = 10\nD1 = 10\nD2 = 2\nD1 = 50\nD1 = 50\nD2 = 10\nD1 = 100 D2 = 2\nD1 = 100 D2 = 10\n\nIWAE\n0.126\n0.0754\n0.247\n0.076\n0.171\n\nDDC HM p-value\n(cid:28) 10\u22125\n0.0388\n(cid:28) 10\u22125\n0.0269\n0.00313 (cid:28) 10\u22125\n(cid:28) 10\u22125\n0.0211\n0.00355 (cid:28) 10\u22125\n\nSigmoid Belief Network trained on MNIST Finally, we evaluated the capacity of our model\nto learn hierarchical generative models with discrete latent variables by training a sigmoid belief\nnetwork (SBN). We used the binarised MNIST dataset of 28x28 images of handwritten digits [23].\nThe generative model had three layers of binary latent variables, with dimensionality of 200 in each\nlayer. The recognition model had a sigmoidal hidden layer of size 300 and DDC representations\nof size 200 for each latent layer. As a comparison, we have also trained an SBN with the same\narchitecture using the VIMCO algorithm (as described in [12]) with 50 samples from the proposal\ndistribution 3. To quantify the \ufb01ts, we have performed the relative MMD test using the test set\n(N = 10000) as a reference distribution and two sets of samples of the same size generated from the\nSBN trained by the VIMCO and DDC-HM algorithms. Again, we used an exponentiated quadratic\nkernel with width chosen by the median heuristic. The test strongly favoured the DDC-HM over\nVIMCO with p (cid:28) 10\u22125 (with MMD values of 6 \u00d7 10\u22124 and 2 \u00d7 10\u22123, respectively).\n\n8 Discussion\n\nThe DDC Helmholtz Machine offers a novel approach to learning hierarchical generative models,\nwhich combines the basic idea of the wake-sleep algorithm with a \ufb02exible posterior representation.\n\n3The model achieved an estimated negative log-likelihood of 90.97 nats, similar to the one reported by [12]\n\n(90.9 nats)\n\n8\n\n\fThe lack of strong parametric assumptions in the DDC representation allows the algorithm to learn\ngenerative models with complex posterior distributions accurately.\nAs in the classical Helmholtz Machine, the approximate posterior is found by seeking to minimise the\n\u201creverse\u201d divergence DKL[p(z|x)(cid:107)q(z|x)], albeit within a much richer class of distributions. Thus,\nthe modi\ufb01ed wake-sleep algorithm presented here still does not directly optimise a variational lower\nbound on the log-likelihood. Rather, it can be viewed as following an approximation to the gradient\nof the log-likelihood, where the quality of the approximation depends on the richness of the DDC\nrepresentation used. Precise conditions for convergence are yet to be established, but the expectation\nis that when the approximation is rich enough for the error in the resulting gradient estimate to be\nbounded, the algorithm will always reach a region around a local mode in which the true gradient\ndoes not exceed that error bound.\nThe DDC-HM recognition model can be trained layer-by-layer using the samples from the generative\nmodel, with no need to back-propagate gradients across stochastic layers. In the version discussed\nhere, the recognition network depended on linear mappings between encoding functions and a \ufb01xed\nnon-linear basis expansion of the input. This restrictive form allowed for closed-form updates in the\nsleep phase. However, this assumption could be relaxed by introducing a neural network between\neach latent variable layer, along with a modi\ufb01ed learning scheme in the sleep phase. This approach\nmay increase the accuracy of the posterior expectations computed during the wake phase.\nAnother future direction involves learning the non-linear encoding functions or choosing them\nin accordance with the properties of the generative model (e.g. requiring sparsity in the random\nprojections). Finally, a natural extension of the DDC representation with expectations of a \ufb01nite\nnumber of encoding functions, would be to approach the RKHS mean embedding, corresponding to\nin\ufb01nitely many encoding functions [24, 25].\nEven without these extensions, however, the DDC-HM offers a novel and powerful approach to\nprobabilistic learning with complex hierarchical models.\n\nAcknowledgments\n\nThis work was funded by the Gatsby Charitable Foundation.\n\nReferences\n[1] MJ Wainwright and MI Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[2] RE Turner and M Sahani. Two problems with variational expectation maximisation for time-\nseries models. In Bayesian Time series models, pp. 109\u2013130. Cambridge University Press,\n2011.\n\n[3] DJ Rezende, S Mohamed, and D Wierstra. Stochastic backpropagation and approximate\ninference in deep generative models. In Proceedings of the 31st International Conference on\nMachine Learning, pp. 1278\u20131286, 2014.\n\n[4] D Kingma and M Welling. Auto-Encoding Variational Bayes. In 2nd International Conference\n\non Learning Representations (ICLR2014). arXiv.org, 2014.\n\n[5] CK Sonderby, T Raiko, L Maaloe, SK Sonderby, and O Winther. Ladder variational autoen-\ncoders. In Advances in Neural Information Processing Systems 29, pp. 3738\u20133746. Curran\nAssociates, Inc., 2016.\n\n[6] P Dayan, GE Hinton, RM Neal, and RS Zemel. The Helmholtz machine. Neural Computation,\n\n7(5):889\u2013904, 1995.\n\n[7] RS Zemel, P Dayan, and A Pouget. Probabilistic interpretation of population codes. Neural\n\nComputation, 10(2):403\u2013430, 1998.\n\n[8] M Sahani and P Dayan. Doubly distributional population codes: Simultaneous representation\n\nof uncertainty and multiplicity. Neural Computation, 15(10):2255\u20132279, 2003.\n\n[9] S Gershman and N Goodman. Amortized inference in probabilistic reasoning. In Proceedings\n\nof the Annual Meeting of the Cognitive Science Society, vol. 36, 2014.\n\n[10] T Minka. Divergence measures and message passing. Microsoft Research, 2005.\n\n9\n\n\f[11] Y Burda, R Grosse, and R Salakhutdinov.\n\narXiv:1509.00519, 2015.\n\nImportance weighted autoencoders.\n\n[12] A Mnih and D Rezende. Variational inference for Monte Carlo objectives. In Proceedings of\n\nthe 33rd International Conference on Machine Learning, pp. 2188\u20132196, 2016.\n\n[13] R Ranganath, L Tang, L Charlin, and D Blei. Deep exponential families. In Arti\ufb01cial Intelligence\n\nand Statistics, pp. 762\u2013771, 2015.\n\n[14] RM Neal. Connectionist learning of belief networks. Arti\ufb01cial Intelligence, 56(1):71\u2013113,\n\n1992.\n\n[15] GE Hinton, P Dayan, BJ Frey, and RM Neal. The \"wake-sleep\" algorithm for unsupervised\n\nneural networks. Science, 268(5214):1158\u20131161, 1995.\n\n[16] A Rahimi and B Recht. Uniform approximation of functions with random bases. In 46th Annual\n\nAllerton Conference on Communication, Control, and Computing, pp. 555\u2013561, 2008.\n\n[17] A Gretton, KM Borgwardt, MJ Rasch, B Scholkopf, and A Smola. A kernel two-sample test.\n\nJournal of Machine Learning Research, 13:723\u2013773, 2012.\n\n[18] D Rezende and S Mohamed. Variational inference with normalizing \ufb02ows. In Proceedings of\n\nthe 32nd International Conference on Machine Learning, pp. 1530\u20131538, 2015.\n\n[19] DP Kingma and J Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.\n[20] W Bounliphone, E Belilovsky, MB Blaschko, I Antonoglou, and A Gretton. A test of relative\n\nsimilarity for model selection in generative models. arXiv:1511.04581, 2015.\n\n[21] W Jitkrittum, Z Szabo, K Chwialkowski, and A Gretton. Interpretable distribution features with\n\nmaximum testing power. arXiv:1605.06796, 2016.\n\n[22] JH van Hateren and A van der Schaaf.\n\nIndependent component \ufb01lters of natural images\ncompared with simple cells in primary visual cortex. Proceedings of the Royal Society B:\nBiological Sciences, 265(1394):359\u2013366, 1998.\n\n[23] R Salakhutdinov and I Murray. On the quantitative analysis of deep belief networks.\n\nIn\nProceedings of the 25th International Conference on Machine Learning, pp. 872\u2013879. ACM,\n2008.\n\n[24] A Smola, A Gretton, L Song, and B Sch\u00f6lkopf. A Hilbert space embedding for distributions. In\n\nInternational Conference on Algorithmic Learning Theory, pp. 13\u201331. Springer, 2007.\n\n[25] S Grunewalder, G Lever, L Baldassarre, S Patterson, A Gretton, and M Pontil. Conditional mean\nembeddings as regressors. In Proceedings of the 29th International Conference on Machine\nLearning, vol. 2, pp. 1823\u20131830, 2012.\n\n10\n\n\f", "award": [], "sourceid": 2054, "authors": [{"given_name": "Eszter", "family_name": "V\u00e9rtes", "institution": "Gatsby Unit, UCL"}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": "Gatsby Unit, UCL"}]}