{"title": "Iterative Refinement of the Approximate Posterior for Directed Belief Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4691, "page_last": 4699, "abstract": "Variational methods that rely on a recognition network to approximate the posterior of directed graphical models offer better inference and learning than previous methods. Recent advances that exploit the capacity and flexibility in this approach have expanded what kinds of models can be trained. However, as a proposal for the posterior, the capacity of the recognition network is limited, which can constrain the representational power of the generative model and increase the variance of Monte Carlo estimates. To address these issues, we introduce an iterative refinement procedure for improving the approximate posterior of the recognition network and show that training with the refined posterior is competitive with state-of-the-art methods. The advantages of refinement are further evident in an increased effective sample size, which implies a lower variance of gradient estimates.", "full_text": "Iterative Re\ufb01nement of the Approximate Posterior for\n\nDirected Belief Networks\n\nUniversity of New Mexico and the Mind Research Network\n\nR Devon Hjelm\n\ndhjelm@mrn.org\n\nKyunghyun Cho\n\nCourant Institute & Center for Data Science, New York University\n\nkyunghyun.cho@nyu.edu\n\nJunyoung Chung\n\nUniversity of Montreal\n\njunyoung.chung@umontreal.ca\n\nRuss Salakhutdinov\n\nCarnegie Melon University\n\nrsalakhu@cs.toronto.edu\n\nVince Calhoun\n\nUniversity of New Mexico and the Mind Research Network\n\nvcalhoun@mrn.org\n\nNebojsa Jojic\n\nMicrosoft Research\n\njojic@microsoft.com\n\nAbstract\n\nVariational methods that rely on a recognition network to approximate the posterior\nof directed graphical models offer better inference and learning than previous\nmethods. Recent advances that exploit the capacity and \ufb02exibility in this approach\nhave expanded what kinds of models can be trained. However, as a proposal for the\nposterior, the capacity of the recognition network is limited, which can constrain the\nrepresentational power of the generative model and increase the variance of Monte\nCarlo estimates. To address these issues, we introduce an iterative re\ufb01nement\nprocedure for improving the approximate posterior of the recognition network and\nshow that training with the re\ufb01ned posterior is competitive with state-of-the-art\nmethods. The advantages of re\ufb01nement are further evident in an increased effective\nsample size, which implies a lower variance of gradient estimates.\n\n1\n\nIntroduction\n\nVariational methods have surpassed traditional methods such as Markov chain Monte Carlo [MCMC,\n15] and mean-\ufb01eld coordinate ascent [25] as the de-facto standard approach for training directed\ngraphical models. Helmholtz machines [3] are a type of directed graphical model that approximate\nthe posterior distribution with a recognition network that provides fast inference as well as \ufb02exible\nlearning which scales well to large datasets. Many recent signi\ufb01cant advances in training Helmholtz\nmachines come as estimators for the gradient of the objective w.r.t. the approximate posterior. The\nmost successful of these methods, variational autoencoders [VAE, 12], relies on a re-parameterization\nof the latent variables to pass the learning signal to the recognition network. This type of parame-\nterization, however, is not available with discrete units, and the naive Monte Carlo estimate of the\ngradient has too high variance to be practical [3, 12].\nHowever, good estimators are available through importance sampling [1], input-dependent baselines\n[13], a combination baselines and importance sampling [14], and parametric Taylor expansions [9].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fEach of these methods strive to be a lower-variance and unbiased gradient estimator. However, the\nreliance on the recognition network means that the quality of learning is bounded by the capacity of\nthe recognition network, which in turn raises the variance.\nWe demonstrate reducing the variance of Monte Carlo based estimators by iteratively re\ufb01ning\nthe approximate posterior provided by the recognition network. The complete learning algorithm\nfollows expectation-maximization [EM, 4, 16], where in the E-step the variational parameters of\nthe approximate posterior are initialized using the recognition network, then iteratively re\ufb01ned. The\nre\ufb01nement procedure provides an asymptotically-unbiased estimate of the variational lowerbound,\nwhich is tight w.r.t. the true posterior and can be used to easily train both the recognition network and\ngenerative model during the M-step. The variance-reducing re\ufb01nement is available to any directed\ngraphical model and can give a more accurate estimate of the log-likelihood of the model.\nFor the iterative re\ufb01nement step, we use adaptive importance sampling [AIS, 17]. We demonstrate the\nproposed re\ufb01nement procedure is effective for training directed belief networks, providing a better\nor competitive estimates of the log-likelihood. We also demonstrate the improved posterior from\nre\ufb01nement can improve inference and accuracy of evaluation for models trained by other methods.\n\n2 Directed Belief Networks and Variational Inference\n\np(x|h1)p(hL)(cid:81)L\u22121\n\nA directed belief network is a generative directed graphical model consisting of a conditional density\np(x|h) and a prior p(h), such that the joint density can be expressed as p(x, h) = p(x|h)p(h). In\nparticular, the joint density factorizes into a hierarchy of conditional densities and a prior: p(x, h) =\nl=1 p(hl|hl+1), where p(hl|hl+1) is the conditional density at the l-th layer and\np(hL) is a prior distribution of the top layer. Sampling from the model can be done simply via\nancestral-sampling, \ufb01rst sampling from the prior, then subsequently sampling from each layer until\nreaching the observation, x. This latent variable structure can improve model capacity, but inference\ncan still be intractable, as is the case in sigmoid belief networks [SBN, 15], deep belief networks\n[DBN, 11], deep autoregressive networks [DARN, 7], and other models in which each of the\nconditional distributions involves complex nonlinear functions.\n\n2.1 Variational Lowerbound of Directed Belief Network\n\nThe objective we consider is the likelihood function, p(x; \u03c6), where \u03c6 represent parameters of the\ngenerative model (e.g. a directed belief network). Estimating the likelihood function given the joint\ndistribution, p(x, h; \u03c6), above is not generally possible as it requires intractable marginalization over\nh. Instead, we introduce an approximate posterior, q(h|x), as a proposal distribution. In this case,\nthe log-likelihood can be bounded from below\u2217:\n\n(cid:20)\n\n(cid:21)\n\nq(h|x) log\n\np(x, h)\nq(h|x)\n\n= Eq(h|x)\n\nlog\n\np(x, h)\nq(h|x)\n\n:= L1,\n\n(1)\n\n(cid:88)\n\nlog p(x, h) \u2265(cid:88)\n\nh\n\nh\n\nlog p(x) =\n\nwhere we introduce the subscript in the lowerbound to make the connection to importance sampling\nlater. The bound is tight (e.g., L1 = log p(x)) when the KL divergence between the approximate and\ntrue posterior is zero (e.g., DKL(q(h|x)||p(h|x)) = 0). The gradients of the lowerbound w.r.t. the\ngenerative model can be approximated using the Monte Carlo approximation of the expectation:\n\n\u2207\u03c6 log p(x, h(k); \u03c6), h(k) \u223c q(h|x).\n\n(2)\n\nK(cid:88)\n\nk=1\n\n\u2207\u03c6L1 \u2248 1\nK\n\nThe success of variational inference lies on the choice of approximate posterior, as poor choice can\nresult in a looser variational bound. A deep feed-forward recognition network parameterized by \u03c8 has\nbecome a popular choice, such that q(h|x) = q(h|x; \u03c8), as it offers fast and \ufb02exible data-dependent\ninference [see, e.g., 22, 12, 13, 20]. Generally known as a \u201cHelmholtz machine\u201d [3], these approaches\noften require additional tricks to train, as the naive Monte Carlo gradient of the lowerbound w.r.t.\nthe variational parameters has high variance. In addition, the variational lowerbound in Eq. (1) is\nconstrained by the assumptions implicit in the choice of approximate posterior, as the approximate\nposterior must be within the capacity of the recognition network and factorial.\n\n\u2217 For clarity of presentation, we will often omit dependence on parameters \u03c6 of the generative model, so that\n\np(x, h) = p(x, h; \u03c6)\n\n2\n\n\fFigure 1: Iterative re\ufb01nement for variational inference. An initial estimate of the variational parameters is\nmade through a recognition network. The variational parameters are then updated iteratively, maximizing the\nlowerbound. The \ufb01nal approximate posterior is used to train the generative model by sampling. The recognition\nnetwork parameters are updated using the KL divergence between the re\ufb01ned posterior qk and the output of the\nrecognition network q0.\n\n2.2\n\nImportance Sampled Variational lowerbound\n\nThese assumptions can be relaxed by using an unbiased K-sampled importance weighted estimate of\nthe likelihood function (see [2] for details):\n\nL1 \u2264 LK =\n\n1\nK\n\np(x, h(k))\nq(h(k)|x)\n\n=\n\n1\nK\n\nw(k) \u2264 p(x),\n\n(3)\n\nwhere h(k) \u223c q(h|x) and w(k) are the importance weights. This lowerbound is tighter than the\nsingle-sample version provided in Eq. (1) and is an asymptotically unbiased estimate of the likelihood\nas K \u2192 \u221e.\nThe gradient of the lowerbound w.r.t. the model parameters \u03c6 is simple and can be estimated as:\n\n(cid:88)\n\nk=1\n\n(cid:88)\n\nk=1\n\n\u2207\u03c6LK =\n\n\u02dcw(k)\u2207\u03c6 log p(x, h(k); \u03c6), where \u02dcw(k) =\n\n.\n\n(4)\n\nw(k)(cid:80)K\n\nk(cid:48)=1 w(k(cid:48))\n\nK(cid:88)\n\nk=1\n\nThe estimator in Eq. (3) can reduce the variance of the gradients, \u2207\u03c8LK, but in general additional\nvariance reduction is needed [14]. Alternatively, importance sampling yields an estimate of the\ninclusive KL divergence, DKL(p(h|x)||q(h|x)), which can be used for training parameters \u03c8 of\nthe recognition network [1]. However, it is well known that importance sampling can yield heavily-\nskewed distributions over the importance weights [5], so that only a small number of the samples will\neffectively have non-zero weight. This is consequential not only in training, but also for evaluating\nmodels when using Eq. (3) to estimate test log-probabilities, which requires drawing a very large\nnumber of samples (N \u2265 100, 000 in the literature for models trained on MNIST [7]).\nThe effective samples size, ne, of importance-weighted estimates increases and is optimal when the\napproximate posterior matches the true posterior:\n\nk=1 w(k)(cid:17)2\n(cid:16)(cid:80)K\n(cid:80)K\n\nk=1(w(k))2\n\n(cid:16)(cid:80)K\n(cid:17)2\n(cid:80)K\nk=1 p(x, h(k))/p(h(k)|x)\n\n(cid:0)p(x, h(k))/p(h(k)|x)(cid:1)2 \u2264 (Kp(x))2\n\nk=1\n\n\u2264\n\nKp(x)2 = K.\n\n(5)\n\nne =\n\nConversely, importance sampling from a poorer approximate posterior will have lower effective\nsampling size, resulting in higher variance of the gradient estimates.\nIn order to improve the\neffectiveness of importance sampling, we need a method for improving the approximate posterior\nfrom those provided by the recognition network.\n\n3\n\nIterative Re\ufb01nement for Variational Inference (IRVI)\n\nTo address the above issues, iterative re\ufb01nement for variational inference (IRVI) uses the recognition\nnetwork as a preliminary guess of the posterior, then re\ufb01nes the posterior through iterative updates of\nthe variational parameters. For the re\ufb01nement step, IRVI uses a stochastic transition operator, g(.),\nthat maximizes the variational lowerbound.\n\n3\n\n\fAn overview of IRVI is available in Figure 1. For the expectation (E)-step, we feed the observation x\nthrough the recognition network to get the initial parameters, \u00b50, of the approximate posterior,\nq0(h|x; \u03c8). We then re\ufb01ne \u00b50 by applying T updates to the variational parameters, \u00b5t+1 = g(\u00b5t, x),\niterating through T parameterizations \u00b51, . . . , \u00b5T of the approximate posterior qt(h|x).\nWith the \ufb01nal set of parameters, \u00b5T , the gradient estimate of the recognition parameters \u03c8 in the\nmaximization (M)-step is taken w.r.t the negative exclusive KL divergence:\n\n\u2212\u2207\u03c8DKL(qT (h|x)||q0(h|x; \u03c8)) \u2248 1\nK\n\n\u2207\u03c8 log q0(h(k)|x; \u03c8),\n\n(6)\n\nwhere h(k) \u223c qT (h|x). Similarly, the gradients w.r.t. the parameters of the generative model \u03c6\nfollow Eqs. (2) or (4) using samples from the re\ufb01ned posterior qT (h|x). As an alternative to Eq. (6),\nwe can maximize the negative inclusive KL divergence using the re\ufb01ned approximate posterior:\n\nK(cid:88)\n\nk=1\n\n\u2212\u2207\u03c8DKL(p(h|x)||q0(h|x; \u03c8)) \u2248 K(cid:88)\n\n\u02dcw(k)\u2207\u03c8 log q0(h(k)|x; \u03c8).\n\n(7)\n\nThe form of the IRVI transition operator, g(\u00b5t, x), depends on the problem. In the case of continuous\nvariables, we can make use of the VAE re-parameterization with the gradient of the lowerbound in\nEq. (1) for our re\ufb01nement step (see supplementary material). However, as this is not available with\ndiscrete units, we take a different approach that relies on adaptive importance sampling.\n\nk=1\n\n3.1 Adaptive Importance Re\ufb01nement (AIR)\n\n(cid:88)\n\n\u02c6\u00b5 = Ep(h|x) [h] =\n\nAdaptive importance sampling [AIS, 17] provides a general approach for iteratively re\ufb01ning the\nvariational parameters. For Bernoulli distributions, we observe that the mean parameter of the true\nposterior, \u02c6\u00b5, can be written as the expected value of the latent variables:\n\n\u2248 K(cid:88)\nEq. 8 until a stopping criteria is met. While using the update, g(\u00b5t, x, \u03b3) = (cid:80)K\n\nAs the initial estimator typically has high variance, AIS iteratively moves \u00b5t toward \u02c6\u00b5 by applying\nk=1 \u02dcw(k)h(k) in\nprinciple works, a convex combination of importance sample estimate of the current step and the\nparameters from the previous step tends to be more stable:\n\nh p(h|x) =\n\np(x, h)\nq(h|x)\n\nq(h|x) h\n\n(cid:88)\n\nh\n\n\u02dcw(k)h(k).\n\n(8)\n\n1\n\np(x)\n\nh\n\nh(m) \u223c Bernoulli(\u00b5k); \u00b5t+1 = g(\u00b5t, x, \u03b3) = (1 \u2212 \u03b3)\u00b5t + \u03b3\n\n(9)\nHere, \u03b3 is the inference rate and (1 \u2212 \u03b3) can be thought of as the adaptive \u201cdamping\u201d rate. This\napproach, which we call adaptive importance re\ufb01nement (AIR), should work with any discrete\nparametric distribution. Although AIR is applicable with continuous Gaussian variables, which\nmodel second-order statistics, we leave adapting AIR to continuous latent variables for future work.\n\n\u02dcw(k)h(k).\n\nk=1\n\nk=1\n\nK(cid:88)\n\n3.2 Algorithm and Complexity\n\nThe general AIR algorithm follows Algorithm 1 with gradient variations following Eqs. (2), (4),\n(6), and (7). While iterative re\ufb01nement may reduce the variance of stochastic gradient estimates\nand speed up learning, it comes at a computational cost, as each update is T times more expen-\nsive than \ufb01xed approximations. However, in addition to potential learning bene\ufb01ts, AIR can also\nimprove the approximate posterior of an already trained directed belief networks at test, indepen-\ndent on how the model was trained. Our implementation following Algorithm 1 is available at\nhttps://github.com/rdevon/IRVI.\n\n4 Related Work\n\nAdaptive importance re\ufb01nement (AIR) trades computation for expressiveness and is similar in\nthis regard to the re\ufb01nement procedure of hybrid MCMC for variational inference [HVI, 24] and\n\n4\n\n\fAlgorithm 1 AIR\nRequire: A generative model p(x, h; \u03c6) = p(x|h; \u03c6)p(h; \u03c6) and a recognition network \u00b50 = f (x; \u03c8)\nRequire: A transition operator g(\u00b5, x, \u03b3) and inference rate \u03b3.\n\nDraw K samples h(k) \u223c qt(h|x) and compute normalized importance weights \u02dcw(k)\n\nk=1 \u02dcw(k)h(k)\n\nend for\nif reweight then\n\nCompute \u00b50 = f (x; \u03c8) for q0(h|x; \u03c8)\nfor t=1:T do\n\n\u00b5t = (1 \u2212 \u03b3)\u00b5t\u22121 + \u03b3(cid:80)K\n\u2206\u03c6 \u221d(cid:80)K\n(cid:80)K\nk=1 \u02dcw(k)\u2207\u03c6 log p(x, h(k); \u03c6)\nk=1 \u2207\u03c6 log p(x, h(k); \u03c6)\n\u2206\u03c8 \u221d(cid:80)K\n(cid:80)K\nk=1 \u02dcw(k)\u2207\u03c8 log q0(h(k)|x; \u03c8)\nk=1 \u2207\u03c8 log q0(h(k)|x; \u03c8)\n\nend if\nif inclusive KL Divergence then\n\n\u2206\u03c6 \u221d 1\n\nK\n\nelse\n\nelse\n\n\u2206\u03c8 \u221d 1\n\nK\n\nend if\n\nnormalizing \ufb02ows for VAE [NF, 21]. HVI has a similar complexity as AIR, as it requires re-estimating\nthe lowerbound at every step. While NF can be less expensive than AIR, both HVI and NF rely on\nthe VAE re-parameterization to work, and thus cannot be applied to discrete variables. Sequential\nimportance sampling [SIS, 5] can offer a better re\ufb01nement step than AIS but typically requires\nresampling to control variance. While parametric versions exist that could be applicable to training\ndirected graphical models with discrete units [8, 18], their applicability as a general re\ufb01nement\nprocedure is limited as the re\ufb01nement parameters need to be learned.\nImportance sampling is central to reweighted wake-sleep [RWS, 1], importance-weighted autoen-\ncoders [IWAE, 2], variational inference for Monte Carlo objectives [VIMCO, 14], and recent work on\nstochastic feed-forward networks [SFFN, 26, 19]. While each of these methods are competitive, they\nrely on importance samples from the recognition network and do not offer the low-variance estimates\navailable from AIR. Neural variational inference and learning [NVIL, 13] is a single-sample and\nbiased version of VIMCO, which is greatly outperformed by techniques that use importance sampling.\nBoth NVIL and VIMCO reduce the variance of the Monte Carlo estimates of gradients by using an\ninput-dependent baseline, but this approach does not necessarily provide a better posterior and cannot\nbe used to give better estimates of the likelihood function or expectations.\nFinally, IRVI is meant to be a general approach to re\ufb01ning the approximate posterior. IRVI is not\nlimited to the re\ufb01nement step provided by AIR, and many different types of re\ufb01nement steps are\navailable to improve the posterior for models above (see supplementary material for the continuous\ncase). SIS and sequential importance resampling [SIR, 6] can be used as an alternative to AIR and\nmay provide a better re\ufb01nement step for IRVI.\n\n5 Experiments\n\nWe evaluate iterative re\ufb01nement for variational inference (IRVI) using adaptive importance re\ufb01nement\n(AIR) for both training and evaluating directed belief networks. We train and test on the following\nbenchmarks: the binarized MNIST handwritten digit dataset [23] and the Caltech-101 Silhouettes\ndataset. We centered the MNIST and Caltech datasets by subtracting the mean-image over the\ntraining set when used as input to the recognition network. We also train additional models using the\nre-weighted wake-sleep algorithm [RWS, 1], the state of the art for many con\ufb01gurations of directed\nbelief networks with discrete variables on these datasets for comparison and to demonstrate improving\nthe approximate posteriors with re\ufb01nement. With our experiments, we show that 1) IRVI can train\na variety of directed models as well or better than existing methods, 2) the gains from re\ufb01nement\nimproves the approximate posterior, and can be applied to models trained by other algorithms, and 3)\nIRVI can be used to improve a model with a relatively simple approximate posterior.\nModels were trained using the RMSprop algorithm [10] with a batch size of 100 and early stopping\nby recorded best variational lower bound on the validation dataset. For AIR, 20 \u201cinference steps\"\n\n5\n\n\f(cid:89)\n\ni\n\ni\u22121(cid:88)\n\nj=0\n\nFigure 2: The log-likelihood (left) and normalized effective sample size (right) with epochs in log-scale on the\ntraining set for AIR with 5 and 20 re\ufb01nement steps (vanilla AIR), reweighted AIR with 5 and 20 re\ufb01nement\nsteps, reweighted AIR with inclusive KL objective and 5 or 20 re\ufb01nement steps, and reweighted wake-sleep\n(RWS), all with a single stochastic latent layer. All models were evaluated with 100 posterior samples, their\nrespective number of re\ufb01nement steps for the effective sample size (ESS), and with 20 re\ufb01nement steps of AIR\nfor the log-likelihood. Despite longer wall-clock time per epoch,\n\n(K = 20), 20 adaptive samples (M = 20), and an adaptive damping rate, (1 \u2212 \u03b3), of 0.9 were used\nduring inference, chosen from validation in initial experiments. 20 posterior samples (N = 20) were\nused for model parameter updates for both AIR and RWS. All models were trained for 500 epochs\nand were \ufb01ne-tuned for an additional 500 with a decaying learning rate and SGD.\nWe use a generative model composed of a) a factorized Bernoulli prior as with sigmoid belief networks\n(SBNs) or b) an autoregressive prior, as in published MNIST results with deep autoregressive networks\n[DARN, 7]:\n\na) p(h) =\n\nb) P (hi = 1) = \u03c3(\n\np(hi); P (hi = 1) = \u03c3(bi),\n\n(10)\nwhere \u03c3 is the sigmoid (\u03c3(x) = 1/(1 + exp(\u2212x))) function, Wr is a lower-triangular square matrix,\nand b is the bias vector.\nFor our experiments, we use conditional and approximate posterior densities that follow Bernoulli\ndistributions:\n\nhj