{"title": "Improving Explorability in Variational Inference with Annealed Variational Objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 9701, "page_last": 9711, "abstract": "Despite the advances in the representational capacity of approximate distributions for variational inference, the optimization process can still limit the density that is ultimately learned.\nWe demonstrate the drawbacks of biasing the true posterior to be unimodal, and introduce Annealed Variational Objectives (AVO) into the training of hierarchical variational methods.\nInspired by Annealed Importance Sampling, the proposed method facilitates learning by incorporating energy tempering into the optimization objective.\nIn our experiments, we demonstrate our method's robustness to deterministic warm up, and the benefits of encouraging exploration in the latent space.", "full_text": "Improving Explorability in Variational Inference with\n\nAnnealed Variational Objectives\n\nChin-Wei Huang\u2020,?,1 Shawn Tan\u2020,2 Alexandre Lacoste?,3 Aaron Courville\u2020k,4\n\n1chin-wei.huang@umontreal.ca, 2jing.shan.shawn.tan@umontreal.ca\n\n3allac@elementai.com, 4aaron.courville@umontreal.ca\n\n\u2020MILA, University of Montreal\n\n?Element AI\n\nkCIFAR Fellow\n\nAbstract\n\nDespite the advances in the representational capacity of approximate distributions\nfor variational inference, the optimization process can still limit the density that is\nultimately learned. We demonstrate the drawbacks of biasing the true posterior to be\nunimodal, and introduce Annealed Variational Objectives (AVO) into the training\nof hierarchical variational methods. Inspired by Annealed Importance Sampling,\nthe proposed method facilitates learning by incorporating energy tempering into\nthe optimization objective. In our experiments, we demonstrate our method\u2019s\nrobustness to deterministic warm up, and the bene\ufb01ts of encouraging exploration\nin the latent space.\n\n1\n\nIntroduction\n\nVariational Inference (VI) has played an important role in Bayesian model uncertainty calibration and\nin unsupervised representation learning. It is different from Markov Chain Monte Carlo (MCMC)\nmethods, which rely on the Markov chain to reach an equilibrium; in VI one can easily draw i.i.d.\nsamples from the variational distribution, and enjoy lower variance in inference. On the other hand\nVI is subject to bias on account of the introduction of the approximating variational distribution.\nAs pointed out by Turner and Sahani (2011), variational approximations tend not to propagate\nuncertainty well. This inaccuracy and overcon\ufb01dence in inference can result in bias in statistics of\ncertain features of the unobserved variable, such as marginal likelihood of data or the predictive\nposterior in the context of a Bayesian model. We argue that this is especially true in the amortized VI\nsetups (Kingma and Welling, 2014; Rezende et al., 2014), where one seeks to learn the representations\nof the data in an ef\ufb01cient manner. We note that this sacri\ufb01ces the chance of exploring different\ncon\ufb01gurations of representation in inference, and can bias and hurt the training of the model.\nThis bias is believed to be caused by the variational family that is used, such as factorial Gaussian for\ncomputational tractability, which limits the expressiveness of the approximate posterior. In principle,\nthis can be alleviated by having a more rich family of distributions, but in practice, it remains a\nchallenging issue for optimization. To see this, we write the training objective of VI as the KL\ndivergence, also known as the variational free energy F, between the proposal q and the target\ndistribution f:\n\nF(q) = Eq [log q(z) log f (z)] = DKL(q||f ).\n\nDue to the KL divergence, q gets penalized for allocating probability mass in regions where f\nhas low density. This behaviour of the objective can result in a cascade of consequences. The\nvariational distribution becomes biased towards being excessively con\ufb01dent, which, in turn, can\ninhibit the variational approximation from escaping poor local minima, even when it has suf\ufb01cient\nrepresentational capacity to accurately represent the posterior. This usually happens when the target\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(posterior) distribution has a complicated structure, such as multimodality, where the variational\ndistribution might get stuck \ufb01tting only a subset of modes.\nIn what follows, we discuss two annealing techniques that are designed with two diverging goals in\nmind. On the one hand, alpha-annealing is used to encourage exploration of signi\ufb01cant modes and is\ndesigned for learning a good inference model. On the other, beta-annealing facilitates the learning of\na good generative model by reducing noise and regularization during training.\n\nAlpha-annealing: The optimization problem of inference is due to the non-convex nature of the\nobjective, and can be mitigated via energy tempering (Katahira et al., 2008; Abrol et al., 2014; Mandt\net al., 2016):\n\nEq [log q(z) \u21b5 log f (z)]\n\n(1)\nwhere \u21b5 = 1\nT and T is analogous to the temperature in statistical mechanics. The temperature\nis usually initially set to be high, and gradually annealed to 1, i.e. \u21b5 goes from a small value\nto 1. The intuition is that when \u21b5 is small the energy landscape is smoothed out, as rzf (z)\u21b5 =\n\u21b5f (z)\u21b51rzf (z) = \u21b5f (z)\u21b5rz log f (z) goes to zero everywhere when \u21b5 ! 0 if log f is continuous\nand has bounded gradient.\nHowever, alpha-annealing might not be ideal in practice. Tempering schemes are typically suitable\nfor one time inference, e.g. in the case of inferring the posterior distribution of a Bayesian model,\nbut it can be time consuming for latent variable models or hierarchical Bayesian models where\nmultiple inferences are required for maximizing the marginal likelihood. Examples include deep\nlatent variable models (Kingma and Welling, 2014; Rezende et al., 2014), \ufb01ltering and smoothing\nin deep state space models (Chung et al., 2015; Fraccaro et al., 2016; Karl et al., 2016; Shabanian\net al., 2017), and hierarchical Bayes with inference network (Edwards and Storkey, 2017). Doing\nenergy tempering in these setups would do harm to the training due to the excess noise injected to the\ngradient estimate of the model.\n\nBeta-annealing: Deterministic warm-up (Raiko et al., 2007) is applied to improve training of\na generative model, which is of even greater importance especially in the case of hierarchical\nmodels (S\u00f8nderby et al., 2016) and latent variable models with \ufb02exible likelihood (Gulrajani et al.,\n2017; Bowman et al., 2016). Let the joint likelihood of observed data x and latent variable z be\np(x, z) = p(x|z)p(z), which is equal to the true posterior f (z) = p(z|x) up to a normalizing constant\n(marginal p(x)). The annealed variational objective (negative Evidence Lower Bound, or ELBO) is\n(2)\nwhere is annealed from 0 to 1. The rationale behind this annealing is that the \ufb01rst term in the\nparenthesis over-regularizes the model by forcing the approximate posterior q(z) to be like the prior\np(z), and thus by reducing this prior contrastive coef\ufb01cient early on during the training we allow\nthe decoder to make better use of the latent code to represent the underlying structure of the data.\nHowever, the entropy term has less importance when the coef\ufb01cient is smaller, leading it to be more\na deterministic autoencoder and biasing the encoding distribution to be more sharp and unimodal.\nThis approach is clearly in con\ufb02ict with the principle of energy tempering, where one allows the\napproximate posterior to \u201cexplore\u201d in the energy landscape for signi\ufb01cant modes.\n\nEq[(log q(z) log p(z)) log p(x|z)]\n\nThis work aims to clarify the implications of doing these two kinds of annealing, in terms of the\ninductive bias the learning objective and algorithm has. We \ufb01rst review a few techniques in the\nVI and MCMC literature to tackle the expressiveness problem (i.e. limited expressiveness of the\napproximate posterior) and the optimization problem (i.e. the mode seeking problem) in inference.\nWe focus on the latter. These are summarized in Section 3. We introduce Annealed Variational\nObjectives (AVO) in Section 4, which aims to satisfy the criteria the alpha and beta annealing schemes\nseek to achieve separately. Finally, in Section 5 we demonstrate the biasing effect of VI with beta\nannealing, and demonstrate the relative robustness and effectiveness of the proposed method.\n\n2 Related work\n\nNaturally, recent works in VI have been focused on reducing representational bias, especially in the\nsetting of amortized VI, known as the Variational Auto-Encoders (VAE) (Kingma and Welling, 2014;\n\n2\n\n\fRezende et al., 2014), by (1) \ufb01nding more expressive families of variational distributions without\nlosing the computational tractability. Explicit density methods include Rezende and Mohamed (2015);\nRanganath et al. (2016); Kingma et al. (2016); Tomczak and Welling (2016, 2017); Berg et al. (2018);\nHuang et al. (2018). Other methods include implicit density methods (Husz\u00e1r, 2017; Mescheder\net al., 2017; Shi et al., 2018). A second line of research focuses on (2) reducing the amortization error\nintroduced by the use of a conditional network (Cremer et al., 2018; Krishnan et al., 2017; Marino\net al., 2018; Kim et al., 2018).\nIn terms of non-parametric methods, the Importance Weighted Auto-Encoder (IWAE) developped by\nBurda et al. (2016) uses several samples for evaluating the loss to reduce the variational gap, which\ncan be expensive in scenarios where the decoder is much more complex. Nowozin (2018) further\nreduces the bias in inference via Jackknife Variational Inference. However, Rainforth et al. (2018)\nnotices that the signal-to-noise ratio of the encoder\u2019s gradient vanishes with increasing number of\nimportance samples (reduced bias), rendering the encoder an ineffective representation model in\nthe limit case. Salimans et al. (2015) combines MCMC with VI but the inference process requires\nmultiple passes through the decoder as well.\nOur method is orthogonal to all these, assuming we have a rich family of approximate posterior at\nhand, and smooths out the objective landscape using annealed objectives speci\ufb01cally for hierarichical\nVI methods. We also note that it is possible to consider alternative losses to induce different\noptimization behavior, such as in Li and Turner (2016); Dieng et al. (2017).\n\n3 Background\nIn what follows, we consider a latent variable model with a joint probability p\u2713(x, z) = p(x|z)p(z),\nwhere x and z are observed and real-valued unobserved variables, and \u2713 denotes the set of parameters\nto be learned. Due to the non-conjugate prior-likelihood pair or the non-linearity of the conditioning,\nexact inference is often intractable. Direct maximization of the marginal likelihood is impossible\nbecause the marginalization involves integration: log p(x) = logR p(x, z) d z. Thus, training is\n\nusually conducted by maximizing the expected complete data log likelihood (ECLL) over an auxiliary\ndistribution q:\n\nmax\n\n\u2713\n\nEq(z)[log p\u2713(x, z)].\n\nWhen using the conditional q(z|x) we emphasize the use of a recognition network in amortized VI\n(Kingma and Welling, 2014; Rezende et al., 2014). When the exact posterior p(z|x) is tractable,\none could choose to use q(z) = p(z|x), which leads to the well known Expectation-Maximization\n(EM) algorithm. When exact inference is not possible, we need to approximate the true posterior,\nusually by sampling from a Markov chain in MCMC or from a variational distribution in VI. The\napproximation induces bias in learning the model \u2713, as\n\nEq[log p\u2713(x, z)] = log p\u2713(x) + Eq[log p\u2713(z|x)].\n\nThat is to say, maximizing ECLL increases the marginal likelihood of the data while biasing the\ntrue posterior to be more like the auxiliary distribution. The second effect vanishes when q(z)\napproximates p(z|x) better.\nIn this paper we focus on the case of VI, where learning of the auxiliary distribution, i.e. variational\ndistribution, is through maximizing the ELBO, which is equivalent to minimizing DKL(q(z)||p(z|x)).\nDue to the zero-forcing property of the KL, q tends to be unimodal and more concentrated. We\nemphasize that a highly \ufb02exible parametric form of q can potentially alleviate this problem but does\nnot address the issue of \ufb01nding the optimal parameters. Before going into technical details of choice\nof q and loss function, we now discuss a few implications and effects doing approximate inference\nhas in practice:\n\n1. At initialization, the true posterior is likely to be multi-modal. Having a unimodal q helps\nto regularize the latent space, biasing the true posterior towards being unimodal as well.\nDepending on the type of data being modeled, this may be a good property to have. A\nunimodal approximate posterior also means less noise when sampling from q. This facilitates\nlearning by having lower variance gradient estimates.\n\n2. In cases where the true posterior has to be multi-modal, biasing the posterior to be unimodal\ninhibits the model from learning the true generative process of the data. Allowing suf\ufb01cient\n\n3\n\n\funcertainty in the approximate posterior encourages the learning process to explore the\nspurious modes in the true posterior.\n\n3. Beta-annealing facilitates point 1 by lowering the penalty of the prior contrastive term,\nallowing the q(z) estimate to be sharp (per point 1). Alpha-annealing encourages exploration\n(point 2) by lowering the penalty of the cross-entropy term, increasing the importance of the\nentropy term, and therefore the tendency to explore the latent space (per point 2).\n\n3.1 Assumption of the variational family\nRecent works have shown promising results in having more expressive parametric form of the\nvariational distribution. This allows the variational family to cover a wider range of distributions,\nideally including the true posterior that we seek to approximate. For instance, the Hierarchical\nVariational Inference (HVI) methods (Ranganath et al., 2016) are a generic family of methods that\nsubsume discrete mixture proposals (e.g. mixture of Gaussian), auxiliary variable methods (Agakov\nand Barber, 2004; Maal\u00f8e et al., 2016), and normalizing \ufb02ows (Rezende and Mohamed, 2015) as\nsubfamilies.\n\nIn HVI, we use a latent variable model q(zT ) = R q(zT , zt)\n\n(b) p(z|X = [0,1]>)\n\nFigure 2: Learned posteriors given different locations where inference is ambiguous.\n\n(c) p(z|X = [0.1,1]>)\n\nm(z) = Wm \u00b7 h(z) + bm;\n\ng(z) = actg(Wg \u00b7 h(z) + bg);\n\nh(z) = acth(Wh \u00b7 z + bh).\n\nThe activation functions act, actg and acth are chosen to be softplus, sigmoid and ReLU (Nair and\nHinton, 2010) (or ELU (Clevert et al., 2016) in the case of amortized VI). In the case of amortized VI,\nwe replace the dot product with the conditional weight normalization (Salimans and Kingma, 2016)\noperator proposed in Krueger et al. (2017); Huang et al. (2018).\n\n4.3 Loss calibrated AVO\n\nSince the correctness of the marginal qT (zT ) trained with AVO depends on the optimality of each\ntransition operator, when used for amortized VI each update will not necessarily improve the marginal\nto be a better approximate posterior. Hence, in amortized VI, we consider the following loss calibrated\nversion of AVO 5:\n\nmax\n\nqt(zt|zt1),rt(zt1|zt)\n\na Eqt(zt|zt1)qt1(zt1)\"log\n\n\u02dcft(zt)rt(zt1|zt)\n\nq(zt|zt1)q(zt1)# + (1 a)L(x)\n\nfor all t, where a 2 [0, 1] is a hyperparameter that determines the weight of AVO used in training. In\npractice naive implementation of this objective can be computationally expensive, so we select the\nloss function stochastically (with probability a maximize AVO, otherwise maximize ELBO) in the\namortized VI experiments, which we found helps to make progress in improving the model p\u2713(x, z).\n\n5 Experiments\n\nOur \ufb01rst experiment shows that having a unimodal approximate posterior is not universally benign.\nThe second and third experiments analyze the effect of the optimization process has on the learned\n(marginal) density of qT , qualitatively and quantitatively. We do this in an unamortized setup, in an\nattempt to suppress the confounding effect of suboptimal amortization. Finally, we experiment with\nreal data and demonstrate that HVI can be improved with our proposed method.\n\n5 Note that in amortized VI \u02dcft, rt and qt all depend on the input x, but we omit the notation for simplicity.\n\n6\n\n\fFigure 3: Learned marginals qt(zt) at different layers (T = 10). First row: true targets f1...f10. Second row:\nHVI-ELBO. Third row: HVI-AVO.\n\n5.1 Biased noise model\n\nTo demonstrate the biasing effect of the approximate posterior in amortized VI, we utilize a toy\nexample in the form of the following data distribution:\n\nz \u21e0N (0, 1),\u270f\nx1 = sin (\u21e1 tanh (\u2318z)) + \u270f1,\n\n1,\u270f 2 \u21e0N (0,\" ),\n\nx2 = cos (\u21e1 tanh (\u2318z)) + \u270f2,\n\n(5)\n(6)\n\nwhere \u2318 = 0.75. The result is the data distribution depicted in Figure 1a. The data distribution is\nconstructed so that x = [0,1]> has a bimodal true posterior with a mode at the tail ends of the\nGaussian. This is analogous to some real data distribution in which two data points close in the input\nspace can come from two well separated modes in latent space (e.g. someone wearing glasses vs. not\nwearing glasses).\nWe train three different models which differ only in the way their approximate posterior is parame-\nterized: (1) q(z|x) = p(z), trained using the IWAE loss and 500 samples, (2) A VAE model with a\nGaussian approximate posterior q(z|x) = N (z|x), and \ufb01nally (3) A VAE model with an AVO loss.\nWe use a decoder exactly as in Equation 6 \ufb01xed for the mean of the conditional Gaussian likelihood,\nwith standard deviation of \u270f1 and \u270f2 parameterized as an MLP conditioned on z to be learned. The\ndensities learned by each model is depicted in Figure 1. We estimate the posterior of the learned\ngenerative model in each case, and compare them in Figure 2. At the point x = [0,1]>, the true\nposterior is a bimodal distribution as discussed before. Shifting slightly to the left or right in x-space\ntowards the higher density regions gives a sharper prediction in the latent space at their respective\nends. In the unambiguous regions of the data, the posterior is unimodal.\nWe observe that the VAE encoder predicts the posterior to be centered around z = 0. From Figure 2\nwe can clearly see the zero-forcing effect mentioned before, where encoder matches one of the modes\nof the true posterior. As a consequence of that behavior, the variance of the decoder\u2019s prediction at\nthat point is high, resulting in the plot seen in Figure 1c. The IWAE model on the other hand, matches\nthe distribution better. With 500 samples per data point in this case, we are effectively training the\ngenerative model with a non-parametric posterior that can perfectly \ufb01t the true posterior. The price of\ndoing so, however, is that we require many samples in order to train the model.\nFor a simple task such as this one, this approach works, but is meant to demonstrate the bene\ufb01ts of\nhaving suf\ufb01cient uncertainty in the proposal distribution. The proposed AVO method (T = 10) also\napproximates the distribution well, but without incurring the same computational cost as IWAE.\n\n5.2 Toy energy \ufb01tting\n\nIn this experiment, we take a four-mode mixture of Gaussian as a true target energy, and compare HVI\ntrained with ELBO and HVI trained with AVO (with T = 10). In Figure 3 we see that HVI-ELBO\nover\ufb01ts two modes out of four, whereas HVI-AVO captures all four modes. Also, \ufb01rst few layers of\nHVI-ELBO do not contribute much to the resulting marginal qT (zT ), where each layer of HVI-AVO\nhas to \ufb01t the assigned target.\n\n7\n\n\fFigure 4: Robustness of HVI to beta-annealing. \u201cn\u201d denotes normal ELBO and \u201ca\u201d denotes AVO. x-axis: the\npercentage of total training time it takes to anneal linearly back to 1; y-axis: estimated Eq [E(x) log q(x)]\n(negative KL) shifted by a log normalizing constant (the higher the better).\n\nTable 1: Amortized VI with VAE. L\u21e4 is the ELBO, with tr, va and te representing the training, validation\nand test set respectively. log p(D\u21e4) is the estimated log-likelihood on the dataset. L indicates the number of\nleap-frog steps used in Salimans et al. (2015).\n\n(a) MNIST\n\nT = 5\na = 0\n84.51\n92.37\n87.63\n92.48\n87.62\n7.97\n4.86\n\nT = 5\na = 0.2\n86.15\n90.00\n86.06\n90.01\n86.06\n3.86\n3.95\n\nT = 0\na = 0\n84.69\n92.37\n87.49\n92.40\n87.50\n7.71\n4.90\n\n log p(Dva)\n log p(Dte)\n\nL tr\nL va\nL te\ngen gap\nvar gap\n\nT = 5\na = 0.5 L = 4 L = 8\n87.02\n90.22\n86.12\n90.26\n86.12\n3.24\n4.14\n\n89.82\n86.40\n\n88.3\n85.51\n\n-\n-\n-\n\n-\n\n-\n-\n-\n\n-\n\n3.42\n\n2.79\n\n(b) OMNIGLOT\n\nT = 0\na = 0\n106.67\n114.68\n108.41\n115.25\n108.55\n8.58\n6.70\n\nT = 5\na = 0\n111.32\n110.78\n106.32\n113.19\n106.25\n1.87\n6.94\n\nT = 5\na = 0.5\n108.94\n109.47\n105.18\n111.94\n105.04\n3.00\n6.90\n\n5.3 Quantitative analysis on robustness to beta annealing\nWe perform the experiments in Section 5.2 again, on six different energy functions, E (summarized\nin Appendix B) with different linear schedules of beta-annealing. We run 10 trials for each energy\nfunction, and do 2000 stochastic updates with a batch size of 64 and a learning rate of 0.001. We\nevaluate the resulting marginal qT (zT ) according to its negative KL divergence from the true energy\nfunction. Since the entropy of qT (zT ) is intractable, we estimate it using importance sampling; we\nelaborate this in Appendix C. We see that HVI-AVO constantly outperforms HVI-ELBO. Even when\nmost of the time performance of HVI-ELBO degrades along with prolonged beta-annealing schedule,\nHVI-AVO\u2019s performance is relatively invariant. In Appendix D, we visualize the learned marginals\nqt(zt) under beta annealing using both stochastic and deterministic transitions.\n\n5.4 Amortized inference\n\nWe train VAEs using a standard VAE (with a Gaussian approximate posterior), HVI, and HVI with\nAVO on the Binarized MNIST dataset from Larochelle and Murray (2011), and the Omniglot dataset\nas used in Burda et al. (2016). In these experiments, we used an MLP for both the encoder and\ndecoder, with gating activation as in Tomczak and Welling (2016). Both the encoder and decoder\nhave 2 hidden layers, each with 300 hidden units. For MNIST, we used a dimension of 40 for the\nlatent space, and 200 dimensions in Omniglot. We used hyperparameter search to determine batch\nsize, learning rate, and the beta-annealing schedule.\nTable 1a lists the results from our experiments on MNIST, alongside results from Salimans et al. (2015)\ncombining MCMC in HVI for comparison (see Appendix E). Table 1b lists the results for Omniglot.\nWe \ufb01nd that HVI\u2019s approximate posterior trained with AVO tends to have smaller variational gap\n(estimated by difference between test likelihood and ELBO), and also better test likelihood. It is also\nworth noting that in the MNIST case, smaller variational gap translates into smaller generalization\ngap and also better test likelihood, while the same is not true in the OMNIGLOT example, which\ncorroborates our \ufb01nding in Section 5.1 that true posterior can be biased towards approximate posterior,\nresulting in a smaller variatinal gap (8.58) but a worse density model (6.70).\n\n8\n\n\f6 Conclusion\n\nWe \ufb01nd that despite the representational capacity of the chosen family of approximate distributions in\nVI, the density that can be represented is still limited by the optimization process. We resolve this by\nincorporating annealed objectives into the training of hierarchical variational methods. Experimentally,\nwe demonstrate (1) our method\u2019s robustness to deterministic warm-up, (2) the bene\ufb01ts of encouraging\nexploration and (3) the downside of biasing the true posterior to be unimodal. Our method is\northogonal to \ufb01nding a more rich family of variational distributions, and sheds light on an important\noptimization issue that has thus far been neglected in the amortized VI literature.\n\n7 Acknowledgements\n\nWe would like to thank Massimo Cassia and Aristide Baratin for their useful feedback and discussion.\n\nReferences\nAbrol, F., Mandt, S., Ranganath, R., and Blei, D. (2014). Deterministic annealing for stochastic\n\nvariational inference. stat, 1050:7.\n\nAgakov, F. V. and Barber, D. (2004). An auxiliary variational method.\n\nProcessing.\n\nIn Neural Information\n\nBerg, R. v. d., Hasenclever, L., Tomczak, J. M., and Welling, M. (2018). Sylvester normalizing \ufb02ows\n\nfor variational inference. arXiv preprint arXiv:1803.05649.\n\nBowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. (2016). Generating\nsentences from a continuous space. In SIGNLL Conference on Computational Natural Language\nLearning (CONLL).\n\nBurda, Y., Grosse, R., and Salakhutdinov, R. (2016).\n\nInternational Conference on Learning Representations.\n\nImportance weighted autoencoders.\n\nIn\n\nChung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. (2015). A recurrent latent\n\nvariable model for sequential data. In Advances in neural information processing systems.\n\nClevert, D.-A., Unterthiner, T., and Hochreiter, S. (2016). Fast and accurate deep network learning by\n\nexponential linear units (elus). International Conference on Learning Representations.\n\nCremer, C., Li, X., and Duvenaud, D. (2018). Inference suboptimality in variational autoencoders. In\n\nInternational Conference on Machine Learning.\n\nDieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. (2017). Variational inference via \n\nupper bound minimization. In Advances in Neural Information Processing Systems.\n\nEdwards, H. and Storkey, A. (2017). Towards a neural statistician. In International Conference on\n\nLearning Representations.\n\nFraccaro, M., S\u00f8nderby, S. K., Paquet, U., and Winther, O. (2016). Sequential neural models with\n\nstochastic layers. In Advances in neural information processing systems, pages 2199\u20132207.\n\nGulrajani, I., Kumar, K., Ahmed, F., Taiga, A. A., Visin, F., Vazquez, D., and Courville, A. (2017).\nPixelvae: A latent variable model for natural images. In International Conference on Learning\nRepresentations.\n\nHuang, C.-W., Krueger, D., Lacoste, A., and Courville, A. (2018). Neural autoregressive \ufb02ows. In\n\nInternational Conference on Machine Learning.\n\nHusz\u00e1r, F. (2017). Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235.\n\nKarl, M., Soelch, M., Bayer, J., and van der Smagt, P. (2016). Deep variational bayes \ufb01lters:\nUnsupervised learning of state space models from raw data. In International Conference on\nLearning Representations.\n\n9\n\n\fKatahira, K., Watanabe, K., and Okada, M. (2008). Deterministic annealing variant of variational\nbayes method. In Journal of Physics: Conference Series, volume 95, page 012015. IOP Publishing.\n\nKim, Y., Wiseman, S., Miller, A. C., Sontag, D., and Rush, A. M. (2018). Semi-amortized variational\n\nautoencoders. In International Conference on Machine Learning.\n\nKingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016).\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural Information\nProcessing Systems.\n\nKingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In International Conference\n\non Learning Representations.\n\nKrishnan, R. G., Liang, D., and Hoffman, M. (2017). On the challenges of learning with inference\n\nnetworks on sparse, high-dimensional data. arXiv preprint arXiv:1710.06085.\n\nKrueger, D., Huang, C.-W., Islam, R., Turner, R., Lacoste, A., and Courville, A. (2017). Bayesian\n\nhypernetworks. arXiv preprint arXiv:1710.04759.\n\nLarochelle, H. and Murray, I. (2011). The neural autoregressive distribution estimator. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics.\n\nLi, Y. and Turner, R. E. (2016). R\u00e9nyi divergence variational inference. In Advances in Neural\n\nInformation Processing Systems.\n\nMaal\u00f8e, L., S\u00f8nderby, C. K., S\u00f8nderby, S. K., and Winther, O. (2016). Auxiliary deep generative\n\nmodels. In International Conference on Machine Learning.\n\nMandt, S., McInerney, J., Abrol, F., Ranganath, R., and Blei, D. (2016). Variational tempering. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics.\n\nMarino, J., Yue, Y., and Mandt, S. (2018). Iterative amortized inference. In International Conference\n\non Machine Learning.\n\nMescheder, L., Nowozin, S., and Geiger, A. (2017). Adversarial variational bayes: Unifying\nvariational autoencoders and generative adversarial networks. In International Conference on\nMachine Learning.\n\nNair, V. and Hinton, G. E. (2010). Recti\ufb01ed linear units improve restricted boltzmann machines. In\n\nInternational Conference on Machine Learning.\n\nNeal, R. M. (2001). Annealed importance sampling. Statistics and computing, 11(2).\n\nNowozin, S. (2018). Debiasing evidence approximations: On importance-weighted autoencoders and\n\njackknife variational inference. In International Conference on Learning Representations.\n\nRaiko, T., Valpola, H., Harva, M., and Karhunen, J. (2007). Building blocks for variational bayesian\n\nlearning of latent variable models. Journal of Machine Learning Research.\n\nRainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J., Igl, M., Wood, F., and Teh, Y. W. (2018).\nTighter variational bounds are not necessarily better. In International Conference on Machine\nLearning.\n\nRanganath, R., Tran, D., and Blei, D. (2016). Hierarchical variational models. In International\n\nConference on Machine Learning.\n\nRezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing \ufb02ows. In Interna-\n\ntional Conference on Machine Learning.\n\nRezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate\n\ninference in deep generative models. In International Conference on Machine Learning.\n\nSalimans, T., Kingma, D., and Welling, M. (2015). Markov chain monte carlo and variational\n\ninference: Bridging the gap. In International Conference on Machine Learning.\n\n10\n\n\fSalimans, T. and Kingma, D. P. (2016). Weight normalization: A simple reparameterization to\nIn Advances in Neural Information Processing\n\naccelerate training of deep neural networks.\nSystems.\n\nShabanian, S., Arpit, D., Trischler, A., and Bengio, Y. (2017). Variational bi-lstms. arXiv preprint\n\narXiv:1711.05717.\n\nShi, J., Sun, S., and Zhu, J. (2018). Kernel implicit variational inference. In International Conference\n\non Learning Representations.\n\nS\u00f8nderby, C. K., Raiko, T., Maal\u00f8e, L., S\u00f8nderby, S. K., and Winther, O. (2016). Ladder variational\n\nautoencoders. In Advances in neural information processing systems, pages 3738\u20133746.\n\nTomczak, J. M. and Welling, M. (2016). Improving variational auto-encoders using householder \ufb02ow.\n\narXiv preprint arXiv:1611.09630.\n\nTomczak, J. M. and Welling, M. (2017). Improving variational auto-encoders using convex combina-\n\ntion linear inverse autoregressive \ufb02ow. In Benelearn.\n\nTurner, R. E. and Sahani, M. (2011). Two problems with variational expectation maximisation for\n\ntime-series models. Bayesian Time series models, 1(3.1):3\u20131.\n\n11\n\n\f", "award": [], "sourceid": 6174, "authors": [{"given_name": "Chin-Wei", "family_name": "Huang", "institution": "MILA"}, {"given_name": "Shawn", "family_name": "Tan", "institution": "Mila"}, {"given_name": "Alexandre", "family_name": "Lacoste", "institution": "Element AI"}, {"given_name": "Aaron", "family_name": "Courville", "institution": "U. Montreal"}]}