{"title": "REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models", "book": "Advances in Neural Information Processing Systems", "page_first": 2627, "page_last": 2636, "abstract": "Learning in models with discrete latent variables is challenging due to high variance gradient estimators. Generally, approaches have relied on control variates to reduce the variance of the REINFORCE estimator. Recent work \\citep{jang2016categorical, maddison2016concrete} has taken a different approach, introducing a continuous relaxation of discrete variables to produce low-variance, but biased, gradient estimates. In this work, we combine the two approaches through a novel control variate that produces low-variance, \\emph{unbiased} gradient estimates. Then, we introduce a modification to the continuous relaxation and show that the tightness of the relaxation can be adapted online, removing it as a hyperparameter. We show state-of-the-art variance reduction on several benchmark generative modeling tasks, generally leading to faster convergence to a better final log-likelihood.", "full_text": "REBAR: Low-variance, unbiased gradient estimates\n\nfor discrete latent variable models\n\nGeorge Tucker1,\u21e4, Andriy Mnih2, Chris J. Maddison2,3,\n\nDieterich Lawson1,*, Jascha Sohl-Dickstein1\n\n1Google Brain, 2DeepMind, 3University of Oxford\n\n{gjt, amnih, dieterichl, jaschasd}@google.com\n\ncmaddis@stats.ox.ac.uk\n\nAbstract\n\nLearning in models with discrete latent variables is challenging due to high variance\ngradient estimators. Generally, approaches have relied on control variates to reduce\nthe variance of the REINFORCE estimator. Recent work (Jang et al., 2016; Maddi-\nson et al., 2016) has taken a different approach, introducing a continuous relaxation\nof discrete variables to produce low-variance, but biased, gradient estimates. In this\nwork, we combine the two approaches through a novel control variate that produces\nlow-variance, unbiased gradient estimates. Then, we introduce a modi\ufb01cation\nto the continuous relaxation and show that the tightness of the relaxation can be\nadapted online, removing it as a hyperparameter. We show state-of-the-art variance\nreduction on several benchmark generative modeling tasks, generally leading to\nfaster convergence to a better \ufb01nal log-likelihood.\n\n1\n\nIntroduction\n\nModels with discrete latent variables are ubiquitous in machine learning: mixture models, Markov\nDecision Processes in reinforcement learning (RL), generative models for structured prediction,\nand, recently, models with hard attention (Mnih et al., 2014) and memory networks (Zaremba &\nSutskever, 2015). However, when the discrete latent variables cannot be marginalized out analytically,\nmaximizing objectives over these models using REINFORCE-like methods (Williams, 1992) is\nchallenging due to high-variance gradient estimates obtained from sampling. Most approaches to\nreducing this variance have focused on developing clever control variates (Mnih & Gregor, 2014;\nTitsias & L\u00e1zaro-Gredilla, 2015; Gu et al., 2015; Mnih & Rezende, 2016). Recently, Jang et al. (2016)\nand Maddison et al. (2016) independently introduced a novel distribution, the Gumbel-Softmax or\nConcrete distribution, that continuously relaxes discrete random variables. Replacing every discrete\nrandom variable in a model with a Concrete random variable results in a continuous model where the\nreparameterization trick is applicable (Kingma & Welling, 2013; Rezende et al., 2014). The gradients\nare biased with respect to the discrete model, but can be used effectively to optimize large models.\nThe tightness of the relaxation is controlled by a temperature hyperparameter. In the low temperature\nlimit, the gradient estimates become unbiased, but the variance of the gradient estimator diverges, so\nthe temperature must be tuned to balance bias and variance.\nWe sought an estimator that is low-variance, unbiased, and does not require tuning additional\nhyperparameters. To construct such an estimator, we introduce a simple control variate based on the\ndifference between the REINFORCE and the reparameterization trick gradient estimators for the\nrelaxed model. This reduces variance, but does not outperform state-of-the-art methods on its own.\nOur key contribution is to show that it is possible to conditionally marginalize the control variate\n\n\u21e4Work done as part of the Google Brain Residency Program.\nSource code for experiments: github.com/tensorflow/models/tree/master/research/rebar\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fto signi\ufb01cantly improve its effectiveness. We call this the REBAR gradient estimator, because it\ncombines REINFORCE gradients with gradients of the Concrete relaxation. Next, we show that\na modi\ufb01cation to the Concrete relaxation connects REBAR to MuProp in the high temperature\nlimit. Finally, because REBAR is unbiased for all temperatures, we show that the temperature\ncan be optimized online to reduce variance further and relieve the burden of setting an additional\nhyperparameter.\nIn our experiments, we illustrate the potential problems inherent with biased gradient estimators on\na toy problem. Then, we use REBAR to train generative sigmoid belief networks (SBNs) on the\nMNIST and Omniglot datasets and to train conditional generative models on MNIST. Across tasks,\nwe show that REBAR has state-of-the-art variance reduction which translates to faster convergence\nand better \ufb01nal log-likelihoods. Although we focus on binary variables for simplicity, this work is\nequally applicable to categorical variables (Appendix C).\n\n2 Background\nFor clarity, we \ufb01rst consider a simpli\ufb01ed scenario. Let b \u21e0 Bernoulli (\u2713) be a vector of independent\nbinary random variables parameterized by \u2713. We wish to maximize\n\nE\np(b)\n\n[f (b, \u2713)] ,\n\nwhere f (b, \u2713) is differentiable with respect to b and \u2713, and we suppress the dependence of p(b) on \u2713 to\nreduce notational clutter. This covers a wide range of discrete latent variable problems; for example,\nin variational inference f (b, \u2713) would be the stochastic variational lower bound.\nTypically, this problem has been approached by gradient ascent, which requires ef\ufb01ciently estimating\n\nd\nd\u2713 E\n\np(b)\n\n[f (b, \u2713)] = E\n\np(b)\uf8ff @f (b, \u2713)\n\n@\u2713\n\n+ f (b, \u2713)\n\n@\n@\u2713\n\nlog p(b) .\n\n(1)\n\nIn practice, the \ufb01rst term can be estimated effectively with a single Monte Carlo sample, however,\na na\u00efve single sample estimator of the second term has high variance. Because the dependence of\nf (b, \u2713) on \u2713 is straightforward to account for, to simplify exposition we assume that f (b, \u2713) = f (b)\ndoes not depend on \u2713 and concentrate on the second term.\n\n2.1 Variance reduction through control variates\n\nPaisley et al. (2012); Ranganath et al. (2014); Mnih & Gregor (2014); Gu et al. (2015) show that\ncarefully designed control variates can reduce the variance of the second term signi\ufb01cantly. Control\nvariates seek to reduce the variance of such estimators using closed form expectations for closely\nrelated terms. We can subtract any c (random or constant) as long as we can correct the bias (see\nAppendix A and (Paisley et al., 2012) for a review of control variates in this context):\n\np(b,c)\n\n[f (b)] =\n\n@\n\n@\u2713\u2713 E\n\n@\n@\u2713 E\nFor example, NVIL (Mnih & Gregor, 2014) learns a c that does not depend2 on b and MuProp (Gu\net al., 2015) uses a linear Taylor expansion of f around Ep(b|\u2713)[b]. Unfortunately, even with a control\nvariate, the term can still have high variance.\n\n[f (b) c] + E\n\n@\n@\u2713 E\n\n[c]\u25c6 = E\n\np(b,c)\uf8ff(f (b) c)\n\nlog p(b) +\n\n@\n@\u2713\n\np(b,c)\n\np(b,c)\n\n[c]\n\np(b,c)\n\n2.2 Continuous relaxations for discrete variables\n\nAlternatively, following Maddison et al. (2016), we can parameterize b as b = H(z), where H is the\nelement-wise hard threshold function3 and z is a vector of independent Logistic random variables\nde\ufb01ned by\n\nz := g(u, \u2713) := log\n\n+ log\n\n\u2713\n1 \u2713\n\nu\n1 u\n\n,\n\n2In this case, c depends on the implicit observation in variational inference.\n3H(z) = 1 if z 0 and H(z) = 0 if z < 0.\n\n2\n\n\fwhere u \u21e0 Uniform(0, 1). Notably, z is differentiably reparameterizable (Kingma & Welling,\n2013; Rezende et al., 2014), but the discontinuous hard threshold function prevents us from using\nthe reparameterization trick directly. Replacing all occurrences of the hard threshold function\n1 however results in a\nwith a continuous relaxation H(z) \u21e1 (z) := z\nf ((g(u, \u2713))) ,\n@\n@\u2713 E\n\nreparameterizable computational graph. Thus, we can compute low-variance gradient estimates for\nthe relaxed model that approximate the gradient for the discrete model. In summary,\n\n =1 + exp z\np(u)\uf8ff @\n\n[f ((z))] = E\n\n[f (H(z))] \u21e1\n\nwhere > 0 can be thought of as a temperature that controls the tightness of the relaxation (at low\ntemperatures, the relaxation is nearly tight). This generally results in a low-variance, but biased\nMonte Carlo estimator for the discrete model. As ! 0, the approximation becomes exact, but the\nvariance of the Monte Carlo estimator diverges. Thus, in practice, must be tuned to balance bias\nand variance. See Appendix C and Jang et al. (2016); Maddison et al. (2016) for the generalization to\nthe categorical case.\n\n@\n@\u2713 E\n\n@\n@\u2713 E\n\n[f (b)] =\n\n@\u2713\n\np(z)\n\np(z)\n\np(b)\n\n3 REBAR\n\nWe seek a low-variance, unbiased gradient estimator. Inspired by the Concrete relaxation, our strategy\nwill be to construct a control variate (see Appendix A for a review of control variates in this context)\nbased on the difference between the REINFORCE gradient estimator for the relaxed model and the\ngradient estimator from the reparameterization trick. First, note that closely following Eq. 1\n\np(b)\uf8fff (b)\n\nE\n\n@\n@\u2713\n\nlog p(b) =\n\n@\n@\u2713 E\n\np(b)\n\n[f (b)] =\n\nThe similar form of the REINFORCE gradient estimator for the relaxed model\n\n[f (H(z))] = E\n\np(z)\n\n@\n@\u2713 E\np(z)\uf8fff ((z))\n\n@\n@\u2713\n\np(z)\uf8fff (H(z))\nlog p(z)\n\n@\n@\u2713\n\nlog p(z) .\n\n(2)\n\n(3)\n\n@\n@\u2713 E\n\np(z)\n\n[f ((z))] = E\n\nsuggests it will be strongly correlated and thus be an effective control variate. Unfortunately, the\nMonte Carlo gradient estimator derived from the left hand side of Eq. 2 has much lower variance\nthan the Monte Carlo gradient estimator derived from the right hand side. This is because the left\nhand side can be seen as analytically performing a conditional marginalization over z given b, which\nis noisily approximated by Monte Carlo samples on the right hand side (see Appendix B for details).\nOur key insight is that an analogous conditional marginalization can be performed for the control\nvariate (Eq. 3),\n\np(z)\uf8fff ((z))\n\nE\n\n@\n@\u2713\n\nlog p(z) = E\n\np(b)\uf8ff @\n\n@\u2713 E\np(z|b)\n\n[f ((z))] + E\n\np(b)\uf8ff E\n\np(z|b)\n\n[f ((z))]\n\n@\n@\u2713\n\nlog p(b) ,\n\nwhere the \ufb01rst term on the right-hand side can be ef\ufb01ciently estimated with the reparameterization\ntrick (see Appendix C for the details)\n\np(b)\uf8ff @\n\nE\n\n@\u2713 E\np(z|b)\n\n[f ((z))] = E\n\np(b)\uf8ff E\n\np(v)\uf8ff @\n\n@\u2713\n\nf ((\u02dcz)) ,\n\nwhere v \u21e0 Uniform(0, 1) and \u02dcz \u2318 \u02dcg(v, b, \u2713) is the differentiable reparameterization for z|b (Ap-\npendix C). Therefore,\np(z)\uf8fff ((z))\nlog p(b) .\n\nlog p(z) = E\n\nUsing this to form the control variate and correcting with the reparameterization trick gradient, we\narrive at\n\n@\n@\u2713\n\n@\n@\u2713\n\nE\n\n@\u2713\n\np(v)\uf8ff @\n\np(b)\uf8ff E\nf ((\u02dcz)) + E\np(u,v)\uf8ff [f (H(z)) \u2318f ((\u02dcz))]\n\n@\n@\u2713 E\n\np(b)\n\n[f (b)] = E\n\np(z|b)\n\n[f ((z))]\n\np(b)\uf8ff E\nlog p(b)b=H(z)\nf ((\u02dcz)),\n\n@\n@\u2713\n\n@\n@\u2713\n\n(4)\n\n+ \u2318\n\n@\n@\u2713\n\nf ((z)) \u2318\n\n3\n\n\fwhere u, v \u21e0 Uniform(0, 1), z \u2318 g(u, \u2713), \u02dcz \u2318 \u02dcg(v, H(z),\u2713 ), and \u2318 is a scaling on the control\nvariate. The REBAR estimator is the single sample Monte Carlo estimator of this expectation. To\nreduce computation and variance, we couple u and v using common random numbers (Appendix G,\n(Owen, 2013)). We estimate \u2318 by minimizing the variance of the Monte Carlo estimator with SGD.\nIn Appendix D, we present an alternative derivation of REBAR that is shorter, but less intuitive.\n\n3.1 Rethinking the relaxation and a connection to MuProp\nBecause (z) ! 1\n\n2 as ! 1, we consider an alternative relaxation\nH(z) \u21e1 \u2713 1\n\n2 + + 1\n\n\u2713\n1 \u2713\n\n1\n\n\nlog\n\nlog\n\n+\n\nu\n\n1 u\u25c6 = (z),\n\n\n\n + 1\nwhere z = 2++1\n1u. As ! 1, the relaxation converges to the mean, \u2713, and still\nas ! 0, the relaxation becomes exact. Furthermore, as ! 1, the REBAR estimator converges\nto MuProp without the linear term (see Appendix E). We refer to this estimator as SimpleMuProp in\nthe results.\n\n1\u2713 +log u\n\nlog \u2713\n\n+1\n\n(5)\n\n3.2 Optimizing temperature ()\n\nThe REBAR gradient estimator is unbiased for any choice of > 0, so we can optimize to minimize\nthe variance of the estimator without affecting its unbiasedness (similar to optimizing the dispersion\ncoef\ufb01cients in Ruiz et al. (2016)). In particular, denoting the REBAR gradient estimator by r(), then\n\n@\n@\n\nVar(r()) =\n\n@\n\n@\u21e3E\u21e5r()2\u21e4 E [r()]2\u2318 = E\uf8ff2r()\n\n@r()\n\n@ \n\nbecause E[r()] does not depend on . The resulting expectation can be estimated with a single\nsample Monte Carlo estimator. This allows the tightness of the relaxation to be adapted online jointly\nwith the optimization of the parameters and relieves the burden of choosing ahead of time.\n\n3.3 Multilayer stochastic networks\nSuppose we have multiple layers of stochastic units (i.e., b = {b1, b2, . . . , bn}) where p(b) factorizes\nas\n\np(b1:n) = p(b1)p(b2|b1)\u00b7\u00b7\u00b7 p(bn|bn1),\n\nand similarly for the underlying Logistic random variables p(z1:n) recalling that bi = H(zi). We\ncan de\ufb01ne a relaxed distribution over z1:n where we replace the hard threshold function H(z) with a\ncontinuous relaxation (z). We refer to the relaxed distribution as q(z1:n).\nWe can take advantage of the structure of p, by using the fact that the high variance REINFORCE\nterm of the gradient also decomposes\n\nFocusing on the ith term, we have\n\np(b)\uf8fff (b)\n\nE\n\n@\n@\u2713\n\nwhich suggests the following control variate\n\nlog p(bi|bi1) .\n\nE\n\nE\n\n@\n@\u2713\n\n@\n@\u2713\n\np(b)\uf8fff (b)\nlog p(bi|bi1) = E\np(zi|bi,bi1)\uf8ff\n\nlog p(b) =Xi\np(b1:i1)\uf8ff\n\np(b)\uf8fff (b)\np(bi|bi1)\uf8ff\n[f (b1:i1, (zi:n))] @\n\nq(zi+1:n|zi)\n\nE\n\nE\n\nE\n\nE\n\np(bi+1:n|bi)\n\n@\u2713\n\n[f (b)]\n\n@\n@\u2713\n\nlog p(bi|bi1) ,\n\nlog p(bi|bi1)\n\nfor the middle expectation. Similarly to the single layer case, we can debias the control variate\nwith terms that are reparameterizable. Note that due to the switch between sampling from p and\nsampling from q, this approach requires n passes through the network (one pass per layer). We\ndiscuss alternatives that do not require multiple passes through the network in Appendix F.\n\n4\n\n\f3.4 Q-functions\n\nFinally, we note that since the derivation of this control variate is independent of f, the REBAR\ncontrol variate can be generalized by replacing f with a learned, differentiable Q-function. This\nsuggests that the REBAR control variate is applicable to RL, where it would allow a \u201cpseudo-action\u201d-\ndependent baseline. In this case, the pseudo-action would be the relaxation of the discrete output\nfrom a policy network.\n\n4 Related work\n\nMost approaches to optimizing an expectation of a function w.r.t. a discrete distribution based on\nsamples from the distribution can be seen as applications of the REINFORCE (Williams, 1992)\ngradient estimator, also known as the likelihood ratio (Glynn, 1990) or score-function estimator\n(Fu, 2006). Following the notation from Section 2, the basic form of an estimator of this type\n@\u2713 log p(b) where b is a sample from the discrete distribution and c is some quantity\nis (f (b) c) @\nindependent of b, known as a baseline. Such estimators are unbiased, but without a carefully chosen\nbaseline their variance tends to be too high for the estimator to be useful and much work has gone\ninto \ufb01nding effective baselines.\nIn the context of training latent variable models, REINFORCE-like methods have been used to\nimplement sampling-based variational inference with either fully factorized (Wingate & Weber, 2013;\nRanganath et al., 2014) or structured (Mnih & Gregor, 2014; Gu et al., 2015) variational distributions.\nAll of these involve learned baselines: from simple scalar baselines (Wingate & Weber, 2013;\nRanganath et al., 2014) to nonlinear input-dependent baselines (Mnih & Gregor, 2014). MuProp\n(Gu et al., 2015) combines an input-dependent baseline with a \ufb01rst-order Taylor approximation to\nthe function based on the corresponding mean-\ufb01eld network to achieve further variance reduction.\nREBAR is similar to MuProp in that it also uses gradient information from a proxy model to reduce\nthe variance of a REINFORCE-like estimator. The main difference is that in our approach the proxy\nmodel is essentially the relaxed (but still stochastic) version of the model we are interested in, whereas\nMuProp uses the mean \ufb01eld version of the model as a proxy, which can behave very differently\nfrom the original model due to being completely deterministic. The relaxation we use was proposed\nby (Maddison et al., 2016; Jang et al., 2016) as a way of making discrete latent variable models\nreparameterizable, resulting in a low-variance but biased gradient estimator for the original model.\nREBAR on the other hand, uses the relaxation in a control variate which results in an unbiased,\nlow-variance estimator. Alternatively, Titsias & L\u00e1zaro-Gredilla (2015) introduced local expectation\ngradients, a general purpose unbiased gradient estimator for models with continuous and discrete\nlatent variables. However, it typically requires substantially more computation than other methods.\nRecently, a specialized REINFORCE-like method was proposed for the tighter multi-sample version\nof the variational bound (Burda et al., 2015) which uses a leave-out-out technique to construct\nper-sample baselines (Mnih & Rezende, 2016). This approach is orthogonal to ours, and we expect it\nto bene\ufb01t from incorporating the REBAR control variate.\n\n5 Experiments\n\nAs our goal was variance reduction to improve optimization, we compared our method to the\nstate-of-the-art unbiased single-sample gradient estimators, NVIL (Mnih & Gregor, 2014) and\nMuProp (Gu et al., 2015), and the state-of-the-art biased single-sample gradient estimator Gumbel-\nSoftmax/Concrete (Jang et al., 2016; Maddison et al., 2016) by measuring their progress on the\ntraining objective and the variance of the unbiased gradient estimators4. We start with an illustrative\nproblem and then follow the experimental setup established in (Maddison et al., 2016) to evaluate the\nmethods on generative modeling and structured prediction tasks.\n\n4Both MuProp and REBAR require twice as much computation per step as NVIL and Concrete. To present\ncomparable results with previous work, we plot our results in steps. However, to offer a fair comparison, NVIL\nshould use two samples and thus reduce its variance by half (or log(2) \u21e1 0.69 in our plots).\n\n5\n\n\fFigure 1: Log variance of the gradient estimator (left) and loss (right) for the toy problem with\nt = 0.45. Only the unbiased estimators converge to the correct answer. We indicate the temperature\nin parenthesis where relevant.\n\n5.1 Toy problem\n\nTo illustrate the potential ill-effects of biased gradient estimators, we evaluated the methods on a\nsimple toy problem. We wish to minimize Ep(b)[(b t)2], where t 2 (0, 1) is a continuous target\nvalue, and we have a single parameter controlling the Bernoulli distribution. Figure 1 shows the\nperils of biased gradient estimators. The optimal solution is deterministic (i.e., p(b = 1) 2{ 0, 1}),\nwhereas the Concrete estimator converges to a stochastic one. All of the unbiased estimators correctly\nconverge to the optimal loss, whereas the biased estimator fails to. For this simple problem, it is\nsuf\ufb01cient to reduce temperature of the relaxation to achieve an acceptable solution.\n\n5.2 Learning sigmoid belief networks (SBNs)\n\nNext, we trained SBNs on several standard benchmark tasks. We follow the setup established in\n(Maddison et al., 2016). We used the statically binarized MNIST digits from Salakhutdinov & Murray\n(2008) and a \ufb01xed binarization of the Omniglot character dataset. We used the standard splits into\ntraining, validation, and test sets. The network used several layers of 200 stochastic binary units\ninterleaved with deterministic nonlinearities. In our experiments, we used either a linear deterministic\nlayer (denoted linear) or 2 layers of 200 tanh units (denoted nonlinear).\n\n5.2.1 Generative modeling on MNIST and Omniglot\nFor generative modeling, we maximized a single-sample variational lower bound on the log-likelihood.\nWe performed amortized inference (Kingma & Welling, 2013; Rezende et al., 2014) with an inference\nnetwork with similar architecture in the reverse direction. In particular, denoting the image by x and\nthe hidden layer stochastic activations by b \u21e0 q(b|x, \u2713), we have\n\nq(b|x,\u2713)\n\n[log p(x, b|\u2713) log q(b|x, \u2713)] ,\n\nlog p(x|\u2713) E\nwhich has the required form for REBAR.\nTo measure the variance of the gradient estimators, we follow a single optimization trajectory\nand use the same random numbers for all methods. This signi\ufb01cantly reduces the variance in\nour measurements. We plot the log variance of the unbiased gradient estimators in Figure 2 for\nMNIST (Appendix Figure App.3 for Omniglot). REBAR produced the lowest variance across\nlinear and nonlinear models for both tasks. The reduction in variance was especially large for\nthe linear models. For the nonlinear model, REBAR (0.1) reduced variance at the beginning of\ntraining, but its performance degraded later in training. REBAR was able to adaptively change the\ntemperature as optimization progressed and retained superior variance reduction. We also observed\nthat SimpleMuProp was a surprisingly strong baseline that improved signi\ufb01cantly over NVIL. It\nperformed similarly to MuProp despite not explicitly using the gradient of f.\nGenerally, lower variance gradient estimates led to faster optimization of the objective and conver-\ngence to a better \ufb01nal value (Figure 3, Table 1, Appendix Figures App.2 and App.4). For the nonlinear\nmodel, the Concrete estimator underperformed optimizing the training objective in both tasks.\n\n6\n\n\fFigure 2: Log variance of the gradient estimator for the two layer linear model (left) and single layer\nnonlinear model (right) on the MNIST generative modeling task. All of the estimators are unbiased,\nso their variance is directly comparable. We estimated moments from exponential moving averages\n(with decay=0.999; we found that the results were robust to the exact value). The temperature is\nshown in parenthesis where relevant.\n\nFigure 3: Training variational lower bound for the two layer linear model (left) and single layer\nnonlinear model (right) on the MNIST generative modeling task. We plot 5 trials over different\nrandom initializations for each method with the median trial highlighted. The temperature is shown\nin parenthesis where relevant.\n\nAlthough our primary focus was optimization, for completeness, we include results on the test set in\nAppendix Table App.2 computed with a 100-sample lower bound Burda et al. (2015). Improvements\non the training variational lower bound do not directly translate into improved test log-likelihood.\nPrevious work (Maddison et al., 2016) showed that regularizing the inference network alone was\nsuf\ufb01cient to prevent over\ufb01tting. This led us to hypothesize that the over\ufb01tting results was primarily\ndue to over\ufb01tting in the inference network (q). To test this, we trained a separate inference network\non the validation and test sets, taking care not to affect the model parameters. This reduced over\ufb01tting\n(Appendix Figure App.5), but did not completely resolve the issue, suggesting that the generative and\ninference networks jointly over\ufb01t.\n\n5.2.2 Structured prediction on MNIST\n\nStructured prediction is a form of conditional density estimation that aims to model high dimensional\nobservations given a context. We followed the structured prediction task described by Raiko et al.\n(2014), where we modeled the bottom half of an MNIST digit (x) conditional on the top half (c). The\nconditional generative network takes as input c and passes it through an SBN. We optimized a single\nsample lower bound on the log-likelihood\n\nlog p(x|c, \u2713) E\n\np(b|c,\u2713)\n\n[log p(x|b, \u2713)] .\n\nWe measured the log variance of the gradient estimator (Figure 4) and found that REBAR signi\ufb01cantly\nreduced variance. In some con\ufb01gurations, MuProp excelled, especially with the single layer linear\nmodel where the \ufb01rst order expansion that MuProp uses is most accurate. Again, the training objective\nperformance generally mirrored the reduction in variance of the gradient estimator (Figure 5, Table\n1).\n\n7\n\n\fNVIL\n112.5\n99.6\n102.2\n\nMuProp\n111.7\n99.07\n101.5\n\nMNIST gen.\nLinear 1 layer\nLinear 2 layer\nNonlinear\nOmniglot gen.\nLinear 1 layer\nLinear 2 layer\nNonlinear\nMNIST struct. pred.\nLinear 1 layer\nLinear 2 layer\nNonlinear\n\n117.44 117.09\n109.98 109.55\n110.4 109.58\n\n69.17 64.33\n68.87\n63.69\n54.08\n47.6\n\nREBAR (0.1)\n\n111.7\n99\n101.4\n\n116.93\n109.12\n109\n\n65.73\n65.5\n47.302\n\nREBAR\n111.6\n98.8\n101.1\n\n116.83\n108.99\n108.72\n\n65.21\n61.72\n46.44\n\nConcrete (0.1)\n\n111.3\n99.62\n102.8\n\n117.23\n109.95\n110.64\n\n65.49\n66.88\n47.02\n\nTable 1: Mean training variational lower bound over 5 trials with different random initializations.\nThe standard error of the mean is given in the Appendix. We bolded the best performing method (up\nto standard error) for each task. We report trials using the best performing learning rate for each task.\n\nFigure 4: Log variance of the gradient estimator for the two layer linear model (left) and single layer\nnonlinear model (right) on the structured prediction task.\n\n6 Discussion\n\nInspired by the Concrete relaxation, we introduced REBAR, a novel control variate for REINFORCE,\nand demonstrated that it greatly reduces the variance of the gradient estimator. We also showed that\nwith a modi\ufb01cation to the relaxation, REBAR and MuProp are closely related in the high temperature\nlimit. Moreover, we showed that we can adapt the temperature online and that it further reduces\nvariance.\nRoeder et al. (2017) show that the reparameterization gradient includes a score function term which\ncan adversely affect the gradient variance. Because the reparameterization gradient only enters the\n\nFigure 5: Training variational lower bound for the two layer linear model (left) and single layer\nnonlinear model (right) on the structured prediction task. We plot 5 trials over different random\ninitializations for each method with the median trial highlighted.\n\n8\n\n\fREBAR estimator through differences of reparameterization gradients, we implicitly implement the\nrecommendation from (Roeder et al., 2017).\nWhen optimizing the relaxation temperature, we require the derivative with respect to of the\ngradient of the parameters. Empirically, the temperature changes slowly relative to the parameters,\nso we might be able to amortize the cost of this operation over several parameter updates. We leave\nexploring these ideas to future work.\nIt would be natural to explore the extension to the multi-sample case (e.g., VIMCO (Mnih & Rezende,\n2016)), to leverage the layered structure in our models using Q-functions, and to apply this approach\nto reinforcement learning.\n\nAcknowledgments\nWe thank Ben Poole and Eric Jang for helpful discussions and assistance replicating their results.\n\nReferences\nYuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv\n\npreprint arXiv:1509.00519, 2015.\n\nMichael C Fu. Gradient estimation. Handbooks in operations research and management science, 13:\n\n575\u2013616, 2006.\n\nPeter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the\n\nACM, 33(10):75\u201384, 1990.\n\nShixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. Muprop: Unbiased backpropagation\n\nfor stochastic neural networks. arXiv preprint arXiv:1511.05176, 2015.\n\nEric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144, 2016.\n\nDiederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\nChris J. Maddison, Daniel Tarlow, and Tom Minka. A* Sampling. In Advances in Neural Information\n\nProcessing Systems 27, 2014.\n\nChris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous\n\nrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\nAndriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In\nProceedings of The 31st International Conference on Machine Learning, pp. 1791\u20131799, 2014.\nAndriy Mnih and Danilo Rezende. Variational inference for monte carlo objectives. In Proceedings\n\nof The 33rd International Conference on Machine Learning, pp. 2188\u20132196, 2016.\n\nVolodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In\n\nAdvances in neural information processing systems, pp. 2204\u20132212, 2014.\n\nArt B. Owen. Monte Carlo theory, methods and examples. 2013.\nJohn Paisley, David M Blei, and Michael I Jordan. Variational bayesian inference with stochastic\nIn Proceedings of the 29th International Coference on International Conference on\n\nsearch.\nMachine Learning, pp. 1363\u20131370, 2012.\n\nTapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. Techniques for learning binary\n\nstochastic feedforward neural networks. arXiv preprint arXiv:1406.2989, 2014.\n\nRajesh Ranganath, Sean Gerrish, and David M Blei. Black box variational inference. In AISTATS, pp.\n\n814\u2013822, 2014.\n\n9\n\n\fDanilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and\nIn Proceedings of The 31st International\n\napproximate inference in deep generative models.\nConference on Machine Learning, pp. 1278\u20131286, 2014.\n\nGeoffrey Roeder, Yuhuai Wu, and David Duvenaud. Sticking the landing: An asymptotically\nzero-variance gradient estimator for variational inference. arXiv preprint arXiv:1703.09194, 2017.\nFrancisco JR Ruiz, Michalis K Titsias, and David M Blei. Overdispersed black-box variational\ninference. In Proceedings of the Thirty-Second Conference on Uncertainty in Arti\ufb01cial Intelligence,\npp. 647\u2013656. AUAI Press, 2016.\n\nRuslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In\nProceedings of the 25th international conference on Machine learning, pp. 872\u2013879. ACM, 2008.\nMichalis K Titsias and Miguel L\u00e1zaro-Gredilla. Local expectation gradients for black box variational\n\ninference. In Advances in Neural Information Processing Systems, pp. 2638\u20132646, 2015.\n\nRonald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256, 1992.\n\nDavid Wingate and Theophane Weber. Automated variational inference in probabilistic programming.\n\narXiv preprint arXiv:1301.1299, 2013.\n\nWojciech Zaremba and Ilya Sutskever. Reinforcement learning neural Turing machines. arXiv\n\npreprint arXiv:1505.00521, 362, 2015.\n\n10\n\n\f", "award": [], "sourceid": 1508, "authors": [{"given_name": "George", "family_name": "Tucker", "institution": "Google Brain"}, {"given_name": "Andriy", "family_name": "Mnih", "institution": "DeepMind"}, {"given_name": "Chris", "family_name": "Maddison", "institution": "University of Oxford / DeepMind"}, {"given_name": "John", "family_name": "Lawson", "institution": "Google Brain"}, {"given_name": "Jascha", "family_name": "Sohl-Dickstein", "institution": "Google Brain"}]}