{"title": "The Generalized Reparameterization Gradient", "book": "Advances in Neural Information Processing Systems", "page_first": 460, "page_last": 468, "abstract": "The reparameterization gradient has become a widely used method to obtain Monte Carlo gradients to optimize the variational objective. However, this technique does not easily apply to commonly used distributions such as beta or gamma without further approximations, and most practical applications of the reparameterization gradient fit Gaussian distributions. In this paper, we introduce the generalized reparameterization gradient, a method that extends the reparameterization gradient to a wider class of variational distributions. Generalized reparameterizations use invertible transformations of the latent variables which lead to transformed distributions that weakly depend on the variational parameters. This results in new Monte Carlo gradients that combine reparameterization gradients and score function gradients. We demonstrate our approach on variational inference for two complex probabilistic models. The generalized reparameterization is effective: even a single sample from the variational distribution is enough to obtain a low-variance gradient.", "full_text": "The Generalized Reparameterization Gradient\n\nFrancisco J. R. Ruiz\nUniversity of Cambridge\nColumbia University\n\nMichalis K. Titsias\nAthens University of\n\nEconomics and Business\n\nDavid M. Blei\n\nColumbia University\n\nAbstract\n\nThe reparameterization gradient has become a widely used method to obtain Monte\nCarlo gradients to optimize the variational objective. However, this technique does\nnot easily apply to commonly used distributions such as beta or gamma without\nfurther approximations, and most practical applications of the reparameterization\ngradient \ufb01t Gaussian distributions. In this paper, we introduce the generalized repa-\nrameterization gradient, a method that extends the reparameterization gradient to a\nwider class of variational distributions. Generalized reparameterizations use invert-\nible transformations of the latent variables which lead to transformed distributions\nthat weakly depend on the variational parameters. This results in new Monte Carlo\ngradients that combine reparameterization gradients and score function gradients.\nWe demonstrate our approach on variational inference for two complex probabilistic\nmodels. The generalized reparameterization is e\ufb00ective: even a single sample from\nthe variational distribution is enough to obtain a low-variance gradient.\n\nIntroduction\n\n1\nVariational inference (vi) is a technique for approximating the posterior distribution in probabilistic\nmodels (Jordan et al., 1999; Wainwright and Jordan, 2008). Given a probabilistic model p.x; z/ of\nobserved variables x and hidden variables z, the goal of vi is to approximate the posterior p.zj x/,\nwhich is intractable to compute exactly for many models. The idea of vi is to posit a family of\ndistributions over the latent variables q.zI v/ with free variational parameters v. vi then \ufb01ts those\nparameters to \ufb01nd the member of the family that is closest in Kullback-Leibler (kl) divergence to\nthe exact posterior, v(cid:3) D arg minv KL.q.zI v/jjp.zj x//. This turns inference into optimization, and\ndi\ufb00erent ways of doing vi amount to di\ufb00erent optimization algorithms for solving this problem.\nFor a certain class of probabilistic models, those where each conditional distribution is in an exponential\nfamily, we can easily use coordinate ascent optimization to minimize the kl divergence (Ghahramani\nand Beal, 2001). However, many important models do not fall into this class (e.g., probabilistic neural\nnetworks or Bayesian generalized linear models). This is the scenario that we focus on in this paper.\nMuch recent research in vi has focused on these di\ufb03cult settings, seeking e\ufb00ective optimization\nalgorithms that can be used with any model. This has enabled the application of vi on nonconjugate\nprobabilistic models (Carbonetto et al., 2009; Paisley et al., 2012; Ranganath et al., 2014; Titsias and\nL\u00e1zaro-Gredilla, 2014), deep neural networks (Neal, 1992; Hinton et al., 1995; Mnih and Gregor, 2014;\nKingma and Welling, 2014), and probabilistic programming (Wingate and Weber, 2013; Kucukelbir\net al., 2015; van de Meent et al., 2016).\nOne strategy for vi in nonconjugate models is to obtain Monte Carlo estimates of the gradient of the\nvariational objective and to use stochastic optimization to \ufb01t the variational parameters. Within this\nstrategy, there have been two main lines of research: black-box variational inference (bbvi) (Ranganath\net al., 2014) and reparameterization gradients (Salimans and Knowles, 2013; Kingma and Welling,\n2014). Each enjoys di\ufb00erent advantages and limitations.\nbbvi expresses the gradient of the variational objective as an expectation with respect to the variational\ndistribution using the log-derivative trick, also called reinforce or score function method (Glynn,\n1990; Williams, 1992). It then takes samples from the variational distribution to calculate noisy\ngradients. bbvi is generic\u2014it can be used with any type of latent variables and any model. However,\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe gradient estimates typically su\ufb00er from high variance, which can lead to slow convergence.\nRanganath et al. (2014) reduce the variance of these estimates using Rao-Blackwellization (Casella\nand Robert, 1996) and control variates (Ross, 2002; Paisley et al., 2012; Gu et al., 2016). Other\nresearchers have proposed further reductions, e.g., through local expectations (Titsias and L\u00e1zaro-\nGredilla, 2015) and importance sampling (Ruiz et al., 2016).\nThe second approach to Monte Carlo gradients of the variational objective is through reparameteriza-\ntion (Price, 1958; Bonnet, 1964; Salimans and Knowles, 2013; Kingma and Welling, 2014; Rezende\net al., 2014). This approach reparameterizes the latent variable z in terms of a set of auxiliary random\nvariables whose distributions do not depend on the variational parameters (typically, a standard\nnormal). This facilitates taking gradients of the variational objective because the gradient operator can\nbe pushed inside the expectation, and because the resulting procedure only requires drawing samples\nfrom simple distributions, such as standard normals. We describe this in detail in Section 2.\nReparameterization gradients exhibit lower variance than bbvi gradients. They typically need only\none Monte Carlo sample to estimate a noisy gradient, which leads to fast algorithms. Further, for some\nmodels, their variance can be bounded (Fan et al., 2015). However, reparameterization is not as generic\nas bbvi. It is typically used with Gaussian variational distributions and does not easily generalize to\nother common distributions, such as the gamma or beta, without using further approximations. (See\nKnowles (2015) for an alternative approach to deal with the gamma distribution.)\nWe develop the generalized reparameterization (g-rep) gradient, a new method to extend reparameter-\nization to other variational distributions. The main idea is to de\ufb01ne an invertible transformation of the\nlatent variables such that the distribution of the transformed variables is only weakly governed by the\nvariational parameters. (We make this precise in Section 3.) Our technique naturally combines both\nbbvi and reparameterization; it applies to a wide class of nonconjugate models; it maintains the black-\nbox criteria of reusing variational families; and it avoids approximations. We empirically show in two\nprobabilistic models\u2014a nonconjugate factorization model and a deep exponential family (Ranganath\net al., 2015)\u2014that a single Monte Carlo sample is enough to build an e\ufb00ective low-variance estimate\nof the gradient. In terms of speed, g-rep outperforms bbvi. In terms of accuracy, it outperforms\nautomatic di\ufb00erentiation variational inference (advi) (Kucukelbir et al., 2016), which considers\nGaussian variational distributions on a transformed space.\n2 Background\nConsider a probabilistic model p.x; z/, where z denotes the latent variables and x the observations.\nWe assume that the posterior distribution p.zj x/ is analytically intractable and we wish to apply vi.\nWe introduce a tractable distribution q.zI v/ to approximate p.zj x/ and minimize the kl divergence\nDKL .q.zI v/ k p.zj x// with respect to the variational parameters v. This minimization is equivalently\nexpressed as the maximization of the so-called evidence lower bound (elbo) (Jordan et al., 1999),\n\nL.v/ D Eq.zIv/ \u0152log p.x; z/ (cid:0) log q.zI v/\u008d D Eq.zIv/ \u0152f .z/\u008d C H \u0152q.zI v/\u008d :\n\n(1)\n\nf .z/ , log p.x; z/\n\nWe denote\n(2)\nto be the model log-joint density and H \u0152q.zI v/\u008d to be the entropy of the variational distribution. When\nthe expectation Eq.zIv/ \u0152f .z/\u008d is analytically tractable, the maximization of the elbo can be carried out\nusing standard optimization methods. Otherwise, when it is intractable, other techniques are needed.\nRecent approaches rely on stochastic optimization to construct Monte Carlo estimates of the gradient\nwith respect to the variational parameters. Below, we review the two main methods for building such\nMonte Carlo estimates: the score function method and the reparameterization trick.\nScore function method. A general way to obtain unbiased stochastic gradients is to use the score\nfunction method, also called log-derivative trick or reinforce (Williams, 1992; Glynn, 1990), which\nhas been recently applied to vi (Paisley et al., 2012; Ranganath et al., 2014; Mnih and Gregor, 2014).\nIt is based on writing the gradient of the elbo with respect to v as\n\nrvL D Eq.zIv/ \u0152f .z/rv log q.zI v/\u008d C rvH \u0152q.zI v/\u008d ;\n\n(3)\nand then building Monte Carlo estimates by approximating the expectation with samples from q.zI v/.\nThe resulting estimator su\ufb00ers from high variance, making it necessary to apply variance reduction\nmethods such as control variates (Ross, 2002) or Rao-Blackwellization (Casella and Robert, 1996).\nSuch variance reduction techniques have been used in bbvi (Ranganath et al., 2014).\n\n2\n\n\fhrzf .z/\n\n\u02c7\u02c7zDT .(cid:15)Iv/\n\ni C rvH \u0152q.zI v/\u008d :\n\nReparameterization. The reparameterization trick (Salimans and Knowles, 2013; Kingma and\nWelling, 2014) expresses the latent variables z as an invertible function of another set of variables (cid:15),\ni.e., z D T .(cid:15)I v/, such that the distribution of the new random variables q(cid:15).(cid:15)/ does not depend on\nthe variational parameters v. Under these assumptions, expectations with respect to q.zI v/ can be\nexpressed as Eq.zIv/ \u0152f .z/\u008d D Eq(cid:15) .(cid:15)/ \u0152f .T .(cid:15)I v//\u008d, and the gradient with respect to v can be pushed\ninto the expectation, yielding\n\nrvT .(cid:15)I v/\n\nrvL D Eq(cid:15) .(cid:15)/\n\n(cid:15) D T\n\n(4)\nThe assumption here is that the log-joint f .z/ is di\ufb00erentiable. The gradient rzf .z/ depends on the\nmodel, but it can be computed using automatic di\ufb00erentiation tools (Baydin et al., 2015). Monte Carlo\nestimates of the reparameterization gradient typically present much lower variance than those based\non Eq. 3. In practice, a single sample from q(cid:15).(cid:15)/ is enough to obtain a low-variance estimate.1\nThe reparameterization trick is thus a powerful technique to reduce the variance of the estimator,\nbut it requires a transformation (cid:15) D T\n(cid:0)1.zI v/ such that q(cid:15).(cid:15)/ does not depend on the variational\nparameters v. For instance, if the variational distribution is Gaussian with mean (cid:22) and covariance \u2020,\na straightforward transformation consists of standardizing the random variable z, i.e.,\n\n(cid:0)1.zI (cid:22); \u2020/ D \u2020\n\n2 .z (cid:0) (cid:22)/:\n(cid:0) 1\n\n(5)\nThis transformation ensures that the (Gaussian) distribution q(cid:15).(cid:15)/ does not depend on (cid:22) or \u2020.\nFor a general variational distribution q.zI v/, Kingma and Welling (2014) discuss three families\nof transformations: inverse cumulative density function (cdf), location-scale, and composition.\nHowever, these transformations may not apply in certain cases.2 Notably, none of them apply to the\ngamma3 and the beta distributions, although these distributions are often used in vi.\nNext, we show how to relax the constraint that the transformed density q(cid:15).(cid:15)/ must not depend on the\nvariational parameters v. We follow a standardization procedure similar to the Gaussian case in Eq. 5,\nbut we allow the distribution of the standardized variable (cid:15) to depend (at least weakly) on v.\n3 The Generalized Reparameterization Gradient\nWe now generalize the reparameterization idea to distributions that, like the gamma or the beta, do\nnot admit the standard reparameterization trick. We assume that we can e\ufb03ciently sample from the\nvariational distribution q.zI v/, and that q.zI v/ is di\ufb00erentiable with respect to z and v. We introduce\na random variable (cid:15) de\ufb01ned by an invertible transformation\n\nz D T .(cid:15)I v/;\n\n(cid:15) D T\n\n(cid:0)1.zI v/;\n\nand\n\n(6)\n(cid:0)1.zI v/ as a standardization procedure that attempts to make the\nwhere we can think of (cid:15) D T\ndistribution of (cid:15) weakly dependent on the variational parameters v. \u201cWeakly\u201d means that at least\nits \ufb01rst moment does not depend on v. For instance, if (cid:15) is de\ufb01ned to have zero mean, then its \ufb01rst\nmoment has become independent of v. However, we do not assume that the resulting distribution of (cid:15)\nis completely independent of the variational parameters v, and therefore we write it as q(cid:15).(cid:15)I v/. We use\nthe distribution q(cid:15).(cid:15)I v/ in the derivation of g-rep, but we write the \ufb01nal gradient as an expectation\nwith respect to the original variational distribution q.zI v/, from which we can sample.\nMore in detail, by the standard change-of-variable technique, the transformed density is\nq(cid:15).(cid:15)I v/ D q .T .(cid:15)I v/I v/ J.(cid:15); v/; where J.(cid:15); v/ , jdet r(cid:15)T .(cid:15)I v/j ;\n\n(7)\nis a short-hand for the absolute value of the determinant of the Jacobian. We \ufb01rst use the transformation\nto rewrite the gradient of Eq.zIv/ \u0152f .z/\u008d in (1) as\n\nrvEq.zIv/ \u0152f .z/\u008d D rvEq(cid:15) .(cid:15)Iv/ \u0152f .T .(cid:15)I v//\u008d D rv\n\nq(cid:15).(cid:15)I v/f .T .(cid:15)I v// d (cid:15):\n\n(8)\n\nZ\n\n1In the literature, there is no formal proof that reparameterization has lower variance than the score function\nestimator, except for some simple models (Fan et al., 2015). Titsias and L\u00e1zaro-Gredilla (2014) provide some\nintuitions, and Rezende et al. (2014) show some bene\ufb01ts of reparameterization in the Gaussian case.\n(cid:0)1.zI v/ to the cdf. This leads to a uniform distribution over (cid:15) on the unit\ninterval, but it is not practical because the inverse cdf, T .(cid:15)I v/, does not have analytical solution in general. We\ndevelop an approach that does not require computation of (inverse) cdf\u2019s or their derivatives.\n3Composition is only available when it is possible to express the gamma as a sum of exponentials, i.e., its\n\n2The inverse cdf approach sets T\n\nshape parameter is an integer, which is not generally the case in vi.\n\n3\n\n\fWe now express the gradient as the sum of two terms, which we name grep and gcorr for reasons that\nwe will explain below. We apply the log-derivative trick and the product rule for derivatives, yielding\n\nrvEq.zIv/ \u0152f .z/\u008d DZ\n\u201e\n\n\u2026\nq(cid:15).(cid:15)I v/rvf .T .(cid:15)I v// d (cid:15)\n\n\u0192\u201a\n\ngrep\n\n\u2026\nq(cid:15).(cid:15)I v/f .T .(cid:15)I v//rv log q(cid:15).(cid:15)I v/d (cid:15)\n\n\u0192\u201a\n\n;\n\ngcorr\n\n(9)\n\nCZ\n\u201e\n\nWe rewrite Eq. 9 as an expression that involves expectations with respect to the original variational\ndistribution q.zI v/ only. For that, we de\ufb01ne the following two auxiliary functions that depend on the\ntransformation T .(cid:15)I v/:\n\nh.(cid:15)I v/ , rvT .(cid:15)I v/;\n\nand\n\nu.(cid:15)I v/ , rv log J.(cid:15); v/:\n\n(cid:2)rzf .z/h(cid:0)T\n(cid:0)1.zI v/I v(cid:1)(cid:3) ;\n(cid:2)f .z/(cid:0)rz log q.zI v/h(cid:0)T\n\nAfter some algebra (see the Supplement for details), we obtain\ngrep D Eq.zIv/\ngcorr D Eq.zIv/\nThus, we can \ufb01nally write the full gradient of the elbo as\n\n(cid:0)1.zI v/I v(cid:1) C rv log q.zI v/ C u(cid:0)T\n\n(cid:0)1.zI v/I v(cid:1)(cid:1)(cid:3) :\n\n(10)\n\n(11)\n\nrvL D grep C gcorr C rvH \u0152q.zI v/\u008d ;\n\n(12)\nInterpretation of the generalized reparameterization gradient. The term grep is easily recognizable\nas the standard reparameterization gradient, and hence the label \u201crep.\u201d Indeed, if the distribution\nq(cid:15).(cid:15)I v/ does not depend on the variational parameters v, then the term rv log q(cid:15).(cid:15)I v/ in Eq. 9\nvanishes, making gcorr D 0. Thus, we may interpret gcorr as a \u201ccorrection\u201d term that appears when the\ntransformed density depends on the variational parameters.\nFurthermore, we can recover the score function gradient in Eq. 3 by choosing the identity transfor-\nmation, z D T .(cid:15)I v/ D (cid:15). In such case, the auxiliary functions in Eq. 10 become zero because the\ntransformation does not depend on v, i.e., h.(cid:15)I v/ D 0 and u.(cid:15)I v/ D 0. This implies that grep D 0\nand gcorr D Eq.zIv/ \u0152f .z/rv log q.zI v/\u008d.\nAlternatively, we can interpret the g-rep gradient as a control variate of the score function gradient.\nFor that, we rearrange Eqs. 9 and 11 to express the gradient as\nrvEq.zIv/ \u0152f .z/\u008d D Eq.zIv/ \u0152f .z/rv log q.zI v/\u008d\n\n(cid:0)1.zI v/I v(cid:1) C u(cid:0)T\n\n(cid:0)1.zI v/I v(cid:1)(cid:1)(cid:3) ;\n\n(cid:2)f .z/(cid:0)rz log q.zI v/h(cid:0)T\n\nC grep C Eq.zIv/\n\nwhere the second line is the control variate, which involves the reparameterization gradient.\nTransformations. Eqs. 9 and 11 are valid for any transformation T .(cid:15)I v/. However, we may expect\nsome transformations to perform better than others, in terms of the variance of the resulting estimator.\nIt seems sensible to search for transformations that make gcorr small, as the reparameterization gradient\ngrep is known to present low variance in practice under standard smoothness conditions of the log-joint\n(Fan et al., 2015).4 Transformations that make gcorr small are such that (cid:15) D T\n(cid:0)1.zI v/ becomes\nweakly dependent on the variational parameters v. In the standard reparameterization of Gaussian\nrandom variables, the transformation takes the form in (5), and thus (cid:15) is a standardized version of\nz. We mimic this standardization idea for other distributions as well. In particular, for exponential\nfamily distributions, we use transformations of the form (su\ufb03cient statistic (cid:0) expected su\ufb03cient\nstatistic)=(scale factor). We present several examples in the next section.\n3.1 Examples\nFor concreteness, we show here some examples of the equations above for well-known probability\ndistributions. In particular, we choose the gamma, log-normal, and beta distributions.\nGamma distribution. Let q.zI \u02db; \u02c7/ be a gamma distribution with shape \u02db and rate \u02c7. We use a\ntransformation based on standardization of the su\ufb03cient statistic log.z/, i.e.,\n\n(cid:15) D T\n\n(cid:0)1.zI \u02db; \u02c7/ D log.z/ (cid:0) .\u02db/ C log.\u02c7/\n\n;\n\np\n\n 1.\u02db/\n\n4Techniques such as Rao-Blackwellization could additionally be applied to reduce the variance of gcorr. We\n\ndo not apply any such technique in this paper.\n\n4\n\n\fwhere .(cid:1)/ denotes the digamma function, and k.(cid:1)/ is its k-th derivative. This ensures that (cid:15) has zero\nmean and unit variance, and thus its two \ufb01rst moments do not depend on the variational parameters \u02db\nand \u02c7. We now compute the auxiliary functions in Eq. 10 for the components of the gradient with\nrespect to \u02db and \u02c7, which take the form\n\n \n\n(cid:15) 2.\u02db/\n\np\n!\nC 1.\u02db/\n\n 1.\u02db/\n\n2\n\n!\nC 1.\u02db/\n\nC 2.\u02db/\n2 1.\u02db/\n\n;\n\n;\n\nh\u02db.(cid:15)I \u02db; \u02c7/ D T .(cid:15)I \u02db; \u02c7/\np\n\nu\u02db.(cid:15)I \u02db; \u02c7/ D\n\n \n\n(cid:15) 2.\u02db/\n\n2\n\n 1.\u02db/\n\nh\u02c7 .(cid:15)I \u02db; \u02c7/ D (cid:0) T .(cid:15)I \u02db; \u02c7/\nu\u02c7 .(cid:15)I \u02db; \u02c7/ D (cid:0) 1\n\n\u02c7\n\n:\n\n;\n\n\u02c7\n\nThe terms grep and gcorr are obtained after substituting these results in Eq. 11. We provide the \ufb01nal\nexpressions in the Supplement. We remark here that the component of gcorr corresponding to the\nD 0, meaning that the distribution of (cid:15) does\nderivative with respect to the rate equals zero, i.e., gcorr\nnot depend on the parameter \u02c7. Indeed, we can compute this distribution following Eq. 7 as\n\nexp(cid:16)\n\n\u02c7\n\n 1.\u02db/ (cid:0) exp(cid:16)\np\n\n\u0001\u0001\np\n 1.\u02db/ C .\u02db/\n\nq(cid:15).(cid:15)I \u02db; \u02c7/ D e\u02db .\u02db/p\n\n(cid:15)\u02db\nwhere we can verify that it does not depend on \u02c7.\nLog-normal distribution. For a log-normal distribution with location (cid:22) and scale (cid:27), we can\nstandardize the su\ufb03cient statistic log.z/ as\n\n\u0080.\u02db/\n\n(cid:15)\n\n;\n\n 1.\u02db/\n\n(cid:15) D T\n\n(cid:0)1.zI (cid:22); (cid:27) / D log.z/ (cid:0) (cid:22)\n\n:\n\n(cid:27)\n\nThis leads to a standard normal distribution on (cid:15), which does not depend on the variational parameters,\nand thus gcorr D 0. The auxiliary function h.(cid:15)I (cid:22); (cid:27) /, which is needed for grep, takes the form\n\nThus, the reparameterization gradient is given in this case by\n\nh(cid:22).(cid:15)I (cid:22); (cid:27) / D T .(cid:15)I (cid:22); (cid:27) /;\nD Eq.zI(cid:22);(cid:27) / \u0152zrzf .z/\u008d ;\n\ngrep\n\n(cid:22)\n\nh(cid:27) .(cid:15)I (cid:22); (cid:27) / D (cid:15)T .(cid:15)I (cid:22); (cid:27) /:\n\n(cid:2)zT\n\n(cid:0)1.zI (cid:22); (cid:27) /rzf .z/(cid:3) :\n\ngrep\n\n(cid:27)\n\nD Eq.zI(cid:22);(cid:27) /\n\nThis corresponds to advi (Kucukelbir et al., 2016) with a logarithmic transformation over a positive\nrandom variable, since the variational distribution over the transformed variable is Gaussian. For a\ngeneral variational distribution, we recover advi if the transformation makes (cid:15) Gaussian.\nBeta distribution. For a random variable z (cid:24) Beta.\u02db; \u02c7/, we could rewrite z D z\n0\n0\n0\n2/ for\n(cid:24) Gamma.\u02db; 1/ and z\n1=.z\n0\n0\n1\nand\n0\n. Instead, in the spirit of applying standardization directly over z, we de\ufb01ne a transformation to\nstandardize the logit function, logit .z/ , log.z=.1 (cid:0) z// (sum of su\ufb03cient statistics of the beta),\n\nC z\n(cid:24) Gamma.\u02c7; 1/, and apply the gamma reparameterization for z\n\nz\nz\n\n0\n\n2\n\n2\n\n1\n\n1\n\n(cid:15) D T\n\n(cid:0)1.zI \u02db; \u02c7/ D logit .z/ (cid:0) .\u02db/ C .\u02c7/\n\n:\n\n(cid:27) .\u02db; \u02c7/\n\nThis ensures that (cid:15) has zero mean. We can set the denominator to the standard deviation of logit .z/.\nHowever, for larger-scaled models we found better performance with a denominator (cid:27) .\u02db; \u02c7/ that\nmakes gcorr D 0 for the currently drawn sample z (see the Supplement for details), even though the\nvariance of the transformed variable (cid:15) is not one in such case.5 The reason is that gcorr su\ufb00ers from\nhigh variance in the same way as the score function estimator does.\n3.2 Algorithm\nWe now present our full algorithm for g-rep. It requires the speci\ufb01cation of the variational family\nand the transformation T .(cid:15)I v/. Given these, the full procedure is summarized in Algorithm 1. We\nuse the adaptive step-size sequence proposed by Kucukelbir et al. (2016), which combines rmsprop\n(Tieleman and Hinton, 2012) and Adagrad (Duchi et al., 2011). Let g.i /\nbe the k-th component of the\ngradient at the i-th iteration, and (cid:26).i /\n\nthe step-size for that component. We set\n\nk\n\n(cid:26).i /\nk\n\nD \u0001 (cid:2) i\n\nk /2 C .1 (cid:0) (cid:13) /s.i(cid:0)1/\nwhere we set \u0004 D 10\nvariational parameters as v.iC1/ D v.i / C (cid:26).i / \u0131 rvL, where \u2018\u0131\u2019 is the element-wise product.\n\n(13)\n(cid:0)16, (cid:28) D 1, (cid:13) D 0:1, and we explore several values of \u0001. Thus, we update the\n\nD (cid:13).g.i /\n\nwith\n\ns.i /\nk\n\ns.i /\nk\n\n5Note that this introduces some bias since we are ignoring the dependence of (cid:27) .\u02db; \u02c7/ on z.\n\nk\n\n;\n\n;\n\n(cid:0)0:5C\u0004 (cid:2)\u0002\n\n(cid:28) Cq\n\nk\n\n\u0003(cid:0)1\n\n5\n\n\f:data x, probabilistic model p.x; z/, variational family q.zI v/, transformation z D T .(cid:15)I v/\n\nAlgorithm 1: Generalized reparameterization gradient algorithm\ninput\noutput :variational parameters v\nInitialize v\nrepeat\n\nCompute the auxiliary functions h(cid:0)T\n\nDraw a single sample z (cid:24) q.zI v/\nEstimate grep and gcorr (Eq. 11, estimate the expectation with one sample)\nCompute (analytic) or estimate (Monte Carlo) the gradient of the entropy, rvH \u0152q.zI v/\u008d\nCompute the noisy gradient rvL (Eq. 12)\nSet the step-size (cid:26).i / (Eq. 13) and take a gradient step for v\n\n(cid:0)1.zI v/I v(cid:1) and u(cid:0)T\n\n(cid:0)1.zI v/I v(cid:1) (Eq. 10)\n\nuntil convergence\n\n3.3 Related work\nA closely related vi method is advi, which also relies on reparameterization and has been incorporated\ninto Stan (Kucukelbir et al., 2015, 2016). advi applies a transformation to the random variables such\nthat their support is on the reals and then uses a Gaussian variational posterior on the transformed\nspace. For instance, random variables that are constrained to be positive are \ufb01rst transformed through\na logarithmic function and then a Gaussian variational approximating distribution is placed on the\nunconstrained space. Thus, advi struggles to approximate probability densities with singularities,\nwhich are useful in models where sparsity is appropriate. In contrast, the g-rep method allows\nto estimate the gradient for a wider class of variational distributions, including gamma and beta\ndistributions, which are more appropriate to encode sparsity constraints.\nSchulman et al. (2015) also write the gradient in the form given in Eq. 12 to automatically estimate\nthe gradient through a backpropagation algorithm in the context of stochastic computation graphs.\nHowever, they do not provide additional insight into this equation, do not apply it to general vi, do\nnot discuss transformations for any distributions, and do not report experiments. Thus, our paper\ncomplements Schulman et al. (2015) and provides an o\ufb00-the-shelf tool for general vi.\n4 Experiments\nWe apply g-rep to perform mean-\ufb01eld vi on two nonconjugate probabilistic models: the sparse\ngamma deep exponential family (def) and a beta-gamma matrix factorization (mf) model. The sparse\ngamma def (Ranganath et al., 2015) is a probabilistic model with several layers of latent locations\nand latent weights, mimicking the architecture of a deep neural network. The weights of the model\n0 run over latent components, and ` indexes the layer. The latent\nare denoted by w.`/\nlocations are z.`/\n, where n denotes the observation. We consider Poisson-distributed observations\nxnd for each dimension d. Thus, the model is speci\ufb01ed as\n\n0, where k and k\n\nnk\n\nkk\n\nxnd (cid:24) Poisson\n\n;\n\nz.`/\nnk\n\n\u02dbz;\n\n0 z.`C1/\n0 w.`/\n0\nk\nk\n0 with rate 0:3 and shape 0:1, and a gamma prior with\nWe place gamma priors over the weights w`\n. We set the hyperparameter \u02dbz D 0:1,\nrate 0:1 and shape 0:1 over the top-layer latent variables z.L/\nand we use L D 3 layers with 100, 40, and 15 latent factors.\nThe second model is a beta-gamma mf model with weights wkd and latent locations znk. We use this\nmodel to describe binary observations xnd , which are modeled as\n\nnk\n\nnk\n\nkk\n\nz.1/\nnk\n\n0 w.0/\n0\nd\nk\n\n:\n\n!\n\n \n\n(cid:24) Gamma\n\nP\n\nk\n\n\u02dbz\n\n!\n\n X\n\n0\n\nk\n\n!!\n\n \n\n X\n\nxnd (cid:24) Bernoulli\n\nsigmoid\n\nlogit .znk/ wkd\n\n;\n\nwhere logit .z/ D log.z=.1 (cid:0) z// and sigmoid .(cid:1)/ is the inverse logit function. We place a gamma\nprior with shape 0:1 and rate 0:3 over the weights wkd , a uniform prior over the variables znk, and\nwe use K D 100 latent components.\nDatasets. We apply the sparse gamma def on two di\ufb00erent databases: (i) the Olivetti database at\nAT&T,6 which consists of 400 (320 for training and 80 for test) 64 (cid:2) 64 images of human faces in a 8\n\nk\n\n6http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html\n\n6\n\n\fDataset\nOlivetti\nnips\nmnist\nOmniglot\n\ng-rep\n\nbbvi\n\n5\n0:5\n5\n5\n\n1\n5\n5\n(cid:0)\n\nadvi\n0:1\n1\n0:1\n0:1\n\nDataset\nOlivetti\nnips\nmnist\nOmniglot\n\ng-rep\n0:46\n0:83\n1:09\n5:50\n\nbbvi\n12:90\n20:95\n25:99\n(cid:0)\n\nadvi\n0:17\n0:25\n0:34\n4:10\n\nTable 1: (Left) Step-size constant \u0001, reported for completeness. (Right) Average time per iteration in\nseconds. g-rep is 1-4 times slower than advi but above one order of magnitude faster than bbvi.\nbit scale (0 (cid:0) 255); and (ii) the collection of papers at the Neural Information Processing Systems\n(nips) 2011 conference, which consists of 305 documents and a vocabulary of 5715 e\ufb00ective words\nin a bag-of-words format (25% of words from all documents are set aside to form the test set).\nWe apply the beta-gamma mf on: (i) the binarized mnist data,7 which consists of 28 (cid:2) 28 images of\nhand-written digits (we use 5000 training and 2000 test images); and (ii) the Omniglot dataset (Lake\net al., 2015), which consists of 105 (cid:2) 105 images of hand-written characters from di\ufb00erent alphabets\n(we select 10 alphabets, with 4425 training images, 1475 test images, and 295 characters).\nEvaluation. We apply mean-\ufb01eld vi and we compare g-rep with bbvi (Ranganath et al., 2014) and\nadvi (Kucukelbir et al., 2016). We do not apply bbvi on the Omniglot dataset due to its computational\ncomplexity. At each iteration, we evaluate the elbo using one sample from the variational distribution,\nexcept for advi, for which we use 20 samples (for the Omniglot dataset, we only use one sample). We\nrun each algorithm with a \ufb01xed computational budget of CPU time. After that time, we also evaluate\nthe predictive log-likelihood on the test set, averaging over 100 posterior samples. For the nips data,\nwe also compute the test perplexity (with one posterior sample) every 10 iterations, given by\n\n (cid:0)PdocsP\n\nexp\n\nw2doc.d /\n\nlog p.w j #held out in doc.d //\n\n#held out words\n\n!\n\n:\n\nExperimental setup. To estimate the gradient, we use 30 Monte Carlo samples for bbvi, and only 1\nfor advi and g-rep. For bbvi, we use Rao-Blackwellization and control variates (we use a separate\nset of 30 samples to estimate the control variates). For bbvi and g-rep, we use beta and gamma\nvariational distributions, whereas advi uses Gaussian distributions on the transformed space, which\ncorrespond to log-normal or logit-normal distributions on the original space. Thus, only g-rep and\nbbvi optimize the same variational family. We parameterize the gamma distribution in terms of\nits shape and mean, and the beta in terms of its shape parameters \u02db and \u02c7. To avoid constrained\n0 D log.exp.v/ (cid:0) 1/ to the variational parameters that are\noptimization, we apply the transformation v\n0. We use the analytic\nconstrained to be positive and take stochastic gradient steps with respect to v\ngradient of the entropy terms. We implement advi as described by Kucukelbir et al. (2016).\nWe use the step-size schedule in Eq. 13, and we explore the parameter \u0001 2 f0:1; 0:5; 1; 5g. For each\nalgorithm and each dataset, we report the results based on the value of \u0001 for which the best elbo was\nachieved. We report the values of \u0001 in Table 1 (left).\nResults. We show in Figure 1 the evolution of the elbo as a function of the running time for\nthree of the considered datasets. bbvi converges slower than the rest of the methods, since each\niteration involves drawing multiple samples and evaluating the log-joint for each of them. advi and\ng-rep achieve similar bounds, except for the mnist dataset, for which g-rep provides a variational\napproximation that is closer to the posterior, since the elbo is higher. This is because a variational\nfamily with sparse gamma and beta distributions provides a better \ufb01t to the data than the variational\nfamily to which advi is limited (log-normal and logit-normal). advi seems to converge slower;\nhowever, we do not claim that advi converges slower than g-rep in general. Instead, the di\ufb00erence\nmay be due to the di\ufb00erent step-sizes schedules that we found to be optimal (see Table 1). We also\nreport in Table 1 (right) the average time per iteration8 for each method: bbvi is the slowest method,\nand advi is the fastest because it involves simulation of Gaussian random variables only.\nHowever, g-rep provides higher likelihood values than advi. We show in Figure 2a the evolution of\nthe perplexity (lower is better) for the nips dataset, and in Figure 2b the resulting test log-likelihood\n(larger is better) for the rest of the considered datasets. In Figure 2b, we report the mean and standard\ndeviation over 100 posterior samples. advi cannot \ufb01t the data as well as g-rep or bbvi because it is\nconstrained to log-normal and logit-normal variational distributions. These cannot capture sparsity,\n\n7http://yann.lecun.com/exdb/mnist\n8On the full mnist with 50; 000 training images, g-rep (advi) took 8:08 (2:04) seconds per iteration.\n\n7\n\n\f(a) elbo (Olivetti dataset).\n\n(b) elbo (mnist dataset).\n\n(c) elbo (Omniglot dataset).\n\nFigure 1: Comparison between g-rep, bbvi, and advi in terms of the variational objective function.\n\ng-rep\n\nDataset\n(cid:0)4:63 \u02d9 0:01\n(cid:0)4:48 \u02d9 0:01\nOlivetti\nmnist (cid:0)0:0932 \u02d9 0:0004 (cid:0)0:0888 \u02d9 0:0004 (cid:0)0:189 \u02d9 0:009\nOmniglot (cid:0)0:0472 \u02d9 0:0001\n(cid:0)0:0823 \u02d9 0:0009\n\n(cid:0)9:74 \u02d9 0:08\n\n(cid:0)\n\nbbvi\n\nadvi\n\n(b) Average test log-likelihood per entry xnd\n\n.\n\n(a) Perplexity (nips dataset).\n\nFigure 2: Comparison between g-rep, bbvi, and advi in terms of performance on the test set. g-rep\noutperforms bbvi because the latter has not converged in the allowed time, and it also outperforms\nadvi because of the variational family it uses.\nwhich is an important feature for the considered models. We can also conclude this by a simple visual\ninspection of the \ufb01tted models. In the Supplement, we compare images sampled from the g-rep and\nthe advi posteriors, where we can observe that the latter are more blurry or lack some details.\n5 Conclusion\nWe have introduced the generalized reparameterization gradient (g-rep), a technique to extend the\nstandard reparameterization gradient to a wider class of variational distributions. As the standard\nreparameterization method, our method is applicable to any probabilistic model that is di\ufb00erentiable\nwith respect to the latent variables. We have demonstrated the generalized reparameterization gradient\non two nonconjugate probabilistic models to \ufb01t a variational approximation involving gamma and\nbeta distributions. We have also empirically shown that a single Monte Carlo sample is enough to\nobtain a noisy estimate of the gradient, therefore leading to a fast inference procedure.\nAcknowledgments\nThis project has received funding from the EU H2020 programme (Marie Sk\u0142odowska-Curie grant\nagreement 706760), NFS IIS-1247664, ONR N00014-11-1-0651, DARPA FA8750-14-2-0009,\nDARPA N66001-15-C-4032, Adobe, the John Templeton Foundation, and the Sloan Foundation. The\nauthors would also like to thank Kriste Krstovski, Alp Kuckukelbir, and Christian A. Naesseth for\nhelpful comments and discussions.\nReferences\nBaydin, A. G., Pearlmutter, B. A., and Radul, A. A. (2015). Automatic di\ufb00erentiation in machine learning: a\n\nsurvey. arXiv:1502.05767.\n\nBonnet, G. (1964). Transformations des signaux al\u00e9atoires a travers les systemes non lin\u00e9aires sans m\u00e9moire.\n\nAnnals of Telecommunications, 19(9):203\u2013220.\n\nCarbonetto, P., King, M., and Hamze, F. (2009). A stochastic approximation method for inference in probabilistic\n\ngraphical models. In Advances in Neural Information Processing Systems.\n\nCasella, G. and Robert, C. P. (1996). Rao-Blackwellisation of sampling schemes. Biometrika, 83(1):81\u201394.\nDuchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12:2121\u20132159.\n\nFan, K., Wang, Z., Beck, J., Kwok, J., and Heller, K. A. (2015). Fast second order stochastic backpropagation for\n\nvariational inference. In Advances in Neural Information Processing Systems.\n\nGhahramani, Z. and Beal, M. J. (2001). Propagation algorithms for variational Bayesian learning. In Advances\n\nin Neural Information Processing Systems.\n\nGlynn, P. W. (1990). Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM,\n\n33(10):75\u201384.\n\n8\n\n00.511.522.533.5\u22122.5\u22122\u22121.5\u22121\u22120.5x 107Time (h)ELBOOlivetti G\u2212REPBBVIADVI02468\u22123\u22122.5\u22122\u22121.5\u22121\u22120.5x 106Time (h)ELBOMNIST G\u2212REPBBVIADVI051015\u22125\u22124\u22123\u22122\u22121x 107Time (h)ELBOOmniglot G\u2212REPADVI01234561000150020002500Time (h)Test perplexityNIPS G\u2212REPBBVIADVI\fGu, S., Levine, S., Sutskever, I., and Mnih, A. (2016). MuProp: Unbiased backpropagation for stochastic neural\n\nnetworks. In International Conference on Learning Representations.\n\nHinton, G., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural\n\nnetworks. Science, 268(5214):1158\u20131161.\n\nJordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods\n\nfor graphical models. Machine Learning, 37(2):183\u2013233.\n\nKingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In International Conference on\n\nLearning Representations.\n\nKnowles, D. A. (2015).\narXiv:1509.01631v1.\n\nStochastic gradient variational Bayes for gamma approximating distributions.\n\nKucukelbir, A., Ranganath, R., Gelman, A., and Blei, D. M. (2015). Automatic variational inference in Stan. In\n\nAdvances in Neural Information Processing Systems.\n\nKucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2016). Automatic di\ufb00erentiation variational\n\ninference. arXiv:1603.00788.\n\nLake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic\n\nprogram induction. Science, 350(6266):1332\u20131338.\n\nMnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. In International\n\nConference on Machine Learning.\n\nNeal, R. (1992). Connectionist learning of belief networks. Arti\ufb01cial Intelligence, 56(1):71\u2013113.\nPaisley, J. W., Blei, D. M., and Jordan, M. I. (2012). Variational Bayesian inference with stochastic search. In\n\nInternational Conference on Machine Learning.\n\nPrice, R. (1958). A useful theorem for nonlinear devices having Gaussian inputs. IRE Transactions on Information\n\nTheory, 4(2):69\u201372.\n\nRanganath, R., Gerrish, S., and Blei, D. M. (2014). Black box variational inference. In Arti\ufb01cial Intelligence and\n\nStatistics.\n\nRanganath, R., Tang, L., Charlin, L., and Blei, D. M. (2015). Deep exponential families. In Arti\ufb01cial Intelligence\n\nand Statistics.\n\nRezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in\n\ndeep generative models. In International Conference on Machine Learning.\n\nRoss, S. M. (2002). Simulation. Elsevier.\nRuiz, F. J. R., Titsias, M. K., and Blei, D. M. (2016). Overdispersed black-box variational inference.\n\nUncertainty in Arti\ufb01cial Intelligence.\n\nIn\n\nSalimans, T. and Knowles, D. A. (2013). Fixed-form variational posterior approximation through stochastic\n\nlinear regression. Bayesian Analysis, 8(4):837\u2013882.\n\nSchulman, J., Heess, N., Weber, T., and Abbeel, P. (2015). Gradient estimation using stochastic computation\n\ngraphs. In Advances in Neural Information Processing Systems.\n\nTieleman, T. and Hinton, G. (2012). Lecture 6.5-RMSPROP: Divide the gradient by a running average of its\n\nrecent magnitude. Coursera: Neural Networks for Machine Learning, 4.\n\nTitsias, M. K. and L\u00e1zaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference.\n\nIn International Conference on Machine Learning.\n\nTitsias, M. K. and L\u00e1zaro-Gredilla, M. (2015). Local expectation gradients for black box variational inference.\n\nIn Advances in Neural Information Processing Systems.\n\nvan de Meent, J.-W., Tolpin, D., Paige, B., and Wood, F. (2016). Black-box policy search with probabilistic\n\nprograms. In Arti\ufb01cial Intelligence and Statistics.\n\nWainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1\u20132):1\u2013305.\n\nWilliams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nAutomated variational inference in probabilistic programming.\n\n9\n\nMachine Learning, 8(3\u20134):229\u2013256.\nWingate, D. and Weber, T. (2013).\n\narXiv:1301.1299.\n\n\f", "award": [], "sourceid": 256, "authors": [{"given_name": "Francisco", "family_name": "Ruiz", "institution": "Columbia University"}, {"given_name": "Michalis", "family_name": "Titsias RC AUEB", "institution": "Athens University of Economics and Business"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}