{"title": "Local Expectation Gradients for Black Box Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2638, "page_last": 2646, "abstract": "We introduce local expectation gradients which is a general purpose stochastic variational inference algorithm for constructing stochastic gradients by sampling from the variational distribution. This algorithm divides the problem of estimating the stochastic gradients over multiple variational parameters into smaller sub-tasks so that each sub-task explores intelligently the most relevant part of the variational distribution. This is achieved by performing an exact expectation over the single random variable that most correlates with the variational parameter of interest resulting in a Rao-Blackwellized estimate that has low variance. Our method works efficiently for both continuous and discrete random variables. Furthermore, the proposed algorithm has interesting similarities with Gibbs sampling but at the same time, unlike Gibbs sampling, can be trivially parallelized.", "full_text": "Local Expectation Gradients for Black Box\n\nVariational Inference\n\nMichalis K. Titsias\n\nAthens University of Economics and Business\n\nmtitsias@aueb.gr\n\nMiguel L\u00b4azaro-Gredilla\n\nVicarious\n\nmiguel@vicarious.com\n\nAbstract\n\nWe introduce local expectation gradients which is a general purpose stochastic\nvariational inference algorithm for constructing stochastic gradients by sampling\nfrom the variational distribution. This algorithm divides the problem of estimating\nthe stochastic gradients over multiple variational parameters into smaller sub-tasks\nso that each sub-task explores intelligently the most relevant part of the variational\ndistribution. This is achieved by performing an exact expectation over the single\nrandom variable that most correlates with the variational parameter of interest\nresulting in a Rao-Blackwellized estimate that has low variance. Our method\nworks ef\ufb01ciently for both continuous and discrete random variables. Furthermore,\nthe proposed algorithm has interesting similarities with Gibbs sampling but at the\nsame time, unlike Gibbs sampling, can be trivially parallelized.\n\n1\n\nIntroduction\n\nStochastic variational inference has emerged as a promising and \ufb02exible framework for perform-\ning large scale approximate inference in complex probabilistic models. It signi\ufb01cantly extends the\ntraditional variational inference framework [7, 1] by incorporating stochastic approximation [16]\ninto the optimization of the variational lower bound. Currently, there exist two major research di-\nrections in stochastic variational inference. The \ufb01rst one (data stochasticity) attempts to deal with\nmassive datasets by constructing stochastic gradients by using mini-batches of training examples\n[5, 6]. The second direction (expectation stochasticity) aims at dealing with the intractable expec-\ntations under the variational distribution that are encountered in non-conjugate probabilistic models\n[12, 14, 10, 18, 8, 15, 20]. Unifying these two ideas, it is possible to use stochastic gradients to ad-\ndress both massive datasets and intractable expectations. This results in a doubly stochastic estima-\ntion approach, where the mini-batch source of stochasticity can be combined with the stochasticity\nassociated with sampling from the variational distribution.\nIn this paper, we are interested to further investigate the expectation stochasticity that in practice is\ndealt with by drawing samples from the variational distribution. A challenging issue here is con-\ncerned with the variance reduction of the stochastic gradients. Speci\ufb01cally, while the method based\non the log derivative trick is currently the most general one, it has been observed to severely suffer\nfrom high variance problems [12, 14, 10] and thus it is only applicable together with sophisticated\nvariance reduction techniques based on control variates. However, the construction of ef\ufb01cient con-\ntrol variates can be strongly dependent on the form of the probabilistic model. Therefore, it would be\nhighly desirable to introduce more black box procedures, where simple stochastic gradients can work\nwell for any model, thus allowing the end-user not to worry about having to design model-dependent\nvariance reduction techniques. Notice, that for continuous random variables and differentiable func-\ntions the reparametrization approach [8, 15, 20] offers a simple black box procedure [20, 9] which\ndoes not require further model-dependent variance reduction. However, reparametrization is neither\napplicable for discrete spaces nor for non-differentiable models and this greatly limits its scope of\napplicability.\n\n1\n\n\fIn this paper, we introduce a simple black box algorithm for stochastic optimization in variational\ninference which provides stochastic gradients having low variance and without needing any extra\nvariance reduction. This method is based on a new trick referred to as local expectation or integra-\ntion. The key idea here is that stochastic gradient estimation over multiple variational parameters\ncan be divided into smaller sub-tasks where each sub-task requires different amounts of information\nabout different parts of the variational distribution. More precisely, each sub-task aims at exploiting\nthe conditional independence structure of the variational distribution. Based on this intuitive idea\nwe introduce the local expectation gradients algorithm that provides a stochastic gradient over a\nvariational parameter vi by performing an exact expectation over the associated latent variable xi\nwhile using a single sample from the remaining latent variables. Essentially, this consists of a Rao-\nBlackwellized estimate that allows to dramatically reduce the variance of the stochastic gradient\nso that, for instance, for continuous spaces the new stochastic gradient is guaranteed to have lower\nvariance than the stochastic gradient corresponding to the reparametrization method where the latter\nutilizes a single sample. Furthermore, the local expectation algorithm has interesting similarities\nwith Gibbs sampling with the important difference, that unlike Gibbs sampling, it can be trivially\nparallelized.\n\n2 Stochastic variational inference\n\nHere, we discuss the main ideas behind current algorithms on stochastic variational inference and\nparticularly methods that sample from the variational distribution in order to approximate intractable\nexpectations using Monte Carlo. Given a joint probability distribution p(y, x) where y are ob-\nservations and x are latent variables (possibly including model parameters that consist of random\nvariables) and a variational distribution qv(x), the objective is to maximize the lower bound\n\nF(v) = Eqv(x) [log p(y, x) \u2212 log qv(x)] ,\n\n= Eqv(x) [log p(y, x)] \u2212 Eqv(x) [log qv(x)] ,\n\n(1)\n(2)\nwith respect to the variational parameters v. Ideally, in order to tune v we would like to have a\nclosed-form expression for the lower bound so that we could subsequently maximize it by using\nstandard optimization routines such as gradient-based algorithms. However, for many probabilis-\ntic models and forms of the variational distribution at least one of the two expectations in (2) is\nintractable. Therefore, in general we are faced with the following intractable expectation\n\n(3)\nwhere f (x) can be either log p(y, x), \u2212 log qv(x) or log p(y, x)\u2212 log qv(x), from which we would\nlike to ef\ufb01ciently estimate the gradient over v in order to apply gradient-based optimization.\n\nThe most general method for estimating the gradient \u2207v(cid:101)F(v) is based on the log derivative trick,\n\nalso called likelihood ratio or REINFORCE, that has been invented in control theory and reinforce-\nment learning [3, 21, 13] and used recently for variational inference [12, 14, 10]. Speci\ufb01cally, this\nmakes use of the property \u2207vqv(x) = qv(x)\u2207v log qv(x), which allows to write the gradient as\n\n\u2207v(cid:101)F(v) = Eqv(x) [f (x)\u2207v log qv(x)]\n\n(cid:101)F(v) = Eqv(x) [f (x)] ,\n\n(4)\n\n(5)\n\nand then obtain an unbiased estimate according to\n\nS(cid:88)\n\ns=1\n\n1\nS\n\nf (x(s))\u2207v log qv(x(s)),\n\nwhere each x(s) is an independent draw from qv(x). While this estimate is unbiased, it has been\nobserved to severely suffer from high variance so that in practice it is necessary to consider variance\nreduction techniques such as those based on control variates [12, 14, 10].\nThe second approach is suitable for continuous spaces where f (x) is a differentiable function of\nx [8, 15, 20]. It is based on a simple transformation of (3) which allows to move the variational\nparameters v inside f (x) so that eventually the expectation is taken over a base distribution that does\nnot depend on the variational parameters any more. For example, if the variational distribution is the\n\nGaussian N (x|\u00b5, LL(cid:62)) where v = (\u00b5, L), the expectation in (3) can be re-written as (cid:101)F(\u00b5, L) =\n\n2\n\n\f(cid:82) N (z|0, I)f (\u00b5 + Lz)dz and subsequently the gradient over (\u00b5, L) can be approximated by the\n\nfollowing unbiased Monte Carlo estimate\n\n\u2207(\u00b5,L)f (\u00b5 + Lz(s)),\n\n(6)\n\nS(cid:88)\n\ns=1\n\n1\nS\n\nwhere each z(s) is an independent sample from N (z|0, I). This estimate makes ef\ufb01cient use of the\nslope of f (x) which allows to perform informative moves in the space of (\u00b5, L). Furthermore, it has\nbeen shown experimentally in several studies [8, 15, 20, 9] that the estimate in (6) has relatively low\nvariance and can lead to ef\ufb01cient optimization even when a single sample is used at each iteration.\nNevertheless, a limitation of the approach is that it is only applicable to models where x is continuous\nand f (x) is differentiable. Even within this subset of models we are also additionally restricted to\nusing certain classes of variational distributions for which reparametrization is possible.\nNext we introduce an approach that is applicable to a broad class of models (both discrete and\ncontinuous), has favourable scaling properties and provides low-variance stochastic gradients.\n\n3 Local expectation gradients\n\nSuppose that the n-dimensional latent vector x in the probabilistic model takes values in some space\nS1 \u00d7 . . .Sn where each set Si can be continuous or discrete. We consider a variational distribution\nover x that is represented as a directed graphical model having the following joint density\n\nqv(x) =\n\nqvi (xi|pai),\n\n(7)\n\nwhere qvi(xi|pai) is the conditional factor over xi given the set of the parents denoted by pai.\nWe assume that each conditional factor has its own separate set of variational parameters vi and\nv = (vi, . . . , vn). The objective is then to obtain a stochastic approximation for the gradient of the\nlower bound over each variational parameter vi.\nOur method is motivated by the observation that each vi is in\ufb02uenced mostly by its corresponding\nlatent variable xi since vi determines the factor qvi(xi|pai). Therefore, to get information about the\ngradient of vi we should be exploring multiple possible values of xi and a rather smaller set of values\nfrom the remaining latent variables x\\i. Next we take this idea into the extreme where we will be\nusing in\ufb01nite draws from xi (i.e. essentially an exact expectation) together with just a single sample\nof x\\i. More precisely, we factorize the variational distribution as qv(x) = q(xi|mbi)q(x\\i), where\nmbi denotes the Markov blanket of xi. The gradient over vi can be written as\n\n\u2207vi(cid:101)F(v) = Eq(x) [f (x)\u2207vi log qvi(xi|pai)] = Eq(x\\i)\n\n(cid:2)Eq(xi|mbi) [f (x)\u2207vi log qvi(xi|pai)](cid:3) ,\n\nEq(xi|mb(t)\ni )\n\n(8)\nwhere in the second expression we used the law of iterated expectations. Then, an unbiased stochas-\ntic gradient, say at the t-th iteration of an optimization algorithm, can be obtained by drawing a\nsingle sample x(t)\\i from q(x\\i) so that\n\n(cid:104)\n(cid:101)q(xi|mb(t)\nf (x(t)\\i , xi)\u2207vi log qvi(xi|pa(t)\ni )\ndenotes summation or integration and(cid:101)q(xi|mb(t)\ni ) is the same as q(xi|mb(t)\n\ni )f (x(t)\\i , xi)\u2207viqvi (xi|pa(t)\n\nwhere(cid:80)\n\ni ) but with\nxi\nqvi(xi|pa(t)\ni ) removed from the numerator.1 The above is the expression for the proposed stochastic\ngradient for the parameter vi. Notice that this estimate does not rely on the log derivative trick\nsince we never draw samples from q(xi|mb(t)\ni ). Instead the trick here is to perform local expectation\n(integration or summation). To get an independent sample x(t)\\i from q(x\\i) we can simply simulate a\nfull latent vector x(t) from qv(x) by applying the standard ancestral sampling procedure for directed\ngraphical models [1]. Then, the sub-vector x(t)\\i\nis by construction an independent draw from the\ni ) for some non-negative function h(\u00b7).\ni )qvi (xi|pa(t)\n\n1Notice that q(xi|mb(t)\n\ni ) \u221d h(xi, mb(t)\n\n(cid:88)\n\ni ), (9)\n\n=\n\nxi\n\n(cid:105)\n\nn(cid:89)\n\ni=1\n\n3\n\n\fAlgorithm 1 Stochastic variational inference using local expectation gradients\n\nInput: f (x), qv(x).\nInitialize v(0), t = 0.\nrepeat\n\nSet t = t + 1.\nDraw pivot sample x(t) \u223c qv(x).\nfor i = 1 to n do\n\n(cid:104)\nf (x(t)\\i , xi)\u2207vi log qvi(xi|pa(t)\ni )\n\n(cid:105)\n\n.\n\ndvi = Eq(xi|mb(t)\ni )\nvi = vi + \u03b7tdvi.\n\nend for\n\nuntil convergence criterion is met.\n\nmarginal q(x\\i). Furthermore, the sample x(t) can be thought of as a global or pivot sample that is\nneeded to be drawn once and then it can be re-used multiple times in order to compute all stochastic\ngradients for all variational parameters (v1, . . . , vn) according to eq. (9).\nWhen the variable xi takes discrete values, the expectation in eq. (9) reduces to a sum of terms\nassociated with all possible values of xi. On the other hand, when xi is a continuous variable\nthe expectation in (9) corresponds to an univariate integral that in general may not be analytically\ntractable. In this case we shall use fast numerical integration methods.\nWe shall refer to the above algorithm for providing stochastic gradients over variational parameters\nas local expectation gradients and pseudo-code of a stochastic variational inference scheme that\ninternally uses this algorithm is given in Algorithm 1. Notice that Algorithm 1 corresponds to the\ncase where f (x) = log p(y, x) \u2212 log qv(x) while other cases can be expressed similarly.\nIn the next two sections we discuss some theoretical properties of local expectation gradients (Sec-\ntion 3.1) and draw interesting connections with Gibbs sampling (Section 3.2).\n\n3.1 Properties of local expectation gradients\n\nWe \ufb01rst derive the variance of the stochastic estimates obtained by local expectation gradients. In\nour analysis, we will focus on the case of \ufb01tting a fully factorized variational distribution (and leave\nthe more general case for future work) having the form\n\nn(cid:89)\n(cid:2)f (x\\i, xi)\u2207vi log qvi(xi)(cid:3) =\n\nqv(x) =\n\ni=1\n\nqvi(xi).\n\n(cid:88)\n\nEqvi (xi)\n\n(10)\n\n(11)\n\nFor such case the local expectation gradient for each parameter vi from eq. (9) simpli\ufb01es to\n\n\u2207viqvi(xi)f (x\\i, xi),\n\nwhere also for notational simplicity we write x(t)\\i as x\\i. It would be useful to de\ufb01ne the following\nmean and covariance functions\n\nm(xi) = Eq(x\\i)[f (x\\i, xi)],\n\n(12)\n\nxi\n\nxi\n\nCov(xi, x(cid:48)\n\ni) = Eq(x\\i)[(f (x\\i, xi) \u2212 m(xi))(f (x\\i, x(cid:48)\n\n(13)\nthat characterize the variability of f (x\\i, xi) as x\\i varies according to q(x\\i). Notice that based\non eq. (12) the exact gradient of the variational lower bound over vi can also be written as\n\u2207viqvi(xi)m(xi), which has an analogous form to the local expectation gradient from (11)\n\ni) \u2212 m(x(cid:48)\n\n(cid:80)\n\nwith the difference that f (x\\i, xi) is now replaced by its mean value m(xi).\nWe can now characterize the variance of the stochastic gradient and describe some additional prop-\nerties. All proofs for the following statements are given in the Supplementary Material.\nProposition 1. The variance of the stochastic gradient in (11) can be written as\n\ni))],\n\n\u2207viqvi(xi)\u2207viqvi(x(cid:48)\n\ni)Cov(xi, x(cid:48)\ni).\n\n(14)\n\n(cid:88)\n\nxi,x(cid:48)\n\ni\n\n4\n\n\fi) = c for all xi and x(cid:48)\n\ni then the variance in (14) is equal to zero.\n\ni) when the covariance function Cov(xi, x(cid:48)\n\nThis gives us some intuition about when we expect the variance of the estimate to be small. For\ninstance, two simple cases are:\ni) takes small values,\nwhich can occur when q(x\\i) has low entropy, or ii) when Cov(xi, x(cid:48)\ni) is approximately constant. In\nfact, when Cov(xi, x(cid:48)\ni) is exactly constant, then the variance is zero (so that the stochastic gradient\nis exact) as the following proposition states.\nProposition 2. If Cov(xi, x(cid:48)\nA case for which the condition Cov(xi, x(cid:48)\ni) = c holds exactly is when the function f (x) factorizes\nas f (x\\i, xi) = fi(xi) + f\\i(x\\i) (see Supplementary Material for a proof). Such a factorization es-\nsentially implies that xi is independent from the remaining random variables, which results the local\nexpectation gradient to be exact. In contrast, in order to get exactness by using the standard Monte\nCarlo stochastic gradient from eq. (5) (and any of its improvements that apply variance reduction)\nwe will typically need to draw in\ufb01nite number of samples.\nTo further analyze local expectation gradients we can contrast them with stochastic gradients ob-\ntained by the reparametrization trick [8, 15, 20]. Suppose that we can reparametrize the random\nvariable xi \u223c qvi(xi) according to xi = g(vi, zi), where zi \u223c qi(zi) and qi(zi) is a suitable base\ndistribution. We further assume that the function f (x\\i, xi) is differentiable with respect to xi and\ng(vi, zi) is differentiable with respect to vi. Then, the exact gradient with respect to the variational\nparameter vi can be reparametrized as\n\nq(x\\i)qvi(xi)f (x\\i, xi)dx\n\n=\n\nq(x\\i)qi(zi)\u2207vi f (x\\i, g(vi, zi))dx\\idzi,\n\n(15)\n\n(cid:18)(cid:90)\n\n\u2207vi\n\n(cid:19)\n\n(cid:90)\n\nwhile a single-sample stochastic estimate that follows from this expression is\n\u2207vif (x\\i, g(vi, zi)), x\\i \u223c q(x\\i), zi \u223c qi(zi).\n\n(16)\nThe following statement gives us a clear understanding about how this estimate compares with the\ncorresponding local expectation gradient.\nProposition 3. Given that we can reparametrize xi as described above (and all differentiability\nconditions mentioned above hold), the gradient from (11) can be equivalently written as\n\nqi(zi)\u2207vif (x\\i, g(vi, zi))dzi, x\\i \u223c q(x\\i).\n\n(17)\n\n(cid:90)\n\nClearly, the above expression is an expectation of the reparametrization gradient from eq. (16), and\ntherefore based on the standard Rao-Blackwellization argument the variance of the local expectation\ngradient will always be lower or equal than the variance of a single-sample estimate based on the\nreparametrization method. Notice that the reparametrization method is only applicable to continuous\nrandom variables and differentiable functions f (x). However, for such cases, reparametrization\ncould be computationally more ef\ufb01cient than local expectation gradients since the latter approach\nwill require to apply 1-D numerical integration to estimate the integral in (11) or the integral in (17)2\nwhich could be computationally more expensive.\n\n3.2 Connection with Gibbs sampling\n\nThere are interesting similarities between local expectation gradients and Gibbs sampling. Firstly,\nnotice that carrying out Gibbs sampling in the variational distribution in eq. (7) requires iteratively\nsampling from each conditional q(xi|mbi), for i = 1, . . . , n, and clearly the same conditional ap-\npears also in local expectation gradients with the obvious difference that instead of sampling from\nq(xi|mbi) we now average under this distribution. Of course, in practice, we never perform Gibbs\nsampling on a variational distribution but instead on the true posterior distribution which is propor-\ntional to ef (x) (where we assumed that \u2212 log qv(x) is not part of f (x)). Speci\ufb01cally, at each Gibbs\nstep we simulate a new value for some xi from the posterior conditional distribution that is propor-\ntional to ef (x(t)\n\\i ,xi) and where x(t)\\i are the \ufb01xed values for the remaining random variables. We can\nobserve that an update in local expectation gradients is quite similar, because now we also condi-\ntion on some \ufb01xed remaining values x(t)\\i\nin order to update the parameter vi towards the direction\n\n2The exact value of the two integrals is the same. However, approximation of these two integrals based on\n\nnumerical integration will typically not give the same value.\n\n5\n\n\fwhere q(xi|mb(t)\ni ) gets closer to the corresponding true posterior conditional distribution. Despite\nthese similarities, there is a crucial computational difference between the two procedures. While\nin local expectation gradients it is perfectly valid to perform all updates of the variational parame-\nters in parallel, given the pivot sample x(t), in Gibbs sampling all updates need to be executed in a\nserial manner. This difference is essentially a consequence of the fundamental difference between\nvariational inference and Gibbs sampling where the former relies on optimization while the latter on\nconvergence of a Markov chain.\n\n4 Experiments\n\nIn this section, we apply local expectation gradients (LeGrad) to two different types of stochastic\nvariational inference problems and we compare it against the standard stochastic gradient based\non the log derivative trick (LdGrad), that incorporates also variance reduction3, as well as the\nreparametrization-based gradient (ReGrad) given by eq. (6). In Section 4.1, we consider a two-class\nclassi\ufb01cation problem using two digits from the MNIST database and we approximate a Bayesian\nlogistic regression model using stochastic variational inference. Then, in Section 4.2 we consider\nsigmoid belief networks [11] and we \ufb01t them to the binarized version of the MNIST digits.\n\n4.1 Bayesian logistic regression\n\n(cid:16)(cid:81)M\n\nm=1 \u03c3(ymz(cid:62)\n\nmw)\n\n(cid:17)\n\ni=1 N (wi|\u00b5i, (cid:96)2\n\nvariational Gaussian distribution of the form qv(w) = (cid:81)n\n\nIn this section we compare the three approaches in a challenging binary classi\ufb01cation problem using\nBayesian logistic regression. Speci\ufb01cally, given a dataset D \u2261 {zj, yj}m\nj=1, where zj \u2208 Rn is\nthe input and yj \u2208 {\u22121, +1} the class label, we model the joint distribution over the observed\np(w), where \u03c3(a) is the sigmoid\nlabels and the parameters w by p(y, w) =\nfunction and p(w) denotes a zero-mean Gaussian prior on the weights w. We wish to apply the\nthree algorithms in order to approximate the posterior over the regression parameters by a factorized\ni ). In the following we\nconsider a subset of the MNIST dataset that includes all 12660 training examples from the digit\nclasses 2 and 7, each with 784 pixels so that by including the bias the number of weights is n = 785.\nTo obtain the local expectation gradient for each (\u00b5i, (cid:96)i) we need to apply 1-D numerical integration.\nWe used the quadrature rule having K = 5 nodes4 so that LeGrad was using S = 785 \u00d7 5 function\nevaluations per gradient estimation. For LdGrad we also set the number of samples to S = 785 \u00d7 5\nso that LeGrad and LdGrad match exactly in the number of function evaluations and roughly in\ncomputational cost. When using the ReGrad approach based on (6) we construct the stochastic\ngradient using K = 5 target function gradient samples. This matches the computational cost, but\nReGrad still has the unmatched advantage of having access to the gradient of the target function.\nThe variance of the stochastic gradient for parameter \u00b51 is shown in Figure 1(a)-(b). It is much\nsmaller for LeGrad than for LdGrad, despite having almost similar computational cost and use the\nsame amount of information about the target function. The evolution of the bound in Figure 1(c)\nclearly shows the advantage of using less noisy gradients. LdGrad will need a huge number of\niterations to \ufb01nd the global optimum, despite having optimized the step size of its stochastic updates.\n\n4.2 Sigmoid belief networks\n\nIn the second example we consider sigmoid belief networks (SBNs) [11] and i) compare our ap-\nproach with LdGrad in terms of variance and optimization ef\ufb01ciency and then ii) we perform den-\nsity estimation experiments by training sigmoid belief nets with fully connected hidden units using\nLeGrad. Note that ReGrad cannot be used on discrete models.\n\n3As discussed in [19], there are multiple unbiased sample-based estimators of (4), and using (5) directly\ntends to have a large variance. We use instead the estimator given by eq. (8) in [19]. Though other estimators\nwith even lower variance exist, we restrict ourselves to those with the same scalability as the proposed LeGrad,\nrequiring at most O(S|v|) computation per gradient estimation.\n\n4Gaussian quadrature with K grid points integrates exactly polynomials up to 2K \u2212 1 degree.\n\n6\n\n\f(a)\n\n(b)\n\nFigure 1:\n(a) Variance of the gradient for the variational parameter \u00b51 for LeGrad (red line) and\nReGrad (blue line). (b) Variance of the gradient for the variational parameter \u00b51 for LdGrad (green\nline). (c) Evolution of the stochastic value of the lower bound.\n\nFor the variance reduction comparison we consider a network with an unstructured hidden layer\nwhere binary observed vectors yi \u2208 {0, 1}D are generated independently according to\n\np(y|W ) =\n\n(cid:88)\n\nD(cid:89)\n\nx\n\nd=1\n\n(cid:2)\u03c3(w(cid:62)\n\nd x)(cid:3)yd(cid:2)1 \u2212 \u03c3(w(cid:62)\n\nd x)(cid:3)1\u2212yd p(x),\n\n(18)\n\nk=1\n\nwhere x \u2208 {0, 1}K is a vector of hidden variables that follows a uniform distribution. The matrix W\n(which includes bias terms) contains the parameters to be estimated by \ufb01tting the model to the data.\nIn theory we could use the EM algorithm to learn the parameters W , however, such an approach is\nnot feasible because at the E step we need to compute the posterior distribution p(xi|yi, W ) over\neach hidden variable which clearly is intractable since each xi takes 2K values. Therefore, we need\nto apply approximate inference and next we consider stochastic variational inference using the local\nexpectation gradients algorithm and compare this with the method in [19] eq. (8), which has the\nsame scalability properties and have been denoting as LdGrad.\nMore precisely, we shall consider a variational distribution that consists of a recognition model\n[4, 2, 10, 8, 15] which is parametrized by a \u201creverse\u201d sigmoid network that predicts the latent vector\n\nxi from the associated observation yi: qV (xi) = (cid:81)K\n\n(cid:2)\u03c3(v(cid:62)\nk yi)(cid:3)xik(cid:2)1 \u2212 \u03c3(v(cid:62)\n\nk yi)(cid:3)1\u2212xik. The\n\nvariational parameters are contained in matrix V (also the bias terms). The application of stochastic\nvariational inference boils down to constructing a separate lower bound for each pair (yi, xi) so\nthat the full bound is the sum of these individual terms (see Supplementary Material for explicit\nexpressions). Then, the maximization of the bound proceeds by performing stochastic gradient\nupdates for the model weights W and the variational parameters V . The update for W reduces to\na logistic regression type of update, based upon drawing a single sample from the full variational\ndistribution. On the other hand, obtaining effective and low variance stochastic gradients for the\nvariational parameters V is considered to be a very highly challenging task and current advanced\nmethods are based on covariates that employ neural networks as auxiliary models [10]. In contrast,\nthe local expectation gradient for each variational parameter vk only requires evaluating\n\u2207vkF =\n\n\uf8f9\uf8fb yi,\n\u2207vkFi =\nk yi) and(cid:101)yid is the {\u22121, 1} encoding of yid. This expression is a weighted sum\n\nwhere \u03c3ik = \u03c3(v(cid:62)\nacross data terms where each term is a difference induced by the directions xik = 1 and xik = 0 for\nall hidden units {xik}n\nBased on the above model, we compare the performance of LeGrad and LdGrad when simulta-\nneously optimizing V and W for a small set of 100 random binarized MNIST digits [17]. The\nevolution of the instantaneous bound for H = 40 hidden units can be seen in Figure 2(a), where\nonce again LeGrad shows superior performance and increased stability.\nIn the second series of experiments we consider a more complex sigmoid belief network where\nthe prior p(x) over the hidden units becomes a fully connected distribution parametrized by an\n\ni=1 associated with the variational factors that depend on vk.\n\n\u2212(cid:101)yidw(cid:62)\n\u2212(cid:101)yidw(cid:62)\n\n\uf8ee\uf8f0 D(cid:88)\n\n1 \u2212 \u03c3ik\n\u03c3ik\n\n\u03c3ik(1 \u2212 \u03c3ik)\n\nd (x(t)\n\ni\\k,xik=1)\n\nd (x(t)\n\ni\\k,xik=0)\n\nn(cid:88)\n\nn(cid:88)\n\ni=1\n\n+ log\n\nlog\n\n1 + e\n\n1 + e\n\n(19)\n\ni=1\n\nd=1\n\n7\n\n050001000000.511.52IterationsVariance0500010000020406080100IterationsVariance050001000015000\u22121000\u2212800\u2212600\u2212400\u2212200IterationsLower bound\fTable 1: NLL scores in the test data for the binarized MNIST dataset. The left part of the table\nshows results based on sigmoid belief nets (SBN) constructed and trained based on the approach\nfrom [10], denoted as NVIL, or by using the LeGrad algorithm. The right part of the table gives the\nperformance of alternative state of the art models (reported in Table 1 in [10]).\n\nSBN\nNVIL\n200-200\nNVIL 200-200-200\nNVIL 200-200-500\n200\n300\n500\n\nDim Test NLL\n99.8\n96.7\n97.0\n96.0\n95.1\n94.9\n\nLeGrad\nLeGrad\nLeGrad\n\nModel Dim Test NLL\n96.3\nFDARN\n88.9\nNADE\nDARN\n93.0\n105.5\nRBM(CD3)\n86.3\nRBM(CD25)\nMOB\n137.6\n\n400\n500\n400\n500\n500\n500\n\nadditional set of K(K + 1)/2 model weights (see Supplementary Material). Such a model can\nbetter capture the dependence structure of the hidden units and provide a good density estimator for\nhigh dimensional data. We trained this model using the 5 \u00d7 104 training examples of the binarized\nMNIST by using mini-batches of size 100 and assuming different numbers of hidden units: H =\n200, 300, 500. Table 1 provides negative log likelihood (NLL) scores for LegGrad and several other\nmethods reported in [10]. Notice that for LeGrad the NLLs are essentially variational upper bounds\nof the exact NLLs obtained by Monte Carlo approximation of the variational bound (an estimate\nalso considered in [10]). From Table 1 we can observe that LeGrad outperforms the advanced NVIL\ntechnique proposed in [10]. Finally, Figure 2(b) and 2(c) displays model weights and few examples\nof digits generated after having trained the model with H = 200 units, respectively.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2:\n(a) LeGrad (red) and LdGrad (green) convergence for the SBN model on a single mini-\nbatch of 100 MNIST digits. (b) Weights W (\ufb01lters) learned by LeGrad when training an SBN with\nH = 200 units in the full MNIST training set. (c) New digits generated from the trained model.\n\n5 Discussion\n\nLocal expectation gradients is a generic black box stochastic optimization algorithm that can be\nused to maximize objective functions of the form Eqv(x)[f (x)], a problem that arises in variational\ninference. The idea behind this algorithm is to exploit the conditional independence structure of\nthe variational distribution qv(x). Also this algorithm is mostly related to stochastic optimization\nschemes that make use of the log derivative trick that has been invented in reinforcement learning\n[3, 21, 13] and has been recently used for variational inference [12, 14, 10]. The approaches in\n[12, 14, 10] can be thought of as following a global sampling strategy, where multiple samples are\ndrawn from qv(x) and then variance reduction is built a posteriori in a subsequent stage through the\nuse of control variates. In contrast, local expectation gradients reduce variance by directly changing\nthe sampling strategy, so that instead of working with a global set of samples drawn from qv(x),\nthe strategy now is to exactly marginalize out the random variable that has the largest in\ufb02uence on a\nspeci\ufb01c gradient of interest while using a single sample for the remaining random variables.\nWe believe that local expectation gradients can be applied to a great range of stochastic optimization\nproblems that arise in variational inference and in other domains. Here, we have demonstrated its\nuse for variational inference in logistic regression and sigmoid belief networks.\n\n8\n\n010002000300040005000\u2212110\u2212100\u221290\u221280\u221270\u221260\u221250\u221240\u221230IterationsLower bound\fReferences\n[1] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[2] Jrg Bornschein and Yoshua Bengio. Reweighted wake-sleep. CoRR, pages \u20131\u20131, 2014.\n[3] Peter W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Commun. ACM, 33(10):75\u2013\n\n84, October 1990.\n\n[4] Geoffrey Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The wake-sleep algorithm for\n\nunsupervised neural networks. Science, 268(5214):1158\u20131161, 1995.\n\n[5] Matthew D. Hoffman, David M. Blei, and Francis R. Bach. Online learning for latent dirichlet allocation.\n\nIn NIPS, pages 856\u2013864, 2010.\n\n[6] Matthew D. Hoffman, David M. Blei, Chong Wang, and John William Paisley. Stochastic variational\n\ninference. Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[7] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to\n\nvariational methods for graphical models. Mach. Learn., 37(2):183\u2013233, November 1999.\n\n[8] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[9] A. Kucukelbir, R. Ranganath, A. Gelman, and D.M. Blei. Automatic variational inference in stan. In\n\nAdvances in Neural Information Processing Systems, 28, 2015.\n\n[10] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In The 31st\n\nInternational Conference on Machine Learning (ICML 2014), 2014.\n\n[11] Radford M. Neal. Connectionist learning of belief networks. Artif. Intell., 56(1):71\u2013113, July 1992.\n[12] John William Paisley, David M. Blei, and Michael I. Jordan. Variational bayesian inference with stochastic\n\nsearch. In ICML, 2012.\n\n[13] J. Peters and S. Schaal. Policy gradient methods for robotics. In Proceedings of the IEEE International\n\nConference on Intelligent Robotics Systems (IROS 2006), 2006.\n\n[14] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In Proceedings of the\nSeventeenth International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), page 814822,\n2014.\n\n[15] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approx-\nimate inference in deep generative models. In The 31st International Conference on Machine Learning\n(ICML 2014), 2014.\n\n[16] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical\n\nStatistics, 22(3):400\u2013407, 1951.\n\n[17] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of Deep Belief Networks. In An-\ndrew McCallum and Sam Roweis, editors, Proceedings of the 25th Annual International Conference on\nMachine Learning (ICML 2008), pages 872\u2013879. Omnipress, 2008.\n\n[18] Tim Salimans and David A. Knowles. Fixed-form variational posterior approximation through stochastic\n\nlinear regression. Bayesian Anal., 8(4):837\u2013882, 12 2013.\n\n[19] Tim Salimans and David A. Knowles. On Using Control Variates with Stochastic Approximation for\n\nVariational Bayes and its Connection to Stochastic Linear Regression, January 2014.\n\n[20] Michalis K. Titsias and Miguel L\u00b4azaro-Gredilla. Doubly stochastic variational bayes for non-conjugate\n\ninference. In The 31st International Conference on Machine Learning (ICML 2014), 2014.\n\n[21] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Mach. Learn., 8(3-4):229\u2013256, May 1992.\n\n9\n\n\f", "award": [], "sourceid": 1541, "authors": [{"given_name": "Michalis", "family_name": "Titsias RC AUEB", "institution": "Athens University of Economics and Business"}, {"given_name": "Miguel", "family_name": "L\u00e1zaro-Gredilla", "institution": "Vicarious"}]}