{"title": "Provable Gradient Variance Guarantees for Black-Box Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 329, "page_last": 338, "abstract": "Recent variational inference methods use stochastic gradient estimators whose variance is not well understood. Theoretical guarantees for these estimators are important to understand when these methods will or will not work. This paper gives bounds for the common \u201creparameterization\u201d estimators when the target is smooth and the variational family is a location-scale distribution. These bounds are unimprovable and thus provide the best possible guarantees under the stated assumptions.", "full_text": "Provable Gradient Variance Guarantees for\n\nBlack-Box Variational Inference\n\nJustin Domke\n\nCollege of Information and Computer Sciences\n\nUniversity of Massachusetts Amherst\n\ndomke@cs.umass.edu\n\nAbstract\n\nRecent variational inference methods use stochastic gradient estimators whose\nvariance is not well understood. Theoretical guarantees for these estimators are\nimportant to understand when these methods will or will not work. This paper\ngives bounds for the common \u201creparameterization\u201d estimators when the target is\nsmooth and the variational family is a location-scale distribution. These bounds\nare unimprovable and thus provide the best possible guarantees under the stated\nassumptions.\n\n1\n\nIntroduction\n\nTake a distribution p(z, x) representing relationships between data x and latent variables z. After\nobserving x, one might wish to approximate the marginal probability p(x) or the posterior p(z|x).\nVariational inference (VI) is based on the simple observation that for any distribution q(z),\n\nlog\n\np(z, x)\nq(z)\n\n+KL (q(z)kp(z|x)) .\n\n(1)\n\nlog p(x) = Ez\u21e0q\n|\n\n}\n\nELBO(q)\n\n{z\nVI algorithms typically choose an approximating family qw and maximize ELBO(qw) over w.\nSince log p(x) is \ufb01xed, this simultaneously tightens a lower-bound on log p(x) and minimizes the\ndivergence from qw(z) to the posterior p(z|x).\nTraditional VI algorithms suppose p and qw are simple enough for certain expectations to have closed\nforms, leading to deterministic coordinate-ascent type algorithms [6, 1, 20]. Recent work has turned\ntowards stochastic optimization. There are two motivations for this. First, stochastic data subsampling\ncan give computational savings [7]. Second, more complex distributions can be addressed if p is\ntreated as a \u201cblack box\u201d, with no expectations available [9, 15, 19]. In both cases, one can still\nestimate a stochastic gradient of the ELBO [17] and thus use stochastic gradient optimization. It is\npossible to address very complex and large-scale problems using this strategy [10].\nThese improvements in scale and generality come at a cost: Stochastic optimization is typically less\nreliable than deterministic coordinate ascent. Convergence is often a challenge, and methods typically\nuse heuristics for parameters like step-sizes. Failures do frequently occur in practice [22, 11, 4].\nTo help understand when black-box VI can be expected to work, this paper investigates the variance\nof gradient estimates. This is a major issue in practice, and many ideas have been proposed to attempt\nto reduce the variance [8, 5, 12, 2, 18, 13, 14, 16]. Despite all this, few rigorous guarantees on the\nvariance of gradient estimators seem to be known (Sec. 5.1).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.1 Contributions\nThis paper studies \u201creparameterization\u201d (RP) or \u201cpath\u201d based gradient estimators when qw is in a\nmultivariate location-scale family. We decompose ELBO(qw) = l(w) + h(w) where h(w) is the\nentropy of qw (known in closed-form) and l(w) = Ez\u21e0qw log p(z, x). The key assumption is that\nlog p(z, x) is (Lipschitz) smooth as a function of z, meaning that rz log p(z, x) can\u2019t change too\nquickly as z changes. Formally f (z) is M-smooth if krf (z)  rf (z0)k2 \uf8ff Mkz  z0k2.\nBound for smooth target distributions: If g is the RP gradient estimator of rl(w) and log p is\nM-smooth, then Ekgk2 is bounded by a quadratic function of w (Thm. 3). With a small\nrelaxation, this is Ekgk2 \uf8ff aM 2kw  \u00afwk2 (Eq. (3)) where \u00afw are \ufb01xed parameters and a\nis determined by the location-scale family.\nGeneralized bound: We extend this result to consider a more general notion of \u201cmatrix\u201d smoothness\n(Thm. 5) re\ufb02ecting that the sensitivity of rz log p(z, x) to changes in z may depend on the\ndirection of change.\nData Subsampling: We again extend this result to consider data subsampling (Thm. 6). In particular,\n\nwe observe that non-uniform subsampling gives tighter bounds.\n\nIn all cases, we show that the bounds are unimprovable. We experimentally compare these bounds to\nempirical variance.\n\n2 Setup\n\nGiven some \u201cblack box\u201d function f, this paper studies estimating gradients of functions l of the form\nl(w) = Ez\u21e0qw f (z). Now, suppose some base distribution s and mapping Tw are known such that if\nu \u21e0 s, then Tw(u) \u21e0 qw. Then, l can be written as\nl(w) = Eu\u21e0s\n\nf (Tw(u)).\n\nIf we de\ufb01ne g = rwf (Tw(u)), then g is an unbiased estimate of rl, i.e. E g = rl(w). The same\nidea can be used when f is composed as a \ufb01nite sum as f (z) =PN\nn=1 fn(z). If N is large, even\nevaluating f once might be expensive. However, take any positive distribution \u21e1 over n 2{ 1,\u00b7\u00b7\u00b7 , N}\nand sample n \u21e0 \u21e1 independently of u. Then, if we de\ufb01ne g = rw\u21e1(n)1fn(Tw(u)), this is again an\nunbiased estimator with E g = rl(w).\nConvergence rates in stochastic optimization depend on the variability of the gradient estimator,\ntypically either via the expected squared norm (ESN) Ekgk2\n2 or the trace of the variance tr V g. These\nare closely related, since Ekgk2\nThe goal of this paper is to bound the variability of g for reparameterization / path estimators of g.\nThis requires making assumptions about (i) the transformation function Tw and base distribution s\n(which determine qw) and (ii) the target function f.\nHere, we are interested in the case of af\ufb01ne mappings. We use the mapping[17]\n\n2 = tr V g + k E gk2\n2.\n\nTw(u) = Cu + m,\n\nwhere w = (m, C) is a single vector of all parameters. This is the most common mapping used\nto represent location-scale families. That is, if u \u21e0 s then Tw(u) is equal in distribution to a\nlocation-scale family distribution. For example, if s = N (0, I) then Tw(u) is equal in distribution to\nN (m, CC>).\nWe will refer to the base distribution as standardized if the components of u = (u1,\u00b7\u00b7\u00b7 , ud) \u21e0 s are\niid with E u1 = E u3\n1 = 0 and V u1 = 1. The bounds will depend on the fourth moment \uf8ff = E[u4\n1],\nbut are otherwise independent of s.\nTo apply these estimators to VI, choose f (z) = log(z, x). Then ELBO(w) = l(w) + h(w)\nwhere h is the entropy of qw. Stochastic estimates of the gradient of l can be employed in a\nstochastic gradient method to maximize the ELBO. To model the stochastic setting, suppose that\nn=1 p(xn|z). Then, one may choose, e.g. fn(z) =\n\nX = (x1,\u00b7\u00b7\u00b7 , xN ) are iid and p(z, X) = p(z)QN\n\n2\n\n\f1\n\nN log p(z) + log p(xn|z). The entropy h is related to the (constant) entropy of the base distribution\nas h(w) = Entropy(s) + log |C|.\nThe main bounds of this paper concern estimators for the gradient of l(w) alone, disregarding h(w).\nThere are two reasons for this. First, in location-scale families, the exact gradient of h(w) is known.\nSecond, if one uses a stochastic estimator for h(w), this can be \u201cabsorbed\u201d into l(w) to some degree.\nThis is discussed further in Sec. 5.\n\n3 Variance Bounds\n\n3.1 Technical Lemmas\nWe begin with two technical lemmas which will do most of the work in the main results. Both have\n(somewhat laborious) proofs in Sec. 7 (Appendix). The \ufb01rst lemma relates the norm of the parameter\ngradient of f (Tw(u)) (with respect to w) to the norm of the gradient of f (z) itself, evaluated at\nz = Tw(u).\nLemma 1. For any w and u, krwf (Tw(u))k2\nThe proof is tedious but essentially amounts to calculating the derivative with respect to each\ncomponent of w (i.e. entries mi and Cij), summing the square of all entries, and simplifying. The\nsecond lemma gives a closed-form for the expectation of a closely related expression that will appear\nin the proof of Thm. 3 as a consequence of applying Lem. 1.\nLemma 2. Let u \u21e0 s for s standardized with u 2 Rd and Eu\u21e0s u4\n2 = (d + 1)km  \u00afzk2\n\ni = \uf8ff. Then for any \u00afz,\n2 + (d + \uf8ff)kCk2\nF .\n\nAgain, the proof is tedious but based on simple ideas: Substitute the de\ufb01nition of Tw into the left-hand\nside and expand all terms. This gives terms between zeroth and fourth order (in u). Calculating the\nexact expectation of each and simplifying using the assumption that s is standardized gives the result.\n\n2\u21e31 + kuk2\n2\u2318 .\n\n21 + kuk2\n\n2 = krf (Tw(u))k2\n\nEkTw(u)  \u00afzk2\n\n3.2 Basic Variance Bound\nGiven these two lemmas, we give our major technical result, bounding the variability of a\nreparameterization-based gradient estimator. This will be later be extended to consider data subsam-\npling, and a generalized notion of smoothness. Note that we do not require that f be convex.\nTheorem 3. Suppose f is M-smooth, \u00afz is a stationary point of f, and s is standardized with u 2 Rd\nand E u4\n\ni = \uf8ff. Let g = rwf (Tw(u)) for u \u21e0 s. Then,\n2 \uf8ff M 2\u21e3(d + 1)km  \u00afzk2\n\nEkgk2\n\nF\u2318 .\n2 + (d + \uf8ff)kCk2\n\nMoreover, this result is unimprovable without further assumptions.\n\n(2)\n\nProof. We expand the de\ufb01nition of g, and use the above lemmas and the smoothness of f.\n\nEkgk2\n\n2\n\n2 = Ekrwf (Tw(u))k2\n= Ekrf (Tw(u))k2\n2 (1 + kuk2\n2)\n= Ekrf (Tw(u))  rf ( \u00afz)k2\n2 (1 + kuk2\n2)\n\uf8ff E M 2 kTw(u)  \u00afzk2\n= M 2\u21e3(d + 1)km  \u00afzk2\n\nF\u2318 .\n2 + (d + \uf8ff)kCk2\n\n2 (1 + kuk2\n2)\n\n(De\ufb01nition of g)\n(Lem. 1)\n(rf ( \u00afz) = 0)\n(f is smooth)\n(Lem. 2)\n\nTo see that this is unimprovable without further assumptions, observe that the only inequality is using\nthe smoothness on f to bound the norm of the difference of gradients at Tw(u) and at \u00afz. But for\n2 this inequality is tight. Thus, for any M and \u00afz, there is a function f satisfying\nf (z) = M\nthe assumptions of the theorem such that Eq. (2) is an equality.\n\n2 kz  \u00afzk2\n\n3\n\n\fWith a small amount of additional looseness, we can cast Eq. (2) into a more intuitive form. De\ufb01ne\nF , so\n\u00afw = ( \u00afz, 0d\u21e5d), where 0d\u21e5d is a d \u21e5 d matrix of zeros. Then, kw  \u00afwk2\nwe can slightly relax Eq. (2) to the more user-friendly form of\n\n2 = km  \u00afzk2\n\n2 + kCk2\n\nEkgk2\n\n2 \uf8ff (d + \uf8ff)M 2 kw  \u00afwk2\n2 .\n\n(3)\nThe only additional looseness is bounding d+1 \uf8ff d+\uf8ff. This is justi\ufb01ed since when s is standardized,\ni is the kurtosis, which is at least one. Here, \uf8ff is determined by s and does not depend on the\n\uf8ff = u4\ndimensionality. For example, if s is Gaussian, \uf8ff = 3. Thus, Eq. (3) will typically not be much looser\nthan Eq. (2).\nIntuitively, \u00afw are parameters that concentrate q entirely at a stationary point of f. It is not hard to\nshow that kw  \u00afwk2 = Ez\u21e0qw kz  \u00afzk2. Thus, Eq. (3) intuitively says that Ekgk2 is bounded in\nterms of how far far the average point sampled from qw is from \u00afz. Since f need not be convex, there\nmight be multiple stationary points. In this case, Thm. 3 holds simultaneously for all of them.\n\n3.3 Generalized Smoothness\n\nSince the above bound is not improvable, tightening it requires stronger assumptions. The tightness\nof Thm. 3 is determined by the smoothness condition that the difference of gradients at two points\nis bounded as krf (y)  rf (z)k2 \uf8ff M ky  zk2. For some problems, f may be much smoother\nin certain directions than others. In such cases, the smoothness constant M will need to re\ufb02ect the\nworst-case direction. To produce a tighter bound for such situations, we generalize the notion of\nsmoothness to allow M to be a symmetric matrix.\nDe\ufb01nition 4. f is M-matrix-smooth if krf (y)  rf (z)k2 \uf8ff kM (y  z)k2 (for symmetric M).\nWe can generalize the result in Thm. 3 to functions with this matrix-smoothness condition. The proof\nis very similar. The main difference is that after applying the smoothness condition, the matrix M\nneeds to be \u201cabsorbed\u201d into the parameters w = (m, C) before applying Lem. 2.\nTheorem 5. Suppose f is M-matrix smooth, \u00afz is a stationary point of f, and s is standardized with\nu 2 Rd and E u4\n\ni = \uf8ff. Let g = rwf (Tw(u)) for u \u21e0 s. Then,\n\nEkgk2\n\n2 \uf8ff (d + 1)kM (m  \u00afz)k2\n\n2 + (d + \uf8ff)kM Ck2\nF .\n\n(4)\n\nMoreover, this result is unimprovable without further assumptions.\n\nProof. The proof closely mirrors that of Thm. 3. Here, given w = (m, C), we de\ufb01ne v =\n(M m, M C), to be w with M \u201cabsorbed\u201d into the parameters.\n\nEkgk2\n\n2\n\n2 = Ekrwf (Tw(u))k2\n= Ekrf (Tw(u))k2\n2 (1 + kuk2\n2)\n= Ekrf (Tw(u))  rf ( \u00afz)k2\n2 (1 + kuk2\n2)\n\uf8ff EkM (Tw(u)  \u00afz)k2\n= EkTv(u)  M ( \u00afz  m)k2\n= (d + 1)kM m  M \u00afzk2\n\n2 (1 + kuk2\n2)\n\n2 (1 + kuk2\n2)\n2 + (d + \uf8ff)kM Ck2\nF .\n\nDe\ufb01nition of g)\n(Lem. 1)\n(rf ( \u00afz) = 0)\n(f is smooth)\n(Absorb M into v)\n(Lem. 2)\n\nTo see that this is unimprovable, observe that the only inequality is the matrix-smoothness condition\n2 (z  \u00afz)>M (z  \u00afz), the difference of gradients krf (y)  rf (z)k2 =\non f. But for f (z) = 1\nkM (y  z)k2 is an equality. Thus, for any M and \u00afz, there is an f satisfying the assumptions of the\ntheorem such that the bound in Eq. (4) is an equality.\n\nIt\u2019s easy to see that this reduces to Thm. 3 in the case that f is smooth in the standard sense\u2013 this\ncorresponds to the situation where M is some constant times the identity. Alternatively, one can\nsimply observe that the two results are the same if M is a scalar. Thus, going forward we will use\nEq. (4) to represent the result with either type of smoothness assumption on f.\n\n4\n\n\f3.4 Subsampling\n\nOften, the function f (z) takes the form of a sum over other functions fn(z), typically representing\ndifferent data. Write this as\n\nf (z) =\n\nfn(z).\n\nNXn=1\n\nTo estimate the gradient of Eu\u21e0s f (Tw(u)), one can save time by using \u201csubsampling\u201d: That is, draw\na random n, and then estimate the gradient of Eu\u21e0s fn(Tw(u)). The following result bounds this\nprocedure. It essentially just takes a set of estimators, one corresponding to each function fn, bounds\ntheir expected squared norm using the previous theorems, and then combines these.\nTheorem 6. Suppose fn is Mn-matrix-smooth, \u00afzn is a stationary point of fn, and s is standardized\nwith u 2 Rd and E u4\n\n\u21e1(n)rfn(Tw(u)) for u \u21e0 s and n \u21e0 \u21e1 independent. Then,\n\ni = \uf8ff. Let g = 1\n\n1\n\n\u21e1(n)\u21e3(d + 1)kMn(m  \u00afzn)k2\n\nF\u2318 .\n2 + (d + \uf8ff)kMnCk2\n\n(5)\n\nMoreover, this result is unimprovable without further assumptions.\n\nEkgk2\n\n2 \uf8ff\n\nNXn=1\n\nn=1 E an and Ekbk2\n\nProof. Consider a simple lemma: Suppose a1 \u00b7\u00b7\u00b7 aN are independent random vectors and \u21e1 is any\ndistribution over 1\u00b7\u00b7\u00b7 N. Let b = an/\u21e1(n) for n \u21e0 \u21e1, where n is independent of an. It is easy to\n2 =Pn Ekank2\nshow that E b =PN\n2 /\u21e1(n). The result follows from applying\nthis with an = rwfn (Tw(u)), and then bounding Ekank2\n2 using Thm. 5.\nAgain, in this result the only source of looseness is the use of the smoothness bound for the component\nfunctions fn. Accordingly, the result can be shown to be unimprovable: For any set of stationary\npoints \u00afz and smoothness parameters Mn we can construct functions fn (as in Thm. 5) for which the\nprevious theorems are tight and thus this result is also tight.\n\nThis result generalizes all the previous bounds: Thm. 5 is the special case when N = 1, while Thm. 3\nis the special-case when N = 1 and f1 is M1-smooth (for a scalar M1). The case where N > 1 but\nfn is Mn-smooth (for scalar Mn) is also useful\u2013 the bound in Eq. (5) remains valid, but with a scalar\nMn.\n\n4 Empirical Evaluation\n\n4.1 Model and Datasets\n\nWe consider Bayesian linear regression and logistic regression models on various datasets (Table 1).\nGiven data {(x1, y1),\u00b7\u00b7\u00b7 (xN , yN )}, let y be a vector of all yn and X a matrix of all xn. We\nassume a Gaussian prior so that p(z, y|X) = N (z|0, 2I)QN\nn=1 p(yn|z, xn). For linear regression,\np(yn|z, xn) = N (yn|z>xi,\u21e2 2), while for logistic regression, p(yn|z, xn) = Sigmoid(ynz>xn).\nFor both models we use a prior of 2 = 1. For linear regression, we set \u21e22 = 4.\nTo justify the use of VI, apply the decomposition in Eq. (1) substituting p(z, y|X) in place of p(z, x)\nto get that\n\nlog p(y|X) = Ez\u21e0q\n\nlog\n\np(z, y|X)\n\nq(z)\n\n+ KL (q(z)kp(z|y, X)) .\n\nThus, adjusting the parameters of q to maximize the \ufb01rst term on the right tightens a lower-bound\non the conditional log likelihood log p(y|X) and minimizes the divergence from q to the posterior.\nSo, we again take our goal as maximizing l(w) + h(w). In the batch setting, f (z) = log p(z, y|X),\nwhile with subsampling, fn(z) = 1\n\nN log p(z) + log p(yn|z, xn).\n\n5\n\n\fFigure 1: How loose are the bounds compared to reality? Odd Rows: Evolution of the ELBO\nduring the single optimization trace used to compare all estimators. Even Rows: True and bounded\nvariance with gradients estimated in \u201cbatch\u201d (using the full dataset in each evaluation) and \u201cuniform\u201d\n(stochastically with \u21e1(n) = 1/N). The \ufb01rst two rows are for linear regression models, while the\nrest are for logistic regression. Key Observations: (i) Batch estimation is lower-variance but higher\ncost (ii) variance with stochastic estimation varies little over time (iii) using matrix smoothness\nsigni\ufb01cantly tightens bounds \u2013 and is exact for linear regression models.\n\nSec. 8 shows that if 0 \uf8ff 00(t) \uf8ff \u2713, thenPN\n\u2713PN\n\ni=1 aia>i . Applying this1 gives that f (z) and fn(z) are matrix-smooth for\n\nn=1 (a>n z + bn) is M-matrix-smooth for M =\n\nM =\n\n1\n2 I + c\n\nNXn=1\n\nxnx>n , and Mn =\n\n1\nN 2 I + c xnx>n ,\n\n1For linear regression, set (t) = t2/(2\u21e22), an = xn and bn = yn and observe that 00 = 1/\u21e22. For\nlogistic regression, set (t) = log Sigmoid(t), an = ynxn and bn = 0 and observe that 00 \uf8ff 1/4. Adding\nthe prior and using the triangle inequality gives the result.\n\n6\n\n10.01000.0k1e51e81e111e14g2Grad Estimatorbatchuniformbound (matrix)bound (scalar)truemushrooms\fFigure 2: Tightening variance bounds reduces true variance. A comparison of the true (vertical\nbars) and bounded Ekgk2 values produced using \ufb01ve different gradient estimators. Batch does\nnot use subsampling. Uniform uses subsampling \u21e1(n) = 1/N, proportional uses \u21e1(n) / Mn,\nopt (scalar) numerically optimizes \u21e1(n) to tighten Eq. (5) with a scalar Mn and opt (matrix)\ntightens Eq. (5) with a matrix Mn. For each sampling strategy, we show the variance bound both with\na scalar and matrix Mn. Uniform sampling has true and bounded values of Ekgk2 ranging between\n1.5x and 10x higher than those for sampling with \u21e1 numerically optimized.\n\nDataset\nboston\n\ufb01res\n\ncpusmall\n\na1a\n\nionosphere\naustralian\n\nsonar\n\nmushrooms\n\nr\nr\nr\nc\nc\nc\nc\nc\n\nType\n\n# dims\n\n# data\n506\n517\n8192\n1695\n351\n690\n208\n8124\n\n13\n12\n13\n124\n35\n15\n61\n113\n\nwhere c = 1/\u21e22 for linear regression, and c = 1/4\nfor logistic regression. Taking the spectral norm\nof these matrices gives scalar smoothness constants.\nWith subsampling, this is kMnk2 = 1\n2N + ckxnk2.\n\n4.2 Evaluation of Bounds\n\nTable 1: Regression (r) and classi\ufb01cation (c)\ndatasets\n\nTo enable a clear comparison of of different estima-\ntors and bounds, we generate a single optimization\ntrace of parameter vectors w for each dataset. All\ncomparisons use this same trace. These use a con-\nservative optimization method: Find a maximum \u00afz\nand then initialize to w = ( \u00afz, 0). Then, optimization\nuses proximal stochastic gradient descent (with the proximal operator re\ufb02ecting h) with a step size of\n1/M (the scalar smoothness constant) and 1000 evaluations for each gradient estimate.\nFig. 1 shows the evolution of the ELBO along with the variance of gradient estimation either in batch\nor stochastically with a uniform distribution over data. For each iteration and estimator, we plot the\nempirical kgk2 along with this paper\u2019s bounds using either scalar or matrix smoothness.\n\n4.3 Sampling distributions\n\nWith subsampling, variability depends on the sampling distribution \u21e1. We consider uniform sampling\nas well as three strategies that attempt to tighten the bound in Thm. 6. In general,Pn f (n)2/\u21e1(n) is\nminimized over distributions \u21e1 by \u21e1(n) /| f (n)|. Thus, the tightest bound is given by\n\n7\n\n1e41e81e12g2batchopt (matrix)opt (scalar)proportionaluniformGrad Estimator1.18e+099.35e+101.15e+111.15e+111.15e+112.42e+125.93e+134.79e+134.79e+134.79e+131.53e+051.03e+089.66e+078.9e+073.93e+071.53e+051.03e+089.66e+078.9e+073.93e+07Bound Typebound (matrix)bound (scalar)mushrooms\f\u21e1\u21e4w(n) /q(d + 1)kMn(m  \u00afzn)k2\n\n2 + (d + \uf8ff)kMnCk2\nF .\n\n(6)\n\nWe call this \u201copt (scalar)\u201d or \u201copt (matrix)\u201d when Mn is a scalar or matrix, respectively. We also\nconsider a \u201cproportional\u201d heuristic with \u21e1(n) / Mn for a scalar Mn. Sampling from Eq. (6) appears\nto require calculating the right-hand side for each n and then normalizing, which may not be practical\nfor large datasets. While there are obvious heuristics for recursively approximating \u21e1\u21e4 during an\noptimization, to maintain focus we do not pursue these ideas here.\nFig. 2 shows the empirical and true variance at the \ufb01nal iteration of the optimization shown in\nFig. 1. The basic conclusion is that using a more careful sampling distribution reduces both true and\nempirical variance.\n\n5 Discussion\n\n5.1 Related work\n\nXu et al. [21] compute the variance of a reparameterization estimator applied to a quadratic function,\nwhen the variational distribution is a fully-factorized Gaussian. This paper can be seen as extending\nthis result to more general densities (full-rank location-scale families) and more general target\nfunctions (smooth functions).\nFan et al. [4] give an abstract variance bound for RP estimators. Essentially, they argue that if\ngi = rwif (Tw(u)) and rwif (Tw(u)) is M-smooth as a function of u, then V[gi] \uf8ff M 2\u21e12/4\nwhen u \u21e0N (0, I). While this result is fairly abstract \u2013 there is no proof that the smoothness\nassumption holds for any particular M with any particular f and Tw \u2013 it is similar in spirit to the\nresults in this paper.\n\n5.2 Variance vs Expected Squared Norms\nThe above results are on the the expected squared norm (ESN) of the gradient Ekgk2. Some\nstochastic gradient convergence rates instead consider (the trace of) the variance V[g]. Since tr V[g] =\nEkgk2  k E gk2, ESN bounds are valid as variance bounds. Still, one can ask if these bounds are\nloose. The following (proof in Sec. 7.3) gives a lower-bound that shows that there is not much to gain\nfrom a direct bound on the variance rather than just using the ESN bound from Thm. 6.\nTheorem 7. For any symmetric matrices M1,\u00b7\u00b7\u00b7 , MN and vectors \u00afz1,\u00b7\u00b7\u00b7 , \u00afzN , there are functions\nf1,\u00b7\u00b7\u00b7 , fN such that (1) fn is Mn-matrix-smooth and has a stationary point at \u00afzn and (2) if s is\nstandardized with u 2 Rd and E u4\nNXn=1\n1\n\nF\u2318 .\n2 + (d + \uf8ff  1)kMnCk2\n\n\u21e1(n)\u21e3dkMn(m  \u00afzn)k2\n\ntr Vkgk2\n\n2 \n\ni = \uf8ff, then for g = 1\n\n\u21e1(n)rfn(Tw(u)),\n\nWhen d  1 this lower-bound is very close to the upper-bound on Ekgk2 in Thm. 6. Thus, under\nthis paper\u2019s assumptions, a variance bound cannot be signi\ufb01cantly better than an ESN bound.\n\n5.3 The Entropy Term\n\nAll discussion in this paper has been for gradient estimators for l, while the goal is of course to\noptimize l + h. For location-scale families, h is known in closed-form, meaning the exact gradient \u2013\nor the proximal operator for h \u2013 can be computed exactly. Still, it has been observed that if qw is very\nclose to p(z|x), cancellations mean that estimating the gradient of h + l might have lower variance\nthan the gradient of l alone [12].\nWith any variational family, it is well-known that the gradient of the entropy can be represented\nas rw Ez\u21e0qw log qv(z)|v=w. That is, the dependence of log qw on w can be neglected under\ndifferentiation. Thus, if one wishes to stochastically estimate the gradient of h, one can treat log qv in\nthe same way as log p when calculating gradients. Then, one could apply the analysis in this paper to\nf (z) = log p(z, x)  log qv(z) rather than f (z) = log p(z, x) as done above. It is easy to imagine\n\n8\n\n\fsituations where subtracting log qv (or a fraction of it) from log p would change Mn and \u00afzn in such a\nway as to produce a tighter bound. Thus, the bounds in this paper are consistent with practices [5, 12]\nwhere using log qv as a control variate can reduce gradient variance.\n\n5.4 Smoothness and Convergence Guarantees\n\nAt a very high level, convergence rates for stochastic gradient methods require both (1) control of the\nvariability of the gradient estimator and (2) either convexity or Lipschitz smoothness of the objective.\nThis paper is dedicated entirely to the \ufb01rst goal. Independent recent work has addressed at the second\nissue [3]. The basic summary is that if f (z) is smooth, then l(w) is smooth, and similarly if f (z) is\nstrongly convex. However, full convergence guarantees for black-box VI remain an open research\nproblem.\n\n5.5 Prospects for Generalizing Bounds to Other Variational Families\n\nThe bounds given in this paper are closely tied to location-scale families: The exact form of the\nreparameterization function Tw is used in Lem. 1 and Lem. 2, which underly the main results of\nThm. 3, Thm. 5, and Eq. (4). Thus, extending our proof strategy to other variational families would\nrequire deriving new results analogous to Lem. 1 and Lem. 2 for the reparameterization function\nTw corresponding to those new variational families. Moreover, if the exact entropy is not available\nfor a variational family, the analysis must address the variance of the entropy gradient estimator, as\ndiscussed in Sec. 5.3.\n\n5.6 Limitations\n\nThis work has several limitations. First, it applies only to location-scale families, and requires that\nthe target objective be smooth. Second, if log p is smooth, it may still be challenging in practice to\nestablish what the smoothness constant is. Third, we observed that even with our strongest condition\nof matrix smoothness, the some looseness remains in the bounds with the logistic regression examples.\nSince the ESN bound is unimprovable, this looseness cannot be removed without using more detailed\nstructure of the target log p. It is not obvious what this structure would be, or how it would be\nobtained for practical black-box inference problems.\n\nReferences\n[1] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational Inference: A Review for\n\nStatisticians. Journal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[2] Alexander Buchholz, Florian Wenzel, and Stephan Mandt. Quasi-Monte Carlo Variational\n\nInference. In ICML, 2018.\n\n[3] Justin Domke.\n\nProvable Smoothness Guarantees for Black-Box Variational Inference.\n\narXiv:1901.08431 [cs, stat], 2019.\n\n[4] Kai Fan, Ziteng Wang, Jeff Beck, James Kwok, and Katherine Heller. Fast Second-Order\n\nStochastic Backpropagation for Variational Inference. In NeurIPS, 2015.\n\n[5] Tomas Geffner and Justin Domke. Using Large Ensembles of Control Variates for Variational\n\nInference. In NeurIPS, 2018.\n\n[6] Zoubin Ghahramani and Matthew Beal. Propagation Algorithms for Variational Bayesian\n\nLearning. In NeurIPS, 2001.\n\n[7] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic Variational\n\nInference. Journal of Machine Learning Research, 14:1303\u20131347, 2013.\n\n[8] Andrew Miller, Nick Foti, Alexander D\u2019 Amour, and Ryan P Adams. Reducing Reparameteri-\n\nzation Gradient Variance. In NeurIPS, 2017.\n\n[9] Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black Box Variational Inference. In\n\nAISTATS, 2014.\n\n9\n\n\f[10] Jeffrey Regier, Kiran Pamnany, Ryan Giordano, Rollin Thomas, David Schlegel, Jon McAuliffe,\nand Prabhat. Learning an Astronomical Catalog of the Visible Universe through Scalable\nBayesian Inference. arXiv:1611.03404 [astro-ph, stat], 2016.\n\n[11] Jeffrey Regier, Michael I Jordan, and Jon McAuliffe. Fast Black-box Variational Inference\n\nthrough Stochastic Trust-Region Optimization. In NeurIPS, page 10, 2017.\n\n[12] Geoffrey Roeder, Yuhuai Wu, and David K Duvenaud. Sticking the Landing: Simple, Lower-\n\nVariance Gradient Estimators for Variational Inference. In NeurIPS, 2017.\n\n[13] Francisco J. R. Ruiz, Michalis K. Titsias, and David M. Blei. The Generalized Reparameteriza-\n\ntion Gradient. In NeurIPS, 2016.\n\n[14] Francisco J. R. Ruiz, Michalis K. Titsias, and David M. Blei. Overdispersed Black-Box\n\nVariational Inference. arXiv:1603.01140 [stat], 2016.\n\n[15] Tim Salimans and David A. Knowles. Fixed-Form Variational Posterior Approximation through\n\nStochastic Linear Regression. Bayesian Analysis, 8(4):837\u2013882, 2013.\n\n[16] Linda S. L. Tan and David J. Nott. Gaussian variational approximation with sparse precision\n\nmatrices. Statistics and Computing, 28(2):259\u2013275, 2018.\n\n[17] Michalis Titsias and Miguel L\u00e1zaro-gredilla. Doubly Stochastic Variational Bayes for non-\n\nConjugate Inference. In ICML, 2014.\n\n[18] Michalis K. Titsias and Miguel L\u00e1zaro-Gredilla. Local Expectation Gradients for Black Box\n\nVariational Inference. In NeurIPS, 2015.\n\n[19] David Wingate and Theophane Weber. Automated Variational Inference in Probabilistic Pro-\n\ngramming. arXiv:1301.1299 [cs, stat], 2013.\n\n[20] John Winn and Christopher M Bishop. Variational Message Passing. Journal of Machine\n\nLearning Research, 6:661\u2013694, 2005.\n\n[21] Ming Xu, Matias Quiroz, Robert Kohn, and Scott A. Sisson. Variance reduction properties of\n\nthe reparameterization trick. In AISTATS, 2019.\n\n[22] Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman. Yes, but Did It Work?:\n\nEvaluating Variational Inference. In ICML, 2018.\n\n10\n\n\f", "award": [], "sourceid": 161, "authors": [{"given_name": "Justin", "family_name": "Domke", "institution": "University of Massachusetts, Amherst"}]}