{"title": "Stein Variational Gradient Descent as Moment Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 8854, "page_last": 8863, "abstract": "Stein variational gradient descent (SVGD) is a non-parametric inference algorithm that evolves a set of particles to fit a given distribution of interest. We analyze the non-asymptotic properties of SVGD, showing that there exists a set of functions, which we call the Stein matching set, whose expectations are exactly estimated by any set of particles that satisfies the fixed point equation of SVGD. This set is the image of Stein operator applied on the feature maps of the positive definite kernel used in SVGD. Our results provide a theoretical framework for analyzing the properties of SVGD with different kernels, shedding insight into optimal kernel choice. In particular, we show that SVGD with linear kernels yields exact estimation of means and variances on Gaussian distributions, while random Fourier features enable probabilistic bounds for distributional approximation. Our results offer a refreshing view of the classical inference problem as fitting Stein\u2019s identity or solving the Stein equation, which may motivate more efficient algorithms.", "full_text": "Stein Variational Gradient Descent\n\nas Moment Matching\n\nQiang Liu, Dilin Wang\n\nDepartment of Computer Science\nThe University of Texas at Austin\n\nAustin, TX 78712\n\n{lqiang, dilin}@cs.utexas.edu\n\nAbstract\n\nStein variational gradient descent (SVGD) is a non-parametric inference algorithm\nthat evolves a set of particles to \ufb01t a given distribution of interest. We analyze the\nnon-asymptotic properties of SVGD, showing that there exists a set of functions,\nwhich we call the Stein matching set, whose expectations are exactly estimated by\nany set of particles that satis\ufb01es the \ufb01xed point equation of SVGD. This set is the\nimage of Stein operator applied on the feature maps of the positive de\ufb01nite kernel\nused in SVGD. Our results provide a theoretical framework for analyzing properties\nof SVGD with different kernels, shedding insight into optimal kernel choice. In\nparticular, we show that SVGD with linear kernels yields exact estimation of means\nand variances on Gaussian distributions, while random Fourier features enable\nprobabilistic bounds for distributional approximation. Our results offer a refreshing\nview of the classical inference problem as \ufb01tting Stein\u2019s identity or solving the\nStein equation, which may motivate more ef\ufb01cient algorithms.\n\n1\n\nIntroduction\n\nOne of the core problems of modern statistics and machine learning is to approximate dif\ufb01cult-\nto-compute probability distributions. Two fundamental ideas have been extensively studied and\nused in the literature: variational inference (VI) and Markov chain Monte Carlo (MCMC) sampling\n(e.g., Koller & Friedman, 2009; Wainwright et al., 2008). MCMC has the advantage of being non-\nparametric and asymptotically exact, but often suffers from dif\ufb01culty in convergence, while VI frames\nthe inference into a parametric optimization of the KL divergence and works much faster in practice,\nbut loses the asymptotic consistency. An ongoing theme of research is to combine the advantages of\nthese two methodologies.\nStein variational gradient descent (SVGD) (Liu & Wang, 2016) is a synthesis of MCMC and VI that\ninherits the non-parametric nature of MCMC while maintaining the optimization perspective of VI.\nIn brief, SVGD for distribution p(x) updates a set of particles {xi}n\ni=1 parallelly with a velocity \ufb01eld\n\u03c6(\u00b7) that balances the gradient force and repulsive force,\n\nn(cid:88)\n\nj=1\n\nxi \u2190 xi + \u0001\u03c6(xi),\n\n\u03c6(\u00b7) =\n\n1\nn\n\n\u2207xj log p(xj)k(xj,\u00b7) + \u2207xj k(xj,\u00b7),\n\nwhere \u0001 is a step size and k(x, x(cid:48)) is a positive de\ufb01nite kernel de\ufb01ned by the user. This update is\nderived as approximating a kernelized Wasserstein gradient \ufb02ow of KL divergence (Liu et al., 2017)\nwith connection to Stein\u2019s method (Stein, 1972) and optimal transport (Ollivier et al., 2014); see\nalso Anderes & Coram (2002). SVGD has been applied to solve challenging inference problems\nin various domains; examples include Bayesian inference (Liu & Wang, 2016; Feng et al., 2017),\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\funcertainty quanti\ufb01cation (Zhu & Zabaras, 2018), reinforcement learning (Liu et al., 2017; Haarnoja\net al., 2017), learning deep probabilistic models (Wang & Liu, 2016; Pu et al., 2017) and Bayesian\nmeta learning (Feng et al., 2017; Kim et al., 2018).\nHowever, the theoretical properties of SVGD are still largely unexplored. The only exceptions are\nLiu et al. (2017); Lu et al. (2018), which studied the partial differential equation that governs the\nevolution of the limit densities of the particles, with which the convergence to the distribution of\ninterest can be established. However, the results in Liu et al. (2017); Lu et al. (2018) are asymptotic\nin nature and hold only when the number of particles is very large. More recently, Chen et al. (2018)\nstudied non-asymptotic properties of a linear combination of SVGD and Langevin dynamics, whose\nanalysis, however, mainly exploits the propriety of Langevin dynamics and does not work for SVGD\nalone. A theoretical understanding of SVGD in the \ufb01nite sample size region is still missing and of\ngreat practical importance, especially given that the particle sizes used in practice are often relatively\nsmall, thanks to the property that SVGD with a single particle (n = 1) exactly reduces to \ufb01nding the\nmode (a.k.a. maximum a posteriori (MAP)).\nOur Results We analyze the \ufb01nite sample properties of SVGD. In contrast to the dynamical\nperspective of Liu et al. (2017), we directly study what properties a set of particles would have if\nit satis\ufb01es the \ufb01xed point equation of SVGD, regardless of how we obtain them algorithmically, or\nwhether the \ufb01xed point is unique. Our analysis indicates that the \ufb01xed point equation of SVGD is\ni }n\nessentially a moment matching condition which ensures that the \ufb01xed point particles {x\u2217\ni=1 exactly\nestimate the expectations of all the functions in a special function set F\u2217,\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nf (x\u2217\n\ni ) = Epf,\n\n\u2200f \u2208 F\u2217.\n\nThis set F\u2217, which we call the Stein matching set, consists of functions obtained by applying Stein\noperator on the linear span of feature maps of the kernel used by SVGD.\nThis framework allows us to understand properties of different kernels (and the related feature maps)\nby studying their Stein matching sets F\u2217, which should ideally either match the test functions that we\nare actually interested in estimating, or is as large as possible to approximate the overall distribution.\nThis process is dif\ufb01cult in general, but we make two observations in this work:\ni) We show that, by using linear kernels (features), SVGD can exactly estimate the mean and variance\nof Gaussian distributions when the number of particles is larger than the dimension. Since Gaussian-\nlike distributions appear widely in practice, and the estimates of mean and variance are often of\nspecial importance, linear kernels can provide a signi\ufb01cant advantage over the typical Gaussian RBF\nkernels, especially in estimating the variance.\nii) Linear features are not suf\ufb01cient to approximate the whole distributions. We show that, by using\n\u221a\nrandom features of strictly positive de\ufb01nite kernels, the \ufb01xed points of SVGD approximate the whole\ndistribution with an O(1/\nOverall, our framework reveals a novel perspective that reduces the inference problem to either a\nregression problem of \ufb01tting Stein identities, or inverting the Stein operator which is framed as solving\na differential equation called Stein equation. These ideas are signi\ufb01cantly different from the traditional\nMCMC and VI that are currently popular in machine learning literature, and draw novel connections\nto Quasi Monte Carlo and quadrature methods, among other techniques in applied mathematics. New\nef\ufb01cient approximate inference methods may be motivated with our new perspectives.\n\nn) rate in kernelized Stein discrepancy.\n\n2 Background\n\nWe introduce the basic background of the Stein variational method, a framework of approximate\ninference that integrates ideas from Stein\u2019s method, kernel methods, and variational inference. The\nreaders are referred to Liu et al. (2016); Liu & Wang (2016); Liu et al. (2017) and references therein\nfor more details. For notation, all vectors are assumed to be column vectors. The differential operator\n\u2207x is viewed as a column vector of the same size as x \u2208 Rd. For example, \u2207x\u03c6 is a Rd-valued\nfunction when \u03c6 is a scalar-valued function, and \u2207(cid:62)\ni=1 \u2202xi\u03c6(x) is a scalar-valued\nfunction when \u03c6 is Rd-valued.\n\nx \u03c6(x) = (cid:80)d\n\n2\n\n\fStein\u2019s Identity Stein\u2019s identity forms the foundation of our framework. Given a positive differen-\ntiable density p(x) on X \u2286 Rd, one form of Stein\u2019s identity is\n\nEp[\u2207x log p(x)(cid:62)\u03c6(x) + \u2207(cid:62)\n\nx \u03c6(x)] = 0,\n\n\u2200\u03c6,\n\nwhich holds for any differentiable, Rd-valued function \u03c6 that satis\ufb01es a proper zero-boundary\ncondition. Stein\u2019s identity can be proved by a simple exercise of integration by parts. We may write\nStein\u2019s identity in a more compact way by de\ufb01ning a Stein operator Px:\n\nwhere\n\nEp[P(cid:62)\n\nx \u03c6(x)] = 0,\n\nP(cid:62)\nx \u03c6(x) = \u2207x log p(x)(cid:62)\u03c6(x) + \u2207(cid:62)\nwhere Px is formally viewed as a d-dimensinoal column vector like \u2207x, and hence P(cid:62)\nproduct of Px and \u03c6, yielding a scalar-valued function.\nThe power of Stein\u2019s identity is that, for a given distribution p, it de\ufb01nes an in\ufb01nite number of\nfunctions of form P(cid:62)\nx \u03c6 that has zero expectation under p, all of which only depend on p through\n\u2207p(x)\nthe Stein operator Px, or the score function \u2207x log p(x) =\np(x) , which is independent of the\nnormalization constant in p that is often dif\ufb01cult to calculate.\n\nx \u03c6(x),\nx \u03c6 is the inner\n\nStein Discrepancy on RKHS Stein\u2019s identity can be leveraged to characterize the discrepancy\nbetween different distributions. The idea is that, for two different distributions p (cid:54)= q, there shall\nx \u03c6] (cid:54)= 0. Consider functions \u03c6 in a Rd-valued reproducing kernel\nexist a function \u03c6 such that Eq[P(cid:62)\nHilbert space (RKHS) of form H = H0 \u00d7 \u00b7\u00b7\u00b7H0 where H0 is a R-valued RKHS with positive\nde\ufb01nite kernel k(x, x(cid:48)). We may de\ufb01ne a kernelized Stein discrepancy (KSD) (Liu et al., 2016;\nChwialkowski et al., 2016; Oates et al., 2017):\n\n(cid:8)Eq[P(cid:62)\n\nx \u03c6(x)]\n\n: ||\u03c6||H \u2264 1(cid:9),\n\nDk(q || p) = max\n\u03c6\u2208H\n\n(1)\n\n(2)\n\nThe optimal \u03c6 in (1) can be solved in closed form:\n\nq,p(\u00b7) \u221d Ex\u223cq[Pxk(x,\u00b7)],\n\u2217\n\u03c6\n\nwhich yields a simple kernel-based representation of KSD:\n\nk(q || p) = Ex,x(cid:48)\u223cq[\u03bap(x, x(cid:48))],\nD2\n\n(3)\nwhere x and x(cid:48) are i.i.d. draws from q, and \u03bap(x, x(cid:48)) is a new \u201cSteinalized\u201d positive de\ufb01nite kernel\nobtained by applying the Stein operator twice, \ufb01rst w.r.t. variable x and then x(cid:48). It turns out that\nthe RKHS related to kernel \u03bap(x, x(cid:48)) is exactly the space of functions obtained by applying Stein\noperator on functions in H, that is,\n\nx (Px(cid:48)k(x, x(cid:48))),\n\n\u03bap(x, x(cid:48)) = P(cid:62)\n\nwith\n\nHp = {P(cid:62)\n\nx \u03c6 : \u2200\u03c6 \u2208 H}.\n\nBy Stein\u2019s identity, all the functions in Hp have zero expectation under p. We can also de\ufb01ne H+\np to be\np := {f (x) + c : f \u2208 Hp, c \u2208 R},\nthe space of functions in Hp adding arbitrary constants, that is, H+\nwhich can also be viewed as a RKHS, with kernel \u03bap(x, x(cid:48)) + 1. Stein discrepancy can be viewed as\na maximum mean discrepancy (MMD) on the Steinalized RKHS H+\n||f||H+\n\n(cid:8)Eqf \u2212 Epf :\n\np (or equivalently Hp):\n\n\u2264 1(cid:9).\n\n(4)\n\np\n\nDk(q || p) = max\nf\u2208H+\n\np\n\nDifferent from typical MMD, here the RKHS space depends on distribution p. In order to make Stein\ndiscrepancy discriminative, in that Dk(q || p) = 0 implies q = p, we need to take kernels k(x, x(cid:48))\nso that H+\np is suf\ufb01ciently large. It has been shown that this can be achieved if k(x, x(cid:48)) is strictly\npositive de\ufb01nite or universal, in a proper technical sense (Liu et al., 2016; Chwialkowski et al., 2016;\nGorham & Mackey, 2017; Oates et al., 2017).\nIt is useful to consider the kernels in a random feature representation (Rahimi & Recht, 2007),\n\n(5)\nwhere \u03c6(x, w) is a set of features indexed by a random parameter w drawn from a distribution pw.\nFor example, the Gaussian RBF kernel k(x, x(cid:48)) = exp(\u2212 1\n\nk(x, x(cid:48)) = Ew\u223cpw [\u03c6(x, w)\u03c6(x(cid:48), w)],\n\n2h2||x \u2212 x(cid:48)||2\n\n2) admits\n\n\u221a\n\n\u03c6(x, w) =\n\n2 cos(\n\nw(cid:62)\n\n1 x + w0),\n\n1\nh\n\n3\n\n(6)\n\n\f(cid:20)\n\n(cid:21)\n\nwhere w0 \u223c Unif([0, 2\u03c0]) and w1 \u223c N (0, I). With the random feature representation, KSD can be\nrewritten into\n\nk(q || p) = Ew\u223cpw\nD2\n\n||Ex\u223cq[Px\u03c6(x, w)]||2\n\n(7)\nwhich can be viewed as the mean square error of Stein\u2019s identity Ex\u223cq[Px\u03c6(x, w)] = 0 over the\nk(q || p) = 0 shall imply q = p if the feature set G = {\u03c6(x, w) : \u2200w} is rich\nrandom features. D2\nenough. Note that the RKHS H and feature set G are different; Stein discrepancy is an expected loss\nfunction on G as shown in (7), but a worst-case loss on H as shown in (1).\n\n,\n\nStein Variational Gradient Descent (SVGD) SVGD is a deterministic sampling algorithm mo-\ntivated by Stein discrepancy. It is based on the following basic observation: given a distribution q,\nassume q[\u03c6] is the distribution of x(cid:48) = x + \u0001\u03c6(x) obtained by updating x with a velocity \ufb01eld \u03c6,\nwhere \u0001 is a small step size, then we have\n\nKL(q[\u03c6] || p) = KL(q || p) \u2212 \u0001Eq[P(cid:62)\n\nx \u03c6] + O(\u00012),\n\nwhich shows that the decrease of KL divergence is dominated by \u0001Eq[P(cid:62)\nx \u03c6]. In order to choose \u03c6 to\nmake q[\u03c6] move towards p as fast as possible, we should choose \u03c6 to maximize Eq[P(cid:62)\nx \u03c6], whose\nq,p(\u00b7) \u221d Ex\u223cq[Pxk(x,\u00b7)] as shown in (2). This suggests that \u03c6\n\u2217\n\u2217\nsolution is exactly \u03c6\nq,p happens to\nbe the best velocity \ufb01eld that pushes the probability mass of q towards p as fast as possible.\nMotivated by this, SVGD approximates q with the empirical distribution of a set of particles {xi}n\nand iteratively updates the particles by\n\ni=1,\n\nxi \u2190 xi +\n\n\u0001\nn\n\n[Pxj k(xj, xi)].\n\n(8)\n\nn(cid:88)\n\nj=1\n\nLiu et al. (2017) studied the asymptotic properties of the dynamic system underlying SVGD, showing\nthat the evolution of the limit density of the particles when n \u2192 \u221e can be captured by a nonlinear\nFokker-Planck equation, and established its weak convergence to the target distribution p.\nHowever, the analysis in Liu et al. (2017) and Lu et al. (2018) do not cover the case when the sample\nsize n is \ufb01nite, which is more relevant to the practical performance. We address this problem by\ndirectly analyzing the properties of the \ufb01xed point equation of SVGD, yielding results that work for\n\ufb01nite sample size n, also independent of the update rule used to arrive the \ufb01xed points.\n\n3 SVGD as Moment Matching\n\nThis section presents our main results on the moment matching properties of SVGD and the related\nStein matching sets. We start with Section 3.1 which introduces the basic idea and characterizes the\nStein matching set of SVGD with general positive de\ufb01nite kernels. We then analyze in Section 3.2\nthe special case when the rank of the kernel is less than the particle size, in which case the Stein\nmatching set is independent of the \ufb01xed points themselves. Section 3.3 shows that SVGD with linear\nfeatures exactly estimates the \ufb01rst two second-order moments of Gaussian distributions. Section 3.4\nestablishes a probabilistic bound when random features are used.\n\n3.1 Fixed Point of SVGD\nOur basic idea is rather simple to illustrate. Assume X\u2217 = {x\u2217\ni }n\ni=1 is the \ufb01xed point of SVGD and\n\u02c6\u00b5X\u2217 its related empirical measure, then according to (8), the \ufb01xed point condition of SVGD ensures\n(9)\n\n\u2200i = 1, . . . , n.\n\ni )] = 0,\n\nEx\u223c\u02c6\u00b5X\u2217 [Pxk(x, x\u2217\nOn the other hand, by Stein\u2019s identity, we have\n\nEx\u223cp[Pxk(x, x\u2217\n\ni )] = 0,\n\n\u2200i = 1, . . . , n.\n\nThis suggests that \u02c6\u00b5X\u2217 exactly estimates the expectation of functions of form f (x) = Pxk(x, x\u2217\ni )\nunder p, all of which are zero. By the linearity of expectation, the same holds for all the functions in\nthe linear span of Pxk(x, x\u2217\ni ).\n\n4\n\n\fn(cid:88)\ni }n\nLemma 3.1. Assume X\u2217 = {x\u2217\ni=1 satis\ufb01es the \ufb01xed point equation (9) of SVGD. We have\n1\nn\n\ni ) = Epf,\n\nf (x\u2217\n\nwhere the Stein matching set F\u2217 is the linear span of {Pxk(x, x\u2217\n\ni=1 \u222a {1}, that is, F\u2217 consists of\n\n\u2200f \u2208 F\u2217,\ni )}n\n\nEquivalently, f (x) = P(cid:62)\n\n(cid:80)n\ni=1 aik(x, x\u2217\ni ).\n\nf (x) =\n\ni Pxk(x, x\u2217\na(cid:62)\n\n\u2200ai \u2208 Rd, b \u2208 R.\nx \u03c6(x) + b and \u03c6 is in the linear span of {k(x, x\u2217\n\ni ) + b,\n\ni=1\n\ni )}n\n\ni=1, that is, \u03c6(x) =\n\nExtending Lemma 3.1, one can readily see that the SVGD \ufb01xed points can approximate the expectation\nof functions that are close to F\u2217. Speci\ufb01cally, let F\u2217\n\u0001 =\n{f : inf f(cid:48)\u2208F ||f \u2212 f(cid:48)||\u221e \u2264 \u0001}, then it is easily shown that\n\n\u0001 be the \u0001 neighborhood of F\u2217, that is, F\u2217\n\ni=1\n\nn(cid:88)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\nn\n\nn(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 2\u0001,\n\nf (x\u2217\n\ni ) \u2212 Epf\n\n\u2200f \u2208 F\u2217\n\u0001 .\n\n\u221a\n\nTherefore, the SVGD approximation can be viewed as prioritizing the functions within, or close to,\nF\u2217. This is different in nature from Monte Carlo, which approximates the expectation of all bounded\nvariance functions with the same O(1/\nn) error rate. Instead, SVGD shares more similarity with\nthe quadrature and sigma point methods, which also \ufb01nd points (particles) to match the expectation\non certain class of functions, but mostly only on polynomial functions and for simple distributions\nsuch as uniform or Gaussian distributions. SVGD provides a more general approach that can match\nmoments of richer classes of functions for more general complex multivariate distributions. As we\nshow in Section 3.3, when using polynomial kernels, SVGD reduces to matching polynomials when\napplied to multivariate Gaussian distributions.\nIn this view, the performance of SVGD is essentially decided by the Stein matching set F\u2217. We shall\ndesign the algorithm, by engineering the kernels or feature maps, to make F\u2217 as large as possible\nin order to approximate the distribution well, or include the test functions of actual interest, such as\nmean and variance.\n\n3.2 Fixed Point of Feature-based SVGD\nOne undesirable property of F\u2217 in Lemma 3.1 is that it depends on the values of the \ufb01xed point\nparticles X\u2217, whose properties are dif\ufb01cult to characterize a priori. This makes it dif\ufb01cult to infer\nwhat kernel should be used to obtain a desirable F\u2217. It turns out the dependency of F on X\u2217 can be\nessentially decoupled by using degenerated kernels corresponding to a \ufb01nite number of feature maps.\nSpeci\ufb01cally, we consider kernels of form\n\nk(x, x(cid:48)) =\n\n\u03c6(cid:96)(x)\u03c6(cid:96)(x(cid:48)),\n\nm(cid:88)\n\n(cid:96)=1\n\nm(cid:88)\n\nwhere we assume the number m of features is no larger than the particle size n. Then, the \ufb01xed point\nof SVGD reduces to\n\nDe\ufb01ne \u03a6 = [\u03c6(cid:96)(x\u2217\n\nEx\u223c\u02c6\u00b5X\u2217 [\n\nPx\u03c6(cid:96)(x)\u03c6(cid:96)(x\u2217\n\n(10)\nj )](cid:96),j which is a matrix of size (m \u00d7 n) . If rank(\u03a6) \u2265 m, then (10) reduces to\n(11)\n\nEx\u223c\u02c6\u00b5X\u2217 [Px\u03c6(cid:96)(x)] = 0,\n\n\u2200(cid:96) = 1, . . . , m,\n\nj )] = 0,\n\n(cid:96)=1\n\n\u2200j \u2208 [n].\n\nwhere the test function f (x) := Px\u03c6(cid:96)(x) no longer depends on the \ufb01xed point X\u2217.\n\nTheorem 3.2. Assume X\u2217 is a \ufb01xed point of SVGD with kernel k(x, x(cid:48)) = (cid:80)m\n\nDe\ufb01ne the (m \u00d7 n) matrix \u03a6 = [\u03c6(cid:96)(x\u2217\n\n(cid:96)=1 \u03c6(cid:96)(x)\u03c6(cid:96)(x(cid:48)).\n\ni )](cid:96)\u2208[m],i\u2208[n]. If rank(\u03a6) \u2265 m, then\nn(cid:88)\nf (x\u2217\n\n\u2200f \u2208 F\u2217,\n\ni ) = Epf,\n\n1\nn\n\ni=1\n\n5\n\n\fm(cid:88)\n\nwhere the Stein matching set F\u2217 is the linear span of {Px\u03c6(cid:96)(x)}m\nfunctions of form\n\n(cid:96)=1 \u222a {1}, that is, it is set of the\n\nf (x) =\n\n(cid:96) Px\u03c6(cid:96)(x) + b,\na(cid:62)\n\n\u2200 a(cid:96) \u2208 Rd, b \u2208 R.\n\n(12)\n\n(cid:96)=1\n\nNote that the rank condition implies that we must have m \u2264 n. The idea is that n particles can at\nmost match n linearly independent features exactly. Here, although the rank condition still depends\ni }n\non the \ufb01xed point X\u2217 = {x\u2217\ni=1 and cannot be guaranteed a priori, it can be numerically veri\ufb01ed\nonce we obtain the values of X\u2217. In our experiments, we \ufb01nd that the rank condition tends to always\nhold practically when n = m. In cases when it does fail to satisfy, we can always rerun the algorithm\nwith a larger n until it is satis\ufb01ed. Intuitively, it seems to require bad luck to have \u03a6 low rank when\nthere are more particles than features (n \u2265 m), although a theoretical guarantee is still missing.\n\nwhere \u03c6(x) =(cid:80)m\n\nQuery-Speci\ufb01c Inference as Solving Stein Equation Assume we are interested in a query-speci\ufb01c\ntask of estimating Epf for a speci\ufb01c test function f. In this case, we should ideally select the features\n{\u03c6(cid:96)}(cid:96) such that (12) holds to yield an exact estimation of Epf. By the linearity of the Stein operator,\n(12) is equivalent to\n\nStein Equation:\n\nf (x) = P(cid:62)\n\nx \u03c6(x) + b,\n\n(13)\n\n(cid:96)=1 a(cid:96)\u03c6(cid:96)(x). Eq (13) is known as Stein Equation when solving \u03c6 and b with a\n\ngiven f, which effectively calculates the inverse of Stein operator.\nStein equation plays a central role in Stein\u2019s method as a theoretical tool (Barbour & Chen, 2005).\nHere, we highlight its fundamental connection to the approximate inference problem: if we can\nexactly solve \u03c6 and b for a given f, then the inference problem regarding f is already solved (without\nrunning SVGD), since we can easily see that Epf = b by taking expectation from both sides of (13).\nMathematically, this reduces the integration problem of estimating Epf into solving a differential\nequation. It suggests that Stein equation is at least as hard as the inference problem itself, and we\nshould not expect a tractable way to solve it in general cases. On the other hand, it suggested that\nef\ufb01cient ways of approximate inference may be developed by approximate solutions of Stein equation.\nSimilar idea has been investigated in Oates et al. (2017), which developed a kernel approximation of\nStein equation in the case based on a given set of points. SVGD allows us to further extend this idea\nby optimizing the set of points (particles) on which approximation is de\ufb01ned.\n\n3.3 Linear Feature SVGD is Exact for Gaussian\n\nAlthough Stein equation is dif\ufb01cult to solve in general, it is signi\ufb01cantly simpli\ufb01ed when the\ndistribution p of interest is Gaussian. In the following, we show that when p is a multivariate Gaussian\ndistribution, we can use linear features, relating to a linear kernel k(x, x(cid:48)) = x(cid:62)x(cid:48) + 1, to ensure\nthat SVGD exactly estimates all the \ufb01rst and second order moments of p. This insight provides an\nimportant practical guidance on the optimal kernel choices for Gaussian-like distributions.\nTheorem 3.3. Assume X\u2217 is a \ufb01xed point of SVGD with polynomial kernel k(x, x(cid:48)) = x(cid:62)x(cid:48) + 1.\nLet F\u2217 be the Stein matching set in Theorem 3.2. If p is a multivariate normal distribution on\nRd, then F \u2286 Poly(2), where Poly(2) is the set of all polynomials upto the second order, that is,\nPoly(2) = {x(cid:62)Ax + b\nFurther, denote by \u03a6 the (d + 1) \u00d7 n matrix de\ufb01ned by\n\nx + c : A \u2208 Rd\u00d7d, b \u2208 Rd, c \u2208 R}.\n\n(cid:62)\n\n(cid:20)x1 x2\n\n1\n\n1\n\n\u03a6 =\n\n(cid:21)\n\n.\n\n\u00b7\u00b7\u00b7 xn\n\u00b7\u00b7\u00b7\n1\n\nIf rank(\u03a6) \u2265 d + 1, then F = Poly(2). In this case, any \ufb01xed point of SVGD exactly estimates both\nthe mean and the covariance matrix of the target distribution.\n\nMore generally, if the features are polynomials of order j, its related Stein matching set should be\npolynomials of order j + 1 for Gaussian distributions. We do not investigate this further because it is\nless common to estimate higher order moments in multivariate settings.\n\n6\n\n\fTheorem 3.3 suggests that it is a good heuristic to include linear features in SVGD, because Gaussian-\nlike distributions appear widely thanks to the central limit theorem and Bernstein\u2013von Mises theorem,\nand the main goal of inference is often to estimate the mean and variance. In contrast, the more\ncommonly used Gaussian RBF kernel does not have similar exact recovery results for the mean and\nvariance, even for Gaussian distributions.\nA nice property of our result is that once we use fewer features than the particles and solve the\n\ufb01xed point exactly, the features do not \u201cinterfere\u201d with each other. This allows us to \u201cprogram\u201d our\nalgorithm by adding different types of features that serve different purposes in different cases.\n\n3.4 Random feature SVGD\n\nThe linear features are not suf\ufb01cient for providing the consistent estimation of the whole distribution,\neven for Gaussian distributions. Non-degenerate kernels are required to obtain bounds on the whole\ndistributions, but they complicate the analysis because their Stein matching set depends on the\nsolution X\u2217 as shown in Lemma C.1. Random features can be used to sidestep this dif\ufb01culty (Rahimi\n& Recht, 2007), enabling us to analyze a random feature variant of SVGD with probabilistic bounds.\nTo set up, assume k(x, x(cid:48)) is a universal kernel whose Stein discrepancy Dk(q || p) yields a discrim-\ninative measure of differences between distributions. Assume k(x, x(cid:48)) yields the random feature\nrepresentation in (5), and we can approximate it by drawing m random features,\n\nm(cid:88)\n\n(cid:96)=1\n\n\u02c6k(x, x(cid:48)) =\n\n1\nm\n\n\u03c6(x, w(cid:96))\u03c6(x(cid:48), w(cid:96)),\n\n\u221a\n\nwhere w(cid:96) are i.i.d. drawn from pw. We assume m \u2264 n, then running SVGD with kernel \u02c6k(x, x(cid:48))\n(with the random features \ufb01xed during the iterations) yields a matching set that decouples with the\n\ufb01xed point X\u2217. In this way, our result below establish that Dk(\u02c6\u00b5X\u2217 || p) = \u02dcO(1/\nn) with high\nprobability. According to (4), this provides a uniform bound of E\u02c6\u00b5X\u2217 f \u2212 Epf for all functions in the\nunit ball of H+\np .\nHere, random features are introduced mainly for facilitating theoretical analysis, but we also \ufb01nd\nrandom feature SVGD works comparably, and sometimes even better than SVGD with the original\nnon-degenerate kernel (see Appendix). This is because with a \ufb01nite number n of particles, at most n\nfunction basis of k(x, x(cid:48)) can be effectively used, even if k(x, x(cid:48)) itself has an in\ufb01nite rank. From the\nperspective of moment matching, there is no bene\ufb01t to use universal kernels when the particle size n\nis \ufb01nite.\nIn the sequel, we \ufb01rst explain the intuitive idea behind our result, highlighting a perspective that views\ninference as \ufb01tting a zero-valued curve with Stein\u2019s identity, and then introduce technical details.\nDistributional Inference as Fitting Stein\u2019s Identity Recall that our goal can be viewed as \ufb01nding\nparticles X\u2217 = {x\u2217\ni } such that their empirical \u02c6\u00b5X\u2217 approximates the target distribution p. We\nre-frame this into \ufb01nding \u02c6\u00b5X\u2217 such that Stein\u2019s identity holds (approximately):\n\nFind \u02c6\u00b5X\u2217\n\ns.t.\n\nE\u02c6\u00b5X\u2217 [Px\u03c6(x, w)] \u2248 0, \u2200w.\n\nWe may view this as a special curve \ufb01tting problem: considering gX (w) = E\u02c6\u00b5X [Px\u03c6(x, w)], we\nwant to \ufb01nd \u201cparameter\u201d X such that gX (w) \u2248 0 for all inputs w. The kernelized Stein discrepancy\n(KSD), as shown in (7), can be viewed as the expected rooted mean square loss of this \ufb01tting problem:\n(14)\nWhen replacing k(x, x(cid:48)) with its random feature approximation \u02c6k(x, x(cid:48)), the corresponding KSD\ncan be viewed as an empirical loss on random sample {w(cid:96)} from pw:\n\n2\n\n(cid:2)||gX (w)||2\n(cid:3)\n(cid:3).\n(cid:2)||gX (w(cid:96))||2\n\n2\n\nk(\u02c6\u00b5X || p) = Ew\u223cpw\nD2\nm(cid:88)\n\n(\u02c6\u00b5X || p) =\n\nD2\n\u02c6k\n\n1\nm\n\n(cid:96)=1\n\nBy running SVGD with \u02c6k(x, x(cid:48)), we acheive gX\u2217 (w(cid:96)) = 0 for all (cid:96) at the \ufb01xed point, implying a\nzero empirical loss D\u02c6k(\u02c6\u00b5X\u2217 || p) = 0 assuming the rank condition holds.\nThe key question, however, is to bound the expected loss Dk(\u02c6\u00b5X\u2217 || p), which can be achieved\nusing generalization bounds in statistical learning theory. In fact, standard results in learning theory\n\n7\n\n\fsuggests that the difference between the empirical loss and expected loss is O(m\u22121/2), yielding\nk(\u02c6\u00b5X\u2217 || p) = O(m\u22121/2). However, following (4), this implies E\u02c6\u00b5X\u2217 f \u2212 Epf = O(m\u22121/4) for\nD2\nf \u2208 H+\np , which does not acheive the standard O(m\u22121/2). Fortunately, note that our setting is noise-\nk(\u02c6\u00b5X || p) = \u02dcO(m\u22121)\nfree, and we achieve zero empirical loss; thus, we can get a better rate of D2\nusing the techniques in Srebro et al. (2010).\nBound for Random Features We now present our concentration bounds of random feature SVGD.\nAssumption 3.4. 1) Assume {\u03c6(x, w(cid:96))}m\npw on domain W, and X\u2217 = {x\u2217\n\u03c6(x, w(cid:96)) in the sense that\n\n(cid:96)=1 is a set of random features with w(cid:96) i.i.d. drawn from\ni }n\ni=1 is an approximate \ufb01xed point of SVGD with random feature\n|Ex\u223c\u02c6\u00b5X\u2217Pxj \u03c6(x, w(cid:96))| \u2264 \u0001j\u221a\nm\n\nwhere Pxj is the Stein operator w.r.t. the j-th coordinate xj of x. Assume \u00012 :=(cid:80)d\n2) Let supx\u2208X ,w\u2208W |Pxj \u03c6(x, w)| = Mj, and M 2 :=(cid:80)d\n\nj < \u221e. This may imply that X has\nto be compact since \u2207x log p(x) is typically unbounded on non-compact X (e.g., when p is standard\nGaussian,\u2207x log p(x) = x).\n3) De\ufb01ne function set\n\nj < \u221e.\n\nj=1 M 2\n\nj=1 \u00012\n\n.\n\n\u221a\nWe assume the Rademacher complexity of Pj\u03a6 satis\ufb01es Rm(Pj\u03a6) \u2264 Rj/\n\nPj\u03a6 = {w (cid:55)\u2192 Pxj \u03c6(x, w) : \u2200x \u2208 X}.\n\nm, and R2 :=\n\n(cid:80)d\n\nj=1 R2\n\nj < \u221e.\n\nTheorem 3.5. Under Assumption 3.4, for any \u03b4 > 0, we have with at least probability 1 \u2212 \u03b4 (in\nterms of the randomness of feature parameters {w(cid:96)}m\n\n(cid:96)=1),\n\n(cid:20)\n\nDk(\u02c6\u00b5X\u2217 || p) \u2264 C\u221a\nm\n\n(cid:21)1/2\n\n\u00012 + log3 m + log(1/\u03b4)\n\n,\n\n(15)\n\nwhere C is a constant that depends on R and M.\n\nRemark Recalling (4), Eq (15) provides a uniform bound\n\n(cid:8)E\u00b5X\u2217 f \u2212 Epf(cid:9) = O(m\u22121/2log1.5 m).\n\nsup\n||f||H+\n\np\n\n\u22641\n\nThis is a uniform bound that controls the worse error uniformly among all f \u2208 H+\nthe logarithm factor log1.5 m is essential. In the following, we present a result that has an O(1/\nrate, without the logarithm factor, but only holds for individual functions.\nTheorem 3.6. Let F\u221e be the set of linear span of the Steinalized features:\n\np . It is unclear if\nm)\n\n\u221a\n\nf (x) = Ew\u223cpw [v(w)(cid:62)Px\u03c6(x, w)],\n\n(16)\n(cid:80)d\nwhere v(w) = [v1(w), . . . , vd(w)] \u2208 Rd is the combination weights that satisfy supw ||v(w)||\u221e <\nj=1 supw |vj(w)|2, where inf v is taken on\n\u221e. We may de\ufb01ne a norm on F\u221e by ||f||2F\u221e := inf v\nall v(w) that satis\ufb01es (16).\nAssume Assumption 3.4 holds, then for any given function f \u2208 F\u221e with ||f||F\u221e \u2264 1, we have with\nat least probability 1 \u2212 \u03b4,\n\n|E\u02c6\u00b5X\u2217 f \u2212 Epf| \u2264 C\u221a\nm\n\n(1 + \u0001 +(cid:112)2 log(1/\u03b4)),\n\nwhere C is a constant that depends on R and M.\nThe F\u221e de\ufb01ned above is closely related to the RKHS Hp. In fact, one can show that F\u221e is a dense\nsubset of Hp (Rahimi & Recht, 2008) and is hence quite rich if k(x, x(cid:48)) is set to be universal.\n\n8\n\n\f4 Conclusion\n\nWe analyze SVGD through the eyes of moment matching. Our results are non-asymptotic in nature\nand provide an insightful framework for understanding the in\ufb02uence of kernels in the behavior of\nSVGD \ufb01xed points. Our framework suggests promising directions to develop systematic ways of\noptimizing the choice of kernels, especially for the query-speci\ufb01c inference that focuses on speci\ufb01c\ntest functions. A particularly appealing idea is to \u201cprogram\u201d the inference algorithm by adding\nfeatures that serve speci\ufb01c purposes so that the algorithm can be easily adapted to meet the needs\nof different users. In general, we expect that the connection between approximation inference and\nStein\u2019s identity and Stein equation will provide further opportunities for deriving new generations of\napproximate inference algorithms.\nAnother advantage of our framework is that it separates the design of the \ufb01xed point equation with\nthe numerical algorithm used to achieve the \ufb01xed point. In this way, the iterative algorithm does\nnot have to be derived as an approximation of an in\ufb01nite dimensional gradient \ufb02ow, in contrast to\nthe original SVGD. This allows us to apply various practical numerical methods and acceleration\ntechniques to solve the \ufb01xed point equation faster, with convergence guarantees.\n\nReferences\nAnderes, Ethan and Coram, Marc. A general spline representation for nonparametric and semipara-\n\nmetric density estimates using diffeomorphisms. In arXiv preprint arXiv:1205.5314, 2002.\n\nBarbour, Andrew D and Chen, Louis Hsiao Yun. An introduction to Stein\u2019s method, volume 4. World\n\nScienti\ufb01c, 2005.\n\nBartlett, Peter L and Mendelson, Shahar. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\nChen, Changyou, Zhang, Ruiyi, Wang, Wenlin, Li, Bai, and Chen, Liqun. A uni\ufb01ed particle-\noptimization framework for scalable bayesian sampling. arXiv preprint arXiv:1805.11659, 2018.\n\nChwialkowski, Kacper, Strathmann, Heiko, and Gretton, Arthur. A kernel test of goodness-of-\ufb01t.\n\nInternational Conference on Machine Learning (ICML), 2016.\n\nFeng, Yihao, Wang, Dilin, and Liu, Qiang. Learning to draw samples with amortized Stein variational\n\ngradient descent. Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2017.\n\nGorham, Jackson and Mackey, Lester. Measuring sample quality with kernels.\n\nConference on Machine Learning (ICML), 2017.\n\nInternational\n\nHaarnoja, Tuomas, Tang, Haoran, Abbeel, Pieter, and Levine, Sergey. Reinforcement learning\nwith deep energy-based policies. In International Conference on Machine Learning (ICML), pp.\n1352\u20131361, 2017.\n\nKim, Taesup, Yoon, Jaesik, Dia, Ousmane, Kim, Sungwoong, Bengio, Yoshua, and Ahn, Sungjin.\nBayesian model-agnostic meta-learning. In Advances In Neural Information Processing Systems\n(NIPS), 2018.\n\nKoller, Daphne and Friedman, Nir. Probabilistic graphical models: principles and techniques. MIT\n\npress, 2009.\n\nLiu, Qiang and Wang, Dilin. Stein variational gradient descent: A general purpose Bayesian inference\nalgorithm. In Advances In Neural Information Processing Systems (NIPS), pp. 2378\u20132386, 2016.\n\nLiu, Qiang, Lee, Jason, and Jordan, Michael. A kernelized Stein discrepancy for goodness-of-\ufb01t tests.\n\nIn International Conference on Machine Learning, pp. 276\u2013284, 2016.\n\nLiu, Yang, Ramachandran, Prajit, Liu, Qiang, and Peng, Jian. Stein variational policy gradient.\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2017.\n\nLu, Jianfeng, Lu, Yulong, and Nolen, James. Scaling limit of the stein variational gradient descent\n\npart i: the mean \ufb01eld regime. arXiv preprint arXiv:1805.04035, 2018.\n\n9\n\n\fOates, Chris J, Girolami, Mark, and Chopin, Nicolas. Control functionals for Monte Carlo integration.\n\nJournal of the Royal Statistical Society, Series B, 2017.\n\nOllivier, Yann, Pajot, Herv\u00e9, and Villani, C\u00e9dric. Optimal Transport: Theory and Applications,\n\nvolume 413. Cambridge University Press, 2014.\n\nPu, Yuchen, Gan, Zhe, Henao, Ricardo, Li, Chunyuan, Han, Shaobo, and Carin, Lawrence. VAE\nlearning via Stein variational gradient descent. In Advances in Neural Information Processing\nSystems (NIPS), pp. 4239\u20134248, 2017.\n\nRahimi, Ali and Recht, Benjamin. Random features for large-scale kernel machines. In Advances in\n\nNeural Information Processing Systems (NIPS), pp. 1177\u20131184, 2007.\n\nRahimi, Ali and Recht, Benjamin. Uniform approximation of functions with random bases. In\nCommunication, Control, and Computing, 46th Annual Allerton Conference on, pp. 555\u2013561.\nIEEE, 2008.\n\nSrebro, Nathan, Sridharan, Karthik, and Tewari, Ambuj. Smoothness, low noise and fast rates. In\n\nAdvances in neural information processing systems (NIPS), pp. 2199\u20132207, 2010.\n\nStein, Charles. A bound for the error in the normal approximation to the distribution of a sum of\ndependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical\nStatistics and Probability, Volume 2: Probability Theory, pp. 583\u2013602, 1972.\n\nWainwright, Martin J, Jordan, Michael I, et al. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\nWang, Dilin and Liu, Qiang. Learning to draw samples: With application to amortized MLE for\n\ngenerative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.\n\nZhu, Yinhao and Zabaras, Nicholas. Bayesian deep convolutional encoder\u2013decoder networks for\nsurrogate modeling and uncertainty quanti\ufb01cation. Journal of Computational Physics, 366:415\u2013\n447, 2018.\n\n10\n\n\f", "award": [], "sourceid": 5317, "authors": [{"given_name": "Qiang", "family_name": "Liu", "institution": "UT Austin"}, {"given_name": "Dilin", "family_name": "Wang", "institution": "UT Austin"}]}