{"title": "Stein Variational Gradient Descent With Matrix-Valued Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 7836, "page_last": 7846, "abstract": "Stein variational gradient descent (SVGD) is a particle-based inference algorithm that  leverages gradient information for efficient approximate inference. In this work, we enhance SVGD by leveraging preconditioning matrices, such as the Hessian and Fisher information matrix, to incorporate geometric information into SVGD updates. We achieve this by presenting a generalization of SVGD that replaces the scalar-valued kernels in vanilla SVGD with more general matrix-valued kernels. This yields a significant extension of SVGD, and more importantly, allows us to flexibly incorporate various preconditioning matricesto accelerate the exploration in the probability landscape. Empirical results show that our method outperforms vanilla SVGD and a variety of baseline approaches over a range of real-world Bayesian inference tasks.", "full_text": "Stein Variational Gradient Descent with\n\nMatrix-Valued Kernels\n\nDilin Wang* Ziyang Tang\u21e4 Chandrajit Bajaj Qiang Liu\n\nDepartment of Computer Science, UT Austin\n\n{dilin, ztang, bajaj, lqiang}@cs.utexas.edu\n\nAbstract\n\nStein variational gradient descent (SVGD) is a particle-based inference algorithm\nthat leverages gradient information for ef\ufb01cient approximate inference. In this work,\nwe enhance SVGD by leveraging preconditioning matrices, such as the Hessian\nand Fisher information matrix, to incorporate geometric information into SVGD\nupdates. We achieve this by presenting a generalization of SVGD that replaces the\nscalar-valued kernels in vanilla SVGD with more general matrix-valued kernels.\nThis yields a signi\ufb01cant extension of SVGD, and more importantly, allows us to\n\ufb02exibly incorporate various preconditioning matrices to accelerate the exploration\nin the probability landscape. Empirical results show that our method outperforms\nvanilla SVGD and a variety of baseline approaches over a range of real-world\nBayesian inference tasks.\n\n1\n\nIntroduction\n\nApproximate inference of intractable distributions is a central task in probabilistic learning and\nstatistics. An ef\ufb01cient approximation inference algorithm must perform both ef\ufb01cient optimization\nto explore the high probability regions of the distributions of interest, and reliable uncertainty\nquanti\ufb01cation for evaluating the variation of the given distributions. Stein variational gradient descent\n(SVGD) (Liu & Wang, 2016) is a deterministic sampling algorithm that achieves both desiderata by\noptimizing the samples using a procedure similar to gradient-based optimization, while achieving\nreliable uncertainty estimation using an interacting repulsive mechanism. SVGD has been shown\nto provide a fast and \ufb02exible alternative to traditional methods such as Markov chain Monte Carlo\n(MCMC) (e.g., Neal et al., 2011; Hoffman & Gelman, 2014) and parametric variational inference\n(VI) (e.g., Wainwright et al., 2008; Blei et al., 2017) in various challenging applications (e.g., Pu\net al., 2017; Wang & Liu, 2016; Kim et al., 2018; Haarnoja et al., 2017).\nOn the other hand, standard SVGD only uses the \ufb01rst order gradient information, and can not leverage\nthe advantage of the second order methods, such as Newton\u2019s method and natural gradient, to achieve\nbetter performance on challenging problems with complex loss landscapes or domains. Unfortunately,\ndue to the special form of SVGD, it is not straightforward to derive second order extensions of\nSVGD by simply extending similar ideas from optimization. While this problem has been recently\nconsidered (e.g., Detommaso et al., 2018; Liu & Zhu, 2018; Chen et al., 2019), the presented solutions\neither require heuristic approximations (Detommaso et al., 2018), or lead to complex algorithmic\nprocedures that are dif\ufb01cult to implement in practice (Liu & Zhu, 2018).\nOur solution to this problem is through a key generalization of SVGD that replaces the original scalar-\nvalued positive de\ufb01nite kernels in SVGD with a class of more general matrix-valued positive de\ufb01nite\nkernels. Our generalization includes all previous variants of SVGD (e.g., Wang et al., 2018; Han &\nLiu, 2018) as special cases. More signi\ufb01cantly, it allows us to easily incorporate various structured\n\n\u21e4Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fpreconditioning matrices into SVGD updates, including both Hessian and Fisher information matrices,\nas part of the generalized matrix-valued positive de\ufb01nite kernels. We develop theoretical results\nthat shed insight on optimal design of the matrix-valued kernels, and also propose simple and fast\npractical procedures. We empirically evaluate both Newton and Fisher based extensions of SVGD on\nvarious practical benchmarks, including Bayesian neural regression and sentence classi\ufb01cation, on\nwhich our methods show signi\ufb01cant improvement over vanilla SVGD and other baseline approaches.\n\nNotation and Preliminary For notation, we use bold lower-case letters (e.g., x) for vectors in Rd,\nand bold upper-case letters (e.g., Q) for matrices. A symmetric function k : Rd \u21e5 Rd ! R is called a\npositive de\ufb01nite kernel ifPij cik(xi, xj)cj  0 for any {ci}\u21e2 R and {xi}\u21e2 Rd. Every positive\nde\ufb01nite kernel k(x, x0) is associated with a reproducing kernel Hilbert space (RKHS) Hk, which\nconsists of the closure of functions of form\n(1)\n\n8{ci}\u21e2 R, {xi}\u21e2 Rd,\nfor which the inner product and norm are de\ufb01ned by hf, giHk =Pij cisjk(xi, xj), kfk2\n=\nPij cicjk(xi, xj), where we assume g(x) = Pi sik(x, xi). Denote by Hd\nk := Hk \u21e5 . . . \u21e5H k\nthe vector-valued RKHS consisting of Rd vector-valued functions  = [1, . . . , d]> with each\n` 2H k. See e.g., Berlinet & Thomas-Agnan (2011) for more rigorous treatment. For notation\nconvenience, we do not distinguish distributions on Rd and and their density functions.\n\nf (x) =Xi\n\ncik(x, xi),\n\nHk\n\n2 Stein Variational Gradient Descent (SVGD)\n\nWe introduce the basic derivation of Stein variational gradient descent (SVGD), which provides a\nfoundation for our new generalization. See Liu & Wang (2016, 2018); Liu (2017) for more details.\nLet p(x) be a positive and continuously differentiable probability density function on Rd. Our goal\nis to \ufb01nd a set of points (a.k.a. particles) {xi}n\ni=1 \u21e2 Rd to approximate p, such that the empirical\ndistribution q(x) =Pi (x  xi)/n of the particles weakly converges to p when n is large. Here\n(\u00b7) denotes the Dirac delta function.\nSVGD achieves this by starting from a set of initial particles, and iteratively updating them with a\ndeterministic transformation of form\nKL(q[\u270f] k p)\u270f=0,\n\nwhere \u270f is a small step size, \u21e4k : Rd ! Rd is an optimal transform function chosen to maximize the\ndecreasing rate of the KL divergence between the distribution of particles and the target p, and q[\u270f]\ndenotes the distribution of the updated particles x0 = x + \u270f(x) as x \u21e0 q, and Bk is the unit ball of\nRKHS Hd\n\nk := Hk \u21e5 . . . \u21e5H k associated with a positive de\ufb01nite kernel k(x, x0), that is,\n\nxi xi + \u270f\u21e4k(xi), 8i = 1,\u00b7\u00b7\u00b7 , n,\n\n2Bk \u21e2\n\n\u21e4k = arg max\n\nd\nd\u270f\n\n(2)\n\nBk = { 2H d\nk :\n\nkkHd\n\nk \uf8ff 1}.\n\n(3)\n\nLiu & Wang (2016) showed that the objective in (2) can be expressed as a linear functional of ,\n\nd\nd\u270f\n\n\n\nKL(q[\u270f] || p)\u270f=0\n\n= Ex\u21e0q[P>(x)], P>(x) = rx log p(x)>(x) + r>x (x),\n\n(4)\n\nwhere P is a differential operator called Stein operator; here we formally view P and the derivative\noperator rx as Rd column vectors, hence P> and r>x  are viewed as inner products, e.g.,\nr>x  =Pd\n\n`=1 rx``, with x` and ` being the `-th coordinate of vector x and , respectively.\n\nWith (4), it is shown in Liu & Wang (2016) that the solution of (2) is\n\n\u21e4k(\u00b7) / Ex\u21e0q[Pk(x,\u00b7)] = Ex\u21e0q[rx log p(x)k(x,\u00b7) + rxk(x,\u00b7)].\n\n(5)\n\nSuch \u21e4k provides the best update direction for the particles within RKHS Hd\nthe empirical measure of the particles, i.e., q(x) =Pn\nupdate on the particles, we obtain the SVGD algorithm using equations (2) and (5).\n\nk. By taking q to be\ni=1 (x  xi)/n and repeatedly applying this\n\n2\n\n\f3 SVGD with Matrix-valued Kernels\n\nOur goal is to extend SVGD to allow ef\ufb01cient incorporation of precondition information for better\noptimization. We achieve this by providing a generalization of SVGD that leverages more general\nmatrix-valued kernels, to \ufb02exibly incorporate preconditioning information.\nThe key idea is to observe that the standard SVGD searches for the optimal  in RKHS Hd\nk =\nHk \u21e5\u00b7\u00b7\u00b7\u21e5H k, a product of d copies of RKHS of scalar-valued functions, which does not allow us\nto encode potential correlations between different coordinates of . This limitation can be addressed\nby replacing Hd\nk with a more general RKHS of vector-valued functions (called vector-valued RKHS),\nwhich uses more \ufb02exible matrix-valued positive de\ufb01nite kernels to specify rich correlation structures\nbetween different coordinates. In this section, we \ufb01rst introduce the background of vector-valued\nRKHS with matrix-valued kernels in Section 3.1, and then propose and discuss our generalization of\nSVGD using matrix-valued kernels in Section 3.2-3.3.\n\n3.1 Vector-Valued RKHS with Matrix-Valued Kernels\nWe now introduce the background of matrix-valued positive de\ufb01nite kernels, which provides a most\ngeneral framework for specifying vector-valued RKHS. We focus on the intuition and key ideas in\nour introduction, and refer the readers to Alvarez et al. (2012); Carmeli et al. (2006) for mathematical\ntreatment.\nRecall that a standard real-valued RKHS Hk consists of the closure of the linear span of its kernel\nfunction k(\u00b7, x), as shown in (1). Vector-valued RKHS can be de\ufb01ned in a similar way, but consist of\nthe linear span of a matrix-valued kernel function:\n\nf (x) =Xi\n\nK(x, xi)ci,\n\n(6)\n\nfor any {ci}\u21e2 Rd and {xi}\u21e2 Rd, where K : Rd \u21e5 Rd ! Rd\u21e5d is now a matrix-valued kernel\nfunction, and ci are vector-valued weights. Similar to the scalar case, we can de\ufb01ne an inner product\nstructure hf , giHK =Pij c>i K(xi, xj)sj, where we assume g =Pi K(x, xi)si, and hence a\n=Pij c>i K(xi, xj)cj. In order to make the inner product and norm well de\ufb01ned,\nnorm kfk2\nHk\nthe matrix-value kernel K is required to be symmetric in that K(x, x0) = K(x0, x)>, and positive\nde\ufb01nite in thatPij c>i K(xi, xj)cj  0, for any {xi}\u21e2 Rd and {ci}\u21e2 Rd.\nMathematically, one can show that the closure of the set of functions in (6), equipped with the inner\nproduct de\ufb01ned above, de\ufb01nes a RKHS that we denote by HK. It is \u201creproducing\u201d because it has\nthe following reproducing property that generalizes the version for scalar-valued RKHS: for any\nf 2H K and any c 2 Rd, we have\n\nf (x)>c = hf (\u00b7), K(\u00b7, x)ciHK ,\n\n(7)\nwhere it is necessary to introduce c because the result of the inner product of two functions must be a\nscalar. A simple example of matrix kernel is K(x, x0) = k(x, x0)I, where I is the d \u21e5 d identity\nmatrix. It is related RKHS is HK = Hk \u21e5\u00b7\u00b7\u00b7\u21e5H k = Hd\n3.2 SVGD with Matrix-Valued Kernels\nIt is now natural to leverage matrix-valued kernels to obtain a generalization of SVGD (see Algo-\nrithm 1). The idea is simple: we now optimize  in the unit ball of a general vector-valued RKHS\nHK with a matrix valued kernel K(x, x0):\n\nk, as used in the original SVGD.\n\n\u21e4K = arg max\n\n2HK Ex\u21e0q\u21e5P>(x)\u21e4 , s.t. kkHK \uf8ff 1 .\n\n(8)\n\nThis yields a simple closed form solution similar to (5).\n\nTheorem 1. Let K(x, x0) be a matrix-valued positive de\ufb01nite kernel that is continuously differen-\ntiable on x and x0, the optimal \u21e4 in (8) is\n\n\u21e4K(\u00b7) / Ex\u21e0q [K(\u00b7, x)P] = Ex\u21e0q [K(\u00b7, x)rx log p(x) + K(\u00b7, x)rx] ,\n\n(9)\n\n3\n\n\fAlgorithm 1 Stein Variational Gradient Descent with Matrix-valued Kernels (Matrix SVGD)\n\nInput: A (possibly unnormalized) differentiable density function p(x) in Rd. A matrix-valued\npositive de\ufb01nite kernel K(x, x0). Step size \u270f.\nGoal: Find a set of particles {xi}n\nInitialize a set of particles {xi}n\nrepeat\n\ni=1 to represent the distribution p.\ni=1, e.g., by drawing from some simple distribution.\n\nxi xi +\n\n\u270f\nn\n\nnXj=1\u21e5K(xi, xj)rxj log p(xj) + K(xi, xj)rxj\u21e4 ,\n\nwhere K(\u00b7, x)rx is formally de\ufb01ned as the product of matrix K(\u00b7, x) and vector rx. The `-th\nelement of K(\u00b7, x)rx is (K(\u00b7, x)rx)` =Pd\nuntil Convergence\n\nm=1 rxmK`,m(\u00b7, x); see also (10).\n\nwhere the Stein operator P and derivative operator rx are again formally viewed as Rd-valued\ncolumn vectors, and K(\u00b7, x)P and K(\u00b7, x)rx are interpreted by the matrix multiplication rule.\nTherefore, K(\u00b7, x)P is a Rd-valued column vector, whose `-th element is de\ufb01ned by\n\n(K(\u00b7, x)P)` =\n\n(K`,m(\u00b7, x)rxm log p(x) + rxmK`,m(\u00b7, x)) ,\n\n(10)\n\ndXm=1\n\nwhere K`,m(x, x0) denotes the (`, m)- element of matrix K(x, x0) and xm the m-th element of x.\nSimilar to the case of standard SVGD, recursively applying the optimal transform \u21e4K on the particles\nyields a general SVGD algorithm shown in Algorithm 1, which we call matrix SVGD.\nParallel to vanilla SVGD, the gradient of matrix SVGD in (9) consists of two parts that account for\noptimization and diversity, respectively: the \ufb01rst part is a weighted average of gradient rx log p(x)\nmultiplied by a matrix-value kernel K(\u00b7, x); the other part consists of the gradient of the matrix-\nvalued kernel K, which, like standard SVGD, serves as a repulsive force to keep the particles away\nfrom each other to re\ufb02ect the uncertainty captured in distribution p.\nMatrix SVGD includes various previous variants of SVGD as special cases. The vanilla SVGD\ncorresponds to the case when K(x, x0) = k(x, x0)I, with I as the d\u21e5d identity matrix; the gradient-\nfree SVGD of Han & Liu (2018) can be treated as the case when K(x, x0) = k(x, x0)w(x)w(x0)I,\nwhere w(x) is an importance weight function; the graphical SVGD of Wang et al. (2018); Zhuo et al.\n`=1], where\n(2018) corresponds to a diagonal matrix-valued kernel: K(x, x0) = diag[{k`(x, x0)}d\neach k`(x, x0) is a \u201clocal\u201d scalar-valued kernel function related to the `-th coordinate x` of vector x.\n\n3.3 Matrix-Valued Kernels and Change of Variables\n\nIt is well known that preconditioned gradient descent can be interpreted as applying standard gradient\ndescent on a reparameterization of the variables. For example, let y = Q1/2x, where Q is a positive\nde\ufb01nite matrix, then log p(x) = log p(Q1/2y). Applying gradient descent on y and transform it\nback to the updates on x yields a preconditioned gradient update x x + \u270fQ1rx log p(x).\nWe now extend this idea to SVGD, for which matrix-valued kernels show up naturally as a conse-\nquence of change of variables. This justi\ufb01es the use of matrix-valued kernels and provides guidance\non the practical choice of matrix-valued kernels. We start with a basic result of how matrix-valued\nkernels change under change of variables (see Paulsen & Raghupathi (2016)).\nLemma 2. Assume H0 is an RKHS with a matrix kernel K0 : Rd \u21e5 Rd ! Rd\u21e5d. Let H be the set\nof functions formed by\n\n(x) = M (x)0(t(x)),\n\n80 2H 0,\n\nwhere M : Rd ! Rd\u21e5d is a \ufb01xed matrix-valued function and we assume M (x) is an invertible\nmatrix for all x, and t : Rd ! Rd is a \ufb01xed continuously differentiable one-to-one transform on Rd.\nFor 8, 0 2H , we can identity an unique 0, 00 2H 0 such that (x) = M (x)0(t(x)) and\n0(x) = M (x)00(t(x)). De\ufb01ne the inner product on H via h, 0iH = h0, 00iH0, then H is\n\n4\n\n\falso a vector-valued RKHS, whose matrix-valued kernel is\n\nK(x, x0) = M (x)K0(t(x), t(x0))M (x0)>.\n\nWe now present a key result, which characterizes the change of kernels when we apply invertible\nvariable transforms on the SVGD trajectory.\nTheorem 3. i) Let p and q be two distributions and p0, q0 the distribution of x0 = t(x) when x is\ndrawn from p, q, respectively, where t is a continuous differentiable one-to-one map on Rd. Assume p\nis a continuous differentiable density with Stein operator P, and P0 the Stein operator of p0. We have\n(11)\n\nEx\u21e0q0[P>0 0(x)] = Ex\u21e0q[P>(x)],\n\n(x) := rt(x)10(t(x)),\n\nwhere rt is the Jacobian matrix of t.\nii) Therefore, in the asymptotics of in\ufb01nitesimal step size (\u270f ! 0+), running SVGD with kernel K0\non p0 is equivalent to running SVGD on p with kernel\n\nwith\n\nK(x, x0) = rt(x)1K0(t(x), t(x0))rt(x0)>,\n\nin the sense that the trajectory of these two SVGD can be mapped to each other by the one-to-one\nmap t (and its inverse).\n\n3.4 Practical Choice of Matrix-Valued Kernels\nTheorem 3 suggests a conceptual procedure for constructing proper matrix kernels to incorporate\ndesirable preconditioning information: one can construct a one-to-one map t so that the distribution\np0 of x0 = t(x) is an easy-to-sample distribution with a simpler kernel K0(x, x0), which can be a\nstandard scalar-valued kernel or a simple diagonal kernel. Practical choices of t often involve rotating\nx with either Hessian matrix or Fisher information, allowing us to incorporating these information\ninto SVGD. In the sequel, we \ufb01rst illustrate this idea for simple Gaussian cases and then discuss\npractical approaches for non-Gaussian cases.\n\nConstant Preconditioning Matrices Consider the simple case when p is multivariate Gaussian,\ne.g., log p(x) =  1\n2 x>Qx+const, where Q is a positive de\ufb01nite matrix. In this case, the distribution\np0 of the transformed variable t(x) = Q1/2x is the standard Gaussian distribution that can be better\napproximated with a simpler kernel K0(x, x0), which can be chosen to be the standard RBF kernel\nsuggested in Liu & Wang (2016), the graphical kernel suggested in Wang et al. (2018), or the linear\nkernels suggested in Liu & Wang (2018). Theorem 3 then suggests to use a kernel of form\n\nQ := (x x0)>Q(x x0) and h is a bandwidth parameter. De\ufb01ne K0,Q(x, x0) :=\n\nwhere ||x x0||2\nK0Q1/2x, Q1/2x0. One can show that the SVGD direction of the kernel in (12) equals\n\u21e4KQ(\u00b7) = Q1Ex\u21e0q[r log p(x)K0,Q(\u00b7, x) + K0,Q(\u00b7, x)rx] = Q1\u21e4K0,Q(\u00b7),\nwhich is a linear transform of the SVGD direction of kernel K0,Q(x, x0) with matrix Q1.\nIn practice, when p is non-Gaussian, we can construct Q by taking averaging over the particles. For\nx log p(x) the negative Hessian matrix at x, we can construct Q by\nexample, denote by H(x) = r2\n\n(14)\n\nQ =\n\nH(xi)/n,\n\n(15)\n\nwhere {xi}n\ninformation matrix to obtain a natural gradient like variant of SVGD.\n\ni=1 are the particles from the previous iteration. We may replace H with the Fisher\n\nnXi=1\n\n5\n\nin which Q is applied on both the input x and the output side. As an example, taking K0 to be the\nscalar-valued Gaussian RBF kernel gives\n\nKQ(x, x0) := Q1/2K0\u21e3Q1/2x, Q1/2x0\u2318 Q1/2,\nQ\u25c6 ,\nKQ(x, x0) = Q1 exp\u2713\n1\n2h||x  x0||2\n\n(12)\n\n(13)\n\n\fPoint-wise Preconditioning A constant preconditioning matrix can not re\ufb02ect different curvature\nor geometric information at different points. A simple heuristic to address this limitation is to replace\nthe constant matrix Q with a point-wise matrix function Q(x); this motivates a kernel of form\n\nK(x, x0) = Q1/2(x)K0Q1/2(x)x, Q1/2(x0)x0Q1/2(x0).\n\nUnfortunately, this choice may yield expensive computation and dif\ufb01cult implementation in practice,\nbecause the SVGD update involves taking the gradient of the kernel K(x, x0), which would need\nto differentiate through matrix valued function Q(x). When Q(x) equals the Hessian matrix, for\nexample, it involves taking the third order derivative of log p(x), yielding an unnatural algorithm.\n\nMixture Preconditioning We instead propose a more practical approach to achieve point-wise\npreconditioning with a much simpler algorithm. The idea is to use a weighted combination of several\nconstant preconditioning matrices. This involves leveraging a set of anchor points {z`}m\n`=1 \u21e2 Rd,\neach of which is associated with a preconditioning matrix Q` = Q(z`) (e.g., their Hessian or Fisher\ninformation matrices). In practice, the anchor points {z`}m\n`=1 can be conveniently set to be the same\nas the particles {xi}n\n\ni=1. We then construct a kernel by\n\nKQ`(x, x0)w`(x)w`(x0),\n\n(16)\n\nK(x, x0) =\n\nmX`=1\nprobability, and hence should satisfyPm\n\nwhere KQ`(x, x0) is de\ufb01ned in (12) or (13), and w`(x) is a positive scalar-valued function that\ndecides the contribution of kernel KQ` at point x. Here w`(x) should be viewed as a mixture\n`=1 w`(x) = 1 for all x. In our empirical studies, we take\n\nw`(x) as the Gaussian mixture probability from the anchor points:\n\n,\n\nN (x; z`, Q1\n\n` ) :=\n\n1\nZ`\n\nexp\u2713\n\nQ`\u25c6 ,\n1\n2 kx  z`k2\n\n(17)\n\nw`(x) =\n\nN (x; z`, Q1\n` )\nPm\n`0=1 N (x; z`0, Q1\n`0 )\n\nwhere Z` = (2\u21e1)d/2 det(Q`)1/2. In this way, each point x is mostly in\ufb02uenced by the anchor point\nclosest to it, allowing us to apply different preconditioning for different points. Importantly, the\nSVGD update direction related to the kernel in (16) has a simple and easy-to-implement form:\n\n\u21e4K(\u00b7) =\n\nmX`=1\n\nw`(\u00b7)Ex\u21e0q\u21e5(w`(x)KQ`(\u00b7, x))P\u21e4 =\n\nmX`=1\n\nw`(\u00b7)\u21e4w`KQ`\n\n(\u00b7),\n\n(18)\n\nwhich is a weighted sum of a number of SVGD directions with constant preconditioning matrices\n(but with an asymmetric kernel w`(x)KQ`(\u00b7, x)).\nA Remark on Stein Variational Newton (SVN) Detommaso et al. (2018) provided a Newton-like\nvariation of SVGD. It is motivated by an intractable functional Newton framework, and arrives a\npractical algorithm using a series of approximation steps. The update of SVN is\n\n(19)\nwhere \u21e4k(\u00b7) is the standard SVGD gradient, and \u02dcH i is a Hessian like matrix associated with particle\nxi, de\ufb01ned by\n\nxi xi + \u270f \u02dcH1\ni \u21e4k(xi), 8i = 1, . . . , n,\n\u02dcH i = Ex\u21e0q\u21e5H(x)k(x, xi)2 + (rxik(x, xi))\u23262\u21e4 ,\n\nwhere H(x) = r2\nx log p(x), and w\u23262 := ww>. Due to the approximation introduced in the\nderivation of SVN, it does not correspond to a standard functional gradient \ufb02ow like SVGD (unless\n\u02dcH i = Q for all i, in which case it reduces to using a constant preconditioning matrix on SVGD\nlike (14)). SVN can be heuristically viewed as a \u201chard\u201d variant of (18), which assigns each particle\nwith its own preconditioning matrix with probability one, but the mathematical form do not match\nprecisely. On the other hand, it is useful to note that the set of \ufb01xed points of SVN update (19)\nis the identical to that of the standard SVGD update with \u21e4k(\u00b7), once all \u02dcH i are positive de\ufb01nite\nmatrices. This is because at the \ufb01xed points of (19), we have \u02dcH1\ni \u21e4k(xi) = 0 for 8i = 1, . . . , n,\nwhich is equivalent to \u21e4k(xi) = 0,8i when all the \u02dcH i, 8i are positive de\ufb01nite. Therefore, SVN can\nbe justi\ufb01ed as an alternative \ufb01xed point iteration method to achieve the same set of \ufb01xed points as the\nstandard SVGD.\n\n6\n\n\f0\n3\n=\nn\no\ni\nt\na\nr\ne\nt\nI\n\nD\nM\nM\ng\no\nL\n\n(a) Initializations\n\n(b) Vanilla SVGD\n\n(c) SVN\n\n(d) Matrix-SVGD\n\n(e) Matrix-SVGD\n\n(f) Iteration\n\n(average)\n\n(mixture)\n\nFigure 1: Figure (a)-(e) show the particles obtained by various methods at the 30-th iteration. Figure\n(f) plots the log MMD (Gretton et al., 2012) vs. training iteration starting from the 10-th iteration.\nWe use the standard RBF kernel for evaluating MMD.\n\n4 Experiments\n\nWe demonstrate the effectiveness of our matrix SVGD on various practical tasks. We start with\na toy example and then proceed to more challenging tasks that involve logistic regression, neural\nnetworks and recurrent neural networks. For our method, we take the preconditioning matrices to be\neither Hessian or Fisher information matrices, depending on the application. For large scale Fisher\nmatrices in (recurrent) neural networks, we leverage the Kronecker-factored (KFAC) approximation\nby Martens & Grosse (2015); Martens et al. (2018) to enable ef\ufb01cient computation. We use RBF\nkernel for vanilla SVGD. The kernel K0(x, x0) in our matrix SVGD (see (12) and (13)) is also taken\nto be Gaussian RBF. Following Liu & Wang (2016), we choose the bandwidth of the Gaussian RBF\nkernels using the standard median trick and use Adagrad (Duchi et al., 2011) for stepsize. Our code\nis available at https://github.com/dilinwang820/matrix_svgd.\nThe algorithms we test are summarized here:\nVanilla SVGD, using the code by Liu & Wang (2016);\nMatrix-SVGD (average), using the constant preconditioning matrix kernel in (13), with Q to be\neither the average of the Hessian matrices or Fisher matrices of the particles (e.g., (15));\nMatrix-SVGD (mixture), using the mixture preconditioning matrix kernel in (16), where we pick\nthe anchor points to be particles themselves, that is, {z`}m\nStein variational Newton (SVN), based on the implementation of Detommaso et al. (2018);\nPreconditioned Stochastic Langevin Dynamics (pSGLD), which is a variant of SGLD (Li\net al., 2016), using a diagonal approximation of Fisher information as the preconditioned matrix.\n\n`=1 = {xi}n\n\ni=1;\n\n4.1 Two-Dimensional Toy Examples\nSettings We start with illustrating our method using a Gaussian mixture toy model (Figure 1), with\nexact Hessian matrices for preconditioning. For fair comparison, we search the best learning rate for\nall algorithms exhaustively. We use 50 particles for all the cases. We use the same initialization for\nall methods with the same random seeds.\n\nResults Figure 1 show the results for 2D toy examples. Appendix B shows more visualization and\nresults on more examples. We can see that methods with Hessian information generally converge\nfaster than vanilla SVGD, and Matrix-SVGD (mixture) yields the best performance.\n\n4.2 Bayesian Logistic Regression\nSettings We consider the Bayesian logistic regression model for binary classi\ufb01cation. Let D =\nj=1 be a dataset with feature vector xj and binary label yj 2{ 0, 1}. The distribution of\n{(xj, yj)}N\ninterest is\n\np(\u2713 | D) / p(D | \u2713)p(\u2713)\n\nwith\n\np(D | \u2713) =\n\nNYj=1hyj(\u2713>xj) + (1  yj)(\u2713>xj)i ,\n\nwhere (z) := 1/(1 + exp(z)), and p0(\u2713) is the prior distribution, which we set to be standard\nnormal N (\u2713; 0, I). The goal is to approximate the posterior distribution p(\u2713 | D) with a set of\n\n7\n\n\f(a) Covtype\n\n(b) Covtype\n\n(c) Protein\n\n(d) Protein\n\ny\nc\na\nr\nu\nc\nc\nA\n\nt\ns\ne\nT\n\nd\no\no\nh\ni\nl\ne\nk\ni\nL\n-\ng\no\nL\n\nt\ns\ne\nT\n\nE\nS\nM\nR\n\nt\ns\ne\nT\n\nd\no\no\nh\ni\nl\ne\nk\ni\nL\n-\ng\no\nL\n\nt\ns\ne\nT\n\n# of Iterations\n\n# of Iterations\n\n# of Iterations\n\n# of Iterations\n\nFigure 2: (a)-(b) Results of Bayesian Logistic regression on the Covtype dataset. (c)-(d) Average test\nRMSE and log-likelihood vs. training batches on the Protein dataset for Bayesian neural regression.\n\nDataset\n\npSGLD\n\nBoston\n2.699\u00b10.155\n2.699\u00b10.155\n2.699\u00b10.155\nConcrete\n5.053\u00b10.124\nEnergy\n0.985\u00b10.024\nKin8nm\n0.091\u00b10.001\nNaval\n0.002\u00b10.000\nCombined 4.042\u00b10.034\nWine\n0.641\u00b10.009\nProtein\n4.300\u00b10.018\nYear\n8.630\u00b10.007\n\nTest RMSE\n\nTest Log-Likelihood\n\nVanilla SVGD Matrix-SVGD\n\nMatrix-SVGD\n\npSGLD\n\nVanilla SVGD Matrix-SVGD\n\nMatrix-SVGD\n\n2.785\u00b10.169\n5.027\u00b10.116\n0.889\u00b10.024\n0.093\u00b10.001\n0.004\u00b10.000\n4.088\u00b10.033\n0.645\u00b10.009\n4.186\u00b10.017\n8.686\u00b10.010\n\n(average)\n2.898\u00b10.184\n4.869\u00b10.124\n0.795\u00b10.025\n0.795\u00b10.025\n0.795\u00b10.025\n0.092\u00b10.001\n0.001\u00b10.000\n4.056\u00b10.033\n0.637\u00b10.008\n0.637\u00b10.008\n0.637\u00b10.008\n3.997\u00b10.018\n8.637\u00b10.005\n\n(average)\n\n(mixture)\n\n(mixture)\n2.669\u00b10.141\n2.717\u00b10.166 2.847\u00b10.182 2.706\u00b10.158 2.669\u00b10.141\n2.669\u00b10.141 2.861\u00b10.207\n3.064\u00b10.034\n4.721\u00b10.111\n3.064\u00b10.034 3.150\u00b10.054 3.207\u00b10.071\n4.721\u00b10.111 3.206\u00b10.056 3.064\u00b10.034\n4.721\u00b10.111\n1.135\u00b10.026\n1.135\u00b10.026 1.249\u00b10.036\n0.868\u00b10.025 1.395\u00b10.029 1.315\u00b10.020 1.135\u00b10.026\n0.090\u00b10.001\n0.975\u00b10.011\n0.975\u00b10.011\n0.090\u00b10.001\n0.964\u00b10.012\n0.090\u00b10.001\n0.975\u00b10.011\n0.956\u00b10.011\n0.000\u00b10.000\n5.639\u00b10.048\n5.383\u00b10.081\n0.000\u00b10.000\n0.000\u00b10.000\n5.639\u00b10.048\n5.639\u00b10.048\n4.312\u00b10.087\n2.817\u00b10.009\n4.029\u00b10.033\n4.029\u00b10.033 2.821\u00b10.009 2.832\u00b10.009 2.824\u00b10.009 2.817\u00b10.009\n2.817\u00b10.009\n4.029\u00b10.033\n0.980\u00b10.016\n0.637\u00b10.009\n0.637\u00b10.009 0.984\u00b10.016 0.997\u00b10.019 0.980\u00b10.016\n0.980\u00b10.016 0.988\u00b10.018\n0.637\u00b10.009\n2.755\u00b10.003\n3.852\u00b10.014\n3.852\u00b10.014 2.874\u00b10.004 2.846\u00b10.003 2.796\u00b10.004 2.755\u00b10.003\n2.755\u00b10.003\n3.852\u00b10.014\n3.561\u00b10.002\n8.594\u00b10.009\n8.594\u00b10.009 3.568\u00b10.002 3.577\u00b10.002 3.569\u00b10.001 3.561\u00b10.002\n3.561\u00b10.002\n8.594\u00b10.009\n\n0.973\u00b10.010\n4.535\u00b10.093\n\nTable 1: Average test RMSE and log-likelihood in test data for UCI regression benchmarks.\n\ni=1, and then use it to predict the class labels for testing data points. We compare\nparticles {\u2713i}n\nour methods with preconditioned stochastic gradient Langevin dynamics (pSGLD) (Li et al., 2016).\nBecause pSGLD is a sequential algorithm, for fair comparison, we obtain the samples of pSGLD by\nrunning n parallel chains of pSGLD for estimation. The preconditioning matrix in both pSGLD and\nmatrix SVGD is taken to be the Fisher information matrix.\nWe consider the binary Covtype2 dataset with 581, 012 data points and 54 features. We partition the\ndata into 70% for training, 10% for validation and 20% for testing. We use Adagrad optimizer with a\nmini-batch size of 256. We choose the best learning rate from [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]\nfor each method on the validation set. For all the experiments and algorithms, we use n = 20 particles.\nResults are average over 20 random trials.\n\nResults Figure 2 (a) and (b) show the test accuracy and test log-likelihood of different algorithms.\nWe can see that both Matrix-SVGD (average) and Matrix-SVGD (mixture) converge much\nfaster than both vanilla SVGD and pSGLD, reaching an accuracy of 0.75 in less than 500 iterations.\n\n4.3 Neural Network Regression\n\nSettings We apply our matrix SVGD on Bayesian neural network regression on UCI datasets.\nFor all experiments, we use a two-layer neural network with 50 hidden units with ReLU activation\nfunctions. We assign isotropic Gaussian priors to the neural network weights. All datasets3 are\nrandomly partitioned into 90% for training and 10% for testing. All results are averaged over 20\nrandom trials, except for Protein and Year, on which 5 random trials are performed. We use n = 10\nparticles for all methods. We use Adam optimizer with a mini-batch size of 100; for large dataset\nsuch as Year, we set the mini-batch size to be 1000. We use the Fisher information matrix with\nKronecker-factored (KFAC) approximation for preconditioning.\n\nResults Table 1 shows the performance in terms of the test RMSE and the test log-likelihood.\nWe can see that both Matrix-SVGD (average) and Matrix-SVGD (mixture), which use second-\norder information, achieve better performance than vanilla SVGD. Matrix-SVGD (mixture) yields\nthe best performance for both test RMSE and test log-likelihood in most cases. Figure 2 (c)-(d) show\nthat both variants of Matrix-SVGD converge much faster than vanilla SVGD and pSGLD on the\nProtein dataset.\n\n2https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html\n3https://archive.ics.uci.edu/ml/datasets.php\n\n8\n\n\f4.4 Sentence Classi\ufb01cation With Recurrent Neural Networks (RNN)\nSettings We consider the sentence classi\ufb01cation task on four datasets: MR (Pang & Lee,\n2005), CR (Hu & Liu, 2004), SUBJ (Pang & Lee, 2004), and MPQA (Wiebe et al., 2005).\nWe use a recurrent neural network (RNN) based\nmodel, p(y | x) = softmax(w>y hRN N (x, v)),\nwhere x is the input sentence, y is a discrete-\nvalued label of the sentence, and wy is a\nweight coef\ufb01cient related to label class y. And\nhRN N (x, v) is an RNN function with param-\neter v using a one-layer bidirectional GRU\nmodel (Cho et al., 2014) with 50 hidden units.\nWe apply matrix SVGD to infer the posterior\nof w = {wy : 8y}, while updating the RNN\nweights v using typical stochastic gradient descent. In all experiments, we use the pre-processed\ntext data provided in Gan et al. (2016). For all the datasets, we conduct 10-fold cross-validation for\nevaluation. We use n = 10 particles for all the methods. For training, we use a mini-batch size of 50\nand run all the algorithms for 20 epochs with early stop. We use the Fisher information matrix for\npreconditioning.\n\nMR CR SUBJ MPQA\nMethod\n20.52 18.65 7.66 11.24\nSGLD\n19.75 17.50 6.99 10.80\npSGLD\n19.73 18.07 6.67 10.58\nVanilla SVGD\nMatrix-SVGD (average) 19.22 17.29 6.76 10.79\nMatrix-SVGD (mixture) 19.09 17.13 6.59 10.71\n\nTable 2: Sentence classi\ufb01cation errors measured\nwith four benchmarks.\n\nResults Table 2 shows the results of testing classi\ufb01cation errors. We can see that Matrix-SVGD\n(mixture) generally performs the best among all algorithms.\n\n5 Conclusion\n\nWe present a generalization of SVGD by leveraging general matrix-valued positive de\ufb01nite kernels,\nwhich allows us to \ufb02exibly incorporate various preconditioning matrices, including Hessian and\nFisher information matrices, to improve exploration in the probability landscape. We test our practical\nalgorithms on various practical tasks and demonstrate its ef\ufb01ciency compared to various existing\nmethods.\n\nAcknowledgement\n\nThis work is supported in part by NSF CRII 1830161 and NSF CAREER 1846421. We would like to\nacknowledge Google Cloud and Amazon Web Services (AWS) for their support.\n\nReferences\nAlvarez, M. A., Rosasco, L., Lawrence, N. D., et al. Kernels for vector-valued functions: A review.\nFoundations and Trends R in Machine Learning, 4(3):195\u2013266, 2012.\nBerlinet, A. and Thomas-Agnan, C. Reproducing kernel Hilbert spaces in probability and statistics.\n\nSpringer Science & Business Media, 2011.\n\nBlei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\nCarmeli, C., De Vito, E., and Toigo, A. Vector valued reproducing kernel Hilbert spaces of integrable\n\nfunctions and Mercer theorem. Analysis and Applications, 4(04):377\u2013408, 2006.\n\nChen, W. Y., Barp, A., Briol, F.-X., Gorham, J., Girolami, M., Mackey, L., Oates, C., et al. Stein\npoint markov chain monte carlo. International Conference on Machine Learning (ICML), 2019.\n\nCho, K., Van Merri\u00ebnboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio,\nY. Learning phrase representations using RNN encoder-decoder for statistical machine translation.\narXiv preprint arXiv:1406.1078, 2014.\n\nDetommaso, G., Cui, T., Marzouk, Y., Spantini, A., and Scheichl, R. A Stein variational Newton\n\nmethod. In Advances in Neural Information Processing Systems, pp. 9187\u20139197, 2018.\n\n9\n\n\fDuchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\nGan, Z., Li, C., Chen, C., Pu, Y., Su, Q., and Carin, L. Scalable Bayesian learning of recurrent neural\n\nnetworks for language modeling. ACL, 2016.\n\nGretton, A., Borgwardt, K. M., Rasch, M. J., Sch\u00f6lkopf, B., and Smola, A. A kernel two-sample test.\n\nJournal of Machine Learning Research, 13(Mar):723\u2013773, 2012.\n\nHaarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based\n\npolicies. arXiv preprint arXiv:1702.08165, 2017.\n\nHan, J. and Liu, Q. Stein variational gradient descent without gradient. In International Conference\n\non Machine Learning, 2018.\n\nHoffman, M. D. and Gelman, A. The No-U-turn sampler: adaptively setting path lengths in\n\nHamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593\u20131623, 2014.\n\nHu, M. and Liu, B. Mining and summarizing customer reviews. In Proceedings of the tenth ACM\nSIGKDD international conference on Knowledge discovery and data mining, pp. 168\u2013177. ACM,\n2004.\n\nKim, T., Yoon, J., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian model-agnostic meta-learning.\n\narXiv preprint arXiv:1806.03836, 2018.\n\nLi, C., Chen, C., Carlson, D. E., and Carin, L. Preconditioned stochastic gradient Langevin dynamics\n\nfor deep neural networks. In AAAI, volume 2, pp. 4, 2016.\n\nLiu, C. and Zhu, J. Riemannian Stein variational gradient descent for Bayesian inference.\n\nThirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\nIn\n\nLiu, Q. Stein variational gradient descent as gradient \ufb02ow. In Advances in neural information\n\nprocessing systems, pp. 3115\u20133123, 2017.\n\nLiu, Q. and Wang, D. Stein variational gradient descent: A general purpose Bayesian inference\n\nalgorithm. In Advances In Neural Information Processing Systems, pp. 2378\u20132386, 2016.\n\nLiu, Q. and Wang, D. Stein variational gradient descent as moment matching. In Advances in Neural\n\nInformation Processing Systems, pp. 8867\u20138876, 2018.\n\nMartens, J. and Grosse, R. Optimizing neural networks with Kronecker-factored approximate\n\ncurvature. In International conference on machine learning, pp. 2408\u20132417, 2015.\n\nMartens, J., Ba, J., and Johnson, M. Kronecker-factored curvature approximations for recurrent\n\nneural networks. In ICLR, 2018.\n\nNeal, R. M. et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2\n\n(11):2, 2011.\n\nPang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summariza-\ntion based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for\nComputational Linguistics, pp. 271. Association for Computational Linguistics, 2004.\n\nPang, B. and Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with\nrespect to rating scales. In Proceedings of the 43rd annual meeting on association for computational\nlinguistics, pp. 115\u2013124. Association for Computational Linguistics, 2005.\n\nPaulsen, V. I. and Raghupathi, M. An introduction to the theory of reproducing kernel Hilbert spaces,\n\nvolume 152. Cambridge University Press, 2016.\n\nPu, Y., Gan, Z., Henao, R., Li, C., Han, S., and Carin, L. VAE learning via Stein variational gradient\n\ndescent. In Advances in Neural Information Processing Systems, pp. 4236\u20134245, 2017.\n\nWainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends R in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n10\n\n\fWang, D. and Liu, Q. Learning to draw samples: With application to amortized MLE for generative\n\nadversarial learning. arXiv preprint arXiv:1611.01722, 2016.\n\nWang, D., Zeng, Z., and Liu, Q. Stein variational message passing for continuous graphical models.\n\nIn International Conference on Machine Learning, pp. 5206\u20135214, 2018.\n\nWiebe, J., Wilson, T., and Cardie, C. Annotating expressions of opinions and emotions in language.\n\nLanguage resources and evaluation, 39(2-3):165\u2013210, 2005.\n\nZhuo, J., Liu, C., Shi, J., Zhu, J., Chen, N., and Zhang, B. Message passing Stein variational gradient\n\ndescent. In International Conference on Machine Learning, pp. 6013\u20136022, 2018.\n\n11\n\n\f", "award": [], "sourceid": 4229, "authors": [{"given_name": "Dilin", "family_name": "Wang", "institution": "UT Austin"}, {"given_name": "Ziyang", "family_name": "Tang", "institution": "UT Austin"}, {"given_name": "Chandrajit", "family_name": "Bajaj", "institution": "The University of Texas at Austin"}, {"given_name": "Qiang", "family_name": "Liu", "institution": "UT Austin"}]}