{"title": "A Stein variational Newton method", "book": "Advances in Neural Information Processing Systems", "page_first": 9169, "page_last": 9179, "abstract": "Stein variational gradient descent (SVGD) was recently proposed as a general purpose nonparametric variational inference algorithm: it minimizes the Kullback\u2013Leibler divergence between the target distribution and its approximation by implementing a form of functional gradient descent on a reproducing kernel Hilbert space [Liu & Wang, NIPS 2016]. In this paper, we accelerate and generalize the SVGD algorithm by including second-order information, thereby approximating a Newton-like iteration in function space. We also show how second-order information can lead to more effective choices of kernel. We observe significant computational gains over the original SVGD algorithm in multiple test cases.", "full_text": "A Stein variational Newton method\n\nGianluca Detommaso\n\nUniversity of Bath & The Alan Turing Institute\n\nTiangang Cui\n\nMonash University\n\ngd391@bath.ac.uk\n\nTiangang.Cui@monash.edu\n\nAlessio Spantini\n\nMassachusetts Institute of Technology\n\nYoussef Marzouk\n\nMassachusetts Institute of Technology\n\nspantini@mit.edu\n\nymarz@mit.edu\n\nRobert Scheichl\n\nHeidelberg University\n\nr.scheichl@uni-heidelberg.de\n\nAbstract\n\nStein variational gradient descent (SVGD) was recently proposed as a general\npurpose nonparametric variational inference algorithm [Liu & Wang, NIPS 2016]:\nit minimizes the Kullback\u2013Leibler divergence between the target distribution and\nits approximation by implementing a form of functional gradient descent on a\nreproducing kernel Hilbert space. In this paper, we accelerate and generalize the\nSVGD algorithm by including second-order information, thereby approximating\na Newton-like iteration in function space. We also show how second-order in-\nformation can lead to more effective choices of kernel. We observe signi\ufb01cant\ncomputational gains over the original SVGD algorithm in multiple test cases.\n\n1\n\nIntroduction\n\nApproximating an intractable probability distribution via a collection of samples\u2014in order to evaluate\narbitrary expectations over the distribution, or to otherwise characterize uncertainty that the distribu-\ntion encodes\u2014is a core computational challenge in statistics and machine learning. Common features\nof the target distribution can make sampling a daunting task. For instance, in a typical Bayesian\ninference problem, the posterior distribution might be strongly non-Gaussian (perhaps multimodal)\nand high dimensional, and evaluations of its density might be computationally intensive.\nThere exist a wide range of algorithms for such problems, ranging from parametric variational\ninference [4] to Markov chain Monte Carlo (MCMC) techniques [10]. Each algorithm offers a\ndifferent computational trade-off. At one end of the spectrum, we \ufb01nd the parametric mean-\ufb01eld\napproximation\u2014a cheap but potentially inaccurate variational approximation of the target density.\nAt the other end, we \ufb01nd MCMC\u2014a nonparametric sampling technique yielding estimators that are\nconsistent, but potentially slow to converge. In this paper, we focus on Stein variational gradient\ndescent (SVGD) [17], which lies somewhere in the middle of the spectrum and can be described as a\nparticular nonparametric variational inference method [4], with close links to the density estimation\napproach in [2].\nThe SVGD algorithm seeks a deterministic coupling between a tractable reference distribution of\nchoice (e.g., a standard normal) and the intractable target. This coupling is induced by a transport\nmap T that can transform a collection of reference samples into samples from the desired target\ndistribution. For a given pair of distributions, there may exist in\ufb01nitely many such maps [28]; several\nexisting algorithms (e.g., [27, 24, 21]) aim to approximate feasible transport maps of various forms.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe distinguishing feature of the SVGD algorithm lies in its de\ufb01nition of a suitable map T . Its central\nidea is to approximate T as a growing composition of simple maps, computed sequentially:\n\nT = T1 \u25e6 \u00b7\u00b7\u00b7 \u25e6 Tk \u25e6 \u00b7\u00b7\u00b7 ,\n\n(1)\nwhere each map Tk is a perturbation of the identity map along the steepest descent direction of a\nfunctional J that describes the Kullback\u2013Leibler (KL) divergence between the pushforward of the\nreference distribution through the composition T1 \u25e6 \u00b7\u00b7\u00b7 \u25e6 Tk and the target distribution. The steepest\ndescent direction is further projected onto a reproducing kernel Hilbert space (RKHS) in order to\ngive Tk a nonparametric closed form [3]. Even though the resulting map Tk is available explicitly\nwithout any need for numerical optimization, the SVGD algorithm implicitly approximates a steepest\ndescent iteration on a space of maps of given regularity.\nA primary goal of this paper is to explore the use of second-order information (e.g., Hessians)\nwithin the SVGD algorithm. Our idea is to develop the analogue of a Newton iteration\u2014rather\nthan gradient descent\u2014for the purpose of sampling distributions more ef\ufb01ciently. Speci\ufb01cally, we\ndesign an algorithm where each map Tk is now computed as the perturbation of the identity function\nalong the direction that minimizes a certain local quadratic approximation of J. Accounting for\nsecond-order information can dramatically accelerate convergence to the target distribution, at the\nprice of additional work per iteration. The tradeoff between speed of convergence and cost per\niteration is resolved in favor of the Newton-like algorithm\u2014which we call a Stein variational Newton\nmethod (SVN)\u2014in several numerical examples.\nThe ef\ufb01ciency of the SVGD and SVN algorithms depends further on the choice of reproducing kernel.\nA second contribution of this paper is to design geometry-aware Gaussian kernels that also exploit\nsecond-order information, yielding substantially faster convergence towards the target distribution\nthan SVGD or SVN with an isotropic kernel.\nIn the context of parametric variational inference, second-order information has been used to acceler-\nate the convergence of certain variational approximations, e.g., [14, 13, 21]. In this paper, however, we\nfocus on nonparametric variational approximations, where the corresponding optimisation problem\nis de\ufb01ned over an in\ufb01nite-dimensional RKHS of transport maps. More closely related to our work is\nthe Riemannian SVGD algorithm [18], which generalizes a gradient \ufb02ow interpretation of SVGD\n[15] to Riemannian manifolds, and thus also exploits geometric information within the inference task.\nThe rest of the paper is organized as follows. Section 2 brie\ufb02y reviews the SVGD algorithm, and\nSection 3 introduces the new SVN method. In Section 4 we introduce geometry-aware kernels for\nthe SVN method. Numerical experiments are described in Section 5. Proofs of our main results and\nfurther numerical examples addressing scaling to high dimensions are given in the supplementary\nmaterial. Code and all numerical examples are collected in our GitHub repository [1].\n\n2 Background\nSuppose we wish to approximate an intractable target distribution with density \u03c0 on Rd via an\nempirical measure, i.e., a collection of samples. Given samples {xi} from a tractable reference\ndensity p on Rd, one can seek a transport map T : Rd \u2192 Rd such that the pushforward density of\np under T , denoted by T\u2217p, is a close approximation to the target \u03c0.1 There exist in\ufb01nitely many\nsuch maps [28]. The image of the reference samples under the map, {T (xi)}, can then serve as an\nempirical measure approximation of \u03c0 (e.g., in the weak sense [17]).\n\nVariational approximation. Using the KL divergence to measure the discrepancy between the\ntarget \u03c0 and the pushforward T\u2217p, one can look for a transport map T that minimises the functional\n(2)\nover a broad class of functions. The Stein variational method breaks the minimization of (2) into\nseveral simple steps: it builds a sequence of transport maps {T1, T2, . . . , Tl, . . .} to iteratively push\nan initial reference density p0 towards \u03c0. Given a scalar-valued RKHS H with a positive de\ufb01nite\nkernel k(x, x(cid:48)), each transport map Tl : Rd \u2192 Rd is chosen to be a perturbation of the identity map\nI(x) = x along the vector-valued RKHS Hd (cid:39) H \u00d7 \u00b7\u00b7\u00b7 \u00d7 H, i.e.,\n\nT (cid:55)\u2192 DKL(T\u2217 p|| \u03c0)\n\nTl(x) := I(x) + Q(x)\n\nfor Q \u2208 Hd.\n\n(3)\n\n1 If T is invertible, then T\u2217p(x) = p(T \u22121(x))| det(\u2207xT \u22121(x))|.\n\n2\n\n\fThe transport maps are computed iteratively. At each iteration l, our best approximation of \u03c0 is given\nby the pushforward density pl = (Tl \u25e6 \u00b7\u00b7\u00b7 \u25e6 T1)\u2217 p0. The SVGD algorithm then seeks a transport\nmap Tl+1 = I + Q that further decreases the KL divergence between (Tl+1)\u2217pl and \u03c0,\n\n(4)\nfor an appropriate choice of Q \u2208 Hd. In other words, the SVGD algorithm seeks a map Q \u2208 Hd\nsuch that\n\nQ (cid:55)\u2192 Jpl [Q] := DKL((I + Q)\u2217 pl || \u03c0),\n\nJpl [Q] < Jpl [0],\n\n(5)\nwhere 0(x) = 0 denotes the zero map. By construction, the sequence of pushforward densities\n{p0, p1, p2, . . . , pl, . . .} becomes increasingly closer (in KL divergence) to the target \u03c0. Recent\nresults on the convergence of the SVGD algorithm are presented in [15].\nFunctional gradient descent. The \ufb01rst variation of Jpl at S \u2208 Hd along V \u2208 Hd can be de\ufb01ned\nas\n\n(6)\nAssuming that the objective function Jpl : Hd \u2192 R is Fr\u00e9chet differentiable, the functional gradient\nof Jpl at S \u2208 Hd is the element \u2207Jpl [S] of Hd such that\n\nDJpl [S](V ) := lim\n\u03c4\u21920\n\n(cid:0)Jpl [S + \u03c4 V ] \u2212 Jpl [S](cid:1).\n\n1\n\u03c4\n\nDJpl [S](V ) = (cid:104)\u2207Jpl [S], V (cid:105)Hd \u2200 V \u2208 Hd,\n\n(7)\n\nwhere (cid:104)\u00b7,\u00b7(cid:105)Hd denotes an inner product on Hd.\nIn order to satisfy (5), the SVGD algorithm de\ufb01nes Tl+1 as a perturbation of the identity map along\nthe steepest descent direction of the functional Jpl evaluated at the zero map, i.e.,\n\nTl+1 = I \u2212 \u03b5\u2207Jpl [0],\n\n(8)\n\nfor a small enough \u03b5 > 0. It was shown in [17] that the functional gradient at 0 has a closed form\nexpression given by\n\n\u2212\u2207Jpl [0](z) = Ex\u223cpl[k(x, z)\u2207x log \u03c0(x) + \u2207xk(x, z)].\n\n(9)\n\nEmpirical approximation. There are several ways to approximate the expectation in (9). For\ninstance, a set of particles {x0\ni}n\ni=1 can be generated from the initial reference density p0 and pushed\nforward by the transport maps {T1, T2, . . .}. The pushforward density pl can then be approximated\nby the empirical measure given by the particles {xl\n) for i = 1, . . . , n, so\nthat\n\ni\n\n(cid:2)k(xl\n\ni}n\ni=1, where xl\nj, z)\u2207xl\n\ni = Tl(xl\u22121\nj) + \u2207xl\n\nlog \u03c0(xl\n\nj\n\nk(xl\n\nj\n\nj, z)(cid:3) .\n\n\u2212\u2207Jpl [0](z) \u2248 G(z) :=\n\n(cid:80)n\n\nj=1\n\n1\nn\n\n(10)\n\nThe \ufb01rst term in (10) corresponds to a weighted average steepest descent direction of the log-target\ndensity \u03c0 with respect to pl. This term is responsible for transporting particles towards high-\nprobability regions of \u03c0. In contrast, the second term can be viewed as a \u201crepulsion force\u201d that\nspreads the particles along the support of \u03c0, preventing them from collapsing around the mode of \u03c0.\nThe SVGD algorithm is summarised in Algorithm 1.\n\n:Particles {xl\n\nAlgorithm 1: One iteration of the Stein variational gradient algorithm\ni}n\nInput\ni=1 at previous iteration l; step size \u03b5l+1\nOutput :Particles {xl+1\n}n\ni=1 at new iteration l + 1\n1: for i = 1, 2, . . . , n do\n2:\n3: end for\n\ni), where G is de\ufb01ned in (10).\n\ni + \u03b5l+1 G(xl\n\ni \u2190 xl\n\nSet xl+1\n\ni\n\n3 Stein variational Newton method\n\nHere we propose a new method that incorporates second-order information to accelerate the conver-\ngence of the SVGD algorithm. We replace the steepest descent direction in (8) with an approximation\nof the Newton direction.\n\n3\n\n\fFunctional Newton direction. Given a differentiable objective function Jpl, we can de\ufb01ne the\nsecond variation of Jpl at 0 along the pair of directions V, W \u2208 Hd as\n\nD2Jpl[0](V, W ) := lim\n\u03c4\u21920\n\n1\n\u03c4\n\n(cid:0)DJpl [\u03c4 W ](V ) \u2212 DJpl [0](V )(cid:1).\n\nAt each iteration, the Newton method seeks to minimize a local quadratic approximation of Jpl. The\nminimizer W \u2208 Hd of this quadratic form de\ufb01nes the Newton direction and is characterized by the\n\ufb01rst order stationarity conditions\n\n(11)\nWe can then look for a transport map Tl+1 that is a local perturbation of the identity map along the\nNewton direction, i.e.,\n\nD2Jpl [0](V, W ) = \u2212DJpl [0](V ),\n\n\u2200 V \u2208 Hd.\n\nTl+1 = I + \u03b5W,\n\n(12)\nfor some \u03b5 > 0 that satis\ufb01es (5). The function W is guaranteed to be a descent direction if the bilinear\nform D2Jpl [0] in (11) is positive de\ufb01nite. The following theorem gives an explicit form for D2Jpl [0]\nand is proven in Appendix.\nTheorem 1. The variational characterization of the Newton direction W = (w1, . . . , wd)(cid:62) \u2208 Hd in\n(11) is equivalent to\n\n(cid:42) d(cid:88)\n\nd(cid:88)\n\ni=1\n\nj=1\n\n(cid:43)\n(cid:104)hij(y, z), wj(z)(cid:105)H + \u2202iJpl [0](y), vi(y)\n\n= 0,\n\nH\n\nfor all V = (v1, . . . , vd)(cid:62) \u2208 Hd, where\n\nhij(y, z) = Ex\u223cpl\n\n(cid:2)\u2212\u22022\nij log \u03c0(x)k(x, y)k(x, z) + \u2202ik(x, y)\u2202jk(x, z)(cid:3) .\n\nWe propose a Galerkin approximation of (13). Let (xk)n\naccording to pl(\u00b7 ), and de\ufb01ne the \ufb01nite dimensional linear space Hd\nWe look for an approximate solution W = (w1, . . . , wd)(cid:62) in Hd\n\nk=1 be an ensemble of particles distributed\nn = span{k(x1,\u00b7), . . . , k(xn,\u00b7)}.\nn\u2014i.e.,\n\nwj(z) =\n\n\u03b1k\n\nj k(xk, z)\n\n(15)\n\nn(cid:88)\n\nk=1\n\n(13)\n\n(14)\n\n(16)\n\nfor some unknown coef\ufb01cients (\u03b1k\nn. The\nfollowing corollary gives an explicit characterization of the Galerkin solution and is proven in the\nAppendix.\nCorollary 1. The coef\ufb01cients (\u03b1k\n\nj )\u2014such that the residual of (13) is orthogonal to Hd\n\nj ) are given by the solution of the linear system\nH s,k \u03b1k = \u2207J s,\n\ns = 1, . . . , n,\n\nfor all\n\nn(cid:88)\n(cid:1)(cid:62)\n\nk=1\n\nwhere \u03b1k := (cid:0)\u03b1k\n\n1, . . . , \u03b1k\nd\n\nis a vector of unknown coef\ufb01cients, (H s,k)ij := hij(xs, xk) is the\nevaluation of the symmetric form (14) at pairs of particles, and where \u2207J s := \u2212\u2207Jpl [0](xs)\nrepresents the evaluation of the \ufb01rst variation at the s-th particle.\nIn practice, we can only evaluate a Monte Carlo approximation of H s,k and \u2207J s in (16) using the\nensemble (xk)n\n\nk=1.\n\nInexact Newton. The solution of (16) by means of direct solvers might be impractical for problems\nwith a large number of particles n or high parameter dimension d, since it is a linear system with\nnd unknowns. Moreover, the solution of (16) might not lead to a descent direction (e.g., when\n\u03c0 is not log-concave). We address these issues by deploying two well-established techniques in\nnonlinear optimisation [31]. In the \ufb01rst approach, we solve (16) using the inexact Newton\u2013conjugate\ngradient (NCG) method [31, Chapters 5 and 7], wherein a descent direction can be guaranteed by\nappropriately\nthe matrix-vector product with each H s,k and does not construct the matrix explicitly, and thus can\nbe scaled to high dimensions. In the second approach, we simplify the problem further by taking a\n\n4\n\n\fs = 1, . . . n .\n\nH s,s\u03b1s = \u2207J s,\n\nblock-diagonal approximation of the second variation, breaking (16) into n decoupled d \u00d7 d linear\nsystems\n(17)\nHere, we can either employ a Gauss-Newton approximation of the Hessian \u22072 log \u03c0 in H s,s or again\nuse inexact Newton\u2013CG, to guarantee that the approximation of the Newton direction is a descent\ndirection.\nBoth the block-diagonal approximation and inexact NCG are more ef\ufb01cient than solving for the full\nNewton direction (16). In addition, the block-diagonal form (17) can be solved in parallel for each of\nthe blocks, and hence it may best suit high-dimensional applications and/or large numbers of particles.\nIn the supplementary material, we provide a comparison of these approaches on various examples.\nBoth approaches provide similar progress per SVN iteration compared to the full Newton direction.\nLeveraging second-order information provides a natural scaling for the step size, i.e., \u03b5 = O(1).\nHere, the choice \u03b5 = 1 performs reasonably well in our numerical experiments (Section 5 and the\nAppendix). In future work, we will re\ufb01ne our strategy by considering either a line search or a trust\nregion step. The resulting Stein variational Newton method is summarised in Algorithm 2.\n\n:Particles {xl\n\nAlgorithm 2: One iteration of the Stein variational Newton algorithm\ni}n\nInput\ni=1 at stage l; step size \u03b5\nOutput :Particles {xl+1\n}n\ni=1 at stage l + 1\n1: for i = 1, 2, . . . , n do\n2:\n3:\n4: end for\n\nSolve the linear system (16) for \u03b11, . . . , \u03b1n\nSet xl+1\ni) given \u03b11, . . . , \u03b1n\n\ni \u2190 xl\n\ni + \u03b5W (xl\n\ni\n\n4 Scaled Hessian kernel\n\nx log \u03c0(x). We introduce the metric\nM\u03c0 := Ex\u223c\u03c0[A(x)] ,\n\nIn the Stein variational method, the kernel weighs the contribution of each particle to a locally\naveraged steepest descent direction of the target distribution, and it also spreads the particles along\nthe support of the target distribution. Thus it is essential to choose a kernel that can capture the\nunderlying geometry of the target distribution, so the particles can traverse the support of the target\ndistribution ef\ufb01ciently. To this end, we can use the curvature information characterised by the Hessian\nof the logarithm of the target density to design anisotropic kernels.\nConsider a positive de\ufb01nite matrix A(x) that approximates the local Hessian of the negative logarithm\nof the target density, i.e., A(x) \u2248 \u2212\u22072\n\n(18)\nto characterise the average curvature of the target density, stretching and compressing the parameter\nspace in different directions. There are a number of computationally ef\ufb01cient ways to evaluate such\nan A(x)\u2014for example, the generalised eigenvalue approach in [20] and the Fisher information-based\napproach in [11]. The expectation in (18) is taken against the target density \u03c0, and thus cannot be\ni}n\ndirectly computed. Utilising the ensemble {xl\n(cid:80)n\ni=1 in each iteration, we introduce an alternative\nmetric\n1\nn\n\n(19)\nto approximate M\u03c0. Similar approximations have also been introduced in the context of dimension\nreduction for statistical inverse problems; see [7]. Note that the computation of the metric (19) does\nnot incur extra computational cost, as we already calculated (approximations to) \u22072\nx log \u03c0(x) at each\nparticle in the Newton update.\nGiven a kernel of the generic form k(x, x(cid:48)) = f ((cid:107)x \u2212 x(cid:48)(cid:107)2), we can then use the metric Mpl to\nde\ufb01ne an anisotropic kernel\n\ni=1 A(xl\n\nMpl :=\n\ni),\n\ng(d)\nwhere the norm (cid:107) \u00b7 (cid:107)Mpl\n= x(cid:62)Mpl x and g(d) is a positive and real-valued\nfunction of the dimension d. For example, with g(d) = d, the Gaussian kernel used in the SVGD of\n\nMpl\n\nkl(x, x(cid:48)) = f\nis de\ufb01ned as (cid:107)x(cid:107)2\n\n(cid:19)\n\n,\n\n(cid:107)x \u2212 x(cid:48)(cid:107)2\n\nMpl\n\n(cid:18) 1\n\n5\n\n\f[17] can be modi\ufb01ed as\n\n(cid:18)\n\nkl(x, x(cid:48)) := exp\n\n(cid:19)\n\n.\n\n\u2212 1\n2d\n\n(cid:107)x \u2212 x(cid:48)(cid:107)2\n\nMpl\n\n(20)\n\nThe metric Mpl induces a deformed geometry in the parameter space: distance is greater along\ndirections where the (average) curvature is large. This geometry directly affects how particles in\nSVGD or SVN \ufb02ow\u2014by shaping the locally-averaged gradients and the \u201crepulsion force\u201d among the\nparticles\u2014and tends to spread them more effectively over the high-probability regions of \u03c0.\nThe dimension-dependent scaling factor g(d) plays an important role in high dimensional problems.\nConsider a sequence of target densities that converges to a limit as the dimension of the parameter\nspace increases. For example, in the context of Bayesian inference on function spaces, e.g., [26],\nthe posterior density is often de\ufb01ned on a discretisation of a function space, whose dimensionality\nincreases as the discretisation is re\ufb01ned. In this case, the g(d)-weighed norm (cid:107) \u00b7 (cid:107)2/d is the square of\nthe discretised L2 norm under certain technical conditions (e.g., the examples in Section 5.2 and the\nAppendix) and converges to the functional L2 norm as d \u2192 \u221e. With an appropriate scaling g(d),\nthe kernel may thus exhibit robust behaviour with respect to discretisation if the target distribution\nhas appropriate in\ufb01nite-dimensional limits. For high-dimensional target distributions that do not\nhave a well-de\ufb01ned limit with increasing dimension, an appropriately chosen scaling function g(d)\ncan still improve the ability of the kernel to discriminate inter-particle distances. Further numerical\ninvestigation of this effect is presented in the Appendix.\n\n5 Test cases\n\nWe evaluate our new SVN method with the scaled Hessian kernel on a set of test cases drawn from\nvarious Bayesian inference tasks. For these test cases, the target density \u03c0 is the (unnormalised)\nposterior density. We assume the prior distributions are Gaussian, that is, \u03c00(x) = N (mpr, Cpr),\nwhere mpr \u2208 Rd and Cpr \u2208 Rd\u00d7d are the prior mean and prior covariance, respectively. Also, we\nassume there exists a forward operator F : Rd \u2192 Rm mapping from the parameter space to the\ndata space. The relationship between the observed data and unknown parameters can be expressed\nas y = F(x) + \u03be, where \u03be \u223c N (0, \u03c32 I) is the measurement error and I is the identity matrix.\nThis relationship de\ufb01nes the likelihood function L(y|x) = N (F(x), \u03c32 I) and the (unnormalised)\nposterior density \u03c0(x) \u221d \u03c00(x)L(y|x).\nWe will compare the performance of SVN and SVGD, both with the scaled Hessian kernel (20) and\nthe heuristically-scaled isotropic kernel used in [17]. We refer to these algorithms as SVN-H, SVN-I,\nSVGD-H, and SVGD-I, where \u2018H\u2019 or \u2018I\u2019 designate the Hessian or isotropic kernel, respectively. Recall\nthat the heuristic used in the \u2018-I\u2019 algorithms involves a scaling factor based on the number of particles\nn and the median pairwise distance between particles [17]. Here we present two test cases, one\nmulti-modal and the other high-dimensional. In the supplementary material, we report on additional\ntests. First, we evaluate the performance of SVN-H with different Hessian approximations: the\nexact Hessian (full Newton), the block diagonal Hessian, and a Newton\u2013CG version of the algorithm\nwith exact Hessian. Second, we provide a performance comparison between SVGD and SVN on a\nhigh-dimensional Bayesian neural network. Finally, we provide further numerical investigations of\nthe dimension-scalability of our scaled kernel.\n\n5.1 Two-dimensional double banana\n\nThe \ufb01rst test case is a two-dimensional bimodal and \u201cbanana\u201d shaped posterior density. The prior is a\nstandard multivariate Gaussian, i.e., mpr = 0 and Cpr = I, and the observational error has standard\ndeviation \u03c3 = 0.3. The forward operator is taken to be a scalar logarithmic Rosenbrock function, i.e.,\n\nF(x) = log(cid:0)(1 \u2212 x1)2 + 100(x2 \u2212 x2\n1)2(cid:1) ,\n\nwhere x = (x1, x2). We take a single observation y = F(xtrue) + \u03be, with xtrue being a random\nvariable drawn from the prior and \u03be \u223c N (0, \u03c32 I).\nFigure 1 summarises the outputs of four algorithms at selected iteration numbers, each with n = 1000\nparticles initially sampled from the prior \u03c00. The rows of Figure 1 correspond to the choice of\nalgorithms and the columns of Figure 1 correspond to the outputs at different iteration numbers.\nWe run 10, 50, and 100 iterations of SVN-H. To make a fair comparison, we rescale the number\n\n6\n\n\fFigure 1: Particle con\ufb01gurations superimposed on contour plots of the double-banana density.\n\nof iterations for each of the other algorithms so that the total cost (CPU time) is approximately the\nsame. It is interesting to note that the Hessian kernel takes considerably less computational time\nthan the Isotropic kernel. This is because, whereas the Hessian kernel is automatically scaled, the\nIsotropic kernel calculates the distance between the particles at each iterations to heuristically rescale\nthe kernel.\nThe \ufb01rst row of Figure 1 displays the performance of SVN-H, where second-order information is\nexploited both in the optimisation and in the kernel. After only 10 iterations, the algorithm has\nalready converged, and the con\ufb01guration of particles does not visibly change afterwards. Here, all the\nparticles quickly reach the high probability regions of the posterior distribution, due to the Newton\nacceleration in the optimisation. Additionally, the scaled Hessian kernel seems to spread the particles\ninto a structured and precise con\ufb01guration.\nThe second row shows the performance of SVN-I, where the second-order information is used\nexclusively in the optimisation. We can see the particles quickly moving towards the high-probability\nregions, but the con\ufb01guration is much less structured. After 47 iterations, the algorithm has essentially\nconverged, but the con\ufb01guration of the particles is noticeably rougher than that of SVN-H.\nSVGD-H in the third row exploits second-order information exclusively in the kernel. Compared to\nSVN-I, the particles spread more quickly over the support of the posterior, but not all the particles\nreach the high probability regions, due to slower convergence of the optimisation. The fourth row\nshows the original algorithm, SVGD-I. The algorithm lacks both of the bene\ufb01ts of second-order\ninformation: with slower convergence and a more haphazard particle distribution, it appears less\nef\ufb01cient for reconstructing the posterior distribution.\n\n5.2\n\n100-dimensional conditioned diffusion\n\nThe second test case is a high-dimensional model arising from a Langevin SDE, with state u :\n[0, T ] \u2192 R and dynamics given by\n\ndut =\n\n\u03b2u (1 \u2212 u2)\n(1 + u2)\n\ndt + dxt,\n\nu0 = 0 .\n\n(21)\n\n7\n\n\fHere x = (xt)t\u22650 is a Brownian motion, so that x \u223c \u03c00 = N (0, C), where C(t, t(cid:48)) = min(t, t(cid:48)).\nThis system represents the motion of a particle with negligible mass trapped in an energy potential,\nwith thermal \ufb02uctuations represented by the Brownian forcing; it is often used as a test case for\nMCMC algorithms in high dimensions [6]. Here we use \u03b2 = 10 and T = 1. Our goal is to infer the\ndriving process x and hence its pushforward to the state u.\n\nFigure 2: In each plot, the magenta path is the true solution of the discretised Langevin SDE; the\nblue line is the reconstructed posterior mean; the shaded area is the 90% marginal posterior credible\ninterval at each time step.\nThe forward operator is de\ufb01ned by F(x) = [ut1, ut2, . . . , ut20](cid:62) \u2208 R20, where ti are equispaced\nobservation times in the interval (0, 1], i.e., ti = 0.05 i. By taking \u03c3 = 0.1, we de\ufb01ne an observation\ny = F(xtrue) + \u03be \u2208 R20, where xtrue is a Brownian motion path and \u03be \u223c N (0, \u03c32 I). For discretiza-\ntion, we use an Euler-Maruyama scheme with step size \u2206t = 10\u22122; therefore the dimensionality of\nthe problem is d = 100. The prior is given by the Brownian motion x = (xt)t\u22650, described above.\nFigure 2 summarises the outputs of four algorithms, each with n = 1000 particles initially sampled\nfrom \u03c00. Figure 2 is presented in the same way as Figure 1 from the \ufb01rst test case. The iteration\nnumbers are scaled, so that we can compare outputs generated by various algorithms using approxi-\nmately the same amount of CPU time. In Figure 2, the path in magenta corresponds to the solution of\nthe Langevin SDE in (21) driven by the true Brownian path xtrue. The red points correspond to the\n20 noisy observations. The blue path is the reconstruction of the magenta path, i.e., it corresponds\nto the solution of the Langevin SDE driven by the posterior mean of (xt)t\u22650. Finally, the shaded\narea represents the marginal 90% credible interval of each dimension (i.e., at each time step) of the\nposterior distribution of u.\nWe observe excellent performance of SVN-H. After 50 iterations, the algorithm has already converged,\naccurately reconstructing the posterior mean (which in turn captures the trends of the true path) and\nthe posterior credible intervals. (See Figure 3 and below for a validation of these results against\na reference MCMC simulation.) SVN-I manages to provide a reasonable reconstruction of the\n\n8\n\nSVN-H -- 10 iterations00.51-1.5-1-0.500.5SVN-H -- 50 iterations00.51-1.5-1-0.500.5SVN-H -- 100 iterations00.51-1.5-1-0.500.5SVN-I -- 11 iterations00.51-2-1012SVN-I -- 54 iterations00.51-2-101SVN-I -- 108 iterations00.51-2-101SVGD-H -- 40 iterations00.51-2-1012SVGD-H -- 198 iterations00.51-2-1012SVGD-H -- 395 iterations00.51-2-1012SVGD-I -- 134 iterations00.51-1.5-1-0.500.5SVGD-I -- 668 iterations00.51-1.5-1-0.500.5SVGD-I -- 1336 iterations00.51-1.5-1-0.500.5\ftarget distribution: the posterior mean shows fair agreement with the true solution, but the credible\nintervals are slightly overestimated, compared to SVN-H and the reference MCMC. The overestimated\ncredible interval may be due to the poor dimension scaling of the isotropic kernel used by SVN-I.\nWith the same amount of computational effort, SVGD-H and SVGD-I cannot reconstruct the posterior\ndistribution: both the posterior mean and the posterior credible intervals depart signi\ufb01cantly from\ntheir true values.\nIn Figure 3, we compare the posterior distribution approximated with SVN-H (using n = 1000\nparticles and 100 iterations) to that obtained with a reference MCMC run (using the DILI algorithm\nof [6] with an effective sample size of 105), showing an overall good agreement. The thick blue\nand green paths correspond to the posterior means estimated by SVN-H and MCMC, respectively.\nThe blue and green shaded areas represent the marginal 90% credible intervals (at each time step)\nproduced by SVN-H and MCMC. In this example, the posterior mean of SVN-H matches that of\nMCMC quite closely, and both are comparable to the data-generating path (thick magenta line). (The\nposterior means are much smoother than the true path, which is to be expected.) The estimated\ncredible intervals of SVN-H and MCMC also match fairly well along the entire path of the SDE.\n\nFigure 3: Comparison of reconstructed distributions from SVN-H and MCMC\n\n6 Discussion\n\nIn general, the use of Gaussian reproducing kernels may be problematic in high dimensions, due to\nthe locality of the kernel [8]. While we observe in Section 4 that using a properly rescaled Gaussian\nkernel can improve the performance of the SVN method in high dimensions, we also believe that\na truly general purpose nonparametric algorithm using local kernels will inevitably face further\nchallenges in high-dimensional settings. A sensible approach to coping with high dimensionality\nis also to design algorithms that can detect and exploit essential structure in the target distribution,\nwhether it be decaying correlation, conditional independence, low rank, multiple scales, and so on.\nSee [25, 29] for recent efforts in this direction.\n\n7 Acknowledgements\n\nG. Detommaso is supported by the EPSRC Centre for Doctoral Training in Statistical Applied\nMathematics at Bath (EP/L015684/1) and by a scholarship from the Alan Turing Institute. T. Cui,\nG. Detommaso, A. Spantini, and Y. Marzouk acknowledge support from the MATRIX Program\non \u201cComputational Inverse Problems\u201d held at the MATRIX Institute, Australia, where this joint\ncollaboration was initiated. A. Spantini and Y. Marzouk also acknowledge support from the AFOSR\nComputational Mathematics Program.\n\n9\n\n00.10.20.30.40.50.60.70.80.91-1.6-1.4-1.2-1-0.8-0.6-0.4-0.200.2trueSVN-HMCMCobs\fReferences\n[1] http://github.com/gianlucadetommaso/Stein-variational-samplers\n[2] E. Anderes, M. Coram. A general spline representation for nonparametric and semiparametric\n\ndensity estimates using diffeomorphisms. arXiv preprint arXiv:1205.5314, 2012.\n\n[3] N. Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical\n\nsociety, p. 337\u2013404, 1950.\n\n[4] D. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, p. 859\u2013877, 2017.\n\n[5] W. Y. Chen, L. Mackey, J. Gorham, F. X. Briol, C. J. Oates. Stein points. In International\n\nConference on Machine Learning. arXiv:1803.10161, 2018.\n\n[6] T. Cui, K. J. H. Law, Y. M. Marzouk. Dimension-independent likelihood-informed MCMC.\n\nJournal of Computational Physics, 304: 109\u2013137, 2016.\n\n[7] T. Cui, J. Martin, Y. M. Marzouk, A. Solonen, and A. Spantini. Likelihood-informed dimension\n\nreduction for nonlinear inverse problems. Inverse Problems, 30(11):114015, 2014.\n\n[8] D. Francois, V. Wertz, and M. Verleysen. About the locality of kernels in high-dimensional\nspaces. International Symposium on Applied Stochastic Models and Data Analysis, p. 238\u2013245,\n2005.\n\n[9] S. Gershman, M. Hoffman, D. Blei. Nonparametric variational inference. arXiv preprint\n\narXiv:1206.4665, 2012.\n\n[10] W. R. Gilks, S. Richardson, and D. Spiegelhalter. Markov chain Monte Carlo in practice. CRC\n\npress, 1995.\n\n[11] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo\nmethods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123\u2013\n214, 2011.\n\n[12] J. Han and Q. Liu.\n\nStein variational adaptive importance sampling.\n\narXiv:1704.05201, 2017.\n\narXiv preprint\n\n[13] M. E. Khan, Z. Liu, V. Tangkaratt, Y. Gal. Vprop: Variational inference using RMSprop. arXiv\n\npreprint arXiv:1712.01038, 2017.\n\n[14] M. E. Khan, W. Lin, V. Tangkaratt, Z. Liu, D. Nielsen. Adaptive-Newton method for explorative\n\nlearning. arXiv preprint arXiv:1711.05560, 2017.\n\n[15] Q. Liu. Stein variational gradient descent as gradient \ufb02ow. In Advances in Neural Information\n\nProcessing systems (I. Guyon et al., Eds.), Vol. 30, p. 3118\u20133126, 2017.\n\n[16] Y. Liu, P. Ramachandran, Q. Liu, and J. Peng. Stein variational policy gradient. arXiv preprint\n\narXiv:1704.02399, 2017.\n\n[17] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose Bayesian inference\nalgorithm. In Advances In Neural Information Processing Systems (D. D. Lee et al., Eds.),\nVol. 29, p. 2378\u20132386, 2016.\n\n[18] C. Liu and J. Zhu. Riemannian Stein variational gradient descent for Bayesian inference. arXiv\n\npreprint arXiv:1711.11216, 2017.\n\n[19] D. G. Luenberger. Optimization by vector space methods. John Wiley & Sons, 1997.\n[20] J. Martin, L. C. Wilcox, C. Burstedde, and O. Ghattas. A stochastic Newton MCMC method for\nlarge-scale statistical inverse problems with application to seismic inversion. SIAM Journal on\nScienti\ufb01c Computing, 34(3), A1460\u2013A1487, Chapman & Hall/CRC, 2012\n\n[21] Y. M. Marzouk, T. Moselhy, M. Parno, and A. Spantini. Sampling via measure transport: An\n\nintroduction. Handbook of Uncertainty Quanti\ufb01cation, Springer, p. 1\u201341, 2016.\n\n[22] R. M. Neal. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo\n\n(S. Brooks et al., Eds.), Chapman & Hall/CRC, 2011.\n\n[23] Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, and L. Carin. Stein variational autoencoder. arXiv\n\npreprint arXiv:1704.05155, 2017.\n\n[24] D. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. arXiv:1505.05770,\n\n2015.\n\n10\n\n\f[25] A. Spantini, D. Bigoni, and Y. Marzouk. Inference via low-dimensional couplings. Journal of\n\nMachine Learning Research, to appear. arXiv:1703.06131, 2018.\n\n[26] A. M. Stuart. Inverse problems: a Bayesian perspective. Acta Numerica, 19, p. 451\u2013559, 2010.\n[27] E. G. Tabak and T. V. Turner. A family of nonparametric density estimation algorithms.\n\nCommunications on Pure and Applied Mathematics, p. 145\u2013164, 2013.\n\n[28] C. Villani. Optimal Transport: Old and New. Springer-Verlag Berlin Heidelberg, 2009.\n[29] D. Wang, Z. Zeng, and Q. Liu. Structured Stein variational inference for continuous graphical\n\nmodels. arXiv:1711.07168, 2017.\n\n[30] J. Zhuo, C. Liu, N. Chen, and B. Zhang. Analyzing and improving Stein variational gradient\n\ndescent for high-dimensional marginal inference. arXiv preprint arXiv:1711.04425, 2017.\n\n[31] S. Wright, J. Nocedal. Numerical Optimization. Springer Science, 1999.\n\n11\n\n\f", "award": [], "sourceid": 5504, "authors": [{"given_name": "Gianluca", "family_name": "Detommaso", "institution": "University of Bath / The Alan Turing Institute"}, {"given_name": "Tiangang", "family_name": "Cui", "institution": "Monash University"}, {"given_name": "Youssef", "family_name": "Marzouk", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Alessio", "family_name": "Spantini", "institution": "MIT"}, {"given_name": "Robert", "family_name": "Scheichl", "institution": "Heidelberg University"}]}