{"title": "Convergence Guarantees for Adaptive Bayesian Quadrature Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 6237, "page_last": 6248, "abstract": "Adaptive Bayesian quadrature (ABQ) is a powerful approach to numerical integration that empirically compares favorably with Monte Carlo integration on problems of medium dimensionality (where non-adaptive quadrature is not competitive).\nIts key ingredient is an acquisition function that changes as a function of  previously collected values of the integrand.\nWhile this adaptivity appears to be empirically powerful, it complicates analysis. Consequently, there are no theoretical guarantees so far for this class of methods. In this work, for a broad class of adaptive Bayesian quadrature methods, we prove consistency, deriving non-tight but informative convergence rates. To do so we introduce a new concept we call \\emph{weak adaptivity}. Our results identify a large and flexible class of adaptive Bayesian quadrature rules as consistent, within which practitioners can develop empirically efficient methods.", "full_text": "Convergence Guarantees\n\nfor Adaptive Bayesian Quadrature Methods\n\nMotonobu Kanagawa\u2020,#\u2217and Philipp Hennig#\n\n\u2020EURECOM, Sophia Antipolis, France\n\n#University of T\u00fcbingen and Max Planck Institute for Intelligent Systems, T\u00fcbingen, Germany\n\nmotonobu.kanagawa@eurecom.fr & philipp.hennig@uni-tuebingen.de\n\nAbstract\n\nAdaptive Bayesian quadrature (ABQ) is a powerful approach to numerical integra-\ntion that empirically compares favorably with Monte Carlo integration on problems\nof medium dimensionality (where non-adaptive quadrature is not competitive). Its\nkey ingredient is an acquisition function that changes as a function of previously\ncollected values of the integrand. While this adaptivity appears to be empirically\npowerful, it complicates analysis. Consequently, there are no theoretical guarantees\nso far for this class of methods. In this work, for a broad class of adaptive Bayesian\nquadrature methods, we prove consistency, deriving non-tight but informative con-\nvergence rates. To do so we introduce a new concept we call weak adaptivity. Our\nresults identify a large and \ufb02exible class of adaptive Bayesian quadrature rules as\nconsistent, within which practitioners can develop empirically ef\ufb01cient methods.\n\n1\n\nIntroduction\n\nNumerical integration, or quadrature/cubature, is a fundamental task in many areas of science and en-\ngineering. This includes machine learning and statistics, where such problems arise when computing\nmarginals and conditionals in probabilistic inference problems. In particular in hierarchical Bayesian\ninference, quadrature is generally required for the computation of the marginal likelihood, the key\nquantity for model selection, and for prediction, for which latent variables are to be marginalized out.\nTo describe the problem, let \u2126 be a compact metric space, \u00b5 be a \ufb01nite positive Borel measure on\n\u2126 (such as the Lebesgue measure on compact \u2126 \u2282 Rd) that playes the role of reference measure,\n\u03c0 : \u2126 \u2192 R be a known density function, and f : \u2126 \u2192 R be an integrand, a known function such that\nthe function value f (x) \u2208 R can be obtained for any given query x \u2208 \u2126. The task of quadrature is to\nnumerically compute the integral (assumed to be intractable analytically)\n\n(cid:90)\n\nf (x)\u03c0(x)d\u00b5(x).\n\na proposal distribution and the integral is approximated as(cid:80)n\n\nThis is done by evaluating the function values f (x1), . . . , f (xn) at design points x1, . . . , xn \u2208 \u2126\nand using them to approximate f and the integral. The points x1, . . . , xn should be \u201cgood\u201d in the\nsense that f (x1), . . . , f (xn) provide useful information for computing the the integral.\nMonte Carlo methods are the classic alternative, where x1, . . . , xn are randomly generated from\ni=1 wif (xi), with w1, . . . , wn being\nimportance weights. Such Monte Carlo estimators achieve the convergence rate of order n\u22121/2 for\nn the number of design points, under a mild condition that f is a bounded function. This dimension-\nindependent rate, and the mild condition about f, would be one of the reasons for the wide popularity\nand successes of Monte Carlo methods. However, as has been empirically known for practitioners\n\u2217Most of this work was done when MK was af\ufb01liated with University of T\u00fcbingen and MPI IS, Germany\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fand also theoretically investigated recently [3, 10], practical (i.e. Markov Chain) Monte Carlo can\nstruggle in high dimensional integration, requiring a huge number of sample points to give a reliable\nestimate:2 the curse of dimensionality appears in the constant term in front of the rate n\u22121/2 [22,\nSec. 2.5] [10, Thm. 2.1 and Sec. 3.4]. Thus, there has been a number of attempts on developing\nmethods that work better than Monte Carlo for high dimensional integration, such as Quasi Monte\nCarlo methods [14].\nAdaptive Bayesian quadrature (ABQ) is a recent approach from machine learning that actively,\nsequentially and deterministiclaly selects design points to adapt to the target integrand [29, 30, 16, 1,\n9]. It is an extension of Bayesian quadrature (BQ) [28, 15, 8, 21], a probabilistic numerical method\nfor quadrature that makes use of prior knowledge about the integrand, such as smoothness and\nstructure, via a Gaussian process (GP) prior. Convergence rates of BQ methods take the form n\u2212s/d\nif the integrand f is s-times differentiable, or of the form exp(\u2212Cn1/d) for some constant C > 0 if\nf is in\ufb01nitely smooth [8, 20]. While the rates can be faster than Monte Carlo, the dimension d of the\nambient space now appears in the rate, meaning that BQ also suffers from the curse of dimensionality.\nABQ has been developed to improve upon such vanilla BQ methods. One drawback of vanilla BQ is\nthat the Gaussian process model prevents the use of certain kinds of relevant knowledge about the\nintegrand, such as it being positive (or non-negative), because they cannot be encoded in a Gaussian\ndistribution. Positive integrands are ubiquitous in machine learning and statistics, where integration\ntasks emerge in the marginalization and conditioning of probability density functions, which are\npositive by de\ufb01nition. In ABQ such prior knowledge is modelled by describing the integrand as given\nby a certain transformation (or warping) of a GP \u2014 for instance, an exponentiated GP [30, 29, 9] or a\nsquared GP [16]. ABQ methods with such transformations have empirically been shown to improve\nupon both standard BQ and Monte Carlo, leading to state-of-the-art wall-clock time performance on\nproblems of medium dimensionality.\nIf the transformation is nonlinear, as in the examples above, the transformed GP no longer allows an\nanalytic expression for its posterior process, and thus approximations are used to obtain a tractable\nacquisition function. In contrast to the posterior covariance of GPs, these acquisition functions then\nbecome dependent on previous observations, making the algorithm adaptive. This twist seems to be\ncritical for ABQ methods\u2019 superior empirical performance, but it complicates analysis. Thus, there\nhas been no theoretical guarantee for their convergence, rendering them heuristics in practice. This is\nproblematic since integration is usually an intermediate computational step in a larger system, and\nthus must be reliable. This paper provides the \ufb01rst convergence analysis for ABQ methods.\nIn Sec. 2 we review ABQ methods, and formulate a generic class of acquisition functions that cover\nthose of [16, 1, 2, 9]. Our convergence analysis is done for this class. We also derive an upper-bound\non the quadrature error using a transformed integrand, which is applicable to any design points and\ngiven in terms of the GP posterior variance (Prop. 2.1). In Sec. 3, we establish a connection between\nABQ and certain weak greedy algorithms (Thm. 3.3). This is based on a new result that the scaled\nGP posterior variance can be interpreted in terms of a certain projection in a Hilbert space (Lemma\n3.1). Using this connection, we derive convergence rates of ABQ methods in Sec. 4. For ease of the\nreader, we present a high-level overview of the proof structure in Fig. 1.\nThe key to our analysis is a relatively general notion for active exploration that we term weak\nadaptivity. An ABQ method that satis\ufb01es weak adaptivity (and a few additional technical constraints)\nis consistent, and the conceptual space of weakly adaptive BQ methods is large and \ufb02exible. We hope\nthat our results spark a practical interest in the design of empirically ef\ufb01cient acquisition functions, to\nextend the reach of quadrature to problems of higher and higher dimensionality.\n\nRelated Work. For standard BQ methods, and the corresponding kernel quadrature rules, conver-\ngence properties have been studied extensively [e.g. 7, 19, 4, 40, 21, 11, 8, 27, 20]. Some of these\nworks theoretically analyze methods that deterministically generate design points [12, 5, 17, 7, 11].\nThese methods are, however, not adaptive, as design points are generated independently to the\nfunction values of the target integrand.\nOur analysis is technically related to the work by Santin and Haasdonk [34], which analyzed the\nso-called P-greedy algorithm, an algorithm to sequentially obtain design points using the GP posterior\n\n2For instance, Wenliang et al. [39, Fig. 3] used 1010 Monte Carlo samples to estimate the the normalizing\n\nconstant of their model, on problems with medium dimensionality (10 to 50 dims).\n\n2\n\n\fFigure 1: Relationships between the various auxiliary results and how they yield the main results.\n\nvariance as an acquisition function. Our results can be regarded as a generalization of their result so\nthat the acquisition function can include i) a scaling and a transformation of the GP posterior variance\nand ii) a data-dependent term that takes care of adaptation; see (4) for details.\nAdaptive methods have also been theoretically studied in the information-based complexity literature\n[23, 24, 25, 26]. The key result is that optimal points for quadrature can be obtained without observing\nactual function values, if the hypothesis class of functions is symmetric and convex (e.g. the unit\nball in a Hilbert space): in this case adaptation does not help improve the performance. On the other\nhand, it the hypothesis class is either asymmetric or nonconvex, then adaptation may be helpful. For\ninstance, a class of positive functions is assymetric because only one of f or \u2212f can be positive.\nThese results thus support the choice of acquisition functions of existing ABQ methods, where the\nadaptivity to function values is motivated by modeling the positivity of the integrand.\nNotation. N denotes the set of positive integers, R the real line, and Rd the d-dimensional Euclidean\nspace for d \u2208 N. Lp(\u2126) for 1 \u2264 p < \u221e is the Banach space of p-integrable functions, and L\u221e(\u2126) is\nthat of essentially bounded functions.\n\n2 Adaptive Bayesian Quadrature (ABQ)\n\nWe describe here ABQ methods, and present a generic form of acquisition functions that we analyze.\nWe also derive an upper-bound on the quadrature error using a transformed integrand in terms of the\nGP posterior variance, motivating our analysis in the later sections. Throughout the paper we assume\nthat the domain \u2126 is a compact metric space and \u00b5 is a \ufb01nite positive Borel measure on \u2126.\n\n2.1 Bayesian Quadrature with Transformation\n\nABQ methods deal with an integrand f that is a priori known to satisfy a certain constraint, for\nexample f (x) > 0 \u2200x \u2208 \u2126. Such a constraint is modeled by considering a certain transformation\nT : R \u2192 R, and assuming that there exists a latent function g : \u2126 \u2192 R such that the integrand f\nis given as the transformation of g, i.e., f (x) = T (g(x)), x \u2208 \u2126. Examples of T for modeling the\npositivity include i) the square transformation T (y) = \u03b1 + 1\n2 y2, where \u03b1 > 0 is a small constant such\nthat 0 < \u03b1 < inf x\u2208\u2126 f (x), assuming that f is bounded away from 0 [16]; and ii) the exponential\ntransformation T (y) = exp(y) [30, 29, 9]. Note that the identity map T (y) = y recovers standard\nBayesian quadrature (BQ) methods [28, 15, 7, 21]. To model the latent function g, a Gaussian process\n(GP) prior [32] is placed over g:\n(1)\nwhere m : \u2126 \u2192 R is a mean function and k : \u2126 \u00d7 \u2126 is a covariance kernel. Both m and k should be\nchosen to capture as much prior knowledge or belief about g (or its transformation f) as possible,\nsuch as smoothness and correlation structure; see e.g. [32, Chap. 4].\nAssume that a set of points Xn := {x1, . . . , xn} \u2282 \u2126 are given, such that the kernel matrix\ni,j=1 \u2282 Rn\u00d7n is invertible. Given the function values f (x1), . . . , f (xn), de\ufb01ne\nKn := (k(xi, xj))n\n\ng \u223c GP(m, k)\n\n3\n\nAssumption 1On the integrand, kernel and mean functions Assumption 2On the \ud835\udc5efunction, the kernel and the domain Assumption 3\u201cWeak adaptivitycondition\u201d Assumption 4On the \ud835\udc39functionProposition 2.1 Bound on the quadrature errorLemma 3.1Scaled posterior variance Lemma 3.2Kernel matrix invertibleTheorem 3.3Main result: ABQ as a weak greedy algorithmProposition 4.2Bound on Kolmogorov width Theorem 4.3Bound on scaled posterior variance Corollary 4.4Convergence rateInfinitely smooth kernelsfinitely smooth kernelsTheorem C.3Bound on scaled posterior variance Corollary C.4Convergence rateProposition C.2Bound on Kolmogorov width Lemma 4.1 (DeVoreet al. 13)Weak greedy algorithm and Kolmogorov widthLemma C.1Kolmogorov width and scaled posterior variance\fgi(x) := zi \u2208 R such that T (zi) = f (xi) for i = 1, . . . , n. Treating g(x1), . . . , g(xn) as \u201cobserved\ndata without noise,\u201d the posterior distribution of g under the GP prior (1) is again given as a GP\n\ng|(xi, g(xi))n\n\ni=1 \u223c GP(mg,Xn, kXn ),\n\nwhere mg,Xn : \u2126 \u2192 R is the posterior mean function and kXn : \u2126 \u00d7 \u2126 \u2192 R is the posterior\ncovariance kernel given by (see e.g. [32])\n\nmg,Xn (x)\nkXn (x, x(cid:48))\n\n:= m(x) + kn(x)(cid:62)K\u22121\n:= k(x, x(cid:48)) \u2212 kn(x)(cid:62)K\u22121\n\n(2)\n(3)\nwhere kn(x) := (k(x, x1), . . . , k(x, xn))(cid:62) \u2208 Rn, gn := (g(x1), . . . , g(xn))(cid:62) \u2208 Rn and mn =\n\n(m(x1), . . . , m(xn))(cid:62) \u2208 Rn. Then a quadrature estimate3 for the integral(cid:82) f (x)\u03c0(x)d\u00b5(x) is given\nas the integral(cid:82) T (mg,Xn (x))\u03c0(x)d\u00b5(x) of the transformed posterior mean function T (mg,Xn),\nor as the integral of the posterior expectation of the transformation(cid:82) E\u00b4gT (\u00b4g(x))\u03c0(x)d\u00b5(x), where\n\u00b4g \u223c GP(mg,Xn , kXn ) is the posterior GP. The posterior covariance for(cid:82) f (x)\u03c0(x) is given similarly;\n\nn (gn \u2212 mn),\nn kn(x(cid:48)),\n\nsee [9, 16] for details.\n\na(cid:96)(x), where a(cid:96)(x) = F(cid:0)q2(x)kX(cid:96)(x, x)(cid:1) b(cid:96)(x),\n\n2.2 A Generic Form of Acquisition Functions\nThe key remaining question is how to select good design points x1, . . . , xn \u2208 \u2126. ABQ methods\nsequentially and deterministically generate x1, . . . , xn using an acquisition function. Many of the\nacquisition functions can be formulated in the following generic form:\nx(cid:96)+1 \u2208 arg max\n((cid:96) = 0, 1, . . . , n\u22121) (4)\nx\u2208\u2126\nwhere kX0 (x, x) := k(x, x), F : [0,\u221e) \u2192 [0,\u221e) is an increasing function such that F (0) = 0,\nq : \u2126 \u2192 (0,\u221e) and b(cid:96) : \u2126 \u2192 R is a function that may change at each iteration (cid:96). e.g., it may\ndepend on the function values f (x1), . . . , f (x(cid:96)) of the target integrand f. Intuitively, b(cid:96)(x) is a\ndata-dependent term that makes the point selection adaptive to the target integrand, q(x) may be seen\nas a proposal density in importance sampling, and F determines the balance between the uncertainty\nsampling part q2(x)kX(cid:96) (x, x) and the adaptation term b(cid:96)(x). We analyse ABQ with this generic\nform (4), aiming for results with wide applicability. Here are some representative choices.\nWarped Sequential Active Bayesian Integration (WSABI) [16]: Gunter et al. [16] employ the\nsquare transformation f (x) = T (g(x)) = \u03b1 + 1\n2 g2(x) with two acquisition functions: i) WSABI-L\n[16, Eq. 15], which is based on linearization of T and recovered with F (y) = y, q(x) = \u03c0(x) and\n(x); and ii) WSABI-M [16, Eq. 14], the one based on moment matching given by\nb(cid:96)(x) = m2\nF (y) = y, q(x) = \u03c0(x) and b(cid:96)(x) = 1\nMoment-Matched Log-Transformation (MMLT) [9]: Chai and Garnett [9, 3rd raw in Table 1]\nuse the exponential transformation f (x) = T (g(x)) = exp(g(x)) with the acquisition function given\nby F (y) = exp(y) \u2212 1 , q(x) = 1 and b(cid:96)(x) = exp (kX(cid:96)(x, x) + 2mg,X(cid:96)(x)).\nVariational Bayesian Monte Carlo (VBMC) [1, 2]: Acerbi [2, Eq. 2] uses the identity f (x) =\nT (g(x)) = g(x) with the acquisition function given by F (y) = y\u03b41, q(x) = 1 and b(cid:96)(x) =\n(cid:96) (x) exp(\u03b43mg,X(cid:96)(x)), where \u03c0(cid:96) is the variational posterior at the (cid:96)-th iteration and \u03b41, \u03b42, \u03b43 \u2265 0\n\u03c0\u03b42\nare constants: setting \u03b41 = \u03b42 = \u03b43 = 1 recovers the original acquisition function [1, Eq. 9]. Acerbi\n[1, Sec. 2.1] considers an integrand f that is de\ufb01ned as the logarithm of a joint density, while \u03c0 is an\nintractable posterior that is gradually approximated by the variational posteriors \u03c0(cid:96).\nFor the WSABI and MMLT, the acquisition function (4) is obtained by a certain approximation\n\nfor the posterior variance of the integral(cid:82) f (x)\u03c0(x)d\u00b5(x) =(cid:82) T (g(x))\u03c0(x)d\u00b5(x); thus this is a\n\n2 kX(cid:96)(x, x) + m2\n\nform of uncertainty sampling. Such an approximation is needed because the posterior variance of\nthe integral is not available in closed form, due to the nonlinear transformation T . The resulting\nacquisition function includes the data-dependent term b(cid:96)(x), which encourages exploration in regions\nwhere the value of g(x) is expected to be large. This makes ABQ methods adaptive to the target\nintegrand. Alas, it also complicates analysis. Thus there has been no convergence guarantee for these\nABQ methods; which is what we aim to remedy in this paper.\n\n(x).\n\ng,X(cid:96)\n\ng,X(cid:96)\n\n3The point is that, in contrast to the integral over f, this estimate should be analytically tractable. This\n2 y2 with k and \u03c0 Gaussian,\n\ndepends on the choices for T , k and \u03c0. For instance, for T (y) = y or T (y) = \u03b1 + 1\nthe estimate can be obtained analytically [16], while for T (y) = exp(y) one needs approximations; [cf. 9].\n\n4\n\n\f2.3 Bounding the Quadrature Error with Transformation\n\nquadrature estimate(cid:82) T (mg,Xn (x))\u03c0(x)d\u00b5(x) based on a transformation described in Sec. 2.1. It\nestimator(cid:82) E\u00b4gT (\u00b4g(x))\u03c0(x)d\u00b5(x) with \u00b4g \u223c GP(mg,Xn , kXn ), which we describe in Appendix A.2.\n\nOur \ufb01rst result, which may be of independent interest, is an upper-bound on the error for the\nis applicable to any point set Xn = {x1, . . . , xn}, and the bound is given in terms of the posterior\nvariance kXn (x, x). This gives us a motivation to study the behavior of this quantity for x1, . . . , xn\ngenerated by ABQ (4) in the later sections. Note that the essentially same bound holds for the other\n\ni=1 \u2282 \u2126 such that (cid:107)h(cid:107)2Hk\n\n=(cid:80)\u221e\n\nHk = span{k(\u00b7, x) | x \u2208 \u2126}, meaning that any h \u2208 Hk can be written as h =(cid:80)\u221e\n\nTo state the result, we need to introduce the Reproducing Kernel Hilbert Space (RKHS) of the\ncovariance kernel k of the GP prior. See e.g. [35, 36] for details of RKHS\u2019s, and [6, 18] for\ndiscussions of their close but subtle relation to the GP notion. Let Hk be the RKHS associated with\nthe covariance kernel k of the GP prior (1), with (cid:104)\u00b7,\u00b7(cid:105)Hk\nand (cid:107) \u00b7 (cid:107)Hk being its inner-product and\nnorm, respectively. Hk is a Hilbert space consisting of functions on \u2126, such that i) k(\u00b7, x) \u2208 Hk\nfor all x \u2208 \u2126, and ii) h(x) = (cid:104)k(\u00b7, x), h(cid:105)Hk\nfor all h \u2208 Hk and x \u2208 \u2126 (the reproducing property),\nwhere k(\u00b7, x) denotes the function of the \ufb01rst argument such that y \u2192 k(y, x), with x being \ufb01xed.\nAs a set of functions, Hk is given as the closure of the linear span of such functions k(\u00b7, x), i.e.,\ni=1 \u03b1ik(\u00b7, yi) for\ni=1 \u2282 R and (yi)\u221e\ni,j=1 \u03b1i\u03b1jk(yi, yj) < \u221e. We are now\nsome (\u03b1i)\u221e\nready to state our assumption:\nAssumption 1. T : R \u2192 R is continuously differentiable. For f : \u2126 \u2192 R, there exists g : \u2126 \u2192 R\nsuch that f (x) = T (g(x)), x \u2208 \u2126 and that \u02dcg := g \u2212 m \u2208 Hk.\nIt holds that (cid:107)k(cid:107)L\u221e(\u2126) :=\nsupx\u2208\u2126 k(x, x) < \u221e and (cid:107)m(cid:107)L\u221e(\u2126) := supx\u2208\u2126 |m(x)| < \u221e.\nThe assumption \u02dcg := g \u2212 m \u2208 Hk is common in theoretical analysis of standard BQ methods, where\nT (y) = y and m = 0 [see e.g. 7, 40, 8, and references therein]. This assumption may be weakened\nby using proof techniques developed for standard BQ in the misspeci\ufb01d setting [19, 20], but we leave\nit for a future work. The other conditions on T , k and m are weak.\nProposition 2.1. (proof in Appendix A.1) Let \u2126 be a compact metric space, Xn = {x1, . . . , xn} \u2282\ni,j=1 \u2208 Rn\u00d7n is invertible, and \u03c0 : \u2126 \u2192 [0,\u221e)\n\u2126 be such that the kernel matrix Kn = (k(xi, xj))n\n\u2126 \u03c0(x)/q(x)d\u00b5(x) < \u221e. Suppose\nthat Assumption 1 is satis\ufb01ed. Then there exists a constant C\u02dcg,m,k,T depending only on \u02dcg, m, k and\nT such that\n\nand q : \u2126 \u2192 [0,\u221e) be continuous functions such that C\u03c0/q :=(cid:82)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:90)\nq(x)(cid:112)kXn(x, x).\nthe convergence behavior of the quantity supx\u2208\u2126 q(x)(cid:112)kXn(x, x) for points Xn = {x1, . . . , xn}\nTo analyze the quantity supx\u2208\u2126 q(x)(cid:112)kXn (x, x) for points Xn = {x1, . . . , xn} generated from\n{(cid:80)n\n\nABQ (4), we show here that the ABQ can be interpreted as a certain weak greedy algorithm studied\nby DeVore et al. [13]. To describe this, let H be a (generic) Hilbert space and C \u2282 H be a compact\nsubset. To de\ufb01ne some notation, let h1, . . . , hn \u2208 C be given. Denote by Sn := span(h1, . . . , hn) =\ni=1 \u03b1ihi | \u03b11, . . . , \u03b1n \u2208 R} \u2282 H the linear subspace spanned by h1, . . . , hn. For a given h \u2208 C,\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 C\u02dcg,m,k,T C\u03c0/q(cid:107)\u02dcg(cid:107)Hk sup\n\nx\u2208\u2126\n\n3 Connections to Weak Greedy Algorithms in Hilbert Spaces\n\nProp. 2.1 shows that to establish convergence guarantees for ABQ methods, it is suf\ufb01cient to analyze\n\nf (x)\u03c0(x)d\u00b5(x) \u2212\n\nT (mg,Xn (x)) \u03c0(x)d\u00b5(x)\n\ngenerated from (4). This is what we focus on in the remainder.\n\n(cid:90)\n\nlet dist(h, Sn) be the distance between h and Sn de\ufb01ned by\n\ndist(h, Sn) := inf\ng\u2208Sn\n\n(cid:107)h \u2212 g(cid:107)H =\n\n\u03b11,...,\u03b1n\u2208R(cid:107)h \u2212 n(cid:88)\n\ninf\n\ni=1\n\n\u03b1ihi(cid:107)H,\n\nwhere (cid:107) \u00b7 (cid:107)H denotes the norm of H. Geometrically, this is the distance between h and its orthogonal\nprojection onto the subspace Sn. The task considered in [13] is to select h1, . . . , hn \u2208 C such that\nthe worst case error in C de\ufb01ned by\n\ndist(h, Sn)\n\n(5)\n\nen(C) := sup\nh\u2208C\n\n5\n\n\fbecomes as small as possible: h1, . . . , hn \u2208 C are to be chosen to approximate well the set C.\nThe following weak greedy algorithm is considered in DeVore et al. [13]. Let \u03b3 be a constant\nsuch that 0 < \u03b3 \u2264 1, and let n \u2208 N. First select h1 \u2208 C such that (cid:107)h1(cid:107)H \u2265 \u03b3 suph\u2208C (cid:107)h(cid:107)H. For\n(cid:96) = 1, . . . n\u22121, suppose that h1, . . . , h(cid:96) have already been generated, and let S(cid:96) = span(h1, . . . , h(cid:96)).\nThen select a next element h(cid:96)+1 \u2208 C such that\ndist(h(cid:96)+1, S(cid:96)) \u2265 \u03b3 sup\nh\u2208C\n\n(6)\nIn this paper we refer to such h1, . . . , hn as a \u03b3-weak greedy approximation of C in H because, \u03b3 = 1\nrecovers the standard greedy algorithm, while \u03b3 < 1 weakens the \u201cgreediness\u201d of this rule. DeVore\net al. [13] derived convergence rates of the worst case error (5) as n \u2192 \u221e for h1, . . . , hn generated\nfrom this weak greedy algorithm.\n\n((cid:96) = 1, . . . , n \u2212 1).\n\ndist(h, S(cid:96)),\n\nWeak Greedy Algorithms in the RKHS. To establish a connection to ABQ, we formulate the\nweak greedy algorithm in an RKHS. Let Hk be the RKHS of the covariance kernel k as in Sec. 2.3,\nand q(x) be the function in (4). We de\ufb01ne a subset Ck,q \u2282 Hk by\n\nCk,q := {q(x)k(\u00b7, x) | x \u2208 \u2126} \u2282 Hk.\n\nNote that Ck,q is the image of the mapping x \u2192 q(x)k(\u00b7, x) with \u2126 being the domain. Therefore\nCk,q is compact, if k and q are continuous and \u2126 is compact; this is because in this case the mapping\nx \u2192 q(x)k(\u00b7, x) becomes continuous, and in general the image of a continuous mapping from a\ncompact domain is compact. Thus, we make the following assumption:\nAssumption 2. \u2126 is a compact metric space, q : \u2126 \u2192 R is continuous with q(x) > 0 for all x \u2208 \u2126,\nand k : \u2126 \u00d7 \u2126 \u2192 R is continuous.\nThe following simple lemma establishes a key connection between weak greedy algorithms and\nABQ. (Note that the the result for the case q(x) = 1 is well known in the literature, and the novelty\nlies in that we allow for q(x) to be non-constant.) For a geometric interpretation of (7) in terms of\nprojections, see Fig.2 in Appendix B.1.\nLemma 3.1. (proof in Appendix B.1) Let x1, . . . , xn \u2208 \u2126 be such that the kernel matrix Kn =\ni,j=1 \u2208 Rn\u00d7n is invertible. De\ufb01ne hx := q(x)k(\u00b7, x) for any x \u2208 X , and let Sn :=\n(k(xi, xj))n\nspan(hx1, . . . , hxn ) \u2282 Hk. Assume that q(x) > 0 holds for all x \u2208 \u2126. Then for all x \u2208 \u2126 we have\n(7)\n\nq2(x)kXn (x, x) = dist2(hx, Sn),\n\nwhere kXn(x, x) is the GP posterior variance function given by (3). Moreover, we have\n\nq(x)(cid:112)kXn (x, x),\n\nen(Ck,q) = sup\nx\u2208\u2126\n\nwhere en(Ck,q) is the worst case error de\ufb01ned by (5) with C := Ck,q and Sn de\ufb01ned here.\n\nLemma 3.1 (8) suggests that we can analyze the convergence properties of supx\u2208\u2126 q(x)(cid:112)kXn (x, x)\n\nfor Xn = {x1, . . . , xn} generated from the ABQ rule (4) by analyzing those of the worst case error\nen(Ck,q) for the corresponding elements hx1, . . . , hxn, where hxi := q(xi)k(\u00b7, xi).\nAdaptive Bayesian Quadrature as a Weak Greedy Algorithm. We now show that the ABQ (4)\ngives a weak greedy approximation of the compact set Ck,q in the RKHS Hk in the sense of (6). We\nsummarize required conditions in Assumptions 3 and 4. As mentioned in Sec. 1, Assumption 3 is the\ncrucial one: its implications for certain speci\ufb01c ABQ methods will be discussed in Sec. 4.2.\nAssumption 3 (Weak Adaptivity Condition). There are constants CL, CU > 0 such that CL <\nb(cid:96)(x) < CU holds for all x \u2208 \u2126 and for all (cid:96) \u2208 N \u222a {0}.\nIntuitively, this condition enforces ABQ to not overly focus on a speci\ufb01c local region in \u2126 and to\nexplore the entire domain \u2126. For instance, consider the following two situations where Assumption 3\ndoes not hod.: (a) b(cid:96)(x) \u2192 +0 as (cid:96) \u2192 \u221e for some local region x \u2208 A \u2282 \u2126, while b(cid:96)(x) remains\nbounded from blow for x \u2208 \u2126\\A; (b) b(cid:96)(x) \u2192 +\u221e as (cid:96) \u2192 \u221e for some local region x \u2208 B \u2282 \u2126,\nwhile b(cid:96)(x) remains bounded from above for x \u2208 \u2126\\B. In case (a), ABQ will not allocate any points\nto this region A at all, after a \ufb01nite number of iterations. Thus, the information about the integrand f\n\n(8)\n\n6\n\n\fon this region A will not be obtained after a \ufb01nite number of evaluations, which makes it dif\ufb01cult to\nguarantee the consistency of quadrature, unless f has a \ufb01nite degree of freedom on A. Similarly, in\ncase (b), ABQ will generate points only in the region B and no point in the rest of the region \u2126\\B,\nafter a \ufb01nite number of iterations. Assumption 3 prevents such problematic situations to occur.\nAssumption 4. F : [0,\u221e) \u2192 [0,\u221e) is increasing and continuous, and F (0) = 0. For any\n0 < c \u2264 1, there is a constant 0 < \u03c8(c) \u2264 1 such that F \u22121 (cy) \u2265 \u03c8(c)F \u22121(y) holds for all y \u2265 0.\nFor instance, if F (y) = y\u03b4 for \u03b4 > 0 then F \u22121(y) = y1/\u03b4 and thus we have \u03c8(c) = c1/\u03b4\nfor 0 < c \u2264 1; \u03b4 = 1 is the case for the WSABI [16], and \u03b4 > 0 for the VBMC [1, 2]. If\nF (y) = exp(y) \u2212 1 as in the MMLT [9], we have F \u22121(y) = log(y + 1) and it can be shown\nthat \u03c8(c) = c for 0 < c \u2264 1; see Appendix B.2. Note that in Assumption 4, the inverse F \u22121 is\nwell-de\ufb01ned since F is increasing and continuous.\nIn our analysis, we allow for the point selection procedure of ABQ itself \u201cweak,\u201d in the sense that the\noptimization problem in (4) may be solved approximately.4 That is, for a constant 0 < \u02dc\u03b3 \u2264 1 we\nassume that the points x1, . . . , xn satisfy\n\na(cid:96)(x(cid:96)+1) \u2265 \u02dc\u03b3 max\nx\u2208\u2126\n\na(cid:96)(x),\n\n((cid:96) = 0, 1, . . . , n \u2212 1),\n\n(9)\n\nThe case \u02dc\u03b3 = 1 amounts to exactly solving the global optimization problem of ABQ (4).\nThe following lemma guarantees we can assume without loss of generality that the kernel matrix\nKn for the points x1, . . . , xn generated from the ABQ (4) is invertible under the assumptions above,\nsince otherwise supx\u2208\u2126 kX(cid:96)(x, x) = 0 holds, implying that the quadrature error is 0 from Prop. 2.1.\nThis guarantees the applicability of Lemma 3.1 for points generated from the ABQ (4).\nLemma 3.2. (proof in Appendix B.3) Suppose that Assumptions 2, 3 and 4 are satis\ufb01ed. For a\nconstant 0 < \u02dc\u03b3 \u2264 1, assume that x1, . . . , xn are generated by a \u02dc\u03b3-weak version of ABQ (4), i.e., (9) is\ni,j=1 \u2208 R(cid:96)\u00d7(cid:96) is\nsatis\ufb01ed. Then either one of the following holds: i) the kernel matrix K(cid:96) = (k(xi, xj))(cid:96)\ninvertible for all (cid:96) = 1, . . . , n; or ii) there exists some (cid:96) = 1, . . . , n such that supx\u2208\u2126 kX(cid:96)(x, x) = 0.\n\nLemma 3.1 leads to the following theorem, which establishes a connection between ABQ and weak\ngreedy algorithms.\nTheorem 3.3. (proof in Appendix B.4) Suppose that Assumptions 2, 3 and 4 are satis\ufb01ed. For a\nconstant 0 < \u02dc\u03b3 \u2264 1, assume that x1, . . . , xn are generated by a \u02dc\u03b3-weak version of ABQ (4), i.e.,\n(9) is satis\ufb01ed. Let hxi = q(xi)k(\u00b7, xi) for i = 1, . . . , n. Then hx1 , . . . , hxn are a \u03b3-weak greedy\n\napproximation of Ck,q in Hk with \u03b3 =(cid:112)\u03c8(\u02dc\u03b3CL/CU ).\n\n4 Convergence Rates of Adaptive Bayesian Quadrature\n\nWe use the connection established in the previous section to derive convergence rates of ABQ. To\nthis end we introduce a quantity called Kolmogorov n-width, which is de\ufb01ned (for a Hilbert space H\nand a compact subset C \u2282 H) by\n\ndn(C) := inf\n\nUn\n\nsup\nh\u2208C\n\ndist(h, Un),\n\nwhere the in\ufb01mum is taken over all n-dimensional subspaces Un of H. This is the worst case error\nfor the best possible solution using n elements in H; thus dn(C) \u2264 en(C) holds for any choice of Sn\nthat de\ufb01nes the worst case error en(C) in (5). The following result by DeVore et al. [13, Corollary\n3.3] relates the Kolmogorov n-width with the worst case error en(C) of a weak greedy algorithm.\nLemma 4.1. Let H be a Hilbert space and C \u2282 H be a compact subset. For 0 < \u03b3 \u2264 1, let\nh1, . . . , hn \u2208 C be a \u03b3-weak greedy approximation of C in H for n \u2208 N, and let en(C) be the worst\ncase error (5) for the subspace Sn := span(h1, . . . , hn). Then we have:\n\n\u2013 Exponential decay: Assume that there exist constants \u03b1 > 0, C0 > 0 and D0 > 0 such that\n2C0\u03b3\u22121 exp(\u2212D1n\u03b1)\n\ndn(C) \u2264 C0 exp(\u2212D0n\u03b1) holds for all n \u2208 N. Then en(C) \u2264 \u221a\nholds for all n \u2208 N with D1 := 2\u22121\u22122\u03b1D0.\n\n4We thank George Wynne for pointing out that our analysis can be extended to this weak version of ABQ.\n\n7\n\n\f\u2013 Polynomial decay: Assume that there exist constants \u03b1 > 0 and C0 > 0 such that dn(C) \u2264\nC0n\u2212\u03b1 holds for all n \u2208 N. Then en(C) \u2264 C1n\u2212\u03b1 holds for all n \u2208 N with C1 :=\n25\u03b1+1\u03b3\u22122C0.\ne2n(C) \u2264 \u221a\n\n2\u03b3\u22121(cid:112)dn(C) holds for all n \u2208 N.\n\n\u2013 Generic case: We have en(C) \u2264 \u221a\n\n2\u03b3\u22121 min1\u2264(cid:96)<n (d(cid:96)(C))n\u2212(cid:96) for all n \u2208 N.\n\nIn particular,\n\nThus, the key is how to upper-bound the Kolmogorov n-width dn(Ck,q) for the RKHS Hk associated\nwith the covariance kernel k. Given such an upper bound, one can then derive convergence rates for\nABQ using Thm. 3.3.\nBelow we demonstrate such results in the setting where \u2126 \u2282 Rd is compact and \u00b5 is the Lebesgue\nmeasure, focusing on kernels with in\ufb01nite smoothness such as Gaussian and (inverse) multiquadric\nkernels, using Lemma 4.1 for the case of exponential decay.\nIn a similar way (using Lemma 4.1\nfor the polynomial decay case) one can also derive rates for kernels with \ufb01nite smoothness, such as\nMat\u00e9rn and Wendland kernels. These additional results are presented in Appendix C.4. We emphasize\nthat one can also analyze other cases (e.g. kernels on a sphere) by deriving upper-bounds on the\nKolmogorov n-width and using Thm. 3.3.\n\n4.1 Convergence Rates for Kernels with In\ufb01nite Smoothness\nWe consider kernels with in\ufb01nite smoothness, such as square-exponential kernels k(x, x(cid:48)) =\nexp(\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2/\u03b32) with \u03b3 > 0, multiquadric kernels k(x, x(cid:48)) = (\u22121)(cid:100)\u03b2(cid:101)(c2 + (cid:107)x \u2212 x(cid:48)(cid:107)2)\u03b2\nwith \u03b2, c > 0 such that \u03b2 (cid:54)\u2208 N, where (cid:100)\u03b2(cid:101) denotes the smallest integer greater than \u03b2, and inverse\nmultiquadric kernels k(x, x(cid:48)) = (c2 + (cid:107)x \u2212 x(cid:48)(cid:107)2)\u2212\u03b2 with \u03b2 > 0. We have the following bound on\nthe Kolmogorov n-width of the Ck,q for these kernels; the proof is in Appendix C.2.\nProposition 4.2. Let \u2126 \u2282 Rd be a cube, and suppose that Assumption 2 is satis\ufb01ed. Let k be a\nsquare-exponential kernel or an (inverse) multiquadric kernel. Then there exist constants C0, D0 > 0\nsuch that dn(Ck,q) \u2264 C0 exp(\u2212D0n1/d) holds for all n \u2208 N.\nThe requirement for \u2126 to be a cube stems from the use of Wendland [38, Thm. 11.22] in our proof,\nwhich requires this condition. In fact, this can be weakened to \u2126 being a compact set satisfying an\ninterior cone condition, but the resulting rate weakens to O(exp(\u2212D1n\u22121/2d)) (note that this is still\nexponential); see [38, Sec. 11.4]. This also applies to the following results. Combining Prop. 4.2\n\nwith Lemma 3.1, Thm. 3.3 and Lemma 4.1, we now obtain a bound on supx\u2208\u2126 q(x)(cid:112)kXn (x, x).\n\nTheorem 4.3. (proof in Appendix C.3) Suppose that Assumptions 2, 3 and 4 are satis\ufb01ed. Let\n\u2126 \u2282 Rd be a cube, and k be a square-exponential kernel or an (inverse) multiquadric kernel. For\na constant 0 < \u02dc\u03b3 \u2264 1, assume that Xn = {x1, . . . , xn} \u2282 \u2126 are generated by a \u02dc\u03b3-weak version of\nABQ (4), i.e., (9) is satis\ufb01ed. Then there exist constants C1, D1 > 0 such that\n\nq(x)(cid:112)kXn (x, x) \u2264 C1\u03c8(\u02dc\u03b3CL/CU )\u22121/2 exp(\u2212D1n1/d)\n\n(n \u2208 N).\n\nsup\nx\u2208\u2126\n\n(cid:82)\n\nAs a directly corollary of Prop. 2.1 and Thm. 4.3, we \ufb01nally obtain a convergence rate of the ABQ\nwith an in\ufb01nitely smooth kernel, which is exponentially fast.\nCorollary 4.4. Suppose that Assumptions 1, 2, 3 and 4 are satis\ufb01ed, and that C\u03c0/q\n:=\n\u2126 \u03c0(x)/q(x)d\u00b5(x) < \u221e. Let \u2126 \u2282 Rd be a cube, and k be a square-exponential kernel or a\n(inverse) multiquadric kernel. For a constant 0 < \u02dc\u03b3 \u2264 1, assume that Xn = {x1, . . . , xn} \u2282 \u2126 are\ngenerated by a \u02dc\u03b3-weak version of ABQ (4), i.e., (9) is satis\ufb01ed. Then there exists a constant D1 > 0\nindependent of n \u2208 N such that\n\n(cid:12)(cid:12)(cid:12)(cid:12) = O(exp(\u2212D1n1/d))\n\n(n \u2192 \u221e).\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:90)\n\n(cid:90)\n\nf (x)\u03c0(x)d\u00b5(x) \u2212\n\nT (mg,Xn (x)) \u03c0(x)d\u00b5(x)\n\n4.2 Discussions of the Weak Adaptivity Condition (Assumption 3)\n\nWe discuss consequences of our results to individual ABQ methods reviewed in Sec. 2.2. We do this\nin particular by discussing the weak adaptivity condition (Assumption 3), which requires that the\ndata-dependent term bn(x) in (4) is uniformly bounded away from zero and in\ufb01nity. (A discussion\n\n8\n\n\ffor VBMC by Acerbi [1, 2] is given in Appendix C.8. To summarize, Assumption 3 holds if the\ndensities of the variational distributions are bounded away uniformly from zero and in\ufb01nity.)\nWe \ufb01rst consider the WSABI-L approach by Gunter et al. [16], for which bn(x) = (mg,Xn (x))2;\na similar result is presented for the WSABI-M in Appendix C.7. The following bounds for bn(x)\nfollow from Lemma C.5 in Appendix C.5.\nLemma 4.5. Let bn(x)\nand that\ninf x\u2208\u2126 |m(x)| \u2212 2(cid:107)\u02dcg(cid:107)Hk(cid:107)k(cid:107)1/2\n\nSuppose that Assumption 1 is satis\ufb01ed,\nThen Assumption 3 holds for CL\n:=\n< \u221e.\n\n(cid:16)(cid:107)m(cid:107)L\u221e(\u2126) + 2(cid:107)\u02dcg(cid:107)Hk(cid:107)k(cid:107)1/2\n\n:= (mg,Xn (x))2.\ninf x\u2208\u2126 |m(x)| > 2(cid:107)\u02dcg(cid:107)Hk(cid:107)k(cid:107)1/2\n\n> 0 and CU :=\n\nL\u221e(\u2126).\n\n(cid:17)2\n\nL\u221e(\u2126)\n\n(cid:16)\n\n(cid:17)2\n\nL\u221e(\u2126)\n\nLemma 4.5 implies that WSABI-L may not be consistent when, e.g., one uses the zero prior mean\nfunction m(x) = 0, since in this case the condition inf x\u2208\u2126 |m(x)| > 2(cid:107)\u02dcg(cid:107)Hk(cid:107)k(cid:107)1/2\nL\u221e(\u2126) is not\nsatis\ufb01ed. Intuitively, the inconsistency may happen because the posterior mean mg,Xn (x) for inputs\nx in regions distant from the current design points x1, . . . , xn would become close to 0, since the\nprior mean function is 0; and such regions will never be explored in the subsequent iterations,\nbecause of the form bn(x) = (mg,Xn (x))2. One simple way to guarantee the consistency is to\nmake a modi\ufb01cation like bn(x) := 1\n2 (mg,Xn (x))2 + \u03b1 = T (mg,Xn (x)); then we can guarantee\nthat CL \u2265 \u03b1 > 0, encouraging exploration in the whole region \u2126. This then makes the algorithm\nconsistent.\nWe next consider\nfor which bn(x) =\nexp (kXn(x, x) + 2mg,Xn(x)). Lemma 4.6 below shows that the weak adaptivity condition holds\nfor the MMLT as long as Assumption 1 is satis\ufb01ed. Therefore different from the WSABI, the MMLT\nis consistent without requiring a further assumption.\nLemma 4.6. (proof in Appendix C.6) Let bn(x) := exp(kXn (x, x) + 2mg,Xn (x)). Suppose\nthat Assumption 1 is satis\ufb01ed. Then Assumption 3 holds for CL := exp(\u22122(cid:107)m(cid:107)L\u221e(\u2126) \u2212\n4(cid:107)\u02dcg(cid:107)Hk(cid:107)k(cid:107)1/2\n\nL\u221e(\u2126)) > 0 and CU := exp((cid:107)k(cid:107)L\u221e(\u2126) + 2(cid:107)m(cid:107)L\u221e(\u2126) + 4(cid:107)\u02dcg(cid:107)Hk(cid:107)k(cid:107)1/2\n\nthe MMLT method by Chai and Garnett\n\nL\u221e(\u2126)) < 0.\n\n[9],\n\n5 Conclusion and Outlook\n\nExtending ef\ufb01cient numerical integration beyond the low-dimensional domain remains both a\nformidable challenge and a crucial desideratum for many areas. In machine learning, ef\ufb01cient\nnumerical integration in the high-dimensional domain would be a game-changer for Bayesian learn-\ning. Developed by, and used in, the NeurIPS community, adaptive Bayesian quadrature is a promising\nnew direction for progress in this fundamental problem class. So far, it has been hindered by the\nabsence of theoretical guarantees.\nIn this work, we have provided the \ufb01rst known convergence guarantees for ABQ methods, by\nanalyzing a generic form of their acquisition functions. Of central importance is the notion of weak\nadaptivity which, speaking vaguely, ensures that the algorithm asymptotically does not \u201coverly focus\u201d\non some evaluations. It is conceptually related to ideas like detailed balance and ergodicity, which\nplay a similar role for Markov Chain Monte Carlo methods (where, speaking equally vaguely, they\nguard against the same kind of locality) [cf. \u00a76.5 & 6.6 in 33]. Like those of MCMC, our suf\ufb01cient\nconditions for consistency span a \ufb02exible class of design options, and can thus act as a guideline for\nthe design of novel acquisition functions for ABQ, guided by practical and intuitive considerations.\nBased on the results presented herein, novel ABQ methods may be proposed for novel domains\nother than only positive integrands, for example integrands with discontinuities [31] and those with\nspatially inhomogeneous smoothness.\nAn important theoretical question, however, remains to be addressed: While our results provide\nconvergence guarantees for ABQ methods, they do not provide a theoretical explanation for why,\nhow and when ABQ methods should be fundamentally better than non-adaptive methods. In fact,\nlittle is known about theoretical properties of adaptive quadrature methods in general. In applied\nmathematics, they remain an open problem [23, 24, 25, 26]. While we have to leave this question of\nABQ\u2019s potential advantages over standard BQ for future research, we consider this area to be highly\npromising on account of the fundamental role of high-dimensional integrals of structured functions in\nprobabilistic machine learning.\n\n9\n\n\fAcknowledgements\n\nWe would like to express our gratitude to the anonymous reviewers for their constructive feedback.\nWe also thank Alexandra Gessner, Hans Kersting, Tim Sullivan and George Wynne for their comments\nand for fruitful discussions. The authors gratefully acknowledge \ufb01nancial supports by the European\nResearch Council through ERC StG Action 757275 / PANAMA, by the DFG Cluster of Excellence\n\u201cMachine Learning \u2013 New Perspectives for Science\u201d, EXC 2064/1, project number 390727645, by\nthe German Federal Ministry of Education and Research (BMBF) through the T\u00fcbingen AI Center\n(FKZ: 01IS18039A, 01IS18039B), and by the Ministry of Science, Research and Arts of the State of\nBaden-W\u00fcrttemberg.\n\nReferences\n[1] L. Acerbi. Variational Bayesian Monte Carlo. In S. Bengio, H. Wallach, H. Larochelle, K. Grau-\nman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 31, pages 8213\u20138223. Curran Associates, Inc., 2018.\n\n[2] L. Acerbi. An Exploration of Acquisition and Mean Functions in Variational Bayesian Monte\nCarlo. In Francisco Ruiz, Cheng Zhang, Dawen Liang, and Thang Bui, editors, Proceedings of\nThe 1st Symposium on Advances in Approximate Bayesian Inference, volume 96 of Proceedings\nof Machine Learning Research, pages 1\u201310. PMLR, 02 Dec 2019.\n\n[3] S. Agapiou, O. Papaspiliopoulos, D. Sanz-Alonso, and A. M. Stuart. Importance Sampling:\n\nIntrinsic Dimension and Computational Cost. Statistical Science, 32(3):405\u2013431, 2017.\n\n[4] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions.\n\nJournal of Machine Learning Research, 18(19):1\u201338, 2017.\n\n[5] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence between herding and\nIn Proceedings of the 29th International Conference on\n\nconditional gradient algorithms.\nMachine Learning (ICML2012), pages 1359\u20131366, 2012.\n\n[6] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and\n\nstatistics. Kluwer Academic Publisher, 2004.\n\n[7] F-X. Briol, C. J. Oates, M. Girolami, and M. A. Osborne. Frank-Wolfe Bayesian quadrature:\nIn Advances in Neural Information\n\nProbabilistic integration with theoretical guarantees.\nProcessing Systems 28, pages 1162\u20131170, 2015.\n\n[8] F.-X. Briol, C.J. Oates, M. Girolami, M.A. Osborne, and D. Sejdinovic. Probabilistic Integration:\nA Role in Statistical Computation? (with Discussion and Rejoinder). Statistical Science, 34(1):1\u2013\n22; rejoinder: 38\u201342, 2019.\n\n[9] H. R. Chai and R. Garnett. Improving quadrature for constrained integrands. In Kamalika\nChaudhuri and Masashi Sugiyama, editors, Proceedings of the 22nd International Conference on\nArti\ufb01cial Intelligence and Statistics (AISTATS), volume 89 of Proceedings of Machine Learning\nResearch, pages 2751\u20132759. PMLR, 16\u201318 Apr 2019.\n\n[10] S. Chatterjee and P. Diaconis. The sample size required in importance sampling. Annals of\n\nApplied Probability, 28(2):1099\u20131135, 2018.\n\n[11] W. Y. Chen, L. Mackey, J. Gorham, F. X. Briol, and C. Oates. Stein points. In J Dy and\nA. Krause, editors, Proceedings of the 35th International Conference on Machine Learning,\nvolume 80 of Proceedings of Machine Learning Research, pages 844\u2013853. PMLR, 2018.\n\n[12] Y. Chen, M. Welling, and A. Smola. Supersamples from kernel-herding. In Proceedings of the\n\n26th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI 2010), pages 109\u2013116, 2010.\n\n[13] R. DeVore, G. Petrova, and P. Wojtaszczyk. Greedy algorithms for reduced bases in Banach\n\nspaces. Constructive Approximation, 37(3):455\u2013466, 2013.\n\n[14] J. Dick, F. Y. Kuo, and I. H. Sloan. High dimensional numerical integration - the Quasi-Monte\n\nCarlo way. Acta Numerica, 22(133-288), 2013.\n\n10\n\n\f[15] Z. Ghahramani and C. E. Rasmussen. Bayesian Monte Carlo. In S. Becker, S. Thrun, and\nK. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 505\u2013512.\nMIT Press, 2003.\n\n[16] T. Gunter, M. A. Osborne, R. Garnett, P. Hennig, and S. J. Roberts. Sampling for Inference in\nProbabilistic Models with Fast Bayesian Quadrature. In Z. Ghahramani, M. Welling, C. Cortes,\nN. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing\nSystems 27, pages 2789\u20132797. Curran Associates, Inc., 2014.\n\n[17] F. Husz\u00e1r and D. Duvenaud. Optimally-weighted herding is Bayesian quadrature. In Uncertainty\n\nin Arti\ufb01cial Intelligence, pages 377\u2013385, 2012.\n\n[18] M. Kanagawa, P. Hennig, D. Sejdinovic, and B. K Sriperumbudur. Gaussian processes and\nkernel methods: A review on connections and equivalences. ArXiv preprint, 1807.02582v1,\nJuly 2018.\n\n[19] M. Kanagawa, B. K. Sriperumbudur, and K. Fukumizu. Convergence guarantees for kernel-\nbased quadrature rules in misspeci\ufb01ed settings. In D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages\n3288\u20133296. Curran Associates, Inc., 2016.\n\n[20] M. Kanagawa, B. K. Sriperumbudur, and K. Fukumizu. Convergence analysis of determin-\nistic kernel-based quadrature rules in misspeci\ufb01ed settings. Foundations of Computational\nMathematics, 2019.\n\n[21] T. Karvonen, C. J. Oates, and S. Sarkka. A Bayes-Sard cubature method.\n\nIn S. Bengio,\nH. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 31, pages 5882\u20135893. Curran Associates, Inc., 2018.\n\n[22] J. S. Liu. Monte Carlo Strategies in Scienti\ufb01c Computing. Springer-Verlag, New York, 2001.\n\n[23] E. Novak. The adaption problem for nonsymmetric convex sets. Journal of Approximation\n\nTheory, 82:123\u2013134, 1995.\n\n[24] E. Novak. Optimal recovery and n-widths for convex classes of functions. Journal of Approxi-\n\nmation Theory, 80:390\u2013408, 1995.\n\n[25] E. Novak. On the power of adaption. Journal of Complexity, 12:199\u2013237, 1996.\n\n[26] E. Novak. Some results on the complexity of numerical integration. In R. Cools and D. Nuyens,\neditors, Monte Carlo and Quasi-Monte Carlo Methods. Springer Proceedings in Mathematics\n& Statistics, volume 163, pages 161\u2013183. Springer, Cham, 2016.\n\n[27] C. J. Oates, J. Cockayne, F-X. Briol, and M. Girolami. Convergence rates for a class of\n\nestimators based on Stein\u2019s method. Bernoulli, 25(2):1141\u20131159, 2019.\n\n[28] A. O\u2019Hagan. Bayes\u2013Hermite quadrature. Journal of Statistical Planning and Inference, 29:245\u2013\n\n260, 1991.\n\n[29] M. A. Osborne, D. K. Duvenaud, R. Garnett, C. E. Rasmussen, S. J. Roberts, and Z. Ghahra-\nmani. Active learning of model evidence using Bayesian quadrature. In Advances in Neural\nInformation Processing Systems (NIPS), pages 46\u201354, 2012.\n\n[30] M. A. Osborne, R. Garnett, S. J. Roberts, C. Hart, S. Aigrain, and N. Gibson. Bayesian\nIn International Conference on Arti\ufb01cial Intelligence and Statistics\n\nquadrature for ratios.\n(AISTATS), pages 832\u2013840, 2012.\n\n[31] L. Plaskota and G. W. Wasilkowski. The power of adaptive algorithms for functions with\n\nsingularities. Journal of Fixed Point Theory and Applications, 6:227\u2013248, 2009.\n\n[32] E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,\n\n2006.\n\n[33] C. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2004.\n\n11\n\n\f[34] G. Santin and B. Haasdonk. Convergence rate of the data-independent P-greedy algorithm in\n\nkernel-based approximation. Dolomites Research Notes on Approximation, 10:68\u201378, 2017.\n\n[35] B. Sch\u00f6lkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.\n\n[36] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.\n\n[37] H. Wendland. Piecewise polynomial, positive de\ufb01nite and compactly supported radial functions\n\nof minimal degree. Advances in Computational Mathematics, 4(1):389\u2013396, 1995.\n\n[38] H. Wendland. Scattered Data Approximation. Cambridge University Press, Cambridge, UK,\n\n2005.\n\n[39] L. Wenliang, D. Sutherland, H. Strathmann, and A. Gretton. Learning deep kernels for exponen-\ntial family densities. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of\nthe 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine\nLearning Research, pages 6737\u20136746, Long Beach, California, USA, 09\u201315 Jun 2019. PMLR.\n\n[40] X. Xi, F. X. Briol, and M. Girolami. Bayesian quadrature for multiple related integrals. In J. Dy\nand A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning,\nvolume 80 of Proceedings of Machine Learning Research, pages 5373\u20135382. PMLR, 2018.\n\n12\n\n\f", "award": [], "sourceid": 3368, "authors": [{"given_name": "Motonobu", "family_name": "Kanagawa", "institution": "EURECOM"}, {"given_name": "Philipp", "family_name": "Hennig", "institution": "University of T\u00fcbingen and MPI for Intelligent Systems T\u00fcbingen"}]}