{"title": "Frank-Wolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees", "book": "Advances in Neural Information Processing Systems", "page_first": 1162, "page_last": 1170, "abstract": "There is renewed interest in formulating integration as an inference problem, motivated by obtaining a full distribution over numerical error that can be propagated through subsequent computation. Current methods, such as Bayesian Quadrature, demonstrate impressive empirical performance but lack theoretical analysis. An important challenge is to reconcile these probabilistic integrators with rigorous convergence guarantees. In this paper, we present the first probabilistic integrator that admits such theoretical treatment, called Frank-Wolfe Bayesian Quadrature (FWBQ). Under FWBQ, convergence to the true value of the integral is shown to be exponential and posterior contraction rates are proven to be superexponential. In simulations, FWBQ is competitive with state-of-the-art methods and out-performs alternatives based on Frank-Wolfe optimisation. Our approach is applied to successfully quantify numerical error in the solution to a challenging model choice problem in cellular biology.", "full_text": "Frank-Wolfe Bayesian Quadrature: Probabilistic\n\nIntegration with Theoretical Guarantees\n\nFranc\u00b8ois-Xavier Briol\nDepartment of Statistics\nUniversity of Warwick\n\nf-x.briol@warwick.ac.uk\n\nChris J. Oates\n\nSchool of Mathematical and Physical Sciences\n\nUniversity of Technology, Sydney\n\nchristopher.oates@uts.edu.au\n\nMark Girolami\n\nDepartment of Statistics\nUniversity of Warwick\n\n& The Alan Turing Institute for Data Science\n\nm.girolami@warwick.ac.uk\n\nMichael A. Osborne\n\nDepartment of Engineering Science\n\nUniversity of Oxford\n\nmosb@robots.ox.ac.uk\n\nAbstract\n\nThere is renewed interest in formulating integration as a statistical inference prob-\nlem, motivated by obtaining a full distribution over numerical error that can be\npropagated through subsequent computation. Current methods, such as Bayesian\nQuadrature, demonstrate impressive empirical performance but lack theoretical\nanalysis. An important challenge is therefore to reconcile these probabilistic in-\ntegrators with rigorous convergence guarantees. In this paper, we present the \ufb01rst\nprobabilistic integrator that admits such theoretical treatment, called Frank-Wolfe\nBayesian Quadrature (FWBQ). Under FWBQ, convergence to the true value of\nthe integral is shown to be up to exponential and posterior contraction rates are\nproven to be up to super-exponential. In simulations, FWBQ is competitive with\nstate-of-the-art methods and out-performs alternatives based on Frank-Wolfe op-\ntimisation. Our approach is applied to successfully quantify numerical error in the\nsolution to a challenging Bayesian model choice problem in cellular biology.\n\n1\n\nIntroduction\n\nComputing integrals is a core challenge in machine learning and numerical methods play a central\nrole in this area. This can be problematic when a numerical integration routine is repeatedly called,\nmaybe millions of times, within a larger computational pipeline. In such situations, the cumulative\nimpact of numerical errors can be unclear, especially in cases where the error has a non-trivial\nstructural component. One solution is to model the numerical error statistically and to propagate\nthis source of uncertainty through subsequent computations. Conversely, an understanding of how\nerrors arise and propagate can enable the ef\ufb01cient focusing of computational resources upon the\nmost challenging numerical integrals in a pipeline.\nClassical numerical integration schemes do not account for prior information on the integrand and,\nas a consequence, can require an excessive number of function evaluations to obtain a prescribed\nlevel of accuracy [21]. Alternatives such as Quasi-Monte Carlo (QMC) can exploit knowledge on\nthe smoothness of the integrand to obtain optimal convergence rates [7]. However these optimal\nrates can only hold on sub-sequences of sample sizes n, a consequence of the fact that all function\nevaluations are weighted equally in the estimator [24]. A modern approach that avoids this problem\nis to consider arbitrarily weighted combinations of function values; the so-called quadrature rules\n(also called cubature rules). Whilst quadrature rules with non-equal weights have received compar-\n\n1\n\n\fatively little theoretical attention, it is known that the extra \ufb02exibility given by arbitrary weights can\nlead to extremely accurate approximations in many settings (see applications to image de-noising\n[3] and mental simulation in psychology [13]).\nProbabilistic numerics, introduced in the seminal paper of [6], aims at re-interpreting numerical\ntasks as inference tasks that are amenable to statistical analysis.1 Recent developments include\nprobabilistic solvers for linear systems [14] and differential equations [5, 26]. For the task of com-\nputing integrals, Bayesian Quadrature (BQ) [22] and more recent work by [20] provide probabilistic\nnumerics methods that produce a full posterior distribution on the output of numerical schemes. One\nadvantage of this approach is that we can propagate uncertainty through all subsequent computations\nto explicitly model the impact of numerical error [15]. Contrast this with chaining together classical\nerror bounds; the result in such cases will typically be a weak bound that provides no insight into the\nerror structure. At present, a signi\ufb01cant shortcoming of these methods is the absence of theoretical\nresults relating to rates of posterior contraction. This is unsatisfying and has likely hindered the\nadoption of probabilistic approaches to integration, since it is not clear that the induced posteriors\nrepresent a sensible quanti\ufb01cation of the numerical error (by classical, frequentist standards).\nThis paper establishes convergence rates for a new probabilistic approach to integration. Our re-\nsults thus overcome a key perceived weakness associated with probabilistic numerics in the quadra-\nture setting. Our starting point is recent work by [2], who cast the design of quadrature rules as\na problem in convex optimisation that can be solved using the Frank-Wolfe (FW) algorithm. We\npropose a hybrid approach of [2] with BQ, taking the form of a quadrature rule, that (i) carries a\nfull probabilistic interpretation, (ii) is amenable to rigorous theoretical analysis, and (iii) converges\norders-of-magnitude faster, empirically, compared with the original approaches in [2]. In particular,\nwe prove that super-exponential rates hold for posterior contraction (concentration of the posterior\nprobability mass on the true value of the integral), showing that the posterior distribution provides\na sensible and effective quanti\ufb01cation of the uncertainty arising from numerical error. The method-\nology is explored in simulations and also applied to a challenging model selection problem from\ncellular biology, where numerical error could lead to mis-allocation of expensive resources.\n\n2 Background\n\n2.1 Quadrature and Cubature Methods\nLet X \u2286 Rd be a measurable space such that d \u2208 N+ and consider a probability density p(x) de\ufb01ned\nwith respect to the Lebesgue measure on X . This paper focuses on computing integrals of the form\n\n(cid:82) f (x)p(x)dx for a test function f : X \u2192 R where, for simplicity, we assume f is square-integrable\n\nwith respect to p(x). A quadrature rule approximates such integrals as a weighted sum of function\nvalues at some design points {xi}n\n\nwhere \u02c6p =(cid:80)n\n\nViewing integrals as projections, we write p[f ] for the left-hand side and \u02c6p[f ] for the right-hand side,\ni=1 wi\u03b4(xi) and \u03b4(xi) is a Dirac measure at xi. Note that \u02c6p may not be a probability\ndistribution; in fact, weights {wi}n\ni=1 do not have to sum to one or be non-negative. Quadrature\nrules can be extended to multivariate functions f : X \u2192 Rd by taking each component in turn.\nThere are many ways of choosing combinations {xi, wi}n\ni=1 in the literature. For example, taking\nweights to be wi = 1/n with points {xi}n\ni=1 drawn independently from the probability distribution\np(x) recovers basic Monte Carlo integration. The case with weights wi = 1/n, but with points cho-\nsen with respect to some speci\ufb01c (possibly deterministic) schemes includes kernel herding [4] and\nQuasi-Monte Carlo (QMC) [7]. In Bayesian Quadrature, the points {xi}n\ni=1 are chosen to minimise\na posterior variance, with weights {wi}n\nClassical error analysis for quadrature rules is naturally couched in terms of minimising the worst-\ncase estimation error. Let H be a Hilbert space of functions f : X \u2192 R, equipped with the inner\n\ni=1 arising from a posterior probability distribution.\n\n1A detailed discussion on probabilistic numerics and an extensive up-to-date bibliography can be found at\n\nhttp://www.probabilistic-numerics.org.\n\n2\n\n(cid:90)\ni=1 \u2282 X :\n\nf (x)p(x)dx \u2248 n(cid:88)\n\nX\n\nwif (xi).\n\n(1)\n\ni=1\n\n\f(cid:1) :=\n\n(cid:12)(cid:12)p[f ] \u2212 \u02c6p[f ](cid:12)(cid:12).\n\nMMD(cid:0){xi, wi}n\n\nproduct (cid:104)\u00b7,\u00b7(cid:105)H and associated norm (cid:107) \u00b7 (cid:107)H. We de\ufb01ne the maximum mean discrepancy (MMD) as:\n(2)\nThe reader can refer to [27] for conditions on H that are needed for the existence of the MMD. The\nrate at which the MMD decreases with the number of samples n is referred to as the \u2018convergence\nrate\u2019 of the quadrature rule. For Monte Carlo, the MMD decreases with the slow rate of OP (n\u22121/2)\n(where the subscript P speci\ufb01es that the convergence is in probability). Let H be a RKHS with\nreproducing kernel k : X \u00d7X \u2192 R and denote the corresponding canonical feature map by \u03a6(x) =\nk(\u00b7, x), so that the mean element is given by \u00b5p(x) = p[\u03a6(x)] \u2208 H. Then, following [27]\n\nf\u2208H:(cid:107)f(cid:107)H=1\n\nsup\n\ni=1\n\nMMD(cid:0){xi, wi}n\n\n(cid:1) = (cid:107)\u00b5p \u2212 \u00b5 \u02c6p(cid:107)H.\n\n(3)\nThis shows that to obtain low integration error in the RKHS H, one only needs to obtain a good\napproximation of its mean element \u00b5p (as \u2200f \u2208 H: p[f ] = (cid:104)f, \u00b5p(cid:105)H). Establishing theoretical\nresults for such quadrature rules is an active area of research [1].\n\ni=1\n\n2.2 Bayesian Quadrature\n\nBayesian Quadrature (BQ) was originally introduced in [22] and later revisited by [11, 12] and [23].\nThe main idea is to place a functional prior on the integrand f, then update this prior through Bayes\u2019\ntheorem by conditioning on both samples {xi}n\ni=1 and function evaluations at those sample points\n{fi}n\ni=1 where fi = f (xi). This induces a full posterior distribution over functions f and hence over\nthe value of the integral p[f ]. The most common implementation assumes a Gaussian Process (GP)\nprior f \u223c GP(0, k). A useful property motivating the use of GPs is that linear projection preserves\nnormality, so that the posterior distribution for the integral p[f ] is also a Gaussian, characterised by\nits mean and covariance. A natural estimate of the integral p[f ] is given by the mean of this posterior\ndistribution, which can be compactly written as\n\n\u02c6pBQ[f ] = zT K\u22121f.\n\n(cid:1).\n\n(cid:0){xi}n\n\n(4)\nwhere zi = \u00b5p(xi) and Kij = k(xi, xj). Notice that this estimator takes the form of a quadrature\nrule with weights wBQ = zT K\u22121. Recently, [25] showed how speci\ufb01c choices of kernel and design\npoints for BQ can recover classical quadrature rules. This begs the question of how to select design\npoints {xi}n\ni=1. A particularly natural approach aims to minimise the posterior uncertainty over the\nintegral p[f ], which was shown in [16, Prop. 1] to equal:\n\n(cid:1) = p[\u00b5p] \u2212 zT K\u22121z = MMD2(cid:0){xi, wBQ\n\ni }n\n\ni\n\ni=1\n\ni=1\n\nvBQ\n\n(5)\nThus, in the RKHS setting, minimising the posterior variance corresponds to minimising the worst\ncase error of the quadrature rule. Below we refer to Optimal BQ (OBQ) as BQ coupled with design\npoints {xOBQ\n}n\ni=1 chosen to globally minimise (5). We also call Sequential BQ (SBQ) the algorithm\nthat greedily selects design points to give the greatest decrease in posterior variance at each iteration.\nOBQ will give improved results over SBQ, but cannot be implemented in general, whereas SBQ is\ncomparatively straight-forward to implement. There are currently no theoretical results establishing\nthe convergence of either BQ, OBQ or SBQ.\nRemark: (5) is independent of observed function values f. As such, no active learning is possible in\nSBQ (i.e. surprising function values never cause a revision of a planned sampling schedule). This\nis not always the case: For example [12] approximately encodes non-negativity of f into BQ which\nleads to a dependence on f in the posterior variance. In this case sequential selection becomes an\nactive strategy that outperforms batch selection in general.\n\n2.3 Deriving Quadrature Rules via the Frank-Wolfe Algorithm\n\nDespite the elegance of BQ, its convergence rates have not yet been rigorously established. In brief,\nthis is because \u02c6pBQ[f ] is an orthogonal projection of f onto the af\ufb01ne hull of {\u03a6(xi)}n\ni=1, rather than\ne.g. the convex hull. Standard results from the optimisation literature apply to bounded domains, but\nthe af\ufb01ne hull is not bounded (i.e. the BQ weights can be arbitrarily large and possibly negative).\nBelow we describe a solution to the problem of computing integrals recently proposed by [2], based\non the FW algorithm, that restricts attention to the (bounded) convex hull of {\u03a6(xi)}n\n\ni=1.\n\n3\n\n\fAlgorithm 1 The Frank-Wolfe (FW) and Frank-Wolfe with Line-Search (FWLS) Algorithms.\nRequire: function J, initial state g1 = \u00afg1 \u2208 G (and, for FW only: step-size sequence {\u03c1i}n\ni=1).\n1: for i = 2, . . . , n do\n2:\n3:\n4:\n5: end for\n\n[For FWLS only, line search: \u03c1i = argmin\u03c1\u2208[0,1]J(cid:0)(1 \u2212 \u03c1)gi\u22121 + \u03c1 \u00afgi\n\nCompute \u00afgi = argming\u2208G\nUpdate gi = (1 \u2212 \u03c1i)gi\u22121 + \u03c1i\u00afgi\n\n(cid:10)g, (DJ)(gi\u22121)(cid:11)\n\n(cid:1)]\n\n\u00d7\n\nThe Frank-Wolfe (FW) algorithm (Alg. 1), also called the conditional gradient algorithm, is a convex\noptimization method introduced in [9]. It considers problems of the form ming\u2208G J(g) where the\nfunction J : G \u2192 R is convex and continuously differentiable. A particular case of interest in this\npaper will be when the domain G is a compact and convex space of functions, as recently investigated\nin [17]. These assumptions imply the existence of a solution to the optimization problem.\nAt each iteration i, the FW algorithm computes a linearisation of the objective function J at the\nprevious state gi\u22121 \u2208 G along its gradient (DJ)(gi\u22121) and selects an \u2018atom\u2019 \u00afgi \u2208 G that minimises\nthe inner product a state g and (DJ)(gi\u22121). The new state gi \u2208 G is then a convex combination of\nthe previous state gi\u22121 and of the atom \u00afgi. This convex combination depends on a step-size \u03c1i which\nis pre-determined and different versions of the algorithm may have different step-size sequences.\nOur goal in quadrature is to approximate the mean element \u00b5p. Recently [2] proposed to frame\nintegration as a FW optimisation problem. Here, the domain G \u2286 H is a space of functions and\ntaking the objective function to be:\n\n(cid:13)(cid:13)g \u2212 \u00b5p\n\n(cid:13)(cid:13)2\n\nH.\n\nJ(g) =\n\n1\n2\n\n(6)\n\nThis gives an approximation of the mean element and J takes the form of half the posterior variance\n(or the MMD2).\nIn this functional approximation setting, minimisation of J is carried out over\nG = M, the marginal polytope of the RKHS H. The marginal polytope M is de\ufb01ned as the\nclosure of the convex hull of \u03a6(X ), so that in particular \u00b5p \u2208 M. Assuming as in [18] that \u03a6(x) is\nuniformly bounded in feature space (i.e. \u2203R > 0 : \u2200x \u2208 X , (cid:107)\u03a6(x)(cid:107)H \u2264 R), then M is a closed\nand bounded set and can be optimised over.\nIn order to de\ufb01ne the algorithm rigorously in this case, we introduce the Fr\u00b4echet derivative of J,\ndenoted DJ, such that for H\u2217 being the dual space of H, we have the unique map DJ : H \u2192 H\u2217\nH.\nWe also introduce the bilinear map (cid:104)\u00b7,\u00b7(cid:105)\u00d7 : H \u00d7 H\u2217 \u2192 R which, for F \u2208 H\u2217 given by F (g) =\n(cid:104)g, f(cid:105)H, is the rule giving (cid:104)h, F(cid:105)\u00d7 = (cid:104)h, f(cid:105)H.\nA particular advantage of this method is that it leads to \u2018sparse\u2019 solutions which are linear combina-\ntions of the atoms {\u00afgi}n\ni=1 [2]. In particular this provides a weighted estimate for the mean element:\n\nsuch that for each g \u2208 H, (DJ)(g) is the function mapping h \u2208 H to (DJ)(g)(h) =(cid:10)g \u2212 \u00b5, h(cid:11)\n\nn(cid:88)\n\n(cid:16) n(cid:89)\n\n(cid:0)1 \u2212 \u03c1j\u22121\n\n(cid:1)\u03c1i\u22121\n\n(cid:17)\n\nn(cid:88)\n\nwFW\n\ni=1\n\nj=i+1\n\n\u00afgi :=\n\ni \u00afgi,\n\n\u02c6\u00b5FW := gn =\n\n(7)\ni \u2208 [0, 1] when \u03c1i = 1/(i + 1). A typical sequence of\nwhere by default \u03c10 = 1 which leads to all wFW\napproximations to the mean element is shown in Fig. 1 (left), demonstrating that the approximation\nquickly converges to the ground truth (in black). Since minimisation of a linear function can be\n) = k(\u00b7, xFW\nrestricted to extreme points of the domain, the atoms will be of the form \u00afgi = \u03a6(xFW\n)\ni \u2208 X . The minimisation in g over G from step 2 in Algorithm 1 therefore becomes a\nfor some xFW\nminimisation in x over X and this algorithm therefore provides us design points. In practice, at each\ni \u2208 X which induces an atom \u00afgi and\niteration i, the FW algorithm hence selects a design point xFW\ngives us an approximation of the mean element \u00b5p. We denote by \u02c6\u00b5FW this approximation after n\niterations. Using the reproducing property, we can show that the FW estimate is a quadrature rule:\n\ni=1\n\ni\n\ni\n\n\u02c6pFW[f ] :=(cid:10)f, \u02c6\u00b5FW\n\n(cid:11)\n\n(cid:68)\n\nn(cid:88)\n\ni=1\n\n(cid:69)\n\n(cid:10)f, k(\u00b7, xFW\n\n)(cid:11)\n\ni\n\nn(cid:88)\n\ni=1\n\nH =\n\nf,\n\nwFW\n\ni \u00afgi\n\nH =\n\nwFW\n\ni\n\nH =\n\nwFW\ni f (xFW\n\ni\n\n).\n\n(8)\n\nThe total computational cost for FW is O(n2). An extension known as FW with Line Search\n(FWLS) uses a line-search method to \ufb01nd the optimal step size \u03c1i at each iteration (see Alg. 1).\n\nn(cid:88)\n\ni=1\n\n4\n\n\fFigure 1: Left: Approximations of the mean element \u00b5p using the FWLS algorithm, based on n =\n1, 2, 5, 10, 50 design points (purple, blue, green, red and orange respectively). It is not possible to\ndistinguish between approximation and ground truth when n = 50. Right: Density of a mixture\nof 20 Gaussian distributions, displaying the \ufb01rst n = 25 design points chosen by FW (red), FWLS\n(orange) and SBQ (green). Each method provides well-spaced design points in high-density regions.\nMost FW and FWLS design points overlap, partly explaining their similar performance in this case.\n\nOnce again, the approximation obtained by FWLS has a sparse expression as a convex combination\nof all the previously visited states and we obtain an associated quadrature rule. FWLS has theoreti-\ncal convergence rates that can be stronger than standard versions of FW but has computational cost\nin O(n3). The authors in [10] provide a survey of FW-based algorithms and their convergence rates\nunder different regularity conditions on the objective function and domain of optimisation.\nRemark: The FW design points {xFW\ni=1 are generally not available in closed-form. We follow\nmainstream literature by selecting, at each iteration, the point that minimises the MMD over a \ufb01nite\ncollection of M points, drawn i.i.d from p(x). The authors in [18] proved that this approximation\nadds a O(M\u22121/4) term to the MMD, so that theoretical results on FW convergence continue to\napply provided that M (n) \u2192 \u221e suf\ufb01ciently quickly. Appendix A provides full details. In practice,\none may also make use of a numerical optimisation scheme in order to select the points.\n\ni }n\n\n3 A Hybrid Approach: Frank-Wolfe Bayesian Quadrature\n\ni }n\n\ni }n\n\ni=1 we use instead the weights {wBQ\n\nTo combine the advantages of a probabilistic integrator with a formal convergence theory, we pro-\npose Frank-Wolfe Bayesian Quadrature (FWBQ). In FWBQ, we \ufb01rst select design points {xFW\ni }n\ni=1\nusing the FW algorithm. However, when computing the quadrature approximation, instead of using\nthe usual FW weights {wFW\ni=1 provided by BQ. We denote\nthis quadrature rule by \u02c6pFWBQ and also consider \u02c6pFWLSBQ, which uses FWLS in place of FW. As\nwe show below, these hybrid estimators (i) carry the Bayesian interpretation of Sec. 2.2, (ii) per-\nmit a rigorous theoretical analysis, and (iii) out-perform existing FW quadrature rules by orders of\nmagnitude in simulations. FWBQ is hence ideally suited to probabilistic numerics applications.\nFor these theoretical results we assume that f belongs to a \ufb01nite-dimensional RKHS H, in line with\nrecent literature [2, 10, 17, 18]. We further assume that X is a compact subset of Rd, that p(x) > 0\n\u2200x \u2208 X and that k is continuous on X \u00d7X . Under these hypotheses, Theorem 1 establishes consis-\ntency of the posterior mean, while Theorem 2 establishes contraction for the posterior distribution.\nTheorem 1 (Consistency). The posterior mean \u02c6pFWBQ[f ] converges to the true integral p[f ] at the\nfollowing rates:\n\n(cid:12)(cid:12)(cid:12)p[f ] \u2212 \u02c6pFWBQ[f ]\n\n(cid:12)(cid:12)(cid:12) \u2264 MMD(cid:0){xi, wi}n\n\ni=1\n\n(cid:40)\n\n(cid:1) \u2264\n\n\u221a\n\n2D2\n\nR n\u22121\n2D exp(\u2212 R2\n\n2D2 n)\n\nfor FWBQ\nfor FWLSBQ\n\n(9)\n\nwhere the FWBQ uses step-size \u03c1i = 1/(i+1), D \u2208 (0,\u221e) is the diameter of the marginal polytope\nM and R \u2208 (0,\u221e) gives the radius of the smallest ball of center \u00b5p included in M.\n\n5\n\n************************************************************************\u221210010\u221210010x1x2\fNote that all the proofs of this paper can be found in Appendix B. An immediate corollary of The-\norem 1 is that FWLSBQ has an asymptotic error which is exponential in n and is therefore superior\nto that of any QMC estimator [7]. This is not a contradiction - recall that QMC restricts attention to\nuniform weights, while FWLSBQ is able to propose arbitrary weightings. In addition we highlight a\nrobustness property: Even when the assumptions of this section do not hold, one still obtains atleast\na rate OP (n\u22121/2) for the posterior mean using either FWBQ or FWLSBQ [8].\nRemark: The choice of kernel affects the convergence of the FWBQ method [15]. Clearly, we expect\nfaster convergence if the function we are integrating is \u2018close\u2019 to the space of functions induced by\nIndeed, the kernel speci\ufb01es the geometry of the marginal polytope M, that in turn\nour kernel.\ndirectly in\ufb02uences the rate constant R and D associated with FW convex optimisation.\nConsistency is only a stepping stone towards our main contribution which establishes posterior con-\ntraction rates for FWBQ. Posterior contraction is important as these results justify, for the \ufb01rst time,\nthe probabilistic numerics approach to integration; that is, we show that the full posterior distribution\nis a sensible quanti\ufb01cation (at least asymptotically) of numerical error in the integration routine:\nTheorem 2 (Contraction). Let S \u2286 R be an open neighbourhood of the true integral p[f ] and let\n\u03b3 = inf r\u2208SC |r \u2212 p[f ]| > 0. Then the posterior probability mass on Sc = R\\ S vanishes at a rate:\n\n\uf8f1\uf8f2\uf8f3\n\nprob(Sc) \u2264\n\n2D\u221a\n\u03c0\u03b3 exp\n\n2D2\n\n\u221a\n\u03c0R\u03b3 n\u22121 exp\n\u221a\n2\n\n(cid:16) \u2212 R2\n\n2D2 n \u2212 \u03b32\n\u221a\n\n8D4 n2(cid:17)\n(cid:16) \u2212 \u03b32R2\n2D2 n(cid:1)(cid:17)\nexp(cid:0) R2\n\n2D\n\n2\n\nfor FWBQ\n\nfor FWLSBQ\n\n(10)\n\nwhere the FWBQ uses step-size \u03c1i = 1/(i+1), D \u2208 (0,\u221e) is the diameter of the marginal polytope\nM and R \u2208 (0,\u221e) gives the radius of the smallest ball of center \u00b5p included in M.\nThe contraction rates are exponential for FWBQ and super-exponential for FWLBQ, and thus the\ntwo algorithms enjoy both a probabilistic interpretation and rigorous theoretical guarantees. A no-\ntable corollary is that OBQ enjoys the same rates as FWLSBQ, resolving a conjecture by Tony\nO\u2019Hagan that OBQ converges exponentially [personal communication]:\nCorollary. The consistency and contraction rates obtained for FWLSBQ apply also to OBQ.\n\n4 Experimental Results\n\n4.1 Simulation Study\n\nTo facilitate the experiments in this paper we followed [1, 2, 11, 18] and employed an exponentiated-\nquadratic (EQ) kernel k(x, x(cid:48)) := \u03bb2 exp(\u22121/2\u03c32(cid:107)x \u2212 x(cid:48)(cid:107)2\n2). This corresponds to an in\ufb01nite-\ndimensional RKHS, not covered by our theory; nevertheless, we note that all simulations are\npractically \ufb01nite-dimensional due to rounding at machine precision. See Appendix E for a \ufb01nite-\ndimensional approximation using random Fourier features. EQ kernels are popular in the BQ lit-\nerature as, when p is a mixture of Gaussians, the mean element \u00b5p is analytically tractable (see\nAppendix C). Some other (p, k) pairs that produce analytic mean elements are discussed in [1].\nFor this simulation study, we took p(x) to be a 20-component mixture of 2D-Gaussian distribu-\ntions. Monte Carlo (MC) is often used for such distributions but has a slow convergence rate in\nOP (n\u22121/2). FW and FWLS are known to converge more quickly and are in this sense preferable to\nMC [2]. In our simulations (Fig. 2, left), both our novel methods FWBQ and FWLSBQ decreased\nthe MMD much faster than the FW/FWLS methods of [2]. Here, the same kernel hyper-parameters\n(\u03bb, \u03c3) = (1, 0.8) were employed for all methods to have a fair comparison. This suggests that the\nbest quadrature rules correspond to elements outside the convex hull of {\u03a6(xi)}n\ni=1. Examples of\nthose, including BQ, often assign negative weights to features (Fig. S1 right, Appendix D).\nThe principle advantage of our proposed methods is that they reconcile theoretical tractability with\na fully probabilistic interpretation. For illustration, Fig. 2 (right) plots the posterior uncertainty due\nto numerical error for a typical integration problem based on this p(x). In-depth empirical studies\nof such posteriors exist already in the literature and the reader is referred to [3, 13, 22] for details.\nBeyond these theoretically tractable integrators, SBQ seems to give even better performance as\nn increases. An intuitive explanation is that SBQ picks {xi}n\ni=1 to minimise the MMD whereas\n\n6\n\n\fFigure 2: Simulation study. Left: Plot of the worst-case integration error squared (MMD2). Both\nFWBQ and FWLSBQ are seen to outperform FW and FWLS, with SBQ performing best overall.\nRight: Integral estimates for FWLS and FWLSBQ for a function f \u2208 H. FWLS converges more\nslowly and provides only a point estimate for a given number of design points. In contrast, FWLSBQ\nconverges faster and provides a full probability distribution over numerical error shown shaded in\norange (68% and 95% credible intervals). Ground truth corresponds to the dotted black line.\n\nFWBQ and FWLSBQ only minimise an approximation of the MMD (its linearisation along DJ). In\naddition, the SBQ weights are optimal at each iteration, which is not true for FWBQ and FWLSBQ.\nWe conjecture that Theorem 1 and 2 provide upper bounds on the rates of SBQ. This conjecture is\npartly supported by Fig. 1 (right), which shows that SBQ selects similar design points to FW/FWLS\n(but weights them optimally). Note also that both FWBQ and FWLSBQ give very similar result.\nThis is not surprising as FWLS has no guarantees over FW in in\ufb01nite-dimensional RKHS [17].\n\n4.2 Quantifying Numerical Error in a Proteomic Model Selection Problem\n\nA topical bioinformatics application that extends recent work by [19] is presented. The objective is\nto select among a set of candidate models {Mi}m\ni=1 for protein regulation. This choice is based on a\ndataset D of protein expression levels, in order to determine a \u2018most plausible\u2019 biological hypothesis\nfor further experimental investigation. Each Mi is speci\ufb01ed by a vector of kinetic parameters \u03b8i (full\ndetails in Appendix D). Bayesian model selection requires that these parameters are integrated out\n\nagainst a prior p(\u03b8i) to obtain marginal likelihood terms L(Mi) = (cid:82) p(D|\u03b8i)p(\u03b8i)d\u03b8i. Our focus\nposterior model probability L(Mj)/(cid:80)m\n\nhere is on obtaining the maximum a posteriori (MAP) model Mj, de\ufb01ned as the maximiser of the\ni=1 L(Mi) (where we have assumed a uniform prior over\nmodel space). Numerical error in the computation of each term L(Mi), if unaccounted for, could\ncause us to return a model Mk that is different from the true MAP estimate Mj and lead to the\nmis-allocation of valuable experimental resources.\nThe problem is quickly exaggerated when the number m of models increases, as there are more\nopportunities for one of the L(Mi) terms to be \u2018too large\u2019 due to numerical error. In [19], the number\nm of models was combinatorial in the number of protein kinases measured in a high-throughput\nassay (currently \u223c 102 but in principle up to \u223c 104). This led [19] to deploy substantial computing\nresources to ensure that numerical error in each estimate of L(Mi) was individually controlled.\nProbabilistic numerics provides a more elegant and ef\ufb01cient solution: At any given stage, we have\na fully probabilistic quanti\ufb01cation of our uncertainty in each of the integrals L(Mi), shown to be\nsensible both theoretically and empirically. This induces a full posterior distribution over numerical\nuncertainty in the location of the MAP estimate (i.e. \u2018Bayes all the way down\u2019). As such we can\ndetermine, on-line, the precise point in the computational pipeline when numerical uncertainty near\nthe MAP estimate becomes acceptably small, and cease further computation.\nThe FWBQ methodology was applied to one of the model selection tasks in [19]. In Fig. 3 (left) we\ndisplay posterior model probabilities for each of m = 352 candidates models, where a low number\n(n = 10) of samples were used for each integral.\n(For display clarity only the \ufb01rst 50 models\nare shown.) In this low-n regime, numerical error introduces a second level of uncertainty that we\nquantify by combining the FWBQ error models for all integrals in the computational pipeline; this is\nsummarised by a box plot (rather than a single point) for each of the models (obtained by sampling\n- details in Appendix D). These box plots reveal that our estimated posterior model probabilities are\n\n7\n\n\u22120.10.00.1100200300number of design pointsEstimatorFWLSFWLSBQ\fFigure 3: Quantifying numerical error in a model selection problem. FWBQ was used to model\nthe numerical error of each integral L(Mi) explicitly. For integration based on n = 10 design\npoints, FWBQ tells us that the computational estimate of the model posterior will be dominated by\nnumerical error (left). When instead n = 100 design points are used (right), uncertainty due to\nnumerical error becomes much smaller (but not yet small enough to determine the MAP estimate).\n\ncompletely dominated by numerical error. In contrast, when n is increased through 50, 100 and 200\n(Fig. 3, right and Fig. S2), the uncertainty due to numerical error becomes negligible. At n = 200\nwe can conclude that model 26 is the true MAP estimate and further computations can be halted.\nCorrectness of this result was con\ufb01rmed using the more computationally intensive methods in [19].\nIn Appendix D we compared the relative performance of FWBQ, FWLSBQ and SBQ on this prob-\nlem. Fig. S1 shows that the BQ weights reduced the MMD by orders of magnitude relative to FW\nand FWLS and that SBQ converged more quickly than both FWBQ and FWLSBQ.\n\n5 Conclusions\n\nThis paper provides the \ufb01rst theoretical results for probabilistic integration, in the form of pos-\nterior contraction rates for FWBQ and FWLSBQ. This is an important step in the probabilistic\nnumerics research programme [15] as it establishes a theoretical justi\ufb01cation for using the poste-\nrior distribution as a model for the numerical integration error (which was previously assumed [e.g.\n11, 12, 20, 23, 25]). The practical advantages conferred by a fully probabilistic error model were\ndemonstrated on a model selection problem from proteomics, where sensitivity of an evaluation of\nthe MAP estimate was modelled in terms of the error arising from repeated numerical integration.\nThe strengths and weaknesses of BQ (notably, including scalability in the dimension d of X ) are\nwell-known and are inherited by our FWBQ methodology. We do not review these here but refer\nthe reader to [22] for an extended discussion. Convergence, in the classical sense, was proven here\nto occur exponentially quickly for FWLSBQ, which partially explains the excellent performance\nof BQ and related methods seen in applications [12, 23], as well as resolving an open conjecture.\nAs a bonus, the hybrid quadrature rules that we developed turned out to converge much faster in\nsimulations than those in [2], which originally motivated our work.\nA key open problem for kernel methods in probabilistic numerics is to establish protocols for the\npractical elicitation of kernel hyper-parameters. This is important as hyper-parameters directly affect\nthe scale of the posterior over numerical error that we ultimately aim to interpret. Note that this prob-\nlem applies equally to BQ, as well as related quadrature methods [2, 11, 12, 20] and more generally\nin probabilistic numerics [26]. Previous work, such as [13], optimised hyper-parameters on a per-\napplication basis. Our ongoing research seeks automatic and general methods for hyper-parameter\nelicitation that provide good frequentist coverage properties for posterior credible intervals, but we\nreserve the details for a future publication.\n\nAcknowledgments\n\nThe authors are grateful for discussions with Simon Lacoste-Julien, Simo S\u00a8arkk\u00a8a, Arno Solin, Dino\nSejdinovic, Tom Gunter and Mathias Cronj\u00a8ager. FXB was supported by EPSRC [EP/L016710/1].\nCJO was supported by EPSRC [EP/D002060/1]. MG was supported by EPSRC [EP/J016934/1],\nan EPSRC Established Career Fellowship, the EU grant [EU/259348] and a Royal Society Wolfson\nResearch Merit Award.\n\n8\n\n1020304050...00.010.020.03Candidate ModelsPosterior Probabilityn = 101020304050...00.020.040.06Candidate ModelsPosterior Probabilityn = 100\fReferences\n[1] F. Bach. On the Equivalence between Quadrature Rules and Random Features. arXiv:1502.06800, 2015.\n[2] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the Equivalence between Herding and Conditional\nGradient Algorithms. In Proceedings of the 29th International Conference on Machine Learning, pages\n1359\u20131366, 2012.\n\n[3] Y. Chen, L. Bornn, N. de Freitas, M. Eskelin, J. Fang, and M. Welling. Herded Gibbs Sampling. Journal\n\nof Machine Learning Research, 2015. To appear.\n\n[4] Y. Chen, M. Welling, and A. Smola. Super-Samples from Kernel Herding. In Proceedings of the Confer-\n\nence on Uncertainty in Arti\ufb01cial Intelligence, pages 109\u2013116, 2010.\n\n[5] P. Conrad, M. Girolami, S. S\u00a8arkk\u00a8a, A. Stuart, and K. Zygalakis. Probability Measures for Numerical\n\nSolutions of Differential Equations. arXiv:1506.04592, 2015.\n\n[6] P. Diaconis. Bayesian Numerical Analysis. Statistical Decision Theory and Related Topics IV, pages\n\n163\u2013175, 1988.\n\n[7] J. Dick and F. Pillichshammer. Digital Nets and Sequences - Discrepancy Theory and Quasi-Monte Carlo\n\nIntegration. Cambridge University Press, 2010.\n\n[8] J. C. Dunn. Convergence Rates for Conditional Gradient Sequences Generated by Implicit Step Length\n\nRules. SIAM Journal on Control and Optimization, 18(5):473\u2013487, 1980.\n\n[9] M. Frank and P. Wolfe. An Algorithm for Quadratic Programming. Naval Research Logistics Quarterly,\n\n3:95\u2013110, 1956.\n\n[10] D. Garber and E. Hazan. Faster Rates for the Frank-Wolfe Method over Strongly-Convex Sets. In Pro-\n\nceedings of the 32nd International Conference on Machine Learning, pages 541\u2013549, 2015.\n\n[11] Z. Ghahramani and C. Rasmussen. Bayesian Monte Carlo. In Advances in Neural Information Processing\n\nSystems, pages 489\u2013496, 2003.\n\n[12] T. Gunter, R. Garnett, M. Osborne, P. Hennig, and S. Roberts. Sampling for Inference in Probabilistic\n\nModels with Fast Bayesian Quadrature. In Advances in Neural Information Processing Systems, 2014.\n\n[13] J.B. Hamrick and T.L. Grif\ufb01ths. Mental Rotation as Bayesian Quadrature. In NIPS 2013 Workshop on\n\nBayesian Optimization in Theory and Practice, 2013.\n\n[14] P. Hennig. Probabilistic Interpretation of Linear Solvers. SIAM Journal on Optimization, 25:234\u2013260,\n\n2015.\n\n[15] P. Hennig, M. Osborne, and M. Girolami. Probabilistic Numerics and Uncertainty in Computations.\n\nProceedings of the Royal Society A, 471(2179), 2015.\n\n[16] F. Huszar and D. Duvenaud. Optimally-Weighted Herding is Bayesian Quadrature. In Uncertainty in\n\nArti\ufb01cial Intelligence, pages 377\u2013385, 2012.\n\n[17] M. Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In Proceedings of the\n\n30th International Conference on Machine Learning, volume 28, pages 427\u2013435, 2013.\n\n[18] S. Lacoste-Julien, F. Lindsten, and F. Bach. Sequential Kernel Herding : Frank-Wolfe Optimization\nfor Particle Filtering. In Proceedings of the 18th International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 544\u2013552, 2015.\n\n[19] C.J. Oates, F. Dondelinger, N. Bayani, J. Korkola, J.W. Gray, and S. Mukherjee. Causal Network Inference\n\nusing Biochemical Kinetics. Bioinformatics, 30(17):i468\u2013i474, 2014.\n\n[20] C.J. Oates, M. Girolami, and N. Chopin. Control Functionals for Monte Carlo Integration. arXiv:\n\n1410.2392, 2015.\n\n[21] A. O\u2019Hagan. Monte Carlo is Fundamentally Unsound. Journal of the Royal Statistical Society, Series D,\n\n36(2):247\u2013249, 1984.\n\n[22] A. O\u2019Hagan. Bayes-Hermite Quadrature. Journal of Statistical Planning and Inference, 29:245\u2013260,\n\n1991.\n\n[23] M. Osborne, R. Garnett, S. Roberts, C. Hart, S. Aigrain, and N. Gibson. Bayesian Quadrature for Ratios.\nIn Proceedings of the 15th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 832\u2013\n840, 2012.\n\n[24] A. B. Owen. A Constraint on Extensible Quadrature Rules. Numerische Mathematik, pages 1\u20138, 2015.\n[25] S. S\u00a8arkk\u00a8a, J. Hartikainen, L. Svensson, and F. Sandblom. On the Relation between Gaussian Process\n\nQuadratures and Sigma-Point Methods. arXiv:1504.05994, 2015.\n\n[26] M. Schober, D. Duvenaud, and P. Hennig. Probabilistic ODE solvers with Runge-Kutta means.\n\nAdvances in Neural Information Processing Systems 27, pages 739\u2013747, 2014.\n\nIn\n\n[27] B. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00a8olkopf, and G. Lanckriet. Hilbert Space Embeddings\n\nand Metrics on Probability Measures. Journal of Machine Learning Research, 11:1517\u20131561, 2010.\n\n9\n\n\f", "award": [], "sourceid": 732, "authors": [{"given_name": "Fran\u00e7ois-Xavier", "family_name": "Briol", "institution": "University of Warwick"}, {"given_name": "Chris", "family_name": "Oates", "institution": "University of Tech., Sydney"}, {"given_name": "Mark", "family_name": "Girolami", "institution": "University of Warwick, The Alan Turing Institute for Data Science"}, {"given_name": "Michael", "family_name": "Osborne", "institution": "U Oxford"}]}