{"title": "Convergence guarantees for kernel-based quadrature rules in misspecified settings", "book": "Advances in Neural Information Processing Systems", "page_first": 3288, "page_last": 3296, "abstract": "Kernel-based quadrature rules are becoming important in machine learning and statistics, as they achieve super-$\u00a5sqrt{n}$ convergence rates in numerical integration, and thus provide alternatives to Monte Carlo integration in challenging settings where integrands are expensive to evaluate or where integrands are high dimensional. These rules are based on the assumption that the integrand has a certain degree of smoothness, which is expressed as that the integrand belongs to a certain reproducing kernel Hilbert space (RKHS). However, this assumption can be violated in practice (e.g., when the integrand is a black box function), and no general theory has been established for the convergence of kernel quadratures in such misspecified settings. Our contribution is in proving that kernel quadratures can be consistent even when the integrand does not belong to the assumed RKHS, i.e., when the integrand is less smooth than assumed. Specifically, we derive convergence rates that depend on the (unknown) lesser smoothness of the integrand, where the degree of smoothness is expressed via powers of RKHSs or via Sobolev spaces.", "full_text": "Convergence guarantees for kernel-based quadrature\n\nrules in misspeci\ufb01ed settings\n\nMotonobu Kanagawa(cid:3), Bharath K Sriperumbudury, Kenji Fukumizu(cid:3)\n\n(cid:3)The Institute of Statistical Mathematics, Tokyo 190-8562, Japan\n\nyDepartment of Statistics, Pennsylvania State University, University Park, PA 16802, USA\n\nkanagawa@ism.ac.jp, bks18@psu.edu, fukumizu@ism.ac.jp\n\nAbstract\n\np\nKernel-based quadrature rules are becoming important in machine learning and\nstatistics, as they achieve super-\nn convergence rates in numerical integration,\nand thus provide alternatives to Monte Carlo integration in challenging settings\nwhere integrands are expensive to evaluate or where integrands are high dimen-\nsional. These rules are based on the assumption that the integrand has a certain\ndegree of smoothness, which is expressed as that the integrand belongs to a cer-\ntain reproducing kernel Hilbert space (RKHS). However, this assumption can be\nviolated in practice (e.g., when the integrand is a black box function), and no gen-\neral theory has been established for the convergence of kernel quadratures in such\nmisspeci\ufb01ed settings. Our contribution is in proving that kernel quadratures can\nbe consistent even when the integrand does not belong to the assumed RKHS,\ni.e., when the integrand is less smooth than assumed. Speci\ufb01cally, we derive con-\nvergence rates that depend on the (unknown) lesser smoothness of the integrand,\nwhere the degree of smoothness is expressed via powers of RKHSs or via Sobolev\nspaces.\n\n1 Introduction\nNumerical integration, or quadrature, is a fundamental task in the construction of various statistical\nand machine learning algorithms. For instance, in Bayesian learning, numerical integration is gen-\nerally required for the computation of marginal likelihood in model selection, and for the marginal-\nization of parameters in fully Bayesian prediction, etc [20]. It also offers \ufb02exibility to probabilistic\nmodeling, since, e.g., it enables us to use a prior that is not conjugate with a likelihood function.\nLet P be a (known) probability distribution on a measurable space X and f be an integrand on X .\nSuppose that the integral\nf (x)dP (x) has no closed form solution. One standard form of numerical\nintegration is to approximate the integral as a weighted sum of function values f (X1); : : : ; f (Xn)\nby appropriately choosing the points X1; : : : ; Xn 2 X and weights w1; : : : ; wn 2 R:\n\n\u222b\n\n\u222b\n\nn\u2211\n\nwif (Xi) (cid:25)\n\nf (x)dP (x):\n\n(1)\n\ni=1\n\nFor example, the simplest Monte Carlo method generates the points X1; : : : ; Xn as an i.i.d. sample\nfrom P and uses equal weights w1 = (cid:1)(cid:1)(cid:1) = wn = 1=n. Convergence rates of such Monte Carlo\n(cid:0)1=2, which can be slow for practical purposes. For instance, in situations\nmethods are of the form n\nwhere the evaluation of the integrand requires heavy computations, n should be small and Monte\nCarlo would perform poorly; such situations typically appear in modern scienti\ufb01c and engineering\napplications, and thus quadratures with faster convergence rates are desirable [18].\nOne way of achieving faster rates is to exploit one\u2019s prior knowledge or assumption about the in-\ntegrand (e.g. the degree of smoothness) in the construction of a weighted point set f(wi; Xi)gn\ni=1.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fReproducing kernel Hilbert spaces (RKHS) have been successfully used for this purpose, with ex-\namples being Quasi Monte Carlo (QMC) methods based on RKHSs [13] and Bayesian quadratures\n[19]; see e.g. [11, 6] and references therein. We will refer to such methods as kernel-based quadra-\nture rules or simply kernel quadratures in this paper. A kernel quadrature assumes that the integrand\nf belongs to an RKHS consisting of smooth functions (such as Sobolev spaces), and constructs the\nweighted points f(wi; Xi)gn\ni=1 so that the worst case error in that RKHS is small. Then the error\n(cid:0)b; b (cid:21) 1, which is much faster than the rates of Monte Carlo methods, will be\nrate of the form n\nguaranteed with b being a constant representing the degree of smoothness of the RKHS (e.g., the\norder of differentiability). Because of this nice property, kernel quadratures have been studied ex-\ntensively in recent years [7, 3, 5, 2, 17] and have started to \ufb01nd applications in machine learning and\nstatistics [23, 14, 12, 6].\nHowever, if the integrand f does not belong to the assumed RKHS (i.e. if f is less smooth than as-\nsumed), there is no known theoretical guarantee for fast convergence rate or even the consistency of\nkernel quadratures. Such misspeci\ufb01cation is likely to happen if one does not have the full knowledge\nof the integrand\u2014such situations typically occur when the integrand is a black box function. As an\nillustrative example, let us consider the problem of illumination integration in computer graphics\n(see e.g. Sec. 5.2.4 of [6]). The task is to compute the total amount of light arriving at a camera in a\nvirtual environment. This is solved by numerical integration with integrand f (x) being the intensity\nof light arriving at the camera from a direction x (angle). The value f (x) is only given by simulation\nof the environment for each x, so f is a black box function. In such a situation, one\u2019s assumption\non the integrand can be misspeci\ufb01ed. Establishing convergence guarantees for such misspeci\ufb01ed\nsettings has been recognized as an important open problem in the literature [6, Section 6].\nContributions. The main contribution of this paper is in providing general convergence guaran-\ntees for kernel-based quadrature rules in misspeci\ufb01ed settings. Speci\ufb01cally, we make the following\ncontributions:\n(cid:15) In Section 4, we prove that consistency can be guaranteed even when the integrand f does\nnot belong to the assumed RKHS. Speci\ufb01cally, we derive a convergence rate of the form\n(cid:0)(cid:18)b, where 0 < (cid:18) (cid:20) 1 is a constant characterizing the (relative) smoothness of the inte-\nn\ngrand. In other words, the integration error decays at a speed depending on the (unknown)\nsmoothness of the integrand. This guarantee is applicable to kernel quadratures that employ\nrandom points.\n(cid:15) We apply this result to QMC methods called lattice rules (with randomized points) and\nthe quadrature rule by Bach [2], for the setting where the RKHS is a Korobov space. We\nshow that even when the integrand is less smooth than assumed, the error rate becomes\nthe same as for the case when the (unknown) smoothness is known; namely, we show that\nthese methods are adaptive to the unknown smoothness of the integrand.\n(cid:15) In Section 5, we provide guarantees for kernel quadratures with deterministic points, by\n2 of order r 2 N (the order\nnW r\n2\n2 of lesser smoothness. We\n(cid:0)bs=r, where the ratio s=r determines the relative\n\nfocusing on speci\ufb01c cases where the RKHS is a Sobolev space W r\nof differentiability). We prove that consistency can be guaranteed even if f 2 W s\nwhere s (cid:20) r, i.e., the integrand f belongs to a Sobolev space W s\nderive a convergence rate of the form n\ndegree of smoothness.\n(cid:15) As an important consequence, we show that if weighted points f(wi; Xi)gm\noptimal rate in W r\nachieve the optimal rate for the integrand f belonging to W s\nthe smoothness s of the integrand; one only needs to know its upper-bound s (cid:20) r.\n\n2\n\n2 , then they also achieve the optimal rate in W s\n2 .\n\ni=1 achieve the\nIn other words, to\n2 , one does not need to know\n\nThis paper is organized as follows. In Section 2, we describe kernel-based quadrature rules, and\nformally state the goal and setting of theoretical analysis in Section 3. We present our contributions\nin Sections 4 and 5. Proofs are collected in the supplementary material.\nRelated work. Our work is close to [17] in spirit, which discusses situations where the true inte-\ngrand is smoother than assumed (which is complementary to ours) and proposes a control functional\napproach to make kernel quadratures adaptive to the (unknown) greater smoothness. We also note\nthat there are certain quadratures which are adaptive to less smooth integrands [8, 9, 10]. On the\nother hand, our aim here is to provide general theoretical guarantees that are applicable to a wide\nclass of kernel-based quadrature rules.\n\n2\n\n\fn\n\nNotation. For x 2 Rd, let fxg 2 [0; 1]d be its fractional parts. For a probability distribution P on\na measurable space X , let L2(P ) be the Hilbert space of square-integrable functions with respect to\nP . If P is the Lebesgue measure on X (cid:26) Rd, let L2(X ) := L2(P ) and further L2 := L2(Rd). Let\n\u2211\n0 (Rd) be the set of all functions on Rd that are continuously differentiable up to order s 2 N\n0 := C s\nC s\nand vanish at in\ufb01nity. Given a function f and weighted points f(wi; Xi)gn\nand Pnf :=\n2 Kernel-based quadrature rules\nSuppose that one has prior knowledge on certain properties of the integrand f (e.g. its order of differ-\nentiability). A kernel quadrature exploits this knowledge by expressing it as that f belongs to a cer-\ntain RKHS H that possesses those properties, and then constructing weighted points f(wi; Xi)gn\ni=1\nso that the error of integration is small for every function in the RKHS. More precisely, it pursues\nthe minimax strategy that aims to minimize the worst case error de\ufb01ned by\n\ni=1 wif (Xi) denote the integral and its numerical approximation, respectively.\n\ni=1, P f :=\n\nf (x)dP (x)\n\n\u222b\n\nen(P ;H) :=\n\njP f (cid:0) Pnfj :=\n\nsup\n\nsup\n\nwif (Xi)\n\nf2H:\u2225f\u2225H(cid:20)1\n\nf2H:\u2225f\u2225H(cid:20)1\n\n(2)\nwhere \u2225 (cid:1) \u2225H denotes the norm of H. The use of RKHS is bene\ufb01cial (compared to other function\nspaces), because it results in an analytic expression of the worst case error (2) in terms of the re-\nproducing kernel. Namely, one can explicitly compute (2) in the construction of f(wi; Xi)gn\ni=1 as a\ncriterion to be minimized. Below we describe this as well as examples of kernel quadratures.\n2.1 The worst case error in RKHS\nLet X be a measurable space and H be an RKHS of functions on X with k : X (cid:2) X ! R as the\nbounded reproducing kernel. By the reproducing property, it is easy to verify that P f = \u27e8f; mP\u27e9H\nand Pnf = \u27e8f; mPn\n\n\u222b\n\u27e9H for all f 2 H, where \u27e8(cid:1);(cid:1)\u27e9H denotes the inner-product of H, and\n\ni=1\n\nmP :=\n\nk((cid:1); x)dP (x) 2 H; mPn :=\n\nwik((cid:1); Xi) 2 H:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\u222b\n\nf (x)dP (x) (cid:0) n\u2211\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ;\n\nn\u2211\n\ni=1\n\nsup\n\n\u2225f\u2225H(cid:20)1\n\n\u27e9H = \u2225mP (cid:0) mPn\n\nTherefore, the worst case error (2) can be written as the difference between mP and mPn:\n\njP f (cid:0) Pnfj = sup\n\u2225f\u2225H(cid:20)1\n\n\u221a\u27e8f; f\u27e9H for f 2 H. From (3), it is easy\n\n\u27e8f; mP (cid:0) mPn\nwhere \u2225 (cid:1) \u2225H denotes the norm of H de\ufb01ned by \u2225f\u2225H =\nto see that for every f 2 H, the integration error jPnf (cid:0) P fj is bounded by the worst case error:\njPnf (cid:0) P fj (cid:20) \u2225f\u2225H\u2225mP (cid:0) mPn\n\u222b\n\nWe refer the reader to [21, 11] for details of these derivations. Using the reproducing property of k,\nthe r.h.s. of (3) can be alternately written as:\nen(P ;H) =\n\n\u2225H = \u2225f\u2225Hen(P ;H):\nn\u2211\n\nk(x; ~x)dP (x)dP (~x) (cid:0) 2\n\nk(x; Xi)dP (x) +\n\nn\u2211\n\nn\u2211\n\nwiwjk(Xi; Xj):\n\n\u2225H;\n\n(3)\n\nwi\n\ni=1\n\ni=1\n\nj=1\n\n(4)\nThe integrals in (4) are known in closed form for many pairs of k and P ; see e.g. Table 1 of [6].\nFor instance, if P is the uniform distribution on X = [0; 1]d, and k is the Korobov kernel described\nk(y; x)dP (x) = 1 for all y 2 X . To pursue the aforementioned minimax strategy,\nbelow, then\none can explicitly use the formula (4) to minimize the worst case error (2). Often H is chosen as\nan RKHS consisting of smooth functions, and the degree of smoothness is what a user speci\ufb01es; we\ndescribe this in the example below.\n\nvuut\u222b \u222b\n\u222b\n\n2.2 Examples of RKHSs: Korobov spaces\nThe setting X = [0; 1]d is standard in the literature on numerical integration; see e.g. [11]. In this\nsetting, Korobov spaces and Sobolev spaces have been widely used as RKHSs.1 We describe the\nformer here; for the latter, see Section 5.\n\n1Korobov spaces are also known as periodic Sobolev spaces in the literature [4, p.318].\n\n3\n\n\fKorobov space on [0; 1]. The Korobov space W (cid:11)\nkernel k(cid:11) is given by\n\nKor([0; 1]) of order (cid:11) 2 N is an RKHS whose\n\nk(cid:11)(x; y) := 1 +\n\nB2(cid:11)(fx (cid:0) yg);\n\n(5)\n\n((cid:0)1)(cid:11)(cid:0)1(2(cid:25))2(cid:11)\n\n(2(cid:11))!\n\nwhere B2(cid:11) denotes the 2(cid:11)-th Bernoulli polynomial. W (cid:11)\nKor([0; 1]) consists of periodic functions\non [0; 1] whose derivatives up to ((cid:11) (cid:0) 1)-th are absolutely continuous and the (cid:11)-th derivative be-\nlongs to L2([0; 1]) [16]. Therefore the order (cid:11) represents the degree of smoothness of functions in\nKor([0; 1]).\nW (cid:11)\nKorobov space on [0; 1]d. For d (cid:21) 2, the kernel of the Korobov space is given as the product of\none-dimensional kernels (5):\nk(cid:11);d(x; y) := k(cid:11)(x1; y1)(cid:1)(cid:1)(cid:1) k(cid:11)(xd; yd); x := (x1; : : : ; xd)T ; y := (y1; : : : ; yd)T 2 [0; 1]d:\nKor([0; 1]d) on [0; 1]d is then the tensor product of one-dimensional\nThe induced Korobov space W (cid:11)\nKorobov spaces: W (cid:11)\nKor([0; 1]). Therefore it consists of\nfunctions having square-integrable mixed partial derivatives up to the order (cid:11) in each variable. This\nmeans that by using the kernel (6) in the computation of (4), one can make an assumption that the\nintegrand f has smoothness of degree (cid:11) in each variable. In other words, one can incorporate one\u2019s\nknowledge or belief on f into the construction of weighted points f(wi; Xi)g via the choice of (cid:11).\n\nKor([0; 1]) (cid:10) (cid:1)(cid:1)(cid:1) (cid:10) W (cid:11)\n\nKor([0; 1]d) := W (cid:11)\n\n(6)\n\n2.3 Examples of kernel-based quadrature rules\n\nKor([0; 1]d) can achieve the rate en(P; W (cid:11)\n\nWe brie\ufb02y describe examples of kernel-based quadrature rules.\nQuasi Monte Carlo (QMC). These methods typically focus on the setting where X = [0; 1]d\nwith P being the uniform distribution on [0; 1]d, and employ equal weights wi = (cid:1)(cid:1)(cid:1) = wn = 1=n.\nPopular examples are lattice rules and digital nets/sequences. Points X1; : : : ; Xn are selected in a\ndeterministic way so that the worst case error (4) is as small as possible. Then such deterministic\npoints are often randomized to obtain unbiased integral estimators, as we will explain in Section 4.2.\nFor a review of these methods, see [11].\nFor instance, lattice rules generate X1; : : : ; Xn in the following way (for simplicity assume n is\nprime). Let z 2 f1; : : : ; n(cid:0)1gd be a generator vector. Then the points are de\ufb01ned as Xi = fiz=ng 2\n[0; 1]d for i = 1; : : : ; n. Here z is selected so that the resulting worst case error (2) becomes as small\nas possible. The CBC (Component-By-Component) construction is a fast method that makes use of\nthe formula (4) to achieve this; see Section 5 of [11] and references therein. Lattice rules applied\n(cid:0)(cid:11)+(cid:24)) for the\nto the Korobov space W (cid:11)\nworst case error with (cid:24) > 0 arbitrarily small [11, Theorem 5.12].\nBayesian quadratures. These methods are applicable to general X and P , and employ non-\nuniform weights. Points X1; : : : ; Xn are selected either deterministically or randomly. Given the\npoints being \ufb01xed, weights w1; : : : ; wn are obtained by minimizing (4), which can be done by solv-\ning a linear system of size n. Such methods are called Bayesian quadratures, since the resulting\nestimate Pnf in this case is exactly the posterior mean of the integral P f given \u201cobservations\u201d\nf(Xi; f (Xi))gn\ni=1, with the prior on the integrand f being Gaussian Process with the covariance\nkernel k. We refer to [6] for these methods.\nFor instance, the algorithm by Bach [2] proceeds as follows, for the case of H being a Korobov space\nKor([0; 1]d) and P being the uniform distribution on [0; 1]d: (i) Generate points X1; : : : ; Xn inde-\nW (cid:11)\npendently from the uniform distribution on [0; 1]d; (ii) Compute weights w1; : : : ; wn by minimizing\n(cid:20) 4=n. Bach [2] proved that this procedure gives the error rate\nn\n(4), with the constraint\ni=1 w2\n(cid:0)(cid:11)+(cid:24)) for (cid:24) > 0 arbitrarily small.2\ni\nen(P; W (cid:11)\n3 Setting and objective of theoretical analysis\nWe now formally state the setting and objective of our theoretical analysis in general form. Let P be\na known distribution and H be an RKHS. Our starting point is that weighted points f(wi; Xi)gn\ni=1\n\nKor([0; 1]d) = O(n\n\nKor([0; 1]d) = O(n\n\n\u2211\n\n2Note that in [2], the degree of smoothness is expressed in terms of s := (cid:11)d.\n\n4\n\n\fare already constructed for each n 2 N by some quadrature rule3, and that these provide consistent\napproximation of P in terms of the worst case error:\n\nen(P ;H) = \u2225mP (cid:0) mPn\n\n\u2225H = O(n\n\n(cid:0)b)\n\n(7)\nwhere b > 0 is some constant. Here we do not specify the quadrature algorithm explicitly, to\nestablish results applicable to a wide class of kernel quadratures simultaneously.\nLet f be an integrand that is not included in the RKHS: f =2 H. Namely, we consider a misspeci\ufb01ed\nsetting. Our aim is to derive convergence rates for the integration error\n\n(n ! 1);\n\njPnf (cid:0) P fj =\n\nwif (Xi) (cid:0)\n\nf (x)dP (x)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n\u2211\n\ni=1\n\n\u222b\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nbased on the assumption (7). This will be done by assuming a certain regularity condition on f,\nwhich expresses (unknown) lesser smoothness of f. For example, this is the case when the weighted\npoints are constructed by assuming the Korobov space of order (cid:11) 2 N, but the integrand f belongs\nto the Korobov space of order (cid:12) < (cid:11): in this case, f is less smooth than assumed. As mentioned in\nSection 1, such misspeci\ufb01cation is likely to happen if f is a black box function. But misspeci\ufb01cation\nalso occurs even when one has the full knowledge of f. As explained in Section 2.1, the kernel k\nshould be chosen so that the integrals in (4) allow analytic solutions w.r.t. P . Namely, the distribution\nP determines an available class of kernels (e.g. Gaussian kernels for a Gaussian distribution), and\ntherefore the RKHS of a kernel from this class may not contain the integrand of interest. This\nsituation can be seen in application to random Fourier features [23], for example.\n4 Analysis 1: General RKHS with random points\nWe \ufb01rst focus on kernel quadratures with random points. To this end, we need to introduce certain\nassumptions on (i) the construction of weighted points f(wi; Xi)gn\ni=1 and on (ii) the smoothness of\nthe integrand f; we discuss them in Sections 4.1 and 4.2, respectively. In particular, we introduce\nthe notion of powers of RKHSs [22] in Section 4.2, which enables us to characterize the (relative)\nsmoothness of the integrand. We then state our main result in Section 4.3, and illustrate it with QMC\nlattice rules (with randomization) and the Bayesian quadrature by Bach [2] in Korobov RKHSs.\n4.1 Assumption on random points X1; : : : ; Xn\nAssumption 1. There exists a probability distribution Q on X satisfying the following properties:\n(i) P has a bounded density function w.r.t. Q; (ii) there is a constant D > 0 independent of n, such\nthat\n\n(\n\n[\n\nE\n\n(8)\n\n1\nn\n\ng2(Xi)\n\n8g 2 L2(Q);\nwhere E[(cid:1)] denotes the expectation w.r.t. the joint distribution of X1; : : : ; Xn.\nAssumption 1 is fairly general, as it does not specify any distribution of points X1; : : : ; Xn, but just\nrequires that the expectations over these points satisfy (8) for some distribution Q (also note that\nit allows the points to be dependent). For instance, let us consider the case where X1; : : : ; Xn are\nindependently generated from a user-speci\ufb01ed distribution Q; in this case, Q serves as a proposal\ndistribution. Then (8) holds for D = 1 with equality. Examples in this case include the Bayesian\nquadratures by Bach [2] and Briol et al. [6] with random points.\nAssumption 1 is also satis\ufb01ed by QMC methods that apply randomization to deterministic points,\nwhich is common in the literature [11, Sections 2.9 and 2.10]. Popular methods for randomization\nare random shift and scrambling, both of which satisfy Assumption 1 for D = 1 with equality, where\nQ( = P ) is the uniform distribution on X = [0; 1]d. This is because in general, randomization is\napplied to make an integral estimator unbiased: E[ 1\n[0;1]d f (x)dx [11, Section\n2.9]. For instance, the random shift is done as follows. Let x1; : : : ; xn 2 [0; 1]d be deterministic\n\u222b\npoints generated by a QMC method. Let \u2206 be a random sample from the uniform distribution on\n[0; 1]d. Then each Xi is given as Xi := fxi + \u2206g 2 [0; 1]d. Therefore E[ 1\nn\ni=1 g2(Xi)] =\n[0;1]d g2(x)dx =\n)gn\n3Note that here the weighted points should be written as f(w(n)\n; X (n)\ni=1, since they are constructed\ndepending on the number of points n. However, we just write as f(wi; Xi)gn\ni=1 for notational simplicity.\n\ng2(x)dQ(x) for all g 2 L2(Q), so (8) holds for D = 1 with equality.\n\nn\ni=1 f (Xi)] =\n\n\u2211\n\n\u2211\n\n\u222b\n\n\u222b\n\ni\n\ni\n\n])1=2 (cid:20) D\u2225g\u2225L2(Q);\n\nn\u2211\n\ni=1\n\nn\n\nn\n\n5\n\n\f4.2 Assumption on the integrand via powers of RKHSs\n\nTo state our assumption on the integrand f, we need to introduce powers of RKHSs [22, Section 4].\nLet 0 < (cid:18) (cid:20) 1 be a constant. First, with the distribution Q in Assumption 1, we require that the\nkernel satis\ufb01es\n\n\u222b\n\nk(x; x)dQ(x) < 1:\n1\u2211\n\nFor example, this is always satis\ufb01ed if the kernel is bounded. We also assume that the support of\nQ is entire X and that k is continuous. These conditions imply Mercer\u2019s theorem [22, Theorem 3.1\nand Lemma 2.3], which guarantees the following expansion of the kernel k:\n\nk(x; y) =\n\n(cid:22)iei(x)ei(y);\n\n(9)\n\u22111\nwhere (cid:22)1 (cid:21) (cid:22)2 (cid:21) (cid:1)(cid:1)(cid:1) > 0 and feig1\neig1\nforms an orthonormal basis of H. Here the convergence of the series in (9) is pointwise. Assume\ni (x) < 1 holds for all x 2 X . Then the (cid:18)-th power of the kernel k is a function\nthat\ni e2\nk(cid:18) : X (cid:2) X ! R de\ufb01ned by\n\ni=1 is an orthonormal series in L2(Q); in particular, f(cid:22)1=2\n\ni=1 (cid:22)(cid:18)\n\ni=1\n\ni=1\n\ni\n\nx; y 2 X ;\n\n1\u2211\n{ 1\u2211\n\ni=1\n\nH(cid:18) =\n\nai(cid:22)(cid:18)=2\n\ni\n\nei :\n\nx; y 2 X :\n}\ni < 1\na2\n\n:\n\n1\u2211\n\nk(cid:18)(x; y) :=\n\n(cid:22)(cid:18)\n\ni ei(x)ei(y);\n\n(10)\n\nThis is again a reproducing kernel [22, Proposition 4.2], and de\ufb01nes an RKHS called the (cid:18)-th power\nof the RKHS H:\n\n\u2032\n\nThis is an intermediate space between L2(Q) and H, and the constant 0 < (cid:18) (cid:20) 1 determines how\nclose H(cid:18) is to H. For instance, if (cid:18) = 1 we have H(cid:18) = H, and H(cid:18) approaches L2(Q) as (cid:18) ! +0.\nIndeed, H(cid:18) is nesting w.r.t. (cid:18):\n\ni=1\n\ni=1\n\nH = H1 (cid:26) H(cid:18) (cid:26) H(cid:18)\n\n\u2032 (cid:26) L2(Q);\n\n< (cid:18) < 1:\n\nfor all 0 < (cid:18)\n\n(11)\nIn other words, H(cid:18) gets larger as (cid:18) decreases. If H is an RKHS consisting of smooth functions, then\nH(cid:18) contains less smooth functions than those in H; we will show this in the example below.\nAssumption 2. The integrand f lies in H(cid:18) for some 0 < (cid:18) (cid:20) 1.\nWe note that Assumption 2 is equivalent to assuming that f belongs to the interpolation space\n[L2(Q);H](cid:18);2, or lies in the range of a power of certain integral operator [22, Theorem 4.6].\nPowers of Tensor RKHSs. Let us mention the important case where RKHS H is given as the\ntensor product of individual RKHSs H1; : : : ;Hd on the spaces X1; : : : ;Xd, i.e., H = H1(cid:10)(cid:1)(cid:1)(cid:1)(cid:10)Hd\nand X = X1 (cid:2) (cid:1)(cid:1)(cid:1) (cid:2) Xd. In this case, if the distribution Q is the product of individual distributions\nQ1; : : : ; Qd on X1; : : : ; Xn, it can be easily shown that the power RKHS H(cid:18) is the tensor product\nof individual power RKHSs H(cid:18)\ni :\n\nExamples: Powers of Korobov spaces. Let us consider the Korobov space W (cid:11)\nbeing the uniform distribution on [0; 1]d. The Korobov kernel (5) has a Mercer representation\n\n(12)\nKor([0; 1]d) with Q\n\n(cid:10) (cid:1)(cid:1)(cid:1) (cid:10) H(cid:18)\nd:\n\n1\n\nH(cid:18) = H(cid:18)\n1\u2211\n\nk(cid:11)(x; y) = 1 +\n\n(13)\n2 sin 2(cid:25)ix. Note that c0(x) := 1 and fci; sig1\nwhere ci(x) :=\ni=1\nconstitute an orthonormal basis of L2([0; 1]). From (10) and (13), the (cid:18)-th power of the Korobov\nkernel k(cid:11) is given by\n\np\ni=1\n2 cos 2(cid:25)ix and si(x) :=\n\n1\ni2(cid:11) [ci(x)ci(y) + si(x)si(y)];\np\n\nk(cid:18)\n(cid:11)(x; y) = 1 +\n\n1\ni2(cid:11)(cid:18) [ci(x)ci(y) + si(x)si(y)]:\n\nIf (cid:11)(cid:18) 2 N, this is exactly the kernel k(cid:11)(cid:18) of the Korobov space W (cid:11)(cid:18)\nother words, W (cid:11)(cid:18)\nalso show that the (cid:18)-th power of W (cid:11)\n\nKor([0; 1]) is nothing but the (cid:18)-power of W (cid:11)\n\nKor([0; 1]d) is W (cid:11)(cid:18)\n\nKor([0; 1]d) for d (cid:21) 2.\n\nKor([0; 1]) of lower order (cid:11)(cid:18). In\nKor([0; 1]). From this and (12), we can\n\n1\u2211\n\ni=1\n\n6\n\n\f4.3 Result: Convergence rates for general RKHSs with random points\n\nn\ni=1 w2\n\ni ] = O(n\n\ni=1 be such that E[en(P ;H)] = O(n\n\nThe following result guarantees the consistency of kernel quadratures for integrands satisfying As-\n\u2211\nsumption 2, i.e., f 2 H(cid:18).\nTheorem 1. Let f(wi; Xi)gn\n(cid:0)b) for some b > 0 and\nE[\ni=1 satis-\n\ufb01es Assumption 1. Let 0 < (cid:18) (cid:20) 1 be a constant. Then for any f 2 H(cid:18), we have\n(n ! 1):\n\n(cid:0)2c) for some 0 < c (cid:20) 1=2, as n ! 1. Assume also that fXign\nE [jPnf (cid:0) P fj] = O(n\n\u2211\n\n(14)\n(cid:0)b) is w.r.t. the joint distri-\n\n(cid:0)(cid:18)b+(1=2(cid:0)c)(1(cid:0)(cid:18)))\n\n(cid:0)b).\n\nn\ni=1 w2\n\nRemark 1. (a) The expectation in the assumption E[en(P ;H)] = O(n\nbution of the weighted points f(wi; Xi)gn\ni=1.\n(cid:0)2c) requires that each weight wi decreases as n increases.\n(b) The assumption E[\ni ] = O(n\nFor instance, if maxi2f1;:::;ng jwij = O(n\n(cid:0)1), we have c = 1=2. For QMC methods, weights are\nuniform wi = 1=n, so we always have c = 1=2. The quadrature rule by Bach [2] also satis\ufb01es\nc = 1=2; see Section 2.3.\n(cid:0)(cid:18)b), which shows that the integral estimator\n(c) Let c = 1=2. Then the rate in (14) becomes O(n\nPnf is consistent, even when the integrand f does not belong to H (recall H \u228a H(cid:18) for (cid:18) < 1; see\n(cid:0)(cid:18)b) is determined by 0 < (cid:18) (cid:20) 1 of the assumption f 2 H(cid:18),\nalso (11)). The resulting rate O(n\nwhich characterizes the closeness of f to H.\n(cid:0)b),\n(d) When (cid:18) = 1 (well-speci\ufb01ed case), irrespective of the value of c, the rate in (14) becomes O(n\nwhich recovers the rate of the worst case error E[en(P ;H)] = O(n\nExamples in Korobov spaces. Let us illustrate Theorem 1\nin the following setting described earlier. Let X = [0; 1]d,\nH = W (cid:11)\nKor([0; 1]d), and P be the uniform distribution on\n[0; 1]d. Then H(cid:18) = W (cid:11)(cid:18)\nKor([0; 1]d), as discussed in Section\n4.2. Let us consider the two methods discussed in Section 2.3:\n(i) the QMC lattice rules with randomization and (ii) the algo-\nrithm by Bach [2]. For both the methods, we have c = 1=2,\nand the distribution Q in Assumption 1 is uniform on [0; 1]d in\nthis setting. As mentioned before, these methods achieve the\n(cid:0)(cid:11)+(cid:24) for arbitrarily small (cid:24) > 0 in the well-speci\ufb01ed\nrate n\nsetting: b = (cid:11) (cid:0) (cid:24) in our notation.\nThen the assumption f 2 H(cid:18) reads f 2 W (cid:11)(cid:18)\nobtain the rate O(n\nspeci\ufb01ed case where W (cid:11)(cid:18)\nwe have shown that these methods are adaptive to the unknown smoothness of the integrand.\nFor the algorithm by Bach [2], we conducted simulation experiments to support this observation,\nby using code available from http://www.di.ens.fr/~fbach/quadrature.html. The setting\nis what we have described with d = 1, and weights are obtained without regularization as in [2].\nThe result is shown in Figure 1, where r (= (cid:11)) denotes the assumed smoothness, and s (= (cid:11)(cid:18))\nis the (unknown) smoothness of an integrand. The straight lines are (asymptotic) upper-bounds\nin Theorem 1 (slope (cid:0)s and intercept \ufb01tted for n (cid:21) 24), and the corresponding solid lines are\nnumerical results (both in log-log scales). Averages over 100 runs are shown. The result indeed\nshows the adaptability of the quadrature rule by Bach for the less smooth functions (i.e. s = 1; 2; 3).\nWe observed similar results for the QMC lattice rules (reported in Appendix D in the supplement).\n5 Analysis 2: Sobolev RKHS with deterministic points\nIn Section 4, we have provided guarantees for methods that employ random points. However, the\nresult does not apply to those with deterministic points, such as (a) QMC methods without random-\nization, (b) Bayesian quadratures with deterministic points, and (c) kernel herding [7].\nWe aim here to provide guarantees for quadrature rules with deterministic points. To this end, we\nfocus on the setting where X = Rd and H is a Sobolev space [1]. The Sobolev space W r\n2 of order\nr 2 N is de\ufb01ned by\n\nKor([0; 1]d) for 0 < (cid:18) (cid:20) 1. For such an integrand f, we\n(cid:0)(cid:11)(cid:18)+(cid:24)) in (14) with arbitrarily small (cid:24) > 0. This is the same rate as for a well-\n2 ([0; 1]d) was assumed for the construction of weighted points. Namely,\n\nFigure 1: Simulation results\n\n2 := ff 2 L2 : D(cid:11)f 2 L2 exists for allj(cid:11)j (cid:20) rg\nW r\n\n7\n\n\f\u2211\n\n\u2211\n\u2225D(cid:11)f\u22252\n\nd\n\nL2\n\n= (\n\n2\n\nO(n\n\ni=1 be such that en(P ; W r\n\njPnf (cid:0) P fj = O(n\n\n(cid:0)b) for some b > 0 and\n\n2 ) = O(n\n(cid:0)2c) for some 0 < c (cid:20) 1=2, as n ! 1. Then for any f 2 C s\n(cid:0)bs=r+(1=2(cid:0)c)(1(cid:0)s=r))\n\nwhere (cid:11) := ((cid:11)1; : : : ; (cid:11)d) with (cid:11)i (cid:21) 0 is a multi-index with j(cid:11)j :=\ni=1 (cid:11)i, and D(cid:11)f is the (cid:11)-th\n(weak) derivative of f. Its norm is de\ufb01ned by \u2225f\u2225W r\n)1=2. For r > d=2, this\nj(cid:11)j(cid:20)r\nis an RKHS with the reproducing kernel k being the Mat\u00e8rn kernel; see Section 4.2.1. of [20] for\nthe de\ufb01nition.\n2 of a lower order s (cid:20) r.\nOur assumption on the integrand f is that it belongs to a Sobolev space W s\nNote that the order s represents the smoothness of f (the order of differentiability). Therefore the\nsituation s < r means that f is less smooth than assumed; we consider the setting where W r\n2 was\nassumed for the construction of weighted points.\nRates under an assumption on weights. The \ufb01rst result in this section is based on the same\nassumption on weights as in Theorem 1.\nTheorem 2. Let f(wi; Xi)gn\n\n\u2211\n2 with s (cid:20) r, we have\n(15)\n(cid:0)(cid:18)b+(1=2(cid:0)c)(1(cid:0)(cid:18))), which\nIn other words, Theorem 2 provides a deterministic version of\n\n(cid:0)sb=r) in (15). The minimax optimal rate in this setting (i.e., c = 1=2) is given by n\n\nRemark 2. (a) Let (cid:18) := s=r. Then the rate in (15) is rewritten as O(n\nmatches the rate of Theorem 1.\nTheorem 1 for the special case of Sobolev spaces.\n(b) Theorem 2 can be applied to quadrature rules with equally-weighted deterministic points, such\nas QMC methods and kernel herding [7]. For these methods, we have c = 1=2 and so we obtain the\n(cid:0)b with\nrate O(n\n(cid:0)s=d) in (15), which is exactly\nb = r=d [15]. For these choices of b and c, we obtain a rate of O(n\n(cid:0)s=d) can\nthe optimal rate in W s\nbe achieved for an integrand f 2 W s\n2 without knowing the degree of smoothness s; one just needs to\nknow its upper-bound s (cid:20) r. Namely, any methods of optimal rates in Sobolev spaces are adaptive\n\u2211\nto lesser smoothness.\nRates under an assumption on separation radius. Theorems 1 and 2 require the assumption\n(cid:0)2c). However, for some algorithms, the value of c may not be available. For\ninstance, this is the case for Bayesian quadratures that compute the weights without any constraints\n[6]; see Section 2.3. Here we present a preliminary result that does not rely on the assumption on\nthe weights. To this end, we introduce a quantity called separation radius:\n\n2 . This leads to an important consequence that the optimal rate O(n\n\n\\ W s\n(n ! 1):\n\ni = O(n\n\nn\ni=1 w2\n\nn\ni=1 w2\n\ni =\n\n0\n\nqn := min\ni\u0338=j\n\n\u2225Xi (cid:0) Xj\u2225:\n\nIn the result below, we assume that qn does not decrease very quickly as n increases. Let\ndiam(X1; : : : ; Xn) denote the diameter of the points.\n(cid:0)b) for some b > 0 as n ! 1,\nTheorem 3. Let f(wi; Xi)gn\nqn (cid:21) Cn\n2 with\ns (cid:20) r, we have\n\n(cid:0)b=r for some C > 0, and diam(X1; : : : ; Xn) (cid:20) 1. Then for any f 2 C s\n\ni=1 be such that en(P ; W r\n\n2 ) = O(n\n\n\\ W s\n\n0\n\njPnf (cid:0) P fj = O(n\n\n(cid:0) bs\nr )\n\n(n ! 1):\n\n(cid:0)1=d for some C > 0. As above, the optimal rate for this setting is n\n\n(16)\nConsequences similar to those of Theorems 1 and 2 can be drawn for Theorem 3. In particular, the\nrate in (16) coincides with that of (15) with c = 1=2. The assumption qn (cid:21) Cn\n(cid:0)b=r can be veri\ufb01ed\nwhen points form equally-spaced grids in a compact subset of Rd. In this case, the separation radius\nsatis\ufb01es qn (cid:21) Cn\n(cid:0)b with\nb = r=d, which implies the separation radius satis\ufb01es the assumption as n\n6 Conclusions\nKernel quadratures are powerful tools for numerical integration. However, their convergence guar-\nantees had not been established in situations where integrands are less smooth than assumed, which\ncan happen in various situations in practice. In this paper, we have provided the \ufb01rst known theoret-\nical guarantees for kernel quadratures in such misspeci\ufb01ed settings.\n\n(cid:0)b=r = n\n\n(cid:0)1=d.\n\nAcknowledgments\n\nWe wish to thank the anonymous reviewers for valuable comments. We also thank Chris Oates for\nfruitful discussions. This work has been supported in part by MEXT Grant-in-Aid for Scienti\ufb01c\nResearch on Innovative Areas (25120012).\n\n8\n\n\fReferences\n[1] R. A. Adams and J. J. F. Fournier. Sobolev Spaces. Academic Press, 2nd edition, 2003.\n[2] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions.\n\nTechnical report, HAL-01118276v2, 2015.\n\n[3] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence between herding and condi-\n\ntional gradient algorithms. In Proc. ICML2012, pages 1359\u20131366, 2012.\n\n[4] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and\n\nstatistics. Kluwer Academic Publisher, 2004.\n\n[5] F-X. Briol, C. J. Oates, M. Girolami, and M. A. Osborne. Frank-Wolfe Bayesian quadrature:\nProbabilistic integration with theoretical guarantees. In Adv. NIPS 28, pages 1162\u20131170, 2015.\n[6] F-X. Briol, C. J. Oates, M. Girolami, M. A. Osborne, and D. Sejdinovic. Probabilistic integra-\n\ntion: A role for statisticians in numerical analysis? arXiv:1512.00933v4 [stat.ML], 2016.\n\n[7] Y. Chen, M. Welling, and A. Smola. Supersamples from kernel-herding. In Proc. UAI 2010,\n\npages 109\u2013116, 2010.\n\n[8] J. Dick. Explicit constructions of quasi-Monte Carlo rules for the numerical integration of\n\nhigh-dimensional periodic functions. Siam J. Numer. Anal., 45:2141\u20132176, 2007.\n\n[9] J. Dick. Walsh spaces containing smooth functions and quasi\u2013Monte Carlo rules of arbitrary\n\nhigh order. Siam J. Numer. Anal., 46(3):1519\u20131553, 2008.\n\n[10] J. Dick. Higher order scrambled digital nets achieve the optimal rate of the root mean square\n\nerror for smooth integrands. The Annals of Statistics, 39(3):1372\u20131398, 2011.\n\n[11] J. Dick, F. Y. Kuo, and I. H. Sloan. High dimensional numerical integration - the quasi-Monte\n\nCarlo way. Acta Numerica, 22(133-288), 2013.\n\n[12] M. Gerber and N. Chopin. Sequential quasi Monte Carlo. Journal of the Royal Statistical\n\nSociety. Series B. Statistical Methodology, 77(3):509\u2013579, 2015.\n\n[13] F. J. Hickernell. A generalized discrepancy and quadrature error bound. Mathematics of Com-\n\nputation of the American Mathematical Society, 67(221):299\u2013322, 1998.\n\n[14] S. Lacoste-Julien, F. Lindsten, and F. Bach. Sequential kernel herding: Frank-Wolfe optimiza-\n\ntion for particle \ufb01ltering. In Proc. AISTATS 2015, 2015.\n\n[15] E. Novak. Deterministic and Stochastic Error Bounds in Numerical Analysis. Springer-Verlag,\n\n1988.\n\n[16] E. Novak and H. W\u00f3zniakowski. Tractability of Multivariate Problems, Vol. I: Linear Infor-\n\nmation. EMS, 2008.\n\n[17] C. J. Oates and M. Girolami. Control functionals for quasi-Monte Carlo integration. In Proc.\n\nAISTATS 2016, 2016.\n\n[18] C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration.\n\nJournal of the Royal Statistical Society. Series B, 2017, to appear.\n\n[19] A. O\u2019Hagan. Bayes\u2013Hermite quadrature.\n\n29:245\u2013260, 1991.\n\nJournal of Statistical Planning and Inference,\n\n[20] E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,\n\n2006.\n\n[21] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00f6lkopf, and G. R.G. Lanckriet. Hilbert\n\nspace embeddings and metrics on probability measures. JMLR, 11:1517\u20131561, 2010.\n\n[22] I. Steinwart and C. Scovel. Mercer\u2019s theorem on general domains: On the interaction between\n\nmeasures, kernels, and RKHS. Constructive Approximation, 35(363-417), 2012.\n\n[23] J. Yang, V. Sindhwani, H. Avron, and M. W. Mahoney. Quasi-Monte Carlo feature maps for\n\nshift-invariant kernels. In Proc. ICML 2014, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1642, "authors": [{"given_name": "Motonobu", "family_name": "Kanagawa", "institution": "The Institute of Statistical Mathematics"}, {"given_name": "Bharath", "family_name": "Sriperumbudur", "institution": "Penn State University"}, {"given_name": "Kenji", "family_name": "Fukumizu", "institution": "Institute of Statistical Mathematics"}]}