{"title": "Kernel quadrature with DPPs", "book": "Advances in Neural Information Processing Systems", "page_first": 12927, "page_last": 12937, "abstract": "We study quadrature rules for functions living in an RKHS, using nodes sampled from a projection determinantal point process (DPP). DPPs are parametrized by a kernel, and we use a truncated and saturated version of the RKHS kernel.\nThis natural link between the two kernels, along with DPP machinery, leads to relatively tight bounds on the quadrature error, that depends on the spectrum of the RKHS kernel. Finally, we experimentally compare DPPs to existing kernel-based quadratures such as herding, Bayesian quadrature, or continuous leverage score sampling. Numerical results confirm the interest of DPPs, and even suggest faster rates than our bounds in particular cases.", "full_text": "Kernel quadrature with DPPs\n\nAyoub Belhadji, R\u00e9mi Bardenet, Pierre Chainais\n\nUniv. Lille, CNRS, Centrale Lille, UMR 9189 - CRIStAL, Villeneuve d\u2019Ascq, France\n\n{ayoub.belhadji, remi.bardenet, pierre.chainais}@univ-lille.fr\n\nAbstract\n\nWe study quadrature rules for functions from an RKHS, using nodes sampled from\na determinantal point process (DPP). DPPs are parametrized by a kernel, and we\nuse a truncated and saturated version of the RKHS kernel. This link between the\ntwo kernels, along with DPP machinery, leads to relatively tight bounds on the\nquadrature error, that depends on the spectrum of the RKHS kernel. Finally, we\nexperimentally compare DPPs to existing kernel-based quadratures such as herding,\nBayesian quadrature, or leverage score sampling. Numerical results con\ufb01rm the\ninterest of DPPs, and even suggest faster rates than our bounds in particular cases.\n\n1\n\nIntroduction\n\nNumerical integration [11] is an important tool for Bayesian methods [38] and model-based machine\nlearning [32]. Formally, numerical integration consists in approximating\n\nwjf (xj),\n\n(1)\n\n(cid:90)\n\nX\n\nf (x)g(x)d\u03c9(x) \u2248 (cid:88)\n\nj\u2208[N ]\n\nwhere X is a topological space, d\u03c9 is a Borel probability measure on X , g is a square integrable\nfunction, and f is a function belonging to a space to be precised. In the quadrature formula (1), the N\npoints x1, . . . , xN \u2208 X are called the quadrature nodes, and w1, . . . , wN the corresponding weights.\nThe accuracy of a quadrature rule is assessed by the quadrature error, i.e., the absolute difference\nbetween the left-hand side and the right-hand side of (1). Classical Monte Carlo algorithms, like\nimportance sampling or Markov chain Monte Carlo [39], pick up the nodes as either independent\nsamples or a sample from a Markov chain on X , and all achieve a root mean square quadrature error in\n\u221a\nO(1/\nN ). Quasi-Monte Carlo quadrature [12] is based on deterministic, low-discrepancy sequences\nof nodes, and typical error rates for X = Rd are O(logd N/N ). Recently, kernels have been used to\nderive quadrature rules such as herding [2, 9], Bayesian quadrature [19, 35], sophisticated control\nvariates [28, 33], and leverage-score quadrature [1] under the assumption that f lies in a RKHS. The\nmain theoretical advantage is that the resulting error rates are faster than classical Monte Carlo and\nadapt to the smoothness of f.\nIn this paper, we propose a new quadrature rule for functions in a given RKHS. Our nearest scienti\ufb01c\nneighbour is [1], but instead of sampling nodes independently, we leverage dependence and use a\nrepulsive distribution called a projection determinantal point process (DPP), while the weights are\nobtained through a simple quadratic optimization problem. DPPs were originally introduced by\n[29] as probabilistic models for beams of fermions in quantum optics. Since then, DPPs have been\nthoroughly studied in random matrix theory [21], and have more recently been adopted in machine\nlearning [26] and Monte Carlo methods [3].\nIn practice, a projection DPP is de\ufb01ned through a reference measure d\u03c9 and a repulsion kernel K. In\nour approach, the repulsion kernel is a modi\ufb01cation of the underlying RKHS kernel. This ensures\nthat sampling is tractable, and, as we shall see, that the expected value of the quadrature error is\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcontrolled by the decay of the eigenvalues of the integration operator associated to the measure d\u03c9.\nNote that quadratures based on projection DPPs have already been studied in the literature: implicitly\nin [22, Corollary 2.3] in the simple case where X = [0, 1] and d\u03c9 is the uniform measure, and in [3]\nfor [0, 1]d and more general measures. In the latter case, the quadrature error is asymptotically of\norder N\u22121/2\u22121/2d [3], with f essentially C1. In the current paper, we leverage the smoothness of the\nintegrand to improve the convergence rate of the quadrature in general spaces X .\nThis article is organized as follows. Section 2 reviews kernel-based quadrature. In Section 3, we recall\nsome basic properties of projection DPPs. Section 4 is devoted to the exposition of our main result,\nalong with a sketch of proof. We give precise pointers to the supplementary material for missing\ndetails. Finally, in Section 5 we illustrate our result and compare to related work using numerical\nsimulations, for the uniform measure in d = 1 and 2, and the Gaussian measure on R.\nNotation. Let X be a topological space equipped with a Borel measure d\u03c9 and assume that the\nsupport of d\u03c9 is X . Let L2(d\u03c9) be the Hilbert space of square integrable, real-valued functions\nde\ufb01ned on X , with the usual inner product denoted by (cid:104)\u00b7,\u00b7(cid:105)d\u03c9, and the associated norm by (cid:107).(cid:107)d\u03c9.\nLet k : X \u00d7 X \u2192 R+ be a symmetric and continuous function such that, for any \ufb01nite set of points\nin X , the matrix of pairwise kernel evaluations is positive semi-de\ufb01nite. Denote by F the associated\nreproducing kernel Hilbert space (RKHS) of real-valued functions [5]. We assume that x (cid:55)\u2192 k(x, x)\nis integrable with respect to the measure d\u03c9 so that F \u2282 L2(d\u03c9). De\ufb01ne the integral operator\n\n\u03a3f (\u00b7) =\n\nk(\u00b7, y)f (y)d\u03c9(y),\n\nf \u2208 L2(d\u03c9).\n\nX\n\n(2)\nBy construction, \u03a3 is self-adjoint, positive semi-de\ufb01nite, and trace-class [40]. For m \u2208 N, denote\nby em the m-th eigenfunction of \u03a3, normalized so that (cid:107)em(cid:107)d\u03c9 = 1 and \u03c3m the corresponding\neigenvalue. The integrability of the diagonal x (cid:55)\u2192 k(x, x) implies that F is compactly embedded\nin L2(d\u03c9), that is, the identity map IF : F \u2212\u2192 L2(d\u03c9) is compact; moreover, since d\u03c9 is of full\nsupport in X , IF is injective [44]. This implies a Mercer-type decomposition of k,\n\n(cid:90)\n\nk(x, y) =\n\n\u03c3mem(x)em(y),\n\n\u221a\n\nm\u2208N\u2217\nwhere N\u2217 = N (cid:114) {0} and the convergence is point-wise [45]. Moreover, for m \u2208 N\u2217, we write\nm)m\u2208N\u2217 is an orthonormal basis of F. Unless explicitly\neF\nm =\nstated, we assume that F is dense in L2(d\u03c9), so that (em)m\u2208N\u2217 is an orthonormal basis of L2(d\u03c9).\nm (cid:104)f, em(cid:105)2L2(d\u03c9) converges.\n\nFor more intuition, under these assumptions, f \u2208 F if and only if(cid:80)\n\n\u03c3mem. Since IF is injective [45], (eF\n\nm \u03c3\u22121\n\n(cid:88)\n\n2 Related work on kernel-based quadrature\nWhen the integrand f belongs to the RKHS F of kernel k [10], the quadrature error reads [41]\n\n(cid:90)\n\nX\n\nf (x)g(x)d\u03c9(x) \u2212 (cid:88)\n\nj\u2208[N ]\n\nwjf (xj) = (cid:104)f, \u00b5g \u2212 (cid:88)\n(cid:13)(cid:13)(cid:13)\u00b5g \u2212 (cid:88)\n\n\u2264 (cid:107)f(cid:107)F\n\nj\u2208[N ]\n\nwjk(xj, .)(cid:105)F\n\nwjk(xj, .)\n\nj\u2208[N ]\n\n(cid:13)(cid:13)(cid:13)F ,\n\nX g(x)k(x, .)d\u03c9(x) is the so-called mean element [13, 31]. A tight approximation of\nthe mean element by a linear combination of functions k(xj, .) thus guarantees low quadrature error.\nThe approaches described in this section differ by their choice of nodes and weights.\n\n(3)\n\n(4)\n\nwhere \u00b5g =(cid:82)\n\n2.1 Bayesian quadrature and the design of nodes\n\nBayesian Quadrature initially [27] considered a \ufb01xed set of nodes and put a Gaussian process prior\non the integrand f. Then, the weights were chosen to minimize the posterior variance of the integral\nof f. If the kernel of the Gaussian process is chosen to be k, this amounts to minimizing the RHS of\n(4). The case of the Gaussian reference measure was later investigated in detail [35], while parametric\nintegrands were considered in [30]. Rates of convergence were provided in [8] for speci\ufb01c kernels on\ncompact spaces, under a \ufb01ll-in condition [47] that encapsulates that the nodes must progressively \ufb01ll\nup the (compact) space.\n\n2\n\n\fFinding the weights that optimize the RHS of (4) for a \ufb01xed set of nodes is a relatively simple task,\nsee later Section 4.1, the cost of which can even be reduced using symmetries of the set of nodes\n[20, 24]. Jointly optimizing on the nodes and weights, however, is only possible in speci\ufb01c cases\n[6, 23]. In general, this corresponds to a non-convex problem with many local minima [16, 34].\nWhile [36] proposed to sample i.i.d. nodes from the reference measure d\u03c9, greedy minimization\napproaches have also been proposed [19, 34]. In particular, kernel herding [9] corresponds to uniform\nweights and greedily minimizing the RHS in (4). This leads to a fast rate in O(1/N ), but only\nwhen the integrand is in a \ufb01nite-dimensional RKHS. Kernel herding and similar forms of sequential\nBayesian quadrature are actually linked to the Frank-Wolfe algorithm [2, 7, 19]. Beside the dif\ufb01culty\nof proving fast convergence rates, these greedy approaches still require heuristics in practice.\n\n2.2 Leverage-score quadrature\n\nIn [1], the author proposed to sample the nodes (xj) i.i.d. from some proposal distribution q, and\nthen pick weights \u02c6w in (1) that solve the optimization problem\n\n(cid:13)(cid:13)(cid:13)\u00b5g \u2212 (cid:88)\n\nj\u2208[N ]\n\nmin\nw\u2208RN\n\nwj\n\nq(xj)1/2\n\nk(xj, .)\n\nfor some regularization parameter \u03bb > 0. Proposition 1 gives a bound on the resulting approximation\nerror of the mean element for a speci\ufb01c choice of proposal pdf, namely the leverage scores\n\n\u03bb(x) \u221d (cid:104)k(x, .), \u03a3\u22121/2(\u03a3 + \u03bbIL2(d\u03c9))\u22121\u03a3\u22121/2k(x, .)(cid:105)L2(d\u03c9) =\nq\u2217\n\n\u03c3m\n\nm\u2208N\n\n\u03c3m + \u03bb\n\nem(x)2.\n\n(6)\n\nProposition 1 (Proposition 2 in [1]). Let \u03b4 \u2208 [0, 1], and d\u03bb = Tr \u03a3(\u03a3 + \u03bbI)\u22121. Assume that\nN \u2265 5d\u03bb log(16d\u03bb/\u03b4), then\n\n(cid:18)\n\nP\n\n(cid:13)(cid:13)(cid:13)\u00b5g \u2212 (cid:88)\n\nj\u2208[N ]\n\nsup\n\n(cid:107)g(cid:107)d\u03c9\u22641\n\ninf\n\n(cid:107)w(cid:107)2\u2264 4\n\nN\n\nwj\n\nq\u03bb(xj)1/2\n\nk(xj, .)\n\n\u2264 4\u03bb\n\n\u2265 1 \u2212 \u03b4.\n\n(7)\n\nIn other words, Proposition 1 gives a uniform control on the approximation error \u00b5g by the subspace\nspanned by the k(xj, .) for g belonging to the unit ball of L2(d\u03c9), where the (xj) are sampled i.i.d.\n\u03bb. The required number of nodes is equal to O(d\u03bb log d\u03bb) for a given approximation error \u03bb.\nfrom q\u2217\nHowever, for \ufb01xed \u03bb, the approximation error in Proposition 1 does not go to zero when N increases.\nOne theoretical workaround is to make \u03bb = \u03bb(N ) decrease with N. However, the coupling of N and\n\u03bb through d\u03bb makes it very intricate to derive a convergence rate from Proposition 1. Moreover, the\noptimal density q\u2217\n\u03bb is in general only available as the limit (6), which makes sampling and evaluation\ndif\ufb01cult. Finally, we note that Proposition 1 highlights the fundamental role played by the spectral\ndecomposition of the operator \u03a3 in designing and analyzing kernel quadrature rules.\n\n3 Projection determinantal point processes\nLet N \u2208 N\u2217 and (\u03c8n)n\u2208[N ] an orthonormal family of L2(d\u03c9), and assume for simplicity that\nX \u2282 Rd and that d\u03c9 has density \u03c9 with respect to the Lebesgue measure. De\ufb01ne the repulsion kernel\n\n(cid:13)(cid:13)(cid:13)2\n\nF + \u03bbN(cid:107)w(cid:107)2\n2,\n(cid:88)\n\n(cid:19)\n\n(cid:13)(cid:13)(cid:13)2\n\nF\n\n(5)\n\n(8)\n\n(9)\n\nnot to be mistaken for the RKHS kernel k. One can show [18, Lemma 21] that\n\n(cid:88)\n\nn\u2208[N ]\n\nK(x, y) =\n\n\u03c8n(x)\u03c8n(y),\n\n1\nN !\n\nDet(K(xi, xj)i,j\u2208[N ])\n\n\u03c9(xi)\n\n(cid:89)\n\ni\u2208[N ]\n\nis a probability density over X N . When x1, . . . , xN have distribution (9), the set x = {x1, . . . xN}\nis said to be a projection DPP1 with reference measure d\u03c9 and kernel K. Note that the kernel K\nis a positive de\ufb01nite kernel so that the determinant in (9) is non-negative. Equation (9) is key to\n\n1In the \ufb01nite case, more common in ML, projection DPPs are also called elementary DPPs [26].\n\n3\n\n\funderstanding DPPs. First, loosely speaking, the probability of seeing a point of x in an in\ufb01nitesimal\nvolume around x1 is K(x1, x1)\u03c9(x1)dx1. Note that when d = 1 and (\u03c8n) are the family of\northonormal polynomials with respect to d\u03c9, this marginal probability is related to the optimal\nproposal q\u03bb in Section 2.2; see Appendix E.2. Second, the probability of simultaneously seeing a\npoint of x in an in\ufb01nitesimal volume around x1 and one around x2 is\n\n(cid:104)\n\nK(x1, x1) K(x2, x2)\u2212 K(x1, x2)2(cid:105)\n\n\u03c9(x1)\u03c9(x2) dx1dx2\n\n\u2264 [K(x1, x1)\u03c9(x1)dx1] [K(x2, x2)\u03c9(x2)dx2] .\n\nThe probability of co-occurrence is thus always smaller than that of a Poisson process with the same\nintensity. In this sense, a projection DPP with symmetric kernel is a repulsive distribution, and K\nencodes its repulsiveness.\nOne advantage of DPPs is that they can be sampled exactly. Because of the orthonormality of (\u03c8n),\none can write the chain rule for (9); see [18]. Sampling each conditional in turn, using e.g. rejection\nsampling [39], then yields an exact sampling algorithm. Rejection sampling aside, the cost of this\nalgorithm is cubic in N without further assumptions on the kernel. Simplifying assumptions can\ntake many forms. In particular, when d = 1, and \u03c9 is a Gaussian, gamma [14], or beta [25] pdf, and\n(\u03c8n) are the orthonormal polynomials with respect to \u03c9, the corresponding DPP can be sampled by\ntridiagonalizing a matrix with independent entries, which takes the cost to O(N 2) and bypasses the\nneed for rejection sampling. For further information on DPPs see [21, 43].\n\n4 Kernel quadrature with projection DPPs\n\nWe follow in the footsteps of [1], see Section 2.2, but using a projection DPP rather than independent\nsampling to obtain the nodes. In a nutshell, we consider nodes (xj)j\u2208[N ] that are drawn from the\nprojection DPP with reference measure d\u03c9 and repulsion kernel\n\nK(x, y) =\n\nen(x)en(y),\n\n(10)\n\n(cid:88)\n\nn\u2208[N ]\n\nwhere we recall that (en) are the normalized eigenfunctions of the integral operator \u03a3. The weights\nw are obtained by solving the optimization problem\n\nwhere\n\n(cid:107)\u00b5g \u2212 \u03a6w(cid:107)2F ,\n\nmin\nw\u2208RN\n\n\u03a6 : (wj)j\u2208[N ] (cid:55)\u2192 (cid:88)\n\nwjk(xj, .)\n\nj\u2208[N ]\n\n(11)\n\n(12)\n\nis the reconstruction operator2. In Section 4.1 we prove that (11) almost surely has a unique solution\n\u02c6w and state our main result, an upper bound on the expected approximation error (cid:107)\u00b5g \u2212 \u03a6 \u02c6w(cid:107)2F\nunder the proposed Projection DPP. Section 4.2 gives a sketch of the proof of this bound.\n\n4.1 Main result\n\nAssuming that nodes (xj)j\u2208[N ] are known, we \ufb01rst need to solve the optimization problem (11) that\nrelates to problem (5) without regularization (\u03bb = 0). Let x = (x1, . . . , xN ) \u2208 X N , then\n\n(cid:107)\u00b5g \u2212 \u03a6w(cid:107)2F = (cid:107)\u00b5g(cid:107)2F \u2212 2w\n\n(cid:124)\n\nK(x)w,\n\n\u00b5g(xj)j\u2208[N ] + w\n\n(13)\nwhere K(x) = (k(xi, xj))i,j\u2208[N ]. The right-hand side of (13) is quadratic in w, so that the\noptimization problem (11) admits a unique solution \u02c6w if and only if K(x) is invertible. In this case,\nthe solution is given by \u02c6w = K(x)\u22121\u00b5g(xj)j\u2208[N ]. A suf\ufb01cient condition for the invertibility of\nK(x) is given in the following proposition.\nProposition 2. Assume that the matrix E(x) = (ei(xj))i,j\u2208[N ] is invertible, then K(x) is invertible.\nThe proof of Proposition 2 is given in Appendix D.1. Since the pdf (9) of the projection DPP with\nkernel (10) is proportional to Det2 E(x), the following corollary immediately follows.\n\n(cid:124)\n\n2The reconstruction operator \u03a6 depends on the nodes xj, although our notation doesn\u2019t re\ufb02ect it for simplicity.\n\n4\n\n\fCorollary 1. Let x = {x1, . . . , xN} be a projection DPP with reference measure d\u03c9 and kernel\n(10). Then K(x) is a.s. invertible, so that (11) has unique solution \u02c6w = K(x)\u22121\u00b5g(xj)j\u2208[N ] a.s.\nWe now give our main result that uses nodes (xj)j\u2208[N ] drawn from a well-chosen projection DPP.\nTheorem 1. Let x = {x1, . . . , xN} be a projection DPP with reference measure d\u03c9 and kernel (10).\n|(cid:104)en, g(cid:105)d\u03c9|. Assume that (cid:107)g(cid:107)d\u03c9 \u2264 1\nLet \u02c6w be the unique solution to (11) and de\ufb01ne (cid:107)g(cid:107)d\u03c9,1 =\n\n(cid:88)\n\nand de\ufb01ne rN = (cid:80)\n\n\u03c3m, then\n\nm\u2265N +1\n\nn\u2208[N ]\n\n(cid:32)\n\n(cid:18) N rN\n\n(cid:19)(cid:96)(cid:33)\n\n\u03c31\n\nN(cid:88)\n\n(cid:96)=2\n\n\u03c31\n(cid:96)!2\n\n.\n\n(14)\n\nEDPP (cid:107)\u00b5g \u2212 \u03a6 \u02c6w(cid:107)2F \u2264 2\u03c3N +1 + 2(cid:107)g(cid:107)2\n\nd\u03c9,1\n\nN rN +\n\nIn particular, if N rN = o(1), then the right-hand side of (14) is N rN + o(N rN ). For example,\ntake X = [0, 1], d\u03c9 the uniform measure on X , and F the s-Sobolev space, then \u03c3m = m\u22122s [5].\nNow, if s > 1, the expected worst case quadrature error is bounded by N rN = O(N 2\u22122s) = o(1).\nAnother example is the case of the Gaussian measure on X = R, with the Gaussian kernel. In this\ncase \u03c3m = \u03b2\u03b1m with 0 < \u03b1 < 1 and \u03b2 > 0 [37] so that N rN = N \u03b2\nWe have assumed that F is dense in L2(d\u03c9) but Theorem 1 is valid also when F is \ufb01nite-dimensional.\nIn this case, denote N0 = dimF. Then, for n > N0, \u03c3n = 0 and rN0 = 0, so that (14) implies\n\n1\u2212\u03b1 \u03b1N +1 = o(1).\n\n(cid:107)\u00b5g \u2212 \u03a6 \u02c6w(cid:107)2F = 0 a.s.\n\n(15)\nThis compares favourably with herding, for instance, which comes with a rate in O( 1\nN ) for the\nquadrature based on herding with uniform weights [2, 9].\nThe constant (cid:107)g(cid:107)d\u03c9,1 in (14) is the (cid:96)1 norm of the coef\ufb01cients of projection of g onto Span(en)n\u2208[N ]\nin L2(d\u03c9). For example, for g = en, (cid:107)g(cid:107)d\u03c9,1 = 1 if n \u2208 [N ] and (cid:107)g(cid:107)d\u03c9,1 = 0 if n \u2265 N + 1. In the\nworst case, (cid:107)g(cid:107)d\u03c9,1 \u2264 \u221a\nN. Thus, we can obtain a uniform bound for (cid:107)g(cid:107)d\u03c9 \u2264 1 in\n\nN(cid:107)g(cid:107)d\u03c9 \u2264 \u221a\n\nthe spirit of Proposition 1, but with a supplementary factor N in the upper bound in (14).\n\n4.2 Bounding the approximation error under the DPP\n\nIn this section, we give the skeleton of the proof of Theorem 1, referring to the appendices for\ntechnical details. The proof is in two steps. First, we give an upper bound for the approximation error\n(cid:107)\u00b5g \u2212 \u03a6 \u02c6w(cid:107)2F that involves the maximal principal angle between the functional subspaces of F\n\nEF\nN = Span(eF\n\nn )n\u2208[N ]\n\nand\n\nT (x) = Span(k(xj, .))j\u2208[N ].\n\nDPPs allow closed form expressions for the expectation of trigonometric functions of such angles;\nsee [4] and Appendix E.1 for the geometric intuition behind the proof. The second step thus consists\nin developing the expectation of the bound under the DPP.\n\n4.2.1 Bounding the approximation error using principal angles\nLet x = (x1, . . . , xN ) \u2208 X N be such that Det E(x) (cid:54)= 0. By Proposition 2, K(x) is non singular\nand dimT (x) = N. The optimal approximation error writes\n\n(cid:107)\u00b5g \u2212 \u03a6 \u02c6w(cid:107)2F = (cid:107)\u00b5g \u2212 \u03a0T (x)\u00b5g(cid:107)2F ,\n\n(16)\n\nwhere \u03a0T (x) = \u03a6(\u03a6\u2217\u03a6)\u22121\u03a6\u2217 is the orthogonal projection onto T (x) with \u03a6\u2217 the dual3 of \u03a6.\nIn other words, (16) equates the approximation error to (cid:107)\u03a0T (x)\u22a5\u00b5g(cid:107)2F , where \u03a0T (x)\u22a5 is the orthog-\nonal projection onto T (x)\u22a5. Now we have the following lemma.\nLemma 1. Assume that (cid:107)g(cid:107)d\u03c9 \u2264 1 then (cid:107)\u03a3\u22121/2\u00b5g(cid:107)F \u2264 1 and\n\n(17)\n3For \u00b5 \u2208 F,\u03a6\u2217\u00b5 = (\u00b5(xj))j\u2208[N ]. \u03a6\u2217\u03a6 is an operator from RN to RN that can be identi\ufb01ed with K(x).\n\nd\u03c9,1 max\nn\u2208[N ]\n\n.\n\n\u03c3n(cid:107)\u03a0T (x)\u22a5 eF\n\nn (cid:107)2F\n\n\u03c3N +1 + (cid:107)g(cid:107)2\n\n(cid:18)\n(cid:107)\u03a0T (x)\u22a5 \u00b5g(cid:107)2F \u2264 2\n\n(cid:19)\n\n5\n\n\fn (cid:107)2F is the product\nNow, to upper bound the right-hand side of (17), we note that \u03c3n(cid:107)\u03a0T (x)\u22a5 eF\nn (cid:107)2F is the interpolation error of\nof two terms: \u03c3n is a decreasing function of n while (cid:107)\u03a0T (x)\u22a5 eF\nn , measured in the (cid:107).(cid:107)F norm. We can bound the latter interpolation error\nthe eigenfunction eF\nuniformly in n \u2208 [N ] using the geometric notion of maximal principal angle between T (x) and\nEF\nN = Span(eF\n\nn )n\u2208[N ]. This maximal principal angle is de\ufb01ned through its cosine\n\ncos2 \u03b8N (T (x),EF\n\nN ) =\n\ninf\n\n(cid:104)u, v(cid:105)F .\n\nu\u2208T (x),v\u2208EF\n(cid:107)u(cid:107)F =1,(cid:107)v(cid:107)F =1\nSimilarly, we can de\ufb01ne the N principal angles \u03b8n(T (x),EF\nsubspaces EF\nAppendix C.3 for more details about principal angles. Now, we have the following lemma.\nLemma 2. Let x = (x1, . . . , xN ) \u2208 X N such that Det E(x) (cid:54)= 0. Then\n\nN and T (x). These angles quantify the relative position of the two subspaces. See\n\n(cid:3) for n \u2208 [N ] between the\n\nN ) \u2208(cid:2)0, \u03c0\n\n2\n\nN\n\n(18)\n\n(cid:107)\u03a0T (x)\u22a5eF\n\nn (cid:107)2F \u2264\n\nmax\nn\u2208[N ]\n\n1\n\ncos2 \u03b8N (T (x),EF\nN )\n\n1\n\ncos2 \u03b8n(T (x),EF\nN )\n\n\u2212 1.\n\n(19)\n\n\u2212 1 \u2264 (cid:89)\n\nn\u2208[N ]\n\nTo sum up, we have so far bounded the approximation error by the geometric quantity in the right-hand\nside of (19). Where projection DPPs shine is in taking expectations of such geometric quantities.\n\n4.2.2 Taking the expectation under the DPP\nThe analysis in Section 4.2.1 is valid whenever Det E(x) (cid:54)= 0. As seen in Corollary 1, this condition\nis satis\ufb01ed almost surely when x is drawn from the projection DPP of Theorem 1. Furthermore, the\nexpectation of the right-hand side of (19) can be written in terms of the eigenvalues of the kernel k.\nProposition 3. Let x be a projection DPP with reference measure d\u03c9 and kernel (10). Then,\n\n(cid:89)\n\nEDPP\n\nn\u2208[N ]\n\ncos2 \u03b8n\n\n(cid:88)\n\n(cid:19) =\n\n(cid:18)\n\n1\nT (x),EF\n\nN\n\nT\u2282N\u2217\n|T|=N\n\nn\u2208[N ]\n\n(cid:81)\n\u03c3t(cid:81)\n\nt\u2208T\n\n\u03c3n\n\n.\n\n(20)\n\nThe bound of Proposition 3, once reported in Lemma 2 and Lemma 1, already yields Theorem 1 in\nthe special case where \u03c31 = \u00b7\u00b7\u00b7 = \u03c3N . This seems a very restrictive condition, but next Proposition 4\nshows that we can always reduce the analysis to that case. In fact, let the kernel \u02dck be de\ufb01ned by\n\n\u02dck(x, y) =\n\n\u03c31en(x)en(y) +\n\n\u03c3nen(x)en(y) =\n\n\u02dc\u03c3nen(x)en(y),\n\n(21)\n\n(cid:88)\n\nn\u2208N\u2217\n\n(cid:88)\n\nn\u2208[N ]\n\n(cid:88)\n(cid:17)\n(cid:16)\u02dck(xj, .)\n\nn\u2265N +1\n\nand let \u02dcF be the corresponding RKHS. Then one has the following inequality.\nProposition 4. Let \u02dcT (x) = Span\n\u02dcT (x)\u22a5 in ( \u02dcF,(cid:104)., .(cid:105) \u02dcF ). Then,\n\nj\u2208[N ]\n\nand \u03a0 \u02dcT (x)\u22a5 the orthogonal projection onto\n\n\u2200n \u2208 [N ], \u03c3n(cid:107)\u03a0T (x)\u22a5 eF\n\nn (cid:107)2\nn (cid:107)2F \u2264 \u03c31(cid:107)\u03a0 \u02dcT (x)\u22a5e \u02dcF\n\u02dcF .\n\n(22)\n\nSimply put, capping the \ufb01rst eigenvalues of k yields a new kernel \u02dck that captures the interaction\nbetween the terms \u03c3n and (cid:107)\u03a0T (x)\u22a5eF\nn (cid:107)2F such that we only have to deal with the term (cid:107)\u03a0 \u02dcT (x)\u22a5 e \u02dcF\nn (cid:107)2\n\u02dcF .\nCombining Proposition 3 with Proposition 4 applied to the kernel \u02dck yields Theorem 1.\n\n4.3 Discussion\n\nWe have arbitrarily introduced a product in the right-hand side of (19), which is a rather loose\nmajorization. Our motivation is that the expected value of this symmetric quantity is tractable under\nthe DPP. Getting rid of the product could make the bound much tighter. Intuitively, taking the upper\nbound in (20) to the power 1/N results in a term in O(rN ) for the RKHS \u02dcF. Improving the bound in\n(20) would require a de-symmetrization by comparing the maximum of the 1/ cos2 \u03b8(cid:96)(T (x),EF\nN ) to\n\n6\n\n\ftheir geometric mean. An easier route than de-symmetrization could be to replace the product in (19)\nby a sum, but this is beyond the scope of this article.\nIn comparison with [1], we emphasize that the dependence of our bound on the eigenvalues of the\nkernel k, via rN , is explicit. This is in contrast with Proposition 1 that depends on the eigenvalues\nof \u03a3 through the degree of freedom d\u03bb so that the necessary number of samples N diverges when\n\u03bb \u2192 0. On the contrary, our quadrature requires a \ufb01nite number of points for \u03bb = 0. It would be\ninteresting to extend the analysis of our quadrature in the regime \u03bb > 0.\n\n5 Numerical simulations\n\n5.1 The periodic Sobolev space and the Korobov space\nLet d\u03c9 be the uniform measure on X = [0, 1], and let the RKHS kernel be [5]\n\nks(x, y) = 1 +\n\n1\n\nm2s cos(2\u03c0m(x \u2212 y)),\n\n(cid:88)\n\nm\u2208N\u2217\n\nso that F = Fs is the Sobolev space of order s on [0, 1]. Note that ks can be expressed in closed\nform using Bernoulli polynomials [46]. We take g \u2261 1 in (1), so that the mean element \u00b5g \u2261 1. We\ncompare the following algorithms: (i) the quadrature rule DPPKQ we propose in Theorem 1, (ii)\nthe quadrature rule DPPUQ based on the same projection DPP but with uniform weights, implicitly\nstudied in [22], (iii) the kernel quadrature rule (5) of [1], which we denote LVSQ for leverage\nscore quadrature, with regularization parameter \u03bb \u2208 {0, 0.1, 0.2} (note that the optimal proposal is\n\u03bb \u2261 1), (iv) herding with uniform weights [2, 9], (v) sequential Bayesian quadrature (SBQ) [19]\nq\u2217\nwith regularization to avoid numerical instability, and (vi) Bayesian quadrature on the uniform grid\n(UGBQ). We take N \u2208 [5, 50]. Figures 1a and 1b show log-log plots of the worst case quadrature\nerror w.r.t. N, averaged over 50 samples for each point, for s \u2208 {1, 3}.\nWe observe that the approximation errors of all \ufb01rst four quadratures converge to 0 with different\nrates. Both UGBQ and DPPKQ converge to 0 with a rate of O(N\u22122s), which indicates that our\nO(N 2\u22122s) bound in Theorem 1 is not tight in the Sobolev case. Meanwhile, the rate of DPPUQ\nis O(N\u22122) across the three values of s: it does not adapt to the regularity of the integrands. This\ncorresponds to the CLT proven in [22]. LVSQ without regularization converges to 0 slightly slower\nthan O(N\u22122s). Augmenting \u03bb further slows down convergence. Herding converges at an empirical\nrate of O(N\u22122), which is faster than the rate O(N\u22121) predicted by the theoretical analysis in [2, 9].\nSBQ is the only one that seems to plateau for s = 3, although it consistently has the best performance\nfor low N. Overall, in the Sobolev case, DPPKQ and UGBQ have the best convergence rate. UGBQ\n\u2013 known to be optimal in this case [6] \u2013 has a better constant.\nNow, for a multidimensional example, consider the \u201cKorobov\" kernel ks de\ufb01ned on [0, 1]d by\n\n\u2200x, y \u2208 [0, 1]d, ks,d(x, y) =\n\nks(xi, yi).\n\n(23)\n\n(cid:89)\n\ni\u2208[d]\n\nWe still take g \u2261 1 in (1) so that \u00b5g \u2261 1. We compare (i) our DPPKQ, (ii) LVSQ without\nregularization (\u03bb = 0), (iii) the kernel quadrature based on the uniform grid UGBQ, (iv) the kernel\nquadrature SGBQ based on the sparse grid from [42], (v) the kernel quadrature based on the Halton\nsequence HaltonBQ [15]. We take N \u2208 [5, 1000] and s = 1. The results are shown in Figure 1c.\nThis time, UGBQ suffers from the dimension with a rate in O(N\u22122s/d), while DPPKQ, HaltonBQ\nand LVSQ (\u03bb = 0) all perform similarly well. They scale as O((log N )2s(d\u22121)N\u22122s), which is a\ntight upper bound on \u03c3N +1, see [1] and Appendix B. SGBQ seems to lag slightly behind with a rate\nO((log N )2(s+1)(d\u22121)N\u22122s) [17, 42].\n\n5.2 The Gaussian kernel\nWe now consider d\u03c9 to be the Gaussian measure on X = R along with the RKHS kernel k\u03b3(x, y) =\nexp[\u2212(x \u2212 y)2/2\u03b32], and again g \u2261 1. Figure 1d compares the empirical performance of DPPKQ\nto the theoretical bound of Theorem 1, herding, crude Monte Carlo with i.i.d. sampling from d\u03c9,\nand sequential Bayesian Quadrature, where we again average over 50 samples. We take N \u2208 [5, 50]\n\n7\n\n\f(a) Sobolev space, d = 1, s = 1\n\n(b) Sobolev space, d = 1, s = 3\n\n(c) Korobov space, d = 2, s = 1\n\n(d) Gaussian kernel, d = 1\n\nFigure 1: Squared error vs. number of nodes N for various kernels.\n\nand \u03b3 = 1\n2. Note that, this time, only the y-axis is on the log scale for better display, and that LVSQ\nis not plotted since we don\u2019t know how to sample from q\u03bb in (6) in this case. We observe that the\napproximation error of DPPKQ converges to 0 as O(\u03b1N ), while the discussion below Theorem 1 let\nus expect a slightly slower O(N \u03b1N ). Herding improves slightly upon Monte Carlo that converges as\nO(N\u22121). Similarly to Sobolev spaces, the convergence of sequential Bayesian quadrature plateaus\neven if it has the smallest error for small N. We also conclude that DPPKQ is a close runner-up to\nSBQ and de\ufb01nitely takes the lead for large enough N.\n\n6 Conclusion\n\nIn this article, we proposed a quadrature rule for functions living in a RKHS. The nodes are drawn\nfrom a DPP tailored to the RKHS kernel, while the weights are the solution to a tractable, non-\nregularized optimization problem. We proved that the expected value of the squared worst case\nerror is bounded by a quantity that depends on the eigenvalues of the integral operator associated\nto the RKHS kernel, thus preserving the natural feel and the generality of the bounds for kernel\nquadrature [1]. Key intermediate quantities further have clear geometric interpretations in the ambient\nRKHS. Experimental comparisons suggest that DPP quadrature favourably compares with existing\nkernel-based quadratures. In speci\ufb01c cases where an optimal quadrature is known, such as the\nuniform grid for 1D periodic Sobolev spaces, DPPKQ seems to have the optimal convergence rate.\nHowever, our generic error bound does not re\ufb02ect this optimality in the Sobolev case, and must thus\nbe sharpened.\nWe have discussed room for improvement in our proofs. Further work should also address exact\nsampling algorithms, which do not exist yet when the spectral decomposition of the integral operator\nis not known. Approximate algorithms would also suf\ufb01ce, as long as the error bound is preserved.\n\nAcknowledgments\n\nWe acknowledge support from ANR grant BoB (ANR-16-CE23-0003) and r\u00e9gion Hauts-de-France.\nWe also thank Adrien Hardy and the reviewers for their detailed and insightful comments.\n\n8\n\n0.81.01.21.41.6log10(N)3.02.52.01.51.00.5log10(Squared error)DPPKQ: 1.9DPPUQ: 1.7Herding: 1.8SBQ: 1.8LVSQ (=0): 1.7LVSQ (=0.1)LVSQ (=0.2)UGBQ: 2.00.81.01.21.41.6log10(N)108642log10(Squared error)DPPKQ: 6.0DPPUQ: 2.0Herding: 2.1SBQ: 3.2LVSQ (=0): 4.8LVSQ (=0.1)LVSQ (=0.2)UGBQ: 6.00.51.01.52.02.53.0log10(N)43210log10(Squared error)DPPKQLVSQ (=0)HaltonBQUGBQSGBQ(logN)2s(d1)N2sN2s/d110(logN)2(s+1)(d1)N2s01020304050N1086420log10(Squared error)DPPKQDPPKQ (UB)HerdingSBQMC\fReferences\n[1] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions.\n\nThe Journal of Machine Learning Research, 18(1):714\u2013751, 2017.\n\n[2] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence between herding and condi-\ntional gradient algorithms. In Proceedings of the 29th International Coference on International\nConference on Machine Learning, ICML\u201912, pages 1355\u20131362, 2012.\n\n[3] R. Bardenet and A. Hardy. Monte Carlo with determinantal point processes. arXiv:1605.00361,\n\nMay 2016.\n\n[4] A. Belhadji, R. Bardenet, and P. Chainais. A determinantal point process for column subset\n\nselection. arXiv:1812.09771, 2018.\n\n[5] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and\n\nstatistics. Springer Science & Business Media, 2011.\n\n[6] B. Bojanov. Uniqueness of the optimal nodes of quadrature formulae. Mathematics of computa-\n\ntion, 36(154):525\u2013546, 1981.\n\n[7] F. Briol, C. Oates, M. Girolami, and M. Osborne. Frank-Wolfe Bayesian quadrature: Proba-\nbilistic integration with theoretical guarantees. In Advances in Neural Information Processing\nSystems, pages 1162\u20131170, 2015.\n\n[8] F. Briol, C. Oates, M. Girolami, M. Osborne, D. Sejdinovic, et al. Probabilistic integration: A\n\nrole in statistical computation? Statistical Science, 34(1):1\u201322, 2019.\n\n[9] Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. In Proceedings of\nthe Twenty-Sixth Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201910, pages 109\u2013116,\nArlington, Virginia, United States, 2010. AUAI Press.\n\n[10] N. Cristianini and J. Shawe-Taylor. Kernel methods for pattern recognition. Cambridge\n\nUniversity Press, 2004.\n\n[11] P. J. Davis and P. Rabinowitz. Methods of numerical integration. Courier Corporation, 2nd\n\nedition, 2007.\n\n[12] J. Dick and F. Pilichshammer. Digital nets and sequences. Discrepancy theory and quasi-Monte\n\nCarlo integration. Cambridge University Press, 2010.\n\n[13] J. Dick and F. Pillichshammer. Discrepancy theory and quasi-Monte Carlo integration. In A\n\nPanorama of Discrepancy Theory, pages 539\u2013619. Springer, 2014.\n\n[14] I. Dumitriu and A. Edelman. Matrix models for beta ensembles. Journal of Mathematical\n\nPhysics, 43(11):5830\u20135847, 2002.\n\n[15] J. Halton. Algorithm 247: Radical-inverse quasi-random point sequence. Communications of\n\nthe ACM, 7(12):701\u2013702, 1964.\n\n[16] A. Hinrichs and J. Oettershagen. Optimal point sets for quasi-Monte Carlo integration of\nbivariate periodic functions with bounded mixed derivatives. In Monte Carlo and Quasi-Monte\nCarlo Methods, pages 385\u2013405. Springer, 2016.\n\n[17] M. Holtz. Sparse grid quadrature in high dimensions with applications in \ufb01nance and insurance.\n\nPhD Thesis, University of Bonn, 2008.\n\n[18] J. B. Hough, M. Krishnapur, Y. Peres, and B. Vir\u00e1g. Determinantal processes and independence.\n\nProbability surveys, 2006.\n\n[19] F. Husz\u00e1r and D. Duvenaud. Optimally-weighted herding is Bayesian quadrature. In Proceedings\nof the Twenty-Eighth Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201912, pages 377\u2013\n386. AUAI Press, 2012.\n\n[20] R. Jagadeeswaran and F. Hickernell. Fast automatic Bayesian cubature using lattice sampling.\n\narXiv:1809.09803, 2018.\n\n9\n\n\f[21] K. Johansson. Random matrices and determinantal processes. In Mathematical Statistical\nPhysics, Session LXXXIII: Lecture Notes of the Les Houches Summer School 2005, pages 1\u201356.\n\n[22] K. Johansson. On random matrices from the compact classical groups. Annals of mathematics,\n\npages 519\u2013545, 1997.\n\n[23] T. Karvonen and S. S\u00e4rkk\u00e4. Classical quadrature rules via gaussian processes. In 2017 IEEE\n27th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1\u20136.\nIEEE, 2017.\n\n[24] T. Karvonen, S. S\u00e4rkk\u00e4, C. Oates, et al. Symmetry exploits for Bayesian cubature methods.\n\narXiv:1809.10227, 2018.\n\n[25] R. Killip and I. Nenciu. Matrix models for circular ensembles. International Mathematics\n\nResearch Notices, 2004(50):2665, 2004.\n\n[26] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations\n\nand Trends R(cid:13) in Machine Learning, 5(2\u20133):123\u2013286, 2012.\n\n[27] F. Larkin. Gaussian measure in Hilbert space and applications in numerical analysis. The Rocky\n\nMountain Journal of Mathematics, pages 379\u2013421, 1972.\n\n[28] Q. Liu and J. D. Lee. Black-box importance sampling. In Internation Conference on Arti\ufb01cial\n\nIntelligence and Statistics (AISTATS), 2017.\n\n[29] O. Macchi. The coincidence approach to stochastic point processes. 7:83\u2013122, 03 1975.\n\n[30] T. Minka. Deriving quadrature rules from Gaussian processes. Technical report, Technical\n\nreport, Statistics Department, Carnegie Mellon University, 2000.\n\n[31] K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Sch\u00f6lkopf, et al. Kernel mean embedding\nof distributions: A review and beyond. Foundations and Trends R(cid:13) in Machine Learning,\n10(1-2):1\u2013141, 2017.\n\n[32] K. Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012.\n\n[33] C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695\u2013718, jun\n2017.\n\n[34] J. Oettershagen. Construction of optimal cubature algorithms with applications to econometrics\n\nand uncertainty quanti\ufb01cation. PhD Thesis, University of Bonn, 2017.\n\n[35] A. O\u2019Hagan. Bayes-Hermite quadrature. Journal of Statistical Planning and Inference,\n\n29(3):245\u2013260, 1991.\n\n[36] C. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo. In Advances in Neural Informa-\ntion Processing Systems 15, pages 489\u2013496, Cambridge, MA, USA, Oct. 2003. Max-Planck-\nGesellschaft, MIT Press.\n\n[37] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. Adaptive Computa-\n\ntion and Machine Learning. MIT Press, Cambridge, MA, USA, Jan. 2006.\n\n[38] C. P. Robert. The Bayesian choice: from decision-theoretic foundations to computational\n\nimplementation. Springer Science & Business Media, 2007.\n\n[39] C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer, 2004.\n\n[40] B. Simon. Trace Ideals and Their Applications. American Mathematical Society, 2005.\n\n[41] A. Smola, A. Gretton, L. Song, and B. Sch\u00f6lkopf. A Hilbert space embedding for distributions.\n\nIn International Conference on Algorithmic Learning Theory, pages 13\u201331. Springer, 2007.\n\n[42] S. Smolyak. Quadrature and interpolation formulas for tensor products of certain classes of\nfunctions. In Doklady Akademii Nauk, volume 148, pages 1042\u20131045. Russian Academy of\nSciences, 1963.\n\n10\n\n\f[43] A. Soshnikov. Determinantal random point \ufb01elds. Russian Mathematical Surveys, 55:923\u2013975,\n\n2000.\n\n[44] I. Steinwart and A. Christmann. Support Vector Machines. Springer Publishing Company,\n\nIncorporated, 1st edition, 2008.\n\n[45] I. Steinwart and C. Scovel. Mercer\u2019s theorem on general domains: on the interaction between\n\nmeasures, kernels, and RKHSs. Constructive Approximation, 35(3):363\u2013417, 2012.\n\n[46] G. Wahba. Spline Models for Observational Data, volume 59. SIAM, 1990.\n\n[47] H. Wendland. Scattered Data Approximation, volume 17. Cambridge University Press, 2004.\n\n11\n\n\f", "award": [], "sourceid": 7077, "authors": [{"given_name": "Ayoub", "family_name": "Belhadji", "institution": "Ecole Centrale de Lille"}, {"given_name": "R\u00e9mi", "family_name": "Bardenet", "institution": "University of Lille"}, {"given_name": "Pierre", "family_name": "Chainais", "institution": "Centrale Lille / CRIStAL CNRS UMR 9189"}]}