{"title": "Consistent Kernel Mean Estimation for Functions of Random Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 1732, "page_last": 1740, "abstract": "We provide a theoretical foundation for non-parametric estimation of functions of random variables using kernel mean embeddings. We show that for any continuous function f, consistent estimators of the mean embedding of a random variable X lead to consistent estimators of the mean embedding of f(X). For Matern kernels and sufficiently smooth functions we also provide rates of convergence. Our results extend to functions of multiple random variables. If the variables are dependent, we require an estimator of the mean embedding of their joint distribution as a starting point; if they are independent, it is sufficient to have separate estimators of the mean embeddings of their marginal distributions. In either case, our results cover both mean embeddings based on i.i.d. samples as well as \"reduced set\" expansions in terms of dependent expansion points. The latter serves as a justification for using such expansions to limit memory resources when applying the approach as a basis for probabilistic programming.", "full_text": "Consistent Kernel Mean Estimation\nfor Functions of Random Variables\n\nCarl-Johann Simon-Gabriel\u21e4, Adam \u00b4Scibior\u21e4,\u2020, Ilya Tolstikhin, Bernhard Sch\u00f6lkopf\n\nDepartment of Empirical Inference, Max Planck Institute for Intelligent Systems\n\nSpemanstra\u00dfe 38, 72076 T\u00fcbingen, Germany\n\n\u21e4 joint \ufb01rst authors; \u2020 also with: Engineering Department, Cambridge University\n\ncjsimon@, adam.scibior@, ilya@, bs@tuebingen.mpg.de\n\nAbstract\n\nWe provide a theoretical foundation for non-parametric estimation of functions of\nrandom variables using kernel mean embeddings. We show that for any continuous\nfunction f, consistent estimators of the mean embedding of a random variable X\nlead to consistent estimators of the mean embedding of f (X). For Mat\u00e9rn kernels\nand suf\ufb01ciently smooth functions we also provide rates of convergence.\nOur results extend to functions of multiple random variables. If the variables\nare dependent, we require an estimator of the mean embedding of their joint\ndistribution as a starting point; if they are independent, it is suf\ufb01cient to have\nseparate estimators of the mean embeddings of their marginal distributions. In\neither case, our results cover both mean embeddings based on i.i.d. samples as well\nas \u201creduced set\u201d expansions in terms of dependent expansion points. The latter\nserves as a justi\ufb01cation for using such expansions to limit memory resources when\napplying the approach as a basis for probabilistic programming.\n\n1\n\nIntroduction\n\nA common task in probabilistic modelling is to compute the distribution of f (X), given a measurable\nfunction f and a random variable X. In fact, the earliest instances of this problem date back at least\nto Poisson (1837). Sometimes this can be done analytically. For example, if f is linear and X is\nGaussian, that is f (x) = ax + b and X \u21e0N (\u00b5; ), we have f (X) \u21e0N (a\u00b5 + b; a). There exist\nvarious methods for obtaining such analytical expressions (Mathai, 1973), but outside a small subset\nof distributions and functions the formulae are either not available or too complicated to be practical.\nAn alternative to the analytical approach is numerical approximation, ideally implemented as a\n\ufb02exible software library. The need for such tools is recognised in the general programming languages\ncommunity (McKinley, 2016), but no standards were established so far. The main challenge is in\n\ufb01nding a good approximate representation for random variables.\nDistributions on integers, for example, are usually represented as lists of (xi, p(xi)) pairs. For real\nvalued distributions, integral transforms (Springer, 1979), mixtures of Gaussians (Milios, 2009), La-\nguerre polynomials (Williamson, 1989), and Chebyshev polynomials (Korze\u00b4n and Jaroszewicz, 2014)\nwere proposed as convenient representations for numerical computation. For strings, probabilistic\n\ufb01nite automata are often used. All those approaches have their merits, but they only work with a\nspeci\ufb01c input type.\nThere is an alternative, based on Monte Carlo sampling (Kalos and Whitlock, 2008), which is to\ni=1 (with wi 0). This representation has\nrepresent X by a (possibly weighted) sample {(xi, wi)}n\nseveral advantages: (i) it works for any input type, (ii) the sample size controls the time-accuracy\ntrade-off, and (iii) applying functions to random variables reduces to applying the functions pointwise\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fi=1 and {(x0i, w0i)}n\n\nto the sample, i.e., {(f (xi), wi)} represents f (X). Furthermore, expectations of functions of random\nvariables can be estimated as E [f (X)] \u21e1Pi wif (xi)/Pi wi, sometimes with guarantees for the\nconvergence rate.\nThe \ufb02exibility of this Monte Carlo approach comes at a cost: without further assumptions on the\nunderlying input space X , it is hard to quantify the accuracy of this representation. For instance,\ni=1, how can we tell which one is a\ngiven two samples of the same size, {(xi, wi)}n\nbetter representation of X? More generally, how could we optimize a representation with prede\ufb01ned\nsample size?\nThere exists an alternative to the Monte Carlo approach, called Kernel Mean Embeddings (KME)\n(Berlinet and Thomas-Agnan, 2004; Smola et al., 2007). It also represents random variables as\nsamples, but additionally de\ufb01nes a notion of similarity between sample points. As a result, (i) it\nkeeps all the advantages of the Monte Carlo scheme, (ii) it includes the Monte Carlo method as\na special case, (iii) it overcomes its pitfalls described above, and (iv) it can be tailored to focus\non different properties of X, depending on the user\u2019s needs and prior assumptions. The KME\napproach identi\ufb01es both sample points and distributions with functions in an abstract Hilbert space.\nInternally the latter are still represented as weighted samples, but the weights can be negative and\nthe straightforward Monte Carlo interpretation is no longer valid. Sch\u00f6lkopf et al. (2015) propose\nusing KMEs as approximate representation of random variables for the purpose of computing their\nfunctions. However, they only provide theoretical justi\ufb01cation for it in rather idealised settings, which\ndo not meet practical implementation requirements.\nIn this paper, we build on this work and provide general theoretical guarantees for the proposed esti-\ni=1 provides a good estimate for\nmators. Speci\ufb01cally, we prove statements of the form \u201cif {(xi, wi)}n\nthe KME of X, then {(f (xi), wi)}n\ni=1 provides a good estimate for the KME of f (X)\u201d. Importantly,\nour results do not assume joint independence of the observations xi (and weights wi). This makes\nthem a powerful tool. For instance, imagine we are given data {(xi, wi)}n\ni=1 from a random variable\nX that we need to compress. Then our theorems guarantee that, whatever compression algorithm we\nuse, as long as the compressed representation {(x0j, w0j)}n\nj=1 still provides a good estimate for the\nj=1 provide good estimates of the KME of f (X).\nKME of X, the pointwise images {(f (x0j), w0j)}n\nIn the remainder of this section we \ufb01rst introduce KMEs and discuss their merits. Then we explain\nwhy and how we extend the results of Sch\u00f6lkopf et al. (2015). Section 2 contains our main results. In\nSection 2.1 we show consistency of the relevant estimator in a general setting, and in Section 2.2 we\nprovide \ufb01nite sample guarantees when Mat\u00e9rn kernels are used. In Section 3 we show how our results\napply to functions of multiple variables, both interdependent and independent. Section 4 concludes\nwith a discussion.\n\n1.1 Background on kernel mean embeddings\nLet X be a measurable input space. We use a positive de\ufb01nite bounded and measurable kernel\nk : X\u21e5X! R to represent random variables X \u21e0 P and weighted samples \u02c6X := {(xi, wi)}n\nX in the corresponding Reproducing Kernel Hilbert Space (RKHS) Hk by\nas two functions \u00b5k\nde\ufb01ning\nX :=Z k(x, .) dP (x)\n\nX and \u02c6\u00b5k\n\nwik(xi, .) .\n\nThese are guaranteed to exist, since we assume the kernel is bounded (Smola et al., 2007). When\nclear from the context, we omit the kernel k in the superscript. \u00b5X is called the KME of P , but we\nalso refer to it as the KME of X. In this paper we focus on computing functions of random variables.\nFor f : X!Z , where Z is a measurable space, and for a positive de\ufb01nite bounded kz : Z\u21e5Z! R\nwe also write\n\n\u00b5k\n\nand\n\n\u02c6\u00b5k\n\nX :=Xi\n\ni=1\n\n\u00b5kz\n\nf (X) :=Z kz(f (x), .) dP (x)\n\nand\n\n\u02c6\u00b5kz\n\nf (X) :=Xi\n\nwikz(f (xi), .) .\n\n(1)\n\nThe advantage of mapping random variables X and samples \u02c6X to functions in the RKHS is that\nwe may now say that \u02c6X is a good approximation for X if the RKHS distance k\u02c6\u00b5X \u00b5Xk is\nsmall. This distance depends on the choice of the kernel and different kernels emphasise different\ninformation about X. For example if on X := [a, b] \u21e2 R we choose k(x, x0) := x \u00b7 x0 + 1, then\n\n2\n\n\f\u00b5X(x) = EX\u21e0P [X] x + 1. Thus any two distributions and/or samples with equal means are mapped\nto the same function in Hk so the distance between them is zero. Therefore using this particular k,\nwe keep track only of the mean of the distributions. If instead we prefer to keep track of all \ufb01rst\np moments, we may use the kernel k(x, x0) := (x \u00b7 x0 + 1)p. And if we do not want to loose any\ninformation at all, we should choose k such that \u00b5k is injective over all probability measures on X .\nSuch kernels are called characteristic. For standard spaces, such as X = Rd, many widely used\nkernels were proven characteristic, such as Gaussian, Laplacian, and Mat\u00e9rn kernels (Sriperumbudur\net al., 2010, 2011).\nThe Gaussian kernel k(x, x0) := ekxx0k2\n22 may serve as another good illustration of the \ufb02exibility\nof this representation. Whatever positive bandwidth 2 > 0, we do not lose any information about\ndistributions, because k is characteristic. Nevertheless, if 2 grows, all distributions start looking the\nsame, because their embeddings converge to a constant function 1. If, on the other hand, 2 becomes\nsmall, distributions look increasingly different and \u02c6\u00b5X becomes a function with bumps of height wi\nat every xi. In the limit when 2 goes to zero, each point is only similar to itself, so \u02c6\u00b5X reduces to\nthe Monte Carlo method. Choosing 2 can be interpreted as controlling the degree of smoothing in\nthe approximation.\n\n1.2 Reduced set methods\n\n:= {(x0j, 1/N )}N\n\nj=1 then the objective is to construct \u02c6X := {(xi, wi)}n\n\nAn attractive feature when using KME estimators is the ability to reduce the number of ex-\npansion points (i.e., the size of the weighted sample) in a principled way. Speci\ufb01cally, if\n\u02c6X0\ni=1 that minimises\nk\u02c6\u00b5X0 \u02c6\u00b5Xk with n < N. Often the resulting xi are mutually dependent and the wi certainly\ndepend on them. The algorithms for constructing such expansions are known as reduced set methods\nand have been studied by the machine learning community (Sch\u00f6lkopf and Smola, 2002, Chapter 18).\nAlthough reduced set methods provide signi\ufb01cant ef\ufb01ciency gains, their application raises certain\nconcerns when it comes to computing functions of random variables. Let P, Q be distributions of X\nNPj k(f (x0j), .)\nand f (X) respectively. If x0j \u21e0i.i.d. P , then f (x0j) \u21e0i.i.d. Q and so \u02c6\u00b5f (X0) = 1\nreduces to the commonly used pN-consistent empirical estimator of \u00b5f (X) (Smola et al., 2007).\nUnfortunately, this is not the case after applying reduced set methods, and it is not known under\nwhich conditions \u02c6\u00b5f (X) is a consistent estimator for \u00b5f (X).\nSch\u00f6lkopf et al. (2015) advocate the use of reduced expansion set methods to save computational\nresources. They also provide some reasoning why this should be the right thing to do for characteristic\nkernels, but as they state themselves, their rigorous analysis does not cover practical reduced set\nmethods. Motivated by this and other concerns listed in Section 1.4, we provide a generalised analysis\nof the estimator \u02c6\u00b5f (X), where we do not make assumptions on how xi and wi were generated.\nBefore doing that, however, we \ufb01rst illustrate how the need for reduced set methods naturally emerges\non a concrete problem.\n\nIllustration with functions of two random variables\n\n1.3\ni=1 and \u02c6Y 0 =\nSuppose that we want to estimate \u00b5f (X,Y ) given i.i.d. samples \u02c6X0 = {x0i, 1/N}N\nj=1 from two independent random variables X 2X and Y 2Y respectively. Let Q be\n{y0j, 1/N}N\nthe distribution of Z = f (X, Y ).\ni=1 kzf (x0i, y0i), ..\nThe \ufb01rst option is to consider what we will call the diagonal estimator \u02c6\u00b51 := 1\nSince f (x0i, y0i) \u21e0i.i.d. Q, \u02c6\u00b51 is pN-consistent (Smola et al., 2007). Another option is to con-\ni,j=1 kzf (x0i, y0j), ., which is also known to be pN-\nsider the U-statistic estimator \u02c6\u00b52 := 1\nconsistent. Experiments show that \u02c6\u00b52 is more accurate and has lower variance than \u02c6\u00b51 (see Figure 1).\nHowever, the U-statistic estimator \u02c6\u00b52 needs O(n2) memory rather than O(n). For this reason\nSch\u00f6lkopf et al. (2015) propose to use a reduced set method both on \u02c6X0 and \u02c6Y 0 to get new sam-\nples \u02c6X = {xi, wi}n\nj=1 of size n \u2327 N, and then estimate \u00b5f (X,Y ) using\n\u02c6\u00b53 :=Pn\n\nN 2PN\ni=1 and \u02c6Y = {yj, uj}n\n\nNPn\n\ni,j=1 wiujkx(f (xi, yj), .).\n\n3\n\n\fWe ran experiments on synthetic data to show how accurately \u02c6\u00b51, \u02c6\u00b52 and \u02c6\u00b53 approximate \u00b5f (X,Y )\nwith growing sample size N. We considered three basic arithmetic operations: multiplication\nX \u00b7 Y , division X/Y , and exponentiation X Y , with X \u21e0N (3; 0.5) and Y \u21e0N (4; 0.5). As the\ntrue embedding \u00b5f (X,Y ) is unknown, we approximated it by a U-statistic estimator based on a large\nsample (125 points). For \u02c6\u00b53, we used the simplest possible reduced set method: we randomly sampled\nsubsets of size n = 0.01 \u00b7 N of the xi, and optimized the weights wi and ui to best approximate \u02c6\u00b5X\nand \u02c6\u00b5Y . The results are summarised in Figure 1 and corroborate our expectations: (i) all estimators\nconverge, (ii) \u02c6\u00b52 converges fastest and has the lowest variance, and (iii) \u02c6\u00b53 is worse than \u02c6\u00b52, but\nmuch better than the diagonal estimator \u02c6\u00b51. Note, moreover, that unlike the U-statistic estimator\n\u02c6\u00b52, the reduced set based estimator \u02c6\u00b53 can be used with a \ufb01xed storage budget even if we perform\na sequence of function applications\u2014a situation naturally appearing in the context of probabilistic\nprogramming.\nSch\u00f6lkopf et al. (2015) prove the consistency of \u02c6\u00b53 only for a rather limited case, when the points\nof the reduced expansions {xi}n\ni=1 are i.i.d. copies of X and Y , respectively, and\ni=1 are constants. Using our new results we will prove in Section 3.1 the\nthe weights {(wi, ui)}n\nconsistency of \u02c6\u00b53 under fairly general conditions, even in the case when both expansion points and\nweights are interdependent random variables.\n\ni=1 and {yi}n\n\nFigure 1: Error of kernel mean estimators for basic arithmetic functions of two variables, X \u00b7 Y ,\nX/Y and X Y , as a function of sample size N. The U-statistic estimator \u02c6\u00b52 works best, closely\nfollowed by the proposed estimator \u02c6\u00b53, which outperforms the diagonal estimator \u02c6\u00b51.\n\n1.4 Other sources of non-i.i.d. samples\n\nAlthough our discussion above focuses on reduced expansion set methods, there are other popular\nalgorithms that produce KME expansions where the samples are not i.i.d. Here we brie\ufb02y discuss\nseveral examples, emphasising that our selection is not comprehensive. They provide additional\nmotivation for stating convergence guarantees in the most general setting possible.\nAn important notion in probability theory is that of a conditional distribution, which can also be\nrepresented using KME (Song et al., 2009). With this representation the standard laws of probability,\nsuch as sum, product, and Bayes\u2019 rules, can be stated using KME (Fukumizu et al., 2013). Applying\nthose rules results in KME estimators with strong dependencies between samples and their weights.\nAnother possibility is that even though i.i.d. samples are available, they may not produce the best\nestimator. Various approaches, such as kernel herding (Chen et al., 2010; Lacoste-Julien et al.,\n2015), attempt to produce a better KME estimator by actively generating pseudo-samples that are not\ni.i.d. from the underlying distribution.\n\n2 Main results\n\nThis section contains our main results regarding consistency and \ufb01nite sample guarantees for the\nestimator \u02c6\u00b5f (X) de\ufb01ned in (1). They are based on the convergence of \u02c6\u00b5X and avoid simplifying\nassumptions about its structure.\n\n4\n\n\fIf\n\nthen\n\n\u02c6\u00b5kx\nX ! \u00b5kx\n\nX\n\n2.1 Consistency\nIf kx is c0-universal (see Sriperumbudur et al. (2011)), consistency of \u02c6\u00b5f (X) can be shown in a rather\ngeneral setting.\nTheorem 1. Let X and Z be compact Hausdorff spaces equipped with their Borel -algebras,\nf : X!Z a continuous function, kx, kz continuous kernels on X ,Z respectively. Assume kx is\nc0-universal and that there exists C such thatPi |wi|\uf8ff C independently of n. The following holds:\nf (X) ! \u00b5kz\n\u02c6\u00b5kz\nProof. Let P be the distribution of X and \u02c6Pn = Pn\ni=1 wixi. De\ufb01ne a new kernel on X by\nekx(x1, x2) := kzf (x1), f (x2). X is compact and { \u02c6Pn | n 2 N}[{ P} is a bounded set (in\ntotal variation norm) of \ufb01nite measures, because k \u02c6PnkT V = Pn\ni=1 |wi|\uf8ff C. Furthermore, kx\nis continuous and c0-universal. Using Corollary 52 of Simon-Gabriel and Sch\u00f6lkopf (2016) we\nX implies that \u02c6P converges weakly to P . Now, kz and f being continuous,\nconclude that: \u02c6\u00b5kx\nso isekx. Thus, if \u02c6P converges weakly to P , then \u02c6\u00b5ekx\nX ! \u00b5ekx\nX (Simon-Gabriel and Sch\u00f6lkopf, 2016,\nX implies \u02c6\u00b5ekx\nX ! \u00b5ekx\nX ! \u00b5kx\nX . We conclude the proof\nby showing that convergence in Hekx\nleads to convergence in Hkz:\nX\n=\u02c6\u00b5ekx\n\u02c6\u00b5kz\nX \u00b5ekx\nekx! 0.\n\nFor a detailed version of the above, see Appendix A.\n\nTheorem 44, Points (1) and (iii)). Overall, \u02c6\u00b5kx\n\nf (X)\n\nas n ! 1.\n\nf (X) \u00b5kz\n\nX ! \u00b5kx\n\nf (X)\n\n2\n\nkz\n\n2\n\nThe continuity assumption is rather unrestrictive. All kernels and functions de\ufb01ned on a discrete\nspace are continuous with respect to the discrete topology, so the theorem applies in this case. For\nX = Rd, many kernels used in practice are continuous, including Gaussian, Laplacian, Mat\u00e9rn and\nother radial kernels. The slightly limiting factor of this theorem is that kx must be c0-universal, which\noften can be tricky to verify. However, most standard kernels\u2014including all radial, non-constant\nkernels\u2014are c0-universal (see Sriperumbudur et al., 2011). The assumption that the input domain\nis compact is satis\ufb01ed in most applications, since any measurements coming from physical sensors\n\nare contained in a bounded range. Finally, the assumption thatPi |wi|\uf8ff C can be enforced, for\n\ninstance, by applying a suitable regularization in reduced set methods.\n\n2.2 Finite sample guarantees\nTheorem 1 guarantees that the estimator \u02c6\u00b5f (X) converges to \u00b5f (X) when \u02c6\u00b5X converges to \u00b5X.\nHowever, it says nothing about the speed of convergence. In this section we provide a convergence\nrate when working with Mat\u00e9rn kernels, which are of the form\n\nx(x, x0) =\nks\n\n21s\n(s) kx x0ksd/2\n\n2\n\nBd/2s (kx x0k2) ,\n\n(2)\n\nwhere B\u21b5 is a modi\ufb01ed Bessel function of the third kind (also known as Macdonald function) of\norder \u21b5, is the Gamma function and s > d\n2 is a smoothness parameter. The RKHS induced by\n2 (Rd) (Wendland, 2004, Theorem 6.13 & Chap.10) containing s-times\nx is the Sobolev space W s\nks\ndifferentiable functions. The \ufb01nite-sample bound of Theorem 2 is based on the analysis of Kanagawa\net al. (2016), which requires the following assumptions:\nAssumptions 1. Let X be a random variable over X = Rd with distribution P and let \u02c6X =\ni=1 be random variables over X n \u21e5Rn with joint distribution S. There exists a probability\n{(xi, wi)}n\ndistribution Q with full support on Rd and a bounded density, satisfying the following properties:\n\n(i) P has a bounded density function w.r.t. Q;\n(ii) there is a constant D > 0 independent of n, such that\n\nS\" 1\n\nE\n\nn\n\nnXi=1\n\ng2(xi)# \uf8ff D kgk2\n\nL2(Q) ,\n\n8g 2 L2(Q) .\n\n5\n\n\fks1\n\nx and ks2\n\nLet \u2713 = min( s2\n2s1\n\n, 1) and assume \u2713b (1/2 c)(1 \u2713) > 0. Then\n\nThese assumptions were shown to be fairly general and we refer to Kanagawa et al. (2016, Section\n4.1) for various examples where they are met. Next we state the main result of this section.\nTheorem 2. Let X = Rd, Z = Rd0, and f : X!Z be an \u21b5-times differentiable function (\u21b5 2 N+).\nTake s1 > d/2 and s2 > d0 such that s1, s2/2 2 N+. Let ks1\nz be Mat\u00e9rn kernels over X and\nZ respectively as de\ufb01ned in (2). Assume X \u21e0 P and \u02c6X = {(xi, wi)}n\ni=1 \u21e0 S satisfy 1. Moreover,\nassume that P and the marginals of x1, . . . xn have a common compact support. Suppose that, for\nsome constants b > 0 and 0 < c \uf8ff 1/2:\nx i = O(n2b) ;\n(i) EShk\u02c6\u00b5X \u00b5Xk2\n(ii) Pn\nS\uf8ff\u02c6\u00b5f (X) \u00b5f (X)\n\nz = O\u21e3(log n)d0 n2 (\u2713b(1/2c)(1\u2713))\u2318 .\n\ni = O(n2c) (with probability 1) .\ni=1 w2\n, \u21b5\ns1\n\nBefore we provide a short sketch of the proof, let us brie\ufb02y comment on this result. As a benchmark,\nremember that when x1, . . . xn are i.i.d. observations from X and \u02c6X = {(xi, 1/n)}n\ni=1, we get\nk\u02c6\u00b5f (X) \u00b5f (X)k2 = OP (n1), which was recently shown to be a minimax optimal rate (Tolstikhin\net al., 2016). How do we compare to this benchmark? In this case we have b = c = 1/2 and our rate\nis de\ufb01ned by \u2713. If f is smooth enough, say \u21b5> d/ 2 + 1, and by setting s2 > 2s1 = 2\u21b5, we recover\nthe O(n1) rate up to an extra (log n)d0 factor.\nHowever, Theorem 2 applies to much more general settings. Importantly, it makes no i.i.d. assump-\ntions on the data points and weights, allowing for complex interdependences. Instead, it asks the\nconvergence of the estimator \u02c6\u00b5X to the embedding \u00b5X to be suf\ufb01ciently fast. On the downside, the\nupper bound is affected by the smoothness of f, even in the i.i.d. setting: if \u21b5 \u2327 d/2 the rate will\nbecome slower, as \u2713 = \u21b5/s1. Also, the rate depends both on d and d0. Whether these are artefacts of\nour proof remains an open question.\n\n(3)\n\nks2\n\nE\n\n2\n\nProof. Here we sketch the main ideas of the proof and develop the details in Appendix C. Throughout\nthe proof, C will designate a constant that depends neither on the sample size n nor on the variable R\n(to be introduced). C may however change from line to line. We start by showing that:\n\nE\n\nS\uf8ff\u02c6\u00b5kz\n\nf (X) \u00b5kz\n\n2\n\nf (X)\nS\uf8ff\u21e3[\u02c6\u00b5h\n\nE\n\nkz = (2\u21e1)\n\nd0\n\n2 ZZ\n\nE\n\nS\uf8ff\u21e3[\u02c6\u00b5h\n\nf (X) \u00b5h\n\nf (X)](z)\u23182 dz,\n\n(4)\n\nwhere h is Mat\u00e9rn kernel over Z with smoothness parameter s2/2. Second, we upper bound the\nintegrand by roughly imitating the proof idea of Theorem 1 from Kanagawa et al. (2016). This\neventually yields:\n\n(5)\nwhere \u232b := \u2713b (1/2 c)(1 \u2713). Unfortunately, this upper bound does not depend on z and can\nnot be integrated over the whole Z in (4). Denoting BR the ball of radius R, centred on the origin of\nZ, we thus decompose the integral in (4) as:\n\nf (X) \u00b5h\n\nf (X)](z)\u23182 \uf8ff Cn2\u232b ,\n\nOn X\\BR, we upper bound the integral by a value that decreases with R, which is of the form:\n\nf (X) \u00b5h\n\nf (X)](z)\u23182 dz \uf8ff Cn12c(R C0)s22e2(RC0)\n\n(6)\n\n(7)\n\nOn BR we upper bound the integral by (5) times the ball\u2019s volume (which grows like Rd):\n\nf (X)](z)\u23182 dz\nf (X)](z)\u23182 dz +ZZ\\BR\nE\uf8ff\u21e3[\u02c6\u00b5h\n\nf (X) \u00b5h\n\nf (X) \u00b5h\n\nE\uf8ff\u21e3[\u02c6\u00b5h\nZZ\nE\uf8ff\u21e3[\u02c6\u00b5h\n=ZBR\nf (X) \u00b5h\nZBR\nE\uf8ff\u21e3[\u02c6\u00b5h\n\nZZ\\BR\n\n6\n\nE\uf8ff\u21e3[\u02c6\u00b5h\n\nf (X) \u00b5h\n\nf (X)](z)\u23182 dz.\n\nf (X)](z)\u23182 dz \uf8ff CRdn2\u232b .\n\n\fwith C0 > 0 being a constant smaller than R. In essence, this upper bound decreases with R because\nf (X)](z) decays with the same speed as h when kzk grows inde\ufb01nitely. We are now left\nf (X) \u00b5h\n[\u02c6\u00b5h\nwith two rates, (6) and (7), which respectively increase and decrease with growing R. We complete\nthe proof by balancing these two terms, which results in setting R \u21e1 (log n)1/2.\n3 Functions of Multiple Arguments\n\nThe previous section applies to functions f of one single variable X. However, we can apply its\nresults to functions of multiple variables if we take the argument X to be a tuple containing multiple\nvalues. In this section we discuss how to do it using two input variables from spaces X and Y, but\nthe results also apply to more inputs. To be precise, our input space changes from X to X\u21e5Y , input\nrandom variable from X to (X, Y ), and the kernel on the input space from kx to kxy.\nTo apply our results from Section 2, all we need is a consistent estimator \u02c6\u00b5(X,Y ) of the joint embedding\n\u00b5(X,Y ). There are different ways to get such an estimator. One way is to sample (x0i, y0i) i.i.d. from\nthe joint distribution of (X, Y ) and construct the usual empirical estimator, or approximate it using\nreduced set methods. Alternatively, we may want to construct \u02c6\u00b5(X,Y ) based only on consistent\nestimators of \u00b5X and \u00b5Y . For example, this is how \u02c6\u00b53 was de\ufb01ned in Section 1.3. Below we show\nthat this can indeed be done if X and Y are independent.\n\n3.1 Application to Section 1.3\nFollowing Sch\u00f6lkopf et al. (2015), we consider two independent random variables X \u21e0 Px and\nY \u21e0 Py. Their joint distribution is Px \u2326 Py. Consistent estimators of their embeddings are\ngiven by \u02c6\u00b5X = Pn\nIn this section we show that\n\u02c6\u00b5f (X,Y ) =Pn\nWe choose a product kernel kxy(x1, y1), (x2, y2) = kx(x1, x2)ky(y1, y2), so the corresponding\nRKHS is a tensor product Hkxy = Hkx \u2326H ky (Steinwart and Christmann, 2008, Lemma 4.6) and\nthe mean embedding of the product random variable (X, Y ) is a tensor product of their marginal\nmean embeddings \u00b5(X,Y ) = \u00b5X \u2326 \u00b5Y . With consistent estimators for the marginal embeddings we\ncan estimate the joint embedding using their tensor product\n\ni,j=1 wiujkzf (xi, yj), . is a consistent estimator of \u00b5f (X,Y ).\n\ni=1 wikx(xi, .) and \u02c6\u00b5Y = Pn\n\nj=1 ujky(yi, .).\n\n\u02c6\u00b5(X,Y ) = \u02c6\u00b5X \u2326 \u02c6\u00b5Y =\n\nwiujkx(xi, .) \u2326 ky(yj, .) =\n\nnXi,j=1\n\nnXi,j=1\n\nwiujkxy(xi, yj), (. , .).\n\nIf points are i.i.d. and wi = ui = 1/n, this reduces to the U-statistic estimator \u02c6\u00b52 from Section 1.3.\nLemma 3. Let (sn)n be any positive real sequence converging to zero. Suppose kxy = kxky is a\nproduct kernel, \u00b5(X,Y ) = \u00b5X \u2326 \u00b5Y , and \u02c6\u00b5(X,Y ) = \u02c6\u00b5X \u2326 \u02c6\u00b5Y . Then:\n\nProof. For a detailed expansion of the \ufb01rst inequality see Appendix B.\n\nimplies\n\nk\u02c6\u00b5Y \u00b5Y kky\n\n= O(sn);\n= O(sn)\n\n(k\u02c6\u00b5X \u00b5Xkkx\n\u02c6\u00b5(X,Y ) \u00b5(X,Y )kxy \uf8ff k\u00b5Xkkx k\u02c6\u00b5Y \u00b5Y kky\n\n\u02c6\u00b5(X,Y ) \u00b5(X,Y )kxy\n\n= O(sn) .\n\n+ k\u00b5Y kky k\u02c6\u00b5X \u00b5Xkkx\nn) = O(sn).\n\n+ k\u02c6\u00b5X \u00b5Xkkx k\u02c6\u00b5Y \u00b5Y kky\n\u00b5X and \u02c6\u00b5Y !n!1\n\n= O(sn) + O(sn) + O(s2\n\u00b5Y , then \u02c6\u00b5(X,Y ) !n!1\n\nCorollary 4. If \u02c6\u00b5X !n!1\nTogether with the results from Section 2 this lets us reason about estimators resulting from applying\nfunctions to multiple independent random variables. Write\n\n\u00b5(X,Y ).\n\n\u02c6\u00b5kxy\nXY =\n\nnXi,j=1\n\nwiujkxy(xi, yj), . =\n\nn2X`=1\n\n!`kxy(\u21e0`, .),\n\n7\n\n\fY then \u02c6\u00b5kxy\n\nXY ! \u00b5kxy\n\nX\n\nwhere ` enumerates the (i, j) pairs and \u21e0` = (xi, yj), !` = wiuj. Now if \u02c6\u00b5kx\nand \u02c6\u00b5ky\n\nX ! \u00b5kx\nY ! \u00b5ky\n(X,Y ) (according to Corollary 4) and Theorem 1 shows that\ni,j=1 wiujkzf (xi, yj), . is consistent as well. Unfortunately, we cannot apply Theorem 2 to get\nPn\n\nthe speed of convergence, because a product of Mat\u00e9rn kernels is not a Mat\u00e9rn kernel any more.\nOne downside of this overall approach is that the number of expansion points used for the estimation\nof the joint increases exponentially with the number of arguments of f. This can lead to prohibitively\nlarge computational costs, especially if the result of such an operation is used as an input to another\nfunction of multiple arguments. To alleviate this problem, we may use reduced expansion set methods\nbefore or after applying f, as we did for example in Section 1.2.\nTo conclude this section, let us summarize the implications of our results for two practical scenarios\nthat should be distinguished.\n\n. If we have separate samples from two random variables X and Y , then our results justify\nhow to provide an estimate of the mean embedding of f (X, Y ) provided that X and Y are\nindependent. The samples themselves need not be i.i.d. \u2014 we can also work with weighted\nsamples computed, for instance, by a reduced set method.\n. How about dependent random variables? For instance, imagine that Y = X, and\nf (X, Y ) = X + Y . Clearly, in this case the distribution of f (X, Y ) is a delta mea-\nsure on 0, and there is no way to predict this from separate samples of X and Y . However,\nit should be stressed that our results (consistency and \ufb01nite sample bound) apply even to\nthe case where X and Y are dependent. In that case, however, they require a consistent\nestimator of the joint embedding \u00b5(X,Y ).\n\n. It is also suf\ufb01cient to have a reduced set expansion of the embedding of the joint distribution.\nThis setting may sound strange, but it potentially has signi\ufb01cant applications. Imagine that\none has a large database of user data, sampled from a joint distribution. If we expand the\njoint\u2019s embedding in terms of synthetic expansion points using a reduced set construction\nmethod, then we can pass on these (weighted) synthetic expansion points to a third party\nwithout revealing the original data. Using our results, the third party can nevertheless\nperform arbitrary continuous functional operations on the joint distribution in a consistent\nmanner.\n\n4 Conclusion and future work\n\nThis paper provides a theoretical foundation for using kernel mean embeddings as approximate\nrepresentations of random variables in scenarios where we need to apply functions to those random\nvariables. We show that for continuous functions f (including all functions on discrete domains),\nconsistency of the mean embedding estimator of a random variable X implies consistency of the\nmean embedding estimator of f (X). Furthermore, if the kernels are Mat\u00e9rn and the function f\nis suf\ufb01ciently smooth, we provide bounds on the convergence rate. Importantly, our results apply\nbeyond i.i.d. samples and cover estimators based on expansions with interdependent points and\nweights. One interesting future direction is to improve the \ufb01nite-sample bounds and extend them to\ngeneral radial and/or translation-invariant kernels.\nOur work is motivated by the \ufb01eld of probabilistic programming. Using our theoretical results,\nkernel mean embeddings can be used to generalize functional operations (which lie at the core of\nall programming languages) to distributions over data types in a principled manner, by applying the\noperations to the points or approximate kernel expansions. This is in principle feasible for any data\ntype provided a suitable kernel function can be de\ufb01ned on it. We believe that the approach holds\nsigni\ufb01cant potential for future probabilistic programming systems.\n\nAcknowledgements\n\nWe thank Krikamol Muandet for providing the code used to generate Figure 1, Paul Rubenstein,\nMotonobu Kanagawa and Bharath Sriperumbudur for very useful discussions, and our anonymous\nreviewers for their valuable feedback. Carl-Johann Simon-Gabriel is supported by a Google European\nFellowship in Causal Inference.\n\n8\n\n\fReferences\nR. A. Adams and J. J. F. Fournier. Sobolev Spaces. Academic Press, 2003.\nC. Bennett and R. Sharpley. Interpolation of Operators. Pure and Applied Mathematics. Elsevier Science, 1988.\nA. Berlinet and C. Thomas-Agnan. RKHS in probability and statistics. Springer, 2004.\nY. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. In UAI, 2010.\nK. Fukumizu, L. Song, and A. Gretton. Kernel Bayes\u2019 Rule: Bayesian Inference with Positive De\ufb01nite Kernels.\n\nJournal of Machine Learning Research, 14:3753\u20133783, 2013.\n\nI. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Elsevier/Academic Press, Amsterdam,\n\n2007. Edited by Alan Jeffrey and Daniel Zwillinger.\n\nM. Kalos and P. Whitlock. Monte Carlo Methods. Wiley, 2008.\nM. Kanagawa, B. K. Sriperumbudur, and K. Fukumizu. Convergence guarantees for kernel-based quadrature\n\nrules in misspeci\ufb01ed settings. arXiv:1605.07254 [stat], 2016. arXiv: 1605.07254.\n\nY. Katznelson. An Introduction to Harmonic Analysis. Cambridge University Press, 2004.\nM. Korze\u00b4n and S. Jaroszewicz. PaCAL: A Python package for arithmetic computations with random variables.\n\nJournal of Statistical Software, 57(10), 2014.\n\nS. Lacoste-Julien, F. Lindsten, and F. Bach. Sequential kernel herding : Frank-Wolfe optimization for particle\n\n\ufb01ltering. In Arti\ufb01cial Intelligence and Statistics, volume 38, pages 544\u2013552, 2015.\n\nA. Mathai. A review of the different techniques used for deriving the exact distributions of multivariate test\n\ncriteria. Sankhy\u00afa: The Indian Journal of Statistics, Series A, pages 39\u201360, 1973.\n\nK. McKinley. Programming the world of uncertain things (keynote). In ACM SIGPLAN-SIGACT Symposium on\n\nPrinciples of Programming Languages, pages 1\u20132, 2016.\n\nD. Milios. Probability Distributions as Program Variables. PhD thesis, University of Edinburgh, 2009.\nS. Poisson. Recherches sur la probabilit\u00e9des jugements en mati\u00e8re criminelle et en mati\u00e8re civile, pr\u00e9c\u00e9d\u00e9es des\n\nr\u00e8gles g\u00e9n\u00e9rales du calcul des probabilit\u00e9s. 1837.\n\nB. Sch\u00f6lkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization,\n\nand Beyond. MIT Press, 2002.\n\nB. Sch\u00f6lkopf, K. Muandet, K. Fukumizu, S. Harmeling, and J. Peters. Computing functions of random variables\n\nvia reproducing kernel Hilbert space representations. Statistics and Computing, 25(4):755\u2013766, 2015.\n\nC. Scovel, D. Hush, I. Steinwart, and J. Theiler. Radial kernels and their reproducing kernel hilbert spaces.\n\nJournal of Complexity, 26, 2014.\n\nC.-J. Simon-Gabriel and B. Sch\u00f6lkopf. Kernel distribution embeddings: Universal kernels, characteristic kernels\n\nand kernel metrics on distributions. Technical report, Max Planck Institute for Intelligent Systems, 2016.\n\nA. Smola, A. Gretton, L. Song, and B. Sch\u00f6lkopf. A Hilbert space embedding for distributions. In ALT, 2007.\nL. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions with\n\napplications to dynamical systems. In International Conference on Machine Learning, pages 1\u20138, 2009.\n\nM. D. Springer. The Algebra of Random Variables. Wiley, 1979.\nB. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00f6lkopf, and G. R. Lanckriet. Hilbert space embeddings\n\nand metrics on probability measures. Journal of Machine Learning Research, 11:1517\u20131561, 2010.\n\nB. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels and RKHS\n\nembedding of measures. Journal of Machine Learning Research, 12:2389\u20132410, 2011.\n\nI. Steinwart and A. Christmann. Support Vector Machines. Information Science and Statistics. Springer, 2008.\nI. Steinwart and C. Scovel. Mercer\u2019s Theorem on General Domains: On the Interaction between Measures,\n\nKernels, and RKHSs. Constructive Approximation, 35(3):363\u2013417, 2012.\n\nI. Tolstikhin, B. Sriperumbudur, and K. Muandet. Minimax Estimation of Kernel Mean Embeddings.\n\narXiv:1602.04361 [math, stat], 2016.\n\nH. Wendland. Scattered Data Approximation. Cambridge University Press, 2004.\nR. Williamson. Probabilistic Arithmetic. PhD thesis, University of Queensland, 1989.\n\n9\n\n\f", "award": [], "sourceid": 945, "authors": [{"given_name": "Carl-Johann", "family_name": "Simon-Gabriel", "institution": "MPI Tuebingen"}, {"given_name": "Adam", "family_name": "Scibior", "institution": "University of Cambridge"}, {"given_name": "Ilya", "family_name": "Tolstikhin", "institution": "MPI for Intelligent Systems"}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": "MPI for Intelligent Systems T\u00fcbingen, Germany"}]}