{"title": "Kernel Mean Estimation via Spectral Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 1, "page_last": 9, "abstract": "The problem of estimating the kernel mean in a reproducing kernel Hilbert space (RKHS) is central to kernel methods in that it is used by classical approaches (e.g., when centering a kernel PCA matrix), and it also forms the core inference step of modern kernel methods (e.g., kernel-based non-parametric tests) that rely on embedding probability distributions in RKHSs. Previous work [1] has shown that shrinkage can help in constructing \u201cbetter\u201d estimators of the kernel mean than the empirical estimator. The present paper studies the consistency and admissibility of the estimators in [1], and proposes a wider class of shrinkage estimators that improve upon the empirical estimator by considering appropriate basis functions. Using the kernel PCA basis, we show that some of these estimators can be constructed using spectral filtering algorithms which are shown to be consistent under some technical assumptions. Our theoretical analysis also reveals a fundamental connection to the kernel-based supervised learning framework. The proposed estimators are simple to implement and perform well in practice.", "full_text": "Kernel Mean Estimation via Spectral Filtering\n\nKrikamol Muandet\nMPI-IS, T\u00a8ubingen\n\nkrikamol@tue.mpg.de\n\nBharath Sriperumbudur\nDept. of Statistics, PSU\n\nbks18@psu.edu\n\nBernhard Sch\u00a8olkopf\n\nMPI-IS, T\u00a8ubingen\nbs@tue.mpg.de\n\nAbstract\n\nThe problem of estimating the kernel mean in a reproducing kernel Hilbert space\n(RKHS) is central to kernel methods in that it is used by classical approaches (e.g.,\nwhen centering a kernel PCA matrix), and it also forms the core inference step of\nmodern kernel methods (e.g., kernel-based non-parametric tests) that rely on em-\nbedding probability distributions in RKHSs. Previous work [1] has shown that\nshrinkage can help in constructing \u201cbetter\u201d estimators of the kernel mean than the\nempirical estimator. The present paper studies the consistency and admissibility\nof the estimators in [1], and proposes a wider class of shrinkage estimators that\nimprove upon the empirical estimator by considering appropriate basis functions.\nUsing the kernel PCA basis, we show that some of these estimators can be con-\nstructed using spectral \ufb01ltering algorithms which are shown to be consistent under\nsome technical assumptions. Our theoretical analysis also reveals a fundamental\nconnection to the kernel-based supervised learning framework. The proposed es-\ntimators are simple to implement and perform well in practice.\n\n1\n\nIntroduction\n\nThe kernel mean or the mean element, which corresponds to the mean of the kernel function in a\nreproducing kernel Hilbert space (RKHS) computed w.r.t. some distribution P, has played a fun-\ndamental role as a basic building block of many kernel-based learning algorithms [2\u20134], and has\nrecently gained increasing attention through the notion of embedding distributions in an RKHS [5\u2013\n13]. Estimating the kernel mean remains an important problem as the underlying distribution P is\nusually unknown and we must rely entirely on the sample drawn according to P.\n\nfrom P, the most common\nGiven a random sample drawn independently and identically (i.i.d.)\nway to estimate the kernel mean is by replacing P by the empirical measure, Pn := 1\ni=1 \u03b4Xi\nwhere \u03b4x is a Dirac measure at x [5, 6]. Without any prior knowledge about P, the empirical\nestimator is possibly the best one can do. However, [1] showed that this estimator can be \u201cimproved\u201d\nby constructing a shrinkage estimator which is a combination of a model with low bias and high\nvariance, and a model with high bias but low variance. Interestingly, signi\ufb01cant improvement is\nin fact possible if the trade-off between these two models is chosen appropriately. The shrinkage\nestimator proposed in [1], which is motivated from the classical James-Stein shrinkage estimator\n[14] for the estimation of the mean of a normal distribution, is shown to have a smaller mean-squared\nerror than that of the empirical estimator. These \ufb01ndings provide some support for the conceptual\npremise that we might be somewhat pessimistic in using the empirical estimator of the kernel mean\nand there is abundant room for further progress.\n\nnPn\n\nIn this work, we adopt a spectral \ufb01ltering approach to obtain shrinkage estimators of kernel mean\nthat improve on the empirical estimator. The motivation behind our approach stems from the idea\npresented in [1] where the kernel mean estimation is reformulated as an empirical risk minimization\n(ERM) problem, with the shrinkage estimator being then obtained through penalized ERM. It is\nimportant to note that this motivation differs fundamentally from the typical supervised learning as\nthe goal of regularization here is to get the James-Stein-like shrinkage estimators [14] rather than\n\n1\n\n\fto prevent over\ufb01tting. By looking at regularization from a \ufb01lter function perspective, in this paper,\nwe show that a wide class of shrinkage estimators for kernel mean can be obtained and that these\nestimators are consistent for an appropriate choice of the regularization/shrinkage parameter.\n\nUnlike in earlier works [15\u201318] where the spectral \ufb01ltering approach has been used in supervised\nlearning problems, we here deal with unsupervised setting and only leverage spectral \ufb01ltering as a\nway to construct a shrinkage estimator of the kernel mean. One of the advantages of this approach\nis that it allows us to incorporate meaningful prior knowledge. The resultant estimators are char-\nacterized by the \ufb01lter function, which can be chosen according to the relevant prior knowledge.\nMoreover, the spectral \ufb01ltering gives rise to a broader interpretation of shrinkage through, for exam-\nple, the notion of early stopping and dimension reduction. Our estimators not only outperform the\nempirical estimator, but are also simple to implement and computationally ef\ufb01cient.\n\nThe paper is organized as follows. In Section 2, we introduce the problem of shrinkage estimation\nand present a new result that theoretically justi\ufb01es the shrinkage estimator over the empirical esti-\nmator for kernel mean, which improves on the work of [1] while removing some of its drawbacks.\nMotivated by this result, we consider a general class of shrinkage estimators obtained via spectral\n\ufb01ltering in Section 3 whose theoretical properties are presented in Section 4. The empirical perfor-\nmance of the proposed estimators are presented in Section 5. The missing proofs of the results are\ngiven in the supplementary material.\n\n2 Kernel mean shrinkage estimator\n\nIn this section, we present preliminaries on the problem of shrinkage estimation in the context of esti-\nmating the kernel mean [1] and then present a theoretical justi\ufb01cation (see Theorem 1) for shrinkage\nestimators that improves our understanding of the kernel mean estimation problem, while alleviating\nsome of the issues inherent in the estimator proposed in [1].\nPreliminaries: Let H be an RKHS of functions on a separable topological space X . The space H\nis endowed with inner product h\u00b7,\u00b7i, associated norm k \u00b7 k, and reproducing kernel k : X \u00d7 X \u2192 R,\nwhich we assume to be continuous and bounded, i.e., \u03ba := supx\u2208X pk(x, x) < \u221e. The kernel\nmean of some unknown distribution P on X and its empirical estimate\u2014we refer to this as kernel\nmean estimator (KME)\u2014from i.i.d. sample x1, . . . , xn are given by\n\n\u00b5P :=ZX\n\nk(x,\u00b7) dP(x)\n\nand\n\n\u02c6\u00b5P :=\n\n1\nn\n\nnXi=1\n\nk(xi,\u00b7),\n\n(1)\n\nrespectively. As mentioned before, \u02c6\u00b5P is the \u201cbest\u201d possible estimator to estimate \u00b5P if nothing is\nknown about P. However, depending on the information that is available about P, one can construct\nvarious estimators of \u00b5P that perform \u201cbetter\u201d than \u00b5P. Usually, the performance measure that is\nused for comparison is the mean-squared error though alternate measures can be used. Therefore,\nour main objective is to improve upon KME in terms of the mean-squared error, i.e., construct \u02dc\u00b5P\nsuch that EPk\u02dc\u00b5P\u2212 \u00b5Pk2 \u2264 EPk\u02c6\u00b5P\u2212 \u00b5Pk2 for all P \u2208 P with strict inequality holding for at least one\nelement in P where P is a suitably large class of Borel probability measures on X . Such an estimator\n\u02dc\u00b5P is said to be admissible w.r.t P. If P = M 1\n+(X ) is the set of all Borel probability measures on\nX , then \u02dc\u00b5P satisfying the above conditions may not exist and in that sense, \u02c6\u00b5P is possibly the best\nestimator of \u00b5P that one can have.\nAdmissibility of shrinkage estimator: To improve upon KME, motivated by the James-Stein esti-\nmator, \u02dc\u03b8, [1] proposed a shrinkage estimator \u02c6\u00b5\u03b1 := \u03b1f \u2217 + (1 \u2212 \u03b1)\u02c6\u00b5P where \u03b1 \u2208 R is the shrinkage\nparameter that balances the low-bias, high-variance model (\u02c6\u00b5P) with the high-bias, low-variance\nmodel (f \u2217 \u2208 H). Assuming for simplicity f \u2217 = 0, [1] showed that EPk\u02c6\u00b5\u03b1 \u2212 \u00b5Pk2 < EPk\u02c6\u00b5P \u2212 \u00b5Pk2\nif and only if \u03b1 \u2208 (0, 2\u2206/(\u2206 + k\u00b5Pk2)) where \u2206 := EPk\u02c6\u00b5P \u2212 \u00b5Pk2. While this is an interesting\nresult, the resultant estimator \u02c6\u00b5\u03b1 is strictly not a \u201cstatistical estimator\u201d as it depends on quantities\nthat need to be estimated, i.e., it depends on \u03b1 whose choice requires the knowledge of \u00b5P, which\nis the quantity to be estimated. We would like to mention that [1] handles the general case with f \u2217\nbeing not necessarily zero, wherein the range for \u03b1 then depends on f \u2217 as well. But for the purposes\nof simplicity and ease of understanding, for the rest of this paper we assume f \u2217 = 0. Since \u02c6\u00b5\u03b1 is\nnot practically interesting, [1] resorted to the following representation of \u00b5P and \u02c6\u00b5P as solutions to\nthe minimization problems [1, 19]:\n\n2\n\n\fP\n\n21/\u03b2 \u03b2\n\n\u03bb+1\n\n\u03bb+1\n\n\u00b5P = arg inf\n\n\u03bb+1\n\n\u03bb+1\n\n\u02c6\u00b5P = arg inf\ng\u2208H\n\n1\nn\n\nfrom these viewpoints.\n\nkk(xi,\u00b7) \u2212 gk2,\n\ng\u2208HZX kk(x,\u00b7) \u2212 gk2 dP(x),\nnXi=1\n\n\u02c7\u00b5\u03bb = arg inf\ng\u2208H\n\u03bb+1 , i.e., \u02c7\u00b5\u03bb = \u02c6\u00b5 \u03bb\n\n1\nn\n\n\u03bb+1\n\n(2)\nusing which \u02c6\u00b5\u03b1 is shown to be the solution to the regularized empirical risk minimization problem:\n(3)\n\nnXi=1\nkk(xi,\u00b7) \u2212 gk2 + \u03bbkgk2,\nwhere \u03bb > 0 and \u03b1 := \u03bb\n. It is interesting to note that unlike in supervised\nlearning (e.g., least squares regression), the empirical minimization problem in (2) is not ill-posed\nand therefore does not require a regularization term although it is used in (3) to obtain a shrinkage\nestimator of \u00b5P. [1] then obtained a value for \u03bb through cross-validation and used it to construct\nas an estimator of \u00b5P, which is then shown to perform empirically better than \u02c6\u00b5P. However,\n\u02c6\u00b5 \u03bb\nno theoretical guarantees including the basic requirement of \u02c6\u00b5 \u03bb\nbeing consistent are provided. In\nfact, because \u03bb is data-dependent, the above mentioned result about the improved performance of\n\u02c6\u00b5\u03b1 over a range of \u03b1 does not hold as such a result is proved assuming \u03b1 is a constant and does not\ndepend on the data. While it is clear that the regularizer in (3) is not needed to make (2) well-posed,\nthe role of \u03bb is not clear from the point of view of \u02c6\u00b5 \u03bb\nbeing consistent and better than \u02c6\u00b5P. The\nfollowing result provides a theoretical understanding of \u02c6\u00b5 \u03bb\nTheorem 1. Let \u02c7\u00b5\u03bb be constructed as in (3). Then the following hold.\n(i) k\u02c7\u00b5\u03bb \u2212 \u00b5Pk\nk\u02c7\u00b5\u03bb \u2212 \u00b5Pk = OP(n\u2212 min{\u03b2,1/2}).\n+(X ) : k\u00b5Pk2 <\n(ii) For \u03bb = cn\u2212\u03b2 with c > 0 and \u03b2 > 1, de\ufb01ne Pc,\u03b2 := {P \u2208 M 1\nAR k(x, x) dP(x)} where A :=\n21/\u03b2 \u03b2+c1/\u03b2 (\u03b2\u22121)(\u03b2\u22121)/\u03b2 . Then \u2200 n and \u2200 P \u2208 Pc,\u03b2, we have\nEPk\u02c7\u00b5\u03bb \u2212 \u00b5Pk2 < EPk\u02c6\u00b5P \u2212 \u00b5Pk2.\nRemark. (i) Theorem 1(i) shows that \u02c7\u00b5\u03bb is a consistent estimator of \u00b5P as long as \u03bb \u2192 0 and the\nconvergence rate in probability of k\u02c7\u00b5\u03bb \u2212 \u00b5Pk is determined by the rate of convergence of \u03bb to zero,\nwith the best possible convergence rate being n\u22121/2. Therefore to attain a fast rate of convergence,\nit is instructive to choose \u03bb such that \u03bb\u221an \u2192 0 as \u03bb \u2192 0 and n \u2192 \u221e.\n\n\u2192 0 as \u03bb \u2192 0 and n \u2192 \u221e.\n\nIn addition, if \u03bb = n\u2212\u03b2 for some \u03b2 > 0, then\n\n(ii) Suppose for some c > 0 and \u03b2 > 1, we choose \u03bb = cn\u2212\u03b2, which means the resultant estimator\n\u02c7\u00b5\u03bb is a proper estimator as it does not depend on any unknown quantities. Theorem 1(ii) shows\nthat for any n and P \u2208 Pc,\u03b2, \u02c7\u00b5\u03bb is a \u201cbetter\u201d estimator than \u02c6\u00b5P. Note that for any P \u2208 M 1\n+(X ),\nk\u00b5Pk2 = R R k(x, y) dP(x) dP(y) \u2264 (R pk(x, x) dP(x))2 \u2264 R k(x, x) dP(x). This means \u02c7\u00b5\u03bb\n+(X ) to Pc,\u03b2 which considers only those distributions for which\nis admissible if we restrict M 1\nk\u00b5Pk2/R k(x, x) dP(x) is strictly less than a constant, A < 1. It is obvious to note that if c is\nvery small or \u03b2 is very large, then A gets closer to one and \u02c7\u00b5\u03bb behaves almost like \u02c6\u00b5P, thereby\nmatching with our intuition.\n(iii) A nice interpretation for Pc,\u03b2 can be obtained as in Theorem 1(ii) when k is a translation in-\nvariant kernel on Rd. It can be shown that Pc,\u03b2 contains the class of all probability measures whose\ncharacteristic function has an L2 norm (and therefore is the set of square integrable probability den-\nsities if P has a density w.r.t. the Lebesgue measure) bounded by a constant that depends on c, \u03b2 and\nk (see \u00a72 in the supplementary material).\n(cid:4)\n3 Spectral kernel mean shrinkage estimator\n\nLet us return to the shrinkage estimator \u02c6\u00b5\u03b1 considered in [1], i.e., \u02c6\u00b5\u03b1 = \u03b1f \u2217 + (1 \u2212 \u03b1)\u02c6\u00b5P =\n\u03b1Pihf \u2217, eiiei + (1 \u2212 \u03b1)Pih\u02c6\u00b5P, eiiei, where (ei)i\u2208N are the countable orthonormal basis (ONB)\nof H\u2014countable ONB exist since H is separable which follows from X being separable and k\nbeing continuous [20, Lemma 4.33]. This estimator can be generalized by considering the shrinkage\nestimator \u02c6\u00b5\u03b1 := Pi \u03b1ihf \u2217, eiiei + Pi(1 \u2212 \u03b1i)h\u02c6\u00b5P, eiiei where \u03b1 := (\u03b11, \u03b12, . . .) \u2208 R\u221e is\na sequence of shrinkage parameters. If \u2206\u03b1 := EPk\u02c6\u00b5\u03b1 \u2212 \u00b5Pk2 is the risk of this estimator, the\nfollowing theorem gives an optimality condition on \u03b1 for which \u2206\u03b1 < \u2206.\nTheorem 2. For some ONB (ei)i, \u2206\u03b1 \u2212 \u2206 =Pi(\u2206\u03b1,i \u2212 \u2206i) where \u2206\u03b1,i and \u2206i denote the risk\nof the ith component of \u02c6\u00b5\u03b1 and \u02c6\u00b5P, respectively. Then, \u2206\u03b1,i \u2212 \u2206i < 0 if\n\n(4)\n\n0 < \u03b1i <\n\n2\u2206i\n\u2206i + (f \u2217\n\ni \u2212 \u00b5i)2 ,\n\n3\n\n\funcorrelated isotropic Gaussian\n\ncorrelated anisotropic Gaussian\n\n\u02c6\u03b8M L = X\n\n.\n\n\u03b8\n\n\u02c6\u03b8M L = X\n\n.\n\n\u03b8\n\ntarget\n\ntarget\n\nX \u223c N (\u03b8, I)\nFigure 1: Geometric explanation of a shrinkage estimator when estimating a mean of a Gaussian\ndistribution. For isotropic Gaussian, the level sets of the joint density of \u02c6\u03b8M L = X are hyperspheres.\nIn this case, shrinkage has the same effect regardless of the direction. Shaded area represents those\nestimates that get closer to \u03b8 after shrinkage. For anisotropic Gaussian, the level sets are concentric\nellipsoids, which makes the effect dependent on the direction of shrinkage.\n\nX \u223c N (\u03b8, \u03a3)\n\nwhere f \u2217\n\ni and \u00b5i denote the Fourier coef\ufb01cients of f \u2217 and \u00b5P, respectively.\n\nThe condition in (4) is a component-wise version of the condition given in [1, Theorem 1] for a class\nof estimators \u02c6\u00b5\u03b1 := \u03b1f \u2217 + (1 \u2212 \u03b1)\u02c6\u00b5P which may be expressed here by assuming that we have a\nconstant shrinkage parameter \u03b1i = \u03b1 for all i. Clearly, as the optimal range of \u03b1i may vary across\ncoordinates, the class of estimators in [1] does not allow us to adjust \u03b1i accordingly. To understand\nwhy this property is important, let us consider the problem of estimating the mean of Gaussian\ndistribution illustrated in Figure 1. For correlated random variable X \u223c N (\u03b8, \u03a3), a natural choice\nof basis is the set of orthonormal eigenvectors which diagonalize the covariance matrix \u03a3 of X.\nClearly, the optimal range of \u03b1i depends on the corresponding eigenvalues. Allowing for different\nbasis (ei)i and shrinkage parameter \u03b1i opens up a wide range of strategies that can be used to\nconstruct \u201cbetter\u201d estimators.\n\nA natural strategy under this representation is as follows:\ni) we specify the ONB (ei)i and project\n\u02c6\u00b5P onto this basis. ii) we shrink each \u02c6\u00b5i independently according to a pre-de\ufb01ned shrinkage rule.\niii) the shrinkage estimate is reconstructed as a superposition of the resulting components. In other\nwords, an ideal shrinkage estimator can be de\ufb01ned formally as a non-linear mapping:\n\n\u02c6\u00b5P \u2212\u2192Xi\n\nh(\u03b1i)hf \u2217, eiiei +Xi\n\n(1 \u2212 h(\u03b1i))h\u02c6\u00b5P, eiiei\n\n(5)\n\nnPn\n\nwhere h : R \u2192 R is a shrinkage rule. Since we make no reference to any particular basis (ei)i, nor to\nany particular shrinkage rule h, a wide range of strategies can be adopted here. For example, we can\ni=1 xi and 1\u2212 h(\u03b1i) = 1/\u221a\u03b1i\nview whitening as a special case in which f \u2217 is the data average 1\nwhere \u03b1i and ei are the ith eigenvalue and eigenvector of the covariance matrix, respectively.\nInspired by Theorem 2, we adopt the spectral \ufb01ltering approach as one of the strategies to construct\nthe estimators of the form (5). To this end, owing to the regularization interpretation in (3), we\nconsider estimators of the formPn\ni=1 \u03b2ik(xi,\u00b7) for some \u03b2 \u2208 Rn\u2014looking for such an estimator\nis equivalent to learning a signed measure that is supported on (xi)n\ni=1 \u03b2ik(xi,\u00b7)\nis a minimizer of (3), \u03b2 should satisfy K\u03b2 = K1n where K is an n \u00d7 n Gram matrix and 1n =\n[1/n. . . . , 1/n]\u22a4. Here the solution is trivially \u03b2 = 1n, i.e., the coef\ufb01cients of the standard estimator\n\u02c6\u00b5P if K is invertible. Since K\u22121 may not exist and even if it exists, the computation of it can be\nnumerically unstable, the idea of spectral \ufb01ltering\u2014this is quite popular in the theory of inverse\nproblems [15] and has been used in kernel least squares [17]\u2014is to replace K\u22121 by some regularized\nmatrices g\u03bb(K) that approximates K\u22121 as \u03bb goes to zero. Note that unlike in (3), the regularization\ni=1 \u03b2ik(xi,\u00b7)) without which the\n\nis quite important here (i.e., the case of estimators of the formPn\n\nthe linear system is under determined. Therefore, we propose the following class of estimators:\n\ni=1. Since Pn\n\n\u02c6\u00b5\u03bb :=\n\nnXi=1\n\n\u03b2ik(xi,\u00b7) with \u03b2(\u03bb) := g\u03bb(K)K1n,\n\n(6)\n\nwhere g\u03bb(\u00b7) is a \ufb01lter function and \u03bb is referred to as a shrinkage parameter. The matrix-valued\nfunction g\u03bb(K) can be described by a scalar function g\u03bb : [0, \u03ba2] \u2192 R on the spectrum of K.\nThat is, if K = UDU\u22a4 is the eigen-decomposition of K where D = diag(\u02dc\u03b31, . . . , \u02dc\u03b3n), we have\ng\u03bb(D) = diag(g\u03bb(\u02dc\u03b31), . . . , g\u03bb(\u02dc\u03b3n)) and g\u03bb(K) = Ug\u03bb(D)U\u22a4. For example, the scalar \ufb01lter\nfunction of Tikhonov regularization is g\u03bb(\u03b3) = 1/(\u03b3 + \u03bb).\nIn the sequel, we call this class of\nestimators a spectral kernel mean shrinkage estimator (Spectral-KMSE).\n\n4\n\n\fTable 1: Update equations for \u03b2 and corresponding \ufb01lter functions.\n\nAlgorithm\n\nL2 Boosting\nAcc. L2 Boosting \u03b2t \u2190 \u03b2t\u22121 + \u03c9t(\u03b2t\u22121 \u2212 \u03b2t\u22122) + \u03bat\nIterated Tikhonov\nTruncated SVD\n\nUpdate Equation (a := K1n \u2212 K\u03b2t\u22121)\n\u03b2t \u2190 \u03b2t\u22121 + \u03b7a\n(K + n\u03bbI)\u03b2i = 1n + n\u03bb\u03b2i\u22121\nNone\n\na\n\nn\n\ni=1(1 \u2212 \u03b7\u03b3)i\n\nFilter Function\n\ng(\u03b3) = \u03b7Pt\u22121\n\ng(\u03b3) = pt(\u03b3)\ng(\u03b3) = (\u03b3+\u03bb)t\u2212\u03b3t\n\u03bb(\u03b3+\u03bb)t\ng(\u03b3) = \u03b3\u221211{\u03b3\u2265\u03bb}\n\nTikhonov\nL2 Boosting\n\n\u03bd\u2212method\nIterated Tikhonov\n\nTSVD\n\n \n\n1.5\n\n1\n\n0.5\n\n\u03b3\n\n)\n\n\u03b3\n\n(\ng\n\n \n\n0\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n\u03b3\n\nFigure 2: Plot of g(\u03b3)\u03b3.\n\nnPn\n\ni=1 k(\u00b7, xi) \u2297 k(\u00b7, xi).\n\nProposition 3. The Spectral-KMSE satis\ufb01es \u02c6\u00b5\u03bb = Pn\ni=1 g\u03bb(\u02dc\u03b3i)\u02dc\u03b3ih\u02c6\u00b5, \u02dcvii\u02dcvi, where (\u02dc\u03b3i, \u02dcvi) are\neigenvalue and eigenfunction pairs of the empirical covariance operator bCk : H \u2192 H de\ufb01ned as\nbCk = 1\nBy virtue of Proposition 3, if we choose 1 \u2212 h(\u02dc\u03b3) := g\u03bb(\u02dc\u03b3)\u02dc\u03b3, the Spectral-KMSE is indeed in\nthe form of (5) when f \u2217 = 0 and (ei)i is the kernel PCA (KPCA) basis, with the \ufb01lter function\ng\u03bb determining the shrinkage rule. Since by de\ufb01nition g\u03bb(\u02dc\u03b3i) approaches the function 1/\u02dc\u03b3i as \u03bb\ngoes to 0, the function g\u03bb(\u02dc\u03b3i)\u02dc\u03b3i approaches 1 (no shrinkage). As the value of \u03bb increases, we have\nmore shrinkage because the value of g\u03bb(\u02dc\u03b3i)\u02dc\u03b3i deviates from 1, and the behavior of this deviation\ndepends on the \ufb01lter function g\u03bb. For example, we can see that Proposition 3 generalizes Theorem\n2 in [1] where the \ufb01lter function is g\u03bb(K) = (K + n\u03bbI)\u22121, i.e., g(\u03b3) = 1/(\u03b3 + \u03bb). That is, we\nhave g\u03bb(\u02dc\u03b3i)\u02dc\u03b3i = \u02dc\u03b3i/(\u02dc\u03b3i + \u03bb), implying that the effect of shrinkage is relatively larger in the low-\nvariance direction. In the following, we discuss well-known examples of spectral \ufb01ltering algorithms\nobtained by various choices of g\u03bb. Update equations for \u03b2(\u03bb) and corresponding \ufb01lter functions are\nsummarized in Table 1. Figure 2 illustrates the behavior of these \ufb01lter functions.\n\nL2 Boosting. This algorithm, also known as gradient descent or Landweber iteration, \ufb01nds a\nweight \u03b2 by performing a gradient descent iteratively. Thus, we can interpret early stopping as\nshrinkage and the reciprocal of iteration number as shrinkage parameter, i.e., \u03bb \u2248 1/t. The step-size\n\u03b7 does not play any role for shrinkage [16], so we use the \ufb01xed step-size \u03b7 = 1/\u03ba2 throughout.\n\nAccelerated L2 Boosting. This algorithm, also known as \u03bd-method, uses an accelerated gradient\ndescent step, which is faster than L2 Boosting because we only need \u221at iterations to get the same\nsolution as the L2 Boosting would get after t iterations. Consequently, we have \u03bb \u2248 1/t2.\nIterated Tikhonov. This algorithm can be viewed as a combination of Tikhonov regularization\nand gradient descent. Both parameters \u03bb and t play the role of shrinkage parameter.\n\nTruncated Singular Value Decomposition. This algorithm can be interpreted as a projection onto\nthe \ufb01rst principal components of the KPCA basis. Hence, we may interpret dimensionality reduction\nas shrinkage and the size of reduced dimension as shrinkage parameter. This approach has been used\nin [21] to improve the kernel mean estimation under the low-rank assumption.\n\nMost of the above spectral \ufb01ltering algorithms allow to compute the coef\ufb01cients \u03b2 without explicitly\ncomputing the eigen-decomposition of K, as we can see in Table 1, and some of which may have\nno natural interpretation in terms of regularized risk minimization. Lastly, an initialization of \u03b2\ncorresponds to the target of shrinkage. In this work, we assume that \u03b20 = 0 throughout.\n\n4 Theoretical properties of Spectral-KMSE\n\nThis section presents some theoretical properties for the proposed Spectral-KMSE in (6). To this\nend, we \ufb01rst present a regularization interpretation that is different from the one in (3) which involves\nlearning a smooth operator from H to H [22]. This will be helpful to investigate the consistency of\nthe Spectral-KMSE. Let us consider the following regularized risk minimization problem,\n\narg minF\u2208H\u2297H\n\nEX kk(X,\u00b7) \u2212 F[k(X,\u00b7)]k2\n\nH + \u03bbkFk2\n\nHS\n\n(7)\n\nwhere F is a Hilbert-Schmidt operator from H to H. Essentially, we are seeking a smooth operator\nF that maps k(x,\u00b7) to itself, where (7) is an instance of the regression framework in [22]. The\nformulation of shrinkage as the solution of a smooth operator regression, and the empirical solution\n(8) and in the lines below, were given in a personal communication by Arthur Gretton. It can be\n\n5\n\n\f\u03b3i\n\ni=1\n\nshown that the solution to (7) is given by F = Ck(Ck + \u03bbI)\u22121 where Ck : H \u2192 H is a covariance\noperator in H de\ufb01ned as Ck = R k(\u00b7, x) \u2297 k(\u00b7, x) dP(x) (see \u00a75 of the supplement for a proof).\nDe\ufb01ne \u00b5\u03bb := F\u00b5P = Ck(Ck + \u03bbI)\u22121\u00b5P. Since k is bounded, it is easy to verify that Ck is Hilbert-\nSchmidt and therefore compact. Hence by the Hilbert-Schmidt theorem, Ck =Pi \u03b3ih\u00b7, \u03c8ii\u03c8i where\n(\u03b3i)i\u2208N are the positive eigenvalues and (\u03c8i)i\u2208N are the corresponding eigenvectors that form an\nONB for the range space of Ck denoted as R(Ck). This implies \u00b5\u03bb can be decomposed as \u00b5\u03bb =\nP\u221e\n\u03b3i+\u03bbh\u00b5P, \u03c8ii\u03c8i. We can observe that the \ufb01lter function corresponding to the problem (7)\nis g\u03bb(\u03b3) = 1/(\u03b3 + \u03bb). By extending this approach to other \ufb01lter functions, we obtain \u00b5\u03bb =\nP\u221e\ni=1 \u03b3ig\u03bb(\u03b3i)h\u00b5P, \u03c8ii\u03c8i which is equivalent to \u00b5\u03bb = Ckg\u03bb(Ck)\u00b5P.\nSince Ck is a compact operator, the role of \ufb01lter function g\u03bb is to regularize the inverse of Ck.\nIn standard supervised setting, the explicit form of the solution is f\u03bb = g\u03bb(Lk)Lkf\u03c1 where Lk\nis the integral operator of kernel k acting in L2(X , \u03c1X ) and f\u03c1 is the expected solution given by\nf\u03c1(x) =RY y d\u03c1(y|x) [16]. It is interesting to see that \u00b5\u03bb admits a similar form to that of f\u03bb, but it is\nwritten in term of covariance operator Ck instead of the integral operator Lk. Moreover, the solution\nto (7) is also in a similar form to the regularized conditional embedding \u00b5Y |X = CY X (Ck + \u03bbI)\u22121\n[9]. This connection implies that the spectral \ufb01ltering may be applied more broadly to improve the\nestimation of conditional mean embedding, i.e., \u00b5Y |X = CY X g\u03bb(Ck).\nThe empirical counterpart of (7) is given by\n\narg min\n\nF\n\n1\nn\n\nnXi=1\n\nkk(xi,\u00b7) \u2212 F[k(xi,\u00b7)]k2\n\nH + \u03bbkFk2\n\nHS,\n\n(8)\n\n1\n\ni=1 k(xi,\u00b7) \u2297 k(xi,\u00b7) and \u02c6\u00b5P := 1\n\nK(K + \u03bbI)\u22121\u03a6 where \u03a6 = [k(x1,\u00b7), . . . , k(xn,\u00b7)]\u22a4, which matches\nresulting in \u02c6\u00b5\u03bb = F\u02c6\u00b5P = 1\u22a4\nn\nwith the one in (6) with g\u03bb(K) = (K + \u03bbI)\u22121. Note that this is exactly the F-KMSE proposed in\n[1]. Based on \u00b5\u03bb which depends on P, an empirical version of it can be obtained by replacing Ck\nand \u00b5P with their empirical estimators leading to \u02dc\u00b5\u03bb = bCkg\u03bb(bCk)\u02c6\u00b5P. The following result shows\nthat \u02c6\u00b5\u03bb = \u02dc\u00b5\u03bb, which means the Spectral-KMSE proposed in (6) is equivalent to solving (8).\nProposition 4. Let bCk and \u02c6\u00b5P be the sample counterparts of Ck and \u00b5P given by bCk\n:=\nnPn\nnPn\ni=1 k(xi,\u00b7), respectively. Then, we have that \u02dc\u00b5\u03bb :=\nbCkg\u03bb(bCk)\u02c6\u00b5P = \u02c6\u00b5\u03bb, where \u02c6\u00b5\u03bb is de\ufb01ned in (6).\n\nHaving established a regularization interpretation for \u02c6\u00b5\u03bb, it is of interest to study the consistency and\nconvergence rate of \u02c6\u00b5\u03bb similar to KMSE in Theorem 1. Our main goal here is to derive convergence\nrates for a broad class of algorithms given a set of suf\ufb01cient conditions on the \ufb01lter function, g\u03bb. We\nbelieve that for some algorithms it is possible to derive the best achievable bounds, which requires\nad-hoc proofs for each algorithm. To this end, we provide a set of conditions any admissible \ufb01lter\nfunction, g\u03bb must satisfy.\nDe\ufb01nition 1. A family of \ufb01lter functions g\u03bb : [0, \u03ba2] \u2192 R, 0 < \u03bb \u2264 \u03ba2 is said to be admis-\nsible if there exists \ufb01nite positive constants B, C, D, and \u03b70 (all independent of \u03bb) such that\n(C1) sup\u03b3\u2208[0,\u03ba2] |\u03b3g\u03bb(\u03b3)| \u2264 B, (C2) sup\u03b3\u2208[0,\u03ba2] |r\u03bb(\u03b3)| \u2264 C and (C3) sup\u03b3\u2208[0,\u03ba2] |r\u03bb(\u03b3)|\u03b3\u03b7 \u2264\nD\u03bb\u03b7, \u2200 \u03b7 \u2208 (0, \u03b70] hold, where r\u03bb(\u03b3) := 1 \u2212 \u03b3g\u03bb(\u03b3).\nThese conditions are quite standard in the theory of inverse problems [15, 23]. The constant \u03b70\nis called the quali\ufb01cation of g\u03bb and is a crucial factor that determines the rate of convergence in\ninverse problems. As we will see below, that the rate of convergence of \u02c6\u00b5\u03bb depends on two factors:\n(a) smoothness of \u00b5P which is usually unknown as it depends on the unknown P and (b) quali\ufb01cation\nof g\u03bb which determines how well the smoothness of \u00b5P is captured by the spectral \ufb01lter, g\u03bb.\n\nTheorem 5. Suppose g\u03bb is admissible in the sense of De\ufb01nition 1. Let \u03ba = supx\u2208X pk(x, x). If\n\u00b5P \u2208 R(C\u03b2\n\nk ) for some \u03b2 > 0, then for any \u03b4 > 0, with probability at least 1 \u2212 3e\u2212\u03b4,\n(2\u221a2\u03ba2\u221a\u03b4)min{1,\u03b2}\n\n2\u03baB + \u03baB\u221a2\u03b4\n\nk\u02c6\u00b5\u03bb \u2212 \u00b5Pk \u2264\n\n\u221an\n\n+ D\u03bbmin{\u03b2,\u03b70}kC\u2212\u03b2\n\nk \u00b5Pk + C\u03c4\n\nnmin{1/2,\u03b2/2}\n\nkC\u2212\u03b2\n\nk \u00b5Pk,\n\nwhere R(A) denotes the range space of A and \u03c4 is some universal constant that does not depend on\n\u03bb and n. Therefore, k\u02c6\u00b5\u03bb \u2212 \u00b5Pk = OP(n\u2212 min{1/2,\u03b2/2}) with \u03bb = o(n\u2212 min{1/2,\u03b2/2}\nmin{\u03b2,\u03b70 } ).\nTheorem 5 shows that the convergence rate depends on the smoothness of \u00b5P which is imposed\nthrough the range space condition that \u00b5P \u2208 R(C\u03b2\nk ) for some \u03b2 > 0. Note that this is in contrast\n\n6\n\n\fto the estimator in Theorem 1 which does not require any smoothness assumptions on \u00b5P. It can\nbe shown that the smoothness of \u00b5P increases with increase in \u03b2. This means, irrespective of the\nsmoothness of \u00b5P for \u03b2 > 1, the best possible convergence rate is n\u22121/2 which matches with that of\nKMSE in Theorem 1. While the quali\ufb01cation \u03b70 does not seem to directly affect the rates, it controls\nthe rate at which \u03bb converges to zero. For example, if g\u03bb(\u03b3) = 1/(\u03b3 + \u03bb) which corresponds to\nTikhonov regularization, it can be shown that \u03b70 = 1 which means for \u03b2 > 1, \u03bb = o(n\u22121/2)\nimplying that \u03bb cannot decay to zero slower than n\u22121/2. Ideally, one would require a larger \u03b70\n(preferably in\ufb01nity which is the case with truncated SVD) so that the convergence of \u03bb to zero can\nbe made arbitrarily slow if \u03b2 is large. This way, both \u03b2 and \u03b70 control the behavior of the estimator.\nIn fact, Theorem 5 provides a choice for \u03bb\u2014which is what we used in Theorem 1 to study the\nadmissibility of \u02c7\u00b5\u03bb to Pc,\u03b2\u2014to construct the Spectral-KMSE. However, this choice of \u03bb depends\non \u03b2 which is not known in practice (although \u03b70 is known as it is determined by the choice of g\u03bb).\nTherefore, \u03bb is usually learnt from data through cross-validation or through Lepski\u2019s method [24] for\nwhich guarantees similar to the one presented in Theorem 5 can be provided. However, irrespective\nof the data-dependent/independent choice for \u03bb, checking for the admissibility of Spectral-KMSE\n(similar to the one in Theorem 1) is very dif\ufb01cult and we intend to consider it in future work.\n\n5 Empirical studies\n\nSynthetic data. Given the i.i.d. sample X = {x1, x2, . . . , xn} from P where xi \u2208 Rd, we evaluate\ndifferent estimators using the loss function L(\u03b2, X, P) := kPn\ni=1 \u03b2ik(xi,\u00b7) \u2212 Ex\u223cP[k(x,\u00b7)]k2\nH.\nThe risk of the estimator is subsequently approximated by averaging over m independent copies of\nX. In this experiment, we set n = 50, d = 20, and m = 1000. Throughout, we use the Gaussian\nRBF kernel k(x, x\u2032) = exp(\u2212kx \u2212 x\u2032k2/2\u03c32) whose bandwidth parameter is calculated using the\nmedian heuristic, i.e., \u03c32 = median{kxi \u2212 xjk2}. To allow for an analytic calculation of the\nloss L(\u03b2, X, P), we assume that the distribution P is a d-dimensional mixture of Gaussians [1, 8].\nSpeci\ufb01cally, the data are generated as follows: x \u223cP4\ni=1 \u03c0iN (\u03b8i, \u03a3i)+\u03b5, \u03b8ij \u223c U (\u221210, 10), \u03a3i \u223c\nW(3 \u00d7 Id, 7), \u03b5 \u223c N (0, 0.2 \u00d7 Id) where U (a, b) and W(\u03a30, df ) are the uniform distribution and\nWishart distribution, respectively. As in [1], we set \u03c0 = [0.05, 0.3, 0.4, 0.25].\nA natural approach for choosing \u03bb is cross-validation procedure, which can be performed ef\ufb01ciently\nfor the iterative methods such as Landweber and accelerated Landweber. For these two algorithms,\nwe evaluate the leave-one-out score and select \u03b2t at the iteration t that minimizes this score (see,\ne.g., Figure 3(a)). Note that these methods have the built-in property of computing the whole regu-\nlarization path ef\ufb01ciently. Since each iteration of the iterated Tikhonov is in fact equivalent to the\nF-KMSE, we assume t = 3 for simplicity and use the ef\ufb01cient LOOCV procedure proposed in [1]\nto \ufb01nd \u03bb at each iteration. Lastly, the truncation limit of TSVD can be identi\ufb01ed ef\ufb01ciently by mean\nof generalized cross-validation (GCV) procedure [25]. To allow for an ef\ufb01cient calculation of GCV\n2.\nscore, we resort to the alternative loss function L(\u03b2) := kK\u03b2 \u2212 K1nk2\nFigure 3 reveals interesting aspects of the Spectral-KMSE. Firstly, as we can see in Figure 3(a), the\nnumber of iterations acts as shrinkage parameter whose optimal value can be attained within just\na few iterations. Moreover, these methods do not suffer from \u201cover-shrinking\u201d because \u03bb \u2192 0 as\nt \u2192 \u221e. In other words, if the chosen t happens to be too large, the worst we can get is the stan-\ndard empirical estimator. Secondly, Figure 3(b) demonstrates that both Landweber and accelerated\nLandweber are more computationally ef\ufb01cient than the F-KMSE. Lastly, Figure 3(c) suggests that\nthe improvement of shrinkage estimators becomes increasingly remarkable in a high-dimensional\nsetting.\nInterestingly, we can observe that most Spectral-KMSE algorithms outperform the S-\nKMSE, which supports our hypothesis on the importance of the geometric information of RKHS\nmentioned in Section 3. In addition, although the TSVD still gain from shrinkage, the improvement\nis smaller than other algorithms. This highlights the importance of \ufb01lter functions and associated\nparameters.\n\nReal data. We apply Spectral-KMSE to the density estimation problem via kernel mean matching\n[1, 26]. The datasets were taken from the UCI repository1 and pre-processed by standardizing\nI) to the pre-processed dataset\n\neach feature. Then, we \ufb01t a mixture model Q = Pr\n\n1http://archive.ics.uci.edu/ml/\n\nj=1 \u03c0jN (\u03b8j, \u03c32\n\nj\n\n7\n\n\f10\u22121.2\n\n10\u22121.4\n\n10\u22121.6\n\n)\ns\nn\no\ni\nt\na\nr\ne\nt\ni\n \n0\n0\n0\n1\n(\n \nk\ns\nR\n\ni\n\n10\u22121.8\n \n0\n\n \n\nKME\nS\u2212KMSE\nF\u2212KMSE\nLandweber\nAcc. Landweber\nIterated Tikhonov (\u03bb=0.01)\n\nKME\nS\u2212KMSE\nF\u2212KMSE\nLandweber\nAcc Landweber\nIterated Tikhonov\nTruncated SVD\n\n102\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n\n10\u22125\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n \n\nS\u2212KMSE\nF\u2212KMSE\nLandweber\nAcc Landweber\nIter. Tikhonov\nTruncated SVD\n\n \n\n)\ns\nn\no\n\ni\nt\n\na\nr\ne\n\nt\ni\n \n\n0\n0\n0\n1\n(\n \nt\n\nn\ne\nm\ne\nv\no\nr\np\nm\n\nI\n \nf\n\nt\n\n \n\no\ne\ng\na\nn\ne\nc\nr\ne\nP\n\n)\ns\nn\no\ni\nt\na\nr\ne\nt\ni\n \n0\n0\n0\n1\n(\n \ne\nm\nT\n \nd\ne\ns\np\na\nE\n\ni\n\nl\n\n10\n\n20\n\nIterations\n\n30\n\n40\n\n10\u22126\n\n \n101\n\n102\n\nSample Size\n\n103\n\n0\n\n \n\n20\n\n40\nDimensionality\n\n60\n\n80\n\n100\n\n(a) risk vs. iteration\n\n(b) runtime vs. sample size\n\n(c) risk vs. dimension\n\nFigure 3: (a) For iterative algorithms, the number of iterations acts as shrinkage parameter. (b) The\niterative algorithms such as Landweber and accelerated Landweber are more ef\ufb01cient than the F-\nKMSE. (c) A percentage of improvement w.r.t. the KME, i.e., 100 \u00d7 (R \u2212 R\u03bb)/R where R and R\u03bb\ndenote the approximated risk of KME and KMSE, respectively. Most Spectral-KMSE algorithms\noutperform S-KMSE which does not take into account the geometric information of the RKHS.\n\ni=1 by minimizing k\u00b5Q \u2212 \u02c6\u00b5Xk2 subject to the constraintPr\n\nj=1 \u03c0j = 1. Here \u00b5Q is the\nX := {xi}n\nmean embedding of the mixture model Q and \u02c6\u00b5X is the empirical mean embedding obtained from\nX. Based on different estimators of \u00b5X, we evaluate the resultant model Q by the negative log-\nlikelihood score on the test data. The parameters (\u03c0j, \u03b8j, \u03c32\nj ) are initialized by the best one obtained\nfrom the K-means algorithm with 50 initializations. Throughout, we set r = 5 and use 25% of each\ndataset as a test set.\n\nDataset\n\nTable 2: The average negative log-likelihood evaluated on the test set. The results are obtained from\n30 repetitions of the experiment. The boldface represents the statistically signi\ufb01cant results.\nIter Tik TSVD\n36.1334\n10.9078\n18.1267\n14.2868\n13.8633\n28.0417\n18.4128\n16.6954\n35.1881\n\nS-KMSE F-KMSE Landweber Acc Land\n36.1402\n10.7403\n18.1158\n14.2195\n13.8426\n28.0546\n18.3693\n16.7548\n35.1814\n\nionosphere\nglass\nbodyfat\nhousing\nvowel\nsvmguide2\nvehicle\nwine\nwdbc\n\nKME\n36.1769\n10.7855\n18.1964\n14.3016\n13.9253\n28.1091\n18.5295\n16.7668\n35.1916\n\n36.1554\n10.7541\n18.1941\n14.1983\n14.1368\n27.9693\n18.3124\n16.6790\n35.1366\n\n36.1622\n10.7448\n18.1810\n14.0409\n13.8817\n27.9640\n18.2547\n16.7457\n35.0023\n\n36.1204\n10.7099\n18.1607\n14.2499\n13.8337\n28.1052\n18.4873\n16.7596\n35.1402\n\n36.1442\n10.7791\n18.1061\n14.3129\n13.8375\n28.1128\n18.3910\n16.5719\n35.1850\n\nTable 2 reports the results on real data. In general, the mixture model Q obtained from the proposed\nshrinkage estimators tend to achieve lower negative log-likelihood score than that obtained from the\nstandard empirical estimator. Moreover, we can observe that the relative performance of different\n\ufb01lter functions vary across datasets, suggesting that, in addition to potential gain from shrinkage, in-\ncorporating prior knowledge through the choice of \ufb01lter function could lead to further improvement.\n\n6 Conclusion\n\nWe shows that several shrinkage strategies can be adopted to improve the kernel mean estimation.\nThis paper considers the spectral \ufb01ltering approach as one of such strategies. Compared to previous\nwork [1], our estimators take into account the speci\ufb01cs of kernel methods and meaningful prior\nknowledge through the choice of \ufb01lter functions, resulting in a wider class of shrinkage estimators.\nThe theoretical analysis also reveals a fundamental similarity to standard supervised setting. Our\nestimators are simple to implement and work well in practice, as evidenced by the empirical results.\n\nAcknowledgments\n\nThe \ufb01rst author thanks Ingo Steinwart for pointing out existing works along the line of spectral \ufb01lter-\ning, and Arthur Gretton for suggesting the connection of shrinkage to smooth operator framework.\nThis work was carried out when the second author was a Research Fellow in the Statistical Labora-\ntory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge.\n\n8\n\n\fReferences\n\n[1] K. Muandet, K. Fukumizu, B. Sriperumbudur, A. Gretton, and B. Sch\u00a8olkopf. \u201cKernel Mean Estimation\n\nand Stein Effect\u201d. In: ICML. 2014, pp. 10\u201318.\n\n[2] B. Sch\u00a8olkopf, A. Smola, and K.-R. M\u00a8uller. \u201cNonlinear Component Analysis as a Kernel Eigenvalue\n\n[3]\n\nProblem\u201d. In: Neural Computation 10.5 (July 1998), pp. 1299\u20131319.\nJ. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge, UK: Cambridge\nUniversity Press, 2004.\n\n[4] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Opti-\n\nmization, and Beyond. Cambridge, MA, USA: MIT Press, 2001.\n\n[5] A. Berlinet and T. C. Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer\n\nAcademic Publishers, 2004.\n\n[6] A. Smola, A. Gretton, L. Song, and B. Sch\u00a8olkopf. \u201cA Hilbert Space Embedding for Distributions\u201d. In:\n\nALT. Springer-Verlag, 2007, pp. 13\u201331.\n\n[7] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. J. Smola. \u201cA kernel method for the\n\ntwo-sample-problem\u201d. In: NIPS. 2007.\n\n[8] K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Sch\u00a8olkopf. \u201cLearning from Distributions via Support\n\nMeasure Machines\u201d. In: NIPS. 2012, pp. 10\u201318.\n\n[9] L. Song, J. Huang, A. Smola, and K. Fukumizu. \u201cHilbert Space Embeddings of Conditional Distributions\n\nwith Applications to Dynamical Systems\u201d. In: ICML. 2009.\n\n[10] K. Muandet, D. Balduzzi, and B. Sch\u00a8olkopf. \u201cDomain Generalization via Invariant Feature Representa-\n\ntion\u201d. In: ICML. 2013, pp. 10\u201318.\n\n[11] K. Muandet and B. Sch\u00a8olkopf. \u201cOne-Class Support Measure Machines for Group Anomaly Detection\u201d.\n\nIn: UAI. AUAI Press, 2013, pp. 449\u2013458.\n\n[12] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00a8olkopf, and G. R. G. Lanckriet. \u201cHilbert Space\n\nEmbeddings and Metrics on Probability Measures\u201d. In: JMLR 99 (2010), pp. 1517\u20131561.\n\n[13] K. Fukumizu, L. Song, and A. Gretton. \u201cKernel Bayes\u2019 Rule: Bayesian Inference with Positive De\ufb01nite\n\nKernels\u201d. In: JMLR 14 (2013), pp. 3753\u20133783.\n\n[14] C. M. Stein. \u201cEstimation of the Mean of a Multivariate Normal Distribution\u201d. In: The Annals of Statistics\n\n9.6 (1981), pp. 1135\u20131151.\n\n[15] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Vol. 375. Mathematics\n\nand its Applications. Kluwer Academic Publishers Group, Dordrecht, 1996.\n\n[16] E. D. Vito, L. Rosasco, and R. Verri. Spectral Methods for Regularization in Learning Theory. 2006.\n[17] E. D. Vito, L. Rosasco, A. Caponnetto, U. D. Giovannini, and F. Odone. \u201cLearning from Examples as\n\nan Inverse Problem.\u201d In: JMLR 6 (2005), pp. 883\u2013904.\n\n[18] L. Baldassarre, L. Rosasco, A. Barla, and A. Verri. \u201cVector Field Learning via Spectral Filtering.\u201d In:\n\nECML/PKDD (1). Vol. 6321. Lecture Notes in Computer Science. Springer, 2010, pp. 56\u201371.\nJ. Kim and C. D. Scott. \u201cRobust Kernel Density Estimation\u201d. In: JMLR 13 (2012), 2529\u20132565.\nI. Steinwart and A. Christmann. Support Vector Machines. New York: Springer, 2008.\n\n[19]\n[20]\n[21] L. Song and B. Dai. \u201cRobust Low Rank Kernel Embeddings of Multivariate Distributions\u201d. In: NIPS.\n\n2013, pp. 3228\u20133236.\n\n[22] S. Gr\u00a8unew\u00a8alder, G. Arthur, and J. Shawe-Taylor. \u201cSmooth Operators\u201d. In: ICML. Vol. 28. 2013,\n\npp. 1184\u20131192.\n\n[23] L. L. Gerfo, L. Rosasco, F. Odone, E. D. Vito, and A. Verri. \u201cSpectral Algorithms for Supervised Learn-\n\ning.\u201d In: Neural Computation 20.7 (2008), pp. 1873\u20131897.\n\n[24] O. V. Lepski, E. Mammen, and V. G. Spokoiny. \u201cOptimal Spatial Adaptation to Inhomogeneous Smooth-\nness: An Approach based on Kernel Estimates with Variable Bandwith Selectors\u201d. In: Annals of Statistics\n25 (1997), pp. 929\u2013947.\n\n[25] G. Golub, M. Heath, and G. Wahba. \u201cGeneralized Cross-Validation as a Method for Choosing a Good\n\nRidge Parameter\u201d. In: Technometrics 21 (1979), pp. 215\u2013223.\n\n[26] L. Song, X. Zhang, A. Smola, A. Gretton, and B. Sch\u00a8olkopf. \u201cTailoring Density Estimation via Repro-\n\nducing Kernel Moment Matching\u201d. In: ICML. 2008, pp. 992\u2013999.\n\n9\n\n\f", "award": [], "sourceid": 2, "authors": [{"given_name": "Krikamol", "family_name": "Muandet", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Bharath", "family_name": "Sriperumbudur", "institution": "Pennsylvania State University"}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": "MPI T\u00fcbingen"}]}