{"title": "Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 1750, "page_last": 1758, "abstract": "", "full_text": "Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions\r\n\r\nBharath K. Sriperumbudur Department of ECE UC San Diego, La Jolla, USA bharathsv@ucsd.edu\r\n\r\nKenji Fukumizu The Institute of Statistical Mathematics Tokyo, Japan fukumizu@ism.ac.jp\r\n\r\nArthur Gretton Carnegie Mellon University MPI for Biological Cybernetics arthur.gretton@gmail.com Gert R. G. Lanckriet Department of ECE UC San Diego, La Jolla, USA gert@ece.ucsd.edu Bernhard Sch lkopf o MPI for Biological Cybernetics T bingen, Germany u bs@tuebingen.mpg.de\r\n\r\nAbstract\r\nEmbeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example.\r\n\r\n1\r\n\r\nIntroduction\r\n\r\nKernel methods are broadly established as a useful way of constructing nonlinear algorithms from linear ones, by embedding points into higher dimensional reproducing kernel Hilbert spaces (RKHSs) [9]. A generalization of this idea is to embed probability distributions into RKHSs, giving 1\r\n\r\n\fus a linear method for dealing with higher order statistics [6, 12, 14]. More specifically, suppose we are given the set P of all Borel probability measures defined on the topological space M , and the RKHS (H, k) of functions on M with k as its reproducing kernel (r.k.). For P  P, denote by Pk := M k(., x) dP(x). If k is measurable and bounded, then we may define the embedding of P in H as Pk  H. The RKHS distance between two such mappings associated with P, Q  P is called the maximum mean discrepancy (MMD) [6, 14], and is written k (P, Q) = Pk - Qk\r\nH.\r\n\r\n(1)\r\n\r\nWe say that k is characteristic [4, 14] if the mapping P  Pk is injective, in which case (1) is zero if and only if P = Q, i.e., k is a metric on P. An immediate application of the MMD is to problems of comparing distributions based on finite samples: examples include tests of homogeneity [6], independence [7], and conditional independence [4]. In this application domain, the question of whether k is characteristic is key: without this property, the algorithms can fail through inability to distinguish between particular distributions. Characteristic kernels are important in binary classification: The problem of distinguishing distributions is strongly related to binary classification: indeed, one would expect easily distinguishable distributions to be easily classifiable.1 The link between these two problems is especially direct in the case of the MMD: in Section 2, we show that k is the negative of the optimal risk (corresponding to a linear loss function) associated with the Parzen window classifier [9, 11] (also called kernel classification rule [3, Chapter 10]), where the Parzen window turns out to be k. We also show that k is an upper bound on the margin of a hard-margin support vector machine (SVM). The importance of using characteristic RKHSs is further underlined by this link: if the property does not hold, then there exist distributions that are unclassifiable in the RKHS H. We further strengthen this by showing that characteristic kernels are necessary (and sufficient under certain conditions) to achieve Bayes risk in the kernel-based classification algorithms. Characterization of characteristic kernels: Given the centrality of the characteristic property to both RKHS classification and RKHS distribution testing, we should take particular care in establishing which kernels satisfy this requirement. Early results in this direction include [6], where k is shown to be characteristic on compact M if it is universal in the sense of Steinwart [15, Definition 4]; and [4, 5], which address the case of non-compact M , and show that k is characteristic if and only if H + R is dense in the Banach space of p-power (p  1) integrable functions. The conditions in both these studies can be difficult to check and interpret, however, and the restriction of the first to compact M is limiting. In the case of translation invariant kernels, [14] proved the kernel to be characteristic if and only if the support of the Fourier transform of k is the entire Rd , which is a much easier condition to verify. Similar sufficient conditions are obtained by [5] for translation invariant kernels on groups and semi-groups. In Section 3, we expand the class of characteristic kernels to include kernels that may or may not be translation invariant, with the introduction of a novel criterion: strictly positive definite kernels (see Definition 3) on M are characteristic. Choice of characteristic kernels: In expanding the families of allowable characteristic kernels, we have so far neglected the question of which characteristic kernel to choose. A practitioner asking by how much two samples differ does not want to receive a blizzard of answers for every conceivable kernel and bandwidth setting, but a single measure that satisfies some \"reasonable\" notion of distance across the family of kernels considered. Thus, in Section 4, we propose a generalization of the MMD, yielding a new distance measure between P and Q defined as (P, Q) = sup{k (P, Q) : k  K} = sup{ Pk - Qk\r\nH\r\n\r\n: k  K},\r\n\r\n(2)\r\n\r\nwhich is the maximal RKHS distance between P and Q over a family, K of positive definite kernels. For example, K can be the family of Gaussian kernels on Rd indexed by the bandwidth parameter. This distance measure is very natural in the light of our results on binary classification (in Section 2): most directly, this corresponds to the problem of learning the kernel by minimizing the risk of the associated Parzen-based classifier. As a less direct justification, we also increase the upper bound on the margin allowed for a hard margin SVM between the samples. To apply the generalized MMD in practice, we must ensure its empirical estimator is consistent. In our main result of Section 4, we provide an empirical estimate of (P, Q) based on finite samples, and show that many popular kernels like the Gaussian, Laplacian, and the entire Mat rn class on Rd yield consistent estimates e\r\n1 There is a subtlety here, since unlike the problem of testing for differences in distributions, classification suffers from slow learning rates. See [3, Chapter 7] for details.\r\n\r\n2\r\n\r\n\fof (P, Q). The proof is based on bounding the Rademacher chaos complexity of K, which can be understood as the U-process equivalent of Rademacher complexity [2]. Finally, in Section 5, we provide a simple experimental demonstration that the generalized MMD can be applied in practice to the problem of homogeneity testing. Specifically, we show that when two distributions differ on particular length scales, the kernel selected by the generalized MMD is appropriate to this difference, and the resulting hypothesis test outperforms the heuristic kernel choice employed in earlier studies [6]. The proofs of the results in Sections 2-4 are provided in the supplementary material.\r\n\r\n2\r\n\r\nCharacteristic Kernels and Binary Classification\r\n\r\nOne of the most important applications of the maximum mean discrepancy is in nonparametric hypothesis testing [6, 7, 4], where the characteristic property of k is required to distinguish between probability measures. In the following, we show how MMD naturally appears in binary classification, with reference to the Parzen window classifier and hard-margin SVM. This motivates the need for characteristic k to guarantee that classes arising from different distributions can be classified by kernel-based algorithms. To this end, let us consider the binary classification problem with X being a M -valued random variable, Y being a {-1, +1}-valued random variable and the product space, M  {-1, +1}, being endowed with an induced Borel probability measure . A discriminant function, f is a real valued measurable function on M , whose sign is used to make a classification decision. Given a loss function L : {-1, +1}  R  R, the goal is to choose an f that minimizes the risk associated with L, with the optimal L-risk being defined as\r\nL RF = inf f F\r\n\r\nL(y, f (x)) d(x, y) = inf\r\nM\r\n\r\nf F\r\n\r\n\r\nM\r\n\r\nL1 (f ) dP + (1 - )\r\nM\r\n\r\nL-1 (f ) dQ ,\r\n\r\n(3)\r\n\r\nwhere F is the set of all measurable functions on M , L1 () := L(1, ), L-1 () := L(-1, ), P(X) := (X|Y = +1), Q(X) := (X|Y = -1),  := (M, Y = +1). Here, P and Q represent the class-conditional distributions and  is the prior distribution of class +1. Now, we present the result that relates k to the optimal risk associated with the Parzen window classifier.\r\n Theorem 1 (k and Parzen classification). Let L1 () = -  and L-1 () = 1- . Then, k (P, Q) =  L -RFk , where Fk = {f : f H  1} and H is an RKHS with a measurable and bounded k. Suppose {(Xi , Yi )}N , Xi  M , Yi  {-1, +1},  i is a training sample drawn i.i.d. from  and i=1 m = |{i : Yi = 1}|. If f  Fk is an empirical minimizer of (3) (where F is replaced by Fk in (3)), then 1 1 1, Yi =1 k(x, Xi ) > N -m Yi =-1 k(x, Xi ) m , (4) sign(f (x)) = 1 1 -1, m Yi =1 k(x, Xi )  N -m Yi =-1 k(x, Xi )\r\n\r\nwhich is the Parzen window classifier. Theorem 1 shows that k is the negative of the optimal L-risk (where L is the linear loss as defined in Theorem 1) associated with the Parzen window classifier. Therefore, if k is not characteristic, L which means k (P, Q) = 0 for some P = Q, then RFk = 0, i.e., the risk is maximum (note that L since 0  k (P, Q) = -RFk , the maximum risk is zero). In other words, if k is characteristic, then the maximum risk is obtained only when P = Q. This motivates the importance of characteristic kernels in binary classification. In the following, we provide another result which provides a similar motivation for the importance of characteristic kernels in binary classification, wherein we relate k to the margin of a hard-margin SVM. Theorem 2 (k and hard-margin SVM). Suppose {(Xi , Yi )}N , Xi  M , Yi  {-1, +1},  i is i=1 a training sample drawn i.i.d. from . Assuming the training sample is separable, let fsvm be the solution to the program, inf{ f H : Yi f (Xi )  1,  i}, where H is an RKHS with measurable and bounded k. If k is characteristic, then 1 fsvm\r\n1 where Pm := m Yi =1 Xi , Qn := represents the Dirac measure at x. 1 n\r\n\r\n\r\nH\r\n\r\nk (Pm , Qn ) , 2\r\n\r\n(5)\r\n\r\nYi =-1 Xi ,\r\n\r\nm = |{i : Yi = 1}| and n = N - m. x\r\n\r\n3\r\n\r\n\fTheorem 2 provides a bound on the margin of hard-margin SVM in terms of MMD. (5) shows that a smaller MMD between Pm and Qn enforces a smaller margin (i.e., a less smooth classifier, fsvm , where smoothness is measured as fsvm H ). We can observe that the bound in (5) may be loose if the number of support vectors is small. Suppose k is not characteristic, then k (Pm , Qn ) can be zero for Pm = Qn and therefore the margin is zero, which means even unlike distributions can become inseparable in this feature representation. Another justification of using characteristic kernels in kernel-based classification algorithms can be provided by studying the conditions on H for which the Bayes risk is realized for all . Steinwart and Christmann [16, Corollary 5.37] have showed that under certain conditions on L, the Bayes risk is achieved for all  if and only if H is dense in Lp (M, ) for all , where  = P + (1 - )Q. Here, Lp (M, ) represents the Banach space of p-power integrable functions, where p  [1, ) is dependent on the loss function, L. Denseness of H in Lp (M, ) implies H + R is dense Lp (M, ), which therefore yields that k is characteristic [4, 5]. On the other hand, if constant functions are included in H, then it is easy to show that the characteristic property of k is also sufficient to achieve the Bayes risk. As an example, it can be shown that characteristic kernels are necessary (and sufficient if constant functions are in H) for SVMs to achieve the Bayes risk [16, Example 5.40]. Therefore, the characteristic property of k is fundamental in kernel-based classification algorithms. Having showed how characteristic kernels play a role in kernel-based classification, in the following section, we provide a novel characterization for them.\r\n\r\n3\r\n\r\nNovel Characterization for Characteristic Kernels\r\n\r\nA positive definite (pd) kernel, k is said to be characteristic to P if and only if k (P, Q) = 0  P = Q,  P, Q  P. The following result provides a novel characterization for characteristic kernels, which shows that strictly pd kernels are characteristic to P. An advantage with this characterization is that it holds for any arbitrary topological space M unlike the earlier characterizations where a group structure on M is assumed [14, 5]. First, we define strictly pd kernels as follows. Definition 3 (Strictly positive definite kernels). Let M be a topological space. A measurable and bounded kernel, k is said to be strictly positive definite if and only if M M k(x, y) d(x) d(y) > 0 for all finite non-zero signed Borel measures,  defined on M . Note that the above definition is not equivalent to the usual definition of strictly pd kernels that involves finite sums [16, Definition 4.15]. The above definition is a generalization of integrally strictly positive definite functions [17, Section 6]: k(x, y)f (x)f (y) dx dy > 0 for all f  L2 (Rd ), which is the strictly positive definiteness of the integral operator given by the kernel. Definition 3 is stronger than the finite sum definition as [16, Theorem 4.62] shows a kernel that is strictly pd in the finite sum sense but not in the integral sense. Theorem 4 (Strictly pd kernels are characteristic). If k is strictly positive definite on M , then k is characteristic to P. The proof idea is to derive necessary and sufficient conditions for a kernel not to be characteristic. We show that choosing k to be strictly pd violates these conditions and k is therefore characteristic to P. Examples of strictly pd kernels on Rd include exp(- x-y 2 ),  > 0, exp(- x-y 1 ),  > 2 ~ 0, (c2 + x - y 2 )- ,  > 0, c > 0, B2l+1 -splines etc. Note that k(x, y) = f (x)k(x, y)f (y) is a 2 strictly pd kernel if k is strictly pd, where f : M  R is a bounded continuous function. Therefore, translation-variant strictly pd kernels can be obtained by choosing k to be a translation invariant strictly pd kernel. A simple example of a translation-variant kernel that is a strictly pd kernel on ~ compact sets of Rd is k(x, y) = exp(xT y),  > 0, where we have chosen f (.) = exp( . 2 /2) 2 ~ and k(x, y) = exp(- x - y 2 /2),  > 0. Therefore, k is characteristic on compact sets of Rd , 2 ~ which is the same result that follows from the universality of k [15, Section 3, Example 1]. The following result in [10], which is based on the usual definition of strictly pd kernels, can be obtained as a corollary to Theorem 4. Corollary 5 ([10]). Let X = {xi }m  M , Y = {yj }n  M and assume that xi = xj , yi = i=1 j=1 m n yj ,  i, j. Suppose k is strictly positive definite. Then i=1 i k(., xi ) = j=1 j k(., yj ) for some i , j  R\\{0}  X = Y .\r\n1 1 Suppose we choose i = m ,  i and j = n ,  j in Corollary 5. Then i=1 i k(., xi ) n and j=1 j k(., yj ) represent the mean functions in H. Note that the Parzen classifier in (4) m\r\n\r\n4\r\n\r\n\fis a mean classifier (that separates the mean functions) in H, i.e., sign( k(., x), w H ), where m n 1 1 w = m i=1 k(., xi ) - n i=1 k(., yi ). Suppose k is strictly pd (more generally, suppose k is characteristic). Then, by Corollary 5, the normal vector, w to the hyperplane in H passing through the origin is zero, i.e., the mean functions coincide (and are therefore not classifiable) if and only if X =Y.\r\n\r\n4 Generalizing the MMD for Classes of Characteristic Kernels\r\nThe discussion so far has been related to the characteristic property of k that makes k a metric on P. We have seen that this characteristic property is of prime importance both in distribution testing, and to ensure classifiability of dissimilar distributions in the RKHS. We have not yet addressed how to choose among a selection/family of characteristic kernels, given a particular pair of distributions we wish to discriminate between. We introduce one approach to this problem in the present section. Let M = Rd and k (x, y) = exp(- x - y 2 ),   R+ , where  represents the bandwidth 2 parameter. {k :   R+ } is the family of Gaussian kernels and {k :   R+ } is the family of MMDs indexed by the kernel parameter, . Note that k is characteristic for any   R++ and therefore k is a metric on P for any   R++ . However, in practice, one would prefer a single number that defines the distance between P and Q. The question therefore to be addressed is how to choose appropriate . The choice of  has important implications on the statistical aspect of k . Note that as   0, k  1 and as   , k  0 a.e., which means k (P, Q)  0 as   0 or    for all P, Q  P (this behavior is also exhibited by k (x, y) = exp(- x - y 1 ) and k (x, y) =  2 /( 2 + x - y 2 ), which are also characteristic). This means choosing sufficiently 2 small or sufficiently large  (depending on P and Q) makes k (P, Q) arbitrarily small. Therefore,  has to be chosen appropriately in applications to effectively distinguish between P and Q. Presently, the applications involving MMD set  heuristically [6, 7]. To generalize the MMD to families of kernels, we propose the following modification to k , which yields a pseudometric on P, (P, Q) = sup{k (P, Q) : k  K} = sup{ Pk - Qk H : k  K}. (6) Note that  is the maximal RKHS distance between P and Q over a family, K of positive definite kernels. It is easy to check that if any k  K is characteristic, then  is a metric on P. Examples for 2 K include: Kg := {e- x-y 2 , x, y  Rd :   R+ }; Kl := {e- x-y 1 , x, y  Rd :   R+ }; K := {e-(x,y) , x, y  M :   R+ }, where  : M  M  R is a negative definite kernel; 2  Krbf := { 0 e- x-y 2 d (), x, y  Rd ,   M + :     Rd }, where M + is the set of all finite nonnegative Borel measures,  on R+ that are not concentrated at zero, etc. The proposal of (P, Q) in (6) can be motivated by the connection that we have established in Section 2 between k and the Parzen window classifier. Since the Parzen window classifier depends on the kernel, k, one can propose to learn the kernel like in support vector machines [8], wherein L L the kernel is chosen such that RFk in Theorem 1 is minimized over k  K, i.e., inf kK RFk = - supkK k (P, Q) = -(P, Q). A similar motivation for  can be provided based on (5) as learning the kernel in a hard-margin SVM by maximizing its margin. At this point, we briefly discuss the issue of normalized vs. unnormalized kernel families, K in (6). We say a translation-invariant kernel, k on Rd is normalized if M (y) dy = c (some positive constant independent of the kernel parameter), where k(x, y) = (x - y). K is a normalized kernel family if every kernel in K is normalized. If K is not normalized, we say it is unnormalized. For example, it is easy to see that Kg and Kl are unnormalized kernel families. Let us consider the 2 normalized Gaussian family, Kn = {(/)d/2 e- x-y 2 , x, y  Rd :   [0 , )}. It can be g shown that for any k , k  Kn , 0 <  <  < , we have k (P, Q)  k (P, Q), which g means, (P, Q) = 0 (P, Q). Therefore, the generalized MMD reduces to a single kernel MMD. A similar result also holds for the normalized inverse-quadratic kernel family, { 2 2 /( 2 + x - y 2 )-1 , x, y  R :   [0 , )}. These examples show that the generalized MMD definition 2 is usually not very useful if K is a normalized kernel family. In addition, 0 should be chosen beforehand, which is equivalent to heuristically setting the kernel parameter in k . Note that 0 cannot be zero because in the limiting case of   0, the kernels approach a Dirac distribution, which means the limiting kernel is not bounded and therefore the definition of MMD in (1) does not hold. So, in this work, we consider unnormalized kernel families to render the definition of generalized MMD in (6) useful. 5\r\n\r\n\fTo use  in statistical applications where P and Q are known only through i.i.d. samples {Xi }m i=1 and {Yi }n respectively, we require its estimator (Pm , Qn ) to be consistent, where Pm and Qn i=1 represent the empirical measures based on {Xi }m and {Yj }n . For k measurable and bounded, i=1 j=1 [6, 12] have shown that k (Pm , Qn ) is a mn/(m + n)-consistent estimator of k (P, Q). The statistical consistency of (Pm , Qn ) is established in the following theorem, which uses tools from U-process theory [2, Chapters 3,5]. We begin with the following definition. Definition 6 (Rademacher chaos). Let G be a class of functions on M  M and {i }n be i=1 1 independent Rademacher random variables, i.e., Pr(i = 1) = Pr(i = -1) = 2 . The homogeneous Rademacher chaos process of order two with respect to {i }n is defined as i=1 n {n-1 i<j i j g(xi , xj ) : g  G} for some {xi }n  M . The Rademacher chaos complexi=1 ity over G is defined as n 1 Un (G; {xi }n ) := E sup (7) i j g(xi , xj ) . i=1 gG n i<j We now provide the main result of the present section. Theorem 7 (Consistency of (Pm , Qn )). Let every k  K be measurable and bounded with  := supkK,xM k(x, x) < . Then, with probability at least 1 - , |(Pm , Qn ) - (P, Q)|  A, where   4 16Um (K; {Xi }) 16Un (K; {Yi }) ( 8 + 36 log  ) m + n  + + A= . (8) m n mn From (8), it is clear that if Um (K; {Xi }) = OP (1) and Un (K; {Yi }) = OQ (1), then (Pm , Qn )  (P, Q). The following result provides a bound on Um (K; {Xi }) in terms of the entropy integral. Lemma 8 (Entropy bound). For any K as in Theorem 7 with 0  K, there exists a universal constant C such that  Um (K; {Xi }m )  C i=1\r\nm a.s.\r\n\r\nlog N (K, D, ) d ,\r\n0\r\n1 2\r\n\r\n(9)\r\n\r\n1 2 . N (K, D, ) represents the where D(k1 , k2 ) = m i<j (k1 (Xi , Xj ) - k2 (Xi , Xj )) covering number of K with respect to the metric D. Assuming K to be a VC-subgraph class, the following result, as a corollary to Lemma 8 provides an estimate of Um (K; {Xi }m ). Before presenting the result, we first provide the definition of a i=1 VC-subgraph class. Definition 9 (VC-subgraph class). The subgraph of a function g : M  R is the subset of M  R given by {(x, t) : t < g(x)}. A collection G of measurable functions on a sample space is called a VC-subgraph class, if the collection of all subgraphs of the functions in G forms a VC-class of sets (in M  R). The VC-index (also called the VC-dimension) of a VC-subgraph class, G is the same as the pseudodimension of G. See [1, Definition 11.1] for details. Corollary 10 (Um (K; {Xi }) for VC-subgraph, K). Suppose K is a VC-subgraph class with V (K) being the VC-index. Assume K satisfies the conditions in Theorem 7 and 0  K. Then\r\n\r\nUm (K; {Xi })  C log(C1 V (K)(16e9 )V (K) ),\r\n\r\n(10)\r\n\r\nfor some universal constants C and C1 . Using (10) in (8), we have |(Pm , Qn ) - (P, Q)| = OP,Q ( (m + n)/mn) and by the Borela.s. Cantelli lemma, |(Pm , Qn ) - (P, Q)|  0. Now, the question reduces to which of the kernel classes, K have V (K) < . [18, Lemma 12] showed that V (Kg ) = 1 (also see [19]) and Um (Krbf )  C2 Um (Kg ), where C2 < . It can be shown that V (K ) = 1 and V (Kl ) = 1. All these classes satisfy the conditions of Theorem 7 and Corollary 10 and therefore provide consistent estimates of (P, Q) for any P, Q  P. Examples of kernels on Rd that are covered by these classes include the Gaussian, Laplacian, inverse multiquadratics, Mat rn class etc. Other choices e for K that are popular in machine learning are the linear combination of kernels, Klin := {k = l l l l i=1 i ki | k is pd, i=1 i = 1} and Kcon := {k = i=1 i ki | i  0, i=1 i = 1}. [13, Lemma 7] have shown that V (Kcon )  V (Klin )  l. Therefore, instead of using a class based on a fixed, parameterized kernel, one can also use a finite linear combination of kernels to compute . 6\r\n\r\n\fSo far, we have presented the metric property and statistical consistency (of the empirical estimator) of . Now, the question is how do we compute (Pm , Qn ) in practice. To show this, in the following, we present two examples. Example 11. Suppose K = Kg . Then, (Pm , Qn ) can be written as   2 2 m,n - X -Y 2 m n i j e- Xi -Xj e- Yi -Yj e  . (11)  2 (Pm , Qn ) = sup  + -2 m2 n2 mn R+ i,j=1 i,j=1 i,j=1 The optimum   can be obtained by solving (11) and (Pm , Qn ) = Pm k - Qn k Example 12. Suppose K = Kcon . Then, (Pm , Qn ) becomes  2 (Pm , Qn ) = sup\r\nkKcon H\r\n\r\n.\r\n\r\nPm k - Qn k\r\n\r\n2 H\r\n\r\n= sup\r\nkKcon\r\n\r\nk d(Pm - Qn )  (Pm - Qn ) (12)\r\n\r\n= sup{T a : T 1 = 1,  where we have replaced k by m 1 1 a,b=1 ki (Xa , Xb ) + n2 m2  2 (Pm , Qn ) = max1il (a)i .\r\nl i=1 i ki . Here  n a,b=1 ki (Ya , Yb )\r\n\r\n0},\r\n\r\n= (1 , . . . , l ) and (a)i = Pm ki - Qn ki 2 i = H m,n 2 - mn a,b=1 ki (Xa , Yb ). It is easy to see that\r\n\r\nSimilar examples can be provided for other K, where (Pm , Qn ) can be computed by solving a semidefinite program (K = Klin ) or by the constrained gradient descent ( K = Kl , Krbf ). Finally, while the approach in (6) to generalizing k is our focus in this paper, an alternative Bayesian strategy would be to define a non-negative finite measure  over K, and to average k over that measure, i.e., (P, Q) := K k (P, Q) d(k). This also yields a pseudometric on P. That said, (P, Q)  (K)(P, Q),  P, Q, which means if P and Q can be distinguished by , they can be distinguished by , but not vice-versa. In this sense,  is stronger than . One further complication with the Bayesian approach is in defining a sensible  over K. Note that k0 (single kernel MMD based on k0 ) can be obtained by defining (k) = (k - k0 ) in (P, Q).\r\n\r\n5 Experiments\r\nIn this section, we present a benchmark experiment that illustrates the generalized MMD proposed in Section 4 is preferred above the single kernel MMD where the kernel parameter is set heuristically. The experimental setup is as follows.\r\n2 2 Let p = N (0, p ), a normal distribution in R with zero mean and variance, p . Let q be the perturbed version of p, given as q(x) = p(x)(1 + sin x). Here p and q are the densities associated with P and Q respectively. It is easy to see that q differs from p at increasing frequencies with increasing . Let k(x, y) = exp(-(x - y)2 /). Now, the goal is that given random samples drawn i.i.d. from P and Q (with  fixed), we would like to test H0 : P = Q vs. H1 : P = Q. The idea is that as  increases, it will be harder to distinguish between P and Q for a fixed sample size. Therefore, using this setup we can verify whether the adaptive bandwidth selection achieved by  (as the test statistic) helps to distinguish between P and Q at higher  compared to k with a heuristic . To this end, using (Pm , Qn ) and k (Pm , Qn ) (with various ) as test statistics Tmn , we design a test that returns H0 if Tmn  cmn , and H1 otherwise. The problem therefore reduces to finding cmn . cmn is determined as the (1 - ) quantile of the asymptotic distribution of Tmn under H0 , which therefore fixes the type-I error (the probability of rejecting H0 when it is true) to . The consistency of this test under k (for any fixed ) is proved in [6]. A similar result can be shown for  under some conditions on K. We skip the details here. 2 In our experiments, we set m = n = 1000, p = 10 and draw two sets of independent random samples from Q. The distribution of Tmn is estimated by bootstrapping on these samples (250 bootstrap iterations are performed) and the associated 95th quantile (we choose  = 0.05) is computed. Since the performance of the test is judged by its type-II error (the probability of accepting H0 when H1 is true), we draw a random sample, one each from P and Q and test whether P = Q. This process is repeated 300 times, and estimates of type-I and type-II errors are obtained for both  and k . 14 different values for  are considered on a logarithmic scale of base 2 with exponents 3 (-3, -2, -1, 0, 1, 2 , 2, 5 , 3, 7 , 4, 5, 6) along with the median distance between samples as one more 2 2 choice. 5 different choices for  are considered: ( 1 , 3 , 1, 5 , 3 ). 2 4 4 2\r\n\r\n7\r\n\r\n\fError (in %)\r\n\r\n6 5 4 2 0 0.5 0.75 1 \r\n\r\n20 15 10 5\r\n\r\nType-I error Type-II error\r\n\r\nType-II error (in %)\r\n\r\nType-I error (in %)\r\n\r\n25\r\n\r\n=0.5 =0.75 =1.0 =1.25 =1.5\r\n\r\n100\r\n\r\n50\r\n\r\n=0.5 =0.75 =1.0 =1.25 =1.5\r\n\r\n1.25\r\n\r\n1.5\r\n\r\n(a)\r\n3 log  2 1 0 0.5 0.75\r\n\r\n-3 -2 -1 0 1 2 3 4 5 6 log \r\n\r\n(b)\r\n\r\n0 -3 -2 -1 0 1 2 3 4 5 6 log \r\n\r\n(c)\r\n\r\n11 Median as  1  1.25 1.5 10 9 8 0.5 0.75 1  1.25 1.5\r\n\r\n(d)\r\n\r\n(e)\r\n\r\nFigure 1: (a) Type-I and Type-II errors (in %) for  for varying . (b,c) Type-I and type-II error (in %) for k (with different ) for varying . The dotted line in (c) corresponds to the median heuristic, which shows that its associated type-II error is very large at large . (d) Box plot of log  grouped by , where  is selected by . (e) Box plot of the median distance between points (which is also a choice for ), grouped by . Refer to Section 5 for details. Figure 1(a) shows the estimated type-I and type-II errors using  as the test statistic for varying . Note that the type-I error is close to its design value of 5%, while the type-II error is zero for all , which means  distinguishes between P and Q for all perturbations. Figures 1(b,c) show the estimates of type-I and type-II errors using k as the test statistic for different  and . Figure 1(d) shows the box plot for log , grouped by , where  is the bandwidth selected by . Figure 1(e) shows the box plot of the median distance between points (which is also a choice for ), grouped by . From Figures 1(c) and (e), it is easy to see that the median heuristic exhibits high type-II error for 3  = 2 , while  exhibits zero type-II error (from Figure 1(a)). Figure 1(c) also shows that heuristic choices of  can result in high type-II errors. It is intuitive to note that as  increases, (which means the characteristic function of Q differs from that of P at higher frequencies), a smaller  is needed to detect these changes. The advantage of using  is that it selects  in a distribution-dependent fashion and its behavior in the box plot shown in Figure 1(d) matches with the previously mentioned intuition about the behavior of  with respect to . These results demonstrate the validity of using  as a distance measure in applications.\r\n\r\n6 Conclusions\r\nIn this work, we have shown how MMD appears in binary classification, and thus that characteristic kernels are important in kernel-based classification algorithms. We have broadened the class of characteristic RKHSs to include those induced by strictly positive definite kernels (with particular application to kernels on non-compact domains, and/or kernels that are not translation invariant). We have further provided a convergent generalization of MMD over families of kernel functions, which becomes necessary even in considering relatively simple families of kernels (such as the Gaussian kernels parameterized by their bandwidth). The usefulness of the generalized MMD is illustrated experimentally with a two-sample testing problem. Acknowledgments The authors thank anonymous reviewers for their constructive comments and especially the reviewer who pointed out the connection between characteristic kernels and the achievability of Bayes risk. B. K. S. was supported by the MPI for Biological Cybernetics, National Science Foundation (grant DMS-MSPA 0625409), the Fair Isaac Corporation and the University of California MICRO program. A. G. was supported by grants DARPA IPTO FA8750-09-1-0141, ONR MURI N000140710747, and ARO MURI W911NF0810242. 8\r\n\r\n\fReferences\r\n[1] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, UK, 1999. [2] V. H. de la Pe~ a and E. Gin . Decoupling: From Dependence to Independence. Springer-Verlag, NY, n e 1999. [3] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York, 1996. [4] K. Fukumizu, A. Gretton, X. Sun, and B. Sch lkopf. Kernel measures of conditional dependence. In J.C. o Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 489496, Cambridge, MA, 2008. MIT Press. [5] K. Fukumizu, B. K. Sriperumbudur, A. Gretton, and B. Sch lkopf. Characteristic kernels on groups o and semigroups. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 473480, 2009. o [6] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch lkopf, and A. Smola. A kernel method for the two sample problem. In B. Sch lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing o Systems 19, pages 513520. MIT Press, 2007. [7] A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch lkopf, and A. Smola. A kernel statistical test of o independence. In Advances in Neural Information Processing Systems 20, pages 585592. MIT Press, 2008. [8] G. R. G. Lanckriet, N. Christianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:2472, 2004. o [9] B. Sch lkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. [10] B. Sch lkopf, B. K. Sriperumbudur, A. Gretton, and K. Fukumizu. RKHS representation of measures. In o Learning Theory and Approximation Workshop, Oberwolfach, Germany, 2008. [11] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, UK, 2004. [12] A. J. Smola, A. Gretton, L. Song, and B. Sch lkopf. A Hilbert space embedding for distributions. In o Proc. 18th International Conference on Algorithmic Learning Theory, pages 1331. Springer-Verlag, Berlin, Germany, 2007. [13] N. Srebro and S. Ben-David. Learning bounds for support vector machines with learned kernels. In G. Lugosi and H. U. Simon, editors, Proc. of the 19th Annual Conference on Learning Theory, pages 169183, 2006. [14] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. R. G. Lanckriet, and B. Sch lkopf. Injective Hilbert o space embeddings of probability measures. In R. Servedio and T. Zhang, editors, Proc. of the 21st Annual Conference on Learning Theory, pages 111122, 2008. [15] I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:6793, 2002. [16] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008. [17] J. Stewart. Positive definite functions and generalizations, an historical survey. Rocky Mountain Journal of Mathematics, 6(3):409433, 1976. [18] Y. Ying and C. Campbell. Generalization bounds for learning the kernel. In Proc. of the 22nd Annual Conference on Learning Theory, 2009. [19] Y. Ying and D. X. Zhou. Learnability of Gaussians with flexible variances. Journal of Machine Learning Research, 8:249276, 2007.\r\n\r\n9\r\n\r\n\f", "award": [], "sourceid": 3750, "authors": [{"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}, {"given_name": "Arthur", "family_name": "Gretton", "institution": null}, {"given_name": "Gert", "family_name": "Lanckriet", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Bharath", "family_name": "Sriperumbudur", "institution": null}]}