{"title": "Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 1750, "page_last": 1758, "abstract": "", "full_text": "Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions\r\n\r\nBharath K. Sriperumbudur Department of ECE UC San Diego, La Jolla, USA bharathsv@ucsd.edu\r\n\r\nKenji Fukumizu The Institute of Statistical Mathematics Tokyo, Japan fukumizu@ism.ac.jp\r\n\r\nArthur Gretton Carnegie Mellon University MPI for Biological Cybernetics arthur.gretton@gmail.com Gert R. G. Lanckriet Department of ECE UC San Diego, La Jolla, USA gert@ece.ucsd.edu Bernhard Sch lkopf o MPI for Biological Cybernetics T bingen, Germany u bs@tuebingen.mpg.de\r\n\r\nAbstract\r\nEmbeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example.\r\n\r\n1\r\n\r\nIntroduction\r\n\r\nKernel methods are broadly established as a useful way of constructing nonlinear algorithms from linear ones, by embedding points into higher dimensional reproducing kernel Hilbert spaces (RKHSs) [9]. A generalization of this idea is to embed probability distributions into RKHSs, giving 1\r\n\r\n\fus a linear method for dealing with higher order statistics [6, 12, 14]. More specifically, suppose we are given the set P of all Borel probability measures defined on the topological space M , and the RKHS (H, k) of functions on M with k as its reproducing kernel (r.k.). For P P, denote by Pk := M k(., x) dP(x). If k is measurable and bounded, then we may define the embedding of P in H as Pk H. The RKHS distance between two such mappings associated with P, Q P is called the maximum mean discrepancy (MMD) [6, 14], and is written k (P, Q) = Pk - Qk\r\nH.\r\n\r\n(1)\r\n\r\nWe say that k is characteristic [4, 14] if the mapping P Pk is injective, in which case (1) is zero if and only if P = Q, i.e., k is a metric on P. An immediate application of the MMD is to problems of comparing distributions based on finite samples: examples include tests of homogeneity [6], independence [7], and conditional independence [4]. In this application domain, the question of whether k is characteristic is key: without this property, the algorithms can fail through inability to distinguish between particular distributions. Characteristic kernels are important in binary classification: The problem of distinguishing distributions is strongly related to binary classification: indeed, one would expect easily distinguishable distributions to be easily classifiable.1 The link between these two problems is especially direct in the case of the MMD: in Section 2, we show that k is the negative of the optimal risk (corresponding to a linear loss function) associated with the Parzen window classifier [9, 11] (also called kernel classification rule [3, Chapter 10]), where the Parzen window turns out to be k. We also show that k is an upper bound on the margin of a hard-margin support vector machine (SVM). The importance of using characteristic RKHSs is further underlined by this link: if the property does not hold, then there exist distributions that are unclassifiable in the RKHS H. We further strengthen this by showing that characteristic kernels are necessary (and sufficient under certain conditions) to achieve Bayes risk in the kernel-based classification algorithms. Characterization of characteristic kernels: Given the centrality of the characteristic property to both RKHS classification and RKHS distribution testing, we should take particular care in establishing which kernels satisfy this requirement. Early results in this direction include [6], where k is shown to be characteristic on compact M if it is universal in the sense of Steinwart [15, Definition 4]; and [4, 5], which address the case of non-compact M , and show that k is characteristic if and only if H + R is dense in the Banach space of p-power (p 1) integrable functions. The conditions in both these studies can be difficult to check and interpret, however, and the restriction of the first to compact M is limiting. In the case of translation invariant kernels, [14] proved the kernel to be characteristic if and only if the support of the Fourier transform of k is the entire Rd , which is a much easier condition to verify. Similar sufficient conditions are obtained by [5] for translation invariant kernels on groups and semi-groups. In Section 3, we expand the class of characteristic kernels to include kernels that may or may not be translation invariant, with the introduction of a novel criterion: strictly positive definite kernels (see Definition 3) on M are characteristic. Choice of characteristic kernels: In expanding the families of allowable characteristic kernels, we have so far neglected the question of which characteristic kernel to choose. A practitioner asking by how much two samples differ does not want to receive a blizzard of answers for every conceivable kernel and bandwidth setting, but a single measure that satisfies some \"reasonable\" notion of distance across the family of kernels considered. Thus, in Section 4, we propose a generalization of the MMD, yielding a new distance measure between P and Q defined as (P, Q) = sup{k (P, Q) : k K} = sup{ Pk - Qk\r\nH\r\n\r\n: k K},\r\n\r\n(2)\r\n\r\nwhich is the maximal RKHS distance between P and Q over a family, K of positive definite kernels. For example, K can be the family of Gaussian kernels on Rd indexed by the bandwidth parameter. This distance measure is very natural in the light of our results on binary classification (in Section 2): most directly, this corresponds to the problem of learning the kernel by minimizing the risk of the associated Parzen-based classifier. As a less direct justification, we also increase the upper bound on the margin allowed for a hard margin SVM between the samples. To apply the generalized MMD in practice, we must ensure its empirical estimator is consistent. In our main result of Section 4, we provide an empirical estimate of (P, Q) based on finite samples, and show that many popular kernels like the Gaussian, Laplacian, and the entire Mat rn class on Rd yield consistent estimates e\r\n1 There is a subtlety here, since unlike the problem of testing for differences in distributions, classification suffers from slow learning rates. See [3, Chapter 7] for details.\r\n\r\n2\r\n\r\n\fof (P, Q). The proof is based on bounding the Rademacher chaos complexity of K, which can be understood as the U-process equivalent of Rademacher complexity [2]. Finally, in Section 5, we provide a simple experimental demonstration that the generalized MMD can be applied in practice to the problem of homogeneity testing. Specifically, we show that when two distributions differ on particular length scales, the kernel selected by the generalized MMD is appropriate to this difference, and the resulting hypothesis test outperforms the heuristic kernel choice employed in earlier studies [6]. The proofs of the results in Sections 2-4 are provided in the supplementary material.\r\n\r\n2\r\n\r\nCharacteristic Kernels and Binary Classification\r\n\r\nOne of the most important applications of the maximum mean discrepancy is in nonparametric hypothesis testing [6, 7, 4], where the characteristic property of k is required to distinguish between probability measures. In the following, we show how MMD naturally appears in binary classification, with reference to the Parzen window classifier and hard-margin SVM. This motivates the need for characteristic k to guarantee that classes arising from different distributions can be classified by kernel-based algorithms. To this end, let us consider the binary classification problem with X being a M -valued random variable, Y being a {-1, +1}-valued random variable and the product space, M {-1, +1}, being endowed with an induced Borel probability measure . A discriminant function, f is a real valued measurable function on M , whose sign is used to make a classification decision. Given a loss function L : {-1, +1} R R, the goal is to choose an f that minimizes the risk associated with L, with the optimal L-risk being defined as\r\nL RF = inf f F\r\n\r\nL(y, f (x)) d(x, y) = inf\r\nM\r\n\r\nf F\r\n\r\n\r\nM\r\n\r\nL1 (f ) dP + (1 - )\r\nM\r\n\r\nL-1 (f ) dQ ,\r\n\r\n(3)\r\n\r\nwhere F is the set of all measurable functions on M , L1 () := L(1, ), L-1 () := L(-1, ), P(X) := (X|Y = +1), Q(X) := (X|Y = -1), := (M, Y = +1). Here, P and Q represent the class-conditional distributions and is the prior distribution of class +1. Now, we present the result that relates k to the optimal risk associated with the Parzen window classifier.\r\n Theorem 1 (k and Parzen classification). Let L1 () = - and L-1 () = 1- . Then, k (P, Q) = L -RFk , where Fk = {f : f H 1} and H is an RKHS with a measurable and bounded k. Suppose {(Xi , Yi )}N , Xi M , Yi {-1, +1}, i is a training sample drawn i.i.d. from and i=1 m = |{i : Yi = 1}|. If f Fk is an empirical minimizer of (3) (where F is replaced by Fk in (3)), then 1 1 1, Yi =1 k(x, Xi ) > N -m Yi =-1 k(x, Xi ) m , (4) sign(f (x)) = 1 1 -1, m Yi =1 k(x, Xi ) N -m Yi =-1 k(x, Xi )\r\n\r\nwhich is the Parzen window classifier. Theorem 1 shows that k is the negative of the optimal L-risk (where L is the linear loss as defined in Theorem 1) associated with the Parzen window classifier. Therefore, if k is not characteristic, L which means k (P, Q) = 0 for some P = Q, then RFk = 0, i.e., the risk is maximum (note that L since 0 k (P, Q) = -RFk , the maximum risk is zero). In other words, if k is characteristic, then the maximum risk is obtained only when P = Q. This motivates the importance of characteristic kernels in binary classification. In the following, we provide another result which provides a similar motivation for the importance of characteristic kernels in binary classification, wherein we relate k to the margin of a hard-margin SVM. Theorem 2 (k and hard-margin SVM). Suppose {(Xi , Yi )}N , Xi M , Yi {-1, +1}, i is i=1 a training sample drawn i.i.d. from . Assuming the training sample is separable, let fsvm be the solution to the program, inf{ f H : Yi f (Xi ) 1, i}, where H is an RKHS with measurable and bounded k. If k is characteristic, then 1 fsvm\r\n1 where Pm := m Yi =1 Xi , Qn := represents the Dirac measure at x. 1 n\r\n\r\n\r\nH\r\n\r\nk (Pm , Qn ) , 2\r\n\r\n(5)\r\n\r\nYi =-1 Xi ,\r\n\r\nm = |{i : Yi = 1}| and n = N - m. x\r\n\r\n3\r\n\r\n\fTheorem 2 provides a bound on the margin of hard-margin SVM in terms of MMD. (5) shows that a smaller MMD between Pm and Qn enforces a smaller margin (i.e., a less smooth classifier, fsvm , where smoothness is measured as fsvm H ). We can observe that the bound in (5) may be loose if the number of support vectors is small. Suppose k is not characteristic, then k (Pm , Qn ) can be zero for Pm = Qn and therefore the margin is zero, which means even unlike distributions can become inseparable in this feature representation. Another justification of using characteristic kernels in kernel-based classification algorithms can be provided by studying the conditions on H for which the Bayes risk is realized for all . Steinwart and Christmann [16, Corollary 5.37] have showed that under certain conditions on L, the Bayes risk is achieved for all if and only if H is dense in Lp (M, ) for all , where = P + (1 - )Q. Here, Lp (M, ) represents the Banach space of p-power integrable functions, where p [1, ) is dependent on the loss function, L. Denseness of H in Lp (M, ) implies H + R is dense Lp (M, ), which therefore yields that k is characteristic [4, 5]. On the other hand, if constant functions are included in H, then it is easy to show that the characteristic property of k is also sufficient to achieve the Bayes risk. As an example, it can be shown that characteristic kernels are necessary (and sufficient if constant functions are in H) for SVMs to achieve the Bayes risk [16, Example 5.40]. Therefore, the characteristic property of k is fundamental in kernel-based classification algorithms. Having showed how characteristic kernels play a role in kernel-based classification, in the following section, we provide a novel characterization for them.\r\n\r\n3\r\n\r\nNovel Characterization for Characteristic Kernels\r\n\r\nA positive definite (pd) kernel, k is said to be characteristic to P if and only if k (P, Q) = 0 P = Q, P, Q P. The following result provides a novel characterization for characteristic kernels, which shows that strictly pd kernels are characteristic to P. An advantage with this characterization is that it holds for any arbitrary topological space M unlike the earlier characterizations where a group structure on M is assumed [14, 5]. First, we define strictly pd kernels as follows. Definition 3 (Strictly positive definite kernels). Let M be a topological space. A measurable and bounded kernel, k is said to be strictly positive definite if and only if M M k(x, y) d(x) d(y) > 0 for all finite non-zero signed Borel measures, defined on M . Note that the above definition is not equivalent to the usual definition of strictly pd kernels that involves finite sums [16, Definition 4.15]. The above definition is a generalization of integrally strictly positive definite functions [17, Section 6]: k(x, y)f (x)f (y) dx dy > 0 for all f L2 (Rd ), which is the strictly positive definiteness of the integral operator given by the kernel. Definition 3 is stronger than the finite sum definition as [16, Theorem 4.62] shows a kernel that is strictly pd in the finite sum sense but not in the integral sense. Theorem 4 (Strictly pd kernels are characteristic). If k is strictly positive definite on M , then k is characteristic to P. The proof idea is to derive necessary and sufficient conditions for a kernel not to be characteristic. We show that choosing k to be strictly pd violates these conditions and k is therefore characteristic to P. Examples of strictly pd kernels on Rd include exp(- x-y 2 ), > 0, exp(- x-y 1 ), > 2 ~ 0, (c2 + x - y 2 )- , > 0, c > 0, B2l+1 -splines etc. Note that k(x, y) = f (x)k(x, y)f (y) is a 2 strictly pd kernel if k is strictly pd, where f : M R is a bounded continuous function. Therefore, translation-variant strictly pd kernels can be obtained by choosing k to be a translation invariant strictly pd kernel. A simple example of a translation-variant kernel that is a strictly pd kernel on ~ compact sets of Rd is k(x, y) = exp(xT y), > 0, where we have chosen f (.) = exp( . 2 /2) 2 ~ and k(x, y) = exp(- x - y 2 /2), > 0. Therefore, k is characteristic on compact sets of Rd , 2 ~ which is the same result that follows from the universality of k [15, Section 3, Example 1]. The following result in [10], which is based on the usual definition of strictly pd kernels, can be obtained as a corollary to Theorem 4. Corollary 5 ([10]). Let X = {xi }m M , Y = {yj }n M and assume that xi = xj , yi = i=1 j=1 m n yj , i, j. Suppose k is strictly positive definite. Then i=1 i k(., xi ) = j=1 j k(., yj ) for some i , j R\\{0} X = Y .\r\n1 1 Suppose we choose i = m , i and j = n , j in Corollary 5. Then i=1 i k(., xi ) n and j=1 j k(., yj ) represent the mean functions in H. Note that the Parzen classifier in (4) m\r\n\r\n4\r\n\r\n\fis a mean classifier (that separates the mean functions) in H, i.e., sign( k(., x), w H ), where m n 1 1 w = m i=1 k(., xi ) - n i=1 k(., yi ). Suppose k is strictly pd (more generally, suppose k is characteristic). Then, by Corollary 5, the normal vector, w to the hyperplane in H passing through the origin is zero, i.e., the mean functions coincide (and are therefore not classifiable) if and only if X =Y.\r\n\r\n4 Generalizing the MMD for Classes of Characteristic Kernels\r\nThe discussion so far has been related to the characteristic property of k that makes k a metric on P. We have seen that this characteristic property is of prime importance both in distribution testing, and to ensure classifiability of dissimilar distributions in the RKHS. We have not yet addressed how to choose among a selection/family of characteristic kernels, given a particular pair of distributions we wish to discriminate between. We introduce one approach to this problem in the present section. Let M = Rd and k (x, y) = exp(- x - y 2 ), R+ , where represents the bandwidth 2 parameter. {k : R+ } is the family of Gaussian kernels and {k : R+ } is the family of MMDs indexed by the kernel parameter, . Note that k is characteristic for any R++ and therefore k is a metric on P for any R++ . However, in practice, one would prefer a single number that defines the distance between P and Q. The question therefore to be addressed is how to choose appropriate . The choice of has important implications on the statistical aspect of k . Note that as 0, k 1 and as , k 0 a.e., which means k (P, Q) 0 as 0 or for all P, Q P (this behavior is also exhibited by k (x, y) = exp(- x - y 1 ) and k (x, y) = 2 /( 2 + x - y 2 ), which are also characteristic). This means choosing sufficiently 2 small or sufficiently large (depending on P and Q) makes k (P, Q) arbitrarily small. Therefore, has to be chosen appropriately in applications to effectively distinguish between P and Q. Presently, the applications involving MMD set heuristically [6, 7]. To generalize the MMD to families of kernels, we propose the following modification to k , which yields a pseudometric on P, (P, Q) = sup{k (P, Q) : k K} = sup{ Pk - Qk H : k K}. (6) Note that is the maximal RKHS distance between P and Q over a family, K of positive definite kernels. It is easy to check that if any k K is characteristic, then is a metric on P. Examples for 2 K include: Kg := {e- x-y 2 , x, y Rd : R+ }; Kl := {e- x-y 1 , x, y Rd : R+ }; K := {e-(x,y) , x, y M : R+ }, where : M M R is a negative definite kernel; 2 Krbf := { 0 e- x-y 2 d (), x, y Rd , M + : Rd }, where M + is the set of all finite nonnegative Borel measures, on R+ that are not concentrated at zero, etc. The proposal of (P, Q) in (6) can be motivated by the connection that we have established in Section 2 between k and the Parzen window classifier. Since the Parzen window classifier depends on the kernel, k, one can propose to learn the kernel like in support vector machines [8], wherein L L the kernel is chosen such that RFk in Theorem 1 is minimized over k K, i.e., inf kK RFk = - supkK k (P, Q) = -(P, Q). A similar motivation for can be provided based on (5) as learning the kernel in a hard-margin SVM by maximizing its margin. At this point, we briefly discuss the issue of normalized vs. unnormalized kernel families, K in (6). We say a translation-invariant kernel, k on Rd is normalized if M (y) dy = c (some positive constant independent of the kernel parameter), where k(x, y) = (x - y). K is a normalized kernel family if every kernel in K is normalized. If K is not normalized, we say it is unnormalized. For example, it is easy to see that Kg and Kl are unnormalized kernel families. Let us consider the 2 normalized Gaussian family, Kn = {(/)d/2 e- x-y 2 , x, y Rd : [0 , )}. It can be g shown that for any k , k Kn , 0 < < < , we have k (P, Q) k (P, Q), which g means, (P, Q) = 0 (P, Q). Therefore, the generalized MMD reduces to a single kernel MMD. A similar result also holds for the normalized inverse-quadratic kernel family, { 2 2 /( 2 + x - y 2 )-1 , x, y R : [0 , )}. These examples show that the generalized MMD definition 2 is usually not very useful if K is a normalized kernel family. In addition, 0 should be chosen beforehand, which is equivalent to heuristically setting the kernel parameter in k . Note that 0 cannot be zero because in the limiting case of 0, the kernels approach a Dirac distribution, which means the limiting kernel is not bounded and therefore the definition of MMD in (1) does not hold. So, in this work, we consider unnormalized kernel families to render the definition of generalized MMD in (6) useful. 5\r\n\r\n\fTo use in statistical applications where P and Q are known only through i.i.d. samples {Xi }m i=1 and {Yi }n respectively, we require its estimator (Pm , Qn ) to be consistent, where Pm and Qn i=1 represent the empirical measures based on {Xi }m and {Yj }n . For k measurable and bounded, i=1 j=1 [6, 12] have shown that k (Pm , Qn ) is a mn/(m + n)-consistent estimator of k (P, Q). The statistical consistency of (Pm , Qn ) is established in the following theorem, which uses tools from U-process theory [2, Chapters 3,5]. We begin with the following definition. Definition 6 (Rademacher chaos). Let G be a class of functions on M M and {i }n be i=1 1 independent Rademacher random variables, i.e., Pr(i = 1) = Pr(i = -1) = 2 . The homogeneous Rademacher chaos process of order two with respect to {i }n is defined as i=1 n {n-1 i