{"title": "Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint", "book": "Advances in Neural Information Processing Systems", "page_first": 1773, "page_last": 1781, "abstract": "The goal of this paper is to investigate the advantages and disadvantages of learning in Banach spaces over Hilbert spaces. While many works have been carried out in generalizing Hilbert methods to Banach spaces, in this paper, we consider the simple problem of learning a Parzen window classifier in a reproducing kernel Banach space (RKBS)---which is closely related to the notion of embedding probability measures into an RKBS---in order to carefully understand its pros and cons over the Hilbert space classifier. We show that while this generalization yields richer distance measures on probabilities compared to its Hilbert space counterpart, it however suffers from serious computational drawback limiting its practical applicability, which therefore demonstrates the need for developing efficient learning algorithms in Banach spaces.", "full_text": "Learning in Hilbert vs. Banach Spaces: A Measure\n\nEmbedding Viewpoint\n\nBharath K. Sriperumbudur\n\nGatsby Unit\n\nUniversity College London\n\nKenji Fukumizu\n\nThe Institute of Statistical\n\nMathematics, Tokyo\n\nGert R. G. Lanckriet\n\nDept. of ECE\nUC San Diego\n\nbharath@gatsby.ucl.ac.uk\n\nfukumizu@ism.ac.jp\n\ngert@ece.ucsd.edu\n\nAbstract\n\nThe goal of this paper is to investigate the advantages and disadvantages of learn-\ning in Banach spaces over Hilbert spaces. While many works have been carried\nout in generalizing Hilbert methods to Banach spaces, in this paper, we consider\nthe simple problem of learning a Parzen window classi\ufb01er in a reproducing kernel\nBanach space (RKBS)\u2014which is closely related to the notion of embedding prob-\nability measures into an RKBS\u2014in order to carefully understand its pros and cons\nover the Hilbert space classi\ufb01er. We show that while this generalization yields\nricher distance measures on probabilities compared to its Hilbert space counter-\npart, it however suffers from serious computational drawback limiting its practi-\ncal applicability, which therefore demonstrates the need for developing ef\ufb01cient\nlearning algorithms in Banach spaces.\n\n1 Introduction\n\nKernel methods have been popular in machine learning and pattern analysis for their superior per-\nformance on a wide spectrum of learning tasks. They are broadly established as an easy way to\nconstruct nonlinear algorithms from linear ones, by embedding data points into reproducing kernel\nHilbert spaces (RKHSs) [1, 14, 15]. Over the last few years, generalization of these techniques to\nBanach spaces has gained interest. This is because any two Hilbert spaces over a common scalar\n\ufb01eld with the same dimension are isometrically isomorphic while Banach spaces provide more va-\nriety in geometric structures and norms that are potentially useful for learning and approximation.\n\nTo sample the literature, classi\ufb01cation in Banach spaces, more generally in metric spaces were stud-\nied in [3, 22, 11, 5]. Minimizing a loss function subject to a regularization condition on a norm\nin a Banach space was studied by [3, 13, 24, 21] and online learning in Banach spaces was con-\nsidered in [17]. While all these works have focused on theoretical generalizations of Hilbert space\nmethods to Banach spaces, the practical viability and inherent computational issues associated with\nthe Banach space methods has so far not been highlighted. The goal of this paper is to study the\nadvantages/disadvantages of learning in Banach spaces in comparison to Hilbert space methods, in\nparticular, from the point of view of embedding probability measures into these spaces.\n\nThe concept of embedding probability measures into RKHS [4, 6, 9, 16] provides a powerful and\nstraightforward method to deal with high-order statistics of random variables. An immediate appli-\ncation of this notion is to problems of comparing distributions based on \ufb01nite samples: examples\ninclude tests of homogeneity [9], independence [10], and conditional independence [7]. Formally,\nsuppose we are given the set P(X ) of all Borel probability measures de\ufb01ned on the topological\nspace X , and the RKHS (H, k) of functions on X with k as its reproducing kernel (r.k.). If k is\nmeasurable and bounded, then we can embed P in H as\n\nP 7\u2192ZX\n\nk(\u00b7, x) dP(x).\n\n1\n\n(1)\n\n\fGiven the embedding in (1), the RKHS distance between the embeddings of P and Q de\ufb01nes a\npseudo-metric between P and Q as\n\n\u03b3k(P, Q) :=(cid:13)(cid:13)(cid:13)(cid:13)\nZX\nk(\u00b7, x) dP(x) andRX\n\nk(\u00b7, x) dP(x) \u2212ZX\n\nk(\u00b7, x) dQ(x)(cid:13)(cid:13)(cid:13)(cid:13)H\n\non their embeddingsRX\n\nIt is clear that when the embedding in (1) is injective, then P and Q can be distinguished based\nk(\u00b7, x) dQ(x). [18] related RKHS embeddings to the\nproblem of binary classi\ufb01cation by showing that \u03b3k(P, Q) is the negative of the optimal risk associ-\nated with the Parzen window classi\ufb01er in H. Extending this classi\ufb01er to Banach space and studying\nthe highlights/issues associated with this generalization will throw light on the same associated with\nmore complex Banach space learning algorithms. With this motivation, in this paper, we consider\nthe generalization of the notion of RKHS embedding of probability measures to Banach spaces\u2014in\nparticular reproducing kernel Banach spaces (RKBSs) [24]\u2014and then compare the properties of the\nRKBS embedding to its RKHS counterpart.\n\n.\n\n(2)\n\nTo derive RKHS based learning algorithms, it is essential to appeal to the Riesz representation\ntheorem (as an RKHS is de\ufb01ned by the continuity of evaluation functionals), which establishes the\nexistence of a reproducing kernel. This theorem hinges on the fact that a notion of inner product can\nbe de\ufb01ned on Hilbert spaces. In this paper, as in [24], we deal with RKBSs that are uniformly Fr\u00b4echet\ndifferentiable and uniformly convex (called as s.i.p. RKBS) as many Hilbert space arguments\u2014most\nimportantly the Riesz representation theorem\u2014can be carried over to such spaces through the notion\nof semi-inner-product (s.i.p.) [12], which is a more general structure than an inner product. Based\non Zhang et al. [24], who recently developed RKBS counterparts of RKHS based algorithms like\nregularization networks, support vector machines, kernel principal component analysis, etc., we\nprovide a review of s.i.p. RKBS in Section 3. We present our main contributions in Sections 4 and\n5. In Section 4, \ufb01rst, we derive an RKBS embedding of P into B\u2032 as\n\nP 7\u2192ZX\n\nK(\u00b7, x) dP(x),\n\n(3)\n\nwhere B is an s.i.p. RKBS with K as its reproducing kernel (r.k.) and B\u2032 is the topological dual of\nB. Note that (3) is similar to (1), but more general than (1) as K in (3) need not have to be positive\nde\ufb01nite (pd), in fact, not even symmetric (see Section 3; also see Examples 2 and 3). Based on (3),\nwe de\ufb01ne\n\n\u03b3K(P, Q) :=(cid:13)(cid:13)(cid:13)(cid:13)\nZX\n\nK(\u00b7, x) dP(x) \u2212ZX\n\nK(\u00b7, x) dQ(x)(cid:13)(cid:13)(cid:13)(cid:13)B\u2032\n\n,\n\na pseudo-metric on P(X ), which we show to be the negative of the optimal risk associated with the\nParzen window classi\ufb01er in B\u2032. Second, we characterize the injectivity of (3) in Section 4.1 wherein\nwe show that the characterizations obtained for the injectivity of (3) are similar to those obtained for\n(1) and coincide with the latter when B is an RKHS. Third, in Section 4.2, we consider the empirical\nestimation of \u03b3K(P, Q) based on \ufb01nite random samples drawn i.i.d. from P and Q and study its\nconsistency and the rate of convergence. This is useful in applications like two-sample tests (also in\nbinary classi\ufb01cation as it relates to the consistency of the Parzen window classi\ufb01er) where different\nP and Q are to be distinguished based on the \ufb01nite samples drawn from them and it is important that\nthe estimator is consistent for the test to be meaningful. We show that the consistency and the rate\nof convergence of the estimator depend on the Rademacher type of B\u2032. This result coincides with\nthe one obtained for \u03b3k when B is an RKHS.\nThe above mentioned results, while similar to results obtained for RKHS embeddings, are signi\ufb01-\ncantly more general, as they apply RKBS spaces, which subsume RKHSs. We can therefore expect\nto obtain \u201cricher\u201d metrics \u03b3K than when being restricted to RKHSs (see Examples 1\u20133). On the\nother hand, one disadvantage of the RKBS framework is that \u03b3K(P, Q) cannot be computed in a\nclosed form unlike \u03b3k (see Section 4.3). Though this could seriously limit the practical impact of\nthe RKBS embeddings, in Section 5, we show that closed form expression for \u03b3K and its empirical\nestimator can be obtained for some non-trivial Banach spaces (see Examples 1\u20133). However, the\ncritical drawback of the RKBS framework is that the computation of \u03b3K and its empirical estima-\ntor is signi\ufb01cantly more involved and expensive than the RKHS framework, which means a simple\nkernel algorithm like a Parzen window classi\ufb01er, when generalized to Banach spaces suffers from\na serious computational drawback, thereby limiting its practical impact. Given the advantages of\nlearning in Banach space over Hilbert space, this work, therefore demonstrates the need for the\n\n2\n\n\fdevelopment of ef\ufb01cient algorithms in Banach spaces in order to make the problem of learning in\nBanach spaces worthwhile compared to its Hilbert space counterpart. The proofs of the results in\nSections 4 and 5 are provided in the supplementary material.\n\n2 Notation\n\nWe introduce some notation that is used throughout the paper. For a topological space X , C(X )\n(resp. Cb(X )) denotes the space of all continuous (resp. bounded continuous) functions on X . For\na locally compact Hausdorff space X , f \u2208 C(X ) is said to vanish at in\ufb01nity if for every \u01eb > 0 the\nset {x : |f (x)| \u2265 \u01eb} is compact. The class of all continuous f on X which vanish at in\ufb01nity is\ndenoted as C0(X ). For a Borel measure \u00b5 on X , Lp(X , \u00b5) denotes the Banach space of p-power\n(p \u2265 1) \u00b5-integrable functions. For a function f de\ufb01ned on Rd, \u02c6f and f\u2228 denote the Fourier\nand inverse Fourier transforms of f . Since \u02c6f and f\u2228 on Rd can be de\ufb01ned in L1, L2 or more\ngenerally in distributional senses, they should be treated in the appropriate sense depending on the\ncontext. In the L1 sense, the Fourier and inverse Fourier transforms of f \u2208 L1(Rd) are de\ufb01ned as:\n\u02c6f (y) = (2\u03c0)\u2212d/2RRd f (x) e\u2212ihy,xi dx and f\u2228(y) = (2\u03c0)\u2212d/2RRd f (x) eihy,xi dx, where i denotes\nthe imaginary unit \u221a\u22121. \u03c6P :=RRd eih\u00b7,xi dP(x) denotes the characteristic function of P.\n\n3 Preliminaries: Reproducing Kernel Banach Spaces\n\nIn this section, we brie\ufb02y review the theory of RKBSs, which was recently studied by [24] in the\ncontext of learning in Banach spaces. Let X be a prescribed input space.\nDe\ufb01nition 1 (Reproducing kernel Banach space). An RKBS B on X is a re\ufb02exive Banach space of\nfunctions on X such that its topological dual B\u2032 is isometric to a Banach space of functions on X\nand the point evaluations are continuous linear functionals on both B and B\u2032.\nNote that if B is a Hilbert space, then the above de\ufb01nition of RKBS coincides with that of an RKHS.\nLet (\u00b7,\u00b7)B be a bilinear form on B\u00d7 B\u2032 wherein (f, g\u2217)B := g\u2217(f ), f \u2208 B, g\u2217 \u2208 B\u2032. Theorem 2 in\n[24] shows that if B is an RKBS on X , then there exists a unique function K : X \u00d7 X \u2192 C called\nthe reproducing kernel (r.k.) of B, such that the following hold:\n\n(a1) K(x,\u00b7) \u2208 B, K(\u00b7, x) \u2208 B\u2032, x \u2208 X ,\n(a2) f (x) = (f, K(\u00b7, x))B, f\u2217(x) = (K(x,\u00b7), f\u2217)B, f \u2208 B, f\u2217 \u2208 B\u2032, x \u2208 X .\n\nNote that K satis\ufb01es K(x, y) = (K(x,\u00b7), K(\u00b7, y))B and therefore K(\u00b7, x) and K(x,\u00b7) are reproduc-\ning kernels for B and B\u2032 respectively. When B is an RKHS, K is indeed the r.k. in the usual sense.\nThough an RKBS has exactly one r.k., different RKBSs may have the same r.k. (see Example 1) un-\nlike an RKHS, where no two RKHSs can have the same r.k (by the Moore-Aronszajn theorem [4]).\nDue to the lack of inner product in B (unlike in an RKHS), it can be shown that the r.k. for a general\nRKBS can be any arbitrary function on X \u00d7X for a \ufb01nite set X [24]. In order to have a substitute for\ninner products in the Banach space setting, [24] considered RKBS B that are uniformly Fr\u00b4echet dif-\nferentiable and uniformly convex (referred to as s.i.p. RKBS) as it allows Hilbert space arguments to\nbe carried over to B\u2014most importantly, an analogue to the Riesz representation theorem holds (see\nTheorem 3)\u2014through the notion of semi-inner-product (s.i.p.) introduced by [12]. In the following,\nwe \ufb01rst present results related to general s.i.p. spaces and then consider s.i.p. RKBS.\nDe\ufb01nition 2 (S.i.p. space). A Banach space B is said to be uniformly Fr\u00b4echet differentiable if for\nall f, g \u2208 B, limt\u2208R,t\u21920 kf +tgkB\u2212kfkB\nexists and the limit is approached uniformly for f, g in the\nunit sphere of B. B is said to be uniformly convex if for all \u01eb > 0, there exists a \u03b4 > 0 such that\nkf + gkB \u2264 2 \u2212 \u03b4 for all f, g \u2208 B with kfkB = kgkB = 1 and kf \u2212 gkB \u2265 \u01eb. B is called an\ns.i.p. space if it is both uniformly Fr\u00b4echet differentiable and uniformly convex.\nNote that uniform Fr\u00b4echet differentiability and uniform convexity are properties of the norm associ-\nated with B. [8, Theorem 3] has shown that if B is an s.i.p. space, then there exists a unique function\n[\u00b7,\u00b7]B : B \u00d7 B \u2192 C, called the semi-inner-product such that for all f, g, h \u2208 B and \u03bb \u2208 C:\n\nt\n\n(a3) [f + g, h]B = [f, h]B + [g, h]B,\n(a4) [\u03bbf, g]B = \u03bb[f, g]B, [f, \u03bbg]B = \u03bb[f, g]B,\n(a5) [f, f ]B =: kfk2\n\nB > 0 for f 6= 0,\n\n3\n\n\ft\n\nkfkB\n\n= Re([g,f ]B)\n\n(a6) (Cauchy-Schwartz) |[f, g]B|2 \u2264 kfk2\n\nB,\nBkgk2\nand limt\u2208R,t\u21920 kf +tgkB\u2212kfkB\n, f, g \u2208 B, f 6= 0, where Re(\u03b1) and \u03b1 represent the\nreal part and complex conjugate of a complex number \u03b1. Note that s.i.p. in general do not satisfy\nconjugate symmetry, [f, g]B = [g, f ]B for all f, g \u2208 B and therefore is not linear in the second\nargument, unless B is a Hilbert space, in which case the s.i.p. coincides with the inner product.\nSuppose B is an s.i.p. space. Then for each h \u2208 B, f 7\u2192 [f, h]B de\ufb01nes a continuous linear\nfunctional on B, which can be identi\ufb01ed with a unique element h\u2217 \u2208 B\u2032, called the dual function of\nh. By this de\ufb01nition of h\u2217, we have h\u2217(f ) = (f, h\u2217)B = [f, h]B, f, h \u2208 B. Using the structure of\ns.i.p., [8, Theorem 6] provided the following analogue in B to the Riesz representation theorem of\nHilbert spaces.\nTheorem 3 ([8]). Suppose B is an s.i.p. space. Then\n\n(a7) (Riesz representation theorem) For each g \u2208 B\u2032, there exists a unique h \u2208 B such that\n(a8) B\u2032 is an s.i.p. space with respect to the s.i.p. de\ufb01ned by [h\u2217, f\u2217]B\u2032 := [f, h]B, f, h \u2208 B\n\ng = h\u2217, i.e., g(f ) = [f, h]B, f \u2208 B and kgkB\u2032 = khkB.\n\nand kh\u2217kB\u2032 := [h\u2217, h\u2217]1/2\nB\u2032 .\n\nFor more details on s.i.p. spaces, we refer the reader to [8]. A concrete example of an s.i.p. space\nis as follows, which will prove to be useful in Section 5. Let (X , A , \u00b5) be a measure space and\nB := Lp(X , \u00b5) for some p \u2208 (1, +\u221e). It is an s.i.p. space with dual B\u2032 := Lq(X , \u00b5) where\nq = p\n. Consequently, the semi-inner-\nproduct on B is\n\np\u22121 . For each f \u2208 B, its dual element in B\u2032 is f\u2217 = f|f|p\u22122\nkfkp\u22122\nf g|g|p\u22122 d\u00b5\nkgkp\u22122\nLp(X ,\u00b5)\n\n[f, g]B = g\u2217(f ) = RX\n\n.\n\n(4)\n\nHaving introduced s.i.p. spaces, we now discuss s.i.p. RKBS which was studied by [24]. Using the\nRiesz representation for s.i.p. spaces (see (a7)), Theorem 9 in [24] shows that if B is an s.i.p. RKBS,\nthen there exists a unique r.k. K : X \u00d7 X \u2192 C and a s.i.p. kernel G : X \u00d7 X \u2192 C such that:\n\nLp (X ,\u00b5)\n\n(a9) G(x,\u00b7) \u2208 B for all x \u2208 X , K(\u00b7, x) = (G(x,\u00b7))\u2217, x \u2208 X ,\n(a10) f (x) = [f, G(x,\u00b7)]B, f\u2217(x) = [K(x,\u00b7), f ]B for all f \u2208 B, x \u2208 X .\n\nIt is clear that G(x, y) = [G(x,\u00b7), G(y,\u00b7)]B, x, y \u2208 X . Since s.i.p. in general do not satisfy conju-\ngate symmetry, G need not be Hermitian nor pd [24, Section 4.3]. The r.k. K and the s.i.p. kernel\nG coincide when span{G(x,\u00b7) : x \u2208 X} is dense in B, which is the case when B is an RKHS [24,\nTheorems 2, 10 and 11]. This means when B is an RKHS, then the conditions (a9) and (a10) reduce\nto the well-known reproducing properties of an RKHS with the s.i.p. reducing to an inner product.\n\n4 RKBS Embedding of Probability Measures\n\nIn this section, we present our main contributions of deriving and analyzing the RKBS embedding\nof probability measures, which generalize the theory of RKHS embeddings. First, we would like to\nremind the reader that the RKHS embedding in (1) can be derived by choosing F = {f : kfkH \u2264 1}\nin\n\n\u03b3F(P, Q) = sup\n\nf\u2208F(cid:12)(cid:12)(cid:12)(cid:12)\nZX\n\nf dP \u2212ZX\n\nf dQ(cid:12)(cid:12)(cid:12)(cid:12) .\n\nSee [19, 20] for details. Similar to the RKHS case, in Theorem 4, we show that the RKBS embed-\ndings can be obtained by choosing F = {f : kfkB \u2264 1} in \u03b3F(P, Q). Interestingly, though B does\nnot have an inner product, it can be seen that the structure of semi-inner-product is suf\ufb01cient enough\nto generate an embedding similar to (1).\nTheorem 4. Let B be an s.i.p. RKBS de\ufb01ned on a measurable space X with G as the s.i.p. kernel\nand K as the reproducing kernel with both G and K being measurable. Let F = {f : kfkB \u2264 1}\nand G be bounded. Then\n\n\u03b3K(P, Q) := \u03b3F(P, Q) =(cid:13)(cid:13)(cid:13)(cid:13)\nZX\n\nK(\u00b7, x) dP(x) \u2212ZX\n\nK(\u00b7, x) dQ(x)(cid:13)(cid:13)(cid:13)(cid:13)B\u2032\n\n.\n\n(5)\n\n4\n\n\fBased on Theorem 4,\n\nit is clear that P can be seen as being embedded into B\u2032 as P 7\u2192\nK(\u00b7, x) dP(x) and \u03b3K(P, Q) is the distance between the embeddings of P and Q. Therefore,\nwe arrive at an embedding which looks similar to (1) and coincides with (1) when B is an RKHS.\n\nRX\n\nGiven these embeddings, two questions that need to be answered for these embeddings to be practi-\ncally useful are: (\u22c6) When is the embedding injective? and (\u22c6\u22c6) Can \u03b3K(P, Q) in (5) be estimated\nconsistently and computed ef\ufb01ciently from \ufb01nite random samples drawn i.i.d. from P and Q? The\nsigni\ufb01cance of (\u22c6) is that if (3) is injective, then such an embedding can be used to differentiate\nbetween different P and Q, which can then be used in applications like two-sample tests to differen-\ntiate between P and Q based on samples drawn i.i.d. from them if the answer to (\u22c6\u22c6) is af\ufb01rmative.\nThese questions are answered in the following sections.\n\nBefore that, we show how these questions are important in binary classi\ufb01cation. Following [18], it\ncan be shown that \u03b3K is the negative of the optimal risk associated with a Parzen window classi\ufb01er\nin B\u2032, that separates the class-conditional distributions P and Q (refer to the supplementary material\nfor details). This means that if (3) is not injective, then the maximum risk is attained for P 6= Q, i.e.,\ndistinct distributions are not classi\ufb01able. Therefore, the injectivity of (3) is of primal importance in\napplications. In addition, the question in (\u22c6\u22c6) is critical as well, as it relates to the consistency of the\nParzen window classi\ufb01er.\n\n4.1 When is (3) injective?\n\nThe following result provides various characterizations for the injectivity of (3), which are similar\n(but more general) to those obtained for the injectivity of (1) and coincide with the latter when B is\nan RKHS.\nTheorem 5 (Injectivity of \u03b3K). Suppose B is an s.i.p. RKBS de\ufb01ned on a topological space X with\nK and G as its r.k. and s.i.p. kernel respectively. Then the following hold:\n(a) Let X be a Polish space that is also locally compact Hausdorff. Suppose G is bounded and\nK(x,\u00b7) \u2208 C0(X ) for all x \u2208 X . Then (3) is injective if B is dense in C0(X ).\n(b) Suppose the conditions in (a) hold. Then (3) is injective if B is dense in Lp(X , \u00b5) for any Borel\nprobability measure \u00b5 on X and some p \u2208 [1,\u221e).\nSince it is not easy to check for the denseness of B in C0(X ) or Lp(X , \u00b5), in Theorem 6, we present\nan easily checkable characterization for the injectivity of (3) when K is bounded continuous and\ntranslation invariant on Rd. Note that Theorem 6 generalizes the characterization (see [19, 20]) for\nthe injectivity of RKHS embedding (in (1)).\nTheorem 6 (Injectivity of \u03b3K for translation invariant K). Let X = Rd. Suppose K(x, y) =\n\u03c8(x \u2212 y), where \u03c8 : Rd \u2192 R is of the form \u03c8(x) = RRd eihx,\u03c9i d\u039b(\u03c9) and \u039b is a \ufb01nite complex-\nvalued Borel measure on Rd. Then (3) is injective if supp(\u039b) = Rd. In addition if K is symmetric,\nthen the converse holds.\nRemark 7. If \u03c8 in Theorem 6 is a real-valued pd function, then by Bochner\u2019s theorem, \u039b has to be\nreal, nonnegative and symmetric, i.e., \u039b(d\u03c9) = \u039b(\u2212d\u03c9). Since \u03c8 need not be a pd function for K\nto be a real, symmetric r.k. of B, \u039b need not be nonnegative. More generally, if \u03c8 is a real-valued\nfunction on Rd, then \u039b is conjugate symmetric, i.e., \u039b(d\u03c9) = \u039b(\u2212d\u03c9). An example of a translation\ninvariant, real and symmetric (but not pd) r.k. that satis\ufb01es the conditions of Theorem 6 can be\nobtained with \u03c8(x) = (4x6 + 9x4 \u2212 18x2 + 15) exp(\u2212x2). See Example 3 for more details.\n4.2 Consistency Analysis\n\nConsider a two-sample test, wherein given two sets of random samples, {Xj}m\nj=1\ndrawn i.i.d. from distributions P and Q respectively, it is required to test whether P = Q or not.\nGiven a metric, \u03b3K on P(X ), the problem can equivalently be posed as testing for \u03b3K(P, Q) = 0 or\nnot, based on {Xj}m\nj=1, in which case, \u03b3K(P, Q) is estimated based on these random\nsamples. For the test to be meaningful, it is important that this estimate of \u03b3K is consistent. [9]\nshowed that \u03b3K(Pm, Qn) is a consistent estimator of \u03b3K(P, Q) when B is an RKHS, where Pm :=\nmPm\n1\nj=1 \u03b4Yj and \u03b4x represents the Dirac measure at x \u2208 X. Theorem 9\ngeneralizes the consistency result in [9] by showing that \u03b3K(Pm, Qn) is a consistent estimator of\n\nj=1 and {Yj}n\n\nj=1 and {Yj}n\n\nj=1 \u03b4Xj , Qn := 1\n\nnPn\n\n5\n\n\fj=1 \u033ajfjkt\n\nj=1 kfjkt\n\nB)1/t, where {\u033aj}N\n\nB)1/t \u2264 C\u2217(PN\n\n\u03b3K(P, Q) and the rate of convergence is O(m(1\u2212t)/t + n(1\u2212t)/t) if B\u2032 is of type t, 1 < t \u2264 2. Before\nwe present the result, we de\ufb01ne the type of a Banach space, B [2, p. 303].\nDe\ufb01nition 8 (Rademacher type of B). Let 1 \u2264 t \u2264 2. A Banach space B is said to be of t-\nRademacher (or, more shortly, of type t) if there exists a constant C\u2217 such that for any N \u2265 1\nj=1 \u2282 B: (EkPN\nand any {fj}N\nj=1 are\ni.i.d. Rademacher (symmetric \u00b11-valued) random variables.\nClearly, every Banach space is of type 1. Since having type t\u2032 for t\u2032 > t implies having type t, let us\nde\ufb01ne t\u2217(B) := sup{t : B has type t}.\nTheorem 9 (Consistency of \u03b3K(Pm, Qn)). Let B be an s.i.p. RKBS. Assume \u03bd := sup{pG(x, x) :\ni.i.d.\u223c P\nx \u2208 X} < \u221e. Fix \u03b4 \u2208 (0, 1). Then with probability 1\u2212\u03b4 over the choice of samples {Xj}m\nand {Yj}n\n2(cid:17),\n\n|\u03b3K(Pm, Qn) \u2212 \u03b3K(P, Q)| \u2264 2C\u2217\u03bd(cid:16)m\n\nt (cid:17) +p18\u03bd2 log(4/\u03b4)(cid:16)m\u2212 1\n\nwhere t = t\u2217(B\u2032) and C\u2217 is some universal constant.\nIt is clear from Theorem 9 that if t\u2217(B\u2032) \u2208 (1, 2], then \u03b3K(Pm, Qn) is a consistent estimator of\n\u03b3K(P, Q). In addition, the best rate is obtained if t\u2217(B\u2032) = 2, which is the case if B is an RKHS. In\nSection 5, we will provide examples of s.i.p. RKBSs that satisfy t\u2217(B\u2032) = 2.\n\ni.i.d.\u223c Q, we have\n\n2 + n\u2212 1\n\nj=1\n\n1\u2212t\n\nt + n\n\n1\u2212t\n\nj=1\n\n4.3 Computation of \u03b3K(P, Q)\n\nWe now consider the problem of computing \u03b3K(P, Q) and \u03b3K(Pm, Qn). De\ufb01ne \u03bb\u2217P\n\n:=\n\nRX\n\nK(\u00b7, x) dP(x). Consider\n\u03b32\nK(P, Q) = k\u03bb\u2217P \u2212 \u03bb\u2217Qk2\n\n(a5)\n\nB\u2032\n\n= [\u03bb\u2217P \u2212 \u03bb\u2217Q, \u03bb\u2217P \u2212 \u03bb\u2217Q]B\u2032\nK(\u00b7, x) dP(x), \u03bb\u2217P \u2212 \u03bb\u2217QiB\u2032 \u2212hZX\n[K(\u00b7, x), \u03bb\u2217P \u2212 \u03bb\u2217Q]B\u2032 dP(x) \u2212ZX\nK(\u00b7, y) d(P \u2212 Q)(y)iB\u2032\n\n= hZX\n= ZX\n= ZX hK(\u00b7, x),ZX\n\n(\u2217)\n\n(a3)\n\n= [\u03bb\u2217P, \u03bb\u2217P \u2212 \u03bb\u2217Q]B\u2032 \u2212 [\u03bb\u2217Q, \u03bb\u2217P \u2212 \u03bb\u2217Q]B\u2032\nK(\u00b7, x) dQ(x), \u03bb\u2217P \u2212 \u03bb\u2217QiB\u2032\n[K(\u00b7, x), \u03bb\u2217P \u2212 \u03bb\u2217Q]B\u2032 dQ(x)\n\nd(P \u2212 Q)(x),\n\n(6)\n\nwhere (\u2217) is proved in the supplementary material. (6) is not reducible as the s.i.p. is not linear in\nthe second argument unless B is a Hilbert space. This means \u03b3K(P, Q) is not representable in terms\nof the kernel function, K(x, y) unlike in the case of B being an RKHS, in which case the s.i.p. in\n(6) reduces to an inner product providing\n\n\u03b32\n\nK(P, Q) =ZZX\n\nK(x, y) d(P \u2212 Q)(x) d(P \u2212 Q)(y).\n\nSince this issue holds for any P, Q \u2208 P(X ), it also holds for Pm and Qn, which means \u03b3K(Pm, Qn)\ncannot be computed in a closed form in terms of the kernel, K(x, y) unlike in the case of an RKHS\nwhere \u03b3K(Pm, Qn) can be written as a simple V-statistic that depends only on K(x, y) computed\nj=1. This is one of the main drawbacks of the RKBS approach where the\nat {Xj}m\ns.i.p. structure does not allow closed form representations in terms of the kernel K (also see [24]\nwhere regularization algorithms derived in RKBS are not solvable unlike in an RKHS), and therefore\ncould limit its practical viability. However, in the following section, we present non-trivial examples\nof s.i.p. RKBSs for which \u03b3K(P, Q) and \u03b3K(Pm, Qn) can be obtained in closed forms.\n\nj=1 and {Yj}n\n\n5 Concrete Examples of RKBS Embeddings\n\nIn this section, we present examples of RKBSs and then derive the corresponding \u03b3K(P, Q) and\n\u03b3K(Pm, Qn) in closed forms. To elaborate, we present three examples that cover the spectrum:\nExample 1 deals with RKBS (in fact a family of RKBSs induced by the same r.k.) whose r.k. is pd,\nExample 2 with RKBS whose r.k. is not symmetric and therefore not pd and Example 3 with RKBS\nwhose r.k. is symmetric but not pd. These examples show that the Banach space embeddings result\nin richer metrics on P(X ) than those obtained through RKHS embeddings.\n\n6\n\n\f(7)\n\n(8)\n\neihx,\u00b7i d(P \u2212 Q)(x)(cid:13)(cid:13)(cid:13)Lq(Rd,\u00b5)\n\n= k\u03c6P \u2212 \u03c6QkLq(Rd,\u00b5) .\n\nExample 1 (K is positive de\ufb01nite). Let \u00b5 be a \ufb01nite nonnegative Borel measure on Rd. Then for\nany 1 < p < \u221e with q = p\np\u22121\n\nu(t)eihx,ti d\u00b5(t) : u \u2208 Lp(Rd, \u00b5), x \u2208 Rd(cid:27) ,\nis an RKBS with K(x, y) = G(x, y) = (\u00b5(Rd))(p\u22122)/pRRd e\u2212ihx\u2212y,ti d\u00b5(t) as the r.k. and\n\nBpd\n\np (Rd) :=(cid:26)fu(x) =ZRd\n\u03b3K(P, Q) =(cid:13)(cid:13)(cid:13)ZRd\n\nFirst note that K is a translation invariant pd kernel on Rd as it is the Fourier transform of a\nnonnegative \ufb01nite Borel measure, \u00b5, which follows from Bochner\u2019s theorem. Therefore, though the\ns.i.p. kernel and the r.k. of an RKBS need not be symmetric, the space in (7) is an interesting example\nof an RKBS, which is induced by a pd kernel. In particular, it can be seen that many RKBSs (Bpd\np (Rd)\nfor any 1 < p < \u221e) have the same r.k (ignoring the scaling factor which can be made one for any\np by choosing \u00b5 to be a probability measure). Second, note that Bpd\np is an RKHS when p = q = 2\nand therefore (8) generalizes \u03b3k(P, Q) = k\u03c6P \u2212 \u03c6QkL2(Rd,\u00b5). By Theorem 6, it is clear that \u03b3K in\n(8) is a metric on P(Rd) if and only if supp(\u00b5) = Rd. Refer to the supplementary material for an\ninterpretation of Bpd\nExample 2 (K is not symmetric). Let \u00b5 be a \ufb01nite nonnegative Borel measure such that its moment-\n\np (Rd) as a generalization of Sobolev space [23, Chapter 10].\n\ngenerating function, i.e., M\u00b5(x) :=RRd ehx,ti d\u00b5(t) exists. Then for any 1 < p < \u221e with q = p\n\np\u22121\n\nBns\n\np (Rd) :=(cid:26)fu(x) =ZRd\n\nu(t)ehx,ti d\u00b5(t) : u \u2208 Lp(Rd, \u00b5), x \u2208 Rd(cid:27)\n\n2\n\n3\n2\n\n3\n\nbut\n\nnot\n\n\u03c8(x)\n\nBsnpd\n\n(K is\n\nde\ufb01nite).\n\nsymmetric\n\np (Rd) is an RKHS.\nLet\n\n(R) :=(cid:26)fu(x) =ZR\n\nis an RKBS with K(x, y) = G(x, y) = (M\u00b5(qx))(p\u22122)/p M\u00b5(x(q \u2212 1) + y) as the r.k. Suppose\nP and Q are such that MP and MQ exist. Then \u03b3K(P, Q) = kRRd ehx,\u00b7i d(P \u2212 Q)(x)kLq(Rd,\u00b5) =\nkMP \u2212 MQkLq(Rd,\u00b5), which is the weighted Lq distance between the moment-generating functions\nof P and Q. It is easy to see that if supp(\u00b5) = Rd, then \u03b3K(P, Q) = 0 \u21d2 MP = MQ a.e. \u21d2 P =\nQ, which means \u03b3K is a metric on P(Rd). Note that K is not symmetric (for q 6= 2) and therefore\nis not pd. When p = q = 2, K(x, y) = M\u00b5(x + y) is pd and Bns\nExample\npositive\n=\nAe\u2212x2(cid:0)4x6 + 9x4 \u2212 18x2 + 15(cid:1) with A := (1/243)(cid:0)4\u03c02/25(cid:1)1/6. Then\nu(t) dt : u \u2208 L\n\n(x \u2212 t)2e\u2212 3(x\u2212t)2\nis an RKBS with r.k. K(x, y) = g(x, y) = \u03c8(x \u2212 y). Clearly, \u03c8 and therefore K are not pd\n34992\u221a2(cid:0)x6 \u2212 39x4 + 216x2 \u2212 324(cid:1) is not nonnegative at\n(though symmetric on R) as b\u03c8(x) = \u2212e\u2212 x2\nevery x \u2208 R. Refer to the supplementary material for the derivation of K and b\u03c8. In addition,\n\u03b3K(P, Q) = kRR \u03b8(\u00b7 \u2212 x) d(P \u2212 Q)(x)kLq(R) = k(b\u03b8 (\u03c6P \u2212 \u03c6Q))\u2228kLq(R), where \u03b8(t) = t2e\u2212 3\nSince supp(b\u03b8) = R, we have \u03b3K(P, Q) = 0 \u21d2 (b\u03b8 (\u03c6P\u2212\u03c6Q))\u2228 = 0 \u21d2 b\u03b8 (\u03c6P\u2212\u03c6Q) = 0 \u21d2 \u03c6P = \u03c6Q\n\na.e., which implies P = Q and therefore \u03b3K is a metric on P(R).\nSo far, we have presented different examples of RKBSs, wherein we have demonstrated the nature\nof the r.k., derived the Banach space embeddings in closed form and studied the conditions under\nwhich it is injective. These examples also show that the RKBS embeddings result in richer distance\nmeasures on probabilities compared to those obtained by the RKHS embeddings\u2014an advantage\ngained by moving from Hilbert to Banach spaces. Now, we consider the problem of computing\n\u03b3K(Pm, Qn) in closed form and its consistency. In Section 4.3, we showed that \u03b3K(Pm, Qn) does\nnot have a nice closed form expression unlike in the case of B being an RKHS. However, in the\nfollowing, we show that for K in Examples 1\u20133, \u03b3K(Pm, Qn) has a closed form expression for\ncertain choices of q. Let us consider the estimation of \u03b3K(P, Q):\n\n2 (R), x \u2208 R(cid:27)\n\n2 t2.\n\n3\n\n4\n\n\u03b3q\n\nK(Pm, Qn) =(cid:13)(cid:13)(cid:13)(cid:13)\nZX\n=ZX (cid:12)(cid:12)(cid:12)\n\nq\n\nb(x,\u00b7) d(Pm \u2212 Qn)(x)(cid:13)(cid:13)(cid:13)(cid:13)\nnXj=1\nmXj=1\n\nb(Xj, t) \u2212\n\n1\nn\n\n1\nm\n\nLq(X ,\u00b5)\n\nb(Yj, t)(cid:12)(cid:12)(cid:12)\n\n7\n\n=ZX (cid:12)(cid:12)(cid:12)ZX\n\nq\n\nd\u00b5(t),\n\nb(x, t) d(Pm \u2212 Qn)(x)(cid:12)(cid:12)(cid:12)\n\nq\n\nd\u00b5(t)\n\n(9)\n\n\fwhere b(x, t) = eihx,ti in Example 1, b(x, t) = ehx,ti in Example 2 and b(x, t) = \u03b8(x \u2212 t) with\nq = 3 and \u00b5 being the Lebesgue measure in Example 3. Since the duals of RKBSs considered\nin Examples 1\u20133 are of type min(q, 2) for 1 \u2264 q \u2264 \u221e [2, p. 304], by Theorem 9, \u03b3K(Pm, Qn)\n) for q \u2208\nestimates \u03b3K(P, Q) consistently at a convergence rate of O(m\n(1,\u221e), with the best rate of O(m\u22121/2 + n\u22121/2) attainable when q \u2208 [2,\u221e). This means for\nq \u2208 (2,\u221e), the same rate as attainable by the RKHS can be achieved. Now, the problem reduces\nto computing \u03b3K(Pm, Qn). Note that (9) cannot be computed in a closed form for all q\u2014see the\ndiscussion in the supplementary material about approximating \u03b3K(Pm, Qn). However, when q = 2,\n(9) can be computed very ef\ufb01ciently in closed form (in terms of K) as a V-statistic [9], given by\n\nmin(q,2) + n\n\nmax(1\u2212q,\u22121)\n\nmax(1\u2212q,\u22121)\n\nmin(q,2)\n\nn2\n\n\u2212 2\n\n\u03b3q\n\n(11)\n\nmXj,l=1\n\nq\n\n2\u2212p\n\n\u03b32\nK(Pm, Qn) =\n\nA(x1,...,xq)\n\n}|\n\nm2\n\nsYj=1\n\n+\n\nnXj,l=1\n\nK(Xj, Xl)\n\nK(Yj, Yl)\n\nK(Xj, Yl)\n\nmn\n\n.\n\n(10)\n\nK(Pm, Qn) =ZX\n\n{\nb(x2j\u22121, t)b(x2j , t) d\u00b5(t)\n\nnXl=1\nmXj=1\nMore generally, it can be shown that if q = 2s, s \u2208 N, then (9) reduces to\nqYj=1\n\nz\n\u00b7\u00b7\u00b7ZX\nZX\nfor which closed form computation is possible for appropriate choices of b and \u00b5. Refer to\nthe supplementary material for the derivation of (11). For b and \u00b5 as in Example 1, we have\nj=1 x2j(cid:17), while for b and \u00b5 as in Example 2,\np K(cid:16)Ps\nA(x1, . . . , xq) = (\u00b5(Rd))\nwe have A(x1, . . . , xq) = M\u00b5(Pq\nj=1 xj ). By appropriately choosing \u03b8 and \u00b5 in Example 3, we\ncan obtain a closed form expression for A(x1, . . . , xq), which is proved in the supplementary mate-\nrial. Note that choosing s = 1 in (11) results in (10). (11) shows that \u03b3q\nK(Pm, Qn) can be computed\nin a closed form in terms of A at a complexity of O(mq), assuming m = n, which means the least\ncomplexity is obtained for q = 2. The above discussion shows that for appropriate choices of q,\ni.e., q \u2208 (2,\u221e), the RKBS embeddings in Examples 1\u20133 are useful in practice as \u03b3K(Pm, Qn) is\nconsistent and has a closed form expression. However, the drawback of the RKBS framework is that\nthe computation of \u03b3K(Pm, Qn) is more involved than its RKHS counterpart.\n\nj=1 x2j\u22121,Ps\n\nd(Pm \u2212 Qn)(xj )\n\n6 Conclusion & Discussion\n\nWith a motivation to study the advantages/disadvantages of generalizing Hilbert space learning algo-\nrithms to Banach spaces, in this paper, we generalized the notion of RKHS embedding of probability\nmeasures to Banach spaces, in particular RKBS that are uniformly Fr\u00b4echet differentiable and uni-\nformly convex\u2014note that this is equivalent to generalizing a RKHS based Parzen window classi\ufb01er\nto RKBS. While we showed that most of results in RKHS like injectivity of the embedding, con-\nsistency of the Parzen window classi\ufb01er, etc., nicely generalize to RKBS yielding richer distance\nmeasures on probabilities, the generalized notion is less attractive in practice compared to its RKHS\ncounterpart because of the computational disadvantage associated with it. Since most of the existing\nliterature on generalizing kernel methods to Banach spaces deal with more complex algorithms than\na simple Parzen window classi\ufb01er that is considered in this paper, we believe that most of these\nalgorithms may have limited practical applicability, though they are theoretically appealing. This,\ntherefore raises an important open problem of developing computationally ef\ufb01cient Banach space\nbased learning algorithms.\n\nAcknowledgments\n\nThe authors thank the anonymous reviewers for their constructive comments that improved the pre-\nsentation of the paper. Part of the work was done while B. K. S. was a Ph. D. student at UC\nSan Diego. B. K. S. and G. R. G. L. acknowledge support from the National Science Foundation\n(grants DMS-MSPA 0625409 and IIS-1054960). K. F. was supported in part by JSPS KAKENHI\n(B) 22300098.\n\nReferences\n[1] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337\u2013404, 1950.\n\n8\n\n\f[2] B. Beauzamy. Introduction to Banach spaces and their Geometry. North-Holland, The Netherlands, 1985.\n[3] K. Bennett and E. Bredensteiner. Duality and geometry in svm classi\ufb01er. In Proc. 17th International\n\nConference on Machine Learning, pages 57\u201364, 2000.\n\n[4] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics.\n\nKluwer Academic Publishers, London, UK, 2004.\n\n[5] R. Der and D. Lee. Large-margin classi\ufb01cation in Banach spaces. In JMLR Workshop and Conference\n\nProceedings, volume 2, pages 91\u201398. AISTATS, 2007.\n\n[6] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with repro-\n\nducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73\u201399, 2004.\n\n[7] K. Fukumizu, A. Gretton, X. Sun, and B. Sch\u00a8olkopf. Kernel measures of conditional dependence. In J.C.\nPlatt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems\n20, pages 489\u2013496, Cambridge, MA, 2008. MIT Press.\n\n[8] J. R. Giles. Classes of semi-inner-product spaces. Trans. Amer. Math. Soc., 129:436\u2013446, 1967.\n[9] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel method for the two sample\nproblem. In B. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing\nSystems 19, pages 513\u2013520. MIT Press, 2007.\n\n[10] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch\u00a8olkopf, and A. J. Smola. A kernel statistical test of\nindependence. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information\nProcessing Systems 20, pages 585\u2013592. MIT Press, 2008.\n\n[11] M. Hein, O. Bousquet, and B. Sch\u00a8olkopf. Maximal margin classi\ufb01cation for metric spaces. J. Comput.\n\nSystem Sci., 71:333\u2013359, 2005.\n\n[12] G. Lumer. Semi-inner-product spaces. Trans. Amer. Math. Soc., 100:29\u201343, 1961.\n[13] C. A. Micchelli and M. Pontil. A function representation for learning in Banach spaces. In Conference\n\non Learning Theory, 2004.\n\n[14] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[15] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,\n\nUK, 2004.\n\n[16] A. J. Smola, A. Gretton, L. Song, and B. Sch\u00a8olkopf. A Hilbert space embedding for distributions. In\nProc. 18th International Conference on Algorithmic Learning Theory, pages 13\u201331. Springer-Verlag,\nBerlin, Germany, 2007.\n\n[17] K. Sridharan and A. Tewari. Convex games in Banach spaces. In Conference on Learning Theory, 2010.\n[18] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, G. R. G. Lanckriet, and B. Sch\u00a8olkopf. Kernel choice\nand classi\ufb01ability for RKHS embeddings of probability distributions.\nIn Y. Bengio, D. Schuurmans,\nJ. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems\n22, pages 1750\u20131758. MIT Press, 2009.\n\n[19] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. R. G. Lanckriet, and B. Sch\u00a8olkopf. Injective Hilbert\nspace embeddings of probability measures. In R. Servedio and T. Zhang, editors, Proc. of the 21st Annual\nConference on Learning Theory, pages 111\u2013122, 2008.\n\n[20] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00a8olkopf, and G. R. G. Lanckriet. Hilbert space\nembeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517\u20131561,\n2010.\n\n[21] H. Tong, D.-R. Chen, and F. Yang. Least square regression with \u2113p-coef\ufb01cient regularization. Neural\n\nComputation, 22:3221\u20133235, 2010.\n\n[22] U. von Luxburg and O. Bousquet. Distance-based classi\ufb01cation with Lipschitz functions. Journal for\n\nMachine Learning Research, 5:669\u2013695, 2004.\n\n[23] H. Wendland. Scattered Data Approximation. Cambridge University Press, Cambridge, UK, 2005.\n[24] H. Zhang, Y. Xu, and J. Zhang. Reproducing kernel Banach spaces for machine learning. Journal of\n\nMachine Learning Research, 10:2741\u20132775, 2009.\n\n9\n\n\f", "award": [], "sourceid": 993, "authors": [{"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}, {"given_name": "Gert", "family_name": "Lanckriet", "institution": null}, {"given_name": "Bharath", "family_name": "Sriperumbudur", "institution": null}]}