{"title": "Semigroup Kernels on Finite Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 329, "page_last": 336, "abstract": null, "full_text": " Semigroup Kernels on Finite Sets\n\n\n\n Marco Cuturi Jean-Philippe Vert\n Computational Biology Group Computational Biology Group\n Ecole des Mines de Paris Ecole des Mines de Paris\n 35 rue Saint Honore 35 rue Saint Honore\n 77300 Fontainebleau 77300 Fontainebleau\n marco.cuturi@ensmp.fr jean-philippe.vert@ensmp.fr\n\n\n\n\n Abstract\n\n Complex objects can often be conveniently represented by finite sets of\n simpler components, such as images by sets of patches or texts by bags\n of words. We study the class of positive definite (p.d.) kernels for two\n such objects that can be expressed as a function of the merger of their\n respective sets of components. We prove a general integral representa-\n tion of such kernels and present two particular examples. One of them\n leads to a kernel for sets of points living in a space endowed itself with a\n positive definite kernel. We provide experimental results on a benchmark\n experiment of handwritten digits image classification which illustrate the\n validity of the approach.\n\n\n1 Introduction\n\nSuppose we are to deal with complex (e.g non-vectorial) objects from a set Z on which\nwe wish to apply existing kernel methods [1] to perform tasks such as classification or\nregression. Assume furthermore that the latter objects can be meaningfully described by\nsmall components contained in a set X. Namely, we suppose that we can define an a\npriori mapping which maps any z Z into a finite unordered list of elements of X,\n (z) = [x1, x2, ..., xn], through a sampling process which may be exhaustive, heuristic or\nrandom both in the quantity of sampled components n and in the way those components\nare extracted. Comparing two such complex objects through the direct comparison of their\nrespective lists of components has attracted much attention recently, namely through the\ndefinition of p.d. kernel on such -lists. Most recent approaches to compare two -lists\ninvolve the estimation of two distributions pz and pz on X within a parametric class of\nmodels that fit (e.g. in maximum likelihood (ML) sense) respectively (z) and (z) seen\nas a samples from laws on X, where each resulting law could be identified with z and z\nrespectively. Such a kernel is then defined between pz and pz , as seen for example in [2]\nwith the Information Diffusion kernel, in [3] with the family of Mutual Information Kernels\nor in [4] with the use of the Battacharyya affinity between pz and pz . An alternative\nand non-parametric approach to -lists comparison that studies the subspaces generated by\npoints of (z) and (z) in a feature space was also proposed in [5], recalling elements\npresented in Kernel-Canonical Correlation Analysis [6].\n\nWe explore in this contribution a different direction to kernel design for lists by studying the\nclass of kernels whose value computed on two lists is only defined through its value on their\n\n\f\nconcatenation. This approach was already used in [7], where a particular kernel for strings\nthat only compares two strings through their concatenation is presented. In this paper, the\napproach is extended to a more general and abstract setting of -lists, but the motivation\nremains the same as in [7]: if two -lists are similar, e.g. in terms of the distribution of\nthe components they describe, then their concatenation will be more \"concentrated\" than if\nthey are very different, in which case it might look more like a reunion of two disjoint sets\nof points. As a result, one can expect to get a relevant measure of similarity, and hence a\nkernel, by studying properties of the concatenation of two lists such as its concentration.\n\nAfter an example of a valid kernel for lists seen as measures on the space of components\n(Section 2), we provide a complete characterization for this class of kernels (Section 3)\nby casting them in the context of semigroup kernels. This leads to the definition of a\nsecond kernel based on exponential densities on X, which boils down after a numerical\napproximation to the computation of the entropy of the maximum likelihood density (taken\nin the considered exponential family) of the points numbered by lists of components. This\nkernel is extended in Section 4 to points taken in a reproducing kernel Hilbert space defined\nby a kernel on X, and is then tested on a problem of image classification, where images\nare seen as bags of pixels and a non-linear kernel between pixels is used (Section 5).\n\n\n2 The entropy kernel\n\nAs a warm-up, let us assume that the set X is measurable, e.g. X = Rd, and that to\nany point x X we can associate a probability measure on X with density x with\nrespect to a common measure (e.g. the Borel uniform measure), with finite entropy\n\nh() def\n = - ln. Consider for example a Gaussian distribution with mean x and\n X\nfixed variance. A natural way to represent an unordered list (z) = [x1, x2, ..., xn] Xn\nis by the density = 1/n n . In that case, a p.d. kernel k between unordered lists\n i=1 xi\n and that only depends on their concatenation (z) (z) is equivalent to a p.d. kernel\nbetween densities and that only depends on +. Hence we are looking for a p.d. ker-\nnel on the set P of probability densities of finite entropy of the form (, ) = (+).\nAn example of such a kernel is provided in the following proposition. Recall that a neg-\native definite (n.d.) kernel on a set X is a symmetric function g : X2 R that satisfies\n n c\n i,j=1 icj g(xi, xj ) 0 for any n N, (x1, . . . , xn) Xn, and (c1 . . . , cn) Rn with\n n c\n i=1 i = 0. A useful link between p.d. and n.d. kernels is that g is n.d. if and only if\nexp(-tg) is p.d. for all t > 0 [8, Theorem 3.2.2.].\nProposition 1. The function g : , h(+ ) is negative definite on\n 2 P, making\nk )\n h(, ) def\n = e-th( +\n 2 a p.d. kernel on P for any t > 0. We call kh the entropy kernel\nbetween two measures.\n\nThe entropy kernel is already a satisfactory answer to our initial motivation to look at\nmerger of points. Observe that if x is a probability density around x, then can often\nbe thought of as an estimate of the distribution of the points in , and ( + )/2 is an\nestimate of the distribution of the points enumerated in . If the latter estimate has a\nsmall entropy we can guess that the points in and are likely to have similar distributions\nwhich is exactly the similarity that is quantified by the entropy kernel.\n\nProof of Proposition 1. It is known that the real-valued function r : y -y ln y is n.d.\non R+ as a semigroup endowed with addition [8, Example 6.5.16]. As a consequence the\nfunction f r f is n.d. on P as a pointwise application of r, and so is its summation on\nX. For any real-valued n.d. kernel k and any real-valued function g, we have trivially that\n(y, y) k(y,y) + g(y) + g(y) remains negative definite, hence h(f+f) is n.d. through\n 2\nh( f+f ) = 1 h(f + f ) + ln 2 (\n 2 2 2 |f| + |f|), yielding positive definiteness of kh.\n\n\f\n3 Semigroups and integral representations of p.d. kernels on finite\n Radon measures\n\n\nIn order to generalize the example presented in the previous section, let us briefly recall\nthe concept of p.d. kernels on semigroups [8]. A nonempty set S is called an Abelian\n(autoinvolutive) semigroup if it is equipped with an Abelian associative composition \nadmitting a neutral element in S. A function : S R is called a positive definite\n(resp. negative definite) function on the semigroup (S, ) if (s, t) (st) is a p.d. (resp.\nn. d.) kernel on S S.\nThe entropy kernel defined in Proposition 1 is therefore a p.d. kernel on the semigroup\nof measures with finite entropy endowed with usual addition. This can be generalized by\nassuming that X is a Hausdorff space, which suffices to consider the set of finite Radon\nmeasures M b+(X) [8]. For Mb+(X), we note || = (X) < +. For a Borel\nmeasurable function f RX, we note [f] = f d. Endowed with the usual Abelian\n X\naddition between measures, (M b+(X), +) is an Abelian semigroup. The reason to consider\nthis semigroup is that there is a natural semigroup homomorphism between finite lists of\npoints and elements of M b+(X) given by = [x1, ..., xn] = n , where\n i=1 xi\nx Mb+(X) is an arbitrary finite measure associated with each x X. We discussed in\nsection 2 the case where x has a density, but more general measures are allowed, such as\nx = x, the Dirac measure. Observe that when we talk about lists, it should be understood\nthat some objects might appear with some multiplicity which should be taken into account\n(specially when X is finite), making us consider weighted measures = n c in\n i=1 ixi\nthe general case. We now state the main result of this section which characterizes bounded\np.d. functions on the semigroup M b+(X),\nTheorem 1. A bounded real-valued function on M b\n +(X ) such that (0) = 1 is p.d. if\nand only if it has an integral representation:\n\n\n () = e-[f]d(f ),\n C+(X )\n\n\nwhere is a uniquely determined positive radon measure on C+(X), the space of non-\nnegative-valued continuous functions of RX endowed with the topology of pointwise con-\nvergence.\n\n\n\nProof. (sketch) Endowed with the topology of weak convergence, M b+(X) is a Hausdorff\nspace [8, Proposition 2.3.2]. The general result of integral representation of bounded p.d.\nfunction [8, Theorem 4.2.8] therefore applies. It can be shown that bounded semicharacters\non M b+(X) are exactly the functions of the form exp(-[f]) where f C+(X) by\nusing the characterization of semicharacters on (R+, +) [8, Theorem 6.5.8] and the fact\nthat atomic measures is a dense subset of M b+(X) [8, Theorem 2.3.5].\n\nAs a constructive application to this general representation theorem, let us consider the\ncase x = x and consider, as a subspace of C+(X), the linear span of N non-constant,\ncontinuous, real-valued and linearly independent functions f1, ..., fN on X. As we will see\nbelow, this is equivalent to considering a set of densities defined by an exponential model,\nnamely of the form p(x) = exp( N jf\n j=1 j (x)-()) where = (j)j=1..N RN\nis variable and is a real-valued function defined on to ensure normalization of the\ndensities p. Considering a prior on the parameter space is equivalent to defining\na Radon measure taking positive values on the subset of C+(X) spanned by functions\nf1, ..., fN . We now have ( see [9] for a geometric point of view) that:\n\n\f\nTheorem 2. ^\n being the ML parameter associated with and noting p = p^ ,\n \n\n\n () = e-||h(p) e-||d(p||p)(d),\n \n\nis a p.d. kernel on the semigroup of measures, where d(p||q) = p ln p is the\n supp(q) q\nKullback-Leibler divergence between p and q.\n\nAlthough an exact calculation of the latter equation is feasible in certain cases (see [10, 7]),\nan approximation can be computed using Laplace's approximation. If for example the prior\non the densities is taken to be Jeffrey's prior [9, p.44] then the following approximation\nholds:\n N\n 2\n () ~\n () := e-||h(p) 2 . (1)\n | \n | ||\n N\nThe ML estimator being unaffected by the total weight ||, we have ~(2) = ~()2(||) 2\n 4\nwhich we use to renormalize our kernel on its diagonal:\n\n N\n e-(|+|)h(p 2\n + ) 2\n k(, ) = ||||\n e-||h(p)-||h(p ) || + ||\nTwo problems call now for a proper renormalization: First, if || || (which would\nbe the case if describes far more elements than ), the entropy h(p+) will not take\ninto account the elements enumerated in . Second, the value taken by our p.d function ~\n \ndecreases exponentially with || as can be seen in equation (1). This inconvenient scaling\nbehavior leads in practice to bad SVM classification results due to diagonal dominance of\nthe Gram matrices produced by such kernels (see [11] for instance). Recall however that\nthe Laplace approximation can be accurate only when || 0. To take into account this\ntradeoff on the ideal range of ||, we rewrite the previous expression using a common width\nparameter after having applied a renormalization on and :\n\n h(p\n )+h(p\n -2 h(p )\n k )- 2\n (, ) = k( , ) = e , (2)\n || ||\nwhere = + . should hence be big enough in practical applications to ensure the\n || ||\nconsistency of Laplace's approximation and thus positive definiteness, while small enough\nto avoid diagonal dominance. We will now always suppose that our atomic measures are\nnormalized, meaning that their total weight n c\n i=1 i always sums up to 1.\n\nLet us now review a practical case when X is Rk, and that some kind of gaussianity among\npoints makes sense. We can use k-dimensional normal distributions pm, N(m, )\n(where is a k k p.d. matrix) to define our densities. The ML parameters of a measure\n are in that case : \n = n c c\n i=1 ixi and = n\n i=1 i(xi - \n )(xi - ). Supposing\nthat the span of the n vectors xi covers Rk yields non-degenerated covariance matrices.\nThis ensures the existence of the entropy of the ML estimates through the formula [12]:\nh(pm,) = 1 ln ((2e)n\n 2 ||). The value of the normalized kernel in (2) is then:\n 2\n k(, ) = |||| .\n ||\nThis framework is however limited to vectorial data for which the use of Gaussian laws\nmakes any sense. An approach designed to bypass this double restriction is presented in\nthe next section, taking advantage of a prior knowledge on the components space through\nthe use of a kernel .\n\n\f\n4 A kernel defined through regularized covariance operators\n\nEndowing X (now also considered 2-separable) with a p.d. kernel bounded on the di-\nagonal, we make use in this section of its corresponding reproducing kernel Hilbert space\n(RKHS, see [13] for a complete survey). This RKHS is denoted by , and its feature map\nby : x (x, ). is infinite dimensional in the general case, preventing any sys-\ntematical use of exponential densities on that feature space. We bypass this issue through\na generalization of the previous section by still assuming some \"gaussianity\" among the\nelements numbered by atomic measures , and which, once mapped in the feature\nspace, are now functions. More precisely, our aim when dealing with Euclidean spaces\nwas to estimate finite dimensional covariance matrices , , and compare them in\nterms of their spectrum or more precisely through their determinant. In this section we\nuse such finite samples to estimate, diagonalize and regularize three covariance operators\nS, S, S associated with each measure on , and compare them by measuring their re-\nspective dispersion in a similar way. We note for its dual (namely the linear form\n R s.t. = , ) and ||||2 = . Let (ei)iN be a complete orthonormal\nbase of (i.e. such that span(ei)iN = and ee\n i j = ij ). Given a family of positive real\nnumbers (ti)iN, we note St,e the bilinear symmetric operator which maps , St,e\nwhere St,e = t\n iN ieie\n i .\n\n def\nFor an atomic measure and noting ~\n i = (i - []) its n centered points in , the\nempirical covariance operator S = n c\n i=1 ii\n i on can be described through such a\ndiagonal representation by finding its principal eigenfunctions, namely orthogonal func-\ntions in which maximize the expected (w.r.t to ) variance of the normalized dot-product\n\nhv() def\n = v here defined for any v of . Such functions can be obtained through the\n ||v||\nfollowing recursive maximizations:\n\n 1 n\n v ~\n j = argmax var(hv()) = argmax civ\n j i.\n v,v{v1,...,vj-1} v,v{v1,...,vj-1} ||vj||2 i=1\nAs in the framework of Kernel PCA [1] (from which this calculus only differs by consider-\ning weighted points in the feature space) we have by the representer theorem [1] that all the\nsolutions of these successive maximizations lie in span({~i}i=1..n). Thus for each vj there\nexists a vector ~ ~\n j of Rn such that vj = d K\n i=1 j,i i with ||vj||2 = j j where ~\n K =\n(In - 1n,nc)K(In - c1n,n) is the centered Gram matrix K = [(xi,xj)]1i,jn of\nthe points taken in the support of , with 1n,n being the n n matrix composed of ones\nand c the n n diagonal matrix of ci coefficients. Our latter formulation is however\nill-defined, since any j is determined up to the addition of any element of ker ~\n K. We\n\nthus restrict our parameters to lie in E def\n = ker ~\n K\n Rn to consider functions of positive\nsquared norm, having now:\n\n ~\n K ~\n K\n c \n j = argmax = var(hv ())\n j\n E:k 0, to\npropose a regularization of S as:\n r\n\n S ( + v .\n ,v = i + )viv\n i iv\n i\n i=1 i>r\n\n\nThe entropy of a covariance operator St,e not being defined, we bypass this issue by consid-\nering the entropy of its marginal distribution on its first d eigenfunctions, namely introduc-\ning the quantity |S d\n t,e|d = d ln(2e) + 1 ln t\n 2 2 i=1 i. Let us sum up ideas now and consider\nthree normalized measures , and = + , which yield three different orthonormal\n 2\nbases vi, v =\n i and v\n i of and three different families of weights = (ir +, , ...), \n( = (\n ir + , , ...) and \n ir + , , ...). Though working on different bases, those\nrespective d first directions allow us to express an approached form of kernel (2) limited to\ndifferent subspaces of of arbitrary size d r max(r, r):\n ,v\n k |d\n d, (, ) = exp -2 |S,v\n |d - |S,v|d + |S\n 2\n\n 2 (3)\n r 1 + r\n i 1 + i\n i=1 i=1 \n = \n ,\n r 1 + i\n i=1 \nThe latter expression is independent of d, while letting d go to infinity lets every base on\nwhich are computed our entropies span the entire space . Though the latter hint does not\nestablish a valid theoretical proof of the positive definiteness of this kernel, we use this final\nformula for the following classification experiments.\n\n\n5 Experiments\n\nFollowing the previous work of [4], we have conducted experiments on an extraction of\n500 images (28 28 pixels) taken in the MNIST database of handwritten digits, with 50\nimages for each digit. To each image z we randomly associate a set (z) of 25 to 30\npixels among black points (intensity superior to 191 on a 0 to 255 scale ) in the image,\nwhere X is {1, .., 28} {1, .., 28} in this case. In all our experiments we set to be 12\nwhich always yielded positive definite Gram matrices in practice. To define our RKHS\n we used both the linear kernel, a((x1, y1), (x2, y2)) = (x1x2 + y1y2)/272 and the\nGaussian kernel of width , namely b((x1, y1), (x2, y2)) = e- (x1-x2)2+(y1-y2)2 . The\n 272 22\nlinear case boils down to the simple application presented in the end of section 3 where we\nfit Gaussian bivariate-laws on our three measures and define similarity through variance\nanalysis. The resulting diagonal variances (1,1, 2,2),(1,1, 2,2) and (1,1, 2,2) mea-\nsure the dispersion of our data for each of the three measures, yielding a kernel value of\n1,12,2 \n 1,1 2,2 equal to 0.382 in the case shown in figure 1. The linear kernel man-\n \n 1,1 2,2\nages a good discrimination between clearly defined digits such as 1 and 0 but fails at\ndoing so when considering numbers whose pixels' distribution cannot be properly char-\nacterized by ellipsoid-like shapes. Using instead the Gaussian kernel brings forward a\nnon-linear perspective to the previous approach since it maps now all pixels into Gaus-\nsian bells, providing thus a much richer function class for . In this case two parameters\n\n\f\n (a)\n 1,1 = 0.0552 1,1 = 0.0441 1,1 = 0.0497\n 2,2 = 0.0013 2,2 = 0.0237 2,2 = 0.0139\n\n\n\n\n\n (b)\n 1 = 0.276 1 = 0.168 1 = 0.184\n\nFigure 1: First Eigenfunction of three empirical measures 1, 0 and 1+0 using the linear\n 2\n(a) and the Gaussian (b, with = 0.01, = 0.1) kernel. Below each image are the cor-\nresponding eigenvalues which correspond to the variance captured by each eigenfunction,\nthe second eigenvalue being also displayed in the linear case (a).\n\n\n\nrequire explicit tuning: (the width of ) controls the range of the typical eigenvalues\nfound in the spectrum of our regularized operators whereas acts as a scaling parame-\nter for the latter values as can be seen in equation (3). An efficient choice can thus only\nbe defined on pairs of parameter, which made us use two ranges of values for and \nbased on preliminary attempts: 10-2 {0.1, 0.3, 0.5, 0.8, 1, 1.5, 2, 3, 5, 8, 10, 20} and\n 10-1{0.5,1,1.2,1.5,1.8,2,2.5,3}. For each kernel computed on the base of a (,)\ncouple, we used a balanced training fold of our dataset to train 10 binary SVM classifiers,\nnamely one for each digit versus all other 9 digits. The class of the remaining images of the\ntest fold was then predicted to be the one with highest SVM score among the the 10 pre-\nviously trained binary SVMs. Splitting our data into test and training sets was led through\na 3-fold cross validation (roughly 332 training images and 168 for testing), averaging the\ntest error on 5 random fold splits of the original data. Those results were obtained using\nthe spider toolbox1 and graphically displayed in figure (2). Note that the best testing errors\nwere reached using a value of 0.12 with an parameter within 0.008 and 0.02, this error\nbeing roughly 19.5% with a standard deviation inferior to 1% in all the region correspond-\ning to an error lower than 22%. To illustrate the sensibility of our method to the number of\nsampled points in we show in the same figure the decrease of this error when the number\nof sampled points ranges from 10 to 30 with independently chosen random points for each\ncomputation. As in [4], we also compared our results to the standard RBF kernel on images\nseen as vectors of {0, 1}27 27, using a fixed number of 30 sampled points and the formula\nk(z, z) = e- ||z-z||\n 30 22 . We obtained similar results with an optimal error rate of roughly\n44.5% for {0.12,0.15,0.18}. Our results didn't improve by choosing different soft\nmargin C parameters, which we hence just set to be C = as is chosen by default by the\nspider toolbox.\n\n 1see http://www.kyb.tuebingen.mpg.de/bs/people/spider/\n\n\f\n 50%\n 102 \n\n 0.1 0.3 0.5 0.8 1 1.5 2 3 5 8 10 20\n\n 45%\n 0.05\n\n\n\n 0.1 e < 22 % 40% \n\n\n 0.12 e < 19.5 % 35%\n\n\n 0.15\n 30% \n e < 22 % \n 0.18 Averaged error rate\n 25%\n\n 0.2\n\n 20%\n\n 0.25\n\n\n 15%\n 0.3 10 15 20 25 30\n # points\n\n (a) (b)\n\nFigure 2: (a) Average test error (displayed as a grey level) of different SVM handwritten\ncharacter recognition experiments using 500 images from the MNIST database (each seen\nas a set of 25 to 30 randomly selected black pixels), carried out with 3-fold (2 for training, 1\nfor test) cross validations with 5 repeats, where parameters (regularization) and (width\nof the Gaussian kernel) have been tuned to different values. (b) Curve of the same error\n(with = 0.01, = 0.12 fixed) depending now on the size of the sets of randomly selected\nblack pixels for each image, this size varying between 10 and 30.\n\nAcknowledgments\n\nThe authors would like to thank Francis Bach, Kenji Fukumizu and Jeremie Jakubowicz\nfor fruitful discussions and Xavier Dupre for his help on the MNIST database.\n\nReferences\n\n [1] B. Scholkopf and A.J. Smola. Learning with Kernels: Support Vector Machines, Regularization,\n Optimization, and Beyond. MIT Press, Cambridge, MA, 2002.\n\n [2] J. Lafferty and G. Lebanon. Information diffusion kernels. In Advances in Neural Information\n Processing Systems 14, Cambridge, MA, 2002. MIT Press.\n\n [3] M. Seeger. Covariance kernels from bayesian generative models. In Advances in Neural Infor-\n mation Processing Systems 14, pages 905912, Cambridge, MA, 2002. MIT Press.\n\n [4] R. Kondor and T. Jebara. A kernel between sets of vectors. In Machine Learning, Proceedings\n of the Twentieth International Conference (ICML 2003), pages 361368. AAAI Press, 2003.\n\n [5] L. Wolf and A. Shashua. Learning over sets using kernel principal angles. Journal of Machine\n Learning Research, 4:913931, 2003.\n\n [6] F. Bach and M. Jordan. Kernel independent component analysis. Journal of Machine Learning\n Research, 3:148, 2002.\n\n [7] M. Cuturi and J.-P. Vert. A mutual information kernel for sequences. In IEEE International\n Joint Conference on Neural Networks, 2004.\n\n [8] C. Berg, J.P.R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer, 1984.\n\n [9] S. Amari and H. Nagaoka. Methods of information geometry. AMS vol. 191, 2001.\n\n[10] F. M. J. Willems, Y. M. Shtarkov, and Tj. J. Tjalkens. The context-tree weighting method: basic\n properties. IEEE Transancations on Information Theory, pages 653664, 1995.\n\n[11] J.-P. Vert, H. Saigo, and T. Akutsu. Local alignment kernels for protein sequences. In\n B. Schoelkopf, K. Tsuda, and J.-P. Vert, editors, Kernel Methods in Computational Biology.\n MIT Press, 2004.\n\n[12] T. Cover and J. Thomas. Elements of Information Theory. Wiley & Sons, New-York, 1991.\n\n[13] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical\n Society, 68:337 404, 1950.\n\n\f\n", "award": [], "sourceid": 2709, "authors": [{"given_name": "Marco", "family_name": "Cuturi", "institution": null}, {"given_name": "Jean-philippe", "family_name": "Vert", "institution": null}]}