{"title": "The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space", "book": "Advances in Neural Information Processing Systems", "page_first": 625, "page_last": 632, "abstract": null, "full_text": " The Laplacian PDF Distance: A Cost\n Function for Clustering in a Kernel\n Feature Space\n\n\n\n Robert Jenssen1, Deniz Erdogmus2, Jose Principe2, Torbjrn Eltoft1\n\n 1Department of Physics, University of Troms, Norway\n 2Computational NeuroEngineering Laboratory, University of Florida, USA\n\n\n\n Abstract\n\n A new distance measure between probability density functions\n (pdfs) is introduced, which we refer to as the Laplacian pdf dis-\n tance. The Laplacian pdf distance exhibits a remarkable connec-\n tion to Mercer kernel based learning theory via the Parzen window\n technique for density estimation. In a kernel feature space defined\n by the eigenspectrum of the Laplacian data matrix, this pdf dis-\n tance is shown to measure the cosine of the angle between cluster\n mean vectors. The Laplacian data matrix, and hence its eigenspec-\n trum, can be obtained automatically based on the data at hand,\n by optimal Parzen window selection. We show that the Laplacian\n pdf distance has an interesting interpretation as a risk function\n connected to the probability of error.\n\n\n1 Introduction\n\nIn recent years, spectral clustering methods, i.e. data partitioning based on the\neigenspectrum of kernel matrices, have received a lot of attention [1, 2]. Some\nunresolved questions associated with these methods are for example that it is not\nalways clear which cost function that is being optimized and that is not clear how\nto construct a proper kernel matrix.\n\nIn this paper, we introduce a well-defined cost function for spectral clustering. This\ncost function is derived from a new information theoretic distance measure between\ncluster pdfs, named the Laplacian pdf distance. The information theoretic/spectral\nduality is established via the Parzen window methodology for density estimation.\nThe resulting spectral clustering cost function measures the cosine of the angle\nbetween cluster mean vectors in a Mercer kernel feature space, where the feature\nspace is determined by the eigenspectrum of the Laplacian matrix. A principled\napproach to spectral clustering would be to optimize this cost function in the feature\nspace by assigning cluster memberships. Because of space limitations, we leave it\nto a future paper to present an actual clustering algorithm optimizing this cost\nfunction, and focus in this paper on the theoretical properties of the new measure.\n\n Corresponding author. Phone: (+47) 776 46493. Email: robertj@phys.uit.no\n\n\f\nAn important by-product of the theory presented is that a method for learning the\nMercer kernel matrix via optimal Parzen windowing is provided. This means that\nthe Laplacian matrix, its eigenspectrum and hence the feature space mapping can\nbe determined automatically. We illustrate this property by an example.\n\nWe also show that the Laplacian pdf distance has an interesting relationship to the\nprobability of error.\n\nIn section 2, we briefly review kernel feature space theory. In section 3, we utilize\nthe Parzen window technique for function approximation, in order to introduce the\nnew Laplacian pdf distance and discuss some properties in sections 4 and 5. Section\n6 concludes the paper.\n\n\n2 Kernel Feature Spaces\n\nMercer kernel-based learning algorithms [3] make use of the following idea: via a\nnonlinear mapping\n : Rd F, x (x) (1)\nthe data x1, . . . , xN Rd is mapped into a potentially much higher dimensional\nfeature space F. For a given learning problem one now considers the same algorithm\nin F instead of in Rd, that is, one works with (x1),...,(xN) F.\nConsider a symmetric kernel function k(x, y). If k : C C R is a continuous\nkernel of a positive integral operator in a Hilbert space L2(C) on a compact set\nC Rd, i.e.\n L2(C) : k(x,y)(x)(y)dxdy 0, (2)\n C\nthen there exists a space F and a mapping : Rd F, such that by Mercer's\ntheorem [4]\n NF\n\n k(x, y) = (x), (y) = ii(x)i(y), (3)\n i=1\n\nwhere , denotes an inner product, the i's are the orthonormal eigenfunctions\nof the kernel and NF [3]. In this case\n (x) = [ 11(x), 22(x), . . . ]T , (4)\n\ncan potentially be realized.\n\nIn some cases, it may be desirable to realize this mapping. This issue has been\naddressed in [5]. Define the (N N) Gram matrix, K, also called the affinity, or\nkernel matrix, with elements Kij = k(xi, xj), i, j = 1, . . . , N . This matrix can be\ndiagonalized as ET KE = , where the columns of E contains the eigenvectors of K\nand is a diagonal matrix containing the non-negative eigenvalues ~\n 1, . . . , ~\n N , ~\n 1 \n ~N. In [5], it was shown that the eigenfunctions and eigenvalues of (4) can\n ~\nbe approximated as j\n j (xi) Neji, j , where e\n N ji denotes the ith element\nof the jth eigenvector. Hence, the mapping (4), can be approximated as\n\n\n (xi) [ ~1e1i,..., ~NeNi]T. (5)\nThus, the mapping is based on the eigenspectrum of K. The feature space data set\nmay be represented in matrix form as NN = [(x1), . . . , (xN )]. Hence, =\n 1\n 2 ET . It may be desirable to truncate the mapping (5) to C-dimensions. Thus,\n\n\f\n T\nonly the C first rows of are kept, yielding ^\n . It is well-known that ^\n K = ^\n ^\n is\nthe best rank-C approximation to K wrt. the Frobenius norm [6].\n\nThe most widely used Mercer kernel is the radial-basis-function (RBF)\n\n k(x, y) = exp -||x - y||2 . (6)\n 22\n\n3 Function Approximation using Parzen Windowing\n\nParzen windowing is a kernel-based density estimation method, where the resulting\ndensity estimate is continuous and differentiable provided that the selected kernel\nis continuous and differentiable [7]. Given a set of iid samples {x1,...,xN} drawn\nfrom the true density f (x), the Parzen window estimate for this distribution is [7]\n N\n ^ 1\n f (x) = W\n N 2 (x, xi), (7)\n i=1\n\nwhere W2 is the Parzen window, or kernel, and 2 controls the width of the kernel.\nThe Parzen window must integrate to one, and is typically chosen to be a pdf itself\nwith mean xi, such as the Gaussian kernel\n 1\n W2 (x, xi) = exp , (8)\n d -||x - xi||2\n (22) 2 22\nwhich we will assume in the rest of this paper. In the conclusion, we briefly discuss\nthe use of other kernels.\n\nConsider a function h(x) = v(x)f (x), for some function v(x). We propose to\nestimate h(x) by the following generalized Parzen estimator\n N\n ^ 1\n h(x) = v(xi)W\n N 2 (x, xi). (9)\n i=1\nThis estimator is asymptotically unbiased, which can be shown as follows\n\n 1 N\n Ef v(xi)W\n N 2 (x, xi) = v(z)f (z)W2 (x, z)dz = [v(x)f (x)] W2(x),\n i=1\n (10)\nwhere Ef () denotes expectation with respect to the density f(x). In the limit as\nN and (N) 0, we have\n lim [v(x)f (x)] W2(x) = v(x)f(x). (11)\n N \n (N )0\nOf course, if v(x) = 1 x, then (9) is nothing but the traditional Parzen estimator\nof h(x) = f (x). The estimator (9) is also asymptotically consistent provided that\nthe kernel width (N ) is annealed at a sufficiently slow rate. The proof will be\npresented in another paper.\n\nMany approaches have been proposed in order to optimally determine the size of\nthe Parzen window, given a finite sample data set. A simple selection rule was\nproposed by Silverman [8], using the mean integrated square error (MISE) between\nthe estimated and the actual pdf as the optimality metric:\n 1\n d+4\n opt = X 4N -1(2d + 1)-1 , (12)\nwhere d is the dimensionality of the data and 2 = d-1 , where are the\n X i Xii Xii\ndiagonal elements of the sample covariance matrix. More advanced approximations\nto the MISE solution also exist.\n\n\f\n4 The Laplacian PDF Distance\n\nCost functions for clustering are often based on distance measures between pdfs.\nThe goal is to assign memberships to the data patterns with respect to a set of\nclusters, such that the cost function is optimized.\n\nAssume that a data set consists of two clusters. Associate the probability density\nfunction p(x) with one of the clusters, and the density q(x) with the other cluster.\nLet f (x) be the overall probability density function of the data set. Now define the\nf -1 weighted inner product between p(x) and q(x) as p, q f p(x)q(x)f-1(x)dx.\nIn such an inner product space, the Cauchy-Schwarz inequality holds, that is,\np, q 2 q, q . Based on this discussion, an information theoretic distance\n f p, p f f\nmeasure between the two pdfs can be expressed as\n\n p, q\n D f\n L = - log 0. (13)\n p, p q, q\n f f\n\n\nWe refer to this measure as the Laplacian pdf distance, for reasons that we discuss\nnext. It can be seen that the distance DL is zero if and only if the two densities\nare equal. It is non-negative, and increases as the overlap between the two pdfs\ndecreases. However, it does not obey the triangle inequality, and is thus not a\ndistance measure in the strict mathematical sense.\n\nWe will now show that the Laplacian pdf distance is also a cost function for clus-\ntering in a kernel feature space, using the generalized Parzen estimators discussed\nin the previous section. Since the logarithm is a monotonic function, we will derive\nthe expression for the argument of the log in (13). This quantity will for simplicity\nbe denoted by the letter \"L\" in equations.\n\nAssume that we have available the iid data points {xi}, i = 1,...,N1, drawn from\np(x), which is the density of cluster C1, and the iid {xj}, j = 1, . . ., N2, drawn from\nq(x), the density of C2. Let h(x) = f - 12 (x)p(x) and g(x) = f - 12 (x)q(x). Hence, we\nmay write\n h(x)g(x)dx\n L = . (14)\n h2(x)dx g2(x)dx\n\nWe estimate h(x) and g(x) by the generalized Parzen kernel estimators, as follows\n\n N1 N2\n ^ 1 1\n h(x) = f - 12 (xi)W f - 12 (xj )W\n N 2 (x, xi ), ^\n g(x) = 2 (x, xj ). (15)\n 1 N2\n i=1 j=1\n\n\nThe approach taken, is to substitute these estimators into (14), to obtain\n\n N N\n 1 1 1 2\n h(x)g(x)dx f - 12 (xi)W f - 12 (xj )W\n N 2 (x, xi ) 2 (x, xj )\n 1 N2\n i=1 j=1\n\n N\n 1 1 ,N2\n = f - 12 (xi)f - 12 (xj ) W\n N 2 (x, xi )W2 (x, xj )dx\n 1N2 i,j=1\n\n N\n 1 1 ,N2\n = f - 12 (xi)f - 12 (xj )W\n N 22 (xi, xj ), (16)\n 1N2 i,j=1\n\n\f\nwhere in the last step, the convolution theorem for Gaussians has been employed.\nSimilarly, we have\n\n N\n 1 1 ,N1\n h2(x)dx f - 12 (xi)f - 12 (xi )W\n N 2 22 (xi, xi ), (17)\n 1 i,i =1\n\n N\n 1 2 ,N2\n g2(x)dx f - 12 (xj)f - 12 (xj )W\n N 2 22 (xj , xj ). (18)\n 2 j,j =1\n\nNow we define the matrix Kf , such that\n\n Kf = K\n ij f (xi, xj ) = f - 1\n 2 (xi)f - 12 (xj )K(xi, xj ), (19)\n\nwhere K(xi, xj ) = W22 (xi, xj) for i, j = 1, . . . , N and N = N1 + N2. As a\nconsequence, (14) can be re-written as follows\n\n N1,N2 Kf (xi, xj)\n L = i,j=1 (20)\n N1,N1 K K\n i,i =1 f (xi, xi ) N2,N2\n j,j =1 f (xj , xj )\n\n\nThe key point of this paper, is to note that the matrix K = Kij = K(xi, xj), i, j =\n1, . . . , N , is the data affinity matrix, and that K(xi, xj) is a Gaussian RBF kernel\nfunction. Hence, it is also a kernel function that satisfies Mercer's theorem. Since\nK(xi, xj) satisfies Mercer's theorem, the following by definition holds [4]. For any\nset of examples {x1,...,xN} and any set of real numbers 1,...,N\n N N\n\n ijK(xi, xj) 0, (21)\n i=1 j=1\n\nin analogy to (3). Moreover, this means that\n\n N N N N\n\n ijf - 12 (xi)f - 12 (xj )K(xi, xj) = ijKf (xi, xj) 0, (22)\n i=1 j=1 i=1 j=1\n\nhence Kf (xi, xj ) is also a Mercer kernel.\n\nNow, it is readily observed that the Laplacian pdf distance can be analyzed in terms\nof inner products in a Mercer kernel-based Hilbert feature space, since Kf (xi, xj) =\nf (xi), f (xj) . Consequently, (20) can be written as follows\n\n N1,N2 f (xi), f (xj)\n L = i,j=1\n N1,N1 \n i,i =1 f (xi), f (xi ) N2,N2\n j,j =1 f (xj ), f (xj )\n\n 1 N1 N2 \n N i=1 f (xi ), 1\n N j=1 f (xj )\n= 1 2\n 1 N1 N1 N2 N2 \n N f (xi), 1 f (xi ) 1\n f (xj ), 1 f (xj )\n 1 i=1 N1 i =1 N2 j=1 N2 j =1\n\n m1 , m2\n = f f = cos (m , m ), (23)\n ||m 1f 2f\n 1f ||||m2f ||\nwhere m Ni\n i = 1 \n f N f (xl), i = 1, 2, that is, the sample mean of the ith cluster\n i l=1\nin feature space.\n\n\f\nThis is a very interesting result. We started out with a distance measure between\ndensities in the input space. By utilizing the Parzen window method, this distance\nmeasure turned out to have an equivalent expression as a measure of the distance\nbetween two clusters of data points in a Mercer kernel feature space. In the feature\nspace, the distance that is measured is the cosine of the angle between the cluster\nmean vectors.\n\nThe actual mapping of a data point to the kernel feature space is given by the\neigendecomposition of Kf , via (5). Let us examine this mapping in more detail.\n 1\nNote that f 2 (xi) can be estimated from the data by the traditional Parzen pdf\nestimator as follows\n\n N\n 1 1\n f 2 (xi) = W (xi, xl) = di. (24)\n N 2\n f\n l=1\n\nDefine the matrix D = diag(d1, . . . , dN ). Then Kf can be expressed as\n\n Kf = D- 12 KD- 12 . (25)\n\nQuite interestingly, for 2 = 22, this is in fact the Laplacian data matrix. 1\n f\n\nThe above discussion explicitly connects the Parzen kernel and the Mercer kernel.\nMoreover, automatic procedures exist in the density estimation literature to opti-\nmally determine the Parzen kernel given a data set. Thus, the Mercer kernel is\nalso determined by the same procedure. Therefore, the mapping by the Laplacian\nmatrix to the kernel feature space can also be determined automatically. We regard\nthis as a significant result in the kernel based learning theory.\n\nAs an example, consider Fig. 1 (a) which shows a data set consisting of a ring\nwith a dense cluster in the middle. The MISE kernel size is opt = 0.16, and\nthe Parzen pdf estimate is shown in Fig. 1 (b). The data mapping given by the\ncorresponding Laplacian matrix is shown in Fig. 1 (c) (truncated to two dimensions\nfor visualization purposes). It can be seen that the data is distributed along two lines\nradially from the origin, indicating that clustering based on the angular measure\nwe have derived makes sense.\n\nThe above analysis can easily be extended to any number of pdfs/clusters. In the\nC-cluster case, we define the Laplacian pdf distance as\n\n C-1 pi, pj\n L = f . (26)\n i=1 j=i C pi, pi p\n f j , pj f\n\nIn the kernel feature space, (26), corresponds to all cluster mean vectors being\npairwise as orthogonal to each other as possible, for all possible unique pairs.\n\n4.1 Connection to the Ng et al. [2] algorithm\n\nRecently, Ng et al. [2] proposed to map the input data to a feature space determined\nby the eigenvectors corresponding to the C largest eigenvalues of the Laplacian ma-\ntrix. In that space, the data was normalized to unit norm and clustered by the\nC-means algorithm. We have shown that the Laplacian pdf distance provides a\n\n 1It is a bit imprecise to refer to Kf as the Laplacian matrix, as readers familiar with\nspectral graph theory may recognize, since the definition of the Laplacian matrix is L =\nI - Kf . However, replacing Kf by L does not change the eigenvectors, it only changes the\neigenvalues from i to 1 \n - i.\n\n\f\n 0\n\n\n\n\n\n 0\n\n\n\n\n (a) Data set (b) Parzen pdf estimate (c) Feature space data\n\n\nFigure 1: The kernel size is automatically determined (MISE), yielding the Parzen\nestimate (b) with the corresponding feature space mapping (c).\n\n\nclustering cost function, measuring the cosine of the angle between cluster means,\nin a related kernel feature space, which in our case can be determined automati-\ncally. A more principled approach to clustering than that taken by Ng et al. is to\noptimize (23) in the feature space, instead of using C-means. However, because of\nthe normalization of the data in the feature space, C-means can be interpreted as\nclustering the data based on an angular measure. This may explain some of the\nsuccess of the Ng et al. algorithm; it achieves more or less the same goal as cluster-\ning based on the Laplacian distance would be expected to do. We will investigate\nthis claim in our future work. Note that we in our framework may choose to use\nonly the C largest eigenvalues/eigenvectors in the mapping, as discussed in section\n2. Since we incorporate the eigenvalues in the mapping, in contrast to Ng et al.,\nthe actual mapping will in general be different in the two cases.\n\n\n5 The Laplacian PDF distance as a risk function\n\nWe now give an analysis of the Laplacian pdf distance that may further motivate its\nuse as a clustering cost function. Consider again the two cluster case. The overall\ndata distribution can be expressed as f (x) = P1p(x) + P2q(x), were Pi, i = 1, 2, are\nthe priors. Assume that the two clusters are well separated, such that for xi C1,\nf (xi) P1p(xi), while for xi C2, f(xi) P2q(xi). Let us examine the numerator\nof (14) in this case. It can be approximated as p(x)q(x) dx\n f (x)\n\n p(x)q(x) p(x)q(x) 1 1\n dx + dx q(x)dx + p(x)dx. (27)\n C f (x) f (x) P1 P2\n 1 C2 C1 C2\n\nBy performing a similar calculation for the denominator of (14), it can be shown to\nbe approximately equal to 1\n . Hence, the Laplacian pdf distance can be written\n P1P1\nas a risk function, given by\n\n 1 1\n L P1P2 q(x)dx + p(x)dx . (28)\n P1 C P2\n 1 C2\n\nNote that if P1 = P2 = 1 , then L = 2P\n 2 e, where Pe is the probability of error when\nassigning data points to the two clusters, that is\n\n Pe = P1 q(x)dx + P2 p(x)dx. (29)\n C1 C2\n\n\f\nThus, in this case, minimizing L is equivalent to minimizing Pe. However, in the case\nthat P1 = P2, (28) has an even more interesting interpretation. In that situation,\nit can be seen that the two integrals in the expressions (28) and (29) are weighted\nexactly oppositely. For example, if P1 is close to one, L p(x)dx, while P\n C e \n 2\n q(x)dx. Thus, the Laplacian pdf distance emphasizes to cluster the most un-\n C1\nlikely data points correctly. In many real world applications, this property may be\ncrucial. For example, in medical applications, the most important points to classify\ncorrectly are often the least probable, such as detecting some rare disease in a group\nof patients.\n\n\n6 Conclusions\n\nWe have introduced a new pdf distance measure that we refer to as the Laplacian\npdf distance, and we have shown that it is in fact a clustering cost function in a\nkernel feature space determined by the eigenspectrum of the Laplacian data matrix.\nIn our exposition, the Mercer kernel and the Parzen kernel is equivalent, making\nit possible to determine the Mercer kernel based on automatic selection procedures\nfor the Parzen kernel. Hence, the Laplacian data matrix and its eigenspectrum can\nbe determined automatically too. We have shown that the new pdf distance has an\ninteresting property as a risk function.\n\nThe results we have derived can only be obtained analytically using Gaussian ker-\nnels. The same results may be obtained using other Mercer kernels, but it requires\nan additional approximation wrt. the expectation operator. This discussion is left\nfor future work.\n\nAcknowledgments. This work was partially supported by NSF grant ECS-\n0300340.\n\n\nReferences\n\n[1] Y. Weiss, \"Segmentation Using Eigenvectors: A Unifying View,\" in Interna-\n tional Conference on Computer Vision, 1999, pp. 975982.\n[2] A. Y. Ng, M. Jordan, and Y. Weiss, \"On Spectral Clustering: Analysis and an\n Algorithm,\" in Advances in Neural Information Processing Systems, 14, 2001,\n vol. 2, pp. 849856.\n[3] K. R. M\n uller, S. Mika, G. R\n atsch, K. Tsuda, and B. Sch\n olkopf, \"An Introduction\n to Kernel-Based Learning Algorithms,\" IEEE Transactions on Neural Networks,\n vol. 12, no. 2, pp. 181201, 2001.\n[4] J. Mercer, \"Functions of Positive and Negative Type and their Connection with\n the Theory of Integral Equations,\" Philos. Trans. Roy. Soc. London, vol. A, pp.\n 415446, 1909.\n[5] C. Williams and M. Seeger, \"Using the Nystr\n om Method to Speed Up Kernel\n Machines,\" in Advances in Neural Information Processing Systems 13, Vancou-\n ver, Canada, USA, 2001, pp. 682688.\n[6] M. Brand and K. Huang, \"A Unifying Theorem for Spectral Embedding and\n Clustering,\" in Ninth Int'l Workshop on Artificial Intelligence and Statistics,\n Key West, Florida, USA, 2003.\n[7] E. Parzen, \"On the Estimation of a Probability Density Function and the\n Mode,\" Ann. Math. Stat., vol. 32, pp. 10651076, 1962.\n[8] B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chap-\n man and Hall, London, 1986.\n\n\f\n", "award": [], "sourceid": 2685, "authors": [{"given_name": "Robert", "family_name": "Jenssen", "institution": null}, {"given_name": "Deniz", "family_name": "Erdogmus", "institution": null}, {"given_name": "Jose", "family_name": "Principe", "institution": null}, {"given_name": "Torbj\u00f8rn", "family_name": "Eltoft", "institution": null}]}