{"title": "Spectral Kernel Methods for Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 649, "page_last": 655, "abstract": null, "full_text": "Spectral Kernel Methods for Clustering \n\nN ello Cristianini \nBIOwulf Technologies \nnello@support-vector.net \n\nJohn Shawe-Taylor \nJaz Kandola \nRoyal Holloway, University of London \n\n{john, jaz} @cs.rhul.ac.uk \n\nAbstract \n\nIn this paper we introduce new algorithms for unsupervised learn(cid:173)\ning based on the use of a kernel matrix. All the information re(cid:173)\nquired by such algorithms is contained in the eigenvectors of the \nmatrix or of closely related matrices. We use two different but re(cid:173)\nlated cost functions, the Alignment and the 'cut cost'. The first \none is discussed in a companion paper [3], the second one is based \non graph theoretic concepts. Both functions measure the level of \nclustering of a labeled dataset, or the correlation between data clus(cid:173)\nters and labels. We state the problem of unsupervised learning as \nassigning labels so as to optimize these cost functions. We show \nhow the optimal solution can be approximated by slightly relaxing \nthe corresponding optimization problem, and how this corresponds \nto using eigenvector information. The resulting simple algorithms \nare tested on real world data with positive results. \n\n1 \n\nIntroduction \n\nKernel based learning provides a modular approach to learning system design [2]. A \ngeneral algorithm can be selected for the appropriate task before being mapped onto \na particular application through the choice of a problem specific kernel function. \nThe kernel based method works by mapping data to a high dimensional feature \nspace implicitly defined by the choice of the kernel function. The kernel function \ncomputes the inner product of the images of two inputs in the feature space. From \na practitioners viewpoint this function can also be regarded as a similarity measure \nand hence provides a natural way of incorporating domain knowledge about the \nproblem into the bias of the system. \nOne important learning problem is that of dividing the data into classes according \nto a cost function together with their relative positions in the feature space. We \ncan think of this as clustering in the kernel defined feature space, or non-linear \nclustering in the input space. \nIn this paper we introduce two novel kernel-based methods for clustering. They both \nassume that a kernel has been chosen and the kernel matrix constructed. The meth(cid:173)\nods then make use of the matrix's eigenvectors, or of the eigenvectors of the closely \nrelated Laplacian matrix, in order to infer a label assignment that approximately \noptimizes one of two cost functions. See also [4] for use of spectral decompositions \nof the kernel matrix. The paper includes some analysis of the algorithms together \nwith tests of the methods on real world data with encouraging results. \n\n\f2 Two partition cost measures \n\nAll the information needed to specify a clustering of a set of data is contained in \nthe matrix Mij = (cluster(xi) == cluster(xj)), where (A == B) E {-I, +1}. After \na clustering is specified, one can measure its cost in many ways. We propose here \ntwo cost functions that are easy to compute and lead to efficient algorithms. \nLearning is possible when some collusion between input distribution and target \nexists, so that we can predict the target based on the input. Typically one would \nexpect points with similar labels to be clustered and the clusters to be separated. \nThis can be detected in two ways: either by measuring the amount of label-clustering \nor by measuring the correlation between such variables. In the first case, we need \nto measure how points of the same class are close to each other and distant from \npoints of different classes. In the second case, kernels can be regarded as oracles \npredicting whether two points are in the same class. The 'true' oracle is the one \nthat knows the true matrix M. A measure of quality can be obtained by measuring \nthe Pearson correlation coefficient between the kernel matrix K and the true M . \nBoth approaches lead to the same quantity, known as the alignment [3]. \nWe will use the following definition of the inner product between matrices \n(K1 ,K2)F = 2:2j=1 K 1 (Xi,Xj)K2(Xi,Xj). The index F refers to the Frobenius \nnorm that corresponds to this inner product. \n\nDefinition 1 Alignment The (empirical) alignment of a kernel kl with a kernel \nk2 with respect to the sample S is the quantity \n\n...1(S,k1 ,k2) = \n\n(K1 ,K2)F \n\nyi(K1 ,K1 )F(K2,K2)F \n\n, \n\nwhere Ki is the kernel matrix for the sample S using kernel ki . \n\nThis can also be viewed as the cosine of the angle between to bi-dimensional vectors \nKl and K 2, representing the Gram matrices. If we consider k2 = yy', where y is \nthe vector of { -1, + I} labels for the sample, then with a slight abuse of notation \n\nAA(Sk)= \n\n, ,y \n\n/ \n\n(K,yy')F \n\nV (K, K) F (YY' , yy') F \n\n(K,yY')F. ( \" ) \n\n= mllKllF ' smce yy,yy F = m \n\n2 \n\nAnother measure of separation between classes is the average separation between \ntwo points in different classes, again normalised by the matrix norm. \n\nDefinition 2 Cut Cost. The cut cost of a clustering is defined as \n\nC(S, k, y) = L..'J:Y;li~IIF' \n\n\" ' .. \n\n-t-\n\nk(Xi XJ) \n\n. \n\nThis quantity is motivated by a graph theoretic concept. If we consider the Kernel \nmatrix as the adjacency matrix of a fully connected weighted graph whose nodes \nare the data points, the cost of partitioning a graph is given by the total weight of \nthe edges that one needs to cut or remove, and is exactly the numerator of the 'cut \ncost'. Notice also the relation between alignment and cutcost: \n\n...1(S, k, y) \n\n'\" k(x\u00b7 x\u00b7) - 2C(S k) \nL..ij \n\n\" \n\nJ \n\n, \n\nmyi(K, K)F \n\n= T(S k) - 2C(S k \n\n) \n, ,y, \n\n' \n\nwhere T(S,k) = ...1(S, k,j), for j the all ones vector. \nAmong other appealing \nproperties of the alignment, is that this quantity is sharply concentrated around \n\n\fits mean, as proven in the companion paper [3]. This shows that the expected \nalignment can be reliably estimated from its empirical estimate A.(S). As the cut \ncost can be expressed as the difference of two alignments \n\nC(S,k,y) = O.5(T(S,k) - A.(S, k,y)), \nit will be similarly concentrated around its expected value. \n\n(1) \n\n3 Optimising the cost with spectral techniques \n\nIn this section we will introduce and test two related methods for clustering, as \nwell as their extensions to transduction. The general problem we want to solve is \nto assign class-labels to datapoints so as to maximize one of the two cost functions \ngiven above. By equation (1) the optimal solution to both problems is identical for \na fixed data set and kernel. The difference between the approaches is in the two \napproximation algorithms developed for the different cost functions. The approxi(cid:173)\nmation algorithms are obtained by relaxing the discrete problems of optimising over \nall possible labellings of a dataset to closely related continuous problems solved by \neigenvalue decompositions. See [5] for use of eigenvectors in partitioning sparse \nmatrices. \n\n3.1 Optimising the alignment \n\nTo optimise the alignment, the problem is to find the maximally aligned set of labels \n\nA \n\nA*(S,k)= max A(S,k,y)= max \n\nA \n\n(K,yy')F \n\nyE{ -1 ,1}= \n\nyE{ -l ,l}= mJ(K, K)F \n\nSince in this setting the kernel is fixed maximising the alignment reduces to choos(cid:173)\ning y E {-I, l}m to maximise (K,yy') = y'Ky. If we allow y to be chosen from \nthe larger set IRm subject to the constraint IIyl12 = m, we obtain an approximate \nmaximum-alignment problem that can be solved efficiently. After solving the re(cid:173)\nlaxed problem, we can obtain an approximate discrete solution by choosing a suit(cid:173)\nable threshold to the entries in the vector y and applying the sign function. Bounds \nwill be given on the quality of the approximations. \nThe solution of the approximate problem follows from the following theorem that \nprovides a variational characterization of the spectrum of symmetric matrices. \n\nTheorem 3 (Courant-Fischer Minimax Theorem) If ME IRmxm is symmet(cid:173)\nric, then for k = 1, ... , m, \n\nAk(M) = max min - - = \n\ndirn(T)=k OopvET v'v \n\nv'Mv \n\nmin \n\ndirn(T)=m - k+lO opvET v'v \n\nv'Mv \n\nmax - - , \n\nIf we consider the first eigenvector, the first min does not apply and we obtain that \nthe approximate alignment problem is solved by the first eigenvector, so that the \nmaximal alignment is upper bounded by a multiple of the first eigenvalue, Arnax = \nmaxOopv EIR= v:~v. One can now transform the vector v into a vector in {-I, +l}m \nby choosing the threshold 8 that gives maximum alignment of y = sign(vrnaX - 8). \nBy definition, the value of alignment A.(S, k, y) obtained by this y will be a lower \nbound of the optimal alignment, hence we have \n\nA.(S,k,y):s A.*(S,k):S Amax/IIKIIF. \n\nOne can hence estimate the quality of a dichotomy by comparing its value with the \nupper bound. The absolute alignment tells us how specialized a kernel is on a given \ndataset: the higher this quantity, the more committed to a specific dichotomy. \n\n\fThe first eigenvector can be calculated in many ways, for example the Lanczos \nprocedure, which is already effective for large datasets. Search engines like Google \nare based on estimating the first eigenvector of a matrix with dimensionality more \nthan 109 , so for very large datasets there are approximation techniques. \nWe applied the procedure outlined above to two datasets from the VCI repository. \nWe preprocessed the data by normalising the input vectors in the kernel defined \nfeature space and then centering them by shifting the origin (of the feature space) \nto their centre of gravity. This can be achieved by the following transformation of \nthe kernel matrix, K +--- K - m - 1jg' - m - 1gj' + m - 2 j'KjJ, where j is the all \nones vector, J the all ones matrix and 9 the vector of row sums of K. \n\nEigenvalue Number \n\n(a) \n\n(b) \n\nFigure 1: (a) Plot of alignment of the different eigenvectors with the labels or(cid:173)\ndered by increasing eigenvalue. (b) Plot for Breast Cancer data (linear kernel) of \n.Amax/llKIIF (straight line), ...1(S, k, y) for y = sign(vmaX -\n(}i ) (bottom curve), and \nthe accuracy of y (middle curve) against threshold number i. \nThe first experiment applied the unsupervised technique to the Breast Cancer data \nwith a linear kernel. Figure l(a) shows the alignmment of the different eigenvectors \nwith the labels. The highest alignment is shown by the last eigenvector correspond(cid:173)\ning to the largest eigenvalue. \nFor each value (}i of the threshold Figure l(b) shows the upper bound of .Amax/llKIIF \n(straight line), the alignment ...1(S, k, y) for y = sign( vmax - (}i) (bottom curve), and \nt he accuracy of y (middle curve). Notice that where actual alignment and upper \nbound on alignment get closest, we have confidence that we have partitioned our \ndata well, and in fact the accuracy is also maximized. Notice also that the choice of \nthe threshold corresponds to maintaining the correct proportion between positives \nand negatives. This suggests another possible t hreshold selection strategy, based on \nthe availability of enough labeled points to give a good estimate of the proportion \nof positive points in the dataset. This is one way label information can be used \nto choose the threshold. At the end of the experiments we will describe another \n'transduction' method. \nIt is a measure of how naturally t he data separates that t his procedure is able \nto optimise the split with an accuracy of approximately 97.29% by choosing the \nthreshold that maximises the alignment (threshold number 435) but without making \nany use of the labels. \nIn Figure 2a we present the same results for the Gaussian kernel (u = 6). In this \ncase the accuracy obtained by optimising the alignment (threshold number 316) \nof t he resulting dichotomy is less impressive being only about 79.65%. Finally, \nFigure 2b shows the same results for the Ionosphere dataset. Here the accuracy \nof the split that optimises the alignment (threshold number 158) is approximately \n\n\f(a) \n\n(b) \n\nFigure 2: Plot for Breast Cancer data (Gaussian kernel) (a) and Ionosphere data \n(linear kernel) (b) of Amax/ilKIIF (straight line), .4(S, k, y) for y = sign(vmaX - ()i) \n(bottom curve), and the accuracy of y (middle curve) against threshold number i. \n\n71.37%. \nWe can also use the overall approach to adapt the kernel to the data. For example \nwe can choose the kernel parameters so as to optimize Amax/IIKIIF. Then find \nthe first eigenvector, choose a threshold to maximise the alignment and output the \ncorresponding y. \nThe cost to the alignment of changing a label Yi is 2 Lj Yjk(Xi' xj)/IIKIIF , so that \nif a point is isolated from the others, or if it is equally close to the two different \nclasses, then changing its label will have only a very small effect. On the other \nhand, labels in strongly clustered points clearly contribute to the overall cost and \nchanging their label will alter the alignment significantly. \nThe method we have described can be viewed as projecting the data into a 1-\ndimensional space and finding a threshold. The projection also implicitly sorts the \ndata so that points of the same class are nearby in the ordering. We discuss the \nproblem in the 2-class case. We consider embedding the set into the real line, so \nas to satisfy a clustering criterion. The resulting Kernel matrix should appear as a \nblock diagonal matrix. \nThis problem has been addressed in the case of information retrieval in [1], and \nalso applied to assembling sequences of DNA. In those cases, the eigenvectors of the \nLaplacian have been used, and the approach is called the Fiedler ordering. Although \nthe Fiedler ordering could be used here as well, we present here a variation based \non the simple kernel matrix. \nLet the coordinate ofthe point Xi on the real line be v(i). Consider the cost function \nLij v(i)v(j)K(i,j). It is maximized when points with high similarity have the same \nsign and high absolute value, and when points with different sign have low similarity. \nThe choice of coordinates v that optimizes this cost is the first eigenvector, and \nhence by sorting the data according to the value of their entry in this eigenvector \none can hope to find a good permutation, that renders the kernel matrix block \ndiagonal. Figure 3 shows the results of this heuristic applied to the Breast cancer \ndataset. The grey level indicates the size of the kernel entry. The figure on the left \nis for the unsorted data, while that on the right shows the same plot after sorting. \nThe sorted figure clearly shows the effectivenesss of the method. \n\n3.2 Optimising the cut-cost \n\nFor a fixed kernel matrix minimising the cut-cost corresponds to mlmmlsmg \nLy;#y; k( Xi, X j), that is the sum of the kernel entries between points of two dif-\n\n\fFigure 3: Gram matrix for cancer data, before and after permutation of data ac(cid:173)\ncording to sorting order of first eigenvector of K \nferent classes. Since we are dealing with normalized kernels, this also controls the \nexpected distance between them. \nWe can express this quantity as ~ Kij =\"2 ~ Kij - Y Ky =\"2Y Ly, \n\n1\",\", \n\n\"'\"' \n\n1 , \n\n) \n\n' \n\n( \n\ni,j \n\nydy; \n\nwhere L is the Laplacian matrix, defined as L = D-K, where D = diag(dl , ... , dm ) \nwith di = '\u00a3';1 k(Xi , Xj). One would like to find y E {-l,+l}m so as to minimize \nthe cut cost subject to the division being even, but this problem is NP-hard. Fol(cid:173)\nlowing the same strategy as with the alignment we can impose a slightly looser \nconstraint on y, y E Rm, '\u00a3i yt = m, l:i Yi = O. This gives the problem \n\nmin y' Ly subject to y E Rm , l: yt = m, l: Yi = O. \n\nSince, zero is an eigenvalue of L with eigenvector j, the all ones vector, the problem \nis equivalent to finding the eigenvector of the smallest non-zero eigenvalue ..\\ \nminO#y l..j y/yY. Hence, this eigenvalue ..\\ provides a lower bound on the cut cost \n\n. \n\nmm C(S, k, y) ~ IIKII . \n\n2 \n\nF \n\ny E { - l,l}'\" \n\n..\\ \n\nSo the eigenvector corresponding to the eigenvalue ..\\ of the Laplacian can be used \nto obtain a good approximate split and ..\\ gives a lower bound on the cut-cost. One \ncan now threshold the entries of the eigenvector in order to obtain a vector with \n-1 and + 1 entries. We again plot the lower bound, cut-cost, and error rate as a \nfunction of the threshold. \nWe applied the procedure to the Breast cancer data with both linear and Gaussian \nkernels. The results are shown in Figure 4. Now using the cut cost to select \nthe best threshold for the linear kernel sets it at 378 with an accuracy of 67.86%, \nsignificantly worse than the results obtained by optimising the alignment. With \nthe Gaussian kernel, on the other hand, the method selects threshold 312 with an \naccuracy of 80.31 %, a slight improvement over the results obtained with this kernel \nby optimising the alignment. \nSo far we have presented algorithms that use unsupervised data. We now consider \nthe situation where we are given a partially labelled dataset. This leads to a sim(cid:173)\nple algorithm for transduction or semi-supervised learning. The idea that some \nlabelled data might improve performance comes from observing Figure 4b, where \nthe selection based on the cut-cost is clearly suboptimal. By incorporating some \nlabel information, it is hoped that we can obtain an improved threshold selection. \n\n\f0050'---------c:c------c=--=--~-_=_-_=_-~ \n\n(a) \n\n(b) \n\nFigure 4: Plot for Breast Cancer data using (a) Linear kernel) and (b) Gaussian \nkernel of C(S,k,y) - ,X/(21IKIIF) (dashed curves), for y = sign(v maX -\n()i) and the \nerror of y (solid curve) against threshold number i. \nLet z be the vector containing the known labels and 0 elsewhere. Set K P = \nK + Cozz', where Co is a positive constant parameter. We now use the original \nmatrix K to generate the eigenvector, but the matrix K P when measuring the \ncut-cost of the classifications generated by different thresholds. Taking Co = 1 \nwe performed 5 random selections of 20% of the data and obtained a mean success \nrate of 85.56% (standard deviation 0.67%) for the Breast cancer data with Gaussian \nkernel, a marked improvement over the 80.31 % achieved with no label information. \n\n4 Conclusions \n\nThe paper has considered two partition costs the first derived from the so-called \nalignment of a kernel to a label vector, and the second from the cut-cost of a label \nvector for a given kernel matrix. The two quantities are both optimised by the \nsame labelling, but give rise to different approximation algorithms when the discrete \nconstraint is removed from the labelling vector. It was shown how these relaxed \nproblems are solved exactly using spectral techniques, hence leading to two distinct \napproximation algorithms through a post-processing phase that re-discretises the \nvector to create a labelling that is chosen to optimise the given criterion. \nExperiments are presented showing the performance of both of these clustering \ntechniques with some very striking results. For the second algorithm we also gave \none preliminary experiment with a transductive version that enables some labelled \ndata to further refine the clustering. \n\nReferences \n\n[1] M.W. Berry, B. Hendrickson, and P. Raghavan. Sparse matrix reordering schemes for \nbrowsing hypertext. In Th e Mat ematics of Num erical Analysis, pages 99- 123. AMS, \n1996. \n\n[2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. \nCambridge University Press, 2000. See also the web site www.support-vector.net. \n[3] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandola. On kernel(cid:173)\n\ntarget alignment. In submitted to Proceedings of Neural Information Processing Systems \n(NIPS), 200l. \n\n[4] Nello Cristianini, Huma Lodhi, and John Shawe-Taylor. Latent semantic kernels \nfor feature selection. Technical Report NC-TR-00-080, NeuroCOLT Working Group, \nhttp://www.neurocolt.org, 2000. \n\n[5] A. Pothen, H. Simon, and K. Liou. Partitioning sparse matrices with eigenvectors of \n\ngraphs. SIAM J. Matrix Anal., 11(3):430- 452, 1990. \n\n\f", "award": [], "sourceid": 2002, "authors": [{"given_name": "Nello", "family_name": "Cristianini", "institution": null}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}, {"given_name": "Jaz", "family_name": "Kandola", "institution": null}]}