{"title": "On Kernel-Target Alignment", "book": "Advances in Neural Information Processing Systems", "page_first": 367, "page_last": 373, "abstract": null, "full_text": "On Kernel-Target Alignment \n\nN ello Cristianini \nBIOwulf Technologies \nnello@support-vector. net \n\nJohn Shawe-Taylor \n\nRoyal Holloway, University of London \n\njohn@cs.rhul.ac.uk \n\nAndre Elisseeff \n\nBIOwulf Technologies \n\nJaz Kandola \n\nRoyal Holloway, University of London \n\nandre@barnhilltechnologies.com \n\njaz@cs.rhul.ac.uk \n\nAbstract \n\nWe introduce the notion of kernel-alignment, a measure of similar(cid:173)\nity between two kernel functions or between a kernel and a target \nfunction. This quantity captures the degree of agreement between \na kernel and a given learning task, and has very natural interpre(cid:173)\ntations in machine learning, leading also to simple algorithms for \nmodel selection and learning. We analyse its theoretical properties, \nproving that it is sharply concentrated around its expected value, \nand we discuss its relation with other standard measures of per(cid:173)\nformance. Finally we describe some of the algorithms that can be \nobtained within this framework, giving experimental results show(cid:173)\ning that adapting the kernel to improve alignment on the labelled \ndata significantly increases the alignment on the test set, giving \nimproved classification accuracy. Hence, the approach provides a \nprincipled method of performing transduction. \n\nKeywords: Kernels, alignment, eigenvectors, eigenvalues, transduction \n\n1 \n\nIntroduction \n\nKernel based learning algorithms [1] are modular systems formed by a general(cid:173)\npurpose learning element and by a problem specific kernel function. It is crucial for \nthe performance of the system that the kernel function somehow fits the learning \ntarget, that is that in the feature space the data distribution is somehow correlated \nto the label distribution. Several results exist showing that generalization takes \nplace only when such correlation exists (nofreelunch; luckiness), and many classic \nestimators of performance (eg the margin) can be understood as estimating this \nrelation. In other words, selecting a kernel in this class of systems amounts to the \nclassic feature and model selection problems in machine learning. \nMeasuring the similarity between two kernels, or the degree of agreement between \na kernel and a given target function, is hence an important problem both for con(cid:173)\nceptual and for practical reasons. As an example, it is well known that one can \nobtain complex kernels by combining or manipulating simpler ones, but how can \none predict whether the resulting kernel is better or worse than its components? \n\n\fWhat a kernel does is to virtually map data into a feature space so that their relative \npositions in that space are what matters. The degree of clustering achieved in that \nspace, and the relation between the clusters and the labeling to be learned, should \nbe captured by such an estimator. \nAlternatively, one could regard kernels as 'oracles' or 'experts' giving their opinion \non whether two given points belong to the same class or not. In this case, the \ncorrelation between experts (seen as random variables) should provide an indication \nof their similarity. \nWe will argue that - if one were in possess of this information - the ideal kernel for \na classification target y(x) would be K(x, z) = y(x)y(z). One way of estimating \nthe extent to which the kernel achieves the right clustering is to compare the sum \nof the within class distances with the sum of the between class distances. This will \ncorrespond to the alignment between the kernel and the ideal kernel y(x)y(z). By \nmeasuring the similarity of this kernel with the kernel at hand - on the training \nset - one can assess the degree of fitness of such kernel. The measure of similarity \nthat we propose, 'kernel alignment' would give in this way a reliable estimate of its \nexpected value, since it is sharply concentrated around its mean. \nIn this paper we will motivate and introduce the notion of Alignment (Section 2); \nprove its concentration (Section 3); discuss its implications for the generalisation \nof a simple classifier (Section 4) and deduce some simple algorithms (Section 5) to \noptimize it and finally report on some experiments (Section 6). \n\n2 Alignment \n\nGiven an (unlabelled) sample 8 = {Xl, ... ,xm }, we use the following inner product \nbetween Gram matrices, (K1,K2)F = 2:7,'j=l K 1(Xi,Xj)K2(Xi,Xj) \n\nDefinition 1 Alignment The (empirical) alignment of a kernel k1 with a kernel \nk2 with respect to the sample 8 is the quantity \n\nA(8 k k) = \n\n, 1, 2 \n\n(K1 ,K2 )F \n\nJ(K1,K1)F(K2, K 2)F' \n\nwhere Ki is the kernel matrix for the sample 8 using kernel ki . \nThis can also be viewed as the cosine of the angle between two bi-dimensional \nvectors K1 and K 2, representing the Gram matrices. If we consider K2 = yyl, \nwhere y is t he vector of { -1, + I} labels for the sample, then \n. \n\nA(8 K \n\n/ 1 \n\n2 \n\nI) \n\n, smce \\yy ,yy F = m \n\nI) \n\n(K, yyl)F \n,yy =. / / K K) / 1 \n\n, \n\nI) \n\n(K, yyl)F \n. / / K K) \n\ny \\ , F\\YY ,yy F my \\ , F \n\nWe will occasionally omit t he arguments K or y when t hese are understood from \nthe context or when y forms part of the sample. In the next section we will see how \nthis definition provides with a method for selecting kernel parameters and also for \ncombining kernels. \n\n3 Concentration \n\nThe following theorem shows that the alignment is not too dependent on the training \nset 8. This result is expressed in terms of 'concentration'. Concentration means that \nthe probability of an empirical estimate deviating from its mean can be bounded \nas an exponentially decaying function of that deviation. \nThis will have a number of implications for the application and optimisation of the \nalignment. For example if we optimise the alignment on a random sample we can \n\n\fexpect it to remain high on a second sample. Furthermore we will show in the next \nsection that if the expected value of the alignment is high, then there exist functions \nthat generalise well. Hence, the result suggests that we can optimise the alignment \non a training set and expect to keep high alignment and hence good performance \non a test set. Our experiments will demonstrate that this is indeed the case. \nThe theorem makes use of the following result due to McDiarmid. Note that lEs is \nthe expectation operator under the selection of the sample. \n\nTheoreIll 2 (McDiarmid!4}) Let Xl, ... ,Xn be independent random variables tak(cid:173)\ning values in a set A, and assume that f : An -+ m. satisfies for 1 ::::; i ::::; n \n\nthen for all f > 0, \n\nTheoreIll 3 The sample based estimate of the alignment is concentrated around its \nexpected value. For a kernel with feature vectors of norm 1, we have that \n\npm{s: 1.4(S) - A(y)1 ::::: \n\n\u20ac} \n\n::::; 8 where \n\n\u20ac = C(S)V8ln(2/8)/m, \n\n(1) \n\nfor a non-trivial function C (S) and value A(y). \n\nProof: Let \n\nA \n\nA1(S) = m 2 .~ Yiyjk(Xi,Xj),A2(S) = m 2 .~ k(xi,Xj) , and A(y) = \n\nA \n\n1 ~ \n\n1 ~ \n\n2 \n\n',J=l \n\n',J=l \n\nlEs[.41(S)] \n/ \nylES [A 2 (S)] \n\nA \n\n\u2022 \n\nFirst note that .4(S) = .41(S)/) .42(S). Define Al = lES[A1(S)] and A2 = \nlES[A2(S)], First we make use of McDiarmid's theorem to show that Ai(S) are \nconcentrated for i = 1,2. Consider the training set S' = S \\ {(Xi, Yi)} U {(X~, y~)}. \nWe must bound the difference \n\nA \n\nIAj(S) - Aj(S')1 ::::; -2 (2(m - 1)2) < -, \n\nA \n\n1 \nm \n\n4 \nm \n\nfor j = 1,2. Hence, we have Ci = 4/m for all i and we obtain from an application \nof McDiarmid's Theorem for j = 1 and 2, \n\n< 2exp ( \n\nf;m) \n\nSetting f = V8ln(2/8)/m, the right hand sides are less than or equal to 8/2. Hence, \nwith probability at least 1 - 8, we have for j = 1, 2 1 Aj (S) - Aj 1 < f. But whenever \nthese two inequalities hold, we have \n\n< \n\n< \n\n\f(kl,k2)P \n\nRemark. We could also define the true Alignment, based on the input dis(cid:173)\ntribution P, as follows: given functions f,g : X 2 --+ JR, we define (j,g)p = \nIX2 f(x, z)g(x, z)dP(x)dP(z), Then the alignment of a kernel k1 with a kernel k2 \nis the quantity A(k1' k2) = J \nThen it is possible to prove that asymptotically as m tends to infinity the empirical \nalignment as defined above converges to the true alignment. However if one wants \nto obtain unbiased convergence it is necessary to slightly modify its definition by \nremoving the diagonal, since for finite samples it biases the expectation by receiving \ntoo large a weight. With this modification A(y) in the statement of the theorem be(cid:173)\ncomes the true alignment. We prefer not to pursue this avenue further for simplicity \nin this short article, we just note that the change is not significant. \n\n(kl ,kl) P (k2 ,k2) P \n\n. \n\n4 Generalization \n\nIn this section we consider the implications of high alignment for the generalisation \nof a classifier. By generalisation we mean the test error err(h) = P(h(x) \u00a5- y). \nOur next observation relates the generalisation of a simple classification function \nto the value of the alignment. The function we consider is the expected Parzen \nwindow estimator hex) = sign(f(x)) = sign (lE(XI ,v') [y'k(x' , x)]). This corresponds \nto thresholding a linear function f in the feature space. We will show that if \nthere is high alignment then this function will have good generalisation. Hence, by \noptimising the alignment we may expect Parzen window estimators to perform well. \nWe will demonstrate that this prediction does indeed hold good in experiments. \nTheorem 4 Given any 8 > O. With probability 1 - 8 over a randomly drawn \ntraining set S, the generalisation accuracy of the expected Parzen window estimator \nh(x) = sign (lE(XI ,yl) [y' k(X', x)]) is bounded from above by \n\nerr(h(x)) ::::: 1- A(S) + E + (mJ A2(S)) - 1, where E = C(S)V! ln~. \n\nProof: (sketch) We assume throughout that the kernel has been normalised so that \nk(x , x) = 1 for all x. First observe that by Theorem 3 with probability greater than \n1- 8/2, IA(y) - A(S)I ::::: E. The result will follow if we show that with probability \ngreater than 1- 8/2 the generalisation error of hS\\(xl,y,) can be upper bounded by \n1 - A(y) + ~. Consider the quantity A(y) from Theorem 3. \n\nm A2(S) \n\nA(y) \n\nlEs [~L:Z;=1 Yiyjk(xi,xj)] \nlEs [~2 L:Z;=1 k(Xi,Xj)2] \n\nlEs [~L:#j Yiyjk(xi,xj)] + ~ \n\nC \n\nBut \n\nI mC-ml f(x) I \nP(f(x) \n\n\u00a5-\n\n< V (x,y) y \n\n[2] \n\nIlE \n\n(m -1)2 \n\nC2m 2 \n\nlE(XI,yl) [k(x, x ) ] < 1 \n\nI 2 \n\nif E \n\ny), we have \nHence, \nlEs [C~2 L:#j YiYj k(Xi' Xj)] ::::: 1 x a + 0 x E = a and E = 1 - a ::::: 1 - A(y) + c~, D \n\ny) and a \n\nP(f(x) \n\n\fAn empirical estimate of the function f would be the Parzen window function. \nThe expected margin of the empirical function is concentrated around the expected \nmargin of the expected Parzen window. Hence, with high probability we can bound \nthe error of j in terms of the empirically estimated alignment A(S). This is omitted \ndue to lack of space. The concentration of j is considered in [3]. \n\n5 Algorithms \n\nThe concentration of the alignment can be directly used for tuning a kernel family \nto the particular task, or for selecting a kernel from a set, with no need for training. \nThe probability that the level of alignment observed on the training set will be out \nby more than \u20ac from its expectation for one of the kernels is bounded by 6, where \n\n\u20ac is given by equation (1) for E = J ~ (InINI + lnj), where INI is the size of the \n\nset from which the kernel has been chosen. In fact we will select from an infinite \nfamily of kernels. Providing a uniform bound for such a class would require covering \nnumbers and is beyond the scope of this paper. One of the main consequences of \nthe definition of kernel alignment is in providing a practical criterion for combining \nkernels. We will justify the intuitively appealing idea that two kernels with a certain \nalignment with a target that are not aligned to each other, will give rise to a more \naligned kernel combination. In particular we have that \n\nThis shows that if two kernels with equal alignment to a given target yare also \ncompletely aligned to each other, then IIKI + K211F = IIKlllF + IIK211F and the \nalignment of the combined kernel remains the same. If on the other hand the \nkernels are not completely aligned, then the alignment of the combined kernel is \ncorrespondingly increased. \nTo illustrate the approach we will take to optimising the kernel, consider a kernel \nthat can be written in the form k(x, Xl) = l:.k I-tk(yk(x)yk(xl)) , where all the yk \nare orthogonal with respect to the inner product defined on the training set S, \n(y, yl)S = l:.:l YiYj. Assume further that one of them yt is the true label vector. \nWe can now evaluate the alignment as A(y) ~ I-tt/v'l:.kl-t% . In terms of the \nGram matrix this is written as Kij = l:.k I-tkyfyj where yf is the i-th label of the \nk-th classification. This special case is approximated by the decomposition into \neigenvectors of the kernel matrix K = l:. Aiviv~, where Vi denotes the transpose of \nv and Vi is the i-th eigenvector with eigenvalue Ai. In other words, the more peaked \nthe spectrum the more aligned (specific) the kernel can be. \nIf by chance the eigenvector of the largest eigenvalue Al corresponds to the target \nlabeling, then we will give to that labeling a fraction Ad v'l:.i AT of the weight that \nwe can allocate to different possible labelings. The larger the emphasis of the kernel \non a given target, the higher its alignment. \nIn the previous subsection we observed that combining non-aligned kernels that are \naligned with the target yields a kernel that is more aligned to the target. Consider \nthe base kernels Ki = ViV~ where Vi are the eigenvectors of K, the kernel matrix \nfor both labeled and unlabeled data. Instead of choosing only the most aligned \nones, one could use a linear combination, with the weights proportional to their \nalignment (to the available labels): k = l:.i f(ai)viv~ where ai is the alignment of \nthe kernel K i , and f(a) is a monotonically increasing function (eg. the identity or \nan exponential). Note that a recombination of these rank 1 kernels was made in \nso-called latent semantic kernels [2]. The overall alignment of the new kernel with \n\n\fthe labeled data should be increased, and the new kernel matrix is expected also \nto be more aligned to the unseen test labels (because of the concentration, and the \nassumption that the split was random). \nMoreover, in general one can set up an optimization problem, aimed at finding the \noptimal a, that is the parameters that maximize the alignment of the combined \nkernel with the available labels. Given K = Li aiviv~ , using the orthonormality \nof the Vi and that (v v' ,uu') F = (v, u)}, the alignment can be written as \n\nA.(y) = \n\n(K, yy')F \n\nLi ai(vi, y)} \n\nmJLij aiaj(viv~, VjVj)F \n\nJ(yy', yY')FJLi a;\u00b7 \n\nHence we have the following optimization problem: \n\nmaximise W (a) \n\n(2) \n\n(Vi,Y)} - A2ai = 0 and hence ai (X \n\nSetting derivatives to zero we obtain ~:. \n(Vi,Y)}, giving the overall alignment A.(y) = JL,i~i'Y)j\". \nThis analysis suggests the following transduction algorithm. Given a partially la(cid:173)\nbelled set of examples optimise its alignment by adapting the full kernel matrix by \nrecombining its rank one eigenmatrices ViV~ using the coefficients ai determined by \nmeasuring the alignment between Vi and y on the labelled examples. Our results \nsuggest that we should see a corresponding increase in the alignment on the un(cid:173)\nlabelled part of the set, and hence a reduction in test error when using a Parzen \nwindow estimator. Results of experiments testing these predictions are given in the \nnext section. \n\n6 Experiments \n\nWe applied the transduction algorithm designed to take advantage of our results \nby optimizing alignment with the labeled part of the dataset using the spectral \nmethod described above. All of the results are averaged over 20 random splits with \nTable 1 shows the alignments of the \nthe standard deviation given in brackets. \n\nTest Align \nTrain Align \n0.092 (0.029) \n0.076 (0.007) \n0.228 ~0.012) 0.219 ~0.041) \n0.075 ~0.016) 0.084 ~0.017) \n0.242 (0.023) \n0.181 (0.043) \n0.072 ~0.022) 0.081 ~0.006) \n0.273 ~0.037) 0.034 ~0.046) \n\nTest Align \nTrain Align \n0.207 (0.020) \n0.240 (0.083 \n0.240 ~0.016) 0.257 ~0.059) \n0.210 ~0.031) 0.216 ~0.033) \n0.257 (0.023) \n0.202 (0.015) \n0.227 ~0.057) 0.210 ~0.015) \n0.326 ~0.023) 0.118 ~0.017) \n\nK50 \nG50 \nK 20 \nG20 \n\nTable 1: Mean and associated standard deviation alignment values using a linear \nkernel on the Breast (left two columns) and Ionosphere (right two columns). \n\nGram matrices to the label matrix for different sizes of training set. The index \nindicates the percentage of training points. The K matrices are before adaptation, \nwhile the G matrices are after optimisation of the alignment using equation (2). \nThe results on the left are for Breast Cancer data using a linear kernel, while the \nresults on the right are for Ionosphere data. \nThe left two columns of Table 2 shows the alignment values for Breast Cancer data \nusing a Gaussian kernel together with the performance of an SVM classifier trained \n\n\fTable 2: Breast alignment (cols 1,2) and SVM error for a Gaussian kernel (sigma \n= 6) (col 3), Parzen window error for Breast (col 4) and Ionosphere (col 5) \n\nwith the given gram matrix in the third column. The right two columns show the \nperformance of the Parzen window classifier on the test set for Breast linear kernel \n(left column) and Ionosphere (right column). \nThe results clearly show that optimising the alignment on the training set does \nindeed increase its value in all but one case by more than the sum of the standard \ndeviations. Furthermore, as predicted by the concentration this improvement is \nmaintained in the alignment measured on the test set with both linear and Gaussian \nkernels in all but one case (20% train with the linear kernel). The results for \nIonosphere are less conclusive. Again as predicted by the theory the larger the \nalignment the better the performance that is obtained using the Parzen window \nestimator. The results of applying an SVM to the Breast Cancer data using a \nGaussian kernel show a very slight improvement in the test error for both 80% and \n50% training sets. \n\n7 Conclusions \n\nWe have introduced a measure of performance of a kernel machine that is much \neasier to analyse than standard measures (eg the margin) and that provides much \nsimpler algorithms. We have discussed its statistical and geometrical properties, \ndemonstrating that it is a well motivated and formally useful quantity. \nBy identifying that the ideal kernel matrix has a structure of the type yy', we have \nbeen able to transform a measure of similarity between kernels into a measure of \nfitness of a given kernel. The ease and reliability with which this quantity can be \nestimated using only training set information prior to training makes it an ideal \ntool for practical model selection. We have given preliminary experimental results \nthat largely confirm the theoretical analysis and augur well for the use of this tool \nin more sophisticated model (kernel) selection applications. \n\nReferences \n\n[1] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Ma(cid:173)\nchines. Cambridge University Press, 2000. See also the web site www.support(cid:173)\nvector.net. \n\n[2] Nello Cristianini, Huma Lodhi, and John Shawe-Taylor. Latent semantic kernels \nfor feature selection. Technical Report NC-TR-00-080, NeuroCOLT Working \nGroup, http://www.neurocolt.org, 2000. \n\n[3] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Th eory of Pattern Recog(cid:173)\n\nnition. Number 31 in Applications of mathematics. Springer, 1996. \n\n[4] C. McDiarmid. On the method of bounded differences. In Surveys in Combina(cid:173)\n\ntorics 1989, pages 148-188. Cambridge University Press, 1989. \n\n\f", "award": [], "sourceid": 1946, "authors": [{"given_name": "Nello", "family_name": "Cristianini", "institution": null}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}, {"given_name": "Andr\u00e9", "family_name": "Elisseeff", "institution": null}, {"given_name": "Jaz", "family_name": "Kandola", "institution": null}]}