{"title": "Dynamically Adapting Kernels in Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 204, "page_last": 210, "abstract": null, "full_text": "Dynamically Adapting Kernels in Support \n\nVector Machines \n\nN ello Cristianini \n\nDept. of Engineering Mathematics \n\nUniversity of Bristol, UK \n\nnello.cristianini@bristol.ac.uk \n\nColin Campbell \n\nDept. of Engineering Mathematics \n\nUniversity of Bristol, UK \n\nc.campbell@bristol.ac.uk \n\nJohn Shawe-Taylor \n\nDept. of Computer Science \n\nRoyal Holloway College \njohn@dcs.rhbnc.ac.uk \n\nAbstract \n\nThe kernel-parameter is one of the few tunable parameters in Sup(cid:173)\nport Vector machines, controlling the complexity of the resulting \nhypothesis. Its choice amounts to model selection and its value is \nusually found by means of a validation set. We present an algo(cid:173)\nrithm which can automatically perform model selection with little \nadditional computational cost and with no need of a validation set . \nIn this procedure model selection and learning are not separate, \nbut kernels are dynamically adjusted during the learning process \nto find the kernel parameter which provides the best possible upper \nbound on the generalisation error. Theoretical results motivating \nthe approach and experimental results confirming its validity are \npresented. \n\n1 \n\nIntroduction \n\nSupport Vector Machines (SVMs) are learning systems designed to automatically \ntrade-off accuracy and complexity by minimizing an upper bound on the general(cid:173)\nisation error provided by VC theory. In practice, however, SVMs still have a few \ntunable parameters which need to be determined in order to achieve the right bal(cid:173)\nance and the values of these are usually found by means of a validation set. One \nof the most important of these is the kernel-parameter which implicitly defines the \nstructure of the high dimensional feature space where the maximal margin hyper(cid:173)\nplane is found. Too rich a feature space would cause the system to overfit the data, \n\n\fDynamically Adapting Kernels in Support Vector Machines \n\n205 \n\nand conversely the system can be unable to separate the data if the kernels are too \npoor. Capacity control can therefore be performed by tuning the kernel parameter \nsubject to the margin being maximized. For noisy datasets, yet another quantity \nneeds to be set, namely the soft-margin parameter C. \nSVMs therefore display a remarkable dimensionality reduction for model selection. \nSystems such as neural networks need many different architectures to be tested and \ndecision trees are faced with a similar problem during the pruning phase. On the \nother hand SVMs can shift from one model complexity to another by simply tuning \na continuous parameter. \n\nGenerally, model selection by SVMs is still performed in the standard way: by \nlearning different SVMs and testing them on a validation set in order to determine \nthe optimal value of the kernel-parameter. This is expensive in terms of computing \ntime and training data. In this paper we propose a different scheme which dy(cid:173)\nnamically adjusts the kernel-parameter to explore the space of possible models at \nlittle additional computational cost compared to fixed-kernel learning. Futhermore \nthis approach only makes use of training-set information so it is more efficient in a \nsample complexity sense. \n\nBefore proposing the model selection procedure we first prove a theoretical result, \nnamely that the margin and structural risk minimization (SRM) bound on the gen(cid:173)\neralization error depend smoothly on the kernel parameter. This can be exploited \nby an algorithm which keeps the system close to maximal margin while the kernel \nparameter is changed smoothly. During this phase, the theoretical bound given by \nSRM theory can be computed. The best kernel-parameter is the one which gives the \nlowest possible bound. In section 4 we present experimental results showing that \nmodel selection can be efficiently performed using the proposed method (though we \nonly consider Gaussian kernels in the simulations outlined). \n\n2 Support Vector Learning \n\nThe decision function implemented by SV machines can be written as: \n\nf(x) = sign (L Yiai K(x, Xi) - B) \n\ntESV \n\nwhere the ai are obtained by maximising the following Lagrangian (where m is the \nnumber of patterns): \n\nm \n\nm \n\nL = L ai - 1/2 L aiajYiyjK(Xi, Xj) \n\nwith respect to the ai, subject to the constraints \n\ni= l \n\ni,j= l \n\nm \n\nLaiYi = a \n\ni=l \n\nand where the functions K( x, x') are called kernels. The kernels provide an expres(cid:173)\nsion for dot- products in a high-dimensional feature space [1]: \n\nK( x, x') = (* (x) , **(x' )) \n\n\f206 \n\nN. Cristianini, C. Campbell and 1. Shawe-Taylor \n\nand also implicitly define the nonlinear mapping <1>( x) of the training data into \nfeature space where they may be separated using the maximal margin hyperplane. \nA number of choices of kernel-function can be made e.g. Gaussians kernels: \n\nK(x, x') = e-ll x-x'1 12/2(T2 \n\nThe following upper bound can be proven from VC theory for the generalisation \nerror using hyperplanes in feature space [7, 9J: \n\nwhere R is the radius of the smallest ball containing the training set, m the number \nof training points and 'Y the margin (d. [2J for a complete survey of the generaliza(cid:173)\ntion properties of SV machines). \n\nThe Lagrange multipliers Qi are usually found by means of a Quadratic Program(cid:173)\nming optimization routine, while the kernel-parameters are found using a validation \nset. As illustrated in Figure 1 there is a minimum of the generalisation error for \nthat value of the kernel-parameter which has the best trade-off between overfitting \nand ability to find an efficient solution. \n\n0 13 \n\n012 \n\n0 11 \n\n0 1 \n\n009 \n\n008 \n\n0 07 \n\n0 06 \n\n005 \n\n0 04 \n\n2 \n\n10 \n\nFigure 1: Generalization error (y-axis) as a function of (J (x-axis) for the mirror sym(cid:173)\nmetry problem (for Gaussian kernels with zero training error and maximal margin, \nm = 200, n = 30 and averaged over 105 examples). \n\n3 A utomatic Model Order Selection \n\nWe now prove a theorem which shows that the margin of the optimal hyperplane is a \nsmooth function of the kernel parameter, as is the upper bound on the generalisation \nerror. First we state the Implicit Function Theorem. \nImplicit Function Theorem [10]: Let F(x, y) be a continuously differentiable \nfunction, \n\nF : U ~ ~ x V ~ ~p --t ~ \n\nand let (a, b) E U x V be a solution to the equation F(x, y) = O. Let the partial \nderivatives matrix mi,j = (~:;) w.r.t. y be full rank at (a, b). Then, near (a, b), \n\n\fDynamically Adapting Kernels in Support Vector Machines \n\n207 \n\nthere exists one and only one function y = g(x) such that F(x,g(x)) = 0, and such \nfunction is continuous. \n\nTheorem: The margin, of SV machines depends smoothly on the kernel parameter \na. \nProof: Consider the function 9 : ~ <;;; ~ --t A <;;; ~P, 9 : a ~ (aO, A) which given the \ndata maps the choice of a to the optimal parameters aO and lagrange parameter A \nof the SV machine with Kernel matrix Gij = YiYjK(a; Xi, Xj )). Let \n\np \n\nWu(a) = l:ai - 1/2 l:aiajYiyjK(a; Xi,Xj) + A(l: Yiai ) \n\ni=l \n\ni,j \n\nbe the functional that the SV machine maximizes. Fix a value of a and let aO(a) be \nthe corresponding solution of Wu(a). Let I be the set of indices for which aj((J) =1= O. \nWe may assume that the submatrix of G indexed by I is non-singular since otherwise \nthe maximal margin hyperplane could be expressed in terms of a subset of indices. \nNow choose a maximal set of indices J containing I such that the corresponding \nsu bmatrix of G is non-singular and all of the points indexed by J have margin 1. \nNow consider the function F((J,a , A)i = (a~)j; ,i 2: 1, F((J,a,A)o = LjYjaj in \nthe neighbourhood of (J, where ji is an enumeration of the elements of J, \n\noWu \noa. = 1 - Yj L.. aiYiK((J; Xi, Xj) + AYj \nJ \n\n\"'_ \n\n. \nt \n\nand satisfies the equation F((J, aO((J), A(a)) = 0 at the extremal points of Wu(a) . \nThen the SV function is the implicit function, (aO, A) = g((J), and is continuous \n(and unique) iff F is continuously differentiable and the partial derivatives matrix \nw.r.t. a, A is full rank. But the partial derivatives matrix H is given by \n\nHij = oat = Yj;Yj}K((J;xj; ,Xj}) = Hji,i , j 2: 1, \n\nOP \n\nJJ \n\nfor ji,iJ E J, which was non-degenerate by definition of J, while \n\noFo \n\nHoo = OA = 0 and HOj = oajJ = n = OA = Hjo,J 2: 1. \n\noFo \n\noFj \n\n. \n\nConsider any non-zero a satisfying Lj ajYJ = 0, and any A. We have \n\n(a, Af H(a, A) = aTGa + 2AaT Y = aTGa > O. \n\nHence, the matrix H is non-singular for a satisfying the given linear constraint. \nHence, by the implicit function theorem 9 is a continuous function of (J. The \nfollowing is proven in [2J: \n\n,2 = (t Zif) -1 \n\nt=l \n\nwhich shows that, is a continuous function of (J. As the radius of the ball containing \nthe points is also a continuous function of (J , and the generalization error bound has \nthe form f. ~ CR(a)2llaO((J)lll for some constant C, we have the following corollary. \nCorollary: The bound on the generalization error is smooth in (J. \nThis means that, when the margin is optimal, small variations in the kernel pa(cid:173)\nrameter will produce small variations in the margin (and in the bound on the \ngeneralisation error). Thus ,u ~ ,uHu and after updating the (J, the system will \n\n\f208 \n\nN Cristianini, C. Campbell and J. Shawe- Taylor \n\nstill be in a sub-optimal position. This suggests the following strategy for Gaussian \nkernels, for instance: \n\nKernel Selection Procedure \n\nl. Initialize u to a very small value \n2. Maximize the margin, then \n\n\u2022 Compute the SRM bound (or observe the validation error) \n\u2022 Increase the kernel parameter: u +- u + 8u \n\n3. Stop when a predetermined value of u is reached else repeat step 2. \n\nThis procedure takes advantage of the fact that for very small (J convergence is \ngenerally very rapid (overfi tting the data, of course), and that once the system is \nnear the equilibrium, few iterations will always be sufficient to move it back to the \nmaximal margin situation. In other words, this system is brought to a maximal \nmargin state in the beginning, when this is computationally very cheap, and then it \nis actively kept in that situation by continuously adjusting the a while the kernel(cid:173)\nparameter is gradually increased. \n\nIn the next section we will experimentally investigate this procedure for real-life \nIn the numerical simulations we have used the Kernel-Adatron (KA) \ndatasets. \nalgorithm recently developed by two of the authors [4] which can be used to train SV \nmachines. We have chosen this algorithm because it can be regarded as a gradient \nascent procedure for maximising the Kuhn-Tucker Lagrangian L . Thus the ai for \na sub-optimal state are close to those for the optimum and so little computational \neffort will be needed to bring the system back to a maximal margin position: \n\nThe Kernel-Adatron Algorithm. \n\nl. exi = l. \n2. FOR i = 1 TO m \n\u00b7 ,i = YiZi \n\u2022 8exi = 17(1 _ , i) \n\u2022 IF (ex i + 8ex i ) ::; 0 THEN ex i = 0 ELSE exi +- ex i + 8ex t . \n\u2022 margin = ~ (min(z;) -max(z;)) \n\n(4 (z;) = positively (negatively) labelled patterns) \n\n3. IF(margin = 1) THEN stop, ELSE go to step 2. \n\n4 Experimental Results \n\nIn this section we implement the above algorithm for real-life datasets and plot the \nupper bound given by VC theory and the generalization error as functions of (J. In \norder to compute the bound, E ::; R2/m,2 we need to estimate the radius of the ball \nin feature space. In general his can be done explicitly by maximising the following \nLagrangian w.r.t. Ai using convex quadratic programming routines: \n\nL = L AiK(Xi, Xi) - L AiAjK(Xi, Xj) \n\ni,j \n\nsubject to the constraints 2:i Ai = 1 and Ai 2: O. The radius is then found from [3]: \n\n\fDynamically Adapting Kernels in Support Vector Machines \n\n209 \n\ni,j \n\ni,j \n\nHowever, we can also get an upper bound for this quantity by noting that Gaussian \nkernels always map training points to the surface of a sphere of radius 1 centered on \nthe origin of the feature space. This can be easily seen by noting that the distance \nof a point from the origin is its norm: \n\n11**(x)11 = J(**(X),**(X)) = JK(x,x) = Jellx-xll/2o-2 = 1 \n\nIn Figure 2 we give both these bounds (the upper bound is Li adm) and general(cid:173)\nisation error (on a test set) for two standard datasets: the aspect-angle dependent \nsonar classification dataset of Gorman and Sejnowski [5] and the Wisconsin breast \ncancer dataset [8]. As we see from these plots there is little need for the addi(cid:173)\ntional computational cost of determining R from the above quadratic progamming \nproblem, at least for Gaussian kernels. In Fig. 3 we plot the bound Li adm and \ngeneralisation error for 2 figures from a United States Postal Service dataset of \nhandwritten digits [6]. In these, and other instances we have investigated, the mini(cid:173)\nmum of the bound approximately coincides with the minimum of the generalisation \nerror. This gives a good criterion for the most suitable choice for a. Furthermore, \nthis estimate for the best a is derived solely from training data without the need \nfor an additional validation set . \n\n02 \n\n. , \n\n.. \n\nFigure 2: Generalisation error (solid curves) for the sonar classification (left Fig.) \nand Wisconsin breast cancer datasets (right Fig.). The upper curves (dotted) show \nthe upper bounds from VC theory (for the top curves R=l). \n\nStarting with a small a-value we have observed that the margin can be maximised \nrapidly. Furthermore, the margin remains close to 1 if a is incremented by a small \namount. Consequently, we can study the performance of the system by traversing \na range of a-values, alternately incrementing a then maximising the margin using \nthe previous optimal set of a-values as a starting point. We have found that this \nprocedure does not add a significant computational cost in general. For example, \nfor the sonar classification dataset mentioned above and starting at a = 0.1 with \nincrements ~a = 0.1 it took 186 iterations to reach a = 1.0 and 4895 to reach \na = 2.0 as against 110 and 2624 iterations for learning at both these a-values. For \na rough doubling of the learning time it is possible to determine a reasonable value \nfor a for good generalisation without use of a validation set. \n\n\f210 \n\nO. \n\n0 7 \n\nO. \n\n.. \" ...... \n\". \n'. \n'. \n\n\\'\" \n\n0' \n\n02 \\ \n\n0 \n\n0 \n\nN Cristianini, C Campbell and J Shawe-Taylor \n\no. \n\n07 \n\nO. \n\n0' \n\n0 \n\n0 \n\n-' \n\n10 \n\n-\", \n'. \n\n\\ \n\n\\. \n\n'. \n\n\\. \n\\ \n\\\\., \n, \n\"-'. \n\n~ \n\n12 \n\nFigure 3: Generalisation error (solid curve) and upper bound from VC theory \n(dashed curve with R=l) for digits 0 and 3 from the USPS dataset of handwritten \ndigits. \n\n5 Conclusion \n\nWe have presented an algorithm which automatically learns the kernel parameter \nwith little additional cost, both in a computational and sample-complexity sense. \nModel selection takes place during the learning process itself, and experimental \nresults are provided showing that this strategy provides a good estimate of the \ncorrect model complexity. \n\nReferences \n\n[1] Aizerman, M., Braverman, E., and Rozonoer, L. (1964) . Theoretical Foundations of \nthe Potential Function Method in Pattern Recognition Learning, A utomations and \nRemote Control, 25:821-837. \n\n[2] Bartlett P., Shawe-Taylor J ., (1998). Generalization Performance of Support Vector \nMachines and Other Pattern Classifiers. 'Advances in Kernel Methods - Support Vector \nLearning', Bernhard Sch61kopf, Christopher J . C. Burges, and Alexander J . Smola \n(eds.), MIT Press, Cambridge, USA. \n\n[3] Burges c., (1998). A tutorial on support vector machines for pattern recognition . Data \n\nMining and Knowledge Discovery, 2:l. \n\n[4] Friess T., Cristianini N., Campbell C. , (1998) The Kernel-Adatron Algorithm: a Fast \n\nand Simple Learning Procedure for Support Vector Machines, in Shavlik, J. , ed., Ma(cid:173)\nchine Learning: Proceedings of the Fifteenth International Conference, Morgan Kauf(cid:173)\nmann Publishers, San Francisco, CA. \n\n[5] Gorman R.P. & Sejnowski, T.J. (1988) Neural Networks 1:75-89. \n[6] LeCun, Y., Jackel, L. D. , Bottou, L., Brunot, A., Cortes, C., Denker, J . S., Drucker, H., \nGuyon, I., Muller, U. A., Sackinger, E., Simard, P. and Vapnik, V., (1995) . Comparison \nof learning algorithms for handwritten digit recognition, International Conference on \nArtificial Neural Networks, Fogelman, F. and Gallinari, P. (Ed.), pp. 53-60. \n\n[7] Shawe-Taylor, J ., Bartlett, P., Williamson, R. & Anthony, M. (1996) . Structural Risk \n\nMinimization over Data-Dependent Hierarchies NeuroCOLT Technical Report NC(cid:173)\nTR-96-053 (ftp://ftp.des .rhbne .ae. uk /pub/neuroeolt/teeh_reports). \n\n[8] Ster, B., & Dobnikar, A. (1996) Neural networks in medical diagnosis: comparison with \nother methods. In A. Bulsari et al. (ed.) Proceedings of the International Conference \nEA NN '96, p. 427-430. \n\n[9] Vapnik , V. (1995) The Nature of Statistical Learning Theory, Springer Verlag. \n[10] James, Robert C. (1966) Advanced calculus Belmont, Calif. : Wadsworth \n\n\f", "award": [], "sourceid": 1485, "authors": [{"given_name": "Nello", "family_name": "Cristianini", "institution": null}, {"given_name": "Colin", "family_name": "Campbell", "institution": null}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}]}*