{"title": "Incorporating Invariances in Non-Linear Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 609, "page_last": 616, "abstract": null, "full_text": "Incorporating Invariances in Nonlinear \n\nSupport Vector Machines \n\nOlivier Chapelle \n\nolivier.chapelle@lip6.fr \n\nLIP6, Paris, France \nBiowulf Technologies \n\nBernhard Scholkopf \n\nbernhard.schoelkopf@tuebingen.mpg.de \nMax-Planck-Institute, Tiibingen, Germany \n\nBiowulf Technologies \n\nAbstract \n\nThe choice of an SVM kernel corresponds to the choice of a rep(cid:173)\nresentation of the data in a feature space and, to improve per(cid:173)\nformance, it should therefore incorporate prior knowledge such as \nknown transformation invariances. We propose a technique which \nextends earlier work and aims at incorporating invariances in non(cid:173)\nlinear kernels. We show on a digit recognition task that the pro(cid:173)\nposed approach is superior to the Virtual Support Vector method, \nwhich previously had been the method of choice. \n\n1 \n\nIntroduction \n\nIn some classification tasks, an a priori knowledge is known about the invariances \nrelated to the task. For instance, in image classification, we know that the label of \na given image should not change after a small translation or rotation. \n\nMore generally, we assume we know a local transformation Lt depending on a \nparameter t (for instance, a vertical translation of t pixels) such that any point x \nshould be considered equivalent to LtX, the transformed point. Ideally, the output \nof the learned function should be constant when its inputs are transformed by the \ndesired invariance. \nIt has been shown [1] that one can not find a non-trivial kernel which is globally \ninvariant. For this reason, we consider here local invariances and for this purpose \nwe associate at each training point X i a tangent vector dXi, \n\ndXi = lim - (LtXi - Xi) = -\n\nLtXi \n\n1 \nt--+o t \n\n81 \n8t t=o \n\nIn practice dXi can be either computed by finite difference or by differentiation. \nNote that generally one can consider more than one invariance transformation. \n\nA common way of introducing invariances in a learning system is to add the per(cid:173)\nturbed examples LtXi in the training set [7]. Those points are often called virtual \nexamples. In the SVM framework, when applied only to the SVs, it leads to the \nVirtual Support Vector (VSV) method [10]. An alternative to this is to modify \ndirectly the cost function in order to take into account the tangent vectors. This \n\n\fhas been successfully applied to neural networks [13] and linear Support Vector \nMachines [11]. The aim of the present work is to extend these methods to the case \nof nonlinear SVMs which will be achieved mainly by using the kernel peA trick \n[12]. \nThe paper is organized as follows. After introducing the basics of Support Vector \nMachines in section 2, we recall the method proposed in [11] to train invariant linear \nSVMs (section 3). In section 4, we show how to extend it to the nonlinear case and \nfinally experimental results are provided in section 5. \n\n2 Support Vector Learning \n\nWe introduce some standard notations for SVMs; for a complete description, see \n[15]. Let {(Xi, Yi) h* (Xi) dlJ> (Xi) T \n\n) 1~ \n\nand the new kernel function \n\nK(x , y) = C~ llJ>(x) . C~ llJ>(y) = lJ>(x) T C~ 21J>(y) \n\n(5) \n\n(6) \n\nHowever, due to the high dimension of the feature space, it is impossible to do it \ndirectly. We propose two different ways for overcoming this difficulty. \n\n4.1 Decomposition of the tangent Gram matrix \n\nIn order to be able to compute the new kernel (6) , we propose to diagonalize the \nmatrix C, (eq 5) using a similar approach as the kernel PCA trick [12]. In that \narticle, they showed how it was possible to diagonalize the feature space covariance \nmatrix by computing the eigendecomposition of the Gram matrix of those points. \nPresently, instead of having a set of training points {1J>(Xi)} , we have a set of tangent \nvectors {dlJ> (Xi)} and a tangent covariance matrix (the right term of the sum in (5)) \nLet us introduce the Gram matrix Kt of the tangent vectors: \n\nKij = dlJ>(Xi )\u00b7 dlJ>(xj) \n\nK(Xi +dXi, Xj +dxj) - K(Xi +dXi, Xj) - K(Xi ' Xj +dxj) + K(Xi' Xj) (7) \n(8) \nd T02K(Xi,Xj)d . \nxi \nX J \n\n~ ~ \nUXiUXj \n\nThis matrix Kt can be computed either by finite differences (equation 7) or with the \nanalytical derivative expression given by equation (8) . Note that for a linear kernel, \nK(x,y) = x T y, and (8) reads Kfj = dxi dXj, which is a standard dot product \nbetween the tangent vectors. \n\nWriting the eigendecomposition of Kt as Kt = U AUT , and using the kernel PCA \ntools [12], one can show after some algebra (details in [2]) that the new kernel matrix \nreads \n\n1 \n--Kx y + \n-\nI - '\"Y ( , ) ~ Ap \n(~ U. d T OK(Xi' X)) (~U. d T OK(Xi' y)) \n\n'\"Y Ap + 1 - '\"Y \n\n- - -\n1 - '\"Y \n\n1) \n\n1 \n\n~ 'p x, \n~1 \n\n~ \nU~ \n\n~ 'p x, \n~1 \n\n~ \nU~ \n\nK(x,y) \n\nn1( \n\n\f4.2 The kernel PCA map \n\nA drawback of the previous approach appears when one wants to deal with multiple \ninvariances (i.e. more than one tangent vector per training point). \nIndeed, it \nrequires to diagonalize the matrix Kt (cf eq 7), whose size is equal to the number of \ndifferent tangent vectors. For this reason, we propose an alternative method. The \nidea is to use directly the so-called kernel peA map, first introduced in [12] and \nextended in [14]. \nThis map is based on the fact that even in a high dimensional feature space 1i, a \ntraining set {Xl , .. . , x n } of size n when mapped to this feature space spans a sub(cid:173)\nspace E C 1i whose dimension is at most n . More precisely, if (VI\"'\" Vn ) E En is \nan orthonormal basis of E with each Vi being a principal axis of {**(xd, ... , ** (x n )} , \nthe kernel peA map 'i/J : X -+ ~n is defined coordinatewise as \n\n'i/Jp (x) = **(x) . v P ' 1:S p:S n. \n\nEach principal direction has a linear expansion on the training points {**(Xi)} and \nthe coefficients of this expansion are obtained using kernel peA [12]. Writing the \neigendecompostion of K as K = U AUT, with U an orthonormal matrix and A a \ndiagonal one, it turns out that the the kernel peA map reads \n\n'i/J(x) = A -1/2UTk(x), \n\n(9) \n\nwhere k (x) = (K(x, Xl)\"'\" K(x, xn)) T . \nNote that by definition, for all i and j , **(Xi) and **(Xj) lie in E and thus K(Xi ' Xj) = \n**(Xi) . **(Xj) = 'i/J(Xi) . 'i/J(Xj). This reflects the fact that if we retain all principal \ncomponents, kernel peA is just a basis transform in E, leaving the dot product of \ntraining points invariant. \nAs a consequence, training a nonlinear SVM on {Xl , ... , xn} is equivalent to training \na linear SVM on {'i/J(xd, . . . ,'i/J(xn)} and thus, thanks to the nonlinear mapping 'i/J, \nwe can work directly in the linear space E and use exactly the technique described \nfor invariant linear SVMs (section 3) . However the invariance directions d**(Xi) do \nnot necessarily belong to E. By projecting them onto E, some information might \nbe lost. The hope is that this approximation will give a similar decision function to \nthe exact one obtained in section 4.l. \n\nFinally, the proposed algorithm consists in training an invariant linear SVM as \ndescribed in section 3 with training set {'i/J(XI) , ... ,'i/J(xn)} and with invariance \ndirections {d'i/J(XI) , ... , d'i/J (xn)}, where d'i/J (Xi) = 'i/J(Xi + dXi ) - 'i/J(Xi), which can \nbe expressed from equation (9) as \n\n4.3 Comparisons with the VSV method \n\nOne might wonder what is the difference between enforcing an invariance and just \nadding the virtual examples LtXi in the training set. Indeed the two approaches \nare related and some equivalence can be shown [6] . \n\nSo why not just add virtual examples? This is the idea of the Virtual Support \nVector (VSV) method [10] . The reason is the following: if a training point Xi is \nfar from the margin, adding the virtual example LtXi will not change the decision \nboundary since neither of the points can become a support vector. Hence adding \n\n\fvirtual examples in the SVM framework enforces invariance only around the decision \nboundary, which, as an aside, is the main reason why the virtual SV method only \nadds virtual examples generated from points that were support vectors in the earlier \niteration. \n\nOne might argue that the points which are far from the decision boundary do not \nprovide any information anyway. On the other hand, there is some merit in not \nonly keeping the output label invariant under the transformation Lt, but also the \nreal-valued output. This can be justified by seeing the distance of a given point \nto the margin as an indication of its class-conditional probability [8]. It appears \nreasonable that an invariance transformation should not affect this probability too \nmuch. \n\n5 Experiments \n\nIn our experiments, we compared a standard SVM with several methods taking into \naccount invariances: standard SVM with virtual examples (cf. the VSV method [10]) \n[VSV], invariant SVM as described in section 4.1 [ISVM] and invariant hyperplane \nin kernel peA coordinates as described in section 4.2 [ IHKPcA ]. \n\nThe hybrid method described in [11] (see end of section 3) did not perform better \nthan the VSV method and is not included in our experiments for this reason. \nNote that in the following experiments, each tangent vector d**(Xi) has been nor(cid:173)\n\nmalized by the average length vI: Ild**(xi)W/n in order to be scale independent. \n\n5.1 Toy problem \n\nThe toy problem we considered is the following: the training data has been gener(cid:173)\nated uniformly from [-1 , 1]2. The true decision boundary is a circle centered at the \norigin: f(x) = sign(x2 - 0.7). \n\nThe a priori knowledge we want to encode in this toy problem is local invariance \nunder rotations. Therefore, the output of the decision function on a given training \npoint Xi and on its image R(Xi,C:) obtained by a small rotation should be as similar \nas possible. To each training point, we associate a tangent vector dXi which is \nactually orthogonal to Xi. \n\nA training set of 30 points was generated and the experiments were repeated 100 \ntimes. A Gaussian kernel K(x,y) = exp (_ IIX2~~ 1I2) was chosen. \n\nThe results are summarized in figure 1. Adding virtual examples (VSV method) \nis already very useful since it made the test error decrease from 6.25% to 3.87% \n(with the best choice of a). But the use of ISVM or IHKPcA yields even better \nperformance. On this toy problem, the more the invariances are enforced b -+ 1), \nthe better the performances are (see right side of figure 1), reaching a test error of \n1.11%. \n\nWhen comparing log a = 1.4 and log a = 0 (right side of of figure 1), one notices \nthat the decrease in the test error does not have the same speed. This is actually \nthe dual of the phenomenon observed on the left side of this figure: for a same value \nof gamma, the test error tends to increase, when a is larger. This analysis suggests \nthat 'Y needs to be adapted in function of a. This can be done automatically by the \ngradient descent technique described in [3]. \n\n\f0.14 \n\n0 .12 \n\n0.1 \n\n0 .08 \n\n-\n\n- -\n\nlog sigma=-O.8 \nlog sigma=O \n10 si ma=1,4 \n\n0.12 \n\n0.06 \n\n0.04 \n\n0.02 \n\nO.02 '-----_~, ------:-0~.5--~0 --0~.5,------~------'c-\". 5 \n\n%'------~-~-~6 -~8--,~0 -~, 2~ \n\nLog sigma \n\n- Log (1-gamma) \n\nFigure 1: Left: test error for different learning algorithms plotted against the width \nof a RBF kernel and \"( fixed to 0.9. Right: test error of IHKPcA across \"( and for \ndifferent values of (5. The test errors are averaged over the 100 splits and the error \nbars correspond to the standard deviation of the means. \n\n5.2 Handwritten digit recognition \n\nAs a real world experiment, we tried to incorporate invariances for a handwritten \ndigit recognition task. The USPS dataset have been used extensively in the past \nfor this purpose, especially in the SVM community. It consists of 7291 training and \n2007 test examples. \n\nAccording to [9], the best performance has been obtained for a polynomial kernel \nof degree 3, and all the results described in this section were performed using this \nkernel. The local transformations we considered are translations (horizontal and \nvertical). All the tangent vectors have been computed by a finite difference between \nthe original digit and its I-pixel translated. \n\nWe split the training set into 23 subsets of 317 training examples after a random \npermutation of the training and test set. Also we concentrated on a binary clas(cid:173)\nsification problem, namely separating digits a to 4 against 5 to 9. The gain in \nperformance should also be valid for the multiclass case. \n\nFigure 2 compares ISVM, IHKPcA and VSV for different values of \"(. From those \nfigures, it can be seen that the difference between ISVM (the original method) and \nIHKPcA (the approximation) is much larger than in the toy example. The difference \nto the toy example is probably due to the input dimensionality. In 2 dimensions, \nwith an RBF kernel, the 30 examples of the toy problem \"almost span\" the whole \nfeature space, whereas with 256 dimensions, this is no longer the case. \n\nWhat is noteworthy in these experiments is that our proposed method is much \nbetter than the standard VSV. As explained in section 4.3, the reason for this \nmight be that invariance is enforced around all training points and not only around \nsupport vectors. Note that what we call VSV here is a standard SVM with a double \nsize training set containing the original data points and their translates. \n\nThe horizontal invariance yields larger improvements than the vertical one. One \nof the reason might be that the digits in the USPS database are already centered \nvertically. \n\n\f0.068 \n\n0.066 \n\n0.064 \n\n0.062 \n\n0.06 \n\n0.058 \n\n0.056 \n\n0.054 \n0 \n\n0.5 \n\n-\n\nIHKPCA \nISVM \n- - VSV \n\n0.068 \n\n0.066 \n\n0.064 \n\n0.062 \n\n0.06 \n\n0.058 \n\n0.056 \n\n3.5 \n\n0.054 \n0 \n\n0.5 \n\n-\n\nIHKPCA \nISVM \n- - VSV \n\n3.5 \n\n1.5 \n\n2.5 \n-Log (1-gamma) \n\n2 \n\n1.5 \n\n2.5 \n-Log (1-gamma) \n\n2 \n\nVertical translation (to the top) \n\nHorizontal translation (to the right) \n\nFigure 2: Comparison of ISVM, IHKPcA and VSV on the USPS dataset. The left \nof the plot (\"( = 0) corresponds to standard SVM whereas the right part of the plot \nh -+ 1) means that a lot of emphasis is put on the enforcement of the constraints. \nThe test errors are averaged over the 23 splits and the error bars correspond to the \nstandard deviation of the means. \n\n6 Conclusion \n\nWe have extended a method for constructing invariant hyperplanes to the nonlinear \ncase. We have shown results that are superior to the virtual SV method. The latter \nhas recently broken the record on the NIST database which is the \"gold standard\" \nof handwritten digit benchmarks [5], therefore it appears promising to also try the \nnew system on that task. For this propose, a large scale version of this method \nneeds to be derived. The first idea we tried is to compute the kernel PCA map \nusing only a subset of the training points. Encouraging results have been obtained \non the lO-class USPS database (with the whole training set), but other methods \nare also currently under study. \n\nReferences \n\n[1] C. J. C. Burges. Geometry and invariance in kernel based methods. \n\nIn \n\nB. Sch6lkopf, C. J . C. Burges, and A. J . Smola, editors, Advances in Ker(cid:173)\nnel Methods -\n\nSupport Vector Learning. MIT Press, 1999. \n\n[2] O. Chapelle and B. Sch6lkopf. Incorporating invariances in nonlinear Support \n\nVector Machines, 2001. Availabe at: www-connex.lip6.frrchapelle. \n\n[3] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple \nparameters for support vector machines. Machine Learning, 46:131- 159, 2002. \n[4] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273 -\n\n297,1995. \n\n[5] D. DeCoste and B. Sch6lkopf. Training invariant support vector machines. \n\nMachine Learning, 2001. In press. \n\n[6] Todd K. Leen. From data distributions to regularization in invariant learning. \n\nIn Nips, volume 7. The MIT Press, 1995. \n\n[7] P. Niyogi, T. Poggio, and F. Girosi. Incorporating prior information in machine \nlearning by creating virtual examples. IEEE Proceedings on Intelligent Signal \nProcessing, 86(11):2196-2209, November 1998. \n\n\f[8] John Platt. Probabilities for support vector machines. In A. Smola, P. Bartlett, \n\nB. Sch6lkopf, and D. Schuurmans, editors, Advances in Large Margin Classi(cid:173)\nfiers. MIT Press, Cambridge, MA, 2000. \n\n[9] B. Sch6lkopf, C. Burges, and V. Vapnik. Extracting support data for a given \n\ntask. In U. M. Fayyad and R. Uthurusamy, editors, First International Con(cid:173)\nference on Knowledge Discovery fj Data Mining. AAAI Press, 1995. \n\n[10] B. Sch6lkopf, C. Burges, and V. Vapnik. Incorporating invariances in support \nvector learning machines. In Artificial Neural Networks -\nICANN'96, volume \n1112, pages 47- 52, Berlin, 1996. Springer Lecture Notes in Computer Science. \n[11] B. Sch6lkopf, P. Y. Simard, A. J. Smola, and V. N. Vapnik. Prior knowledge \n\nin support vector kernels. In MIT Press, editor, NIPS, volume 10, 1998. \n\n[12] B. Sch6lkopf, A. Smola, and K.-R. Muller. Nonlinear component analysis as a \n\nkernel eigenvalue problem. Neural Computation, 10:1299- 1310, 1998. \n\n[13] P. Simard, Y. LeCun, J. Denker, and B. Victorri. Transformation invariance \nin pattern recognition, tangent distance and tangent propagation. In G. Orr \nand K. Muller, editors, Neural Networks: Tricks of the trade. Springer, 1998. \n[14] K. Tsuda. Support vector classifier with asymmetric kernel function. In M. Ver(cid:173)\n\nleysen, editor, Proceedings of ESANN'99, pages 183- 188,1999. \n\n[15] V. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998. \n\n\f", "award": [], "sourceid": 2024, "authors": [{"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}*