. if C is not definite (cf. (11)). Alternatively, we can below use the \n\npseudoinverse. \n\n2 As an aside, note that our goal to build invariant SV machines has thus serendipitously provided \nus with an approach for an open problem in SV learning, namely the one of scaling: in SV machines, \nthere has so far been no way of automatically assigning different weight to different directions in input \nspace -\nin a trained SV machine, the weights of the first layer (the SV s) form a subset of the training \nset. Choosing these Support Vectors from the training set only gives rather limited possibilities for \nappropriately dealing with different scales in different directions of input space. \n\n\f644 \n\nB. SchOlkopf, P. Simard, A. 1. Smola and V. Vapnik \n\n~d j \n\nC.'.) I \n\n(. )d l \n\nFigure I: Kernel utilizing local correlations in images, corresponding to a dot product in a \npolynomial space which is spanned mainly by local correlations between pixels (see text). \n\nThe resulting kernel will be of order d1 \u2022 d2 , however, it will not contain all possible corre(cid:173)\nlations of d1 . d2 pixels. \n\n5 EXPERIMENTAL RESULTS \n\nIn the experiments, we used a subset of the MNIST data base of handwritten characters \n(Bottou et aI., 1994), consisting of 5000 training examples and 10000 test examples at a \nresolution of 20x20 pixels, with entries in [-1, 1]. Using a linear SV machine (i.e. a sep(cid:173)\narating hyperplane), we obtain a test error rate of 9.8% (training 10 binary classifiers, and \nusing the maximum value of 9 (cf. (4\u00bb for lO-class classification); by using a polynomial \nkernel of degree 4, this drops to 4.0%. In all of the following experiments, we used degree \n4 kernels of various types. The number 4 was chosen as it can be written as a product of \ntwo integers, thus we could compare results to a kernel k~l ,d2 with d1 = d2 = 2. For the \nconsidered classification task, results for higher polynomial degrees are very similar. \nIn a series of experiments with a homogeneous polynomial kernel k(x, y) = (x\u00b7 y)4, using \npreprocessing with Gaussian smoothing kernels of standard deviation 0.1, 0.2, ... ,1.0, we \nobtained error rates which gradually increased from 4.0% to 4.3%; thus no improvement \nof this performance was possible by a simple smoothing operation. Applying the Virtual \nSV method (retraining the SV machine on translated SVs; Scholkopf, Burges, & Vapnik,-\n1996) to this problem results in an improved error rate of 2.8%. For training on the full \n60000 pattern set, the Virtual SV performance is 0.8% (Scholkopf, 1997). \n\nInvariant hyperplanes. Table 1 reports results obtained by preprocessing all patterns with \nB (cf. (10\u00bb, choosing different values of ..\\ (cf. (11\u00bb. In the experiments, the patterns were \nfirst rescaled to have entries in [0,1], then B was computed, using horizontal and vertical \ntranslations, and preprocessing was carried out; finally, the resulting patterns were scaled \nback again. This was done to ensure that patterns and derivatives lie in comparable regions \nof RN (note that if the pattern background level is a constant -1, then its derivative is \n0). The results show that even though (9) was derived for the linear case, it can lead to \nimprovements in the nonlinear case (here, for a degree 4 polynomial), too. \n\nDimensionality reduction. The above [0, 1] scaling operation is affine rather than linear, \nhence the argument leading to (12) does not hold for this case. We thus only report results \non dimensionality reduction for the case where the data is kept in [0, 1] scaling from the very \n\n\fPrior Knowledge in Support Vector Kernels \n\n645 \n\nTable I: Classification error rates for modifying the kernel k(x, y) = (X\u00b7y)4 with the invari-\nant hyperplane preprocessing matrix B).. = C~ 'i ; cf. (10) and (11). Enforcing invariance \nwith 0.1 < A < 1 leads to improvements over the original performance (A = 1). \n\nI \n\nA \n\n0.1 \nerror rate in % 4.2 \n\n0.2 \n3.8 \n\n0.4 \n3.6 \n\n0.6 \n3.8 \n\nTable 2: Dropping directions corresponding to smaIl Eigenvalues of C (cf. (12)) leads to \nsubstantial improvements. AIl results given are for the case A = 0.4 (cf. Table 1); degree 4 \nhomogeneous polynomial kernel. \n\nprincipal components discarded \n\nerror rate in % \n\nbeginning on. Dropping principal components which are less important leads to substantial \nimprovements (Table 2); cf. the explanation foIlowing (12). The results in Table 2 are \nsomewhat distorted by the fact that the polynomial kernel is not translation invariant, and \nperforms poorly on the [0, 1] data, which becomes evident in the case where none of the \nprincipal components are discarded. Better results have been obtained using translation \ninvariant kernels, e.g. Gaussian REFs (Scholkopf, 1997). \nKernels using local correlations. To exploit locality in images, we used a pyramidal \nreceptive field kernel k;l,d 2 with diameter p = 9 (cf. Sec. 4). For d1 = d2 = 2, we ob(cid:173)\ntained an improved error rate of 3.1%, another degree 4 kernel with only local correlations \n(d l = 4, d2 = 1) led to 3.4%. Albeit significantly better than the 4.0% for the degree 4 \nhomogeneous polynomial (the error rates on the 10000 element test set have an accuracy \nof about 0.1%, cf. Bottouet aI., 1994), this is still worse than the Virtual SV resultof2.8%. \nAs the two methods, however, exploit different types of prior knowledge, it could be ex(cid:173)\npected that combining them leads to still better performance; and indeed, this yielded the \nbest performance of all (2.0%). \n\nFor the purpose of benchmarking, we also ran our system on the US postal service database \nof 7291 +2007 handwritten digits at a resolution of 16 x 16. In that case, we obtained the \nfoIlowing test error rates: SV with degree 4 polynomial kernel 4.2%, Virtual SV (same ker(cid:173)\nnel) 3.5%, SV with k~,2 3.6%, Virtual SV with k~,2 3.0%. The latter compares favourably \nto almost all known results on that data base, and is second only to a memory-based tangent(cid:173)\ndistance nearest neighbour classifier at 2.6% (Simard, LeCun, & Denker, 1993). \n\n6 DISCUSSION \n\nWith its rather general class of admissible kernel functions, the SV algorithm provides am(cid:173)\nple possibilities for constructing task-specific kernels. We have considered an image classi(cid:173)\nfication task and used two forms of domain knowledge: first, pattern classes were required \nto be locally translationaIly invariant, and second, local correlations in the images were \nassumed to be more reliable than long-range correlations. The second requirement can be \nseen as a more general form of prior knowledge -\nit can be thought of as arising partiaIly \nfrom the fact that patterns possess a whole variety of transformations; in object recognition, \nfor instance, we have object rotations and deformations. Typically, these transformations \nare continuous, which implies that local relationships in an image are fairly stable, whereas \nglobal relationships are less reliable. \n\nWe have incorporated both types of domain knowledge into the SV algorithm by construct(cid:173)\ning appropriate kernel functions, leading to substantial improvements on the considered \npattern recognition tasks. Our method for constructing kernels for transformation invari(cid:173)\nant SV machines, put forward to deal with the first type of domain knowledge, so far has \n\n\f646 \n\nB. SchOlkopf, P. Simard, A. 1. Smola and V. Vapnik \n\nonly been applied in the linear case, which partially explains why it only led to moderate \nimprovements (also, we so far only used translational invariance). It is applicable for dif(cid:173)\nferentiable transformations -\nother types, e.g. for mirror symmetry, have to be dealt with \nusing other techniques, e.g. Virtual SVs (Scholkopf, Burges, & Vapnik, 1996). Its main \nadvantages compared to the latter technique is that it does not slow down testing speed, \nand that using more invariances leaves training time almost unchanged. The proposed ker(cid:173)\nnels respecting locality in images led to large improvements; they are applicable not only \nin image classification but in all cases where the relative importance of subsets of products \nfeatures can be specified appropriately. They do, however, slow down both training and \ntesting by a constant factor which depends on the specific kernel used. \n\nBoth described techniques should be directly applicable to other kernel-based methods \nas SV regression (Vapnik, 1995) and kernel PCA (Scholkopf, Smola, & Muller, 1996). \nFuture work will include the nonlinear case (cf. our remarks in Sec. 3), the incorporation \nof invariances other than translation, and the construction of kernels incorporating local \nfeature extractors (e.g. edge detectors) different from the pyramids described in Sec. 4. \n\nAcknowledgements. We thank Chris Burges and Uon Bottou for parts of the code and for \nhelpful discussions, and Tony Bell for his remarks. \n\nReferences \nB. E. Boser, I .M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin \nclassifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on \nComputational Learning Theory, pages 144-152, PittSburgh, PA, 1992. ACM Press. \n\nL. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun, U. A. \nMuller, E. Sackinger, P. Simard, and V. Vapnik. Comparison of classifier methods: a \ncase study in handwritten digit recognition. In Proceedings of the J 2th International \nConference on Pattern Recognition and Neural Networks, Jerusalem, pages 77 - 87. \nIEEE Computer Society Press, 1994. \n\nC. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273 - 297, 1995. \nB. Scholkopf. Support Vector Learning. R. Oldenbourg Verlag, Munich, 1997. ISBN \n\n3-486-24632-1. \n\nB. Scholkopf, C. Burges, and V. Vapnik. Incorporating invariances in support vector learn(cid:173)\n\ning machines. In C. von der Malsburg, W. von Seelen, J. C. Vorbriiggen, and B. Sendhoff, \neditors, Artificial Neural Networks-ICANN'96, pages 47 - 52, Berlin, 1996a. Springer \nLecture Notes in Computer Science, Vol. 1112. \n\nB. Scholkopf, A. Smola, and K.-R. MulIer. Nonlinear component analysis as a kernel \neigenvalue problem. Technical Report 44, Max-Planck-Institut fUr biologische Kyber(cid:173)\nnetik, 1996b. in press (Neural Computation). \n\nP. Simard, Y. LeCun, and J. Denker. Efficient pattern recognition using a new transfor(cid:173)\nmation distance. In S. 1. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in \nNeural Information Processing Systems 5, pages 50-58, San Mateo, CA, 1993. Morgan \nKaufmann. \n\nP. Simard, B. Victorri, Y. LeCun, and 1. Denker. Tangent prop -\n\na formalism for specifying \nselected invariances in an adaptive network. In J. E. Moody, S. J. Hanson, and R. P. \nLippmann, editors, Advances in Neural Information Processing Systems 4, pages 895-\n903, San Mateo, CA, 1992. Morgan Kaufmann. \n\nV. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. \nV. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian}. Nauka, \nMoscow, 1974. (German Translation: W. Wapnik & A. Tscherwonenkis, Theorie der \nZeichenerkennung, Akademie-Verlag, Berlin, 1979). \n\n\f", "award": [], "sourceid": 1457, "authors": [{"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Patrice", "family_name": "Simard", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Vladimir", "family_name": "Vapnik", "institution": null}]}