(X))x: ~ 0] of the classifier u \nfound by the perceptron algorithm is less than \n\nm~~* (In ((:.)) +In(m)+ln(D) \n\n(8) \n\nThe most intriguing feature of this result is that the mere existence of a large \nmargin classifier u* is sufficient to guarantee a small generalisation error for the \nsolution u of the perceptron although its attained margin ~z (u) is likely to be \nmuch smaller than ~z (u*). It has long been argued that the attained margin ~z (u) \nitself is the crucial quantity controlling the generalisation error of u. In light of \nour new result if there exists a consistent classifier u* with large margin we know \nthat there also exists at least one classifier u with high sparsity that can efficiently \nbe found using the percept ron algorithm. In fact, whenever the SYM appears to \nbe theoretically justified by a large observed margin, every solution found by the \nperceptron algorithm has a small guaranteed generalisation error - mostly even \nsmaller than current bounds on the generalisation error of SYMs. Note that for \na given training sample Z it is not unlikely that by permutation of Z there exist \no ((,:'!)) many different consistent sparse classifiers u. \n\n5 \n\nImpact on the Foundations of Support Vector Machines \n\nSupport vector machines owe their popularity mainly to their theoretical justifica(cid:173)\ntion in the learning theory. In particular, two arguments have been put forward to \nsingle out the solutions found by SYMs [14, p. 139]: \n\nSYM (optimal hyperplanes) can generalise because \n\n1. the expectation of the data compression is large. \n2. the expectation of the margin is large. \n\nThe second reason is often justified by margin results (see [14, 12]) which bound \nthe generalisation of a classifier u in terms of its own attained margin ~z (u). If \nwe require the slightly stronger condition that ~* < ~, n 2: 4, then our bound (8) \nfor solutions of percept ron learning can be upper bounded by \n\n~ (~*lnC::n)+ln(mn~1)+ln(c5n1~1))' \nwhich has to be compared with the PAC margin bound (see [12, 5]) \n~ (64~*log2 (:::. ) log2 (32m) + log2 (2m) + log2 (~) ) \n\nDespite the fact that the former result also holds true for the margin rz (u*) (which \ncould loosely be upper bounded by (5)) \n\n\u2022 the PAC margin bound's decay (as a function of m) is slower by a log2 (32m) \n\nfactor, \n\n\fdigit \n\nperceptron \n\nlIalio \nmistakes \nbound \nSVM \nIiallo \nbound \n\no \n0.2 \n740 \n844 \n6.7 \n0.2 \n1379 \n11.2 \n\n1 \n0.2 \n643 \n843 \n6.0 \n0.1 \n989 \n8.6 \n\n2 \n0.4 \n1168 \n1345 \n9.8 \n0.4 \n1958 \n14.9 \n\n3 \n0.4 \n1512 \n1811 \n12.0 \n0.4 \n1900 \n14.5 \n\n4 \n0.4 \n1078 \n1222 \n9.2 \n0.4 \n1224 \n10.2 \n\n5 \n0.4 \n1277 \n1497 \n10.5 \n0.5 \n2024 \n15.3 \n\n6 \n0.4 \n823 \n960 \n7.4 \n0.3 \n1527 \n12.2 \n\n7 \n0.5 \n1103 \n1323 \n9.4 \n0.4 \n2064 \n15.5 \n\n8 \n0.6 \n1856 \n2326 \n14.3 \n0.5 \n2332 \n17.1 \n\n9 \n0.7 \n1920 \n2367 \n14.6 \n0.6 \n2765 \n19.6 \n\nTable 1: Results of kernel perceptrons and SVMs on NIST (taken from [2, Table \n3]). The kernel used was k (x, x') = ((x, x') x + 1)4 and m = 60000. For both \nalgorithms we give the measured generalisation error (in %), the attained sparsity \nand the bound value (in %, 8 = 0.05) of (7) . \n\n\u2022 for any m and almost any 8 the margin bound given in Theorem 4 guaran(cid:173)\n\ntees a smaller generalisation error . \n\n\u2022 For example, using the empirical value K,* ~ 600 (see [14, p. 153]) in \nthe NIST handwritten digit recognition task and inserting this value into \nthe PAC margin bound, it would need the astronomically large number of \nm > 410 743 386 to obtain a bound value of 0.112 as obtained by (3) for \nthe digit \"0\" (see Table 1). \n\nWith regard to the first reason, it has been confirmed experimentally that SVMs find \nsolutions which are sparse in the expansion coefficients o. However, there cannot \nexist any distribution- free guarantee that the number of support vectors will in fact \nbe sma1l2 . In contrast, Theorem 2 gives an explicit bound on the sparsity in terms \n\nof the achievable margin ,z (0*). Furthermore, experimental results on the NIST \n\ndatasets show that the sparsity of solution found by the perceptron algorithm is \nconsistently (and often by a factor of two) greater than that of the SVM solution \n(see [2, Table 3] and Table 1). \n\n6 Conclusion \n\nthe perceptron algorithm -\n\nWe have shown that the generalisation error of a very simple and efficient learning \ncan be bounded by \nalgorithm for linear classifiers -\na quantity involving the margin of the classifier the SVM would have found on the \nsame training data using the same kernel. This result implies that the SVM solution \nis not at all singled out as being superior in terms of provable generalisation error. \nAlso, the result indicates that sparsity of the solution may be a more fundamental \nproperty than the size of the attained margin (since a large value of the latter \nimplies a large value of the former). \n\nOur analysis raises an interesting question: having chosen a good kernel, correspond(cid:173)\ning to a metric in which inter- class distances are great and intra- class distances are \nshort, in how far does it matter which consistent classifier we use? Experimental \n\n2Consider a distribution PXY on two parallel lines with support in the unit ball. Suppose \nthat their mutual distance is ../2. Then the number of support vectors equals the training \nset size whereas the perceptron algorithm never uses more than two points by Theorem 2. \nOne could argue that it is the number of essential support vectors [13] that characterises \nthe data compression of an SVM (which would also have been two in our example). Their \ndetermination, however, involves a combinatorial optimisation problem and can thus never \nbe performed in practical applications. \n\n\fresults seem to indicate that a vast variety of heuristics for finding consistent clas(cid:173)\nsifiers, e.g. kernel Fisher discriminant, linear programming machines, Bayes point \nmachines, kernel PCA & linear SVM, sparse greedy matrix approximation perform \ncomparably (see http://www . kernel-machines. org/). \n\nAcknowledgements \n\nThis work was done while TG and RH were visiting the ANU Canberra. They \nwould like to thank Peter Bartlett and Jon Baxter for many interesting discussions. \nFurthermore, we would like to thank the anonymous reviewer, Olivier Bousquet and \nMatthias Seeger for very useful remarks on the paper. \n\nReferences \n\n[I] M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the po(cid:173)\n\ntential function method in pattern recognition learning. Automation and Remote \nControl, 25:821- 837, 1964. \n\n[2] Y. Freund and R. E. Schapire. Large margin classification using the perceptron \n\nalgorithm. Machine Learning, 1999. \n\n[3] T. Friess, N. Cristianini, and C. Campbell. The Kernel-Adatron: A fast and sim(cid:173)\n\nple learning procedure for Support Vector Machines. In Proceedings of the 15- th \nInternational Conference in Machine Learning, pages 188- 196, 1998. \n\n[4] T. Graepel, R. Herbrich, and J. Shawe-Taylor. Generalisation error bounds for sparse \n\nlinear classifiers. In Proceedings of the Thirteenth Annual Conference on Computa(cid:173)\ntional Learning Theory, pages 298- 303, 2000. in press. \n\n[5] R. Herbrich. Learning Linear Classifiers - Theory and Algorithms. PhD thesis, Tech(cid:173)\n\nnische Universitiit Berlin, 2000. accepted for publication by MIT Press. \n\n[6] R. Herbrich and T. Graepel. A PAC-Bayesian margin bound for linear classifiers: \n\nWhy SVMs work. In Advances in Neural Information System Processing 13, 2001. \n[7] H. Konig. Eigenvalue Distribution of Compact Operators. Birkhiiuser, Basel, 1986. \n[8] N. Littlestone and M. Warmuth. Relating data compression and learn ability. Tech(cid:173)\n\nnical report, University of California Santa Cruz, 1986. \n\n[9] T. Mercer. Functions of positive and negative type and their connection with the \ntheory of integral equations. Transaction of London Philosophy Society (A), 209:415-\n446, 1909. \n\n[10] A. Novikoff. On convergence proofs for perceptrons. In Report at the Symposium \non Mathematical Theory of Automata, pages 24- 26, Politechnical Institute Brooklyn, \n1962. \n\n[11] M. Rosenblatt. Principles of neurodynamics: Perceptron and Theory of Brain Mech(cid:173)\n\nanisms. Spartan- Books, Washington D.C., 1962. \n\n[12] J . Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk \nminimization over data-dependent hierarchies. IEEE Transactions on Information \nTheory, 44(5):1926- 1940, 1998. \n\n[13] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998. \n[14] V. Vapnik. The Nature of Statistical Learning Theory. Springer, second edition, 1999. \n[15] G. Wahba. Support Vector Machines, Reproducing Kernel Hilbert Spaces and the ran-\ndomized GACV. Technical report, Department of Statistics, University of Wisconsin, \nMadison, 1997. TR- NO- 984. \n\n\f", "award": [], "sourceid": 1870, "authors": [{"given_name": "Thore", "family_name": "Graepel", "institution": null}, {"given_name": "Ralf", "family_name": "Herbrich", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}