{"title": "From Margin to Sparsity", "book": "Advances in Neural Information Processing Systems", "page_first": 210, "page_last": 216, "abstract": null, "full_text": "From Margin To Sparsity \n\nThore Graepel, Ralf Herbrich \n\nComputer Science Department \nTechnical University of Berlin \n\nBerlin, Germany \n\n{guru, ralfh)@cs.tu-berlin.de \n\nRobert C. Williamson \nDepartment of Engineering \n\nAustralian National University \n\nCanberra, Australia \n\nBob. Williamson@anu.edu.au \n\nAbstract \n\nWe present an improvement of Novikoff's perceptron convergence \ntheorem. Reinterpreting this mistake bound as a margin dependent \nsparsity guarantee allows us to give a PAC-style generalisation er(cid:173)\nror bound for the classifier learned by the perceptron learning algo(cid:173)\nrithm. The bound value crucially depends on the margin a support \nvector machine would achieve on the same data set using the same \nkernel. Ironically, the bound yields better guarantees than are cur(cid:173)\nrently available for the support vector solution itself. \n\n1 \n\nIntroduction \n\nIn the last few years there has been a large controversy about the significance \nof the attained margin, i.e. the smallest real valued output of a classifiers before \nthresholding, as an indicator of generalisation performance. Results in the YC, PAC \nand luckiness frameworks seem to indicate that a large margin is a pre- requisite \nfor small generalisation error bounds (see [14, 12]). These results caused many \nresearchers to focus on large margin methods such as the well known support vector \nmachine (SYM). On the other hand, the notion of sparsity is deemed important for \ngeneralisation as can be seen from the popularity of Occam's razor like arguments \nas well as compression considerations (see [8]). \n\nIn this paper we reconcile the two notions by reinterpreting an improved version of \nNovikoff's well known perceptron convergence theorem as a sparsity guarantee in \ndual space: the existence of large margin classifiers implies the existence of sparse \nconsistent classifiers in dual space. Even better, this solution is easily found by \nthe perceptron algorithm. By combining the perceptron mistake bound with a \ncompression bound that originated from the work of Littlestone and Warmuth [8] \nwe are able to provide a PAC like generalisation error bound for the classifier found \nby the perceptron algorithm whose size is determined by the magnitude of the \nmaximally achievable margin on the dataset. \n\nThe paper is structured as follows: after introducing the perceptron in dual variables \nin Section 2 we improve on Novikoff's percept ron convergence bound in Section 3. \nOur main result is presented in the subsequent section and its consequences on the \ntheoretical foundation of SYMs are discussed in Section 5. \n\n\f(Dual) Kernel Perceptrons \n\n2 \nWe consider learning given m objects X = {Xl, ... , X m } E xm and a set Y = \n{Yl, . .. Ym} E ym drawn iid from a fixed distribution P XY = Pz over the space \nX x {-I, + I} = Z of input-output pairs. Our hypotheses are linear classifiers \nX f-t sign ((w, 4> (x))) in some fixed feature space K ~ \u00a3~ where we assume that a \nmapping 4> : X --+ K is chosen a prioril . Given the features \u00a2i : X --+ ~ the classical \n(primal) percept ron algorithm aims at finding a weight vector w E K consistent \nwith the training data. Recently, Vapnik [14] and others -\nin their work on SVMs \n- have rediscovered that it may be advantageous to learn in the dual representation \n(see [1]), i.e. expanding the weight vector in terms of the training data \n\nWo: = L>~i4> (Xi) = 2: ctiXi, \n\nm \n\ni=l \n\nm \n\ni=l \n\n(1) \n\nand learn the m expansion coefficients a E ~m rather than the components of \nw E K. This is particularly useful if the dimensionality n = dim (K) of the feature \nspace K is much greater (or possibly infinite) than the number m of training points. \nThis dual representation can be used for a rather wide class of learning algorithms \nin particular if all we need for learning is the real valued output (w, Xi)/C \n(see [15]) -\nof the classifier at the m training points Xl, . .. , x m . Thus it suffices to choose a \nsymmetric function k : X x X --+ ~ called kernel and to ensure that there exists a \nmapping 4>k : X --+ K such that \nVx,x' EX: \n\nk (x,x') = (4)k (x) ,4>k (x'))/C . \n\n(2) \n\nA sufficient condition is given by Mercer's theorem. \nTheorem 1 (Mercer Kernel [9, 7]). Any symmetric function k E Loo (X x X) \nthat is positive semidefinite, i.e. \n\nVf E L2 (X): \n\nIx Ix k(x,x') f (x) f (x') dx dx';::: 0, \n\nis called a Mercer kernel and has the following property: if 'if;i E L2 (X) solve the \neigenvalue problem Ix k (x, x') 'if;i (x') dx' = Ai'if;dx) with Ix 'if;; (x) dx = 1 and \nVi f:. j : Ix 'if;i (x) 'if;j (x) dx = 0 then k can be expanded in a uniformly convergent \nseries, i. e. \n\nk(x,x') = 2:Ai'if;dx) 'if;dx') . \n\n00 \n\ni=l \n\nIn order to see that a Mercer kernel fulfils equation (2) consider the mapping \n\n4>k (x) = ( A 'if;l (x), y'):;'if;2 (x), ... ) \n\n(3) \n\nwhose existence is ensured by the third property. Finally, the percept ron learning \nalgorithm we are going to consider is described in the following definition. \nDefinition 1 (Perceptron Learning). The perceptron learning procedure with \nthe fixed learning rate TJ E ~+ is as follows: \n\n1. Start in step zero, i.e. t = 0, with the vector at = O. \n2. If there exists an index i E {I, ... , m} such that Yi (w 0:., Xi) /C :::; 0 then \n\n(at+l)i = (at)i + TJYi \n\nand t t- t + 1. \n\n\u00a2:> wO:.+ 1 = wo:. + TJYiXi \u00b7 \n\n(4) \n\nlSomtimes, we abbreviate 4> (x) by x always assuming 4> is fixed. \n\n\f3. Stop, if there is no i E {I, . .. , m} such that Yi (Wat' Xi) J( ~ o. \n\nOther variants of this algorithm have been presented elsewhere (see [2, 3]). \n\n3 An Improvement of N ovikoff's Theorem \n\nIn the early 60's Novikoff [10) was able to give an upper bound on the number \nof mistakes made by the classical perceptron learning procedure. Two years later, \nthis bound was generalised to feature spaces using Mercer kernels by Aizerman et \nal. [1). The quantity determining the upper bound is the maximally achievable \nunnormalised margin maxaElR.~ 'Yz (a) normalised by the total extent R(X) of the \ndata in feature space, i.e. R (X) = maxxiEX IlxillJ( . \nDefinition 2 (Unnormalised Margin). Given a training set Z = (X, Y) and a \nvector a E IRm the unnormalised margin 'Yz (a) is given by \n\n. \n'Yz a = mm \n\n( ) \n\n( Xi,y;)EZ \n\nYi (Wa,Xi)J( \n\nIlwallJ( \n\nTheorem 2 (Novikoffs Percept ron Convergence Theorem 110,1]). Let Z = \n(X, Y) be a training set of size m. Suppose that there exists a vector a* E IRm such \nthat 'Yz (a*) > O. Then the number of mistakes made by the perceptron algorithm \nin Definition 1 on Z is at most \n\n( R(X) )2 \n\n'Yz (a*) \n\nSurprisingly, this bound is highly influenced by the data point Xi E X with the \nlargest norm IIXil1 albeit rescaling of a data point would not change its classifica(cid:173)\ntion. Let us consider rescaling of the training set X before applying the perceptron \nalgorithm. Then for the normalised training set we would have R (Xnorm ) = 1 and \n'Yz (a) would change into the normalised margin rz (a) first advocated in [6). \nDefinition 3 (Normalised Margin). Given a training set Z = (X, Y) and a \nvector a E IRm the normalised margin rz (a) is given by \nYi (wa, Xi)J( \n(Xi,y;)EZ IIwallJ( IIXillJ( \n\n. \nr ( ) \nza = mm \n\n. \n\nBy definition, for all Xi E X we have R (X) 2: Ilxi IIJ(. Hence for any a E IRm and \nall (Xi ,Yi) E Z such that Yi (Wa,Xi)J( > 0 \nIIXillJ( \n\nR(X) > \n\n1 \n\nwhich immediately implies for all Z = (X, Y) E zm such that 'Yz (a) > 0 \n\nIIwall lC \n\nIlwalllC \n\nYi(Wa,Xi)1C - Yi(Wa,Xi )1C \n\nYi(Wa,Xi )f,' \nIIwalldxirlC \n\nR(X) > _1_. \n'Yz (a) - rz (a) \n\n(5) \n\nThus when normalising the data in feature space, i.e. \n\nk \nnorm \n\n(x Xl) _ \n\n, \n\n-\n\nk (X ,XI) \n\n...jk(x,x).k(X',X') ' \n\nthe upper bound on the number of steps until convergence of the classical perceptron \nlearning procedure of Rosenblatt [11) is provably decreasing and is given by the \nsquared r.h.s of (5). \n\n\fConsidering the form of the update rule (4) we observe that this result not only \nbounds the number of mistakes made during learning but also the number 110:110 \nof non-zero coefficients in the 0: vector. To be precise, for 'T/ = 1 it bounds the \u00a31 \nnorm 110:111 of the coefficient vector 0: which, in turn, bounds the zero norm 110:110 \nfrom above for all vectors with integer components. Theorem 2 thus establishes a \nrelation between the existence of a large margin classifier w* and the sparseness of \nany solution found by the perceptron algorithm. \n\n4 Main Result \n\nIn order to exploit the guaranteed sparseness of the solution of a kernel perceptron \nwe make use of the following lemma to be found in [8, 4). \nLemma 1 (Compression Lemma). Fix d E {l, . . . ,m}. For any measure Pz, \nthe probability that m examples Z drawn iid according to Pz will yield a classifier \n0: (Z) learned by the perceptron algorithm with 110: (Z)llo = d whose generalisation \nerror PXY [Y (w a(Z),
(X))x: ~ 0] of the classifier u \nfound by the perceptron algorithm is less than \n\nm~~* (In ((:.)) +In(m)+ln(D) \n\n(8) \n\nThe most intriguing feature of this result is that the mere existence of a large \nmargin classifier u* is sufficient to guarantee a small generalisation error for the \nsolution u of the perceptron although its attained margin ~z (u) is likely to be \nmuch smaller than ~z (u*). It has long been argued that the attained margin ~z (u) \nitself is the crucial quantity controlling the generalisation error of u. In light of \nour new result if there exists a consistent classifier u* with large margin we know \nthat there also exists at least one classifier u with high sparsity that can efficiently \nbe found using the percept ron algorithm. In fact, whenever the SYM appears to \nbe theoretically justified by a large observed margin, every solution found by the \nperceptron algorithm has a small guaranteed generalisation error - mostly even \nsmaller than current bounds on the generalisation error of SYMs. Note that for \na given training sample Z it is not unlikely that by permutation of Z there exist \no ((,:'!)) many different consistent sparse classifiers u. \n\n5 \n\nImpact on the Foundations of Support Vector Machines \n\nSupport vector machines owe their popularity mainly to their theoretical justifica(cid:173)\ntion in the learning theory. In particular, two arguments have been put forward to \nsingle out the solutions found by SYMs [14, p. 139]: \n\nSYM (optimal hyperplanes) can generalise because \n\n1. the expectation of the data compression is large. \n2. the expectation of the margin is large. \n\nThe second reason is often justified by margin results (see [14, 12]) which bound \nthe generalisation of a classifier u in terms of its own attained margin ~z (u). If \nwe require the slightly stronger condition that ~* < ~, n 2: 4, then our bound (8) \nfor solutions of percept ron learning can be upper bounded by \n\n~ (~*lnC::n)+ln(mn~1)+ln(c5n1~1))' \nwhich has to be compared with the PAC margin bound (see [12, 5]) \n~ (64~*log2 (:::. ) log2 (32m) + log2 (2m) + log2 (~) ) \n\nDespite the fact that the former result also holds true for the margin rz (u*) (which \ncould loosely be upper bounded by (5)) \n\n\u2022 the PAC margin bound's decay (as a function of m) is slower by a log2 (32m) \n\nfactor, \n\n\fdigit \n\nperceptron \n\nlIalio \nmistakes \nbound \nSVM \nIiallo \nbound \n\no \n0.2 \n740 \n844 \n6.7 \n0.2 \n1379 \n11.2 \n\n1 \n0.2 \n643 \n843 \n6.0 \n0.1 \n989 \n8.6 \n\n2 \n0.4 \n1168 \n1345 \n9.8 \n0.4 \n1958 \n14.9 \n\n3 \n0.4 \n1512 \n1811 \n12.0 \n0.4 \n1900 \n14.5 \n\n4 \n0.4 \n1078 \n1222 \n9.2 \n0.4 \n1224 \n10.2 \n\n5 \n0.4 \n1277 \n1497 \n10.5 \n0.5 \n2024 \n15.3 \n\n6 \n0.4 \n823 \n960 \n7.4 \n0.3 \n1527 \n12.2 \n\n7 \n0.5 \n1103 \n1323 \n9.4 \n0.4 \n2064 \n15.5 \n\n8 \n0.6 \n1856 \n2326 \n14.3 \n0.5 \n2332 \n17.1 \n\n9 \n0.7 \n1920 \n2367 \n14.6 \n0.6 \n2765 \n19.6 \n\nTable 1: Results of kernel perceptrons and SVMs on NIST (taken from [2, Table \n3]). The kernel used was k (x, x') = ((x, x') x + 1)4 and m = 60000. For both \nalgorithms we give the measured generalisation error (in %), the attained sparsity \nand the bound value (in %, 8 = 0.05) of (7) . \n\n\u2022 for any m and almost any 8 the margin bound given in Theorem 4 guaran(cid:173)\n\ntees a smaller generalisation error . \n\n\u2022 For example, using the empirical value K,* ~ 600 (see [14, p. 153]) in \nthe NIST handwritten digit recognition task and inserting this value into \nthe PAC margin bound, it would need the astronomically large number of \nm > 410 743 386 to obtain a bound value of 0.112 as obtained by (3) for \nthe digit \"0\" (see Table 1). \n\nWith regard to the first reason, it has been confirmed experimentally that SVMs find \nsolutions which are sparse in the expansion coefficients o. However, there cannot \nexist any distribution- free guarantee that the number of support vectors will in fact \nbe sma1l2 . In contrast, Theorem 2 gives an explicit bound on the sparsity in terms \n\nof the achievable margin ,z (0*). Furthermore, experimental results on the NIST \n\ndatasets show that the sparsity of solution found by the perceptron algorithm is \nconsistently (and often by a factor of two) greater than that of the SVM solution \n(see [2, Table 3] and Table 1). \n\n6 Conclusion \n\nthe perceptron algorithm -\n\nWe have shown that the generalisation error of a very simple and efficient learning \ncan be bounded by \nalgorithm for linear classifiers -\na quantity involving the margin of the classifier the SVM would have found on the \nsame training data using the same kernel. This result implies that the SVM solution \nis not at all singled out as being superior in terms of provable generalisation error. \nAlso, the result indicates that sparsity of the solution may be a more fundamental \nproperty than the size of the attained margin (since a large value of the latter \nimplies a large value of the former). \n\nOur analysis raises an interesting question: having chosen a good kernel, correspond(cid:173)\ning to a metric in which inter- class distances are great and intra- class distances are \nshort, in how far does it matter which consistent classifier we use? Experimental \n\n2Consider a distribution PXY on two parallel lines with support in the unit ball. Suppose \nthat their mutual distance is ../2. Then the number of support vectors equals the training \nset size whereas the perceptron algorithm never uses more than two points by Theorem 2. \nOne could argue that it is the number of essential support vectors [13] that characterises \nthe data compression of an SVM (which would also have been two in our example). Their \ndetermination, however, involves a combinatorial optimisation problem and can thus never \nbe performed in practical applications. \n\n\fresults seem to indicate that a vast variety of heuristics for finding consistent clas(cid:173)\nsifiers, e.g. kernel Fisher discriminant, linear programming machines, Bayes point \nmachines, kernel PCA & linear SVM, sparse greedy matrix approximation perform \ncomparably (see http://www . kernel-machines. org/). \n\nAcknowledgements \n\nThis work was done while TG and RH were visiting the ANU Canberra. They \nwould like to thank Peter Bartlett and Jon Baxter for many interesting discussions. \nFurthermore, we would like to thank the anonymous reviewer, Olivier Bousquet and \nMatthias Seeger for very useful remarks on the paper. \n\nReferences \n\n[I] M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the po(cid:173)\n\ntential function method in pattern recognition learning. Automation and Remote \nControl, 25:821- 837, 1964. \n\n[2] Y. Freund and R. E. Schapire. Large margin classification using the perceptron \n\nalgorithm. Machine Learning, 1999. \n\n[3] T. Friess, N. Cristianini, and C. Campbell. The Kernel-Adatron: A fast and sim(cid:173)\n\nple learning procedure for Support Vector Machines. In Proceedings of the 15- th \nInternational Conference in Machine Learning, pages 188- 196, 1998. \n\n[4] T. Graepel, R. Herbrich, and J. Shawe-Taylor. Generalisation error bounds for sparse \n\nlinear classifiers. In Proceedings of the Thirteenth Annual Conference on Computa(cid:173)\ntional Learning Theory, pages 298- 303, 2000. in press. \n\n[5] R. Herbrich. Learning Linear Classifiers - Theory and Algorithms. PhD thesis, Tech(cid:173)\n\nnische Universitiit Berlin, 2000. accepted for publication by MIT Press. \n\n[6] R. Herbrich and T. Graepel. A PAC-Bayesian margin bound for linear classifiers: \n\nWhy SVMs work. In Advances in Neural Information System Processing 13, 2001. \n[7] H. Konig. Eigenvalue Distribution of Compact Operators. Birkhiiuser, Basel, 1986. \n[8] N. Littlestone and M. Warmuth. Relating data compression and learn ability. Tech(cid:173)\n\nnical report, University of California Santa Cruz, 1986. \n\n[9] T. Mercer. Functions of positive and negative type and their connection with the \ntheory of integral equations. Transaction of London Philosophy Society (A), 209:415-\n446, 1909. \n\n[10] A. Novikoff. On convergence proofs for perceptrons. In Report at the Symposium \non Mathematical Theory of Automata, pages 24- 26, Politechnical Institute Brooklyn, \n1962. \n\n[11] M. Rosenblatt. Principles of neurodynamics: Perceptron and Theory of Brain Mech(cid:173)\n\nanisms. Spartan- Books, Washington D.C., 1962. \n\n[12] J . Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk \nminimization over data-dependent hierarchies. IEEE Transactions on Information \nTheory, 44(5):1926- 1940, 1998. \n\n[13] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998. \n[14] V. Vapnik. The Nature of Statistical Learning Theory. Springer, second edition, 1999. \n[15] G. Wahba. Support Vector Machines, Reproducing Kernel Hilbert Spaces and the ran-\ndomized GACV. Technical report, Department of Statistics, University of Wisconsin, \nMadison, 1997. TR- NO- 984. \n\n\f", "award": [], "sourceid": 1870, "authors": [{"given_name": "Thore", "family_name": "Graepel", "institution": null}, {"given_name": "Ralf", "family_name": "Herbrich", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}