{"title": "From Margin to Sparsity", "book": "Advances in Neural Information Processing Systems", "page_first": 210, "page_last": 216, "abstract": null, "full_text": "From  Margin  To  Sparsity \n\nThore  Graepel, Ralf Herbrich \n\nComputer Science Department \nTechnical University of Berlin \n\nBerlin,  Germany \n\n{guru, ralfh)@cs.tu-berlin.de \n\nRobert  C.  Williamson \nDepartment of Engineering \n\nAustralian National University \n\nCanberra, Australia \n\nBob. Williamson@anu.edu.au \n\nAbstract \n\nWe  present  an  improvement  of Novikoff's  perceptron convergence \ntheorem.  Reinterpreting this mistake bound as a margin dependent \nsparsity guarantee allows us to give a PAC-style generalisation er(cid:173)\nror bound for the classifier learned by the perceptron learning algo(cid:173)\nrithm.  The bound value crucially depends on the margin a support \nvector machine would achieve on the same data set using the same \nkernel.  Ironically, the bound yields better guarantees than are cur(cid:173)\nrently available for  the support vector solution itself. \n\n1 \n\nIntroduction \n\nIn  the  last  few  years  there  has  been  a  large  controversy  about  the  significance \nof the  attained  margin,  i.e.  the  smallest  real  valued  output  of a  classifiers  before \nthresholding, as an indicator of generalisation performance.  Results in the YC, PAC \nand  luckiness  frameworks  seem  to  indicate  that  a  large  margin  is  a  pre- requisite \nfor  small  generalisation  error  bounds  (see  [14,  12]).  These  results  caused  many \nresearchers to focus on large margin methods such as the well known support vector \nmachine (SYM).  On the other hand, the notion of sparsity is  deemed important for \ngeneralisation as  can be seen from  the popularity of Occam's  razor  like  arguments \nas well  as compression considerations  (see  [8]). \n\nIn this paper we  reconcile the two notions by reinterpreting an improved version of \nNovikoff's  well  known  perceptron  convergence theorem  as  a  sparsity  guarantee  in \ndual  space:  the  existence  of large  margin  classifiers  implies  the  existence  of sparse \nconsistent  classifiers  in  dual  space.  Even  better,  this  solution  is  easily  found  by \nthe  perceptron  algorithm.  By  combining  the  perceptron  mistake  bound  with  a \ncompression bound that  originated from  the  work  of Littlestone  and Warmuth  [8] \nwe are able to provide a PAC like generalisation error bound for the classifier found \nby  the  perceptron  algorithm  whose  size  is  determined  by  the  magnitude  of  the \nmaximally achievable margin on the dataset. \n\nThe paper is structured as follows:  after introducing the perceptron in dual variables \nin Section 2 we  improve on Novikoff's  percept ron  convergence bound in  Section 3. \nOur main result is presented in the subsequent section and its consequences on the \ntheoretical foundation  of SYMs  are discussed in Section 5. \n\n\f(Dual)  Kernel Perceptrons \n\n2 \nWe  consider  learning  given  m  objects  X  = {Xl, ... , X m }  E  xm  and  a  set  Y  = \n{Yl, . .. Ym}  E  ym  drawn  iid  from  a  fixed  distribution  P XY  =  Pz  over  the  space \nX  x  {-I, + I}  = Z  of input-output  pairs.  Our  hypotheses  are  linear  classifiers \nX  f-t  sign ((w, 4> (x)))  in  some fixed  feature  space  K  ~ \u00a3~ where  we  assume that  a \nmapping 4>  : X  --+  K  is chosen a prioril .  Given the features \u00a2i  :  X  --+  ~ the classical \n(primal)  percept ron  algorithm  aims  at finding  a  weight  vector  w  E  K  consistent \nwith the training data.  Recently,  Vapnik [14]  and others -\nin their work on SVMs \n- have rediscovered that it may be advantageous to learn in the dual representation \n(see  [1]),  i.e.  expanding the weight  vector in terms of the training data \n\nWo:  =  L>~i4> (Xi)  = 2: ctiXi, \n\nm \n\ni=l \n\nm \n\ni=l \n\n(1) \n\nand  learn  the  m  expansion  coefficients  a  E  ~m rather  than  the  components  of \nw  E K.  This is  particularly useful if the dimensionality n = dim (K)  of the feature \nspace K is much greater (or possibly infinite) than the number m  of training points. \nThis dual representation can be used for  a rather wide  class of learning algorithms \nin particular if all we need for learning is the real valued output (w, Xi)/C \n(see [15]) -\nof the  classifier  at  the  m  training  points  Xl, . .. , x m .  Thus  it  suffices  to  choose  a \nsymmetric function  k  : X  x  X  --+  ~ called kernel  and to ensure that there exists  a \nmapping 4>k  : X  --+  K such that \nVx,x'  EX: \n\nk (x,x') = (4)k  (x) ,4>k  (x'))/C  . \n\n(2) \n\nA sufficient  condition is given by Mercer's theorem. \nTheorem 1  (Mercer Kernel  [9,  7]).  Any symmetric function  k  E Loo (X  x  X) \nthat is  positive  semidefinite,  i.e. \n\nVf  E  L2 (X): \n\nIx Ix k(x,x') f  (x)  f  (x')  dx dx';:::  0, \n\nis  called  a  Mercer  kernel  and  has  the  following  property:  if 'if;i  E  L2 (X)  solve  the \neigenvalue  problem  Ix k (x, x') 'if;i (x')  dx'  =  Ai'if;dx)  with  Ix 'if;;  (x)  dx  = 1  and \nVi  f:.  j  : Ix 'if;i (x) 'if;j  (x)  dx =  0  then k  can  be  expanded  in  a  uniformly  convergent \nseries,  i. e. \n\nk(x,x') = 2:Ai'if;dx) 'if;dx')  . \n\n00 \n\ni=l \n\nIn order to see that a  Mercer kernel fulfils  equation  (2)  consider the mapping \n\n4>k  (x)  =  ( A 'if;l  (x), y'):;'if;2 (x), ... ) \n\n(3) \n\nwhose existence  is  ensured  by  the third property.  Finally,  the percept ron  learning \nalgorithm we  are going to consider is  described in the following  definition. \nDefinition 1  (Perceptron Learning).  The perceptron  learning  procedure with \nthe fixed  learning rate TJ  E ~+ is  as follows: \n\n1.  Start in step zero, i.e.  t  = 0,  with the vector at = O. \n2.  If there exists an index i  E {I, ... , m} such that Yi  (w 0:., Xi) /C  :::;  0 then \n\n(at+l)i = (at)i + TJYi \n\nand t  t- t + 1. \n\n\u00a2:>  wO:.+ 1  = wo:. + TJYiXi \u00b7 \n\n(4) \n\nlSomtimes, we  abbreviate 4> (x)  by x  always  assuming 4>  is  fixed. \n\n\f3.  Stop, if there is  no i  E {I, . .. , m}  such that Yi  (Wat' Xi) J(  ~ o. \n\nOther variants of this algorithm have been presented elsewhere  (see  [2,  3]). \n\n3  An Improvement of N ovikoff's Theorem \n\nIn  the  early  60's  Novikoff  [10)  was  able  to  give  an  upper  bound  on  the  number \nof mistakes  made by  the  classical  perceptron learning procedure.  Two  years later, \nthis  bound was  generalised to feature  spaces  using  Mercer  kernels  by Aizerman et \nal.  [1).  The  quantity  determining  the  upper  bound  is  the  maximally  achievable \nunnormalised margin maxaElR.~ 'Yz (a) normalised by the total extent R(X) of the \ndata in feature space, i.e. R (X) = maxxiEX  IlxillJ( . \nDefinition  2  (Unnormalised Margin).  Given  a  training set  Z  = (X, Y)  and a \nvector a  E IRm  the unnormalised margin 'Yz (a)  is  given  by \n\n. \n'Yz  a  =  mm \n\n(  ) \n\n( Xi,y;)EZ \n\nYi  (Wa,Xi)J( \n\nIlwallJ( \n\nTheorem 2  (Novikoffs Percept ron Convergence Theorem 110,1]).  Let Z  = \n(X, Y)  be  a training set of size m.  Suppose  that there  exists  a vector a* E  IRm  such \nthat 'Yz (a*)  > O.  Then  the  number  of mistakes  made  by  the  perceptron  algorithm \nin Definition  1  on  Z  is  at  most \n\n(  R(X)  )2 \n\n'Yz (a*) \n\nSurprisingly,  this  bound  is  highly  influenced  by the  data  point  Xi  E  X  with  the \nlargest  norm  IIXil1  albeit  rescaling  of a  data point  would  not  change its  classifica(cid:173)\ntion.  Let us consider rescaling of the training set X  before applying the perceptron \nalgorithm.  Then for  the normalised training set we  would  have  R (Xnorm )  = 1 and \n'Yz (a)  would change into the  normalised margin rz (a) first  advocated in  [6). \nDefinition  3  (Normalised  Margin).  Given  a  training  set  Z  = (X, Y)  and  a \nvector a  E  IRm  the normalised  margin rz (a)  is  given by \nYi  (wa, Xi)J( \n(Xi,y;)EZ  IIwallJ(  IIXillJ( \n\n. \nr  (  ) \nza  =  mm \n\n. \n\nBy  definition,  for  all  Xi  E  X  we  have  R (X)  2:  Ilxi IIJ(.  Hence for  any a  E  IRm  and \nall  (Xi ,Yi)  E  Z  such that Yi  (Wa,Xi)J(  > 0 \nIIXillJ( \n\nR(X)  > \n\n1 \n\nwhich immediately implies for  all  Z  =  (X, Y)  E zm  such that 'Yz  (a) > 0 \n\nIIwall lC \n\nIlwalllC \n\nYi(Wa,Xi)1C  - Yi(Wa,Xi )1C \n\nYi(Wa,Xi )f,' \nIIwalldxirlC \n\nR(X)  > _1_. \n'Yz (a)  - rz (a) \n\n(5) \n\nThus when  normalising the data in feature space,  i.e. \n\nk \nnorm \n\n(x  Xl)  _ \n\n, \n\n-\n\nk (X ,XI) \n\n...jk(x,x).k(X',X') ' \n\nthe upper bound on the number of steps until convergence of the classical perceptron \nlearning  procedure  of  Rosenblatt  [11)  is  provably  decreasing  and  is  given  by  the \nsquared r.h.s of (5). \n\n\fConsidering  the  form  of the  update  rule  (4)  we  observe  that  this  result  not  only \nbounds  the  number  of mistakes  made  during  learning  but  also  the  number  110:110 \nof non-zero coefficients  in  the 0:  vector.  To  be precise,  for  'T/  =  1 it  bounds the \u00a31 \nnorm  110:111  of the  coefficient  vector 0:  which,  in turn,  bounds the zero  norm  110:110 \nfrom  above for  all vectors with integer components.  Theorem  2 thus establishes a \nrelation between the  existence  of a large  margin  classifier w*  and the  sparseness  of \nany  solution found  by  the perceptron  algorithm. \n\n4  Main Result \n\nIn order to exploit the guaranteed sparseness of the solution of a kernel perceptron \nwe make use of the following lemma to be found  in  [8,  4). \nLemma 1  (Compression  Lemma).  Fix  d  E  {l, . . . ,m}.  For  any  measure  Pz, \nthe  probability  that m  examples  Z  drawn  iid  according  to  Pz  will  yield  a  classifier \n0: (Z)  learned  by  the  perceptron  algorithm  with  110:  (Z)llo  = d  whose  generalisation \nerror PXY  [Y (w a(Z), <P (X\u00bb) /C  ::;  0]  is greater than c  is  at  most \n\n(~)(l_c)m-d. \n\n(6) \n\nProof.  Since  we  restrict  the  solution  0: (Z)  with  generalisation error greater than \nc  only  to  use  d points  Zd  c:;;  Z  but  still  to  be  consistent  with  the  remaining  set \nZ  \\  Zd,  this  probability  is  at  most  (1  - c)m-d  for  a  fixed  subset  Zd.  The  result \nfollows  by the union bound over all  (r;;)  subsets  Zd.  Intuitively, the consistency on \nthe m - d unused training points witnesses the small generalisation error with high \nprobability. \n0 \n\nIf we  set  (6)  to  ~ and  solve  for  c  we  have  that  with  probability at  most  ~ over \nthe  random  draw  of  the  training  set  Z  the  percept ron  learning  algorithm  finds \na  vector  0:  such  that  110:110  = d  and  whose  generalisation  error  is  greater  than \nc (m, d)  = m~d (In ((r;;))  + In (m) + In (~)) .  Thus by the union  bound,  if the per(cid:173)\nceptron  algorithm  converges,  the  probability  that  the  generalisation  error  of  its \nsolution  is  greater  than  c (m, 110:110)  is  at  most  8.  We  have  shown  the  following \nsparsity bounds also to be found  in  [4). \n\nTheorem 3  (Generalisation Error Bound for Perceptrons).  For any measure \nPz,  with  probability  at  least  1 - 8  over  the  random  draw  of the  training  set  Z  of \nsize m, if the perceptron learning  algorithm  converges  to the  vector 0:  of coefficients \nthen  its generalisation  error PXY [Y (w a(Z), <p (X\u00bb) /C  ::;  0]  is less  than \n\n(7) \n\nThis  theorem  in  itself constitutes  a  powerful  result  and  can  easily  be  adopted  to \nhold for  a  large  class of learning algorithms including SVMs  [4) .  This  bound often \noutperforms  margin  bounds  for  practically  relevant  training  set  sizes,  e.g.  m  < \n100 000.  Combining Theorem 2 and Theorem 3 thus gives  our main result. \n\n\fTheorem 4  (Margin Bound).  For  any measure P z,  with probability  at least \n1 - c5  over  the  random  draw  of the  training  set  Z  of size  m,  if there  exists  a \nvector u*  such  that \n\n~* = I (~~~~:)rl ~ m \n\nthen  the  generalisation  error  PXY  [Y(wo:(Z),I/>(X))x:  ~ 0]  of the  classifier  u \nfound  by  the  perceptron  algorithm  is  less  than \n\nm~~* (In ((:.)) +In(m)+ln(D) \n\n(8) \n\nThe  most  intriguing  feature  of this  result  is  that  the  mere  existence  of  a  large \nmargin  classifier  u*  is  sufficient  to guarantee  a  small  generalisation error  for  the \nsolution  u  of the  perceptron  although  its  attained  margin  ~z (u)  is  likely  to  be \nmuch smaller than ~z (u*).  It has long been argued that the attained margin ~z (u) \nitself  is  the  crucial  quantity  controlling  the  generalisation error of u.  In  light  of \nour new  result  if there  exists  a  consistent  classifier u*  with  large  margin  we  know \nthat  there  also  exists  at least  one  classifier u  with  high  sparsity  that can efficiently \nbe found  using  the  percept ron  algorithm.  In fact,  whenever  the  SYM  appears to \nbe theoretically justified  by  a  large  observed  margin,  every  solution found  by  the \nperceptron  algorithm  has  a  small  guaranteed  generalisation  error  - mostly  even \nsmaller  than  current  bounds  on  the generalisation  error  of SYMs.  Note  that  for \na  given  training sample  Z  it  is  not  unlikely  that  by  permutation of Z  there  exist \no ((,:'!))  many different  consistent sparse classifiers  u. \n\n5 \n\nImpact on the Foundations of Support Vector  Machines \n\nSupport vector machines owe  their popularity mainly to their theoretical justifica(cid:173)\ntion in the learning theory.  In particular, two arguments have been put forward to \nsingle out the solutions found  by SYMs  [14,  p.  139]: \n\nSYM  (optimal hyperplanes)  can generalise because \n\n1.  the expectation of the data compression is large. \n2.  the expectation of the margin is  large. \n\nThe  second  reason  is  often justified  by  margin  results  (see  [14,  12])  which  bound \nthe generalisation of a  classifier  u  in  terms  of its  own  attained  margin  ~z (u).  If \nwe  require the slightly stronger condition that  ~* < ~, n  2:  4,  then our bound (8) \nfor  solutions of percept ron learning can be upper bounded by \n\n~ (~*lnC::n)+ln(mn~1)+ln(c5n1~1))' \nwhich  has to be compared with the PAC  margin bound  (see  [12,  5]) \n~ (64~*log2 (:::. )  log2  (32m)  +  log2 (2m)  +  log2  (~) ) \n\nDespite the fact that the former result also holds true for the margin rz (u*)  (which \ncould loosely be upper bounded by  (5)) \n\n\u2022  the PAC margin bound's decay (as a function of m) is slower by a log2 (32m) \n\nfactor, \n\n\fdigit \n\nperceptron \n\nlIalio \nmistakes \nbound \nSVM \nIiallo \nbound \n\no \n0.2 \n740 \n844 \n6.7 \n0.2 \n1379 \n11.2 \n\n1 \n0.2 \n643 \n843 \n6.0 \n0.1 \n989 \n8.6 \n\n2 \n0.4 \n1168 \n1345 \n9.8 \n0.4 \n1958 \n14.9 \n\n3 \n0.4 \n1512 \n1811 \n12.0 \n0.4 \n1900 \n14.5 \n\n4 \n0.4 \n1078 \n1222 \n9.2 \n0.4 \n1224 \n10.2 \n\n5 \n0.4 \n1277 \n1497 \n10.5 \n0.5 \n2024 \n15.3 \n\n6 \n0.4 \n823 \n960 \n7.4 \n0.3 \n1527 \n12.2 \n\n7 \n0.5 \n1103 \n1323 \n9.4 \n0.4 \n2064 \n15.5 \n\n8 \n0.6 \n1856 \n2326 \n14.3 \n0.5 \n2332 \n17.1 \n\n9 \n0.7 \n1920 \n2367 \n14.6 \n0.6 \n2765 \n19.6 \n\nTable  1:  Results  of kernel  perceptrons  and SVMs  on  NIST  (taken from  [2,  Table \n3]).  The  kernel  used  was  k (x, x')  = ((x, x') x  + 1)4  and  m  = 60000.  For  both \nalgorithms we  give the measured generalisation error  (in %),  the attained  sparsity \nand the bound value  (in  %,  8 = 0.05)  of (7) . \n\n\u2022  for  any m  and almost any 8 the margin bound given in Theorem 4 guaran(cid:173)\n\ntees a  smaller generalisation error . \n\n\u2022  For  example,  using  the  empirical  value  K,*  ~ 600  (see  [14,  p.  153])  in \nthe  NIST  handwritten  digit  recognition task and  inserting  this  value  into \nthe PAC  margin bound, it  would  need the astronomically large number of \nm  > 410 743 386  to  obtain a  bound  value  of 0.112  as  obtained  by  (3)  for \nthe digit \"0\"  (see  Table  1). \n\nWith regard to the first reason, it has been confirmed experimentally that SVMs find \nsolutions  which  are sparse in the expansion coefficients  o.  However,  there  cannot \nexist any distribution- free guarantee that the number of support vectors will in fact \nbe sma1l2 .  In contrast, Theorem 2  gives  an explicit  bound on the sparsity in terms \n\nof the achievable margin ,z (0*).  Furthermore,  experimental results on the  NIST \n\ndatasets  show  that  the  sparsity  of solution  found  by  the  perceptron  algorithm  is \nconsistently  (and often  by a  factor  of two)  greater than that  of the SVM  solution \n(see  [2,  Table 3]  and Table 1). \n\n6  Conclusion \n\nthe  perceptron algorithm -\n\nWe  have shown that the generalisation error of a  very simple  and efficient  learning \ncan be  bounded by \nalgorithm for  linear  classifiers -\na  quantity involving the margin of the classifier the SVM  would  have found  on the \nsame training data using the same kernel.  This result implies that the SVM solution \nis not at all singled out as  being superior in terms of provable generalisation error. \nAlso,  the result indicates that  sparsity  of the solution may be a  more fundamental \nproperty  than  the  size  of the  attained  margin  (since  a  large  value  of the  latter \nimplies  a large value of the former). \n\nOur analysis raises an interesting question:  having chosen a good kernel, correspond(cid:173)\ning to a metric in which inter- class distances are great and intra- class distances are \nshort,  in  how  far  does  it matter  which  consistent  classifier we  use?  Experimental \n\n2Consider a distribution PXY on two parallel lines with support in the unit ball.  Suppose \nthat their mutual distance is  ../2.  Then the number of support vectors equals the training \nset size whereas  the perceptron algorithm never uses  more than two points by Theorem  2. \nOne could  argue that it is  the number of  essential  support  vectors  [13]  that characterises \nthe data compression of an SVM  (which would also  have  been two  in our example).  Their \ndetermination, however, involves a combinatorial optimisation problem and can thus never \nbe performed in practical  applications. \n\n\fresults seem to indicate that a  vast  variety of heuristics for  finding  consistent clas(cid:173)\nsifiers,  e.g.  kernel  Fisher  discriminant,  linear  programming machines,  Bayes point \nmachines, kernel PCA & linear SVM,  sparse greedy matrix approximation perform \ncomparably  (see http://www . kernel-machines. org/). \n\nAcknowledgements \n\nThis  work  was  done  while  TG  and  RH  were  visiting  the  ANU  Canberra.  They \nwould like to thank Peter Bartlett and Jon Baxter for  many interesting discussions. \nFurthermore, we would like to thank the anonymous reviewer, Olivier Bousquet and \nMatthias Seeger for  very  useful  remarks on the paper. \n\nReferences \n\n[I]  M.  Aizerman,  E.  Braverman,  and  L.  Rozonoer.  Theoretical  foundations  of the po(cid:173)\n\ntential  function  method  in  pattern  recognition  learning.  Automation  and  Remote \nControl,  25:821- 837,  1964. \n\n[2]  Y.  Freund  and  R.  E.  Schapire.  Large  margin  classification  using  the  perceptron \n\nalgorithm.  Machine  Learning,  1999. \n\n[3]  T.  Friess,  N.  Cristianini,  and  C.  Campbell.  The  Kernel-Adatron:  A  fast  and  sim(cid:173)\n\nple  learning  procedure  for  Support  Vector  Machines.  In  Proceedings  of the  15- th \nInternational  Conference  in Machine  Learning,  pages  188- 196,  1998. \n\n[4]  T. Graepel, R. Herbrich,  and J. Shawe-Taylor.  Generalisation error bounds for sparse \n\nlinear  classifiers.  In  Proceedings  of the  Thirteenth  Annual  Conference  on  Computa(cid:173)\ntional Learning  Theory,  pages  298- 303,  2000.  in press. \n\n[5]  R.  Herbrich.  Learning Linear Classifiers  - Theory  and Algorithms.  PhD thesis,  Tech(cid:173)\n\nnische  Universitiit  Berlin,  2000.  accepted for  publication by MIT Press. \n\n[6]  R.  Herbrich  and  T.  Graepel.  A  PAC-Bayesian  margin  bound for  linear  classifiers: \n\nWhy SVMs work.  In  Advances  in Neural  Information  System  Processing  13,  2001. \n[7]  H.  Konig.  Eigenvalue  Distribution  of Compact  Operators.  Birkhiiuser,  Basel,  1986. \n[8]  N.  Littlestone  and M.  Warmuth.  Relating  data compression  and learn ability.  Tech(cid:173)\n\nnical report,  University of California Santa Cruz,  1986. \n\n[9]  T.  Mercer.  Functions  of  positive  and  negative  type  and  their  connection  with  the \ntheory of integral equations.  Transaction  of London Philosophy Society  (A),  209:415-\n446,  1909. \n\n[10]  A.  Novikoff.  On  convergence  proofs  for  perceptrons.  In  Report  at  the  Symposium \non  Mathematical  Theory  of Automata, pages  24- 26,  Politechnical Institute Brooklyn, \n1962. \n\n[11]  M.  Rosenblatt.  Principles  of neurodynamics:  Perceptron  and  Theory  of Brain Mech(cid:173)\n\nanisms.  Spartan- Books,  Washington D.C.,  1962. \n\n[12]  J .  Shawe-Taylor,  P. L.  Bartlett,  R.  C. Williamson,  and M.  Anthony.  Structural risk \nminimization  over  data-dependent  hierarchies.  IEEE  Transactions  on  Information \nTheory,  44(5):1926- 1940,  1998. \n\n[13]  V.  Vapnik.  Statistical Learning  Theory.  John Wiley and Sons,  New York,  1998. \n[14]  V. Vapnik.  The Nature  of Statistical Learning  Theory.  Springer, second edition,  1999. \n[15]  G. Wahba. Support Vector Machines, Reproducing Kernel Hilbert Spaces and the ran-\ndomized GACV.  Technical report, Department of Statistics, University of Wisconsin, \nMadison,  1997.  TR- NO- 984. \n\n\f", "award": [], "sourceid": 1870, "authors": [{"given_name": "Thore", "family_name": "Graepel", "institution": null}, {"given_name": "Ralf", "family_name": "Herbrich", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}