{"title": "Algorithmic Stability and Generalization Performance", "book": "Advances in Neural Information Processing Systems", "page_first": 196, "page_last": 202, "abstract": null, "full_text": "Algorithmic  Stability and  Generalization \n\nPerformance \n\nOlivier Bousquet \n\nCMAP \n\nEcole Polytechnique \n\nF-91128 Palaiseau cedex \n\nFRANCE \n\nAndre Elisseeff'\" \nBarnhill Technologies \n6709 Waters  A venue \nSavannah, GA 31406 \n\nUSA \n\nbousquet@cmapx.polytechnique\u00b7fr \n\nandre@barnhilltechnologies.com \n\nAbstract \n\nWe  present a novel way of obtaining PAC-style bounds on the gen(cid:173)\neralization error of learning algorithms, explicitly using their stabil(cid:173)\nity properties.  A stable learner is one for which the learned solution \ndoes  not  change much with small  changes in the training set.  The \nbounds we  obtain do not depend on any measure of the complexity \nof the hypothesis space  (e.g.  VC  dimension)  but rather depend on \nhow  the  learning  algorithm  searches  this  space,  and  can  thus  be \napplied  even when the VC  dimension is  infinite.  We  demonstrate \nthat regularization networks possess the required stability property \nand apply our method to obtain new bounds on their generalization \nperformance. \n\n1 \n\nIntroduction \n\nA key issue in computational learning theory is  to bound the generalization error of \nlearning algorithms.  Until recently, most of the research in that area has focused on \nuniform a-priori bounds giving a guarantee that the difference between the training \nerror  and  the  test  error  is  uniformly  small  for  any  hypothesis  in  a  given  class. \nThese bounds are usually expressed in terms of combinatorial quantities such as VC(cid:173)\ndimension.  In the last few years, researchers have tried to use more refined quantities \nto  either  estimate  the  complexity  of the  search  space  (e.g.  covering  numbers  [1]) \nor to use  a  posteriori information  about  the solution found  by the algorithm  (e.g. \nmargin [11]).  There exist other approaches such as observed VC dimension [12],  but \nall  are  concerned with  structural properties of the learning systems.  In this paper \nwe  present  a  novel  way  of obtaining PAC  bounds for  specific  algorithms explicitly \nusing  their  stability  properties.  The  notion  of  stability,  introduced  by  Devroye \nand  Wagner  [4]  in  the  context  of classification  for  the  analysis  of the  Leave-one(cid:173)\nout  error  and  further  refined  by  Kearns  and  Ron  [8]  is  used  here  in  the  context \nof regression  in  order to  get  bounds  on  the empirical error rather than the  leave(cid:173)\none-out  error.  This  method  has  the  nice  advantage of providing  bounds  that  do \n\n-This  work  was  done  while  the  author was  at  Laboratoire  ERIC,  Universite  Lumiere \n\nLyon 2,5 avenue  Pierre Mendes-France,  F-69676  Bron cedex,  FRANCE \n\n\fnot  depend on any complexity measure of the search space  (e.g.  VC-dimension or \ncovering numbers) but rather on the way the algorithm searches this space.  In  that \nrespect, our approach can be related to Freund's [7]  where the estimated size of the \nsubset of the hypothesis space actually searched by the algorithm is  used to bound \nits  generalization  error.  However  Freund's  result  depends  on  a  complexity  term \nwhich  we  do  not  have  here  since  we  are not  looking  separately  at  the  hypotheses \nconsidered by the algorithm and their risk. \n\nThe paper is  structured as  follows:  next  section introduces the  notations  and the \ndefinition of stability used throughout the paper.  Section 3 presents our main result \nas  a  PAC-like theorem.  In  Section 4 we  will  prove that regularization networks are \nstable  and  apply  the main  result  to  obtain bounds  on their generalization ability. \nA discussion of the results will be presented in  Section 5. \n\n2  Notations and  Definitions \n\nX and 'lJ  being  respectively  an input  and  an  output  space,  we  consider  a  learning \nset  S  = {Zl  = (Xl, yd, .. , Zm  = (xm, Ym)}  of size  m  in  Z = X x 'lJ  drawn i.i.d.  from \nan  unknown  distribution  D.  A  learning  algorithm  is  a  function  L  from  zm  into \n'lJx  mapping  a  learning set  S onto  a  function  Is  from  X to  'lJ.  To  avoid  complex \nnotations,  we  consider  only  deterministic  algorithms.  It is  also  assumed  that  the \nalgorithm  A  is  symmetric  with  respect  to  S,  i. e. \nfor  any  permutation  over  the \nelements of S,  Is yields the same result.  Furthermore, we assume that all functions \nare measurable  and  all  sets  are  countable which  does  not  limit  the  interest of the \nresults presented here. \nThe empirical error of a function I  measured on the training set Sis: \n\nRm(f) = - L c(f, Zi) \n\n1  m \n\nm  i=l \n\nc: 'lJx  X  X x 'lJ  -t 1R+  being a  cost function.  The risk or generalization error can be \nwritten as: \n\nR(f) = Ez~D [c(f,z)] \n\nThe study we  describe here intends to bound the difference between empirical and \ngeneralization error for  specific algorithms.  More precisely, our goal is  to bound for \nany  E  > 0,  the term \n\n(1) \n\nUsually,  learning algorithms cannot output just any function  in 'lJx  but rather pick \na  function  Is  in a  set :r S;;  'lJX  representing the structure or the architecture or the \nmodel.  Classical VC  theory deals with structural properties and aims at bounding \nthe following  quantity: \n\nPS~Dm [sup IRm(f) - R(f)1  > EJ \n\nfE'3' \n\n(2) \n\nThis  applies  to  any  algorithm  using :r as  a  hypothesis  space  and  a  bound  on  this \nquantity directly implies a  similar bound on (1).  However,  classical bounds require \nthe VC  dimension  of :r to be finite  and  do  not  use information about  algorithmic \nproperties.  For a set :r, there exists many ways to search it which may yield different \nperformance.  For instance, multilayer perceptrons can be learned by a simple back(cid:173)\npropagation algorithm  or combined  with  a  weight  decay  procedure.  The  outcome \n\n\fof the algorithm belongs in both cases to the same set of functions,  although their \nperformance can be different. \n\nVC  theory was initially  motivated  by empirical risk minimization  (ERM)  in which \ncase  the  uniform  bounds  on  the  quantity  (2)  give  tight  error bounds.  Intuitively, \nthe empirical risk  minimization principle relies on a  uniform law  of large numbers. \nBecause  it  is  not  known  in  advance,  what  will  be  the  minimum  of the  empirical \nrisk,  it  is  necessary  to  study  the  difference  between  empirical  and  generalization \nerror for  all possible functions in 9\".  If,  now, we  do not consider this minimum, but \ninstead,  we  focus  on the outcome of a  learning algorithm  A,  we  may then know  a \nlittle bit more what kind of functions  will  be obtained.  This limits the possibilities \nand restricts the supremum over all the functions  in 9\" to the possible outcomes of \nthe algorithm.  An  algorithm which always outputs the null function  does not need \nto be studied by a  uniform  law of large numbers. \n\nif  S  denotes  the  initial \nLet's  introduce  a  notation  for  modified  training  sets: \ntraining  set,  S  =  {Zl, ... ,Zi-l,Zi,Zi+l, ... ,zm},  then  Si  denotes  the  training \nset  after  Zi  has  been  replaced  by  a  different  training  example  z~,  that  is  Si  = \n{Zl, ... , Zi-l, z~, Zi+1, . .. , zm}.  Now,  we  define  a  notion of stability for  regression. \n\nDefinition 1  (UniforIll stability)  Let S  =  {Zl, ... ,zm}  be  a  training  set,  Si  = \nS\\Zi  be  the  training  set  where  instance  i  has  been  removed  and  A  a  symmetric \nalgorithm.  We  say that A  is (3-stable  if the following  holds: \n\n'<IS  E  zm,  '<Iz~,z E  Z,  Ic(fs,z) - c(fsi,z)l:::; {3 \n\n(3) \n\nThis condition expresses that for  any  possible training set  S  and any replacement \nexample z~, the difference  in cost  (measured on any instance in  Z)  incurred by the \nlearning algorithm when training on S  and on Si is smaller than some  constant {3. \n\n3  Main result \n\nA stable algorithm, i. e.  {3-stable with a small {3,  has the property that replacing one \nelement  in  its  learning set  does  not  change  much  its  outcome.  As  a  consequence, \nthe empirical error, if thought  as  a  random variable,  should have  a  small variance. \nStable algorithms can then be good candidates for  their empirical error to be close \nto their generalization error.  This assertion is formulated in the following theorem: \n\nTheorem 2  Let  A  be  a  (3-stable  algorithm,  such  that \u00b0 :::;  c(fs, Z)  :::;  M,  for  all \nZ E  Z  and  all  learning  set S.  For  all  E > 0,  for  any m  ~ 8~C , we  have: \n\nPS~D'\" [lRm(fS) - R(fs) I > E]  :::; \n\n64Mm{3 + 8M 2 \n\n2 \nmE \n\nand for  any m  ~ 1, \n\n(4) \n\n(5) \n\nNotice  that  this  theorem  gives  tight  bounds  when  the  stability  (3  is  of the  order \nof 11m.  It will  be  proved in  next  section  that regularization networks satisfy this \nrequirement. \nIn order to prove  theorem  2,  one  has  to  study the  random variable  X  = R(fs) -\nRm(fs),  which  can  be  done  using  two  different  approaches.  The first  one  (corre(cid:173)\nsponding to the exponential inequality) uses a  classical martingale inequality and is \n\n\fdetailed below.  The second one is  a bit more technical and requires to use standard \nproof techniques such as symmetrization.  Here we only briefly sketch this proof and \nrefer the reader to  [5]  for  more details. \n\nProof of inequality  (5)  We  use the following  theorem: \n\nTheorem 3  {McDiarmid  [9}}.  Let Y1 ,  .. .  , Yn  be  n  i.i.d.  random  variables  taking \nvalues  in  a set A,  and  assume  that F  : An -+  ~ satisfies for  I  ~ i  ~ n: \n\nsup \n\nIF(Yl, ... ,Yn)-F(Yl, ... ,Yi-l,Y~,Yi+1, ... ,Yn)1  ~Ci \n\nthen \n\nIn  order to  apply  theorem  3,  we  have  to  bound  the  expectation  of X.  We  begin \nwith a  useful  lemma: \n\nLemma 1  For  any symmetric learning  algorithm  we  have  for  all  I  ~ i  ~ m: \n\nES~D~ [R(fs) - Rm(fs)] = ES,z:~D~+l [c(fs, Z~) - c(fs\"  zD] \n\nProof:  Notice that \nES~D'\" [Rm(fs)]  = - 2:ES~D'\"  [c(fs,Zj)] = ES~D'\"  [c(fS,Zi)],  'Vi  E {I, ... ,m} \n\nI  m \n\nm  j=l \n\nby symmetry and the i.i.d.  assumption.  Now, by simple renaming of Zi  as z~ we  get \n\nES~D'\" [Rm(fs)]  =  ES'~D'\" [c(fs.,zDJ  =  ES,Z:~D~+l [c(fs.,zD] \n\nwhich,  with the observation that \n\nES~D~ [R(fs)]  = ES,z:~D\"'+l [c(fs, zD ] \n\nconcludes the proof. \n\nUsing the above lemma and the fact that  A  is  ,B-stable, we  easily get: \n\nES~D'\" [R(fs) - Rm(fs)]  ~ ES,z:~D\"'+l [,B] =  ,B \n\nWe  now  have to compute the constants  Ci  appearing in theorem 3. \n\nWe  have \n\nIR(fs) - R(fs\u00b7) I ~ Ez~D [Ic(fs, z)  - c(fs.,z)l]  ~,B \n\no \n\nand \nIRm(fs) - Rm(fsi)1  <  ~ 2: Ic(fs,zj) - c(fsi,zj)1 + ~ IC(fS,Zi)  - c(fs\u00b7,zDI \n\nm  #i \n\nm \n\nTheorem 3 applied to R(fs) - Rm(fs) then gives inequality  (5). \n\n<  2M +,B \n\nm \n\nSketch of the proof of inequality  (4)  Recall  Chebyshev's inequality: \n\nE[X2] \n\nP(IXI ;::::  \u20ac)  ~ - 2 - '  \n\n(6) \nfor  any  random  variable  X.  In  order  to  apply  this  inequality,  we  have  to  bound \nE[X2].  This  can  be  done  with  a  similar  reasoning  as  for  the expectation.  Calcu(cid:173)\nlations  are  however  more  complex  and  we  do  not  describe  them  here.  For  more \ndetails,  see  [5].  The result  is  the following: \n\n\u20ac \n\nwhich  with  (6)  gives inequality  (4)  and concludes the proof. \n\nEu[X2]  ~ M2/m + 8M,B \n\n\f4  Stability of Regularization Networks \n\n4.1  Definitions \n\nRegularization  networks  have  been introduced  in  machine  learning  by  Poggio  and \nGirosi  [10].  The relationship between  these networks and the Support Vector  Ma(cid:173)\nchines,  as  well  as  their  Bayesian  interpretation,  make  them  very  attractive.  We \nconsider  a training set  S  =  {(Xl, Yl), ... ,(xm, Ym)}  with Xi  E IRd  and Yi  E IR,  that \nis  we  are  in  the  regression  setting.  The  regularization  network technique  consists \nin  finding  a  function  I  : IRd  -t  IR  in  a  space  H  which  minimizes  the  following \nfunctional: \n\n(7) \n\nwhere  II/II~ denotes the norm  in  the space  H.  In  this framework,  H  is  chosen to \nbe  a  reproducing kernel Hilbert  space  (rkhs),  which  is  basically a  functional  space \nendowed  with  a  dot  productl.  A  rkhs  is  defined  by  a  kernel function,  that  is  a \nsymmetric function  k  :  IRd  X  IRd  -t IR  which  we  will  assume  to  be  bounded  by  a \nconstant  K  in what follows2 .  In  particular the following  property will  hold: \n\n(8) \n\n4.2  Stability  study \n\nIn  this  section,  we  show  that  regularization  networks  are,  furthermore,  stable  as \nsoon as A is  not too small. \nTheorem 4  For regularization  networks  with Ilkli H  ~ K  and (f(x) - y)2  ~ M, \n\nand \n\n(\n\nR(fs) ~ Rm(fs) + 2M \n\n4K \n\n2)  In(2/6) \n\n2K2 \n\nA2  +  A + \n\nm \n\n( 64K  + 2) ...!... \n\nm6 \n\nA \n\n(9) \n\n(10) \n\nProof:  Let's denote by  Is the minimizer of C.  Let's define \n\nLet  lSi  be the minimizer of C i  and let  9 denote the difference  lSi  - Is.  By simple \nalgebra, we  have for  t  E  [0,1] \n\nC(fs) - C(fs + tg) = -- ~)Is(xj) - Yj)g(Xj)  - 2tA < Is,g > +t2 A(g) \n\n2t  m \n\nm  j=l \n\nlWe do  not  detail here  the properties of such a  space  and refer  the reader to [2,  3]  for \n\nadditional details. \n\n20nce  again we  do not give full  detail of the definition of appropriate kernel  functions \n\nand refer the reader to [3] . \n\n\fwhere A(g) which is not explicitly written here is the factor of t2 .  Similarly we have \n\n2t \nm  l)fSi(Xj) -Yj)g(Xj) \n\n#i \n\n+ 2t (fsi (xD  - YDg(xD  + 2tA <  fSi, 9 > +t2 Ai(g) \n\nm \n\nBy optimality,  we  have \n\nC(fs) - C(fs + tg)  :::;  0 and Ci(fsi) - Ci(fsi  - tg)  :::;  0 \n\nthus, summing those inequalities, dividing by tim and making t  -t 0,  we  get \n2 L g(Xj)2  - 2(fS(Xi) - Yi)g(Xi) + 2(fsi (xD  - YDg(x~) + 2mAIIgilir :::;  0 \n\n#i \nwhich  gives \n\nmAllgllir :::;  (fS(Xi)  - Yi)g(Xi)  - (fs(x~) - YDg(xD  :::;  2VMI\\;IIgliH \n\nusing  (8).  We  thus obtain \n\nand also \n\nIlfsi  - fsllH  :::;  2VMI\\;/(mA) \n\n(11) \n\nVx,y  I (fs(x) - y)2  - (fsi(X) - y)21:::;  2VMlfs(x) - fSi(X)1  :::;  4MI\\;/(mA) \n\nWe  thus  proved  that  the  minimization  of  C[f]  is  a  4:;>.'\" -stable  procedure  which \nallows to apply theorem 2. \n<>. \n\n4.3  Discussion \n\nThese inequalities are both of interest since the range where they are tight is differ(cid:173)\nent.  Indeed,  (10)  has a poor dependence on c5  which makes it deteriorate when high \nconfidence is  sought for.  However,  (9)  can give high confidence bounds but will  be \nlooser when  A is  small. \n\nMoreover,  results  exposed  by  Evgeniou  et  al.  [6]  indicate that the optimal  depen(cid:173)\ndence of A with m  is  obtained for  Am = 0 (In In m).  If we  plug this into the above \nbounds, we  can notice that (9)  does not converge as m  -t 00.  It may be conjectured \nthat the poor estimation of the variance coming from the martingale method in Mc(cid:173)\nDiarmid's inequality is  responsible for  this effect,  but a finer  analysis is  required to \nfully  understand this phenomenon. \nOne of the interests of these results is to provide a mean for choosing the A parameter \nby minimizing the right hand side  of the inequality.  Usually,  it  is  determined with \na  validation set:  some of the data is  not used  during learning and  A is  chosen such \nthat  the  error  of  fs  over  the  validation  set  is  minimized.  The  drawback  of this \napproach is to reduce the amount of data available for  learning. \n\n5  Conclusion and  future  work \n\nWe have presented a new approach to get bounds on the generalization performance \nof learning  algorithms  which  makes  use  of specific  properties of these  algorithms. \nThe bounds we  obtain do  not depend on the complexity of the hypothesis class but \non a measure of how stable the algorithm's output is  with respect to changes in the \ntraining set. \n\n\fAlthough  this work  has focused  on regression,  we  believe  that it  can  be  extended \nto  classification,  in  particular by making the stability requirement  less  demanding \n(e.g.  stability  in  average instead  of uniform  stability).  Future  work  will  also  aim \nat finding  other algorithms that are stable or can be appropriately modified to ex(cid:173)\nhibit  the stability property.  At last,  a  promising application of this work  could be \nthe  model  selection  problem  where  one  has  to  tune  parameters of the  algorithms \n(e.g.  A and parameters of the kernel for  regularization networks).  Instead of using \ncross-validation,  one  could  measure  how  stability  is  influenced  by  the  various  pa(cid:173)\nrameters of interest  and  plug  these  measures  into  theorem  2 to  derive  bounds  on \nthe generalization error. \n\nAcknowledgments \n\nWe  would  like  to  thank  G.  Lugosi,  S.  Boucheron  and  O.  Chapelle  for  interesting \ndiscussions  on  stability  and  concentration inequalities.  Many thanks to  A.  Smola \nand to the anonymous reviewers  who  helped improve the readability. \n\nReferences \n\n[1]  N. Alon , S. Ben-David, N. Cesa-Bianchi, and D. Haussler.  Scale-sensitive dimensions, \n\nuniform convergence  and learnability.  Journal  of the  ACM, 44(4):615- 631,  1997. \n\n[2]  N. Aronszajn.  Theory ofreproducing kernels.  Trans.  Amer.  Math.  Soc. , 68:337- 404, \n\n1950. \n\n[3]  M.  Atteia.  Hilbertian  Kernels  and  splines functions.  Studies in computational math(cid:173)\n\nematics 4. North-Holland,  1992. \n\n[4]  L.P.  Devroye  and  T.J.  Wagner.  Distribution-free  performance  bounds for  potential \n\nfunction rules.  IEEE  Trans.  on  Information  Theory,  25(5) :202- 207,  1979. \n\n[5]  A.  Elisseeff.  A  study  about  algorithmic  stability  and  its  relation  to  generalization \n\nperformances.  Technical report,  Laboratoire ERIC, Univ. Lyon 2,  2000. \n\n[6]  T.  Evgeniou,  M.  Pontil,  and T.  Poggio.  A  unified framework  for  regularization  net(cid:173)\nworks  and support vector  machines.  Technical  Memo  AIM-1654,  Massachusetts  In(cid:173)\nstitute of Technology,  Artificial Intelligence  Laboratory, December 1999. \n\n[7]  Y. Freund. Self bounding learning algorithms. In Proceedings  of the  11 th  Annual Con(cid:173)\n\nference  on  Computational  Learning  Theory  (COLT-9S),  pages  247- 258,  New  York, \nJuly 24- 26  1998.  ACM  Press. \n\n[8]  M.  Kearns  and D.  Ron.  Algorithmic stability and sanity-check bounds for  leave-one(cid:173)\n\nout cross-validation.  Neural  Computation,  11(6):1427- 1453,  1999. \n\n[9]  C. McDiarmid.  On the method of bounded differences.  In Surveys  in  Combinatorics, \n\npages  148- 188.  Cambridge University Press, Cambridge, 1989. \n\n[10]  T .  Poggio  and  F .  Girosi.  Regularization  algorithms  for  learning  that  are  equivalent \n\nto multilayer networks.  Science,  247:978- 982,  1990. \n\n[11]  J.  Shawe-Taylor,  P.  L.  Bartlett, R.  C. Williamson,  and M.  Anthony.  A framework for \nstructural risk minimization.  In Proc.  9th  Annu.  Conf.  on  Comput.  Learning  Theory, \npages  68- 76.  ACM  Press,  New York,  NY,  1996. \n\n[12]  J.  Shawe-Taylor  and R.  C.  Williamson.  Generalization  performance  of classifiers  in \n\nterms  of observed  covering  numbers.  In  Paul  Fischer  and  Hans  Ulrich  Simon,  edi(cid:173)\ntors,  Proceedings  of the  4th  European  Conference  on  Computational  Learning  The(cid:173)\nory  (Eurocolt-99) ,  volume  1572  of  LNAI, pages  274- 284,  Berlin,  March  29- 31  1999. \nSpringer. \n\n\f", "award": [], "sourceid": 1854, "authors": [{"given_name": "Olivier", "family_name": "Bousquet", "institution": null}, {"given_name": "Andr\u00e9", "family_name": "Elisseeff", "institution": null}]}