{"title": "Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 59, "page_last": 66, "abstract": null, "full_text": "Hoeffding Races:  Accelerating Model \nSelection Search for  Classification and \n\nFunction  Approximation \n\nOded Maron \n\nArtificial Intelligence  Laboratory \n\nMassachusetts  Institute of Technology \n\nCambridge, MA  02139 \n\nAndrew W.  Moore \n\nRobotics Institute \n\nSchool of Computer Science \nCarnegie  Mellon  University \n\nPittsburgh,  PA  15213 \n\nAbstract \n\nSelecting  a good  model of a set  of input points by  cross  validation \nis  a  computationally intensive  process,  especially  if the  number of \npossible  models  or  the  number  of training  points  is  high.  Tech(cid:173)\nniques  such  as  gradient  descent  are  helpful  in  searching  through \nthe space of models,  but problems such  as  local minima, and more \nimportantly, lack  of a  distance  metric  between  various  models re(cid:173)\nduce  the applicability of these search  methods.  Hoeffding  Races  is \na  technique  for  finding  a  good  model  for  the  data by  quickly  dis(cid:173)\ncarding bad models, and concentrating the computational effort  at \ndifferentiating between  the better  ones.  This paper focuses  on  the \nspecial  case  of leave-one-out  cross  validation  applied  to  memory(cid:173)\nbased  learning  algorithms,  but  we  also  argue  that  it is  applicable \nto any  class  of model selection  problems. \n\n1 \n\nIntroduction \n\nModel  selection  addresses  \"high  level\"  decisions  about  how  best  to  tune  learning \nalgorithm architectures for  particular  tasks.  Such  decisions  include which function \napproximator to  use,  how  to  trade  smoothness  for  goodness  of fit  and  which  fea(cid:173)\ntures  are  relevant.  The problem of automatically selecting  a  good  model has  been \nvariously described  as fitting a curve,  learning a function, or trying to predict future \n\n59 \n\n\f60 \n\nMaron and Moore \n\n'-' \n\n'-' e  0.2 \n~ = = 0.18 \n' ; :  \n~ \n7a  0.16 \n;;>-\n~  0.14 \ne \nU \n\n0.22 \n\n0.12 \n\n1 \n\n3 \n\n5 \n\n7 \n\n9 \n\nk  Nearest Neigh bors Used \n\nFigure  1:  A  space  of  models  consisting  of local-weighted-regression  models  with \ndifferent numbers of nearest neighbors used.  The global minimum is at one-nearest(cid:173)\nneighbor,  but a  gradient  descent  algorithm would  get  stuck  in local  minima unless \nit happened  to start in in  a  model where  k < 4. \n\ninstances  of the  problem.  One  can  think  of this  as  a  search  through  the  space  of \npossible models with some criterion of \"goodness\"  such as prediction accuracy,  com(cid:173)\nplexity of the model, or smoothness.  In  this paper,  this  criterion  will  be  prediction \naccuracy.  Let  us examine two  common  ways  of measuring  accuracy:  using  a  test \nset  and leave-one-out  cross  validation (Wahba and Wold,  1975) . \n\n\u2022  The test set method arbitrarily divides the data into a training set and a \ntest set.  The learner is  trained on the training set,  and is then  queried with \njust the input vectors of the test set.  The error for  a particular point is  the \ndifference  between  the  learner's prediction  and  the actual output vector . \n\n\u2022  Leave-one-out  cross validation trains the  learner  N  times (where  N  is \nthe number of points), each  time omitting a  different  point.  We attempt to \npredict each omitted point.  The error for  a particular point is the difference \nbetween  the learner's  prediction  and  the actual output vector. \n\nThe total error  of either method is  computed by  averaging all  the error instances. \n\nThe  obvious  method  of searching  through  a  space  of models,  the  brute  force  ap(cid:173)\nproach, finds  the accuracy  of every  model and  picks the best  one.  The time to find \nthe accuracy  (error rate) of a particular model is proportional to the size of the test \nset  IT EST!,  or  the size  of the  training set  in  the  case  of cross  validation .  Suppose \nthat  the  model space  is  discretized  into a  finite  number of models  IMODELSI  -\nthen  the amount of work required is  O(IMODELSI x ITEST!),  which is expensive. \n\nA  popular way  of dealing  with  this  problem is  gradient  descent.  This method  can \nbe  applied  to  find  the  parameters  (or  weights)  of a  model.  However,  it  cannot  be \nused  to find  the structure  (or architecture)  of the modeL  There  are two reasons for \n\n\fHoeffding Races: Accelerating Model Selection \n\n61 \n\nthis.  First,  we  have empirically noted many occasions  on  which  the search space is \npeppered with local minima (Figure 1).  Second,  at the highest level we  are selecting \nfrom  a  set  of entirely  distinct  models,  with  no  numeric  parameters over  which  to \nhill-climb.  For example, is  a neural net  with 100 hidden units closer to a neural net \nwith  50  hiden  units  or  to  a  memory-based model  which  uses  3 nearest  neighbors? \nThere  is  no  viable  answer  to this  question  since  we  cannot  impose a  viable  metric \non  this  model space. \n\nThe algorithm we  describe  in this paper,  Hoeffding Races,  combines the robustness \nof brute force and the computational feasibility of hill climbing.  We instantiated the \nalgorithm  by  specifying  the  set  of models  to  be  memory-based  algorithms  (Stan(cid:173)\nfill  and  Waltz,  1986)  (Atkeson  and  Reinkensmeyer,  1989)  (Moore,  1992)  and  the \nmethod  of finding  the  error  to  be  leave-one-out  cross  validation.  We  will  discuss \nhow  to extend the algorithm to any set of models and to the test set method in the \nfull  paper.  We  chose  memory-based  algorithms since  they  go  hand  in  hand  with \ncross  validation.  Training is  very  cheap - simply keep  all the points in memory, and \nall  the  algorithms of  the  various  models  can  use  the  same  memory.  Finding  the \nleave-one-out cross  validation error at a point is  cheap as  making a prediction:  sim(cid:173)\nply  \"cover up\"  that point in memory, then predict its value using the current model. \nFor  a  discussion  of how  to  generate  various  memory-based  models,  see  (Moore  et \nal.,  1992). \n\n2  Hoeffding Races \n\nThe  algorithm was  inspired  by  ideas  from  (Haussler,  1992)  and  (Kaelbling,  1990) \nand a similar idea appears in (Greiner and Jurisica,  1992).  It derives its name from \nHoeffding's formula (Hoeffding,  1963),  which  concerns our confidence  in the sample \nmean of n  independently  drawn  points  Xl,  \u2022\u2022. ,  X n .  The probability of the  estimated \nmean  Ee3t  = ~ 2::l<i<n Xi  being  more  than  epsilon  far  away  from  the  true  mean \nEtrue  after n  independently  drawn  points is  bounded by: \n\nwhere  B  bounds  the possible spread  of point values. \n\nWe  would like  to say that with confidence  1 - 8,  our estimate of the mean is within \n\u20ac  of the  true  mean; or in other  words,  Pr(IEtrue  - Ee3tl  > f) < 8.  Combining the \ntwo  equations  and solving for  \u20ac  gives us  a  bound on  how  close  the estimated mean \nis  to the  true  mean after  n  points with  confidence  1 - 8: \n\n_ j B 2 1og(2/6) \n\n2n \n\n\u20ac \n\n-\n\nThe  algorithm  starts  with  a  collection  of learning  boxes.  We  call  each  model  a \nlearning  box  since  we  are  treating  the  models  as  if  they  were  black  boxes.  We \nare  not  looking  at  how  complex or  time-consuming each  prediction  is,  just  at  the \ninput  and  output  of the  box.  Associated  with  each  learning  box  are  two  pieces  of \ninformation:  a  current  estimate of its  error  rate  and  the  number  of points  it  has \nbeen  tested  upon  so  far.  The  algorithm also  starts  with  a  test  set  of size  N.  For \nleave-one-out  cross  validation, the  test set  is  simply the  training set. \n\n\f62 \n\nMaron and Moore \n\n---------- ----------;; \n\nERROR \n\nUppez \nBound \n\nI \n\no  ~------r_----_+------~----~~----_r------+_----~-------------\n\nlearning \nbox #0 \n\nlearning \nbox 411 \n\nlearning \nbox  112 \n\nlearning \nbox  413 \n\nlearning \nbox  114 \n\nlearning \nbox  lIS \n\nlearning \nbox  116 \n\nFigure  2:  An  example where  the  best  upper  bound of learning  box  #2 eliminates \nlearning boxes  #1 and #5.  The size  of f  varies since each learning box has its own \nupper  bound on  its error  range,  B. \n\nAt  each  point  in  the  algorithm,  we  randomly select  a  point from  the  test  set.  We \ncompute  the  error  at  that  point for  all  learning  boxes,  and  update  each  learning \nbox's estimate of its  own  total error  rate.  In  addition,  we  use  Hoeffding's  bound \nto  calculate  how  close  the  current  estimate  is  to  the  true  error  for  each  learning \nbox.  We  then eliminate those  learning boxes whose  best  possible error  (their lower \nbound)  is  still  greater  than  the  worst  error  of  the  best  learning  box  (its  upper \nbound); see  Figure  2.  The intervals get smaller as more points are  tested,  thereby \n\"racing\"  the good  learning boxes,  and eliminating the  bad ones. \n\nWe  repeat  the  algorithm until  we  are  left  with just  one  learning  box,  or  until  we \nrun  out of points.  The  algorithm can  also  be stopped  once  f  has reached  a  certain \nthreshhold.  The  algorithm  returns  a  set  of learning  boxes  whose  error  rates  are \ninsignificantly  (to  within f)  different  after  N  test  points. \n\n3  Proof of Correctness \n\nThe careful  reader  would  have  noticed  that the  confidence  {;  given  in the  previous \nsection  is  incorrect.  In  order  to  prove  that  the  algorithm indeed  returns  a  set  of \nlearning  boxes  which  includes  the  best  one,  we'll  need  a  more  rigorous  approach. \nWe  denote  by  ~ the  probability  that  the  algorithm eliminates  what  would  have \nbeen the  best learning box.  The difference  between  ~ and {;  which  was glossed over \nin  the previous  section  is  that  1 - ~ is  the  confidence  for  the success  of the entire \nalgrithm,  while  1 -\n{;  is  the  confidence  in  Hoeffding's  bound  for  one  learning  box \n\n\fHoeffding Races: Accelerating Model Selection \n\n63 \n\nduring  one iteration of the algorithm. \n\nWe would like to make a formal connection between Ll  and {;.  In order to do that, let \nus  make the requirement  of a  correct  algorithm more stringent.  We'll say  that the \nalgorithm is correct if every learning box is within f  of its true error at every iteration \nof the  algorithm.  This  requirement  encompasses  the  weaker  requirement  that  we \ndon't eliminate the  best learning box.  An  algorithm is  correct  with  confidence  Ll  if \nPr{ all learning boxes  are  within  f  on  all  iterations} :2:  1 - Ll. \nWe'll now  derive  the relationship  between  {;  and  Ll  by  using the  disjunctive  proba(cid:173)\nbility inequality  which states  that  Pr{A V  B}  ~ Pr{A} + Pr{B}. \nLet's assume that we  have  n  iterations  (we  have  n  points in  our test set),  and that \nwe  have m  learning boxes  (LBl  .. \u00b7LBm).  By  Hoeffding's inequality,  we  know  that \n\nPr{  a  particular  LB  is  within  f  on  a particular iteration} :2:  1 -\n\n{; \n\nFlipping that around  we  get: \n\nPr{ a particular LB  is  wrong  on  a particular iteration} < {; \n\nUsing  the disjunctive inequality  we  can say \n\nPr{  a particular LB is  wrong on iteration  1  V \na particular LB is  wrong on iteration 2  V \n\na particular LB is  wrong on iteration n}  ~ {;  . n \n\nLet's rewrite  this as: \n\nPr{  a particular LB  is  wrong  on  any  iteration}  ~ {;  . n \n\nN ow  we  do  the same thing for  all  learning  boxes: \n\nPr{  LBl  is wrong on \nLB2  is wrong on \n\nany iteration  V \nany iteration  V \n\nLBm  is wrong on \n\nany iteration}  ~ {;  . n . m \n\nor  in other  words: \n\nPr{ some LB  is  wrong in some iteration} ~ {;  . n . m \n\nWe  flip  this to get: \n\nPr{ all  LBs  are  within  f  on  all iterations} :2:  1 - {;  . n  . m \n\nWhich  is  exactly  what  we  meant  by  a  correct  algorithm  with  some  confidence. \nTherefore,  {;  =  n~m. When we  plug this into our expression for  f  from the previous \nsection,  we find  that we  have only increased it by  a constant factor.  In other words, \nby  pumping up  f,  we  have managed to ensure the correctness  of this algorithm with \nconfidence  Ll.  The new  f  is expressed  as: \n\nf  = V~B-~-(l-Og-(-2-nm-n-)--I-O-g(-~-)-) \n\n\f64 \n\nMaron and Moore \n\nProblem \nROBOT \n\nPROTEIN \n\nENERGY \n\nPOWER \n\nPOOL \n\nDISCONT \n\nTable 1:  Test  problems \n\nDescrIption \n\n10 input attributes, 5 outputs.  Given an initial and a final  description \nof a  robot  arm,  learn  the  control  needed  in  order  to  make  the  robot \nperform devil-sticking (Schaal and Atkeson,  1993). \n3 inputs, output is  a  classification into one of three classes.  This is  the \nfamous protein secondary structure database, with some preprocessing \n(Zhang et al.,  1992). \nGiven solar radiation sensing,  predict  the  cooling load for  a  building. \nThis is  taken from the  Building Energy  Predictor  Shootout. \nMarket data for  electricity generation pricing period class for  the new \nUnited  Kingdom  Power  Market. \nThe visually perceived  mapping from pool table configurations to shot \noutcome for  two-ball collisions (Moore,  1992). \nAn artificially constructed set of points with many discontinuities.  Lo(cid:173)\ncal  models should  outperform global ones. \n\nClearly this is an extremely pessimistic bound and tighter proofs are possible (Omo(cid:173)\nhundro,  1993). \n\n4  Results \n\nWe  ran  Hoeffding  Races  on  a  wide  variety  of learning  and  prediction  problems. \nTable  1 describes  the  problems,  and Table 2 summarizes the  results  and  compares \nthem to  brute force  search. \nFor Table 2,  all ofthe experiments were run using Ll  =  .01.  The initial set of possible \nmodels  was  constructed  from  various  memory  based  algorithms:  combinations of \ndifferent  numbers  of  nearest  neighbors,  different  smoothing  kernels,  and  locally \nconstant  vs. \nlocally  weighted  regression.  We  compare  the  algorithms  relative  to \nthe  number of queries  made,  where  a  query  is  one  learning  box finding  its error  at \none point.  The brute force  method makes ITESTI x ILEARNING BOXESI  queries. \nHoeffding  Races  eliminates  bad  learning  boxes  quickly,  so  it  should  make  fewer \nquerIes. \n\n5  Discussion \n\nHoeffding  Races  never  does  worse  than  brute  force.  It is  least  effective  when  all \nmodels  perform  equally  well.  For  example,  in  the  POOL  problem,  where  there \nwere  75  learning  boxes  left  at  the  end  of the  race,  the  number  of queries  is  only \nslightly smaller for  Hoeffding  Races  than for  brute force .  In  the ROBOT problem, \nwhere there were  only 6 learning boxes left,  a significant reduction in the number of \nqueries  can  be seen.  Therefore,  Hoeffding  Races  is  most effective  when  there  exists \na  subset  of clear  winners  within  the  initial set  of models.  We  can then search  over \na very  broad set of models without much concern  about the computational expense \n\n\fHoeffding Races: Accelerating Model Selection \n\n65 \n\nTable 2:  Results  of Brute  Force  vs.  Hoeffding  Races. \n\nqueries \n\nwith \n\nHoeffding \n\nRaces \n\nlearning \nboxes \nleft \n\n15637 \n349405 \n121400 \n13119 \n22095 \n25144 \n\n6 \n60 \n40 \n48 \n75 \n29 \n\nProblem \n\npoints \n\nROBOT \nPROTEIN \nENERGY \nPOWER \nPOOL \nDISCONT \n\n972 \n4965 \n2444 \n210 \n259 \n500 \n\nInitial # \nlearning \n\nboxes \n\n95 \n95 \n189 \n95 \n95 \n95 \n\nqueries \n\nwith \nBrute \nForce \n\n92340 \n471675 \n461916 \n19950 \n24605 \n47500 \n\n60000 \n\n60000 \n\n400 00 \n\n:;';0000 \n\nFigure 3:  The x-axis is  the size  of a set of initial learning boxes (chosen  randomly) \nand  the  y-axis  is  the  number  of  queries  to  find  a  good  model  for  the  ROBOT \nproblem.  The  bottom  line  shows  performance  by  the  Hoeffding  Race  algorithm) \nand the top  line by  brute force. \n\n\f66 \n\nMaron and Moore \n\nof a  large  initial set.  Figure  3  demonstrates  this.  In  all  the  cases  we  have  tested, \nthe  learning  box  chosen  by  brute force  is  also  contained  by  the  set  returned  from \nHoeffding  Races.  Therefore,  there  is  no  loss  of performance accuracy. \n\nThe results described here show the performance improvement with relatively small \nproblems.  Preliminary results indicate that performance improvements will increase \nas  the  problems  scale  up.  In  other  words,  as  the  number  of test  points  and  the \nnumber  of learning  boxes  increase,  the  ratio  of  the  number  of  queries  made  by \nbrute  force  to  the  number  of  queries  made  by  Hoeffding  Races  becomes  larger. \nHowever,  the cost  of each  query  then  becomes the main computational expense. \n\nAcknowledgements \n\nThanks  go  to  Chris  Atkeson,  Marina  Meila,  Greg  Galperin,  Holly  Yanco,  and \nStephen  Omohundro for  helpful  and stimulating discussions. \n\nReferences \n\n[Atkeson  and  Reinkensmeyer,  1989]  C.  G.  Atkeson  and  D.  J.  Reinkensmeyer.  Using  asso(cid:173)\n\nciative  content-addressable  memories  to  control  robots.  In  W.  T.  Miller,  R.  S.  Sutton, \nand  P.  J.  Werbos,  editors,  Neural  Networks for  Control.  MIT Press,  1989. \n\n[Greiner  and  Jurisica,  1992]  R.  Greiner  and  I.  Jurisica.  A  statistical  approach  to  solv(cid:173)\n\ning  the  EBL  utility  problem.  In  Proceedings  of the  Tenth  International conference  on \nArtificial Intelligence  (AAAI-92).  MIT  Press,  1992. \n\n[Haussler,  1992]  D.  Haussler.  Decision theoretic generalizations of the pac model for neural \nnet  and  other learning  applications.  Information and  Computation,  100:78-150,  1992. \n[Hoeffding,  1963]  Wassily  Hoeffding.  Probability inequalities  for  sums of bounded  random \n\nvariables.  Journal of the  American Statistical Association, 58:13-30,  1963. \n\n[Kaelbling,  1990]  1. P.  Kaelbling.  Learning in Embedded Systems.  PhD. Thesis; Technical \nReport No. TR-90-04, Stanford University,  Department of Computer Science,  June 1990. \n\n[Moore  et al.,  1992]  A.  W.  Moore,  D.  J.  Hill,  and  M.  P.  Johnson.  An  empirical  inves(cid:173)\n\ntigation  of  brute force  to  choose  features,  smoothers  and  function  approximators.  In \nS.  Hanson,  S.  Judd,  and  T.  Petsche,  editors,  Computational Learning  Theory  and  Nat(cid:173)\nural  Learning Systems,  Volume  9.  MIT  Press,  1992. \n\n[Moore,  1992]  A.  W.  Moore.  Fast,  robust  adaptive control by  learning only  forward  mod(cid:173)\n\nels.  In  J.  E.  Moody,  S.  J.  Hanson,  and  R.  P.  Lippman,  editors,  Advances  in  Neural \nInformation  Processing Systems 4.  Morgan  Kaufmann,  April  1992. \n\n[Omohundro,  1993]  Stephen  Omohundro.  Private communication,  1993. \n[Pollard,  1984]  David  Pollard.  Convergence of Stochastic Processes. Springer-Verlag,  1984. \n[Schaal  and  Atkeson,  1993]  S. Schaal and  C. G.  Atkeson.  Open loop stable control strate-\ngies for  robot juggling.  In Proceedings of IEEE conference on Robotics and Automation, \nMay  1993. \n\n[Stanfill  and  Waltz,  1986]  C.  Stanfill  and  D.  Waltz.  Towards  memory-based  reasoning. \n\nCommunications of the  A CM,  29(12):1213-1228,  December  1986. \n\n[Wahba and  Wold,  1975]  G.  Wahba  and  S.  Wold.  A  completely  automatic french  curve: \nFitting spline  functions  by  cross-validation.  Communications in  Statistics, 4(1),  1975. \n[Zhang  et al.,  1992]  X.  Zhang,  J.P.  Mesirov,  and  D.L.  Waltz.  Hybrid  system for  protein \n\nsecondary  structure prediction.  Journal  of Molecular Biology,  225: 1 049-1 063,  1992. \n\n\f", "award": [], "sourceid": 841, "authors": [{"given_name": "Oded", "family_name": "Maron", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}]}