{"title": "Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM", "book": "Advances in Neural Information Processing Systems", "page_first": 705, "page_last": 711, "abstract": null, "full_text": "Incremental Learning and  Selective \n\nSampling via Parametric  Optimization \n\nFramework for  SVM \n\nIBM T.  J.  Watson Research  Center \n\nIBM T.  J. Watson Research  Center \n\nKatya  Scheinberg \n\nkatyas@us.ibm.com \n\nShai Fine \n\nfshai@us.ibm.com \n\nAbstract \n\nWe propose a framework based on a parametric quadratic program(cid:173)\nming  (QP)  technique  to  solve  the  support vector  machine  (SVM) \ntraining problem.  This framework, can be specialized to obtain two \nSVM  optimization  methods.  The first  solves  the  fixed  bias  prob(cid:173)\nlem,  while  the  second  starts  with  an  optimal  solution  for  a  fixed \nbias problem and adjusts the bias until the optimal value is found. \nThe later method can be applied in conjunction with any other ex(cid:173)\nisting technique which obtains a fixed  bias solution.  Moreover, the \nsecond  method  can  also  be  used  independently  to  solve  the  com(cid:173)\nplete SVM training problem.  A combination of these two methods \nis  more  flexible  than  each  individual  method  and,  among  other \nthings,  produces an incremental algorithm which exactly  solve the \n1-Norm Soft  Margin  SVM  optimization problem.  Applying  Selec(cid:173)\ntive  Sampling techniques may further  boost convergence. \n\n1 \n\nIntroduction \n\nSVM  training  is  a  convex optimization problem which  scales  with  the training  set \nsize rather than the input dimension.  While this is usually considered to be a desired \nquality,  in  large  scale  problems  it  may  cause  training  to  be  impractical. \nThe \ncommon way to handle massive data applications is  to turn to active set methods, \nwhich  gradually  build  the  set  of active  constraints  by  feeding  a  generic  optimizer \nwith  small  scale  sub-problems.  Active  set  methods  guarantee  to  converge  to  the \nglobal solut ion,  however,  convergence  may  be  very  slow,  it  may require too  many \npasses  over  the  data set,  and  at  each  iteration  there's  an  implicit  computational \noverhead of the  actual  active  set  selection.  By  using  some  heuristics  and  caching \nmechanisms, one can,  in  practice, reduce this load significantly. \n\nAnother  common  practice  is  to  modify  the  SVM  optimization  problem  such  that \nit  wont  handle the bias  term  directly.  Instead, the bias is  either fixed  in  advance! \n(e.g. \n[4]).  The \nadvantage  is  that  the  resulting  dual  optimization  problem  does  not  contain  the \nlinear  constraint,  in  which  case  one  can  suggest  a  procedure  which  updates  only \n\n[6])  or  added  as  another  dimension  to  the  feature  space  (e.g. \n\nIThroughout this sequel we  will  refer  to such solution  as  the fixed  bias  solution. \n\n\fone Lagrange multiplier at a time.  Thus, an incremental approach, which efficiently \nupdates  an  existing  solution  given  a  new  training  point,  can  be  devised.  Though \nwidely  used,  the  solution  resulting  from  this  practice  has  inferior  generalization \nperformances and the number of SY tends to be much higher  [4]. \n\nTo  the  best  of our  knowledge,  the  only  incremental  algorithm  suggested  so  far  to \nexactly solve the  1-Norm Soft  Margin2 optimization problem, have been described \nby Cauwenberghs and Poggio at  [3].  This algorithm,  handles  Adiabatic  increments \nby solving a  system of linear equations resulted from  a  parametric transcription of \nthe  KKT  conditions.  This  approach  is  somewhat  close  to  the  one  independently \ndeveloped here and we  offer a  more thorough comparison in the discussion section. \nIn  this  paper3  we  introduce  two  new  methods  derived  from  parametric  QP  tech(cid:173)\nniques.  The  two  methods  are  based  on  the  same  framework,  which  we  call  Para(cid:173)\nmetric  Optimization  for  Kernel  methods  (POKER),  and  are  essentially  the  same \nmethodology applied to somewhat  different  problems.  The first  method solves  the \nfixed  bias problem,  while the second one starts with an optimal solution for  a  fixed \nbias  problem  and  adjusts  the  bias  until  the  optimal  value  is  found.  Each of these \nmethods  can  be  used  independently  to  solve  the  SYM  training  problem. \nThe \nmost  interesting  application,  however,  is  alternating between  the  two  methods  to \nobtain a  unique incremental algorithm.  We will  show how by using this  approach \nwe  can adjust the optimal solution as more data becomes available, and by applying \nSelective  Sampling techniques we  may further  boost convergence rate. \nBoth  our  methods  converge  after  a  finite  number  of iterations.  In  principle,  this \nnumber may be exponential in  the  training set  size, n.  However, since parametric \nQP methods are based on the well-known Simplex method for  linear programming, \na  similar  behavior  is  expected:  Though  in  theory  the  Simplex  method  is  known \nto  have  exponential  complexity, \nin  practice  it  hardly  ever  displays  exponential \nbehavior.  The  per-iteration  complexity  is  expected  to  be  O(nl),  where  l  is  the \nnumber of active points  at that iteration,  with the exception of some rare cases in \nwhich the complexity is  expected to be bounded by O(nl2). \n\n2  Parametric QP  for  SVM \n\nAny optimal solution to the  1-Norm  Soft  Margin  SYM optimization problem must \nsatisfy the Karush-Kuhn-Tucker (KKT)  necessary and sufficient  conditions: \n\n1 \n2 \n\n3 \n4 \n5 \n\nexiSi  =  0, \n(c - exi)~i =  0, \n\ni  =  1, ... ,n \n\ni  =  1, . . . ,n \n\nT \n\nY  ex  =  0, \n-Qex + by + S  - ~ =  -e, \n\u00b0 ~ ex  ~ c,  S ::::  0,  ~:::: 0. \n\n(1) \n\n2 A  different  incremental approach stems from  a  geometric interpretation of the primal \nproblem:  Keerthi  et  al. \n[7]  were  the  first  to  suggest  a  nearest  point  batch  algorithm \nand  Kowalczyk  [8]  provided  the  on-line  version.  They  handled  the inseparable  with  the \nwell-known  transformation  W  ~ (W, .;c~)  and b ~ b,  which  establish  the  equivalence \nbetween the  Hard  Margin  and the 2-Norm  Soft  Margin  optimization  problems.  Although \nthe i-Norm and the 2-Norm have been shown to yield equivalent generalization properties, \nit is  often observed (cf.  [7])  that the former  method results in  a  smaller number of SV.  It \nis  obvious  by the above transformation  that the  i-Norm  Soft  Margin  is  the most  general \nSVM  optimization  problem. \n\n3The  detailed  statements of the  algorithms  and the supporting lemmas  were  omitted \n\ndue to space limitation,  and can  be found  at  [5]. \n\n\fwhere a  E  Rn is  the vector of Lagrange multipliers,  b is  the bias  (scalar)  and sand \n~ are the  n-dimensional  vectors  of slack  and  surplus  variables,  respectively.  Y is  a \nvector oflabels,  \u00b11.  Q is  the label encoded kernel matrix, i.e.  Qij  = YiyjK(Xi,Xj), \ne is  the vector of all  1 's of length n  and c is  the penalty associated with errors. \nIf we  assume  that  the  value  of the  bias  is  fixed  to  some  predefined  value  b,  then \ncondition 3  disappears from  the system  (1)  and condition 4  becomes \n\nConsider the following  modified  parametric system of KKT conditions \n\n-Qa + S  - ~ =  -e - by \n\ni  = 1, ... ,n \n\naiSi = 0, \n(c - ai)~i =  0, \n-Qa + S  - ~ =  p + u( -e - yb - p) , \no :::::  a  :::::  c,  S  ~ 0,  ~ ~ 0, \n\ni  =  1, ... ,n \n\n(2) \n\n(3) \n\nfor  some  vector  p.  It  is  easy  to  find  p,  a  S  and  ~ satisfying  (3)  for  u  =  O.  For \nexample,  one  may  pick  a  =  0,  S  =  e,  ~ =  0  and  p  =  -Qa + s.  For  u  =  1  the \nsystems  (3)  reduces  to  the  fixed  bias  system.  Our  fixed  bias  method  starts  at  a \nsolution to (3)  for  u  =  0 and by increasing u  while updating a, s and ~ so that they \nsatisfy  (3),  obtains the optimal solution for  u  = 1. \nSimilarly  we  can  obtain  solution  to  (1)  by  starting  at  a  fixed  bias  solution  and \nupdate  b,  while  maintaining a,  s  and ~ feasible  for  (2) , until  the optimal value for \nb is  reached.  The  optimal  value  of the  bias  is  recognized  when  the  corresponding \nsolution satisfy  (1),  namely aT y  =  O. \n\nBoth these methods are based on the same framework of adjusting a scalar param(cid:173)\neter  in the  right  hand  side  of a  KKT system.  In  the  next  section  we  will  present \nthe method for  adjusting the bias  (adjusting u  in  (3)  is  very similar,  save for  a  few \ntechnical differences).  An advantage of this special case is that it solves the original \nproblem and can,  in  principal,  be applied  \"from scratch\" . \n\n3  Correcting a  \"Fixed  Bias\"  Solution \n\nLet  (a(b), s(b), ~(b))  be  a  fixed  bias  solution for  a  given  b.  The algorithm that we \npresent here is based on increasing (or decreasing) b monotonically, until the optimal \nb*  is  found,  while  updating and maintaining  (a(b),s(b),~(b)). \n\nLet  us  introduce  some  notation.  For  a  given  b  and  and  a  fixed  bias  solution, \n(a(b), s(b), ~(b)),  we  partition  the  index  set  I  =  {I, ... , n}  into  three  sets  10 (b), \nIe(b)  and  Is(b)  in the following  way:  Vi  E Io(b)  si(b)  > 0 and ai(b)  =  0,  Vi  E Ie(b) \n~i(b) > 0 and ai(b)  = c and Vi  E  Is(b)  si(b)  = ~i(b) = 0 and 0:::::  ai(b)  :::::  c.  It is easy \nto see that Io(b)Ule(b)UIs(b)  = I  and Io(b)nle(b)  = Ie(b)nIs(b)  = Io(b)nIs(b)  = 0. \nWe  will  call  the  partition  (Io(b),Ie(b),Is(b))  - the  optimal  partition for  a  given  b. \nWe  will  refer  to Is  as  the  active set.  Based on partition  (Io,Ie,Is)  we  define  Qss \n(Qes  Qse Qee,  Qos,  Qoo)  as  the  submatrix  of Q  whose  columns  are  the  columns \nof  Q  indexed  by  the  set  Is  (Ie,  Is,  Ie,  10 ,  10 )  and  whose  rows  are  the  rows  of Q \nindexed  by  Is  (Is,  Ie,  Ie,  Is , 10).  We  also  define  Ys  (Ye,  Yo)  and  as  (ae ,  ao)  and \nthe  subvectors of Y and  a  whose  entries  are indexed  by  Is  (Ie,  10).  Byes  (ee)  we \ndenote a  vector of all  ones of the appropriate size. \nAssume  that  we  are  given  an  initial  guess4  bO  < b*.  To  initiate  the  algorithm we \n4Whether  bO  < b*  can  be  determined by evaluating  -y T  a(bO):  if  -y T  a(bO)  > 0  then \nbO  <  b*,  otherwise  bO  >  b*,  in  which  case  the algorithm  is  essentially  the  same,  save  for \nobvious changes. \n\n\fassume  that  we  know  the  optimal  partition  (Ioo'!eo,Iso)  =  (Io(bO),!c(bO),!s(bO)) \nthat corresponds to aO  =  a(bO).  We  know  that Vi  E  10  ai  =  0 and Vi  E  Ie ai  =  c. \nWe  also  know  that  -Qia + Yib  = -1, Vi  E  Is  (here  Qi  is  the  i-th row  of Q).  We \ncan write the set of active constraints as \n\n(4) \n\nIf Qss  is  nonsingular  (the nondegenerate case),  then as  depends  linearly on scalar \nb.  Similarly,  we  can  express  So  and  ~e  as  linear  functions  of b.  If Q ss  is  singular \n(the degenerate case), then, the set of all possible solutions as  changes linearly with \nb as  long  as  the  partition  remains  optimal.  In  either  case,  if 0  < as  < c,  So  > 0 \nand  ~e > 0 then  sufficiently  small changes  in  b preserve these constraints.  At each \niteration b can increase until one of the four types of inequality constraints becomes \nactive.  Then, the optimal partition is  updated, new linear expressions of the active \nvariables through b are computed,  and the algorithm iterates.  We  terminate when \nY  a  < 0,  that is  b >  b*.  The final  iteration gives  us  the correct optimal active set \nand optimal partition; from  that we  can easily compute b*  and a*. \n\nT \n\nA  geometric  interpretation  of the  algorithmic  steps  suggest  that  we  are  trying to \nmove the separating hyperplane by increasing its bias and at the same time adjusting \nits  orientation so  it  stays optimal for  the  current  bias.  At  each  iteration  we  move \nthe  hyperplane  until  either  a  support  vector  is  dropped  from  the  support  set,  a \nsupport  vector becomes  violated,  a  violated point  becomes  a  support vector  or an \ninactive point joins the support vector set. \n\nThe  algorithm  is  guaranteed  to  terminate  after  finitely  many  iterations.  At  each \niteration the algorithm covers an interval that corresponds to an optimal partition. \nThe same partition cannot correspond to two different intervals  and the number of \npartitions  is  finite,  hence  so  is  the  number of iterations  (d.  [1,  9]).  Per-iteration \ncomplexity depends on whether an iteration is  degenerate or not.  A nondegenerate \niteration  takes  O(niIs I)  + O(IIs 13 )  arithmetic  operations,  while  a  degenerate  iter(cid:173)\nation  should  in  theory  take  0(n21Is 12)  operations,  but  in  practice  it  only  takes5 \n0(nIIsI2).  Note  that the  degeneracy  occurs  when  the  active  support  vectors  are \nlinearly dependent.  The larger is  the rank of the kernel matrix the less likely is such \na  situation.  The storage requirement of the algorithm is  O(n) + 0(IIsI2). \n\n4 \n\nIncremental  Algorithm \n\nIncremental  and on-line  algorithms  are aimed  at  training problems for  which  the \ndata becomes  available  in  the  course  of training.  Such  an  algorithm,  when  given \nan  optimal solution  for  a  training  set  of size  n,  and  additional m  training  points, \nhas to efficiently find  the optimal solution to the extended n + m  training set. \nAssume  we  have  an  optimal  solution  (a, b, s,~) for  a  given  data set  X  of  size  n. \nFor each  new  point  that  is  added,  we  take  the  following  actions:  a  new  Lagrange \nmultiplier  a n+l  =  0  is  added  to  the  set  of  multipliers,  then  the  distance  to  the \nmargin  is  evaluated  for  this  point.  If the  point  is  not  violated,  that  is  if  Sn+l  = \nW T  xn+l_yn+1b_1 > 0, then the new positive slack Sn+l  is added to the set of slack \nvariables.  If the point is violated then sn+1  =  1 is added to the set of slack variables. \n(Notice,  that at this point the condition w T  x n+1 + yn+1b + sn+1  =  -1 is  violated.) \nA  surplus  variable  ~n+l  =  0  is  also  added  to  the  set  of  surplus  variables.  The \noptimal partition is  adjusted accordingly.  The process is  repeated for  all the points \nthat  have  to  be  added  at  the  given  step.  If no  violated  points  were  encountered, \n\n5This assumes solving such a  problem by an interior point  method \n\n\fo \n1 \n\n2 \n\n3 \n4 \n\n5 \n\n1 \n\n-\n\n( n+i )T \n\ny ,  On+i  = <,n+i  =  ,  Sn+i  = -\nb \n\nt \n\ny  +  x \nb  n+i \n\n0 \nn+i  T \n)  w + 1,  Sn+i  =  1 \n\nGiven  dataset  <X,y> ,  asolution(oo , bo , so,~o) ,  and new  points  <x,y>~t~ \net  p  = - e -\nw,  i  = 1, ... , m \nS \nIf Sn+i  :::::  0,  Set pn+i  := -(x \nElse pn+i  := -1 - byn+i \nX  :=  XU {xn+l , ... , xn+m},  y :=  (yl , ... , y n, y n+l , ... , yn+m) \nIf p  #- - e - by \nCall  POKERfixedbias(X, y , 0 , b, s, ~ , p) \nCall  POKERadjustbias (X , y , 0, b, s, ~) \nIf there are more data points go  to O. \n\nFigure 1:  Outline of the incremental algorithm  (AltPOKER) \n\nthen  no  further  action  is  necessary.  The  current  solution  is  optimal  and  the  bias \nis  unchanged.  If at  least  one  point  is  violated,  then  the  new  set  (Q, b, s,~) is  not \nfeasible  for  the KKT system (1)  with the extended data set.  However, it is  easy to \nfind p  such that  (Q, b, s, ~) is  optimal for  (3).  Thus we  can first  apply the fixed  bias \nalgorithm to  find  a  new  solution  and then  apply  the  adjustable  bias  algorithm  to \nfind  the optimal solution to the new  extended problem  (see  Figure 1). \n\nIn  theory  adding  even  one  point  may  force  the  algorithm  to  work  as  hard  as  if \nit  were  solving  the  problem  \"from  scratch\".  But  in  practice  it  virtually  never \nhappens.  In our experiments, just a few  iterations of the fixed  bias and adjustable \nbias algorithms were sufficient to find the solution to the extended problem.  Overall, \nthe computational complexity ofthe incremental algorithm is expected to be O(n2 ) . \n\n5  Experiments \n\nConvergence  in  Batch  Mode:  The  most  straight-forward  way  to  activate \nPOKER in  a  batch mode  is  to  construct  the  trivial  partition6  and then  apply  the \nadjustable bias  algorithm to  get  the optimal  solution.  We  term  this  method  Self(cid:173)\nInit POKER.  Note that the initial value of the bias is most likely far  away from the \nglobal solution, and as such, the results presented here should be regarded as a lower \nbound.  We  examined performances on a  moderate size  problem,  the Abalone  data \nset from the VCI Repository [2].  We fed  the training algorithm with increasing sub(cid:173)\nsets  up  to the whole  set  (of size  4177).  The gender encoding  (male/female/infant) \nwas  mapped  into  {(I,O,O),(O,I,O) ,(O,O,I)}.  Then,  the  data was  scaled  to  lie  in  the \n[-1 ,1]  interval.  We  demonstrate convergence for  polynomial  kernel  with  increasing \ndegree,  which  in  this  setting  corresponds  to  level  of difficulty.  However  naive  our \nimplementation is,  one  can observe  (see  Figure  2)  a  linear convergence rate in  the \nbatch mode. \n\nConvergence  in Incremental Mode:  AltPOKER is  the incremental  algorithm \ndescribed in section 4.  We  examined  the performance on the\" diabetes\"  problem 7 \nthat have been used by  Cauwenberghs and Poggio in  [3]  to test the performance of \ntheir  algorithm.  We  demonstrate  convergence for  the  RBF  kernel  with  increasing \npenalty  (\"C\" ).  Figure 3 demonstrates the advantage of the more flexible  approach \n\n6Fixing  the  bias  term  to  be  large  enough  (positive  or  negative)  and  the  Lagrange \n\nmultipliers to  0 or C  based on their class  (negative/positive)  membership. \n\n7 available at  http://bach . ece.jhu. edu/pub/ gert/svm/increm ental \n\n\fSelflnil  POKER: No. ollleralions VS.  Problem Size \n\nAUPOKER:  No. ol lleralions VS.  Chun k Size \n\n16000 \n\n,near  erne \n\n_ \n_  POJyKemel:(<>:.y>+1r \n\u2022  POJyKemel:(<>:.y>+1r \nPOJy Kemel:(<>:.y>+1t \nPOJyKemel:(<>:.y>+1r \n\n2000 \n\n-\n\n-\n\nC:O.l \nC\"l \nC,,10 \nC,,25 \nC,,50 \nC,,75 \nC\"l00 \n\n2500' \n\nProblernSize \n\nChunk Size \n\nFigure 2:  SelfInit  POKER - Convergence Figure  3:  AltPOKER  - Convergence  in \nin  Batch mode \n\nIncremental mode \n\nwhich allows various increment sizes:  using increments of only one point resulted in \na  performance of a  similar scale as that of Cauwenberghs and Poggio, but with the \nincrease of the chunk sizes  we  observe rapid improvement in the convergence rate. \n\nSelective Sampling:  We can use the incremental algorithm even in case when all \nthe  data is  available  in  advance to improve the overall  efficiency.  If one  can select \na  good representative small subset of the data set, then one can use it for  training, \nhoping that the majority of the data points are classified correctly using the initial \nsampled data8 .  We applied selective sampling as a  preprocess in incremental mode: \nAt  each  meta-iteration,  we  ranked  the  points  according  to  a  predefined  selection \ncriterion, and then picked just the top ones for  the increment. \n\nThe following selection criteria have been used in our experiments:  CIs2W picks the \nclosest point to the current hyperplane.  This approach is inspired by active learning \nschemes  which  strive to  halve  the  version  space.  However, the notion of a  version \nspace is  more  complex when  the problem is  inseparable.  Thus,  it  is  reasonable to \nadapt a  greedy approach which  selects  the point  that will  cause  the larger change \nin the value of the objective function. \n\nWhile solving the optimization problem for all possible increments is  impracticable, \nit  may  still  worthwhile  to  approximate  the  potential  change:  MaxSlk  picks  the \nmost  violating point.  This  corresponds to  an  upper bound estimate of the  change \nin  the  objective,  since  the  value  of the  slack  (times  c)  is  an  upper  bound  to  the \nfeasibility gap.  dObj perform only few  iterations of the adjustable bias algorithm \nand examine the change in the objective value.  This is  similar to Strong  Branching \ntechnique  which  is  used  in  branch  and  bound  methods  for  integer  programming. \nHere it provides a  lower bound estimate to the change in the objective value. \n\nAlthough  performing  only  few  iterations  is  much  cheaper  than  converging  to  the \noptimal  solution,  this  technique  is  still  more  demanding  then  previous  selection \nmethods.  Hence  we  first  ranked  the points using  CIs2W  (MaxSlk)  and then ap(cid:173)\nplied  dObj  only  to  the  top  few .  Table  1  presents  the  application  of the  above \nmentioned  criteria to three  different  problems.  The  results  clearly  shows  that  ad(cid:173)\nvantage of using the information obtained by dObj estimate. \n\n8This  is  different  from  a  full-fledged  Active  Learning scheme  in  which  the data is  not \n\nlabeled, but rather queried at selected points. \n\n\fSelection \n\nCriteria \n\nNo  Selection \nMaxSlk \nMaxSlk+dObj \nClsW \nClsW+dObj \n\na\u00b7  I Is  I Ie  I  10 \n400  I  4  I 11  I 9985 \n\na\u00b7  I Is  I Ie  I  10 \nII  a\u00b7  I Is  I  Ie  I  10 \n8  I 73  I  1  I 277  II  40  I 20  I 313  I 243 \n\n234 \n112 \n92 \n128 \n116 \n\n871 \n303 \n269 \n433 \n407 \n\n3078 \n3860 \n3184 \n2576 \n2218 \n\nTable  1:  The impact of Selective  Sampling on the No.  of iterations of AltPOKER: \nSynthetic data (10Kx2),  \"ionosophere\"  [2]  and \"diabetes\"  (columns ordered resp.) \n\n6  Conclusions and  Discussion \n\nWe  propose  a  new  finitely  convergent  method  that  can  be  applied  in  both  batch \nand incremental modes to solve the  1-Norm Soft Margin SVM problem.  Assuming \nthat the number  of support  vectors  is  small  compared to the size  of the  data,  the \nmethod is expected to perform O(n2 )  arithmetic operations, where n  is  the problem \nsize.  Applying  Selective  Sampling  techniques  may  further  boost  convergence  and \nreduce computation load. \n\nOur method is  independently developed,  but somewhat similar to that in  [3].  Our \nmethod,  however,  is  more  general - it  can  be  applied  to  solve  fixed  bias  problems \nas  well  as obtain optimal bias from  a  given fixed  bias solution;  It is  not  restricted \nto increments of size  one,  but rather can handle increments of arbitrary size;  And, \nit can be used to get an estimate of the drop in the value of the objective function, \nwhich is  a  useful  selective sampling criterion. \n\nFinally, it is  possible to extend this method to produce a  true on-line algorithm, by \nassuming  certain properties of the  data.  This re-introduces  some very important \napplications  of the  on-line  technology,  such  as  active  learning,  and  various  forms \nof  adaptation.  Pursuing  this  direction  with  a  special  emphasis  on  massive  data \napplications  (e.g.  speech related applications), is  left  for  further study. \n\nReferences \n\n[1]  A.  B.  Berkelaar,  B.  Jansen, K.  Roos, and T.  Terlaky.  Sensitivity analysis in  (degener(cid:173)\n\nate)  quadratic programming.  Technical Report 96-26, Delft  University,  1996. \n\n[2]  C.  L.  Blake and C.  J  Merz.  UCI repository of machine learning databases,  1998. \n[3]  G. Cauwenberghs and T . Poggio.  Incremental and decremental support vector machine \nlearning.  In  Adv.  in  N eural  Information  Processing  Systems  13,  pages  409- 415,  2001. \n[4]  N.  Cristianini  and  J.  Shawe-Taylor.  An  Introductin  to  Support  Vector  Macines  and \n\nOther Kernel-Based  Learning  Methods.  Cambridge University Press,  2000. \n\n[5]  S.  Fine  and  K.  Scheinberg.  Poker:  Parametric  optimization  framework  for  kernel \n\nmethods.  Technical report , IBM  T.  J.  Watson Research  Center, 2001.  Submitted. \n\n[6]  T.  T.  Friess,  N.  Cristianini,  and C.  Campbell.  The kernel-adaraton  algorithm:  A  fast \n\nsimple learning procedure for  SVM.  In  Pmc.  of 15th  ICML, pages  188- 196,  1998. \n\n[7]  S.  S.  Keerthi,  S.  K.  Shevade, C.  Bhattacharyya, and K. R. K. Murthy.  A fast  iterative \nnearest  point algorithm for  SVM  classifier  design .  IEEE  Trnas .  NN, 11:124- 36,  2000. \n[8]  A. Kowalczyk.  Maximal margin perceptron.  In  Advances  in  Large  Margin  Classifiers , \n\npages  75-113.  MIT  Press,  2000. \n\n[9]  R.  T.  Rockafellar.  Conjugate  Duality  and  Optimization.  SIAM,  Philadelphia,  1974. \n\n\f", "award": [], "sourceid": 1978, "authors": [{"given_name": "Shai", "family_name": "Fine", "institution": null}, {"given_name": "Katya", "family_name": "Scheinberg", "institution": null}]}