{"title": "Using Analytic QP and Sparseness to Speed Training of Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 557, "page_last": 563, "abstract": null, "full_text": "Using Analytic QP and Sparseness to Speed \n\nTraining of Support Vector Machines \n\nJohn C. Platt \n\nMicrosoft Research \n\n1 Microsoft Way \n\nRedmond, WA 98052 \njplatt@microsoft.com \n\nAbstract \n\nTraining a Support Vector Machine (SVM) requires the solution of a very \nlarge quadratic programming (QP) problem.  This paper proposes an  al(cid:173)\ngorithm for training SVMs:  Sequential Minimal Optimization,  or SMO. \nSMO breaks the large QP problem into a series of smallest possible QP \nproblems which  are  analytically  solvable.  Thus,  SMO does not require \na numerical QP library.  SMO's computation time is dominated by eval(cid:173)\nuation  of the  kernel,  hence  kernel  optimizations  substantially  quicken \nSMO. For the MNIST database, SMO is  1.7 times as fast as PCG chunk(cid:173)\ning;  while for  the  UCI Adult database  and  linear SVMs,  SMO can  be \n1500 times faster than the PCG chunking algorithm. \n\n1 \n\nINTRODUCTION \n\nIn  the  last  few  years,  there  has  been  a  surge  of  interest  in  Support  Vector  Machines \n(SVMs) [1].  SVMs have empirically been shown to give good generalization performance \non a wide variety of problems. However, the use of SVMs is stilI limited to a small group of \nresearchers. One possible reason is that training algorithms for SVMs are slow, especially \nfor  large  problems.  Another  explanation  is  that  SVM  training  algorithms  are  complex, \nsubtle,  and  sometimes difficult to  implement.  This paper describes a new  SVM  learning \nalgorithm that is easy to implement, often faster,  and has better scaling properties than the \nstandard SVM  training algorithm.  The new  SVM  learning algorithm is  called Sequential \nMinimal Optimization (or SMO). \n\n1.1  OVERVIEW OF SUPPORT VECTOR MACHINES \n\nA general non-linear SVM can be expressed as \n\nU  =  LQiYiK(Xi,X) - b \n\n(1) \n\n\f558 \n\nJ  C.  Platt \n\nwhere U  is  the  output of the SVM,  K  is  a kernel  function  which  measures  the similarity \nof a stored training example Xi  to  the input x, Yi  E  {-1, + 1}  is  the desired output of the \nclassifier,  b is  a  threshold,  and  (li  are  weights  which  blend the  different kernels  [1].  For \nlinear SVMs, the kernel function K  is  linear, hence equation (1) can be expressed as \n\nu=w\u00b7x-b \n\n(2) \n\nwhere W  =  Li (liYiXi\u00b7 \nTraining of an SVM consists of finding the (li.  The training is expressed as a minimization \nof a dual quadratic form: \n\nsubject to box constraints, \n\nand one linear equality constraint \n\nN \nLYi(li =  O. \ni=l \n\n(3) \n\n(4) \n\n(5) \n\nThe (li  are Lagrange multipliers of a primal quadratic programming (QP) problem:  there \nis  a one-to-one correspondence between each (li  and each training example Xi. \nEquations (3-5) form  a QP problem that the  SMO algorithm will  solve.  The SMO algo(cid:173)\nrithm will terminate when all of the  Karush-Kuhn-Tucker (KKT) optimality conditions of \nthe QP problem are fulfilled.  These KKT conditions are particularly simple: \n\n(li = 0 '* YiUi  ~ 1,  0 < (li  < C '* YiUi  = 1, \n\n(li = C '* YiUi  :::;  1, \n\n(6) \n\nwhere Ui  is the output of the SVM for the ith training example. \n\n1.2  PREVIOUS METHODS FOR TRAINING SUPPORT VECTOR MACHINES \n\nDue to its immense size, the QP problem that arises from SVMs cannot be easily solved via \nstandard QP techniques. The quadratic form in (3) involves a Hessian matrix of dimension \nequal to  the  number of training examples.  This matrix cannot be fit  into  128 Megabytes if \nthere are more than 4000 training examples. \n\nVapnik  [9]  describes  a  method  to  solve  the  SVM  QP,  which  has  since  been  known  as \n\"chunking.\"  Chunking  relies  on  the  fact  that  removing training  examples  with  (li  =  0 \ndoes  not change  the  solution.  Chunking  thus  breaks  down  the  large  QP  problem  into  a \nseries of smaller QP sub-problems, whose object is  to  identify the training examples with \nnon-zero (li.  Every QP sub-problem updates the subset of the  (li  that are  associated with \nthe sub-problem, while leaving the rest of the (li unchanged. The QP sub-problem consists \nof every non-zero (li from the previous sub-problem combined with the M  worst examples \nthat  violate  the  KKT conditions  (6),  for  some  M  [1].  At the  last step,  the  entire  set of \nnon-zero (li has been identified, hence the last step solves the entire QP problem. \n\nChunking reduces  the  dimension  of the  matrix from  the  number of training examples  to \napproximately the  number of non-zero  (li.  If standard QP techniques are  used, chunking \ncannot handle  large-scale training  problems,  because even this  reduced  matrix cannot fit \ninto memory.  Kaufman [3]  has described a QP algorithm that does not require the storage \nof the entire Hessian. \n\nThe decomposition technique  [6]  is  similar to  chunking:  decomposition breaks  the  large \nQP problem into  smaller QP sub-problems.  However, Osuna et al.  [6]  suggest keeping a \n\n\fAnalytic QP and Sparseness to Speed Training o/Support Vector Machines \n\n559 \n\nQ 2 =c \n\nQ 2 =c \n\nal=oQal=C  al  = {::sJal  = C \n\nQ 2  =0 \n\nQ 2  =0 \n\nYt  *- Y2  ~ Qt  - Q 2 = k \n\nYt  = Y2  ~ Qt + Q 2  = k \n\nFigure 1:  The Lagrange multipliers al and a2  must fulfill all  of the constraints of the full \nproblem.  The inequality constraints cause the Lagrange multipliers to lie in  the box.  The \nlinear equality constraint causes them to lie on a diagonal line. \n\nfixed size matrix for every sub-problem, deleting some examples and adding others which \nviolate the  KKT conditions.  Using a fixed-size matrix allows SVMs to be trained on  very \nlarge training sets.  10achims  [2]  suggests  adding  and  subtracting examples according to \nheuristics for rapid convergence.  However, until SMO, decomposition required the use of \na numerical QP library, which can be costly or slow. \n\n2  SEQUENTIAL MINIMAL OPTIMIZATION \n\nSequential Minimal Optimization quickly solves the SVM QP problem without using nu(cid:173)\nmerical QP optimization steps at all.  SMO decomposes the overall QP problem into fixed(cid:173)\nsize QP sub-problems, similar to the decomposition method [7]. \n\nUnlike previous methods, however, SMO chooses to solve the smallest possible optimiza(cid:173)\ntion problem at each step.  For the standard SVM, the smallest possible optimization prob(cid:173)\nlem involves two elements of a. because the a. must obey one linear equality constraint. At \neach step,  SMO chooses two ai to jointly optimize, finds  the optimal values for these ai, \nand updates the SVM to reflect these new values. \n\nThe  advantage of SMO  lies  in  the  fact  that solving for  two  ai can  be done  analytically. \nThus, numerical QP optimization is  avoided entirely.  The inner loop of the algorithm can \nbe expressed in a short amount of C code, rather than invoking an  entire QP library routine. \n\nBy avoiding numerical QP,  the computation time is shifted from QP to  kernel evaluation. \nKernel  evaluation  time  can  be  dramatically  reduced  in  certain  common  situations,  e.g., \nwhen  a linear SVM is  used,  or when the  input data is  sparse (mostly zero).  The result of \nkernel evaluations can also be cached in memory [1]. \n\nThere  are  two  components to  SMO:  an  analytic  method  for  solving  for  the  two  ai,  and \na  heuristic  for  choosing  which  multipliers  to  optimize.  Pseudo-code for  the  SMO  algo(cid:173)\nrithm can be found in  [8, 7], along with the relationship to other optimization and machine \nlearning algorithms. \n\n2.1  SOLVING FOR TWO LAGRANGE MULTIPLIERS \n\nTo solve for the two Lagrange multipliers al and a2, SMO first computes the constraints on \nthese mUltipliers and then solves for the constrained minimum. For convenience, all quan(cid:173)\ntities that refer to the first  multiplier will have a subscript 1,  while all quantities that refer \nto  the  second mUltiplier  will  have  a subscript 2.  Because there  are only  two  multipliers, \n\n\f560 \n\n1.  C.  Platt \n\nthe  constraints can easily  be displayed in  two  dimensions (see figure  1).  The constrained \nminimum of the objective function must lie on a diagonal line segment. \nThe ends of the diagonal line segment can be expressed quite simply in  terms of a2.  Let \ns  =  YI Y2\u00b7  The following bounds apply to  a2: \n\nL  =  max(O, a2  + sal  -\n\n1 \n'2(s + l)C), \n\n. \n\nH  =  mm(C, a2 + sal -\n\n1 \n'2(s -\n\nl)C).  (7) \n\nUnder normal circumstances, the objective function is  positive definite, and there is a min(cid:173)\nimum along the direction of the linear equality constraint.  In  this case, SMO computes the \nminimum along the direction of the linear equality constraint: \n\nnew  _ \n\na 2 \n\n-a2  K( ....\n\n+ \n\nXl, Xl  + \n\n....  )  K( .... \n\nY2(EI  - E 2) \n-\n\nX2,  X2 \n\n- )  2K( ....\n\n....  )' \n\nXl, X2 \n\n(8) \n\nwhere Ei = Ui  - Yi  is the error on the ith training example. As a next step, the constrained \nminimum  is  found  by  clipping  a2ew  into  the  interval  [L, H].  The  value  of al  is  then \ncomputed from the new, clipped, a2: \n\n(9) \n\nFor both linear and non-linear SVMs, the threshold b is re-computed after each step, so that \nthe KKT conditions are fulfilled for both optimized examples. \n\n2.2  HEURISTICS FOR CHOOSING WHICH MULTIPLIERS TO OPTIMIZE \n\nIn order to speed convergence, SMO uses  heuristics to choose which two Lagrange multi(cid:173)\npliers to jointly optimize. \nThere  are  two  separate  choice  heuristics:  one for  al  and  one for  a2.  The  choice  of al \nprovides the outer loop of the SMO algorithm.  If an  example is found to violate the KKT \nconditions by the outer loop, it is eligible for optimization. The outer loop alternates single \npasses through the entire training set with multiple passes through the non-bound ai (ai f. \n{a, C}). The multiple passes terminate when all of the non-bound examples obey the KKT \nconditions  within  E.  The  entire  SMO  algorithm  terminates  when  the  entire  training  set \nobeys the KKT conditions within c.  Typically, c =  10- 3 . \nThe first choice heuristic concentrates the CPU time on the examples that are most likely to \nviolate the KKT conditions, i.e., the non-bound subset.  As the SMO algorithm progresses, \nai that are at the bounds are likely to stay at the bounds, while ai that are not at the bounds \nwill  move as  other examples are optimized. \nAs a further optimization, SMO uses the shrinking heuristic proposed in [2].  After the pass \nthrough the entire training set,  shrinking finds  examples which fulfill  the  KKT conditions \nmore than the worst example failed the KKT conditions. Further passes through the training \nset ignore these fulfilled conditions until  a final  pass at the end of training,  which  ensures \nthat every example fulfills  its  KKT condition. \nOnce an  al  is  chosen,  SMO chooses an  a2  to  maximize the size of the step taken during \njoint optimization.  SMO approximates the step size by the absolute value of the numerator \nin equation (8):  lEI -E21. SMO keeps a cached error value E for every non-bound example \nin the  training set and' then  chooses  an  error to  approximately maximize the  step  size.  If \nEI is positive, SMO chooses an example with minimum error E 2 .  If EI is  negative, SMO \nchooses an example with maximum error E 2 . \n\n2.3  KERNEL OPTIMIZATIONS \n\nBecause the computation time for SMO is  dominated by kernel evaluations,  SMO can be \naccelerated by  optimizing these  kernel  evaluations.  Utilizing sparse  inputs is  a generally \n\n\fAnalytic QP and Sparseness to Speed Training of Support  Vector Machines \n\nExperiment \n\nKernel \n\nLinear \nAdultLin \nLinear \nAdultLinD \nLinear \nWebLin \nLinear \nWebLinD \nGaussian \nAdultGaussK \nGaussian \nAdultGauss \nAdultGaussKD  Gaussian \nGaussian \nAdultGaussD \nGaussian \nWebGaussK \nGaussian \nWebGauss \nGaussian \nWebGaussKD \nGaussian \nWebGaussD \nPolynom. \nMNIST \n\ny \nN \ny \nN \ny \ny \nN \nN \ny \ny \nN \nN \ny \n\nSparse  Kernel \nInputs  Caching \nUsed \n\nTraining  Number of \n\nC \n\nSet \nSize \n11221 \n11221 \n49749 \n49749 \n11221 \n11221 \n11221 \n11221 \n49749 \n49749 \n49749 \n49749 \n60000 \n\nSupport \nVectors \n4158 \n4158 \n1723 \n1723 \n4206 \n4206 \n4206 \n4206 \n4484 \n4484 \n4484 \n4484 \n3450 \n\n0.05 \n0.05 \n1 \n1 \n1 \n1 \n1 \n1 \n5 \n5 \n5 \n5 \n100 \n\nUsed \nmix \nmix \nmix \nmix \ny \nN \ny \nN \ny \nN \ny \nN \nN \n\n561 \n\n% \n\nSparse \nInputs \n\n89 \n0 \n96 \n0 \n89 \n89 \n0 \n0 \n96 \n96 \n0 \n0 \n81 \n\nTable  1:  Parameters for various experiments \n\napplicable kernel  optimization.  For commonly-used kernels, equations (1) and (2) can be \ndramatically sped  up  by  exploiting the sparseness  of the  input.  For example,  a  Gaussian \nkernel  can  be expressed as  an  exponential of a linear combination of sparse dot products. \nSparsely storing the training set also  achieves substantial reduction in  memory consump(cid:173)\ntion. \nTo compute a linear SVM, only a single weight vector needs to be stored, rather than all  of \nthe training examples that correspond to non-zero ai. If the QP sub-problem succeeds, the \nstored weight vector is updated to reflect the new ai values. \n\n3  BENCHMARKING SMO \n\nThe SMO algorithm is  tested  against the standard chunking algorithm and  against the de(cid:173)\ncomposition method on  a series  of benchmarks.  Both  SMO  and  chunking are  written  in \nC++,  using  Microsoft's  Visual  C++ 6.0 compiler.  Joachims'  package SVMlight (version \n2.01) with  a default working set size of lOis used to  test the decomposition method.  The \nCPU time of all  algorithms are  measured on  an  unloaded 266 MHz Pentium II processor \nrunning Windows NT 4. \nThe chunking algorithm uses the projected conjugate gradient algorithm  as  its  QP solver, \nas suggested by Burges [1].  All algorithms use sparse dot product code and kernel caching, \nas  appropriate [1, 2].  Both SMO and chunking share folded linear SVM code. \n\nThe SMO algorithm is  tested on three real-world data sets.  The results of the experiments \nare shown in Tables 1 and 2.  Further tests on  artificial data sets can be found in  [8, 7]. \n\nThe first test set is the UeI Adult data set [5].  The SVM is  given  14 attributes of a census \nform  of a  household and  asked  to  predict whether that household has  an  income greater \nthan $50,000. Out of the 14 attributes, eight are categorical and six are continuous. The six \ncontinuous attributes are discretized into quintiles, yielding a total of 123 binary attributes. \n\nThe  second  test  set is  text categorization:  classifying  whether  a  web  page  belongs  to  a \ncategory or not.  Each web page is  represented as 300 sparse binary keywords attributes. \n\nThe  third  test  set  is  the  MNIST  database  of handwritten  digits,  from  AT&T  Research \nLabs  [4].  One  classifier of MNIST,  class  8,  is  trained.  The  inputs  are  784-dimensional \n\n\f562 \n\nExperiment \n\nAdultLin \nAdultLinD \nWebLin \nWebLinD \nAdultGaussK \nAdultGauss \nAdultGaussKD \nAdultGaussD \nWebGaussK \nWebGauss \nWebGaussKD \nWebGaussD \nMNIST \n\nSMa \nTime \n(sec) \n\n13.7 \n21.9 \n339.9 \n4589.1 \n442.4 \n523.3 \n1433.0 \n1810.2 \n2477.9 \n2538.0 \n23365.3 \n24758.0 \n19387.9 \n\nSVMllght  Chunking \n\nTime \n(sec) \n\n217.9 \nnla \n3980.8 \nnla \n284.7 \n737.5 \nnla \nnla \n2949.5 \n6923.5 \nnla \nnla \n38452.3 \n\nTime \n(sec) \n20711.3 \n21141.1 \n17164.7 \n17332.8 \n11910.6 \nnla \n14740.4 \nnla \n23877.6 \nnla \n50371 .9 \nnla \n33109.0 \n\n1.  C.  Platt \n\nSVMllght  Chunking \nSMa \nScaling \nScaling \nScaling \nExponent  Exponent  Exponent \n\n1.8 \n1.0 \n1.6 \n1.5 \n2.0 \n2.0 \n2.5 \n2.0 \n1.6 \n1.6 \n2.6 \n1.6 \nnla \n\n2.1 \nnla \n2.2 \nnla \n2.0 \n2.0 \nnla \nnla \n2.0 \n1.8 \nnla \nnla \nnla \n\n3.1 \n3.0 \n2.5 \n2.5 \n2.9 \nnla \n2.8 \nnla \n2.0 \nnla \n2.0 \nnla \nnla \n\nTable 2: Timings of algorithms on various data sets. \n\nnon-binary  vectors  and  are  stored  as  sparse  vectors.  A  fifth-order  polynomial  kernel  is \nused to match the AT&T accuracy results. \nThe Adult set and the Web set are trained both with linear SVMs and Gaussian SVMs with \nvariance of 10.  For the  Adult and Web  data sets,  the  C  parameter is  chosen  to  optimize \naccuracy on a validation set.  Experiments on the Adult and Web  sets  are performed with \nand  without sparse inputs and  with  and  without kernel  caching,  in  order to  determine the \neffect these kernel optimizations have on computation time.  When a kernel cache is  used, \nthe  cache size  for  SMa and  SVMlight is  40 megabytes.  The chunking algorithm always \nuses  kernel caching:  matrix  values  from  the  previous QP step are re-used.  For the linear \nexperiments, SMa does not use kernel caching, while SVMlight does. \nIn Table 2,  the scaling of each  algorithm is  measured as a function of the training set size, \nwhich  is  varied  by  taking random  nested  subsets  of the  full  training set.  A  line  is  fitted \nto  the  log  of the  training time  versus  the  log  of the  set  size.  The  slope  of the  line  is  an \nempirical scaling exponent. \n\n4  CONCLUSIONS \n\nAs  can  be seen  in  Table  2,  standard PCG  chunking is  slower than SMa for  the  data sets \nshown, even for dense inputs.  Decomposition and SMa have the advantage, over standard \nPCG chunking,  of ignoring the examples whose Lagrange multipliers are  at C.  This  ad(cid:173)\nvantage is reflected in the scaling exponents for PCG chunking versus SMa and SVMlight . \nPCG chunking can be altered to have a similar property [3].  Notice that PCG chunking uses \nthe same sparse  dot product code and linear SVM folding code as  SMa. However,  these \noptimizations do  not speed up  PCG chunking due to  the overhead of numerically solving \nlarge QP sub-problems. \n\nSMa and SVM1ight are similar:  they decompose the large QP problem into very small QP \nsub-problems.  SMa decomposes into even smaller sub-problems:  it uses  analytical solu(cid:173)\ntions  of two-dimensional  sub-problems,  while  SVMlight uses  numerical  QP  to  solve  10-\ndimensional  sub-problems.  The  difference in  timings  between the  two  methods is  partly \ndue to the numerical QP overhead, but mostly due to the difference in heuristics and kernel \noptimizations.  For example,  SMa is  faster  than  SVMlight by  an  order of magnitude on \n\n\fAnalytic QP and Sparseness to Speed Training of Support  Vector Machines \n\n563 \n\nlinear problems,  due to  linear SVM folding.  However,  SVMlight can  also potentially use \nlinear SVM folding . In these experiments, SMO uses a very simple least-recently-used ker(cid:173)\nnel cache of Hessian rows, while SVMlight uses a more complex kernel cache and modifies \nits heuristics to utilize the kernel effectively [2].  Therefore, SMO does not benefit from the \nkernel cache at the largest problem sizes, while SVMlight speeds up by a factor of 2.5 . \n\nUtilizing sparseness to compute kernels yields a large advantage for SMO due to the  lack \nof heavy  numerical QP overhead.  For the sparse data sets shown, SMO can  speed up  by \na factor of between 3 and 13, while PCG chunking only obtained a maximum speed up of \n2.1  times. \nThe MNIST experiments were performed without a kernel cache, because the MNIST data \nset takes up most of the memory of the benchmark machine. Due to sparse inputs, SMO is \na factor of 1.7 faster than PCG chunking, even though none of the Lagrange multipliers are \nat C.  On a machine with more memory, SVMlight would be as fast or faster than SMO for \nMNIST, due to kernel caching. \nIn summary, SMO is  a simple method for training support vector machines which does not \nrequire  a numerical QP library.  Because its CPU time is dominated by kernel evaluation, \nSMO can be dramatically quickened by the use of kernel optimizations, such as linear SVM \nfolding and sparse dot products. SMO can be anywhere from  1.7 to  1500 times faster than \nthe standard PCG chunking algorithm, depending on the data set. \n\nAcknowledgements \n\nThanks to Chris Burges for running data sets through his projected conjugate gradient code \nand for various helpful suggestions. \n\nReferences \n[1]  c. J.  C.  Burges.  A  tutorial on support vector machines for pattern recognition.  Data \n\nMining and Knowledge Discovery , 2(2),  1998. \n\n[2]  T.  Joachims.  Making large-scale  SVM  learning practical.  In  B.  Scholkopf, C.  J.  C. \nSupport  Vector \n\nBurges,  and  A.  J.  Smola,  editors,  Advances  in  Kernel  Methods  -\nLearning, pages 169-184. MIT Press,  1998. \n\n[3]  L.  Kaufman.  Solving the  quadratic  programming problem  arising  in  support vector \nclassification.  In B.  Scholkopf, C.  J. C.  Burges, and A.  J.  Smola, editors, Advances in \nKernel Methods - Support Vector Learning, pages 147-168. MIT Press, 1998. \n\n[4]  Y.  LeCun.  MNIST  handwritten  digit  database.  Available  on  the  web  at  http:// \n\nwww.research .att.comr yann/ocr/mnistl. \n\n[5]  C.  J.  Merz  and  P. M.  Murphy.  UCI repository of machine learning databases,  1998. \n\n[http://www.ics.uci.edu/rvmlearnIMLRepository.html].Irvine.CA: University of Cali(cid:173)\nfornia, Department of Information and Computer Science. \n\n[6]  E. Osuna, R.  Freund, and F.  Girosi .  Improved training  algorithm  for  support vector \n\nmachines.  In Proc.  IEEE Neural Networks in Signal Processing  '97,  1997. \n\n[7]  J.  C.  Platt. \n\nFast  training  of  SVMs  using  sequential  minimal  optimization. \n\nIn \n\nB. Scholkopf,  C.  J.  C.  Burges,  and  A.  J.  Smola,  editors, Advances in  Kernel  Meth(cid:173)\nods - Support Vector Learning, pages 185-208. MIT Press, 1998. \n\n[8]  J. C.  Platt.  Sequential minimal optimization:  A fast algorithm for training support vec(cid:173)\n\ntor machines. Technical Report MSR- TR-98-14, Microsoft Research, 1998. Available \nat http://www.research .microsoft.comrjplattlsmo.html. \n\n[9]  V.  Vapnik.  Estimation of Dependences Based on  Empirical  Data.  Springer-Verlag, \n\n1982. \n\n\f", "award": [], "sourceid": 1577, "authors": [{"given_name": "John", "family_name": "Platt", "institution": null}]}