{"title": "Uniqueness of the SVM Solution", "book": "Advances in Neural Information Processing Systems", "page_first": 223, "page_last": 229, "abstract": null, "full_text": "Uniqueness of the  SVM  Solution \n\nChristopher J .C. Burges \n\nDavid J.  Crisp \n\nAdvanced Technologies, \n\nBell Laboratories, \nLucent Technologies \nHolmdel,  New  Jersey \n\nburges@iucent.com \n\nCentre for  Sensor Signal and \n\nInformation Processing, \n\nDeptartment of Electrical Engineering, \nUniversity of Adelaide, South Australia \n\ndcrisp@eleceng.adelaide.edu.au \n\nAbstract \n\nWe  give  necessary  and  sufficient  conditions  for  uniqueness  of the \nsupport vector solution for the problems of pattern recognition and \nregression estimation, for a general class of cost functions.  We show \nthat if the solution is not unique, all support vectors are necessarily \nat  bound,  and  we  give  some  simple examples of non-unique solu(cid:173)\ntions.  We  note that uniqueness of the primal (dual)  solution does \nnot necessarily imply uniqueness of the dual (primal) solution.  We \nshow  how to compute the threshold b when the solution is  unique, \nbut when  all support vectors are at bound, in which case the usual \nmethod for  determining b does  not work. \n\n1 \n\nIntroduction \n\nSupport vector machines (SVMs)  have attracted wide interest as a means to imple(cid:173)\nment  structural risk minimization for  the problems of classification and regression \nestimation.  The fact  that training an SVM  amounts to solving a  convex quadratic \nprogramming problem means that the solution found  is global, and that if it is not \nunique,  then the  set  of global solutions is  itself convex;  furthermore,  if the objec(cid:173)\ntive  function  is  strictly convex,  the  solution is  guaranteed  to be  unique  [1]1.  For \nquadratic programming problems, convexity of the objective function  is equivalent \nto positive semi-definiteness of the Hessian, and strict convexity, to positive definite(cid:173)\nness  [1].  For  reference,  we  summarize the  basic uniqueness  result in the following \ntheorem, the proof of which  can be found  in  [1]: \n\nTheorem  1:  The  solution  to a  convex  programming problem,  for  which  the  ob(cid:173)\njective function  is  strictly convex,  is  unique.  Positive  definiteness  of the  Hessian \nimplies strict convexity of the objective function . \nNote that in general strict convexity of the objective function  does  not neccesarily \nimply  positive  definiteness  of the  Hessian.  Furthermore,  the  solution  can  still  be \nunique, even if the objective function is loosely convex (we will use the term \"loosely \nconvex\"  to mean convex but not strictly convex).  Thus the question of uniqueness \n\nIThis is  in  contrast with  the case  of neural  nets,  where local  minima of the objective \n\nfunction  can occur. \n\n\f224 \n\nC. J.  C.  Burges and D. J.  Crisp \n\nfor a convex programming problem for which the objective function is loosely convex \nis  one  that must  be  examined on  a  case  by  case  basis.  In  this  paper  we  will give \nnecessary  and  sufficient  conditions  for  the  support  vector  solution  to  be  unique, \neven  when  the  objective function  is  loosely  convex,  for  both  the  clasification and \nregression cases,  and for  a general class of cost function. \nOne of the  central features  of the support  vector  method is  the implicit mapping \n~ of the data Z  E Rn  to some feature  space F, which  is  accomplished by replacing \ndot products between data points Zi,  Zj,  wherever they occur in  the train and test \nalgorithms, with a symmetric function  K (Zi' Zj ),  which is itself an inner product in \nF  [2]:  K(Zi' Zj)  = (~(Zi)' ~(Zj\u00bb  = (Xi, Xj),  where  we  denote the mapped points in \nF  by X  =  ~(z). In order for  this to hold the kernel function K  must satisfy Mercer's \npositivity condition  [3].  The  algorithms  then  amount  to constructing an  optimal \nseparating hyperplane in F, in the pattern recognition case, or fitting the data to a \nlinear regression tube  (with  a  suitable choice of loss  function  [4])  in  the regression \nestimation  case.  Below,  without  loss  of generality,  we  will  work  in  the  space  F, \nwhose dimension we  denote by dF.  The conditions we  will find  for  non-uniqueness \nof the solution will not depend explicitly on F  or ~. \nMost  approaches to solving the support vector training problem employ the Wolfe \ndual,  which  we  describe  below.  By  uniqueness  of the  primal  (dual)  solution,  we \nmean uniqueness of the set of primal (dual)  variables  at the solution.  Notice  that \nstrict convexity of the primal objective function  does  not imply strict convexity of \nthe dual objective function.  For example, for  the optimal hyperplane problem (the \nproblem of finding  the maximal separating hyperplane in input space, for  the case \nof separable data),  the  primal objective  function  is  strictly  convex,  but  the  dual \nobjective function  will  be  loosely  convex  whenever  the  number  of training points \nexceeds  the  dimension  of the data in  input  space.  In  that  case,  the  dual  Hessian \nH  will  necessarily be  positive semidefinite,  since  H  (or  a  submatrix of H, for  the \ncases in which the cost function also contributes to the (block-diagonal) Hessian) is a \nGram matrix of the training data, and some rows of the matrix will then necessarily \nbe linearly dependent  [5]2.  In  the cases  of support vector pattern recognition and \nregression estimation studied  below,  one  of four  cases  can  occur:  (1)  both primal \nand  dual  solutions  are  unique;  (2)  the  primal  solution  is  unique  while  the  dual \nsolution  is  not;  (3)  the  dual  is  unique  but  the  primal  is  not;  (4)  both  solutions \nare  not  unique.  Case  (2)  occurs  when  the  unique  primal solution  has  more  than \none  expansion in  terms of the dual variables.  We  will give  an  example of case  (3) \nbelow.  It is  easy to construct trivial examples  where  case  (1)  holds,  and based on \nthe  discussion  below,  it  will  be clear  how  to construct examples of (4).  However, \nsince  the  geometrical motivation and  interpretation of SVMs  rests  on  the  primal \nvariables, the theorems given below address uniqueness of the primal solution3 \u2022 \n\n2  The Case of Pattern Recognition \n\nWe  consider  a  slightly  generalized  form  of  the  problem  given  in  [6],  namely  to \nminimize the objective function \n\nF = (1/2) IIwl12 + L Ci~f \n\n(1) \n\n2Recall  that  a  Gram matrix is  a  matrix whose  ij'th element  has  the form  (Xi,Xj)  for \nsome inner product  (,), where  Xi  is  an element  of a  vector space,  and that the rank of a \nGram matrix is the maximum number of linearly independent vectors Xi  that appear in it \n[6]. \n\n3Due  to  space  constraints  some  proofs  and  other  details  will  be  omitted.  Complete \n\ndetails will  be given  elsewhere. \n\n\fUniqueness of the SVM Solution \n\n225 \n\nwith constants p  E  [1,00),  Gi  > 0,  subject to constraints: \n\nC;  >  0 \n\n,:>.  - '  \n\n\" \n\ni  =  1  ...  1 \n\nYi(W  .  Xi + b)  >  1 - ~i'  i  =  1,,,,,1 \n\n(2) \n(3) \nwhere W  is the vector of weights,  b a scalar threshold, ~i are positive slack variables \nwhich are introduced to handle the case of nonseparable data, the Yi  are the polar(cid:173)\nities of the training samples  (Yi  E  {\u00b1 I} ),  Xi  are the images of training samples in \nthe space  F  by  the  mapping  ~, the Gi  determine how  much  errors  are  penalized \n(here we have allowed each pattern to have its own penalty), and the index i  labels \nthe 1 training patterns.  The goal is  then to find  the values  of the primal variables \n{w, b, ~i} that solve  this problem.  Most  workers choose p  =  1,  since this  results in \na  particularly simple  dual  formulation,  but the  problem  is  convex  for  any  p  2:  1. \nWe will not go into further details on support vector classification algorithms them(cid:173)\nselves here,  but refer  the interested reader to  [3],  [7] .  Note  that,  at the solution,  b \nis  determined  from  w  and  ~i  by the  Karush Kuhn  Tucker  (KKT)  conditions  (see \nbelow),  but we  include it in  the definition of a solution for  convenience. \nNote  that  Theorem  1  gives  an  immediate proof that  the  solution  to  the  optimal \nhyperplane problem is unique, since there the objective function  is just (1/2)lIwI1 2 , \nwhich is strictly convex, and the constraints (Eq.  (2)  with the ~ variables removed) \nare linear inequality constraints which therefore define  a  convex set4. \n\nFor the discussion below we  will need the dual formulation of this problem, for  the \ncase  p  =  1.  It takes  the following  form:  minimize  ~ L-ijG:iG:jYiYj(Xi,Xj)  - L-iG:i \nsubject to constraints: \n\nTJi  >  0,  G:i  2:  0 \nGi \n\nG:i  + TJi \n0 \n\nLG:iYi \n\n(4) \n(5) \n(6) \n\nand  where  the  solution takes  the  form  w  = L-i G:iYiXi,  and  the  KKT  conditions, \nwhich are satisfied at the solution, are TJi~i = 0,  G:i (Yi (w . Xi + b) - 1 + ~i) = 0,  where \nTJi  are  Lagrange  multipliers  to  enforce  positivity  of the  ~i'  and  G:i  are  Lagrange \nmultipliers  to  enforce  the  constraint  (2).  The  TJi  can  be  implicitly  encapsulated \nin  the  condition  0  ~ ai  :::;  Gi ,  but  we  retain  them  to  emphasize  that  the  above \nequations imply  that  whenever  ~i  =/;  0,  we  must  have  ai  = Gi .  Note  that,  for  a \ngiven solution, a support vector is defined to be any point Xi  for  which G:i  > O.  Now \nsuppose we  have some solution to the problem (1),  (2),  (3).  Let Nl  denote the set \n{i  : Yi  = 1,  W \u00b7  Xi + b < I}, N2  the set {i  : Yi  = -1,  W\u00b7 Xi + b > -I}, N3  the set \n{i : Yi  =  1,  W\u00b7 Xi + b =  I},  N4  the set {i : Yi  =  -1,  W\u00b7 Xi + b =  -I}, Ns  the set \n{i : Yi  =  1,  W\u00b7 Xi + b > I}, and N6  the set {i : Yi  =  -1,  W\u00b7 Xi + b < -I}.  Then we \nhave the following theorem: \n\nTheorem 2:  The solution to the soft-margin problem,  (1),  (2)  and  (3),  is  unique \nfor  p  >  1.  For p  =  1,  the solution is  not  unique  if and  only if at least  one  of the \nfollowing two conditions holds: \n\nFurthermore,  whenever  the solution is  not  unique,  all solutions share the same w, \nand any support vector Xi  has Lagrange multiplier satisfying ai =  Gi ,  and when (7) \n\niENl UN3 \n\niEN2 \n\n4This  is  of course  not a  new result:  see for  example [3]. \n\n(7) \n\n(8) \n\n\f226 \n\nC. 1.  C.  Burges and D. 1.  Crisp \n\nholds, then N3  contains no  support vectors,  and when  (8)  holds, then N4  contains \nno support vectors. \nProof:  For  the  case  p  >  1,  the  objective  function  F  is  strictly  convex,  since  a \nsum of strictly convex functions is  a strictly convex function, and since the function \ng( v)  =  vP ,  v  E lR+  is  strictly convex for  p > 1.  FUrthermore the constraints define \na  convex  set,  since  any  set  of simultaneous linear inequality constraints  defines  a \nconvex set.  Hence by Theorem 1 the solution is  unique. \nFor the case p = 1,  define  Z  to be that dF + i-component vector with  Zi = Wi, \ni  = \ni  =  dF + 1\", . ,dF + t.  In  terms  of the  variables  z,  the \n1, ... ,dF,  and  Zi  = ~i' \nproblem is  still a  convex  programming problem, and hence  has  the  property that \nany  solution  is  a  global  solution.  Suppose  that  we  have  two  solutions,  Zl  and \nt)ZI  + tZ2,  and \nZ2'  Then  we  can  form  the  family  of solutions Zt,  where  Zt  ==  (1  -\nsince  the  solutions  are  global,  we  have  F(zd  = F(Z2)  = F(zt).  By  expanding \nF(zt) - F(zt} = 0 in terms of Zl  and Z2  and differentiating twice with respect to t \nwe  find  that WI  =  W2.  Now  given wand b,  the ~i are completely determined by the \nKKT conditions.  Thus the solution is not unique if and only if b is  not unique. \nDefine 0 ==  min {miniENl ~i' miniEN6 (-1 - W  \u2022 Xi  - b)}, and suppose that condition \n(7)  holds.  Then  a  different  solution  {w', b', e}  is  given  by  w'  = w,  b'  = b + 0, \nand  ~~  = ~i - 0,  Vi  E  N 1 ,  ~~  =  ~i + 0,  Vi  E  N2  uN4 ,  all  other  ~i  = 0,  since  by \nconstruction  F  then  remains  the  same,  and  the  constraints  (2),  (3)  are  satisfied \nby  the  primed  variables.  Similarly, suppose  that condition  (8)  holds.  Define  0 == \nmin{miniEN2~i,miniEN5(w\u00b7xi+b-l)}.  Then  a  different  solution  {w',b',e}  is \ngiven  by  w' = w,  b'  = b - 0,  and  ~~ =  ~i - 0,  Vi  E N2 ,  ~: = ~i + 0,  Vi  E NI U N3 , \nall other  ~i  = 0,  since  again  by  construction  F  is  unchanged  and  the  constraints \nare  still met.  Thus  the  given  conditions  are  sufficient  for  the  solution to  be  non(cid:173)\nunique.  To  show  necessity,  assume  that  the  solution is  not  unique:  then  by  the \nabove  argument, the solutions  must differ  by their values of b.  Given a  particular \nsolution b,  suppose that b + 0,  0 > 0 is  also a  solution.  Since the set of solutions is \nitself convex,  then b + 0'  will  also  correspond to a  solution  for  all 0'  : 0  ~ 0'  ~ O. \nGiven some b'  =  b + 0',  we  can use  the KKT conditions to compute all the ei,  and \nwe  can  choose  0'  sufficiently  small so  that no  ~i'  i  E N6  that  was  previously zero \nbecomes  nonzero.  Then  we  find  that  in order that  F  remain  the  same,  condition \n(7)  must hold.  If b - 0,  0 > 0 is  a  solution, similar reasoning shows that condition \n(8)  must  hold.  To  show  the  final  statement  of the  theorem,  we  use  the  equality \nconstraint  (6),  together with  the fact  that,  from  the KKT  conditions,  all support \nvectors Xi  with indices in NI uN2  satisfy (Xi  = Ci \u2022  Substituting (6)  in (7) then gives \nL:N3 (Xi + L:N4 (Ci  -\n(Xi)  =  0 which implies the result, since all (Xi  are non-negative. \n(Xi)  + L:.Af.  (Xi  =  0  which  again \nSimilarly,  substituting  (6)  in  (8)  gives  L:,M  (Ci  -\n1 \n. \nImp les t  e resu t.  0 \n\nl' \n\nh \n\n3 \n\n4 \n\nCorollary:  For any solution which is  not unique, letting S  denote the set of indices \nof the  corresponding  set  of support  vectors,  then  we  must  have  L:iES CiYi  =  O. \nFUrthermore,  if the  number  of data points  is  finite,  then  for  at  least  one  of the \nfamily of solutions, all support vectors have corresponding ~i i=  O. \nNote  that  it  follows  from  the  corollary that  if the  Ci  are  chosen  such  that  there \nexists no subset r  of the train data such that  L:iET CiYi = 0,  then the solution is \nguaranteed to be unique, even if p =  1.  FUrthermore this can be done by choosing all \nthe Ci very close to some central value C, although the resulting solution can depend \nsensitively on the values chosen  (see the example immediately below).  Finally, note \nthat if all Ci  are equal, the theorem shows that a necessary condition for the solution \nto be non-unique is that the negative and positive polarity support vectors be equal \nin number. \n\n\fUniqueness of the SVM Solution \n\n227 \n\nA simple example of a non-unique solution, for the case p =  1,  is given by a train set \nin one dimension with just two examples, {Xl  = 1, YI  = I} and {xz = -1, Yz  = -11' \nwith  GI  =  Cz  ==  C.  It is  straightforward  to  show  analytically  that  for  G  2:  2' \nthe  solution  is  unique,  with  w  = 1,  6  = 6  = b = 0,  and  marginS  equal  to  2, \nwhile  for  C  < ! there  is  a  family  of solutions,  with  -1 + 2C  ::;  b  ::;  1 - 2C  and \n6  =  1- b - 2C,  6  =  1 + b - 2G,  and margin l/C . The case G < ! corresponds to \nCase  (3)  in  Section  (1)  (dual  unique but  primal not),  since  the  dual  variables  are \nuniquely  specified  by a  =  C.  Note  also  that this family  of solutions also  satisfies \nthe  condition  that  any  solution  is  smoothly  deformable  into another  solution  [7J. \nIf GI  >  Cz, the  solution  becomes  unique,  and  is  quite  different  from  the  unique \nsolution  found  when  Gz  >  CI .  When  the  G's  are  not  equal,  one  can  interpret \nwhat  happens  in  terms  of the  mechanical analogy  [8J,  with  the  central separating \nhyperplane sliding away from the point that exerts the higher force,  until that point \nlies on the edge of the margin region. \nNote  that if the  solution is  not  unique,  the possible values  of b fall  on  an interval \nof  the  real  line:  in  this  case  a  suitable  choice  would  be  one  that  minimizes  an \nestimate of the  Bayes error,  where  the SVM  output densities are modeled  using a \nvalidation set6 .  Alternatively, requiring continuity with the cases p > 1, so that one \nwould choose that value of b that would result by considering the family of solutions \ngenerated by different choices of p, and taking the limit from above of p -t 1, would \nagain result in a  unique solution. \n\n3  The Case of Regression Estimation 7 \n\nHere  one  has  a  set of l  pairs  {xI,Yd,{xz,yz},\u00b7\u00b7\u00b7,{XI,YI},  {Xi  E  :F,Yi  E  R},  and \nthe  goal  is  to estimate  the  unknown  functional  dependence  j  of  the  Y  on  the  X, \nwhere  the  function  j  is  assumed  to  be  related  to  the  measurements  {Xi,Yi}  by \nYi  =  j(Xi) +ni, and where ni represents noise.  For details we refer the reader to [3], \n[9].  Again we  generalize the original formulation [10],  as follows:  for  some choice of \npositive error penalties Gi, and for  positive \u20aci,  minimize \n\nF  =  ~ Ilwllz + 2)Gi~f + C;(~np) \n\nI \n\ni=l \n\n(9) \n\nwith  constant p  E [1 , 00), subject to constraints \n\nYi  - w . Xi  - b  <  \u20aci  + ~i \n\u20aci  + ~; \nW  \u2022 Xi  + b - Yi  < \n\n(10) \n(11) \n(12) \nwhere we  have adopted the notation ~;*) ==  {~i' ~;} [9J.  This formulation results in \ninsensitive\"  loss  function,  that is,  there  is  no  penalty  (~}*)  =  0)  associated \nan  \"\u20ac \nwith  point  Xi  if IYi  - w  . Xi  - bl  ::;  \u20aci.  Now  let {3,  {3*  be the  Lagrange  multipliers \nintroduced to enforce the constraints (10),  (11).  The dual then gives \n\n~;*)  >  0 \n\n2: {3i  = 2: {3;,  0::; {3i  ::;  Gi ,  0::; {3;  ::;  G;, \n\n(13) \n\n5The margin is  defined to  be the distance  between the two  hyperplanes corresponding \nto equality  in  Eq.  (2),  namely  2/lIwll,  and the margin region  is  defined  to be the set  of \npoints between the two  hyperplanes. \n\n6This method was used to estimate b under similar circumstances in  [8]. \n7The notation in this section only coincides with that used in section 2 where convenient. \n\n\f228 \n\nC. J.  C.  Burges and D. J.  Crisp \n\nwhich we  will  need  below.  For this formulation, we  have the following \nTheorem 3:  For a given solution, define  !(Xi, Yi)  ==  Yi  - W  \u2022 Xi  - b,  and define Nl \nto be the set of indices {i : !(Xi, Yi)  > fi}, N2  the set {i : !(Xi, Yi)  =  fd, N3  the set \n{i : !(Xi,Yi) =  -fi}, and N4  the set {i : !(Xi,Yi) < -fi}.  Then the solution to  (9) \n- (12)  is  unique  for  p  > 1,  and for  p  =  1 it is  not unique if and only if at least one \nof the following two conditions holds: \n\nCi \n\nL \n\niENIUN2 \n\nC'! , \n\nL \n\niEN3UN4 \n\nLC; \niEN4 \n\nLCi \niENl \n\n(14) \n\n(15) \n\nFurthermore, whenever  the solution is  not  unique,  all solutions share the same w, \nand  all  support  vectors  are  at  bound  (that  iss,  either f3i  =  Ci or f3i  =  Cn,  and \nwhen  (14)  holds,  then N3  contains no  support vectors,  and  when  (15)  holds,  then \nN2  contains no support vectors. \nThe theorem shows  that in  the  non-unique case one will only be able to move the \ntube  (and  get  another  solution)  if one  does  not  change  its  normal  w.  A  trivial \nexample of a  non-unique  solution  is  when  all  the  data fits  inside  the  f-tube  with \nroom to spare, in which case for  all the solutions, the normal to the f-tubes always \nlies along the Y direction.  Another example is  when  all Ci  are equal, all data falls \noutside the tube, and there are the same number of points above the tube as below \nit. \n\n4  Computing b when all  SV s  are at Bound \n\nThe  threshold  b in  Eqs.  (2),  (10)  and  (11)  is  usually  determined  from  that  sub(cid:173)\nset  of  the  constraint  equations  which  become  equalities  at  the  solution  and  for \nwhich  the corresponding Lagrange multipliers are not at bound.  However,  it may \nbe  that  at  the  solution,  this  subset  is  empty.  In  this  section  we  consider  the  sit(cid:173)\nuation where  the solution is  unique,  where  we  have  solved the optimization prob(cid:173)\nlem  and  therefore  know  the  values  of all  Lagrange  multipliers,  and  hence  know \nalso  w,  and  where  we  wish  to  find  the  unique  value  of b for  this  solution.  Since \nthe  ~~.)  are  known  once  b  is  fixed,  we  can  find  b  by  finding  that  value  which \nboth  minimizes  the  cost  term  in  the  primal  Lagrangian,  and  which  satisfies  all \nthe  constraint equations.  Let  us  consider  the  pattern  recognition  case  first.  Let \nS+  (S_)  denote  the  set  of indices  of positive  (negative)  polarity support  vectors. \nAlso let  V+  (V_)  denote  the set of indices of positive (negative)  vectors which  are \nnot  support  vectors.  It is  straightforward to  show  that if 2:iES_ Ci  >  2:iES+ Ci, \nthen  b =  max {maxiES_ (-1 - W  \u2022 Xi),  maxiEV+ (1  - W  \u2022 Xi)},  while if 2:iES_  Ci  < \n2:iES+ Ci,  then  b  =  min {miniEs+ (1  - W  \u2022 Xi),  miniEv_ (-1 - W  \u2022 Xi)}'  Further(cid:173)\nmore, if 2:iES_ Ci  =  2:iES+ Ci,  and if the solution is unique, then these two values \ncoincide. \n\nlet  us  denote  by  S  the  set  of  indices  of  all  sup(cid:173)\nIn  the  regression  case, \nport  vectors,  S  its  complement,  SI  the  set  of  indices  for  which  f3i  =  Ci, \nand  S2  the  set  of  indices  for  which  f3i  =  C;,  so  that  S  =  SI  U  S2  (note \nSI  n S2  =  0).  Then  if  2:iES2 C;  >  2:iESl Ci,  the  desired  value  of  b  is \nb = max{m~Es(Yi - W\u00b7 Xi  + fi),  maxiES(Yi  - W\u00b7 Xi  - fi)}  while  if 2:iES2 C;  < \nmin {miniEs(Yi - W  \u2022 Xi  - fi),  miniES(Yi - W\u00b7 Xi  + fi)}' \n2:iESl Ci, \n\nthen  b \n\n-\n\n8Recall that if Ei  > 0,  then {3i{3;  =  O. \n\n\fUniqueness of the SVM Solution \n\n229 \n\nAgain,  if the  solution  is  unique,  and  if  also  l:iES 2  c; \ntwo values coincide. \n\n5  Discussion \n\nWe have shown that non-uniqueness of the SVM solution will be the exception rather \nthan the rule:  it will occur only when one can rigidly parallel transport the margin \nregion  without  changing  the  total cost.  If non-unique  solutions  are  encountered, \nother techniques for finding the threshold, such as minimizing the Bayes error arising \nfrom a model of the SVM posteriors [8],  will be needed.  The method of proof in the \nabove theorems is straightforward, and should be extendable to similar algorithms, \nfor example Mangasarian's Generalized SVM [11].  In fact one can extend this result \nto any  problem  whose  objective function  consists  of a  sum of strictly convex  and \nloosely convex functions:  for example, it follows immediately that for the case of the \nlI-SVM pattern recognition and regression estimation algorithms [12], with arbitrary \nconvex costs, the value of the normal w  will always be unique. \n\nAcknowledgments \n\nC.  Burges wishes  to thank W.  Keasler, V.  Lawrence  and  C.  Nohl  of Lucent Tech(cid:173)\nnologies for  their support. \n\nReferences \n\n[1]  R.  Fletcher.  Practical  Methods  of Optimization. \nedition,  1987. \n\nJohn  Wiley  and  Sons,  Inc.,  2nd \n\n[2]  B.  E.  Boser,  I. M.  Guyon,  and V  .Vapnik.  A  training algorithm for  optimal  margin \nIn  Fifth  Annual  Workshop  on  Computational  Learning  Theory,  Pittsburgh, \nclassifiers. \n1992.  ACM. \n\n[3]  V.  Vapnik.  Statistical  Learning  Theory.  John Wiley and Sons, Inc.,  New York,  1998. \n\n[4]  A.J.  Smola  and  B.  Scholkopf.  On  a  kernel-based  method  for  pattern  recognition, \nregression,  approximation and operator inversion.  Algorithmica,  22:211  - 231,  1998. \n[5]  Roger A.  Horn and Charles R.  Johnson.  Matrix Analysis.  Cambridge University Press, \n1985. \n\n[6]  C.  Cortes and V. Vapnik.  Support vector  networks.  Machine  Learning,  20:273-297, \n1995. \n\n[7]  C.J.C.  Burges.  A  tutorial on support vector  machines for  pattern recognition.  Data \nMining  and Knowledge  Discovery,  2(2}:121-167,  1998. \n\n[8]  C. J. C.  Burges and B. Scholkopf.  Improving the accuracy and speed of support vector \nlearning machines.  In M. Mozer,  M. Jordan,  and T.  Petsche, editors,  Advances  in Neural \nInformation Processing  Systems  9,  pages 375-381,  Cambridge,  MA,  1997.  MIT Press. \n\n[9]  A.  Smola and B.  Scholkopf.  A  tutorial  on  support vector  regression.  Statistics  and \nComputing,  1998.  In press:  also,  COLT Technical Report TR-1998-030. \n\n[10]  V.  Vapnik,  S.  Golowich,  and A.  Smola.  Support vector  method for  function approx(cid:173)\nimation,  regression  estimation,  and  signal  processing.  Advances  in  Neural  Information \nProcessing  Systems,  9:281-287,  1996. \n\n[11]  O.L. Mangarasian.  Generalized support vector machines, mathematical programming \ntechnical report 98-14.  Technical report,  University of Wisconsin,  October 1998. \n\n[12]  B.  Scholkopf,  A.  Smola,  R.  Williamson  and  P.  Bartlett,  New  Support  Vector  Algo(cid:173)\nrithms, NeuroCOLT2  NC2-TR-1998-031,  1998. \n\n\f", "award": [], "sourceid": 1735, "authors": [{"given_name": "Christopher", "family_name": "Burges", "institution": null}, {"given_name": "David", "family_name": "Crisp", "institution": null}]}