{"title": "Minimax Probability Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 801, "page_last": 807, "abstract": null, "full_text": "Minimax  Probability  Machine \n\nUniversity of California,  Berkeley \n\nUniversity of California,  Berkeley \n\nGert R.G.  Lanckriet* \nDepartment of EECS \n\nBerkeley, CA  94720-1770 \n\ngert@eecs. berkeley.edu \n\nLaurent  EI  Ghaoui \nDepartment of EECS \n\nBerkeley,  CA  94720-1770 \nelghaoui@eecs.berkeley.edu \n\nChiranjib Bhattacharyya \n\nDepartment of EECS \n\nUniversity of California,  Berkeley \n\nBerkeley,  CA  94720-1776 \nchiru@eecs.berkeley.edu \n\nMichael I. Jordan \n\nComputer Science  and Statistics \nUniversity of California,  Berkeley \n\nBerkeley,  CA  94720-1776 \njordan@cs.berkeley.edu \n\nAbstract \n\nWhen  constructing  a  classifier,  the  probability  of correct  classifi(cid:173)\ncation of future  data points  should  be maximized.  In  the  current \npaper  this  desideratum is  translated  in  a  very  direct  way  into  an \noptimization  problem,  which  is  solved  using  methods  from  con(cid:173)\nvex  optimization.  We  also  show  how  to exploit  Mercer  kernels  in \nthis setting to obtain nonlinear decision  boundaries.  A  worst-case \nbound  on the  probability of misclassification  of future  data is  ob(cid:173)\ntained explicitly. \n\n1 \n\nIntroduction \n\nConsider the problem of choosing a linear discriminant by minimizing the probabil(cid:173)\nities that data vectors fall  on the wrong side of the boundary.  One way to attempt \nto  achieve  this  is  via a  generative approach in  which  one  makes  distributional  as(cid:173)\nsumptions about the class-conditional densities  and thereby estimates and controls \nthe relevant  probabilities.  The need  to make  distributional assumptions,  however, \ncasts doubt on the generality and validity of such  an approach,  and in  discrimina(cid:173)\ntive  solutions  to  classification problems  it is  common  to  attempt  to  dispense  with \nclass-conditional densities entirely. \n\nRather than avoiding any reference to class-conditional densities, it might be useful \nto  attempt  to  control  misclassification  probabilities  in  a  worst-case  setting;  that \nis,  under  all  possible  choices  of  class-conditional  densities.  Such  a  minimax  ap(cid:173)\nproach could  be  viewed  as  providing an  alternative justification for  discriminative \napproaches.  In this paper we  show how  such a  minimax programme can be carried \nout  in  the  setting  of  binary  classification.  Our  approach  involves  exploiting  the \nfollowing  powerful theorem due to Isii  [6],  as extended in  recent work by Bertsimas \n\n\u2022  http://robotics.eecs.berkeley.edur gert/ \n\n\fand Sethuraman [2]: \n\nwhere y  is  a  random vector, where a  and b are constants, and where the supremum \nis  taken  over  all  distributions  having  mean  y  and  covariance  matrix  ~y.  This \ntheorem  provides  us  with  the  ability  to  bound  the  probability  of misclassifying  a \npoint,  without  making  Gaussian  or  other  specific  distributional  assumptions.  We \nwill  show how to exploit this ability in the design of linear  classifiers. \n\nOne  of the  appealing  features  of this  formulation  is  that  one  obtains  an  explicit \nupper  bound on the probability of misclassification of future  data:  1/(1 + rP). \nA second appealing feature of this approach is that, as in linear discriminant analysis \n[7],  it is  possible to generalize the basic methodology,  utilizing  Mercer kernels  and \nthereby forming  nonlinear decision boundaries.  We  show how to do  this  in  Section \n3. \n\nThe paper is organized as follows:  in  Section 2 we  present the minimax formulation \nfor  linear  classifiers,  while  in  Section  3  we  deal  with  kernelizing  the  method.  We \npresent empirical results in Section 4. \n\n2  Maximum probabilistic decision  hyperplane \n\nIn this  section we  present  our minimax formulation  for  linear decision  boundaries. \nLet  x  and y  denote  random vectors in  a  binary  classification problem,  with  mean \nvectors and covariance matrices given by x  '\" (x, ~x) and y  '\" (y, ~y) ,  respectively, \nwhere  \"\",\"  means that the random variable has the specified  mean and covariance \nmatrix but that the distribution is  otherwise unconstrained.  Note that x, x , y , Y E \nJRn  and  ~x, ~y E  JRnxn. \n\nWe want to determine the hyperplane aT z  =  b (a, z  E  JRn  and b E  JR)  that separates \nthe two classes of points with maximal probability with respect to all distributions \nhaving these means and covariance matrices.  This boils down to: \n\nor, \n\nmax  a \na ,a ,b \n\ns.t. \n\ninf Pr{ aT x  2:  b}  2:  a \n\nmax  a \na,a,b \n\ns.t. \n\n1 - a  2:  sup Pr{ aT x :s  b} \n1- a  2:  sup  Pr{aT y  2:  b} . \n\n(2) \n\n(3) \n\nConsider  the second  constraint in  (3).  Recall the result  of Bertsimas and  Sethura(cid:173)\nman  [2]: \n\nsupPr{aTY2:b}=-d2'  with  d2 = \n\n1 \n1 + \n\ninf  (Y_Yf~y-1(y_y)  (4) \n\naTy?b \n\nWe can write this as d2 =  infcTw>d  wT w, where w  =  ~y -1/2 (y_y), cT  =  aT~y 1/2 \nand d =  b - aTy.  To solve  this,-first  notice that we  can assume that aTy :S  b (i.e. \ny is  classified  correctly by the decision  hyperplane aT z  =  b):  indeed, otherwise we \nwould find  d2  =  0  and thus  a  =  0 for  that  particular a  and  b, which  can never be \nan optimal value.  So, d> o.  We  then form  the Lagrangian: \n\n\u00a3(w, >.)  =  w T w + >.(d - cT w), \n\n(5) \n\n\fwhich is  to be maximized  with respect  to A 2:  0 and minimized  with respect  to w . \nAt the optimum,  2w  =  AC  and d =  cT  W ,  so  A =  -!#c  and w  =  c%c.  This yields: \n\n(6) \n\nUsing (4), the second constraint in (3)  becomes 1-0: 2:  1/(I+d2 )  or ~ 2:  0:/(1-0:). \nTaking  (6)  into account, this boils  down to: \n\nb-aTY2:,,(o:)/aT~ya \n\nV \n\nwhere \n\n,,(0:)=)  0: \n\n1-0: \n\n(7) \n\nWe  can  handle  the  first  constraint  in  (3)  in  a  similar  way  (just  write  aT x  :::::  b as \n_aT x  2:  -b and apply the result  (7)  for  the second constraint).  The optimization \nproblem  (3)  then becomes: \n\nmax  0: \na ,a,b \n\ns.t. \n\n-b + aTx 2:  ,,(o:)JaT~xa \n\nb - aTy 2:  \"(o:h/aT~ya. \n\nBecause \"(0:)  is  a  monotone increasing function  of 0:,  we  can write this as: \n\nmax\"  s.t. \n\"\"a,b \n\nb - aTy 2:  \"JaT~ya. \n\nFrom both constraints in  (9),  we  get \n\naTy + \"JaT~ya::::: b:::::  aTx - \"JaT~xa, \n\nwhich allows us  to eliminate  b from  (9): \n\nmax\"  s.t. \nI<,a \n\naTy + \"JaT~ya::::: aTx - \"JaT~xa. \n\n(8) \n\n(9) \n\n(10) \n\n(11) \n\nBecause  we  want  to  maximize  \",  it  is  obvious  that  the  inequalities  in  (10)  will \nbecome equalities at the optimum.  The optimal value  of b will  thus be given by \n\n(12) \n\nwhere  a*  and  \"*  are  the  optimal  values  of a  and  \"  respectively. \nconstraint in  (11),  we  get: \n\nRearranging the \n\naT(x - y)  2:\" (JaT~xa+ JaT~ya). \n\n(13) \n\nThe above is  positively  homogeneous  in  a:  if a  satisfies  (13),  sa with  s  E 114  also \ndoes.  Furthermore,  (13)  implies aT(x - y)  2:  O.  Thus, we  can restrict a to be such \nthat aT(x - y)  = 1.  The optimization problem  (11)  then becomes \n\nmax\"  s.t. \nI<,a \n\n~ 2:  JaT~xa + JaT~ya \naT (x-Y)=I , \n\nwhich allows us  to eliminate  ,,: \n\nm~n JaT~xa + JaT~ya  s.t.  aT(x - y)  = 1, \n\n(14) \n\n(15) \n\n\for,  equivalently \n\n(16) \n\nThis is  a convex optimization problem, more precisely a second order cone program \n(SOCP)  [8,5].  Furthermore, notice that we  can write a  =  ao +Fu, where U  E  Il~n-l, \nao  =  (x - y)/llx - y112,  and F  E IRnx (n-l)  is  an orthogonal matrix whose  columns \nspan the subspace of vectors orthogonal to x - y. \nUsing this we  can write  (16)  as  an unconstrained SOCP: \n\n(17) \n\nWe can solve this problem in various ways, for example using interior-point methods \nfor  SOCP [8],  which yield a  worst-case complexity of O(n3 ).  Of course, the first and \nsecond moments of x, y  must be estimated from  data, using for  example plug-in es(cid:173)\ntimates X, y, :Ex, :Ey  for  respectively x, y, ~x, ~y. This brings the total complexity \nto O(ln3 ), where l  is  the number of data points.  This is  the same complexity as the \nquadratic programs one has to solve in  support vector machines. \nIn our implementations, we took an iterative least-squares approach, which is  based \non the following  form, equivalent to  (17): \n\n(18) \n\nAt iteration k , we first minimize with respect to 15  and E by setting 15k =  II~x 1/2(ao + \nFuk- d112  and  Ek  =  II~y 1/2(ao  +  Fuk - 1)112.  Then  we  minimize  with  respect  to  U \nby  solving  a  least  squares  problem  in  u  for  15  =  15k  and  E  =  Ek,  which  gives  us \nUk.  Because in  both update steps the objective of this  COP  will  not increase,  the \niteration will  converge to the global minimum  II~xl/2(ao + Fu*)112  +  II~yl /2(ao + \nFu*)lb with u*  an optimal value of u. \n\nWe  then  obtain  a*  as  ao  +  Fu*  and  b*  from  (12)  with  \"'*  =  l/h/ar~xa* + \nJar~ya*).  Classification  of  a  new  data  point  Znew  is  done  by  evaluating \nsign( a;; Znew  - b*):  if this is + 1,  Znew  is  classified as from  class x, otherwise Zn ew  is \nclassified as  from  class  y. \n\nIt is  interesting  to  see  what  happens  if  we  make  distributional  assumptions;  in \nparticular,  let  us  assume that x  \"\"  N(x, ~x) and y  \"\" N(y, ~y). This leads to the \nfollowing  optimization problem: \n\nmax  a  S.t. \no:, a ,b \n\n-b + aTx :::::  <I>-l(a)JaT~xa \n\n(19) \n\nwhere <I>(z)  is  the cumulative distribution function for  a  standard normal Gaussian \ndistribution.  This has  the  same form  as  (8),  but now  with \",(a)  =  <I>-l(a)  instead \n\nof \",(a)  = V l~a (d. a result by Chernoff [4]).  We  thus solve the same optimization \n\nproblem  (a  disappears  from  the  optimization  problem  because  \",(a)  is  monotone \nincreasing)  and  find  the same  decision  hyperplane  aT z  =  b.  The difference  lies  in \nthe value of a  associated with  \"'*:  a  will  be higher in  this  case,  so  the hyperplane \nwill  have a  higher predicted probability of classifying future  data correctly. \n\n\f3  Kernelization \n\nIn this section we describe the  \"kernelization\" of the minimax approach described in \nthe previous section.  We  seek to map the problem to a  higher  dimensional feature \nspace ]Rf  via a mapping cP  : ]Rn  1-+  ]Rf,  such that a linear discriminant in the feature \nspace  corresponds  to a  nonlinear  discriminant  in  the  original  space.  To  carry out \nthis  programme, we  need to try to reformulate the minimax problem in terms of a \nkernel function  K(Z1' Z2)  =  cp(Z1)T CP(Z2)  satisfying Mercer's condition. \n\n\"\"' \nLet  the  data  be  mapped  as  x  1-+  cp(x) \n(cp(y) , ~cp(y))  where  {Xi}~1 and  {Yi}~1 are  training  data  points  in  the  classes \ncorresponding to x  and Y respectively.  The decision hyperplane in ]Rf  is  then given \nby aT cp(Z)  =  b with a, cp(z)  E  ]Rf  and b E  ]R.  In  ]Rf,  we  need  to solve the following \noptimization problem: \n\n(cp(X) , ~cp(x))  and  Y  1-+  cp(y) \n\n\"\"' \n\nmln Jr-aT-~-cp-(-x)-a + J aT~cp(y)a  s.t.  aT (cp(X)  - cp(y))  =  1, \n\n(20) \n\nwhere,  as  in  (12),  the optimal value of b will  be given by \n\nb*  =  a; cp(x)  - \"'*Jar~cp(x)a* =  a; cp(y)  + \"'*Jar~cp(y)a*, \n\n(21) \nwhere  a*  and  \"'*  are the  optimal  values  of a  and  '\"  respectively.  However,  we  do \nnot  wish  to  solve  the  COP  in  this  form,  because  we  want  to  avoid  using  f  or  cp \nexplicitly. \nIf a  has a  component in  ]Rf  which is  orthogonal to the subspace spanned by CP(Xi), \ni  =  1,2, ... , N x  and  CP(Yi),  i  = 1,2, ... , Ny,  then that  component  won't  affect  the \nobjective or the constraint in  (20) .  This implies  that we  can write a  as \n\nN. \n\nNy \n\na  =  LaiCP(Xi) + L;)jCP(Yj). \n\ni=1 \n\nj=1 \n\n(22) \n\n1 \n\nA \n\n_ \n\nNy \n\nSubstituting  expression  (22)  for  a  and  estimates ;Pw  =  J. 2:~1 CP(Xi)  , ;p(Y)  = \n1 \nNy  2:i=l cp(Yi),  ~cp(x)  - N.  2:i=1 (cp(Xi)  - cp(X)) (cp(Xi)  - cp(x)) \nand  ~cp(y)  -\nJ  2:i~1(CP(Yi) - cp(y))(cp(Yi)  - cp(y))T  for  the  means  and  the  covariance  matri-\nces  in  the  objective  and  the  constraint  of the  optimization  problem  (20),  we  see \nthat  both  the  objective  and  the  constraints  can  be  written  in  terms  of the  kernel \nfunction  K(Zl' Z2)  =  CP(Z1)T cp(Z2) .  We  obtain: \n\n.....--..  T \n\n.....--.. \n\n.....--.. \n\n.....--.. \n\nN. \n\nN \n\n-\n\nA \n\n_ \n\ny \n\nT \n\n-\n\n\"f  (k x - ky)  =  1, \n\n-\n\n(23) \n\n]R  .+  y  WIth  [kxl i  = \nwhere  \"f  =  [a1  a2  ...  aN.  ;)1 \nJ. 2:f;1 K(xj, Zi),  ky  E  ]RN. +Ny  with  [kyl i  =  Jy  2:f~l K(Yj, Zi),  Zi  =  Xi  for \ni  =  1,2, ... ,Nx  and Zi  =  Yi - N.  for  i  =  N x + 1, N x + 2, ... ,Nx + Ny .  K is  defined \nas: \n\n...  ;)Ny l  ,  kx  E \n\n;)2 \n\nN \n\nN \n\n. \n\nT \n\n-\n\n-\n\nK =  (Kx -IN.~~) =  (*x) \nKy \n\nKy -lNy ky \n\n(24) \n\nwhere  1m  is  a  column  vector  with  ones  of  dimension  m.  Kx  and  Ky  contain \nrespectively the first  N x  rows  and the last Ny  rows of the Gram matrix K  (defined \nas Kij  =  cp(zdTcp(zj)  =  K(Zi,Zj)).  We  can also  write  (23)  as \n\n-\nKx \n\nI \n\n-\n\nI Ky \n\nI \n\nm~n II  ~\"f12 + I .jlV;\"f 12  s.t. \n\nT \n\n-\n\n\"f  (kx - ky)  =  1, \n\n-\n\n(25) \n\n\fwhich  is  a  second  order  cone  program  (SOCP)  [5]  that  has  the  same form  as  the \nSOCP  in  (16)  and  can  thus  be  solved  in  a  similar  way.  Notice  that,  in  this  case, \nthe  optimizing variable  is  \"f  E  ~Nz +Ny  instead  of a  E  ~n.  Thus  the  dimension  of \nthe optimization problem increases,  but the solution is  more powerful  because the \nkernelization corresponds to a  more complex decision  boundary in  ~n . \nSimilarly, the optimal value b*  of b in  (21)  will  then become \n\n(26) \n\nwhere \"f*  and \"'*  are the optimal values  of \"f  and\", respectively. \nOnce \"f*  is known,  we get \"'*  =  1/ ( J ~z \"f;K~Kx\"f* + J ~y \"f;K~Ky\"f*) and then \nb*  from  (26).  Classification  of a  new  data point  Znew  is  then  done  by  evaluating \nsign(a; <p(znew) -b*)  =  sign ( (L~l+Ny b*]iK(Zi, Znew) )  - b*)  (again only in terms \nof  the  kernel  function):  if  this  is  + 1,  Znew  is  classified  as  from  class  x ,  otherwise \nZnew  is  classified as  from  class  y. \n\n4  Experiments \n\nIn  this  section  we  report  the  results  of  experiments  that  we  carried  out  to  test \nour  algorithmic  approach.  The  validity  of  1 - a  as  the  worst  case  bound  on  the \nprobability  of  misclassification  of  future  data  is  checked,  and  we  also  assess  the \nusefulness of the kernel trick in this setting.  We compare linear kernels and Gaussian \nkernels. \n\nExperimental results on standard benchmark problems are summarized in Table 1. \nThe Wisconsin breast cancer dataset contained 16 missing examples which were not \nused.  The breast cancer,  pima,  diabetes,  ionosphere and sonar data were  obtained \nfrom  the  VCI  repository.  Data for  the  twonorm  problem  data were  generated  as \nspecified  in  [3].  Each  dataset  was  randomly  partitioned  into  90%  training  and \n10% test sets.  The kernel  parameter (u)  for  the  Gaussian kernel  (e-llx-yI12/,,)  was \ntuned  using  cross-validation  over  20  random  partitions.  The  reported  results  are \nthe averages over 50  random partitions for  both the linear kernel and the Gaussian \nkernel with u  chosen as above. \n\nThe results  are comparable with those in the existing literature [3] and with those \nobtained with  Support  Vector  Machines.  Also,  we  notice  that  a  is  indeed smaller \n\nTable 1:  a  and test-set accuracy (TSA)  compared to BPB (best performance in [3]) \nand to  the  performance of an  SVM  with  linear  kernel  (SVML)  and  an  SVM  with \nGaussian kernel  (SVMG) \n\nGaussian  kernel \nTSA: \n\na \n\nBPB \n\nSVML  SVMG \n\nDataset \n\nLinear kernel \nTSA: \na \n\n80.2  %  96.0  %  83.6  %  97.2  %  96.3 %  95.6 %  97.4 % \nTwonorm \n84.4  %  97.2  %  92.7 %  97.3 %  96.8 %  92.6 %  98.5 % \nBreast cancer \n63.3  %  85.4 %  89.9  %  93.0 %  93.7 %  87.8 %  91.5 % \nIonosphere \nPima diabetes  31.2  %  73.8 %  33.0  %  74.6 %  76.1  %  70.1  %  75.3  % \n75.9 %  86.7 % \nSonar \n\n62.4  %  75.1  %  87.1  %  89.8 % \n\n\fthan the test-set accuracy in all cases.  Furthermore, a  is smaller for a linear decision \nboundary then for  the nonlinear decision boundary obtained via the  Gaussian ker(cid:173)\nnel.  This clearly shows that kernelizing the method leads to more powerful decision \nboundaries. \n\n5  Conclusions \n\nThe  problem of linear  discrimination  has  a  long  and  distinguished  history.  Many \nresults  on  misclassification  rates  have  been  obtained  by  making  distributional  as(cid:173)\nsumptions  (e.g.,  Anderson and Bahadur [1]) .  Our results, on the other hand, make \nuse  of recent  work  on  moment  problems  and  semidefinite  optimization  to  obtain \ndistribution-free  results  for  linear  discriminants.  We  have  also  shown  how  to  ex(cid:173)\nploit  Mercer kernels  to generalize our algorithm to nonlinear classification. \n\nThe computational complexity of our method is  comparable to the quadratic pro(cid:173)\ngram that one has  to solve  for  the support vector machine  (SVM).  While  we  have \nused  a  simple  iterative  least-squares  approach,  we  believe  that  there  is  much  to \ngain  from  exploiting  analogies  to  the  SVM  and  developing  specialized,  more  effi(cid:173)\ncient  optimization procedures for  our algorithm,  in  particular tools  that break the \ndata into  subsets.  The  extension  towards  large  scale  applications  is  a  current  fo(cid:173)\ncus  of our research,  as  is  the problem of developing  a  variant of our  algorithm for \nmultiway classification and function regression.  Also the statistical consequences of \nusing  plug-in  estimates  for  the  mean  vectors  and  covariance  matrices  needs  to  be \ninvestigated. \n\nAcknowledgements \n\nWe  would  like  to  acknowledge  support from  ONR MURI  N00014-00-1-0637,  from \nNSF  grants IIS-9988642 and ECS-9983874 and from  the Belgian American Educa(cid:173)\ntional Foundation. \n\nReferences \n\n[1]  Anderson, T . W . and Bahadur, R.  R . (1962)  Classification into two multivariate Normal \ndistributions with different covariance matrices.  Annals  of Mathematical  Statistics  33(2): \n420-431. \n[2]  Bertsimas, D.  and Sethuraman, J.  (2000)  Moment problems and semidefinite optimiza(cid:173)\ntion.  Handbook  of Semidefinite  Optimization 469-509,  Kluwer  Academic Publishers. \n\n[3]  Breiman  L.  (1996)  Arcing  classifiers.  Technical  Report  460,  Statistics  Department, \nUniversity of California,  December 1997. \n\n[4]  Chernoff H.  (1972)  The selection of effective attributes for  deciding between hypothesis \nusing  linear  discriminant  functions.  In  Frontiers  of Pattern  Recognition,  (S.  Watanabe, \ned.),  55-60.  New York:  Academic Press. \n\n[5]  Boyd,  S.  and Vandenberghe,  L.  (2001)  Convex  Optimization.  Course notes for  EE364, \nStanford University.  Available at http://www . stanford. edu/ class/ee364. \n\n[6]  Isii,  K.  (1963)  On  the  sharpness  of  Chebyshev-type  inequalities.  Ann.  Inst.  Stat. \nMath .  14:  185-197. \n\n[7]  Mika,  M.  Ratsch,  G.,  Weston,  J.,  SchOikopf,  B.,  and  Mii11er,  K.-R.  (1999)  Fisher \ndiscriminant  analysis  with  kernels.  In  Neural  Networks  for  Signal  Processing  IX, 41- 48, \nNew  York:  IEEE Press. \n\n[8]  Nesterov , Y.  and Nemirovsky, A.  (1994)  Interior Point Polynomial Methods  in  Convex \nProgramming:  Theory  and Applications.  Philadelphia,  PA:  SIAM. \n\n\f", "award": [], "sourceid": 2036, "authors": [{"given_name": "Gert", "family_name": "Lanckriet", "institution": null}, {"given_name": "Laurent", "family_name": "Ghaoui", "institution": null}, {"given_name": "Chiranjib", "family_name": "Bhattacharyya", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}