{"title": "Nonlinear Discriminant Analysis Using Kernel Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 568, "page_last": 574, "abstract": null, "full_text": "Nonlinear  Discriminant  Analysis  using \n\nKernel  Functions \n\nUniversity of Bonn, Institut of Computer Science III \n\nRomerstrasse 164,  D-53117 Bonn,  Germany \n\nVolker  Roth &  Volker  Steinhage \n\n{roth,  steinhag}@cs.uni-bonn.de \n\nAbstract \n\nFishers linear discriminant analysis  (LDA)  is  a  classical multivari(cid:173)\nate technique both for  dimension reduction and classification.  The \ndata vectors are transformed into a low dimensional subspace such \nthat  the  class  centroids  are  spread  out  as  much  as  possible.  In \nthis subspace LDA  works  as  a  simple  prototype classifier  with lin(cid:173)\near decision  boundaries.  However,  in many applications the linear \nboundaries  do  not  adequately separate the  classes.  We  present  a \nnonlinear generalization of discriminant analysis that uses the ker(cid:173)\nnel trick of representing dot products by kernel functions.  The pre(cid:173)\nsented algorithm allows a simple formulation of the EM-algorithm \nin terms of kernel functions which leads to a unique concept for  un(cid:173)\nsupervised  mixture analysis,  supervised discriminant  analysis  and \nsemi-supervised discriminant analysis with partially unlabelled ob(cid:173)\nservations in feature spaces. \n\nIntroduction \n\n1 \nClassical linear discriminant analysis  (LDA)  projects N  data vectors that belong to \nc different  classes into a  (c - 1)-dimensional space in such way that the ratio of be(cid:173)\ntween group  scatter SB  and within group scatter Sw is maximized [1].  LDA formally \nconsists of an eigenvalue decomposition of Sv) S B  leading to the so called  canonical \nvariates  which contain the whole class specific information in a  (c - I)-dimensional \nsubspace.  The  canonical  variates  can be ordered  by  decreasing eigenvalue  size  in(cid:173)\ndicating  that  the  first  variates  contain  the  major  part  of the  information.  As  a \nconsequence,  this  procedure  allows  low  dimensional  representations  and  therefore \na  visualization  of the data.  Besides from  interpreting LDA only as  a  technique for \ndimensionality reduction,  it can also be seen as  a  multi-class classification method: \nthe set of linear discriminant functions define a partition of the projected space into \nregions that are identified with class membership.  A new observation x  is  assigned \nto the class  with centroid closest to x  in  the projected space. \nTo  overcome  the  limitation  of only  linear  decision  functions  some  attempts  have \nbeen  made  to incorporate nonlinearity into the  classical  algorithm.  HASTIE  et  al. \n[2]  introduced the so called model of Flexible  Discriminant Analysis:  LDA  is  refor(cid:173)\nmulated in the framework of linear regression estimation and a generalization of this \nmethod is  given by using nonlinear regression techniques.  The proposed regression \ntechniques implement the idea of using nonlinear mappings to transform the input \ndata into a  new space in which again a linear regression is  performed.  In real world \n\n\fNonlinear Discriminant Analysis Using Kernel Functions \n\n569 \n\napplications this approach has to deal  with  numerical problems due to the dimen(cid:173)\nsional explosion resulting from nonlinear mappings.  In the recent years approaches \nthat avoid such  explicit  mappings  by using  kernel functions  have become  popular. \nThe  main  idea is  to construct  algorithms that only afford  dot  products  of pattern \nvectors which can be computed efficiently in high-dimensional spaces.  Examples of \nthis  type  of algorithms  are  the  Support  Vector  Machine  [3]  and  Kernel  Principal \nComponent Analysis  [4]. \nIn  this  paper  we  show  that  it  is  possible  to  formulate  classical  linear  regression \nand therefore also linear discriminant analysis exclusively in terms of dot products. \nTherefore,  kernel  methods  can  be  used  to  construct  a  nonlinear  variant  of  dis(cid:173)\ncriminant  analysis.  We  call  this  technique  Kernel  Discriminant  Analysis  (KDA). \nContrary to a similar approach that has been published recently [5J,  our algorithm is \na  real multi-class classifier and inherits from classical LDA the convenient property \nof data visualization. \n2  Review of Linear Discriminant Analysis \nUnder the  assumption of the data being  centered  (i.e. Ei Xi  = 0)  the  scatter  ma(cid:173)\ntrices  S B  and Sw  are defined by \n\n(1) \n\n(2) \n\n(3) \n\nSB  =  \"\"~  ~ \"\"nj \n\nL...,;J=1  nj L...,;l,m=l \n\n(x(j))  (x~))T \n\nI \n\nSw = \"\"c  \"\"nj  (x(j)  _  ~ \"\"nj  XU))  (x(j)  _  ~ \"\"nj  x(j)) T \n\nL...,;j=l  L...,;l=l \n\nI \n\nnj L...,;l=l \n\nI \n\nI \n\nnj L...,;m=l  m \n\nwhere  nj is  the number of patterns x~j)  that belong to class j. \nLDA chooses a  transformation matrix V  that maximizes the objective function \n\nJ(V) =  IVTSB VI. \nIVTSwVI \n\nThe  columns  of an  optimal V  are  the  generalized  eigenvectors that correspond to \nthe nonzero eigenvalues in SBVi =  Ai SWVi' \nIn  [6J  and  [7J  we  have  shown,  that  the  standard  LDA  algorithm  can  be  restat(cid:173)\ned  exclusively  in  terms  of dot  products  of input  vectors.  The final  equation is  an \neigenvalue equation in terms of dot product matrices which are of size N  x N.  Since \nthe solution of high-dimensional generalized eigenvalue equations may cause numer(cid:173)\nical  problems  (N may be large in real world applications),  we  present an improved \nalgorithm  that reformulates  discriminant  analysis  as  a  regression  problem.  More(cid:173)\nover,  this  version  allows  a  simple  implementation  of the  EM-algorithm  in feature \nspaces. \n3  Linear  regression analysis \nIn this  section  we  give a  brief review  of linear  regression analysis  which  we  use  as \n\"building  block\"  for  LDA.  The task of linear regression analysis is  to approximate \nthe regression function  by a  linear function \n\nr(x) =  E(YIX =  x) ~ c + x T f'. \n\n(4) \non  the  basis  of  a  sample  (YI, Xl), ... ,(Y N  , x N ).  Let  now  y  denote  the  vector \n(YI, ... ,YN)T  and  X  denote  the  data  matrix  which  rows  are  the  input  vectors. \nU sing  a  quadratic  loss  function,  the  optimal  parameters  c  and  f'  are  chosen  to \nminimize the average squared residual \n\nASR = N-Illy - c IN + Xf'112 + f'Tnf'. \n\n(5) \nIN denotes a  N-vector of ones, n denotes a  ridge-type penalty matrix n = \u20acI  which \npenalizes the coefficients of f'.  Assuming the data beirig centered, i.e  E~l Xi  =  0, \nthe  parameters of the regression function are given by: \n\nc = N- 1 \"\".  Yi  =: /-Ly  , \n\nN \n\nL...,;t=l \n\nf'  = (XT X + \u20acI)-l  XT y. \n\n(6) \n\n\f570 \n\nV.  Roth and V.  Steinhage \n\n4  LDA  by optimal scoring \nIn this section the LDA problem is  linked to linear regression using the framework \nof penalized  optimal  scoring.  We  give  an overview  over  the  detailed  derivation  in \n[2]  and  [8].  Considering  again  the  problem  with  c  classes  and  N  data  vectors, \nthe  class-memberships  are  represented  by  a  categorical  response  variable  9  with \nIt  is  useful  to  code  the  n  responses  in  terms  of  the  indicator  matrix \nc  levels. \nZ:  Zi ,j  =  1, if the i-th data vectJr belongs to class j,  and 0 otherwise.  The  point \nof optimal scoring is to turn categorical variables into quantitative ones by assigning \nscores  to classes:  the score  vector  9  assigns the real number 9j  to the j-th level  of \ng.  The vector  Z9 then represents a  vector of scored training data and is  regressed \nonto  the  data  matrix  X.  The  simultaneous  estimation  of scores  and  regression \ncoefficients constitutes the optimal scoring problem:  minimize  the criterion \n\nASR(9, (3)  =  N- 1 [IIZ9 - X{311 2 + {3TO{3] \n\n(7) \n\nunder  the  constraint  ~ IIZ9UZ  =  1.  According  to  (6),  for  a  given  score  9  the \nminimizing {3  is  given by \n\n{3os  = (XT X + 0)-1 xT Z(J, \n\n(8) \n\nand the partially minimized  criteri9n becomes: \n\nminASR(9,{3) =  1- N-19TZ\u2122(0)Z9, \n(3 \n\n(9) \n\nwhere M(O) = X(XTX +O)-IXT denotes the regularized hat or smoother matrix. \nMinimizing  of  (9)  under  the  constraint  ~ IIZ9W  =  1  can  be  performed  by  the \nfollowing  procedure: \n\n~ \n\n1.  Choose  an initial  matrix 8 0  satisfying  the  constraint  N- 18'{; ZT Z80  = I \nand set  8 0 = Z80 \n2.  Run  a  multi-response  regression  of  8 0 onto  X:  8 0 = M(0)80 = XB, \nwhere  B  is the ma.1rix of regression coefficients. \n3.  Eigenanalyze 8 0T 8 0 to obtain the optimal scores, and update the matrix of \nregression coefficients:  B* = BW, with W  being the matrix of eigenvectors. \nIt can be shown, that the final matrix B* is, up to a diagonal scale matrix, equivalent \nto the matrix of LDA-vectors,  see  [8]. \n5  Ridge  regression using only dot  products \nThe  penalty  matrix  0  in  (5)  assures  that  the  penalized  d  x  d  covariance  matrix \ni: =  XT X + d  is  a symmetric nonsingular matrix.  Therefore,  it has d eigenvectors \nei with accomplished positive eigenvalues Ii such that the following equations hold: \n\n- -1  \"d   1 \n~  = 6 \ni=1  Ii \n\nT \n-eie \u00b7 \nt \n\n(10) \n\nThe first  equation implies  that  the first  1 leading  eigenvectors  ei  with eigenvalues \nIi > \u20ac  have an expansion in  terms  of the input vectors.  Note that 1 is  the number \nof nonzero eigenvalues  of the  unpenalized  covariance matrix X T X.  Together  with \n(6),  it follows  for  the general case,  when  the dimensionality d may extend l,  that {3 \ncan  be  written  as  the  sum  of two  terms:  an expansion  in  terms  of the  vectors  Xi \nwith coefficients  ai and a  similar expansion in terms of the remaining eigenvectors: \n\nI:N \n\n.  aixi + \n\nI:d \n. \n)=1+1 \n\n{3  = \n\nt=1 \n\n~jej =  X  a  + \n\nT \n\n~jej , \n\n(11) \n\nI:d \n. \n)=1+1 \n\nwith  a  = (a1  ... an) T.  However,  the last term can be dropped,  since  every eigen(cid:173)\nvector  ej,  j  = 1 + 1, ... ,d is  orthogonal to  every  vector  Xi  and does  not  influence \nthe value of the regression function  (4). \nThe problem of penalized linear regression can therefore be stated as  minimizing \n\n\fNonlinear Discriminant Analysis Using Kernel Functions \n\nASR(a) =  N- 1 [Ily - XXT al1 2  + aTXOXTaJ. \n\n571 \n\n(12) \n\nA stationary vector a  is  determined by \n\na  =  (XXT + O)-ly. \n\n)-1 \n\ny . \n\nT( \n\n(13) \nLet now the dot product matrix K  be defined by Kij  = xT Xj  and let for a given test \npoint  (Xl)  the  dot  product  vector  kl  be  defined  by  kl  =  XXI . With  this  notation \nthe regression function  of a  test point (xL)  reads \nr(Xl)  =  /-Ly  + kl  K  + \u20acI \n\n(14) \nThis  equation requires  only  dot  products  and  we  can  apply  the  kernel  trick.  The \nfinal equation (14),  up to the constant term /-Ly , has also been found by SAUNDERS et \nal.,  [9J .  They restated ridge regression in dual variables and optimized the resulting \ncriterion function  with  a  lagrange  multiplier  technique.  Note  that  our  derivation, \nwhich is a direct generalization of the standard linear regression formalism,  leads in \na  natural way to a  class of more general regression functions including the constant \nterm. \n6  LDA using only dot  products \nSetting  f3  =  XT a  as  in  (11)  and using  the notation of section  5, for  a  given score \no the optimal vector a  is  given by: \n\naas =  (XXT + 0)-1 ZO . \n\n(15) \n\nAnalogous to  (9),  the partially minimized  criterion becomes: \nmin ASR(O, a) =  1 - N- 10T ZT M(O)ZO, \nex \n\n(16) \n\nwith \n\nM(O) = XXT(XXT + 0)-1 = K(K + \u20acI)-l. \n\nTo  minimize  (16)  under  the  constraint  tv IIZOW  =  1  the  procedure  described  in \nsection  4  can  be  used  when  M(O)  is  substituted  by  M(O).  The matrix Y  which \nrows are the input vectors projected onto the column vectors of B*  is  given by: \n\nY = XB* = K(K + \u20acI)-l  Z8oW. \n\n(17) \n\nNote  that again the dot product matrix K  is  all  that is  needed to calculate Y. \n7  The kernel trick \nThe main idea of constructing nonlinear algorithms is  to apply the linear methods \nnot in the space of observations but in a feature  space  F  that is related to the former \nby a  nonlinear mapping \u00a2  :  RN ---+  F,  X  ---+  \u00a2(x) . \nAssuming that the mapped data are centered in F, i.e.  L~=l \u00a2(Xi) =  0, the present(cid:173)\ned algorithms remain formally unchanged if the dot product matrix K  is  computed \nin  F:  Kij  =  (\u00a2(Xi)  . \u00a2(Xj)).  As  shown  in  [4],  this assumption can  be dropped  by \nwriting \u00a2 instead of the mapping \u00a2: \nComputation of dot  products in  feature  spaces  can be done  efficiently  by using  k(cid:173)\nernel functions  k(xi, Xj)  [3]:  For  some  choices  of k  there exists  a  mapping \u00a2  into \nsome  feature  space  F  such  that  k  acts  as  a  dot  product  in  F.  Among  possible \nkernel  functions  there  are  e.g.  Radial  Basis  Function  (RBF)  kernels  of  the  form \nk(x,y) =  exp(-llx - YW/c). \n8  The EM-algorithm in feature  spaces \nLDA  can  be  derived  as  the  maximum  likelihood  method  for  normal  populations \nwith different  means and common covariance matrix ~ (see  [11]) .  Coding the class \nmembership  of the  observations  in  the  matrix  Z  as  in  section  4,  LDA  maximizes \nthe  (complete data)  log-likelihood function \n\n\u00a2(Xi) := \u00a2(Xi) - ~ L~=l \u00a2(Xi). \n\n\f572 \n\nV.  Roth and V.  Steinhage \n\nThis  concept  can  be  generalized  for  the  case  that  only  the  group  membership  of \nNc  < N  observations is  known  ([14],  p.679):  the  EM-algorithm  provides  a  conve(cid:173)\nnient  method for  maximizing the likelihood function  with missing data: \nE-step:  set Pki  =  Prob(xi E  class k) \n\nPki  = \n\n{\n\nif the class membership of Xi  has been observed \n\nZik' \nLk=l 1Tk\u00a2>\"(Z;) '  ot  erWlse, \n\n1Tk \u00a2>\" (z.)  h \u00b7   '\"  ( )  \n\n'f'k  Xi  ex  exp  -\n\n[1/2( \n\nXi  -\n\nILk \n\n)T~-l ( \n\nL.; \n\nXi  -\n\n)] \n\nILk \n\nM-step:  set \n\n1  N \n\n'irk  =  N  LPki' \n\ni=l \n\nThe idea behind this approach is that even an unclassified observation can be used \nfor  estimation  if it  is  given  a  proper  weight  according  to  its  posterior probability \nfor  class  membership.  The  M-step  can  be  seen  as  weighted  mean  and  covariance \nmaximum likelihood estimates in a  weighted and augmented problem:  we  augment \nthe data by  replicating the  N  observations  c  times,  with the  l-th  such  replication \nhaving observation weights Plio  The maximization of the likelihood function can be \nachieved via a weighted and augmented LDA.  It turns out that it is  not necessary to \nexplicitly  replicate the  observations and run a  standard LDA:  the optimal  scoring \nversion of LDA  described in section 4 allows  an implicit solution of the augmented \nproblem that still  uses  only  N  observations.  Instead of using a  response indicator \nmatrix Z, one  uses a  blurred  response  Matrix Z,  whose rows  consist of the current \nclass probabilities for  each observation.  At each M-step this Z is used in a  multiple \nlinear regression followed  by an eigen-decomposition.  A detailed derivation is given \nin  [11].  Since we  have shown that the optimal scoring problem can be solved in fea(cid:173)\nture spaces using kernel functions  this is  also the case for  the whole  EM-algorithm: \nthe E-step requires only  differences in Mahalonobis distances which are supplied by \nKDA. \nAfter  iterated application of the  E- and  M-step  an observation is  classified  to  the \nclass  k  with  highest  probability  Pk.  This  leads  to  a  unique  framework  for  pure \nmixture  analysis  (Nc  =  0),  pure  discriminant  analysis  (Nc  = N)  and  the  semi(cid:173)\nsupervised  models  of discriminant  analysis  with partially unclassified observations \n(0  < Nc < N) in feature spaces. \n\n9  Experiments \nWaveform data:  We  illustrate KDA on a  popular simulated example, taken from \n[10],  pA9-55 and used  in  [2,  11].  It is  a  three class problem with  21  variables.  The \nlearning set  consisted of 100 observations per class.  The test set  was  of size  1000. \nThe results are given in table  1. \n\nTable  1:  Results  for  waveform data.  The values  are  averages over  10  simulations. \nThe  4  entries  above  the  line  are  taken  from  [11].  QDA:  quadratic  discriminant \nanalysis, FDA:  flexible  discriminant analysis,  MDA:  mixture discriminant analysis. \n\nTechnique \nLDA \nQDA \nFDA  (best model  parameters) \nMDA  (best  model parameters) \nKDA  (RBF kernel,  (7  = 2, \u20ac  = 1.5) \n\nTraining Error [%]  Test  Error [%] \n19.1(0.6) \n20.5(0.6) \n19.1(0.6) \n15.5(0.5) \n14.1(0.7) \n\n12.1(0.6) \n3.9(OA) \n10.0(0.6) \n13.9{0.5) \n10.7(0.6) \n\n\fNonlinear Discriminant Analysis Using Kernel Functions \n\n573 \n\nThe  Bayes  risk  for  the  problem  is  about  14%  [10].  KDA  outperforms  the  other \nnonlinear  versions  of discriminant  analysis  and  reaches  the  Bayes  rate  within  the \nerror  bounds,  indicating  that  one  cannot  expect  significant  further  improvement \nusing  other  classifiers.  Figure  1  demonstrates  the  data  visualization  property  of \nKDA.  Since for  a  3 class  problem the dimensionality of the projected space equals \n2,  the  data can be visualized  without any  loss  of information.  In the left  plot one \ncan  see  the  projected  learn  data and the  class  centroids,  the  right  plot  shows  the \ntest  data and again the class centroids of the learning set. \n\nFigure 1:  Data visualization with KDA.  Left:  learn set,  right:  test set \n\nTo  demonstrate  the  effect  of  using  unlabeled  data for  classification  we  repeated \nthe  experiment  with waveform data using  only  20  labeled  observations  per  class. \nWe  compared the the  classification  results  on  a  test  set of size  300  using  only the \nlabeled  data  (error  rate  E 1 )  with  the  results  of  the  EM-model  which  considers \nthe test data as  incomplete measurements during an iterative maximization of the \nlikelihood function  (error rate E2).  Using a RBF kernel  (0\"  =  250),  we obtained the \nfollowing  mean error rates over 20  simulations:  El = 30.5(3.6)%,  E2  = 17.1(2.7)%. \nThe  classification  performance  could  be  drastically  improved  when  including  the \nunlabelled data into the learning process. \nObject recognition:  We  tested KDA on the  MPI Chair Database l .  It consists of \n89  regular spaced  views  form  the upper  viewing hemisphere of 25  different  classes \nof chairs  as  a  training  set  and  100  random views  of each  class  as  a  test  set.  The \navailable  images  are downscaled  to 16  x  16  pixels.  We  did  not  use  the  additional \n4 edge  detection patterns for  each  view.  Classification results for  several classifiers \nare  given  in table 2. \n\nKDA  poly.  kernel \n\n2.1 \n\nFor  a  comparison of the computational performance we  also trained the  SVM-light \nimplementation  (V  2.0)  on  the  data,  [13].  In this  experiment  with  25  classes  the \nKDA  algorithm  showed  to  be  Significantly  faster  than  the  SVM:  using  the  RBF(cid:173)\nkernel,  KDA  was  3  times  faster,  with  the  polynomial  kernel  KDA  was  20  times \nfaster than  SVM-light. \n\n10  Discussion \nIn this paper we  present a nonlinear version of classical linear discriminant analysis. \nThe main idea is  to map the input vectors into a  high- or even infinite dimensional \nfeature space and to apply LDA in this enlarged space.  Restating LDA in a way that \nonly dot products of input vectors are needed makes it possible to use  kernel repre(cid:173)\nsentations of dot products.  This overcomes numerical problems in high-dimensional \n\nIThe database is  available  via ftp:/ /ftp.mpik-tueb.mpg.de/pub/chair_dataset/ \n\n\f574 \n\nV.  Roth and V.  Steinhage \n\nfeature  spaces.  We  studied the classification performance of the  KDA  classifier on \nsimulated waveform data and on the MPI chair database that has been widely used \nfor  benchmarking  in  the  literature.  For  medium  size  problems,  especially  if the \nnumber of classes is high,  the KDA algorithm showed to be significantly faster than \na  SVM  while  leading  to  the  same classification performance.  From  classical  LDA \nthe presented algorithm inherits the convenient property of data  visualization,  since \nit  allows  low  dimensional views of the data vectors.  This makes an intuitive inter(cid:173)\npretation possible,  which  is  helpful  in  many  practical applications.  The presented \nKDA algorithm can be used as the maximization step in an EM algorithm in feature \nspaces.  This allows to include unlabeled observation into the learning process which \ncan improve classification results.  Studying the performance of KDA for other clas(cid:173)\nsification  problems  as  well  as  a  theoretical comparison of the optimization criteria \nused  in the KDA- and SVM-algorithm will  be subject of future  work. \n\nAcknowledgements \nThis  work  was  supported  by  Deutsche  Forschungsgemeinschaft,  DFG.  We  heavily \nprofitted from  discussions with Armin B.  Cremers,  John Held and Lothar Hermes. \n\nReferences \n[1]  R.  Duda and P.  Hart,  Pattern  Classification  and Scene  Analysis.  Wiley & Sons, 1973. \n[2]  T .  Hastie,  R.  Tibshirani,  and  A.  Buja,  \"Flexible  discriminant  analysis  by  optimal \n\nscoring,\"  JASA,  vol.  89,  pp.  1255-1270,  1994. \n\n[3]  V.  N. Vapnik,  Statistical  learning  theory.  Wiley  &  Sons,  1998. \n[4]  B.  Sch6lkopf, A.  Smola, and K.-R.  Muller,  \"Nonlinear component analysis as a kernel \n\neigenvalue  problem,\"  Neural  Computation,  vol.  10,  no.  5,  pp.  1299-1319,  1998. \n\n[5]  S.  Mika,  G.  Ratsch,  J.  Weston,  B.  Sch6lkopf,  and  K.-R.  Miiller,  \"Fisher  discrimi(cid:173)\n\nnant analysis with kernels,\"  in  Neural  Networks  for  Signal Processing  IX (Y.-H.  Hu, \nJ.  Larsen,  E.  Wilson,  and S.  Douglas,  eds.),  pp.  41-48,  IEEE,  1999. \n\n[6]  V. Roth and V.  Steinhage,  \"Nonlinear discriminant analysis  using kernel  functions,\" \nTech.  Rep. IAI-TR-99-7, Department of Computer Science III, Bonn University, 1999. \n[7]  V.  Roth,  A.  Pogoda,  V.  Steinhage,  and S.  Schroder,  \"Pattern recognition  combining \n\nfeature- and  pixel-based  classification  within  a  real  world  application,\"  in  Muster(cid:173)\nerkennung 1999 (W.  Forstner, J. Buhmann, A.  Faber, and P.  Faber, eds.), Informatik \naktuell,  pp.  120-129,  21.  DAGM Symposium,  Bonn,  Springer,  1999. \n\n[8]  T.  Hastie,  A.  Buja,  and  R.  Tibshirani,  \"Penalized  discriminant  analysis,\"  AnnStat, \n\nvol.  23,  pp.  73-102,  1995. \n\n[9]  S.  Saunders,  A.  Gammermann, and V.  Vovk,  \"Ridge regression learning algorithm in \n\ndual  variables,\"  tech.  rep.,  Royal Holloway,  University of London,  1998. \n\n[10]  L.  Breiman,  J.  H.  Friedman,  R.  A.  Olshen,  and  C.  J .  Stone,  Classification  and  Re(cid:173)\n\ngression  Trees.  Monterey,  CA:  Wadsworth and Brooks/Cole,  1984. \n\n[11]  T.  Hastie  and R.  Tibshirani,  \"Discriminant analysis  by gaussian  mixtures,\"  JRSSB, \n\nvol.  58,  pp.  158-176,  1996. \n\n[12]  B.  Scholkopf,  Support  Vector  Learning.  PhD  thesis,  1997.  R.  Oldenbourg  Verlag, \n\nMunich. \n\n[13]  T.  Joachims,  \"Making  large-scale  svm  learning  practical,\"  in  Advances  in  Kernel \nMethods  - Support  Vector  Learning  (B.  Scholkopf,  C.  Burges,  and  A.  Smola,  eds.), \nMIT  Press,  1999. \n\n[14]  B.  Flury,  A  First  Course  in Multivariate  Statistics.  Springer,  1997. \n\n\f", "award": [], "sourceid": 1736, "authors": [{"given_name": "Volker", "family_name": "Roth", "institution": null}, {"given_name": "Volker", "family_name": "Steinhage", "institution": null}]}