{"title": "Adaptive Nearest Neighbor Classification Using Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 665, "page_last": 672, "abstract": null, "full_text": "Adaptive  N earest  Neighbor  Classification \n\nusing  Support  Vector  Machines \n\nCarlotta Domeniconi, Dimitrios Gunopulos \n\nDept.  of Computer Science,  University of California,  Riverside,  CA  92521 \n\n{ carlotta, dg} @cs.ucr.edu \n\nAbstract \n\nThe nearest  neighbor technique is  a  simple  and appealing method \nto  address  classification  problems.  It relies  on  t he  assumption  of \nlocally  constant  class  conditional  probabilities.  This  assumption \nbecomes  invalid  in high  dimensions with a  finite  number of exam(cid:173)\nples  due  to  the  curse  of dimensionality.  We  propose  a  technique \nthat computes a  locally flexible  metric by means of Support Vector \nMachines  (SVMs).  The maximum margin boundary found  by the \nSVM is used to determine the most discriminant direction over the \nquery's  neighborhood.  Such  direction  provides  a  local  weighting \nscheme  for  input  features.  We  present  experimental  evidence  of \nclassification  performance  improvement  over  the  SVM  algorithm \nalone  and  over  a  variety  of adaptive  learning  schemes,  by  using \nboth simulated and real data sets. \n\n1 \n\nIntroduction \n\nIn a  classification problem, we  are given  J  classes  and l  training observations.  The \ntraining  observations  consist  of  n  feature  measurements  x  =  (Xl,'\"  ,Xn)T  E  ~n \nand the known  class  labels  j  =  1, ... , J.  The goal is  to predict the class  label  of a \ngiven query  q. \n\nThe K  nearest  neighbor classification method  [4,  13,  16]  is  a  simple  and appealing \napproach  to  this  problem:  it  finds  the  K  nearest  neighbors  of  q  in  the  training \nset,  and  then  predicts  the  class  label  of q  as  the  most  frequent  one  occurring  in \nthe  K  neighbors.  It has  been  shown  [5,  8]  that the  one  nearest  neighbor  rule  has \nasymptotic  error  rate  that  is  at  most  twice  t he  Bayes  error  rate,  independent  of \nthe  distance  metric  used.  The  nearest  neighbor  rule  becomes  less  appealing  with \nfinite  training  samples,  however.  This  is  due  to  the  curse  of dimensionality  [2]. \nSevere  bias  can  be  introduced  in  t he  nearest  neighbor  rule  in  a  high  dimensional \ninput feature  space  with finite  samples.  As  such,  the  choice of a  distance  measure \nbecomes crucial in determining t he outcome of nearest neighbor classification.  The \ncommonly used  Euclidean distance implies that the input space is  isotropic, which \nis  often invalid  and generally undesirable in many practical  applications. \n\nSeveral techniques  [9, 10,  7]  have been proposed to try to minimize bias in high di(cid:173)\nmensions by using locally adaptive mechanisms.  The \"lazy learning\"  approach used \n\n\fby these methods, while appealing in many ways, requires a considerable amount of \non-line computation, which makes it difficult for  such techniques to scale up to large \ndata sets.  The feature weighting scheme they introduce, in fact , is  query based and \nis applied on-line when the test point is  presented to the  \"lazy learner\" .  In this pa(cid:173)\nper we  propose a  locally adaptive metric classification method which,  although still \nfounded  on a  query based  weighting mechanism,  computes off-line  the information \nrelevant to define  local weights. \n\nOur technique uses support vector machines (SVMs) as a guidance for the process of \ndefining a local flexible  metric.  SVMs have been successfully used as a classification \ntool in a variety of areas [11, 3, 14], and the maximum margin boundary they provide \nhas  been  proved  to  be  optimal  in  a  structural risk  minimization  sense.  The  solid \ntheoretical  foundations  that  have  inspired  SVMs  convey  desirable  computational \nand  learning  theoretic  properties  to  the  SVM's  learning  algorithm,  and  therefore \nSVMs are a  natural choice for seeking local discriminant directions between classes. \n\nThe solution provided by SVMs allows to determine locations in input space where \nclass conditional probabilities are likely to be not constant, and guides the extraction \nof local  information in  such  areas.  This  process  produces  highly  stretched  neigh(cid:173)\nborhoods along boundary directions when the query is  close to the boundary.  As  a \nresult, the class conditional probabilities tend to be constant in the modified neigh(cid:173)\nborhoods, whereby better classification performance can be achieved.  The amount \nof  elongation-constriction  decays  as  the  query  moves  further  from  the  boundary \nvicinity. \n\n2  Feature  Weighting \n\nto \n\nf(x) \n\nthe  sign(f(x)),  where \n\nSVMs  classify  patterns  according \nL:~=l (XiYiK(Xi, x) - b,  K(x , y)  = cpT(x). cp(y)  (kernel junction), and cp:  3(n  -+  3(N \nis  a  mapping  of the  input  vectors  into  a  higher  dimensional  feature  space.  Here \nwe  assume  Xi  E  3(n,  i  =  I, . . . ,l,  and  Yi  E  {-I,I}.  Clearly,  in  the  general  case \nof a  non-linear feature  mapping cp,  the  SVM  classifier gives  a  non-linear  boundary \nf(x)  =  0  in  input  space.  The  gradient  vector  lld  =  \"Vdj,  computed  at  any  point \nd  of the  level  curve  f(x)  =  0,  gives  the  perpendicular  direction  to  the  decision \nboundary in input  space at d.  As  such,  the  vector  lld  identifies  the orientation in \ninput  space  on  which  the  projected  training  data are  well  separated,  locally  over \nd's neighborhood.  Therefore, the orientation given by lld, and any orientation close \nto it,  is  highly informative for  the classification task at hand, and we  can  use  such \ninformation to define  a  local measure of feature  relevance. \n\nLet  q  be  a  query  point  whose  class  label  we  want  to  predict.  Suppose  q  is  close \nto the boundary,  which is  where class  conditional probabilities  become locally non \nuniform,  and  therefore  estimation  of local  feature  relevance  becomes  crucial.  Let \nd  be  the  closest  point  to  q on  the  boundary  f(x)  =  0:  d  =  argminp Ilq - pll, \nsubject to the constraint  f(p)  = O.  Then we  know that the gradient lld  identifies a \ndirection along which data points between classes  are well  separated. \n\nAs  a  consequence,  the  subspace  spanned  by  the  orientation  lld,  locally  at  q,  is \nlikely to contain points having the same class label as  q .  Therefore, when  applying \na  nearest  neighbor  rule  at  q,  we  desire  to  stay  close  to  q  along  the  lld  direction, \nbecause that  is  where  it  is  likely  to find  points  similar  to  q  in  terms  of class  pos(cid:173)\nterior  probabilities.  Distances  should  be  constricted  (large  weight)  along  lld  and \nalong  directions  close  to  it.  The  farther  we  move  from  the  lld  direction,  the  less \ndiscriminant  the  correspondent orientation becomes.  This  means  that class  labels \nare likely not to change along those orientations, and distances should be elongated \n\n\f(small  weight) ,  thus  including  in  q's  neighborhood  points  which  are  likely  to  be \nsimilar to q  in terms of the class conditional probabilities. \n\nFormally,  we  can  measure  how  close  a  direction  t  is  to  lld  by considering  the  dot \nproduct lla \u00b7t.  In particular, by denoting with Uj  the unit vector along input feature \nj, for  j  =  1, . .. , n,  we  can define  a  measure of relevance for  feature  j, locally  at q \n(and therefore at d), as Rj(q) ==  Iu] . lldl  =  Ind,j l,  where lld = (nd,l,'\"  ,nd,n)T. \nThe measure of feature relevance,  as  a  weighting scheme,  can then be given  by the \nfollowing exponential weighting scheme:  Wj(q)  = exp(ARj(q))1 2::7=1 exp(ARi(q)), \nwhere A is  a  parameter that can be chosen to maximize  (minimize)  the influence of \nR j  on Wj'  When A = 0 we  have Wj  = lin, thereby ignoring any difference between \nthe Rj's.  On the other hand, when A is large a change in R j will be exponentially re(cid:173)\nflected in Wj'  The exponential weighting scheme conveys stability to the method by \npreventing neighborhoods to extend infinitely  in any direction.  This is  achieved by \navoiding zero weights, which would instead be allowed by linear or quadratic weight(cid:173)\nings.  Thus, the exponential weighting scheme can be used as weights associated with \nfeatures  for  weighted distance computation D(x, y)  =  )2::7=1 Wi(Xi  - Yi)2.  These \nweights enable the neighborhood to elongate less important feature dimensions, and, \nat the same time,  to constrict the most influential ones.  Note that the technique is \nquery-based because weightings  depend on the query. \n\n3  Local  Flexible Metric Classification based on SVMs \n\nTo estimate the orientation of local boundaries, we  move from the query point along \nthe input  axes  at  distances  proportional to  a  given  small  step  (whose  initial value \ncan be arbitrarily small, and doubled at each iteration till the boundary is crossed). \nWe stop as soon as the boundary is  crossed along an input axis i, i.e.  when a  point \nPi is  reached that satisfies the condition sign(f(q)) x  sign(f(pi)) =  -1.  Given Pi, \nwe  can get  arbitrarily close to the boundary by moving at  (arbitrarily)  small steps \nalong the segment that joins Pi  to q. \nLet us  denote with d i  the intercepted point on the boundary along direction i.  We \nthen approximate lld  with the gradient vector lld i  = \\7 di  f, computed at d i . \nWe desire that the parameter A in the exponential weighting scheme increases as the \ndistance of q  from  the  boundary decreases.  By  using the knowledge  that  support \nvectors are mostly located around the boundary surface,  we  can estimate how close \na  query point  q  is  to the boundary by computing its distance from  the closest non \nbounded  support  vector:  Bq  =  minsi Ilq - si ll,  where  the  minimum  is  taken  over \nthe non  bounded  (0  < D:i  < C)  support  vectors  Si.  Following  the  same  principle, \nin  [1]  the spatial resolution  around the boundary is  increased  by enlarging volume \nelements locally in neighborhoods of support vectors.  Then, we can achieve our goal \nby setting A  = D - B q ,  where D  is a constant input parameter of the algorithm.  In \nour experiments we  set  D  equal to the approximated average distance between the \ntraining points  Xk  and the boundary:  D  = t 2::xk {minsi  Ilxk  - sill}.  If A  becomes \nnegative it is  set  to zero. \nBy doing so the value of A nicely adapts to each query point according to its location \nwith  respect  to the boundary.  The closer q  is  to the decision  boundary, the higher \nthe effect  of the Rj's values  will  be on distances  computation. \n\nWe observe that this principled guideline for setting the parameters of our technique \ntakes  advantage  of the  sparseness  representation  of  the  solution  provided  by  the \nSVM.  In  fact,  for  each  query  point  q,  in  order  to  compute  Bq  we  only  need  to \nconsider  the  support  vectors,  whose  number  is  typically  small  compared  to  the \n\n\fInput:  Decision  boundary  f(x)  =  a produced  by  a  SVM;  query \npoint q  and parameter K. \n\n1.  Compute the approximated closest point d i to q  on the bound-\n\nary; \n\n2.  Compute the gradient vector  ndi  =  \\l dJ; \n3.  Set feature  relevance values  Rj(q) =  Indi,jl  for  j  = 1, . . . ,n; \n4.  Estimate  the  distance  of  q  from  the  boundary  as:  Bq  = \n5.  Set  A  = D  - B q ,  where D  = t EXk {minsi  Ilxk  - sill}; \n6.  Set  Wj(q)  =  exp(ARj(q))/ E~=l exp(ARi(q)),  for \n\nminsi  Ilq - sill; \n\nj \n\n1, ... ,n; \n\n7.  Use  the  resulting  w  for  K-nearest  neighbor  classification  at \n\nthe query point q. \n\nFigure 1:  The LFM-SVM algorithm \n\ntotal  number  of training examples.  Furthermore,  the computation  of D's  value  is \ncarried out once  and off-line. \n\nThe resulting local flexible metric technique based on SVMs  (LFM-SVM) is summa(cid:173)\nrized in Figure 1.  The algorithm has only one adjustable tuning parameter, namely \nthe  number  K  of neighbors  in  the  final  nearest  neighbor  rule.  This  parameter  is \ncommon to all  nearest  neighbor classification techniques. \n\n4  Experimental Results \n\nIn  the  following  we  compare  several  classification  methods  using  both  simulated \nand real  data.  We  compare the following  classification  approaches:  (1)  LFM-SVM \nalgorithm  described  in  Figure  1.  SV Mlight  [12]  with  radial  basis  kernels  is  used \nto build the  SVM  classifier;  (2)  RBF-SVM  classifier  with  radial  basis  kernels.  We \nused  SV Mlight  [12],  and  set  the  value  of\"(  in  K(Xi' x)  =  e-r llxi-xI12 equal to  the \noptimal one determined via cross-validation.  Also the value of C for the soft-margin \nclassifier is  optimized via cross-validation.  The output of this classifier is  the input \nof LFM-SVM;  (3)  ADAMENN-adaptive  metric  nearest  neighbor  technique  [7].  It \nuses  the  Chi-squared distance in  order to estimate  to which extent each  dimension \ncan  be  relied  on  to  predict  class  posterior  probabilities;  (4)  Machete  [9].  It is  a \nrecursive  partitioning  procedure,  in  which  the  input  variable  used  for  splitting  at \neach step  is  the  one  that  maximizes  the  estimated  local  relevance.  Such  relevance \nis measured in terms of the improvement in squared prediction error each feature is \ncapable to provide;  (5)  Scythe [9].  It is a generalization of the machete algorithm, in \nwhich the input variables influence each split in proportion to their estimated local \nrelevance;  (6)  DANN-discriminant  adaptive  nearest  neighbor  classification  [10].  It \nis  an adaptive  nearest  neighbor  classification method  based on linear  discriminant \nanalysis.  It computes  a  distance  metric  as  a  product  of properly weighted  within \nand between sum of squares matrices;  (7)  Simple K-NN method using the Euclidean \ndistance measure;  (8)  C4.5  decision  tree method  [15]. \nIn  all  the  experiments,  the  features  are first  normalized  over  the  training  data to \nhave zero mean and unit  variance, and the test data features  are normalized using \nthe  corresponding training  mean  and  variance.  Procedural  parameters  (including \n\n\fK) for  each method were  determined empirically through cross-validation. \n\n4.1  Experiments on Simulated Data \n\nFor all simulated data,  10 independent training samples of size  200  were generated. \nFor  each  of these,  an  additional  independent  test  sample  consisting  of  200  obser(cid:173)\nvations  was  generated.  These  test  data were  classified  by  each  competing method \nusing  the  respective  training  data  set.  Error  rates  computed  over  all  2,000  such \nclassifications  are reported in Table  1. \n\nThe  Problems.  (1)  Multi-Gaussians.  The  data  set  consists  of  n  =  2  input \nfeatures,  l  =  200  training data,  and  J  =  2 classes.  Each class  contains  two  spher(cid:173)\nical  bivariate  normal  subclasses,  having  standard  deviation  1.  The  mean  vectors \nfor  one  class  are  (-3/4, -3)  and  (3/4,3);  whereas  for  the  other  class  are  (3, -3) \nand  (-3,3).  For  each  class,  data are  evenly  drawn  from  each  of the  two  normal \nsubclasses.  The first  column  of Table  1  shows  the  results  for  this  problem.  The \nstandard  deviations  are:  0.17,  0.01,  0.01,  0.01,  0.01  0.01,  0.01  and  1.50,  respec(cid:173)\ntively.  (2)  Noisy-Gaussians.  The data for  this  problem  are  generated  as  in  the \nprevious  example,  but  augmented  with  four  predictors  having  independent  stan(cid:173)\ndard Gaussian distributions.  They serve  as  noise.  Results  are shown in  the second \ncolumn  of Table  1.  The standard deviations  are:  0.18,  0.01,  0.02,  0.01,  0.01,  0.01, \n0.01  and 1.60,  respectively. \n\nResults.  Table 1 shows that all methods have similar performances for  the Multi(cid:173)\nGaussians  problem,  with  C4.5  being  the  worst  performer.  When  the  noisy  pre(cid:173)\ndictors  are  added  to  the  problem  (NoisyGaussians),  we  observe  different  levels  of \ndeterioration in performance among the eight methods.  LFM-SVM shows the most \nrobust  behavior  in  presence  of  noise.  K-NN  is  instead  the  worst  performer.  In \nFigure 2 we  plot the performances of LFM-SVM and RBF-SVM as a function of an \nincreasing  number  of noisy  features  (for  the  same  MultiGaussians  problem).  The \nstandard deviations for  RBF -SVM  (in order of increasing number of noisy features) \nare:  0.01,  0.01 , 0.03,  0.03,  0.03  and  0.03.  The standard deviations  for  LFM-SVM \nare:  0.17,0.18,0.2,0.3,0.3 and 0.3.  The LFM-SVM technique shows a considerable \nimprovement over RBF -SVM  as the amount of noise  increases. \n\nTable  1:  Average classification error rates for  simulated and real  data. \n\nLFM-SVM \nRBF-SVM \nADAMENN \n\nMachete \nScythe \nDANN \nK-NN \nC4.5 \n\n3.3 \n3.3 \n3.4 \n3.4 \n3.4 \n3.7 \n3.3 \n5.0 \n\n4.0 \n4.0 \n3.0 \n5.0 \n4.0 \n6.0 \n6.0 \n8.0 \n\nMultiGauss  NoisyGauss  Iris  Sonar  Liver  Vote  Breast  OQ  Pima \n3.5  19.3 \n3.4  21.3 \n3.1  20.4 \n7.4  20.4 \n5.0  20.0 \n4.0  22.2 \n5.4  24.2 \n9.2  23.8 \n\n2.6 \n11.0 \n12.0  26.1  3.0 \n3.0 \n9.1 \n3.4 \n21.2 \n16.3 \n3.4 \n1.1 \n3.0 \n7.8 \n12.5 \n23.1 \n3.4 \n\n3.4 \n4.1 \n4.1 \n4.3 \n4.8 \n4.7 \n7.0 \n5.1 \n\n28.1 \n\n30.7 \n27.5 \n27.5 \n30.1 \n32.5 \n38.3 \n\n3.0 \n3.1 \n3.2 \n3.5 \n2.7 \n2.2 \n2.7 \n4.1 \n\n4.2  Experiments on Real Data \n\nIn our experiments we  used  seven different  real  data sets.  They are all  taken from \nDCI  Machine  Learning  Repository \nhttp://www.cs.uci.edu/,,,-,mlearn/ \nMLRepository.html.  For  a  description  of the data sets see  [6].  For the Iris,  Sonar, \nLiver  and  Vote  data we  perform  leave-one-out  cross-validation  to  measure  perfor(cid:173)\nmance,  since  the  number  of available  data is  limited  for  these  data sets.  For  the \n\nat \n\n\fLFM-SVM  --+-(cid:173)\nRBF-SVM  ---)(---\n\n36'--'--'---r--'--~--'--'--~--.--'--~ \n34 \n32 \n30 \n28 \n26 \n24 \n22 \n20 \n18 \n16 \n14 \n12 \n10 \n8 \n6 \n~ ~~=='P'-\n\nO L-~--~--~~--~--~~--~--~~--~ \n\n10 \n\n14 \nNumber of Noisy  Variables \n\n12 \n\n16 \n\n18 \n\n20 \n\n22 \n\no \n\nFigure  2:  Average  Error  Rates  of LFM-SVM  and  RBF-SVM  as  a  function  of an \nincreasing number of noisy predictors. \n\ni  J.  T I I \n--\n~ -\n~ \u2022 \n\"\"'!\"\"  -\n-\n\u2022 \nI \n\ni \n\n\" \n~ \n\n3 \n\n1j \n\nz \nz \n\"\" \nQ \n\nz \nz \n'\" \n\n- 1-\n\n:E \n> \n:z \n\"' \n\" \n\n:E \n~ \n;l \n\"\" \n..J \n\nz \nz \nOJ \n:E \n\"\" \nQ \n\"\" \n\nFigure 3:  Performance distributions for  real data. \n\nBreast, OQ-Ietter and Pima data we  randomly generated five  independent training \nsets of size  200.  For each of these,  an additional independent test sample consisting \nof 200 observations was generated.  Table 1 (columns 3-9)  shows the cross-validated \nerror rates for  the eight  methods  under consideration on  the seven  real  data.  The \nstandard deviation values are as follows.  Breast data:  0.2,  0.2, 0.2,  0.2,  0.2,  0.9,  0.9 \nand 0.9,  respectively.  OQ data:  0.2 , 0.2 , 0.2, 0.3, 0.2 , 1.1 , 1.5 and  2.1 , respectively. \nPima data:  0.4, 0.4,  0.4, 0.4,  0.4,  2.4,  2.1  and 0.7,  respectively. \n\nResults.  Table 1 shows that LFM-SVM achieves the best performance in 2/7 of the \nreal data sets; in one case it shows the second best performance, and in the remaining \nfour  its  error  rate  is  still  quite  close  to  t he  best  one.  Following  Friedman  [9],  we \ncapture robustness by computing the ratio bm  of the error rate em  of method m  and \nthe  smallest  error  rate  over  all  methods  being  compared  in  a  particular example: \nbm =  emf minl~k~8 ek\u00b7 \nFigure 3 plots the distribution of bm  for  each method over the seven  real data sets. \nThe dark area represents the lower and upper quartiles of the distribution that are \nseparated by the median.  The outer vertical lines show the entire range of values for \nthe distribution.  The spread of the error distribution for  LFM-SVM  is  narrow and \nclose  to  one.  The  results  clearly  demonstrate that  LFM-SVM  (and  ADAMENN) \nobtained the most  robust performance over the data sets. \n\nThe poor performance of the machete and C4.5 methods might be due to the greedy \nstrategy they employ.  Such recursive peeling strategy removes at each step a subset \nof data points  permanently from  further  consideration.  As  a  result,  changes  in  an \nearly  split,  due  to  any  variability  in  parameter  estimates,  can  have  a  significant \nimpact  on  later  splits ,  thereby  producing  different  terminal  regions.  This  makes \n\n\fpredictions highly sensitive to the sampling fluctuations associated with the random \nnature of the process that produces the traning data, thus leading to high variance \npredictions.  The scythe algorithm, by relaxing the winner-take-all splitting strategy \nof the machete algorithm, mitigates the greedy nature of the approach, and thereby \nachieves better performance. \n\nIn [10],  the authors show that the metric employed by the DANN algorithm approx(cid:173)\nimates the  weighted  Chi-squared distance,  given that  class  densities  are  Gaussian \nand  have the  same  covariance matrix.  As  a  consequence,  we  may expect  a  degra(cid:173)\ndation in  performance when the data do not follow  Gaussian distributions and are \ncorrupted  by  noise,  which  is  likely  the  case  in  real  scenarios  like  the  ones  tested \nhere. \n\nWe observe that the sparse solution given by SVMs  provides LFM-SVM with prin(cid:173)\ncipled  guidelines  to  efficiently  set  the  input  parameters.  This  is  an  important  ad(cid:173)\nvantage  over  ADAMENN,  which  has  six  tunable  input  parameters.  Furthermore, \nLFM-SVM speeds up the classification process since it applies the nearest neighbor \nrule only once, whereas ADAMENN applies it at each point within a region centered \nat the  query.  We  also  observe that the  construction of the SVM  for  LFM-SVM  is \ncarried out off-line only once, and there exist algorithmic and computational results \nwhich  make SVM  training practical also for  large-scale problems  [12]. \n\nThe  LFM-SVM  offers  performance  improvements  over  the  RBF-SVM  algorithm \nalone, for  both the (noisy)  simulated and real data sets.  The reason for  such perfor(cid:173)\nmance gain may rely on the effect of our local weighting scheme on the separability \nof classes, and therefore on the margin,  as  shown in  [6].  Assigning large weights to \ninput features  close  to  the  gradient  direction,  locally  in  neighborhoods  of support \nvectors, corresponds to increase the spatial resolution along those orientations, and \ntherefore to improve the separability of classes.  As  a  consequence, better classifica(cid:173)\ntion results  can be achieved as  demonstrated in our experiments. \n\n5  Related Work \n\nIn  [1],  Amari  and  Wu  improve  support  vector  machine  classifiers  by  modifying \nkernel  functions.  A  primary  kernel  is  first  used  to  obtain  support  vectors.  The \nkernel is  then modified in  a  data dependent  way by using the support vectors:  the \nfactor that drives the transformation has larger values at positions close to support \nvectors.  The  modified  kernel  enlarges  the spatial  resolution  around  the  boundary \nso that the separability of classes is  increased. \n\nThe resulting transformation depends  on the distance of data points from  the sup(cid:173)\nport vectors,  and  it  is  therefore  a  local  transformation,  but  is  independent  of the \nboundary's  orientation  in  input  space.  Likewise,  our  transformation  metric  de(cid:173)\npends , through the  factor  A,  on the  distance of the query  point  from  the  support \nvectors.  Moreover, since we  weight features,  our metric is  directional,  and depends \non  the  orientation  of local  boundaries  in  input  space.  This  dependence  is  driven \nby  our measure of feature  relevance,  which  has  the effect  of increasing the  spatial \nresolution  along discriminant directions  around the boundary. \n\n6  Conclusions \n\nWe have described a locally adaptive metric classification method and demonstrated \nits  efficacy  through  experimental  results.  The  proposed  technique  offers  perfor(cid:173)\nmance  improvements  over  the  SVM  alone,  and  has  the  potential  of scaling  up  to \n\n\flarge  data sets.  It speeds  up,  in  fact,  the  classification  process  by  computing off(cid:173)\nline  the  information  relevant  to  define  local  weights,  and  by  applying  the  nearest \nneighbor rule only once. \n\nAcknowledgments \n\nThis research has been supported by the National Science Foundation under grants \nNSF  CAREER  Award  9984729  and  NSF  IIS-9907477,  by  the  US  Department  of \nDefense,  and a  research award from  AT&T. \n\nReferences \n\n[1]  S.  Amari  and  S.  Wu,  \"Improving  support  vector  machine  classifiers  by  modifying \n\nkernel functions\",  Neural  Networks,  12,  pp.  783-789,  1999. \n\n[2]  R.E.  Bellman,  Adaptive  Control  Processes.  Princeton Univ.  Press,  1961. \n[3]  M.  Brown,  W.  Grundy,  D.  Lin,  N.  Cristianini,  C.  Sugnet,  T.  Furey,  M.  Ares,  and \nD.  Haussler,  \"Knowledge-based  analysis  of microarray  gene  expressions  data  using \nsupport  vector  machines\",  Tech.  Report,  University  of  California  in  Santa  Cruz, \n1999. \n\n[4]  W.S.  Cleveland  and  S.J.  Devlin,  \"Locally  Weighted  Regression:  An  Approach  to \n\nRegression  Analysis by  Local Fitting\",  J.  Amer.  Statist.  Assoc.  83,  596-610,  1988 \n\n[5]  T.M.  Cover  and P.E.  Hart,  \"Nearest  Neighbor  Pattern Classification\",  IEEE  Trans. \n\non  Information  Theory,  pp.  21-27,  1967. \n\n[6]  C.  Domeniconi  and D.  Gunopulos,  \"Adaptive  Nearest  Neighbor  Classification  using \n\nSupport  Vector  Machines\",  Tech.  Report  UCR-CSE-01-04,  Dept.  of Computer  Sci(cid:173)\nence,  University of California,  Riverside,  June 200l. \n\n[7]  C.  Domeniconi, J. Peng, and D.  Gunopulos,  \"An Adaptive Metric Machine for  Pattern \n\nClassification\",  Advances  in  Neural  Information  Processing  Systems,  2000. \n\n[8]  R.O.  Duda and P.E.  Hart,  Pattern  Classification  and Scene  Analysis.  John Wiley  & \n\nSons,  Inc.,  1973. \n\n[9]  J.H. Friedman \"Flexible Metric Nearest Neighbor Classification\",  Tech. Report, Dept. \n\nof Statistics,  Stanford University,  1994. \n\n[10]  T.  Hastie  and  R.  Tibshirani,  \"Discriminant  Adaptive  Nearest  Neighbor  Classifica(cid:173)\n\ntion\",  IEEE  Trans.  on Pattern  Analysis  and Machine  Intelligence,  Vol.  18,  No.6, pp. \n607-615,  1996. \n\n[11]  T.  Joachims,  \"Text categorization with support vector machines\",  Pmc.  of European \n\nConference  on  Machine  Learning,  1998. \n\n[12]  T.  Joachims,  \"Making large-scale SVM learning practical\"  Advances  in Kernel Meth(cid:173)\nods  - Support  Vector  Learning,  B.  Sch6lkopf and C.  Burger and A.  Smola (ed.),  MIT(cid:173)\nPress,  1999.  http://www-ai.cs.uni-dortmund.de/thorsten/svm_light.html \n\n[13]  D.G. Lowe,  \"Similarity Metric Learning for a Variable-Kernel Classifier\",  Neural  Com(cid:173)\n\nputation 7(1):72-85,  1995. \n\n[14]  E.  Osuna, R.  Freund, and F.  Girosi,  \"Training support vector machines:  An applica(cid:173)\n\ntion to face  detection\",  Pmc.  of Computer  Vision  and Pattern  Recognition,  1997. \n\n[15]  J.R.  Quinlan,  C4.5:  Programs  for  Machine  Learning.  Morgan-Kaufmann Publishers, \n\nInc.,  1993. \n\n[16]  C.J.  Stone,  Nonparametric  regression  and  its  applications  (with  discussion).  Ann. \n\nStatist.  5,  595,  1977. \n\n\f", "award": [], "sourceid": 2054, "authors": [{"given_name": "Carlotta", "family_name": "Domeniconi", "institution": null}, {"given_name": "Dimitrios", "family_name": "Gunopulos", "institution": null}]}