{"title": "Learning Prototype Models for Tangent Distance", "book": "Advances in Neural Information Processing Systems", "page_first": 999, "page_last": 1006, "abstract": null, "full_text": "Learning Prototype Models for  Tangent \n\nDistance \n\nTrevor Hastie\u00b7 \n\nStatistics  Department \n\nSequoia Hall \n\nStanford University \nStanford, CA  94305 \n\nemail:  trevor@playfair .stanford .edu \n\nPatrice Simard \n\nAT&T  Bell  Laboratories \nCrawfords Corner Road \n\nHolmdel,  NJ  07733 \n\nemail:  patrice@neural.att.com \n\nEduard Siickinger \n\nAT &T Bell  Laboratories \nCrawfords Corner Road \n\nHolmdel, NJ  07733 \n\nemail:  edi@neural.att.com \n\nAbstract \n\nSimard,  LeCun  & Denker  (1993)  showed  that the performance of \nnearest-neighbor  classification  schemes  for  handwritten  character \nrecognition  can  be  improved  by  incorporating  invariance  to  spe(cid:173)\nthe  so \ncific  transformations in  the  underlying  distance  metric  -\ncalled  tangent  distance.  The  resulting  classifier,  however,  can  be \nprohibitively slow and memory intensive due to the large amount of \nprototypes that need to be stored and used in the distance compar(cid:173)\nisons.  In this  paper we  develop  rich  models for  representing  large \nsubsets of the prototypes.  These models are either used singly per \nclass,  or  as  basic building blocks  in conjunction with the K-means \nclustering  algorithm. \n\n*This work was performed while Trevor Hastie was a member of the Statistics and Data \n\nAnalysis  Research  Group,  AT&T Bell  Laboratories,  Murray  Hill,  NJ  07974. \n\n\fJ 000 \n\nTrevor Hastie,  Patrice Simard,  Eduard Siickinger \n\n1 \n\nINTRODUCTION \n\nLocal algorithms such as K-nearest neighbor (NN)  perform well in  pattern recogni(cid:173)\ntion, even though they often assume the simplest distance on  the pattern space.  It \nhas recently  been shown  (Simard et al.  1993)  that the performance can be further \nimproved by  incorporating invariance to specific transformations in  the underlying \ndistance  metric  -\never,  can  be  prohibitively slow  and  memory  intensive  due  to  the  large  amount  of \nprototypes that need  to be stored and  used  in the distance comparisons. \n\nthe  so  called  tangent  distance.  The  resulting  classifier,  how(cid:173)\n\nas such they are similar in flavor to the Singular Value De(cid:173)\n\nIn  this  paper  we  address  this  problem  for  the  tangent  distance  algorithm,  by  de(cid:173)\nveloping  rich  models for  representing large subsets of the  prototypes.  Our leading \nexample of prototype model is a low-dimensional (12)  hyperplane defined by a point \nand a set of basis or tangent vectors.  The components of these  models  are learned \nfrom the training set, chosen to minimize the average tangent distance from a subset \nof the training images -\ncomposition  (SVD),  which  finds  closest  hyperplanes  in  Euclidean distance.  These \nmodels  are  either used  singly  per  class,  or  as  basic  building  blocks  in conjunction \nwith  K-means  and  LVQ.  Our results  show  that not  only  are  the  models  effective, \nbut  they  also  have  meaningful  interpretations.  In  handwritten  character  recogni(cid:173)\ntion, for instance, the main tangent vector learned for  the the digit  \"2\"  corresponds \nto addition/removal of the loop  at the bottom left  corner of the digit;  for  the 9 the \nfatness of the circle.  We can therefore think of some of these learned tangent vectors \nas  representing  additional invariances  derived  from  the  training digits  themselves. \nEach learned prototype model  therefore represents very  compactly a  large number \nof prototypes of the training set. \n\n2  OVERVIEW OF TANGENT DISTANCE \n\nWhen we look at handwritten characters, we are easily able to allow for simple trans(cid:173)\nformations such as rotations, small scalings, location shifts, and character thickness \nw hen identifying the character.  Any  reasonable automatic scheme should similarly \nbe insensitive to such changes. \nSimard  et  al.  (1993)  finessed  this  problem  by  generating  a  parametrized  7-\ndimensional  manifold  for  each  image,  where  each  parameter  accounts  for  one \nsuch  invariance.  Consider  a  single  invariance  dimension:  rotation.  If we  were \nto  rotate  the  image  by  an  angle  B prior  to  digitization,  we  would  see  roughly \nthe  same  picture,  just  slightly  rotated.  Our  images  are  16  x  16  grey-scale  pix(cid:173)\nelmaps,  which  can  be  thought  of as  points  in  a  256-dimensional Euclidean  space. \nThe  rotation  operation  traces  out  a  smooth  one-dimensional  curve  Xi(B)  with \nXi(O)  = Xi,  the image  itself.  Instead  of measuring the  distance  between  two  im(cid:173)\nages  as  D(Xi,Xj) =  IIXi - Xjll  (for  any  norm  11\u00b711),  the idea is  to  use  instead  the \nrotation-invariant  DI (Xi, Xj)  =  minoi,oj  IIX i(B;)  - Xj(Bj )11.  Simard  et  al.  (1993) \nused  7 dimensions of invariance, accounting fo::  horizontal and vertical location and \nscale,  rotation and shear and character thickness. \nComputing the  manifold  exactly is  impossible,  given  a  digitized  image,  and  would \nbe  impractical  anyway.  They  approximated  the  manifold  instead  by  its  tangent \n\n\fLearning  Prototype  Models for Tangent  Distance \n\n1001 \n\nplane at the image itself,  leading to the tangent model  Xi(B)  = Xi + TiB,  and the \ntangent  distance  DT(Xi,Xj)  =  minoi,oj  IIXi(Bd -Xj(Bj)ll.  Here  we  use  B for  the \n7-dimensional  parameter,  and  for  convenience  drop the  tilde.  The approximation \nis  valid  locally,  and thus permits local transformations.  Non-local  transformations \nare not interesting anyway  (we  don't want to flip  6s into 9s;  shrink all  digits down \nto nothing.)  See  Sackinger  (1992)  for  further  details.  If 11\u00b711  is  the Euclidean  norm, \ncomputing the tangent distance is a simple least-squares problem, with solution the \nsquare-root  of the  residual  sum-of-squares  of the  residuals  in  the  regression  with \nresponse  Xi - Xj  and predictors  (-Ti : Tj ). \n\nSimard  et  al.  (1993)  used  DT  to  drive  a  1-NN  classification  rule,  and  achieved \nthe  best  rates so  far-2.6%-on the official  test set  (2007  examples)  of the  USPS \ndata base.  Unfortunately, 1-NN is  expensive, especially when the distance function \nis  non-trivial  to  compute;  for  each  new  image  classified,  one  has  to  compute  the \ntangent  distance  to  each  of the  training  images,  and  then  classify  as  the  class  of \nthe  closest.  Our goal  in  this  paper is  to  reduce  the training set  dramatically to  a \nsmall set of prototype models;  classification is then performed by finding the closest \nprototype. \n\n3  PROTOTYPE MODELS \n\nIn  this  section  we  explore  some  ideas  for  generalizing  the  concept  of  a  mean  or \ncentroid  for  a  set  of  images,  taking  into  account  the  tangent  families.  Such  a \ncentroid  model  can  be  used  on  its  own,  or else  as  a  building  block  in  a  K-means \nor LVQ  algorithm at a  higher level.  We  will  interchangeably refer to the images as \npoints  (in  256  space). \nThe centroid  of a  set of N  points  in  d  dimensions  minimizes  the  average squared \nnorm from  the points: \n\n(1) \n\n3.1  TANGENT CENTROID \n\nOne  could  generalize  this  definition  and  ask  for  the  point  M  that  minimizes  the \naverage squared  tangent distance: \n\nN \n\nMT =  argm,Jn LDT(Xi,M)2 \n\ni=l \n\n(2) \n\nThis appears to be a  difficult  optimization  problem,  since computation of tangent \ndistance  requires  not  only  the  image  M  but  also  its  tangent  basis TM.  Thus  the \ncriterion to be minimized is \n\n\f1002 \n\nTrevor Hastie,  Patrice Simard,  Eduard Sackinger \n\nwhere  T(M)  produces  the  tangent  basis  from  M.  All  but  the  location  tangent \nvectors  are  nonlinear  functionals  of  M,  and  even  without  this  nonlinearity,  the \nproblem to be solved is a  difficult  inverse functional.  Fortunately a simple iterative \nprocedure  is  available  where  we  iteratively  average  the  closest  points  (in  tangent \ndistance)  to the current guess. \n\nTangent  Centroid Algorithm \n\nInitialize:  Set  M  = ~ 2:~1 Xi,  let TM  = T(M)  be  the derived \nset  of tangent  vectors,  and  D  =  2:i DT(Xi' M).  Denote \nthe current tangent centroid  (tangent family)  by  M(-y)  = \nM +TM\"I. \n\nIterate:  1.  For  each \n\ni \n\nfind  a  1'.  and  8i \n\nthat  solves \n\n11M + TM\"I  - Xi(8)11  = min'Y.9 \n\n2.  Set  M  +- N 2:'=1 (Xi(8i ) - TMi'i)  and  compute  the \n\nN \n\n1 \n\nA \n\nnew  tangent subspace TM  =  T(M). \n\n3.  Compute D =  2:iDT(Xi,M) \n\nUntil:  D  converges. \n\nNote that the first  step in Iterate is  available from  the computations in  the third \nstep.  The algorithm divides the parameters into two  sets:  M  in  the one,  and  then \nTM,  \"Ii  and 8,  for  each i  in  the other.  It alternates between  the two sets,  although \nthe  computation  of TM  given  M  is  not  the  solution  of an optimization  problem. \nIt seems  very  hard  to say  anything  precise  about  the  convergence or  behavior  of \nthis algorithm, since the tangent vectors depend on each iterate in a nonlinear way. \nOur experience has always been that it converges fairly  rapidly \u00ab  6 iterations).  A \npotential drawback of this algorithm is that the TM  are not learned, but are implicit \nin  M. \n\n3.2  TANGENT SUBSPACE \n\nRather  than  define  the  model  as  a  point  and  have  it  generate  its  own  tangent \nsubspace,  we  can  include  the  subspace  as  part  of  the  parametrization:  M(-y)  = \nM + V\"I.  Then we  define this  tangent  subspace  model as the minimizer of \n\nMS(M, V) = L min 11M + V\"Ii  - Xi(8d1l 2 \n\nN \n\n.  1 'Yi. 9i \nt= \n\n(3) \n\nover M  and V.  Note that V  can have an arbitrary number 0 ::;  r ::;  256 of columns, \nalthough  it  does  not  make  sense  for  r  to  be  too  large.  An  iterative  algorithm \nsimilar  to  the  tangent  centroid  algorithm  is  available,  which  hinges  on  the  SVD \ndecomposition for  fitting  affine subspaces to a  set of points.  We  briefly  review  the \nSVD  in this context. \nLet X  be the N  x 256 matrix with rows the vectors Xi - X where X =  ~ 2:~1 Xi. \nThen  SVD(X)  =  UDVT  is  a  unique  decomposition  with  UNxR  and  V256xR  the \n\n\fLearning  Prototype  Models for Tangent Distance \n\n1003 \n\northonormal  left and  right matrices of  singular  vectors,  and  R  =  rank( X).  D Rx R \nis  a  diagonal matrix of decreasing positive  singular values.  A pertinent property of \nthe SVD  is: \n\nConsider  finding  the  closest  affine,  rank-r  subspace  to  a  set  of \npoints, or \n\nN \n\nmin  2: IIXi - M  - v(r)'hll \nM,v(r),{9i} i=1 \n\n2 \n\nwhere  v(r)  is  256  x  r  orthonormal.  The solution  is  given  by  the \nSVD  above,  with  M  =  X  and v(r)  the first  r  columns  of V,  and \nthe total squared distance E;=1 D;j. \n\nThe V( r)  are also the largest r principal components or eigenvectors of the covariance \nmatrix of the Xi.  They give in sequence  directions of maximum  spread,  and for  a \ngiven digit class can be thought of as  class specific invariances. \nWe  now  present our  Tangent  subspace  algorithm for  solving (3);  for  convenience we \nassume V  is  rank r  for  some chosen r,  and drop the superscript. \n\nTangent  subspace algorithm \n\nInitialize:  Set M  =  ~ Ef:l Xi  and let V  correspond to the first \nr  right singular vectors of X.  Set D  = E;=1 D;j, and  let \nthe current tangent subspace model  be M(-y)  =  M + V-y. \nsolves \n\n(ji  which \n\nIterate: \n\nthat \n\n1.  For \n\neach \n\nfind \nIIM(-y)  - Xi (8)11  =  min \n\ni \n\nN \n\n2.  Set  M  +- ~ Ei=1 (Xi (8i ))  and  replace the rows  of X \nby  Xi({jd  - M.  Compute the SVD  of X,  and  replace \nV  by the first  r  right singular vectors. \n\nA \n\n3.  Compute D  =  E;=l D;j \n\nUntil:  D  converges. \n\nThe algorithm alternates between i)  finding the closest point in the tangent subspace \nfor  each image to the current tangent subspace model,  and ii)  computing the SVD \nfor  these closest  points.  Each step of the alternation decreases the criterion,  which \nis  positive  and  hence  converges  to  a  stationary  point  of the  criterion.  In  all  our \nexamples we  found  that  12  complete iterations were sufficient  to achieve a relative \nconvergence ratio of 0.001. \nOne advantage of this  approach  is  that we  need  not  restrict ourselves  to  a  seven(cid:173)\ndimensional  V  -\nindeed,  we  have  found  12  dimensions  has  produced  the  best \nresults.  The  basis  vectors  found  for  each  class  are interesting  to  view  as  images. \nFigure 1 shows some examples of the basis vectors found,  and what kinds of invari(cid:173)\nances in  the images they account for.  These are digit specific features; for  example, \na  prominent  basis  vector  for  the  family  of 2s  accounts  for  big versus  small  loops. \n\n\f1004 \n\nTrevor Hastie,  Patrice Simard,  Eduard Siickinger \n\nEach of the examples shown accounts for a similar digit specific invariance.  None of \nthese  changes  are accounted  for  by  the  7-dimensional tangent  models,  which  were \nchosen to be digit  nonspecific. \n\nFigure  1:  Each  column  corresponds  to  a  particular  tangent  subspace  basis  vector for  the \ngiven  digit.  The  top image is  the  basis  vector itself,  and the  remaining 3 images  correspond \nto  the  0.1 ,  0.5  and  0.9  quantiles  for  the  projection  indices  for  the  training  data  for  that \nbasis  vector,  showing  a  range  of image  models for  that  basis,  keeping  all  the  others  at o. \n\n4  SUBSPACE MODELS  AND K-MEANS  CLUSTERING \n\nA  natural  and  obvious  extension  of these  single  prototype-per-class  models,  is  to \nuse  them  as  centroid  modules  in  a  K-means  algorithm.  The extension  is  obvious, \nand space permits only a rough description.  Given an initial partition of the images \nin a  class into K  sets: \n\n1.  Fit a separate prototype model to each of the subsets; \n2.  Redefine  the partition based on closest  tangent distance to the prototypes \n\nfound  in step 1. \n\nIn a similar way the tangent centroid or subspace models can be used  to seed LVQ \nalgorithms  (Kohonen  1989), but so far we  have  not  much experience with them. \n\n5  RESULTS \n\nTable 1 summarizes the results for  some of these models.  The first  two lines  corre(cid:173)\nspond to a SVD  model for  the images fit  by ordinary least squares rather than least \ntangent squares.  The first line classifies using Euclidean distance to this model,  the \nsecond  using tangent distance.  Line 3 fits  a single 12-dimensional  tangent subspace \nmodel per class, while lines 4 and 5 use 12-dimensional tangent subspaces as cluster \n\n\fLearning  Prototype  Models for Tangent  Distance \n\n1005 \n\nTable 1:  Test  errors  for  a  variety  of situations.  In  all  cases  the  training  data  were  7291 \nUSPS  handwritten  digits,  and  the  test  data  the  \"official\"  2007  USPS  test  digits.  Each \nentry  describes  the  model  used  in  each  class,  so  for  example  in  row  5  there  are  5  models \nper class,  hence  50  in  all. \n\nPrototype \n\nMetric \n0  1-NN \nEuclidean \n1  12 dim  SVD  subspace \nEuclidean \n2  12  dim  SVD  subspace \nTangent \n3  12  dim  Tangent subspace  Tangent \n4  12  dim  Tangent subspace  Tangent \n5  12 dim Tangent subspace  Tangent \n6  Tangent centroid \nTangent \nTangent \n7 \n8  1-NN \nTangent \n\n(4)  U (6) \n\n# Prototypes7Class  Error Rate \n\n~ 700 \n1 \n1 \n1 \n3 \n5 \n20 \n23 \n~ 700 \n\n0.053 \n0.055 \n0.045 \n0.041 \n0.038 \n0.038 \n0.038 \n0.034 \n0.026 \n\ncenters  within  each  class.  We  tried  other  dimensions  in  a  variety  of settings,  but \n12  seemed  to  be  generally  the  best.  Line  6  corresponds  to  the  tangent  centroid \nmodel used  as the centroid in a  20-means cluster model  per class;  the performance \ncompares  with  with  K=3 for  the  subspace  model.  Line  7 combines  4  and  6,  and \nreduces the error even further.  These limited experiments suggest that the tangent \nsubspace model is  preferable, since it is  more compact and the algorithm for  fitting \nit is  on firmer  theoretical grounds. \nFigure 4 shows  some  of the  misclassified  examples in  the test  set.  Despite all  the \nmatching, it seems that Euclidean distance still fails  us  in the end in some of these \ncases. \n\n6  DISCUSSION \n\nGold,  Mjolsness &  Rangarajan (1994) independently had the idea of using  \"domain \nspecific\"  distance  measures  to seed  K-means  clustering  algorithms.  Their  setting \nwas slightly different from  ours, and they did not use subspace models.  The idea of \nclassifying points to the closest subspace is found  in the work of Oja (1989),  but of \ncourse not in  the context of tangent distance. \nWe  are using Euclidean distance in conjunction with tangent distance.  Since neigh(cid:173)\nboring pixels are correlated, one might expect that a metric that accounted for  the \ncorrelation might  do  better.  We  tried  several  variants  using Mahalanobis  metrics \nin  different  ways,  but  with  no  success.  We  also  tried  to  incorporate  information \nabout where the images  project in the tangent subspace models into the classifica(cid:173)\ntion  rule.  We  thus  computed two  distances:  1)  tangent  distance  to the subspace, \nand 2)  Mahalanobis distance within the subspace to the centroid for  the subspace. \nAgain the best performance was  attained by ignoring the latter distance. \nIn  conclusion,  learning  tangent  centroid  and  subspace  models  is  an  effective  way \n\n\f1006 \n\nTrevor Hastie,  Patrice Simard,  Eduard Siickinger \n\ntrue:  6 \n\ntrue: 2 \n\ntrue:  5 \n\ntrue: 2 \n\ntrue: 9 \n\ntrue: 4 \n\npred. pro). ( 0 ) \n\nprado proj. ( 0 ) \n\npred. pro).  ( 8 ) \n\npred. proj. ( 0 ) \n\nprado  proj. ( 4 ) \n\nprado pro). ( 7 ) \n\nFigure  2:  Some  of the  errorS  for  the  test  set  corresponding  to  line  (3)  of table  4.  Each \ncase  is  displayed  as  a  column  of three  images.  The  top  is  the  true  image,  the  middle  the \ntangent projection of the  true  image  onto  the  subspace  model of its  class,  the  bottom image \nthe  tangent  projection  of the  image  onto  the  winning  class.  The  models  are  sufficiently \nrich  to  allow  distortions  that  can fool  Euclidean  distance. \n\nto reduce  the number of prototypes  (and  thus the cost in  speed and memory)  at a \nslight expense  in  the  performance.  In  the extreme case,  as little as one  12  dimen(cid:173)\nsional tangent subspace per class and the tangent distance is  enough to outperform \nclassification  using  ~ 700  prototypes  per  class  and  the  Euclidean  distance  (4.1 % \nversus 5.3%  on the test  data). \n\nReferences \n\nGold, S., Mjolsness, E. & Rangarajan, A.  (1994), Clustering with a domain specific \ndistance  measure,  in  'Advances  in  Neural  Information  Processing  Systems', \nMorgan Kaufman,  San Mateo,  CA. \n\nKohonen,  T.  (1989),  Self-Organization  and  Associative  Memory  (3rd  edition), \n\nSpringer-Verlag, Berlin. \n\nOja,  E.  (1989),  'Neural  networks,  principal  components,  and subspaces',  Interna(cid:173)\n\ntional  Journal  Of Neural  Systems 1(1), 61-68. \n\nSackinger, E. (1992), Recurrent networks for elastic matching in pattern recognition, \n\nTechnical report,  AT&T Bell Laboratories. \n\nSimard,  P.  Y, LeCun,  Y.  & Denker,  J.  (1993),  Efficient  pattern recognition using \na new transformation distance,  in 'Advances in Neural Information Processing \nSystems', Morgan Kaufman,  San Mateo,  CA,  pp.  50-58. \n\n\f", "award": [], "sourceid": 939, "authors": [{"given_name": "Trevor", "family_name": "Hastie", "institution": null}, {"given_name": "Patrice", "family_name": "Simard", "institution": null}]}