{"title": "A Comparison of Dynamic Reposing and Tangent Distance for Drug Activity Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 216, "page_last": 223, "abstract": null, "full_text": "A  Comparison of Dynamic Reposing and \n\nTangent  Distance for  Drug  Activity \n\nPrediction \n\nArris Pharmaceutical Corporation and Oregon  State University \n\nThomas G.  Dietterich \n\nCorvallis,  OR 97331-3202 \n\nAjay N.  Jain \n\nArris  Pharmaceutical Corporation \n\n385  Oyster  Point Blvd.,  Suite  3 \nSouth San Francisco,  CA 94080 \n\nRichard H.  Lathrop and  Tomas  Lozano-Perez \n\nArris  Pharmaceutical Corporation and MIT Artificial Intelligence  Laboratory \n\n545 Technology Square \nCambridge, MA  02139 \n\nAbstract \n\nIn  drug  activity  prediction  (as  in  handwritten  character  recogni(cid:173)\ntion), the features extracted to describe a training example depend \non  the  pose  (location, orientation, etc.)  of the example.  In  hand(cid:173)\nwritten  character  recognition,  one  of the  best  techniques  for  ad(cid:173)\ndressing  this  problem  is  the  tangent  distance  method  of Simard, \nLeCun  and Denker  (1993).  Jain, et al.  (1993a;  1993b) introduce  a \nnew  technique-dynamic reposing-that also  addresses  this  prob(cid:173)\nlem.  Dynamic reposing iteratively learns a neural network and then \nreposes  the  examples in  an effort  to  maximize  the  predicted  out(cid:173)\nput values.  New  models are trained and new  poses computed until \nmodels and poses converge.  This paper compares dynamic reposing \nto the  tangent  distance  method on the  task of predicting  the bio(cid:173)\nlogical  activity  of musk  compounds.  In  a  20-fold cross-validation, \n\n216 \n\n\fA Comparison of Dynamic Reposing and Tangent Distance for Drug Activity Prediction \n\n217 \n\ndynamic  reposing  attains  91 %  correct  compared  to  79%  for  the \ntangent  distance  method, 75% for  a  neural network  with standard \nposes,  and 75% for  the nearest  neighbor method. \n\n1 \n\nINTRODUCTION \n\nThe  task  of drug  activity  prediction  is  to  predict  the  activity  of proposed  drug \ncompounds  by  learning from  the  observed  activity of previously-synthesized  drug \ncompounds.  Accurate drug activity prediction can save substantial time and money \nby  focusing  the  efforts  of chemists  and  biologists  on  the  synthesis  and  testing  of \ncompounds whose  predicted  activity is  high.  If the  requirements for  highly  active \nbinding can be displayed in three dimensions, chemists can work from such displays \nto  design  new  compounds having high predicted  activity. \n\nDrug molecules usually act  by binding to localized sites on large receptor  molecules \nor  large  enyzme  molecules.  One  reasonable  way  to  represent  drug  molecules  is \nto  capture  the  location  of  their  surface  in  the  (fixed)  frame  of reference  of  the \n(hypothesized)  binding  site.  By  learning  constraints  on  the  allowed  location  of \nthe  molecular surface  (and  important  charged  regions  on  the surface),  a  learning \nalgorithm can form a  model of the binding site that can yield  accurate predictions \nand support drug  design. \n\nThe  training  data for  drug  activity  prediction  consists  of molecules  (described  by \ntheir structures,  i.e.,  bond graphs)  and measured binding activities.  There are  two \ncomplications that make it  difficult  to  learn  binding site  models from such  data. \n\nFirst,  the bond graph does  not uniquely determine the shape of the molecule.  The \nbond  graph  can  be  viewed  as  specifying  a  (possibly  cyclic)  kinematic chain  which \nmay  have  several  internal  degrees  of freedom  (i.e.,  rotatable  bonds).  The  confor(cid:173)\nmations that the graph can adopt, when  it is embedded in 3-space,  can be assigned \nenergies that depend on such intramolecular interactions as the Coulomb attraction, \nthe  van  der  Waal's force,  internal  hydrogen  bonds,  and  hydrophobic  interactions. \nAlgorithms  exist  for  searching  through  the  space  of  conformations  to  find  local \nminima having  low  energy  (these  are  called  \"conformers\").  Even  relatively  rigid \nmolecules may have  tens or even  hundreds of low  energy  conformers.  The training \ndata  does  not  indicate  which  of these  conformers  is  the  \"bioactive\"  one-that  is, \nthe  conformer  that  binds  to  the  binding  site  and  produces  the  observed  binding \nactivity. \n\nSecond,  even  if  the  bioactive  conformer  were  known,  the  features  describing  the \nmolecular surface-because they are measured in the frame of reference of the bind(cid:173)\ning site-change as the molecule rotates and translates  (rigidly) in space. \n\nHence,  if we  consider feature space,  each  training example (bond graph)  induces  a \nfamily of 6-dimensional manifolds.  Each manifold corresponds to one conformer as \nit rotates  and  translates  (6  degrees  of freedom)  in space.  For  a  classification  task, \na  positive  decision  region  for  \"active\"  molecules  would  be  a  region  that  intersects \nat  least  one  manifold  of each  active  molecule  and  no  manifolds  of  any  inactive \nmolecules.  Finding such  a  decision  region  is  quite  difficult,  because  the  manifolds \nare  difficult  to compute. \n\n\f218 \n\nDietterich, Jain, Lathrop, and Lozano-Perez \n\nA similar  \"feature  manifold problem\"  arises  in handwritten  character  recognition. \nThere,  the  training examples are  labelled  handwritten  digits,  the features  are  ex(cid:173)\ntracted  by  taking  a  digitized  gray-scale  picture,  and  the feature  values  depend  on \nthe rotation, translation,  and zoom of the camera with respect  to the character. \nWe  can  formalize  this situation as follows.  Let  Xi,  i  = 1, ... , N  be training exam(cid:173)\nples  (i.e.,  bond  graphs or  physical  handwritten  digits),  and  let  I(Xi)  be  the  label \nassociated  with Xi  (i.e., the measured activity of the molecule or the identity of the \nhandwritten digit).  Suppose we extract n real-valued features V( Xi)  to describe ob(cid:173)\nject Xi  and then employ, for  example, a multilayer sigmoid network to approximate \nI(x)  by  j(x) = g(V(x\u00bb.  This is  the ordinary supervised  learning task. \nHowever,  the feature  manifold problem arises  when  the extracted features  depend \non the \"pose\"  of the example.  We will define the pose to be a vector P of parameters \nthat describe,  for example, the rotation, translation, and conformation of a molecule \nor  the rotation, translation, scale,  and line thickness of a handwritten digit.  In this \ncase,  the feature  vector V(x,p)  depends on both the example and the pose. \nWithin the  handwritten  character  recognition  community, several  techniques  have \nbeen  developed  for  dealing with the feature  manifold problem.  Three existing  ap(cid:173)\nproaches are standardized poses, the tangent-prop method, and the tangent-distance \nmethod.  Jain et  al.  (1993a,  1993b)  describe  a  new  method-dynamic reposing(cid:173)\nthat  applies  supervised  learning  simultaneously  to  discover  the  \"best\"  pose pi  of \neach  training example Xi  and also to learn an approximation to the unknown func(cid:173)\ntion  I(x)  as  j(Xi)  =  g(V(Xi'p;\u00bb.  In  this  paper,  we  briefly  review  each  of these \nmethods  and  then  compare  the  performance  of standardized  poses,  tangent  dis(cid:173)\ntance,  and  dynamic  reposing  to  the  problem  of predicting  the  activity  of  musk \nmolecules. \n\n2  FOUR APPROACHES  TO THE FEATURE \n\nMANIFOLD  PROBLEM \n\n2.1  STANDARDIZED POSES \n\nThe simplest approach is to select only one of the feature vectors V( Xi, Pi)  for  each \nexample  by  constructing  a  function,  Pi  = S(Xi),  that  computes  a  standard  pose \nfor  each  object.  Once  Pi  is  chosen  for  each  example,  we  have  the  usual  super(cid:173)\nvised learning task-each training example has a  unique feature vector,  and we  can \napproximate 1 by  j(x) = g(V(x, S(x\u00bb). \nThe difficulty is  that S  can be very  hard to design.  In optical character recognition, \nS  typically works by computing some pose-invariant properties (e.g., principal axes \nof a circumscribing ellipse) of Xi  and then choosing Pi  to translate, rotate,  and scale \nXi  to give  these  properties standard values.  Errors  committed by  OCR algorithms \ncan  often  be  traced  to errors  in  the  S  function,  so  that  characters  are  incorrectly \npositioned for  recognition. \nIn  drug  activity  prediction,  the  standardizing  function  S  must  guess  which  con(cid:173)\nformer  is  the bioactive conformer.  This is exceedingly  difficult to do without  addi(cid:173)\ntional information (e.g., 3-D atom coordinates of the molecule bound in the binding \n\n\fA Comparison of Dynamic Reposing and Tangent Distance for Drug Activity Prediction \n\n219 \n\nsite  as  determined  by  x-ray  crystallography).  In  addition,  S  must  determine  the \norientation of the  bioactive  conformers  within  the  binding site.  This  is  also  quite \ndifficult-the bioactive conformers must be mutually aligned so  that shared poten(cid:173)\ntial chemical interactions (e.g.,  hydrogen bond donors)  are superimposed. \n\n2.2  TANGENT  PROPAGATION \n\nThe  tangent-prop  approach  (Simard,  Victorri,  LeCun,  &  Denker,  1992)  also  em(cid:173)\nploys  a standardizing function  S,  but it  augments the learning procedure  with  the \nconstraint  that  the  output  of the  learned  function  g(V( x, p))  should  be  invariant \nwith  respect  to slight  changes in  the poses of the examples: \n\nII\\7p  g(V(x,p))  Ip=S(x) II  = 0, \n\nwhere  II  . II  indicates Euclidean  norm.  This constraint is  incorporated  by  using the \nleft-hand-side as  a  regularizer  during backpropagation training. \n\nTangent-prop  can  be  viewed  as  a  way  of focusing  the learning  algorithm on  those \ninput  features  and  hidden-unit  features  that  are  invariant  with  respect  to  slight \nchanges  in  pose.  Without  the  tangent-prop  constraint,  the  learning  algorithm \nmay  identify  features  that  \"accidentally\"  discriminate  between  classes.  However, \ntangent-prop  still  assumes  that  the standard  poses  are  correct.  This  is  not  a  safe \nassumption in  drug  activity prediction. \n\n2.3  TANGENT  DISTANCE \n\nThe tangent-distance approach (Simard, LeCun &  Denker,  1993) is a  variant of the \nnearest-neighbor  algorithm  that  addresses  the  feature  manifold  problem.  Ideally, \nthe best  distance metric  to employ for  the nearest-neighbor  algorithm with feature \nmanifolds is  to  compute  the  \"manifold  distance\"-the  point  of nearest  approach \nbetween  two  manifolds: \n\nThis is very expensive to compute, however,  because the manifolds can have highly \nnonlinear  shapes  in  feature  space,  so  the  manifold  distance  can  have  many  local \nmInIma. \n\nThe tangent  distance is  an  approximation to the manifold distance.  It is computed \nby  approximating the  manifold by  a  tangent  plane in  the vicinity of the  standard \nposes.  Let Ji be  the Jacobian matrix defined  by (Jdik =  8V(Xi,Pi)ij8(Pih, which \ngives  the  plane  tangent  to  the  manifold  of molecule  Xi  at  pose  Pi.  The  tangent \ndistance  is  defined  as \n\nwhere  PI  = S(xI)  and P2  = S(X2)'  The  column  vectors  a  and  b give  the  change \nin  the pose  required  to minimize the distance  between  the tangent  planes  approx(cid:173)\nimating the  manifolds.  The values  of a  and  b minimizing the  right-hand side  can \nbe computed fairly quickly via gradient descent  (Simard, personal communication). \nIn  practice,  only  poses  close  to  S(xd  and S(X2)  are  considered,  but  this  provides \n\n\f220 \n\nDietterich, Jain, Lathrop, and Lozano-Perez \n\nmore opportunity for objects belonging to the same class to adopt poses that make \nthem more similar to each  other. \n\nIn experiments with handwritten digits,  Simard, LeCun,  and  Denker  (1993)  found \nthat tangent  distance gave  the  best performance of these  three methods. \n\n2.4  DYNAMIC  REPOSING \n\nAll of the preceding methods can be viewed as attempts to make the final  predicted \noutput  j(x)  invariant with  respect  to  changes  in  pose.  Standard  poses  do  this  by \nnot  permitting poses  to  change.  Tangent-prop  adds a  local  invariance  constraint. \nTangent distance enforces  a  somewhat less  local invariance constraint. \nIn  dynamic reposing,  we  make j  invariant by defining  it to be the maximum value \n(taken over  all poses p)  of an auxiliary function g: \n\nj(x) = max  g(V(x,p)). \n\np \n\nThe function  9  will  be  the function  learned by the neural network. \n\nBefore  we  consider  how  9  is  learned,  let  us  first  consider  how  it  can  be  used  to \npredict the activity of a  new molecule x'.  To compute j(x'), we  must find  the pose \np'.  that  maximizes g(V(x',p'*\u00bb.  We  can  do  this  by  performing a  gradient  ascent \nstarting from  the  standard  pose  S(x)  and  moving  in  the  direction  of the gradient \nof 9 with  respect  to the pose:  \\7plg(V(X',p'\u00bb. \nThis process  has an  important physical  analog  in  drug  activity  prediction.  If x'  is \na  new  molecule  and  9  is  a  learned  model  of the  binding site,  then  by  varying  the \npose  p'  we  are  imitating  the  process  by  which  the  molecule  chooses  a  low-energy \nconformation and rotates and  translates to  \"dock\"  with  the  binding site. \n\nIn handwritten  character  recognition,  this would  be  the  dual of a  deformable tem(cid:173)\nplate  model:  the  template  (g)  is  held  fixed,  while  the  example  is  deformed  (by \nrotation, translation,  and scaling)  to find  the  best fit  to the template. \n\nThe function 9 is learned iteratively from a growing pool of feature vectors.  Initially, \nthe pool contains only the feature vectors for  the standard poses of the training ex(cid:173)\namples (actually, we  start with one standard pose of each  low energy  conformation \nof each  training  example).  In  iteration  j, we  apply  backpropagation  to  learn  hy(cid:173)\npothesis  gj  from selected  feature  vectors  drawn from  the pool.  For each  molecule, \none feature vector is selected  by  performing a forward propagation (i.e., computing \n9(V(Xi' Pi\u00bb))  of all feature  vectors of that molecule and selecting the one giving the \nhighest predicted  activity for  that molecule. \nAfter learning gj,  we then compute for each conformer the pose P1+1  that maximizes \ngj(V(Xi' p\u00bb: \n\nPi  = argmax gj(V(Xi'p\u00bb. \n\u00b7+1 \n\np \n\nFrom the  chemical  perspective,  we  permit each  of the  molecules  to  \"dock\"  to  the \ncurrent  model gj  of the binding site. \n\n) corresponding to these  poses are  added to the pool \nThe feature  vectors V(Xi,Pi \nof poses,  and a new  hypothesis gj+l  is learned.  This process iterates until the poses \n\n\u00b7+1 \n\n\fA Comparison of Dynamic Reposing and Tangent Distance for Drug Activity Prediction \n\n221 \n\ncease to change.  Note that this algorithm is analogous to the EM procedure (Redner \n& Walker,  1984)  in  that we  accomplish the simultaneous optimization of 9  and the \nposes  {Pi}  by conducting a series of separate optimizations of 9 (holding {Pi}  fixed) \nand  {pd  (holding 9  fixed). \n\nWe  believe  the  power  of dynamic reposing  results  from  its  ability  to  identify  the \nfeatures  that  are  critical for  discriminating active from  inactive  molecules.  In  the \ninitial, standard poses,  a  learning algorithm is  likely to find  features  that  \"acciden(cid:173)\ntally\"  discriminate  actives  from  inactives.  However,  during  the  reposing  process, \ninactive molecules will  be  able  to reorient  themselves  to resemble  active  molecules \nwith  respect  to  these  features. \nIn  the  next  iteration,  the  learning  algorithm  is \ntherefore forced  to choose better features  for  discrimination. \n\nMoreover,  during  reposing,  the  active  molecules are  able  to  reorient  themselves so \nthat  they  become  more  similar  to  each  other  with  respect  to  the  features  judged \nto  be  important  in  the  previous  iteration.  In  subsequent  iterations,  the  learning \nalgorithm can  \"tighten\"  its criteria for  recognizing  active molecules. \n\nIn  the  initial, standard  poses,  the  molecules  are  posed  so  that  they  resemble each \nother  along  all features  more-or-Iess  equally.  At  convergence,  the  active  molecules \nhave changed pose so  that they only resemble each other  along the features  impor(cid:173)\ntant for  discrimination. \n\n3  AN  EXPERIMENTAL COMPARISON \n\n3.1  MUSK  ACTIVITY PREDICTION \n\nWe compared dynamic reposing with the tangent distance and standard pose meth(cid:173)\nods on the task of musk odor prediction.  The problem of musk odor prediction has \nbeen  the  focus  of many modeling efforts  (e.g.,  Bersuker,  et  al.,  1991;  Fehr,  et  al., \n1989;  Narvaez,  Lavine  &  Jurs,  1986).  Musk  odor  is  a  specific  and  clearly  iden(cid:173)\ntifiable  sensation,  although  the  mechanisms  underlying  it  are  poorly  understood. \nMusk  odor  is  determined  almost entirely  by steric  (i.e.,  \"molecular shape\")  effects \n(Ohloff,  1986).  The  addition or  deletion  of a  single  methyl group  can  convert  an \nodorless  compound  into  a  strong  musk.  Musk  molecules  are  similar  in  size  and \ncomposition to many kinds of drug  molecules. \n\nWe studied a set of 102 diverse structures that were collected from published studies \n(Narvaez,  Lavine  &  Jurs,  1986;  Bersuker,  et  al.,  1991;  Ohloff,  1986;  Fehr,  et  al., \n1989).  The data set contained 39 aromatic, oxygen-containing molecules with musk \nodor  and  63  homologs that  lacked  musk  odor.  Each  molecule  was  conformation(cid:173)\nally  searched  to  identify  low  energy  conformations.  The  final  data set  contained \n6,953  conformations of the  102  molecules  (for full  details of this data set,  see  Jain, \net  al.,  1993a).  Each  of these  conformations  was  placed  into  a  starting pose  via a \nhand-written  S  function.  We  then  applied  nearest  neighbor  with  Euclidean  dis(cid:173)\ntance,  nearest  neighbor  with  the  tangent distance,  a  feed-forward  network  without \nreposing,  and  a  feed-forward  network  with  the  dynamic reposing  method.  For  dy(cid:173)\nnamic reposing, five  iterations of reposing were sufficient for  convergence.  The time \nrequired  to  compute  the  tangent  distances  far  exceeds  the  computation  times  of \nthe other algorithms.  To make the tangent distance computations feasible,  we only \n\n\f222 \n\nDietterich, Jain, Lathrop, and Lozano-Perez \n\nTable  1:  Results of 20-fold cross-validation on  102 musk  molecules. \n\nMethod \nNearest  neighbor  (Euclidean  distance) \nNeural network  (standard poses) \nNearest  neighbor  (Tangent distance) \nNeural network  (dynamic reposing) \n\nPercent  Correct \n\n75 \n75 \n79 \n91 \n\nTable 2:  Neural network  cross-class  predictions  (percent  correct) \n\nN \n\nMolecular  class: \n\nStandard poses \nDynamic reposing \n\n85 \n100 \n\n76 \n90 \n\n74 \n85 \n\n57 \n71 \n\ncomputed the tangent distance for the 200 neighbors that were nearest in Euclidean \ndistance.  Experiments with  a subset of the molecules showed  that this heuristic  in(cid:173)\ntroduced  no error on that subset. \n\nTable  1  shows  the  results  of a  20-fold  cross-validation  of  all  four  methods.  The \ntangent  distance  method does  show  improvement with  respect  to  a  standard  neu(cid:173)\nral network approach (and with respect  to the standard nearest  neighbor method). \nHowever,  the  dynamic  reposing  method  outperforms  the other  two  methods  sub(cid:173)\nstantially. \n\nAn  important test  for  drug  activity  prediction  methods  is  to  predict  the  activity \nof molecules whose  molecular structure  (i.e.,  bond  graph)  is  substantially different \nfrom  the  molecules  in  the  training set.  A  weakness  of many existing  methods for \ndrug activity prediction  (Hansch & Fujita, 1964; Hansch,  1973) is that they rely on \nthe assumption that all molecules in the training and test data sets share a common \nstructural  skeleton.  Because  our  representation  for  molecules  concerns  itself only \nwith  the  surface of the  molecule,  we  should  not suffer  from  this  problem.  Table 2 \nshows four  structural classes  of molecules and the results of \"class holdout\"  exper(cid:173)\niments in  which  all  molecules of a  given  class  were  excluded  from the  training  set \nand then predicted.  Cross-class predictions from standard poses are not particularly \ngood.  However,  with  dynamic reposing,  we  obtain excellent  cross-class predictions. \nThis demonstrates the  ability of dynamic reposing  to identify the critical discrimi(cid:173)\nnating features.  Note  that the  accuracy  of the predictions generally is  determined \nby  the size  of the  training  set  (i.e.,  as  more  molecules  are  withheld,  performance \ndrops).  The exception  to  this  is  the  right-most  class,  where  the  local  geometry  of \nthe oxygen  atom is  substantially different  from  the other three  classes. \n\n\fA Comparison of Dynamic Reposing and Tangent Distance for Drug Activity Prediction \n\n223 \n\n4  CONCLUDING REMARKS \n\nThe  \"feature  manifold  problem\"  arises  in  many  application  tasks,  including  drug \nactivity prediction and handwritten character recognition.  A new method, dynamic \nreposing,  exhibits  performance superior  to  the  best  existing  method,  tangent  dis(cid:173)\ntance,  and to other standard methods on the problem of musk activity prediction. \nIn addition to producing more accurate  predictions,  dynamic reposing  results  in  a \nlearned  binding site model that can  guide the  design  of new  drug  molecules.  Jain, \net  al.,  (1993a)  shows  a  method for  visualizing the  learned  model  in  the  context  of \na  given  molecule  and  demonstrates  how  the  model  can  be  applied  to  guide  drug \ndesign.  Jain,  et  al.,  (1993b)  compares  the method  to other state-of-the-art  meth(cid:173)\nods for drug activity prediction and shows that feed-forward networks with dynamic \nreposing are substantially superior on two steroid binding tasks.  The method is cur(cid:173)\nrently  being  applied  at Arris  Pharmaceutical Corporation  to  aid  the  development \nof new  pharmaceutical compounds. \n\nAcknowledgements \n\nMany  people  made  contributions to  this  project.  The  authors  thank  Barr  Bauer, \nJohn  Burns,  David Chapman, Roger  Critchlow,  Brad  Katz,  Kimberle  Koile,  John \nPark,  Mike  Ross,  Teresa Webster,  and  George  Whitesides for  their efforts. \n\nReferences \n\nBersuker, I.  B., Dimoglo, A.  S., Yu.  Gorbachov, M., Vlad, P.  F., Pesaro, M.  (1991). \nNew  Journal  of Chemistry,  15,  307. \nFehr, C.,  Galindo, J., Haubrichs,  R., Perret,  R.  (1989).  Helv.  Chim.  Acta,  72,  1537. \nHansch,  C.  (1973).  In C.  J.  Cavallito (Ed.),  Structure-Activity Relationships.  Ox(cid:173)\nford:  Pergamon. \nHansch,  C.,  Fujita, T.  (1964).  J.  Am.  Chem.  Soc.,  86,  1616. \nJain,  A.  N.,  Dietterich,  T.  G.,  Lathrop,  R.  H.,  Chapman,  D.,  Critchlow,  R .  E., \nBauer, B. E., Webster, T. A.,  Lozano-Perez,  T. (1993a).  A shape-based method for \nmolecular design with adaptive alignment and conformational selection.  Submitted. \nJain,  A.,  Koile,  K.,  Bauer,  B.,  Chapman,  D.  (1993b).  Compass:  A  3D  QSAR \nmethod.  Performance comparisons on a  steroid benchmark.  Submitted. \nNarvaez,  J.  N.,  Lavine, B. K.,  Jurs,  P.  C.  (1986).  Chemical Senses,  11,  145-156. \nOhloff,  G.  (1986).  Chemistry of odor stimuli.  Experientia,  42,  271. \nRedner,  R.  A.,  Walker,  H.  F.  (1984).  Mixture  densities,  maximum likelihood,  and \nthe EM  algorithm.  SIAM Review,  26 (2)  195-239. \nSimard, P.  Victorri, B.,  Le  Cun, Y.  Denker,  J. (1992).  Tangent Prop-A formalism \nfor specifying selected invariances in an adaptive network.  In Moody, J. E., Hanson, \nS.  J.,  Lippmann, R.  P.  (Eds.)  Advances  in  Neural Information  Processing  Systems \n4.  San Mateo, CA:  Morgan Kaufmann.  895-903. \nSimard,  P.  Le  Cun,  Y.,  Denker,  J.  (1993).  Efficient  pattern  recognition  using  a \nnew  transformation distance.  In  Hanson,  S.  J.,  Cowan,  J.  D.,  Giles,  C.  L.  (Eds.) \nAdvances  in  Neural  Information  Processing  Systems  5,  San  Mateo,  CA:  Morgan \nKaufmann.  50-58. \n\n\f", "award": [], "sourceid": 781, "authors": [{"given_name": "Thomas", "family_name": "Dietterich", "institution": null}, {"given_name": "Ajay", "family_name": "Jain", "institution": null}, {"given_name": "Richard", "family_name": "Lathrop", "institution": null}, {"given_name": "Tom\u00e1s", "family_name": "Lozano-P\u00e9rez", "institution": null}]}