{"title": "Discovering Viewpoint-Invariant Relationships That Characterize Objects", "book": "Advances in Neural Information Processing Systems", "page_first": 299, "page_last": 305, "abstract": null, "full_text": "Discovering Viewpoint-Invariant  Relationships \n\nThat  Characterize Objects \n\nRichard S.  Zemel  and  Geoffrey  E.  Hinton \n\nDepartment of Computer Science \n\nUniversity  of Toronto \n\nToronto, ONT  M5S  lA4 \n\nAbstract \n\nUsing  an  unsupervised  learning  procedure,  a  network  is  trained on  an en(cid:173)\nsemble of images of the same two-dimensional object at different positions, \norientations  and  sizes.  Each  half of  the  network  \"sees\"  one  fragment  of \nthe object, and  tries to produce  as  output a set of 4 parameters that have \nhigh mutual information with the 4 parameters output by  the other half of \nthe network.  Given the ensemble of training patterns, the 4 parameters on \nwhich the two halves of the network can agree are the position, orientation, \nand  size  of the  whole  object,  or  some  recoding  of them.  After  training, \nthe network can reject  instances of other shapes by  using the fact  that the \npredictions  made  by  its  two  halves  disagree.  If two  competing  networks \nare trained on  an unlabelled mixture of images of two objects, they  cluster \nthe training cases on the basis of the objects' shapes,  independently of the \nposition, orientation, and size. \n\n1 \n\nINTRODUCTION \n\nA  difficult  problem  for  neural  networks  is  to  recognize  objects  independently  of \ntheir  position,  orientation,  or  size.  Models  addressing  this  problem have  generally \nachieved  viewpoint-invariance  either  through  a  separate  normalization  procedure \nor by  building translation- or rotation-in variance into the structure of the network. \nThis  problem  becomes  even  more  difficult  if  the  network  must  learn  to  perform \nviewpoint-invariant  recognition  without  any  supervision  signal  that  indicates  the \ncorrect  viewpoint, or  which  object  is  which  during training. \n\nIn this paper, we  describe a model that is trained on an ensemble of instances of the \nsame object, in a variety of positions, orientations and sizes,  and can  then recognize \n\n299 \n\n\f300 \n\nZemel and Hinton \n\nnew instances of that object.  We also describe an extension to the model that allows \nit  to  learn  to  recognize  two  different  objects  through  unsupervised  training on  an \nunlabelled  mixture of images of the objects. \n\n2  THE VIEWPOINT CONSISTENCY CONSTRAINT \n\nAn important invariant in object recognition is the fixed spatial relationship between \na rigid object  and each of its component features.  We  assume that each feature  has \nan intrinsic reference  frame,  which  can  be specified  by  its  instantiation  parameters, \ni.e., its  position, orientation and size  with  respect  to the image.  For  a  rigid object \nand a particular feature of that object, there is a  fixed viewpoint-independent trans(cid:173)\nformation from the feature's reference frame to the object's.  Given the instantiation \nparameters  of the  feature  in  an  image,  we  can  use  the  transformation  to  predict \nthe object's instantiation parameters.  The  viewpoint  consistency  constraint (Lowe, \n1987) states that all  of the features  belonging to the same rigid object should make \nconsistent  predictions of the object's instantiation parameters.  This constraint  has \nbeen  played  an  important role  in  many shape  recognition  systems  (Roberts,  1965; \nBallard,  1981; Hinton,  1981;  Lowe,  1985). \n\n2.1  LEARNING  THE CONSTRAINT:  SUPERVISED \n\nA  recognition  system  that  learns  this  constraint  is  TRAFFIC  (Zemel,  Mozer  and \nHinton,  1989).  In  TRAFFIC,  the  constraints  on  the  spatial relations  between  fea(cid:173)\ntures  of  an  object  are  directly  expressed  in  a  connectionist  network.  For  two(cid:173)\ndimensional  shapes,  an  object  instantiation  contains  4  degrees  of freedom:  (x ,y)(cid:173)\nposition, orientation,  and size.  These  parameter  values,  or some recoding  of them, \ncan  be  represented  in  a  set  of 4  real-valued  instantiation  units.  The  network  has \na  modular structure,  with  units  devoted  to  each  object  or  object  fragment  to  be \nrecognized.  In  a  recognition  module,  one  layer  of instantiation units represents  the \ninstantiation parameters of each of an object's features;  these units connect  to a set \nof units  that  represent  the  object's  instantiation  parameters  as  predicted  by  this \nfeature;  and these  predictions are combined into a single object instantiation in an(cid:173)\nother set of instantiation units.  The set of weights connecting the instantiation units \nof the  feature  and  its  predicted  instantiation for  the  object  are  meant  to  capture \nthe fixed,  linear reference  frame transformation between  the feature  and the object. \nThese  weights  are  trained  by  showing  various instantiations of the object,  and  the \nobject's instantiation parameters act  as  the training signal for  each  of the features' \npredictions.  Through this supervised  procedure,  the  features  of an  object  learn  to \npredict  the instantiation parameters for  the object.  Thus, when  the features of the \nobject  are  present  in  the image in  the  appropriate relationship,  the predictions are \nconsistent and this consistency can be used  to decide that the object is present.  Our \nsimulations showed  that TRAFFIC was  able to learn  to recognize  constellations in \nrealistic star-plot images. \n\n2.2  LEARNING  THE CONSTRAINT:  UNSUPERVISED \n\nThe  goal  of the  current  work  is  to  use  an  unsupervised procedure  to  discover  and \nuse  the  consistency  constraint. \n\n\fDiscovering Viewpoint-Invariant Relationships That Characterize Objects \n\n301 \n\nmaximize \n\nmutual \n\ninformation \n\na \n\nb \n\n0000 I  0000 \n\nOutput units \n\na  0  .... 0  0  110  0  .. \n\u2022\u2022 \u2022 \u2022 \n\u2022 \u2022 \n\u2022  \u2022 \n\n\u2022 \u2022 \n\u2022 \u2022 \n\u2022  \u2022 \n\u2022  \u2022 \n\n\u00b7\u00b70  0  I \n\nRBF units \n\nInput \n\nFigure  1:  A  module  with  two  halves  that  try  to  agree  on  their  predictions.  The \ninput to each half is 100 intensity values (indicated by the areas of the black circles). \nEach  half has  200  Gaussian  radial  basis  units  (constrained  to  be  the  same for  the \ntwo  halves)  connected  to 4 output units. \n\nWe  explore  this  idea  using  a  framework  similar  to  that  of TRAFFIC,  in  which \ndifferent  features  of an  object  are  represented  in different  parts of the  recognition \nmodule,  and each  part generates  a  prediction for  the object's instantiation param(cid:173)\neters.  Figure  1 presents  an example of the kind  of task we  would  like to solve.  The \nmodule has  two  halves.  The rigid  object  in  the  image is  very  simple - it  has  two \nends,  each  of which  is  composed  of two  Gaussian  blobs  of intensity.  Each  image \nin  the  training set  contains one  instance of the object.  For  now,  we  constrain  the \ninstantiation parameters of the object so  that the left  half of the image always con(cid:173)\ntains one  end  of the object,  and the right  half the  other end.  This way,  just based \non  the end  of the  object  in  the  input  image  that  it sees,  each  half of the  module \ncan always specify  the  position, orientation  and size  of the  whole  object.  The goal \nis  that,  after  training,  for  any  image  containing this object,  the  output  vectors  of \nboth halves of the module,  a  and  h, should  both represent  the same  instantiation \nparameters for  the object. \n\nIn TRAFFIC, we could use the object's instantiation parameters as a training signal \nfor  both  module  halves,  and  the features  would  learn  their  relation  to  the  object. \nNow,  without  providing  a  training signal,  we  would  like  the  module  to  learn  that \nwhat is consistent across the ensemble of images is the relation between the position, \norientation, and size of each end of that object.  The two halves of a module trained \non  a  particular  shape  should  produce  consistent  instantiation  parameters  for  any \ninstance  of this  object.  If the  features  are  related  in  a  different  way,  then  these \n\n\f302 \n\nZemel and Hinton \n\npredictions should disagree.  If the module learns to do this through an unsupervised \nprocedure, it has found  a viewpoint-invariant spatial relationship that characterizes \nthe object,  and can  be used  to recognize  it. \n\n3  THE IMAX LEARNING PROCEDURE \n\nWe  describe  a  version  of the IMAX  learning  procedure  (Hinton  and  Becker,  1990) \nthat  allows  a  module  to  discover  the  4  parameters  that  are  consistent  between \nthe  two  halves  of each  image  when  it  is  presented  with  many  different  images  of \nthe  same,  rigid  object  in  different  positions,  orientations  and  sizes.  Because  the \ntraining cases  are  all  positive examples of the object,  each  half of the module  tries \nto extract  a  vector  of 4  parameters  that significantly  agrees  with  the 4  parameters \nextracted  by  the  other  half.  Note  that  the  two  halves  can  agree  on  each  instance \nby  outputting  zero  on  each  case,  but  this  agreement  would  not  be  significant.  To \nagree  significantly,  each  output  vector  must  vary  from  image  to  image,  but  the \ntwo  output  vectors  must  nevertheless  be  the  same for  each  image.  Under  suitable \nGaussian  assumptions,  the  significance  of the  agreement  between  the  two  output \nvectors  can  be  computed  by  comparing  the  variances  across  training  cases  of the \nparameters  produced  by  the individual  halves  of the module  with  the  variances  of \nthe differences  of these  parameters. \n\nWe assume that the two output vectors,  a  and h, are both noisy versions of the same \nunderlying  signal,  the  correct  object  instantiation  parameters.  If we  assume  that \nthe noise  is  independent,  additive,  and  Gaussian,  the  mutual information  between \nthe presumed  underlying signal and  the average of the noisy  versions  of that signal \nrepresented  by  a  and his: \n\n1 \n\nI L(a+ h)1 \nl(a; h) = - log -......:....--~ \nIL(a-h)1 \n\n2 \n\n(1) \n\nwhere  I I:(a+h) I is  the  determinant  of the  covariance  matrix of the sum of a  and \nh  (see  (Becker  and  Hinton,  1989)  for  details).  We  train  a  recognition  module  by \nsetting  its  weights  so  as  to  maximize  this  objective  function.  By  maximizing  the \ndeterminant,  we  are  discouraging  the  components  of the  vector  a  + h  from  being \nlinearly  dependent  on  one  another,  and  thus  assure  that  the  network  does  not \ndiscover  the same parameter four  times. \n\n4  EXPERIMENTAL RESULTS \n\nUsing  this  objective  function,  we  have  experimented  with  different  training sets, \ninput representations and network  architectures.  We  discuss  two examples here. \n\nIn all  of the experiments described,  we  fix  the number of output units in each mod(cid:173)\nule to be 4,  matching the underlying degrees  of freedom  in  the object instantiation \nparameters.  We  are  in  effect  telling  the  recognition  module  that  there  are  4  pa(cid:173)\nrameters  worth  extracting  from  the  training  ensemble.  For  some  tasks  there  may \nbe  less  than  4  parameters.  For  example,  the  same  learning  procedure  should  be \nable  to capture the lower-dimensional constraints between  the parts of objects that \n\n\fDiscovering Viewpoint-Invariant Relationships That Characterize Objects \n\n303 \n\ncontain  internal  degrees  of freedom  in  their shape  (e.g.,  scissors),  but  we  have  not \nyet  tested  this. \n\nThe first  set  of experiments  uses  training images  like  Figure  1.  The  task  requires \nan  intermediate layer between the intensity  values and the instantiation parameters \nvector.  Each  half  of the  module  has  200  non-adaptive,  radial  basis  units.  The \nmeans of the  RBFs are  formed  by  randomly sampling the space of possible  images \nof an  end  of the  object;  the  variances  are  fixed.  The  output  units  are  linear.  We \nmaximize  the  objective  function  I  by  adjusting  the  weights  from  the  radial  basis \nunits to the output units,  after each  full  sweep  through  the training set. \n\nThe optimization requires 20  sweeps of a conjugate gradient technique through 1000 \ntraining cases.  Unfortunately, it  is  difficult  to interpret  the outputs of the  module, \nsince  it finds  a  nonlinear transform of the object instantiation parameters.  But the \nmutual  information  is  quite  high  - about  7  bits.  After  training,  the  predictions \nmade by  the two  halves  are  consistent  on new  images We  measure  the  consistency \nin  the  predictions  for  an  image  using  a  kind  of generalized  Z-score,  which  relates \nthe  difference  between  the predictions on  a  particular case  (di)  to the distribution \nof this difference  across  the training set: \n\nZ(di) = (di - d)t  L~l (di - d) \n\n(2) \nA  low  Z-score  indicates  a  consistent  match.  After  training,  the  module  produces \nhigh Z-scores  on  images where  the same two ends are present, but are in  a different \nrelationship  than  the  object  on  which  it  was  trained. \nIn  general,  the  Z-scores \nincrease  smoothly  with  the  degree  of perturbation in  the relationship  between  the \ntwo ends,  indicating that the module has  learned  the constraint. \n\nIn  the second set of experiments,  we  remove an unrealistic constraint on our images \nthat  one  end  of the  object  must  always  fall  in  one  half of  the  image.  Instead \n-\nwe  assume  that  there  is  a  feature-extraction  process  that finds  instances  of simple \nfeatures  in  the  image  and  passes  on  to  the  module  a  set  of parameters  describing \nthe  position,  orientation  and  spatial  extent  of each  feature.  This  is  a  reasonable \nassumption,  since  low-level  vision  is  generally  good  at providing accurate  descrip(cid:173)\ntions  of simple features  that  are  present  in  an  image  (such  as  edges  and  corners), \nand  can  also specify  their  locations. \n\nIn  these experiments, the feature-extraction  program finds instances of two features \nof the letter  y - the upper u-shaped curve and the long vertical stroke with a curved \ntail.  The  recognition  module  then  tries  to  extract  consistent  object  instantiation \nparameters  from  these  feature  instantiation  parameters  by  maximizing  the  same \nmutual information objective as  before. \n\nThere  are  several  advantages  of  this  second  scheme.  The  first  set  of training  in(cid:173)\nstances  were  artificially restricted  by  the requirement  that one end  must  appear in \nthe  left  half of the  image,  and  the  other  in  the  right  half.  Now  since  a  separate \nprocess  is  analyzing  the  entire  image to find  a  feature  of a  given  type,  we  can  use \nthe  entire space  of possible  instantiation  parameters in  the training set.  With  the \nsimpler architecture, we can efficiently handle more complex images.  In addition, no \nhidden layer is  necessary  - the mapping from the features'  instantiation parameters \nto the object's instantiation parameters is  linear. \n\nUsing  this  scheme,  only  twelve  sweeps  through  1000  training  cases  are  necessary \n\n\f304 \n\nZemel and Hinton \n\nto optimize the objective function.  The speed-up  is likely  due  to  the fact  that the \ninput  is  already  parameterized  in  an  appropriate  form  for  the  extraction  of the \ninstantiation  parameters.  This  method  also  produces  robust  recognition  modules, \nwhich  reject  instances  where  the relationships  between  the  two  input  vectors  does \nnot  match  the  relationship  in  the  training set.  We  test  this  robustness  by  adding \nnoise of varying magnitudes separately to each component of the input vectors,  and \nmeasuring the Z-scores  of the output  vectors.  As expected,  the agreement  between \nthe  two  outputs of a  module degrades  smoothly with  added  noise. \n\n5  COMPETITIVE IMAX \n\nWe  are  currently  working  on  extending  this idea  to  handle  multiple shapes.  The \nobvious  way  to  do  this  using  modules  of the  type  described  above  is  to  force  the \nmodules  to  specialize  by  training  each  module separately  on  images  of a  particu(cid:173)\nlar  shape,  and  then  to  recognize  shapes  by  giving  the  image  to  each  module  and \nseeing which  module achieves the lowest  Z-score.  However,  this requires supervised \ntraining in  which  the  images  are  labelled  by  the  type  of object  they  contain.  We \nare exploring  an entirely  unsupervised  method  in  which  images are  unlabelled,  and \nevery  image is  processed  by  many competing modules. \n\nEach  competing  module  has  a  responsibility  for  each  image  that  depends  on  the \nconsistency  between  the two output vectors  of the  module.  The responsibilities are \nnormalized so  that, for each image,  they sum to one.  In computing the covariances \nfor  a particular module in Equation 1,  we  weight each  training case  by the module's \nresponsibility  for  that  case.  We  also  compute  an  overall  mixing  proportion,  7rm , \nfor  each  module  which  is  just  the  avera.ge  of its  responsibilities.  We  extend  the \nobjective function  I  to  multiple modules as follows: \n\nm \n\n(3) \n\nWe  could  compute  the  relative  responsibilities  of modules  by  comparing  their  Z(cid:173)\nscores,  but  this  would  lead  to a  recurrent  relationship  between  the  responsibilities \nand  the  weights  within  a  module.  To  avoid  this  recurrence,  we  simply  store  the \nresponsibility of each  module for  each training case.  We optimize r  by  interleaving \nupdates of the weights within each  module,  with  updates of the stored  responsibil(cid:173)\nities.  This  learning  is  a  sophisticated  form  of competitive  learning.  Rather  than \nclustering  together  input  vectors  that  are  close  to one  another  in  the  input space, \nthe modules cluster together input vectors that share a common spatial relationship \nbetween  their  two halves. \n\nIn our experiments, we  are using just two modules and an ensemble of images of two \ndifferent shapes (either a g or a  y in each image).  We have found that the system can \ncluster the images with a little bootstrapping.  We initially split the training set into \ng-images and  y-images,  and train up one module for  several iterations on one set of \nimages, and the other module on  the other set.  When we then use a new training set \ncontaining 500  images of each  shape,  and  train  both  modules competitively on  the \nfull  set,  the system successfully  learns  to  separate  the  images so  that  the  modules \neach  specialize  in  a  particular shape.  After  the bootstrapping, one  module wins on \n297  cases of one shape and  206  cases  of the other  shape.  After  further  learning on \n\n\fDiscovering Viewpoint-Invariant Relationships That Characterize Objects \n\n305 \n\nthe  unlabelled mixture  of shapes,  it  wins  on  498  cases  of one shape  and  0  cases  of \nthe other. \n\nBy  making another  assumption,  that the  input images in  the training set  are  tem(cid:173)\nporally  coherent,  we  should  be  able  to  eliminate  the  need  for  the  bootstrapping \nprocedure.  If we  assume  that  the  training  images  come  in  runs  of one  class,  and \nthen  another,  as  would  be  the  case  if they  were  a  sequence  of images  of various \nmoving  objects,  then  for  each  module,  we  can  attempt  to  maximize  the  mutual \ninformation between  the  responsibilities  it  assigns  to  consecutive  training images. \nWe  can augment the objective function r  by  adding this  temporal coherence  term \nonto the spatial coherence  term,  and our network  should cluster  the  input set  into \ndifferent  shapes  while simultaneously learning how  to recognize  them. \n\nFinally, we  plan to extend our model to become  a more general recognition system. \nSince  the  learning relatively  is  fast,  we  should  also  be  able  to  build  a  hierarchy  of \nmodules that could  learn  to recognize  more complex objects. \n\nAcknowledgements \n\nWe  thank  Sue  Becker  and  Steve  Nowlan  for  helpful  discussions.  This  research  was  sup(cid:173)\nported  by  grants from  the  Ontario Information Technology  Research  Center,  the  Natural \nSciences  and  Engineering  Research  Council,  and  Apple  Computer,  Inc.  Hinton  is  the \nN oranda Fellow of the Canadian Institute for  Advanced  Research. \n\nReferences \n\nBallard, D. H.  (1981).  Generalizing the Hough transform to detect arbitrary shapes. \n\nPattern  Recognition,  13(2):111-122. \n\nBecker,  S.  and  Hinton, G.  E.  (1989).  Spatial coherence  as  an internal  teacher  for  a \nneural  network.  Technical  Report Technical  Report  CRG-TR-89-7,  University \nof Toronto. \n\nHinton,  G.  E.  (1981).  A parallel  computation  that assigns  canonical  object-based \nframes  of reference.  In  Proceedings  of the  7th  International  Joint  Conference \non  Artificial Intelligence,  pages  683- 685,  Vancouver,  BC,  Canada. \n\nHinton, G.  E.  and Becker,  S.  (1990) .  An  unsupervised  learning procedure  that dis(cid:173)\n\ncovers surfaces in random-dot stereograms.  In  Proceedings  of the  International \nJoint  Conference  on  Neural Networks, volume 1,  pages  218-222, Hillsdale,  NJ. \nErlbaum. \n\nLowe,  D.  G.  (1985).  Perceptual  Organization  and  Visual  Recognition.  Kluwer  Aca(cid:173)\n\ndemic  Publishers,  Boston. \n\nLowe,  D.  G.  (1987).  The  viewpoint  consistency  constraint.  International  Journal \n\nof Computer  Vision,  1:57-72. \n\nRoberts,  L.  G.  (1965).  Machine perception of three-dimensional solids.  In Tippett, \nJ. T., editor,  Optical and  Electro-Optical Information  Processing.  MIT Press. \nZemel,  R.  S.,  Mozer,  M.  C.,  and  Hinton,  G.  E.  (1989).  TRAFFIC:  Object  recog(cid:173)\n\nnition using hierarchical  reference  frame  transformations.  In  Touretzky,  D.  S., \neditor,  Advances  in  Neural  Information  Processing  Systems  2,  pages  266-273. \nMorgan Kaufmann, San  Mateo,  CA. \n\n\f", "award": [], "sourceid": 329, "authors": [{"given_name": "Richard", "family_name": "Zemel", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}