{"title": "Just One View: Invariances in Inferotemporal Cell Tuning", "book": "Advances in Neural Information Processing Systems", "page_first": 215, "page_last": 221, "abstract": "", "full_text": "Just One View: \n\nInvariances in Inferotemporal Cell Thning \n\nMaximilian Riesenhuber \n\nTomaso Poggio \n\nCenter for Biological and Computational Learning and \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts Institute of Techno)ogy, E25-201 \n\nCambridge, MA 02139 \n{max,tp }@ai.mit.edu \n\nAbstract \n\nIn  macaque  inferotemporal cortex  (IT),  neurons have  been found to re(cid:173)\nspond  selectively to complex  shapes  while  showing broad  tuning (\"in(cid:173)\nvariance\")  with  respect  to stimulus transformations such as  translation \nand  scale  changes  and  a  limited  tuning to rotation  in  depth.  Training \nmonkeys with novel,  paperclip-like objects, Logothetis et al. 9  could in(cid:173)\nvestigate whether these invariance properties are due to experience with \nexhaustively many transformed instances of an object or if there are mech(cid:173)\nanisms that allow the cells to show response invariance also to previously \nunseen instances of that object.  They found object-selective cells in an(cid:173)\nterior IT which exhibited limited invariance  to various transformations \nafter training with single object views.  While previous models accounted \nfor  the tuning of the  cells  for  rotations  in  depth and  for  their selectiv(cid:173)\nity  to a  specific  object relative to a population of distractor objects,14,1 \nthe model described here attempts to explain in a  biologically plausible \nway  the additional properties of translation  and  size  invariance.  Using \nthe  same  stimuli  as  in  the  experiment,  we  find  that model  IT neurons \nexhibit invariance properties which closely parallel those of real neurons. \nSimulations show that the model  is capable of unsupervised learning of \nview-tuned neurons. \n\nWe thank Peter Dayan, Marcus Dill, Shimon Edelman, Nikos Logothetis, Jonathan Mumick and \n\nRandy O'Reilly for useful discussions and comments. \n\n\f216 \n\n1  Introduction \n\nM.  RiesenhuberandT. Poggio \n\nNeurons in  macaque  inferotemporal cortex  (IT)  have  been  shown  to respond to  views of \ncomplex objects,8 such as faces or body parts, even  when the retinal image undergoes size \nchanges over several  octaves, is translated by several degrees of visual angle7  or rotated in \ndepth by a certain amount9  (see [13]  for a review). \n\nThese  findings  have  prompted  researchers  to  investigate  the  physiological  mechanisms \nunderlying  these  tuning  properties.  The  original  model 14  that  led  to  the  physiological \nexperiments  of Logothetis et al. 9  explains  the  behavioral  view  invariance  for  rotation  in \ndepth  through  the learning  and  memory  of a  few  example  views,  each  represented  by  a \nneuron  tuned  to that view.  Invariant recognition for translation and  scale  transformations \nhave  been  explained  either  as  a  result  of object-specific  learning4  or  as  a  result  of  a \nnormalization procedure  (\"shifter\") that  is  applied to any  image  and  hence  requires only \none object-view for recognition. 12 \nA problem with previous experiments has been that they did not illuminate the mechanism \nunderlying invariance since they employed objects (e.g.,  faces) with which the monkey was \nquite familiar,  having seen  them  numerous times  under  various transformations.  Recent \nexperiments  by  Logothetis et al. 9  addressed  this question  by  training monkeys  to  recog(cid:173)\nnize novel objects (\"paperclips\" and  amoeba-like objects) with  which the monkey  had  no \nprevious visual experience.  After training, responses of IT cells to transformed versions of \nthe training stimuli and to distractors of the same type were collected.  Since the views the \nmonkeys were exposed to during training were tightly controlled, the paradigm allowed to \nestimate the degree of invariance that can  be extracted from just one object view. \n\nIn partiCUlar, Logothetis et al. 9  tested the cells' responses to rotations in depth, translation \nand  size  changes.  Defining \"in variance\" as  yielding a  higher response to  test views than \nto  distractor objects,  they  report9,10  an  average  rotation invariance  over  30\u00b0,  translation \ninvariance over \u00b12\u00b0, and size invariance of up to \u00b11 octave around the training view. \nThese  results establish that  there are  cells showing some  degree  of invariance  even  after \ntraining with just one object view, thereby arguing against a completely learning-dependent \nmechanisms  that requires  visual  experience  with  each  transformed  instance  that  is  to  be \nrecognized.  On the other hand, invariance is far from perfect but rather centered around the \nobject views seen during training. \n\n2  The Model \n\nStudies  of the  visual  areas  in  the  ventral  stream  of the  macaque  visual  system8  show  a \ntendency  for  cells  higher  up  in  the  pathway  (from  VI  over  V2  and  V4  to  anterior and \nposterior IT) to respond to increasingly complex objects and to show increasing invariance \nto transformations such as translations, size changes or rotation in depth.13 \nWe  tried  to  construct  a  model  that  explains  the  receptive  field  properties  found  in  the \nexperiment based on  a simple feedforward model.  Figure 1 shows a cartoon of the model: \nA retinal  input pattern  leads  to excitation  of a set  of \"VI\" cells,  in  the  figure  abstracted \nas  having  derivative-of-Gaussian  receptive field  profiles.  These  \"VI\" cells  are  tuned  to \nsimple features and have relatively small receptive fields.  While they could be cells from a \nvariety of areas,  e.g.,  VI or V2 (cf. Discussion), for simplicity, we label them as \"VI\" cells \n(see  figure).  Different cells  differ in  preferred  feature,  e.g.,  orientation, preferred  spatial \nfrequency  (scale),  and  receptive  field  location.  \"VI\" cells  of the  same  type  (i.e.,  having \nthe same  preferred  stimulus,  but of different preferred  scale  and  receptive  field  location) \nfeed  into the same  neuron  in  an  intermediate layer.  These intermediate neurons could  be \ncomplex cells in  VI  or V2 or V4 or even posterior IT:  we  label them as \"V4\" cells,  in the \n\n\fJust One View: Invariances in Inferotemporal Cell Tuning \n\n217 \n\nsame spirit in which we labeled the neurons feeding into them as \"VI\" units.  Thus, a \"V4\" \ncell receives inputs from \"VI\" cells over a large area and different spatial scales ([8] reports \nan average receptive field size in V 4 of 4.4\u00b0 of visual angle, as opposed to about  I \u00b0 in VI; \nfor spatial frequency  tuning, [3]  report an  average FWHM of 2.2 octaves,  compared to  1.4 \n(foveally) to  1.8 octaves (parafoveally) in VI 5).  These \"V4\" cells in turn feed  into a layer \nof \"IT\" neurons,  whose  invariance properties are to be compared  with the experimentally \nobserved ones. \n\nRetina \n\nFigure 1: Cartoon of the model.  See text for explanation. \n\nA crucial element of the model is the mechanism an  intermediate neuron uses  to pool the \nactivities of its afferents.  From the computational point of view, the intermediate neurons \nshould  be  robust feature detectors,  i.e.,  measure  the presence of specific  features  without \nbeing confused  by  clutter and context in  the receptive field.  More detailed considerations \n(Riesenhuber and Poggio, in preparation) show that this cannot be achieved with a response \nfunction  that  just  summates  over  all  the  afferents  (cf.  Results). \nInstead,  intermediate \nneurons in  our model  perform a \"max\"  operation (akin  to  a \"Winner-Take-AII\") over all \ntheir afferents, i.e.,  the response of an intermediate neuron is determined by its most strongly \nexcited afferent.  This hypothesis appears  to  be compatible with recent data,15  that show \nthat when two stimuli (gratings of different contrast and  orientation) are  brought into the \nrecepti ve field  of a V 4 cell, the cell's response tends to be close to the stronger of the two \nindividual responses (instead of e.g.,  the sum as in a linear model). \n\n\f218 \n\nM.  Riesenhuber and T.  Poggio \n\nThus, the response function 0i  of an  intennediate neuron i  to stimulation with an  image v \nIS \n\n(1) \n\nwith Ai the set of afferents to neuron i, aU) the receptive field center of afferent j, v a(j) the \n(square-nonnalized) image patch centered at aU) that corresponds in size to the receptive \nfield, ~j  (also square-nonnalized) of afferent j  and \".\" the dot product operation. \n\nStudies  have  shown  that  V 4  neurons  respond  to  features  of \"intennediate\"  complexity \nsuch as  gratings, corners and crosses.8  In  V4 the receptive fields  are comparatively  large \n(4.4 0  of visual angle on  average8 ),  while the preferred stimuli are usually much smaller.3 \nInterestingly, cells respond independently of the location of the stimulus within the receptive \nfield.  Moreover,  average  V 4 receptive field  size is comparable to the range of translation \ninvariance of IT cells (:S  \u00b12\u00b0) observed  in the experiment.9  For afferent receptive fields \n~j,  we chose  features  similar to  the  ones found  for  V 4  cells in  the  visual  system: 8  bars \n(modeled as  second  derivatives of Gaussians)  in  two orientations,  and  \"corners\"  of four \ndifferent orientations and  two different degrees  of obtuseness.  This  yielded  a  total of lO \nintennediate neurons.  This set of features  was chosen  to give a compact and biologically \nplausible representation.  Each  intennediate cell received  input from  cells  with the same \ntype of preferred stimulus densely covering the visual field of 256  x  256 pixels (which thus \nwould correspond to  about 4.40  of visual  angle,  the average  receptive  field  size in V48 ), \nwith receptive  field  sizes of afferent cells ranging from 7 to 19 pixels in steps of 2 pixels. \nThe features  used  in  this paper represent the first  set of features  tried,  optimizing feature \nshapes might further improve the model's performance. \n\nThe response tj oftop layer neuron j  with connecting weights Wj  to the intennediate layer \nwas set to be a Gaussian, centered on Wj, \n\ntj = ~exp (_IIO;~jI12) \n\n0' \n\n(2) \nwhere \u00b0 is the excitation of the intennediate layer and 0' the variance of the Gaussian, which \nwas chosen  based on the distribution of responses (for section 3.1) or learned (for section \n3.2). \n\n271'0'2 \n\nThe stimulus images were views of 21  randomly generated \"paperclips\" of the type used in \nthe physiology experiment.9  Distractors were 60 other paperclip images generated  by  the \nsame method.  Training size was 128  x  128 pixels. \n\n3  Results \n\n3.1 \n\nInvariance of Representation \n\nIn  a  first  set  of simulations  we  investigated  whether  the  proposed  model  could  indeed \naccount for the observed invariance properties.  Here we assumed that connection strengths \nfrom the intennediate layer cells  to the top  layer had already  been  learned  by  a  separate \nprocess, allowing us to focus on the tolerance of the representation to the above-mentioned \ntransformations and on the selectivity of the top layer cells. \n\nTo {,\u00b7,tablish the tuning properties of view-tuned model neurons, the connections Wj  between \nthe intermediate layer and top layer unit j  were set to be equal to the excitation 0training in \nthe intermediate layer caused  by the training view.  Figure 2  shows the \"tuning curve\" for \nrotation in depth and Fig. 3 the response to changes in stimulus size of one such neuron . The \nneuron shows rotation invariance (i.e.,  producing a higher response than to any  distractor) \nover about 44 0  and invariance to scale changes over the whole range tested.  For translation \n\n\fJust One View:  Invariances in InJerotemporal Cell Tuning \n\n219 \n\nQ.J \n\nU'J c: o  O. \n\nQ.. \ntil \n~ \n\n1~-----------------\n\n0.5 \n\n60 \n\n100  120 \n\n80 \nangle \n\n20 \n40 \ndistractor \n\n60 \n\nFigure 2:  Responses of a sample top layer neuron to different views of the training stimulus and to distractors. \nThe left plot shows the rotation tuning curve, with the training view (900  view) shown in the middle image over \nthe  plot.  The neighboring images  show the  views  of the paperclip at  the  borders of the  rotation tuning curve, \nwhich are located where the response to the rotated clip falls  below the response to the best distractor (shown in \nthe plot on the right).  The neuron exhibits broad rotation tuning over more than 40\u00b0 . \n\n(not shown), the neuron showed invariance over translations of \u00b196 pixels around the center \nin  any direction, corresponding to \u00b1 1.7\u00b0 of visual angle. \n\nThe average invariance ranges for the 21  tested  paperclips were 35\u00b0 of rotation angle, 2.9 \noctaves  of scale  invariance  and  \u00b1 1.8 0  of translation  invariance.  Comparing  this  to  the \nexperimentally observed10  300 ,2 octaves and \u00b12\u00b0, resp., shows a very  good agreement of \nthe invariance properties of model and experimental neurons. \n\n3.2  Learning \n\nIn the  previous section  we  assumed  that the connections from  the intermediate layer to  a \nview-tuned neuron in the top layer were pre-set to appropriate values.  In  this section,  we \ninvestigate whether the system allows unsupervised learning of view-tuned neurons. \n\nSince  biological  plausibility  of the  learning  algorithm  was  not  our primary  focus  here, \nwe  chose a general,  rather abstract learning algorithm, viz.  a mixture of Gaussians model \ntrained with the EM algorithm. Our model had four neurons in the top level, the stimuli were \nviews of four  paperclips,  randomly  selected  from  the  21  paperclips used  in  the previous \nexperiments.  For each clip, the stimulus set contained views from 17 different viewpoints, \nspanning 340  of viewpoint change.  Also, each clip was included at  11  different scales  in \nthe stimulus set, covering a range of two octaves of scale change. \nConnections  Wi  and  variances  O'i,  i  = 1, ... ,4, were  initialized to  random  values  at the \nbeginning of training.  After a few  iterations of the EM algorithm (usually less than 30), a \nstationary state was reached, in which each model neuron had become tuned to views of one \npaperclip:  For each  paperclip, all  rotated and scaled  views were mapped  to (i.e.,  activated \nmost strongly) the same  model  neuron  and  views of different paperclips were  mapped  to \ndifferent neurons.  Hence,  when  the system  is presented  with  multiple views of different \nobjects,  receptive  fields  of top  level  neurons  self-organize  in  such  a  way  that  different \nneurons become tuned to different objects. \n\n\f220 \n\nM.  Riesenhuber and T.  Poggio \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.20 \n\n1r-----------------~ \n\n0.5 \n\n100 \n\n200 \n\nstimulus size \n\n20 \n40 \ndistractor \n\n60 \n\nFigure 3:  Responses of the same top  layer neuron  as  in  Fig.  2 to  scale changes of the training stimulus and to \ndistractors.  The  left  plot shows the size  tuning  curve,  with  the  training size  (128  x  128 pixels) shown in  the \nmiddle image over the plot.  The neighboring images show scaled versions of the paperclip.  Other elements as in \nFig. 2.  The neuron exhibits scale invariance over more than 2 octaves. \n\n4  Discussion \n\nObject recognition is a difficult problem because objects must be  recognized  irrespective \nof position,  size,  viewpoint  and  illumination.  Computational  models  and  engineering \nimplementations  have  shown  that  most  of the  required  invariances  can  be  obtained  by \na  relatively  simple  learning scheme,  ba<;ed  on  a  small  set  of example  views. 14 ,17  Quite \nsensibly, the visual system can also achieve some significant degree of scale and translation \ninvariance from just one view.  Our simulations show that the maximum response function is \ni.e.,  implementing a direct \na key component in the performance ofthe model.  Without it -\nconvolution of the filters  with the input images and a subsequent summation -\ninvariance \nto rotation in depth and translation both decrease significantly.  Most dramatically, however, \ninvariance to scale changes  is abolished completely, due to the strong changes  in  afferent \ncell activity with changing stimulus size.  Taking the maximum over the afferents, as in our \nmodel, always picks the best matching filter and hence produces a more stable response.  We \nexpect a maximum mechanism  to be essential  for recognition-in-context, a more difficult \ntask and  much more common  than the recognition of isolated objects studied here and  in \nthe related psychophysical and physiological experiments. \n\nThe recognition of a specific paperclip object is a difficult, subordinate level  classification \ntask.  It is interesting that our model sol ves it well and with a performance closely resembling \nthe physiological data on the same  task.  The model  is  a more  biologically plausible and \ncomplete model than previous ones14, 1 but it is still at the level of a plausibility proof rather \nthan a detailed physiological model.  It suggests a maximum-like response of intermediate \ncells as a key  mechanism for explaining the properties of view-tuned IT cells,  in addition \nto view-based representations (already described in (1,9]). \n\nNeurons in  the intermediate layer currently use  a  very  simple set of features .  While this \nappears  to  be  adequate  for  the  class  of paperclip  objects,  more  complex  filters  might be \nnecessary  for  more  complex  stimulus classes  like faces.  Consequently,  future  work  will \naim to improve the filtering step ofthe model and to test it on more real world stimuli.  One \ncan  imagine a  hierarchy  of cell  layers,  similar to  the \"S\" and  \"C\"  layers  in Fukushima's \n\n\fJust One View:  Invariances in In!erotemporal Cell Tuning \n\n221 \n\nNeocognitron,6 in which progressively more complex features are synthesized from simple \nones.  The corner detectors in our model are likely candidates for such a scheme.  We  are \ncurrently investigating the feasibility of such a hierarchy of feature detectors. \n\nThe  demonstration  that  unsupervised  learning  of view-tuned  neurons  is  possible  in  this \nrepresentation (which  is not clear for related  view-based models14 , 1) shows that different \nviews  of one object  tend  to  form  distinct clusters  in  the  response  space  of intermediate \nneurons.  The current learning algorithm, however, is not very plausible, and more realistic \nlearning  schemes  have  to  be  explored,  as,  for  instance,  in  the  attention-based  model  of \nRiesenhuber and Dayan16  which incorporated a learning mechanism using bottom-up and \ntop-down pathways.  Combining the two approaches could also demonstrate how invariance \nover a wide range of transformations can be learned from several  example views, as in the \ncase of familiar stimuli.  We also plan to simulate detailed  physiological implementations \nof several  aspects  of the  model  such as  the  maximum  operation (for instance comparing \nnonlinear dendritic interactionsll with  recurrent excitation  and  inhibition).  As  it is,  the \nmodel can already be tested in additional physiological experiments, for instance involving \npartial occlusions. \n\nReferences \n[1]  Bricolo. E,  Poggio, T  &  Logothetis, N (1997).  3D object recognition:  A model of view-tuned \n\nneurons. In Advances In  Neural Information  Processing 9,41-47. MIT Press. \n\n[2]  Biilthoff, H &  Edelman, S (1992). Psychophysical support for a two-dimensional view interpo(cid:173)\n\nlation theory of object recognition. Proc. Nat.  Acad. Sci.  USA 89, 60-64. \n\n[3]  Desimone, R  &  Schein,  S  (1987).  Visual  properties  of neurons  in  area  V 4  of the  macaque: \n\nSensitivity to  stimulus fonn. 1.  Neurophys. 57, 835-868. \n\n[4]  Foldiak, P (1991). Learning invariance from transfonnation sequences. Neural Computation 3, \n\n194-200. \n\n[5]  Foster,  KH,  Gaska,  JP,  Nagler,  M  &  Pollen,  DA  (1985).  Spatial  and  temporal  selectivity  of \n\nneurones in  visual cortical areas VI  and V2 of the macaque monkey. 1.  Phy.  365,331-363. \n\n[6]  Fukushima, K (1980). Neocognitron:  A self-organizing neural network model for a mechanism \n\nof pattern recognition unaffected by shift in position. Biological Cybernetics 36, 193-202. \n\n[7]  Ito,  M,  Tamura,  H,  Fujita,  I  &  Tanaka,  K  (1995).  Size  and  position  invariance of neuronal \n\nresponses in  monkey inferotemporal cortex.  1.  Neurophys. 73, 218-226. \n\n[8]  Kobatake, E & Tanaka, K (1995). Neuronal selectivities to complex object features in the ventral \n\nvisual pathway of the macaque cerebral cortex 1. Neurophys., 71, 856-867. \n\n[9]  Logothetis,  NK,  Pauls,  J  &  Poggio,  T  (1995). Shape  representation  in  the  inferior  temporal \n\ncortex of monkeys. Current Biology, 5, 552-563. \n\n[10]  Nikos Logothetis, personal communication. \n[11]  Mel, BW, Ruderman, DL & Archie, KA (1997). Translation-invariant orientation tuning in visual \n\n'complex' cells could derive from intradendritic computations. Manuscript in  preparation. \n\n[12]  Olshausen, BA,  Anderson,  CH  &  Van  Essen,  DC  (1993).  A  neurobiological  model  of visual \n\nattention  and invariant pattern  recognition based on  dynamic routing  of infonnation.  1.  Neu(cid:173)\nrosci. 13,4700-4719. \n\n[13]  Perret, D &  Oram, M (1993). Neurophysiology of shape processing. Image Vision  Comput.  11, \n\n317-333. \n\n[14]  Poggio, T &  Edelman, S (1990).  A Network that learns  to  recognize 3D objects.  Nature 343, \n\n263-266. \n\n[15]  Reynolds, JH &  Desimone, R (1997). Attention and contrast have similar effects on competitive \n\ninteractions in macaque area V4. Soc.  Neurosc. Abstr.  23,302. \n\n[16]  Riesenhuber, M &  Dayan, P (1997).  Neural models for part-whole hierarchies. In Advances In \n\nNeural Information Processing 9,  17-23. MIT Press. \n\n[17]  Ullman, S (1996). High-level vision:  Object recognition and visual cognition. MIT Press. \n\n\f", "award": [], "sourceid": 1374, "authors": [{"given_name": "Maximilian", "family_name": "Riesenhuber", "institution": null}, {"given_name": "Tomaso", "family_name": "Poggio", "institution": null}]}