{"title": "Filter Selection Model for Generating Visual Motion Signals", "book": "Advances in Neural Information Processing Systems", "page_first": 369, "page_last": 376, "abstract": null, "full_text": "Filter  Selection Model for  Generating \n\nVisual Motion  Signals \n\nSteven J.  Nowlan\u00b7 \nCNL, The Salk Institute \n\nTerrence J.  Sejnowski \nCNL, The Salk Institute \n\nP.O. Box 85800, San  Diego, CA \n\nP.O.  Box 85800, San Diego,  CA \n\n92186-5800 \n\n92186-5800 \n\nAbstract \n\nNeurons  in  area  MT of primate visual  cortex  encode  the  velocity \nof moving objects.  We present  a  model of how  MT cells  aggregate \nresponses  from  VI  to  form  such  a  velocity  representation.  Two \ndifferent  sets  of units,  with  local  receptive  fields,  receive  inputs \nfrom motion energy filters.  One set of units forms estimates of local \nmotion, while the second set computes the utility of these estimates. \nOutputs from  this second set  of units  \"gate\"  the outputs from  the \nfirst  set  through  a  gain  control  mechanism.  This  active  process \nfor  selecting  only  a  subset  of local  motion  responses  to  integrate \ninto more global  responses  distinguishes  our  model from  previous \nmodels of velocity estimation.  The model yields  accurate  velocity \nestimates  in  synthetic  images  containing multiple moving  targets \nof varying size,  luminance,  and spatial frequency  profile and  deals \nwell  with a  number of transparency  phenomena. \n\n1 \n\nINTRODUCTION \n\nHumans,  and  primates  in  general,  are  very  good  at  complex  motion  processing \ntasks such  as  tracking a  moving target  against a moving background under varying \nluminance.  In  order  to  accomplish  such  tasks,  the  visual  system  must  integrate \nmany  local  motion  estimates  from  cells  with  limited  spatial  receptive  fields  and \nmarked orientation selectivity.  These  local  motion estimates are  sensitive  not just \n\n\u00b7Current address,  Synaptics  Inc.,  2698  Orchard  Parkway,  San  Jose,  CA 95134. \n\n369 \n\n\f370 \n\nNowlan  and Sejnowski \n\nto the velocity of a  visual target, but also to many other features  of the target such \nas its spatial frequency  profile or local edge orientation.  As  a result,  the integration \nof these  motion  signals  cannot  be  performed  in  a  fixed  manner,  but  must  be  a \ndynamic process  dependent  on the visual stimulus. \n\nAlthough  cells  with  motion-sensitive responses  are  found  in  primary  visual  cortex \n(VI  in  primates),  mounting physiological evidence  suggests  that  the integration of \nthese  responses  to produce responses  which  are tuned  primarily to the velocity of a \nvisual  target  first  occurs  in  primate visual  area MT (Albright  1992,  Maunsell  and \nNewsome  1987).  We  propose  a  computational model  for  integrating local  motion \nresponses  to  estimate  the  velocity  of objects  in  the  visual  scene.  These  velocity \nestimates  may be  used  for  eye  tracking or other  visua-motor skills.  Previous  com(cid:173)\nputational  approaches  to  this  problem  (Grzywacz  and  Yuille  1990,  Heeger  1987, \nHeeger  1992,  Horn  and  Schunk  1981,  Nagel  1987)  have  primarily focused  on  how \nto  combine local  motion responses  into local velocity  estimates  at  all  points  in  an \nimage  (the  velocity  flow  field).  We  propose  that  the  integration  of local  motion \nmeasurements  may be  much simpler,  if one  does  not  try  to  integrate across  all  of \nthe  local  motion  measurements  but  only  a  subset.  Our  model  learns  to  estimate \nthe velocity of visual  targets  by solving the problems of what to integrate and  how \nto integrate in  parallel.  The trained  model yields  accurate velocity estimates from \nsynthetic images containing multiple moving targets of varying size, luminance, and \nspatial frequency  profile. \n\n2  THE MODEL \n\nThe  model  is  implemented  as  a  cascade  of  networks  of  locally  connected  units \nwhich  has  two  parallel  processing  pathways  (figure  1).  All  stages  of the  model \nare  represented  as  \"layers\"  of units  with  a  roughly  retinotopic  organization.  The \nfigure  schematically  represents  the  activity  in  the  model  at  one  instant  of  time. \nConceptually, it is easier to think of the model as computing evidence for  particular \nvelocities  in  an  image rather  than  computing velocity  directly.  Processing  in  the \nmodel  may be  divided  into 3 stages,  to be  described  in  more  detail  below.  In  the \nfirst  stage,  the input intensity  image is  converted  into 36 local motion  \"images\"  (9 \nof which are shown  in  the figure)  which  represent  the outputs of 36 motion energy \nfilters from each region of the input image.  In the second stage, the operations of in(cid:173)\ntegration and selection are performed in parallel.  The integration pathway combines \ninformation from motion energy filters  tuned  to different  directions and spatial and \ntemporal frequencies  to compute the local evidence in favor of a particular velocity. \nThe selection pathway weights each region of the image according to the amount of \nevidence for a particular velocity that region contains.  In the third stage, the global \nevidence  for  a  visual  target  moving at a  particular velocity V1: (t)  is  computed as  a \nsum over  the product of the outputs of the integration and selection  pathways: \n\nV1:(t)  = L 11: (x, y, t)S1:(x, y, t) \n\nZ:,lI \n\n(1) \n\nwhere  11:(x, y, t)  is  the  local  evidence  for  velocity  k  computed  by  the  integration \npathway from  region  (x, y)  at  time t,  and S1:(x, y, t) is  the  weight  assigned  by  the \nselection  pathway to that region. \n\n\fFilter Selection Model for  Generating Visual  Motion Signals \n\n371 \n\nIntegration \n\nInput \n\n(64 x 64) \n\nMotion Energy ---\n\n. \n\n~nm \n4 directions \n\n9 types \n\n:::::::::::::\"--4*-:~::l1]  . \n\nVelocity \n\n(8 x 8) \n\nFigure  1:  Diagram  of  motion  processing  model.  Processing  proceeds \nfrom  left  to  right  in  the model,  but  the integration  and  selection  stages \noperate in  parallel.  Shading  within  the boxes indicates  different levels  of \nactivity  at each stage.  The responses shown in  the diagram  are intended \nto  be  indicative of the responses  at  different  stages of the model  but do \nnot  represent  actual responses  from  the model. \n\n2.1  LOCAL  MOTION  ESTIMATES \n\nThe  first  stage  of processing  is  based  on  the  motion  energy  model  (Adelson  and \nBergen  1985, Watson 1985).  This model relies on the observation  that an intensity \nedge  moving  at  a  constant  velocity  produces  a  line  at  a  particular  orientation  in \nspace-time.  This means that an oriented space-time filter will respond most strongly \nto objects moving at a particular velocity.1  A motion energy filter  uses  the squared \noutputs  of  a  quadrature  pair  (90 0  out  of phase)  of oriented  filters  to  produce  a \nphase  independent  local  velocity estimate.  The motion energy  model was  selected \nas  a  biologically  plausible  model  of motion  processing  in  mammalian  VI,  based \nprimarily on the similarity of responses  of simple and  complex cells  in cat  area VI \nto the output of different stages of the motion energy model (Heeger  1992, Grywacz \nand Yuille 1990,  Emerson  1987). \nThe  particular  filters  used  in  our  model  had  spatial  responses  similar  to  a  two(cid:173)\ndimensional Gabor filter,  with the physiologically more plausible temporal responses \nsuggested by Adelson and Bergen  (1985).  The motion energy layer was divided into \na  grid  of 49  by  49  receptive  field  locations  and  at  each  grid  location  there  were \nfilters  tuned  to four  different  directions  of motion (up,  down,  left,  and  right).  For \neach  direction of motion there  were  nine different  filters  representing  combinations \nof three  spatial  and  three  temporal  frequencies.  The filter  center  frequency  spac(cid:173)\nings  were  1 octave spatially and  1.5 octaves  temporally.  The filter  parameters and \nspacings  were  chosen  to be physiologically realistic,  and were  fixed  during  training \nof the model.  In  addition, there  was a correspondence  between  the size of the filter \n\nIThese filters  actually  respond  most  strongly  to  a  narrow  band  of spatial  frequencies \n\n(SF)  and  temporal frequencies  (TF), which  represent a  range of velocities,  v = TF/SF. \n\n\f372 \n\nNowlan and Sejnowski \n\nAftW--.., Local \n\nCompetition \n\nMotion Energy \n\n(49 x 49) \n\n\u2022\u2022 \n\u2022 \n\n33 layers \n\n(8 x 8) \n\nOutput \n(33 units) \n\nFigure  2:  Diagram  of integration  and  selection  processing  stages.  Dif(cid:173)\nferent  shadings for  units in  the integration  and  output pools correspond \nto  different  directions  of  motion.  Only  two  of  the  selection  layers  are \nshown  and  the  backgrounds  of  these  layers  are  shaded  to  match  their \ncorresponding  integration  and  output  units.  See  text for  description  of \narchitecture. \n\nreceptive fields  and the spatial frequency  tuning of the filters  with lower  frequency \nfilters  having larger spatial extent  to their  receptive  fields.  This  is  also  similar to \nwhat  has been  found  in  visual  cortex  (Maunsell  and Newsome,  1987). \n\nThe input intensity image is first  filtered  with a  difference  of gaussians filter  which \nis  a  simplification of retinal  processing  and  provides  smoothing  and  contrast  en(cid:173)\nhancement.  Each  motion energy  filter  is  then  convolved  with  the smoothed input \nimage producing 36  motion energy  responses  at each  location in the receptive  field \ngrid which serve  as  the input to the next stage of processing. \n\n2.2 \n\nINTEGRATION  AND  SELECTION \n\nThe integration and selection pathways are both implemented as locally connected \nnetworks with a single layer of weights.  The integration pathway can be thought of \nas a layer of units organized into a  grid of 8 by 8  receptive field  locations (figure  2). \nUnits  at  each  receptive  field  location  look  at  all  36  motion  energy  measurements \nfrom  each  location  within  a  9  by  9  region  of the  motion  energy  receptive  field \ngrid.  Adjacent  receptive  field  locations  receive  input  from  overlapping  regions  of \nthe motion energy  layer. \n\nAt each receptive field  location in the integration layer there is a pool of 33 integra(cid:173)\ntion units (9 units in one of these pools are shown in figure 2).  These units represent \nmotion in 8 different  directions with units representing four different speeds for each \ndirection plus a central unit indicating no motion.  These units form a log polar rep(cid:173)\nresentation of the local  velocity  at  that receptive field  location, since  as one moves \nout along any  \"arm\" of the pool of units each  unit represents  a speed  twice as large \nas the preceding  unit in that arm.  All  of the integration pools share a  common set \n\n\fFilter Selection Model for  Generating Visual  Motion Signals \n\n373 \n\nof weights,  so in the final  Lrained  model all  compute the same function. \n\nThe  activity  of an  integration  unit  (which  lies  between  0  and  1)  represents  the \namount of local support for  the corresponding  velocity.  Local  competition  between \nthe  units  in  each  integration  pool  enforces  the  important constraint  that  each  in(cid:173)\ntegration  pool  can  only  provide  strong  support for  one  velocity.  The competition is \nenforced using a  softmax non-linearity:  If I~ (x, y, t) represents  the net input to unit \nk  in one of the integration pools,  the state of that unit is  computed  as \n\nh:(x,y,t) = el~(~,y,t)/Lel;(~IYlt). \n\nj \n\nNote  that  the  summation is  performed  over  all  units  within  a  single  pool,  all  of \nwhich  share the same (x, y)  receptive field  location. \nThe output of the model is  also represented  by a  pool of 33  units,  organized in the \nsame way  as  each  pool of integration  units.  The state of each  unit  in  the output \npool represents  the global evidence within the entire image supporting a  particular \nvelocity.  The state of each  of these output units Vk(t)  is  computed as the weighted \nsum of the state of the corresponding integration unit in  all 64  integration receptive \nfield  locations (equation  (1\u00bb.  The weights assigned  to each  receptive field  location \nare computed by  the state of the corresponding selection  unit  (figure  2).  Although \nthe  activity  of output  units  can  be  treated  as  evidence  for  a  particular  velocity, \nthe  activity  across  the  entire  pool  of units  forms  a  distributed  representation  of \na  continuous  range  of velocities  (i. e.  activity  split  between  two  adjacent  units \nrepresents  a  velocity  between  the optimal velocities of those two  units). \n\nThe selection  units are also organized into a grid of 8 by 8 receptive  field  locations \nwhich are in one to one correspondence with the integration receptive field  locations \n(figure 2).  However, it is convenient to think of the selection units as being organized \nnot as a single layer of units but rather as 33 layers of units, one for each output unit. \nIn  each  layer  of selection  units,  there  is  one  unit  for  each  receptive  field  location. \nTwo  of the  selection  layers  are  shown  in  figure  2.  The  layer  with  the  vertically \nshaded background  corresponds  to the output unit for  upward motion (also shaded \nwith  vertical stripes)  and states of units in  this selection  layer weight  the states of \nupward  motion units  in each  integration pool  (again shaded  vertically). \n\nThere  is  global  competition  among  all  of the  units  in  each  selection  layer.  Again \nthis is  implemented using  a  softmax non-linearity:  If Sk(x, y, t)  is  the net  input  to \na selection unit  in layer k,  the state of that unit  is  computed as \n\nSk(X,y,t) = eS~(~,y,t)/ L \n\neS~(~',y',t). \n\n~',y' \n\nNote that unlike the integration case,  the summation in this case is performed over \nall receptive field  locations.  This global competition enforces  the second  important \nconstraint in the model, that the total amount of support for each  velocity across the \nentire  image  cannot  exceed  one.  This  constraint,  combined  with  the fact  that  the \nintegration unit  outputs  can  never  exceed  1 ensures  that  the states  of the  output \nunits  are  constrained  to  be between  0  and  1 and  can  be  interpreted  as  the  global \nsupport  within the image for  each  velocity,  as stated earlier. \n\nThe combination of global competition in the selection layers and local competition \nwithin the integration pools means that the only way  to produce strong support for \n\n\f374 \n\nNowlan and Sejnowski \n\na  particular output  velocity  is  for  the corresponding  selection  network  to focus  all \nits support on regions that strongly support that velocity.  This allows the selection \nnetwork  to learn to estimate how useful information in different  regions of an image \nis for  predicting velocities within the visual scene.  The weights of both the selection \nand integration networks  are  adapted in parallel as  is  discussed  next. \n\n2.3  OBJECTIVE FUNCTION  AND  TRAINING \n\nThe outputs of the integration and selection networks in the final trained model are \ncombined as in equation  (I), so that the final  outputs represent  the global support \nfor  each  velocity  within  the  image.  During  training  of the  system  however,  the \noutputs of each pool of integration units are treated  as if each  were an independent \nestimate of support  for  a  particular velocity.  If a  training image sequence  contains \nan object  moving at velocity  VA:  then  the target  for  the corresponding output  unit \nis  set  to  I,  otherwise  it  is  set  to  o.  The  system  is  then  trained  to  maximize  the \nlikelihood of generating the targets: \n\nlog L = L: L: log (L: SA:(z, y, t) exp [-(VA:  -\n\nIA:(z,  y, t))2]) \n\n(2) \n\nt \n\nA: \n\nz:,y \n\nTo optimize this  objective,  each  integration  output  IA:(z,  y, t)  is  compared  to  the \ntarget  VA:  directly,  and the outputs closest  to the target value are assigned  the most \nresponsibility  for  that  target,  and  hence  receive  the  largest  error  signal.  At  the \nsame  time,  the  selection  network  states  are  trained  to  try  and  estimate from  the \ninput  alone  (i. e. \nthe  local  motion  measurements),  which  integration  outputs  are \nmost accurate.  This interpretation of the system  during training is identical to the \ninterpretation given to the mixture of experts (Nowlan,  1990) and the same training \nprocedure was used.  Each pool of integration units functions like an expert network, \nand each layer of selection units functions  like a gating network. \n\nThere are,  however,  two important differences  between  the current system  and the \nmixture of experts.  First,  this system  uses  multiple gating networks  rather  than a \nsingle one,  allowing the system  to represent  more than  a  single velocity  within  an \nimage. Second,  in the mixture of experts,  each  expert  network has an independent \nset of weights and essentially learns to compute a different function (usually different \nfunctions  of the same input).  In  the  current  model, each  pool of integration  units \nshares  the  same set  of weights  and  is  constrained  to  compute  the  same function. \nThe effect  of the  training  procedure  in  this  system  is  to bias  the  computations of \nthe integration pools to favor certain types of local image features  (for example, the \nintegration  stage  may only  make reliable  velocity  estimates  in  regions  of shear  or \ndiscontinuities in  velocity).  The selection  networks learn  to identify  which  features \nthe integration stage is  looking for,  and to weight image regions most heavily which \ncontain these  kinds of features. \n\n3  RESULTS  AND DISCUSSION \n\nThe system was trained using 500 image sequences containing 64 frames each.  These \ntraining  image sequences  were  generated  by  randomly selecting  one  or  two  visual \n\n\fFilter Selection Model for  Generating Visual Motion Signals \n\n375 \n\ntargets for  each sequence  and moving these  targets through  randomly selected  tra(cid:173)\njectories.  The  targets  were  rectangular  patches  that  varied  in  size,  texture,  and \nintensity.  The motion trajectories  all  began  with  the objects stationary  and  then \none or both objects rapidly accelerated  to constant velocities maintained for  the re(cid:173)\nmainder of the trajectory.  Targets moved  in one of 8 possible directions,  at speeds \nranging between  0 and 2.5  pixels per  unit of time.  In training sequences  containing \nmultiple targets,  the  targets  were  permitted  to overlap  (targets  were  assigned  to \ndifferent  depth  planes  at  random)  and  the upper  target  was  treated  as  opaque  in \nsome  cases  and  partially  transparent  in  other  cases.  The  system  was  trained  us(cid:173)\ning a  conjugate gradient  descent  procedure  until the response of the system on  the \ntraining sequences  deviated  by less  than  1%  on average from  the desired  response. \n\nThe  performance of the  trained  system  was  tested  using  a  separate set  of 50  test \nimage sequences.  These  sequences  contained  10  novel  visual  targets  with  random \ntrajectories generated in the same manner as the training sequences.  The responses \non this test set remained within 2.5% of the desired response,  with the largest errors \noccurring at the highest  velocities.  Several of these test  sequences  were  designed  so \nthat targets  contained edges  oriented  obliquely to the direction  of motion, demon(cid:173)\nstrating  the  ability of the model  to deal  with  aspects  of the aperture  problem.  In \naddition,  only small,  transient  increases  in  error  were  observed  when  two  moving \nobjects  intersected,  whether  these objects  were  opaque or  partially transparent. \n\nA  more challenging test of the system was  provided  by presenting  the system with \n\"plaid patterns\"  consisting of two  square  wave  gratings  drifting  in  different  direc(cid:173)\ntions  (Adelson  and  Movshon,  1982).  Human observers  will  sometimes see  a  single \ncoherent  motion corresponding to the intersection of constraints  (IOC) direction of \nthe  two  grating  motions,  and  sometimes  see  the  two  grating  motions  separately, \nas one grating sliding through  the other.  The percept  reported  can  be  altered  by \nchanging the contrast of the regions where the two gratings intersect  relative to the \ncontrast  of the grating itself (Stoner  et  ai,  1990).  We  found  that  for  most grating \npatterns  the  model  reliably reported  a  single motion in  the IOC  direction,  but by \nmanipulating the intensity of the intersection  regions it was  possible to find  regions \nwhere  the model would report  the motion of the two gratings separately.  Coherent \ngrating motion was reported  when  the model tended  to select  most strongly image \nregions  corresponding  to the intersections  of the gratings,  while  two  motions were \nreported  when  the regions between  the grating intersections  were strongly selected. \n\nWe  also  explored  the  response  properties  of selection  and  integration  units  in  the \ntrained model using drifting sinusoidal gratings.  These stimuli were chosen because \nthey  have  been  used  extensively  in  exploring  the  physiological  response  proper(cid:173)\nties of visual  motion neurons  in  cortical  visual areas  (Albright  1992,  Maunsell  and \nNewsome  1987).  Integration  units tended  to be tuned  to a  fairly  narrow  band  of \nvelocities  over  a  broad  range of spatial frequencies,  like  many  MT  cells  (Maunsell \nand  Newsome,  1987).  The selection  units  had  quite  different  response  properties. \nThey  responded  primarily to velocity  shear  (neighboring regions  of differing  veloc(cid:173)\nity)  and  to flicker  (temporal frequency)  rather  than  true velocity.  Cells with  many \nof these properties are also common in MT (Maunsell  and Newsome,  1987).  A final \nimportant difference between  the integration and selection units is their response  to \nwhole field  motion.  Integration  units  tend  to have responses  which  are somewhat \nenhanced  by  whole  field  motion  in  their  preferred  direction,  while  selection  unit \n\n\f376 \n\nNowlan and Sejnowski \n\nresponses  are generally suppressed  by  whole field  motion.  This difference  is  similar \nto  the recent  observation  that  area MT contains  two  classes  of cell,  one  whose  re(cid:173)\nsponses  are suppressed  by  whole field  motion,  while  responses  of the second  class \nare not suppressed  (Born  and Tootell,  1992). \n\nFinally,  the  model  that  we  have  proposed  is  built  on  the  premise  of  an  active \nmechanism for  selecting subsets of unit responses  to integrate over.  While this is  a \ncommon aspect of many accounts of attentional phenomena, we suggest that active \nselection may represent  a fundamental aspect of cortical processing that occurs with \nmany pre-attentive  phenomena, such  as  motion processing. \nReferences \n\nAdelson,  E.  H.  and  Bergen,  J.  R  (1985)  Spatiotemporal  energy models for  the perception \nof motion.  J.  Opt.  Soc.  Am.  A,  2,  284-299. \n\nAdelson,  M.  and  Movshon,  J.  A.  (1982)  Phenomenal  coherence of moving  visual  patterns. \nNature,  300, 523-525. \n\nAlbright,  T.  D.  (1992)  Form-cue  invariant  motion  processing  in  primate  visual  cortex. \nScience.  255,  1141-1143. \n\nBorn,  R. T. and Tootell,  R  B.  H.  (1992)  Segregation  of global  and local  motion processing \nin  primate middle  temporal visual  area.  Nature,  357,  497-500. \n\nEmerson,  RC.,  Citron,  M.C.,  Vaughn  W.J.,  Klein,  S.A.  (1987)  Nonlinear  directionally \nselective  subunits  in  complex  cells  of cat striate cortex.  J.  Neurophys.  58, 33-65. \n\nGrzywacz,  N.M.  and  Yuille,  A.L.  (1990)  A  model  for  the estimate of local  image velocity \nby  cells  in  the visual  cortex.  Proc.  R.  Soc.  Lond.  B  239,  129-161. \n\nHeeger,  D.J.  (1987)  Model  for  the  extraction  of image  flow.  J.  Opt.  Soc.  Am.  A  4, \n1455-1471. \n\nHeeger,  D.J.  (1992)  Normalization  of cell  responses  in  cat  striate  cortex.  Visual  Neuro(cid:173)\nscience, in  press. \n\nHorn,  B.K.P.  and  Schunk,  B.G.  (1981)  Determining  optical  flow.  Artificial Intelligence \n17,  185-203. \n\nMaunsell  J .H.R.  and  Newsome,  W.T.  (1987)  Visual  processing  in  monkey  extrastriate \ncortex.  Ann.  Rev.  Neurosci.  10, 363-401. \n\nNowlan,  S.J. (1990)  Competing experts:  An experimental investigation  of associative mix(cid:173)\nture  models.  Technical  Report  CRG-TR-90-5,  Department  of Computer  Science,  Univer(cid:173)\nsity  of Toronto. \n\nNagel,  H.H.  (1987)  On  the  estimation  of optical  flow: \nproaches  and some new  results.  Artificial Intelligence 33,  299-324. \n\nrelations  between  different  ap(cid:173)\n\nStoner  G.R,  Albright  T.D.,  Ramachandran  V.S.  (1990)  Transparency  and  coherence  in \nhuman  motion  perception.  Nature 344,  153-155. \n\nWatson,  A.B.  and  Ahumada,  A.J.  (1985)  Model  of human  visual-motion  sensing.  J.  Opt. \nSoc.  Am.  A,  2,  322-342. \n\n\f", "award": [], "sourceid": 679, "authors": [{"given_name": "Steven", "family_name": "Nowlan", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}