{"title": "A Neurodynamical Approach to Visual Attention", "book": "Advances in Neural Information Processing Systems", "page_first": 10, "page_last": 16, "abstract": null, "full_text": "A Neurodynamical Approach to Visual Attention \n\nGustavo Deco \nSiemens AG \n\nCorporate Technology \n\nNeural Computation, ZT IK 4 \n\nOtto-Hahn-Ring 6 \n\n81739 Munich, Germany \n\nGustavo.Deco@mchp.siemens.de \n\nJosefZihl \n\nInstitute of Psychology \n\nNeuropsychology \n\nLudwig-Maximilians-University Munich \n\nLeopoldstr.  13 \n\n80802 Munich, Germany \n\nAbstract \n\nThe psychophysical evidence for \"selective attention\" originates mainly \nfrom  visual search experiments. In this work, we formulate a hierarchi(cid:173)\ncal  system of interconnected modules consisting in  populations of neu(cid:173)\nrons  for  modeling  the  underlying  mechanisms  involved  in  selective \nvisual  attention.  We  demonstrate  that  our  neural  system  for  visual \nsearch  works  across the  visual  field  in  parallel  but due  to the  different \nintrinsic dynamics can show the two experimentally observed modes of \nvisual  attention,  namely:  the  serial  and  the  parallel  search  mode.  In \nother words, neither explicit model of a focus of attention nor saliencies \nmaps are  used.  The  focus of attention  appears as an  emergent property \nof the dynamic behavior of the system. The neural population dynamics \nare  handled in  the  framework  of the  mean-field  approximation.  Conse(cid:173)\nquently, the whole process can be expressed as a system of coupled dif(cid:173)\nferential equations. \n\n1 Introduction \nTraditional  theories of human  vision  considers two functionally  distinct stages of visual \nprocessing  [1].  The  first  stage,  termed  the  preattentive  stage,  implies  an  unlimited(cid:173)\ncapacity system capable of processing the  information contained in  the  entire visual  field \nin parallel. The second stage is termed the attentive or focal stage, and is characterized by \nthe  serial  processing  of visual  information  corresponding  to  local  spatial  regions.  This \nstage of processing is typically associated with a limited-capacity system which allocates \nits resources to a single  particular location in  visual  space.  The designed psychophysical \nexperiments for testing  this hypothesis consist of visual  search  tasks.  In  a  visual  search \ntest  the  subject  have  to  look  at  the  display  containing  a  frame  filled  with  randomly \npositioned items in  order to  seek for an  a priori  defined target item.  All  other items in  a \nframe  which  are  not  the  target are  called distractors.  The  number of items in  a frame  is \ncalled  the  frame  size.  The  relevant  variable  to  be  measured  is  the  reaction  time  as  a \nfunction of the frame size. In this context, the Feature Integration Theory,  assumes that the \ntwo  stage  processes  operate  sequentially  [1].  The  first  early  pre attentive  stage  runs  in \nparallel  over  the  complete  visual  field  extracting  single  primitive  features  without \n\n\fA Neurodynamical Approach to  Visual Attention \n\nJ1 \n\nintegrating them. The second attentive stage has been likened to a spotlight. This metaphor \nalludes that attention is focally allocated to a local region of the visual field where stimuli \nare  processed in  more detail and passed to higher level  of processing, while, in  the  other \nregions  not  illuminated  by  the  attentional  spotlight,  no  further  processing  occurs. \nComputational  models formulated  in  the  framework of feature  integration  theory require \nthe existence of a saliency or priority map for registering the  potentially interesting areas \nof the  retinal input, and  a gating mechanism for reducing the  amount of incoming visual \ninformation, so that limited computational resources in the system are not overloaded. The \npriority  map  serves  to  represent  topographically  the  relevance  of different  parts  of the \nvisual  field,  in  order  to  have  a  mechanism  for  guiding  the  attentional  focus  on  salient \nregions of the retinal input. The focused area will be gated, such that only the information \nwithin  will  be passed further to yet higher levels,  concerned with  object recognition  and \naction. The disparity between these  two stages of attentional visual processing originated \na vivid experimental disputation. Duncan and Humphreys [2] have postulated a hypothesis \nthat integrates both attentional modes (parallel and serial) as an  instantiates of a common \nprinciple.  This  principle  sustains  in  both  schemes  that  a  selection  is  made.  In  the  serial \nfocal  scheme,  the  selection acts  on  in  the  space  dimension,  while  in  the  parallel  spread \nscheme  the  selection  concentrates  in  feature  dimensions,  e.g.  color.  On  the  other hand, \nDuncan's attentional theory  [3]  proposed that after a first parallel  search a competition is \ninitiated, which ends up by accepting only one object namely the target. Recently, several \nelectrophysiological  experiments  have  been  performed  which  seems  to  support  this \nhypothesis  [4].  Chelazzi  et  al.  [4]  measured  IT  (inferotemporal)  neurons  in  monkeys \nobserving a display containing a target object (that the monkey has seen previously) and a \ndistractor.  They  report  a  short  period  during  which  the  neuron's  response  is  enhanced. \nAfter this period the  activity level of the neuron remains high if the  target is the  neuron's \neffective  stimulus,  and  decay  otherwise.  The  challenging question  is  therefore:  is  really \nthe  linear  increasing  reaction  time  observed  in  some  visual  search  tests  due  to  a  serial \nmechanism? or is there only parallel processing followed by a dynamical time consuming \nlatency? In other words, are really priority maps and spotlight paradigm required? or can a \nneurodynamical \nthe  observed  psychophysical  experiments? \nFurthermore,  it  should  be  clarified  if  the  feature  dimension  search  is  achieved \nindependently in  each feature  dimension or is done  after integrating  the  involved feature \ndimensions. We study in this paper these questions from a computational perspective. We \nformulate  a neurodynamical model consisting in interconnected populations of biological \nneurons specially designed for visual search  tasks.  We demonstrate that it is plausible to \nbuild a neural system for visual search, which works across the  visual  field  in parallel but \ndue to the  different intrinsic dynamics can  show  the  two experimentally observed modes \nof visual attention, namely:  the serial focal and the parallel spread over the  space mode. In \nother words, neither explicit serial focal search nor saliency maps should be assumed. The \nfocus  of attention  is  not \u00b7included  in  the  system  but just results after convergence  of the \ndynamical behavior of the neural networks. The dynamics of the system can be interpreted \nas an intrinsic dynamical routing for binding features if top-down information is available. \nOur neurodynamical  computational  model  requires independent competition mechanism \nalong each feature dimension for explaining the experimental data, implying the necessity \nof  the  independent  character  of  the  search  in  separated  and  not  integrated  feature \ndimensions. The neural  population dynamics are handled in  the framework  of the  mean(cid:173)\nfield approximation yielding a system of coupled differential equations. \n\napproach \n\nexplain \n\n2 Neurodynamical model \nWe  extend with  the  present model the  approach of Usher and Niebur [5], which is  based \non the  experimental data of Chelazzi et al.  [4], for explaining the results of visual search \nexperiments. The hierarchical  architecture of our system is shown  in  Figure  1.  The input \nretina  is  given  as  a  matrix  of visual  items.  The  location  of each  item  at  the  retina  is \n\n\f12 \n\nG.  Deco and J.  Zihl \n\nspecified  by  two  indices  ij  meaning  the  position  at  the  row  i  and  the  column  j . The \ndimension  of this  matrix  is  SxS,  i.e.  the  frame  size  is  also  SxS .  The  information  is \nprocessed at each spatial location in parallel. Different feature maps extract for the item at \neach  position  the  local  values  of the  features.  In  the  present  work  we  hypothesize  that \nselective  attention  is  guided  by  an  independent  mechanism  which  corresponds  to  the \nindependent search of each feature. Let us assume that each visual item can be defined by \nK  features. Each feature  k  can adopt L(k)  values, for example the feature color can have \nthe  values  red  or green  (in  this  case  L( color) =2).  For each  feature  map  k  exist  L(k) \nlayers of neurons for characterizing the presence of each feature value. \n\nStern \nSpikIng \nNeurons \n\n-\n\n1-\n-I I \nI \nI  -\n\nFigure  1:  Hierarchical  architecture  of spiking  neural  modules  for  visual  selective \nattention.  Solid  arrows  denote  excitatory  connections  and  dotted  arrows  denote \ninhibitory connections \n\nA cell assembly consisting in  a population of full  connected excitatory  integrate-and-fire \nspiking neurons (pyramidal cells) is allocated in each layer and for each item location for \nencoding  the  presence  of a  specific  feature  value  (e.g.  color  red)  at  the  corresponding \nposition.  This  corresponds  to  a  sparse  distributed  representation.  The  feature  maps  are \ntopographically  ordered,  i.e.  the  receptive  fields  of the  neurons  belonging  to  the  cell \nassembly  ij at one of these  maps are  sensible  to  the  location  ij at the  retinal input.  We \nfurther assume  that the  cell assemblies in layers corresponding to a feature dimension are \nmutually inhibitory. Inhibition is modeled, according to  the  constraint imposed by Dale's \nprinciple, by a different pool of inhibitory neurons. Each feature  dimension has  therefore \nan independent pool of inhibitory neurons. This accounts for the  neurophysiological fact \nthat the  response  of V 4 neurons  sensible  to a specific feature  value  is  enhanced and  the \nactivity of the other neurons sensible to other feature  values are  suppressed. A high level \nmap consisting also in  a topographically ordered excitatory cell  assemblies is introduced \nfor integration of the different feature dimension at each item location, i.e. for binding the \nfeatures of each item. These cell assemblies are also mutually inhibited through a common \n\n\fA Neurodynamical Approach to  Visual Attention \n\n13 \n\npool of inhibitory neurons. This layer corresponds to the  modeling of IT neurons,  which \nshow location specific enhancement of activity by suppression of the responses of the cell \nassemblies associated to other locations. This fact would yield a dynamical formation of a \nfocus  of  attention  without  explicitly  assuming  any  spotlight.  Top-down  information \nconsisting in the feature  values at each feature  dimension of the  target item is feed in  the \nsystem  by  including  an  extra  excitatory  input  to  the  corresponding  feature  layers.  The \nwhole system  analyzes  the  information at all  locations  in  parallel.  Larger reaction  times \ncorrespond to slower dynamical convergence at all levels, i.e. feature map and integration \nmap levels. \n\nInstead of solving the  explicit set of integrate-and-fire neural  equations, the  Hebbian cell \nassemblies  adopted  representation  impels  to  adopt  a  dynamic  theory  whose  dependent \nvariables  are  the  activation levels of the  cell popUlations.  Assuming an  ergodic  behavior \n[5]  it is possible to derive  the  dynamic equations for the  cell assembly activities level by \nutilizing  the  mean-field  approximation  [5].  The  essential  idea  consists in  characterizing \neach cell assembly by means of each activity  x, and an  input current that is characteristic \nfor all cells in  the popUlation, denoted by I, which satisfies: \n\nx  =  F(l) \n\n(I) \n\nwhich is the response function that transforms current into discharge rates for an integrate(cid:173)\nand-fire spiking neuron with deterministic input, time membrane constant  't  and absolute \nrefractory  time  T r \u2022  The  system  of differential  equations describing  the  dynamics of the \nfeature maps are: \n\nP \n\n-bF(l  k(t))+Io+I  ijkl+I  kl+V \n\nF \n\nA \n\naSS  L(k) \n\n'tpa/Pk(t)  =  -IPk(t) + eLL L F(lijk/(t)) \n\ni-lj-lk-l \n\n-dF(lPk(t)) \n\nwhere  Iijk/(t)  is the  input current for the population with receptive field at location  ij of \nthe feature  map  k  that analysis the  value feature  I, I P k( t)  is the current in  the  inhibitory \npool bounded to  the  feature map layers of the  feature  dimension  k. The frame  size is  S. \nThe additive Gaussian noise  v  considered has standard deviation 0.002. The synaptic time \nconstants were  't  =  5  msec for the excitatory populations and  'tp  =  20  for the inhibitory \npools.  The  synaptic  weights  chosen  were:  a  =  0.95, b  =  0.8, c  =  2.  and  d  =  0.1  . \n10  =  0.025  is  a diffuse  spontaneous background input,  IF ijk/  is the  sensory input to  the \ncells in feature map k  sensible to the value  I  and with receptive fields at the location  ij at \nthe  retina.  This  input  characterizes  the  presence  of the  respective  feature  value  at  the \ncorresponding  position.  A  value  of 0.05  corresponds  to  the  presence  of the  respective \nfeature  value and  a value of 0 to the  absence of it.  The top-down target information  IA kl \nwas equal 0.005 for the layers which code the target properties and 0 otherwise. \n\nThe  higher level  integrating  assemblies  are  described  by  following  differential  equation \nsystem: \n\n\f14 \n\nG.  Deco and J.  Zihl \n\na H \n\n'tHa/  ij(t)  =  - [ij(t) +  a F(lij(t\u00bb  - b  F(l \n\n_ \n\n-\n\nPH \n\n(t\u00bb \n\nK  L(k) \n\nlo+w L  LF(lijk/(t\u00bb+V \n\nk  - 11- 1 \n\na PH \n\ns \n\ns \n\n_  ~ \n\nH \n\n'tP'ra\"/ \n\n(t)  =  -I  (t) +  c  \u00a3.J  L  F(l  ij(t\u00bb \n\nPH \n\ni  - Ij- 1 \n\nwhere  [Hij(t)  is the  input current for the population with  receptive field  at location  ij  of \nthe high level integrating map,  IPH (t) is the associated current in the inhibitory pool. The \nsynaptic time  constants were  'tH \"\"  5  msec  for the  excitatory populations and 'tpH  =  2C \nfor \nwere: \n..-.. \n...-.... \na  =  0.95,  b \n\nsynaptic \n..  1.  and  d  - 0.1  . \n\ninhibitory \npools. \n..-......-... \n..  0.8,  w  ..  1,  c \n\nweights \n\nchosen \n\nThe \n\nthe \n\n..-.. \n\nThese  systems of differential  equations were  integrated numerically until  a convergence \ncriterion  were  reached.  This  criterion  were  that  the  neurons  in  the  high  level  map  are \npolarized, i.e. \n\nH \n\nF(/' \n\n. \n\n(t\u00bb \n\nlmaxlnrax \n\n1_\u00b7~-,im\"\",a\"-,xJ;....\u00b7~....::;j~ma,,,-x ____ > e \n\n(S2 - 1) \n\nwhere  the  index  imaxjmax  denotes  the  cell  assembly in  the  high  level map with  maximal \nactivity and the  threshold  e was chosen equal to 0.1. The second in  the l.h.s measure the \nmean distractor activity. At each feature dimension the fixed  point of the dynamic is given \nby the  activity  of cell  assemblies  at the  layers with  a common value  with  the  target and \ncorresponding  to  items  having  this  value.  For example,  if the  target  is  red,  at  the  color \nmap,  the  activity  at  the  green  layer  will  be  suppressed  and  the  cell  assemblies \ncorresponding  to  red  items  will  be  enhanced.  At  the  high-level  map,  the  populations \ncorresponding to location which are maximally in all feature dimensions activated will be \nenhanced  by  suppressing  the  others.  In  other words,  the  location  that  shows  all  feature \ndimension equivalent at what top-down is stimulated and required, will be enhanced when \nthe target is at this location. \n\n3 Simulations of visual search tasks \nIn  this  section  we  present  results  simulating  the  visual  search  experiments  involving \nfeature  and  conjunction  search  [1].  Let us  define  the  different  kinds  of search  tasks  by \ngiven a pair of numbers  m and  n , where  m  is the number of feature dimensions by which \nthe dis tractors differ from  the target and  n  is  the number of feature dimensions by which \nthe  distractor  groups  simultaneously  differ  from  the  target.  In  other  words,  the  feature \nsearch  corresponds to  aI, I-search;  a standard  conjunction  search  corresponds  to  a 2,1-\nsearch; a triple conjunction search can be a 3,1  or a 3,2-search if the target differs from all \ndistractor  groups  in  one  or  in  two  features  respectively.  We  assume  that  the  items  are \ndefined  by  three  feature  dimensions  (K  =  3, e.g.  color,  size  and  position),  each  one \n\n\fA Neurodynamical Approach to  Visual Attention \n\n15 \n\nhaving  two  values  (L(k)  =  2  for  k  =  1, 2,3). At each  size  we  repeat  the  experiment \n100 times, each time with different randomly generated distractors and target. \n\nT \n\n100 \n\n00' \n\n0.00 \n\n085 \n\n080 \n\n0 .15 \n\n010 \n\n065 \n\n060 \n\n055 \n\n050 \n\n045 \n\n0.40 \n\n0.35 \n\n0.30 \n\n(a) \n\n3,I-search \n\n..\u2022... \n\n2,I-search \n\n3,2-search \n\n... -_ . .,;. \n\n. . __ . \n\n' . \n- - ':\"\"..; ~,.....-.::.---=--.:~-~: - -- -- --- ---- -- ---- - -- - -- - ---\n\nI, I-search \n\n10 .00 \n\n2000 \n\n3000 \n\n4000 \n\n50 .00 \nFrame size \n\nFigure 2:  Search  times for feature  and conjunction searches obtained utilizing  the \npresented model. \n\nWe  plot as  result  the  mean  value  T  of the  100  simulated  reaction  times  (in  msec)  as  a \nfunction  of the  frame  size.  In Figure  2,  the  results  obtained  for  I, 1 ~  2, 1 ~  3, I  and  3,2-\nsearches  are  shown.  The  slopes  of  the  reaction  time  vs.  frame  size  curves  for  all \nsimulations  are  absolutely  consistent  with  the  existing  experimental  results[I).  The \nexperimental  work  reports  that  in  feature  search  (1,1)  the  target  is  detected  in  parallel \nacross  the  visual  field.  Furthermore,  the  slopes  corresponding  to \u00b7 standard  conjunction \nsearch and triple conjunction search are  a linear function  of the  frame  size, where by the \nslope  of the  triple  conjunction  search  is  steeper or very  flat  than  in  the  case  of standard \nsearch  (2,1)  if the  target  differs  from  the  distractors  in  one  (3,1)  or  two  features  (3,2) \nrespectively. In order to analyze more carefully the dynamical evolution of the system, we \nplot in Figure 3 the temporal evolution of the rate activity corresponding to the target and \nto  the  distractors  at  the  high-level  integrating  map  and  also  separately  for  each  feature \ndimension level for a parallel (I, I-search) and a serial (3,I-search) visual tasks. The frame \nsize used is 25. It is interesting to note that in the  case of I, I-search the convergence time \nin all levels are very small and therefore this kind of search appears as a parallel search. In \nthe  case of 3, I-search the latency of the dynamic takes more time  and therefore  this kind \nof search appears as a serial one, in spite that the underlying mechanisms are  parallel.  In \nthis case  (see  Figure 3-c) the  large competition present in  each feature  dimension delays \nthe  convergence of the dynamics at each feature dimension and therefore also at the high(cid:173)\nlevel  map.  Note  in  Figure  3-c  the  slow suppression  of the distractor activity that reflects \nthe underlying competition. \n\n\f16 \n\nG.  Deco and J.  Zihl \n\nRates  dOOOO I=\"T---,---\n\n(High-Level-Map) \n-\n\n.-----,---\n\n......---=. \n\n~ \n\nThrget-Activlty (l,l-search) \n\nThrget-Activlty (3,I-search) \n\n(a) \n\nDistractors-Activity (l,l-search) \n\n\\ \n\\  Distractors-Activity (3,I-search) \n\n, \n\n' \n\n~~~~=*=:~~\u00a3~:;~:;;;,:~::,j.~C:~~-~:~ \n\n:JC)aoo \n\nTime \n\nRates \n\n(b) l,l-search \n\nRates \n\n(c) 3~1-search \n\n;.t\\ ....... f  .... .. \u00b7::I(. ........ .:\u00b7\u00a5\u00b7:\u00b7..:;-;~-:..::\u00b7'r-J..~ ... .;./'.~ ... :~\"\"~+ .. (:. ... ~\u00b7 \n!.  \",,-- Distractors-Activit~ \n\n..... ,,'\\,.-\n\n.... ---..-------. .. _ .............. _-_. -.----\n\nTime \n\nTime \n\nFigure  3:  Activity levels during visual search experiments.  (a)  High-level-map rates \nfor target  F(lHimaxjmax(t\u00bb  and  mean  distractors-activity.  (b)  Feature-level  map rates \nfor  target  and  one  distractor  activity  for  1,1-search.  There  is  one  curve  for  each \nfeature dimension (i.e. 3 for target and 3 for distractor. (c) the same as (b) but for 3,1-\nsearch. \n\nReferences \n[1]  Treisman,  A.  (1988)  Features and  objects:  The  fourteenth  Barlett  memorial  lecture. \nThe Quarterly Journal of Experimental Psychology, 4OA, 201-237. \n\n[2]  Duncan, J.  and Humphreys, G.  (1989)  Visual  search and stimulus similarity.  Psycho(cid:173)\nlogical Review, 96, 433-458. \n\n[3] Duncan, J.  (1980) The locus of interference in the  perception of simultaneous stimuli. \nPsychological Review, 87, 272-300. \n\n[4]  Chelazzi, L., Miller, E., Duncan, J.  and Desimone, R.  (1993) A neural basis for visual \nsearch in inferior temporal cortex. Nature (London), 363, 345-347. \n\n[5]  Usher,  M.  and Niebur,  E.  (1996)  Modeling the  temporal  dynamics of IT neurons  in \nvisual search:  A mechanism for top-down selective attention. Journal of Cognitive Neuro(cid:173)\nscience, 8, 311-327. \n\n\f", "award": [], "sourceid": 1751, "authors": [{"given_name": "Gustavo", "family_name": "Deco", "institution": null}, {"given_name": "Josef", "family_name": "Zihl", "institution": null}]}