{"title": "Selective Attention for Handwritten Digit Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 771, "page_last": 777, "abstract": null, "full_text": "Selective Attention for  Handwritten \n\nDigit  Recognition \n\nEthem Alpaydm \n\nDepartment of Computer Engineering \n\nBogazi<1i  U ni versi ty \n\nIstanbul,  TR-SOS15 Turkey \n\nalpaydin@boun.edu.tr \n\nAbstract \n\nCompletely  parallel object  recognition  is  NP-complete.  Achieving \na  recognizer  with  feasible  complexity  requires  a  compromise  be(cid:173)\ntween parallel and sequential processing  where a  system selectively \nfocuses  on  parts  of a  given  image,  one  after  another.  Successive \nfixations  are  generated  to sample the  image and these  samples are \nprocessed  and abstracted  to  generate  a  temporal context  in which \nresults are integrated over time.  A computational model based on a \npartially recurrent feedforward network is proposed and made cred(cid:173)\nible  by  testing  on  the  real-world  problem  of recognition  of hand(cid:173)\nwritten  digits with encouraging results. \n\n1 \n\nINTRODUCTION \n\nFor all-parallel bottom-up recognition, allocating one separate unit for each possible \nfeature  combination, i.e., conjunctive encoding,  implies combinatorial explosion.  It \nhas  been  shown  that  completely  parallel,  bottom-up  visual  object  recognition  is \nNP-complete  (Tsotsos,  1990).  By  exchanging space  with  time, systems  with  much \nless  complexity may be  designed.  For  example,  to phone someone at the  press  of a \nbutton,  one  needs  107  buttons  on  the  phone;  the  sequential  alternative  is  to  have \n10  buttons on the  phone and press  one at a  time, seven  times. \n\nWe propose  recognition based  on selective  attention where  we  analyze only a  small \npart of the image in detail at each step, combining results in time.  N oton and Stark's \n(1971)  \"scanpath\"  theory  advocates  that each object  is  internally represented  as a \nfeature-ring which is a  temporal sequence  of features extracted at each fixation and \nthe  positions  or  the  motor commands for  the  eye  movements  in  between.  In  this \napproach,  there  is an  \"eye\"  that  looks at an image but which can really  see  only a \nsmall part of it.  This part of the image that is examined in detail is  the  fovea.  The \n\n\f772 \n\nE.  ALPAYDIN \n\nASSOCIATIVE \n\nLEVEL \n\nClass Probabilities \n(lOx!)  P ~r------7-\"7 \n\nL-L ______ /  ' \nsoftmax  t \n\nClass Units \n(lOxI)  0 /  \n\n7 \n\nT1 \n\nHidden Units (s x  I) \n\nH  L..../~_....;...._-_-_-_-~_-_-_-_-_-~_-_-~7-.\" \n\nI~-----------------;t~---------- ----;;-------------------\n\n------ -------------------, \n\nATTENTIVE LEVEL \n\nPRE-ATTENTIVE LEVEL \n,-------------------- -----, \n: \n\nFeature Map \n\nI \nI \n\nF \n\n( r x I ) :  \n\nEye Position Map \n\n(pxp) \n\n/ /p  \n\n1 subsample \n\nand blur \n\nFovea \n\n1-------'---1- ~ -I WTA I \n\n- -:- - - - - - - ~ \n\nSaliency Map \n(n x n) \n\nM \n\nBitmap Image (n x n) \n\nFigure  1:  The block diagram of the  implemented system. \n\nfovea's content is examined by the  pre-attentive level where basic feature extraction \ntakes  place.  The  features  thus  extracted  are  fed  to  an  a660ciative  part  together \nwith  the  current  eye  position.  If the  accumulated  information is  not sufficient  for \nrecognition,  the  eye  is  moved to another  part of the  image, making a  saccade.  To \nminimize  recognition  time,  the  number of saccades  should  be  minimized.  This  is \ndone  through defining a  criterion of being  \"interesting\"  or saliency and by fixating \nonly at  the  most  interesting.  Thus sucessive  fixations  are generated  to sample the \nimage and these samples are processed  and abstracted to generate a  temporal con(cid:173)\ntext in which results are integrated over time.  There is  a  large amount of literature \non selective  attention  in  neuroscience  and  psychology;  for  reviews  see  respectively \n(Posner and Peterson,  1990) and (Treisman, 1988).  The point stressed in this paper \nis  that the approach is also useful  in engineering. \n\n2  AN  EXAMPLE SYSTEM FOR OCR \n\nThe structure  of the  implemented system  for  recognition  of handwritten  digits  is \ngiven in Fig.  1. \n\n\fSelective  Attention  for  Handwritten  Digit  Recognition \n\n773 \n\nWe  have  an  n  x  n  binary  image  in  which  the  fovea  is  m  x  m  with  m  <  n.  To \nminimize recognition time, the system should only attend to the parts of the image \nthat carry  discriminative information.  We  define  a  criterion of being  \"interesting\" \nor saliency which is applied to all image locations in parallel to generate a  8aliency \nmap,  S.  The  saliency  measure  should  be  chosen  to  draw  attention  to  parts  that \nhave the highest information content.  Here,  the saliency criterion is a low-pass filter \nwhich  roughly  counts  the  number of on pixels  in  the  corresponding  m  x  m  region \nof the  input image M.  As  the strokes  in  handwritten digits are mostly one  or  two \npixels  wide,  a  count  of the  on pixels  is  a  good  measure  of the  discontinuity  (and \nthus information).  It is also simple to compute: \n\nSij  =  L \n\nL  MkIN2((i,jl, (Lm/6J)2 *1),  i,j =  1. .. n \n\ni+lm/2J  HLm/2J \n\nk=i-Lm/2J  l=j-Lm/2J \n\nwhere  N2(p.,  E)  is  the  bivariate normal with mean  p.  and  the  covariance  E.  Note \nthat  we  want  the  convolution  kernel  to  have  effect  up  to  L m/2 J and  also  that  the \nnormal is zero after p.\u00b1 30-.  In our simulations where n  is  16 and m  is 5  (typical for \ndigit recognition),  0- ~ 1.  The location that is most salient is the position ofthe next \nfixation  and as such  defines  the new  center  of the  fovea.  A  location once attended \nto is  no longer interesting;  after each  fixation,  the saliency of all the locations that \ncurrently are in the scope of the fovea are set to 0 to inhibit another fixation  there. \n\nThe attentive level thus controls the scope of the pre-attentive level.  The maximum \nof  the  saliency  map  through  a  winner-take-all  gives  the  eye  position  (i*, j*)  at \nfixation t. \n\n(i*(t),j*(t)) = arg~B:XSij \n\n',J \n\nBy thus following the salient regions,  we  get an input-dependent emergent sequence \nin time. \n\nEye-Position Map \n\nThe  eye  p08ition  map,  P, stores the position of the eye  in the current fixation.  It is \np  x  p.  p  is  chosen  to be smaller than  n  for  dimensionality reduction for  decreasing \ncomplexity and  introducing  an  effect  of regularization  (giving  invariance  to  small \ntranslations).  When p  is a  factor of n, computations are also simpler.  We  also blur \nthe immediate neighbors for a  smoother representation: \n\nP( t) = blur( subsample( winner-take-all( S))) \n\nPre-Attentive Level:  Feature Extraction \n\nThe pre-attentive level extracts detailed features from the fovea to generate a feature \nmap.  This  information and  the  current  eye  position  is  passed  to  the  associative \nsystem for  recognition.  There is  a  trade-off between the fovea size  and the  number \nof saccades  required  for  recognition:  As  the  operation in  the  pre-attentive  level  is \ncarried  out  in parallel,  to minimize complexity the features  extracted  there should \nnot  be  many  and  the  fovea  should  not  be  large:  Fovea  is  where  the  expensive \ncomputation takes  place.  On  the other  hand,  the fovea  should  be large  enough  to \nextract discriminative features  and thus complete recognition in a  small amount of \ntime.  The features  to  be  extracted  can  be  learned  through an supervised  method \nwhen feedback  is available. \n\n\f774 \n\nE.  ALPAYDIN \n\nThe m  x  m  region symmetrically around  (i*, j*) is  extracted  as  the fovea I  and is \nfed  to the feature  extractors.  The r  features  extracted  there  are  passed  on  to  the \nassociative level as the feature map, F.  r  is  typically 4  to 8.  Ug  denote the weights \nof feature  9  and  Fg  is  the  value  of feature  9  that  is  found  by  convolving the fovea \ninput with the feature  weight  vector  (1(.)  is  the sigmoid function): \n\nM i o(t)-Lm/2J+i,jo(t)-Lm/2J+j,  i,j =  1 ... m \n\nf  ( ~ ~ U\"jI,j(t\u00bb) ,  g =  1. .. r \n\nAssociative Level:  Classification \n\nAt  each fixation,  the associative level is fed  the feature  map from  the pre-attentive \nlevel  and  the  eye  position  map from  the  attentive  level.  As  a  number of fixations \nmay be necessary to recognize an image, the associative system should have a  short(cid:173)\nterm  memory able  to  accumulate inputs coming through  time.  Learning similarly \nshould be  through time.  When used for  classification, the class  units are  organized \nso  as  to  compete  and  during  recognition  the  activations  of the  class  units  evolve \ntill  one  class  gets  sufficiently  active  and  suppresses  the  others.  When  a  training \nset  is available, a  temporal supervised  method can be used  to train the associative \nlevel.  Note  that there may be  more than one scanpath for each object and learning \none sequence for  each object fails.  We  see  it is  a  task of accumulating two types of \ninformation through  time:  the  \"what\"  (features  extracted)  and  the  \"where\"  (eye \nposition). \n\nThe  fovea  map,  F,  and  the  eye  position  map,  P,  are  concatenated  to  make  a \nr  + p  X  P  dimensional  input  that  is  fed  to  the  associative  level.  Here  we  use  an \nartificial  neural  network  with  one  hidden  layer of 8  units.  We  have  experimented \nwith  various  architectures  and  noticed  that  recurrency  at  the  output  layer  is  the \nbest.  There are  10  output units. \n\nf  (L VhgFg(t) + L  L  WhabPab(t)) ,  h =  1. .. s \n\ng ab  \n\nLTchHh + L  RckPk(t - 1),  c =  1. .. 10 \nh \nexp[Oc(t)] \n\nk \n\nLk exp[Ok(t)] \n\nwhere  P  denotes  the  \"softmax\"ed output probabilities (Bridle,  1990) and P(t - 1) \nare  the  values  in  the  preceding  fixation  (initially 0).  We  use  the  cross-entropy  as \nthe goodness  measure: \n\nC =  L \nt \n\n1 \nt L  Dk 10gPc(t), \n\nt ~ 1 \n\nc \n\nDc  is  the  required  output for  class  c.  Learning is  gradient-ascent  on  this  goodness \nmeasure.  The fraction  lit is to give more weight  to initial fixations than later ones. \nConnections  to the output  units are updated as follows  (11  is  the learning factor): \n\n\fSelective  Attention  for  Handwritten  Digit  Recognition \n\n775 \n\nNote that we  assume 8PIc(t -1)/8Rclc  =  o.  For the connections to the hidden units \nwe  have: \n\nc \n\nWe  can  back-propagate  one  step  more  to  train  the  feature  extractors.  Thus  the \nupdate  equations for  the connections  to feature  units are: \n\nCg(t)  = L Ch(t)Vhg \n\nh \n\nA  series  of fixations  are  made  until  one  of  the  class  units  is  sufficiently  active: \n3c, Pc > 8  (typically 0.99), or when the most salient point has a saliency less  than a \ncertain threshold  (this condition is rarely met after the first  few  epochs).  Then the \ncomputed changes are summed up and the updates are made like the exaple below: \n\nBackpropagation through time where the recurrent connections are unfolded in time \ndid not work well in this task because as explained before, for the same class, there is \nmore than one scanpath.  The above-mentioned approach is  like real-time recurrent \nlearning  (Williams and  Zipser,  1989)  where  the  partial derivatives  in  the  previous \ntime step is  0,  thus ignoring this temporal dependence. \n\n3  RESULTS  AND  DISCUSSION \n\nWe  have  experimented with various parameter settings and finally  chose  the archi(cid:173)\ntecture  given above:  When  input  is  16  x  16  and  there  are  10  classes,  the  fovea  is \n5  x  5  with  8  features  and  there  are  16  hidden  units.  There  are  1,934  images for \ntraining,  946  for  cross-validation  and  943  for  testing.  Results  are  given  in  Table \n1.  ( It can  be  seen  that  by  scanning less  than half of the  image,  we  get  80%  gen(cid:173)\neralization.  Additional to the local  high-resolution image provided  by  the fovea,  a \nlow-resolution  image of the  surrounding parafovea can  be  given  to  the  associative \nlevel for better recognition.  For example we low-pass filtered  and undersampled the \noriginal  image  to  get  a  4  x  4  image which  we  fed  to  the  class  units additional  to \nthe  attention-based  hidden  units.  Success  went  up  quite  high and  fewer  fixations \nwere  necessary;  compare  rows  1 and 2  of the  Table.  The  information provided  by \nthe  4  x  4  map is  actually not  much as  can  be seen  from  row  3 of the  table where \nonly  that  is  given  as  input.  Thus  the  idea  is  that  when  we  have  a  coarse  input, \nlooking  only  at  a  quarter  of the  image in  detail  is  sufficient  to  get  93%  accuracy. \nBoth features  (what)  and eye  positions (where)  are necessary for  good recognition. \nWhen only one is used without the other, success is quite low as can be seen  in rows \n4 and 5.  In the  last  row,  we  see  the  performance of a  multi layer  percept ron  with \n10  hidden  units that does  all-parallel recognition. \n\nBeyond a  certain network size,  increasing the number of features  do not help much. \nDecreasing  8,  the  certainty  threshold,  decreases  the  number of fixations  necessary \n\n\f776 \n\nE. ALPAYDIN \n\nTable  1:  Results  of handwritten digit recognition  with selective  attention.  Values \ngiven  are  average  and  standard  deviation  of  10  independent  runs.  See  text  for \ncomments. \n\nNO  OF \n\nTEST \n\nPARAMS  SUCCESS \n\nTRAINING \n\nEPOCHS \n\nNO  OF \n\nFIXATIONS \n\nMETHOD \n\nSA  system \nSA+parafovea \nOnly parafovea \nOnly  what info \nOnly where  info \n\n878 \n1,038 \n170 \n622 \n440 \n\n79.7,  1.8 \n92.5,0.8 \n86.9,0.2 \n49.0,21.0 \n54.2,  1.4 \n\n74.5,  17.1 \n54.2,  10.2 \n52.3,8.2 \n66.6,  30.6 \n92.9,6.5 \n\n6.5,0.2 \n3.9,0.3 \n1.0,  0.0 \n7.5,0.1 \n7.6,0.0 \n\n1.0,0.0 \n\nMLP,  10  hiddens \n\n2,680 \n\n95.1,  0.6 \n\n13.5,4.1 \n\nwhich we  want,  but decreases  success  too which  we  don't.  Smaller foveas  decrease \nthe  number  of free  parameters  but  decrease  success  and  require  a  larger  number \nof fixations.  Similarly larger  foveas  decrease  the  number  of fixations  but  increase \ncomplexity. \n\nThe simple low-pass filter  used here  as a  saliency measure is  the simplest measure. \nPreviously it has been  used  by  Fukushima and Imagawa (1993) for  finding the next \ncharacter,  i.e.,  segmentation,  and  also  by  Olshausen  et  al.  (1992)  for  translation \ninvariance.  More robust  measures at the  expense  of more computations, are  possi(cid:173)\nble;  see  (Rimey and Brown,  1990;  Milanese  et  al.,  1993).  Salient  regions are  those \nthat  are  conspicious,  i.e.,  different  from  their surrounding where  there  is  a  change \nin  X  where  X  can be  brightness or color  (edges),  orientation  (corners),  time  (mo(cid:173)\ntion),  etc.  It  is  also possible  that  top-down,  task-dependent  saliency  measures  be \nintegrated  to  minimize  further  recognition  time  implying a  remembered  explicit \nsequence  analogous to skilled motor behaviour (probably gained after many repeti(cid:173)\ntions). \n\nHere a  partially recurrent  network is used for temporal processing.  Hidden Markov \nModels  like  used  in speech  recognition  are  another  possibility  (Rimey and  Brown, \n1990;  Haclsalihzade et al.,  1992).  They are probabilistic finite  automata which can \nbe trained to classify sequences and one can have more than one model for an object. \n\nIt should be noted here that better approaches for the same problem exists (Le  Cun \net  al.,  1989).  Here  we  advocate  a  computational model  and  make  it  plausible  by \ntesting it on  a  real-world  problem.  It is  necessary  for  more  complicated problems \nwhere  an all-parallel approach would not work.  For example Le  Cun et al. 's model \nfor  the same type of inputs has  2,578 free  parameters.  Here  there  are \n\n(mx m+1) x  r+(r+pxp+ 1)  x  8+(S+ 1)  x  10+10 x  10 \n\n, \n\n# '   #~~ \n\niT \n\nv';w \n\nT \n\nR \n\nfree  parameters  which  make  878  when  m  = 5, r  = 8,  S  = 16.  This  is  the  main \nadvantage of selective attention which is that the complexity of the system is heavily \nreduced at the expense of slower recognition, both in overt form of attention through \nfoveation  and  in  its  covert  form,  for  binding  features  -\nFor  this  latter  type  of \nattention not  discussed  here,  see  (Ahmad,  1992).  Also  note  that  low-level feature \nextraction operations like carried out in the pre-attentive level are local convolutions \n\n\fSelective  Attention  for  Handwritten  Digit  Recognition \n\n777 \n\nand  are  appropriate  for  parallel  processing,  e.g.,  on  a  SIMD  machine.  Higher(cid:173)\nlevel  operations require larger connectivity and are better carried out sequentially. \nNature also seems to have taken this direction. \n\nAcknowledgements \n\nThis  work  is  supported  by  Tiibitak  Grant  EEEAG-143  and  Bogazi<;;i  University \nResearch  Funds 95HA108.  Cenk  Kaynak prepared  the handwritten  digit  database \nbased  on the  programs provided  by  NIST (Garris et  al.,  1994). \n\nReferences \n\nS.  Ahmad.  (1992) VISIT: A Neural Model of Covert Visual Attention.  In J.  Moody, \nS.  Hanson,  R.  Lippman (Eds.)  Advances in Neural Information Processing Systems \n4,420-427.  San Mateo,  CA:  Morgan Kaufmann. \n\nJ.S.  Bridle.  (1990)  Probabilistic Interpretation  of Feedforward  Classification Net(cid:173)\nwork Outputs with Relationships to Statistical Pattern Recognition.  In  Neurocom(cid:173)\nputing,  F.  Fogelman-Soulie, J.  Herault,  Eds.  Springer,  Berlin,  227-236. \n\nK.  Fukushima,  T.  Imagawa.  (1993)  Recognition  and  Segmentation of Connected \nCharacters with  Selective Attention,  Neural Networks,  6:  33-41. \n\nM.D.  Garris et  al.  (1994)  NIST  Form-Based  Handprint  Recognition  System,  NIS(cid:173)\nTIR 5469,  NIST Computer Systems Laboratory. \n\nS.S.  Haclsalihzade,  L.W. Stark, J .S.  Allen.  (1992)  Visual Perception and Sequences \nof  Eye  Movement  Fixations:  A  Stochastic  Modeling  Approach,  IEEE  SMC,  22, \n474-481. \n\nY.  Le  Cun et  al.  (1991)  Handwritten  Digit  Recognition with a  Back-Propagation \nIn  D.S.  Touretzky  (ed.)  Advances  in  Neural  Information  Processing \nNetwork. \nSystems  2,  396-404.  San Mateo,  CA:  Morgan Kaufmann. \n\nR.  Milanese et al.  (1994)  Integration of Bottom-U p and Top- Down Cues for  Visual \nAttention  using  Non-Linear  Relaxation  IEEE  Int'l  Conf  on  CVPR,  Seattle,  WA, \nUSA. \n\nD.  Noton  and  L.  Stark.  (1971)  Eye  Movements and  Visual  Perception,  Scientific \nAmerican,  224:  34-43. \n\nB.  Olshausen,  C.  Anderson,  D.  Van Essen.  (1992)  A  Neural Model  of Visual Atten(cid:173)\ntion  and  Invariant Pattern Recognition,  CNS  Memo  18,  CalTech. \n\nM.L  Posner,  S.E.  Petersen. \nAnn.  Rev.  Neurosci.,  13:25-42. \n\n(1990)  The  Attention  System  of  the  Human  Brain, \n\nR.D. Rimey, C.M. Brown.  (1990) Selective Attention as Sequential Behaviour:  Mod(cid:173)\nelling  Eye  Movements  with  an  Augmented  Hidden  Markov  Model,  TR-327,  Com(cid:173)\nputer Science,  Univ of Rochester. \n\nA.  Treisman.  (1988)  Features  and  Objects,  Quarterly  Journ.  of Ezp.  Psych.,  40: \n201-237. \n\nJ.K.  Tsotsos.  (1990)  Analyzing Vision at the Complexity Level,  Behav.  and  Brain \nSci.  13:  423-469. \n\nR.J.  Williams,  D.  Zipser.  (1989)  A  Learning  Algorithm for  Continually Running \nFully Recurrent  Neural  Networks  Neural  Computation,  1, 270-280. \n\n\f", "award": [], "sourceid": 1047, "authors": [{"given_name": "Ethem", "family_name": "Alpaydin", "institution": null}]}