{"title": "Correlates of Attention in a Model of Dynamic Visual Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 80, "page_last": 86, "abstract": "", "full_text": "Correlates of Attention in a Model of \n\nDynamic Visual Recognition* \n\n. \n\nRajesh P. N. Rao \n\nDepartment of Computer Science \n\nUniversity of Rochester \nRochester, NY 14627 \n\nrao@cs.rochester.edu \n\nAbstract \n\nGiven a set of objects in the visual field, how does the the visual system learn \nto attend to a particular object of interest while ignoring the rest? How are \nocclusions and background clutter so effortlessly discounted for when rec(cid:173)\nognizing a familiar object? In this paper, we attempt to answer these ques(cid:173)\ntions in the context of a Kalman filter-based model of visual recognition that \nhas previously proved useful in explaining certain neurophysiological phe(cid:173)\nnomena such as endstopping and related extra-classical receptive field ef(cid:173)\nfects in the visual cortex. By using results from the field of robust statistics, \nwe describe an extension of the Kalman filter model that can handle multiple \nobjects in the visual field. The resulting robust Kalman filter model demon(cid:173)\nstrates how certain forms of attention can be viewed as an emergent prop(cid:173)\nerty of the interaction between top-down expectations and bottom-up sig(cid:173)\nnals. The model also suggests functional interpretations of certain attention(cid:173)\nrelated effects that have been observed in visual cortical neurons. Exper(cid:173)\nimental results are provided to help demonstrate the ability of the model \nto perform robust segmentation and recognition of objects and image se(cid:173)\nquences in the presence of varying degrees of occlusions and clutter. \n\n1 INTRODUCTION \n\nThe human visual system possesses the remarkable ability to recognize objects despite the \npresence of distractors and occluders in the field of view. A popular suggestion is that an \"at(cid:173)\ntentional spotlight\" mediates this ability to preferentially process a relevant object in a given \nscene (see [5, 9] for reviews). Numerous models have been proposed to simulate the control of \nthis .. focus of attention\" [10, 11, 15]. Unfortunately, there is inconclusive evidence for the ex(cid:173)\nistence of an explicit neural mechanism for implementing an attentional spotlight in the visual \n\n*This research was supported by NIH/PHS research grant 1-P41-RR09283. I am grateful to Dana \nBallard for many useful discussions and suggestions. Author's current address: The Salk Institute, CNL, \n10010 N. Torrey Pines Road, La Jolla, CA 92037. E-mail: rao@salk. edu . \n\n. \u00b7\u00b7 .. \\.' .. \n\n' \n\n\fCorrelates of Attention in a Model of Dynamic Visual Recognition \n\n81 \n\ncortex. Thus, an important question is whether there are alternate neural mechanisms which \ndon't explicitly use a spotlight but whose effects can nevertheless be interpreted as attention. \nIn other words, can attention be viewed as an emergent property of a distributed network of \nneurons whose primary goal is visual recognition? \n\nIn this paper, we extend a previously proposed Kalman filter-based model of visual recognition \n[13, 12] to handle the case of multiple objects, occlusions, and clutter in the visual field. We \nprovide simulation results suggesting that certain forms of attention can be viewed as an emer(cid:173)\ngent property of the interaction between bottom-up signals and top-down expectations during \nvisual recognition. The simulation results demonstrate how \"attention\" can be switched be(cid:173)\ntween different objects in a visual scene without using an explicit spotlight of attention. \n\n2 A KALMAN FILTER MODEL OF VISUAL RECOGNITION \n\nWe have previously introduced a hierarchical Kalman filter-based model of visual recognition \nand have shown how this model can be used to explain neurophysiological effects such as end(cid:173)\nstopping and neural response suppression during free-viewing of natural images [ 12, 13 ]. The \nKalman filter [7] is essentially a linear dynamical system that attempts to mimic the behavior \nof an observed natural process. At any time instant t, the filter assumes that the internal state \nof the given natural process can be represented as a k x 1 vector r( t). Although not directly \naccessible, this internal state vector is assumed to generate ann x 1 measurable and observable \noutput vector I( t) (for example, an image) according to: \nI(t) = Ur(t) + n(t) \n\n(1) \nwhere U is ann x k generative (or measurement) matrix, and n(t) is a Gaussian stochastic \nnoise process with mean zero and a covariance matrix given by E = E[nilT] (E denotes the \nexpectation operator and T denotes transpose). \nIn order to specify how the internal state r changes with time, the Kalman filter assumes that \nthe process of interest can be modeled as a Gauss-Markov random process [1]. Thus, given \nthe state r(t- 1) at time instant t- 1, the next state r(t) is given by: \n\n(2) \nwhere Vis the state transition (or prediction) matrix and m is white Gaussian noise with mean \nm = E[m] and covariance II= E[(m- m)(m- m)T]. \n\nr(t) = Vr(t- 1) + m(t- 1) \n\nJ = (I- Ur)TE-1 (I- Ur) + (r- rf M-1 (r- r) \n\nGiven the generative model in Equation 1 and the dynamics in Equation 2, the goal is to op(cid:173)\ntimally estimate the current internal state r( t) using only the measurable inputs I( t). An op(cid:173)\ntimization function whose minimization yields an estimate of r is the weighted least-squares \ncriterion: \n(3) \nwhere r( t) is the mean of the state vector before measurement of the input data I( t) and M = \nE[(r - r)(r - r)T] is the corresponding covariance matrix. It is easy to show [1] that J is \nsimply the sum of the negative log-likelihood of generating the data I given the stater, and \nthe negative log of the prior probability of the stater. Thus, minimizing J is equivalent to \nmaximizing the posterior probability p(rji) of the stater given the input data. \nThe optimization function J can be minimized by setting ~J = 0 and solving for the minimum \nvaluer of the stater (note thatr equals the mean ofr afte~ measurement of I). The resultant \nKalman .filter equation is given by: \n\n\u00b7 \nr(t) = r(t) + N(t)UTE(t)-1 (I(t)- Ur(t)) \nr(t) = W(t- 1) + m(t- 1) \n\n(4) \n(5) \nwhere N(t) = (UTE(t)- 1U + M(t)-1 )-1 is a \"normalization\" matrix that maintains the \ncovariance of the stater after measurement ofl. The matrix M, which is the covariance before \n\n\f82 \n\nR. P. N. Rao \n\nmeasurement of I, is updated as M ( t) = V N ( t - 1) VT +II ( t - 1). Thus, the Kalman filter \npredicts one step into the future using Equation 5, obtains the next sensory input I( t), and then \nco~ects its Jj.rediction r(9 u~ing the sensory resi~ual e~or (I(t) --: Ur(t)) a~d ~e Kalman \ngam N(t)U ~(t)- 1 \u2022 This yields the corrected est1mate r(t) (Equation 4), wh1ch IS then used \nto make the next state predictionf(t + 1). \nThe measurement (or generative) matrix U and the state transition (or prediction) matrix V \nused by the Kalman filter together encode an internal nwdel of the observed dynamic pro(cid:173)\ncess. As suggested in [13], it is possible to learn an internal model of the input dynamics \nfrom observed data. Let u and v denote the vectorized forms of the matrices U and V re(cid:173)\nspectively. For exam,fle, the n ~ k generative matrix U can be collapsed into an nk x 1 vector \nu = [U1U2 \u2022\u2022. un] where U' denotes the ith row of U. Note that (I- Ur) = (I- Ru) \nwhere R is then x nk matrix given by: \n\n(6) \n\n~ l \n\n. \n. \n. \nrT \n\n= [ r; r~ \n\n. \n. \n. \n0 \n\nR \n\n0 \n\nBy minimizing an optimization function similar to J [13], one can derive a Kalman filter-like \n.. learning rule\" for the generative matrix U: \n\nu(t) = ii(t) + Nu(t)R(t)T~(t)- 1 (I(t)- R(t)ii(t))- aNu(t)ii(t) \n\n(7) \nwhere ii(t) = u(t- 1), Nu(t) = (Nu(t- 1)-1 + R(t)T~(t)- 1 R(t) + ai)- 1 , and I is the \nnk x nk identity matrix. The constant a determines the decay rate of ii. \nAs in the case of U, an estimate of the prediction matrix V can be obtained via the following \nlearning rule for v [13]: \n\nv(t) = v(t) + Nv(t)R(tf M(t)-1 [r(t + 1)- r(t + 1)]- f3Nv(t)v(t) \n\n(8) \nwherev(t) = v(t-1), Nv(t) = (Nv(t-1)- 1 +R(t)T M(t)- 1R(t)+f3I)- 1 andRisak X k2 \nmatrix analogous to R (Equation 6) but with rT = rr. The constant /3 determines the decay \nrate for v while I denotes the k2 x k2 identity matrix. Note that in this case, the estimate ofV is \ncorrected using the prediction residual error ( r( t + 1) - r( t + 1)), which denotes the difference \nbetween the actual state and the predicted state. One unresolved issue is the specification of \nvalues for r(t) (comprising R(t)) in Equation 7 and r(t + 1) in Equation 8. The Expectation(cid:173)\nMaximization (EM) algorithm [4] suggests that in the case of static stimuli (f(t) = r(t -\n1)), one may use r(t) = r which is the converged optimal state estimate for the given static \ninput. In the case of dynamic stimuli, the EM algorithm prescribes r( t) = r( tl N), which is \nthe optimal temporally snwothed state estimate [1] for timet (5 N), given input data for each \nof the time instants 1, ... , N. Unfortunately, the smoothed estimate requires knowledge of \nfuture inputs and is computationally quite expensive. For the experimental results, we used \nthe on-line estimates r(t) when updating the matrices u and v during training. \n\n3 ROBUST KALMAN FILTERING \n\nThe standard derivation of the Kalman filter minimizes Equation 3 but unfortunately does not \nspecify how the covariance ~ is to be obtained. A common choice is to use a constant matrix \nor even a constant scalar. Making ~ constant however reduces the Kalman filter estimates to \nstandard least-squares estimates,\u00b7 which are highly susceptible to outliers or gross errors i.e. \ndata points that lie far away from the bulk of the observed or predicted data [ 6]. For example, \nin the case where I represents an input image, occlusions and clutter will cause many pixels in \nI to deviate significantly from corresponding pixels in the predicted image Ur. The problem \n\n\fCorrelates of AJtention in a Model of Dynamic Visual Recognition \n\n83 \n\nGating \n\nMatrix -\n\nG \n\nSensory \nResidual \nI- ltd \n\nInput I \n\n-\n\nInhibition \n\nItd=Ur \n\nTop-Down Prediction \nted n ut \n\nof Ex \n\nI p \n\npee \n\nFeedforward \n\nMatrix \nuT \n\nFeedback \nMatrix \n\nu \n\nRobust \nKalman Filter \nEstimate \n\nNonnalization \n\nN \n\n\" \nr0 r \n\n-\n\nPrediction \nMatrix ~ \nv \n\nPredicted State r \n\nFigure 1: Recurrent Network Implementation of the Robust Kalman Filter. The gating matrix G is \na non-linear function of the current residual error between the input I and its top-down prediction ur. G \neffectively filters out any high residuals, thereby preventing outliers in input data I from influencing the \nrobust Kalman filter estimate r. Note that the entire filter can be implemented in a recurrent neural net(cid:173)\nwork, with U, UT, and V represented by the synaptic weights of neurons with linear activation functions \nand G being implemented by a set of threshold non-linear neurons with binary outputs. \n\nof outliers can be tackled using robust estimation procedures [6] such as M-estimation, which \ninvolves minimizing a function of the form: \nn \n\n(9) \n\ni=1 \n\nwhere Ii and Ui are the ith pixel and ith row of I and U respectively, and p is a function that \nincreases less rapidly than the square. This reduces the influence oflarge residual errors (which \ncorrespond to outliers) on the optimization of J', thereby \"rejecting\" the outliers. A special \ncase of the above function is the following weighted least squares criterion: \n\nJ' =(I-:- Urf S(I- Ur) \n\n(10) \nwhere Sis a diagonal matrix whose diagonal entries Si,i determine the weight accorded to the \ncorresponding pixel error (Ii - Uir). A simple but attractive choice for these weights is the \nnon-linear function given by Si,i =min {1, c/(Ii- Uir) 2 }, wherecisa threshold parameter. \nTo understand the behavior of this function, note that S effectively clips the ith summand in \nJ' (Equation 10 above) to a constant value c whenever the ith squared residual (Ii - Uir? \nexceeds the threshold c; otherwise, the summand is set equal to the squared residual. \nBy substituting E-1 = Sin the optimization function J (Equation 3), we can rederive the \nfollowing robust Kalman filter equation: \n\n(11) \nwhere r(t) = W(t- 1)) + iii(t- 1), N(t) = (UTG(t)U + M(t)- 1)- 1 , M(t) = V N(t-\n1)VT + II(t -1), and G(t) is ann x n diagonal matrix whose diagonal entries at time instant \nt are given by: \n\nr(t) + N(t)UTG(t)(I- Ur(t)) \n\nr(t) = \n\nGi,i _ \n-\n\n{ 0 \n\nif (Ii(t) - Uir(t))2 > c(t) \n\n1 otherwise \n\nG can be regarded as the sensory residual gain or \"gating\" matrix, which determines the (bi(cid:173)\nnary) gain on the various components of the incoming sensory residual error vector. By effec(cid:173)\ntively filtering out any high residuals, G allows the Kalman filter to ignore the corresponding \noutliers in the input I, thereby enabling it to robustly estimate the stater. Figure 1 depicts an \nimplementation of the robust Kalman filter in the form of a recurrent network of linear and \nthreshold non-linear neurons. In particular, the feedforward, feedback and prediction neurons \npossess linear activation functions while the gating neurons implementing G compute binary \noutputs based on a threshold non-linearity. \n\n\f84 \n\nR. P.N.Rao \n\nTraining Objects \n\nInput Image \n\nRobust Estimate \n\nOutliers \n\n(a) \n\n(b) \n\nInput Image \n\nRobust \n\nEstimate I \n\nOutliers \n\nI \n' \n\n(c) \n\nRobust \n\nEstimate 2 \n\nI. \n\nLeast Squares \n\nEstimate \n\nFigure 2: Correlates of Attention during Static Recognition. (a) Images of size 105 x 65 used to train \na robust Kalman filter network. The generative matrix U was 6825 x 5. (b) Occlusions and background \nclutter are treated as outliers (white regions in the third image, depicting the diagonal of the gating matrix \nG). This allows the network to \"attend to\" and recognize the training object, as indicated by the accurate \nreconstruction (middle image) of the training image based on the final robust state estimate. (c) In the \nmore interesting case of the training objects occluding each other, the network converges to one of the \nobjects (the \"dominant\" one in the image - in this case, the object in the foreground). Having recognized \none object, the second object is attended to and recognized by taking the complement of the outliers ( di(cid:173)\nagonal of G) and repeating the robust filtering process (third and fourth images). The fifth image is the \nimage reconstruction obtained from the standard (least squares derived) Kalman filter estimate, showing \nan inability to resolve or recognize either of the two objects. \n\n4 VISUAL ATIENTION IN A SIMULATED NETWORK \n\nThe gating matrix G allows the Kalman filter network to \"selectively attend .. to an object while \ntreating the remaining components of the sensory input as outliers. We demonstrate this capa(cid:173)\nbility of the network using three different examples. In the first example, a network was trained \non static grayscale images of a pair of 3D objects (Figure 2 (a)}. For learning static inputs, the \nprediction matrix Vis unnecessary since we may use r(t) = r(t -1) and M(t) = N(t -1). \nAfter training, the network was tested on images containing the training objects with varying \ndegrees of occlusion and clutter (Figure 2 (b) and (c)). The outlier threshold c was initialized \nto the sum of the mean plus k standard deviations of the current distribution of squared residual \nerrors (Ii - Uir )2 \u2022 The value of k was gradually decreased during each iteration in order to \nallow the network to refine its robust estimate by gradually pruning away the outliers as it con(cid:173)\nverges to a single object estimate. After convergence, the diagonal of the matrix G contains \nzeros in the image locations containing the outliers and ones in the remaining locations. As \nshown in Figure 2 (b), the network was successful in recognizing the training object despite \nocclusion and background clutter. \nMore interestingly, the outliers (white) produce a crude segmentation of the occluder and back(cid:173)\nground clutter, which can subsequently .be used to focus \"attention\" on these previously ig(cid:173)\nnored objects and recover their identity. In particular, an outlier mask m can be defined by \ntaking the complement of the diagonal of G (i.e. m i = 1- Gi,i). By replacing the diagonal of \nG with m in Equation tt 1 and repeating the estimation process, the network can \"attend to\" \n\n1 Although not implemented here, this \"shifting of attentional focus\" can be automated using a model \n\nof neuronal fatigue and active decay (see, for example, [3)). \n\n\fCorrelates of AUention in a Model of Dynamic Visual Recognition \n\n85 \n\nlaput ~ \n\n~~~~~ \n~/' ~ \n\n-~ \n\n. ----- ---\u00b7--~- \u00b7-\n\n(a) \n\n.. -\nJ~r5' ~ \n\u2022\u2022Ill \nI\u2022\u2022\u2022 \n\nOulllen \n\n..... I \n\nPredictions \n\nOutliers \n\n~ \n\n(b) \n\nInputs \n\nPredictions \n\nOutliers \n\n(c) \n\n(d) \n\nFigure 3: Correlates of Attention during Dynamic Recognition. (a) A network was trained on a cyclic \nimage sequence of gestures (top), each image of size 75 x 75, with U and V of size 5625 x 15 and 15 x 15 \nrespectively. The panels below show how the network can ignore various fonns of occlusion and clutter \n(outliers}, \"attending to\" the sequence of gestures that it has been trained on. The outlier threshold c \nwas computed as the mean plus 0.3 standard deviati9ns of the current distribution of squared residual \nerrors. Results shown are those obtained after 5 cycles of exposure to the occluded images. (b) Three \nimage sequences used to train a network. (c) and (d) show the response of the network to ambiguous \nstimuli comprised of images containing both a horizontal and a vertical bar. Note that the network was \ntrained on a horizontal bar moving downwards and\u00b7a vertical bar moving rightwards (see (b)) but not both \nsimultaneously. Given ambiguous stimuli containing both these stimuli, the network interprets the input \ndifferently depending on the initial \"priming\" input. When the initial input is a vertical bar as in (c), the \nnetwork interprets the sequence as a vertical bar moving rightwards (with some minor artifacts due to \nthe other training sequences). On the other hand, when the initial input is a horizontal bar as in (d), the \nsequence is interpreted as a horizontal bar moving downwards, not paying \"attention\" to the extraneous \nvertical bars, which are now treated as outliers. \n\nthe image region(s) that were previously ignored as outliers. Such a two-step serial recogni(cid:173)\ntion process is depicted in Figure 2 (c). The network first recognizes the \"doniinant\" object, \nwhich was generally observed to be the object occupying a larger area of the input image or \npossessing regions with higher contrast.\u00b7 The outlier mask m is subsequently used for \"switch(cid:173)\ning attention\" and extracting the identity of the second object (lower arrow). Figure 3 shows \nexamples of attention during recognition of dynamic stimuli. In particular, Figure 3 (c) and \n(d) show how the same image sequence can be interpreted in two different ways depending on \nwhich part of the stimulus is \"attended to,\" which in tum depends on the initial priming input. \n\n5 CONCLUSIONS \n\nThe simulation results indicate that certain\u00b7 experimental observations that have previously \nbeen interpreted using the metaphor of an attentional spotlight can also arise as a result of com(cid:173)\npetition and cooperation during visual recognition within networks of linear and non-linear \n\n\f86 \n\nR. P. N. Rao \n\nneurons. Although not explicitly designed to simulate attention, the robust Kalman filter net(cid:173)\nworks nevertheless display some of the essential characteristics of visual attention, such as \nthe preferential processing of a subset of the input signals and the consequent \"switching\" of \nattention to previously ignored stimuli. Given multiple objects or conflicting stimuli in their \nreceptive fields (Figures 2 and 3), the responses of the feedforward, .feedback, and prediction \nneurons in the simulated network were modulated according to the current object being \"at(cid:173)\ntended to.\" The modulation in responses was mediated by the non-linear gating neurons G, \ntaking into account both bottom-up signals as well top-down feedback signals. This suggests a \nnetwork-level interpretation of similar forms of attentional response modulation in the primate \nvisual cortex [2, 8, 14], with the consequent prediction that the genesis of attentional modula(cid:173)\ntion in such cases may not necessarily lie within the recorded neurons themselves but within \nthe distributed circuitry that these neurons are an integral part of. \n\nReferences \n\n[1] A.B. Bryson and Y.-C. Ho. Applied Optimtzl Control. New York: John Wiley, 1975. \n[2] L. Chelazzi, E.K. Miller, J. Duncan, and R. Desimone. A neural basis for visual search \n\nin inferior temporal cortex. Nature, 363:345-347, 1993. \n\n[3] P. Dayan. An hierarchical model of visual rivalry. \n\nIn M. Mozer, M. Jordan, and \nT. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 48-54. \nCambridge, MA: MIT Press, 1997. \n\n[4] A.P. Dempster, N.M. Laird, andD.B. Rubin. Maximum likelihood from incomplete data \n\nvia the EM algorithm. J. Royal Statistical Society Series B, 39:1-38, 1977. \n\n[5] R. Desimone and J. Duncan. Neural mechanisms of selective visual attention. Annual \n\nReview of Neuroscience, 18:193-222,1995. \n\n[6] P.J. Huber. Robust Statistics. New York: John Wiley, 1981. \n[7] R.E. Kalman. A new approach to linear filtering and prediction theory. Trans. ASME J. \n\nBasic Eng., 82:35-45, 1960. \n\n[8] J. Moran and R. Desimone. Selective attention gates visual processing in the extrastriate \n\ncortex. Science, 229:782-784, 1985. \n\n[9] W.T. Newsome. Spotlights, highlights and visual awareness. Current Biology, 6( 4 ):357-\n\n360, 1996. \n\n[10] E. Niebur and C. Koch. Control of selective visual attention: Modeling the \"where\" path(cid:173)\nway. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural lnfor(cid:173)\nmtztion Processing Systems 8, pages 802-808. Cambridge, MA: MIT Press, 1996. \n\n[11] B.A. Olshausen, D.C. Van Essen, and C.H. Anderson. A neurobiological model of vi(cid:173)\nsual attention and invariant pattern recognition based on dynamic routing of information. \nJournal of Neuroscience, 13:4700-4719, 1993. \n\n[12] R.P.N. Rao and D.H. Ballard. The visual cortex as a hierarchical predictor. Technical \n\nReport 96.4, National Resource Laboratory for the Study of Brain and Behavior, Depart(cid:173)\nment of Computer Science, University of Rochester, September 1996. \n\n[13] R.P.N. Rao and D.H. Ballard. Dynamic model of visual recognition predicts neural re(cid:173)\n\nsponse properties in the visual cortex. Neural Computation, 9(4):721-763, 1997. \n\n[14] S. Treue and J.H.R. Maunsell. Attentional modulation of visual motion processing in \n\ncortical areas MT and MST. Nature, 382:539-541, 1996. \n\n[15] J .K. Tsotsos, S.M. Culhane, W. Y.K. Wai, Y. Lai, N. Davis, and F. Nuflo. Modeling visual \n\nattention via selective tuning. Artificial Intelligence, 78:507~545, 1995. \n\n\f\f", "award": [], "sourceid": 1416, "authors": [{"given_name": "Rajesh", "family_name": "Rao", "institution": null}]}