{"title": "Learning Saccadic Eye Movements Using Multiscale Spatial Filters", "book": "Advances in Neural Information Processing Systems", "page_first": 893, "page_last": 900, "abstract": null, "full_text": "Learning Saccadic Eye Movements \n\nUsing Multiscale Spatial Filters \n\nRajesh P.N. Rao and Dana H. Ballard \n\nDepartment of Computer Science \n\nUniversity of Rochester \nRochester, NY 14627 \n\n{rao)dana}~cs.rochester.edu \n\nAbstract \n\nWe describe a framework for learning saccadic eye movements using a \nphotometric representation of target points in natural scenes. The rep(cid:173)\nresentation takes the form of a high-dimensional vector comprised of the \nresponses of spatial filters at different orientations and scales. We first \ndemonstrate the use of this response vector in the task of locating pre(cid:173)\nviously foveated points in a scene and subsequently use this property in \na multisaccade strategy to derive an adaptive motor map for delivering \naccurate saccades. \n\n1 \n\nIntroduction \n\nThere has been recent interest in the use of space-variant sensors in active vision systems \nfor tasks such as visual search and object tracking [14]. Such sensors realize the simultane(cid:173)\nous need for wide field-of-view and good visual acuity. One popular class of space-variant \nsensors is formed by log-polar sensors which have a small area near the optical axis of \ngreatly increased resolution (the fovea) coupled with a peripheral region that witnesses \na gradual logarithmic falloff in resolution as one moves radially outward. These sensors \nare inspired by similar structures found in the primate retina where one finds both a \nperipheral region of gradually decreasing acuity and a circularly symmetric area centmlis \ncharacterized by a greater density of receptors and a disproportionate representation in \nthe optic nerve [3]. The peripheral region, though of low visual acuity, is more sensitive \nto light intensity and movement. \nThe existence of a region optimized for discrimination and recognition surrounded by a \nregion geared towards detection thus allows the image of an object of interest detected \nin the outer region to be placed on the more analytic center for closer scrutiny. Such a \nstrategy however necessitates the existence of (a) methods to determine which location \nin the periphery to foveate next, and (b) fast gaze-shifting mechanisms to achieve this \n\n\f894 \n\nRajesh P. N. Rao, Dana H. Ballard \n\nfoveation. In the case of humans, the \"where-to-Iook-next\" issue is addressed by both \nbottom-up strategies such as motion or salience clues from the periphery as well as top(cid:173)\ndown strategies such as search for a particular form or color. Gaze-shifting is accomplished \nvia very rapid eye movements called saccades. Due to their high velocities, guidance \nthrough visual feedback is not possible and hence, saccadic movement is preprogrammed \nor ballistic: a pattern of muscle activation is calculated in advance that will direct the \nfovea almost exactly to the desired position [3]. \n\nIn this paper, we describe an iconic representation of scene points that facilitates top(cid:173)\ndown foveal targeting. The representation takes the form of a high-dimensional vector \ncomprised of the responses of different order Gaussian derivative filters, which are known \nto form the principal components of natural images [5], at variety of orientations and \nscales. Such a representation has been recently shown to be useful for visual tasks ranging \nfrom texture segmentation [7] to object indexing using a sparse distributed memory [11]. \nWe describe how this photometric representation of scene points can be used in locating \npreviously foveated points when a log-polar sensor is being used. This property is then \nused in a simple learning strategy that makes use of multiple corrective saccades to \nadaptively form a retinotopic motor map similar in spirit to the one known to exist in \nthe deep layers of the primate superior colliculus [13]. Our approach differs from previous \nstrategies for learning motor maps (for instance, [12]) in that we use the visual modality \nto actively supply the necessary reinforcement signal required during the motor learning \nstep (Section 3.2) . \n\n2 The Multiscale Spatial Filter Representation \n\nIn the active vision framework, vision is seen as subserving a larger context of the encom(cid:173)\npassing behaviors that the agent is engaged in. For these behaviors, it is often possible \nto use temporary, iconic descriptions of the scene which are only relatively insensitive \nto variations in the view. Iconic scene descriptions can be obtained, for instance, by \nemploying a bank of linear spatial filters at a variety of orientations and scales. In our \napproach, we use derivative of Gaussian filters since these are known to form the domi(cid:173)\nnant eigenvectors of natural images [5] and can thus be expected to yield reliable results \nwhen used as basis functions for indexingl . \n\nThe exact number of Gaussian derivative basis functions used is motivated by the need \nto make the representations invariant to rotations in the image plane (see [11] for more \ndetails). This invariance can be achieved by exploiting the property of steerability [4] \nwhich allows filter responses at arbitrary orientations to be synthesized from a finite set \nof basis filters. In particular, our implementation uses a minimal basis set of two first(cid:173)\norder directional derivatives at 0\u00b0 and 90\u00b0, three second-order derivatives at 0\u00b0, 60\u00b0 and \n120\u00b0, and four third-order derivatives oriented at 0\u00b0, 45\u00b0, 90\u00b0, and 135\u00b0. \n\nThe response of an image patch J centered at (xo, Yo) to a particular basis filter G~j can \nbe obtained by convolving the image patch with the filter : \n\nIf 9 \n\nri,j(XO, Yo) = (G/ * I)(xo, Yo) = G/ (XO - x, Yo - y)J(x, y)dx dy \n\ng . \n\n(1) \n\nlIn addition, these filters are endorsed by recent physiological studies [15] which show that \nderivative-of-Gaussians provide the best fit to primate cortical receptive field profiles among the \ndifferent functions suggested in the literature. \n\n\fLearning Saccadic Eye Movements Using Multiscale Spatial Filters \n\n895 \n\nThe iconic representation for the local image patch centered at (xo, Yo) is formed by \ncombining into a single high-dimensional vector the responses from the nine basis filters, \neach (in our current implementation) at five different scales: \n\nr(xo, Yo) = (ri,j,s) , \n\ni = 1,2, 3;j = 1, . .. , i + 1; S = Smin , . .. , Smax \n\n(2) \n\nwhere i denotes the order of the filter, j denotes the number of filters per order, and S \ndenotes the number of different scales. \n\nThe use of multiple scales increases the perspicuity of the representation and allows inter(cid:173)\npolation strategies for scale invariance (see [9] for more details). The entire representation \ncan be computed using only nine convolutions done at frame-rate within a pipeline image \nprocessor with nine constant size 8 x 8 kernels on a five-level octave-separated low-pass(cid:173)\nfiltered pyramid of the input image. \n\nThe 45-dimensional vector representation described above shares some of the favorable \nmatching properties that accrue to high-dimensional vectors (d. [6]). In particular, the \ndistribution of distances between points in the 45-dimensional space of these vectors \napproximates a normal distribution; most of the points in the space lie at approximately \nthe mean distance and are thus relatively uncorrelated to a given point [11]. As a result, \nthe multiscale filter bank tends to generate almost unique location-indexed signatures of \nimage regions which can tolerate considerable noise before they are confused with other \nimage regions. \n\n2.1 Localization \n\nDenote the response vector from an image point as fi and that from a previously foveated \nmodel point as Tm. Then one metric for describing the similarity between the two points \nis simply the square of the Euclidean distance (or the sum-of-squared-differences) between \ntheir response vectors dim = llfi - r.n 112. The algorithm for locating model points in a \nnew scene can then be described as follows : \n\n1. For the response vector representing a model point m, create a distance image \n\nI m defined by \n\n(3) \nwhere t3 is a suitably chosen constant (this makes the best match the brightest \npoint in Im). \n\nIm(x,y) = min [Imax - t3dim , 0] \n\n2. Find the best match point (Xb~, Yb~) in the image using the relation \n\n(4) \n\nFigure 1 shows the use of the localization algorithm for targeting the optical axis of a \nuniform-resolution sensor in an example scene. \n\n2.2 Extension to Space-Variant Sensing \n\nThe localization algorithm as presented above will obviously fail for sensors exhibiting \nnonuniform resolution characteristics. However, the multiscale structure of the response \nvectors can be effectively exploited to obtain a modified localization algorithm. Since \ndecreasing radial resolution results in an effective reduction in scale (in addition to some \n\n\f896 \n\nRajesh P. N. Rao, Dana H. Ballard \n\n(a) \n\n(b) \n\n(c) \n\n(d) \n\nFigure 1: Using response vectors to saccade to previously foveated positions. (a) Initial gaze \npoint. \n(b) New gaze point; (c) To get back to the original point, the \"distance image\" is \ncomputed: the brightest spot represents the point whose response vector is closest to that of the \noriginal gaze point; (d) Location of best match is marked and an oculomotor command at that \nlocation can be executed to foveate that point. \n\nother minor distortions) of previously foveated regions as they move towards the periph(cid:173)\nery, the filter responses previously occuring at larger scales now occur at smaller scales. \nResponses usually vary smoothly between scales; it is thus possible to establish a corre(cid:173)\nspondence between the two response vectors of the same point on an object imaged at \ndifferent scales by using a simple interpolate-and-eompare scale matching strategy. That \nis, in addition to comparing an image response vector and a model response vector di(cid:173)\nrectly as outlined in the previous section, scale interpolated versions of the image vector \nare also compared with the original model response vector. In the simplest case, interpo(cid:173)\nlation amounts to shifting image response vectors by one scale and thus, responses from a \nnew image are compared with original model responses at second, third, .. , scales, then \nwith model responses at third, fourth, ... scales, and so on upto some threshold scale. \nThis is illustrated in Figure 2 for two discrete movements of a simulated log-polar sensor. \n\n3 The M ultisaccade Learning Strategy \nSince the high speed of saccades precludes visual guidance, advance knowledge of the \nprecise motor command to be sent to the extraocular muscles for fixation of a desired \nretinal location is required. Results from neurophysiological and psychophysical studies \nsuggest that in humans, this knowledge is acquired via learning: infants show a gradual \nincrease in saccadic accuracy during their first year [1, 2] and adults can adapt to changes \n(caused for example by weakening of eye-muscles) in the interrelation between visual input \nand the saccades needed for centering. An adaptive mechanism for automatically learning \nthe transfer function from retinal image space into motor space is also desirable in the \ncontext of active vision systems since an autonomous calibration of the saccadic system \nwould (a) avoid the need for manual calibration, which can sometimes be complicated, \nand (b) provide resilience amidst changing circumstances caused by, for instance, changes \nin the camera lens mechanisms or degradation of the motor apparatus. \n\n3.1 Motor Maps \n\nIn primates, the superior eollieulus (SC), a multilayered neuron complex located in the \nupper regions of the brain stem, is known to playa crucial role in the saccade generation \n[13]. The upper layers of the SC contain a retinotopie sensory map with inputs from \n\n\fLearning Saccadic Eye Movements Using MuLtiscaLe SpatiaL Filters \n\n897 \n\n(a) \n\n(b) \n\n(c) \n\nScale I \n\nScale 2 \n\nScale 3 \n\nScale 4 \n\nScale 5 \n\n(a) III' I \" \n\n(b) \n\n(c) \n\n111'1\"\" I I \" III, ' II \" ,li'I\"\" 111'1 \nseNe 4 \n\ns1I3 \n\nScale 2 \n\nScille I \n\nScale 5 \n\n111 ' 1 \"'1 1\" 111 \" 11,, 1 \" 1' ,11 ,,1 ' 11 1,1 \n\nScale I \n\nScale 2 \n\nScaje 3 \n\nScale 4 \n\nScale 5 \n\n11\"1 11 , II ' ,11'''1, ,111,, 111 11 1 ,1'11'1 \n\nI \n\n(d) \n\nFigure 2: Using response vectors with a log-polar sensor, (a) through (c) represent a sequence of \nimages (in Cartesian coordinates) obtained by movement of a simulated log-polar sensor from an \noriginal point (marked by '+') in the foveal region (indicated by a circle) towards the right. (d) \ndepicts the process of interpolating (in this case, shifting) and matching response vectors of the \nsame point as it moves towards the periphery of the sensor (Positive responses are represented \nby proportional upward bars and negative ones by proportional downward bars with the nine \nsmallest scale responses at the beginning and the nine largest ones at the end). \n\nthe retina while the deeper layers contain a motor map approximately aligned with the \nsensory map. The motor map can be visualized as a topologically-organized network \nof neurons which reacts to a local activation caused by an input signal with a vectorial \noutput quantity that can be transcoded into a saccadic motor command. \nThe alignment of the sensory and motor maps suggests the following convenient strategy \nfor foveation: an excitation in the sensory layer (signaling a foveal target) is transferred \nto the underlying neurons in the motor layer which deliver the required saccade. In our \nframework, the excitation in the sensory layer before a goal-directed saccade corresponds \nto the brightest spot (most likely match) in the distance image (Figure 1 (c) for example), \nThe formation of sensory map can be achieved using Kohonen's well-known stochastic \nlearning algorithm by using a Gaussian input density function as described in [12]. Our \nprimary interest lies not in the formation of the sensory map but in the development \nof a learning algorithm that assigns appropriate motor vectors to each location in the \ncorresponding retinotopically-organized motor map. In particular, our algorithm employs \na visual reinforcement signal obtained using iconic scene representations to determine the \nerror vector during the learning step. \n\n3.2 Learning the Motor Map \n\nOur multisaccade learning strategy is inspired by the following observations in [2] : During \nthe first few weeks after birth, infants appear to fixate randomly. At about 3 months of \nage, infants are able to fixate stimuli albeit with a number of corrective saccades of \nrelatively large dispersion. There is however a gradual decrease in both the dispersion \n\n\f898 \n\nRajesh P. N. Rao, Dana H. Ballard \n\nand the number of saccades required for foveation in subsequent months (Figure 3 (a) \ndepicts a sample set of fixations). After the first year, saccades are generally accurate, \nrequiring at most one corrective saccade2 \u2022 \n\nThe learning method begins by assigning random values to the motor vectors at each \nlocation. The response vector for the current fixation point is first stored and a random \nsaccade is executed to a different point. The goal then is to refixate the original point \nwith the help of the localization algorithm and a limited number of multiple corrective \nsaccades. The algorithm keeps track of the motor vector with minimum error during \neach run and updates the motor vectors for the neighborhood around the original unit \nwhenever an improvement is observed. The current run ends when either the original \npoint was successfully foveated or the limit MAX for the maximum number of allowable \ncorrective saccades was exceeded. A more detailed outline of the algorithm is as follows: \n\n1. Initialize the motor map by assigning random values (within an appropriate \nrange) to the saccadic motor vectors at each location. Align the optical axis \nof the sensor so that a suitable salient point falls on the fovea. Initialize the run \nnumber to t := O. \n\n2. Store in memory the filter response vector of the point p currently in the center \n\nof the foveal region. Let t := t + 1. \n\n3. Execute a random saccade to move the fovea to a different location in the scene. \n4. Use the localization algorithm described in Section 2.2 and the stored response \nvector to find the location [ of the previously foveated point in the current retinal \nimage. Execute a saccade using the motor vector St stored in this location in the \nmotor map. \n\n5. If the currently foveated region contains the original point p, return to 2 (SI is \n\naccurate); otherwise, \n(a) Initialize the number of corrective saccades N := 0 and let s:= St. \n(b) Determine the new location /' of p in the new image as in (4) and let emin be \nthe error vector, i.e. the vector from the foveal center to /', computed from \nthe output of the localization algorithm. \n\n(c) Execute a saccade using the motor vector Stl stored at [' and let ebe the error \nvector (computed from the output of the localization algorithm) from the \nfoveal center to the new location [II of point p found as in 4. Let N := N + 1 \nand let s:= s+ SI' . \n\n(d) If lie'll < lliminll, then let emin := e and update the motor vectors for the \nunits k given by the neighborhood function N(l, t) according to the well(cid:173)\nknown Kohonen rule: \n\nwhere 'Y(t) is an appropriate gain function (0 < 'Y(t) < 1). \n\n(e) If the currently foveated region contains the original point p, return to 2; \notherwise, if N < MAX, then determine the new location [' of p in the new \nimage as in (4) and go to 5(c) (i.e. execute the next saccade); otherwise, \nreturn to 2. \n\n(5) \n\n2Large saccades in adults are usually hypometric i.e. they undershoot, necessitating a slightly \nslower corrective saccade. There is currently no universally accepted explanation for the need \nfor such a two-step strategy. \n\n\fLearning Saccadic Eye Movements Using Multiscale Spatial Filters \n\n899 \n\n! \" 1 \nI \nI .. \u2022 \n\nf.! \n\n+ IMX-IO \n\n...... , \n\n' ..... 1 \n\n(a) \n\nN!.IBfIll __ \n\n(b) \n\n\",~;:;!l-;;:;;;--l-;;;;\n\nll1I)\"'---;;:ll1Ol;;;--;;\"=-lI) ---;\",=-,C:;;,.,:---;;:; ... ;;-;:;!,,\", \n\nN_la rillelllltN \n\n(c) \n\nFigure 3: (a) Successive saccades executed by a 3-month old (left) and a 5-month old (right) \ninfant when presented with a single illuminated stimulus (Adapted from [2]) . (b) Graph showing \n% of saccades that end directly in the fovea plotted against the number of iterations of the \nlearning algorithm for different values of MAX. (c) An enlarged portion of the same graph \nshowing points when convergence was achieved. \n\nThe algorithm continues typically until convergence or the completion of a maximum \nnumber of runs. The gain term -y(t) and the neighborhood N(l, t) for any location l are \ngradually decreased with increasing number of iterations t. \n\n4 Results and Discussion \n\nThe simulation results for learning a motor map comprising of 961 units are shown in \nFigures 3 (b) and (c) which depict the variation in saccadic accuracy with the number of \niterations of the algorithm for values of MAX (maximum number of corrective saccades) \nof 1, 5 and 10. From the graphs, it can be seen that starting with an initially random \nassignment of vectors, the algorithm eventually assigns accurate saccadic vectors to all \nunits. Fewer iterations seem to be required if more corrective saccades are allowed but \nthen each iteration itself takes more time. \nThe localization algorithm described in Section 2.1 has been implemented on a Datacube \nMaxVideo 200 pipeline image processing system and takes 1-2 seconds for location of \npoints. Current work includes the integration of the multisaccade learning algorithm de(cid:173)\nscribed above with the Datacube implementation and further evaluation of the learning \nalgorithm. One possible drawback of the proposed algorithm is that for large retinal \nspaces, learning saccadic motor vectors for every retinal location can be time-consuming \nand in some cases, even infeasible [1]. In order to address this problem, we have recently \nproposed a variation of the current learning algorithm which uses a sparse motor map \nin conjunction with distributed coding of the saccadic motor vectors. This organization \nbears some striking similarities to Kanerva's sparse distributed memory model [6] and is \nin concurrence with recent neurophysiological evidence [8] supporting a distributed popu(cid:173)\nlation encoding of saccadic movements in the superior colliculus. We refer the interested \nreader to [10] for more details. \n\n\f900 \n\nAcknowledgments \n\nRajesh P. N. Rao, Dana H. Ballard \n\nWe thank the NIPS*94 referees for their helpful comments. This work was supported by \nNSF research grant no. CDA-8822724, NIH/PHS research grant no. 1 R24 RR06853, and \na grant from the Human Science Frontiers Program. \n\nReferences \n\n[1] Richard N. Aslin. Perception of visual direction in human infants. In C. Granlund, \neditor, Visual Perception and Cognition in Infancy, pages 91-118. Hillsdale, NJ: \nLawrence Erlbaum Associates, 1993. \n\n[2] Gordon W. Bronson. The Scanning Patterns of Human Infants: Implications for \n\nVisual Learning. Norwood, NJ: Ablex, 1982. \n\n[3] Roger H.S. Carpenter. Movements of the Eyes. London: Pion, 1988. \n[4] William T . Freeman and Edward H. Adelson. The design and use of steerable filters. \nIEEE Transactions on Pattern Analysis and Machine Intelligence, 13(9):891-906, \nSeptember 1991. \n\n[5] Peter J.B. Hancock, Roland J. Baddeley, and Leslie S. Smith. The principal compo(cid:173)\n\nnents of natural images. Network, 3:61-70, 1992. \n\n[6] Pentti Kanerva. Sparse Distributed Memory. Bradford Books, Cambridge, MA, 1988. \n[7] Jitendra Malik and Pietro Perona. A computational model of texture segmentation. \nIn IEEE Conference on Computer Vision and Pattern Recognition, pages 326-332, \nJune 1989. \n\n[8] James T. McIlwain. Distributed spatial coding in the superior colliculus: A review. \n\nVisual Neuroscience, 6:3-13, 1991. \n\n[9] Rajesh P.N. Rao and Dana H. Ballard. An active vision architecture based on iconic \nrepresentations. Technical Report 548, Department of Computer Scienc~, University \nof Rochester, 1995. \n\n[10] Rajesh P.N. Rao and Dana H. Ballard. A computational model for visual learning \nof saccadic eye movements. Technical Report 558, Department of Computer Science, \nUniversity of Rochester, January 1995. \n\n[11] Rajesh P.N. Rao and Dana H. Ballard. Object indexing using an iconic sparse \n\ndistributed memory. Technical Report 559, Department of Computer Science, Uni(cid:173)\nversity of Rochester, January 1995. \n\n[12] Helge Ritter, Thomas Martinetz, and Klaus Schulten. Neural Computation and Self(cid:173)\n\nOrganizing Maps: An Introduction. Reading, MA: Addison-Wesley, 1992. \n\n[13] David L. Sparks and Rosi Hartwich-Young. The deep layers of the superior collicu(cid:173)\n\nIus. In R.H. Wurtz and M.E. Goldberg, editors, The Neurobiology of Saccadic Eye \nMovements, pages 213-255. Amsterdam: Elsevier, 1989. \n\n[14] Massimo Tistarelli and Giulio Sandini. Dynamic aspects in active vision. Computer \nVision, Graphics, and Image Processing: Image Understanding, 56(1):108-129, 1992. \n[15] R.A. Young. The Gaussian derivative theory of spatial vision: Analysis of cortical cell \nreceptive field line-weighting profiles. General Motors Research Publication GMR-\n4920, 1985. \n\n\f", "award": [], "sourceid": 923, "authors": [{"given_name": "Rajesh", "family_name": "Rao", "institution": null}, {"given_name": "Dana", "family_name": "Ballard", "institution": null}]}