{"title": "Using a Saliency Map for Active Spatial Selective Attention: Implementation & Initial Results", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 458, "abstract": null, "full_text": "Using a Saliency Map for Active Spatial Selective \n\nAttention: Implementation & Initial Results \n\nShumeet Baluja \nbaluja@cs.cmu.edu \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nDean A. Pomerleau \n\npomerleau@cs.cmu.edu \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nAbstract \n\nIn many vision based tasks, the ability to focus attention on the important \nportions of a scene is crucial for good performance on the tasks. In this paper \nwe present a simple method of achieving spatial selective attention through \nthe use of a saliency map. The saliency map indicates which regions of the \ninput retina are important for performing the task. The saliency map is cre(cid:173)\nated through predictive auto-encoding. The performance of this method is \ndemonstrated on two simple tasks which have multiple very strong distract(cid:173)\ning features in the input retina. Architectural extensions and application \ndirections for this model are presented. \n\n1 MOTIVATION \n\nMany real world tasks have the property that only a small fraction of the available input is \nimportant at any particular time. On some tasks this extra input can easily be ignored. \nNonetheless, often the similarity between the important input features and the irrelevant \nfeatures is great enough to interfere with task performance. Two examples of this phenom(cid:173)\nena are the famous \"cocktail party effect\", otherwise known as speech recognition in a \nnoisy environment, and image processing of a cluttered scene. In both cases, the extrane(cid:173)\nous information in the input signal can be easily confused with the important features, \nmaking the task much more difficult. \nThe concrete real world task which motivates this work is vision-based road following. In \nthis domain, the goal is to control a robot vehicle by analyzing the scene ahead, and choos(cid:173)\ning a direction to travel based on the location of important features like lane marking and \nroad edges. This is a difficult task, since the scene ahead is often cluttered with extraneous \nfeatures such as other vehicle, pedestrians, trees, guardrails, crosswalks, road signs and \nmany other objects that can appear on or around a roadway. 1 While we have had signifi(cid:173)\ncant success on the road following task using simple feed-forward neural networks to \ntransform images of the road ahead into steering commands for the vehicle [pomerleau, \n1993b], these methods fail when presented with cluttered environments like those encoun-\n\nI . For the gl:neral task of autonomous navigation, these extra features are extremely important, but for \nrestricted task of road following. which is the focus of this paper. these features are merely distractions. \nAlthough we are addressing the more general task using the techniques described here in combination with \nother methods, a description of these efforts is beyond the scope of this paper. \n\n\f452 \n\nShumeet Baluja, Dean A. Pomerleall \n\ntered when driving in heavy traffic, or on city streets. \nThe obvious solution to this difficulty is to focus the attention of the processing system on \nonly the relevant features by masking out the \"noise\". Because of the high degree of simi-\n1arity between the relevant features and the noise, this filtering is often extremely difficult. \nSimultaneously learning to perform a task like road following and filtering out clutter in a \nscene is doubly difficult because of a chicken-and-egg problem. It is hard to learn which \nfeatures to attend to before knowing how to perform the task, and it is hard to learn how to \nperform the task before knowing which features to attend to. \nThis paper describes a technique designed to solve this problem. It involves deriving a \n\"saliency map\" of the image from a neural network's internal representation which high(cid:173)\nlights regions of the scene considered to be important. This saliency map is used as feed(cid:173)\nback to focus the attention of the network's processing on subsequent images. This \ntechnique overcomes the chicken-and-egg problem by simultaneously learning to identify \nwhich aspects of the scene are important, and how to use them to perform a task. \n\n2 THE SALIENCY MAP \n\nA saliency map is designed to indicate which portions of the image are important for com(cid:173)\npleting the required task. The trained network should be able to accomplish two goals with \nthe presentation of each image. The first is to perform the given task using the inputs and \nthe saliency map derived from the previous image, and the second is to predict the salient \nportions of the next image. \n\n2.1 Implementation \n\nThe creation of the saliency map is similar to the technique of Input Reconstruction Reli(cid:173)\nability Estimation (IRRE) by [Pomerleau, 1993]. IRRE attempts to predict the reliability \nof a network's output. The prediction is made by reconstructing the input image from lin(cid:173)\near transformations of the activations in the hidden layer, and comparing it with the actual \nimage. IRRE works on the premise that the greater the similarity between the input image \nand the reconstructed input image, the more the internal representation has captured the \nimportant input features, and therefore the more reliable the network's response. \nA similar method to IRRE can be used to create a saliency map. The saliency map should \nbe determined by the important features in the current image for the task to be performed. \nBecause compressed representations of the important features in the current image are \nrepresented in the activations of the hidden units, the saliency map is derived from these, \nas shown in Figure 1. It should be noted that the hidden units, from which the saliency \nmap is derived, do not necessarily contain information similar to principal components (as \nis achieved through auto-encoder networks), as the relevant task may only require infor(cid:173)\nmation on a small portion of the image. In the simple architecture depicted in Figure 1, the \ninternal representation must contain information which can be transformed by a single \nlayer of adjustable weights into a saliency map for the next image. If such a transforma(cid:173)\ntion is not possible, separate hidden layers, with input from the task-specific internal rep(cid:173)\nresentations could be employed to create the saliency map. \nThe saliency map is trained by using the next image, of a time-sequential data set, as the \ntarget image for the prediction, and applying standard error backpropagation on the differ(cid:173)\nences between the next image and the predicted next image. The weights from the hidden \n\n\fUsing a Saliency Map for Active Spatial Selective Attention \n\n453 \n\n(delayed I time step) \n\nOutput Units \n\nHidden Units \n\nInput Retina \n\nPredicted \n\n,....--=--., \n\nL------! ~':~e \nx X ----+--- Relevant Portion of Input \n\np \n\nerived \naliency \n\nrepresents \n\nFigure 1: A simple architecture for \nusing a saliency map. The dashed \n\"chilled \nline \nconnections\", i.e. errors from these \nconnections do not propagate back \nfurther to impact the activations of \nthe hidden units. This architecture \ntask \nassumes \ncontains information which will \nhelp determine the salient portions \nof the next frame. \n\ntarget \n\nthat \n\nthe \n\nunits to the saliency map are adjusted using standard backpropagation, but the error terms \nare not propagated to the weights from the inputs to the hidden units. This ensures that the \nhidden representation developed by the network is determined only by the target task, and \nnot by the task of prediction. \nIn the implementation used here, the feedback is to the input layer. The saliency map is \ncreated to either be the same size as the input layer, or is scaled to the same size, so that it \ncan influence the representation in a straight-forward manner. The saliency map's values \nare scaled between 0.0 and 1.0, where 1.0 represents the areas in which the prediction \nmatched the next image exactly. The value of 1.0 does not alter the activation of the input \nunit, a value of 0.0 turns off the activation. The exact construction of the saliency map is \ndescribed in the next section, with the description of the experiments. The entire network \nis trained by standard backpropagation; in the experiments presented, no modifications to \nthe standard training algorithm were needed to account for the feedback connections. \nThe training process for prediction is complicated by the potential presence of noise in the \nnext image. The saliency map cannot \"reconstruct\" the noise in the next image, because it \ncan only construct the portions of the next image which can be derived from the activation \nof the hidden units, which are task-specific. There/ore, the noise in the next image will not \nbe constructed, and thereby will be de-emphasized in the next time step by the saliency \nmap. The saliency map serves as a filter, which channels the processing to the important \nportions of the scene [Mozer, 1988]. One of the key differences between the filtering \nemployed in this study, and that used in other focus of attention systems, is that this filter(cid:173)\ning is based on expectations from multiple frames, rather than on the retinal activations \nfrom a single frame. An alternative neural model of visual attention which was explored \nby [Olshausen et al., 1992] achieved focus of attention in single frames by using control \nneurons to dynamically modify synaptic strengths. \nThe saliency map may be used in two ways. It can either be used to highlight portions of \nthe input retina or, when the hidden layer is connected in a retinal fashion using weight \nsharing, as in [LeCun et al., 1990], it can be used to highlight important spatial locations \nwithin the hidden layer itself. The difference is between highlighting individual pixels \nfrom which the features are developed or highlighting developed features. Discussion of \nthe psychological evidence for both of these types of highlighting (in single-frame retinal \nactivation based context), is given in [pashler and Badgio, 1985]. \nThis network architecture shares several characteristics with a Jordan-style recurrent net(cid:173)\nwork [Jordan, 1986], in which the output from the network is used to influence the pro-\n\n\f454 \n\nShumeet Baluja, Dean A. Pomerleau \n\ncessing of subsequent input patterns in a temporal sequence. One important distinction is \nthat the feedback in this architecture is spatially constrained. The saliency map represents \nthe importance of local patches of the input, and can influence only the network's process(cid:173)\ning of corresponding regions of the input. The second distinction is that the outputs are not \ngeneral task outputs, rather they are specially constructed to predict the next image. The \nthird distinction is in the form of this influence. Instead of treating the feedback as addi(cid:173)\ntional input units, which contribute to the weighted sum for the network's hidden units, \nthis architecture uses the saliency map as a gating mechanism, suppressing or emphasiz(cid:173)\ning the processing of various regions of the layer to which it is fed-back. In some respects, \nthe influence of the saliency map is comparable to the gating network in the mixture of \nexperts architecture [Jacobs et al., 1991]. Instead of gating between the outputs of multiple \nexpert networks, in this architecture the saliency map is used to gate the activations of the \ninput units within the same network. \n\n3 THE EXPERIMENTS \n\nIn order to study the feasibility of the saliency map without introducing other extraneous \nfactors, we have conducted experiments with two simple tasks described below. Exten(cid:173)\nsions of these ideas to larger problems are discussed in sections 4 & 5. The first experi(cid:173)\nment is representative of a typical machine vision task, in which the relevant features \nmove very little in consecutive frames. With the method used here, the relevant features \nare automatically determined and tracked. However, if the relevant features were known a \npriori, a more traditional vision feature tracker which begins the search for features within \nthe vicinity of the location of the features in the previous frame, could also perform well. \nThe second task is one in which the feature of interest moves in a discontinuous manner. A \ntraditional feature tracker without exact knowledge the feature's transition rules would be \nunable to track this feature, in the presence of the noise introduced. The transition rules of \nthe feature of interest are learned automatically through the use of the saliency map. \nIn the first task, there is a slightly tilted vertical line of high activation in a 30x32 input \nunit grid. The width of the line is approximately 9 units, with the activation decaying with \ndistance from the center of the line. The rest of the image does not have any activation. \nThe task is to center a gaussian of activation around the center of the x-intercept of the \nline, 5 pixels above the top of the image. The output layer contains 50 units. In consecu(cid:173)\ntive images, the line can have a small translational move and/or a small rotational move. \nSample training examples are given in Figure 2. This task can be easily learned in the \npresence of no noise. The task is made much harder when lines, which have the same \nvisual appearance as the real line (in everything except for location and tilt) randomly \nappear in the image. In this case, it is vital that the network is able to distinguish between \nthe real line and noise line by using information gathered from previous image(s). \nIn the second task, a cross (\"+\") of size 5x5 appears in a 20x20 grid. There are 16 posi(cid:173)\ntions in which the cross can appear, as shown in Figure 2c. The locations in which the \ncross appears is set according to the transition rules shown in Figure 2c. The object of this \nproblem is to reproduce the cross in a smaller lOx 1 0 grid, with the edges of the cross \nextended to the edges of the grid, as shown in Figure 2b. The task is complicated by the \npresence of randomly appearing noise. The noise is in the form of another cross which \nappears exactly similar to the cross of interest. Again, in this task, it is vital for the net(cid:173)\nwork to be able to distinguish between the real cross, and crosses which appear as noise. \nAs in the first task, this is only possible with knowledge of the previous image(s). \n\n\fUsing a Saliency Map for Active Spatial Selective Attention \n\n455 \n\nAt \n\nA2 \n\n12 5 \n\n9 \n\n2 \n\n7 \n\n1 \n\nII 16 \n\n10 15 4 \n\n\\3 \n\n3 \n\n14 6 \n\n8 \n\nFigure 2: (A) The first task, image (AI) with no distractions, image (A2) with one distracting feature. (8) \nThe second task, image (8 I) with no distractions, image (82) with two distractions. (C) Transition rules \nfor the second task. \n\nBt \n\nB2 \n\nc \n\n3.1 Results \n\nThe two problems described above were attempted with networks trained both with and \nwithout noise. Each of the training sessions were also tested with and without the saliency \nmap. Each type of network was trained for the same number of pattern presentations with \nthe same training examples. The results are shown in Table 1. The results reported in \nTable I represent the error accrued over 10,000 testing examples. For task 1, errors are \nreported in terms of the absolute difference between the peak's of the Gaussians produced \nin the output layer, summed for all of the testing examples (the max error per image is 49). \nIn task 2, the errors are the sum of the absolute difference between the network's output \nand the target output, summed across all of the outputs and all of the testing examples. \nWhen noise was added to the examples, it was added in the following manner (for both \ntraining and testing sets): In task 1, '1 noise' guarantees a noise element, similarly, '2 \nnoise' guarantees two noise elements. However, in task 2, '1 noise' means that there is a \n50% chance of a noise element occurring in the image, '2 noise' means that there is a 50% \nchance of another noise element occurring, independently of the appearance of the first \nnoise element. The positions of the noise elements are determined randomly. \nThe best performance. in task 1, came from the cases in which there was no noise in test(cid:173)\ning or training. and no saliency map was used. This is expected, as this task is not difficult \nwhen no noise is present. Surprisingly, in task 2, the best case was found with the saliency \nmap, when training with noise and testing without noise. This performed even better than \ntraining without noise. Investigation into this result is currently underway. \nIn task 1, when training and testing without noise. the saliency map can hurt performance. \nIf the predictions made by the saliency map are not correct, the inputs appear slightly dis(cid:173)\ntorted; therefore, the task to be learned by the network becomes more difficult. Neverthe(cid:173)\nless. the benefit of using a saliency map is apparent when the test set contains noise. \nIn task 2. the network without the saliency map, trained with noise, and tested without \nnoise cannot perform well; the performance further suffers when noise is introduced into \nthe testing set. The noise in the training prevents accurate learning. This is not the case \nwhen the saliency map is used (Table 1, task 2). When the training set contains noise, the \nnetwork with the saliency map works better when tested with and without noise. \n\n\f456 \n\nSizumeet Baluja. Dean A. Pomerleau \n\nTable 1: Summed Error of 10,000 Testing Examples \n\nTesting Set \n\nTraining Set \n\nTask 1 \n\nTask 2 \n\no Noise (Saliency) \no Noise (No Saliency) \n\n1 Noise (Saliency) \n\n1 Noise (No Saliency) \n\no Noise \n\n1 Noise \n\n2 Noise \n\no Noise \n\nI Noise \n\n2 Noise \n\n12672 \n\n60926 \n\n82282 \n\n7174 \n\n94333 \n\n178883 \n\n10241 \n\n91812 \n\n103605 \n\n7104 \n\n133496 \n\n216884 \n\n18696 \n\n26178 \n\n52668 \n\n4843 \n\n10427 \n\n94422 \n\n14336 \n\n80668 \n\n97433 \n\n31673 \n\n150078 \n\n227650 \n\nWhen the noise increased beyond the level of training, to 2 noise elements per image, the \nperformances of networks trained both with and without the saliency map declined. It is \nsuspected that training the networks with increased noise will improve performance in the \nnetwork trained with the saliency map. Nonetheless, due to the amount of noise compared \nto the small size of the input layer, improvements in results may not be dramatic. \nIn Figure 3, a typical test run of the second task is shown. In the figure, the inputs, the pre(cid:173)\ndicted and actual outputs, and the predicted and actual saliency maps, are shown. The \nactual saliency map is just a smaller version of the unfiltered next input image. The input \nsize is 20x20, the outputs are 10xlO, and the saliency map is 10xlO. The saliency map is \nscaled to 20x20 when it is applied to the next inputs. Note that in the inputs to the net(cid:173)\nwork, one cross appears much brighter than the other; this is due to the suppression of the \ndistracting cross by the saliency map. The particular use of the saliency map which is \nemployed in this study, proceeded as follows: the difference between the saliency map \n(derived from input imagei) and the input imagei+l was calculated. This difference image \nwas scaled to the range of 0.0 to 1.0. Each pixel was then passed through a sigmoid; alter(cid:173)\nnatively, a hard-step function could have been used. This is the saliency map. The saliency \nmap was multiplied by input imagei+ 1; this was used as the input into the network. If the \nsigmoid is not used, features, such as incomplete crosses, sometimes appear in the input \nimage. This happens because different portions of the cross may have slightly different \nsaliency values associated with them -due to errors in prediction coupled with the scaling \nof the saliency map. Although the sigmoid helps to alleviate the problem, it does not elim(cid:173)\ninate it. This explains why training with no noise with a saliency map sometimes does not \nperform as well as training without a saliency map. \n\n4 ALTERNATIVE IMPLEMENTATIONS \n\nAn alternative method of implementing the saliency map is with standard additive connec(cid:173)\ntions. However, these connection types have several drawbacks in comparison with the \nmultiplicative ones use in this study. First, the additive connections can drastically change \nthe meaning of the hidden unit's activation by changing the sign of the activation. The \nsaliency map is designed to indicate which regions are important for accomplishing the \ntask based upon the features in the hidden representations; as little alteration of the impor(cid:173)\ntant features as possible is desired. Second, if the saliency map is incorrect, and suggests \nan area of which is not actually important, the additive connections will cause 'ghost' \nimages to appear. These are activations which are caused only by the influence of the addi-\n\n\fUsing a Saliency Map for Active Spatial Selective Attention \n\n457 \n\npredicted \nsaliency \n\ntarget \nsaliency \n\npredicted \noutput \n\ntarget \noutput \n\ninputs \n\n14 56 7 \n\nFigure 3: A typical sequence of inputs and outputs in the second task. Note that when two crosses are \n\nin the inputs. one is much brighter than the other. The \"noise\" cross is de-emphasized. \n\ntive saliency map. The multiplicative saliency map, as is implemented here, does not have \neither of these problems. \nA second alternative, which is more closely related to a standard recurrent network [Jor(cid:173)\ndan, 1986], is to use the saliency map as extra inputs into the network. The extra inputs \nserve to indicate the regions which are expected to be important. Rather than hand-coding \nthe method to represent the importance of the regions to the network, as was done in this \nstudy, the network learns to use the extra inputs when necessary. Further, the saliency map \nserves as the predicted next input. This is especially useful when the features of interest \nmay have momentarily been partially obscured or have entirely disappeared from the \nimage. This implementation is currently being explored by the authors for use in a autono(cid:173)\nmous road lane-tracking system in which the lane markers are not always present in the \ninput image. \n\n5 CONCLUSIONS & FUTURE DIRECTIONS \n\nThese experiments have demonstrated that an artificial neural network can learn to both \nidentify the portions of a visual scene which are important, and to use these important fea(cid:173)\ntures to perform a task. The selective attention architecture we have develop uses two sim(cid:173)\nple mechanisms, predictive auto-encoding to form a saliency map, and a constrained form \nof feedback to allow this saliency map to focus processing in the subsequent image. \nThere are at least four broad directions in which this research should be extended. The first \nis, as described here, related to using the saliency map as a method for automatically \nactively focusing attention to important portions of the scene. Because of the spatial \ndependence of the task described in this paper, with the appropriate transformations, the \noutput units could be directly fed back to the input layer to indicate saliency. Although this \ndoes not weaken the result, in terms of the benefit of using a saliency map, future work \nshould also focus on problems which do not have this property to determine how easily a \nsaliency map can be constructed. Will the use of weight sharing be enough to develop the \nnecessary spatially oriented feature detectors? Harder problems are those with target tasks \nwhich does not explicitly contain spatial saliency information. \nAn implementation problem which needs to be resolved is in networks which contain \nmore than a single hidden layer: from which layer should the saliency map be con-\n\n\f458 \n\nShumeet Baluja, Dean A. Pomerleau \n\nstructed? The trade-off is that at the higher layers, the information contained is more task \nspecific. However, the higher layers may effectively extract the information required to \nperform the task, without maintaining the information required for saliency map creation. \nThe opposite case is true in the lower layers; these may contain all of the information \nrequired, but may not provide enough discrimination to narrow the focus effectively. \nThe third area for research is an alternative use for the saliency map. ANNs have often \nbeen criticized for their uninterpretability, and lack of mechanism to explain performance. \nThe saliency map provides a method for understanding, at a high level, what portions of \nthe inputs the ANN finds the most important. \nFinally, the fourth direction for research is the incorporation of higher level, or symbolic \nknowledge. The saliency map provides a very intuitive and direct method for focusing the \nnetwork's attention to specific portions of the image. The saliency map may prove to be a \nuseful mechanism to allow other processes, including human users, to simply \"point at\" \nthe portion of the image to which the network should be paying attention. \nThe next step in our research is to test the effectiveness of this technique on the main task \nof interest, autonomous road following. Fortunately, the first demonstration task employed \nin this paper shares several characteristics with road following. Both tasks require the net(cid:173)\nwork to track features which move over time in a cluttered image. Both tasks also require \nthe network to produce an output that depends on the positions of the important features in \nthe image. Because of these shared characteristics, we believe that similar performance \nimprovements should be possible in the autonomous driving domain. \nAcknowledgments \nShumeet Baluja is supported by a National Science Foundation Graduate Fellowship. \nSupport has come from \"Perception for Outdoor Navigation\" (contract number \nDACA76-89-C-0014, monitored by the US Army Topographic Engineering Center) and \n\"Unmanned Ground Vehicle System\" (contract number DAAE07-90-C-R059, monitored \nby TACOM). Support has also come from the National Highway Traffic Safety Adminis(cid:173)\ntration under contract number DTNH22-93-C-07023. The views and conclusions con(cid:173)\ntained in this document are those of the authors and should not be interpreted as \nrepresenting official policies, either expressed or implied, of the National Science Founda(cid:173)\ntion, ARPA, or the U.S. Government. \nReferences \nCottrell, G.W. & Munro, P. (1988) Principal Component Analysis ofImages via back-propagation. Proc Soc. of \nPhoto-Opticallnstr. Eng., Cambridge, MA. \nJordan, M.I., (1989). Serial Order: A Parallel, Distributed Processing Approach. In Advances in Connectionist \nTheory: Speech, eds. J.L. Elman and D.E. Rumerlhart. Hillsdale: Erlbaum. \nJacobs, R.A., Jordan, M.I., Nowlan, S.J. & Hinton, G.E. (1991). Adaptive Mixtures of Local Experts. Neural \nComputation, 3:1. \nLeCun, Y., Boser, B., Denker, J.S., Henderson, D. Howard, R.E., Hurnmand W., and Jackel, L.D. (1989) Back(cid:173)\npropagation Applied to Handwritten Zip Code Recognition. Neural Computation 1,541-551. MIT, 1989. \nMozer, M.C. (1988) A Connectionist Model of Selective Attention in Visual Perception. Technical Report, Uni(cid:173)\nversity of Toronto, CRG-TR-88-4. \nPashler, H. & Badgio, P. (1985). Visual Attention and Stimulus Identification. Journal of Experimental Psychol(cid:173)\nogy: Human Perception and Performance, II 105-121. \nPomerleau, D.A. (1993) Input Reconstruction Reliability Estimation. In Giles, C.L. Hanson, S.J. and Cowan, \nJ.D. (eds). Advances in Neurallnfol71Ultion Processing Systems 5, CA: Morgan Kaufmann Publishers. \nPomerleau, D.A. (1993b) Neural Network Perception for Mobile Robot Guidance, Kluwer Academic Publishing. \nOlshausen, B., Anderson, C., & Van Essen; D. (1992) A Neural Model of Visual Attention and Invariant Pattern \nRecognition. California Institute of Technology, CNS Program, memo-18. \n\n\f", "award": [], "sourceid": 880, "authors": [{"given_name": "Shumeet", "family_name": "Baluja", "institution": null}, {"given_name": "Dean", "family_name": "Pomerleau", "institution": null}]}