{"title": "Contextual Modulation of Target Saliency", "book": "Advances in Neural Information Processing Systems", "page_first": 1303, "page_last": 1310, "abstract": "", "full_text": "Contextual Modulation of Target Saliency\n\nAntonio Torralba\n\nDept. of Brain and Cognitive Sciences\n\nMIT, Cambridge, MA 02139\n\ntorralba@ai. mit. edu\n\nAbstract\n\nThe most popular algorithms for object detection require the use of\nexhaustive spatial and scale search procedures. In such approaches,\nan object is defined by means of local features.\nfu this paper we\nshow that including contextual information in object detection pro(cid:173)\ncedures provides an efficient way of cutting down the need for\nexhaustive search. We present results with real images showing\nthat the proposed scheme is able to accurately predict likely object\nclasses, locations and sizes.\n\n1\n\nIntroduction\n\nAlthough there is growing evidence of the role of contextual information in human\nperception [1], research in computational vision is dominated by object-based rep(cid:173)\nIn real-world scenes, intrinsic object information is often\nresentations [5,9,10,15].\ndegraded due to occlusion, low contrast, and poor resolution. In such situations, the\nobject recognition problem based on intrinsic object representations is ill-posed. A\nmore comprehensive representation of an object should include contextual informa(cid:173)\ntion [11,13]: Obj. representatian == {intrisic obj. model, contextual obj. model}.\nIn this representation, an object is defined by 1) a model of the intrinsic proper(cid:173)\nties of the object and 2) a model of the typical contexts in which the object is\nimmersed. Here we show how incorporating contextual models can enhance target\nobject saliency and provide an estimate of its likelihood and intrinsic properties.\n\n2 Target saliency and object likelihood\n\nlocal features, VL,\nImage information can be partitioned into two sets of features:\nthat are intrinsic to an object, and contextual features, rUe which encode structural\nproperties of the background. In a statistical framework, object detection requires\nevaluation of the likelihood function (target saliency function): P(O IVL, va) which\nprovides the probability of presence of the object 0 given a set of local and contex(cid:173)\ntual measurements. 0 is the set of parameters that define an object immersed in a\nscene: 0 == {on, x, y, i} with on==object class, (x,y)==location in image coordinates\n\n\fand bobject appearance parameters. By applying Bayes rule we can write:\n\nP(O IVL, va) = P(vL11 va) P(VL 10, va)P(O Iva)\n\n(1)\n\nThose three factors provide a simplified framework for representing three levels of at(cid:173)\ntention guidance when looking for a target: The normalization factor, 1/P(VL Iva),\ndoes not depend on the target or task constraints, and therefore is a bottom-up fac(cid:173)\ntor. It provides a measure of how unlikely it is to find a set of local measurements VL\nwithin the context va. We can define local saliency as S(x,y) == l/P(vL(x,y) Iva).\nSaliency is large for unlikely features ina' scene. The second factor, P(VL 10, va),\ngives the likelihood of the local measurements VL when the object is present at such\nlocation in a particular context. We can write P(VL 10, va) ~ P(VL 10), which is a\nconvenient approximation when the aspect of the target object is fully determined\nby the parameters given by the description O. This factor represents the top-down\nknowledge of the target\u00b7 appearance and how it contributes to the search. Regions\nof the image with features unlikely to belong to the target object are vetoed. and\nregions with attended features are enhanced. The third factor, the PDF P(O Iva),\nprovides context-based priors on object class, location and scale.\nIt is of capital\nimportance for insuring reliable inferences in situations where the local image mea(cid:173)\nsurements VL produce ambiguous interpretations. This factor does not depend on\nlocal measurements and target models [8,13]. Therefore, the term P(O Iva) mod(cid:173)\nulates the saliency of local image properties when looking for an object of the class\nOn. Contextual priors become more evident if we apply Bayes rule successively in\norder to split the PDF P(0 Iva) into three factors that model three kinds of context\npriming on object search:\n\n(2)\n\nAccording to this decomposition of the PDF, the contextual modulation of target\nsaliency is a function of three main factors:\n\nObject likelihood: P(on Iva) provides the probability of presence of the object class\nOn in the scene. If P(On Iva) is very small, then object search need not be initiated\n(we do not need to look for cars in a living room).\n\nContextual control of focus of attention: P(x, y I On, va)\u00b7 This PDDF gives the\nmost likely locations for the presence of object On given context information, and\nit allocates computational resources into relevant scene regions.\n\nContextual selection of local target appearance: P(tl.va, on). This gives the likely\n(prototypical) shapes (point of views, size, aspect ratio, object aspect) of the object\nOn in the context Va- Here t == {a, p}, with a==scale and p==aspect ratio. Other\nparameters describing the appearance of an object in an image can be added.\n\nThe image features most commonly used for describing local structures are the\nenergy outputs of oriented band-pass filters, as they have been shown to be relevant\nfor the task of object detection [9,10] and scene recognition [2,4,8,12]~ Therefore,\nthe local image representation at the spatial location (x) is given by the vector\nVL(X) == {v(X,k)}k==l,N with:\n\n(3)\n\n\f1\n\",.-.....\nu\n;>\n-a\n0\n'0:\n\n2 3 4\n\n1\n'0\n;>\n-a\n0\nP:\no 1 2 3 4\n\n1\n\",.-.....\nu\n;>\n-a\n0\n0:\no 1 2 3 4\n\nFigure 1: Contextual object prImIng of four objects categories (I-people, 2(cid:173)\nfurniture, 3-vehicles and 4-trees)\n\nwhere i(x) is the input image and gk(X) are oriented band-pass filters defined by\ngk(i) == e-llxI12/u~e27fj<f~,x>.\nIn such a representation [8], v(i,k) is the output\nmagnitude- at the location i of a complex Gabor filter tuned to the spatial fre(cid:173)\nquency f~. The variable k indexes filters tuned to different spatial frequencies and\norientations.\n\nOn the other ,hand, contextual features have to summarize the structure of the\nwhole image.\nIt has been shown that a holistic low-dimensional encoding of the\nlocal image features conveys enough information for a semantic categorization of\nthe scene/context [8] and can be used for contextual priming in object recognition\ntasks [13]. Such a representation can be achieved by decomposing the image features\ninto the basis functions provided by PCA:\n\nan == L L v{x, k) 1/ln{x, k)\n\nx\n\nk\n\nN\n\nv(x, k) ~ L an1/ln(x, k)\n\nn=l\n\n(4)\n\nWe propose to use the decomposition coefficients vc == {an}n=l,N as context fea(cid:173)\ntures. The functions 1/ln are the eigenfunctions of the covariance operator given by\nv(x, k). By using only a reduced set of components (N == 60 for the rest of the\npaper), the coefficients {an}n=l,N encode the main spectral characteristics of the\nscene with a coarse description of their spatial arrangement. In essence, {an}n=l,N\nis a holistic representation as all the regions of the image contribute to all the co(cid:173)\nefficients, and objects are not encoded individually [8]. In the rest of the paper we\nshow the efficacy of this set of features in context modeling for object detection\ntasks.\n\n3 Contextual object priming\n\nThe PDF P(On Iva) gives the probability of presence of the object class On given\nIn other words, the PDF P{on Ive) evaluates the con(cid:173)\ncontextual information.\nsistency of the object On with the context vc. For instance, a car has a high\nprobability of presence in a highway scene but it is inconsistent with an indoor\nenvironment. The goal of P(on Ive) is to cut down the number of possible ob(cid:173)\nject categories to deal with before- expending computational resources in the object\nrecognition process. The learning of the PDF P(on Ive) == P(ve IOn)P(on)/p(ve)\nwith p(vo) == P(vc IOn)P{on) + P(vc l-,on)P(-,on) is done by approximating the\nin-class and out-of-class PDFs by a mixture of Gaussians:\nP(ve IOn) == L bi,nG(VC;Vi,n, Vi,n)\n\n(5)\n\nL\n\ni=l\n\n\fFigure 2: Contextual control of focus of attention when the algorithm is looking for\ncars (upper row) or heads (bottom row).\n\nThe model parameters (bi,n, Vi,n, Vi,n) for the object class On are obtained using the\nEM algorithm [3]. The learning requires the use of few Gaussian clusters (L == 2\nprovides very good performances). For the learning, the system is trained with\na set of examples manually annotated with the .presence/absence of four objects\ncategories (i-people, 2-furniture, 3-vehicles and 4-trees). Fig. 1 shows some typical\nresults from the priming model on the four superordinate categories of objects\ndefined. Note. that the probability function P(on Ive) provides information about\nthe probable presence of one object without scanning the picture. If P(On Ive) > 1(cid:173)\nth then we can predict that the target is present. On the other hand, if P(On Ive) <\nth we can predict that the object is likely to be absent before exploring the image.\n\nThe number of scenes in which the system may be able to take high confidence\ndecisions will depend on different factors such as: the strength of the relationship\nbetween the target object and its context and the ability of ve for efficiently charac(cid:173)\nterizing the context. Figure 1 shows some typical results from the priming model for\na set of super-ordinate categories of objects. When forcing the model to take binary\ndecisions in all the images (by selecting an acceptance threshold of th == 0.5) the\npresence/absence of the objects was correctly predicted by the model on 81% of the\nscenes of the test set. For each object category, high confidence predictions (th == .1)\nwere made in at least 50% of the tested scene pictures and the presence/absence\nof each object class was correctly predicted by the model on 95% of those images.\nTherefore, for those images, we do not need to use local image analysis to decide\nabout the presence/absence of the object.\n\n4 Contextual control of focus of attention\n\nOne of the strategies that biological visual systems use to deal with the analysis\nof real-world scenes is to focus attention (and, therefore, computational resources)\nonto the important image regions while neglecting others. Current computational\nmodels of visual attention (saliency maps anQ target detection) rely exclusively on\nlocal information or intrinsic object models [6,7,9,14,16]. The control of the focus\nof attention by contextual information that we propose. here is both task driven\n(looking for object on) and context driven (given global context information: ve).\nHowever, it does riot include any model of the target object at this stage. In our\nframework, the problem of contextual control of the focus of attention involves the\n\n\f\u2022\u2022\u2022\u2022 11 1.8\n\nHEADS\n\n\u2022\u2022 ~\n.~.~.\nt. ,.,:-.,,, \u2022\n, -: \u2022\n, \u2022\u2022-=- \u2022\u2022\n\nQ.)\n\n100 ]\n\"'0\nE\nS\u2022\u2022\n10 .~ ...\n~ \u2022\u2022 fIlIIe\u00b7':\n\":I':\u00b7.?\n1 ~\n\n,\n\nReal scale\n\n1\n\n10 pixels 100\n\n100 Q.)\n~\n~\n\nCARS.\n\nCARS\n\n1 ~\no\nP;\n\"'0\n~\n.~ -\nS:\nfill tI':,._.:\n\n\u2022\u2022\u2022 I\n\u2022 \u2022\n0:\n\u2022\n\n:.-\n\\.:..\n_ filii\n.\\.\n\u2022\u2022\n\n\u2022\u2022 \\\n\n~\n\n\u2022\n\nII\n\n\u2022\n1,---~_\"\"\"\"\"\"\"-----R_eal_sc_al--..Je . :\n0.4\n\n10 pixels 100\n\n1\n\n0.4 tre.\n\n\u2022\n\noReal pose\n1\n\n\u2022 \u2022\\ ~ .tto\n\nFigure 3: Estimation results of object scale and pose based on contextual features.\n\nevaluation of the PDF P(xlon,vo). For the learning, the joint PDF is modeled\nas a sum of gaussian clusters. Each cluster is decomposed into the product of\ntwo gaussians modeling respectively the distribution of object locations and the\ndistribution of contextual features for each cluster:\n\nP(x, vol on) == L bi,n G(x; Xi,n, Xi,n)G(VO; Vi,n, Vi,n)\n\nL\n\ni==l\n\n(6)\n\nThe training set used for the learning of the PDF P(x, vol on) is a subset of'the\npictures that contain the object On. The training data is {Vt}t==l,Nt and {Xt}t==l,Nt\nwhere Vt are the contextual features of the picture t of the training set and Xt is\nthe location of object On in the image. The model parameters are obtained using\nthe EM algorithm [3,13]. We used 1200 pictures for training and a separate set of\n1200 pictures for testing. The success of the PDF in narrowing the region of the\nfocus of attention will depend on the consistency of the relationship between the\nobject and the context. Fig. 2 shows several examples of images and the selected\nregions based on contextual features when looking for cars and faces. From the\nPDF P(x, Vo IOn) we selected the region with the highest probability (33% of the\nimage size on average). 87% of the heads present in the test pictures were inside\nthe selected regions.\n\n5 Contextual selection of object appearance models\n\nOne major problem for computational approaches to object detection is the large\nvariability in object appearance. The classical solution is to explore the space of\npossible shapes looking for the best match. The main sources of variability in object\nappearance are size, pose and intra-class shape variability (deformations, style, etc.).\nWe show here that including contextual information can reduce at le.ast the first\ntwo sources of variability. For instance, the expected size of people in an image\ndiffers greatly between an indoor environment and a perspective view of a street.\nBoth environments produce different patterns of contextual features vo [8]. For\nthe second factor, pose, in the case of cars, there is a strong relationship between\nthe possible orientations of the object and the scene configuration. For instance,\nlooking down a highway, we expect to see the back of the cars, however, in a street\nview, looking towards the buildings, lateral views of cars are more likely.\n\nThe expected scale and pose of the target object can be estimated by a regression\nprocedure. The training database used for building the regression is a set of 1000\nimages in which the target object On is present. For each training image the target\n\n\fFigure 4: Selection of prototypical object appearances based on contextual cues.\n\nobject was selected by cropping a rectangular window. For faces and cars we define\nthe u == scale as the height of the selected window and the P == pose as the ratio be(cid:173)\ntween the horizontal and vertical dimensions of the window (~y/ ~x). On average,\nthis definition of pose provides a good estimation of the orientation for cars but not\nfor heads. Here we used regression using a mixture of gaussians for estimating the\nconditional PDFs between scale, pose and contextual features: P(u IVa, on) and\nPCP Iva, on). This yields the next regression procedures [3]:\n\n(j == Ei Ui,nbi,nG(Va; Vi,n, Vi,n)\nEi bi,nG(vO; Vi,n, Vi,n)\n\nEiPi,nbi,nG(VO;Vi,n, Vi,n)\n\n_\nP == Ei bi,nG(VC;Vi,n, Vi,n)\n\n(7)\n\nThe results summarized in fig. 3 show that context is a strong cue for scale selec(cid:173)\ntion for the face detection task but less important for the car detection task. On\nthe other hand, context introduces strong constraints on the prototypical point of\nviews of cars but not at all for heads. Once the two parameters (pose and scale)\nhave been estimated, we can build a prototypical model of the target object. In the\ncase of a view-based object representation, the model of the object will consist of\na collection of templates that correspond to the possible aspects of the target. For\neach image the system produces a collection of views, selected among a database\nof target examples that have the scale and pose given by eqs.\n(7). Fig. 4 shows\nIn the statistical framework, the object detec(cid:173)\nsome results from this procedure.\ntion requires the evaluation of the function P(VL 10, va). We can approximate\n\n\fInput image\n(target = cars)\n\nObject priming and\nTarget model selection of focus of attention\n\nContextual control\n\nIntegration of\nlocal features\n\nTarget saliency\n\n1\n\nFigure 5: Schematic layout of the model for object detection (here cars) by inte(cid:173)\ngration of contextual and local information. The bottom example is an error in\ndetection due to incorrect context identification.\n\nP(VL 10, va) ~ P(VL IOn'\n(J\", p). Fig. 5 and 6 show the complete chain of opera(cid:173)\ntions and some detection results using a simple correlation technique between the\nimage and the generated object models (100 exemplars) at only one scale. The last\nimage of each row shows the total object likelihood obtained by multiplying the\nobject saliency maps (obtained by the correlation) and the contextual control of\nthe focus of attention. The result shows how the use of context helps reduce false\nalarms. This results in good detection performances despite the simplicity of the\nmatching procedure used.\n\n6 Conclusion\n\nThe contextual schema of a scene provides the likelihood of presence, typical loca(cid:173)\ntions and appearances of objects within the scene. We have proposed a model for\nincorporating such contextual cues in the task of object detection. The main aspects\nof our approach are: 1) Progressive reduction of the window of focus of attention:\nthe system reduces the size of the focus of attention by first integrating contextual\ninformation and then local information. 2) Inhibition of target like patterns that\nare in inconsistent locations. 3) Faster detection of correctly scaled targets that\nhave a pose in agreement with the context. 4) No requirement of parsing a scene\ninto individual objects. Furthermore, once one object has been detected, it can\nintroduce new contextual information for analyzing the rest of the scene.\n\nAcknowledglllents\n\nThe author wishes to thank Dr. Pawan Sinha, Dr. Aude Oliva and Prof. Whitman\nRichards for fruitful discussions.\n\nReferences\n\n[1] Biederman, I., Mezzanotte, R.J., & Rabinowitz, J.C. (1982). Scene perception: detect(cid:173)\ning and judging objects undergoing relational violations. Cognitive Psychology, 14:143(cid:173)\n177.\n\n\fFeature maps\n\n\\\n\nI\n\nV t---HXJ---+l\n\n.\n.\n.\n.\n~\n\nFigure 6: Schema for object detection (e.g. cars) integrating local and giobal infor(cid:173)\nmation.\n\n[2] Carson, C., Belongie, S., Greenspan, H., and Malik, J. (1997). Region-based image\nquerying. Proc.\nIEEE W. on Content-Based Access of Image and Video Libraries, pp:\n42-49.\n[3] Gershnfeld, N. The nature of mathematical modeling. Cambridge university press, 1999.\n[4] Gorkani, M. M., Picard, R. W. (1994). Texture orientation for sorting photos 'at a\nglance'. Proc. Int. Conf. Pat. Rec., Jerusalem, Vol. I: 459-464.\n[5] Heisle, B., T. Serre, S. Mukherjee and T. Poggio. (2001) Feature Reduction and Hier(cid:173)\narchy of Classifiers for Fast Object Detection in Video Images.\nIn: Proceedings of 2001\nIEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE\nComputer Society Press, Jauai, Hawaii.\n[6] Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for\nrapid scene analysis. IEEE Trans. Pattern Analysis and Machine Vision, 20(11):1254.\n[7] Moghaddam, B., & Pentland, A. (1997). Probabilistic Visual Learning for Object Rep(cid:173)\nresentation. IEEE Trans. Pattern Analysis and Machine Vision, 19(7):696-710.\n[8] Oliva, A., & Torralba, A. (2001). Modeling the Shape of the Scene: A holistic repre(cid:173)\nsentation of the spatial envelope. Int. Journal of Computer Vision, 42(3):145-175.\n[9] Rao, R.P.N., Zelinsky, G.J., Hayhoe, M.M., & Ballard, D.H. (1996). Modeling saccadic\ntargeting in visual search. NIPS 8. Cambridge, MA: MIT Press.\n[10] Schiele, B., Crowley, J. L. (2000) Recognition without Correspondence using Multidi(cid:173)\nmensional Receptive Field Histograms, Int. Journal of Computer Vision, Vol. 36(1):31-50.\n[11] Strat, T. M., & Fischler, M. A. (1991). Context-based vision:\nrecognizing objects\nusing information from both 2-D and 3-D imagery. IEEE trans. on Pattern Analysis and\nMachine Intelligence, 13(10): 1050-1065.\n[12] Szummer, M., and Picard, R. W. (1998).\nIEEE intl. workshop on Content-based Access of Image and Video Databases, 1998.\n[13] Torralba, A., & Sinha, P. (2001). Statistical context priming for object detection.\nIEEE Proc. Of Int. Conf in Compo Vision.\n[14] Treisman, A., & Gelade, G. (1980). A feature integration theory of attention. Cogni(cid:173)\ntive Psychology, Vol. 12:97-136.\n[15] Viola, P. and Jones, M. (2001). Rapid object detection using a boosted cascade\nof simple features. In: Proceedings of 2001 IEEE Computer Society Conference on Com(cid:173)\nputer Vision and Pattern Recognition (CVPR 2001), IEEE Computer Society Press, Jauai,\nHawaii.\n[16] Wolfe, J. M. (1994). Guided search 2.0. A revised model of visual search. Psycho(cid:173)\nnomic Bulletin and Review, 1:202-228\n\nIndoor-outdoor image classification.\n\nIn\n\n\f", "award": [], "sourceid": 2074, "authors": [{"given_name": "Antonio", "family_name": "Torralba", "institution": null}]}