{"title": "The discriminant center-surround hypothesis for bottom-up saliency", "book": "Advances in Neural Information Processing Systems", "page_first": 497, "page_last": 504, "abstract": "The classical hypothesis, that bottom-up saliency is a center-surround process, is combined with a more recent hypothesis that all saliency decisions are optimal in a decision-theoretic sense. The combined hypothesis is denoted as discriminant center-surround saliency, and the corresponding optimal saliency architecture is derived. This architecture equates the saliency of each image location to the discriminant power of a set of features with respect to the classification problem that opposes stimuli at center and surround, at that location. It is shown that the resulting saliency detector makes accurate quantitative predictions for various aspects of the psychophysics of human saliency, including non-linear properties beyond the reach of previous saliency models. Furthermore, it is shown that discriminant center-surround saliency can be easily generalized to various stimulus modalities (such as color, orientation and motion), and provides optimal solutions for many other saliency problems of interest for computer vision. Optimal solutions, under this hypothesis, are derived for a number of the former (including static natural images, dense motion fields, and even dynamic textures), and applied to a number of the latter (the prediction of human eye fixations, motion-based saliency in the presence of ego-motion, and motion-based saliency in the presence of highly dynamic backgrounds). In result, discriminant saliency is shown to predict eye fixations better than previous models, and produce background subtraction algorithms that outperform the state-of-the-art in computer vision.", "full_text": "The discriminant center-surround hypothesis for\n\nbottom-up saliency\n\nDashan Gao\n\nVijay Mahadevan\n\nNuno Vasconcelos\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of California, San Diego\n\n{dgao, vmahadev, nuno}@ucsd.edu\n\nAbstract\n\nThe classical hypothesis, that bottom-up saliency is a center-surround process, is\ncombined with a more recent hypothesis that all saliency decisions are optimal in\na decision-theoretic sense. The combined hypothesis is denoted as discriminant\ncenter-surround saliency, and the corresponding optimal saliency architecture is\nderived. This architecture equates the saliency of each image location to the dis-\ncriminant power of a set of features with respect to the classi\ufb01cation problem that\nopposes stimuli at center and surround, at that location. It is shown that the result-\ning saliency detector makes accurate quantitative predictions for various aspects\nof the psychophysics of human saliency, including non-linear properties beyond\nthe reach of previous saliency models. Furthermore, it is shown that discriminant\ncenter-surround saliency can be easily generalized to various stimulus modalities\n(such as color, orientation and motion), and provides optimal solutions for many\nother saliency problems of interest for computer vision. Optimal solutions, under\nthis hypothesis, are derived for a number of the former (including static natural\nimages, dense motion \ufb01elds, and even dynamic textures), and applied to a num-\nber of the latter (the prediction of human eye \ufb01xations, motion-based saliency in\nthe presence of ego-motion, and motion-based saliency in the presence of highly\ndynamic backgrounds). In result, discriminant saliency is shown to predict eye\n\ufb01xations better than previous models, and produces background subtraction algo-\nrithms that outperform the state-of-the-art in computer vision.\n\n1 Introduction\n\nThe psychophysics of visual saliency and attention have been extensively studied during the last\ndecades. As a result of these studies, it is now well known that saliency mechanisms exist for a\nnumber of classes of visual stimuli, including color, orientation, depth, and motion, among others.\nMore recently, there has been an increasing effort to introduce computational models for saliency.\nOne approach that has become quite popular, both in the biological and computer vision communi-\nties, is to equate saliency with center-surround differencing. It was initially proposed in [12], and\nhas since been applied to saliency detection in both static imagery and motion analysis, as well\nas to computer vision problems such as robotics, or video compression. While difference-based\nmodeling is successful at replicating many observations from psychophysics, it has three signi\ufb01-\ncant limitations. First, it does not explain those observations in terms of fundamental computational\nprinciples for neural organization. For example, it implies that visual perception relies on a linear\nmeasure of similarity (difference between feature responses in center and surround). This is at odds\nwith well known properties of higher level human judgments of similarity, which tend not to be\nsymmetric or even compliant with Euclidean geometry [20]. Second, the psychophysics of saliency\noffers strong evidence for the existence of both non-linearities and asymmetries which are not eas-\nily reconciled with this model. Third, although the center-surround hypothesis intrinsically poses\n\n1\n\n\fsaliency as a classi\ufb01cation problem (of distinguishing center from surround), there is little basis on\nwhich to justify difference-based measures as optimal in a classi\ufb01cation sense. From an evolutionary\nperspective, this raises questions about the biological plausibility of the difference-based paradigm.\n\nAn alternative hypothesis is that all saliency decisions are optimal in a decision-theoretic sense.\nThis hypothesis has been denoted as discriminant saliency in [6], where it was somewhat narrowly\nproposed as the justi\ufb01cation for a top-down saliency algorithm. While this algorithm is of interest\nonly for object recognition, the hypothesis of decision theoretic optimality is much more general,\nand applicable to any form of center-surround saliency. This has motivated us to test its ability to\nexplain the psychophysics of human saliency, which is better documented for the bottom-up neural\npathway. We start from the combined hypothesis that 1) bottom-up saliency is based on center-\nsurround processing, and 2) this processing is optimal in a decision theoretic sense. In particular,\nit is hypothesized that, in the absence of high-level goals, the most salient locations of the visual\n\ufb01eld are those that enable the discrimination between center and surround with smallest expected\nprobability of error. This is referred to as the discriminant center-surround hypothesis and, by\nde\ufb01nition, produces saliency measures that are optimal in a classi\ufb01cation sense. It is also clearly\ntied to a larger principle for neural organization: that all perceptual mechanisms are optimal in a\ndecision-theoretic sense.\n\nIn this work, we present the results of an experimental evaluation of the plausibility of the discrim-\ninant center-surround hypothesis. Our study evaluates the ability of saliency algorithms, that are\noptimal under this hypothesis, to both\n\n\u2022 reproduce subject behavior in classical psychophysics experiments, and\n\u2022 solve saliency problems of practical signi\ufb01cance, with respect to a number of classes of\n\nvisual stimuli.\n\nWe derive decision-theoretic optimal center-surround algorithms for a number of saliency problems,\nranging from static spatial saliency, to motion-based saliency in the presence of egomotion or even\ncomplex dynamic backgrounds. Regarding the ability to replicate psychophysics, the results of this\nstudy show that discriminant saliency not only replicates all anecdotal observations that can be ex-\nplained by linear models, such as that of [12], but can also make (surprisingly accurate) quantitative\npredictions for non-linear aspects of human saliency, which are beyond the reach of the existing\napproaches. With respect to practical saliency algorithms, they show that discriminant saliency not\nonly is more accurate than difference-based methods in predicting human eye \ufb01xations, but actu-\nally produces background subtraction algorithms that outperform the state-of-the-art in computer\nvision. In particular, it is shown that, by simply modifying the probabilistic models employed in\nthe (decision-theoretic optimal) saliency measure - from well known models of natural image statis-\ntics, to the statistics of simple optical-\ufb02ow motion features, to more sophisticated dynamic texture\nmodels - it is possible to produce saliency detectors for either static or dynamic stimuli, which are\ninsensitive to background image variability due to texture, egomotion, or scene dynamics.\n\n2 Discriminant center-surround saliency\n\nA common hypothesis for bottom-up saliency is that the saliency of each location is determined by\nhow distinct the stimulus at the location is from the stimuli in its surround (e.g., [11]). This hypoth-\nesis is inspired by the ubiquity of \u201ccenter-surround\u201d mechanisms in the early stages of biological\nvision [10]. It can be combined with the hypothesis of decision-theoretic optimality, by de\ufb01ning a\nclassi\ufb01cation problem that equates\n\n\u2022 the class of interest, at location l, with the observed responses of a pre-de\ufb01ned set of fea-\n\ntures X within a neighborhood W 1\n\nl of l (the center),\n\n\u2022 the null hypothesis with the responses within a surrounding window W 0\n\nl (the surround ),\n\nThe saliency of location l\u2217 is then equated with the power of the feature set X to discriminate\nbetween center and surround. Mathematically, the feature responses within the two windows are\ninterpreted as observations drawn from a random process X(l) = (X1(l), . . . , Xd(l)), of dimension\nd, conditioned on the state of a hidden random variable Y (l). The observed feature vector at any\nlocation j is denoted by x(j) = (x1(j), . . . , xd(j)), and feature vectors x(j) such that j \u2208 W c\nl , c \u2208\n\n2\n\n\f{0, 1} are drawn from class c (i.e., Y (l) = c), according to conditional densities PX(l)|Y (l)(x|c).\nThe saliency of location l, S(l), is quanti\ufb01ed by the mutual information between features, X, and\nclass label, Y ,\n\nS(l) = Il(X; Y ) = Xc\n\nZ pX(l),Y (l)(x, c) log\n\npX(l),Y (l)(x, c)\npX(l)(x)pY (l)(c)\n\ndx.\n\n(1)\n\nThe l subscript emphasizes the fact that the mutual information is de\ufb01ned locally, within Wl. The\nfunction S(l) is referred to as the saliency map.\n\n3 Discriminant saliency detection in static imagery\n\nSince human saliency has been most thoroughly studied in the domain of static stimuli, we \ufb01rst\nderive the optimal solution for discriminant saliency in this domain. We then study the ability of\nthe discriminant center-surround saliency hypothesis to explain the fundamental properties of the\npsychophysics of pre-attentive vision.\n\n3.1 Feature decomposition\n\nThe building blocks of the static discriminant saliency detector are shown in Figure 1. The \ufb01rst\nstage, feature decomposition, follows the proposal of [11], which closely mimics the earliest stages\nof biological visual processing. The image to process is \ufb01rst subject to a feature decomposition into\nan intensity map and four broadly-tuned color channels, I = (r + g + b)/3, R = b\u02dcr \u2212 (\u02dcg + \u02dcb)/2c+,\nG = b\u02dcg \u2212 (\u02dcr + \u02dcb)/2c+, B = b\u02dcb \u2212 \u02dc(r + \u02dcg)/2c+, and Y = b(\u02dcr + \u02dcg)/2 \u2212 |\u02dcr \u2212 \u02dcg|/2c+, where\n\u02dcr = r/I, \u02dcg = g/I, \u02dcb = b/I, and bxc+ = max(x, 0). The four color channels are, in turn, combined\ninto two color opponent channels, R \u2212 G for red/green and B \u2212 Y for blue/yellow opponency.\nThese and the intensity map are convolved with three Mexican hat wavelet \ufb01lters, centered at spatial\nfrequencies 0.02, 0.04 and 0.08 cycle/pixel, to generate nine feature channels. The feature space X\nconsists of these channels, plus a Gabor decomposition of the intensity map, implemented with a\ndictionary of zero-mean Gabor \ufb01lters at 3 spatial scales (centered at frequencies of 0.08, 0.16, and\n0.32 cycle/pixel) and 4 directions (evenly spread from 0 to \u03c0).\n\n3.2 Leveraging natural image statistics\n\nIn general, the computation of (1) is impractical, since it requires density estimates on a potentially\nhigh-dimensional feature space. This complexity can, however, be drastically reduced by exploiting\na well known statistical property of band-pass natural image features, e.g. Gabor or wavelet coef\ufb01-\ncients: that features of this type exhibit strongly consistent patterns of dependence (bow-tie shaped\nconditional distributions) across a very wide range of classes of natural imagery [2, 9, 21]. The\nconsistency of these feature dependencies suggests that they are, in general, not greatly informative\nabout the image class [21, 2] and, in the particular case of saliency, about whether the observed\nfeature vectors originate in the center or surround. Hence, (1) can usually be well approximated by\nthe sum of marginal mutual informations [21]1, i.e.,\n\nS(l) =\n\nd\n\nXi=1\n\nIl(Xi; Y ).\n\n(2)\n\nSince (2) only requires estimates of marginal densities, it has signi\ufb01cantly less complexity than (1).\nThis complexity can, indeed, be further reduced by resorting to the well known fact that the marginal\ndensities are accurately modeled by a generalized Gaussian distribution (GGD) [13]. In this case, all\ncomputations have a simple closed form [4] and can be mapped into a neural network that replicates\nthe standard architecture of V1: a cascade of linear \ufb01ltering, divisive normalization, quadratic non-\nlinearity and spatial pooling [7].\n\n1Note that this approximation does not assume that the features are independently distributed, but simply\n\nthat their dependencies are not informative about the class.\n\n3\n\n\fFeature maps\n\nFeature saliency \n\nmaps\n\n(a)\n\nSaliency map\n\n(cid:54)\n\ny\nc\nn\ne\n\ni\nl\n\na\nS\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n5 10\n\n20\n\n1.9\n\n1.85\n\n1.8\n\n1.75\n\ny\nc\nn\ne\n\ni\nl\n\na\nS\n\n70\n\n80\n\n90\n\n1.7\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\nOrientation contrast (deg)\n\n30\n\n40\n\n50\n\nOrientation contrast (deg)\n\n60\n\n)\n\nY\nB\n\n/\n\n \n,\n\nG\nR\n\n/\n\n(\n \nr\no\nl\no\nC\n\nn\no\ni\nt\ni\ns\no\np\nm\no\nc\ne\nd\n \ne\nr\nu\nt\na\ne\nF\n\ny\nt\ni\ns\nn\ne\nt\nn\nI\n\nn\no\ni\nt\na\nt\nn\ne\ni\nr\n\nO\n\nFigure 1: Bottom-up discriminant saliency detector.\n\n(b)\n\n(c)\n\nFigure 2: The nonlinearity of human saliency re-\nsponses to orientation contrast [14] (a) is replicated\nby discriminant saliency (b), but not by the model\nof [11] (c).\n\n3.3 Consistency with psychophysics\nTo evaluate the consistency of discriminant saliency with psychophysics, we start by applying the\ndiscriminant saliency detector to a series of displays used in classical studies of visual attention [18,\n19, 14]2. In [7], we have shown that discriminant saliency reproduces the anecdotal properties of\nsaliency - percept of pop-out for single feature search, disregard of feature conjunctions, and search\nasymmetries for feature presence vs. absence - that have previously been shown possible to replicate\nwith linear saliency models [11]. Here, we focus on quantitative predictions of human performance,\nand compare the output of discriminant saliency with both human data and that of the difference-\nbased center-surround saliency model [11]3.\nThe \ufb01rst experiment tests the ability of the saliency models to predict a well known nonlinearity\nof human saliency. Nothdurft [14] has characterized the saliency of pop-out targets due to ori-\nentation contrast, by comparing the conspicuousness of orientation de\ufb01ned targets and luminance\nde\ufb01ned ones, and using luminance as a reference for relative target salience. He showed that the\nsaliency of a target increases with orientation contrast, but in a non-linear manner: 1) there exists a\nthreshold below which the effect of pop-out vanishes, and 2) above this threshold saliency increases\nwith contrast, saturating after some point. The results of this experiment are illustrated in Figure 2,\nwhich presents plots of saliency strength vs orientation contrast for human subjects [14] (in (a)),\nfor discriminant saliency (in (b)), and for the difference-based model of [11]. Note that discrim-\ninant saliency closely predicts the strong threshold and saturation effects characteristic of subject\nperformance, but the difference-based model shows no such compliance.\n\nThe second experiment tests the ability of the models to make accurate quantitative predictions of\nsearch asymmetries. It replicates the experiment designed by Treisman [19] to show that the asym-\nmetries of human saliency comply with Weber\u2019s law. Figure 3 (a) shows one example of the displays\nused in the experiment, where the central target (vertical bar) differs from distractors (a set of iden-\ntical vertical bars) only in length. Figure 3 (b) shows a scatter plot of the values of discriminant\nsaliency obtained across the set of displays. Each point corresponds to the saliency at the target\nlocation in one display, and the dashed line shows that, like human perception, discriminant saliency\nfollows Weber\u2019s law: target saliency is approximately linear in the ratio between the difference of\ntarget/distractor length (\u2206x) and distractor length (x). For comparison, Figure 3 (c) presents the cor-\nresponding scatter plot for the model of [11], which clearly does not replicate human performance.\n\n4 Applications of discriminant saliency\nWe have, so far, presented quantitative evidence in support of the hypothesis that pre-attentive vi-\nsion implements decision-theoretical center-surround saliency. This evidence is strengthened by the\n\n2For the computation of the discriminant saliency maps, we followed the common practice of psychophysics\nand physiology [18, 10], to set the size of the center window to a value comparable to that of the display items,\nand the size of the surround window is 6 times of that of the center. Informal experimentation has shown that\nthe saliency results are not substantively affected by variations around the parameter values adopted.\n\n3Results obtained with the MATLAB implementation available in [22].\n\n4\n\n\f(a)\n\n1.95\n\n1.9\n\n1.85\n\n1.8\n\n1.75\n\n1.7\n\ny\nc\nn\ne\ni\nl\na\nS\n\n0.2\n\n0.4\n\u2206 x/x\n\n0.6\n\n0.8\n\n1.65\n0\n\n0.2\n\n0.4\n\u2206 x/x\n\n0.6\n\n0.8\n\n1\n\n0.8\n\ny\nc\nn\ne\ni\nl\na\nS\n\n0.6\n\n0.4\n\n0.2\n\n0\n0\n\n(b)\n\n(c)\n\nFigure 3: An example display (a) and perfor-\nmance of saliency detectors (discriminant saliency\n(b) and [11] (c)) on Weber\u2019s law experiment.\n\n \n\na\ne\nr\na\nC\nO\nR\n \ny\nc\nn\ne\n\ni\nl\n\na\nS\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.8\n\ndiscriminant saliency\nItti et al.\nBruce et al.\n\n0.85\n\n0.9\n\nInter\u2212subject ROC area\n\n0.95\n\n0.98\n\nFigure 4: Average ROC area, as a function of\ninter-subject ROC area, for the saliency algo-\nrithms.\n\nSaliency model Discriminant\n\nItti et al. [11] Bruce et al. [1]\n\nROC area\n\n0.7694\n\n0.7287\n\n0.7547\n\nTable 1: ROC areas for different saliency models with respect to all human \ufb01xations.\n\nalready mentioned one-to-one mapping between the discriminant saliency detector proposed above\nand the standard model for the neurophysiology of V1 [7]. Another interesting property of discrim-\ninant saliency is that its optimality is independent of the stimulus dimension under consideration, or\nof speci\ufb01c feature sets. In fact, (1) can be applied to any type of stimuli, and any type of features, as\nlong as it is possible to estimate the required probability distributions from the center and surround\nneighborhoods. This encouraged us to derive discriminant saliency detectors for various computer\nvision applications, ranging from the prediction of human eye \ufb01xations, to the detection of salient\nmoving objects, to background subtraction in the context of highly dynamic scenes. The outputs\nof these discriminant saliency detectors are next compared with either human performance, or the\nstate-of-the-art in computer vision for each application.\n\n4.1 Prediction of eye \ufb01xations on natural images\nWe start by using the static discriminant saliency detector of the previous section to predict human\neye \ufb01xations. For this, the saliency maps were compared to the eye \ufb01xations of human subjects in\nan image viewing task. The experimental protocol was that of [1], using \ufb01xation data collected from\n20 subjects and 120 natural images. Under this protocol, all saliency maps are \ufb01rst quantized into\na binary mask that classi\ufb01es each image location as either a \ufb01xation or non-\ufb01xation [17]. Using\nthe measured human \ufb01xations as ground truth, a receiver operator characteristic (ROC) curve is\nthen generated by varying the quantization threshold. Perfect prediction corresponds to an ROC\narea (area under the ROC curve) of 1, while chance performance occurs at an area of 0.5. The\npredictions of discriminant saliency are compared to those of the methods of [11] and [1].\n\nTable 1 presents average ROC areas for all detectors, across the entire image set. It is clear that\ndiscriminant saliency achieves the best performance among the three detectors. For a more detailed\nanalysis, we also plot (in Figure 4) the ROC areas of the three detectors as a function of the \u201cinter-\nsubject\u201d ROC area (a measure of the consistency of eye movements among human subjects [8]), for\nthe \ufb01rst two \ufb01xations - which are more likely to be driven by bottom-up mechanisms than the later\nones [17]. Again, discriminant saliency exhibits the strongest correlation with human performance,\nthis happens at all levels of inter-subject consistency, and the difference is largest when the latter\nis strong. In this region, the performance of discriminant saliency (.85) is close to 90% of that of\nhumans (.95), while the other two detectors only achieve close to 85% (.81).\n\n4.2 Discriminant saliency on motion \ufb01elds\nSimilarly to the static case, center-surround discriminant saliency can produce motion-based\nsaliency maps if combined with motion features. We have implemented a simple motion-based de-\ntector by computing a dense motion vector map (optical \ufb02ow) between pairs of consecutive images,\nand using the magnitude of the motion vector at each location as motion feature. The probability\ndistributions of this feature, within center and surround, were estimated with histograms, and the\nmotion saliency maps computed with (2).\n\n5\n\n\fFigure 5: Optical \ufb02ow-based saliency in the presence of egomotion.\n\nDespite the simplicity of our motion representation, the discriminant saliency detector exhibits in-\nteresting performance. Figure 5 shows several frames (top row) from a video sequence, and their\ndiscriminant motion saliency maps (bottom row). The sequence depicts a leopard running in a grass-\nland, which is tracked by a moving camera. This results in signi\ufb01cant variability of the background,\ndue to egomotion, making the detection of foreground motion (leopard), a non-trivial task. As shown\nin the saliency maps, discriminant saliency successfully disregards the egomotion component of the\noptical \ufb02ow, detecting the leopard as most salient.\n\n4.3 Discriminant Saliency with dynamic background\nWhile the results of Figure 5 are probably within the reach of previously proposed saliency models,\nthey illustrate the \ufb02exibility of discriminant saliency. In this section we move to a domain where\ntraditional saliency algorithms almost invariably fail. This consists of videos of scenes with com-\nplex and dynamic backgrounds (e.g. water waves, or tree leaves). In order to capture the motion\npatterns characteristic of these backgrounds it is necessary to rely on reasonably sophisticated prob-\nabilistic models, such as the dynamic texture model [5]. Such models are very dif\ufb01cult to \ufb01t in the\nconventional, e.g. difference-based, saliency frameworks but naturally compatible with the discrim-\ninant saliency hypothesis. We next combine discriminant center-surround saliency with the dynamic\ntexture model, to produce a background-subtraction algorithm for scenes with complex background\ndynamics. While background subtraction is a classic problem in computer vision, there has been\nrelatively little progress for these type of scenes (e.g. see [15] for a review).\n\nA dynamic texture (DT) [5, 3] is an autoregressive, generative model for video. It models the spatial\ncomponent of the video and the underlying temporal dynamics as two stochastic processes. A video\nis represented as a time-evolving state process xt \u2208 Rn, and the appearance of a frame yt \u2208 Rm is\na linear function of the current state vector with some observation noise. The system equations are\n\nxt = Axt\u22121 + vt\nyt = Cxt + wt\n\n(3)\n\nwhere A \u2208 Rn\u00d7n is the state transition matrix, C \u2208 Rm\u00d7n is the observation matrix. The state and\nobservation noise are given by vt \u223ciid N (0, Q,) and wt \u223ciid N (0, R), respectively. Finally, the\ninitial condition is distributed as x1 \u223c N (\u00b5, S). Given a sequence of images, the parameters of the\ndynamic texture can be learned for the center and surround regions at each image location, enabling\na probabilistic description of the video, with which the mutual information of (2) can be evaluated.\n\nWe applied the dynamic texture-based discriminant saliency (DTDS) detector to three video se-\nquences containing objects moving in water. The \ufb01rst (Water-Bottle from [23]) depicts a bottle\n\ufb02oating in water which is hit by rain drops, as shown in Figure 7(a). The second and third, Boat and\nSurfer, are composed of boats/surfers moving in water, and shown in Figure 8(a) and 9(a). These\nsequences are more challenging, since the micro-texture of the water surface is superimposed on a\nlower frequency sweeping wave (Surfer) and interspersed with high frequency components due to\nturbulent wakes (created by the boat, surfer, and crest of the sweeping wave). Figures 7(b), 8(b)\nand 9(b), show the saliency maps produced by discriminant saliency for the three sequences. The\nDTDS detector performs surprisingly well, in all cases, at detecting the foreground objects while ig-\nnoring the movements of the background. In fact, the DTDS detector is close to an ideal background-\nsubtraction algorithm for these scenes.\n\n6\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n)\n\nR\nD\n\n(\n \ne\nt\na\nr\n \nn\no\ni\nt\nc\ne\nt\ne\nD\n\n0\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n)\n\nR\nD\n\n(\n \ne\nt\na\nr\n \nn\no\ni\nt\nc\ne\nt\ne\nD\n\n1\n\n0\n0\n\nDiscriminant Salliency\nGMM\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nFalse positive rate (FPR)\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n)\n\nR\nD\n\n(\n \ne\nt\na\nr\n \nn\no\ni\nt\nc\ne\nt\ne\nD\n\n1\n\n0\n0\n\nDiscriminant Salliency\nGMM\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nFalse positive rate (FPR)\n\n1\n\nDiscriminant Saliency\nGMM\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nFalse positive rate (FPR)\n\n(a)\n\n(b)\n\n(c)\n\nFigure 6: Performance of background subtraction algorithms on: (a) Water-Bottle, (b) Boat, and (c) Surfer.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 7: Results on Bottle: (a) original; b) discriminant saliency with DT; and c) GMM model of [16, 24].\n\nFor comparison, we present the output of a state-of-the-art background subtraction algorithm, a\nGaussian mixture model (GMM) [16, 24]. As can be seen in Figures 7(c), 8(c) and 9(c), the resulting\nforeground detection is very noisy, and cannot adapt to the highly dynamic nature of the water\nsurface. Note, in particular, that the waves produced by boat and surfer, as well as the sweeping\nwave crest, create serious dif\ufb01culties for this algorithm. Unlike the saliency maps of DTDS, the\nresulting foreground maps would be dif\ufb01cult to analyze by subsequent vision (e.g. object tracking)\nmodules. To produce a quantitative comparison of the saliency maps, these were thresholded at a\nlarge range of values. The results were compared with ground-truth foreground masks, and an ROC\ncurve produced for each algorithm. The results are shown in Figure 6, where it is clear that while\nDTDS tends to do well on these videos, the GMM based background model does fairly poorly.\nReferences\n[1] N. D. Bruce and J. K. Tsotsos. Saliency based on information maximization. In Proc. NIPS, 2005.\n[2] R. Buccigrossi and E. Simoncelli. Image compression via joint statistical characterization in the wavelet\n\ndomain. IEEE Transactions on Image Processing, 8:1688\u20131701, 1999.\n\n[3] A. B. Chan and N. Vasconcelos. Modeling, clustering, and segmenting video with mixtures of dynamic\n\ntextures. IEEE Trans. PAMI, In Press.\n\n[4] M. N. Do and M. Vetterli. Wavelet-based texture retrieval using generalized gaussian density and\n\nkullback-leibler distance. IEEE Trans. Image Processing, 11(2):146\u2013158, 2002.\n\n[5] G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto. Dynamic textures. Int. J. Comput. Vis., 51, 2003.\n[6] D. Gao and N. Vasconcelos. Discriminant saliency for visual recognition from cluttered scenes. In Proc.\n\nNIPS, pages 481\u2013488, 2004.\n\n[7] D. Gao and N. Vasconcelos. Decision-theoretic saliency: computational principle, biological plausibility,\n\nand implications for neurophysiology and psychophysics. submitted to Neural Computation, 2007.\n\n[8] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. In Proc. NIPS, 2006.\n[9] J. Huang and D. Mumford. Statistics of Natural Images and Models. In Proc. IEEE Conf. CVPR, 1999.\n[10] D. H. Hubel and T. N. Wiesel. Receptive \ufb01elds and functional architecture in two nonstriate visual areas\n\n(18 and 19) of the cat. J. Neurophysiol., 28:229\u2013289, 1965.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 8: Results on Boats: (a) original; b) discriminant saliency with DT; and c) GMM model of [16, 24].\n\n(a)\n\n(b)\n\n(c)\n\nFigure 9: Results on Surfer: (a) original; b) discriminant saliency with DT; and c) GMM model of [16, 24].\n\n[11] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual attention.\n\nVision Research, 40:1489\u20131506, 2000.\n\n[12] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE\n\nTrans. PAMI, 20(11), 1998.\n\n[13] S. G. Mallat. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans.\n\nPAMI, 11(7):674\u2013693, 1989.\n\n[14] H. C. Nothdurft. The conspicuousness of orientation and motion contrast. Spat. Vis., 7, 1993.\n[15] Y. Sheikh and M. Shah. Bayesian modeling of dynamic scenes for object detection.\n\nIEEE Trans. on\n\nPAMI, 27(11):1778\u201392, 2005.\n\n[16] C. Stauffer and W. Grimson. Adaptive background mixture models for real-time tracking.\n\npages 246\u201352, 1999.\n\nIn CVPR,\n\n[17] B. W. Tatler, R. J. Baddeley, and I. D. Gilchrist. Visual correlates of \ufb01xation selection: effects of scale\n\nand time. Vision Research, 45:643\u2013659, 2005.\n\n[18] A. Treisman and G. Gelade. A feature-integratrion theory of attention. Cognit. Psych., 12, 1980.\n[19] A. Treisman and S. Gormican. Feature analysis in early vision: Evidence from search asymmetries.\n\nPsychological Review, 95:14\u201358, 1988.\n\n[20] A. Tversky. Features of similarity. Psychol. Rev., 84, 1977.\n[21] N. Vasconcelos. Scalable discriminant feature selection for image retrieval. In CVPR, 2004.\n[22] D. Walther and C. Koch. Modeling attention to salient proto-objects. Neural Networks, 19, 2006.\n[23] J. Zhong and S. Sclaroff. Segmenting foreground objects from a dynamic textured background via a\n\nrobust Kalman \ufb01lter. In ICCV, 2003.\n\n[24] Z. Zivkovic. Improved adaptive Gaussian mixture model for background subtraction. In ICVR, 2004.\n\n8\n\n\f", "award": [], "sourceid": 874, "authors": [{"given_name": "Dashan", "family_name": "Gao", "institution": null}, {"given_name": "Vijay", "family_name": "Mahadevan", "institution": null}, {"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}]}