{"title": "A biologically plausible network for the computation of orientation dominance", "book": "Advances in Neural Information Processing Systems", "page_first": 1723, "page_last": 1731, "abstract": "The determination of dominant orientation at a given image location is formulated as a decision-theoretic question. This leads to a novel measure for the dominance of a given orientation $\\theta$, which is similar to that used by SIFT. It is then shown that the new measure can be computed with a network that implements the sequence of operations of the standard neurophysiological model of V1. The measure can thus be seen as a biologically plausible version of SIFT, and is denoted as bioSIFT. The network units are shown to exhibit trademark properties of V1 neurons, such as cross-orientation suppression, sparseness and independence. The connection between SIFT and biological vision provides a justification for the success of SIFT-like features and reinforces the importance of contrast normalization in computer vision. We illustrate this by replacing the Gabor units of an HMAX network with the new bioSIFT units. This is shown to lead to significant gains for classification tasks, leading to state-of-the-art performance among biologically inspired network models and performance competitive with the best non-biological object recognition systems.", "full_text": "A biologically plausible network for the computation\n\nof orientation dominance\n\nKritika Muralidharan\n\nNuno Vasconcelos\n\nStatistical Visual Computing Laboratory\n\nUniversity of California San Diego\n\nStatistical Visual Computing Laboratory\n\nUniversity of California San Diego\n\nLa Jolla, CA 92039\n\nkrmurali@ucsd.edu\n\nLa Jolla, CA 92039\n\nnuno@ece.ucsd.edu\n\nAbstract\n\nThe determination of dominant orientation at a given image location is formulated\nas a decision-theoretic question. This leads to a novel measure for the domi-\nnance of a given orientation \u03b8, which is similar to that used by SIFT. It is then\nshown that the new measure can be computed with a network that implements\nthe sequence of operations of the standard neurophysiological model of V1. The\nmeasure can thus be seen as a biologically plausible version of SIFT, and is de-\nnoted as bioSIFT. The network units are shown to exhibit trademark properties of\nV1 neurons, such as cross-orientation suppression, sparseness and independence.\nThe connection between SIFT and biological vision provides a justi\ufb01cation for\nthe success of SIFT-like features and reinforces the importance of contrast nor-\nmalization in computer vision. We illustrate this by replacing the Gabor units of\nan HMAX network with the new bioSIFT units. This is shown to lead to signif-\nicant gains for classi\ufb01cation tasks, leading to state-of-the-art performance among\nbiologically inspired network models and performance competitive with the best\nnon-biological object recognition systems.\n\n1\n\nIntroduction\n\nIn the past decade, computer vision research in object recognition has \ufb01rmly established the ef\ufb01cacy\nof representing images as collections of local descriptors of edge orientation. These descriptors are\nusually based on histograms of dominant orientation, for example, the edge orientation histograms\nof [1], the SIFT descriptor of [2], or the HOG features of [3]. SIFT, in particular, could be considered\ntoday\u2019s default (low-level) representation for object recognition, adopted by hundreds of computer\nvision papers. The SIFT descriptor is heavily inspired by known computations of the early visual\ncortex [2], but has no formal detailed connection to computational neuroscience. Interestingly, a\nparallel, and equally important but seemingly unrelated, development has taken place in this area in\nthe recent past. After many decades of modeling simple cells as linear \ufb01lters plus \u201csome\u201d nonlin-\nearity [4], neuroscientists have developed a much \ufb01rmer understanding of their non-linear behavior.\nOne property that has always appeared essential to the robustness of biological vision is the ability\nof individual cells to adapt their dynamic range to the strength of the visual stimulus. This adap-\ntation appears as early as in the retina [5], is prevalent throughout the visual cortex [6], and seems\nresponsible for the remarkable ability of the visual system to adapt to lighting variations. Within\nthe last decade, it has been explained by the implementation of gain control in individual neurons,\nthrough the divisive normalization of their responses by those of their neighbors [7, 8]. Again,\nhundreds of papers have been written on divisive normalization, and its consequences for visual\nprocessing. Today, there appears to be little dispute about its role as a component of the standard\nneurophysiological model of early vision [9].\n\n1\n\n\fIn this work, we establish a formal connection between these two developments. This connection\nis inspired by recent work on the link between the computations of the standard model and the\nbasic operations of statistical decision theory [10]. We start by formulating the central motivating\nquestion for descriptors such as SIFT or HOG, how to represent locally dominant image orientation,\nas a decision-theoretic problem. An orientation \u03b8 is de\ufb01ned as dominant, at a location l of the visual\n\ufb01eld, if the Gabor response of orientation \u03b8 at l, x\u03b8(l), is both large and distinct from those of other\norientations. An optimal statistical test is then derived to determine if x\u03b8(l) is distinct from the\nresponses of remaining orientations. The core of this test is the posterior probability of orientation\nof the visual stimulus at l, given x\u03b8(l). The dominance of orientation \u03b8, within a neighborhood R, is\nthen de\ufb01ned as the expected strength of responses x\u03b8(l), in R, which are distinct. This is shown to\nbe a sum of the response amplitudes |x\u03b8(l)| across R, with each location weighted by the posterior\nprobability that it contains stimulus of orientation \u03b8.\nThe resulting representation of orientation is similar to that of SIFT, which assigns each point to a\ndominant orientation and integrates responses over R. The main difference is that a location could\ncontribute to more than one orientation, since the expected strength relies on a soft assignment\nof locations to orientations, according to their posterior orientation probability. Exploiting known\nproperties of natural image statistics, and the framework of [10], we then show that this measure of\norientation dominance can be computed with the sequence of operations of the standard neurophys-\niological model: simple cells composed of a linear \ufb01lter, divisive normalization, and a saturating\nnon-linearity, and complex cells that implement spatial pooling. The proposed measure of orien-\ntation dominance can then be seen as a biologically plausible version of that used by SIFT, and is\ndenoted by bioSIFT. BioSIFT units are shown to exhibit the trademark properties of V1 neurons:\ntheir responses are closely \ufb01t by the Naka-Rushton equation [11], and they exhibit an inhibitory be-\nhavior, known as cross-orientation suppression, which is ubiquitous in V1 [12]. We note, however,\nthat our goal is not to provide an alternative to SIFT. On the contrary, the formal connection between\n\ufb01ndings from computer vision and neuroscience provides additional justi\ufb01cation to both the success\nof SIFT in computer vision, and the importance of divisive normalization in the visual cortex, as\nwell as its connection to the determination of orientation dominance.\nThe main practical bene\ufb01t of bioSIFT is to improve the performance of biologically plausible recog-\nnition networks, whose performance it brings close to the level of the state of the art in computer\nvision. In the process of doing this, it points to the importance of divisive normalization in vision.\nWhile such normalization tends to be justi\ufb01ed as a means to increase robustness to variations of\nillumination, a hypothesis that we do not dispute, it appears to make a tremendous difference even\nwhen such variations do not hold. We illustrate these points through object recognition experiments\nwith HMAX networks [13]. It is shown that the simple replacement of Gabor \ufb01lter responses with\nthe normalized orientation descriptors of bioSIFT produces very signi\ufb01cant gains in recognition ac-\ncuracy. These gains hold for standard datasets, such as Caltech101, where lighting variations are not\na substantial nuisance. This points to the alternative hypothesis that the fundamental role of contrast\nnormalization is to determine orientation dominance. The hypothesis is substantiated by the fact that\nthe bioSIFT enhanced HMAX network substantially outperforms the previous best results in the lit-\nerature of biologically-inspired recognition networks [14, 15]. While these networks implement a\nnumber of operations similar to those of bioSIFT, including the use of contrast normalized units,\nthey do not have a precise functional justi\ufb01cation (such as the determination of orientation domi-\nnace), lack a well de\ufb01ned optimality criterion, and do not have a rigorous statistical interpretation.\nThe importance of these properties is further illustrated by experiments in a dataset composed ex-\nclusively of natural scenes [16], which (unlike Caltech) fully matches the assumptions under which\nbioSIFT is optimal (natural image statistics). In this dataset, the HMAX network with the bioSIFT\nfeatures has performance identical to that of very recent state-of-the-art computer vision methods.\n\n2 The bioSIFT Features\n\nWe start by describing the implementation of the bioSIFT network in detail. We lay out the com-\nputations, establish their conformity with the standard neurophysiological model, and analyze the\nstatistical meaning of the computed features.\n\n2\n\n\f2.1 Motivation\n\nVarious authors have argued that perceptual systems compute optimal decisions tuned to the statis-\ntics of natural stimuli [17, 18, 19]. The ubiquity of orientation processing in visual cortex suggests\nthat the estimation of local orientation is important for tasks such as object recognition. This is\nreinforced by the success, in computer vision, of algorithms based on SIFT or SIFT-like descriptors.\nWhile the classical view was that the brain simply performs a linear decomposition into orientation\nchannels, through Gabor \ufb01ltering, SIFT representations emphasize the estimation of dominant ori-\nentation. The latter is a very non-linear operation, involving the comparison of response strength\nacross orientation channels, and requires inter-channel normalization. In SIFT, this is performed\nimplicitly, by combining the computation of gradients with some post-processing heuristics. More\nformal estimates of dominant orientation can be obtained by formulating the problem in decision-\ntheoretic terms, and deriving optimal decision rules for its solution. For this, we assume that the\nvisual system infers dominant orientation from a set of visual features x \u2208 RM , which measure\nstimulus amplitude at each orientation. In this work, we assume these features to be the set of re-\nsponses Xi = I \u25e6 Gi of the stimulus I, to a bank of Gabor \ufb01lters Gi. Here, Gi is the \ufb01lter of ith\norientation, and \u25e6 convolution. In principle, determining whether there is a dominant orientation\nrequires the joint inspection of all feature channels Xi. Statistically, this implies modeling the joint\nfeature distribution and is intractable for low-level vision.\nA more tractable question is whether the ith channel responses, Xi, are distinct from those of the\nother channels, Xj, j (cid:54)= i. Letting \u03b8 denote the channel orientation, i.e. PX|\u03b8(x|i) = PXi(x), this\nquestion can be posed as a classi\ufb01cation problem with two hypotheses of label Y \u2208 {0, 1}, where\n\u2022 Y = 1 if the ith channel responses are distinct, i.e. P (X = x, \u03b8 = i) (cid:54)= P (X = x, \u03b8 (cid:54)= i),\n\u2022 Y = 0 otherwise, i.e. P (X = x, \u03b8 = i) = P (X = x, \u03b8 (cid:54)= i).\n\nThis problem has class-conditional densities\n\nP (X = x, \u03b8 = i|Y = 1) = P (X = x, \u03b8 = i) = PX|\u03b8(x|i)P\u03b8(i)\n\nP (X = x, \u03b8 = i|Y = 0) = P (X = x, \u03b8 (cid:54)= i) =(cid:88)\n\nPX|\u03b8(x|j)P\u03b8(j)\n\nj(cid:54)=i\n\nand the posterior probability of the \u2019distinct\u2019 hypothesis given an observation from channel i is\n\nP (Y = 1|X = x, \u03b8 = i) =\n\n= P\u03b8|X(i|x)\n\n(1)\n\n(cid:80)\nPX|\u03b8(x|i)P\u03b8(i)\nj PX|\u03b8(x|j)P\u03b8(j)\n\nwhere we have assumed that PY (0) = PY (1) = 1/2. Given the response xi(l) of Xi at location\nl \u2208 R, the minimum probability of error (MPE) decision rule is to declare it distinct when\n\nP\u03b8|X(i|xi(l)) = PXi(xi(l))P\u03b8(i)\nj PXj (xi(l))P\u03b8(j)\n\n\u2265 1\n2 .\n\n(2)\n\nWhile this test determines if the responses of Xi are distinct from those of Xj(cid:54)=i, it does not deter-\nmine if Xi is dominant: Xi could be distinct because it is the only feature that does not respond to\nthe stimulus in R. The second question is to determine if the responses of Xi are both distinct and\nlarge. This requires a new random variable\n\nS(xi) =\n\nif Y = 1\nif Y = 0.\n\n(3)\n\nwhich measures the strength (absolute value) of the distinct responses. The expected strength of\ndistinct responses in R is then\n\n(cid:80)\n\n0,\n\n(cid:26) |xi|,\n(cid:90)\n(cid:90)\n\n=\n\n(cid:88)\n\nl\n\n3\n\nEY,X|\u03b8[S(X)|\u03b8 = i] =\n\n|x|PY |X,\u03b8(1|x, i)PX|\u03b8(x|i)dx\n|x|P\u03b8|X(i|x)PXi(x)dx.\n\nThe empirical estimate of (5) from the sample xi(l), l \u2208 R, is\n\n(cid:92)S(Xi)R =\n\n1\n|R|\n\n|xi(l)|P\u03b8|X(i|xi(l)).\n\n(4)\n\n(5)\n\n(6)\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\n(e)\n\n(g)\n\n(f)\n\n(h)\n\nFigure 1: bioSIFT computations for given orientation \u03b8: (a) an image, (b) response of Gabor \ufb01lter of ori-\nentation \u03b8, (c) posterior probability map for orientation \u03b8, (d) orientation dominance measure for channel \u03b8;\n(e),(f),(g),(h) the image, Gabor response, posterior probability, and dominance measure of same channel for a\ncontrast-reduced version of the image.\n\nThis measure of the dominance of the ith orientation is a sum of the response amplitudes |xi(l)|\nacross R, with each location weighted by the posterior probability that it contains stimulus of that\norientation.\nIt is similar to the measure used by SIFT, which assigns each point to a dominant\norientation and integrates responses over R. The main difference is that a location could contribute\nto more than one orientation, since the expected strength relies on a soft assignment of locations to\norientations, according to their posterior orientation probability.\nFigure 1 illustrates the computations of (6) for the image shown in a). The response of a Gabor \ufb01lter\nof orientation \u03b8 = 3\u03c0/4 is shown in b), and the orientation probability map P\u03b8|X(i|xi) in c). Note\nthat these probabilities are much smaller than the Gabor responses in the body of the star\ufb01sh, where\nthe image is textured but there is no signi\ufb01cant structure of orientation \u03b8. On the other hand, they are\nlargest for the locations where the orientation is dominant. Figure 1 d) shows the \ufb01nal dominance\nmeasure. The combined multiplication by the Gabor responses and averaging over R magni\ufb01es the\nresponses where the orientation is dominant, suppressing the details due to texture or noise. This\ncan be seen by comparing b) and d). Overall, (6) is large when the ith orientation responses are\n1) distinct from those of other channels and 2) large. It is small when they are either indistinct or\nsmall. One interesting property is that it penalizes large responses of Xi that are not informative of\nthe presence of stimuli with orientation i. Hence, increasing the stimulus contrast does not increase\n(cid:92)S(Xi)R when responses xi(l) cannot be con\ufb01dently assigned to the ith orientation. This can be\nseen in Figure 1 f) and h), where the Gabor response and dominance measure are shown for a low-\ncontrast replica of the image of a). While the Gabor responses at low (f) and high (b) contrasts are\nsubstantially different, the dominance measure (d and h) stays almost constant. It follows that (6)\nimplements contrast normalization, a topic to which we will return in later sections. It is worth\nnoting that such normalization is accomplished without modeling joint distributions of response\nacross orientations. On the contrary, all quantities in (6) are scalar.\n\n3 Biological plausibility\n\nIn this section we study the biological plausibility of the orientation dominance measure of (6).\n\n3.1 Natural image statistics\n\nExtensive research on the statistics of natural images has shown that the responses of bandpass\nfeatures follow the generalized gaussian distribution (GGD)\n\u2212\n\nPX(x; \u03b1, \u03b2) =\n\nexp\n\n(cid:32)|x|\n\n\u03b2(cid:33)(cid:33)\n\n(cid:32)\n\nwhere \u0393(z) = (cid:82) \u221e\n\n\u03b2\n\n2\u03b1\u0393(1/\u03b2)\n\n\u03b1\n\n(7)\n\n0 e\u2212ttz\u22121 , dt, t > 0 is the Gamma function, \u03b1 is a scale and \u03b2 a shape param-\neter. The biological plausibility of statistical inference for GGD stimuli was extensively studied in\n\n4\n\n\fFigure 2: One channel of the bioSIFT network. The large dashed box implements the computations of the\nsimple cell, and the small one those of the complex cell. The simple cell computes the contribution of channel\ni to the expected value of the dominant response at pixel x, indicated by a \ufb01lled box. Spatial pooling by the\ncomplex cell determines the channel\u2019s contribution to the expected value of the dominant response within the\npooling neighborhood.\n\n[10]. This work shows that various fundamental computations in statistics can indeed be computed\nbiologically when a maximum a posteriori (MAP) estimate is adopted for \u03b1\u03b2, using a conjugate\n(Gamma) prior. This MAP estimate is\n\n\uf8ee\uf8f0 \u03b2\n\nn + \u03b7\n\n\uf8eb\uf8ed n(cid:88)\n\nj=1\n\n\uf8f6\uf8f8\uf8f9\uf8fb1/\u03b2\n\n\u03b1M AP =\n\n|x(j)|\u03b2 + \u03bd\n\n(8)\n\nwhere \u03bd and \u03b7 are the prior hyperparameters, and x(j) a sample of training points. As is usual in\nBayesian inference, the hyperparameter values are important when the sample is too small to enable\nreliable inference. This is not the case for the current work, where the estimates remain constant over\na substantial range of their values. Hence, we simply set \u03bd = 10\u22123 and \u03b7 = 1 in all experiments.\nFor natural images, the value of \u03b2 is quite stable. We use \u03b2 = 0.5, (determined by \ufb01tting the GGD\nto a large set of images) in our experiments.\n\n3.2 Biological computations\n\nTo derive a biologically plausible form of (6) we start by assuming that P\u03b8(i) = 1\nM . This is mostly\nfor simplicity, the discussion could be generalized to account for any prior distribution of orienta-\ntions. Under this assumption, using (1)\nP\u03b8|X(\u03b8 = j|x) =\n\n(9)\n\n=\n\n(cid:80)\n\n(cid:80)\nPX|\u03b8(x|\u03b8 = j)\nk PX|\u03b8(x|\u03b8 = k)\n|xi(l)|\u03c8i [log PX1(xi(l)), . . . , log PXM (xi(l))]\n\nPXj (x)\nk PXk(x)\n\n(cid:92)S(Xi)R \u221d (cid:88)\n\nl\u2208R\n\nand\n\nwhere \u03c8k is the classical softmax activation function\n\n\u03c8k(q1, ..., qn) =\n\n(cid:80)n\nexp(qk)\nj=1 exp(qj) ,\n\n(10)\n\n(11)\n\n(12)\n\nqj the log-likelihood (up to constants that cancel in (11))\n\nqj = log PXj (xi(l)) = \u2212\u03c6(xi(l); \u03b8j) \u2212 Kj\n\nand, from (7) with the MAP estimate of \u03b1\u03b2 from (8) and the responses in R as training sample,\n\n\u03c6(x; \u03b8k) =\n\n|x|\u03b2\n\u03bek\n\n; \u03bek =\n\n\u03b2\n\n|R| + \u03b7\n\n; Kj = log \u03b1j =\n\n1\n\u03b2\n\nlog \u03bej.\n\n(13)\n\n(cid:32)(cid:88)\n\nl\u2208R\n\n(cid:33)\n\n|xk(l)|\u03b2 + \u03bd\n\n5\n\ntoC1. . . to C1 \u2026. . . i| .|\u03b2i\u00f7| .|\u03b2i\u03a3log P(xi(l)|\u03b8i)|xi(l) |input\u03c8i\u015c(Xi)R...\u03a3\u00d7...Multi-scale image\u00f7| .|\u03b2k\u03a3log P(xi(l)|\u03b8k)\u015c(Xi)R\u03a3C1 layer| .|\u03b2kimageS1 layer\fa)\n\ne)\n\nb)\n\nf)\n\nc)\n\ng)\n\nd)\n\nh)\n\nFigure 3: (a) COS in real neurons(from [12]), and (b) in bioSIFT features (c) Contrast response in bioSIFT\nfeatures and corresponding Naka-Rushton \ufb01t (d)distributions of Gabor and bioSIFT amplitudes (e) Example\nof Orientation selectivity (f) sample image and maximum biosift response at each location (g,h) conditional\nhistograms of adjacent channels for Gabor(g) and bioSIFT(h) features.\n\nThe computations of (11)-(13) are those performed by simple cells in the standard neurophysiolog-\nical model of V1. A bank of linear \ufb01lters is applied at each location l of the \ufb01eld of view. This\nproduces the Gabor responses xi(l). Each response xi(l) is divisively normalized by the sum of\nresponses in the neighborhood R, for each orientation channel k, using (13). Notice that this im-\nplies that the conditional distribution of responses of a channel is learned locally, from the sample\nof responses in R. Altogether, (12) implements the computations of a divisively normalized simple\ncell. Finally, the softmax \u03c8k is a multi-way sigmoidal non-linearity which replicates the well known\nsaturating behavior of simple cells. The computation of the orientation dominance measure by (10)\nthen corresponds to a complex cell, which pools the simple cell responses in R, modulated by the\nmagnitude of the underlying Gabor responses. This produces each channel\u2019s contribution to the\nbioSIFT descriptor. A graphical description of the network is presented in Figure 2.\n\n3.3 Naka-Rushton \ufb01t\n\nIn addition to replicating the standard model of V1, the biological plausibility of the bioSIFT features\ncan be substantiated by checking if they reproduce well-established properties of neuronal responses.\nOne characteristic property of neural responses of monkey and cat V1 is the tightness with which\nthey can be \ufb01t by the Naka-Rushton equation [11]. The equation describes the average response to\na sinusoidal grating of contrast c as\n\nR = Rmax\n\ncq\n\ncq\n50 + cq\n\n(14)\n\nwhere Rmax is the maximum mean response, c50 is the semi-saturation contrast i.e. the contrast at\nwhich the response is half the saturation value. The parameter q, which determines the steepness of\nthe curve, is remarkably stable for V1 neurons, where it takes values around 2 [20]. The \ufb01t between\nthe contrast response of a bioSIFT unit and the Naka-Rushton function was determined, using the\nprocedure of [11], and is shown in Figure 3 c). As in biology, the Naka-Rushton model \ufb01ts the\nbioSIFT data quite well. Over multiple trials, the q parameter for the best \ufb01tting curve is stable and\nstays in the interval (1.7, 2.1).\n\n3.4 Inhibitory effects\n\nIt is well known that V1 neurons have a characteristic inhibitory behavior, known as cross-\norientation suppression (COS) [12, 7, 21]. This suppression is observed by measuring the response\nof a neuron, tuned to an orientation \u03b8, to a sinusoidal grating of orthogonal orientation (\u03b8 \u00b1 90\u25e6).\nWhen presented by itself, the grating barely evokes a response from the neuron. However, if super-\nimposed with a grating of another orientation, it signi\ufb01cantly reduces the response of the neuron to\nthe latter. To test if the bioSIFT features exhibit COS, we repeated the set of experiments reported\n\n6\n\n0.0010.010.11.000.050.10.150.20.250.30.350.4contrastresponse 10%20%30%40%50%0.0010.010.11.000.10.20.30.4contrastresponse bioSIFTN\u2212R eqn00.20.40.60.81\u221212\u221210\u22128\u22126\u22124\u221220responselog(probability) biosiftgaborchannel 1channel 2\u22121\u22120.500.51\u22120.6\u22120.4\u22120.200.20.40.6channel 1channel 200.20.40.60.810.350.30.250.20.150.10.05\fin [12]. These consist of measuring a simple cell response to a set of sinusoidal plaids obtained by\nsumming 1) a test grating oriented along the cell\u2019s preferred orientation, and 2) a mask grating of\northogonal orientation. The test and the mask have the same frequency as the cell\u2019s Gabor \ufb01lter. The\ncell response is recorded as a function of the contrast of the gratings. Figure 3 a) shows the results\nreported in [12], for a real neuron. The stimuli are shown on the left and the neuron\u2019s response on\nthe right. Note the suppression of the latter when the mask contrast increases. The response of the\nbioSIFT simple cell, shown in Figure 3 b), is identical to that of the neuron.\nFrom a functional point of view, the great advantage of COS is the resulting increase in selectivity\nof the orientation channels. This is illustrated in Figure 3 (e). The \ufb01gure shows the results of an\nexperiment that measured the response of 12 Gabor \ufb01lters of orientation in [0o, 180o] to a horizontal\ngrating. While both the \ufb01rst and twelfth Gabor \ufb01lters have relatively large responses to this stimulus,\nthe twelfth channel of bioSIFT is strongly suppressed. When combined with the contrast invariance\nof Figure 1, this leads to a representation with strong orientation discrimination and robustness\nto lighting variations. An example of this is shown in Figure 3 (f) which shows the value of the\ndominance measure for the most dominant orientation at each image location (in \u201csplit screen\u201d with\nthe original image). Note how the bioSIFT features capture information about dominant orientation\nand object shape, suppressing uninformative or noisy pixels.\n\n3.5\n\nIndependence and sparseness\n\nBarlow [18] argued that the goal of sensory systems is to reduce redundancy, so as to produce\nstatistically independent responses. A known property of the responses of bandpass features to\nnatural images is a consistent pattern of higher order dependence, characterized by bow-tie shaped\nconditional distributions between feature pairs. This pattern is depicted in Figure 3 g), which shows\nthe histogram of responses of a Gabor feature, conditioned on the response of the co-located feature\nof an adjacent orientation channel. Simoncelli [22] showed that divisively normalizing linear \ufb01lter\nresponses reduces these higher-order dependencies, making the features independent. As can be\nseen from (10), (12), and (13), the bioSIFT network divisively normalizes each Gabor response by\nthe sum, across the spatial neighborhood R, of responses from each of the Gabor orientations (11).\nIt is thus not surprising that, as shown in Figure 3 h), the conditional histograms of bioSIFT features\nare zero outside a small horizontal band around the horizontal axis. This implies that they are\nindependent (knowledge of the value of one feature does not modify the distribution of responses of\nthe other).This is a consistent observation across bioSIFT feature pairs.\nAnother important, and extensively researched, property of V1 responses is their sparseness. Chan-\nnel sparseness is closely related to independence across channels. Sparse representations have sev-\neral important advantages, such as increased generalization ability and energy ef\ufb01ciency of neural\ndecision-making circuits. Given the discussion above, it is not surprising that the contrast normal-\nization inherent to the bioSIFT representation also makes it more sparse. This is shown in Figure 3\nd), which compares the sparseness of the responses of both a Gabor \ufb01lter and a bioSIFT unit to a\nnatural image. It is worth noting that these properties have not been exploited in the SIFT literature\nitself. For example, independence could lead to more ef\ufb01cient implementations of SIFT-based rec-\nognizers than the standard visual words approach, which requires an expensive quantization of SIFT\nfeatures with respect to a large codebook. We leave this as a topic for future research.\n\n4 Experimental Evaluation\n\nIn this section, we report on experiments designed to evaluate the bene\ufb01ts, for recognition, of the\nconnections between SIFT and the standard neurophysiological model.\n\n4.1 Biologically inspired object recognition\n\nBiologically motivated networks for object recognition have been recently the subject of substantial\nresearch [13, 23, 14, 15]. To evaluate the impact of adding bioSIFT features to these networks,\nwe considered the HMAX network of [13], which mimics the structure of the visual cortex as a\ncascade of alternating simple and complex cell layers. The \ufb01rst layer encodes the input image as a\nset of complex cell responses, and the second layer measures the distance between these responses\nand a set of learned prototypes. The vector of these distances is then classi\ufb01ed with a linear SVM.\n\n7\n\n\fModel\nBase HMAX [13]\n+ enhancements [23]\nPinto et al. [14]\nJarrett et al [15]\nLazebnik et al. [16]\nZhang et al. [24]\nNBNN [25]\nYang et al. [26]\nbase bioSIFT HMAX\n+enhancements\n\n30 training images/cat.\n\n42\n56\n65\n65.5\n\n54.5\n\n70.4\n\n64.6 \u00b1 0.8\n66.2 \u00b1 0.5\n73.2 \u00b1 0.5\n69.3 \u00b1 0.3\n\nModel\nFei-Fei et al [27]\nLazebnik et al. [16]\nYang et al [26]\nKernel Codebooks [28]\nHMAX with bioSIFT\n\n65.2\n\nPerformance\n81.4 \u00b1 0.5\n80.3 \u00b1 0.9\n76.7 \u00b1 0.4\n80.1 \u00b1 0.6\n\nFigure 4: Classi\ufb01cation Results on Caltech-101(left) and the Scene Classi\ufb01cation Database(right)\n\nFor this evaluation, each unit of the \ufb01rst layer was replaced by a bioSIFT unit, implemented as in\nFigure 2. The experimental setup is similar to that of [23]: multi-class classi\ufb01cation on Caltech101\n(with the size of the images reduced so that their height is 140) using 30 images/object for training\nand at-most 50 for testing. The baseline accuracy of [13] was 42%. The work of [23] introduced\nseveral enhancements that were shown to considerably improve this baseline. Two of these enhance-\nments, sparsi\ufb01cation and inhibition, were along the lines of the contributions discussed in this work.\nOthers, such as limiting receptive \ufb01elds to restrict invariance, and discriminant selection of proto-\ntypes could also be combined with bioSIFT. The base performance of the network with bioSIFT\n(54.5%) is superior to that of all comparable extensions of [23] (49%). This can be attributed to\nthe fact that those extensions are mostly heuristic, while those now proposed have a more sound\ndecision-theoretical basis. In fact, the simple addition of bioSIFT features to the HMAX network\noutperforms all extensions of [23] up to the prototype selection stage (54%). When bioSIFT is com-\nplemented with limited C2 invariance and prototype selection the performance improves to 69%,\nwhich is better than all results from [23]. In fact, the HMAX network with bioSIFT outperforms the\nstate-of-the-art 1 performance (65.5%) for biologically inspired networks [15]. This improvement\nis interesting, given that these networks also implement most of the operations of the bioSIFT unit\n(\ufb01ltering, normalization, pooling, saturation, etc.). The main difference is that this is done without a\nclear functional justi\ufb01cation, optimality criteria, or statistical interpretation. In result, the sequence\nof operations is not the same, there is no guarantee that normalization provides optimal estimates of\norientation dominance, or even that it corresponds to optimal statistical learning, as in (8).\n\n4.2 Natural scene classi\ufb01cation\n\nWhen compared to the state-of-the-art from the computer vision literature, the HMAX+bioSIFT\nnetwork, does not fare as well. Most notably, it has worse performance than the method of Yang\net al. [26], which holds the current best results for this dataset (single descriptor methods). This is\nexplained by two main reasons. The \ufb01rst is that the networks are not equivalent. Yang et. al rely on a\nsparse coding representation in layer 2, which is likely to be more effective than the simple Gaussian\nunits of HMAX. This problem could be eliminated by combining bioSIFT with the same sparse\nrepresentation, something that we have not attempted. A second reason is that bioSIFT is not exactly\noptimal for Caltech, because this dataset contains various classes with many non-natural images. To\navoid this problem, we have also evaluated the bioSIFT features on the scene classi\ufb01cation task of\n[16]. Using the same HMAX setup, a simple linear classi\ufb01er and 3000 layer 2 units, the network\nachieves a classi\ufb01cation performance of 80.1% (see Figure 4). This is a substantial improvement,\nsince these results are nearly identical to those of Yang et al. [26], and better than many of those\nof other methods from the computer vision literature. Overall, these results suggest that orientation\ndominance is an important property for visual recognition. In particular, the improved performance\nof the bioSIFT units cannot be explained by the importance of contrast normalization, since this\nis not a major nuisance for the datasets considered, it is also implemented by the other networks,\nbioSIFT is not optimized to normalize contrast, and it is unlikely that constrast variations would be\nmore of an issue on Caltech than on the natural scene dataset.\n\n1[14] reports 65%, but for a network with a much larger number of units (SVM dimension) than what is\nused by all other networks. Our implementation of their network with comparable parameters only achieved\n42%.\n\n8\n\n\fReferences\n[1] W. T. Freeman and M. Roth, \u201cOrientation histograms for hand gesture recognition,\u201d in IEEE Intl. Wkshp.\n\non Automatic Face and Gesture Recognition, 1995.\n\n[2] D. G. Lowe, \u201cDistinctive image features from scale-invariant keypoints,\u201d IJCV, vol. 60(2), pp. 91\u2013110,\n\n2004.\n\n[3] N. Dalal and B. Triggs, \u201cHistograms of oriented gradients for human detection,\u201d in Proc. IEEE Conf.\n\nCVPR, 2005.\n\n[4] D. H. Hubel and T. N. Wiesel, \u201cReceptive \ufb01elds, binocular interaction, and functional architecture in the\n\ncat\u2019s visual cortex,\u201d Journal of Physiology, vol. 160, 1962.\n\n[5] R. Shapley and J. D. Victor, \u201cThe contrast gain control of the cat retina,\u201d Vision Research, vol. 19, pp. 431\u2013\n\n434, 1979.\n\n[6] S. E. Palmer, Vision Science: Photons to Phenomenology. The MIT Press, 1999.\n[7] D. Heeger, \u201cNormalization of cell responses in cat striate cortex,\u201d Visual Neuroscience, vol. 9, 1992.\n[8] M. Carandini, D. J. Heeger, and J. A. Movshon, \u201cLinearity and normalization in simple cells of macaque\n\nprimary visual cortex,\u201d Journal of Neuroscience, vol. 17, pp. 8621\u20138644, 1997.\n\n[9] M. Carandini, J. B. Demb, V. Mante, D. J. Tolhurst, Y. Dan, B. A. Olshausen, J. L. Gallant, and N. C.\n\nRust, \u201cDo we know what the early visual system does?,\u201d Journal of Neuroscience, vol. 25, 2005.\n\n[10] D. Gao and N. Vasconcelos, \u201cDecision-theoritic saliency: computational principles, biological plausibil-\n\nity, and implications for neurophysiology and psychophysics,\u201d Neural Computation, vol. 21, 2009.\n\n[11] M. Chirimuuta and D. J. Tolhurst, \u201cDoes a bayesian model of v1 contrast coding offer a neurophysiolog-\n\nical account of contrast discrimination?,\u201d Vision Research, vol. 45, pp. 2943\u20132959, 2005.\n\n[12] M. Carandini, Receptive \ufb01elds and suppressive \ufb01elds in the early visual system. MIT Press, 2004.\n[13] T. Serre, L. Wolf, and T. Poggio, \u201cObject recognition with features inspired by visual cortex,\u201d in IEEE\n\nConf. CVPR, 2005.\n\n[14] N. Pinto, D. Cox, and J. Dicarlo, \u201cWhy is real-world visual object recognition hard?,\u201d PLoS Computa-\n\ntional Biology, 2008.\n\n[15] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. Lecun, \u201cWhat is the best multi-stage architecture for\n\nobject recognition?,\u201d in Proc. IEEE International Conference on Computer Vision, 2009.\n\n[16] S. Lazebnik, C. Schmid, and J. Ponce, \u201cBeyond bags of features: Spatial pyramid matching for recogniz-\n\ning natural scene categories,\u201d in CVPR, 2006.\n\n[17] F. Attneave, \u201cInformational aspects of visual perception,\u201d Psychological review, vol. 61, pp. 183\u2013193,\n\n1954.\n\n[18] H. B. Barlow, \u201cRedundancy reduction revisited,\u201d Network: Computation in Neural Systems, vol. 12, 2001.\n[19] D. C. Knill and W. Richards, Perception as Bayesian inference. Cambridge University Press, 1996.\n[20] D. G. Albrecht and D. B. Hamilton, \u201cStriate cortex of monkey and cat: contrast response function,\u201d\n\nJournal of Neurophysiology, vol. 48, pp. 217\u2013237, 1982.\n\n[21] M. C. Morrone, D. C. Burr, and L. Maffei, \u201cFunctional implications of cross orientation inhibition of\ncortical visual cells i. neurophysiological evidence,\u201d Proc. Royal Society London B, vol. 216, pp. 335\u2013\n354, 1982.\n\n[22] M. J. Wainwright, O. Schwartz, and E. P. Simoncelli, \u201cNatural image statistics and divisive normaliza-\ntion: Modeling nonlinearities and adaptation in cortical neurons,\u201d in Probabilistic Models of the Brain:\nPerception and Neural Function, pp. 203\u2013222, MIT Press, 2002.\n\n[23] J. Mutch and D. Lowe, \u201cObject class recognition and localization using sparse features with limited\n\nreceptive \ufb01elds,\u201d IJCV, vol. 80, pp. 45\u201357, 2008.\n\n[24] H. Zhang, A. Berg, M. Maire, and J. Malik, \u201cSvm-knn: Discriminative nearest neigbor classi\ufb01cation for\n\nvisual category recognition,\u201d in Proc. IEEE Conf CVPR, 2006.\n\n[25] O. Boiman, E. Shechtman, and M. Irani, \u201cIn defense of nearest-neighbor based image classi\ufb01cation,\u201d in\n\nProc. IEEE Conf. CVPR, 2008.\n\n[26] J. Yang, K. Yu, Y. Gong, and T. Huang, \u201cLinear spatial pyramid matching using sparse coding for image\n\nclassi\ufb01cation,\u201d in Proc. IEEE Conf. CVPR, 2009.\n\n[27] L. Fei-Fei and P. Perona, \u201cA bayesian heirarchical model for learning natural scene categories,\u201d in Proc.\n\nIEEE Conf CVPR, 2005.\n\n[28] J. C. van Gemert, J. M. Geusebroek, C. J. Veenman, and A. W. M. Smeulders, \u201cKernel codebooks for\n\nscene categorisation,\u201d in Proc ECCV, 2008.\n\n9\n\n\f", "award": [], "sourceid": 856, "authors": [{"given_name": "Kritika", "family_name": "Muralidharan", "institution": null}, {"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}]}