{"title": "Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing", "book": "Advances in Neural Information Processing Systems", "page_first": 369, "page_last": 377, "abstract": "A central hypothesis about early visual processing is that it represents inputs in a coordinate system matched to the statistics of natural scenes. Simple versions of this lead to Gabor-like receptive fields and divisive gain modulation from local surrounds; these have led to influential neural and psychological models of visual processing. However, these accounts are based on an incomplete view of the visual context surrounding each point. Here, we consider an approximate model of linear and non-linear correlations between the responses of spatially distributed Gabor-like receptive fields, which, when trained on an ensemble of natural scenes, unifies a range of spatial context effects. The full model accounts for neural surround data in primary visual cortex (V1), provides a statistical foundation for perceptual phenomena associated with Lis (2002) hypothesis that V1 builds a saliency map, and fits data on the tilt illusion.", "full_text": "Statistical Models of Linear and Non\u2013linear\n\nContextual Interactions in Early Visual Processing\n\nRuben Coen\u2013Cagli\n\nAECOM\n\nBronx, NY 10461\n\nrcagli@aecom.yu.edu\n\nPeter Dayan\nGCNU, UCL\n\n17 Queen Square, LONDON\n\ndayan@gatsby.ucl.ac.uk\n\nOdelia Schwartz\n\nAECOM\n\nBronx, NY 10461\n\noschwart@aecom.yu.edu\n\nAbstract\n\nA central hypothesis about early visual processing is that it represents inputs in a\ncoordinate system matched to the statistics of natural scenes. Simple versions of\nthis lead to Gabor\u2013like receptive \ufb01elds and divisive gain modulation from local\nsurrounds; these have led to in\ufb02uential neural and psychological models of visual\nprocessing. However, these accounts are based on an incomplete view of the visual\ncontext surrounding each point. Here, we consider an approximate model of linear\nand non\u2013linear correlations between the responses of spatially distributed Gabor-\nlike receptive \ufb01elds, which, when trained on an ensemble of natural scenes, uni\ufb01es\na range of spatial context effects. The full model accounts for neural surround\ndata in primary visual cortex (V1), provides a statistical foundation for perceptual\nphenomena associated with Li\u2019s (2002) hypothesis that V1 builds a saliency map,\nand \ufb01ts data on the tilt illusion.\n\n1\n\nIntroduction\n\nThat visual input at a given point is greatly in\ufb02uenced by its spatial context is manifest in a host\nof neural and perceptual effects (see, e.g., [1, 2]). For instance, stimuli surrounding the so-called\nclassical receptive \ufb01eld (RF) lead to striking nonlinearities in the responses of visual neurons [3, 4];\nspatial context results in intriguing perceptual illusions, such as the misjudgment of a center stimulus\nattribute in the presence of a surrounding stimulus [5\u20137]; it also plays a critical role in determining\nthe salience of points in visual space, for instance controlling pop-out, contour integration, texture\nsegmentation [8\u201310] and more generally locations where statistical homogeneity of the input breaks\ndown [1]. Contextual effects are widespread across sensory systems, neural areas, and stimulus\nattributes \u2014 making them an attractive target for computational modeling.\nThere are various mechanistic treatments of extra-classical RF effects (e.g.,[11\u201313]) and contour\nintegration [14], and V1\u2019s suggested role in computing salience has been realized in a large-scale\ndynamical model [1, 15]. There are also normative approaches to salience (e.g., [16\u201319]) with links\nto V1. However, these have not substantially encompassed neurophysiological data or indeed made\nconnections with the perceptual literature on contour integration and the tilt illusion. Our aim is\nto build a principled model based on scene statistics that can ultimately account for, and therefore\nunify, the whole set of contextual effects above.\nMuch seminal work has been done in the last two decades on learning linear \ufb01lters from \ufb01rst prin-\nciples from the statistics of natural images (see e.g. [20]). However, contextual effects emerge from\nthe interactions among multiple \ufb01lters; therefore here we address the much less well studied issue\nof the learned, statistical, basis of the coordination of the group of \ufb01lters \u2014 the scene\u2013dependent,\nlinear and non\u2013linear interactions among them. We focus on recent advances in models of scene\nstatistics, using a Gaussian Scale Mixture generative model (GSM; [21\u201323]) that captures the joint\ndependencies (e.g., [24\u201331]) between the activations of Gabor\u2013like \ufb01lters to natural scenes. The\n\n1\n\n\fGSM captures the dependencies via two components, (i) covariance in underlying Gaussian vari-\nables, which accounts for linear correlations in the activations of \ufb01lters; and (ii) a shared mixer\nvariable, which accounts for the non\u2013linear correlations in the magnitudes of the \ufb01lter activations.\nAs yet, the GSM has not been applied to the wide range of contextual phenomena discussed above.\nThis is partly because linear correlations, which appear important to capture phenomena such as\ncontour integration, have largely been ignored outside image processing (e.g., [23]). In addition,\nalthough the mixer variable of the GSM is closely related to bottom-up models of divisive normal-\nization in cortex [32, 33], the assignment problem of grouping \ufb01lters that share a common mixer for\na given scene has yet to receive a computationally and neurobiologically realistic solution. Recent\nwork has shown that incorporating a simple, predetermined, solution to the assignment problem in a\nGSM could capture the tilt illusion [34]. Nevertheless, the approach has not been studied in a more\nrealistic model with Gabor-like \ufb01lters, and learning assignments from natural scenes. Further, the\nimplications of assignment for cortical V1 data and salience have not been explored.\nIn this paper we extend the GSM model to learn both assignments and linear covariance (section 2).\nWe then apply the model to contextual neural V1 data, noting its link to the tilt illusion (section 3);\nand then to perceptual salience examples (section 4). In the discussion (section 5), we also describe\nthe relationship between our GSM model and other recent scene statistics approaches (e.g., [31, 35]).\n\n2 Methods\n\nA recent focus in natural image statistics has been the joint conditional histograms of the activations\nof pairs of oriented linear \ufb01lters (throughout the paper, \ufb01lters come from the \ufb01rst level of a steerable\npyramid with 4 orientations [36]). When \ufb01lter pairs are proximal in space, these histograms have a\ncharacteristic bowtie shape: the variance of one \ufb01lter depends on the magnitude of activation of the\nother. It has been shown [22] that this form of dependency can be captured by a class of generative\nmodel known as Gaussian Scale Mixture (GSM), which assumes that the linear \ufb01lter activations\nx = vg are random variables de\ufb01ned as the product of two other random variables, a multivariate\nGaussian g, and a (positive) scalar v which scales the variance of all the Gaussian components.\nHere, we address two additional properties of natural scenes. First, in addition to the variance\ndependency, \ufb01lters which are close enough in space and feature space are linearly dependent, as\nshown by the tilt of the bowtie in \ufb01g. 1b. In order for the GSM to capture this effect, the multivariate\nGaussian must be endowed with a non-diagonal covariance matrix. This matrix can be approximated\nby the sample covariance matrix of the \ufb01lter activations or learned directly [23]; here, we learn it\nby maximizing the likelihood of the observed data. The second issue is that \ufb01lter dependencies\ndiffer across image patches, implying that there is no \ufb01xed relationship between mixers and \ufb01lters\n[28]. The general issue of learning multiple pools of \ufb01lters, each assigned to a different mixer on\nan patch\u2013dependent basis, has been addressed in recent work [30], but using a computationally and\nbiologically impracticable scheme [37] which allowed for arbitrary pooling.\nWe consider an approximation to the assignment problem, by allowing a group of surround \ufb01lters to\neither share or not the same mixer with a target \ufb01lter. While this is clearly an oversimpli\ufb01ed model\nof natural images, here we aimed for a reasonable balance between the complexity of the model,\nand biological plausibility of the computations involved.\n\n2.1 The generative model\n\nThe basic repeating unit of our simpli\ufb01ed model involves center and surround groups of \ufb01lters: we\nuse nc to denote the number of center \ufb01lters, and xc their activations; similarly, we use ns and xs\ns )(cid:62). We consider\nfor the surround; \ufb01nally, we de\ufb01ne ncs\na single assignment choice as to whether the center group\u2019s mixer variable vc is (case \u03be1), or is\nnot (case \u03be2) shared with the surround, which in the latter case would have its own mixer variable\nvs. Thus, there are 2 con\ufb01gurations, or competing models, which are themselves combined (i.e., a\nmixture of GSM\u2019s, see also [35]). The graphical models of the two con\ufb01gurations are shown in Fig.\n1a. We show this from the perspective of the center group, since in the implementation we will be\nreporting model neuron responses in the center location given the contextual surround.\n\n.= nc +ns and x .= (x1\n\ns, . . . xns\n\nc, . . . xnc\n\nc , x1\n\n2\n\n\fc , g1\n\nDe\ufb01ning Gaussian components as g .= (g1\nindependent and the pools are independent given the mixers, the mixture distribution is:\n\ns )(cid:62), and assuming the mixers are\n\ns , . . . gns\n\nc , . . . gnc\n\np(x) = p(\u03be1)p(x| \u03be1) + p(\u03be2)p(x| \u03be2)\n\np(x| \u03be1) = (cid:82) dvc p(vc)p(x| vc, \u03be1)\np(x| \u03be2) = (cid:82) dvc p(vc)p(xc | vc, \u03be2)(cid:82) dvs p(vs)p(xs | vs, \u03be2)\n\n(1)\n(2)\n(3)\nWe assume a Rayleigh prior distribution on the mixers, and covariance matrix \u03a3cs for the Gaussian\ncomponents for \u03be1, and \u03a3c and \u03a3s for center and surround, respectively, for \u03be2. The integrals in eqs.\n(2,3) can then be solved analytically:\np(x| \u03be1) = det(\u03a3\u22121\ncs ) 1\n(2\u03c0) ncs\np(x| \u03be2) = det(\u03a3\u22121\nc ) 1\n(2\u03c0) nc\n\nwhere B is the modi\ufb01ed Bessel function of the second kind, and \u03bbcs =(cid:112)x(cid:62)\u03a3\u22121\n\nB(1 \u2212 ncs\n\u03bb( ncs\nB(1 \u2212 nc\n\u03bb( nc\n\n2 ; \u03bbcs)\n2 \u22121)\n2 ; \u03bbc)\n2 \u22121)\n\nB(1 \u2212 ns\n\u03bb( ns\n\ndet(\u03a3\u22121\n(2\u03c0) ns\n\n2 ; \u03bbs)\n2 \u22121)\n\ns ) 1\n\n2\n\ncs x.\n\n(5)\n\n(4)\n\n2\n\n2\n\ncs\n\nc\n\n2\n\n2\n\n2\n\ns\n\n2.2 Learning\n\nThe parameters to be estimated are the covariance matrices (\u03a3cs, \u03a3c, \u03a3s) and the prior probability\n(k) that center and surround share the same pool; we use a Generalized Expectation Maximization\nalgorithm, speci\ufb01cally Multi Cycle EM [38], where a full EM cycle is divided into three subcycles,\neach involving a full E-step and a partial M-step performed only on one covariance matrix.\nE-step: In the E-step we compute an estimate, Q, of the posterior distribution over the assignment\nvariable, given the \ufb01lter activations and the previous estimates of the parameters, namely kold and\n\u0398old .=\n\n. This is obtained via Bayes rule:\n\n\u03a3c, \u03a3s, \u03a3cs\n\n(cid:110)\n\n(cid:111)\n\nQ(\u03be1) = p(\u03be1 | x, \u0398old) \u221d kold p(x| \u03be1, \u0398old)\nQ(\u03be2) = p(\u03be2 | x, \u0398old) \u221d (1 \u2212 kold) p(x| \u03be2, \u0398old)\n\n(6)\n(7)\n\nM-step: In the M-step we increase the complete\u2013data Log Likelihood, namely:\n\nf = Q(\u03be1) log [k p(x| \u03be1, \u0398)] + Q(\u03be2) log [(1 \u2212 k)p(x| \u03be2, \u0398)]\n\n(8)\nSolving \u2202f /\u2202k = 0, we obtain k(cid:63) .= arg maxk [f] = Q(\u03be1). The other terms cannot be solved an-\nalytically, and a numerical procedure must be adopted to maximize f w.r.t. the covariance matrices.\nThis requires an explicit form for the gradient:\n\n(cid:16)\u03a3cs\n\n2\n\n\u2202f\n\u2202\u03a3\u22121\n\ncs\n\n= Q(\u03be1)\n\n\u2212 1\n2\u03bbcs\n\nB(\u2212 ncs\nB(1 \u2212 ncs\n\n2 ; \u03bbcs)\n2 ; \u03bbcs)\n\nxx(cid:62)(cid:17)\n\n(9)\n\nSimilar expressions hold for the other partial derivatives. In practice, we add the constraint that the\ncovariances of the surround \ufb01lters are spatially symmetric.\n\n2.3\n\nInference: patch\u2013by\u2013patch assignment and model neural unit\n\nUpon convergence of EM, the covariance matrices and prior k over the assignment are found. Then,\nfor a new image patch, the probability p(\u03be1 | x) that the surround shares a common mixer with the\ncenter is inferred. The output of the center group is taken to be the estimate (for the present, we\nconsider just the mean) of the Gaussian component E [gc | x], which we take to be our model neural\nunit response. To estimate the normalized response of the center \ufb01lter, we need to compute the\nfollowing expected value under the full model:\n\nE [gc | x] =\n\ndgc gc p(gc | x) = p(\u03be1 | x)E [gc | x , \u03be1] + p(\u03be2 | x)E [gc | xc , \u03be2]\n\n(10)\n\n(cid:90)\n\nthe r.h.s., obtained by a straightforward calculation applying Bayes rule and the conditional indepen-\ndence of xs from xc, gc given \u03be2, is the sum of the expected value of gc in the two con\ufb01gurations,\n\n3\n\n\fFigure 1: (a) Graphical model for the two components of the mixture of GSMs, where the center \ufb01l-\nter is (\u03be1; left) or is not (\u03be2; right) normalized by the surround \ufb01lters; (b) joint conditional histogram\nof two linear \ufb01lters activations, showing the typical bowtie shape due to the variance dependency, as\nwell as a tilt due to linear dependencies between the two \ufb01lters; (c) marginal distribution of linear ac-\ntivations in black, estimated Gaussian component in blue, and ideal Gaussian in red. The estimated\ndistribution is closer to a Gaussian than that of the original \ufb01lter.\n\nweighted by their posterior probabilities. The explicit form for the estimate of the i-th component\n(corresponding in the implementation to a given orientation and phase) of gc under \u03be1 is:\n\nB( 1\n2 \u2212 ncs\nB(1 \u2212 ncs\n\n2 ; \u03bbcs)\n2 ; \u03bbcs)\n\n(11)\n\nE(cid:2)gi\n\nc | x , \u03be1\n\n(cid:3) = sign(xi\n\nc)(cid:112)|xi\n\nc|\n\n(cid:115)|xi\n\nc|\n\u03bbcs\n\nand a similar expression holds under \u03be2, replacing the subscript cs by c. Note that in either con-\n\ufb01guration, the mixer variable\u2019s effect on this is a form of divisive normalization or gain control,\nthrough \u03bb (including for stability, as in [30], an additive constant set to 1 for the \u03bb values; we omit\nthe formul\u00e6 to save space). Under \u03be1, but not \u03be2, this division is in\ufb02uenced by the surround1. Note\ncs x, the gain control\nsignal is reduced when there is strong covariance, which in turn enhances the neural unit response.\n\nalso that, due to the presence of the inverse covariance matrix in \u03bbcs =(cid:112)x(cid:62)\u03a3\u22121\n\n3 Cortical neurophysiology simulations\nTo simulate neurophysiological experiments, we consider the following \ufb01lter con\ufb01guration: 3 \u00d7 3\nspatial positions separated by 6 pixels, 2 phases (quadrature pair), and one orientation (vertical),\nplus 3 additional orientations in the central position to allow for cross\u2013orientation gain control. We\n\ufb01rst learn the parameters of the model for 25000 patches from an ensemble of 5 standard scenes\n(Einstein, Goldhill, and so on). We take as our model neuron the absolute value of the complex\nactivation composed by the non\u2013linear responses (eq. (11)) of two phases of the central vertical\n\ufb01lter.\nWe characterize the neuron\u2019s basic properties with a procedure that is common in physiology exper-\niments focusing on contextual non\u2013linear modulation. First, we measure the so-called Area Sum-\nmation curve, namely the response to gratings that are optimal in orientation and spatial frequency,\nas a function of size. Cat and monkey experiments have shown striking non\u2013linearities, with the\npeak response at low contrasts being for signi\ufb01cantly larger diameters than at high contrasts (Figure\n2a). We obtain the same behavior in the model (Figure 2b; see also [33]). This behavior is due to the\nassignment: for small grating sizes, center and surround have a higher posterior probability at high\ncontrast than at low contrast, and therefore the surround exerts stronger gain control. In a reduced\nmodel with no assignment, we obtain a much weaker effect (Figure 2c).\nWe then assess the modulatory effect of a surround grating on a \ufb01xed, optimally\u2013oriented central\ngrating, as a function of their relative orientations (Figure 3a). As is common, we determine the\nspatial extent of the center and surround stimuli based on the area summation curves (see [4]). The\nmodel simulations (Figure 3b), as in the data, exhibit the most reduced responses when the center\nand surround have similar orientation (but note the \u201dblip\u201d when they are exactly equal in Figure\n3a;b, which arises in the model from the covariance of the Gaussian; see also [31]). In addition,\n\n1To ensure that the mixer follows the same distribution under \u03be1 and \u03be2, after training with natural images\nwe rescale vc \u2014 and therefore gc \u2014 so that their values span the same range in both con\ufb01gurations; since the\nassignment is made at a higher level in the hierarchy, such a rescaling is equivalent to downstream normalization\nprocesses that make the estimates of gc comparable under \u03be1 and \u03be2.\n\n4\n\n\fFigure 2: Area summation curves show the normalized \ufb01ring rate of a neuron in response to optimal\ngratings of increasing size. (a) a V1 neuron, after [4]; (b) the model neuron described in Sec. 3; (c)\na reduced model assuming that the surround \ufb01lters are always in the gain pool of the center \ufb01lter.\n\nFigure 3: Orientation tuning of the surround. (a) and (b): normalized \ufb01ring rate in response to a\nstimulus composed by an optimal central grating surrounded by an annular grating of varying ori-\nentation, for (a) a V1 neuron, after [3]; and (b) the model neuron described in Sec. 3 (c) Probability\nthat the surround normalizes the center as a function of the relative orientation of the annular grating.\n\nas the orientation difference between center and surround grows, the response increases and then\ndecreases, an effect that arises from the assignments. In the model simulations, we \ufb01nd that the\nstrength of this behavior depends on contrast, being larger at low contrasts, an effect of which there\nare hints in the neurophysiology experimental data (using contrasts of 0.2 to 0.4) but which has yet\nto be systematically explored. Figure 3c shows the posterior assignment probability for the same\ntwo contrasts as in \ufb01gure 3b, as a function of the surround orientation. These remain close to 1 at all\norientations at high contrast, but fall off more rapidly at low contrast.\nNote that a previous GSM population model assumed (but did not learn) this form of fall off of the\nposterior weights of of \ufb01gure 3c, and showed that it is a basis for explaining the so-called direct\nand indirect biases in the tilt illusion; i.e., repulsion and attraction in the perception of a center\nstimulus orientation in the presence of a surround stimulus [34]. Figure 4 compares the GSM model\nof [34] designed with parameters matched to perceptual data, to the result of our learned model. The\nqualitative shape (although not the quantitative strength) of the effects are similar.\n\n4 Salience popout and contour integration simulations\n\nTo address perceptual salience effects, we need a population model of oriented units. We consider\none distinct group of \ufb01lters \u2013 arranged as for the model neuron in Sec. 3 \u2013 for each of four orienta-\ntions (0, 45, 90, 135 deg, sampling more coarsely than [1]). We compute the non\u2013linear response of\neach model neuron as in Sec. 3, and take the maximum across the four orientations as the population\noutput, as in standard population decoding. This is performed at each pixel of the input image, and\nthe result is interpreted as a saliency map.\nWe \ufb01rst consider the popout of a target that differs from a background of distractors by a single\nfeature (eg. [8]), in our case orientation. Input image and output saliency map (the brighter, the\nmore salient) are shown in Fig. 5. As in [1], the target pops out since it is less suppressed by its own,\northogonally-oriented neighbors than the surround bars are by their parallel ones; here, this emerges\nstraight from normative inference. [8] quanti\ufb01ed relative target saliency above detection threshold\n\n5\n\n\fFigure 4: The tilt illusion. (a) Comparison of the learned GSM model (black, solid line with \ufb01lled\nsquares), with the GSM model in [34] (blue, solid line; parameters set to account for the illusion\ndata of [39]), and the model in [34] with parameters modi\ufb01ed to match the learned model (blue,\ndashed line). The response of each neuron in the population is plotted as a function of the differ-\nence between the surround stimulus orientation and the preferred center stimulus orientation. We\nassume all oriented neurons have identical properties to the learned vertical neuron (i.e., ignoring\nthe oblique effect). The model of [34] includes idealized tuning curves. The learned model is as in\nthe previous section, but with \ufb01lters of narrower orientation tuning (because of denser sampling of\n16 orientations in the pyramid), which results in an earlier point on the x axis of maximal response.\nModel simulations are normalized to a maximum of 1. (b) Simulations of the tilt illusion using the\nmodel in [34], based on parameters matched to the learned model (dashed line) versus parameters\nmatched to the data of [39] (solid line).\n\nas a function of the difference in orientation between target and distractors using luminance (Fig.\n5b). Fig. 5c plots saliency from the model; it exhibits non\u2013linear saturation for large orientation\ncontrast, an effect that not all saliency models capture (see [17] for discussion). The shape of the\nsaturation is different for neural (Fig. 2a-b) versus perceptual (Fig. 5b-c) data, in both experiment\nand model; for the latter, this arises from differences in stimuli (gratings versus bars, how the center\nand surround extents were determined).\nThe second class of saliency effects involves collinear facilitation. One example is the so called\nborder effect, shown in \ufb01gure 6a \u2013 one side of the border, whose individual bars are collinear, is\nmore salient than the other (e.g. [1], but see also [40]). The middle and right plots in \ufb01gure 6a depict\nthe saliency map for the full model and a reduced model that uses a diagonal covariance matrix.\nNotice that the reduced model also shows an enhancement of the collinear side of the border vs the\nparallel, due to the partial overlap of the linear receptive \ufb01elds; but, as explained in Sec. 2.3, the\nhigher covariance between collinear \ufb01lters in the full model, strengthens the effect. To quantify the\ndifference, we report also the ratio between the salience values on the collinear and parallel sides\nof the border, after subtracting the saliency value of the homogeneous regions: the lower value for\nthe reduced model (1.28; versus 1.74 for the full model) shows that the full model enhances the\ncollinear relative to the parallel side. The ratio for the full model increases if we rescale the off\u2013\ndiagonal terms of the covariance matrix relative to the diagonal (2.1 for a scaling factor of 1.5; 2.73\nfor a factor of 2). Rescaling would come from more reasonably dense spatial sampling. Fig. 6b\nprovides another, stronger example of the collinear facilitation.\n\n5 Discussion\n\nWe have extended a standard GSM generative model of scene statistics to encompass contextual\neffects. We modeled the covariance between the Gaussian components associated with neighboring\nlocations, and suggested a simple, approximate, process for choosing whether or not to pool such\nlocations under the same mixer. Using parameters learned from natural scenes, we showed that\nthis model provides a promising account of neurophysiological data on area summation and center-\nsurround orientation contrast, and perceptual data on the saliency of image elements. This form of\nmodel has previously been applied to the tilt illusion [34], but had just assumed the assignments of\n\ufb01gure 3c, in order to account for the indirect tilt illusion. Here, this emerged from \ufb01rst principles.\nThis model therefore uni\ufb01es a wealth of data and ideas about contextual visual processing. To our\n\n6\n\n\fFigure 5: (a) An example of the stimulus and saliency map computed by the model. (b) Perceptual\ndata reproduced after [8], and (c) model output, of the saliency of the central bar as a function of the\norientation contrast between center and surround.\n\nFigure 6: (a) Border effect: the collinear side of the border is more salient than the parallel one; the\ncenter plot is the saliency map for the full model, right plot is for a reduced model with diagonal\ncovariance matrix. (b) Another example of collinear facilitation: the center row of bars is more\nsalient, relative to the background, when the bars are collinear (left) rather than when they are\nparallel (right). In both (a) and (b), Col/P ar is the ratio between the salience values on the collinear\nand parallel sides of the border, after subtracting the saliency value of the homogeneous regions.\n\nknowledge, there have only been few previous attempts of this sort; one notable example is the\nextensive salience work of [17]; here we go further in terms of simulating neural non\u2013linearities,\nand making connections with the contour integration and illusion literature: phenomena that have\npreviously been addressed only individually, if at all.\nOur model is closely related to a number of suggestions in the literature. Previous bottom-up models\nof divisive normalization, which were the original inspiration for the application by [22] of the GSM,\ncan account for some neural non\u2013linearities by learning divisive weights instead of assignments\n(e.g., [33]). However they do not incorporate linear correlations, and they \ufb01x the divisive weights\na priori rather than on a image\u2013by\u2013image basis such as in our model. Non-parametric statistical\nalternatives to divisive normalization, e.g. non-linear ICA [41], have also been proposed, but have\nbeen applied only to the orientation masking nonlinearity, therefore not addressing spatial context.\nThere are also various top-down models based on related principles. Compared with previous GSM\nmodelling [30], we have built a more computationally straightforward, and neurobiologically cred-\nible, approximate assignment mechanism. Other recent generative statistical models that capture\nthe statistical dependencies of the \ufb01lters in slightly different ways (notably [31, 35]), might also be\nable to encompass the data we have presented here. However, [35] has been applied only to the\nimage processing domain, and the model of [31] has not been tied to the perceptual phenomena we\nhave considered, nor to contrast data. There are also quantitative differences between the models,\nincluding issues of soft versus hard assignment (see discussion in [30]);\nthe assumption about the\nlink to data (here we adopted the mean of the Gaussian component of the GSM which incorporates\n\n7\n\n\fan explicit gain control, in contrast to the approach in [31]); and the richness of assignment versus\napproximation in the various models (here we have purposely taken an approximate version of a full\nassignment model).\nThere are also many models devoted to saliency. We showed that our assignment process, and the\nnormalization that results, is a good match for (and thus a normative justi\ufb01cation of) at least some of\nthe results that [1, 15] captured in a dynamical realization of the V1 saliency hypothesis. However,\nour model achieves suppression in regions of statistical homogeneity divisively rather than subtrac-\ntively. The covariance between the Gaussian components captures some aspects of the long range\nexcitatory effects in that model, which permit contour integration. However, some of the collinear\nfacilitation arises just from receptive \ufb01eld overlap; and the structure of the covariance in natural\nscenes seems rather impoverished compared with that implied by the association \ufb01eld [42], and\nmerits further examination with higher order statistics (see also [10, 26]). Note also that dynamical\nmodels have not previously been applied to the same range of data (such as the tilt illusion).\nOpen theoretical issues include quantifying carefully the effect of the rather coarse assignment ap-\nproximation, as well as the differences between the learned model and the idealized population\nmodel of the tilt illusion [34]. Other important issues include characterizing the nature and effect of\nuncertainty in the distributions of g and v rather than just the mean. This is critical to characterize\npsychophysical results on contrast detection in the face of noise and also orientation acuity, and also\nraises the issues aired by [31] as to how neural responses convey uncertainties. Open experimental\nissues include a range of other contextual effects as to salience, contour integration, and even per-\nceptual crowding. Contextual effects are equally present at multiple levels of neural processing. An\nimportant future generalization would be to higher neural areas, and to mid and high level vision\n(which themselves exhibit gain-control related phenomena, see e.g. [43]). More generally, context\nis pervasive in time as well as space. The parallels are underexplored, and so pressing.\n\nAcknowledgements\n\nThis work was funded by the Alfred P. Sloan Foundation (OS); and The Gatsby Charitable Founda-\ntion, the BBSRC, the EPSRC and the Wellcome Trust (PD). We are very grateful to Adam Kohn,\nJoshua Solomon, Adam Sanborn, and Li Zhaoping for discussion.\n\nReferences\n\n[1] Z. Li. A saliency map in primary visual cortex. Trends Cogn Sci, 6(1):9\u201316, 2002.\n[2] P. Series, J. Lorenceau, and Y. Fr\u00b4egnac. The \u201dsilent\u201d surround of v1 receptive \ufb01elds: theory and experi-\n\nments. J Physiol Paris, 97(4-6):453\u2013474, 2003.\n\n[3] H. E. Jones, K. L. Grieve, W. Wang, and A. M. Sillito. Surround suppression in primate v1. J Neurophys-\n\niol, 86(4):2011\u20132028, 2001.\n\n[4] J. R. Cavanaugh, W. Bair, and J. A. Movshon. Selectivity and spatial distribution of signals from the\n\nreceptive \ufb01eld surround in macaque v1 neurons. J Neurophysiol, 88(5):2547\u20132556, 2002.\n\n[5] J. J. Gibson and M. Radner. Adaptation, after-effect, and contrast in the perception of tilted lines. Journal\n\nof Experimental Psychology, 20:553\u2013569, 1937.\n\n[6] C. W. Clifford, P. Wenderoth, and B. Spehar. A functional angle on some after-effects in cortical vision.\n\nProc R Soc Lond B Biol Sci, 1454:1705\u20131710, 2000.\n\n[7] J. A. Solomon and M. J. Morgan. Stochastic re-calibration: contextual effects on perceived tilt. Proc Biol\n\nSci, 273(1601):2681\u20132686, 2006.\n\n1993.\n\n[8] H.C. Nothdurft. The conspicuousness of orientation and motion contrast. Spatial Vision, 7(4):341\u2013363,\n\n[9] D. J. Field, A. Hayes, and R. F. Hess. Contour integration by the human visual system: evidence for a\n\nlocal \u201dassociation \ufb01eld\u201d. Vision Res, 33(2):173\u2013193, 1993.\n\n[10] W. S. Geisler, J. S. Perry, B. J. Super, and D. P. Gallogly. Edge co-occurrence in natural images predicts\n\ncontour grouping performance. Vision Res, 41(6):711\u2013724, 2001.\n\n[11] L. Schwabe, K. Obermayer, A. Angelucci, and P. C. Bressloff. The role of feedback in shaping the extra-\nclassical receptive \ufb01eld of cortical neurons: a recurrent network model. J Neurosci, 26(36):9117\u20139129,\n2006.\n\n[12] J. Wielaard and P. Sajda. Extraclassical receptive \ufb01eld phenomena and short-range connectivity in v1.\n\n[13] T. J. Sullivan and V. R. de Sa. A model of surround suppression through cortical feedback. Neural Netw,\n\nCereb Cortex, 16(11):1531\u20131545, 2006.\n\n19(5):564\u2013572, 2006.\n\n8\n\n\f[14] T.N. Mundhenk and L. Itti. Computational modeling and exploration of contour integration for visual\n\nsaliency. Biological Cybernetics, 93(3):188\u2013212, 2005.\n\n[15] Z. Li. Visual segmentation by contextual in\ufb02uences via intracortical interactions in primary visual cortex.\n\nNetwork: Computation in Neural Systems, 10(2):187\u2013212, 1999.\n\n[16] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual attention.\n\nVision Research, 40(10-12):1489\u20131506, 2000.\n\n[17] D. Gao, V. Mahadevan, and N. Vasconselos. On the plausibility of the discriminant center-surround\n\nhypothesis for visual saliency. Journal of Vision, 8(7)(13):1\u201318, 2008.\n\n[18] L. Zhang, M.H. Tong, T. Marks, H. Shan, and G.W. Cottrell. Sun: A bayesian framework for saliency\n\nusing natural statistics. Journal of Vision, 8(7)(32):1\u201320, 2008.\n\n[19] N.D.B. Bruce and J.K. Tsotsos. Saliency, attention, and visual search: An information theoretic approach.\n\nJournal of Vision, 9(3)(5):1\u201324, 2009.\n\n[20] A. Hyv\u00a8arinen, J. Hurri, and P.O. Hoyer. Natural Image Statistics. Springer, 2009.\n[21] D. Andrews and C. Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99\u2013102,\n\n1974.\n\n[22] M. J. Wainwright, E. P. Simoncelli, and A. S. Willsky. Random cascades on wavelet trees and their use in\nmodeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89\u2013123,\n2001.\n\n[23] J. Portilla, V. Strela, M. Wainwright, and E. P. Simoncelli.\n\nImage denoising using a scale mixture of\n\nGaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338\u20131351, 2003.\n\n[24] C. Zetzsche, B. Wegmann, and E. Barth. Nonlinear aspects of primary vision: Entropy reduction beyond\ndecorrelation. In Int\u2019l Symposium, Society for Information Display, volume XXIV, pages 933\u2013936, 1993.\n[25] E. P. Simoncelli. Statistical models for images: Compression, restoration and synthesis. In Proc 31st\nAsilomar Conf on Signals, Systems and Computers, pages 673\u2013678, Paci\ufb01c Grove, CA, 1997. IEEE\nComputer Society.\n\n[26] P. Hoyer and A. Hyv\u00a8arinen. A multi-layer sparse coding network learns contour coding from natural\n\nimages. Vision Research, 42(12):1593\u20131605, 2002.\n\n[27] A Hyv\u00a8arinen, J. Hurri, and J. Vayrynen. Bubbles: a unifying framework for low-level statistical properties\n\nof natural image sequences. Journal of the Optical Society of America A, 20:1237\u20131252, 2003.\n\n[28] Y. Karklin and M. S. Lewicki. A hierarchical Bayesian model for learning nonlinear statistical regularities\n\nin nonstationary natural signals. Neural Computation, 17:397\u2013423, 2005.\n\n[29] S. Osindero, M. Welling, and G. E. Hinton. Topographic product models applied to natural scene statistics.\n\nNeural Computation, 18(2):381\u2013414, 2006.\n\n[30] O. Schwartz, T. J. Sejnowski, and P. Dayan. Soft mixer assignment in a hierarchical generative model of\n\nnatural scene statistics. Neural Comput, 18(11):2680\u20132718, 2006.\n\n[31] Y. Karklin and M.S. Lewicki. Emergence of complex cell properties by learning to generalize in natural\n\nscenes. Nature, 457(7225):83\u201386, 2009.\n\n[32] D. J. Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9:181\u2013198, 1992.\n[33] O. Schwartz and E. P. Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience,\n\n[34] O. Schwartz, T.J. Sejnowski, and P. Dayan. Perceptual organization in the tilt illusion. Journal of Vision,\n\n4(8):819\u2013825, 2001.\n\n9(4)(19):1\u201320, 2009.\n\n[35] J.A. Guerrero-Colon, E.P. Simoncelli, and J. Portilla. Image denoising using mixtures of gaussian scale\n\nmixtures. In Proc 15th IEEE Int\u2019l Conf on Image Proc, pages 565\u2013568, 2008.\n\n[36] E. P. Simoncelli, W. T. Freeman, E. H. Adelson, and D. J. Heeger. Shiftable multi-scale transforms. IEEE\n\nTrans Information Theory, 38(2):587\u2013607, 1992.\n\n[37] C. K. I. Williams and N. J. Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors,\nAdv. Neural Information Processing Systems, volume 11, pages 634\u2013640, Cambridge, MA, 1999. MIT\nPress.\n\n[38] X.L. Meng and D.B. Rubin. Maximum likelihood estimation via the ecm algorithm: A general framework.\n\nBiometrika, 80(2):267\u2013278, 1993.\n\n[39] E. Goddard, C.W.G. Clifford, and S.G. Solomon. Centre-surround effects on perceived orientation in\n\ncomplex images. Vision Research, 48:1374\u20131382, 2008.\n\n[40] A.V. Popple and Z. Li. Testing a v1 model: perceptual biases and saliency effects. Journal of Vision,\n\n[41] J. Malo and J. Guti\u00b4errez. V1 non\u2013linear properties emerge from local\u2013to\u2013global non\u2013linear ica. Network:\n\n[42] D. J. Field. Relations between the statistics of natural images and the response properties of cortical cells.\n\n[43] Q. Li and Z. Wang. General\u2013purpose reduced\u2013reference image quality assessment based on perceptually\nIn Proc 15th IEEE Int\u2019l Conf on Image Proc, pages\n\nand statistically motivated image representation.\n1192\u20131195, 2008.\n\n1,3:148, 2001.\n\nComp. Neur. Syst., 17(1):85\u2013102, 2006.\n\nJ. Opt. Soc. Am. A, 4(12):2379\u20132394, 1987.\n\n9\n\n\f", "award": [], "sourceid": 662, "authors": [{"given_name": "Ruben", "family_name": "Coen-cagli", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}, {"given_name": "Odelia", "family_name": "Schwartz", "institution": null}]}