{"title": "A Model of Early Visual Processing", "book": "Advances in Neural Information Processing Systems", "page_first": 173, "page_last": 179, "abstract": "", "full_text": "A Model of Early Visual Processing \n\nLaurent Itti, Jochen Braun, Dale K. Lee and Christof Koch \n\n{itti, achim, jjwen, koch}Gklab.caltech.edu \n\nComputation & Neural Systems, MSC 139-74 \n\nCalifornia Institute of Technology, Pasadena, CA 91125, U.S.A. \n\nAbstract \n\nWe propose a model for early visual processing in primates. The \nmodel consists of a population of linear spatial filters which inter(cid:173)\nact through non-linear excitatory and inhibitory pooling. Statisti(cid:173)\ncal estimation theory is then used to derive human psychophysical \nthresholds from the responses of the entire population of units. The \nmodel is able to reproduce human thresholds for contrast and ori(cid:173)\nentation discrimination tasks, and to predict contrast thresholds in \nthe presence of masks of varying orientation and spatial frequency. \n\n1 \n\nINTRODUCTION \n\nA remarkably wide range of human visual thresholds for spatial patterns appears to \nbe determined by the earliest stages of visual processing, namely, orientation- and \nspatial frequency-tuned visual filters and their interactions [18, 19, 3, 22, 9]. Here we \nconsider the possibility of quantitatively relating arbitrary spatial vision thresholds \nto a single computational model. The success of such a unified account should \nreveal the extent to which human spatial vision indeed reflects one particular stage \nof processing. Another motivation for this work is the controversy over the neural \ncircuits that generate orientation and spatial frequency tuning in striate cortical \nneurons (13, 8, 2]. We think it is likely that behaviorally defined visual filters \nand their interactions reveal at least some of the characteristics of the underlying \nneural circuitry. Two specific problems are addressed: (i) what is the minimal set \nof model components necessary to account for human spatial vision, (ii) is there \na general decision strategy which relates model responses to behavioral thresholds \nand which obviates case-by-case assumptions about the decision strategy in different \nbehavioral situations. To investigate these questions, we propose a computational \nmodel articulated around three main stages: first, a population of bandpass linear \nfilters extracts visual features from a stimulus; second, linear filters interact through \nnon-linear excitatory and inhibitory pooling; third, a noise model and decision \nstrategy are assumed in order to relate the model's output to psychophysical data. \n\n\f174 \n\nL Itti, 1. Braun, D. K. Lee and C. Koch \n\n2 MODEL \nWe assume spatial visual filters tuned for a variety of orientations e E e and \nspatial periods A E A. The filters have overlapping receptive fields in visual space. \nQuadrature filter pairs, p{~(r and F{~d, are used to compute a phase-independent \nlinear energy response, E>.,6, to a visual stimulus S. A small constant background \nactivity, f, is added to the linear energy responses: \n\nE>. 6 = . /(peven * S)2 + (podd * S)2 + f \n\n, \n\n\\I \n\n>' ,6 \n\n>.,6 \n\nFilters have separable Gaussian tuning curves in orientation and spatial frequency. \nTheir corresponding shape in visual space is close to that of Gabor filters, although \nnot separable along spatial dimensions. \n\n2.1 Pooling: self excitation and divisive inhibition \n\nA model based on linear filters alone would not correctly account for the non-linear \nresponse characteristics to stimulus contrast which have been observed psychophys(cid:173)\nically [19]. Several models have consequently introduced a non-linear transducer \nstage following each linear unit [19]. A more appealing possibility is to assume a \nnon-linear pooling stage [6, 21, 3, 22]. In this study, we propose a pooling strategy \ninspired by Heeger's model for gain control in cat area VI [5, 6]. The pooled re(cid:173)\nsponse R>.,6 of a unit tuned for (A, 0) is computed from the linear energy responses \nof the entire population: \nR>. \n\nE'Y \n\n(1) \n\n>',6 \n\n-\n\n,6 - So + L>'I,61 W>.,6(N, OI)E~/,61 \n\n+ 1] \n\nwhere the sum is taken over the entire population and W>.,6 is a two-dimensional \nGaussian weighting function centered around (A,O), and 1] a background activity. \nThe numerator in Eq. 1 represents a non-linear self-excitation term. The denomi(cid:173)\nnator represents a divisive inhibitory term which depends not only on the activity \nof the unit (A,O) of interest, but also on the responses of other units. We shall see \nin Section 3 that, in contrast to Heeger's model for electrophysiological data in \nwhich all units contribute equally to the pool, it is necessary to assume that only a \nsubpopulation of units with tuning close to (A, 0) contribute to the pool in order to \naccount for psychophysical data. Also, we assume, > 15 to obtain a power law for \nhigh contrasts [7], as opposed to Heeger's physiological model in which, = 15 = 2 \nto account for neuronal response saturation at high contrasts. \n\nSeveral interesting properties result from this pooling model. First, a sigmoidal \ntransducer function - in agreement with contrast discrimination psychophysics - is \nnaturally obtained through pooling and thus need not be introduced post-hoc. The \ntransducer slope for high contrasts is determined by ,-15, the location of its inflexion \npoint by 5, and the slope at this point by the absolute value of, (and 15). Second, the \ntuning curves of the pooled units for orientation and spatial period do not depend \nof stimulus contrast, in agreement with physiological and psychophysical evidence \n[14]. In comparison, a model which assumes a non-linear transducer but no pooling \nexhibits sharper tuning curves for lower contrasts. Full contrast independence of \nthe tuning is achieved only when all units participate in the inhibitory pool; when \nonly sub-populations participate in the pool, some contrast dependence remains. \n\n2.2 Noise model: Poisson lX \n\nIt is necessary to assume the presence of noise in the system in order to be able to \nderive psychophysical performance from the responses of the population of pooled \n\n\fA Model of Early Visual Processing \n\n175 \n\nunits. The deterministic response of each unit then represents the mean of a ran(cid:173)\ndomly distributed \"neuronal\" response which varies from trial to trial in a simulated \npsychophysical experiment . \n\nExisting models usually assume constant noise variance in order to simplify the \nsubsequent decision stage [18]. Using the decision strategy presented below, it is \nhowever possible to derive psychophysical performance with a noise model whose \nvariance increases with mean activity, in agreement with electrophysiology [16]. \nIn what follows, Poissoncx noise will be assumed and approximated by a Gaussian \nrandom variable with variance = mean cx (0' is a constant close to unity). \n\n2.3 Decision strategy \n\nWe use tools from statistical estimation theory to compute the system's behavioral \nresponse based on the responses of the population of pooled units. Similar tools \nhave been used by Seung and Sompolinsky [12] under the simplifying assumption of \npurely Poisson noise and for the particular task of orientation discrimination in the \nlimit of an infinite population of oriented units. Here, we extend this framework \nto the more general case in which any stimulus attribute may differ between the \ntwo stimulus presentations to be discriminated by the model. Let's assume that we \nwant to estimate psychophysical performance at discriminating between two stimuli \nwhich differ by the value of a stimulus parameter ((e.g . contrast, orientation, \nspatial period). \n\nThe central assumption of our decision strategy is that the brain implements an \nunbiased efficient statistic T(R; (), which is an estimator of the parameter ( based \non the population response R = {R).,I/; A E A, () E 0}. The efficient statistic is \nthe one which, among all possible estimators of (, has the property of minimum \nvariance in the estimated value of ( . Although we are not suggesting any putative \nneuronal correlate for T, it is important to note that the assumption of efficient \nstatistic does not require T to be prohibitively complex; for instance, a maximum \nlikelihood estimator proposed in the decision stage of several existing models is \nasymptotically (with respect to the number of observations) a efficient statistic. \n\nBecause T is efficient, it achieves the Cramer-Rao bound [1]. Consequently, when \nthe number of observations (i.e. simulated psychophysical trials) is large, \n\nE[T] = ( \n\nand \n\nvar[T] = 1/3(() \n\nwhere E[.] is the mean over all observations, var[.] the variance, and 3(() is the \nFisher information. The Fisher information can be computed using the noise model \nassumption and tuning properties of the pooled units: for a random variable X \nwith probability density f(x; (), it is given by [1]: \n\nJ(() = E [:c In/(X;()r \n\nFor our Poissoncx noise model and assuming that different pooled units are inde(cid:173)\npendent [15], this translates into: \n\nOne unit R). ,I/: \nAll independent units: \n\nThe Fisher information computed for each pooled unit and three types of stimulus \nparameters ( is shown in Figure 1. This figure demonstrates the importance of \nusing information from all units in the population rather than from only one unit \noptimally tuned for the stimulus: although the unit carrying the most information \nabout contrast is the one optimally tuned to the stimulus pattern, more information \n\n\f176 \n\nL. lui, 1 Braun, D. K. Lee and C. Koch \n\nabout orientation or spatial frequency is carried by units which are tuned to flanking \norientations and spatial periods and whose tuning curves have maximum slope for \nthe stimulus rather than maximum absolute sensitivity. In our implementation, \nthe derivatives of pooled responses used in the expression of Fisher information are \ncomputed numerically. \n\norientation \n\nspatial frequency \n\nFigure 1: Fisher information computed for contrast, orientation and spatial frequency. \nEach node in the tridimensional meshes represents the Fisher information for the corre(cid:173)\nsponding pooled unit (A, B) in a model with 30 orientations and 4 scales. Arrows indicate \nthe unit (A, B) optimally tuned to the stimulus. The total Fisher information in the pop(cid:173)\nulation is the sum of the information for all units. \n\nUsing the estimate of ( and its variance from the Fisher information, it is pos(cid:173)\nsible to derive psychophysical performance for a discrimination task between two \nstimuli with parameters (1 ~ (2 using standard ideal observer signal discrimination \ntechniques [4] . For such discrimination, we use the Central Limit Theorem (in the \nlimit of large number of trials) to model the noisy responses of the system as two \nGaussians with means (1 and (2, and variances lTi = 1/:1((d and lTi = 1/:1((2) \nrespectively. A decision criterion D is chosen to minimize the overall probability of \nerror; since in our case lT1 =f. \nlT2 in general, we derive a slightly more complicated \nexpression for performance P at a Yes/No (one alternative forced choice) task than \nwhat is commonly used with models assuming constant noise [18]: \n\n(2 lTi - (llT~ -\n\nD = \n\nlT1lT2J((1 - (2)2 + 2(lTr -\n\n2 \nlT1 -\n\n2 \nlT2 \n\nlTi) log(lT!/lT2) \n\nP= ~+~erf((2-D) + ~erf(D-(l) \n\nlT2..J2 \n\n4 \n\n2 \n\n4 \n\nlT1..J2 \n\nwhere erf is the Normal error function. The expression for D extends by continuity \nto D = ((2 - (1)/2 when lT1 = lT2 . This decision strategy provides a unified, task(cid:173)\nindependent framework for the computation of psychophysical performance from the \ndeterministic responses of the pooled units. This strategy can easily be extended to \nallow the model to perform discrimination tasks with respect to additional stimulus \nparameters, under exactly the same theoretical assumptions. \n\n3 RESULTS \n\n3.1 Model calibration \n\nThe parameters of the model were automatically adjusted to fit human psychophys(cid:173)\nical thresholds measured in our laboratory [17] for contrast and orientation discrim(cid:173)\nination tasks (Figure 2). The model used in this experiment consisted of 60 \norientations evenly distributed between 0 and 180deg. One spatial scale at 4 cycles \nper degree (cpd) was sufficient to account for the data. A multidimensional simplex \nmethod with simulated annealing overhead was used to determine the best fit of \nthe model to the data [10]. The free parameters adjusted during the automatic \n\n\fA Model of Early VlSUal Processing \n\n177 \n\nfits were: the noise level a, the pooling exponents 'Y and &, the inhibitory pooling \nconstant 5, and the background firing rates, E and rJ. \n\nThe error function minimized by the fitting algorithm was a weighted average of \nthree constraints: 1) least-square error with the contrast discrimination data in \nFigure 2.a; 2) least-square error with the orientation discrimination data in Fig(cid:173)\nure 2.h; 3) because the data was sparse in the \"dip-shaped\" region of the curve \nin Figure 2.a, and unreliable due to the limited contrast resolution of the dis(cid:173)\nplay used for the psychophysics, we added an additional constraint favoring a more \npronounced \"dip\", as has been observed by several other groups [11, 19, 22] . \n\nData fits used for model calibration: \n\n. -_____ ---..:a:::.., a \n\niii \n\n~u ~ 0 -\n\n~ \n\n~ ~ 10-2 \nc:Ch \nQ)~ E :5 10-3 L..-__ _ ___ --...J \no \n0 \n10 \nc: .-\n\nmask contrast \n\n\u00b72 \n10 \n\nQ) \n\n0.2 \n\n0.4 \nstimulus contrast \n\nQ) \nCh \n\nTransducer function: \n\n-~50.----________ c~ \nc: o a. \n~ \nu \nQ) \n(5 \n8. \n\nstimulus contrast \n\n0.5 \n\nOrientation tuning: \nd \n\nI ; \n\nQ) \nCh \n\nC o a. \n~ 0.5 \n\n~ \nQ) \n> \n~ O~~----~----=-~ \n100 \n~ -100 \nstimulUS orientation (deg) \n\n0 \n\nFigure 2: The model (solid lines) was calibrated using data from two psychophysical \nexperiments: (a) discrimination between a pedestal contrast (a.a) and the same pedestal \nplus an increment contrast (a.{3); (b) discrimination between two orientations near vertical \n(b.a and b.{3). After calibration, the transducer function of each pooled unit (c) correctly \nexhibits an accelerating non-linearity near threshold (contrast ~ 1%) and compressive \nnon-linearity for high contrasts (Weber's law). We can see in (d) that pooling among \nunits with similar tuning properties sharpens their tuning curves. Model parameters were: \na ~ 0.75,,), ~ 4,\u00ab5 ~ 3.5,E ~ 1%, '1 ~ 1.7Hz,S such that transducer inflexion point is \nat 4x detection threshold contrast, orientation tuning FWHM=68deg (full width at half \nmaximum), orientation pooling FWHM=40deg. \n\nTwo remaining parameters are the orientation tuning width, (7'8, of the filters and \nthe width, (7'We, of the pool. It was not possible from the data in Figure 2 alone \nto unambiguously determine these parameters. However, for any given (7'8, (7'W8 \nis uniquely determined by the following two qualitative constraints: first, a small \npool size is not desirable because it yields contrast-dependent orientation tuning; \nit however appears from the data in Figure 2.h that this tuning should not vary \nmuch over a wide range of contrasts. The second constraint is qualitatively derived \nfrom Figure 3.a: for large pool sizes, the model predicted significant interference \nbetween mask and test patterns even for large orientation differences. Such inter-\n\n\f178 \n\nL Itti, 1. Braun, D. K. Lee and C. Koch \n\nference was not observed in the data for orientation differences larger than 45deg. \nIt consequently seems that a partial inhibitory pool, composed only of a fraction of \nthe population of oriented filters with tuning similar to the central excitatory unit, \naccounts best for the psychophysical data. Finally, (76 was fixed so as to yield a \ncorrect qualitative curve shape for Figure 3.a. \n\n3.2 Predictions \n\nWe used complex stimuli from masking experiments to test the predictive value \nof the model (Figure 3). Although it was necessary to use some of the qualita(cid:173)\ntive properties of the data seen in Figure 3.a to calibrate the model as detailed \nabove, the calibrated model correctly produced a quantitative fit of this data. The \ncalibrated model also correctly predicted the complex data of Figure 3.h. \n\nc::10 \n0 \n~ \n> \nQ) 5 \nCD \n\"C \n(5 \n.r; \n(/J \nQ) \n~ \n.r; \n...... \n\n0 \nmask orientation (deg) mask \n\n90 no \n\n30 \n\n60 \n\na \n\na \n\n~ \n\nb \n\nc:: \n0 6 \n~ \n> \n4 \nQ) \nQ) \n\"C 2 \n(5 \n.r; \n(/J \n8 \nQ) \n~ \n.r; mask spatial freq. (cpd) \n...... \n\n2 \n\n4 \n\nFigure 3: Prediction of psychophysical contrast thresholds in the presence of an oblique \nmask. The mask was a 50%-contrast stochastic oriented pattern (a). and the superim(cid:173)\nposed test pattern was a sixth-derivative of Gaussian bar (j3). In (a), threshold elevation \n(i.e. ratio of threshold in the presence of mask to threshold in the absence of mask) was \nmeasured for varying mask orientation, for mask and test patterns at 4 cycles per degree \n(cpd). In (b), orientation difference between test and mask was fixed to 15deg, and thresh(cid:173)\nold elevation was measured as a function of mask spatial frequency. Solid lines represent \nmodel predictions, and dashed lines represent unity threshold elevation. \n\n4 DISCUSSION AND CONCLUSION \n\nWe have developed a model of early visual processing in humans which accounts for \na wide range of measured spatial vision thresholds and which predicts behavioral \nthresholds for a potentially unlimited number of spatial discriminations. In addi(cid:173)\ntion to orientation- and spatial-frequency-tuned units, we have found it necessary to \nassume two types of interactions between such units: (i) non-linear self-excitation \nof each unit and (ii) divisive normalization of each unit response relative to the \nresponses of similarly tuned units. All model parameters are constrained by psy(cid:173)\nchophysical data and an automatic fitting procedure consistently converged to the \nsame parameter set regardless of the initial position in parameter space. \n\nOur two main contributions are the small number of model components and the un i(cid:173)\n.fied, task-independent decision strategy. Rather than making different assumptions \nabout the decision strategy in different behavioral tasks, we combine the informa(cid:173)\ntion contained in the responses of all model units in a manner that is optimal for \nany behavioral task. We suggest that human observers adopt a similarly optimal \ndecision procedure as they become familiar with a particular task (\" task set\"). Al(cid:173)\nthough here we apply this decision strategy only to the discrimination of stimulus \ncontrast, orientation, and spatial frequency, it can readily be generalized to arbi(cid:173)\ntrary discriminations such as, for example, the discrimination of vernier targets. \n\n\fA Model of Early Vzsual Processing \n\n179 \n\nSo far we have considered only situations in which the same decision strategy is \noptimal for every stimulus presentation. We are now studying situations in which \nthe optimal decision strategy varies unpredictably from trial to trial (\" decision \nuncertainty\"). For example, situations in which the observer attempts to detect an \nincrease in either the spatial frequency or the contrast of stimulus. In this way, we \nhope to learn the extent to which our model reflects the decision strategy adopted \nby human observers in an even wider range of situations. We have also assumed \nthat the model's units were independent, which is not strictly true in biological \nsystems (although the main source of correlation between neurons is the overlap \nbetween their respective tuning curves, which is accounted for in the model). The \nmathematical developments necessary to account for fixed or variable covariance \nbetween units are currently under study. \n\nIn contrast to other models of early visual processing [5, 6], we find that the psy(cid:173)\nchophysical data is consistent only with interactions between similarly tuned units \n(e.g., \"near-orientation inhibition\")' not with interactions between units of very dif(cid:173)\nferent tuning (e.g., \"cross-orientation inhibition\") . Although such partial pooling \ndoes not render tuning functions completely contrast-independent, an additional de(cid:173)\ngree of contrast-independence could be provided by pooling across different spatial \nlocations. This issue is currently under investigation. \n\nIn conclusion, we have developed a model based on self-excitation of each unit, \ndivisive normalization [5, 6] between similarly tuned units, and an ideal observer \ndecision strategy. It was able to reproduce a wide range of human visual thresholds. \nThe fact that such a simple and idealized model can account quantitatively for \na wide range of psychophysical observations greatly strengthens the notion that \nspatial vision thresholds reflect processing at one particular neuroanatomical level. \n\nAcknowledgments: This work was supported by NSF-Engineering Research Center \n(ERC), NIMH, ONR, and the Sloan Center for Theoretical Neurobiology. \nReferences \n[1] Cover TM, Thomas JA. Elem Info Theo, Wiley & Sons, 1991 \n[2] Ferster D, Chung S, Wheat H. Nature 1996;380(6571):249-52 \n[3] Foley JM. J Opt Soc A 1994;11(6):1710-9 \n[4] Green DM, Swets JA. Signal Detectability and Psychophys. Wiley & Sons, 1966. \n[5] Heeger DJ. Comput Models of Vis Processing, MIT Press, 1991 \n[6] Heeger DJ . Vis Neurosci 1992;9:181-97 \n[7] Nachmias J, Sansbury RV. Vis Res 1974;14:1039-42 \n[8] Nelson S, Toth L, Sheth B, Sur M. Science 1994;265(5173):774-77 \n[9] Perona P, Malik J. J Opt Soc A 1990;7(5):923-32 \n[10] Press WH, Teukolsky SA, et al. Num Rec in C. Cambridge University Press, 1992 \n[ll] Ross J, Speed HD. Proc R Soc B 1991;246:61-9 \n[12] Seung HS, Sompolinksy H. Proc Natl Acad Sci USA 1993;90:10749-53. \n[13] Sillito AM. Progr Brain Res 1992;90:349-84 \n[14] Skottun BC, Bradley A, Sclar G et al. J Neurophys 1987;57(3):773-86 \n[15] Snippe HP, Koenderink JJ. Bioi Cybern 1992;67:183-90 \n[16] Teich MC, Thrcott RG, Siegel RM. IEEE Eng Med Bioi 1996;Sept-Oct,79-87 \n[17] Wen J, Koch C , Braun J. Proc ARVO 1997;5457 \n[18] Wilson HR, Bergen JR. Vis Res 1979; 19: 19-32 \n[19] Wilson HR. Bioi Cybern 1980;38: 171-8 \n[20] Wilson HR, McFarlane DK, Phillips GC. Vis Res 1983;23;873-82. \n[21] Wilson HR, Humanski R. Vis Res 1993;33(8):1133-50 \n[22] Zenger B, Sagi D. Vis Res 1996;36(16):2497-2513. \n\n\f", "award": [], "sourceid": 1471, "authors": [{"given_name": "Laurent", "family_name": "Itti", "institution": null}, {"given_name": "Jochen", "family_name": "Braun", "institution": null}, {"given_name": "Dale", "family_name": "Lee", "institution": null}, {"given_name": "Christof", "family_name": "Koch", "institution": null}]}