{"title": "Divisive Normalization, Line Attractor Networks and Ideal Observers", "book": "Advances in Neural Information Processing Systems", "page_first": 104, "page_last": 110, "abstract": null, "full_text": "Divisive Normalization, Line Attractor \n\nNetworks and Ideal Observers \n\nSophie Denevel Alexandre Pougetl, and P.E. Latham2 \n\n1 Georgetown Institute for Computational and Cognitive Sciences, \n\nGeorgetown University, Washington, DC 20007-2197 \n\n2Dpt of Neurobiology, UCLA, Los Angeles, CA 90095-1763, U.S.A. \n\nAbstract \n\nGain control by divisive inhibition, a.k.a. divisive normalization, \nhas been proposed to be a general mechanism throughout the vi(cid:173)\nsual cortex. We explore in this study the statistical properties \nof this normalization in the presence of noise. Using simulations, \nwe show that divisive normalization is a close approximation to a \nmaximum likelihood estimator, which, in the context of population \ncoding, is the same as an ideal observer. We also demonstrate ana(cid:173)\nlytically that this is a general property of a large class of nonlinear \nrecurrent networks with line attractors. Our work suggests that \ndivisive normalization plays a critical role in noise filtering, and \nthat every cortical layer may be an ideal observer of the activity in \nthe preceding layer. \n\nInformation processing in the cortex is often formalized as a sequence of a linear \nstages followed by a nonlinearity. In the visual cortex, the nonlinearity is best de(cid:173)\nscribed by squaring combined with a divisive pooling of local activities. The divisive \npart of the nonlinearity has been extensively studied by Heeger and colleagues [1], \nand several authors have explored the role of this normalization in the computation \nof high order visual features such as orientation of edges or first and second order \nmotion[ 4]. We show in this paper that divisive normalization can also playa role in \nnoise filtering. More specifically, we demonstrate through simulations that networks \nimplementing this normalization come close to performing maximum likelihood es(cid:173)\ntimation. We then demonstrate analytically that the ability to perform maximum \nlikelihood estimation, and thus efficiently extract information from a population of \nnoisy neurons, is a property exhibited by a large class of networks. \n\nMaximum likelihood estimation is a framework commonly used in the theory of \nideal observers. A recent example comes from the work of Itti et al., 1998, who have \nshown that it is possible to account for the behavior of human subjects in simple \ndiscrimination tasks. Their model comprised two distinct stages: 1) a network \n\n\fDivisive Normalization. Line Attractor Networks and Ideal Observers \n\n105 \n\nwhich models the noisy response of neurons with tuning curves to orientation and \nspatial frequency combined with divisive normalization, and 2) an ideal observer (a \nmaximum likelihood estimator) to read out the population activity of the network. \n\nOur work suggests that there is no need to distinguish between these two stages, \nsince, as we will show, divisive normalization comes close to providing a maximum \nlikelihood estimation. More generally, we propose that there may not be any part \nof the cortex that acts as an ideal observer for patterns of activity in sensory areas \nbut, instead, that each cortical layer acts as an ideal observer of the activity in the \npreceding layer. \n\n1 The network \n\nOur network is a simplified model of a cortical hypercolumn for spatial frequency \nand orientation. It consists of a two dimensional array of units in which each unit \nis indexed by its preferred orientation, 8i , and spatial frequency, >'j. \n\n1.1 LGN model \n\nUnits in the cortical layer are assumed to receive direct inputs from the lateral \ngeniculate nucleus (LG N). Here we do not model explicitly the LG N, but focus \ninstead on the pooled LGN input onto each cortical unit. The input to each unit \nis denoted aij' We distinguish between the mean pooled LGN input, fij(8, >'), as \na function of orientation, 8, and spatial frequency, >., and the noise distribution \naround this mean, P(aijI8, >.). \nIn response to a stimulus of orientation, 8, spatial frequency, >., and contrast, G, \nthe mean LGN input onto unit ij is a circular Gaussian with a small amount of \nspontaneous activity, 1/: \n\nJ, ,(8 ') - KG \n\n,/\\ -\n\n'J \n\nexp \n\n(COS(>. - >'j) - 1 cos(8 - 8i ) - 1) \n\n+ \n\n2 \n~A \n\n2 \n~o \n\n+ 1/, \n\n(1) \n\nwhere K is a constant. Note that spatial frequency is treated as a periodic variable; \nthis was done for convenience only and should have negligible effects on our results \nas long as we keep>. far from 27m, n an integer. \nOn any given trial the LGN input to cortical unit ij, aij, is sampled from a Gaussian \nnoise distribution with variance ~;j: \n\n(2) \n\nIn our simulations, the variance of the noise was either kept fixed (~'fj = ~2) or set \nto the mean activity (~t = Jij(8, >.)). The latter is more consistent with the noise \nthat has been measured experimentally in the cortex. We show in figure I-A an \nexample of a noisy LGN pattern of activity. \n\n1.2 Cortical Model: Divisive Normalization \n\nActivities in the cortical layer are updated over time according to: \n\n\f106 \n\nS. Deneve, A. Pouget and P. E. Latham \n\nA. \n\nCORTEX \n\n:::::::: r:::L:: t ::;-:::;::::: \n\n- r- , _ :-\n:::~ \n' ~ \n\nDO \n\n0.1 \n\n0.2 0.3 \n\n0.4 0.5 \n\n0.8 \n\n0.7 \n\n0.1 D.' 1 \nContrast \n\nFigure 1: A- LGN input (bottom) and stable hill in the cortical network after \nrelaxation (top). The position of the stable hill can be used to estimate orientation \n(0) and spatial frequency (5.). B- Inverse of the variance of the network estimate for \norientation using Gaussian noise with variance equal to the mean as a function of \ncontrast and number of iterations (0, dashed; 1, diamond; 2, circle; and 3, square). \nThe continuous curve corresponds to the theoretical upper bound on the inverse \nof the variance (i.e. an ideal observer). C- Gain curve for contrast for the cortical \nunits after 1, 2 and 3 iterations. \n\n(3) \n\nwhere {Wij,kt} are the filtering weights, Oij(t) is the activity of unit ij at time t, \nS is a constant, and J1. is what we call the divisive inhibition weight. The filtering \nweights implement a two dimensional Gaussian filter: \n\nWij,kl = Wi-k,j - l = Kwexp (COS[27!'(i -2 k )/P] -1 + cos[27!'(j ~ l)/Pl-1) \n\n(4) \n\n~w~ \n\n~WA \n\nwhere Kw is a constant, ~w~ and ~WA control the width of the filtering weights, and \nthere are p 2 units. \n\nOn each iteration the activity is filtered by the weights, squared, and then nor(cid:173)\nmalized by the total local activity. Divisive normalization per se only involves the \nsquaring and division by local activity. We have added the filtering weights to ob(cid:173)\ntain a local pooling of activity between cells with similar preferred orientations and \nspatial frequencies. This pooling can easily be implemented with cortical lateral \nconnections and it is reasonable to think that such a pooling takes place in the \ncortex. \n\n\fDivisive Normalization, Line Attractor Networks and Ideal Observers \n\n107 \n\n2 Simulation Results \n\nOur simulations consist of iterating equation 3 with initial conditions determined by \nthe presentation orientation and spatial frequency. The initial conditions are chosen \nas follows: For a given presentation angle, (}o, and spatial frequency, Ao, determine \nthe mean cortical activity, /ij((}o, AO), via equation 1. Then generate the actual \ncortical activity, {aij}, by sampling from the distribution given in equation 2. This \nserves as our set of initial conditions: Oij (t = 0) = aij' \nIterating equation 3 with the above initial conditions, we found that for very low \ncontrast the activity of all cortical units decayed to zero. Above some contrast \nthreshold, however, the activities converged to a smooth stable hill (see figure I-A \nfor an example with parameters (Jw(} = (Jw).. = (J(} = (J).. = I/V8, K = 74, C = 1, \nJ.L = 0.01). The width of the hill is controlled by the width of the filtering weights. \nIts peak, on the other hand, depends on the orientation and spatial frequency of the \nLGN input, (}o and Ao. The peak can thus be used to estimate these quantities (see \nfigure I-A). To compute the position of the final hill, we used a population vector \nestimator [3] although any unbiased estimator would work as well. In all cases we \nlooked at, the network produced an unbiased estimate of (}o and Ao. \n\nIn our simulations we adjusted (Jw(} and (Jw).. so that the stable hill had the same \nprofile as the mean LGN input (equation 1). As a result, the tuning curves of the \ncortical units match the tuning curves specified by the pooled LGN input. For this \ncase, we found that the estimate obtained from the network has a variance close \nto the theoretical minimum, known as the Cramer-Rao bound [3]. For Gaussian \nnoise of fixed variance, the variance of the estimate was 16.6% above this bound, \ncompared to 3833% for the population vector applied directly to the LGN input. \nIn a ID network (orientation alone), these numbers go to 12.9% for the network \nversus 613% for population vector. For Gaussian noise with variance proportional \nto the mean, the network was 8.8% above the bound, compared to 722% for the \npopulation vector applied directly to the input. These numbers are respectively 9% \nand 108% for the I-D network. The network is therefore a close approximation to \na maximum likelihood estimator, i.e., it is close to being an ideal observer of the \nLGN activity with respect to orientation and spatial frequency. \nAs long as the contrast, C, was superthreshold, large variations in contrast did not \naffect our results (figure I-B). However, the tuning of the network units to contrast \nafter reaching the stable state was found to follow a step function whereas, for real \nneurons, the curves are better described by a sigmoid [2]. \nImproved agreement \nwith experiment was achieved by taking only 2-3 iterations, at which point the \nperformance of the network is close to optimal (figure I-B) and the tuning curves to \ncontrast are more realistic and closer to sigmoids (figure I-C). Therefore, reaching \na stable state is not required for optimal performance, and in fact leads to contrast \ntuning curves that are inconsistent with experiment. \n\n3 Mathematical Analysis \n\nWe first prove that line attractor networks with sufficiently small noise are close \napproximations to a maximum likelihood estimator. We then show how this result \napplies to our simulations with divisive normalization. \n\n\fJ08 \n\nS. Deneve, A. Pouget and P. E. Latham \n\n3.1 General Case: Line Attractor Networks \n\nLet On be the activity vector (denoted by bold type) at discrete time, n, for a set \nof P interconnected units. We consider a one dimensional network, i.e., only one \nfeature is encoded; generalization to multidimensional networks is straightforward. \nA generic mapping for this network may be written \n\n(5) \n\nwhere H is a nonlinear function. We assume that this mapping admits a line \nattractor, which we denote G(O), for which G(O) = H(G(O)) where 0 is a continuous \nvariable. 1 Let the initial state of the network be a function of the presentation \nparameter, 00 , plus noise, \n\n00 = F(Oo) + N \n\n(6) \n\nwhere F(Oo) is the function used to generate the data (in our simulations this \nwould correspond to the mean LGN input, equation 1). Iterating the mapping, \nequation 5, leads eventually to a point on the line attractor. Consequently, as \nn -+ 00 , On -+ G(O) . The parameter 0 provides an estimate of 00 . \nTo determine how well the network does we need to find fJO :::: 0 - 00 as a function \nof the noise, N, then average over the noise to compute the mean and variance of \nfJO. Because the mapping, equation 5, is nonlinear, this cannot be done exactly. For \nsmall noise, however, we can take a perturbative approach and expand around a \npoint on the attractor. For line at tractors there is no general method for choosing \nwhich point on the attractor to expand around. Our approach will be to expand \naround an arbitrary point, G( 0), and choose 0 by requiring that the quadratic terms \nbe finite. Keeping terms up to quadratic order, equation 6 may be written \n\nG(O) + fJo n . \nIn . fJoo + ~ I.: (Jm . fJo o ) . H\" . (Jm . fJo o ) , \n\nn-l \n\nm=O \n\n(7) \n\n(8) \n\nwhere J(O) == [8G (o)H(G(0))f is the Jacobian (the subscript T means transpose), \nH\" is the Hessian of H evaluated at G(O) and a \".\" represents the standard dot \nproduct. \nBecause the mapping, equation 5, admits a line attractor, J has one eigenvalue \nequal to 1 and all others less than 1. Denote the eigenvector with eigenvalue 1 as \ny and its adjoint v t : J . v = v and JT . v t = yt. It is not hard to show that y = \n8oG(0), up to a multiplicative constant. Since J has an eigenvalue equal to 1, to \navoid the quadratic term in Eq. 8 approaching infinity as n -+ 00 we require that \n\nlim I n . fJoo = O. \nn-too \n\n(9) \n\nIThe line attractor is, in fact , an idealization; for P units the attractors associated with \nequation 5 consists of P isolated points. However, for P large, the attractors are spaced \nclosely enough that they may be considered a line. \n\n\fDivisive Normalization. Line Attractor Networks and Ideal Observers \n\n109 \n\nThis equations has an important consequence: it implies that, to linear order, \nlimn-too 60 n = 0 (see equation 8), which in turn implies that 0 00 = G(O) which, \n~nally, implies that 0 = O. Consequently we can find the network estimator of 00 , \n0, by computing O. We now turn to that task. \nIt is straightforward to show that JOO = vvt . Combining this expression for J with \nequation 9, using equation 7 to express 600 in terms of 00 and G(O), and, finally \nusing equation 6 to express 00 in terms of the initial mean activity, F(Oo), and the \nnoise, N, we find that \n\nv t (0) . [F(Oo) - G(O) + N] = O. \n\nUsing 00 = 0 - 60 and expanding F(Oo) to first order in 60 then yields \n\n60 = vt(O) . [N + F(O) - G(O)] \n\nvt(O) . F'(O) \n\n. \n\n(10) \n\n(11) \n\nAs long as v t is orthogonal to F(O) - G(O), (60) = 0 and the estimator is unbiased. \nThis must be checked on a case by case basis, but for the circularly symmetric \nnetworks we considered orthogonality is satisfied. \nWe can now calculate the variance of the network estimate, (60)2. Assuming v t . \n[F(O) - G(O)] = 0, equation 11 implies that \n\n2 vt.R\u00b7vt \n(60) = [vt . F'F ' \n\n(12) \n\nwhere a prime denotes a derivative with respect to 0 and R is the covariance matrix \nof the noise, R = (NN). The network is equivalent to maximum likelihood when \nthis variance is equal to the Cramer-Rao bound [3], (60)bR. If the noise, N, is \nGaussian with a covariance matrix independent of 0, this bound is equal to: \n\n2 \n\n(60)CR = F'. R - l . F' \n\n1 \n\n(13) \n\nFor independent Gaussian noise of fixed variance, (T2, and zero covariance, the \nvariance of the network estimate, equation 12, becomes (T2 1(IF'1 2 cos2 f-L) where f-L \nis the angle between v t and F'. The Cramer-Rao bound, on the other hand, is \nequal to (T2 IIF'1 2 . These expressions differ only by cos2 J1., which is 1 if F ex v t . In \naddition, it is close to 1 for networks that have identical input and output tunin1 \ncurves, F(O) = G(O), and the Jacobian, J, is nearly symmetric, so that v ::::: v \n(recall that v = G'). If these last two conditions are satisfied, the network comes \nclose to being a maximum likelihood estimator. \n\n3.2 Application to Divisive Normalization \n\nDivisive normalization is a particular example of the general case considered above. \nFor simplicity, in our simulations we chose the input and output tuning curves to \nbe equal (F = G in the above notation), which lead to a value of 0.87 for cos2 f-L \n(evaluated numerically). This predicted a variance 15% above the Cramer-Rao \n\n\f110 \n\nS. Deneve, A. Pouget and P E. Latham \n\nbound for independent Gaussian noise with fixed variance, consistent with the 16% \nwe obtained in our simulations. The network also handles fairly well other noise \ndistributions, such as Gaussian noise with variance proportional to the mean, as \nillustrated by our simulations. \n\n4 Conclusions \n\nWe have recently shown that a subclass of line attractor networks can be used as \nmaximum likelihood estimators[3]. This paper extend this conclusion to a much \nwider class of networks, namely, any network that admits a line (or, by straightfor(cid:173)\nward extension of the above analysis, a higher dimensional) attractor. This is true \nin particular for networks using divisive normalization, a normalization which is \nthought to match quite closely the nonlinearity found in the primary visual cortex \nand MT. \n\nAlthough our analysis relies on the existence of an attractor, this is not a require(cid:173)\nment for obtaining near optimal noise filtering. As we have seen, 2-3 iterations \nare enough to achieve asymptotic performance (except at contrasts barely above \nthreshold). What matters most is that our network implement a sequence of low \npass filtering to filter out the noise, followed by a square nonlinearity to compensate \nfor the widening of the tuning curve due to the low pass filter, and a normalization \nto weaken contrast dependence. It is likely that this process would still clean up \nnoise efficiently in the first 2-3 iterations even if activity decayed to zero eventually, \nthat is to say, even if the hills of activity were not stable states. This would allow us \nto apply our approach to other types of networks, including those lacking circular \nsymmetry and networks with continuously clamped inputs. \n\nTo conclude, we propose that each cortical layer may read out the activity in the \npreceding layer in an optimal way thanks to the nonlinear pooling properties of \ndivisive normalization, and, as a result, may behave like an ideal observer. It is \ntherefore possible that the ability to read out neuronal codes in the sensory cortices \nin an optimal way may not be confined to a few areas like the parietal or frontal \ncortex, but may instead be a general property of every cortical layer. \n\nReferences \n\n[1] D. Heeger. Normalization of cell responses in cat striate cortex. Visual Neuro(cid:173)\n\nscience, 9:181- 197,1992. \n\n[2] L. Itti, C. Koch, and J. Braun. A quantitative model for human spatial vi(cid:173)\n\nsion threshold on the basis of non-linear interactions among spatial filters. In \nR. Lippman, J. Moody, and D. Touretzky, editors, Advances in Neural Infor(cid:173)\nmation Processing Systems, volume 11. Morgan-Kaufmann, San Mateo, 1998. \n\n[3] A. Pouget, K. Zhang, S. Deneve, and P. Latham. Statistically efficient estimation \n\nusing population coding. Neural Computation, 10:373- 401, 1998. \n\n[4] E. Simoncelli and D. Heeger. A model of neuronal responses in visual area MT. \n\nVision Research, 38(5):743- 761 , 1998. \n\n\f", "award": [], "sourceid": 1536, "authors": [{"given_name": "Sophie", "family_name": "Den\u00e8ve", "institution": null}, {"given_name": "Alexandre", "family_name": "Pouget", "institution": null}, {"given_name": "Peter", "family_name": "Latham", "institution": null}]}