{"title": "Neural Implementation of Bayesian Inference in Population Codes", "book": "Advances in Neural Information Processing Systems", "page_first": 317, "page_last": 323, "abstract": null, "full_text": "Neural Implementation of Bayesian \n\nInference in Population Codes \n\nSi Wu \n\nComputer Science Department \n\nSheffield University, UK \n\nShun-ichi Amari \n\nLab. for Mathematic Neuroscience, \n\nRIKEN Brain Science Institute, JAPAN \n\nAbstract \n\nThis study investigates a population decoding paradigm, in which \nthe estimation of stimulus in the previous step is used as prior \nknowledge for consecutive decoding. We analyze the decoding accu(cid:173)\nracy of such a Bayesian decoder (Maximum a Posteriori Estimate), \nand show that it can be implemented by a biologically plausible \nrecurrent network, where the prior knowledge of stimulus is con(cid:173)\nveyed by the change in recurrent interactions as a result of Hebbian \nlearning. \n\n1 \n\nIntroduction \n\nInformation in the brain is not processed by a single neuron, but rather by a popu(cid:173)\nlation of them. Such a coding strategy is called population coding. It is conceivable \nthat population coding has advantage of being robust to the fluctuation in a single \nneuron's activity. However, people argue that population coding may have other \ncomputationally desirable properties. One such property is to provide a framework \nfor encoding complex objects by using basis functions [1]. This is inspired by the \nrecent progresses in nonlinear function approximation, such as, sparse coding, over(cid:173)\ncomplete representation and kernel regression. These methods are efficient and show \nsome interesting neuron-like behaviors [2,3]. It is reasonable to think that similar \nstrategies are used in the brain under the support of population codes. However, \nto confirm this idea, a general suspicion has to be clarified: can the brain perform \nsuch complex statistic inference? An important work towards the answer of this \nquestion was done by Pouget and co-authors [4,5]. They show that Maximum Like(cid:173)\nlihood (ML) Inference, which is usually thought to be complex, can be implemented \nby a biologically plausible recurrent network using the idea of line attractor. \n\nML is a special case of Bayesian inference when the stimulus is (or assumed to be) \nuniformly distributed. In case there is prior knowledge on the stimulus distribution, \nMaximum a Posteriori (MAP) Estimate has better performance. Zhang et al. has \nsuccessfully applied MAP for reconstructing the rat position in a maze from the \nactivity of hippocampal place cells [6]. \nIn their method, the prior knowledge is \nthe rat's position in the previous time step, which restricts the variability of rat's \nposition in the current step under the continuity constraint. It turns out that MAP \nhas a much better performance than other decoding methods, and overcomes the \ninefficiency of ML when information is not sufficient (when the rat stops running). \n\n\fThis result implies that MAP may be used by the nervous system. So far, in the \nliterature MAP has been mainly studied as a mathematic tool for reconstructing \ndata, though its potential neural implementation was pointed out by [1 ,6]. \n\nIn the present study, we will firmly show how to implement MAP in a biologic way. \nThe same kind of recurrent network for achieving ML is used [4,5]. The decoding \nprocess consists of two steps. In the first step when there is no prior knowledge of \nthe stimulus, the network implements ML. Its estimation is subsequently used to \nform the prior distribution of stimulus for consecutive decoding, which we assume \nis a Gaussian function with the mean value being the estimation. It turns out \nthat this prior knowledge can be naturally conveyed by the change in the recurrent \ninteractions according to the Hebbian learning rule. This is an interesting finding \nand suggests a new role of Hebbian learning. In the second step, with the changed \ninteractions, the network implements MAP. The decoding accuracy of MAP and \nthe optimal form of Gaussian prior are also analyzed in this paper. \n\n2 MAP in Population Codes \n\nLet us consider a standard population coding paradigm. There are N neurons \ncoding for a stimulus x. The population activity is denoted by r = {rd. Here ri is \nthe response of the ith neuron, which is given by \n\n(1) \n\nwhere fi(X) is the tuning function and fi is a random noise. \n\nThe encoding process of a population code is specified by the conditional probability \nQ(rlx) (i.e., the noise model). The decoding is to infer the value of x from the \nobserved r. \n\nWe consider a general Bayesian inference in a population code, which estimates the \nstimulus by maximizing a log posterior distribution, In P(xlr) , i.e. , \n\nargmaxx \nargmaxx \n\nIn P(xlr) , \nInP(rlx) + InP(x), \n\n(2) \n\nwhere P(rlx) is the likelihood function. It can be equal to or different from the real \nencoding model Q(rlx) , depending on the available information of the encoding \nprocess [7]. P(x) is the distribution of x, representing the prior knowledge. This \nmethod is also called Maximum a Posteriori (MAP). When the distribution of x is, \nor is assumed to be (when there is no prior knowledge) uniform, MAP is equivalent \nto ML. \n\nMAP could be used in the information processing of the brain in several occasions. \nLet us consider the following scenario: a stimulus is decoded in multiple steps. This \nhappens when the same stimulus is presented through multiple steps, or during \na single presentation, neural signals are sampled many times. In both cases, the \nbrain successively gains a rough estimation of the stimulus in each step decoding, \nwhich can serve to be the prior knowledge when further decoding is concerned. It \nis therefore natural to use MAP in this situation. Experiencing slightly different \nstimuli in consecutive steps as studied in [6], or more generally, stimulus slowly \nchanges with time (multiple-step diagram is a discreted approximation), is a similar \nscenario. For simplicity, we only consider that stimulus is unchanged in the present \nstudy. \n\n\f2.1 The Performance of MAP \n\nLet us analyze the performance of MAP. Some notations are introduced first. De(cid:173)\nnote Xt a particular estimation of the stimulus in the tth step, and 0; the corre(cid:173)\nsponding variance. The prior distribution of x in the t + lth step is assumed to be \na Gaussian with the mean value X\"~ i.e., \n\nP(xIXt) = _1_ exp-CX-Xt)2 /2r;, \n\n(3) \nwhere the parameter Tt reflects the estimator's confidence on xt, whose optimal \nvalue will be calculated later. \nThe posterior distribution of x in the t + lth step is given by \n\n.,J2irTt \n\nP( I ) = P(rlx)P(xlxt) \n, \n\nP(r) \n\nxr \n\nand the solution of MAP is obtained by solving \n\n\\7 In P(Xt+1 Ir) \n\n\\7lnP(rlxt+l) - (Xt+l-Xt)/T;, \nO. \n\n(4) \n\n(5) \n\nWe calculate the decoding accuracies iteratively. In the first step decoding, since \nthere is no prior knowledge on x, ML is used, whose decoding accuracy is known to \nbe [7] \n\n02- \u00ab\\7lnP(rlx))2> \n\n(6) \n\n1 - < -\\7\\7lnP(rlx) >2' \nwhere the bracket < . > denotes averaging over Q(rlx). \nNote that, to get the above result, we have considered that ML is asymptotically or \nquasi-asymptotically (when an unfaithful model is used) efficient [7]. This includes \nthe cases when neural responses are independent, weakly correlated, uniformly cor(cid:173)\nrelated, correlated with strength proportional to firing rate (multiplicative correla(cid:173)\ntion), or the fluctuation in neural responses are sufficiently small. In other strong \ncorrelation cases, ML is proved to be non-Fisherian, i.e, its decoding error satisfies \na Cauchy type of distribution with variance diverging. Decoding accuracy can no \nlonger be quantified by variance in such situations (for details, please refer to [8]) . \nNow come to calculate the decoding error in the second step. Suppose X2 is close \nenough to x. By expanding \\7lnP(rlx2) at x in eq.(5), we obtain \n\n\\7lnP(rlx) + \\7\\7lnP(rlx)(x2 - x) - (X2 - X1)/T; = O. \n\n(7) \n\nThe random variable Xl can be decomposed as Xl = x + f1, where f1 is a random \nnumber satisfying Gaussian distribution of zero mean and variance Oi. \nBy using the notation of f1, we have \n\nA \n\n\\7lnP(rlx)+fdTf \nX2 -x = l/T; - \\7\\7lnP(rlx)' \n\n(8) \n\nFor the correlation cases considered in the present study (i.e, those ensure ML \nasymptotically or quasi-asymptotically efficient), - \\7\\7 In P(rlx) can be approxi(cid:173)\nmated as a (positive) constant according to the law of large numbers [7,8]. There(cid:173)\nfore, we can define a constant variable \n\na = T;(-\\7\\7lnP(rlx)), \n\n(9) \n\n\fand a random variable \n\nR = \n\n\\71nP(rlx) \n\n- \\7\\71n P(rlx) . \n\nObviously R satisfies the Gaussian distribution of zero mean and variance 0I. \nBy using the notations 0: and R, we get \n\nwhose variance is calculated to be \n\no:R+fl \nX2- X = - - -\n1+0: \n\nSince (1 + 0:2)/(1 + 0:)2 ::::: 1 holds for any positive 0:, the decoding accuracy in the \nsecond step is always improved. It is not difficult to check that its minimum value \nis \n\n(10) \n\n(11) \n\n(12) \n\n(13) \n\n(14) \n\n2 \nwhen 0: = 1, or, the optimal value of Tl is \n\n2 -\n\n0 2 - !02 \n1> \n\nTl=----,.......,.....,.. \n- \\7\\71n P(rlx) \n\n1 \n\nWhen a faithful model is used, -\\7\\71nQ(rlx) is the Fisher information. Tl hence \nequals to the variance of decoding error. This is understandable. \n\nFollowing the same procedure, it can be proved that the optimal decoding accuracy \nin the tth step is 0; = tOI when the width of Gaussian prior being Tl = tTl. \nIt is interesting to see that the above multiple decoding procedure, when the optimal \nvalues of Tt are used, achieves the same decoding accuracy as a one-step ML by \nusing all N x t signals. This is the best for any estimator to achieve. However, \nthe multiple decoding is not a trivial replacement of one-step ML, and has many \nadvantages. One of them is to save memory, considering that only N signals and \nthe value of previous estimation are stored in each step. Moreover, if a slowly \nchanging stimulus is concerned, the multiple decoding outperforms one-step ML for \nthe balance between adaptation and memory. These properties are valuable when \ninformation is processed in the brain. \n\n3 Network Implementation of MAP \n\nIn this section, we investigate how to implement MAP by a recurrent network. A \ntwo-step decoding is studied. Without loss of generality, we consider N ---+ 00 and \ndo calculation in the continuous limit. \n\nThe network we consider is a fully connected one-dimensional homogeneous neural \nfield, in which c denotes the position coordinate, i.e., the neurons' preferred stimuli. \nThe tuning function of the neuron with preferred stimulus c is \n\nf c(x) = _1_ exp-( c- x)2/2a2 . \n\n\"fiifa \n\n(15) \n\nFor simplicity, we consider an encoding process in which the fluctuations in neurons' \nresponses are independent Gaussian noises (more general correlated cases can be \nhandled similarly), that is, \n\nQ(rlx) = ~ exp- ~ j(T c - f c (x))2 dC, \n\n(16) \n\n\fwhere p is the neuron density and Z is the normalization factor. A faithful model \nis used in both steps decoding, i.e., P(rlx) = Q(rlx) (again, generalization to more \ngeneral cases of P(rlx) -::/:- Q(rlx) is straightforward.). \nFor the above model setting, the solution of ML in the first step is calculated to be \n\nXl = argmaxx J rc!e(x)de, \nwhere the condition J J;(x)de = const has been used. \nThe solution of MAP in the second step is \n\nX2 = argmaxx J rc!e(x)de - (x - xd2/ 2Tf. \n\nCompared with eq.(17), eq.(18) has one more term corresponding to the contribu(cid:173)\ntion of prior distribution. \n\nNow come to the study of using a recurrent network to realize eqs.(17) and (18). \nFollowing the idea of Pouget et al. [4,5], the following network dynamics is con(cid:173)\nstructed. Let Ue denote the (average) internal state of neuron at e, and We,e' the \nrecurrent connection weights from neurons at e to those at e'. The dynamics of \nneural excitation is governed by \n\n(17) \n\n(18) \n\n(19) \n\n(20) \n\nwhere \n\nJ 0 \ndt = -Ue + We ,e' \ndUe \nU; \n\no e = ----;;-=--=-\n1 + f..LJU;de \n\ne, de + Ie, \n\n' \n\nis the activity of neurons at e and Ie is the external input arriving at e. \n\nThe recurrent interactions are chosen to be \n\n(21) \nwhich ensures that when there is no external input (Ie = 0), the network is neutrally \nstable on line attractor, \n\nc,c' -\n\n- exp-(e-e')2/2a2 \n, \n\nW \n\n(22) \nwhere the parameter D is constant and can be determined easily. Note that the \nline attractor has the same shape as the tuning function. This is crucial, which \nallows the network perform template-matching by using the tuning function, being \nas same as ML and MAP. \n\n'r:/z, \n\nWhen a sufficiently small input Ie is added, the network is no longer neutrally \nstable on the line attractor. It can be proved that the steady state of the network \nhas approximately the same shape as eq.(22) (the deviation is of the 2nd order of \nthe magnitude of Ie.), whereas, its steady position on the line attractor (i.e., the \nnetwork estimation) is determined by maximizing the overlap between Ie and Oe(Z) \n[4,9]. \nThus, if Ie = ere in the first step1, where e is a sufficiently small number, the \nnetwork estimation is given by \n\n21 = argmaxz J reOe(z)de, \n\n(23) \n\n- - - - - - - - - - - - -\n\nused in [5] , has the same result . \n\nlConsider an instant input, triggering the network to be initially at Oe(t = 0) = r e, as \n\n\fwhich has the same value as the solution of ML (see eq.(I7)). We say that the \nnetwork implements ML. \n\nTo implement MAP in the second step, it is critical to identify a neural mechanism \nwhich can 'transmit' the prior knowledge obtained in the first step to the second \none. We find that this is naturally done by Hebbian learning. \n\nAfter the first step decoding, the recurrent interaction changes a small amount \naccording to the Hebbian rule, whose new value is \n\n(24) \nwhere TJ is a small positive number representing the Hebbian learning rate, and \nOe(,2d is the neuron activity in the first step. \nWith the new recurrent interactions, the net input from other neurons to the one \nat c is calculated to be \n\nJ We,el Oel dc' J We,el Oel dc' +TJOe(,2d J Oe/(zd Oe,dc' , \n\n(25) \n\nwhere 1/ is a small constant. To get the last approximation, the following facts have \nbeen used: 1) The initial state of neuron in the second step is at Oe(Z1 ), 2) The \nneuron activity Oe during the second step is between Oe(zd and Oe(Z2 ), where Z2 \nis the position of the steady state; 3) (Z1 - z2 )2/2a2 \u00ab 1, considering that neurons \nare widely tuned as seen in data (a is large) and consecutive estimations are close \nenough. These factors ensures the approximation, J Oe/ (zd Oe,dc' :=;:j const to be \ngood enough. \n\nSubstituting eq.(25) in (19), we see that the network dynamics in the second step, \nwhen compared with the first one, is in effect to modify the input Ie to be I~ = \n\u20ac(re + AOc(zd), where A is a constant and can be determined easily. \nThus, the network estimation in the second step is determined by maximizing the \noverlap between I~ and Oc(z), which gives \n\nZ2 = argmaxz J rcOc(z)dc + A J Oe(zdO e(z )dc. \n\n(26) \n\nThe first term in the right handside is known to achieve ML. Let us see the contri(cid:173)\nbution of the second one, which can be transformed to \n\nJ Oe(zd Oc(z)dc = Bexp-CZI-Z)2/4a2, \n\n(27) \nwhere B is a constant. Again, in the above calculation, (Z1 - z)2/4a2 \u00ab 1 is used \nfor the same argument discussed above. \n\n-B(z - zd 2 /4a 2 + terms not on z, \n\n:=;:j \n\nCompare eqs.(I8) and (27), we see that the second term plays the same role as the \nprior knowledge in MAP. Thus, the network indeed implements MAP. The value of \nA (or the Hebbian learning rate) can be adjusted accordingly to match the optimal \nchoice of Tf . \nThe above result is confirmed by the simulation experiment (Table.I) , which was \ndone with 101 neurons uniformly distributed in the region [-3,3] and the true \nstimulus being at O. It shows that the estimation of the network agrees well with \nMAP. \n\n\fTable 1: Comparing the decoding accuracies of the network and MAP with different \nvalues of a (the corresponding values of T[ and A are adjusted.). The parameters \nare a = 1, f-t = 0.5 and (J2 = 0.01. The data is obtained after 100 trials. \n\n4 Conclusion and Discussion \n\nIn summary we have investigated how to implement MAP by using a biologically \nplausible recurrent network. A two-step decoding paradigm is studied. In the first \nstep when there is no prior knowledge, the network implements ML, whose estima(cid:173)\ntion is subsequently used to form the prior distribution of stimulus for consecutive \ndecoding. In the second step, the network implements MAP. \n\nLine attractor and Hebbian learning are two critical elements to implement MAP. \nThe former enables the network to do template-matching by using the tuning func(cid:173)\ntion, being as same as ML and MAP. The latter provides a mechanism that conveys \nthe prior knowledge obtained from the first step to the second one. Though the re(cid:173)\nsults in this paper may quantitatively depend on the formulation of the models, it is \nreasonable to believe that they are qualitatively true, as both Hebbian learning and \nline attractor are biologically plausible. Line attractor comes from the translation \ninvariance of network interactions, and has been shown to be involved in several \nneural computations [10-12]. We expect that the essential idea of Bayesian inference \nof utilizing previous knowledge for successive decoding is used in the information \nprocessing of the brain. \n\nWe also analyzed the decoding accuracy of MAP in a population code and the \noptimal form of Gaussian prior. In the present study, stimulus is kept to be fixed \nduring consecutive decodings. A generalization to the case when stimulus slowly \nchanges over time is straightforward. \n\nReferences \n\n[1] A. Pouget, P. Dayan & R. Zemel. Nature Reviews Neurosci ence, 1, 125-132, 2000. \n\n[2] B. Olshausen & D. Field. Nature, 381, 607-609, 1996. \n\n[3] T. Poggio & F . Girosi. Neural Computation, 10, 1445-1454, 1998. \n\n[4] A. Pouget & K. Zhang. NIPS, 9, 1997. \n\n[5] S. Deneve, P. E. Latham & A. Pouget. Nature N euroscience, 2, 740-745, 1999. \n\n[6] K. Zhang, 1. Ginzburg, B. McNaughton & T. Sejnowski. J. Neurophysiol., 79, 1017-\n1044, 1998. \n\n[7] S. Wu, H. Nakahara & S. Amari. Neural Computation, 13, 775-798, 200l. \n\n[8] S. Wu, S. Amari & H. Nakahara. CNS*Ol (to appear). \n\n[9] S. Wu, S. Amari & H. Nakahara. Neural Computation (in press). \n\n[10] S. Amari. Biological Cybernetics, 27, 77-87, 1977. \n\n[11] K. Zhang. J. Neurosci., 16, 2112-2126, 1996. \n\n[12] H . Seung. Proc. Natl. Acad. Sci. USA , 93, 13339-13344, 1996. \n\n\f", "award": [], "sourceid": 1988, "authors": [{"given_name": "Si", "family_name": "Wu", "institution": null}, {"given_name": "Shun-ichi", "family_name": "Amari", "institution": null}]}