{"title": "Temporally Dependent Plasticity: An Information Theoretic Account", "book": "Advances in Neural Information Processing Systems", "page_first": 110, "page_last": 116, "abstract": null, "full_text": "Temporally Dependent Plasticity: \nAn Information Theoretic Account \n\nGal Chechik \n\nand \n\nN aft ali Tishby \n\nSchool of Computer Science and Engineering \n\nand the Interdisciplinary Center for Neural Computation \n\nThe Hebrew University, Jerusalem, Israel \n\n{ggal,tishby}@cs.huji.ac.il \n\nAbstract \n\nThe paradigm of Hebbian learning has recently received a novel in(cid:173)\nterpretation with the discovery of synaptic plasticity that depends \non the relative timing of pre and post synaptic spikes. This paper \nderives a temporally dependent learning rule from the basic princi(cid:173)\nple of mutual information maximization and studies its relation to \nthe experimentally observed plasticity. We find that a supervised \nspike-dependent learning rule sharing similar structure with the ex(cid:173)\nperimentally observed plasticity increases mutual information to a \nstable near optimal level. Moreover, the analysis reveals how the \ntemporal structure of time-dependent learning rules is determined \nby the temporal filter applied by neurons over their inputs. These \nresults suggest experimental prediction as to the dependency of the \nlearning rule on neuronal biophysical parameters \n\nIntroduction \n\n1 \nHebbian plasticity, the major paradigm for learning in computational neuroscience, \nwas until a few years ago interpreted as learning by correlated neuronal activity. \nA series of studies have recently shown that changes in synaptic efficacies highly \ndepend on the relative timing of the pre- and postsynaptic spikes, as the efficacy \nof a synapse between two excitatory neurons increases when the presynaptic spike \nprecedes the postsynaptic one, but decreases otherwise [1-6]. The magnitude of \nthese synaptic changes decays roughly exponentially as a function of the time dif(cid:173)\nference between pre- and post synaptic spikes, with a time constant of few tens of \nmilliseconds (results vary between studies, especially with regard to the synaptic \ndepression component, compare e.g. [4] and [6]). \nWhat could be the computational role of this delicate type of plasticity, sometimes \ntermed spike-timing dependent plasticity (STDP) ? Several authors suggested an(cid:173)\nswers for this question by modeling STDP and studying its effects on synaptic, \nneural and network dynamics. Importantly, STDP embodies an inherent compe(cid:173)\ntition between incoming inputs, and was shown to result in normalization of total \nincoming synaptic strength [7], maintain the irregularity of neuronal firing [8, 9], \n\n\u00b7Work supported in part by a Human Frontier Science Project (HFSP) grant RG \n\n0133/1998. \n\n\fand lead to the emergence of synchronous subpopulation firing in recurrent networks \n[10]. It may also play an important role in sequence learning [11, 12]. The dynam(cid:173)\nics of synaptic efficacies under the operation of STDP strongly depends on whether \nSTDP is implemented additively (independent of the baseline synaptic value) or \nmultiplicatively (where the change is proportional to the synaptic efficacy) [13]. \n\nThis paper takes a different approach to the study of spike-dependent learning rules: \nwhile the above studies model STDP and study the model properties, we start by \nderiving a spike-dependent learning rule from first principles within a simple rate \nmodel and then compare it with the experimentally observed STDP. To derive our \nlearning rule, we consider the principle of mutual information maximization. This \nidea, known as the Infomax principle [14], states that the goal of a neural network's \nlearning procedure is to maximize the mutual information between its output and \ninput. The current paper applies Infomax for a leaky integrator neuron with spiking \ninputs. The derivation suggests computational insights into the dependence of the \ntemporal characteristics of STDP on biophysical parameters and shows that STDP \nmay serve to maximize mutual information in a network of spiking neurons. \n\n2 The Model \nWe study a network with N input neurons Sl .. SN firing spike trains, and a single \noutput (target) neuron Y. At any point in time, the target neuron accumulates its \ninputs with some temporal filter F due to voltage attenuation or synaptic transfer \nfunction \n\nN \n\nY(t) = L WiXi(t) \n\n(1) \n\ni=l \n\na p&k e \n\nwhere Wi is the synaptic efficacy between the ith input neuron and the target \nneuron, Si(t) = L:t . c5(t - tspike) is the i-th spike train and T is the membrane \ntime constant. The filter F may be used to consider general synaptic transfer \nfunction and voltage decay effects, but is set here as an example to an exponential \nfilter Fr (x) == exp( - x / T). The learning goal is to set the synaptic weights W \nsuch that M + 1 uncorrelated patterns of input activity ~1/ ('TJ = O .. M) may be \ndiscriminated using the output. Each pattern determines the firing rates of the \ninput neurons, thus S is a noisy realization of ~ due to the stochasticity of the \npoint process. The input patterns are presented for periods of length T (on the \norder of tens of milliseconds). At each period, a pattern ~1/ is randomly chosen for \npresentation with probability q1/' where most of the patterns are rare (L:!l q1/ \u00ab 1) \nbut ~o is abundant and may be thought of as a background noisy pattern. It should \nbe stressed that in our model information is coded in the non-stationary rates that \nunderlie the input spike trains. As these rates are not observable, any learning must \ndepends on the observable input spikes that realize those underlying rates. \n\n3 Mutual Information Maximization \nLet us focus on a single presentation period (omitting the notation of t), and \nlook at the value of Y at the end of this period, Y = L:~l WiXi, with Xi == \nJ~T et'/r Si(t')dt'. Denoting by f(Y) the p.d.f. of Y, the input-output mutual \ninformation [15] in this network is defined by \n\nh(Y) = - i: f(y)log(f(y))dx \n\nI(Y; 'TJ) = h(Y) - h(YI'TJ) \n\n(2) \n\nwhere h(Y) is the differential entropy of the Y distribution, and h(YI'TJ) is the \ndifferential entropy given that the network is presented with a known input pattern. \n\n\fThis mutual information measures how easy it is to decide which input pattern 'TJ \nwas presented to the network by observing the network's output Y. \nTo calculate the conditional entropy h(YI'TJ) we use the assumption that input \nneurons fire independently and their number is large, thus the input of the tar(cid:173)\nget neuron when the network is presented with the pattern ~11 is normally dis(cid:173)\ntributed f(YI'TJ) = N(J.t11,(711 2) with mean J.t11 =< WXl1 > and variance (711 2 =< \n(W Xl1)(W Xl1)T > - < W Xl1 >2. The brackets denote averaging over the possi(cid:173)\nble realizations of the inputs Xl1 when the network is presented with the pattern \n~11. To calculate the entropy of Y we note that f (Y) is a mixture of Gaussians, \neach resulting from the presentation of an input pattern and use the assumption \nE!1 ql1 \u00ab 1 to approximate the entropy. The details of this derivation are omitted \ndue to space considerations and will be presented elsewhere. Differentiating the \nmutual information with regard to Wi we obtain \n\n8I(Yj'TJ) \n\n8Wi \n\nwith \n\nM \n\n+ Lql1 (Cav(y,X:!)K~ + E(X;:)K~) \n\n(3) \n\n11=1 \nM \n\n- Lql1 (Cav(y,X?)K~ + E(X?)K~) \n\n11=1 \n(J.t11-J.tO)2 +(7112 -(702 . \n' \n\n(704 \n\n(704\n\n1 \n\n1 \nK = - -_. \n(711 ' \n\n1 \n(70 \n\n11 -\n\nK' = J.t11-J.t0 \n\n11 -\n\n(702 \n\n. \n\nwhere E(X;:) is the expected value of X;: as averaged over presentation of the ~11 \npattern. The general form of this complex gradient is simplified in the following \nsections together with a discussion of its use for biological learning. \n\nThe derived gradient may be used for a gradient ascent learning rule by repeatedly \ncalculating the distribution moments J.t11' (711 that depend on W, and updating the \nweights according to ~ Wi = >. 8~J (Y j 'TJ). This learning rule climbs along the \ngradient and is bound to converge to a local maximum of the mutual information. \nFigure lA plots the mutual information during the operation of the learning rule, \nshowing that the network indeed reaches a (possibly local) mutual information \nmaximum. Figure IB depicts the changes in output distribution during learning, \nshowing that it splits into two segregated bumps: one that corresponds to the ~o \npattern and another that corresponds to the rest of the patterns. \n\n4 Learning In A Biological System \n\nAiming to obtain a spike-dependent biologically feasible learning rule that maxi(cid:173)\nmizes mutual information, we now turn to approximate the analytical rule derived \nabove by a rule that can be implemented in biology. To this end, four steps are \ntaken where each step corresponds to a biological constraint and its solution. \n\nFirst, biological synapses are limited either to excitatory or inhibitory regimes. \nSince information is believed to be coded in the activity of excitatory neurons, we \nlimit the weights W to positive values. \n\nSecondly, the K terms are global functions of weights and input distributions since \nthey depend on the distribution moments J.t11' (71)\" To avoid this problem we ap(cid:173)\nproximate the learning rule by replacing {K~ , Kg, K~} with constants {>.~, >.g, >'_~l. \nThese constants are set to optimal values, but remain fixed once they are set. We \nhave found numerically that high performance (to be demonstrated in section 5) \nmay be obtained for a wide regime of these constants. \n\n\fA. \n\n0.5 ,----~--~---~-----, \n\n0.4 \n\nF.:O.3 \n~ \n~0 . 2 \n\n0.1 \n\n100 \n\nB. \n\n0.1 \n\n0.09 \n\n0.08 \n\n>='0.07 \nii:' 0.06 \n~ == 0.05 \n.0 \n~ 0.04 \no c.. 0.03 \n\n0.02 \n\n0.01 \n\n1000 \n\n2000 \n\ntime steps \n\n3000 \n\n4000 \n\no~~~~--~-~~~~~~-~ \n- 60 \n\n- 20 \n\n0 \n\noutput value Y \n\n20 \n\n40 \n\nFigure 1: Mutual information and output distribution along learning with the gra(cid:173)\ndient ascent learning rule (Eq. 3). All patterns were constructed by setting 10% \nof the input neurons to fire Poisson spike trains at 40R z, while the rest fire at \nlOR z. Poisson spike trains were simulated by discretizing time into 1 millisecond \nbins. Simulation parameters A = 1 M = 100, N = 1000, qo = 0.9, q'f/ = 0.001, \nT = 20msec. A. Input-output mutual information B. Output distribution after \n100,150,200 and 300 learning steps. Outputs segregate into two distinct bumps: \none corresponds to the presentation of the ~o pattern and the other corresponds to \nthe rest of the patterns. \n\nThirdly, summation over patterns embodies a 'batch' mode of learning, requiring \nvery large memory to average over multiple presentations. To implement an on(cid:173)\nline learning rule, we replace summation over patterns by pattern-triggered learn(cid:173)\ning. One should note that the analytical derivation yielded that summation in is \nperformed over the rare patterns only (Eq. 3), thus pattern-triggered learning is \nnaturally implemented by restricting learning to presentations of rare patternsl . \n\nt \n\nFourthly, the learning rule explicitly depends on E(X) and COV(Y, X) that are not \nobservables of the model. We thus replace them by performing stochastic weighted \naveraging over spikes to yield a spike-dependent learning rule. In the case of inhomo(cid:173)\ngeneous Poisson spike trains where input neurons fire independently, the covariance \nterms obeys Cau(Y,Xi) = WiEr/ 2 (Xi), where Er(X) = Loo e-T-E(S(t'))dt' . The \nexpectations E(Xn may be simply estimated by weighted averaging of the observed \nspikes Xi that precede the learning moment. Estimating E(XP) is more difficult \nbecause, as stated above, learning should be triggered by the rare patterns only. \nThus, ~o spikes should have an effect only when a rare pattern ~'f/ is presented. A \npossible solution is to use the fact that ~o is highly frequent, (and therefore spikes \nin the vicinity of a ~'f/ presentation are with high probability ~o spikes), to average \nover spikes following a ~'f/ presentation for background activity estimation. These \nspikes can be temporally weighted in many ways: from the computational point \nof view it is beneficial to weigh spikes uniformly along time, but this may require \nlong \"memory\" and is biologically improbable. We thus refrain from suggesting \na specific weighting for background spikes, and obtain the following rule, that is \n\nt ' - t \n\nlIn fact, learning rules where learning is also triggered by the presentation of the back(cid:173)\n\nground pattern explicitly depend on the prior probabilities Q'1 ' and thus are not robust \nto fluctuations in Q'1. Since such fluctuations strongly reduce the mutual information ob(cid:173)\ntained by these rules, we conclude that pattern-triggered learning should be triggered by \nthe rare pattern only. \n\n\factivated only when one of the rare patterns (f'I, mem = l..M) is presented \n\n(4) \n\nwhere h,2(S(t')) denote the temporal weighting of f,,0 spikes. It should be noted \nthat this learning rule uses rare pattern presentations as an external (\"supervised\") \nlearning signal. The general form of this learning rule and its performance are \ndiscussed in the next section. \n\n5 Analyzing The Biologically Feasible Rule \n\n5.1 Comparing performance \n\nWe have obtained a new spike-dependent learning rule that may be implemented in \na biological system that approximates an information maximization learning rule. \nBut how good are these approximations? Does learning with the biologically feasible \nlearning rule increase mutual information? and to what level? The curves in figure \n2A compare the mutual information of the learning rule of Eq. 3 with that of Eq. \n4, as traced in simulation of the learning process. Apparently, the approximated \nlearning rule achieves fairly good performance compared to the optimal rule, and \nmost of reduction in performance is due to limiting weights to positive values. \n\nInterpreting the learning rule structure \n\n5.2 \nThe general form of the learning rule of Eq. 4 is pictorially presented in figure 2B, \nto allow us to inspect the main features of its structure. First, synaptic potentiation \nis temporally weighted in a manner that is determined by the same filter F that the \nneuron applies over its inputs, but learning should apply an average of F and F2 \n\n(it F(t - t')S(t')dt' and t F2(t - t')S(t')dt'). The relative weighting of these two \n\ncomponents was numerically estimated by simulating the optimal rule of Eq. 3 and \nwas found to be on the same order of magnitude. Second, in our model synaptic \ndepression is targeted at learning the underlying structure of background activity. \nOur analysis does not restrict the temporal weighting of the depression curve. \n\nA major difference between the obtained rule and the experimentally observed learn(cid:173)\ning rule is that in our rule learning is triggered by an external learning signal that \ncorresponds to the presentation of rare patterns, while in the experimentally ob(cid:173)\nserved rule learning is triggered by the postsynaptic spike. The possible role of the \npostsynaptic spike is discussed in the following section. \n\n6 Unsupervised Learning \n\nBy now we have considered a learning scenario that used external information telling \nwhether the presented pattern is the background pattern or not, to decide whether \nlearning should take place. When such learning signal is missing, it is tempting to \nuse the postsynaptic spike (signaling the presence of an interesting input pattern) \nas a learning signal. This yields a learning procedure as in Eq. 4 except this \ntime learning is triggered by postsynaptic spikes instead of an external signal. The \nresulting learning rule is similar to previous models of the experimentally observed \nSTDP as [9, 13, 16]. However, this mechanism will effectively serve learning only \n\n\fif the postsynaptic spikes co-occur with the presentation of a rare pattern. Such \nco-occurrence may be achieved by supplying short learning signals at the presence \nof the interesting patterns (e.g. by attentional mechanisms increasing neuronal \nexcitability). This will induce learning such that later postsynaptic spikes will be \ntriggered by the rare pattern presentation. These issues await further investigation. \n\nA. \n0.0 \n\n0.0 \n\n~o.o \n\u00a3. \n~o . o \n~ \n0.0 \n\nB. \n\nPositive weights limitation \n\nSpike dependent rule (Eq.4) \n\npossible spike \n\nweighting \n\n0'-----~O.-5 ~-~1.-5 ~2-~2.-5 ~3-~3.-5 ~4-~4.-5 ~5 \nx 104 \n\ntime steps \n\n-50 \n\na \n\nt(pre)-t(learning) \n\n50 \n\nFigure 2: A. Comparing optimal (Eq. 3) and approximated (Eq. 4) learning rules. \n10% of the input neurons of CI ('T/ > 0) were set to fire at 40Hz, while the rest fire \nat 5H z . ~o-neurons fire at 8Hz yielding similar average input as the fTl patterns. \nLearning rates ratio for Eq. 4 were numerically searched for optimal value, yielding \n).'1/ = 0.15,).0 = 0.05 for the arbitrary choice ).' = 0.1. Rest of parameters as in Fig \n1 except M = 20, N = 2000. B. A pictorial representation of Eq. 4, plotting ~W \nas a function of the time difference between the learning signal time t and the input \nspike time tspike. The potentiation curve (solid line) is the sum of two exponents \nwith constants T and !T (dashed lines). The depression curve is not constrained by \nour derivation, thus several examples are brought (dot-dashed lines). \n\n7 Discussion \nIn the framework of information maximization, we have derived a spike-dependent \nlearning rule for a leaky integrator neuron. This learning rule achieves near optimal \nmutual information and can in principle be implemented in biological neurons. The \nanalytical derivation of this rule allows to obtain insight into the learning rules \nobserved experimentally in various preparations. \n\nThe most fundamental result is that time-dependent learning stems from the time(cid:173)\ndependency of neuronal output on its inputs. In our model this is embodied in the \nfilter F which a neuron applies over its input spike trains. This filter is determined \nby the biophysical parameters of the neuron, namely its membrane leak, synap(cid:173)\ntic transfer functions and dendritic arbor structure. Our model thus yields direct \nexperimental predictions for the way temporal characteristics of the potentiation \nlearning curve are determined by the neuronal biophysical parameters. Namely, \ncells with larger membrane constants should exhibit longer synaptic potentiation \ntime windows. Interestingly, the time window observed for STDP potentiation in(cid:173)\ndeed fits the time windows of an AMP A channel and is also in agreement with \ncortical membrane time constants, as predicted by the current analysis [4, 6]. \n\nSeveral features of the theoretically derived rule may have similar functions in the \nexperimentally observed rule: In our model synaptic weakening is targeted to learn \nthe structure of the background activity. Both synaptic depression and potentiation \nin our model should be triggered by rare pattern presentation to allow near-optimal \n\n\fmutual information. IN addition, synaptic changes should depend on the synaptic \nbaseline value in a sub-linear manner. The experimental results in this regard are \nstill unclear, but theoretical investigations show that this weight dependency has \nlarge effect on networks dynamics [13]. \n\nWhile the learning rule presented in Equation 4 assumes independent firing of input \nneurons, our derivation actually holds for a wider class of inputs. In the case of cor(cid:173)\nrelated inputs however, the learning rule involves cross-synaptic terms, which may \nbe difficult to compute by biological neurons. As STDP is highly sensitive to syn(cid:173)\nchronous inputs, it remains a most interesting question to investigate biologically(cid:173)\nfeasible approximations to an Infomax rule for time structured and synchronous \ninputs. \n\nReferences \n\n[1] W .B. Levy and D. Steward. Temporal contiguity requirements for long-term associa(cid:173)\n\ntive potentiatio/depression in the hippocampus. Neuroscience, 8:791- 797, 1983. \n\n[2] D. Debanne, B.H. Gahwiler, and S.M. Thompson. Asynchronous pre- and post(cid:173)\n\nsynaptic activity induces associative long-term depression in area CAl of the rat \nhippocampus in vitro. Proc. Natl. Acad. Sci., 91 :1148- 1152, 1994. \n\n[3] H. Markram, J . Lubke, M. Frotscher, and B. Sakmann. Regulation of synaptic efficacy \n\nby coincidence of postsynaptic aps and epsps. Science, 275(5297}:213- 215 , 1997. \n\n[4] L. Zhang, H.W.Tao, C.E. Holt, W.A. Harris, and M m. Poo. A critical window \nfor cooperation and competition among developing retinotectal synapses. Nature, \n395(3}:37- 44, 1998. \n\n[5] Q. Bi and M m. Poo. Precise spike timing determines the direction and extent of \n\nsynaptic modifications in cultured hippocampal neurons. J. Neurosci. , 18:10464-\n10472, 1999. \n\n[6] D.E. Feldman. Timing based Jtp and ltd at vertical inputs to layer II/III pryamidal \n\ncells in rat barrel cortex. Neuron, 27:45- 56, 2000. \n\n[7] R. Kempter, W . Gerstner, and J.L. van Hemmen. Hebbian learning and spiking \n\nneurons. Phys. Rev. E., 59(4} :4498- 4514, 1999. \n\n[8] L.F. Abbot and S. Song. Temporally asymetric hebbian learning, spike timing and \nneural respons variability. In S.A. Solla and D.A. Cohen, editors, Advances in Neural \nInformation Processing Systems 11, pages 69- 75. MIT Press, 1999. \n\n[9] S. Song, KD. Miller, and L.F. Abbot. Competitive Hebbian learning through spike(cid:173)\n\ntiming dependent synaptic plasticity. Nature Neuroscience, pages 919- 926, 2000. \n\n[10] D. Horn, N. Levy, I. Meilijson, and E. Ruppin. Distributed synchrony of spiking \nneurons in a hebbian cell assembly. \nIn S.A. Solla, T.K Leen, and KR. Muller, \neditors, Advances in Neural Information Processing Systems 12, pages 129- 135, 2000. \n[11] M.R. Mehta, M. Quirk, and M. Wilson. From hippocampus to v1 : Effect of ltp on \nspatio-temporal dynamics of receptive fields. In J.M. Bower, editor, Computational \nNeuroscience: Trends in Research 1999. Elsevier, 1999. \n\n[12] R. Rao and T. Sejnowski. Predictive sequence learning in recurrent neocortical cir(cid:173)\ncuits. In S.A. Solla, T .K Leen, and KR. Muller, editors, Advances in Neural Infor(cid:173)\nmation Processing Systems 12, pages 164- 170. MIT Press, 2000. \n\n[13] J. Rubin, D. Lee, and H. Sompolinski. Equilibrium properties of temporally asym(cid:173)\n\nmetric hebbian plasticity. Phys . Rev. D., In press, 2000. \n\n[14] R. Linsker. Self-organization in a perceptual network. Computer, 21(3}:105- 117, 1988. \n[15] C.E. Shannon. A mathematical theory of communication. Bell Syst. Tech . J., 27:379-\n\n423, 1948. \n\n[16] R. Kempter, W. Gerstner, and J.L. van Hemmen. Intrinsic stabilization of output \n\nrates by spike-time dependent hebbian learning. Submitted, 2000. \n\n\f", "award": [], "sourceid": 1882, "authors": [{"given_name": "Gal", "family_name": "Chechik", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}