{"title": "Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons", "book": "Advances in Neural Information Processing Systems", "page_first": 713, "page_last": 720, "abstract": null, "full_text": "Information Bottleneck Optimization and\n\nIndependent Component Extraction with Spiking\n\nNeurons\n\nStefan Klamp\ufb02, Robert Legenstein, Wolfgang Maass\n\nInstitute for Theoretical Computer Science\n\nGraz University of Technology\n\n{klampfl,legi,maass}@igi.tugraz.at\n\nA-8010 Graz, Austria\n\nAbstract\n\nThe extraction of statistically independent components from high-dimensional\nmulti-sensory input streams is assumed to be an essential component of sensory\nprocessing in the brain. Such independent component analysis (or blind source\nseparation) could provide a less redundant representation of information about the\nexternal world. Another powerful processing strategy is to extract preferentially\nthose components from high-dimensional input streams that are related to other\ninformation sources, such as internal predictions or proprioceptive feedback. This\nstrategy allows the optimization of internal representation according to the infor-\nmation bottleneck method. However, concrete learning rules that implement these\ngeneral unsupervised learning principles for spiking neurons are still missing. We\nshow how both information bottleneck optimization and the extraction of inde-\npendent components can in principle be implemented with stochastically spiking\nneurons with refractoriness. The new learning rule that achieves this is derived\nfrom abstract information optimization principles.\n\n1 Introduction\n\nThe Information Bottleneck (IB) approach and independent component analysis (ICA) have both\nattracted substantial interest as general principles for unsupervised learning [1, 2]. A hope has been,\nthat they might also help us to understand strategies for unsupervised learning in biological systems.\nHowever it has turned out to be quite dif\ufb01cult to establish links between known learning algorithms\nthat have been derived from these general principles, and learning rules that could possibly be im-\nplemented by synaptic plasticity of a spiking neuron. Fortunately, in a simpler context a direct link\nbetween an abstract information theoretic optimization goal and a rule for synaptic plasticity has\nrecently been established [3]. The resulting rule for the change of synaptic weights in [3] maxi-\nmizes the mutual information between pre- and postsynaptic spike trains, under the constraint that\nthe postsynaptic \ufb01ring rate stays close to some target \ufb01ring rate. We show in this article, that this\napproach can be extended to situations where simultaneously the mutual information between the\npostsynaptic spike train of the neuron and other signals (such as for example the spike trains of other\nneurons) has to be minimized (Figure 1). This opens the door to the exploration of learning rules\nfor information bottleneck analysis and independent component extraction with spiking neurons that\nwould be optimal from a theoretical perspective.\nWe review in section 2 the neuron model and learning rule from [3]. We show in section 3 how this\nlearning rule can be extended so that it not only maximizes mutual information with some given\nspike trains and keeps the output \ufb01ring rate within a desired range, but simultaneously minimizes\nmutual information with other spike trains, or other time-varying signals. Applications to infor-\n\n\fA\n\nB\n\nFigure 1: Different learning situations analyzed in this article. A In an information bottleneck task\nthe learning neuron (neuron 1) wants to maximize the mutual information between its output Y K\n1\nand the activity of one or several target neurons Y K\n3 , . . . (which can be functions of the inputs\nX K and/or other external signals), while at the same time keeping the mutual information between\nthe inputs X K and the output Y K\n1 as low as possible (and its \ufb01ring rate within a desired range). Thus\nthe neuron should learn to extract from its high-dimensional input those aspects that are related to\nthese target signals. This setup is discussed in sections 3 and 4. B Two neurons receiving the\nsame inputs X K from a common set of presynaptic neurons both learn to maximize information\ntransmission, and simultaneously to keep their outputs Y K\nstatistically independent. Such\nextraction of independent components from the input is described in section 5.\n\n1 and Y K\n2\n\n2 , Y K\n\nmation bottleneck tasks are discussed in section 4. In section 5 we show that a modi\ufb01cation of\nthis learning rule allows a spiking neuron to extract information from its input spike trains that is\nindependent from the component extracted by another neuron.\n\n2 Neuron model and a basic learning rule\n\nWe use the model from [3], which is a stochastically spiking neuron model with refractoriness,\nwhere the probability of \ufb01ring in each time step depends on the current membrane potential and the\ntime since the last output spike. It is convenient to formulate the model in discrete time with step\nsize \u2206t. The total membrane potential of a neuron i in time step tk = k\u2206t is given by\n\nN(cid:88)\n\nk(cid:88)\n\nui(tk) = ur +\n\nwij\u0001(tk \u2212 tn)xn\nj ,\n\n(1)\n\nj=1\n\nn=1\n\nj , x2\n\nj , . . . , xk\n\nwhere ur = \u221270mV is the resting potential and wij is the weight of synapse j (j = 1, . . . , N).\nj =\nAn input spike train at synapse j up to the k-th time step is described by a sequence X k\n(x1\nj = 1)\nevokes a postsynaptic potential (PSP) with exponentially decaying time course \u0001(t \u2212 tn) with time\nconstant \u03c4m = 10ms. The probability \u03c1k\n\nj ) of zeros (no spike) and ones (spike); each presynaptic spike at time tn (xn\n\ni of \ufb01ring of neuron i in each time step tk is given by\n\ni = 1 \u2212 exp[\u2212g(ui(tk)Ri(tk)\u2206t] \u2248 g(ui(tk))Ri(tk)\u2206t,\n\u03c1k\n\ni \u00bf 1). The refractory variable Ri(t) =\n\n(2)\nwhere g(u) = r0 log{1 + exp[(u \u2212 u0)/\u2206u]} is a smooth increasing function of the membrane\npotential u (u0 = \u221265mV, \u2206u = 2mV, r0 = 11Hz). The approximation is valid for suf\ufb01ciently\n(t\u2212\u02c6ti\u2212\u03c4abs)2\nref r+(t\u2212\u02c6ti\u2212\u03c4abs)2 \u0398(t \u2212 \u02c6ti \u2212 \u03c4abs) assumes\nsmall \u2206t (\u03c1k\nvalues in [0, 1] and depends on the last \ufb01ring time \u02c6ti of neuron i (absolute refractory period \u03c4abs =\n3ms, relative refractory time \u03c4ref r = 10ms). The Heaviside step function \u0398 takes a value of 1 for\nnon-negative arguments and 0 otherwise.\nThis model from [3] is a special case of the spike-response model, and with a refractory variable\nR(t) that depends only on the time since the last postsynaptic event it has renewal properties [4].\n\n\u03c4 2\n\n\fi = (y1\n\ni that assumes the value 1 if a\nThe output of neuron i at the k-th time step is denoted by a variable yk\npostsynaptic spike occurred and 0 otherwise. A speci\ufb01c spike train up to the k-th time step is written\nas Y k\nThe information transmission between an ensemble of input spike trains XK and the output spike\ntrain YK\n\ni can be quanti\ufb01ed by the mutual information1 [5]\n\ni ).\ni , . . . , yk\n\ni , y2\n\nI(XK; YK\n\ni ) =\n\nP (X K, Y K\n\n.\n\n(3)\n\n|X K)\ni ) log P (Y K\nP (Y K\ni )\n\ni\n\n(cid:88)\n\nXK ,Y K\n\ni\n\n(cid:80)\n\ni\n\nY K\n\nP (Y K\n\ni )) =\n\ni )/ \u02dcP (Y K\n\ni ) log(P (Y K\n\ni )|| \u02dcP (Y K\n\ni ) \u2212 \u03b3DKL(P (Y K\ni )), where\nThe idea in [3] was to maximize the quantity I(XK; YK\ni )) denotes the Kullback-Leibler di-\nDKL(P (Y K\nvergence [5], imposing the additional constraint that the \ufb01ring statistics P (Yi) of the neuron should\nstay as close as possible to a target distribution \u02dcP (Yi). This distribution was chosen to be that of a\nconstant target \ufb01ring rate \u02dcg accounting for homeostatic processes. An online learning-rule perform-\ning gradient ascent on this quantity was derived for the weights wij of neuron i, with \u2206wk\nij denoting\nthe weight change during the k-th time step:\n\u2206wk\nij\n\u2206t\n\ni )|| \u02dcP (Y K\n\n= \u03b1C k\n\ni (\u03b3),\n\nijBk\n\n(4)\n\n1j\n\nj\n\n+\n\n(cid:164)\n\n(cid:163)\n\n(cid:182)\n\nk(cid:88)\n\n\u0001(tk \u2212 tn)xn\n\n1j = C k\u22121\nC k\n\nij and the \u201cpostsynaptic term\u201d Bk\n\nwhich consists of the \u201ccorrelation term\u201d C k\ni [3]. The term C k\nij\n(cid:181)\nmeasures coincidences between postsynaptic spikes at neuron i and PSPs generated by presynaptic\naction potentials arriving at synapse j,\n1 \u2212 \u2206t\n\u03c4C\n(cid:183)\n\nin an exponential time window with time constant \u03c4C = 1s and g0(ui(tk)) denoting the derivative\nof g with respect to u. The term\n1 (\u03b3) = yk\n1\nBk\n\u2206t\n\u2212 (1 \u2212 yk\n\n(6)\ncompares the current \ufb01ring rate g(ui(tk)) with its average \ufb01ring rate2 \u00afgi(tk), and simultaneously\nthe running average \u00afgi(tk) with the constant target rate \u02dcg. The argument indicates that this term also\ndepends on the optimization parameter \u03b3.\n\ng(u1(tk)) \u2212 (1 + \u03b3)\u00afg1(tk) + \u03b3\u02dcg\n\ng0(u1(tk))\ng(u1(tk))\n\n\u00afg1(tk)\n1 )R1(tk)\n\n(cid:182)\u03b3(cid:184)\n\n1 \u2212 \u03c1k\nyk\n\ng(u1(tk))\n\n\u00afg1(tk)\n\n(cid:181)\n\nlog\n\n(cid:164)\n\n(cid:163)\n\n(5)\n\nn=1\n\n,\n\n1\n\n\u02dcg\n\n,\n\n3 Learning rule for multi-neuron interactions\n\nl\n\nWe extend the learning rule presented in the previous section to a more complex scenario, where the\nmutual information between the output spike train Y K\n1 of the learning neuron (neuron 1) and some\n(l > 1) has to be maximized, while simultaneously minimizing the mutual\ntarget spike trains Y K\ninformation between the inputs X K and the output Y K\n1 . Obviously, this is the generic IB scenario\napplied to spiking neurons (see Figure 1A). A learning rule for extracting independent components\nwith spiking neurons (see section 5) can be derived in a similar manner.\nFor simplicity, we consider the case of an IB optimization for only one target spike train Y K\n2 , and\nderive an update rule for the synaptic weights w1j of neuron 1. The quantity to maximize is therefore\n(7)\nwhere \u03b2 and \u03b3 are optimization constants. To maximize this objective function, we derive the weight\nchange \u2206wk\n1j during the k-th time step by gradient ascent on (7), assuming that the weights w1j can\nchange between some bounds 0 \u2264 w1j \u2264 wmax (we assume wmax = 1 throughout this paper).\n\n2 ) \u2212 \u03b3DKL(P (Y K\n\nL = \u2212I(XK; YK\n\n1 )|| \u02dcP (Y K\n\n1 ) + \u03b2I(YK\n\n1 ; YK\n\n1 )),\n\n1We use boldface letters (Xk) to distinguish random variables from speci\ufb01c realizations (X k).\n2The rate \u00afgi(tk) = hg(ui(tk))iXk|Y k\u22121\n\ndenotes an expectation of the \ufb01ring rate over the input distribution\ngiven the postsynaptic history and is implemented as a running average with an exponential time window (with\na time constant of 10ms).\n\ni\n\n\fNote that all three terms of (7) implicitly depend on w1j because the output distribution P (Y K\n1 )\nchanges if we modify the weights w1j. Since the \ufb01rst and the last term of (7) have already been\nconsidered (up to the sign) in [3], we will concentrate here on the middle term L12 := \u03b2I(YK\n1 ; YK\n2 )\nand denote the contribution of the gradient of L12 to the total weight change \u2206wk\n1j in the k-th time\nstep by \u2206 \u02dcwk\n1j.\n\ni ) and P (Y K\ni |Y k\u22121\n\nIn order to get an expression for the weight change in a speci\ufb01c time step tk we write the probabilities\nP (Y K\ni ) =\n), according to the chain rule\nof information theory [5]. Consequently, we rewrite L12 as a sum over the contributions of the\nindividual time bins, L12 =\n\n2 ) occurring in (7) as products over individual time bins, i.e., P (Y K\n\n1 , Y K\n) and P (Y K\n\n2|Y k\u22121\n\nk=1 P (yk\n\nk=1 P (yk\n\n, Y k\u22121\n\n1 , Y K\n\n1 , yk\n\n2\n\n1\n\ni\n\n(cid:81)K\n\n2|Y k\u22121\n1\n)P (yk\n\n1 , yk\n1|Y k\u22121\n\n, Y k\u22121\n)\n2|Y k\u22121\n\n2\n\n1\n\n2\n\n.\n\n)\n\nXk,Yk\n\n1 ,Yk\n2\n\n(8)\n\n(cid:81)K\n\n2 ) =\n\n(cid:80)K\n(cid:42)\nk=1 \u2206Lk\n12, with\n\u03b2 log P (yk\nP (yk\n(cid:174)\n\n(cid:173)\n\n\u2206Lk\n\n12 =\n\nThe weight change \u2206 \u02dcwk\nweights w1j, i.e., \u2206 \u02dcwk\ngradient yields \u2206 \u02dcwk\n\n1j = \u03b1(\u2202\u2206Lk\n1j\u03b2F k\n12\n\nC k\n\n1j = \u03b1\n\n1j is then proportional to the gradient of this expression with respect to the\n12/\u2202w1j), with some learning rate \u03b1 > 0. The evaluation of the\n\nwith a correlation term C k\n\n1j as in (5) and a term\n\nXk,Yk\n\n1 ,Yk\n2\n\u2212 yk\n\n(cid:183)\n\n1 (1 \u2212 yk\n\u00afg12(tk)\n\u00afg2(tk)\n\n12 = yk\nF k\n\n1 yk\n\n2 log\n\n\u00afg12(tk)\n\n\u00afg1(tk)\u00afg2(tk)\n\n2 )R2(tk)\u2206t\n\n\u00afg12(tk)\n\u00afg1(tk)\n\n\u2212 \u00afg2(tk)\n\n\u2212\n\n(cid:184)\n\n+\n\n\u2212 (1 \u2212 yk\n+ (1 \u2212 yk\n\n2 )R1(tk)R2(tk)(\u2206t)2(cid:163)\n\n\u2212 \u00afg1(tk)\n\n2 R1(tk)\u2206t\n\n1 )yk\n1 )(1 \u2212 yk\nHere, \u00afgi(tk) = hg(ui(tk))iXk|Y k\u22121\nhg(u1(tk))g(u2(tk))iXk|Y k\u22121\n,Y k\u22121\nquantities are implemented online as running exponential averages with a time constant of 10s.\nUnder the assumption of a small learning rate \u03b1 we can approximate the expectation h\u00b7iXk,Yk\n1 ,Yk\n2\nby averaging over a single long trial. Considering now all three terms in (7) we \ufb01nally arrive at an\nonline rule for maximizing (7)\n\ndenotes the average \ufb01ring rate of neuron i and \u00afg12(tk) =\ndenotes the average product of \ufb01ring rates of both neurons. Both\n\n\u00afg12(tk) \u2212 \u00afg1(tk)\u00afg2(tk)\n\n(cid:164)\n\n(9)\n\n.\n\n2\n\n1\n\ni\n\n\u2206wk\n1j\n\u2206t\n\n= \u2212\u03b1C k\n\n1j\n\n1 (\u2212\u03b3) \u2212 \u03b2\u2206tBk\nBk\n\n12\n\n.\n\n(10)\n\n1j sensitive to correlations between the output of the neuron and its\nwhich consists of a term C k\n12 that characterize the\npresynaptic input at synapse j (\u201ccorrelation term\u201d) and terms Bk\npostsynaptic state of the neuron (\u201cpostsynaptic terms\u201d). Note that the argument of Bk\n1 is different\nfrom (4) because some of the terms of the objective function (7) have a different sign. In order to\ncompensate the effect of a small \u2206t, the constant \u03b2 has to be large enough for the term Bk\n12 to have\nan in\ufb02uence on the weight change.\nThe factors C k\ncontains an extra term Bk\nthe output spike train of the neuron and the target. It is given by\n\nIn addition, our learning rule\n12/(\u2206t)2 that is sensitive to the statistical dependence between\n\n1 were described in the previous section.\n\n1j and Bk\n\n1 and Bk\n\n12 = F k\n\n\u00afg12(tk)\n\n\u00afg1(tk)\u00afg2(tk)\n\n(1 \u2212 yk\n\n2 )R2(tk)\n\n\u00afg12(tk)\n\u00afg1(tk)\n\n\u2212 \u00afg2(tk)\n\n1 yk\n12 = yk\n(\u2206t)2 log\n2\nBk\n\u2212 yk\n2\n\u2206t\n+ (1 \u2212 yk\n\n(1 \u2212 yk\n\n1 )R1(tk)\n\n(cid:183)\n\n\u2212 yk\n1\n\u2206t\n\u00afg12(tk)\n\u00afg2(tk)\n\n(cid:184)\n\n\u2212 \u00afg1(tk)\n\n(cid:163)\n\n(cid:164)\n\n1 )(1 \u2212 yk\n\n(11)\nThis term basically compares the average product of \ufb01ring rates \u00afg12 (which corresponds to the joint\nprobability of spiking) with the product of average \ufb01ring rates \u00afg1\u00afg2 (representing the probability\nof independent spiking). In this way, it measures the momentary mutual information between the\noutput of the neuron and the target spike train.\n\n2 )R1(tk)R2(tk)\n\n.\n\n\u00afg12(tk) \u2212 \u00afg1(tk)\u00afg2(tk)\n\n(cid:43)\n\n(cid:183)\n\n(cid:164)\n\n(cid:183)\n\n(cid:184)\n\n(cid:184)\n\n(cid:163)\n\n\fFor a simpli\ufb01ed neuron model without refractoriness (R(t) = 1), the update rule (4) resembles the\nBCM-rule [6] as shown in [3]. With the objective function (7) to maximize, we expect an \u201canti-\nHebbian BCM\u201d rule with another term accounting for statistical dependencies between Y K\nand\n1\n2 . Since there is no refractoriness, the postsynaptic rate \u03bd1(tk) is given directly by the current\nY K\nvalue of g(u(tk)), and the update rule (10) reduces to the rate model3\n\n(cid:189)\n\n(cid:183)\n\n(cid:181)\n\n(cid:182)\u03b3(cid:184)\n\n\u2206wk\n1j\n\u2206t\n\n= \u2212\u03b1\u03bdpre,k\n\nj\n\nf(\u03bdk\n1 )\n\nlog\n\n\u03bdk\n1\n\u00af\u03bdk\n1\n\n\u00af\u03bdk\n1\n\u02dcg\n\n(cid:181)\n\n(cid:183)\n\n(cid:183)\n\n(cid:184)(cid:182)(cid:190)\n\n(cid:184)\n\nj\n\n\u00af\u03bdk\n12\n1 \u00af\u03bdk\n\u00af\u03bdk\n\n2\n\n2\n\n\u00af\u03bdk\n12\n1 \u00af\u03bdk\n\u00af\u03bdk\n\n2\n\n\u2212 1\n\n\u2212 \u00af\u03bdk\n\n\u2212\u03b2\u2206t\n\n2 log\n\u03bdk\n\n(cid:80)k\nn=1 \u0001(tk \u2212 tn)xn\nwhere the presynaptic rate at synapse j at time tk is denoted by \u03bdpre,k\nj\nwith a in units (Vs)\u22121. The values \u00af\u03bdk\n1 , the\n2 , respectively. The function\nrate of the target signal \u03bdk\nf(\u03bdk\n1 ))/a is proportional to the derivative of g with respect to u, evaluated at the\ncurrent membrane potential. The \ufb01rst term in the curly brackets accounts for the homeostatic process\n(similar to the BCM rule, see [3]), whereas the second term reinforces dependencies between Y K\n1\nand Y K\nIt is interesting to note that if we rewrite the simpli\ufb01ed rate-based learning rule (12) in the following\nway,\n\n2 . Note that this term is zero if the rates of the two neurons are independent.\n\n12 are running averages of the output rate \u03bdk\n\n2 and of the product of these values, \u03bdk\n\n1 ) = g0(g\u22121(\u03bdk\n\n2 , and \u00af\u03bdk\n\n1 , \u00af\u03bdk\n\n= a\n\n,\n\n(12)\n\n1 \u03bdk\n\n\u2206wk\n1j\n\u2206t\n\n= \u2212\u03b1\u03bdpre,k\n\nj\n\n\u03a6(\u03bdk\n\n1 , \u03bdk\n\n2 ),\n\n(13)\n\nwe can view it as an extension of the classical Bienenstock-Cooper-Munro (BCM) rule [6] with a\ntwo-dimensional synaptic modi\ufb01cation function \u03a6(\u03bdk\n2 ). Here, values of \u03a6 > 0 produce LTD\nwhereas values of \u03a6 < 0 produce LTP. These regimes are separated by a sliding threshold, however,\nin contrast to the original BCM rule this threshold does not only depend on the running average of\nthe postsynaptic rate \u00af\u03bdk\n\n1 , but also on the current values of \u03bdk\n\n2 and \u00af\u03bdk\n2 .\n\n1 , \u03bdk\n\n4 Application to Information Bottleneck Optimization\n\n1 of\nWe use a setup as in Figure 1A where we want to maximize the information which the output Y K\na learning neuron conveys about two target signals Y K\n3 . If the target signals are statistically\nindependent from each other we can optimize the mutual information to each target signal separately.\nThis leads to an update rule\n\n2 and Y K\n\n(cid:161)\n\n(cid:162)(cid:164)\n\n1 (\u2212\u03b3) \u2212 \u03b2\u2206t\nBk\n\nBk\n\n12 + Bk\n\n13\n\n,\n\n(14)\n\n(cid:163)\n\n\u2206wk\n1j\n\u2206t\n\n= \u2212\u03b1C k\n\n1j\n\n12 and Bk\n\nwhere Bk\n13 are the postsynaptic terms (11) sensitive to the statistical dependence between\nthe output and target signals 1 and 2, respectively. We choose \u02dcg = 30Hz for the target \ufb01ring rate,\nand we use discrete time with \u2206t = 1ms.\nIn this experiment we demonstrate that it is possible to consider two very different kinds of target\nsignals: one target spike train has has a similar rate modulation as one part of the input, while the\nother target spike train has a high spike-spike correlation with another part of the input. The learning\nneuron receives input at 100 synapses, which are divided into 4 groups of 25 inputs each. The \ufb01rst\ntwo input groups consist of rate modulated Poisson spike trains4 (Figure 2A). Spike trains from the\nremaining groups 3 and 4 are correlated with a coef\ufb01cient of 0.5 within each group, however, spike\ntrains from different groups are uncorrelated. Correlated spike trains are generated by the procedure\ndescribed in [7].\nThe \ufb01rst target signal is chosen to have the same rate modulation as the inputs from group 1, except\nthat Gaussian random noise is superimposed with a standard deviation of 2Hz. The second target\nspike train is correlated with inputs from group 3 (with a coef\ufb01cient of 0.5), but uncorrelated to\ninputs from group 4. Furthermore, both target signals are silent during random intervals: at each\n3In the absence of refractoriness we use an alternative gain function galt(u) = [1/gmax + 1/g(u)]\u22121 in\n\norder to pose an upper limit of gmax = 100Hz on the postsynaptic \ufb01ring rate.\n\n\fA\n\nD\n\nB\n\nE\n\nC\n\nF\n\nFigure 2: Performance of the spike-based learning rule (10) for the IB task. A Modulation of input\nrates to input groups 1 and 2. B Evolution of weights during 60 minutes of learning (bright: strong\nsynapses, wij \u2248 1, dark: depressed synapses, wij \u2248 0.) Weights are initialized randomly between\n0.10 and 0.12, \u03b1 = 10\u22124, \u03b2 = 2 \u00b7 103, \u03b3 = 50. C Output rate and rate of target signal 1 during 5\nseconds after learning. D Evolution of the average mutual information per time bin (solid line, left\nscale) between input and output and the Kullback-Leibler divergence per time bin (dashed line, right\nscale) as a function of time. Averages are calculated over segments of 1 minute. E Evolution of the\naverage mutual information per time bin between output and both target spike trains as a function\nof time. F Trace of the correlation between output rate and rate of target signal 1 (solid line) and\nthe spike-spike correlation (dashed line) between the output and target spike train 2 during learning.\nCorrelation coef\ufb01cients are calculated every 10 seconds.\n\ntime step, each target signal is independently set to 0 with a certain probability (10\u22125) and remains\nsilent for a duration chosen from a Gaussian distribution with mean 5s and SD 1s (minimum duration\nis 1s). Hence this experiment tests whether learning works even if the target signals are not available\nall of the time.\nFigure 2 shows that strong weights evolve for the \ufb01rst and third group of synapses, whereas the\nef\ufb01cacies for the remaining inputs are depressed. Both groups with growing weights are correlated\nwith one of the target signals, therefore the mutual information between output and target spike trains\nincreases. Since spike-spike correlations convey more information than rate modulations synaptic\nef\ufb01cacies develop more strongly to group 3 (the group with spike-spike correlations). This results in\nan initial decrease in correlation with the rate-modulated target to the bene\ufb01t of higher correlation\nwith the second target. However, after about 30 minutes when the weights become stable, the\ncorrelations as well as the mutual information quantities stay roughly constant.\nAn application of the simpli\ufb01ed rule (12) to the same task is shown in Figure 3 where it can be\nseen that strong weights close to wmax are developed for the rate-modulated input. To some extent\nweights grow also for the inputs with spike-spike correlations in order to reach the constant target\n\ufb01ring rate \u02dcg. In contrast to the spike-based rule the simpli\ufb01ed rule is not able to detect spike-spike\ncorrelations between output and target spike trains.\n\n4The rate of the \ufb01rst 25 inputs is modulated by a Gaussian white-noise signal with mean 20Hz that has been\nlow pass \ufb01ltered with a cut-off frequency of 5Hz. Synapses 26 to 50 receive a rate that has a constant value\nof 2Hz, except that a burst is initiated at each time step with a probability of 0.0005. Thus there is a burst on\naverage every 2s. The duration of a burst is chosen from a Gaussian distribution with mean 0.5s and SD 0.2s,\nthe minimum duration is chosen to be 0.1s. During a burst the rate is set to 50Hz. In the simulations we use\ndiscrete time with \u2206t = 1ms.\n\n025005000050input 1 [Hz]025005000050input 2 [Hz]t [ms]t [min]synapse idxevolution of weights 2040602040608010000.51025005000050output [Hz]025005000050target 1 [Hz]t [ms]020406000.0050.01t [min]MI/KLD of neuron 1020406000.020.04020406000.51x 10\u22123t [min]I(output;targets)020406000.10.20.30.40.5t [min]correlation with targets target 1target 2\fA\n\nB\n\nC\n\nFigure 3: Performance of the simpli\ufb01ed update rule (12) for the IB task. A Evolution of weights\nduring 30 minutes of learning (bright: strong synapses, wij \u2248 1, dark: depressed synapses, wij \u2248\n0.) Weights are initialized randomly between 0.10 and 0.12, \u03b1 = 10\u22123, \u03b2 = 104, \u03b3 = 10. B\nEvolution of the average mutual information per time bin (solid line, left scale) between input and\noutput and the Kullback-Leibler divergence per time bin (dashed line, right scale) as a function of\ntime. Averages are calculated over segments of 1 minute. C Trace of the correlation between output\nrate and target rate during learning. Correlation coef\ufb01cients are calculated every 10 seconds.\n\n5 Extracting Independent Components\n\nWith a slight modi\ufb01cation in the objective function (7) the learning rule allows us to extract statis-\ntically independent components from an ensemble of input spike trains. We consider two neurons\nreceiving the same input at their synapses (see Figure 1B). For both neurons i = 1, 2 we maximize\ninformation transmission under the constraint that their outputs stay as statistically independent from\neach other as possible. That is, we maximize\ni ) \u2212 \u03b2I(YK\n\n(15)\nSince the same terms (up to the sign) are optimized in (7) and (15) we can derive a gradient ascent\nrule for the weights of neuron i, wij, analogously to section 3:\n\n\u02dcLi = I(XK; YK\n\ni )|| \u02dcP (Y K\n\ni )).\n\n1 ; YK\n\n2 ) \u2212 \u03b3DKL(P (Y K\n(cid:164)\n\ni (\u03b3) \u2212 \u03b2\u2206tBk\nBk\n\n12\n\n.\n\n(cid:163)\n\n\u2206wk\nij\n\u2206t\n\n= \u03b1C k\nij\n\n(16)\n\nFigure 4 shows the results of an experiment where two neurons receive the same Poisson input with a\nrate of 20Hz at their 100 synapses. The input is divided into two groups of 40 spike trains each, such\nthat synapses 1 to 40 and 41 to 80 receive correlated input with a correlation coef\ufb01cient of 0.5 within\neach group, however, any spike trains belonging to different input groups are uncorrelated. The\nremaining 20 synapses receive uncorrelated Poisson input. Weights close to the maximal ef\ufb01cacy\nwmax = 1 are developed for one of the groups of synapses that receives correlated input (group 2 in\nthis case) whereas those for the other correlated group (group 1) as well as those for the uncorrelated\ngroup (group 3) stay low. Neuron 2 develops strong weights to the other correlated group of synapses\n(group 1) whereas the ef\ufb01cacies of the second correlated group (group 2) remain depressed, thereby\ntrying to produce a statistically independent output. For both neurons the mutual information is\nmaximized and the target output distribution of a constant \ufb01ring rate of 30Hz is approached well.\nAfter an initial increase in the mutual information and in the correlation between the outputs, when\nthe weights of both neurons start to grow simultaneously, the amounts of information and correlation\ndrop as both neurons develop strong ef\ufb01cacies to different parts of the input.\n\n6 Discussion\n\nInformation Bottleneck (IB) and Independent Component Analysis (ICA) have been proposed as\ngeneral principles for unsupervised learning in lower cortical areas, however, learning rules that\ncan implement these principles with spiking neurons have been missing. In this article we have\nderived from information theoretic principles learning rules which enable a stochastically spiking\nneuron to solve these tasks. These learning rules are optimal from the perspective of information\ntheory, but they are not local in the sense that they use only information that is available at a single\n\nt [min]synapse idxevolution of weights 102030204060801000.20.40.60.81010203001234x 10\u22123t [min]MI/KLD of neuron 1010203000.010.020.030.0401020300.10.20.30.40.5t [min]correlation with target 1\fA\n\nD\n\nB\n\nE\n\nC\n\nF\n\nFigure 4: Extracting independent components. A,B Evolution of weights during 30 minutes of\nlearning for both postsynaptic neurons (red: strong synapses, wij \u2248 1, blue: depressed synapses,\nwij \u2248 0.) Weights are initialized randomly between 0.10 and 0.12, \u03b1 = 10\u22123, \u03b2 = 100, \u03b3 = 10.\nC Evolution of the average mutual information per time bin between both output spike trains as\na function of time. D,E Evolution of the average mutual information per time bin (solid line, left\nscale) between input and output and the Kullback-Leibler divergence per time bin for both neurons\n(dashed line, right scale) as a function of time. Averages are calculated over segments of 1 minute.\nF Trace of the correlation between both output spike trains during learning. Correlation coef\ufb01cients\nare calculated every 10 seconds.\n\nsynapse without an auxiliary network of interneurons or other biological processes. Rather, they tell\nus what type of information would have to be ideally provided by such auxiliary network, and how\nthe synapse should change its ef\ufb01cacy in order to approximate a theoretically optimal learning rule.\n\nAcknowledgments\n\nWe would like to thank Wulfram Gerstner and Jean-Pascal P\ufb01ster for helpful discussions. This paper\nwas written under partial support by the Austrian Science Fund FWF, # S9102-N13 and # P17229-\nN04, and was also supported by PASCAL, project # IST2002-506778, and FACETS, project #\n15879, of the European Union.\n\nReferences\n[1] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37-th\n\nAnnual Allerton Conference on Communication, Control and Computing, pages 368\u2013377, 1999.\n\n[2] A. Hyv\u00a8arinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley, New York, 2001.\n[3] T. Toyoizumi, J.-P. P\ufb01ster, K. Aihara, and W. Gerstner. Generalized Bienenstock-Cooper-Munro rule for\nspiking neurons that maximizes information transmission. Proc. Natl. Acad. Sci. USA, 102:5239\u20135244,\n2005.\n\n[4] W. Gerstner and W. M. Kistler. Spiking Neuron Models. Cambridge University Press, Cambridge, 2002.\n[5] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, New York, 1991.\n[6] E. L. Bienenstock, L. N. Cooper, and P. W. Munro. Theory for the development of neuron selectivity:\n\norientation speci\ufb01city and binocular interaction in visual cortex. J. Neurosci., 2(1):32\u201348, 1982.\n\n[7] R. G\u00a8utig, R. Aharonov, S. Rotter, and H. Sompolinsky. Learning input correlations through non-linear\n\ntemporally asymmetric hebbian plasticity. Journal of Neurosci., 23:3697\u20133714, 2003.\n\nt [min]synapse idxweights of neuron 1 1020302040608010000.51t [min]synapse idxweights of neuron 2 1020302040608010000.5101020300246x 10\u22124t [min]I(output 1;output2)010203000.0040.0080.0120.016t [min]010203000.010.020.030.04MI/KLD of neuron 1010203000.0040.0080.0120.016t [min]MI/KLD of neuron 2010203000.010.020.030.04010203000.20.40.6t [min]correlation between outputs\f", "award": [], "sourceid": 3070, "authors": [{"given_name": "Stefan", "family_name": "Klampfl", "institution": null}, {"given_name": "Wolfgang", "family_name": "Maass", "institution": null}, {"given_name": "Robert", "family_name": "Legenstein", "institution": null}]}