{"title": "Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons", "book": "Advances in Neural Information Processing Systems", "page_first": 193, "page_last": 200, "abstract": "We show that under suitable assumptions (primarily linearization) a simple and perspicuous online learning rule for Information Bottleneck optimization with spiking neurons can be derived. This rule performs on common benchmark tasks as well as a rather complex rule that has previously been proposed \\cite{KlampflETAL:07b}. Furthermore, the transparency of this new learning rule makes a theoretical analysis of its convergence properties feasible. A variation of this learning rule (with sign changes) provides a theoretically founded method for performing Principal Component Analysis {(PCA)} with spiking neurons. By applying this rule to an ensemble of neurons, different principal components of the input can be extracted. In addition, it is possible to preferentially extract those principal components from incoming signals $X$ that are related or are not related to some additional target signal $Y_T$. In a biological interpretation, this target signal $Y_T$ (also called relevance variable) could represent proprioceptive feedback, input from other sensory modalities, or top-down signals.", "full_text": "Simpli\ufb01ed Rules and Theoretical Analysis for\n\nInformation Bottleneck Optimization and PCA with\n\nSpiking Neurons\n\nLars Buesing, Wolfgang Maass\n\nInstitute for Theoretical Computer Science\n\nGraz University of Technology\n\nA-8010 Graz, Austria\n\n{lars,maass}@igi.tu-graz.at\n\nAbstract\n\nWe show that under suitable assumptions (primarily linearization) a simple and\nperspicuous online learning rule for Information Bottleneck optimization with\nspiking neurons can be derived. This rule performs on common benchmark tasks\nas well as a rather complex rule that has previously been proposed [1]. Further-\nmore, the transparency of this new learning rule makes a theoretical analysis of\nits convergence properties feasible. A variation of this learning rule (with sign\nchanges) provides a theoretically founded method for performing Principal Com-\nponent Analysis (PCA) with spiking neurons. By applying this rule to an ensem-\nble of neurons, different principal components of the input can be extracted. In\naddition, it is possible to preferentially extract those principal components from\nincoming signals X that are related or are not related to some additional target\nsignal YT . In a biological interpretation, this target signal YT (also called rele-\nvance variable) could represent proprioceptive feedback, input from other sensory\nmodalities, or top-down signals.\n\n1 Introduction\n\nThe Information Bottleneck (IB) approach [2] allows the investigation of learning algorithms for\nunsupervised and semi-supervised learning on the basis of clear optimality principles from infor-\nmation theory. Two types of time-varying inputs X and YT are considered. The learning goal is\nto learn a transformation from X into another signal Y that extracts only those components from\nX that are related to the relevance signal YT . In a more global biological interpretation X might\nrepresent for example some sensory input, and Y the output of the \ufb01rst processing stage for X in\nthe cortex. In this article Y will simply be the spike output of a neuron that receives the spike trains\nX as inputs. The starting point for our analysis is the \ufb01rst learning rule for IB optimization in for\nthis setup, which has recently been proposed in [1], [3]. Unfortunately, this learning rule is compli-\ncated, restricted to discrete time and no theoretical analysis of its behavior is feasible. Any online\nlearning rule for IB optimization has to make a number of simplifying assumptions, since true IB\noptimization can only be carried out in an of\ufb02ine setting. We show here, that with a slightly different\nset of assumptions than those made in [1] and [3], one arrives at a drastically simpler and intuitively\nperspicuous online learning rule for IB optimization with spiking neurons. The learning rule in [1]\nwas derived by maximizing the objective function1 L0:\n\nL0 = \u2212I(X, Y ) + \u03b2I(Y, YT ) \u2212 \u03b3DKL(P (Y )kP ( \u02dcY )),\n\n(1)\n\n1The term DKL(P (Y )kP ( \u02dcY )) denotes the Kullback-Leibler divergence between the distribution P (Y )\nand a target distribution P ( \u02dcY ). This term ensures that the weights remain bounded, it is shortly discussed in\n[4].\n\n1\n\n\fwhere I(., .) denotes the mutual information between its arguments and \u03b2 is a positive trade-off fac-\ntor. The target signal YT was assumed to be given by a spike train. The learning rule from [1] (see\n[3] for a detailed interpretation) is quite involved and requires numerous auxiliary de\ufb01nitions (hence\nwe cannot repeat it in this abstract). Furthermore, it can only be formulated in discrete time (steps\nsize \u2206t) for reasons we want to outline brie\ufb02y: In the limit \u2206t \u2192 0 the essential contribution to the\nlearning rule, which stems from maximizing the mutual information I(Y, YT ) between output and\ntarget signal, vanishes. This dif\ufb01culty is rooted in a rather technical assumption, made in appendix\nk at time step k of the neural \ufb01ring probability \u03c1 ,\nA.4 in [3], concerning the expectation value \u03c1\ngiven the information about the postsynaptic spikes and the target signal spikes up to the preceding\ntime step k \u2212 1 (see our detailed discussion in [4])2. The restriction to discrete time prevents the\napplication of powerful analytical methods like the Fokker-Planck equation, which requires contin-\nuous time, for analyzing the dynamics of the learning rule.\nIn section 2 of this paper, we propose a much simpler learning rule for IB optimization with spiking\nneurons, which can also be formulated in continuous time. In contrast to [3], we approximate the\nk with a linear estimator, under the assumption that X and YT are positively corre-\ncritical term \u03c1\nlated. Further simpli\ufb01cations in comparison to [3] are achieved by considering a simpler neuron\nmodel (the linear Poisson neuron, see [5]). However we show through computer simulation in [4]\nthat the resulting simple learning rule performs equally well for the more complex neuron model\nwith refractoriness from [1] - [5]. The learning rule presented here can be analyzed by the means of\nthe drift function of the corresponding Fokker-Planck equation. The theoretical results are outlined\nin section 3, followed by the consideration of a concrete IB optimization task in section 4. A link\nbetween the presented learning rule and Principal Component Analysis (PCA) is established in sec-\ntion 5. A more detailed comparison of the learning rule presented here and the one of [3] as well as\nresults of extensive computer tests on common benchmark tasks can be found in [4].\n\n2 Neuron model and learning rule for IB optimization\n\nWe consider a linear Poisson neuron with N synapses of weights w = (w1, . . . , wN ) . It is driven by\nj denotes\nthe time of the i\u2019th spike at synapse j. The membrane potential u(t) of the neuron at time t is given\nby the weighted sum of the presynaptic activities \u03bd(t) = (\u03bd1(t), . . . , \u03bdN (t)):\n\nthe input X, consisting of N spike trains Xj(t) = Pi \u03b4(t \u2212 ti\n\nj), j \u2208 {1, . . . , N }, where ti\n\nN\n\nu(t) =\n\nXj=1\n\u03bdj(t) = Z t\n\nwj\u03bdj(t)\n\n(2)\n\n\u01eb(t \u2212 s)Xj(s)ds.\n\n\u2212\u221e\n\nThe kernel \u01eb(.) models the EPSP of a single spike (in simulations \u01eb(t) was chosen to be a decaying\nexponential with a time constant of \u03c4m = 10 ms). The postsynaptic neuron spikes at time t with the\nprobability density g(t):\n\ng(t) =\n\nu(t)\nu0\n\n,\n\nwith u0 being a normalization constant. The postsynaptic spike train is denoted as Y (t) = Pi \u03b4(t \u2212\nf ), with the \ufb01ring times ti\nf .\nti\nWe now consider the IB task described in general in [2], which consists of maximizing the objective\nfunction LIB, in the context of spiking neurons. As in [6], we introduce a further term L3 into\nthe the objective function that re\ufb02ects the higher metabolic costs for the neuron to maintain strong\nj . Thus the complete objective function L\n\nsynapses, a natural, simple choice being L3 = \u2212\u03bbP w2\n\nto maximize is:\n\nL = LIB + L3 = \u2212I(X, Y ) + \u03b2I(YT , Y ) \u2212 \u03bb\n\nN\n\nXj=1\n\nw2\nj .\n\n(3)\n\n2The remedy, proposed in section 3.1 in [3], of replacing the mutual information I(Y, YT ) in L0 by an\ninformation rate I(Y, YT )/\u2206t does not solve this problem, as the term I(Y, YT )/\u2206t diverges in the continuous\ntime limit.\n\n2\n\n\fThe objective function L differs slightly from L0 given in (1), which was optimized in [3]; this\nchange turned out to be advantageous for the PCA learning rule given in section 5, without signi\ufb01-\ncantly changing the characteristics of the IB learning rule.\nThe online learning rule governing the change of the weights wj(t) at time t is obtained by a gradient\nascent of the objective function L:\n\nd\ndt\n\nwj(t) = \u03b1\n\n\u2202L\n\u2202wj\n\n.\n\nFor small learning rates \u03b1 and under the assumption that the presynaptic input X and the target\nsignal YT are stationary processes, the following learning rule can be derived:\n\nd\ndt\n\nwj(t) = \u03b1\n\nY (t)\u03bdj(t)\n\nu(t)u(t) (cid:16)\u2212 (u(t) \u2212 u(t)) + \u03b2(cid:16)F [YT ](t) \u2212 F [YT ](t)(cid:17)(cid:17) \u2212 \u03b1\u03bbwj(t),\n\n(4)\n\nwhere the operator (.) denotes the low-pass \ufb01lter with a time constant \u03c4C (in simulations \u03c4C = 3s),\ni. e. for a function f:\n\nf (t) =\n\n1\n\n\u03c4C Z t\n\n\u2212\u221e\n\nexp(cid:18)\u2212\n\nt \u2212 s\n\n\u03c4C (cid:19) f (s)ds.\n\n(5)\n\nThe operator F [YT ](t) appearing in (4) is equal to the expectation value of the membrane potential\nhu(t)iX|YT = E[u(t)|YT ], given the observations (YT (\u03c4 )|\u03c4 \u2208 R) of the relevance signal; F is thus\nclosely linked to estimation and \ufb01ltering theory. For a known joint distribution of the processes X\nand YT , the operator F could in principal be calculated exactly, but it is not clear how this quantity\ncan be estimated in an online process; thus we look for a simple approximation to F . Under the\nabove assumptions, F is time invariant and can be approximated by a Volterra series (for details see\n[4]):\n\nhu(t)iX|YT = F [YT ](t) =\n\n\u221e\n\nXn=0ZR\n\n\u00b7 \u00b7 \u00b7ZR\n\n\u03ban(t \u2212 t1, . . . , t \u2212 tn)\n\nn\n\nYi=1\n\nYT (ti)dti.\n\n(6)\n\nIn this article, we concentrate on the situation, where F can be well approximated by its linearization\nF1[YT ](t), corresponding to a linear estimator of hu(t)iX|YT\n. For F1[YT ](t) we make the following\nansatz:\n\nF [YT ](t) \u2248 F1[YT ](t) = c \u00b7 uT (t) = cZR\n\n\u03ba1(t \u2212 t1)YT (t1)dt1.\n\n(7)\n\nAccording to (7), F is approximated by a convolution uT (t) of the relevance signal YT and a suit-\nable prefactor c. Assuming positively correlated X and YT , \u03ba1(t) is chosen to be a non-anticipating\ndecaying exponential exp(\u2212t/\u03c40)\u0398(t) with a time constant \u03c40 (in simulations \u03c40 = 100 ms), where\n\u0398(t) is the Heaviside step function. This choice is motivated by the standard models for the impact\nof neuromodulators (see [7]), thus such a kernel may be implemented in a realistic biological mech-\nanism. It turned out that the choice of \u03c40 was not critical, it could be varied over a decade ranging\nfrom 10 ms to 100 ms. The prefactor c appearing in (7) can be determined from the fact that F1 is\nthe optimal linear estimator of the form given in (7), leading to:\n\nc =\n\nhuT (t), u(t)i\nhuT (t), uT (t)i\n\n.\n\nThe quantity c can be estimated online in the following way:\n\nd\ndt\n\nc(t) = (uT (t) \u2212 uT (t)) [(u(t) \u2212 u(t)) \u2212 c(t)(uT (t) \u2212 uT (t))] .\n\nUsing the above de\ufb01nitions, the resulting learning rule is given by (in vector notation):\n\nd\ndt\n\nw(t) = \u03b1\n\nY (t)\u03bd(t)\nu(t)u(t)\n\n[\u2212 (u(t) \u2212 u(t)) + c(t)\u03b2(uT (t) \u2212 uT (t))] \u2212 \u03b1\u03bbw(t).\n\n(8)\n\nEquation (8) will be called the spike-based learning rule, as the postsynaptic spike train Y (t) explic-\nitly appears. An accompanying rate-base learning rule can also be derived:\n\nd\ndt\n\nw(t) = \u03b1\n\n\u03bd(t)\n\nu0u(t)\n\n[\u2212 (u(t) \u2212 u(t)) + c(t)\u03b2(uT (t) \u2212 uT (t))] \u2212 \u03b1\u03bbw(t).\n\n(9)\n\n3\n\n\f3 Analytical results\n\nThe learning rules (8) and (9) are stochastic differential equations for the weights wj driven by the\nprocesses Y (.), \u03bdj(.) and uT (.), of which the last two are assumed to be stationary with the means\nh\u03bdj(t)i = \u03bd0 and huT (t)i = uT ,0 respectively. The evolution of the solutions w(t) to (8) and (9)\nmay be studied via a Master equation for the probability distribution of the weights p(w, t) (see [8]).\nFor small learning rates \u03b1, the stationary distribution p(w) sharply peaks3 at the roots of the drift\nfunction A(w) of the corresponding Fokker-Planck equation (the detailed derivation is given in [4]).\nThus, for \u03b1 \u226a 1, the temporal evolution of the learning rules (8) and (9) may be studied via the\ndeterministic differential equation:\n\nd\ndt\n\n\u02c6w = A( \u02c6w) = \u03b1\n\nz =\n\nN\n\nXj=1\n\n\u02c6wj,\n\n1\n\n\u03bd0u0z (cid:0)\u2212C 0 + \u03b2C 1(cid:1) \u02c6w \u2212 \u03b1\u03bb \u02c6w\n\n(10)\n\n(11)\n\nwhere z is the total weight. The matrix C = \u2212C 0 + \u03b2C 1 (with the elements Cij) has two contribu-\ntions. C 0 is the covariance matrix of the input and the matrix C 1 quanti\ufb01es the covariance between\nthe activities \u03bdj and the trace uT :\n\nC 0\n\nij = h\u03bdi(t), \u03bdj(t)i\n\nC 1\n\nij =\n\nh\u03bdi(t), uT (t)ihuT (t), \u03bdj(t)i\n\nhuT (t), uT (t)i\n\n.\n\nNow the critical points w\u2217 of dynamics of (10) are investigated. These critical points, if asymptoti-\ncally stable, determine the peaks of the stationary distribution p(w) of the weights w; we therefore\nexpect the solutions of the stochastic equations to \ufb02uctuate around these \ufb01xed points w\u2217. If \u03b2 and \u03bb\nare much larger than one, the term containing the matrix C 0 can be neglected and equation (10) has\na unique stable \ufb01xed point w\u2217:\n\nw\u2217 \u221d C T\nC T\ni = h\u03bdi(t), uT (t)i .\n\nUnder this assumption the maximal mutual information between the target signal YT (t) and the\noutput of the neuron Y (t) is obtained by a weight vector w = w\u2217 that is parallel to the covariance\nvector C T .\nIn general, the critical points of equation (10) depend on the eigenvalue spectrum of the symmetric\nmatrix C: If all eigenvalues are negative, the weight vector \u02c6w decays to the lower hard bound 0. In\ncase of at least one positive eigenvalue (which exists if \u03b2 is chosen large enough), there is a unique\nstable \ufb01xed point w\u2217:\n\nw\u2217 =\n\nb\n\n:=\n\nb\n\n\u00b5\n\n\u03bbu0\u03bd0b\nN\n\nbi.\n\nXi=1\n\n(12)\n\nThe vector b appearing in (12) is the eigenvector of C corresponding to the largest eigenvalue \u00b5.\nThus, a stationary unimodal4 distribution p(w) of the weights w is predicted, which is centered\naround the value w\u2217.\n\n4 A concrete example for IB optimization\n\nA special scenario of interest, that often appears in the literature (see for example [1], [9] and [10]),\nis the following: The synapses, and subsequently the input spike trains, form M different subgroups\n\n3It can be shown that the diffusion term in the FP equation scales like O(\u03b1), i. e. for small learning rates \u03b1,\n\n\ufb02uctuations tend to zero and the dynamics can be approximated by the differential equation (10) .\n\n4Note that p(w) denotes the distribution of the weight vector, not the distribution of a single weight p(wj).\n\n4\n\n\fA\n\nX  (t)\n\n1\n\nX  (t)\n\n2\n\nX  (t)\n\nN\n\nC\n\nOutput Y(t)\n\nRelevance\nSignal Y (t)T\n\nB\n\nD\n\nFigure 1: A The basic setup for the Information Bottleneck optimization. B-D Numerical and\nanalytical results for the IB optimization task described in section 4. The temporal evolution of\n\nthe average weights \u02dcwl = 1/M Pj\u2208Gl wj of the four different synaptic subgroups Gl are shown.\n\nB The performance of the spike-based rule (8). The highest trajectory corresponds to \u02dcw1; it stays\nclose to its analytical predicted \ufb01xed point value obtained from (12), which is visualized by the\nupper dashed line. The trajectory just below belongs to \u02dcw3, for which the \ufb01xed point value is also\nplotted as dashed line. The other two trajectories \u02dcw2 and \u02dcw4 decay and eventually \ufb02uctuate above\nthe predicted value of zero. C The performance of the rate-based rule (9); results are analogous to\nthe ones of the spike-based rule. D Simulation of the deterministic equation (10).\n\nGl, l \u2208 {1, . . . , N/M } of the same size N/M \u2208 N. The spike trains Xj and Xk, j 6= k, are statis-\ntically independent if they belong to different subgroups; within a subgroup there is a homogeneous\ncovariance term C 0\njk = cl, j 6= k for j, k \u2208 Gl, which can be due either to spike-spike correlations\nor correlations in rate modulations. The covariance between the target signal YT and the spike trains\nXj is homogeneous among a subgroup.\nAs a numerical example, we consider in \ufb01gure 1 a modi\ufb01cation of the IB task presented in \ufb01gure 2 of\n[1]. The N = 100 synapses form M = 4 subgroups Gl = {25(l \u2212 1) + 1, . . . , 25l}, l \u2208 {1, . . . , 4}.\nSynapses in G1 receive Poisson spike trains of constant rate \u03bd0 = 20 Hz, which are mutually spike-\nspike correlated with a correlation-coef\ufb01cient5 of 0.5. The same holds for the spike trains of G2.\nSpike trains for G3 and G4 are uncorrelated Poisson trains with a common rate modulation, which is\nequal to low pass \ufb01ltered white noise (cut-off frequency 5 Hz) with mean \u03bd0 and standard deviation\n(SD) \u03c3 = \u03bd0/2. The rate modulations for G3 and G4 are however independent (though identically\ndistributed). Two spike trains for different synapse subgroups are statistically independent. The\ntarget signal YT was chosen to be the sum of two Poisson trains. The \ufb01rst is of constant rate \u03bd0 and\nhas spike-spike correlations with G1 of coef\ufb01cient 0.5; the second is a Poisson spike train with the\nsame rate modulation as the spike trains of G3 superimposed by additional white noise of SD 2 Hz.\nFurthermore, the target signal was turned off during random intervals6. The resulting evolution of\nthe weights is shown in \ufb01gure 1, illustrating the performance of the spike-based rule (8) as well as\nof the rate-based rule (9). As expected, the weights of G1 and G3 are potentiated as YT has mutual\ninformation with the corresponding part of the input. The synapses of G2 and G4 are depressed.\nThe analytical result for the stable \ufb01xed point w\u2217 obtained from (12) is shown as dashed lines and\nis in good agreement with the numerical results. Furthermore the trajectory of the solution \u02c6w(t) to\n\n5Spike-spike correlated Poisson spike trains were generated according to the method outlined in [9].\n6These intervals of silence were modeled as random telegraph noise with a time constant of 200 ms and a\n\noverall probability of silence of 0.5.\n\n5\n\n\fthe deterministic equation (10) is plotted.\nThe presented concrete IB task was slightly changed from the one presented in [1], because for the\nsetting used here, the largest eigenvalue \u00b5 of C and its corresponding eigenvector b can be calculated\nanalytically. The simulation results for the original setting in [1] can also be reproduced with the\nsimpler rules (8) and (9) (not shown).\n\n5 Relevance-modulated PCA with spiking neurons\n\nThe presented learning rules (8) and (9) exhibit a close relation to Principal Component Analysis\n(PCA). A learning rule which enables the linear Poisson neuron to extract principal components\nfrom the input X(.) can be derived by maximizing the following objective function:\n\nLPCA = \u2212LIB \u2212 \u03bb\n\nN\n\nXj=1\n\nw2\n\nj = +I(X, Y ) \u2212 \u03b2I(YT , Y ) \u2212 \u03bb\n\nN\n\nXj=1\n\nw2\nj ,\n\n(13)\n\nwhich just differs from (3) by a change of sign in front of LIB. The resulting learning rule is in close\nanalogy to (8):\n\nd\ndt\n\nw(t) = \u03b1\n\nY (t)\u03bd(t)\nu(t)u(t)\n\n[(u(t) \u2212 u(t)) \u2212 c(t)\u03b2(uT (t) \u2212 uT (t))] \u2212 \u03b1\u03bbw(t).\n\n(14)\n\nThe corresponding rate-based version can also be derived. Without the trace uT (.) of the target sig-\nnal, it can be seen that the solution \u02c6w(t) of deterministic equation corresponding to (14) (which is of\nthe same form as (10) with the obvious sign changes) converges to an eigenvector of the covariance\nmatrix C 0. Thus, for \u03b2 = 0 we expect the learning rule (14) to perform PCA for small learning rates\n\u03b1. The rule (14) without the relevance signal is comparable to other PCA rules, e. g. the covariance\nrule (see [11]) for non-spiking neurons.\nThe side information given by the relevance signal YT (.) can be used to extract speci\ufb01c principal\ncomponents from the input, thus we call this paradigm relevance-modulated PCA. Before we con-\nsider a concrete example for relevance-modulated PCA, we want to point out a further application\nof the learning rule (14).\nThe target signal YT can also be used to extract different components from the input with different\nneurons (see \ufb01gure 2). Consider m neurons receiving the same input X. These neurons have the\noutputs Y1(.), . . . , Ym(t), target signals Y 1\nT (t) and weight vectors w1(t), . . . , wm(t),\nthe latter evolving according to (14). In order to prevent all weight vectors from converging towards\nthe same eigenvector of C 0 (the principal component), the target signal Y i\nT for neuron i is chosen to\nbe the sum of all output spike trains except Yi:\n\nT (.), . . . , Y m\n\nY i\n\nT (t) =\n\nN\n\nXj=1, j6=i\n\nYj(t).\n\n(15)\n\nIf one weight vector wi(t) is already close to the eigenvector ek of C 0, than by means of (15), the\nbasins of attraction of ek for the other weight vectors wj(t), j 6= i are reduced (or even vanish,\ndepending on the value of \u03b2). It is therefore less likely (or impossible) that they also converge to ek.\nIn practice, this setup is suf\ufb01ciently robust, if only a small number (\u2264 4) of different components is\nto be extracted and if the differences between the eigenvalues \u03bbi of these principal components are\nnot too big7. For the PCA learning rule, the time constant \u03c40 of the kernel \u03ba1 (see (7)) had to be\nchosen smaller than for the IB tasks in order to obtain good performance; we used \u03c40 = 10 ms in\nsimulations. This is in the range of time constants for IPSPs. Hence, the signals Y i\nT could probably\nbe implemented via lateral inhibition.\nThe learning rule considered in [3] displayed a close relation to Independent Component Analysis\n(ICA). Because of the linear neuron model used here and the linearization of further terms in the\nderivation, the resulting learning rule (14) performs PCA instead of ICA.\nThe results of a numerical example are shown in \ufb01gure 2. The m = 3 for the regular PCA experi-\nment neurons receive the same input X and their weights change according to (14). The weights and\ninput spike trains are grouped into four subgroups G1, . . . , G4, as for the IB optimization discussed\n\n7Note that the input X may well exhibit a much larger number of principal components. However it is only\n\npossible to extract a limited number of them by different neurons at the same time.\n\n6\n\n\fA\n\nX  (t)\n\n1\n\nX  (t)\n\n2\n\nNX  (t)\n\nD\n\nB\n\nneuron 1\n\nC\n\nneuron 2\n\nOutput Y (t)1\n\nOutput Y (t)\n\nm\n\nneuron 1\n\nE\n\nneuron 2\n\nF\n\nneuron 3\n\nFigure 2: A The basic setup for the PCA task: The m different neurons receive the same input X\nand are expected to extract different principal components of it. B-F The temporal evolution of the\n\naverage subgroup weights \u02dcwl = 1/25Pj\u2208Gl wj for the groups G1 (black solid line), G2 (light gray\n\nsolid line) and G3 (dotted line). B-C Results for the relevance-modulated PCA task: neuron 1 (\ufb01g.\nB) specializes on G2 and neuron 2 (\ufb01g. C) on subgroup G3. D-F Results for the regular PCA task:\nneuron 1 (\ufb01g. D) specialize on G1, neuron 2 (\ufb01g. E) on G2 and neuron 3 (\ufb01g. F) on G3 .\n\nin section 4. The only difference is that all groups (except for G4) receive spike-spike correlated\nPoisson spike trains with a correlation coef\ufb01cient for the groups G1, G2, G3 of 0.5, 0.45, 0.4\nrespectively. Group G4 receives uncorrelated Poisson spike trains. As can be seen in \ufb01gure 2 D to\nF, the different neurons specialize on different principal components corresponding to potentiated\nsynaptic subgroups G1, G2 and G3 respectively. Without the relevance signals Y i\nT (.), all neurons\ntend to specialize on the principal component corresponding to G1 (not shown).\nAs a concrete example for relevance-modulated PCA, we consider the above setup with slight mod-\ni\ufb01cations: Now we want m = 2 neurons to extract the components G2 and G3 from the input X,\nand not the principal component G1. This is achieved with an additional relevance signal Y 0\nT , which\nis the same for both neurons and has spike-spike correlations with G2 and G3 of 0.45 and 0.4. We\nT ) to the objective function (13), where \u03b3 is a positive trade-off factor. The\nadd the term \u03b3I(Y, Y 0\nresulting learning rule has exactly the same structure as (14), with an additional term due to Y 0\nT .\nThe numerical results are presented in \ufb01gure 2 B and C, showing that it is possible in this setup to\nexplicitly select the principle components that are extracted (or not extracted) by the neurons.\n\n6 Discussion\n\nWe have introduced and analyzed a simple and perspicuous rule that enables spiking neurons to\nperform IB optimization in an online manner. Our simulations show that this rule works as well\nas the substantially more complex learning rule that had previously been proposed in [3]. It also\nperforms well for more realistic neuron models as indicated in [4]. We have shown that the con-\nvergence properties of our simpli\ufb01ed IB rule can be analyzed with the help of the Fokker-Planck\nequation (alternatively one may also use the theoretical framework described in A.2 in [12] for its\nanalysis). The investigation of the weight vectors to which this rule converges reveals interesting\nrelationships to PCA. Apparently, very little is known about learning rules that enable spiking neu-\nrons to extract multiple principal components from an input stream (a discussion of a basic learning\nrule performing PCA is given in chapter 11.2.4 of [5]). We have demonstrated both analytically and\nthrough simulations that a slight variation of our new learning rule performs PCA. Our derivation\nof this rule within the IB framework opens the door to new variations of PCA where preferentially\nthose components are extracted from a high dimensional input stream that are \u2013or are not\u2013 related to\nsome external relevance variable. We expect that a further investigation of such methods will shed\nlight on the unknown principles of unsupervised and semi-supervised learning that might shape and\nconstantly retune the output of lower cortical areas to intermediate and higher cortical areas. The\nlearning rule that we have proposed might in principle be able to extract from high-dimensional\n\n7\n\n\fsensory input streams X those components that are related to other sensory modalities or to internal\nexpectations and goals.\nQuantitative biological data on the precise way in which relevance signals YT (such as for example\ndopamin) might reach neurons in the cortex and modulate their synaptic plasticity are still missing.\nBut it is fair to assume that these signals reach the synapse in a low-pass \ufb01ltered form of the type uT\nthat we have assumed for our learning rules. From that perspective one can view the learning rules\nthat we have derived (in contrast to the rules proposed in [3]) as local learning rules.\n\nAcknowledgments\n\nWritten under partial support by the Austrian Science Fund FWF, project # P17229, project # S9102\nand project # FP6-015879 (FACETS) of the European Union.\n\nReferences\n\n[1] S. Klamp\ufb02, R. A. Legenstein, and W. Maass. Information bottleneck optimization and independent com-\nponent extraction with spiking neurons. In Proc. of NIPS 2006, Advances in Neural Information Process-\ning Systems, volume 19. MIT Press, 2007.\n\n[2] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37-th\n\nAnnual Allerton Conference on Communication, Control and Computing, pages 368\u2013377, 1999.\n\n[3] S. Klamp\ufb02, R. Legenstein, and W. Maass. Spiking neurons can learn to solve information bottleneck\n\nproblems and to extract independent components. Neural Computation, 2007. in press.\n\n[4] L. Buesing and W. Maass. journal version. 2007. in preparation.\n[5] W. Gerstner and W. M. Kistler. Spiking Neuron Models. Cambridge University Press, Cambridge, 2002.\n[6] Taro Toyoizumi, Jean-Pascal P\ufb01ster, Kazuyuki Aihara, and Wulfram Gerstner. Optimality Model of\nUnsupervised Spike-Timing Dependent Plasticity: Synaptic Memory and Weight Distribution. Neural\nComputation, 19(3):639\u2013671, 2007.\n\n[7] Eugene M. Izhikevich. Solving the Distal Reward Problem through Linkage of STDP and Dopamine\n\nSignaling. Cereb. Cortex, page bhl152, 2007.\n\n[8] H. Risken. The Fokker-Planck Equation. Springer, 3rd edition, 1996.\n[9] R. G\u00a8utig, R. Aharonov, S. Rotter, and H. Sompolinsky. Learning input correlations through non-linear\n\ntemporally asymmetric hebbian plasticity. Journal of Neurosci., 23:3697\u20133714, 2003.\n\n[10] H. Mef\ufb01n, J. Besson, A. N. Burkitt, and D. B. Grayden. Learning the structure of correlated synaptic\nsubgroups using stable and competitive spike-timing-dependent plasticity. Physical Review E, 73, 2006.\n[11] T. J. Sejnowski and G. Tesauro. The hebb rule for synaptic plasticity: algorithms and implementations. In\nJ. H. Byrne and W. O. Berry, editors, Neural Models of Plasticity, pages 94\u2013103. Academic Press, 1989.\n[12] N. Intrator and L. N. Cooper. Objective function formulation of the BCM theory of visual cortical plas-\n\nticity: statistical connections, stability conditions. Neural Networks, 5:3\u201317, 1992.\n\n8\n\n\f", "award": [], "sourceid": 416, "authors": [{"given_name": "Lars", "family_name": "Buesing", "institution": null}, {"given_name": "Wolfgang", "family_name": "Maass", "institution": null}]}