{"title": "Synaptic Sampling: A Bayesian Approach to Neural Network Plasticity and Rewiring", "book": "Advances in Neural Information Processing Systems", "page_first": 370, "page_last": 378, "abstract": "We reexamine in this article the conceptual and mathematical framework for understanding the organization of plasticity in spiking neural networks. We propose that inherent stochasticity enables synaptic plasticity to carry out probabilistic inference by sampling from a posterior distribution of synaptic parameters. This view provides a viable alternative to existing models that propose convergence of synaptic weights to maximum likelihood parameters. It explains how priors on weight distributions and connection probabilities can be merged optimally with learned experience. In simulations we show that our model for synaptic plasticity allows spiking neural networks to compensate continuously for unforeseen disturbances. Furthermore it provides a normative mathematical framework to better understand the permanent variability and rewiring observed in brain networks.", "full_text": "Synaptic Sampling: A Bayesian Approach to\n\nNeural Network Plasticity and Rewiring\n\nDavid Kappel1\n\nStefan Habenschuss1\n\nRobert Legenstein\n\nWolfgang Maass\n\nInstitute for Theoretical Computer Science\n\nGraz University of Technology\n\nA-8010 Graz, Austria\n\n[kappel, habenschuss, legi, maass]@igi.tugraz.at\n\nAbstract\n\nWe reexamine in this article the conceptual and mathematical framework for un-\nderstanding the organization of plasticity in spiking neural networks. We propose\nthat inherent stochasticity enables synaptic plasticity to carry out probabilistic in-\nference by sampling from a posterior distribution of synaptic parameters. This\nview provides a viable alternative to existing models that propose convergence of\nsynaptic weights to maximum likelihood parameters. It explains how priors on\nweight distributions and connection probabilities can be merged optimally with\nlearned experience. In simulations we show that our model for synaptic plasticity\nallows spiking neural networks to compensate continuously for unforeseen distur-\nbances. Furthermore it provides a normative mathematical framework to better\nunderstand the permanent variability and rewiring observed in brain networks.\n\n1\n\nIntroduction\n\nnetwork connectivity. The likelihood pN (x|\u03b8) =(cid:80)\n\nIn the 19th century, Helmholtz proposed that perception could be understood as unconscious infer-\nence [1]. This insight has recently (re)gained considerable attention in models of Bayesian inference\nin neural networks [2]. The hallmark of this theory is the assumption that the activity z of neuronal\nnetworks can be viewed as an internal model for hidden variables in the outside world that give rise\nto sensory experiences x. This hidden state z is usually assumed to be represented by the activity of\nneurons in the network. A network N of stochastically \ufb01ring neurons is modeled in this framework\nby a probability distribution pN (x, z|\u03b8) that describes the probabilistic relationships between a set\nof N inputs x = (x1, . . . , xN ) and corresponding network responses z = (z1, . . . , zN ), where \u03b8\ndenotes the vector of network parameters that shape this distribution, e.g., via synaptic weights and\nz pN (x, z|\u03b8) of the actually occurring inputs x\nunder the resulting internal model can then be viewed as a measure for the agreement between this\ninternal model (which carries out \u201cpredictive coding\u201d [3]) and its environment (which generates x).\nThe goal of network learning is usually described in this probabilistic generative framework as \ufb01nd-\ning parameter values \u03b8\u2217 that maximize this agreement, or equivalently the likelihood of the inputs x\n(maximum likelihood learning): \u03b8\u2217 = arg max\u03b8 pN (x|\u03b8). Locally optimal estimates of \u03b8\u2217 can be\ndetermined by gradient ascent on the data likelihood pN (x|\u03b8), which led to many previous models\nof network plasticity [4, 5, 6]. While these models learn point estimates of locally optimal param-\neters \u03b8\u2217, theoretical considerations for arti\ufb01cial neural networks suggest that it is advantageous to\nlearn full posterior distributions p\u2217(\u03b8) over parameters. This full Bayesian treatment of learning\nallows to integrate structural parameter priors in a Bayes-optimal way and promises better general-\nization of the acquired knowledge to new inputs [7, 8]. The problem how such posterior distributions\ncould be learned by brain networks has been highlighted in [2] as an important future challenge in\ncomputational neuroscience.\n\n1these authors contributed equally\n\n1\n\n\fFigure 1: Illustration of synaptic sampling for two parameters \u03b8 = {\u03b81, \u03b82} of a neural network\nN . A: 3D plot of an example likelihood function. For a \ufb01xed set of inputs x it assigns a probability\ndensity (amplitude on z-axis) to each parameter setting \u03b8. The likelihood function is de\ufb01ned by\nthe underlying neural network N . B: Example for a prior that prefers small values for \u03b8. C:\nThe posterior that results as product of the prior (B) and the likelihood (A). D: A single trajectory\nof synaptic sampling from the posterior (C), starting at the black dot. The parameter vector \u03b8\n\ufb02uctuates between different solutions, the visited values cluster near local optima (red triangles). E:\nCartoon illustrating the dynamic forces (plasticity rule (2)) that enable the network to sample from\nthe posterior distribution p\u2217(\u03b8|x) in (D).\n\nHere we introduce a possible solution to this problem. We present a new theoretical framework\nfor analyzing and understanding local plasticity mechanisms of networks of neurons as stochastic\nprocesses, that generate speci\ufb01c distributions p\u2217(\u03b8) of network parameters \u03b8 over which these pa-\nrameters \ufb02uctuate. We call this new theoretical framework synaptic sampling. We use it here to\nanalyze and model unsupervised learning and rewiring in spiking neural networks. In Section 3\nwe show that the synaptic sampling hypothesis also provides a uni\ufb01ed framework for structural and\nsynaptic plasticity which both are integrated here into a single learning rule. This model captures\nsalient features of the permanent rewiring and \ufb02uctuation of synaptic ef\ufb01cacies observed in the cor-\ntex [9, 10]. In computer simulations, we demonstrate another advantage of the synaptic sampling\nframework: It endows neural circuits with an inherent robustness against perturbations [11].\n\n2 Learning a posterior distribution through stochastic synaptic plasticity\nIn our learning framework we assume that not only a neural network N as described above, but also a\nprior pS (\u03b8) for its parameters \u03b8 = (\u03b81, . . . , \u03b8M ) are given. This prior pS can encode both structural\nconstraints (such as sparse connectivity) and structural rules (e.g., a heavy-tailed distribution of\nsynaptic weights). Then the goal of network learning becomes:\n\nlearn the posterior distribution:\n\n(1)\nwith normalizing constant Z. A key insight (see Fig. 1 for an illustration) is that stochastic local\nplasticity rules for the parameters \u03b8i enable a network to achieve the learning goal (1): The distri-\nbution of network parameters \u03b8 will converge after a while to the posterior distribution (1) \u2013 and\nproduce samples from it \u2013 if each network parameter \u03b8i obeys the dynamics\n\np\u2217(\u03b8|x) = 1Z pS (\u03b8) \u00b7 pN (x|\u03b8) ,\n\nlog pN (x|\u03b8) + T b(cid:48)(\u03b8i)\n\n\u2202\n\u2202\u03b8i\n\nd\u03b8i =\n\nlog pS (\u03b8) + b(\u03b8i)\n\n(2)\nb(\u03b8i). The stochastic term dWi describes in\ufb01nitesimal stochastic\nfor i = 1, . . . , M and b(cid:48)(\u03b8i) = \u2202\nincrements and decrements of a Wiener process Wi, where process increments over time t \u2212 s are\nnormally distributed with zero mean and variance t\u2212 s, i.e. W t\ni \u223c NORMAL(0, t\u2212 s) [12].\nThe dynamics (2) extend previous models of Bayesian learning via sampling [13, 14] by including a\ntemperature T > 0 and a sampling-speed parameter b(\u03b8i) > 0 that can depend on the current value\n\ni \u2212W s\n\n\u2202\u03b8i\n\n(cid:18)\n\nb(\u03b8i)\n\n\u2202\n\u2202\u03b8i\n\n(cid:19)\n\ndt +(cid:112)2T b(\u03b8i) dWi ,\n\n2\n\n\fof \u03b8i without changing the stationary distribution. For example, the sampling speed of a synaptic\nweight can be slowed down if it reaches very high or very low values.\nThe temperature parameter T can be used to scale the diffusion term (i.e., the noise). The resulting\nstationary distribution of \u03b8 is proportional to p\u2217(\u03b8) 1\nT , so that the dynamics of the stochastic process\nT log p\u2217(\u03b8). For high values of T this energy landscape\ncan be described by the energy landscape 1\nis \ufb02attened, i.e., the main modes of p\u2217(\u03b8) become less pronounced. For T = 1 we arrive at the\nlearning goal (1). For T \u2192 0 the dynamics of \u03b8 approaches a deterministic process and converges\nto the next local maximum of p\u2217(\u03b8). Thus the learning process approximates for low values of T\nmaximum a posteriori (MAP) inference [8]. The result is formalized in the following theorem:\nTheorem 1. Let p(x, \u03b8) be a strictly positive, continuous probability distribution over continuous\nor discrete states x and continuous parameters \u03b8 = (\u03b81, . . . , \u03b8M ), twice continuously differentiable\nwith respect to \u03b8. Let b(\u03b8) be a strictly positive, twice continuously differentiable function. Then the\nset of stochastic differential equations (2) leaves the distribution p\u2217(\u03b8) invariant:\n\nwith Z(cid:48) =(cid:82) p\u2217(\u03b8 | x) 1\n\np\u2217(\u03b8) \u2261 1\n\nZ(cid:48) p\u2217(\u03b8 | x)\n\n1\n\nT ,\n\n(3)\n\nT d\u03b8. Furthermore, p\u2217(\u03b8) is the unique stationary distribution of (2).\n\nProof: First, note that the \ufb01rst two terms in the drift term of Eq. (2) can be written as\n\nb(\u03b8i)\n\nlog pS (\u03b8) + b(\u03b8i)\n\nlog pN (x|\u03b8) = b(\u03b8i)\n\nlog p(\u03b8i|x, \u03b8\\i),\n\nwhere \u03b8\\i denotes the vector of parameters excluding parameter \u03b8i. Hence, the dynamics (2) can be\nwritten in terms of an It\u02c6o stochastic differential equations with drift Ai(\u03b8) and diffusion Bi(\u03b8):\n\nlog p(\u03b8i|x, \u03b8\\i) + T b(cid:48)(\u03b8i)\n\ndWi\n\n.\n\n(4)\n\n\u2202\n\u2202\u03b8i\n\n(cid:18)\n\nd\u03b8i =\n\nb(\u03b8i)\n\n(cid:124)\n\n\u2202\n\u2202\u03b8i\n\n\u2202\n\u2202\u03b8i\n\n(cid:123)(cid:122)\n\ndrift: Ai(\u03b8)\n\n\u2202\n\u2202\u03b8i\n\n(cid:114)\n\n(cid:19)\n\n(cid:125)\n\ndt +\n\n2 T b(\u03b8i)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\ndiffusion: Bi(\u03b8)\n\nThis describes the stochastic dynamics of each parameter over time. For the stationary distribution\nwe are interested in the dynamics of the distribution of parameters. Eq. (4) translate into the fol-\nlowing Fokker-Planck equation, that determines the temporal dynamics of the distribution pFP(\u03b8, t)\nover network parameters \u03b8 at time t (see [12]),\n\nd\ndt\n\npFP(\u03b8, t) =\n\n(5)\nPlugging in the presumed stationary distribution p\u2217(\u03b8) on the right hand side of Eq. (5), one obtains\n\nBi(\u03b8) pFP(\u03b8, t)\n\nAi(\u03b8) pFP(\u03b8, t)\n\n+\n\n2\n\n.\n\ni\n\n(cid:19)\n\n(cid:18) 1\n\n\u22022\n\u2202\u03b82\ni\n\n(cid:19)\n\n(cid:88)\n\n\u2212 \u2202\n\u2202\u03b8i\n\n(cid:18)\n(cid:88)\n(cid:88)\n\ni\n\ni\n\nd\ndt\n\npFP(\u03b8, t) =\n\n=\n\n(cid:18)\n(cid:18)\n\n\u2212 \u2202\n\u2202\u03b8i\n\u2212 \u2202\n\u2202\u03b8i\n\n\u2202\n\u2202\u03b8i\n\n(Ai(\u03b8) p\u2217(\u03b8)) +\n\nb(\u03b8i) p\u2217(\u03b8)\n\n\u2202\n\u2202\u03b8i\nT b(\u03b8i) p\u2217(\u03b8)\n\n(Bi(\u03b8) p\u2217(\u03b8))\n\n\u22022\n\u2202\u03b82\ni\nlog p(\u03b8i|x, \u03b8\\i)\n\n(cid:19)\n\n(cid:19)\n\n(cid:18)\n(cid:18)\n\n+\n\n,\nwhich by inserting for p\u2217(\u03b8) the assumed stationary distribution (3) becomes\nd\ndt\n\nlog p(\u03b8i|x, \u03b8\\i)\n\nb(\u03b8i) p\u2217(\u03b8)\n\n(cid:88)\n\npFP(\u03b8, t) =\n\n\u2212 \u2202\n\u2202\u03b8i\n\n\u2202\n\u2202\u03b8i\n\n(cid:19)\n\nlog p\u2217(\u03b8)\n\n\u2202\n\u2202\u03b8i\n\ni\n\n(cid:0)log p(\u03b8\\i|x) + log p(\u03b8i|x, \u03b8\\i)(cid:1)(cid:19)\n\n(cid:88)\n\n\u2202\n\u2202\u03b8i\n\nb(\u03b8i) p\u2217(\u03b8)\n\n\u2202\n\u2202\u03b8i\n\n+\n\n0 = 0 .\nThis proves that p\u2217(\u03b8) is a stationary distribution of the parameter sampling dynamics (4). Under\nthe assumption that b(\u03b8i) is strictly positive, this stationary distribution is also unique. If the matrix\nof diffusion coef\ufb01cients is invertible, and the potential conditions are satis\ufb01ed (see Section 3.7.2 in\n[12] for details), the stationary distribution can be obtained (uniquely) by simple integration. Since\nthe matrix of diffusion coef\ufb01cients B is diagonal in our model (B = diag(Bi(\u03b8), . . . , BM (\u03b8))), B\nis trivially invertible since all elements, i.e. all Bi(\u03b8), are positive. Convergence and uniqueness of\n(cid:3)\nthe stationary distribution follows then for strictly positive b(\u03b8i) (see Section 5.3.3 in [12]).\n\n=\n\ni\n\n3\n\n\f2.1 Online synaptic sampling\n\nFor sequences of N inputs x = (x1, . . . , xN ), the weight update rule (2) depends on all inputs, such\nthat synapses have to keep track of the whole set of all network inputs for the exact dynamics (batch\nlearning). In an online scenario, we assume that only the current network input xn is available.\nAccording to the dynamics (2), synaptic plasticity rules have to compute the log likelihood derivative\nlog pN (x|\u03b8). We assume that every \u03c4x time units a different input xn is presented to the network\n\u2202\n\u2202\u03b8i\nand that the inputs x1, . . . , xN are visited repeatedly in a \ufb01xed regular order. Under the assumption\nthat the input patterns are statistically independent the likelihood pN (x|\u03b8) becomes\n\npN (x|\u03b8) = pN (x1, . . . , xN|\u03b8) =\n\nlog pN (x|\u03b8) = (cid:80)N\n\n(6)\ni.e., each network input xn can be explained as being drawn individually from pN (xn|\u03b8), in-\ndependently from other inputs. The derivative of the log likelihood in (2) is then given by\nlog pN (xn|\u03b8) . This \u201cbatch\u201d dynamics does not map readily onto\n\u2202\n\u2202\u03b8i\na network implementation because the weight update requires at any time knowledge of all inputs\nx1, . . . , xN . We provide here an online approximation for small sampling speeds. To obtain an\nonline learning rule, we consider the parameter dynamics\n\npN (xn|\u03b8) ,\n\nn=1\n\n\u2202\n\u2202\u03b8i\n\nn=1\n\nN(cid:89)\n\n(cid:19)\n\ndt +(cid:112)2T b(\u03b8i) dWi.\n\n(7)\n\n(cid:18)\n\nd\u03b8i =\n\nb(\u03b8i)\n\n\u2202\n\u2202\u03b8i\n\nlog pS (\u03b8) + N b(\u03b8i)\n\n\u2202\n\u2202\u03b8i\n\nlog pN (xn|\u03b8) + T b(cid:48)(\u03b8i)\n\nAs in the batch learning setting, we assume that each input xn is presented for a time interval of\n\u03c4x. Although convergence to the correct posterior distribution cannot be guaranteed theoretically\nfor this online rule, we show that it is a reasonable approximation to the batch-rule. Integrating the\nparameter changes (7) over one full presentation of the data x, i.e., starting from t = 0 with some\ninitial parameter values \u03b80 up to time t = N \u03c4x, we obtain for slow sampling speeds (N \u03c4xb(\u03b8i) (cid:28) 1)\n\n(cid:32)\n\ni \u2212 \u03b80\n\u03b8N \u03c4x\n\ni \u2248 N \u03c4x\n\nN(cid:88)\n\n\u2202\n\u2202\u03b8i\n\nb(\u03b80\ni )\n\n(cid:113)\n\n+\n\nlog pS (\u03b80) + b(\u03b80\ni )\n\n2T b(\u03b80\n\ni ) (W N \u03c4x\n\ni\n\nn=1\n\n\u2212 W 0\ni ).\n\nlog pN (xn|\u03b80) + T b(cid:48)(\u03b80\ni )\n\n\u2202\n\u2202\u03b8i\n\n(cid:33)\n\n(8)\n\nThis is also what one obtains when integrating the batch rule (2) for N \u03c4x time units (for slow b(\u03b8i)).\nHence, for slow enough b(\u03b8i), (7) is a good approximation of optimal weight sampling.\nIn the presence of hidden variables z, maximum likelihood learning cannot be applied directly, since\nthe state of the hidden variables is not known from the observed data. The expectation maximization\nalgorithm [8] can be used to overcome this problem. We adopt this approach here. In the online\nsetting, when pattern xn is applied to the network, it responds with network state zn according\nto pN (zn | xn, \u03b8), where the current network parameters are used in this inference process. The\nparameters are updated in parallel according to the dynamics (8) for the given values of xn and zn.\n\n3 Synaptic sampling for network rewiring\n\nIn this section we present a simple model to describe permanent network rewiring using the dynam-\nics (2). Experimental studies have provided a wealth of information about the stochastic rewiring in\nthe brain (see e.g. [9, 10]). They demonstrate that the volume of a substantial fraction of dendritic\nspines varies continuously over time, and that all the time new spines and synaptic connections are\nformed and existing ones are eliminated. We show that these experimental data on spine motility\ncan be understood as special cases of synaptic sampling. To arrive at a concrete model we use the\nfollowing assumption about dynamic network rewiring:\n\n1. In accordance with experimental studies [10], we require that spine sizes have a multiplica-\ntive dynamics, i.e., that the amount of change within some given time window is propor-\ntional to the current size of the spine.\n\n2. We assume here for simplicity that there is a single parameter \u03b8i for each potential synaptic\n\nconnection i.\n\n4\n\n\fThe second requirement can be met by encoding the state of the synapse in an abstract form, that\nrepresents synaptic connectivity and synaptic ef\ufb01cacy in a single parameter \u03b8i. We de\ufb01ne that nega-\ntive values of \u03b8i represent a current disconnection and positive values represent a functional synaptic\nconnection (we focus on excitatory connections). The distance of the current value of \u03b8i from zero\nindicates how likely it is that the synapse will soon reconnect (for negative values) or withdraw\n(for positive values). In addition the synaptic parameter \u03b8i encodes for positive values the synaptic\nef\ufb01cacy wi, i.e., the resulting EPSP amplitudes, by a simple mapping wi = f (\u03b8i).\nThe \ufb01rst assumption which requires multiplicative synaptic dynamics supports an exponential func-\ntion f in our model, in accordance with previous models of spine motility [10]. Thus, we assume in\nthe following that the ef\ufb01cacy wi of synapse i is given by\n\nwi = exp(\u03b8i \u2212 \u03b80) .\n\n(9)\nNote that for a large enough offset \u03b80, negative parameter values \u03b8i (which model a non-functional\nsynaptic connection) are automatically mapped onto a tiny region close to zero in the w-space, so\nthat retracted spines have essentially zero synaptic ef\ufb01cacy. In addition we use a Gaussian prior\npS (\u03b8i) = NORMAL(\u03b8i | \u00b5, \u03c3), with mean \u00b5 and variance \u03c32 over synaptic parameters. In the sim-\nulations we used \u00b5 = 0.5, \u03c3 = 1 and \u03b80 = 3. A prior of this form allows to include a simple\nregularization mechanism in the learning scheme, which prefers sparse solutions (i.e. solutions with\nsmall parameters) [8]. Together with the exponential mapping (9) this prior induces a heavy-tailed\nprior distribution over synaptic weights wi. The network therefore learns solutions where only the\nmost relevant synapses are much larger than zero.\nThe general rule for online synaptic sampling (7) for the exponential mapping wi = exp(\u03b8i \u2212 \u03b80)\nand the Gaussian prior becomes (for constant small learning rate b (cid:28) 1 and unit temperature T = 1)\n\nlog pN (xn|w)\n\n\u2202\n\u2202wi\n\ndt +\n\nd\u03b8i = b\n\n(10)\nlog pN (xn|w),\nIn Eq. (10) the multiplicative synaptic dynamics becomes explicit. The gradient \u2202\n\u2202wi\ni.e., the activity-dependent contribution to synaptic plasticity, is weighted by wi. Hence, for negative\nvalues of \u03b8i (non-functional synaptic connection), the activities of the pre- and post-synaptic neurons\nhave negligible impact on the dynamics of the synapse. Assuming a large enough \u03b80, retracted\nsynapses therefore evolve solely according to the prior pS (\u03b8) and the random \ufb02uctuations dWi. For\nlog pS (\u03b8) and the Wiener\nlarge values of \u03b8i the opposite is the case. The in\ufb02uence of the prior \u2202\nprocess dWi become negligible, and the dynamics is dominated by the activity-dependent likelihood\n\u2202\u03b8i\nterm.\nIf the activity-dependent second term in Eq. (10) (that tries to maximize the likelihood) is small\n(e.g., because \u03b8i is small or parameters are near a mode of the likelihood) then Eq. (10) implements\nan Ornstein-Uhlenbeck process. This prediction of our model is consistent with a previous analysis\nwhich showed that an Ornstein-Uhlenbeck process is a viable model for synaptic spine motility [10].\n\n(cid:18) 1\n\u03c32 (\u00b5 \u2212 \u03b8i) + N wi\n\n(cid:19)\n\n\u221a\n\n2b dWi .\n\n3.1 Spiking network model\n\nThrough the use of parameters \u03b8 which determine both synaptic connectivity and synaptic weights,\nthe synaptic sampling framework provides a uni\ufb01ed model for structural and synaptic plasticity.\nEq. (10) describes the stochastic dynamics of the synaptic parameters \u03b8i. In this section we analyze\nthe resulting rewiring dynamics and structural plasticity by applying the synaptic sampling frame-\nwork to networks of spiking neurons. Here, we used winner-take-all (WTA) networks to learn a\nsimple sensory integration task and show that learning with synaptic sampling in such networks is\ninherently robust to perturbations.\nFor the WTA we adapted the model described in detail in [15]. Brie\ufb02y, the WTA neurons were\nmodeled as stochastic spike response neurons with a \ufb01ring rate that depends exponentially on the\nmembrane voltage [16, 17]. The membrane potential uk(t) of neuron k at time t is given by\n\nuk(t) =\n\nwki xi(t) + \u03b2k(t) ,\n\n(11)\n\nwhere xi(t) denotes the (unweighted) input from input neuron i, wki denotes the ef\ufb01cacy of the\nsynapse from input neuron i, and \u03b2k(t) denotes a homeostatic adaptation current (see below). The\n\ni\n\n5\n\n(cid:88)\n\n\fIlat(t) =(cid:80)K\n\ni POISSON(xn\n\ninput xi(t) models the (additive) excitatory postsynaptic current from neuron i. In our simulations\nwe used a double-exponential kernel with time constants \u03c4m = 20ms and \u03c4s = 2ms [18]. The\ninstantaneous \ufb01ring rate \u03c1k(t) of network neuron k depends exponentially on the membrane poten-\ntial and is subject to divisive lateral inhibition Ilat(t) (described below): \u03c1k(t) = \u03c1net\nIlat(t) exp(uk(t)),\nwhere \u03c1net = 100Hz scales the \ufb01ring rate of neurons [16]. Spike trains were then drawn from\nindependent Poisson processes with instantaneous rate \u03c1k(t) for each neuron. Divisive inhibi-\ntion [19] between the K neurons in the WTA network was implemented in an idealized form [6],\nl=1 exp(ul(t)). In addition, each output spike caused a slow depressing current, giving\nrise to the adaptation current \u03b2k(t). This implements a slow homeostatic mechanism that regulates\nthe output rate of individual neurons (see [20] for details).\nThe WTA network de\ufb01ned above implicitly de\ufb01nes a generative model [21]. Inputs xn are assumed\nto be generated in dependence on the value of a hidden multinomial random variable hn that can\ntake on K possible values 1, . . . , K. Each neuron k in the WTA circuit corresponds to one value k\nof this hidden variable. One obtains the probability of an input vector for a given hidden cause as\ni |\u03b1ewki ), with a scaling parameter \u03b1 > 0. In other words,\nthe synaptic weight wki encodes (in log-space) the \ufb01ring rate of input neuron i, given that the hidden\ncause is k. The network implements inference in this generative model, i.e., for a given input xn, the\n\ufb01ring rate of network neuron zk is proportional to the posterior probability p(hn = k|xn, w) of the\ncorresponding hidden cause. Online maximum likelihood learning is realized through the synaptic\nupdate rule (see [21]), which realizes here the second term of Eq. (10)\n\npN (xn|hn = k, w) =(cid:81)\n\n\u2202\n\n\u2202wki\n\nlog pN (xn | w) \u2248 Sk(t) (xi(t) \u2212 \u03b1 ewki ) ,\n\n(12)\n\nwhere Sk(t) denotes the spike train of the kth neuron and xi(t) denotes the weight-normalized value\nof the sum of EPSPs from presynaptic neuron i at time t in response to pattern xn.\n\n3.2 Simulation results\n\nHere, we consider a network that allows us to study the self-organization of connections between\nhidden neurons. Additional details to this experiment and further analyses of the synaptic sampling\nmodel can be found in [22].\nThe architecture of the network is illustrated in Fig. 2A. It consists of eight WTA circuits with\narbitrary excitatory synaptic connections between neurons within the same or different ones of these\nWTA circuits. Two populations of \u201cauditory\u201d and \u201cvisual\u201d input neurons xA and xV project onto\ncorresponding populations zA and zV of hidden neurons (each consisting of four WTA circuits with\nK = 10 neurons, see lower panel of Fig. 2A). The hidden neuron populations receive exclusively\nauditory (zA, 770 neurons) or visual inputs (zV , 784 neurons) and in addition, arbitrary lateral\nexcitatory connections between all hidden neurons are allowed. This network models multi-modal\nsensory integration and association in a simpli\ufb01ed manner [15].\nBiological neural networks are astonishingly robust against perturbations and lesions [11]. To in-\nvestigate the inherent compensation capability of synaptic sampling we applied two lesions to the\nnetwork within a learning session of 8 hours (of equivalent biological time). The network was\ntrained by repeatedly drawing random instances of spoken and written digits of the same type (digit\n1 or 2 taken from MNIST and 7 utterances of speaker 1 from TI 46) and simultaneously presenting\nPoisson spiking representations of these input patterns to the network. Fig. 2A shows example \ufb01ring\nrates for one spoken/written input pair. Input spikes were randomly drawn according to these rates.\nFiring rates of visual input neurons were kept \ufb01xed throughout the duration of the auditory stimulus.\nIn the \ufb01rst lesion we removed all neurons (16 out of 40) that became tuned for digit 2 in the preced-\ning learning. The reconstruction performance of the network was measured through the capability\nof a linear readout neuron, which received input only from zV . During these test trials only the\nauditory stimulus was presented (the remaining 3 utterances of speaker 1 were used as test set) and\nvisual input neurons were clamped to 1Hz background noise. The lesion signi\ufb01cantly impaired the\nperformance of the network in stimulus reconstruction, but it was able to recover from the lesion\nafter about one hour of continuing network plasticity (see Fig. 2C).\nIn the second lesion all synaptic connections between hidden neurons that were present after recov-\nery from the \ufb01rst lesion were removed and not allowed to regrow (2936 synapses in total). After\n\n6\n\n\fFigure 2: Inherent compensation for network perturbations. A: Illustration of the network ar-\nchitecture: A recurrent spiking neural network received simultaneously spoken and handwritten\nspiking representations of the same digit. B: First three PCA components of the temporal evolution\na subset of the network parameters \u03b8. After each lesion the network parameters migrate to a new\nmanifold. C: The generative reconstruction performance of the \u201cvisual\u201d neurons zV for the test\ncase when only an auditory stimulus is presented was tracked throughout the whole learning session\n(colors of learning phases as in (B)). After each lesion the performance strongly degrades, but reli-\nably recovers. Learning with zero temperature (dashed yellow) or with approximate HMM learning\n[15] (dashed purple) performed signi\ufb01cantly worse. Insets at the top show the synaptic weights of\nneurons in zV at 4 time points projected back into the input space. Network diagrams in the middle\nshow ongoing network rewiring for synaptic connections between the hidden neurons. Each arrow\nindicates a functional connection between two neurons (only 1% randomly drawn subset shown).\nThe neuron whose parameters are tracked in (C) is highlighted in red. Numbers under the network\ndiagrams show the total number of functional connections between hidden neurons at the time point.\n\nabout two hours of continuing synaptic sampling 294 new synaptic connections between hidden\nneurons emerged. These connections made it again possible to infer the auditory stimulus from the\nactivity of the remaining 24 hidden neurons in the population zV (in the absence of input from the\npopulation xV ). The classi\ufb01cation performance was around 75% (see bottom of Fig. 2C).\nIn Fig. 2B we track the temporal evolution of a subset \u03b8(cid:48) of network parameters (35 parameters\n\u03b8i associated with the potential synaptic connections of the neuron marked in red in the middle\nof Fig. 2C from or to other hidden neurons, excluding those that were removed at lesion 2 and\nnot allowed to regrow). The \ufb01rst three PCA components of this 35-dimensional parameter vector\nare shown. The vector \u03b8(cid:48) \ufb02uctuates \ufb01rst within one region of the parameter space while probing\n\n7\n\n\fdifferent solutions to the learning problem, e.g., high probability regions of the posterior distribution\n(blue trace). Each lesions induced a fast switch to a different region (red,green), accompanied by\na recovery of the visual stimulus reconstruction performance (see Fig. 2C). The network therefore\ncompensates for perturbations by exploring new parameter spaces.\nWithout the noise and the prior the same performance could not be reached for this experiment.\nFig. 2C shows the result for the approximate HMM learning [15], which is a deterministic learning\napproach (without a prior). Using this approach the network was able to learn representations of the\nhandwritten and spoken digits. However, these representation and the associations between them\nwere not as distinctive as for synaptic sampling and the classi\ufb01cation performance was signi\ufb01cantly\nworse (only \ufb01rst learning phase shown). We also evaluated this experiment with a deterministic\nversion of synaptic sampling (T = 0). Here, the stochasticity inherent to the WTA circuit was\nsuf\ufb01cient to overcome the \ufb01rst lesion. However, the performance was worse in the last learning phase\n(after removing all active lateral synapses). In this situation, the random exploration of the parameter\nspace that is inherent to synaptic sampling signi\ufb01cantly enhanced the speed of the recovery.\n\n4 Discussion\n\nWe have shown that stochasticity may provide an important function for network plasticity. It en-\nables networks to sample parameters from the posterior distribution that represents attractive combi-\nnations of structural constraints and rules (such as sparse connectivity and heavy-tailed distributions\nof synaptic weights) and a good \ufb01t to empirical evidence (e.g., sensory inputs). The resulting rules\nfor synaptic plasticity contain a prior distributions over parameters. Potential functional bene\ufb01ts of\npriors (on emergent selectivity of neurons) have recently been demonstrated in [23] for a restricted\nBoltzmann machine.\nThe mathematical framework that we have presented provides a normative model for evaluating\nempirically found stochastic dynamics of network parameters, and for relating speci\ufb01c properties of\nthis \u201cnoise\u201d to functional aspects of network learning. Some systematic dependencies of changes\nin synaptic weights (for the same pairing of pre- and postsynaptic activity) on their current values\nhad already been reported in [24, 25, 26]. These can be modeled as the impact of priors in our\nframework.\nModels of learning via sampling from a posterior distribution have been previously studied in ma-\nchine learning [13, 14] and the underlying theoretical principles are well known in physics (see e.g.\nSection 5.3 of [27]). The theoretical framework provided in this paper extends these previous mod-\nels for learning by introducing the temperature parameter T and by allowing to control the sampling\nspeed in dependence of the current parameter setting through b(\u03b8i). Furthermore, our model com-\nbines for the \ufb01rst time automatic rewiring in neural networks with Bayesian inference via sampling.\nThe functional consequences of these mechanism are further explored in [22].\nThe postulate that networks should learn posterior distributions of parameters, rather than maximum\nlikelihood values, had been proposed for arti\ufb01cial neural networks [7, 8], since such organization\nof learning promises better generalization capability to new examples. The open problem of how\nsuch posterior distributions could be learned by networks of neurons in the brain, in a way that is\nconsistent with experimental data, has been highlighted in [2] as a key challenge for computational\nneuroscience. We have presented here a model, whose primary innovation is to view experimentally\nfound trial-to-trial variability and ongoing \ufb02uctuations of parameters no longer as a nuisance, but as\na functionally important component of the organization of network learning. This model may lead to\na better understanding of such noise and seeming imperfections in the brain. It might also provide an\nimportant step towards developing algorithms for upcoming new technologies implementing analog\nspiking hardware, which employ noise and variability as a computational resource [28, 29].\n\nAcknowledgments\n\nWritten under partial support of the European Union project #604102 The Human Brain Project\n(HBP) and CHIST-ERA ERA-Net (Project FWF #I753-N23, PNEUMA).\nWe would like to thank Seth Grant, Christopher Harvey, Jason MacLean and Simon Rumpel for\nhelpful comments.\n\n8\n\n\fReferences\n[1] Hat\ufb01eld G. Perception as Unconscious Inference. In: Perception and the Physical World: Psychological\n\nand Philosophical Issues in Perception. Wiley; 2002. p. 115\u2013143.\n\n[2] Pouget A, Beck JM, Ma WJ, Latham PE. Probabilistic brains: knowns and unknowns. Nature Neuro-\n\nscience. 2013;16(9):1170\u20131178.\n\n[3] Winkler I, Denham S, Mill R, B\u00a8ohm TM, Bendixen A. Multistability in auditory stream segregation: a\n\npredictive coding view. Phil Trans R Soc B: Biol Sci. 2012;367(1591):1001\u20131012.\n\n[4] Brea J, Senn W, P\ufb01ster JP. Sequence learning with hidden units in spiking neural networks. In: NIPS.\n\nvol. 24; 2011. p. 1422\u20131430.\n\n[5] Rezende DJ, Gerstner W. Stochastic variational learning in recurrent spiking networks. Frontiers in\n\nComputational Neuroscience. 2014;8:38.\n\n[6] Nessler B, Pfeiffer M, Maass W. STDP enables spiking neurons to detect hidden causes of their inputs.\n\nIn: NIPS. vol. 22; 2009. p. 1357\u20131365.\n\n[7] MacKay DJ. Bayesian interpolation. Neural Computation. 1992;4(3):415\u2013447.\n[8] Bishop CM. Pattern Recognition and Machine Learning. New York: Springer; 2006.\n[9] Holtmaat AJ, Trachtenberg JT, Wilbrecht L, Shepherd GM, Zhang X, Knott GW, et al. Transient and\n\nPersistent Dendritic Spines in the Neocortex In Vivo. Neuron. 2005;45:279\u2013291.\n\n[10] Loewenstein Y, Kuras A, Rumpel S. Multiplicative dynamics underlie the emergence of the log-normal\n\ndistribution of spine sizes in the neocortex in vivo. J Neurosci. 2011;31(26):9481\u20139488.\n\n[11] Marder E. Variability, compensation and modulation in neurons and circuits. PNAS. 2011; 108(3):15542\u2013\n\n15548.\n\n[12] Gardiner CW. Handbook of Stochastic Methods. 3rd ed. Springer; 2004.\n[13] Welling M, Teh YW. Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of\n\nthe 28th International Conference on Machine Learning (ICML-11); 2011. p. 681\u2013688.\n\n[14] Sato I, Nakagawa H. Approximation analysis of stochastic gradient langevin dynamics by using fokker-\n\nplanck equation and ito process. In: NIPS; 2014. p. 982\u2013990.\n\n[15] Kappel D, Nessler B, Maass W. STDP installs in winner-take-all circuits an online approximation to\n\nhidden Markov model learning. PLoS Comp Biol. 2014; 10(3):e1003511.\n\n[16] Jolivet R, Rauch A, L\u00a8uscher H, Gerstner W. Predicting spike timing of neocortical pyramidal neurons by\n\nsimple threshold models. J Comp Neurosci. 2006; 21:35\u201349.\n\n[17] Mensi S, Naud R, Gerstner W. From stochastic nonlinear integrate-and-\ufb01re to generalized linear models.\n\nIn: NIPS. vol. 24; 2011. p. 1377\u20131385.\n\n[18] Gerstner W, Kistler WM. Spiking Neuron Models. Cambridge University Press; 2002.\n[19] Carandini M. From circuits to behavior: a bridge too far? Nature Neurosci. 2012; 15(4):507\u2013509.\n[20] Habenschuss S, Bill J, Nessler B. Homeostatic plasticity in Bayesian spiking networks as Expectation\n\nMaximization with posterior constraints. In: NIPS. vol. 25; 2012. p. 782\u2013790.\n\n[21] Habenschuss S, Puhr H, Maass W. Emergence of optimal decoding of population codes through STDP.\n\nNeural Computation. 2013; 25:1\u201337.\n\n[22] Kappel D, Habenschuss S, Legenstein R, Maass W. Network Plasticity as Bayesian Inference. PLoS\n\nComp Biol. 2015; 11(11):e1004485.\n\n[23] Xiong H, Szedmak S, Rodrguez-Sanchez A, Piater J. Towards sparsity and selectivity: Bayesian learning\n\nof restricted Boltzmann machine for early visual features. In: ICANN; 2014. p. 419\u2013426.\n\n[24] Bi GQ, Poo MM. Synaptic modi\ufb01cations in cultured hippocampal neurons: dependence on spike timing,\n\nsynaptic strength, and postsynaptic cell type. J Neurosci. 1998; 18(24):10464\u201310472.\n\n[25] Sj\u00a8ostr\u00a8om PJ, Turrigiano GG, Nelson SB. Rate, timing, and cooperativity jointly determine cortical synap-\n\ntic plasticity. Neuron. 2001; 32(6):1149\u20131164.\n\n[26] Montgomery JM, Pavlidis P, Madison DV. Pair recordings reveal all-silent synaptic connections and the\n\npostsynaptic expression of long-term potentiation. Neuron. 2001; 29(3):691\u2013701.\n\n[27] Kennedy AD. The Hybrid Monte Carlo algorithm on parallel computers. Parallel Computing. 1999;\n\n25(10):1311\u20131339.\n\n[28] Johannes Schemmel KMEM Andreas Gruebl. Implementing Synaptic Plasticity in a VLSI Spiking Neural\n\nNetwork Model. In: IJCNN; 2006. p. 1\u20136.\n\n[29] Bill J, Legenstein R. A compound memristive synapse model for statistical learning through STDP in\n\nspiking neural networks. Frontiers in Neuroscience. 2014; 8: 412.\n\n9\n\n\f", "award": [], "sourceid": 261, "authors": [{"given_name": "David", "family_name": "Kappel", "institution": "Graz University of Technology"}, {"given_name": "Stefan", "family_name": "Habenschuss", "institution": null}, {"given_name": "Robert", "family_name": "Legenstein", "institution": null}, {"given_name": "Wolfgang", "family_name": "Maass", "institution": null}]}