{"title": "Neurons as Monte Carlo Samplers: Bayesian \ufffcInference and Learning in Spiking Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1943, "page_last": 1951, "abstract": "We propose a two-layer spiking network capable of performing approximate inference and learning for a hidden Markov model. The lower layer sensory neurons detect noisy measurements of hidden world states. The higher layer neurons with recurrent connections infer a posterior distribution over world states from spike trains generated by sensory neurons. We show how such a neuronal network with synaptic plasticity can implement a form of Bayesian inference similar to Monte Carlo methods such as particle filtering. Each spike in the population of inference neurons represents a sample of a particular hidden world state. The spiking activity across the neural population approximates the posterior distribution of hidden state. The model provides a functional explanation for the Poisson-like noise commonly observed in cortical responses. Uncertainties in spike times provide the necessary variability for sampling during inference. Unlike previous models, the hidden world state is not observed by the sensory neurons, and the temporal dynamics of the hidden state is unknown. We demonstrate how this network can sequentially learn the hidden Markov model using a spike-timing dependent Hebbian learning rule and achieve power-law convergence rates.", "full_text": "Neurons as Monte Carlo Samplers: Bayesian\nInference and Learning in Spiking Networks\n\nYanping Huang\n\nUniversity of Washington\nhuangyp@cs.uw.edu\n\nRajesh P.N. Rao\n\nUniversity of Washington\n\nrao@cs.uw.edu\n\nAbstract\n\nWe propose a spiking network model capable of performing both approximate\ninference and learning for any hidden Markov model. The lower layer sensory\nneurons detect noisy measurements of hidden world states. The higher layer neu-\nrons with recurrent connections infer a posterior distribution over world states\nfrom spike trains generated by sensory neurons. We show how such a neuronal\nnetwork with synaptic plasticity can implement a form of Bayesian inference sim-\nilar to Monte Carlo methods such as particle \ufb01ltering. Each spike in the population\nof inference neurons represents a sample of a particular hidden world state. The\nspiking activity across the neural population approximates the posterior distribu-\ntion of hidden state. The model provides a functional explanation for the Poisson-\nlike noise commonly observed in cortical responses. Uncertainties in spike times\nprovide the necessary variability for sampling during inference. Unlike previous\nmodels, the hidden world state is not observed by the sensory neurons, and the\ntemporal dynamics of the hidden state is unknown. We demonstrate how such\nnetworks can sequentially learn hidden Markov models using a spike-timing de-\npendent Hebbian learning rule and achieve power-law convergence rates.\n\n1\n\nIntroduction\n\nHumans are able to routinely estimate unknown world states from ambiguous and noisy stimuli,\nand anticipate upcoming events by learning the temporal dynamics of relevant states of the world\nfrom incomplete knowledge of the environment. For example, when facing an approaching tennis\nball, a player must not only estimate the current position of the ball, but also predict its trajectory\nby inferring the ball\u2019s velocity and acceleration before deciding on the next stroke. Tasks such as\nthese can be modeled using a hidden Markov model (HMM), where the relevant states of the world\nare latent variables X related to sensory observations Z via a likelihood model (determined by the\nemission probabilities). The latent states themselves evolve over time in a Markovian manner, the\ndynamics being governed by a transition probabilities. In these tasks, the optimal way of combin-\ning such noisy sensory information is to use Bayesian inference, where the level of uncertainty for\neach possible state is represented as a probability distribution [1]. Behavioral and neuropsychophys-\nical experiments [2, 3, 4] have suggested that the brain may indeed maintain such a representation\nand employ Bayesian inference and learning in a great variety of tasks in perception, sensori-motor\nintegration, and sensory adaptation. However, it remains an open question how the brain can se-\nquentially infer the hidden state and learn the dynamics of the environment from the noisy sensory\nobservations.\nSeveral models have been proposed based on populations of neurons to represent probability dis-\ntribution [5, 6, 7, 8]. These models typically assume a static world state X. To get around this\nlimitation, \ufb01ring-rate models [9, 10] have been proposed to used responses in populations of neu-\nrons to represent the time-varying posterior distributions of arbitrary hidden Markov models with\ndiscrete states. For the continuous state space, similar models based on line attractor networks [11]\n\n1\n\n\fhave been introduced for implementing the Kalman \ufb01lter, which assumes all distributions are Gaus-\nsian and the dynamics is linear. Bobrowski et al. [12] proposed a spiking network model that can\ncompute the optimal posterior distribution in continuous time. The limitation of these models is that\nmodel parameters (the emission and transition probabilities) are assumed to be known a priori. Den-\neve [13, 14] proposed a model for inference and learning based on the dynamics of a single neuron.\nHowever, the maximum number of world state in her model is limited to two.\nIn this paper, we explore a neural implementation of HMMs in networks of spiking neurons that\nperform approximate Bayesian inference similar to the Monte Carlo method of particle \ufb01ltering [15].\nWe show how the time-varying posterior distribution P (Xt|Z1:t) can be directly represented by\nmean spike counts in sub-populations of neurons. Each model neuron in the neuron population\nbehaves as a coincidence detector, and each spike is viewed as a Monte Carlo sample of a particular\nworld state. At each time step, the probability of a spike in one neuron is shown to approximate\nthe posterior probability of the preferred state encoded by the neuron. Nearby neurons within the\nsame sub-population (analogous to a cortical column) encode the same preferred state. The model\nthus provides a concrete neural implementation of sampling ideas previously suggested in [16, 17,\n18, 19, 20]. In addition, we demonstrate how a spike-timing based Hebbian learning rule in our\nnetwork can implement an online version of the Expectation-Maximization(EM) algorithm to learn\nthe emission and transition matrices of HMMs.\n\n2 Review of Hidden Markov Models\n\nFor clarity of notation, we brie\ufb02y review the equations behind a discrete-time \u201cgrid-based\u201d Bayesian\n\ufb01lter for a hidden Markov model. Let the hidden state be {Xk \u2208 X, k \u2208 N} with dynamics\nXk+1 | (Xk = x(cid:48)) \u223c f (x|x(cid:48)), where f (x|x(cid:48)) is the transition probability density, X is a discrete\nstate space of Xk, N is the set of time steps, and \u201c\u223c\u201d denotes distributed according to. We focus\non estimating Xk by constructing its posterior distribution, based only on noisy measurements or\nobservations {Zk} \u2208 Z where Z can be discrete or continuous. {Zk} are conditional independent\ngiven {Xk} and are governed by the emission probabilities Zk | (Xk = x) \u223c g(z|x).\nThe posterior probability P (Xk = i|Z1:k) = \u03c9i\n(Eq 1) and a measurement update (or correction) stage (Eq 2):\n\nk|k may be updated in two stages: a prediction stage\n\nP (Xk+1 = i | Z1:k) =\n\n\u03c9i\nP (Xk+1 = i | Z1:k+1) = \u03c9i\n\nk+1|k =(cid:80)X\nk|kf (xi|xj),\nj=1 \u03c9j\n(cid:80)X\nk+1|kg(Zk+1|xi)\n\u03c9i\nj=1 \u03c9j\n\nk+1|kg(Zk+1|xj )\n\nk+1|k+1 =\n\n(1)\n\n(2)\n\n.\n\nThis process is repeated for each time step. These two recursive equations above are the foundation\nfor any exact or approximate solution to Bayesian \ufb01ltering, including well-known examples such as\nKalman \ufb01ltering when the original continuous state space has been discretized into X bins.\n\n3 Neural Network Model\n\nWe now describe the two-layer spiking neural network model we use (depicted in the central panel of\nFigure 1(a)). The noisy observation Zk is not directly observed by the network, but sensed through\nan array of Z sensory neurons, The lower layer consists of an array of sensory neurons, each of\nwhich will be activated at time k if the observation Zk is in the receptive \ufb01eld. The higher layer\nconsists of an array of inference neurons, whose activities can be de\ufb01ned as:\n\ns(k) = sgn(a(k) \u00d7 b(k))\n\n(3)\nwhere s(k) describes the binary response of an inference neuron at time k, the sign function\nsgn(x) = 1 only when x > 0. a(k) represents the sum of neuron\u2019s recurrent inputs, which is\ndetermined by the recurrent weight matrix W among the inference neurons and the population re-\nsponses sk\u22121 from the previous time step. b(k) represents the sum of feedforward inputs, which is\ndetermined by the feed-forward weight matrix M as well as the activities in sensory neurons.\nNote that Equation 3 de\ufb01nes the output of an abstract inference neuron which acts as a coincidence\ndetector and \ufb01res if and only if both recurrent and sensory inputs are received. In the supplementary\nmaterials, we show that this abstract model neuron can be implemented using the standard leaky-\nintegrate-and-\ufb01re (LIF) neurons used to model cortical neurons.\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: a. Spiking network model for sequential Monte Carlo Bayesian inference. b. Graphical\nrepresentation of spike distribution propagation\n\nl, i = 1, . . .X , l = 1, . . . ,L}. We have si\n\n3.1 Neural Representation of Probability Distributions\nSimilar to the idea of grid-based \ufb01ltering, we \ufb01rst divide the inference neurons into X sub-\npopulations. s = {si\nl(k) = 1 if there is a spike in the\nl-th neuron of the i-th sub-population at time step k. Each sub-population of L neurons share the\nsame preferred world state, there being X such sub-populations representing each of X preferred\nstates. One can, for example, view such a neuronal sub-population as a cortical column, within\nwhich neurons encode similar features [21].\nFigure 1(a) illustrates how our neural network encodes a simple hidden Markov model with X =\nZ = 1, . . . , 100. Xk = 50 is a static state and P (Zk|Xk) is normally distributed. The network\nutilizes 10,000 neurons for the Monte Carlo approximation, with each state preferred by a sub-\npopulation of 100 neurons. At time k, the network observe Zk and the corresponding sensory\nneuron whose receptive \ufb01eld contains Zk is activated and sends inputs to the inference neurons.\nCombining with recurrent inputs from the previous time step, the responses in the inference neurons\nare updated at each time step. As shown in the raster plot of Figure 1(a), the spikes across the entire\ninference layer population form a Monte-Carlo approximation to the current posterior distribution:\n\nL(cid:88)\n\nni\nk|k :=\n\nl(k) \u221d \u03c9i\nsi\nk|k\n\n(4)\n\nl=1\n\nwhere ni\n\nregarded as the instantaneous \ufb01ring rate for sub-population i. Nk = (cid:80)X\n\nk|k is the number of spiking neurons in the ith sub-population at time k, which can also be\nk|k is the total spike\nk|k} represents the un-normalized conditional\n\ncount in the inference layer population. The set {ni\nprobabilities of Xk, so that \u02c6P (Xk = i|Z1:k) = \u03c9i\n\ni=1 ni\n\nk|k = ni\n\nk|k/Nk.\n\n3.2 Bayesian Inference with Stochastic Synaptic Transmission\n\nIn this section, we assume the network is given the model parameters in a HMM and there is no\nlearning in connection weights in the network. To implement the prediction Eq 1 in a spiking\nnetwork, we initialize the recurrent connections between the inference neurons as the transition\nprobabilities: Wij = f (xj|xi)/CW , where CW is a scaling constant. We will discuss how our\nnetwork learns the HMM parameters from random initial synaptic weights in section 4.\nWe de\ufb01ne the recurrent weight Wij to be the synaptic release probability between the i-th neuron\nsub-population and the j-th neuron sub-population in the inference layer. Each neuron that spikes\nat time step k will randomly evoke, with probability Wij, one recurrent excitatory post-synaptic\npotential (EPSP) at time step k + 1, after some network delay. We de\ufb01ne the number of recurrent\nEPSPs received by neuron l in the j-th sub-population as aj\nl is the sum of Nk independent\n(but not identically distributed) Bernoulli trials:\n\nl . Thus, aj\n\naj\nl (k + 1) =\n\n\u0001i\nl(cid:48)si\n\nl(cid:48)(k),\n\n\u2200l = 1 . . .L.\n\n(5)\n\nX(cid:88)\n\nL(cid:88)\n\ni=1\n\nl(cid:48)=1\n\n3\n\n\fwhere P (\u0001i\nbinomial\u201d distribution [22] and in the limit approaches the Poisson distribution:\n\nl = 1) = Wij and P (\u0001i\n\nl = 0) = 1 \u2212 Wij. The sum aj\n\nl follows the so-called \u201cPoisson\n\nl (k + 1) \u2265 1) (cid:39)(cid:88)\n\nP (aj\n\ni\n\nWijni\n\nk|k =\n\nNk\nCW\n\n\u03c9j\n\nk+1|k\n\n(6)\n\nl and the proof of equation 6 are provided in the supple-\n\nThe detailed analysis of the distribution of ai\nmentary materials.\nThe de\ufb01nition of model neuron in Eq 3 indicates that recurrent inputs alone are not strong enough\nto make the inference neurons \ufb01re \u2013 these inputs leave the neurons partially activated. We can\nview these partially activated neurons as the proposed samples drawn from the prediction density\nP (Xk+1|Xk). Let nj\n\nk+1|k be the number of proposed samples in j-th sub-population, we have\n\nE[nj\n\nk+1|k|{ni\n\nk|k}] = L\n\nWij ni\n\nk|k = L Nk\nCW\n\nk+1|k \u221d Var[nj\n\u03c9j\n\nk+1|k|{ni\n\nk|k}]\n\n(7)\n\nX(cid:88)\n\ni=1\n\nThus, the prediction probability in equation 1 is represented by the expected number of neurons that\nreceive recurrent inputs.\nWhen a new observation Zk+1 is received, the network will correct the prediction distribution based\non the current observation. Similar to rejection sampling used in sequential Monte Carlo algo-\nrithms [15], these proposed samples are accepted with a probability proportional to the observation\nlikelihood P (Zk+1|Xk+1). We assume for simplicity that receptive \ufb01elds of sensory neurons do not\noverlap with each other (in the supplementary materials, we discuss the more general overlapping\ncase). Again we de\ufb01ne the feedforward weight Mij to be the synaptic release probability between\nsensory neuron i and inference neurons in the j-th sub-population. A spiking sensory neuron i\ncauses an EPSP in a neuron in the j-th sub-population with probability Mij, which is initialized\nproportional to the likelihood:\n\nP (bi\n\nl(k + 1) \u2265 1) = g(Zk+1|xi)/CM\n\n(8)\n\nwhere CM is a scaling constant such that Mij = g(Zk+1 = zi | xj)/CM .\nFinally, an inference neuron \ufb01res a spike at time k + 1 if and only if it receives both recurrent and\nsensory inputs. The corresponding \ufb01ring probability is then the product of the probabilities of the\ntwo inputs:P (si\n\nl(k + 1) \u2265 1)P (bi\n\nl(k + 1) = 1) = P (ai\n\nl(k + 1) \u2265 1)\n\nl=1 si\n\nl(k + 1) be the number of spikes in i-th sub-population at time k + 1, we\n\nk+1|k+1 =(cid:80)L\n\nLet ni\nhave\n\nE[ni\n\nVar[ni\n\nk+1|k+1|{ni\nk+1|k+1|{ni\n\nk|k}] = L Nk\nCW CM\nk|k}] (cid:39) L Nk\nCW CM\n\nP (Zk+1|Z1:k)\u03c9i\ng(Zk+1|xi)\u03c9i\n\nk+1|k\n\nk+1|k+1\n\n(9)\n\n(10)\n\nEquation 9 ensures that the expected spike distribution at time k + 1 is a Monte Carlo approximation\nto the updated posterior probability P (Xk+1|Z1:k+1). It also determines how many neurons are\nactivated at time k + 1. To keep the number of spikes at different time steps relatively constant, the\nscaling constant CM , CW and the number of neurons L could be of the same order of magnitude:\nfor example, CW = L = 10 \u2217 N1 and CM (k + 1) = 10 \u2217 Nk/N1, resulting in a form of divisive\ninhibition [23]. If the overall neural activity is weak at time k, then the global inhibition regulating\nM is decreased to allow more spikes at time k + 1. Moreover, approximations in equations 6 and\n10 become exact when N 2\nk\nC2\nW\n\n\u2192 0.\n\n3.3 Filtering Examples\n\nFigure 1(b) illustrates how the model network implements Bayesian inference with spike samples.\nThe top three rows of circles in the left panel in Figure 1(b) represent the neural activities in the\ninference neurons, approximating respectively the prior, prediction, and posterior distributions in\nthe right panel. At time k, spikes (shown as \ufb01lled circles) in the posterior population represent the\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Filtering results for uni-modal (a) and bi-modal posterior distributions ((b) and (c) - see\ntext for details).\n\ndistribution P (Xk|Z1:k). With recurrent weights W \u221d f (Xk+1|Xk), spiking neurons send EPSPs\nto their neighbors and make them partially activated (shown as half-\ufb01lled circles in the second row).\nThe distribution of partially activated neurons is a Monte-Carlo approximation to the prediction\ndistribution P (Xk+1|Z1:k). When a new observation Zk+1 arrives, the sensory neuron (\ufb01lled circles\nthe bottom row) whose receptive \ufb01eld contains Zk+1 is activated, and sends feedforward EPSPs to\nthe inference neurons using synaptic weights M = g(Z|X). The inference neurons at time k +1 \ufb01re\nonly if they receive both recurrent and feedforward inputs. With the \ufb01ring probability proportional to\nthe product of prediction probability P (Xk+1|Z1:k) and observation likelihood g(Zk+1|Xk+1), the\nspike distribution at time k + 1 (\ufb01lled circles in the third row) again represents the updated posterior\nP (Xk+1|Z1:k+1).\nWe further tested the \ufb01ltering results of the proposed neural network with two other example HMMs.\nThe \ufb01rst example is the classic stochastic volatility model, where X = Z = R. The transition model\nof the hidden volatility variable f (Xk+1|Xk) = N (0.91Xk, 1.0), and the emission model of the\nobserved price given volatility is g(Zk|Xk) = N (0, 0.25 exp(Xk)). The posterior distribution of\nthis model is uni-modal. In simulation we divided X into 100 bins, and initial spikes N1 = 1000.\nWe plotted the expected volatility with estimated standard deviation from the population posterior\ndistribution in Figure 2(a). We found that the neural network does indeed produce a reasonable\nestimate of volatility and plausible con\ufb01dence interval. The second example tests the network\u2019s\nability to approximate bi-modal posterior distributions by comparing the time varying population\nposterior distribution with the true one using heat maps (Figures 2(b) and 2(c)). The vertical axis\nrepresents the hidden state and the horizontal axis represents time steps. The magnitude of the\nprobability is represented by the color. In this example, X = {1, . . . , 8} and there are 20 time steps.\n\n3.4 Convergence Results and Poisson Variability\n\nk =\n\nni\nk|k\nNk\n\nIn this section, we discuss some convergence results for Bayesian \ufb01ltering using the proposed spik-\ning network and show our population estimator of the posterior probability is a consistent one. Let\nbe the population estimator of the true posterior probability P (Xk = i|Z1:k) at time k.\n\u02c6P i\nSuppose the true distribution is known only at initial time k = 1: \u02c6P i\n1|1. We would like to\ninvestigate how the mean and variance of \u02c6P i\nk vary over time. We derived the updating equations for\nmean and variance (see supplementary materials) and found two implications. First, the variance of\nneural response is roughly proportional to the mean. Thus, rather than representing noise, Poisson\nvariability in the model occurs as a natural consequence of sampling and sparse coding. Second, the\nk ] \u221d 1/N1. Therefore Var[ \u02c6P j\nvariance Var[ \u02c6P j\nk is a consistent\nestimator of \u03c9j\nk|k. We tested the above two predictions using numerical experiments on arbitrary\nHMMs, where we choose X = {1, 2, . . . 20}, Zk \u223c N (Xk, 5), the transition matrix f (xj|xi) \ufb01rst\n\nuniformly drawn from [0, 1], and then normalized to ensure(cid:80)\n\nk ] \u2192 0 as N1 \u2192 \u221e, showing that \u02c6P j\n\n1 = \u03c9i\n\nj f (xj|xi) = 1.\n\nk ] \u2212 E2[ \u02c6P j\nIn Figures 3(a-c), each data point represents Var[ \u02c6P j\nk ]\nalong the horizontal axis, calculated over 100 trials with the same random transition matrix f, and\nk = 1, . . . 10, j = 1, . . . 20. The solid lines represent a least squares power law \ufb01t to the data:\nVar[ \u02c6P j\nk ])CE . For 100 different random transition matrices f, the means\n\nk ] along the vertical axis and E[ \u02c6P j\n\nk ] = CV \u2217 (E[ \u02c6P j\n\nk ] \u2212 E2[ \u02c6P j\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 3: Variance versus Mean of estimator for different initial spike counts\n\n(f)\n\nk|k] \u221d E[nj\n\nof the exponential term CE were 1.2863, 1.13, and 1.037, with standard deviations 0.13, 0.08, and\n0.03 respectively, for N1 = 100 and X = 4, 20, and 100. The mean of CE continues to approach\n1 when X is increased, as shown in \ufb01gure 3(d). Since Var[ \u02c6P j\nk ]) implies\nVar[nj\nk|k] (see supplementary material for derivation), these results verify the Poisson\nvariability prediction of our neural network.\nThe term CV represents the scaling constant for the variance. Figure 3(e) shows that the mean of\nCV over 100 different transition matrices f (over 100 different trials with the same f) is inversely\nproportional to initial spike count N1, with power law \ufb01t CV = 1.77N\u22120.9245\n. This indicates that\nk converges to 0 if N1 \u2192 \u221e. The bias between estimated and true posterior\nthe variance of \u02c6P j\nprobability can be calculated as:\n\nk ] \u221d (E[ \u02c6P j\n\nk ] \u2212 E2[ \u02c6P j\n\n1\n\nbias(f) =\n\n1\nXK\n\n(E[\u02c6Pi\n\nk] \u2212 \u03c9i\n\nk|k)2\n\nX(cid:88)\n\nK(cid:88)\n\ni=1\n\nk=1\n\nThe relationship between the mean of the bias (over 100 different f) versus initial count N1 is shown\nin \ufb01gure 3(f). We also have an inverse proportionality between bias and N1. Therefore, as the \ufb01gure\nshows, for arbitrary f, the estimator \u02c6P j\n\nk is a consistent estimator of \u03c9j\n\nk|k.\n\n4 On-line parameter learning\n\nIn the previous section, we assumed that the model parameters, i.e., the transition probabilities\nf (Xk+1|Xk) and the emission probabilities g(Zk|Xk), are known. In this section, we describe how\nthese parameters \u03b8 = {f, g} can be learned from noisy observations {Zk}. Traditional methods\nto estimate model parameters are based on the Expectation-Maximization (EM) algorithm, which\nmaximizes the (log) likelihood of the unknown parameters log P\u03b8(Z1:k) given a set of observations\ncollected previously. However, such an \u201coff-line\u201d approach is biologically implausible because (1)\nit requires animals to store all of the observations before learning, and (2) evolutionary pressures\ndictate that animals update their belief over \u03b8 sequentially any time a new measurement becomes\navailable.\nWe therefore propose an on-line estimation method where observations are used for updating pa-\nrameters as they become available and then discarded. We would like to \ufb01nd the parameters \u03b8 that\nt=1 log P\u03b8(Zt|Zt\u22121). Our approach is based on re-\ncursively calculating the suf\ufb01cient statistics of \u03b8 using stochastic approximation algorithms and the\n\nmaximize the log likelihood: log P\u03b8(Z1:k) =(cid:80)k\n\n6\n\n10\u2212510\u2212310010\u2212110\u2212710\u2212510\u2212210\u2212410\u2212610\u22123E[pjk] \u2212 E2[pjk]Var[pjk]y = 0.028804 * x1.286310010\u2212510\u2212310\u2212110\u2212510\u2212710\u22123E[pjk] \u2212 E2[pjk]Var[pjk]y = 0.00355627 * x1.1310\u2212510\u2212410\u2212310\u2212210\u2212110010\u2212910\u2212810\u2212710\u2212610\u2212510\u22124E[pjk] \u2212 E2[pjk]Var[pjk]y = 0.000303182 * x1.037\fFigure 4: Performance of the Hebbian Learning Rules.\n\nMonte Carlo method, and employs an online EM algorithm obtained by approximating the expected\nsuf\ufb01cient statistic \u02c6T (\u03b8k) using the stochastic approximation (or Robbins-Monoro) procedure. Based\non the detailed derivations described in the supplementary materials, we obtain a Hebbian learning\nrule for updating the synaptic weights based on the pre-synaptic and post-synaptic activities:\n\nM k\n\nij = \u03b3k\n\nW k\n\nij = \u03b3k\n\n\u00d7 \u02dcni(k)(cid:80)\ni \u02dcni(k)\n\u00d7 nj\nk|k\nNk\n\nnj\nk|k\nNk\nni\nk\u22121|k\u22121\nNk\u22121\n\n) \u00d7 M k\u22121\n\nij\n\n+ (1 \u2212 \u03b3k\n\n+ (1 \u2212 \u03b3k\n\nnj\nk|k\nNk\nni\nk\u22121|k\u22121\nNk\u22121\n\nwhen nj\n\nk|k > 0,\n\n(11)\n\n) \u00d7 W k\u22121\n\nij\n\nwhen ni\n\nk\u22121|k\u22121 > 0,\n\n(12)\n\nwhere \u02dcni(k) is the number of pre-synaptic spikes in the i-th sub-population of sensory neurons at\ntime k, \u03b3k is the learning rate.\nLearning both emission and transition probability matrices at the same time using the online EM\nalgorithm with stochastic approximation is in general very dif\ufb01cult because there are many local\nminima in the likelihood function. To verify the correctness of our learning algorithms individually,\nwe \ufb01rst divide the learning process into two phases. The \ufb01rst phase involves learning the emission\nprobability g when the hidden world state is stationary, i.e., Wij = fij = \u03b4ij. This corresponds to\nlearning the observation model of static objects at the center of gaze before learning the dynamics\nf of objects. After an observation model g is learned, we relax the stationarity constraint, and allow\nthe spiking network to update the recurrent weights W to learn the arbitrary transition probability f.\nFigure 4 illustrates the performance of learning rules (11) and (12) for a discrete HMM with X = 4\nand Z = 12. X and Z values are spaced equally apart: X \u2208 {1, . . . , 4} and Z \u2208 { 2\n3}.\n3 , . . . , 4 1\nThe transition probability matrix f then involves 4\u00d74 = 16 parameters and the emission probability\nmatrix g involves 12 \u00d7 4 = 48 parameters.\nIn Figure 4(a), we examine the performance of learning rule (11) for the feedforward weights\nM k, with \ufb01xed transition matrix. The true emission probability matrix has the form g.j =\u223c\nN (xj, \u03c32\n\nZ). The solid blue curve shows the mean square error (Frobenius norm)(cid:13)(cid:13)M k \u2212 g(cid:13)(cid:13)F =\n\nij \u2212 gij)2 between the learned feedforward weights M k and the true emission probabil-\nity matrix g over trials with different g,. The dotted lines show \u00b1 1 standard deviation for MSE\nbased on 10 different trials. \u03c3Z varied from trial to trial and was drawn uniformly between 0.2\nand 0.4, representing different levels of observation noises. The initial spike distribution was uni-\ni,j = 1Z . The learning rate was set\nform ni\nk , although a small constant learning rate such as \u03b3k = 10\u22125 also gives rise to similar\nto \u03b3k = 1\nlearning results. A notable feature in Figure 4(a) is that the average MSE exhibits a fast power-\nlaw decrease. The red solid line in Figure 4(a) represents the power-law \ufb01t to the average MSE:\nM SE(k) \u221d k\u22121.1. Furthermore, the standard deviation of MSE approaches zero as k grows large.\n\n0|0,\u2200i, j = 1 . . . ,X and the initial estimate M 0\n\n(cid:113)(cid:80)\n\n0|0 = nj\n\nij(M k\n\n3 , 1, 4\n\n7\n\n\f(cid:113)(cid:80)\n\nij(W k\n\nmean square error(cid:13)(cid:13)W k \u2212 f(cid:13)(cid:13)F =\n\nFigure 4(a) thus shows the asymptotic convergence of equation (11) irrespective of the \u03c3Z of the\ntrue emission matrix g.\nWe next examined the performance of learning rule 12 for the recurrent weights W k, given the\nlearned emission probability matrix g (the true transition probabilities f are unknown to the net-\nwork). The initial estimator W 0\nij = 1X . Similarly, Performance was evaluated by calculating the\nij \u2212 fij)2 between the learned recurrent weight W k\nand the true f. Different randomly chosen transition matrices f were tested. When \u03c3Z = 0.04, the\nobservation noise is 0.04\n1/3 = 12% of the separation between two observed states. Hidden state identi-\n\ufb01cation in this case is relatively easy. The red solid line in \ufb01gure 4(b) represents the power-law \ufb01t to\nthe average MSE: M SE(k) \u221d k\u22120.36. Similar convergence results can still be obtained for higher\n\u03c3Z, e.g., \u03c3Z = 0.4 (\ufb01gure 4(c)). In this case, hidden state identi\ufb01cation is much more dif\ufb01cult as\nthe observation noise is now 1.2 times the separation between two observed states. This dif\ufb01culty\nis re\ufb02ected in a slower asymptotic convergence rate, with a power-law \ufb01t M SE(k) \u221d k\u22120.21, as\nindicated by the red solid line in \ufb01gure 4(c).\nFinally, we show the results for learning both emission and transition matrices simultaneously in\n\ufb01gure 4(d,e). In this experiment, the true emission and transition matrices are deterministic, the\nij \u221d\nweight matrices are initialized as the sum of the true one and a uniformly random one: W 0\nij \u221d gij + \u0001 where \u0001 is a uniform distributed noise between 0 and 1/NX. Although\nfij + \u0001 and M 0\nthe asymptotic convergence rate for this case is much slower, it still exhibits desired power-law\nconvergences in both M SEW (k) \u221d k\u22120.02 and M SEM (k) \u221d k\u22120.08 over 100 trials starting with\ndifferent initial weight matrices.\n\n5 Discussion\n\nOur model suggests that, contrary to the commonly held view, variability in spiking does not re-\n\ufb02ect \u201cnoise\u201d in the nervous system but captures the animal\u2019s uncertainty about the outside world.\nThis suggestion is similar to some previous models [17, 19, 20], including models linking \ufb01ring rate\nvariability to probabilistic representations [16, 8] but differs in the emphasis on spike-based repre-\nsentations, time-varying inputs, and learning. In our model, a probability distribution over a \ufb01nite\nsample space is represented by spike counts in neural sub-populations. Treating spikes as random\nsamples requires that neurons in a pool of identical cells \ufb01re independently. This hypothesis is sup-\nported by a recent experimental \ufb01ndings [21] that nearby neurons with similar orientation tuning and\ncommon inputs show little or no correlation in activity. Our model offers a functional explanation\nfor the existence of such decorrelated neuronal activity in the cortex.\nUnlike many previous models of cortical computation, our model treats synaptic transmission be-\ntween neurons as a stochastic process rather than a deterministic event. This acknowledges the\ninherent stochastic nature of neurotransmitter release and binding. Synapses between neurons usu-\nally have only a small number of vesicles available and a limited number of post-synaptic receptors\nnear the release sites. Recent physiological studies [24] have shown that only 3 NMDA receptors\nopen on average per release during synaptic transmission. These observations lend support to the\nview espoused by the model that synapses should be treated as probabilistic computational units\nrather than as simple scalar parameters as assumed in traditional neural network models.\nThe model for learning we have proposed builds on prior work on online learning [25, 26]. The\nonline algorithm used in our model for estimating HMM parameters involves three levels of approx-\nimation. The \ufb01rst level involves performing a stochastic approximation to estimate the expected\ncomplete-data suf\ufb01cient statistics over the joint distribution of all hidden states and observations.\nCappe and Moulines [26] showed that under some mild conditions, such an approximation produces\na consistent, asymptotically ef\ufb01cient estimator of the true parameters. The second approximation\ncomes from the use of \ufb01ltered rather than smoothed posterior distributions. Although the conver-\ngence reported in the methods section is encouraging, a rigorous proof of convergence remains to\nbe shown. The asymptotic convergence rate using only the \ufb01ltered distribution is about one third\nthe convergence rate obtained for the algorithms in [25] and [26], where the smoothed distribution\nis used. The third approximation results from Monte-Carlo sampling of the posterior distribution.\nAs discussed in the methods section, the Monte Carlo approximation converges in the limit of large\nnumbers of particles (spikes).\n\n8\n\n\fReferences\n[1] R.S. Zemel, Q.J.M. Huys, R. Natarajan, and P. Dayan. Probabilistic computation in spiking populations.\n\nAdvances in Neural Information Processing Systems, 17:1609\u20131616, 2005.\n\n[2] D. Knill and W. Richards. Perception as Bayesian inference. Cambridage University Press, 1996.\n[3] K. Kording and D. Wolpert. Bayesian integration in sensorimotor learning. Nature, 427:244\u2013247, 2004.\n[4] K. Doya, S. Ishii, A. Pouget, and R. P. N. Rao. Bayesian Brain: Probabilistic Approaches to Neural\n\nCoding. Cambridge, MA: MIT Press, 2007.\n\n[5] K. Zhang, I. Ginzburg, B.L. McNaughton, and T.J.Sejnowski. Interpreting neuronal population activity\nby reconstruction: A uni\ufb01ed framework with application to hippocampal place cells. Journal of Neuro-\nscience, 16(22), 1998.\n\n[6] R. S. Zemel and P. Dayan. Distributional population codes and multiple motion models. Advances in\n\nneural information procession system, 11, 1999.\n\n[7] S. Wu, D. Chen, M. Niranjan, and S.I. Amari. Sequential Bayesian decoding within a population of\n\nneurons. Neural Computation, 15, 2003.\n\n[8] W.J. Ma, J.M. Beck, P.E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes.\n\nNature Neuroscience, 9(11):1432\u20131438, 2006.\n\n[9] R.P.N. Rao. Bayesian computation in recurrent neural circuits. Neural Computation, 16(1):1\u201338, 2004.\n[10] J.M. Beck and A. Pouget. Exact inferences in a neural implementation of a hidden Markov model. Neural\n\nComputation, 19(5):1344\u20131361, 2007.\n\n[11] R.C. Wilson and L.H. Finkel. A neural implmentation of the kalman \ufb01lter. Advances in Neural Informa-\n\ntion Processing Systems, 22:2062\u20132070, 2009.\n\n[12] O. Bobrowski, R. Meir, and Y. Eldar. Bayesian \ufb01ltering in spiking neural networks: noise adaptation and\n\nmultisensory integration. Neural Computation, 21(5):1277\u20131320, 2009.\n\n[13] S. Deneve. Bayesian spiking neurons i: Inference. Neural Computation, 20:91\u2013117, 2008.\n[14] S. Deneve. Bayesian spiking neurons ii: Learning. Neural Computation, 20:118\u2013145, 2008.\n[15] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo methods in practice. Springer-Verlag,\n\n2001.\n\n[16] P.O. Hoyer, A. Hyrinen, and A.H. Arinen. Interpreting neural response variability as Monte Carlo sam-\n\npling of the posterior. Advances in Neural Information Processing Systems 15, 2002.\n\n[17] M G Paulin. Evolution of the cerebellum as a neuronal machine for Bayesian state estimation. J. Neural\n\nEng., 2:S219\u2013S234, 2005.\n\n[18] N.D. Daw and A.C. Courville. The pigeon as particle lter. Advances in Neural Information Processing\n\nSystems, 19, 2007.\n\n[19] L. Buesing, J. Bill, B. Nessler, and W. Maass. Neural dynamics as sampling: A model for stochastic\n\ncomputation in recurrent networks of spiking neurons. PLoS Comput Biol, 7(11), 2011.\n\n[20] P. Berkes, G. Orban, M. Lengye, and J. Fisher. Spontaneous cortical activity reveals hallmarks of an\n\noptimal internal model of the environment. Science, 331(6013), 2011.\n\n[21] A. S. Ecker, P. Berens, G.A. Kelirls, M. Bethge, N. K. Logothetis, and A. S. Tolias. Decorrelated neuronal\n\n\ufb01ring in cortical microcircuits. Science, 327(5965):584\u2013587, 2010.\n\n[22] Jr. Hodges, J. L. and Lucien Le Cam. The Poisson approximation to the Poisson binomial distribution.\n\nThe Annals of Mathematical Statistics, 31(3):737\u2013740, 1960.\n\n[23] Frances S. Chance and L. F. Abbott. Divisive inhibition in recurrent networks. Network, 11:119\u2013129,\n\n2000.\n\n[24] E.A. Nimchinsky, R. Yasuda, T.G. Oertner, and K. Svoboda. The number of glutamate receptors opened\n\nby synaptic stimulation in single hippocampal spines. J Neurosci, 24:2054\u20132064, 2004.\n\n[25] G. Mongillo and S. Deneve. Online learning with hidden Markov models. Neural Computation, 20:1706\u2013\n\n1716, 2008.\n\n[26] O. Cappe and E. Moulines. Online EM algorithm for latent data models, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1070, "authors": [{"given_name": "Yanping", "family_name": "Huang", "institution": "University of Washington"}, {"given_name": "Rajesh", "family_name": "Rao", "institution": "University of Washington"}]}