{"title": "Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3059, "page_last": 3067, "abstract": "Recent experiments have demonstrated that humans and animals typically reason probabilistically about their environment. This ability requires a neural code that represents probability distributions and neural circuits that are capable of implementing the operations of probabilistic inference. The proposed probabilistic population coding (PPC) framework provides a statistically efficient neural representation of probability distributions that is both broadly consistent with physiological measurements and capable of implementing some of the basic operations of probabilistic inference in a biologically plausible way. However, these experiments and the corresponding neural models have largely focused on simple (tractable) probabilistic computations such as cue combination, coordinate transformations, and decision making. As a result it remains unclear how to generalize this framework to more complex probabilistic computations. Here we address this short coming by showing that a very general approximate inference algorithm known as Variational Bayesian Expectation Maximization can be implemented within the linear PPC framework. We apply this approach to a generic problem faced by any given layer of cortex, namely the identification of latent causes of complex mixtures of spikes. We identify a formal equivalent between this spike pattern demixing problem and topic models used for document classification, in particular Latent Dirichlet Allocation (LDA). We then construct a neural network implementation of variational inference and learning for LDA that utilizes a linear PPC. This network relies critically on two non-linear operations: divisive normalization and super-linear facilitation, both of which are ubiquitously observed in neural circuits. We also demonstrate how online learning can be achieved using a variation of Hebb\u2019s rule and describe an extesion of this work which allows us to deal with time varying and correlated latent causes.", "full_text": "Complex Inference in Neural Circuits with\n\nProbabilistic Population Codes and Topic Models\n\nDepartment of Brain and Cognitive Sciences\n\nDepartment of Statistical Science\n\nJeff Beck\n\nKatherine Heller\n\nUniversity of Rochester\n\njbeck@bcs.rochester.edu\n\nDuke University\n\nkheller@stat.duke.edu\n\nAlexandre Pouget\n\nDepartment of Neuroscience\n\nUniversity of Geneva\n\nAlexandre.Pouget@unige.ch\n\nAbstract\n\nRecent experiments have demonstrated that humans and animals typically rea-\nson probabilistically about their environment. This ability requires a neural code\nthat represents probability distributions and neural circuits that are capable of\nimplementing the operations of probabilistic inference. The proposed probabilis-\ntic population coding (PPC) framework provides a statistically ef\ufb01cient neural\nrepresentation of probability distributions that is both broadly consistent with\nphysiological measurements and capable of implementing some of the basic oper-\nations of probabilistic inference in a biologically plausible way. However, these\nexperiments and the corresponding neural models have largely focused on simple\n(tractable) probabilistic computations such as cue combination, coordinate transfor-\nmations, and decision making. As a result it remains unclear how to generalize this\nframework to more complex probabilistic computations. Here we address this short\ncoming by showing that a very general approximate inference algorithm known\nas Variational Bayesian Expectation Maximization can be naturally implemented\nwithin the linear PPC framework. We apply this approach to a generic problem\nfaced by any given layer of cortex, namely the identi\ufb01cation of latent causes of\ncomplex mixtures of spikes. We identify a formal equivalent between this spike\npattern demixing problem and topic models used for document classi\ufb01cation, in\nparticular Latent Dirichlet Allocation (LDA). We then construct a neural network\nimplementation of variational inference and learning for LDA that utilizes a linear\nPPC. This network relies critically on two non-linear operations: divisive normal-\nization and super-linear facilitation, both of which are ubiquitously observed in\nneural circuits. We also demonstrate how online learning can be achieved using a\nvariation of Hebb\u2019s rule and describe an extension of this work which allows us to\ndeal with time varying and correlated latent causes.\n\n1\n\nIntroduction to Probabilistic Inference in Cortex\n\nProbabilistic (Bayesian) reasoning provides a coherent and, in many ways, optimal framework for\ndealing with complex problems in an uncertain world. It is, therefore, somewhat reassuring that\nbehavioural experiments reliably demonstrate that humans and animals behave in a manner consistent\nwith optimal probabilistic reasoning when performing a wide variety of perceptual [1, 2, 3], motor\n[4, 5, 6], and cognitive tasks[7]. This remarkable ability requires a neural code that represents\nprobability distribution functions of task relevant stimuli rather than just single values. While there\n\n1\n\n\fare many ways to represent functions, Bayes rule tells us that when it comes to probability distribution\nfunctions, there is only one statistically optimal way to do it. More precisely, Bayes Rule states\nthat any pattern of activity, r, that ef\ufb01ciently represents a probability distribution over some task\nrelevant quantity s, must satisfy the relationship p(s|r) \u221d p(r|s)p(s), where p(r|s) is the stimulus\nconditioned likelihood function that speci\ufb01es the form of neural variability, p(s) gives the prior belief\nregarding the stimulus, and p(s|r) gives the posterior distribution over values of the stimulus, s given\nthe representation r . Of course, it is unlikely that the nervous system consistently achieves this level\nof optimality. None-the-less, Bayes rule suggests the existence of a link between neural variability as\ncharacterized by the likelihood function p(r|s) and the state of belief of a mature statistical learning\nmachine such as the brain.\nThe so called Probabilistic Population Coding (or PPC) framework[8, 9, 10] takes this link seriously\nby proposing that the function encoded by a pattern of neural activity r is, in fact, the likelihood\nfunction p(r|s). When this is the case, the precise form of the neural variability informs the nature\nof the neural code. For example, the exponential family of statistical models with linear suf\ufb01cient\nstatistics has been shown to be \ufb02exible enough to model the \ufb01rst and second order statistics of in vivo\nrecordings in awake behaving monkeys[9, 11, 12] and anesthetized cats[13]. When the likelihood\nfunction is modeled in this way, the log posterior probability over the stimulus is linearly encoded by\nneural activity, i.e.\n(1)\n\nlog p(s|r) = h(s) \u00b7 r \u2212 log Z(r)\n\nHere, the stimulus dependent kernel, h(s), is a vector of functions of s, the dot represents a standard\ndot product, and Z(r) is the partition function which serves to normalize the posterior. This log\nlinear form for a posterior distribution is highly computationally convenient and allows for evidence\nintegration to be implemented via linear operations on neural activity[14, 8].\nProponents of this kind of linear PPC have demonstrated how to build biologically plausible neu-\nral networks capable of implementing the operations of probabilistic inference that are needed to\noptimally perform the behavioural tasks listed above. This includes, linear PPC implementations\nof cue combination[8], evidence integration over time, maximum likelihood and maximum a pos-\nterior estimation[9], coordinate transformation/auditory localization[10], object tracking/Kalman\n\ufb01ltering[10], explaining away[10], and visual search[15]. Moreover, each of these neural computa-\ntions has required only a single recurrently connected layer of neurons that is capable of just two\nnon-linear operations: coincidence detection and divisive normalization, both of which are widely\nobserved in cortex[16, 17].\nUnfortunately, this research program has been a piecemeal effort that has largely proceeded by\nbuilding neural networks designed deal with particular problems. As a result, there have been no\nproposals for a general principle by which neural network implementations of linear PPCs might\nbe generated and no suggestions regarding how to deal with complex (intractable) problems of\nprobabilistic inference.\nIn this work, we will partially address this short coming by showing that Variation Bayesian Expec-\ntation Maximization (VBEM) algorithm provides a general scheme for approximate inference and\nlearning with linear PPCs. In section 2, we brie\ufb02y review the VBEM algorithm and show how it\nnaturally leads to a linear PPC representation of the posterior as well as constraints on the neural\nnetwork dynamics which build that PPC representation. Because this section describes the VB-PPC\napproach rather abstractly, the remainder of the paper is dedicated to concrete applications. As a\nmotivating example, we consider the problem of inferring the concentrations of odors in an olfactory\nscene from a complex pattern of spikes in a population of olfactory receptor neurons (ORNs). In\nsection 3, we argue that this requires solving a spike pattern demixing problem which is indicative of\nthe generic problem faced by many layers of cortex. We then show that this demixing problem is\nequivalent to the problem addressed by a class of models for text documents know as probabilistic\ntopic models, in particular Latent Dirichlet Allocation or LDA[18].\nIn section 4, we apply the VB-PPC approach to build a neural network implementation of probabilistic\ninference and learning for LDA. This derivation shows that causal inference with linear PPC\u2019s also\ncritically relies on divisive normalization. This result suggests that this particular non-linearity may\nbe involved in very general and fundamental probabilistic computation, rather than simply playing a\nrole in gain modulation. In this section, we also show how this formulation allows for a probabilistic\ntreatment of learning and show that a simple variation of Hebb\u2019s rule can implement Bayesian\nlearning in neural circuits.\n\n2\n\n\fWe conclude this work by generalizing this approach to time varying inputs by introducing the\nDynamic Document Model (DDM) which can infer short term \ufb02uctuations in the concentrations of\nindividual topics/odors and can be used to model foraging and other tracking tasks.\n\n2 Variational Bayesian Inference with linear Probabilistic Population Codes\n\nVariational Bayesian (VB) inference refers to a class of deterministic methods for approximating the\nintractable integrals which arise in the context of probabilistic reasoning. Properly implemented it\ncan result a fast alternative to sampling based methods of inference such as MCMC[19] sampling.\nGenerically, the goal of any Bayesian inference algorithm is to infer a posterior distribution over\nbehaviourally relevant latent variables Z given observations X and a generative model which speci\ufb01es\nthe joint distribution p(X, \u0398, Z). This task is confounded by the fact that the generative model\nincludes latent parameters \u0398 which must be marginalized out, i.e. we wish to compute,\n\n(cid:90)\n\np(Z|X) \u221d\n\np(X, \u0398, Z)d\u0398\n\n(2)\n\nWhen the number of latent parameters is large this integral can be quite unwieldy. The VB algorithms\nsimplify this marginalization by approximating the complex joint distribution over behaviourally\nrelevant latents and parameters, p(\u0398, Z|X), with a distribution q(\u0398, Z) for which integrals of this\nform are easier to deal with in some sense. There is some art to choosing the particular form for the\napproximating distribution to make the above integral tractable, however, a factorized approximation\nis common, i.e. q(\u0398, Z) = q\u0398(\u0398)qZ(Z).\nRegardless, for any given observation X, the approximate posterior is found by minimizing the\nKullback-Leibler divergence between q(\u0398, Z) and p(\u0398, Z|X). When a factorized posterior is\nassumed, the Variational Bayesian Expectation Maximization (VBEM) algorithm \ufb01nds a local\nminimum of the KL divergence by iteratively updating, q\u0398(\u0398) and qZ(Z) according to the scheme\n\nlog qn\n\n\u0398(\u0398) \u223c (cid:104)log p(X, \u0398, Z)(cid:105)qn\n\nZ(Z)\n\nand log qn+1\n\nZ (Z) \u223c (cid:104)log p(X, \u0398, Z)(cid:105)qn\n\n\u0398(\u0398)\n\n(3)\n\nHere the brackets indicate an expected value taken with respect to the subscripted probability\ndistribution function and the tilde indicates equality up to a constant which is independent of \u0398 and Z.\nThe key property to note here is that the approximate posterior which results from this procedure is in\nan exponential family form and is therefore representable by a linear PPC (Eq. 1). This feature allows\nfor the straightforward construction of networks which implement the VBEM algorithm with linear\nPPC\u2019s in the following way. If rn\nZ are patterns of activity that use a linear PPC representation\nof the relevant posteriors, then\n\n\u0398 and rn\n\nlog qn\n\n\u0398(\u0398) \u223c h\u0398(\u0398) \u00b7 rn\n\n\u0398 and log qn+1\n\nZ (Z) \u223c hZ(Z) \u00b7 rn+1\nZ .\n\n(4)\n\nHere the stimulus dependent kernels hZ(Z) and h\u0398(\u0398) are chosen so that their outer product results\nin a basis that spans the function space on Z \u00d7 \u0398 given by log p(X, \u0398, Z) for every X. This choice\nguarantees that there exist functions f\u0398(X, rn\n\nZ) and fZ(X, rn\n\n\u0398) such that\n\n\u0398 = f\u0398(X, rn\nrn\nZ)\n\nand rn+1\n\nZ = fZ(X, rn\n\u0398)\n\n(5)\n\nsatisfy Eq. 3. When this is the case, simply iterating the discrete dynamical system described by Eq.\n5 until convergence will \ufb01nd the VBEM approximation to the posterior. This is one way to build a\nneural network implementation of the VB algorithm. However, its not the only way. In general, any\ndynamical system which has stable \ufb01xed points in common with Eq. 5 can also be said to implement\nthe VBEM algorithm. In the example below we will take advantage of this \ufb02exibility in order to build\nbiologically plausible neural network implementations.\n\n3\n\n\fFigure 1: (Left) Each cause (e.g. coffee) in isolation results in a pattern of neural activity (top). When\nmultiple causes contribute to a scene this results in an overall pattern of neural activity which is a\nmixture of these patterns weighted by the intensities (bottom). (Right) The resulting pattern can be\nrepresented by a raster, where each spike is colored by its corresponding latent cause.\n\n3 Probabilistic Topic Models for Spike Train Demixing\n\nConsider the problem of odor identi\ufb01cation depicted in Fig. 1. A typical mammalian olfactory\nsystem consists of a few hundred different types of olfactory receptor neurons (ORNs), each of which\nresponds to a wide range of volatile chemicals. This results in a highly distributed code for each\nodor. Since, a typical olfactory scene consists of many different odors at different concentrations, the\npattern of ORN spike trains represents a complex mixture. Described in this way, it is easy to see that\nthe problem faced by early olfactory cortex can be described as the task of demixing spike trains to\ninfer latent causes (odor intensities).\nIn many ways this olfactory problem is a generic problem faced by each cortical layer as it tries to\nmake sense of the activity of the neurons in the layer below. The input patterns of activity consist of\nspikes (or spike counts) labeled by the axons which deliver them and summarized by a histogram\nwhich indicates how many spikes come from each input neuron. Of course, just because a spike came\nfrom a particular neuron does not mean that it had a particular cause, just as any particular ORN\nspike could have been caused by any one of a large number of volatile chemicals. Like olfactory\ncodes, cortical codes are often distributed and multiple latent causes can be present at the same time.\nRegardless, this spike or histogram demixing problem is formally equivalent to a class of demixing\nproblems which arise in the context of probabilistic topic models used for document modeling. A\nsimple but successful example of this kind of topic model is called Latent Dirichlet Allocation (LDA)\n[18]. LDA assumes that word order in documents is irrelevant and, therefore, models documents\nas histograms of word counts. It also assumes that there are K topics and that each of these topics\nappears in different proportions in each document, e.g. 80% of the words in a document might be\nconcerned with coffee and 20% with strawberries. Words from a given topic are themselves drawn\nfrom a distribution over words associated with that topic, e.g. when talking about coffee you have a\n5% chance of using the word \u2019bitter\u2019. The goal of LDA is to infer both the distribution over topics\ndiscussed in each document and the distribution of words associated with each topic. We can map\nthe generative model for LDA onto the task of spike demixing in cortex by letting topics become\nlatent causes or odors, words become neurons, word occurrences become spikes, word distributions\nassociated with each topic become patterns of neural activity associated with each cause, and different\ndocuments become the observed patterns of neural activity on different trials. This equivalence is\nmade explicit in Fig. 2 which describes the standard generative model for LDA applied to documents\non the left and mixtures of spikes on the right.\n\n4 LDA Inference and Network Implementation\n\nIn this section we will apply the VB-PPC formulation to build a biologically plausible network\ncapable of approximating probabilistic inference for spike pattern demixing. For simplicity, we will\nuse the equivalent Gamma-Poisson formulation of LDA which directly models word and topic counts\n\n4\n\nCause Intensity\t\n\rSingle\t\n\rOdor\t\n\rResponse\t\n\rResponse!to Mixture !of Odors!\f1. For each topic k = 1, . . . , K,\n(a) Distribution over words\n\n\u03b2k \u223c Dirichlet(\u03b70)\n\n2. For document d = 1, . . . , D,\n(a) Distribution over topics\n\n\u03b8d \u223c Dirichlet(\u03b10)\n\n(b) For word m = 1, . . . , \u2126d\n\ni. Topic assignment\n\nii. Word assignment\n\nzd,m \u223c Multinomial(\u03b8d)\n\u03c9d,m \u223c Multinomial(\u03b2zm )\n\n1. For latent cause k = 1, . . . , K,\n(a) Pattern of neural activity\n\n\u03b2k \u223c Dirichlet(\u03b70)\n2. For scene d = 1, . . . , D,\n\u03b8d \u223c Dirichlet(\u03b10)\n\n(a) Relative intensity of each cause\n\n(b) For spike m = 1, . . . , \u2126d\n\ni. Cause assignment\n\nii. Neuron assignment\n\nzd,m \u223c Multinomial(\u03b8d)\n\u03c9d,m \u223c Multinomial(\u03b2zm)\n\nFigure 2: (Left) The LDA generative model in the context of document modeling. (Right) The\ncorresponding LDA generative model mapped onto the problem of spike demixing. Text related\nattributes on the left, in red, have been replaced with neural attributes on the right, in green.\n\nrather than topic assignments. Speci\ufb01cally, we de\ufb01ne, Rd,j to be the number of times neuron j \ufb01res\nduring trial d. Similarly, we let Nd,j,k to be the number of times a spike in neuron j comes from\ncause k in trial d. These new variables play the roles of the cause and neuron assignment variables,\nzd,m and \u03c9d,m by simply counting them up. If we let cd,k be an un-normalized intensity of cause j\n\nsuch that \u03b8d,k = cd,k/(cid:80)\n\nk cd,k then the generative model,\n\nRd,j =(cid:80)\n\nk Nd,j,k\n\nNd,j,k \u223c Poisson(\u03b2j,kcd,k)\n\ncd,k \u223c Gamma(\u03b10\n\nk, C\u22121).\n\n(6)\n\nis equivalent to the topic models described above. Here the parameter C is a scale parameter\nwhich sets the expected total number of spikes from the population on each trial. Note that, the\nproblem of inferring the wj,k and cd,k is a non-negative matrix factorization problem similar to\nthat considered by Lee and Seung[20]. The primary difference is that, here, we are attempting to\ninfer a probability distribution over these quantities rather than maximum likelihood estimates. See\nsupplement for details. Following the prescription laid out in section 2, we approximate the posterior\nover latent variables given a set of input patterns, Rd, d = 1, . . . , D, with a factorized distribu-\ntion of the form, qN(N)qc(c)q\u03b2(\u03b2). This results in marginal posterior distributions q (\u03b2:,k|\u03b7:,k),\n\nq(cid:0)cd,k|\u03b1d,k, C\u22121 + 1(cid:1)), and q (Nd,j,:| log pd,j,:, Rd,i) which are Dirichlet, Gamma, and Multino-\n\nmial respectively. Here, the parameters \u03b7:,k, \u03b1d,k, and log pd,j,: are the natural parameters of these\ndistributions. The VBEM update algorithm yields update rules for these parameters which are\nsummarized in Fig. 3 Algorithm1.\n\nj Rd,jpd,j,k\n\nfor d = 1,\u00b7\u00b7\u00b7 , D do\n\nAlgorithm 1: Batch VB updates\n1: while \u03b7j,k not converged do\n2:\n3:\n4:\n5:\n\n\u03b1d,k \u2192 \u03b10 +(cid:80)\nwhile pd,j,k, \u03b1d,k not converged do\npd,j,k \u2192\n(cid:80)\nexp (\u03c8(\u03b7j,k)\u2212\u03c8(\u00af\u03b7k)) exp \u03c8(\u03b1d,k)\ni exp (\u03c8(\u03b7j,i)\u2212\u03c8(\u00af\u03b7i)) exp \u03c8(\u03b1d,i)\n\u03b7j,k = \u03b70 +(cid:80)\nFigure 3: Here \u00af\u03b7k = (cid:80)\n\n6:\n7:\n8:\n9: end while\n\nd Rd,jpd,j,k\n\nend while\n\nend for\n\nthreshold linear function.\n\nAlgorithm 2: Online VB updates\n1: for d = 1,\u00b7\u00b7\u00b7 , D do\n\u03b1k \u2192 \u03b10 +(cid:80)\nreinitialize pj,k, \u03b1k \u2200j, k\n2:\nwhile pj,k, \u03b1k not converged do\n3:\n4:\npj,k \u2192\n(cid:80)\n5:\nexp (\u03c8(\u03b7j,k)\u2212\u03c8(\u00af\u03b7k)) exp \u03c8(\u03b1k)\ni exp (\u03c8(\u03b7j,i)\u2212\u03c8(\u00af\u03b7i)) exp \u03c8(\u03b1i)\nend while\n\u03b7j,k \u2192\n(1 \u2212 dt)\u03b7j,k + dt(\u03b70 + Rd,jpj,k)\n\nj Rd,jpj,k\n\n6:\n7:\n\n8: end for\n\nj \u03b7j,k and \u03c8(x) is the digamma function so that exp \u03c8(x) is a smoothed\n\nBefore we move on to the neural network implementation, note that this standard formulation of\nvariational inference for LDA utilizes a batch learning scheme that is not biologically plausible.\nFortunately, an online version of this variational algorithm was recently proposed and shown to give\n\n5\n\n\fsuperior results when compared to the batch learning algorithm[21]. This algorithm replaces the\nsum over d in update equation for \u03b7j,k with an incremental update based upon only the most recently\nobserved pattern of spikes. See Fig. 3 Algorithm 2.\n\n4.1 Neural Network Implementation\n\nRecall that the goal was to build a neural network that implements the VBEM algorithm for the\nunderlying latent causes of a mixture of spikes using a neural code that represents the posterior\ndistribution via a linear PPC. A linear PPC represents the natural parameters of a posterior distribution\nvia a linear operation on neural activity. Since the primary quantity of interest here is the posterior\ndistribution over odor concentrations, qc(c|\u03b1), this means that we need a pattern of activity r\u03b1 which\nis linearly related to the \u03b1k\u2019s in the equations above. One way to accomplish this is to simply assume\nthat the \ufb01ring rates of output neurons are equal to the positive valued \u03b1k parameters.\nFig. 4 depicts the overall network architecture. Input patterns of activity, R, are transmitted to the\nsynapses of a population of output neurons which represent the \u03b1k\u2019s. The output activity is pooled to\nform an un-normalized prediction of the activity of each input neuron, \u00afRj, given the output layer\u2019s\ncurrent state of belief about the latent causes of the Rj. The activity at each synapse targeted by input\nneuron j is then inhibited divisively by this prediction. This results in a dendrite that reports to the\nsoma a quantity, \u00afNj,k, which represents the fraction of unexplained spikes from input neuron j that\ncould be explained by latent cause k. A continuous time dynamical system with this feature and the\nproperty that it shares its \ufb01xed points with the LDA algorithm is given by\n\nd\ndt\n\n\u00afNj,k = wj,kRj \u2212 \u00afRj \u00afNj,k\n\u03b1k = exp (\u03c8 (\u00af\u03b7k)) (\u03b10 \u2212 \u03b1k) + exp (\u03c8 (\u03b1k))\nd\ndt\n\n(cid:88)\n\ni\n\n\u00afNj,k\n\n(7)\n\n(8)\n\nwhere \u00afRj = (cid:80)\n\nk wj,k exp (\u03c8 (\u03b1k)), and wj,k = exp (\u03c8 (\u03b7j,k)). Note that, despite its form, it is\nEq. 7 which implements the required divisive normalization operation since, in the steady state,\n\u00afNj,k = wj,kRj/ \u00afRj.\nRegardless, this network has a variety of interesting properties that align well with biology. It predicts\nthat a balance of excitation and inhibition is maintained in the dendrites via divisive normalization\nand that the role of inhibitory neurons is to predict the input spikes which target individual dendrites.\nIt also predicts superlinear facilitation. Speci\ufb01cally, the \ufb01nal term on the right of Eq. 8 indicates\nthat more active cells will be more sensitive to their dendritic inputs. Alternatively, this could be\nimplemented via recurrent excitation at the population level. In either case, this is the mechanism by\nwhich the network implements a sparse prior on topic concentrations and stands in stark contrast to the\nwinner take all mechanisms which rely on competitive mutual inhibition mechanisms. Additionally,\nthe \u00af\u03b7j in Eq. 8 represents a cell wide \u2019leak\u2019 parameter that indicates that the total leak should be\nroughly proportional to the sum total weight of the synapses which drive the neuron. This predicts\nthat cells that are highly sensitive to input should also decay back to baseline more quickly. This\nimplementation also predicts Hebbian learning of synaptic weights. To observe this fact, note that the\nonline update rule for the \u03b7j,k parameters can be implemented by simply correlating the activity at\neach synapse, \u00afNj,k with activity at the soma \u03b1j via the equation:\n\n\u03c4L\n\nd\ndt\n\nwj,k = exp (\u03c8 (\u00af\u03b7k)) (\u03b70 \u2212 1/2 \u2212 wj,k) + \u00afNj,k exp \u03c8 (\u03b1k)\n\n(9)\nwhere \u03c4L is a long time constant for learning and we have used the fact that exp (\u03c8 (\u03b7jk)) \u2248 \u03b7jk\u22121/2\nfor x > 1. For a detailed derivation see the supplementary material.\n\n5 Dynamic Document Model\n\nLDA is a rather simple generative model that makes several unrealistic assumptions about mixtures\nof sensory and cortical spikes. In particular, it assumes both that there are no correlations between the\n\n6\n\n\fFigure 4: The LDA network model. Den-\ndritically targeted inhibition is pooled\nfrom the activity of all neurons in the\noutput layer and acts divisively.\n\nFigure 5: DDM network model also includes recur-\nrent connections which target the soma with both\na linear excitatory signal and an inhibitory signal\nthat also takes the form of a divisive normalization.\n\nintensities of latent causes and that there are no correlations between the intensities of latent causes\nin temporally adjacent trials or scenes. This makes LDA a rather poor computational model for a\ntask like olfactory foraging which requires the animal to track the rise a fall of odor intensities as it\nnavigates its environment. We can model this more complicated task by replacing the static cause or\nodor intensity parameters with dynamic odor intensity parameters whose behavior is governed by an\nexponentiated Ornstein-Uhlenbeck process with drift and diffusion matrices given by (\u039b and \u03a3D).\nWe call this variant of LDA the Dynamic Document Model (DDM) as it could be used to model\nsmooth changes in the distribution of topics over the course of a single document.\n\n5.1 DDM Model\n\nThus the generative model for the DDM is as follows:\n\n1. For latent cause k = 1, . . . , K,\n\n2. For scene t = 1, . . . , T ,\n\n(a) Cause distribution over spikes \u03b2k \u223c Dirichlet(\u03b70)\n(a) Log intensity of causes c(t) \u223c Normal(\u039bct\u22121, \u03a3D)\n(b) Number of spikes in neuron j resulting from cause k,\n\n(c) Number of spikes in neuron j, Rj(t) =(cid:80)\n\nNj,k(t) \u223c Poisson(\u03b2j,k exp ck(t))\n\nk Nj,k(t)\n\nThis model bears many similarities to the Correlated and Dynamic topic models[22], but models\ndynamics over a short time scale, where the dynamic relationship (\u039b, \u03a3D) is important.\n\n5.2 Network Implementation\nOnce again the quantity of interest is the current distribution of latent causes, p(c(t)|R(\u03c4 ), \u03c4 = 0..T ).\nIf no spikes occur then no evidence is presented and posterior inference over c(t) is simply given\nby an undriven Kalman \ufb01lter with parameters (\u039b, \u03a3D). A recurrent neural network which uses a\nlinear PPC to encode a posterior that evolves according to a Kalman \ufb01lter has the property that neural\nresponses are linearly related to the inverse covariance matrix of the posterior as well as that inverse\ncovariance matrix times the posterior mean. In the absence of evidence, it is easy to show that these\nquantities must evolve according to recurrent dynamics which implement divisive normalization[10].\nThus, the patterns of neural activity which linearly encode them must do so as well. When a new spike\narrives, optimal inference is no longer possible and a variational approximation must be utilized. As\nis shown in the supplement, this variational approximation is similar to the variational approximation\nused for LDA. As a result, a network which can divisively inhibit its synapses is able to implement\napproximate Bayesian inference. Curiously, this implies that the addition of spatial and temporal\ncorrelations to the latent causes adds very little complexity to the VB-PPC network implementation\nof probabilistic inference. All that is required is an additional inhibitory population which targets the\nsomata in the output population. See Fig. 5.\n\n7\n\nInputNeuronsOutputNeuronsTargetedDivisive NormalizationSynapsesRiNij\u03b1j\u00f7InputNeuronsOutputNeuronsTargetedDivisive NormalizationSynapsesRiNijRecurrent Connections\u03a3-1\u03bcj\u03a3-1jj'\u00f7\fFigure 6: (Left) Neural network approximation to the natural parameters of the posterior distribution\nover topics (the \u03b1\u2019s) as a function of the VBEM estimate of those same parameters for a variety of\n\u2019documents\u2019. (Center) Same as left, but for the natural parameters of the DDM (i.e the entries of the\nmatrix \u03a3\u22121(t) and \u03a3\u22121\u00b5(t) of the distribution over log topic intensities. (Right) Three example\ntraces for cause intensity in the DDM. Black shows true concentration, blue and red (indistinguishable)\nshow MAP estimates for the network and VBEM algorithms.\n\n6 Experimental Results\n\nWe compared the PPC neural network implementations of the variational inference with the standard\nVBEM algorithm. This comparison is necessary because the two algorithms are not guaranteed\nto converge to the same solution due to the fact that we only required that the neural network\ndynamics have the same \ufb01xed points as the standard VBEM algorithm. As a result, it is possible\nfor the two algorithms to converge to different local minima of the KL divergence. For the network\nimplementation of LDA we \ufb01nd good agreement between the neural network and VBEM estimates\nof the natural parameters of the posterior. See Fig. 6(left) which shows the two algorithms estimates\nof the shape parameter of the posterior distribution over topic (odor) concentrations (a quantity which\nis proportional to the expected concentration). This agreement, however, is not perfect, especially\nwhen posterior predicted concentrations are low. In part, this is due to the fact we are presenting the\nnetwork with dif\ufb01cult inference problems for which the true posterior distribution over topics (odors)\nis highly correlated and multimodal. As a result, the objective function (KL divergence) is littered\nwith local minima. Additionally, the discrete iterations of the VBEM algorithm can take very large\nsteps in the space of natural parameters while the neural network implementation cannot. In contrast,\nthe network implementation of the DDM is in much better agreement with the VBEM estimation.\nSee Fig. 6(right). This is because the smooth temporal dynamics of the topics eliminate the need for\nthe VBEM algorithm to take large steps. As a result, the smooth network dynamics are better able to\naccurately track the VBEM algorithms output. For simulation details please see the supplement.\n\n7 Discussion and Conclusion\n\nIn this work we presented a general framework for inference and learning with linear Probabilistic\nPopulation codes. This framework takes advantage of the fact that the Variational Bayesian Ex-\npectation Maximization algorithm generates approximate posterior distributions which are in an\nexponential family form. This is precisely the form needed in order to make probability distributions\nrepresentable by a linear PPC. We then outlined a general means by which one can build a neural\nnetwork implementation of the VB algorithm using this kind of neural code. We applied this VB-PPC\nframework to generate a biologically plausible neural network for spike train demixing. We chose\nthis problem because it has many of the features of the canonical problem faced by nearly every\nlayer of cortex, i.e. that of inferring the latent causes of complex mixtures of spike trains in the layer\nbelow. Curiously, this very complicated problem of probabilistic inference and learning ended up\nhaving a remarkably simple network solution, requiring only that neurons be capable of implementing\ndivisive normalization via dendritically targeted inhibition and superlinear facilitation. Moreover,\nwe showed that extending this approach to the more complex dynamic case in which latent causes\nchange in intensity over time does not substantially increase the complexity of the neural circuit.\nFinally, we would like to note that, while we utilized a rate coding scheme for our linear PPC, the\nbasic equations would still apply to any spike based log probability codes such as that considered\nBeorlin and Deneve[23].\n\n8\n\n050100150200250300350400450500050100150200250300350400450500Network EstimateNatural Parameters (\u03b1)VBEM Estimate020406080100120140160180200020406080100120140160180200VBEM EstimateNetwork EstimateNatural Parameters00.10.20.30.40.50.60.70.80.9100.10.20.30.400.10.20.30.40.50.60.70.80.9100.10.20.30.400.10.20.30.40.50.60.70.80.9100.10.20.30.4\fReferences\n[1] Daniel Kersten, Pascal Mamassian, and Alan Yuille. Object perception as Bayesian inference. Annual\n\nreview of psychology, 55:271\u2013304, January 2004.\n\n[2] Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal\n\nfashion. Nature, 415(6870):429\u201333, 2002.\n\n[3] Yair Weiss, Eero P Simoncelli, and Edward H Adelson. Motion illusions as optimal percepts. Nature\n\nneuroscience, 5(6):598\u2013604, 2002.\n\n[4] P N Sabes. The planning and control of reaching movements. Current opinion in neurobiology, 10(6):\n\n740\u20136, 2000.\n\n[5] Konrad P K\u00a8ording and Daniel M Wolpert. Bayesian integration in sensorimotor learning. Nature, 427\n\n(6971):244\u20137, 2004.\n\n[6] Emanuel Todorov. Optimality principles in sensorimotor control. Nature neuroscience, 7(9):907\u201315, 2004.\n[7] Erno T\u00b4egl\u00b4as, Edward Vul, Vittorio Girotto, Michel Gonzalez, Joshua B Tenenbaum, and Luca L Bonatti.\nPure reasoning in 12-month-old infants as probabilistic inference. Science (New York, N.Y.), 332(6033):\n1054\u20139, 2011.\n\n[8] W.J. Ma, J.M. Beck, P.E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes.\n\nNature Neuroscience, 2006.\n\n[9] Jeffrey M Beck, Wei Ji Ma, Roozbeh Kiani, Tim Hanks, Anne K Churchland, Jamie Roitman, Michael N\nShadlen, Peter E Latham, and Alexandre Pouget. Probabilistic population codes for Bayesian decision\nmaking. Neuron, 60(6):1142\u201352, 2008.\n\n[10] J. M. Beck, P. E. Latham, and a. Pouget. Marginalization in Neural Circuits with Divisive Normalization.\n\nJournal of Neuroscience, 31(43):15310\u201315319, 2011.\n\n[11] Tianming Yang and Michael N Shadlen. Probabilistic reasoning by neurons. Nature, 447(7148):1075\u201380,\n\n2007.\n\n[12] RHS Carpenter and MLL Williams. Neural computation of log likelihood in control of saccadic eye\n\nmovements. Nature, 1995.\n\n[13] Arnulf B a Graf, Adam Kohn, Mehrdad Jazayeri, and J Anthony Movshon. Decoding the activity of\n\nneuronal populations in macaque primary visual cortex. Nature neuroscience, 14(2):239\u201345, 2011.\n\n[14] HB Barlow. Pattern Recognition and the Responses of Sensory Neurons. Annals of the New York Academy\n\nof Sciences, 1969.\n\n[15] Wei Ji Ma, Vidhya Navalpakkam, Jeffrey M Beck, Ronald Van Den Berg, and Alexandre Pouget. Behavior\n\nand neural basis of near-optimal visual search. Nature Neuroscience, (May), 2011.\n\n[16] DJ Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 1992.\n[17] M Carandini, D J Heeger, and J a Movshon. Linearity and normalization in simple cells of the macaque\nprimary visual cortex. The Journal of neuroscience : the of\ufb01cial journal of the Society for Neuroscience,\n17(21):8621\u201344, 1997.\n\n[18] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 2003.\n[19] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit, UCL,\n\n2003.\n\n[20] D D Lee and H S Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401\n\n(6755):788\u201391, 1999.\n\n[21] M. Hoffman, D. Blei, and F. Bach. Online learning for Latent Dirichlet Allocation. In NIPS, 2010.\n[22] D. Blei and J. Lafferty. Dynamic topic models. In ICML, 2006.\n[23] M. Boerlin and S. Deneve. Spike-based population coding and working memory. PLOS computational\n\nbiology, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1406, "authors": [{"given_name": "Jeff", "family_name": "Beck", "institution": null}, {"given_name": "Alexandre", "family_name": "Pouget", "institution": null}, {"given_name": "Katherine", "family_name": "Heller", "institution": null}]}