{"title": "A joint maximum-entropy model for binary neural population patterns and continuous signals", "book": "Advances in Neural Information Processing Systems", "page_first": 620, "page_last": 628, "abstract": "Second-order maximum-entropy models have recently gained much interest for describing the statistics of binary spike trains. Here, we extend this approach to take continuous stimuli into account as well. By constraining the joint second-order statistics, we obtain a joint Gaussian-Boltzmann distribution of continuous stimuli and binary neural firing patterns, for which we also compute marginal and conditional distributions. This model has the same computational complexity as pure binary models and fitting it to data is a convex problem. We show that the model can be seen as an extension to the classical spike-triggered average/covariance analysis and can be used as a non-linear method for extracting features which a neural population is sensitive to. Further, by calculating the posterior distribution of stimuli given an observed neural response, the model can be used to decode stimuli and yields a natural spike-train metric. Therefore, extending the framework of maximum-entropy models to continuous variables allows us to gain novel insights into the relationship between the firing patterns of neural ensembles and the stimuli they are processing.", "full_text": "A joint maximum-entropy model for binary neural\n\npopulation patterns and continuous signals\n\nSebastian Gerwinn\n\nPhilipp Berens\n\nMatthias Bethge\n\nMPI for Biological Cybernetics\n\nand University of T\u00a8ubingen\n\nComputational Vision and Neuroscience\n\nSpemannstrasse 41, 72076 T\u00a8ubingen, Germany\n\n{firstname.surname}@tuebingen.mpg.de\n\nAbstract\n\nSecond-order maximum-entropy models have recently gained much interest for\ndescribing the statistics of binary spike trains. Here, we extend this approach to\ntake continuous stimuli into account as well. By constraining the joint second-\norder statistics, we obtain a joint Gaussian-Boltzmann distribution of continuous\nstimuli and binary neural \ufb01ring patterns, for which we also compute marginal\nand conditional distributions. This model has the same computational complex-\nity as pure binary models and \ufb01tting it to data is a convex problem. We show\nthat the model can be seen as an extension to the classical spike-triggered av-\nerage/covariance analysis and can be used as a non-linear method for extracting\nfeatures which a neural population is sensitive to. Further, by calculating the pos-\nterior distribution of stimuli given an observed neural response, the model can be\nused to decode stimuli and yields a natural spike-train metric. Therefore, extend-\ning the framework of maximum-entropy models to continuous variables allows us\nto gain novel insights into the relationship between the \ufb01ring patterns of neural\nensembles and the stimuli they are processing.\n\n1 Introduction\n\nRecent technical advances in systems neuroscience allow us to monitor the activity of increasingly\nlarge neural ensembles simultaneously (e.g. [5, 21]). To understand how such ensembles process\nsensory information and perform the complex computations underlying successful behavior requires\nnot only collecting massive amounts of data, but also the use of suitable statistical models for data\nanalysis. What degree of precision should be incorportated into such a model involves a trade-\noff between the question of interest and mathematical tractability: Complex multi-compartmental\nmodels [8] allow inference concerning the underlying biophysical processes, but their applicability\nto neural populations is limited. The generalized linear model [15] on the other hand is tractable\neven for large ensembles and provides a phenomenological description of the data.\nRecently, several groups have used binary maximum entropy models incorporating pairwise corre-\nlations to model neural activity in large populations of neurons on short time scales [19, 22, 7, 25].\nThese models have two important features: (1) Since they only require measuring the mean activity\nof individual neurons and correlations in pairs of neurons, they can be estimated from moderate\namounts of data. (2) They seem to capture the essential structure of neural population activity at\nthese timescales even in networks of up to a hundred neurons [21]. Although the generality of these\n\ufb01ndings have been subject to debate [3, 18], pairwise maximum-entropy and related models [12] are\nan important tool for the description of neural population activity [23, 17].\n\n1\n\n\fTo \ufb01nd features to which a neuron is sensitive spike-triggered average and spike-triggered covariance\nare commonly used techniques [20, 16]. They correspond to \ufb01tting a Gaussian distribution to the\nspike-triggered ensemble. If one has access to multi-neuron recordings, a straightforward extension\nof this approach is to \ufb01t a different Gaussian distribution to each binary population pattern.\nIn\nstatistics, the corresponding model is known as the location model [14, 10, 9]. To estimate this\nmodel, one has to observe suf\ufb01cient amounts of data for each population pattern. As the number of\npossible binary patterns grows exponentially with the number of neurons, it is desirable to include\nregularization constraints in order to make parameter estimation tractable.\nHere, we extend the framework of pairwise maximum entropy modeling to a joint model for binary\nand continuous variables. This allows us to analyze the functional connection structure in a neural\npopulation at the same time as its relationship with further continuous signals of interest. In particu-\nlar, this approach makes it possible to include a stimulus as a continuous variable into the framework\nof maximum-entropy modeling. In this way, we can study the stimulus dependence of binary neural\npopulation activity in a regularized framework in a rigorous way. In particular, we can use it to ex-\ntract non-linear features in the stimulus that a population of neurons is sensitive to, while taking the\nbinary nature of spike trains into account. We discuss the relationship of the obtained features with\nclassical approaches such as spike-triggered average (STA) and spike-triggered covariance (STC).\nIn addition, we show how the model can be used to perform spike-by-spike decoding and yields a\nnatural spike-train metric [24, 2]. We start with a derivation of the model and a discussion of its\nfeatures.\n\n2 Model\n\nIn this section we derive the maximum-entropy model for joint continuous and binary data with\nsecond-order constraints and describe its basic properties. We write continuous variables x and\nbinary variables b. Having observed the joint mean \u00b5 and joint covariance C, we want to \ufb01nd\na distribution pME which achieves the maximal entropy under all distributions with these observed\nmoments. Since we model continuous and binary variables jointly, we de\ufb01ne entropy to be a mixed\ndiscrete entropy and differential entropy:\n\nH[p] = \u2212(cid:88)\n\n(cid:90)\n\np(x, b) log p(x, b)dx\n\nFormally, we require pME to satisfy the following constraints:\n\nb\n\nE[x] = \u00b5x\n\nE[b] = \u00b5b\n\nE[xx(cid:62)] = Cxx + \u00b5x\u00b5(cid:62)\nE[xb(cid:62)] = Cxb + \u00b5x\u00b5(cid:62)\n\nx\n\nb\n\nE[bb(cid:62)] = Cbb + \u00b5b\u00b5(cid:62)\nE[bx(cid:62)] = Cbx + \u00b5b\u00b5(cid:62)\n\nb\n\nx = Cxb + \u00b5x\u00b5(cid:62)\n\nb\n\n(1)\n\nwhere the expectations are taken over pME. Cxx, Cxb and Cbb are blocks in the observed covariance\nmatrix corresponding to the respective subsets of variables. This problem can be solved analytically\nusing the Lagrange formalism, which leads to a maximum entropy distribution of Boltzmann type:\n\npME(x, b|\u039b, \u03bb) =\n\nQ(x, b|\u039b, \u03bb) =\n\n(cid:19)(cid:62)\n\n1\n\n(cid:18) x\n(cid:90)\nZ(\u039b, \u03bb) =(cid:88)\n\nZ(\u039b, \u03bb)\n1\n2\n\nexp (Q (x, b|\u039b, \u03bb))\n\n(cid:18) x\n\n(cid:19)\n\n\u039b\n\n+ \u03bb\n\nb\n\nb\nexp (Q (x, b|\u039b, \u03bb)) dx,\n\n(cid:62)(cid:18) x\n\nb\n\n(cid:19)\n\n(2)\n\nb\n\nwhere \u039b and \u03bb are chosen such that the resulting distribution ful\ufb01lls the constraints in equation\n(1), as we discuss below. Before we compute marginal and conditional distributions in this model,\nwe explore its basic properties. First, we note that the joint distribution can be factorized in the\nfollowing way:\n\npME(x, b|\u039b, \u03bb) = pME(x|b, \u039b, \u03bb)pME(b|\u039b, \u03bb)\n\n(3)\n\n2\n\n\fThe conditional density pME(x|b, \u039b, \u03bb) is a Normal distribution, given by:\n\npME(x|b, \u039b, \u03bb) \u221d exp\n\nx(cid:62)\u039bxxx + x(cid:62) (\u03bbx + \u039bxbb)\n\n(cid:19)\n\n(cid:18)1\n\u221d N(cid:16)\n\n(cid:17)\n2\nx|\u00b5x|b, \u03a3\n\n,with \u00b5x|b = \u03a3 (\u03bbx + \u039bxbb) ,\n\n\u03a3 = (\u2212\u039bxx)\u22121\n\nHere, \u039bxx, \u039bxb, \u039bbx, \u03bbx are the blocks in \u039b which correspond to x and b, respectively. While\nthe mean of this Normal distribution dependent on b, the covariance matrix is independent of the\nspeci\ufb01c binary state. The marginal probability pME(b|\u039b, \u03bb) is given by:\nZ(\u039b, \u03bb)pME(b|\u039b, \u03bb) = exp\n\nb(cid:62)\u039bbbb + b(cid:62)\n\n(cid:18)1\n\n(cid:19)\n\nexp\n\ndx\n\n2\n\n(cid:17)\n\nx(cid:62)\u039bxxx + x(cid:62) (\u03bbx + \u039bxbb)\n(cid:19)\nxb (\u2212\u039bxx)\u22121 \u039bxb\n1\nx (\u2212\u039bxx)\u22121 \u03bbx\n(cid:62)\n2 \u03bb\n\nb\n\n\u039bbb + \u039b(cid:62)\n\n(cid:17)\n\n+\n\n(cid:18)1\n+b(cid:62)(cid:16)\n\n= (2\u03c0) n\n\n2\n2 |\u2212\u039bxx|\u2212 1\n\n2 exp\n\n\u03bbb + \u039b(cid:62)\n\n(cid:19)(cid:90)\n(cid:18)1\nb(cid:62)(cid:16)\n\n\u03bbb\n\n2\n\nxb (\u2212\u039bxx)\u22121 \u03bbx\n2(cid:88)\n\n(cid:18)1\n\nb(cid:62)(cid:16)\n\nexp\n\n2\n\nb\n\n(4)\n\n(5)\n\n(7)\n\nTo evaluate the maximum entropy distribution, we need to compute the partition function, which\nfollows from the previous equation by summing over b:\n\nZ(\u039b, \u03bb) = (2\u03c0) n\n\n2 |\u2212\u039bxx|\u2212 1\n\n+b(cid:62)(cid:16)\n\n\u039bbb + \u039b(cid:62)\n\nxb (\u2212\u039bxx)\u22121 \u039bxb\n\n(cid:17)\n\n\u03bbb + \u039b(cid:62)\n\nxb (\u2212\u039bxx)\u22121 \u03bbx\n\n+\n\n1\nx (\u2212\u039bxx)\u22121 \u03bbx\n(cid:62)\n2 \u03bb\n\n(cid:17)\n\nb\n\n(cid:19)\n\n(6)\n\nNext, we compute the marginal distribution with respect to x. From equation (5) and (4), we \ufb01nd\nthat pME(x|\u039b, \u03bb) is a mixture of Gaussians, where each Gaussian of equation (4) is weighted by the\ncorresponding pME(b|\u039b, \u03bb). While all mixture components have the same covariance, the different\nweighting terms affect each component\u2019s in\ufb02uence on the marginal covariance of x. Finally, we also\ncompute the conditional density pME(b|x, \u039b, \u03bb), which is given by:\n\n(cid:18)1\n(cid:18)1\n\n2\n\n2\n\nZ(cid:48) =(cid:88)\n\n1\nZ(cid:48) exp\nexp\n\nb\n\npME(b|x, \u039b, \u03bb) =\n\nb(cid:62)\u039bbbb + b(cid:62) (\u03bbb + \u039bbxx)\n\nb(cid:62)\u039bbbb + b(cid:62) (\u03bbb + \u039bbxx)\n\n(cid:19)\n(cid:19)\n\nNote, that the distribution of the binary variables given the continuous variables is again of Boltz-\nmann type.\n\nParameter \ufb01tting To \ufb01nd suitable parameters for given data, we employ a maximum likelihood\napproach [1, 11], where we \ufb01nd the optimal parameters via gradient descent:\nl(\u039b, \u03bb) = log p({x(n), b(n)}N\n\nQ(x(n), b(n)|\u039b, \u03bb) \u2212 N log Z (\u039b, \u03bb)\n\nn=1|\u039b, \u03bb) =(cid:88)\n\n(8)\n\n(cid:19)(cid:62)(cid:43)\n\n(cid:19)(cid:18) x\n\nb\n\n\uf8f9\uf8fb\n\npME\n\nn\n\n\u21d2 \u2207\u039bl = N\n\n\u2207\u03bbl = N\n\n\uf8ee\uf8f0(cid:42)(cid:18) x\n(cid:34)(cid:28)(cid:18) x\n\nb\n\nb\n\n(cid:19)(cid:18) x\n(cid:19)(cid:29)\n\nb\n\n\u2212\n\ndata\n\ndata\n\n(cid:19)(cid:62)(cid:43)\n(cid:28)(cid:18) x\n(cid:68)\n\nb\n\n(cid:42)(cid:18) x\n(cid:35)\n(cid:19)(cid:29)\n\n\u2212\n\nb\n\npME\n\n(cid:69)\n\nTo calculate the moments over the model distribution pME we make use of the above factorization:\n\n(cid:10)xx(cid:62)(cid:11) =(cid:10)(cid:10)xx(cid:62) |b(cid:11)(cid:11)\n(cid:68)\nxb(cid:62)(cid:69)\n\u00b5x|bb(cid:62)(cid:69)\n\n(cid:68)\n\n=\n\nb\n\n=(cid:10)bx(cid:62)(cid:11)(cid:62)\nb = (\u2212\u039bxx)\u22121 +\n\n,\n\n(cid:68)\n\u00b5x|b\u00b5(cid:62)\nx|b\n(cid:104)x(cid:105) =\n\nb\n\u00b5x|b\n\n(cid:69)\n\n(9)\n\nb\n\n3\n\n\fA\n\nB\n\nC\n\nFIGURE 1: Illustration of different parameter settings. A:independent binary and continuous vari-\nables, B: correlations (0.4) between variables, C: changing mean of the binary variables (here:\n0.7) corresponds to changing weightings of the Gaussians, correlations are 0.4. Blue lines indicate\np(x|b = 1) and green ones p(x|b = 0).\n\nHence, the only average we actually need to evaluate numerically is the one over the binary variables.\nUnfortunately, we cannot directly set the parameters for the continuous part, as they depend on the\nones for the binary part. However, since the above equations can be evaluated analytically, the\ndif\ufb01cult part is \ufb01nding the parameters for the binary variables. In particular, if the number of binary\nvariables is large, calculating the partition function can become infeasible. To some extent, this can\nbe remedied by the use of specialized Monte-Carlo algorithms [4].\n\n2.1 Example\n\nIn order to gain intuition into the properties of the model, we illustrate it in a simple one-dimensional\ncase. From equation (4) for the conditional mean of the continuous variables, we expect the distance\nbetween the conditional means \u00b5x|b to increase with increasing correlation between continuous and\nbinary variables increases. We see that this is indeed the case: While the conditional Gaussians\np(x|b = 1) and p(x|b = 0) are identical if x and b are uncorrelated (\ufb01gure 1A), a correlation\nbetween x and b shifts them away from the unconditional mean (\ufb01gure 1B). Also, the weight\nassigned to each of the two Gaussians can be changed. While in \ufb01gures 1A and 1B b has a symmetric\nmean of 0.5, a non-symmetric mean leads to an asymmetry in the weighting of each Gaussian\nillustrated in \ufb01gure 1C.\n\n2.2 Comparison with other models for the joint modeling of binary and continous data\n\nThere are two models in the literature which model the joint distribution of continuous and binary\nvariables, which we will list in the following and compare them to the model derived in this paper.\n\nLocation model The location model (LM) [14, 10, 9] also uses the same factorization as above\np(x, b) = p(x|b)p(b). However, the distribution for the binary variables p(b) is not of Boltzmann\ntype but a general multinomial distribution and therefore has more degrees of freedom. The con-\nditional distribution p(x|b) is assumed to be Gaussian with moments (\u00b5b, \u03a3b), which can both\ndepend on the conditional state b. Thus to \ufb01t the LM usually requires much more data to estimate\nthe moments for every possible binary state. The location model can also be seen as a maximum en-\ntropy model in the sense, that it is the distribution with maximal entropy under all distribution with\nthe conditional moments. As \ufb01tting this model in its general form is prone to over\ufb01tting, various ad\nhoc constraints have been proposed; see [9] for details.\n\nPartially dichotomized Gaussian model Another simple possibility to obtain a joint distribution\nof continuous and binary variables is to take multivariate (latent) Gaussian distribution for all vari-\nables and then dichotomize those components which should represent the binary variables. Thus, a\nbinary variable bi is set to 1 if the underlying Gaussian variables is greater than 0 and it is set to 0 if\nthe Gaussian variable is smaller than 0. This model is known as the partially dichotomized Gaussian\n(PDG) [6]. Importantly the marginal distribution over the continuous variables is always Gaussian\nand not a mixture as in our model. The reason for this is that all marginals of a Gaussian distribution\nare again Gaussian.\n\n4\n\n4321012340.000.040.080.124321012340.000.040.080.124321012340.000.040.080.12\fA\n\nB\n\nC\n\nD\n\nE\n\nFIGURE 2: Illustration of the binary encoding with box-type tuning curves. A: shows the marginal\ndistribution over stimuli. The true underlying stimulus distribution is a uniform distribution over the\ninterval ((cid:31)0(cid:46)5(cid:44) 0(cid:46)5) and is plotted in shaded gray. The mixture of Gaussian approximation of the\nMaxEnt model is plotted in black. Each neuron has a tuning-curve, consisting of a superposition of\nbox-functions. B shows the tuning-curve of the \ufb01rst neuron. This is equivalent to the conditional\ndistribution, when conditioning on the \ufb01rst bit, which indicates if the stimulus is in the right part of\nthe interval. The tuning-curve is a superposition of 5 box-functions. The true tuning curve is plotted\nin shaded gray whereas the MaxEnt approximation is plotted in black. C shows the tuning curve\nof neuron with index 2. D: Covariance between continuous and binary variables as a function of\nthe index of the binary variables. This is the same as the STA for each neuron (see also equation\n(10)). E shows the conditional distribution, when conditioning on both variables (0,2) to be one.\nThis corresponds to the product of the tuning-curves.\n\n3 Applications\n\n3.1 Spike triggering and feature extraction\n\nSpike triggering is a common technique in order to \ufb01nd features which a single neuron is sensitive\nto. The presented model can be seen as an extension in the following sense. Suppose that we\nhave observed samples (xn(cid:44) bn)(cid:31) from a population responding to a stimulus. The spike triggered\naverage (STA) for a neuron i is then de\ufb01ned as\n\nSTAi =\n\n= E[xbi]ri(cid:44)\n\n(10)\n\n(cid:31)\ni(cid:31)\n\nn xnbn\nn bn\n\ni\n\n5\n\nP\n\ni\n\nn bn\nwhere ri =\nN = p(bi = 1) is the \ufb01ring rate of the i-th neuron or fraction of ones within the\nsample. Note, that the moment E[xbi] is one of the constraints we require for the maximum entropy\nmodel and therefore the STA is included in the model.\nIn addition, the model has also similarities to spike-triggered covariance (STC) [20, 16]. STC de-\nnotes the distribution or, more precisely, the covariance of the stimuli that evoked a spiking response.\nUsually, this covariance is then compared to the total covariance over the entire stimulus distribution.\nIn the joint maximum-entropy model, we have access to a similar distribution, namely the condi-\ntional distribution p(x(cid:124) bi = 1), which is a compact description of the spike-triggered distribution.\nNote that p(x(cid:124) bi = 1) can be highly non-Gaussian as all neurons j (cid:30)= i are marginalized out \u2013 this is\nwhy the current model is an extension to spike triggering. Additionally, we can also trigger or con-\n\n\fA\n\nB\n\nC\n\nFIGURE 3: Illustration of a spike-by-spike decoding scheme. The MaxEnt model was \ufb01t to data\nfrom two deterministic integrate-and-\ufb01re models. The MaxEnt model can then be used for decoding\nspikes generated by the two independent deterministic models. The two green arrows correspond the\nweights of a two-pixel receptive \ufb01eld for each of the two neurons. The 2 dimensional stimulus was\ndrawn from two independent Gamma distributions. The resulting spike-trains were discretized in 5\ntime-bins, each 200 ms long. A spike-train to a particular stimulus (x(cid:134) cross) is decoded. In A) the\nmarginal distribution of the continuous variables is shown. In B) the posterior, when conditioning\non the \ufb01rst temporal half of the response to that stimulus is shown. Finally in C) the conditional\ndistribution, when conditioning on the full observed binary pattern is plotted.\n\ndition not on a single neuron but on any response pattern BS of a sub-population S. The resulting\np(x(cid:124) BS) with BS = (cid:123) b : bi = Bi(cid:29)i (cid:28)S(cid:125)\nis then also a mixture of Gaussians with 2n compo-\nnents, where n is the number of unspeci\ufb01ed neurons j (cid:47)(cid:28)S. As illustrated above (see \ufb01gure 1B),\ncorrelations between neurons and stimuli lead to a separation of the individual Gaussians. Hence,\nstimulus correlations of other neurons j (cid:30)= i in the distribution p(x(cid:44) bj(cid:30)=i(cid:124) bi = 1) would have the\nsame effect on the spike-triggered distribution of neuron i. Correlations within this distribution also\nimply, that there are correlations between neuron j and neuron i. Thus, stimulus as well as noise\ncorrelations cause deviations of the conditional p(x(cid:124) BS) from a single Gaussian. Therefore, the\nfull conditional distribution p(x(cid:124) BS) in general contain more information about the features which\ntrigger this sub-population to evoke the speci\ufb01ed response pattern, than the conditional mean, i.e.\nthe STA.\nWe demonstrate the capabilities of this approach by considering the following encoding. As stim-\nulus, we consider one continuous real valued variable that is drawn uniformly from the interval\n[(cid:31)0(cid:46)5(cid:44) 0(cid:46)5]. It is mapped to a binary population response in the following way. Each neuron i has a\nsquare-wave tuning function:\n\nbi(x) = (cid:31) (sin (2(cid:31)(i + 1)x)) (cid:44)\n\nwhere (cid:31) is the Heaviside function. In this way, the response of a neuron is set to 1 if its tuning-\nfunction is positive and 0 otherwise. The \ufb01rst (index 0) neuron distinguishes the left and the right\npart of the entire interval. The (i + 1)st neuron distinguishes subsequently left from right in the sub-\nintervals of the ith neuron. That is, the response of the second neuron is always 1, if the stimulus is\nin the right part of the intervals [(cid:31)0(cid:46)5(cid:44) 0] and [0(cid:44) 0(cid:46)5]. These tuning curves can also be thought of as\na mapping into a non-linear feature space in which the neuron acts linear again. Although the data-\ngeneration process is not contained in our model class we were able to extract the tuning curves\nas shown in \ufb01gure 2. Note, that for this example neither the STA nor STC analysis alone would\nprovide any insight into the feature selectivity of the neurons, in particular for the neurons which\nhave multi-modal tuning curves (the ones with higher indexes in the above example). However, the\ntuning curves could be reconstructed with any kind of density estimation, given the STA.\n\n3.2 Spike-by-Spike decoding\nSince we have a simple expression for the conditional distribution p(x(cid:124) b(cid:44) (cid:30)(cid:44) (cid:30)) (see equation (4)),\nwe can use the model to analyze the decoding performance of a neural population. To illustrate\nthis, we sampled spike trains from two leaky integrate-and-\ufb01re neurons for 1 second and discretized\nthe resulting spike trains into 5 bins of 200 ms length each. Each trial, we used a constant two\ndimensional stimulus, which was drawn from two independent Gamma distributions with shape\n\n6\n\n\fA\n\nB\n\nFIGURE 4: Illustration of the conditional probability p(b(cid:124) x) for the example in \ufb01gure 3. In 4A,\nfor every binary pattern the corresponding probability is plotted for the given stimulus from \ufb01gure\n3, where the brightness of each square indicates its probability. For the given stimulus the actual\nresponse pattern used for \ufb01gure 3 is marked with a circle. Each pattern b is split into two halves by\nthe contributions of the two neurons (32 possible patterns for each neuron) and response patterns of\nthe \ufb01rst neuron are shown on the x-axis, while response patterns of the second neuron on the y-axis.\nIn 4B we plotted for each pattern b its probability under the two conditional distributions p(b(cid:124) x(cid:134) )\nand p(b(cid:124) x(cid:135) ) against each other with x(cid:134) = (0(cid:46)85(cid:44) 0(cid:46)72) and x(cid:135) = (1(cid:46)5(cid:44) 1(cid:46)5).\n\nparameter (cid:29) = 3 and scale parameter (cid:28) = 0(cid:46)3. For each LIF neuron, this two dimensional stimulus\nwas then projected onto the one-dimensional subspace spanned by its receptive \ufb01eld and used as\ninput current. Hence, there are 10 binary variables, 5 for each spike-train of the neurons and 2\ncontinuous variables for the stimulus to be modeled. We draw 5(cid:183) 106 samples, calculated the second\norder moments of the joint stimulus and response vectors and \ufb01tted our maximum entropy model\nto these moments. The obtained distribution is shown in \ufb01gure 3. In 3A, we show the marginal\ndistribution of the stimuli, which is a mixture of 210 Gaussians. The receptive \ufb01elds of the two\nneurons are indicated by green arrows. To illustrate the decoding process, we sampled a stimulus\nand corresponding response r, from which we try to reconstruct the stimulus.\nIn 3B, we show\nthe conditional distribution when conditioning on the \ufb01rst half of the response. Finally in 3C, the\ncomplete posterior is shown when conditioned on the full response. From a-c, the posterior is more\nand more concentrated around the true stimulus. Although there is no neural noise in the encoding\nmodel, the reconstruction is not perfect. This is due to the regularization properties of the maximum\nentropy approach.\n\n3.3 Stimulus dependence of \ufb01ring patterns\n\nWhile previous studies on the structure of neuronal \ufb01ring patterns in the retina have compared how\nwell second-order maximum entropy models \ufb01t the empirically observed distributions under differ-\nent stimulation conditions [19, 22], the stimulus has never been explicitly taken into account into\nthe model. In the proposed framework, we have access to p(b(cid:124) x), so we can explicitly study how\nthe pattern distribution of a neural population depends on the stimulus. We illustrate this by con-\ntinuing the example of \ufb01gure 3. First, we show how the individual \ufb01ring probabilities depend on x\n(\ufb01gure 4A). Note, that although the encoding process for the previous example was noiseless, that\nis, for every given stimulus there is only one response pattern, the conditional distribution p(b(cid:124) x)\nis not a delta-function, but dispersed around the expected response. This is due to the second order\napproximation to the encoding model. Further, as it turns out, that a spike in the next bin after a\nspike is very unlikely under the model, which captures the property of the leaky integrator. Also, we\ncompare how p(b(cid:124) x) changes for different values of x. This is illustrated in \ufb01gure 4B.\n\n3.4 Spike train metric\n\nOftentimes, it is desirable to measure distances between spike trains [24]. One problem, however, is\nthat not every spike might be of equal importance. That is, if a spike train differs only in one spike, it\nmight nevertheless represent a completely different stimulus. Therefore, Ahmadian [2] suggested to\nmeasure the distance between spike trains as the difference of stimuli when reconstructed based on\n\n7\n\n\fthe one or the other spike train seems. If the population is noisy, we want to measure the difference\nof reconstructed stimuli on average. To this end, we need access to the posterior distribution, when\nconditioning on a particular spike train or binary pattern. Using the maximum entropy model, we\ncan de\ufb01ne the following spike-metric:\n\n(cid:2)pME(x|b1)||pME(x|b2)(cid:3) =\n\n(cid:18)(cid:16)\n\n1\n2\n\nd(b1, b2) = DKL\n\n(cid:17)(cid:62)\n\n(cid:16)\n\n(cid:17)(cid:19)\n\n\u00b5x|b1 \u2212 \u00b5x|b2\n\n\u039bxx\n\n\u00b5x|b1 \u2212 \u00b5x|b2\n\n(11)\n\nHere, DKL denotes the Kullback-Leibler divergence between the posterior densities. Equation 11\nis symmetric in b, however, in order to get a symmetric expression for other types of posterior\ndistributions, the Jensen-Shannon divergence might be used instead. As an example we consider\nthe induced metrics for the encoding model of \ufb01gure 2. The metric induced by the square-wave\ntuning functions of section 3.1 is relatively simple. When conditioning on a particular population\nresponse, the conditional distribution p(x|b) is always a Gaussian with approximately the width of\nthe smallest wavelength. Flipping a neuron\u2019s response within this pattern corresponds to shifting the\nconditional distribution. Suppose we have observed a population response consisting of only ones.\nThis results in a Gaussian posterior distribution with mean in the middle of the rightmost interval\n(0.5 \u2212 1\n1024 , 0.5). Now \ufb02ipping the response of the \u201clow-frequency\u201d neuron, that is the one shown\nin \ufb01gure 2B, shifts the mean of the posterior to the middle of the sub-interval (\u2212 1\n1024 , 0). Whereas\n\ufb02ipping the \u201chigh-frequency\u201d neuron, the one which indicates left or right within the smallest possi-\nble sub-interval, corresponds to shifting the mean just by the amount of this smallest interval to the\nleft. Flipping the response of single neurons within this population can result in posterior distribu-\ntion which look quite different in terms of the Kullback-Leibler divergence. In particular, there is an\nordering in terms of the frequency of the neurons with respect to the proposed metric.\n\n4 Conclusion\n\nWe have presented a maximum-entropy model based on the joint second order statistics of contin-\nuous valued variables and binary neural responses. This allows us to extend the maximum-entropy\napproach [19] for analyzing neural data to incorporate other variables of interest such as continuous\nvalued stimuli. Alternatively, additional neurophysiological signals such as local \ufb01eld potentials\n[13] can be taken into account to study their relation with the joint \ufb01ring patterns of local neural\nensembles. We have demonstrated four applications of this approach: (1) It allows us to extract the\nfeatures a (sub-)population of neurons is sensitive to, (2) we can use it for spike-by-spike decoding,\n(3) we can assess the impact of stimuli on the distribution of population patterns and (4) it yields a\nnatural spike-train metric.\nWe have shown that the joint maximum-entropy model can be learned in a convex fashion, although\nhigh-dimensional binary patterns might require the use of ef\ufb01cient sampling techniques. Because\nof the maximum-entropy approach the resulting distribution is well regularized and does not require\nany ad-hoc restrictions or regularity assumptions as have been proposed for related models [9].\nAnalogous to a Boltzmann machine with hidden variables, it is possible to further add hidden binary\nnodes to the model. This allows us to take higher-order correlations into account as well, although\nwe stay essentially in the second-order framework. Fortunately, the learning scheme for \ufb01tting the\nmodi\ufb01ed model to observed data remains almost unchanged: The only difference is that the moments\nhave to be averaged over the non-observed binary variables as well. In this way, the model can also\nbe used as a clustering algorithm if we marginalize over all binary variables. The resulting mixture of\nGaussian model will consist of 2N components, where N is the number of hidden binary variables.\nUnfortunately, convexity cannot be guaranteed if the model contains hidden nodes. In a similar\nfashion, we could also add hidden continuous variables, for example to model unobserved common\ninputs. In contrast to hidden binary nodes, this does not lead to an increased model complexity:\naveraging over hidden continuous variables corresponds to integrating out each Gaussian within the\nmixture, which results in another Gaussian. Also the restriction that all covariance matrices in the\nmixture need to be the same still holds, because each Gaussian is integrated in the same way.\n\nAcknowledgments We would like to thank J. Macke and J. Cotton for discussions and feedback on the\nmanuscript. This work is supported by the German Ministry of Education, Science, Research and Technology\nthrough the Bernstein award to MB (BMBF; FKZ: 01GQ0601), the Werner-Reichardt Centre for Integrative\nNeuroscience T\u00a8ubingen, and the Max Planck Society.\n\n8\n\n\fReferences\n[1] D.H. Ackley, G.E. Hinton, and T.J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive\n\nScience, 9:147\u2013169, 1985.\n\n[2] Y. Ahmadian, J. Pillow, J. Shlens, E. Simoncelli, E.J. Chichilinsky, and L. Paninski. A decoder-based\nIn Frontiers in Systems Neuroscience.\n\nspike train metric for analyzing the neural code in the retina.\nConference Abstract: Computational and systems neuroscience, 2009.\n\n[3] M. Bethge and P. Berens. Near-Maximum entropy models for binary neural representations of natural im-\nages. In Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems,\nvolume 20, pages 97\u2013104, Cambridge, MA, 2008. MIT Press.\n\n[4] Tamara Broderick, Miroslav Dudik, Gasper Tkacik, Robert E Schapire, and William Bialek. Faster solu-\n\ntions of the inverse pairwise ising problem. arXiv, q-bio.QM:0712.2437, Dec 2007.\n\n[5] G. Buzsaki. Large-scale recording of neuronal ensembles. Nature Neuroscience, 7(5):446\u2013451, 2004.\n[6] D. R. Cox and Nanny Wermuth. Likelihood factorizations for mixed discrete and continuous variables.\n\nScandinavian Journal of Statistics, 26(2):209\u2013220, June 1999.\n\n[7] A. Tang et al. A maximum entropy model applied to spatial and temporal correlations from cortical\n\nnetworks in vitro. J. Neurosci., 28(2):505\u2013518, 2008.\n\n[8] Q.J.M. Huys, M.B. Ahrens, and L. Paninski. Ef\ufb01cient estimation of detailed single-neuron models. Jour-\n\nnal of neurophysiology, 96(2):872, 2006.\n\n[9] W. Krzanowski. The location model for mixtures of categorical and continuous variables. Journal of\n\nClassi\ufb01cation, 10(1):25\u201349, 1993.\n\n[10] S. L. Lauritzen and N. Wermuth. Graphical models for associations between variables, some of which are\n\nqualitative and some quantitative. The Annals of Statistics, 17(1):31\u201357, March 1989.\n\n[11] D.J.C. MacKay. Information theory, inference and learning algorithms. Cambridge U. Press, 2003.\n[12] J.H. Macke, P. Berens, A.S. Ecker, A.S. Tolias, and M. Bethge. Generating spike trains with speci\ufb01ed\n\ncorrelation coef\ufb01cients. Neural Computation, 21(2):1\u201327, 2009.\n\n[13] Marcelo A. Montemurro, Malte J. Rasch, Yusuke Murayama, Nikos K. Logothetis, and Stefano Panzeri.\nPhase-of-Firing coding of natural visual stimuli in primary visual cortex. Current Biology, Vol 18:375\u2013\n380, March 2008.\n\n[14] I. Olkin and R. F. Tate. Multivariate correlation models with mixed discrete and continuous variables.\n\nThe Annals of Mathematical Statistics, 32(2):448\u2013465, June 1961.\n\n[15] J.W. Pillow, J. Shlens, L. Paninski, A. Sher, A.M. Litke, EJ Chichilnisky, and E.P. Simoncelli. Spatio-\ntemporal correlations and visual signalling in a complete neuronal population. Nature, 454(7207):995\u2013\n999, 2008.\n\n[16] J.W. Pillow and E.P. Simoncelli. Dimensionality reduction in neural models: an information-theoretic\ngeneralization of spike-triggered average and covariance analysis. Journal of Vision, 6(4):414\u2013428, 2006.\n[17] Y. Roudi, E. Aurell, and J.A. Hertz. Statistical physics of pairwise probability models. Frontiers in\n\nComputational Neuroscience, 2009.\n\n[18] Yasser Roudi, Sheila Nirenberg, and Peter E. Latham. Pairwise maximum entropy models for studying\n\nlarge biological systems: When they can work and when they can\u2019t. PLoS Comput Biol, 5(5), 2009.\n\n[19] Elad Schneidman, Michael J. Berry, Ronen Segev, and William Bialek. Weak pairwise correlations imply\n\nstrongly correlated network states in a neural population. Nature, 440(7087):1007\u20131012, April 2006.\n\n[20] O. Schwartz, EJ Chichilnisky, and E.P. Simoncelli. Characterizing neural gain control using spike-\nIn Advances in Neural Information Processing Systems 14: Proceedings of the\n\ntriggered covariance.\n2002 [sic] Conference, page 269. MIT Press, 2002.\n\n[21] J. Shlens, G. D. Field, J. L. Gauthier, M. Greschner, A. Sher, A. M. Litke, and E. J. Chichilnisky. The\nstructure of Large-Scale synchronized \ufb01ring in primate retina. Journal of Neuroscience, 29(15):5022,\n2009.\n\n[22] Jonathon Shlens, Greg D. Field, Jeffrey L. Gauthier, Matthew I. Grivich, Dumitru Petrusca, Alexander\nSher, Alan M. Litke, and E. J. Chichilnisky. The structure of Multi-Neuron \ufb01ring patterns in primate\nretina. J. Neurosci., 26(32):8254\u20138266, August 2006.\n\n[23] Jonathon Shlens, Fred Rieke, and E. J. Chichilnisky. Synchronized \ufb01ring in the retina. Current Opinion\n\nin Neurobiology, 18(4):396\u2013402, August 2008.\n\n[24] J.D. Victor and K.P. Purpura. Metric-space analysis of spike trains: theory, algorithms and application.\n\nNetwork: computation in neural systems, 8(2):127\u2013164, 1997.\n\n[25] Shan Yu, Debin Huang, Wolf Singer, and Danko Nikolic. A small world of neuronal synchrony. Cereb.\n\nCortex, 18(12):2891\u20132901, April 2008.\n\n9\n\n\f", "award": [], "sourceid": 232, "authors": [{"given_name": "Sebastian", "family_name": "Gerwinn", "institution": null}, {"given_name": "Philipp", "family_name": "Berens", "institution": null}, {"given_name": "Matthias", "family_name": "Bethge", "institution": null}]}