{"title": "Spike train entropy-rate estimation using hierarchical Dirichlet process priors", "book": "Advances in Neural Information Processing Systems", "page_first": 2076, "page_last": 2084, "abstract": "Entropy rate quantifies the amount of disorder in a stochastic process.  For spiking neurons, the entropy rate places an upper bound on the rate at which the spike train can convey stimulus information, and a large literature has focused on the problem of estimating entropy rate from spike train data.  Here we present Bayes Least Squares and Empirical Bayesian entropy rate estimators for binary spike trains using Hierarchical Dirichlet Process (HDP) priors.  Our estimator leverages the fact that the entropy rate of an ergodic Markov Chain with known transition probabilities can be calculated analytically, and many stochastic processes that are non-Markovian can still be well approximated by Markov processes of sufficient depth.  Choosing an appropriate depth of Markov model presents challenges due to possibly long time dependencies and short data sequences: a deeper model can better account for long time-dependencies, but is more difficult to infer from limited data. Our approach mitigates this difficulty by using a hierarchical prior to share statistical power across Markov chains of different depths.   We present both a fully Bayesian and empirical Bayes entropy rate estimator based on this model, and demonstrate their performance on simulated and real neural spike train data.", "full_text": "Spike train entropy-rate estimation using hierarchical\n\nDirichlet process priors\n\nKarin Knudson\n\nDepartment of Mathematics\n\nkknudson@math.utexas.edu\n\nJonathan W. Pillow\n\nCenter for Perceptual Systems\n\nDepartments of Psychology & Neuroscience\n\nThe University of Texas at Austin\npillow@mail.utexas.edu\n\nAbstract\n\nEntropy rate quanti\ufb01es the amount of disorder in a stochastic process. For spiking\nneurons, the entropy rate places an upper bound on the rate at which the spike train\ncan convey stimulus information, and a large literature has focused on the prob-\nlem of estimating entropy rate from spike train data. Here we present Bayes least\nsquares and empirical Bayesian entropy rate estimators for binary spike trains us-\ning hierarchical Dirichlet process (HDP) priors. Our estimator leverages the fact\nthat the entropy rate of an ergodic Markov Chain with known transition prob-\nabilities can be calculated analytically, and many stochastic processes that are\nnon-Markovian can still be well approximated by Markov processes of suf\ufb01cient\ndepth. Choosing an appropriate depth of Markov model presents challenges due\nto possibly long time dependencies and short data sequences: a deeper model can\nbetter account for long time dependencies, but is more dif\ufb01cult to infer from lim-\nited data. Our approach mitigates this dif\ufb01culty by using a hierarchical prior to\nshare statistical power across Markov chains of different depths. We present both\na fully Bayesian and empirical Bayes entropy rate estimator based on this model,\nand demonstrate their performance on simulated and real neural spike train data.\n\n1\n\nIntroduction\n\nThe problem of characterizing the statistical properties of a spiking neuron is quite general, but two\ninteresting questions one might ask are: (1) what kind of time dependencies are present? and (2) how\nmuch information is the neuron transmitting? With regard to the second question, information theory\nprovides quanti\ufb01cations of the amount of information transmitted by a signal without reference to\nassumptions about how the information is represented or used. The entropy rate is of interest as a\nmeasure of uncertainty per unit time, an upper bound on the rate of information transmission, and\nan intermediate step in computing mutual information rate between stimulus and neural response.\nUnfortunately, accurate entropy rate estimation is dif\ufb01cult, and estimates from limited data are of-\nten severely biased. We present a Bayesian method for estimating entropy rates from binary data\nthat uses hierarchical Dirichlet process priors (HDP) to reduce this bias. Our method proceeds by\nmodeling the source of the data as a Markov chain, and then using the fact that the entropy rate of\na Markov chain is a deterministic function of its transition probabilities. Fitting the model yields\nparameters relevant to both questions (1) and (2) above: we obtain both an approximation of the\nunderlying stochastic process as a Markov chain, and an estimate of the entropy rate of the process.\nFor binary data, the HDP reduces to a hierarchy of beta priors, where the prior probability over g, the\nprobability of the next symbol given a long history, is a beta distribution centered on the probability\nof that symbol given a truncated, one-symbol-shorter, history. The posterior over symbols given\n\n1\n\n\fa certain history is thus \u201csmoothed\u201d by the probability over symbols given a shorter history. This\nsmoothing is a key feature of the model.\nThe structure of the paper is as follows. In Section 2, we present de\ufb01nitions and challenges involved\nin entropy rate estimation, and discuss existing estimators. In Section 3, we discuss Markov models\nand their relationship to entropy rate. In Sections 4 and 5, we present two Bayesian estimates of\nentropy rate using the HDP prior, one involving a direct calculation of the posterior mean transition\nprobabilities of a Markov model, the other using Markov Chain Monte Carlo methods to sample\nfrom the posterior distribution of the entropy rate. In Section 6 we compare the HDP entropy rate\nestimators to existing entropy rate estimators including the context tree weighting entropy rate esti-\nmator from [1], the string-parsing method from [2], and \ufb01nite-length block entropy rate estimators\nthat makes use of the entropy estimator of Nemenman, Bialek and Shafee [3] and Miller and Madow\n[4]. We evaluate the results for simulated and real neural data.\n\n2 Entropy Rate Estimation\n\nIn information theory, the entropy of a random variable is a measure of the variable\u2019s average un-\npredictability. The entropy of a discrete random variable X with possible values {x1, ..., xn} is\n\nH(X) = \u2212 n(cid:88)\n\np(xi) log(xi)\n\n(1)\n\ni=1\n\nEntropy can be measured in either nats or bits, depending on whether we use base 2 or e for the\nlogarithm. Here, all logarithms discussed will be base 2, and all entropies will be given in bits.\nWhile entropy is a property of a random variable, entropy rate is a property of a stochastic process,\nsuch as a time series, and quanti\ufb01es the amount of uncertainty per symbol. The neural and simulated\ndata considered here will be binary sequences representing the spike train of a neuron, where each\nsymbol represents either the presence of a spike in a bin (1) or the absence of a spike (0). We view\nthe data as a sample path from an underlying stochastic process. To evaluate the average uncertainty\nof each new symbol (0 or 1) given the previous symbols - or the amount of new information per\nsymbol - we would like to compute the entropy rate of the process.\nFor a stochastic process {Xi}\u221e\ni=1 the entropy of the random vector (X1, ..., Xk) grows with k; we\nare interested in how it grows. If we de\ufb01ne the block entropy Hk to be the entropy of the distribution\nof length-k sequences of symbols, Hk = H(Xi+1, ...Xi+k), then the entropy rate of a stochastic\nprocess {Xi}\u221e\n\ni=1is de\ufb01ned by\n\nh = lim\nk\u2192\u221e\n\n1\nk\n\nHk\n\n(2)\n\nwhen the limit exists (which, for stationary stochastic processes, it must). There are two other\nde\ufb01nitions for entropy rate, which are equivalent to the \ufb01rst for stationary processes:\n\nh = lim\n\nh = lim\n\nk\u2192\u221e Hk+1 \u2212 Hk\nk\u2192\u221e H(Xi+1|Xi, Xi\u22121, ...Xi\u2212k)\n\n(3)\n\n(4)\n\nWe now brie\ufb02y review existing entropy rate estimators, to which we will compare our results.\n\n2.1 Block Entropy Rate Estimators\n\nSince much work has been done to accurately estimate entropy from data, Equations (2) and (3)\nsuggest a simple entropy rate estimator, which consists of choosing \ufb01rst a block size k and then\na suitable entropy estimator with which to estimate Hk. A simple such estimator is the \u201cplugin\u201d\nentropy estimator, which approximates the probability of each length-k block (x1, ..., xk) by the\nproportion of total length-k blocks observed that are equal to (x1, ..., xk). For binary data there are\n\n2\n\n\f2k possible length-k blocks. When N denotes the data length and ci the number of observations of\neach block in the data, we have:\n\n2k(cid:88)\n\ni=1\n\n\u02c6Hplugin =\n\n\u2212 ci\nN\n\nlog\n\nci\nN\n\n(5)\n\nfrom which we can immediately estimate the entropy rate with hplugin,k = \u02c6Hplugin/k, for some\nappropriately chosen k (the subject of \u201cappropriate choice\u201d will be taken up in more detail later).\nWe would expect that using better block entropy estimators would yield better entropy rate esti-\nmators, and so we also consider two other block based entropy rate estimators. The \ufb01rst uses the\nBayesian entropy estimator HN SB from Nemenman, Shafee and Bialek [3], which gives a Bayesian\nleast squares estimate for entropy given a mixture-of-Dirichlet prior. The second uses the Miller and\nMadow estimator [4], which gives a \ufb01rst-order correction to the (often signi\ufb01cantly biased) plugin\nentropy estimator of Equation 5:\n\n2k(cid:88)\n\ni=1\n\n\u02c6HMM =\n\n\u2212 ci\nN\n\nlog\n\nci\nN\n\n+\n\nA \u2212 1\n2N\n\nlog(e)\n\n(6)\n\nwhere A is the size of the alphabet of symbols (A = 2 for the binary data sequences presently consid-\nered). For a given k, we obtain entropy rate estimators hN SB,k = \u02c6HN SB/k and hMM,k = \u02c6HMM /k\nby applying the entropy estimators from [3] and [4] respectively to the empirical distribution of the\nlength-k blocks.\nWhile we can improve the accuracy of these block entropy rate estimates by choosing a better\nentropy estimator, choosing the block size k remains a challenge.\nIf we choose k to be small,\nwe miss long time dependencies in the data and tend to overestimate the entropy; intuitively, the\ntime series will seem more unpredictable than it actually is, because we are ignoring long-time\ndependencies. On the other hand, as we consider larger k, limited data leads to underestimates of\nthe entropy rate. See the plots of hplugin, hN SB, and hMM in Figure 2d for an instance of this effect\nof block size on entropy rate estimates. We might hope that in between the overestimates of entropy\nrate for short blocks and the the underestimates for longer blocks, there is some \u201cplateau\u201d region\nwhere the entropy rate stays relatively constant with respect to block size, which we could use as a\nheuristic to select the proper block length [1]. Unfortunately, the block entropy rate at this plateau\nmay still be biased, and for data sequences that are short with respect to their time dependencies,\nthere may be no discernible plateau at all ([1], Figure 1).\n\n2.2 Other Entropy Rate Estimators\n\nN log N, free from any explicit block length parameters.\n\nNot all existing techniques for entropy rate estimation involve an explicit choice of block length.\nThe estimator from [2], for example, parses the full string of symbols in the data by starting from\nthe \ufb01rst symbol, and sequentially removing and counting as a \u201cphrase\u201d the shortest substring that\nhas not yet appeared. When M is the number of distinct phrases counted in this way, we obtain the\nestimator: hLZ = M\nA \ufb01xed block length model like the ones described in the previous section uses the entropy of the dis-\ntribution of all the blocks of a some length - e.g. all the blocks in the terminal nodes of a context tree\nlike the one in Figure 1a. In the context tree weighting (CTW) framework of [1], the authors instead\nuse a minimum descriptive length criterion to weight different tree topologies, which have within\nthe same tree terminal nodes corresponding to blocks of different lengths. They use this weighting\n\nto generate Monte Carlo samples and approximate the integral(cid:82) h(\u03b8)p(\u03b8|T, data)p(T|data) d\u03b8 dT,\n\nin which T represents the tree topology, and \u03b8 represents transition probabilities associated with the\nterminal nodes of the tree.\nIn our approach, the HDP prior combined with a Markov model of our data will be a key tool in\novercoming some of the dif\ufb01culties of choosing a block-length appropriately for entropy rate esti-\nmation. It will allow us to choose a block length that is large enough to capture possibly important\nlong time dependencies, while easing the dif\ufb01culty of estimating the properties of these long time\ndependencies from short data.\n\n3\n\n\fFigure 1: A depth-3 hierarchical Dirichlet prior for binary data\n\n3 Markov Models\n\nThe usefulness of approximating our data source with a Markov model comes from (1) the \ufb02exibility\nof Markov models including their ability to well approximate even many processes that are not truly\nMarkovian, and (2) the fact that for a Markov chain with known transition probabilities the entropy\nrate need not be estimated but is in fact a deterministic function of the transition probabilities.\nA Markov chain is a sequence of random variables that has the property that the probability\nof the next state depends only on the present state, and not on any previous states. That is,\nP (Xi+1|Xi, ..., X1) = P (Xi+1|Xi). Note that this property does not mean that for a binary se-\nquence the probability of each 0 or 1 depends only on the previous 0 or 1, because we consider the\nstate variables to be strings of symbols of length k rather than individual 0s and 1s, Thus we will\ndiscuss \u201cdepth-k\u201d Markov models, where the probability of the next state depends only previous k\nsymbols, or what we will call the length-k context of the symbol. With a binary alphabet, there are\n2k states the chain can take, and from each state s, transitions are possible only to two other states.\n(So that for, example, 110 can transition to state 101 or state 100, but not to any other state). Because\nonly two transitions are possible from each state, the transition probability distribution from each s\nis completely speci\ufb01ed by only one parameter, which we denote gs, the probability of observing a 1\ngiven the context s.\nThe entropy rate of an ergodic Markov chain with \ufb01nite state set A is given by:\n\nh =\n\np(s)H(x|s),\n\n(7)\n\n(cid:88)\n\ns\u2208A\n\n(cid:90)\n\n(cid:88)\n\nwhere p(s) is the stationary probability associated with state s, and H(x|s) is the entropy of the\ndistribution of possible transitions from state s. The vector of stationary state probabilities p(s) for\nall s is computed as a left eigenvector of the transition matrix T:\n\np(s)T = p(s) ,\n\n(8)\nSince each row of the transition matrix T contains only two non-zero entries, gs, and 1 \u2212 gs, p(s)\ncan be calculated relatively quickly. With equations 7 and 8, h can be calculated analytically from\nthe vector of all 2k transition probabilities {gs}. A Bayesian estimator of entropy rate based on a\nMarkov model of order k is given by\n\np(s) = 1\n\ns\n\nh(g)p(g|data)dg\n\n\u02c6hBayes =\n\n(9)\nwhere g = {gs : |s| = k}, h is the deterministic function of g given by Equations 7 and 8, and\np(g|data) \u221d p(data|g)p(g) given some appropriate prior over g.\nModeling a time series as a Markov chain requires a choice of the depth of that chain, so we have\nnot avoided the depth selection problem yet. What will actually mitigate the dif\ufb01culty here is the\nuse of hierarchical Dirichlet process priors.\n\n4 Hierarchical Dirichlet Process priors\n\nWe describe a hierarchical beta prior, a special case of the hierarchical Dirichlet process (HDP),\nwhich was presented in [5] and applied to problems of natural language processing in [6] and [7].\n\n4\n\n\fThe true entropy rate h = limk\u2192\u221e Hk/k captures time dependencies of in\ufb01nite depth. Therefore\nto calculate the estimate \u02c6hBayes in Equation 9 we would like to choose some large k. However, it is\ndif\ufb01cult to estimate transition probabilities for long blocks with short data sequences, so choosing\nlarge k may lead to inaccurate posterior estimates for the transition probabilities g. In particular,\nshorter data sequences may not even have observations of all possible symbol sequences of a given\nlength.\nThis motivates our use of hierarchical priors as follows. Suppose we have a data sequence in which\nthe subsequence 0011 is never observed. Then we would not expect to have a very good estimate\nfor g0011; however, we could improve this by using the assumption that, a priori, g0011 should be\nsimilar to g011. That is, the probability of observing a 1 after the context sequence 0011 should be\nsimilar to that of seeing a 1 after 011, since it might be reasonable to assume that context symbols\nfrom the more distant past matter less. Thus we choose for our prior:\ngs|gs(cid:48) \u223c Beta(\u03b1|s|gs(cid:48), \u03b1|s|(1 \u2212 gs(cid:48)))\n\n(10)\nwhere s(cid:48) denotes the context s with the earliest symbol removed. This choice gives the prior\ndistribution of gs mean gs(cid:48), as desired. We continue constructing the prior with gs(cid:48)(cid:48)|gs(cid:48) \u223c\nBeta(\u03b1|s(cid:48)|gs(cid:48)(cid:48) , \u03b1|s(cid:48)|(1 \u2212 gs(cid:48)(cid:48) )) and so on until g[] \u223c Beta(\u03b10p\u2205, \u03b10(1 \u2212 p\u2205)) where g[] is the proba-\nbility of a spike given no context information and p\u2205 is a hyperparameter re\ufb02ecting our prior belief\nabout the probability of a spike. This hierarchy gives our prior the tree structure as shown in in\nFigure 1. A priori, the distribution of each transition probability is centered around the transition\nprobability from a one-symbol-shorter block of symbols. As long as the assumption that more dis-\ntant contextual symbols matter less actually holds (at least to some degree), this structure allows\nthe sharing of statistical information across different contextual depths. We can obtain reasonable\nestimates for the transition probabilities from long blocks of symbols, even from data that is so short\nthat we may have few (or no) observations of each of these long blocks of symbols.\nWe could use any number of distributions with mean gs(cid:48) to center the prior distribution of gs at gs(cid:48);\nwe use Beta distributions because they are conjugate to the likelihood. The \u03b1|s| are concentration\nparameters which control how concentrated the distribution is about its mean, and can also be esti-\nmated from the data. We assume that there is one value of \u03b1 for each level in the hierarchy, but one\ncould also \ufb01x alpha to be constant throughout all levels, or let it vary within each level.\nThis hierarchy of beta distributions is a special case of the hierarchical Dirichlet process . A Dirichlet\nprocess (DP) is a stochastic process whose sample paths are each probability distributions. Formally,\nif G is a \ufb01nite measure on a set S, then X \u223c DP (\u03b1, G) if for any \ufb01nite measurable partition of\nthe sample space (A1, ...An) we have that X(A1), ...X(An) \u223c Dirichlet(\u03b1G(A1), ..., \u03b1G(An)).\nThus for a partition into only two sets, the Dirichlet process reduces to a beta distribution, which\nis why when we specialize the HDP to binary data, we obtain a hierarchical beta distribution. In\n[5] the authors present a hierarchy of DPs where the base measure for each DP is again a DP. In\nour case, for example, we have G011 = {g011, 1 \u2212 g011} \u223c DP (\u03b13, G11), or more generally,\nGs \u223c DP (\u03b1|s|, Gs(cid:48)).\n\n5 Empirical Bayesian Estimator\n\nOne can generate a sequence from an HDP by drawing each subsequent symbol from the transition\nprobability distribution associated with its context, which is given recursively by [6] :\n\n(cid:40) cs1\n\np(1|s) =\n\n\u03b1|s|+cs\nc1\n\u03b10+N + \u03b10\n\n+ \u03b1|s|\n\u03b1|s|+cs\n\u03b10+N p\u2205\n\np(1|s(cid:48))\n\nif s (cid:54)= \u2205\nif s = \u2205\n\n(11)\n\nwhere N is the length of the data string, p\u2205 is a hyperparameter representing the a prior probability\nof observing a 1 given no contextual information, cs1 is the number of times the symbol sequence s\nfollowed by a 1 was observed, and cs is the number of times the symbol sequence s was observed.\nWe can calculate the posterior predictive distribution \u02c6gpr which is speci\ufb01ed by the 2k values {gs =\np(1|s) : |s| = k} by using counts c from the data and performing the above recursive calculation\nto estimate gs for each of the 2k states s. Given the estimated Markov transition probabilities \u02c6gpr\nwe then have an empirical Bayesian entropy rate estimate via Equations 7 and 8. We denote this\nestimator hempHDP . Note that while \u02c6gpr is the posterior mean of the transition probabilities, the\n\n5\n\n\fentropy rate estimator hempHDP is no longer a fully Bayesian estimate, and is not equivalent to\nthe \u02c6hBayes of equation 9. We thus lose some clarity and the ability to easily compute Bayesian\ncon\ufb01dence intervals. However, we gain a good deal of computational ef\ufb01ciency because calculating\nhempHDP from \u02c6gpr involves only one eigenvector computation, instead of the many needed for the\nMC approximation to the integral in Equation 9. We present a fully Bayesian estimate next.\n\n6 Fully Bayesian Estimator\n\nHere we return to the Bayes least squares estimator \u02c6hBayes of Equation 9. The integral is not\nanalytically tractable, but we can approximate it using Markov Chain Monte Carlo techniques. We\nuse Gibbs sampling to simulate NM C samples g(i) \u223c g|data from the posterior distribution and\nthen calculate h(i) from each g(i) via Equations 7 and 8 to obtain the Bayesian estimate:\n\nNM C(cid:88)\n\ni=1\n\nhHDP =\n\n1\n\nNM C\n\nh(i)\n\n(12)\n\nTo perform the Gibbs sampling, we need the posterior conditional probabilities of each gs. Because\nthe parameters of the model have the structure of a tree, each gs for |s| < k is conditionally inde-\npendent from all but its immediate ancestor in the tree, gs(cid:48), and its two descendants, g0s and g1s. We\nhave:\np(gs|gs(cid:48), g0s, g1s.\u03b1|s|, \u03b1|s|=1) \u221d Beta(gs; \u03b1|s|gs(cid:48), \u03b1|s|(1 \u2212 gs(cid:48)))Beta(g0s; \u03b1|s|+1gs, \u03b1|s|+1(1 \u2212 gs))\n\nBeta(g1s; \u03b1|s|+1gs, \u03b1|s|+1(1 \u2212 gs))\n\n(13)\n\nand we can compute these probabilities on a discrete grid since they are each one dimensional, then\nsample the posterior gs via this grid. We used a uniform grid of 100 points on the interval [0,1] for\nour computation. For the transition probabilities from the bottom level of the tree {gs : |s| = k}, the\nconjugacy of the beta distributions with binomial likelihood function gives the posterior conditional\nof gs a recognizable form: p(gs|gs(cid:48), data) = Beta(\u03b1kgs(cid:48) + cs1, \u03b1k(1 \u2212 gs(cid:48)) + cs0).\nIn the HDP model we may treat each \u03b1 as a \ufb01xed hyperparameter, but it is also straightforward to set\na prior over each \u03b1 and then sample \u03b1 along with the other model parameters with each pass of the\nGibbs sampler. The full posterior conditional for \u03b1i with a uniform prior is (from Bayes\u2019 theorem):\n\np(\u03b1i|gs, gs0, gs1 : |s| = i \u2212 1) \u221d (cid:89)\n\n(gs1gs0)\u03b1igs\u22121((1 \u2212 gs1)(1 \u2212 gs0))\u03b1i(1\u2212gs)\u22121\n\nBeta(\u03b1igs, \u03b1i(1 \u2212 gs))2\n\n(14)\n\n{s:|s|=i\u22121}\n\nWe sampled \u03b1 by computing the probabilities above on a grid of values spanning the range [1, 2000].\nThis upper bound on \u03b1 is rather arbitrary, but we veri\ufb01ed that increasing the range for \u03b1 had little\neffect on the entropy rate estimate, at least for the ranges and block sizes considered.\nIn some applications, the Markov transition probabilities g, and not just the entropy rate, may be\nof interest as a description of the time dependencies present in the data. The Gibbs sampler above\nyields samples from the distribution g|data, and averaging these NM C samples yields a Bayes least\nsquares estimator of transition probabilities, \u02c6ggibbsHDP . Note that this estimate is closely related\nto the estimate \u02c6gpr from the previous section; with more MC samples, \u02c6ggibbsHDP converges to the\nposterior mean \u02c6gpr (when the \u03b1 are \ufb01xed rather than sampled, to match the \ufb01xed \u03b1 per level used in\nEquation 11).\n\n7 Results\n\nWe applied the model to both simulated data with a known entropy rate and to neural data, where\nthe entropy rate is unknown. We examine the accuracy of the fully Bayesian and empirical Bayesian\nentropy rate estimators hHDP and hempHDP , and compare the entropy rate estimators hplugin,\nhN SB, hMM , hLZ [2], and hCT W [1], which are described in Section 2. We also consider estimates\nof the Markov transition probabilities g produced by both inference methods.\n\n6\n\n\fFigure 2: Comparison of es-\ntimated (a) transition prob-\nability and (b,c,d) entropy\nrate for data simulated from\na Markov model of depth\n5.\nIn (a) and (d), data sets\nare 500 symbols long. The\nblock-based and HDP esti-\nmators in (b) and (c) use\nblock size k = 8. In (b,c,d)\nresults were averaged over 5\ndata sequences, and (c) plots\nthe average absolute value of\nthe difference between true\nand estimated entropy rates.\n\n7.1 Simulation\n\nWe considered data simulated from a Markov model with transition probabilities set so that transi-\ntion probabilities from states with similar suf\ufb01xes are similar (i.e. the process actually does have the\nproperty that more distant context symbols matter less than more recent ones in determining transi-\ntions). We used a depth-5 Markov model, whose true transition probabilities are shown in black in\nFigure 2a , where each of the 32 points on the x axis represents the probability that the next symbol\nis a 1 given the speci\ufb01ed 5-symbol context.\nIn Figure 2a we compare HDP estimates of transition probabilities of this simulated data to the\nplugin estimator of transition probabilities \u02c6gs = cs1\ncalculated from a 500-symbol sequence. (The\ncs\nother estimators do not include calculating transition probabilities as an intermediate step, and so\ncannot be included here.) With a series of 500 symbols, we do not expect enough observations of\neach of possible transitions to adequately estimate the 2k transition probabilities, even for rather\nmodest depths such as k = 5. And indeed, the \u201cplugin\u201d estimates of transition probabilities do not\nmatch the true transition probabilities well. On the other hand, the transition probabilities estimated\nusing the HDP prior show the kind of \u201csmoothing\u201d the prior was meant to encourage, where states\ncorresponding to contexts with same suf\ufb01xes have similar estimated transition probabilities.\nLastly, we plot the convergence of the entropy rate estimators with increased length of the data\nsequence and the associated error in Figures 2b,c. If the true depth of the model is no larger than\nthe depth k considered in the estimators, all the estimators considered should converge. We see in\nFigure 2c that the HDP-based entropy rate estimates converge quickly with increasing data, relative\nto other models.\nThe motivation of the hierarchical prior was to allow observations of transitions from shorter con-\ntexts to inform estimates of transitions from longer contexts. This, it was hoped, would mitigate the\ndrop-off with larger block-size seen in block-entropy based entropy rate estimators. Figure 2d indi-\ncates that for simulated data that is indeed the case, although we do see some bias the fully Bayesian\nentropy rate estimator for large block lengths. The empirical Bayes and and fully Bayesian entropy\nrate estimators with HDP priors produce estimates that are close to the true entropy rate across a\nwider range of block-size.\n\n7.2 Neural Data\n\nWe applied the same analysis to neural spike train data collected from primate retinal ganglion cells\nstimulated with binary full-\ufb01eld movies refreshed at 100 Hz [8]. In this case, the true transition\nprobabilities are unknown (and indeed the process may not be exactly Markovian). However, we\ncalculate the plug-in transition probabilities from a longer data sequence (167,000 bins) so that the\nestimates are approximately converged (black trace in Figure 3a), and note that transition probabil-\nities from contexts with the same most-recent context symbols do appear to be similar. Thus the\nestimated transition probabilities re\ufb02ect the idea that more distant context cues matter less, and the\nsmoothing of the HDP prior appears to be appropriate for this neural data.\n\n7\n\n(b)(c)hNSBhMMhpluginhLZhctwhempHDPhHDPhtrue(a)00.20.40.60.81p(1|s)0000010000010001100000100101000110011100000101001001010110100011010110011101111000001100010100111001001011010101101111010001110011010111101100111101110111111111246810120.70.80.91Block LengthEntropy Rate EstimateNSBMMpluginLZctwempHDPHDPtrue(d)246810120.70.80.91Block LengthEntropy Rate Estimate10110210310400.20.40.60.81Data LengthEntropy Rate Estimate10110210310400.20.40.60.81Absolute ErrorData Length\fFigure 3: Comparison of es-\ntimated (a) transition prob-\nability and (b,c,d) entropy\nrate for neural data. The\n\u2018converged\u2019 estimates are\ncalculated from 700s of\ndata with 4ms bins (167,000\nsymbols).\nIn (a) and (d),\ntraining data sequences are\n500 symbols (2s) long. The\nblock-based and HDP esti-\nmators in (b) and (c) use\nblock size k = 8. In (b,c,d),\nresults were averaged over 5\ndata sequences sampled ran-\ndomly from the full dataset.\n\nThe true entropy rate is also unknown, but again we estimate it using the plugin estimator on a large\ndata set. We again note the relatively fast convergence of hHDP and hempHDP in Figures 3b,c, and\nthe long plateau of the estimators in Figure 3d indicating the relative stability of the HDP entropy\nrate estimators with respect to choice of model depth.\n\n8 Discussion\n\nWe have presented two estimators of the entropy rate of a spike train or arbitrary binary sequence.\nThe true entropy rate of a stochastic process involves consideration of in\ufb01nitely long time depen-\ndencies. To make entropy rate estimation tractable, one can try to \ufb01x a maximum depth of time\ndependencies to be considered, but it is dif\ufb01cult to choose an appropriate depth that is large enough\nto take into account long time dependencies and small enough relative to the data at hand to avoid\na severe downward bias of the estimate. We have approached this problem by modeling the data as\na Markov chain and estimating transition probabilities using a hierarchical prior that links transition\nprobabilities from longer contexts to transition probabilities from shorter contexts. This allowed us\nto choose a large depth even in the presence of limited data, since the structure of the prior allowed\nobservations of transitions from shorter contexts (of which we have many instances in the data) to\ninform estimates of transitions from longer contexts (of which we may have only a few instances).\nWe presented both a fully Bayesian estimator, which allows for Bayesian con\ufb01dence intervals, and\nan empirical Bayesian estimator, which provides computational ef\ufb01ciency. Both estimators show\nexcellent performance on simulated and neural data in terms of their robustness to the choice of\nmodel depth, their accuracy on short data sequences, and their convergence with increased data.\nBoth methods of entropy rate estimation also yield estimates of the transition probabilities when\nthe data is modeled as a Markov chain, parameters which may be of interest in the own right as\ndescriptive of the statistical structure and time dependencies in a spike train. Our results indicate that\ntools from modern Bayesian nonparametric statistics hold great promise for revealing the structure\nof neural spike trains despite the challenges of limited data.\n\nAcknowledgments\n\nWe thank V. J. Uzzell and E. J. Chichilnisky for retinal data. This work was supported by a Sloan\nResearch Fellowship, McKnight Scholar\u2019s Award, and NSF CAREER Award IIS-1150186.\n\n8\n\n(b)(a)(c)2468100.20.40.60.81Block LengthEntropy Rate EstimateNSBMMpluginLZctwempHDPHDPconverged00.20.40.60.81p(1|s)000001000001000110000010010100011001110000010100100101011010001101011001110111100000110001010011100100101101010110111101000111001101011110110011110111011111111110200.20.40.60.81Data LengthEstimated Entropy Rate 10200.10.20.30.4Data LengthMean Absolute ErrorhNSBhMMhpluginhLZhctwhempHDPhHDPhconverged(d)10210400.10.20.30.4Data LengthAbsolute Error10210400.20.40.60.81Data LengthEntropy Rate Estimate246810120.50.60.70.80.91Block LengthEntropy Rate Estimate\fReferences\n[1] Matthew B Kennel, Jonathon Shlens, Henry DI Abarbanel, and EJ Chichilnisky. Estimating\nentropy rates with bayesian con\ufb01dence intervals. Neural Computation, 17(7):1531\u20131576, 2005.\n[2] Abraham Lempel and Jacob Ziv. On the complexity of \ufb01nite sequences. Information Theory,\n\nIEEE Transactions on, 22(1):75\u201381, 1976.\n\n[3] Ilya Nemenman, Fariel Shafee, and William Bialek. Entropy and inference, revisited. arXiv\n\npreprint physics/0108025, 2001.\n\n[4] George Armitage Miller and William Gregory Madow. On the Maximum Likelihood Esti-\nmate of the Shannon-Weiner Measure of Information. Operational Applications Laboratory,\nAir Force Cambridge Research Center, Air Research and Development Command, Bolling Air\nForce Base, 1954.\n\n[5] Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet\n\nprocesses. Journal of the American Statistical Association, 101(476), 2006.\n\n[6] Yee Whye Teh. A hierarchical bayesian language model based on pitman-yor processes. In\nProceedings of the 21st International Conference on Computational Linguistics and the 44th\nannual meeting of the Association for Computational Linguistics, pages 985\u2013992. Association\nfor Computational Linguistics, 2006.\n\n[7] Frank Wood, C\u00b4edric Archambeau, Jan Gasthaus, Lancelot James, and Yee Whye Teh. A stochas-\ntic memoizer for sequence data. In Proceedings of the 26th Annual International Conference on\nMachine Learning, pages 1129\u20131136. ACM, 2009.\n\n[8] V. J. Uzzell and E. J. Chichilnisky. Precision of spike trains in primate retinal ganglion cells.\n\nJournal of Neurophysiology, 92:780\u2013789, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1037, "authors": [{"given_name": "Karin", "family_name": "Knudson", "institution": "UT Austin"}, {"given_name": "Jonathan", "family_name": "Pillow", "institution": "UT Austin"}]}