{"title": "Active learning of neural response functions with Gaussian processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2043, "page_last": 2051, "abstract": "A sizable literature has focused on the problem of estimating a low-dimensional feature space capturing a neuron's stimulus sensitivity. However, comparatively little work has addressed the problem of estimating the nonlinear function from feature space to a neuron's output spike rate. Here, we use a Gaussian process (GP) prior over the infinite-dimensional space of nonlinear functions to obtain Bayesian estimates of the \"nonlinearity\" in the linear-nonlinear-Poisson (LNP) encoding model. This offers flexibility, robustness, and computational tractability compared to traditional methods (e.g., parametric forms, histograms, cubic splines). Most importantly, we develop a framework for optimal experimental design based on uncertainty sampling. This involves adaptively selecting stimuli to characterize the nonlinearity with as little experimental data as possible, and relies on a method for rapidly updating hyperparameters using the Laplace approximation. We apply these methods to data from color-tuned neurons in macaque V1. We estimate nonlinearities in the 3D space of cone contrasts, which reveal that V1 combines cone inputs in a highly nonlinear manner. With simulated experiments, we show that optimal design substantially reduces the amount of data required to estimate this nonlinear combination rule.", "full_text": "Active learning of neural response functions\n\nwith Gaussian processes\n\nMijung Park\n\nElectrical and Computer Engineering\n\nThe University of Texas at Austin\nmjpark@mail.utexas.edu\n\nGreg Horwitz\n\nDepartments of Physiology and Biophysics\n\nThe University of Washington\n\nghorwitz@uw.edu\n\nJonathan W. Pillow\n\nDepartments of Psychology and Neurobiology\n\nThe University of Texas at Austin\npillow@mail.utexas.edu\n\nAbstract\n\nA sizeable literature has focused on the problem of estimating a low-dimensional\nfeature space for a neuron\u2019s stimulus sensitivity. However, comparatively little\nwork has addressed the problem of estimating the nonlinear function from feature\nspace to spike rate. Here, we use a Gaussian process (GP) prior over the in\ufb01nite-\ndimensional space of nonlinear functions to obtain Bayesian estimates of the \u201cnon-\nlinearity\u201d in the linear-nonlinear-Poisson (LNP) encoding model. This approach\noffers increased \ufb02exibility, robustness, and computational tractability compared\nto traditional methods (e.g., parametric forms, histograms, cubic splines). We\nthen develop a framework for optimal experimental design under the GP-Poisson\nmodel using uncertainty sampling. This involves adaptively selecting stimuli ac-\ncording to an information-theoretic criterion, with the goal of characterizing the\nnonlinearity with as little experimental data as possible. Our framework relies on\na method for rapidly updating hyperparameters under a Gaussian approximation\nto the posterior. We apply these methods to neural data from a color-tuned sim-\nple cell in macaque V1, characterizing its nonlinear response function in the 3D\nspace of cone contrasts. We \ufb01nd that it combines cone inputs in a highly nonlinear\nmanner. With simulated experiments, we show that optimal design substantially\nreduces the amount of data required to estimate these nonlinear combination rules.\n\n1\n\nIntroduction\n\nOne of the central problems in systems neuroscience is to understand how neural spike responses\nconvey information about environmental stimuli, which is often called the neural coding problem.\nOne approach to this problem is to build an explicit encoding model of the stimulus-conditional\nresponse distribution p(r|x), where r is a (scalar) spike count elicited in response to a (vector) stim-\nulus x. The popular linear-nonlinear-Poisson (LNP) model characterizes this encoding relationship\nin terms of a cascade of stages: (1) linear dimensionality reduction using a bank of \ufb01lters or receptive\n\ufb01elds; (2) a nonlinear function from \ufb01lter outputs to spike rate; and (3) an inhomogeneous Poisson\nspiking process [1].\nWhile a sizable literature [2\u201310] has addressed the problem of estimating the linear front end to this\nmodel, the nonlinear stage has received comparatively less attention. Most prior work has focused\non: simple parametric forms [6, 9, 11]; non-parametric methods that do not scale easily to high\n\n1\n\n\fFigure 1: Encoding model schematic. The nonlinear function f converts an input vector x to a\nscalar, which g then transforms to a non-negative spike rate \u03bb = g(f (x)). The spike response r is a\nPoisson random variable with mean \u03bb.\n\ndimensions (e.g., histograms, splines) [7, 12]; or nonlinearities de\ufb01ned by a sum or product of 1D\nnonlinear functions [10, 13].\nIn this paper, we use a Gaussian process (GP) to provide a \ufb02exible, computationally tractable model\nof the multi-dimensional neural response nonlinearity f (x), where x is a vector in feature space.\nIntuitively, a GP de\ufb01nes a probability distribution over the in\ufb01nite-dimensional space of functions\nby specifying a Gaussian distribution over its \ufb01nite-dimensional marginals (i.e., the probability over\nthe function values at any \ufb01nite collection of points), with hyperparameters that control the func-\ntion\u2019s variability and smoothness [14]. Although exact inference under a model with GP prior and\nPoisson observations is analytically intractable, a variety of approximate and sampling-based infer-\nence methods have been developed [15, 16]). Our work builds on a substantial literature in neuro-\nscience that has used GP-based models to decode spike trains [17\u201319], estimate spatial receptive\n\ufb01elds [20,21], infer continuous spike rates from spike trains [22\u201324], infer common inputs [25], and\nextract low-dimensional latent variables from multi-neuron spiking activity [26, 27].\nWe focus on data from trial-based experiments where stimulus-response pairs (x, r) are sparse in the\nspace of possible stimuli. We use a \ufb01xed inverse link function g to transform f (x) to a non-negative\nspike rate, which ensures the posterior over f is log-concave [6, 20]. This log-concavity justi\ufb01es a\nGaussian approximation to the posterior, which we use to perform rapid empirical Bayes estimation\nof hyperparameters [5, 28]. Our main contribution is an algorithm for optimal experimental design,\nwhich allows f to be characterized quickly and accurately from limited data [29, 30]. The method\nrelies on uncertainty sampling [31], which involves selecting the stimulus x for which g(f (x)) is\nmaximally uncertain given the data collected in the experiment so far. We apply our methods to\nthe nonlinear color-tuning properties of macaque V1 neurons. We show that the GP-Poisson model\nprovides a \ufb02exible, tractable model for these responses, and that optimal design can substantially\nreduce the number of stimuli required to characterize them.\n\n2 GP-Poisson neural encoding model\n\n2.1 Encoding model (likelihood)\n\nWe begin by de\ufb01ning a probabilistic encoding model for the neural response. Let ri be an observed\nneural response (the spike count in some time interval T ) at the i\u2019th trial given the input stimulus\nxi. Here, we will assume that x is D-dimensional vector in the moderately low-dimensional neural\nfeature space to which the neuron is sensitive, the output of the \u201cL\u201d stage in the LNP model.\nUnder the encoding model (Fig. 1), an input vector xi passes through a nonlinear function f, whose\nreal-valued output is transformed to a positive spike rate through a (\ufb01xed) function g. The spike re-\nsponse is a Poisson random variable with mean g(f (x)), so the conditional probability of a stimulus-\nresponse pair is Poisson:\n\np(ri|xi, f ) = 1\nri! \u03bbri\nFor a complete dataset, the log-likelihood is:\n\ni e\u2212\u03bbi,\n\n\u03bbi = g(f (xi)).\n\nL(f ) = log p(r|X, f ) = r(cid:62) log(g(f )) \u2212 1(cid:62)g(f ) + const,\n\n2\n\n(1)\n\n(2)\n\nPoissonspiking inputresponsenonlinearityinverse-linkhistory filter\fwhere r = (r1, . . . , rN )(cid:62) is a vector of spike responses, 1 is a vector of ones, and f =\n(f (x1), . . . f (xN ))(cid:62) is shorthand for the vector de\ufb01ned by evaluating f at the points in X =\n{x1, . . . xN}. Note that although f is an in\ufb01nite-dimensional object in the space of functions, the\nlikelihood only depends on the value of f at the points in X.\nIn this paper, we \ufb01x the inverse-link function to g(f ) = log(1 + exp(f )), which has the nice\nproperty that it grows linearly for large f and decays gracefully to zero for negative f. This allows\nus to place a Gaussian prior on f without allocating probability mass to negative spike rates, and\nobviates the need for constrained optimization of f (but see [22] for a highly ef\ufb01cient solution). Most\nimportantly, for any g that is simultaneously convex and log-concave1, the log-likelihood L(f ) is\nconcave in f, meaning it is free of non-global local extrema [6,20]. Combining L with a log-concave\nprior (as we do in the next section) ensures the log-posterior is also concave.\n\n2.2 Gaussian Process prior\n\nGaussian processes (GPs) allow us to de\ufb01ne a probability distribution over the in\ufb01nite-dimensional\nspace of functions by specifying a Gaussian distribution over a function\u2019s \ufb01nite-dimensional\nmarginals (i.e., the probability over the function values at any \ufb01nite collection of points). The hy-\nperparameters de\ufb01ning this prior are a mean \u00b5f and a kernel function k(xi, xj) that speci\ufb01es the\ncovariance between function values f (xi) and f (xj) for any pair of input points xi and xj. Thus,\nthe GP prior over the function values f is given by\np(f ) = N (f| \u00b5f 1, K) = |2\u03c0K|\u2212 1\n\n(3)\nwhere K is a covariance matrix whose i, j\u2019th entry is Kij = k(xi, xj). Generally, the kernel\ncontrols the prior smoothness of f by determining how quickly the correlation between nearby\nfunction values falls off as a function of distance. (See [14] for a general treatment). Here, we use a\nGaussian kernel, since neural response nonlinearities are expected to be smooth in general:\n\n2 (f \u2212 \u00b5f 1)(cid:62)K\u22121(f \u2212 \u00b5f 1)(cid:1)\n\n2 exp(cid:0)\u2212 1\n\nk(xi, xj) = \u03c1 exp(cid:0)\u2212||xi \u2212 xj||2/(2\u03c4 )(cid:1) ,\n\n(4)\nwhere hyperparameters \u03c1 and \u03c4 control the marginal variance and smoothness scale, respectively.\nThe GP therefore has three total hyperparameters, \u03b8 = {\u00b5f , \u03c1, \u03c4} which set the prior mean and\ncovariance matrix over f for any collection of points in X.\n\n2.3 MAP inference for f\n\nThe maximum a posteriori (MAP) estimate can be obtained by numerically maximizing the posterior\nfor f. From Bayes rule, the log-posterior is simply the sum of the log-likelihood (eq. 2) and log-prior\n(eq. 3) plus a constant:\n\nlog p(f|r, X, \u03b8) = r(cid:62) log(g(f )) \u2212 1(cid:62)g(f ) \u2212 1\n\n(5)\nAs noted above, this posterior has a unique maximum fmap so long as g is convex and log-concave.\nHowever, the solution vector fmap de\ufb01ned this way contains only the function values at the points\nin the training set X. How do we \ufb01nd the MAP estimate of f at other points not in our training set?\nThe GP prior provides a simple analytic formula for the maximum of the joint marginal containing\nthe training data and any new point f\u2217 = f (x\u2217), for a new stimulus x\u2217. We have\n\n2 (f \u2212 \u00b5f )(cid:62)K\u22121(f \u2212 \u00b5f ) + const.\n\np(f\u2217, f|x\u2217, r, X, \u03b8) = p(f\u2217|f , \u03b8)p(f|r, X, \u03b8) = N (f\u2217|\u00b5\u2217, \u03c3\u22172) p(f|r, X, \u03b8)\n\n(6)\nwhere, from the GP prior, \u00b5\u2217 = \u00b5f + k\u2217(cid:62)K\u22121(f \u2212 \u00b5f ) and \u03c3\u22172 = k(x\u2217, x\u2217) \u2212 k\u2217(cid:62)K\u2217k\u2217 are\nthe (f-dependent) mean and variance of f\u2217, and row vector k\u2217 = (k(x1, x\u2217), . . . k(xN , x\u2217)). This\nfactorization arises from the fact that f\u2217 is conditionally independent of the data given the value\nof the function at X. Clearly, this posterior marginal (eq. 6) is maximized when f\u2217 = \u00b5\u2217 and\nf = fmap.2 Thus, for any collection of novel points X\u2217, the MAP estimate for f (X\u2217) is given by\nthe mean of the conditional distribution over f\u2217 given fmap:\n\np(f (X\u2217)|X\u2217, fmap, \u03b8) = N(cid:0)\u00b5f + K\u2217K\u22121(fmap \u2212 \u00b5f ), K\u2217\u2217 \u2212 K\u2217K\u22121K\u2217(cid:62)(cid:1)\n\n(7)\n\nexponential, half-recti\ufb01ed linear, log(1 + exp(f ))p for p \u2265 1.\n\n1Such functions must grow monotonically at least linearly and at most exponentially [6]. Examples include\n2Note that this is not necessarily identical to the marginal MAP estimate of f\u2217|x\u2217, r, X, \u03b8, which requires\n\nmaximizing (eq. 6) integrated with respect to f.\n\n3\n\n\fij = k(x\u2217\n\ni , x\u2217\nj ).\n\nil = k(x\u2217\n\ni , xl) and K\u2217\u2217\n\nwhere K\u2217\nIn practice, the prior covariance matrix K is often ill-conditioned when datapoints in X are closely\nspaced and smoothing hyperparameter \u03c4 is large, making it impossible to numerically compute\nK\u22121. When the number of points is not too large (N < 1000), we can address this by performing a\nsingular value decomposition (SVD) of K and keeping only the singular vectors with singular value\nabove some threshold. This results in a lower-dimensional numerical optimization problem, since\nwe only have to search the space spanned by the singular vectors of K. We discuss strategies for\nscaling to larger datasets in the Discussion.\n\n2.4 Ef\ufb01cient evidence optimization for \u03b8\nThe hyperparameters \u03b8 = {\u00b5f , \u03c1, \u03c4} that control the GP prior have a major in\ufb02uence on the shape\nof the inferred nonlinearity, particularly in high dimensions and when data is scarce. A theoretically\nattractive and computationally ef\ufb01cient approach for setting \u03b8 is to maximize the evidence p(\u03b8|r, X),\nalso known as the marginal likelihood, a general approach known as empirical Bayes [5, 14, 28, 32].\nHere we describe a method for rapid evidence maximization that we will exploit to design an active\nlearning algorithm in Section 3.\nThe evidence can be computed by integrating the product of the likelihood and prior with respect to\nf, but can also be obtained by solving for the (often neglected) denominator term in Bayes\u2019 rule:\n\n(cid:90)\n\np(r|\u03b8) =\n\np(r|f )p(f|\u03b8)df =\n\n,\n\n(8)\n\np(r|f )p(f|\u03b8)\np(f|r, \u03b8)\n\nwhere we have dropped conditioning on X for notational convenience. For the GP-Poisson model\nhere, this integral is not tractable analytically, but we can approximate it as follows. We begin with\na well-known Gaussian approximation to the posterior known as the Laplace approximation, which\ncomes from a 2nd-order Taylor expansion of the log-posterior around its maximum [28]:\n\np(f|r, \u03b8) \u2248 N (f|fmap, \u039b),\n\n(9)\n\u2202f 2L(f ) is the Hessian (second derivative matrix) of the negative log-likelihood (eq. 2),\nwhere H = \u22022\nevaluated at fmap, and K\u22121 is the inverse prior covariance (eq. 3). This approximation is reason-\nable given that the posterior is guaranteed to be unimodal and log-concave. Plugging it into the\ndenominator in (eq. 8) gives us a formula for evaluating approximate evidence,\n\n\u039b\u22121 = H + K\u22121,\n\np(r|\u03b8) \u2248 exp(cid:0)L(f )(cid:1)N (f|\u00b5f , K)\n\nN (f|fmap, \u039b)\n\n,\n\n(10)\n\nwhich we evaluate at f = fmap, since the Laplace approximation is the most accurate there [20,33].\nThe hyperparameters \u03b8 directly affect the prior mean and covariance (\u00b5f , K), as well as the poste-\nrior mean and covariance (fmap, \u039b), all of which are essential for evaluating the evidence. Finding\nfmap and \u039b given \u03b8 requires numerical optimization of log p(f|r, \u03b8), which is computationally ex-\npensive to perform for each search step in \u03b8. To overcome this dif\ufb01culty, we decompose the posterior\nmoments (fmap, \u039b) into terms that depend on \u03b8 and terms that do not via a Gaussian approximation\nto the likelihood. The logic here is that a Gaussian posterior and prior imply a likelihood function\nproportional to a Gaussian, which in turn allows prior and posterior moments to be computed an-\nalytically for each \u03b8. This trick is similar to that of the EP algorithm [34]: we divide a Gaussian\ncomponent out of the Gaussian posterior and approximate the remainder as Gaussian. The resulting\nmoments are H = \u039b\u22121 \u2212 K\u22121 for the likelihood inverse-covariance (which is the Hessian of the\nlog-likelihood from eq. 9), and m = H\u22121(\u039b\u22121fmap \u2212 K\u22121\u00b5f ) for the likelihood mean, which\ncomes from the standard formula for the product of two Gaussians.\nOur algorithm for evidence optimization proceeds as follows: (1) given the current hyperparameters\n\u03b8i, numerically maximize the posterior and form the Laplace approximation N (fmapi, \u039bi); (2)\ncompute the Gaussian \u201cpotential\u201d N (mi, Hi) underlying the likelihood, given the current values of\n(fmapi, \u039bi, \u03b8i), as described above; (3) Find \u03b8i+1 by maximizing the log-evidence, which is:\n(fmap\u2212\u00b5f )T K\u22121(fmap\u2212\u00b5f ),\nE(\u03b8) = rT log(g(fmap))\u22121T g(fmap)\u2212 1\n(11)\n2\nwhere fmap and \u039b are updated using Hi and mi obtained in step (2), i.e. fmap = \u039b(Himi +\nK\u22121\u00b5f ) and \u039b = (Hi + K\u22121)\u22121. Note that this signi\ufb01cantly expedites evidence optimization\nsince we do not have to numerically optimize fmap for each \u03b8.\n\nlog |KHi+I|\u2212 1\n2\n\n4\n\n\fFigure 2: Comparison of random and optimal design in a simulated experiment with a 1D nonlinear-\nity. The true nonlinear response function g(f (x)) is in gray, the posterior mean is in black solid, 95%\ncon\ufb01dence interval is in black dotted, stimulus is in blue dots. A (top): Random design: responses\nwere measured with 20 (left) and 100 (right) additional stimuli, with stimuli sampled uniformly over\nthe interval shown on the x axis. A (bottom): Optimal design: responses were measured with same\nnumbers of additional stimuli selected by uncertainty sampling (see text). B: Mean square error as\na function of the number of stimulus-response pairs. The optimal design achieved half the error rate\nof the random design experiment.\n\n3 Optimal design: uncertainty sampling\n\nSo far, we have introduced an ef\ufb01cient algorithm for estimating the nonlinearity f and hyperparam-\neters \u03b8 for an LNP encoding model under a GP prior. Here we introduce a method for adaptively\nselecting stimuli during an experiment (often referred to as active learning or optimal experimen-\ntal design) to minimize the amount of data required to estimate f [29]. The basic idea is that we\nshould select stimuli that maximize the expected information gained about the model parameters.\nThis information gain of course depends the posterior distribution over the parameters given the\ndata collected so far. Uncertainty sampling [31] is an algorithm that is appropriate when the model\nparameters and stimulus space are in a 1-1 correspondence. It involves selecting the stimulus x\nfor which the posterior over parameter f (x) has highest entropy, which in the case of a Gaussian\nposterior corresponds to the highest posterior variance.\nHere we alter the algorithm slightly to select stimuli for which we are most uncertain about the spike\nrate g(f (x)), not (as stated above) the stimuli where we are most uncertain about our underlying\nfunction f (x). The rationale for this approach is that we are generally more interested in the neu-\nron\u2019s spike-rate as a function of the stimulus (which involves the inverse link function g) than in\nthe parameters we have used to de\ufb01ne that function. Moreover, any link function that maps R to\nthe positive reals R+, as required for Poisson models, we will have unavoidable uncertainty about\nnegative values of f, which will not be overcome by sampling small (integer) spike-count responses.\nOur strategy therefore focuses on uncertainty in the expected spike-rate rather than uncertainty in f.\nOur method proceeds as follows. Given the data observed up to a certain time in the experiment,\nj} as candidate next stimuli. For each point, we\nwe de\ufb01ne a grid of (evenly-spaced) points {x\u2217\ncompute the posterior uncertainty \u03b3j about the spike rate g(f (x\u2217\nj )) using the delta method, i.e.,\n\u03b3j = g(cid:48)(f (x\u2217\nj ))\u03c3j, where \u03c3j is the posterior standard deviaton (square root of the posterior variance)\nat f (xj) and g(cid:48) is the derivative of g with respect to its argument. The stimulus selected next on trial\nt + 1, given all data observed up to time t, is selected randomly from the set:\n\nxt+1 \u2208 {x\u2217\n\nj | \u03b3j \u2265 \u03b3i\u2200i},\n\n(12)\nthat is, the set of all stimuli for which uncertainty \u03b3 is maximal. To \ufb01nd {\u03c3j} at each candidate point,\nwe must \ufb01rst update \u03b8 and fmap. After each trial, we update fmap by numerically optimizing the\nposterior, then update the hyperparameters using (eq. 11), and then numerically re-compute fmap\nand \u039b given the new \u03b8. The method is summarized in Algorithm 1, and runtimes are shown in Fig. 5.\n\n5\n\n050100051525050100051525randomsampling uncertaintysampling 20 datapoints100 datapointstrueposterior mean95% confidence regionArate (spikes/trial)108001020random samplinguncertainty sampling4020160# of datapointserrorBrate (spikes/trial)\fAlgorithm 1 Optimal design for nonlinearity estimation under a GP-Poisson model\n\n1. given the current data Dt = {x1, ..., xt, r1, ..., rt}, the posterior mode fmapt, and hyper-\nmap, \u03c3\u2217) at a grid of\n\nparameters \u03b8t, compute the posterior mean and standard deviation (f\u2217\ncandidate stimulus locations {x\u2217}.\n\n2. select the element of {x\u2217} for which \u03b3\u2217 = g(cid:48)(f\u2217\n3. present the selected xt+1 and record the neural response rt+1\n4. \ufb01nd fmapt+1|Dt+1, \u03b8t; update \u03b8i+1 by maximizing evidence; \ufb01nd fmapt+1|Dt+1, \u03b8t+1\n\nmap)\u03c3\u2217 is maximal\n\n4 Simulations\n\nWe tested our method in simulation using a 1-dimensional feature space, where it is easy to visualize\nthe nonlinearity and the uncertainty of our estimates (Fig. 2). The stimulus space was taken to be\nthe range [0, 100], the true f was a sinusoid, and spike responses were simulated as Poisson with\nrate g(f (x)). We compared the estimate of g(f (x)) obtained using optimal design to the estimate\nobtained with \u201crandom sampling\u201d, stimuli drawn uniformly from the stimulus range.\nFig. 2 shows the estimates of g(f (x)) after 20 and 100 trials using each method, along with the\nmarginal posterior standard deviation, which provides a \u00b12 SD Bayesian con\ufb01dence interval for the\nestimate. The optimal design method effectively decreased the high variance in the middle (near 50)\nbecause it drew more samples where uncertainty about the spike rate was higher (due to the fact that\nvariance increases with mean for Poisson neurons). The estimates using random sampling (A, top)\nwas not accurate because it drew more points in the tails where the variance was originally lower\nthan the center. We also examined the errors in each method as a function of the number of data\npoints. We drew each number of data points 100 times and computed the average error between\nthe estimate and the true g(f (x)). As shown in (B), uncertainty sampling achieved roughly half the\nerror rate of the random sampling after 20 datapoints.\n\n5 Experiments\n\nFigure 3: Raw experimental data: stimuli in 3D cone-\ncontrast space (above) and recorded spike counts (below)\nduring the \ufb01rst 60 experimental trials. Several (3-6) stim-\nulus staircases along different directions in color space\nwere randomly interleaved to avoid the effects of adap-\ntation; a color direction is de\ufb01ned as the relative propor-\ntions of L, M, and S cone contrasts, with [0 0 0] corre-\nsponding to a neutral gray (zero-contrast) stimulus.\nIn\neach color direction, contrast was actively titrated with\nthe aim of evoking a response of 29 spikes/sec. This\nsampling procedure permitted a broad survey of the stim-\nulus space, with the objective that many stimuli evoked\na statistically reliable but non-saturating response. In all,\n677 stimuli in 65 color directions were presented for this\nneuron.\n\nWe recorded from a V1 neuron in an awake, \ufb01xating rhesus monkey while Gabor patterns with vary-\ning color and contrast were presented at the receptive \ufb01eld. Orientation and spatial frequency of the\nGabor were \ufb01xed at preferred values for the neuron and drifted at 3 Hz for 667 ms each. Contrast\nwas varied using multiple interleaved staircases along different axes in color space, and spikes were\ncounted during a 557ms window beginning 100ms after stimulus appeared. The staircase design\nwas used because the experiments were carried out prior to formulating the optimal design methods\ndescribed in this paper. However, we will analyze them here for a \u201csimulated optimal design exper-\niment\u201d, where we choose stimuli sequentially from the list of stimuli that were actually presented\nduring the experiment, in an order determined by our information-theoretic criterion. See Fig. 3\ncaption for more details of the experimental recording.\n\n6\n\n\u22120.600.6cone contrast020406001020spike counttrial #LMS\fFigure 4: One and two-dimensional conditional \u201cslices\u201d through the 3D nonlinearity of a V1 simple\ncell in cone contrast space. A: 1D conditionals showing spike rate as a function of L, M, and S\ncone contrast, respectively, with other cone contrasts \ufb01xed to zero. Traces show the posterior mean\nand \u00b12SD credible interval given all datapoints (solid and dotted gray), and the posterior mean\ngiven only 150 data points selected randomly (black) or by optimal design (red), carried out by\ndrawing a subset of the data points actually collected during the experiment. Note that even with\nonly 1/4 of data, the optimal design estimate is nearly identical to the estimate obtained from all 677\ndatapoints. B: 2D conditionals on M and L (\ufb01rst row), S and L (second row), M and S (third row)\ncones, respectively, with the other cone contrast set to zero. 2D conditionals using optimal design\nsampling (middle column) with 150 data points are much closer to the 2D conditionals using all data\n(right column) than those from a random sub-sampling of 150 points (left column).\n\nWe \ufb01rst used the entire dataset (677 stimulus-response pairs) to \ufb01nd the posterior maximum fmap,\nwith hyperparameters set by maximizing evidence (sequential optimization of fmap and \u03b8 (eq. 11)\nuntil convergence). Fig. 4 shows 1D and 2D conditional slices through the estimated 3D nonlinearity\ng(f (x)), with contour plots constructed using the MAP estimate of f on a \ufb01ne grid of points. The\ncontours for a neuron with linear summation of cone contrasts followed by an output nonlinearity\n(i.e., as assumed by the standard model of V1 simple cells) would consist of straight lines. The\ncurvature observed in contour plots (Fig. 4B) indicates that cone contrasts are summed together in a\nhighly nonlinear fashion, especially for L and M cones (top).\nWe then performed a simulated optimal design experiment by selecting from the 677 stimulus-\nresponse pairs collected during the experiment, and re-ordering them greedily according to the\nuncertainty sampling algorithm described above. We compared the estimate obtained using only\n1/4 of the data (150 points) with an estimate obtained if we had randomly sub-sampled 150 data\npoints from the dataset (Fig. 4). Using only 150 data points, the conditionals of the estimate using\nuncertainty sampling were almost identical to those using all data (677 points).\nAlthough our software implementation of the optimal design method was crude (using Matlab\u2019s\nfminunc twice to \ufb01nd fmap and fmincon once to optimize the hyperparameters during each\ninter-trial interval), the speed was more than adequate for the experimental data collected (Fig. 5,\nA) using a machine with an Intel 3.33GHz XEON processor. The largest bottleneck by far was\ncomputing the eigendecomposition of K for each search step for \u03b8. We will discuss brie\ufb02y how to\nimprove the speed of our algorithm in the Discussion.\nLastly, we added a recursive \ufb01lter h to the model (Fig. 1), to incorporate the effects of spike history\non the neuron\u2019s response, allowing us to account for the possible effects of adaptation on the spike\ncounts obtained. We computed the maximum a posteriori (MAP) estimate for h under a temporal\n\n7\n\nrandom samplinguncertainty samplingLS cone contrast M cone contrastL cone contrastABposterior mean random sampling95% conf.intervaluncertaintysampling150 datapoints150 datapointsall data0.60.300.60.300.40.2030150spike rateM0.60- 0.60.60- 0.6L0.60- 0.60.40- 0.4M0.60- 0.60.40- 0.40.60- 0.60.60- 0.6S0.40- 0.40.40- 0.4S0.40- 0.40.40- 0.4spike ratespike rateall data3015030150\fFigure 5: Comparison of run time and error of optimal design method using simulated experiments\nby resampling experimental data. A: The run time for uncertainty sampling (including the posterior\nupdate and the evidence optimization) as a function of the number of data points observed. (The grid\nof \u201ccandidate\u201d stimuli {x\u2217} was the subset of stimuli in the experimental dataset not yet selected,\nbut the speed was not noticeably affected by scaling to much larger sets of candidate stimuli). The\nblack dotted line shows the mean intertrial interval of 677ms. B: The mean squared error between\nthe estimate obtained using each sampling method and that obtained using the full dataset. Note\nthat the error of uncertainty sampling with 150 points is even lower than that from random sampling\nwith 300 data points. C: Estimated response-history \ufb01lter h, which describes how recent spiking\nin\ufb02uences the neuron\u2019s spike rate. This neuron shows self-excitatory in\ufb02uence on the time-scale of\n25s, with self-suppression on a longer scale of approximately 1m.\n\nsmoothing prior (Fig. 5). It shows that the neuron\u2019s response has a mild dependence on its recent\nspike-history, with a self-exciting effect of spikes within the last 25s. We evaluated the performance\nof the augmented model by holding out a random 10% of the data for cross-validation. Prediction\nperformance on test data was more accurate by an average of 0.2 spikes per trial in predicted spike\ncount, a 4 percent reduction in cross-validation error compared to the original model.\n\n6 Discussion\n\nWe have developed an algorithm for optimal experimental design, which allows the nonlinearity in\na cascade neural encoding model to be characterized quickly and accurately from limited data. The\nmethod relies on a fast method for updating the hyperparameters using a Gaussian factorization of\nthe Laplace approximation to the posterior, which removes the need to numerically recompute the\nMAP estimate as we optimize the hyperparameters. We described a method for optimal experimen-\ntal design, based on uncertainty sampling, to reduce the number of stimuli required to estimate such\nresponse functions. We applied our method to the nonlinear color-tuning properties of macaque\nV1 neurons and showed that the GP-Poisson model provides a \ufb02exible, tractable model for these\nresponses, and that optimal design can substantially reduce the number of stimuli required to char-\nacterize them. One additional virtue of the GP-Poisson model is that conditionals and marginals\nof the high-dimensional nonlinearity are straightforward, making it easy to visualize their lower-\ndimensional slices and projections (as we have done in Fig. 4). We added a history term to the LNP\nmodel in order to incorporate the effects of recent spike history on the spike rate (Fig. 5), which\nprovided a very slight improvement in prediction accuracy. We expect that the ability to incorpo-\nrate dependencies on spike history to be important for the success of optimal design experiments,\nespecially with neurons that exhibit strong spike-rate adaptation [30].\nOne potential criticism of our approach is that uncertainty sampling in unbounded spaces is known\nto \u201crun away from the data\u201d, repeatedly selecting stimuli that are far from previous measurements.\nWe wish to point out that in neural applications, the stimulus space is always bounded (e.g., by the\ngamut of the monitor), and in our case, stimuli at the corners of the space are actually helpful for\ninitializing estimates the range and smoothness of the function.\nIn future work, we will work to improve the speed of the algorithm for use in real-time neurophysiol-\nogy experiments, using analytic \ufb01rst and second derivatives for evidence optimization and exploring\napproximate methods for sparse GP inference [35]. We will examine kernel functions with a more\ntractable matrix inverse [20], and test other information-theoretic data selection criteria for response\nfunction estimation [36].\n\n8\n\n501001502002503002468105010015020025030000.20.40.60.8randomsamplinguncertaintysamplingA# of datapointsMSE# of datapointsrun time(in seconds)B02550\u2212505x 10\u22124Cestimated history filtertime before spike (s)\fReferences\n[1] E. P. Simoncelli, J. W. Pillow, L. Paninski, and O. Schwartz. The Cognitive Neurosciences, III, chapter 23,\n\npages 327\u2013338. MIT Press, Cambridge, MA, October 2004.\n\n[2] R.R. de Ruyter van Steveninck and W. Bialek. Proc. R. Soc. Lond. B, 234:379\u2013414, 1988.\n[3] E. J. Chichilnisky. Network: Computation in Neural Systems, 12:199\u2013213, 2001.\n[4] F. Theunissen, S. David, N. Singh, A. Hsu, W. Vinje, and J. Gallant. Network: Computation in Neural\n\nSystems, 12:289\u2013316, 2001.\n\n[5] M. Sahani and J. Linden. NIPS, 15, 2003.\n[6] L. Paninski. Network: Computation in Neural Systems, 15:243\u2013262, 2004.\n[7] Tatyana Sharpee, Nicole C Rust, and William Bialek. Neural Comput, 16(2):223\u2013250, Feb 2004.\n[8] O. Schwartz, J. W. Pillow, N. C. Rust, and E. P. Simoncelli. Journal of Vision, 6(4):484\u2013507, 7 2006.\n[9] J. W. Pillow and E. P. Simoncelli. Journal of Vision, 6(4):414\u2013428, 4 2006.\n[10] Misha B Ahrens, Jennifer F Linden, and Maneesh Sahani. J Neurosci, 28(8):1929\u20131942, Feb 2008.\n[11] Nicole C Rust, Odelia Schwartz, J. Anthony Movshon, and Eero P Simoncelli. Neuron, 46(6):945\u2013956,\n\nJun 2005.\n\n[12] I. DiMatteo, C. Genovese, and R. Kass. Biometrika, 88:1055\u20131073, 2001.\n[13] S.F. Martins, L.A. Sousa, and J.C. Martins.\n\nImage Processing, 2007. ICIP 2007. IEEE International\n\nConference on, volume 3, pages III\u2013309. IEEE, 2007.\n\n[14] Carl Rasmussen and Chris Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[15] Liam Paninski, Yashar Ahmadian, Daniel Gil Ferreira, Shinsuke Koyama, Kamiar Rahnama Rad, Michael\n\nVidne, Joshua Vogelstein, and Wei Wu. J Comput Neurosci, Aug 2009.\n\n[16] Jarno Vanhatalo, Ville Pietil\u00a8ainen, and Aki Vehtari. Statistics in medicine, 29(15):1580\u20131607, July 2010.\n[17] E. Brown, L. Frank, D. Tang, M. Quirk, and M. Wilson. Journal of Neuroscience, 18:7411\u20137425, 1998.\n[18] W. Wu, Y. Gao, E. Bienenstock, J.P. Donoghue, and M.J. Black. Neural Computation, 18(1):80\u2013118,\n\n2006.\n\n[19] Y. Ahmadian, J. W. Pillow, and L. Paninski. Neural Comput, 23(1):46\u201396, Jan 2011.\n[20] K.R. Rad and L. Paninski. Network: Computation in Neural Systems, 21(3-4):142\u2013168, 2010.\n[21] Jakob H Macke, Sebastian Gerwinn, Leonard E White, Matthias Kaschube, and Matthias Bethge. Neu-\n\nroimage, 56(2):570\u2013581, May 2011.\n\n[22] John P. Cunningham, Krishna V. Shenoy, and Maneesh Sahani. Proceedings of the 25th international\n\nconference on Machine learning, ICML \u201908, pages 192\u2013199, New York, NY, USA, 2008. ACM.\n\n[23] R.P. Adams, I. Murray, and D.J.C. MacKay. Proceedings of the 26th Annual International Conference on\n\nMachine Learning. ACM New York, NY, USA, 2009.\n\n[24] Todd P. Coleman and Sridevi S. Sarma. Neural Computation, 22(8):2002\u20132030, 2010.\n[25] J. E. Kulkarni and L Paninski. Network: Computation in Neural Systems, 18(4):375\u2013407, 2007.\n[26] A.C. Smith and E.N. Brown. Neural Computation, 15(5):965\u2013991, 2003.\n[27] B.M. Yu, J.P. Cunningham, G. Santhanam, S.I. Ryu, K.V. Shenoy, and M. Sahani. Journal of Neurophys-\n\niology, 102(1):614, 2009.\n\n[28] C.M. Bishop. Pattern recognition and machine learning. Springer New York:, 2006.\n[29] D. Mackay. Neural Computation, 4:589\u2013603, 1992.\n[30] J. Lewi, R. Butera, and L. Paninski. Neural Computation, 21(3):619\u2013687, 2009.\n[31] David D. Lewis and William A. Gale. Proceedings of the ACM SIGIR conference on Research and\n\nDevelopment in Information Retrieval, pages 3\u201312. Springer-Verlag, 1994.\n\n[32] G. Casella. American Statistician, pages 83\u201387, 1985.\n[33] J. W. Pillow, Y. Ahmadian, and L. Paninski. Neural Comput, 23(1):1\u201345, Jan 2011.\n[34] T. P. Minka. UAI \u201901: Proceedings of the 17th Conference in Uncertainty in Arti\ufb01cial Intelligence, pages\n\n362\u2013369, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.\n\n[35] E. Snelson and Z. Ghahramani. Advances in neural information processing systems, 18:1257, 2006.\n[36] Andreas Krause, Ajit Singh, and Carlos Guestrin. J. Mach. Learn. Res., 9:235\u2013284, June 2008.\n\n9\n\n\f", "award": [], "sourceid": 1158, "authors": [{"given_name": "Mijung", "family_name": "Park", "institution": null}, {"given_name": "Greg", "family_name": "Horwitz", "institution": null}, {"given_name": "Jonathan", "family_name": "Pillow", "institution": null}]}