{"title": "Bayesian active learning with localized priors for fast receptive field characterization", "book": "Advances in Neural Information Processing Systems", "page_first": 2348, "page_last": 2356, "abstract": "Active learning can substantially improve the yield of neurophysiology experiments by adaptively selecting stimuli to probe a neuron's receptive field (RF) in real time. Bayesian active learning methods maintain a posterior distribution over the RF, and select stimuli to maximally reduce posterior entropy on each time step.  However, existing methods tend to rely on simple Gaussian priors, and do not exploit uncertainty at the level of hyperparameters when determining an optimal stimulus.  This uncertainty can play a substantial role in RF characterization, particularly when RFs are smooth, sparse, or local in space and time.  In this paper, we describe a novel framework for active learning under hierarchical, conditionally Gaussian priors.  Our algorithm uses sequential Markov Chain Monte Carlo sampling (''particle filtering'' with MCMC) over hyperparameters to construct a mixture-of-Gaussians representation of the RF posterior, and selects optimal stimuli using an approximate infomax criterion.  The core elements of this algorithm are parallelizable, making it computationally efficient for real-time experiments.  We apply our algorithm to simulated and real neural data, and show that it can provide highly accurate receptive field estimates from very limited data, even with a small number of hyperparameter samples.", "full_text": "Bayesian active learning with localized priors\n\nfor fast receptive \ufb01eld characterization\n\nMijung Park\n\nElectrical and Computer Engineering\n\nThe University of Texas at Austin\nmjpark@mail.utexas.edu\n\nJonathan W. Pillow\n\nCenter For Perceptual Systems\nThe University of Texas at Austin\npillow@mail.utexas.edu\n\nAbstract\n\nActive learning methods can dramatically improve the yield of neurophysiology\nexperiments by adaptively selecting stimuli to probe a neuron\u2019s receptive \ufb01eld\n(RF). Bayesian active learning methods specify a posterior distribution over the\nRF given the data collected so far in the experiment, and select a stimulus on\neach time step that maximally reduces posterior uncertainty. However, existing\nmethods tend to employ simple Gaussian priors over the RF and do not exploit\nuncertainty at the level of hyperparameters.\nIncorporating this uncertainty can\nsubstantially speed up active learning, particularly when RFs are smooth, sparse,\nor local in space and time. Here we describe a novel framework for active learning\nunder hierarchical, conditionally Gaussian priors. Our algorithm uses sequential\nMarkov Chain Monte Carlo sampling (\u201cparticle \ufb01ltering\u201d with MCMC) to con-\nstruct a mixture-of-Gaussians representation of the RF posterior, and selects op-\ntimal stimuli using an approximate infomax criterion. The core elements of this\nalgorithm are parallelizable, making it computationally ef\ufb01cient for real-time ex-\nperiments. We apply our algorithm to simulated and real neural data, and show\nthat it can provide highly accurate receptive \ufb01eld estimates from very limited data,\neven with a small number of hyperparameter samples.\n\n1\n\nIntroduction\n\nNeurophysiology experiments are costly and time-consuming. Data are limited by an animal\u2019s will-\ningness to perform a task (in awake experiments) and the dif\ufb01culty of maintaining stable neural\nrecordings. This motivates the use of active learning, known in statistics as \u201coptimal experimen-\ntal design\u201d, to improve experiments using adaptive stimulus selection in closed-loop experiments.\nThese methods are especially powerful for models with many parameters, where traditional methods\ntypically require large amounts of data.\nIn Bayesian active learning, the basic idea is to de\ufb01ne a statistical model of the neural response,\nthen carry out experiments to ef\ufb01ciently characterize the model parameters [1\u20136]. (See Fig. 1A).\nTypically, this begins with a (weakly- or non-informative) prior distribution, which expresses our\nuncertainty about these parameters before the start of the experiment. Then, recorded data (i.e.,\nstimulus-response pairs) provide likelihood terms that we combine with the prior to obtain a poste-\nrior distribution. This posterior re\ufb02ects our beliefs about the parameters given the data collected so\nfar in the experiment. We then select a stimulus for the next trial that maximizes some measure of\nutility (e.g., expected reduction in entropy, mean-squared error, classi\ufb01cation error, etc.), integrated\nwith respect to the current posterior.\nIn this paper, we focus on the problem of receptive \ufb01eld (RF) characterization from extracellularly\nrecorded spike train data. The receptive \ufb01eld is a linear \ufb01lter that describes how the neuron integrates\nits input (e.g., light) over space and time; it can be equated with the linear term in a generalized linear\n\n1\n\n\fmodel (GLM) of the neural response [7]. Typically, RFs are high-dimensional (with 10s to 100s of\nparameters, depending on the choice of input domain), making them an attractive target for active\nlearning methods. Our paper builds on prior work from Lewi et al [6], a seminal paper that describes\nactive learning for RFs under a conditionally Poisson point process model.\nHere we show that a sophisticated choice of prior distribution can lead to substantial improvements\nin active learning. Speci\ufb01cally, we develop a method for learning under a class of hierarchical,\nconditionally Gaussian priors that have been recently developed for RF estimation [8, 9]. These pri-\nors \ufb02exibly encode a preference for smooth, sparse, and/or localized structure, which are common\nfeatures of real neural RFs. In \ufb01xed datasets (\u201cpassive learning\u201d), the associated estimators give sub-\nstantial improvements over both maximum likelihood and standard lasso/ridge-regression shrinkage\nestimators, but they have not yet been incorporated into frameworks for active learning.\nActive learning with a non-Gaussian prior poses several major challenges, however, since the poste-\nrior is non-Gaussian, and requisite posterior expectations are much harder to compute. We address\nthese challenges by exploiting a conditionally Gaussian representation of the prior (and posterior)\nusing sampling at the level of the hyperparameters. We demonstrate our method using the Automatic\nLocality Determination (ALD) prior introduced in [9], where hyperparameters control the locality\nof the RF in space-time and frequency. The resulting algorithm outperforms previous active learning\nmethods on real and simulated neural data, even under various forms of model mismatch.\nThe paper is organized as follows. In Sec. 2, we formally de\ufb01ne the Bayesian active learning prob-\nlem and review the algorithm of [6], to which we will compare our results. In Sec. 3, we describe\na hierarchical response model, and in Sec. 4 describe the localized RF prior that we will employ\nfor active learning. In Sec. 5, we describe a new active learning method for conditionally Gaussian\npriors. In Sec. 6, we show results of simulated experiments with simulated and real neural data.\n\n2 Bayesian active learning\n\nBayesian active learning (or \u201cexperimental design\u201d) provides a model-based framework for selecting\noptimal stimuli or experiments. A Bayesian active learning method has three basic ingredients:\n(1) an observation model (likelihood) p(y|x, k), specifying the conditional probability of a scalar\nresponse y given vector stimulus x and parameter vector k; (2) a prior p(k) over the parameters\nof interest; and (3) a loss or utility function U, which characterizes the desirability of a stimulus-\nresponse pair (x, y) under the current posterior over k. The optimal stimulus x is the one that\nmaximizes the expected utility Ey|x[U (x, y)], meaning the utility averaged over the distribution of\n(as yet) unobserved y|x.\nOne popular choice of utility function is the mutual information between (x, y) and the parameters\nk. This is commonly known as information-theoretic or infomax learning [10]. It is equivalent to\npicking the stimulus on each trial that minimizes the expected posterior entropy.\nLet Dt = {xi, yi}t\nlearning, the optimal stimulus at time step t + 1 is:\n\ni=1 denote the data collected up to time step t in the experiment. Under infomax\n\nEy|x,Dt[I(y, k|x,Dt)] = arg min\n\nEy|x,Dt,[H(k|x, y,Dt)],\n\n(1)\n\nx\n\nxt+1 = arg max\n\nwhere H(k|x, y,Dt) = \u2212(cid:82) p(k|x, y,Dt) log p(k|x, y,Dt)dk denotes the posterior entropy of k,\nand p(y|x,Dt) =(cid:82) p(y|x, k)p(k|Dt)dk is the predictive distribution over response y given stimulus\n\nx and data Dt. The mutual information provided by (y, x) about k, denoted by I(y, k|x,Dt), is\nsimply the difference between the prior and posterior entropy.\n\nx\n\n2.1 Method of Lewi, Butera & Paninski 2009\n\nLewi et al [6] developed a Bayesian active learning framework for RF characterization in closed-loop\nneurophysiology experiments, which we henceforth refer to as \u201cLewi-09\u201d. This method employs a\nconditionally Poisson generalized linear model (GLM) of the neural spike response:\n\n\u03bbt = g(k(cid:62)xt)\nyt \u223c Poiss(\u03bbt),\n\n2\n\n(2)\n\n\fFigure 1: (A) Schematic of Bayesian active learning for neurophysiology experiments. For each\npresented stimulus x and recorded response y (upper right), we update the posterior over receptive\n\ufb01eld k (bottom), then select the stimulus that maximizes expected information gain (upper left).\n(B) Graphical model for the non-hierarchical RF model used by Lewi-09. It assumes a Gaussian\nprior p(k) and Poisson likelihood p(yt|xt, k). (C) Graphical model for the hierarchical RF model\nused here, with a hyper-prior p\u03b8(\u03b8) over hyper-parameters and conditionally Gaussian prior p(k|\u03b8)\nover the RF. For simplicity and speed, we assume a Gaussian likelihood for p(yt|xt, k), though all\nexamples in the manuscript involved real neural data or simulations from a Poisson GLM.\n\nwhere g is a nonlinear function that ensures non-negative spike rate \u03bbt.\nThe Lewi-09 method assumes a Gaussian prior over k, which leads to a (non-Gaussian) posterior\ngiven by the product of Poisson likelihood and Gaussian prior. (See Fig. 1B). Neither the predictive\ndistribution p(y|x,Dt) nor the posterior entropy H(k|x, y,Dt) can be computed in closed form.\nHowever, the log-concavity of the posterior (guaranteed for suitable choice of g [11]) motivates a\ntractable and accurate Gaussian approximation to the posterior, which provides a concise analytic\nformula for posterior entropy [12, 13].\nThe key contributions of Lewi-09 include fast methods for updating the Gaussian approximation to\nthe posterior and for selecting the stimulus (subject to a maximum-power constraint) that maximizes\nexpected information gain. The Lewi-09 algorithm yields substantial improvement in characteriza-\ntion performance relative to randomized iid (e.g., \u201cwhite noise\u201d) stimulus selection. Below, we will\nbenchmark the performance of our method against this algorithm.\n\n3 Hierarchical RF models\n\nHere we seek to extend the work of Lewi et al to incorporate non-Gaussian priors in a hierarchical\nreceptive \ufb01eld model. (See Fig. 1C). Intuitively, a good prior can improve active learning by reducing\nthe prior entropy, i.e., the effective size of the parameter space to be searched. The drawback of\nmore sophisticated priors is that they may complicate the problem of computing and optimizing the\nposterior expectations needed for active learning.\nTo focus more straightforwardly on the role of the prior distribution, we employ a simple linear-\nGaussian model of the neural response:\n\n(3)\nwhere \u0001t is iid zero-mean Gaussian noise with variance \u03c32. We then place a hierarchical, condition-\nally Gaussian prior on k:\n\nyt = k(cid:62)xt + \u0001t,\n\n\u0001t \u223c N (0, \u03c32),\n\n(4)\n(5)\nwhere C\u03b8 is a prior covariance matrix that depends on hyperparameters \u03b8. These hyperparameters\nin turn have a hyper-prior p\u03b8. We will specify the functional form of C\u03b8 in the next section.\nIn this setup, the effective prior over k is a mixture-of-Gaussians, obtained by marginalizing over \u03b8:\n\n\u223c p\u03b8,\n\nk | \u03b8 \u223c N (0, C\u03b8)\n\u03b8\n\nN (0, C\u03b8) p\u03b8(\u03b8)d\u03b8.\n\n(6)\n\n(cid:90)\n\np(k) =\n\n(cid:90)\n\np(k|\u03b8)p(\u03b8)d\u03b8 =\n\n3\n\nChyper-parametersparameters(RF)parameters(RF)hierarchical RF modelAselect stimulusupdate posteriorexperimentBRF model (Lewi et al 09)stimulusspikecount\fGiven data X = (x1, . . . , xt)(cid:62) and Y = (y1, . . . , yt)(cid:62), the posterior also takes the form of a\nmixture-of-Gaussians:\n\np(k|X, Y ) =\n\np(k|X, Y, \u03b8)p(\u03b8|X, Y )d\u03b8\n\n(cid:90)\n\n(7)\n\n(8)\n\nwhere the conditional posterior given \u03b8 is the Gaussian\n\np(k|X, Y, \u03b8) = N (\u00b5\u03b8, \u039b\u03b8),\n\n\u00b5\u03b8 = 1\n\n\u03c32 \u039b\u03b8X(cid:62)Y, \u039b\u03b8 = ( 1\n\n\u03c32 X(cid:62)X + C\u22121\n\n\u03b8 )\u22121,\n\nand the mixing weights are given by the marginal posterior,\n\np(\u03b8|X, Y ) \u221d p(Y |X, \u03b8)p\u03b8(\u03b8),\n\n|2\u03c0\u039b\u03b8| 1\n\n2\n\n(cid:0)\u00b5(cid:62)\n\n(9)\nwhich we will only need up to a constant of proportionality. The marginal likelihood or evidence\np(Y |X, \u03b8) is the marginal probability of the data given the hyperparameters, and has a closed form\nfor the linear Gaussian model:\np(Y |X, \u03b8) =\n\n\u03b8 \u00b5\u03b8 \u2212 m(cid:62)L\u22121m(cid:1)(cid:3) ,\n\nexp(cid:2) 1\n\n2|2\u03c0C\u03b8| 1\n\u03c32 LX(cid:62)Y .\n\n|2\u03c0\u03c32I| 1\nwhere L = \u03c32(X(cid:62)X)\u22121 and m = 1\nSeveral authors have pointed out that active learning confers no bene\ufb01t over \ufb01xed-design experi-\nments in linear-Gaussian models with Gaussian priors, due to the fact that the posterior covariance\nis response-independent [1, 6]. That is, an optimal design (one that minimizes the \ufb01nal posterior\nentropy) can be planned out entirely in advance of the experiment. However, this does not hold\nfor linear-Gaussian models with non-Gaussian priors, such as those considered here. The posterior\ndistribution in such models is data-dependent via the marginal posterior\u2019s dependence on Y (eq. 9).\nThus, active learning is warranted even for linear-Gaussian responses, as we will demonstrate em-\npirically below.\n\n\u03b8 \u039b\u22121\n\n(10)\n\n2\n\n2\n\n4 Automatic Locality Determination (ALD) prior\n\nIn this paper, we employ a \ufb02exible RF model underlying the so-called automatic locality determina-\ntion (ALD) estimator [9].1 The key justi\ufb01cation for the ALD prior is the observation that most neural\nRFs tend to be localized in both space-time and spatio-temporal frequency. Locality in space-time\nrefers to the fact that (e.g., visual) neurons integrate input over a limited domain in time and space;\nlocality in frequency refers to the band-pass (or smooth / low pass) character of most neural RFs.\nThe ALD prior encodes these tendencies in the parametric form of the covariance matrix C\u03b8, where\nhyperparameters \u03b8 control the support of both the RF and its Fourier transform.\nThe hyperparameters for the ALD prior are \u03b8 = (\u03c1, \u03bds, \u03bdf , Ms, Mf )(cid:62), where \u03c1 is a \u201cridge\u201d pa-\nrameter that determines the overall amplitude of the covariance; \u03bds and \u03bdf are length-D vectors that\nspecify the center of the RF support in space-time and frequency, respectively (where D is the degree\nof the RF tensor2); and Ms and Mf are D \u00d7 D positive de\ufb01nite matrices that describe an elliptical\n(Gaussian) region of support for the RF in space-time and frequency, respectively. In practice, we\nwill also include the additive noise variance \u03c32 (eq. 3) as a hyperparameter, since it plays a similar\nrole to C in determining the posterior and evidence. Thus, for the (D = 2) examples considered\nhere, there are 12 hyperparameters, including scalars \u03c32 and \u03c1, two hyperparameters each for \u03bds and\n\u03bdf , and three each for symmetric matrices Ms and Mf .\nNote that although the conditional ALD prior over k|\u03b8 assigns high prior probability to smooth\nand sparse RFs for some settings of \u03b8, for other settings (i.e., where Ms and Mf describe elliptical\nregions large enough to cover the entire RF) the conditional prior corresponds to a simple ridge\nprior and imposes no such structure. We place a \ufb02at prior over \u03b8 so that no strong prior beliefs about\nspatial locality or bandpass frequency characteristics are imposed a priori. However, as data from a\nneuron with a truly localized RF accumulates, the support of the marginal posterior p(\u03b8|Dt) shrinks\ndown on regions that favor a localized RF, shrinking the posterior entropy over k far more quickly\nthan is achievable with methods based on Gaussian priors.\n\n1\u201cAutomatic\u201d refers to the fact that in [9], the model was used for empirical Bayes inference, i.e., MAP\ninference after maximizing the evidence for \u03b8. Here, we consider perform fully Bayesian inference under the\nassociated model.\n\n2e.g., a space\u00d7space\u00d7time RF has degree D = 3.\n\n4\n\n\f5 Bayesian active learning with ALD\n\nTo perform active learning under the ALD model, we need two basic ingredients: (1) an ef\ufb01cient\nmethod for representing and updating the posterior p(k|Dt) as data come in during the experiment;\nand (2) an ef\ufb01cient algorithm for computing and maximizing the expected information gain given a\nstimulus x. We will describe each of these in turn below.\n\n5.1 Posterior updating via sequential Markov Chain Monte Carlo\n\nTo represent the ALD posterior over k given data, we will rely on the conditionally Gaussian repre-\nsentation of the posterior (eq. 7) using particles {\u03b8i}i=1,...,N sampled from the marginal posterior,\n\u03b8i \u223c P (\u03b8|Dt) (eq. 9). The posterior will then be approximated as:\np(k|Dt, \u03b8i),\n\n(cid:88)\n\n(11)\n\np(k|Dt) \u2248 1\nN\n\ni\n\nwhere each distribution p(k|Dt, \u03b8i) is Gaussian with \u03b8i-dependent mean and covariance (eq. 8).\nMarkov Chain Monte Carlo (MCMC) is a popular method for sampling from distributions known\nonly up to a normalizing constant. In cases where the target distribution evolves over time by ac-\ncumulating more data, however, MCMC samplers are often impractical due to the time required for\nconvergence (i.e., \u201cburning in\u201d). To reduce the computational burden, we use a sequential sampling\nalgorithm to update the samples of the hyperparameters at each time step, based on the samples\ndrawn at the previous time step. The main idea of our algorithm is adopted from the resample-move\nparticle \ufb01lter, which involves generating initial particles; resampling particles according to incom-\ning data; then performing MCMC moves to avoid degeneracy in particles [14]. The details are as\nfollows.\nInitialization: On the \ufb01rst time step, generate initial hyperparameter samples {\u03b8i} from the hyper-\nprior p\u03b8, which we take to be \ufb02at over a broad range in \u03b8.\nResampling: Given a new stimulus/response pair {x, y} at time t, resample the existing particles\naccording to the importance weights:\n\np(yt|\u03b8(t)\n\ni\n\n,Dt\u22121, xt) = N (yt|\u00b5i\n\n(cid:62)xt, xt\n\n(cid:62)\u039bixt + \u03c32\ni ),\n\n(12)\n\nwhere (\u00b5i, \u039bi) denote the mean and covariance of the Gaussian component attached to particle \u03b8i,\nThis ensures the posterior evolves according to:\n|Dt) \u221d p(yt|\u03b8(t)\n\n,Dt\u22121, xt)p(\u03b8(t)\n\n|Dt\u22121).\n\np(\u03b8(t)\n\n(13)\n\ni\n\ni\n\ni\n\nMCMC Move: Propagate particles via Metropolis Hastings (MH), with multivariate Gaussian pro-\nposals centered on the current particle \u03b8i of the Markov chain: \u03b8\u2217 \u223c N (\u03b8i, \u0393), where \u0393 is a diagonal\nmatrix with diagonal entries given by the variance of the particles at the end of time step t\u22121. Accept\nthe proposal with probability min(1, \u03b1), where \u03b1 = q(\u03b8\u2217)\nq(\u03b8i) , with q(\u03b8i) = p(\u03b8i|Dt). Repeat MCMC\nmoves until computational or time budget has expired.\nThe main bottleneck of this scheme is the updating of conditional posterior mean \u00b5i and covariance\n\u039bi for each particle \u03b8i, since this requires inversion of a d \u00d7 d matrix. (Note that, unlike Lewi-\n09, these are not rank-one updates due to the fact that C\u03b8i changes after each \u03b8i move). This cost\nis independent of the amount of data, linear in the number of particles, and scales as O(d3) in\nRF dimensionality d. However, particle updates can be performed ef\ufb01ciently in parallel on GPUs or\nmachines with multi-core processors, since the particles do not interact except for stimulus selection,\nwhich we describe below.\n\n5.2 Optimal Stimulus Selection\nGiven the posterior over k at time t, represented by a mixture of Gaussians attached to particles {\u03b8i}\nsampled from the marginal posterior, our task is to determine the maximally informative stimulus to\npresent at time t + 1. Although the entropy of a mixture-of-Gaussians has no analytic form, we can\n\n5\n\n\fFigure 2: Simulated experiment. (A) Angular error in estimates of a simulated RF (20 \u00d7 20 pixels,\nshown in inset) vs. number of stimuli, for Lewi-09 method (blue), the ALD-based active learning\nmethod using 10 (pink) or 100 (red) particles, and the ALD-based passive learning method (black).\nTrue responses were simulated from a Poisson-GLM neuron. Traces show average over 20 inde-\npendent repetitions. (B) RF estimates obtained by each method after 200, 400, and 1000 trials. Red\nnumbers below indicate angular error (deg).\n\ncompute the exact posterior covariance via the formula:\n\nN(cid:88)\n\n(cid:0)\u039bi + \u00b5i\u00b5i\n\n(cid:62)(cid:1) \u2212 \u02dc\u00b5\u02dc\u00b5(cid:62),\n\n1\nN\n\n\u02dc\u039bt =\n\n(14)\n\ni=1\n\n(cid:80) \u00b5i is the full posterior mean. This leads to an upper bound on posterior en-\n\nwhere \u02dc\u00b5t = 1\nN\ntropy, since a Gaussian is the maximum-entropy distribution for \ufb01xed covariance. We then take\nthe next stimulus to be the maximum-variance eigenvector of the posterior covariance, which is the\nmost informative stimulus under a Gaussian posterior and Gaussian noise model, subject to a power\nconstraint on stimuli [6].\nAlthough this selection criterion is heuristic, since it is not guaranteed to maximize mutual informa-\ntion under the true posterior, it is intuitively reasonable since it selects the stimulus direction along\nwhich the current posterior is maximally uncertain. Conceptually, directions of large posterior vari-\nance can arise in two different ways: (1) directions of large variance for all covariances \u039bi, meaning\nthat all particles assign high posterior uncertainty over k|Dt in the direction of x; or (2) directions in\nwhich the means \u00b5i are highly dispersed, meaning the particles disagree about the mean of k|Dt in\nthe direction of x. In either scenario, selecting a stimulus proportional to the dominant eigenvector\nis heuristically justi\ufb01ed by the fact that it will reduce collective uncertainty in particle covariances or\ncause particle means to converge by narrowing of the marginal posterior. We show that the method\nperforms well in practice for both real and simulated data (Section 6). We summarize the complete\nmethod in Algorithm 1.\n\nAlgorithm 1 Sequential active learning under conditionally Gaussian models\n\nGiven particles {\u03b8i} from p(\u03b8|Dt), which de\ufb01ne the posterior as P (k|Dt) =(cid:80)\n\n1. Compute the posterior covariance \u02dc\u039bt from {(\u00b5i, \u039bi)} (eq. 14).\n2. Select optimal stimulus xt+1 as the maximal eigenvector of \u02dc\u039bt\n3. Measure response yt+1.\n4. Resample particles {\u03b8i} with the weights {N (yt+1|\u00b5i\n5. Perform MH sampling of p(\u03b8|Dt+1), starting from resampled particles.\nrepeat\n\n(cid:62)xt+1, xt+1\n\n(cid:62)\u039bixt+1 + \u03c32\n\ni )}.\n\ni N (\u00b5i, \u039bi),\n\n6\n\ntrue filter10201020200 trials400 trialsLewi-09ALD10 ALD100 11BA 1000 trials020040060080010003040506070angle difference in degree# trialsLewi-09ALD10ALD100                      62.82               51.54           44.9457.29               40.69           36.6543.34               35.90          28.98Passive-ALD\fFigure 3: Additional simulated exam-\nples comparing Lewi-09 and ALD-\nbased active learning. Responses were\nsimulated from a GLM-Poisson model\nwith three different true 400-pixel RFs\n(A) a Gabor \ufb01lter\n(left column):\n(shown previously in [6]); (B): a center-\nsurround RF, typical in retinal ganglion\n(C): a relatively non-localized\ncells;\ngrid-cell RF. Middle and right columns\nshow RF estimates after 400 trials of ac-\ntive learning under each method, with\naverage angular error (over independent\n20 repeats) shown beneath in red.\n\n6 Results\n\nSimulated Data: We tested the performance of our algorithm using data simulated from a Poisson-\nGLM neuron with a 20 \u00d7 20 pixel Gabor \ufb01lter and an exponential nonlinearity (See Fig. 2). This is\nthe response model assumed by the Lewi-09 method, and therefore substantially mismatched to the\nlinear-Gaussian model assumed by our method.\nFor the Lewi-09 method, we used a diagonal prior covariance with amplitude set by maximizing\nmarginal likelihood for a small dataset. We compared two versions of the ALD-based algorithm\n(with 10 and 100 hyperparameter particles, respectively) to examine the relationship between per-\nformance and \ufb01delity of the posterior representation. To quantify the performance, we used the\nangular difference (in degrees) between the true and estimated RF.\nFig 2A shows the angular difference between the true RF and estimates obtained by Lewi-09 and\nthe ALD-based method, as a function of the number of trials. The ALD estimate exhibits more\nrapid convergence, and performs noticeably better with 100 than with 10 particles (ALD100 vs.\nALD10), indicating that accurately preserving uncertainty over the hyperparameters is bene\ufb01cial to\nperformance. We also show the performance of ALD inference under passive learning (iid random\nstimulus selection), which indicates that the improvement in our method is not simply due to the use\nof an improved RF estimator. Fig 2B shows the estimates obtained by each method after 200, 400,\nand 1000 trials. Note that the estimate with 100 hyperparameter samples is almost indistinguishable\nfrom the true \ufb01lter after 200 trials, which is substantially lower than the dimensionality of the \ufb01lter\nitself (d = 400).\nFig. 3 shows a performance comparison using three additional 2-dimensional receptive \ufb01elds, to\nshow that performance improves across a variety of different RF shapes. The \ufb01lters included: (A)\na gabor \ufb01lter similar to that used in [6]; (B) a retina-like center-surround receptive \ufb01eld; (C) a\ngrid-cell receptive \ufb01eld with multiple modes. As before, noisy responses were simulated from a\nPoisson-GLM. For the grid-cell example, these \ufb01lter is not strongly localized in space, yet the ALD-\nbased estimate substantially outperforms Lewi-09 due to its sensitivity to localized components in\nfrequency. Thus, ALD-based method converges more quickly despite the mismatch between the\nmodel used to simulate data and the model assumed for active learning.\nNeural Data: We also tested our method with an off-line analysis of real neural data from a sim-\nple cell recorded in primate V1 (published in [15]). The stimulus consisted of 1D spatiotemporal\nwhite noise (\u201c\ufb02ickering bars\u201d), with 16 spatial bars on each frame, aligned with the cell\u2019s preferred\norientation. We took the RF to have 16 time bins, resulting in a 256-dimensional parameter space\nfor the RF. We performed simulated active learning by extracting the raw stimuli from 46 minutes\nof experimental data. On each trial, we then computed the expected information gain from present-\ning each of these stimuli (blind to neuron\u2019s actual response to each stimulus). We used ALD-based\nactive learning with 10 hyperparameter particles, and examined performance of both algorithms for\n960 trials (selecting from \u2248 276,000 possible stimuli on each trial).\n\n7\n\n Lewi-09true filter(A)(B)(C)angle difference:  60.68             37.8262.82              42.5760.32              50.73ALD10 \fFigure 4: Comparison of active learning methods in a simulated experiment with real neural data\nfrom a primate V1 simple cell. (Original data recorded in response to white noise \u201c\ufb02ickering bars\u201d\nstimuli, see [15]). (A): Average angular difference between the MLE (inset, computed from an entire\n46-minute dataset) and the estimates obtained by active learning, as a function of the amount of data.\nWe simulated active learning via an of\ufb02ine analysis of the \ufb01xed dataset, where methods had access\nto possible stimuli but not responses. (B): RF estimates after 10 and 30 seconds of data. Note that\nthe ALD-based estimate has smaller error with 10 seconds of data than Lewi-09 with 30 seconds of\ndata. (C): Average entropy of hyperparameter particles as a function of t, showing rapid narrowing\nof marginal posterior.\n\nFig 4A shows the average angular difference between the maximum likelihood estimate (computed\nwith the entire dataset) and the estimate obtained by each active learning method, as a function of\nthe number of stimuli. The ALD-based method reduces the angular difference by 45 degrees with\nonly 160 stimuli, although the \ufb01lter dimensionality of the RF for this example is 256. The Lewi-09\nmethod requires four times more data to achieve the same accuracy. Fig 4B shows estimates after\n160 and 480 stimuli. We also examined the average entropy of the hyperparameter particles as a\nfunction of the amount of data used. Fig. 4C shows that the entropy of the marginal posterior over\nhyperparameters falls rapidly during the \ufb01rst 150 trials of active learning.\nThe main bottleneck of the algorithm is eigendecomposition of the posterior covariance \u02dc\u039b, which\ntook 30ms for a 256 \u00d7 256 matrix on a 2 \u00d7 2.66 GHz Quad-Core Intel Xeon Mac Pro. Updating\nimportance weights and resampling 10 particles took 4ms, and a single step of MH resampling for\neach particle took 5ms. In total, it took <60 ms to compute the optimal stimulus in each trial using a\nnon-optimized implementation of our algorithm, indicating that our methods should be fast enough\nfor use in real-time neurophysiology experiments.\n\n7 Discussion\n\nWe have developed a Bayesian active learning method for neural RFs under hierarchical response\nmodels with conditionally Gaussian priors. To take account of uncertainty at the level of hyperpa-\nrameters, we developed an approximate information-theoretic criterion for selecting optimal stimuli\nunder a mixture-of-Gaussians posterior. We applied this framework using a prior designed to capture\nsmooth and localized RF structure. The resulting method showed clear advantages over traditional\ndesigns that do not exploit structured prior knowledge. We have contrasted our method with that\nof Lewi et al [6], which employs a more \ufb02exible and accurate model of the neural response, but\na less \ufb02exible model of the RF prior. A natural future direction therefore will be to combine the\nPoisson-GLM likelihood and ALD prior, which will combine the bene\ufb01ts of a more accurate neural\nresponse model and a \ufb02exible (low-entropy) prior for neural receptive \ufb01elds, while incurring only a\nsmall increase in computational cost.\n\n8\n\nml (46 min.)816816Lewi-09ALD 11B A0160 48096040506070# of stimuliALDLewi-09\u2212140\u2212100\u221260\u22122020angle: 55.0    42.5C 0  320  640960# of stimuli160 stimuli480 stimuli45.147.2avg angle difference\fAcknowledgments\n\nWe thank N. C. Rust and J. A. Movshon for V1 data, and several anonymous reviewers for help-\nful advice on the original manuscript. This work was supported by a Sloan Research Fellowship,\nMcKnight Scholar\u2019s Award, and NSF CAREER Award IIS-1150186 (JP).\n\nReferences\n[1] D. J. C. MacKay. Information-based objective functions for active data selection. Neural Computation,\n\n4(4):590\u2013604, 1992.\n\n[2] K. Chaloner and I. Verdinelli. Bayesian experimental design: a review. Statistical Science, 10:273\u2013304,\n\n1995.\n\n[3] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. J. Artif. Intell. Res.\n\n(JAIR), 4:129\u2013145, 1996.\n\n[4] A. Watson and D. Pelli. QUEST: a Bayesian adaptive psychophysical method. Perception and Psy-\n\nchophysics, 33:113\u2013120, 1983.\n\n[5] L. Paninski. Asymptotic theory of information-theoretic experimental design. Neural Computation,\n\n17(7):1480\u20131507, 2005.\n\n[6] J. Lewi, R. Butera, and L. Paninski. Sequential optimal design of neurophysiology experiments. Neural\n\nComputation, 21(3):619\u2013687, 2009.\n\n[7] W. Truccolo, U. T. Eden, M. R. Fellows, J. P. Donoghue, and E. N. Brown. A point process framework\nfor relating neural spiking activity to spiking history, neural ensemble and extrinsic covariate effects. J.\nNeurophysiol, 93(2):1074\u20131089, 2005.\n\n[8] M. Sahani and J. Linden. Evidence optimization techniques for estimating stimulus-response functions.\n\nNIPS, 15, 2003.\n\n[9] M. Park and J. W. Pillow. Receptive \ufb01eld inference with localized priors. PLoS Comput Biol,\n\n7(10):e1002219, 2011.\n\n[10] N. Houlsby, F. Huszar, Z. Ghahramani, and M. Lengyel. Bayesian active learning for classi\ufb01cation and\n\npreference learning. CoRR, abs/1112.5745, 2011.\n\n[11] L. Paninski. Maximum likelihood estimation of cascade point-process neural encoding models. Network:\n\nComputation in Neural Systems, 15:243\u2013262, 2004.\n\n[12] R. Kass and A. Raftery. Bayes factors. Journal of the American Statistical Association, 90:773\u2013795,\n\n1995.\n\n[13] J. W. Pillow, Y. Ahmadian, and L. Paninski. Model-based decoding, information estimation, and change-\n\npoint detection techniques for multineuron spike trains. Neural Comput, 23(1):1\u201345, Jan 2011.\n\n[14] W. R. Gilks and C. Berzuini. Following a moving target \u2013 monte carlo inference for dynamic bayesian\nmodels. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(1):127\u2013146,\n2001.\n\n[15] N. C. Rust, Schwartz O., J. A. Movshon, and Simoncelli E.P. Spatiotemporal elements of macaque v1\n\nreceptive \ufb01elds. Neuron, 46(6):945\u2013956, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1144, "authors": [{"given_name": "Mijung", "family_name": "Park", "institution": null}, {"given_name": "Jonathan", "family_name": "Pillow", "institution": null}]}