{"title": "Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 1669, "page_last": 1677, "abstract": "The goal of perception is to infer the hidden states in the hierarchical process by which sensory data are generated. Human behavior is consistent with the optimal statistical solution to this problem in many tasks, including cue combination and orientation detection. Understanding the neural mechanisms underlying this behavior is of particular importance, since probabilistic computations are notoriously challenging. Here we propose a simple mechanism for Bayesian inference which involves averaging over a few feature detection neurons which fire at a rate determined by their similarity to a sensory stimulus. This mechanism is based on a Monte Carlo method known as importance sampling, commonly used in computer science and statistics. Moreover, a simple extension to recursive importance sampling can be used to perform hierarchical Bayesian inference. We identify a scheme for implementing importance sampling with spiking neurons, and show that this scheme can account for human behavior in cue combination and oblique effect.", "full_text": "Neural Implementation of Hierarchical Bayesian\n\nInference by Importance Sampling\n\nLei Shi\n\nHelen Wills Neuroscience Institute\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nlshi@berkeley.edu\n\nThomas L. Grif\ufb01ths\n\nDepartment of Psychology\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\ntom griffiths@berkeley.edu\n\nAbstract\n\nThe goal of perception is to infer the hidden states in the hierarchical process by\nwhich sensory data are generated. Human behavior is consistent with the opti-\nmal statistical solution to this problem in many tasks, including cue combination\nand orientation detection. Understanding the neural mechanisms underlying this\nbehavior is of particular importance, since probabilistic computations are notori-\nously challenging. Here we propose a simple mechanism for Bayesian inference\nwhich involves averaging over a few feature detection neurons which \ufb01re at a rate\ndetermined by their similarity to a sensory stimulus. This mechanism is based\non a Monte Carlo method known as importance sampling, commonly used in\ncomputer science and statistics. Moreover, a simple extension to recursive im-\nportance sampling can be used to perform hierarchical Bayesian inference. We\nidentify a scheme for implementing importance sampling with spiking neurons,\nand show that this scheme can account for human behavior in cue combination\nand the oblique effect.\n\n1\n\nIntroduction\n\nLiving creatures occupy an environment full of uncertainty due to noisy sensory inputs, incomplete\nobservations, and hidden variables. One of the goals of the nervous system is to infer the states of the\nworld given these limited data and make decisions accordingly. This task involves combining prior\nknowledge with current data [1], and integrating cues from multiple sensory modalities [2]. Studies\nof human psychophysics and animal behavior suggest that the brain is capable of solving these\nproblems in a way that is consistent with optimal Bayesian statistical inference [1, 2, 3, 4]. Moreover,\ncomplex brain functions such as visual information processing involves multiple brain areas [5].\nHierarchical Bayesian inference has been proposed as a computational framework for modeling such\nprocesses [6]. Identifying neural mechanisms that could support hierarchical Bayesian inference is\nimportant, since probabilistic computations can be extremely challenging. Just representing and\nupdating distributions over large numbers of hypotheses is computationally expensive.\nMuch effort has recently been devoted towards proposing possible mechanisms based on known\nneuronal properties. One prominent approach to explaining how the brain uses population activities\nfor probabilistic computations has been done in the \u201cBayesian decoding\u201d framework [7]. In this\nframework, it is assumed that the \ufb01ring rate of a population of neurons, r, can be converted to a\nprobability distribution over stimuli, p(s|r), by applying Bayesian inference, where the likelihood\np(r|s) re\ufb02ects the probability of that \ufb01ring pattern given the stimulus s. A \ufb01ring pattern thus encodes\na distribution over stimuli, which can be recovered through Bayesian decoding. The problem of\nperforming probabilistic computations then reduces to identifying a set of operations on \ufb01ring rates\nr that result in probabilistically correct operations on the resulting distributions p(s|r). For example,\n\n1\n\n\f[8] showed that when the likelihood p(r|s) is an exponential family distribution with linear suf\ufb01cient\nstatistics, adding two sets of \ufb01ring rates is equivalent to multiplying probability distributions.\nIn this paper, we take a different approach, allowing a population of neurons to encode a probability\ndistribution directly. Rather than relying on a separate decoding operation, we assume that the activ-\nity of each neuron translates directly to the weight given to the optimal stimulus for that neuron in the\ncorresponding probability distribution. We show how this scheme can be used to perform Bayesian\ninference, and how simple extensions of this basic idea make it possible to combine sources of in-\nformation and to propagate uncertainty through multiple layers of random variables. In particular,\nwe focus on one Monte Carlo method, namely importance sampling with the prior as a surrogate,\nand show how recursive importance sampling approximates hierarchical Bayesian inference.\n\np(x\u2217|x) =\n\n(cid:82)\n\np(x\u2217)p(x|x\u2217)\n\n2 Bayesian inference and importance sampling\nGiven a noisy observation x, we can recover the true stimulus x\u2217 by using Bayes\u2019 rule to compute\nthe posterior distribution\n\nx\u2217 p(x\u2217)p(x|x\u2217)dx\u2217\n\n(1)\nwhere p(x\u2217) is the prior distribution over stimulus values, and p(x|x\u2217) is the likelihood, indicating\nthe probability of the observation x if the true stimulus value is x\u2217. A good guess for the value of\nx\u2217 is the expectation of x\u2217 given x. In general, we are often interested in the expectation of some\nfunction f(x\u2217) over the posterior distribution p(x\u2217|x), E[f(x\u2217)|x]. The choice of f(x\u2217) depends\non the task. For example, in noise reduction where x\u2217 itself is of interest, we can take f(x\u2217) = x\u2217.\nHowever, evaluating expectations over the posterior distribution can be challenging: it requires com-\nputing a posterior distribution and often a multidimensional integration. The expectation E[f(x\u2217)|x]\ncan be approximated using a Monte Carlo method known as importance sampling. In its general\nform, importance sampling approximates the expectation by using a set of samples from some surro-\ngate distribution q(x\u2217) and assigning those samples weights proportional to the ratio p(x\u2217|x)/q(x\u2217).\n\nIf we choose q(x\u2217) to be the prior p(x\u2217), the weights reduce to the likelihood p(x|x\u2217), giving\ni ) p(x|x\u2217\ni )\np(x)\n\nE[f(x\u2217)|x] (cid:39) 1\nM\n\nf(x\u2217\n\nf(x\u2217\n\n1\nM\n\n=\n\n=\n\n(cid:90)\n\n=\n\nM\n\nM(cid:88)\nf(x\u2217) p(x\u2217|x)\nq(x\u2217) q(x\u2217)dx\u2217 (cid:39) 1\nM(cid:88)\nM(cid:88)\n(cid:80)M\n(cid:82) p(x|x\u2217)p(x\u2217)dx\u2217 (cid:39)(cid:88)\ni=1 f(x\u2217\n\ni |x)\ni ) p(x\u2217\np(x\u2217\ni )\ni )p(x|x\u2217\ni )\n\nf(x\u2217\n\n1\nM\n\ni=1\n\ni=1\n\ni=1\n\n1\nM\n\nf(x\u2217\ni )\n\nx\u2217\n\ni\n\nf(x\u2217\n\ni |x)\ni ) p(x\u2217\nq(x\u2217\ni )\n\ni ) p(x, x\u2217\ni )\np(x\u2217\ni )p(x)\n(cid:80)\np(x|x\u2217\ni )\np(x|x\u2217\ni )\nx\u2217\n\ni\n\nE[f(x\u2217)|x] =\n\n(2)\n\ni \u223c q(x\u2217)\nx\u2217\nM(cid:88)\n\ni=1\n\ni \u223c p(x\u2217)\nx\u2217\n\n(3)\n\nThus, importance sampling provides a simple and ef\ufb01cient way to perform Bayesian inference,\napproximating the posterior distribution with samples from the prior weighted by the likelihood.\nRecent work also has suggested that importance sampling might provide a psychological mechanism\nfor performing probabilistic inference, drawing on its connection to exemplar models [9].\n\n3 Possible neural implementations of importance sampling\n\nThe key components of an importance sampler can be realized in the brain if: 1) there are feature\ndetection neurons with preferred stimulus tuning curves proportional to the likelihood p(x|x\u2217\ni ); 2)\nthe frequency of these feature detection neurons is determined by the prior p(x\u2217); and 3) divisive\nnormalization can be realized by some biological mechanism. In this section, we \ufb01rst describe a\nradial basis function network implementing importance sampling, then discuss the feasibility of\nthree assumptions mentioned above. The model is then extended to networks of spiking neurons.\n\n3.1 Radial basis function (RBF) networks\n\nRadial basis function (RBF) networks are a multi-layer neural network architecture in which the\nhidden units are parameterized by locations in a latent space x\u2217\ni . On presentation of a stimulus x,\n\n2\n\n\fFigure 1: Importance sampler realized by radial basis function network. For details see Section 3.1.\n\nthese hidden units are activated according to a function that depends only on the distance ||x \u2212 x\u2217\ni ||,\ne.g., exp(\u2212|x \u2212 x\u2217\ni |2/2\u03c32), similar to the tuning curve of a neuron. RBF networks are popular\nbecause they have a simple structure with a clear interpretation and are easy to train. Using RBF\nnetworks to model the brain is not a new idea \u2013 similar models have been proposed for pattern\nrecognition [10] and as psychological accounts of human category learning [11].\nImplementing importance sampling with RBF networks is straightforward. A RBF neuron is re-\ncruited for a stimulus value x\u2217\ni drawn from the prior (Fig. 1). The neuron\u2019s synapses are organized\nso that its tuning curve is proportional to p(x|x\u2217\ni ). For a Gaussian likelihood, the peak \ufb01ring rate\ni || increases. The ith RBF\ni and diminishes as ||x \u2212 x\u2217\nwould be reached at preferred stimulus x = x\u2217\nneuron makes a synaptic connection to output neuron j with strength fj(x\u2217\ni ), where fj is a function\nof interest. The output units also receive input from an inhibitory neuron that sums over all RBF\nneurons\u2019 activities. Such an RBF network produces output exactly in the form of Eq. 3, with the\nactivation of the output units corresponding to E[fj(x\u2217)|x].\nTraining RBF networks is practical for neural implementation. Unlike the multi-layer perceptron\nthat usually requires global training of the weights, RBF networks are typically trained in two stages.\nFirst, the radial basis functions are determined using unsupervised learning, and then, weights to the\noutputs are learned using supervised methods. The \ufb01rst stage is even easier in our formulation,\nbecause RBF neurons simply represent samples from the prior, independent of the second stage\nlater in development. Moreover, the performance of RBF networks is relatively insensitive to the\nprecise form of the radial basis functions [12], providing some robustness to differences between the\nBayesian likelihood p(x|x\u2217\ni ) and the activation function in the network. RBF networks also produce\nsparse coding, because localized radial basis likelihood functions mean only a few units will be\nsigni\ufb01cantly activated for a given input x.\n\n3.2 Tuning curves, priors and divisive normalization\n\nWe now examine the neural correlates of the three components in RBF model. First, responses\nof cortical neurons to stimuli are often characterized by receptive \ufb01elds and tuning curves, where\nreceptive \ufb01elds specify the domain within a stimulus feature space that modify neuron\u2019s response\nand tuning curves detail how neuron\u2019s responses change with different feature values. A typical\ntuning curve (like orientation tuning in V1 simple cells) has a bell-shape that peaks at the neuron\u2019s\npreferred stimulus parameter and diminishes as parameter diverges. These neurons are effectively\nmeasure the likelihood p(x|x\u2217\nSecond, importance sampling requires neurons with preferred stimuli x\u2217\ni to appear with frequency\nproportional to the prior distribution p(x\u2217). This can be realized if the number of neurons represent-\ning x\u2217 is roughly proportional to p(x\u2217). While systematic study of distribution of neurons over their\npreferred stimuli is technically challenging, there are cases where this assumption seems to hold.\nFor example, research on the \u201doblique effect\u201d supports the idea that the distribution of orientation\ntuning curves in V1 is proportional to the prior. Electrophysiology [13], optical imaging [14] and\n\ni is the preferred stimulus.\n\ni ), where x\u2217\n\n3\n\nSXp(Sx|xi*)p(Sx|xn*)p(Sx|x1*)\u2211\u2211\u2211E[f1(x)|Sx]\u2248f1(x1*)f2(x1*)f1(xi*)f1(xn*)f2(xi*)f2(xn*)lateralnormalization\u2211 f1(xi*)p(Sx|xi*)\u2211 \u2211 p(Sx|xi*)\u2211 f2(xi*)p(Sx|xi*)\u2211 \u2211 p(Sx|xi*)stimulusRBFneuronsoutputneuronsinhibitoryneuronxi* ~ p(x)E[f2(x)|)|Sx]\u2248]\u2248\ffMRI studies [15] have found that there are more V1 neurons tuned to cardinal orientations than\nto oblique orientations. These \ufb01ndings are in agreement with the prior distribution of orientations\nof lines in the visual environment. Other evidence comes from motor areas. Repetitive stimulation\nof a \ufb01nger expands its corresponding cortical representation in somatosensory area [16], suggesting\nmore neurons are recruited to represent this stimulus. Alternatively, recruiting neurons x\u2217\ni according\nto the prior distribution can be implemented by modulating feature detection neurons\u2019 \ufb01ring rates.\nThis strategy also seems to be used by the brain: studies in parietal cortex [17] and superior col-\nliculus [18] show that increased prior probability at a particular location results in stronger \ufb01ring for\nneurons with receptive \ufb01elds at that location.\nThird, divisive normalization is a critical component in many neural models, notably in the study of\nattention modulation [19, 20]. It has been suggested that biophysical mechanisms such as shunting\ninhibition and synaptic depression might account for normalization and gain control [10, 21, 22].\nMoreover, local interneurons [23] act as modulator for pooled inhibitory inputs and are good can-\ndidates for performing normalization. Our study makes no speci\ufb01c claims about the underlying\nbiophysical processes, but gains support from the literature suggesting that there are plausible neu-\nral mechanisms for performing divisive normalization.\n\n3.3\n\nImportance sampling by Poisson spiking neurons\n\nexpectation is\n\ni ), with the values of x\u2217\n\ni \u03bbi) and (y1, y2, . . . , ym|Y = n) \u223c Multinomial(n, \u03bbi/(cid:80)\n(cid:35)\n\ni )), where c is any positive constant. An average of a function f(x\u2217\n\n(cid:35)\n\nri \u223c Poisson(c \u00b7 p(x|x\u2217\n\nNeurons communicate mostly by spikes rather than continuous membrane potential signals. Poisson\nspiking neurons play an important role in other analyses of systems for representing probabilities [8].\nPoisson spiking neurons can also be used to perform importance sampling if we have an ensemble\nof neurons with \ufb01ring rates \u03bbi proportional to p(x|x\u2217\ni drawn from the prior.\ni yi,\ni \u03bbi). This further\ni emits spikes\ni ) using\ni ri, whose\n\nTo show this we need a property of Poisson distributions: if yi \u223c Poisson(\u03bbi) and Y = (cid:80)\nthen Y \u223c Poisson((cid:80)\nimplies that E(yi/Y |Y = n) = \u03bbi/(cid:80)\nthe number of spikes produced by the corresponding neurons yields (cid:80)\ni )ri/(cid:80)\ni \u03bbi. Assume a neuron tuned to stimulus x\u2217\ni f(x\u2217\n(cid:34)(cid:88)\n(cid:34)\n(cid:80)\nc\u03bbi(cid:80)\n(cid:80)\ni f(x\u2217\npectation. The variance of this estimator decreases as population activity n = (cid:80)\n\nwhich is thus an unbiased estimator of the importance sampling approximation to the posterior ex-\ni ri increases\nbecause var[ri/n] \u223c 1/n. Thus, Poisson spiking neurons, if plugged into an RBF network, can per-\nform importance sampling and give similar results to \u201cneurons\u201d with analog output, as we con\ufb01rm\nlater in the paper through simulations.\n\ni )p(x|x\u2217\ni )\ni p(x|x\u2217\ni )\n\n=(cid:88)\n\n=(cid:88)\n\ni\n\nri(cid:80)\n\nj rj\n\nri(cid:80)\n\nj rj\n\n=\n\nj c\u03bbj\n\nf(x\u2217\ni )\n\ni\n\n(4)\n\nf(x\u2217\ni )\n\nE\n\ni\n\nf(x\u2217\n\ni )E\n\n4 Hierarchical Bayesian inference and multi-layer importance sampling\n\nInference tasks solved by the brain often involve more than one random variable, with complex\ndependency structures between those variables. For example, visual information process in pri-\nmates involves dozens of subcortical areas that interconnect in a hierarchical structure containing\ntwo major pathways [5]. Hierarchical Bayesian inference has been proposed as a solution to this\nproblem, with particle \ufb01ltering and belief propagation as possible algorithms implemented by the\nbrain [6]. However, few studies have proposed neural models that are capable of performing hier-\narchical Bayesian inference (although see [24]). We show how a multi-layer neural network can\nperform such computations using importance samplers (Fig. 1) as building blocks.\n\n4.1 Generative models and Hierarchical Bayesian inference\n\nGenerative models describe the causal process by which data are generated, assigning a probability\ndistribution to each step in that process. To understand brain function, it is often helpful to identify\nthe generative model that determines how stimuli to the brain Sx are generated. The brain then has\nto reverse the generative model to recover the latent variables expressed in the data (see Fig. 2). The\ndirection of inference is thus the opposite of the direction in which the data are generated.\n\n4\n\n\fFigure 2: A hierarchical Bayesian model. The generative model speci\ufb01es how each variable is\ngenerated (in circles), while inference reverses this process (in boxes). Sx is the stimulus presented\nto the nervous system, while X, Y , and Z are latent variables at increasing levels of abstraction.\n\nIn the case of a hierarchical Bayesian model, as shown in Fig. 2, the quantity of interest is the\nposterior expectation of some function f(z) of a high-level latent variable Z given stimulus Sx,\n\nE[f(z)|Sx] =(cid:82) f(z)p(z|Sx) dz. After repeatedly using the importance sampling trick (see Eq. 5),\n\nthis hierarchical Bayesian inference problem can decomposed into three importance samplers with\nvalues x\u2217\n\nk drawn from the prior.\n\nj and z\u2217\n\ni ,y\u2217\n\nThis result relies on recursively applying importance sampling to the integral, with each recursion\nresulting in an approximation to the posterior distribution of another random variable. This recursive\nimportance sampling scheme can be used in a variety of graphical models. For example, tracking a\nstimulus over time is a natural extension where an additional observation is added at each level of\nthe generative model. We evaluate this scheme in several generative models in Section 5.\n\n4.2 Neural implementation of the multi-layer importance sampler\n\n(5)\n\n(cid:80)\n\ni |y\u2217\np(x\u2217\nj )\ni |y\u2217\nj p(x\u2217\n\nThe decomposition of hierarchical inference into recursive importance sampling (Eq. 5) gives rise\nto a multi-layer neural network implementation (see Fig. 3a). The input layer X is similar to that in\nFig. 1, composed of feature detection neurons with output proportional to the likelihood p(Sx|x\u2217\ni ).\nTheir output, after presynaptic normalization, is fed into a layer corresponding to the Y variables,\nj ). The response of neuron y\u2217\nwith synaptic weights\nj , summing over synaptic inputs, ap-\nk|Sx), and the activities of these neurons\nproximates p(y\u2217\nare pooled to compute E[f(z)|Sx]. Note that, at each level, x\u2217\nk are sampled from prior\ndistributions. Posterior expectations involving any random variable can be computed because the\nneuron activities at each level approximate the posterior density. A single pool of neurons can also\nfeed activation to multiple higher levels. Using the visual system as an example (Fig. 3b), such\na multi-layer importance sampling scheme could be used to account for hierarchical inference in\ndivergent pathways by projecting a set of V2 cells to both MT and V4 areas with corresponding\nsynaptic weights.\n\nj|Sx). Similarly, the response of z\u2217\n\nk \u2248 p(z\u2217\n\nj and z\u2217\n\ni ,y\u2217\n\n5\n\nZYXSxgenerative modelZYXSxinference processp(yj|zk)p(xi|yj)p(Sx|xi)pp(zk|y|yj)p(yj|x|xi)p(xi|S|Sx)E[f(z)|S ]x =f(z)p(z|y)[p(y|x)p(x|Sx)dx]dydz\u2248f(z)p(z|y)ip(y|x*i)p(Sx|x*i)ip(Sx|x*i)dy dz=f(z)ip(z|y)p(y|x*i)dyp(Sx|x*i)ip(Sx|x*i)dz\u2248f(z)ijp(z|y*j)p(x*i|y*j)jp(x*i|y*j)p(Sx|x*i)ip(Sx|x*i)dz=j[f(z)p(z|y*j)dz]ip(x*i|y*j)jp(x*i|y*j)p(Sx|x*i)ip(Sx|x*i)\u2248jkf(z*k)p(y*j|z*k)kp(y*j|z*k)ip(x*i|y*j)jp(x*i|y*j)p(Sx|x*i)ip(Sx|x*i)=kf(z*k)jp(y*j|z*k)kp(y*j|z*k)ip(x*i|y*j)jp(x*i|y*j)p(Sx|x*i)ip(Sx|x*i)importancesamplingimportancesamplingimportancesampling zk yj xix*i~ p(x)y*j~ p(y)z*k~ p(z)\fFigure 3: a) Multi-layer importance sampler for hierarchical Bayesian inference. b) Possible imple-\nmentation in dorsal-ventral visual inference pathways, with multiple higher levels receiving input\nfrom one lower level. Note that the arrow directions in the \ufb01gure are direction of inference, which\nis opposite to that of its generative model.\n\n5 Simulations\n\nIn this section we examine how well the mechanisms introduced in the previous sections account\nfor human behavioral data for two perceptual phenomena: cue combination and the oblique effect.\n\n5.1 Haptic-visual cue combination\n\nH), where SV , SH , \u03c32\n\nWhen sensory cues come from multiple modalities, the nervous system is able to combine those cues\noptimally in the way dictated by Bayesian statistics [2]. Fig. 4a shows the setup of an experiment\nwhere a subject measures the height of a bar through haptic and visual inputs. The object\u2019s visual\ninput is manipulated so that the visual cues can be inconsistent with haptic cues and visual noise\ncan be adjusted to different levels, i.e. visual cue follows xV \u223c N (SV , \u03c32\nV ) and haptic cue follows\nxH \u223c N (SH , \u03c32\nV are controlled parameters. The upper panel of Fig. 4d shows\nthe percentage of trials that participants report the comparison stimulus (consistent visual/haptic cues\nfrom 45-65mm) is larger than the standard stimulus (inconsistent visual/haptic cues, SV = 60mm\nand SH = 50mm). With the increase of visual noise, haptic input accounts for larger weights in\ndecision making and the percentage curve is shifted towards SH, consistent with Bayesian statistics.\nSeveral studies have suggested that this form of cue combination could be implemented by popula-\ntion coding [2, 8]. In particular, [8] made an interesting observation that, for Poisson-like spiking\nneurons, summing \ufb01ring activities of two populations is the optimal strategy. This model is under\nthe Bayesian decoding framework and requires construction of the network so that these two pop-\nulations of neurons have exactly the same number of neurons and precise one-to-one connection\nbetween two populations, with the connected pair of neurons having exactly the same tuning curves.\nWe present an alternative solution based on importance sampling that encodes the probability distri-\nbution by a population of neurons directly.\nThe importance sampling solution approximates the posterior expectation of the bar\u2019s height x\u2217\nC\ngiven SV and SH. Sensory inputs are channeled in through xV and xH (Fig.4b). Because sensory\ninput varies in a small range (45-65mm in [2]), we assume priors p(xC), p(xV ) and p(xH) are\nuniform. It is straightforward to approximate posterior p(xV |SV ) using importance sampling:\n\np(xV = x\u2217\n\nV |SV ) = E[1(xV = x\u2217\n\n(cid:80)\n\nV )|SV ] \u2248 p(SV |x\u2217\nV )\ni p(SV |x\u2217\nV,i)\n\n\u2248 rV(cid:80)\n\ni rV,i\n\nV,i \u223c p(xV )\nx\u2217\n\n(6)\n\nwhere rV,i \u223c Poisson[c\u00b7p(SV |x\u2217\nV,i. A similar strat-\negy applies to p(xH|SH). The posterior p(xC|SV , SH), however, is not trivial since multiplication\n\nV,i)] is the number of spikes emitted by neuron x\u2217\n\n6\n\nx2*\u2211lateralnormalizationGRBF neuronsxi*~p(x)\u2211\u2211\u2211xn*y1*=\u2211i\u2211\u2211x1*yj*=\u2211iym*=\u2211ixi*z1*=\u2211jzk*=\u2211jzN*=\u2211j\u2211yj*~p(y)zk*~p(z)synaptic weight:activity of yj:activity of xi:p(Sx|x*i)ip(Sx|x*i)ip(x*i|y*j)jp(x*i|y*j)p(Sx|x*i)ip(Sx|x*i)activity of zk:jp(y*j|z*k)kp(y*j|z*k)ip(x*i|y*j)jp(x*i|y*j)p(Sx|x*i)ip(Sx|x*i)synaptic weight:pp(y*j|z*k)kp(y*j|z*k)p(x*i|y*j)jp(x*i|y*j)\u2211f( Z1*)f( Zk*)f( ZN*)(a)(b)V1MTV4V2E[f(z)|Sx]synaptic weight:p(V*1,i|V*2,j)jsynaptic weight:p(x*i|y*j)jp(x*i|y*j)p(V*1,i|V*2,j)p(V*2,j|V*4,k)jp(V*2,j|V*4,k)p(V*2,j|MT* m)(V*2,j|MT* m)p)\u2018Where\u2019 pathway\u2018What\u2019 pathway\fFigure 4: (a) Experimental setup [2]. (b) Generative model. SV and SH are the sensory stimuli,\nXV and XH the values along the visual and haptic dimensions, and XC the combined estimate of\nH,j}. The\nV,i},{x\u2217\nobject height. (c) Illustration of importance sampling using two sensory arrays {x\u2217\ntransparent ellipses indicate the tuning curves of high level neurons centered on values x\u2217\nC,k over\nxV and xH. The big ellipse represents the manipulated input with inconsistent sensory input and\ndifferent variance structure. Bars at the center of opaque ellipses indicate the relative \ufb01ring rates of\nxC neurons, proportional to p(x\u2217\n\nof spike trains is needed.\n\np(xC = x\u2217\n\nC|SV , SH) =\n\nC,k|SV , SH). (d) Human data and simulation results.\n(cid:90)\n\u2248 (cid:88)\n\nC)p(xC|xV , xH)p(xV |SV )p(xH|SH) dxV dxH\n1(xC = x\u2217\n\nC)p(xC|xV , xH) rV,i(cid:80)\n\nrH,j(cid:80)\n\n1(xC = x\u2217\n\n(cid:88)\n\n(cid:90)\n\ni rV,i\n\nj rH,j\n\n(7)\n\ni\n\nj\n\nFortunately, the experiment gives an important constraint, namely subjects were not aware of the\nmanipulation of visual input. Thus, the values x\u2217\nC,k employed in the computation are sampled from\nnormal perceptual conditions, namely consistent visual and haptic inputs (xV = xH) and normal\nvariance structure (transparent ellipses in Fig.4c, on the diagonal). Therefore, the random variables\n{xV , xH} effectively become one variable xV,H and values of x\u2217\nV,H,i are composed of samples\ndrawn from xV and xH independently. Applying importance sampling,\nj p(x\u2217\nj rH,j\n\nC|SV , SH) \u2248\n\n(cid:80)\ni p(x\u2217\nC|SV , SH] \u2248 (cid:88)\n\nC)rV,i +(cid:80)\n(cid:80)\ni rV,i +(cid:80)\nV,i|x\u2217\n(cid:88)\n\np(xC = x\u2217\nE[x\u2217\n\nx\u2217\nC,krC,k/\n\nH,j|x\u2217\n\nC)rH,j\n\nrC,k\n\n(8)\n\n(9)\n\nk\n\nk\n\nC,k|SV , SH)) and x\u2217\n\nwhere rC,k \u223c Poisson(c \u00b7 p(x\u2217\nC,k \u223c p(xC). Compared with Eq. 6, inputs x\u2217\nand x\u2217\nH,j are treaded as from one population in Eq 8. rV,i and rH,j are weighted differently only\nbecause of different observation noise. Eq. 9 is applicable for manipulated sensory input (in Fig. 4c,\nthe ellipse off the diagonal). The simulation results (for an average of 500 trials) are shown in the\nlower panel of Fig.4d, compared with human data in the upper panel. There are two parameters,\nnoise levels \u03c3V and \u03c3H, are optimized to \ufb01t within-modality discrimination data (see [2] Fig. 3a).\n{x\u2217\nC,k} consist of 20 independently drawn examples each, and the total \ufb01ring rate\nof each set of neurons is limited to 30. The simulations produce a close match to human behavior.\n\nH,j} and {x\u2217\n\nV,i},{x\u2217\n\nV,i\n\n5.2 The oblique effect\n\nThe oblique effect describes the phenomenon that people show greater sensitivity to bars with hor-\nizontal or vertical (0o/90o) orientations than \u201coblique\u201d orientations. Fig. 5a shows an experimental\nsetup where subjects exhibited higher sensitivity in detecting the direction of rotation of a bar when\nthe reference bar to which it was compared was in one of these cardinal orientations. Fig. 5b shows\nthe generative model for this detection problem. The top-level binary variable D randomly chooses a\ndirection of rotation. Conditioning on D, the amplitude of rotation \u2206\u03b8 is generated from a truncated\n\n7\n\nCRTStereoglassesOpaquemirrorForce-feedbackdevicesVisual and hapticsceneNoise:3 cm equals 100%Visual heightHaptic heightWidth3-cm depth stepM.O. Ernst and M.S. Banks Nature (2002)(a) Experiment setting(d) Visual\u2013haptic discriminationSVSHNormalized comparison height (mm)00.250.500.751.000%67%133%200%545556Noise levelVisual\u2013hapticProportion of trials perceived as 'taller'human behavior (Ernst et. al. 2002)00.250.500.751.00SVSH545556simulation(b) Generative model of cue combinationxCxVxHSVSH(c) Importance sampling from visual\u2212haptic examples505560Visual input (mm)505560Haptic input (mm)p(xV,xH|xC,k*){xC,k*}p(xxV,xH){xV,i*}{xH, j*}\fFigure 5: (a) Orientation detection experiment. The oblique effect is shown in lower panel, being\ngreater sensitivity to orientation near the cardinal directions. (b) Generative model. (c) The oblique\neffect emerges from our model, but depends on having the correct prior p(\u03b8).\n\nnormal distribution (NT (D), being restricted to \u2206\u03b8 > 0 if D = 1 and \u2206\u03b8 < 0 otherwise). When\ncombined with the angle of the reference bar r (shaded in the graphical model, since it is known),\n\u2206\u03b8 generates the orientation of a test bar \u03b8, and \u03b8 further generates the observation S\u03b8, both with\nnormal distributions with variance \u03c3\u03b8 and \u03c3S\u03b8 respectively.\nThe oblique effect has been shown to be closely related to the number of V1 neurons that tuned to\ndifferent orientations [25]. Many studies have found more V1 neurons tuned to cardinal orientations\nthan other orientations [13, 14, 15]. Moreover, the uneven distribution of feature detection neurons\nis consistent with the idea that these neurons might be sampled proportional to the prior: more\nhorizontal and vertical segments exist in the natural visual environment of humans.\nImportance sampling provides a direct test of the hypothesis that preferential distribution of V1\nneurons around 0o/90o can cause the oblique effect, which becomes a question of whether the\noblique effect depends on the use of a prior p(\u03b8) with this distribution. The quantity of interest is:\n\nj(cid:48)\n\n(10)\nwhere j(cid:48) indexes all \u2206\u03b8\u2217 > 0. If p(D = 1|S\u03b8, r) > 0.5, then we should assign D = 1. Fig. 5c\nshows that detection sensitivity is uncorrelated with orientations if we take a uniform prior p(\u03b8), but\nexhibits the oblique effect under a prior that prefers cardinal directions. In both cases, 40 neurons\nare used to represent each of \u2206\u03b8\u2217\ni , and results are averaged over 100 trials. Sensitivity is\nmeasured by percentage correct in inference. Due to the qualitative nature of this simulation, model\nparameters are not tuned to \ufb01t experiment data.\n\ni and \u03b8\u2217\n\ni\n\np(D = 1|S\u03b8, r) \u2248(cid:88)\n\n(cid:88)\n\n(cid:80)\n\np(\u03b8\u2217\nj p(\u03b8\u2217\n\ni |\u2206\u03b8\u2217\nj(cid:48), r)\ni |\u2206\u03b8\u2217\nj , r)\n\n(cid:80)\np(S\u03b8|\u03b8\u2217\ni )\ni p(S\u03b8|\u03b8\u2217\ni )\n\n6 Conclusion\n\nUnderstanding how the brain solves the problem of Bayesian inference is a signi\ufb01cant challenge for\ncomputational neuroscience. In this paper, we have explored the potential of a class of solutions\nthat draw on ideas from computer science, statistics, and psychology. We have shown that a small\nnumber of feature detection neurons whose tuning curves represent a small set of typical examples\nfrom sensory experience is suf\ufb01cient to perform some basic forms of Bayesian inference. Moreover,\nour theoretical analysis shows that this mechanism corresponds to a Monte Carlo sampling method,\ni.e. importance sampling. The basic idea behind this approach \u2013 storing examples and activating\nthem based on similarity \u2013 is at the heart of a variety of psychological models, and straightforward\nto implement either in traditional neural network architectures like radial basis function networks,\ncircuits of Poisson spiking neurons, or associative memory models. The nervous system is con-\nstantly reorganizing to capture the ever-changing structure of our environment. Components of the\nimportance sampler, such as the tuning curves and their synaptic strengths, need to be updated to\nmatch the distributions in the environment. Understanding how the brain might solve this daunting\nproblem is a key question for future research.\nAcknowledgments. Supported by the Air Force Of\ufb01ce of Scienti\ufb01c Research (grant FA9550-07-1-0351).\n\n8\n\n45901351800124590135180012(b) Generative model0o90o180op(clockwise)?reference bartest bar(a) Oblique effectRelative detection sensitivityadopted from Furmanski & Engel (2000)D\u2206\u03b8\u03b8rS\u03b8p(D=1) = p(D=-1) = 0.5Clockwise or counterclockwise?\u2206\u03b8 | D ~ NT(D) (0,\u03c3\u2206\u03b8)2\u2206\u03b8 ~ N (0,\u03c3\u2206\u03b8)2\u03b8 | \u2206\u03b8, r ~ N (\u2206\u03b8+r ,\u03c3\u03b8 )2\u03b8 ~ {S\u03b8 | \u03b8 ~ N (\u03b8 ,\u03c3S )2\u2206\u03b8Uni([0, pi]) or(1-k)/2[N(0, \u03c3\u03b8 )+N(pi/2, \u03c3\u03b8 )]+k Uni([0, pi])(c) Oblique effect and prior090180prior090180priorOrientationRelative detection sensitivity\fReferences\n[1] K. K\u00a8ording and D. M. Wolpert. Bayesian integration in sensorimotor learning. Nature, 427:244\u2013247,\n\n2004.\n\n[2] M. O. Ernst and M. S. Banks. Humans integrate visual and haptic information in a statistically optimal\n\nfashion. Nature, 415(6870):429\u2013433, 2002.\n\n[3] A. Stocker and E. Simoncelli. A bayesian model of conditioned perception.\n\nIn J.C. Platt, D. Koller,\nY. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1409\u2013\n1416. MIT Press, Cambridge, MA, 2008.\n\n[4] A. P. Blaisdell, K. Sawa, K. J. Leising, and M. R. Waldmann. Causal reasoning in rats. Science,\n\n311(5763):1020\u20131022, 2006.\n\n[5] D. C. Van Essen, C. H. Anderson, and D. J. Felleman.\n\nInformation processing in the primate visual\n\nsystem: an integrated systems perspective. Science, 255(5043):419\u2013423, 1992 Jan 24.\n[6] T. S. Lee and D. Mumford. Hierarchical bayesian inference in the visual cortex.\n\nJ.Opt.Soc.Am.A\n\nOpt.Image Sci.Vis., 20(7):1434\u20131448, 2003.\n\n[7] R. S. Zemel, P. Dayan, and A. Pouget. Probabilistic interpretation of population codes. Neural Comput,\n\n10(2):403\u2013430, 1998.\n\n[8] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. Bayesian inference with probabilistic population\n\ncodes. Nat.Neurosci., 9(11):1432\u20131438, 2006.\n\n[9] L. Shi, N. H. Feldman, and T. L. Grif\ufb01ths. Performing bayesian inference with exemplar models.\n\nProceedings of the 30th Annual Conference of the Cognitive Science Society, 2008.\n\nIn\n\n[10] M. Kouh and T. Poggio. A canonical neural circuit for cortical nonlinear operations. Neural Comput,\n\n20(6):1427\u20131451, 2008.\n\n[11] J. K. Kruschke. Alcove: An exemplar-based connectionist model of category learning. Psychological\n\nReview, 99:22\u201344, 1992.\n\n[12] M. J. D. Powell. Radial basis functions for multivariable interpolation: a review. Clarendon Press, New\n\nYork, NY, USA, 1987.\n\n[13] R. L. De Valois, E. W. Yund, and N. Hepler. The orientation and direction selectivity of cells in macaque\n\nvisual cortex. Vision Res, 22(5):531\u2013544, 1982.\n\n[14] D. M. Coppola, L. E. White, D. Fitzpatrick, and D. Purves. Unequal representation of cardinal and oblique\n\ncontours in ferret visual cortex. Proc Natl Acad Sci U S A, 95(5):2621\u20132623, 1998 Mar 3.\n\n[15] C. S. Furmanski and S. A. Engel. An oblique effect in human primary visual cortex. Nat Neurosci,\n\n3(6):535\u2013536, 2000.\n\n[16] A. Hodzic, R. Veit, A. A. Karim, M. Erb, and B. Godde. Improvement and decline in tactile discrimination\nbehavior after cortical plasticity induced by passive tactile coactivation. J Neurosci, 24(2):442\u2013446, 2004.\n[17] M. L. Platt and P. W. Glimcher. Neural correlates of decision variables in parietal cortex. Nature, 400:233\u2013\n\n238, 1999.\n\n[18] M. A. Basso and R. H. Wurtz. Modulation of neuronal activity by target uncertainty. Nature,\n\n389(6646):66\u201369, 1997.\n\n[19] J. H. Reynolds and D. J. Heeger. The normalization model of attention. Neuron, 61(2):168\u2013185, 2009\n\nJan 29.\n\n[20] J. Lee and J. H. R. Maunsell. A normalization model of attentional modulation of single unit responses.\n\nPLoS ONE, 4(2):e4651, 2009.\n\n[21] S. J. Mitchell and R. A. Silver. Shunting inhibition modulates neuronal gain during synaptic excitation.\n\nNeuron, 38(3):433\u2013445, 2003.\n\n[22] J. S. Rothman, L. Cathala, V. Steuber, and R A. Silver. Synaptic depression enables neuronal gain control.\n\nNature, 457(7232):1015\u20131018, 2009 Feb 19.\n\n[23] H. Markram, M. Toledo-Rodriguez, Y. Wang, A. Gupta, G. Silberberg, and C. Wu. Interneurons of the\n\nneocortical inhibitory system. Nat Rev Neurosci, 5(10):793\u2013807, 2004 Oct.\n\n[24] K. Friston. Hierarchical models in the brain. PLoS Comput Biol, 4(11):e1000211, 2008 Nov.\n[25] G. A. Orban, E. Vandenbussche, and R. Vogels. Human orientation discrimination tested with long stim-\n\nuli. Vision Res, 24(2):121\u2013128, 1984.\n\n9\n\n\f", "award": [], "sourceid": 1091, "authors": [{"given_name": "Lei", "family_name": "Shi", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}]}