{"title": "Nonparametric inference of prior probabilities from Bayes-optimal behavior", "book": "Advances in Neural Information Processing Systems", "page_first": 1067, "page_last": 1074, "abstract": null, "full_text": "Nonparametric inference of prior probabilities\n\nfrom Bayes-optimal behavior\n\nLiam Paninski\u2217\n\nDepartment of Statistics, Columbia University\n\nliam@stat.columbia.edu; http://www.stat.columbia.edu/\u223cliam\n\nAbstract\n\nWe discuss a method for obtaining a subject\u2019s a priori beliefs from\nhis/her behavior in a psychophysics context, under the assumption that\nthe behavior is (nearly) optimal from a Bayesian perspective. The\nmethod is nonparametric in the sense that we do not assume that the\nprior belongs to any \ufb01xed class of distributions (e.g., Gaussian). Despite\nthis increased generality, the method is relatively simple to implement,\nbeing based in the simplest case on a linear programming algorithm, and\nmore generally on a straightforward maximum likelihood or maximum\na posteriori formulation, which turns out to be a convex optimization\nproblem (with no non-global local maxima) in many important cases. In\naddition, we develop methods for analyzing the uncertainty of these esti-\nmates. We demonstrate the accuracy of the method in a simple simulated\ncoin-\ufb02ipping setting; in particular, the method is able to precisely track\nthe evolution of the subject\u2019s posterior distribution as more and more data\nare observed. We close by brie\ufb02y discussing an interesting connection to\nrecent models of neural population coding.\n\nIntroduction\n\nBayesian methods have become quite popular in psychophysics and neuroscience (1\u20135); in\nparticular, a recent trend has been to interpret observed biases in perception and/or behavior\nas optimal, in a Bayesian (average) sense, under ecologically-determined prior distributions\non the stimuli or behavioral contexts under study. For example, (2) interpret visual motion\nillusions in terms of a prior weighted towards slow, smooth movements of objects in space.\n\nIn an experimental context, it is clearly desirable to empirically obtain estimates of the\nprior the subject is operating under; the idea would be to then compare these experimental\nestimates of the subject\u2019s prior with the ecological prior he or she \u201cshould\u201d have been\nusing. Conversely, such an approach would have the potential to establish that the subject\nis not behaving Bayes-optimally under any prior, but rather is in fact using a different, non-\nBayesian strategy. Such tools would also be quite useful in the context of studies of learning\nand generalization, in which we would like to track the time course of a subject\u2019s adaptation\nto an experimentally-chosen prior distribution (5). Such estimates of the subject\u2019s prior\nhave in the past been rather qualitative, and/or limited to simple parametric families (e.g.,\n\n\u2217We thank N. Daw, P. Hoyer, S. Inati, K. Koerding, I. Nemenman, E. Simoncelli, A. Stocker, and D.\nWolpert for helpful suggestions, and in particular P. Dayan for pointing out the connection to neural\npopulation coding models. This work was supported by funding from the Howard Hughes Medical\nInstitute, Gatsby Charitable Trust, and by a Royal Society International Fellowship.\n\n\fthe width of a Gaussian may be \ufb01t to the experimental data, but the actual Gaussian identity\nof the prior is not examined systematically).\n\nWe present a more quantitative method here. We \ufb01rst discuss the method in the general case\nof an arbitrarily-chosen loss function (the \u201ccost\u201d which we assume the subject is attempting\nto minimize, on average), then examine a few special important cases (e.g., mean-square\nand mean-absolute error) in which the technique may be simpli\ufb01ed somewhat. The algo-\nrithms for determining the subject\u2019s prior distributions turn out to be surprisingly quick and\neasy to code: the basic idea is that each observed stimulus-response pair provides a set of\nconstraints on what the actual prior could be. In the simplest case, these constraints are\nlinear, and the resulting algorithm is simply a version of linear programming, for which\nvery ef\ufb01cient algorithms exist. More generally, the constraints are probabilistic, and we\ndiscuss likelihood-based methods for combining these noisy constraints (and in particular\nwhen the resulting maximum likelihood, or maximum a posteriori, problem can be solved\nef\ufb01ciently via ascent methods, without fear of getting trapped in non-global local maxima).\nFinally, we discuss Bayesian methods for representing the uncertainty in our estimates.\n\nWe should point out that related problems have appeared in the statistics literature, par-\nticularly under the subject of elicitation of expert opinion (6\u20138); in the machine learn-\ning literature, most recently in the area of \u201cinverse reinforcement learning\u201d (9); and in\nthe economics/ game theory literature on utility learning (10). The experimental eco-\nnomics literature in particular is quite vast (where the relevance to gambling, price setting,\netc. is discussed at length, particularly in settings in which \u201crational\u201d \u2014 expected utility-\nmaximizing \u2014 behavior seems to break down); see, e.g. Wakker\u2019s recent bibliography\n(www1.fee.uva.nl/creed/wakker/refs/rfrncs.htm) for further references. Finally, it is worth\nnoting that the question of determining a subject\u2019s (or more precisely, an opponent\u2019s) pri-\nors in a gambling context \u2014 in particular, in the binary case of whether or not an opponent\nwill accept a bet, given a \ufb01xed table of outcomes vs. payoffs \u2014 has received attention\ngoing back to the foundations of decision theory, most prominently in the discussions of\nde Finetti and Savage. Nevertheless, we are unaware of any previous application of simi-\nlar techniques (both for estimating a subject\u2019s true prior and for analyzing the uncertainty\nassociated with these estimates) in the psychophysical or neuroscience literature.\n\nGeneral case\n\nOur technique for determining the subject\u2019s prior is based on several assumptions (some of\nwhich will be relaxed below). To begin, we assume that the subject is behaving optimally\nin a Bayesian sense. To be precise, we have four ingredients: a prior distribution on some\nhidden parameter \u03b8; observed input (stimulus) data, dependent in some probabilistic way\non \u03b8; the subject\u2019s corresponding output estimates of the underlying \u03b8, given the input\ndata; and \ufb01nally a loss function D(., .) that penalizes bad estimates for \u03b8. The fundamental\nassumption is that, on each trial i, the subject is choosing the estimate \u02c6\u03b8i of the underlying\nparameter, given data xi, to minimize the posterior average error\n\nZ p(\u03b8|xi)D(\u02c6\u03b8i, \u03b8)d\u03b8 \u223c Z p(\u03b8)p(xi|\u03b8)D(\u02c6\u03b8i, \u03b8)d\u03b8,\n\n(1)\n\nwhere p(\u03b8) is the prior on hidden parameters (the unknown object the experimenter is trying\nto estimate), and p(xi|\u03b8) is the likelihood of data xi given \u03b8. For example, in the visual\nmotion example, \u03b8 could be the true underlying velocity of an object moving through space,\nthe observed data xi could be a short, noise-contaminated movie of the object\u2019s motion, and\nthe subject would be asked to estimate the true motion \u03b8 given the data xi and any prior\nconceptions, p(\u03b8), of how one expects objects to move. Note that we have also implicitly\nassumed, in this simplest case, that both the loss D(., .) and likelihood functions p(xi|\u03b8)\nare known, both to the subject and to the experimenter (perhaps from a preceding set of\n\n\f\u201clearning\u201d trials).\n\nSo how can the experimenter actually estimate p(\u03b8), given the likelihoods p(x|\u03b8), the loss\nfunction D(., .), and some set of data {xi} with corresponding estimates {\u02c6\u03b8i} minimizing\nthe posterior expected loss (1)? This turns out to be a linear programming problem (11),\nfor which very ef\ufb01cient algorithms exist (e.g., \u201clinprog.m\u201d in Matlab). To see why, \ufb01rst\nnote that the right hand side of expression (1) is linear in the prior p(\u03b8). Second, we have a\nlarge collection of linear constraints on p(\u03b8): we know that\n\np(\u03b8) \u2265 0\n\n\u2200\u03b8\n\nZ p(\u03b8)d\u03b8 = 1\n\nZ p(\u03b8)p(xi|\u03b8)(cid:20)D(\u02c6\u03b8i, \u03b8) \u2212 D(z, \u03b8)(cid:21)d\u03b8 \u2264 0\n\n\u2200z\n\n(2)\n\n(3)\n\n(4)\n\nwhere (2-3) are satis\ufb01ed by any proper prior distribution and (4) is the maximizer condition\n(1) expressed in slightly different language.\n(See also (10), who noted the same linear\nprogramming structure in an application to cost function estimation, rather than the prior\nestimation examined here.)\n\nThe solution to the linear programming problem de\ufb01ned by (2-4) isn\u2019t necessarily unique; it\ncorresponds to an intersection of half-spaces, which is convex in general. To come up with\na unique solution, we could maximimize a concave \u201cregularizing\u201d function on this convex\nset; possible such functions include, e.g., the entropy of p(\u03b8), or its negative mean-square\nderivative (this function is strictly concave on the space of all functions whose integral is\nheld \ufb01xed, as is the case here given constraint (3)); more generally, if we have some prior\ninformation on the form of the priors the subject might be using, and this information can\nbe expressed in the \u201cenergy\u201d form\n\nfor a concave functional q[.], we could use the log of this \u201cprior on priors\u201d P . An alternative\nsolution would be to modify constraint (4) to\n\nP [p(\u03b8)] \u223c eq[p(\u03b8)],\n\nZ p(\u03b8)p(xi|\u03b8)(cid:20)D(\u02c6\u03b8i, \u03b8) \u2212 D(z, \u03b8)(cid:21) \u2264 \u2212\u01eb\n\n\u2200z,\n\nwhere we can then adjust the slack variable \u01eb until the contraint set shrinks to a single\npoint. This leads directly to another linear programming problem (where we want to make\nthe linear function \u01eb as large as possible, under the above constraints). Note that for this\nlast approach to work \u2014 for the linear programming problem to have a solution \u2014 we need\nto ensure that the set de\ufb01ned by the constraints (2-4) is compact; this basically means that\nthe constraint set (4) needs to be suf\ufb01ciently rich, which, in turn, means that suf\ufb01cient data\n(or suf\ufb01ciently strong prior constraints) are required. We will return to this point below.\n\nFinally, what if our primary assumption is not met? That is, what if subjects are not quite\nbehaving optimally with respect to p(\u03b8)? It is possible to detect this situation in the above\nframework, for example if the slack variable \u01eb above is found to be negative. However, a\ndifferent, more probabilistic viewpoint can be taken. Assume the value of the choice \u02c6\u03b8i is\noptimal under some \u201ccomparison\u201d noise, that is,\n\nZ p(\u03b8)p(xi|\u03b8)(cid:20)D(\u02c6\u03b8i, \u03b8) \u2212 D(z, \u03b8)(cid:21) \u2264 \u03c3\u03b7i(z)\n\n\u2200z,\n\nwith \u03b7i(z) a random variable of scale \u03c3 > 0 (assume \u03b7 to be i.i.d. for now, although this\nmay be generalized). If we assume this decision noise \u03b7 has a log-concave density (i.e.,\nthe log of the density is a concave function; e.g., Gaussian, or exponential), then so does\n\n\fits integral (12), and the resulting maximum likelihood problem has no non-global local\nmaxima and is therefore solvable by ascent methods. To see this, write the log-likelihood\nof (p, \u03c3) given data {xi, \u02c6\u03b8i} as\n\nL{xi,\u02c6\u03b8i}(p, \u03c3) = X logZ ui(z)\n\n\u2212\u221e\n\ndp(\u03b7),\n\nwith the sum over the set of all the constraints in (4) and\n\nui(z) \u2261\n\n1\n\n\u03c3 Z p(\u03b8)p(xi|\u03b8)(cid:20)D(\u02c6\u03b8i, \u03b8) \u2212 D(z, \u03b8)(cid:21).\n\nL is the sum of concave functions in ui, and hence is concave itself, and has no non-global\nlocal maxima in these variables; since \u03c3 and p are linearly related through ui (and (p, \u03c3)\nlive in a convex set), L has no non-global local maxima in (p, \u03c3), either. Once again, this\nmaximum likelihood problem may be regularized by prior information1, maximizing the\na posteriori likelihood L(p) \u2212 q[p] instead of L(p); this problem is similarly tractable by\nascent methods, by the concavity of \u2212q[.] (note that this \u201csoft-constraint\u201d problem reduces\nexactly to the \u201chard\u201d constraint problem (4) as the noise \u03c3 \u2192 0)2.\nNote that the estimated value of the noise scale \u03c3 plays a similar role to that of the slack\nvariable \u01eb, above, with the difference that \u01eb can be much more sensitive to the worst trial\n(that is, the trial on which the subject behaves most suboptimally); we can use either of\nthese slack variables to go back and ask about how close to optimally the subjects were\nactually performing \u2014 large values of \u03c3, for example, imply sub-optimal performance. An\nadditional interesting idea is to use the computed value of \u03b7 as a kind of outlier test; \u03b7 large\nimplies the trial was particularly suboptimal.\n\nSpecial cases\n\nMaximum a posteriori estimation: The maximum a posteriori (MAP) estimator corre-\nsponds to the Hamming distance loss function,\n\nthis implies that the constraints (4) have the simple form\n\nD(i, j) = 1(i 6= j);\n\np(\u02c6\u03b8i) \u2212 p(z)L(\u02c6\u03b8i, z) \u2265 0,\n\nwith L(\u02c6\u03b8i, z) de\ufb01ned as the largest observed likelihood ratio for \u02c6\u03b8i and z, that is,\n\nL(\u02c6\u03b8i, z) \u2261 max\n\nxi\n\np(xi|z)\np(xi|\u02c6\u03b8i)\n\n,\n\n1Over\ufb01tting here is a symptom of the fact that in some cases \u2014 particularly when few data samples\nhave been observed \u2014 many priors (even highly implausible priors) can explain the observed data\nfairly well; in this case, it is often quite useful to penalize these \u201cimplausible\u201d priors, thus effectively\nregularizing our estimates. Similar observations have appeared in the context of medical applications\nof Markov random \ufb01eld methods (13).\n\n2Another possible application of this regularization idea is as follows. We may incorporate im-\nproper priors \u2014 that is, priors which may not integrate to unity (such priors frequently arise in the\nanalysis of reparameterization-invariant decision procedures, for example) \u2014 without any major con-\nceptual modi\ufb01cation in our analysis, simply by removing the normalization contraint (3). However,\na problem arises: the zero measure, p(\u03b8) \u2261 0, will always trivially satisfy the remaining constraints\n(2) and (4). This problem could potentially be ameliorated by introducing a convex regularizing term\n(or equivalently, a log-concave prior) on the total mass R p(\u03b8)d\u03b8.\n\n\fwith the maximum taken over all xi which led to the estimate \u02c6\u03b8i. This setup is perhaps\nmost appropriate for a two-alternative forced choice situation, where the problem is one of\nclassi\ufb01cation or discrimination, not estimation.\nMean-square and absolute-error regression: Our discussion assumes an even simpler\nform when the loss function D(., .) is taken to be squared error, D(x, y) = (x \u2212 y)2, or\nabsolute error, D(x, y) = |x \u2212 y|. In this case it is convenient to work with a slightly\ndifferent noise model than the classi\ufb01cation noise discussed above; instead, we may model\nthe subject\u2019s responses as optimal plus estimation noise. For squared-error, the optimal\n\u02c6\u03b8i is known to be uniquely de\ufb01ned as the conditional mean of \u03b8 given xi. Thus we may\nreplace the collection of linear inequality constraints (4) with a much smaller set of linear\nequalities (a single equality per trial, instead of a single inequality per trial per z):\n\nZ (cid:18)p(xi|\u03b8)(\u03b8 \u2212 \u02c6\u03b8i)(cid:19)p(\u03b8)d\u03b8 = \u03c3\u03b7i;\n\n(5)\n\nthe corresponding likelihood, again, has no non-global local maxima if \u03b7 has a log-concave\ndensity.\nIn the simplest case of Gaussian \u03b7, the maximum likelihood problem may be\nsolved by standard nonnegative least-squares (e.g., \u201clsqnonneg\u201d or \u201cquadprog\u201d in Matlab).\nIn the absolute error case, the optimal \u02c6\u03b8i is given by the conditional median of \u03b8 given\nxi (although recall that the median is not necessarily unique here); thus, the inequality\nconstraints (4) may again be replaced by equalities which are linear in p(\u03b8):\n\n\u02c6\u03b8i\n\nZ\n\n\u2212\u221e\n\np(\u03b8)p(xi|\u03b8) \u2212Z \u221e\n\n\u02c6\u03b8i\n\np(\u03b8)p(xi|\u03b8) = \u03c3\u03b7i;\n\nagain, for Gaussian \u03b7 this may be solved via standard nonnegative regression, albeit with a\ndifferent constraint matrix. In each case, \u03b7i retains its utility as an outlier score.\n\nA worked example: learning the fairness of a coin\n\nIn this section we will work through a concrete example, to show how to put the ideas\ndiscussed above into practice. We take perhaps the simplest possible example, for clarity:\nthe subject observes some number N of independent, identically distributed coin \ufb02ips, and\non each trial i tells us his/her probability of observing tails on the next trial, given that\nt = t(i) tails were observed in the \ufb01rst i trials3. Here the likelihood functions p(xi|\u03b8)\ntake the standard binomial form p(t(i)|ptails) = (cid:0)i\ntails(1 \u2212 ptails)i\u2212t (note that it is\nreasonable to assume that these likelihoods are known to the subject, at least approximately,\ndue to the ubiquity of binomial data).\nUnder our assumptions, the subject\u2019s estimates \u02c6ptails,i are given as the posterior mean\nof ptails given the number of tails observed up to trial i. This puts us directly in the\nmean-square framework discussed in equation (5); we assume Gaussian estimation noise \u03b7,\nconstruct a regression matrix A of N rows, with the i-th row given by p(t(i)|ptails)(ptails\u2212\n\u02c6ptails,i). To regularize our estimates, we add a small square-difference penalty of the form\n\nt(cid:1)pt\n\nq[p(\u03b8)] = R |dp(\u03b8)/d\u03b8|2d\u03b8. Finally, we estimate\n\n\u02c6p(\u03b8) = arg\n\nmin\n0 p(\u03b8)d\u03b8=1\n\np\u22650;R 1\n\n||Ap||2\n\n2 + \u01ebq[p],\n\nfor \u01eb \u2248 10\u22127; this estimate is equivalent to MAP estimation under a (weak) Gaussian prior\non the function p(\u03b8) (truncated so that p(\u03b8) \u2265 0), and is computed using quadprog.m.\n\n3We note in passing that this simple binomial paradigm has potential applications to ideal-\nobserver analysis of classical neuroscienti\ufb01c tasks (e.g., synaptic release detection, or photon count-\ning in retina) in addition to potential applications in psychophysics.\n\n\fr\no\ni\nr\np\n\n \n\ne\nu\nr\nt\n\n2\n\n1\n\n0\n\n150\n\n100\n\n50\n\n)\ni\n(\n \n\n#\n\n \nl\n\na\ni\nr\nt\n\n0\n\n8\n6\n4\n2\n0\n\n3\n\n2\n\n1\n\n0\n\n6\n\n4\n\n2\n\n0\n\ns\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n\nr\no\ni\nr\np\n\n \nt\ns\ne\n\nr\no\ni\nr\np\n\n \nt\ns\ne\n\n# tails/i\nestimate\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n0.1\n\n0.2\n\n0.3\n0.7\n<\u2212\u2212\u2212\u2212\u2212\u2212\u2212all heads ... all tails\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212>\n\n0.6\n\n0.4\n\n0.5\n\ni=0\n50\n100\n\n0.8\n\n0.9\n\nFigure 1: Learning the fairness of a coin (numerical simulation). Top panel: True prior distribution\non coin fairness. The bimodal nature of this prior indicates that the subject expects coins to be\nunfair (skewed towards heads, ptails < .5, or tails, ptails > .5) more often than fair (ptails =\n.5). Second: Observed data. Open circles indicate the fraction of observed tails t = t(i) as a\nfunction of trial number i (the maximum likelihood estimate, MLE, of the fairness and a minimal\nsuf\ufb01cient statistic for this problem); + symbols indicate the subject\u2019s estimate of the coin\u2019s fairness,\nassumed to correspond to the posterior mean of the fairness under the subject\u2019s prior. Note the\nsystematic deviations of the subject\u2019s estimate from the MLE; these deviations shrink as i increases\nand the strength of the prior relative to the likelihood term decreases. Third: Binomial likelihood\nterms `i\ntails(1 \u2212 ptails)i\u2212t. Color of trace correponds to trial number i, as indicated in previous\npanel (traces are normalized for clarity). Fourth: Estimate of prior given 150 trials. Black trace\nindicates true prior (as in top panel); red indicates estimate \u00b11 posterior standard error (computed\nvia importance sampling). Bottom: Tracking the evolution of the posterior. Black traces indicate\nthe subject\u2019s true posterior after observing 0 (thin trace), 50 (medium trace), and 100 (thick trace)\nsample coin \ufb02ips; as more data are observed, the subject becomes more and more con\ufb01dent about\nthe true fairness of the coin (p = .5), and the posteriors match the likelihood terms (c.f. third panel)\nmore closely. Red traces indicate the estimated posterior given the full 150 or just the last 100 or 50\ntrials, respectively (errorbars omitted for visibility). Note that the procedure tracks the evolution of\nthe subject\u2019s posterior quite accurately, given relatively few trials.\n\nt\u00b4pt\n\nTo place Bayesian con\ufb01dence intervals around our estimate, we sample from the corre-\nsponding (truncated) Gaussian posterior distribution on p(\u03b8) (via importance sampling with\na suitably shifted, rescaled truncated Gaussian proposal density; similar methods are ap-\nplicable more generally in the non-Gaussian case via the usual posterior approximation\ntechniques, e.g. Laplace approximation). Figs. 1-2 demonstrate the accuracy of the es-\ntimated \u02c6p(\u03b8); in particular, the bottom panels show that the method accurately tracks the\nevolution of the model subjects\u2019 posteriors as an increasing amount of data are observed.\n\n\fr\no\ni\nr\np\n\n \n\ne\nu\nr\nt\n\n3\n\n2\n\n1\n\n0\n\n50\n\n0\n\n10\n\n)\ni\n(\n \n\n#\n\n \nl\n\na\ni\nr\nt\n\ns\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n\n150\n\n100\n\n# tails/i\nestimate\n\n5\n\n0\n\n4\n\n2\n\n0\n\n6\n\n4\n\n2\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n0.1\n\n0.2\n\n0.3\n0.7\n<\u2212\u2212\u2212\u2212\u2212\u2212\u2212all heads ... all tails\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212>\n\n0.6\n\n0.4\n\n0.5\n\ni=0\n50\n100\n\n0.8\n\n0.9\n\nr\no\ni\nr\np\n\n \nt\ns\ne\n\nr\no\ni\nr\np\n\n \nt\ns\ne\n\nFigure 2: Learning an unfair coin (ptails = .25). Conventions as in Fig. 1.\n\nConnection to neural population coding\n\nIt is interesting to note a connection to the neural population coding model studied in (14)\n(with more recent work reviewed in (15)). The basic idea is that neural populations encode\nnot just stimuli, but probability distributions over stimuli (where the distribution describes\nthe uncertainty in the state of the encoded object). Here the experimentally observed data\nare neural \ufb01ring rates, which provide constraints on the underlying encoded \u201cprior\u201d distri-\nbution in terms of the individual tuning function of each cell in the observed population.\nThe simplest model is as follows: the observed spikes ni from the i-th cell are Poisson-\ndistributed, with rate a nonlinear function of a linear functional of some prior distribution,\n\nni \u223c Poiss(cid:18)g(cid:18)Z p(\u03b8)f (xi, \u03b8)(cid:19)(cid:19) ,\n\nwhere the kernel f is considered as the cell\u2019s \u201ctuning function\u201d; the log-concavity of the\nlikelihood of p is preserved for any nonlinearity g that is convex and log-concave, a class\nincluding the linear recti\ufb01ers, exponentials, and power-laws (and studied more extensively\nin (16)). Alternately, a simpli\ufb01ed model is often used, e.g.:\n\nni \u223c q(cid:18) ni \u2212R p(\u03b8)f (xi, \u03b8)\n\n\u03c3\n\n(cid:19) ,\n\nwith q a log-concave density (typically Gaussian) to preserve the concavity of the log-\nlikelihood; in this case, the scale \u03c3 of the noise does not vary with the mean \ufb01ring rate,\n\n\fas it does in the Poisson model. In both cases, the observed \ufb01ring rates act as constraints\noriented linearly with respect to p; in the latter case, the noise scale \u03c3 sets the strength, or\ncon\ufb01dence, of each such constraint (2, 3). Thus, under this framework, given the simul-\ntaneously recorded activity of many cells {ni} and some model for the tuning functions\nf (xi, \u03b8), we can infer p(\u03b8) (and represent the uncertainty in these estimates) using meth-\nods quite similar to those developed above.\n\nDirections\n\nThe obvious open avenue for future research (aside from application to experimental data)\nis to relax the assumptions: that the likelihood and cost function are both known, and that\nthe data are observed directly (without any noise).\nIt seems fair to conjecture that the\nsubject can learn the likelihood and cost functions given enough data, but one would like to\ntest this directly, e.g. by estimating D(., .) and p together, perhaps under restrictions on the\nform of D(., .). As emphasized above, the utility estimation problem has received a great\ndeal of attention, and it is plausible to expect that the methods proposed here for estimation\nof the prior might be combined with previously-studied methods for utility elicitation and\nestimation.\nIt is also interesting to consider these elicitation methods in the context of\nexperimental design (8, 17, 18), in which we might actively seek stimuli xi to maximally\nconstrain the possible form of the prior and/or cost function.\n\nReferences\n\n1. D. Knill, W. Richards, eds., Perception as Bayesian Inference (Cambridge University Press,\n\n1996).\n\n2. Y. Weiss, E. Simoncelli, E. Adelson, Nature Neuroscience 5, 598 (2002).\n3. Y. Weiss, D. Fleet, Statistical Theories of the Cortex (MIT Press, 2002), chap. Velocity likeli-\n\nhoods in biological and machine vision, pp. 77\u201396.\n\n4. D. Kersten, P. Mamassian, A. Yuille, Annual Review of Psychology 55, 271 (2004).\n5. K. Koerding, D. Wolpert, Nature 427, 244 (2004).\n6. R. Hogarth, Journal of the American Statistical Association 70, 271 (1975).\n7. J. Oakley, A. O\u2019Hagan, Biometrika under review (2003).\n8. P. Garthwaite, J. Kadane, A. O\u2019Hagan, Handbook of Statistics (2004), chap. Elicitation.\n9. A. Ng, S. Russell, ICML-17 (2000).\n10. J. Blythe, AAAI02 (2002).\n11. G. Strang, Linear algebra and its applications (Harcourt Brace, New York, 1988).\n12. Y. Rinott, Annals of Probability 4, 1020 (1976).\n13. M. Henrion, et al., Why is diagnosis using belief networks insensitive to imprecision in proba-\n\nbilities?, Tech. Rep. SMI-96-0637, Stanford (1996).\n\n14. R. Zemel, P. Dayan, A. Pouget, Neural Computation 10, 403 (1998).\n15. A. Pouget, P. Dayan, R. Zemel, Annual Reviews of Neuroscience 26, 381 (2003).\n16. L. Paninski, Network: Computation in Neural Systems 15, 243 (2004).\n17. K. Chaloner, I. Verdinelli, Statistical Science 10, 273 (1995).\n18. L. Paninski, Advances in Neural Information Processing Systems 16 (2003).\n\n\f", "award": [], "sourceid": 2900, "authors": [{"given_name": "Liam", "family_name": "Paninski", "institution": null}]}