{"title": "Demystifying excessively volatile human learning: A Bayesian persistent prior and a neural approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 2781, "page_last": 2790, "abstract": "Understanding how humans and animals learn about statistical regularities in stable and volatile environments, and utilize these regularities to make predictions and decisions, is an important problem in neuroscience and psychology. Using a Bayesian modeling framework, specifically the Dynamic Belief Model (DBM), it has previously been shown that humans tend to make the {\\it default} assumption that environmental statistics undergo abrupt, unsignaled changes, even when environmental statistics are actually stable. Because exact Bayesian inference in this setting, an example of switching state space models, is computationally intense, a number of approximately Bayesian and heuristic algorithms have been proposed to account for learning/prediction in the brain. Here, we examine a neurally plausible algorithm, a special case of leaky integration dynamics we denote as EXP (for exponential filtering), that is significantly simpler than all previously suggested algorithms except for the delta-learning rule, and which far outperforms the delta rule in approximating Bayesian prediction performance. We derive the theoretical relationship between DBM and EXP, and show that EXP gains computational efficiency by foregoing the representation of inferential uncertainty (as does the delta rule), but that it nevertheless achieves near-Bayesian performance due to its ability to incorporate a \"persistent prior\" influence unique to DBM and absent from the other algorithms. Furthermore, we show that EXP is comparable to DBM but better than all other models in reproducing human behavior in a visual search task, suggesting that human learning and prediction also incorporates an element of persistent prior. More broadly, our work demonstrates that when observations are information-poor, detecting changes or modulating the learning rate is both {\\it difficult} and (thus) {\\it unnecessary} for making Bayes-optimal predictions.", "full_text": "Demystifying excessively volatile human learning: A\nBayesian persistent prior and a neural approximation\n\nChaitanya K. Ryali\n\nDepartment of Computer Science and Engineering\n\nUniversity of California San Diego\n\n9500 Gilman Drive La Jolla, CA 92093\n\nrckrishn@eng.ucsd.edu\n\nGautam Reddy\n\nDepartment of Physics\n\nUniversity of California San Diego\n\n9500 Gilman Drive La Jolla, CA 92093\n\ngnallama@physics.ucsd.edu\n\nAngela J. Yu\n\nDepartment of Cognitive Science\nUniversity of California San Diego\n\n9500 Gilman Drive La Jolla, CA 92093\n\najyu@ucsd.edu\n\nAbstract\n\nUnderstanding how humans and animals learn about statistical regularities in sta-\nble and volatile environments, and utilize these regularities to make predictions\nand decisions, is an important problem in neuroscience and psychology. Using a\nBayesian modeling framework, speci\ufb01cally the Dynamic Belief Model (DBM),\nit has previously been shown that humans tend to make the default assumption\nthat environmental statistics undergo abrupt, unsignaled changes, even when envi-\nronmental statistics are actually stable. Because exact Bayesian inference in this\nsetting, an example of switching state space models, is computationally intensive, a\nnumber of approximately Bayesian and heuristic algorithms have been proposed to\naccount for learning/prediction in the brain. Here, we examine a neurally plausible\nalgorithm, a special case of leaky integration dynamics we denote as EXP (for\nexponential \ufb01ltering), that is signi\ufb01cantly simpler than all previously suggested\nalgorithms except for the delta-learning rule, and which far outperforms the delta\nrule in approximating Bayesian prediction performance. We derive the theoretical\nrelationship between DBM and EXP, and show that EXP gains computational\nef\ufb01ciency by foregoing the representation of inferential uncertainty (as does the\ndelta rule), but that it nevertheless achieves near-Bayesian performance due to its\nability to incorporate a \"persistent prior\" in\ufb02uence unique to DBM and absent from\nthe other algorithms. Furthermore, we show that EXP is comparable to DBM but\nbetter than all other models in reproducing human behavior in a visual search task,\nsuggesting that human learning and prediction also incorporates an element of\npersistent prior. More broadly, our work demonstrates that when observations are\ninformation-poor, detecting changes or modulating the learning rate is both dif\ufb01cult\nand thus unnecessary for making Bayes-optimal predictions.\n\nIntroduction\n\nUnderstanding how humans and animals make future predictions based on changing environmental\nstatistics is an important problem in both neuroscience and psychology [1, 2, 3, 4, 5, 6]. Intriguingly,\neven when environmental statistics are stable, Bayesian models of human learning and prediction\nsuggest a default human tendency to assume that statistical contingencies undergo abrupt, unsignaled\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fchanges, also known as \"change points\" [7, 8, 9, 10, 11, 12]. The behavioral consequence of this is\nthat humans all too readily discard long-term knowledge in favor of recent, unexpected observations,\nleading to excessively volatile learning and prediction.\nIt has been suggested that this default\nassumption of non-stationarity helps the brain to adapt when the environment is truly volatile [7].\nHere, we propose another reason why a default assumption of volatility is dif\ufb01cult to overcome:\none\u2019s ability to discern whether unexpected outcomes arise from change points or simply noise is\nfundamentally limited when the observations are very noisy (e.g. when it is binary as opposed to\nreal-valued) [7]. We focus on categorical data (including binary) case, whose information-poor data\n(in a predictive information sense [13]) make the detection of change points and the estimation of\nhidden variables particularly dif\ufb01cult.\nPreviously, Bayesian and Bayes-inspired models of varying complexity have been suggested to\ncapture human learning and prediction behavior while making implicit [3, 7] or explicit [1, 5]\npredictions among categorical choices. The most complex of these is exact Bayes [1, 3, 5, 7], such as\nthe Dynamic Belief Model (DBM) [7], a hidden Markov model that assumes the observations to be\ndrawn from a Bernoulli (if binary) [7] or categorical (if more than two outcomes) [11] distribution,\nwhose parameters undergo abrupt, unsignaled changes from time to time. An alternative Bayesian\nmodel is the Fixed Belief Model (FBM) [7], which assumes environmental statistics to be \ufb01xed\nover time (no change points). It has been found that DBM captures human behavior better than\nFBM, even though the latter more veridically captures experimental design in a variety of tasks,\ne.g. 2-alternative forced choice [7], inhibitory control [9, 10], multi-armed bandit [12, 14], and\nvisual search [11]. However, exact learning/prediction in DBM is computationally intensive, given\nthat it is an example of switching state space models [15]. Consequently, several approximate and\nheuristic learning rules have also been proposed [1, 16, 17], all of which make some claim to neural\nplausibility and probabilistic interpretation. Separately, very simple, non-probabilistic forms of\nlearning rules have also been used to model online learning in the brain. We explore two of them\nhere: (1) a delta-learning rule [18, 19], also known as Q-learning or reinforcement learning (RL)\nin the neuroscience literature [5], (2) a variant of exponential \ufb01ltering (EXP) [2, 7], equivalent to a\nparticular form of leaky-integrating neuronal dynamics [7].\nAlthough all of the algorithms described above have been used to model sequential learning and\nprediction in the brain, there has been little theoretical analysis of the statistical relationship among\nthem, or a systematic validation by comparing them to the same set of behavioral data. In this work,\nwe present just such a theoretical analysis and human data comparison [11].\nThe rest of the paper is organized as follows. In section 1, we will formally describe how the different\nalgorithms learn online from binary data and make predictions about upcoming data. In section\n2, we will present a theoretical analysis of the various algorithms and their relationships to each\nother. In section 2.5, we will extend the results to m-ary data. In section 3, we will compare model\nperformance in terms of their ability to predict human behavior in a visual search task [11]. In section\n4, we will discuss implications, links to related work, and future work.\n\n1 Learning Models\n\nIn this section, we formally describe the learning models: the \ufb01rst two are principled Bayesian\nmodels, while the latter two are simple, mechanistic algorithms commonly used in neuroscience and\npsychology. Here, we assume the observations xt are binary. In a later section, we will show that our\nresults easily generalize to the m-ary case.\n\n1.1 Dynamic Belief Model (DBM)\n\nThe Dynamic Belief Model (DBM) is a hidden Markov model that assumes the observations are drawn\nfrom a Bernoulli distribution whose rate parameter undergoes unsignaled changes with probability\n1 \u2212 \u03b1 at each time step.\nGenerative Model. The hidden variable \u03b3t denotes probability of xt = 1 and has a Markovian\ndependence on \u03b3t\u22121:\n\n(1)\ni.e., \u03b3t remains the same (\u03b3t = \u03b3t\u22121) with a \ufb01xed probability \u03b1, and redrawn from the prior\np0(\u03b3) = Beta(\u03b3; a, b) with probability 1 \u2212 \u03b1.\n\np(\u03b3t = \u03b3|\u03b3t\u22121) = \u03b1\u03b4(\u03b3 \u2212 \u03b3t\u22121) + (1 \u2212 \u03b1)p0(\u03b3),\n\n2\n\n\fRecognition model. The prior p(\u03b3t|x1:t\u22121) and the posterior p(\u03b3t|x1:t) are recursively computed:\n(2)\n(3)\n\np(\u03b3t = \u03b3|x1:t\u22121) = \u03b1p(\u03b3t\u22121 = \u03b3|x1:t\u22121) + (1 \u2212 \u03b1)p0(\u03b3t = \u03b3),\n\np(\u03b3t|x1:t) \u221d p(xt|\u03b3t)p(\u03b3t|x1:t\u22121).\n\nPrediction. The predictive probability for trial t + 1, given the past observations x1:t is computed as\n\nPDBM,t+1 (cid:44) P (xt+1 = 1|x1:t) =\n\n\u03b3p(\u03b3t+1 = \u03b3|x1:t)d\u03b3 = Ep(\u03b3t+1|x1:t)[\u03b3],\n\n(4)\n\n(cid:90)\n\nand has an implicit marginalization over every possible timing of the most recent change point.\nIn practice, one can either marginalize over the timing of the last change point, or discretize the\nbelief state (posterior distribution over \u03b3t). Thus, the computation of the predictive probabilities is\ncomputationally and representationally expensive.\n\n1.2 Fixed Belief Model (FBM)\n\nFBM is a special case of the DBM with no change point, i.e \u03b1 = 1. It is simply a beta-Bernoulli\nprocess. The posterior and predictive probabilities are:\n\n(cid:80) x\u03c4 +a\u22121(1 \u2212 \u03b3)\n\n(cid:80) \u00afx\u03c4 +b\u22121; PFBM,t+1 (cid:44) a +(cid:80) x\u03c4\n\na + b + t\n\n,\n\n(5)\n\np(\u03b3|x1:t) \u221d P (x1:t|\u03b3)p(\u03b3) = \u03b3\n\nwhere \u00afx\u03c4 (cid:44) 1 \u2212 x\u03c4 .\n\n1.3 Exponential Filtering (EXP)\n\nt\u22121(cid:88)\n\nEXP is a simple algorithm that linearly sums past observations, while exponentially discounting into\nthe past [2], to predict the probability of encountering different outcomes on the next trial [7]:\n\nPEXP,t+1 (cid:44) PEXP(xt+1 = 1|x1:t) = C + \u03b7\u03b2\n\n\u03b2\u03c4 xt\u2212\u03c4 = C(1 \u2212 \u03b2) + \u03b7\u03b2xt + \u03b2PEXP,t,\n\n(6)\n\n\u03c4 =0\n\nwhere the parameters (C, \u03b7, \u03b2) are constrained as 0 \u2264 C, \u03b7 \u2264 1, 0 \u2264 \u03b2 < 1, C + \u03b7\u03b2\n1\u2212\u03b2 < 1. This\nmodel was introduced in relation to DBM [7], inspired by related work showing that monkeys\u2019\nchoices when tracking reward biases that undergo change points are discounted in an approximately\nexponential fashion [2]. The last expression in Eq. 6 shows how it can be implemented by correctly\ntuned leaky integration dynamics (in a single neuron!) [7]: the \ufb01rst term is a constant bias, the second\n\"feedforward\" term depends on the current input (\u03b7\u03b2 speci\ufb01es the weight on the input), and the third\n\u201crecurrent\u201d term depends on the previous state (\u03b2 speci\ufb01es the weight of the recurrent term).\n\n1.4 Delta-Learning Rule (RL)\n\nThe delta-learning rule, a form of simple Q-learning or reinforcement learning (RL) [19] is commonly\nused for online learning in both neuroscience [18, 5] and machine learning [19]. Here, we adapt it to\nestimate predictive probabilities:\n\n(7)\nNote that this version of RL is similar in form to EXP. It has a feedforward term and a recurrent term,\none parameter to trade off between the two, and no bias term.\n\nPRL,t+1 (cid:44) \u0001xt + (1 \u2212 \u0001)PRL,t.\n\n2 Relationship Among the Models\n\nIn this section, we analyze the relationship among the models. We will \ufb01rst show that while DBM\nonline prediction can be viewed as a delta-like learning rule with an adaptive gain, and EXP, with a\nconstant learning rate, can nevertheless approximate DBM well under certain conditions. We will\nalso show when and why EXP outperforms RL, as well as how the parameters of EXP can be tuned\nonline in a neurally plausible manner. Finally, we will analyze the parameter regime under which the\nDBM \u2248 EXP approximation breaks down.\n\n3\n\n\f2.1 DBM Prediction as an Adaptive Delta Rule\n\nThe exact, nonlinear Bayesian update rule for the predictive probability PDBM,t+1, denoted as Pt+1\nin this section to be concise, may also be written as:\n\nPt+1 = (1 \u2212 \u03b1)(cid:104)\u03b3(cid:105)p0(\u03b3) + \u03b1xt\n\nQt \u2212 P 2\nPt(1 \u2212 Pt)\n\nt\n\n+ \u03b1Pt\n\nPt \u2212 Qt\nPt(1 \u2212 Pt)\n\n= (1 \u2212 \u03b1)P0 + \u03b1xtGt + \u03b1Pt(1 \u2212 Gt) = (1 \u2212 \u03b1)P0 + \u03b1(Pt + Gt(xt \u2212 Pt)),\n\n(8)\n\n(9)\n\nt\n\nPt(1\u2212Pt) = var(\u03b3t|x1:t\u22121)\n\nwhere Qt (cid:44) Ep(\u03b3t|x1:t\u22121)[\u03b32], P0 (cid:44) Ep0(\u03b3)[\u03b3] and Gt (cid:44) Qt\u2212P 2\nvar(Bern(Pt)) . The form in\nEq. (9) is reminiscent of the delta rule: Gt (0 \u2264 Gt \u2264 1, for any binary sequence x1:t) acts like\nan adaptive learning rate, governing the trade-off between new data xt and the previous predictive\nmean, Pt; an additional parameter \u03b1 governs the trade-off between this combined prediction and a\nconstant bias P0, which inserts persistent prior in\ufb02uence due to the recurring probability of \u03b3 being\nre-sampled.\nIntuitively, Gt is modulated by how \u201csurprising\u201d recent observations are. Surprising recent obser-\nvations, i.e those inducing large prediction error, could indicate a switch in environment statistics,\nprompting an increase in the learning rate. However, categorical data are information-poor, making\nprompt detection of a true change in the environment dif\ufb01cult. This suggests that the Bayesian\nupdate rule for predicting future outcomes can be simpli\ufb01ed by approximating Gt with an appropriate\nconstant. These intuitions are formalized in the following theorem.\nTheorem 1. The adaptive learning rate Gt has the following property,\n\n1 \u2212 Gt = (1 \u2212 G) + \u03b1c\u03b1(\u2212a\u00afxt\u22121 + bxt\u22121) + O(\u03b12),\n\n(10)\n\nwhere G =\nrule for the predictive probability Pt+1, correct to O(\u03b12),\n\n(a+b+1) and c\u03b1 =\n\n1\n\nab(a+b+1)2(a+b+2) . Approximating Gt by G yields a linear update\n\n(a2\u2212b2)\n\nPt+1 = (1 \u2212 \u03b1)P0 + \u03b1(Gxt + Pt(1 \u2212 G)) + O(\u03b12).\n\n(11)\n\nProof. We rewrite the update rule (9) for the predictive probability Pt+1 as:\n\nPt+1 = (1 \u2212 \u03b1)P0 + \u03b1(Pt + Gt(xt \u2212 Pt)) = (1 \u2212 \u03b1)P0 + \u03b1Lt,\n\nwhere Lt (cid:44) xtGt + Pt(1\u2212 Gt). Analogous to the update rule for Pt (Eq. 8), Qt has the update rule:\n\nQt+1 = (1 \u2212 \u03b1)Q0 + \u03b1xt\n\n(12)\nwhere Rt (cid:44) Ep(\u03b3t|x1:t\u22121)[\u03b33]. Next, we make O(\u03b12) approximations to the numerator (Pt \u2212 Qt)\nand the denominator Pt(1 \u2212 Pt) of 1 \u2212 Gt as\nPt(1 \u2212 Pt) = P0 \u00afP0 + \u03b1[ \u00afP0Lt\u22121 + P0 \u00afLt\u22121 \u2212 2P0 \u00afP0] + \u03b12(P0 \u2212 Lt\u22121)( \u00afP0 \u2212 \u00afLt\u22121)\n\n+ \u03b1Qt\n\n,\n\nRt \u2212 QtPt\nPt(1 \u2212 Pt)\n\nQt \u2212 Rt\nQt(1 \u2212 Pt)\n\nPt \u2212 Qt\n\n(\u2217)\n= (P0 \u2212 Q0) \u2212 \u03b1\n\n(\u2212a\u00afxt\u22121 + bxt\u22121) + O(\u03b12),\n\n(a + b)(a + b + 1)(a + b + 2)\n\n(14)\nwhere \u00afP0 = 1 \u2212 P0 and \u00afLt\u22121 = 1 \u2212 Lt\u22121 and (\u2217) follows by setting Pt\u22121 = P0 + O(\u03b1), Qt\u22121 =\nQ0 + O(\u03b1), Rt\u22121 = R0 + O(\u03b1). Upon substituting the approximations (13), (14) for (Pt \u2212 Qt) and\nPt(1 \u2212 Pt) and using (l0 + \u03b1l1 + O(\u03b12))\u22121 = l\u22121\n0 + O(\u03b12), the O(\u03b12) approximation\n(10) for Gt directly follows.\nSetting Gt = G + O(\u03b1) in (9) gives (11), the linear update rule for the predictive probability Pt+1\ncorrect to O(\u03b12).\n\n0 \u2212 \u03b1l1l\u22122\n\nBased on the theorem, Gt can be approximated as a constant with O(\u03b1) error, or as a linear function\nof the last observation with O(\u03b12) error; the corresponding linear update rule has either O(\u03b12) or\nO(\u03b13) error, respectively. Furthermore, |c\u03b1(\u2212a\u00afxt\u22121 + bxt\u22121)| can be shown to be upper bounded\n\n4\n\n(\u2217)\n= P0 \u00afP0 + \u03b1\n\n(a \u2212 b)\n\n(a + b)2(a + b + 1)\n\n(a \u2212 b)\n\n(\u2212a\u00afxt\u22121 + bxt\u22121) + O(\u03b12),\n\n(13)\n\n\fby a small number of 0.062 (proof omitted) for a, b \u2265 1, so replacing Gt by G should work well in\npractice.\nAs a corollary to the theorem, for a uniform prior a = b = 1, the O(\u03b1) term in 1 \u2212 Gt is exactly\nzero, so that replacing Gt by G incurs only O(\u03b13) error. In many behavioral tasks (e.g. 2-alternative\nforced choice), a uniform prior is a reasonable choice; we employ a uniform prior for all simulations\nin the paper. All these results imply that the approximations are particularly accurate when \u03b1 is\nrelatively small.\nOn a separate note, it is important to note that the proof, and thus the approximation, does not make\nany speci\ufb01c generative assumptions about the sequence x1:t, and is therefore valid for arbitrary\nbinary sequences. In other words, the constant Gt approximation is valid for arbitrary environments\nand does not depend on whether humans truly have mis-speci\ufb01ed generative assumptions, as having\nbeen previously suggested [7, 11, 12].\n\n2.2 Relationship of EXP to DBM and RL\n\na+b+1, \u03b7 = 1\n\nWe de\ufb01ne EXP using Eq. 11. We will show that while two critical features of DBM \u2013 exponential\ndiscounting of past observations and \"persistent\" in\ufb02uence of the prior \u2013 are captured by EXP, only\nthe former is captured by RL. Moreover, in volatile environments (relatively small \u03b1, which appears\nto be the default assumption for humans, see sec. 4), EXP will be shown to be especially effective at\napproximating DBM, while also enjoying a particular advantage over RL.\na+b,\nEq. 11 shows how the parameters of EXP are related to those of DBM: \u03b2 = \u03b1 a+b\nC = (1\u2212\u03b1)P0\n. In other words, the exponential discount parameter of EXP, \u03b2, is proportional to the\n1\u2212\u03b2\nvolatility parameter \u03b1 in DBM (for uniform prior, \u03b2 \u2248 2\n3 \u03b1, matching a previous conjecture [7]),\nand the constant bias, (1 \u2212 \u03b1)P0, is proportional to the prior mean P0 and thus injects a persistent\nadditive in\ufb02uence of the prior. In a set of simulations with \u03b1 = .7 (similar to those found in humans,\nsee sec. 4), we regress PDBM,t+1 against past observations xt, xt\u22121, . . . , and \ufb01nd that our analytical\nEXP approximation closely matches both DBM and the best freely \ufb01tted EXP (best linear estimator)\n(Fig. 1a). We also see that this excellent performance is underpinned by an approximately constant\nlearning rate Gt that is quite insensitive to the timing of true change points (Fig. 1b). Indeed, EXP\napproximates DBM equally well whether there is a switch on the last time step or not (Fig. 1c).\nWe can gain additional intuition about DBM (and EXP) by noticing that the parameter C is the lower\nbound on Pt, determined by the stability of the environment \u03b1 and the prior p0(\u03b3). This lower bound\nis attained asymptotically in the limit of observing an in\ufb01nite sequence of 0\u2019s. Similarly, the upper\nbound in the limit of observing in\ufb01nite 1\u2019s is C + \u03b7\u03b2\n1\u2212\u03b2 (see the last ten trials in Fig. 1a). This bounded\nbehavior is characteristic of DBM, and well captured by appropriately parameterized EXP.\n\nLike EXP, RL has an element of exponential discounting if we write PRL,t+1 =(cid:80)t\u22121\n\n\u03c4 =0 \u0001(1\u2212 \u0001)\u03c4 xt\u2212\u03c4 .\nHowever, RL has no analytic setting for its free parameter \u0001, and parameter \ufb01tting yields a discount\nbehavior different from DBM and EXP (Fig. 1d). An even bigger problem is that RL cannot capture\na persistent prior in\ufb02uence, due to the lack of a bias term. In a small-\u03b1 environment, the persistent\nprior in\ufb02uence is especially critical (Eq. 9), and EXP enjoys a particular advantage over RL (Fig. 1a).\nThis pattern also translates to the behaviorally more relevant measure of predictive accuracy (Fig. 1e),\nwhich assumes the observer to make a binary outcome prediction choice (by taking the max) based\non predictive probabilities.\nIt is worth noting that even in this regime of relatively frequent switches, prediction is not trivial, in\nthat it depends sensitively on the data and not only the prior (Fig. 1a;d); a prediction algorithm that\nonly relies on the prior performs poorly (Fig. 1e). Indeed, smaller \u03b1 makes DBM and EXP especially\nsensitive to local statistics (recent data), since they are more willing to discard long-term statistics\ndue to the stronger belief that environmental statistics can drastically change any time.\n\n2.3 Adapting to Volatility\nHumans appear to be able to adapt their choice behavior according to changing volatility (1 \u2212 \u03b1) of\nthe environment [5]. Exact Bayesian is computationally intensive. However, the EXP approximation,\ndenoted \u02c6Pt, to the DBM permits a simple, principled update rule for \u03b1 via stochastic gradient descent\n\n5\n\n\fFigure 1: Simulation results: validity of EXP approximation. Data generated from DBM (\u03b1 = 0.7)\n(a-d): m = 2 (binary data), p0(\u03b3) = Beta(1, 1). (a) Exact and approximate predictive probabilities\n(of observing 1) for an example sequence of synthetic data (1\u2019s depicted by blue dots, 0\u2019s not shown).\n(b) Exact Gt and approximate G learning rates of an example sequence; black dots denote true\nchange points. (c) Approximate predictive probability \u02c6Pt (EXP) versus exact Pt (DBM), following\nno change point (blue) or a change point (red). (d) DBM dependence on previous observations (blue:\nlinear regression coef\ufb01cients) is approximately exponential (green), and well-approximated by EXP\n(red). Fitted RL yields a very different exponential curve (purple). e Predictive accuracy (fraction of\ncorrect predictions): DBM\u2248EXP\u2248EXP \ufb01tted > RL > FBM. (f) Analogous to (a) but for m = 4 and\np0(\u03b3) = Dir(1, 1, 1, 1). Different colors represent the four outcomes.\n\n[7]:\n\n\u02c6\u03b1 \u2190 \u02c6\u03b1 + \u0001(xt \u2212 \u02c6Pt)\n\nd \u02c6Pt\nd\u02c6\u03b1\n\n;\n\nd \u02c6Pt\nd\u02c6\u03b1\n\n= \u02c6Pt\u22121 + G(xt \u2212 \u02c6Pt\u22121) \u2212 P0.\n\n(15)\n\n2.4 Breakdown of DBM \u2248 EXP\nFor \u03b1 \u2248 1, EXP is not a good approximation to exact-Bayes predictive probabilities (Fig. 2a). Indeed,\na+b+t, which is clearly not constant. However, \ufb01tting\nfor a stable FBM environment (\u03b1 = 1), Gt = 1\nEXP\u2019s parameters freely still performs close to DBM (Fig. 2a); while the deviation between our\nanalytical approximation of the discount parameter \u03b2 and the best \ufb01tting \u03b2 grows as a function of \u03b1\n(Fig. 2c). For larger \u03b1, even though Gt increases more after a true change point, because real change\npoints are rare, their in\ufb02uence is minor relative to the stable value of Gt in between change points.\nIn any case, in terms of the behaviorally more relevant predictive accuracy measure (analogous to\nFig. 1e), EXP still approximates DBM well (Fig. 2b). Interestingly, \ufb01tted RL also approximates\nDBM well (Fig. 2a;b), since the persistent prior in\ufb02uence in Eq. 11 is more negligible. This makes a\nbroader point about prediction in stable but noisy environments: simple, cheap prediction algorithms\ncan perform well relative to complex models, since each data point contains little information and\nthere are many consecutive opportunities to learn from the data.\n\n2.5 Generalization to m-ary Data\n\nDBM and EXP easily extend to m-ary data. We assume a Dirichlet prior p0(\u03b3) = Dir(a), where\na = (a1, . . . , am), ak \u2265 1. We say x(i)\nt = 1 if the observation on trial t is category i. Denoting\nP (i)\nk(cid:54)=i ak, the following corollary is easy to show (proof\nt+1\nomitted):\nCorollary 1. The adaptive learning rate Gt has the following property,\n\n1:t) and a\u2212i (cid:44) (cid:80)\n\nt+1 = 1|x(i)\n\n(cid:44) P (x(i)\n\n1 \u2212 Gt = (1 \u2212 G) + \u03b1c(i)\n\n\u03b1 (\u2212ai \u00afx(i)\n\nt\u22121 + a\u2212ix(i)\n\nt\u22121) + O(\u03b12),\n\n(16)\n\n6\n\n230200210220Trials00.20.40.60.81Predictive probabilityDBMEXPEXP \ufb01ttedRL200210220230Trials00.51GtG246810# Trials into past00.050.10.150.20.25Coe\ufb03cientsReg. coe\ufb00EXPEXP \ufb01ttedRL200210220230Trials00.51Predictive probabilityPredictive accuracyabcdef\fFigure 2: Simulation results: large \u03b1. DBM parameters: m = 2, p0(\u03b3) = Beta(1, 1). (a) Exact\nand approximate predictive probabilities, analogous to Fig. 1a. \u03b1 = 0.95. (b) Predictive accuracy,\nanalogous to Fig. 1e, on the scale of 1/(1 \u2212 \u03b1) instead of \u03b1 to better visualize performance for\nlarge \u03b1. (c) EXP \ufb01tted parameters \u03b2 deviates from our approximation for larger \u03b1. Comparison of\nanalytical approximation of \u03b2 versus best \ufb01tted \u03b2, as a function of \u03b1.\n\n((cid:80)\nk ak+1) and c\u03b1 =\n\nwhere G =\nupdate rule for the predictive probability P (i)\n\n1\n\naia\u2212i((cid:80)\n\nk ak+1)2((cid:80)\n\ni \u2212a2\u2212i)\n\n(a2\n\nt+1, correct to O(\u03b12),\n\nk ak+2) . Approximating Gt by G yields a linear\n\nt+1 = (1 \u2212 \u03b1)P (i)\nP (i)\n\n0 + \u03b1(Gx(i)\n\nt + P (i)\n\nt\n\n(1 \u2212 G)) + O(\u03b12).\n\n(17)\n\nthat normalization is preserved,(cid:80)m\n\nThough we maintain approximate updates for each P (i)\n\nseparately, it is easily shown by induction\nt = 1. Since identical bounds on the coef\ufb01cient of O(\u03b1)\nterm in Gt hold for m > 2, the quality of the approximation will be the same as the binary case,\nso that in volatile environments, near-Bayesian prediction can be achieved simply by using m \u2212 1\nseparate linear-exponential \ufb01lters with no recurrent or complex interactions among the alternatives.\nWe will use this novel m-EXP model in section 3 to model human data.\n\ni=1 P (i)\n\nt\n\n3 Case Study: Visual Search Task\n\nWe will evaluate the models by comparing the models to human behavior in a visual search task\n[11]. The objective of the task is to \ufb01nd the target among three stimuli (target is a random-dot patch\nmoving in the direction opposite to the two distractor patches, see Fig. 3a). The location of the target\non each trial is drawn independently from a \ufb01xed distribution (1/13, 3/13, 9/13). We collapse the\nspatial con\ufb01guration and refer to the patches corresponding to prior probabilities 1/13, 3/13, 9/13 as\npatches 1, 3 and 9, respectively. The spatial con\ufb01guration is \ufb01xed in a block (90 trials per block),\nand counter-balanced across blocks for each subject. Eye-movements are tracked; we only analyze\n\ufb01rst-\ufb01xation location here, as an indication of where a subject perceives as the currently most probable\ntarget location. Subsequent \ufb01xations are much more complex complex, being \"contaminated\" by\nsensory and motor processes [20]. Subjects are given feedback of true target location on each trial.\nThe data are from 11 subjects and are from [11].\n\n3.1 Model Fitting\n\nLearning of environmental statistics by the participants is modeled using each of DBM, EXP, RL and\nFBM. DBM and FBM both assume an uninformative prior p0(\u03b3) = Dir(\u03b3; 1, 1, 1). Since the actual\nspatial con\ufb01guration is \ufb01xed over a block, FBM is the correct generative model. The probability of\nthe \ufb01rst \ufb01xation choice (choice fraction) qt,i at time t is modeled by polynomial softmax [4, 11] as\n\n(cid:80)\n\n(P (i)\nt+1)\u03b2\ni(P (i)\n\nt+1)\u03b2\n\n.\n\nqt,i =\n\nWe \ufb01t the learning and decision making models at an individual level by maximizing the likelihood of\n\ufb01rst \ufb01xation choices (averaged over trials). Each of DBM, EXP and RL have one free parameter (\u03b1\nfor DBM/Exp, \u0001 for RL), while FBM has none. The learning rate G in EXP is set to\naccording to the main theorem.\n\n((cid:80) ak+1) = 1\n\n1\n\n4\n\n7\n\nPredictive probability200210220230Trials00.20.40.60.81DBMEXPEXP \ufb01ttedRLPredictive accuracyabc\f3.2 Results\n\nAs shown in [11], the aggregate choice statistics appear to correspond to matching [21] but belie the\nmore complex temporal patterns in choice behaviour. In Fig. 3b, note that when the previous target\nwas 1 or 3, the \ufb01rst \ufb01xation choice fractions on the next trial show a much higher choice fraction of 1\nor 3, respectively. This bar graph is re-plotted in a different representation in Fig. 3c, where each\nchoice distribution is represented by a point in this 2D probability simplex (2D because the three\nprobabilities add up to 1), af\ufb01ne-transformed to achieve symmetry across the three choices. We see\nthat, in comparison to the case when last target was location 9, human choice fractions on the current\ntrial are pulled toward 1 or 3, when the last target was 1 or 3, respectively. DBM and EXP are biased\nto a similar extent. However, FBM, which asymptotically ignores the last data point, shows very\nlittle variation in average choice distribution as a function of last trial location. RL, which shares the\nexponential discounting element of DBM and EXP but not the persistent prior component, exhibits\nsome in\ufb02uence of the last target location, but not as much as humans/DBM/EXP. Note that all model\nresults are on held out data, and therefore independent of model complexity.\n\nFigure 3: Model comparison to human data in visual search task. (a) Schematic of the task. (b)\nHuman choice fractions conditioned on the last target location. (c) Model-predicted choice fractions\nand human choice fractions, on an af\ufb01ne transformation of the probability simplex. Last target\nlocation: 1 - (cid:52), 3 - \u25e6, 9 - (cid:3). Model predictions are based on actual sequences of held out stimuli\nsubjects experienced in 6-fold maximum likelihood cross validation. Error bars = SEM over subjects.\n\n4 Discussion\n\nWe have shown that the DBM-like human learning/prediction found in previous studies can be\nimplemented by appropriately tuned leaky integrating neuronal dynamics (EXP). While we derived\nan analytical form for the appropriate EXP parameters for volatile environments, we have also shown\nthat even for less volatile environments, where our analytical approximation does not hold, the\nempirically \ufb01tted EXP still achieves near-Bayes performance. This leaves open the possibility that the\nbrain may utilize EXP-like learning for quite a large range of possible volatility, via feedback-driven\nincremental tuning. In any case, in previous tasks where human behavior has been shown to be\n\ufb01tted well by DBM, \ufb01tted \u03b1 ranges between 0.7 and 0.8 [7, 11, 12, 9, 10]) \u2013 in the range where our\nanalytically derived EXP would perform very close to Bayes-optimal.\nOur work demysti\ufb01es human learning [7, 14, 11, 12, 9, 10] by decomposing DBM into two simple\nmechanistic components. We showed that EXP approximates DBM well in all but extremely stable\nenvironments, and does so via both exponential discounting of past observations, and a persistent\nin\ufb02uence of a prior bias that is injected on every trial. Our work shows that when observations\nare information-poor, detecting changes or modulating the learning rate explicitly or implicitly\n(e.g. by discretizing belief state space [7] or averaging over possible change point times [22, 17])\nis both dif\ufb01cult and (thus) unnecessary for making Bayes-optimal predictions. In practice, m-ary\nDBM is typically implemented via discretization of the belief state space [7, 11, 9, 10], which has\na computational and representational complexity of O(ekm) per observation, where k depends on\n\ufb01neness of the discretization, while the near exact-Bayes approximation EXP is only O(m).\nWe found that DBM & EXP both explain human choice behavior in a visual search task [11] better\nthan RL, which has exponential discounting but no persistent prior in\ufb02uence, and FBM, which has\nneither. This is broadly consonant with our related \ufb01nding that DBM not only provides a better\ntrial-by-trial account of human-choices in a multi-arm bandit task, but is able to recover a systematic\n\n8\n\n\funderestimation in human prior reward rate expectation \u2013 this \u201cpessimism bias\u201d is incompletely cap-\ntured by RL and FBM [14]. Together, these \ufb01ndings suggest that a more comprehensive comparison\nof these models in their ability to capture diverse behavioral patterns is needed in the future.\nNote that we are not suggesting that humans can only do prediction with a constant learning rate.\nIn \u201cinformation-abundant\u201d settings, change-points are relatively easy to detect, and their detection\nis critical for Bayes-optimal learning and prediction. We have done separate simulations (data\nnot shown) to show that, in comparison to binary or categorical data, when mutual information\nbetween hidden state and observations is high, Bayesian detection of change points can be highly\naccurate, and the corresponding \u201clearning rate\u201d of its equivalent leaky-integrating update equation\nsigni\ufb01cantly increases after detecting such a change. In these scenarios, the EXP approximation with\na constant learning rate would clearly do a poor job. Indeed, there is evidence that in information-\nabundant settings, human learning rate may be modulated by uncertainty [16], and subjects are able\nto detect change points and report uncertainty [23]. It is quite possible that different parts of the\nbrain may implement different kinds of learning/prediction algorithms. Different approximations\nmay come into play depending on information-abundance or whether the task explicitly necessitates\nthe representation of uncertainty (e.g. in [23]).\nGiven how well EXP does as an approximate recognition model for DBM, and EXP does not bother\nto detect change points or modify its learning rate in response to detected change points, there might\nexist a generative model for which EXP would be an exact Bayesian recognition model. In particular,\none might consider a model that assumes the underlying real-valued hidden variable to undergo\npersistent stochastic changes with constant noisy characteristics, such as a Gaussian process, which\nthen gives rise to noisy binary or categorical observations. Finally, we note that our approximation\ntechnique does not preclude an approximation in which the learning rate is modulated from trial\nto trial. In fact, a proof technique similar to the one used here may be utilized to determine an\napproximate update rule for higher order moments of p(\u03b3t+1|x1:t) (proof not shown), which could be\nused as a neurally plausible approximation to con\ufb01dence in information-abundant settings. Whether\nsuch an approximation can account for human reported con\ufb01dence as in [23] is a worthy line of\ninquiry for future work.\n\nAcknowledgments\n\nWe thank He Huang for assistance with data collection, Samer Sabri for helpful input with the writing,\nand the anonymous reviewers for helpful comments. This work was in part funded by an NSF CRCNS\ngrant (BCS-1309346) to AJY.\n\nReferences\n[1] Yu, A. J. & Dayan, P. Expected and unexpected uncertainty: ACh and NE in the neocortex. In Advances in\n\nNeural Information Processing Systems, 173\u2013180 (2003).\n\n[2] Sugrue, L. P., Corrado, G. S. & Newsome, W. T. Matching behavior and the representation of value in the\n\nparietal cortex. science 304, 1782\u20131787 (2004).\n\n[3] Yu, A. J. & Dayan, P. Uncertainty, neuromodulation, and attention. Neuron 46, 681\u2013692 (2005).\n[4] Daw, N. D., O\u2019doherty, J. P., Dayan, P., Seymour, B. & Dolan, R. J. Cortical substrates for exploratory\n\ndecisions in humans. Nature 441, 876 (2006).\n\n[5] Behrens, T. E., Woolrich, M. W., Walton, M. E. & Rushworth, M. F. Learning the value of information in\n\nan uncertain world. Nature neuroscience 10, 1214 (2007).\n\n[6] Nassar, M. R. et al. Rational regulation of learning dynamics by pupil-linked arousal systems. Nature\n\nneuroscience 15, 1040 (2012).\n\n[7] Yu, A. J. & Cohen, J. D. Sequential effects: Superstition or rational behavior? In Advances in Neural\n\nInformation Processing Systems, 1873\u20131880 (2009).\n\n[8] Wilder, M., Jones, M. & Mozer, M. C. Sequential effects re\ufb02ect parallel learning of multiple environmental\n\nregularities. In Advances in Neural Information Processing Systems, 2053\u20132061 (2009).\n\n[9] Ma, N. & Yu, A. J. Statistical learning and adaptive decision-making underlie human response time\n\nvariability in inhibitory control. Frontiers in psychology 6, 1046 (2015).\n\n[10] Ide, J. S., Shenoy, P., Yu, A. J. & Chiang-Shan, R. L. Bayesian prediction and evaluation in the anterior\n\ncingulate cortex. Journal of Neuroscience 33, 2039\u20132047 (2013).\n\n9\n\n\f[11] Yu, A. J. & Huang, H. Maximizing masquerading as matching in human visual search choice behavior.\n\nDecision 1, 275\u2013287 (2014).\n\n[12] Zhang, S. & Yu, A. J. Forgetful Bayes and myopic planning: Human learning and decision-making in a\n\nbandit setting. In Advances in Neural Information Processing Systems, 2607\u20132615 (2013).\n\n[13] Bialek, W., Nemenman, I. & Tishby, N. Predictability, complexity, and learning. Neural computation 13,\n\n2409\u20132463 (2001).\n\n[14] Guo, D. & Yu, A. J. Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed\n\nbandit task. In Advances in Neural Information Processing Systems (2018).\n\n[15] Ghahramani, Z. & Hinton, G. E. Variational learning for switching state-space models. Neural computation\n\n12, 831\u2013864 (2000).\n\n[16] Nassar, M. R., Wilson, R. C., Heasly, B. & Gold, J. I. An approximately Bayesian delta-rule model explains\nthe dynamics of belief updating in a changing environment. Journal of Neuroscience 30, 12366\u201312378\n(2010).\n\n[17] Wilson, R. C., Nassar, M. R. & Gold, J. I. A mixture of delta-rules approximation to Bayesian inference in\n\nchange-point problems. PLoS computational biology 9, e1003150 (2013).\n\n[18] Rescorla, R. A. & Wagner, A. R. A theory of Pavlovian conditioning: Variations in the effectiveness of\nreinforcement and nonreinforcement. Classical conditioning II: Current research and theory 2, 64\u201399\n(1972).\n\n[19] Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT press, 1998).\n[20] Ahmad, S., Huang, H. & Yu, A. J. Cost-sensitive Bayesian control policy in human active sensing.\n\nFrontiers in Human Neuroscience 8 (2014).\n\n[21] Herrnstein, R. J. Relative and absolute strength of response as a function of frequency of reinforcement 1,\n\n2. Journal of the experimental analysis of behavior 4, 267\u2013272 (1961).\n\n[22] Adams, R. P. & MacKay, D. J. C. Bayesian Online Changepoint Detection. arXiv:0710.3742 [stat] (2007).\n[23] Meyniel, F., Schlunegger, D. & Dehaene, S. The sense of con\ufb01dence during probabilistic learning: A\n\nnormative account. PLoS computational biology 11, e1004305 (2015).\n\n10\n\n\f", "award": [], "sourceid": 1477, "authors": [{"given_name": "Chaitanya", "family_name": "Ryali", "institution": "UC San Diego"}, {"given_name": "Gautam", "family_name": "Reddy", "institution": "University of California, San Diego"}, {"given_name": "Angela", "family_name": "Yu", "institution": "UC San Diego"}]}