{"title": "Sequential effects: Superstition or rational behavior?", "book": "Advances in Neural Information Processing Systems", "page_first": 1873, "page_last": 1880, "abstract": "In a variety of behavioral tasks, subjects exhibit an automatic and apparently sub-optimal sequential effect: they respond more rapidly and accurately to a stimulus if it reinforces a local pattern in stimulus history, such as a string of repetitions or alternations, compared to when it violates such a pattern. This is often the case even if the local trends arise by chance in the context of a randomized design, such that stimulus history has no predictive power. In this work, we use a normative Bayesian framework to examine the hypothesis that such idiosyncrasies may reflect the inadvertent engagement of fundamental mechanisms critical for adapting to changing statistics in the natural environment. We show that prior belief in non-stationarity can induce experimentally observed sequential effects in an otherwise Bayes-optimal algorithm. The Bayesian algorithm is shown to be well approximated by linear-exponential filtering of past observations, a feature also apparent in the behavioral data. We derive an explicit relationship between the parameters and computations of the exact Bayesian algorithm and those of the approximate linear-exponential filter. Since the latter is equivalent to a leaky-integration process, a commonly used model of neuronal dynamics underlying perceptual decision-making and trial-to-trial dependencies, our model provides a principled account of why such dynamics are useful. We also show that near-optimal tuning of the leaky-integration process is possible, using stochastic gradient descent based only on the noisy binary inputs. This is a proof of concept that not only can neurons implement near-optimal prediction based on standard neuronal dynamics, but that they can also learn to tune the processing parameters without explicitly representing probabilities.", "full_text": "Sequential effects: Superstition or rational behavior?\n\nAngela J. Yu\n\nDepartment of Cognitive Science\nUniversity of California, San Diego\n\nJonathan D. Cohen\n\nDepartment of Psychology\n\nPrinceton University\n\najyu@ucsd.edu\n\njdc@princeton.edu\n\nAbstract\n\nIn a variety of behavioral tasks, subjects exhibit an automatic and apparently sub-\noptimal sequential effect: they respond more rapidly and accurately to a stimulus\nif it reinforces a local pattern in stimulus history, such as a string of repetitions or\nalternations, compared to when it violates such a pattern. This is often the case\neven if the local trends arise by chance in the context of a randomized design, such\nthat stimulus history has no real predictive power. In this work, we use a normative\nBayesian framework to examine the hypothesis that such idiosyncrasies may re-\n\ufb02ect the inadvertent engagement of mechanisms critical for adapting to a changing\nenvironment. We show that prior belief in non-stationarity can induce experimen-\ntally observed sequential effects in an otherwise Bayes-optimal algorithm. The\nBayesian algorithm is shown to be well approximated by linear-exponential \ufb01lter-\ning of past observations, a feature also apparent in the behavioral data. We derive\nan explicit relationship between the parameters and computations of the exact\nBayesian algorithm and those of the approximate linear-exponential \ufb01lter. Since\nthe latter is equivalent to a leaky-integration process, a commonly used model\nof neuronal dynamics underlying perceptual decision-making and trial-to-trial de-\npendencies, our model provides a principled account of why such dynamics are\nuseful. We also show that parameter-tuning of the leaky-integration process is\npossible, using stochastic gradient descent based only on the noisy binary inputs.\nThis is a proof of concept that not only can neurons implement near-optimal pre-\ndiction based on standard neuronal dynamics, but that they can also learn to tune\nthe processing parameters without explicitly representing probabilities.\n\n1 Introduction\n\nOne common error human subjects make in statistical inference is that they detect hidden patterns\nand causes in what are genuinely random data. Superstitious behavior, or the inappropriate linking\nof stimuli or actions with consequences, can often arise in such situations, something also observed\nin non-human subjects [1, 2]. One common example in psychology experiments is that despite a\nrandomized experimental design, which deliberately de-correlate stimuli from trial to trial, subjects\npick up transient patterns such as runs of repetitions and alternations, and their responses are fa-\ncilitated when a stimulus continues to follow a local pattern, and impeded when such a pattern is\nviolated [3]. It has been observed in numerous experiments [3\u20135], that subjects respond more accu-\nrately and rapidly if a trial is consistent with the recent pattern (e.g. AAAA followed by A, BABA\nfollowed by B), than if it is inconsistent (e.g. AAAA followed by B, BABA followed by A). This\nsequential effect is more prominent when the preceding run has lasted longer. Figure 1a shows re-\naction time (RT) data from one such experiment [5]. Error rates follow a similar pattern, re\ufb02ecting\na true expectancy-based effect, rather than a shift in RT-accuracy trade-off.\n\nA natural interpretation of these results is that local patterns lead subjects to expect a stimulus,\nwhether explicitly or implicitly. They readily respond when a subsequent stimulus extends the local\npattern, and are \u201csurprised\u201d and respond less rapidly and accurately when a subsequent stimulus\nviolates the pattern. When such local patterns persist longer, the subjects have greater con\ufb01dence in\n\n1\n\n\fa\n\n)\ns\nm\n\n(\nT\nR\n\nb\n\n)\n1\n\u2212\n\nt\nx\n\n|\nt\nx\n(\nP\n\u2212\n1\n\n0.52\n\n0.51\n\n0.5\n\n0.49\n\n0.48\n\n \n\n1st half\n2nd half\n\n1st half\n2nd half\n\n2\n\np0(\u03b3)\n\n1\n\n0\n0\n\n0.5\n\u03b3\n\n1\n\nc\n\n \n\n)\n1\n\u2212\n\nt\nx\n\n|\nt\nx\n(\nP\n\u2212\n1\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n \n\nR A R A R A R A R A R A R A R A\nR R A A R R A A R R A A R R A A\nR R R R A A A A R R R R A A A A\nR R R R R R R R A A A A A A A A\n\nR A R A R A R A R A R A R A R A\nR R A A R R A A R R A A R R A A\nR R R R A A A A R R R R A A A A\nR R R R R R R R A A A A A A A A\n\nd\n\n \n\n50\n\n)\ns\nm\n\n(\nT\nR\n\n0\n\n\u221250\n\n \n\n1st haf\n2nd half\nmodel\n\n \n\nR A R A R A R A R A R A R A R A\nR R A A R R A A R R A A R R A A\nR R R R A A A A R R R R A A A A\nR R R R R R R R A A A A A A A A\n\nFigure 1: Bayesian modeling of sequential effects. (a) Median reaction time (RT) from Cho et al\n(2002) affected by recent history of stimuli, in which subjects are required to discriminate a small \u201co\u201d\nfrom a large \u201cO\u201d using button-presses. Along the abscissa are all possible four-trial sub-sequences,\nin terms of repetitions (R) and alternations (A). Each sequence, read from top to bottom, proceeds\nfrom the earliest stimulus progressively toward the present stimulus. As the effects were symmetric\nacross the two stimulus types, A and B, each bin contains data from a pair of conditions (e.g. RRAR\ncan be AAABB or BBBAA). RT was fastest when a pattern is reinforced (RRR followed by R,\nor AAA followed by A); it is slowest when an \u201cestablished\u201d pattern is violated (RRR followed by\nA, or AAA followed by R). (b) Assuming RT decreases with predicted stimulus probability (i.e.\nRT increases with 1\u2212P (xt|xt\u22121), where xt is the actual stimulus seen), then FBM would predict\nmuch weaker sequential effects in the second half (blue: 720 simulated trials) than in the \ufb01rst half\n(red: 840 trials). (c) DBM predicts persistently strong sequential effects in both the \ufb01rst half (red:\n840 trials) and second half (blue: 720 trials). Inset shows prior over \u03b3 used; the same prior was also\nused for the FBM in (b). \u03b1 = .77. (d) Sequential effects in behavioral data were equally strong in\nthe \ufb01rst half (red: 7 blocks of 120 trials each) and the second half (blue: 6 blocks of 120 trials each).\nGreen dashed line shows a linear transformation from the DBM prediction in probability space of\n(c) into the RT space. The \ufb01t is very good given the errorbars (SEM) in the data.\n\nthe pattern, and are therefore more surprised and more strongly affected when the pattern is violated.\nWhile such a strategy seems plausible, it is also sub-optimal. The experimental design consists of\nrandomized stimuli, thus all runs of repetitions or alternations are spurious, and any behavioral ten-\ndencies driven by such patterns are useless. However, compared to arti\ufb01cial experimental settings,\ntruly random sequential events may be rare in the natural environment, where the laws of physics and\nbiology dictate that both external entities and the observer\u2019s viewpoint undergo continuous transfor-\nmations for the most part, leading to statistical regularities that persist over time on characteristic\ntimescales. The brain may be primed to extract such statistical regularities, leading to what appears\nto be superstitious behavior in an arti\ufb01cially randomized experimental setting.\n\nIn section 2, we use Bayesian probability theory to build formally rigorous models for predicting\nstimuli based on previous observations, and compare differentially complex models to subjects\u2019\nactual behavior. Our analyses imply that subjects assume statistical contingencies in the task to\npersist over several trials but non-stationary on a longer time-scale, as opposed to being unknown\nbut \ufb01xed throughout the experiment. We are also interested in understanding how the computations\nnecessary for prediction and learning can be implemented by the neural hardware. In section 3, we\nshow that the Bayes-optimal learning and prediction algorithm is well approximated by a linear \ufb01lter\nthat weighs past observations exponentially, a computationally simpler algorithm that also seems to\n\ufb01t human behavior. Such an exponential linear \ufb01lter can be implemented by standard models of\nneuronal dynamics. We derive an explicit relationship between the assumed rate of change in the\nworld and the time constant of the optimal exponential linear \ufb01lter. Finally, in section 4, we will\nshow that meta-learning about the rate of change in the world can be implemented by stochastic\ngradient descent, and compare this algorithm with exact Bayesian learning.\n\n2 Bayesian prediction in \ufb01xed and changing worlds\n\nOne simple internal model that subjects may have about the nature of the stimulus sequence in a\n2-alternative forced choice (2AFC) task is that the statistical contingencies in the task remain \ufb01xed\nthroughout the experiment. Speci\ufb01cally, they may believe that the experiment is designed such that\nthere is a \ufb01xed probability \u03b3, throughout the experiment, of encountering a repetition (xt = 1) on\nany given trial t (thus probability 1\u2212\u03b3 of seeing an alternation xt = 0). What they would then learn\n\n2\n\n\fa\n\nFBM\n\nb\n\nDBM\n\nc\n\n)\nt\nx\n\n|\n\u03b3\n(\np\n\nd\n\n)\nt\nx\n\n|\nt\n\u03b3\n(\np\n\nTrial\n\nTrial\n\nFigure 2: Bayesian inference assuming \ufb01xed and changing Bernoulli parameters.\n(a) Graphical\nmodel for the FBM. \u03b3 \u2208 [0, 1], xt \u2208 {0, 1}. The numbers in circles show example values for the\nvariables. (b) Graphical model for the DBM. \u03b3t = \u03b1\u03b4(\u03b3t \u2212 \u03b3t\u22121) + (1 \u2212 \u03b1)p0(\u03b3t), where we as-\nsume the prior p0 to be a Beta distribution. The numbers in circles show examples values for the\nvariables. (c) Grayscale shows the evolution of posterior probability mass over \u03b3 for FBM (darker\ncolor indicate concentration of mass), given the sequence of truly random (P (xt) = .5) binary\ndata (blue dots). The mean of the distribution, in cyan, is also the predicted stimulus probability:\nP (xt = 1|xt\u22121) = h\u03b3|xt\u22121i. (d) Evolution of posterior probability mass for the DBM (grayscale)\nand predictive probability P (xt = 1|xt\u22121) (cyan); they perpetually \ufb02uctuate with transient runs of\nrepetitions or alternations.\n\nabout the task over the time course of the experiment is the appropriate value of \u03b3. We call this the\nFixed Belief Model (FBM). Bayes\u2019 Rule tells us how to compute the posterior:\n\np(\u03b3|xt) \u221d P (xt|\u03b3)p(\u03b3) = \u03b3 r\n\nt+a+1(1 \u2212 \u03b3)t\u2212r\n\nt+b+1\n\nwhere rt denotes the number of repetitions observed so far (up to t), xt is the set of binary\nobservations (x1, . . . , xt), and the prior distribution p(\u03b3) is assumed to be a beta distribution:\np(\u03b3) = p0(\u03b3) = Beta(a, b). The predicted probability of seeing a repetition on the next trial is\nthe mean of this posterior distribution: P (xt+1 = 1|xt) = R \u03b3p(\u03b3|xt)d\u03b3 = h\u03b3|xti.\nA more complex internal model that subjects may entertain is that the relative frequency of repeti-\ntion (versus alternation) can undergo discrete changes at unsignaled times during the experimental\nsession, such that repetitions are more prominent at times, and alternation more prominent at other\ntimes. We call this the Dynamic Belief Model (DBM), in which \u03b3t has a Markovian dependence\non \u03b3t\u22121, so that with probability \u03b1, \u03b3t = \u03b3t\u22121, and probability 1 \u2212 \u03b1, \u03b3t is redrawn from a \ufb01xed\ndistribution p0(\u03b3t) (same Beta distribution as for the prior). The observation xt is still assumed to\nbe drawn from a Bernoulli process with rate parameter \u03b3t. Stimulus predictive probability is now\nthe mean of the iterative prior, P (xt = 1|xt\u22121) = h\u03b3t|xt\u22121i, where\n\np(\u03b3t = \u03b3|xt\u22121) = \u03b1p(\u03b3t\u22121 = \u03b3|xt\u22121) + (1 \u2212 \u03b1)p0(\u03b3t = \u03b3)\n\np(\u03b3t|xt) \u221d P (xt|\u03b3t)p(\u03b3t|xt\u22121)\n\nFigures 2a;b illustrate the two graphical models. Figures 2c;d demonstrate how the two models re-\nspond differently to the exact same sequence of truly random binary observations (\u03b3 = .5). While\ninference in FBM leads to less variable and more accurate estimate of the underlying bias as the\nnumber of samples increases, inference in DBM is perpetually driven by local transients. Relat-\ning back to the experimental data, we plot the probability of not observing the current stimulus for\neach type of 5-stimulus sequences in Figure 1 for (b) FBM and (c) DBM, since RT is known to\nlengthen with reduced stimulus expectancy. Comparing the \ufb01rst half of a simulated experimental\nsession (red) with the second half (blue), matched to the number of trials for each subject, we see\nthat sequential effects signi\ufb01cantly diminish in the FBM, but persist in the DBM. A re-analysis\nof the experimental data (Figure 1d) shows that sequential effects also persist in human behavior,\ncon\ufb01rming that Bayesian prediction based on a (Markovian) changeable world can account for be-\nhavioral data, while that based on a \ufb01xed world cannot. In Figure 1d, the green dashed line shows\nthat a linear transformation of the DBM sequential effect (from Figure 1c) is quite a good \ufb01t of the\nbehavioral data. It is also worth noting that in the behavioral data there is a slight over all preference\n(shorter RT) for repetition trials. This is easily captured by the DBM by assuming p0(\u03b3t) to be\nskewed toward repetitions (see Figure 1c inset). The same skewed prior cannot produce a bias in the\nFBM, however, because the prior only \ufb01gures into Bayesian inference once at the outset, and is very\nquickly overwhelmed by the accumulating observations.\n\n3\n\n\fa\n\ns\nt\n\nn\ne\nc\nf\nf\n\ni\n\ne\no\nC\n\n5x 10\u22124\n\n4\n\n3\n\n2\n\n1\n\n0\n\n \n\n \n\nnum\nexp\n\nb\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\ns\nt\n\ni\n\nn\ne\nc\n\ufb01\nf\n\ne\no\nC\n\nc\n\n1\n\n0.8\n\n \n\nnum\nexp\n\n7\n5\n.\n\u03b2\n\n0.6\n\n0.4\n\n2\n\n6\n\n4\nTrials\n\n8\n\n0\n\n \n\n2\n\n6\n\n8\n\n4\n\nTrials\n\n0.2\n\n0\n0\n\n0.5\n\u03b1 .77\n\n1\n\nd\n\nn\no\n\ni\nt\nc\nu\nr\nt\ns\nn\no\nc\ne\nR\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n0\n\n1\nTrue P (xt = 1|xt\u22121)\n\n0.5\n\ne\n\n)\ns\nm\n\n(\nT\nR\n\n380\n\n360\n\n340\n\n320\n\n300\n\n280\n\n \n\n0.2\n\nlog b/(1 \u2212 b)\n\n2\n\n0\n\n\u22122\n\n0.2\n\n0.8\n\n0.5\n\nb\n\n \n\nBayes\nExp\n\nalt\n\nrep\n\n0.4\n\n0.6\n\n0.8\n\nP (xt = 1|xt\u22121)\n\nFigure 3: Exponential discounting a good descriptive and normative model. (a) For each of the\nsix subjects, we regressed RR on repetition trials against past observations, RT \u2248 C + b1xt\u22121 +\nb2xt\u22122 + . . ., where x\u03c4 is assigned 0 if it was repetition, and 1 if alternation, the idea being that\nrecent repetition trials should increase expectation of repetition and decrease RR, and recent alter-\nnation should decrease expectation of repetition and increase RR on a repetition trial. Separately we\nalso regressed RR\u2019s on alternation trials against past observations (assigning 0 to alternation trials,\nand 1 to repetitions). The two sets of coef\ufb01cients did not differ signi\ufb01cantly and were averaged\ntogther (red: average across subjects, error bars: SEM). Blue line shows the best exponential \ufb01t to\nthese coef\ufb01cients. (b) We regressed Pt obtained from exact Bayesian DBM inference, against past\nobservations, and obtained a set of average coef\ufb01cients (red); blue is the best exponential \ufb01t. (c) For\ndifferent values of \u03b1, we repeat the process in (b) and obtain the best exponential decay parameter\n\u03b2 (blue). Optimal \u03b2 closely tracks the 2/3 rule for a large range of values of \u03b1. \u03b2 is .57 in (a),\nso \u03b1 = .77 was used to generate (b). (d) Both the optimal exponential \ufb01t (red) and the 2/3 rule\n(blue) approxiate the true Bayesian Pt well (green dashed line shows perfect match). \u03b1 = .77. For\nsmaller values of \u03b1, the \ufb01t is even better; for larger \u03b1, the exponential approximation deteriorates\n(not shown). (e) For repetition trials, the greater the predicted probability of seeing a repetition\n(xt = 1), the faster the RT, whether trials are categorized by Bayesian predictive probabilities (red:\n\u03b1 = .77, p0 = Beta(1.6, 1.3)), or by linear exponential \ufb01ltering (blue). For alternation trials, RT\u2019s\nincrease with increasing predicted probability of seeing a repetition. Inset: for the biases b \u2208 [.2, .8],\nthe log prior ratio (shift in the initial starting point, and therefore change in the distance to decision\nboundary) is approximately linear.\n\n3 Exponential \ufb01ltering both normative and descriptive\n\nWhile Bayes\u2019 Rule tells us in theory what the computations ought to be, the neural hardware may\nonly implement a simpler approximation. One potential approximation is suggested by related work\nshowing that monkeys\u2019 choices, when tracking reward contingencies that change at unsignaled\ntimes, depend linearly on previous observations that are discounted approximately exponentially\ninto the past [6]. This task explicitly examines subjects\u2019 ability to track unsignaled statistical regu-\nlarities, much like the kind we hypothesize to be engaged inadvertently in sequential effects.\n\nFirst, we regressed the subjects\u2019 reward rate (RR) against past observations and saw that the linear\ncoef\ufb01cients decay approximately exponentially into the past (Figure 3a). We de\ufb01ne reward rate as\nmean accuracy/mean RT, averaged across subjects; we thus take into account both effects in RT and\naccuracy as a function of past experiences. We next examined whether there is also an element of\nexponential discounting embedded in the DBM inference algorithm. Linear regression of the pre-\ndictive probability Pt , P (xt = 1|xt\u22121), which should correlate positively with RR (since it corre-\nlates positively with accuracy and negatively with RT) against previous observations xt\u22121, xt\u22122, . . .\nyields coef\ufb01cients that also decay exponentially into the past (Figure 3b): Pt \u2248 C+\u03b7 Pt\u22121\n\u03c4 =1 \u03b2 \u03c4 xt\u2212\u03c4 .\nLinear exponential \ufb01ltering thus appears to be both a good descriptive model of behavior, and a good\nnormative model approximating Bayesian inference.\n\nAn obvious question is how this linear exponential \ufb01lter relates to exact Bayesian inference, in\nparticular how the rate of decay relates to the assumed rate of change in the world (parameterized\nby \u03b1). We \ufb01rst note that the linear exponential \ufb01lter has an equivalent iterative form:\n\nPt , P (xt = 1|xt\u22121) = C +\u03b7\n\nt\u22121\n\nX\n\n\u03c4 =1\n\n\u03b2 \u03c4 xt\u2212\u03c4 = C(1 \u2212 \u03b2)+\u03b7\u03b2xt\u22121 +\u03b2Pt\u22121 .\n\nWe then note that the nonlinear Bayesian update rule can also be written as:\n\nPt+1 =\n\n1\n2\n\n(1 \u2212 \u03b1) + xt\u22121\u03b1\n\nKt \u2212 P 2\nPt \u2212 P 2\n\nt\n\nt\n\n+ \u03b1Pt\n\n4\n\nt\n\n1 \u2212 K\nP\n1 \u2212 Pt\n\nt\n\n\u2248\n\n1\n2\n\n(1\u2212\u03b1) +\n\n1\n3\n\n\u03b1xt +\n\n2\n3\n\n\u03b1Pt\n\n(1)\n\n\fwhere Kt , h\u03b32\nt |xt\u22121i, and we approximate Pt by its mean value hPti = 1/2, and Kt by its mean\nvalue hKti = 1/3. These expected values are obtained by expanding Pt and Kt in their iterative\nforms and assuming hPti = hPt\u22121i and hKti = hKt\u22121i, and also assuming that p0 is the uniform\ndistribution. We veri\ufb01ed numerically (data not shown) that this mean approximation is quite good for\na large range of \u03b1 (though it gets progressively worse when \u03b1 \u2248 1, probably because the equilibrium\nassumptions deviate farther from reality as changes become increasingly rare).\nNotably, our calculations imply \u03b2 \u2248 2\n3 \u03b1, which makes intuitive sense, since slower changes should\nresult in longer integration time window, whereas faster changes should result in shorter memory.\nFigure 3c shows that the best numerically obtained \u03b2 (by \ufb01tting an exponential to the linear regres-\nsion coef\ufb01cients) for different values of \u03b1 (blue) is well approximated by the 2/3 rule (black dashed\nline). For the behavioral data in Figure 3a, \u03b2 was found to be .57, which implies \u03b1 = .77; the sim-\nulated data in Figure 3b are in fact obtained by assuming \u03b1 = .77, hence the remarkably good \ufb01t\nbetween data and model. Figure 3d shows that reconstructed Pt based on the numerically optimal\nlinear exponential \ufb01lter (red) and the 2/3 rule (blue) both track the true Bayesian Pt very well.\nIn the previous section, we saw that exact Bayesian inference for the DBM is a good model of be-\nhavioral data. In this section, we saw that linear exponential \ufb01ltering also seems to capture the data\nwell. To compare which of the two better explains the data, we need a more detailed account of how\nstimulus history-dependent probabilities translate into reaction times. A growing body of psycho-\nlogical [7] and physiological data [8] support the notion that some form of evidence integration up\nto a \ufb01xed threshold underlies binary perceptual decision making, which both optimizes an accuracy-\nRT trade-off [9] and seems to be implemented in some form by cortical neurons [8]. The idealized,\ncontinuous-time version of this, the drift-diffusion model (DDM), has a well characterized mean\nstopping time [10], Td = z\nc2 , where A and c are the mean and standard deviation of unit\ntime \ufb02uctuation, and z is the distance between the starting point and decision boundary. The vertical\naxis for the DDM is in units of log posterior ratio log P (s0|x\nt)\nt) . An unbiased (uniform) prior over s\nP (s1|x\nimplies a stochastic trajectory that begins at 0 and drifts until it hits one of the two boundaries \u00b1z.\nWhen the prior is biased at b 6= .5, it has an additive effect in the log posterior ratio space and moves\nthe starting point to log b\n1\u2212b . For the relevant range of b (.2 to .8), the shift shift in starting point\nis approximately linear in b (Figure 3e inset), so that the new distance to the boundary is approxi-\nmately z + kb. Thus, the new mean decision time is z+kb\n. Typically in DDM models\nof decision-making, the signal-to-noise ratio is small, i.e. A \u226a c, such that tanh is highly linear in\nthe relevant range. We therefore have Td(b) \u2248 z2\nc2 b, implying that the change in mean decision\ntime is linear in the bias b, in units of probability.\nThis linear relationship between RT and b was already born out by the good \ufb01t between sequential\neffects in behavioral data and for the DBM in Figure 1d. To examine this more closely, we run the\nexact Bayesian DBM algorithm and the linear exponential \ufb01lter on the actual sequences of stimuli\nobserved by the subjects, and plot median RT against predicted stimulus probabilities. In Figure 3e,\nwe see that for both exact Bayesian (red) and exponential (blue) algorithms, RT\u2019s decrease on repe-\ntition stimuli when predicted probability for repetition increased; conversely, RT\u2019s increase on alter-\nnation trials when predicted probability for repetition increase (and therefore predicted probability\nfor alternation decrease). For both Bayesian inference and linear exponential \ufb01ltering, the relation-\nship between RT and stimulus probability is approximately linear. The linear \ufb01t in fact appears\nbetter for the exponential algorithm than exact Bayesian inference, which, conditioned on the DDM\nbeing an appropriate model for binary decision making, implies that the former may be a better\nmodel of sequential adaptation than exact Bayesian inference. Further experimentation is underway\nto examine this prediction more carefully.\n\nA tanh Az+Akb\n\nA tanh Az\n\nc2\n\nc2 + 2zk\n\nAnother implication of the SPRT or DDM formulation of perceptual decision-making is that incor-\nrect prior bias, such as due to sequential effects in a randomized stimulus sequence, induces a net\ncost in accuracy (even though the RT effects wash out due to the linear dependence on prior bias).\nThe error rate with a bias x0 in starting point is\ne2az\u2212e\u22122az [10], implying error rate rises\nmonotonically with bias in either direction. This is a quantitative characterization of our claim that\nextrageneous prior bias, such as due to sequential effects, induces suboptimality in decision-making.\n\n1+e2za \u2212 1\u2212(e\u2212ax0 )2\n\n1\n\n5\n\n\fa\n\nb\n\n1\n\n\u03b1\n\nf\no\ne\nt\na\nm\n\n0.8\n\n0.6\n\n0.4\n\np(\u03b1)\n\n5\n\n0\n0\n\n0.5\n\n\u03b1\n\n1\n\n \n\n\u03b1=0\n\u03b1=.4\n\u03b1=.5\n\u03b1=.6\n\nc\n\n1\n\n\u03b1\n\nf\no\ne\nt\na\nm\n\n0.8\n\n0.6\n\n0.4\n\ni\nt\ns\nE\n\n0.2\n\n \n\n0\n0\n\ni\nt\ns\nE\n\n0.2\n\n4000\n\n5000\n\n0\n0\n\n1000\n\n2000\n\n3000\n\nTimesteps\n\n1000\n\n2000\n\n3000\n\nTimesteps\n\nd\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n4000\n\n5000\n\np(\u03b1|xt)\n\np(\u03b3t|xt)\n\nTimesteps\n\nFigure 4: Meta-learning about the rate of change. (a) Graphical model for exact Bayesian learning.\nNumbers are example values for the variables.\n(b) Mean of posterior p(\u03b1|xt) as a function of\ntimesteps, averaged over 30 sessions of simulated data, each set generated from different true values\nof \u03b1 (see legend; color-coded dashed lines indicate true \u03b1).\nInset shows prior over \u03b1, p(\u03b1) =\nBeta(17, 3). Time-course of learning is not especially sensitive to the exact form of the prior (not\nshown). (c) Stochastic gradient descent with a learning rate of .01 produce estimates of \u03b1 (thick\nlines, width denotes SEM) that converge to the true values of \u03b1 (dashed lines). Initial estimate of \u03b1,\nbefore seeing any data, is .9. Learning based on 50 sessions of 5000 trials for each value of \u03b1. (d)\nMarginal posterior distributions over \u03b1 (top panel) and \u03b3t (bottom panel) on a sample run, where\nprobability mass is color-coded: brighter color is more mass.\n\n4 Neural implementation and learning\n\nSo far, we have seen that exponential discounting of the past not only approximates exact Bayesian\ninference, but \ufb01ts human behavioral data. We now note that it has the additional appealing property\nof being equivalent to standard models of neuronal dynamics. This is because the iterative form\nof the linear exponential \ufb01lter in Equation 1 has a similar form to a large class of leaky integration\nneuronal models, which have been used extensively to model perceptual decision-making on a rela-\ntively fast time-scale [8, 11\u201315], as well as trial-to-trial interactions on a slower time-scale [16\u201320].\nIt is also related to the concept of eligibility trace in reinforcement learning [21], which is important\nfor the temporal credit assignment problem of relating outcomes to states or actions that were re-\nsponsible for them. Here, we provided the computational rationale for this exponential discounting\nthe past \u2013 it approximates Bayesian inference under DBM-like assumptions.\n\n2 (1 \u2212 \u03b1) can be thought of as a constant bias, 1\n\nViewed as a leaky-integrating neuronal process, the parameters of Equation 1 have the following\nsemantics: 1\n3 \u03b1xt\u22121 as the feed-forward input, and\n2\n3 \u03b1Pt\u22121 as the leaky recurrent term. Equation 1 suggests that neurons utilizing a standard form\nof integration dynamics can implement near-optimal Bayesian prediction under the non-stationary\nassumption, as long as the relative contributions of the different terms are set appropriately. A natural\nquestion to ask next is how neurons can learn to set the weights appropriately. We \ufb01rst note that xt\nis a sample from the distribution P (xt|xt\u22121). Since P (xt|xt\u22121) has the approximate linear form in\nEquation 1, with dependence on a single parameter \u03b1, learning about near-optimal predictions can\npotentially be achieved based on estimating the value of \u03b1 via the stochastic samples x1, x2, . . ..\nWe implement a stochastic gradient descent algorithm, in which \u02c6\u03b1 is adjusted incrementally on each\ntrial in the direction of the gradient, which should bring \u02c6\u03b1 closer to the true \u03b1.\n\n\u02c6\u03b1t = \u02c6\u03b1t\u22121 + \u01eb(xt \u2212 \u02c6Pt)\n\ndPt\nd\u03b1\n\nwhere \u02c6\u03b1t is the estimate of \u03b1 after observing xt, and \u02c6Pt is the estimate of Pt using the estimate\n\u02c6\u03b1t\u22121 (before seeing xt). Figure 4c shows that learning via the binary samples is indeed possible: for\ndifferent true values of \u03b1 (dashed lines) that generated different data sets, stochastic gradient descent\nproduced estimates of \u02c6\u03b1 that converge to the true values, or close to them (thick lines; widths denote\nSEM estimated from 50 sessions of learning). A key challenge for future work is to clarify whether\nand how the gradient, dP\n\nd\u03b1 , can be computed by neural machinery (perhaps approximately).\n\nt\n\nFor comparison, we also implement the exact Bayesian learning algorithm, which augments the\nDBM architecture by representing \u03b1 as a hidden variable instead of a \ufb01xed known parameter:\n\np(\u03b1, \u03b3t|xt) \u221d p(\u03b1|xt\u22121)P (xt|\u03b3t)p(\u03b3t|\u03b1, xt\u22121) .\n\nFigure 4a illustrates this augmented model graphically. Figure 4b shows the evolution of the mean\nof the posterior distribution over \u03b1, or h\u03b1|xti. Based on sets of 30 sessions of 5000 trials, generated\n\n6\n\n\ffrom each of four different true values of \u03b1, the mean value of \u03b1 under the posterior distribution\ntends toward the true \u03b1 over time. The prior we assume for \u03b1 is a beta distribution (Beta(17, 3),\nshown in the inset of Figure 4b).\n\nCompared to exact Bayesian learning, stochastic gradient descent has a similar learning rate. But\nlarger values of \u03b1 (e.g. \u03b1 = .6) tend to be under-estimated, possibly due to the fact that the analytical\napproximation for \u03b2 is under-estimated for larger \u03b1. For data that were generated from a \ufb01xed\nBernoulli process with rate .5, an equivalently appropriate model is the DBM with \u03b1 = 0 \u2013 stochastic\ngradient descent produced estimates of \u03b1 (thick red line) that converge to 0 on the order of 50000\ntrials (details not shown). Figure 4d shows that the posterior inference about \u03b1 and \u03b3t undergoes\ndistinct phases when true \u03b1 = 0 and there is no correlation between one timestep and the next.\nThere is an initial phase where marginal posterior mass for \u03b1 tends toward high values of \u03b1, while\nmarginal posterior mass for \u03b3t \ufb02uctuates around .5. Note that this combination is an alternative,\nequally valid generative model for completely randomized sequence of inputs. However, this joint\nstate is somehow unstable, and \u03b1 tends toward 0 while \u03b3t becomes broad and \ufb02uctuates wildly. This\nis because as inferred \u03b1 gets smaller, there is almost no information about \u03b3t from past observations,\nthus the marginal posterior over \u03b3t tends to be broad (high uncertainty) and \ufb02uctuates along with\neach data point. \u03b1 can only decrease slowly because so little information about the hidden variables\nis obtained from each data point. For instance, it is very dif\ufb01cult to infer from what is believed\nto be an essentially random sequence whether the underlying Bernoulli rate really tends to change\nonce every 1.15 trials or 1.16 trials. This may explain why subjects show no diminished sequential\neffects over the course of a few hundred trials (Figure 1d). While the stochastic gradient results\ndemonstrate that, in principle, the correct values of \u03b1 can be learned via the sequence of binary\nobservations x1, x2, . . . , further work is required to demonstrate whether and how neurons could\nimplement the stochastic gradient algorithm or an alternative learning algorithm .\n\n5 Discussion\n\nHumans and other animals constantly have to adapt their behavioral strategies in response to chang-\ning environments: growth or shrinkage in food supplies, development of new threats and opportuni-\nties, gross changes in weather patterns, etc. Accurate tracking of such changes allow the animals to\nadapt their behavior in a timely fashion. Subjects have been observed to readily alter their behavioral\nstrategy in response to recent trends of stimulus statistics, even when such trends are spurious. While\nsuch behavior is sub-optimal for certain behavioral experiments, which interleave stimuli randomly\nor pseudo-randomly, it is appropriate for environments in which changes do take place on a slow\ntimescale. It has been observed, in tasks where statistical contingencies undergo occasional and\nunsignaled changes, that monkeys weigh past observations linearly but with decaying coef\ufb01cients\n(into the past) in choosing between options [6]. We showed that human subjects behave very simi-\nlarly in 2AFC tasks with randomized design, and that such discounting gives rise to the frequently\nobserved sequential effects found in such tasks [5]. We showed that such exponential discounting ap-\nproximates optimal Bayesian inference under assumptions of statistical non-stationarity, and derived\nan analytical, approximate relationship between the parameters of the optimal linear exponential \ufb01l-\nter and the statistical assumptions about the environment. We also showed how such computations\ncan be implemented by leaky integrating neuronal dynamics, and how the optimal tuning of the\nleaky integration process can be achieved without explicit representation of probabilities.\n\nOur work provides a normative account of why exponential discounting is observed in both sta-\ntionary and non-stationary environments, and how it may be implemented neurally. The relevant\nneural mechanisms seem to be engaged both in tasks when the environmental contingencies are\ntruly changing at unsignaled times, and also in tasks in which the underlying statistics are station-\nary but chance patterns masquerade as changing statistics (as seen in sequential effects). This work\nbridges and generalizes previous descriptive accounts of behavioral choice under non-stationary task\nconditions [6], as well as mechanistic models of how neuronal dynamics give rise to trial-to-trial in-\nteractions such as priming or sequential effects [5, 13, 18\u201320]. Based the relationship we derived\nbetween the rate of behavioral discounting and the subjects\u2019 implicit assumptions about the rate of\nenvironmental changes, we were able to \u201creverse-engineer\u201d the subjects\u2019 internal assumptions. Sub-\njects appear to assume \u03b1 = .77, or changing about once every four trials. This may have implications\nfor understanding why working memory has the observed capacity of 4-7 items.\n\n7\n\n\fIn a recent human fMRI study [22], subjects appeared to have different learning rates in two phases\nof slower and faster changes, but notably the \ufb01rst phase contained no changes, while the second\nphase contained frequent ones. This is a potential confound, as it has been observed that adaptive\nresponses change signi\ufb01cantly upon the \ufb01rst switch but then settle into a more stable regime [23]. It\nis also worth noting that different levels of sequential effects/adaptive response appear to take place\nat different time-scales [4,23], and different neural areas seem to be engaged in processing different\ntypes of temporal patterns [24]. In the context of our model, it may imply that there is sequential\nadaptation happening at different levels of processing (e.g. sensory, cognitive, motor), and their\ndifferent time-scales may re\ufb02ect different characteristic rate of changes at these different levels.\nA related issue is that brain needs not to have explicit representation of the rate of environmental\nchanges, which are implicitly encoded in the \u201cleakiness\u201d of neuronal integration over time. This is\nconsistent with the observation of sequential effects even when subjects are explicitly told that the\nstimuli are random [4]. An alternative explanation is that subjects do not have complete faith in the\nexperimenter\u2019s instructions [25]. Further work is needed to clarify these issues.\n\nWe used both a computationally optimal Bayesian learning algorithm, and a simpler stochastic gra-\ndient descent algorithm, to learn the rate of change (1-\u03b1). Both algorithms were especially slow at\nlearning the case when \u03b1 = 0, which corresponds to truly randomized inputs. This implies that com-\npletely random statistics are dif\ufb01cult to internalize, when the observer is searching over a much larger\nhypothesis space that contains many possible models of statistical regularity, which can change over\ntime. This is consistent with previous work [26] showing that discerning \u201crandomness\u201d from binary\nobservations may require surprisingly many samples, when statistical regularities are presumed to\nchange over time. Although this earlier work used a different model for what kind of statistical\nregularities are allowed, and how they change over time (temporally causal and Markovian in ours,\nan acausal correlation function in theirs), as well as the nature of the inference task (on-line in our\nsetting, and off-line in theirs), the underlying principles and conclusions are similar: it is very dif\ufb01-\ncult to discriminate a truly randomized sequence, which by chance would contain runs of repetitions\nand alternations, from one that has changing biases for repetitions and alternations over time.\n\nReferences\n[1] Skinner, B F (1948). J. Exp. Psychol. 38: 168-72.\n[2] Ecott, C L & Critch\ufb01eld, T S (2004). J. App. Beh. Analysis 37: 249-65.\n[3] Laming, D R J (1968). Information Theory of of Choice-Reaction Times, Academic Press, London.\n[4] Soetens, E, Boer, L C, & Hueting, J E (1985). JEP: HPP 11: 598-616.\n[5] Cho, R, et al (2002). Cognitive, Affective, & Behavioral Neurosci. 2: 283-99.\n[6] Sugrue, L P, Corrado, G S, & Newsome, W T (2004). Science 304: 1782-7.\n[7] Smith, P L & Ratcliff, R. Trends Neurosci. 27: 161-8.\n[8] Gold, J I & Shadlen, M N (2002). Neuron 36: 299-308.\n[9] Wald, A & Wolfowitz, J (1948). Ann. Math. Statisti. 19: 326-39.\n[10] Bogacz, et al (2006). Psychological Review 113: 700-65.\n[11] Cook, E P & Maunsell, J H R (2002). Nat. Neurosci. 5: 985-94.\n[12] Grice, G R (1972). Perception & Psychophysics 12: 103-7.\n[13] McClelland, J L. Attention & Performance XIV: 655-88. MIT Press.\n[14] Smith, P L (1995). Psychol. Rev. 10: 567-93.\n[15] Yu, A J (2007). Adv. in Neur. Info. Proc. Systems 19: 1545-52.\n[16] Dayan, P & Yu, A J (2003). IETE J. Research 49: 171-81.\n[17] Kim, C & Myung, I J (1995). 17th Ann. Meeting. of Cog. Sci. Soc.: 472-7.\n[18] Mozer, M C, Colagrosso, M D, & Huber, D E (2002). Adv. in Neur. Info. Proc. Systems 14: 51-57.\n[19] Mozer, M C, Kinoshita, S, & Shettel, M (2007). Integrated Models of Cog. Sys.: 180-93.\n[20] Simen, P, Cohen, J D, & Holmes, P (2006). Neur. Netw. 19: 1013-26.\n[21] Sutton, R S & Barto, A G (1998). Reinforcement Learning: An Introduction, MIT Press.\n[22] Behrens, T E J, Woolrich, M W, Walton, M E, & Rushworth, M F S (2007). Nat. Neurosci. 10: 1214-21.\n[23] Kording, K P, Tenenbaum, J B, & Shadmehr, R (2007). Nat. Neurosci. 10: 779-86.\n[24] Huettel, S A, Mack, P B, & McCarthy, G (2002). Nat. Neurosci. 5: 485-90.\n[25] Hertwig, R & Ortmann, A (2001). Behavioral & Brain Sciences 24: 383-403.\n[26] Bialek, W (2005). Preprint q-bio.NC/0508044, Princeton University.\n\n8\n\n\f", "award": [], "sourceid": 720, "authors": [{"given_name": "Angela", "family_name": "Yu", "institution": null}, {"given_name": "Jonathan", "family_name": "Cohen", "institution": null}]}