{"title": "Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task", "book": "Advances in Neural Information Processing Systems", "page_first": 5176, "page_last": 5185, "abstract": "How humans make repeated choices among options with imperfectly known reward outcomes is an important problem in psychology and neuroscience. This is often studied using multi-armed bandits, which is also frequently studied in machine learning. We present data from a human stationary bandit experiment, in which we vary the average abundance and variability of reward availability (mean and variance of reward rate distributions). Surprisingly, we find subjects significantly underestimate prior mean of reward rates -- based on their self-report, at the end of a game, on their reward expectation of non-chosen arms. Previously, human learning in the bandit task was found to be well captured by a Bayesian ideal learning model, the Dynamic Belief Model (DBM), albeit under an incorrect generative assumption of the temporal structure - humans assume reward rates can change over time even though they are actually fixed. We find that the \"pessimism bias\" in the bandit task is well captured by the prior mean of DBM when fitted to human choices; but it is poorly captured by the prior mean of the Fixed Belief Model (FBM), an alternative Bayesian model that (correctly) assumes reward rates to be constants. This pessimism bias is also incompletely captured by a simple reinforcement learning model (RL) commonly used in neuroscience and psychology, in terms of fitted initial Q-values. While it seems sub-optimal, and thus mysterious, that humans have an underestimated prior reward expectation, our simulations show that an underestimated prior mean helps to maximize long-term gain, if the observer assumes volatility when reward rates are stable and utilizes a softmax decision policy instead of the optimal one (obtainable by dynamic programming). This raises the intriguing possibility that the brain underestimates reward rates to compensate for the incorrect non-stationarity assumption in the generative model and a simplified decision policy.", "full_text": "Why so gloomy? A Bayesian explanation of human\n\npessimism bias in the multi-armed bandit task\n\nDalin Guo\n\nDepartment of Cognitive Science\nUniversity of California San Diego\n\nLa Jolla, CA 92093\ndag082@ucsd.edu\n\nAngela J. Yu\n\nDepartment of Cognitive Science\nUniversity of California San Diego\n\nLa Jolla, CA 92093\najyu@ucsd.edu\n\nAbstract\n\nHow humans make repeated choices among options with imperfectly known reward\noutcomes is an important problem in psychology and neuroscience. This is often\nstudied using multi-armed bandits, which is also frequently studied in machine\nlearning. We present data from a human stationary bandit experiment, in which\nwe vary the average abundance and variability of reward availability (mean and\nvariance of the reward rate distribution). Surprisingly, we \ufb01nd subjects signi\ufb01cantly\nunderestimate prior mean of reward rates \u2013 based on their self-report on their\nreward expectation of non-chosen arms at the end of a game. Previously, human\nlearning in the bandit task was found to be well captured by a Bayesian ideal\nlearning model, the Dynamic Belief Model (DBM), albeit under an incorrect\ngenerative assumption of the temporal structure \u2013 humans assume reward rates can\nchange over time even though they are truly \ufb01xed. We \ufb01nd that the \u201cpessimism\nbias\u201d in the bandit task is well captured by the prior mean of DBM when \ufb01tted\nto human choices; but it is poorly captured by the prior mean of the Fixed Belief\nModel (FBM), an alternative Bayesian model that (correctly) assumes reward\nrates to be constants. This pessimism bias is also incompletely captured by a\nsimple reinforcement learning model (RL) commonly used in neuroscience and\npsychology, in terms of \ufb01tted initial Q-values. While it seems sub-optimal, and\nthus mysterious, that humans have an underestimated prior reward expectation, our\nsimulations show that an underestimated prior mean helps to maximize long-term\ngain, if the observer assumes volatility when reward rates are stable, and utilizes\na softmax decision policy instead of the optimal one (obtainable by dynamic\nprogramming). This raises the intriguing possibility that the brain underestimates\nreward rates to compensate for the incorrect non-stationarity assumption in the\ngenerative model and a simpli\ufb01ed decision policy.\n\n1\n\nIntroduction\n\nHumans and animals frequently have to make choices among options with imperfectly known\noutcomes. This is often studied using the multi-armed bandit task [1, 2, 3], in which the agent\nrepeatedly chooses among bandit arms with \ufb01xed but unknown reward probabilities. In a bandit\nsetting, only the outcome of the chosen arm is observed in a given trial. The decision-maker learns\nhow rewarding an arm is by choosing it and observing whether it produces a reward, thus each choice\npits exploitation against exploration since it affects not only the immediate reward outcome but also\nthe longer-term information gain. Previously, it has been shown that human learning in the bandit\ntask is well captured by a Bayesian ideal learning model [4], the Dynamic Belief Model (DBM) [5],\nwhich assumes the reward rate distribution to undergo occasional and abrupt changes. While DBM\nassumes non-stationarity, an alternative Bayesian model, the Fixed Belief Model (FBM), assumes the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fenvironmental statistics to be \ufb01xed, which is usually consistent with the experimental setting. Previous\nstudies have shown that DBM predicts trial-wise human behavior better than FBM in a variety of\nbehavioral tasks (2-alternative forced choice [5], inhibitory control [6, 7], and visual search [8]),\nincluding the bandit task [4] \u2013 this occurs despite the task statistics being \ufb01xed during the task. While\nit has been argued that a default assumption of high volatility helps the observer to better adapt to\ntruly volatile natural environments [5], and computationally true change-points are dif\ufb01cult to discern\ngiven the noisy binary/categorical observations in these tasks [9], it has nevertheless remained rather\nmysterious why humans would persist in making this assumption contrary to observed environmental\nstatistics. In this work, we tackle this problem using a revised version of the classical bandit task.\nIn our experiment, we vary average reward abundance and variability to form four different reward\nenvironments. Previous multi-armed bandit studies using binary reward outcomes have typically\nutilized a neutral reward environment [3, 4, 10, 11, 12], i.e. the mean of the reward rates of all the\noptions across games/blocks is 0.5. Here, we manipulate and partially inform the participants of the\ntrue generative prior distribution of reward rates in four different environments: high/low abundance,\nhigh/low variance. Notably, the information provided about reward availability is not speci\ufb01c to\nthe arms in a deterministic way as in some previous studies [13], but rather independently affect all\narms within the environment. Our goal is to examine how humans adapt their decision-making to\ndifferent reward environments. Particularly, we focus on whether human participants have veridical\nprior beliefs about reward rates. To gain greater computational insight into human learning and\ndecision making, we compare several previously proposed models in their ability to capture the\nhuman trial-by-trial choices as well as self-report data.\nSpeci\ufb01cally, we consider two Bayesian learning models, DBM and FBM [5], as well as a simple\nreinforcement learning model (RL) \u2013 the delta rule [14], all coupled with a softmax decision policy.\nBecause FBM (correctly) assumes the reward structure to remain \ufb01xed during a game, it updates\nthe posterior mean by weighing new observations with decreasing coef\ufb01cients, as the variance of\nthe posterior distribution decreases over time. In contrast, by assuming the reward rates to have a\n(small) \ufb01xed probability of being redrawn from a prior distribution on each trial, DBM continuously\nupdates the posterior reward rate distribution by exponentially forgetting past observations, and\ninjecting a \ufb01xed prior bias [5, 9]. FBM can be viewed as a special case of DBM, whereby the\nprobability of redrawing from the prior distribution is zero on each trial. RL has been widely used in\nthe neuroscience literature [3, 15, 16], and dopamine has been suggested to encode the prediction\nerror incorporated in the RL model [15, 17]. DBM is related to RL in that the stability parameter\nof DBM also controls the exponential weights as the learning rate parameter of RL does, but two\nmodels are not mathematically equivalent. In particularly, RL has no means of injecting a prior\nbias on each trial, as DBM does [5, 9]. For the decision policy, we employ a variant of the softmax\npolicy, which is popular in psychology and neuroscience, and has been frequently used to model\nhuman behavior in the bandit task [1, 3, 11, 12, 18] \u2013 our variant is polynomial rather than the more\ncommon exponential form, such that a polynomial exponent of one corresponds to exact \u201cmatching\u201d\n[8], whereas the exponential form has no setting equivalent to matching.\nIn the following, we \ufb01rst describe the experiment and some model-free data analyses (Sec. 2), then\npresent the model and related analyses (Sec. 3), and \ufb01nally discuss the implications of our work and\npotential future directions (Sec. 4).\n\n2 Experiment\n\nExperimental design. We recruited 107 UCSD students to participate in a four-armed, binary\noutcomes (success/fail) bandit task, with 200 games in total played by each participant. Each game\ncontains 15 trials, i.e. 15 decision in sequence choosing among the same four options, where reward\nrates are \ufb01xed throughout 15 trials. On each trial, a participant is shown four arms along with their\nprevious choice and reward histories in the game (Fig. 1A); the chosen arm produces a reward or\nfailure based on its (hidden) reward probability. Thirty-two participants were required to report their\nestimate of the reward rates of the never-chosen arms at the end of each game. Participants received\ncourse credits and $0.05 for every point earned in \ufb01ve randomly chosen games (amounts paid ranged\nfrom $1.15 to $4.90, with an average of $1.99).\nWe separated the 200 games into four consecutive environments (sessions), with 50 games each, and\nprovided a clear indication that the participant was entering a new environment. Each environment\n\n2\n\n\fFigure 1: Experimental design. (A) Experimental interface. The total number of attempts so far, total\nreward, current environment, and the cumulative reward of each option are shown to the subjects on\neach trial during the experiment. The four panels correspond to the four options (arms). A green\ncircle represents a success (1 point), and a red circle represents a failure (0 point). (B) An example of\nthe \ufb01shing report: the numbers represent the total number of \ufb01sh caught out of 10 attempts at each of\nthe 20 random locations in the environment.\n\ncorresponds to a setting of high/low abundance and high/low variance. For any game within a given\nenvironment, the reward rates of the four arms are identically and independently pre-sampled from\none of four Beta distributions: Beta(4, 2), Beta(2, 4), Beta(30, 15) and Beta(15, 30). These Beta\ndistributions represent the prior distribution of reward rates, where the high/low mean (0.67/0.33) and\nhigh/low standard deviation (0.18/0.07) of the distributions correspond to high/low abundance and\nhigh/low variability of the environment respectively. The order of the four environments, as well as\nthe order of the pre-sampled reward rates in each environment, are randomized for each subject.\nWe portrayed the task as an ice \ufb01shing contest, where the four arms represent four \ufb01shing holes.\nParticipants are informed that the different camps (games) they \ufb01sh from residing on four different\nlakes (environments) that vary in (a) overall abundance of \ufb01sh, and (b) variability of \ufb01sh abundance\nacross locations. Each environment is presented along with the lake\u2019s \ufb01shing conditions (high/low\nabundance, high/low variance) and samples from the reward distribution (a \ufb01shing report showing the\nnumber of \ufb01sh caught out of ten attempts at 20 random locations in the lake (Fig. 1B)).\nResults. The reported reward rates of the never-chosen arms are shown in Fig. 2A. Human subjects\nreported estimates of reward rate signi\ufb01cantly lower than the true generative prior mean (p < .001),\nexcept in low abundance and low variance environment (p = 0.2973). The average reported estimates\nacross the four reward environments are not signi\ufb01cantly different (F (3, 91) = 1.78, p = 0.1570),\nnor across different mean, variance or the interaction (mean: F (1, 91) = 2.93, p = 0.0902, variance:\nF (1, 91) = 0.23, p = 0.6316, interaction: F (1, 91) = 1.77, p = 0.1870). This result indicates that\nhumans do not alter their prior belief about the reward rates even when provided with both explicit\n(verbal) and implicit (sampled) information about the reward statistics of the current environment.\nIn spite of systematically underestimating expected rewards, our participants appear to perform\nrelatively well in the task. The actual total reward accrued by the subjects are only slightly lower than\nthe optimal algorithm utilizing the correct Bayesian inference and the dynamic-programming-based\noptimal decision policy (Fig. 2B); humans also perform signi\ufb01cantly better than the chance level\nattained by a random policy (p < .001, see Fig. 2B), which is equal to the generative prior mean of\nthe reward rates. Thus, participants experience empirical reward rates higher than the generative prior\nmean (since they eared more than the random policy); nevertheless, they signi\ufb01cantly underestimate\nthe mean reward rates.\n\n3 Models\n\nHow do humans achieve relatively good performance with an \u201cirrationally\u201d low expectation of reward\nrates? We attempt to gain some insight into human learning and decision-making processes in bandit\ntask via computational modeling. We consider three learning models, DBM, FBM, and RL, coupled\nwith softmax decision policy. In the following, we \ufb01rst formally describe the models (Sec. 3.1), then\ncompare their ability to explain/predict the data (Sec. 3.2), and \ufb01nally present simulation results\n\n3\n\nExploratory Fishing ReportIt appears that this lake has high abundance, low variance of \ufb01sh.A B \fFigure 2: Error bars: s.e.m. across participants or validation runs. (+M, -V) denotes high mean\n(abundance), low variance, and so on. (A) Reported reward rate estimates by human subjects (orange),\nand \ufb01tted prior mean of DBM (blue), FBM (purple), and RL (green). Dotted lines: the true generative\nprior mean (0.67/0.33 for high/low abundance environments). (B) Reward rates earned by human\nsubjects (orange), and expected reward rate of the optimal policy (brown) and a random choice policy\n(gray). (C) Averaged per-trial likelihood of 10-fold cross validation of three learning models. Dotted\nline: the chance level (0.25). (D) Fitted softmax b and DBM \u03b3 parameters.\n\nto gain additional insights into what our experimental \ufb01ndings may imply about human cognition\n(Sec. 3.3, 3.4).\n\n3.1 Model description\n\n3, \u03b8t\n\n2, \u03b8t\n\nk, k \u2208 {1, 2, 3, 4}, 1 \u2264 t \u2264 15, and \u03b8t =\n4]. We denote the reward outcome at time t as Rt \u2208 {0, 1}, and Rt = [R1, R2, . . . , Rt].\n\nWe denote the reward rate of arm k at time t as \u03b8t\n[\u03b8t\n1, \u03b8t\nWe denote the decision at time t as Dt, Dt \u2208 {1, 2, 3, 4}, and Dt = [D1, D2, . . . , Dt].\nDynamic belief model (DBM). As with the actual experimental design, DBM assumes that the\nbinary reward (1: reward, 0: no reward) distribution of the chosen arm is Bernoulli. Unlike the actual\nexperimental design, where reward rates of a game are \ufb01xed, DBM assumes that the reward rate\n(Bernoulli rate parameter) for each arm undergoes discrete, un-signaled changes independently with a\nper-trial probability of 1 \u2212 \u03b3, 0 \u2264 \u03b3 \u2264 1. The reward rate at time t remains the same with probability\n\u03b3, and is re-sampled from the prior distribution p0(\u03b8) with probability 1 \u2212 \u03b3. The observed data has\nthe distribution, P (Rt = 1|\u03b8t, Dt = k) = \u03b8t\nk.\nThe hidden variable dynamics of DBM is\nk = \u03b8|\u03b8t\u22121\n\nk \u2212 \u03b8) + (1 \u2212 \u03b3)p0(\u03b8),\n\n) = \u03b3\u03b4(\u03b8t\u22121\n\np(\u03b8t\n\n(1)\n\nk\n\nwhere \u03b4(x) is the Dirac delta function, and p0(\u03b8) is the assumed prior distribution.\nThe posterior reward rate distribution given the reward outcomes up to time t can be computed\niteratively via Bayes\u2019 rule as\n\nk|Rt, Dt) \u221d p(Rt|\u03b8t\n\nk|Rt\u22121, Dt\u22121), if Dt = k.\n\np(\u03b8t\n\np(\u03b8t\n\nk)p(\u03b8t\n\n(2)\nOnly the posterior distribution of the chosen arm is updated with the new observation, while the\nposterior distribution of the other arms are the same as the predictive reward rate distribution (see\nbelow), i.e. p(\u03b8t\nThe predictive reward rate distribution at time t given the reward outcomes up to time t \u2212 1 is a\nweighted sum of the posterior probability and the prior probability:\n\nk|Rt\u22121, Dt\u22121), if Dt (cid:54)= k.\n\nk|Rt, Dt) = p(\u03b8t\n\nk = \u03b8|Rt\u22121, Dt\u22121) = \u03b3p(\u03b8t\u22121\n\nk = \u03b8|Rt\u22121, Dt\u22121) + (1 \u2212 \u03b3)p0(\u03b8), for all k.\n\n(3)\nk|Rt\u22121, Dt\u22121]. DBM can\nThe expected (mean predicted) reward rate of arm k at trial t is \u02c6\u03b8t\nbe well approximated by an exponential \ufb01lter [5], thus \u03b3 is also related to the length of the integration\nwindow as well as the exponential decay rate.\nFixed belief model (FBM). FBM assumes that the statistical environment remains \ufb01xed throughout\nthe game, e.g. the reward outcomes are Bernoulli samples generated from a \ufb01xed rate parameter \u03b8. It\ncan be viewed as a special case of DBM with \u03b3 = 1.\nReinforcement learning (RL). The update rule of a simple and commonly used reinforcement\nlearning model is\n\nk = E[\u03b8t\n\nk = \u02c6\u03b8t\u22121\n\u02c6\u03b8t\n\nk + \u0001(Rt \u2212 \u02c6\u03b8t\u22121\n\nk\n\n), if Dt = k.\n\n(4)\n\n4\n\nA B C D\fk = \u03b80, and 0 \u2264 \u0001 \u2264 1. \u0001 is the learning parameter that controls the exponential\nwith an initial value \u02c6\u03b80\ndecay of the coef\ufb01cients associated with the previous observations. In the multi-armed bandit task,\n, if Dt (cid:54)= k.\nonly the chosen arm is updated, while the other arms remain the same, i.e. \u02c6\u03b8t\nSoftmax decision policy. We use a version of the softmax decision policy that assumes the choice\nprobabilities among the options to be normalized polynomial functions of the estimated expected\nreward rates [8], with a parameter b:\n\nk = \u02c6\u03b8t\u22121\n\nk\n\nk)b(cid:80)K\n\n(\u02c6\u03b8t\ni (\u02c6\u03b8t\n\n,\n\ni)b\n\np(Dt = k) =\n\n(5)\nwhere K is the total number of options (K = 4), b \u2265 0. When b = 0, the choice is at random, i.e.\np(Dt = k) = 1/K for all options. When b = 1, it is exact matching [8]. When b \u2192 \u221e, the most\nrewarding option is always chosen. By varying the b parameter, the softmax policy is able to capture\nmore or less \u201cnoisy\u201d choice behavior. However, we note that softmax ascribes unexplained variance\nin choice behavior entirely to \u201cnoise\u201d, when subject may indeed employ a much more strategic policy\nwhose learning and decision components are poorly captured by the model(s) under consideration.\nThus, smaller \ufb01tted b does not imply subjects are necessarily more noisy or care about rewards less; it\nmay simply mean that the model is less good at capturing subjects\u2019 internal processes.\nOptimal policy. The multi-armed bandit problem can be viewed as a Markov decision process,\nwhere the state variable is the posterior belief after making each observation. The optimal solution to\nthe problem considered here can be computed numerically via dynamic programming [4, 19], where\nthe optimal learning model is FBM with the correct prior distribution. Previously, it has been shown\nthat human behavior does not follow the optimal policy [4]; nevertheless, it is a useful model to\nconsider in order to assess the performance of human subjects and the various other models in terms\nof maximal expected total reward.\n\n3.2 Model comparison\n\nHere, we compare the three learning models to human behavior, in order to identify the best (of those\nconsidered) formal description of the underlying psychological processes.\nWe \ufb01rst evaluate how well the three learning models \ufb01t human data. We perform 10-fold cross-\nvalidation to avoid over\ufb01tting for comparison, since the models have different numbers of free\nparameters. We use per-trial likelihood as the evaluation metric, calculated as exp(log L/N ), where\nL is the maximum likelihood of the data, and N is the total data points. The pre-trial likelihood\ncan also be interpreted as the trial-by-trial predictive accuracy (i.e., on average, how likely is it\nthat the model will choose the same arm the human participant chose), so we will also refer to this\nmeasurement as predictive accuracy. We \ufb01t prior weight (\u03b1 + \u03b2, related to precision) at the group\nlevel. We \ufb01t prior mean (\u03b1/(\u03b1 + \u03b2)), DBM \u03b3, RL \u0001, and softmax b parameters at the individual level,\nand separately for the four reward environments. This \ufb01tting strategy predicts subjects\u2019 choices better\nthan other variants that \ufb01t a shared parameter across participants or environments based on 10-fold\ncross-validation, comparing per-condition \ufb01tting and common-parameter \ufb01tting. The cross-validation\nmitigates against the over-\ufb01tting issue. Moreover, when the DBM \u03b3 and softmax b are held \ufb01xed\nacross the four conditions for each participant, we \ufb01nd the same pattern of results as when those\nparameters are allowed to vary across conditions (results not shown).\nFig. 2C shows the held-out per-trial likelihood for DBM, FBM, and RL, averaged across ten runs of\ncross-validation. DBM achieves signi\ufb01cantly higher per-trial likelihood than FBM (p < .001) and RL\n(p < .001) based on paired t-test, i.e., predicting human behavior better than the other two models.\nThe predictive accuracy of three \ufb01tted models on the whole dataset are 0.4182 (DBM), 0.4038 (RL),\nand 0.3856 (FBM). DBM also achieves lower BIC and AIC values than RL or FBM, in spite of\nincurring a penalty for the additional parameter. This result corroborates previous \ufb01ndings [4] that\nhumans assume non-stationarity by default in the multi-armed bandit task, even though the reward\nstructure is truly stationary.\nNext, we examine how well the learning models can recover the underestimation effect observed\nin human participants. The reported estimation is on the arm(s) that they never chose at the end of\neach game, which is their belief of the mean reward rate before any observation, i.e., mathematically\nequivalent to the prior mean (DBM & FBM) or the initial value (RL). For simplicity, we will refer to\nthem all as the prior mean. Fig. 2A shows the average \ufb01tted prior mean of the models. FBM recovers\n\n5\n\n\fTable 1: Inferred RL learning rates\n\nModel\nRL\n\n(+M, +V)\n0.35 (SD=0.02)\n\n(+M, -V)\n0.39 (SD=0.02)\n\n(-M, +V)\n0.22 (SD=0.02)\n\n(-M, -V)\n0.25 (SD=0.02)\n\nTable 2: Inferred softmax b\n\n(+M, +V)\nModel\nRL\n6.94 (SD=0.54)\nFBM 12.28 (SD=0.58)\n\n(+M, -V)\n5.41 (SD=0.48)\n10.52 (SD=0.54)\n\n(-M, +V)\n5.72 (SD=0.51)\n6.25 (SD=0.42)\n\n(-M, -V)\n5.07 (SD=0.51)\n5.68 (SD=0.37)\n\nprior mean values that are well correlated with the true generative prior means (r = +.96, p < 0.05),\nand signi\ufb01cantly different in the four environments (F (3, 424) = 13.47, p < .001). The recovered\nprior means for RL are also signi\ufb01cantly different in the four environments (F (3, 424) = 4.21, p <\n0.01). In contrast, the recovered prior means for DBM are not signi\ufb01cantly different in the four\nenvironments (F (3, 424) = 0.91, p = 0.4350), just like human estimates (Fig. 2A). DBM also\nrecovers prior mean values in low abundance and high variance environment slightly lower than in\nother environments, similar to human reports. In summary, DBM allows for better recovery of human\ninternal prior beliefs of reward expectation than FBM or RL.\nTaking DBM as the best model for learning, we can then examine the other \ufb01tted learning and\ndecision-making parameters. A higher softmax b corresponds to a more myopic, less exploratory, and\nless stochastic choice behavior. A lower DBM \u03b3 corresponds to a higher change rate and a shorter\nintegration window of the exponential weights [5]. The prior weight of DBM is \ufb01tted to six, which is\nequivalent to six pseudo-observations before the task; it is also the same as the true prior weight in the\nexperimental design for the high variance environments. Fig. 3D shows the \ufb01tted DBM \u03b3 and softmax\nb in four reward environments. In high abundance environments, softmax b is \ufb01tted to larger values,\nwhile DBM \u03b3 is \ufb01tted to lower values, than the low abundance environments (four paired t-test,\np < .01 for all). They do not vary signi\ufb01cantly across low- and high-variance environments (four\npaired t-test, p > .05 for all). The \ufb01tted DBM \u03b3 values imply that human participants behave as if they\nbelieve the reward rates change on average approximately once every three trials in high-abundance\nenvironments, and once every four trials in low-abundance environments (mean change interval is\n1/(1-\u03b3)).\n\n3.3 Simulation results: shift rate facing an unexpected loss\n\nWe simulated the models under the same reward rates as in the human experiment with the parameters\n\ufb01tted to human data, averaged across participants. The \ufb01tted DBM \u03b3 and the \ufb01tted softmax b for\nDBM are shown in Figure. 2D. The \ufb01tted learning rate of RL is shown in Table 1, and the \ufb01tted\nsoftmax b for FBM and RL are shown in Table 2.\nTo gain some insight into how DBM behaves differently than FBM and RL, and thus what it\nimplies about human psychological processes, we consider the empirical/simulated probability of the\nparticipants/model switching away from a \u201cwinning\u201d arm after it suddenly produces a loss (Fig. 3A).\nSince DBM assumes reward rates can change any time, a string of wins followed by a loss indicates a\nhigh probability of the arm switching to a low reward rate, especially with a low abundance prior\nbelief, where a switch is likely to result in a new reward rate that is also low. On the other hand, since\nFBM assumes reward rates to be stable, it depends more on long-term statistics to estimate an arm\u2019s\nreward rate. Give observations of many wins, which leads to a relatively high reward rate estimate as\nwell as a relatively low uncertainty, a single loss should still induce in FBM a high probability of\nsticking with the same arm. RL can adjust its reward estimate according to unexpected observations,\nbut is much slower than DBM in doing so since it has a constant learning rate that is not increased\nwhen there is high posterior probability of a recent change point (as when a string of wins is followed\nby a loss); RL also cannot persistently encode prior information about overall reward abundance in\nthe environment when a change occurs (e.g. with a low-abundance belief, the new reward rate after a\nchange point is likely to be low). We would thus expect RL to also shift less frequently than DBM in\nthis scenario. Fig. 3A shows that the simulated shift rate of the three models (probability of a model\nto shift away from the previously chosen arm) exactly follow the pattern of behavior described above.\n\n6\n\n\fFigure 3: (A) Probability of shifting to other arms after a failure preceded by three consecutive\nsuccesses on the same arm. Error bar shows the s.e.m. across participants/simulation runs. (B)\nReward rates achieved in high variance environment with low abundance and with (C) high abundance\nby different models: DBM (blue), FBM (purple), RL (green) and optimal policy (brown). The\ndiamond symbols represent the actual reward per trial earned by human subjects (y-axis) vs. the \ufb01tted\nprior mean (x-axis) of the three models. Vertical dotted lines: true generative prior mean.\n\nHuman subjects\u2019 shift rates are closest to what DBM predicts, which is what we would already expect\nfrom the fact that overall DBM has already been found to \ufb01t human data the best.\n\n3.4 Simulation results: understanding human reward underestimation\n\nFinally, we try to understand why humans might exhibit a \u201cpessimism bias\u201d in their reward rate\nexpectation. Fig. 3B,C shows the simulated average earned reward per trial, of the various models as\na function of the assumed prior mean. DBM, FBM, RL are simulated with the average parameters\nthat are \ufb01tted to human data, except we allow the prior mean parameter to vary in each case. The\noptimal policy is computed with different prior means and the correct prior variance. The per-trial\nearned reward rates are calculated from the simulation of models/optimal policy under the same\nreward rates of the human experiment. We focus on the high variance environments, since the model\nperformance is relatively insensitive to the assumed prior mean in low variance environments (not\nshown).\nFirstly, consider the diamond symbols in Fig. 3B;C: the combination of human subjects\u2019 actual\naverage per-trial earned reward (y-axis) and the \ufb01tted prior mean for each of the three models (x-axis,\ncolor-coded) is very close to DBM\u2019s joint predictions of the two quantities (blue lines), but very far\naway from FBM (purple line) and RL (green line)\u2019s joint predictions of the two quantities. This result\nprovides additional evidence that DBM can predict and capture human performance better than the\nother two models.\nMore interestingly, while the optimal policy (brown) achieves the highest earned reward when it\nassumes the correct prior (as expected), FBM coupled with softmax achieves its maximum reward at\na prior mean much lower than the true generative mean. Given that FBM is the correct generative\nmodel, this implies that one way to compensate for using the sub-optimal softmax policy, instead of\nthe optimal (dynamic programming-derived) policy, is to somewhat underestimate the prior mean. In\naddition, DBM achieves maximal earned reward with an assumed prior mean even lower than FBM,\nimplying that even more prior reward rate underestimation is needed to compensate for assuming\nenvironmental volatility (when the environment is truly stable). We note that human participants do\nnot assume a prior mean that optimizes the earning of reward (blue diamonds are far from the peak\nof the blue lines) \u2013 this may re\ufb02ect a compromise between optimizing reward earned and truthfully\nrepresenting environmental statistics.\n\n4 Discussion\n\nOur results show that humans underestimate the expected reward rates (a pessimism bias), and this\nunderestimation is only recoverable by the prior mean of DBM. DBM is also found to be better than\nFBM or RL at predicting human behavior in terms of trial-by-trial choice and actual rewards earned.\nOur results provide further evidence that humans underestimate the stability of the environment,\ni.e. assuming the environment to be non-stationary when the real setting is stationary. This default\n\n7\n\nA B C\fnon-stationarity belief might be bene\ufb01cial in real world scenarios in the long run, where the behavioral\nenvironment can be volatile [5]. It is worth noting that the best earned per-trial reward rates achievable\nby DBM and FBM are quite close. In other words, as long as a softmax policy is being used, there is\nno disadvantage to incorrectly assuming environmental volatility as long as the assumed prior mean\nparameter is appropriately tuned. Participants\u2019 actual assumed prior mean (if DBM is correct) is not\nat the mode of the simulated performance curve, which underestimates prior mean even more than\nsubjects do. This may re\ufb02ect a tension to accurately internalize environmental statistics and assume a\nstatistical prior that achieves better outcomes.\nHumans often have mis-speci\ufb01ed beliefs, even though most theories predict optimal behavior when\nenvironmentally statistics are correctly internalized. Humans have been found to overestimate their\nabilities and underestimate the probabilities of the negative events from the environment [20]. Our\nresult might seem to contradict these earlier \ufb01ndings; however, having a lower prior expectation is\nnot necessarily in con\ufb02ict with an otherwise optimistic bias. We \ufb01nd that human participants earn\nrelatively high reward rate while reporting low expectation on the unchosen arm(s). It is possible that\nthey are optimistic about their ability to succeed in an overall hostile environment (even though they\nover-estimate the hostility).\nSeveral previous studies [13, 21, 22] found a human tendency to sometimes under-estimate and some-\ntimes over-estimate environmental reward availability in various behavioral tasks. Major differences\nin task design complicate direct comparisons among the studies. However, our empirical \ufb01nding\nof human reward under-estimation is broadly consistent with the notion that humans do not always\nveridically represent reward statistics. More importantly, we propose a novel mechanism/principle\nfor why this is the case: a compensatory measure for assuming volatility, which is useful for coping\nwith non-stationary environments and utilizing a cheap decision policy such as softmax. Separately,\nour work implies that systematic errors may creep into investigators\u2019 estimation of subjects\u2019 reward\nexpectation when an incorrect learning model is assumed (e.g. assuming subjects believe the reward\nstatistics to be stationary when they actually assume volatility). For future work, it would be inter-\nesting to also include an individual measure of trait optimism (e.g. LOT-R), to see if the form of\npessimism we observe correlates with optimism scores.\nOne of the limitations of this study is that human reports might be unreliable, or biased by the\nexperimental design. For example, one might object to our \u201cpessimism bias\u201d interpretation of lower\nexpected reward rates for unchosen alternatives, by attributing it to a con\ufb01rmation bias [23] or a\nsampling bias [24]. That is, subjects may report especially low reward estimates for unchosen\noptions, either because they retroactively view discarded options as more undesirable (con\ufb01rmation\nbias), or because they failed to choose the unchosen options precisely as they decided these were\nless valuable to begin with for some reason (regardless of the reason, this scenario introduces a\nsampling bias of the unchosen arms). However, both of these explanations fail to account for the\nlarger reward under-estimation effect observed in high-abundance environments compared to low-\nabundance environments, which is also predicted by DBM. Moreover, DBM predicts human choice\nbetter than the other models on a trial-by-trial basis, lending us further evidence that the reward\nunder-estimation effect is real.\nAnother modeling limitation is that we only include softmax as the decision policy. A previous study\nfound that Knowledge Gradient (KG), which approximates the relative action values of exploiting\nand exploring to deterministically choose the best option on each trial [25], is the best model among\nseveral decision policies [4] (not including softmax). However, in a later study, softmax was found to\nexplain human data better than KG [11]. This is not entirely surprising, because even though KG is\nsimpler than the optimal policy, it is still signi\ufb01cantly more complicated than softmax. Moreover,\neven if humans use a model similar to KG, they may not give the optimal weight to the exploratory\nknowledge gain on each trial, or use a deterministically decision policy. We therefore also conceive a\n\u201csoft-KG\u201d policy, that adds a weighted knowledge gain to the immediate reward rates in the softmax\npolicy. Based on ten-fold cross-validation, the held-out average per-trial likelihood of softmax and\nsoft-KG, in explaining human data, are not signi\ufb01cantly different (p = 0.0693), and the likelihoods\nare on average 0.4248 and 0.4250 respectively. 44 subjects have higher, 44 subjects have lower, and\n19 subjects have equal held-out likelihood for soft-KG, compared to softmax. Besides KG, there is a\nrich literature on the decision-making component [26, 27], as opposed to the learning component that\nwe focus on here. For simplicity, we only included the softmax policy. Future studies with greater\nstatistical power may be able to discriminate between softmax and soft-KG, or other decision policies,\nbut this is beyond the scope and intention of this work.\n\n8\n\n\fWhile this study is primarily focused on modeling human behavior in the bandit task, it may have\ninteresting implications for the study of bandit problems in machine learning as well. For example,\nour study suggests that human learning and decision making are sub-optimal in various ways:\nassuming an incorrect prior mean, assuming environmental statistics to be non-stationary when they\nare stable; however, our analyses also show that these sub-optimalities may potentially be combined\nto achieve better performance than one might expect, as they somehow compensate for each other.\nGiven that these sub-optimal assumptions have certain computational advantages, e.g., softmax\nis computationally much simpler than optimal policy, DBM can handle a much broader range of\ntemporal statistics than FBM, understanding how these algorithms \ufb01t together in humans may, in the\nfuture, yield better algorithms for machine learning applications as well.\n\nAcknowledgments\n\nWe thank Shunan Zhang, Henry Qiu, Alvita Tran, Joseph Schilz, and numerous undergraduate\nresearch assistants who helped in the data collection. We thank Samer Sabri for helpful input with\nthe writing. This work was in part funded by an NSF CRCNS grant (BCS-1309346) to AJY.\n\nReferences\n[1] N D Daw, J P O\u2019Doherty, P Dayan, B Seymour, and R J Dolan. Cortical substrates for exploratory decisions\n\nin humans. Nature, 441(7095):876\u20139, 2006.\n\n[2] J D Cohen, S M McClure, and A J Yu. Should I stay or should I go? how the human brain manages the\ntradeoff between exploitation and exploration. Philos Trans R Soc Lond B Biol Sci, 362(1481):933\u201342,\n2007.\n\n[3] T Sch\u00f6nberg, N D Daw, D Joel, and J P O\u2019Doherty. Reinforcement learning signals in the human striatum\ndistinguish learners from nonlearners during reward-based decision making. Journal of Neuroscience,\n27(47):12860\u201312867, 2007.\n\n[4] S Zhang and A J Yu. Forgetful Bayes and myopic planning: Human learning and decision-making in a\n\nbandit setting. Advances in Neural Information Processing Systems, 26, 2013.\n\n[5] A J Yu and J D Cohen. Sequential effects: Superstition or rational behavior? Advances in Neural\n\nInformation Processing Systems, 21:1873\u201380, 2009.\n\n[6] P Shenoy, R Rao, and A J Yu. A rational decision making framework for inhibitory control. In J. Lafferty,\nC. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information\nProcessing Systems 23, volume 23, pages 2146\u201354. MIT Press, Cambridge, MA, 2010.\n\n[7] J S Ide, P Shenoy, A J Yu*, and C-S R Li*. Bayesian prediction and evaluation in the anterior cingulate\n\ncortex. Journal of Neuroscience, 33:2039\u20132047, 2013. *Co-senior authors.\n\n[8] A J Yu and H Huang. Maximizing masquerading as matching: Statistical learning and decision-making in\n\nchoice behavior. Decision, 1(4):275\u2013287, 2014.\n\n[9] C Ryali, G Reddy, and A J Yu. Demystifying excessively volatile human learning: A bayesian persistent\n\nprior and a neural approximation. Advances in Neural Information Processing Systems, 2018.\n\n[10] M Steyvers, M D Lee, and E J Wagenmakers. A bayesian analysis of human decision-making on bandit\n\nproblems. J Math Psychol, 53:168\u201379, 2009.\n\n[11] K M Harl\u00e9, S Zhang, M Schiff, S Mackey, M P Paulus*, and A J Yu*. Altered statistical learning and\ndecision-making in methamphetamine dependence: evidence from a two-armed bandit task. Frontiers in\nPsychology, 6(1910), 2015. *Co-senior authors.\n\n[12] K M Harl\u00e9, D Guo, S Zhang, M P Paulus, and A J Yu. Anhedonia and anxiety underlying depressive\nsymptomatology have distinct effects on reward-based decision-making. PloS One, 12(10):e0186473,\n2017.\n\n[13] S J Gershman and Y Niv. Novelty and inductive generalization in human reinforcement learning. Topics in\n\nCognitive Science, 7(3):391\u2013415, 2015.\n\n[14] R A Rescorla and A R Wagner. A theory of pavlovian conditioning: Variations in the effectiveness of\nreinforcement and nonreinforcement. In A H Black and W F Prokasy, editors, Classical Conditioning II:\nCurrent Research and Theory, pages 64\u201399. Appleton-Century-Crofts, Mew York, 1972.\n\n9\n\n\f[15] W Schultz. Predictive reward signal of dopamine neurons. J. Neurophysiol., 80:1\u201327, 1998.\n\n[16] T E J Behrens, M W Woolrich, M E Walton, and M F S Rushworth. Learning the value of information in\n\nan uncertain world. Nature Neurosci, 10(9):1214\u201321, 2007.\n\n[17] P R Montague, P Dayan, and T J Sejnowski. A framework for mesencephalic dopamine systems based on\n\npredictive hebbian learning. Journal of Neuroscience, 16:1936\u20131947, 1996.\n\n[18] I C Dezza, A J Yu, A Cleeremans, and A William. Learning the value of information and reward over time\n\nwhen solving exploration-exploitation problems. Scienti\ufb01c Reports, 7(1):16919, 2017.\n\n[19] B B Averbeck. Theory of choice in bandit, information sampling and foraging tasks. PLoS Computational\n\nBiology, 11(3):e1004164, 2015.\n\n[20] T Sharot. The optimism bias. Current Biology, 21(23):R941\u2013R945, 2011.\n\n[21] A Stankevicius, Q JM Huys, A Kalra, and P Seri\u00e8s. Optimism as a prior belief about the probability of\n\nfuture reward. PLoS Computational Biology, 10(5):e1003605, 2014.\n\n[22] M Bateson. Optimistic and pessimistic biases: a primer for behavioural ecologists. Current Opinion in\n\nBehavioral Sciences, 12:115\u2013121, 2016.\n\n[23] R S Nickerson. The production and perception of randomness. Psychol. Rev., 109:330\u201357, 2002.\n\n[24] R A Berk. An introduction to sample selection bias in sociological data. American Sociological Review,\n\npages 386\u2013398, 1983.\n\n[25] P Frazier and A J Yu. Sequential hypothesis testing under stochastic deadlines. Advances in Neural\n\nInformation Processing Systems, 20, 2008.\n\n[26] S J Gershman. Deconstructing the human algorithms for exploration. Cognition, 173:34\u201342, 2018.\n\n[27] M Speekenbrink and E Konstantinidis. Uncertainty and exploration in a restless bandit problem. Topics in\n\nCognitive Science, 7(2):351\u2013367, 2015.\n\n10\n\n\f", "award": [], "sourceid": 2482, "authors": [{"given_name": "Dalin", "family_name": "Guo", "institution": "UC San Diego"}, {"given_name": "Angela", "family_name": "Yu", "institution": "UC San Diego"}]}