{"title": "Learning under uncertainty: a comparison between R-W and Bayesian approach", "book": "Advances in Neural Information Processing Systems", "page_first": 2730, "page_last": 2738, "abstract": "Accurately differentiating between what are truly unpredictably random and systematic changes that occur at random can have profound effect on affect and cognition. To examine the underlying computational principles that guide different learning behavior in an uncertain environment, we compared an R-W model and a Bayesian approach in a visual search task with different volatility levels. Both R-W model and the Bayesian approach reflected an individual's estimation of the environmental volatility, and there is a strong correlation between the learning rate in R-W model and the belief of stationarity in the Bayesian approach in different volatility conditions. In a low volatility condition, R-W model indicates that learning rate positively correlates with lose-shift rate, but not choice optimality (inverted U shape). The Bayesian approach indicates that the belief of environmental stationarity positively correlates with choice optimality, but not lose-shift rate (inverted U shape). In addition, we showed that comparing to Expert learners, individuals with high lose-shift rate (sub-optimal learners) had significantly higher learning rate estimated from R-W model and lower belief of stationarity from the Bayesian model.", "full_text": "Learning under uncertainty: a comparison between\n\nR-W and Bayesian approach\n\nHe Huang\n\nLaureate Institute for Brain Research\n\nTulsa, OK, 74133\n\ncrane081@gmail.com\n\nMartin Paulus\n\nLaureate Institute for Brain Research\n\nTulsa, OK, 74133\n\nmpaulus@laureateinstitute.org\n\nAbstract\n\nAccurately differentiating between what are truly unpredictably random and sys-\ntematic changes that occur at random can have profound effect on affect and\ncognition. To examine the underlying computational principles that guide different\nlearning behavior in an uncertain environment, we compared an R-W model and\na Bayesian approach in a visual search task with different volatility levels. Both\nR-W model and the Bayesian approach re\ufb02ected an individual\u2019s estimation of the\nenvironmental volatility, and there is a strong correlation between the learning rate\nin R-W model and the belief of stationarity in the Bayesian approach in different\nvolatility conditions. In a low volatility condition, R-W model indicates that learn-\ning rate positively correlates with lose-shift rate, but not choice optimality (inverted\nU shape). The Bayesian approach indicates that the belief of environmental station-\narity positively correlates with choice optimality, but not lose-shift rate (inverted\nU shape). In addition, we showed that comparing to Expert learners, individuals\nwith high lose-shift rate (sub-optimal learners) had signi\ufb01cantly higher learning\nrate estimated from R-W model and lower belief of stationarity from the Bayesian\nmodel.\n\n1\n\nIntroduction\n\nLearning and using environmental statistics in choice-selection under uncertainty is a fundamental\nsurvival skill. It has been shown that, in tasks with embedded environmental statistics, subjects\nuse sub-optimal heuristic Win-Stay-Lose-Shift (WSLS) strategy (Lee et al. 2011), and strategies\nthat can be interpreted using Reinforcement Learning model (Behrens et al. 2007), or Bayesian\ninference model (Mathys et al. 2014; Yu et al. 2014). Value-based model-free RL model assumes\nsubjects learn the values of chosen options using a prediction error that is scaled by a learning rate\n(Rescorla and Wagner, 1972; Sutton and Barto, 1998). This learning rate can be used to measure\nan individual\u2019s reaction to environmental volatility (Browning et al. 2015). Higher learning rate is\nusually associated with a more volatile environment, and a lower learning rate is associated with\na relatively stable situation. Different from traditional (model-free) RL model, Bayesian approach\nassumes subjects make decisions by learning the reward probability distribution of all options based\non Bayes\u2019 rule, i.e., sequentially updating the posterior probability by combining the prior knowledge\nand the new observation (likelihood function) over time. To examine how environment volatility may\nin\ufb02uence this inference process, Yu & Cohen 2009 proposed to use a dynamic belief model (DBM)\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthat assumes subjects update their belief of the environmental statistics by balancing between the\nprior belief and the belief of environmental stationarity, in which a belief of high stationarity will\nlead to a relatively \ufb01xed belief of the environmental statistics, and vice versa.\nThough formulated under different assumptions (Gershman 2015), those two approaches share similar\ncharacteristics. First, both the learning rate in RL model and the belief of stationarity in DBM re\ufb02ect\nan individual\u2019s estimation of the environmental volatility. In a highly volatile environment, one will\nbe expected to have a high learning rate estimated by RL model, and a belief of low stationarity\nestimated by DBM. Second, though standard RL only updates the chosen option\u2019s value and DBM\nupdates the posterior probability of all options, assuming maximization decision rule (Blakely, Starin,\n& Poling, 1988), both models will lead to qualitatively similar choice preference. That is, the mostly\nrewarded choice will have an increasing value in RL model, and increasing reward probability in\nDBM, while the often-unrewarded choice will lead to a decreasing value in RL model, and decreasing\nreward probability in DBM. Thirdly, they can both explain Win-Stay strategy. That is, under the\nmaximization assumption, choosing the option with maximum value in RL model, the rewarded\noption (Win) will reinforce this choice (i.e. remain the option with the maximum value) and thus\nwill be chosen again (Stay). Similarly, choosing the option with the maximum reward probability\nin DBM, the rewarded option (Win) will also reinforce this choice (i.e. remain the option with the\nmaximum reward probability) and thus will be chosen again (Stay).\nWhile both approaches share some characteristics as mentioned above and have showed strong\nevidence in explaining the overall subjects\u2019 choices in previous studies, it is unclear how they differ\nin explaining other behavioral measures in tasks with changing reward contingency, such as decision\noptimality, i.e., percentage of trials in which one chooses the most likely rewarded option, and\nlose-shift rate, i.e., the tendency to follow the last target if the current choice is not rewarded. In a\ntask with changing reward contingency (e.g., 80%:20% to 20%:80%), decision optimality relies on\nproper estimation of the environmental volatility, i.e., how frequent change points occur, and using\nproper strategy, i.e. staying with the mostly likely option and ignoring the noise before change points\n(i.e. not switching to the option with lower reward rate when it appears as the target). Thus it is\nimportant to know how the parameter in each model (learning rate vs. the belief of stationarity) affects\ndecision optimality in tasks with different volatility. On the other hand, lose-shift can be explained\nas a heuristic decision policy that is used to reduce a cognitively dif\ufb01cult problem (Kahneman &\nFrederick, 2002), or as an artifact of learning that can be interpreted in a principled fashion (using RL:\nWorthy et al. 2014; using Bayesian inference: Bonawitz et al. 2014). Intuitively, when experiencing\na loss in the current trial, in a high volatility environment where change points frequently occur, one\nmay tend to shift to the last target; while in a stable environment with \ufb01xed reward rates, one may tend\nto stay with the option with the higher reward rate. That is, the frequency of using lose-shift strategy\nshould depend on how frequent the environment changes. Thus it is also important to examine how\nthe parameter in each model (learning rate vs. the belief of stationarity) affects lose-shift rate under\ndifferent volatility conditions.\nHowever, so far little is known about how a model-free RL model and a Bayesian model differ in\nexplaining decision optimality and lose-shift in tasks with different levels of volatility. In addition,\nit is unclear if parameters in each model can capture the individual differences in learning. For\nexample, if they can provide satisfactory explanation of individuals who always choose the same\nchoice while disregarding feedback information (No Learning), individuals who always choose the\nmost likely rewarded option (expert), and individuals who always use the heuristic win-stay-lose-shift\nstrategy. Here we aim to address the \ufb01rst question by investigating the relationship between decision\noptimality and lose-shift rate with parameters estimated from an Rescorla-Wagner (R-W model) and\na Bayesian model in three volatility conditions (Fig 1a) in a visual search task (Yu et al. 2014): 1)\nstable, where the reward contingency at three locations remains the same (relative reward frequency\nat three locations: 1:3:9), 2) low volatility, where the reward contingency at three locations changes\n(e.g. from 1:3:9 to 9:1:3) based on N (30, 1) (i.e. on average change points occur every 30 trials), and\n3) high volatility, where the reward contingency changes based on N (10, 1) (i.e. on average change\npoints occur every 10 trials). For the second question, we will examine how the two models differ in\nexplaining three types of behavior: No Learning, Expert, and WSLS (Fig 1b).\n\n2\n\n\fFigure 1: Example of volatility conditions and behavioral types in a visual search task. a. Example of\nthree volatility conditions. b. Example of three behavioral types. Colors indicate target location in\neach trial. WSLS: Win-stay-Lose-shift (follow last target). Expert: always choose the most likely\nlocation. No Learning: always choose the same location.\n\n2 Value-based RL model\n\nAssuming a constant learning rate, the Rescorla-Wagner model takes the following form (Rescorla\nand Wagner, 1972; Sutton and Barto, 1998):\n\nV t+1\ni\n\n= V t\n\ni + \u03b7(Rt \u2212 V t\ni )\n\n(1)\n\nwhere \u03b7 is the learning rate, and Rt is the reward feedback (0-no reward, 1-reward) for the chosen\noption i in trial t. In this paper, we assume subjects use a softmax decision rule for all models as\nfollows:\n\ne\u03b2Vi(cid:80)\n\nj e\u03b2Vj\n\np(i) =\n\n(2)\n\nwhere \u03b2 is the inverse decision temperature parameter, which measures the degree to which subjects\nuse the estimated value in choosing among three options (i.e. a large \u03b2 approximates \u2019maximization\u2019\nstrategy). This model has two free parameters = {\u03b7, \u03b2} for each subject.\n\n2.1 Simulation in three volatility conditions\n\nLearning rate is expected to increase as the volatility increases. To show this, we simulated three\nvolatility conditions (Stable, Low and High volatility, Fig 2ab) and the results are summarized in\nTable1. For each condition, we simulated 100 runs (90 trial per run) of agents\u2019 choices with \u03b7 ranges\nfrom 0 to 1 with an increment of 0.1 and \ufb01xed \u03b2 = 20. As is shown in Fig 2a, decision optimality in a\nstable and a low volatility environment has an inverted U shape as a function of learning rate \u03b7. It is\nnot surprising, as in those conditions, where one should rely more on the long term statistics, if the\nlearning rate is too high, then subjects will tend to shift more due to recent experience (Fig 2b), which\nwould adversely in\ufb02uence decision optimality. On the other hand, in a high volatility environment,\ndecision optimality has a linear correlation with the learning rate, suggesting that higher learning rate\nleads to better performance. In fact, the optimal learning rate increases as the environmental volatility\nincreases (i.e. the peak of the inverted U should shift to the right). On the other hand, across all\nvolatility conditions, lose shift rate increases as learning rate increases (Fig 2b), except for learning\nrate=0. It is not surprising as zero learning rate indicates subjects make random choices, thus it will\nbe close to 1/3.\n\n2.2 Simulation of three behavioral types\n\nTo examine if learning rate can be used to explain different types of learning behavior, we have\nsimulated three types of behavior (No Learning, Expert and WSLS, Fig 1b) in a low volatility\ncondition. In particular, we simulated 60 runs (90 trials per run) of target sequences with a relative\nreward frequency 1:3:9 that changes based on N (30, 1), and generated three types of behavior for\n\n3\n\nTargetWSLSExpertNo learningTrials\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026StableTrialsABCMost likely rewarded locationLow volatilityHigh volatilityabVolatility conditionsBehavioral types\fTable 1: R-W model: In\ufb02uence of learning rate \u03b7\n\nCondition\n\nDecision optimality\n\nLose-shift rate\n\nStable\nLow volatility\nHigh volatility\n\nInverted U shape, \u03b7optimal = low\nInverted U shape, \u03b7optimal = medium\nPositive linear relationship, \u03b7optimal = high\n\nPositive linear\nPositive linear\nPositive linear\n\neach run. For each simulated behavior type, R-W model was \ufb01tted using Maximum Likelihood\nEstimation with \u03b7 ranges from 0 to 1 with an increment of .025 and \u03b2 = 20. Based on what we have\nshown in 2.1, in a low volatility condition where decision optimality has an inverted U shape as a\nfunction of learning rate, individuals that perform poorly will be expected to have a low learning\nrate, and individuals that use heuristic WSLS strategy will be expected to have a high learning rate.\nWe con\ufb01rmed this in simulation (Fig 2c), that agents with the same choice over time (No Learning)\nhave the lowest learning rate, indicating their choices have little in\ufb02uence from the reward feedback.\nExpert agents have the medium learning rate indicating the effect of long-term statistics. Agents that\nstrictly follow WSLS have the highest learning rate, indicating their choices are heavily impacted by\nrecent experience. Results for learning rate estimation of three behavioral types in stable and high\nvolatility condition can be seen in Supplementary Figure S1.\n\nFigure 2: R-W simulation. a. Percentage of trials in which agents chose the optimal choice (the\nmost likely location) as a function of learning rate in three volatility conditions. b. Lose shift rate\nas a function of learning rate in three volatility conditions. c. Learning rate estimation of three\nsimulated behavior types in low volatility condition. Errorbars indicate standard error of the mean\nacross simulation runs.\n\n3 A Bayesian approach\n\nHere we compare above R-W model to a dynamic belief model that is based on a Bayesian hidden\nMarkov model, where we assume subjects make decisions by using the inferred posterior target\nprobability st based on the inferred hidden reward probability \u03b3\u03b3\u03b3t and the reward contingency mapping\nbt (Equation 3). To examine the in\ufb02uence of volatility, we assume (\u03b3\u03b3\u03b3t, bt) has probability \u03b1 of\nremaining the same as the last trial, and 1-\u03b1 of being drawn from the prior distribution p0(\u03b3\u03b3\u03b3t, bt)\n(Equation 4). Here \u03b1 represents an individual\u2019s estimation of environmental stationarity, which\ncontrasts with learning rate \u03b7 in R-W model. For model details, please refer to Yu & Huang 2014.\n\n3 , 1\n\n3 , 1\n( 1\n\n3 ),\n(\u03b3h, \u03b3m, \u03b3l),\n(\u03b3h, \u03b3l, \u03b3m),\n(\u03b3m, \u03b3h, \u03b3l),\n(\u03b3m, \u03b3l, \u03b3h),\n(\u03b3l, \u03b3h, \u03b3m),\n(\u03b3l, \u03b3m, \u03b3h),\n\nbk = 1\nbk = 2\nbk = 3\nbk = 4\nbk = 5\nbk = 6\nbk = 7\n\n(3)\n\n4\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nP (st|\u03b3\u03b3\u03b3t, bt) =\n\nNo learningExpertWSLS00.10.20.30.40.50.60.70.8eta00.20.40.60.81Lose shift%00.050.10.150.20.250.30.350.4Optimal choice%Lose shift%Learning rateLow volatilityabeta00.20.40.60.81Optimal choice%0.30.40.50.60.70.80.91NonN(30,1)N(10,1)Simu_all.mSimu_Fix_TD.mSimu_all_new.mcLearning rate \u03b7Learning rate \u03b7Learning typeLearning rate \u03b7StableLow volatilityHigh volatilitySimu_target.mFigure 2\fP (\u03b3\u03b3\u03b3t, bt|st\u22121) = \u03b1P (\u03b3\u03b3\u03b3t\u22121, bt\u22121|st\u22121) + (1 \u2212 \u03b1)p0(\u03b3\u03b3\u03b3t, bt)\n\n(4)\n\nwhere \u03b3\u03b3\u03b3t is the hidden reward probability, bt is the reward contingency mapping of the probability to\nthe options, st\u22121 is the target history from trial 1 to trial t \u2212 1. We also used softmax decision rule\nhere (Equation 2), thus this model also has two free parameters = {\u03b1, \u03b2}.\n\n3.1 Simulation in three volatility conditions\n\nBelief of stationarity \u03b1 is expected to decrease as the volatility increases, as subjects are expected to\ndepend more on the recent trials to predict next outcome (Yu &Huang 2014). We have shown this in\nthree simulated conditions under different volatility (Fig 3ab) and the results are summarized in Table\n2. For each simulated condition (stable, low and high volatility), we simulated 100 runs (90 trials per\nrun) for agents\u2019 choices with \u03b1 ranges from 0 to 1 and \ufb01xed \u03b2 = 20. As is shown in Fig 3a, in a stable\ncondition, decision optimality increases as \u03b1 increases, indicating a \ufb01xed belief mode (i.e. no change\nof the environmental statistics) is optimal in this condition. In the other two volatile environments,\ndecision optimality also increases as \u03b1 increases, but both drop as \u03b1 approaches 1. It is reasonable as\nin volatile environments a belief of high stationarity is no longer optimal. On the other hand, lose\nshift rate in all conditions (Fig 3b) have an inverted U shape as a function of alpha, where \u03b1 = 0 leads\nto a random lose shift rate (1/3), and \u03b1 = 1(\ufb01xed belief model) leads to the minimal lose shift rate.\n\n3.2 Simulation of three behavioral types\n\nTo examine if an individual\u2019s belief of environmental stationarity can be used to explain different\ntypes of learning behavior, we \ufb01t DBM using Maximum Likelihood Estimation with the simulated\nbehavioral data in 2.2. DBM results suggest that WSLS has a signi\ufb01cantly lower belief of stationarity\ncomparing to Expert behavior (Fig 3c), which is consistent with the higher volatility estimation\nre\ufb02ected by a higher learning rate than Expert from R-W model (Fig 2c). Simulation results also\nsuggest that No Learning agents have a signi\ufb01cantly lower belief of stationarity than Expert learners,\nbut not different from WSLS. However, the comparison of model accuracy between R-W and DBM\n(Fig 3d) shows DBM outperforms R-W in predicting Expert and WSLS behavior (p = .000), but it\ndoes not perform as well in No Learning behavior where R-W has signi\ufb01cantly better performance\n(p = .000). Model accuracy is measured as the percentage of trials that the model correctly predicted\nsubjects\u2019 choice. Thus further investigation is needed to examine the validity of using DBM in\nexplaining poor learners\u2019 choice behavior in this task. Results for \u03b1 in stable and high volatility\ncondition can be seen in Supplementary Figure S2.\n\nFigure 3: DBM simulation. a. Percentage of trials in which agents chose the optimal choice (the most\nlikely location) as a function of \u03b1 in three volatility conditions. b: Lose shift rate as a function of \u03b1\nin three volatility conditions. c. \u03b1 estimation of three types of behavior in low volatility condition\n(N (30, 1)). d. Model performance comparison between R-W and DBM in low volatility condition.\n\n5\n\nalpha00.20.40.60.81Lose Shift%00.10.20.30.40.50.60.70.80.91NonN(30,1)N(10,1)Optimal choice%Lose shift%abcDBM_Softmax_Ncp, CPCompareNcp_10_30_DBM.malpha00.20.40.60.81Optimal choice%0.30.40.50.60.70.80.91NonN(30,1)N(10,1)check_simuDBM.mdAdd optimal alpha in Non, 45, 30, 10CompareNcp_10_30_45_DBM.mStableLow volatilityHigh volatility\u03b1Learning typeBelief of stationarityBelief of stationarity \u03b1Belief of stationarity \u03b1Learning type(Low volatility)(Low volatility)Simu_DBM_bk_low.mcheck_simuDBM_low.mNo learningExpertWSLS00.10.20.30.40.50.60.70.80.91Compare_Simu_TD_DBM.mNo learningExpertWSLS00.10.20.30.40.50.60.70.80.91RWDBMModel accuracyFigure 3\fTable 2: DBM: In\ufb02uence of stationarity belief \u03b1\n\nCondition\n\nDecision optimality\n\nLose-shift rate\n\nStable\nLow volatility\nHigh volatility\n\nPositive linear relationship, \u03b1optimal =1\nInverted U shape, \u03b1optimal = high\nInverted U shape, \u03b1optimal = medium\n\nInverted U shape\nInverted U shape\nInverted U shape\n\n4 Experiment\n\nWe applied the above models to two sets of data in a visual search task: (1) stable condition with no\nchange points (from Yu & Huang, 2014) and (2) low volatility condition with change points based\non N (30, 1). For both data sets, we \ufb01tted an R-W model and DBM for each subject, and compared\nlearning rate \u03b7 in R-W and estimation of stochasticity (1 \u2212 \u03b1) in DBM, as well as how they correlate\nwith decision optimality and lose shift rate. For (2), we also looked at how model parameters differ in\nexplaining No Learning, Expert and WSLS behavior.\n\n4.1 Results\n\n4.1.1 Stable condition\n\nIn a visual search task with relative reward frequency 1:3:9 but no change points, we found a signi\ufb01cant\ncorrelation between \u03b7 estimated from R-W model and 1-\u03b1 from DBM (r2 = .84, p = .0001, Fig 4a),\nwhich is consistent with the hypothesis that both the learning rate in R-W model and the belief of\nstochasticity in the Bayesian approach re\ufb02ects subjects\u2019 estimation of environmental volatility. We\nalso examined the relationship between decision optimality (optimal choice%) (Fig 4b) and lose-shift\nrate (Fig 4c) with \u03b7 and 1-\u03b1 respectively. As is shown in Fig 4b, in this stable condition, decision\noptimality decreases as the learning rate increases, as well as the belief of stochasticity increases,\nwhich is consistent with Fig 2a (red, for \u03b7 \u2265 .1) and Fig 3a (red). For lose-shift rate, there is a\nsigni\ufb01cant positive relationship between lose-shift% and \u03b7, as shown previously in Fig 2b (red), and\nan inverted U shape as suggested in Fig 3b (red). There are no signi\ufb01cant differences in the prediction\naccuracy of R-W model and DBM (R-W: .81+/-.03, DBM: .81+/-.03) or inverse decision parameters\n(p > .05).\n\nFigure 4: Stable condition: R-W vs. DBM. a. Relationship between learning rate \u03b7 in R-W model\nand 1-\u03b1 in DBM. b. Optimal choice% as a function of \u03b7 and 1-\u03b1. c. Lose-shift% as a function of \u03b7\nand 1-\u03b1.\n\n4.1.2 Low volatility condition\n\nIn a visual search task with relative reward frequency 1:3:9, and change of the reward contingency\nbased on N (30, 1) (3 blocks of 90 trials/block), we looked at the correlation between model parame-\nters, their correlation with decision optimality and lose-shift rate, as well as how model parameters\ndiffer in explaining different types of behavior.\n\n6\n\n2 (RW)00.51Optimal choice%0.50.60.70.80.91Optimal choice%1-, (DBM)00.5Optimal choice%0.50.60.70.80.91Optimal choice%2 (RW)00.51Lose-shift%00.20.40.60.81Lose-shift%1-, (DBM)00.5Lose-shift%00.20.40.60.8Lose-shift%eta-TD00.20.40.60.81alpha-DBM-0.2-0.100.10.20.30.40.50.6\u03b7 (RW)1-\u03b1 (DBM)abc\fSubjects (N=207) were grouped into poor learners (optimal choice%< .5, n = 63), good learners\n(.5\u2264 optimal choice% \u2264 9/13, n = 108) and expert learners (optimal choice% > 9/13, n = 36) based\non their performance (percentage of trials started from the most likely rewarded location). Consistent\nwith what we have shown previously (Fig 3d-No Learning), R-W model outperformed DBM in poor\nlearners (p = .000). Similar as in stable condition (Fig 4a), among good and expert learners, there is\na signi\ufb01cant positive correlation between \u03b7 and 1-\u03b1 (Fig 5b, r2 = .35, p = .000). The relationship\nbetween decision optimality and lose-shift% is shown in Fig 5c. As is shown, in this task where\nchange points occur with relatively low frequency (N (30, 1)), lose shift% has an inverted U shape as\na function of optimal choice%, indicating that a high lose-shift rate does not necessarily lead to better\nperformance.\n\nFigure 5: a. Prediction accuracy in poor, good and expert learners. b. Correlation between \u03b7 from\nR-W and 1-\u03b1 from DBM. c. Correlation between optimal choice% and lose-shift%.\n\nNext, we looked at how each model parameter correlates with decision optimality and lose-shift rate.\nFor decision optimality (Fig 6ab), consistent with simulation result, it has an inverted U shape as a\nfunction of learning rate \u03b7 in R-W model (Fig 6a), while it is positively correlated with \u03b1 in DBM\n(Fig 6b). For lose-shift rate (Fig 6cd), also consistent with simulation result, it is positively correlated\nwith \u03b7 in R-W (Fig 6c), while having an inverted U shape as a function of \u03b1 in DBM (Fig 6d).\n\nFigure 6: Decision optimality and lose-shift rate. a. Optimal choice% as a function of \u03b7 in R-W\nmodel. b. Optimal choice% as a function of \u03b1 in DBM. c. Lose-shift% as a function of \u03b7 in R-W\nmodel. d. Lose-shift% as a function of \u03b1 in DBM.\n\nIn addition, we examined how poor, expert learners and individuals with a high lose-shift rate\n(LS, lose shift%> .5 and optimal choice% < 9/13, n = 51) differ in model parameters (Figure 7).\nConsistent with what we have shown (Fig 2c), those three different behavioral types had signi\ufb01cantly\ndifferent learning rate (one-way ANOVA, p = .000) and each condition is signi\ufb01cant from each other\n(p = .000 for t test across conditions), in which poor learners had the lowest learning rate while\nsubjects with high lose-shift rate had the highest learning rate (Fig 7a). Belief of stationarity from\nDBM also con\ufb01rmed what we have shown (Fig 3c), that expert subjects had signi\ufb01cantly higher\nbelief of stationarity (one-way ANOVA, p = .003, and p = .004 for t test comparing to Poor subjects\nand p = .000 comparing to LS subjects). It also suggested that poor learners did not differ from LS\nsubjects (p > .05), though DBM had a lower accuracy in predicting poor learners\u2019 choices (Fig 5a).\nNo signi\ufb01cant difference of inverse decision parameter \u03b2 was found between R-W and DBM for\n\n7\n\nabcGood & Expert learnersoptimal choice%.1-.2.2-.3.3-.4.4-.5.5-6.6-.7.7-.8.8-9lose shift%-0.100.10.20.30.40.50.6Optimal choice % vs. Lose-shift%Compare_TD2_DBM_overall.mPoor(<.5)Good(.5-9/13)Expert(>9/13)Model acc0.60.650.70.750.80.850.9RWDBMModel accuracy2 (RW)00.20.40.60.811-, (DBM)-0.100.10.20.30.40.50.6Belief of stationarity \u03b1 \u03b7 (RW) 1-\u03b1(DBM),(DBM)0-.2.2-.4.4-.6.6-.8.8-1Optimal choice%0.20.250.30.350.40.450.50.550.60.650.7aOptimal choice % (DBM),(DBM)0-.2.2-.4.4-.6.6-.8.8-1LS%00.10.20.30.40.50.60.7Lose-shift % (RW)Lose-shift %(DBM)bcdCompare_TD2_DBM_overall.m2 (RW)0-.2.2-.4.4-.6.6-.8.8-1LS%0.10.20.30.40.50.62 (RW)0-.2.2-.4.4-.6.6-.8.8-1Optimal choice%0.40.450.50.550.60.650.7Optimal choice % (RW)Learning rate \u03b7 Belief of stationarity \u03b1 Belief of stationarity \u03b1Learning rate \u03b7 \fexpert and LS subjects (p > .05), but it was signi\ufb01cantly lower in poor learners estimated in DBM\n(Supplementary Figure S3).\n\nFigure 7: Parameter estimation for different behavioral types. a. Learning rate in R-W model. b.\nBelief of stationarity in DBM.\n\n5 Discussion\n\nIn this paper we compared an R-W model and a Bayesian model in a visual search task across different\nvolatility conditions, and examined parameter differences for different types of learning behavior.\nWe have shown in simulation that both the learning rate \u03b7 estimated from R-W and the belief of\nstochasticity 1 \u2212 \u03b1 estimated from DBM have strong positive correlation with increasing volatility,\nand con\ufb01rmed that they are highly correlated with behavioral data (Fig 4a and Fig5b). This suggests\nthat both models are able to re\ufb02ect an individual\u2019s estimation of environmental volatility. We also have\nshown in simulation that R-W model can differentiate No Learning, Expert and WSLS behavioral\ntypes with (increasing) learning rate, and DBM can differentiate Expert and WSLS behavioral types\nwith (increasing) belief of stochasticity, and con\ufb01rmed this with behavioral data in a low volatility\ncondition. A few other things to note here:\nCorrelation between decision optimality and lose-shift rate. Here we have provided a model-\nbased explanation of using lose-shift strategy and how it is related to decision optimality. 1) R-W\nmodel suggests that, across different levels of environmental volatility, the frequency of using lose-\nshift is positively correlated with learning rates (Fig 2b). However, decision optimality is NOT\npositively correlated with lose-shift rate across conditions. 2) DBM model suggests that, across\ndifferent levels of environmental volatility, there is an inverted U shape relationship between the\nfrequency of using lose-shift and one\u2019s belief of stationarity (Fig 3b), and a close-to-linear relationship\nbetween decision optimality and the belief of stationarity in a low volatility environment (Fig 6b).\nImplications for model selection. We have shown that both models have comparable prediction\naccuracy for individuals with good performance, but R-W model is better in explaining poor learners\u2019\nchoice. There are several possible reasons: 1) the Bayesian model assumed subjects would use the\nfeedback information to update the posterior probability of target reward distribution. Thus for \u2019poor\u2019\nlearners who did not use the feedback information, this assumption is no longer appropriate. 2) the\nR-W model assumed subjects would only update the chosen option\u2019s value, thus error trials may have\nless in\ufb02uence (especially in the early stages, with low learning rate). That is, for 0 value option, it\nwill remain 0 if not rewarded, and for the highest value option, it will remain being the highest value\noption even if not rewarded. Therefore, R-W model may capture poor learners\u2019 search pattern better\nwith a low learning rate.\nFuture directions. For future work, we will modify current R-W model with a dynamic learning rate\nthat will change based on value estimation, and modify current DBM model with a parameter that\ncontrols how much feedback information is used in updating posterior belief and a hyper-parameter\nthat models the dynamic of \u03b1.\n\nAcknowledgements\n\nWe thank Angela Yu for sharing the data in Yu et al. 2014, and for allowing us to use it in this paper.\n\n8\n\naCompare_TD2_DBM_overall.mPoorExpertLS1-\u03b1 (DBM)\u03b7 (RW)Figure 7Compare_RW_DBM_overall.m00.10.20.30.40.50.60.70.8PoorExpertLSLearning rateb, (DBM)00.10.20.30.40.50.60.70.80.9PoorExpertLSBelief of stationarity\fReferences\n\n[1] Lee, M. D., Zhang, S., Munro, M., & Steyvers, M. (2011). Psychological models of human and optimal\nperformance in bandit problems. Cognitive Systems Research, 12(2), 164-174.\n[2] Behrens, T. E., Woolrich, M. W., Walton, M. E., & Rushworth, M. F. (2007). Learning the value of\ninformation in an uncertain world. Nature neuroscience, 10(9), 1214-1221.\n[3] Mathys, C. D., Lomakina, E. I., Daunizeau, J., Iglesias, S., Brodersen, K. H., Friston, K. J., & Stephan, K. E.\n(2014). Uncertainty in perception and the Hierarchical Gaussian Filter. Front Hum Neurosci, 8.\n[4] Yu, A. J., & Huang, H. (2014). Maximizing masquerading as matching in human visual search choice\nbehavior. Decision, 1(4), 275.\n[5] Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness\nof reinforcement and nonreinforcement. Classical conditioning II: Current research and theory, 2, 64-99.\n[6] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press.\n[7] Browning, M., Behrens, T. E., Jocham, G., O\u2019Reilly, J. X., & Bishop, S. J. (2015). Anxious individuals have\ndif\ufb01culty learning the causal statistics of aversive environments. Nature neuroscience, 18(4), 590-596.\n[8] Yu, A. J., & Cohen, J. D. (2009). Sequential effects: superstition or rational behavior?. In Advances in neural\ninformation processing systems (pp. 1873-1880).\n[9] Gershman, S. J. (2015). A Unifying Probabilistic View of Associative Learning. PLoS Comput Biol, 11(11),\ne1004567.\n[10] Blakely, E., Starin, S., & Poling, A. (1988). Human performance under sequences of \ufb01xed-ratio schedules:\nEffects of ratio size and magnitude of reinforcement. The Psychological Record, 38(1), 111.\n[11] Kahneman, D., & Frederick, S. (2002). Representativeness revisited: Attribute substitution in intuitive\njudgment. Heuristics and biases: The psychology of intuitive judgment, 49.\n[12] Worthy, D. A., & Maddox, W. T. (2014). A comparison model of reinforcement-learning and win-stay-lose-\nshift decision-making processes: A tribute to WK Estes. Journal of mathematical psychology, 59, 41-49.\n[13] Bonawitz, E., Denison, S., Gopnik, A., & Grif\ufb01ths, T. L. (2014). Win-Stay, Lose-Sample: A simple\nsequential algorithm for approximating Bayesian inference. Cognitive psychology, 74, 35-65.\n\n9\n\n\f", "award": [], "sourceid": 1395, "authors": [{"given_name": "He", "family_name": "Huang", "institution": "LIBR"}, {"given_name": "Martin", "family_name": "Paulus", "institution": "LIBR"}]}