{"title": "A Bayesian Framework for Modeling Confidence in Perceptual Decision Making", "book": "Advances in Neural Information Processing Systems", "page_first": 2413, "page_last": 2421, "abstract": "The degree of confidence in one's choice or decision is a critical aspect of perceptual decision making. Attempts to quantify a decision maker's confidence by measuring accuracy in a task have yielded limited success because confidence and accuracy are typically not equal. In this paper, we introduce a Bayesian framework to model confidence in perceptual decision making. We show that this model, based on partially observable Markov decision processes (POMDPs), is able to predict confidence of a decision maker based only on the data available to the experimenter. We test our model on two experiments on confidence-based decision making involving the well-known random dots motion discrimination task. In both experiments, we show that our model's predictions closely match experimental data. Additionally, our model is also consistent with other phenomena such as the hard-easy effect in perceptual decision making.", "full_text": "A Bayesian Framework for Modeling Con\ufb01dence in\n\nPerceptual Decision Making\n\nKoosha Khalvati, Rajesh P. N. Rao\n\nDepartment of Computer Science and Engineering\n\nUniversity of Washington\n\n{koosha, rao}@cs.washington.edu\n\nSeattle, WA 98195\n\nAbstract\n\nThe degree of con\ufb01dence in one\u2019s choice or decision is a critical aspect of per-\nceptual decision making. Attempts to quantify a decision maker\u2019s con\ufb01dence by\nmeasuring accuracy in a task have yielded limited success because con\ufb01dence and\naccuracy are typically not equal. In this paper, we introduce a Bayesian framework\nto model con\ufb01dence in perceptual decision making. We show that this model,\nbased on partially observable Markov decision processes (POMDPs), is able to\npredict con\ufb01dence of a decision maker based only on the data available to the\nexperimenter. We test our model on two experiments on con\ufb01dence-based deci-\nsion making involving the well-known random dots motion discrimination task.\nIn both experiments, we show that our model\u2019s predictions closely match exper-\nimental data. Additionally, our model is also consistent with other phenomena\nsuch as the hard-easy effect in perceptual decision making.\n\n1\n\nIntroduction\n\nThe brain is faced with the persistent challenge of decision making under uncertainty due to noise in\nthe sensory inputs and perceptual ambiguity . A mechanism for self-assessment of one\u2019s decisions\nis therefore crucial for evaluating the uncertainty in one\u2019s decisions. This kind of decision making,\ncalled perceptual decision making, and the associated self-assessment, called con\ufb01dence, have re-\nceived considerable attention in decision making experiments in recent years [9, 10, 12, 13]. One\npossible way of estimating the con\ufb01dence of a decision maker is to assume that it is equal to the\naccuracy (or performance) on the task. However, the decision maker\u2019s belief about the chance of\nsuccess and accuracy need not be equal because the decision maker may not have access to infor-\nmation that the experimenter has access to [3]. For example, in the well-known task of random dots\nmotion discrimination [18], on each trial, the experimenter knows the dif\ufb01culty of the task (coher-\nence or motion strength of the dots), but not the decision maker [3, 13]. In this case, when the data is\nbinned based on dif\ufb01culty of the task, the accuracy is not equal to decision con\ufb01dence. An alternate\nway to estimate the subject\u2019s con\ufb01dence is to use auxiliary tasks such as post-decision wagering [15]\nor asking the decision maker to estimate con\ufb01dence explicitly [12]. These methods however only\nprovide an indirect window into the subject\u2019s con\ufb01dence and are not always applicable.\nIn this paper, we explain how a model of decision making based on Partially Observable Decision\nMaking Processes (POMDPs) [16, 5] can be used to estimate a decision maker\u2019s con\ufb01dence based on\nexperimental data. POMDPs provide a unifying Bayesian framework for modeling several impor-\ntant aspects of perceptual decision making including evidence accumulation via Bayesian updates,\nthe role of priors, costs and rewards of actions, etc. One of the advantages of the POMDP model\nover the other models is that it can incorporate various types of uncertainty in computing the optimal\n\nThis research was supported by NSF grants EEC-1028725 and 1318733, and ONR grant N000141310817.\n\n1\n\n\fdecision making strategy. Drift-diffusion and race models are able to handle uncertainty in proba-\nbility updates [3] but not the costs and rewards of actions. Furthermore, these models originated as\ndescriptive models of observed data, where as the POMDP approach is fundamentally normative,\nprescribing the optimal policy for any task requiring decision making under uncertainty. In addition,\nthe POMDP model can capture the temporal dynamics of a task. Time has been shown to play a\ncrucial role in decision making, especially in decision con\ufb01dence [12, 4]. POMDPs have previ-\nously been used to model evidence accumulation and understand the role of priors [16, 5, 6]. To our\nknowledge, this is the \ufb01rst time that it is being applied to model con\ufb01dence and explain experimental\ndata on con\ufb01dence-based decision making tasks.\nIn the following sections, we introduce some basic concepts in perceptual decision making and\nshow how a POMDP can model decision con\ufb01dence. We then explore the model\u2019s predictions\nfor two well-known experiments in perceptual decision making involving con\ufb01dence: (1) a \ufb01xed\nduration motion discrimination task with post-decision wagering [13], and (2) a reaction-time mo-\ntion discrimination task with con\ufb01dence report [12]. Our results show that the predictions of the\nPOMDP model closely match experimental data. The model\u2019s predictions are also consistent with\nthe \u201chard-easy\u201d phenomena in decision making, involving over-con\ufb01dence in the hard trials and\nunder-con\ufb01dence in the easy ones [7].\n\n2 Accuracy, Belief and Con\ufb01dence in Perceptual Decision Making\n\nConsider perceptual decision making tasks in which the subject has to guess the hidden state of the\nenvironment correctly to get a reward. Any guess other than the correct state usually leads to no\nreward. The decision maker has been trained on the task, and wants to obtain the maximum possible\nreward. Since the state is hidden, the decision maker must use one or more observations to estimate\nthe state. For example, the state could be one of two biased coins, one biased toward heads and\nthe other toward tails. On each trial, the experimenter picks one of these coins randomly and \ufb02ips\nit. The decision maker only sees the result, heads or tails, and must guess which coin has been\npicked. If she guesses correctly, she gets a reward immediately. If she fails, she gets nothing. In\nthis context, Accuracy is de\ufb01ned as the number of correct guesses divided by the total number of\ntrials. In a single trial, if A represents the action (or choice) of the decision maker, and S and Z\ndenote the state and observation respectively, then Accuracy for the choice s with observation z is\nthe probability P (A = as|S = s, Z = z) where as represents the action of decision maker, i.e.\nchoosing s, and s is the true state. This Accuracy can be measured by the experimenter. However,\nfrom the decision maker\u2019s perspective, her chance of success in a trial is given by the probability\nof s being the correct state, given observation z: P (S = s|Z = z). We call this probability the\ndecision maker\u2019s belief. After choosing an action, for example as, the conf idence for this choice\nis the probability: P (S = s|A = as, Z = z). According to Bayes theorem:\n\nP (A|S, Z)P (S|Z) = P (S|A, Z)P (A|Z).\n\n(1)\n\nAs the goal of our decision maker is to maximize her reward, she picks the most probable\nstate. This means that on observing z she picks as\u2217 where s\u2217 is the most probable state, i.e.\ns\u2217 = arg max(P (S = s|Z = z)). Therefore, P (A|Z = z) is equal to 1 for as\u2217 and 0 for the\nrest of the actions. As a result, accuracy is 1 for the most probable state and 0 for the rest. Also\nP (S|A, Z) is equal to P (S|Z) for the most probable state. This means that, given observation z,\nAccuracy is equal to the con\ufb01dence on the most probable state. Also, this con\ufb01dence is equal to the\nbelief of the most probable state. As con\ufb01dence cannot be de\ufb01ned on actions not performed, one\ncould consider con\ufb01dence on the most probable state only, implying that accuracy, con\ufb01dence, and\nbelief are all equal given observation z:\n\nP (A = as|S = s, Z)P (S = s|Z) = P (S = s\u2217|A = as\u2217 , Z) = P (S = s\u2217|Z).1\n\n(2)\n\n(cid:88)\n\ns\n\nAll of the above equalities, however, depend on the ability of the decision maker to compute P (S|Z).\nAccording to Bayes\u2019 theorem P (S|Z) = P (Z|S)P (S)/P (Z)\n(P (Z) (cid:54)= 0). If the decision maker\nhas the perfect observation model P (Z|S), she could compute P (S|Z) by estimating P (S) and\n1In the case that there are multiple states with maximum probability, Accuracy is the sum of the con\ufb01dence\n\nvalues on those states.\n\n2\n\n\fP (Z) beforehand by counting the total number of occurrences of each state without considering\nany observations, and the total number of occurrences of observation z, respectively. Therefore,\naccuracy and con\ufb01dence are equal if the decision maker has the true model for observations. Some-\ntimes, however, the decision maker does not even have access to Z. For example, in the motion\ndiscrimination task, if the data is binned based on dif\ufb01culty (i.e., motion strength), the decision\nmaker cannot estimate P (S|dif\ufb01culty) because she does not know the dif\ufb01culty of each trial. As a\nresult, accuracy and con\ufb01dence are not equal.\nIn the general case, the decision maker can utilize multiple observations over time, and perform an\naction on each time step. For example, in the coin toss problem, the decision maker could request a\n\ufb02ip multiple times to gather more information. If she requests a \ufb02ip two times and then guesses the\nstate to be the coin biased toward heads, her actions would be Sample, Sample, Choose heads. She\nalso has two observations (likely to be two Heads). In the general case, the state of the environment\ncan also change after each action.2 In this case, the relationship between accuracy and con\ufb01dence at\ntime t after a sequence (history Ht) of actions and observations ht = a0, z1, a2, ..., zt\u22121, at\u22121, is:\n\n(3)\nWith the same reasoning as above, accuracy and con\ufb01dence are equal if and only if the decision\nmaker has access to all the observations and has the true model of the task.\n\nP (At|St, Ht)P (St|Ht) = P (St|At, Ht)P (At|Ht).\n\n3 The POMDP Model\n\nPartially Observable Markov Decision Processes (POMDPs) provide a mathematical framework\nfor decision making under uncertainty in autonomous agents [8]. A POMDP is formally a tuple\n(S, A, Z, T, O, R, \u03b3) with the following description: S is a \ufb01nite set of states of the environment,\nA is a \ufb01nite set of possible actions, and Z is a \ufb01nite set of possible observations. T is a transition\nfunction de\ufb01ned as T : S \u00d7 S \u00d7 A \u2192 [0, 1] which determines P (s|s(cid:48), a), the probability of going\nfrom a state s(cid:48) to another state s after performing a particular action a. O is an observation function\nde\ufb01ned as O : Z \u00d7 A \u00d7 S \u2192 [0, 1], which determines P (z|a, s), the probability of observing z\nafter performing an action a and ending in a particular state s. R is the reward function, de\ufb01ned as\nR : S \u00d7 A \u2192 R, determining the reward received by performing an action in a particular state. \u03b3\nis the discount factor, which is always between 0 and 1, and determines how much rewards in the\nfuture are discounted compared to current rewards.\nIn a POMDP, the goal is to \ufb01nd a sequence of actions to maximize the expected discounted re-\nt=0 \u03b3tR(st, at)]. The states are not fully observable and the agent must rely on its\nobservations to choose actions. At the time t, we have a history of actions and observations:\nht = a0, z1, a1, ..., zt\u22121, at\u22121. The belief state [1] at time t is the posterior probability over states\ngiven this history and the prior probability b0 over states: bt = P (st|ht, b0). As the system is\nMarkovian, the belief state captures the suf\ufb01cient statistics for the history of states and actions [19]\nand it is possible to obtain bt+1 using only bt, at and zt+1:\n\nward, Est[(cid:80)\u221e\n\nbt+1(s) \u221d O(s, at, zt+1)\n\nT (s(cid:48), s, at)bt(s(cid:48)).\n\n(cid:88)\n\ns(cid:48)\n\nthe expected reward(cid:80)\u221e\nability distribution over actions \u03c0(bt) : P (At|bt). The policy which maximizes(cid:80)\u221e\n\nGiven this de\ufb01nition of belief, the goal of the agent is to \ufb01nd a sequence of actions to maximize\nt=0 \u03b3tR(bt, at). The actions are picked based on the belief state, and the\nresulting mapping, from belief states to actions, is called a policy (denoted by \u03c0), which is a prob-\nt=0 \u03b3tR(bt, at)\nis called the optimal policy, \u03c0\u2217. It can be shown that there is always a deterministic optimal policy,\nallowing the agent to always choose one action for each bt [20]. As a result, we may use a function\n\u03c0\u2217 : B \u2192 A where B is the space of all possible beliefs. There has been considerable progress in\nrecent years in fast \u201dPOMDP-solvers\u201d which \ufb01nd near-optimal policies for POMDPs [14, 17, 11].\n\n(4)\n\n3.1 Modeling Decision Making with POMDPs\n\nResults from experiments and theoretical models indicate that in many perceptual decision making\ntasks, if the previous task state is revealed, the history beyond this state does not exert a noticeable\n\n2In traditional perceptual decision making tasks such as the random dots task, the state does not usually\n\nchange. However, our model is equally applicable to this situation.\n\n3\n\n\fin\ufb02uence on decisions [2], suggesting that the Markov assumption and the notion of belief state is\napplicable to perceptual decision making. Additionally, since the POMDP model aims to maximize\nthe expected reward, the problem of guessing the correct state in perceptual decision making can be\nconverted to a reward maximization problem by simply setting the reward for the correct guess to\n1 and the reward for all other actions to 0. The POMDP model also allows other costs in decision\nmaking to be taken into account, e.g., the cost of sampling, that the brain may utilize for metabolic\nor other evolutionarily-driven reasons. Finally, as there is only one correct hidden state in each\ntrial, the policy is deterministic (choosing the most probable state), consistent with the POMDP\nmodel. All these facts mean that we could model the perceptual decision making with the POMDP\nframework. In the cases where all observations and the true environment model are available to\nthe decision maker, the belief state in the POMDP is equal to both accuracy and con\ufb01dence as\ndiscussed above. When some information is hidden from the decision maker, one can use a POMDP\nwith that information to model accuracy and another POMDP without that information to model the\ncon\ufb01dence. If this hidden information is independent of time, we can model the difference with the\ninitial belief state, b0, i.e., we use two similar POMDPs to model accuracy and con\ufb01dence but with\ndifferent initial belief states. In the well-known motion discrimination experiment, it is common to\nbin the data based on the dif\ufb01culty of the task. This dif\ufb01culty is hidden to the decision maker and\nalso independent of the time. As a result, the con\ufb01dence can be calculated by the same POMDP that\nmodels accuracy but with different initial belief state. This case is discussed in the next section.\n\n4 Experiments and Results\n\nWe investigate the applicability of the POMDP model in the context of two well-known tasks in\nperceptual decision making. The \ufb01rst is a \ufb01xed-duration motion discrimination task with a \u201dsure\noption,\u201d presented in [13]. In this task, a movie of randomly moving dots is shown to a monkey for\na \ufb01xed duration. After a delay period, the monkey must correctly choose the direction of motion\n(left or right) of the majority of the dots to obtain a reward. In half of the trials, a third choice\nalso becomes available, the \u201dsure option,\u201d which always leads to a reward, though the reward is less\nthan the reward for guessing the direction correctly. Intuitively, if the monkey wants to maximize\nreward, it should go for the sure choice only when it is very uncertain about the direction of the\ndots. The second task is a reaction-time motion discrimination task in humans studied in [12]. In\nthis task, the subject observes the random dots motion stimuli but must determine the direction of\nthe motion (in this case, up or down) of the majority of the dots as fast and as accurately as possible\n(rather than observing for a \ufb01xed duration). In addition to their decision regarding direction, subjects\nindicated their con\ufb01dence in their decision on a horizontal bar stimulus, where pointing nearer to\nthe left end meant less con\ufb01dence and nearer to the right end meant more con\ufb01dence. In both tasks,\nthe dif\ufb01culty of the task is governed by a parameter known as \u201dcoherence\u2019\u201d (or \u201dmotion strength\u2019\u201d),\nde\ufb01ned as the percentage of dots moving in the same direction from frame to frame in a given trial.\nIn the experiments, the coherence value for a given trial was chosen to be one of the following:\n0.0%, 3.2%, 6.4%, 12.8%, 25.6%, 51.2%.\n\n4.1 Fixed Duration Task as a POMDP\n\nThe direction and the coherence of the moving dots comprise the states of the environment.\nIn\naddition, the actions which are available to the subject are dependent on the stage of the trial, namely,\nrandom dots display, wait period, choosing the direction or the sure choice, or choosing only the\ndirection. As a result, the stage of the trial is also a part of the state of the POMDP. As the transition\nbetween these stages are dependent on time, we incorporate discretized time as a part of the state.\nConsidering the data, we de\ufb01ne a new state for each constant \u2206t, each direction, each coherence,\nand each stage (when there is intersection between stages). We use dummy states to enforce the\ndelay period of waiting and a terminal state, which indicates termination of the trial:\nS = { (direction, coherence, stage, time), waiting states, terminal }\n\nThe actions are Sample, Wait, Left, Right, and Sure. The transition function models the passage of\ntime and stages. The observation function models evidence accumulation only in the random dots\ndisplay stage and with the action Sample. The observations received in each \u2206t are governed by the\nnumber of dots moving in the same direction. We model the observations as normally distributed\n\n4\n\n\f(a)\n\n(b)\n\nFigure 1: Experimental accuracy of the decision maker for each coherence is shown in (a). This plot\nis from [13]. The curves with empty circles and dashed lines are the trials where the sure option was\nnot given to the subject. The curves with solid circles and solid lines are the trials where the sure\noption was shown, but waived by the decision maker. (b) shows the accuracy curves for the POMDP\nmodel \ufb01t to the experimental accuracy data from trials where the sure option was not given.\n\naround a mean related to the coherence and the direction as follows:\n\nO((d, c, display, t), Sample) = N (\u00b5d,c, \u03c3d,c).\n\nThe reward for choosing the correct direction is set to 1, and the other rewards were set relative to\nthis reward. The sure option was set to a positive reward less than 1 while the cost of sampling and\nreceiving a new observation was set to a negative \u201dreward\u201d value. To model the unavailability of\nsome actions in some states, we set their resultant rewards to a large negative number to preclude\nthe decision making agent from picking these actions. The discount factor models how much more\nimmediate reward is worth relative to future rewards. In the \ufb01xed-duration task, the subject does not\nhave the option of terminating the trial early to get reward sooner, and therefore we used a discount\nfactor of 1 for this task.\n\n4.2 Predicting the Con\ufb01dence in the Fixed Duration Task\n\nAs mentioned before, con\ufb01dence and accuracy are equal to each other when the same amount of\ninformation is available to the experimenter and the decision maker. Therefore, they can be mod-\neled by the same POMDP. However, these two are not equal when we look at a speci\ufb01c coherence\n(dif\ufb01culty), i.e. the data is binned based on coherence, because the coherence in each trial is not\nrevealed to the decision maker. Figure 1a shows the accuracy vs. stimulus duration, binned based\non coherence. The con\ufb01dence is not equal to the accuracy in this plot. However, we could predict\nthe decision maker\u2019s con\ufb01dence only from accuracy data. This time, we use two POMDPs, one\nfor the experimenter and one for the decision maker. At time t, bt of the experimenter\u2019s POMDP\ncan be related to accuracy and bt of the decision maker\u2019s to con\ufb01dence. These two POMDPs have\nthe same model parameters, but different initial belief state. This is because the subject knows the\nenvironment model but does not have access to the coherence in each trial.\nFirst, we \ufb01nd the set of parameters for the experimenter\u2019s POMDP to reproduce the same accuracy\ncurves as in the experiment for each coherence. We only use data from the trials where the sure\noption is not given i.e. dashed curves in \ufb01gure 1a. As the data is binned based on the coherence, and\ncoherence is observable to the experimenter, the initial belief state of the experimenter\u2019s POMDP\nfor coherence c is as following: .5 for each of two possible initial states (at time = 0), and 0 for\nthe rest. Fitting the POMDP to the accuracy data yields the mean and variance for each observation\nfunction and the cost for sampling. Figure 1b shows the accuracy curves based on the experimenter\u2019s\nPOMDP.\nNow, we could apply the parameters obtained from \ufb01tting accuracy data (the experimenter\u2019s\nPOMDP) to the decision maker\u2019s POMDP to predict her con\ufb01dence. The decision maker does\nnot know the coherence in each single trial. Therefore, the initial belief state should be a uniform\ndistribution over all initial states (all coherences, not only coherence of that trial). Also, neural data\nfrom experiments and post-decision wagering experiments suggest that the decision maker does not\nrecognize the existence of a true zero coherence state (coherence = 0%) [13]. Therefore, the initial\n\n5\n\n\f(a)\n\n(b)\n\nFigure 2: The con\ufb01dence predicted by the POMDP \ufb01t to the observed accuracy data in the \ufb01xed-\nduration experiment is shown in (a). (b) shows accuracy and con\ufb01dence in one plot, demonstrating\nthat they are not equal for this task. Curves with solid lines show the con\ufb01dence (same curves as\n(a)) and the ones with dashed lines show the accuracy (same as Figure 1b).\n\n(a)\n\n(b)\n\nFigure 3: Experimental post-decision wagering results (plot (a)) and the wagering predicted by our\nmodel (plot (b)). Plot (a) is from [13].\n\nprobability of 0 coherence states is set to 0. Figure 2a shows the POMDP predictions regarding the\nsubject\u2019s belief. Figure 2b con\ufb01rms that the predicted con\ufb01dence and accuracy are not equal.\nTo test our prediction about the con\ufb01dence of the decision maker, we use experimental data from\npost-decision wagering in this experiment. If the reward for the sure option is rsure then the decision\nmaker chooses it if and only if b(right)rright < rsure and b(lef t)rlef t < rsure where b(direction)\nis the sum of the belief states of all states in that direction. Since rsure cannot be obtained from the\n\ufb01t to the accuracy data, we choose a value for rsure which makes the prediction of the con\ufb01dence\nconsistent with the wagering data shown in Figure 3a. We found that if rsure is approximately two-\nthirds the value of the reward for correct direction choice, the POMDP model\u2019s prediction matches\nexperimental data (Figure 3b). A possible objection is that the free parameter of rsure was used to\n\ufb01t the data. Although rsure is needed to \ufb01t the exact probabilities, we found that any reasonable\nvalue for rsure generates the same trend of wagering. In general, the effect of rsure is to shift the\nplots vertically. The most important phenomena here is the relatively small gap between hard trials\nand easy trials in Figure 3b. Figure 4a shows what this wagering data would look like if the decision\nmaker knew the coherence in each trial and con\ufb01dence was equal to the accuracy. The difference\nbetween these two plots (\ufb01gure 3b and \ufb01gure 4a), and \ufb01gure 2b which shows the con\ufb01dence and the\naccuracy together con\ufb01rm the POMDP model\u2019s ability to explain hard-easy effect [7], wherein the\ndecision maker underestimates easy trials and has overcon\ufb01dence in the hard ones.\nAnother way of testing the predictions about the con\ufb01dence is to verify if the POMDP predicts the\ncorrect accuracy in the trials where the decision maker waives the sure option. Figure 4b shows that\nthe results from the POMDP closely match the experimental data both in post-decision wagering and\naccuracy improvement. Our methods are presented in more detail in the supplementary document.\n\n4.3 Reaction Time Task\n\nThe POMDP for the reaction time task is similar to the \ufb01xed duration task. The most important\ncomponents of the state are again direction and coherence. We also need some dummy states for the\n\n6\n\n\f(a)\n\n(b)\n\nFigure 4: (a) shows what post-decision wagering would look like if the accuracy and the con\ufb01dence\nwere equal. (b) shows the accuracy predicted by the POMDP model in the trials where the sure\noption is shown but waived (solid lines), and also in the trials where it is not shown (dashed lines).\nFor comparison, see experimental data in \ufb01gure 1a.\n\nwaiting period between the decision command from the decision maker and reward delivery. How-\never, the passage of stages and time are not modeled. The absence of time in the state representation\ndoes not mean that the time is not modeled in the framework. Tracking time is a very important\ncomponent of any POMDP especially when the discount factor is less than one. The actions for this\ntask are Sample, Wait, Up and Down (the latter two indicating choice for the direction of motion).\nThe transition model and the observation model are similar to those for the Fixed duration task.\nO((d, c), Sample) = N (\u00b5d,c, \u03c3d,c)\n\nS = (direction, coherence), waiting states, terminal\n\nThe reward for choosing the correct direction is 1 and the reward for sampling is a small negative\nvalue adjusted to the reward of the correct choice. As the subject controls the termination of the\ntrial, the discount factor is less than 1. In this task, the subjects have been explicitly advised to\nterminate the task as soon as they discover the direction. Therefore, there is an incentive for the\nsubject to terminate the trial sooner. While sampling cost is constant during the experiment, the\ndiscount factor makes the decision making strategy dependent on time. A discount factor less than\n1 means that as time passes, the effective value of the rewards decreases. Also, in a general reaction\ntime task, the discount factor connects the trials to each other. While models usually assume each\nsingle trial is independent of the others, trials are actually dependent when the decision maker has\ncontrol over trial termination. Speci\ufb01cally, the decision maker has a motivation to terminate each\ntrial quickly to get the reward, and proceed to the next one. Moreover, when one is very uncertain\nabout the outcome of a trial, it may be prudent to terminate the trial sooner with the expectation that\nthe next trial may be easier.\n\n4.4 Predicting the Con\ufb01dence in the Reaction time Task\n\nLike the \ufb01xed duration task, we want to predict the decision maker\u2019s con\ufb01dence on a speci\ufb01c coher-\nence. To achieve this, we use the same technique, i.e., having two POMDPs with the same model\nand different initial belief states. The control of the subject over the termination of the trial makes\nestimating the con\ufb01dence more dif\ufb01cult in the reaction time task. As the subject decides based on\nher own belief, not accuracy, the relationship between the accuracy and the reaction time, binned\nbased on dif\ufb01culty is very noisy in comparison to the \ufb01xed duration task (the plots of this rela-\ntionship are illustrated in the supplementary materials of [12]). Therefore we \ufb01t the experimenter\u2019s\nPOMDP to two other plots, reaction time vs. motion strength (coherence), and accuracy vs. motion\nstrength (coherence). The \ufb01rst subject (S1) of the original experiment was picked for this analysis\nbecause the behavior of this subject was consistent with the behavior of the majority of subjects\n[12]. Figures 5a and 5b show the experimental data from [12]. Figure 5c and 5d show the results\nfrom the POMDP model \ufb01t to experimental data. As in the previous task, the initial belief state of\nthe POMDP for a coherence c is .5 for each direction of c, and 0 for the rest.\nAll the free parameters of the POMDP were extracted from this \ufb01t. Again, as in the \ufb01xed duration\ntask, we assume that the decision maker knows the environment model, but does not know about\nthe coherence of each trial and existence of 0% coherence. Figure 6a shows the reported con\ufb01dence\nfrom the experiments and \ufb01gure 6b shows the prediction of our POMDP model for the belief of the\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 5: (a) and (b) show Accuracy vs. motion strength, and reaction time vs. motion strength\nplots from the reaction-time random dots experiments in [12]. (c) and (d) show the results from the\nPOMDP model.\n\n(a)\n\n(b)\n\nFigure 6: (a) illustrates the reported con\ufb01dence by the human subject from [12]. (b) shows the\npredicted con\ufb01dence by the POMDP model.\n\ndecision maker. Although this report is not in percentile and quantitative comparison is not possible,\nthe general trends in these plots are similar. The two become almost identical if one maps the report\nbar to the probability range.\nIn both tasks, we assume that the decision maker has a nearly perfect model of the environment,\napart from using 5 different coherences instead of 6 (the zero coherence state assumed not known).\nThis assumption is not necessarily true. Although, the decision maker understands that the dif\ufb01culty\nof the trials is not constant, she might not know the exact number of coherences. For example, she\nmay divide trials into three categories: easy, normal, and hard for each direction. However, these\ndifferences do not signi\ufb01cantly change the belief because the observations are generated by the true\nmodel, not the decision maker\u2019s model. We tested this hypothesis in our experiments. Although\nusing using a separate decision maker\u2019s model makes the predictions closer to the real data, we used\nthe true (experimenter\u2019s) model to avoid over\ufb01tting the data.\n\n5 Conclusions\n\nOur results present, to our knowledge, the \ufb01rst supporting evidence for the utility of a Bayesian\nreward optimization framework based on POMDPs for modeling con\ufb01dence judgements in subjects\nengaged in perceptual decision making. We showed that the predictions of the POMDP model are\nconsistent with results on decision con\ufb01dence in both primate and human decision making tasks, en-\ncompassing \ufb01xed-duration and reaction-time paradigms. Unlike traditional descriptive models such\nas drift-diffusion or race models, the POMDP model is normative and is derived from Bayesian and\nreward optimization principles. Additionally, unlike the traditional models, it allows one to model\noptimal decision making across trials using the concept of a discount factor. Important directions\nfor future research include leveraging the ability of the POMDP framework to model intra-trial\nprobabilistic state transitions, and exploring predictions of the POMDP model for decision making\nexperiments with more sophisticated reward/cost functions.\n\n8\n\n\fReferences\n[1] Karl J. Astrom. Optimal control of Markov decision processes with incomplete state estima-\n\ntion. Journal of Mathematical Analysis and Applications, pages 174\u2013205, 1965.\n\n[2] Jan Drugowitsch, Ruben Moreno-Bote, Anne K. Churchland, Michael N. Shadlen, and Alexan-\ndre Pouget. The cost of accumulating evidence in perceptual decision making. The Journal of\nneuroscience, 32(11):3612\u20133628, 2012.\n\n[3] Jan Drugowitsch, Ruben Moreno-Bote, and Alexandre Pouget. Relation between belief and\n\nperformance in perceptual decision making. PLoS ONE, 9(5):e96511, 2014.\n\n[4] Timothy D. Hanks, Mark E. Mazurek, Roozbeh Kiani, Elisabeth Hopp, and Michael N.\nShadlen. Elapsed decision time affects the weighting of prior probability in a perceptual deci-\nsion task. Journal of Neuroscience, 31(17):6339\u20136352, 2011.\n\n[5] Yanping Huang, Abram L. Friesen, Timothy D. Hanks, Michael N. Shadlen, and Rajesh P. N.\nRao. How prior probability in\ufb02uences decision making: A unifying probabilistic model. In\nProceedings of The Twenty-sixth Annual Conference on Neural Information Processing Sys-\ntems (NIPS), pages 1277\u20131285, 2012.\n\n[6] Yanping Huang and Rajesh P. N. Rao. Reward optimization in the primate brain: A probabilis-\n\ntic model of decision making under uncertainty. PLoS ONE, 8(1):e53344, 01 2013.\n\n[7] Peter Juslin, Henrik Olsson, and Mats Bjorkman. Brunswikian and Thurstonian origins of\nbias in probability assessment: On the interpretation of stochastic components of judgment.\nJournal of Behavioral Decision Making, 10(3):189\u2013209, 1997.\n\n[8] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting\n\nin partially observable stochastic domains. Arti\ufb01cial Intelligence, 101(1?2):99 \u2013 134, 1998.\n\n[9] Adam Kepecs and Zachary F Mainen. A computational framework for the study of con\ufb01-\ndence in humans and animals. Philosophical Transactions of the Royal Society B: Biological\nSciences, 367(1594):1322\u20131337, 2012.\n\n[10] Adam Kepecs, Naoshige Uchida, Hatim A. Zariwala, and Zachary F. Mainen. Neural corre-\nlates, computation and behavioural impact of decision con\ufb01dence. Nature, 455(7210):227\u2013\n231, 2012.\n\n[11] Koosha Khalvati and Alan K. Mackworth. A fast pairwise heuristic for planning under un-\ncertainty. In Proceedings of The Twenty-Seventh AAAI Conference on Arti\ufb01cial Intelligence,\npages 187\u2013193, 2013.\n\n[12] Roozbeh Kiani, Leah Corthell, and Michael N. Shadlen. Choice certainty is informed by both\n\nevidence and decision time. Neuron, 84(6):1329 \u2013 1342, 2014.\n\n[13] Roozbeh Kiani and Michael N. Shadlen. Representation of con\ufb01dence associated with a deci-\n\nsion by neurons in the parietal cortex. Science, 324(5928):759\u2013764, 2009.\n\n[14] Hanna Kurniawati, David Hsu, and Wee Sun Lee. SARSOP: Ef\ufb01cient point-based POMDP\nplanning by approximating optimally reachable belief spaces. In Proceedings of The Robotics:\nScience and Systems IV, 2008.\n\n[15] Navindra Persaud, Peter McLeod, and Alan Cowey. Post-decision wagering objectively mea-\n\nsures awareness. Nature Neuroscience, 10(2):257\u2013261, 2007.\n\n[16] Rajesh P. N. Rao. Decision making under uncertainty: a neural model based on partially\n\nobservable Markov decision processes. Frontiers in computational neuroscience, 4, 2010.\n\n[17] Stphane Ross, Joelle Pineau, Sebastien Paquet, and Brahim Chaib-draa. Online planning algo-\n\nrithms for POMDPs. Journal of Arti\ufb01cial Intelligence Research, 32(1), 2008.\n\n[18] Michael N. Shadlen and William T. Newsome. Motion perception: seeing and deciding. Pro-\nceedings of the National Academy of Sciences of the United States of America, 93(2):628\u2013633,\n1996.\n\n[19] Richard D. Smallwood and Edward J. Sondik. The optimal control of partially observable\n\nmarkov processes over a \ufb01nite horizon. Operations Research, 21(5):1071\u20131088, 1973.\n\n[20] Edward J. Sondik. The optimal control of partially observable markov processes over the\n\nin\ufb01nite horizon: Discounted costs. Operations Research, 26(2):pp. 282\u2013304, 1978.\n\n9\n\n\f", "award": [], "sourceid": 1418, "authors": [{"given_name": "Koosha", "family_name": "Khalvati", "institution": "University of Washington"}, {"given_name": "Rajesh", "family_name": "Rao", "institution": "University of Washington"}]}