{"title": "Optimal decision-making with time-varying evidence reliability", "book": "Advances in Neural Information Processing Systems", "page_first": 748, "page_last": 756, "abstract": "Previous theoretical and experimental work on optimal decision-making was restricted to the artificial setting of a reliability of the momentary sensory evidence that remained constant within single trials. The work presented here describes the computation and characterization of optimal decision-making in the more realistic case of an evidence reliability that varies across time even within a trial. It shows that, in this case, the optimal behavior is determined by a bound in the decision maker's belief that depends only on the current, but not the past, reliability. We furthermore demonstrate that simpler heuristics fail to match the optimal performance for certain characteristics of the process that determines the time-course of this reliability, causing a drop in reward rate by more than 50%.", "full_text": "Optimal decision-making\n\nwith time-varying evidence reliability\n\nJan Drugowitsch1\n1D\u00b4ept. des Neurosciences Fondamentales\n\nRub\u00b4en Moreno-Bote2\n\nUniversit\u00b4e de Gen`eve\n\nCH-1211 Gen`eve 4, Switzerland\n\njdrugo@gmail.com,\n\nalexandre.pouget@unige.ch\n\nAlexandre Pouget1\n2Research Unit, Parc Sanitari\n\nSant Joan de D\u00b4eu and\nUniversity of Barcelona\n08950 Barcelona, Spain\nrmoreno@fsjd.org\n\nAbstract\n\nPrevious theoretical and experimental work on optimal decision-making was re-\nstricted to the arti\ufb01cial setting of a reliability of the momentary sensory evidence\nthat remained constant within single trials. The work presented here describes the\ncomputation and characterization of optimal decision-making in the more realistic\ncase of an evidence reliability that varies across time even within a trial. It shows\nthat, in this case, the optimal behavior is determined by a bound in the decision\nmaker\u2019s belief that depends only on the current, but not the past, reliability. We\nfurthermore demonstrate that simpler heuristics fail to match the optimal perfor-\nmance for certain characteristics of the process that determines the time-course of\nthis reliability, causing a drop in reward rate by more than 50%.\n\n1\n\nIntroduction\n\nOptimal decision-making constitutes making optimal use of sensory information to maximize one\u2019s\noverall reward, given the current task contingencies. Example of decision-making are the decision\nto cross the road based on the percept of incoming traf\ufb01c, or the decision of an eagle to dive for\nprey based on the uncertain information of the prey\u2019s presence and location. Any kind of decision-\nmaking based on sensory information requires some temporal accumulation of this information,\nwhich makes such accumulation the \ufb01rst integral component of decision-making. Accumulating\nevidence for a longer duration causes higher certainty about the stimulus but comes at the cost of\nspending more time to commit to a decision. Thus, the second integral component of such decision-\nmaking is to decide when enough information has been accumulated to commit to a decision.\nPrevious work has established that, if the reliability of momentary evidence is constant within a\ntrial but might vary across trials, optimal decision-making can be implemented by a class of models\nknown as diffusion models [1, 2, 3]. Furthermore, it has been shown that the behavior of humans\nand other animals at least qualitatively follow that predicted by such diffusion models [4, 5, 6, 3].\nOur work signi\ufb01cantly extends this work by moving from the rather arti\ufb01cial case of constant ev-\nidence reliability to allowing the reliability of evidence to change within single trials. Based on\na principled formulation of this problem, we describe optimal decision-making with time-varying\nevidence reliability. Furthermore, a comparison to simpler decision-making heuristics demonstrates\nwhen such heuristics fail to feature comparable performance. In particular, we derive Bayes-optimal\nevidence accumulation for our task setup, and compute the optimal policy for such cases by dynamic\nprogramming. To do so, we borrow concepts from continuous-time stochastic control to keep the\ncomputational complexity linear in the process space size (rather than quadratic for the na\u00a8\u0131ve ap-\nproach). Finally, we characterize how the optimal policy depends on parameters that determine the\nevidence reliability time-course, and show that simpler, heuristic policies fail to match the optimal\nperformance for particular sub-regions of this parameter space.\n\n1\n\n\f2 Perceptual decision-making with time-varying reliability\n\nWithin a single trial, the decision maker\u2019s task is to identify the state of a binary hidden variable,\nz \u2208 {\u22121, 1} (with units s\u22121, if time is measured in seconds), based on a stream of momentary\nevidence dx(t), t \u2265 0. This momentary evidence provides uncertain information about z by\n\ndx = zdt +\n\ndW, where d\u03c4 = \u03b7 (\u00b5 \u2212 \u03c4 ) dt + \u03c3\n\n\u03c4dB,\n\n(1)\n\n(cid:114) 2\u03b7\n\n\u00b5\n\n\u221a\n\n1(cid:112)\u03c4 (t)\n\n(cid:90) t\n\n0\n\nwhere dW and dB are independent Wiener processes. In the above, \u03c4 (t) controls how informative\nthe momentary evidence dx(t) is about z, such that \u03c4 (t) is the reliability of this momentary evidence.\nWe assume its time-course to be described by the Cox-Ingersoll-Ross (CIR) process (\u03c4 (t) in Eq. (1))\n[7]. Despite the simplicity of this model and its low number of parameters, it is suf\ufb01ciently \ufb02exible\nin modeling how the evidence reliability changes with time, and ensures that \u03c4 \u2265 0, always1. It is\nparameterized by the mean reliability, \u00b5, its variance, \u03c32, and its speed of change, \u03b7, all of which we\nassume to be known to the decision maker. At the beginning of each trial, at t = 0, \u03c4 (0) is drawn\nfrom the process\u2019 steady-state distribution, which is gamma with shape \u00b52/\u03c32 and scale \u03c32/\u00b5 [7].\nIt can be shown, that upon observing some momentary evidence, \u03c4 (t) can be immediately estimated\nwith in\ufb01nite precision, such that it is known for all t \u2265 0 (see supplement).\nOptimal decision-making requires in each trial computing the posterior z, given all evidence dx0:t\nfrom trial onset to some time t. Assuming a uniform prior over z\u2019s, this posterior is given by\n\ng(t) \u2261 p (z = 1|dx0:t) =\n\n1\n\n1 + e\u22122X(t)\n\n,\n\nwhere X(t) =\n\n\u03c4 (s)dx(s),\n\n(2)\n\n(this has already been established in [8]; see supplement for derivation). Thus, at time t, the decision\nmaker\u2019s belief g(t) that z = 1 is the sigmoid of the accumulated, reliability-weighted, momentary\nevidence up until that time.\nWe consider two possible tasks. In the ER task, the decision maker is faced with a single trial\nin which correct (incorrect) decisions are rewarded by r+ (r\u2212), and the accumulation of evidence\ncomes at a constant cost (for example, attentional effort) of c per unit time. The decision maker\u2019s\naim is then to maximize her expected reward, ER, including the cost for accumulating evidence. In\nthe RR task, we consider a long sequence of trials, separated on average by the inter-trial interval\nti, which might be extended by the penalty time tp for wrong decisions. Maximizing reward in such\na sequence equals maximizing the reward rate, RR, per unit time [9]. Thus, the objective function\nfor either task is given by\nER (P C, DT ) = P Cr++(1\u2212P C)r\u2212\u2212cDT, RR (P C, DT ) =\n\nER (P C, DT )\n\n, (3)\n\nDT + ti + (1 \u2212 P C)tp\n\nwhere P C is the probability of performing a correct decision, and DT is the expected decision time.\nFor notational convenience we assume r+ = 1 and r\u2212 = 0. The work can be easily generalized to\nany choice of r+ and r\u2212.\n\n3 Finding the optimal policy by Dynamic Programming\n\n3.1 Dynamic Programming formulation\n\nFocusing \ufb01rst on the ER task of maximizing the expected reward in a single trial, the optimal policy\ncan be described by bounds in belief2 at g\u03b8(\u03c4 ) and 1 \u2212 g\u03b8(\u03c4 ) as functions of the current reliability,\n\u03c4. Once either of these bounds is crossed, the decision maker chooses z = 1 (for g\u03b8(\u03c4 )) or z = \u22121\n(cid:110)\n(cid:111)\n(for 1 \u2212 g\u03b8(\u03c4 )). The bounds are found by solving Bellman\u2019s equation [10, 9],\nVd(g),(cid:104)V (g + \u03b4g, \u03c4 + \u03b4\u03c4 )(cid:105)p(\u03b4g,\u03b4\u03c4|g,\u03c4 ) \u2212 c\u03b4t\n\n(4)\nwhere Vd(g) = max{g, 1 \u2212 g}. Here, the value function V (g, \u03c4 ) denotes the expected return for\ncurrent state (g, \u03c4 ) (i.e. holding belief g, and current reliability \u03c4), which is the expected reward at\n\nV (g, \u03c4 ) = max\n\n,\n\n1We restrict ourselves to \u00b5 > \u03c3, in which case \u03c4 (t) > 0 (excluding \u03c4 = 0) is guaranteed for all t \u2265 0.\n2The subscript \u00b7\u03b8 indicates the relation to the optimal decision bound \u03b8.\n\n2\n\n\fFigure 1: Finding the optimal policy by dynamic programming. (a) illustrates the approach for the\nER task. Here, Vd(g) and Vc(g, \u03c4 ) denote the expected return for immediate decisions and that\nfor continuing to accumulate evidence, respectively. (b) shows the same approach for RR tasks, in\nwhich, in an outer loop, the reward rate \u03c1 is found by root \ufb01nding.\n\nthis state within a trial, given that optimal choices are performed in all future states. The right-hand\nside of Bellman\u2019s equation is the maximum of the expected returns for either making a decision\nimmediately, or continuing to accumulate more evidence and deciding later. When deciding imme-\ndiately, one expects reward g (or 1 \u2212 g) when choosing z = 1 (or z = \u22121), such that the expected\nreturn for this choice is Vd(g). Continuing to accumulate evidence for another small time step \u03b4t\ncomes at cost c\u03b4t, but promises future expected return (cid:104)V (g + \u03b4g, \u03c4 + \u03b4\u03c4 )(cid:105)p(\u03b4g,\u03b4\u03c4|g,\u03c4 ), as expressed\nby the second term in max{\u00b7,\u00b7} in Eq. (4). Given a V (g, t) that satis\ufb01es Bellman\u2019s equation, it is\neasy to see that the optimal policy is to accumulate evidence until the expected return for doing so\nis exceeded by that for making immediate decisions. The belief g at which this happens differs for\ndifferent reliabilities \u03c4, such that the optimal policy is determined by a bound in belief, g\u03b8(\u03c4 ), that\ndepends on the current reliability.\nWe \ufb01nd the solution to Bellman\u2019s equation itself by value iteration on a discretized (g, \u03c4 )-\nspace, as illustrated in Fig. 1(a). Value iteration is based on a sequence of value functions\nV 0(g, \u03c4 ), V 1(g, \u03c4 ), . . . , where V n(g, \u03c4 ) is given by the solution to right-hand side of Eq. (4) with\n(cid:104)V (g + \u03b4g, \u03c4 + \u03b4\u03c4 )(cid:105) based on the previous value function V n\u22121(g, \u03c4 ). With n \u2192 \u221e, this proce-\ndure guarantees convergence to the solution of Eq. (4). In practice, we terminate value iteration once\nmaxg,\u03c4 |V n(g, \u03c4 )\u2212 V n\u22121(g, \u03c4 )| drops below a pre-de\ufb01ned threshold. The only remaining dif\ufb01culty\nis how to compute the expected future return (cid:104)V (\u00b7,\u00b7)(cid:105) on the discretized (g, \u03c4 )-space, which we\ndescribe in more detail in the next section.\nThe RR task, in which the aim is to maximize the reward rate, requires the use of average-reward\nDynamic Programming [9, 11], based on the average-adjusted expected return, \u02dcV (g, \u03c4 ). If \u03c1 denotes\nthe reward rate (avg. reward per unit time, RR in Eq. (3)), this expected return penalizes the passage\nof some time \u03b4t by \u2212\u03c1\u03b4t, and can be interpreted as how much better or worse the current state is\nthan the average. It is relative to an arbitrary baseline, such that adding a constant to this return for\nall states does not change the resulting policy [11]. We remove this additional degree of freedom by\n\ufb01xing the average \u02dcV (\u00b7,\u00b7) at the beginning of a trial (where g = 1/2) to (cid:104) \u02dcV (1/2, \u03c4 )(cid:105)p(\u03c4 ) = 0, where\nthe expectation is with respect to the steady-state distribution of \u03c4. Overall, this leads to Bellman\u2019s\nequation,\n\n(cid:26)\n\n(cid:68) \u02dcV (g + \u03b4g, \u03c4 + \u03b4\u03c4 )\n(cid:69)\n\n\u02dcV (g, \u03c4 ) = max\n\n\u02dcVd(g),\n\n(cid:27)\n\n\u2212 (c + \u03c1)\u03b4t\n\np(\u03b4g,\u03b4\u03c4|g,\u03c4 )\n\n(5)\n\nwith the average-adjusted expected return for immediate decisions given by\n\n\u02dcVd(g) = max{g \u2212 \u03c1 (ti + (1 \u2212 g)tp) , 1 \u2212 g \u2212 \u03c1 (ti + gtp)} .\n\n(6)\nThe latter results from a decision being followed by the inter-trial interval ti and an eventual penalty\ntime tp for incorrect choices, after which the average-adjusted expected return is (cid:104) \u02dcV (1/2, \u03c4 )(cid:105) = 0,\nas previously chosen. The value function is again computed by value iteration, assuming a known\n\u03c1. The correct \u03c1 itself is found in an outer loop, by root-\ufb01nding on the consistency condition,\n(cid:104) \u02dcV (1/2, \u03c4 )(cid:105) = 0, as illustrated in Fig. 1(b).\n3.2 Finding (cid:104)V (g + \u03b4g, \u03c4 + \u03b4\u03c4 )(cid:105) as solution to a PDE\nPerforming\nexpectation\n(cid:104)V (g + \u03b4g, \u03c4 + \u03b4\u03c4 )(cid:105)p(\u03b4g,\u03b4\u03c4|g,\u03c4 ) on a discretized (g, \u03c4 ) space. Na\u00a8\u0131vely, we could perform the\n\ncomputing\n\n(4)\n\nrequires\n\nEq.\n\nvalue\n\niteration\n\non\n\nthe\n\n3\n\nuntilvalue iteration, n = 1, 2, 3, ...whereandintersectboundexpectation by PDE solver(a)(b)root finding onuntilvalue iteration with current \frequired integration by the rectangle method or related methods, but this has several disad-\nvantages. First, the method scales quadratically in the size of the (g, \u03c4 ) space. Second, with\n\u03b4t \u2192 0, p(\u03b4g, \u03b4\u03c4|g, \u03c4 ) becomes singular, such that small time discretization requires even smaller\nstate discretization. Third, it requires explicit computation of p(\u03b4g, \u03b4\u03c4|g, \u03c4 ), which might be\ncumbersome.\nInstead, we borrow methods from stochastic optimal control [12] to \ufb01nd the expectation as a solution\nto the partial differential equation (PDE). To do so, we link V (g, \u03c4 ) to (cid:104)V (g + \u03b4g, \u03c4 + \u03b4\u03c4 )(cid:105), by\nconsidering how g and \u03c4 evolve from some time t to time t + \u03b4t. De\ufb01ning u(g, \u03c4, t) \u2261 V (g, \u03c4 ) and\nu(g, \u03c4, t + \u03b4t) \u2261 (cid:104)V (g + \u03b4g, \u03c4 + \u03b4\u03c4 )(cid:105), and replacing this expectation by its second-order Taylor\nexpansion around (g, \u03c4 ), we \ufb01nd that, with \u03b4t \u2192 0, we have\n\n(cid:33)\n\n\u22022\n\u2202\u03c4 2 +\n\n(cid:104)dgd\u03c4(cid:105)\n\ndt\n\n\u22022\n\u2202g\u2202\u03c4\n\nu,\n\n(7)\n\n(cid:32)(cid:104)dg(cid:105)\n\ndt\n\n\u2202u\n\u2202t\n\n=\n\n(cid:10)dg2(cid:11)\n\n2dt\n\n\u22022\n\u2202g2 +\n\n(cid:10)d\u03c4 2(cid:11)\n\n2dt\n\n\u2202\n\u2202g\n\n+\n\n(cid:104)d\u03c4(cid:105)\ndt\n\n\u2202\n\u2202\u03c4\n\n+\n\nwith all expectations implicitly conditional on g and \u03c4. If we approximate the partial derivatives\nkj \u2261\nwith respect to g and \u03c4 by their central \ufb01nite differences, and denote un\nu(gk, \u03c4j, t + \u03b4t) (gk and \u03c4j are the discretized state nodes), applying the Crank-Nicolson method\n[13] to the above PDE results in the linear system\n\nkj \u2261 u(gk, \u03c4j, t) and un+1\n\nLn+1un+1 = Lnun\n\n(8)\nwhere both Ln and Ln+1 are sparse matrices, and the u\u2019s are vectors that contain all ukj. Computing\n(cid:104)V (g + \u03b4g, \u03c4 + \u03b4\u03c4 )(cid:105) now conforms to solving the above linear system with respect to un+1. As the\nprocess on g and \u03c4 only appears as its in\ufb01nitesimal moments in Eq. (7), this approach neither requires\nexplicit computation of p(\u03b4g, \u03b4\u03c4|g, \u03c4 ) nor suffers from singularities in this density. It still scales\nquadratically with the state space discretization, but we achieve linear scaling by switching from\nthe Crank-Nicolson to the Alternating Direction Implicit (ADI) method [13] (see supplement for\ndetails). This method splits the computation into two steps of size \u03b4t/2, in each of which the partial\nderivatives are only implicit with respect to one of the two state space dimensions. This results\nin a tri-diagonal structure of the linear system, and an associated reduction of the computational\ncomplexity while preserving the numerical robustness of the Crank-Nicholson method [13].\nThe PDE approach requires us to specify how V (and thus u) behaves at the boundaries, g \u2208 {0, 1}\nand \u03c4 \u2208 {0,\u221e}. Beliefs g \u2208 {0, 1} imply complete certainty about the latent variable z, such\nthat a decision is imminent. This implies that, at these beliefs, we have V (g, \u03c4 ) = Vd(g) for all\n\u03c4. With \u03c4 \u2192 \u221e, the reliability of the momentary evidence becomes overwhelming, such that the\nlatent variable z is again immediately known, resulting in V (g, \u03c4 ) \u2192 Vd(1) (= Vd(0)) for all g. For\n\u03c4 = 0, the in\ufb01nitesimal moments are (cid:104)dg(cid:105) = (cid:104)dg2(cid:105) = (cid:104)d\u03c4 2(cid:105) = 0, and (cid:104)d\u03c4(cid:105) = \u03b7\u00b5dt, such that g\nremains unchanged and \u03c4 drifts deterministically towards positive values. Thus, there is no leakage\nof V towards \u03c4 < 0, which makes this lower boundary well-de\ufb01ned.\n\n4 Results\n\nWe \ufb01rst provide an example of an optimal policy and how it shapes behavior, followed by how\ndifferent parameters of the process on the evidence reliability \u03c4 and different task parameters in\ufb02u-\nence the shape of the optimal bound g\u03b8(\u03c4 ). Then, we compare the performance of these bounds to\nthe performance that can be achieved by simple heuristics, like the diffusion model with a constant\nbound, or a bound in belief independent of \u03c4.\nIn all cases, we computed the optimal bounds by dynamic programming on a 200 \u00d7 200 grid on\n(g, \u03c4 ), using \u03b4t = 0.005. g spun its whole [0, 1] range, and \u03c4 ranged from 0 to twice the 99th\npercentile of its steady-state distribution. We used maxg,\u03c4 |V n(g, \u03c4 ) \u2212 V n\u22121(g, \u03c4 )| \u2264 10\u22123\u03b4t as\nconvergence criterion for value iteration.\n\n4.1 Decision-making with reliability-dependent bounds\n\nFigure 2(a) shows one example of an optimal policy (black lines) for an ER task with evidence\naccumulation cost of c = 0.1 and \u03c4-process parameters \u00b5 = 0.4, \u03c3 = 0.2, and \u03b7 = 1. This\npolicy can be understood as follows. At the beginning of each trial, the decision maker starts at\n\n4\n\n\fFigure 2: Decision-making with the optimal policy. (a) shows the optimal bounds, at g\u03b8(\u03c4 ) and 1 \u2212\ng\u03b8(\u03c4 ) (black) and an example trajectory (grey). The dashed curve shows the steady-state distribution\nof the \u03c4-process. (b) shows the \u03c4-component (evidence reliability) of this example trajectory over\ntime. Even though not a jump-diffusion process, the CIR process can feature jump-like transitions\n\u2014 here at around 1s. (c) shows the g-component (belief) of this trajectory over time (grey), and how\nthe change in evidence reliability changes the bounds on this belief (black). Note that the bound\n\ufb02uctuates rapidly due to the rapid \ufb02uctuation of \u03c4, even though the bound itself is continuous in \u03c4.\n\ng(0) = 1/2 and some \u03c4 (0) drawn from the steady-state distribution over \u03c4\u2019s (dashed curve in\nFig. 2(a)). When accumulating evidence, the decision maker\u2019s belief g(t) starts diffusing and drifting\ntowards either 1 or 0, following the dynamics described in Eqs. (1) and (2). At the same time, the\nreliability \u03c4 (t) changes according to the CIR process, Eq. (1) (Fig. 2(b)). In combination, this leads\nto a two-dimensional trajectory in the (g, \u03c4 ) space (Fig. 2(a), grey line). A decision is reached\nonce this trajectory reaches either g\u03b8(\u03c4 ) or 1 \u2212 g\u03b8(\u03c4 ) (Fig. 2(a), black lines). In belief space, this\ncorresponds to a bound that changes with the current reliability. For the example trajectory in Fig. 2,\nthis reliability jumps to higher values after around 1s (Fig. 2(b)), which leads to a corresponding\njump of the bound to higher levels of con\ufb01dence (black line in Fig. 2(c)).\nIn general, the optimal bound is an increasing function in \u03c4. Thus, the larger the current reliability\nof the momentary evidence, the more sense it makes to accumulate evidence to a higher level of\ncon\ufb01dence before committing to a choice. This is because a low evidence reliability implies that \u2013\nat least in the close future \u2013 this reliability will remain low, such that it does not make sense to pay the\ncost for accumulating evidence without the associated gain in choice accuracy. A higher evidence\nreliability implies that high levels of con\ufb01dence, and associated choice accuracy, are reached more\nquickly, and thus at a lower cost. This also indicates that a decision bound increasing in \u03c4 does not\nimply that high-reliability evidence will lead to slower choices. In fact, the opposite is true, as a\nfaster move towards higher con\ufb01dence for high reliability causes faster decisions in such cases.\n\n4.2 Optimal bounds for different reliability/task parameters\n\nTo see how different parameters of the CIR process on the reliability in\ufb02uence the optimal decision\nbound, we compared bounds where one of its parameters is systematically varied. In all cases, we\nassumed an ER task with c = 0.1, and default CIR process parameters \u00b5 = 0.4, \u03c3 = 0.2, \u03b7 = 2.\nFigure 3(a) shows how the bound differs for different means \u00b5 of the CIR process. A lower mean\nimplies that, on average, the task will be harder, such that more evidence needs to be accumulated\nto reach the same level of performance. This accumulation comes at a cost, such that the optimal\npolicy is to stop accumulating earlier in harder tasks. This causes lower decision bounds for smaller\n\u00b5. Fig. 3(b) shows that the optimal bound only very weakly depends on the standard deviation \u03c3 of\nthe reliability process. This standard deviation determines how far \u03c4 can deviate from its mean, \u00b5.\nThe weak dependence of the bound on this parameter shows that it is not that important to which\ndegree \u03c4 \ufb02uctuates, as long as it \ufb02uctuates with the same speed, \u03b7. This speed has a strong in\ufb02uence\non the optimal bound, as shown in Fig. 3(c). For a slowly changing \u03c4 (low \u03b7), the current \u03c4 is likely\nto remain the same in the future, such that the optimal bound strongly depends on \u03c4. For a rapidly\nchanging \u03c4, in contrast, the current \u03c4 does not provide much information about future reliabilities,\nsuch that the optimal bound features only a very weak dependence on the current evidence reliability.\nSimilar observations can be made for changes in task parameters. Figure 3(d) illustrates that a larger\ncost c generally causes lower bounds, as it pays less to accumulate evidence.\nIn RR tasks, the\n\n5\n\n(a)(b)(c)reliabilitytime t010.5Bound exampleReliability time-courseBelief time-course00.511.500.20.40.60.8belief g00.20.40.60.81reliability010.5belief gtime t00.511.5\fFigure 3: Optimal bounds for different reliability process / task parameters. In the top row, we vary\n(a) the mean, \u00b5, (b) the standard deviation \u03c3, or (c) the speed \u03b7 of the CIR process that describes\nthe reliability time-course. In the bottom row, we vary (d) the momentary cost c in an ER task, and,\nin an RR task (e) the inter-trial interval ti, or (f) the penalty time tp. In all panels, solid lines show\noptimal bounds, and dashed lines show steady-state densities of \u03c4 (vertically re-scaled).\n\ninter-trial timing also plays an important role. If the inter-trial interval ti is long, performing well\nin single trials is more important, as there are fewer opportunities per unit time to gather reward. In\nfact, for ti \u2192 \u221e, the optimal bound in RR tasks becomes equivalent to that of an ER task [3]. For\nshort ti\u2019s, in contrast, quick, uninformed decisions are better, as many of them can be performed in\nquick succession, and they are bound to be correct in at least half of the trials. This is re\ufb02ected in\noptimal bounds that are signi\ufb01cantly lower for shorter ti\u2019s (Fig. 3(e)). A larger penalty time, tp, in\ncontrast, causes a rise in the optimal bound (Fig.3(f)), as it is better to make better, slower decisions,\nif incorrect decisions are penalized by longer waits between consecutive trials.\n\n4.3 Performance comparison with alternative heuristics\n\nAs previous examples have shown, the optimal policy is \u2014 due to its two-dimensional nature \u2014 not\nonly hard to compute but might also be hard to implement. For these reasons we investigated if sim-\npler, one-dimensional heuristics were able to achieve comparable performance. We focused on two\nheuristics in particular. First, we considered standard diffusion models [1, 2] that trigger decisions\nas soon as the accumulated evidence, x(t) (Eq.\n(1)), not weighted by \u03c4, reaches one of the time-\ninvariant bounds at x\u03b8 and \u2212x\u03b8. These models have been shown to feature optimal performance\nwhen the evidence reliability is constant within single trials [2, 3], and electrophysiological record-\nings have provided support for their implementation in neural substrate [14, 15]. Diffusion models\nuse the unweighted x(t) in Eq. (1) and thus do not perform Bayes-optimal inference if the evidence\nreliability varies within single trials. For this reason, we considered a second heuristic that performs\nBayes-optimal inference by Eq. (2), with time-invariant bounds X\u03b8 and \u2212X\u03b8 on X(t). This heuris-\ntic deviates from the optimal policy only by not taking into account the bound\u2019s dependence on the\ncurrent reliability, \u03c4.\nWe compared the performance of the optimal bound with the two heuristics exhaustively by dis-\ncretizing a subspace of all possible reliability process parameters. The comparison is shown only\nfor the ER task with accumulation cost c = 0.1, but we observed qualitatively similar results for\nother accumulation costs, and RR tasks with various combinations of c, ti and tp. For a fair com-\nparison, we tuned for each set of reliability process parameters the bound of each of the heuristics\nsuch that it maximized the associated ER / RR. This optimization was performed by the Subplex\nalgorithm [16] in the NLopt tookit [17], where the ER / RR was found by Monte Carlo simulations.\n\n6\n\n(a)(b)(c)(d)(e)(f)0.040.201.005.0025.00125.00625.00belief g0.50.60.70.80.910.010.050.100.200.400.800.000.100.300.902.708.1024.30inter-trial intervalpenalty timeevidence accumulation cost c00.20.40.60.811.21.4reliability meanreliability standard deviationreliability speed100.51.522.500.20.40.60.81belief g0.50.60.70.80.9100.20.40.60.8100.20.40.60.8100.20.40.60.81reliabilityreliabilityreliability0.140.401.000.05000.10000.20000.56570.251.004.0016.00\fFigure 4: Expected reward comparison between optimal bound and heuristics. (a) shows the reward\nrate difference (white = no difference, dark green = optimal bound \u2265 2\u00d7 higher expected reward)\nbetween optimal bound and diffusion model for different \u03c4-process parameters. The process SD is\nshown as fraction of the mean (e.g. \u00b5 = 1.4, \u02dc\u03c3 = 0.8 implies \u03c3 = 1.5\u00d70.8 = 1.12). (b) The optimal\nbound (black, for \u03b7 = 0 independent of \u00b5 and \u03c3) and effective tuned diffusion model bounds (blue,\ndotted curves) for speed \u03b7 = 0 and two different mean / SD combinations (blue, dotted rectangles\nin (a)). The dashed curves show the associated \u03c4 steady-state distributions. (c) same as (a), but\ncomparing optimal bound to constant bound on belief. (d) The optimal bounds (solid curves) and\ntuned constant bounds (dotted curves) for different \u03b7 and the same \u00b5 / \u03c3 combination (red rectangles\nin (c)). The dashed curve shows the steady-state distribution of \u03c4.\n\n4.3.1 Comparison to diffusion models\n\nFigure 4(a) shows that for very slow process speeds (e.g. \u03b7 = 0), the diffusion model performance is\ncomparable to the optimal bound found by dynamic programming. At higher speeds (e.g. \u03b7 = 16),\nhowever, diffusion models are no match for the optimal bound anymore. Their performance degrades\nmost strongly when the reliability SD is large, and close to the reliability\u2019s mean (dark green area\nfor \u03b7 = 16, large \u02dc\u03c3, in Fig. 4(a)). This pattern can be explained as follows. In the extreme case\nof \u03b7 = 0, the evidence reliability remains unchanged within single trials. Then, by Eq. (2), we\nhave X(t) = \u03c4 x(t), such that a constant bound x\u03b8 on x(t) corresponds to a \u03c4-dependent bound\nX\u03b8 = \u03c4 x\u03b8 on X(t). Mapped into belief by Eq. (2), this results in a sigmoidal bound that closely\nfollows the similarly rising optimal bound. Figure 4(b) illustrates that, depending on the steady-state\ndistribution of \u03c4, the tuned diffusion model bound focuses on approximating different regions of the\noptimal bound.\nFor a non-stationary evidence reliability, \u03b7 > 0, the relation between X(t) and x(t) changes for\ndifferent trajectories of \u03c4 (t). In this case, the diffusion model bounds cannot be directly related to\na bound in X(t) (or, equivalently, in belief g(t)). As a result, the effective diffusion model bound\nin belief \ufb02uctuate strongly, causing possibly strong deviations from the optimal bound. This is\nillustrated in Fig. 4(a) by a signi\ufb01cant loss in performance for larger process speeds. This loss is most\npronounced for large spreads of \u03c4 (i.e. a large \u03c3). For small spreads, in contrast, the \u03c4 (t) remains\nmostly stationary, which is again well approximated by a stationary \u03c4 whose associated optimal\npolicy is well captured by a diffusion model bound. To summarize, diffusion models approximate\nwell the optimal bound as long as the reliability within single trials is close-to stationary. As soon as\nthis reliability starts to \ufb02uctuate signi\ufb01cantly within single trials (e.g. large \u03b7 and \u03c3), the performance\nof diffusion models deteriorates.\n\n4.3.2 Comparison to a bound that does not depend on evidence reliability\n\nIn contrast to diffusion models, a heuristic, constant bound in belief (i.e. either in X(t) or g(t)), as\nused in [8], causes a drop in performance for slow rather than fast changes of the evidence reliability.\n\n7\n\n(a)(b)(c)(d)0.80.60.40.200.511.5200.511.5200.511.520.80.60.40.20.50.60.70.80.91belief g0.50.60.70.80.91belief g00.511.522.500.20.40.60.81.01.21.4reliability\fThis is illustrated in Fig. 4(c), where the performance loss is largest for \u03b7 = 0 and large \u03c3, and drops\nwith an increase in \u03b7, \u03c3, and \u00b5.\nFigure 4(d) shows why this performance loss is particularly pronounced for slow changes in evidence\nreliability (i.e. low \u03b7). As can be seen, the optimal bound becomes \ufb02atter as a function of \u03c4 when the\nprocess speed \u03b7 increases. As previously mentioned, for large \u03b7, this is due to the current reliability\nproviding little information about future reliability. As a consequence, the optimal bound is in these\ncases well approximated by a constant bound in belief that completely ignores the current reliability.\nFor smaller \u03b7, the optimal bound becomes more strongly dependent on the current reliability \u03c4, such\nthat a constant bound provides a worse approximation, and thus a larger loss in performance.\nThe dependence of performance loss on the mean \u00b5 and standard deviation \u03c3 of the steady-state\nreliability arises similarly. As has been shown in Fig. 3(a), a larger mean reliability \u00b5 causes the\noptimal bound to become \ufb02atter as a function of the current reliability, such that a constant bound\napproximation performs better for larger \u00b5, as con\ufb01rmed in Fig. 4(c). The smaller performance loss\nfor smaller spreads of \u03c4 (i.e. smaller \u03c3) is not explained by a change in the optimal bound, which\nis mostly independent of the exact value of \u03c3 (Fig. 3(b)). Instead, it arises from the constant bound\nfocusing its approximation to regions of the optimal bound where the steady-state distribution of \u03c4\nhas high density (dashed curves in Fig. 3(b)). The size of this region shrinks with shrinking \u03c3, thus\nimproving the approximation of the optimal bound by a constant, and the associated performance of\nthis approximation. Overall, a constant bound in belief features competitive performance compared\nto the optimal bound if the evidence reliability changes rapidly (large \u03b7), if the task is generally easy\n(large \u00b5), and if the reliability does not \ufb02uctuate strongly within single trials (small \u03c3). For widely\nand rapidly changing evidence reliability \u03c4 in dif\ufb01cult tasks, in contrast, a constant bound in belief\nprovides a poor approximation to the optimal bound.\n\n5 Discussion\n\nOur work offers the following contributions. First, it pushes the boundaries of the theory of optimal\nhuman and animal decision-making by moving towards more realistic tasks in which the reliability\nchanges over time within single trials. Second, it shows how to derive the optimal policy while\navoiding the methodological caveats that have plagued previous, related approaches [3]. Third, it\ndemonstrates that optimal behavior is achieved by a bound on the decision maker\u2019s belief that de-\npends on the current evidence reliability. Fourth, it explains how the shape of the bound depends on\ntask contingencies and the parameters that determine how the evidence reliability changes with time\n(in contrast to, e.g., [18], where the utilized heuristic policy is independent of the \u03c4 process). Fifth, it\nshows that alternative decision-making heuristics can match the optimal bound\u2019s performance only\nfor a particular subset of these parameters, outside of which their performance deteriorates.\nAs derived in Eq. (2), optimal evidence accumulation with time-varying reliability is achieved by\nweighting the momentary evidence by its current reliability [8]. Previous work has shown that hu-\nmans and other animals optimally accumulate evidence if its reliability remains constant within a\ntrial [5, 3], or changes with a known time-course [8]. It remains to be clari\ufb01ed if humans and other\nanimals can optimally accumulate evidence if the time-course of its reliability is not known in ad-\nvance. They have the ability to estimate this reliability on a trial-by-trial basis[19, 20], but how\nquickly this estimate is formed remains unclear. To this respect, our model predicts that access\nto the momentary evidence is suf\ufb01cient to estimate its reliability immediately and with high preci-\nsion. This property arises from the Wiener process being only an approximation of physical realism.\nFurther work will extend our approach to processes where this reliability is not known with abso-\nlute certainty, and that can feature jumps. We do not expect such process modi\ufb01cations to induce\nqualitative changes to our predictions.\nOur theory predicts that, for optimal decision-making, the decision bounds need to be a function\nof the current evidence reliability, that depends on the parameters that describe the reliability time-\ncourse. This prediction can be used to guide the design of experiments that test if humans and other\nanimals are optimal in the increasingly realistic scenarios addressed in this work. While we do not\nexpect our quantitative prediction to be a perfect match to the observed behavior, we expect the\ndecision makers to qualitatively change their decision strategies according to the optimal strategy\nfor different reliability process parameters. Then, having shown in which cases simpler heuristics\nfail to match the optimal performance allows us focus on such cases to validate our theory.\n\n8\n\n\fReferences\n[1] Roger Ratcliff. A theory of memory retrieval. Psychological Review, 85(2):59\u2013108, 1978.\n[2] Rafal Bogacz, Eric Brown, Jeff Moehlis, Philip J. Holmes, and Jonathan D. Cohen. The physics of\noptimal decision making: A formal analysis of models of performance in two-alternative forced-choice\ntasks. Psychological Review, 113(4):700\u2013765, 2006.\n\n[3] Jan Drugowitsch, Rub\u00b4en Moreno-Bote, Anne K. Churchland, Michael N. Shadlen, and Alexandre Pouget.\nThe cost of accumulating evidence in perceptual decision making. The Journal of Neuroscience, 32(11):\n3612\u20133628, 2012.\n\n[4] John Palmer, Alexander C. Huk, and Michael N. Shadlen. The effect of stimulus strength on the speed\n\nand accuracy of a perceptual decision. Journal of Vision, 5:376\u2013404, 2005.\n\n[5] Roozbeh Kiani, Timothy D. Hanks, and Michael N. Shadlen. Bounded integration in parietal cortex under-\nlies decisions even when viewing duration is dictated by the environment. The Journal of Neuroscience,\n28(12):3017\u20133029, 2008.\n\n[6] Rafal Bogacz, Peter T. Hu, Philip J. Holmes, and Jonathan D. Cohen. Do humans produce the speed-\naccuarcy trade-off that maximizes reward rate. The Quarterly Journal of Experimental Psychology, 63\n(5):863\u2013891, 2010.\n\n[7] John C. Cox, Jonathan E. Ingersoll Jr., and Stephen A. Ross. A theory of the term structure of interest\n\nrates. Econometrica, 53(2):385\u2013408, 1985.\n\n[8] Jan Drugowitsch, Gregory C DeAngelis, Eliana M Klier, Dora E Angelaki, and Alexandre Pouget. Opti-\n\nmal multisensory decision-making in a reaction-time task. eLife, 2014. doi: 10.7554/eLife.03005.\n\n[9] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley\n\nSeries in Probability and Statistics. John Wiley & Sons, Inc., 2005.\n\n[10] Richard E. Bellman. Dynamic Programming. Princeton University Press, 1957.\n[11] Sridhar Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical\n\nresults. Machine Learning, 22:159\u2013195, 1996.\n\n[12] Wendell H. Fleming and Raymond W. Rishel. Deterministic and Stochastic Optimal Control. Stochastic\n\nModelling and Applied Probability. Springer-Verlag, 1975.\n\n[13] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes:\n\nThe Art of Scienti\ufb01c Computing. Cambridge University Press, 3rd edition, 2007.\n\n[14] Jamie D. Roitman and Michael N. Shadlen. Response of neurons in the lateral intraparietal area during\na combined visual discrimination reaction time task. The Journal of Neuroscience, 22(21):9475\u20139489,\n2002.\n\n[15] Mark E. Mazurek, Jamie D. Roitman, Jochen Ditterich, and Michael N. Shadlen. A role for neural\n\nintegrators in perceptual decision making. Cerebral Cortex, 13:1257\u20131269, 2003.\n\n[16] Thomas Harvey Rowan. Functional Stability Analysis of Numerical Algorithms. PhD thesis, Department\n\nof Computer Sciences, University of Texas at Austin, 1990.\n\n[17] Steven G. Johnson. The NLopt nonlinear-optimization package. URL http://ab-initio.mit.\n\nedu/nlopt.\n\n[18] Sophie Deneve. Making decisions with unknown sensory reliability. Frontiers in Neuroscience, 6(75),\n\n2012. ISSN 1662-453X. doi: 10.3389/fnins.2012.00075.\n\n[19] Marc O. Ernst and Martin S. Banks. Humans integrate visual and haptic information in a statistically\n\noptimal fashion. Nature, 415:429\u2013433, 2002.\n\n[20] Christopher R. Fetsch, Amanda H. Turner, Gregory C. DeAngelis, and Dora E. Angelaki. Dynamic\nreweighting of visual and vestibular cues during self-motion perception. The Journal of Neuroscience, 29\n(49):15601\u201315612, 2009.\n\n9\n\n\f", "award": [], "sourceid": 515, "authors": [{"given_name": "Jan", "family_name": "Drugowitsch", "institution": "University of Geneva"}, {"given_name": "Ruben", "family_name": "Moreno-Bote", "institution": "Foundation Sant Joan de Deu"}, {"given_name": "Alexandre", "family_name": "Pouget", "institution": "University of Geneva"}]}