{"title": "Goal-directed decision making in prefrontal cortex: a computational framework", "book": "Advances in Neural Information Processing Systems", "page_first": 169, "page_last": 176, "abstract": "Research in animal learning and behavioral neuroscience has distinguished between two forms of action control: a habit-based form, which relies on stored action values, and a goal-directed form, which forecasts and compares action outcomes based on a model of the environment. While habit-based control has been the subject of extensive computational research, the computational principles underlying goal-directed control in animals have so far received less attention. In the present paper, we advance a computational framework for goal-directed control in animals and humans. We take three empirically motivated points as founding premises: (1) Neurons in dorsolateral prefrontal cortex represent action policies, (2) Neurons in orbitofrontal cortex represent rewards, and (3) Neural computation, across domains, can be appropriately understood as performing structured probabilistic inference. On a purely computational level, the resulting account relates closely to previous work using Bayesian inference to solve Markov decision problems, but extends this work by introducing a new algorithm, which provably converges on optimal plans. On a cognitive and neuroscientific level, the theory provides a unifying framework for several different forms of goal-directed action selection, placing emphasis on a novel form, within which orbitofrontal reward representations directly drive policy selection.", "full_text": "Goal-directed decision making in prefrontal\n\ncortex: A computational framework\n\n Matthew Botvinick \n Princeton Neuroscience Institute and Computer Science Department\n Department of Psychology, Princeton University Princeton University\n Princeton, NJ 08540\n matthewb@princeton.edu\n\n Princeton, NJ 08540\n an@princeton.edu\n\n James An\n\nAbstract\n\nResearch in animal learning and behavioral neuroscience has distinguished\nbetween two forms of action control: a habit-based form, which relies on\nstored action values, and a goal-directed form, which forecasts and\ncompares action outcomes based on a model of the environment. While\nhabit-based control has been the subject of extensive computational\nresearch, the computational principles underlying goal-directed control in\nanimals have so far received less attention. In the present paper, we\nadvance a computational framework for goal-directed control in animals\nand humans. We take three empirically motivated points as founding\npremises: (1) Neurons in dorsolateral prefrontal cortex represent action\npolicies, (2) Neurons in orbitofrontal cortex represent rewards, and (3)\nNeural computation, across domains, can be appropriately understood as\nperforming structured probabilistic inference. On a purely computational\nlevel, the resulting account relates closely to previous work using Bayesian\ninference to solve Markov decision problems, but extends this work by\nintroducing a new algorithm, which provably converges on optimal plans.\nOn a cognitive and neuroscientific level, the theory provides a unifying\nframework for several different forms of goal-directed action selection,\nplacing emphasis on a novel form, within which orbitofrontal reward\nrepresentations directly drive policy selection.\n\n1\n\nG o a l - d i re c t e d a c t i o n c o n t ro l\n\nIn the study of human and animal behavior, it is a long-standing idea that reward-based\ndecision making may rely on two qualitatively different mechanisms. In habit-based\ndecision making, stimuli elicit reflex-like responses, shaped by past reinforcement [1]. In\ngoal-directed or purposive decision making, on the other hand, actions are selected based on\na prospective consideration of possible outcomes and future lines of action [2]. Over the past\ntwenty years or so, the attention of cognitive neuroscientists and computationally minded\npsychologists has tended to focus on habit-based control, due in large part to interest in\npotential links between dopaminergic function and temporal-difference algorithms for\nreinforcement learning. However, a resurgence of interest in purposive action selection is\nnow being driven by innovations in animal behavior research, which have yielded powerful\nnew behavioral assays [3], and revealed specific effects of focal neural damage on goal-\ndirected behavior [4].\n\nIn discussing some of the relevant data, Daw, Niv and Dayan [5] recently pointed out the\nclose relationship between purposive decision making, as understood in the behavioral\nsciences, and model-based methods for the solution of Markov decision problems (MDPs),\nwhere action policies are derived from a joint analysis of a transition function (a mapping\n\n\ffrom states and actions to outcomes) and a reward function (a mapping from states to\nrewards). Beyond this important insight, little work has yet been done to characterize the\ncomputations underlying goal-directed action selection (though see [6, 7]). As discussed\nbelow, a great deal of evidence indicates that purposive action selection depends critically on\na particular region of the brain, the prefrontal cortex. However, it is currently a critical, and\nquite open, question what the relevant computations within this part of the brain might be.\n\nOf course, the basic computational problem of formulating an optimal policy given a model\nof an MDP has been extensively studied, and there is no shortage of algorithms one might\nconsider as potentially relevant to prefrontal function (e.g., value iteration, policy iteration,\nbackward induction, linear programming, and others). However, from a cognitive and\nneuroscientific perspective, there is one approach to solving MDPs that it seems particularly\nappealing to consider. In particular, several researchers have suggested methods for solving\nMDPs through probabilistic inference [8-12]. The interest of this idea, in the present\ncontext, derives from a recent movement toward framing human and animal information\nprocessing, as well as the underlying neural computations, \nin terms of structured\nprobabilistic inference [13, 14]. Given this perspective, it is inviting to consider whether\ngoal-directed action selection, and the neural mechanisms that underlie it, might be\nunderstood in those same terms.\n\nOne challenge in investigating this possibility is that previous research furnishes no \u2018off-the-\nshelf\u2019 algorithm for solving MDPs through probabilistic inference that both provably yields\noptimal policies and aligns with what is known about action selection in the brain. We\nendeavor here to start filling in that gap. In the following section, we introduce an account\nof how goal-directed action selection can be performed based on probabilisitic inference,\nwithin a network whose components map grossly onto specific brain structures. As part of\nthis account, we introduce a new algorithm for solving MDPs through Bayesian inference,\nalong with a convergence proof. We then present results from a set of simulations\nillustrating how the framework would account for a variety of behavioral phenomena that\nare thought to involve purposive action selection.\n\n2\n\nC o m p u t a t i o n a l m o d e l\n\nAs noted earlier, the prefrontal cortex (PFC) is believed to play a pivotal role in purposive\nbehavior. This is indicated by a broad association between prefrontal lesions and\nimpairments in goal-directed action in both humans (see [15]) and animals [4]. Single-unit\nrecording and other data suggest that different sectors of PFC make distinct contributions.\nIn particular, neurons in dorsolateral prefrontal cortex (DLPFC) appear to encode task-\nspecific mappings from stimuli to responses (e.g., [16]): \u201ctask representations,\u201d in the\nlanguage of psychology, or \u201cpolicies\u201d in the language of dynamic programming. Although\nthere is some understanding of how policy representations in DLPFC may guide action\nexecution [15], little is yet known about how these representations are themselves selected.\nOur most basic proposal is that DLPFC policy representations are selected in a prospective,\nmodel-based fashion, leveraging information about action-outcome contingencies (i.e., the\ntransition function) and about the incentive value associated with specific outcomes or states\n(the reward function). There is extensive evidence to suggest that state-reward associations\nare represented in another area of the PFC, the orbitofrontal cortex (OFC) [17, 18]. As for\nthe transition function, although it is clear that the brain contains detailed representations of\naction-outcome associations [19], their anatomical localization is not yet entirely clear.\nHowever, some evidence suggests that the enviromental effects of simple actions may be\nrepresented in inferior fronto-parietal cortex [20], and there is also evidence suggesting that\nmedial temporal structures may be important in forecasting action outcomes [21].\n\nAs detailed in the next section, our model assumes that policy representations in DLPFC,\nreward representations in OFC, and representations of states and actions in other brain\nregions, are coordinated within a network structure that represents their causal or statistical\ninterdependencies, and that policy selection occurs, within this network, through a process of\nprobabilistic inference.\n\n2 . 1 \n\nA r c h i t e c t u r e\n\nThe implementation takes the form of a directed graphical model [22], with the layout shown\nin Figure 1. Each node represents a discrete random variable. State variables (s),\n\n\frepresenting the set of m possible world states, serve the role played by parietal and medial\ntemporal cortices in representing action outcomes. Action variables (a) representing the set\nof available actions, play the role\nof high-level cortical motor areas\ninvolved in the programming of\naction sequences. Policy variables\n((cid:1)), each repre-senting the set of\nall \npolicies\nassociated with a specific state,\ncapture the representational role\nof DLPFC. \n Local and global\nutility variables, described further\nbelow, capture the role of OFC in\nrepresenting incentive value. A\nseparate set of nodes is included for each discrete time-step up to the planning horizon.\n\n Fig 1. Left: Single-step decision. Right: Sequential decision.\n Each time-slice includes a set of m policy nodes.\n\ndeterministic \n\nThe conditional probabilities associated with each variable are represented in tabular form.\nState probabilities are based on the state and action variables in the preceding time-step, and\nthus encode the transition function. Action probabilities depend on the current state and its\nassociated policy variable. Utilities depend only on the current state. Rather than\nrepresenting reward magnitude as a continuous variable, we adopt an approach introduced by\n[23], representing reward through the posterior probability of a binary variable (u). States\nassociated with large positive reward raise p(u) (i.e, p(u=1|s)) near to one; states associated\nwith large negative rewards reduce p(u) to near zero. In the simulations reported below, we\nused a simple linear transformation to map from scalar reward values to p(u):\n\np u si\n(\n\n) =\n\n)\n\n1\n2\n\n(cid:2)\n(cid:6)\n(cid:3)\n\nR si(\nrmax\n\n(cid:4)\n(cid:7),\n+1\n(cid:5)\n\n \n\nrmax (cid:1) max j R s j(\n\n)\n\n (1)\n\nIn situations involving sequential actions, expected returns from different time-steps must be\nintegrated into a global representation of expected value. In order to accomplish this, we\nemploy a technique proposed by [8], introducing a \u201cglobal\u201d utility variable (uG). Like u, this\nis a binary random variable, but associated with a posterior probability determined as:1\n\np uG(\n\n) =\n\n1\nN\n\n(cid:1)\n\np(ui )\n\n (2)\n\ni\n\n \nwhere N is the number of u nodes. The network as whole embodies a generative model for\ninstrumental action. The basic idea is to use this model as a substrate for probabilistic\ninference, in order to arrive at optimal policies. There are three general methods for\naccomplishing this, which correspond three forms of query. First, a desired outcome state\ncan be identified, by treating one of the state variables (as well as the initial state variable)\nas observed (see [9] for an application of this approach). Second, the expected return for\nspecific plans can be evaluated and compared by conditioning on specific sets of values over\nthe policy nodes (see [5, 21]). However, our focus here is on a less obvious possibility,\nwhich is to condition directly on the utility variable uG , as explained next.\n\n2 . 2\n\nP o l i c y s e l e c t i o n b y p r o b a b i l i s t i c i n f e r e n c e : a n i t e r a t i v e a l g o r i t h m\n\nCooper [23] introduced the idea of inferring optimal decisions in influence diagrams by\ntreating utility nodes into binary random variables and then conditioning on these variables.\nAlthough this technique has been adopted in some more recent work [9, 12], we are aware of\nno application that guarantees optimal decisions, in the expected-reward sense, in multi-step\ntasks. We introduce here a simple algorithm that does furnish such a guarantee. The\nprocedure is as follows: (1) Initialize the policy nodes with any set of non-deterministic\npriors. (2) Treating the initial state and uG as observed variables (uG = 1),2 use standard belief\n \n1 Note that temporal discounting can be incorporated into the framework through minimal\nmodifications to Equation 2.\n2 In the single-action situation, where there is only one u node, it is this variable that is treated as\nobserved (u = 1).\n\n\fpropagation (or a comparable algorithm) to infer the posterior distributions over all policy\nnodes. (3) Set the prior distributions over the policy nodes to the values (posteriors)\nobtained in step 2. (4) Go to step 2. The next two sections present proofs of monotonicity\nand convergence for this algorithm.\n\n2 . 2 . 1 M o n o t o n i c i t y\n\nWe show first that, at each policy node, the probability associated with the optimal policy will rise\non every iteration. Define (cid:1)* as follows:\n\n p uG (cid:2)\n+ is the current set of probability distributions at all policy nodes on subsequent time-steps.\nwhere (cid:1)\n(Note that we assume here, for simplicity, that there is a unique optimal policy.) The objective is\nto establish that:\n\n), (cid:4) (cid:3)(cid:2) (cid:1) (cid:2)\n\n) > p uG\n\n (3)\n\n(cid:3)(cid:2) ,(cid:2)+\n\n(\n\n(\n\n*,(cid:2)+\n\n*\n\n \n\n*\np (cid:1)t\n(\n\n*\n\n) > p (cid:1)t (cid:2)1\n\n(\n\n)\n\n (4)\n\nwhere t indexes processing iterations. The dynamics of the network entail that\n\n p (cid:1)t(\nwhere (cid:1) represents any value (i.e., policy) of the decision node being considered. Substituting this\ninto (4) gives\n\n) = p (cid:1)t (cid:2)1 uG\n\n (5)\n\n(\n\n)\n\n \nFrom this point on the focus is on a single iteration, which permits us to omit the relevant\nsubscripts. Applying Bayes\u2019 law to (6) yields\n\n (6)\n\n(\np (cid:1)t (cid:2)1\n\n* uG\n\n*\n\n) > p (cid:1)t (cid:2)1\n\n(\n\n)\n\np uG (cid:2)*\n(\n(cid:1)\n\n) p (cid:2)*\n(\n)\np (cid:2)(\np uG (cid:2)(\n)\n)\n\n> p (cid:2)*\n\n(\n\n)\n\n \nCanceling, and bringing the denominator up, this becomes\n\n(cid:2)\n\n \nRewriting the left hand side, we obtain\n\np uG (cid:2)*\n(\n\n) >\n\n(cid:1)\n\np uG (cid:2)(\n\n) p (cid:2)(\n)\n\n(cid:2)\n\n \nSubtracting and further rearranging:\n\n(cid:2)\n\n(cid:1)\n\np uG (cid:2)*\n(\n\n) p (cid:2)(\n)\n\n>\n\n(cid:1)\n\n(cid:2)\n\np uG (cid:2)(\n\n) p (cid:2)(\n)\n\n \n(\n\np uG (cid:3)*\n(\n\n) (cid:4) p uG (cid:3)*\n\n(cid:6)\n(cid:7)\n\n(cid:8)\n(cid:9) p (cid:3)*\n)\n\n(\n\n \n\np uG (cid:2)*\n(\n\n) (cid:3) p uG (cid:2)(\n)\n\n(cid:6)\n(cid:7) p (cid:2)(\n)\n\n> 0\n\n(cid:1)\n\n(cid:4)\n(cid:5)\n\n(cid:2)\n\n \n\n(cid:6)\n(cid:7)\n\np uG (cid:3)*\n(\n\n(cid:2)\n(cid:5)(cid:3) (cid:1)(cid:3)*\n\n) (cid:4) p uG\n\n(cid:5)(cid:3)(\n)\n\n(cid:8)\n(cid:9) p\n\n(cid:5)(cid:3)(\n)\n\n> 0\n\n(cid:6)\n(cid:7)\n\np uG (cid:3)*\n(\n\n) +\n\n(cid:2)\n(cid:5)(cid:3) (cid:1)(cid:3)*\n(cid:8)\n) (cid:4) p uG\n(cid:9) p\n)\n(cid:5)(cid:3)(\n\n(cid:5)(cid:3)(\n)\n\n> 0\n\n (7)\n\n (8)\n\n (9)\n\n (10)\n\n (11)\n\n (12)\n\nNote that this last inequality (12) follows from the definition of (cid:1)*.\n\n+. In particular, the policy (cid:1)* will only be part\nRemark: Of course, the identity of (cid:1)* depends on (cid:1)\n+ is optimal. Fortunately, this requirement is\nof a globally optimal plan if the set of choices (cid:1)\nguaranteed to be met, as long as no upper bound is placed on the number of processing cycles.\nRecalling that we are considering only finite-horizon problems, note that for policies leading to\n+ is empty. Thus (cid:1)* at the relevant policy nodes is fixed, and is\nstates with no successors, (cid:1)\nguaranteed to be part of the optimal policy. The proof above shows that (cid:1)* will continuously rise.\nOnce it reaches a maximum, (cid:1)* at immediately preceding decisions will perforce fit with the\nglobally optimal policy. The process works backward, in the fashion of backward induction.\n\n\f2 . 2 . 2 C o n v e r g e n c e\n\nContinuing with the same notation, we show now that\n\n \nNote that, if we apply Bayes\u2019 law recursively,\n\nlimt (cid:3)(cid:1) pt (cid:2)\n(\n\n* uG\n\n) = 1\n\n (13)\n\n) =\n\n(\np uG (cid:1)(cid:3)\n\n) pt (cid:1)(cid:3)\n\n(\n\n)\n\npi uG(\n\n)\n\n=\n\n2\n\n(\np uG (cid:1)(cid:3)\npi uG(\n\n)\n(\n) pt (cid:2)1 uG(\n\npt (cid:2)1 (cid:1)(cid:3)\n)\n\n)\n\n=\n\n3\n\n)\n\n(\np uG (cid:1)(cid:3)\npt uG(\n\n) pt (cid:2)1 uG(\n\n(\n\npt (cid:2)2 (cid:1)(cid:3)\n)\n) pt (cid:2)2 uG(\n\n\u2026\n\n)\n\n (14)\n\npt (cid:1)(cid:3) uG\n\n(\n\n \nThus,\n\np1 (cid:1)(cid:2) uG\n\n(\n\n) =\n\n(\np uG (cid:1)(cid:2)\n\n) p1 (cid:1)(cid:2)\n\n(\n\n)\n\np1 uG(\n\n)\n\np2 (cid:1)(cid:2) uG\n\n(\n\n) =\n\n,\n\n \n\n2\n\n(\np uG (cid:1)(cid:2)\np2 uG(\n\n)\n(\n) p1 uG(\n\np1 (cid:1)(cid:2)\n)\n\n)\n\n,\n\n \n\np3 (cid:1)(cid:2) uG\n\n(\n\n) =\n\nand so forth. Thus, what we wish to prove is\n\n3\n\n(\np uG (cid:1)(cid:2)\np3 uG(\n\n)\n) p2 uG(\n\np1 (cid:1)(cid:2)\n)\n(\n) p1 uG(\n\n \nor, rearranging,\n\n(cid:1)\n\np1 (cid:3)*\n\n(\n\n)\n\n= 1\n\n(cid:1)\n\n)\npt uG(\n\n)\n\np uG (cid:3)*\n(\n\n(cid:1)\n\n(cid:2)\n\nt =1\n\n(cid:2)\n\nt =1\n\npt uG(\n)\n(\np uG (cid:3)(cid:4)\n\n)\n\n= p1 (cid:3)(cid:4)\n\n(\n\n).\n\n(15)\n\n,\n\n)\n\n (16)\n\n (17)\n\n (18)\n\n\u2026 (19)\n\n \nNote that, given the stipulated relationship between p((cid:1)) on each processing iteration and p((cid:1) | uG)\non the previous iteration,\n\npt uG(\n\n) =\n\n(cid:1)\n\n(cid:2)\n\np uG (cid:2)(\n)\n\npt (cid:2)(\n\n) =\n\n(cid:1)\n\n(cid:2)\n\np uG (cid:2)(\n)\n\npt (cid:3)1 (cid:2) uG\n\n(\n\n) =\n\n(cid:1)\n\n(cid:2)\n\npt (cid:3)1 (cid:2)(\n\n)\n\np uG (cid:2)(\n\n)2\npt (cid:3)1 uG(\n\n)\n\n \n\n=\n \nWith this in mind, we can rewrite the left hand side product in (17) as follows:\n\n) pt (cid:3)3 uG(\n\n\u2026\n\n=\n\n)\n\n)\n\npt (cid:3)1 (cid:2)(\n)\n\np uG (cid:2)(\n\n(cid:1)\npt (cid:3)1 uG(\n\n)3\n) pt (cid:3) 2 uG(\n\n(cid:2)\n\n(cid:1)\npt (cid:3)1 uG(\n\np uG (cid:2)(\n\n)4\n) pt (cid:3)2 uG(\n\n(cid:2)\n\npt (cid:3)1 (cid:2)(\n)\n\np1 uG(\n)\n(\np uG (cid:2)(cid:4)\n\n)\n\n(cid:3)\n\np uG (cid:2)(\n)\n\n2\n\np1 (cid:2)(\n)\n\n(cid:1)\n\n(cid:2)\n\n(\np uG (cid:2)(cid:4)\n\n) p1 uG(\n\n)\n\n(cid:3)\n\np uG (cid:2)(\n)\n\n3\n\np1 (cid:2)(\n)\n\n(cid:1)\n\n(cid:2)\n\n(cid:3)\n\np uG (cid:2)(\n)\n\n4\n\np1 (cid:2)(\n)\n\n(cid:1)\n\n(cid:2)\n\n \nNote that, given (18), the numerator in each factor of (19) cancels with the denominator in the\nsubsequent factor, leaving only p(uG|(cid:1)*) in that denominator. The expression can thus be rewritten\nas\n\n(\np uG (cid:2)(cid:4)\n\n) p1 uG(\n\n) p2 uG(\n\n)\n\n(\np uG (cid:2)(cid:4)\n\n) p1 uG(\n\n) p2 uG(\n\n) p3 uG(\n\n)\n\n1\n\n \n\n(\np uG (cid:2)(cid:4)\n\n1\n\n(\np uG (cid:2)(cid:4)\n\n)\n\n(cid:3)\n\n1\n\n(\np uG (cid:2)(cid:4)\n\n)\n\n(cid:3)\n\n)\n\n(cid:1)\n\n(cid:2)\n\n(cid:3)\n\np uG (cid:2)(\n)\n\n4\n\np1 (cid:2)(\n)\n\n(\np uG (cid:2)(cid:4)\n\n)\n\n)(cid:1)\np uG (cid:3)(\n.\n(cid:1) p1 (cid:3)(\n)\n(\n)\np uG (cid:3)(cid:4)\n\n=\n\n(cid:2)\n\n(cid:3)\n\n\u2026\n\n \n\n (20)\n\nThe objective is then to show that the above equals p((cid:1)*). It proceeds directly from the definition\nof (cid:1)* that, for all (cid:1) other than (cid:1)*,\n\np uG (cid:1)(\n)\n(\n)\np uG (cid:1)(cid:2)\n\n< 1\n\n (21)\n\n \nThus, all but one of the terms in the sum above approach zero, and the remaining term equals\np1((cid:1)*). Thus,\n\n \n\n(cid:2)\n\n(cid:3)\n\n)(cid:1)\np uG (cid:3)(\n)\n(\np uG (cid:3)(cid:5)\n\n(cid:1)\n\np1 (cid:3)(\n\n) = p1 (cid:3)(cid:5)\n\n(\n\n)\n\n (22)\n\n\f3\n\nS i m u l a t i o n s\n\n3 . 1\n\nB i n a r y c h o i c e\n\nWe begin with a simulation of a simple incentive choice situation. Here, an animal faces\ntwo levers. Pressing the left lever reliably yields a preferred food (r = 2), the right a less\npreferred food (r = 1). Representing these contingencies in a network structured as in Fig. 1\n(left) and employing the iterative algorithm described in section 2.2 yields the results in\nFigure 2A. Shown here are the posterior probabilities for the policies press left and press\nright, along with the marginal value of p(u = 1) under these posteriors (labeled EV for\nexpected value). The dashed horizontal line indicates the expected value for the optimal\nplan, to which the model obviously converges.\n\nA key empirical assay for purposive behavior involves outcome devaluation. Here, actions\nyielding a previously valued outcome are abandoned after the incentive value of the outcome\nis reduced, for example by pairing with an aversive event (e.g., [4]). To simulate this within\nthe binary choice scenario just described, we reduced to zero the reward value of the food\nyielded by the left lever (fL), by making the appropriate change to p(u|fL). This yielded a\nreversal in lever choice (Fig. 2B).\n\nAnother signature of purposive actions is that they are abandoned when their causal\nconnection with rewarding outcomes is removed (contingency degradation, see [4]). We\nsimulated this by starting with the model from Fig. 2A and changing conditional\nprobabilities at s for t=2 to reflect a decoupling of the left action from the fL outcome. The\nresulting behavior is shown in Fig. 2C.\n\n \n\nFig 2. Simulation results, binary choice.\n\n3 . 2\n\nS t o c h a s t i c o u t c o m e s\n\nA critical aspect of the present modeling paradigm is that it yields reward-maximizing\nchoices in stochastic domains, a property that distinguishes it from some other recent\napproaches using graphical models to do planning (e.g., [9]). To illustrate, we used the\narchitecture in Figure 1 (left) to simulate a choice between two fair coins. A \u2018left\u2019 coin\nyields $1 for heads, $0 for tails; a \u2018right\u2019 coin $2 for heads but for tails a $3 loss. As\nillustrated in Fig. 2D, the model maximizes expected value by opting for the left coin.\n\nFig 3. Simulation results, two-step sequential choice.\n\n3 . 3\n\nS e q u e n t i a l d e c i s i o n\n\nHere, we adopt the two-step T-maze scenario used by [24] (Fig. 3A). Representing the task\ncontingencies in a graphical model based on the template from Fig 1 (right), and using the\nreward values indicated in Fig. 3A, yields the choice behavior shown in Figure 3B.\nFollowing [24], a shift in motivational state from hunger to thirst can be represented in the\n\n\fgraphical model by changing the reward function (R(cheese) = 2, R(X) = 0, R(water) = 4,\nR(carrots) = 1). Imposing this change at the level of the u variables yields the choice\nbehavior shown in Fig. 3C. The model can also be used to simulate effort-based decision.\nStarting with the scenario in Fig. 2A, we simulated the insertion of an effort-demanding\nscalable barrier at S2 (R(S2) = -2) by making appropriate changes p(u|s). The resulting\nbehavior is shown in Fig. 3D.\n\nA famous empirical demonstration of purposive control involves detour behavior. Using a\nmaze like the one shown in Fig. 4A, with a food reward placed at s5, Tolman [2] found that\nrats reacted to a barrier at location A by taking the upper route, but to a barrier at B by taking\nthe longer lower route. We simulated this experiment by representing the corresponding\ntransition and reward functions in a graphical model of the form shown in Fig. 1 (right),3\nrepresenting the insertion of barriers by appropriate changes to the transition function. The\nresulting choice behavior at the critical juncture s2 is shown in Fig. 4.\n\nFig 4. Simulation results, detour behavior. B: No barrier. C: Barrier at A. D: Barrier at B.\n\nAnother classic empirical demonstration involves latent\nlearning. Blodgett [25] allowed rats to explore the maze\nshown in Fig. 5. Later insertion of a food reward at s13\nwas followed immediately by dramatic reductions in the\nrunning time, reflecting a reduction in entries into blind\nalleys. We simulated this effect in a model based on the\ntemplate in Fig. 1 (right), representing the maze layout\nvia an appropriate transition function. In the absence of\na reward at s12, random choices occurred at each\nintersection. However, setting R(s13) = 1 resulted in the\nset of choices indicated by the heavier arrows in Fig. 5.\n\n4\n\nR e l a t i o n t o p re v i o u s w o r k\n\n Fig 5. Latent learning.\n\nInitial proposals for how to solve decision problems through probabilistic inference in\ngraphical models, including the idea of encoding reward as the posterior probability of a\nrandom utility variable, were put forth by Cooper [23]. Related ideas were presented by\nShachter and Peot [12], including the use of nodes that integrate information from multiple\nutility nodes. More recently, Attias [11] and Verma and Rao [9] have used graphical models\nto solve shortest-path problems, leveraging probabilistic representations of rewards, though\nnot in a way that guaranteed convergence on optimal (reward maximizing) plans. More\nclosely related to the present research is work by Toussaint and Storkey [10], employing the\nEM algorithm. The iterative approach we have introduced here has a certain resemblance to\nthe EM procedure, which becomes evident if one views the policy variables in our models as\nparameters on the mapping from states to actions. It seems possible that there may be a\nformal equivalence between the algorithm we have proposed and the one reported by [10].\n\nAs a cognitive and neuroscientific proposal, the present work bears a close relation to recent\nwork by Hasselmo [6], addressing the prefrontal computations underlying goal-directed\naction selection (see also [7]). The present efforts are tied more closely to normative\nprinciples of decision-making, whereas the work in [6] is tied more closely to the details of\nneural circuitry. In this respect, the two approaches may prove complementary, and it will\nbe interesting to further consider their interrelations.\n \n3 In this simulation and the next, the set of states associated with each state node was limited to the\nset of reachable states for the relevant time-step, assuming an initial state of s1.\n\n\fA c k n o w l e d g m e n t s\n\nThanks to Andrew Ledvina, David Blei, Yael Niv, Nathaniel Daw, and Francisco Pereira for\nuseful comments.\n\nR e f e r e n c e s\n\n[1] Hull, C.L., Principles of Behavior. 1943, New York: Appleton-Century.\n\n[2] Tolman, E.C., Purposive Behavior in Animals and Men. 1932, New York: Century.\n\n[3] Dickinson, A., Actions and habits: the development of behavioral autonomy. Philosophical\nTransactions of the Royal Society (London), Series B, 1985. 308: p. 67-78.\n\n[4] Balleine, B.W. and A. Dickinson, Goal-directed instrumental action: contingency and incentive\nlearning and their cortical substrates. Neuropharmacology, 1998. 37: p. 407-419.\n\n[5] Daw, N.D., Y. Niv, and P. Dayan, Uncertainty-based competition between prefrontal and striatal\nsystems for behavioral control. Nature Neuroscience, 2005. 8: p. 1704-1711.\n\n[6] Hasselmo, M.E., A model of prefrontal cortical mechanisms for goal-directed behavior. Journal of\nCognitive Neuroscience, 2005. 17: p. 1115-1129.\n\n[7] Schmajuk, N.A. and A.D. Thieme, Purposive behavior and cognitive mapping. A neural network\nmodel. Biological Cybernetics, 1992. 67: p. 165-174.\n\n[8] Tatman, J.A. and R.D. Shachter, Dynamic programming and \nTransactions on Systems, Man and Cybernetics, 1990. 20: p. 365-379.\n\ninfluence diagrams. IEEE\n\n[9] Verma, D. and R.P.N. Rao. Planning and acting in uncertain enviroments using probabilistic\ninference. in IEEE/RSJ International Conference on Intelligent R obots and Systems. 2006.\n\n[10] Toussaint, M. and A. Storkey. Probabilistic inference for solving discrete and continuous state\nmarkov decision processes. in Proceedings of the 23rd International Conference on Machine\nLearning. 2006. Pittsburgh, PA.\n\n[11] Attias, H. Planning by probabilistic inference. in Proceedings of the 9th Int. Workshop on\nArtificial Intelligence and Statistics. 2003.\n\n[12] Shachter, R.D. and M.A. Peot. Decision making using probabilistic inference methods. in\nUncertainty in artificial intelligence: Proceedings of the Eighth Conference (1992). 1992. Stanford\nUniversity: M. Kaufmann.\n\n[13] Chater, N., J.B. Tenenbaum, and A. Yuille, Probabilistic models of cognition: conceptual\nfoundations. Trends in Cognitive Sciences, 2006. 10(7): p. 287-291.\n\n[14] Doya, K., et al., eds. The Bayesian Brain: Probabilistic Approaches to Neural Coding. 2006, MIT\nPress: Cambridge, MA.\n\n[15] Miller, E.K. and J.D. Cohen, An integrative theory of prefrontal cortex function. Annual Review\nof Neuroscience, 2001. 24: p. 167-202.\n\n[16] Asaad, W.F., G. Rainer, and E.K. Miller, Task-specific neural activity in the primate prefrontal\ncortex. Journal of Neurophysiology, 2000. 84: p. 451-459.\n\n[17] Rolls, E.T., The functions of the orbitofrontal cortex. Brain and Cognition, 2004. 55: p. 11-29.\n\n[18] Padoa-Schioppa, C. and J.A. Assad, Neurons in the orbitofrontal cortex encode economic value.\nNature, 2006. 441: p. 223-226.\n\n[19] Gopnik, A., et al., A theory of causal learning in children: causal maps and Bayes nets.\nPsychological Review, 2004. 111: p. 1-31.\n\n[20] Hamilton, A.F.d.C. and S.T. Grafton, Action outcomes are represented in human inferior\nfrontoparietal cortex. Cerebral Cortex, 2008. 18: p. 1160-1168.\n\n[21] Johnson, A., M.A.A. van der Meer, and D.A. Redish, Integrating hippocampus and striatum in\ndecision-making. Current Opinion in Neurobiology, 2008. 17: p. 692-697.\n\n[22] Jensen, F.V., Bayesian Networks and Decision Graphs. 2001, New York: Springer Verlag.\n\n[23] Cooper, G.F. A method for using belief networks as influence diagrams. in Fourth Workshop on\nUncertainty in Artificial Intelligence. 1988. University of Minnesota, Minneapolis.\n\n[24] Niv, Y., D. Joel, and P. Dayan, A normative perspective on motivation. Trends in Cognitive\nSciences, 2006. 10: p. 375-381.\n\n[25] Blodgett, H.C., The effect of the introduction of reward upon the maze performance of rats.\nUniversity of California Publications in Psychology, 1929. 4: p. 113-134.\n\n\f", "award": [], "sourceid": 34, "authors": [{"given_name": "Matthew", "family_name": "Botvinick", "institution": null}, {"given_name": "James", "family_name": "An", "institution": null}]}