{"title": "Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement", "book": "Advances in Neural Information Processing Systems", "page_first": 1689, "page_last": 1696, "abstract": "Working memory is a central topic of cognitive neuroscience because it is critical for solving real world problems in which information from multiple temporally distant sources must be combined to generate appropriate behavior. However, an often neglected fact is that learning to use working memory effectively is itself a difficult problem. The Gating\" framework is a collection of psychological models that show how dopamine can train the basal ganglia and prefrontal cortex to form useful working memory representations in certain types of problems. We bring together gating with ideas from machine learning about using finite memory systems in more general problems. Thus we present a normative Gating model that learns, by online temporal difference methods, to use working memory to maximize discounted future rewards in general partially observable settings. The model successfully solves a benchmark working memory problem, and exhibits limitations similar to those observed in human experiments. Moreover, the model introduces a concise, normative definition of high level cognitive concepts such as working memory and cognitive control in terms of maximizing discounted future rewards.\"", "full_text": " \n\n \n\nLearning to use Working Memory in Partially \n\nObservable Environments through \n\nDopaminergic Reinforcement \n\nMichael T. Todd, Yael Niv, Jonathan D. Cohen \n\nDepartment of Psychology & Princeton Neuroscience Institute \n\nPrinceton University, Princeton, NJ 08544 \n\n{mttodd,yael,jdc}@princeton.edu \n\nAbstract \n\nWorking memory is a central topic of cognitive neuroscience because it is \ncritical for solving real-world problems in which information from multiple \ntemporally  distant  sources  must  be  combined  to  generate  appropriate \nbehavior. However, an often neglected fact is that learning to use working \nmemory effectively is itself a difficult problem. The Gating framework [1-\n4]  is  a  collection  of  psychological  models  that  show  how  dopamine  can \ntrain the basal ganglia and prefrontal cortex to form useful working memory \nrepresentations in certain types of problems. We unite Gating with machine \nlearning  theory  concerning  the  general  problem  of  memory-based  optimal \ncontrol [5-6]. We present a normative model that learns, by online temporal \ndifference methods, to use working memory to maximize discounted future \nreward  in  partially  observable  settings.  The  model  successfully  solves  a \nbenchmark  working  memory  problem,  and  exhibits  limitations  similar  to \nthose observed in humans. Our purpose is to introduce a concise, normative \ndefinition  of  high  level  cognitive  concepts  such  as  working  memory  and \ncognitive control in terms of maximizing discounted future rewards.  \nIntroduction \n\n1 \nWorking  memory  is  loosely  defined  in  cognitive  neuroscience  as  information  that  is  (1) \ninternally maintained on a temporary or short term basis, and (2) required for tasks in which \nimmediate  observations  cannot  be  mapped  to  correct  actions.  It  is  widely  assumed  that \nprefrontal cortex (PFC) plays a role in maintaining and updating working memory. However, \nrelatively  little  is  known  about  how  PFC  develops  useful  working  memory  representations \nfor a new task. Furthermore, current work focuses on describing the structure and limitations \nof working memory, but does not ask why, or in what general class of tasks, is it necessary. \nBorrowing  from  the  theory  of  optimal  control  in  partially  observable  Markov  decision \nproblems (POMDPs), we frame the psychological concept of working memory as an internal \nstate  representation,  developed  and  employed  to  maximize  future  reward  in  partially \nobservable  environments.  We  combine  computational \ninsights  from  POMDPs  and \nneurobiologically  plausible  models  from  cognitive  neuroscience  to  suggest  a  simple \nreinforcement  learning  (RL)  model  of  working  memory  function  that  can  be  implemented \nthrough dopaminergic training of the basal ganglia and PFC.  \nThe  Gating  framework  is  a  series  of  cognitive  neuroscience  models  developed  to  explain \nhow  dopaminergic  RL  signals  can  shape  useful  working  memory  representations  [1-4]. \nComputationally \nthis  framework  models  working  memory  as  a  collection  of  past \nobservations, each of which can occasionally be replaced with the current observation, and \naddresses the problem of learning when to update each memory element versus maintaining \nit.  In  the  original  Gating  model  [1-2]  the  PFC  contained  a  unitary  working  memory \n\n\frepresentation that was updated whenever a phasic dopamine (DA) burst occurred (e.g., due \nto unexpected reward or novelty). That model was the first to connect working memory and \nRL  via  the  temporal  difference  (TD)  model  of  DA  firing  [7-8],  and  thus  to  suggest  how \nworking  memory  might  serve  a  normative  purpose.  However,  that  model  had  limited \ncomputational  flexibility  due  to  the  unitary  nature  of  the  working  memory  (i.e.,  a  single-\nobservation memory controlled by a scalar DA signal). More recent work [3-4] has partially \nrepositioned the Gating framework within the Actor/Critic model of mesostriatal RL [9-10], \npositing  memory  updating  as  but  another  cortical  action  controlled  by  the  dorsal  striatal \n\"actor.\"  This  architecture  increased  computational  flexibility  by  introducing  multiple \nworking  memory  elements,  corresponding  to  multiple  corticostriatal  loops,  that  could  be \nquasi-independently  updated.  However,  that  model  combined  a  number  of  components \n(including  supervised  and  unsupervised  learning,  and  complex  neural  network  dynamics), \nmaking  it  difficult  to  understand  the  relationship  between  simple  RL  mechanisms  and \nworking  memory  function.  Moreover,  because  the  model  used  the  Rescorla-Wagner-like \nPVLV  algorithm  [4]  rather  than  TD  [7-8]  as  the  model  of  phasic  DA  bursts,  the  model's \nbehavior  and  working  memory  representations  were  not  directly  shaped  by  standard \nnormative criteria for RL models (i.e., discounted future reward or reward per unit time). \nWe  present  a  new  Gating  model,  synthesizing  the  mesostriatal Actor/Critic  architecture  of \n[4]  with  a  normative  POMDP  framework,  and  reducing  the  Gating  model  to  a  four-\nparameter,  pure  RL  model  in  the  process.  This  produces  a  model  very  similar  to  previous \nmachine learning work on \"model-free\" approximate POMDP solvers [5,6], which attempt to \nform good solutions without explicit knowledge of the environment's structure or dynamics. \nThat  is,  we  model  working  memory  as  a  discrete  memory  system  (a  collection  of  recent \nobservations) rather than a continuous \"belief state\" (an inferred probability distribution over \nhidden  states).  In  some  environments  this  may  permit  only  an  approximate  solution. \nHowever, the strength of such a system is that it requires very little prior knowledge, and is \nthus  potentially  useful  for  animals,  who  must  learn  effective  behavior  and  memory-\nmanagement  policies  in  completely  novel  environments  (i.e.,  in  the  absence  of  a  \u201cworld \nmodel\u201d). Therefore, we retain the computational flexibility of the more recent Gating models \n[3-4], while re-establishing the goal of defining working memory in normative terms [1-2].  \nTo  illustrate  the  strengths  and  limitations  of  the  model,  we  apply  it  to  two  representative \nworking-memory tasks. The first is the 12-AX task proposed as a Gating benchmark in [4]. \nContrary to previous claims that TD learning is not sufficient to solve this task, we show that \n\nwith  an  eligibility  trace  (i.e.,  TD(\ud835\udf06)  with 0< \uf06c<1),  the  model  can  achieve  optimal \n\nbehavior. The second task highlights important limitations of the model. Since our model is a \nPOMDP solver and POMDPs are, in general, intractable (i.e., solution algorithms require an \ninfeasible number of computations), it is clear that our model must ultimately fail to achieve \noptimal  performance  as  environments  increase  even  to  moderate  complexity.  However, \nhuman working memory also exhibits sharp limitations. We apply our model to an implicit \nartificial  grammar  learning  task  [11]  and  show  that  it  indeed  fails  in  ways  reminiscent  of \nhuman  performance.  Moreover,  simulating  this  task  with  increased  working  memory \ncapacity  reveals  diminishing  returns  as  capacity  increases  beyond  a  small  number, \nsuggesting that the \"magic number\" limited working memory capacity found in humans [12] \nmight in fact be optimal from a learning standpoint. \n  \n2 \nAs with working memory tasks, a POMDP does not admit an optimal behavior policy based \nonly  on  the  current  observation.  Instead,  the  optimal  policy  generally  depends  on  some \ncombination  of  memory  as  well  as  the  current  observation. Although  the  type  of  memory \nrequired varies across POMDPs, in certain cases a finite memory system is a sufficient basis \nfor an optimal policy. Peshkin, Meuleau, and Kaelbling [6] used an external finite memory \ndevice  (e.g.,  a  shopping  list)  to  improve  the  performance  of  RL  in  a  model-free  POMDP \nsetting. Their model's \"state\" variable consisted of the current observation augmented by the \nmemory device. An augmented action space, consisting of both memory actions and motor \nactions,  allowed  the  model  to  learn  effective  memory-management  and  motor  policies \nsimultaneously. We integrate this approach with the Gating model, altering the semantics so \nthat \n(presumed \n\nthe  external  memory  device  becomes \n\nModel Architecture \n\ninternal  working  memory \n\n\f \n\neligibility trace. Gating action trace is analogous.) \n\nUpdate internal state based on previous state, gating \naction, and new observation \n\n   \u2200 \ud835\udc60,\ud835\udc4e  \n\nChoose motor action, \ud835\udc4e(cid:3047), and gating action, \ud835\udc54(cid:3047), for \ncurrent state, \ud835\udc60(cid:3047) according to softmax over motor and \ngating action preferences, \ud835\udc62 and \ud835\udc63, respectively. \nUpdate motor and gating action eligibility traces, \ud835\udc52(cid:3014) \nand \ud835\udc52(cid:3008), respectively. (Update shown for motor action \nUpdate (hidden) environment state, \ud835\udf0e, with motor \naction. Get next reward, \ud835\udc5f, and observation, \ud835\udc5c. \nCompute state-value prediction error, \ud835\udeff(cid:3047), based on \ncritic\u2019s state-value approximation, \ud835\udc49(\ud835\udc60) \nUpdate state-value eligibility traces, \ud835\udc52(cid:3023). \n\n\ud835\udc4e(cid:3047)\u2190Softmax(\ud835\udc62;\ud835\udc60(cid:3047)) \n\ud835\udc54(cid:3047)\u2190Softmax(\ud835\udc63;\ud835\udc60(cid:3047)) \n\ud835\udc52(cid:3014)(\ud835\udc60,\ud835\udc4e)\u2190(cid:4688)1\u2212Pr(\ud835\udc4e|\ud835\udc60),\ud835\udc60=\ud835\udc60(cid:3047),\ud835\udc4e=\ud835\udc4e(cid:3047)\n\u2212Pr(\ud835\udc4e|\ud835\udc60),\ud835\udc60=\ud835\udc60(cid:3047),\ud835\udc4e\u2260\ud835\udc4e(cid:3047)\n\ud835\udefe\ud835\udf06\ud835\udc52(cid:3014)(\ud835\udc60,\ud835\udc4e),\ud835\udc60\u2260\ud835\udc60(cid:3047) \n\ud835\udf0e(cid:3047)(cid:2878)(cid:2869)\u2190Environment(\ud835\udc4e(cid:3047),\ud835\udf0e(cid:3047)) \n\ud835\udc5f,\ud835\udc5c\u2190Environment(\ud835\udf0e(cid:3047)(cid:2878)(cid:2869))\n\ud835\udc60(cid:3047)(cid:2878)(cid:2869)\u2190\ud835\udc5c,\ud835\udc54(cid:3047),\ud835\udc60(cid:3047)  \n\ud835\udeff(cid:3047)\u2190\ud835\udc5f+\ud835\udefe\ud835\udc49(\ud835\udc60(cid:3047)(cid:2878)(cid:2869))\u2212\ud835\udc49(\ud835\udc60(cid:3047)) \n\ud835\udc52(cid:3023)(\ud835\udc60)=(cid:3420)\ud835\udefe\ud835\udf06\ud835\udc52(cid:3023)(\ud835\udc60)+1,\ud835\udc60=\ud835\udc60(cid:3047)\n\ud835\udefe\ud835\udf06\ud835\udc52(cid:3023)(\ud835\udc60),\ud835\udc60\u2260\ud835\udc60(cid:3047),   \u2200 \ud835\udc60 \n\ud835\udc49(\ud835\udc60)=\ud835\udc49(\ud835\udc60)+\ud835\udefc\ud835\udeff(cid:3047)\ud835\udc52(cid:3023)(\ud835\udc60),\u2200\ud835\udc60  \n\ud835\udc62(\ud835\udc60,\ud835\udc4e)=\ud835\udc62(\ud835\udc60,\ud835\udc4e)+\ud835\udefc\ud835\udeff(cid:3047)\ud835\udc52(cid:3014)(\ud835\udc60,\ud835\udc4e),\n\u2200 \ud835\udc60,\ud835\udc4e \n\ud835\udc63(\ud835\udc60,\ud835\udc54)=\ud835\udc63(\ud835\udc60,\ud835\udc54)+\ud835\udefc\ud835\udeff(cid:3047)\ud835\udc52(cid:3008)(\ud835\udc60,\ud835\udc54),\n\u2200 \ud835\udc60,\ud835\udc54 \n\ud835\udc60(cid:3047)\u2190\ud835\udc60(cid:3047)(cid:2878)(cid:2869) \nWilliams's (\ud835\udc5f\u2212\ud835\udc4f) term [14]. We describe here a single gating actor, but it is straightforward to \ngeneralize to an array of independent gating actors as we use in our simulations. \ud835\udefe= discount rate; \n\ud835\udf06= eligibility trace decay rate; \ud835\udefc=learning rate. In all simulations, \ud835\udefe= 0.94, \ud835\udefc= 0.1. \n\nUpdate state-values  \nUpdate motor action preferences \nUpdate gating action preferences \nNext trial\u2026 \n \nTable  1  Pseudocode  of  one  trial  of  the  model,  based  on  the  Actor/Critic  architecture  with \neligibility  traces.  Following  [13],  we  substitute  the  critic's  state-value  prediction  error  for \n\nto be supported in PFC), and altering the Gating model so that the role of working memory \nis explicitly to support optimal behavior (in terms of discounted future reward) in a POMDP.  \nLike  [6],  the  key  difference  between  our  model  and  standard  RL  methods  is  that  our  state \nvariable  includes  controlled  memory  elements  (i.e.,  working  memory),  which  augment  the \ncurrent  observation. The  action  space  is  similarly  augmented  to  include  memory  or  gating \nactions,  and  the  model  learns  by  trial-and-error  how  to  update  its  working  memory  (to \nresolve  hidden  states  when  such  resolution  leads  to  greater  rewards)  as  well  as  its  motor \npolicy.  The  task  for  our  model  then,  is  to  learn  a  working  memory  policy  such  that  the \ncurrent  internal  state  (i.e.,  memory  and  current  observation)  admits  an  optimal  behavioral \npolicy.  \nOur model (Table 1) consists of a critic, a motor actor, and several gating actors. As in the \nstandard Actor/Critic architecture, the critic learns to evaluate (internal) states and, based on \nthe  ongoing  temporal  difference  of  these  values,  generates  at  each  time  step  a  prediction \nerror (PE) signal (thought to correspond to phasic bursts and dips in DA [8]). The PE is used \nto train the critic's state values and the policies of the actors. The motor actor also fulfills the \nusual role, choosing actions to send to the environment based on its policy and the current \ninternal  state.  Finally,  gating  actors  correspond  one-to-one  with  each  memory  element. At \neach time point, each gating actor independently chooses (via a policy based on the internal \nstate)  whether  to  (1)  maintain  its  element's  memory  for  another  time  step,  or  (2)  replace \n(update) its element's memory with the current observation.  \nTo  remain  aligned  with  the Actor/Critic  online  learning  framework  of  mesostriatal  RL  [9-\n10], learning in our model is based on REINFORCE [14] modified for expected discounted \nfuture  reward  [13],  rather  than  the  Monte-Carlo  policy  learning  algorithm  in  [6]  (which  is \nmore  suitable  for  offline,  episodic  learning).  Furthermore,  because  it  has  been  shown  that \neligibility  traces  are  particularly  useful  when  applying  TD  to  POMDPs  (e.g.,  [15-16]),  we \n\nused TD(\ud835\udf06), taking the characteristic eligibilities of the REINFORCE algorithm [14] as the \n\nimpulse  function  for  a  replacing  eligibility  trace  [17].  For  simplicity  of  exposition  and \ninterpretation, we used tabular policy and state-value representations throughout.  \n\n\f12-AX Performance  \n\nBenchmark Performance and Psychological Data \n\nsequence of the 12-AX task.  \n \n3 \nWe now describe the model's performance on the 12-AX task proposed as a benchmark for \nGating  models  [4].  We  then  turn  to  a  comparison  of  the  model's  behavior  against  actual \npsychological data. \n \n3.1 \nThe 12-AX task was used in [4] to illustrate the problem of learning a task in which correct \nbehavior  depends  on  multiple  previous  observations.  In  the  task  (Figure  1C),  subjects  are \npresented with a sequence of observations drawn from the set {1, 2, A, B, C, X, Y, Z}. They \ngain  rewards  by  responding  L  or  R  according  to  the  following  rules:  Respond  R  if  (1)  the \ncurrent observation is an X, the last observation from the set {A, B, C} was an A, and the \nlast observation from the set {1, 2} was a 1; or (2) the current observation is a Y, the last \nobservation from the set {A, B, C} was a B, and the last observation from the set {1, 2} was \na 2. Respond L otherwise. In our implementation, reward is 1 for correct responses when the \ncurrent  observation  is  X  or  Y,  0.25  for  all  other  correct  responses,  and  0  for  incorrect \nresponses. \nWe modeled this task using two memory elements, the minimum theoretically necessary for \n\noptimal  performance.  The  results  (Figure  1A,B)  show  that  our  TD(\ud835\udf06)  Gating  model  can \nmodel  on  the  eligibility  trace  parameter, \ud835\udf06,  with  best  performance  at  high  intermediate \nvalues of \ud835\udf06. When \uf06c=0, the model finds a suboptimal policy that is only slightly better than \nthe  optimal  policy  for  a  model  without  working  memory. With \uf06c=1  performance  is  even \nobservable  (non-Markovian)  settings  [15],  whereas  TD(\ud835\udf06)  (without  memory)  with \ud835\udf06\u22480.9 \n\nworse,  as  can  be  expected  for  an  online  policy  improvement  method  with  non-decaying \ntraces  (a  point  of  comparison  with  [6]  to  which  we  will  return  in  the  Discussion).  These \nresults  are  consistent  with  previous  work  showing  that  TD(0)  performs  poorly  in  partially \n\nindeed achieve optimal 12-AX performance. The results also demonstrate the reliance of the \n\n \n\nFigure 1 12-AX: Average performance over 40 training runs, each consisting of 2\u00d7107 timesteps. \nwhen  the  eligibility  trace  parameter, \ud835\udf40,  is between  zero  and one.  (B)  The  time  required for  the \nmodel  to  reach  300  consecutive  correct  trials  increases  rapidly  as \ud835\udf40  decreases.  (C)  Sample \n\n(A) As indicated by reward rate over the last 105 time steps, the model learns an optimal policy \n\n \n\nperforms best [16]. Indeed, early in training, as our model learns to convert a POMDP to an \nMDP  via  its  working  memory,  the  internal  state  dynamics  are  not  Markovian,  and  thus  an \neligibility trace is necessary. \n \n3.2 \nWe are the first to interpret the Gating framework (and the use of working memory) as an \nattempt  to  solve  POMDPs.  This  brings  a  large  body  of  theoretical  work  to  bear  on  the \nproperties of Gating models. Importantly, it implies that, as task complexity increases, both \nthe Gating model and humans must fail to find optimal solutions in reasonable time frames  \n\nPsychological data \n\n\f \nFigure  2  (A)  Artificial  grammar  from  [11].  Starting  from  node  0,  the  grammar  generates  a \ncontinuing sequence of observations. All nodes with two transitions (edges) make either transition \nwith  p=0.5.  Edge  labels  mark  grammatical  observations.  At  each  transition,  the  grammatical \nobservation  is replaced  with a  random,  ungrammatical,  observation with  p=0.15. The  task  is  to \npredict  the  next  observation  at  each  time  point.  (B)  The  model  shows  a  gradual  increase  in \nsensitivity to sequences of length 2 and 3, but not length 4, replicating the human data. Sensitivity \nis measured as probability of choosing grammatical action for the true state, minus probability of \nchoosing  grammatical  action  for  the  aliased  state;  0  indicates  complete  aliasing,  1  complete \nresolution.  (C)  Model  performance  (reward  rate)  averaged  over  training  runs  with  variable \nnumbers of time steps shows diminishing returns as the number of memory elements increases. \ndue to the generally intractable nature of POMDPs. Given this inescapable conclusion, it is  \ninteresting to compare model failures to corresponding human failures: a pattern of failures \nmatching human data would provide support for our model. In this subsection we describe a \nsimulation  of  artificial  grammar  learning  [11],  and  then  offer  an  account  of  the  pervasive \n\"magic number\" observations concerning limits of working memory capacity (e.g., [12]).  \nIn  artificial  grammar  learning,  subjects  see  a  seemingly  random  sequence  of  observations, \nand are instructed to  mimic each observation as quickly as possible (or to predict the next \nobservation)  with  a  corresponding  action.  Unknown  to  the  subjects,  the  observation \nsequence  is  generated  by  a  stochastic  process  called  a  \"grammar\"  (Figure  2A).  Artificial \ngrammar  tasks  constitute  POMDPs:  the  (recent)  observation  history  can  predict  the  next \nobservation  better  than  the  current  observation  alone,  so  optimal  performance  requires \nsubjects  to  remember  information  distilled  from  the  history.  Although  subjects  typically \nreport  no  knowledge  of  the  underlying  structure,  after  training  their  reaction  times  (RTs) \nreveal  implicit  structural  knowledge.  Specifically,  RTs  become  significantly  faster  for \n\"grammatical\" as compared to \"ungrammatical\" observations (see Figure 2).  \nCleeremans and McClelland [11] examined the limits of subjects' capacity to detect grammar \nstructure. The grammar they used is shown in Figure 2A. They found that, although subjects \ngrew  increasingly  sensitive  to  sequences  of  length  two  and  three  throughout  training,  (as \nmeasured  by  transient  RT  increases  following  ungrammatical observations),  they  remained \ninsensitive,  even  after  60,000  time  steps  of  training,  to  sequences  of  length  four.  This \npresumably  reflected  a  failure  of  subjects'  implicit  working  memory  learning  mechanisms, \nand  was  confirmed  in  a  second  experiment  [11].  We  replicated  these  results,  as  shown  in \nFigure 2B. To simulate the task, we gave the model two memory elements (results were no \ndifferent  with  three  elements),  and  reward  1  for  each  correct  prediction.  We  tested  the \nmodel's ability to resolve states based on previous observations by contrasting its behavior \nacross  pairs  of  observation  sequences  that  differed  only  in  the  first  observation.  State \nresolution based on sequences of length two, three, and four were represented by VS versus \nXS  (leading  to  predictions  Q  vs.  V/P,  respectively),  SQX  versus  XQX  (S/Q  vs.  P/T),  and \nXTVX versus PTVX (S/Q vs. P/T), respectively.  \nIn  this  task,  optimal  use  of  information  from  sequences  of  length  four  or  more  proved \nimpossible  for  the  model  and,  apparently,  for  humans.  To  understand  intuitively  this \nlimitation, consider a problem of two hidden states, 1 and 2, with optimal actions L and R, \nrespectively.  The  states  are  preceded  by  identical  observation  sequences  of  length  (cid:31). \n\nHowever,  at  (cid:31)+ 1  time  steps  in  the  past,  observation  A  precedes  state  1,  whereas \n\n\f(cid:31)+ 1 time steps decreases geometrically with  (cid:31), thus the probability of resolving states 1 \n\nobservation B precedes state 2. The probability that A/B are held in memory for the required \n\ntime  approaches  one.  However, \n\nthis  strategy \n\nand 2 decreases geometrically. Because the agent cannot resolve state 1 from state 2, it can \nnever  learn  the  appropriate  1-L,  2-R  action  preferences  even  if  it  explores  those  actions,  a \nmore insidious problem than an RL agent faces in a fully observable setting. As a result, the \nmodel can\u2019t reinforce optimal gating policies, eventually learning an internal state space and \ndynamics that fail to reflect the true environment. The problem is that credit assignment (i.e., \nlearning a mapping from working memory to actions) is only useful inasmuch as the internal \nstate  corresponds  to  the  true  hidden  state  of  the  POMDP,  leading  to  a  \u201cchicken-and-egg\u201d \nproblem. \nGiven  the  preceding  argument,  one  obvious  modification  that  might  lead  to  improved \nperformance  is  to  increase  the  number  of  memory  elements.  As  the  number  of  memory \nelements increases, the probability that the model remembers observation A for the required \namount  of \nintroduces \nthe  curse  of \ndimensionality due to the rapidly increasing size of the internal state space. \nThis  intuitive  analysis  suggests  a  normative  explanation  for  the  famous  \"magic  number\" \nlimitation  observed  in  human  working  memory  capacity,  thought  to  be  about  four \nindependent elements (e.g., [12]). We demonstrate this idea by again simulating the artificial \ngrammar  task,  this  time  averaging  performance  over  a  range  of  training  times  (1  to  10 \nmillion time steps) to capture the idea that humans may practice novel tasks for a typical, but \nvariable, amount of time. Indeed the averaged results show diminishing returns of increasing \nmemory  elements  (Figure  2C).  This  simulation  used  tabular  (rather  than  more  neurally \nplausible)  representations  and  a  highly  simplified  model,  so  the  exact  number  of  policy \nparameters  and  state  values  to  be  estimated,  time  steps,  and  working  memory  elements  is \nsomewhat  arbitrary  in  relation  to  human  learning.  Still,  the  model's  qualitative  behavior \n(evidenced  by  the  shape  of  the  resulting  curve  and  the  order  of  magnitude  of  the  optimal \nnumber of working memory elements) is surprisingly reminiscent of human behavior. Based \non  this  we  suggest  that  the  limitation  on  working  memory  capacity  may  be  due  to  a \nlimitation on learning rather than on storage: it may be impractical to learn to utilize more \nthan a very small number (i.e., smaller than 10) of independent working memory elements, \ndue to the curse of dimensionality.  \n \n4 \nWe  have  presented  a  psychological  model  that  suggests  that  dopaminergic  PE  signals  can \nimplicitly  shape  working  memory  representations  in  PFC.  Our  model  synthesizes  recent \nadvances in the Gating literature [4] with normative RL theory regarding model-free, finite \nmemory solutions to POMDPs [6]. We showed that the model learns to behave optimally in \nthe benchmark 12-AX task. We also related the model's computational limitations to known \nlimitations of human working memory [11-12]. \n \n4.1 \nOther  recent  work  in  neural  RL  has  argued  that  the  brain  applies  memory-based  POMDP \nsolution  mechanisms  to  the  real-world  problems  faced  by  animals  [17-20].  That  work \nprimarily  considers  model-based  mechanisms,  in  which  the  temporary  memory  is  a \ncontinuous belief state, and assumes that a function of cerebral cortex is to learn the required \nworld  model,  and  specifically  that  PFC  should  represent  temporary  goal-  or  policy-related \ninformation  necessary  for  optimal  POMDP  behavior.  The  model  that  we  present  here  is \nrelated  to  that  line  of  thinking,  demonstrating  a  model-free,  rather  than  model-based, \nmechanism  for  learning  to  store  policy-related  information  in  PFC.  Different  learning \nsystems  may  form  different  types  of  working  memory  representations.  Future  work  may \ninvestigate the relationship between implicit learning (as in this Gating model) and model-\nfree  POMDP solutions,  versus  other kinds  of  learning  and  model-based  POMDP  solutions. \nIrrespective  of  the  POMDP  framework,  other  work  has  assumed  that  there  exists  a  gating \npolicy that controls task-relevant working memory updating in PFC (e.g., [21]). The present \nwork further develops a model of how this policy can be learned. \nIt  is  interesting  to  compare  our  model  to  previous  work  on  model-free  POMDP  solutions. \n\nRelation to other theoretical work \n\nDiscussion \n\n\fof \n\nboth \n\nkinds \n\nof \n\non \n\ndepending \n\na  mixture \n\nMcCallum first emphasized the importance of learning utile distinctions [5], or learning to \nresolve  two  hidden  states  only  if  they  have  different  optimal  actions.  This  is  an  emphasis \nthat our model shares, at least in spirit. Humans must of course be extremely flexible in their \nbehavior.  Therefore  there  is  an  inherent  tension  between  the  need  to  focus  cognitive \nresources  on  learning  the  immediate  task,  and  the  need  to  form  a  basis  of  general  task \nknowledge [3]. It would be interesting for future work to explore how closely the working \nmemory  representations  learned  by  our  model  align  to  McCallum's  utile  (and  less \ngeneralizable)  distinctions  as  opposed  to  more  generalizable  representations  of  the \nunderlying  hidden  structure  of  the  world,  or  whether  our  model  could  be  modified  to \nincorporate \nsome \nexploration/exploitation parameter. \nOur model most closely follows the Gating model described in [4], and the theoretical model \ndescribed in [6]. Our model is clearly more abstract and less biologically detailed than [4]. \nHowever, our intent was to ask whether the important insights and capabilities of that model \ncould  be  captured  using  a  four-parameter,  pure  RL  model  with  a  clear  normative  basis. \nAccordingly, we have shown that such a model is comparably equipped to simulate a range \nof  psychological  phenomena.  Our  model  also  makes  equally  testable  (albeit  different) \npredictions  about  the  neural  DA  signal.  Relative  to  [6],  our  model  places  biological  and \npsychological  concerns  at  the  forefront,  eliminating  the  episodic  memory  requirements  of \nthe Monte-Carlo algorithm. It is perhaps interesting, vis \u00e1 vis [6], that our model performed \n\nso  poorly  when (cid:31)= 1,  as  this  produces  a  nearly  Monte-Carlo  scheme. The  difference  was \n\nknowledge, \n\nImplications for Working Memory and Cognitive Control \n\nlikely due to our model's online learning (i.e., we updated the policy at each time step rather \nthan at the ends of episodes), which invalidates the Monte-Carlo approach. Thus it might be \nsaid that our model is a uniquely psychological variant of that previous architecture.  \n \n4.2 \nSubjects in cognitive control experiments typically face situations in which correct behavior \nis indeterminate given only the immediate observation. Working memory is often thought of \nas  the  repository  of  temporary  information  that  augments  the  immediate  observation  to \npermit  correct  behavior,  sometimes  called  goals,  context,  task  set,  or  decision  categories. \nThese concepts are difficult to define. Here we have proposed a formal theoretical definition \nfor  the  cognitive  control  and  working  memory  constructs.  Due  to  the  importance  of \ntemporally distant goals and of information that is not immediately observable, the canonical \ncognitive control environment is well captured by a POMDP. Working memory is then the \ntemporary  information,  defined  and  updated  by  a  memory  control  policy,  that  the  animal \nuses  to  solve  these  POMDPs.  Model-based  research  might  identify  working  memory  with \ncontinuous belief states, whereas our model-free framework identifies working memory with \na  discrete  collection  of  recent  observations.  These  may  correspond  to  the  products  of \ndifferent  learning  systems,  but  the  outcome  is  the  same  in  either  case:  cognitive  control  is \ndefined as an animal's memory-based POMDP solver, and working memory is defined as the \ninformation, derived from recent history, that the solver requires.  \n \n4.3 \nAlthough the intractability of solving a POMDP means that all models such as the one we \npresent here must ultimately fail to find an optimal solution in a practical amount of time (if \nat  all),  the  particular  manifestation  of  computational  limitations  in  our  model  aligns \nqualitatively  with  that  observed  in  humans.  Working  memory,  the  psychological  construct \nthat  the  Gating  model  addresses,  is  famously  limited  (see  [12]  for  a  review).  Beyond \ncanonical  working  memory  capacity  limitations,  other  work  has  shown  subtler  limitations \narising in learning contexts (e.g., [11]). The results that we presented here are promising, but \nit remains for future work to more fully explore the relation between the failures exhibited \nby this model and those exhibited by humans. \nIn  conclusion,  we  have  shown  that  the  Gating  framework  provides  a  connection  between \nhigh  level  cognitive  concepts  such  as  working  memory  and  cognitive  control,  systems \nneuroscience,  and  current  neural  RL  theory.  The  framework's  trial-and-error  method  for \nsolving  POMDPs  gives  rise  to  particular  limitations  that  are  reminiscent  of  observed \n\nPsychological and neural validity \n\n\fpsychological limits. It remains for future work to further investigate the model's ability to \ncapture a range of specific psychological and neural phenomena. Our hope is that this link \nbetween  working  memory  and  POMDPs  will  be  fruitful  in  generating  new  insights,  and \nsuggesting further experimental and theoretical work. \n\nAcknowledgments \nWe thank Peter Dayan, Randy O'Reilly, and Michael Frank for productive discussions, and \nthree anonymous reviewers for helpful comments. This work was supported by NIH grant \n5R01MH052864 (MT & JDC) and a Human Frontiers Science Program Fellowship (YN) \n\nReferences \n[1] Braver, T. S., & Cohen, J. D. (1999). Dopamine, cognitive control, and schizophrenia: The gating model. \nIn J. A. Reggia, E. Ruppin, & D. Glanzman (Eds.), Progress in Brain Research (pp. 327-349). Amsterdam, \nNorth-Holland: Elsevier Science. \n[2] Braver, T. S., & Cohen, J. D. (2000). On the Control of Control: The Role of Dopamine in Regulating \nPrefrontal  Function  and  Working  Memory.  In  S.  Monsell,  &  J.  S.  Driver  (Eds.),  Control  of  Cognitive \nProcesses: Attention and Performance XVIII (pp. 713-737). Cambridge, MA: MIT Press. \n[3]  Rougier,  A.,  Noelle,  D.,  Braver,  T.,  Cohen,  J.,  &  O'Reilly,  R.  (2005).  Prefrontal  Cortex  and  Flexible \nCognitive  Control:  Rules  Without  Symbols. Proceedings of the National Academy of Sciences , 102 (20), \n7338-7343.  \n[4]  O'Reilly,  R.  C.,  &  Frank,  M.  J.  (2006).  Making  Working  Memory  Work:  A  Computational  Model  of \nLearning in the Prefrontal Cortex and Basal Ganglia. Neural Computation , 18, 283-328. \n[5] McCallum, A. (1995). Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State. \nInternational Conference on Machine Learning, (pp. 387-395). \n[6] Peshkin, L., Meuleau, N., & Kaelbling, L. (1999). Learning Policies with External Memory. Sixteenth \nInternational Conference on Machine Learning, (pp. 307-314).  \n[7]  Montague,  P.  R.,  Dayan,  P.,  &  Sejnowski,  T.  J.  (1996).  A  Framework  for  Mesencephalic  Dopamine \nSystems Based on Predictive Hebbian Learning. The Journal of Neuroscience , 16 (5), 1936-1947. \n[8] Schultz, W., Dayan, P., & Montague, P. R. (1997). A Neural Substrate of Prediction and Reward. Science \n, 275, 1593-1599.  \n[9] Houk, J., Adams, J., & Barto, A. (1995). A Model of how the Basal Ganglia Generate and use Neural \nSignals that Predict Reinforcement. In J. Houk, J. Davis, & D. Beiser, Models of Information Processing in \nthe Basal Ganglia. MIT Press. \n[10] Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic Models of the Basal Ganglia: New Anatomical and \nComputational Perspectives. Neural Networks , 15, 535-547. \n[11]  Cleeremans,  A.,  &  McClelland,  J.  (1991).  Learning  the  Structure  of  Event  Sequences.  Journal  of \nExperimental Psychology: General , 120 (3), 235-253. \n[12] Cowan, N. (2000). The Magical Number 4 in Short-term Memory: A Reconsideration of Mental Storage \nCapacity. Behavioral and Brain Sciences , 24, 87-114. \n[13] Dayan, P., & Abbott, L. (2001). Theoretical Neuroscience. Cambridge, MA: MIT Press. \n[14] Williams, R. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement \nLearning. Machine Learning , 8, 229-256. \n[15]  Singh,  S.,  Jaakkola,  T.,  &  Jordan,  M.  I.  (1994).  Learning  Without  State-Estimation  in  Partially \nObservable  Markovian  Decision  Processes.  Eleventh  International  Conference  on  Machine  Learning,  (pp. \n284-292).  \n[16] Loch, J., & Singh, S. (1998). Using Eligibility Traces to Find the Best Memoryless Policy in Partially \nObservable Markov Decision Processes. Fifteenth International Conference on Machine Learning, (pp. 323-\n331). \n[17] Sutton, R., & Barto, A. (1998). Reinforcement Learning: An Introduction. Cambridge, MA: The MIT \nPress.  \n[18]  Daw,  N.,  Courville,  A.,  &  Touretzky,  D.  (2006).  Representation  and  Timing  in  Theories  of  the \nDopamine System. Neural Computation , 18, 1637-1677. \n[19]  Samejima,  K.,  &  Doya,  K.  (2007).  Multiple  Representations  of  Belief  States  and  Action  Values  in \nCorticobasal Ganglia Loops. Annals of the New York Academy of Sciences , 213-228. \n[20] Yoshida, W., & Ishii, S. (2006). Resolution of Uncertainty in Prefrontal Cortex. Neuron , 50, 781-789. \n[21] Dayan, P. (2007). Bilinearity, Rules, and Prefrontal Cortex. Frontiers in Computational Neuroscience , \n1, 1-14. \n\n\f", "award": [], "sourceid": 93, "authors": [{"given_name": "Michael", "family_name": "Todd", "institution": null}, {"given_name": "Yael", "family_name": "Niv", "institution": null}, {"given_name": "Jonathan", "family_name": "Cohen", "institution": null}]}