{"title": "Feudal Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 271, "page_last": 278, "abstract": null, "full_text": "Feudal Reinforcement Learning \n\nPeter Dayan \n\nCNL \n\nThe Salk Institute \n\nPO Box 85800 \n\nSan Diego CA 92186-5800, USA \ndayan~helmholtz.sdsc.edu \n\nGeoffrey E Hinton \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\n6 Kings College Road, Toronto, \n\nCanada M5S 1A4 \n\nhinton~ai.toronto.edu \n\nAbstract \n\nOne way to speed up reinforcement learning is to enable learning to \nhappen simultaneously at multiple resolutions in space and time. \nThis paper shows how to create a Q-Iearning managerial hierarchy \nin which high level managers learn how to set tasks to their sub(cid:173)\nmanagers who, in turn, learn how to satisfy them.  Sub-managers \nneed  not initially understand  their managers' commands.  They \nsimply learn to maximise their reinforcement in the context of the \ncurrent command. \nWe illustrate the system using a simple maze task ..  As the system \nlearns  how to get around,  satisfying commands at the multiple \nlevels, it explores more efficiently than standard, flat,  Q-Iearning \nand builds a more comprehensive map. \n\n1  INTRODUCTION \n\nStraightforward  reinforcement  learning  has  been quite  successful  at  some  rela(cid:173)\ntively  complex  tasks  like  playing  backgammon (Tesauro,  1992).  However,  the \nlearning time does not scale well with the number of parameters. For agents solv(cid:173)\ning rewarded  Markovian decision tasks by learning dynamic programming value \nfunctions,  some of the  main  bottlenecks  (Singh,  1992b) are  temporal  resolution  -\nexpanding the unit of learning from the smallest possible step in the task, division(cid:173)\nand-conquest  - finding  smaller subtasks that are easier  to  solve,  exploration,  and \nstructural generalisation - generalisation of the value function between different 10-\n\n271 \n\n\f272 \n\nDayan  and Hinton \n\ncations. These are obviously related - for instance, altering the temporal resolution \ncan have a dramatic effect on exploration. \nConsider a control  hierarchy in  which managers  have sub-managers, who work \nfor  them,  and  super-managers,  for  whom  they  work.  If the  hierarchy  is  strict \nin  the sense that  managers control  exactly the  sub-managers at  the  level  below \nthem and only the very lowest level managers can actually act in the world, then \nintermediate level managers have essentially two instruments of control over their \nsub-managers at any time - they can choose amongst them and they can set them \nsub-tasks.  These sub-tasks can be incorporated into the state of the sub-managers \nso that they in turn can choose their own sub-sub-tasks and sub-sub-managers to \nexecute them based on the task selection at the higher level. \nAn  appropriate  hierarchy  can  address  the  first  three  bottlenecks.  Higher  level \nmanagers should sustain a larger grain of temporal resolution, since they leave the \nsub-sub-managers to do the actual work. Exploration for actions leading to rewards \ncan be more efficient since it can be done non-uniformly - high level managers can \ndecide that reward  is best found  in some other region of the state space and send \nthe agent there directly, without forcing  it to explore in detail on the way. \nSingh (1992a) has studied the case in which a manager picks one of its sub-managers \nrather than setting tasks.  He used the degree of accuracy of the Q-values of sub(cid:173)\nmanagerial  Q-Iearners  (Watkins,  1989)  to  train  a  gating  system (Jacobs,  Jordan, \nNowlan & Hinton, 1991) to choose the one that matches best in each state.  Here \nwe study the converse case, in which there is only one possible sub-manager active \nat any level,  and so the only choice a manager has is over the tasks it sets.  Such \nsystems have been previously considered (Hinton, 1987; Watkins, 1989). \nThe next section considers how such a strict hierarchical scheme can learn to choose \nappropriate tasks at each  level,  section 3 describes a  maze learning example for \nwhich  the  hierarchy  emerges  naturally  as  a  multi-grid  division  of  the  space  in \nwhich the agent moves, and section 4 draws some conclusions. \n\n2  FEUDAL CONTROL \n\nWe  sought to  build  a  system that  mirrored  the  hierarchical  aspects of a  feudal \nfiefdom,  since  this  is  one extreme  for  models  of control.  Managers  are  given \nabsolute  power over their sub-managers - they can  set them  tasks and  reward \nand  punish them entirely as they see fit.  However managers ultimately have to \nsatisfy their own super-managers, or face punishment themselves - and so there is \nrecursive reinforcement and selection until the whole system satisfies the goal of the \nhighest level manager.  This can all be made to happen without the sub-managers \ninitially \"understanding\" the sub-tasks they are set.  Every component just acts to \nmaximise its expected reinforcement, so after learning, the meaning it attaches to a \nspecification of a sub-task consists of the way in which that specification influences \nits choice of sub-sub-managers and sub-sub-tasks. Two principles are key: \nReward  Hiding  Managers  must  reward  sub-managers  for  doing  their  bidding \nwhether or  not this satisfies the commands of the super-managers.  Sub-managers \nshould just learn to obey their managers and leave it up to them to determine what \n\n\fFeudal  Reinforcement  Learning \n\n273 \n\nit is best to do at the next level up.  So if a sub-manager fails to achieve the sub-goal \nset by its manager it is not rewarded, even if its actions result in the satisfaction of \nof the manager's own goal.  Conversely, if a sub-manager achieves the sub-goal it \nis given it is  rewarded, even if this does not lead to satisfaction of the manager's \nown goal.  This allows the sub-manager to  learn to achieve sub-goals even when \nthe  manager was mistaken in  setting these sub-goals.  So  in  the early stages of \nlearning, low-level managers can become quite competent at achieving low-level \ngoals even if the highest level goal has never been satisfied. \nInformation Hiding Managers only need  to know  the state of the system at the \ngranularity of their own choices of tasks.  Indeed, allowing some decision making \nto take place at a coarser grain is one of the main goals of the hierarchical decom(cid:173)\nposition.  Information is  hidden  both downwards - sub-managers do  not  know \nthe task the super-manager has set the manager - and upwards - a super-manager \ndoes not know what choices its manager has made to satisfy its command.  How(cid:173)\never managers do need  to know the satisfaction conditions for  the tasks they set \nand some measure of the actual cost to the system for  achieving  them using the \nsub-managers and tasks it picked on any particular occasion. \nFor the special case to be considered here, in which managers are given no choice \nof which sub-manager to use in a given state, their choice of a task is very similar \nto  that of an action  for  a  standard  Q-Iearning  system.  If the  task is  completed \nsuccessfully, the cost is determined by the super-manager according to how well (eg \nhow quickly, or indeed whether) the manager satisfied its super-tasks. Depending \non how its own task is accomplished, the manager rewards or punishes the sub(cid:173)\nmanager responsible.  When a manager chooses an action, control is  passed to the \nsub-manager and is only returned when the state changes at the managerial level. \n\n3  THE MAZE TASK \n\nTo  illustrate this  feudal  system, consider a  standard  maze task (Barto,  Sutton & \nWatkins, 1989) in which the agent has to learn to find  an initially unknown goal. \nThe  grid  is  split up at successively  finer  grains  (see  figure  1)  and  managers are \nassigned to separable parts of the maze at each level.  So,  for  instance, the level 1 \nmanager of area  1-(1,1)  sets the tasks  for  and  reinforcement given  to  the level  2 \nmanagers  for  areas 2-(1,1), 2-(1,2),  2-(2,1) and 2-(2,2).  The successive separation \ninto quarters is fairly arbitrary - however if the regions at high levels did not cover \ncontiguous areas at lower levels, then the system would not perform very well. \nAt all times, the agent is effectively performing an action at every level.  There are \nfive actions, N5EW and \"', available to the managers at all levels other than the first \nand last. NSEW represent the standard geographical moves and'\" is a special action \nthat non-hierarchical systems do not require.  It specifies that lower level managers \nshould search for the goal within the confines of the current larger state instead of \ntrying to move to another region of the space at the same level.  At the top level, \nthe only possible action is  \"';  at the lowest level, only the geographical moves are \nallowed, since the agent cannot search at a finer granularity than it can move. \nEach  manager maintains  Q values (Watkins, 1989; Barto, Bradtke &  Singh, 1992) \nover the actions it instructs its sub-managers to perform, based on the location of \n\n\f274 \n\nDayan  and Hinton \n\nFigure  1:  Figure  1:  The Grid  Task.  This  shows  how the maze is  divided  up at \ndifferent levels in the hierarchy.  The 'u' shape is the barrier, and the shaded square \nis the goal.  Each high level state is divided into four low level ones at every step. \n\nthe agent at the subordinate level of detail and the command it has received  from \nabove.  So, for instance, if the agent currently occupies 3-(6,6), and the instruction \nfrom the level a manager is to move South, then the 1-(2,2) manager decides upon \nan action based on the Q values  for  NSEW  giving the total length of the path to \neither 2-(3,2) or 2-(4,2).  The action the 1-(2,2) manager chooses is communicated \none level down the hierarchy and becomes part of the state determining the level 2 \nQ values. \nWhen the agent starts, actions at successively lower levels are selected using the \nstandard Q-Iearning softmax method and the agent moves according to the finest \ngrain  action  (at level  3  here).  The  Q values at every level  at which  this causes \n\n\fFeudal  Reinforcement  Learning \n\n275 \n\nF-Q Task 1 \nF'-=-Q-Task  2 \nS.:(j-fask 1 \nS-QTask 2 \n\n----------\n\n-+ -\n\n---- -\n-- -\n\n.............. _- . \n\n... _--------- ... \n\n\\ \n\n\"\\ \n, \n\\  \\ \n\\ , .... \n\\ \n\\  ',~ \n.  \" \n\\  -'<\\ \n\n-... \"\" ---~ -----\nr---\n---\n---\n'. -'. \". -.  --.. ~- .. -----\n\n~-\n~ -\n-. \n\nSteps to Goal \n1e+04 \n7 \n5 \n\n3 \n2 \n1.5 \n1e+03 \n7 \n5 \n\n3 \n2 \n1.5 \n1e+02 \n7 \n5 \n\n3 \n2 \n1.5 \n\n0 .00 \n\n100.00 \n\n200.00 \n\n300 .00 \n\n400.00 \n\nIterations \n\n500.00 \n\nFigure 2:  Learning Performance.  F-Q shows the performance of the feudal an:::hi(cid:173)\ntecture and S-Q of the standard Q-Iearning architecture. \n\na  state transition are  updated according  to  the length of path at  that  level,  if the \nstate transition is what was ordered at all lower levels. This restriction comes from \nthe constraint that super-managers should only learn from the fruits of the honest \nlabour of sub-managers, ie only if they obey their managers. \nFigure 2 shows how the system performs compared with standard, one-step, Q(cid:173)\nlearning,  first  in  finding  a goal in  a  maze similar to  that in  figure  I, only having \n32x32 squares, and second in finding the goal after it is subsequently moved. Points \non the graph are averages of the number of steps it takes the agent to reach the goal \nacross all possible testing locations, after the given number of learning iterations. \nLittle effort was made to optimise the learning parameters, so care is  necessary in \ninterpreting the results. \nFor the first  task the feudal  system is  initially slower, but after a while,  it  learns \nmuch  more quickly  how  to  navigate  to  the goal.  The  early sloth  is  due to  the \nfact that many low level actions are wasted, since they do not implement desired \nhigher level behaviour and the system has to learn not to try impossible actions or \n* in inappropriate places. The late speed comes from the feudal system's superior \nexploratory behaviour. If it decides at a high level that the goal is in one part of the \nmaze, then it has the capacity to specify large scale actions at that level  to  take it \nthere.  This is the same advantage that Singh's (1992b) variable temporal resolution \nsystem garners, although this is over a single task rather than explicitly composite \nsub-tasks. Tests on mazes of different sizes suggested that the number of iterations \nafter which  the advantage of exploration outweighs the disadvantage of wasted \nactions gets less as the complexity of the task increases. \nA similar pattern emerges  for  the second  task.  Low  level  Q values embody an \nimplicit knowledge of how to get around the maze, and so  the feudal system can \nexplore efficiently once it (slowly) learns not to search in the original place. \n\n\f276 \n\nDayan  and Hinton \n\nFigure 3:  The Learned Actions.  The area of the boxes and the radius of the central \ncircle give the probabilities of taking action NSEW and * respectively. \n\nFigure 3 shows the probabilities of each move at each location once the agent has \nlearnt to  find  the goal  at 3-(3,3).  The  length of the NSEW  bars  and  the radius \nof the central circle are proportional to the probability of selecting actions NSEW \nor * respectively,  and  action choice flows  from  top to bottom.  For instance,  the \nprobability of choosing action S at state 2-(1,3)  is  the sum of the  products of the \nprobabilities of choosing actions NSEW and * at state 1-(1,2) and the probabilities, \nconditional on this higher level selection, of choosing action S at state 2-0,3).  Apart \nfrom the right hand side of the barrier, the actions are generally correct - however \nthere are examples of sub-optimal behaviour caused by the decomposition of the \nspace, eg the system decides to move North at 3-(8,5) despite it being more felicitous \nto move South. \nCloser investigation of the course of learning revea Is that, as might be expected from \nthe restrictions in updating the Q values, the system initially learns in a completely \nbottom-up manner.  However after a  while,  it  learns  appropriate actions at the \nhighest levels, and so top-down learning  happens too.  This  generally beneficial \neffect arises  because there  are  far  fewer  states at coarse resolutions, and so  it  is \neasier for the agent to calculate what to do. \n\n\fFeudal  Reinforcement Learning \n\n277 \n\n4  DISCUSSION \n\nThe feudal architecture partially addresses one of the major concerns in reinforce(cid:173)\nment learning about how to divide a single task up into sub-tasks at multiple levels. \nA demonstration was given of how this can be done separately from choosing be(cid:173)\ntween different possible sub-managers at a given level. \nIt depends on there being a  plausible managerial system, preferably based on a \nnatural hierarchical division of the available state space.  For some tasks it can be \nvery  inefficient,  since  it  forces  each  sub-manager to  learn  how  to  satisfy all  the \nsub-tasks set by its  manager,  whether or not those sub-tasks are appropriate.  It \nis  therefore  more  likely to  be  useful in environments  in  which  the set tasks can \nchange.  Managers need not necessarily know in advance the consequences of their \nactions. They could learn, in a self-supervised manner, information about the state \ntransitions that they have experienced.  These observed next states can be used as \ngoals for their sub-managers - consistency in  providing rewards  for appropriate \ntransitions is the only requirement. \nAlthough  the  system  gains  power  through  hiding  information,  which  reduces \nthe  size  of the  state spaces  that  must  be searched,  such  a  step also  introduces \ninefficiencies.  In  some cases,  if a  sub-manager only  knew  the  super-task of its \nsuper-manager then it  could bypass its  manager with advantage.  However the \nreductio  of this would  lead  to each sub-manager having as large a state space as \nthe whole problem, negating the intent of the feudal architecture.  A more serious \nconcern is that the non-Markovian nature of the task at the higher levels (the future \ncourse of the agent is determined by more detailed information than just the high \nlevel states) can render the problem insoluble.  Moore and Atkeson's (1993) system \nfor detecting such cases and choosing finer resolutions accordingly should integrate \nwell with the feudal system. \nFor the maze task, the feudal system learns much more about how to navigate than \nthe standard Q-Iearning system.  Whereas the latter is completely concentrated on \na  particular target, the former  knows  how to execute arbitrary high level  moves \nefficiently, even ones that are not used to find  the current goal such as going East \nfrom  one quarter of the space 1-(2,2) to  another 1-(1,2).  This  is  why exploration \ncan be  more efficient.  It  doesn't require a  map of the space, or even a  model of \nstate  x  action  -4  next  state to be learned explicitly. \nJameson (1992) independently studied a system with some similarities to the feu(cid:173)\ndal architecture.  In one case,  a  high  level agent learned  on the basis of external \nreinforcement to provide on a slow timescale direct commands (like reference tra(cid:173)\njectories) to a low level agent - which learned to obey it based on reinforcement \nproportional to  the square trajectory error.  In  another, low and high level agents \nreceived  the same reinforcement from  the world, but the former was additionally \ntasked  on making its  prediction of future  reinforcement  significantly dependent \non the output of the  latter.  Both systems learned  very  effectively  to balance an \nupended  pole  for  long  periods.  They share the notion of hierarchical  structure \nwith the feudal architecture, but the notion of control is somewhat different. \nMulti-resolution methods have long been studied as ways of speeding up dynamic \nprogramming (see Morin, 1978, for numerous examples and references).  Standard \n\n\f278 \n\nDayan  and Hinton \n\nmethods focus  effectively on having a single task at every  level and just having \ncoarser and  finer  representations of the value function.  However,  here we have \nstudied a slightly different problem in which managers have the flexibility to specify \ndifferent tasks which the sub-managers have to learn how to satisfy.  This is  more \ncom plicated, but also more powerful. \nFrom a psychological perspective, we have replaced a system in which there is a \nsingle external reinforcement schedule with a system in  which the rat's mind  is \ncomposed of a hierarchy of little Skinners. \n\nAcknowledgements \n\nWe are most grateful to Andrew Moore, Mark Ring, Jiirgen Schmid huber, Satinder \nSingh,  Sebastian  Thrun  and  Ron  Williams  for  helpful  discussions.  This  work \nwas supported by SERC, the Howard Hughes Medical Institute and the Canadian \nInstitute for Advanced Research (CIAR). GEH is the Noranda fellow of the CIAR. \n\nReferences \n\n[1]  Barto, AC, Bradtke, SJ  & Singh, SP (1991).  Real-Time Learning and Control  using Asyn(cid:173)\n\nchronous Dynamic Programming. COINS technical report 91-57. Amherst: University of \nMassach usetts. \n\n[2]  Barto,  AC,  Sutton,  RS  &  Watkins,  qCH  (1989).  Learning  and  sequential  decision \nmaking.  In  M  Gabriel  & J Moore,  editors,  Learning  and  Computational  Neuroscience: \nFoundations of Adaptive Networks. Cambridge, MA: MIT Press, Bradford Books. \n\n[3]  Hinton, GE (1987). Connectionist Learning Procedures. Technical Report CMU-CS-B7-115, \n\nDepartment of Computer Science, Carnegie-Mellon University. \n\n[4]  Jacobs, RA, Jordan, MI, Nowlan, S1  &  Hinton, GE.  Adaptive mixtures of local experts. \n\nNeural Computation, 3, pp 79-87. \n\n[5]  Jameson, JW (1992). Reinforcement control with hierarchical backpropagated adaptive \n\ncritics. Submitted to Neural Networks. \n\n[6]  Moore,  AW  &  Atkeson, CC (1993).  Memory-based reinforcement learning:  efficient \ncomputation with prioritized sweeping. In SJ  Hanson, CL Giles & JD Cowan, editors \nAdvances in Neural Information Processing Systems 5. San Mateo, CA: Morgan Kaufmann. \n[7]  Morin, TL (1978). Computational ad vances in dynamic programming. In ML Puterman, \n\neditor, Dynamic Programming and its Applications. New York:  Academic Press. \n\n[8]  Moore,  AW  (1991). Variable resolution dynamic programming:  Efficiently  learning \naction maps in multivariate real-valued state spaces. Proceedings of the Eighth  Machine \nLearning Workshop. San Mateo, CA: Morgan Kaufmann. \n\n[9]  Singh, SP (1992a). Transfer of learning by composing solutions for elemental sequential \n\ntasks. Machine Learning, 8, pp 323-340. \n\n[10]  Singh, SP (1992b). Scaling reinforcement learning algorithms by learning variable tem(cid:173)\n\nporal resolution models. Submitted to Machine Learning. \n\n[11]  Tesauro, G (1992). Practical issues in temporal difference learning. Machine Learning, 8, \n\npp 257-278. \n\n[12J  Watkins, qCH (1989). Learning from  Delayed Rewards. PhD Thesis. University of Cam(cid:173)\n\nbridge, England . \n\n\f", "award": [], "sourceid": 714, "authors": [{"given_name": "Peter", "family_name": "Dayan", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}