{"title": "Programmable Reinforcement Learning Agents", "book": "Advances in Neural Information Processing Systems", "page_first": 1019, "page_last": 1025, "abstract": null, "full_text": "Programmable Reinforcement Learning Agents \n\nDavid Andre and Stuart J. Russell \n\nComputer Science Division, UC Berkeley, CA 94702 \n\n{ dandre,russell}@cs.berkeley.edu \n\nAbstract \n\nWe present an expressive agent design language for reinforcement learn(cid:173)\ning that allows the user to constrain the policies considered by the learn(cid:173)\ning process.The language includes standard features such as parameter(cid:173)\nized subroutines, temporary interrupts, aborts, and memory variables, but \nalso allows for unspecified choices in the agent program. For learning \nthat which isn't specified, we present provably convergent learning algo(cid:173)\nrithms. We demonstrate by example that agent programs written in the \nlanguage are concise as well as modular. This facilitates state abstraction \nand the transferability of learned skills. \n\n1 Introduction \nThe field of reinforcement learning has recently adopted the idea that the application of \nprior knowledge may allow much faster learning and may indeed be essential if real(cid:173)\nworld environments are to be addressed. For learning behaviors, the most obvious form \nof prior knowledge provides a partial description of desired behaviors. Several languages \nfor partial descriptions have been proposed, including Hierarchical Abstract Machines \n(HAMs) [8], semi-Markov options [12], and the MAXQ framework [4]. \n\nThis paper describes extensions to the HAM language that substantially increase its ex(cid:173)\npressive power, using constructs borrowed from programming languages. Obviously, in(cid:173)\ncreasing expressiveness makes it easier for the user to supply whatever prior knowledge \nis available, and to do so more concisely. (Consider, for example, the difference between \nwiring up Boolean circuits and writing Java programs.) More importantly, the availability \nof an expressive language allows the agent to learn and generalize behavioral abstractions \nthat would be far more difficult to learn in a less expressive language. For example, the \nability to specify parameterized behaviors allows multiple behaviors such as WalkEast, \nW alkN arth, Walk West, WalkS outh to be combined into a single behavior W alk( d) \nwhere d is a direction parameter. Furthermore, if a behavior is appropriately parameter(cid:173)\nized, decisions within the behavior can be made independently ofthe \"calling context\" (the \nhierarchy of tasks within which the behavior is being executed). This is crucial in allowing \nbehaviors to be learned and reused as general skills. \n\nOur extended language includes parameters, interrupts, aborts (i.e., interrupts without re(cid:173)\nsumption), and local state variables. Interrupts and aborts in particular are very important \nin physical behaviors-more so than in computation-and are crucial in allowing for mod(cid:173)\nularity in behavioral descriptions. These features are all common in robot programming \nlanguages [2, 3, 5]; the key element of our approach is that behaviors need only be par(cid:173)\ntially described; reinforcement learning does the rest. \n\nTo tie our extended language to existing reinforcement learning algorithms, we utilize Parr \nand Russell's [8] notion of the joint semi-Markov decision process (SMDP) created when \n\n\fa HAM is composed with an environment (modeled as an MDP). The joint SMDP state \nspace consists of the cross-product of the machine states in the HAM and the states in the \noriginal MDP; the dynamics are created by the application of the HAM in the MDP. Parr \nand Russell showed that an optimal solution to the joint SMDP is both learnable and yields \nan optimal solution to the original MDP in the class o/policies expressed by the HAM (so(cid:173)\ncalled hierarchical optimality). Furthermore, Parr and Russell show that the joint SMDP \ncan be reduced to an equivalent SMDP with a state space consisting only of the states \nwhere the HAM does not specify an action, which reduces the complexity of the SMDP \nproblem that must be solved. We show that these results hold for our extended language of \nProgrammable HAMs (PHAMs). \n\nTo demonstrate the usefulness of the new language, we show a small, complete program for \na complex environment that would require a much larger program in previous formalisms. \nWe also show experimental results verifying the convergence of the learning process for \nour language. \n\n2 Background \nAn MDP is a 4-tuple, (S, A, 'T, R), where S is a set of states, A is a set of actions, 'T is a \nprobabilistic transition function mapping S x A x S -+ [0,1], and R is a reward function \nmapping S x A x S to the reals. In this paper, we focus on infinite-horizon MDPs with a \ndiscount factor /3. A solution to a MDP is an optimal policy 7['* that maps from S -+ A and \nachieves maximum expected discounted reward for the agent. An SMDP (semi-Markov \ndecision process) allows for actions that take more than one time step. 'T is modified to \nbe a mapping from S, A, S, N -+ [0, 11, where N is the natural numbers; i.e., it specifies \na distribution over both output states and action durations. R is then a mapping from \nS, A, S, N to the reals. The discount factor, /3, is generalized to be a function, /3(s, a), that \nrepresents the expected discount factor when action a is taken in state s. Our definitions \nfollow those common in the literature [9, 6,4]. \n\nThe HAM language [8] provides for partial specification of agent programs. A HAM pro(cid:173)\ngram consists of a set of partially specified Moore machines. Transitions in each machine \nmay depend stochastically on (features of) the environment state, and the outputs of each \nmachine are primitive actions or nonrecursive invocations of other machines. The states \nin each machine can be of four types: {start, stop, action, choice}. Each machine has a \nsingle distinguished start state and may have one or more distinguished stop states. When \na machine is invoked, control starts at the start state; stop states return control back to the \ncalling machine. An action state executes an action. A call state invokes another machine \nas a subroutine. A choice state may have several possible next states; after learning, the \nchoice is reduced to a single next state. \n\n3 Programmable HAMs \n\nConsider the problem of creating a HAM program for the Deliver-Patrol domain presented \nin Figure 1, which has 38,400 states. In addition to delivering mail and picking up occa(cid:173)\nsional additional rewards while patrolling (both of which require efficient navigation and \nsafe maneuvering), the robot must keep its battery charged (lest it be stranded) and its \ncamera lens clean (lest it crash). It must also decide whether to move quickly (incurring \ncollision risk) or slowly (delaying reward), depending on circumstances. \n\nBecause all the 5 x 5 \"rooms\" are similar, one can write a \"traverse the room\" HAM routine \nthat works in all rooms, but a different routine is needed for each direction (north-south, \nsouth-north, east-west, etc.). Such redundancy suggests the need for a \"traverse the room\" \nroutine that is parameterized by the desired direction. \n\nConsider also the fact that the robot should clean its camera lens whenever it gets dirty. \n\n\fRootO \n\nIs a=e---~ \n\nwater \n\nean \n\n0 \nAI 0 \n0 \n0 \n\nI 0 \n0 \n0 \n\nI 0 \n0 \n0 \ncl 0 \n0 \n\n(a) \n\n00 \n00 \n00 .!.. 00 \n\n00 \n00 \n\nI 0 \n\nI 0 \n\n00 \n00 \n\n00 \n00 \n\n00 \n00 \n\nI 0 \n\nI 0 \n\n00 \n00 \n\n00 M 00 \n00 \n00 \n\nI 0 \n\nI 0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n0 \nI \n0 \n0 \nI \n0 \n0 \nID \n0 \n\n~-___ .~ail \n\nWorkO \n\n \n\nFigure 1: (a) The Deliver- Patrol world. Mail appears at M and must be delivered to the appro(cid:173)\npriate location. Additional rewards appear sporadically at A, B , C, and D. The robot's battery \nmay be recharged at R. The robot is penalized for colliding with walls and \"furniture\" (small cir(cid:173)\ncles). (b) Three of the PHAMs in the partial specification for the Deliver- Patrol world. Right-facing \nhalf-circles are start states, left-facing half-circles are stop states, hexagons are call states, ovals are \nprimitive actions, and squares are choice points. zl and z2 are memory variables. When arguments to \ncall states are in braces, then the choice is over the arguments to pass to the subroutine. The RootO \nPHAM specifies an interrupt to clean the camera lens whenever it gets dirty; the WorkO PHAM \ninterrupts its patrolling whenever there is mail to be delivered. \n\nToDoor( dest,sp) \n\n~-~ \n(a) t I H e Hit \n19 : - : gl L -____ ~==~ ________________ __\" \n\n(b) \n\nFigure 2: (a) A room in the Deliver-Patrol domain. The arrows in the drawing of the room indi(cid:173)\ncate the behavior specified by the pO transition function in ToDoor(dest,sp). Two arrows indicate \na \"fast\" move (jN,jS,jE.jW), whereas a single arrow indicates a slow move (N, S, E, W). (b) The \nToDoor(dest,sp) and Move(dir) PHAMs. \n\nNav( dest,sp) \n\n- InRoom( dest) \n\nFigure 3: The remainder of the PHAMs for the Deliver- Patrol domain. Nav(dest,sp) leaves route \nchoices to be learned through experience. Similarly, PatrolO does not specify the sequence of loca(cid:173)\ntions to check. \n\n\fIn the HAM language, this conditional action must be inserted after every state in every \nHAM. An interrupt mechanism with appropriate scoping would obviate the need for such \nwidespread mutilation. \n\nThe PHAM language has these additional characteristics. We provide here an informal \nsummary of the language features that enable concise agent programs to be written. The \n9 PHAMs for the Deliver-Patrol domain are presented in Figure l(b), Figure 2(b), and \nFigure 3. The corresponding HAM program requires 63 machines, many of which have \nsignificantly more states than their PHAM counterparts. \n\nThe PHAM language adds several structured programming constructs to the HAM lan(cid:173)\nguage. To enable this, we introduce two additional types of states in the PHAM: internal \nstates, which execute an internal computational action (such as setting memory variables \nto a function of the current state), and null states, which have no direct effect and are used \nfor computational convenience. \n\nParameterization is key for expressing concise agent specifications, as can be seen in \nthe Deliver-Patrol task. Subroutines take a number of parameters, (h,fh, ... Ok, the val(cid:173)\nues of which must be filled in by the calling subroutine (and can depend on any function \nof the machine, parameter, memory, and environment state). In Figure 2(b), the subrou(cid:173)\ntine Move(dir) is shown. The dir parameter is supplied by the NavRoom subroutine. The \nToDoor( dest,speed) subroutine is for navigating a single room of the agent's building. The \npO is a transition function that stores a parameterized policy for getting to each door. The \npolicy for (N, J) (representing the North door, going fast) is shown in Figure 2(a). Note \nthat by using parameters, the control for navigating a room is quite modular, and is written \nonce, instead of once for each direction and speed. \n\nAborts and interrupts allow for modular agent specification. As well as the camera-lens \ninterrupt described earlier, the robot needs to abort its current activity if the battery is low \nand should interrupt its patrolling activity if mail arrives for delivery. The PHAM language \nallows abort conditions to be specified at the point where a subroutine is invoked within \na calling routine; those conditions are in force until the subroutine exits. For each abort \ncondition, an \"abort handler\" state is specified within the calling routine, to which control \nreturns if the condition becomes true. (For interrupts, normal execution is resumed once \nthe handler completes.) Graphically, aborts are depicted as labelled dotted lines (e.g., in the \nDoAll() PHAM in Figure 3), and interrupts are shown as labelled dashed lines with arrows \non both ends (e.g., in the Work() PHAM in Figure l(b\u00bb. \n\nMemory variables are a feature of nearly every programming language. Some previous \nresearch has been done on using memory variables in reinforcement learning in partially \nobservable domains [10]. For an example of memory use in our language, examine the \nDoDelivery subroutine in Figure l(b), where Z2 is set to another memory value (set in \nNav( dest,sp ). Z2 is then passed as a variable to the Nav subroutine. Computational func(cid:173)\ntions such as dest in the Nav( dest,sp) subroutine are restricted to be recursive functions \ntaking effectively zero time. A PHAM is assumed to have a finite number of memory vari(cid:173)\nables, Zl, ... ,Zn, which can be combined to yield the memory state, Z. Each memory \nvariable has finite domain D(Zi). The agent can set memory variables by using internal \nstates, which are computational action states with actions in the following format: (set \nZl 'l/J(m, 0, x, Z), where 'l/J(m, 0, x, Z) is a function taking the machine, parameter, en(cid:173)\nvironment, and memory state as parameters. The transition function, parameter-setting \nfunctions, and choice functions take the memory state into account as well. \n\n4 Theoretical Results \nOur results mirror those obtained in [9]. In summary (see also Figure 4): The composition \n1-l 0 M of a PHAM 1-l with the underlying MDP M is defined using the cross product of \nstates in 1-l and M. This composition is in fact an SMDP. Furthermore, solutions to 1-l 0 M \n\n\fyield optimal policies for the original MDP, among those policies expressed by the PHAM. \nFinally, 1i a M may be reduced to an equivalent SMDP whose states are just the choice \npoints, i.e., the joint states where the machine state is a choice state. See [1] for the proofs. \n\nDefinition 1 (Programmable Hierarchical Abstract Machines: PHAMs) A PRAM is a \ntuple 1i = (IL, 9,8, p, ~, I, ILI, A, ILA, Z, \\[1), where IL is the set of machine states in 1i, 9 \nis the .Ipace of possible parameter settings, 8 is the transition function, mapping IL x 9 x \nZ x X x IL to [0,1], p is a mapping from IL x 9 x Z x X x 9 to [0,1] and expresses the \nparameter choice function, ~ maps from IL x 9 x Z x X to subsets of IL and expresses the \nallowed choices at choice states, I( m) returns the interrupt condition at a call state, ILI (m) \nspecifies the handler of an interrupt, A(m) returns the abort condition at a call state, \nILA (m) specifies the handler of an abort, Z is the set of possible memory configurations, \nand \\[1(m) is a complex function expressing which computational internal function is used \nat internal states, and to which memory variable the result is assigned. \n\nTheorem 1 For any MDP, M and any PRAM, 1i, the operation of1i in M induces a joint \nSMDp, called 1i a M. If 7r is an optimal solution for 1i a M, then the primitive actions \nspecified by 7r constitute an optimal policy for M among those consistent with 1i. \nThe state space of 1i a M may be enormous. As is illustrated in Figure 4, however, we \ncan obtain significant further savings, just as in [9]. First, not all pairs of PHAM and MDP \nstates will be reachable from the initial state; second, the complexity of the induced SMDP \nis solely determined by the number of reachable choice points. \nTheorem 2 For any MDP M and PRAM 1i, let C be the set of choice points in 1i a M. \nThere exists an SMDP called reduce(1i a M) with states C such that the optimal policy for \nreduce(1i a M) corresponds to an optimal policy for M among those consistent with 1i. \nThe reduced SMDP can be solved by offline, model-based techniques using the method \ngiven in [9] for constructing the reduced model. Alternatively, and much more simply, \nwe can solve it using online, model-free HAMQ-Iearning [8], which learns directly in the \nreduced state space of choice points. Starting from a choice state w where the agent takes \naction a, the agent keeps track of the reward r tot and discount fJtot accumulated on the way \nto the next choice point, w'. On each step, the agent encounters reward ri and discount \nfJi (note that fJi is 0 exactly when the agent transitions only in the PHAM and not in the \nMDP), and updates the totals as follows: \n\nrtot ~ rtot + fJtotri; fJtot ~ fJtotfJi . \n\nThe agent maintains a Q-table, Q(w, a), indexed by choice state and action. When the \nagent gets to the next choice state, w', it updates the Q-table as follows: \n\nQ(w, a) ~ (1 - o:)Q(w, a) + o:[rtot + fJtot max Q(w' , u)] . \n\nu \n\nWe have the following theorem. \nTheorem 3 For a PHAM 1i and and MDP M, HAMQ-leaming will converge to an op(cid:173)\ntimal policy for reduce(1i a M), with probability 1, with appropriate restrictions on the \nlearning rate. \n5 Expressiveness of the PHAM language \nAs shown by Parr [9], the HAM language is at least as expressive as some existing action \nlanguages including options [12] and full-fJ models [11]. The PHAM language is sub(cid:173)\nstantially more expressive than HAMs. As mentioned earlier, the Deliver-Patrol PHAM \nprogram has 9 machines whereas the HAM program requires 63. In general, the additional \nnumber of states required to express a PHAM as a pure HAM is IV(Z) x C x 91, where \nV(Z) is the memory state space, C is the set of possible abort/interrupt contexts, and 9 is \nthe total parameter space. We also developed a PHAM program for the 3,700-state maze \nworld used by Parr and Russell [8]. The HAM used in their experiments had 37 machines; \nthe PHAM program requires only 7. \n\n\f~ Reduce(H oM) \n\nFigure 4: A schematic illustration of the formal results. (1) The top two diagrams are of a PRAM \nfragment with 1 choice state and 3 action states (of which one, labelled d, is the start state). The \nMDP has 4 states, and action d always leads to state 1 or 4. The composition, H. 0 M , is shown in \n(2). Note that there are no incoming arcs to the states < c, 2 > or < c, 3 >. In (3), reduce(H. 0 M) is \nshown. There are only 2 states in the reduced SMDP because there are no incoming arcs to the states \n< c, 2 > or < c, 3 >. \n\nResu lts on Deliver/Pat rol Task \n\n150000 \n\n100000 \n\ni.' \n\nI \n\" ~ \n1 \n\n-50000 \n\n50000 \n\n'0 \n\nj \n\n,;xllmal -\nPHA -easy _ \nPHAM-hard ~ \nQ-Leaming~ \n\n-100000 \n\n50 \n\n100 \n\n150 \n\n200 \n\nNum PrimitIVe Steps, In 10,OOOs \n\nFigure 5: Learning curves for the DeliverlPatrol domain, averaged over 25 runs . X-axis: number of \nprimitive steps. Y-axis: value of the policy measured by ten 5,000 step trial s. PRAM-hard refers to \nthe PRAMs given in this paper. PRAM-easy refers to a more complete PRAM, leaving unspecified \nonly the speed of travel for each activity. \n\nWith respect to the induced choice points, the Deliver-Patrol PHAM induces 7,816 choice \npoints in the joint SMDP, compared with 38,400 in the original MDP. Furthermore, only \n15,800 Q-values must be learned, compared with 307,200 for flat Q-Iearning. Figure 5 \nshows empirical results for the Deliver-Patrol problem, indicating that Q-Iearning with a \nsuitable PHAM program is far faster than flat Q-Iearning. (Parr and Russell observed sim(cid:173)\nilar results for the maze world, where HAMQ-Iearning finds a good policy in 270,000 iter(cid:173)\nations compared to 9,000,000 for flat Q-Iearning.) Note that equivalent HAM and PHAM \nprograms yield identical reductions in the number of choice points and identical speedups \nin Q-Iearning. Thus, one might argue that PHAMs do not offer any advantage over HAMs, \nas they can express the same set of behaviors. However, this would be akin to arguing that \nthe Java programming language offers nothing over Boolean circuits. Ease of expression \nand the ability to utilize greater modularity can greatly ease the task of coding reinforce(cid:173)\nment learning agents that take advantage of prior knowledge. \n\nAn interesting feature of PHAMs was observed in the Deliver-Patrol domain. The initial \nPHAM program was constructed on the assumption that the agent should patrol among A, \nB, C, D unless there is mail to be delivered. However, the specific rewards are such that \nthe optimal behavior is to loiter in the mail room until mail arrives, thereby avoiding costly \n\n\fdelays in mail delivery. The PHAM-Q learning agents learned this optimal behavior by \n\"retargeting\" the N av routine to stay in the mail room rather than go to the specified des(cid:173)\ntination. This example demonstrates the difference between constraining behavior through \nstructure and constraining behavior through subgoals: the former method may give the \nagent greater flexibility but may yield \"surprising\" results. In another experiment, we con(cid:173)\nstrained the PHAM further to prevent loitering. As expected, the agent learned a suboptimal \npolicy in which N av had the intended meaning of travelling to a specified destination. This \nexperience suggests a natural debugging cycle in which the agent designer may examine \nlearned behaviors and adjust the PHAM program accordingly. \n\nThe additional features of the PHAM language allow direct expression of programs from \nother formalisms that are not easily expressed using HAMs. For example, programs in \nDietterich's MAXQ language [4] are written easily as PHAMs, but not as HAMs because \nthe MAXQ language allows parameters. The language of teleo-reactive (TR) programs [7, \n2] relies on a prioritized set of condition-action rules to achieve a goal. Each action can \nitself be another TR program. The TR architecture can be implemented directly in PHAMs \nusing the abort mechanism [1]. \n\n6 Future work \nOur long-term goal in this project is to enable true cross-task learning of skilled behavior. \nThis requires state abstraction in order to learn choices within PHAMs that are applicable \nin large classes of circumstances rather than just to each invocation instance separately. \nDietterich [4] has derived conditions under which state abstraction can be done within his \nMAXQ framework without sacrificing recursive optimality (a weaker form of optimality \nthan hierarchical optimality). We have developed a similar set of conditions, based on a \nnew form of value function decomposition, such that PHAM learning maintains hierarchi(cid:173)\ncal optimality. This decomposition critically depends on the modularity of the programs \nintroduced by the language extensions presented in this paper. \n\nRecently, we have added recursion and complex data structures to the PHAM language, \nincorporating it into a standard programming language (Lisp). This provides the PHAM \nprogrammer with a very powerful set of tools for creating adaptive agents. \n\nReferences \n[1] D. Andre. Programmable HAMs. www.cs.berkeley.edwdandre/pham.ps. 2000. \n[2] S. Benson and N. Nilsson. Reacting, planning and learning in an autonomous agent. In K. Fu(cid:173)\n\nrukawa, D. Michie, and S. Muggleton, editors, Machine Intelligence 14. 1995. \n\n[3] G. Berry and G. Gonthier. The Esterel synchronous programming language: Design, semantics, \n\nimplementation. Science oj Computer Programming, 19(2):87-152, 1992. \n\n[4] T. G. Dietterich. State abstraction in MAXQ hierarchical RL. In NIPS 12, 2000. \n[5] R.I. Firby. Modularity issues in reactive planning. In AlPS 96, pages 78-85. AAAI Press, 1996. \n[6] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. lAIR, \n\n4:237-285, 1996. \n\n[7] N. I. Nilsson. Teleo-reactive programs for agent control. lAIR, 1:139-158, 1994. \n[8] R. Parr and S. I. Russell. Reinforcement learning with hierarchies of machines. In NIPS 10, \n\n1998. \n\n[9] R. Parr. Hierarchical Control and Learning jor MDPs. PhD thesis, UC Berkeley, 1998. \n[10] L. Peshkin, N. Meuleau, and L. Kaelbling. Learning policies with external memory. In ICML, \n\n1999. \n\n[11] R. Sutton. Temporal abstraction in reinforcement learning. In ICML, 1995. \n[12] R. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal \nabstraction in reinforcement learning. Artificial Intelligence , 112(1):181- 211 , February 1999. \n\n\f", "award": [], "sourceid": 1936, "authors": [{"given_name": "David", "family_name": "Andre", "institution": null}, {"given_name": "Stuart", "family_name": "Russell", "institution": null}]}