Nuttapong Chentanez, Andrew Barto, Satinder Singh
Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous en- tities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing arti- ficial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy.
Psychologists distinguish between extrinsic motivation, which means being moved to do something because of some specific rewarding outcome, and intrinsic motivation, which refers to being moved to do something because it is inherently enjoyable. Intrinsic motiva- tion leads organisms to engage in exploration, play, and other behavior driven by curiosity in the absence of explicit reward. These activities favor the development of broad com- petence rather than being directed to more externally-directed goals (e.g., ref. ). In contrast, machine learning algorithms are typically applied to single problems and so do not cope flexibly with new problems as they arise over extended periods of time.
Although the acquisition of competence may not be driven by specific problems, this com- petence is routinely enlisted to solve many different specific problems over the agent's lifetime. The skills making up general competence act as the "building blocks" out of which an agent can form solutions to new problems as they arise. Instead of facing each new challenge by trying to create a solution out of low-level primitives, it can focus on combining and adjusting its higher-level skills. In animals, this greatly increases the effi- ciency of learning to solve new problems, and our main objective is to achieve a similar efficiency in our machine learning algorithms and architectures.
This paper presents an elaboration of the reinforcement learning (RL) framework  that encompasses the autonomous development of skill hierarchies through intrinsically mo- tivated reinforcement learning. We illustrate its ability to allow an agent to learn broad competence in a simple "playroom" environment. In a related paper , we provide more extensive background for this approach, whereas here the focus is more on algorithmic details.
Lack of space prevents us from providing a comprehensive background to the many ideas to which our approach is connected. Many researchers have argued for this kind of devel-
opmental approach in which an agent undergoes an extended developmental period dur- ing which collections of reusable skills are autonomously learned that will be useful for a wide range of later challenges (e.g., [4, 13]). The previous machine learning research most closely related is that of Schmidhuber (e.g., ) on confidence-based curiosity and the ideas of exploration and shaping bonuses [6, 10], although our definition of intrinsic re- ward differs from these. The most direct inspiration behind the experiment reported in this paper, comes from neuroscience. The neuromodulator dopamine has long been associated with reward learning . Recent studies [2, 3] have focused on the idea that dopamine not only plays a critical role in the extrinsic motivational control of behaviors aimed at harvest- ing explicit rewards, but also in the intrinsic motivational control of behaviors associated with novelty and exploration. For instance, salient, novel sensory stimuli inspire the same sort of phasic activity of dopamine cells as unpredicted rewards. However, this activation extinguishes more or less quickly as the stimuli become familiar. This may underlie the fact that novelty itself has rewarding characteristics . These connections are key components of our approach to intrinsically motivated RL.
2 Reinforcement Learning of Skills
According to the "standard" view of RL (e.g., ) the agent-environment interaction is envisioned as the interaction between a controller (the agent) and the controlled system (the environment), with a specialized reward signal coming from a "critic" in the environment that evaluates (usually with a scalar reward value) the agent's behavior (Fig. 1A). The agent learns to improve its skill in controlling the environment in the sense of learning how to increase the total amount of reward it receives over time from the critic.
External Environment Environment Actions Sensations Critic Internal Environment Critic Rewards Actions States Rewards Decisions States Agent Agent "Organism" A B Figure 1: Agent-Environment Interaction in RL. A: The usual view. B: An elaboration.
Sutton and Barto  point out that one should not identify this RL agent with an entire animal or robot. An an animal's reward signals are determined by processes within its brain that monitor not only external state but also the animal's internal state. The critic is in an animal's head. Fig. 1B makes this more explicit by "factoring" the environment of Fig. 1A into an external environment and an internal environment, the latter of which contains the critic which determines primary reward. This scheme still includes cases in which reward is essentially an external stimulus (e.g., a pat on the head or a word of praise). These are simply stimuli transduced by the internal environment so as to generate the appropriate level of primary reward.
The usual practice in applying RL algorithms is to formulate the problem one wants the agent to learn how to solve (e.g., win at backgammon) and define a reward function spe- cially tailored for this problem (e.g., reward = 1 on a win, reward = 0 on a loss). Sometimes considerable ingenuity is required to craft an appropriate reward function. The point of departure for our approach is to note that the internal environment contains, among other things, the organism's motivational system, which needs to be a sophisticated system that
should not have to be redesigned for different problems. Handcrafting a different special- purpose motivational system (as in the usual RL practice) should be largely unnecessary.
Skills--Autonomous mental development should result in a collection of reusable skills. But what do we mean by a skill? Our approach to skills builds on the theory of options . Briefly, an option is something like a subroutine. It consists of 1) an option policy that directs the agent's behavior for a subset of the environment states, 2) an initiation set con- sisting of all the states in which the option can be initiated, and 3) a termination condition, which specifies the conditions under which the option terminates. It is important to note that an option is not a sequence of actions; it is a closed-loop control rule, meaning that it is responsive to on-going state changes. Furthermore, because options can invoke other options as actions, hierarchical skills and algorithms for learning them naturally emerge from the conception of skills as options. Theoretically, when options are added to the set of admissible agent actions, the usual Markov decision process (MDP) formulation of RL extends to semi-Markov decision processes (SMDPs), with the one-step actions now be- coming the "primitive actions." All of the theory and algorithms applicable to SMDPs can be appropriated for decision making and learning with options .
Two components of the the options framework are especially important for our approach: 1. Option Models: An option model is a probabilistic description of the effects of executing an option. As a function of an environment state where the option is initiated, it gives the probability with which the option will terminate at any other state, and it gives the total amount of reward expected over the option's execution. Option models can be learned from experience (usually only approximately) using standard methods. Option models allow stochastic planning methods to be extended to handle planning at higher levels of abstraction. 2. Intra-option Learning Methods: These methods allow the policies of many options to be updated simultaneously during an agent's interaction with the environment. If an option could have produced a primitive action in a given state, its policy can be updated on the basis of the observed consequences even though it was not directing the agent's behavior at the time.
In most of the work with options, the set of options must be provided by the system designer. While an option's policy can be improved through learning, each option has to be prede- fined by providing its initiation set, termination condition, and the reward function that evaluates its performance. Many researchers have recognized the desirability of automati- cally creating options, and several approaches have recently been proposed (e.g., ). For the most part, these methods extract options from the learning system's attempts to solve a particular problem, whereas our approach creates options outside of the context of solving any particular problem.
Developing Hierarchical Collections of Skills--Children accumulate skills while they engage in intrinsically motivated behavior, e.g., while at play. When they notice that some- thing they can do reliably results in an interesting consequence, they remember this in a form that will allow them to bring this consequence about if they wish to do so at a future time when they think it might contribute to a specific goal. Moreover, they improve the efficiency with which they bring about this interesting consequence with repetition, before they become bored and move on to something else. We claim that the concepts of an option and an option model are exactly appropriate to model this type of behavior. Indeed, one of our main contributions is a (preliminary) demonstration of this claim.
3 Intrinsically Motivated RL
Our main departure from the usual application of RL is that our agent maintains a knowl- edge base of skills that it learns using intrinsic rewards. In most other regards, our ex- tended RL framework is based on putting together learning and planning algorithms for
Loop forever Current state st, current primitive action at, current option ot, extrinsic reward ret, intrinsic reward rit
Obtain next state st+1 //-- Deal with special case if next state is salient If st+1 is a salient event e If option for e, oe, does not exist in O (skill-KB) Create option oe in skill-KB; Add st to Ioe // initialize initiation set Set oe (st+1) = 1 // set termination probability //-- set intrinsic reward value ri = [1 - P oe (s t+1 t+1|st)] // is a constant multiplier else ri = 0 t+1 //-- Update all option models For each option o = oe in skill-KB (O) If st+1 Io, then add st to Io // grow initiation set If at is greedy action for o in state st //-- update option transition probability model P o(x|st) [(1 - o(st+1)P o(x|st+1) + o(st+1)st+1x] //-- update option reward model Ro(st) [re + (1 - o(s t t+1))Ro(st+1)] //-- Q-learning update of behavior action-value function QB(st, at) [re + ri + max t t aAO QB (st+1, a)] //-- SMDP-planning update of behavior action-value function For each option o in skill-KB QB(st, o) [Ro(st) + P o(x|s xS t) maxaAO QB (x, a)] //-- Update option action-value functions For each option o O such that st Io Qo(st, at) [re + (o(s t t+1) terminal value for option o) +(1 - o(st+1)) maxaAO Qo(st+1, a)] For each option o O such that st Io and o = o Qo(st, o ) Ro (st) + P o (x|s xS t)[o(x) terminal val for option o +((1 - o(x)) maxaAO Qo(x, a))] Choose at+1 using -greedy policy w.r.to QB // -- Choose next action //-- Determine next extrinsic reward Set ret+1 to the extrinsic reward for transition st, at st+1 Set st st+1; at at+1; re re ri t t+1; rit t+1
Figure 2: Learning Algorithm. Extrinsic reward is denoted re while intrinsic reward is denoted ri. Equations of the form x [y] are short for x (1-)x+[y]. The behavior action value function QB is updated using a combination of Q-learning and SMDP planning. Throughout is a discount factor and is the step-size. The option action value functions Qo are updated using intra-option Q-learning. Note that the intrinsic reward is only used in updating QB and not any of the Qo.
options . Behavior The agent behaves in its environment according to an -greedy policy with re- spect to an action-value function QB that is learned using a mix of Q-learning and SMDP planning as described in Fig. 2. Initially only the primitive actions are available to the agent. Over time, skills represented internally as options and their models also become available to the agent as action choices. Thus, QB maps states s and actions a (both primitive and options) to the expected long-term utility of taking that action a in state s. Salient Events In our current implementation we assume that the agent has intrinsic or hardwired notions of interesting or "salient" events in its environment. For example, in the playroom environment we present shortly, the agent finds changes in light and sound intensity to be salient. These are intended to be independent of any specific task and likely to be applicable to many environments. Reward In addition to the usual extrinsic rewards there are occasional intrinsic rewards generated by the agent's critic (see Fig. 1B). In this implementation, the agent's intrinsic reward is generated in a way suggested by the novelty response of dopamine neurons. The intrinsic reward for each salient event is proportional to the error in the prediction of the salient event according to the learned option model for that event (see Fig. 2 for detail). Skill-KB The agent maintains a knowledge base of skills that it has learned in its environ- ment. Initially this may be empty. The first time a salient event occurs, say light turned on, structures to learn an option that achieves that salient event (turn-light-on option) are created in the skill-KB. In addition, structures to learn an option model are also created. So for option o, Qo maps states s and actions a (again, both primitive and options) to the long-term utility of taking action a in state s. The option for a salient event terminates with probability one in any state that achieves that event and never terminates in any other state. The initiation set, Io, for an option o is incrementally expanded to includes states that lead to states in the current initiation set. Learning The details of the learning algorithm are presented in Fig. 2.
4 Playroom Domain: Empirical Results
We implemented intrinsically motivated RL (of Fig. 2) in a simple artificial "playroom" domain shown in Fig. 3A. In the playroom are a number of objects: a light switch, a ball, a bell, two movable blocks that are also buttons for turning music on and off, as well as a toy monkey that can make sounds. The agent has an eye, a hand, and a visual marker (seen as a cross hair in the figure). The agent's sensors tell it what objects (if any) are under the eye, hand and marker. At any time step, the agent has the following actions available to it: 1) move eye to hand, 2) move eye to marker, 3) move eye one step north, south, east or west, 4) move eye to random object, 5) move hand to eye, and 6) move marker to eye. In addition, if both the eye and and hand are on some object, then natural operations suggested by the object become available, e.g., if both the hand and the eye are on the light switch, then the action of flicking the light switch becomes available, and if both the hand and eye are on the ball, then the action of kicking the ball becomes available (which when pushed, moves in a straight line to the marker).
The objects in the playroom all have potentially interesting characteristics. The bell rings once and moves to a random adjacent square if the ball is kicked into it. The light switch controls the lighting in the room. The colors of any of the blocks in the room are only visible if the light is on, otherwise they appear similarly gray. The blue block if pressed turns music on, while the red block if pressed turns music off. Either block can be pushed and as a result moves to a random adjacent square. The toy monkey makes frightened sounds if simultaneously the room is dark and the music is on and the bell is rung. These objects were designed to have varying degrees of difficulty to engage. For example, to get the monkey to cry out requires the agent to do the following sequence of actions: 1) get its eye to the light switch, 2) move hand to eye, 3) push the light switch to turn the light on, 4) find the blue block with its eye, 5) move the hand to the eye, 6) press the blue block to turn
music on, 7) find the light switch with its eye, 8) move hand to eye, 9) press light switch to turn light off, 10) find the bell with its eye, 11) move the marker to the eye, 12) find the ball with its eye, 13) move its hand to the ball, and 14) kick the ball to make the bell ring. Notice that if the agent has already learned how to turn the light on and off, how to turn music on, and how to make the bell ring, then those learned skills would be of obvious use in simplifying this process of engaging the toy monkey.
A B C Performance of Learned Options Effect of Intrinsically Motivated Learning 10000 120
100 8000 80 Sound On Extrinsic Reward Only 6000 60 Light On Music On 4000 40 Toy Monkey On Intrinsic & Extrinsic Rewards 20 2000 0 Average # of Actions to Salient Event 0 0.5 1 1.5 2 2.5 0 Number of steps between extrinsic rewards 0 100 200 300 400 500 600 7 Number of Actions x 10 Number of extrinsic rewards
Figure 3: A. Playroom domain. B. Speed of learning of various skills. C. The effect of intrinsically motivated learning when extrinsic reward is present. See text for details
For this simple example, changes in light and sound intensity are considered salient by the playroom agent. Because the initial action value function, QB, is uninformative, the agent starts by exploring its environment randomly. Each first encounter with a salient event initiates the learning of an option and an option model for that salient event. For example, the first time the agent happens to turn the light on, it initiates the data structures necessary for learning and storing the light-on option. As the agent moves around the environment, all the options (initiated so far) and their models are simultaneously updated using intra-option learning.
As shown in Fig. 2, the intrinsic reward is used to update QB. As a result, when the agent encounters an unpredicted salient event a few times, its updated action value function drives it to repeatedly attempt to achieve that salient event. There are two interesting side effects of this: 1) as the agent tries to repeatedly achieve the salient event, learning improves both its policy for doing so and its option-model that predicts the salient event, and 2) as its option policy and option model improve, the intrinsic reward diminishes and the agent gets "bored" with the associated salient event and moves on. Of course, the option policy and model become accurate in states the agent encounters frequently. Occasionally, the agent encounters the salient event in a state (set of sensor readings) that it has not encountered before, and it generates intrinsic reward again (it is "surprised").
A summary of results is presented in Fig. 4. Each panel of the figure is for a distinct salient event. The graph in each panel shows both the time steps at which the event occurs as well as the intrinsic reward associated by the agent to each occurrence. Each occurrence is denoted by a vertical bar whose height denotes the amount of associated intrinsic reward. Note that as one goes from top to bottom in this figure, the salient events become harder to achieve and, in fact, become more hierarchical. Indeed, the lowest one for turning on the monkey noise (Non) needs light on, music on, light off, sound on in sequence. A number of interesting results can be observed in this figure. First note that the salient events that are simpler to achieve occur earlier in time. For example, Lon (light turning on) and Loff (light turning off) are the simplest salient events, and the agent makes these happen quite early. The agent tries them a large number of times before getting bored and moving on to other salient events. The reward obtained for each of these events diminishes after repeated exposure to the event. Thus, automatically, the skill of achieving the simpler events are learned before those for the more complex events.
Figure 4: Results from the playroom domain. Each panel depicts the occurrences of salient events as well as the associated intrinsic rewards. See text for details.
Of course, the events keep happening despite their diminished capacity to reward because they are needed to achieve the more complex events. Consequently, the agent continues to turn the light on and off even after it has learned this skill because this is a step along the way toward turning on the music, as well as along the way toward turning on the monkey noise. Finally note that the more complex skills are learned relatively quickly once the required sub-skills are in place, as one can see by the few rewards the agent receives for them. The agent is able to bootstrap and build upon the options it has already learned for the simpler events. We confirmed the hierarchical nature of the learned options by inspecting the greedy policies for the more complex options like Non and Noff. The fact that all the options are successfully learned is also seen in Fig. 3B in which we show how long it takes to bring about the events at different points in the agent's experience (there is an upper cutoff of 120 steps). This figure also shows that the simpler skills are learned earlier than the more complex ones.
An agent having a collection of skills learned through intrinsic reward can learn a wide variety of extrinsically rewarded tasks more easily than an agent lacking these skills. To illustrate, we looked at a playroom task in which extrinsic reward was available only if the agent succeeded in making the monkey cry out. This requires the 14 steps described above. This is difficult for an agent to learn if only the extrinsic reward is available, but much easier if the agent can use intrinsic reward to learn a collection of skills, some of which are relevant to the overall task. Fig. 3C compares the performance of two agents in this task. Each starts out with no knowledge of task, but one employs the intrinsic reward mechanism we have discussed above. The extrinsic reward is always available, but only when the monkey cries out. The figure, which shows the average of 100 repetitions of the experiment, clearly shows the advantage of learning with intrinsic reward.
Discussion One of the key aspects of the Playroom example was that intrinsic reward was generated only by unexpected salient events. But this is only one of the simplest
possibilities and has many limitations. It cannot account for what makes many forms of exploration and manipulation "interesting." In the future, we intend to implement compu- tational analogs of other forms of intrinsic motivation as suggested in the psychological, statistical, and neuroscience literatures.
Despite the "toy" nature of this domain, these results are among the most sophisticated we have seen involving intrinsically motivated learning. Moreover, they were achieved quite directly by combining a collection of existing RL algorithms for learning options and option-models with a simple notion of intrinsic reward. The idea of intrinsic motivation for artificial agents is certainly not new, but we hope to have shown that the elaboration of the formal RL framework in the direction we have pursued, together with the use of recently- developed hierarchical RL algorithms, provides a fruitful basis for developing competently autonomous agents.
Acknowledgement Satinder Singh and Nuttapong Chentanez were funded by NSF grant CCF 0432027 and by a grant from DARPA's IPTO program. Andrew Barto was funded by NSF grant CCF 0432143 and by a grant from DARPA's IPTO program.