{"title": "Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery", "book": "Advances in Neural Information Processing Systems", "page_first": 1818, "page_last": 1826, "abstract": "Skill discovery algorithms in reinforcement learning typically identify single states or regions in state space that correspond to task-specific subgoals. However, such methods do not directly address the question of how many distinct skills are appropriate for solving the tasks that the agent faces. This can be highly inefficient when many identified subgoals correspond to the same underlying skill, but are all used individually as skill goals. Furthermore, skills created in this manner are often only transferable to tasks that share identical state spaces, since corresponding subgoals across tasks are not merged into a single skill goal. We show that these problems can be overcome by clustering subgoal data defined in an agent-space and using the resulting clusters as templates for skill termination conditions. Clustering via a Dirichlet process mixture model is used to discover a minimal, sufficient collection of portable skills.", "full_text": "Clustering via Dirichlet Process Mixture Models for\n\nPortable Skill Discovery\n\nScott Niekum\n\nAndrew G. Barto\n\nDepartment of Computer Science\n\nUniversity of Massachusetts Amherst\n{sniekum,barto}@cs.umass.edu\n\nAmherst, MA 01003\n\nAbstract\n\nSkill discovery algorithms in reinforcement learning typically identify single\nstates or regions in state space that correspond to task-speci\ufb01c subgoals. However,\nsuch methods do not directly address the question of how many distinct skills are\nappropriate for solving the tasks that the agent faces. This can be highly inef-\n\ufb01cient when many identi\ufb01ed subgoals correspond to the same underlying skill,\nbut are all used individually as skill goals. Furthermore, skills created in this\nmanner are often only transferable to tasks that share identical state spaces, since\ncorresponding subgoals across tasks are not merged into a single skill goal. We\nshow that these problems can be overcome by clustering subgoal data de\ufb01ned in\nan agent-space and using the resulting clusters as templates for skill termination\nconditions. Clustering via a Dirichlet process mixture model is used to discover a\nminimal, suf\ufb01cient collection of portable skills.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) is often used to solve single tasks for which it is tractable to learn a\ngood policy with minimal initial knowledge. However, many real-world problems cannot be solved\nin this fashion, motivating recent research on transfer and hierarchical RL methods that allow knowl-\nedge to be generalized to new problems and encapsulated in modular skills. Although skills have\nbeen shown to improve agent learning performance [2], representational power [10], and adaptation\nto non-stationarity [3], to the best of our knowledge, current methods lack the ability to automatically\ndiscover skills that are transferable to related state spaces and novel tasks, especially in continuous\ndomains.\nSkill discovery algorithms in reinforcement learning typically identify single states or regions in\nstate space that correspond to task-speci\ufb01c subgoals. However, such methods do not directly address\nthe question of how many distinct skills are appropriate for solving the tasks that the agent faces.\nThis can be highly inef\ufb01cient when many identi\ufb01ed subgoals correspond to the same underlying\nskill, but are all used individually as skill goals. For example, opening a door ought to be the same\nskill whether an agent is one inch or two inches away from the door, or whether the door is red\nor blue; making each possible con\ufb01guration a separate skill would be unwise. Furthermore, skills\ncreated in this manner are often only transferable to tasks that share identical state spaces, since\ncorresponding subgoals across tasks are not merged into a single skill goal.\nWe show that these problems can be overcome by collecting subgoal data from a series of tasks\nand clustering it in an agent-space [9], a shared feature space across multiple tasks. The resulting\nclusters generalize subgoals within and across tasks and can be used as templates for portable skill\ntermination conditions. Clustering also allows the creation of skill termination conditions in a data-\ndriven way that makes minimal assumptions and can be tailored to the domain through a careful\n\n1\n\n\fchoice of clustering algorithm. Additionally, this framework extends the utility of single-state sub-\ngoal discovery algorithms to continuous domains, in which the agent may never see the same state\ntwice. We argue that clustering based on a Dirichlet process mixture model is appropriate in the\ngeneral case when little is known about the nature or number of skills needed in a domain. Experi-\nments in a continuous domain demonstrate the utility of this approach and illustrate how it may be\nuseful even when traditional subgoal discovery methods are infeasible.\n\n2 Background and Related Work\n\n2.1 Reinforcement learning\n\nThe RL paradigm [20] usually models a problem faced by the agent as a Markov decision process\n(MDP), expressed as M = (cid:104)S, A, P, R(cid:105), where S is the set of environment states the agent can\nobserve, A is the set of actions that the agent can execute, P (s, a, s(cid:48)) is the probability that the\nenvironment transitions to s(cid:48) \u2208 S when action a \u2208 A is taken in state s \u2208 S, and R(s, a, s(cid:48)) is the\nexpected scalar reward given to the agent when the environment transitions to state s(cid:48) from s after\nthe agent takes action a.\n\n2.2 Options\n\nThe options framework [19] models skills as temporally extended actions that can be invoked like\nprimitive actions. An option o consists of an option policy \u03c0o : S\u00d7A \u2192 [0, 1], giving the probability\nof taking action a in state s, an initiation set Io \u2286 S, giving the set of states from which the option\ncan be invoked, and a termination condition \u03b2o : S \u2192 [0, 1], giving the probability that option\nexecution will terminate upon reaching state s. In this paper, termination conditions are binary, so\nthat we can de\ufb01ne a termination set of states, To \u2286 S, in which option execution always terminates.\n\n2.3 Agent-spaces\n\nTo facilitate option transfer across multiple tasks, Konidaris and Barto [9] propose separating prob-\nlems into two representations. The \ufb01rst is a problem-space representation which is Markov for the\ncurrent task being faced by the agent, but may change across tasks; this is the typical formulation of\na problem in RL. The second is an agent-space representation, which is identical across all tasks to\nbe faced by the agent, but may not be Markov for any particular task. An agent-space is often a set\nof agent-centric features, like a robot\u2019s sensor readings, that are present and retain semantics across\ntasks. If the agent represents its top-level policy in a task-speci\ufb01c problem-space but represents its\noptions in an agent-space, the task at hand will always be Markov while allowing the options to\ntransfer between tasks.\nAgent-spaces enable the transfer of an option\u2019s policy between tasks, but are based on the assumption\nthat this policy was learned under an option termination set that is portable; the termination set must\naccurately re\ufb02ect how the goal of the skill varies across tasks. Previous work using agent-spaces has\nproduced portable option policies when the termination sets were hand-coded; our contribution is the\nautomatic discovery of portable termination sets, so that such skills can be aquired autonomously.\n\n2.4 Subgoal discovery and skill creation\n\nThe simplest subgoal discovery algorithms analyze reward statistics or state visitation frequencies\nto discover subgoal states [3]. Graph-based algorithms [18, 11] search for \u201cbottleneck\u201d states on\nstate transition graphs via clustering and other types of analysis. Algorithms based on intrinsic\nmotivation have included novelty metrics [17] and hand-coded salience functions [2]. Skill chaining\n[10] discovers subgoals by \u2018chaining\u2019 together options, in which the termination set of one option is\nthe empirically determined initiation set of the next option in the chain. HASSLE [1] clusters similar\nregions of state space to identify single-task subgoals. All of these methods compute subgoals that\nmay be inef\ufb01cient or non-portable if used alone as skill targets, but that can be used as data for our\nalgorithm to \ufb01nd portable options.\nOther algorithms analyze tasks to create skills directly, rather than search for subgoals. VISA [7]\ncreates skills to control factored state variables in tasks with sparse causal graphs. PolicyBlocks\n\n2\n\n\f[15] looks for policy similarities that can be used as templates for skills. The SKILLS algorithm\n[21] attempts to minimize description length of policies while preserving a performance metric.\nHowever, these methods only exhibit transfer to identical state spaces and often rely on discrete\nstate representations. Related work has also used clustering to determine which of a set of MDPs an\nagent is currently facing, but does not address the need for skills within a single MDP [22].\n\n2.5 Dirichlet process mixture models\n\nMany popular clustering algorithms require the number of data clusters to be known a priori or use\nheuristics to choose an approximate number. By contrast, Dirichlet process mixture models (DP-\nMMs) provide a non-parametric Bayesian framework to describe distributions over mixture models\nwith an in\ufb01nite number of mixture components. A Dirichlet process (DP), parameterized by a base\ndistribution G0 and a concentration parameter \u03b1, is used as a prior over the distribution G of mixture\ncomponents. For data points X, mixture component parameters \u03b8, and a parameterized distribution\nF , the DPMM can be written as [13]:\n\nG|\u03b1, G0 \u223c DP (\u03b1, G0)\n\u03b8i|G \u223c G\nxi|\u03b8i \u223c F (\u03b8i).\n\nOne type of DPMM can be implemented as an in\ufb01nite Gaussian mixture model (IGMM) in which\nall parameters are inferred from the data [16]. Gibbs sampling is used to generate samples from the\nposterior distribution of the IGMM and adaptive rejection sampling [4] is used for the probabilities\nwhich are not in a standard form. After a \u201cburn-in\u201d period, unbiased samples from the posterior\ndistribution of the IGMM can be drawn from the Gibbs sampler. A hard clustering can be found\nby drawing many such samples and using the sample with the highest joint likelihood of the class\nindicator variables. We use a modi\ufb01ed IGMM implementation written by M. Mandel 1.\n\n3 Latent Skill Discovery\n\nTo aid thinking about our algorithm, subgoals can be viewed as samples from the termination sets\nof latent options that are implicitly de\ufb01ned by the distribution of tasks, the chosen subgoal discov-\nery algorithm, and the agent de\ufb01nition. Speci\ufb01cally, we de\ufb01ne the latent options as those whose\ntermination sets contain all of the sampled subgoal data and that maximize the expected discounted\ncumulative reward when used by a particular agent on a distribution of tasks (assuming optimal op-\ntion policies given the termination sets). When many such maximizing sets exist, we assume that\nthe latent options are one particular set from amongst these choices; for discussion, the particular\nchoice does not matter, but it is important to have a single set.\nTherefore, our goal is to recover the termination sets of the latent options from the sampled subgoal\ndata; these can be used to construct a library of options that approximate the latent options and have\nthe following desirable properties:\n\ntermination sets of the latent options.\n\n\u2022 Recall: The termination sets of the library options should contain a maximal portion of the\n\u2022 Precision: The termination sets of the library options should contain minimal regions that\n\u2022 Separability: The termination set of each library option should be entirely contained within\n\u2022 Minimality: A minimal number of options should be de\ufb01ned, while still meeting the above\n\nare not in the termination sets of the latent options.\n\nthe termination set of some single latent option.\n\ncriteria. Ideally, this will be equal to the number of latent options.\n\nMost of these properties are straightforward, but the importance of separability should be empha-\nsized. Imagine an agent that faces a distribution of tasks with several latent options that need to be\nsequenced in various ways for each task. If a clustering breaks each latent option termination set\ninto two options (minimality is violated, but separability is preserved), some exploration inef\ufb01ciency\n\n1Source code can be found at http://mr-pc.org/work/\n\n3\n\n\fmay be introduced, but each option will reliably terminate in a skill-appropriate state. However, if\na clustering combines the termination sets of two latent options into that of a single library option,\nthe library option becomes unreliable; when the functionality of a single latent option is needed, the\ncombined option may exhibit behavior corresponding to either.\nWe cannot reason directly about latent options since we do not know what they are a priori, so\nwe must estimate them with respect to the above constraints from sampled subgoal data alone. We\nassume that subgoal samples corresponding to the same latent option form a contiguous region on\nsome manifold, which is re\ufb02ected in the problem representation. If they do not, then our method\ncannot cluster and \ufb01nd skills; we view this as a failing of the representation and not of our method-\nology.\nUnder this assumption, clustering of sampled subgoals can be used to approximate latent option\ntermination sets. We propose a method of converting clusters parameterized by Gaussians into ter-\nmination sets that respect the recall and precision properties. Knowing the number of skills a priori\nor discovering the appropriate number of clusters from the data satis\ufb01es the minimality property.\nSeparability is more complicated, but can be satis\ufb01ed by any method that can handle overlapping\nclusters without merging them and that is not inherently biased toward a small number of skills.\nMethods like spectral clustering [14] that rely on point-wise distance metrics cannot easily handle\ncluster overlap and are unsuitable for this sort of task. In the general case where little is known\nabout the number and nature of the latent options, IGMM-based clustering is an attractive choice, as\nit can model any number of clusters of arbitrary complexity; when clusters have a complex shape,\nan IGMM may over-segment the data, but this still produces separable options.\n\n4 Algorithm\n\nWe present a general algorithm to discover latent options when using any particular subgoal dis-\ncovery method and clustering algorithm. Note that some subgoal discovery methods discover state\nregions, rather than single states; in such cases, sampling techniques or a clustering algorithm such\nas NPClu [5] that can handle non-point data must be used. We then describe a speci\ufb01c implementa-\ntion of the general algorithm that is used in our experiments.\n\n4.1 General algorithm\nGiven an agent A, task distribution \u03c4, subgoal discovery algorithm D, and clustering algorithm C:\n1. Compute a set of sample agent-space subgoals X = {x1, x2, ..., xn}, where X = D(A, \u03c4 ).\n2. Cluster the subgoals X into clusters with parameters \u03b8 = {\u03b81, \u03b82, ..., \u03b8k}, where \u03b8 =\nC(X). If the clustering method is parametric, then the elements of \u03b8 are cluster parameters,\notherwise they are data point assignments to clusters.\n3. De\ufb01ne option termination sets T1,T2, ...,Tk, where Ti = M(\u03b8i), and M is a mapping from\n4. Instantiate and train options O1,O2, ...,Ok using T1,T2, ...,Tk as termination sets.\n\nelements of \u03b8 to termination set de\ufb01nitions.\n\n4.2 Experimental implementation\n\nWe now present an example implementation of the general algorithm that is used in our experiments.\nAs to not confound error from our clustering method with error introduced by a subgoal discovery\nalgorithm, we use a hand-coded binary salience function; the main contribution of this work is the\nclustering strategy that enables generalization and transfer, so we are not concerned with the details\nof any particular subgoal discovery algorithm. This also demonstrates the possible utility of our\napproach, even when automatic subgoal discovery is inappropriate or infeasible. More details on\nthis are presented in the following sections.\nFirst, a distribution of tasks and an RL agent are de\ufb01ned. We allow the agent to solve tasks drawn\nfrom this distribution while collecting subgoal state samples every time the salience function is\ntriggered. This continues until 10,000 subgoal state samples are collected. These points are then\nclustered using one of two different clustering methods. Gaussian expectation-maximization (E-M),\n\n4\n\n\ffor which we must provide the number of clusters a priori, provides an approximate upper bound\non the performance of any clustering method based on a Gaussian mixture model. We compare this\nto IGMM-based clustering that must discover the number of clusters automatically. E-M is used as\na baseline metric to separate error caused by not knowing the number of clusters a priori from error\ncaused by using a Gaussian mixture model. Since E-M can get stuck in local minima, we run it 10\ntimes and choose the clustering with the highest log-likelihood. For the IGMM-based clustering,\nwe let the Gibbs sampler burn-in for 10,000 samples and then collect an additional 10,000 samples,\nfrom which we choose the sample with the highest joint likelihood of the class indicator variables\nas de\ufb01ned by Rasmussen [16].\nWe now must de\ufb01ne a mapping function M that maps our clusters to termination sets. Both of\nour clustering methods return a list of K sets of Gaussian means \u00b5 and covariances \u03a3. We would\nlike to choose a ridge on each Gaussian to be the cluster\u2019s termination set boundary; thus, we use\nMahalanobis distance from each cluster mean, where\n\n(cid:113)\n\n(x) =\n\n(x \u2212 \u00b5i)T \u03a3\u22121\n\ni\n\n(x \u2212 \u00b5i) ,\n\nDM ahalanobis\nand the termination set Ti is de\ufb01ned as:\nTi(x) =\n\n(cid:26) 1\n\ni\n\n: DM ahalanobis\n: otherwise,\n\ni\n\n(x) \u2264 \u0001i\n\n0\n\ni\n\nwhere \u0001i is a threshold. An appropriate value for each \u0001i is found automatically by calculating the\nmaximum DM ahalanobis\n(x) of any of the subgoal state points x assigned to the ith cluster. This\nmakes each \u0001i just large enough so that all the subgoal state data points assigned to the ith cluster\nare within the \u0001i-Mahalanobis distance of that cluster mean, satisfying both our recall and precision\nconditions. Note that some states can be contained in multiple termination sets.\nUsing these termination sets, we create options that are given to the agent for a 100 episode \u201cgesta-\ntion period\u201d, during which the agent can learn option policies using off-policy learning, but cannot\ninvoke the options. After this period, the options can be invoked from any state.\n\n5 Experiments\n\n5.1 Light-Chain domain\n\nWe test the various implementations of our algorithm on a continuous domain similar to the Light-\nworld domain [9], designed to provide intuition about the capabilities of our skill discovery method.\nIn our version, the Light-Chain domain, an agent is placed in a 10\u00d710 room that contains a primary\nbeacon, a secondary beacon, and a goal beacon placed in random locations. If the agent moves\nwithin 1 unit of the primary beacon, the beacon becomes \u201cactivated\u201d for 30 time steps. Similarly,\nif the agent moves within 1 unit of the secondary beacon while the primary beacon is activated, it\nalso becomes activated for 30 time steps. The goal of the task is for the agent to move within 1 unit\nof the goal beacon while the secondary beacon is activated, upon which it receives a reward of 100,\nending the episode. In all other states, the agent receives a reward of \u22121. Additionally, each beacon\nemits a uniquely colored light\u2014either red, green, or blue\u2014that is selected randomly for each task.\nFigure 1 shows two instances of the Light-Chain domain with different beacon locations and light\ncolor assignments.\nThere are four actions available to the agent in every state: move north, south, east, or west. The\nactions are stochastic, moving the agent between 0.9 and 1.1 units (uniformly distributed) in the\nspeci\ufb01ed direction. In the case of an action that would move an agent through a wall, the agent\nsimply moves up to the wall and stops. The problem-space for this domain is 4-dimensional: The\nx-position of the agent, the y-position of the agent, and two boolean variables denoting whether or\nnot the primary and secondary beacons are activated, respectively. The agent-space is 6-dimensional\nand de\ufb01ned by RGB range sensors that the agent is equipped with. Three of the sensors describe the\nnorth/south distance of the agent from each of the three colored lights (0 if the agent is at the light,\npositive values for being north of it, and negative vales for being south of it). The other three sensors\nare identical, but measure east/west distance. Since the beacon color associations change with every\ntask, a portable top-level policy cannot be learned in agent space, but portable agent-space options\ncan be learned that reliably direct the agent toward each of the lights.\n\n5\n\n\fFigure 1: Two instances of the Light-Chain domain. The numbers 1\u20133 indicate the primary, sec-\nondary, and goal beacons respectively, while color signi\ufb01es the light color each beacon emits. Notice\nthat both beacon placement and color associations change between tasks.\n\nThe agent\u2019s salience function is de\ufb01ned as:\n\n(cid:26) 1 : At time t, a beacon became activated for the \ufb01rst time in this episode.\n\nsalient(t) =\n\n0 : otherwise.\n\nOur algorithm clusters subgoal state data to create option termination conditions that generalize\nproperly within a task and across tasks. In the Light-Chain domain, there are three latent options\u2014\none corresponding to each light color. Generalization within a task requires each option to terminate\nin any state within a 1 unit radius of its corresponding light color. However, if the agent only sees\none task, all such states will be within some small \ufb01xed range of the other two lights; a termination\nset built from such data would not transfer to another task, since the relative positions of the lights\nwould change. Thus, generalization across tasks requires each option to terminate when it is close\nto the proper light, regardless of the observed positions of the other two lights. When provided with\ndata from many tasks, our algorithm can discover these relationships between agent-space variables\nand use them to de\ufb01ne portable options. These options can then be used in each task, although in a\ndifferent order for each, based on that task\u2019s color associations with the beacons.\nAlthough we provide a broad subgoal (activate beacons) to the agent through the salience function,\nour algorithm does the work of discovering how many ways there are to accomplish these subgoals\n(three\u2014one for each light color) and how to achieve each of these (get within 1 unit of that light). In\neach instance of the task, it is unknown which light color will correspond to each beacon. Therefore,\nit is not possible to de\ufb01ne a skill that reliably guides the agent to a particular beacon (e.g. the primary\nbeacon) and is portable across tasks. Instead, our algorithm discovers skills to navigate to particular\nlights, leading the agent to beacons by proxy. Note that this number of skills is independent of the\nnumber of beacons; if there were four possible colors of light, but only three beacons, four skills\nwould be created so that the agent could perform well when presented with any three of the four\ncolors in a given task. Similarly, such a setup can be used in other tasks where a broad subgoal is\nknown, but the different means and number of ways of achieving it are unknown a priori.\n\n5.2 Experimental structure\n\nTwo different agent types were used in our experiments: agents with and without options. The\nparameters for each agent type were optimized separately via a grid search. Top-level policies were\nlearned using \u0001-greedy SARSA(\u03bb) (\u03b1 = 0.001, \u03b3 = 0.99, \u03bb = 0.7, \u0001 = 0.1 without options,\n\u03b1 = 0.0005, \u03b3 = 0.99, \u03bb = 0.9, \u0001 = 0.1 with options) and the state-action value function was\nrepresented with a linear function approximator using the third order Fourier basis [8]. Option\npolicies were learned off-policy (with an option reward of 1000 when in a terminating state), using\nQ(\u03bb) (\u03b1 = 0.000025, \u03b3 = 0.99, \u03bb = 0.9) and the \ufb01fth order independent Fourier basis.\nFor the agents that discover options, we used the procedure outlined in the previous section to collect\nsubgoal state samples and learn option policies. We compared these agents to an agent with perfect,\nhand-coded termination sets (each option terminated within 1 unit of a particular light) that followed\nthe same learning procedure, but without the subgoal discovery step. After option policies were\nlearned for 100 episodes, they were frozen and agent performance was measured for 10 episodes in\n\n6\n\n\f(a) Proj. onto Green-N/S and Green-E/W\n\n(b) Proj. onto Green-N/S and Blue-N/S\n\nFigure 2: IGMM clusterings of 6-dimensional subgoal data projected onto 2 dimensions at a time\nfor visualization.\n\neach of 1,000 novel tasks, with a maximum episode length of 5,000 steps and a maximum option\nexecution time of 50 steps. After each task, the top-level policy was reset, but the option policies\nwere kept constant. We compared performance of the agents using options to that of an agent without\noptions, tested under the same conditions. This entire experiment was repeated 10 times.\n\n6 Results\n\nFigure 2 shows an IGMM-based clustering (only 1,000 points shown for readability), in which the\noriginal data points are projected onto 2 of the 6 agent-space dimensions at a time for visualiza-\ntion purposes, where cluster assignment is denoted with unique markers. It can be seen that three\nclusters (the intuitively optimal number) have been found. In 2(a), the data is projected onto the\ngreen north/south and green east/west dimensions. A central circular cluster is apparent, containing\nsubgoals triggered by being near the green light. In 2(b), the north/south dimensions of two different\nlight colors are compared. Here, there are two long clusters that each have a small variance with\nrespect to one color and a large variance with respect to the other. These \ufb01ndings correspond to our\nintuitive notion of skills in this domain, in which an option should terminate when it is close to a\nparticular light color, regardless of the positions of the other two lights. Note that these clusters actu-\nally overlap in 6 dimensions, not just in the projected view, since the activation radii of the beacons\ncan occasionally overlap, depending on their placement.\nFigure 3(a) compares the cumulative time it takes to solve 10 episodes for agents with no options,\nIGMM options, E-M options (with three clusters), and options with perfect, hand-coded termination\nsets. As expected, in all cases, options provide a signi\ufb01cant learning advantage when facing a novel\ntask. The agent using E-M options performs only slightly worse than the agent using perfect, hand-\ncoded options, showing that clustering effectively discovers options in this domain and that very little\nerror is introduced by using a Gaussian mixture model. Possibly more surprisingly, the agent using\nIGMM options performs equally as well as the agent using E-M options (making the lines dif\ufb01cult\nto distinguish in the graph), demonstrating that estimating the number of clusters automatically is\nfeasible in this domain and introduces negligible error. In fact, the IGMM-based clustering \ufb01nds\nthree clusters in all 10 trials of the experiment.\nFigure 3(b) shows the performance of agents using E-M options where the number of pre-speci\ufb01ed\nclusters varies. As expected, the agent with three options (the intuitively optimal number of skills\nin this domain) performs the best, but the agents using \ufb01ve and six options still retain a signi\ufb01cant\nadvantage over an agent with no options. Most notably, when less than the optimal number of\noptions are used, the agent actually performs worse than the baseline agent with no options. This\ncon\ufb01rms our intuition that option separability is more important than minimality. Thus, it seems\nthat E-M may be effective if the designer can come up with a good approximation of the number of\nlatent options, but it is critical to overestimate this number.\n\n7\n\n(cid:239)10(cid:239)8(cid:239)6(cid:239)4(cid:239)20246810(cid:239)10(cid:239)8(cid:239)6(cid:239)4(cid:239)20246810Green North/SouthGreen East/West(cid:239)10(cid:239)8(cid:239)6(cid:239)4(cid:239)20246810(cid:239)10(cid:239)8(cid:239)6(cid:239)4(cid:239)20246810Green North/SouthBlue North/South\f(a) Comparative performance of agents\n\n(b) E-M with varying numbers of clusters\n\nFigure 3: Agent performance in Light-Chain domain with 95% con\ufb01dence intervals\n\n7 Discussion and Conclusions\n\nWe have demonstrated a general method for clustering agent-space subgoal data to form the termina-\ntion sets of portable skills in the options framework. This method works in both discrete and contin-\nuous domains and can be used with any choice of subgoal discovery and clustering algorithms. Our\nanalysis of the Light-Chain domain suggests that if the number of latent options is approximately\nknown a priori, clustering algorithms like E-M can perform well. However, in the general case,\nIGMM-based clustering is able to discover an appropriate number of options automatically without\nsacri\ufb01cing performance.\nThe collection and analysis of subgoal state samples can be computationally expensive, but this\nis a one-time cost. Our method is most relevant when a distribution of tasks is known ahead of\ntime and we can spend computational time up front to improve agent performance on new tasks\nto be faced later, drawn from the same distribution. This can be bene\ufb01cial when an agent will\nhave to face a large number of related tasks, like in DRAM memory access scheduling [6], or\nfor problems where fast learning and adaptation to non-stationarity is critical, such as automatic\nanesthesia administration [12].\nIn domains where traditional subgoal discovery algorithms fail or are too computationally expensive,\nit may be possible to de\ufb01ne a salience function that speci\ufb01es useful subgoals, while still allowing\nthe clustering algorithm to decide how many skills are appropriate. For example, it is desirable to\ncapture the queen in chess, but it may be bene\ufb01cial to have several skills that result in different types\nof board con\ufb01gurations after taking the queen, rather than a single monolithic skill. Such a setup is\nadvantageous when a broad subgoal is known a priori, but the various means and number of ways\nin which the subgoal might be accomplished are unknown, as in our Light-Chain experiment. This\nextends the possibility of skill discovery to a class of domains in which it may have previously been\nintractable.\nAn agent with a library of appropriate portable options ought to be able to learn novel tasks faster\nthan an agent without options. However, as this library grows, the number of available actions actu-\nally increases and agent performance may begin to decline. This counter-intuitive notion, commonly\nknown as the utility problem, reveals a fundamental problem with using skills outside the context of\nhierarchies. For skill discovery to be useful in larger problems, future work will have to address ba-\nsic questions about how to automatically construct appropriate skill hierarchies that allow the agent\nto explore in simpler, more abstract action spaces as it gains more skills and competency.\n\nAcknowledgments\n\nWe would like to thank Philip Thomas and George Konidaris for useful discussions. Scott Niekum\nand Andrew G. Barto were supported in part by the AFOSR under grant FA9550-08-1-0418.\n\n8\n\n12345678910050010001500200025003000EpisodesAverage cumulative steps to goal (over episodes) No optionsIGMM term setsE(cid:239)M term setsPerfect term sets123456789100100020003000400050006000EpisodesAverage cumulative steps to goal (over episodes) No optionsE(cid:239)M 2 clustersE(cid:239)M 3 clustersE(cid:239)M 5 clustersE(cid:239)M 6 clusters\fReferences\n[1] Bram Bakker and J\u00a8urgen Schmidhuber. Hierarchical reinforcement learning based on subgoal discovery\nand subpolicy specialization. In Proc. of the 8th Conference on Intelligent Autonomous Systems, pages\n438\u2013445, 2004.\n\n[2] A. G. Barto, S. Singh, and N. Chentanez. Intrinsically motivated learning of hierarchical collections of\n\nskills. In Proc. of the International Conference on Developmental Learning, pages 112\u2013119, 2004.\n\n[3] Bruce L. Digney. Learning hierarchical control structures for multiple tasks and changing environments.\n\nIn Proc. of the 5th Conference on the Simulation of Adaptive Behavior. MIT Press, 1998.\n\n[4] W. R. Gilks and P. Wild. Adaptive Rejection Sampling for Gibbs Sampling. Journal of the Royal Statis-\n\ntical Society, Series C, 41(2):337\u2013348, 1992.\n\n[5] M. Halkidi and M. Vazirgiannis. Npclu: An approach for clustering spatially extended objects. Intell.\n\nData Anal., 12:587\u2013606, December 2008.\n\n[6] Engin Ipek, Onur Mutlu, Jose F. Martinez, and Rich Caruana. Self-optimizing memory controllers: A\n\nreinforcement learning approach. Computer Architecture, International Symposium on, 0:39\u201350, 2008.\n\n[7] Anders Jonsson and Andrew Barto. Causal graph based decomposition of factored mdps. J. Mach. Learn.\n\nRes., 7:2259\u20132301, December 2006.\n\n[8] G.D. Konidaris, S. Osentoski, and P.S. Thomas. Value function approximation in reinforcement learning\n\nusing the fourier basis. In Proceedings of the Twenty-Fifth Conference on Arti\ufb01cial Intelligence, 2011.\n\n[9] George Konidaris and Andrew G. Barto. Building portable options: Skill transfer in reinforcement learn-\ning. In Proc. of the 20th International Joint Conference on Arti\ufb01cial Intelligence, pages 895\u2013900, 2007.\n[10] George Konidaris and Andrew G. Barto. Skill discovery in continuous reinforcement learning domains\nusing skill chaining. In Advances in Neural Information Processing Systems 22, pages 1015\u20131023, 2009.\n[11] Amy McGovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using\n\ndiverse density. In ICML, pages 361\u2013368, 2001.\n\n[12] Brett Moore, Periklis Panousis, Vivek Kulkarni, Larry Pyeatt, and Anthony Doufas. Reinforcement learn-\ning for closed-loop propofol anesthesia: A human volunteer study. In Innovative Applications of Arti\ufb01cial\nIntelligence, 2010.\n\n[13] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of computa-\n\ntional and graphical statistics, 9(2):249\u2013265, 2000.\n\n[14] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In\n\nAdvances in Neural Information Processing Systems, pages 849\u2013856. MIT Press, 2001.\n\n[15] Marc Pickett and Andrew G. Barto. Policyblocks: An algorithm for creating useful macro-actions in\n\nreinforcement learning. In ICML, pages 506\u2013513, 2002.\n\n[16] Carl Edward Rasmussen. The in\ufb01nite Gaussian mixture model.\n\nIn Advances in Neural Information\n\n[17]\n\n[18]\n\nProcessing Systems 12, pages 554\u2013560. MIT Press, 2000.\n\u00a8Ozg\u00a8ur S\u00b8ims\u00b8ek and Andrew G. Barto. Using relative novelty to identify useful temporal abstractions in\nIn Proc. of the Twenty-First International Conference on Machine Learning,\nreinforcement learning.\npages 751\u2013758, 2004.\n\u00a8Ozg\u00a8ur S\u00b8ims\u00b8ek and Andrew G. Barto. Skill characterization based on betweenness. In NIPS, pages 1497\u2013\n1504, 2008.\n\n[19] Richard Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for\n\ntemporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112:181\u2013211, 1999.\n\n[20] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[21] Sebastian Thrun and Anton Schwartz. Finding structure in reinforcement learning. In Advances in Neural\n\nInformation Processing Systems 7, pages 385\u2013392. MIT Press, 1995.\n\n[22] Aaron Wilson, Alan Fern, Soumya Ray, and Prasad Tadepalli. Multi-task reinforcement learning: A\nhierarchical bayesian approach. In In: ICML 07: Proceedings of the 24th international conference on\nMachine learning, page 1015. ACM Press, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1033, "authors": [{"given_name": "Scott", "family_name": "Niekum", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": null}]}