{"title": "Active Exploration for Learning Symbolic Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 5009, "page_last": 5019, "abstract": "We introduce an online active exploration algorithm for data-efficiently learning an abstract symbolic model of an environment. Our algorithm is divided into two parts: the first part quickly generates an intermediate Bayesian symbolic model from the data that the agent has collected so far, which the agent can then use along with the second part to guide its future exploration towards regions of the state space that the model is uncertain about. We show that our algorithm outperforms random and greedy exploration policies on two different computer game domains. The first domain is an Asteroids-inspired game with complex dynamics but basic logical structure. The second is the Treasure Game, with simpler dynamics but more complex logical structure.", "full_text": "Active Exploration for Learning\n\nSymbolic Representations\n\nGarrett Andersen\n\nPROWLER.io\n\nCambridge, United Kingdom\ngarrett@prowler.io\n\nGeorge Konidaris\n\nDepartment of Computer Science\n\nBrown University\n\ngdk@cs.brown.edu\n\nAbstract\n\nWe introduce an online active exploration algorithm for data-ef\ufb01ciently learning\nan abstract symbolic model of an environment. Our algorithm is divided into two\nparts: the \ufb01rst part quickly generates an intermediate Bayesian symbolic model\nfrom the data that the agent has collected so far, which the agent can then use along\nwith the second part to guide its future exploration towards regions of the state\nspace that the model is uncertain about. We show that our algorithm outperforms\nrandom and greedy exploration policies on two different computer game domains.\nThe \ufb01rst domain is an Asteroids-inspired game with complex dynamics but basic\nlogical structure. The second is the Treasure Game, with simpler dynamics but\nmore complex logical structure.\n\n1\n\nIntroduction\n\nMuch work has been done in arti\ufb01cial intelligence and robotics on how high-level state abstractions\ncan be used to signi\ufb01cantly improve planning [19]. However, building these abstractions is dif\ufb01cult,\nand consequently, they are typically hand-crafted [15, 13, 7, 4, 5, 6, 20, 9].\nA major open question is then the problem of abstraction: how can an intelligent agent learn high-\nlevel models that can be used to improve decision making, using only noisy observations from its\nhigh-dimensional sensor and actuation spaces? Recent work [11, 12] has shown how to automatically\ngenerate symbolic representations suitable for planning in high-dimensional, continuous domains.\nThis work is based on the hierarchical reinforcement learning framework [1], where the agent has\naccess to high-level skills that abstract away the low-level details of control. The agent then learns\nrepresentations for the (potentially abstract) effect of using these skills. For instance, opening a door\nis a high-level skill, while knowing that opening a door typically allows one to enter a building would\nbe part of the representation for this skill. The key result of that work was that the symbols required to\ndetermine the probability of a plan succeeding are directly determined by characteristics of the skills\navailable to an agent. The agent can learn these symbols autonomously by exploring the environment,\nwhich removes the need to hand-design symbolic representations of the world.\nIt is therefore possible to learn the symbols by naively collecting samples from the environment,\nfor example by random exploration. However, in an online setting the agent shall be able to use\nits previously collected data to compute an exploration policy which leads to better data ef\ufb01ciency.\nWe introduce such an algorithm, which is divided into two parts: the \ufb01rst part quickly generates\nan intermediate Bayesian symbolic model from the data that the agent has collected so far, while\nthe second part uses the model plus Monte-Carlo tree search to guide the agent\u2019s future exploration\ntowards regions of the state space that the model is uncertain about. We show that our algorithm is\nsigni\ufb01cantly more data-ef\ufb01cient than more naive methods in two different computer game domains.\nThe \ufb01rst domain is an Asteroids-inspired game with complex dynamics but basic logical structure.\nThe second is the Treasure Game, with simpler dynamics but more complex logical structure.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2 Background\n\nAs a motivating example, imagine deciding the route you are going to take to the grocery store;\ninstead of planning over the various sequences of muscle contractions that you would use to complete\nthe trip, you would consider a small number of high-level alternatives such as whether to take one\nroute or another. You also would avoid considering how your exact low-level state affected your\ndecision making, and instead use an abstract (symbolic) representation of your state with components\nsuch as whether you are at home or an work, whether you have to get gas, whether there is traf\ufb01c, etc.\nThis simpli\ufb01cation reduces computational complexity, and allows for increased generalization over\npast experiences. In the following sections, we introduce the frameworks that we use to represent the\nagent\u2019s high-level skills, and symbolic models for those skills.\n\n2.1 Semi-Markov Decision Processes\n\nWe assume that the agent\u2019s environment can be described by a semi-Markov decision process\n(SMDP), given by a tuple D = (S, O, R, P, \u03b3), where S \u2286 Rd is a d-dimensional continuous state\nspace, O(s) returns a set of temporally extended actions, or options [19] available in state s \u2208 S,\nR(s(cid:48), t, s, o) and P (s(cid:48), t | s, o) are the reward received and probability of termination in state s(cid:48) \u2208 S\nafter t time steps following the execution of option o \u2208 O(s) in state s \u2208 S, and \u03b3 \u2208 (0, 1] is a\ndiscount factor. In this paper, we are not concerned with the time taken to execute o, so we use\n\nP (s(cid:48) | s, o) =(cid:82) P (s(cid:48), t | s, o)dt.\n\nAn option o is given by three components: \u03c0o, the option policy that is executed when the option is\ninvoked, Io, the initiation set consisting of the states where the option can be executed from, and\n\u03b2o(s) \u2192 [0, 1], the termination condition, which returns the probability that the option will terminate\nupon reaching state s. Learning models for the initiation set, rewards, and transitions for each option,\nallows the agent to reason about the effect of its actions in the environment. To learn these option\nmodels, the agent has the ability to collect observations of the forms (s, O(s)) when entering a state\ns and (s, o, s(cid:48), r, t) upon executing option o from s.\n\n2.2 Abstract Representations for Planning\n\nWe are speci\ufb01cally interested in learning option models which allow the agent to easily evaluate the\nsuccess probability of plans. A plan is a sequence of options to be executed from some starting state,\nand it succeeds if and only if it is able to be run to completion (regardless of the reward). Thus, a plan\n{o1, o2, ..., on} with starting state s succeeds if and only if s \u2208 Io1 and the termination state of each\noption (except for the last) lies in the initiation set of the following option, i.e. s(cid:48) \u223c P (s(cid:48) | s, o1) \u2208 Io2,\ns(cid:48)(cid:48) \u223c P (s(cid:48)(cid:48) | s(cid:48), o2) \u2208 Io3, and so on.\nRecent work [11, 12] has shown how to automatically generate a symbolic representation that supports\nsuch queries, and is therefore suitable for planning. This work is based on the idea of a probabilistic\nsymbol, a compact representation of a distribution over in\ufb01nitely many continuous, low-level states.\nFor example, a probabilistic symbol could be used to classify whether or not the agent is currently in\nfront of a door, or one could be used to represent the state that the agent would \ufb01nd itself in after\nexecuting its \u2018open the door\u2019 option. In both cases, using probabilistic symbols also allows the agent\nto be uncertain about its state.\nThe following two probabilistic symbols are provably suf\ufb01cient for evaluating the success probability\nof any plan [12]; the probabilistic precondition: Pre(o) = P (s \u2208 Io), which expresses the probability\nthat an option o can be executed from each state s \u2208 S, and the probabilistic image operator:\n\n(cid:82)\n\n(cid:82)\nS P (s(cid:48) | s, o)Z(s)P (Io | s)ds\n\nS Z(s)P (Io | s)ds\n\nIm(o, Z) =\n\n,\n\nwhich represents the distribution over termination states if an option o is executed from a distribution\nover starting states Z. These symbols can be used to compute the probability that each successive\noption in the plan can be executed, and these probabilities can then be multiplied to compute the\noverall success probability of the plan; see Figure 1 for a visual demonstration of a plan of length 2.\nSubgoal Options Unfortunately, it is dif\ufb01cult to model Im(o, Z) for arbitrary options, so we focus\non restricted types of options. A subgoal option [17] is a special class of option where the distribution\nover termination states (referred to as the subgoal) is independent of the distribution over starting\n\n2\n\n\fFigure 1: Determining the probability that a plan consisting of two options can be executed from\na starting distribution Z0. (a): Z0 is contained in Pre(o1), so o1 can de\ufb01nitely be executed. (b):\nExecuting o1 from Z0 leads to distribution over states Im(o1, Z0). (c): Im(o1, Z0) is not completely\ncontained in Pre(o2), so the probability of being able to execute o2 is less than 1. Note that Pre is a\nset and Im is a distribution, and the agent typically has uncertain models for them.\n\nstates that it was executed from, e.g. if you make the decision to walk to your kitchen, the end result\nwill be the same regardless of where you started from.\nFor subgoal options, the image operator can be replaced with the effects distribution: E\ufb00(o) =\nIm(o, Z),\u2200Z(S), the resulting distribution over states after executing o from any start distribution\nZ(S). Planning with a set of subgoal options is simple because for each ordered pair of options oi and\noj, it is possible to compute and store G(oi, oj), the probability that oj can be executed immediately\n\nafter executing oi: G(oi, oj) =(cid:82)\n\nS Pre(oj, s)E\ufb00(oi)(s)ds.\n\nWe use the following two generalizations of subgoal options: abstract subgoal options model the\nmore general case where executing an option leads to a subgoal for a subset of the state variables\n(called the mask), leaving the rest unchanged. For example, walking to the kitchen leaves the amount\nof gas in your car unchanged. More formally, the state vector can be partitioned into two parts\ns = [a, b], such that executing o leaves the agent in state s(cid:48) = [a, b(cid:48)], where P (b(cid:48)) is independent of\nthe distribution over starting states. The second generalization is the (abstract) partitioned subgoal\noption, which can be partitioned into a \ufb01nite number of (abstract) subgoal options. For instance, an\noption for opening doors is not a subgoal option because there are many doors in the world, however\nit can be partitioned into a set of subgoal options, with one for every door.\nThe subgoal (and abstract subgoal) assumptions propose that the exact state from which option\nexecution starts does not really affect the options that can be executed next. This is somewhat\nrestrictive and often does not hold for options as given, but can hold for options once they have been\npartitioned. Additionally, the assumptions need only hold approximately in practice.\n\n3 Online Active Symbol Acquisition\n\nPrevious approaches for learning symbolic models from data [11, 12] used random exploration.\nHowever, real world data from high-level skills is very expensive to collect, so it is important to use a\nmore data-ef\ufb01cient approach. In this section, we introduce a new method for learning abstract models\ndata-ef\ufb01ciently. Our approach maintains a distribution over symbolic models which is updated\nafter every new observation. This distribution is used to choose the sequence of options that in\nexpectation maximally reduces the amount of uncertainty in the posterior distribution over models.\nOur approach has two components: an active exploration algorithm which takes as input a distribution\nover symbolic models and returns the next option to execute, and an algorithm for quickly building\na distribution over symbolic models from data. The second component is an improvement upon\nprevious approaches in that it returns a distribution and is fast enough to be updated online, both of\nwhich we require.\n\n3.1 Fast Construction of a Distribution over Symbolic Option Models\n\nNow we show how to construct a more general model than G that can be used for planning with\nabstract partitioned subgoal options. The advantages of our approach versus previous methods are\nthat our algorithm is much faster, and the resulting model is Bayesian, both of which are necessary\nfor the active exploration algorithm introduced in the next section.\nRecall that the agent can collect observations of the forms (s, o, s(cid:48)) upon executing option o from s,\nand (s, O(s)) when entering a state s, where O(s) is the set of available options in state s. Given\na sequence of observations of this form, the \ufb01rst step of our approach is to \ufb01nd the factors [12],\n\n3\n\n(a)(b)(c)Pre(o1)Z0Im(o1,Z0)o1o1?Pre(o2)o2?\fpartitions of state variables that always change together in the observed data. For example, consider a\nrobot which has options for moving to the nearest table and picking up a glass on an adjacent table.\nMoving to a table changes the x and y coordinates of the robot without changing the joint angles of\nthe robot\u2019s arms, while picking up a glass does the opposite. Thus, the x and y coordinates and the\narm joint angles of the robot belong to different factors. Splitting the state space into factors reduces\nthe number of potential masks (see end of Section 2.2) because we assume that if state variables i\nand j always change together in the observations, then this will always occur, e.g. we assume that\nmoving to the table will never move the robot\u2019s arms.1\nFinding the Factors Compute the set of observed masks M from the (s, o, s(cid:48)) observations: each\nobservation\u2019s mask is the subset of state variables that differ substantially between s and s(cid:48). Since\nwe work in continuous, stochastic domains, we must detect the difference between minor random\nnoise (independent of the action) and a substantial change in a state variable caused by action\nexecution. In principle this requires modeling action-independent and action-dependent differences,\nand distinguishing between them, but this is dif\ufb01cult to implement. Fortunately we have found that in\npractice allowing some noise and having a simple threshold is often effective, even in more noisy and\ncomplex domains. For each state variable i, let Mi \u2286 M be the subset of the observed masks that\ncontain i. Two state variables i and j belong to the same factor f \u2208 F if and only if Mi = Mj. Each\nfactor f is given by a set of state variables and thus corresponds to a subspace Sf . The factors are\nupdated after every new observation.\nf be the projection of S\u2217 onto the\nLet S\u2217 be the set of states that the agent has observed and let S\u2217\nsubspace Sf for some factor f, e.g. in the previous example there is a S\u2217\nf which consists of the set\nof observed robot (x, y) coordinates. It is important to note that the agent\u2019s observations come only\nfrom executing partitioned abstract subgoal options. This means that S\u2217\nf consists only of abstract\nsubgoals, because for each s \u2208 S\u2217, sf was either unchanged from the previous state, or changed to\nanother abstract subgoal. In the robot example, all (x, y) observations must be adjacent to a table\nbecause the robot can only execute an option that terminates with it adjacent to a table or one that\ndoes not change its (x, y) coordinates. Thus, the states in S\u2217 can be imagined as a collection of\nabstract subgoals for each of the factors. Our next step is to build a set of symbols for each factor to\nrepresent its abstract subgoals, which we do using unsupervised clustering.\nFinding the Symbols For each factor f \u2208 F , we \ufb01nd the set of symbols Z f by clustering S\u2217\nf . Let\nZ f (sf ) be the corresponding symbol for state s and factor f. We then map the observed states s \u2208 S\u2217\nto their corresponding symbolic states sd = {Z f (sf ),\u2200f \u2208 F}, and the observations (s, O(s)) and\n(s, o, s(cid:48)) to (sd, O(s)) and (sd, o, s(cid:48)d), respectively.\nIn the robot example, the (x, y) observations would be clustered around tables that the robot could\ntravel to, so there would be a symbol corresponding to each table.\nWe want to build our models within the symbolic state space Sd. Thus we de\ufb01ne the symbolic\nprecondition, Pre(o, sd), which returns the probability that the agent can execute an option from\nsome symbolic state, and the symbolic effects distribution for a subgoal option o, E\ufb00 (o), maps to a\nsubgoal distribution over symbolic states. For example, the robot\u2019s \u2018move to the nearest table\u2019 option\nmaps the robot\u2019s current (x, y) symbol to the one which corresponds to the nearest table.\nThe next step is to partition the options into abstract subgoal options (in the symbolic state space),\ne.g. we want to partition the \u2018move to the nearest table\u2019 option in the symbolic state space so that the\nsymbolic states in each partition have the same nearest table.\nPartitioning the Options For each option o, we initialize the partitioning P o so that each symbolic\nstate starts in its own partition. We use independent Bayesian sparse Dirichlet-categorical models [18]\nfor the symbolic effects distribution of each option partition.2 We then perform Bayesian Hierarchical\nClustering [8] to merge partitions which have similar symbolic effects distributions.3\n\n1The factors assumption is not strictly necessary as we can assign each state variable to its own factor.\nHowever, using this uncompressed representation can lead to an exponential increase in the size of the symbolic\nstate space and a corresponding increase in the sample complexity of learning the symbolic models.\n\n2We use sparse Dirichlet-categorical models because there are a combinatorial number of possible symbolic\n\nstate transitions, but we expect that each partition has non-zero probability for only a small number of them.\n\n3We use the closed form solutions for Dirichlet-multinomial models provided by the paper.\n\n4\n\n\fAlgorithm 1 Fast Construction of a Distribution over Symbolic Option Models\n1: Find the set of observed masks M.\n2: Find the factors F .\n3: \u2200f \u2208 F , \ufb01nd the set of symbols Z f .\n4: Map the observed states s \u2208 S\u2217 to symbolic states sd \u2208 S\u2217d.\n5: Map the observations (s, O(s)) and (s, o, s(cid:48)) to (sd, O(s)) and (sd, o, s(cid:48)d).\n6: \u2200o \u2208 O, initialize P o and perform Bayesian Hierarchical Clustering on it.\n7: \u2200o \u2208 O, \ufb01nd Ao and F o\u2217 .\n\na, but has yet to actually execute it from any sd \u2208 Sd\n\nThere is a special case where the agent has observed that an option o was available in some symbolic\nstates Sd\na. These are not included in the Bayesian\nHierarchical Clustering, instead we have a special prior for the partition of o that they belong to. After\ncompleting the merge step, the agent has a partitioning P o for each option o. Our prior is that with\nprobability qo,4 each sd \u2208 Sd\na belongs to the partition po \u2208 P o which contains the symbolic states\nmost similar to sd, and with probability 1 \u2212 qo each sd belongs to its own partition. To determine the\npartition which is most similar to some symbolic state, we \ufb01rst \ufb01nd Ao, the smallest subset of factors\nwhich can still be used to correctly classify P o. We then map each sd \u2208 Sd\na to the most similar\npartition by trying to match sd masked by Ao with a masked symbolic state already in one of the\npartitions. If there is no match, sd is placed in its own partition.\nOur \ufb01nal consideration is how to model the symbolic preconditions. The main concern is that many\nfactors are often irrelevant for determining if some option can be executed. For example, whether or\nnot you have keys in your pocket does not affect whether you can put on your shoe.\nModeling the Symbolic Preconditions Given an option o and subset of factors F o, let Sd\nF o be the\nsymbolic state space projected onto F o. We use independent Bayesian Beta-Bernoulli models for the\nsymbolic precondition of o in each masked symbolic state sd\nF o. For each option o, we use\nBayesian model selection to \ufb01nd the the subset of factors F o\u2217 which maximizes the likelihood of the\nsymbolic precondition models.\nThe \ufb01nal result is a distribution over symbolic option models H, which consists of the combined sets\nof independent symbolic precondition models {Pre(o, sd\nF o\u2217 );\u2200o \u2208 O,\u2200sd\nF o\u2217 } and independent\nsymbolic effects distribution models {E\ufb00 (o, po);\u2200o \u2208 O,\u2200po \u2208 P o}.\nThe complete procedure is given in Algorithm 1. A symbolic option model h \u223c H can be sampled\nby drawing parameters for each of the Bernoulli and categorical distributions from the corresponding\nBeta and sparse Dirichlet distributions, and drawing outcomes for each qo. It is also possible to\nconsider distributions over other parts of the model such as the symbolic state space and/or a more\ncomplicated one for the option partitionings, which we leave for future work.\n\nF o \u2208 Sd\n\nF o\u2217 \u2208 Sd\n\n3.2 Optimal Exploration\n\nIn the previous section we have shown how to ef\ufb01ciently compute a distribution over symbolic option\nmodels H. Recall that the ultimate purpose of H is to compute the success probabilities of plans\n(see Section 2.2). Thus, the quality of H is determined by the accuracy of its predicted plan success\nprobabilities, and ef\ufb01ciently learning H corresponds to selecting the sequence of observations which\nmaximizes the expected accuracy of H. However, it is dif\ufb01cult to calculate the expected accuracy of\nH over all possible plans, so we de\ufb01ne a proxy measure to optimize which is intended to represent\nthe amount of uncertainty in H. In this section, we introduce our proxy measure, followed by an\nalgorithm for \ufb01nding the exploration policy which optimizes it. The algorithm operates in an online\nmanner, building H from the data collected so far, using H to select an option to execute, updating\nH with the new observation, and so on.\nFirst we de\ufb01ne the standard deviation \u03c3H, the quantity we use to represent the amount of uncertainty\nin H. To de\ufb01ne the standard deviation, we need to also de\ufb01ne the distance and mean.\n\n4This is a user speci\ufb01ed parameter.\n\n5\n\n\fWe de\ufb01ne the distance K from h2 \u2208 H to h1 \u2208 H, to be the sum of the Kullback-Leibler (KL)\ndivergences5 between their individual symbolic effect distributions plus the sum of the KL divergences\nbetween their individual symbolic precondition distributions:6\n\nDKL(Pre h1(o, sd\n\nF o\u2217 ) || Pre h2 (o, sd\n\nF o\u2217 ))\n\nsd\n\nF o\u2217 \u2208Sd\nF o\u2217\nDKL(E\ufb00 h1(o, po) || E\ufb00 h2(o, po))].\n\nK(h1, h2) =\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\no\u2208O\n\n[\n\n+\n\npo\u2208P o\n\nWe de\ufb01ne the mean, E[H], to be the symbolic option model such that each Bernoulli symbolic\nprecondition and categorical symbolic effects distribution is equal to the mean of the corresponding\nBeta or sparse Dirichlet distribution:\n\n\u2200o \u2208 O, \u2200po \u2208 P o, E\ufb00\n\n\u2200o \u2208 O, \u2200sd\n\nF o\u2217 \u2208 Sd\n\nF o\u2217 , Pre\n\nE[H](o, po) = Eh\u223cH [E\ufb00 h(o, po)],\nE[H](o, sd\nF o\u2217 )) = Eh\u223cH [Pre h(o, sd\n\nF o\u2217 ))].\n\nThe standard deviation \u03c3H is then simply: \u03c3H = Eh\u223cH [K(h, E[H])]. This represents the expected\namount of information which is lost if E[H] is used to approximate H.\nNow we de\ufb01ne the optimal exploration policy for the agent, which aims to maximize the expected\nreduction in \u03c3H after H is updated with new observations. Let H(w) be the posterior distribution\nover symbolic models when H is updated with symbolic observations w (the partitioning is not\nupdated, only the symbolic effects distribution and symbolic precondition models), and let W (H, i, \u03c0)\nbe the distribution over symbolic observations drawn from the posterior of H if the agent follows\npolicy \u03c0 for i steps. We de\ufb01ne the optimal exploration policy \u03c0\u2217 as:\n\n\u03c0\u2217 = argmax\n\n\u03c0\n\n\u03c3H \u2212 Ew\u223cW (H,i,\u03c0)[\u03c3H(w)].\n\nFor the convenience of our algorithm, we rewrite the second term by switching the order of the\nexpectations: Ew\u223cW (H,i,\u03c0)[Eh\u223cH(w)[K(h, E[H(w)])]] = Ew\u223cW (h,i,\u03c0)[K(h, E[H(w)])]].\nNote that the objective function is non-Markovian because H is continuously updated with the\nagent\u2019s new observations, which changes \u03c3H. This means that \u03c0\u2217 is non-stationary, so Algorithm 2\napproximates \u03c0\u2217 in an online manner using Monte-Carlo tree search (MCTS) [3] with the UCT tree\npolicy [10]. \u03c0T is the combined tree and rollout policy for MCTS, given tree T .\nThere is a special case when the agent simulates the observation of a previously unobserved transition,\nwhich can occur under the sparse Dirichlet-categorical model. In this case, the amount of information\ngained is very large, and furthermore, the agent is likely to transition to a novel symbolic state. Rather\nthan modeling the unexplored state space, instead, if an unobserved transition is encountered during\nan MCTS update, it immediately terminates with a large bonus to the score, a similar approach to\nthat of the R-max algorithm [2]. The form of the bonus is -zg, where g is the depth that the update\nterminated and z is a constant. The bonus re\ufb02ects the opportunity cost of not experiencing something\nnovel as quickly as possible, and in practice it tends to dominate (as it should).\n\n4 The Asteroids Domain\n\nThe Asteroids domain is shown in Figure 2a and was implemented using physics simulator pybox2d.\nThe agent controls a ship by either applying a thrust in the direction it is facing or applying a torque\nin either direction. The goal of the agent is to be able to navigate the environment without colliding\nwith any of the four \u201casteroids.\u201d The agent\u2019s starting location is next to asteroid 1. The agent is given\nthe following 6 options (see Appendix A for additional details):\n\n1. move-counterclockwise and move-clockwise: the ship moves from the current face it is\nadjacent to, to the midpoint of the face which is counterclockwise/clockwise on the same\nasteroid from the current face. Only available if the ship is at an asteroid.\n\n5The KL divergence has previously been used in other active exploration scenarios [16, 14].\n6Similarly to other active exploration papers, we de\ufb01ne the distance to depend only on the transition models\n\nand not the reward models.\n\n6\n\n\fAlgorithm 2 Optimal Exploration\nInput: Number of remaining option executions i.\n1: while i \u2265 0 do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end while\n\nBuild H from observations (Algorithm 1).\nInitialize tree T for MCTS.\nwhile number updates < threshold do\nSample a symbolic model h \u223c H.\nDo an MCTS update of T with dynamics given by h.\nTerminate current update if depth g is \u2265 i, or unobserved transition is encountered.\nStore simulated observations w \u223c W (h, g, \u03c0T ).\nScore = K(h, E[H]) \u2212 K(h, E[H(w)]) \u2212 zg.\n\nend while\nreturn most visited child of root node.\nExecute corresponding option; Update observations; i--.\n\n2. move-to-asteroid-1, move-to-asteroid-2, move-to-asteroid-3, and move-to-asteroid-4:\nthe ship moves to the midpoint of the closest face of asteroid 1-4 to which it has an\nunobstructed path. Only available if the ship is not already at the asteroid and an unobstructed\npath to some face exists.\n\nExploring with these options results in only one factor (for the entire state space), with symbols\ncorresponding to each of the 35 asteroid faces as shown in Figure 2a.\n\n(a)\n\n(b)\n\nFigure 2: (a): The Asteroids Domain, and the 35 symbols which can be encountered while exploring\nwith the provided options. (b): The Treasure Game domain. Although the game screen is drawn\nusing large image tiles, sprite movement is at the pixel level.\n\nResults We tested the performance of three exploration algorithms: random, greedy, and our\nalgorithm. For the greedy algorithm, the agent \ufb01rst computes the symbolic state space using steps\n1-5 of Algorithm 1, and then chooses the option with the lowest execution count from its current\nsymbolic state. The hyperparameter settings that we use for our algorithm are given in Appendix A.\nFigures 3a, 3b, and 3c show the percentage of time that the agent spends on exploring asteroids 1, 3,\nand 4, respectively. The random and greedy policies have dif\ufb01culty escaping asteroid 1, and are rarely\nable to reach asteroid 4. On the other hand, our algorithm allocates its time much more proportionally.\nFigure 4d shows the number of symbolic transitions that the agent has not observed (out of 115\npossible).7 As we discussed in Section 3, the number of unobserved symbolic transitions is a good\nrepresentation of the amount of information that the models are missing from the environment.\nOur algorithm signi\ufb01cantly outperforms random and greedy exploration. Note that these results are\nusing an uninformative prior and the performance of our algorithm could be signi\ufb01cantly improved by\n\n7We used Algorithm 1 to build symbolic models from the data gathered by each exploration algorithms.\n\n7\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 3: Simulation results for the Asteroids domain. Each bar represents the average of 100 runs.\nThe error bars represent a 99% con\ufb01dence interval for the mean. (a), (b), (c): The fraction of time that\nthe agent spends on asteroids 1, 3, and 4, respectively. The greedy and random exploration policies\nspend signi\ufb01cantly more time than our algorithm exploring asteroid 1 and signi\ufb01cantly less time\nexploring asteroids 3 and 4. (d): The number of symbolic transitions that the agent has not observed\n(out of 115 possible). The greedy and random policies require 2-3 times as many option executions\nto match the performance of our algorithm.\n\nstarting with more information about the environment. To try to give additional intuition, in Appendix\nA we show heatmaps of the (x, y) coordinates visited by each of the exploration algorithms.\n\n5 The Treasure Game Domain\nThe Treasure Game [12], shown in Figure 2b, features an agent in a 2D, 528 \u00d7 528 pixel video-game\nlike world, whose goal is to obtain treasure and return to its starting position on a ladder at the top of\nthe screen. The 9-dimensional state space is given by the x and y positions of the agent, key, and\ntreasure, the angles of the two handles, and the state of the lock.\nThe agent is given 9 options: go-left, go-right, up-ladder, down-ladder, jump-left, jump-right, down-\nright, down-left, and interact. See Appendix A for a more detailed description of the options and\nthe environment dynamics. Given these options, the 7 factors with their corresponding number of\nsymbols are: player-x, 10; player-y, 9; handle1-angle, 2; handle2-angle, 2; key-x and key-y, 3;\nbolt-locked, 2; and goldcoin-x and goldcoin-y, 2.\nResults We tested the performance of the same three algorithms: random, greedy, and our algorithm.\nFigure 4a shows the fraction of time that the agent spends without having the key and with the lock\nstill locked. Figures 4b and 4c show the number of times that the agent obtains the key and treasure,\nrespectively. Figure 4d shows the number of unobserved symbolic transitions (out of 240 possible).\nAgain, our algorithm performs signi\ufb01cantly better than random and greedy exploration. The data\n\n8\n\n200400600800100012001400160018002000Option Executions0.00.10.20.30.40.50.60.70.8Fraction of Time SpentAsteroid 1randomgreedyMCTS200400600800100012001400160018002000Option Executions0.000.050.100.150.200.25Fraction of Time SpentAsteroid 3randomgreedyMCTS200400600800100012001400160018002000Option Executions0.000.050.100.150.200.250.300.35Fraction of Time SpentAsteroid 4randomgreedyMCTS200400600800100012001400160018002000Option Executions0102030405060No. Unobserved Symbolic TransitionsUnobserved TransitionsrandomgreedyMCTS\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 4: Simulation results for the Treasure Game domain. Each bar represents the average of 100\nruns. The error bars represent a 99% con\ufb01dence interval for the mean. (a): The fraction of time\nthat the agent spends without having the key and with the lock still locked. The greedy and random\nexploration policies spend signi\ufb01cantly more time than our algorithm exploring without the key and\nwith the lock still locked. (b), (c): The average number of times that the agent obtains the key and\ntreasure, respectively. Our algorithm obtains both the key and treasure signi\ufb01cantly more frequently\nthan the greedy and random exploration policies. There is a discrepancy between the number of times\nthat our agent obtains the key and the treasure because there are more symbolic states where the\nagent can try the option that leads to a reset than where it can try the option that leads to obtaining\nthe treasure. (d): The number of symbolic transitions that the agent has not observed (out of 240\npossible). The greedy and random policies require 2-3 times as many option executions to match the\nperformance of our algorithm.\n\nfrom our algorithm has much better coverage, and thus leads to more accurate symbolic models. For\ninstance in Figure 4c you can see that random and greedy exploration did not obtain the treasure\nafter 200 executions; without that data the agent would not know that it should have a symbol that\ncorresponds to possessing the treasure.\n\n6 Conclusion\n\nWe have introduced a two-part algorithm for data-ef\ufb01ciently learning an abstract symbolic represen-\ntation of an environment which is suitable for planning with high-level skills. The \ufb01rst part of the\nalgorithm quickly generates an intermediate Bayesian symbolic model directly from data. The second\npart guides the agent\u2019s exploration towards areas of the environment that the model is uncertain about.\nThis algorithm is useful when the cost of data collection is high, as is the case in most real world\narti\ufb01cial intelligence applications. Our results show that the algorithm is signi\ufb01cantly more data\nef\ufb01cient than using more naive exploration policies.\n\n9\n\n200400600800100012001400160018002000Option Executions0.00.20.40.60.81.0Fraction of Time SpentNo KeyrandomgreedyMCTS200400600800100012001400160018002000Option Executions012345678Number of TimesKey ObtainedrandomgreedyMCTS200400600800100012001400160018002000Option Executions0.00.51.01.52.02.53.03.5Number of TimesTreasure ObtainedrandomgreedyMCTS200400600800100012001400160018002000Option Executions0255075100125150175200No. Unobserved Symbolic TransitionsUnobserved TransitionsrandomgreedyMCTS\f7 Acknowledgements\n\nThis research was supported in part by the National Institutes of Health under award number\nR01MH109177. The U.S. Government is authorized to reproduce and distribute reprints for Gov-\nernmental purposes notwithstanding any copyright notation thereon. The content is solely the\nresponsibility of the authors and does not necessarily represent the of\ufb01cial views of the National\nInstitutes of Health.\n\nReferences\n[1] A.G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete\n\nEvent Dynamic Systems, 13(4):341\u2013379, 2003.\n\n[2] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for\nnear-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213\u2013231,\n2002.\n\n[3] C.B. Browne, E. Powley, D. Whitehouse, S.M. Lucas, P.I. Cowling, P. Rohlfshagen, S. Tavener,\nD. Perez, S. Samothrakis, and S. Colton. A survey of Monte-Carlo tree search methods. IEEE\nTransactions on Computational Intelligence and AI in Games, 4(1):1\u201343, 2012.\n\n[4] S. Cambon, R. Alami, and F. Gravot. A hybrid approach to intricate motion, manipulation and\n\ntask planning. International Journal of Robotics Research, 28(1):104\u2013126, 2009.\n\n[5] J. Choi and E. Amir. Combining planning and motion planning. In Proceedings of the IEEE\n\nInternational Conference on Robotics and Automation, pages 4374\u20134380, 2009.\n\n[6] Christian Dornhege, Marc Gissler, Matthias Teschner, and Bernhard Nebel. Integrating symbolic\nand geometric planning for mobile manipulation. In IEEE International Workshop on Safety,\nSecurity and Rescue Robotics, November 2009.\n\n[7] E. Gat. On three-layer architectures. In D. Kortenkamp, R.P. Bonnasso, and R. Murphy, editors,\n\nArti\ufb01cial Intelligence and Mobile Robots. AAAI Press, 1998.\n\n[8] K.A. Heller and Z. Ghahramani. Bayesian hierarchical clustering. In Proceedings of the 22nd\n\ninternational conference on Machine learning, pages 297\u2013304. ACM, 2005.\n\n[9] L. Kaelbling and T. Lozano-P\u00e9rez. Hierarchical planning in the Now. In Proceedings of the\n\nIEEE Conference on Robotics and Automation, 2011.\n\n[10] L. Kocsis and C. Szepesv\u00e1ri. Bandit based Monte-Carlo planning. In Machine Learning: ECML\n\n2006, pages 282\u2013293. Springer, 2006.\n\n[11] G.D. Konidaris, L.P. Kaelbling, and T. Lozano-Perez. Constructing symbolic representations for\nhigh-level planning. In Proceedings of the Twenty-Eighth Conference on Arti\ufb01cial Intelligence,\npages 1932\u20131940, 2014.\n\n[12] G.D. Konidaris, L.P. Kaelbling, and T. Lozano-Perez. Symbol acquisition for probabilistic\nhigh-level planning. In Proceedings of the Twenty Fourth International Joint Conference on\nArti\ufb01cial Intelligence, pages 3619\u20133627, 2015.\n\n[13] C. Malcolm and T. Smithers. Symbol grounding via a hybrid architecture in an autonomous\n\nassembly system. Robotics and Autonomous Systems, 6(1-2):123\u2013144, 1990.\n\n[14] S.A. Mobin, J.A. Arnemann, and F. Sommer. Information-based learning by agents in un-\nbounded state spaces. In Advances in Neural Information Processing Systems, pages 3023\u20133031,\n2014.\n\n[15] N.J. Nilsson. Shakey the robot. Technical report, SRI International, April 1984.\n\n[16] L. Orseau, T. Lattimore, and M. Hutter. Universal knowledge-seeking agents for stochastic\nenvironments. In International Conference on Algorithmic Learning Theory, pages 158\u2013172.\nSpringer, 2013.\n\n10\n\n\f[17] D. Precup. Temporal Abstraction in Reinforcement Learning. PhD thesis, Department of\n\nComputer Science, University of Massachusetts Amherst, 2000.\n\n[18] N.F.Y. Singer. Ef\ufb01cient Bayesian parameter estimation in large discrete domains. In Advances\nin Neural Information Processing Systems 11: Proceedings of the 1998 Conference, volume 11,\npage 417. MIT Press, 1999.\n\n[19] R.S. Sutton, D. Precup, and S.P. Singh. Between MDPs and semi-MDPs: A framework for\ntemporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112(1-2):181\u2013211, 1999.\n\n[20] J. Wolfe, B. Marthi, and S.J. Russell. Combined task and motion planning for mobile manipula-\n\ntion. In International Conference on Automated Planning and Scheduling, 2010.\n\n11\n\n\f", "award": [], "sourceid": 2588, "authors": [{"given_name": "Garrett", "family_name": "Andersen", "institution": "PROWLER.io"}, {"given_name": "George", "family_name": "Konidaris", "institution": "Brown University"}]}