{"title": "Blending Autonomous Exploration and Apprenticeship Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2258, "page_last": 2266, "abstract": "We present theoretical and empirical results for a framework that combines the benefits of apprenticeship and autonomous reinforcement learning. Our approach modifies an existing apprenticeship learning framework that relies on teacher demonstrations and does not necessarily explore the environment. The first change is replacing previously used Mistake Bound model learners with a recently proposed framework that melds the KWIK and Mistake Bound supervised learning protocols. The second change is introducing a communication of expected utility from the student to the teacher. The resulting system only uses teacher traces when the agent needs to learn concepts it cannot efficiently learn on its own.", "full_text": "Blending Autonomous Exploration and\n\nApprenticeship Learning\n\nThomas J. Walsh\n\nCenter for Educational\nTesting and Evaluation\nUniversity of Kansas\nLawrence, KS 66045\ntwalsh@ku.edu\n\nDaniel Hewlett\n\nClayton T. Morrison\n\n{dhewlett@cs,clayton@sista}.arizona.edu\n\nSchool of Information:\n\nScience, Technology and Arts\n\nUniversity of Arizona\n\nTucson, AZ 85721\n\nAbstract\n\nWe present theoretical and empirical results for a framework that combines the\nbene\ufb01ts of apprenticeship and autonomous reinforcement learning. Our approach\nmodi\ufb01es an existing apprenticeship learning framework that relies on teacher\ndemonstrations and does not necessarily explore the environment. The \ufb01rst change\nis replacing previously used Mistake Bound model learners with a recently pro-\nposed framework that melds the KWIK and Mistake Bound supervised learning\nprotocols. The second change is introducing a communication of expected util-\nity from the student to the teacher. The resulting system only uses teacher traces\nwhen the agent needs to learn concepts it cannot ef\ufb01ciently learn on its own.\n\n1\n\nIntroduction\n\nAs problem domains become more complex, human guidance becomes increasingly necessary to\nimprove agent performance. For instance, apprenticeship learning, where teachers demonstrate\nbehaviors for agents to follow, has been used to train agents to control complicated systems such\nas helicopters [1]. However, most work on this topic burdens the teacher with demonstrating even\nthe simplest nuances of a task. By contrast, in autonomous reinforcement learning [2] a number\nof domain classes can be ef\ufb01ciently learned by an actively exploring agent, although this class is\nprovably smaller than those learnable with the help of a teacher [3].\nThus the \ufb01eld seems to be largely bifurcated. Either agents learn autonomously and eschew the larger\nlearning capacity from teacher interaction, or the agent overburdens the teacher by not exploring\nsimple concepts it could garner on its own. Intuitively, this seems like a false choice, as human\nteachers often use demonstration but also let students explore parts of the domain on their own. We\nshow how to build a provably ef\ufb01cient learning system that balances teacher demonstrations and\nautonomous exploration. Speci\ufb01cally, our protocol and algorithms cause a teacher to only step in\nwhen its advice will be signi\ufb01cantly more helpful than autonomous exploration by the agent.\nWe extend a previously proposed apprenticeship learning protocol [3] where a learning agent and\nteacher take turns running trajectories. This version of apprenticeship learning is fundamentally\ndifferent from Inverse Reinforcement Learning [4] and imitation learning [5] because our agents are\nallowed to enact better policies than their teachers and observe reward signals. In this setting, the\nnumber of times the teacher outperforms the student was proven to be related to the learnability of\nthe domain class in a mistake bound predictor (MBP) framework.\nOur work modi\ufb01es previous apprenticeship learning efforts in two ways. First, we will show that\nreplacing the MBP framework with a different learning architecture called KWIK-MBP (based on\na similar recently proposed protocol [6]) indicates areas where the agent should autonomously ex-\nplore, and melds autonomous and apprenticeship learning. However, this change alone is not suf\ufb01-\n\n1\n\n\fcient to keep the teacher from intervening when an agent is capable of learning on its own. Hence,\nwe introduce a communication of the agent\u2019s expected utility, which provides enough information\nfor the teacher to decide whether or not to provide a trace (a property not shared by any of the pre-\nvious efforts). Furthermore, we show the number of such interactions grows only with the MBP\nportion of the KWIK-MBP bound. We then discuss how to relax the communication requirement\nwhen the teacher observes the student for many episodes. This gives us the \ufb01rst apprenticeship\nlearning framework where a teacher only shows demonstrations when they are needed for ef\ufb01cient\nlearning, and gracefully blends autonomous exploration and apprenticeship learning.\n\n2 Background\n\nThe main focus of this paper is blending KWIK autonomous exploration strategies [7] and appren-\nticeship learning techniques [3], utilizing a framework for measuring mistakes and uncertainty based\non KWIK-MB [6]. We begin by reviewing results relating the learnability of domain parameters in\na supervised setting to the ef\ufb01ciency of model-based RL agents.\n\n2.1 MDPs and KWIK Autonomous Learning\nWe will consider environments modeled as a Markov Decision Process (MDP) [2] hS, A, T, R, \u03b3i,\nwith states and actions S and A, transition function T : S, A 7\u2192 P r[S], rewards R : S, A 7\u2192 <, and\ndiscount factor \u03b3 \u2208 [0, 1). The value of a state under policy \u03c0 : S 7\u2192 A is V\u03c0(s) = R(s, \u03c0(s)) +\n\ns0\u2208S T (s, a, s0)V\u03c0(s0) and the optimal policy \u03c0\u2217 satis\ufb01es \u2200\u03c0V\u03c0\u2217 \u2265 V\u03c0.\n\n\u03b3P\n\nIn model-based reinforcement learning, recent advancements [7] have linked the ef\ufb01cient learnabil-\nity of T and R in the KWIK (\u201cKnows What It Knows\u201d) framework for supervised learning with\nPAC-MDP behavior [8]. Formally, KWIK learning is:\nDe\ufb01nition 1. A hypothesis class H : X 7\u2192 Y is KWIK learnable with parameters \u0001 and \u03b4 if the\nfollowing holds. For each (adversarial) input xt the learner predicts yt \u2208 Y or \u201cI don\u2019t know\u201d\n(\u22a5). With probability (1 \u2212 \u03b4) (1) when yt 6= \u22a5,||yt \u2212 E[h(xt)]|| < \u0001 and (2) the total number of \u22a5\npredictions is bounded by a polynomial function of (|H|, 1\nIntuitively, KWIK caps the number of times the agent will admit uncertainty in its predictions. Prior\nwork [7] showed that if the transition and reward functions (T and R) of an MDP are KWIK learn-\nable, then a PAC-MDP agent (which takes only a polynomial number of suboptimal steps with high\nprobability) can be constructed for autonomous exploration. The mechanism for this construction is\nan optimistic interpretation of the learned model. Speci\ufb01cally, KWIK-learners LT and LR are built\nfor T and R and the agent replaces any \u22a5 predictions with transitions to a trap state with reward\nRmax, causing the agent to explore these uncertain regions. This exploration requires only a polyno-\nmial (with respect to the domain parameters) number of suboptimal steps, thus the link from KWIK\nto PAC-MDP. While the class of functions that is KWIK learnable includes tabular and factored\nMDPs, it does not cover many larger dynamics classes (such as STRIPS rules with conjunctions for\npre-conditions) that are ef\ufb01ciently learnable in the apprenticeship setting.\n\n\u0001 , 1\n\n\u03b4 ).\n\n2.2 Apprenticeship Learning with Mistake Bound Predictor\n\nWe now describe an existing apprenticeship learning framework [3], which we will be modifying\nthroughout this paper. In that protocol, an agent is presented with a start state s0 and is asked to take\nactions according to its current policy \u03c0A, until a horizon H or a terminal state is reached. After each\nof these episodes, a teacher is allowed to (but may choose not to) demonstrate their own policy \u03c0T\nstarting from s0. The learning agent is able to fully observe each transition and reward received both\nin its own trajectories as well as those of the teacher, who may be able to provide highly informative\nsamples. For example, in an environment with n bits representing a combination lock that can only\nbe opened with a single setting of the bits, the teacher can demonstrate the combination in a single\ntrace, while an autonomous agent could spend 2n steps trying to open it.\nAlso in that work, the authors describe a measure of sample complexity called PAC-MDP-Trace\n(analogous to PAC-MDP from above) that measures (with probability 1\u2212 \u03b4) the number of episodes\nwhere V\u03c0A(s0) < V\u03c0T (s0)\u2212 \u0001, that is where the expected value of the agent\u2019s policy is signi\ufb01cantly\nworse than the expected value of the teacher\u2019s policy (VA and VT for short). A result analogous\n\n2\n\n\fto the KWIK to PAC-MDP result was shown connecting a supervised framework called Mistake\nBound Predictor (MBP) to PAC-MDP-Trace behavior. MBP extends the classic mistake bound\nlearning framework [9] to handle data with noisy labels, or more speci\ufb01cally:\nDe\ufb01nition 2. A hypothesis class H : X 7\u2192 Y is Mistake Bound Predictor (MBP) learnable with\nparameters \u0001 and \u03b4 if the following holds. For each adversarial input xt, the learner predicts yt \u2208 Y .\nIf ||Eh\u2217[xt]\u2212 yt|| > \u0001, then the agent has made a mistake. The number of mistakes must be bounded\nby a polynomial over ( 1\n\n\u03b4 ,|H|) with probability (1 \u2212 \u03b4).\n\n\u0001 , 1\n\nAn agent using MBP learners LT and LR for the MDP model components will be PAC-MDP-\nTrace. The conversion mirrors the KWIK to PAC-MDP connection described earlier, except that the\ninterpretation of the model is strict, and often pessimistic (sometimes resulting in an underestimate\nof the value function). For instance, if the transition function is based on a conjunction (e.g. our\ncombination lock), the MBP learners default to predicting \u201cfalse\u201d where the data is incomplete,\nleading an agent to think its action will not work in those situations. Such interpretations would\nbe catastrophic in the autonomous case (where the agent would fail to explore such areas), but are\npermissible in apprenticeship learning where teacher traces will provide the missing data.\nNotice that under a criteria where the number of teacher traces is to be minimized, MBP learning\nmay overburden the teacher. For example, in a simple \ufb02at MDP, an MBP-Agent picks actions that\nmaximize utility in the part of the state space that has been exposed by the teacher, never exploring,\nso the number of teacher traces scales with |S||A|. But a \ufb02at MDP is autonomously (KWIK) learn-\nable, so no traces should be required. Ideally an agent would explore the state space where it can\nlearn ef\ufb01ciently, and only rely on the teacher for dif\ufb01cult to learn concepts (like conjunctions).\n\n3 Teaching by Demonstration with Mixed Interpretations\n\nWe now introduce a different criteria with the goal of minimizing teacher traces while not forcing\nthe agent to explore exponentially long.\nDe\ufb01nition 3. A Teacher Interaction (TI) bound for a student-teacher pair is the number of episodes\nwhere the teacher provides a trace to the agent that guarantees (with probability 1 \u2212 \u03b4) that the\nnumber of agent steps between each trace (or after the last one) where VA(s0) < VT (s0) \u2212 \u0001 is\npolynomial in 1\n\n\u03b4 , and the domain parameters.\n\n\u0001 , 1\n\nA good TI bound minimizes the teacher traces needed to achieve good behavior, but only requires\nthe suboptimal exploration steps to be polynomially bounded, not minimized. This re\ufb02ects our\njudgement that teacher interactions are far more costly than autonomous agent steps, so as long\nas the latter are reasonably constrained, we should seek to minimize the former. The relationship\nbetween TI and PAC-MDP-Trace is the following:\nTheorem 1. The TI bound for a domain class and learning algorithm is upper-bounded by the\nPAC-MDP-Trace bound for the same domain/algorithm with the same \u0001 and \u03b4 parameters.\nProof. A PAC-MDP-Trace bound quanti\ufb01es (with probability 1 \u2212 \u03b4) the worst-case number of\nepisodes where the student performs worse than the teacher, speci\ufb01cally where VA(s0) < VT (s0)\u2212\u0001.\nSuppose an environment existed with a PAC-MDP-Trace bound of B1 and a TI bound of B2 > B1.\nThis would mean the domain was learnable with at most B1 teacher traces. But this is a contradiction\nbecause no more traces are needed to keep the autonomous exploration steps polynomial.\n\n3.1 The KWIK-MBP Protocol\n\nWe would like to describe a supervised learning framework (like KWIK or MBP) that can quantify\nthe number of changes made to a model through exploration and teacher demonstrations. Here, we\npropose such a model based on the recent KWIK-MB protocol [6], which we extend below to cover\nstochastic labels (KWIK-MBP).\nDe\ufb01nition 4. A hypothesis class H : X 7\u2192 Y is KWIK-MBP with parameters \u0001 and \u03b4 under\nthe following conditions. For each (adversarial) input xt the learner must predict yt \u2208 Y or \u22a5.\nWith probability (1 \u2212 \u03b4), the number of \u22a5 predictions must be bounded by a polynomial K over\nh|H|, 1/\u0001, 1/\u03b4i and the number of mistakes (by De\ufb01nition 2) must be bounded by a polynomial M\nover h|H|, 1/\u0001, 1/\u03b4i.\n\n3\n\n\fs0 = Environment.startState\n\nif VT (s0) \u2212 k\u22121\n\nk \u0001 > UA(s0) then\n\nk value accuracy w.h.p. for k \u2265 2\n\nAlgorithm 1 KWIK-MBP-Agent with Value Communication\n1: The agent A knows \u0001, \u03b4, S, A, H and planner P .\n2: The teacher T has policy \u03c0T with expected value VT\n3: Initialize KWIK-MBP learners LT and LR to ensure \u0001\n4: for each episode do\n5:\n6: A calculates the value function UA of \u03c0A from \u02c6S, A, \u02c6T and \u02c6R (see construction below).\n7: A communicates its expected utility UA(s0) on this episode to T\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n\nT provides a trace \u03c4 starting from s0.\n\u2200hs, a, r, s0i Update LT (s, a, s0) and LR(s, a, r)\nwhile episode not \ufb01nished and t < H do\n\u02c6R = LR(s, a) or Rmax if LR(s, a) = \u22a5\n\u02c6T = LT (s, a) or Smax if LT (s, a) = \u22a5.\nat = P.getPlan(st, \u02c6S, \u02c6T , \u02c6R).\nhrt, st+1i = E.executeAct(at)\nLT .Update(st, at, st+1); LR.U pdate(st, at, rt)\n\n\u02c6S = SS Smax, the Rmax trap state\n\nKWIK-MB was originally designed for a situation where mistakes are more costly than \u22a5 predic-\ntions. So mistakes are minimized while \u22a5 predictions are only bounded. This is analogous to our\nTI criteria (traces minimized with exploration bounded) so we now examine a KWIK-MBP learner\nin the apprenticeship setting.\n\n3.2 Mixing Optimism and Pessimism\n\n16k and \u0001T = \u0001 (1\u2212\u03b3)2\n\nAlgorithm 1 (KWIK-MBP-Agent) shows an apprenticeship learning agent built over KWIK-MBP\nlearners LT and LR. Both of these model learners are instantiated to ensure the learned value\nk accuracy for k \u2265 2 (for reasons discussed in the main theorem), which can\nfunction will have \u0001\nbe done by setting \u0001R = \u0001 1\u2212\u03b3\n(details follow the same form as standard\nconnections between model learners and value function accuracy, for example in Theorem 3 from\n[7]). When planning with the subsequent model, the agent constructs a \u201cmixed\u201d interpretation,\ntrusting the learner\u2019s predictions where mistakes might be made, but replacing (lines 13-14) all \u22a5\npredictions from LR with a reward of Rmax and any \u22a5 predictions from \u02c6T with transitions to the\nRmax trap state Smax. This has the effect of drawing the agent to explore explicitly uncertain regions\n(\u22a5) and to either explore on its own or rely on the teacher for areas where a mistake might be made.\nFor instance, in the experiments in Figure 2 (left), discussed in depth later, a KWIK-MBP agent\nonly requires traces for learning the pre-conditions in a noisy blocks world but uses autonomous\nexploration to discover the noise probabilities.\n\n16k\u03b3Vmax\n\n4 Teaching by Demonstration with Explicit Communication\n\nThus far we have not discussed communication from the student to the teacher in KWIK-MBP-\nAgent (line 7). We now show that this communication is vital in keeping the TI bound small.\nExample 1. Suppose there was no communication in Algorithm 1 and the teacher provided a trace\nwhen \u03c0A was suboptimal. Consider a domain where the pre-conditions of actions are governed by\na disjunction over the n state factors (if the disjunction fails, the action fails). Disjunctions can be\nlearned with M = n/3 mistakes and K = 3n/2 \u2212 3M \u22a5 predictions [6]. However, that algorithm\ndefaults to predicting \u201ctrue\u201d and only learns from negative examples. This optimistic interpretation\nmeans the agent will expect success, and can learn autonomously. However, the teacher will provide\na trace to the agent since it sees it performing suboptimally during exploration. Such traces are\nunnecessary and uninformative (their positive examples are useless to LT ).\n\nThis illustrates the need for student communication to give some indication of its internal model\nto the teacher. The protocol in Algorithm 1 captures this intuition by providing a channel (line 7)\n\n4\n\n\fwhere the student communicates its expected utility UA. The teacher then only shows a trace to\na pessimistic agent (line 8), but will \u201cstand back\u201d and let an over-con\ufb01dent student learn from its\nown mistakes. We note that there are many other possible forms of this communication such as\nannouncing the probability of reaching a goal or an equivalence query [10] type model, where the\nstudent exposes its entire internal model to the teacher. We focus here on the communication of\nutility, which is general enough for MDP domains but has low communication overhead.\n\n4.1 Theoretical Properties\n\nk\n\nThe proof of the algorithm\u2019s TI bound ap-\npears below and is illustrated in Figure 1\nbut intuitively we show that if we force\nk -accurate\nthe student to (w.h.p.) learn an \u0001\nvalue function for k \u2265 2 then we can guar-\nantee traces where UA < VT \u2212 \u0001\nk will\nbe helpful, but are not needed until UA\nis reported below VT \u2212 (k\u22121)\n\u0001, at which\npoint UA alone cannot guarantee that VA\nis within \u0001 of VT and so a trace must be\ngiven. Because traces are only given when\nthe student undervalues a potential policy,\nthe number of traces is related only to the\nMBP portion of the KWIK-MBP bound, and more speci\ufb01cally to the number of pessimistic mis-\ntakes, de\ufb01ned as:\nDe\ufb01nition 5. A mistake is pessimistic if and only if it causes some policy \u03c0 to be undervalued in the\nagent\u2019s model, that is in our case U\u03c0 < V\u03c0 \u2212 \u0001\nk .\nNote that by the construction of our model, KWIK-learnable parameters (\u22a5 replaced by Rmax-style\ninterpretations) never result in such pessimistic mistakes. We can now state the following:\nTheorem 2. Algorithm 1 with KWIK-MBP learners will have a TI bound that is polynomial in 1\n1\u2212\u03b3 and P , where P is the number of pessimistic mistakes (P \u2264 M) made by LT and LR.\nand 1\n\nFigure 1: The areas for UA and VA corresponding to\nthe cases in the main theorem. In all cases UT \u2264 UA\nand when k = 2 the two dashed lines collapse together.\n\n\u03b4 ,\n\u0001 , 1\n\nProof. The proof stems from an expansion of the Explore-Explain-Exploit Lemma from [3]. That\noriginal lemma categorized the three possible outcomes of an episode in an apprenticeship learning\nsetting where the teacher always gives a trace and with LT and LR built to learn V within \u0001\n2.\nThe three possibilities for an episode were (1) exploration, when the agent\u2019s value estimate of \u03c0A\nis inaccurate, ||VA \u2212 UA|| > \u0001/2, (2) exploitation when the agent\u2019s prediction of its own return is\naccurate (||UA\u2212VA|| \u2264 \u0001/2) and the agent is near-optimal with respect to the teacher (VA \u2265 VT \u2212\u0001),\nand (3) explanation when ||VA \u2212 UA|| \u2264 \u0001/2, but VA < VT \u2212 \u0001. Because both (1) and (3) provide\nsamples to LT and LR, the number of times they can occur is bounded (in the original lemma) by the\nMBP bound on those learners and in both cases a relevant sample is produced with high probability\ndue to the simulation lemma (c.f. Lemma 9 of [7]), which states that two different value returns\nfrom two MDPs means that, with high probability, their parameters must be different.\nWe need to extend the lemma to cover our change in protocol (the teacher may not step in on every\nepisode) and in evaluation criteria (TI bound instead of PAC-MDP-Trace). Speci\ufb01cally, we need to\nshow: (i) The number of steps between traces where VA < VT \u2212\u0001 is polynomially bounded. (ii) Only\na polynomial number of traces are given, and they are all guaranteed to improve some parameter in\nthe agent\u2019s model with high probability. (iii) Only pessimistic mistakes (De\ufb01nition 5) cause a teacher\nintervention. Note that properties (i) and (ii) imply that VA < VT \u2212 \u0001 for only a polynomial number\nof episodes and correspond directly to the TI criteria from De\ufb01nition 3. We now consider Algorithm\n1 according to these properties in all of the cases from the original explore-exploit-explain lemma.\nWe begin with the Explain case where VA < VT \u2212 \u0001 and ||UA \u2212 VA|| \u2264 \u0001\nk . Combining these\ninequalities, we know UA < VT \u2212 \u0001(k\u22121)\n, so a trace will de\ufb01nitely be provided. Since UT \u2264 UA\n(UT is the value of \u03c0T in the student\u2019s model and UA was optimal) we have at least UT < VT \u2212 \u0001\nk\nand the simulation lemma implies the trace will (with high probability) be helpful. Since there are a\n\nk\n\n5\n\nVT VA UA e/k e/k Explain Explore (4) Exploit (2) (1) UA (2) UA UA VA UA (1)VA (3)VA (4)VA (2)VA VT-e \flimited number of such mistakes (because LR and LT are KWIK-MBP learners) we have satis\ufb01ed\nproperty (ii). Property (iii) is true because both \u03c0T and \u03c0A are undervalued.\nWe now consider the Exploit case where VA \u2265 VT \u2212 \u0001 and ||UA \u2212 VA|| \u2264 \u0001\nk . There are two possible\n. If UA \u2265 VT \u2212 \u0001(k\u22121)\nsituations here, because UA can either be larger or smaller than VT \u2212 \u0001(k\u22121)\nk\nIf\nthen no trace is given, but the agent\u2019s policy is near optimal so property (i) is not violated.\nUA < VT \u2212 \u0001(k\u22121)\n, then a trace is given, even in this exploit case, because the teacher does not\nknow VA and cannot distinguish this case from the \u201cexplain\u201d case above. However, this trace will\nstill be helpful, because UT \u2264 UA, so at least UT < VT \u2212 \u0001\nk (satisfying iii), and again by the\nsimulation lemma, the trace will help us learn a parameter and there are a limited number of such\nmistakes, so (ii) holds.\nFinally, we have the Explore case, where ||UA \u2212 VA|| > \u0001\nwill help it learn a parameter, but in terms of traces we have the following cases:\nUA \u2265 VT \u2212 \u0001(k\u22121)\nk\nproperty (i) holds.\nUA \u2265 VT \u2212 \u0001(k\u22121)\nk . No trace is given here, but this is the classical exploration\ncase (UA is optimistic, as in KWIK learning). Since UA and VA are suf\ufb01ciently separated, the\nagent\u2019s own experience will provide a useful sample, and because all parameters are polynomially\nlearnable, property (i) is satis\ufb01ed.\nUA < VT \u2212 \u0001(k\u22121)\nand either VA > UA + \u0001\nprovided but UT \u2264 UA so at least UT < VT \u2212 \u0001\n(ii)). Pessimistic mistakes are causing the trace (property iii) since \u03c0T is undervalued.\n\nk . In that case, the agent\u2019s own experience\nk . In this case no trace is given but we have VA > VT \u2212 \u0001, so\n\nand UA > VA + \u0001\n\nk or UA > VA + \u0001\n\nk . In either case, a trace will be\nk and the trace will be helpful (satisfying property\n\nand VA > UA + \u0001\n\nk\n\nk\n\nk\n\nk\n\n2-accurate learners [3] to \u0001\n\nOur result improves on previous results by attempting to minimize the number of traces while rea-\nsonably bounding exploration. The result also generalizes earlier apprenticeship learning results\non \u0001\nk -accuracy, while ensuring a more practical and stronger bound (TI\ninstead of PAC-MDP-Trace). The choice of k in this situation is somewhat complicated. Larger k\nrequires more accuracy of the learned model, but decreases the size of the \u201cbottom region\u201d above\nwhere a limited number of traces may be given to an already near-optimal agent. So increasing k\ncan either increase or decrease the number of traces, depending on the exact problem instance.\n\n4.2 Experiments\n\nWe now present experiments in two domains. The \ufb01rst domain is a blocks world with dynamics\nbased on stochastic STRIPS operators, a \u22121 step cost, and a goal of stacking the blocks. That is, the\nenvironment state is described as a set of grounded relations (e.g. On(a, b)) and actions are described\nby relational (with variables) operators that have conjunctive pre-conditions that must hold for the\naction to execute (e.g. putDown(X, To) cannot execute unless the agent is holding X and To is\nclear and a block). If the pre-conditions hold, then one of a set of possible effects (pairs of Add\nand Delete lists), chosen based on a probability distribution over effects, will change the current\nstate. The actions in our blocks world are two versions of pickup(X, From) and two versions of\nputDown(X, To), with one version being \u201creliable\u201d, producing the expected result 80% of the time\nand otherwise doing nothing. The other version of each action has the probabilities reversed. The\nliterals in the effects of the STRIPS operators (the Add and Delete lists) are given to the learning\nagents, but the pre-conditions and the probabilities of the effects need to be learned. This is an\ninteresting case because the effect probabilities can be learned autonomously while the conjunctive\npre-conditions (of sizes 3 and 4), require teacher input (like our combination lock example).\nFigure 2, column 1, shows KWIK, MBP, and KWIK-MBP agents as trained by a teacher who uses\nunreliable actions half the time. The KWIK learner never receives traces (since its expected utility,\nshown in 1a, is always high), but spends an exponential (in the number of literals) time exploring\nthe potential pre-conditions of actions (1b).\nIn contrast, the MBP and KWIK-MBP agents use\nthe \ufb01rst trace to learn the pre-conditions. The proportion of trials (out of 30) that the MBP and\nKWIK-MBP learners received teacher traces across episodes is shown in the bar graphs 1c and 1d\nof Fig. 2. The MBP learner continues to get traces for several episodes afterwards, using them to\n\n6\n\n\fhelp learn the probabilities well after the pre-conditions are learned. This probability learning could\nbe accomplished autonomously, but the MBP pessimistic value function prevents such exploration\nin this case. By contrast, KWIK-MBP receives 1 trace to learn the pre-conditions, and then explores\nthe probabilities on its own. KWIK-MBP actually learns the probabilities faster than MBP because\nit targets areas it does not know about rather than relying on potentially redundant teacher samples.\nHowever, in rare cases KWIK-MBP receives additional traces; in fact there were two exceptions in\nthe 30 trials, indicated by \u2217\u2019s at episodes 5 and 19 in 1d. The reason for this is that sometimes the\nlearner may be unlucky and construct an inaccurate value estimate and the teacher then steps in and\nprovides a trace.\nThe second domain is a variant of \u201cWumpus\nWorld\u201d with 5 locations in a chain, an agent\nwho can move, \ufb01re arrows (unlimited supply)\nor pick berries (also unlimited), and a wum-\npus moving randomly. The domain is repre-\nsented by a Dynamic Bayes Net (DBN) based\non these factors and the reward is represented\nas a linear combination of the factor values\n(\u22125 for a live wumpus and +2 for picking\na berry). The action effects are noisy, espe-\ncially the probability of killing the wumpus,\nwhich depends on the exact (not just relative)\nlocations of the agent, wumpus, and whether\nthe wumpus is dead yet (three parent fac-\ntors in the DBN). While the reward function\nis KWIK learnable through linear regression\n[7] and though DBN CPTs with small parent\nsizes are also KWIK learnable, the high con-\nnectivity of this particular DBN makes au-\ntonomous exploration of all the parent-value\ncon\ufb01gurations prohibitive. Because of this,\nin our KWIK-MBP implementation, we com-\nbined a KWIK linear regression learner for\nLR with an MBP learner for LT that is given\nthe DBN structure and learns the parameters\nfrom experience, but when entries in the con-\nditional probability tables are the result of\nonly a few data points, the learner predicts no change for this factor, which was generally a pes-\nsimistic outcome. We constructed an \u201coptimal hunting\u201d teacher that \ufb01nds the best combination of\nlocations to shoot the wumpus from/at, but ignores the berries. We concentrate on the ability of our\nalgorithm to \ufb01nd a better policy than the teacher (i.e., learning to pick berries), while staying close\nenough to the teacher\u2019s traces that it can hunt the wumpus effectively.\nFigure 2, column 2, presents the results from this experiment. In plot 2a we see the predicted values\nof the three learners, while plot 2b shows their performance. The KWIK learner starts with high\nUA that gradually descends (in 2a), but without traces the agent spends most of its time exploring\nfruitlessly (very slowly inclining slope of 2b). The MBP agent learns to hunt from the teacher\nand quickly achieves good behavior, but rarely learns to pick berries (only gaining experience on\nthe reward of berries if it ends up in completely unknown state and picks berries at random many\ntimes). The KWIK-MBP learner starts with high expected utility and explores the structure of just\nthe reward function, discovering berries but not the proper location combinations for killing the\nwumpus. Its UA thus initially drops precipitously as it thinks all it can do is collect berries. Once\nthis crosses the teacher\u2019s threshold, the teacher steps in with a number of traces showing the best\nway to hunt the wumpus\u2014this is seen in plot 2d with the small bump in the proportion of trials\nwith traces, starting at episode 2 and declining roughly linearly until episode 10. The KWIK-MBP\nstudent is then able to \ufb01ll in the CPTs with information from the teacher and reach an optimal policy\nthat kills the wumpus and picks berries, avoiding both the over- and under-exploration of the KWIK\nand MBP agents. This increased overall performance is seen in plot 2b as KWIK-MBP\u2019s average\ncumulative reward surpasses MBP between episodes 5 and 10 .\n\nFigure 2: A plot matrix with rows (a) value predic-\ntions UA(s0), (b) average undiscounted cumulative\nreward and (c and d) the proportion of trials where\nMBP and KWIK-MBP received teacher traces. The\nleft column is Blocks World and the right a modi\ufb01ed\nWumpus World. Red corresponds to KWIK, blue to\nMBP, and black to KWIK-MBP.\n\n7\n\n051015202500.51MBPPr(Trace)051015202500.51Pr(Trace)EpisodesKWIK\u2212MBP0510152025\u221210\u22128\u22126\u22124\u221220Predicted ValuesBlocks World KWIKMBPKWIK\u2212MBP0510152025\u221216\u221214\u221212\u221210\u22128\u22126\u22124Avg Undiscounted Reward05101520253000.51MBPPr(Trace)05101520253000.51Pr(Trace)EpisodesKWIK\u2212MBP051015202530\u2212100\u2212500Predicted ValuesWumpus World KWIKMBPKWIK\u2212MBP051015202530\u221260\u221250\u221240\u221230\u221220\u2212100Avg Undiscounted RewardPr(Trace)Avg Cumulative RewardPr(Trace)Predicted ValuesEpisodesEpisodesMBPMBPKWIK-MBPKWIK-MBP**051015202500.51MBPPr(Trace)051015202500.51Pr(Trace)EpisodesKWIK\u2212MBP0510152025\u221210\u22128\u22126\u22124\u221220Predicted ValuesBlocks World KWIKMBPKWIK\u2212MBP0510152025\u221216\u221214\u221212\u221210\u22128\u22126\u22124Avg Undiscounted Reward05101520253000.51MBPPr(Trace)05101520253000.51Pr(Trace)EpisodesKWIK\u2212MBP051015202530\u2212100\u2212500Predicted ValuesWumpus World KWIKMBPKWIK\u2212MBP051015202530\u221260\u221250\u221240\u221230\u221220\u2212100Avg Undiscounted RewardPr(Trace)Avg Undiscounted RewardPr(Trace)Predicted ValuesEpisodesEpisodesMBPMBPKWIK-MBPKWIK-MBP**1a2a1b2b1c2c1d2dKWIKKWIK-MBPMBPKWIKKWIK-MBPMBP\f5\n\nInferring Student Aptitude\n\nWe now describe a method for a teacher to infer the student\u2019s aptitude by using long periods without\nteacher interventions as observation phases. This interaction protocol is an extension of Algorithm\n1, but instead of using direct communication, the teacher will allow the student to run some number\nof trajectories m from a \ufb01xed start state and then decide whether to show a trace or not.\nWe would like to show that the length (m) of each observation phase can be polynomially bounded\nand the system as a whole can still maintain a good TI bound. We show below that such an m exists\nand is related to the PAC-MDP bound for a portion of the environment we call the zone of tractable\nexploration (ZTE). The ZTE (inspired by the zone of proximal development [11]) is the area of an\nMDP that an agent with background knowledge B and model learners LT and LR can act in with\na polynomial number of suboptimal steps as judged only within that area. Combining the ZTE, B,\nLT and LR induces a learning sub-problem where the agent must learn to act as well as possible\nwithout the teacher\u2019s help.\nRemark 1. If the learning agent is KWIK-MBP and the evaluation phase has length m = A1 + A2\nwhere A1 is the PAC-MDP bound for the ZTE and A2 is the number of trials all starting from s0\nneeded to estimate VA(s0) ( \u02c6VA) within accuracy \u0001/k for k \u2265 4, and the teacher only steps in when\n\u02c6VA < VT \u2212 (k\u22121)\n\u0001, the resulting interaction will have a TI bound equivalent to the earlier one,\nalthough the student needs to wait m trials to get a trace from the teacher.\nA1 trials are necessary because the agent may need to explore all the \u22a5 or optimistic mistakes within\nthe ZTE, and each episode might contain only one of the A1 suboptimal steps. Since each trajectory\nwith a \ufb01xed policy results in an i.i.d. sample with mean VA, A2 can be polynomially bounded using\na Chernoff bound [12]. Note we require here that k \u2265 4 (a stricter requirement than earlier). This is\nbecause we have errors of ||VA\u2212 \u02c6VA|| \u2264 \u0001/k and ||UA\u2212 VA|| \u2264 \u0001/k, so \u02c6VA needs to be at least 3\u0001/k\nbelow VT to ensure UT < VT \u2212 \u0001/k, and therefore traces are helpful. But \u02c6VA may also overestimate\nVA, leading to an extra \u0001/k slack term, and hence k \u2265 4.\n\nk\n\n6 Related Work and Conclusions\n\nOur teaching protocol extends early apprenticeship learning work for linear MDPs [1], which\nshowed a polynomial number of upfront traces followed by greedy (not explicitly exploring) tra-\njectories could achieve good behavior. Our protocol is similar to a recent \u201cpractice/critique\u201d in-\nteraction [13] where a teacher observed an agent and then labeled individual actions as \u201cgood\u201d or\n\u201cbad\u201d, but the teacher did not provide demonstrations in that work. Our setting differs from inverse\nreinforcement learning [4, 5] because our student can act better than the teacher, does not know the\ndynamics, and observes rewards. Studies have also been done on humans providing shaping rewards\nas feedback to agents rather than our demonstration technique [14, 15].\nSome works have taken a heuristic approach to mixing autonomous learning and teacher-provided\ntrajectories. This has been done in robot reinforcement learning domains [16] and for bootstrapping\nclassi\ufb01ers [17]. Many such approaches give all the teacher data at the beginning, while our teaching\nprotocol has the teacher only step in selectively, and our theoretical results ensure the teacher will\nonly step in when its advice will have a signi\ufb01cant effect.\nWe have shown how to use an extension of the KWIK-MB [6] (now KWIK-MBP) framework as\nthe basis for model-based RL agents in the apprenticeship paradigm. These agents have a \u201cmixed\u201d\ninterpretation of their learned models that admits a degree of autonomous exploration. Furthermore,\nintroducing a communication channel from the student to the teacher and having the teacher only\ngive traces when VT is signi\ufb01cantly better than UA guarantees the teacher will only provide demon-\nstrations that attempt to teach concepts the agent could not tractably learn on its own, which has\nclear bene\ufb01ts when demonstrations are far more costly than exploration steps.\n\nAcknowledgments\n\nWe thank Michael Littman and Lihong Li for discussions and DARPA-27001328 for funding.\n\n8\n\n\fReferences\n\n[1] Pieter Abbeel and Andrew Y. Ng. Exploration and apprenticeship learning in reinforcement\n\nlearning. In ICML, 2005.\n\n[2] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT\n\nPress, Cambridge, MA, March 1998.\n\n[3] Thomas J. Walsh, Kaushik Subramanian, Michael L. Littman, and Carlos Diuk. Generalizing\n\napprenticeship learning across hypothesis classes. In ICML, 2010.\n\n[4] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning.\n\nIn ICML, 2004.\n\n[5] Nathan Ratliff, David Silver, and J. Bagnell. Learning to search: Functional gradient tech-\n\nniques for imitation learning. Autonomous Robots, 27:25\u201353, 2009.\n\n[6] Amin Sayedi, Morteza Zadimoghaddam, and Avrim Blum. Trading off mistakes and don\u2019t-\n\nknow predictions. In NIPS, 2010.\n\n[7] Lihong Li, Michael L. Littman, Thomas J. Walsh, and Alexander L. Strehl. Knows what it\n\nknows: A framework for self-aware learning. Machine Learning, 82(3):399\u2013443, 2011.\n\n[8] Alexander L. Strehl, Lihong Li, and Michael L. Littman. Reinforcement learning in \ufb01nite\n\nMDPs: PAC analysis. Journal of Machine Learning Research, 10:2413\u20132444, 2009.\n\n[9] Nick Littlestone. Learning quickly when irrelevant attributes abound. Machine Learning,\n\n2:285\u2013318, 1988.\n\n[10] Dana Angluin. Queries and concept learning. Machine Learning, 2(4):319\u2013342, 1988.\n[11] Lev Vygotsky. Interaction between learning and development. In Mind In Society. Harvard\n\nUniversity Press, Cambridge, MA, 1978.\n\n[12] Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. Approximate planning in large\n\npomdps via reusable trajectories. In NIPS, 1999.\n\n[13] Kshitij Judah, Saikat Roy, Alan Fern, and Thomas G. Dietterich. Reinforcement learning via\n\npractice and critique advice. In AAAI, 2010.\n\n[14] W. Bradley Knox and Peter Stone. Combining manual feedback with subsequent mdp reward\n\nsignals for reinforcement learning. In AAMAS, 2010.\n\n[15] Andrea Lockerd Thomaz and Cynthia Breazeal. Teachable robots: Understanding human\nteaching behavior to build more effective robot learners. Arti\ufb01cial Intelligence, 172(6-7):716\u2013\n737, 2008.\n\n[16] William D. Smart and Leslie Pack Kaelbling. Effective reinforcement learning for mobile\n\nrobots. In ICRA, 2002.\n\n[17] Sonia Chernova and Manuela Veloso.\n\nInteractive policy learning through con\ufb01dence-based\n\nautonomy. Journal of Arti\ufb01cial Intelligence Research, 34(1):1\u201325, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1233, "authors": [{"given_name": "Thomas", "family_name": "Walsh", "institution": null}, {"given_name": "Daniel", "family_name": "Hewlett", "institution": null}, {"given_name": "Clayton", "family_name": "Morrison", "institution": null}]}