{"title": "A Reduction from Apprenticeship Learning to Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 2253, "page_last": 2261, "abstract": "We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert's behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate $\\eps$, the difference between the value of the apprentice's policy and the expert's policy is $O(\\sqrt{\\eps})$. Further, we prove that this difference is only $O(\\eps)$ when the expert's policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain.", "full_text": "A Reduction from Apprenticeship Learning to\n\nClassi\ufb01cation\n\nDepartment of Computer and Information Science\n\nDepartment of Computer Science\n\nRobert E. Schapire\n\nPrinceton University\nPrinceton, NJ 08540\n\nUmar Syed\u2217\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\nusyed@cis.upenn.edu\n\nschapire@cs.princeton.edu\n\nAbstract\n\nWe provide new theoretical results for apprenticeship learning, a variant of rein-\nforcement learning in which the true reward function is unknown, and the goal\nis to perform well relative to an observed expert. We study a common approach\nto learning from expert demonstrations: using a classi\ufb01cation algorithm to learn\nto imitate the expert\u2019s behavior. Although this straightforward learning strategy\nis widely-used in practice, it has been subject to very little formal analysis. We\nprove that, if the learned classi\ufb01er has error rate \u01eb, the difference between the\nvalue of the apprentice\u2019s policy and the expert\u2019s policy is O(\u221a\u01eb). Further, we\nprove that this difference is only O(\u01eb) when the expert\u2019s policy is close to optimal.\nThis latter result has an important practical consequence: Not only does imitating\na near-optimal expert result in a better policy, but far fewer demonstrations are\nrequired to successfully imitate such an expert. This suggests an opportunity for\nsubstantial savings whenever the expert is known to be good, but demonstrations\nare expensive or dif\ufb01cult to obtain.\n\n1\n\nIntroduction\n\nApprenticeship learning is a variant of reinforcement learning, \ufb01rst introduced by Abbeel & Ng [1]\n(see also [2, 3, 4, 5, 6]), designed to address the dif\ufb01culty of correctly specifying the reward function\nin many reinforcement learning problems. The basic idea underlying apprenticeship learning is that\na learning agent, called the apprentice, is able to observe another agent, called the expert, behaving\nin a Markov Decision Process (MDP). The goal of the apprentice is to learn a policy that is at least\nas good as the expert\u2019s policy, relative to an unknown reward function. This is a weaker requirement\nthan the usual goal in reinforcement learning, which is to \ufb01nd a policy that maximizes reward.\nThe development of the apprenticeship learning framework was prompted by the observation that,\nalthough reward functions are often dif\ufb01cult to specify, demonstrations of good behavior by an\nexpert are usually available. Therefore, by observing such a expert, one can infer information about\nthe true reward function without needing to specify it.\n\nExisting apprenticeship learning algorithms have a number of limitations. For one, they typically\nassume that the true reward function can be expressed as a linear combination of a set of known fea-\ntures. However, there may be cases where the apprentice is unwilling or unable to assume that the\nrewards have this structure. Additionally, most formulations of apprenticeship learning are actually\nharder than reinforcement learning; apprenticeship learning algorithms typically invoke reinforce-\nment learning algorithms as subroutines, and their performance guarantees depend strongly on the\nquality of these subroutines. Consequently, these apprenticeship learning algorithms suffer from the\nsame challenges of large state spaces, exploration vs. exploitation trade-offs, etc., as reinforcement\n\n\u2217Work done while the author was a student at Princeton University.\n\n1\n\n\flearning algorithms. This fact is somewhat contrary to the intuition that demonstrations from an\nexpert \u2014 especially a good expert \u2014 should make the problem easier, not harder.\n\nAnother approach to using expert demonstrations that has received attention primarily in the empir-\nical literature is to passively imitate the expert using a classi\ufb01cation algorithm (see [7, Section 4] for\na comprehensive survey). Classi\ufb01cation is the most well-studied machine learning problem, and it\nis sensible to leverage our knowledge about this \u201ceasier\u201d problem in order to solve a more \u201cdif\ufb01cult\u201d\none. However, there has been little formal analysis of this straightforward learning strategy (the\nmain recent example is Ross & Bagnell [8], discussed below). In this paper, we consider a setting\nin which an apprentice uses a classi\ufb01cation algorithm to passively imitate an observed expert in an\nMDP, and we bound the difference between the value of the apprentice\u2019s policy and the value of\nthe expert\u2019s policy in terms of the accuracy of the learned classi\ufb01er. Put differently, we show that\napprenticeship learning can be reduced to classi\ufb01cation. The idea of reducing one learning problem\nto another was \ufb01rst proposed by Zadrozny & Langford [9].\n\nOur main contributions in this paper are a pair of theoretical results. First, we show that the differ-\n\nence between the value of the apprentice\u2019s policy and the expert\u2019s policy is O(\u221a\u01eb),1 where \u01eb \u2208 (0, 1]\n\nis the error of the learned classi\ufb01er. Secondly, and perhaps more interestingly, we extend our \ufb01rst\nresult to prove that the difference in policy values is only O(\u01eb) when the expert\u2019s policy is close to\noptimal. Of course, if one could perfectly imitate the expert, then naturally a near-optimal expert\npolicy is preferred. But our result implies something further: that near-optimal experts are actually\neasier to imitate, in the sense that fewer demonstration are required to achieve the same performance\nguarantee. This has important practical consequences. If one is certain a priori that the expert is\ndemonstrating good behavior, then our result implies that many fewer demonstrations need to be col-\nlected than if this were not the case. This can yield substantial savings when expert demonstrations\nare expensive or dif\ufb01cult to obtain.\n\n2 Related Work\n\nSeveral authors have reduced reinforcement learning to simpler problems. Bagnell et al [10] de-\nscribed an algorithm for constructing a good nonstationary policy from a sequence of good \u201cone-\nstep\u201d policies. These policies are only concerned with maximizing reward collected in a single\ntime step, and are learned with the help of observations from an expert. Langford & Zadrozny\n[11] reduced reinforcement learning to a sequence of classi\ufb01cation problems (see also Blatt & Hero\n[12]), but these problems have an unusual structure, and the authors are only able to provide a small\namount of guidance as to how data for these problems can be collected. Kakade & Langford [13]\nreduced reinforcement learning to regression, but required additional assumptions about how easily\na learning algorithm can access the entire state space. Importantly, all this work makes the standard\nreinforcement learning assumptions that the true rewards are known, and that a learning algorithm\nis able to interact directly with the environment. In this paper we are interested in settings where\nthe reward function is not known, and where the learning algorithm is limited to passively observing\nan expert. Concurrently to this work, Ross & Bagnell [8] have described an approach to reducing\nimitation learning to classi\ufb01cation, and some of their analysis resembles ours. However, their frame-\nwork requires somewhat more than passive observation of the expert, and is focused on improving\nthe sensitivity of the reduction to the horizon length, not the classi\ufb01cation error. They also assume\nthat the expert follows a deterministic policy, and assumption we do not make.\n\n3 Preliminaries\n\nWe consider a \ufb01nite-horizon MDP, with horizon H. We will allow the state space S to be in\ufb01nite,\nbut assume that the action space A is \ufb01nite. Let \u03b1 be the initial state distribution, and \u03b8 the transition\nfunction, where \u03b8(s, a,\u00b7) speci\ufb01es the next-state distribution from state s \u2208 S under action a \u2208 A.\nThe only assumption we make about the unknown reward function R is that 0 \u2264 R(s) \u2264 Rmax for\nall states s \u2208 S, where Rmax is a \ufb01nite upper bound on the reward of any state.\n\n1The big-O notation is concealing a polynomial dependence on other problem parameters. We give exact\n\nbounds in the body of the paper.\n\n2\n\n\fWe introduce some notation and de\ufb01nitions regarding policies. A policy \u03c0 is stationary if it is a\nmapping from states to distributions over actions. In this case, \u03c0(s, a) denotes the probability of\ntaking action a in state s. Let \u03a0 be the set of all stationary policies. A policy \u03c0 is nonstationary if it\nbelongs to the set \u03a0H = \u03a0 \u00d7 \u00b7\u00b7\u00b7 (H times)\u00b7\u00b7\u00b7 \u00d7 \u03a0 . In this case, \u03c0t(s, a) denotes the probability of\ntaking action a in state s at time t. Also, if \u03c0 is nonstationary, then \u03c0t refers to the stationary policy\nthat is equal to the tth component of \u03c0. A (stationary or nonstationary) policy \u03c0 is deterministic if\neach one of its action distributions is concentrated on a single action. If a deterministic policy \u03c0 is\nstationary, then \u03c0(s) is the action taken in state s, and if \u03c0 is nonstationary, the \u03c0t(s) is the action\ntaken in state s at time t.\n\nWe de\ufb01ne the value function V \u03c0\nmanner:\n\nt (s) for a nonstationary policy \u03c0 at time t as follows in the usual\n\nV \u03c0\n\nt (s) , E\" H\nXt\u2032=t\n\nR(st\u2032)(cid:12)(cid:12)(cid:12)\n\nst = s, at\u2032 \u223c \u03c0t\u2032 (st\u2032 ,\u00b7), st\u2032+1 \u223c \u03b8(st\u2032 , at\u2032 ,\u00b7)# .\n\nSo V \u03c0\nt (s) is the expected cumulative reward for following policy \u03c0 when starting at state s and time\nstep t. Note that there are several value functions per nonstationary policy, one for each time step t.\nThe value of a policy is de\ufb01ned to be V (\u03c0) , E[V \u03c0\n1 (s) | s \u223c \u03b1(\u00b7)], and an optimal policy \u03c0\u2217 is one\nthat satis\ufb01es \u03c0\u2217 , arg max\u03c0 V (\u03c0).\nWe write \u03c0E to denote the (possibly nonstationary) expert policy, and V E\nV \u03c0E\nt\nthat the values of these policies are with respect to the unknown reward function.\n\nt (s) as an abbreviation for\n(s). Our goal is to \ufb01nd a nonstationary apprentice policy \u03c0A such that V (\u03c0A) \u2265 V (\u03c0E). Note\n\nt be the distribution on state-action pairs at time t under policy \u03c0. In other words, a sample\nt by \ufb01rst drawing s1 \u223c \u03b1(\u00b7), then following policy \u03c0 for time steps 1 through\nt as\n. In a minor abuse of notation, we write s \u223c D\u03c0\nt to mean: draw state-action\n\nLet D\u03c0\n(s, a) is drawn from D\u03c0\nt, which generates a trajectory (s1, a1, . . . , st, at), and then letting (s, a) = (st, at). We write DE\nan abbreviation for D\u03c0E\npair (s, a) \u223c D\u03c0\n4 Details and Justi\ufb01cation of the Reduction\n\nt , and discard a.\n\nt\n\nOur goal is to reduce apprenticeship learning to classi\ufb01cation, so let us describe exactly how this\nreduction is de\ufb01ned, and also justify the utility of such a reduction.\nIn a classi\ufb01cation problem, a learning algorithm is given a training set h(x1, y1), . . . , (xm, ym)i,\nwhere each labeled example (xi, yi) \u2208 X \u00d7Y is drawn independently from a distribution D on X \u00d7\nY. Here X is the example space and Y is the \ufb01nite set of labels. The learning algorithm is also given\nthe de\ufb01nition of a hypothesis class H, which is a set of functions mapping X to Y. The objective\nof the learning algorithm is to \ufb01nd a hypothesis h \u2208 H such that the error Pr(x,y)\u223cD(h(x) 6= y) is\nsmall.\nFor our purposes, the hypothesis class H is said to be PAC-learnable if there exists a learning\nalgorithm A such that, whenever A is given a training set of size m = poly( 1\n\u01eb ), the algorithm\n\u01eb ) steps and outputs a hypothesis \u02c6h \u2208 H such that, with probability at least 1 \u2212 \u03b4,\nruns for poly( 1\nwe have Pr(x,y)\u223cD(cid:16)\u02c6h(x) 6= y(cid:17) \u2264 \u01eb\u2217\nH,D = inf h\u2208H Pr(x,y)\u223cD(h(x) 6= y) is the\nerror of the best hypothesis in H. The expression poly( 1\n\u01eb ) will typically also depend on other\nquantities, such as the number of labels |Y| and the VC-dimension of H [14], but this dependence is\nnot germane to our discussion.\n\nH,D + \u01eb. Here \u01eb\u2217\n\n\u03b4 , 1\n\n\u03b4 , 1\n\n\u03b4 , 1\n\nThe existence of PAC-learnable hypothesis classes is the reason that reducing apprenticeship learn-\ning to classi\ufb01cation is a sensible endeavor. Suppose that the apprentice observes m independent\n\nt, ai\n\nH(cid:1).\ntrajectories from the expert\u2019s policy \u03c0E, where the ith trajectory is a sequence(cid:0)si\nt) can be viewed as an independent sample from the distribution\nThe key is to note that each (si\nt . Now consider a PAC-learnable hypothesis class H, where H contains a set of functions map-\nDE\nping the state space S to the \ufb01nite action space A. If m = poly( 1\n\u01eb ), then for each time step\nt, the apprentice can use a PAC learning algorithm for H to learn a hypothesis \u02c6ht \u2208 H such that,\nt (cid:16)\u02c6ht(s) 6= a(cid:17) \u2264 \u01eb\u2217\nwith probability at least 1 \u2212 1\n+ \u01eb. And by the union\n\nH\u03b4 , we have Pr(s,a)\u223cDE\n\n1, . . . , si\n\nH\u03b4 , 1\n\nH , ai\n\n1, ai\n\nH,DE\nt\n\n3\n\n\fbound, this inequality holds for all t with probability at least 1\u2212 \u03b4. If each \u01eb\u2217\nnatural choice for the apprentice\u2019s policy \u03c0A is to set \u03c0A\nclassi\ufb01ers to imitate the behavior of the expert.\n\n+ \u01eb is small, then a\nt = \u02c6ht for all t. This policy uses the learned\n\nH,DE\nt\n\nIn light of the preceding discussion, throughout the remainder of this paper we make the following\nassumption about the apprentice\u2019s policy.\nAssumption 1. The apprentice policy \u03c0A is a deterministic policy\nPr(s,a)\u223cDE\n\nt (s) 6= a) \u2264 \u01eb for some \u01eb > 0 and all time steps t.\n\nsatis\ufb01es\n\n(\u03c0A\n\nthat\n\nt\n\nAs we have shown, an apprentice policy satisfying Assumption 1 with small \u01eb can be found with\nhigh probability, provided that expert\u2019s policy is well-approximated by a PAC-learnable hypothesis\nclass and that the apprentice is given enough trajectories from the expert. A reasonable intuition is\nthat the value of the policy \u03c0A in Assumption 1 is nearly as high as the value of the policy \u03c0E; the\nremainder of this paper is devoted to con\ufb01rming this intuition.\n\n5 Guarantee for Any Expert\n\nIf the error rate \u01eb in Assumption 1 is small, then the apprentice\u2019s policy \u03c0A closely imitates the\nexpert\u2019s policy \u03c0E, and we might hope that this implies that V (\u03c0A) is not much less than V (\u03c0E).\nThis is indeed the case, as the next theorem shows.\n\nTheorem 1. If Assumption 1 holds, then V (\u03c0A) \u2265 V (\u03c0E) \u2212 2\u221a\u01ebH 2Rmax.\n\nIn a typical classi\ufb01cation problem, it is assumed that the training and test examples are drawn from\nthe same distribution. The main challenge in proving Theorem 1 is that this assumption does not hold\nfor the classi\ufb01cation problems to which we have reduced the apprenticeship learning problem. This\nt) appearing in an expert trajectory is distributed\nis because, although each state-action pair (si\naccording to DE\nt , a state-action pair (st, at) visited by the apprentice\u2019s policy may not follow this\ndistribution, since the behavior of the apprentice prior to time step t may not exactly match the\nexpert\u2019s behavior. So our strategy for proving Theorem 1 will be to show that these differences do\nnot cause the value of the apprentice policy to degrade too much relative to the value of the expert\u2019s\npolicy.\n\nt, ai\n\nBefore proceeding, we will show that Assumption 1 implies a condition that is, for our purposes,\nmore convenient.\nLemma 1. Let \u02c6\u03c0 be a deterministic nonstationary policy. If Pr(s,a)\u223cDE\nall \u01eb1 \u2208 (0, 1] we have Prs\u223cDE\n\n(\u02c6\u03c0t(s) 6= a) \u2264 \u01eb, then for\n\n\u01eb1\n\nt\n\nt (cid:0)\u03c0E\n\nt (s, \u02c6\u03c0t(s)) \u2265 1 \u2212 \u01eb1(cid:1) \u2265 1 \u2212 \u01eb\n\nProof. Fix any \u01eb1 \u2208 (0, 1], and suppose for contradiction that Prs\u223cDE\n1 \u2212 \u01eb\n\n. Say that a state s is good if \u03c0E\n\nt (s, \u02c6\u03c0t(s)) \u2265 1 \u2212 \u01eb1, and that s is bad otherwise. Then\n\nt (s, \u02c6\u03c0t(s)) \u2265 1 \u2212 \u01eb1(cid:1) <\n\nt (cid:0)\u03c0E\n\n\u01eb1\n\nPr(s,a)\u223cDE\n\nt\n\n(\u02c6\u03c0t(s) = a) = Prs\u223cDE\n\nt\n\n(s is good) \u00b7 Pr(s,a)\u223cDE\n(s is bad) \u00b7 Pr(s,a)\u223cDE\n(s is good) \u00b7 1 + (1 \u2212 Prs\u223cDE\n\nt\n\nt\n\nt\n\nt\n\n(\u02c6\u03c0t(s) = a | s is good)\n(\u02c6\u03c0t(s) = a | s is bad)\n\n(s is good)) \u00b7 (1 \u2212 \u01eb1)\n\n+ Prs\u223cDE\n\nt\n\n\u2264 Prs\u223cDE\n= 1 \u2212 \u01eb1(1 \u2212 Prs\u223cDE\n< 1 \u2212 \u01eb\n\nt\n\n(s is good))\n\nwhere the \ufb01rst inequality holds because Pr(s,a)\u223cDE\ninequality holds because Prs\u223cDE\n(s is good) < 1\u2212 \u01eb\nthe assumption of the lemma.\n\nt\n\nt\n\n\u01eb1\n\n(\u02c6\u03c0t(s) = a | s is bad) \u2264 1 \u2212 \u01eb1, and the second\n. This chain of inequalities clearly contradicts\n\nThe next two lemmas are the main tools used to prove Theorem 1. In the proofs of these lemmas, we\nwrite sa to denote a trajectory, where sa = (\u00afs1, \u00afa1, . . . , \u00afsH , \u00afaH ) \u2208 (S \u00d7A)H. Also, let dP\u03c0 denote\nthe probability measure induced on trajectories by following policy \u03c0, and let R(sa) =PH\nt=1 R(\u00afst)\n\n4\n\n\fdenote the sum of the rewards of the states in trajectory sa. Importantly, using these de\ufb01nitions we\nhave\n\nV (\u03c0) =Zsa\n\nR(sa)dP\u03c0.\n\nThe next lemma proves that if a deterministic policy \u201calmost\u201d agrees with the expert\u2019s policy \u03c0E in\nevery state and time step, then its value is not much worse the value of \u03c0E.\nLemma 2. Let \u02c6\u03c0 be a deterministic nonstationary policy. If for all states s and time steps t we have\nt (s, \u02c6\u03c0t(s)) \u2265 1 \u2212 \u01eb then V (\u02c6\u03c0) \u2265 V (\u03c0E) \u2212 \u01ebH 2Rmax.\n\u03c0E\nProof. Say a trajectory sa is good if it is \u201cconsistent\u201d with \u02c6\u03c0 \u2014 that is, \u02c6\u03c0(\u00afst) = \u00afat for all time steps\nt \u2014 and that sa is bad otherwise. We have\n\nV (\u03c0E) =Zsa\n\nR(sa)dP\u03c0E\n\n=Zsa good\n\u2264Zsa good\n\u2264Zsa good\n\nR(sa)dP\u03c0E +Zsa bad\n\nR(sa)dP\u03c0E\n\nR(sa)dP\u03c0E + \u01ebH 2Rmax\n\nR(sa)dP\u02c6\u03c0 + \u01ebH 2Rmax\n\n= V (\u02c6\u03c0) + \u01ebH 2Rmax\n\nwhere the \ufb01rst inequality holds because, by the union bound, P\u03c0E assigns at most an \u01ebH fraction\nof its measure to bad trajectories, and the maximum reward of a trajectory is HRmax. The second\ninequality holds because good trajectories are assigned at least as much measure by P\u02c6\u03c0 as by P\u03c0E ,\nbecause \u02c6\u03c0 is deterministic.\n\nThe next lemma proves a slightly different statement than Lemma 2: If a policy exactly agrees with\nthe expert\u2019s policy \u03c0E in \u201calmost\u201d every state and time step, then its value is not much worse the\nvalue of \u03c0E.\nLemma 3. Let \u02c6\u03c0 be a nonstationary policy.\nPrs\u223cDE\n\nt we have\n\ntime steps\n\nfor all\n\nt (cid:0)\u02c6\u03c0t(s,\u00b7) = \u03c0E\n\nt (s,\u00b7)(cid:1) \u2265 1 \u2212 \u01eb then V (\u02c6\u03c0) \u2265 V (\u03c0E) \u2212 \u01ebH 2Rmax.\n\nProof. Say a trajectory sa is good if \u03c0E\notherwise. We have\n\nt (\u00afst,\u00b7) = \u02c6\u03c0t(\u00afst,\u00b7) for all time steps t, and that sa is bad\n\nIf\n\nV (\u02c6\u03c0) =Zsa\n\nR(sa)dP\u02c6\u03c0\n\nR(sa)dP\u02c6\u03c0\n\nR(sa)dP\u02c6\u03c0\n\nR(sa)dP\u02c6\u03c0 +Zsa bad\nR(sa)dP\u03c0E +Zsa bad\n\n=Zsa good\n=Zsa good\n=Zsa\n\u2265 V (\u03c0E) \u2212 \u01ebH 2Rmax +Zsa bad\n\u2265 V (\u03c0E) \u2212 \u01ebH 2Rmax.\n\nR(sa)dP\u03c0E \u2212Zsa bad\n\nR(sa)dP\u03c0E +Zsa bad\n\nR(sa)dP\u02c6\u03c0\n\nR(sa)dP\u02c6\u03c0\n\nThe \ufb01rst inequality holds because, by the union bound, P\u03c0E assigns at most an \u01ebH fraction of\nits measure to bad trajectories, and the maximum reward of a trajectory is HRmax. The second\ninequality holds by our assumption that all rewards are nonnegative.\n\nWe are now ready to combine the previous lemmas and prove Theorem 1.\n\n5\n\n\fProof of Theorem 1. Since the apprentice\u2019s policy \u03c0A satis\ufb01es Assumption 1, by Lemma 1 we can\nchoose any \u01eb1 \u2208 (0, 1] and have\n\nNow construct a \u201cdummy\u201d policy \u02c6\u03c0 as follows: For all time steps t, let \u02c6\u03c0t(s,\u00b7) = \u03c0E\nstate s where \u03c0E\n\nt (s)) = 1. By Lemma 2\n\nt (s,\u00b7) for any\n\nt (s, \u03c0A\n\nand by Lemma 3\n\nCombining these inequalities yields\n\n.\n\n\u01eb1\n\nPrs\u223cDE\n\nt (s, \u03c0A\n\nt (s)) \u2265 1 \u2212 \u01eb1(cid:1) \u2265 1 \u2212 \u01eb\nt (s)) \u2265 1 \u2212 \u01eb1. On all other states, let \u02c6\u03c0t(s, \u03c0A\n\nt (cid:0)\u03c0E\nV (\u03c0A) \u2265 V (\u02c6\u03c0) \u2212 \u01eb1H 2Rmax\nH 2Rmax.\nV (\u02c6\u03c0) \u2265 V (\u03c0E) \u2212\nV (\u03c0A) \u2265 V (\u03c0E) \u2212(cid:18)\u01eb1 +\n\n\u01eb1(cid:19) H 2Rmax.\n\n\u01eb\n\u01eb1\n\n\u01eb\n\nSince \u01eb1 was chosen arbitrarily, we set \u01eb1 = \u221a\u01eb, which maximizes this lower bound.\n\n6 Guarantee for Good Expert\n\nTheorem 1 makes no assumptions about the value of the expert\u2019s policy. However, in many cases it\nmay be reasonable to assume that the expert is following a near-optimal policy (indeed, if she is not,\nthen we should question the decision to select her as an expert). The next theorem shows that the\ndependence of V (\u03c0A) on the classi\ufb01cation error \u01eb is signi\ufb01cantly better when the expert is following\na near-optimal policy.\n\nTheorem 2. If Assumption 1 holds, then V (\u03c0A) \u2265 V (\u03c0E) \u2212(cid:0)4\u01ebH 3Rmax + \u2206\u03c0E(cid:1), where \u2206\u03c0E ,\nV (\u03c0\u2217) \u2212 V (\u03c0E) is the suboptimality of the expert\u2019s policy \u03c0E.\nNote that the bound in Theorem 2 varies with \u01eb and not with \u221a\u01eb. We can interpret this bound as\nfollows: If our goal is to learn an apprentice policy whose value is within \u2206\u03c0E of the expert policy\u2019s\nvalue, we can double our progress towards that goal by halving the classi\ufb01cation error rate. On the\nother hand, Theorem 2 suggests that the error rate must be reduced by a factor of four.\n\nTo see why a near-optimal expert policy should yield a weaker dependence on \u01eb, consider an expert\npolicy \u03c0E that is an optimal policy, but in every state s \u2208 S selects one of two actions as\n1 and\n2 uniformly at random. A deterministic apprentice policy \u03c0A that closely imitates the expert will\nas\n2, but in either case the classi\ufb01cation error will not be less than\neither set \u03c0A(s) = as\n1\n2 . However, since \u03c0E is optimal, both actions as\n2 must be optimal actions for state s, and so\nthe apprentice policy \u03c0A will be optimal as well.\n\n1 or \u03c0A(s) = as\n\n1 and as\n\nOur strategy for proving Theorem 2 is to replace Lemma 2 with a different result \u2014 namely, Lemma\n6 below \u2014 that has a much weaker dependence on the classi\ufb01cation error \u01eb when \u2206\u03c0E is small.\nTo help us prove Lemma 6, we will \ufb01rst need to de\ufb01ne several useful policies. The next several\nde\ufb01nitions will be with respect to an arbitrary nonstationary base policy \u03c0B; in the proof of Theorem\n2, we will make a particular choice for the base policy.\n\nFix a deterministic nonstationary policy \u03c0B,\u01eb that satis\ufb01es\n\nt (s, \u03c0B,\u01eb\n\u03c0B\n\nt\n\n(s)) \u2265 1 \u2212 \u01eb\n\nfor some \u01eb \u2208 (0, 1] and all states s and time steps t. Such a policy always exists by letting \u01eb = 1, but\nif \u01eb is close to zero, then \u03c0B,\u01eb is a deterministic policy that \u201calmost\u201d agrees with \u03c0B in every state\nand time step. Of course, depending on the choice of \u03c0B, a policy \u03c0B,\u01eb may not exist for small \u01eb,\nbut let us set aside that concern for the moment; in the proof of Theorem 2, the base policy \u03c0B will\nbe chosen so that \u01eb can be as small as we like.\nHaving thus de\ufb01ned \u03c0B,\u01eb, we de\ufb01ne \u03c0B\\\u01eb as follows: For all states s \u2208 S and time steps t, if\nt (s, \u03c0B,\u01eb(s)) < 1, then let\n\u03c0B\n\n\u03c0B\\\u01eb\nt\n\n(s, a) =\uf8f1\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f3\n\n0\n\nif \u03c0B,\u01eb\n\nt\n\n(s) = a\n\n\u03c0B\n\nt (s, a)\n(s) \u03c0B\n\nt (s, a\u2032)\n\nPa\u20326=\u03c0B,\u01eb\n\nt\n\n6\n\notherwise\n\n\ft\n\nfor all actions a \u2208 A, and otherwise let \u03c0B\\\u01eb\nstate s and time step t, the distribution \u03c0B\\\u01eb\nprobability assigned to action \u03c0B,\u01eb\nwhere \u03c0B\nthe proof of Lemma 4, it is actually immaterial how the distribution \u03c0B\\\u01eb\ncases; we choose the uniform distribution for de\ufb01niteness.\n\n(s, a) = 1\n|A| for all a \u2208 A. In other words, in each\n(s,\u00b7) is obtained by proportionally redistributing the\nt (s,\u00b7) to all other actions. The case\n(s) is treated specially, but as will be clear from\n(s,\u00b7) is de\ufb01ned in these\n\nt (s,\u00b7) assigns all probability to action \u03c0B,\u01eb\n\n(s) by the distribution \u03c0B\n\nt\n\nt\n\nt\n\nt\n\nLet \u03c0B+ be a deterministic policy de\ufb01ned by\n\n\u03c0B+\nt\n\n(s) = arg max\n\na\n\nt+1(s\u2032)(cid:12)(cid:12)(cid:12)\nEhV \u03c0B\n\nt\n\ns\u2032 \u223c \u03b8(s, a,\u00b7)i\n\n.\n\nor \u03c0i\n\nt = \u03c0B+\n\nt\n\nt = \u03c0B,\u01eb\n\nt\n\nt = \u03c0B,\u01eb\n\nt\n\n(s) is the best action in state s at time t,\n\nfor all states s \u2208 S and time steps t. In other words, \u03c0B+\nassuming that the policy \u03c0B is followed thereafter.\nThe next de\ufb01nition requires the use of mixed policies. A mixed policy consists of a \ufb01nite set of\ndeterministic nonstationary policies, along with a distribution over those policies; the mixed policy\nis followed by drawing a single policy according to the distribution in the initial time step, and\nfollowing that policy exclusively thereafter. More formally, a mixed policy is de\ufb01ned by a set of\ni=1 for some \ufb01nite N, where each component policy \u03c0i is a deterministic\nordered pairs {(\u03c0i, \u03bb(i))}N\nnonstationary policy,PN\ni=1 \u03bb(i) = 1 and \u03bb(i) \u2265 0 for all i \u2208 [N ].\n\nWe de\ufb01ne a mixed policy \u02dc\u03c0B,\u01eb,+ as follows: For each component policy \u03c0i and each time step t,\neither \u03c0i\n. There is one component policy for each possible choice; this yields\nN = 2|H| component policies. And the probability \u03bb(i) assigned to each component policy \u03c0i is\n\u03bb(i) = (1 \u2212 \u01eb)k(i)\u01ebH\u2212k(i), where k(i) is the number of times steps t for which \u03c0i\nHaving established these de\ufb01nitions, we are now ready to prove several lemmas that will help us\nprove Theorem 2.\nLemma 4. V (\u02dc\u03c0B,\u01eb,+) \u2265 V (\u03c0B).\nProof. The proof will be by backwards induction on t. Clearly V \u02dc\u03c0B,\u01eb,+\ns, since the value function V \u03c0\nfor induction that V \u02dc\u03c0B,\u01eb,+\n\nH (s) for all states\nH for any policy \u03c0 depends only on the reward function R. Now suppose\n(s) \u2265 V \u03c0B\n(s\u2032)(cid:12)(cid:12)(cid:12)\n(s) = R(s) + EhV \u02dc\u03c0B,\u01eb,+\n(s,\u00b7), s\u2032 \u223c \u03b8(s, a\u2032,\u00b7)i\nt+1(s\u2032)(cid:12)(cid:12)(cid:12)\n\u2265 R(s) + EhV \u03c0B\n(s,\u00b7), s\u2032 \u223c \u03b8(s, a\u2032,\u00b7)i\nt+1(s\u2032)(cid:12)(cid:12)(cid:12)\n= R(s) + (1 \u2212 \u01eb)EhV \u03c0B\nt+1(s\u2032)(cid:12)(cid:12)(cid:12)\n(s)) \u00b7 EhV \u03c0B\n\u2265 R(s) + \u03c0B\nt+1(s\u2032)(cid:12)(cid:12)(cid:12)\n+(cid:16)1 \u2212 \u03c0B\n(s))(cid:17) \u00b7 EhV \u03c0B\nt+1(s\u2032)(cid:12)(cid:12)(cid:12)\n(s)) \u00b7 EhV \u03c0B\n\u2265 R(s) + \u03c0B\nt+1(s\u2032)(cid:12)(cid:12)(cid:12)\n+(cid:16)1 \u2212 \u03c0B\n(s))(cid:17) \u00b7 EhV \u03c0B\n= R(s) + EhV \u03c0B\nt (s), s\u2032 \u223c \u03b8(s, a\u2032,\u00b7)i\na\u2032 \u223c \u03c0B\n\nt+1(s\u2032)(cid:12)(cid:12)(cid:12)\n(s),\u00b7)i + \u01ebEhV \u03c0B\n(s),\u00b7)i\n(s),\u00b7)i\ns\u2032 \u223c \u03b8(s, \u03c0B+\n(s),\u00b7)i\n(s,\u00b7), s\u2032 \u223c \u03b8(s, a\u2032,\u00b7)i\n\ns\u2032 \u223c \u03b8(s, \u03c0B,\u01eb\na\u2032 \u223c \u03c0B\\\u01eb\n\nt+1(s) for all states s. Then for all states s\n\ns\u2032 \u223c \u03b8(s, \u03c0B,\u01eb\n\ns\u2032 \u223c \u03b8(s, \u03c0B,\u01eb\n\ns\u2032 \u223c \u03b8(s, \u03c0B+\n\na\u2032 \u223c \u02dc\u03c0B,\u01eb,+\n\na\u2032 \u223c \u02dc\u03c0B,\u01eb,+\n\n(s) = V \u03c0B\n\nt (s, \u03c0B,\u01eb\n\nt (s, \u03c0B,\u01eb\n\nt (s, \u03c0B,\u01eb\n\nt (s, \u03c0B,\u01eb\n\nV \u02dc\u03c0B,\u01eb,+\nt\n\nt+1\n\nt+1\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nH\n\nt\n\nt\n\nt\n\nt\n\n= V \u03c0B\n\nt\n\n(s).\n\nt+1(s\u2032)(cid:12)(cid:12)(cid:12)\n\nThe \ufb01rst equality holds for all policies \u03c0, and follows straightforwardly from the de\ufb01nition of V \u03c0\nt .\nThe rest of the derivation uses, in order: the inductive hypothesis; the de\ufb01nition of \u02dc\u03c0B,\u01eb,+; property\nof \u03c0B,\u01eb and the fact that \u03c0B+\n(s) is the\nbest action with respect to V \u03c0B\nLemma 5. V (\u02dc\u03c0B,\u01eb,+) \u2264 (1 \u2212 \u01ebH)V (\u03c0B,\u01eb) + \u01ebHV (\u03c0\u2217).\n\n(s) is the best action with respect to V \u03c0B\nt+1; the de\ufb01nition of \u03c0B\\\u01eb; the de\ufb01nition of V \u03c0B\n\nt+1; the fact that \u03c0B+\n\n(s).\n\nt\n\nt\n\nt\n\nt\n\nt\n\n(s),\u00b7)i\n\n7\n\n\fProof. Since \u02dc\u03c0B,\u01eb,+ is a mixed policy, by the linearity of expectation we have\n\nN\n\nwhere each \u03c0i is a component policy of \u02dc\u03c0B,\u01eb,+ and \u03bb(i) is its associated probability. Therefore\n\nV (\u02dc\u03c0B,\u01eb,+) =\n\n\u03bb(i)V (\u03c0i)\n\nXi=1\n\nV (\u02dc\u03c0B,\u01eb,+) =Xi\n\n\u03bb(i)V (\u03c0i)\n\n\u2264 (1 \u2212 \u01eb)H V (\u03c0B,\u01eb) + (1 \u2212 (1 \u2212 \u01eb)H )V (\u03c0\u2217)\n\u2264 (1 \u2212 \u01ebH)V (\u03c0B,\u01eb) + \u01ebHV (\u03c0\u2217).\n\nHere we used the fact that probability (1 \u2212 \u01eb)H \u2265 1 \u2212 \u01ebH is assigned to a component policy that is\nidentical to \u03c0B,\u01eb, and the value of any component policy is at most V (\u03c0\u2217).\n\nLemma 6. If \u01eb < 1\n\nH , then V (\u03c0B,\u01eb) \u2265 V (\u03c0B) \u2212 \u01ebH\n\n1\u2212\u01ebH \u2206\u03c0B .\n\nProof. Combining Lemmas 4 and 5 yields\n\nAnd via algebraic manipulation we have\n\n(1 \u2212 \u01ebH)V (\u03c0B,\u01eb) + \u01ebHV (\u03c0\u2217) \u2265 V (\u03c0B).\n\n(1 \u2212 \u01ebH)V (\u03c0B,\u01eb) + \u01ebHV (\u03c0\u2217) \u2265 V (\u03c0B)\n\n\u21d2 (1 \u2212 \u01ebH)V (\u03c0B,\u01eb) \u2265 (1 \u2212 \u01ebH)V (\u03c0B) + \u01ebHV (\u03c0B) \u2212 \u01ebHV (\u03c0\u2217)\n\u21d2 (1 \u2212 \u01ebH)V (\u03c0B,\u01eb) \u2265 (1 \u2212 \u01ebH)V (\u03c0B) \u2212 \u01ebH\u2206\u03c0B\n\u21d2 V (\u03c0B,\u01eb) \u2265 V (\u03c0B) \u2212\n\n\u2206\u03c0B .\n\n\u01ebH\n\n1 \u2212 \u01ebH\n\nIn the last line, we were able to divide by (1 \u2212 \u01ebH) without changing the direction of the inequality\nbecause of our assumption that \u01eb < 1\nH .\n\nWe are now ready to combine the previous lemmas and prove Theorem 2.\n\nProof of Theorem 2. Since the apprentice\u2019s policy \u03c0A satis\ufb01es Assumption 1, by Lemma 1 we can\nchoose any \u01eb1 \u2208 (0, 1\n\nH ) and have\n\nPrs\u223cDE\n\nt (s, \u03c0A\n\nt (cid:0)\u03c0E\n\nt (s)) \u2265 1 \u2212 \u01eb1(cid:1) \u2265 1 \u2212 \u01eb\n\n\u01eb1\n\n.\n\nAs in the proof of Theorem 1, let us construct a \u201cdummy\u201d policy \u02c6\u03c0 as follows: For all time steps\nt, let \u02c6\u03c0t(s,\u00b7) = \u03c0E\nt (s)) \u2265 1 \u2212 \u01eb1. On all other states, let\n\u02c6\u03c0t(s, \u03c0A\n\nt (s,\u00b7) for any state s where \u03c0E\n\nt (s)) = 1. By Lemma 3 we have\n\nt (s, \u03c0A\n\nV (\u02c6\u03c0) \u2265 V (\u03c0E) \u2212\n\n\u01eb\n\u01eb1\n\nH 2Rmax.\n\nSubstituting V (\u03c0E) = V (\u03c0\u2217) \u2212 \u2206\u03c0E and V (\u02c6\u03c0) = V (\u03c0\u2217) \u2212 \u2206\u02c6\u03c0 and rearranging yields\n\n\u2206\u02c6\u03c0 \u2264 \u2206\u03c0E +\n\n\u01eb\n\u01eb1\n\nH 2Rmax.\n\n(1)\n\n(2)\n\nNow observe that, if we set the base policy \u03c0B = \u02c6\u03c0, then by de\ufb01nition \u03c0A is a valid choice for\n\u03c0B,\u01eb1. And since \u01eb1 < 1\n\nH we have\nV (\u03c0A) \u2265 V (\u02c6\u03c0) \u2212\n\n\u01eb1H\n\n\u2206\u02c6\u03c0\n\n\u01eb1H\n\n1 \u2212 \u01eb1H\n1 \u2212 \u01eb1H (cid:18)\u2206\u03c0E +\nH 2Rmax \u2212\n\n\u01eb\n\u01eb1\n\n\u2265 V (\u02c6\u03c0) \u2212\n\u2265 V (\u03c0E) \u2212\n\n\u01eb\n\u01eb1\n\u01eb1H\n\nH 2Rmax(cid:19)\n1 \u2212 \u01eb1H (cid:18)\u2206\u03c0E +\n\n\u01eb\n\u01eb1\n\nH 2Rmax(cid:19)\n\n(3)\n\nwhere we used Lemma 6, (2) and (1), in that order. Letting \u01eb1 = 1\n\n2H proves the theorem.\n\n8\n\n\fReferences\n[1] Pieter Abbeel and Andrew Ng. Apprenticeship learning via inverse reinforcement learning. In\n\nProceedings of the 21st International Conference on Machine Learning, 2004.\n\n[2] P Abbeel and A Y Ng. Exploration and apprenticeship learning in reinforcement learning. In\n\nProceedings of the 22nd International Conference on Machine Learning, 2005.\n\n[3] Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Zinkevich. Maximum margin planning.\n\nIn Proceedings of the 23rd International Conference on Machine Learning, 2006.\n\n[4] Umar Syed and Robert E. Schapire. A game-theoretic approach to apprenticeship learning. In\n\nAdvances in Neural Information Processing Systems 20, 2008.\n\n[5] J. Zico Kolter, Pieter Abbeel, and Andrew Ng. Hierarchical apprenticeship learning with ap-\nplication to quadruped locomotion. In Advances in Neural Information Processing Systems 20,\n2008.\n\n[6] Umar Syed and Robert E. Schapire. Apprenticeship learning using linear programming. In\n\nProceedings of the 25th International Conference on Machine Learning, 2008.\n\n[7] Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot\n\nlearning from demonstration. Robotics and Autonomous Systems, 57(5):469\u2013483, 2009.\n\n[8] St\u00b4ephane Ross and J. Andrew Bagnell. Ef\ufb01cient reduction for imitation learning. In AISTATS,\n\n2010.\n\n[9] Bianca Zadrozny, John Langford, and Naoki Abe.\n\nCost-sensitive learning by cost-\nproportionate example weighting. In Proceedings of the Third IEEE International Conference\non Data Mining, 2003.\n\n[10] J. Andrew Bagnell, Sham Kakade, Andrew Y. Ng, and Jeff Schneider. Policy search by dy-\n\nnamic programming. In Advances in Neural Information Processing Systems 15, 2003.\n\n[11] John Langford and Bianca Zadrozny. Relating reinforcement learning performance to classi\ufb01-\ncation performance. In Proceedings of the 22nd International Conference on Machine Learn-\ning, 2005.\n\n[12] Doron Blatt and Alfred Hero. From weighted classi\ufb01cation to policy search. In Advances in\n\nNeural Information Processing Systems 18, pages 139\u2013146, 2006.\n\n[13] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learn-\n\ning. In Proceedings 19th International Conference on Machine Learning, 2002.\n\n[14] V. N. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of\n\nevents to their probabilities. Theory of Probability and Its Applications, 16:264\u2013280, 1971.\n\n9\n\n\f", "award": [], "sourceid": 743, "authors": [{"given_name": "Umar", "family_name": "Syed", "institution": null}, {"given_name": "Robert", "family_name": "Schapire", "institution": null}]}