{"title": "Imitation Learning by Coaching", "book": "Advances in Neural Information Processing Systems", "page_first": 3149, "page_last": 3157, "abstract": "Imitation Learning has been shown to be successful in solving many challenging real-world problems. Some recent approaches give strong performance guarantees by training the policy iteratively. However, it is important to note that these guarantees depend on how well the policy we found can imitate the oracle on the training data. When there is a substantial difference between the oracle's ability and the learner's policy space, we may fail to find a policy that has low error on the training set. In such cases, we propose to use a coach that demonstrates easy-to-learn actions for the learner and gradually approaches the oracle. By a reduction of learning by demonstration to online learning, we prove that coaching can yield a lower regret bound than using the oracle. We apply our algorithm to a novel cost-sensitive dynamic feature selection problem, a hard decision problem that considers a user-specified accuracy-cost trade-off. Experimental results on UCI datasets show that our method outperforms state-of-the-art imitation learning methods in dynamic features selection and two static feature selection methods.", "full_text": "Imitation Learning by Coaching\n\nHe He Hal Daum\u00e9 III\n\nDepartment of Computer Science\n\nUniversity of Maryland\nCollege Park, MD 20740\n\n{hhe,hal}@cs.umd.edu\n\nJason Eisner\n\nDepartment of Computer Science\n\nJohns Hopkins University\n\nBaltimore, MD 21218\njason@cs.jhu.edu\n\nAbstract\n\nImitation Learning has been shown to be successful in solving many challenging\nreal-world problems. Some recent approaches give strong performance guaran-\ntees by training the policy iteratively. However, it is important to note that these\nguarantees depend on how well the policy we found can imitate the oracle on the\ntraining data. When there is a substantial difference between the oracle\u2019s abil-\nity and the learner\u2019s policy space, we may fail to \ufb01nd a policy that has low error\non the training set. In such cases, we propose to use a coach that demonstrates\neasy-to-learn actions for the learner and gradually approaches the oracle. By a\nreduction of learning by demonstration to online learning, we prove that coach-\ning can yield a lower regret bound than using the oracle. We apply our algorithm\nto cost-sensitive dynamic feature selection, a hard decision problem that consid-\ners a user-speci\ufb01ed accuracy-cost trade-off. Experimental results on UCI datasets\nshow that our method outperforms state-of-the-art imitation learning methods in\ndynamic feature selection and two static feature selection methods.\n\n1\n\nIntroduction\n\nImitation learning has been successfully applied to a variety of applications [1, 2]. The standard\napproach is to use supervised learning algorithms and minimize a surrogate loss with respect to\nan oracle. However, this method ignores the difference between distributions of states induced by\nexecuting the oracle\u2019s policy and the learner\u2019s, thus has a quadratic loss in the task horizon T . A\nrecent approach called Dataset Aggregation [3] (DAgger) yields a loss linear in T by iteratively\ntraining the policy in states induced by all previously learned policies. Its theoretical guarantees\nare relative to performance of the policy that best mimics the oracle on the training data. In dif\ufb01cult\ndecision-making problems, however, it can be hard to \ufb01nd a good policy that has a low training error,\nsince the oracle\u2019s policy may resides in a space that is not imitable in the learner\u2019s policy space. For\ninstance, the task loss function can be highly non-convex in the learner\u2019s parameter space and very\ndifferent from the surrogate loss.\nWhen the optimal action is hard to achieve, we propose to coach the learner with easy-to-learn\nactions and let it gradually approach the oracle (Section 3). A coach trains the learner iteratively in a\nfashion similar to DAgger. At each iteration it demonstrates actions that the learner\u2019s current policy\nprefers and have a small task loss. The coach becomes harsher by showing more oracle actions\nas the learner makes progress. Intuitively, this allows the learner to move towards a better action\nwithout much effort. Thus our algorithm achieves the best action gradually instead of aiming at an\nimpractical goal from the beginning. We analyze our algorithm by a reduction to online learning\nand show that our approach achieves a lower regret bound than DAgger that uses the oracle action\n(Section 3.1). Our method is also related to direct loss minimization [4] for structured prediction\nand methods of selecting oracle translations in machine translation [5, 6] (Section 5).\n\n1\n\n\fOur approach is motivated by a formulation of budgeted learning as a sequential decision-making\nproblem [7, 8] (Section 4).\nIn this setting, features are acquired at a cost, such as computation\ntime and experiment expense. In dynamic feature selection, we would like to sequentially select a\nsubset of features for each instance at test time according to a user-speci\ufb01ed accuracy-cost trade-off.\nExperimental results show that coaching has a more stable training curve and achieves lower task\nloss than state-of-the-art imitation learning algorithms.\nOur major contribution is a meta-algorithm for hard imitation learning tasks where the available\npolicy space is not adequate for imitating the oracle. Our main theoretical result is Theorem 4 which\nstates that coaching as a smooth transition from the learner to the oracle have a lower regret bound\nthan only using the oracle.\n\n2 Background\n\n|s, a). We denote dt\nEs\u223cdt\n\nIn a sequential decision-making problem, we have a set of states S, a set of actions A and a policy\nspace \u03a0. An agent follows a policy \u03c0 : S \u2192 A that determines which action to take in a given\nstate. After taking action a in state s, the environment responds by some immediate loss L(s, a).\nWe assume L(s, a) is bounded in [0, 1]. The agent is then taken to the next state s(cid:48) according to the\nloss of \u03c0 is J(\u03c0) = (cid:80)T\ntransition probability P (s(cid:48)\n\u03c0 the state distribution at time t after executing \u03c0 from\ntime 1 to t\u2212 1, and d\u03c0 the average state distribution of states over T steps. Then the T -step expected\n[L(s, \u03c0(s)] = T Es\u223cd\u03c0 [L(s, \u03c0(s))]. A trajectory is a complete\nsequence of (cid:104)s, a, L(s, a)(cid:105) tuples from the starting state to a goal state. Our goal is to learn a policy\n\u03c0 \u2208 \u03a0 that minimizes the task loss J(\u03c0). We assume that \u03a0 is a closed, bounded and non-empty\nconvex set in Euclidean space; a policy \u03c0 can be parameterized by a vector w \u2208 Rd.\nIn imitation learning, we de\ufb01ne an oracle that executes policy \u03c0\u2217 and demonstrates actions a\u2217\ns =\nL(s, a) in state s. The learner only attempts to imitate the oracle\u2019s behavior without any\narg min\nnotion of the task loss function. Thus minimizing the task loss is reduced to minimizing a surrogate\nloss with respect to the oracle\u2019s policy.\n\na\u2208A\n\nt=1\n\n\u03c0\n\n2.1\n\nImitation by Classi\ufb01cation\n\nA typical approach to imitation learning is to use the oracle\u2019s trajectories as supervised data and learn\na policy (multiclass classi\ufb01er) that predicts the oracle action under distribution of states induced by\nrunning the oracle\u2019s policy. At each step t, we collect a training example (st, \u03c0\u2217(st)), where \u03c0\u2217(st)\nis the oracle\u2019s action (class label) in state st. Let (cid:96)(s, \u03c0, \u03c0\u2217(s)) denote the surrogate loss of executing\n\u03c0 in state s with respect to \u03c0\u2217(s). This can be any convex loss function used for training the classi\ufb01er,\nfor example, hinge loss in SVM. Using any standard supervised learning algorithm, we can learn a\npolicy\n\n\u02c6\u03c0 = arg min\n\n\u03c0\u2208\u03a0\n\nEs\u223cd\u03c0\u2217 [(cid:96)(s, \u03c0, \u03c0\u2217(s))].\n\n(1)\n\nWe then bound J(\u02c6\u03c0) based on how well the learner imitates the oracle. Assuming (cid:96)(s, \u03c0, \u03c0\u2217(s)) is\nan upper bound on the 0-1 loss and L(s, a) is bounded in [0,1], Ross and Bagnell [9] have shown\nthat:\nTheorem 1. Let Es\u223cd\u03c0\u2217 [(cid:96)(s, \u02c6\u03c0, \u03c0\u2217(s))] = \u0001, then J(\u02c6\u03c0) \u2264 J(\u03c0\u2217) + T 2\u0001.\nOne drawback of the supervised approach is that it ignores the fact that the state distribution is\ndifferent for the oracle and the learner. When the learner cannot mimic the oracle perfectly (i.e.\nclassi\ufb01cation error occurs), the wrong action will change the following state distribution. Thus the\nlearned policy is not able to handle situations where the learner follows a wrong path that is never\nchosen by the oracle, hence the quadratically increasing loss. In fact in the worst case, performance\ncan approach random guessing, even for arbitrarily small \u0001 [10].\nRoss et al. [3] generalized Theorem 1 to any policy that has \u0001 surrogate loss under its own state\ndistribution, i.e. Es\u223cd\u03c0 [(cid:96)(s, \u03c0, \u03c0\u2217(s))] = \u0001. Let Q\u03c0(cid:48)\nt (s, \u03c0) denote the t-step loss of executing \u03c0 in\nthe initial state and then running \u03c0(cid:48). We have the following:\nT\u2212t+1(s, \u03c0\u2217) \u2264 u for all action a, t \u2208 {1, 2, . . . , T}, then\nTheorem 2. If Q\u03c0\u2217\nJ(\u03c0) \u2264 J(\u03c0\u2217) + uT \u0001.\n\nT\u2212t+1(s, \u03c0) \u2212 Q\u03c0\u2217\n\n2\n\n\fIt basically says that when \u03c0 chooses a different action from \u03c0\u2217 at time step t, if the cumulative cost\ndue to this error is bounded by u, then the relative task loss is O(uT ).\n\n2.2 Dataset Aggregation\n\nThe above problem of insuf\ufb01cient exploration can be alleviated by iteratively learning a policy\ntrained under states visited by both the oracle and the learner. For example, during training one\ncan use a \u201cmixture oracle\u201d that at times takes an action given by the previous learned policy [11].\nAlternatively, at each iteration one can learn a policy from trajectories generated by all previous\npolicies [3].\nIn its simplest form, the Dataset Aggregation (DAgger) algorithm [3] works as follows. Let\ns\u03c0 denote a state visited by executing \u03c0.\nIn the \ufb01rst iteration, we collect a training set D1 =\n{(s\u03c0\u2217 , \u03c0\u2217(s\u03c0\u2217 ))} from the oracle (\u03c01 = \u03c0\u2217) and learn a policy \u03c02. This is the same as the super-\nvised approach to imitation. In iteration i, we collect trajectories by executing the previous policy\n\u03c0i and form the training set Di by labeling s\u03c0i with the oracle action \u03c0\u2217(s\u03c0i); \u03c0i+1 is then learned\non D1\nThus we can obtain a policy that performs well under its own induced state distribution.\n\n(cid:83) . . .Di. Intuitively, this enables the learner to make up for past failures to mimic the oracle.\n\n2.3 Reduction to Online Learning\n[(cid:96)(s, \u03c0, \u03c0\u2217(s))] denote the expected surrogate loss of executing \u03c0 in states dis-\nLet (cid:96)i(\u03c0) = Es\u223cd\u03c0i\ntributed according to d\u03c0i. In an online learning setting, in iteration i an algorithm executes policy \u03c0i\nand observes loss (cid:96)i(\u03c0i). It then provides a different policy \u03c0i+1 in the next iteration and observes\n(cid:96)i+1(\u03c0i+1). A no-regret algorithm guarantees that in N iterations\n\nN(cid:88)\n\ni=1\n\n1\nN\n\nN(cid:88)\n\ni=1\n\n(cid:96)i(\u03c0i) \u2212 min\n\u03c0\u2208\u03a0\n\n1\nN\n\n(cid:96)i(\u03c0) \u2264 \u03b3N\n\n(2)\n\n\u03c0\u2208\u03a0\n\n(cid:80)i\n\nand limN\u2192\u221e \u03b3N = 0.\nAssuming a strongly convex loss function, Follow-The-Leader is a simple no-regret algorithm. In\neach iteration it picks the policy that works best so far: \u03c0i+1 = arg min\nj=1 (cid:96)j(\u03c0). Similarly,\nin DAgger at iteration i we choose the policy that has the minimum surrogate loss on all previous\ndata. Thus it can be interpreted as Follow-The-Leader where trajectories collected in each iteration\nare treated as one online-learning example.\nAssume that (cid:96)(s, \u03c0, \u03c0\u2217(s)) is a strongly convex loss in \u03c0 upper bounding the 0-1 loss.\nWe denote the sequence of\n\u0001N =\n[(cid:96)(s, \u03c0, \u03c0\u2217(s))] be the minimum loss we can achieve in the policy space\nmin\u03c0\u2208\u03a0\n\u03a0. In the in\ufb01nite sample per iteration case, following proofs in [3] we have:\nT\u2212t+1(s, \u03c0\u2217) \u2264 u, there exists\nTheorem 3. For DAgger, if N is O(uT log T ) and Q\u03c0\u2217\na policy \u03c0 \u2208 \u03c01:N s.t. J(\u03c0) \u2264 J(\u03c0\u2217) + uT \u0001N + O(1).\nThis theorem holds for any no-regret online learning algorithm and can be generalized to the \ufb01nite\nsample case as well.\n\nlearned policies \u03c01, \u03c02, . . . , \u03c0N by \u03c01:N .\n\nT\u2212t+1(s, \u03c0)\u2212Q\u03c0\u2217\n\n(cid:80)N\n\n1\nN\n\nEs\u223cd\u03c0i\n\ni=1\n\nLet\n\n3\n\nImitation by Coaching\n\nAn oracle can be hard to imitate in two ways. First, the learning policy space is far from the space\nthat the oracle policy lies in, meaning that the learner only has limited learning ability. Second,\nthe environment information known by the oracle cannot be suf\ufb01ciently inferred from the state,\nmeaning that the learner does not have access to good learning resources. In the online learning\nsetting, a too-good oracle may result in adversarially varying loss functions over iterations from the\nlearner\u2019s perspective. This may cause violent changes during policy updating. These dif\ufb01culties\nresult in a substantial gap between the oracle\u2019s performance and the best performance achievable in\nthe policy space \u03a0 (i.e. a large \u0001N in Theorem 3).\n\n3\n\n\fAlgorithm 1 DAgger by Coaching\nInitialize D \u2190 \u2205\nInitialize \u03c01 \u2190 \u03c0\u2217\nfor i = 1 to N do\nSample T -step trajectories using \u03c0i\nCollect coaching dataset Di = {(s\u03c0i, arg max\na\u2208A\nAggregate datasets D \u2190 D\nTrain policy \u03c0i+1 on D\n\nend for\nReturn best \u03c0i evaluated on validation set\n\n(cid:83)\n\nDi\n\n\u03bbi \u00b7 score\u03c0i(s\u03c0i, a) \u2212 L(s\u03c0i, a))}\n\nTo address this problem, we de\ufb01ne a coach in place of the oracle. To better instruct the learner, a\ncoach should demonstrate actions that are not much worse than the oracle action but are easier to\nachieve within the learner\u2019s ability. The lower an action\u2019s task loss is, the closer it is to the oracle\naction. The higher an action is ranked by the learner\u2019s current policy, the more it is preferred by the\nlearner, thus easier to learn. Therefore, similar to [6], we de\ufb01ne a hope action that combines the task\nloss and the score of the learner\u2019s current policy. Let score\u03c0i(s, a) be a measure of how likely \u03c0i\nchooses action a in state s. We de\ufb01ne \u02dc\u03c0i by\n\n\u02dc\u03c0i(s) = arg max\n\na\u2208A\n\n\u03bbi \u00b7 score\u03c0i(s, a) \u2212 L(s, a)\n\n(3)\n\nwhere \u03bbi is a nonnegative parameter specifying how close the coach is to the oracle. In the \ufb01rst\niteration, we set \u03bb1 = 0 as the learner has not learned any model yet. Algorithm 1 shows the\ntraining process. Our intuition is that when the learner has dif\ufb01culty performing the optimal action,\nthe coach should lower the goal properly and let the learner gradually achieving the original goal in\na more stable way.\n\ni=1\n\n(cid:80)N\n\nT\u2212t+1(s, \u03c0)\u2212Q\u03c0\u2217\n\n[(cid:96)(s, \u03c0, \u02dc\u03c0i(s))] denote the expected surrogate loss with respect to \u02dc\u03c0i. We denote\n\u02dc(cid:96)i(\u03c0) the minimum loss of the best policy in hindsight with respect to hope\nT\u2212t+1(s, \u03c0\u2217) \u2264\n\n3.1 Theoretical Analysis\nLet \u02dc(cid:96)i(\u03c0) = Es\u223cd\u03c0i\n\u02dc\u0001N = 1\nN min\u03c0\u2208\u03a0\nactions. The main result of this paper is the following theorem:\nTheorem 4. For DAgger with coaching, if N is O(uT log T ) and Q\u03c0\u2217\nu, there exists a policy \u03c0 \u2208 \u03c01:N s.t. J(\u03c0) \u2264 J(\u03c0\u2217) + uT \u02dc\u0001N + O(1).\nIt is important to note that both the DAgger theorem and the coaching theorem provide a relative\nguarantee. They depend on whether we can \ufb01nd a policy that has small training error in each Follow-\nThe-Leader step. However, in practice, for hard learning tasks DAgger may fail to \ufb01nd such a good\npolicy. Through coaching, we can always adjust \u03bb to create a more learnable oracle policy space,\nthus get a relatively good policy that has small training error, at the price of running a few more\niterations.\nTo prove this theorem, we \ufb01rst derive a regret bound for coaching, and then follows the proofs of\nDAgger.\nWe consider a policy \u03c0 parameterized by a vector w \u2208 Rd. Let \u03c6 : S \u00d7 A \u2192 Rd be a feature map\ndescribing the state. The predicted action is\n(4)\n\nwT \u03c6(s, a)\n\n\u02c6a\u03c0,s = arg max\n\na\u2208A\n\nand the hope action is\n\n\u02dca\u03c0,s = arg max\n\n(5)\nWe assume that the loss function (cid:96) : Rd \u2192 R is a convex upper bound of the 0-1 loss. Further, it\ncan be written as (cid:96)(s, \u03c0, \u03c0\u2217(s)) = f (wT \u03c6(s, \u03c0(s)), \u03c0\u2217(s)) for a function f : R \u2192 R and a feature\nvector (cid:107)\u03c6(s, a)(cid:107)2 \u2264 R. We assume that f is twice differentiable and convex in wT \u03c6(s, \u03c0(s)), which\nis common for most loss functions used by supervised classi\ufb01cation methods.\n\n\u03bb \u00b7 wT \u03c6(s, a) \u2212 L(s, a).\n\na\u2208A\n\n4\n\n\fIt has been shown that given a strongly convex loss function (cid:96), Follow-The-Leader has O(log N )\nregret [12, 13]. More speci\ufb01cally, given the above assumptions we have:\nTheorem 5. Let D = maxw1,w2\u2208Rd (cid:107)w1 \u2212 w2(cid:107)2 be the diameter of the convex set Rd. For some\nb, m > 0, assume that for all w \u2208 Rd, we have |f(cid:48)(wT \u03c6(s, a))| \u2264 b and |f(cid:48)(cid:48)(wT \u03c6(s, a))| \u2265 m.\nThen Follow-The-Leader on functions (cid:96) have the following regret:\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\n(cid:96)i(\u03c0i) \u2212 min\n\u03c0\u2208\u03a0\n\n(cid:96)i(\u03c0) \u2264\n\n(cid:96)i(\u03c0i) \u2212\n\n(cid:96)i(\u03c0i+1)\n\n\u2264\n\nlog\n\n+ 1\n\nN(cid:88)\n\ni=1\n2nb2\nm\n\n(cid:20)\n\ni=1\n\nN(cid:88)\n(cid:18) DRmN\n(cid:80)N\n\nb\n\ni=1\n\n\u02dc(cid:96)i(\u03c0).\n\ni=1\n\n(cid:19)\n\n(cid:21)\n\n\u02dc(cid:96)i(\u03c0i+1).\n\nTo analyze the regret using surrogate loss with respect to hope actions, we use the following lemma:\n\nLemma 1. (cid:80)N\n(cid:80)N\nProof. We prove inductively that(cid:80)N\n\u02dc(cid:96)i(\u03c0) \u2264\n\u02dc(cid:96)i(\u03c0i+1) \u2264 min\u03c0\u2208\u03a0\nWhen N = 1, by Follow-The-Leader we have \u03c02 = arg min\n\u03c0\u2208\u03a0\n\ni=1 (cid:96)i(\u03c0i) \u2212 min\u03c0\u2208\u03a0\n\n(cid:80)N\n\ni=1\n\ni=1\n\ni=1 (cid:96)i(\u03c0i) \u2212\n\n(cid:80)N\n\nN(cid:88)\n\nAssume correctness for N \u2212 1, then\n\u02dc(cid:96)i(\u03c0i+1) \u2264 min\nN\u22121(cid:88)\n\u03c0\u2208\u03a0\n\ni=1\n\n\u2264\n\ni=1\n\nN\u22121(cid:88)\n\ni=1\n\n\u02dc(cid:96)i(\u03c0) + \u02dc(cid:96)N (\u03c0N +1)\n\n\u02dc(cid:96)1(\u03c0), thus \u02dc(cid:96)1(\u03c02) = min\u03c0\u2208\u03a0\n\n\u02dc(cid:96)1(\u03c0).\n\n(inductive assumption)\n\nN(cid:88)\n\ni=1\n\n\u02dc(cid:96)i(\u03c0N +1) + \u02dc(cid:96)N (\u03c0N +1) = min\n\u03c0\u2208\u03a0\n\n\u02dc(cid:96)i(\u03c0)\n\n(cid:80)N\n\n\u03c0\u2208\u03a0\n\nThe last equality is due to the fact that \u03c0N +1 = arg min\n\n\u02dc(cid:96)i(\u03c0).\n\ni=1\n\ni=1\n\n\u02dc(cid:96)i(\u03c0).\n\n(cid:80)N\n\nTo see how learning from \u02dc\u03c0i allows us to approaching \u03c0\u2217, we derive the regret bound of\n\n(cid:80)N\ni=1 (cid:96)i(\u03c0i) \u2212 min\u03c0\u2208\u03a0\nTheorem 6. Assume that wi is upper bounded by C, i.e. for all i (cid:107)wi(cid:107)2 \u2264 C, (cid:107)\u03c6(s, a)(cid:107)2 \u2264 R and\n|L(s, a) \u2212 L(s, a(cid:48))| \u2265 \u0001 for some action a, a(cid:48)\n\u2208 A. Assume \u03bbi is non-increasing and de\ufb01ne n\u03bb as\n. Let (cid:96)max be an upper bound on the loss, i.e. for all i,\nthe largest n < N such that \u03bbn\u03bb \u2265\n(cid:96)i(s, \u03c0i, \u03c0\u2217(s)) \u2264 (cid:96)max. We have\nN(cid:88)\n\nN(cid:88)\n\n(cid:19)\n\n2RC\n\n(cid:21)\n\n(cid:20)\n\n\u0001\n\n(cid:96)i(\u03c0i) \u2212 min\n\u03c0\u2208\u03a0\n\ni=1\n\ni=1\n\n\u02dc(cid:96)i(\u03c0) \u2264 2(cid:96)maxn\u03bb +\n\n2nb2\nm\n\nlog\n\n+ 1\n\nProof. Given Lemma 1, we only need to bound the RHS, which can be written as\n\n(cid:18) DRmN\n(cid:33)\n\nb\n\n(cid:32) N(cid:88)\n\ni=1\n\n(cid:33)\n(cid:96)i(\u03c0i) \u2212 \u02dc(cid:96)i(\u03c0i)\n\n+\n\n(cid:32) N(cid:88)\n\ni=1\n\n\u02dc(cid:96)i(\u03c0i) \u2212 \u02dc(cid:96)i(\u03c0i+1)\n\n.\n\n(6)\n\nTo bound the \ufb01rst term, we consider a binary action space A = {1,\u22121} for clarity. The proof can\nbe extended to the general case in a straightforward manner.\nNote that in states where a\u2217\ns = \u02dca\u03c0,s, (cid:96)(s, \u03c0, \u03c0\u2217(s)) = (cid:96)(s, \u03c0, \u02dc\u03c0(s)). Thus we only need to consider\nsituations where a\u2217\ns (cid:54)= \u02dca\u03c0,s:\n\n(cid:104)\n\n(cid:96)i(\u03c0i) \u2212 \u02dc(cid:96)i(\u03c0i)\n(cid:104)\n\n= Es\u223cd\u03c0i\n+Es\u223cd\u03c0i\n\n((cid:96)i(s, \u03c0i,\u22121) \u2212 (cid:96)i(s, \u03c0i, 1)) 1{s : \u02dca\u03c0i,s=1,a\u2217\n((cid:96)i(s, \u03c0i, 1) \u2212 (cid:96)i(s, \u03c0i,\u22121)) 1{s:\u02dca\u03c0i,s=\u22121,a\u2217\n\ns =\u22121}\n\ns =1}\n\n5\n\n(cid:105)\n(cid:105)\n\n\fIn the binary case, we de\ufb01ne \u2206L(s) = L(s, 1) \u2212 L(s,\u22121) and \u2206\u03c6(s) = \u03c6(s, 1) \u2212 \u03c6(s,\u22121).\n\u02dca\u03c0i,s = 1 and a\u2217\nCase 1\n\u02dca\u03c0i,s = 1 implies \u03bbiwT\n\u2206L(s) \u2208 (0, \u03bbiwT\n\u02c6a\u03c0i = 1. Therefore we have\np(a\u2217\n\ns = \u22121 implies \u2206L(s) > 0. Together we have\ni \u2206\u03c6(s) \u2265 0 since \u03bbi > 0, which implies\n\ns = \u22121.\ni \u2206\u03c6(s) \u2265 \u2206L(s) and a\u2217\n\ni \u2206\u03c6(s)]. From this we know that wT\n\n(cid:19)\n\n(cid:18)\n(cid:16)\n\n= p\n\ns = \u22121, \u02dca\u03c0i,s = 1, \u02c6a\u03c0i,s = 1)\n\n= p(\u02dca\u03c0i,s = 1|a\u2217\ns = \u22121, \u02c6a\u03c0i,s = 1)p(\u02c6a\u03c0i, s = 1)p(a\u2217\n(cid:17)\n\n\u2206L(s)\nwT\ni \u2206\u03c6(s)\n\u0001\n\ns = \u22121)\ni \u2206\u03c6(s) \u2265 0) \u00b7 p(\u2206L(s) > 0)\n\u03bbi \u2265\n, we have\n\n(cid:16)\n\u00b7 p(wT\n\u00b7 1 \u00b7 1 = p\n\n\u03bbi \u2265\n\u03bbi \u2265\n\n\u2264 p\n\n(cid:17)\n\n2RC\n\n2RC\n\n\u0001\n\nLet n\u03bb be the largest n < N such that \u03bbi \u2265\n\n\u0001\n\n2RC\n\n(cid:104)\n\nEs\u223cd\u03c0i\n\nN(cid:88)\n\ni=1\n\n((cid:96)i(s, \u03c0i,\u22121) \u2212 (cid:96)i(s, \u03c0i, 1)) 1{s : \u02dca\u03c0i,s=1,a\u2217\n\ns =\u22121}\n\n\u2264 (cid:96)maxn\u03bb\n\n(cid:105)\n\nFor example, let \u03bbi decrease exponentially, e.g., \u03bbi = \u03bb0e\u2212i.\n\nIf \u03bb0 <\n\n\u0001eN\n2RC\n\n, Then n\u03bb =\n\n2\u03bb0RC\n\n\u0001\n\n(cid:101).\n\u02dca\u03c0i,s = \u22121 and a\u2217\n\ns = 1. This is symmetrical to Case 1. Similar arguments yield the same\n\n(cid:100)log\nCase 2\nbound.\nIn the online learning setting, imitating the coach is to obsearve the loss \u02dc(cid:96)i(\u03c0i) and learn a policy\n\u02dc(cid:96)j(\u03c0) at iteration i. This is indeed equivalent to Follow-The-Leader except\n\u03c0i+1 = arg min\nthat we replaced the loss function. Thus Theorem 5 gives the bound of the second term.\n\n(cid:80)i\n\n\u03c0\u2208\u03a0\n\nj=1\n\n(cid:20)\n\n(cid:18) DRmN\n\n(cid:19)\n\n(cid:21)\n\n+ 1\n\n.\n\nEquation 6 is then bounded by 2(cid:96)maxn\u03bb +\n\n2nb2\nm\n\nlog\n\nb\n\nN(cid:88)\n\ni=1\n\nNow we can prove Theorem 4. Consider the best policy in \u03c01:N , we have\n\nmin\n\u03c0\u2208\u03c01:N\n\nEs\u223cd\u03c0 [(cid:96)(s, \u03c0, \u03c0\u2217(s))] \u2264\n\n1\nN\n\nEs\u223cd\u03c0i\n\n[(cid:96)(s, \u03c0i, \u03c0\u2217(s))]\n\n\u2264 \u02dc\u0001N +\n\n2(cid:96)maxn\u03bb\n\nN\n\n+\n\n2nb2\nmN\n\nlog\n\n(cid:20)\n\n(cid:18) DRmN\n\n(cid:19)\n\n(cid:21)\n\n+ 1\n\nb\n\nWhen N is \u2126(T log T ), the regret is O(1/T ). Applying Theorem 2 completes the proof.\n\n4 Experiments\n\nWe apply imitation learning to a novel dynamic feature selection problem. We consider the setting\nwhere a pretrained model (data classi\ufb01er) on a complete feature set is given and each feature has a\nknown cost. At test time, we would like to dynamically select a subset of features for each instance\nand be able to explicitly specify the accuracy-cost trade-off. This can be naturally framed as a\nsequential decision-making problem. The state includes all features selected so far. The action space\nincludes a set of non-selected features and the stop action. At each time step, the policy decides\nwhether to stop acquiring features and make a prediction; if not, which feature(s) to purchase next.\nAchieving an accuracy-cost trade-off corresponds to \ufb01nding the optimal policy minimizing a loss\nfunction. We de\ufb01ne the loss function as a combination of accuracy and cost:\n\nL(s, a) = \u03b1 \u00b7 cost(s) \u2212 margin(a)\n\n(7)\n\n6\n\n\f(a) Reward of DAgger and Coaching.\n\n(b) Radar dataset.\n\n(c) Digit dataset.\n\n(d) Segmentation dataset.\n\nFigure 1: 1(a) shows reward versus cost of DAgger and Coaching over 15 iterations on the digit\ndataset with \u03b1 = 0.5. 1(b) to 1(d) show accuracy versus cost on the three datasets. For DAgger and\nCoaching, we show results when \u03b1 = 0, 0.1, 0.25, 0.5, 1.0, 1.5, 2.\n\nwhere margin(a) denote the margin of classifying the instance after action a; cost(s) denote the\nuser-de\ufb01ned cost of all selected features in the current state s; and \u03b1 is a user-speci\ufb01ed trade-off\nparameter. Since we consider feature selection for each single instance here, the average margin\nre\ufb02ects accuracy on the whole datasets.\n\n4.1 Dynamic Feature Selection by Imitation Learning\n\nIdeally, an oracle should lead to a subset of features having the maximum reward. However, we\nhave too large a state space to exhaustedly search for the optimal subset of features. In addition,\nthe oracle action may not be unique since the optimal subset of features do not have to be selected\nin a \ufb01xed order. We address this problem by using a forward-selection oracle. Given a state s, the\noracle iterates through the action space and calculates each action\u2019s loss; it then chooses the action\nthat leads to the minimum immediate loss in the current state. We de\ufb01ne \u03c6(st, a) as a concatenation\nof the current feature vector and a meta-feature vector that provides information about previous\nclassi\ufb01cation results and cost.\nIn most cases, our oracle can achieve high accuracy with rather small cost. Considering a linear\nclassi\ufb01er, as the oracle already knows the correct class label of an instance, it can simply choose,\nfor example, a positive feature that has a positive weight to correctly classify a positive instance. In\naddition, at the start state even when \u03c6(s0, a) are almost the same for all instances, the oracle may\ntend to choose features that favor the instance\u2019s class. This makes the oracle\u2019s behavior very hard to\nimitate. In the next section we show that in this case coaching achieves better results than using an\noracle.\n\n7\n\n0.260.280.300.320.340.360.38averagecostperexample0.400.450.500.55rewardDAggerCoaching0.00.20.40.60.81.0averagecostperexample0.600.650.700.750.800.850.900.951.00accuracy|w|/costForwardDAggerCoachingOracle0.00.20.40.60.81.0averagecostperexample0.40.50.60.70.80.9accuracy|w|/costForwardDAggerCoachingOracle0.00.20.40.60.81.0averagecostperexample0.600.650.700.750.800.850.90accuracy|w|/costForwardDAggerCoachingOracle\f4.2 Experimental Results\n\nWe perform experiments on three UCI datasets (radar signal, digit recognition, image segmentation).\nRandom costs are assigned to features. We \ufb01rst compare the learning curve of DAgger and Coaching\nover 15 iterations on the digit dataset with \u03b1 = 0.5 in Figure 1(a). We can see that DAgger makes\na big improvement in the second iteration, while Coaching takes smaller steps but achieves higher\nreward gradually. In addition, the reward of Coaching changes smoothly and grows stably, which\nmeans coaching avoids drastic change of the policy.\nTo test the effect of dynamic selection, we compare our results with DAgger and two static fea-\nture selection baselines that sequentially add features according to a ranked list. The \ufb01rst baseline\n(denoted by Forward) ranks features according to the standard forward feature selection algorithm\nwithout any notion of the cost. The second baseline (denoted by |w|/cost) uses a cost-sensitive\nranking scheme based on |w|/cost, the weight of a feature divided by its cost. Therefore, features\nhaving high scores are expected to be cost-ef\ufb01cient. We give the results in Figure 1(b) to 1(d). To\nget results of our dynamic feature selection algorithm at different costs, we set \u03b1 in the loss function\nto be 0.0, 0.1, 0.25, 0.5, 1.0, 1.5, 2.0 and use the best policy evaluated on the development set for\neach \u03b1. For coaching, we set \u03bb2 = 1 and decrease it by e\u22121 in each iteration. First, we can see that\ndynamically selecting features for each instance signi\ufb01cantly improves the accuracy at a small cost.\nSometimes, it even achieves higher accuracy than using all features. Second, we notice that there is\na substantial gap between the learned policy\u2019s performance and the oracle\u2019s, however, in almost all\nsettings Coaching achieves higher reward, i.e. higher accuracy at a lower cost as shown in the \ufb01g-\nures. Through coaching, we can reduce the gap by taking small steps towards the oracle. However,\nthe learned policy is still much worse compared to the oracle\u2019s policy. This is because coaching\nis still inherently limited by the insuf\ufb01cient policy space, which can be \ufb01xed by using expensive\nkernels and nonlinear policies.\n\n5 Related Work\n\nThe idea of using hope action is similar to what Chiang et al. [6] and Liang et al. [5] have used\nfor selecting oracle translations in machine translation. They maximized a linear combination of the\nBLEU score (similar to negative task loss in our case) and the model score to \ufb01nd good translations\nthat are easier to train against. More recently, McAllester et al. [4] de\ufb01ned the direct label that\ncombines model score and task loss from a different view: they showed that using a perceptron-like\ntraining methods and update towards the direct label is equivalent to perform gradient descent on\nthe task loss.\nCoaching is also similar to proximal methods in online learning [14, 15]. They avoid large changes\nduring updating and achieve the original goal gradually. In proximal regularization, we want the\nupdated parameter vector to stay close to the previous one. Do et al. [14] showed that solving the\noriginal learning problem through a sequence of modi\ufb01ed optimization tasks whose objectives have\ngreater curvature can achieve a lower regret bound.\n\n6 Conclusion and Future Work\n\nIn this paper, we consider the situation in imitation learning where an oracle\u2019s performance is far\nfrom what is achievable in the learning space. We propose a coaching algorithm that lets the learner\ntarget at easier goals \ufb01rst and gradually approaches the oracle. We show that coaching has a lower\nregret bound both theoretically and empirically. In the future, we are interested in formally de\ufb01ning\nthe hardness of a problem so that we know exactly in which cases coaching is more suitable than\nDAgger. Another direction is to develop a similar coaching process in online convex optimization by\noptimizing a sequence of approximating functions. We are also interested in applying coaching to\nmore complex structured prediction problems in natural language processing and computer vision.\n\nReferences\n[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML,\n\n2004.\n\n8\n\n\f[2] M. Veloso B. D. Argall, S. Chernova and B. Browning. A survey of robot learning from\n\ndemonstration. 2009.\n\n[3] St\u00e9phane. Ross, Geoffrey J. Gordon, and J. Andrew. Bagnell. A reduction of imitation learning\n\nand structured prediction to no-regret online learning. In AISTATS, 2011.\n\n[4] D. McAllester, T. Hazan, and J. Keshet. Direct loss minimization for structured prediction. In\n\nNIPS, 2010.\n\n[5] D. Klein P. Liang, A. Bouchard-Ct and B. Taskar. An end-to-end discriminative approach to\n\nmachine translation. In ACL, 2006.\n\n[6] D. Chiang, Y. Marton, and P. Resnik. Online large-margin training of syntactic and structural\n\ntranslation features. In EMNLP, 2008.\n\n[7] R. Busa-Fekete D. Benbouzid and B. K\u00e9gl. Fast classi\ufb01cation using space decision dags. In\n\nICML, 2012.\n\n[8] P. Preux G. Dulac-Arnold, L. Denoyer and P. Gallinari. Datum-wise classi\ufb01cation: a sequential\n\napproach to sparsity. In ECML, 2011.\n\n[9] St\u00e9phane Ross and J. Andrew Bagnell. Ef\ufb01cient reductions for imitation learning. In AISTATS,\n\n2010.\n\n[10] K\u00e4\u00e4ri\u00e4inen. Lower bounds for reductions. In Atomic Learning Workshop, 2006.\n[11] Hal Daum\u00e9 III, John Langford, and Daniel Marcu. Search-based structured prediction. Ma-\n\nchine Learning Journal (MLJ), 2009.\n\n[12] Elad Hazan, Adam Kalai, Satyen Kale, and Amit Agarwal. Logarithmic regret algorithms for\n\nonline convex optimization. In COLT, pages 499\u2013513, 2006.\n\n[13] Sham M. Kakade and Shai Shalev-shwartz. Mind the duality gap: Logarithmic regret algo-\n\nrithms for online optimization. In NIPS, 2008.\n\n[14] Q. Le C. B. Do and C.S. Foo. Proximal regularization for online and batch learning. In ICML,\n\n2009.\n\n[15] H Brendan Mcmahan. Follow-the-regularized-leader and mirror descent : Equivalence theo-\n\nrems and l1 regularization. JMLR, 15:525\u2013533, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1449, "authors": [{"given_name": "He", "family_name": "He", "institution": null}, {"given_name": "Jason", "family_name": "Eisner", "institution": null}, {"given_name": "Hal", "family_name": "Daume", "institution": null}]}