{"title": "Mo' States Mo' Problems: Emergency Stop Mechanisms from Observation", "book": "Advances in Neural Information Processing Systems", "page_first": 15182, "page_last": 15192, "abstract": "In many environments, only a relatively small subset of the complete state space is necessary in order to accomplish a given task. We develop a simple technique using emergency stops (e-stops) to exploit this phenomenon. Using e-stops significantly improves sample complexity by reducing the amount of required exploration, while retaining a performance bound that efficiently trades off the rate of convergence with a small asymptotic sub-optimality gap. We analyze the regret behavior of e-stops and present empirical results in discrete and continuous settings demonstrating that our reset mechanism can provide order-of-magnitude speedups on top of existing reinforcement learning methods.", "full_text": "Mo0 States Mo0 Problems:\n\nEmergency Stop Mechanisms from Observation\n\nSamuel Ainsworth\n\nMatt Barnes\n\nSiddhartha Srinivasa\n\nSchool of Computer Science and Engineering\n\nUniversity of Washington\n\n{skainswo,mbarnes,siddh}@cs.washington.edu\n\nAbstract\n\nIn many environments, only a relatively small subset of the complete state space is\nnecessary in order to accomplish a given task. We develop a simple technique using\nemergency stops (e-stops) to exploit this phenomenon. Using e-stops signi\ufb01cantly\nimproves sample complexity by reducing the amount of required exploration, while\nretaining a performance bound that ef\ufb01ciently trades off the rate of convergence\nwith a small asymptotic sub-optimality gap. We analyze the regret behavior of\ne-stops and present empirical results in discrete and continuous settings demon-\nstrating that our reset mechanism can provide order-of-magnitude speedups on top\nof existing reinforcement learning methods.\n\n1\n\nIntroduction\n\nIn this paper, we consider the problem of determining when along a training roll-out feedback\nfrom the environment is no longer bene\ufb01cial, and an intervention such as resetting the agent to\nthe initial state distribution is warranted. We show that such interventions can naturally trade off\na small sub-optimality gap for a dramatic decrease in sample complexity. In particular, we focus\non the reinforcement learning setting in which the agent has access to a reward signal in addition\nto either (a) an expert supervisor triggering the e-stop mechanism in real-time or (b) expert state-\nonly demonstrations used to \u201clearn\u201d an automatic e-stop trigger. Both settings fall into the same\nframework.\nEvidence already suggests that using simple, manually-designed heuristic resets can dramatically\nimprove training time. For example, the classic pole-balancing problem originally introduced in\nWidrow and Smith [25] prematurely terminates an episode and resets to an initial distribution\nwhenever the pole exceeds some \ufb01xed angle off-vertical. More subtly, these manually designed reset\nrules are hard-coded into many popular OpenAI gym environments [7].\nSome recent approaches have demonstrated empirical success learning when to intervene, either in\nthe form of resetting, collecting expert feedback, or falling back to a safe policy [8, 16, 19, 14]. We\nspeci\ufb01cally study reset mechanisms which are more natural for human operators to provide \u2013 in the\nform of large red buttons, for example \u2013 and thus perhaps less noisy than action or value feedback [5].\nFurther, we show how to build automatic reset mechanisms from state-only observations which are\noften widely available, e.g. in the form of videos [24].\nThe key idea of our method is to build a support set related to the expert\u2019s state-visitation probabilities,\nand to terminate the episode with a large penalty when the agent leaves this set, visualized in Fig. 1.\nThis support set de\ufb01nes a modi\ufb01ed MDP and can either be constructed implicitly via an expert\nsupervisor triggering e-stops in real-time or constructed a priori based on observation-only roll-outs\nfrom an expert policy. As we will show, using a support set explicitly restricts exploration to a\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fE-stop\nLearner \u21e1\nExpert \u21e1e\nSupport set \u02c6S\n\nFigure 1: A robot is tasked with reaching a goal in a cluttered environment. Our method allows\nincorporating e-stop interventions into any reinforcement learning algorithm. The grey support set\nmay either be implicit (from a supervisor) or, if available, explicitly constructed from demonstrations.\n\nsmaller state space while maintaining guarantees on the learner\u2019s performance. We emphasize that\nour technique for incorporating observations applies to any reinforcement learning algorithm in either\ncontinuous or discrete domains.\nThe contributions and organization of the remainder of the paper is as follows.\n\n\u2022 We provide a general framework for incorporating arbitrary emergency stop (e-stop) inter-\nventions from a supervisor into any reinforcement learning algorithm using the notion of\nsupport sets in Section 4.\n\nallowing for the creation of automatic e-stop devices.\n\n\u2022 We present methods and analysis for building support sets from observations in Section 5,\n\u2022 In Section 6 we empirically demonstrate on benchmark discrete and continuous domains\nthat our reset mechanism allows us to naturally trade off a small asymptotic sub-optimality\ngap for signi\ufb01cantly improved convergence rates with any reinforcement learning method.\n\u2022 Finally, in Section 7, we generalize the concept of support sets to a spectrum of set types\n\nand discuss their respective tradeoffs.\n\n2 Related Work\n\nThe problem of learning when to intervene has been studied in several contexts and generally falls\nunder the framework of safe reinforcement learning [10] or reducing expert feedback [16]. Richter\nand Roy [19] use an auto-encoder as an anomaly detector to determine when a high dimensional\nstate is anomalous, and revert to a safe policy. Laskey et al. [16] use a one-class SVM as an anomaly\ndetector, but instead for the purposes of reducing the amount of imitation learning feedback during\nDAGGER training [20]. Garcia and Fern\u00e1ndez [9] perturb a baseline policy and request action\nfeedback if the current state exceeds a minimum distance from any demonstration. Geramifard et al.\n[11] assume access to a function which indicates whether a state is safe, and determines the risk\nof the current state by Monte Carlo roll-outs. Similarly, \u201cshielding\u201d [3] uses a manually speci\ufb01ed\nsafety constraint and a coarse, conservative abstraction of the dynamics to prevent an agent from\nviolating the safety constraint. Eysenbach et al. [8] learn a second \u201csoft reset\u201d policy (in addition to\nthe standard \u201chard\u201d reset) which prevents the agent from entering nearly non-reversible states and\nreturns the agent to an initial state. Hard resets are required whenever the soft reset policy fails to\nterminate in a manually de\ufb01ned set of safe states Sreset. Our method can be seen as learning Sreset\nfrom observation. Their method trades off hard resets for soft resets, whereas ours learns when to\nperform the hard resets.\nThe general problem of Learning from Demonstration (LfD) has been studied in a variety of contexts.\nIn inverse reinforcement learning, Abbeel and Ng [1] assume access to state-only trajectory demon-\nstrations and attempt to learn an unknown reward function. In imitation learning, Ross et al. [20] study\nthe distribution mismatch problem of behavior cloning and propose DAGGER, which collects action\nfeedback at states visited by the current policy. GAIL addresses the problem of imitating a set of \ufb01xed\ntrajectories by minimizing the Jensen-Shannon divergence between the policies\u2019 average-state-action\ndistributions [13]. This reduces to optimizing a GAN-style minimax objective with a reinforcement\nlearning update (the generator) and a divergence estimator (the discriminator).\n\n2\n\n\fThe setting most similar to ours is Reinforcement Learning with Expert Demonstrations (RLED),\nwhere we observe both the expert\u2019s states and actions in addition to a reward function. Abbeel\nand Ng [2] use state-action trajectory demonstrations to initialize a model-based RL algorithm,\nwhich eliminates the need for explicit exploration and can avoid visiting all of the state-action space.\nSmart and Kaelbling [22] bootstrap Q-values from expert state-action demonstrations. Maire and\nBulitko [17] initialize any value-function based RL algorithm by using the shortest observed path\nfrom each state to the goal and generalize these results to unvisited states via a graph Laplacian.\nNagabandi et al. [18] learn a model from state-action demonstrations and use model-predictive\ncontrol to initialize a model-free RL agent via behavior cloning. It is possible to extend our method\nto RLED by constructing a support superset based on state-action pairs, as described in Section 7.\nThus, our method and many RLED methods are complimentary. For example, DQfD [12] would\nallow pre-training the policy from the state-action demonstrations, whereas ours reduces exploration\nduring the on-policy learning phase.\nMost existing techniques for bootstrapping RL are not applicable to our setting because they require\neither (a) state-action observations, (b) online expert feedback, (c) solving a reinforcement learning\nproblem in the original state space, incurring the same complexity as simply solving the original\nRL problem, or (d) provide no guarantees, even for the tabular setting. Further, since our method is\nequivalent to a one-time modi\ufb01cation of the underlying MDP, it can be used to improve any existing\nreinforcement learning algorithm and may be combined with other bootstrapping methods.\n\n3 Problem setup\nLet M = hS,A, P, R, H,\u21e2 0i be a \ufb01nite horizon, episodic Markov decision process (MDP) de\ufb01ned\nby the tuple M, where S is a set of states, A is a set of actions, P : S\u21e5A! (S) is the transition\nprobability distribution and (\u00b7) is a probability simplex over some space, R : S 2 \u21e5A! [0, 1] is\nthe reward function, H is the time horizon and \u21e20 2 (S) is the distribution of the initial state s0.\nLet \u21e1 2 \u21e7: N1:H \u21e5S! (A) be our learner\u2019s policy and \u21e1e be the potentially sub-optimal expert\npolicy (for now assume the realizability setting, \u21e1e 2 \u21e7).\nThe state distribution of policy \u21e1 at time t is de\ufb01ned recursively as\n\n\u21e2t+1\n\n\u21e1 (s) = Xst,at\n\n\u21e2t\n\u21e1(st)\u21e1(at|st, t)P (st, at, s)\n\n\u21e20\n\u21e1(s) \u2318 \u21e20(s).\n\n(1)\n\n(2)\n\nThe expected sum of rewards over a single episode is de\ufb01ned as\n\nJ(\u21e1) = Es\u21e0\u21e2\u21e1,a\u21e0\u21e1(s),s0\u21e0P (s,a)R(s, a, s0)\n\n\u21e1.\nt=0 \u21e2t\n\nHPH1\n\nM [RT ] .\n\nT -step expected regret of A in M compared to the expert is de\ufb01ned as\n\nwhere \u21e2\u21e1 denotes the average state distribution, \u21e2\u21e1 = 1\nOur objective is to learn a policy \u21e1 which minimizes the notion of external regret over K episodes.\nLet T = KH denote the total number of time steps elapsed, (r1, . . . , rT ) be the sequence of rewards\nt=1 rt be the cumulative reward. Then the\n\ngenerated by running algorithm A in M and RT =PT\n(3)\nTypical regret bounds in the discrete setting are some polynomial of the relevant quantities |S|, |A|,\nT , and H. We assume we are given such an RL algorithm. Later, we assume access to either a\nsupervisor who can provide e-stop feedback or a set of demonstration roll-outs D =\u2327 (1), . . . ,\u2327 (n) \nof an expert policy \u21e1e in M, and show how these can affect the regret bounds. In particular, we are\ninterested in using D to decrease the effective size of the state space S, thereby reducing the amount\nof required exploration when learning in M.\n\nM [RT ] EA\n\nRegretA\n\nM (T ) := E\u21e1e\n\n4\n\nIncorporating e-stop interventions\n\nIn the simplest setting, we have access to an external supervisor who provides minimal online\nfeedback in the form of an e-stop device triggered whenever the agent visits states outside of some\nto-be-determined set \u02c6S\u2713S . For example, if a mobile robot is navigating across a cluttered room as\nin Fig. 1, a reasonable policy will rarely collide into objects or navigate into other rooms which are\n\n3\n\n\funrelated to the current task, and the supervisor may trigger the e-stop device if the robot exhibits\neither of those behaviors. The analysis for other support types (e.g. state-action, time-dependent,\nvisitation count) are similar, and their trade-offs are discussed in Section 7.\n\n4.1 The sample complexity and asymptotic sub-optimality trade-off\nWe argue that for many practical MDPs, \u21e2\u21e1e is near-zero in much of the state space, and constructing\nan appropriate \u02c6S enables ef\ufb01ciently trading off asymptotic sub-optimality for potentially signi\ufb01cantly\nimproved convergence rate. Given some reinforcement learning algorithm A, we proceed by running\nA on a newly constructed \u201ce-stop\u201d MDP cM = ( \u02c6S,A, P \u02c6S\n, H,\u21e2 0). Intuitively, whenever the\ncurrent policy leaves \u02c6S, the e-stop prematurely terminates the current episode with no further reward\n(the maximum penalty). These new transition and reward functions are de\ufb01ned as\n\n, R \u02c6S\n\nP (st, at, st+1),\n\n(st, at, st+1) =8><>:\n(st, at, st+1) =\u21e2R(st, at, st+1),\n\nPs062 \u02c6S\n\n0,\n\n0,\n\nP \u02c6S\n\nR \u02c6S\n\nP (st, at, s0),\n\nif st+1 2 \u02c6S\nelse\n\nif s0 2 \u02c6S\nst+1 = sterm\nelse\n\n(4)\n\nwhere sterm is an absorbing state with no reward. A similar idea was discussed for the imitation\nlearning problem in [21].\nThe key trade-off we attempt to balance is between the asymptotic sub-optimality and reinforcement\nlearning regret,\n\nRegret(T ) \uf8ff\u2303 T\n|\n\nH\u2325 [J(\u21e1\u21e4) J(\u02c6\u21e1\u21e4)]\n}\n\nAsymptotic sub-optimality\n\n{z\n\n+ E\u02c6\u21e1\u21e4\ncM\n\n|\n\n[RT ] EA\ncM\n{z\n\nLearning regret\n\n[RT ] .\n\n}\n\nwhere \u21e1\u21e4 and \u02c6\u21e1\u21e4 are the optimal policies in M andcM, respectively (proof in Appendix A). The \ufb01rst\nterm is due to the approximation error introduced when constructing cM, and depends entirely on\nour choice of \u02c6S. The second term is the familiar reinforcement learning regret, e.g. Azar et al. [4]\nrecently proved an upper regret bound ofpH|S||A|T + H 2|S|2|A|. We refer the reader to Kakade\net al. [15] for an overview of state-of-the-art regret bounds in episodic MDPs.\nOur focus is primarily on the \ufb01rst term, which in turn decreases the learning regret of the second\nterm via | \u02c6S| (typically quadratically). This forms the basis for the key performance trade-off. We\nintroduce bounds for the \ufb01rst term in various conditions, which inform our proposed methods. By\nintelligently modifying M through e-stop interventions, we can decrease the required exploration and\nallow for the early termination of uninformative, low-reward trajectories. Note that the reinforcement\nlearning complexity of A is now independent of S, and instead dependent on \u02c6S according to the same\npolynomial factors. Depending on the MDP and expert policy, this set may be signi\ufb01cantly smaller\nthan the full set. In return, we pay a worst case asymptotic sub-optimality penalty.\n\n(5)\n\nAn additional bene\ufb01t of the e-stop MDP cM not captured in Eq. (5) is the ability to terminate\n\ntrajectories upon entering state sterm. No learning is required for this state due to our perfect\nknowledge about its rewards and dynamics.\n\n4.2 Perfect e-stops\nTo begin, consider an idealized setting, \u02c6S = {s|h(s) > 0} (or some superset thereof) where h(s) is\nthe probability \u21e1e visits state s at any point during an episode. Then the modi\ufb01ed MDPcM has an\noptimal policy which achieves at least the same reward as \u21e1e on the true MDP M.\nTheorem 4.1. SupposecM is an e-stop variant of M such that \u02c6S = {s|h(s) > 0} where h(s) denotes\nthe probability of hitting state s in a roll-out of \u21e1e. Let \u02c6\u21e1\u21e4 = arg max\u21e12\u21e7 JcM (\u21e1) be the optimal\npolicy incM. Then J(\u02c6\u21e1\u21e4) J(\u21e1e).\nIn other words, and not surprisingly, if the expert policy never visits a state, then we pay no penalty for\nremoving it. (Note that we could have equivalently selected \u02c6S = {s|\u21e2\u21e1e(s) > 0} since \u21e2\u21e1e(s) > 0 if\nand only h(s) > 0.)\n\n4\n\n\fAlgorithm 1 Resetting based on demonstrator trajectories\n1: procedure LEARNEDESTOP(M, A,\u21e1 e, n,\u21e0 )\n2:\n3:\n4:\n5:\n\nRollout multiple trajectories from \u21e1e: D [s(1)\nEstimate the hitting probabilities: \u02c6h(s) = 1\nConstruct the smallest \u02c6S allowed by thePs2S\\ \u02c6S\nAdd e-stops, resulting in a modi\ufb01ed MDP,cM, where P \u02c6S\n\n6: return A(cM )\n\n1 , . . . , s(1)\n\nH ], . . . , [s(n)\n\n1 , . . . , s(n)\nH ]\n\nnPi I{s 2 \u2327 (i)}. (Or \u02c6\u21e2 in continuous settings.)\n(s, a, s0) Eq. (4)\n\n\u02c6h(s) \uf8ff \u21e0 constraint.\n(s, a, s0), R \u02c6S\n\nImperfect e-stops\n\n4.3\nIn a more realistic setting, consider what happens when we \u201cremove\u201d (i.e. s 62 \u02c6S) states as e-stops\nthat have low but non-zero probability of visitation under \u21e1e. This can happen by \u201caccident\u201d if the\nsupervisor interventions are noisy or we incorrectly estimate the visitation probability to be zero.\nAlternatively, this can be done intentionally to trade off asymptotic performance for better sample\ncomplexity, in which case we remove states with known low but non-zero visitation probability.\n\nTheorem 4.2. ConsidercM, an e-stop variation on MDP M with state spaces \u02c6S and S, respectively.\n\nGiven an expert policy, \u21e1e, let h(s) denote the probability of visiting state s at least once in an\nepisode roll-out of policy \u21e1e in M. Then\n\nJ(\u21e1e) J(\u02c6\u21e1\u21e4) \uf8ff H Xs2S\\ \u02c6S\n\nh(s)\n\n(6)\n\nwhere \u02c6\u21e1\u21e4 is the optimal policy in cM. Naturally if we satisfy some \u201callowance,\u201d \u21e0, such that\nPs2S\\ \u02c6S\n\nCorollary 4.2.1. Recall that \u21e2\u21e1e(s) denotes the average state distribution following actions from \u21e1e,\n\u21e2\u21e1e(s) = 1\n\nh(s) \uf8ff \u21e0 then J(\u21e1e) J(\u02c6\u21e1\u21e4) \uf8ff \u21e0H .\nHPH1\n\n\u21e1e(s). Then\n\nt=0 \u21e2t\n\n(7)\n\nJ(\u21e1e) J(\u02c6\u21e1\u21e4) \uf8ff \u21e2\u21e1e(S \\ \u02c6S)H 2\n\nIn other words, removing states with non-zero hitting probability introduces error into the policy \u02c6\u21e1\u21e4\naccording to the visitation probabilities h.\nRemark. The primary slack in these bounds is due to upper bounding the expected cumulative\nreward for a given state trajectory by H. Although this bound is necessary in the worst case, it\u2019s\nworth noting that performance is much stronger in practice. In non-adverserial settings the expected\ncumulative reward of a state sequence, \u2327, is correlated with the visitation probabilities of the states\nalong its path: very low reward trajectories tend to have low visitation probabilities, assuming sensible\nexpert policies. We opted against making any assumptions about the correlation between h(s) and\nthe value function, V (s), so this remains an interesting option for future work.\n\n5 Learning from observation\n\nIn the previous section, we considered how to incorporate general e-stop interventions \u2013 which could\ntake the form of an expert supervisor or some other learned e-stop device. Here, we propose and\nanalyze a method for building such a learned e-stop trigger using state observations from an expert\ndemonstrator. This is especially relevant for domains where action observations are unavailable (e.g.\nvideos).\nConsider the setting where we observe n roll-outs \u2327 (1), . . . ,\u2327 (n) of a demonstrator policy \u21e1e in\nnPi I{s 2 \u2327 (i)}. Next,\nM. We can estimate the hitting probability h(s) empirically as \u02c6h(s) = 1\nTheorem 4.2 suggests constructing \u02c6S by removing states from S with the lowest \u02c6h(s) values as\nlong as is allowed by thePs2S\\ \u02c6S\n\u02c6h(s) \uf8ff \u21e0 constraint. In other words, we should attempt to\nremove as many states as possible while considering our \u201cbudget\u201d \u21e0. The algorithm is summarized\nin Algorithm 1. In practice, implementing Algorithm 1 is actually even simpler: pick \u02c6S, take any\noff-the-shelf implementation and simply end training roll-outs whenever the state leaves \u02c6S.\n\n5\n\n\fFigure 2: Left: Value iteration results with varying portions of the state space replaced with e-stops.\nColor denotes the portion of states that have been replaced. Note that signi\ufb01cant performance\nimprovements may be realized before the optimal policy reward is meaningfully affected. Middle:\nQ-learning results with and without the e-stop mechanism. Right: Actor-critic results with and\nwithout the e-stop mechanism. Both plots show results across 100 trials. We observe that e-stopping\nproduces drastic improvements in sample ef\ufb01ciency while introducing only a small sub-optimality\ngap.\n\nJ(\u21e1e) J(\u02c6\u21e1\u21e4) \uf8ff (\u21e0 + \u270f)H\n\nTheorem 5.1. The e-stop MDPcM with states \u02c6S in Algorithm 1 has asymptotic sub-optimality\n(8)\nwith probability at least 1 |S| e2\u270f2n/|S|2, for any \u270f> 0. Here \u21e0 denotes our approximate\nstate removal \u201callowance\u201d, where we satisfyPs2S\\ \u02c6S\n\u02c6h(s) \uf8ff \u21e0 in our construction of cM as in\nTheorem 4.2.\nAs expected, there exists a tradeoff between the number of trajectories collected, n, the state removal\nallowance, \u21e0, and the asymptotic sub-optimality gap, J(\u21e1e) J(\u02c6\u21e1\u21e4). In practice we \ufb01nd performance\nto be fairly robust to n, as well as the quality of the expert policy. See Section 6.1 for experimental\nresults measuring the impact of each of these variables.\nNote that although this analysis only applies to the discrete setting, the same method can be extended\nto the continuous case by estimating and thresholding on \u21e2\u21e1e(s) in place of h(s), as implied by\nCorollary 4.2.1. In Section 6.2 we provide empirical results in continuous domains.\n\n6 Empirical study\n\n6.1 Discrete environments\nWe evaluate LEARNEDESTOP on a modi\ufb01ed FrozenLake-v0 environment from the OpenAI gym.\nThis environment is highly stochastic: for example, taking a left action can move the character either\nup, left, or down each with probability 1/3. To illustrate our ability to evade states that are low-value\nbut non-terminating, we additionally allow the agent to \u201cescape\u201d the holes in the map and follow the\nusual dynamics with probability 0.01. As in the original problem, the goal state is terminal and the\nagent receives a reward of 1 upon reaching the goal and 0 elsewhere. To encourage the agent to reach\nthe goal quickly, we use a discount factor of = 0.99.\nAcross all of our experiments, we observe that algorithms modi\ufb01ed with our e-stop mechanism are\nfar more sample ef\ufb01cient thanks to our ability to abandon episodes that do not match the behavior of\nthe expert. We witnessed these bene\ufb01ts across both planning and reinforcement learning algorithms,\nand with both tabular and policy gradient-based techniques.\nAlthough replacing states with e-stops introduces a small sub-optimality gap, practical users need not\ndespair: any policy trained in a constrained e-stop environment is portable to the full environment.\nTherefore using e-stops to warm-start learning on the full environment may provide a \u201cbest of both\nworlds\u201d scenario. Annealing this process could also have a comparable effect.\nValue iteration. To elucidate the relationship between the size of the support set, \u02c6S, and the sub-\noptimality gap, J(\u21e1e) J(\u02c6\u21e1\u21e4), we run value iteration on e-stop environments with progressively\nmore e-stop states. First, the optimal policy with respect to the full environment is computed and\ntreated as the expert policy, \u21e1e. Next, we calculate \u21e2\u21e1e(s) for all states. By progressively thresholding\n\n6\n\n\fFigure 3: Left: E-stop results based on sub-optimal expert policies. Right: The number of expert\ntrajectories used to construct \u02c6S vs the \ufb01nal performance in the e-stop environment. E-stop results\nseem to be quite robust to poor experts and limited demonstrations.\n\non \u21e2\u21e1e(s) we produce sparser and sparser e-stop variants of the original environment. The results of\nvalue iteration run on each of these variants is shown in Fig. 2 (left). Lines are colored according to\nthe portion of states removed from the original environment, darker lines indicating more aggressive\npruning. As expected we observe a tradeoff: decreasing the size of \u02c6S introduces sub-optimality but\nspeeds up convergence. Once pruning becomes too aggressive we see that it begins to remove states\ncrucial to reaching the goal and J(\u21e1e) J(\u02c6\u21e1\u21e4) is more severely impacted as a result.\nRL results. To evaluate the potential of e-stops for accelerating reinforcement learning methods we\nran LEARNEDESTOP from Algorithm 1 with the optimal policy as \u21e1e. Half of the states with the\nlowest hitting probabilities were replaced with e-stops. Finally, we ran classic RL algorithms on the\nresulting e-stop MDP. Fig. 2 (middle) presents our Q-learning results, demonstrating removing half of\nthe states has a minor effect on asymptotic performance but dramatically improves the convergence\nrate. We also found the e-stop technique to be an effective means of accelerating policy gradient\nmethods. Fig. 2 (right) presents results using one-step actor-critic with a tabular, value function critic\n[23]. In both cases, we witnessed drastic speedups with the use of e-stops relative to running on the\nfull environment.\nExpert sub-optimality. The bounds presented Section 4.3 are all in terms of J(\u21e1e) J(\u02c6\u21e1\u21e4),\nprompting the question: To what extent is e-stop performance dependent on the quality of the expert,\nJ(\u21e1e)? Is it possible to exceed the performance of \u21e1e as in Theorem 4.1, even with \"imperfect\"\ne-stops? To address these questions we arti\ufb01cially created sub-optimal policies by adding noise to the\noptimal policy\u2019s Q-function. Next, we used these sub-optimal policies to construct e-stop MDPs, and\ncalculated J(\u02c6\u21e1\u21e4). As shown in Fig. 3 (left), e-stop performance is quite robust to the expert quality.\nUltimately we only need to take care of capturing a \u201cgood enough\u201d set of states in order for the e-stop\npolicy to succeed.\nEstimation error. The sole source of error in Algorithm 1 comes from the estimation of hitting\nprobabilities via a \ufb01nite set of n expert roll-outs. Theorem 5.1 suggests that the probability of failure\nin empirically building an e-stop MDP decays exponentially in terms of the number of roll-outs, n.\nWe test the relationship between n and J(\u02c6\u21e1\u21e4) experimentally in Fig. 3 (right) and \ufb01nd that in this\nparticular case it\u2019s possible to construct very good e-stop MDPs with as few as 10 expert roll-outs.\n\n6.2 Continuous environments\n\nTo experimentally evaluate the power of e-stops in continuous domains we took two classic continuous\ncontrol problems: inverted pendulum control and the HalfCheetah-v3 environment from the OpenAI\ngym [7], and evaluated the performance of a deep reinforcement learning algorithm in the original\nenvironments as well as in modi\ufb01ed versions of the environments with e-stops.\nAlthough e-stops are more amenable to analysis in discrete MDPs, nothing fundamentally limits\nthem from being applied to continuous environments in a principled fashion. The notion of state-\nhitting probabilities, h(s), is meaningless in continuous spaces but the stationary (in\ufb01nite-horizon),\nor average state (\ufb01nite-horizon) distribution, \u21e2\u21e1e(s) is well-de\ufb01ned and many techniques exist for\n\n7\n\n\fFigure 4: Left: DDPG results on the pendulum environment. Right: Results on the HalfCheetah-v3\nenvironment from the OpenAI gym. All experiments were repeated with 48 different random seeds.\nNote that in both cases e-stop agents converged much more quickly and with lower variance than\ntheir full environment counterparts.\n\ndensity estimation in continuous spaces. Applying these techniques along with Corollary 4.2.1\nprovides fairly solid theoretical grounds for using e-stops in continuous problems.\nFor the sake of simplicity we implemented e-stops as min/max bounds on state values. For each\nenvironment we trained a number of policies in the full environments, measured their performance,\nand calculated e-stop min/max bounds based on roll-outs of the resulting best policy. We found\nthat even an approach as simple as this can be surprisingly effective in terms of improving sample\ncomplexity and stabilizing the learning process.\nInverted pendulum. In this environment the agent is tasked with balancing an inverted pendulum\nfrom a random starting semi-upright position and velocity. The agent can apply rotational torque\nto the pendulum to control its movement. We found that without any intervention the agent would\ninitially just spin the pendulum as quickly as possible and would only eventually learn to actually\nbalance the pendulum appropriately. However, the e-stop version of the problem was not tempted\nwith this strange behavior since the agent would quickly learn to keep the rotational velocity to\nreasonable levels and therefore converged far faster and more reliably as shown in Fig. 4 (left).\nHalf cheetah. Fig. 4 (right) shows results on the HalfCheetah-v3 environment. In this environment,\nthe agent is tasked with controling a 2-dimensional cheetah model to run as quickly as possible.\nAgain, we see that the e-stop agents converged much more quickly and reliably to solutions that were\nmeaningfully superior to DDPG policies trained on the full environment. We found that many policies\nwithout e-stop interventions ended up trapped in local minima, e.g. \ufb02ipping the cheetah over and\nscooting instead of running. Because e-stops were able to eliminate these states altogether, policies\ntrained in the e-stop regime consistently outperformed policies trained in the standard environment.\nBroadly, we found training with e-stops to be far faster and more robust than without. In our\nexperiments, we considered support sets \u02c6S to be axis-aligned boxes in the state space. It stands to\nreason that further gains could be squeezed out of this framework by estimating \u21e2\u21e1e(s) more prudently\nand triggering e-stops whenever our estimation, \u02c6\u21e2\u21e1e(s), falls below some threshold. In general, results\nwill certainly be dependent on the structure of the support set used and the parameterization of state\nspace, but our results suggest that there is promise in the tasteful application of e-stops to continuous\nRL problems.\n\n7 Types of support sets and their tradeoffs\n\nIn the previous sections, we proposed reset mechanisms based on a continuous or discrete state\nsupport set \u02c6S. In this section, we describe alternative support set constructions and their respective\ntrade-offs.\nAt the most basic level, consider the sequence of sets S 1\n\u21e1e >\u270f }\nfor some small \u270f. Note that the single set \u02c6S we considered in Section 4.2 is the union of these sets\n\n\u21e1e, de\ufb01ned by S t\n\n\u21e1e, . . . ,S H\n\n\u21e1e = {s|\u21e2t\n\n8\n\n\fwhen \u270f = 0. The advantage of using a sequence of time-dependent support sets is that S\u21e1e may\nsigni\ufb01cantly over-support the expert\u2019s state distribution at any time and not reset when it is desirable\nto do so, i.e. st 2S \u21e1e but st 62 S t\n\u21e1e for some t > 0. The downside of using time-dependent sets is\nthat it increases the memory complexity from O(|S|) to O(|S|H). Further, if the state distributions\n\u21e1e are similar, then using their union effectively increases the number of demonstrations\n\u21e21\n\u21e1e, . . . ,\u21e2 H\nby a factor of H.\nTo illustrate a practical scenario where it is advantageous to use time-dependent sets, we revisit\nthe example in Fig. 1, where an agent navigates from a start state s0. S\u21e1e does not prevent \u21e1 from\nremaining at s0 for the duration of the episode, as \u21e1e is initialized at this state and thus s0 2S \u21e1e.\nClearly, this type of behavior is undesirable, as it does not move the agent towards its goal. However,\nthe time-dependent sets would trigger an intervention after only a couple time steps, since s0 2S 0\n\u21e1e\nbut s0 2S t\nFinally, we propose an alternative support set based on visitation counts, which balances the trade-offs\n\u21e1e}. Let sf 2 N|S|0 be an auxiliary state, which\nof the two previous constructions S\u21e1e and {S 1\ndenotes the number of visits to each state. Let f (s) =PH\n\u21e1e} be the visitation count to\nstate s by the demonstrator. The modi\ufb01ed MDP in this setting is de\ufb01ned by\n\n\u21e1e for some t > 0.\n\nt=1 {s 2S t\n\n\u21e1e, . . . ,S H\n\nP \u02c6S\n\nP (s, a, s0),\n1,\n0\n\n(s, sf , a, s0, s0f ) =8<:\n(s, sf , a, s0) =\u21e2R(s, a, s0),\n\nR \u02c6S\n\n0,\n\nif sf \uf8ff f (s), s0f = sf + es\nif sf > f (s), s0 = sterm, s0f = sf + es\nelse\nif sf \uf8ff f (s)\nelse\n\n(9)\n\nwhere es is the one-hot vector for state s. In other words, we terminate the episode with no further\nreward whenever the agent visits a state more than the demonstrator. The mechanism in Eq. (9) has\nmemory requirements independent of H yet \ufb01xes some of the over-support issues in S\u21e1e. The optimal\npolicy in this MDP achieves at least as much cumulative reward as \u21e1e (by extending Theorem 4.1)\nand can be extend to the imperfect e-stop setting in Section 4.3.\nWe leave exploration of these and other potential e-stop constructions to future work.\n\n8 Conclusions\n\nWe introduced a general framework for incorporating e-stop interventions into any reinforcement\nlearning algorithm, and proposed a method for learning such e-stop triggers from state-only observa-\ntions. Our key insight is that only a small support set of states may be necessary to operate effectively\ntowards some goal, and we contribute a set of bounds that relate the performance of an agent trained\nin this smaller support set to the performance of the expert policy. Tuning the size of the support\nset allows us to ef\ufb01ciently trade off an asymptotic sub-optimality gap for signi\ufb01cantly lower sample\ncomplexity.\nEmpirical results on discrete and continuous environments demonstrate signi\ufb01cantly faster conver-\ngence on a variety of problems and only a small asymptotic sub-optimality gap, if any at all. We argue\nthis trade-off is bene\ufb01cial in problems where environment interactions are expensive, and we are\nless concerned with achieving no-regret guarantees as we are with small, \ufb01nite sample performance.\nFurther, such a trade-off may be bene\ufb01cial during initial experimentation and for bootstrapping\npolicies in larger state spaces. For example, we are particularly excited about graduated learning\nprocesses that could increase the size of the support set over time.\nIn larger, high dimensional state spaces, it would be interesting and relatively straightforward to\napply anomaly detectors such as one-class SVMs [16] or auto-encoders [19] within our framework to\nimplicitly construct the support set. Our bounds capture the reduction in exploration due to reducing\nthe state space size, which could be further tightened by incorporating our a priori knowledge of sterm\nand the ability to terminate trajectories early.\n\n9\n\n\f9 Acknowledgements\n\nThe authors would like to thank The Notorious B.I.G. and Justin Fu for their contributions to\nmusic, pop culture, and our implementation of DDPG. This work was (partially) funded by the\nNational Science Foundation TRIPODS+X:RES (#A135918), National Institute of Health R01\n(#R01EB019335), National Science Foundation CPS (#1544797), National Science Foundation NRI\n(#1637748), the Of\ufb01ce of Naval Research, the RCTA, Amazon, and Honda Research Institute USA.\n\nReferences\n[1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning.\n\nProceedings of the 21st International Conference on Machine Learning, 2004.\n\n[2] Pieter Abbeel and Andrew Y Ng. Exploration and apprenticeship learning in reinforcement\n\nlearning. In Proceedings of the 22nd International Conference on Machine Learning, 2005.\n\n[3] Mohammed Alshiekh, Roderick Bloem, R\u00fcdiger Ehlers, Bettina K\u00f6nighofer, Scott Niekum,\nand Ufuk Topcu. Safe reinforcement learning via shielding. In Thirty-Second AAAI Conference\non Arti\ufb01cial Intelligence, 2018.\n\n[4] Mohammad Gheshlaghi Azar, Ian Osband, and R\u00e9mi Munos. Minimax regret bounds for\nIn Proceedings of the 34th International Conference on Machine\n\nreinforcement learning.\nLearning, 2017.\n\n[5] J Andrew Bagnell. An invitation to imitation. Technical report, Carnegie Mellon University,\n\n2015.\n\n[6] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal\nMaclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy\nprograms, 2018. URL http://github.com/google/jax.\n\n[7] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. OpenAI Gym, 2016.\n\n[8] Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine. Leave no trace: learning\n\nto reset for safe and autonomous reinforcement learning, 2018.\n\n[9] Javier Garcia and Fernando Fern\u00e1ndez. Safe exploration of state and action spaces in reinforce-\n\nment learning. Journal of Arti\ufb01cial Intelligence Research, 45:515\u2013564, 2012.\n\n[10] Javier Garc\u0131a and Fernando Fern\u00e1ndez. A comprehensive survey on safe reinforcement learning.\n\nJournal of Machine Learning Research, 16(1):1437\u20131480, 2015.\n\n[11] Alborz Geramifard, Joshua Redding, Nicholas Roy, and Jonathan P How. UAV cooperative\ncontrol with stochastic risk models. In Proceedings of the 2011 American Control Conference,\npages 3393\u20133398. IEEE, 2011.\n\n[12] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan,\nJohn Quan, Andrew Sendonaris, and Ian Osband. Deep Q-learning from demonstrations. In\nThirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[13] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in\n\nNeural Information Processing Systems, 2016.\n\n[14] Gregory Kahn, Adam Villa\ufb02or, Vitchyr Pong, Pieter Abbeel, and Sergey Levine. Uncertainty-\n\naware reinforcement learning for collision avoidance. arXiv: preprint, 2017.\n\n[15] Sham Kakade, Mengdi Wang, and Lin F Yang. Variance reduction methods for sublinear\n\nreinforcement learning. arXiv:1802.09184, 2018.\n\n[16] Michael Laskey, Sam Staszak, Wesley Yu-Shu Hsieh, Jeffrey Mahler, Florian T Pokorny, Anca D\nDragan, and Ken Goldberg. SHIV: Reducing supervisor burden in DAgger using support vectors\nfor ef\ufb01cient learning from demonstrations in high dimensional state spaces. In International\nConference on Robotics and Automation, 2016.\n\n10\n\n\f[17] Frederic Maire and Vadim Bulitko. Apprenticeship learning for initial value functions in\nreinforcement learning. In IJCAI Workshop on Planning and Learning in A Priori Unknown or\nDynamic Domains, 2005.\n\n[18] Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural network\ndynamics for model-based deep reinforcement learning with model-free \ufb01ne-tuning. In Interna-\ntional Conference on Robotics and Automation, 2018.\n\n[19] Charles Richter and Nicholas Roy. Safe visual navigation via deep learning and novelty\n\ndetection. In Robotics: Science and Systems, 2017.\n\n[20] Stephane Ross, Geoffrey J Gordon, and J Andrew Bagnell. A reduction of imitation learning and\nstructured prediction to no-regret online learning. 14th International Conference on Arti\ufb01cial\nIntelligence and Statistics, 2011.\n\n[21] St\u00e9phane Ross, Narek Melik-Barkhudarov, Kumar Shaurya Shankar, Andreas Wendel, De-\nbadeepta Dey, J Andrew Bagnell, and Martial Hebert. Learning monocular reactive UAV\ncontrol in cluttered natural environments. In IEEE International Conference on Robotics and\nAutomation, 2013.\n\n[22] William D Smart and Leslie Pack Kaelbling. Practical reinforcement learning in continuous\n\nspaces. In Proceedings of the 17th International Conference on Machine Learning, 2000.\n\n[23] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT\n\nPress, 2018.\n\n[24] Faraz Torabi, Garrett Warnell, and Peter Stone. Generative adversarial imitation from observa-\n\ntion. arXiv:1807.06158, 2018.\n\n[25] Bernard Widrow and Fred W Smith. Pattern-recognizing control systems. Computer and\n\nInformation Sciences, pages 288\u2013317, 1964.\n\n11\n\n\f", "award": [], "sourceid": 8723, "authors": [{"given_name": "Samuel", "family_name": "Ainsworth", "institution": "University of Washington"}, {"given_name": "Matt", "family_name": "Barnes", "institution": "University of Washington"}, {"given_name": "Siddhartha", "family_name": "Srinivasa", "institution": "Amazon + University of Washington"}]}