{"title": "A Lyapunov-based Approach to Safe Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 8092, "page_last": 8101, "abstract": "In many real-world reinforcement learning (RL) problems, besides optimizing the main objective function, an agent must concurrently avoid violating a number of constraints. In particular, besides optimizing performance, it is crucial to guarantee the safety of an agent during training as well as deployment (e.g., a robot should avoid taking actions - exploratory or not - which irrevocably harm its hard- ware). To incorporate safety in RL, we derive algorithms under the framework of constrained Markov decision processes (CMDPs), an extension of the standard Markov decision processes (MDPs) augmented with constraints on expected cumulative costs. Our approach hinges on a novel Lyapunov method. We define and present a method for constructing Lyapunov functions, which provide an effective way to guarantee the global safety of a behavior policy during training via a set of local linear constraints. Leveraging these theoretical underpinnings, we show how to use the Lyapunov approach to systematically transform dynamic programming (DP) and RL algorithms into their safe counterparts. To illustrate their effectiveness, we evaluate these algorithms in several CMDP planning and decision-making tasks on a safety benchmark domain. Our results show that our proposed method significantly outperforms existing baselines in balancing constraint satisfaction and performance.", "full_text": "A Lyapunov-based Approach to Safe Reinforcement\n\nLearning\n\nYinlam Chow\n\nDeepMind\n\nyinlamchow@google.com\n\nO\ufb01r Nachum\nGoogle Brain\n\nofirnachum@google.com\n\nMohammad Ghavamzadeh\n\nFacebook AI Research\n\nmgh@fb.com\nAbstract\n\nEdgar Duenez-Guzman\n\nDeepMind\n\nduenez@google.com\n\nIn many real-world reinforcement learning (RL) problems, besides optimizing the\nmain objective function, an agent must concurrently avoid violating a number of\nconstraints. In particular, besides optimizing performance, it is crucial to guar-\nantee the safety of an agent during training as well as deployment (e.g., a robot\nshould avoid taking actions - exploratory or not - which irrevocably harm its hard-\nware). To incorporate safety in RL, we derive algorithms under the framework\nof constrained Markov decision processes (CMDPs), an extension of the standard\nMarkov decision processes (MDPs) augmented with constraints on expected cu-\nmulative costs. Our approach hinges on a novel Lyapunov method. We de\ufb01ne\nand present a method for constructing Lyapunov functions, which provide an ef-\nfective way to guarantee the global safety of a behavior policy during training\nvia a set of local linear constraints. Leveraging these theoretical underpinnings,\nwe show how to use the Lyapunov approach to systematically transform dynamic\nprogramming (DP) and RL algorithms into their safe counterparts. To illustrate\ntheir effectiveness, we evaluate these algorithms in several CMDP planning and\ndecision-making tasks on a safety benchmark domain. Our results show that our\nproposed method signi\ufb01cantly outperforms existing baselines in balancing con-\nstraint satisfaction and performance.\n\nIntroduction\n\n1\nReinforcement learning (RL) has shown exceptional successes in a variety of domains such as video\ngames [25] and recommender systems [40], where the main goal is to optimize a single return.\nHowever, in many real-world problems, besides optimizing the main objective (the return), there\ncan exist several con\ufb02icting constraints that make RL challenging. In particular, besides optimizing\nperformance it is crucial to guarantee the safety of an agent in deployment [5, 32, 33], as well as\nduring training [2]. For example, a robot should avoid taking actions which irrevocably harm its\nhardware; a recommender system must avoid presenting harmful or offending items to users.\nSequential decision-making in non-deterministic environments has been extensively studied in the\nliterature under the framework of Markov decision processes (MDPs). To incorporate safety into the\nRL process, we are particularly interested in deriving algorithms under the context of constrained\nMarkov decision processes (CMDPs), which is an extension of MDPs with expected cumulative\nconstraint costs. The additional constraint component of CMDPs increases \ufb02exibility in modeling\nproblems with trajectory-based constraints, when compared with other approaches that customize\nimmediate costs in MDPs to handle constraints [34]. As shown in numerous applications from robot\nmotion planning [30, 26, 11], resource allocation [24, 18], and \ufb01nancial engineering [1, 41], it is\nmore natural to de\ufb01ne safety over the whole trajectory, instead of over particular state and action\npairs. Under this framework, we denote an agent\u2019s behavior policy to be safe if it satis\ufb01es the\ncumulative cost constraints of the CMDP.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fDespite the capabilities of CMDPs, they have not been very popular in RL. One main reason is that,\nalthough optimal policies of \ufb01nite CMDPs are Markov and stationary, and with known models the\nCMDP can be solved using linear programming (LP) [3], it is unclear how to extend this algorithm\nto handle cases when the model is unknown, or when the state and action spaces are large or contin-\nuous. A well-known approach to solve CMDPs is the Lagrangian method [4, 15], which augments\nthe standard expected reward objective with a penalty on constraint violation. With a \ufb01xed Lagrange\nmultiplier, one can use standard dynamic programming (DP) or RL algorithms to solve for an opti-\nmal policy. With a learnable Lagrange multiplier, one must solve the resulting saddle point problem.\nHowever, several studies [21] showed that iteratively solving the saddle point is apt to run into nu-\nmerical stability issues. More importantly, the Lagrangian policy is only safe asymptotically and\nmakes little guarantee with regards to safety of the behavior policy during each training iteration.\nMotivated by these observations, several recent works have derived surrogate algorithms for solving\nCMDPs, which transform the original constraint to a more conservative one that yields an easier\nproblem to solve. A straight-forward approach is to replace the cumulative constraint cost with\na conservative stepwise surrogate constraint [9] that only depends on the current state-action pair.\nSince this surrogate constraint can be easily embedded into the admissible control set, this formu-\nlation can be modeled by an MDP that has a restricted set of admissible actions. Another surrogate\nalgorithm was proposed by [14] in which the algorithm \ufb01rst computes a uniform super-martingale\nconstraint value function surrogate w.r.t. all policies, and then \ufb01nds a CMDP feasible policy by op-\ntimizing the surrogate problem using the lexicographical ordering method [39]. These methods are\nadvantageous in the sense that (i) there are RL algorithms available to handle the surrogate problems\n(for example see [12] for the step-wise surrogate and [27] for the super-martingale surrogate), (ii)\nthe policy returned by this method is safe, even during training. However, the main drawback of\nthese approaches is their conservativeness. Characterizing sub-optimality performance of the cor-\nresponding solution policy also remains a challenging task. On the other hand, recently in policy\ngradient, [2] proposed the constrained policy optimization (CPO) method that extends trust-region\npolicy optimization (TRPO) to handle the CMDP constraints. While this algorithm is scalable and\nits policy is safe during training, applying this methodology to more general RL algorithms (that are\nnot in the family of proximal PG algorithms) is quite non-trivial.\nLyapunov functions have been extensively used in control theory to analyze the stability of dynamic\nsystems [20, 28]. A Lyapunov function is a type of scalar potential function that keeps track of the\nenergy that a system continually dissipates. Besides modeling physical energy, Lyapunov functions\ncan also represent abstract quantities, such as the steady-state performance of a Markov process [16].\nIn many \ufb01elds, Lyapunov functions provide a powerful paradigm to translate global properties of\na system to local ones and vice-versa. Using Lyapunov functions in RL was \ufb01rst studied by [31],\nwhere Lyapunov functions were used to guarantee closed-loop stability of an agent. Recently [6]\nused Lyapunov functions to guarantee a model-based RL agent\u2019s ability to re-enter an \u201cattraction\nregion\u201d during exploration. However, no previous works have used Lyapunov approaches to explic-\nitly model constraints in a CMDP. Furthermore, one major drawback of these approaches is that the\nLyapunov functions are hand-crafted, and there are no principled guidelines on designing Lyapunov\nfunctions that can guarantee the agent\u2019s performance.\nThe contribution of this paper is four-fold. First, we formulate the problem of safe RL as a CMDP\nand propose a novel Lyapunov approach to solve it. While the main challenge of other Lyapunov-\nbased methods is to design a Lyapunov function candidate, we propose an LP-based algorithm to\nconstruct Lyapunov functions w.r.t. generic CMDP constraints. We also show that our method is\nguaranteed to always return a feasible policy, and under certain technical assumptions, it achieves\noptimality. Second, leveraging the theoretical underpinnings of the Lyapunov approach, we present\ntwo safe DP algorithms \u2013 safe policy iteration (SPI) and safe value iteration (SVI) \u2013 and analyze the\nfeasibility and performance of these algorithms. Third, to handle unknown environments and large\nstate/action spaces, we develop two scalable safe RL algorithms \u2013 (i) safe DQN, an off-policy \ufb01tted\nQ-iteration method, and (ii) safe DPI, an approximate policy iteration method. Fourth, to illustrate\nthe effectiveness of these algorithms, we evaluate them in several tasks on a benchmark 2D planning\nproblem and show that they outperform common baselines in terms of balancing performance and\nconstraint satisfaction.\n2 Preliminaries\nWe consider RL problems in which the agent\u2019s interaction with the system is modeled as a\nMarkov decision process (MDP). A MDP is a tuple (X ,A, c, P, x0), where X = X (cid:48) \u222a {xTerm}\nis the state space, with transient state space X (cid:48) and terminal state xTerm; A is the action space;\n\n2\n\n\fset of Markov stationary policies, i.e., \u2206(x) = {\u03c0(\u00b7|x) : X \u2192 R\u22650s : (cid:80)\n\nc(x, a) \u2208 [0, Cmax] is the immediate cost function (negative reward); P (\u00b7|x, a) is the transition\nprobability distribution; and x0 \u2208 X (cid:48) is the initial state. Our results easily generalize to random ini-\ntial states and random costs, but for simplicity we will focus on the case of deterministic initial state\nand immediate cost. In a more general setting where cumulative constraints are taken into account,\nwe de\ufb01ne a constrained Markov decision process (CMDP), which extends the MDP model by in-\ntroducing additional costs and associated constraints. A CMDP is de\ufb01ned by (X ,A, c, d, P, x0, d0),\nwhere the components X ,A, c, P, x0 are the same for the unconstrained MDP; d(x) \u2208 [0, Dmax] is\nthe immediate constraint cost; and d0 \u2208 R\u22650 is an upper-bound on the expected cumulative (through\ntime) constraint cost. To formalize the optimization problem associated with CMDPs, let \u2206 be the\na \u03c0(a|x) = 1}, for any\nstate x \u2208 X . Also let T\u2217 be a random variable corresponding to the \ufb01rst-hitting time of the terminal\nstate xTerm induced by policy \u03c0. In this paper, we follow the standard notion of transient MDPs and\nassume that the \ufb01rst-hitting time is uniformly bounded by an upper bound T for any stationary poli-\ncies [10]. This assumption implies that every stationary policy is proper [7], whose induced Markov\nchain has an absorbing property (see [13] for an example). While this assumption may seem restric-\ntive, it is a standard one in stochastic shortest path problems for showing that the Bellman operator\nis a contraction. Its justi\ufb01cation follows from the fact that sample trajectories collected in most RL\nalgorithms consist of a \ufb01nite stopping time (also known as a time-out); In general this assumption\nmay also be relaxed in cases where a discount factor \u03b3 < 1 is applied on future costs. For notational\nconvenience, at each state x \u2208 X (cid:48), we de\ufb01ne the generic Bellman operator w.r.t. policy \u03c0 \u2208 \u2206 and\n\n(cid:104)\nh(x, a)+(cid:80)\ngeneric cost function h: T\u03c0,h[V ](x) =(cid:80)\na \u03c0(a|x)\nE(cid:2)(cid:80)T\u2217\u22121\nc(xt, at) | x0, \u03c0(cid:3), and the safety constraint is de\ufb01ned as D\u03c0(x0) \u2264 d0, where the safety\nGiven a policy \u03c0 \u2208 \u2206, an initial state x0,\nconstraint function is given by D\u03c0(x0) := E(cid:2)(cid:80)T\u2217\u22121\nt=0 d(xt) | x0, \u03c0(cid:3). In general the CMDP problem\n\nthe cost function is de\ufb01ned as C\u03c0(x0)\n\nx(cid:48)\u2208X (cid:48)P (x(cid:48)|x, a)V (x(cid:48))\n\n(cid:105)\n\n:=\n\nt=0\n\n.\n\nwe wish to solve is given as follows:\n\nProblem OPT : Given an initial state x0 and a threshold d0,\nmin\u03c0\u2208\u2206\npolicy is denoted by \u03c0\u2217.\n\n(cid:9) . If there is a non-empty solution, the optimal\n\n(cid:8)C\u03c0(x0) : D\u03c0(x0) \u2264 d0\n\nsolve\n\nUnder the transient CMDP assumption, Theorem 8.1 in [3] shows that if the feasibility set is non-\nempty, then there exists an optimal policy in the class of stationary Markovian policies \u2206. To\nmotivate the CMDP formulation studied in this paper, in Appendix A, we include two real-world\nexamples in modeling safety using (i) the reachability constraint, and (ii) the constraint that limits\nthe agent\u2019s visits to undesirable states. Recently there has been a number of works on CMDP\nalgorithms; their details can be found in Appendix B.\n3 A Lyapunov Approach to Solve CMDPs\nIn this section, we develop a novel methodology for solving CMDPs using the Lyapunov approach.\nTo start with, without loss of generality we assume to have access to a baseline feasible policy of\nthe OPT problem, namely \u03c0B \u2208 \u2206.1 We de\ufb01ne a non-empty2 set of Lyapunov functions w.r.t. the\n(cid:111)\n(cid:110)\ninitial state x0 \u2208 X and constraint threshold d0 as\nL\u03c0B (x0, d0) =\nL : X \u2192R\u22650 : T\u03c0B ,d[L](x)\u2264 L(x),\u2200x \u2208 X (cid:48); L(x) = 0, \u2200x \u2208 X\\X (cid:48); L(x0) \u2264 d0\n(cid:8)\u03c0(\u00b7|x) \u2208 \u2206 : T\u03c0,d[L](x) \u2264 L(x)(cid:9) the set of L\u2212induced Markov stationary policies. Since T\u03c0,d\n(1)\nFor any arbitrary Lyapunov function L \u2208 L\u03c0B (x0, d0), we denote by FL(x) =\nis a contraction mapping [7], any L\u2212induced policy \u03c0 has the following property: D\u03c0(x) =\n\u03c0,d[L](x) \u2264 L(x), \u2200x \u2208 X (cid:48). Together with the property of L(x0) \u2264 d0, this further\nlimk\u2192\u221e T k\nimplies any L\u2212induced policy is a feasible policy of the OPT problem. However, in general the\nset FL(x) does not necessarily contain any optimal policies of the OPT problem , and our main\ncontribution is to design a Lyapunov function (w.r.t. a baseline policy) that provides this guarantee.\nIn other words, our main goal is to construct a Lyapunov function L \u2208 L\u03c0B (x0, d0) such that\n\n.\n\nL(x) \u2265 T\u03c0\u2217,d[L](x), L(x0) \u2264 d0.\n\n(2)\n\n1One example of \u03c0B is a policy that minimizes the constraint, i.e., \u03c0B(\u00b7|x) \u2208 arg min\u03c0\u2208\u2206(x) D\u03c0(x).\n2To see this, the constraint cost function D\u03c0B (x) is a valid Lyapunov function, i.e., D\u03c0B (x0) \u2264 d0,\n, \u2200x \u2208 X (cid:48).\n\nD\u03c0B (x) = 0, \u2200x \u2208 X \\ X (cid:48), and D\u03c0B (x) = T\u03c0B ,d[D\u03c0B ](x) = E\n\n(cid:104)(cid:80)T\u2217\u22121\n\nt=0 d(xt) | \u03c0B, x\n\n(cid:105)\n\n3\n\n\f(cid:104)(cid:80)T\u2217\u22121\n\n(cid:105)\nt=0 d(xt) + \u0001(xt) | \u03c0B, x\n\nL\u0001\u2217 (x) \u2265 T\u03c0B ,d[L\u0001\u2217 ](x), L\u0001\u2217 (x) \u2265 max(cid:8)D\u03c0\u2217 (x),D\u03c0B (x)(cid:9) \u2265 0, \u2200x \u2208 X (cid:48).\n\nBefore getting into the main results, we consider the following important technical lemma, which\nstates that with appropriate cost-shaping, one can always transform the constraint value function\nD\u03c0\u2217 (x) w.r.t. optimal policy \u03c0\u2217 into a Lyapunov function that is induced by \u03c0B, i.e., L\u0001(x) \u2208\nL\u03c0B (x0, d0). The proof of this lemma can be found in Appendix C.1.\nLemma 1. There exists an auxiliary constraint cost \u0001 : X (cid:48) \u2192 R such that a Lyapunov function is\n, \u2200x \u2208 X (cid:48), and L\u0001(x) = 0, \u2200x \u2208 X \\ X (cid:48).\ngiven by L\u0001(x) = E\nMoreover, L\u0001 is equal to the constraint value function w.r.t. \u03c0\u2217, i.e., L\u0001(x) = D\u03c0\u2217 (x).\nFrom the structure of L\u0001, one can see that the auxiliary constraint cost function \u0001 is uniformly\nbounded by \u0001\u2217(x) := 2TDmaxDT V (\u03c0\u2217||\u03c0B)(x),3 i.e., \u0001(x) \u2208 [\u2212\u0001\u2217(x), \u0001\u2217(x)], for any x \u2208 X (cid:48).\nHowever, in general it is unclear how to construct such a cost-shaping term \u0001 without explicitly\nknowing \u03c0\u2217 a-priori. Rather, inspired by this result, we consider the bound \u0001\u2217 to propose a Lyapunov\nfunction candidate L\u0001\u2217. Immediately from its de\ufb01nition, this function has the following properties:\n(3)\nThe \ufb01rst property is due to the facts that: (i) \u0001\u2217 is a non-negative cost function; (ii) T\u03c0B ,d+\u0001\u2217 is a\ncontraction mapping, which by the \ufb01xed point theorem [7] implies L\u0001\u2217 (x) = T\u03c0B ,d+\u0001\u2217 [L\u0001\u2217 ](x) \u2265\nT\u03c0B ,d[L\u0001\u2217 ](x), \u2200x \u2208 X (cid:48). For the second property, from the above inequality one concludes that\nthe Lyapunov function L\u0001\u2217 is a uniform upper-bound to the constraint cost, i.e., L\u0001\u2217 (x) \u2265 D\u03c0B (x),\nbecause the constraint cost D\u03c0B (x) w.r.t. policy \u03c0B is the unique solution to the \ufb01xed-point equation\nT\u03c0B ,d[V ](x) = V (x), x \u2208 X (cid:48). On the other hand, by construction, \u0001\u2217(x) is an upper-bound of the\ncost-shaping term \u0001(x). Therefore, Lemma 1 implies that the Lyapunov function L\u0001\u2217 is a uniform\nupper-bound to the constraint cost w.r.t. optimal policy \u03c0\u2217, i.e., L\u0001\u2217 (x) \u2265 D\u03c0\u2217 (x).\nTo show that L\u0001\u2217 is a Lyapunov function that satis\ufb01es (2), we propose the following condition that\nenforces a baseline policy \u03c0B to be suf\ufb01ciently close to an optimal policy \u03c0\u2217.\nAssumption 1. The feasible baseline policy \u03c0B satis\ufb01es the condition maxx\u2208X (cid:48) \u0001\u2217(x) \u2264 Dmax \u00b7\nmin\nThis condition characterizes the maximum allowable distance between \u03c0B and \u03c0\u2217, such that the set\nof L\u0001\u2217\u2212induced policies contains an optimal policy. To formalize this claim, we have the following\nmain result showing that L\u0001\u2217 \u2208 L\u03c0B (x0, d0), and the set of policies FL\u0001\u2217 contains an optimal policy.\nTheorem 1. Suppose the baseline policy \u03c0B satis\ufb01es Assumption 1, then on top of the properties in\n(3), the Lyapunov function candidate L\u0001\u2217 also satis\ufb01es the properties in (2), and thus, its induced\nfeasible set of policies FL\u0001\u2217 contains an optimal policy.\nThe proof of this theorem is given in Appendix C.2. Suppose the distance between the baseline and\noptimal policies can be estimated effectively. Using the above result, one can immediately determine\nif the set of L\u0001\u2217\u2212induced policies contain an optimal policy. Equipped with the set of L\u0001\u2217\u2212induced\nfeasible policies, consider the following safe Bellman operator:\n\n, where D = maxx\u2208X (cid:48) max\u03c0 D\u03c0(x).\n\n(cid:110) d0\u2212D\u03c0B (x0)\n\nTDmax\n\n(cid:111)\n\n, TDmax\u2212D\nTDmax+D\n\n(cid:26) min\u03c0\u2208FL\u0001\u2217 (x) T\u03c0,c[V ](x)\n\n0\n\nT [V ](x) =\n\nif x \u2208 X (cid:48)\notherwise .\n\n(4)\n\nUsing standard analysis of Bellman operators, one can show that T is a monotonic and contraction\noperator (see Appendix C.3 for the proof). This further implies that the solution of the \ufb01xed-point\nequation T [V ](x) = V (x), \u2200x \u2208 X is unique. Let V \u2217 be such a value function. The following\ntheorem shows that under Assumption 1, V \u2217(x0) is a solution to the OPT problem.\nTheorem 2. Suppose that the baseline policy \u03c0B satis\ufb01es Assumption 1. Then, the \ufb01xed-point\nsolution at x = x0, i.e., V \u2217(x0), is equal to the solution of the OPT problem. Furthermore, an\noptimal policy can be constructed by \u03c0\u2217(\u00b7|x)\u2208 arg min\u03c0\u2208FL\u0001\u2217 (x) T\u03c0,c[V \u2217](x), \u2200x\u2208X (cid:48).\nThe proof of this theorem can be found in Appendix C.4. This shows that under Assumption 1,\nan optimal policy of the OPT problem can be found using standard DP algorithms. Note that\nverifying whether \u03c0B satis\ufb01es this assumption is still challenging, because one requires a good\nestimate of DT V (\u03c0\u2217||\u03c0B). Yet to the best of our knowledge, this is the \ufb01rst result that connects\n(cid:80)\nthe optimality of CMDP to Bellman\u2019s principle of optimality. Another key observation is that in\na\u2208A |\u03c0B(a|x) \u2212 \u03c0\u2217(a|x)|.\n\n3The de\ufb01nition of total variation distance is given by DT V (\u03c0\u2217||\u03c0B)(x) = 1\n\n2\n\n4\n\n\fpractice, we will explore ways of approximating \u0001\u2217 via bootstrapping and empirically show that this\napproach achieves good performance, while guaranteeing safety at each iteration. In particular, in\nthe next section, we will illustrate how to systematically construct a Lyapunov function using an\nLP in both planning and RL (when the model is unknown and/or we use function approximation)\nscenarios in order to guarantee safety during learning.\n4 Safe Reinforcement Learning Using Lyapunov Functions\nMotivated by the challenge of computing a Lyapunov function L\u0001\u2217 such that its induced set of\n\npolicies contains \u03c0\u2217, in this section, we approximate \u0001\u2217 with an auxiliary constraint cost(cid:101)\u0001, which is\nand the safety condition L(cid:101)\u0001(x0) \u2264 d0. The larger the(cid:101)\u0001, the larger the set of policies FL(cid:101)\u0001. Thus, by\nthe largest auxiliary cost that satis\ufb01es the Lyapunov condition: L(cid:101)\u0001(x) \u2265 T\u03c0B ,d[L(cid:101)\u0001](x), \u2200x \u2208 X (cid:48),\nchoosing the largest such auxiliary cost, we hope to have a better chance of including the optimal\npolicy \u03c0\u2217 in the set of feasible policies. So, we consider the following LP problem:\n\n\u0001(x) : d0 \u2212 D\u03c0B (x0) \u2265 1(x0)\n(cid:62)\n\n(I \u2212 {P (x\n\n(cid:48)|x, \u03c0B)}x,x(cid:48)\u2208X (cid:48) )\n\n\u22121\u0001\n\n.\n\n(5)\n\n(cid:101)\u0001 \u2208 arg max\n\n\u0001:X (cid:48)\u2192R\u22650\n\n(cid:110)(cid:88)\n\nx\u2208X (cid:48)\n\n(cid:111)\n\nHere 1(x0) represents a one-hot vector in which the non-zero element is located at x = x0.\nvisiting probability E[(cid:80)T\u2217\u22121\nOn the other hand, whenever \u03c0B is a feasible policy, the problem in (5) always has a non-empty\nsolution.4 Furthermore, note that 1(x0)(cid:62)(I \u2212 {P (x(cid:48)|x, \u03c0B)}x,x(cid:48)\u2208X (cid:48))\u221211(x) represents the total\nt=0 1{xt = x} | x0, \u03c0B] from the initial state x0 to any state x \u2208 X (cid:48),\nwhich is a non-negative quantity. Therefore, using the extreme point argument in LP [23], one\ni.e.,(cid:101)\u0001(x) = (d0 \u2212 D\u03c0B (x0)) \u00b7 1{x = x}/E[(cid:80)T\u2217\u22121\ncan simply conclude that the maximizer of problem (5) is an indicator function whose non-zero\nx \u2208 arg minx\u2208X (cid:48) E[(cid:80)T\u2217\u22121\nelement locates at state x that corresponds to the minimum total visiting probability from x0,\nt=0 1{xt = x} | x0, \u03c0B] \u2265 0, \u2200x \u2208 X (cid:48), where\nrestrict the structure of(cid:101)\u0001(x) to be a constant function, i.e.,(cid:101)\u0001(x) = (cid:101)\u0001, \u2200x \u2208 X (cid:48). Then, one can\nt=0 1{xt = x} | x0, \u03c0B]. On the other hand, suppose that we further\nshow that the maximizer is given by(cid:101)\u0001(x) = (d0 \u2212 D\u03c0B (x0))/E[T\u2217 | x0, \u03c0B], \u2200x \u2208 X (cid:48), where\nsonable approximation is to replace the denominator of(cid:101)\u0001 with the upper-bound T.\nUsing this Lyapunov function L(cid:101)\u0001, we propose the safe policy iteration (SPI) in Algorithm 1, in which\nthe Lyapunov function is updated via bootstrapping, i.e., at each iteration L(cid:101)\u0001 is recomputed using (5),\n\n1(x0)(cid:62)(I \u2212 {P (x(cid:48)|x, \u03c0B)}x,x(cid:48)\u2208X (cid:48))\u22121[1, . . . , 1](cid:62) = E[T\u2217 | x0, \u03c0B] is the expected stopping time\nof the transient MDP. In cases where computing the expected stopping time is expensive, one rea-\n\nw.r.t. the current baseline policy. Properties of SPI are summarized in the following proposition.\n\nAlgorithm 1 Safe Policy Iteration (SPI)\n\nInput: Initial feasible policy \u03c00;\nfor k = 0, 1, 2, . . . do\n\nStep 0: With \u03c0b = \u03c0k, evaluate the Lyapunov function L\u0001k, where \u0001k is a solution of (5)\nStep 1: Evaluate the cost value function V\u03c0k (x) = C\u03c0k (x)\nStep 2: Update the policy by solving the problem \u03c0k+1(\u00b7|x) \u2208 argmin\u03c0\u2208FL\u0001k\nend for\nReturn Final policy \u03c0k\u2217\n\n(x) T\u03c0,c[V\u03c0k ](x),\u2200x \u2208 X (cid:48)\n\nProposition 1. Algorithm 1 has the following properties: (i) Consistent Feasibility, i.e., suppose that\nthe current policy \u03c0k is feasible, then the updated policy \u03c0k+1 is also feasible, i.e., D\u03c0k (x0) \u2264 d0\nimplies D\u03c0k+1(x0) \u2264 d0; (ii) Monotonic Policy Improvement, i.e., the cumulative cost induced by\n\u03c0k+1 is lower than or equal to that by \u03c0k, i.e., C\u03c0k+1(x) \u2264 C\u03c0k (x), \u2200x \u2208 X (cid:48); (iii) Convergence,\ni.e., if we add a strictly concave regularizer to the optimization problem (5) and a strictly convex\nregularizer to the policy optimization step, then the policy sequence asymptotically converges.5\nThe proof of this proposition is given in Appendix C.5, and the sub-optimality performance bound\nof SPI can be found in Appendix C.6. Analogous to SPI, we also propose a safe value iteration\n(SVI), in which the Lyapunov function estimate is updated at every iteration via bootstrapping,\nusing the current optimal value estimate. Details of SVI is given in Algorithm 2 and its properties\nare summarized in the following proposition, whose proof is given in Appendix C.7.\n\n4This is due to the fact that d0 \u2212 D\u03c0B (x0) \u2265 0, and thus,(cid:101)\u0001(x) = 0 is a feasible solution.\n\n5The strict concavity property in the objective function is mainly for the purpose of tie-breaking. One\n\nstandard example is the entropy regularizer with a small regularization term.\n\n5\n\n\fProposition 2. Algorithm 2 has: (i) Consistent Feasibility and (ii) Convergence.\nTo justify the notion of bootstrapping in both SVI and SPI, the Lyapunov function is updated based\non the best baseline policy (the policy that is feasible and by far has the lowest cumulative cost).\nOnce the current baseline policy \u03c0k is suf\ufb01ciently close to an optimal policy \u03c0\u2217, then by Theorem 1,\none may conclude that the L(cid:101)\u0001\u2212induced set of policies contains an optimal policy. Although these\nalgorithms do not have optimality guarantees, empirically, they often return a near-optimal policy.\nAt each iteration, the policy optimization step in SPI and SVI requires solving |X (cid:48)| LP sub-problems,\nwhere each of them has |A| + 2 constraints and has a |A|\u2212dimensional decision-variable. Collec-\ntively, at each iteration its complexity is O(|X (cid:48)||A|2(|A| + 2)). While in the worst case SVI con-\nverges in K = O(T) steps [7] and SPI converges in K = O(|X (cid:48)||A|T log T) steps [38], in practice,\nK is much smaller than |X (cid:48)||A|. Therefore, even with the additional complexity of policy evaluation\nin SPI that is O(T|X (cid:48)|2), or the complexity of updating Q\u2212function in SVI that is O(|A|2|X (cid:48)|2), the\ncomplexity of these methods is O(K|X (cid:48)||A|3 + K|X (cid:48)|2|A|2), which in practice is much lower than\nthat of the dual LP method, whose complexity is O(|X (cid:48)|3|A|3) (see Appendix B for more details).\n\nAlgorithm 2 Safe Value Iteration (SVI)\n\nStep 0: Compute Q-function Qk+1(x, a) = c(x, a) +(cid:80)\nInput: Initial Q-function Q0; Initial Lyapunov function L\u00010 w.r.t. auxiliary cost function \u00010(x) = 0;\nfor k = 0, 1, 2, . . . do\nand policy \u03c0k(\u00b7|x) \u2208 arg min\u03c0\u2208FL\u0001k\nStep 1: With \u03c0B = \u03c0k, construct the Lyapunov function L\u0001k+1, where \u0001k+1 is a solution to (5);\nend for\nReturn Final policy \u03c0k\u2217\n\nx(cid:48) P (x(cid:48)|x, a) min\u03c0\u2208FL\u0001k\n\n(x) \u03c0(\u00b7|x)(cid:62)Qk(x,\u00b7)\n\n(x(cid:48)) \u03c0(\u00b7|x(cid:48))(cid:62)Qk(x(cid:48),\u00b7)\n\n(cid:110)\n\nQL(x,\u00b7) \u2264(cid:101)\u0001\n\n(cid:48)\n\n(cid:111)\n\n4.1 Lyapunov-based Safe RL Algorithms\nIn order to improve scalability of SVI and SPI, we develop two off-policy safe RL algorithms, namely\nsafe DQN and safe DPI, which replace the value and policy updates in safe DP with function ap-\nproximations. Their pseudo-codes can be found in Appendix D. Before going into their details, we\n\ufb01rst introduce the policy distillation method, which will be later used in the safe RL algorithms.\nPolicy Distillation: Consider the following LP problem for policy optimization in SVI and SPI:\n\n(cid:48)\n\nwhere QL(x, a) = d(x) +(cid:101)\u0001(cid:48)(x) +(cid:80)\n\n(\u00b7|x) \u2208 arg min\n\u03c0\u2208\u2206\n\n\u03c0(\u00b7|x)\n(cid:62)\n\n\u03c0\n\nQ(x,\u00b7) : (\u03c0(\u00b7|x) \u2212 \u03c0B(\u00b7|x))\nx(cid:48) P (x(cid:48)|x, a)L(cid:101)\u0001(cid:48)(x(cid:48)) is the state-action Lyapunov function.\n\n(6)\n\n(x)\n\n(cid:62)\n\n,\n\n1\nm\n\nm=1\n\n(cid:80)M\n\nWhen the state-space is large (or continuous), we shall use function approximation. Consider a\n(cid:80)T\u22121\nparameterized policy \u03c0\u03c6 with weights \u03c6. Utilizing the distillation concept [36], after comput-\ning the optimal action probabilities w.r.t. a batch of states, the policy \u03c0\u03c6 is updated by solving\nt=0 DJSD(\u03c0\u03c6(\u00b7|xt,m) (cid:107) \u03c0(cid:48)(\u00b7|xt,m)), where DJSD is the Jensen-Shannon\n\u03c6\u2217 \u2208 arg min\u03c6\ndivergence. Pseudo-code of distillation is given in Algorithm 3 in Appendix D.\nSafe Q\u2212learning (SDQN): Here we sample an off-policy mini-batch of state, action, cost, and\nnext-state from the replay buffer, and use it to update the value function estimates that minimize\nthe MSE losses of the Bellman residuals. We \ufb01rst construct the state-action Lyapunov function\n\nestimate (cid:98)QL(x, a; \u03b8D, \u03b8T ) = (cid:98)QD(x, a; \u03b8D) +(cid:101)\u0001(cid:48) \u00b7 (cid:98)QT (x, a; \u03b8T ) by learning the constraint value net-\nwork (cid:98)QD and stopping time value network (cid:98)QT . With a current baseline policy \u03c0k, one can use\nby(cid:101)\u0001(cid:48)(x) =(cid:101)\u0001(cid:48) = (d0 \u2212 \u03c0k(\u00b7|x0)(cid:62)(cid:98)QD(x0,\u00b7; \u03b8D))/\u03c0k(\u00b7|x0)(cid:62)(cid:98)QT (x0,\u00b7; \u03b8T ). Equipped with the Lya-\n\nfunction approximation to approximate the auxiliary constraint cost (which is the solution to (5))\n\npunov function, at each iteration, one can do a standard DQN update, except that the optimal action\nprobabilities are computed by solving (6). Details of SDQN is given in Algorithm 4 in Appendix D.\nSafe Policy Improvement (SDPI): Similar to SDQN, in this algorithm, we \ufb01rst sample an off-\npolicy mini-batch from the replay buffer and use it to update the value function estimates (w.r.t. ob-\njective, constraint, and stopping-time estimate) that minimize MSE losses. Different from SDQN,\nin SDPI the value estimation is done using policy evaluation, which means that the objective\nQ\u2212function is trained to minimize the Bellman residual w.r.t. actions generated by the current pol-\n\nicy \u03c0k, instead of the greedy actions. Using the same construction as in SDQN for auxiliary cost(cid:101)\u0001(cid:48)\nand state-action Lyapunov function (cid:98)QL, we then perform a policy improvement step by computing\n\n6\n\n\fFigure 1: Results of various planning algorithms on the grid-world environment with obstacles,\nwith x-axis showing the obstacle density. From the leftmost column, the \ufb01rst \ufb01gure illustrates the\n2D planning domain example (\u03c1 = 0.25). The second and the third \ufb01gures show the average return\nand the average cumulative constraint cost of the CMDP methods, respectively. The fourth \ufb01gure\ndisplays all the methods used in the experiment. The shaded regions indicate the 80% con\ufb01dence\nintervals. Clearly the safe DP algorithms compute policies that are safe and have good performance.\n\na set of greedy action probabilities from (6) and constructing an updated policy \u03c0k+1 using pol-\nicy distillation. Assuming both value and policy approximations have low error, SDPI resembles\nseveral interesting properties of SPI, such as maintaining safety during training and monotonically\nimproving the policy. To improve learning stability, instead of the full policy update, one can further\nconsider a partial update \u03c0k+1 = (1\u2212\u03b1)\u03c0k+\u03b1\u03c0(cid:48), where \u03b1 \u2208 (0, 1) is a mixing constant that controls\nsafety and exploration [2, 19]. Details of SDPI is summarized in Algorithm 5 in Appendix D.\n5 Experiments\nMotivated by the safety issues of RL in [22], we validate our safe RL algorithms using a stochastic\n2D grid-world motion planning problem. In this domain, an agent (e.g., a robotic vehicle) starts\nin a safe region and its objective is to travel to a given destination. At each time step, the agent\ncan move to any of its four neighboring states. Due to sensing and control noise, however, with\nprobability \u03b4 a move to a random neighboring state occurs. To account for fuel usage, the stage-wise\ncost of each move until reaching the destination is 1, while the reward achieved for reaching the\ndestination is 1000. Thus, we would like the agent to reach the destination in the shortest possible\nnumber of moves. In between the starting and destination points, there are number of obstacles\nthat the agent may pass through but should avoid for safety; each time the agent hits an obstacle it\nincurs a constraint cost of 1. Thus, in the CMDP setting, the agent\u2019s goal is to reach the destination\nin the shortest possible number of moves, while hitting the obstacles at most d0 times or less. For\ndemonstration purposes, we choose a 25 \u00d7 25 grid-world (see Figure 1) with a total of 625 states.\nWe also have a density ratio \u03c1 \u2208 (0, 1) that sets the obstacle-to-terrain ratio. When \u03c1 is close to 0,\nthe problem is obstacle-free, and if \u03c1 is close to 1, then the problem becomes more challenging. In\nthe normal problem setting, we choose a density \u03c1 = 0.3, an error probability \u03b4 = 0.05, a constraint\nthreshold d0 = 5, and a maximum horizon of 200 steps. The initial state is located in (24, 24) and\nthe goal is placed in (0, \u03b1), where \u03b1 \u2208 [0, 24] is a uniform random variable. To account for statistical\nsigni\ufb01cance, the results of each experiment are averaged over 20 trials.\nCMDP Planning: In this task, we have explicit knowledge of the reward function and transition\nprobability. The main goal is to compare our safe DP algorithms (SPI and SVI) with the following\ncommon CMDP baseline methods: (i) Step-wise Surrogate, (ii) Super-martingale Surrogate, (iii)\nLagrangian, and (iv) Dual LP. Since the methods in (i) and (ii) are surrogate algorithms, we will\nalso evaluate these methods with both value iteration and policy iteration. To illustrate the level of\nsub-optimality, we will also compare the returns and constraint costs of these methods with baselines\nthat are generated by maximizing return or minimizing constraint cost of two separate MDPs. The\nmain objective here is to illustrate that safe DP algorithms are less conservative than other surrogate\nmethods, are more numerically stable than the Lagrangian method, and are more computationally\nef\ufb01cient than the Dual LP method (see Appendix F), without using function approximations.\nFigure 1 presents the results on returns and cumulative constraint costs of the aforementioned CMDP\nmethods over a spectrum of \u03c1 values, ranging from 0 to 0.5. In each method, the initial policy is a\nconservative baseline policy \u03c0B that minimizes the constraint cost. The empirical results indicate\nthat although the polices generated by the four surrogate algorithms are feasible, they do not have\nsigni\ufb01cant policy improvement, i.e., return values are close to that of the initial baseline policy.\nOver all density settings, the SPI algorithm consistently computes a solution that is feasible and has\ngood performance. The policy returned by SVI is always feasible and has near-optimal performance\n\n7\n\n\fDiscrete obs, d0 = 5 Discrete obs, d0 = 1\n\nImage obs, d0 = 5\n\nImage obs, d0 = 1\n\ns\nd\nr\na\nw\ne\nR\n\ns\nt\nn\ni\na\nr\nt\ns\nn\no\nC\n\nFigure 2: Results of various RL algorithms on the grid-world environment with obstacles, with x-\naxis in thousands of episodes. We include runs using discrete observations (a one-hot encoding of\nthe agent\u2019s position) and image observations (showing the entire RGB 2D map of the world). We\ndiscover that the Lyapunov-based approaches can perform safe learning, despite the fact that the\nmodel of the environment is not known and that deep function approximation is necessary.\nwhen the obstacle density is low. However, due to numerical instability, its performance degrades as\n\u03c1 grows. Similarly, the Lagrangian methods return a near-optimal solution over most settings, but\ndue to numerical issues their solutions start to violate constraint as \u03c1 grows.\nSafe Reinforcement Learning: Here we present the results of RL algorithms on this safety task.\nWe evaluate their learning performance on two variants: one in which the observation is a one-hot\nencoding of the agent\u2019s location, and the other in which the observation is the 2D image represen-\ntation of the grid map. In each of these, we evaluate performance when d0 = 1 and d0 = 5. We\ncompare our proposed safe RL algorithms, SDPI and SDQN, with their unconstrained counterparts,\nDPI and DQN, as well as the Lagrangian approach to safe RL, in which the Lagrange multiplier is\noptimized via extensive grid search. Details of the experimental setup are given in Appendix F. To\nmake the tasks more challenging, we initialize the RL algorithms with a randomized baseline policy.\nFigure 2 shows the results of these methods across all task variants. We observe that SDPI and\nSDQN can adequately solve the tasks and compute good return performance (similar to that of\nDQN and DPI in some cases), while guaranteeing safety. Another interesting observation in the\nSDQN and SDPI algorithms is that, once the algorithm \ufb01nds a safe policy, then all updated policies\nremain safe throughout training. On the contrary, the Lagrangian approach often achieves worse\nrewards and is more apt to violate the constraints during training, 6, and the performance is very\nsensitive to the initial conditions. Furthermore, in some cases (in experiment with d0 = 5 and with\ndiscrete observations) the Lagrangian method cannot guarantee safety throughout training.\n6 Conclusion\nIn this paper, we formulated the problem of safe RL as a CMDP and proposed a novel Lyapunov\napproach to solve CMDPs. We also derived an effective LP-based method to generate Lyapunov\nfunctions, such that the corresponding algorithm guarantees feasibility and optimality under certain\nconditions. Leveraging these theoretical underpinnings, we showed how Lyapunov approaches can\nbe used to transform DP (and RL) algorithms into their safe counterparts that only require straight-\nforward modi\ufb01cations in the algorithm implementations. We empirically validated our theoretical\n\ufb01ndings in using the Lyapunov approach to guarantee safety and robust learning in RL. In general,\nour work represents a step forward in deploying RL to real-world problems in which guaranteeing\nsafety is of paramount importance. Future research will focus on two directions. On the algorithmic\nperspective, one major extension is to apply the Lyapunov approach to policy gradient algorithms\nand compare its performance with CPO in continuous action problems. On the practical aspect,\nfuture work includes evaluating the Lyapunov-based RL algorithms on several real-world testbeds.\n\n6In Appendix F, we also report the results from the Lagrangian method in which the Lagrange multiplier is\n\nlearned using gradient ascent method [10] and we observe similar (or even worse) behaviors.\n\n8\n\n051015200200400600800100005101520020040060080010000510152002004006008001000051015200200400600800100005101520051015202530051015200510152005101520051015200510152005101520SDPISDQNLagrangeDPILagrangeDQNDPIDQN\fReferences\n[1] N. Abe, P. Melville, C. Pendus, C. Reddy, D. Jensen, V. Thomas, J. Bennett, G. Anderson,\nB. Cooley, M. Kowalczyk, et al. Optimizing debt collections using constrained reinforcement\nlearning. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 75\u201384, 2010.\n\n[2] J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In International\n\nConference of Machine Learning, 2017.\n\n[3] E. Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.\n\n[4] Eitan Altman. Constrained Markov decision processes with total cost criteria: Lagrangian\napproach and dual linear program. Mathematical methods of operations research, 48(3):387\u2013\n417, 1998.\n\n[5] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man\u00b4e. Concrete prob-\n\nlems in AI safety. arXiv preprint arXiv:1606.06565, 2016.\n\n[6] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe model-based reinforcement\nIn Advances in Neural Information Processing Systems,\n\nlearning with stability guarantees.\npages 908\u2013919, 2017.\n\n[7] D. Bertsekas. Dynamic programming and optimal control. Athena scienti\ufb01c Belmont, MA,\n\n1995.\n\n[8] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.\n\n[9] M El Chamie, Y. Yu, and B. Ac\u00b8\u0131kmes\u00b8e. Convex synthesis of randomized policies for controlled\nMarkov chains with density safety upper bound constraints. In American Control Conference,\npages 6290\u20136295, 2016.\n\n[10] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learn-\n\ning with percentile risk criteria. arXiv preprint arXiv:1512.01629, 2015.\n\n[11] Y. Chow, M. Pavone, B. Sadler, and S. Carpin. Trading safety versus performance: Rapid de-\nployment of robotic swarms with robust performance constraints. Journal of Dynamic Systems,\nMeasurement, and Control, 137(3), 2015.\n\n[12] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa. Safe exploration in\n\ncontinuous action spaces. arXiv preprint arXiv:1801.08757, 2018.\n\n[13] R. Fruit and A. Lazaric. Exploration\u2013exploitation in mdps with options. In AISTATS, 2017.\n\n[14] Z. G\u00b4abor and Z. Kalm\u00b4ar. Multi-criteria reinforcement learning. In International Conference of\n\nMachine Learning, 1998.\n\n[15] P. Geibel and F. Wysotzki. Risk-sensitive reinforcement learning applied to control under\n\nconstraints. Journal of Arti\ufb01cial Intelligence Research, 24:81\u2013108, 2005.\n\n[16] P. Glynn, A. Zeevi, et al. Bounding stationary expectations of markov processes. In Markov\nprocesses and related topics: a Festschrift for Thomas G. Kurtz, pages 195\u2013214. Institute of\nMathematical Statistics, 2008.\n\n[17] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Q-learning with model-based\n\nacceleration. In International Conference on Machine Learning, pages 2829\u20132838, 2016.\n\n[18] S. Junges, N. Jansen, C. Dehnert, U. Topcu, and J. Katoen. Safety-constrained reinforcement\nlearning for MDPs. In International Conference on Tools and Algorithms for the Construction\nand Analysis of Systems, pages 130\u2013146, 2016.\n\n[19] S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In\nInternational Conference on International Conference on Machine Learning, pages 267\u2013274,\n2002.\n\n[20] Hassan K Khalil. Noninear systems. Prentice-Hall, New Jersey, 2(5):5\u20131, 1996.\n\n9\n\n\f[21] J. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M. Jordan, and B. Recht. First-order methods\n\nalmost always avoid saddle points. arXiv preprint arXiv:1710.07406, 2017.\n\n[22] J. Leike, M. Martic, V. Krakovna, P. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg.\n\nAi safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.\n\n[23] D. Luenberger, Y. Ye, et al. Linear and nonlinear programming, volume 2. Springer, 1984.\n\n[24] N. Mastronarde and M. van der Schaar. Fast reinforcement learning for energy-ef\ufb01cient wire-\n\nless communication. IEEE Transactions on Signal Processing, 59(12):6262\u20136266, 2011.\n\n[25] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.\n\nPlaying Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.\n\n[26] T. Moldovan and P. Abbeel. Safe exploration in Markov decision processes. arXiv preprint\n\narXiv:1205.4810, 2012.\n\n[27] H. Mossalam, Y. Assael, D. Roijers, and S. Whiteson. Multi-objective deep reinforcement\n\nlearning. arXiv preprint arXiv:1610.02707, 2016.\n\n[28] M. Neely. Stochastic network optimization with application to communication and queueing\n\nsystems. Synthesis Lectures on Communication Networks, 3(1):1\u2013211, 2010.\n\n[29] G. Neu, A. Jonsson, and V. G\u00b4omez. A uni\ufb01ed view of entropy-regularized markov decision\n\nprocesses. arXiv preprint arXiv:1705.07798, 2017.\n\n[30] M. Ono, M. Pavone, Y. Kuwata, and J. Balaram. Chance-constrained dynamic programming\nwith application to risk-aware robotic space exploration. Autonomous Robots, 39(4):555\u2013571,\n2015.\n\n[31] T. Perkins and A. Barto. Lyapunov design for safe reinforcement learning. Journal of Machine\n\nLearning Research, 3:803\u2013832, 2002.\n\n[32] M. Pirotta, M. Restelli, and L. Bascetta. Adaptive step-size for policy gradient methods. In\n\nAdvances in Neural Information Processing Systems, pages 1394\u20131402, 2013.\n\n[33] M. Pirotta, M. Restelli, A. Pecorino, and D. Calandriello. Safe policy iteration. In International\n\nConference on Machine Learning, pages 307\u2013315, 2013.\n\n[34] K. Regan and C. Boutilier. Regret-based reward elicitation for Markov decision processes.\nIn Proceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n444\u2013451, 2009.\n\n[35] D. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley. A survey of multi-objective sequential\n\ndecision-making. Journal of Arti\ufb01cial Intelligence Research, 48:67\u2013113, 2013.\n\n[36] A. Rusu, S. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih,\nK. Kavukcuoglu, and R. Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.\n\n[37] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint\n\narXiv:1511.05952, 2015.\n\n[38] Bruno Scherrer. Performance bounds for \u03bb policy iteration and application to the game of\n\nTetris. Journal of Machine Learning Research, 14(Apr):1181\u20131227, 2013.\n\n[39] M. Schmitt and L. Martignon. On the complexity of learning lexicographic strategies. Journal\n\nof Machine Learning Research, 7:55\u201383, 2006.\n\n[40] G. Shani, D. Heckerman, and R. Brafman. An MDP-based recommender system. Journal of\n\nMachine Learning Research, 6:1265\u20131295, 2005.\n\n[41] A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance related risk criteria. In\n\nInternational Conference of Machine Learning, 2012.\n\n[42] H. van Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems,\n\npages 2613\u20132621, 2010.\n\n10\n\n\f", "award": [], "sourceid": 4976, "authors": [{"given_name": "Yinlam", "family_name": "Chow", "institution": "DeepMind"}, {"given_name": "Ofir", "family_name": "Nachum", "institution": "Google Brain"}, {"given_name": "Edgar", "family_name": "Duenez-Guzman", "institution": "DeepMind"}, {"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": "FaceBook FAIR"}]}