{"title": "Reinforcement Learning with Convex Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 14093, "page_last": 14102, "abstract": "In standard reinforcement learning (RL), a learning agent seeks to optimize the overall reward. However, many key aspects of a desired behavior are more naturally expressed as constraints. For instance, the designer may want to limit the use of unsafe actions, increase the diversity of trajectories to enable exploration, or approximate expert trajectories when rewards are sparse. In this paper, we propose an algorithmic scheme that can handle a wide class of constraints in RL tasks: specifically, any constraints that require expected values of some vector measurements (such as the use of an action) to lie in a convex set. This captures previously studied constraints (such as safety and proximity to an expert), but also enables new classes of constraints (such as diversity). Our approach comes with rigorous theoretical guarantees and only relies on the ability to approximately solve standard RL tasks. As a result, it can be easily adapted to work with any model-free or model-based RL. In our experiments, we show that it matches previous algorithms that enforce safety via constraints, but can also enforce new properties that these algorithms do not incorporate, such as diversity.", "full_text": "Reinforcement Learning with Convex Constraints\n\nSobhan Miryoose\ufb01\nPrinceton University\n\nmiryoosefi@cs.princeton.edu\n\nKiant\u00e9 Brantley\n\nUniversity of Maryland\nkdbrant@cs.umd.edu\n\nHal Daum\u00e9 III\n\nMicrosoft Research\n\nUniversity of Maryland\n\nme@hal3.name\n\nMiroslav Dud\u00edk\nMicrosoft Research\n\nmdudik@microsoft.com\n\nRobert E. Schapire\nMicrosoft Research\n\nschapire@microsoft.com\n\nAbstract\n\nIn standard reinforcement learning (RL), a learning agent seeks to optimize the\noverall reward. However, many key aspects of a desired behavior are more natu-\nrally expressed as constraints. For instance, the designer may want to limit the use\nof unsafe actions, increase the diversity of trajectories to enable exploration, or\napproximate expert trajectories when rewards are sparse. In this paper, we propose\nan algorithmic scheme that can handle a wide class of constraints in RL tasks,\nspeci\ufb01cally, any constraints that require expected values of some vector measure-\nments (such as the use of an action) to lie in a convex set. This captures previously\nstudied constraints (such as safety and proximity to an expert), but also enables\nnew classes of constraints (such as diversity). Our approach comes with rigorous\ntheoretical guarantees and only relies on the ability to approximately solve standard\nRL tasks. As a result, it can be easily adapted to work with any model-free or\nmodel-based RL algorithm. In our experiments, we show that it matches previous\nalgorithms that enforce safety via constraints, but can also enforce new properties\nthat these algorithms cannot incorporate, such as diversity.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) typically considers the problem of learning to optimize the behavior of\nan agent in an unknown environment against a single scalar reward function. For simple tasks, this\ncan be suf\ufb01cient, but for complex tasks, boiling down the learning goal into a single scalar reward\ncan be challenging. Moreover, a scalar reward might not be a natural formalism for stating certain\nlearning objectives, such as safety desires (\u201cavoid dangerous situations\u201d) or exploration suggestions\n(\u201cmaintain a distribution over visited states that is as close to uniform as possible\u201d). In these settings,\nit is much more natural to de\ufb01ne the learning goal in terms of a vector of measurements over the\nbehavior of the agent, and to learn a policy whose measurement vector is inside a target set (\u00a72).\nWe derive an algorithm, approachability-based policy optimization (APPROPO, pronounced like\n\u201capropos\u201d), for solving such problems (\u00a73). Given a Markov decision process with vector-valued\nmeasurements (\u00a72), and a target constraint set, APPROPO learns a stochastic policy whose expected\nmeasurements fall in that target set (akin to Blackwell approachability in single-turn games, Blackwell,\n1956). We derive our algorithm from a game-theoretic perspective, leveraging recent results in online\nconvex optimization. APPROPO is implemented as a reduction to any off-the-shelf reinforcement\nlearning algorithm that can return an approximately optimal policy, and so can be used in conjunction\nwith the algorithms that are the most appropriate for any given domain.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOur approach builds on prior work for reinforcement learning under constraints, such as the formalism\nof constrained Markov decision processes (CMDPs) introduced by Altman (1999). In CMDPs, the\nagent\u2019s goal is to maximize reward while satisfying some linear constraints over auxiliary costs (akin\nto our measurements). Altman (1999) gave an LP-based approach when the CMDP is fully known, and\nmore recently, model-free approaches have been developed for CMDPs in high-dimensional settings.\nFor instance, Achiam et al.\u2019s (2017) constrained policy optimization (CPO) focuses on safe exploration\nand seeks to ensure approximate constraint satisfaction during the learning process. Tessler et al.\u2019s\n(2019) reward constrained policy optimization (RCPO) follows a two-timescale primal-dual approach,\ngiving guarantees for the convergence to a \ufb01xed point. Le et al. (2019) describe a batch off-policy\nalgorithm with PAC-style guarantees for CMDPs using a similar game-theoretic formulation to ours.\nWhile all of these works are only applicable to orthant constraints, our algorithm can work with\narbitrary convex constraints. This enables APPROPO to incorporate previously studied constraint\ntypes, such as inequality constraints that represent safety or that keep the policy\u2019s behavior close to\nthat of an expert (Syed and Schapire, 2008), as well as constraints like the aforementioned exploration\nsuggestion, implemented as an entropy constraint on the policy\u2019s state visitation vector. The entropy\nof the visitation vector was recently studied as the objective by Hazan et al. (2018), who gave an\nalgorithm capable of maximizing a concave function (e.g., entropy) over such vectors. However,\nit is not clear whether their approach can be adapted to the convex constraints setting studied here.\nOur main contributions are: (1) a new algorithm, APPROPO, for solving reinforcement learning\nproblems with arbitrary convex constraints; (2) a rigorous theoretical analysis that demonstrates that\nit can achieve sublinear regret under mild assumptions (\u00a73); and (3) a preliminary experimental\ncomparison with RCPO (Tessler et al., 2019), showing that our algorithm is competitive with RCPO\non orthant constraints, while also handling a diversity constraint (\u00a74).\n\n2 Setup and preliminaries: De\ufb01ning the feasibility problem\n\nWe begin with a description of our learning setting. A vector-valued Markov decision process is\na tuple M = (S,A,, P s, Pz), where S is the set of states, A is the set of actions and is the\ninitial-state distribution. Each episode starts by drawing an initial state s0 from the distribution .\nThen in each step i = 1, 2, . . . , the agent observes its current state si and takes action ai 2A causing\nthe environment to move to the next state si+1 \u21e0 Ps(\u00b7|si, ai). The episode ends after a certain\nnumber of steps (called the horizon) or when a terminal state is reached. However, in our setting,\ninstead of receiving a scalar reward, the agent observes a d-dimensional measurement vector zi 2 Rd,\nwhich, like si+1, is dependent on both the current state si and the action ai, that is, zi \u21e0 Pz(\u00b7|si, ai).\n(Although not explicit in our setting, reward could be incorporated in the measurement vector.)\nTypically, actions are selected according to a (stationary) policy \u21e1 so that ai \u21e0 \u21e1(si), where \u21e1 maps\nstates to distributions over actions. We assume we are working with policies from some candidate\nspace \u21e7. For simplicity of presentation, we assume this space is \ufb01nite, though possibly extremely\nlarge. For instance, if S and A are \ufb01nite, then \u21e7 might consist of all deterministic policies. (Our\nresults hold also when \u21e7 is in\ufb01nite with minor technical adjustments.)\nOur aim is to control the MDP so that measurements satisfy some constraints. For any policy \u21e1, we\nde\ufb01ne the long-term measurement z(\u21e1) as the expected sum of discounted measurements:\n\nz(\u21e1) , E\" 1Xi=0\n\nizi \u21e1#\n\n(1)\n\nfor some discount factor 2 [0, 1), and where expectation is over the random process described\nabove (including randomness inherent in \u21e1).\nLater, we will also \ufb01nd it useful to consider mixed policies \u00b5, which are distributions over \ufb01nitely\nmany stationary policies. The space of all such mixed policies over \u21e7 is denoted (\u21e7). To execute a\nmixed policy \u00b5, before taking any actions, a single policy \u21e1 is randomly selected according to \u00b5; then\nall actions henceforth are chosen from \u21e1, for the entire episode. The long-term measurement of a\nmixed policy z(\u00b5) is de\ufb01ned accordingly:\n\nz(\u00b5) , E\u21e1\u21e0\u00b5 [z(\u21e1)] =X\u21e1\n\n\u00b5(\u21e1)z(\u21e1).\n\n2\n\n(2)\n\n\fFeasibility Problem: Find \u00b5 2 (\u21e7) such that z(\u00b5) 2C .\n\nOur learning problem, called the feasibility problem, is speci\ufb01ed by a convex target set C. The goal is\nto \ufb01nd a mixed policy \u00b5 whose long-term measurements lie in the set C:\n(3)\nFor instance, in our experiments (\u00a74) we consider a grid-world environment where the measurements\ninclude the distance traveled, an indicator of hitting a rock, and indicators of visiting various locations\non the grid. The feasibility goal is to achieve at most a certain trajectory length while keeping the\nprobability of hitting the rock below a threshold for safety reasons, and maintaining a distribution\nover visited states close to the uniform distribution to enable exploration. We can potentially also\nhandle settings where the goal is to maximize one measurement (e.g., \u201creward\u201d) subject to others by\nperforming a binary search over the maximum attainable value of the reward (see \u00a73.4).\n\n3 Approach, algorithm, and analysis\n\nBefore giving details of our approach, we overview the main ideas, which, to a large degree, follow\nthe work of Abernethy et al. (2011), who considered the problem of solving two-player games; we\nextend these results to solve our feasibility problem (3).\nAlthough feasibility is our main focus, we actually solve the stronger problem of \ufb01nding a mixed\npolicy \u00b5 that minimizes the Euclidean distance between z(\u00b5) and C, meaning the Euclidean distance\nbetween z(\u00b5) and its closest point in C. That is, we want to solve\n(4)\n\nmin\n\u00b52(\u21e7)\n\ndist(z(\u00b5),C)\n\nwhere dist denotes the Euclidean distance between a point and a set.\nOur main idea is to take a game-theoretic approach, formulating this problem as a game and solving\nit. Speci\ufb01cally, suppose we can express the distance function in Eq. (4) as a maximization of the form\n\ndist(z(\u00b5),C) = max\n2\u21e4\nfor some convex, compact set \u21e4.1 Then Eq. (4) becomes\n\n \u00b7 z(\u00b5)\n\nmin\n\u00b52(\u21e7)\n\nmax\n2\u21e4\n\n \u00b7 z(\u00b5).\n\n(5)\n\n(6)\n\nThis min-max form immediately evokes interpretation as a two-person zero-sum game: the \ufb01rst\nplayer chooses a mixed policy \u00b5, the second player responds with a vector , and \u00b7 z(\u00b5) is the\namount that the \ufb01rst player is then required to pay to the second player. Assuming this game satis\ufb01es\ncertain conditions, the \ufb01nal payout under the optimal play, called the value of the game, is the same\neven when the order of the players is reversed:\n\nmax\n2\u21e4\n\nmin\n\u00b52(\u21e7)\n\n \u00b7 z(\u00b5).\n\n(7)\n\nNote that the policy \u00b5 we are seeking is the solution of this game, that is, the policy realizing the\nminimum in Eq. (6). Therefore, to \ufb01nd that policy, we can apply general techniques for solving a\ngame, namely, to let a no-regret learning algorithm play the game repeatedly against a best-response\nplayer. When played in this way, it can be shown that the averages of their plays converge to the\nsolution of the game (details in \u00a73.1).\nIn our case, we can use a no-regret algorithm for the -player, and best response for the \u00b5-player.\nImportantly, in our context, computing best response turns out to be an especially convenient task.\nGiven , best response means \ufb01nding the mixed policy \u00b5 minimizing \u00b7 z(\u00b5). As we show below,\nthis can be solved by treating the problem as a standard reinforcement learning task where in each\nstep i, the agent accrues a scalar reward ri = \u00b7 zi. We refer to any algorithm for solving the\nproblem of scalar reward maximization as the best-response oracle. During the run of our algorithm,\nwe invoke this oracle for different vectors corresponding to different de\ufb01nitions of a scalar reward.\n\n1Note that the distance between a point and a set is de\ufb01ned as a minimization of the distance function over\nall points in the set C, but here we require that it be rewritten as a maximization of a linear function over some\nother set \u21e4. We will show how to achieve this in \u00a73.2.\n\n3\n\n\fAlgorithm 1 Solving a game with repeated play\n1: input concave-convex function g :\u21e4 \u21e5U! R, online learning algorithm LEARNER\n2: for t = 1 to T do\n3:\n4:\n5:\n6: end for\n7: return = 1\n\nLEARNER makes a decision t 2 \u21e4\nut argminu2U g(t, u)\nLEARNER observes loss function `t() = g(, ut)\nT PT\n\nT PT\n\nt=1 t and u = 1\n\nt=1 ut\n\nAlthough the oracle is only capable of solving RL tasks with a scalar reward, our algorithm can\nleverage this capability to solve the multi-dimensional feasibility (or distance minimization) problem.\nIn the remainder of this section, we provide the details of our approach, leading to our main algorithm\nand its analysis, and conclude with a discussion of steps for making a practical implementation. We\nbegin by discussing game-playing techniques in general, which we then apply to our setting.\n\n3.1 Solving zero-sum games using online learning\n\nAt the core of our approach, we use the general technique of Freund and Schapire (1999) for solving\na game by repeatedly playing a no-regret online learning algorithm against best response.\nFor this purpose, we \ufb01rst brie\ufb02y review the framework of online convex optimization, which we\nwill soon use for one of the players: At time t = 1, . . . , T , the learner makes a decision t 2 \u21e4,\nthe environment reveals a convex loss function `t :\u21e4 ! R, and the learner incurs loss `t(t). The\nlearner seeks to achieve small regret, the gap between its loss and the best in hindsight:\n\nRegretT ,\" TXt=1\n\n`t(t)# min\n2\u21e4\" TXt=1\n\n`t()# .\n\n(8)\n\nAn online learning algorithm is no-regret if RegretT = o(T ), meaning its average loss approaches\nthe best in hindsight. An example of such an algorithm is online gradient descent (OGD) of Zinkevich\n(2003) (see Appendix A). If the Euclidean diameter of \u21e4 is at most D, and kr`t()k \uf8ff G for any t\nand 2 \u21e4, then the regret of OGD is at most DGpT .\nNow consider a two-player zero-sum game in which two players select, respectively, 2 \u21e4 and\nu 2U , resulting in a payout of g(, u) from the u-player to the -player. The -player wants to\nmaximize this quantity and the u-player wants to minimize it. Assuming g is concave in and convex\nin u, if both spaces \u21e4 and U are convex and compact, then the minimax theorem (von Neumann,\n1928; Sion, 1958) implies that\n\nmax\n2\u21e4\n\nmin\nu2U\n\ng(, u) = min\nu2U\n\nmax\n2\u21e4\n\ng(, u).\n\n(9)\n\nThis means that the -player has an \u201coptimal\u201d strategy which realizes the maximum on the left and\nguarantees payoff of at least the value of the game, i.e., the value given by this expression; a similar\nstatement holds for the u-player.\nWe can solve this game (\ufb01nd these optimal strategies) by playing it repeatedly. We use a no-regret\nonline learner as the -player. At each time t = 1, . . . , T , the learner chooses t 2 \u21e4. In response,\nthe u-player, who in this setting is permitted knowledge of t, selects ut to minimize the payout,\nthat is, ut = argminu2U g(t, u). This is called best response. The online learning algorithm is\nthen updated by setting its loss function to be `t() = g(, ut). (See Algorithm 1.) As stated\nin Theorem 3.1, and u, the averages of the players\u2019 decisions, converge to the solution of the game\n(see Appendix B for the proof).\nTheorem 3.1. Let v be the value of the game in Eq. (9) and let RegretT be the regret of the -player.\nThen for and u we have\n\nmin\nu2U\n\ng(, u) v \n\nand max\n2\u21e4\n\ng(, u) \uf8ff v + , where = 1\n\nT RegretT .\n\n(10)\n\n4\n\n\f3.2 Algorithm and main result\nWe can now apply this game-playing framework to the approach outlined at the beginning of this\nsection. First, we show how to write distance as a maximization, as in Eq. (5). For now, we assume\nthat our target set C is a convex cone, that is, closed under summation and also multiplication by\nnon-negative scalars (we will remove this assumption in \u00a73.3). With this assumption, we can apply\nthe following lemma (Lemma 13 of Abernethy et al., 2011), in which distance to a convex cone\nC\u2713 Rd is written as a maximization over a dual convex cone C called the polar cone:\n\nC , { : \u00b7 x \uf8ff 0 for all x 2C} .\n\nLemma 3.2. For a convex cone C\u2713 Rd and any point x 2 Rd\n \u00b7 x,\n\ndist(x,C) = max\n2C\\B\n\n(11)\n\n(12)\n\nwhere B , {x : kxk \uf8ff 1} is the Euclidean ball of radius 1 at the origin.\nThus, Eq. (5) is immediately achieved by setting \u21e4= C\\B, so the distance minimization problem (4)\ncan be cast as the min-max problem (6). This is a special case of the zero-sum game (9), with\nU = {z(\u00b5) : \u00b5 2 (\u21e7)} and g(, u) = \u00b7 u, which can be solved with Algorithm 1. Note that the\nset U is convex and compact, because it is a linear transformation of a convex and compact set (\u21e7).\nWe will see below that the best responses ut in Algorithm 1 can be expressed as z(\u21e1t) for some\n\u21e1t 2 \u21e7, and so Algorithm 1 returns\n\nu =\n\n1\nT\n\nTXt=1\n\nz(\u21e1t) = z 1\n\nT\n\n\u21e1t!,\n\nTXt=1\n\nwhich is exactly the long-term measurement vector of the mixed policy \u00af\u00b5 = 1\nmixed policy, Theorem 3.1 immediately implies\ndist(z(\u00af\u00b5),C) \uf8ff min\n\u00b52(\u21e7)\n\ndist(z(\u00b5),C) + 1\n\nT RegretT .\n\nt=1 \u21e1t. For this\n\nT PT\n\n(13)\n\nIf the problem is feasible, then min\u00b52(\u21e7) dist(z(\u00b5),C) = 0, and since RegretT = o(T ), our\nlong-term measurement z(\u00af\u00b5) converges to the target set and solves the feasibility problem (3). It\nremains to specify how to implement the no-regret learner for the -player and best response for the\nu-player. We discuss these next, beginning with the latter.\nThe best-response player, for a given , aims to minimize \u00b7 z(\u00b5) over mixed policies \u00b5, but since\nthis objective is linear in the mixture weights \u00b5(\u21e1) (see Eq. 2), it suf\ufb01ces to minimize \u00b7 z(\u21e1) over\nstationary policies \u21e1 2 \u21e7. The key point, as already mentioned, is that this is the same as \ufb01nding a\npolicy that maximizes long-term reward in a standard reinforcement learning task if we de\ufb01ne the\nscalar reward to be ri = \u00b7 zi. This is because the reward of a policy \u21e1 is given by\n\niri \u21e1# = E\" 1Xi=0\n\ni( \u00b7 zi) \u21e1# = \u00b7E\" 1Xi=0\n\nizi \u21e1# = \u00b7 z(\u21e1). (14)\n\nTherefore, maximizing R(\u21e1), as in standard RL, is equivalent to minimizing \u00b7 z(\u21e1).\nThus, best response can be implemented using any one of the many well-studied RL algorithms\nthat maximize a scalar reward. We refer to such an RL algorithm as the best-response oracle. For\nrobustness, we allow this oracle to return an approximately optimal policy.\n\nR(\u21e1) , E\" 1Xi=0\n\nBest-response oracle: BESTRESPONSE().\n\nGiven 2 Rd, return a policy \u21e1 2 \u21e7 that satis\ufb01es R(\u21e1) max\u21e102\u21e7 R(\u21e10) \u270f0, where\nR(\u21e1) is the long-term reward of policy \u21e1 with scalar reward de\ufb01ned as r = \u00b7 z.\n\nFor the -player, we do our analysis using online gradient descent (Zinkevich, 2003), an effective\nno-regret learner. For its update, OGD needs the gradient of the loss functions `t() = \u00b7 z(\u21e1t),\nwhich is just z(\u21e1t). With access to the MDP, z(\u21e1) can be estimated simply by generating multiple\ntrajectories using \u21e1 and averaging the observed measurements. We formalize this by assuming access\nto an estimation oracle for estimating z(\u21e1).\n\n5\n\n\fAlgorithm 2 APPROPO\n1: input projection oracle C(\u00b7) for target set C which is a convex cone,\n\nbest-response oracle BESTRESPONSE(\u00b7), estimation oracle EST(\u00b7),\nstep size \u2318, number of iterations T\n\n2: de\ufb01ne \u21e4 , C \\B , and its projection operator \u21e4(x) , (x C(x))/max{1,kx C(x)k}\n3: initialize 1 arbitrarily in \u21e4\n4: for t = 1 to T do\n5:\n\nCompute an approximately optimal policy for standard RL with scalar reward r = t \u00b7 z:\n\u21e1t BESTRESPONSE(t)\nCall the estimation oracle to approximate long-term measurement for \u21e1t:\n\u02c6zt EST(\u21e1t)\nUpdate t using online gradient descent with the loss function `t() = \u00b7 \u02c6zt:\nt+1 \u21e4t + \u2318\u02c6zt\n\n8: end for\n9: return \u00af\u00b5, a uniform mixture over \u21e11, . . . ,\u21e1 T\n\n6:\n\n7:\n\nEstimation oracle: EST(\u21e1).\n\nGiven policy \u21e1, return \u02c6z satisfying k\u02c6z z(\u21e1)k \uf8ff \u270f1.\n\nOGD also requires projection to the set \u21e4= C \\B . In fact, if we can simply project onto the\ntarget set C, which is more natural, then it is possible to also project onto \u21e4. Consider an arbitrary x\nand denote its projection onto C as C(x). Then the projection of x onto the polar cone C is\nC(x) = x C(x) (Ingram and Marsh, 1991). Given the projection C(x) and further projecting\nonto B, we obtain \u21e4(x) = (x C(x))/max{1,kx C(x)k} (because Dykstra\u2019s projection\nalgorithm converges to this point after two steps, Boyle and Dykstra, 1986). Therefore, it suf\ufb01ces to\nrequire access to a projection oracle for C:\nProjection oracle: C(x) = argminx02C kx x0k.\nPulling these ideas together and plugging into Algorithm 1, we obtain our main algorithm, called\nAPPROPO (Algorithm 2), for approachability-based policy optimization. The algorithm provably\nyields a mixed policy that approximately minimizes distance to the set C, as shown in Theorem 3.3\n(proved in Appendix C).\nTheorem 3.3. Assume that C is a convex cone and for all measurements we have kzk \uf8ff B. Suppose\nwe run Algorithm 2 for T rounds with \u2318 = B\n\n1 + \u270f11T 1/2. Then\n\n(15)\n\ndist(z(\u00b5),C) + B\n\n1 + \u270f1T 1/2 + \u270f0 + 2\u270f1,\n\ndist(z(\u00af\u00b5),C) \uf8ff min\n\u00b52(\u21e7)\n\nwhere \u00af\u00b5 is the mixed policy returned by the algorithm.\n\nWhen the goal is to solve the feasibility problem (3) rather than the stronger distance minimization (4),\nwe can make use of a weaker reinforcement learning oracle, which only needs to \ufb01nd a policy that is\n\u201cgood enough\u201d in the sense of providing long-term reward above some threshold:\n\nPositive-response oracle: POSRESPONSE().\n\nGiven 2 Rd, return \u21e1 2 \u21e7 that satis\ufb01es R(\u21e1) \u270f0 if max\u21e102\u21e7 R(\u21e10) 0 (and arbitrary\n\u21e1 otherwise), where R(\u21e1) is the long-term reward of \u21e1 with scalar reward r = \u00b7 z.\n\nWhen the problem is feasible, it can be shown that there must exist \u21e1 2 \u21e7 with R(\u21e1) 0, and\nfurthermore, that `t(t) (\u270f0 + \u270f1) (from Lemma C.1 in Appendix C). This means, if the goal is\nfeasibility, we can modify Algorithm 2, replacing BESTRESPONSE with POSRESPONSE, and adding\na test at the end of each iteration to report infeasibility if `t(t) < (\u270f0 + \u270f1). The pseudocode is\nprovided in Algorithm 4 in Appendix D along with the proof of the following convergence bound:\nTheorem 3.4. Assume that C is a convex cone and for all measurements we have kzk \uf8ff B. Suppose\n1 + \u270f11T 1/2. Then either the algorithm reports\nwe run Algorithm 4 for T rounds with \u2318 = B\ninfeasibility or returns \u00af\u00b5 such that\n1 + \u270f1T 1/2 + \u270f0 + 2\u270f1.\ndist(z(\u00af\u00b5),C) \uf8ff B\n\n(16)\n\n6\n\n\f3.3 Removing the cone assumption\nOur results so far have assumed the target set C is a convex cone. If instead C is an arbitrary convex,\ncompact set, we can use the technique of Abernethy et al. (2011) and apply our algorithm to a speci\ufb01c\nconvex cone \u02dcC constructed from C to obtain a solution with provable guarantees.\nIn more detail, given a compact, convex target set C\u2713 Rd, we augment every vector in C with a new\ncoordinate held \ufb01xed to some value \uf8ff> 0, and then let \u02dcC be its conic hull. Thus,\n(17)\n\nwhere cone(X ) = {\u21b5x | x 2X ,\u21b5 0}.\n\n\u02dcC = cone(C\u21e5{ \uf8ff}),\n\nGiven our original vector-valued MDP M = (S,A,, P s, Pz), we de\ufb01ne a new MDP M0 =\n(S,A,, P s, P 0z0) with (d + 1)-dimensional measurement z0 2 Rd+1, de\ufb01ned (and generated) by\n\nzi \u21e0 Pz(\u00b7 | si, ai)\n\nz0i = zi h(1 )\uf8ffi,\n\n(18)\nwhere denotes vector concatenation. Writing long-term measurement for M and M0 as z and z0\nrespectively, z0(\u21e1) = z(\u21e1) h\uf8ffi, for any policy \u21e1 2 \u21e7, and similarly for any mixed policy \u00b5.\nThe main idea is to apply the algorithms described above to the modi\ufb01ed MDP M0 using the cone\n\u02dcC as target set. For an appropriate choice of \uf8ff> 0, we show that the resulting mixed policy will\napproximately minimize distance to C for the original MDP M. This is a consequence of the following\nlemma, an extension of Lemma 14 of Abernethy et al. (2011), which shows that distances are largely\npreserved in a controllable way under this construction. The proof is in Appendix E.\nLemma 3.5. Consider a compact, convex set C in Rd and x 2 Rd. For any > 0, let \u02dcC =\ncone(C\u21e5{ \uf8ff}), where \uf8ff = maxx2C kxk\nCorollary 3.6. Assume that C is a convex, compact set and for all measurements we have kzk \uf8ff B.\n1 + \u270f11T 1/2 and running Algorithm 2 for T rounds with M0 as the\nThen by putting \u2318 = B+\uf8ff\nMDP and \u02dcC as the target set, the mixed policy \u00af\u00b5 returned by the algorithm satis\ufb01es\n\n. Then dist(x,C) \uf8ff (1 + )dist(x h\uf8ffi, \u02dcC).\n\np2\n\ndist(z(\u00af\u00b5),C) \uf8ff (1 + )\u2713 min\n\n\u00b52(\u21e7)\n\ndist(z(\u00b5),C) + B+\uf8ff\n\n1 + \u270f1T 1/2 + \u270f0 + 2\u270f1\u25c6 ,\n\nwhere \uf8ff = maxx2C kxk\n\np2\n\nfor an arbitrary > 0. Similarly for Algorithm 4, we either have\n\n(19)\n\n(20)\n\ndist(z(\u00af\u00b5),C) \uf8ff (1 + )\u21e3 B+\uf8ff\n\n1 + \u270f1T 1/2 + \u270f0 + 2\u270f1\u2318\n\nor the algorithm reports infeasibility.\n\n3.4 Practical implementation of the positive response and estimation oracles\nWe next brie\ufb02y describe a few techniques for the practical implementation of our algorithm.\nAs discussed in \u00a73.2, when our aim is to solve a feasibility problem, we only need access to a positive\nresponse oracle. In episodic environments, it is straightforward to use any standard iterative RL\napproach as a positive response oracle: As the RL algorithm runs, we track its accrued rewards, and\nwhen the trailing average of the last n trajectory-level rewards goes above some level \u270f, we return the\ncurrent policy (possibly speci\ufb01ed implicitly as a Q-function).2 Furthermore, the average of the mea-\nsurement vectors z collected over the last n trajectories can serve as the estimate \u02c6zt of the long-term\nmeasurement required by the algorithm, side-stepping the need for an additional estimation oracle.\nThe hyperparameters \u270f and n in\ufb02uence the oracle quality; speci\ufb01cally, assuming that the rewards are\nbounded and the overall number of trajectories until the oracle terminates is at most polynomial in n,\n\nwe have \u270f0 = \u270fO(p(log n)/n) and \u270f1 = O(p(log n)/n). In principle, we could use Theorem 3.4\nto select a value T at which to stop; in practice, we run until the running average of the measurements\n\u02c6zt gets within a small distance of the target set C. If the RL algorithm runs for too long without achiev-\ning non-negative rewards, we stop and declare that the underlying problem is \u201cempirically infeasible.\u201d\n(Actual infeasibility would hold if it is truly not possible to reach non-negative expected reward.)\n\n2This assumes that the last n trajectories accurately estimate the performance of the \ufb01nal iterate. If that is not\n\nthe case, the oracle can instead return the mixture of the policies corresponding to the last n iterates.\n\n7\n\n\fFigure 1: Left: The Mars rover environment. The agent starts in top-left and needs to reach the goal\nin bottom-right while avoiding rocks. Middle, Right: Visitation probabilities of APPROPO (middle)\nand APPROPO with a diversity constraints (right) at 12k samples. Both plots based on a single run.\n\nAn important mechanism to further speed up our algorithm is to maintain a \u201ccache\u201d of all the policies\nreturned by the positive response oracle so far. Each of the cached policies \u21e1 is stored with the\nestimate of its expected measurement vector \u02c6z(\u21e1) \u21e1 \u00afz(\u21e1), based on its last n iterations (as above).\nIn each outer-loop iteration of our algorithm, we \ufb01rst check if the cache contains a policy that already\nachieves a reward at least \u270f under the new ; this can be determined from the cached \u02c6z(\u21e1) since\nthe reward is just a linear function of the measurement vector. If such a policy is found, we return it,\nalongside \u02c6z(\u21e1), instead of calling the oracle. Otherwise, we pick the policy from the cache with the\nlargest reward (below \u270f by assumption) and use it to warm-start the RL algorithm implementing\nthe oracle. The cache can be initialized with a few random policies (as we do in our experiments),\neffectively implementing randomized weight initialization.\nThe cache interacts well with a straightforward binary-search scheme that can be used when the goal\nis to maximize some reward (possibly subject to additional constraints), rather than only satisfy a set\nof constraints. The feasibility problems corresponding to iterates of binary search only differ in the\nconstraint values, but use the same measurements, so the same cache can be reused across all iterations.\nRunning time. Note that APPROPO spends the bulk of its running time executing the best-response\noracle. It additionally performs updates of , but these tend to be orders of magnitude cheaper than\nany per-episode (or per-transition) updates within the oracle. For example, in our experiments (\u00a74),\nthe dimension of is either 2 or 66 (without or with the diversity constraint, respectively), whereas\nthe policies \u21e1 trained by the oracle are two-layer networks described by 8,704 \ufb02oating-point numbers.\n\n4 Experiments\n\nWe next evaluate the performance of APPROPO and demonstrate its ability to handle a variety\nof constraints. For simplicity, we focus on the feasibility version (Algorithm 4 in Appendix D).\nWe compare APPROPO with the RCPO approach of Tessler et al. (2019), which adapts policy\ngradient, speci\ufb01cally, asynchronous actor-critic (A2C) (Mnih et al., 2016), to \ufb01nd a \ufb01xed point of the\nLagrangian of the constrained policy optimization problem. RCPO maintains and updates a vector\nof Lagrange multipliers, which is then used to derive a reward for A2C. The vector of Lagrange\nmultipliers serves a similar role as our , and the overall structure of RCPO is similar to APPROPO,\nso RCPO is a natural baseline for a comparison. Unlike APPROPO, RCPO only allows orthant\nconstraints and it seeks to maximize reward, whereas APPROPO solves the feasibility problem.\nFor a fair comparison, APPROPO uses A2C as a positive-response oracle, with the same hyperpa-\nrameters as used in RCPO. Online learning in the outer loop of APPROPO was implemented via\nonline gradient descent with momentum. Both RCPO and APPROPO have an outer-loop learning\nrate parameter, which we tuned over a grid of values 10i with integer i (see Appendix F for the\ndetails). Here, we report the results with the best learning rate for each method.\nWe ran our experiments on a small version of the Mars rover grid-world environment, used previously\nfor the evaluation of RCPO (Tessler et al., 2019). In this environment, depicted in Figure 1 (left),\nthe agent must move from the starting position to the goal without crashing into rocks. The episode\nterminates when a rock or the goal is reached, or after 300 steps. The environment is stochastic: with\nprobability = 0.05 the agent\u2019s action is perturbed to a random action. The agent receives small\nnegative reward each time step and zero for terminating, with = 0.99. We used the same safety\nconstraint as Tessler et al. (2019): ensure that the (discounted) probability of hitting a rock is at most\na \ufb01xed threshold (set to 0.2). RCPO seeks to maximize reward subject to this constraint. APPROPO\n\n8\n\n\fFigure 2: Left: The performance of the algorithms as a function of the number of samples (steps in\nthe environment); showing average and standard deviation over 25 runs. The vertical axes correspond\nto the three constraints, with thresholds shown as a dashed line; for reward (middle) this is a lower\nbound; for the others it is an upper bound. Right: Each point in the scatter plot represents the reward\nand the probability of failure obtained by the policy learnt by the algorithm at the speci\ufb01ed number\nof samples. The grey region is the target set. Different points represent different random runs.\n\nsolves a feasibility problem with the same safety constraint, and an additional constraint requiring\nthat the reward be at least 0.17 (this is slightly lower than the \ufb01nal reward achieved by RCPO). We\nalso experimented with including the exploration suggestion as a \u201cdiversity constraint,\u201d requiring that\nthe Euclidean distance between our visitation probability vector (across the cells of the grid) and the\nuniform distribution over the upper-right triangle cells of the grid (excluding rocks) be at most 0.12.3\nIn Figure 2 (left), we show how the probability of failure, the average reward, and the distance to the\nuniform distribution over upper triangle vary as a function of the number of samples seen by each\nalgorithm. Both variants of our algorithm are able to satisfy the safety constraints and reach similar\nreward as RCPO with a similar number of samples (around 8k samples). Furthermore, including the\ndiversity constraint, which RCPO is not capable of enforcing, allowed our method to reach a more\ndiverse policy as depicted in both Figure 2 (bottom-left) and Figure 1 (right).\n\n5 Conclusion\n\nIn this paper, we introduced APPROPO, an algorithm for solving reinforcement learning problems\nwith arbitrary convex constraints. APPROPO can combine any no-regret online learner with any\nstandard RL algorithm that optimizes a scalar reward. Theoretically, we showed that for the speci\ufb01c\ncase of online gradient descent, APPROPO learns to approach the constraint set at a rate of 1/pT ,\nwith an additive non-vanishing term that measures the optimality gap of the reinforcement learner.\nExperimentally, we demonstrated that APPROPO can be applied with well-known RL algorithms for\ndiscrete domains (like actor-critic), and achieves similar performance as RCPO (Tessler et al., 2019),\nwhile being able to satisfy additional types of constraints. In sum, this yields a theoretically justi\ufb01ed,\npractical algorithm for solving the approachability problem in reinforcement learning.\n\n3This number ensures that APPROPO without the diversity constraint does not satisfy it automatically.\n\n9\n\n\fReferences\nAbernethy, J., Bartlett, P. L., and Hazan, E. (2011). Blackwell approachability and no-regret learning\nare equivalent. In Proceedings of the 24th Annual Conference on Learning Theory, pages 27\u201346.\nIn\nProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings\nof Machine Learning Research, pages 22\u201331.\n\nAchiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization.\n\nAltman, E. (1999). Constrained Markov decision processes, volume 7. CRC Press.\nBlackwell, D. (1956). An analog of the minimax theorem for vector payoffs. Paci\ufb01c Journal of\n\nMathematics, 6(1):1\u20138.\n\nBoyle, J. P. and Dykstra, R. L. (1986). A method for \ufb01nding projections onto the intersection of\nconvex sets in hilbert spaces. In Advances in order restricted statistical inference, pages 28\u201347.\nSpringer.\n\nFreund, Y. and Schapire, R. E. (1999). Adaptive game playing using multiplicative weights. Games\n\nand Economic Behavior, 29(1-2):79\u2013103.\n\nHazan, E., Kakade, S. M., Singh, K., and Van Soest, A. (2018). Provably ef\ufb01cient maximum entropy\n\nexploration. arXiv preprint arXiv:1812.02690.\n\nIngram, J. M. and Marsh, M. (1991). Projections onto convex cones in hilbert space. Journal of\n\napproximation theory, 64(3):343\u2013350.\n\nLe, H. M., Voloshin, C., and Yue, Y. (2019). Batch policy learning under constraints. arXiv preprint\n\narXiv:1903.08738.\n\nMnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu,\nK. (2016). Asynchronous methods for deep reinforcement learning. In International Conference\non Machine Learning, pages 1928\u20131937.\n\nSion, M. (1958). On general minimax theorems. Paci\ufb01c Journal of mathematics, 8(1):171\u2013176.\nSyed, U. and Schapire, R. E. (2008). A game-theoretic approach to apprenticeship learning. In\n\nAdvances in Neural Information Processing Systems (NeurIPS).\n\nTessler, C., Mankowitz, D. J., and Mannor, S. (2019). Reward constrained policy optimization. In\n\nInternational Conference on Learning Representations.\n\nvon Neumann, J. (1928). Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100:295\u2013320.\nZinkevich, M. (2003). Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\n\nProceedings of the International Conference on Machine Learning (ICML).\n\n10\n\n\f", "award": [], "sourceid": 7857, "authors": [{"given_name": "Sobhan", "family_name": "Miryoosefi", "institution": "Princeton University"}, {"given_name": "Kiant\u00e9", "family_name": "Brantley", "institution": "The University of Maryland College Park"}, {"given_name": "Hal", "family_name": "Daume III", "institution": "Microsoft Research & University of Maryland"}, {"given_name": "Miro", "family_name": "Dudik", "institution": "Microsoft Research"}, {"given_name": "Robert", "family_name": "Schapire", "institution": "MIcrosoft Research"}]}