{"title": "Trust Region-Guided Proximal Policy Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 626, "page_last": 636, "abstract": "Proximal policy optimization (PPO) is one of the most popular deep reinforcement learning (RL) methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, as a model-free RL method, the success of PPO relies heavily on the effectiveness of its exploratory policy search. In this paper, we give an in-depth analysis on the exploration behavior of PPO, and show that PPO is prone to suffer from the risk of lack of exploration especially under the case of bad initialization, which may lead to the failure of training or being trapped in bad local optima. To address these issues, we proposed a novel policy optimization method, named Trust Region-Guided PPO (TRGPPO), which adaptively adjusts the clipping range within the trust region. We formally show that this method not only improves the exploration ability within the trust region but enjoys a better performance bound compared to the original PPO as well. Extensive experiments verify the advantage of the proposed method.", "full_text": "Trust Region-Guided Proximal Policy Optimization\n\nYuhui Wang , Hao He , Xiaoyang Tan , Yaozhong Gan\n\nCollege of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics\n\nMIIT Key Laboratory of Pattern Analysis and Machine Intelligence\n\nCollaborative Innovation Center of Novel Software Technology and Industrialization\n\n{y.wang, hugo, x.tan, yzgancn}@nuaa.edu.cn\n\nAbstract\n\nProximal policy optimization (PPO) is one of the most popular deep reinforcement\nlearning (RL) methods, achieving state-of-the-art performance across a wide range\nof challenging tasks. However, as a model-free RL method, the success of PPO\nrelies heavily on the effectiveness of its exploratory policy search. In this paper, we\ngive an in-depth analysis on the exploration behavior of PPO, and show that PPO\nis prone to suffer from the risk of lack of exploration especially under the case of\nbad initialization, which may lead to the failure of training or being trapped in bad\nlocal optima. To address these issues, we proposed a novel policy optimization\nmethod, named Trust Region-Guided PPO (TRGPPO), which adaptively adjusts\nthe clipping range within the trust region. We formally show that this method not\nonly improves the exploration ability within the trust region but enjoys a better\nperformance bound compared to the original PPO as well. Extensive experiments\nverify the advantage of the proposed method.\n\n1\n\nIntroduction\n\nDeep model-free reinforcement learning has achieved great successes in recent years, notably in\nvideo games [11], board games [19], robotics [10], and challenging control tasks [17, 5]. Among\nothers, policy gradient (PG) methods are commonly used model-free policy search algorithms [14].\nHowever, the \ufb01rst-order optimizer is not very accurate for curved areas. One can get overcon\ufb01dence\nand make bad moves that ruin the progress of the training. Trust region policy optimization (TRPO)\n[16] and proximal policy optimization (PPO) [18] are two representative methods to address this\nissue. To ensure stable learning, both methods impose a constraint on the difference between the new\npolicy and the old one, but with different policy metrics.\nIn particular, TRPO uses a divergence between the policy distributions (total variation divergence or\nKL divergence), whereas PPO uses a probability ratio between the two policies1. The divergence\nmetric is proven to be theoretically-justi\ufb01ed as optimizing the policy within the divergence constraint\n(named trust region) leads to guaranteed monotonic performance improvement. Nevertheless, the\ncomplicated second-order optimization involved in TRPO makes it computationally inef\ufb01cient and\ndif\ufb01cult to scale up for large scale problems. PPO signi\ufb01cantly reduces the complexity by adopting\na clipping mechanism which allows it to use a \ufb01rst-order optimization. PPO is proven to be very\neffective in dealing with a wide range of challenging tasks while being simple to implement and tune.\nHowever, how the underlying metric adopted for policy constraints in\ufb02uence the behavior of the\nalgorithm is not well understood. It is normal to expect that the different metrics will yield RL\nalgorithms with different exploration behaviors. In this paper, we give an in-depth analysis on the\n\n1There is also a variant of PPO which uses KL divergence penalty. In this paper we refer to the one clipping\n\nprobability ratio as PPO by default, which performs better in practice.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fexploration behavior of PPO, and show that the ratio-based metric of PPO tends to continuously\nweaken the likelihood of choosing an action in the future if that action is not preferred by the current\npolicy. As a result, PPO is prone to suffer from the risk of lack of exploration especially under the\ncase of bad initialization, which may lead to the failure of training or being trapped in bad local\noptima.\nTo address these issues, we propose an enhanced PPO method, named Trust Region-Guided PPO\n(TRGPPO), which is theoretically justi\ufb01ed by the improved exploration ability and better performance\nbound compared to the original PPO. In particular, TRGPPO constructs a connection between the\nratio-based metric and trust region-based one, such that the resulted ratio clipping mechanism allows\nthe constraints imposed on the less preferred actions to be relaxed. This effectively encourages the\npolicy to explore more on the potential valuable actions, no matter whether they were preferred by the\nprevious policies or not. Meanwhile, the ranges of the new ratio-based constraints are kept within the\ntrust region; thus it would not harm the stability of learning. Extensive results on several benchmark\ntasks show that the proposed method signi\ufb01cantly improves both the policy performance and the\nsample ef\ufb01ciency. Source code is available at https://github.com/wangyuhuix/TRGPPO.\n\n2 Related Work\n\nMany researchers have tried to improve proximal policy learning from different perspectives. Chen\net al. also presented a so-called \u201cadaptive clipping mechanism\" for PPO [3]. Their method adaptively\nadjusts the scale of policy gradient according to the signi\ufb01cance of state-action. They did not make\nany alteration on the clipping mechanism of PPO, while our method adopts a newly adaptive clipping\nmechanism. Fakoor et al. used proximal learning with penalty on KL divergence to utilize the\noff-policy data, which could effectively reduce the sample complexity [6]. In our previous work,\nwe also introduced trust region-based clipping to improve boundness on policy of PPO [22]. While\nin this work, we use the trust region-based criterion to guide the clipping range adjustment, which\nrequires additional computation but is more \ufb02exible and interpretable.\nSeveral methods have been proposed to improve exploration in recent research. Osband et al. tried to\nconduct consistent exploration using posterior sampling method [12]. Fortunato et al. presented a\nmethod named NoisyNet to improve exploration by generating perturbations of the network weights\n[7]. Another popular algorithm is the soft actor-critic method (SAC) [9], which maximizes expected\nreward and entropy simultaneously.\n\n3 Preliminaries\n\n(cid:80)\u221e\nwhere \u03c1\u03c0(s) = (1\u2212\u03b3)(cid:80)\u221e\n(cid:104) \u03c0(a|s)\n\nA Markov Decision Processes (MDP) is described by the tuple (S,A,T , c, \u03c11, \u03b3). S and A are\nthe state space and action space; T : S \u00d7 A \u00d7 S \u2192 R is the transition probability distribution;\nc : S \u00d7 A \u2192 R is the reward function; \u03c11 is the distribution of the initial state s1, and \u03b3 \u2208 (0, 1) is\nthe discount factor. The return is the accumulated discounted reward from timestep t onwards, R\u03b3\nt =\nk=0 \u03b3kc(st+k, at+k). The performance of a policy \u03c0 is de\ufb01ned as \u03b7(\u03c0) = Es\u223c\u03c1\u03c0,a\u223c\u03c0 [c(s, a)]\nt is the density function of state at time t. Policy gradients\nmethods [20] update the policy by the following surrogate performance objective, L\u03c0old (\u03c0) =\n+ \u03b7(\u03c0old), where \u03c0(a|s)/\u03c0old(a|s) is the probability ratio\nEs\u223c\u03c1\u03c0old ,a\u223c\u03c0old\nt |st =\nbetween the new policy \u03c0 and the old policy \u03c0old, A\u03c0(s, a) = E[R\u03b3\nKL (\u03c0old, \u03c0) (cid:44) DKL (\u03c0old(\u00b7|s)||\u03c0(\u00b7|s)),\ns; \u03c0] is the advantage value function of policy \u03c0. Let Ds\n(cid:46)\nSchulman et al. [16] derived the following performance bound:\n\nt |st = s, at = a; \u03c0] \u2212 E[R\u03b3\n\n\u03c0old(a|s) A\u03c0old (s, a)\n\nt=1 \u03b3t\u22121\u03c1\u03c0\n\nt (s), \u03c1\u03c0\n\n(cid:105)\n\nTheorem 1. De\ufb01ne that C = max\ns,a\nC maxs\u2208S Ds\n\nKL (\u03c0old, \u03c0). We have \u03b7(\u03c0) \u2265 M\u03c0old (\u03c0), \u03b7(\u03c0old) = M\u03c0old (\u03c0old).\n\n|A\u03c0old (s, a)| 4\u03b3\n\n(1 \u2212 \u03b3)2, M\u03c0old(\u03c0) = L\u03c0old(\u03c0) \u2212\n\nThis theorem implies that maximizing M\u03c0old (\u03c0) guarantee non-decreasing of the performance of the\nnew policy \u03c0. To take larger steps in a robust way, TRPO optimizes L\u03c0old (\u03c0) with the constraint\nmaxs\u2208S Ds\n\nKL (\u03c0old, \u03c0) \u2264 \u03b4, which is called the trust region.\n\n2\n\n\f4 The Exploration Behavior of PPO\n\nIn this section will \ufb01rst give a brief review of PPO and then show that how PPO suffers from an\nexploration issue when the initial policy is suf\ufb01ciently far from the optimal one.\nPPO imposes the policy constraint through a clipped surrogate objective function:\n\n(cid:18) \u03c0(a|s)\n\n(cid:19)\n\n(cid:19)(cid:21)\n\n(cid:20)\n\n(cid:18) \u03c0(a|s)\n\n(\u03c0) = E\n\nmin\n\nLCLIP\n\u03c0old\n\n, ls,a, us,a\n\nA\u03c0old (s, a)\n\n\u03c0old(a|s)\n\n\u03c0old(a|s)\n\nA\u03c0old (s, a), clip\n\n(1)\nwhere ls,a \u2208 (0, 1) and us,a \u2208 (1, +\u221e) are called the lower and upper clipping range on state-\naction (s, a). The probability ratio \u03c0(a|s)/\u03c0old(a|s) will be clipped once it is out of (ls,a, us,a).\nTherefore, such clipping mechanism could be considered as a constraint on policy with ratio-\nbased metric, i.e., ls,a \u2264 \u03c0(a|s)/\u03c0old(a|s) \u2264 us,a, which can be rewritten as, \u2212\u03c0old(a|s)(1 \u2212\nls,a) \u2264 \u03c0(a|s) \u2212 \u03c0old(a|s) \u2264 \u03c0old(a|s)(us,a \u2212 1). We call (Ll\n(s, a)) (cid:44)\n(\u2212\u03c0old(a|s)(1 \u2212 ls,a), \u03c0old(a|s)(us,a \u2212 1)) the feasible variation range of policy \u03c0 w.r.t. \u03c0old on\nstate-action (s, a) with the clipping range setting (l, u), which is a measurement on the allowable\nchange of policy \u03c0 on state-action (s, a).\nNote that the original PPO adopts a constant setting of clipping range, i.e., ls,a = 1 \u2212 \u0001, us,a = 1 + \u0001\nfor any (s, a) [18]. The corresponding feasible variation range is (L1\u2212\u0001\n(s, a)) =\n(\u2212\u03c0old(a|s)\u0001, \u03c0old(a|s)\u0001). As can be seen, given an optimal action aopt and a sub-optimal\none asubopt on state s, if \u03c0old(aopt|s) < \u03c0old(asubopt|s), then |(L1\u2212\u0001\n(s, aopt ))|<\n|(L1\u2212\u0001\n(s, asubopt))|. This means that the allowable change of the likelihood on opti-\nmal action, i.e., \u03c0(aopt|s), is smaller than that of \u03c0(asubopt|s). Note that \u03c0(aopt|s) and \u03c0(asubopt|s)\nare in a zero-sum competition, such unequal restriction may continuously weaken the likelihood of\nthe optimal action and make the policy trapped in local optima. We now give a formal illustration.\n\n(s, a),U 1+\u0001\n(s, aopt ),U 1+\u0001\n\n(s, asubopt),U 1+\u0001\n\n(s, a),U u\n\n\u03c0old\n\n\u03c0old\n\n\u03c0old\n\n\u03c0old\n\n\u03c0old\n\n\u03c0old\n\n\u03c0old\n\n\u03c0old\n\nAlgorithm 1 Simpli\ufb01ed Policy Iteration with PPO\n1: Initialize a policy \u03c00, t \u2190 0.\n2: repeat\nSample an action \u02c6at \u223c \u03c0t.\n3:\n4:\nGet the new policy \u03c0t+1 by optimizing the empirical surrogate objective function of PPO based on \u02c6at:\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n\u02c6\u03c0t+1(a) =\n\n\u03c0t(a)ua\n\u03c0t(a)la\n\u03c0t(a) \u2212 \u03c0t(\u02c6at)u\u02c6at\n\u03c0t(a) +\n|A|\u22121\n\u03c0t(a)\n\n|A|\u22121\n\u03c0t(\u02c6at)(1\u2212l\u02c6at\n\n)\n\n\u2212\u03c0t(\u02c6at)\n\na = \u02c6at and c(a) > 0\na = \u02c6at and c(a) < 0\na (cid:54)= \u02c6at and c(\u02c6at) > 0\na (cid:54)= \u02c6at and c(\u02c6at) < 0\nc(\u02c6at) = 0\n\n(2)\n\n\u03c0t+1 = N ormalize(\u02c6\u03c0t+1)2. t \u2190 t + 1.\n\n5:\n6: until \u03c0t converge\n\n(cid:16) \u03c0(a)\n\n(cid:17)\n\n(cid:17)(cid:105)\n\n(cid:16) \u03c0(a)\n\n(\u03c0) = E(cid:104)\n\nmin\n\nc(a)\n\n\u03c0old(a) , la, ua\n\n\u03c0old(a) c(a), clip\n\nWe investigate the exploration behavior of PPO under the discrete-armed bandit problem, where there\nare no state transitions and the action space is discrete. The objective function of PPO in this problem\n. Let A+ (cid:44) {a \u2208 A|c(a) > 0},\nis LCLIP\n\u03c0old\nA\u2212 (cid:44) {a \u2208 A|c(a) < 0} denote the actions which have positive and negative reward respectively,\nand Asubopt = A+/{aopt} denote the set of the sub-optimal actions. Let aopt = argmaxa c(a)\nand asubopt \u2208 Asubopt denote the optimal 3 and a sub-optimal action. Let us consider a simpli\ufb01ed\nonline policy iteration algorithm with PPO. As presented in Algorithm 1, the algorithm iteratively\nsample an action \u02c6at based on the old policy \u03c0old at each step and obtains a new policy \u03c0new.\nWe measure the exploration ability by the expected distance between the learned policy \u03c0t and the\n(cid:44) E\u03c0t [(cid:107)\u03c0t \u2212 \u03c0\u2217(cid:107)\u221e|\u03c00], where \u03c0\u2217(aopt) = 1,\noptimal policy \u03c0\u2217 after t-step learning, i.e., \u2206\u03c00,t\n\u03c0\u2217(a) = 0 for a (cid:54)= aopt, \u03c00 is the initial policy, \u03c0t is a stochastic element in the policy space and\n2 \u02c6\u03c0t+1 may violate the probability rules, e.g.,(cid:80)\n\na \u02c6\u03c0t+1(a) > 1. Thus we need to enforce speci\ufb01c normaliza-\n\ntion operation to rectify it. To simplify the analysis, we assume that \u03c0t+1 = \u02c6\u03c0t+1.\n\n3Assume that there is only one optimal action.\n\n3\n\n\f(cid:44) E\u03c0t [(cid:107)\u03c0t \u2212 \u03c0\u2217(cid:107)\u221e|\u03c00] = 1 \u2212 E\u03c0t [\u03c0t(aopt)|\u03c00].\n\ndepends on the previous sampled actions {at(cid:48)}t\u22121\nt(cid:48)=1 (see eq. (2)). Note that smaller \u2206\u03c00,t means\nbetter exploration ability, as it is closer to the optimal policy. We now derive the exact form of \u2206\u03c00,t.\nLemma 1. \u2206\u03c00,t\nLemma 2. E\u03c0t+1 [\u03c0t+1(a)|\u03c00] = E\u03c0t\nWe provide all the proofs in Appendix E. Lemma 1 implies that we can obtain the exploration ability\n\u2206\u03c00,t by computing the expected likelihood of the optimal action aopt, i.e., E\u03c0t [\u03c0t(aopt)|\u03c00]. And\nLemma 2 shows an iterative way to compute the exploration ability. By eq. (2), for action a which\nsatis\ufb01es c(a) > 0, we have\n\n(cid:2)E\u03c0t+1 [\u03c0t+1(a)|\u03c0t]|\u03c00\n\n(cid:3).\n\n(cid:2)\u03c0t+1(a)|\u03c0t\n\n(cid:3) = \u03c0t(a) +\n\nE\u03c0t+1\n\n\uf8ee\uf8ef\uf8f0\u03c02\nt (a)(ua \u2212 1) \u2212 (cid:88)\n\na+\u2208A+ /{a}\n\nt (a+)\n\u03c02\n|A|\u22121\n\n(u\n\na+ \u2212 1) +\n\n(cid:88)\n\na\u2212\u2208A\u2212\n\nt (a\u2212)\n\u03c02\n|A|\u22121\n\n(1 \u2212 l\n\n\uf8f9\uf8fa\uf8fb\n\na\u2212 )\n\n(3)\n\nThis equation provides a explicit form of the case when the likelihood of action a would decrease.\nThat is, if the second term in RHS of eq. (3) is negative, then the likelihood on action a would\ndecrease. This means that the initialization of policy \u03c00 profoundly affects the future policy \u03c0t. Now\nwe show that if the policy \u03c00 initializes from a bad one, \u03c0(aopt) may continuously be decreased.\nFormally, for PPO, we have the following theorem:\n0(asubopt) \u2212\nTheorem 2. Given initial policy \u03c00,\n\u03c02\n\nif \u03c02\n\nasubopt\u2208Asubopt\n\na\u2212\u2208A\u2212 \u03c02\n\n0(a\u2212), then we have\n\n0(aopt) \u00b7 |A|< (cid:80)\n(cid:2)\u03c0PPO\n\nE\n\u03c0PPO\n\n1\n\n(cid:80)\n(i) (cid:80)\n(cid:80)\n\n(cid:3) < \u00b7\u00b7\u00b7 <\n\n(asubopt)|\u03c00\n\n\u03c00(asubopt) < (cid:80)\n(cid:2)\u03c0PPO\n(cid:2)\u03c0PPO\n\n(cid:3);\n(cid:3) > \u00b7\u00b7\u00b7 > E\n\nt\n\nasubopt\u2208Asubopt\n\n(cid:3);\n\nasubopt\u2208Asubopt\nE\n\u03c0PPO\n\nt\n\n1\n\n(cid:2)\u03c0PPO\n\nt\n\n1\n\nt\n\n1\n\n\u03c0PPO\n\n\u03c0PPO\n\n(aopt)|\u03c00\n\n(asubopt)|\u03c00\n(aopt)|\u03c00\n\u03c00,1 < \u00b7\u00b7\u00b7 < \u2206PPO\n\u03c00,t .\n\nasubopt\u2208Asubopt\n(ii) \u03c00(aopt) > E\n(iii) \u2206\u03c00,0 < \u2206PPO\nConclusion (i) and (ii) implies that if the optimal action aopt is relatively less preferred than the\nsub-optimal action asubopt by the initial policy, then the preference of choosing the optimal action\nwould continue decreasing while that of the sub-optimal action would continue increasing. This is\nbecause the feasible variation of probability on the optimal action \u03c0(aopt) is larger than that on the\nsub-optimal one \u03c0(asubopt), increasing probability on the latter one could diminish the former one.\nConclusion (iii) implies that the policy of PPO is expected to diverge from the optimal one (in terms\nof the in\ufb01nity metric). We give a simple example below.\nExample 1. Consider a three-armed bandit problem,\nthe reward function is c(aopt) =\n1, c(asubopt) = 0.5, c(aworst) = \u221250. The initial policy is \u03c00(aopt) = 0.2, \u03c00(asubopt) =\n0.6, \u03c00(aworst) = 0.2. The hyperparameter of PPO is \u0001 = 0.2. We have \u2206PPO\n\u03c00,0 = 0.8,\n\u03c00,6 \u2248 0.999, which means the policy diverges from the optimal one.\n\u2206PPO\n\n\u03c00,1 = 0.824,. . . , \u2206PPO\n\nNote that the case that the optimal action aopt is relatively less preferred by the initial policy may be\navoided in discrete action space, where we can use uniform distribution as initial policy. However,\nsuch a case could hardly be avoided in the high dimensional action space, where the policy is possibly\ninitialized far from the optimal one. We have experimented Example 1 and a continuous-armed bandit\nproblem with random initialization for multiple trials; about 30% of the trials were trapped in the\nlocal optima. See Section 6.1 for more detail.\nIn summary, PPO with constant clipping range could lead to an exploration issue when the policy is\ninitialized from a bad one. However, eq. (3) inspires us a method to address this issue \u2212 enlarging the\nclipping range (la, ua) when the probability of the old policy \u03c0old(a) is small.\n\n5 Method\n\n5.1 Trust Region-Guided PPO\n\nIn the previous section, we have concluded that the constant clipping range of PPO could lead to an\nexploration issue. We consider how to adaptively adjust the clipping range to improve the exploration\n\n4\n\n\fbehavior of PPO. The new clipping range (l\u03b4\n\ns,a, u\u03b4\n\ns,a), where \u03b4 is a hyperparameter, is set as follows:\n\nl\u03b4\ns,a = min\n\n\u03c0\n\n: Ds\n\nKL(\u03c0old, \u03c0) \u2264 \u03b4\n\n, u\u03b4\n\ns,a = max\n\n\u03c0\n\n: Ds\n\nKL(\u03c0old, \u03c0) \u2264 \u03b4\n\n(4)\n\n(cid:27)\n\n(cid:26) \u03c0(a|s)\n\n\u03c0old(a|s)\n\n(cid:27)\n\n(cid:26) \u03c0(a|s)\n\n\u03c0old(a|s)\n\n(cid:104)\n\ns,a = min(l\u03b4\n\nlog \u03c0old(a|s)\n\u03c0(a|s)\n\ns,a, 1 \u2212 \u0001), u\u03b4,\u0001\n\nKL(\u03c0old, \u03c0) = Ea\n\n(cid:105) \u2264 \u03b4 for all s \u2208 S, which is more\n\nTo ensure the new adaptive clipping range would not be over-strict, an additional truncation operation\nis attached: l\u03b4,\u0001\ns,a, 1 + \u0001). This setting of clipping range setting\ns,a = max(u\u03b4\ncould be motivated from the following perspectives.\nFirst, the clipping range is related to the policy metric of constraint. Both TRPO and PPO imposes\na constraint on the difference between the new policy and the old one. TRPO uses the divergence\nmetric of the distribution, i.e., Ds\ntheoretically-justi\ufb01ed according to Theorem 1. Whereas PPO uses a ratio-based metric on each action,\ni.e., 1\u2212 \u0001 \u2264 \u03c0(a|s)\n\u03c0old(a|s) \u2264 1 + \u0001 for all a \u2208 A and s \u2208 S. The divergence-based metric is averaged over\nthe action space while the ratio-based one is an element-wise one on each action point. If the policy\nis restricted within a region with the ratio-based metric, then it is also constrained within a region\nwith divergence-based one, but not vice versa. Thus the probability ratio-based metric constraint is\nsomewhat more strict than the divergence-based one. Our method connects these two underlying\nmetrics \u2212 adopts the probability ratio-based constraint while getting closer to the divergence metric.\nSecond, a different underlying metric of the policy difference may result in different algorithm\nbehavior. In the previous section, we have concluded that PPO\u2019s metric with constant clipping range\ncould lead to an exploration issue, due to that it imposes a relatively strict constraint on actions which\nare not preferred by the old policy. Therefore, we wish to relax such constraint by enlarging the\nupper clipping range while reducing the lower clipping range. Fig. 1a shows the clipping range of\nTRGPPO and PPO. For TRGPPO (blue curve), as \u03c0old(a|s) gets smaller, the upper clipping range\nincreases while the lower one decreases, which means the constraint is relatively relaxed as \u03c0old(a|s)\ngets smaller. This mechanism could encourage the agent to explore more on the potential valuable\nactions which are not preferred by the old policy. We will theoretically show that the exploration\nbehavior with this new clipping range is better than that of with the constant one in Section 5.2.\nLast but not least, although the clipping ranges are enlarged, it will not harm the stability of learning,\nas the ranges are kept within the trust region. We will show that this new setting of clipping range\nwould not enlarge the policy divergence and has better performance bound compared to PPO in\nSection 5.3.\nOur TRGPPO adopts the same algorithm procedure as PPO, except that it needs an additional\ncomputation of adaptive clipping range. We now present methods on how to compute the adaptive\nclipping range de\ufb01ned in (4) ef\ufb01ciently. For discrete action space, by using the KKT conditions, the\nproblem (4) is transformed into solving the following equation w.r.t X.\n\ng(\u03c0old(a|s), X) (cid:44) (1 \u2212 \u03c0old(a|s)) log\n\n1 \u2212 \u03c0old(a|s)\n1 \u2212 \u03c0old(a|s)X\n\n\u2212 \u03c0old(a|s) log X = \u03b4\n\n(5)\n\nwhich has two solutions, one is for l\u03b4\ns,a which is\nwithin (1, +\u221e). We use MINPACK\u2019s HYBRD and HYBRJ routines [15] as the solver. To accelerate\nthis computation procedure, we adopt two additional measures. First, we train a Deep Neural Network\n\ns,a which is within (0, 1), and another one is for u\u03b4\n\n(a)\n\nFigure 1: (a) and (b) plot the clipping range and the feasible variation range under different \u03c0old(a|s)\nfor discrete action space task. (c) plots the clipping range under different a for continuous action\nspace task; the black curve plots the density of \u03c0old(a|s) = N (a|0, 1).\n\n(c)\n\n(b)\n\n5\n\n00.20.40.60.81old(a|s)1.02.03.04.0Clipping Rangeus,a of TRGPPOls,a of TRGPPO1+ of PPO1 of PPO00.20.40.60.81.0old(a|s)-0.2-0.15-0.1-0.050.050.10.150.2VariationTRGPPOTRGPPOPPOPPO3210123a1.02.03.04.05.0ua of TRGPPOla of TRGPPO1 of PPO1+ of PPOold(a)\f(DNN) which input \u03c0old(a|s) and \u03b4, and approximately output the initial solution. Note that the\nsolution in (5) only depends on the probability \u03c0old(a|s) and the hyperparameter \u03b4, and it is not\naffected by the dimension of the action space. Thus it is possible to train one DNN for all discrete\naction space tasks in advance. Second, with \ufb01xed \u03b4, we discretize the probability space and save all\nthe solutions in advance. This clipping range computation procedure with these two acceleration\nmeasures only requires only additional 4% wallclock computation time of the original policy learning.\nSee Appendix B.3 for more detail.\nWhile for the continuous actions space task, we make several transformations to make the problem in-\ndependent of the dimension of the action space, which makes it tractable to apply the two acceleration\nmeasures above. See Appendix B.2 for more detail.\n\n5.2 Exploration Behavior\n\nIn this section, we will \ufb01rst give the property of the clipping range of TRGPPO, which could affect\nthe exploration behavior (as discussed in Section 4). Then a comparison between TRGPPO and PPO\non the exploration behavior will be provided.\nLemma 3. For TRGPPO with hyperparameter \u03b4, we have\n\ndu\u03b4\n\ndl\u03b4\n\ns,a\n\ns,a\n\nd\u03c0old(a|s) < 0,\n\nd\u03c0old(a|s) > 0.\n\nThis result implies that the upper clipping range becomes larger as the preference on the action by the\nold policy \u03c0old(a|s) approaches zero, while the lower clipping range is on the contrary. This means\nthat the constraints are relaxed on the actions which are not preferred by the old policy, such that it\nwould encourage the policy to explore more on the potential valuable actions, no matter whether they\nwere preferred by the previous policies or not.\nWe now give a formal comparison on the exploration behavior. As mentioned in Section 4, we\nmeasure the exploration ability by the expected distance between the learned policy \u03c0t and the\n(cid:44) E\u03c0t [(cid:107)\u03c0t \u2212 \u03c0\u2217(cid:107)\u221e|\u03c00]. Smaller \u2206\u03c00,t means the\noptimal policy \u03c0\u2217 after t-step learning, i.e., \u2206\u03c00,t\nwhile that of\nbetter exploration ability. The exploration ability of TRGPPO is denoted as \u2206TRGPPO\n\u03c00,t . By eq. (3) and Lemma 3, we get the following conclusion.\nPPO is denoted as \u2206PPO\nTheorem 3. For TRGPPO with hyperparameter (\u03b4, \u0001) and PPO with same \u0001.\ng(maxa\u2208Asubopt \u03c0t(a), 1 + \u0001) for all t, then we have \u2206TRGPPO\nThis theorem implies that our TRGPPO has better exploration ability than PPO, with proper setting\nof the hyperparameter \u03b4.\n\n\u03c00,t for any t.\n\n\u2264 \u2206PPO\n\nIf \u03b4 \u2264\n\n\u03c00,t\n\n\u03c00,t\n\n5.3 Policy Divergence and Lower Performance Bound\n\n(cid:104) \u03c0(at|st)\n\n(cid:105)\n\n1\nT\n\nt=1\n\n\u03c0old(at|st) At\n\nnew \u2208 \u03a0PPO\n\n(cid:80)T\nTo investigate how TRGPPO and PPO perform in practical, let us consider an empirical version\nof lower performance bound: \u02c6M\u03c0old (\u03c0) = \u02c6L\u03c0old (\u03c0) \u2212 C maxt Dst\nKL (\u03c0old, \u03c0) , where \u02c6L\u03c0old (\u03c0) =\n+ \u02c6\u03b7\u03c0old, st \u223c \u03c1\u03c0old , at \u223c \u03c0old(\u00b7|st) are the sampled states and actions,\nwhere we assume si (cid:54)= sj for any i (cid:54)= j, At is the estimated value of A\u03c0old (st, at), \u02c6\u03b7\u03c0old is the\nestimated performance of old policy \u03c0old.\nLet \u03a0PPO\nnew denote the set of all the optimal solutions of the empirical surrogate objective function of\nPPO, and let \u03c0PPO\nnew denote the optimal solution which achieve minimum KL divergence\nover all optimal solutions, i.e., Dst\nnew under all st.\nThis problem can be formalized as \u03c0PPO\nKL(\u03c0old, \u03c0)).\nNote that \u03c0(\u00b7|st) is a conditional probability and the optimal solution on different states are\nindependent from each other. Thus the problem can be optimized by independently solving\nnew is obtained by\nmin\u03c0(\u00b7|st)\u2208{\u03c0(\u00b7|st):\u03c0\u2208\u03a0PPO\nnew (\u00b7|st) on different state st. Similarly, \u03c0TRGPPO\nintegrating these independent optimal solutions \u03c0PPO\nis the one of TRGPPO which has similar de\ufb01nition as \u03c0PPO\nnew . Please refer to Appendix E for more\ndetail.\nTo analyse TRGPPO and PPO in a comparable way, we introduce a variant of TRGPPO.\nis, \u03b4 =\nThe hyperparameter \u03b4 of TRGPPO in eq. (4) is set adaptively by \u0001.\n, where\nmax\n\nnew } DKL (\u03c0old(\u00b7|st), \u03c0(\u00b7|st)) for each st. The \ufb01nal \u03c0PPO\n\n(cid:17)\nThat\n1\u2212p\u2212(1\u2212\u0001) \u2212 p\u2212 log(1 \u2212 \u0001)\n1\u2212p\u2212\n\nKL(\u03c0old, \u03c0) for any \u03c0 \u2208 \u03a0PPO\nKL(\u03c0old, \u03c0), . . . , DsT\n\n1\u2212p+(1+\u0001) \u2212 p+ log(1 + \u0001), (1 \u2212 p\u2212) log\n1\u2212p+\n\nnew = argmin\u03c0\u2208\u03a0PPO\n\nnew ) \u2264 Dst\n\n(1 \u2212 p+) log\n\nKL(\u03c0old, \u03c0PPO\n\n(cid:16)\n\n(Ds1\n\nnew\n\nnew\n\n6\n\n\ft:At<0\n\n\u03c0old(at|st), p\u2212 = max\n\n\u03c0old(at|st). One may note that this equation has a simi-\np+ = max\nt:At>0\nlar form to that of eq. (5). In fact, if TRGPPO and PPO share a similar \u0001, then they have the same KL\ndivergence theoretically. We conclude the comparison between TRGPPO and PPO by the following\ntheorem.\nnew ) < +\u221e for all t. If TRGPPO and PPO have the\nTheorem 4. Assume that maxt Dst\nsame hyperparameter \u0001, we have:\n(i) u\u03b4\n\n\u2264 1 \u2212 \u0001 for all (st, at);\n\n\u2265 1 + \u0001 and l\u03b4\n\nKL(\u03c0old, \u03c0PPO\n\nst,at\n\nst,at\n\nKL(\u03c0old, \u03c0TRGPPO\n\nnew\n\n) = maxt Dst\n\nKL(\u03c0old, \u03c0PPO\n\nnew );\n\n(ii) maxt Dst\n(iii) \u02c6M\u03c0old (\u03c0TRGPPO\n\u03c0old(at|st) (cid:54)= max\n\u02c6t:A\u02c6t<0\n\u02c6M\u03c0old (\u03c0PPO\nnew ).\n\nnew\n\n) \u2265 \u02c6M\u03c0old (\u03c0PPO\n\u03c0old(a\u02c6t|s\u02c6t) and \u03c0old(at|st) (cid:54)= max\n\u02c6t:A\u02c6t>0\n\nnew ). Particularly, if there exists at least one (st, at) such that\n) >\n\n\u03c0old(a\u02c6t|s\u02c6t), then \u02c6M\u03c0old (\u03c0TRGPPO\n\nnew\n\nConclusion (i) implies that TRGPPO could enlarge the clipping ranges compared to PPO and\naccordingly allow larger update of the policy. Meanwhile, the maximum KL divergence is retained,\nwhich means TRGPPO would not harm the stability of PPO theoretically. Conclusion (iii) implies\nthat TRGPPO has better empirical performance bound.\n\n6 Experiment\n\nWe conducted experiments to answer the following questions: (1) Does PPO suffer from the lack of\nexploration issue? (2) Could our TRGPPO relief the exploration issue and improve sample ef\ufb01ciency\ncompared to PPO? (3) Does our TRGPPO maintain the stable learning property of PPO? To answer\nthese questions, we \ufb01rst evaluate the algorithms on two simple bandit problems and then compare\nthem on high-dimensional benchmark tasks.\n\n6.1 Didactic Example: Bandit Problems\n\nWe \ufb01rst evaluate the algorithms on the bandit problems. In the continuous-armed bandit problem,\nthe reward is 0.5 for a \u2208 (1, 2); 1 for a \u2208 (2.5, 5); and 0 otherwise. We use a Gaussian policy. The\ndiscrete-armed bandit problem is de\ufb01ned in Section 4. We use a Gibbs policy \u03c0(a) \u221d exp(\u03b8a), where\nthe parameter \u03b8 is initialized randomly from N (0, 1). We also consider the vanilla Policy Gradient\nmethod as a comparison. Each algorithm was run for 1000 iterations with 10 random seeds.\nFig. 2 plots the performance during the training process.\nPPO gets trapped in local optima at a rate of 30% and 20%\nof all the trials on discrete and continuous cases respec-\ntively, while our TRGPPO could \ufb01nd the optimal solution\non almost all trials. For continuous-armed problem, we\nhave also tried other types of parametrized policies like\nBeta and Mixture Gaussian, and these policies behaves\nsimilarly as the Gaussian policy. In discrete-armed prob-\nlem, we \ufb01nd that when the policy is initialized with a local\noptima, PPO could easily get trapped in that one. Notably,\nsince vanilla PG could also \ufb01nd the optimal one, it could\nbe inferred that the exploration issue mainly derives from\nthe ratio-based clipping with constant clipping range.\n\nFigure 2: The performance on discrete\nand continuous-armed bandit problems\nduring training process.\n\n6.2 Evaluation on Benchmark Tasks\n\nWe evaluate algorithms on benchmark tasks implemented in OpenAI Gym [2], simulated by MuJoCo\n[21] and Arcade Learning Environment [1]. For continuous control tasks, we evaluate algorithms on\n6 benchmark tasks. All tasks were run with 1 million timesteps except that the Humanoid task was\n20 million timesteps. The trained policies are evaluated after sampling every 2048 timesteps data.\nThe experiments on discrete control tasks are detailed in Appendix C.\n\n7\n\n02505007501000Iterations0.00.20.40.60.81.0RewardDiscrete BanditTRPPOPPOVanilla PG02505007501000Iterations0.00.20.40.60.81.0RewardContinuous BanditTRPPOPPOVanilla PGTRGPPOTRGPPO\fFigure 3: Episode rewards during the training process; the shaded area indicate half the standard\ndeviation over 10 random seeds.\n\nTable 1: Results of timesteps to hit a threshold within 1 million timesteps (except Humanoid with 20\nmillion) and averaged rewards over last 40% episodes during training process.\n\n(a) Timesteps to hit threshold (\u00d7103)\nSAC\n\nPPO\n\nThreshold TRGPPO\n\nHumanoid\nReacher\nSwimmer\nHalfCheetah\n\nHopper\nWalker2d\n\n5000\n-5\n90\n3000\n3000\n3000\n\n4653\n201\n353.0\n117\n168.0\n269.0\n\n7241\n178.0\n564\n148\n267\n454\n\nPPO-\npenalty\n13096.0\n301.0\n507.0\n220.0\n188.0\n393.0\n\n343.0\n265\n/4\n53.0\n209\n610\n\n(b) Averaged rewards\n\nTRGPPO\n\nPPO\n\n7074.9\n-7.9\n101.9\n4986.1\n3200.5\n3886.8\n\n6620.9\n-6.7\n100.1\n4600.2\n2848.9\n3276.2\n\nPPO-\npenalty\n3612.3\n-6.8\n94.1\n4868.3\n3018.7\n3524\n\nSAC\n\n6535.9\n-17.2\n49\n\n9987.1\n3020.7\n2570\n\nFor our TRGPPO, the trust region coef\ufb01cient \u03b4 is adaptively set by tuning \u0001 (see Appendix B.4\nfor more detail). We set \u0001 = 0.2, same as PPO. The following algorithms were considered in the\ncomparison. (a) PPO: we used \u0001 = 0.2 as recommended by [18]. (b) PPO-entropy: PPO with an\nexplicit entropy regularization term \u03b2Es [H (\u03c0old(\u00b7|s), \u03c0(\u00b7|s))], where \u03b2 = 0.01. (c) PPO-0.6: PPO\nwith a larger clipping range where \u0001 = 0.6. (d) PPO-penalty: a variant of PPO which imposes\na penalty on the KL divergence and adaptively adjust the penalty coef\ufb01cient [18]. (e) SAC: Soft\nActor-Critic, a state-of-the-art off-policy RL algorithm [9]. Both TRGPPO and PPO adopt exactly\nsame implementations and hyperparameters except the clipping range based on OpenAI Baselines\n[4]. This ensures that the differences are due to algorithm changes instead of implementations or\nhyperparameters. For SAC, we adopt the implementations provided in [9].\nSample Ef\ufb01ciency: Table 1 (a) lists the timesteps required by algorithms to hit a prescribed threshold\nwithin 1 million timesteps and Figure 3 shows episode rewards during the training process. The\nthresholds for all tasks were chosen according to [23]. As can be seen in Table 1, TRGPPO requires\nabout only 3/5 timesteps of PPO on 4 tasks except HalfCheetah and Reacher.\nPerformance/Exploration: Table 1 (b) lists the averaged rewards over last 40% episodes during\ntraining process. TRGPPO outperforms the original PPO on almost all tasks except Reacher. Fig. 4a\nshows the policy entropy during training process, the policy entropy of TRGPPO is obviously higher\nthan that of PPO. These results implies that our TRGPPO method could maintain a level of entropy\nlearning and encourage the policy to explore more.\n\n4\u2018/\u2019 means that the method did not reach the reward threshold within the required timesteps on all the seeds.\n\n8\n\n0.00.20.40.60.81.0Timesteps(\u00d7106)0500100015002000250030003500RewardHopper0.00.20.40.60.81.0Timesteps(\u00d7106)020406080100120RewardSwimmer048121620Timesteps(\u00d7106)010002000300040005000600070008000RewardHumanoid0.00.20.40.60.81.0Timesteps(\u00d7106)1816141210864RewardReacher0.00.20.40.60.81.0Timesteps(\u00d7106)01000200030004000RewardWalker2d0.00.20.40.60.81.0Timesteps(\u00d7106)0200040006000800010000RewardHalfCheetahTRGPPOPPOSACPPO-penaltyPPO-entropyPPO-0.6\f(a) Policy entropy\n\n(b) Upper clipping range\n\nFigure 4: (a) shows the policy entropy during training process. (b) shows the statistics of the computed\nupper clipping ranges over all samples. (c) shows the KL divergence during the training process.\n\n(c) KL divergence\n\nThe Clipping Ranges and Policy Divergence: Fig. 4b shows the statistics of the upper clipping\nranges of TRGPPO and PPO. Most of the resulted adaptive clipping ranges of TRGPPO are much\nlarger that of PPO. Nevertheless, our method has similar KL divergences with PPO (see Fig. 4c).\nHowever, the method of arbitrary enlarging clipping range (PPO-0.6) does not enjoy such property\nand fails on most of tasks.\nTraining Time: Within one million timesteps, the training wall-clock time for our TRGPPO is 25\nmin; for PPO, 24 min; for SAC, 182 min (See Appendix B.3 for the detail of evaluation). TRGPPO\ndoes not require much additional computation time than PPO does.\nComparison with State-of-the-art Method: TRGPPO achieves higher reward than SAC on 5\ntasks while is not as good as it on HalfCheetah. And TRGPPO is not as sample ef\ufb01cient as SAC on\nHalfCheetah and Humanoid. This may due to that TRGPPO is an on-policy algorithm while SAC\nis an off-policy one. However, TRGPPO is much more computationally ef\ufb01cient (25 min vs. 182\nmin). In addition, SAC tuned hyperparameters speci\ufb01cally for each task in the implementation of the\noriginal authors. In contrast, our TRGPPO uses the same hyperparameter across different tasks.\n\n7 Conclusion\n\nIn this paper, we improve the original PPO by an adaptive clipping mechanism with a trust region-\nguided criterion. Our TRGPPO method improves PPO with more exploration and better sample\nef\ufb01ciency and is competitive with several state-of-the-art methods, while maintains the stable learning\nproperty and simplicity of PPO.\nTo our knowledge, this is the \ufb01rst work to reveal the effect of the metric of policy constraint on the\nexploration behavior of the policy learning. While recent works devoted to introducing inductive\nbias to guide the policy behavior, e.g., maximum entropy learning [24, 8], curiosity-driven method\n[13]. In this sense, our adaptive clipping mechanism is a novel alternative approach to incorporate\nprior knowledge to achieve fast and stable policy learning. We hope it will inspire future work on\ninvestigating more well-de\ufb01ned policy metrics to guide ef\ufb01cient learning behavior.\n\nAcknowledgement\n\nThis work is partially supported by National Science Foundation of China (61976115,61672280,\n61732006), AI+ Project of NUAA(56XZA18009), Postgraduate Research & Practice Innovation\nProgram of Jiangsu Province (KYCX19_0195). We would also like to thank Yao Li, Weida Li, Xin\nJin, as well as the anonymous reviewers, for offering thoughtful comments and helpful advice on\nearlier versions of this work.\n\nReferences\n[1] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning\nenvironment: An evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence\nResearch, 47:253\u2013279, 2013.\n\n9\n\n0.00.20.40.60.81.0Timesteps(\u00d7106)101234policy entropyHopper0.00.20.40.60.81.0Timesteps(\u00d7106)43210123policy entropyReacher0.00.20.40.60.81.0Timesteps(\u00d7106)0.51.01.52.02.53.03.54.0policy entropySwimmer0.00.20.40.60.81.0Timesteps(\u00d7106)202468policy entropyWalker2dTRGPPOPPO1.21.31.41.51.61.71.81.92.0clip range0.00.20.40.60.81.0percentageHopper1.21.31.41.51.61.71.81.92.0clip range0.00.20.40.60.81.0percentageReacher1.21.31.41.51.61.71.81.92.0clip range0.00.20.40.60.81.0percentageSwimmer1.21.31.41.51.61.71.81.92.0clip range0.00.20.40.60.81.0percentageWalker2dPPOTRGPPO0.00.20.40.60.81.0Timesteps(\u00d7106)0.000.050.100.150.200.250.300.350.40KL divergenceHopper0.00.20.40.60.81.0Timesteps(\u00d7106)0.000.050.100.150.200.250.300.350.40KL divergenceReacher0.00.20.40.60.81.0Timesteps(\u00d7106)0.000.050.100.150.200.250.300.350.40KL divergenceSwimmer0.00.20.40.60.81.0Timesteps(\u00d7106)0.000.050.100.150.200.250.300.350.40KL divergenceWalker2dTRGPPOPPOPPO-0.6\f[2] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym, 2016.\n\n[3] Gang Chen, Yiming Peng, and Mengjie Zhang. An adaptive clipping approach for proximal\n\npolicy optimization. CoRR, abs/1804.06461, 2018.\n\n[4] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec\nRadford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.\ncom/openai/baselines, 2017.\n\n[5] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep\nreinforcement learning for continuous control. International Conference on Machine Learning,\npages 1329\u20131338, 2016.\n\n[6] Rasool Fakoor, Pratik Chaudhari, and Alexander J Smola. P3o: Policy-on policy-off policy\n\noptimization. In Uncertainty in Arti\ufb01cial Intelligence, 2019.\n\n[7] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel,\nIan Osband, Alex Graves, Volodymyr Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin,\nCharles Blundell, and Shane Legg. Noisy networks for exploration. International Conference\non Learning Representations, 2018.\n\n[8] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with\ndeep energy-based policies. International Conference on Machine Learning, pages 1352\u20131361,\n2017.\n\n[9] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy\nmaximum entropy deep reinforcement learning with a stochastic actor. International Conference\non Machine Learning, pages 1856\u20131865, 2018.\n\n[10] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. Journal of Machine Learning Research, 17(1):1334\u20131373, 2016.\n\n[11] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[12] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration\nvia bootstrapped dqn. Advances in Neural Information Processing Systems, pages 4026\u20134034,\n2016.\n\n[13] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration\nIn International Conference on Machine Learning (ICML),\n\nby self-supervised prediction.\nvolume 2017, 2017.\n\n[14] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients.\n\nNeural networks, 21(4):682\u2013697, 2008.\n\n[15] Michael JD Powell. A hybrid method for nonlinear equations. Numerical methods for nonlinear\n\nalgebraic equations, 1970.\n\n[16] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International Conference on Machine Learning, pages 1889\u20131897,\n2015.\n\n[17] John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-\ndimensional continuous control using generalized advantage estimation. International Confer-\nence on Learning Representations, 2016.\n\n[18] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n10\n\n\f[19] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur\nGuez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering\nchess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint\narXiv:1712.01815, 2017.\n\n[20] Richard S Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient\n\nmethods for reinforcement learning with function approximation. pages 1057\u20131063, 2000.\n\n[21] E Todorov, T Erez, and Y Tassa. Mujoco: A physics engine for model-based control. In\nIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026\u20135033, 2012.\n\n[22] Yuhui Wang, Hao He, Xiaoyang Tan, and Yaozhong Gan. Truly proximal policy optimization.\n\nIn Uncertainty in Arti\ufb01cial Intelligence, 2019.\n\n[23] Yuhuai Wu, Elman Mansimov, Roger B. Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region\nmethod for deep reinforcement learning using kronecker-factored approximation. Advances in\nNeural Information Processing Systems, pages 5279\u20135288, 2017.\n\n[24] Brian D. Ziebart, J. Andrew Bagnell, and Anind K. Dey. Modeling interaction via the principle\n\nof maximum causal entropy. In ICML, 2010.\n\n11\n\n\f", "award": [], "sourceid": 326, "authors": [{"given_name": "Yuhui", "family_name": "Wang", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"given_name": "Hao", "family_name": "He", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"given_name": "Xiaoyang", "family_name": "Tan", "institution": "Nanjing University of Aeronautics and Astronautics, China"}, {"given_name": "Yaozhong", "family_name": "Gan", "institution": "Nanjing University of Aeronautics and Astronautics, China"}]}