NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:326
Title:Trust Region-Guided Proximal Policy Optimization

Reviewer 1

The paper demonstrates a shortcmoing in PPO algorithm where a suboptimal policy that prefers suboptimal actions can diverge over time due to the constant clipping mechanism. Instead, the paper proposes an adaptive clipping mechansim that selects clipping range based on current policy, giving more chance of exploration to actions that are not preferred by the current policy. The paper demonstrates that the proposed method achieves better objective lower bound than PPO while maintaining the same KL divergence between policies. Experimental results compares the proposed method to several baselines in different settings. The paper provides elegant treatment of the problem at hand and has the potential to guide future research. The results are promising but are not very impressive unless the training time is included. I have a number of comments though. My important concerns are regarding evaluation: - Is there a reason that you are choosing average top 10 rewards? It seems more natural to use average rewards over all episodes. A related question is why you are choosing only 7 random seeds for Figure 3? - Please compare with adaptive regularization/constraint baselines such as the adaptive KL regularization version of PPO[20]. - Please add more details (perhaps in the supplementary) about your policy class as well as baseline hyper-parameters to make the results reproducible. Other comments: - Introduction: Lines 23-24 are not very accurate since there is a version of PPO that uses KL penalty without clipping[20]. - L113: Lemma 2 is not clear to me. Are you sure the LHS is not E_{\pi_{t+1}}[\pi_{t+1}(a) | \pi_0] - Algorithm 1: It seems that t <- t+1 should be in the end. - L166: while gets --> while getting. - L235: How can we guarantee that \pi^{new} exists given that it has to satisfy multiple constraints? ======================================================== I thank the authors for their response. Based on the rebuttal, I have changed my score to 7.

Reviewer 2

The paper proposes to adapt the clipping procedure of Proximal Policy Optimization (PPO) such that the lower and upper bounds are no longer constant for all states. The authors show that constant bounds cause convergence to suboptimal policies if the initial policy is initialized poorly (e.g. the probability of choosing optimal actions is small). As an alternative, the authors propose to compute state-action specific lower and upper bounds that are inside the trust region with respect to the previous policy. If the previous policy assigns a small probability to a given action, the lower and upper bounds do not need to be very tight, allowing for less agressive clipping. The adapted version of PPO, which the authors call TRGPPO, has provably better performance bounds than PPO and is validated empirically in several experiments. Since PPO is frequently used in deep reinforcement learning, correcting the suboptimal behavior of PPO seems like a relevant contribution. I think the paper properly motivates the theory, and the derivations appear correct, with the exception of one derivation that I do not follow (see below). The empirical validation is mostly expected, but there are cases for which the approximate version of PPO outperforms TRGPPO, which is a bit puzzling. Page 3: "We now give a formal illustration": This sentence appears misplaced, especially since it is immediately followed by Algorithm 1. In the supplementary material, Equation (4) correctly states the equations on the Lagrange multipliers, but I fail to see how these equations are transformed into Equation (5). "Table 1(b) lists the averaged top 10 episode rewards": this is a very crude presentation of the empirical results that does not account for variation in performance during learning. A given algorithm may have a performance peak early during learning and then suffer from catastrophic forgetting. I would much prefer to see the learning curves of the different algorithms, to help infer whether learning was stable over time. Do you have any ideas wht TRGPPO performs worse than PPO in Humanoid? This seems to contradict your theoretical results. POST-REBUTTAL: The rebuttal mostly confirmed my initial understanding: the proposed version of PPO is guaranteed to improve over the original PPO algorithm, but still does not come with any convergence guarantees, not even to an approximately optimal solution. I appreciate the effort to explain how Eq (4) is transformed into Eq (5).

Reviewer 3

Overall I think this paper presents an interesting idea in improving exploration and stability of PPO. The idea is very well presented and authors include both rigorous theoretical analysis and rich empirical experiments. Pro: 1. The idea for this paper is really well presented. The structure of the paper is well organized and the authors include simple examples and illustrations to help make the argument easily understandable. Starting by analyzing the shortcomings of PPO, the authors naturally introduce the improvement and thus make the paper easy to read. 2. The authors provide rigorous theoretical justification for the shortcomings of PPO and how the improvements in TRGPPO fix the problem. The authors also include intuitive explanation for all the theoretical results which makes the results easy to understand, 3. This paper also includes rich empirical evidence for the proposed algorithm. Besides reward performance, the authors also present analysis for some important metrics of the algorithm, which agree with the theoretical results. Con: 1. Some mathematical formulae in the paper could be better formatted. Particularly, the expressions in theorem 1 and theorem 2 could be aligned. 2. While the paper includes comparison to PPO with a single large clipping rate in section 6, it would be interesting to compare to the performance of PPO with an optimal constant clipping rate. The idea in this paper is well presented and thoroughly investigated. Overall I think the idea is novel and the contribution is significant. Despite minor flaws, I recommend publication of this paper.