NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:782
Title:Distributional Policy Optimization: An Alternative Approach for Continuous Control

Reviewer 1

This paper proposes a distributional policy optimization (DPO) framework and its practical implementation, generative actor-critic (GAC) that belongs to off-policy actor-critic methods. Policy gradient methods, which are currently dominant in continuous control problems, are prone to local optima, thus it is valuable to propose a method addressing that problem fundamentally. Overall, the paper is well written and the proposed algorithm seems novel and sound. - In Algorithm 1, it is not clear what (s,a) and s is, for Q(s,a) and V(s). Does it stand for 'every' state-action pair and state, or the state-action pairs that are visited by the current policy \pi_k'? If it corresponds to the latter, it seems that DPO would possibly not converge to the global optima. For example, suppose that the the initial policy is given as the Dirac delta distribution \pi_0(a|s) = \delta_{0}(a), and the (deterministic) transition function is defined as f(s,a) = s + a. Then, with the \pi_0, only the initial state can be visited, thus the value functions and the policy will remain the same and not be updated. Assumptions about behavior policy are not mentioned in the paper. - L109: can can -> can - In L114, what does 'as \pi is updated on the fast timescale, it can be optimized using supervised learning techniques' mean specifically? Please elaborate on the relationship between supervised-learning and fast timescale. - In DPO the delayed policy \pi' is updated by the convex combination of two 'probability distributions', while in GAC the delayed actor is updated by a convex combination of two 'parameters' of each probability distribution. Therefore, DPO and GAC are not perfectly aligned. - For the actor, why did you choose to adopt implicit quantile networks, though GAC does not require 'quantile' estimation? It seems that conditional GAN or conditional VAE could also be possible.

Reviewer 2

The idea proposed in the paper is very interesting. Policy gradient methods are very popular nowadays and this paper propose a method to approach one of their weakness. The paper is clearly written and figures are helpful for the reader. The DPO procedure would have benefited however to be a bit clearer: for example, why is the delayed actor present in this general procedure? It seems to me mainly helpful to help convergence and does not seem an requirement for the method. Besides, although valid theoretically, a three time scale algorithm seems hard to do in practice. #Post-rebuttal update" I appreciate the efforts the authors put into their rebuttal. I will however keep my score as 7, as I vote for accepting this submission, but would not be upset if it was rejected (clarity of the paper could benefit from a restructuration and I am not a huge fan of the 3-time scale procedure, as it should be quite hard to find the right parameters to make it converge).

Reviewer 3

This paper presents the limitations of policy gradient-based methods which need the explicit p.d.f of action in continuous control, and gives the proof that the Gaussian distribution strategy cannot converge to the optimal under some conditions. Then it introduces the DPO framework that can converge to an optimal solution without the requirement of the underlying p.d.f and thus without the limitation of parametric distribution. Also, the paper presents a practical algorithm GAC that applies Quantile Regression and Autoregressive Implicit Quantile Networks which can represent arbitrarily distributions. GAC achieves good results in continuous control tasks and some are better compared to the policy gradient baselines, and it has the same efficiency but requires more computation. Minor issues: The description of ‘the sub-optimality of uni-modal policies’ (from line 73) is confusing. Which part of figure 1a corresponds to ‘the predefined set of policies’? In line 76, ‘this set is convex in the parameter µ’ seems to mean ‘this set is convex in the parameter space Θ’, and what does ‘it is not convex in the set Π’ mean? What is the definition of ‘(1−α)δ µ 1 +αδ µ 2’? The condition at the end of line 96 seems not written properly. Should the right side of the equation in line 99 (also 442) be 1-2ε and ε<1/3?