Paper ID: | 5415 |
---|---|

Title: | Policy Continuation with Hindsight Inverse Dynamics |

QUALITY I like the approach but I have one fundamental problem in understanding the method. The policy is a mapping from S x G -> a while the inverse dynamics map from (S x G) x (S x G) -> a. How are pi and theta equal and how are they parametrized? In the classical sense, when learning the inverse dynamics, this problem is more about the environment and not the policy, if I am not mistaken. If so, why is it important to relabel data to learn them? Also the combination of PCHID with PPO is not fully sound (as is noted in the paper) but could be solved by just exchanging PPO with an off-policy algorithm, for example TD3. Nonetheless, the experimental results section shows some interesting results with much improved performance and sample-complexity over standard DQN, HER and PPO. CLARITY The paper is well written and structured, only some sentences need some proof reading for some expressions. Regarding my lack of understanding the difference between a policy and inverse dynamics, maybe some more words are necessary to better link these two functions. In the grid world setting, k-step solvability is straightforward to understand and implement while in continuous action spaces it is not so straightforward. Can you comment a bit more on how to determine k-step solvability in continuous action domains? ORIGINALITY The paper introduces some fundamental concepts and combines them into a novel algorithm type. SIGNIFICANCE Improving sample-complexity is of high importance to the field and the results look promising. ##################### Post rebuttal The authors clarified my main question on the connection between inverse dynamics and the policy. The other questions were also answered appropriately. Therefore I would keep my current rating and increase my confidence score.

The proposed method extending hindsight experience replay with k-step inverse dynamics learning is original and significant. While there is a limitation that this paper assumes a deterministic environment and a goal space that is a part of the state space and on which one can define a test function, there exist many challenging scenarios that satisfy these assumptions. For example, this paper and the original hindsight experience replay made a similar assumption but, it showed its potential for challenging continuous control tasks. The paper is clearly written and the intuition is easy to understand. I have minor questions about the experimental setting, especially about the choice of maximum K. In theory, the maximum K should correspond to the maximum number of steps required to reach the goal, but the experiment in the grid world used maximum K=5 and the maximum K is not specified in the OpenAI Fetcher environment. Since choosing the maximum K seems important for this algorithm, clarifying the used maximum K and explaining how small K still provides improvement might be necessary. * After author response I increased my rating after the author response. The only concern I mentioned in the review was about the effect of K in different scenarios. The author response effectively addressed this concern by 1) providing an intuitive explanation on how a model with small K can still improve performance, and 2) showing an additional experiment controlling K in a continuous control environment. I believe the explanation and experimental result would be useful to understand an important characteristic of the algorithm: the effect of K. Therefore, I would recommend having the results in the final version.

Overall, the paper is clearly written. The theory and methods are well developed. And the results are discussed in details with ablation study. However, I do have a question about the significance of the approach. If I understand correctly, the proposed method, policy-continuation + hindsight inverse dynamics (PCHID) is a continuous version of dynamic programming. The difference is that the table is replaced with a function approximator so PCHID can handle continuous state-action space. Have the authors tried to solve the GridWorld task using simple dynamic programming as a comparison? I also have a concern regarding the algorithm part. The proposed algorithm validates whether a state is k-step reachable using a TEST algorithm. Because a function approximator has been used, we cannot guarantee that the given state is not k-step reachable even if the function approximator yields no solution. I think additional theoretical analysis is needed to address this issue. Update: The authors have replied and addressed my main comments: the DP perspective of the PCHID, and the false negatives in the TEST routine.