NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7508
Title:RUDDER: Return Decomposition for Delayed Rewards

Reviewer 1

4 ) Originality: The work is highly original and I believe it could lead to novel RL algorithms in the future. Quality: 5) The artificial tasks are well-designed to highlight the strength of RUDDER i.e. agent must be able to attribute reward through time in order to solve the task. 6) I would like to see comments comparing RUDDER to generalized advantage estimation (GAE). GAE can also be interpreted as reward shaping. 7) I would recommend using PPO in combination with GAE as a baseline. I don't think it will substantially change the results, but it will be a fairer comparison and the two works are often coupled. Clarity: 8) The paper is clear and the appendix looks exhaustive. I think the experiments could be reproduce using the paper. Conclusion: 9) I think this work is a very valuable contribution to the field. The main caveat for which I do not recommend the acceptance at the moment is that the work omits to mention and use the GAE baseline. I would like to hear the authors comments and would be happy to increase my score if they provide sufficient explanations for not including this baseline.

Reviewer 2

It's an interesting idea supported by a lot of theoretical derivations. Especially on the artificial tasks I could understand that reward redistribution would indeed help the agent learn the task more quickly. On atari games, things are much more complicated, as the scenes are complex and contribution analysis may fail to capture essential scene elements and actions. Though generally positive, I have a few critical comments: 1. The experimental results were not carefully analyzed. Especially since the results of contribution analysis can be visualized easily (e.g. what frames contribute to most changes of predicted rewards), such visualizations would help us identify when RUDDER works and when it fails. 2. Although theoretically the optimal policies should be unchanged after reward redistribution, RUDDER led to worse performance on a significant portion of atari games. Closer analysis is welcome, for example the authors could use the visualization technique suggested above. 3. The authors should discuss more about the limitations of RUDDER. Apparently it's not for all tasks. There's a trade-off between quicker learning of Q values and errors of contribution analysis.

Reviewer 3

POST-REBUTTAL COMMENT: **** Thank you for the extensive feedback, and congrats on really great work. **** Summary This works proposes a method for converting an MDP with delayed rewards into a decision process with only immediate rewards, which makes learning the resulting process easier. The reward redistribution method is proven to preserve optimal policies and reduce the expected future reward to zero. This is achieved by redistributing the delayed rewards to the salient state-action events (where saliency is determined by contribution analysis methods). Extensive experiments in both toy domains, as well as the suite of Atari games, demonstrate the method's improvements for delayed reward tasks, as well as the shortcomings of MC and TD methods for these types of tasks. Comments: I felt the work presented in the paper is outstanding. There are numerous contributions that could conceivably stand on their own (resulting in an extremely large appendix!). The mathematics is rigorous and extremely detailed (which impacts the readability of the main paper somewhat), and there is no handwaving justifications that one might normally find in an ML paper. Aside from the main theoretical work, the experiments are also extensive. I liked the use of toy domains to illustrate the idea and shortcomings of other approaches, as well as the fact that statistical significance tests were employed. Running experiments across 52 Atari games for millions of frames is a significant undertaking, but it is hard to judge the results of those given that RUDDER sometimes was worse than the baseline. However, they do show that for delayed reward games, RUDDER does offer an improved performance, and the Atari experiments confirm that it can be applied to very high-dimensional problems. One concern I have is that from the appendix, it looks like it was a significant engineering feat to implement the method on Atari, which may hinder its adoption by others. I think adoption will likely depend on the quality of code released. Overall, the paper is original, theoretically justified, can scale to large domains and provides a strong foundation for future work. Some additional questions and comments I have for the authors: On Line 82 it is states "we use contribution analysis instead of sensitivity analysis, and (ii) we use the whole state-action sequence to predict its associated return". Point (i) seems to be ill-motivated here. True that contribution analysis a different approach, but it would be helpful to motivate why it is better than sensitivity analysis. Intuitively, if a small perturbation to some input dramatically affects the output, then does it not hold that this input is relevant to the output? If this is not the case, I think at least some explanation would be appropriate to add. Also, the the second point seems to be more of a restriction in pratice, since you need to collect full episodes. I think the paper is significantly hindered by the limit on space. For instance, I think the mean and variance analysis (S3) are significant contributions on their own, but they are primarily buried in the appendix. Space permitting, I think it would be useful to add a bit more discussion on these results in the main paper. Figure S6 is very useful for explanatory purposes and I think the paper would benefit from having it in the main paper (again, space depending). Questions: 1. In the experiments, RUDDER was placed on top of PPO. But PPO is an on-policy method I think. Would the use of the "lessons" buffer not introduce bias? Or does it not matter in practice? 2. After redistributing rewards optimally, am I correct in saying that now there is no bias in TD updates, and the variance in MC is completely minimised? 3. What are the last two columns in Table S6? Minor comments: 1. L43: "proper reward redistribution" is a vague term at that point in the paper. Similarly, L70 talks about "optimal" reward distributions, but this concept is onyl introduced on L133, which is confusing for the reader. 2. The paragraph starting at L285 can probably be merged into the the descriptions of the tasks on page 7 to save space. 3. I do not think the equation in Theorem 3 is referenced, in which case you do not need to have the (2) label (which cannot fit on the line).