NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4261
Title:A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment

Reviewer 1

5) Originality: This work generalizes a number of previous work that combines RL and information-theoretic methods. Quality: 6) The algorithm and the theoretical results are sound. 7) In certain environments, the results reported for SAC are much lower than the ones in the original paper e.g. Half-Cheetah ~4000 vs ~8000. I would recommend investigating the source of this discrepancy. 8) Clarity: The paper is clear and the results could be replicated from the information provided. I think some of the derivations could have been made clearer, see improvements. 9) Significance: The work generalizes a number of past works and I thus think it is likely that other researchers will build on this work. Conclusion: 10) The method and the theoretical contributions are sound, but given certain anomalies in the experiments, I do not recommend acceptance at this time.

Reviewer 2

AFTER REBUTTAL ============== I thank the authors for the clarifications given in the author response, especially regarding the 1-step vs multi-step empowerment. I keep my score and recommend the paper to be accepted. Originality ----------- The paper seems to be the first to combine reward maximization and 1-step empowerment inside a single objective function. A slightly more explicit placement of the presented work in a broader context of research on empowerment in the Introduction or Motivation sections would be beneficial for the reader. Quality -------- Theoretical developments and proofs were checked selectively and did not reveal significant flaws. The basic premise of using empowerment to improve reward maximization may be somewhat questionable. Especially given that empowerment is expensive to compute, perhaps a better argument would be to consider a multi-task setting, where task transfer can benefit from the agent being initially empowered. Adding a paragraph detailing drawbacks of the proposed approach would be beneficial. For example, learning forward and inverse models may introduce bias and hinder performance of model-free methods. Was such effect observed in the experiments? Clarity ------- The paper is written clearly and structured well. Significance -------------- The main contribution of the paper is on the theoretical side. An important question which was not addressed by the paper is how much is lost by only considering 1-step empowerment. Since experiments were carried out in a single-task RL setting, the benefit of using the empowerment were not so clear (see Fig. 2). In general, maybe the whole line or argumentation in the paper can be slightly adapted to better motivate the combination of the reward with empowerment. It seems more plausible to expect gains in a multi-task setting.

Reviewer 3

This is a simple paper with a straightforward proposal, execution, and presentation of results. The idea of combining rewards and empowerment has been up for grabs for some time, and thus it is by itself not particularly surprising nor controversial. (Ironically, the idea is actually conceptually at odds with the original, explicit proposal to do away altogether with rewards and to replace it with a universal objective—that's why empowerment got invented in the first place.) I don't have many remarks, just some minor details. - Section 2.2. (``Empowerment'') could be simpler (mainly in notation). Equation (2) is slightly confusing at first, as \pi_empower appears explicitly in the denominator inside the log, but only implicitly in the numerator. The lemmas that follow are trivial. - Similarly, section 4.1. (``Existence of Unique...'') could also be simpler. In fact, I was surprised that this hadn't been shown before. It's nice to see it spelled out. - Section 4.3 with the grid-world example. This example is not very illuminating. - Section 5.2 (``Experiments with Deep Function Approximators''): The experimental results did not look very convincing to me. Empowerment either seems to add little or even deteriorates the policies found by SAC. Perhaps the MuJoCo domains were not great for showcasing the method. In general, while I did enjoy the paper, I felt that the significance of the results were modest, firstly because the ideas are not surprising, and secondly because I did not feel that they made much of a difference to the state of the art. PS: The bibliography is surprisingly complete. Were all the references woven into the text? I felt they weren't. *** POST REBUTTAL COMMENTS *** I would like to thank the authors for addressing all the comments. I went through the experimental results again, and decided to increase the score.