Review for NeurIPS paper: BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning

NeurIPS 2020

BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning

Review 1

Summary and Contributions: ---post author response--- Thank you for the response! The clarifications to the table have improved my understanding of the results. While I think that the results are strong, the discussion section is jumbled/unclear, and intuition of some of the design decisions are lacking and give an 'ad hoc' impression. Clarifications for this are adequately mentioned in the response, and I will increase my score to a 6 assuming the authors will add these clarifications to the final text, as well as make the experimental results section more more clear. ===== This work proposes a batch deep RL algorithm called BAIL. It essentially trains a policy using imitation learning with samples collected from state-action pairs whose (Monte Carlo) returns are from what the authors define as the upper envelope of the data. The upper envelope of the data are value function parameters such that,roughly, the sum of the difference between the value of each state $s_i$ and the sum of discounted returns from $s_i$ to the end of the episode is minimized. BAIL works for domains that have deterministic transition functions. An empirical evaluation was also conducted that compares the performance of BAIL against other batch deep RL algorithms. Against a particular set of domains, BAIL has higher performance than the baselines.

Strengths: The algorithm seems novel and is indeed straightforward and simple as the authors claim. Intuitively, it makes sense that selection of s-a pairs with high return from the data are used to perform imitation learning. The evaluation over five MuJoCo environments shows that BAIL in general is higher-performing. Some rows in Table 1 are bit misleading as not all instances bolded show the algorithm with the highest performance.

Weaknesses: There is little intuition or buildup to the two BAIL versions introduced by the paper---the reader has no sense of why selection of the best actions is performed in these two different ways. I think that a more thorough behavior analysis of BAIL would strengthen this work. For instance, since the algorithm seems dependent on the upper envelope of the data, the addition of experiments (synthetic or otherwise) with poor training examples (perhaps decreasing the quality of non-expert data) would be ideal to see how robust BAIL is to these instances. The execution batches experiments are towards this idea. Again, the table representing these results seems misleading. In 6 or 7 of the rows, BAIL does not have the best performance yet its element is high lighted. From the execution experiments provided, BAIL does not seem to perform competitively in this case. The question of "how close to expert does the training data have to be for BAIL to perform competitively" seems natural to ask for an algorithm that seems to mostly depend on high-quality state-action pairs. Perhaps I am misunderstanding the results, but it does not seem this is being answered well in the experimental section. BAIL is not mentioned to work in the case that environments transitions are non-deterministic.

Correctness: The empirical methodology seems fine. The authors provide reasons for most of their experiment decisions and (to me) nothing seems out-of-place.

Clarity: The paper is readable, but not very clear in some places, specifically explanation of the theorems and optimization problem for the upper envelope. I think that the related works section was nicely written. Also, it is not mentioned in the main body that there are proofs for the theorems (even though there are proofs in the appendix). This adds to the confusion of the paper but can be easily fixed by mentioning that the proof is in the appendix.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: line 127>> the authors say Monte Carlo return G_i i sthe sum of discounted returns from state s_i to the end of the episode but then define G_i to actually be the sum of the discounted rewards.

Review 2

Summary and Contributions: This paper introduces BAIL, a new batch (also known as offline) RL algorithm. The algorithm using an interesting approach of estimating the upper envelope of data. The paper performs a large number of experiments comparing it to recent offline RL algorithms. Update: I've read your response, and my review stays the same. Solid work.

Strengths: I think the work is novel. The presentation is clear. The experimental evaluation is strong. And the results are convincing. I am fairly familiar with the literature in this area, and I think this work is solid.

Weaknesses: I think the results would be even more convincing if there were experiments are harder domains.

Correctness: The claims seem correct to me.

Clarity: The paper is well written.

Relation to Prior Work: The paper clearly discussed how it differs from previous contributions.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The authors introduce BAIL, a batch RL algorithm which estimates an upper bound on the value function and then uses this upper bound to filter out the best trajectories for imitation learning. The authors evaluate their approach on standard mujoco benchmark tasks and show that it compares favorably to other recent batch RL approaches as well as to behavioral cloning.

Strengths: The proposed method is simple to implement and straight-forward. The evaluation shows that it has the potential to be an easy way to achieve better performance without requiring additional environment interactions.

Weaknesses: The approach is at times ad-hoc and some choices could be better motivated. The evaluation isn’t entirely clear.

Correctness: The proposed approach is best seen as a heuristic. While it is intuitive that it can lead to good results when the given dataset is good, no guarantees are made that this is the case. The evaluation is relatively extensive and shows that such benefits can be observed.

Clarity: Most parts of the paper are relatively easy to follow and the method is easy to understand; however, the paper would benefit from an additional pass for spelling mistakes and stylistic faux-pas’s. It was unclear to me which results related to which method of collecting batches. Given the nature of the method, I was interested to see how well the agent would learn from sub-optimal training data which the execution batches in Section 5.1.2 might answer; however, I could not discern where the execution batches are being used.

Relation to Prior Work: The authors draw a connection to and evaluate against the recent body of work in BatchRL. The idea of using imitation learning on observed roll-outs that are better than the current estimate of the value has been explored before in Self-imitation Learning (Oh et al., 2018), albeit in a different context. The general idea of filtering an imitation learning training set to only include good episodes can also be found in the literature, and has for example been used in AlphaStar (Vinyals et al., 2019).

Reproducibility: Yes

Additional Feedback: In section 4.1, a lot of time is being spent on describing the notion of the upper envelope of the value function. This definition seems to be crucially dependent on the l2-regularization term, yet the authors do not use the l2-regularization and replace it with an early stopping criteria. It would be good to know why this is the case and how well the method would perform if the l2-regularization was actually being used. Due to the high-variance nature of the value estimate, it seems to me that the method might require a larger amount of high-quality training data when compared to other BatchRL approaches. As far as I can see, there is no evaluation with regards to that criterion. Theorem 4.1 seems gratuitous, the method is overall more of a heuristic and the theorem does not show anything surprising or lead to any guarantees for the algorithm.

Review 4

Summary and Contributions: The authors present a new batch RL algorithm based on the newly introduced notion of upper envelop for returns as well as the best action selection.

Strengths: * Solid experimental results as well as the range of the domains tested * Good theoretical analysis of the method

Weaknesses: * Computational analysis looks promising, but would be more convincing if authors demonstrated that the implementations of each of the baseline algorithms are efficient.

Correctness: Seems correct.

Clarity: Clearly written

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: I would encourage authors to give more intuition to a reader on why BAIL performs so much better than other methods. **POST-REBUTTAL** After having read the authors response, I am inclined to keep my score.