NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
*** Update after reading review: Thank you for your feedback. I was happy with your inclusion of experiments on manipulation tasks, and agree they're convincing. I was also happy with your explanation on GAILfo vs GAIL vs your algorithm, and your discussion on Sun et al 2019. Your decision to release code also helps with any fears I have about reproducibility. I have changed my score to an 8 to reflect these improvements. ** Original Review Well written paper. Easy to follow the logic and thoughts of the authors. Fairly interesting line of thought. I feel like I learned something by reading it. It’s certainly of interest to the IRL community at this conference. I’m not sure the contributions would be of high interest outside the IRL community (for example, to the wider planning, RL, or inverse optimal control communities). But, that’s probably not so much of an issue. Without the code, it is difficult for me to evaluate their GAIL baseline, which can be difficult to get correct. This gives me pause, because in my personal experience with these algorithms GAIL does not usually care much if it receives the actions or not. Not a large demerit as I’m assuming the code will be released. But it does make things difficult to fully evaluate in their present state. “Provably Efficient Imitation Learning from Observation Alone” Should probably be cited. I found myself wanting to skip over the experiments in section 5.1. I understand the point that was being made but the discussion just couldn’t hold my interest, even though I read it several times. I think the environment feels contrived. I did check the math behind theorem 1 in the appendix. From a quick glance, everything seemed reasonable. Figure 2 indicates that the introduced algorithm does better than BC and GAIL on a variety of locomotion tasks. I do realize these benchmarks are fairly standard, but are they really the best choice of environments here? It seems like learning from observations would be more useful on robotic manipulation tasks, where it’s often easier to provide demonstrations via VR or even human demonstration, but less easy to collect actions. In MuJoCo environments, you always know the ground truth actions anyways. So pretending you don’t have them comes off as somewhat artificial. I don’t see it in the paper, but it seems like the locomotion baselines are provided via something like PPO? Although this is of course standard, it always seems funny to me when learning from demonstration papers use RL to provide expert demonstrations. It’s been shown that RL is a lot easier to optimize against than actual human demonstrations or even solutions provided via model based methods. I do appreciate that the author’s inherited this problem directly from GAIL and related literature, but it’s just hard for me to get excited about IRL results that don’t have a human in the loop, or at least some harder manipulation environments. These locomotion benchmarks are an odd choice for imitation learning. I think when we’re evaluating this work, we have to ask ourselves two key questions: 1. Does this work provide a significant improvement or insight over GAIL. 2. Does this work provide a significant improvement or insight over “Generative Adversarial Imitation from Observations” and “Imitation Learning from Video by Leveraging Proprioception” As for 1: Table 2 suggests that in practice the algorithm gets similar results to GAIL. The author’s do suggest inclusion of actions is important for GAIL’s performance but I am not so sure. I would need to see a much more detailed analysis of GAIL’s performance with and without actions to really make a decision about this. I think the analysis is interesting and the ideas are novel, so the paper can get by on that. However, I do have serious concerns that in practice the improvement over GAIL is rather small and no one will use this algorithm as an actual replacement for GAIL. Since, again, most people that I know that use GAIL already exclude actions. The correction terms to account for that DO seem to provide GAIN, but with these locomotion tasks being a narrow slice of what’s out there, it’s hard for me to feel completely confident. As for 2: I do think the existence of prior art considering this problem is a demerit for this paper. However, neither of those references considers the relationship between inverse dynamics models and GAIL. So, I think the paper is okay on this dimension.
Reviewer 2
This work shows that the difference between the state-action occupancy of LFD and the state-transition occupancy of LFO can be characterized by the difference between the inverse dynamics models of the expert policy and the agent's policy. This paper then proposes an algorithm for learning from observations by minimizing an upper bound on the inverse-dynamics disagreement. The method is evaluated on a number of benchmark tasks. I think the paper is generally well written and presents the derivation of IDDM in a fairly clear manner. The overall concept of inverse dynamics disagreement is also quite interesting. There are some minor typos and instances of awkward phrasing, but can be fixed with some additional polishing. The experiments show promising results on a diverse set of tasks. IDDM compares pretty favourable to a number of previous methods, though the improvements appear to be fairly marginal on many of the tasks. But overall, I am in favour of accepting this work. I have only some small concerns regarding some of the experiments, but it should in no way be a barrier for acceptance. In the performance statistics reported in Table 2, the performance of the handcrafted reward function (DeepMimic) seems unusually low, particularly for the ant. In the original paper, the method was able to reproduce very challenging skills with much more complex simulated agents. The tasks in this work are substantially simpler, so one would expect that the hand designed reward function, with some tuning, should be able to closely reproduce the demonstrations.
Reviewer 3
The authors study Learning from Observation (LfO) inverse reinforcement learning (imitation based on states alone, as opposed to states and actions as is done in LfD). They attempt to build on the GAIfO algorithm to acheive increased performance. To do so, they derive an expression for the performance gap between LfO and LfD. They then derive an upper bound on this expression that they add to the loss function to help close this optimization gap. They report significantly improved perfomance on a wide range of tasks over against a wide range of alternative approaches, including GAIfO. Notes: Eq 5 The explanation here is a little confusing. Firstly, the equation implies the divergence of LfD is always greater than that of LfO, since the inverse dynamics disagreement is always non-negative. This seems counter intuitive and could do with more explanation. Is the idea supposed to be that the KL of divergence could be zero, but the problem would still not be completely solved? If so, this bears clarification. Additionally the sentence "Eq 5 is not equal to zero by nature" is misleading, since there are cases when the IDD is zero (such as when \pi = E). Infact, one such case is highlighted in Corrolary 1. This should be restated with more specificity and clarity. Eq 10 Does this mean that this approach only works for deterministic state transitions? Or does the approach still more or less work, but with an additional source of error. Given the rarity of deterministic transition functions in real-world systems, it seems like it should be much more clearly stated that you are restricting your scope to deterministic systems. If the approach can still be employed effectively on non-deterministic systems without significant performance loss, then that should be stated and defended more clearly. Learning Curves: No comparison for 'number of interactions with the environment' vs GAIL. This seems conspicuously absent. It isn't strictly necessary, since the method has already shown to be superior to the others in its category, but it's odd that it's not here. No learning curves for 'number of demonstrations' Since the main motivation for LfO over LfD is 'availability of data', it seems odd that the authors only report learning curves for 'interactions with the environment' and not 'number of demonstrations'. It would especially be interesting to know at what point, if any, IDDM with more demonstrations begins to outperform GAIL with few demonstrations. GAIfO reported this, I see no reason it shouldn't be reported here as well. Originality: Derives a novel bound on accuracy and optimizes it, achieving state of the art results. The proof, bound, and algorithm are all novel and interesting. Although they build on an existing method and propose only a small change to the learning objective, the change they propose has significant impact and is very well motivated. Quality: The theoretical work is great, the proofs seem fine. The experimental work is not quite as well documented. Although it may not seem this way at first glance, their method somewhat increases the complexity by adding another neural network approximate the mutual information for the loss function, making their approach possibly more unweildy than GAIfO. Coupled with the fact that code was not included, these results may be difficult to replicate. Experiments are generally presented well and show good results, although there are a few experiments/plots that should have been included that were not. Clarity: For the most part, the paper is clearly written and very easy to follow. The supplemental material is also very easy to follow and well above the expected quality of writing. However, it should have been pointed out much more clearly that, as written, the approach presented only works in deterministic environments. Significance: The bound they prove is interesting and useful. Not only do they highlight a fundamental gap in this problem, they also show an effective way to contend with that gap and open up the possibility of new approaches to closing it in the future.