NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3848
Title:On the Correctness and Sample Complexity of Inverse Reinforcement Learning

Reviewer 1


		
Originality: The work stems from an original idea but a better acknowledgment of the last 20 years of work on IRL would have provided a more honest basis for situating the contribution. Quality: The paper is technically sound and the methodology for the results is well-founded however, it is unclear how the theorems apply to the presented work. It would be interesting to see the broader impact of the theorems by analyzing how other algorithms proposed for IRL are impacted by these theorems and how these relate to the experimental results proposed. A lot of the proofs could have been left to the Appendix, leaving more space for experiments and discussion. Clarity: The paper is clearly written and good readability. It adequately informs the reader about some of the limitations of the method. A more in-depth introduction of the F matrix with respect to value iteration would make the paper more self-contained. Significance: The theorems presented in the paper seem important to characterize the number of samples needed to achieve a certain bound in the Finite state space IRL case but it is unclear how this impacts current state-of-the-art algorithms.

Reviewer 2


		
The submission improves the theoretical supports for the IRL problem in Ng & Russel (2000), by providing a sample complexity and a sufficient condition for the correctness of the IRL problem. The results are mainly based on a geometric analysis on the IRL problem which contributes in two aspects: 1. provides the sufficient condition; 2. transforms the problem to a L_1-regularized SVM problem, whose theoretical characteristics have already been thoroughly studied. Therefore, to better evaluate the contributions of the submission, I have following two questions: 1. How strong is the definition 4.1 for the IRL problem. (While I admit definition 4.1 is weak enough for L_1 regularized methods, why this is true for IRL problems?) In other words, by only considering the Regime 3 case, how much do we lose? 2. It seems that the formulation here is now a kind of constrained formulation, especially compared with maximum entropy IRL methods [1], which are more often practically used. Therefore, can the analysis in the submission here contribute or generalized to more more practical methods? [1] Finn, Chelsea, Sergey Levine, and Pieter Abbeel. "Guided cost learning: Deep inverse optimal control via policy optimization." International Conference on Machine Learning. 2016. =================================================== The rebuttal of the authors is clear and convincing. I would like to increase my score to 6.

Reviewer 3


		
Summary: This paper first proposed a geometric analysis perspective for the IRL problem. Based on this observation, an L_1 regularized SVM formulation was proposed. Theoretical analysis on the sample complexity provided formal guarantees for the algorithm. The empirical performance was also tested. Comments: Originality: The theoretical analyses in this paper, including the geometric perspective, the SVM formulation and the sample complexity analysis, are technically straightforward to some degree. Significance: Although the technique in this paper is not that 'fancy', the work is important and valuable. The proposed algorithm is the first one with formal guarantees for the IRL problem. Quality: I think the theoretical analysis is fine. Empirical performance is also evaluated in the experiment part. Clarity: This paper is well-organized, and easy to follow. Other minor comments: Please go through the paper again to make the notations consistent, like in Eq (2.1), the capital N is not defined before. ========= Update after author response: The author didn't address my primary concern in detail, i.e., the technical challenges and novelty. But the theoretical contribution of their work is sufficiently significant for an acceptance. So I maintain my score.

Reviewer 4


		
This work introduces a geometric analysis of the problem of inverse reinforcement learning (IRL) and proposes a formal guarantee for the optimality of the reward function, obtained from the empirical data. The authors also provide the sample complexity for their proposed l1-regularized Support Vector Machine formulation. In general, this is an interesting work with a significant contribution to the theoretical aspect of the inverse reinforcement learning problem. However, there are a few concerns that need to be addressed: Major: 1. The paper does not define the problem as a stand-alone question in the field. The problem formulation heavily relies on the previous work by Ng & Russel (2000) and is written only as a follow up to this work. As such, the reader needs to go back and forth between the two papers to understand the terminology and contribution of this work. The authors need to clearly define the problem, provide the motivation, describe the previously proposed solutions, and clearly state their contribution. Specifically, the section “2. Preliminaries” which introduces Equation 2.1 is far from self-explanatory. 2. There needs to be more discussion about the previous work in inverse reinforcement learning. Many different techniques have been proposed during the past few years (Arora and Doshi, 2018), none of which are discussed and compared to here, which casts doubt on the novel contribution of the paper to the field. 3. While the theoretical aspect of the work is widely discussed, the practical aspect is not adequately addressed. In particular, three similar experiments with small state and action spaces were employed (n, k=5; n, k=7; n, k=10). The authors need to extend their analysis to provide a comparison of the two approaches in larger-scale environments with more realistic tasks (e.g., grid world, Atari game, navigation, etc.) Minor: a. The value function in section “2. Preliminaries” is not defined correctly. \pi: S -> A, and thus R(\pi(s)) is defined in the action space, whereas the reward function is assumed to be defined on the state space. Similarly, \pi(\pi(s)) is not meaningful because the policy is defined in the state space, not action space. The notations should change to resolve this confusion. b. All equations need to have numbers. c. In Equation 2.1, what is N? This should be n. d. Equation 4.1 should be a minimization rather than a maximization.