NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 2577 Learning Task Specifications from Demonstrations

Reviewer 1

** update ** I increase the score I attribute to this paper since the authors' response to the various problems raised by the reviewers seems to me very satisfactory. However the authors promise several improvements and my score cannot be based solely on promises, so this score only increases by one. Finally, the practical motivations of this little explored field are extremely justified, so I also encourage authors to apply their work to more complex problems, which should lead to interesting results. ** summary ** This paper sets up an algorithm capable from demonstrations (sequence of states and actions) and from a set of specifications (set of trajectories), to infer the specifications governing the actions of the agent and the visited states. These specifications can be seen as non-Markovian reward functions. Thus, this work is related to inverse reinforcement learning (IRL) which aims to infer the reward function of an agent by observing these successive states and actions. By defining the probability of a trajectory knowing a specification (using the maximum entropy principle) the development leads to a posterior distribution. Two algorithms result from this and allow to test the approach on the system presented in introduction (motivating the paper). ** quality ** Although its introduction is interesting, this paper is quite hard to follow, especially from page 4. The steps should be better justified or motivated by intuition or pedagogical explanations of the ongoing development. Also some proofs only appears in the supplementary materials and not in the paper, making the paper less self-contained. At least sketches of proof should be provided to let the reader study the main arguments. ** clarity ** As mentioned above, clarity is not the strength of this paper. This can also be illustrated by the more or less in-depth comments that follow. page 3: < Sec > Section The Markov property is already present in Definition 2.1. < no \phi =^{def} 1 - \phi(\xi) > no \phi(\xi) =^{def} 1 - \phi(\xi) page 4: In Equation 2, it seems that we multiply a probability of \xi (trajectory probability given by the transition functions) with another probability of \xi (given by the exponential function). Some confusion in the notations X and \xi. page 6: < the following simple algorithm. > the following simple algorithm (Algorithm 2). page 8: < queries,\phi^{hat} > queries, \phi^{hat} < assignments.\footnote{...} > assignments\footnote{...}. The conclusion should be a Section. ** originality ** To the best of my knowledge this paper is the first research on inferring specifications from demonstrations in the MDP framework. The authors have done a good bibliographical work to position their study in the relevant literature. I suggest authors read (Drougard "Exploiting Imprecise Information Sources in Sequential Decision Making Problems under Uncertainty." 2015) which adapts classical MDP models to Qualitative Possibility Theory. This theory is very close to logic (and therefore could easily integrate the specifications framework) while retaining properties similar to Probability Theory (which allows planning with the same tools). ** significance ** Although the idea and the algorithms developed are really interesting, I doubt that the theoretical and applied contributions are worthy of NIPS. I remain cautious however because the lack of clarity, the absence of sketch of proof and the short time to review prevents me from checking developments in details.

Reviewer 2

This paper proposes a method for learning specifications (binary non-Markovian rewards defined as predicates over trajectories) from demonstrations. The motivation is to address some problems with reward function design, specifically to provide a better way combine rewards for different subtasks. They formally specify the problem of inferring specifications from demonstrations, define an algorithm inspired by Maximum Entropy IRL, and present theoretical results and experiments. The paper is interesting, clearly written and relatively well-motivated. Overall, I think the method has strong theoretical grounding but weak experimental support. The experiment section does not contain all the experiments and details needed to back up the claims of the paper (and neither does the supplementary material). They test MaxEnt IRL and the proposed specification inference algorithm on a gridworld with non-Markovian dependencies, where the agent needs to enter a drying tile after stepping into water before going to the goal. Figure 1 shows rewards inferred by MaxEnt IRL for the two subtasks and the whole task, and the text describes the behavior of agents optimizing these subtask rewards or various combinations of these rewards, but does not mention how the agent optimizing the reward inferred for the whole task (Figure 1b) would behave. The paper claims that the rewards learned using IRL would lead to unsafe behavior, but does not include the corresponding experimental results. The paper shows results for specifications learned by the proposed algorithm, but not for the agent behavior that results from optimizing for these specifications. No details are given on the type of agent used in these experiments. The paper states that "we demonstrate how learning specifications can help avoid common reward hacking bugs", but there are no experimental results for an agent using the learned specifications to support this claim. I also think the authors are using a nonstandard definition of "reward hacking", which includes any agent behavior that results from optimizing a poorly designed reward function, while the standard definition (given in Amodei et al, 2016) refers to the agent finding a loophole in an otherwise well designed reward function. It would be good to see experimental results on environments designed to test for reward hacking, such as the "boat race" environment from the AI Safety Gridworlds suite (Leike et al, 2017). UPDATE after reading the author feedback: - I'm happy to see that most of the missing details for the experiment have been provided. - The argument for why this method would help avoid reward hacking in the boat race scenario seems convincing. I would still like to see experimental results on this environment, which should be easy to obtain. - Overall, the rebuttal has addressed most of my concerns, so I have increased my score.