NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 1506 Showing versus doing: Teaching by demonstration

### Reviewer 1

#### Summary

This well-written paper presents a new approach to learning from demonstrations, introducing an algorithm for picking the best trajectories for demonstrations as well as learning reward functions from such teaching trajectories. They show that the teaching algorithms qualitatively matches human behaviour and the algorithms lead to better learning from demonstrations in general.

#### Qualitative Assessment

The paper provides a new and interesting approach to the problem of learning from demonstration. The paper effectively tackles both sides of the question laid out, developing algorithms both for better teaching and for better learning (assuming a teacher). The algorithms are validated in a pair of experiments with humans and through simulations. Below are some suggestions for improvement: 1. I found it hard to track exactly which model was being used in which experiment/simulation. For example, I don’t see the pedagogical IRL model being used in any of the experiments. As far as I can tell, the first experiment showed the difference between the standard planning model and the pedagogical model. The second experiment also showed the difference and then used standard IRL to infer reward functions from both generated trajectories and human demonstrators (do or show). 2. The general picture of the approach is quite clear, but the details of the algorithms from the equations and the algorithm table were not sufficiently clear. Algorithm 1 is not really discussed in the text, nor unpacked, and the equation have multiple undefined terms, whose functionality was not obvious (esp Eq 2 and the subsequent unnumbered one). What are d and h in Eq 2? How does the parameter alpha change things? Is it the same parameter in the two equations? 3. For all the stats, means could use a measure of variance (e.g., confidence interval or SEM). Why are some of the t-tests one-sided and others two-sided? There does not seem to be an obvious logic. Why are the degrees of freedom fractional and different from one test to the next? 4. The design of experiment 2 was hard to discern from the figure alone. I think the experiment display was the coloured blocks, and the hidden reward function was the first column, but it took several reads to pull that out. A couple more basic introductory sentences would help. Was the reward 10 points or 5 points (seems to be stated different in 2 places)? 5. The general discussion was very thing and made surprisingly little effort to establish the contribution of this paper to the larger literature on teaching by demonstration, inverse reinforcement learning, or even wider impact on pedagogy at large.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 2

#### Summary

A model is presented for how a sequential decision maker could behave if the task is to inform a learner about the underlying reward function. In two simple navigation experiments the model is compared to human behavior.

#### Qualitative Assessment

I find it very interesting to model how a rational agent or human could behave when teaching. But the paper does not fully succeed to convince me that the proposed model is a good one. This is mostly because the comparison with human data is rather qualitative and focuses only on few aspects. 1. For example figure 1 gives the impression that the model and humans act very similarly in the standard planning (doing) condition. Isn't this just an effect of only showing the 2 most probable trajectories of the model? I guess, the total probability of a (swerving) trajectory going through the diagonal square between start and goal is more likely than the total probability of the shown trajectories, i.e. P(up-right-up-right) + P(up-right-right-up) + P(right-up-right-up) + P(right-up-up-right) > P(up-up-right-right) + P(right-right-up-up). I don't have a good intuition for the model distribution in the showing condition, but I wouldn't be surprised to find similar discrepancies. A comparison of the trajectory probabilities in a P-P-plot could be revealing. 2. It feels unsatisfactory that a prior favoring short sequences needs to be used to match the data in in experiment 2 (line 218-220). In fact, I think this is another weakness of the model. How would the values in table 1 change without this extra assumption? 3. I didn't find all parameter values. What are the model parameters for task 1? What lambda was chosen for the Boltzmann policy. But more importantly: How were the parameters chosen? Maximum likelihood estimates? 4. An answer to this point may be beyond the scope of this work, but it may be interesting to think about it. It is mentioned (lines 104-106) that "the examples [...] should maximally disambiguate the concept being taught from other possible concepts". How is disambiguation measured? How can disambiguation be maximized? Could there be an information theoretic approach to these questions? Something like: the teacher chooses samples that maximally reduce the entropy of the assumed posterior of the student. Does the proposed model do that? Minor points: • line 88: The optimal policy is deterministic. Hence I'm a bit confused by "the stochastic optimal policy". Is above defined "the Boltzmann policy" meant? • What is d and h in equation 2? • line 108: "to calculate this ..." What is meant by "this"? • Algorithm 1: Require should also include epsilon. Does line 1 initialize the set of policies to an empty set? Are the policies in line 4 added to this set? Does calculateActionValues return the Q* defined in line 75? What is M in line 6? How should p_min be chosen? Why is p_min needed anyway? • Experiment 2: Is the reward 10 points (line 178) or 5 points (line 196)? • Experiment 2: Is 0A the condition where all tiles are dangerous? Why are the likelihoods so much larger for 0A? Is it reasonable to average over likelihoods that differ by more than an order of magnitude (0A vs 2A-C)? • Text and formulas should be carefully checked for typos (e.g. line 10 in Algorithm 1: delta > epsilon; line 217: 1^-6;)

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 3

#### Summary

This paper addresses the subject of teaching by demonstration. In particular, it asks the following question: When teaching from demonstration, is there a better way to form trajectories than following the optimal policy? In addressing this question, the authors combine two earlier strands of research in the literature, inverse reinforcement learning and Bayesian pedagogy, to introduce a new model called pedagogical inverse reinforcement learning. They perform two experiments with human subjects on two simple, deterministic gridworld problems, using a tabular and a feature-based reward function. The authors observe that people differ in the trajectories they produce when performing the task for themselves (to maximize return) and when demonstrating the task to others (so that they can maximize return as well).

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 4

#### Summary

This is a nice paper that expands on a bayesian model of teaching introduced in a 2014 cognitive psychology paper. The main innovation here is to distinguish between doing and teaching by demonstration (showing). In the original paper, the model is used to predict examples used for teaching purposes in three different settings. In this paper, the model is used to compare the performance of humans and predictions of either an MDP solver or a pedagogical model when the humans are doing a task versus trying to show another person what the goal is in a simple grid world. The model results predict the human behavior fairly well. In a second experiment, there are features involved in the grid world that may signal penalties (colored tiles), and the human demonstrations perform suboptimal paths to the goal that are nevertheless informative about which tiles are dangerous and which are safe. Again, the MDP solver predicts the human behavior in doing the task better than when demonstrating it, and the pedagogical model predicts the human behavior when demonstrating the task better than when they are simply doing it. Finally, they show that a state of the art inverse reinforcement learning model (MLIRL) learns a more accurate representation of the environment (which tiles are safe vs. dangerous) from either the human or the model demonstrations than from the human or model simply doing the task.

#### Qualitative Assessment

Specific comments/typos/wording: Citations not in reference list: Lyons et al. 2007, Abbeel et al., 2007. In references, Babes -> Babes-Vroman  Algorithm 1 is not apparently referenced or explained in the text. It should be pushed to supplementary material with a detailed explanation. I note that as written, lines 11-14 will never be reached. You set delta = infinity, and then the while loop runs as long as delta is less than epsilon. Furthermore, on line 1 you set PI to the empty set, and never add to it. Thus line 5 would lead to a null j, since you calculate j such that there exists a pi element of PI such that….Thus, j will be null. I assume you want to add pi to PI after line 4. Those are just the obvious bugs; without more explanation, this algorithm is relatively impenetrable. Equation 2 needs more explanation. As written, it is unclear why it is correct. Basically, it says that x is proportional to (xy)^alpha, which is nonsensical unless alpha=1. line 142: won-> win line 148: As -> that, as line 152-154 - “as predicted by the model” appears twice. Also, you are describing the statistics for “&”, which you just described in the previous sentence. One of these much be “%.” The model does not account for the swervng behavior for the “#” goal. How would you change it to account for this behavior? Table 1: you haven’t told us what the “0A” or “1B” conditions are, and these are the conditions most strongly predicted by the model. It would behoove you to exhibit these, as well as to provide some explanation as to why these are so strongly predicted, especially 0A. I understand that it is the relative predictions that matter, but these numbers are so wildly different that it would be good to know why. Lines 237-240: the prediction is opposite of what you say it is (Show: M=.82, Do: M=0.87). Again, here, you refer to reward functions that aren’t explained.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 5

#### Summary

In inverse reinforcement learning (IRL), an agent learns a reward function by looking at the action of another agent. Based on the fact that a human teacher modifies their behavior when demonstrating a task to a student in order to facilitate learning, the paper looks at two different strategies for the demonstrating system to teach the observing one, one called doing (where the teaching system behaves the same way with or without a student) and showing (where the teaching system modifies its behavior in order to facilitate learning by the students). A formal framework based on a Bayesian analysis is developed in the first part of the paper and two experiments with human participants try to give it empirical supports. The two experiments show that, in a teaching situation, human participants change their behavior in a way that might be consistent with the formal framework and that eases learning of the value function by an agent using IRL.

#### Qualitative Assessment

The paper tackles a very interesting topic as a vast literature in psychology indeed shows that human do change their behavior depending what they might to communicate to another person, notably in a teaching situation. I do not have the mathematical skills to fully appreciate the framework the authors are proposing but as far as I can tell, it looks promising. The problem is how to test it is this is supposed not only to be a method to boost IRL but also a model of human behavior, which seems to be the goal of the authors. I am not totally convinced by Experiment 1 notably as the statistics do not look appropriate to me. First, it is not clear what the dependent variable is. Intuitively, I would say it is the proportion of trials a participant used a given strategy but sentence like "more people took the outside route in the show condition" got me confused: It either does not correspond to the way the analysis were run, or I am missing something. In both cases, more details are necessary. Moreover, especially in those days where p-hacking is a concern, the lowest number of statistical analysis should be run. In this case, a 2x2 ANOVA of conditions (doing vs. showing) and goal (#,%,&) would have been the appropriate analysis, eventually followed by the t-tests the authors did in the paper. I am not sure this analysis would bring any main effect of condition or an interaction of the two factors. Finally, I am not sure I understand why no difference between the conditions is predicted for the # target. More information should be given about this. Overall, I am not sure experiment 1 is worth putting the paper as experiment 2 basically makes the same point but with a much more interesting task and with a better statistical argument. Yet, I have two remarks. First, the reward function maps shown in Figure 2 are not the easiest way to realize whether or not showing has lead to a better IRL than doing. Maybe showing the amount of reward collected by an agent following those maps (maybe relative to an agent following the optimal strategy) would allow to see that more immediately. Second, the authors conclude that in the showing condition, the action of the participants are better explained by the pedagogical model in the showing condition. It might be could it just be because the behavior becomes more randomlike (increased entropy) in the teaching condition? Would any model leading to a more randomlike behavior would account as well for the data (like for instance, a straighforward RL algorithm using a softmax action selection rule whose inverse temperature parameter is increased in the showing condition) or is this predictive advantage specific to the pedagogical model? This might provide a better null hypothesis. Also, it is not clear how the authors chose free parameters for the pedagogical model (nor what some of these parameters like lmax pr pmin: they are not mentioned elsewhere in the paper).

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)