{"title": "Showing versus doing: Teaching by demonstration", "book": "Advances in Neural Information Processing Systems", "page_first": 3027, "page_last": 3035, "abstract": "People often learn from others' demonstrations, and classic inverse reinforcement learning (IRL) algorithms have brought us closer to realizing this capacity in machines. In contrast, teaching by demonstration has been less well studied computationally. Here, we develop a novel Bayesian model for teaching by demonstration. Stark differences arise when demonstrators are intentionally teaching a task versus simply performing a task. In two experiments, we show that human participants systematically modify their teaching behavior consistent with the predictions of our model. Further, we show that even standard IRL algorithms benefit when learning from behaviors that are intentionally pedagogical. We conclude by discussing IRL algorithms that can take advantage of intentional pedagogy.", "full_text": "Showing versus Doing: Teaching by Demonstration\n\nDepartment of Cognitive, Linguistic, and Psychological Sciences\n\nMark K Ho\n\nBrown University\n\nProvidence, RI 02912\nmark_ho@brown.edu\n\nMichael L. Littman\n\nDepartment of Computer Science\n\nBrown University\n\nProvidence, RI 02912\n\nmlittman@cs.brown.edu\n\nFiery Cushman\n\nDepartment of Psychology\n\nHarvard University\n\nCambridge, MA 02138\n\ncushman@fas.harvard.edu\n\nJames MacGlashan\n\nDepartment of Computer Science\n\nBrown University\n\nProvidence, RI 02912\n\njames_macglashan@brown.edu\n\nJoseph L. Austerweil\n\nDepartment of Psychology\n\nUniversity of Wisconsin-Madison\n\nMadison, WI 53706\n\nausterweil@wisc.edu\n\nAbstract\n\nPeople often learn from others\u2019 demonstrations, and inverse reinforcement learning\n(IRL) techniques have realized this capacity in machines. In contrast, teaching\nby demonstration has been less well studied computationally. Here, we develop\na Bayesian model for teaching by demonstration. Stark differences arise when\ndemonstrators are intentionally teaching (i.e. showing) a task versus simply per-\nforming (i.e. doing) a task. In two experiments, we show that human participants\nmodify their teaching behavior consistent with the predictions of our model. Fur-\nther, we show that even standard IRL algorithms bene\ufb01t when learning from\nshowing versus doing.\n\n1\n\nIntroduction\n\nIs there a difference between doing something and showing someone else how to do something?\nConsider cooking a chicken. To cook one for dinner, you would do it in the most ef\ufb01cient way\npossible while avoiding contaminating other foods. But, what if you wanted to teach a completely\nna\u00efve observer how to prepare poultry? In that case, you might take pains to emphasize certain\naspects of the process. For example, by ensuring the observer sees you wash your hands thoroughly\nafter handling the uncooked chicken, you signal that it is undesirable (and perhaps even dangerous)\nfor other ingredients to come in contact with raw meat. More broadly, how could an agent show\nanother agent how to do a task, and, in doing so, teach about its underlying reward structure?\nTo model showing, we draw on psychological research on learning and teaching concepts by example.\nPeople are good at this. For instance, when a teacher signals their pedagogical intentions, children\nmore frequently imitate actions and learn abstract functional representations [6, 7]. Recent work\nhas formalized concept teaching as a form of recursive social inference, where a teacher chooses\nan example that best conveys a concept to a learner, who assumes that the teacher is choosing in\nthis manner [14]. The key insight from these models is that helpful teachers do not merely select\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fprobable examples of a concept, but rather choose examples that best disambiguate a concept from\nother candidate concepts. This approach allows for more effective, and more ef\ufb01cient, teaching and\nlearning of concepts from examples.\nWe can extend these ideas to explain showing behavior. Although recent work has examined user-\nassisted teaching [8], identi\ufb01ed legible motor behavior in human-machine coordination [9], and\nanalyzed reward coordination in game theoretic terms [11], previous work has yet to successfully\nmodel how people naturally teach reward functions by demonstration. Moreover, in Inverse Rein-\nforcement Learning (IRL), in which an observer attempts to infer the reward function that an expert\n(human or arti\ufb01cial) is maximizing, it is typically assumed that experts are only doing the task and not\nintentionally showing how to do the task. This raises two related questions: First, how does a person\nshowing how to do a task differ from them just doing it? And second, are standard IRL algorithms\nable to bene\ufb01t from human attempts to show how to do a task?\nIn this paper, we investigate these questions. To do so, we formulate a computational model of\nshowing that applies Bayesian models of teaching by example to the reward function learning setting.\nWe contrast this pedagogical model with a model of doing: standard optimal planning in Markov\nDecision Processes. The pedagogical model predicts several systematic differences from the standard\nplanning model, and we test whether human participants reproduce these distinctive patterns. For\ninstance, the pedagogical model chooses paths to a goal that best disambiguates which goal is being\npursued (Experiment 1). Similarly, when teaching feature-based reward functions, the model will\nprioritize trajectories that better signal the reward value of state features or even perform trajectories\nthat would be inef\ufb01cient for an agent simply doing the task (Experiment 2). Finally, to determine\nwhether showing is indeed better than doing, we train a standard IRL algorithm with our model\ntrajectories and human trajectories.\n\n2 A Bayesian Model of Teaching by Demonstration\n\nOur model draws on two approaches: IRL [2] and Bayesian models of teaching by example [14].\nThe \ufb01rst of these, IRL and the related concept of inverse planning, have been used to model people\u2019s\ntheory of mind, or the capacity to infer another agent\u2019s unobservable beliefs and/or desires through\ntheir observed behavior [5]. The second, Bayesian models of pedagogy, prescribe how a teacher\nshould use examples to communicate a concept to an ideal learner. Our model of teaching by\ndemonstration, called Pedagogical Inverse Reinforcement Learning, merges these two approaches\ntogether by treating a teacher\u2019s demonstration trajectories as communicative acts that signal the\nreward function that an observer should learn.\n\n2.1 Learning from an Expert\u2019s Actions\n\n2.1.1 Markov Decision Processes\n\nAn agent that plans to maximize a reward function can be modeled as the solution to a Markov\nDecision Process (MDP). An MDP is de\ufb01ned by the tuple < S,A, T, R, \u03b3 >: a set of states in the\nworld S; a set of actions for each state A(s); a transition function that maps states and actions to\nnext states, T : S \u00d7 A \u2192 S (in this work we assume all transitions are deterministic, but this can\nbe generalized to probabilistic transitions); a reward function that maps states to scalar rewards,\nR : S \u2192 R; and a discount factor \u03b3 \u2208 [0, 1]. Solutions to an MDP are stochastic policies that\nmap states to distributions over actions, \u03c0 : S \u2192 P (A(s)). Given a policy, we de\ufb01ne the expected\ncumulative discounted reward, or value, V \u03c0(s), at each state associated with following that policy:\n\n(cid:104) \u221e(cid:88)\n\n(cid:105)\n\nV \u03c0(s) = E\u03c0\n\n\u03b3krt+k+1 | st = s\n\n.\n\n(1)\n\nk=0\n\nIn particular, the optimal policy for an MDP yields the optimal value function, V \u2217, which is the value\nfunction that has the maximal value for every state (V \u2217(s) = max\u03c0 V \u03c0(s),\u2200s \u2208 S). The optimal\npolicy also de\ufb01nes an optimal state-action value function, Q\u2217(s, a) = E\u03c0[rt+1 + \u03b3V \u2217(st+1) | st =\ns, at = a].\n\n2\n\n\fPDoing(at | st, R) =\n\n(cid:80)\na(cid:48)\u2208A(si) exp{Q\u2217(si, a(cid:48))/\u03bb} .\n\nexp{Q\u2217(si, ai)/\u03bb}\n\n(2)\n\nAlgorithm 1 Pedagogical Trajectory Algorithm\nRequire: starting states s, reward functions {R1, R2, ..., RN}, transition function T , maximum\nshowing trajectory depth lmax, minimum hypothetical doing probability pmin, teacher maxi-\nmization parameter \u03b1, discount factor \u03b3.\n\nQi = calculateActionValues(s, Ri, T , \u03b3)\n\u03c0i = softmax(Qi, \u03bb)\n\u03a0.add(\u03c0i)\n\n6: Calculate j = {j : s1 \u2208 s, length(j) \u2264 lmax, and \u2203\u03c0 \u2208 \u03a0 s.t.(cid:81)\n\n1: \u03a0 \u2190 \u2205\n2: for i = 1 to N do\n3:\n4:\n5:\n7: Construct hypothetical doing probability distribution PDoing(j | R) as an N x M array.\n(cid:80)\nPDoing(j|R)P (R)\n8: PObserving(R | j) =\nR(cid:48) PDoing(j|R(cid:48))P (R(cid:48))\n(cid:80)\n9: PShowing(j | R) = PObserving(R|j)\u03b1\nj(cid:48) PObserving(R|j(cid:48))\u03b1\n10: return PShowing(j | R)\n\n(si,ai)\u2208j \u03c0(ai | si) > pmin}.\n\n2.1.2 Inverse Reinforcement Learning (IRL)\n\nIn the Reinforcement Learning setting, an agent takes actions in an MDP and receives rewards, which\nallow it to eventually learn the optimal policy [15]. We thus assume that an expert who knows the\nreward function and is doing a task selects an action at in a state st according to a Boltzmann policy,\nwhich is a standard soft-maximization of the action-values:\n\n.\n\n(3)\n\n\u03bb > 0 is an inverse temperature parameter (as \u03bb \u2192 0, the expert selects the optimal action with\nprobability 1; as \u03bb \u2192 \u221e, the expert selects actions uniformly randomly).\nIn the IRL setting, an observer sees a trajectory of an expert executing an optimal policy,\nj = {(s1, a1), (s2, a2), ..., (sk, ak)}, and infers the reward function R that the expert is maxi-\n(cid:81)\nmizing. Given that an agent\u2019s policy is stationary and Markovian, the probability of the trajectory\ngiven a reward function is just the product of the individual action probabilities, PDoing(j | R) =\nt PDoing(at | st, R). From a Bayesian perspective [13], the observer is computing a posterior\nprobability over possible reward functions R:\nPObserving(R | j) =\n\n(cid:80)\nPDoing(j | R)P (R)\nR(cid:48) PDoing(j | R(cid:48))P (R(cid:48))\n\nHere, we always assume that P (R) is uniform.\n\n2.2 Bayesian Pedagogy\n\nIRL typically assumes that the demonstrator is executing the stochastic optimal policy for a reward\nfunction. But is this the best way to teach a reward function? Bayesian models of pedagogy and\ncommunicative intent have shown that choosing an example to teach a concept differs from simply\nsampling from that concept [14, 10]. These models all treat the teacher\u2019s choice of a datum, d, as\nmaximizing the probability a learner will infer a target concept, h:\n(cid:80)\nPLearner(h | d)\u03b1\nd(cid:48) PLearner(h | d(cid:48))\u03b1 .\n\n(4)\n\u03b1 is the teacher\u2019s softmax parameter. As \u03b1 \u2192 0, the teacher chooses uniformly randomly; as \u03b1 \u2192 \u221e,\nthe teacher chooses d that maximally causes the learner to infer a target concept h; when \u03b1 = 1, the\nteacher is \u201cprobability matching\u201d.\nThe teaching distribution describes how examples can be effectively chosen to teach a concept. For\ninstance, consider teaching the concept of \u201ceven numbers\u201d. The sets {2, 2, 2} and {2, 18, 202} are\nboth examples of even numbers. Indeed, given \ufb01nite options with replacement, they both have the\nsame probability of being randomly chosen as sets of examples. But {2, 18, 202} is clearly better\n\nPTeacher(d | h) =\n\n3\n\n\ffor helpful teaching since a na\u00efve learner shown {2, 2, 2} would probably infer that \u201ceven numbers\u201d\nmeans \u201cthe number 2\u201d. This illustrates an important aspect of successful teaching by example: that\nexamples should not only be consistent with the concept being taught, but should also maximally\ndisambiguate the concept being taught from other possible concepts.\n\n2.3 Pedagogical Inverse Reinforcement Learning\n\nTo de\ufb01ne a model of teaching by demonstration, we treat the teacher\u2019s trajectories in a reinforcement-\nlearning problem as a \u201ccommunicative act\u201d for the learner\u2019s bene\ufb01t. Thus, an effective teacher will\nmodify its demonstrations when showing and not simply doing a task. As in Equation 4, we can\nde\ufb01ne a teacher that selects trajectories that best convey the reward function:\n\nPShowing(j | R) =\n\n(cid:80)\nPObserving(R | j)\u03b1\nj(cid:48) PObserving(R | j(cid:48))\u03b1 .\n\n(5)\n\nIn other words, showing depends on a demonstrator\u2019s inferences about an observer\u2019s inferences about\ndoing.\nThis model provides quantitative and qualitative predictions for how agents will show and teach how\nto do a task given they know its true reward function. Since humans are the paradigm teachers and a\npotential source of expert knowledge for arti\ufb01cial agents, we tested how well our model describes\nhuman teaching. In Experiment 1, we had people teach simple goal-based reward functions in a\ndiscrete MDP. Even though in these cases entering a goal is already highly diagnostic, different paths\nof different lengths are better for showing, which is re\ufb02ected in human behavior. In Experiment\n2, people taught more complex feature-based reward functions by demonstration. In both studies,\npeople\u2019s behavior matched the qualitative predictions of our models.\n\n3 Experiment 1: Teaching Goal-based Reward Functions\n\nConsider a grid with three possible terminal goals as shown in Figure 1. If an agent\u2019s goal is &, it\ncould take a number of routes. For instance, it could move all the way right and then move upwards\ntowards the & (right-then-up) or \ufb01rst move upwards and then towards the right (up-then-right). But,\nwhat if the agent is not just doing the task, but also attempting to show it to an observer trying to learn\nthe goal location?\nWhen the goal is &, our pedagogical model predicts that up-then-right is the more probable trajectory\nbecause it is more disambiguating. Up-then-right better indicates that the intended goal is & than\nright-then-up because right-then-up has more actions consistent with the goal being #. We have\nincluded an analytic proof of why this is the case for a simpler setting in the supplementary materials.\nAdditionally, our pedagogical model makes the prediction that when trajectory length costs are\nnegligible, agents will engage in repetitive, inef\ufb01cient behaviors that gesture towards one goal\nlocation over others. This \u201clooping\u201d behavior results when an agent can return to a state with an\naction that has high signaling value by taking actions that have a low signaling \u201ccost\u201d (i.e. they do not\nsignal something other than the true goal). Figure 1d shows an example of such a looping trajectory.\nIn Experiment 1, we tested whether people\u2019s showing behavior re\ufb02ected the pedagogical model when\nreward functions are goal-based. If so, this would indicate that people choose the disambiguating\npath to a goal when showing.\n\n3.1 Experimental Design\n\nSixty Amazon Mechanical Turk participants performed the task in Figure 1. One was excluded\ndue to missing data. All participants completed a learning block in which they had to \ufb01nd the\nreward location without being told. Afterwards, they were either placed in a Do condition or a Show\ncondition. Participants in Do were told they would win a bonus based on the number of rewards\n(correct goals) they reached and were shown the text, \u201cThe reward is at location X\u201d, where X was\none of the three symbols %, #, or &. Those in Show were told they would win a bonus based on how\nwell a randomly matched partner who was shown their responses (and did not know the location of\nthe reward) did on the task. On each round of Show, participants were shown text saying \u201cShow your\npartner that the reward is at location X\u201d. All participants were given the same sequence of trials in\nwhich the reward locations were <%, &, #, &, %, #, %, #, &>.\n\n4\n\n\fFigure 1: Experiment 1: Model predictions and participant trajectories for 3 trials when the goal is\n(a) &, (b) %, and (c) #. Model trajectories are the two with the highest probability (\u03bb = 2, \u03b1 = 1.0,\npmin = 10\u22126, lmax = 4). Yellow numbers are counts of trajectories with the labeled tile as the\npenultimate state. (d) An example of looping behavior predicted by the model when % is the goal.\n\n3.2 Results\n\nAs predicted, Show participants tended to choose paths that disambiguated their goal as compared to\nDo participants. We coded the number of responses on & and % trials that were \u201cshowing\u201d trajectories\nbased on how they entered the goal (i.e. out of 3 for each goal). On & trials, entering from the left,\nand on % trials, entering from above were coded as \u201cshowing\u201d. We ran a 2x2 ANOVA with Show vs\nDo as a between-subjects factor and goal (% vs &) as a repeated measure. There was a main effect\nof condition (F (1, 57) = 16.17, p < .001; Show: M = 1.82, S.E. 0.17; Do: M = 1.05, S.E. 0.17) as\nwell as a main effect of goal (F (1, 57) = 4.77, p < .05; %-goal: M = 1.73, S.E. = 0.18; &-goal: M\n= 1.15, S.E. = 0.16). There was no interaction (F (1, 57) = 0.98, p = 0.32).\nThe model does not predict any difference between conditions for the # (lower right) goal. However,\na visual analysis suggested that more participants took a \u201cswerving\u201d path to reach #. This observation\nwas con\ufb01rmed by looking at trials where # was the goal and comparing the number of swerving\ntrials, which was de\ufb01ned as making more than one change in direction (Show: M = 0.83, Do: M =\n0.26; two-sided t-test: t(44.2) = 2.18, p = 0.03). Although not predicted by the model, participants\nmay swerve to better signal their intention to move \u2018directly\u2019 towards the goal.\n\n3.3 Discussion\n\nReaching a goal is suf\ufb01cient to indicate its location, but participants still chose paths that better\ndisambiguated their intended goal. Overall, these results indicate that people are sensitive to the\ndistinction between doing and showing, consistent with our computational framework.\n\n4 Experiment 2: Teaching Feature-based Reward Functions\n\nExperiment 1 showed that people choose disambiguating plans even when entering the goal makes\nthis seemingly unnecessary. However, one might expect richer showing behavior when teaching more\ncomplex reward functions. Thus, for Experiment 2, we developed a paradigm in which showing\nhow to do a task, as opposed to merely doing a task, makes a difference for how well the underlying\nreward function is learned. In particular, we focused on teaching feature-based reward functions that\nallow an agent to generalize what it has learned in one situation to a new situation. People often\nuse feature-based representations for generalization [3], and feature-based reward functions have\nbeen used extensively in reinforcement learning (e.g. [1]). We used a colored-tile grid task shown in\n\n5\n\n\fFigure 2: Experiment 2 results. (a) Column labels are reward function codes. They refer to which\ntiles were safe (o) and which were dangerous (x) with the ordering . Row 1:\nUnderlying reward functions that participants either did or showed; Row 2: Do participant trajectories\nwith visible tile colors; Row 3: Show participant trajectories; Row 4: Mean reward function learned\nfrom Do trajectories by Maximum-Likelihood Inverse Reinforcement Learning (MLIRL) [4, 12];\nRow 5: Mean reward function learned from Show trajectories by MLIRL. (b) Mean distance between\nlearned and true reward function weights for human-trained and model-trained MLIRL. For the\nmodels, MLIRL results for the top two ranked demonstration trajectories are shown.\n\nFigure 2 to study teaching feature-based reward functions. White tiles are always \u201csafe\u201d (reward of 0),\nwhile yellow tiles are always terminal states that reward 10 points. The remaining 3 tile types\u2013orange,\npurple, and cyan\u2013are each either \u201csafe\u201d or \u201cdangerous\u201d (reward of \u22122). The rewards associated with\nthe three tile types are independent, and nothing about the tiles themselves signal that they are safe or\ndangerous.\nA standard planning algorithm will reach the terminal state in the most ef\ufb01cient and optimal manner.\nOur pedagogical model, however, predicts that an agent who is showing the task will engage in\nspeci\ufb01c behaviors that best disambiguate the true reward function. For instance, the pedagogical\nmodel is more likely to take a roundabout path that leads through all the safe tile types, choose\nto remain on a safe colored tile rather than go on the white tiles, or even loop repeatedly between\nmultiple safe tile-types. All of these types of behaviors send strong signals to the learner about which\ntiles are safe as well as which tiles are dangerous.\n\n4.1 Experimental Design\n\nSixty participants did a feature-based reward teaching task; two were excluded due to missing data.\nIn the \ufb01rst phase, all participants were given a learning-applying task. In the learning rounds, they\ninteracted with the grid shown in Figure 2 while receiving feedback on which tiles won or lost points.\n\n6\n\n\fFigure 3: Experiment 2 normalized median model \ufb01ts.\n\nSafe tiles were worth 0 points, dangerous tiles were worth -2 points, and the terminal goal tile was\nworth 5 points. They also won an additional 5 points for each round completed for a total of 10\npoints. Each point was worth 2 cents of bonus. After each learning round, an applying round occurred\nin which they applied what they just learned about the tiles without receiving feedback in a new\ngrid con\ufb01guration. They all played 8 pairs of learning and applying rounds corresponding to the\n8 possible assignments of \u201csafe\u201d and \u201cdangerous\u201d to the 3 tile types, and order was randomized\nbetween participants.\nAs in Experiment 1, participants were then split into Do or Show conditions with no feedback.\nDo participants were told which colors were safe and won points for performing the task. Show\nparticipants still won points and were told which types were safe. They were also told that their\nbehavior would be shown to another person who would apply what they learned from watching the\nparticipant\u2019s behavior to a separate grid. The points won would be added to the demonstrator\u2019s bonus.\n\n4.2 Results\n\nResponses matched model predictions. Do participants simply took ef\ufb01cient routes, whereas Show\nparticipants took paths that signaled tile reward values. In particular, Show participants took paths\nthat led through multiple safe tile types, remained on safe colored tiles when safe non-colored tiles\nwere available, and looped at the boundaries of differently colored safe tiles.\n\n4.2.1 Model-based Analysis\n\nTo determine how well the two models predicted human behaviors globally, we \ufb01t separate models for\neach reward function and condition combination. We found parameters that had the highest median\nlikelihood out of the set of participant trajectories in a given reward function-condition combination.\nSince some participants used extremely large trajectories (e.g. >25 steps) and we wanted to include\nan analysis of all the data, we calculated best-\ufb01tting state-action policies. For the standard-planner, it\nis straightforward to calculate a Boltzmann policy for a reward function given \u03bb.\nFor the pedagogical model, we \ufb01rst need to specify an initial model of doing and distribution\nover a \ufb01nite set of trajectories. We determine this initial set of trajectories and their probabilities\nusing three parameters: \u03bb, the softmax parameter for a hypothetical \u201cdoing\u201d agent that the model\nassumes the learner believes it is observing; lmax, the maximum trajectory length; and pmin, the\nminimum probability for a trajectory under the hypothetical doing agent. The pedagogical model\nthen uses an \u03b1 parameter that determines the degree to which the teacher is maximizing. State-action\nprobabilities are calculated from a distribution over trajectories using the equation P (a | s, R) =\n\n(cid:80)\nj P (a | s, j)P (j | R), where P (a | s, j) =\n\n.\n\n|{(s,a):s=st,a=at\u2200(st,at)\u2208j}|\n\n|{(s,a):s=st\u2200(st,at)\u2208j}|\n\nWe \ufb01t parameter values that produced the maximum median likelihood for each model for each\nreward function and condition combination. These parameters are reported in the supplementary\nmaterials. The normalized median \ufb01t for each of these models is plotted in Figure 3. As shown\nin the \ufb01gure, the standard planning model better captures behavior in the Do condition, while the\npedagogical model better captures behavior in the Show condition. Importantly, even when the\nstandard planning model could have a high \u03bb and behave more randomly, the pedagogical model\nbetter \ufb01ts the Show condition. This indicates that showing is not simply random behavior.\n\n7\n\n\f4.2.2 Behavioral Analyses\n\nWe additionally analyzed speci\ufb01c behavioral differences between the Do and Show conditions\npredicted by the models. When showing a task, people visit a greater variety of safe tiles, visit tile\ntypes that the learner has uncertainty about (i.e. the colored tiles), and more frequently revisit states\nor \u201cloop\u201d in a manner that leads to better signaling. We found that all three of these behaviors were\nmore likely to occur in the Show condition than in the Do condition.\nTo measure the variety of tiles visited, we calculated the entropy of the frequency distribution over\ncolored-tile visits by round by participant. Average entropy was higher for Show (Show: M = 0.50,\nSE = 0.03; Do: M = 0.39, SE = 0.03; two-sided t-test: t(54.9) = \u22123.27, p < 0.01). When analyzing\ntime spent on colored as opposed to un-colored tiles, we calculated the proportion of visits to colored\ntiles after the \ufb01rst colored tile had been visited. Again, this measure was higher for Show (Show:\nM = 0.87, SE = 0.01; Do: M = 0.82, SE = 0.01; two-sided t-test: t(55.6) = \u22123.14, p < .01).\nFinally, we calculated the number of times states were revisited in the two conditions\u2013an indicator of\n\u201clooping\u201d\u2013and found that participants revisited states more in Show compared to Do (Show: M = 1.38,\nSE = 0.22; Do: M = 0.10, SE = 0.03; two-sided t-test: t(28.3) = \u22122.82, p < .01). There was no\ndifference between conditions in the total rewards won (two-sided t-test: t(46.2) = .026, p = 0.80).\n\n4.3 Teaching Maximum-Likelihood IRL\n\nOne reason to investigate showing is its potential for training arti\ufb01cial agents. Our pedagogical model\nmakes assumptions about the learner, but it may be that pedagogical trajectories are better even\nfor training off-the-shelf IRL algorithms. For instance, Maximum Likelihood IRL (MLIRL) is a\nstate-of-the-art IRL algorithm for inferring feature-based reward functions [4, 12]. Importantly, unlike\nthe discrete reward function space our showing model assumes, MLIRL estimates the maximum\nlikelihood reward function over a space of continuous feature weights using gradient ascent.\nTo test this, we input human and model trajectories into MLIRL. We constrained non-goal feature\nweights to be non-positive. Overall, the algorithm was able to learn the true reward function better\nfrom showing than doing trajectories produced by either the models or participants (Figure 2).\n\n4.3.1 Discussion\n\nWhen learning a feature-based reward function from demonstration, it matters if the demonstrator\nis showing or doing. In this experiment, we showed that our model of pedagogical reasoning over\ntrajectories captures how people show how to do a task. When showing as opposed to simply doing,\ndemonstrators are more likely to visit a variety of states to show that they are safe, stay on otherwise\nambiguously safe tiles, and also engage in \u201clooping\u201d behavior to signal information about the tiles.\nMoreover, this type of teaching is even better at training standard IRL algorithms like MLIRL.\n\n5 General Discussion\n\nWe have presented a model of showing as Bayesian teaching. Our model makes accurate quantitative\nand qualitative predictions about human showing behavior, as demonstrated in two experiments.\nExperiment 1 showed that people modify their behavior to signal information about goals, while\nExperiment 2 investigated how people teach feature-based reward functions. Finally, we showed that\neven standard IRL algorithms bene\ufb01t from showing as opposed to merely doing.\nThis provides a basis for future study into intentional teaching by demonstration. Future research\nmust explore showing in settings with even richer state features and whether more savvy observers\ncan leverage a showing agent\u2019s pedagogical intent for even better learning.\n\nAcknowledgments\n\nMKH was supported by the NSF GRFP under Grant No. DGE-1058262. JLA and MLL were\nsupported by DARPA SIMPLEX program Grant No. 14-46-FP-097. FC was supported by grant\nN00014-14-1-0800 from the Of\ufb01ce of Naval Research.\n\n8\n\n\fReferences\n[1] P. Abbeel and A. Y. Ng. Apprenticeship Learning via Inverse Reinforcement Learning. In\nProceedings of the Twenty-\ufb01rst International Conference on Machine Learning, ICML \u201904,\npages 1\u2013, New York, NY, USA, 2004. ACM.\n\n[2] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from\n\ndemonstration. Robotics and Autonomous Systems, 57(5):469\u2013483, May 2009.\n\n[3] J. L. Austerweil and T. L. Grif\ufb01ths. A nonparametric Bayesian framework for constructing\n\n\ufb02exible feature representations. Psychological Review, 120(4):817\u2013851, 2013.\n\n[4] M. Babes, V. Marivate, K. Subramanian, and M. L. Littman. Apprenticeship learning about\nmultiple intentions. In Proceedings of the 28th International Conference on Machine Learning\n(ICML-11), pages 897\u2013904, 2011.\n\n[5] C. L. Baker, R. Saxe, and J. B. Tenenbaum. Action understanding as inverse planning. Cognition,\n\n113(3):329\u2013349, Dec. 2009.\n\n[6] D. Buchsbaum, A. Gopnik, T. L. Grif\ufb01ths, and P. Shafto. Children\u2019s imitation of causal action\nsequences is in\ufb02uenced by statistical and pedagogical evidence. Cognition, 120(3):331\u2013340,\nSept. 2011.\n\n[7] L. P. Butler and E. M. Markman. Preschoolers use pedagogical cues to guide radical reorganiza-\n\ntion of category knowledge. Cognition, 130(1):116\u2013127, Jan. 2014.\n\n[8] M. Cakmak and M. Lopes. Algorithmic and Human Teaching of Sequential Decision Tasks. In\n\nAAAI Conference on Arti\ufb01cial Intelligence (AAAI-12), July 2012.\n\n[9] A. D. Dragan, K. C. T. Lee, and S. S. Srinivasa. Legibility and predictability of robot motion.\nIn 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages\n301\u2013308, Mar. 2013.\n\n[10] M. C. Frank and N. D. Goodman. Predicting Pragmatic Reasoning in Language Games. Science,\n\n336(6084):998\u2013998, May 2012.\n\n[11] D. Had\ufb01eld-Menell, A. Dragan, P. Abbeel, and S. Russell. Cooperative Inverse Reinforcement\nLearning. In D. D. Lee, M. Sugiyama, U. von Luxburg, and I. Guyon, editors, Advances in\nNeural Information Processing Systems, volume 28. The MIT Press, 2016.\n\n[12] J. MacGlashan and M. L. Littman. Between imitation and intention learning. In Proceedings\nof the 24th International Conference on Arti\ufb01cial Intelligence, pages 3692\u20133698. AAAI Press,\n2015.\n\n[13] D. Ramachandran and E. Amir. Bayesian Inverse Reinforcement Learning. In Proceedings of\nthe 20th International Joint Conference on Arti\ufb01cal Intelligence, IJCAI\u201907, pages 2586\u20132591,\nSan Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.\n\n[14] P. Shafto, N. D. Goodman, and T. L. Grif\ufb01ths. A rational account of pedagogical reasoning:\n\nTeaching by, and learning from, examples. Cognitive Psychology, 71:55\u201389, June 2014.\n\n[15] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n\n9\n\n\f", "award": [], "sourceid": 1506, "authors": [{"given_name": "Mark", "family_name": "Ho", "institution": "Brown University"}, {"given_name": "Michael", "family_name": "Littman", "institution": "Brown University"}, {"given_name": "James", "family_name": "MacGlashan", "institution": "Brown University"}, {"given_name": "Fiery", "family_name": "Cushman", "institution": "Harvard University"}, {"given_name": "Joseph", "family_name": "Austerweil", "institution": "University of Wisconsin-Madison"}]}