{"title": "Policy Shaping: Integrating Human Feedback with Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2625, "page_last": 2633, "abstract": "A long term goal of Interactive Reinforcement Learning is to incorporate non-expert human feedback to solve complex tasks. State-of-the-art methods have approached this problem by mapping human information to reward and value signals to indicate preferences and then iterating over them to compute the necessary control policy. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct labels on the policy. We compare Advise to state-of-the-art approaches and highlight scenarios where it outperforms them and importantly is robust to infrequent and inconsistent human feedback.", "full_text": "Policy Shaping: Integrating Human Feedback\n\nwith Reinforcement Learning\n\nShane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles L. Isbell, and Andrea Thomaz\n\nGeorgia Institute of Technology, Atlanta, GA 30332, USA\n\n{sgriffith7, kausubbu, jkscholz}@gatech.edu,\n\nCollege of Computing\n\n{isbell, athomaz}@cc.gatech.edu\n\nAbstract\n\nA long term goal of Interactive Reinforcement Learning is to incorporate non-\nexpert human feedback to solve complex tasks. Some state-of-the-art methods\nhave approached this problem by mapping human information to rewards and val-\nues and iterating over them to compute better control policies. In this paper we\nargue for an alternate, more effective characterization of human feedback: Policy\nShaping. We introduce Advise, a Bayesian approach that attempts to maximize\nthe information gained from human feedback by utilizing it as direct policy labels.\nWe compare Advise to state-of-the-art approaches and show that it can outperform\nthem and is robust to infrequent and inconsistent human feedback.\n\n1 Introduction\nA long\u2013term goal of machine learning is to create systems that can be interactively trained or guided\nby non-expert end-users. This paper focuses speci\ufb01cally on integrating human feedback with Re-\ninforcement Learning. One way to address this problem is to treat human feedback as a shaping\nreward [1\u20135]. Yet, recent papers have observed that a more effective use of human feedback is as\ndirect information about policies [6, 7]. Most techniques for learning from human feedback still,\nhowever, convert feedback signals into a reward or a value. In this paper we introduce Policy Shap-\ning, which formalizes the meaning of human feedback as policy feedback, and demonstrates how to\nuse it directly as policy advice. We also introduce Advise, an algorithm for estimating a human\u2019s\nBayes optimal feedback policy and a technique for combining this with the policy formed from the\nagent\u2019s direct experience in the environment (Bayesian Q-Learning).\n\nWe validate our approach using a series of experiments. These experiments use a simulated human\nteacher and allow us to systematically test performance under a variety of conditions of infrequent\nand inconsistent feedback. The results demonstrate two advantages of Advise: 1) it is able to outper-\nform state of the art techniques for integrating human feedback with Reinforcement Learning; and\n2) by formalizing human feedback, we avoid ad hoc parameter settings and are robust to infrequent\nand inconsistent feedback.\n2 Reinforcement Learning\nReinforcement Learning (RL) de\ufb01nes a class of algorithms for solving problems modeled as a\nMarkov Decision Process (MDP). An MDP is speci\ufb01ed by the tuple (S, A, T, R), which de\ufb01nes\nthe set of possible world states, S, the set of actions available to the agent in each state, A, the\ntransition function T : S \u00d7 A \u2192 Pr[S], a reward function R : S \u00d7 A \u2192 R, and a discount factor\n0 \u2264 \u03b3 \u2264 1. The goal of a Reinforcement Learning algorithm is to identify a policy, \u03c0 : S \u2192 A,\nwhich maximizes the expected reward from the environment. Thus, the reward function acts as a\nsingle source of information that tells an agent what is the best policy for this MDP.\n\nThis paper used an implementation of the Bayesian Q-learning (BQL) Reinforcement Learning\nalgorithm [8], which is based on Watkins\u2019 Q-learning [9]. Q-learning is one way to \ufb01nd an optimal\n\n1\n\n\fpolicy from the environment reward signal. The policy for the whole state space is iteratively re\ufb01ned\nby dynamically updating a table of Q-values. A speci\ufb01c Q-value, Q[s, a], represents a point estimate\nof the long-term expected discounted reward for taking action a in state s.\n\nRather than keep a point estimate of the long-term discounted reward for each state-action pair,\nBayesian Q-learning maintains parameters that specify a normal distribution with unknown mean\nand precision for each Q-value. This representation has the advantage that it approximates the\nagent\u2019s uncertainty in the optimality of each action, which makes the problem of optimizing the\nexploration/exploitation trade-off straightforward. Because the Normal-Gamma (NG) distribution\nis the conjugate prior for the normal distribution, the mean and the precision are estimated using\na NG distribution with hyperparameters h\u00b5s,a\n0 , \u03bbs,a, \u03b1s,a, \u03b2s,ai. These values are updated each\ntime an agent performs an action a in state s, accumulates reward r, and transitions to a new state\ns\u2032. Details on how these parameters are updated can be found in [8]. Because BQL is known to\nunder-explore, \u03b2s,a is updated as shown in [10] using an additional parameter \u03b8.\nThe NG distribution for each Q-value can be used to estimate the probability that each action a \u2208 As\nin a state s is optimal, which de\ufb01nes a policy, \u03c0R, used for action selection. The optimal action can\nbe estimated by sampling each \u02c6Q(s, a) and taking the argmax. A large number of samples can be\nused to approximate the probability an action is optimal by simply counting the number of times an\naction has the highest Q-value [8].\n\n3 Related Work\nA key feature of Reinforcement Learning is the use of a reward signal. The reward signal can be\nmodi\ufb01ed to suit the addition of a new information source (this is known as reward shaping [11]).\nThis is the most common way human feedback has been applied to RL [1\u20135]. However, several\ndif\ufb01culties arise when integrating human feedback signals that may be infrequent, or occasionally\ninconsistent with the optimal policy\u2013violating the necessary and suf\ufb01cient condition that a shaping\nfunction be potential-based [11]. Another dif\ufb01culty is the ambiguity of translating a statement like\n\u201cyes, that\u2019s right\u201d or \u201cno, that\u2019s wrong\u201d into a reward. Typically, past attempts have been a manual\nprocess, yielding ad hoc approximations for speci\ufb01c domains. Researchers have also extended re-\nward shaping to account for idiosyncrasies in human input. For example, adding a drift parameter\nto account for the human tendency to give less feedback over time [1, 12].\n\nAdvancements in recent work sidestep some of these issues by showing human feedback can instead\nbe used as policy feedback. For example, Thomaz and Breazeal [6] added an UNDO function to the\nnegative feedback signal, which forced an agent to backtrack to the previous state after its value\nupdate. Work by Knox and Stone [7, 13] has shown that a general improvement to learning from\nhuman feedback is possible if it is used to directly modify the action selection mechanism of the\nReinforcement Learning algorithm. Although both approaches use human feedback to modify an\nagent\u2019s exploration policy, they still treat human feedback as either a reward or a value. In our\nwork, we assume human feedback is not an evaluative reward, but is a label on the optimality of\nactions. Thus the human\u2019s feedback is making a direct statement about the policy itself, rather than\nin\ufb02uencing the policy through a reward.\n\nIn other works, rather than have the human input be a reward shaping input, the human provides\ndemonstrations of the optimal policy. Several papers have shown how the policy information in\nhuman demonstrations can be used for inverse optimal control [14, 15], to seed an agent\u2019s explo-\nration [16, 17], and in some cases be used entirely in place of exploration [18, 19]. Our work\nsimilarly focuses on people\u2019s knowledge of the policy, but instead of requiring demonstrations we\nwant to allow people to simply critique the agent\u2019s behavior (\u201cthat was right/wrong\u201d).\n\nOur position that human feedback be used as direct policy advice is related to work in transfer learn-\ning [20, 21], in which an agent learns with \u201cadvice\u201d about how it should behave. This advice is pro-\nvided as \ufb01rst order logic rules and is also provided of\ufb02ine, rather than interactively during learning.\nOur approach only requires very high-level feedback (right/wrong) and is provided interactively.\n\n4 Policy Shaping\nIn this section, we formulate human feedback as policy advice, and derive a Bayes optimal algorithm\nfor converting that feedback into a policy. We also describe how to combine the feedback policy with\nthe policy of an underlying Reinforcement Learning algorithm. We call our approach Advise.\n\n2\n\n\f4.1 Model Parameters\nWe assume a scenario where the agent has access to communication from a human during its learning\nprocess. In addition to receiving environmental reward, the agent may receive a \u201cright\u201d/\u201cwrong\u201d\nlabel after performing an action. In related work, these labels are converted into shaping rewards\n(e.g., \u201cright\u201d becomes +1 and \u201cwrong\u201d \u22121), which are then used to modify Q-values, or to bias\naction selection.\nIn contrast, we use this label directly to infer what the human believes is the\noptimal policy in the labeled state.\n\nUsing feedback in this way is not a trivial matter of pruning actions from the search tree. Feed-\nback can be both inconsistent with the optimal policy and sparsely provided. Here, we assume a\nhuman providing feedback knows the right answer, but noise in the feedback channel introduces in-\nconsistencies between what the human intends to communicate and what the agent observes. Thus,\nfeedback is consistent, C, with the optimal policy with probability 0 < C < 1.1\nWe also assume that a human watching an agent learn may not provide feedback after every single\naction, thus the likelihood, L, of receiving feedback has probability 0 < L < 1.\nIn the event\nfeedback is received, it is interpreted as a comment on the optimality of the action just performed.\nThe issue of credit assignment that naturally arises with learning from real human feedback is left\nfor future work (see [13] for an implementation of credit assignment in a different framework for\nlearning from human feedback).\n\n4.2 Estimating a Policy from Feedback\nIt is possible that the human may know any number of different optimal actions in a state, the prob-\nability an action, a, in a particular state, s, is optimal is independent of what labels were provided\nto the other actions. Subsequently, the probability s, a is optimal can be computed using only the\n\u201cright\u201d and \u201cwrong\u201d labels associated with it. We de\ufb01ne \u2206s,a to be the difference between the num-\nber of \u201cright\u201d and \u201cwrong\u201d labels. The probability s, a is optimal can be obtained using the binomial\ndistribution as:\n\nC\u2206s,a\n\nC\u2206s,a + (1 \u2212 C)\u2206s,a\n\n,\n\n(1)\n\nAlthough many different actions may be optimal in a given state, we will assume for this paper that\nthe human knows only one optimal action, which is the one they intend to communicate. In that\ncase, an action, a, is optimal in state s if no other action is optimal (i.e., whether it is optimal now\nalso depends on the labels to the other actions in the state). More formally:\n\nC\u2206s,a(1 \u2212 C)Pj6=a\n\n\u2206s,j\n\n(2)\n\nWe take Equation 2 to be the probability of performing s, a according to the feedback policy, \u03c0F\n(i.e., the value of \u03c0F (s, a)). This is the Bayes optimal feedback policy given the \u201cright\u201d and \u201cwrong\u201d\nlabels seen, the value for C, and that only one action is optimal per state. This is obtained by\napplication of Bayes\u2019 rule in conjunction with the binomial distribution and enforcing independence\nconditions arising from our assumption that there is only one optimal action. A detailed derivation\nof the above results is available in the Appendix Section A.1 and A.2.\n\n4.3 Reconciling Policy Information from Multiple Sources\nBecause the use of Advise assumes an underlying Reinforcement Learning algorithm will also be\nused (e.g., here we use BQL), the policies derived from multiple information sources must be rec-\nonciled. Although there is a chance, C, that a human could make a mistake when s/he does provide\nfeedback, given suf\ufb01cient time, with the likelihood of feedback, L > 0.0 and the consistency of\nfeedback C 6= 0.5, the total amount of information received from the human should be enough for\nthe the agent to choose the optimal policy with probability 1.0. Of course, an agent will also be\nlearning on its own at the same time and therefore may converge to its own optimal policy much\nsooner than it learns the human\u2019s policy. Before an agent is completely con\ufb01dent in either policy,\nhowever, it has to determine what action to perform using the policy information each provides.\n\n1Note that the consistency of feedback is not the same as the human\u2019s or the agent\u2019s con\ufb01dence the feedback\n\nis correct.\n\n3\n\n\fPac-Man\n\nFrogger\n\nFigure 1: A snapshot of each domain used for the experiments. Pac-Man consisted of a 5x5 grid\nworld with the yellow Pac-Man avatar, two white food pellets, and a blue ghost. Frogger consisted\nof a 4x4 grid world with the green Frogger avatar, two red cars, and two blue water hazards.\n\nWe combine the policies from multiple information sources by multiplying them together: \u03c0 \u221d\n\u03c0R\u00d7\u03c0F . Multiplying distributions together is the Bayes optimal method for combining probabilities\nfrom (conditionally) independent sources [22], and has been used to solve other machine learning\nproblems as well (e.g., [23]). Note that BQL can only approximately estimate the uncertainty that\neach action is optimal from the environment reward signal. Rather than use a different combination\nmethod to compensate for the fact that BQL converges too quickly, we introduced the exploration\ntuning parameter, \u03b8, from [10], that can be manually tuned until BQL performs close to optimal.\n\n5 Experimental Setup\nWe evaluate our approach using two game domains, Pac-Man and Frogger (see Fig. 1).\n\n5.1 Pac-Man\nPac-Man consists of a 2-D grid with food, walls, ghosts, and the Pac-Man avatar. The goal is to\neat all the food pellets while avoiding moving ghosts (+500). Points are also awarded for each\nfood pellet (+10). Points are taken away as time passes (-1) and for losing the game (-500). Our\nexperiments used a 5 \u00d7 5 grid with two food pellets and one ghost. The action set consisted of the\nfour primary cartesian directions. The state representation included Pac-Man\u2019s position, the position\nand orientation of the ghost and the presence of food pellets.\n\n5.2 Frogger\nFrogger consists of a 2-D map with moving cars, water hazards, and the Frogger avatar. The goal\nis to cross the road without being run over or jumping into a water hazard (+500). Points are lost\nas time passes (-1), for hopping into a water hazard (-500), and for being run over (-500). Each car\ndrives one space per time step. The car placement and direction of motion is randomly determined\nat the start and does not change. As a car disappears off the end of the map it reemerges at the\nbeginning of the road and continues to move in the same direction. The cars moved only in one\ndirection, and they started out in random positions on the road. Each lane was limited to one car.\nOur experiments used a 4 \u00d7 4 grid with two water hazards and two cars. The action set consisted\nof the four primary cartesian directions and a stay-in-place action. The state representation included\nfrogger\u2019s position and the position of the two cars.\n\n5.3 Constructing an Oracle\nWe used a simulated oracle in the place of human feedback, because this allows us to systematically\nvary the parameters of feedback likelihood, L, and consistency, C and test different learning settings\nin which human feedback is less than ideal. The oracle was created manually by a human before\nthe experiments by hand labeling the optimal actions in each state. For states with multiple optimal\nactions, a small negative reward (-10) was added to the environment reward signal of the extra\noptimal state-action pairs to preserve the assumption that only one action be optimal in each state.\n\n6 Experiments\n6.1 A Comparison to the State of the Art\nIn this evaluation we compare Policy Shaping with Advise to the more traditional Reward Shaping,\nas well as recent Interactive Reinforcement Learning techniques. Knox and Stone [7, 13] tried eight\ndifferent strategies for combining feedback with an environmental reward signal and they found that\n\n4\n\n\fIdeal Case\n\n(L = 1.0, C = 1.0)\n\nPac-Man\n\nFrogger\n\nReduced Consistency\n\n(L = 0.1, C = 1.0)\n\nPac-Man\n\nFrogger\n\nBQL + Action Biasing\nBQL + Control Sharing\nBQL + Reward Shaping\n\nBQL + Advise\n\n0.58 \u00b1 0.02\n0.34 \u00b1 0.03\n0.54 \u00b1 0.02\n0.77 \u00b1 0.02\n\n0.16 \u00b1 0.05\n0.07 \u00b1 0.06\n0.11 \u00b1 0.07\n0.45 \u00b1 0.04\n\n-0.33 \u00b1 0.17\n-2.87 \u00b1 0.12\n-0.47 \u00b1 0.30\n-0.01 \u00b1 0.11\n\n0.05 \u00b1 0.06\n-0.32 \u00b1 0.13\n\n0 \u00b1 0.08\n\n0.02 \u00b1 0.07\n\nReduced Frequency\n\n(L = 1.0, C = 0.55)\nFrogger\n\nPac-Man\n\n0.16 \u00b1 0.04\n0.01 \u00b1 0.12\n0.14 \u00b1 0.04\n0.21 \u00b1 0.05\n\n0.04 \u00b1 0.06\n0.02 \u00b1 0.07\n0.03 \u00b1 0.07\n0.16 \u00b1 0.06\n\nModerate Case\n(L = 0.5, C = 0.8)\n\nPac-Man\n0.25 \u00b1 0.04\n-0.18 \u00b1 0.19\n0.17 \u00b1 0.12\n0.13 \u00b1 0.08\n\nFrogger\n\n0.09 \u00b1 0.06\n0.01 \u00b1 0.07\n0.05 \u00b1 0.07\n0.22 \u00b1 0.06\n\nTable 1: Comparing the learning rates of BQL + Advise to BQL + Action Biasing, BQL + Control\nSharing, and BQL + Reward Shaping for four different combinations of feedback likelihood, L, and\nconsistency, C, across two domains. Each entry represents the average and standard deviation of the\ncumulative reward in 300 episodes, expressed as the percent of the maximum possible cumulative\nreward for the domain with respect to the BQL baseline. Negative values indicate performance\nworse than the baseline. Bold values indicate the best performance for that case.\n\ntwo strategies, Action Biasing and Control Sharing, consistently produced the best results. Both of\nthese methods use human feedback rewards to modify the policy, rather than shape the MDP reward\nfunction. Thus, they still convert human feedback to a value but recognize that the information\ncontained in that value is policy information. As will be seen, Advise has similar performance\nto these state of the art methods, but is more robust to a noisy signal from the human and other\nparameter changes.\n\nAction Biasing uses human feedback to bias the action selection mechanism of the underlying RL\nalgorithm. Positive and negative feedback is declared a reward rh, and \u2212rh, respectively. A table\nof values, H[s, a] stores the feedback signal for s, a. The modi\ufb01ed action selection mechanism is\n\u02c6Q(s, a)+ B[s, a]\u2217 H[s, a], where \u02c6Q(s, a) is an estimate of the long-term expected\ngiven as argmaxa\ndiscounted reward for s, a from BQL, and B[s, a] controls the in\ufb02uence of feedback on learning. The\nvalue of B[s, a] is incremented by a constant b when feedback is received for s, a, and is decayed by\na constant d at all other time steps.\n\nControl Sharing modi\ufb01es the action selection mechanism directly with the addition of a transition\nbetween 1) the action that gains an agent the maximum known reward according to feedback, and\n2) the policy produced using the original action selection method. The transition is de\ufb01ned as the\nprobability P (a = argmaxa H[s, a]) = min(B[s, a], 1.0). An agent transfers control to a feedback\npolicy as feedback is received, and begins to switch control to the underlying RL algorithm as\nB[s, a] decays. Although feedback is initially interpreted as a reward, Control Sharing does not use\nthat information, and thus is unaffected if the value of rh is changed.\nReward Shaping, the traditional approach to learning from feedback, works by modifying the MDP\nreward. Feedback is \ufb01rst converted into a reward, rh, or \u2212rh. The modi\ufb01ed MDP reward function\nis R\u2032(s, a) \u2190 R(s, a) + B[s, a] \u2217 H[s, a]. The values to B[s, a] and H[s, a] are updated as above.\nThe parameters to each method were manually tuned before the experiments to maximize learn-\ning performance. We initialized the BQL hyperparameters to h\u00b5s,a\n0 = 0, \u03bbs,a = 0.01, \u03b1s,a =\n1000, \u03b2s,a = 0.0000i, which resulted in random initial Q-values. We set the BQL exploration pa-\nrameter \u03b8 = 0.5 for Pac-Man and \u03b8 = 0.0001 for Frogger. We used a discount factor of \u03b3 = 0.99.\nAction Biasing, Control Sharing, and Reward Shaping used a feedback in\ufb02uence of b = 1 and a\ndecay factor of d = 0.001. We set rh = 100 for Action Biasing in both domains. For Reward\nShaping we set rh = 100 in Pac-Man and rh = 1 in Frogger 2\nWe compared the methods using four different combinations of feedback likelihood, L, and con-\nsistency, C, in Pac-Man and Frogger, for a total of eight experiments. Table 1 summarizes the\nquantitative results. Fig. 2 shows the learning curve for four cases.\nIn the ideal case of frequent and correct feedback (L = 1.0; C = 1.0), we see in Fig. 2 that Advise\ndoes much better than the other methods early in the learning process. A human reward that does not\nmatch both the feedback consistency and the domain may fail to eliminate unnecessary exploration\nand produce learning rates similar to or worse than the baseline. Advise avoided these issues by not\nconverting feedback into a reward.\n\nThe remaining three graphs in Fig. 2 show one example from each of the non-ideal conditions\nthat we tested: reduced feedback consistency (L = 1.0; C = 0.55), reduced frequency (L = 0.1;\n\n2We used the conversion rh = 1, 10, 100, or 1000 that maximized MDP reward in the ideal case to also\n\nevaluate the three cases of non-ideal feedback.\n\n5\n\n\fFrogger \u2013 Ideal Case\n(L = 1.0; C = 1.0)\n\n600\n\n400\n\n200\n\n0\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n\u2212200\n\nFrogger \u2013 Reduced Consistency\n\nPac-Man \u2013 Reduced Frequency\n\n(L = 1.0; C = 0.55)\n\n600\n\n400\n\n200\n\n0\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n\u2212200\n\n(L = 0.1; C = 1.0)\n\n600\n\n400\n\n200\n\n0\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n\u2212200\n\n600\n\n400\n\n200\n\n0\n\n\u2212200\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n\u2212400\n\n\u2212600\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n\u2212400\n\n\u2212600\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n\u2212400\n\n\u2212600\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n\u2212400\n\n\u2212600\n0\n\nPac-Man \u2013 Moderate Case\n(L = 0.5; C = 0.8)\n\nBQL\nBQL + Action Biasing\nBQL + Control Sharing\nBQL + Reward Shaping\nBQL + Advise\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nNumber of Episodes\n\nNumber of Episodes\n\nNumber of Episodes\n\nFigure 2: Learning curves for each method in four different cases. Each line is the average with\nstandard error bars of 500 separate runs to a duration of 300 episodes. The Bayesian Q-learning\nbaseline (blue) is shown for reference.\n\nNumber of Episodes\n\nC = 1.0), and a case that we call moderate (L = 0.5; C = 0.8). Action Biasing and Reward\nShaping3 performed comparably to Advise in two cases. Action Biasing does better than Advise in\none case in part because the feedback likelihood is high enough to counter Action Biasing\u2019s overly\nin\ufb02uential feedback policy. This gives the agent an extra push toward the goal without becoming\ndetrimental to learning (e.g., causing loops). In its current form, Advise makes no assumptions\nabout the likelihood the human will provide feedback.\nThe cumulative reward numbers in Table 1 show that Advise always performed near or above the\nBQL baseline, which indicates robustness to reduced feedback frequency and consistency. In con-\ntrast, Action Biasing, Control Sharing, and Reward Shaping blocked learning progress in several\ncases with reduced consistency (the most extreme example is seen in column 3 of Table 1). Control\nSharing performed worse than the baseline in three cases. Action Biasing and Reward Shaping both\nperformed worse than the baseline in one case.\nThus having a prior estimate of the feedback consistency (the value of C) allows Advise to balance\nwhat it learns from the human appropriately with its own learned policy. We could have provided\nthe known value of C to the other methods, but doing so would not have helped set rh, b, or d. These\nparameters had to be tuned since they only slightly correspond to C. We manually selected their\nvalues in the ideal case, and then used these same settings for the other cases. However, different\nvalues for rh, b, and d may produce better results in the cases with reduced L or C. We tested this in\nour next experiment.\n\n6.2 How The Reward Parameter Affects Action Biasing\nIn contrast to Advise, Action Biasing and Control Sharing do not use an explicit model of the\nfeedback consistency. The optimal values to rh, b, and d for learning with consistent feedback may\nbe the wrong values to use for learning with inconsistent feedback. Here, we test how Action Biasing\nperformed with a range of values for rh for the case of moderate feedback (L = 0.5 and C = 0.8),\nand for the case of reduced consistency (L = 1.0 and C = 0.55). Control Sharing was left out of\nthis evaluation because changing rh did not affect its learning rate. Reward Shaping was left out of\nthis evaluation due to the problems mentioned in Section 6.1. The conversion from feedback into\nreward was set to either rh = 500 or 1000. Using rh = 0 is equivalent to the BQL baseline.\nThe results in Fig. 3 show that a large value for rh is appropriate for more consistent feedback;\na small value for rh is best for reduced consistency. This is clear in Pac-Man when a reward of\nrh = 1000 led to better-than-baseline learning performance in the moderate feedback case, but\ndecreased learning rates dramatically below the baseline in the reduced consistency case. A reward\nof zero produced the best results in the reduced consistency case. Therefore, rh depends on feedback\nconsistency.\nThis experiment also shows that the best value for rh is somewhat robust to a slightly reduced\nconsistency. A value of either r = 500 or 1000, in addition to r = 100 (see Fig. 2.d), can produce\ngood results with moderate feedback in both Pac-Man and Frogger. The use of a human in\ufb02uence\nparameter B[s, a] to modulate the value for rh is presumably meant to help make Action Biasing\nmore robust to reduced consistency. The value for B[s, a] is, however, increased by b whenever\n\n3The results with Reward Shaping are misleading because it can end up in in\ufb01nite loops when feedback is\ninfrequent or inconsistent with the optimal policy. In frogger we had this problem for rh > 1.0, which forced\nus to use rh = 1.0. This was not a problem in Pac-Man because the ghost can drive Pac-Man around the map;\ninstead of roaming the map on its own Pac-Man oscillated between adjacent cells until the ghost approached.\n\n6\n\n\fFrogger \u2013 Moderate Case\n(L = 0.5; C = 0.8)\n\n600\n\n400\n\n200\n\n0\n\nreward rh\n0\n500\n1000\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nNumber of Episodes\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n\u2212200\n\n\u2212400\n\n\u2212600\n0\n\nFrogger \u2013 Reduced Consistency\n\n(L = 1.0; C = 0.55)\n\n600\n\n400\n\n200\n\n0\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n\u2212200\n\nPac-Man \u2013 Moderate Case\n(L = 0.5; C = 0.8)\n\n600\n\n400\n\n200\n\n0\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n\u2212200\n\nPac-Man \u2013 Reduced Consistency\n\n(L = 1.0; C = 0.55)\n\n600\n\n400\n\n200\n\n0\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n\u2212200\n\n\u2212400\n\n\u2212600\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n\u2212400\n\n\u2212600\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n\u2212400\n\n\u2212600\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nNumber of Episodes\n\nNumber of Episodes\n\nFigure 3: How different feedback reward values affected BQL + Action Biasing. Each line shows the\naverage and standard error of 500 learning curves over a duration of 300 episodes. Reward values\nof rh = 0, 500, and 1000 were used for the experiments. Results were computed for the moderate\nfeedback case (L = 0.5; C = 0.8) and the reduced consistency case (L = 1.0; C = 0.55).\n\nNumber of Episodes\n\nfeedback is received, and reduced by d over time; b and d are more a function of the domain than the\ninformation in accumulated feedback. Our next experiment demonstrates why this is bad for IRL.\n\n6.3 How Domain Size Affects Learning\nAction Biasing, Control Sharing, and Reward Shaping use a \u2018human in\ufb02uence\u2019 parameter, B[s, a],\nthat is a function of the domain size more than the amount of information in accumulated feedback.\nTo show this we held constant the parameter values and tested how the algorithms performed in a\nlarger domain. Frogger was increased to a 6\u00d76 grid with four cars (see Fig. 4). An oracle was created\nautomatically by running BQL to 50,000 episodes 500 times, and then for each state choosing the\naction with the highest value. The oracle provided moderate feedback (L = 0.5; C = 0.8) for the\n33360 different states that were identi\ufb01ed in this process.\nFigure 4 shows the results. Whereas Advise still has a learning curve above the BQL baseline (as\nit did in the smaller Frogger domain; see the last column in Table. 1), Action Biasing, Control\nSharing, and Reward Shaping all had a negligible effect on learning, performing very similar to the\nBQL baseline. In order for those methods to perform as well as they did with the smaller version of\nFrogger, the value for B[s, a] needs to be set higher and decayed more slowly by manually \ufb01nding\nnew values for b and d. Thus, like rh, the optimal values to b and d are dependent on both the domain\nand the quality of feedback. In contrast, the estimated feedback consistency, \u02c6C, used by Advise only\ndepends on the true feedback consistency, C. For comparison, we next show how sensitive Advise\nis to a suboptimal estimate of C.\n\n6.4 Using an Inaccurate Estimate of Feedback Consistency\nInteractions with a real human will mean that in most cases Advise will not have an exact estimate,\n\u02c6C, of the true feedback consistency, C. It is presumably possible to identify a value for \u02c6C that is close\nto the true value. Any deviation from the true value, however, may be detrimental to learning. This\nexperiment shows how an inaccurate estimate of C affected the learning rate of Advise. Feedback\nwas generated with likelihood L = 0.5 and a true consistency of C = 0.8. The estimated consistency\nwas either \u02c6C = 1.0, 0.8, or 0.55.\nThe results are shown in Fig. 5. In both Pac-Man and Frogger using \u02c6C = 0.55 reduced the effective-\nness of Advise. The learning curves are similar to the baseline BQL learning curves because using\nan estimate of C near 0.5 is equivalent to not using feedback at all. In general, values for \u02c6C below C\ndecreased the possible gains from feedback. In contrast, using an overestimate of C boosted learning\nrates for these particular domains and case of feedback quality. In general, however, overestimating\nC can lead to a suboptimal policy especially if feedback is provided very infrequently. Therefore, it\nis desirable to use \u02c6C as the closest overestimate of its true value, C, as possible.\n\n7 Discussion\nOverall, our experiments indicate that it is useful to interpret feedback as a direct comment on the\noptimality of an action, without converting it into a reward or a value. Advise was able to outperform\ntuned versions of Action Biasing, Control Sharing, and Reward Shaping. The performance of Action\nBiasing and Control Sharing was not as good as Advise in many cases (as shown in Table 1) because\nthey use feedback as policy information only after it has been converted into a reward.\n\n7\n\n\f600\n\n400\n\n600\n\n400\n\nPac-Man\n\n600\n\n400\n\nFrogger\n\n200\n\n0\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n\u2212200\n\n\u2212400\n\n\u2212600\n0\n\n0.5\n\n1\n\nBQL\nBQL + A.B.\nBQL + C.S.\nBQL + R.S.\nBQL + Advise\n\n2\n\n3.5\n1.5\nNumber of Episodes\n\n2.5\n\n3\n\n4\n\n4.5\n\n5\n4\nx 10\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n200\n\n0\n\n\u2212200\n\n200\n\n0\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\nestimated C\n\n\u2212200\n\n\u2212400\n\n\u2212600\n0\n\n50\n\n150\n\n100\nNumber of Episodes\n\n200\n\n1.0\n0.8\n0.55\n\n\u2212400\n\n250\n\n300\n\n\u2212600\n0\n\n50\n\n150\n\n100\nNumber of Episodes\n\n200\n\n250\n\n300\n\nFigure 5: The affect of over and underesti-\nmating the true feedback consistency, C, on\nBQL + Advise in the case of moderate feedback\n(L = 0.5, C = 0.8). A line shows the average\nand standard error of 500 learning curves over a\nduration of 300 episodes.\n\nFigure 4: The larger Frogger domain and the cor-\nresponding learning results for the case of mod-\nerate feedback (L = 0.5; C = 0.8). Each\nline shows the average and standard error of\n160 learning curves over a duration of 50,000\nepisodes.\nAction Biasing, Control Sharing, and Reward Shaping suffer because their use of \u2018human in\ufb02uence\u2019\nparameters is disconnected from the amount of information in the accumulated feedback. Although\nb and d were empirically optimized before the experiments, the optimal values of those parameters\nare dependent on the convergence time of the underlying RL algorithm. If the size of the domain in-\ncreased, for example, B[s, a] would have to be decayed more slowly because the number of episodes\nrequired for BQL to converge would increase. Otherwise Action Biasing, Control Sharing, and Re-\nward Shaping would have a negligible affect on learning. Control Sharing is especially sensitive\nto how well the value of the feedback in\ufb02uence parameter, B[s, a], approximates the amount of\ninformation in both policies. Its performance bottomed out in some cases with infrequent and in-\nconsistent feedback because B[s, a] overestimated the amount of information in the feedback policy.\nHowever, even if B[s, a] is set in proportion to the exact probability of the correctness of each policy\n(i.e., calculated using Advise), Control Sharing does not allow an agent to simultaneously utilize\ninformation from both sources.\nAdvise has only one input parameter, the estimated feedback consistency, \u02c6C, in contrast to three.\n\u02c6C is a fundamental parameter that depends only on the true feedback consistency, C, and does not\nchange if the domain size is increased. When it has the right value for \u02c6C, Advise represents the\nexact amount of information in the accumulated feedback in each state, and then combines it with\nthe BQL policy using an amount of in\ufb02uence equivalent to the amount of information in each policy.\nThese advantages help make Advise robust to infrequent and inconsistent feedback, and fair well\nwith an inaccurate estimate of C.\nA primary direction for future work is to investigate how to estimate \u02c6C during learning. That is, a\nstatic model of C may be insuf\ufb01cient for learning from real humans. An alternative approach is to\ncompute \u02c6C online as a human interacts with an agent. We are also interested in addressing other\naspects of human feedback like errors in credit assignment. A good place to start is the approach\ndescribed in [13] which is based on using gamma distributions. Another direction is to investigate\nAdvise for knowledge transfer in a sequence of reinforcement learning tasks (cf. [24]). With these\nextensions, Advise may be especially suitable for learning from humans in real-world settings.\n\n8 Conclusion\nThis paper de\ufb01ned the Policy Shaping paradigm for integrating feedback with Reinforcement Learn-\ning. We introduced Advise, which tries to maximize the utility of feedback using a Bayesian ap-\nproach to learning. Advise produced results on par with or better than the current state of the art\nInteractive Reinforcement Learning techniques, showed where those approaches fail while Advise\nis unaffected, and it demonstrated robustness to infrequent and inconsistent feedback. With these\nadvancements this paper may help to make learning from human feedback an increasingly viable\noption for intelligent systems.\n\nAcknowledgments\n\nThe \ufb01rst author was partly supported by a National Science Foundation Graduate Research Fellow-\nship. This research is funded by the Of\ufb01ce of Naval Research under grant N00014-14-1-0003.\n\n8\n\n\fReferences\n[1] C. L. Isbell, C. Shelton, M. Kearns, S. Singh, and P. Stone, \u201cA social reinforcement learning\n\nagent,\u201d in Proc. of the 5th Intl. Conf. on Autonomous Agents, pp. 377\u2013384, 2001.\n\n[2] H. S. Chang, \u201cReinforcement learning with supervision by combining multiple learnings and\n\nexpert advices,\u201d in Proc. of the American Control Conference, 2006.\n\n[3] W. B. Knox and P. Stone, \u201cTamer: Training an agent manually via evaluative reinforcement,\u201d\n\nin Proc. of the 7th IEEE ICDL, pp. 292\u2013297, 2008.\n\n[4] A. Tenorio-Gonzalez, E. Morales, and L. Villaseor-Pineda, \u201cDynamic reward shaping: training\n\na robot by voice,\u201d in Advances in Arti\ufb01cial Intelligence\u2013IBERAMIA, pp. 483\u2013492, 2010.\n\n[5] P. M. Pilarski, M. R. Dawson, T. Degris, F. Fahimi, J. P. Carey, and R. S. Sutton, \u201cOnline\nhuman training of a myoelectric prosthesis controller via actor-critic reinforcement learning,\u201d\nin Proc. of the IEEE ICORR, pp. 1\u20137, 2011.\n\n[6] A. L. Thomaz and C. Breazeal, \u201cTeachable robots: Understanding human teaching behavior\nto build more effective robot learners,\u201d Arti\ufb01cial Intelligence, vol. 172, no. 6-7, pp. 716\u2013737,\n2008.\n\n[7] W. B. Knox and P. Stone, \u201cCombining manual feedback with subsequent MDP reward signals\n\nfor reinforcement learning,\u201d in Proc. of the 9th Intl. Conf. on AAMAS, pp. 5\u201312, 2010.\n\n[8] R. Dearden, N. Friedman, and S. Russell, \u201cBayesian Q-learning,\u201d in Proc. of the 15th AAAI,\n\npp. 761\u2013768, 1998.\n\n[9] C. Watkins and P. Dayan, \u201cQ learning: Technical note,\u201d Machine Learning, vol. 8, no. 3-4,\n\npp. 279\u2013292, 1992.\n\n[10] T. Matthews, S. D. Ramchurn, and G. Chalkiadakis, \u201cCompeting with humans at fantasy\nfootball: Team formation in large partially-observable domains,\u201d in Proc. of the 26th AAAI,\npp. 1394\u20131400, 2012.\n\n[11] A. Y. Ng, D. Harada, and S. Russell, \u201cPolicy invariance under reward transformations: Theory\n\nand application to reward shaping,\u201d in Proc. of the 16th ICML, pp. 341\u2013348, 1999.\n\n[12] C. L. Isbell, M. Kearns, S. Singh, C. R. Shelton, P. Stone, and D. Kormann, \u201cCobot in Lamb-\n\ndaMOO: An Adaptive Social Statistics Agent,\u201d JAAMAS, vol. 13, no. 3, pp. 327\u2013354, 2006.\n\n[13] W. B. Knox and P. Stone, \u201cReinforcement learning from simultaneous human and MDP re-\n\nward,\u201d in Proc. of the 11th Intl. Conf. on AAMAS, pp. 475\u2013482, 2012.\n\n[14] A. Y. Ng and S. Russell, \u201cAlgorithms for inverse reinforcement learning,\u201d in Proc. of the 17th\n\nICML, 2000.\n\n[15] P. Abbeel and A. Y. Ng, \u201cApprenticeship learning via inverse reinforcement learning,\u201d in Proc.\n\nof the 21st ICML, 2004.\n\n[16] C. Atkeson and S. Schaal, \u201cLearning tasks from a single demonstration,\u201d in Proc. of the IEEE\n\nICRA, pp. 1706\u20131712, 1997.\n\n[17] M. Taylor, H. B. Suay, and S. Chernova, \u201cIntegrating reinforcement learning with human\n\ndemonstrations of varying ability,\u201d in Proc. of the Intl. Conf. on AAMAS, pp. 617\u2013624, 2011.\n\n[18] L. P. Kaelbling, M. L. Littmann, and A. W. Moore, \u201cReinforcement learning: A survey,\u201d JAIR,\n\nvol. 4, pp. 237\u2013285, 1996.\n\n[19] W. D. Smart and L. P. Kaelbling, \u201cEffective reinforcement learning for mobile robots,\u201d 2002.\n[20] R. Maclin and J. W. Shavlik, \u201cCreating advice-taking reinforcement learners,\u201d Machine Learn-\n\ning, vol. 22, no. 1-3, pp. 251\u2013281, 1996.\n\n[21] L. Torrey, J. Shavlik, T. Walker, and R. Maclin, \u201cTransfer learning via advice taking,\u201d in Ad-\nvances in Machine Learning I, Studies in Computational Intelligence (J. Koronacki, S. Wirz-\nchon, Z. Ras, and J. Kacprzyk, eds.), vol. 262, pp. 147\u2013170, Springer Berlin Heidelberg, 2010.\n[22] C. Bailer-Jones and K. Smith, \u201cCombining probabilities.\u201d GAIA-C8-TN-MPIA-CBJ-053,\n\n2011.\n\n[23] M. L. Littman, G. A. Keim, and N. Shazeer, \u201cA probabilistic approach to solving crossword\n\npuzzles,\u201d Arti\ufb01cial Ingelligence, vol. 134, no. 1-2, pp. 23\u201355, 2002.\n\n[24] G. Konidaris and A. Barto, \u201cAutonomous shaping: Knowledge transfer in reinforcement learn-\n\ning,\u201d in Proc. of the 23rd ICML, pp. 489\u2013496, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1233, "authors": [{"given_name": "Shane", "family_name": "Griffith", "institution": "Georgia Tech"}, {"given_name": "Kaushik", "family_name": "Subramanian", "institution": "Georgia Tech"}, {"given_name": "Jonathan", "family_name": "Scholz", "institution": "Georgia Tech"}, {"given_name": "Charles", "family_name": "Isbell", "institution": "Georgia Tech"}, {"given_name": "Andrea", "family_name": "Thomaz", "institution": "Georgia Tech"}]}