{"title": "Balancing Multiple Sources of Reward in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1082, "page_last": 1088, "abstract": null, "full_text": "Balancing Multiple Sources of Reward in \n\nReinforcement Learning \n\nChristian R. Shelton \n\nArtificial Intelligence Lab \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \ncshelton@ai.mit.edu \n\nAbstract \n\nFor many problems which would be natural for reinforcement learning, \nthe reward signal is not a single scalar value but has multiple scalar com(cid:173)\nponents. Examples of such problems include agents with multiple goals \nand agents with multiple users. Creating a single reward value by com(cid:173)\nbining the multiple components can throwaway vital information and \ncan lead to incorrect solutions. We describe the multiple reward source \nproblem and discuss the problems with applying traditional reinforce(cid:173)\nment learning. We then present an new algorithm for finding a solution \nand results on simulated environments. \n\n1 Introduction \n\nIn the traditional reinforcement learning framework, the learning agent is given a single \nscalar value of reward at each time step. The goal is for the agent to optimize the sum \nof these rewards over time (the return). For many applications, there is more information \navailable. \n\nConsider the case of a home entertainment system designed to sense which residents are \ncurrently in the room and automatically select a television program to suit their tastes. We \nmight construct the reward signal to be the total number of people paying attention to the \nsystem. However, a reward signal of 2 ignores important information about which two \nusers are watching. The users of the system change as people leave and enter the room. We \ncould, in theory, learn the relationship among the users present, who is watching, and the \nreward. In general, it is better to use the domain knowledge we have instead of requiring \nthe system to learn it. We know which users are contributing to the reward and that only \npresent users can contribute. \n\nIn other cases, the multiple sources aren't users, but goals. For elevator scheduling we \nmight be trading off people serviced per minute against average waiting time. For financial \nportfolio managing, we might be weighing profit against risk. In these cases, we may wish \nto change the weighting over time. In order to keep from having to relearn the solution \nfrom scratch each time the weighting is changed, we need to keep track of which rewards \nto attribute to which goals. \n\nThere is a separate difficulty if the rewards are not designed functions of the state but \n\n\frather are given by other agents or people in the environment. Consider the case of the \nentertainment system above but where every resident has a dial by which they can give the \nsystem feedback or reward. The rewards are incomparable. One user may decide to reward \nthe system with values twice as large as those of another which should not result in that \nuser having twice the control over the entertainment. This isn't limited to scalings but also \nincludes any other monotonic transforms of the returns. If the users of the system know \nthey are training it, they will employ all kinds of reward strategies to try to steer the system \nto the desired behavior [2]. By keeping track of the sources of the rewards, we will derive \nan algorithm to overcome these difficulties. \n\n1.1 Related Work \n\nThe work presented here is related to recent work on multiagent reinforcement learning \n[1,4,5,7] in that multiple rewards signals are present and game theory provides a solution. \nThis work is different in that it attacking a simpler problem where the computation is con(cid:173)\nsolidated on a single agent. Work in multiple goals (see [3, 8] as examples) is also related \nbut assumes either that the returns of the goals are to be linearly combined for an overall \nvalue function or that only one goal is to be solved at a time. \n\n1.2 Problem Setup \n\nWe will be working with partially observable environments with discrete actions and dis(cid:173)\ncrete observations. We make no assumptions about the world model and thus do not use \nbelief states. x(t) and a(t ) are the observation and action, respectively, at time t. We \nconsider only reactive policies (although the observations could be expanded to include \nhistory). 7f(x, a) is the policy or probability the agent will take action a when observing x. \nAt each time step, the agent receives a set of rewards (one for each source in the environ(cid:173)\nment), Ts (t ) is the reward at time t from source s. We use the average reward formulation \nand so R; = limn--->CXl ~E [r\u00b7s(1) + Ts(2 ) + ... + Ts(n)I7f] is the expected return from \nsource s for following policy 7f. It is this return that we want to maximize for each source. \n\nWe will also assume that the algorithm knows the set of sources present at each time step. \nSources which are not present provide a constant reward, regardless of the state or action, \nwhich we will assume to be zero. All sums over sources will be assumed to be taken over \nonly the present sources. \n\nThe goal is to produce an algorithm that will produce a policy based on previous experience \nand the sources present. The agent's experience will take the form of prior interactions with \nthe world. Each experience is a sequence of observations, action, and reward triplets for a \nparticular run of a particular policy. \n\n2 Balancing Multiple Rewards \n\n2.1 Policy Votes \n\nIf rewards are not directly comparable, we need to find a property of the sources which is \ncomparable and a metric to optimize. We begin by noting that we want to limit the amount \nof control any given source has over the behavior of the agent. To that end, we construct \nthe policy as the average of a set of votes, one for each source present. The votes for a \nsource must sum to 1 and must all be non-negative (thus giving each source an equal \"say\" \nin the agent's policy). We will first consider restricting the rewards from a given source to \nonly affect the votes for that source. \n\n\fThe form for the policy is therefore \n\n(1) \n\nwhere for each present source 8, L x a s(x ) = 1, as (x) ~ \u00b0 for all x , L a vs (x, a) = 1 \nfor all x , and Vs (x, a) ~ \u00b0 for all x and a. We have broken apart the vote from a source \n\ninto two parts, a and v. as (x ) is how much effort source 8 is putting into affecting the \npolicy for observation x . vs (x, a) is the vote by source 8 for the policy for observation x. \nMathematically this is the same as constructing a single vote (v~(x, a) = as (x )vs (x, a) , \nbut we find a and v to be more interpretable. \n\nWe have constrained the total effort and vote anyone source can apply. Unfortunately, \nthese votes are not quite the correct parameters for our policy. They are not invariant to \nthe other sources present. To illustrate this consider the example of a single state with two \nactions, two sources, and a learning agent with the voting method from above. If 8 1 prefers \nonly a1 and 82 likes an equal mix of a1 and a2, the agent will learn a vote of (1,0) for 81 \nand 82 can reward the agent to cause it to learn a vote of (0,1) for 82 resulting in a policy \nof (0.5,0.5). Whether this is the correct final policy depends on the problem definition. \nHowever, the real problem arises when we consider what happens if 8 1 is removed. The \npolicy reverts to (0, 1) which is far from 82 'S (the only present source's) desired (0.5 , 0.5) \nClearly, the learned votes for 82 are meaningless when 8 1 is not present. \n\nThus, while the voting scheme does limit the control each present source has over the \nagent, it does not provide a description of the source's preferences which would allow for \nthe removal or addition (or reweighting) of sources. \n\n2.2 Returns as Preferences \n\nWhile rewards (or returns) are not comparable across sources, they are comparable within \na source. In particular, we know that if R;l > R;2 that source 8 prefers policy 'if1 to policy \n'if2 . We do not know how to weigh that preference against a different source's preference \nso an explicit tradeoff is still impossible, but we can limit (using the voting scheme of \nequation 1) how much one source's preference can override another source's preference. \n\nWe allow a source's preference for a change to prevail in as much as its votes are sufficient \nto affect the change in the presences of the other sources' votes. We have a type of a \ngeneral-sum game (letting the sources be the players of game theory jargon). The value to \nsource 8 ' of the set of all sources' votes is R;, where 'if is the function of the votes defined \nin equation 1. Each source 8 ' would like to set its particular votes, as, (x) and v~ (x, a) to \nmaximize its value (or return). Our algorithm will set each source's vote in this way thus \ninsuring that no source could do better by \"lying\" about its true reward function. \n\nIn game theory, a \"solution\" to such a game is called a Nash Equilibrium [6], a point at \nwhich each player (source) is playing (voting) its best response to the other players. At a \nNash Equilibrium, no single player can change its play and achieve a gain. Because the \nvotes are real-valued, we are looking for the equilibrium of a continuous game. We will \nderive a fictitious play algorithm to find an equilibrium for this game. \n\n3 Multiple Reward Source Algorithm \n\n3.1 Return Parameterization \n\nIn order to apply the ideas of the previous section, we must find a method for finding a \nNash Equilibrium. To do that, we will pick a parametric form for R; (the estimate of the \n\n\freturn): linear in the KL-divergence between a target vote and 1L Letting as. bs, f3s (x), and \nPs(x, a) be the parameters of Ii;, \n\nAn \nRs =as ~ f3sx ~psx, alog ( \n\n\"\"\"' ( ) \"\"\"' \n\n) \n\n( \n\nPs(X, a) \nn x, a \n\n) + bs \n\nx \n\na \n\n(2) \n\nwhere as ::::: 0, f3s (x ) ::::: 0, L:x f3s (x ) = 1, Ps(x, a) ::::: 0, and L:aps(x , a) = 1. Just as \na s (x) was the amount of vote source s was putting towards the policy for observation x, \nf3s (x ) is the importance for source s of the policy for observation x . And, while Vs (x, a) \nwas the policy vote for observation x for source s, ps(x, a) is the preferred policy for \nobservation x for source s. The constants as and bs allow for scaling and translation of the \nreturn. \nIf we let p~(x , a) = asf3s (x)ps(x, a), then, given experiences of different policies and \ntheir empirical returns, we can estimate p~(x, a) using linear least-squares. Imposing the \nconstraints just involves finding the normal least-squares fit with the constraint that all \np~(x, a) be non-negative. Fromp~(x, a) we can calculate as = L: x,ap~(x, a), f3s (x ) = \n.1... L:ap~(x, a) and Ps(x, a) = ~p~ (~t) ')' We now have a method for solving for Ii; \na s \ngiven experience. We now need to find a way to compute the agent's policy. \n\na' Ps x,a \n\n3.2 Best Response Algorithm \n\nTo produce an algorithm for finding a Nash Equilibrium, let us first start by deriving an \nalgorithm for finding the best response for source s to a set of votes. We need to find the \nset of as (x) and Vs (x, a) that satisfy the constraints on the votes and maximize equation 2 \nwhich is the same as minimizing \n\n,,\"\", f-! ()\"\"\"' \n~ fJs X ~ps x , a og \nx \n\n( \n\na \n\n)1 \n\n. L:slas/ (x )vs/ (X) \n\n'\" \nL..J s' Qs , X \n\n( ) \n\n(3) \n\nover a s (x ) and v s (x , a) for given s because the other terms depend on neither as (x ) nor \nvs(x , a). \nTo minimize equation 3, let's first fix the a -values and optimize Vs (x, a). We will ignore the \nnon-negative constraints on Vs (x , a) and just impose the constraint that L:a Vs (x , a) = 1. \nThe solution, whose derivation is simple and omitted due to space, is \n\nWe impose the non-negative constraints by setting to zero any Vs (x, a) which are negative \nand renormalizing. \n\nUnfortunately, we have not been able to find such a nice solution for a s(x ). Instead, we \nuse gradient descent to optimize equation 3 yielding \n\n(4) \n\n(5) \n\nWe constrain the gradient to fit the constraints. \n\nWe can find the best response for source s by iterating between the two steps above. First \nf3s (x ) for all x. We then solve for a new set of vs(x, a) with equa(cid:173)\nwe initialize a s(x ) = \ntion 4. Using those v-values, we take a step in the direction of the gradient of a s(x ) with \nequation 5. We keep repeating until the solution converges (reducing the step size each \niteration) which usually only takes a few tens of steps. \n\n\f\u00abK~ \n\n8 1eft \n\nBright \n\nSbottom \n\n' \n\n, \n\n, , \n\nJ T \n,:~ T \n,~ T \nps(5,a) \n\n\" \n\n, \n\n=> ':~,., => \n\n' \n\n, \n\n,:[ \u2022 \n,:L \u2022 \n\nvs(5 , a) \n\n\" \n\n, \n\n::b \n\n' \n\n, \n\n7f(5 ,a) \n\nFigure 1: Load-unload problem: The right is the state diagram, Cargo is loaded in state L \nDelivery to a boxed state results in reward from the source associated with that state, The \nleft is the solution found, For state 5, from left to right are shown the p-values, the v-values, \nand the policy, \n\nBright \n\nJ \n\n'[ \n\nSbottom J~,~, \n\nps(5 , a) \n\n=> \n\n'J,., \n':' , \u2022 \n\nvs(5,a) \n\n=> \n\n7f(5, a) \n\nFigure 2: Transfer of the load-unload solution: plots of the same values as in figure 1 but \nwith the left source absent No additional learning was allowed (the left side plots are the \nsame), The votes, however, change, and thus so does the final policy, \n\n3.3 Nash Equilibrium Algorithm \n\nTo find a Nash Equilibrium, we start with as (x) = f3s (x) and vs(x , a) = Ps(x, a) and \niterate to an equilibrium by repeatedly finding the best response for each source and simul(cid:173)\ntaneously replacing the old solution with the new best responses, To prevent oscillation, \nwhenever the change in as (x)vs (x , a) grows from one step to the next, we replace the old \nsolution with one halfway between the old and new solutions and continue the iteration, \n\n4 Example Results \n\nIn all of these examples we used the same learning scheme. We ran the algorithm for a \nseries of epochs. At each epoch, we calculated 7f using the Nash Equilibrium algorithm. \nWith probability t, we replace 7f with one chosen uniformly over the simplex of conditional \ndistributions . This insures some exploration. We follow 7f for a fixed number of time steps \nand record the average reward for each source. We add these average rewards and the \nempirical estimate of the policy followed as data to the least-squares estimate of the returns. \nWe then repeat for the next epoch. \n\n4.1 Multiple Delivery Load-Unload Problem \n\nWe extend the classic load-unload problem to multiple receivers. The observation state \nis shown in figure 1. The hidden state is whether the agent is currently carrying cargo. \nWhenever the agent enters the top state (state 1), cargo is placed on the agent Whenever \nthe agent arrives in any of the boxed states while carrying cargo, the cargo is removed \nand the agent receives reward. For each boxed state, there is one reward source who only \nrewards for deliveries to that state (a reward of 1 for a delivery and 0 for all other time \nsteps). In state 5, the agent has the choice of four actions each of which moves the agent \nto the corresponding state without error. Since the agent cannot observe neither whether it \n\n\fFigure 3: One-way door state diagram: At every state there are two actions (right and left) \navailable to the agent. In states 1,9, 10, and 15 where there are only single outgoing edges, \nboth actions follow the same edge. With probability 0.1, an action will actually follow the \nother edge. Source 1 rewards entering state 1 whereas source 2 rewards entering state 9. \n\n81 \n\n82 \n\n81 \n\n82 \n\n:1 ....... ~ \n\n(3s (x ) \n\n=} \n\n:~I .\u2022...... ~ \n:~I ~~ .. ~ .... \n\n(ts(x) \n\n=} \n\nPs(x, right) \n\nVs (x , right) \n\nn(x, right) \n\nFigure 4: One-way door solution: from left to right: the sources' ideal policies, the votes, \nand the final agent's policy. Light bars are for states for which both actions lead to the same \nstate. \n\nhas cargo nor its history, the optimal policy for state 5 is stochastic. \n\nThe algorithm set all (t- and {3-values to 0 for states other than state 5. We started f at 0.5 \nand reduced it to 0.1 by the end of the run. We ran for 300 epochs of 200 iterations by \nwhich point the algorithm consistently settled on the solution shown in figure 1. For each \nsource, the algorithm found the best solution of randomly picking between the load state \nand the source's delivery state (as shown by the p-values). The votes are heavily weighted \ntowards the delivery actions to overcome the other sources' preferences resulting in an \napproximately uniform policy. The important point is that, without additional learning, the \npolicy can be changed if the left source leaves. The learned (t- and p-values are kept the \nsame, but the Nash Equilibrium is different resulting in the policy in figure 2. \n\n4.2 One-way Door Problem \n\nIn this case we consider the environment shown in figure 3. From each state the agent can \nmove to the left or right except in states 1, 9, 10, and 15 where there is only one possible \naction. We can think of states 1 and 9 as one-way doors. Once the agent enters states 1 \nor 9, it may not pass back through except by going around through state 5. Source 1 gives \nreward when the agent passes through state 1. Source 2 gives reward when the agent passes \nthrough state 9. Actions fail (move in the opposite direction than intended) 0.1 of the time. \n\nWe ran the learning scheme for 1000 epochs of 100 iterations starting f at 0.5 and reducing \nit to 0.015 by the last epoch. The algorithm consistently converged to the solution shown \nin figure 4. Source 1 considers the left-side states (2-5 and 11-12) the most important \nwhile source 2 considers the right-side states (5-8 and 13-14) the most important. The \nideal policies captured by the p-values show that source 1 wants the agent to move left \nand source 2 wants the agent to move right for the upper states (2-8) while the sources \n\n\fagree that for the lower states (11-14) the agent should move towards state 5. The votes \nreflect this preference and agreement. Both sources spend most of their vote on state 5, \nthe state they both feel is important and on which they disagree. The other states (states \nfor which only one source has a strong opinion or on which they agree), they do not need \nto spend much of their vote. The resulting policy is the natural one: in state 5, the agent \nrandomly picks a direction after which, the agent moves around the chosen loop quickly to \nreturn to state 5. Just as in the load-unload problem, if we remove one source, the agent \nautomatically adapts to the ideal policy for the remaining source (with only one source, So, \npresent, 7f(x, a) = P SQ (x, a)). \nEstimating the optimal policies and then taking the mixture of these two policies would \nproduce a far worse result. For states 2-8, both sources would have differing opinions and \nthe mixture model would produce a uniform policy in those states; the agent would spend \nmost of its time near state 5. Constructing a reward signal that is the sum of the sources' \nrewards does not lead to a good solution either. The agent will find that circling either the \nleft or right loop is optimal and will have no incentive to ever travel along the other loop. \n\n5 Conclusions \n\nIt is difficult to conceive of a method for providing a single reward signal that would result \nin the solution shown in figure 4 and still automatically change when one of the reward \nsources was removed. The biggest improvement in the algorithm will come from changing \nthe form of the Ii; estimator. For problems in which there is a single best solution, the \nKL-divergence measure seems to work well. However, we would like to be able to extend \nthe load-unload result to the situation where the agent has a memory bit. In this case, the \nreturns as a function of 7f are bimodal (due to the symmetry in the interpretation of the bit). \nIn general, allowing each source's preference to be modelled in a more complex manner \ncould help extend these results. \n\nAcknowledgments \n\nWe would like to thank Charles Isbell, Tommi Jaakkola, Leslie Kaelbling, Michael Kearns, Satinder \nSingh, and Peter Stone for their discussions and comments. \n\nThis report describes research done within CBCL in the Department of Brain and Cognitive Sciences \nand in the AI Lab at MIT. This research is sponsored by a grants from ONR contracts Nos. NOOOI4-\n93-1-3085 & NOO014-95-1-0600, and NSF contracts Nos. IIS-9800032 & DMS-9872936. Additional \nsupport was provided by: AT&T, Central Research Institute of Electric Power Industry, Eastman \nKodak Company, Daimler-Chrysler, Digital Equipment Corporation, Honda R&D Co., Ltd., NEC \nFund, Nippon Telegraph & Telephone, and Siemens Corporate Research, Inc. \n\nReferences \n[1] 1. Hu and M. P. Wellman. Multiagent reinforcement learning: Theoretical framework and an \nalgorithm. In Froc. of the 15th International Con! on Machine Learning, pages 242- 250, 1998. \n[2] C. L. Isbell, C. R. Shelton, M. Kearns, S. Singh, and P. Stone. A social reinforcement learning \n\nagent. 2000. submitted to Autonomous Agents 2001. \n\n[3] 1. Karlsson. Learning to Solve Multiple Goals. PhD thesis, University of Rochester, 1997. \n[4] M. Kearns, Y. Mansouor, and S. Singh. Fast planning in stochastic games. In Proc. of the 16th \n\nConference on Uncertainty in Artificial Intelligence, 2000. \n\n[5] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proc. of \n\nthe 11th International Conference on Machine Learning, pages 157-163, 1994. \n\n[6] G. Owen. Game Theory. Academic Press, UK, 1995. \n[7] S. Singh, M. Kearns, and Y. Mansour. Nash convergence of gradient dynamics in general-sum \n\ngames. In Proc. of the 16th Conference on Uncertainty in Artificial Intelligence, 2000. \n\n[8] S. P. Singh. The efficient learning of multiple task sequences. In NIPS, volume 4, 1992. \n\n\f", "award": [], "sourceid": 1831, "authors": [{"given_name": "Christian", "family_name": "Shelton", "institution": null}]}