{"title": "Reward Mapping for Transfer in Long-Lived Agents", "book": "Advances in Neural Information Processing Systems", "page_first": 2130, "page_last": 2138, "abstract": "We consider how to transfer knowledge from previous tasks to a current task in long-lived and bounded agents that must solve a sequence of MDPs over a finite lifetime.  A novel aspect of our transfer approach is that we reuse reward functions.   While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent's behavior with reward functions other than the task-specifying reward function can help overcome computational  bounds of  the agent.    Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent's performance relative to other approaches, including an approach that transfers policies.", "full_text": "Reward Mapping for Transfer in Long-Lived Agents\n\nXiaoxiao Guo\n\nComputer Science and Eng.\n\nUniversity of Michigan\nguoxiao@umich.edu\n\nSatinder Singh\n\nComputer Science and Eng.\n\nUniversity of Michigan\nbaveja@umich.edu\n\nRichard Lewis\n\nDepartment of Psychology\n\nUniversity of Michigan\nrickl@umich.edu\n\nAbstract\n\nWe consider how to transfer knowledge from previous tasks (MDPs) to a cur-\nrent task in long-lived and bounded agents that must solve a sequence of tasks\nover a \ufb01nite lifetime. A novel aspect of our transfer approach is that we reuse\nreward functions. While this may seem counterintuitive, we build on the insight\nof recent work on the optimal rewards problem that guiding an agent\u2019s behav-\nior with reward functions other than the task-specifying reward function can help\novercome computational bounds of the agent. Speci\ufb01cally, we use good guid-\nance reward functions learned on previous tasks in the sequence to incrementally\ntrain a reward mapping function that maps task-specifying reward functions into\ngood initial guidance reward functions for subsequent tasks. We demonstrate that\nour approach can substantially improve the agent\u2019s performance relative to other\napproaches, including an approach that transfers policies.\n\n1\n\nIntroduction\n\nWe consider agents that live for a long time in a sequential decision-making environment. While\nmany different interpretations are possible for the notion of long-lived, here we consider agents\nthat have to solve a sequence of tasks over a continuous lifetime. Thus, our problem is closely\nrelated to that of transfer learning in sequential decision-making, which can be thought of as a\nproblem faced by agents that have to solve a set of tasks. Transfer learning [18] has explored the\nreuse across tasks of many different components of a reinforcement learning (RL) architecture,\nincluding value functions [16, 5, 8], policies [9, 20], and models of the environment [1, 17]. Other\ntransfer approaches have considered parameter transfer [19], selective reuse of sample trajectories\nfrom previous tasks [7], as well as reuse of learned abstract representations such as options [12, 6].\nA novel aspect of our transfer approach in long-lived agents is that we will reuse reward functions.\nAt \ufb01rst blush, it may seem odd to consider using a reward function different from the one specifying\nthe current task in the sequence (indeed, in most RL research rewards are considered an immutable\npart of the task description). But there is now considerable work on designing good reward functions,\nincluding reward-shaping [10], inverse RL [11], optimal rewards [13] and preference-elicitation [3].\nIn this work, we speci\ufb01cally build on the insight of the optimal rewards problem (ORP; described in\nmore detail in the next section) that guiding an agent\u2019s behavior with reward functions other than the\ntask-specifying reward function can help overcome computational bounds in the agent architecture.\nWe base our work on an algorithm from Sorg et.al. [14] that learns good guidance reward functions\nincrementally in a single-task setting.\nOur main contribution in this paper is a new approach to transfer in long-lived agents in which we\nuse good guidance reward functions learned on previous tasks in the sequence to incrementally train\na reward mapping function that maps task-specifying reward functions into good initial guidance\nreward functions for subsequent tasks. We demonstrate that our approach can substantially improve\na long-lived agent\u2019s performance relative to other approaches, \ufb01rst on an illustrative grid world\ndomain, and second on a networking domain from prior work [9] on the reuse of policies for transfer.\n\n1\n\n\fIn the grid world domain only the task-specifying reward function changes with tasks, while in the\nnetworking domain both the reward function and the state transition function change with tasks.\n\n2 Background: Optimal Rewards for Bounded Agents in Single Tasks\n\nWe consider sequential decision-making environments formulated as controlled Markov processes\n(CMPs); these are de\ufb01ned via a state space S, an action space A, and a transition function T that\ndetermines a distribution over next states given a current state and action. A task in such a CMP is\nde\ufb01ned via a reward function R that maps state-action pairs to scalar values. The objective of the\nagent in a task is to execute the optimal policy, i.e., to choose actions in such a way as to optimize\nutility de\ufb01ned as the expected value of cumulative reward over some lifetime. A CMP and reward\nfunction together de\ufb01ne a Markov decision process or MDP; hence tasks in this paper are MDPs.\nThere are many approaches to planning an optimal policy in MDPs. Here we will use UCT [4] which\nincrementally plans the action to take in the current state. It simulates a number of trajectories from\nthe current state up to some maximum depth, choosing actions at each point based on the sum of an\nestimated action-value that encourages exploitation and a reward bonus that encourages exploration.\nIt has theoretical guarantees of convergence and works well in practice on a variety of large-scale\nplanning problems. We use UCT in this paper because it is one of the state of the art algorithms in\nRL planning and because there exists a good optimal reward \ufb01nding algorithm for it [14].\nOptimal Rewards Problem (ORP).\nIn almost all of RL research, the reward function is consid-\nered part of the task speci\ufb01cation and thus unchangeable. The optimal reward framework of Singh\net al. [13] stems from the observation that a reward function plays two roles simultaneously in RL\nproblems. The \ufb01rst role is that of evaluation in that the task-specifying reward function is used by\nthe agent designer to evaluate the actual behavior of the agent. The second is that of guidance in that\nthe reward function is also used by the RL algorithm implemented by the agent to determine its be-\nhavior (e.g., via Q-learning [21] or UCT planning [4]). The optimal rewards problem separates these\ntwo roles into two separate reward functions, the task-specifying objective reward function used to\nevaluate performance, and an internal reward function used to guide agent behavior. Given a CMP\nM, an objective reward function Ro, an agent A parameterized by an internal reward function, and\na space of possible internal reward functions R, an optimal internal reward function Ri\u2217\nis de\ufb01ned\nas follows (throughout superscript o will denoted objective evaluation quantities and superscript i\nwill denote internal quantities):\nRi\u2217\n\nEh\u223c(cid:104)A(Ri),M(cid:105)\n\n(cid:110)\n\n= arg max\nRi\u2208R\n\n(cid:111)\n\nU o(h)\n\n,\n\nwhere A(Ri) is the agent with internal reward function Ri, h \u223c (cid:104)A(Ri), M(cid:105) is a random history\n(trajectory of alternating states and actions) obtained by the interaction of agent A(Ri) with CMP\nM, and U o(h) is the objective utility (as speci\ufb01ed by Ro) to the agent designer of interaction history\nh. The optimal internal reward function will depend on the agent A\u2019s architecture and its limitations,\nand this distinguishes ORP from other reward-design approaches such as inverse-RL. When would\nthe optimal internal reward function be different from the objective reward function? If an agent is\nunbounded in its capabilities with respect to the CMP then the objective reward function is always an\noptimal internal reward function. More crucially though, in the realistic setting of bounded agents,\noptimal internal reward functions may be quite different from objective reward functions. Singh\net al.[13] and Sorg et al.[14] provide many examples and some theory of when a good choice of\ninternal reward can mitigate agent bounds, including bounds corresponding to limited lifetime to\nlearn [13], limited memory [14], and limited resources for planning (the speci\ufb01c bound of interest\nin this paper).\nPGRD: Solving the ORP on-line while planning. Computing Ri\u2217\ncan be computationally non-\ntrivial. We will use Sorg et.al.\u2019s [14, 15] policy gradient reward design (PGRD) method that is based\non the insight that any planning algorithm can be viewed as procedurally translating the internal\nreward function Ri into behavior\u2014that is, Ri are indirect parameters of the agent\u2019s policy. PGRD\ncheaply computes the gradient of the objective utility with respect to the Ri parameters through UCT\nplanning. Speci\ufb01cally, it takes a simulation model of the CMP and an objective reward function and\nuses UCT to simultaneously plan actions with respect to the current internal reward function as well\nas to update the internal reward function in the direction of the gradient of the objective utility for\nuse in the next planning step.\n\n2\n\n\f(a) Conventional Agent\n\n(b) Non-transfer ORP Agent\n\n(c) Reward Mapping Transfer ORP Agent\n\n(d) Sequential Transfer ORP Agent\n\nFigure 1: The four agent types compared in this paper. In each \ufb01gure, time \ufb02ows from left to right. The\nsequence of objective reward parameters and task durations for n tasks are shown in the environment portion\nof each \ufb01gure. In \ufb01gures (b-d) the agent portion of the \ufb01gure is further split into a critic-agent and an actor-\nagent; \ufb01gure (a) does not have this split because it is the conventional agent. The critic-agent translates the\nobjective reward parameters \u03b8o into the internal reward parameters \u03b8i. The actor-agent is a UCT agent in all\nour implementations. The critic-agent component varies across the \ufb01gures and is crucial to understanding the\ndifferences among the agents (see text for detailed descriptions).\n\n3 Four Agent Architectures for the Long-Lived Agent Problem\n\n(cid:110)\n\nU \u03b8o\n\nj (hj)\n\nj=1\n\nj , tj}K\n\nj (s, a) = \u03b8o\n\nj=1, is Eh\u223c(cid:104)A,M(cid:105)(cid:80)K\n\nj \u00b7 \u03c8o(s, a), where \u03b8o\n(cid:111)\n\nLong-Lived Agent\u2019s Objective Utility. We will consider the case where objective rewards are\nlinear functions of objective reward features. Formally, the jth task is de\ufb01ned by objective reward\nfunction Ro\nj is the parameter vector for the jth task, \u03c8o are the task-\nindependent objective reward features of state and action, and \u2018\u00b7\u2019 denotes the inner-product. Note\nthat the features are constant across tasks while the parameters vary. The jth task lasts for tj time\nsteps. Given some agent A the expected objective utility achieved for a particular task sequence\n{\u03b8o\n, where for ease of exposition we denote the history\nduring task j simply as hj. In general, there may be a distribution over task sequences, and the\nexpected objective utility would then be a further expectation over such a distribution.\nIn some transfer or other long-lived agent research, the emphasis is on learning in that the agent is\nassumed to lack complete knowledge of the CMP and the task speci\ufb01cations. Our emphasis here\nis on planning in that the agent is assumed to know the CMP perfectly as well as the task speci\ufb01-\ncations as they change. If the agent were unbounded in planning capacity, there would be nothing\ninteresting left to consider because the agent could simply \ufb01nd the optimal policy for each new task\nand execute it. What makes our problem interesting therefore is that our UCT-based planning agent\nis computationally limited: the depth and number of trajectories feasible are small enough (relative\n\n3\n\nTask Sequence \u03b8o1 \u03b8o2 \u03b8on \u03b8o3 tn t3 t2 t1 Agent (Actor-Agent) Environment Agent (Actor-Agent) time evaluation reward Task Sequence \u03b8o1 \u03b8o2 \u03b8on \u03b8o3 tn t3 t2 t1 time evaluation reward Actor-Agent Environment Agent \u03b8i1 \u03b8i2 \u03b8i3 \u03b8in Critic-Agent Actor-Agent guidance reward Task Sequence \u03b8o1 \u03b8o2 \u03b8on \u03b8o3 tn t3 t2 t1 time evaluation reward for all j, \u03b8ij =f\u03bb(\u03b8oj) Environment Agent Critic-Agent Actor-Agent reward mapping Actor-Agent \u03b8i1 \u03b8i2 \u03b8i3 \u03b8in initialize initialize initialize initialize Task Sequence \u03b8o1 \u03b8o2 \u03b8on \u03b8o3 tn t3 t2 t1 time evaluation reward Actor-Agent Environment Agent \u03b8i1 \u03b8i2 \u03b8i3 \u03b8in Critic-Agent Actor-Agent guidance reward initialize initialize initialize \fto the size of the CMP) that it cannot \ufb01nd near-optimal actions. This sets up the potential for both\nthe use of the ORP and of transfer across tasks. Note that basic UCT does use a reward function but\ndoes not use an initial value function or policy and hence changing a reward function is a natural\nand consequential way to in\ufb02uence UCT. While non-trivial modi\ufb01cations of UCT could allow use of\nvalue functions and/or policies, we do not consider them here. In addition, in our setting a model of\nthe CMP is available to the agent and so there is no scope for transfer by reuse of model knowledge.\nThus, our reuse of reward functions may well be the most consequential option available in UCT.\nNext we discuss four different agent architectures represented graphically in Figure 1, starting with\na conventional agent that ignores both the potential of transfer and that of ORP, followed by three\ndifferent agents that do not to varying degrees.\nConventional Agent. Figure 1(a) shows the baseline conventional UCT-based agent that ignores\nthe possibility of transfer and treats each task separately. It also ignores ORP and treats each task\u2019s\nobjective reward as the internal reward for UCT planning during that task.\nThe remaining three agents will all consider the ORP, and share the following details: The space of\ninternal reward functions R is the space of all linear functions of internal reward features \u03c8i(s, a),\ni.e., R(s, a) = {\u03b8 \u00b7 \u03c8i(s, a)}\u03b8\u2208\u0398, where \u0398 is the space of possible parameters \u03b8 (in this paper all\n\ufb01nite vectors). Note that the internal reward features \u03c8i and the objective reward features \u03c8o do not\nhave to be identical.\nNon-Transfer ORP Agent. Figure 1(b) shows the non-transfer agent that ignores the possibility of\ntransfer but exploits ORP. It initializes the internal reward function to the objective reward function\nof each new task as it starts and then uses PGRD to adapt the internal reward function while acting\nin that task. Nothing is transferred across task boundaries. This agent was designed to help separate\nthe contributions of ORP and transfer to performance gains.\nReward-Mapping-Transfer ORP Agent.\nFigure 1(c) shows the reward-mapping agent that in-\ncorporates our main new idea. It exploits both transfer and ORP via incrementally learning a reward\nmapping function. A reward mapping function f maps objective reward function parameters to in-\nternal reward function parameters: \u2200j, \u03b8i\nj ). The reward mapping function is used to initialize\nthe internal reward function at the beginning of each new task. PGRD is used to continually adapt\nthe initialized internal reward function throughout each task.\nThe reward mapping function is incrementally trained as follows: when task j ends, the objective\nreward function parameters \u03b8o\nj are used as\nan input-output pair to update the reward mapping function. In our work, we use nonparametric\nkernel-regression to learn the reward mapping function. Pseudocode for a general reward mapping\nagent is presented in Algorithm 1.\nSequential-Transfer ORP Agent. Figure 1(d) shows the sequential-transfer agent. It also exploits\nboth transfer and ORP. However, it does not use a reward mapping function but instead continu-\nally updates the internal reward function across task boundaries using PGRD. The internal reward\nfunction at the end of a task becomes the initial internal reward function at the start of the next task\nachieving a simple form of sequential transfer.\n\nj and the adapted internal reward function parameters \u02c6\u03b8i\n\nj = f (\u03b8o\n\n4 Empirical Evaluation\n\nThe four agent architectures are compared to demonstrate that the reward mapping approach can\nsubstantially improve the bounded agent\u2019s performance, \ufb01rst on an illustrative grid world domain,\nand second on a networking routing domain from prior work [9] on the transfer of policies.\n\n4.1 Food-and-Shelter Domain\n\nThe purpose of the experiments in this domain are (1) to systematically explore the relative bene\ufb01ts\nof the use of ORP, and of transfer (with and without the use of the reward-mapping function), each\nin isolation and together, (2) to explore the sensitivity and dependence of these relative bene\ufb01ts on\nparameters of the long-lived setting such as mean duration of tasks, and (3) to visualize what is\nlearned by the reward mapping function.\n\n4\n\n\fAlgorithm 1 General pseudocode for Reward Mapping Agent (Figure 1(c))\n1: Input: {\u03b8o\nj=1, where j is task indicator, tj is task duration, and \u03b8o\n\nj , tj}k\n\nfunction parameters specifying task j.\n\nj are the objective reward\n\nend if\nat := planning(st; \u03b8i\nj)\n(st+1, rt+1) := takeAction(st, at)\n\u03b8i := updateInternalRewardFunction(\u03b8i, st, at, st+1, rt+1)\n\n(select action using UCT guided by reward function \u03b8i\nj)\n\n(via PGRD)\n\nif current task ends then\n\nobtain current internal reward parameters as \u02c6\u03b8i\nj\nupdate reward mapping function f using training pair (\u03b8o, \u02c6\u03b8i\nj)\n\nif a new task j starts then\n\nobtain current objective reward parameters \u03b8o\nj\ncompute: \u03b8i\ninitialize the internal reward function using \u03b8i\nj\n\nj = f(\u03b8o\nj )\n\n2:\n3: for t = 1, 2, 3, ... do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17: end for\n\nend if\n\n(a) Food-and-Shelter Domain.\n\n(b) Network Routing Domain.\n\nFigure 2: Domains used in empirical evaluation; the network routing domain comes from [9].\n\nThe environment is a simple 3 by 3 maze with three left-to-right corridors. Thick black lines indicate\nimpassable walls. The position of the shelter and possible positions of food are shown in Figure 2.\nDynamics. The shelter breaks down with a probability of 0.1 at each time step. Once the shelter\nis broken, it remains broken until repaired by the agent. Food appears at the rightmost column of\none of the three corridors and can be eaten by the agent when the agent is at the same location with\nthe food. When food is eaten, new food reappears in a different corridor. The agent can move in\nfour cardinal directions, and every movement action has a probability of 0.1 to result in movement\nin a random direction; if the direction is blocked by a wall or the boundary, the action results in no\nmovement. The agent eats food and repairs shelter automatically whenever collocated with food and\nshelter respectively. The discount factor \u03b3 = 0.95.\nState. A state is a tuple (l, f, h), where l is the location of the agent, f is the location of the food,\nand h indicates whether the shelter is broken.\nObjective Reward Function. At each time step, the agent receives a positive reward of e (the eat-\nbonus) for eating food and a negative reward of b (the broken-cost) if the shelter is broken. Thus,\nj = (ej, bj), where ej \u2208 [0, 1] and bj \u2208 [\u22121, 0].\nthe objective reward function\u2019s parameters are \u03b8o\nDifferent tasks will require the agent to behave in different ways. For example, if (ej, bj) = (1,0),\nthe agent should explore the maze to eat more food. If (ej, bj) = (0, -1), the agent should remain at\nthe shelter\u2019s location in order to repair the shelter as it breaks.\nSpace of Internal Reward Functions. The internal reward function is Ri\nwhere Ro\n\nj (s) is the objective reward function, \u03c8i(s) = 1 \u2212 1\n\nj(s) = Ro\n\nj\u03c8i(s),\nnl(s) is the inverse recency feature\n\nj (s) + \u03b8i\n\n5\n\nAgent shelter possible food locations food A E L G I B H K J P O Q R N M C F D 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 3 3 3 3 3 2 1 \fFigure 3: (Left) Performance of four agents in food-and-shelter domain at three different mean task durations.\n(Middle and Right) Comparing performance while accounting for computational overhead of learning and using\nthe reward mapping function. See text for details.\n\nj is a scalar. A positive \u03b8i\n\nand nl(s) is the number of time steps since the agent\u2019s last visit to the location in state s. Since\nthere is exactly one internal reward parameter, \u03b8i\nj encourages the agent to\nvisit locations not visited recently, and a negative \u03b8i\nj encourages the agent to visit locations visited\nrecently.\nResults: Performance advantage of reward mapping. 100 sequences of 200 tasks were gener-\nated, with Poisson distributions for task durations, and with objective reward function parameters\nsampled uniformly from their ranges. The agents used UCT with depth 2 and 500 trajectories; the\nconventional agent is thereby bounded as evidenced in its poor performance (see Figure 3).\nThe left panel in Figure 3 shows average objective reward per\ntime step (with standard error bars). There are three sets of four\nbars each where each bar within a set is for a different archi-\ntecture (see legend), and each set is for a different mean task\nduration (50, 200, and 500 from left to right). For each task\nduration the reward mapping agent does best and the conven-\ntional agent does the worst. These results demonstrate trans-\nfer helps performance and that transfer via the new reward\nmapping approach can substantially improve a bounded long-\nlived agent\u2019s performance relative to transfer via the competing\nmethod of sequential transfer. As task durations get longer the\nratio of the reward-mapping agent\u2019s performance to the non-\ntransfer agent\u2019s performance get smaller, though remains > 1\n(by visually taking the ratio of the corresponding bars). This\nis expected because the longer the task duration the more time\nPGRD has to adapt to the task, and thus the less the better ini-\ntialization provided by the reward mapping function matters.\nIn addition, the sequential transfer agent does better than the\nnon-transfer agent for the shortest task duration of 50 while the\nsituation reverses for the longest task duration of 500. This is\nintuitive and signi\ufb01cant as follows. Recall that the initialization\nof the internal reward function from the \ufb01nal internal reward\nfunction of the previous task can hurt performance in the se-\nquential transfer setting if the current task requires quite differ-\nent behavior from the previous\u2014but it can help if two succes-\nsive tasks are similar. Correcting the internal reward function\ncould cost a large number of steps. These effects are exacerbated by longer task durations because\nthe agent then has longer to adapt its internal reward function to each task.\nIn general, as task\nduration increases, the non-transfer agent improves but the sequential transfer agent worsens.\nResults: Performance Comparison considering computational overhead.\nThe above results\nignore the computational overhead incurred by learning and using the reward mapping function.\nThe two rightmost plots in the bottom row of Figure 3 show the average objective reward per time\nstep as a function of milliseconds per decision for the four agent architectures for a range of depth\n{1, . . . , 6}, and trajectory-count {200, 300, . . . , 600} parameters for UCT. The plots show that for\n\nFigure 4: Reward mapping function\nvisualization: Top: Optimal mapping,\nBottom: Mapping found by the Re-\nward Mapping agent after 50 tasks.\n\n6\n\nt=50t=200t=500\u22120.01\u22120.00500.0050.010.0150.020.025avg. objective reward per time step Reward MappingSequential TransferNon\u2212TransferConventional01234\u22120.01\u22120.00500.0050.010.0150.020.025milliseconds per decisionavg. objective reward per time stepmean task duration 50 Reward MappingSequential TransferNon\u2212TransferConventional0123\u22120.0100.010.020.030.04milliseconds per decisionavg. objective reward per time stepmean task duration 500\u22120.900.760.820.600.500.460.460.540.740.72\u22120.76\u22120.840.360.460.420.460.460.600.620.90\u22121.00\u22120.80\u22120.740.360.360.320.420.500.620.58\u22120.86\u22120.78\u22120.760.360.380.420.400.340.460.42\u22120.92\u22120.74\u22120.60\u22120.700.420.560.520.440.460.40\u22120.90\u22120.84\u22120.86\u22120.70\u22120.860.380.580.580.440.42\u22120.98\u22120.68\u22120.72\u22120.94\u22120.680.300.360.360.540.54\u22120.66\u22120.90\u22120.58\u22120.62\u22120.94\u22120.760.360.400.480.40\u22120.76\u22120.94\u22120.96\u22120.82\u22120.74\u22120.80\u22120.760.480.500.44\u22120.60\u22120.82\u22120.86\u22120.74\u22120.98\u22120.66\u22120.960.400.560.42broken costeat bonusOptimal Internal Reward for UCT\u22120.1\u22120.2\u22120.3\u22120.4\u22120.5\u22120.6\u22120.7\u22120.8\u22120.9\u22121.00.10.20.30.40.50.60.70.80.910.180.220.260.310.370.420.430.430.440.470.110.140.190.250.320.370.390.390.390.390.020.050.110.180.250.300.320.310.300.30\u22120.06\u22120.030.030.100.170.220.240.220.220.23\u22120.13\u22120.09\u22120.030.040.110.160.170.160.160.19\u22120.17\u22120.12\u22120.07\u22120.010.050.100.130.130.140.16\u22120.19\u22120.15\u22120.10\u22120.040.010.060.100.120.130.13\u22120.22\u22120.18\u22120.14\u22120.09\u22120.030.030.090.120.130.12\u22120.26\u22120.23\u22120.18\u22120.13\u22120.060.010.070.120.130.11\u22120.30\u22120.27\u22120.22\u22120.16\u22120.08\u22120.000.070.110.130.11broken costeat bonusReward Mapping learned after 50 tasks\u22120.1\u22120.2\u22120.3\u22120.4\u22120.5\u22120.6\u22120.7\u22120.8\u22120.9\u22121.00.10.20.30.40.50.60.70.80.91\fthe entire range of time-per-decision, the best performing agents are reward-mapping agents\u2014in\nother words, it is not better to spend the overhead time of the reward-mapping on additional UCT\nsearch. This can be seen by observing that the highest dot at any vertical column on the x-axis\nbelongs to the reward mapping agent. Thus, the overhead of the reward mapping function in the\nreward mapping agent is insigni\ufb01cant relative to the computational cost of UCT (this last cost is all\nthe conventional agent incurs).\nResults: Reward mapping visualization. Using a \ufb01xed set of tasks (as described above) with\nmean duration of 500, we estimated the optimal internal reward parameter (the coef\ufb01cient of the\ninverse-recency feature) for UCT by a brute-force grid search. The optimal internal reward parame-\nter is visualized as a function of the two parameters of the objective reward function (broken cost and\neat bonus) in Figure 4, top. Negative coef\ufb01cients (light color squares) for inverse-recency feature\ndiscourage exploration while positive coef\ufb01cients (dark color squares) encourage exploration. As\nwould be expected the top right corner (high penalty for broken shelter and low reward for eating)\ndiscourages exploration while the bottom left corner (high reward for eating and low cost for broken\nshelter) encourages exploration. Figure 4, bottom, visualizes the learned reward mapping function\nafter training on 50 tasks. There is a clearly similar pattern to the optimal mapping in the upper\ngraph, though it has not captured the \ufb01ner details.\n\n4.2 Network Routing Domain\n\nThe purposes of the following experiments are to (1) compare performance of our agents to a com-\npeting policy transfer method [9] from a closely related setting on a networking application domain\nde\ufb01ned by the competing method; (2) demonstrate that our reward mapping and other agents can be\nextended to a multi-agent setting as required by this domain; and (3) demonstrate that the reward-\nmapping approach can be extended to handle task changes that involve changes to the transition\nfunction as well as objective reward.\nThe network routing domain [9] (see Figure 2(b)) is de\ufb01ned from the following components. (1) A\nset of routers, or nodes. Every router has a queue to store packets. In our experiments, all queues\nare of size three. (2) A set of links between two routers. All links are bidirectional and full-duplex,\nand every link has a weight (uniformly sampled from {1,2,3}) to indicate the cost of transmitting a\npacket. (3) A set of active packets. Every packet is a tuple (source, destination, alive-time), where\nsource is the node which generated the packet, destination is the node that the packet is sent to, and\nalive-time is the time period that the packet has existed in the network. When a packet is delivered\nto its destination node, the alive-time is the end-to-end delay. (4) A set of packet generators. Every\nnode has a packet generator that speci\ufb01es a stochastic method to generate packets. (5) A set of\npower consumption functions. Every node\u2019s power consumption at time t is the number of packets\nin its queue multiplied by a scalar parameter sampled uniformly in the range [0, 0.5].\nActions, dynamics, and states. Every node makes its routing decision separately and has its own\naction space (these determine which neighbor the \ufb01rst packet in the queue is sent to). If multiple\npackets reach the same node simultaneously, they are inserted into the queue in random order. Pack-\nets that arrives after the queue is full cause network congestion and result in packet loss. The global\nstate at time t consists of the contents of all queues at all nodes at t.\nTransition function. In a departure from the original de\ufb01nition of the routing domain, we parameter-\nize the transition function to allow a comparison of agents\u2019 performance when transition functions\nchange. Originally, the state transition function in the routing problem was determined by the \ufb01xed\nnetwork topology and by the parameters of the packet generators that determined among other things\nthe destination of packets. In our modi\ufb01cation, nodes in the network are partitioned into three groups\n(G1, G2, and G3) and the probabilities that the destination of a packet belongs to each group of nodes\n(pG1, pG2, and pG3) are parameters we manipulate to change the state transition function.\nObjective reward function. The objective reward function is a linear combination of three objective\nreward features, the delay measured as the sum of the inverse end-to-end delay of all packets received\nat all nodes at time t, the loss measured as the number of lost packets at time t, and power measured\nas the sum of the power consumption of all nodes at time t. The weights of these three features are\nthe parameters of the objective reward function. The weight for the delay feature \u2208 (0, 1), while the\nweights for both loss and power are \u2208 (\u22120.2, 0); different choices of these weights correspond to\ndifferent objective reward functions.\n\n7\n\n\fj,k\u03c8i\n\nk(s, a), where Ro\n\nj (s, a) is the objective reward function, \u03c8i\n\nj (s, a) + \u03b8i\n\nThe internal reward function for the agent at node k is Ri\n\nInternal reward function.\nj,k(s, a) =\nk(s, a) is a binary feature\nRo\nvector with one binary feature for each (packet destination, action) pair. It sets the bits corresponding\nto the destination of the \ufb01rst packet in node k\u2019s queue at state s and action a to 1; all other bits are\nset to 0. The internal reward features are capable of representing arbitrary policies (and thus we also\nimplemented classical policy gradient with these features using OLPOMDP [2] but found it to be\nfar slower than the use of PGRD with UCT and hence don\u2019t present those results here).\nExtension of Reward Mapping Agent to handle transition function changes. The parameters\ndescribing the transition function are concatenated with the parameters de\ufb01ning the objective reward\nfunction and used as input to the reward mapping function (whose output remains the initial internal\nreward function).\nHandling Multi-Agency. Every nodes\u2019 agent observes the full state of the environment. All agents\nmake decisions independently at each time step. Nodes do not know other nodes\u2019 policies, but can\nobserve how the other nodes have acted in the past and use the empirical counts of past actions to\nsample other nodes\u2019 actions accordingly during UCT planning.\nCompeting policy transfer method. The competing\npolicy transfer agent from [9] reuses policy knowledge\nacross tasks based on a model-based average-reward\nRL algorithm. Their method keeps a library of poli-\ncies derived from previous tasks and for each new task\nchooses an appropriate policy from the library and then\nimproves the initial policy with experience. Their pol-\nicy selection criterion was designed for the case when\nonly the linear reward parameters change. However,\nin our experiments, tasks could differ in three different\nways: (1) only reward functions change, (2) only tran-\nsition functions change, and (3) both reward functions\nand transition functions change. Their policy selection\ncriterion is applied to cases (1) and (3). For case (2),\nwhen only transition functions change, their method is\nmodi\ufb01ed to select the library-policy whose transition\nfunction parameters are closest to the new transition\nfunction parameters.\nResults: Performance advantage of Reward Map-\nping Agent.\nThree sets of 100 task sequences were generated, one in which the tasks differed\nin objective reward function only, another in which they differed in state transition function only,\nand third in which they differed in both. Figure 5 compares the average objective reward per time\nstep for all four agents de\ufb01ned above as well as the competing policy transfer agent on the three sets.\nIn all cases, the reward-mapping agent works best and the conventional agent worst. The competing\npolicy transfer agent is second best when only the reward-function changes\u2014just the setting for\nwhich it was designed.\n5 Conclusion and Discussion\n\nFigure 5: Performance on the network rout-\ning domain. (Left) tasks differ in objective re-\nward functions (R) only. (Middle) tasks differ\nin transition function (T) only. (Right) tasks\ndiffer in both objective reward and transition\n(R and T) functions. See text for details.\n\nReward functions are a particularly consequential locus for knowledge transfer; reward functions\nspecify what the agent is to do but not how, and can thus transfer across changes in the environment\ndynamics (transition function) unlike previously explored loci for knowledge transfer such as value\nfunctions or policies or models. Building on work on the optimal reward problem for single task\nsettings, our main algorithmic contribution for our long-lived agent setting is to take good guid-\nance reward functions found for previous objective rewards and learn a mapping used to effectively\ninitialize the guidance reward function for subsequent tasks. We demonstrated that our reward map-\nping approach can outperform alternate approaches; current and future work is focused on greater\ntheoretical understanding of the general conditions under which this is true.\nAcknowledgments. This work was supported by NSF grant IIS-1148668. Any opinions, \ufb01ndings,\nconclusions, or recommendations expressed here are those of the authors and do not necessarily\nre\ufb02ect the views of the sponsors.\n\n8\n\nR onlyT onlyR and T00.10.20.30.4avg. objective reward per time step Reward MappingSequential TransferNon\u2212TransferConventionalPolicy Transfer\fReferences\n[1] Christopher G. Atkeson and Juan Carlos Santamaria. A comparison of direct and model-based reinforce-\n\nment learning. In International Conference on Robotics and Automation, pages 3557\u20133564, 1997.\n\n[2] Peter L Bartlett and Jonathan Baxter. Stochastic optimization of controlled partially observable markov\ndecision processes. In Proceedings of the 39th IEEE Conference on Decision and Control., volume 1,\npages 124\u2013129, 2000.\n\n[3] Urszula Chajewska, Daphne Koller, and Ronald Parr. Making rational decisions using adaptive utility\nelicitation. In Proceedings of the Seventeenth National Conference on Arti\ufb01cial Intelligence, pages 363\u2013\n369, 2000.\n\n[4] Levente Kocsis and Csaba Szepesv\u00b4ari. Bandit based monte-carlo planning. In Machine Learning: ECML,\n\npages 282\u2013293. Springer, 2006.\n\n[5] George Konidaris and Andrew Barto. Autonomous shaping: Knowledge transfer in reinforcement learn-\n\ning. In Proceedings of the 23rd International Conference on Machine learning, pages 489\u2013496, 2006.\n\n[6] George Konidaris and Andrew G Barto. Building portable options: Skill transfer in reinforcement learn-\ning. In Proceedings of the 20th International Joint Conference on Arti\ufb01cial Intelligence, volume 2, pages\n895\u2013900, 2007.\n\n[7] Alessandro Lazaric, Marcello Restelli, and Andrea Bonarini. Transfer of samples in batch reinforcement\nIn Proceedings of the 25th International Conference on Machine learning, pages 544\u2013551,\n\nlearning.\n2008.\n\n[8] Yaxin Liu and Peter Stone. Value-function-based transfer for reinforcement learning using structure map-\nping. In Proceedings of the Twenty-First National Conference on Arti\ufb01cial Intelligence, volume 21(1),\npage 415, 2006.\n\n[9] Sriraam Natarajan and Prasad Tadepalli. Dynamic preferences in multi-criteria reinforcement learning.\n\nIn Proceedings of the 22nd International Conference on Machine learning, 2005.\n\n[10] Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: The-\nIn Proceedings of the Sixteenth International Conference on\n\nory and application to reward shaping.\nMachine Learning, pages 278\u2013287, 1999.\n\n[11] Andrew Y Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In Proceedings of the\n\nSeventeenth International Conference on Machine Learning, pages 663\u2013670, 2000.\n\n[12] Theodore J Perkins and Doina Precup. Using options for knowledge transfer in reinforcement learning.\n\nUniversity of Massachusetts, Amherst, MA, USA, Tech. Rep, 1999.\n\n[13] Satinder Singh, Richard L Lewis, Andrew G Barto, and Jonathan Sorg. Intrinsically motivated reinforce-\nment learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development.,\n2(2):70\u201382, 2010.\n\n[14] Jonathan Sorg, Satinder Singh, and Richard L Lewis. Reward design via online gradient ascent. Advances\n\nof Neural Information Processing Systems, 23, 2010.\n\n[15] Jonathan Sorg, Satinder Singh, and Richard L Lewis. Optimal rewards versus leaf-evaluation heuristics\n\nin planning agents. In Proceedings of the Twenty-Fifth Conference on Arti\ufb01cial Intelligence, 2011.\n\n[16] Fumihide Tanaka and Masayuki Yamamura. Multitask reinforcement learning on the distribution of mdps.\nIn Proceedings IEEE International Symposium on Computational Intelligence in Robotics and Automa-\ntion., volume 3, pages 1108\u20131113, 2003.\n\n[17] Matthew E Taylor, Nicholas K Jong, and Peter Stone. Transferring instances for model-based reinforce-\n\nment learning. In Machine Learning and Knowledge Discovery in Databases, pages 488\u2013505. 2008.\n\n[18] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. The\n\nJournal of Machine Learning Research, 10:1633\u20131685, 2009.\n\n[19] Matthew E Taylor, Shimon Whiteson, and Peter Stone. Transfer via inter-task mappings in policy search\nreinforcement learning. In Proceedings of the 6th International Joint Conference on Autonomous Agents\nand Multiagent Systems, page 37, 2007.\n\n[20] Lisa Torrey and Jude Shavlik. Policy transfer via Markov logic networks. In Inductive Logic Program-\n\nming, pages 234\u2013248. Springer, 2010.\n\n[21] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292, 1992.\n\n9\n\n\f", "award": [], "sourceid": 1049, "authors": [{"given_name": "Xiaoxiao", "family_name": "Guo", "institution": "University of Michigan"}, {"given_name": "Satinder", "family_name": "Singh", "institution": "University of Michigan"}, {"given_name": "Richard", "family_name": "Lewis", "institution": "University of Michigan"}]}