{"title": "The Efficient Learning of Multiple Task Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 251, "page_last": 258, "abstract": null, "full_text": "The Efficient Learning of Multiple Task \n\nSequences \n\nSatinder P. Singh \n\nDepartment of Computer Science \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \n\nAbstract \n\nI present a modular network architecture and a learning algorithm based \non incremental dynamic programming that allows a single learning agent \nto learn to solve multiple Markovian decision tasks (MDTs) with signif(cid:173)\nicant transfer of learning across the tasks. I consider a class of MDTs, \ncalled composite tasks, formed by temporally concatenating a number of \nsimpler, elemental MDTs. The architecture is trained on a set of compos(cid:173)\nite and elemental MDTs. The temporal structure of a composite task is \nassumed to be unknown and the architecture learns to produce a tempo(cid:173)\nral decomposition. It is shown that under certain conditions the solution \nof a composite MDT can be constructed by computationally inexpensive \nmodifications of the solutions of its constituent elemental MDTs. \n\n1 \n\nINTRODUCTION \n\nMost applications of domain independent learning algorithms have focussed on \nlearning single tasks. Building more sophisticated learning agents that operate in \ncomplex environments will require handling multiple tasks/goals (Singh, 1992). Re(cid:173)\nsearch effort on the scaling problem has concentrated on discovering faster learning \nalgorithms, and while that will certainly help, techniques that allow transfer of \nlearning across tasks will be indispensable for building autonomous learning agents \nthat have to learn to solve multiple tasks . In this paper I consider a learning agent \nthat interacts with an external, finite-state, discrete-time, stochastic dynamical en(cid:173)\nvironment and faces multiple sequences of Markovian decision tasks (MDTs). \n\n251 \n\n\f252 \n\nSingh \n\nEach MDT requires the agent to execute a sequence of actions to control the envi(cid:173)\nronment, either to bring it to a desired state or to traverse a desired state trajectory \nover time. Let S be the finite set of states and A be the finite set of actions available \nto the agent.l At each time step t, the agent observes the system's current state \nZt E S and executes action at E A. As a result, the agent receives a payoff with \nexpected value R(zt, at) E R and the system makes a transition to state Zt+l E S \nwith probability P:r:t:r:t+l (at). The agent's goal is to learn an optimal closed loop \ncontrol policy, i.e., a function assigning actions to states, that maximizes the agent's \nobjective. The objective used in this paper is J = E~o -yt R(zt, at), i.e., the sum \nof the payoffs over an infinite horizon. The discount factor, 0 ~ \"Y ~ I, allows \nfuture payoff to be weighted less than more immediate payoff. Throughout this \npaper, I will assume that the learning agent does not have access to a model of the \nenvironment. Reinforcement learning algorithms such as Sutton's (1988) temporal \ndifference algorithm and Watkins's (1989) Q-Iearning algorithm can be used to learn \nto solve single MDTs (also see Barto et al., 1991). \nI consider compositionally-structured MDTs because they allow the possibility of \nsharing knowledge across the many tasks that have common subtasks. In general, \nthere may be n elemental MDTs labeled TI , T2 , \u2022\u2022\u2022 , Tn. Elemental MDTs cannot be \ndecomposed into simpler subtasks. Compo8ite MDTs, labeled GI , G2 , \u2022\u2022\u2022 , Gm , are \nproduced by temporally concatenating a number of elemental MDTs. For example, \nG; = [T(j, I)T(j, 2) ... T(j, k)] is composite task j made up of k elemental tasks that \nhave to be performed in the order listed. For 1 $ i $ k, T(j, i) E {TI' T2 , \u2022\u2022\u2022 , Tn} is \nthe itk elemental task in the list for task G;. The sequence of elemental tasks in a \ncomposite task will be referred to as the decompo8ition of the composite task; the \ndecomposition is assumed to be unknown to the learning agent. \n\nCompo8itional learning involves solving a composite task by learning to compose \nthe solutions of the elemental tasks in its decomposition. It is to be emphasized that \ngiven the short-term, evaluative nature of the payoff from the environment (often \nthe agent gets informative payoff only at the completion of the composite task), \nthe task of discovering the decomposition of a composite task is formidable. In this \npaper I propose a compositional learning scheme in which separate modules learn \nto solve the elemental tasks, and a task-sensitive gating module solves composite \ntasks by learning to compose the appropriate elemental modules over time. \n\n2 ELEMENTAL AND COMPOSITE TASKS \n\nAll elemental tasks are MDTs that share the the same state set S, action set A, and \nhave the same environment dynamics. The payoff function for each elemental task \n11, 1 ~ i ~ n, is ~(z, a) = EYES P:r:y(a)ri(Y) - c(z, a), where ri(Y) is a positive \nreward associated with the state Y resulting from executing action a in state Z for \ntask 11, and c(z, a) is the positive cost of executing action a in state z. I assume \nthat ri(z) = 0 if Z is not the desired final state for 11. Thus, the elemental tasks \nshare the same cost function but have their own reward functions. \n\nA composite task is not itself an MDT because the payoff is a function of both \n\nlThe extension to the case where different sets of actions are available in different states \n\nis straightforward. \n\n\fThe Efficient Learning of Multiple Task Sequences \n\n253 \n\nthe state and the current elemental task, instead of the state alone. Formally, the \nnew state set2 for a composite task, S', is formed by augmenting the elements of \nset S by n bits, one for each elemental task. For each z, E S', the projected 3tate \nz E S is defined as the state obtained by removing the augmenting bits from z'. \nThe environment dynamics and cost function, c, for a composite task is defined by \nassigning to each z, E S' and a E A the transition probabilities and cost assigned \nto the projected state z E S and a E A. The reward function for composite task \nCj , rj, is defined as follows. rj( z') ;::: 0 if the following are all true: i) the projected \nstate z is the final state for some elemental task in the decomposition of Cj, say \ntask Ii, ii) the augmenting bits of z' corresponding to elemental tasks appearing \nbefore and including sub task Ti in the decomposition of Cj are one, and iii) the rest \nof the augmenting bits are zero; rj(z') = 0 everywhere else. \n\n3 COMPOSITIONAL Q-LEARNING \n\nFollowing Watkins (1989), I define the Q-value, Q(z,a), for z E S and a E A, as the \nexpected return on taking action a in state z under the condition that an optimal \npolicy is followed thereafter. Given the Q-values, a greedy policy that in each state \nselects an action with the highest associated Q-value, is optimal. Q-Iearning works \nas follows. On executing action a in state z at time t, the resulting payoff and next \nstate are used to update the estimate of the Q-value at time t, Qt(z, a): \n\n(1.0 - Qt)Qt(z, a) + ae[R(z, a) + l' max Qt(Y, a')], \n\na'EA \n\n(1) \n\nwhere Y is the state at time t + 1, and at is the value of a positive learning rate \nparameter at time t. Watkins and Dayan (1992) prove that under certain conditions \non the sequence {at}, if every state-action pair is updated infinitely often using \nEquation 1, Qt converges to the true Q-values asymptotically. \nCompositional Q-Iearning (CQ-Iearning) is a method for constructing the Q-values \nof a composite task from the Q-values of the elemental tasks in its decomposition. \nLet QT.(z,a) be the Q-value of (z,a), z E S and a E A, for elemental task Ii, \nand let Q~:(z',a) be the Q-value of (z', a), for z' E S' and a E A, for task Ii \nwhen performed as part of the composite task Cj = [T(j, 1) ... T(j, k)]. Assume \nIi = T(j, I) . Note that the superscript on Q refers to the task and the subscript \nrefers to the elemental task currently being performed. The absence of a superscript \nimplies that the task is elemental. \nConsider a set of undiscounted (1' = 1) MDTs that have compositional structure \nand satisfy the following conditions: \n(AI) Each elemental task has a single desired final state. \n(A2) For all elemental and composite tasks, the expected value of undiscounted \nreturn for an optimal policy is bounded both from above and below for all states. \n(A3) The cost associated with each state-action pair is independent of the task \nbeing accomplished. \n\n2The theory developed in this paper does not depend on the particular extension of S \nchosen, as long as the appropriate connection between the new states and the elements of \nS can be made. \n\n\f254 \n\nSingh \n\n(A4) For each elemental task 71, the reward function ri is zero for all states except \nthe desired final state for that task. For each composite task Cj , the reward function \nrj is zero for all states except pouibly the final states of the elemental tasks in its \ndecomposition (Section 2). \nThen, for any elemental task Ii and for all composite tasks C j containing elemental \ntask 71, the following holds: \n\nQ~:(z',a) \n\nQT.(Z, a) + K(Cj,T(j, I\u00bb, \n\n(2) \n\nfor all z' E S' and a E A, where z E S is the projected state, and K (Cj, T(j, I\u00bb \nis a \nfunction of the composite task Cj and subtask T(j, I), where Ti = T(j, I). Note that \nK( Cj , T(j, I\u00bb is independent of the state and the action. Thus, given solutions of \nthe elemental tasks, learning the solution of a composite task with n elemental tasks \nrequires learning only the values of the function K for the n different subtasks. A \nproof of Equation 2 is given in Singh (1992). \n\na \n\nWIll. \nNoIN \nN(O.G) \n\nQ \n\nNetwortc \n\n1 \n\nQ \n\nQ \n\n\u2022 \u2022\u2022 Networtt \n\nn \n\nFigure 1: The CQ-Learning Architecture (CQ-L). This figure is adapted from Jacobs \net al. (1991). See text for details. \n\nEquation 2 is based on the assumption that the decomposition of the composite \ntasks is known. In the next Section, I present a modular architecture and learning \nalgorithm that simultaneously discovers the decomposition of a composite task and \nimplements Equation 2. \n\n4 CQ-L: CQ-LEARNING ARCHITECTURE \n\nJacobs (1991) developed a modular connectionist architecture that performs task \ndecomposition. Jacobs's gating architecture consists of several expert networks and \na gating network that has an output for each expert network. The architecture \nhas been used to learn multiple non-sequential tasks within the supervised learning \n\n\fThe Efficient Learning of Multiple Task Sequences \n\n255 \n\nTable 1: Tasks. Tasks Tl, T2, and T3 are elemental tasks; tasks Gl , G2 , and G3 \nare composite tasks. The last column describes the compositional structure of the \ntasks. \n\nLabel Command De.eription \n'11 \nT2 \nT3 \n0 1 \nC2 \nC3 \n\nVlS1t A \nVlS1t B \nV1S1t C \nVlSlt A and then C \nVlS1t B and then C \nV1S1t A, then B and then C \n\n000001 \n000010 \n000100 \n001000 \n010000 \n100000 \n\nDeeompo.ition \nTl \nT2 \nT3 \n1113 \nT2 T 3 \nT1 T2T3 \n\nparadigm. I extend the modular network architecture to a CQ-Learning architec(cid:173)\nture (Figure I), called CQ-L, that can learn multiple compositionally-structured \nsequential tasks even when training information required for supervised learning is \nnot available. CQ-L combines CQ-learning and the gating architecture to achieve \ntransfer of learning by \"sharing\" the solutions of elemental tasks across multiple \ncomposite tasks. Only a very brief description of the CQ-L is provided in this \npaper; details are given in Singh (1992) . \nIn CQ-L the expert networks are Q-learning networks that learn to approximate \nthe Q-values for the elemental tasks. The Q-networks receive as input both the \ncurrent state and the current action. The gating and bias networks (Figure 1) \nreceive as input the augmenting bits and the task command used to encode the \ncurrent task being performed by the architecture. The stochastic switch in Figure 1 \nselects one Q-network at each time step. CQ-L's output, Q, is the output of the \nselected Q-network added to the output of the bias network. \n\nThe learning rules used to train the network perform gradient descent in the log \nlikelihood, L(t), of generating the estimate of the desired Q-value at time t, denoted \nD(t), and are given below: \n\n8 log L(t) \nqj(t) + oQ 8qj(t) \n, \n8 log L(t) \nSi(t) + Og 8Si(t) \n,and \nb(t) + ob(D(t) - Q(t)), \n\nwhere qj is the output of the jt\" Q-network, Si is the it\" output of the gating \nnetwork, b is the output of the bias network, and 0Q, Ob and Og are learning rate \nparameters. The backpropagation algorithm ( e.g., Rumelhart et al., 1986) was \nused to update the weights in the networks. See Singh (1992) for details. \n\n5 NAVIGATION TASK \n\nTo illustrate the utility of CQ-L, I use a navigational test bed similar to the one used \nby Bachrach (1991) that simulates a planar robot that can translate simultaneously \n\n\f256 \n\nSingh \n\nc \n\nG \n\nFigure 2: Navigation Testbed. See text for details. \n\nand independently in both ~ and y directions. It can move one radius in any \ndirection on each time step. The robot has 8 distance sensors and 8 gray-scale \nsensors evenly placed around its perimeter. These 16 values constitute the state \nvector. Figure 2 shows a display created by the navigation simulator. The bottom \nportion of the figure shows the robot's environment as seen from above. The upper \npanel shows the robot's state vector. Three different goal locations, A, B, and C, \nare marked on the test bed. The set of tasks on which the robot is trained are shown \nin Table 1. The elemental tasks require the robot to go to the given goal location \nfrom a random starting location in minimum time. The composite tasks require the \nrobot to go to a goal location via a designated sequence of subgoallocations. \n\nTask commands were represented by standard unit basis vectors (Table 1), and thus \nthe architecture could not \"parse\" the task command to determine the decomposi(cid:173)\ntion of a composite task. Each Q-network was a feedforward connectionist network \nwith a single hidden layer containing 128 radial basis units. The bias and gating \nnetworks were also feedforward nets with a single hidden layer containing sigmoid \nunits. For all ~ E S U Sf and a E A, c(~, a) = -0.05. ri(~) = 1.0 only if ~ is the \ndesired final state of elemental task Ii, or if ~ E Sf is the final state of composite \ntask Cii ri(~) = 0.0 in all other states. Thus, for composite tasks no intermediate \npayoff for successful completion of subtasks was provided. \n\n6 SIMULATION RESULTS \n\nIn the simulation described below, the performance of CQ-L is compared to the \nperformance of a \"one-for-one\" architecture that implements the \"learn-each-task(cid:173)\nseparately\" strategy. The one-for-one architecture has a pre-assigned distinct net-\n\n\fThe Efficient Learning of Multiple Task Sequences \n\n257 \n\nwork for each task, which prevents transfer of learning. Each network of the one(cid:173)\nfor-one architecture was provided with the augmented state. \n\n,oo \n\nI .. \n\n\u2022 \n1 \n.-\nt \n.. \n1 \n\n'I \n\n0 \n\n-\n--- ON.\u00b7FOA-ONE \n\nCOA. \n\n... \n\n' ... \n\nTrW NIJ1rioer (for T .. k A) \n\n, ... \n\n-\n, . \n.. \nt\u00b7\u00b7 \n'I \n\u2022 \n1 \n\n8. \nI \n\noo \n\nCOA. \n\n-\n-- - ONE-FOA.oNE \n, \n\n\" \n' \n\n0 \n0 \n\n, \no', \nf .... , \n\n' ' \n~ ,~ I', \n'; V \\ \n.'t . \n, \n, ,1,1 \nI,' \n.. \nI \n\n... \n... \n1-\nt-\n'I \n\u2022 \n1-\n.. \n\nI \n\n,-\n\n, \n\nTrial Nurrber (for T .. k [AB)) \n\n-\n\nC<>L \n\n------\n-\n\n, \nTil .. Number (fer TMk [ABC)) \n\nFigure 3: Learning Curves for Multiple tasks. \n\nBoth CQ-L and the one-for-one architecture were separately trained on the six \ntasks T 1 , T2, T3 , C lI C2 , and C3 until they could perform the six tasks optimally. \nCQ-L contained three Q-networks, and the one-for-one architecture contained six \nQ-networks. For each trial, the starting state of the robot and the task identity \nwere chosen randomly. A trial ended when the robot reached the desired final state \nor when there was a time-out. The time-out period was 100 for the elemental tasks, \n200 for C1 and C2 , and 500 for task C3 \u2022 The graphs in Figure 3 show the number \nof actions executed per trial. Separate statistics were accumulated for each task. \n\nThe rightmost graph shows the performance of the two architectures on elemental \ntask TI. Not surprisingly, the one-for-one architecture performs better because \nit does not have the overhead of figuring out which Q-network to train for task \nT1 . The middle graph shows the performance on task C I and shows that the CQ(cid:173)\nL architecture is able to perform better than the one-for-one architecture for a \ncomposite task containing just two elemental tasks. The leftmost graph shows the \nresults for composite task C3 and illustrates the main point of this paper. The one(cid:173)\nfor-one architecture is unable to learn the task, in fact it is unable to perform the \ntask more than a couple of times due to the low probability of randomly performing \nthe correct task sequence. \n\nThis simulation shows that CQ-L is able to learn the decomposition of a composite \ntask and that compositional learning, due to transfer of training across tasks, can \nbe faster than learning each composite task separately. More importantly, CQ-L \nis able to learn to solve composite tasks that cannot be solved using traditional \nschemes. \n\n7 DISCUSSION \n\nLearning to solve MDTs with large state sets is difficult due to the sparseness of the \nevaluative information and the low probability that a randomly selected sequence \nof actions will be optimal. Learning the long sequences of actions required to solve \nsuch tasks can be accelerated considerably if the agent has prior knowledge of useful \nsubsequences. Such subsequences can be learned through experience in learning to \n\n\f258 \n\nSingh \n\nsolve other tasks. In this paper, I define a class of MOTs, called composite MOTs, \nthat are structured as the temporal concatenation of simpler MOTs, called elemen(cid:173)\ntal MOTs. I present CQ-L, an architecture that combines the Q-Iearning algorithm \nof Watkins (1989) and the modular architecture of Jacobs et al. (1991) to achieve \ntransfer of learning by sharing the solutions of elemental tasks across multiple com(cid:173)\nposite tasks. Given a set of composite and elemental MOTs, the sequence in which \nthe learning agent receives training experiences on the different tasks determines the \nrelative advantage of CQ-L over other architectures that learn the tasks separately. \nThe simulation reported in Section 6 demonstrates that it is possible to train CQ-L \non intermixed trials of elemental and composite tasks. Nevertheless, the ability of \nCQ-L to scale well to complex sets of tasks will depend on the choice of the training \nsequence. \n\nAcknowledgements \n\nThis work was supported by the Air Force Office of Scientific Research, Bolling \nAFB, under Grant AFOSR-89-0526 and by the National Science Foundation under \nGrant ECS-8912623. I am very grateful to Andrew Barto for his extensive help in \nformulating these ideas and preparing this paper. \n\nReferences \n\nJ . R. Bachrach. (1991) A connectionist learning control architecture for naviga(cid:173)\ntion. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Adv4nce6 in \nNeural Information Proceuing Sy6tem6 3, pages 457-463, San Mateo, CA. Morgan \nKaufmann. \nA. G. Barto, S. J. Bradtke, and S. P. Singh. (1991) Real-time learning and control \nusing asynchronous dynamic programming. Technical Report 91-57, University of \nMassachusetts, Amherst, MA. Submitted to AI Journal. \nR. A. Jacobs. (1990) T46lc decomp06ition through competition in a modular connec(cid:173)\ntioni6t architecture. PhD thesis, COINS dept, U niv. of Massachusetts, Amherst, \nMass. U.S.A. \nR. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. (1991) Adaptive \nmixtures of local experts. Neural Computation, 3( 1 ). \nD. E. Rumelhart, G. E. Hinton, and R. J. Williams. (1986) Learning internal repre(cid:173)\nsentations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, \nParallel Distributed Proceuing: E:cploration6 in the Micr06tructure of Cognition, \nvol.1: Found4tion6. Bradford Books/MIT Press, Cambridge, MA. \nS. P. Singh. (1992) Transfer of learning by composing solutions for elemental se(cid:173)\nquential tasks. Machine Learning. \nR. S. Sutton. (1988) Learning to predict by the methods of temporal differences. \nMachine Learning, 3:9-44. \nC. J . C. H. Watkins. (1989) Learning from Delayed Rewards. PhD thesis, Cam(cid:173)\nbridge Univ., Cambridge, England. \nC. J. C. H. Watkins and P. Dayan. (1992) Q-learning. Machine Learning. \n\n\f", "award": [], "sourceid": 569, "authors": [{"given_name": "Satinder", "family_name": "Singh", "institution": null}]}