{"title": "How to Dynamically Merge Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1057, "page_last": 1063, "abstract": "", "full_text": "How to Dynamically Merge Markov \n\nDecision Processes \n\nSatinder Singh \n\nDavid Cohn \n\nDepartment of Computer Science \n\nAdaptive Systems Group \n\nUniversity of Colorado \nBoulder, CO 80309-0430 \nbaveja@cs.colorado.edu \n\nHarlequin, Inc. \n\nMenlo Park, CA 94025 \ncohn@harlequin.com \n\nAbstract \n\nWe are frequently called upon to perform multiple tasks that com(cid:173)\npete for our attention and resource. Often we know the optimal \nsolution to each task in isolation; in this paper, we describe how \nthis knowledge can be exploited to efficiently find good solutions \nfor doing the tasks in parallel. We formulate this problem as that of \ndynamically merging multiple Markov decision processes (MDPs) \ninto a composite MDP, and present a new theoretically-sound dy(cid:173)\nnamic programming algorithm for finding an optimal policy for the \ncomposite MDP. We analyze various aspects of our algorithm and \nillustrate its use on a simple merging problem. \n\nEvery day, we are faced with the problem of doing mUltiple tasks in parallel, each \nof which competes for our attention and resource. If we are running a job shop, \nwe must decide which machines to allocate to which jobs, and in what order, so \nthat no jobs miss their deadlines. If we are a mail delivery robot, we must find the \nintended recipients of the mail while simultaneously avoiding fixed obstacles (such \nas walls) and mobile obstacles (such as people), and still manage to keep ourselves \nsufficiently charged up. \nFrequently we know how to perform each task in isolation; this paper considers how \nwe can take the information we have about the individual tasks and combine it to \nefficiently find an optimal solution for doing the entire set of tasks in parallel. More \nimportantly, we describe a theoretically-sound algorithm for doing this merging \ndynamically; new tasks (such as a new job arrival at a job shop) can be assimilated \nonline into the solution being found for the ongoing set of simultaneous tasks. \n\n\f1058 \n\nS. Singh and D. Cohn \n\n1 The Merging Framework \n\nMany decision-making tasks in control and operations research are naturally formu(cid:173)\nlated as Markov decision processes (MDPs) (e.g., Bertsekas & Tsitsikiis, 1996). Here \nwe define MDPs and then formulate what it means to have multiple simultanous \nMDPs. \n\n1.1 Markov decision processes (MDPs) \n\nAn MDP is defined via its state set 8, action set A, transition probability matrices \nP, and payoff matrices R. On executing action a in state s the probability of \ntransiting to state s' is denoted pa(ss') and the expected payoff associated with \nthat transition is denoted Ra (ss'). We assume throughout that the payoffs are \nnon-negative for all transitions. A policy assigns an action to each state of the \nMDP. The value of a state under a policy is the expected value of the discounted \nsum of payoffs obtained when the policy is followed on starting in that state. The \nobjective is to find an optimal policy, one that maximizes the value of every state. \nThe optimal value of state s, V* (s), is its value under the optimal policy. \nThe optimal value function is the solution to the Bellman optimality equations: for \nall s E 8 , V(s) = maxaEA(Esl pa(ss') [Ra (ss') +/V(s'))), where the discount factor \no ~ / < 1 makes future payoffs less valuable than more immediate payoffs (e.g., \nBertsekas & Tsitsiklis, 1996). It is known that the optimal policy 7r* can be de(cid:173)\ntermined from V* as follows: 7r*(s) = argmaxaE A(Esl pa(ss')[Ra(ss') +/V*(s'))). \nTherefore solving an MDP is tantamount to computing its optimal value function. \n\n1.2 Solving MDPs via Value Iteration \n\nGiven a model (8, A, P, R) of an MDP value iteration (e.g., Bertsekas & Tsitsikiis, \n1996) can be used to determine the optimal value function. Starting with an initial \nguess, Vo, iterate for all s Vk+1(S) = maxaEA(EsIES pa(ss')[Ra(ss') + /Vk(S'))). It \nis known that maxsES 1Vk+1 (s) - V*(s)1 ~ / maxsES IVk(S) - V*(s)1 and therefore \nVk converges to V* as k goes to infinity. Note that a Q-value (Watkins, 1989) based \nversion of value iteration and our algorithm presented below is also easily defined. \n\n1.3 Multiple Simultaneous MDPs \n\nThe notion of an optimal policy is well defined for a single task represented as \nan MDP. If, however, we have multiple tasks to do in parallel, each with its own \nstate, action, transition probability, and payoff spaces, optimal behavior is not \nautomatically defined. We will assume that payoffs sum across the MDPs, which \nmeans we want to select actions for each MDP at every time step so as to maximize \nthe expected discounted value of this summed payoff over time. If actions can be \nchosen independently for each MDP, then the solution to this \"composite\" MDP \nis obvious -\ndo what's optimal for each MDP. More typically, choosing an action \nfor one MDP constrains what actions can be chosen for the others. In a job shop \nfor example, actions correspond to assignment of resources, and the same physical \nresource may not be assigned to more than one job simultaneously. \nFormally, we can define a composite MDP as a set of N MDPs {Mi}f. We will use \nsuperscripts to distinguish the component MDPs, e.g., 8i , Ai, pi, and Ri are the \nstate, action, transition probability and payoff parameters of MDP Mi. The state \nspace of the composite MDP, 8, is the cross product of the state spaces of the com(cid:173)\nponent MDPs, i.e., 8 = 8 1 X 8 2 X ... X 8 N . The constraints on actions implies that \n\n\fHow to Dynamically Merge Markov Decision Processes \n\n1059 \n\nthe action set of the composite MDP, A, is some proper subset of the cross product \nof the N component action spaces. The transition probabilities and the payoffs of \nthe composite MDP are factorial because the following decompositions hold: for \nall s, s' E S and a E A, pa(ss') = nf:lpai (SiSi') and Ra(ss') = l:~l ~i (SiSi'). \nSingh (1997) has previously studied such factorial MDPs but only for the case of a \nfixed set of components. \n\nThe optimal value function of a composite MDP is well defined, and satisfies the \nfollowing Bellman equation: for all s E S, \n\nV(s) = ~a:L (nf:lpa'(sisi')[LRa\\sisi')+'YV(s')]). \n\nN \n\n(1) \n\n~ES \n\ni=l \n\nNote that the Bellman equation for a composite MDP assumes an identical discount \nfactor across component MDPs and is not defined otherwise. \n\n1.4 The Dynamic Merging Problem \n\nGiven a composite MDP, and the optimal solution (e.g. the optimal value function) \nfor each of its component MDPs, we would like to efficiently compute the optimal \nsolution for the composite MDP. More generally, we would like to compute the \noptimal composite policy given only bounds on the value functions of the component \nMDPs (the motivation for this more general version will become clear in the next \nsection). To the best of our knowledge, the dynamic merging question has not been \nstudied before. \nNote that the traditional treatment of problems such as job-shop scheduling would \nformulate them as nonstationary MDPs (however, see Zhang and Dietterich, 1995 \nfor another learning approach). This normally requires augmenting the state space \nto include a \"time\" component which indexes all possible state spaces that could \narise (e.g., Bertsekas, 1995). This is inefficient, and potentially infeasible unless we \nknow in advance all combinations of possible tasks we will be required to solve. One \ncontribution of this paper is the observation that this type of nonstationary problem \ncan be reformulated as one of dynamically merging (individually) stationary MDPs. \n\n1.4.1 The naive greedy policy is suboptimal \n\nGiven bounds on the value functions of the component MDPs, one heuristic com(cid:173)\nposite policy is that of selecting actions according to a one-step greedy rule: \n\n7I\"(s) = argmax(l: nf:,lpai (Si si')[l:(Rai (si, ai) + 'YXi(Si'))]), \n\nN \n\na \n\n8' \n\ni=l \n\nwhere Xi is the upper or lower bound of the value function, or the mean of the \nbounds. It is fairly easy however, to demonstrate that these policies are substantially \nsuboptimal in many common situations (see Section 3). \n\n2 Dynamic Merging Algorithm \n\nConsider merging N MDPs; job-shop scheduling presents a special case of merging \na new single MDP with an old composite MDP consisting of several factor MDPs. \nOne obvious approach to finding the optimal composite policy would be to directly \nperform value iteration in the composite state and action space. A more efficient \napproach would make use of the solutions (bounds on optimal value functions) of \nthe existing components; below we describe an algorithm for doing this. \n\n\f1060 \n\nS. SinghandD. Cohn \n\nOur algorithm will assume that we know the optimal values, or more generally, \nupper and lower bounds to the optimal values of the states in each component \nMDP. We use the symbols Land U for the lower and upper bounds; if the optimal \nvalue function for the ith factor MDP is available then Li = Ui = V\u00b7,i.l \nOur algorithm uses the bounds for the component MDPs to compute bounds on \nthe values of composite states as needed and then incrementally updates and nar(cid:173)\nrows these initial bounds using a form of value iteration that allows pruning of \nactions that are not competitive, that is, actions whose bounded values are strictly \ndominated by the bounded value of some other action. \nInitial State: The initial composite state So is composed from the start state of \nall the factor MOPs. In practice (e.g. in job-shop scheduling) the initial composite \nstate is composed of the start state of the new job and whatever the current state \nof the set of old jobs is. Our algorithm exploits the initial state by only updating \nstates that can occur from the initial state under competitive actions. \nInitial Value Step: When we need the value of a composite state S for the first \ntime. we compute upper and lower bounds to its optimal value as follows: L(s) = \nmax!1 Li(Si), and U(s) = E~1 Ui(S). \nInitial Update Step: We dynamically allocate upper and lower bound storage \nspace for composite states as we first update them. We also create the initial set of \ncompetitive actions for S when we first update its value as A(s) = A. As successive \nbackups narrow the upper and lower bounds of successor states, some actions will \nno longer be competitive, and will be eliminated from further consideration. \nModified Value Iteration Algorithm: \nAt step t if the state to be updated is St: \n\nLt+l(St) \n\n-\n\naEAt{st} \n\nmax (L pa(sts')[Ra(st. s') + -yLt(s')]) \nmax (L pa(sts')[Ra(st, s') + -yUt(s')]) \n\nJ s \n\n-\n\nUt+l(St) \nAt+l (St) = U a E At(st) AND L pa(sts')[Ra(st, s') + -yUt(s')] \n\naEAt(st} \n\ns' \n\ns' \n\n;::: argmax L pb(sts')[Rb(st, s') + -yLt(s')] \n\nbEAt(St) 8' \n\nSt+l \n\n{ So if s~ is terminal for all Si E s \n\ns' E S such that 3a E At+1 (St), pa(StS') > 0 otherwise \n\nThe algorithm terminates when only one competitive action remains for each state, \nor when the range of all competitive actions for any state are bounded by an indif(cid:173)\nference parameter \u20ac. \n\nTo elaborate, the upper and lower bounds on the value of a composite state are \nbacked up using a form of Equation 1. The set of actions that are considered \ncompetitive in that state are culled by eliminating any action whose bounded values \nis strictly dominated by the bounded value of some other action in At(st). The \nnext state to be updated is chosen randomly from all the states that have non-zero \n\n1 Recall that unsuperscripted quantities refer to the composite MDP while superscripted \nquantities refer to component MDPs. Also, A is the set of actions that are available to the \ncomposite MDP after taking into account the constraints on picking actions simultaneously \nfor the factor MDPs. \n\n\fHow to Dynamically Merge Markov Decision Processes \n\n1061 \n\npro babili ty of occuring from any action in At+! (St) or, if St is the terminal state of \nall component MDPs, then StH is the start state again. \nA significant advantage of using these bounds is that we can prune actions whose \nupper bounds are worse than the best lower bound. Only states resulting from \nremaining competitive actions are backed up. When only one competitive action \nremains, the optimal policy for that state is known, regardless of whether its upper \nand lower bounds have converged. \nAnother important aspect of our algorithm is that it focuses the backups on states \nthat are reachable on currently competitive actions from the start state. The com(cid:173)\nbined effect of only updating states that are reachable from the start state and \nfurther only those that are reachable under currently competitive actions can lead \nto significant computational savings. This is particularly critical in scheduling, \nwhere jobs proceed in a more or less feedforward fashion and the composite start \nstate when a new job comes in can eliminate a large portion of the composite state \nspace. Ideas based on Kaelbling's (1990) interval-estimation algorithm and Moore \n& Atkeson's (1993) prioritized sweeping algorithm could also be combined into our \nalgorithm. \nThe algorithm has a number of desirable \"anytime\" characteristics: if we have to \npick an action in state So before the algorithm has converged (while multiple com(cid:173)\npetitive actions remain), we pick the action with the highest lower bound. If a new \nMDP arrives before the algorithm converges, it can be accommodated dynamically \nusing whatever lower and upper bounds exist at the time it arrives. \n\n2.1 Theoretical Analysis \n\nIn this section we analyze various aspects of our algorithm. \nUpperBound Calculation: For any composite state, the sum of the optimal \nvalues of the component states is an upper bound to the optimal value of the \ncomposite state, i.e., V*(s = SI, S2, .. . , SN) ~ 2:~1 V*,i(Si). \nIf there were no constraints among the actions of the factor MDPs then V* (s) would \nequal L~l V*,i(Si) because of the additive payoffs across MDPs. The presence of \nconstraints implies that the sum is an upper bound. Because V*,i(S') ~ Ut(Si) the \nresult follows. \nLowerBound Calculation: For any composite state, the maximum of the optimal \nvalues of the component states is a lower bound to the optimal value of the composite \nstates, i.e., V*(s = SI, S2, . .. ,SN) ~ max~1 V*,i(Si). \nTo see this for an arbitrary composite state s, let the MDP that has the largest com(cid:173)\nponent optimal value for state s always choose its component-optimal action first \nand then assign actions to the other MDPs so as to respect the action constraints \nencoded in set A. This guarantees at least the value promised by that MDP because \nthe payoffs are all non-negative. Because V*,i(Si) ~ Lt(Si) the result follows. \nPruning of Actions: For any composite state, if the upper bound for any com(cid:173)\nposite action, a, is lower than the lower bound for some other composite action, \nthen action a cannot be optimal -\naction a can then safely be discarded from the \nmax in value iteration. Once discarded from the competitive set, an action never \nneeds to be reconsidered. \nOur algorithm maintains the upper and lower bound status of U and L as it updates \nthem. The result follows. \n\n\f1062 \n\nS. Singh and D. Cohn \n\nConvergence: Given enough time our algorithm converges to the optimal policy \nand optimal value function for the set of composite states reachable from the start \nstate under the optimal policy. \nIf every state were updated infinitely often, value iteration converges to the optimal \nsolution for the composite problem independent of the intial guess Vo. The difference \nbetween standard value iteration and our algorithm is that we discard actions and \ndo not update states not on the path from the start state under the continually \npruned competitive actions. The actions we discard in a state are guaranteed not \nto be optimal and therefore cannot have any effect on the value of that state. Also \nstates that are reachable only under discarded actions are automatically irrelevant \nto performing optimally from the start state. \n\n3 An Example: Avoiding Predators and Eating Food \n\nWe illustrate the use of the merging algorithm on a simple avoid-predator-and(cid:173)\neat-food problem, depicted in Figure 1a. The component MDPs are the avoid(cid:173)\npredator task and eat-food task; the composite MDP must solve these problems \nsimultaneously. In isolation, the tasks avoid-predator and eat-food are fairly easy \nto learn. The state space of each task is of size n\\ 625 states in the case illustrated. \nUsing value iteration, the optimal solutions to both component tasks can be learned \nin approximately 1000 backups. Directly solving the composite problem requires \nn6 states (15625 in our case), and requires roughly 1 million backups to converge. \nFigure 1b compares the performance of several solutions to the avoid-predator(cid:173)\nand-eat-food task. The opt-predator and opt-food curves shows the performance \nof value iteration on the two component tasks in isolation; both converge qUickly \nto their optima. While it requires no further backups, the greedy algorithm of \nSection 1.4.1 falls short of optimal performance. Our merging algorithm, when \ninitialized with solutions for the component tasks (5000 backups each) converges \nquickly to the optimal solution. Value iteration directly on the composite state space \nalso finds the optimal solutions, but requires 4-5 times as many backups. Note that \nvalue iteration in composite state space also updated states on trajectories (as in \nBarto etal.'s, 1995 RTDP algorithm) through the state space just as in our merging \nalgorithm, only without the benefit of the value function bounds and the pruning \nof non-competitive actions. \n\n4 Conclusion \n\nThe ability to perform multiple decision-making tasks simultaneously, and even \nto incorporate new tasks dynamically into ongoing previous tasks, is of obvious \ninterest to both cognitive science and engineering. Using the framework of MDPs \nfor individual decision-making tasks, we have reformulated the above problem as \nthat of dynamically merging MDPs. We have presented a modified value iteration \nalgorithm for dynamically merging MDPs, proved its convergence, and illustrated \nits use on a simple merging task. \nAs future work we intend to apply our merging algorithm to a real-world job(cid:173)\nshop scheduling problem, extend the algorithm into the framework of semi-Markov \ndecision processes, and explore the performance of the algorithm in the case where \na model of the MDPs is not available. \n\n\fHow to Dynamically Merge Markov Decision Processes \n\n1063 \n\na) \n\nb) \n\nf \n\nP \n\nA \n\n0.80 I \n\nf \n\n0.70 \n\nQ. \n\n*' ~ \n& 0.60 \nj \n\n0.40 , -I \n0.0 \n\n-\n\nopt-predator \n\n- - - - : : - :- . . . . , .- : - - -- - - ' - . , . . . , . . , . --\n\n500000.0 \n\n1000000.0 \n\n-\n\n- - - ' \n\n1500000.0 \n\nNumber 01 Backups \n\nFigure 1: a) Our agent (A) roams an n by n grid. It gets a payoff of 0.5 for every time \nstep it avoids predator (P), and earns a payoff of 1.0 for every piece of food (f) it finds. \nThe agent moves two steps for every step P makes, and P always moves directly toward \nA. When food is found, it reappears at a random location on the next time step. On every \ntime step, A has a 10% chance of ignoring its policy and making a random move. b) The \nmean payoff of different learning strategies vs. number of backups. The bottom two lines \nshow that when trained on either task in isolation, a learner reaches the optimal payoff for \nthat task in fewer than 5000 backups. The greedy approach makes no further backups, but \nperforms well below optimal. The optimal composite solution, trained ab initio, requires \nrequires nearly 1 million backups. Our algorithm begins with the 5000-backup solutions \nfor the individual tasks, and converges to the optimum 4-5 times more quickly than the \nab initio solution. \n\nAcknowledgements \n\nSatinder Singh was supported by NSF grant IIS-9711753. \n\nReferences \nBarto, A. G., Bradtke, S. J., & Singh, S. (1995) . Learning to act using real-time dynamic \n\nprogramming. Artificial Intelligence, 72, 81-138. \n\nBertsekas, D. P. (1995). Dynamic Programming and Optimal Control. Belmont, MA: \n\nAthena Scientific. \n\nBertsekas, D. P. & Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Belmont, MA: \n\nAthena Scientific. \n\nKaelbling, L. P. (1990) . Learning in Embedded Systems. PhD thesis, Stanford University, \n\nDepartment of Computer Science, Stanford, CA. Technical Report TR-90-04. \n\nMoore, A. W . & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with \n\nless data and less real time. Machine Learning, 19(1). \n\nSingh, S. (1997). Reinforcement learning in factorial environments. submitted. \nWatkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge \n\nUniv., Cambridge, England. \n\nZhang, W. & Dietterich, T . G. (1995). High-performance job-shop scheduling with a time \n\ndelay TD(lambda) network. In NIPSystems 8. MIT Press. \n\n\f", "award": [], "sourceid": 1420, "authors": [{"given_name": "Satinder", "family_name": "Singh", "institution": null}, {"given_name": "David", "family_name": "Cohn", "institution": null}]}