{"title": "Reinforcement Learning for Mixed Open-loop and Closed-loop Control", "book": "Advances in Neural Information Processing Systems", "page_first": 1026, "page_last": 1032, "abstract": null, "full_text": "Reinforcement Learning for Mixed \nOpen-loop and Closed-loop Control \n\nEric A. Hansen, Andrew G. Barto, and Shlorno Zilberstein \n\nDepartment of Computer Science \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \n\n{hansen.barto.shlomo }10 \n\n.1 \"\\10 \n\n~ :>;:>;0 \n\nX WWNNNO \n\nI ~ WWO \n\nC) NWO \n\nI~ wwwo \n\nIII WNwn \n\n17 NNNO \n\nII WW() \n\n11\\ WNNNO \n\n~ NNWO \n\n12 WWWO \n\n1<) WWNO \n\n6 NNNO \n\nL' NNO \n\n20 WWWNO \n\n( h) \n\nFigure 2: (a) Grid world with numbered states (b) Optimal policy \n\nWe use the notation 9 and h in our statement of the pruning rule to emphasize \nits relationship to pruning in heuristic search. If we regard the root of a tree of \nmemory states as the start state and the memory state that corresponds to the \nbest open-loop action sequence as the goal state, then 9 can be regarded as the \ncost-to-arrive function and the value of perfect information h can be regarded as an \nupper bound on the cost-to-go function. \n\n4 Example \n\nWe describe a simple example to illustrate the extent of pruning possible using this \nrule. Imagine that a \"robot\" must find its way to a goal location in the upper \nleft-hand corner of the grid shown in Figure 2a. Each cell of the grid corresponds \nto a state, with the states numbered for convenient reference. The robot has five \ncontrol actions; it can move north, east, south, or west, one cell at a time, or it \ncan stop. The problem ends when the robot stops. If it stops in the goal state it \nreceives a reward of 100, otherwise it receives no reward. The robot must execute \na sequence of actions to reach the goal state, but its move actions are stochastic. If \nthe robot attempts to move in a particular direction, it succeeds with probability \no.s. With probability 0.05 it moves in a direction 90 degrees off to one side of its \nintended direction, with probability 0.05 it moves in a direction 90 degrees off to the \nother side, and with probability 0.1 it does not move at all. If the robot's movement \nwould take it outside the grid, it remains in the same cell. Because its progress is \nuncertain, the robot must interleave sensing and control actions to keep track of its \nlocation. The reward for sensing is - 1 (i.e., a cost of 1) and for each move action it \nis -4. To optimize expected total reward, the robot must find its way to the goal \nwhile minimizing the combined cost of moving and sensing. \n\nFigure 2b shows the optimal open-loop sequence of actions for each observable state. \nIf the bound on the length of an open-loop sequence of control actions is five, the \nnumber of possible memory states for this problem is over 64,000, a number that \ngrows explosively as the length bound is increased (to over 16 million when the \nbound is nine) . Using the pruning rule, Q-Iearning must explore just less than 1000 \nmemory states (and no deeper than nine levels in any tree) to converge to an optimal \npolicy, even when there is no bound on the interval between sensing actions. \n\n5 Conclusion \n\nWe have described an extension of Q-Iearning for MDPs with sensing costs and \na rule for limiting exploration that makes it possible for Q-Iearning to converge \nto an optimal policy despite exploring a fraction of possible memory states. As \nalready pointed out, the problem we have formalized is a partially observable MDP, \n\n\fReinforcement Learningfor Mixed Open-loop and Closed-loop Control \n\n1031 \n\nalthough one that is restricted by the assumption that sensing provides perfect \ninformation . An interesting direction in which to pursue this work would be to \nexplore its relationship to work on RL for partially observable MDPs, which has so \nfar focused on the problem of sensor uncertainty and hidden state. Because some \nof this work also makes use of tree representations of the state space and of learned \nstate-action values (e.g., McCallum, 1995), it may be that a similar pruning rule \ncan constrain exploration for such problems. \n\nAcknowledgement s \n\nSupport for this work was provided in part by the National Science Foundation un(cid:173)\nder grants ECS-9214866 and IRI-9409827 and in part by Rome Laboratory, USAF, \nunder grant F30602-95-1-0012. \n\nReferences \n\nBarto, A.G.; Bradtke, S.J .; &. Singh, S.P. (1995) Learning to act using real-time \ndynamic programming. Artificial Intelligence 72(1/2}:81-138. \n\nHansen, E.A. (1997) Markov decision processes with observation costs. University \nof Massachusetts at Amherst, Computer Science Technical Report 97-01. \n\nMcCallum, R.A. (1995) Instance-based utile distinctions for reinforcement learning \nwith hidden state. In Proc. 12th Int. Machine Learning Conf. Morgan Kaufmann. \n\nMonahan, G .E. (1982) A survey of partially observable Markov decision processes: \nTheory, models, and algorithms. Management Science 28:1-16. \n\nTan, M. (1991) Cost-sensitive reinforcement learning for adaptive classification and \ncontrol. In Proc. 9th Nat. Conf. on Artificial Intelligence. AAAI Press/MIT Press. \n\nWatkins, C.J.C.H. (1989) Learning from delayed rewards. Ph.D. Thesis, University \nof Cambridge, England. \nWatkins, C.J.C.H. &. Dayan, P. (1992) Technical note: Q-Iearning. Machine Learn(cid:173)\ning 8(3/4}:279-292. \nWhitehad, S.D. &. Lin, L.-J.(1995} Reinforcement learning of non-Markov decision \nprocesses. Artificial Intelligence 73 :271-306. \n\nAppendix \n\nProof of theorem: Consider an MDP with a state set that consists only of the \nmemory states that are not pruned. We call it a \"pruned MDP\" to distinguish \nit from the original MDP for which the state set consists of all possible memory \nstates. Because the pruned MDP is a finite state and action MDP, Q-Iearning with \npruning converges with probability one. What we must show is that the state-action \nvalues to which it converges include every state-action pair visited by an optimal \ncontroller for the original MDP, and that for each of these state-action pairs the \nlearned state-action value is equal to the optimal state-action value for the original \nMDP. \nLet Q and if denote the values that are learned by Q-Iearning when its exploration is \nlimited by the pruning rule, and let Q and V denote value functions that are optimal \nwhen the state set ofthe MDP includes all possible memory states. Because an MDP \nhas an optimal stationary policy and each control action causes a deterministic \ntransition to a subsequent memory state, there is an optimal path through each \ntree of memory states. The learned value of the root state of each tree is optimal if \nand only if the learned value of each memory state along this path is also optimal. \n\n\f1032 \n\nE. A. Hansen, A. G. Barto and S. Zilberstein \n\nTherefore to show that Q-Iearning with pruning converges to an optimal state(cid:173)\naction value function, it is sufficient to show that V = V for every observable state \nx. Our proof is by induction on the number of control actions that can be taken \nbetween one sensing action and the next . We use the fact that if Q-Iearning has \nconverged, then g(xal .. ai) = g(xal .. ai) and h(xat .. ai) = Eyp(x,al .. ai'Y)V(y) for \nevery memory state xat .. ai . \nFirst note that if g(xat) + 1'r(xal' o) + h(xat} > V(x), that is, if V for some \nobservable state x can be improved by exploring a path of a single control action \nfollowed by sensing, then it is contradictory to suppose Q-Iearning with pruning has \nconverged because single-depth memory states in a tree are never pruned. Now, \nmake the inductive hypothesis that Q-Iearning with pruning has not converged if V \ncan be improved for some observable state by exploring a path of less than k control \nactions before sensing. We show that it has not converged if V can be improved for \nsome observable state by exploring a path of k control actions before sensing. \nSuppose V for some observable state x can be improved by exploring a path that \nconsists of taking the sequence of control actions at .. aA: before sensing, that is, \n\ng(xat .. aA:) + l' r(xat .. aA:, o) + h(xat .. aA:) > V(x), \n\nA \n\nA \n\nA: \n\nSince only pruning can prevent improvement in this case, let xat .. a, be the memory \nstate at which application of the pruning rule prevents xal .. aA: from being explored. \nBecause the tree has been pruned at this node, V(x) 2:: g(xat .. ai) + h(xat .. ai), and \nso \n\ng(xat .. aA:) + l' r(xat .. aA:, o) + h(xat .. aA:) > g(xai .. ai) + h(xat .. ai). \n\nA: \n\nA \n\nA \n\nWe can expand this inequality as follows: \ng(xat .. a,;) + 1\" L p(x, al .. ai, y) [g(ya,+1 .. aA:) + 1'A:-i r(yai+t .. aA:, 0) + h(yai+t .. aA:)] \n\ny \n\n> g(xat .. a,;) + h(:z:at .. ai)' \n\nSimplification and expansion of h yields \nL p(x, at .. ai, y) [g(yai+t .. aA:) + 1'A:-i r(yai+t .. aA:, 0) + 1'A:-i L p(y, ai+1 .. aA:, Z)V(Z)] \n\nyES \n\nz \n\n> L p(x, al .. ai, y)V(y). \n\ny \n\nTherefore, there is some observable state, y, such that \n\nz \n\nBecause the value of observable state y can be improved by taking less than k \ncontrol actions before sensing, by the inductive hypothesis Q-Iearning has not yet \nconverged. 0 \n\nThe proof provides insight into how pruning works. If a state-action pair along \nsome optimal path is temporarily pruned, it must be possible to improve the value \nof some observable state by exploring a shorter path of memory states that has \nnot been pruned. The resulting improvement of the value function changes the \nthreshold for pruning and the state-action pair that was formerly pruned may no \nlonger be so, making further improvement of the learned value function possible. \n\n\f", "award": [], "sourceid": 1278, "authors": [{"given_name": "Eric", "family_name": "Hansen", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": null}, {"given_name": "Shlomo", "family_name": "Zilberstein", "institution": null}]}