{"title": "Reinforcement Learning with Soft State Aggregation", "book": "Advances in Neural Information Processing Systems", "page_first": 361, "page_last": 368, "abstract": null, "full_text": "Reinforcement Learning with Soft State \n\nAggregation \n\nSatinder P. Singh \nsingh@psyche.mit.edu \n\nTommi Jaakkola \n\ntommi@psyche.mit.edu \n\nMichael I. Jordan \njordan@psyche.mit.edu \n\nDept. of Brain & Cognitive Sciences (E-lO) \n\nM.I.T. \n\nCambridge, MA 02139 \n\nAbstract \n\nIt is widely accepted that the use of more compact representations \nthan lookup tables is crucial to scaling reinforcement learning (RL) \nalgorithms to real-world problems. Unfortunately almost all of the \ntheory of reinforcement learning assumes lookup table representa(cid:173)\ntions. In this paper we address the pressing issue of combining \nfunction approximation and RL, and present 1) a function approx(cid:173)\nimator based on a simple extension to state aggregation (a com(cid:173)\nmonly used form of compact representation), namely soft state \naggregation, 2) a theory of convergence for RL with arbitrary, but \nfixed, soft state aggregation, 3) a novel intuitive understanding of \nthe effect of state aggregation on online RL, and 4) a new heuristic \nadaptive state aggregation algorithm that finds improved compact \nrepresentations by exploiting the non-discrete nature of soft state \naggregation. Preliminary empirical results are also presented. \n\n1 \n\nINTRODUCTION \n\nThe strong theory of convergence available for reinforcement learning algorithms \n(e.g., Dayan & Sejnowski, 1994; Watkins & Dayan, 1992; Jaakkola, Jordan & Singh, \n1994; Tsitsiklis, 1994) makes them attractive as a basis for building learning con(cid:173)\ntrol architectures to solve a wide variety of search, planning, and control problems. \nUnfortunately, almost all of the convergence results assume lookup table representa-\n\n\f362 \n\nSatinder Singh, Tommi Jaakkola, Michael I. Jordan \n\ntions for value functions (see Sutton, 1988; Dayan, 1992; Bradtke, 1993; and Vanroy \n& Tsitsiklis, personal communication; for exceptions). It is widely accepted that \nthe use of more compact representations than lookup tables is crucial to scaling RL \nalgorithms to real-world problems. \n\nIn this paper we address the pressing issue of combining function approximation and \nRL, and present 1) a function approximator based on a simple extension to state \naggregation (a commonly used form of compact representation, e.g., Moore, 1991), \nnamely soft state aggregation, 2) a theory of convergence for RL with arbitrary, but \nfixed, soft state aggregation, 3) a novel intuitive understanding of the effect of state \naggregation on online RL, and 4) a new heuristic adaptive state aggregation algo(cid:173)\nrithm that finds improved compact representations by exploiting the non-discrete \nnature of soft state aggregations. Preliminary empirical results are also presented. \n\nProblem Definition and Notation: We consider the problem of solving large \nMarkovian decision processes (MDPs) using RL algorithms and compact function \napproximation. We use the following notation: 8 for state space, A for action space, \npa(s, s') for transition probability, Ra(s) for payoff, and 'Y for discount factor. The \nobjective is to maximize the expected, infinite horizon, discounted sum of payoffs. \n\n1.1 FUNCTION APPROXIMATION: SOFT STATE CLUSTERS \n\nIn this section we describe a new function approximator (FA) for RL. In section 3 \nwe will analyze it theoretically and present convergence results. The FA maps the \nstate space 8 into M > 0 aggregates or clusters from cluster space X. Typically, \nM < < 181. We allow soft clustering, where each state s belongs to cluster x with \nprobability P(xls), called the clustering probabilities. This allows each state s to \nbelong to several clusters. An interesting special case is that of the usual state \naggregation where each state belongs only to one cluster. The theoretical model is \nthat the agent can observe the underlying state but can only update a value function \nfor the clusters. The value of a cluster generalizes to all states in proportion to \nthe clustering probabilities. Throughout we use the symbols x and y to represent \nindividual clusters and the symbols sand s' to represent individual states. \n\n2 A GENERAL CONVERGENCE THEOREM \n\nAn online RL algorithm essentially sees a sequence of quadruples, < St, at, StH, rt >, \nrepresenting a transition from current state St to next state StH on current action \nat with an associated payoff rt. We will first prove a general convergence theorem \nfor Q-learning (Watkins & Dayan, 1992) applied to a sequence of quadruples that \nmayor may not be generated by a Markov process (Bertsekas, 1987). This is \nrequired because the RL problem at the level of the clusters may be non-Markovian. \nConceptually, the sequence of quadruples can be thought of as being produced by \nsome process that is allowed to modify the sequence of quadruples produced by a \nMarkov process, e.g., by mapping states to clusters. In Section 3 we will specialize \nthe following theorem to provide specific results for our function approximator. \n\nConsider any stochastic process that generates a sequence of random quadruples, \n\\]! = {< Xi, ai, Yi, ri > h, where Xi, Yi E Y, ai E A, and ri is a bounded real number. \nNote that Xi+l does not have to be equal to Yi. Let IYI and IAI be finite, and define \n\n\fReinforcement Learning with Soft State Aggregation \n\n363 \n\nindicator variables \n\nXi x, a, Y = \u00b0 th \n\n) \n\n( \n\n{ 1 when Wi =< x,a,y,. > (for any r) \n\no erwlse, \n\n. \n\nand \n\nDefine \n\nwhen Wi =< x, a, .,. > (for any y, and any r) \notherwise. \n\nP,a ,( \nIJx,y -\n, \n\n) _ 2:t-iXk(x,a,y) \n2:1=i n(x, a) \n\n, \n\nand \n\nTheorem 1: If V( > 0, 3ME < 00, such that for all i ~ 0, for all x, y E Y, and \nfor all a E A, the following conditions characterize the infinite sequence W: with \nprobability 1 - (, \n\nIPi~i+M.(X,y)-pa(x,y)1 < ( \nIRi,i+MJX) - Ra(x)1 < (, \n\nand \n\n(1) \nwhere for all x, a, and y, with probability one Pg,oo(x, y) = pa(x, y), and Rg,oo(x) = \nRa(x). Then, online Q-learning applied to such a sequence will converge with \nprobability one to the solution of the following system of equations: Vx E Y, and \nVa E A, \n\nQ(x, a) = Ra(x) + 'Y \" \n\nL.J \nyEY \n\npa(x, y) maxQ(y, a') \n\na'EA \n\n(2) \n\n,\u00ab . ) \n\n,< \u2022 \n\nProof: Consider the semi-batch version of Q-learning that collects the changes \nto the value function for M steps before making the change. By assumption, for \nany (, making ME large enough will ensure that with probability 1 - (, the sample \nquantities for the ith batch, Pti+M \n(x, y) and Ri i+M (i) (x) are within ( of the \nasymptotic quantities. In Appendix A we prove that the semi-batch version of Q-\nlearning outlined above converges to the solution of Equation 2 with probability one. \nThe semi-batch proof can be extended to online Q-learning by using the analysis \ndeveloped in Theorem 3 of Jaakkola et al. (1994). In brief, it can be shown that \nthe difference caused by the online updating vanishes in the limit thereby forcing \nsemi-batch Q-learning and online Q-learning to be equal asymptotically. The use \nof the analysis in Theorem 3 from Jaakkola et al. (1994) requires that the learning \nalex) ( ) _ 1 uniformly w.p.l.; ME(k) is \nrate parameters CY are such that \nthe kth batch of size ME' If CYt(x) is non-increasing in addition to satisfying the \nconventional Q-learning conditions, then it will also meet the above requirement. \no \nTheorem 1 provides the most general convergence result available for Q-learning \n(and TD(O\u00bb; it shows that for an arbitrary quadruple sequence satisfying the er(cid:173)\ngodicity conditions given in Equations 1, Q-learning will converge to the solution \nof the MDP constructed with the limiting probabilities (Po,oo) and payoffs (Ro,oo). \nTheorem 1 combines and generalizes the results on hard state aggregation and \nvalue iteration presented in Vanroy & Tsitsiklis (personal communication), and on \npartially observable MDPs in Singh et al. (1994). \n\nmaxlEM.(Ic)al x \n\n\f364 \n\nSatinder Singh, Tommi Jaakkola, Michael I. Jordan \n\n3 RL AND SOFT STATE AGGREGATION \n\nIn this section we apply Theorem 1 to provide convergence results for two cases: 1) \nusing Q-Iearning and our FA to solve MDPs, and 2) using Sutton's (1988) TD(O) \nand our FA to determine the value function for a fixed policy. As is usual in online \nRL, we continue to assume that the transition probabilities and the payoff function \nof the MDP are unknown to the learning agent. Furthermore, being online such \nalgorithms cannot sample states in arbitrary order. In this section, the clustering \nprobabilities P(xls) are assumed to be fixed. \nCase 1: Q-learning and Fixed Soft State Aggregation \n\nBecause of function approximation, the domain of the learned Q-value function is \nconstrained to be X x A (X is cluster space). This section develops a \"Bellman \nequation\" (e.g., Bertsekas, 1987) for Q-Iearning at the level of the cluster space. We \nassume that the agent follows a stationary stochastic policy 7r that assigns to each \nstate a non-zero probability of executing every action in every state. Furthermore, \nwe assume that the Markov chain under policy 7r is ergodic. Such a policy 7r is a \n:, ;C;,:;p$\"C$')' \npersistently exciting policy. Under the above conditions p'II\"(slx) = \nwhere for all s, p'll\"(s) is the steady-state probability of being in state s. \nCorollary 1: Q-Iearning with soft state aggregation applied to an MDP while \nfollowing a persistently exciting policy 7r will converge with probability one to the \nsolution of the following system of equations: Vex, a) E (X x A), \n\nand pa(s, y) = E$' pa(s, s')P(yls'). The Q-value function for the state space can \nthen be constructed via Q(s, a) = Ex P(xls)Q(x, a) for all (s, a). \nProof: It can be shown that the sequence of quadruples produced by following pol(cid:173)\nicy 7r and independently mapping the current state s to a cluster x with probability \nP(xls) satisfies the conditions of Theorem 1. Also, it can be shown that \n\nNote that the Q-values found by clustering are dependent on the sampling policy \n7r, unlike the lookup table case. \n\nCase 2: TD(O) and Fixed Soft State Aggregation \n\nWe present separate results for TD(O) because it forms the basis for policy-iteration(cid:173)\nlike methods for solving Markov control problems (e.g., Barto, Sutton & Anderson, \na fact that we will use in the next section to derive adaptive state aggre(cid:173)\n1983) -\ngation methods. As before, because of function approximation, the domain of the \nlearned value function is constrained to be the cluster space X. \n\nCorollary 2: TD(O) with soft state aggregation applied to an MDP while following \na policy 7r will converge with probability one to the solution of the following system \n\n\fReinforcement Learning with Soft State Aggregation \n\nof equations: Vx EX, \n\nV(x) \n\n365 \n\n(4) \n\nwhere again as in Q-Iearning the value function for the state space can be con(cid:173)\nstructed via V(s) = Lx P(xls)V(x) for all s. \nProof: Corollary 1 implies Corollary 2 because TD(O) is a special case of Q-Iearning \nfor MDPs with a single (possibly randomized) action in each state. Equation 4 \nprovides a \"Bellman equation\" for TD(O) at the level of the cluster space. 0 \n\n4 ADAPTIVE STATE AGGREGATION \n\nIn previous sections we restricted attention to a function approximator that had a \nfixed compact representation. How might one adapt the compact representation on(cid:173)\nline in order to get better approximations of value functions? This section presents \na novel heuristic adaptive algorithm that improves the compact representation by \nfinding good clustering probabilities given an a priori fixed number of clusters. Note \nthat for arbitrary clustering, while Corollaries 1 and 2 show that RL will find so(cid:173)\nlutions with zero Bellman error in cluster space, the associated Bellman error in \nthe state space will not be zero in general. Good clustering is therefore naturally \ndefined in terms of reducing the Bellman error for the states of the MDP. \n\nLet the clustering probabilities be parametrized as follows P(xls; 0) = L:~~:';~/ .. ), \nat .state s given parameter \u00b0 (a matrix) is, \n\nwhere o( x , s) is the weight between state s and cluster x. Then the Bellman error \n\nJ(sIO) \n\nV( slO) -\n\n[R' (s) + r ~ P' (8, s')V( S'IO)] \n\n~ P(xls; O)V(xIO) -\n\n[w(s) + r ~ P'(s, s') ~ P(xls'; O)V(XIO)] \n\nAdaptive State Aggregation (ASA) Algorithm: \n\nStep 1: Compute V(xIO) for all x E X using the TD(O) algorithm. \nS\ntep 2. Let !:l.0 - -a 89 \n\n. Go to step 1. \n\n. \n\n-\n\noJ 2(9) \n\nwhere Step 2 tries to minimize the Bellman error for the states by holding the \ncluster values fixed to those computed in Step 1. We have \n\n8J 2 (sI0) \n80(y, s) \n\n2J(sI0) [P(yls; 0)(1 - ,p7r(s, s\u00bb(V(yIO) - V(sIO\u00bb]. \n\nThe Bellman error J(sIO) cannot be computed directly because the transition proba(cid:173)\nbilities P( s, s') are unknown. However, it can be estimated by averaging the sample \n\n\f366 \n\nSa tinder Singh, Tommi Jaakkola, Michael I. Jordan \n\nBellman error. P(yls; B) is known, and (1 - ,P 7r (s, s)) is always positive, and in(cid:173)\ndependent of y, and can therefore be absorbed into the step-size c\u00a5. The quantities \nV(yIB) and V(sIB) are available at the end of Step 1. In practice, Step 1 is only \ncarried out partially before Step 2 is implemented. Partial evaluation works well \nbecause the changes in the clustering probabilities at Step 2 are small, and because \nthe final V(xIB) at the previous Step 1 is used to initialize the computation of V(xIB) \nat the next Step 1. \n\n------ 2 Clusters \n\n--------- 4 Clusters \n\nto Clusters \n\n- - 20 Clusters \n\n4 \n\n.. ,'. \n\n.. \",' \" ....... --.. . \" . ... \" .... . ' .\n\n. \". \n\nIterations of ASA \n\nFigure 1: Adaptive State Clustering. See text for explanation. \n\nFigure 1 presents preliminary empirical results for the ASA algorithm. It plots the \nsquared Bellman error summed over the state space as a function of the number of \niterations of the ASA algorithm with constant step-size c\u00a5. It shows error curves \nfor 2, 4, 10 and 20 clusters averaged over ten runs of randomly constructed 20 \nstate Markov chains. Figure 4 shows that ASA is able to adapt the clustering \nprobabilities to reduce the Bellman error in state space, and as expected the more \nclusters the smaller the asymptotic Bellman error. In future work we plan to test \nthe policy iteration version of the adaptive soft aggregation algorithm on Markov \ncontrol problems. \n\n5 SUMMARY AND FUTURE WORK \n\nDoing RL on aggregated states is potentially very advantageous because the value of \neach cluster generalizes across all states in proportion to the clustering probabilities. \nThe same generalization is also potentially perilous because it can interfere with the \ncontraction-based convergence of RL algorithms (see Vee, 1992; for a discussion). \nThis paper resolves this debate for the case of soft state aggregation by defining a set \nof Bellman Equations (3 and 4) for the control and policy evaluation problems in the \nnon-Markovian cluster space, and by proving that Q-Iearning and TD(O) solve them \nrespectively with probability one. Theorem 1 presents a general convergence result \nthat was applied to state aggregation in this paper, but is also a generalization of \nthe results on hidden state presented in Singh et al. (1994), and may be applicable \n\n\fReinforcement Learning with Soft State Aggregation \n\n367 \n\nto other novel problems. It supports the intuitive picture that if a non-Markovian \nsequence of state transitions and payoffs is ergodic in the sense of Equation 1, then \nRL algorithms will converge w.p.l. to the solution of an MDP constructed with the \nlimiting transition probabilities and payoffs. \n\nWe also presented a new algorithm, ASA, for adapting compact representations, \nthat takes advantage of the soft state aggregation proposed here to do gradient de(cid:173)\nscent in clustering probability space to minimize squared Bellman error in the state \nspace. We demonstrated on simple examples that ASA is able to adapt the cluster(cid:173)\ning probabilities to dramatically reduce the Bellman error in state space. In future \nwork we plan to extend the convergence theory presented here to discretizations of \ncontinuous state MDPs, and to further test the ASA algorithms. \n\nA Convergence of semi-batch Q-Iearning (Theorem 1) \n\nConsider a semi-batch algorithm that collects the changes to the Q-value function \nfor M steps before making the change to the Q-value function. Let \n\nR%(x) = L \n\nkM \n\ni=(k-l)M \n\nriXi(x, a); \n\nMk(X, a) = L Xi(X, a) \n\nkM \n\ni=(k-l)M \n\nand \n\nMk(x,a,y) = L Xi(x,a,y) \n\nkM \n\ni=(k-l)M \n\nThen the Q-value of (x, a) after the kth batch is given by: \nQk+l(X, a) = (1- Mk(X, a)ak(x, a))Qk(x, a) \n\n) \n\n+Mk(x, a ak(x, a) M ( \n\nRk(x) ~ Mk(X, a, y) \nkx,a \nkx,a \nLet Q be the solution to Equation 2. Define, \n) +, ~ M ( \n\nRk(x) ~ Mk(x, a, y) \nkx,a \n\n) +, ~ M ( \n\nFk(x, a) = M ( \n\nkx,a \n\nyEY \n\n[ \n\na \n\n, -\n\n) m~xQk(Y, a) - Q(x, a), \n\n) m;pcQk(Y, a) \n\nI \n\n] \n\na \n\nyEY \n\nthen, if Vk(X) = maxa Qk(X, a) and Vex) = maxa Q(x, a), \n\nFk(x, a) = ,~ M ( \n\n) (Vk(y) - V(y)] + (M ( \n\na \n\n) - Ro,oo(x)) \n\n-\n\nRk(x) \nkx,a \n\nkx,a \n\n~ Mk(X, a, y) \ny \n~ [(Mk(x,a,y) \ny \n\n+, ~ Mk(X,a) - PO,oo x,y V y \n\n))-()] \n\na \n\n( \n\n, \n\nThe quantity Fk(x, a) can be bounded by \nIIFk(x, a) II :s; \n\n,llVk - VII + II(~~~~L - Rg,oo(x))11 \n\n+,11 Ly(~tX~~~y - Po,oo(x, y))V(y)11 \nR a ()I \n0,00 x \n\nf I R&(x) \ne arger 0 Mk(x,a) -\n\nh M\u00b7 th I \n\nwere fk 1S \n\n, an \n\n:s; ,llVk - VII + Cfr, \nd I ~ (Mk(x,a,y) \nna \n.ro,oo x, y \n\n'l..Jy Mk(x,a) \n\n( \n\n))1 B \n. y \n\n\f368 \n\nSa tinder Singh, Tommi Jaakkola, Michael I. Jordan \n\nassumption for any (. > 0, 3Mf < 00 such that (.~< < (. with probability 1 - c The \nvariance of Fk(X, a) can also be shown to be bounded because the variance of the \nsample probabilities is bounded (everything else is similar to standard Q-learning \n(1994), f<2r any (. > 0, \nfor MDPs). Therefore by Theorem 1 of Jaakkola et al. \nwith probability (1 - f), Qk(X, a) - Qoo(X, a), where IQoo(x, a) - Q(x, a)1 ~ Cc \nTherefore, semi-batch Q-learning converges with probability one. 0 \n\nAcknowledgements \n\nThis project was supported in part by a grant from the McDonnell-Pew Foundation, \nby a grant from ATR Human Information Processing Research Laboratories, and by \na grant from Siemens Corporation. Michael I. Jordan is a NSF Presidential Young \nInvestigator. \n\nReferences \nA. G. Barto, R. S. Sutton, & C. W. Anderson. (1983) Neuronlike elements that can \n\nsolve difficult learning control problems. IEEE SMC, 13:835-846. \n\nD. P. Bertsekas. (1987) Dynamic Programming: Deterministic and Stochastic Mod(cid:173)\n\nels, Prentice-Hall. \n\nS. J. Bradtke. (1993) Reinforcement learning applied to linear quadratic regulation. \n\nIn Advances in Neural Information Processing Systems 5, pages 295-302. \n\nP. Dayan. (1992) The convergence of TD(A) for general A. Machine Learning, \n\n8(3/4):341-362. \n\nP. Dayan & T.J. Sejnowski. (1994) TD(A) converges with probability 1. Machine \n\nLearning, 13(3). \n\nT. Jaakkola, M. I. Jordan, & S. P. Singh. (1994) On the convergence of stochastic \niterative dynamic programming algorithms. Neural Computation, 6(6):1185-\n1201. \n\nA. W. Moore. (1991) Variable resolution dynamic programming: Efficiently learning \nIn M aching Learning: \n\naction maps in multivariate real-valued state-spaces. \nProceedings of the Eighth International Workshop, pages 333- 337. \n\nS. P. Singh, T. Jaakkola, & M. I. Jordan. (1994) Learning without state-estimation \nIn Machine Learning: \n\nin partially observable markovian decision processes. \nProceedings of the Eleventh International Conference, pages 284-292. \n\nR. S. Sutton. (1988) Learning to predict by the methods of temporal differences. \n\nMachine Learning, 3:9-44. \n\nJ. Tsitsiklis. (1994) Asynchronous stochastic approximation and Q-learning. Ma(cid:173)\n\nchine Learning, 16(3):185-202. \n\nB. Vanroy & J. Tsitsiklis. (personal communication) \nC. J. C. H. Watkins & P. Dayan. (1992) Q-learning. Machine Learning, 8(3/4):279-\n\n292. \n\nR. C. Yee. (1992) Abstraction in control learning. Technical Report COINS Techni(cid:173)\n\ncal Report 92-16, Department of Computer and Information Science, University \nof Massachusetts, Amherst, MA 01003. A dissertation proposal. \n\n\f", "award": [], "sourceid": 981, "authors": [{"given_name": "Satinder", "family_name": "Singh", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}