{"title": "Generalized Prioritized Sweeping", "book": "Advances in Neural Information Processing Systems", "page_first": 1001, "page_last": 1007, "abstract": "", "full_text": "Generalized Prioritized Sweeping \n\nDavid Andre Nir Friedman Ronald Parr \n\nComputer Science Division, 387 Soda Hall \nUniversity of California, Berkeley, CA 94720 \n\n{dandre,nir,parr}@cs.berkeley.edu \n\nAbstract \n\nPrioritized sweeping is a model-based reinforcement learning method \nthat attempts to focus an agent's limited computational resources to \nachieve a good estimate of the value of environment states. To choose ef(cid:173)\nfectively where to spend a costly planning step, classic prioritized sweep(cid:173)\ning uses a simple heuristic to focus computation on the states that are \nlikely to have the largest errors. In this paper, we introduce generalized \nprioritized sweeping, a principled method for generating such estimates \nin a representation-specific manner. This allows us to extend prioritized \nsweeping beyond an explicit, state-based representation to deal with com(cid:173)\npact representations that are necessary for dealing with large state spaces. \nWe apply this method for generalized model approximators (such as \nBayesian networks), and describe preliminary experiments that compare \nour approach with classical prioritized sweeping. \n\n1 Introduction \nIn reinforcement learning, there is a tradeoff between spending time acting in the envi(cid:173)\nronment and spending time planning what actions are best. Model-free methods take one \nextreme on this question- the agent updates only the state most recently visited. On the \nother end of the spectrum lie classical dynamic programming methods that reevaluate the \nutility of every state in the environment after every experiment. Prioritized sweeping (PS) \n[6] provides a middle ground in that only the most \"important\" states are updated, according \nto a priority metric that attempts to measure the anticipated size of the update for each state. \nRoughly speaking, PS interleaves perfonning actions in the environment with propagating \nthe values of states. After updating the value of state s, PS examines all states t from which \nthe agent might reach s in one step and assigns them priority based on the expected size of \nthe change in their value. \n\nA crucial desideratum for reinforcement learning is the ability to scale-up to complex \ndomains. For this, we need to use compact ( or generalizing) representations of the model and \nthe value function. While it is possible to apply PS in the presence of such representations \n(e.g., see [1]), we claim that classic PS is ill-suited in this case. With a generalizing model, \na single experience may affect our estimation of the dynamics of many other states. Thus, \nwe might want to update the value of states that are similar, in some appropriate sense, to \ns since we have a new estimate of the system dynamics at these states. Note that some of \nthese states might never have been reached before and standard PS will not assign them a \npriority at all. \n\n\f1002 \n\nD. Andre, N. Friedman and R Parr \n\nIn this paper, we present generalized prioritized sweeping (GenPS), a method that utilizes \na fonnal principle to understand and extend PS and extend it to deal with parametric \nrepresentations for both the model and the value function. If GenPS is used with an explicit \nstate-space model and value function representation, an algorithm similar to the original \n(classic) PS results. When a model approximator (such as a dynamic Bayesian network \n[2]) is used, the resulting algorithm prioritizes the states of the environment using the \ngeneralizations inherent in the model representation. \n2 The Basic Principle \n\nWe assume the reader is familiar with the basic concepts of Markov Decision Processes \n(MDPs); see, for example, [5]. We use the following notation: A MDP is a 4-tuple, \n(S,A,p,r) where S is a set of states, A is a set of actions, p(t I s,a) is a transition \nmodel that captures the probability of reaching state t after we execute action a at state \ns, and r( s) is a reward function mapping S into real-valued rewards. In this paper, we \nfocus on infinite-horizon MDPs with a discount factor ,. The agent's aim is to maximize \nthe expected discounted total reward it will receive. Reinforcement learning procedures \nattempt to achieve this objective when the agent does not know p and r. \n\nA standard problem in model-based reinforcement learning is one of balancing between \nplanning (Le., choosing a policy) and execution. Ideally, the agent would compute the \noptimal value function for its model of the environment each time the model changes. This \nscheme is unrealistic since finding the optimal policy for a given model is computationally \nnon-trivial. Fortunately, we can approximate this scheme if we notice that the approximate \nmodel changes only slightly at each step. Thus, we can assume that the value function \nfrom the previous model can be easily \"repaired\" to reflect these changes. This approach \nwas pursued in the DYNA [7] framework, where after the execution of an action, the \nagent updates its model of the environment, and then performs some bounded number \nof value propagation steps to update its approximation of the value function . Each vaiue(cid:173)\npropagation step locally enforces the Bellman equation by setting V(s) ~ maxaEA Q(s, a), \nwhere Q(s,a) = f(s) + ,Ls'ESP(s' I s,a)V(s'), p(s' I s,a) and f(s) are the agent's \napproximation of the MDP, and V is the agent's approximation of the value function. \n\nThis raises the question of which states should be updated. In this paper we propose the \n\nfollowing general principle: \n\nGenPS Principle: Update states where the approximation of the value \nfunction will change the most. That is, update the states with the largest \nBellmanerror,E(s) = IV(s) -maxaEAQ(s,a)l. \n\nThe motivation for this principle is straightforward. The maximum Bellman error can be \nused to bound the maximum difference between the current value function, V (s) and the \noptimal value function, V\"'(s) [9]. This difference bounds the policy loss, the difference \nbetween the expected discounted reward received under the agent's current policy and the \nexpected discounted reward received under the optimal policy. \n\nTo carry out this principle we have to recognize when the Bellman error at a state changes. \nThis can happen at two different stages. First, after the agent updates its model of the world, \nnew discrepancies between V (s) and maxa Q( s, a) might be introduced, which can increase \nthe Bellman error at s. Second, after the agent performs some value propagations, V is \nchanged, which may introduce new discrepancies. \n\nWe assume that the agent maintains a value function and a model that are parameterized \nby Dv and D M . (We will sometimes refer to the vector that concatenates these vectors \ntogether into a single, larger vector simply as D.) When the agent observes a transition from \nstate s to s' under action a, the agent updates its environment model by adjusting some \nof the parameters in D M. When perfonning value-propagations, the agent updates V by \nupdating parameters in Dv. A change in any of these parameters may change the Bellman \nerror at other states in the model. We want to recognize these states without explicitly \n\n\fGeneralized Prioritized Sweeping \n\n1003 \n\ncomputing the Bellman error at each one. Formally, we wish to estimate the change in \nerror, I~E(B) I, due to the most recent change ~() in the parameters. \n\nWe propose approximating I~E(8) 1 by using the gradient of the right hand side of the \nBellman equation (i.e. maxa Q(8,a). Thus, we have: I~E(s)1 ~ lV'maxa Q(8,a) . ~()I \nwhich estimates the change in the Bellman error at state 8 as a function of the change in \nQ( 8, a). The above still requires us to differentiate over a max, which is not differentiable. \nIn general, we want to to overestimate the change, to avoid \"starving\" states with non(cid:173)\nnegligible error. Thus, we use the following upper bound: 1V'(maxa Q(8, a)) . ~81 ~ \nmaxa IV'Q(s,a). ~81\u00b7 \n\nWe now define the generalized prioritized sweeping procedure. The procedure maintains \na priority queue that assigns to each state 8 a priority,pri( 8). After making some changes, we \ncan reassign priorities by computing an approximation of the change in the value function. \n\nIdeally, this is done using a procedure that implements the following steps: \n\nprocedure update-priorities (&) \n\nfor all s E S pri(s) +- pri(s) + maxa IV'Q(s, a) . &1. \n\nNote that when the above procedure updates the priority for a state that has an existing \npriority, the priorities are added together. This ensures that the priority being kept is an \noverestimate of the priority of each state, and thus, the procedure will eventually visit all \nstates that require updating. \n\nAlso, in practice we would not want to reconsider the priority of all states after an update \n\n(we return to this issue below). \n\nUsing this procedure, we can now state the general learning procedure: \n\nprocedure GenPS 0 \n\nloop \n\nperform an action in the environment \nupdate the model; let & be the change in () \ncall update-priorities( &) \nwhile there is available computation time \nlet smax = arg maxs pri( s) \nperform value-propagation for V(smax); let & be the change in () \ncall update-priorities( &) \npri(smax) +- W(smax) - maxa Q(smax,a)11 \n\nNote that the GenPS procedure does not determine how actions are selected. This issue, \nwhich involves the problem of exploration, is orthogonal to the our main topic. Standard \napproache!;, such as those described in [5, 6, 7], can be used with our procedure. \n\nThis abstract description specifies neither how to update the model, nor how to update the \nvalue function in the value-propagation steps. Both of these depend on the choices made \nin the corresponding representation of the model and the value function. Moreover, it is \nclear that in problems that involve a large state space, we cannot afford to recompute the \npriority of every state in update-priorities. However, we can simplify this computation \nby exploiting sparseness in the model and in the worst case we may resort to approximate \nmethods for finding the states that receive high priority after each change. \n3 Explicit, State-based Representation \n\nIn this section we briefly describe the instantiation of the generalized procedure when the \nrewards, values, and transition probabilities are explicitly modeled using lookup tables. In \nthis representation, for each state 8, we store the expected reward at 8, denoted by Of(s)' the \nestimated value at 8, denoted by Bv (s), and for each action a and state t the number of times \nthe execution of a at 8 lead to state t, denoted NS,Q.,t. From these transition counts we can \n\nI In general, this will assign the state a new priority of 0, unless there is a self loop. In this case it \n\nwill easy to compute the new Bellman error as a by-product of the value propagation step. \n\n\f1004 \n\nD. Andre, N. Friedman and R. Parr \n\nreconstruct the transition probabilities pACt I 8, a) = \n\nN \u2022. a.! +N~! a.t \n+NO\nN \n\n\",a ,t' \n\n\" \nW t' \n\nwhere NO \n\n8,a,t \n\n' \n\nare \n\n\" , a ,t l \n\nfictional counts that capture our prior information about the system's dynamics.2 After each \nstep in the worJd, these reward and probability parameters are updated in the straightforward \nmanner. Value propagation steps in this representation set 8Y (t) to the right hand side of \nthe Bellman equation. \n\nTo apply the GenPS procedure we need to derive the gradient of the Bellman equation \n\nfor two situations: (a) after a single step in the environment, and (b) after a value update. \nIn case (a), the model changes after performing action 8~t . In this case, it is easy to \nverify that V'Q(s,a) 'do = dOr(t) + t t N.}t+N~.a.t (V(t) - 2:~p(t' I s,a)V(t')), and \nthat V' Q( s', a') . do = 0 if s' =1= s or a' =1= a. Thus, s is the only state whose priority \nchanges. \nIn case (b), the value function changes after updating the value of a state t. In this case, \nV'Q(s, a) \u00b7do = ,pet I s, a)l1ov(t)' It is easy to see that this is nonzero only ift is reachable \nfrom 8. In both cases, it is straightforward to locate the states where the Bellman error \nmight have have changed, and the computation of the new priority is more efficient than \ncomputing the Bellman-error.3 \n\nNow we can relate GenPS to standard prioritized sweeping. The PS procedure has the \n\ngeneral form of this application of GenPS with three minor differences. First, after per(cid:173)\nforming a transition s~t in the environment, PS immediately performs a value propagation \nfor state s, while GenPS increments the priority of s. Second, after performing a value \npropagation for state t, PS updates the priority of states s that can reach t with the value \nmaxa p( tis, a) ./1y( t). The priority assigned by GenPS is the same quantity multiplied by \n,. Since PS does not introduce priorities after model changes, this multiplicative constant \ndoes not change the order of states in the queue. Thirdly, GenPS uses addition to combine \nthe old priority of a state with a new one, which ensures that the priority is indeed an upper \nbound. In contrast, PS uses max to combine priorities. \n\nThis discussion shows that PS can be thought of as a special case of GenPS when the \nagent uses an explicit, state-based representation. As we show in the next section, when \nthe agent uses more compact representations, we get procedures where the prioritization \nstrategy is quite different from that used in PS. Thus, we claim that classic PS is desirable \nprimarily when explicit representations are used. \n4 Factored Representation \nWe now examine a compact representation of p( s' I s, a) that is based on dynamic Bayesian \nnetworks (DBNs) [2]. DBNs have been combined with reinforcement learning before in \n[8], where they were used primarily as a means getting better generalization while learning. \nWe will show that they also can be used with prioritized sweeping to focus the agent's \nattention on groups of states that are affected as the agent refines its environment model. \n\nWe start by assuming that the environment state is described by a set of random variables, \nXI, . .. , X n\u00b7 For now, we assume that each variable can take values from a finite set \nVal(Xi). An assignment of values XI, .\u2022 \u2022 , Xn to these variables describes a particular \nenvironment state. Similarly, we assume that the agent's action is described by random \nvariables AI, ... ,Ak . To model the system dynamics, we have to represent the probability \nof transitions s~t, where sand t are two assignments to XI, .. . , Xn and a is an assignment \nto AI, ... ,Ak \u2022 To simplify the discussion, we denote by Yi, .. . , Yn the agent's state after \n\n2Formally, we are using multinomial Dirichlet priors. See, for example, [4] for an introduction to \n\nthese Bayesian methods. \n\n3 Although ~~(s.a) involves a summation over all states, it can be computed efficiently. To see \nthis, note that the summation is essentially the old value of Q( s, a) (minus the immediate reward) \nwhich can be retained in memory. \n\n.. ,a ,t \n\n\fGeneralized Prioritized Sweeping \n\n1005 \n\nthe action is executed (e.g., the state t). Thus, p(t I s,a) is represented as a conditional \nprobability P(YI, ... , Yn I XI, ... ,Xn , AI , ... ,Ak). \n\nA DBN model for such a conditional distribution consists of two components. The \nfirst is a directed acyclic graph where each vertex is labeled by a random variable and in \nwhich the vertices labeled XI, . .. ,Xn and AI, .. . , Ak are roots. This graph speoifies the \nfactorization of the conditional distribution: \n\nP(Yi, ... , Yn I XI'\u00b7 .. ' X n , AI,\u00b7\u00b7\u00b7, Ak) = II P(Yi I Pai), \n\nn \n\n(1) \n\ni=I \n\nwhere Pai are the parents of Yi in the graph. The second component of the DBN model is \na description of the conditional probabilities P(Yi I Pai). Together, these two components \ndescribe a unique conditional distribution. The simplest representation of P(Yi I Pai) is a \ntable that contains a parameter (}i ,y,z = P(Yi = y I Pai = z) for each possible combination \nof y E Val(Yi) and z E Val(Pai) (note that z is a joint assignment to several random \nvariables). It is easy to see that the \"density\" of the DBN graph determines the number of \nparameters needed. In particular, a complete graph, to which we cannot add an arc without \nviolating the constraints, is equivalent to a state-based representation in terms of the number \nof parameters needed. On the other hand, a sparse graph requires few parameters. \n\nIn this paper, we assume that the learneris supplied with the DBN structure and only has to \nlearn the conditional probability entries. It is often easy to assess structure information from \nexperts even when precise probabilities are not available. As in the state-based representa(cid:173)\ntion, we learn the parameters using Dirichlet priors for each multinomial distribution [4]. \nIn this method, we assess the conditional probability (}i,y,z using prior knowledge and the \nfrequency of transitions observed in the past where Yi = y among those transitions where \nPai = z. Learning amounts to keeping counts Ni ,y,z that record the number of transitions \nwhere Yi = y and Pai = z for each variable Yi and values y E Val(Yi) and z E Val(Pai). \nOur prior knowledge is represented by fictional counts Np,y,z. Then we estimate probabil-\n. . \n-\nltles USIng t e 10rmu a Ui,y,z -\n\nh N\u00b7 - \" N \u00b7 \n\n~, ' ,z - L....,y' \n\nt,y',z + i,y' ,z\u00b7 \n\nNO \n\nh ~ \n\nN; ,1I ,z+N?,y ,z \n\nN;,. ,z \n\n' W ere \n\n. \n\nI LI \n\nWe now identify which states should be reconsidered after we update the DBN parameters. \nRecall that this requires estimating the term V Q( s, a) ./1n. Since!!:.o is sparse, after making \nthe transition s* ~t*, we have that VQ(s, a) . !!:.O = 2:i a~Q(s:a). ' where yi and zi are the \nassignments to Yi and Pai, respectively, in s* ~t*. (Recall that s*, a* and t* jointly assign \nvalues to all the variables in the DBN.) \n\nI,V i '%i \n\nWe say that a transition s~t is consistent with an assignment X = x for a vector of \nrandom variables X, denoted (s,a, t) F= (X = x), if X is assigned the value x in s~t. \nWe also need a similar notion for a partial description of a transition. We say that s and \na are consistent with X = x, denoted (s,a,\u00b7) F= (X = x), if there is a t such that \n(s, a, t) F= (X = x). \n\nUsing this notation, we can show that if (s, a, .) F (Pai = zi), then \n\n8Q(s, a) \n\n= -::Nt-'-\n\n'~'-': [8, .\u2022 : .,: t(,.J~.: .': p(t I s, a)1l(t) - t(,.~I=': p(t I s, a)1l(t)] \n\nand if s, a are inconsistent with Pai = zi, then aa:.(8:a). = o. \n\n\"\"i 'Z i \n\nThis expression shows that if s is similar to s* in that both agree on the values they assign \nto the parents of some Yi (i.e., (s, a*) is consistent with zi), then the priority of s would \nchange after we update the model. The magnitude of the priority change depends upon both \nthe similarity of sand s* (i.e. how many of the terms in VQ(s, a) \u00b7 !!:'o will be non-zero), \nand the value of the states that can be reached from s. \n\n\f1006 \n\nD. Andre, N. Friedman and R. Parr \n\n2.S \n\nI.S \n\n~ \n'3 \n\" 0 \n~ \n\u00a3 \n\n./ \n,.l \n\nPS (cid:173)\nPS+fact