{"title": "Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 695, "page_last": 702, "abstract": null, "full_text": "Convergence of Indirect Adaptive \n\nAsynchronous Value Iteration Algorithms \n\nVijaykumar Gullapalli \n\nDepartment of Computer Science \n\nUniversity of Massachusetts \n\nAndrew G. Barto \n\nDepartment of Computer Science \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \nvijay@cs.umass.edu \n\nAmherst, MA 01003 \nbarto@cs.umass.edu \n\nAbstract \n\nReinforcement Learning methods based on approximating dynamic \nprogramming (DP) are receiving increased attention due to their \nutility in forming reactive control policies for systems embedded \nin dynamic environments. Environments are usually modeled as \ncontrolled Markov processes, but when the environment model is \nnot known a priori, adaptive methods are necessary. Adaptive con(cid:173)\ntrol methods are often classified as being direct or indirect. Direct \nmethods directly adapt the control policy from experience, whereas \nindirect methods adapt a model of the controlled process and com(cid:173)\npute control policies based on the latest model. Our focus is on \nindirect adaptive DP-based methods in this paper. We present a \nconvergence result for indirect adaptive asynchronous value itera(cid:173)\ntion algorithms for the case in which a look-up table is used to store \nthe value function. Our result implies convergence of several ex(cid:173)\nisting reinforcement learning algorithms such as adaptive real-time \ndynamic programming (ARTDP) (Barto, Bradtke, & Singh, 1993) \nand prioritized sweeping (Moore & Atkeson, 1993). Although the \nemphasis of researchers studying DP-based reinforcement learning \nhas been on direct adaptive methods such as Q-Learning (Watkins, \n1989) and methods using TD algorithms (Sutton, 1988), it is not \nclear that these direct methods are preferable in practice to indirect \nmethods such as those analyzed in this paper. \n\n695 \n\n\f696 \n\nGullapalli and Barto \n\n1 \n\nINTRODUCTION \n\nReinforcement learning methods based on approximating dynamic programming \n(DP) are receiving increased attention due to their utility in forming reactive con(cid:173)\ntrol policies for systems embedded in dynamic environments. In most of this work, \nlearning tasks are formulated as Markovian Decision Problems (MDPs) in which \nthe environment is modeled as a controlled Markov process. For each observed \nenvironmental state, the agent consults a policy to select an action, which when \nexecuted causes a probabilistic transition to a successor state. State transitions \ngenerate rewards, and the agent's goal is to form a policy that maximizes the ex(cid:173)\npected value of a measure of the long-term reward for operating in the environment. \n(Equivalent formulations minimize a measure of the long-term cost of operating in \nthe environment.) Artificial neural networks are often used to store value functions \nproduced by these algorithms (e.g., (Tesauro, 1992)). \n\nRecent advances in reinforcement learning theory have shown that asynchronous \nvalue iteration provides an important link between reinforcement learning algo(cid:173)\nrithms and classical DP methods for value iteration (VI) (Barto, Bradtke, & Singh, \n1993). Whereas conventional VI algorithms use repeated exhaustive \"sweeps\" ofthe \nMDP's state set to update the value function, asynchronous VI can achieve the same \nresult without proceeding in systematic sweeps (Bertsekas & Tsitsiklis, 1989). If the \nstate ordering of an asynchronous VI computation is determined by state sequences \ngenerated during real or simulated interaction of a controller with the Markov pro(cid:173)\ncess, the result is an algorithm called Real- Time DP (RTDP) (Barto, Bradtke, & \nSingh, 1993). Its convergence to optimal value functions in several kinds of prob(cid:173)\nlems follows from the convergence properties of asynchronous VI (Barto, Bradtke, \n& Singh, 1993). \n\n2 MDPS WITH INCOMPLETE INFORMATION \n\nBecause asynchronous VI employs a basic update operation that involves computing \nthe expected value of the next state for all possible actions, it requires a complete \nand accurate model of the MDP in the form of state-transition probabilities and ex(cid:173)\npected transition rewards. This is also true for the use of asynchronous VI in RTDP. \nTherefore, when state-transition probabilities and expected transition rewards are \nnot completely known, asynchronous VI is not directly applicable. Problems such \nas these, which are called MDPs with incomplete information,l require more com(cid:173)\nplex adaptive algorithms for their solution. An indirect adaptive method works by \nidentifying the underlying MDP via estimates of state transition probabilities and \nexpected transition rewards, whereas a direct adaptive method (e.g., Q-Learning \n(Watkins, 1989)) adapts the policy or the value function without forming an ex(cid:173)\nplicit model of the MDP through system identification. \n\nIn this paper, we prove a convergence theorem for a set of algorithms we call indirect \nadaptive asynchronous VI algorithms. These are indirect adaptive algorithms that \nresult from simply substituting current estimates of transition probabilities and ex(cid:173)\npected transition rewards (produced by some concurrently executing identification \n\n1 These problems should not be confused with MDPs with incomplete 6tate information, \n\ni.e., partially observable MDPs. \n\n\fConvergence of Indirect Adaptive Asynchronous Value Iteration Algorithms \n\n697 \n\nalgorithm) for their actual values in the asynchronous value iteration computation. \nWe show that under certain conditions, indirect adaptive asynchronous VI algo(cid:173)\nrithms converge with probability one to the optimal value function. Moreover, we \nuse our result to infer convergence of two existing DP-based reinforcement learning \nalgorithms, adaptive real-time dynamic programming (ARTDP) (Barto, Bradtke, \n& Singh, 1993), and prioritized sweeping (Moore & Atkeson, 1993). \n\n3 CONVERGENCE OF INDIRECT ADAPTIVE \n\nASYNCHRONOUS VI \n\nIndirect adaptive asynchronous VI algorithms are produced from non-adaptive algo(cid:173)\nrithms by substituting a current approximate model of the MDP for the true model \nin the asynchronous value iteration computations. An indirect adaptive algorithm \ncan be expected to converge only if the corresponding non-adaptive algorithm, with \nthe true model used in the place of each approximate model, converges. We therefore \nrestrict attention to indirect adaptive asynchronous VI algorithms that correspond \nin this way to convergent non-adaptive algorithms. We prove the following theorem: \n\nTheorem 1 For any finite 6tate, finite action MDP with an infinite-horizon di6-\ncounted performance measure, any indirect adaptive a6ynchronous VI algorithm (for \nwhich the corresponding non-adaptive algorithm converges) converges to the optimal \nvalue function with probability one if \n1) the conditions for convergence of the non-adaptive algorithm are met, \n2) in the limit, every action is executed from every 6tate infinitely often, and \n3) the e6timate6 of the state-transition probabilities and the expected transition re(cid:173)\nwards remain bounded and converge in the limit to their true value6 with probability \none. \n\nProof The proof is given in Appendix A.2. \n\n4 DISCUSSION \n\nCondition 2 of the theorem, which is also required by direct adaptive methods \nto ensure convergence, is usually unavoidable. It is typically ensured by using a \nstochastic policy. For example, we can use the Gibbs distribution method for se(cid:173)\nlecting actions used by Watkins (1989) and others. Given condition 2, condition 3 is \neasily satisfied by most identification methods. In particular, the simple maximum(cid:173)\nlikelihood identification method (see Appendix A.l, items 6 and 7) converges to the \ntrue model with probability one under this condition. \n\nOur result is valid only for the special case in which the value function is explicitly \nstored in a look-up table. The case in which general function approximators such \nas neural networks are used requires further analysis. \n\nFinally, an important issue not addressed in this paper is the trade-off between \nsystem identification and control. To ensure convergence of the model, all actions \nhave to be executed infinitely often in every state. On the other hand, on-line control \nobjectives are best served by executing the action in each sta.te that is optimal \naccording to the current value function (i.e., by using the certainty equivalence \n\n\f698 \n\nGullapalli and Barto \n\noptimal policy). This issue has received considerable attention from control theorists \n(see, for example, (Kumar, 1985), and the references therein). Although we do not \naddress this issue in this paper, for a specific estimation method, it may be possible \nto determine an action selection scheme that makes the best trade-off between \nidentification and control. \n\n5 EXAMPLES OF INDIRECT ADAPTIVE \n\nASYNCHRONOUS VI \n\nOne example of an indirect adaptive asynchronous VI algorithm is ARTDP (Barto, \nBradtke, & Singh, 1993) with maximum-likelihood identification. In this algorithm, \na randomized policy is used to ensure that every action has a non-zero probability \nof being executed in each state. The following theorem for ARDTP follows directly \nfrom our result and the corresponding theorem for RTDP in (Barto, Bradtke, & \nSingh, 1993): \n\nTheorem 2 For any discounted MDP and any initial value junction, trial-based 2 \nARTDP converges with probability one. \n\nAs a special case of the above theorem, we can obtain the result that in similar prob(cid:173)\nlems the prioritized sweeping algorithm of Moore and Atkeson (Moore & Atkeson, \n1993) converges to the optimal value function. This is because prioritized sweeping \nis a special case of ARTDP in which states are selected for value updates based \non their priority and the processing time available. A state's priority reflects the \nutility of performing an update for that state, and hence prioritized sweeping can \nimprove the efficiency of asynchronous VI. A similar algorithm, Queue-Dyna (Peng \n& Williams, 1992), can also be shown to converge to the optimal value function \nusing a simple extension of our result. \n\n6 CONCLUSIONS \n\nWe have shown convergence of indirect adaptive asynchronous value iteration un(cid:173)\nder fairly general conditions. This result implies the convergence of several existing \nDP-based reinforcement learning algorithms. Moreover, we have discussed possi(cid:173)\nble extensions to our result. Our result is a step toward a better understanding \nof indirect adaptive DP-based reinforcement learning methods. There are several \npromising directions for future work. \n\nOne is to analyze the trade-off between model estimation and control mentioned \nearlier to determine optimal methods for action selection and to integrate our work \nwith existing results on adaptive methods for MDPs (Kumar, 1985). Second, anal(cid:173)\nysis is needed for the case in which a function approximation method, such as a \nneural network, is used instead of a look-up table to store the value function. A \nthird possible direction is to analyze indirect adaptive versions of more general DP(cid:173)\nbased algorithms that combine asynchronous policy iteration with asynchronous \n\n2 As in (Barto, Bradtke, & Singh, 1993), by trial-balled execution of an algorithm we \nmean its use in an infinite series of trials such that every state is selected infinitely often \nto be the start state of a trial. \n\n\fConvergence of Indirect Adaptive Asynchronous Value Iteration Algorithms \n\n699 \n\npolicy evaluation. Several non-adaptive algorithms of this nature have been pro(cid:173)\nposed recently (e.g., (Williams & Baird, 1993; Singh & Gullapalli)). \n\nFinally, it will be useful to examine the relative efficacies of direct and indirect \nadaptive methods for solving MDPs with incomplete information. Although the \nemphasis of researchers studying DP-based reinforcement learning has been on di(cid:173)\nrect adaptive methods such as Q-Learning and methods using TD algorithms, it is \nnot clear that these direct methods are preferable in practice to indirect methods \nsuch as the ones discussed here. For example, Moore and Atkeson (1993) report sev(cid:173)\neral experiments in which prioritized sweeping significantly outperforms Q-learning \nin terms of the computation time and the number of observations required for \nconvergence. More research is needed to characterize circumstances for which the \nvarious reinforcement learning methods are best suited. \n\nAPPENDIX \n\nA.1 NOTATION \n\n1. Time steps are denoted t = 1, 2, ... , and Zt denotes the last state observed \n\nbefore time t. Zt belongs to a finite state set S = {I, 2, ... , n}. \n\n2. Actions in a state are selected according to a policy 7r, where 7r(i) E A, a \n\nfinite set of actions, for 1 :::; i :::; n. \n\n3. The probability of making a transition from state i to state j on executing \n\naction a is pa ( i, j). \n\n4. The expected reward from executing action a in state i is r(i, a). The \n\nreward received at time t is denoted rt(Zt, at). \n\n5. 0 :::; \"y < 1 is the discount factor. \n6. Let p~(i, j) denote the estimate at time t of the probability of transition \nfrom state i to j on executing action a E A. Several different methods can be \nused for estimating p~( i, j). For example, if n~( i, j) is the observed number \nof times before time step t that execution of action a when the system was \nin state i was followed by a transition to state j, and n~(i) = L:jEs nf(i, j) \nis the number of times action a was executed in state i before time step \nt, then, for 1 :::; i :::; n and for all a E A, the maximum-likelihood state(cid:173)\ntransition probability estimates at time t are \n\nAa(' \nPt \", J = \n\n') \n\na(' ') \nnt~, J \na ( ' ) ' \nn t \" \n\n1 < '< \n\n_ J _ n. \n\nNote that the maximum-likelihood estimates converge to their true values \nwith probability one if nf(i) -+ 00 as t -+ 00, i.e., every action is executed \nfrom every state infinitely often. \nLet pa(i) = [pa(i, 1), ... , pa(i, n)] E [0,1]'\\ and similarly, pf(i) = [Pf(i, I), \n... , pf(i, n)] E [o,l]n. We will denote the lSI x IAI matrix of transition \nprobabilities associated with state i by P( i) and its estimate at time t by \nPt(i). Finally, P denotes the vector of matrices [P(I), ... , P(n)], and Pt \ndenotes the vector [A(I), ... , A(n)]. \n\n\f700 \n\nGullapalli and Barto \n\n7. Let rt(i, a) denote the estimate at time t of the e:Jq>ected reward r(i, a), \nand let rt denote all the lSI x IAI estimates at time t. Again, if maximum(cid:173)\nlikelihood estimation is used, \n\n\" (\") L:!=I rk(zk, Gk)h,(Zk, Gk) \nrt 'I., G = \n\nII( \") \nn t \n1. \n\n, \n\nwhere fill: S x A -+ {O, 1} is the indicator function for the state-action pair \n1.,G. \n\nB. ~* denotes the optimal value function for the MDP defined by the estimates \n\nA and rt of P and r at time t. Thus, Vi E S, \n\n~*(i) = max{rt(i, a) + \"( '\" p~(i, i)~*(j)}. \n\nilEA \n\nL..-J \nje S \n\nSimilarly, V* denotes the optimal value function for the MDP defined by \nP and r. \n\n9. B t ~ S is the subset of states whose values are updated at time t. Usually, \n\nat least Zt E B t \u2022 \n\nA.2 PROOF OF THEOREM 1 \n\nIn indirect adaptive asynchronous VI algorithms, the estimates of the MDP param(cid:173)\neters at time step t, Pt and rt, are used in place of the true parameters, P and r, in \nthe asynchronous VI computations at time t. Hence the value function is updated \nat time t as \n\nV. \n\n(.) _ { maxaeA{rt(i,a) + \"(L:;Espf(i,i)vt(j)} ifi E Bt \notherwise, \n\nvt(i) \n\nHI 1. \n\n-\n\nwhere B(t) ~ S is the subset of states whose values are updated at time t. \nFirst note that because A and rt are assumed to be bounded for all t, Vi is also \nbounded for all t. Next, because the optimal value function given the model A and \nrt, l't*, is a continuous function of the estimates A and rt, convergence of these \nestimates w.p. 1 to their true values implies that \n\nv.* 1lI.p. 1 V* \n\nt ~ , \n\nwhere V* is the optimal value function for the original MDP. The convergence w.p. \n1 of ~* to V* implies that given an \u20ac > 0 there exists an integer T > 0 such that \nfor all t ;:::: T, \n\n11l't* - V*II < (1 - \"() \u20ac w.p. 1. \n\n2\"( \n\nHere, II . II can be any norm on lRn , although we will use the 1/10 or max norm. \nIn algorithms based on asynchronous VI, the values of only the states in B t ~ S are \nupdated at time t, although the value of each state is updated infinitely often. For \nan arbitrary Z E S, let us define the infinite subsequence {tk}k=O to be the times \nwhen the value of state Z gets updated. Further, let us only consider updates at, \nor after, time T, where T is from equation (1) above, so that t~ ;:::: T for all Z E S. \n\n(1) \n\n\fConvergence of Indirect Adaptive Asynchronous Value Iteration Algorithms \n\n701 \n\nBy the nature of the VI computation we have, for each t ;:::: 1, \nlVi+l(i) - ~*(i)1 ~ '\"Yllvt - ~*II if i E Bt \u2022 \n\nUsing inequality (2), we can get a bound for Ivt-+l(Z) - ~~(z)1 as \n\n,. \n\nIvt-+dz) - ~!(z)1 < '\"Y1I:+lIIVi- - ~!II + (1 - '\"Y1I:)\u20ac w.p.1. \n\n0 \n\n,. \n\n,. \n\n,. \n\n0 \n\n(2) \n\n(3) \n\nWe can verify that the bound in (3) is correct through induction. The bound is \nclearly valid for k = o. Assuming it is valid for k, we show that it is valid for k + 1: \nIvt- +l(Z) - ~~ (z)1 < '\"Yllvt-\n,.+1 \n< '\"Y(lIvt\u00b7 \n\n- ~~ II \n,.+1 \n\n\"+1 \n\n,.+1 \n\n,.+1 \n\n,. \n\n- ~~ II + II~! -~! II) \n(z) - ~!(z)1 +1' ((1- 1') \u20ac) w.p.l \n\n,.+1 \n\n,. \n\n< '\"Ylvt-\n\"+1 , . \n,. \n,. \n\nl' \n'\"Ylvt-+dz) - ~!(z)1 + (1- '\"Y)\u20ac \n\n< '\"Yb1l:+1I1vt. - ~!II + (1 - '\"Y1I:)\u20ac) + (1 - '\"Y)\u20ac w.p.l \n\no \n\n0 \n\n1'11:+211 Vi- - ~! II + (1 - 1'11:+ 1)\u20ac. \n\no \n\n0 \n\nTaking the limit as k -t 00 in equation (3) and observing that for each z, \nlim1l:-.00 ~qz) = V*(z) w.p. 1, we obtain \n\n,. \n\nlim Ivt-+l(Z) - V*(z)1 < \u20ac w.p.1. \n11:-.00 \n\n,. \n\nSince \u20ac and z are arbitrary, this implies that vt -t V* w.p. 1. \n\n0 \n\nAcknowledgements \n\nWe gratefully acknowledge the significant contribution of Peter Dayan, who pointed \nout that a restrictive condition for convergence in an earlier version of our result \nwas actually unnecessary. This work has also benefited from several discussions \nwith Satinder Singh. We would also like to thank Chuck Anderson for his timely \nhelp in preparing this material for presentation at the conference. This material \nis based upon work supported by funding provided to A. Barto by the AFOSR, \nBolling AFB, under Grant AFOSR-F49620-93-1-0269 and by the NSF under Grant \nECS-92-14866. \n\nReferences \n\n[1] A.G. Barto, S.J. Bradtke, and S.P. Singh. Learning to act using real-time \ndynamic programming. Technical Report 93-02, University of Massachusetts, \nAmherst, MA, 1993. \n\n[2] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Di8tributed Computation: \n\nNumerical Method6. Prentice-Hall, Englewood Cliffs, NJ, 1989. \n\n[3] P. R. Kumar. A survey of some results in stochastic adaptive control. SIAM \n\nJournal of Control and Optimization, 23(3):329-380, May 1985. \n\n\f702 \n\nGullapalli and Barto \n\n[4] A. W. Moore and C. G. Atkeson. Memory-based reinforcement learning: Ef(cid:173)\n\nficient computation with prioritized sweeping. In S. J. Hanson, J. D. Cowan, \nand C. L. Giles, editors, Advance8 in Neural Information Proceuing Sy8tem8 \n5, pages 263-270, San Mateo, CA, 1993. Morgan Kaufmann Publishers. \n\n[5] J. Peng and R. J. Williams. Efficient learning and planning within the dyna \n\nframework. In Proceeding8 of the Second International Conference on Simula(cid:173)\ntion of Adaptive Behavior, Honolulu, HI, 1992. \n\n[6] S. P. Singh and V. Gullapalli. Asynchronous modified policy iteration with \n\nsingle-sided updates. (Under review). \n\n[7] R. S. Sutton. Learning to predict by the methods of temporal differences. \n\nMachine Learning, 3:9-44, 1988. \n\n[8] G. J. Tesauro. Practical issues in temporal difference learning. Machine Learn(cid:173)\n\ning, 8(3/4):257-277, May 1992. \n\n[9] C. J. C. H. Watkins. Learning from delayed reward8. PhD thesis, Cambridge \n\nUniversity, Cambridge, England, 1989. \n\n[10] R. J. Williams and L. C. Baird. Analysis of some incremental variants of \n\npolicy iteration: First steps toward understanding actor-critic learning sys(cid:173)\ntems. Technical Report NU-CCS-93-11, Northeastern University, College of \nComputer Science, Boston, MA 02115, September 1993. \n\n\f", "award": [], "sourceid": 773, "authors": [{"given_name": "Vijaykumar", "family_name": "Gullapalli", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": null}]}