{"title": "Convergence of Stochastic Iterative Dynamic Programming Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 703, "page_last": 710, "abstract": null, "full_text": "Convergence of Stochastic Iterative \nDynamic Programming Algorithms \n\nTommi Jaakkola'\" \n\nMichael I. Jordan \n\nSatinder P. Singh \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nIncreasing attention has recently been paid to algorithms based on \ndynamic programming (DP) due to the suitability of DP for learn(cid:173)\ning problems involving control. In stochastic environments where \nthe system being controlled is only incompletely known, however, \na unifying theoretical account of these methods has been missing. \nIn this paper we relate DP-based learning algorithms to the pow(cid:173)\nerful techniques of stochastic approximation via a new convergence \ntheorem, enabling us to establish a class of convergent algorithms \nto which both TD(\"\\) and Q-Iearning belong. \n\n1 \n\nINTRODUCTION \n\nLearning to predict the future and to find an optimal way of controlling it are the \nbasic goals of learning systems that interact with their environment. A variety of \nalgorithms are currently being studied for the purposes of prediction and control \nin incompletely specified, stochastic environments. Here we consider learning algo(cid:173)\nrithms defined in Markov environments. There are actions or controls (u) available \nfor the learner that affect both the state transition probabilities, and the proba(cid:173)\nbility distribution for the immediate, state dependent costs (Ci( u)) incurred by the \nlearner. Let Pij (u) denote the probability of a transition to state j when control \nu is executed in state i. The learning problem is to predict the expected cost of a \n\n... E-mail: tommi@psyche.mit.edu \n\n703 \n\n\f704 \n\nJaakkola, Jordan, and Singh \n\nfixed policy p (a function from states to actions), or to obtain the optimal policy \n(p*) that minimizes the expected cost of interacting with the environment. \n\nIf the learner were allowed to know the transition probabilities as well as the imme(cid:173)\ndiate costs the control problem could be solved directly by Dynamic Programming \n(see e.g., Bertsekas, 1987). However, when the underlying system is only incom(cid:173)\npletely known, algorithms such as Q-Iearning (Watkins, 1989) for prediction and \ncontrol, and TD(>.) (Sutton, 1988) for prediction, are needed. \n\nOne of the central problems in developing a theoretical understanding of these \nalgorithms is to characterize their convergence; that is, to establish under what \nconditions they are ultimately able to obtain correct predictions or optimal control \npolicies. The stochastic nature of these algorithms immediately suggests the use \nof stochastic approximation theory to obtain the convergence results. However, \nthere exists no directly available stochastic approximation techniques for problems \ninvolving the maximum norm that plays a crucial role in learning algorithms based \non DP. \n\nIn this paper, we extend Dvoretzky's (1956) formulation of the classical Robbins(cid:173)\nMunro (1951) stochastic approximation theory to obtain a class of converging pro(cid:173)\ncesses involving the maximum norm. In addition, we show that Q-Iearning and \nboth the on-line and batch versions of TD(>.) are realizations of this new class. \nThis approach keeps the convergence proofs simple and does not rely on construc(cid:173)\ntions specific to particular algorithms. Several other authors have recently presented \nresults that are similar to those presented here: Dayan and Sejnowski (1993) for \nTD(A), Peng and Williams (1993) for TD(A), and Tsitsiklis (1993) for Q-Iearning. \nOur results appear to be closest to those of Tsitsiklis (1993). \n\n2 Q-LEARNING \n\nThe Q-Iearning algorithm produces values-\"Q-values\"-by which an optimal ac(cid:173)\ntion can be determined at any state. The algorithm is based on DP by rewriting \nBellman 's equation such that there is a value assigned to every state-action pair \ninstead of only to a state. Thus the Q-values satisfy \n\nQ(s,u) = cs(u) +, ~pssl(u)maxQ(sl,ul) \n\n1.).1 \n\nL....J \n8 ' \n\n(1) \n\nwhere c denotes the mean of c. The solution to this equation can be obtained \nby updating the Q-values iteratively; an approach known as the vaz'ue iteration \nmethod. In the learning problem the values for the mean of c and for the transition \nprobabilities are unknown. However, the observable quantity \n\nCSt (Ut) +, maxQ(St+l, u) \n\n1.). \n\n(2) \n\nwhere St and Ut are the state of the system and the action taken at time t, respec(cid:173)\ntively, is an unbiased estimate of the update used in value iteration. The Q-Iearning \nalgorithm is a relaxation method that uses this estimate iteratively to update the \ncurrent Q-values (see below). \n\nThe Q-Iearning algorithm converges mainly due to the contraction property of the \nvalue iteration operator. \n\n\fConvergence of Stochastic Iterative Dynamic Programming Algorithms \n\n705 \n\n2.1 CONVERGENCE OF Q-LEARNING \n\nOur proof is based on the observation that the Q-Iearning algorithm can be viewed as \na stochastic process to which techniques of stochastic approximation are generally \napplicable. Due to the lack of a formulation of stochastic approximation for the \nmaximum norm, however, we need to slightly extend the standard results. This is \naccomplished by the following theorem the proof of which can be found in Jaakkola \net al. (1993). \nTheorem 1 A random iterative process ~n+I(X) = (l-ll:n(X))~n(x)+lin(x)Fn(x) \nconverges to zero w.p.l under the following assumptions: \n\n1) The state space is finite. \n2) Ln ll:n(x) = 00, Ln ll:~(x) < 00, Ln lin(x) = 00, Ln Ii~(x) < 00, and \n\nE{lin(x)IPn } ~ E{ll:n(x)IPn } uniformly w.p.1. \n\n3) II E{Fn(x)IPn} Ilw~ 'Y II ~n IlwI where'Y E (0,1). \n4) Var{Fn(x)IPn } ~ C(1+ II ~n Ilw)2, where C is some constant. \n\nHere Pn = {~n, ~n-I, .. \u00b7' Fn- I, ... , ll:n-I,\u00b7 .. , lin-I, ... } stands for the past at step \nn. Fn(x), ll:n(x) and lin(x) are allowed to depend on the past insofar as the above \nconditions remain valid. The notation II . Ilw refers to some weighted maximum \nnorm. \n\nIn applying the theorem, the ~n process will generally represent the difference \nbetween a stochastic process of interest and some optimal value (e.g., the optimal \nvalue function). The formulation of the theorem therefore requires knowledge to be \navailable about the optimal solution to the learning problem before it can be applied \nto any algorithm whose convergence is to be verified. In the case of Q-Iearning the \nrequired knowledge is available through the theory of DP and Bellman's equation \nin particular. \n\nThe convergence of the Q-Iearning algorithm now follows easily by relating the \nalgorithm to the converging stochastic process defined by Theorem 1.1 \n\nTheorem 2 The Q-learning algorithm given by \n\nQt+I(St, Ut) = (1 - ll:t(St, Ut))Qt(St, ut) + ll:t(St, ut}[CSt(ut) + 'Yvt(St+dJ \n\nconverges to the optimal Q*(s, u) values if \n\n1) The state and action spaces are finite. \n2) Lt ll:t(s, u) = 00 and Lt ll:;(s, u) < 00 uniformly w.p.1. \n3) Var{cs(u)} is bounded. \n\n1 We note that the theorem is more powerful than is needed to prove the convergence \nof Q-learning. Its generality, however, allows it to be applied to other algorithms as well \n(see the following section on TD(>.)). \n\n\f706 \n\nJaakkola, Jordan, and Singh \n\n3) If, = 1, all policies lead to a cost free terminal state w.p.1. \n\nProof. By subtracting Q*(s, u) from both sides of the learning rule and by defining \nLlt(s, u) = Qt(s, u) - Q*(s, u) together with \n\n(3) \nthe Q-learning algorithm can be seen to have the form of the process in Theorem 1 \nwith !3t(s, u) = at(s, u). \nTo verify that Ft(s, u) has the required properties we begin by showing that it is a \ncontraction mapping with respect to some maximum norm. This is done by relating \nFt to the DP value iteration operator for the same Markov chain. More specifically, \n\nmaxIE{Ft(i, u)}1 \n\nu \n\nj \n\n< ,max ~Pij(u)maxIQt(j,v) - Q*(j,v)1 \n\nv \n\nu 6 \nj \n\n,muax LPij(U)Va(j) = T(Va)(i) \n\nj \n\nwhere we have used the notation Va(j) = maXv IQt(j, v)-Q*(j, v)1 and T is the DP \nvalue iteration operator for the case where the costs associated with each state are \nzero. If, < 1 the contraction property of E{ Ft (i, u)} can be obtained by bounding \nI:j Pij(U)Va(j) by maxj Va(j) and then including the, factor. When the future \ncosts are not discounted (, = 1) but the chain is absorbing and all policies lead to \nthe terminal state w.p.1 there still exists a weighted maximum norm with respect \nto which T is a contraction mapping (see e.g. Bertsekas & Tsitsiklis, 1989) thereby \nforcing the contraction of E{Ft(i, u)}. The variance of Ft(s, u) given the past is \nwithin the bounds of Theorem 1 as it depends on Qt(s, u) at most linearly and the \nvariance of cs(u) is bounded. \n\nNote that the proof covers both the on-line and batch versions. \n\no \n\n3 THE TD(-\\) ALGORITHM \n\nThe TD(A) (Sutton, 1988) is also a DP-based learning algorithm that is naturally \ndefined in a Markov environment. Unlike Q-learning, however, TD does not involve \ndecision-making tasks but rather predictions about the future costs of an evolving \nsystem. TD(A) converges to the same predictions as a version ofQ-learning in which \nthere is only one action available at each state, but the algorithms are derived from \nslightly different grounds and their behavioral differences are not well understood. \n\nThe algorithm is based on the estimates \n\nV/\\(i) = (1 - A) L An-l~(n)(i) \n\n00 \n\nn=l \n\n(4) \n\nwhere ~(n)(i) are n step look-ahead predictions. The expected values of the ~>\"(i) \nare strictly better estimates of the correct predictions than the lit (i)s are (see \n\n\fConvergence of Stochastic Iterative Dynamic Programming Algorithms \n\n707 \n\nJaakkola et al., 1993) and the update equation of the algorithm \n\nVt+l(it) = vt(it) + adV/(it) - Vt(it)J \n\n(5) \n\ncan be written in a practical recursive form as is seen below. The convergence of \nthe algorithm is mainly due to the statistical properties of the V? (i) estimates. \n\n3.1 CONVERGENCE OF TDP) \n\nAs we are interested in strong forms of convergence we need to impose some new \nconstraints, but due to the generality of the approach we can dispense with some \nothers. Specifically, the learning rate parameters an are replaced by a n( i) which \nsatisfy Ln an(i) = 00 and Ln a~(i) < 00 uniformly w.p.1. These parameters \nallow asynchronous updating and they can, in general, be random variables. The \nconvergence of the algorithm is guaranteed by the following theorem which is an \napplication of Theorem 1. \n\nTheorem 3 For any finite absorbing Markov chain, for any distribution of starting \nstates with no inaccessible states, and for any distributions of the costs with finite \nvariances the TD(A) algorithm given by \n\n1) \n\n2) \n\nVn+1(i) = Vn(i) + an(i) L)Ci t + ,Vn(it+d - Vn(it)] LbA)t-kXi(k) \n\nm \n\nt=l \n\nt \n\nk=l \n\nLn an(i) = 00 and Ln a~(i) < 00 uniformly w.p.i. \n\nVt+l(i) = Vt(i) + at(i)[ci t + ,Vt(it+d - Vt(id] LbA)t-kXi(k) \n\nt \n\nk=l \n\nLt at(i) = 00 and Ln a;(i) < 00 uniformly w.p.i and within sequences \nat(i)/maXtESat(i) ----;. 1 uniformly w.p.i. \n\nconverges to the optimal predictions w.p.i provided\" A E [0,1] with ,A < 1. \n\nProof for (1): We use here a slightly different form for the learning rule (cf. the \nprevious section). \n\nVn(i) + an (i)[Gn (i) - E~~~)} Vn(i)] \n\n1 \n\nm(i) \n\nE{m(i)} {; Vn\"(i; k) \n\nwhere Vn\"( i; k) is an estimate calculated at the ph occurrence of state i in a \nsequence and for mathematical convenience we have made the transformation \nan(i) ----;. E{m(i)}an(i), where m(i) is the number of times state i was visited \nduring the sequence. \n\n\f708 \n\nJaakkola, Jordan, and Singh \n\nTo apply Theorem 1 we subtract V* (i), the optimal predictions, from both sides of \nthe learning equation. By identifying an(i) := an(i)m(i)/E{m(i)}, f3n(i) := an(i), \nand Fn(i) := Gn(i) - V*(i)m(i)/E{m(i)} we need to show that these satisfy the \nconditions of Theorem 1. For an(i) and f3n(i) this is obvious. We begin here by \nshowing that Fn(i) indeed is a contraction mapping. To this end, \n\nm?xIE{Fn(i) 1 Vn}1 = \n\nI \n\nmiaxIE{~(i)} E{(VnA(i; 1) - V*(i\u00bb + (VnA(i;2) - V*(i\u00bb +\u00b7\u00b7\u00b71 Vn}1 \n\nwhich can be bounded above by using the relation \n\nIE{VnA(i; k) - V*(i) 1 Vn}1 \n\n< E { IE{VnA(i; k) - V*(i) 1 m(i) ~ k, Vn}IO(m(i) - k) 1 Vn } \n< P{m(i) ~ k}IE{VnA(i) - V*(i) 1 Vn}1 \n< \n\nI P {m( i) > k} m~x 1 Vn (i) - V* (i) 1 \n\nI \n\nwhere O(x) = 0 if x < 0 and 1 otherwise. Here we have also used the fact that VnA(i) \nis a contraction mapping independent of possible discounting. As Lk P {m( i) ~ \nk} = E{ m( i)} we finally get \n\nm~x IE{ Fn( i) 1 Vn} 1 ::; I m?x IVn(i) - V*(i)1 \n\nI \n\nI \n\nThe variance of Fn (i) can be seen to be bounded by \n\nE{ m4} m~xIVn(i)12 \n\nI \n\nFor any absorbing Markov chain the convergence to the terminal state is geometric \nand thus for every finite k, E{mk}::; C(k), implying that the variance of Fn(i) is \nwithin the bounds of Theorem 1. As Theorem 1 is now applicable we can conclude \nthat the batch version of TD(>.) converges to the optimal predictions w.p.l. \n0 \nProof for (2) The proof for the on-line version is achieved by showing that the \neffect of the on-line updating vanishes in the limit thereby forcing the two versions \nto be equal asymptotically. We view the on-line version as a batch algorithm in \nwhich the updates are made after each complete sequence but are made in such a \nmanner so as to be equal to those made on-line. \nDefine G~ (i) = G n (i) + G~ (i) to be a new batch estimate taking into account the \non-line updating within sequences. Here Gn (i) is the batch estimate with the desired \nproperties (see the proof for (1\u00bb and G~ (i) is the difference between the two. We \ntake the new batch learning parameters to be the maxima over a sequence, that \nis an(i) = maxtES at(i). As all the at(i) satisfy the required conditions uniformly \nw.p.1 these new learning parameters satisfy them as well. \n\nTo analyze the new batch algorithm we divide it into three parallel processes: the \nbatch TD( >.) with an (i) as learning rate parameters, the difference between this and \nthe new batch estimate, and the change in the value function due to the updates \nmade on-line. Under the conditions of the TD(>.) convergence theorem rigorous \n\n\fConvergence of Stochastic Iterative Dynamic Programming Algorithms \n\n709 \n\nupper bounds can be derived for the latter two processes (see Jaakkola, et al., \n1993). These results enable us to write \n\nII E{G~ - V*} II < \n< \n\nII E{Gn - V*} II + II G~ II \n(-y' + C~) II Vn - V* II +C~ \n\nwhere C~ and C~ go to zero with w.p.I. This implies that for any c > 0 and \nII Vn - V* II~ c there exists I < 1 such that \n\nI \n\nII E{Gn - V*} II::; I II Vn - V* II \n\nfor n large enough. This is the required contraction property of Theorem 1. In \naddition, it can readily be checked that the variance of the new estimate falls under \nthe conditions of Theorem 1. \n\nTheorem 1 now guarantees that for any c the value function in the on-line algorithm \nconverges w.p.1 into some t-bounded region of V* and therefore the algorithm itself \nconverges to V* w.p.I. \n0 \n\n4 CONCLUSIONS \n\nIn this paper we have extended results from stochastic approximation theory to \ncover asynchronous relaxation processes which have a contraction property with \nrespect to some maximum norm (Theorem 1). This new class of converging iterative \nprocesses is shown to include both the Q-Iearning and TD(A) algorithms in either \ntheir on-line or batch versions. We note that the convergence of the on-line version \nof TD(A) has not been shown previously. We also wish to emphasize the simplicity \nof our results. The convergence proofs for Q-Iearning and TD(A) utilize only high(cid:173)\nlevel statistical properties of the estimates used in these algorithms and do not rely \non constructions specific to the algorithms. Our approach also sheds additional \nlight on the similarities between Q-Iearning and TD(A). \n\nAlthough Theorem 1 is readily applicable to DP-based learning schemes, the theory \nof Dynamic Programming is important only for its characterization of the optimal \nsolution and for a contraction property needed in applying the theorem. The theo(cid:173)\nrem can be applied to iterative algorithms of different types as well. \n\nFinally we note that Theorem 1 can be extended to cover processes that do not show \nthe usual contraction property thereby increasing its applicability to algorithms of \npossibly more practical importance. \n\nReferences \n\nBertsekas, D. P. (1987). Dynamic Programming: Deterministic and Stochastic Mod(cid:173)\nels. Englewood Cliffs, NJ: Prentice-Hall. \nBertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and Distributed Computation: \nNumerical Methods. Englewood Cliffs, NJ: Prentice-Hall. \n\nDayan, P. (1992). The convergence of TD(A) for general A. Machine Learning, 8, \n341-362. \n\n\f710 \n\nJaakkola, Jordan, and Singh \n\nDayan, P., & Sejnowski, T. J. (1993). TD(>.) converges with probability 1. CNL, \nThe Salk Institute, San Diego, CA. \n\nDvoretzky, A. (1956). On stochastic approximation. Proceedings of the Third Berke(cid:173)\nley Symposium on Mathematical Statistics and Probability. University of California \nPress. \nJaakkola, T., Jordan, M. I., & Singh, S. P. (1993). On the convergence of stochastic \niterative dynamic programming algorithms. Submitted to Neural Computation. \nPeng J., & Williams R. J. (1993). TD(>.) converges with probability 1. Department \nof Computer Science preprint, Northeastern University. \nRobbins, H., & Monro, S. (1951). A stochastic approximation model. Annals of \nMathematical Statistics, 22, 400-407. \n\nSutton, R. S. (1988). Learning to predict by the methods of temporal differences. \nMachine Learning, 3, 9-44. \n\nTsitsiklis J. N. (1993). Asynchronous stochastic approximation and Q-learning. \nSubmitted to: Machine Learning. \n\nWatkins, C.J .C.H. (1989). Learning from delayed rewards. PhD Thesis, University \nof Cambridge, England. \nWatkins, C.J .C.H, & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279-292. \n\n\f", "award": [], "sourceid": 764, "authors": [{"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}]}