{"title": "Stable LInear Approximations to Dynamic Programming for Stochastic Control Problems with Local Transitions", "book": "Advances in Neural Information Processing Systems", "page_first": 1045, "page_last": 1051, "abstract": null, "full_text": "Stable Linear Approximations to \n\nDynamic Programming for Stochastic \n\nControl Problems with Local Transitions \n\nBenjamin Van Roy and John N. Tsitsiklis \nLaboratory for Information and Decision Systems \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\ne-mail: bvr@mit.edu, jnt@mit.edu \n\nAbstract \n\nWe consider the solution to large stochastic control problems by \nmeans of methods that rely on compact representations and a vari(cid:173)\nant of the value iteration algorithm to compute approximate cost(cid:173)\nto-go functions. While such methods are known to be unstable in \ngeneral, we identify a new class of problems for which convergence, \nas well as graceful error bounds, are guaranteed. This class in(cid:173)\nvolves linear parameterizations of the cost-to- go function together \nwith an assumption that the dynamic programming operator is a \ncontraction with respect to the Euclidean norm when applied to \nfunctions in the parameterized class. We provide a special case \nwhere this assumption is satisfied, which relies on the locality of \ntransitions in a state space. Other cases will be discussed in a full \nlength version of this paper. \n\n1 \n\nINTRODUCTION \n\nNeural networks are well established in the domains of pattern recognition and \nfunction approximation, where their properties and training algorithms have been \nwell studied. Recently, however, there have been some successful applications of \nneural networks in a totally different context - that of sequential decision making \nunder uncertainty (stochastic control). \n\nStochastic control problems have been studied extensively in the operations research \nand control theory literature for a long time, using the methodology of dynamic \nIn dynamic programming, the most important \nprogramming [Bertsekas, 1995]. \nobject is the cost-to-go (or value) junction, which evaluates the expected future \n\n\f1046 \n\nB. V. ROY, 1. N. TSITSIKLIS \n\ncost to be incurred, as a function of the current state of a system. Such functions \ncan be used to guide control decisions. \n\nDynamic programming provides a variety of methods for computing cost-to- go \nfunctions. Unfortunately, dynamic programming is computationally intractable in \nthe context of many stochastic control problems that arise in practice. This is \nbecause a cost-to-go value is computed and stored for each state, and due to the \ncurse of dimensionality, the number of states grows exponentially with the number \nof variables involved. \n\nDue to the limited applicability of dynamic programming, practitioners often rely \non ad hoc heuristic strategies when dealing with stochastic control problems. Sev(cid:173)\neral recent success stories - most notably, the celebrated Backgammon player of \nTesauro (1992) - suggest that neural networks can help in overcoming this limita(cid:173)\ntion. In these applications, neural networks are used as compact representations \nthat approximate cost- to-go functions using far fewer parameters than states. This \napproach offers the possibility of a systematic and practical methodology for ad(cid:173)\ndressing complex stochastic control problems. \n\nDespite the success of neural networks in dynamic programming, the algorithms \nused to tune parameters are poorly understood. Even when used to tune the pa(cid:173)\nrameters of linear approximators, algorithms employed in practice can be unstable \n[Boyan and Moore, 1995; Gordon, 1995; Tsitsiklis and Van Roy, 1994]. \n\nSome recent research has focused on establishing classes of algorithms and compact \nrepresentation that guarantee stability and graceful error bounds. Tsitsiklis and Van \nRoy (1994) prove results involving algorithms that employ feature extraction and in(cid:173)\nterpolative architectures. Gordon (1995) proves similar results concerning a closely \nrelated class of compact representations called averagers. However, there remains \na huge gap between these simple approximation schemes that guarantee reasonable \nbehavior and the complex neural network architectures employed in practice. \n\nIn this paper, we motivate an algorithm for tuning the parameters of linear com(cid:173)\npact representations, prove its convergence when used in conjunction with a class \nof approximation architectures, and establish error bounds. Such architectures are \nnot captured by previous results. However, the results in this paper rely on addi(cid:173)\ntional assumptions. In particular, we restrict attention to Markov decision problems \nfor which the dynamic programming operator is a contraction with respect to the \nEuclidean norm when applied to functions in the parameterized class. Though \nthis assumption on the combination of compact representation and Markov deci(cid:173)\nsion problem appears restrictive, it is actually satisfied by several cases of practical \ninterest. In this paper, we discuss one special case which employs affine approxima(cid:173)\ntions over a state space, and relies on the locality of transitions. Other cases will \nbe discussed in a full length version of this paper. \n\n2 MARKOV DECISION PROBLEMS \n\nWe consider infinite horizon, discounted Markov decision problems defined on a \nfinite state space S = {I, .. . , n} [Bertsekas, 1995]. For every state i E S, there is \na finite set U(i) of possible control actions, and for each pair i,j E S of states and \ncontrol action u E U (i) there is a probability Pij (u) of a transition from state i to \nstate j given that action u is applied. Furthermore, for every state i and control \naction u E U (i), there is a random variable Ciu which represents the one-stage cost \nif action u is applied at state i. \nLet f3 E [0,1) be a discount factor. Since the state spaces we consider in this paper \n\n\fStable Linear Approximations Programming for Stochastic Control Problems \n\n1047 \n\nare finite, we choose to think of cost-to-go functions mapping states to cost- to-go \nvalues in terms of cost-to-go vectors whose components are the cost-to-go values \nof various states. The optimal cost-to-go vector V* E !Rn is the unique solution to \nBellman's equation: \n\nVi*= min. (E[CiU]+.BLPij(U)Vj*), \n\nViES. \n\n(1) \n\nuEU(t) \n\njES \n\nIf the optimal cost-to-go vector is known, optimal decisions can be made at any \nstate i as follows: \n\nu*=arg min. (E[CiU]+.BLPij(U)l--j*), \n\nViES. \n\nuEU(t) \n\njES \n\nThere are several algorithms for computing V* but we only discuss the value itera(cid:173)\ntion algorithm which forms the basis of the approximation algorithm to be consid(cid:173)\nered later on. We start with some notation. We define the dynamic programming \noperator as the mapping T : !Rn r-t !Rn with components Ti : !Rn r-t !R defined by \n\nTi(V) = min. (E[CiU]+.BLPij(U)Vj ), \n\nViES. \n\n(2) \n\nuEU(t) \n\njES \n\nIt is well known and easy to prove that T is a maximum norm contraction. In \nparticular , \n\nIIT(V) - T(V')lloo :s; .BIIV - V'lIoo, \n\nThe value iteration algorithm is described by \n\nV(t + 1) = T(V(t)), \n\nwhere V (0) is an arbitrary vector in !Rn used to initialize the algorithm. It is easy \nto see that the sequence {V(t)} converges to V*, since T is a contraction. \n\n3 APPROXIMATIONS TO DYNAMIC PROGRAMMING \n\nClassical dynamic programming algorithms such as value iteration require that we \nmaintain and update a vector V of dimension n. This is essentially impossible when \nn is extremely large, as is the norm in practical applications. We set out to overcome \nthis limitation by using compact representations to approximate cost-to-go vectors. \nIn this section, we develop a formal framework for compact representations, describe \nan algorithm for tuning the parameters of linear compact representations, and prove \na theorem concerning the convergence properties of this algorithm. \n\n3.1 COMPACT REPRESENTATIONS \n\nA compact representation (or approximation architecture) can be thought of as a \nscheme for recording a high-dimensional cost-to-go vector V E !Rn using a lower(cid:173)\ndimensional parameter vector wE !Rm (m \u00abn). Such a scheme can be described by \na mapping V : !Rm r-t !Rn which to any given parameter vector w E !Rm associates \na cost-to-go vector V (w). In particular, each component Vi (w) of the mapping is \nthe ith component of a cost-to-go vector represented by the parameter vector w. \nNote that, although we may wish to represent an arbitrary vector V E !Rn, such a \nscheme allows for exact representation only of those vectors V which happen to lie \nin the range of V. \nIn this paper, we are concerned exclusively with linear compact representations of \nthe form V(w) = Mw, where M E !Rnxm is a fixed matrix representing our choice \nof approximation architecture. In particular, we have Vi(w) = Miw, where Mi (a \nrow vector) is the ith row of the matrix M. \n\n\f1048 \n\nB. V. ROY, J. N. TSITSIKLIS \n\n3.2 A STOCHASTIC APPROXIMATION SCHEME \n\nOnce an appropriate compact representation is chosen, the next step is to generate \na parameter vector w such that V{w) approximates V*. One possible objective is \nto minimize squared error of the form IIMw - V*II~. If we were given a fixed set \nof N samples {( iI, ~:), (i2' Vi;), ... , (i N, ~:)} of an optimal cost-to-go vector V*, it \nseems natural to choose a parameter vector w that minimizE's E7=1 (Mij w - ~;)2. \nOn the other hand, if we can actively sample as many data pairs as we want, one \nat a time, we might consider an iterative algorithm which generates a sequence of \nparameter vectors {w(t)} that converges to the desired parameter vector. One such \nalgorithm works as follows: choose an initial guess w(O), then for each t E {O, 1, ... } \nsample a state i{t) from a uniform distribution over the state space and apply the \niteration \n\n(3) \n\nwhere {a(t)} is a sequence of diminishing step sizes and the superscript T denotes \na transpose. Such an approximation scheme conforms to the spirit of traditional \nfunction approximation - the algorithm is the common stochastic gradient descent \nmethod. However, as discussed in the introduction, we do not have access to such \nsamples of the optimal cost-to-go vector. We therefore need more sophisticated \nmethods for tuning parameters. \n\nOne possibility involves the use of an algorithm similar to that of Equation 3, \nreplacing samples of ~(t) with TiCt) (V(t)). This might be justified by the fact that \nT(V) can be viewed as an improved approximation to V*, relative to V. The \nmodified algorithm takes on the form \n\n(4) \n\nIntuitively, at each time t this algorithm treats T(Mw(t)) as a \"target\" and takes \na steepest descent step as if the goal were to find a w that would minimize IIMw(cid:173)\nT(Mw(t))II~. Such an algorithm is closely related to the TD(O) algorithm of Sutton \n(1988). Unfortunately, as pointed out in Tsitsiklis and Van Roy (1994), such a \nscheme can produce a diverging sequence {w(t)} of weight vectors even when there \nexists a parameter vector w* that makes the approximation error V* - Mw* zero at \nevery state. However, as we will show in the remainder of this paper, under certain \nassumptions, such an algorithm converges. \n\n3.3 MAIN CONVERGENCE RESULT \n\nOur first assumption concerning the step size sequence {a(t)} is standard to stochas(cid:173)\ntic approximation and is required for the upcoming theorem. \n\nAssumption 1 Each step size a(t) is chosen prior to the generation of i(t), and \nthe sequence satisfies E~o a(t) = 00 and E~o a 2 (t) < 00. \n\nOur second assumption requires that T : lRn t-+ lR n be a contraction with respect \nto the Euclidean norm, at least when it operates on value functions that can be \nrepresented in the form Mw, for some w. This assumption is not always satisfied, \nbut it appears to hold in some situations of interest, one of which is to be discussed \nin Section 4. \nAssumption 2 There exists some {3' E [0, 1) such that \n\nIIT(Mw) - T(Mw')112 ::; {3'IIMw - Mw'112, \n\nVw,w' E lRm. \n\n\fStable Linear Approximations to Programming for Stochastic Control Problems \n\n1049 \n\nThe following theorem characterizes the stability and error bounds associated with \nthe algorithm when the Markov decision problem satisfies the necessary criteria. \n\nTheorem 1 Let Assumptions 1 and 2 hold, and assume that M has full column \nrank. Let I1 = M(MT M)-l MT denote the projection matrix onto the subspace \nX = {Mwlw E ~m}. Then, \n(a) With probability 1, the sequence w(t) converges to w*, the unique vector that \nsolves: \n\nMw* = I1T(Mw*). \n\n(b) Let V* be the optimal cost-to-go vector. The following error bound holds: \n\nIIMw* - V*1I2 ~ (1 ;!~ynllI1V* - V*lloo. \n\n3.4 OVERVIEW OF PROOF \n\nDue to space limitations, we only provide an overview of the proof of Theorem 1. \nLet s : ~m f-7 ~m be defined by \n\ns(w) = E [( Miw - Ti(Mw(t)))MT] , \n\nwhere the expectation is taken over i uniformly distributed among {I, .. . , n}. \nHence, \n\nE[w(t + l)lw(t), a(t)] = w(t) - a(t)s(w(t)), \n\nwhere the expectation is taken over i(t). We can rewrite s as \n\ns(w) = ~(MTMW - MTT(MW)) , \n\nand it can be thought of as a vector field over ~m. If the sequence {w(t)} converges \nto some w, then s ( w) must be zero, and we have \n\nMTMw \n\nMTT(Mw) \nMw = I1T(Mw). \n\nNote that \n\nIII1T(Mw) - I1T(Mw')lb ~ {j'IIMw - Mw'112, \n\nVw,w' E ~m, \n\ndue to Assumption 2 and the fact that projection is a nonexpansion of the Euclidean \nnorm. It follows that I1Te) has a unique fixed point w* E ~m, and this point \nuniquely satisfies \n\nMw* = I1T(Mw*). \nWe can further establish the desired error bound: \nIIMw* - V*112 < \n\nIIMw* - I1T(I1V*) 112 + III1T(I1V*) - I1V*112 + III1V* - V*112 \n\n< {j'IIMw* - V*112 + IIT(I1V*) - V*112 + III1V* - V*1I2 \n< \nand it follows that \n\nt3'IIMw* - V*112 + (1 + mv'nIII1V* - V*lloo, \n\nConsider the potential function U(w) = ~llw - w*II~. We will establish that \n(\\1U(w))T s(w) 2 ,U(w), for some, > 0, and we are therefore dealing with a \n\n\f1050 \n\nB. V. ROY, J. N. TSITSIKLIS \n\n\"pseudogradient algorithm\" whose convergence follows from standard results on \nstochastic approximation [Polyak and Tsypkin, 1972J. This is done as follows: \n\n(\\7U(w)f s(w) \n\n~ (w - w*) T MT (Mw - T(Mw)) \n~ (w - w*) T MT(Mw - IIT(Mw) - (J - II)T(MW)) \n\n= ~(MW-Mw*)T(MW-IIT(MW)), \n\nwhere the last equality follows because MTrr = MT. Using the contraction as(cid:173)\nsumption on T and the nonexpansion property of projection mappings, we have \n\nIlIIT(Mw) - Mw*112 \n\nIIIIT(Mw) - rrT(Mw*)112 \n,6'IIMw - Mw*1I2' \n\n::; \n\nand applying the Cauchy-Schwartz inequality, we obtain \n(\\7U(W))T s(w) > 1 -(IIMw - Mw*ll~ -IIMw - Mw*1121IMw* - IIT(Mw)112) \n\nn \n!:.(l - ,6')IIMw - Mw*II~\u00b7 \nn \n\n> \n\nSince M has full column rank, it follows that (\\7U(W))T s(w) ~ 1'U(w), for some \nfixed l' > 0, and the proof is complete. \n\n4 EXAMPLE: LOCAL TRANSITIONS ON GRIDS \n\nTheorem 1 leads us to the next question: are there some interesting cases for which \nAssumption 2 is satisfied? We describe a particular example here that relies on \nproperties of Markov decision problems that naturally arise in some practical situ(cid:173)\nations. \n\nWhen we encounter real Markov decision problems we often interpret the states \nin some meaningful way, associating more information with a state than an index \nvalue. For example, in the context of a queuing network, where each state is one \npossible queue configuration, we might think of the state as a vector in which each \ncomponent records the current length of a particular queue in the network. Hence, \nif there are d queues and each queue can hold up to k customers, our state space is \na finite grid zt (Le., the set of vectors with integer components each in the range \n{O, ... ,k-l}). \nConsider a state space where each state i E {I, ... , n} is associated to a point \nxi E zt (n = k d ), as in the queuing example. We might expect that individual \ntransitions between states in such a state space are local. That is, if we are at \na state xi the next visited state x j is probably close to xi in terms of Euclidean \ndistance. For instance, we would not expect the configuration of a queuing network \nto change drastically in a second. This is because one customer is served at a time \nso a queue that is full can not suddenly become empty. \nNote that the number of states in a state space of the form zt grows exponentially \nwith d. Consequently, classical dynamic programming algorithms such as value \niteration quickly become impractical. To efficiently generate an approximation to \nthe cost-to-go vector, we might consider tuning the parameters w E Rd and a E R \nof an affine approximation ~(w, a) = wT xi + a using the algorithm presented in \nthe previous section. It is possible to show that, under the following assumption \n\n\fStable Linear Approximations to Programming for Stochastic Control Problems \n\n1051 \n\nconcerning the state space topology and locality of transitions, Assumption 2 holds \nwith f3' = .; f32 + k~3' and thus Theorem 1 characterizes convergence properties of \nthe algorithm. \nAssumption 3 The Markov decision problem has state space S = {1, ... , k d }, and \neach state i is uniquely associated with a vector xi E zt with k ~ 6(1 - (32)-1 + 3. \nA ny pair xi, x j E zt of consecutively visited states either are identical or have \nexactly one unequal component, which differs by one. \n\nWhile this assumption may seem restrictive, it is only one example. There are many \nmore candidate examples, involving other approximation architectures and partic(cid:173)\nular classes of Markov decision problems, which are currently under investigation. \n\n5 CONCLUSIONS \n\nWe have proven a new theorem that establishes convergence properties of an al(cid:173)\ngorithm for generating linear approximations to cost-to-go functions for dynamic \nprogramming. This theorem applies whenever the dynamic programming operator \nfor a Markov decision problem is a contraction with respect to the Euclidean norm \nwhen applied to vectors in the parameterized class. In this paper, we have described \none example in which such a condition holds. More examples of practical interest \nwill be discussed in a forthcoming full length version of this paper. \n\nAcknowledgments \n\nThis research was supported by the NSF under grant ECS 9216531, by EPRI under \ncontract 8030-10, and by the ARO. \n\nReferences \n\nBertsekas, D. P. (1995) Dynamic Programming and Optimal Control. Athena Sci(cid:173)\nentific, Belmont, MA. \nBoyan, J. A. & Moore, A. W. (1995) Generalization in Reinforcement Learning: \nSafely Approximating the Value Function. In J. D. Cowan, G. Tesauro, and D. \nTouretzky, editors, Advances in Neural Information Processing Systems 7. Morgan \nKaufmann. \nGordon, G. J. (1995) Stable Function Approximation in Dynamic Programming. \nTechnical Report: CMU-CS-95-103, Carnegie Mellon University. \nPolyak, B. T. & Tsypkin, Y. Z., (1972) Pseudogradient Adaptation and Training \nAlgorithms. A vtomatika i Telemekhanika, 3:45-68. \nSutton, R. S. (1988) Learning to Predict by the Method of Temporal Differences. \nMachine Learning, 3:9-44. \n\nTesauro, G. (1992) Practical Issues in Temporal Difference Learning. Machine \nLearning, 8:257-277. \nTsitsiklis, J. & Van Roy, B. (1994) Feature-Based Methods for Large Scale Dynamic \nProgramming. Technical Report: LIDS-P-2277, Laboratory for Information and \nDecision Systems, Massachusetts Institute of Technology. Also to appear in Machine \nLearning. \n\n\f", "award": [], "sourceid": 1038, "authors": [{"given_name": "Benjamin", "family_name": "Van Roy", "institution": null}, {"given_name": "John", "family_name": "Tsitsiklis", "institution": null}]}