{"title": "Stable LInear Approximations to Dynamic Programming for Stochastic Control Problems with Local Transitions", "book": "Advances in Neural Information Processing Systems", "page_first": 1045, "page_last": 1051, "abstract": null, "full_text": "Stable  Linear  Approximations to \n\nDynamic  Programming for  Stochastic \n\nControl  Problems with Local  Transitions \n\nBenjamin Van Roy and John N.  Tsitsiklis \nLaboratory for  Information and Decision Systems \n\nMassachusetts Institute of Technology \n\nCambridge,  MA  02139 \n\ne-mail:  bvr@mit.edu, jnt@mit.edu \n\nAbstract \n\nWe  consider  the  solution  to  large  stochastic  control  problems  by \nmeans of methods that rely on compact representations and a vari(cid:173)\nant of the value iteration algorithm to compute approximate cost(cid:173)\nto-go functions.  While  such methods are known  to be unstable in \ngeneral, we identify a new class of problems for  which convergence, \nas  well  as  graceful  error  bounds,  are  guaranteed.  This  class  in(cid:173)\nvolves linear parameterizations of the cost-to- go function  together \nwith  an assumption  that the  dynamic  programming operator is  a \ncontraction  with  respect  to  the  Euclidean  norm  when  applied  to \nfunctions  in  the  parameterized  class.  We  provide  a  special  case \nwhere  this  assumption  is  satisfied,  which  relies  on  the  locality of \ntransitions in a  state space.  Other cases will  be discussed  in a  full \nlength version of this paper. \n\n1 \n\nINTRODUCTION \n\nNeural  networks  are  well  established  in  the  domains  of  pattern  recognition  and \nfunction  approximation,  where  their  properties and  training algorithms  have  been \nwell  studied.  Recently,  however,  there  have  been  some  successful  applications  of \nneural  networks in  a  totally different  context - that of sequential  decision  making \nunder uncertainty  (stochastic control). \n\nStochastic control problems have been studied extensively in the operations research \nand  control  theory  literature for  a  long  time,  using  the  methodology  of  dynamic \nIn  dynamic  programming,  the  most  important \nprogramming  [Bertsekas,  1995]. \nobject  is  the  cost-to-go  (or  value)  junction,  which  evaluates  the  expected  future \n\n\f1046 \n\nB. V.  ROY, 1.  N. TSITSIKLIS \n\ncost  to  be  incurred, as  a  function  of the current state of a  system.  Such  functions \ncan be  used  to guide  control decisions. \n\nDynamic  programming  provides  a  variety  of  methods  for  computing  cost-to- go \nfunctions.  Unfortunately,  dynamic  programming is  computationally  intractable  in \nthe  context  of  many  stochastic  control  problems  that  arise  in  practice.  This  is \nbecause  a  cost-to-go value  is  computed  and stored for  each  state,  and  due  to  the \ncurse of dimensionality,  the number of states grows exponentially with the  number \nof variables involved. \n\nDue  to the  limited  applicability of dynamic programming, practitioners often rely \non  ad  hoc heuristic strategies when  dealing with stochastic control  problems.  Sev(cid:173)\neral  recent  success  stories  - most  notably,  the  celebrated  Backgammon  player  of \nTesauro  (1992)  - suggest  that neural networks can help  in  overcoming this limita(cid:173)\ntion.  In  these  applications,  neural  networks  are  used  as  compact  representations \nthat approximate cost- to-go functions using far fewer parameters than states.  This \napproach  offers  the  possibility  of a  systematic  and  practical  methodology  for  ad(cid:173)\ndressing complex stochastic control problems. \n\nDespite  the  success  of neural  networks  in  dynamic  programming,  the  algorithms \nused  to  tune  parameters are  poorly understood.  Even  when  used  to  tune  the  pa(cid:173)\nrameters of linear approximators, algorithms employed in  practice can be unstable \n[Boyan and Moore,  1995;  Gordon, 1995;  Tsitsiklis and Van  Roy, 1994]. \n\nSome recent research has focused on establishing classes of algorithms and compact \nrepresentation that guarantee stability and graceful error bounds.  Tsitsiklis and Van \nRoy (1994) prove results involving algorithms that employ feature extraction and in(cid:173)\nterpolative architectures.  Gordon  (1995)  proves similar results concerning a  closely \nrelated class  of compact  representations called  averagers.  However,  there  remains \na  huge gap between these simple approximation schemes that guarantee reasonable \nbehavior and the complex neural network architectures employed in practice. \n\nIn  this  paper,  we  motivate  an  algorithm for  tuning  the  parameters of linear com(cid:173)\npact  representations,  prove  its  convergence when  used  in  conjunction  with  a  class \nof approximation architectures, and establish error bounds.  Such  architectures are \nnot  captured  by  previous results.  However,  the results  in  this  paper  rely on  addi(cid:173)\ntional assumptions.  In particular, we restrict attention to Markov decision problems \nfor  which  the  dynamic programming operator is  a  contraction with  respect  to  the \nEuclidean  norm  when  applied  to  functions  in  the  parameterized  class.  Though \nthis  assumption  on  the  combination  of compact  representation  and  Markov  deci(cid:173)\nsion problem appears restrictive, it is  actually satisfied by several cases of practical \ninterest.  In this paper, we  discuss one special case which employs affine approxima(cid:173)\ntions  over  a  state space,  and  relies  on  the locality  of transitions.  Other cases  will \nbe  discussed  in  a  full  length version of this paper. \n\n2  MARKOV DECISION PROBLEMS \n\nWe  consider  infinite  horizon,  discounted  Markov  decision  problems  defined  on  a \nfinite  state space  S = {I, .. . , n}  [Bertsekas,  1995].  For  every  state i  E  S,  there  is \na  finite  set  U(i) of possible control actions,  and for  each  pair i,j E  S of states and \ncontrol action u  E  U (i)  there is  a  probability Pij (u)  of a  transition from  state i  to \nstate  j  given  that  action  u  is  applied.  Furthermore,  for  every  state  i  and  control \naction u  E  U (i),  there is  a  random variable  Ciu  which represents the one-stage cost \nif action u  is  applied at state i. \nLet f3  E  [0,1)  be a  discount factor.  Since the state spaces we  consider in  this paper \n\n\fStable  Linear Approximations  Programming for  Stochastic  Control  Problems \n\n1047 \n\nare finite,  we  choose to think of cost-to-go functions  mapping states to cost- to-go \nvalues  in  terms  of cost-to-go  vectors  whose  components  are  the  cost-to-go  values \nof various states.  The optimal cost-to-go vector V*  E !Rn is  the unique  solution to \nBellman's equation: \n\nVi*=  min.  (E[CiU]+.BLPij(U)Vj*), \n\nViES. \n\n(1) \n\nuEU(t) \n\njES \n\nIf the  optimal  cost-to-go vector  is  known,  optimal  decisions  can  be  made  at  any \nstate i  as follows: \n\nu*=arg  min.  (E[CiU]+.BLPij(U)l--j*), \n\nViES. \n\nuEU(t) \n\njES \n\nThere are several algorithms for computing V*  but we  only discuss the value  itera(cid:173)\ntion algorithm which forms  the basis of the approximation algorithm to be consid(cid:173)\nered later on.  We  start with  some  notation.  We  define  the  dynamic  programming \noperator as the mapping T  : !Rn r-t  !Rn with components Ti  : !Rn r-t !R  defined  by \n\nTi(V)  =  min.  (E[CiU]+.BLPij(U)Vj ), \n\nViES. \n\n(2) \n\nuEU(t) \n\njES \n\nIt  is  well  known  and  easy  to  prove  that  T  is  a  maximum  norm  contraction.  In \nparticular , \n\nIIT(V)  - T(V')lloo  :s; .BIIV - V'lIoo, \n\nThe value iteration algorithm is  described  by \n\nV(t + 1)  =  T(V(t)), \n\nwhere V (0)  is  an arbitrary vector  in !Rn  used  to initialize the  algorithm.  It is  easy \nto see  that the sequence  {V(t)}  converges to V*,  since T  is  a  contraction. \n\n3  APPROXIMATIONS TO  DYNAMIC PROGRAMMING \n\nClassical dynamic programming algorithms such as  value  iteration require that we \nmaintain and update a vector V  of dimension n.  This is essentially impossible when \nn  is extremely large, as is the norm in practical applications.  We set out to overcome \nthis limitation by using compact representations to approximate cost-to-go vectors. \nIn this section, we develop a formal framework for compact representations, describe \nan algorithm for tuning the parameters of linear compact representations, and prove \na  theorem concerning the convergence properties of this algorithm. \n\n3.1  COMPACT REPRESENTATIONS \n\nA  compact  representation  (or  approximation  architecture)  can  be  thought  of as  a \nscheme for  recording  a  high-dimensional cost-to-go  vector  V  E  !Rn  using  a  lower(cid:173)\ndimensional parameter vector wE !Rm  (m \u00abn).  Such a scheme can be described by \na  mapping V : !Rm  r-t  !Rn  which to any given  parameter vector  w  E !Rm  associates \na  cost-to-go vector V (w).  In particular,  each component Vi (w)  of the  mapping is \nthe  ith  component  of a  cost-to-go  vector  represented  by the  parameter  vector  w. \nNote that, although  we  may wish  to represent  an arbitrary vector V  E !Rn, such a \nscheme allows for  exact representation only of those vectors V  which  happen to lie \nin  the range of V. \nIn this  paper,  we  are concerned exclusively  with linear  compact representations of \nthe form  V(w)  =  Mw,  where  M  E !Rnxm  is  a  fixed  matrix representing our choice \nof approximation architecture.  In  particular,  we  have  Vi(w)  =  Miw,  where  Mi  (a \nrow  vector)  is the ith row of the matrix M. \n\n\f1048 \n\nB.  V. ROY, J.  N. TSITSIKLIS \n\n3.2  A  STOCHASTIC APPROXIMATION  SCHEME \n\nOnce an appropriate compact representation is chosen, the next step is to generate \na  parameter vector w such  that V{w)  approximates V*.  One  possible  objective is \nto  minimize  squared error of the form  IIMw - V*II~.  If we  were  given  a  fixed  set \nof N  samples {( iI, ~:), (i2' Vi;), ... , (i N, ~:)} of an optimal cost-to-go vector V*, it \nseems natural to choose a  parameter vector w  that minimizE's E7=1 (Mij w  - ~;)2. \nOn the other hand,  if we  can  actively sample as  many data pairs as we  want, one \nat a  time,  we  might consider an iterative algorithm which  generates a  sequence  of \nparameter vectors {w(t)}  that converges to the desired parameter vector.  One such \nalgorithm works as follows:  choose an initial guess w(O),  then for each t  E {O, 1, ... } \nsample  a  state i{t)  from  a  uniform  distribution over  the state space and  apply  the \niteration \n\n(3) \n\nwhere  {a(t)}  is  a  sequence of diminishing step sizes  and the superscript T  denotes \na  transpose.  Such  an  approximation  scheme  conforms  to  the  spirit  of traditional \nfunction approximation - the algorithm is  the common stochastic gradient descent \nmethod.  However,  as discussed  in  the introduction, we  do not have  access  to  such \nsamples  of  the  optimal  cost-to-go  vector.  We  therefore  need  more  sophisticated \nmethods for  tuning parameters. \n\nOne  possibility  involves  the  use  of  an  algorithm  similar  to  that  of  Equation  3, \nreplacing samples of ~(t) with TiCt) (V(t)).  This might be justified by the fact  that \nT(V)  can  be  viewed  as  an  improved  approximation  to  V*,  relative  to  V.  The \nmodified algorithm takes on the form \n\n(4) \n\nIntuitively,  at each time  t  this  algorithm treats T(Mw(t))  as  a  \"target\"  and takes \na  steepest descent step as if the goal were  to find  a w  that would minimize  IIMw(cid:173)\nT(Mw(t))II~. Such an algorithm is closely related to the TD(O)  algorithm of Sutton \n(1988).  Unfortunately,  as  pointed  out  in  Tsitsiklis  and  Van  Roy  (1994),  such  a \nscheme can produce a diverging sequence {w(t)}  of weight vectors even when there \nexists a parameter vector w*  that makes the approximation error V* - Mw* zero at \nevery state.  However,  as we  will  show in the remainder of this paper, under certain \nassumptions, such an algorithm converges. \n\n3.3  MAIN CONVERGENCE RESULT \n\nOur first assumption concerning the step size sequence {a(t)} is standard to stochas(cid:173)\ntic  approximation and is  required for  the upcoming theorem. \n\nAssumption 1  Each  step  size  a(t)  is  chosen  prior  to  the  generation  of i(t),  and \nthe  sequence  satisfies  E~o a(t)  =  00  and E~o a 2 (t)  < 00. \n\nOur second  assumption  requires  that T  : lRn  t-+  lR n  be a  contraction with  respect \nto  the  Euclidean  norm,  at  least  when  it  operates  on  value  functions  that  can  be \nrepresented  in  the form  Mw, for  some w.  This assumption is  not always satisfied, \nbut it appears to hold in some situations of interest, one of which is to be discussed \nin Section 4. \nAssumption 2  There  exists  some {3'  E  [0, 1)  such  that \n\nIIT(Mw) - T(Mw')112  ::; {3'IIMw - Mw'112, \n\nVw,w'  E lRm. \n\n\fStable  Linear  Approximations  to  Programming  for  Stochastic  Control  Problems \n\n1049 \n\nThe following  theorem characterizes the stability and error bounds associated with \nthe algorithm when the Markov decision problem satisfies the necessary criteria. \n\nTheorem  1  Let  Assumptions  1  and  2  hold,  and  assume  that  M  has  full  column \nrank.  Let I1  =  M(MT M)-l MT  denote  the  projection  matrix  onto  the  subspace \nX  = {Mwlw  E  ~m}.  Then, \n(a)  With  probability  1,  the  sequence  w(t)  converges  to  w*,  the  unique  vector  that \nsolves: \n\nMw*  =  I1T(Mw*). \n\n(b)  Let V*  be  the  optimal  cost-to-go  vector.  The  following  error  bound  holds: \n\nIIMw*  - V*1I2  ~ (1 ;!~ynllI1V* - V*lloo. \n\n3.4  OVERVIEW OF  PROOF \n\nDue to space limitations, we  only provide an overview of the proof of Theorem 1. \nLet  s  : ~m f-7  ~m be defined by \n\ns(w) = E  [( Miw - Ti(Mw(t)))MT]  , \n\nwhere  the  expectation  is  taken  over  i  uniformly  distributed  among  {I, .. . , n}. \nHence, \n\nE[w(t + l)lw(t), a(t)]  = w(t) - a(t)s(w(t)), \n\nwhere the expectation is taken over i(t).  We  can rewrite s  as \n\ns(w)  =  ~(MTMW - MTT(MW)) , \n\nand it can be thought of as a vector field over  ~m. If the sequence {w(t)}  converges \nto some  w,  then s ( w)  must be zero, and we  have \n\nMTMw \n\nMTT(Mw) \nMw  =  I1T(Mw). \n\nNote that \n\nIII1T(Mw)  - I1T(Mw')lb  ~ {j'IIMw - Mw'112, \n\nVw,w'  E  ~m, \n\ndue to Assumption 2 and the fact that projection is a nonexpansion of the Euclidean \nnorm.  It follows  that  I1Te)  has  a  unique  fixed  point  w*  E  ~m, and  this  point \nuniquely satisfies \n\nMw*  = I1T(Mw*). \nWe  can further establish the desired error bound: \nIIMw*  - V*112  < \n\nIIMw*  - I1T(I1V*) 112  + III1T(I1V*) - I1V*112 + III1V* - V*112 \n\n<  {j'IIMw*  - V*112  + IIT(I1V*)  - V*112 + III1V* - V*1I2 \n< \nand it follows  that \n\nt3'IIMw*  - V*112 + (1 + mv'nIII1V* - V*lloo, \n\nConsider  the  potential  function  U(w)  =  ~llw - w*II~.  We  will  establish  that \n(\\1U(w))T s(w)  2  ,U(w),  for  some,  >  0,  and  we  are  therefore  dealing  with  a \n\n\f1050 \n\nB. V. ROY, J. N. TSITSIKLIS \n\n\"pseudogradient  algorithm\"  whose  convergence  follows  from  standard  results  on \nstochastic approximation [Polyak and  Tsypkin,  1972J.  This is  done  as follows: \n\n(\\7U(w)f s(w) \n\n~ (w - w*) T MT (Mw - T(Mw)) \n~ (w - w*) T MT(Mw - IIT(Mw) - (J - II)T(MW)) \n\n=  ~(MW-Mw*)T(MW-IIT(MW)), \n\nwhere  the  last  equality  follows  because  MTrr  =  MT.  Using  the  contraction  as(cid:173)\nsumption on T  and the nonexpansion property of projection mappings, we  have \n\nIlIIT(Mw) - Mw*112 \n\nIIIIT(Mw)  - rrT(Mw*)112 \n,6'IIMw - Mw*1I2' \n\n::; \n\nand applying the  Cauchy-Schwartz inequality, we  obtain \n(\\7U(W))T s(w)  >  1 -(IIMw - Mw*ll~ -IIMw - Mw*1121IMw*  - IIT(Mw)112) \n\nn \n!:.(l - ,6')IIMw - Mw*II~\u00b7 \nn \n\n> \n\nSince  M  has full  column  rank,  it follows  that  (\\7U(W))T s(w)  ~ 1'U(w),  for  some \nfixed  l' > 0,  and the proof is  complete. \n\n4  EXAMPLE:  LOCAL TRANSITIONS  ON GRIDS \n\nTheorem 1 leads us to the next question:  are there some interesting cases for  which \nAssumption  2  is  satisfied?  We  describe  a  particular  example  here  that  relies  on \nproperties of Markov decision problems that naturally arise in some  practical situ(cid:173)\nations. \n\nWhen  we  encounter  real  Markov  decision  problems  we  often  interpret  the  states \nin  some  meaningful  way,  associating more information with  a  state than  an  index \nvalue.  For  example,  in  the  context  of a  queuing  network,  where  each  state  is  one \npossible queue configuration, we  might think of the state as a  vector in which each \ncomponent records the current length of a  particular queue in  the network.  Hence, \nif there are d queues and each queue can hold up to k  customers, our state space is \na  finite  grid zt (Le.,  the  set of vectors with  integer  components each in  the range \n{O, ... ,k-l}). \nConsider  a  state  space  where  each  state  i  E  {I, ... , n}  is  associated  to  a  point \nxi  E  zt  (n  =  k d ),  as  in  the  queuing  example.  We  might  expect  that  individual \ntransitions  between  states  in  such  a  state  space  are  local.  That  is,  if  we  are  at \na  state  xi  the  next  visited  state  x j  is  probably  close  to  xi  in  terms  of  Euclidean \ndistance.  For instance, we would not expect the configuration of a  queuing network \nto change drastically in a  second.  This is  because one customer is  served at a  time \nso  a  queue  that is  full  can not suddenly become empty. \nNote that the number of states in  a state space of the form zt grows exponentially \nwith  d.  Consequently,  classical  dynamic  programming  algorithms  such  as  value \niteration quickly  become  impractical.  To  efficiently  generate an  approximation to \nthe cost-to-go vector, we  might consider tuning the parameters w  E  Rd  and a E  R \nof an  affine  approximation  ~(w, a)  =  wT xi + a  using  the  algorithm  presented  in \nthe  previous  section.  It  is  possible  to show  that,  under  the following  assumption \n\n\fStable  Linear Approximations  to  Programming  for  Stochastic  Control  Problems \n\n1051 \n\nconcerning the state space topology and locality of transitions, Assumption 2 holds \nwith  f3'  = .; f32  + k~3' and thus Theorem 1 characterizes convergence properties of \nthe algorithm. \nAssumption 3  The  Markov decision  problem has state space S  = {1, ... , k d },  and \neach  state i  is  uniquely  associated  with  a  vector xi  E zt with  k  ~ 6(1  - (32)-1  + 3. \nA ny  pair  xi, x j  E  zt  of  consecutively  visited  states  either  are  identical  or  have \nexactly  one  unequal  component,  which  differs  by  one. \n\nWhile this assumption may seem restrictive, it is only one example.  There are many \nmore candidate examples,  involving other approximation architectures and partic(cid:173)\nular classes of Markov decision problems, which are currently under investigation. \n\n5  CONCLUSIONS \n\nWe  have  proven  a  new  theorem  that  establishes  convergence  properties  of  an  al(cid:173)\ngorithm for  generating linear  approximations  to cost-to-go functions  for  dynamic \nprogramming.  This theorem applies whenever the dynamic programming operator \nfor  a  Markov decision problem is  a  contraction with respect  to the Euclidean norm \nwhen applied to vectors in the parameterized class.  In this paper, we have described \none  example in  which  such  a  condition  holds.  More examples of practical interest \nwill  be discussed in  a  forthcoming full  length version of this paper. \n\nAcknowledgments \n\nThis research was supported by the NSF under grant ECS 9216531, by EPRI under \ncontract 8030-10, and by  the ARO. \n\nReferences \n\nBertsekas,  D. P.  (1995)  Dynamic  Programming  and  Optimal  Control.  Athena Sci(cid:173)\nentific,  Belmont,  MA. \nBoyan,  J.  A.  &  Moore,  A.  W.  (1995)  Generalization  in  Reinforcement  Learning: \nSafely  Approximating  the  Value  Function.  In  J.  D.  Cowan,  G.  Tesauro,  and  D. \nTouretzky, editors,  Advances in Neural  Information Processing  Systems  7.  Morgan \nKaufmann. \nGordon,  G.  J.  (1995)  Stable  Function  Approximation  in  Dynamic  Programming. \nTechnical Report:  CMU-CS-95-103, Carnegie Mellon  University. \nPolyak,  B.  T.  &  Tsypkin,  Y.  Z.,  (1972)  Pseudogradient  Adaptation  and  Training \nAlgorithms.  A vtomatika  i  Telemekhanika,  3:45-68. \nSutton,  R.  S.  (1988)  Learning to  Predict  by  the  Method  of Temporal  Differences. \nMachine  Learning,  3:9-44. \n\nTesauro,  G.  (1992)  Practical  Issues  in  Temporal  Difference  Learning.  Machine \nLearning,  8:257-277. \nTsitsiklis, J. &  Van Roy, B.  (1994) Feature-Based Methods for  Large Scale Dynamic \nProgramming.  Technical  Report:  LIDS-P-2277,  Laboratory  for  Information  and \nDecision Systems, Massachusetts Institute of Technology.  Also to appear in Machine \nLearning. \n\n\f", "award": [], "sourceid": 1038, "authors": [{"given_name": "Benjamin", "family_name": "Van Roy", "institution": null}, {"given_name": "John", "family_name": "Tsitsiklis", "institution": null}]}