{"title": "Reinforcement Learning in Markovian and Non-Markovian Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 500, "page_last": 506, "abstract": null, "full_text": "Reinforcenlent Learning in Markovian and \n\nNon-Markovian Environments \n\nJiirgen Schmidhuber \nInstitut fiir Informatik \nTechnische Universitat Miinchen \nArcistr. 21, 8000 Miinchen 2, Germany \nschmidhu@tumult.informatik.tu-muenchen.de \n\nAbstract \n\nThis work addresses three problems with reinforcement learning and adap(cid:173)\ntive neuro-control: 1. Non-Markovian interfaces between learner and en(cid:173)\nvironment. 2. On-line learning based on system realization. 3. Vector(cid:173)\nvalued adaptive critics. An algorithm is described which is based on system \nrealization and on two interacting fully recurrent continually running net(cid:173)\nworks which may learn in parallel. Problems with parallel learning are \nattacked by 'adaptive randomness'. It is also described how interacting \nmodel/controller systems can be combined with vector-valued 'adaptive \ncritics' (previous critics have been scalar). \n\n1 \n\nINTRODUCTION \n\nAt a given time, an agent with a non-Markovian interface to its environment cannot \nderive an optimal next action by considering its current input only. The algorithm \ndescribed below differs from previous reinforcement algorithms in at least some \nof the following issues: It has a potential for on-line learning and non-Markovian \nenvironments, it is local in time and in principle it allows arbitrary time lags be(cid:173)\ntween actions and ulterior consequences; it does not care for something like episode(cid:173)\nboundaries, it allows vector-valued reinforcement, it is based on two interacting fully \nrecurrent continually running networks, and it tries to construct a full environmental \nmodel- thus providing complete 'credit assignment paths' into the past. \n\nWe dedicate one or more conventional input units (called pain and pleasure units) \nfor the purpose of reporting the actual reinforcement to a fully recurrent control \nnetwork. Pain and pleasure input units have time-invariant desired values. \n\n500 \n\n\fReinforcement Learning in Markovian and Non-Markovian Environments \n\n501 \n\nWe employ the lID-Algorithm (Robinson and Fallside, 19S7) for training a fully \nrecurrent model network to model the relationships between environmental inputs, \noutput actions of an agent, and corresponding pain or pleasure. The model network \n(e.g. (Werbos, 19S7)(Jordan, 19S5)(Robinson and Fallside, 19S9)) in turn allows \nthe system to compute controller gradients for 'minimizing pain' and 'maximizing \npleasure'. Since reinforcement gradients depend on 'credit assignment paths' leading \n'backwards through the environment \" the model network should not only predict the \npain and pleasure units but also the other input units. \nThe quantity to be minimized by the model network is Et i(Yi(t) - Yipred(t))2, \nwhere Yi(t) is the activation of the ith input unit at time t,' and Yipred(t) is the \nmodel's prediction of the activation of the ith input unit at time t. The quantity \nto be minimized by the controller is Et j(Ci - ri(t))2, where ri(t) is the activation \nof the ith pain or pleasure input unit at time t and Cj is its desired activation for \nall times. t ranges over all (discrete) time steps. Weights are changed at each time \nstep. This relieves dependence on 'episode boundaries'. Here the assumption is \nthat the learning rates are small enough to avoid instabilities (Williams and Zipser, \n19S9). \n\nThere are two versions of the algorithm: the sequential version and the parallel \nversion. With the sequential version, the model network is first trained by providing \nit with randomly chosen examples of sequences of interactions between controller \nand environment. Then the model's weights are fixed to their current values, and \nthe controller begins to learn. With the parallel version both the controller and the \nmodel learn concurrently. One advantage of the parallel version is that the model \nnetwork focusses only on those parts of the environmental dynamics with which \nthe controller typically is confronted. Another advantage is the applicability to \nchanging environments. Some disadvantages of the parallel version are listed next. \n\nImperfect model networks. The model which is used to compute gradient in(cid:173)\n\n1. \nformation for the controller may be wrong. However, if we assume that the model \nnetwork always finds a zero-point of its error function, then over time we can expect \nthe control network to perform gradient descent according to a perfect model of the \nvisible parts of the real world. 1.A: The assumption that the model network can \nalways find a zero-point of its error function is not valid in the general case. One \nof the reasons is the old problem of local minima, for which this paper does not \nsuggest any solutions. 1.B: (Jordan, 19S5) notes that a model network does not \nneed to be perfect to allow increasing performance of the control network. \n\n2. Instabilities. One source of instability could arise if the model network 'forgets' \ninformation about the environmental dynamics because the activities of the con(cid:173)\ntroller push it into a new sub-domain, such that the weights responsible for the old \nwell-modeled sub-domain become over-written. \n\n3. Deadlock. Even if the model's predictions are perfect for all actions executed by \nthe controller, this does not imply that the algorithm will always behave as desired. \nLet us assume that the controller enters a local minimum relative to the current state \nof an imperfect model network. This relative minimum might cause the controller \nto execute the same action again and again (in a certain spatio-temporal context), \nwhile the model does not get a chance to learn something about the consequences \nof alternative actions (this is the deadlock). \n\n\f502 \n\nSchmidhuber \n\nThe sequential version lacks the flavor of on-line learning and is bound to fail as soon \nas the environment changes significantly. We will introduce 'adaptive randomness' \nfor the controller outputs to attack problems of the parallel version. \n\n2 THE ALGORITHM \n\nThe sequential version of the algorithm can be obtained in a straight-forward man(cid:173)\nner from the description of the parallel version below. At every time step, the \nparallel version is performing essentially the same operations: \nIn step 1 of the main loop of the algorithm, actions to be performed in the external \nworld are computed. These actions are based on both current and previous inputs \nand outputs. For all new activations, the corresponding derivatives with respect to \nall controller weights are updated. In step 2 actions are executed in the external \nworld, and the effects of the current action and/or previous actions may become \nvisible. In step 3 the model network sees the last input and the current output of \nthe controller at the same time. The model network tries to predict the new input \nwithout seeing it. Again the relevant gradient information is computed. In step 4 \nthe model network is updated in order to better predict the input (including pleasure \nand pain) for the controller. The weights of the control network are updated in order \nto minimize the cumulative differences between desired and actual activations of the \npain and pleasure units. 'Teacher forcing' (Williams and Zipser, 1989) is used in the \nmodel network (although there is no teacher besides the environment). The partial \nderivatives of the controller's inputs with respect to the controller's weights are \napproximated by the partial derivatives of the corresponding predictions generated \nby the model network. \n\nNotation (the reader may find it convenient to compare with (Williams and Zipser, \n1989)): G is the set of all non-input units of the control network, A is the set of \nits output units, [ is the set of its 'normal' input units, P is the set of its pain and \npleasure units, M is the set of all units of the model network, 0 is the set of its \noutput units, Ope 0 is the set of all units that predict pain or pleasure, W M is the \nset of variables for the weights of the model network, We is the set of variables for \nthe weights of the control network, Yk\" ... is the variable for the updated activation \nof the kth unit from MuG u [ UP, Yk o l4 is the variable for the last value of Yk\" ... , \nWij is the variable for the weight of the directed connection from unit j to unit i. Oik \nis the Kronecker-delta, which is 1 for i = k and 0 otherwise, P~j\" ... is the variable \nwhich gives the current (approximated) value of 8~~~:w , P~jol4 is the variable which \ngives the last value of prj .... If k E P then Ck is k's desired activation for all times, \nif k E [U P, then kpreJ is the unit from 0 which predicts k. Otc is the learning \nrate for the control network, OtM is the learning rate for the model network. \nI [UP 1=1 0 I, lOp 1=1 P I. Each unit in [UPUA has one forward connection to \neach unit in MUG, each unit in M is connected to each other unit in M, each unit \nin G is connected to each other unit in G. Each weight variable of a connection \nleading to a unit in M is said to belong to W M, each weight variable of a connection \nleading to a unit in G is said to belong to We. For each weight Wij E W M \nthere \nare ~~rvalues for all k EM, for each weight Wij E We there are p~rvalues for all \nk EMU G U [ UP. The parallel version of the algorithm works as follows: \n\n\fReinforcement Learning in Markovian and Non-Markovian Environments \n\n503 \n\nINITIALIZATION: \n\nrandom, V possible k: pfjo'\" - o,pt \u2022\u2022 \", - 0 . \n\nV Wij E WM U We: Wij -\nV k E MuG: Ykol\" - O,Yk\".\", - O. \nV k E I UP: Set Ykol\" according to the current environment, Yk\".w - O. \nUNTIL TERMINATION CRITERION IS REACHED: \n\n1. ViE G : Yi\".\", -\n\n2:1 \n\nj \"ijlljol\" \n\n. \n\n1+e -\n\nV Wij E We,k E G: pfj\".'\" - Yk\" ... (l- Yk\" .... )(2:,wk'pijo,,, +bil~Yjo''')' \n\nV k f: G: Ykol\" - Yk ...... , V Wij E We : pfjo'\" - pfj\" .... \n\n2. Execute all actions based on activations of units in A. Update the environment. \n\nViE I UP: Set Yi\".\", according to environment. \n\nS. ViE M : Yi ... \", -\n\n2:1 \n\nj \"'ijlljol\" \n\n. \n\n1+e -\n\nV Wij E WM U We, k EM: pfj\" .... - Yk\".w(1- Yk\" \u2022\u2022 ')(2:, Wk'P~;ol\" + bikY;Old)' \nV k EM: Ykol\" - Yk\".wl V Wi; EWe U WM : pf;o'\" - pf;\".VI . \n4\u00b7 V Wi; E WM: Wij - Wi; + O:M 2:kElUP(Yk\" ... - YkPredol,,)p:::,;d. \n\nV Wi; E We: Wi; - Wi; + O:e LkEP(Ck - Yk ..... )p:::,;d. \n\nV k E I UP: Ykol\" - Yk\" .... , Ykpredol\" - Yk\" .... , V Wi; E WM : p:::,;d - 0, \n\nV Wij E We : pfjo,,, - p:::,;d . \n\nThe algorithm is local in time, but not in space. The computation complexity per \ntime step is O( I W M U We II M II M U I U P U A I + I We II G II I U PUG I). In \nwhat follows we describe some useful extensions of the scheme. \n\n1. More network ticks than environmental ticks. For highly 'non-linear' environ(cid:173)\nments the algorithm has to be modified in a trivial manner such that the involved \nnetworks perform more than one (but not more than three) iterations of step 1 and \nstep 3 at each time step. (4-layer-operations in principle can produce an arbitrary \napproximation of any desired mapping.) \n\n2. Adaptive randomness. Explicit explorative random search capabilities can be \nintroduced by probabilistic controller outputs and 'gradient descent through random \nnumber generators' (Williams, 1988). We adjust both the mean and the variance of \nthe controller actions. In the context of the lID algorithm, this works as follows: A \nprobabilistic output unit k consists of a conventional unit kJ-l which acts as a mean \ngenerator and a conventional unit ku which acts as a variance generator. At a given \ntime, the probabilistic output Yk\" .... is computed by Yk\"ew = YklJ ... w +zYkIT\".w' where \nZ is distributed e.g. according to the normal distribution. The corresponding pf;new \n\n\f504 \n\nSchmidhu ber \n\nmust then be updated according to the following rule: \n\n~. \n\nP,) new \n\n+-~!' + Yk new - Yk/J\"ew ~~ \n\nP,} new \n\nY \nko-new \n\nP,) new' \n\nA more sophisticated strategy to improve the model network is to introduce 'adap(cid:173)\ntive curiosity and boredom '. The priniciple of adaptive curiosity for model-building \nneural controllers (Schmidhuber, 1990a) says: Spend additional reinforcement \nwhenever there is a mismatch between the expectations of the model network and \nreality. \n\nVW' j \n\n3. Perfect models. Sometimes one can gain a 'perfect' model by constructing an \nappropriate mathematical description of the environmental dynamics. This saves \nthe time needed to train the model. However, additional external knowledge is \nrequired. For instance, the description of the environment might be in form of \ndifferential or difference equations. \nIn the context of the algorithm above, this \nmeans introducing new Pii variables for each Wij E We and each relevant state \nvariable 1](t) of the dynamical environment. The new variables serve to accumulate \nthe values of ~71(t). This can be done in exactly the same cumulative manner as \nwith the activations of the model network above. \n4. Augmenting the algorithm by TD-methods. The following ideas are not limited \nto recurrent nets, but are also relevant for feed-forward controllers in Markovian \nenvironments. \nIt is possible to augment model-building algorithms with an 'adaptive critic' \nmethod. To simplify the discussion, let us assume that there are no pleasure units, \njust pain units. The algorithm's goal is to minimize cumulative pain. We introduce \nthe TD-principle (Sutton, 1988) by changing the error function of the units in Op: \nAt a given time t, the contribution of each unit kpred E Op to the model network's \nerror is Ykpred(t) - 'YYkpred(t + 1) - Yk(t+ 1), where Yi(t) is the activation of unit i at \ntime t, and 0 < 'Y < 1 is a discount factor for avoiding predictions of infinite sums. \nThus Op is trained to predict the sum of all (discounted) future pain vectors and \nbecomes a vector-valued adaptive critic. (This affects the first V-loop in step 4 .) \nThe controller's goal is to minimize the absolute value of M's pain predictions. \nThus, the contribution of time t to the error function of the controller now becomes \nEkpredEOp (Ykpred(t)? This affects the second For-loop in step 4 of the algorithm. \nNote that it is not a state which is evaluated by the adaptive critic component, but \na combination of a state and an action. This makes the approach similar to (Jordan \nand Jacobs, 1990) . (Schmidhuber, 1990a) shows how a recurrent model/controller \ncombination can be used for look-ahead planning without using TD-methods. \n\n3 EXPERIMENTS \n\n.. , \n\nThe following experiments were conducted by the TUM-students Josef Hochreiter \nand Klaus Bergner. See (Schmidhuber, 1990a) and (Schmidhuber, 1990b) for the \nfull details . \n1. Evolution of a flip-flop by reinforcement learning. A controller J( had to learn to \nbehave like a flip-flop as described in (Williams and Zipser, 1989). The main diffi-\n\n\fReinforcement Learning in Markovian and Non-Markovian Environments \n\n505 \n\nculty (the one which makes this different from the supervised approach as described \nin (Williams and Zipser, 1989\u00bb was that there was no teacher for K's (probabilistic) \noutput units. Instead, the system had to generate alternative outputs in a variety of \nspatio-temporal contexts, and to build a model of the often 'painful' consequences. \nK's only goal information was the activation of a pain input unit whenever it pro(cid:173)\nduced an incorrect output. With 1 C 1= 3, 1 M 1= 4, (Xc = 0.1 and (XM = 1.0 20 \nout of 30 test runs with the parallel version required less than 1000000 time steps \nto produce an acceptable solution. \n\nWhy does it take much more time solving the reinforcement flip-flop problem than \nsolving the corresponding supervised flip-flop problem? One answer is: With super(cid:173)\nvised learning the controller gradient is given to the system, while with reinforce(cid:173)\nment learning the gradient has to be discovered by the system. \n\n'Non-Markovian' pole balancing. A cart pole system was modeled by the same \n\n2. \ndifferential equations used for a related balancing task which is described in (Ander(cid:173)\nson, 1986). In contrast to previous pole balancing tasks, however, no information \nabout temporal derivatives of cart position and pole angle was provided. (Similar \nexperiments are mentioned in (Piche, 1990).) \n\nIn our experiments the cart-pole system would not stabilize indefinitely. However, \nsignificant performance improvement was obtained. The best results were achieved \nby using a 'perfect model' as described above: Before learning, the average time \nuntil failure was about 25 time steps. Within a few hundred trials one could observe \ntrials with more than 1000 time steps balancing time. 'Friendly' initial conditions \ncould lead to balancing times of more than 3000 time steps. \n\n3. \n'Markovian' pole balancing with a vector-valued adaptive critic. The adaptive \ncritic extension described above does not need a non-Markovian environment to \ndemonstrate advantages over previous adaptive critics: A four-dimensional adaptive \ncritic was tested on the pole balancing task described in (Anderson, 1986). The \ncritic component had four output units for predicting four different kinds of 'pain', \ntwo for bumps against the two edges of the track and two for pole crashes. \n\nNone of five conducted test runs took more than 750 failures to achieve the first \ntrial with more than 30000 time steps. (The longest run reported by (Anderson, \n1986) took about 29000 time steps, more than 7000 failures had to be experienced \nto achieve that result.) \n\n4 SOME LIMITATIONS OF THE APPROACHES \n\n1. The recurrent network algorithms are not local in space. \n\n2. As with all gradient descent algorithms there is the problem of local minima. \nThis paper does not offer any solutions to this problem. \n\n3. More severe limitations of the algorithm are inherent problems of the concepts \nof 'gradient descent through time' and adaptive critics. Neither gradient descent \nnor adaptive critics are practical when there are long time lags between actions \nand ultimate consequences. For this reason, first steps are made in (Schmidhuber, \n1990c) towards adaptive sub-goal generators and adaptive 'causality detectors '. \n\n\f506 \n\nSchmidhu ber \n\nAcknowledgements \n\nI wish to thank Josef Hochreiter and Klaus Bergner who conducted the experiments. \nThis work was supported by a scholarship from SIEMENS AG. \n\nReferences \n\nAnderson, C. W. (1986). Learning and Problem Solving with Multilayer Connec(cid:173)\n\ntionist Systems. PhD thesis, University of Massachusetts, Dept. of Compo and \nInr. Sci. \n\nJordan, M. I. (1988). Supervised learning and systems with excess degrees of free(cid:173)\n\ndom. Technical Report COINS TR 88-27, MIT. \n\nJordan, M. I. and Jacobs, R. A. (1990). Learning to control an unstable system \nwith forward modeling. In Proc. of the 1990 Connectionist Models Summer \nSchool, in press. San Mateo, CA: Morgan Kaufmann. \n\nPiche, S. W. (1990). Draft: First order gradient descent training of adaptive discrete \ntime dynamic networks. Technical report, Dept. of Electrical Engineering, \nStanford University. \n\nRobinson, A. J. and Fallside, F. (1987). The utility driven dynamic error propaga(cid:173)\ntion network. Technical Report CUED/F-INFENG/TR.1, Cambridge Univer(cid:173)\nsity Engineering Department. \n\nRobinson, T. and Fallside, F. (1989). Dynamic reinforcement driven error propa(cid:173)\n\ngation networks with application to game playing. In Proceedings of the 11th \nConference of the Cognitive Science Society, Ann Arbor, pages 836-843. \n\nSchmidhuber, J. H. (1990a). Making the world differentiable: On using fully recur(cid:173)\n\nrent self-supervised neural networks for dynamic reinforcement learning and \nplanning in non-stationary environments. Technical Report FKI-126-90 (re(cid:173)\nvised), Institut fiir Informatik, Technische Universitat Miinchen. (Revised and \nextended version of an earlier report from February.). \n\nSchmidhuber, J. H. (1990b). Networks adjusting networks. Technical Report FKI-\n125-90 (revised), Institut fiir Informatik, Technische Universitat Munchen. (Re(cid:173)\nvised and extended version of an earlier report from February.). \n\nSchmidhuber, J. H. (1990c). Towards compositional learning with dynamic neural \nnetworks. Technical Report FKI-129-90, Institut fur Informatik, Technische \nUniversitat Miinchen. \n\nSutton, R. S. (1988). Learning to predict by the methods of temporal differences. \n\nMachine Learning, 3:9-44. \n\nWerbos, P. J. (1987). Building and understanding adaptive systems: A statis(cid:173)\n\ntical/numerical approach to factory automation and brain research. \nTransactions on Systems, Man, and Cybernetics, 17. \n\nIEEE \n\nWilliams, R. J. (1988). On the use of backpropagation in associative reinforcement \nlearning. In IEEE International Conference on Neural Networks, San Diego, \nvolume 2, pages 263-270. \n\nWilliams, R. J. and Zipser, D. (1989). Experimental analysis of the real-time re(cid:173)\n\ncurrent learning algorithm. Connection Science, 1(1):87-111. \n\n\f", "award": [], "sourceid": 393, "authors": [{"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": null}]}