{"title": "An Environment Model for Nonstationary Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 987, "page_last": 993, "abstract": null, "full_text": "An Environment Model for N onstationary \n\nReinforcement Learning \n\nSamuel P. M. Choi \npmchoi~cs.ust.hk \n\nDit-Yan Yeung \n\ndyyeung~cs.ust.hk \n\nNevin L. Zhang \nlzhang~cs.ust.hk \n\nDepartment of Computer Science, Hong Kong University of Science and Technology \n\nClear Water Bay, Kowloon, Hong Kong \n\nAbstract \n\nReinforcement learning in nonstationary environments is generally \nregarded as an important and yet difficult problem. This paper \npartially addresses the problem by formalizing a subclass of nonsta(cid:173)\ntionary environments. The environment model, called hidden-mode \nMarkov decision process (HM-MDP), assumes that environmental \nchanges are always confined to a small number of hidden modes. \nA mode basically indexes a Markov decision process (MDP) and \nevolves with time according to a Markov chain. While HM-MDP \nis a special case of partially observable Markov decision processes \n(POMDP), modeling an HM-MDP environment via the more gen(cid:173)\neral POMDP model unnecessarily increases the problem complex(cid:173)\nity. A variant of the Baum-Welch algorithm is developed for model \nlearning requiring less data and time. \n\n1 \n\nIntroduction \n\nReinforcement Learning (RL) [7] is a learning paradigm based upon the framework \nof Markov decision process (MDP). Traditional RL research assumes that environ(cid:173)\nment dynamics (i.e., MDP parameters) are always fixed (Le., stationary). This \nassumption, however, is not realistic in many real-world applications. In elevator \ncontrol [3], for instance, the passenger arrival and departure rates can vary signifi(cid:173)\ncantly over one day, and should not be modeled by a fixed MDP. \n\nNonetheless, RL in nonstationary environments is regarded as a difficult problem. \nIn fact, it is an impossible task if there is no regularity in the ways environment \ndynamics change. Hence, some degree of regularity must be assumed. Typically, \nnonstationary environments are presummed to change slowly enough such that on(cid:173)\nline RL algorithms can be employed to keep track the changes. The online approach \nis memory less in the sense that even if the environment ever revert to the previously \nlearned dynamics, learning must still need to be started all over again. \n\n\f988 \n\nS. P. M Choi, D.-y' Yeung and N. L. Zhang \n\n1.1 Our Proposed Model \n\nThis paper proposes a formal model [1] for the nonstationary environments that \nrepeats their dynamics in certain ways. Our model is inspired by the observations \nfrom the real-world nonstationary tasks with the following properties: \nProperty 1. Environmental changes are confined to a small number of modes, \nwhich are stationary environments with distinct dynamics. The environment is in \nexactly one of these modes at any given time. This concept of modes seems to be \napplicable to many real-world tasks. In an elevator control problem, for example, \nthe system might operate in a morning-rush-hour mode, an evening-rush-hour mode \nand a non-rush-hour mode. One can also imagine similar modes for other control \ntasks, such as traffic control and dynamic channel allocation [6]. \nProperty 2. Unlike states, modes cannot be directly observed; the current mode \ncan only be estimated according to the past state transitions. It is analogous to the \nelevator control example in that the passenger arrival rate and pattern can only be \ninferred through the occurrence of pick-up and drop-off requests. \nProperty 3. Mode transitions are stochastic events and are independent of the \ncontrol system's responses. In the elevator control problem, for instance, the events \nthat change the current mode of the environment could be an emergency meeting \nin the administrative office, or a tea break for the staff on the 10th floor. Obviously, \nthe elevator's response has no control over the occurrence of these events. \nProperty 4. Mode transitions are relatively infrequent. In other words, a mode \nis more likely to retain for some time before switching to another one. If we consider \nthe emergency meeting example, employees on different floors take time to arrive at \nthe administrative office, and thus would generate a similar traffic pattern (drop-off \nrequests on the same floor) for some period of time. \nProperty 5. The number of states is often substantially larger than the number \nof modes. This is a common property for many real-world applications. In the \nelevator example, the state space comprises all possible combinations of elevator \npositions, pick-up and drop-off requests, and certainly would be huge. On the other \nhand, the mode space could be small. For instance, an elevator control system can \nsimply have the three modes as described above to approximate the reality. \n\nBased on these properties, an environment model is proposed by introducing a \nmode variable to capture environmental changes. Each mode specifies an MDP \nand hence completely determines the current state transition function and reward \nfunction (property 1). A mode, however, is not directly observable (property 2), \nand evolves with time according to a Markov process (property 3). The model \nis therefore called hidden-mode model. Note that our model does not impose any \nconstraint to satisfy properties 4 and 5. In other words, the hidden-mode model \ncan work for environments without these two properties. Nevertheless, as will be \nshown later, these properties can improve learning in practice. \n\n1.2 Related Work \n\nOur hidden-mode model is related to a non stationary model proposed by Dayan and \nSejnowski [4]. Although our model is more restrictive in terms of representational \npower, it involves much fewer parameters and is thus easier to learn. Besides, other \nthan the number of possible modes, we do not assume any other knowledge about \n\n\fAn Environment Model for Nonstationary Reinforcement Learning \n\n989 \n\nthe way environment dynamics change. Dayan and Sejnowski, on the other hand, \nassume that one knows precisely how the environment dynamics change. \n\nThe hidden-mode model can also be viewed as a special case of the hidden-state \nmodel, or partially observable Markov decision process (POMDP). As will be shown \nlater, a hidden-mode model can always be represented by a hidden-state model \nthrough state augmentation. Nevertheless, modeling a hidden-mode environment \nvia a hidden-state model will unnecessarily increase the problem complexity. In this \npaper, the conversion from the former to the latter is also briefly discussed. \n\n1.3 Our Focus \n\nThere are two approaches for RL. Model-based RL first acquires an environment \nmodel and then, from which, an optimal policy is derived. Model-free RL, on the \ncontrary, learns an optimal policy directly through its interaction with the envi(cid:173)\nronment. This paper is concerned with the first part of the model-based approach, \ni.e., how a hidden-mode model can be learned from experience. We will address the \npolicy learning problem in a separate paper. \n\n2 Hidden-Mode Markov Decision Processes \n\nThis section presents our hidden-mode model. Basically, a hidden-mode model is \ndefined as a finite set of MDPs that share the same state space and action space, with \npossibly different transition functions and reward functions. The MDPs correspond \nto different modes in which a system operates. States are completely observable \nand their transitions are governed by an MDP. In contrast, modes are not directly \nobservable and their transitions are controlled by a Markov chain. We refer to such \na process as a hidden-mode Markov decision process (HM-MDP). An example of \nHM-MDP is shown in Figure l(a). \n\n\u2022 \n\nTime \n\nMode \n\nAction \n\nStaIC \n\n... \n\n... \n\n(a) A 3-mode, 4-state, \nI-action HM-MDP \n\n(b) The evolution of an HM-MDP. The arcs indicate \ndependencies between the variables \n\nFigure 1: An HM -MDP \n\nFormally, an HM-MDP is an 8-tuple (Q,S,A,X,Y,R,rr,'l'), where Q, S and A \nrepresent the sets of modes, states and actions respectively; the mode transition \nfunction X maps mode m to n with a fixed probability Xmn; the state transition \nfunction Y defines transition probability, Ym(8, a, s'), from state 8 to 8' given mode \nm and action a; the stochastic reward function R returns rewards with mean value \nr m (8, a); II and 'l1 denote the prior probabilities of the modes and of the states \nrespectively. The evolution of modes and states over time is depicted in Figure 1 (b). \n\n\f990 \n\nS. P. M. Choi, D.-y' Yeung and N. L. Zhang \n\nHM-MDP is a subclass of POMDP. In other words, the former can be reformulated \nas a special case of the latter. Specifically, one may take an ordered pair of any \nmode and observable state in the HM-MDP as a hidden state in the POMDP, and \nany observable state of the former as an observation of the latter. Suppose the \nobservable states sand s' are in modes m and n respectively. These two HM(cid:173)\nMDP states together with their corresponding modes form two hidden states (m, s) \nand (n, s') for its POMDP counterpart. The transition probability from (m, s) to \n(n, s') is then simply the mode transition probability Xmn multiplied by the state \ntransition probability Ym(s, a, s'). For an M-mode, N-state, K-action HM-MDP, \nthe equivalent POMDP thus has N observations and M N hidden states. Since \nmost state transition probabilities are collapsed into mode transition probabilities \nthrough parameter sharing, the number of parameters in an HM-MDP (N 2 M K + \nM2) is much less than that of its corresponding POMDP (M2 N 2 K). \n\n3 Learning a Hidden-Mode Model \n\nThere are now two ways to learn a hidden-mode model. One may learn either an \nHM-MDP, or an equivalent POMDP instead. POMDP models can be learned via \na variant of the Baum-Welch algorithm [2]. This POMDP Baum-Welch algorithm \nrequires 8(M2 N 2T) time and 8(M2 N 2 K) storage for learning an M-mode, N(cid:173)\nstate, K-action HM-MDP, given T data items. \n\nA similar idea can be applied to the learning of an HM-MDP. Intuitively, one \ncan estimate the model parameters based on the expected counts of the mode \ntransitions, computed by a set of auxiliary variables. The major difference from the \noriginal algorithm is that consecutive state transitions, rather than the observations, \nare considered. Additional effort is thus needed for handling the boundary cases. \nThis HM-MDP Baum-Welch algorithm is described in Figure 2. \n\n4 Empirical Studies \n\nThis section empirically examines the POMDP Baum-Welch1 and HM-MDP Baum(cid:173)\nWelch algorithms. Experiments based on various randomly generated models and \nsome real-world environments were conducted. The results are quite consistent. \nFor illustration, a simple traffic control problem is presented. \nIn this problem, \none direction of a two-way traffic is blocked, and cars from two different directions \n(left and right) are forced to share the remaining road. To coordinate the traffic, \ntwo traffic lights equipped with sensors are set. The system then has two possible \nactions: either to signal cars from the left or cars from the right to pass. For \nsimpliCity, we assume discrete time steps and uniform speed of the cars. \n\nThe system has 8 possible states; they correspond to the combinations of whether \nthere are cars waiting on the left and the right directions, and the stop signal position \nin the previous time step. There are 3 traffic modes. The first one has cars waiting \non the left and the right directions with probabilities 0.3 and 0.1 respectively. In the \nsecond mode, these probabilities are reversed. For the last one, both probabilities \nare 0.3. In addition, the mode transition probability is 0.1. A cost of -1.0 results if \n\nlChrisman's algorithm also attempts to learn a minimal possible number of states. Our \n\npaper concerns only with learning the model parameters. \n\n\fAn Environment Model for Nonstationary Reinforcement Learning \n\n991 \n\nGiven a collection of data and an initial model parameter vector 0. \nrepeat \n\n0=0 \nCompute forward variables (Xt. \n\n(Xl (i) = 1/;$1 \n(X2(i) = 1I\"i 1/;$1 Yi(SI, al,S2) \n(Xt+l(j) = L:iEQ (Xt(i) Xii Yi(St,at,St+l) \n\nCompute backward variables (3t . \n\n(3T(i) = 1 \n(3t(i) = LiEQXii Yi(St,at,St+I) (3t+I(j) \n(31(i) = L:iEQ 1I\"j Yi(sl , al,s2) (32(j) \n\n\"Ii E Q \n\"Ii E Q \n\"Ii E Q \n\n\"Ii E Q \n\"Ii E Q \n\"Ii E Q \n\n\"I i, j E Q \n\n\"Ii E Q \n\nCompute the new model parameter 0. \n\n_ .. _ L;-2 {. (i,i) \nXl] - ~T . \nL....t=1 \"'Yt (I) \n\n8(a, b) = {01 a = b \n\naf.b \n\n1Ti = \"Yl (i) \n\nuntil maxi I Oi - OJ I < to \n\nFigure 2: HM-MDP Baum-Welch Algorithm \n\na car waits on either side. \n\nThe experiments were run with the same initial model for data sets of various sizes. \nThe algorithms iterated until the maximum change of the model parameters was \nless than a threshold of 0.0001. The experiment was repeated for 20 times with \ndifferent random seeds in order to compute the median. Then the learned models \nwere compared in their POMDP forms using the Kullback-Leibler (KL) distance \n[5], and the total CPU running time on a SUN Ultra I workstation was measured. \nFigure 3 (a) and (b) report the results. \n\nGenerally speaking, both algorithms learn a more accurate environment model as \nthe data size increases (Figure 3 (a)). This result is expected as both algorithms \nare statistically-based, and hence their performance relies largely on the data size. \nWhen the training data size is very small , both algorithms perform poorly. However, \nas the data size increases, HM-MDP Baum-Welch improves substantially faster than \nPOMDP Baum-Welch. It is because an HM-MDP in general consists of fewer free \n\n\f992 \n\n\" \n\no ~~~~~~~~~~~~~~ \n\n)000 \n\n:1500 \n\noKIOO \n\n.&500 \n\n5000 \n\no \n\n!SOO \n\n1000 \n\n1501) \n\n2000 \n\n2&00 \n\ns. P M Choi. D.-y' Yeung and N. L. Zhang \n\n'0000 \n\n/---.-.-.. -.... -----.-.. ~ ... --.... --... \n\n............\n\n...... \n\n-.----.\".~ .. \n\n10500L--'OOOJ...--'...\"500-2000~~ .... ,,---:\"_':::\"--=_.,.,.......-:-\"\"':':-:--, .... \n\n...,,...,---:-!,OOO \n\nWndow9tD \n\nWIndowSiz. \n\n(a) Error in transition function \n\n(b) Required learning time \n\nFigure 3: Empirical results on model learning \n\nparameters than its POMDP counterpart. \n\nHM-MDP Baum-Welch also runs much faster than POMDP Baum-Welch (Figure 3 \n(b)). It holds in general for the same reason discussed above. Note that compu(cid:173)\ntational time is not necessarily monotonically increasing with the data size. It is \nbecause the total computation depends not only on the data size, but also on the \nnumber of iterations executed. From our experiments, we noticed that the number \nof iterations tends to decrease as the data size increases. \n\nLarger models have also been tested. While HM-MDP Baum-Welch is able to learn \nmodels with several hundred states and a few modes, POMDP Baum-Welch was \nunable to complete the learning in a reasonable time. Additional experimental \nresults can be found in [1]. \n\n5 Discussions and Future Work \n\nThe usefulness of a model depends on the validity of the assumptions made. We \nnow discuss the assumptions of HM-MDP, and shed some light on its applicability \nto real-world nonstationary tasks. Some possible extensions are also discussed. \n\nModeling a nonstationary environment as a number of distinct MDPs. \nMDP is a flexible framework that has been widely adopted in various applications. \nModeling nonstationary environments by distinct MDPs is a natural extension to \nthose tasks. Comparing to POMDP, our model is more comprehensive: each MDP \nnaturally describes a mode of the environment. Moreover, this formulation facili(cid:173)\ntates the incorporation of prior knowledge into the model initialization step. \n\nStates are directly observable while modes are not. While completely ob(cid:173)\nservable states are helpful to infer the current mode, it is also possible to extend the \nmodel to allow partially observable states. In this case, the extended model would \nbe equivalent in representational power to a POMDP. This could be proved easily \nby showing the reformulation of the two models in both directions. \n\n\fAn Environment Model for Nonstationary Reinforcement Learning \n\n993 \n\nMode changes are independent of the agent's responses. This property \nmay not always hold for all real-world tasks. In some applications, the agent's \nactions might affect the state as well as the environment mode. In that case, an \nMDP should be used to govern the mode transition process. \n\nMode transitions are relatively infrequent. This is a property that generally \nholds in many applications. Our model, however, is not limited by this condition. \nWe have tried to apply our model-learning algorithms to problems in which this \nproperty does not hold. We find that our model still outperforms POMDP, although \nthe required data size is typically larger for both models. \n\nNumber of states is substantially larger than the number of modes. This \nis the key property that significantly reduces the number of parameters in HM-MDP \ncompared to that in POMDP. In practice, introduction of a few modes is sufficient \nfor boosting the system performance. More modes might only help little. Thus a \ntrade-off between performance and response time must be decided. \n\nThere are additional issues that need to be addressed. First, an efficient algorithm \nfor policy learning is required. Although in principle it can be achieved indirectly \nvia any POMDP algorithm, a more efficient algorithm based on the model-based \napproach is possible. We will address this issue in a separate paper. Next, the \nnumber of modes is currently assumed to be known. We are now investigating how \nto remove this limitation. Finally, the exploration-exploitation issue is currently \nignored. In our future work, we will address this important issue and apply our \nmodel to real-world nonstationary tasks. \n\nReferences \n\n[1] S. P. M. Choi, D. Y. Yeung, and N. L. Zhang. Hidden-mode Markov decision \nIn IJCAI 99 Workshop on Neural, Symbolic, and Reinforcement \n\nprocesses. \nMethods for Sequence Learnin9, 1999. \n\n[2] L. Chrisman. Reinforcement learning with perceptual aliasing: The perceptual \n\ndistinctions approach. In AAAI-92, 1992. \n\n[3] R. H. Crites and A. G. Barto. Improving elevator performance using reinforce(cid:173)\n\nment learning. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances \nin Neural Information Processing Systems 8, 1996. \n\n[4] P. Dayan and T. J. Sejnowski. Exploration bonuses and dual control. Machine \n\nLearning, 25(1):5- 22, Oct. 1996. \n\n[5J S. Kullback. Information Theory and Statistics. Wiley, New York, NY, USA, \n\n1959. \n\n[6] S. Singh and D. P. Bertsekas. Reinforcement learning for dynamic channel \nallocation in cellular telephone systems. In Advances in Neural Information \nProcessing Systems 9, 1997. \n\n[7] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The \n\nMIT Press, 1998. \n\n\f", "award": [], "sourceid": 1665, "authors": [{"given_name": "Samuel", "family_name": "Choi", "institution": null}, {"given_name": "Dit-Yan", "family_name": "Yeung", "institution": null}, {"given_name": "Nevin", "family_name": "Zhang", "institution": null}]}