{"title": "Reinforcement Learning Based on On-Line EM Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 1052, "page_last": 1058, "abstract": null, "full_text": "Reinforcement Learning based on \n\nOn-line EM Algorithm \n\nMasa-aki Sato t \n\nt ATR Human Information Processing Research Laboratories \n\nSeika, Kyoto 619-0288, Japan \n\nmasaaki@hip.atr.co.jp \n\nShin Ishii +t \n\ntNara Institute of Science and Technology \n\nIkoma, Nara 630-0101, Japan \n\nishii@is.aist-nara.ac.jp \n\nAbstract \n\nIn this article, we propose a new reinforcement learning (RL) \nmethod based on an actor-critic architecture. The actor and \nthe critic are approximated by Normalized Gaussian Networks \n(NGnet), which are networks of local linear regression units. The \nNGnet is trained by the on-line EM algorithm proposed in our pre(cid:173)\nvious paper. We apply our RL method to the task of swinging-up \nand stabilizing a single pendulum and the task of balancing a dou(cid:173)\nble pendulum near the upright position. The experimental results \nshow that our RL method can be applied to optimal control prob(cid:173)\nlems having continuous state/action spaces and that the method \nachieves good control with a small number of trial-and-errors. \n\n1 \n\nINTRODUCTION \n\nReinforcement learning (RL) methods (Barto et al., 1990) have been successfully \napplied to various Markov decision problems having finite state/action spaces, such \nas the backgammon game (Tesauro, 1992) and a complex task in a dynamic envi(cid:173)\nronment (Lin, 1992). On the other hand, applications to continuous state/action \nproblems (Werbos, 1990; Doya, 1996; Sofge & White, 1992) are much more difficult \nthan the finite state/action cases. Good function approximation methods and fast \nlearning algorithms are crucial for successful applications. \nIn this article, we propose a new RL method that has the above-mentioned two \nfeatures. This method is based on an actor-critic architecture (Barto et al., 1983), \nalthough the detailed implementations of the actor and the critic are quite differ-\n\n\fReinforcement Learning Based on On-Line EM Algorithm \n\n1053 \n\nent from those in the original actor-critic model. The actor and the critic in our \nmethod estimate a policy and a Q-function, respectively, and are approximated by \nNormalized Gaussian Networks (NGnet) (l'doody & Darken , 1989). The NGnet is a \nnetwork of local linear regression units. The model softly partitions the input space \nby using normalized Gaussian functions, and each local unit linearly approximates \nthe output within its partition. As pointed out by Sutton (1996), local models such \nas the NGnet are more suitable than global models such as multi-layered percep(cid:173)\ntrons, for avoiding serious learning interference in on-line RL processes. The NGnet \nis trained by the on-line EM algorithm proposed in our previous paper (Sato & \nIshii, 1998). It was shown that this on-line E11 algorithm is faster than a gradient \ndescent algorithm. In the on-line EM algorithm, the positions of the local units \ncan be adjusted according to the input and output data distribution. Moreover, \nunit creation and unit deletion are performed according to the data distribution. \nTherefore, the model can be adapted to dynamic environments in which the input \nand output data distribution changes with time (Sato & Ishii, 1998). \n\n\\Ve have applied the new RL method to optimal control problems for deterministic \nnonlinear dynamical systems. The first experiment is the task of swinging-up and \nstabilizing a single pendulum with a limited torque (Doya, 1996) . The second \nexperiment is the task of balancing a double pendulum where a torque is applied \nonly to the first pendulum. Our RL method based on the on-line E11 algorithm \ndemonstrated good performances in these experiments. \n\n2 NGNET AND ON-LINE EM ALGORITHM \n\nIn this section, we review the on-line EM algorithm for the NGnet proposed in our \nprevious paper (Sato & Ishii, 1998). The NGnet (Moody & Darken, 1989) , which \ntransforms an N-dimensional input vector x to a D-dimensional output vector y , is \ndefined by the following equations. \n\n(la) \n\n(lb) \n\n\u2022 W i and bi are a (D x N)-dimensionallin(cid:173)\n\nAI denotes the number of units , and the prime (') denotes a transpose. Gi(x) is \nan N-dimensional Gaussian function, which has an N-dimensional center /11 and an \n(N x N)-dimensional covariance matrix E j\near regression matrix and a D-dimensional bias vector, respectively. Subsequently, \nwe use notations ll'-j == (Wi, bl ) and x' == (x' , 1). \nThe NGnet can be interpreted as a stochastic model, in which a pair of an input and \nan output , (x, y) , is a stochastic event. For each event, a unit index i E {I , ... , AI} \nis assumed to be selected, which is regarded as a hidden variable. The stochastic \nmodel is defined by the probability distribution for a triplet (x, y , i), which is called \na complete event: \n\nP(x , y , ilB) = (27r)-(D+N)/2 a ;-DIEi l- 1 / 2 AI - I \n\n(2) \n\nx exp [- ~(x - /1i )'Ei l (x - 11i) - 2~? (y -\n\nIt-ix )2] . \n\nHere , B == {/1i, E i , a?, 11\"1 Ii = 1, ... , AI} is a set of model parameters. We can easily \nproye that the expectation value of the output y for a giYen input x, i.e., E[Ylx] == \n\n\f1054 \n\nM. Sato and S. Ishii \n\nJ yP(ylx , B)dy , is identical to equation (1). Namely, the probability distribution (2) \nprovides a stochastic model for the NGnet. \nFrom a set of T events (observed data) (X,Y) == {(x(t),y(t)) It = 1, ... ,TL the \nmodel parameter B of the stochastic model (2) can be determined by the maximum \nlikelihood estimation method, in particular, by the EM algorithm (Dempster et al., \n1977) . The EM algorithm repeats the following E- and M-steps. \n\nE (~stimation) step: \nprobability that the i-th unit is selected for (x(t), yet)) is given as \n\nLet fJ be the present estimator. By using fJ , the posterior \n\nP(i lx(t) , yet) , fJ) = P(x(t), yet) , ilfJ)!2: P(x(t) , yet), jlfJ). \n\nM \n\n(3) \n\nj=1 \n\nM (l\\laximization) step: \nlikelihood L(Bj1J, X, Y) for the complete events is defined by \n\nUsing the posterior probability (3), the expected log-\n\nT \n\nAI \n\nL(Bj1J, X, Y) = 2: 2: P(ilx(t) , yet) , fJ) log P( x(t), yet), iIB). \n\nt=1 ;=1 \n\n(4) \n\nSince an increase of L(Bj1J, X , Y) implies an increase of the log-likelihood for the ob(cid:173)\nserved data (X, Y) (Dempster et al., 1977) , L(BlfJ, X, Y) is maximized with respect \nto B. A solution of the necessity condition 8L!8B = 0 is given by (Xu et al. , 1995) . \n(5a) \n(5b) \n(5c) \n\nIli = (x)i(T)!(l)i(T) \n~i 1 = [(xx')i(T)!(l)i(T) - lli(T)Il~(T)] - 1 \nTili = (yi;')i(T)[(i;i;')i(T)]-l \na; = ~ [(ly 2 1)i(T) - Tr (Tt';(i;y')i(T))] !(l)i(T), \n\n(5d) \n\nwhere Oi denotes a weighted mean with respect to the posterior probability (3) \nand it is defined by \n\n_ \n(f(x, y)),(T) == T 2: f(x(t), y(t))P(ilx(t), yet) , B). \n\n1 T \n\nt=1 \n\n(6) \n\nThe EM algorithm introduced above is based on batch learning (Xu et al., 1995) , \nnamely, the parameters are updated after seeing all of the observed data. We \nintroduce here an on-line version (Sato & Ishii, 1998) of the EM algorithm. Let \nB(t) be the estimator after the t-th observed data (x(t),y(t)). In this on-line EM \nalgorithm, the weighted mean (6) is replaced by \n\n\u00abf(x,y) \u00bbi (T) == TJ(T) 2:( II >.(s))f(x(t),y(t))P(ilx(t),y(t),B(t -1)). \n\nT \n\nT \n\n(7) \n\nt=1 s=i+1 \n\nThe parameter >'(t) E [0,1] is a discount factor, which is introduced for forgetting \nthe effect of earlier inaccurate estimator. TJ(T) == (Li=1 (TI~=t+l >.(S))) - 1 is a nor(cid:173)\nmalization coefficient and it is iteratively calculated by TJ(t) = (1 + >.(t)!TJ(t _1)) - 1. \nThe modified weighted mean \u00ab . \u00bbi can be obtained by the step-wise equation: \n\n\u00ab f(x, y) \u00bbi (t) =\u00ab f(x, y) \u00bb i (t - 1) \n\n+TJ(t) [!(x(t),y(t))Pi(t)-\u00ab f(x,y) \u00bbi (t - l)J, \n\n(8) \n\n\fReinforcement Learning Based on On-Line EM Algorithm \n\n/055 \n\nwhere Pi(t) == P(ilx(t) , y(t) , {}(t - 1)). Using the modified weighted mean, the new \nparameters are obtained by the following equations. \n\nAi(t = \n\n) \n\n1 \n\n1 - 17(t) \n\n[Ai(t - 1 -\n) \n\nPi(t)Ai (t - l)x(t)x'(t~Ai(t - 1) 1 \n(l/17(t) - 1) + Pi(t)x'(t)Ai(t -\n\nl)x(t) \n\nf.Li(t) =\u00ab x \u00bbi (t)/ \u00ab 1 \u00bbi (t) \nW'i (t) = W'i(t - 1) + 17(t)Pi(t)(y(t) - Wi(t - l)x(t))x'(t)Ai(t) \na;(t) = ~ [\u00ab lyl2 \u00bbi (t) - Tr (Wi(t)\u00ab xy' \u00bbi (t))] /\u00ab 1 \u00bbi (t), \n\n(9a) \n\n(9b) \n(9c) \n\n(9d) \n\nIt can be proved that this on-line EM algorithm is equivalent to the stochastic \napproximation for finding the maximum likelihood estimator, if the time course of \nthe discount factor A(t) is given by \n\nA(t) t~ 1 -\n\n(1 - a)/(at + b), \n\n(11) \n\nwhere a (1 > a > 0) and b are constants (Sato & Ishii, 1998). \nWe also employ dynamic unit manipulation mechanisms in order to efficiently allo(cid:173)\ncate the units (Sato & Ishii, 1998). The probability P(x(t), y(t), i I (}(t-1)) indicates \nhow probable the i-th unit produces the datum (x(t) , y(t)) with the present param(cid:173)\neter {)( t - 1) . If the probability for every unit is less than some threshold value, a \nnew unit is produced to account for the new datum. The weighted mean \u00ab 1 \u00bb i (t) \nindicates how much the i-th unit has been used to account for the data until t. If \nthe mean becomes less than some threshold value, this unit is deleted. \nIn order to deal with a singular input distribution, a regularization for 2:;1 (t) is \nintroduced as follows. \n\n2: ; l(t) = [(<< xx' \u00bbi (t) - f.Li(t)f.L;(t)\u00ab 1 \u00bbi (t) \n\n(12a) \n\n+ Q \u00ab ~; \u00bbi (t)IN) / \u00ab 1 \u00bbi (t)]-l \n\u00ab~T \u00bbi (t) = (<< Ixl 2 \u00bbi (t) -1f.Li(t)12\u00ab 1 \u00bbi (t)) /N, \n\n(12b) \nwhere IN is the (N x N)-dimensional identity matrix and Q is a small constant. The \ncorresponding Ai(t) can be calculated in an on-line manner using a similar equation \nto (9a) (Sato & Ishii, 1998). \n\n3 REINFORCEMENT LEARNING \n\nIn this section, we propose a new RL method based on the on-line EM algorithm \ndescribed in the previous section. In the following, we consider optimal control prob(cid:173)\nlems for deterministic nonlinear dynamical systems having continuous state/action \nspaces. It is assumed that there is no knowledge of the controlled system. An \nactor-critic architecture .(Barto et al. ,1983) is used for the learning system. In the \noriginal actor-critic model, the actor and the critic approximated the probability \nof each action and the value function, respectively, and were trained by using the \nTD-error. The actor and the critic in our RL method are different from those in \nthe original model as explained later. \n\n\f1056 \n\nM. Sato and S. Ishii \n\nFor the current state, xc(t), of the controlled system, the actor outputs a control \nsignal (action) u(t), which is given by the policy function 00, i.e., u(t) = O(xc(t)). \nThe controlled system changes its state to xc(t + 1) after receiving the control \nsignal u(t). Subsequently, a reward r(xc(t) , u(t)) is given to the learning system. \nThe objective of the learning system is to find the optimal policy function that \nmaximizes the discounted future return defined by \n\n00 \n\nV(xc) == L \"/r(xc(t), O(xc(t)))l xc (O)=::x c ' \n\n/ = 0 \n\n(13) \n\nwhere 0 < , < 1 is a discount factor. V(xc), which is called the value function , is \ndefined for the current policy function 0(-) employed by the actor. The Q-function \nis defined by \n\n(14) \nwhere xc(t) = Xc and u(t) = u are assumed. The value function can be obtained \nfrom the Q-function: \n\nV(xc) = Q(xc, O(xc))\u00b7 \n\nThe Q-function should satisfy the consistency condition \n\nQ(xc(t), u(t)) = ,Q(xc(t + 1), O(xc(t + 1)) + r(xc(t) , u(t)). \n\n(15) \n\n(16) \n\nIn our RL method, the policy function and the Q-function are approximated by the \nNGnets, which are called the actor-network and the critic-network, respectively. In \nthe learning phase, a stochastic actor is necessary in order to explore a better policy. \nFor this purpose, we employ a stochastic model defined by (2) , corresponding to \nthe actor-network. A stochastic action is generated in the following way. A unit \nindex i is selected randomly according to the conditional probability P(ilxc) for \na given state X C. Subsequently, an action u is generated randomly according to \nthe conditional probability P(ulxc, i) for a given Xc and the selected i. The value \nfunction can be defined for either the stochastic policy or the deterministic policy. \nSince the controlled system is deterministic, we use the value function defined for \nthe deterministic policy which is given by the actor-network. \nThe learning process proceeds as follows. For the current state xc(t) , a stochastic \naction u(t) is generated by the stochastic model corresponding to the current actor(cid:173)\nnetwork. At the next time step , the learning system gets the next state xc(t+ 1) and \nthe reward r(xc(t) , u(t)). The critic-network is trained by the on-line EM algorithm. \nThe input to the critic-network is (xc(t) , u(t)). The target output is given by the \nright hand side of (16) , where the Q-function and the deterministic policy function \n00 are calculated using the current critic-network and the current actor-network, \nrespectively. The actor-network is also trained by the on-line EM algorithm. The \ninput to the actor-network is xc(t). The target output is given by using the gradient \nof the critic-network (Sofge & White, 1992): \n\n(17) \nwhere the Q-function and the deterministic policy function 00 are calculated using \nthe modified critic-network and the current actor-network, respectively. E is a small \nconstant. This target output gives a better action, which increases the Q-function \nvalue for the current state Xc (t) , than the current deterministic action 0 (xc (t)). \n\nIn the above learning scheme, the critic-network and the actor-network are updated \nconcurrently. One can consider another learning scheme. In this scheme, the learn(cid:173)\ning system tries to control the controlled system for a given period of time by using \nthe fixed actor-network. In this period, the critic-network is trained to estimate the \n\n\fReinforcement Learning Based on On-Line EM Algorithm \n\n1057 \n\nQ-function for the fixed actor-network. The state trajectory in this period is saved. \nAt the next stage, the actor-network is trained along the saved trajectory using the \ncritic-network modified in the first stage. \n\n4 EXPERIMENTS \n\nThe first experiment is the task of swinging-up and stabilizing a single pendulum \nwith a limited torque (Doya, 1996) . The state of the pendulum is represented \n\nby X c = (\u00a2, cp), where cp and \u00a2 denote the angle from the upright position and the \nangular velocity of the pendulum, respectively. The reward r(xc(t) , u(t)) is assumed \nto be given by f(x c(t + 1)) , where \n\nf(xc) = exp( -(\u00a2)2/(2vi) - cp2/(2v~)). \n\n(18) \nVI and V2 are constants. The reward (18) encourages the pendulum to stay high . \nAfter releasing the pendulum from a vicinity of the upright position, the control \nand the learning process of the actor-critic network is conducted for 7 seconds. This \nis a single episode. The reinforcement learning is done by repeating these episodes. \nAfter 40 episodes, the system is able to make the pendulum achieve an upright \nposition from almost every initial state. Even from a low initial position, the system \nswings the pendulum several times and stabilizes it at the upright position. Figure \n1 shows a control process, i.e., stroboscopic time-series of the pendulum, using the \ndeterministic policy after training. According to our previous experiment, in which \nboth of the actor- and critic- networks are the NGnets with fixed centers trained \nby the gradient descent algorithm, a good control was obtained after about 2000 \nepisodes. Therefore, our new RL method is able to obtain a good control much \nfaster than that based on the gradient descent algorithm. \n\nThe second experiment is the task of balancing a double pendulum near the up(cid:173)\nright position. A torque is applied only to the first pendulum. The state of the \npendulum is represented by X c = (\u00a21, \u00a22 , CPl, CP2), where CPl and CP2 are the first pen(cid:173)\ndulum's angle from the upright direction and the second pendulum's angle from the \nfirst pendulum's direction, respectively. \u00a21 (\u00a22) is the angular velocity of the first \n(second) pendulum. The reward is given by the height of the second pendulum's \nend from the lowest position. After 40 episodes, the system is able to stabilize \nthe double pendulum. Figure 2 shows the control process using the deterministic \npolicy after training. The upper two figures show stroboscopic time-series of the \npendulum. The dashed, dotted, and solid lines in the bottom figure denote cPl/7r, \nCP2/7r , and the control signal u produced by the actor-network, respectively. After \na transient period, the pendulum is successfully controlled to stay near the upright \nposition. \n\nThe numbers of units in the actor- (critic-) networks after training are 50 (109) and \n96 (121) for the single and double pendulum cases, respectively. The RL method \nusing center-fixed NGnets trained by the gradient descent algorithm employed 441 \n(= 212) actor units and 18,081 (= 212x41) critic units, for the single pendulum task. \nFor the double pendulum task, this scheme did not work even when 14,641 (= 114) \nactor units and 161 ,051 (= 114 X 11) critic units were prepared. The numbers of \nunits in the NGnets trained by the on-line EM algorithm scale moderately as the \ninput dimension increases. \n\n5 CONCLUSION \n\nIn this article, we proposed a new RL method based on the on-line EM algorithm. \nWe showed that our RL method can be applied to the task of swinging-up and \n\n\f1058 \n\nM. Sato and S. Ishii \n\nstabilizing a single pendulum and the task of balancing a double pendulum near \nthe upright position. The number of trial-and-errors needed to achieve good control \nwas found to be very small in the two tasks. \nIn order to apply a RL method \nto continuous state/action problems, good function approximation methods and \nfast learning algorithms are crucial. The experimental results showed that our RL \nmethod has both features. \n\nReferences \n\nBarto, A. G., Sutton, R. S., & Anderson, C. W. (1983). IEEE Transactions on \nSystems, Man, and Cybernetics, 13,834-846. \nBarto, A. G., Sutton, R. S., & Watkins, C. J. C. H. (1990). Learning and Com(cid:173)\nputational Neuroscience: Foundations of Adaptive Networks (pp. 539-602), MIT \nPress. \nDempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Journal of Royal Statistical \nSociety B, 39, 1-22. \n\nDoya, K. (1996). Advances in Neural Information Processing Systems 8 (pp. lO73-\n1079), MIT Press. \n\nLin, L. J. (1992). Machine Learning, 8,293-321. \nMoody, J., & Darken, C. J. (1989). Neural Computation, 1, 281-294. \nSato, M., & Ishii, S. (1998). ATR Technical Report, TR-H-243, ATR. \nSofge, D. A., & White, D. A. (1992). Handbook of Intelligent Control (pp. 259-282), \nVan Nostrand Reinhold. \n\nSutton, R. S. (1996) . Advances in Neural Information Processing Systems 8 \n(pp. 1038-1044), MIT Press. \n\nTesauro, G. J. (1992). Machine Learning, 8, 257-278. \nWerbos, P. J. (1990). Neural Networks for Control (pp. 67-95), MIT Press. \nXu, 1., Jordan, M. 1., & Hinton, G. E. (1995). Advances in Neural Information \nProcessing Systems \"( (pp. 633-640), MIT Press. \n\nTime Sequence of Inverted Pendulum \n\n3 \n\n3l II I II! j l \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\> \n\n-l~~ ________ ~ ________ ~_ \n\no \n\n2 \n\n3 \n\n-l~~ ________ ~ ________ ~_ \n\n23 4 \n\no \n\n2 \n\n6 \n\n2 \n\n4 \n\n8 \n\n3 \n\nJl