{"title": "Hybrid Reinforcement Learning and Its Application to Biped Robot Control", "book": "Advances in Neural Information Processing Systems", "page_first": 1071, "page_last": 1077, "abstract": "", "full_text": "Hybrid reinforcement learning and its \n\napplication to biped robot control \n\nSatoshi Yamada, Akira Watanabe, M:ichio Nakashima \n\n{yamada, watanabe, naka}~bio.crl.melco.co.jp \n\nAdvanced Technology R&D Center \n\nMitsubishi Electric Corporation \n\nAmagasaki, Hyogo 661-0001, Japan \n\nAbstract \n\nA learning system composed of linear control modules, reinforce(cid:173)\nment learning modules and selection modules (a hybrid reinforce(cid:173)\nment learning system) is proposed for the fast learning of real-world \ncontrol problems. The selection modules choose one appropriate \ncontrol module dependent on the state. This hybrid learning sys(cid:173)\ntem was applied to the control of a stilt-type biped robot. It learned \nthe control on a sloped floor more quickly than the usual reinforce(cid:173)\nment learning because it did not need to learn the control on a \nflat floor, where the linear control module can control the robot. \nWhen it was trained by a 2-step learning (during the first learning \nstep, the selection module was trained by a training procedure con(cid:173)\ntrolled only by the linear controller), it learned the control more \nquickly. The average number of trials (about 50) is so small that \nthe learning system is applicable to real robot control. \n\n1 \n\nIntroduction \n\nReinforcement learning has the ability to solve general control problems because it \nlearns behavior through trial-and-error interactions with a dynamic environment. \nIt has been applied to many problems, e.g., pole-balance [1], back-gammon [2], \nmanipulator [3], and biped robot [4]. However, reinforcement learning has rarely \nbeen applied to real robot control because it requires too many trials to learn the \ncontrol even for simple problems. \nFor the fast learning of real-world control problems, we propose a new learning sys(cid:173)\ntem which is a combination of a known controller and reinforcement learning. It is \ncalled the hybrid reinforcement learning system. One example of a known controller \nis a linear controller obtained by linear approximation. The hybrid learning system \n\n\f1072 \n\nS. Yamada, A. Watanabe and M. Nakashima \n\nwill learn the control more quickly than usual reinforcement learning because it does \nnot need to learn the control in the state where the known controller can control \nthe object. \nA stilt-type biped walking robot was used to test the hybrid reinforcement learning \nsystem. A real robot walked stably on a flat floor when controlled by a linear \ncontroller [5]. Robot motions could be approximated by linear differential equations. \nIn this study, we will describe hybrid reinforcement learning of the control of the \nbiped robot model on a sloped floor, where the linear controller cannot control the \nrobot. \n\n2 Biped Robot \n\na) \n\nb) \n\npitch axis \n\nFigure 1: Stilt-type biped robot. a) a photograph of a real biped robot, b) a model \nstructure of the biped robot. Ul, U2, U3 denote torques. \n\nFigure I-a shows a stilt-type biped robot [5J. It has no knee or ankle, has 1 m \nlegs and weighs 33 kg. It is modeled by 3 rigid bodies as shown in Figure I-b. \nBy assuming that motions around a roll axis and those around a pitch axis are \nindependent, 5-dimensional differential equations in a single supporting phase were \nobtained. Motions of the real biped robot were simulated by the combination of \nthese equations and conditions at a leg exchange period. If angles are approximately \nzero, these equations can be approximated by linear equations. The following linear \ncontroller is obtained from the linear equations. The biped robot will walk if the \nangles of the free leg are controlled by a position-derivative (PD) controller whose \ndesired angles are calculated as follows: \n\nr{J \nif; \n( \n\n-\n\n(J+~+{3 \n(J + 2~ \n-A7) + 6 \n\nA = If \n\n(1) \n\nwhere~, {3, 6, and 9 are a desired angle between the body and the leg (7\u00b0), a constant \nto make up a loss caused by a leg exchange (1.3\u00b0), a constant corresponding to \nwalking speed, and gravitational acceleration (9.8 ms- 2 ), respectively. \n\nThe linear controller controlled walking of the real biped robot on a flat floor [5]. \nHowever, it failed to control walking on a slope (Figure 2). \nIn this study, the \nobjective of the learning system was to control walking on the sloped floor shown \nin Figure 2-a. \n\n\fHybrid Reinforcement Learning for Biped Robot Control \n\n1073 \n\na) lOem] \nOem \n\nb) \n\n45 \u00b7's , --\n\n-\n\nAngular \nVelocity \n\ni \n1m \n\nI \n2m \n\nI \n3m \n\n- - ---------,-,- ----, \nI fall down \n\niJ -45.'sO~---------------:' \n1.0 \n\nTime(s) \n\nIOcm b \n\nHeight of \nFree Leg's Tip \n\n. ______ fall down \n\nVVy';.:\u00a3!Pft'~',....~ 't:, \n\n-2cmO\u00b7\u00b7~ \n\nTime(s) \n\nRobo' Po\".: I = \n\n-lm O \n\nTime(s) \n\n10 \n\nI \n\n10 \n\nFigure 2: Biped robot motion on a sloped floor controlled by the linear controller. \na) a shape of a floor, b) changes in angular velocity, height of free leg's tip, and \nrobot position \n\n3 Hybrid Reinforcement Learning \n\nreinforcement: r(t) \n\nstate \ninputs \n71,~,iJ \n\n1-----10( decision of k \n\nlinear \ncontrol \nmodule \n\nFigure 3: Hybrid reinforcement learning system. \n\nWe propose a hybrid reinforcement learning system to learn control quickly, The \nhybrid reinforcement learning system shown in Figure 3 is composed of a linear \ncontrol module, a reinforcement learning module, and a selection module. The \nreinforcement learning module and the selection module select an action and a \nmodule dependent on their respective Q-values. This learning system is similar to \nthe modular reinforcement learning system proposed by Tham [6] which was based \non hierarchical mixtures of the experts (HME) [7]. In the hybrid learning system, \nthe selection module is trained by Q-Iearning. \n\nTo combine the reinforcement learning with the linear controller described in (1), \nthe ~u!put of the reinforcement learning module is set to k in the adaptable equation \nfor (, ( = -kiJ + 6. The angle and the angular velocity of the supporting leg at the \nleg exchange period ('T], iJJi) are used as inputs. The k values are kept constant until \nthe next leg exchange. The reinforcement learning module is trained by \"Q-sarsa\" \nlearning [8]. Q values are calculated by CMAC neural networks [9], [10]. \nThe Q values for action k (Q c (x, k)) and those for module s selection (Q s (x, s)) are \n\n\f1074 \n\ncalculated as follows: \n\ns. Yamada, A Watanabe and M. Nakashima \n\nL we(k, m, i, t)y(m, i, t) \nQ .. (x, s) = L w .. (s , m, i, t)y(m, i, t), \n\nTn ,i \n\nm,i \n\n(2) \n\nwhere we{k,m,i,t) and w .. {s,m,i,t) denote synaptic strengths and y{m,i,t) repre(cid:173)\nsents neurons' outputs in CMAC networks at time t. \nModules were selected and actions performed according to the \u00a3-greedy policy [8] \nwith \u00a3 = O. \nThe temporal difference (TD) error for the reinforcement learning module (fe(t)) is \ncalculated by \n\n10 \n\nfe(t) = \n\nr(t) + Qe(x(t + l),per(t + 1)) - Qe{x{t),per{t)) \nr{t) + Q .. {x(t + 1), sel(t + 1)) - Qe(x(t),per{t)), \n\nsel(t) = lin \nsel{t) = rein \nsel(t + 1) = rein \nsel{t) = rein \nsel{t + 1) = lin \n\n(3) \nwhere r{t), per{t), sel(t), lin and rein denote reinforcement signals (r{t) = -1 if \nthe robot falls down, 0 otherwise), performed actions, selected modules, the linear \ncontrol module and the reinforcement learning module, respectively. \nTD error (f t (t)) calculated by Q .. (x, s) is considered to be a sum of TD error caused \nby the reinforcement learning module and that by the selection module. TD error \n(f .. {t)) used in the selection-module's learning is calculated as follows: \n\nf .. {t) = ft(t) - fe(t) \n\n= r{t) + ,Q .. {x(t + 1), sel(t + 1)) - Q .. (x{t), sel{t)) - fe(t), \n\n(4) \n\nwhere, denotes a discount factor. \nThe reinforcement learning module used replacing eligibility traces {e c (k, m, i, t)) \n[11]. Synaptic strengths are updated as follows: \n\nwe(k, m, i, t + 1) \nw .. (s,m,i ,t + 1) \n\nee(k, m, i, t) = \n\nwc{k, m, i, t) + Qefc{t)ee{k, m, i, t)/nt \n{ w .. (s , m, i, t) + Q .. f .. (t)y(m , i , t)/nt \nw .. (s,m,i,t) \n1 \no \n>.ec(k, m, i, t - 1) otherwise \n\n{\n\ns = sel(t) \notherwise \n\nk=per(t),y(m,i,t)=l \nk ::f: per(t), y(m, i, t) = 1 \n\n(5) \n\nwhere Qe, Q .. , >. and nt are a learning constant for the reinforcement learning module, \nthat for the selection module, decay rates and the number of tHings, respectively. \n\nIn this study, the CMAC used 10 tHings. Each of the three dimensions was di(cid:173)\nvided into 12 intervals. The reinforcement learning module had 5 actions (k = \n0, A/2, A, 3A/2, 2A). The parameter values were Q .. = 0.2, Q e = 0.4, >. = 0.3, \n, = 0.9 and 6 = 0.05. Each run consisted of a sequence of trials, where each \ntrial began with robot state of position=O, _5\u00b0 < () < -2.5\u00b0,1 .5\u00b0 < \"I < 3\u00b0, cp = \n()+~, 'I/J = cp+~, ( = \"1+ 2\u00b0,9 = cp = \"j; = iJ = ( = 0, and ended with a failure signals \nindicating robot's falling down. Runs were terminated if the number of walking \nsteps of three consecutive trials exceeded 100. All results reported are an average \nof 50 runs. \n\n\fHybrid Reinforcement Learning for Biped Robot Control \n\n1075 \n\n'\" ft' 80 \n'\" bll \n~ 60 \nC; \n~ 40 \n'-o \no \nZ 20 \n\n.... -... ----................................... . \n\n. \n. \n. \n. \n\n. \n. \n. \n. \n. \n\n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\no~~~~~~~~~~ \n\no \n\njO \n\n100 \n\nTrials \n\nIjO \n\n200 \n\nFigure 4: Learning profiles for control of walking on the sloped floor. (0) hybrid \nreinforcement learning, (0) 2-step hybrid reinforcement learning, (\\7) reinforcement \nlearning and (6) HME-type modular reinforcement learning \n\n4 Results \n\nWalking control on the sloped floor (Figure 2-a) was first trained by the usual re(cid:173)\ninforcement learning. The usual reinforcement learning system needed many trials \nfor successful termination (about 800, see Figure 4(\\7)). Because the usual rein(cid:173)\nforcement learning system must learn the control for each input, it requires many \ntrials. \nFigure 4(0) also shows the learning curve for the hybrid reinforcement learning. \nThe hybrid system learned the control more quickly than the usual reinforcement \nlearning (about 190 trials). Because it has a higher probability of succeeding on the \nflat floor, it learned the control quickly. On the other hand, HME-type modular \nreinforcement learning [6] required many trials to learn the control (Figure 4(6)). \n\n45\u00b0's ,--- - - - - - - - - - -- -------, \n\n~A A A A A AA A I h AA Ab A AA A AA A A A A A A A A A A ~ \nVYVV vvrW{VYvvvVYVVlJm rv Y VvVVl \n\n() -45\u00b0,s ' - - - - - - - - - - - - - - - - - - - ' -\n\nAngular \nVel~ity \n\no \n\nTime(s) \n\nFree Leg's Tip ~ \n\n20 \n. \"hfrl}f~iM\u00a3f11tfWfNWV\\{\\/YVVVy1 \n\nHeight 0: Oem \\ \nRObotP~i::[ ~ 1\u00b0 \n\n-2cm \n\n2 \n\n-1m \n\no \n\nTime(s) \n\n20 \n\nFigure 5: Biped robot motion controlled by the network trained by the 2-step hybrid \nreinforcement learning. \n\n\f1076 \n\nS. Yamada, A. Watanabe and M Nakashima \n\nIn order to improve the learning rate, a 2-step learning was examined. The \n2-step learning is proposed to separate the selection-module learning from the \nreinforcement-learning-module learning. In the 2-step hybrid reinforcement learn(cid:173)\ning, the selection module was first trained by a special training procedure in which \nthe robot was controlled only by the linear control module. And then the network \nwas trained by the hybrid reinforcement learning. The 2-step hybrid reinforcement \nlearning learned the control more quickly than the I-step hybrid reinforcement \nlearning (Figure 4(0)). The average number of trials were about 50. The hybrid \nlearning system may be applicable to the real biped robot. \n\nFigure 5 shows the biped robot motion controlled by the trained network. On the \nslope, the free leg's lifting was magnified irregularly (see changes in the height of \nthe free leg's tip of Figure 5) in order to prevent the reduction of an amplitude of \nwalking rhythm. On the upper flat floor, the robot was again controlled stably by \nthe linear control module. \n\na) \n\n004 \n\n~ \n\n\"\"\" 00] \nbI) \nt: \n.~ 002 \nj \n\n001 \n\nb) \n\n.!! \n~ \nEO.\" \n\"2 \n15 \n8 0.4 \nt; ... \n. 5 \n:: 02 \no \n.2 \neOc........ ................. ~ ..................... ~...L.....~ \n\n-002 \n\n\u00b7003 \n002 003 \ninitial synaptic strength values \n\n-0.01 \n\n001 \n\n0 \n\no \n-0.03 \ninitial synaptic strength values \n\n-002 \n\n-001 \n\n0 .02 \n\n003 \n\n0 \n\n0.01 \n\nFigure 6: Dependence of (a) the learning rate and (b) the selection ratio of the \nlinear control module on the initial synaptic strength values (wa(rein, m, i, 0)). (a) \nlearning rate of (0) the hybrid reinforcement learning, and (0) the 2-step hybrid \nreinforcement learning. The learning rate is defined as the inverse of the number of \ntrials where the average walking steps exceed 70. (b) the ratio of the linear-control(cid:173)\nmodule selection. Circles represent the selection ratio of the linear control module \nwhen controlled by the network trained by the hybrid reinforcement learning, rect(cid:173)\nangles represent that by the 2-step hybrid reinforcement learning. Open symbols \nrepresent the selection ratio on the flat floor, closed symbols represent that on the \nslope. \n\nThe dependence of learning characteristics on initial synaptic strengths for \n(W3 (rein, m, i, 0)) was considered \nthe reinforcement-learning-module selection \nIf initial values of ws(rein, m, i, t) \n(other initial synaptic strengths were 0). \n(ws(rein, m, i, 0)) are negative, the Q-values for the reinforcement-learning-module \nselection (Q8(x,rein)) are smaller than Q8(x,lin) and then the linear control mod(cid:173)\nule is selected for all states at the beginning of the learning. In the case of the \n2-step learning, if Ws (rein, m, i, 0) are given appropriate negative values, the rein(cid:173)\nforcement learning module is selected only around failure states, where Qa(x, lin) is \ntrained in the first learning step, and the linear control module is selected otherwise \nat the beginning of the second learning step. Because the reinforcement learning \nmodule only requires training around failure states in the above condition, the 2-\n\n\fHybrid Reinforcement Learning fo r Biped Robot Control \n\n1077 \n\nstep hybrid system is expected to learn the control quickly. Figure 6-a shows the \ndependence of the learning rate on the initial synaptic strength values. The 2-step \nhybrid reinforcement learning had a higher learning rate when Ws (rein, m, i, 0) were \nappropriate negative values (-0.01 '\" -0.005). The trained system selected the linear \ncontrol module on the flat floor (more than 80%), and selected both modules on \nthe slope (see Figure 6-b), when ws(rein, m, i, 0) were negative. \nThree trials were required in the first learning step of the 2-step hybrid reinforcement \nlearning. In order to learn the Q-value function around failure states, the learning \nsystem requires 3 trials. \n\n5 Conclusion \n\nWe proposed the hybrid reinforcement learning which learned the biped robot con(cid:173)\ntrol quickly. The number of trials for successful termination in the 2-step hybrid \nreinforcement learning was so small that the hybrid system is applicable to the real \nbiped robot. Although the control of real biped robot was not learned in this study, \nit is expected to be learned quickly by the 2-step hybrid reinforcement learning. \nThe learning system for real robot control will be easily constructed and should be \ntrained quickly by the hybrid reinforcement learning system. \n\nReferences \n[1] Barto, A. G., Sutton, R. S. and Anderson, C. W.: Neuron like adaptive ele(cid:173)\n\nments that can solve difficult learning control problems, IEEE Trans. Sys. Man \nCybern., Vol. SMC-13, pp. 834-846 (1983). \n\n[2] Tesauro, G.: TD-gammon, a self-teaching backgammon program, achieves \n\nmaster-level play, Neural Computation, Vol. 6, pp. 215-219 (1994). \n\n[3] Gullapalli, V., Franklin, J. A. and Benbrahim, H.: Acquiring robot skills via \nreinforcement learning, IEEE Control System, Vol. 14, No.1, pp. 13-24 (1994). \n[4] Miller, W. T.: Real-time neural network control of a biped walking robot, IEEE \n\nControl Systems, Vol. 14, pp. 41-48 (1994). \n\n[5] Watanabe, A., Inoue, M. and Yamada, S.: Development of a stilts type biped \nrobot stabilized by inertial sensors (in Japanese), in Proceedings of 14th Annual \nConference of RSJ, pp. 195-196 (1996). \n\n[6] Tham, C. K.: Reinforcement learning of multiple tasks using a hierarchical \nCMAC architecture, Robotics and Autonomous Systems, Vol. 15, pp. 247-274 \n(1995). \n\n[7] Jordan, M. I. and Jacobs, R. A.: Hierarchical mixtures of experts and the EM \n\nalgorithm, Neural Computation, Vol. 6, pp. 181-214 (1994). \n\n[8] Sutton, R. S.: Generalization in reinforcement learning: successful examples \n\nusing sparse coarse coding, Advances in NIPS, Vol. 8, pp. 1038-1044 (1996). \n\n[9] Albus, J. S.: A new approach to manipulator control: The cerebellar model \narticulation controller (CMAC), Transaction on ASME J. Dynamical Systems, \nMeasurement, and Controls, pp. 220-227 (1975). \n\n[10] Albus, J. S.: Data storage in the cerebellar articulation controller (CMAC), \nTransaction on ASME J. Dynamical Systems, Measurement, and Controls, pp. \n228-233 (1975). \n\n[11] Singh, S. P. and Sutton, R. S.: Reinforcement learning with replacing eligibility \n\ntraces, Machine Learning, Vol. 22, pp. 123-158 (1996). \n\n\f", "award": [], "sourceid": 1434, "authors": [{"given_name": "Satoshi", "family_name": "Yamada", "institution": null}, {"given_name": "Akira", "family_name": "Watanabe", "institution": null}, {"given_name": "Michio", "family_name": "Nakashima", "institution": null}]}