Jun Morimoto, Kenji Doya
This paper proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as mod(cid:173) eling errors. The use of environmental models in RL is quite pop(cid:173) ular for both off-line learning by simulations and for on-line ac(cid:173) tion planning. However, the difference between the model and the real environment can lead to unpredictable, often unwanted results. Based on the theory of H oocontrol, we consider a differential game in which a 'disturbing' agent (disturber) tries to make the worst possible disturbance while a 'control' agent (actor) tries to make the best control input. The problem is formulated as finding a min(cid:173) max solution of a value function that takes into account the norm of the output deviation and the norm of the disturbance. We derive on-line learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in refer(cid:173) ence to the value function. We tested the paradigm, which we call "Robust Reinforcement Learning (RRL)," in the task of inverted pendulum. In the linear domain, the policy and the value func(cid:173) tion learned by the on-line algorithms coincided with those derived analytically by the linear H ootheory. For a fully nonlinear swing(cid:173) up task, the control by RRL achieved robust performance against changes in the pendulum weight and friction while a standard RL control could not deal with such environmental changes.