Reinforcement Learning for Continuous Stochastic Control Problems

Part of Advances in Neural Information Processing Systems 10 (NIPS 1997)

Bibtex Metadata Paper

Authors

Rémi Munos, Paul Bourgine

Abstract

This paper is concerned with the problem of Reinforcement Learn(cid:173) ing (RL) for continuous state space and time stocha.stic control problems. We state the Harnilton-Jacobi-Bellman equation satis(cid:173) fied by the value function and use a Finite-Difference method for designing a convergent approximation scheme. Then we propose a RL algorithm based on this scheme and prove its convergence to the optimal solution.

1

Introduction to RL in the continuous, stochastic case

The objective of RL is to find -thanks to a reinforcement signal- an optimal strategy for solving a dynamical control problem. Here we sudy the continuous time, con(cid:173) tinuous state-space stochastic case, which covers a wide variety of control problems including target, viability, optimization problems (see [FS93], [KP95])}or which a formalism is the following. The evolution of the current state x(t) E 0 (the state(cid:173) space, with 0 open subset of IRd ), depends on the control u(t) E U (compact subset) by a stochastic differential equation, called the state dynamics:

dx = f(x(t), u(t))dt + a(x(t), u(t))dw

(1) where f is the local drift and a .dw (with w a brownian motion of dimension rand (j a d x r-matrix) the stochastic part (which appears for several reasons such as lake of precision, noisy influence, random fluctuations) of the diffusion process. For initial state x and control u(t), (1) leads to an infinity of possible traj~tories x(t). For some trajectory x(t) (see figure I)., let T be its exit time from 0 (with the convention that if x(t) always stays in 0, then T = 00). Then, we define the functional J of initial state x and control u(.) as the expectation for all trajectories of the discounted cumulative reinforcement :

J(x; u(.)) = Ex,u( .) {loT '/r(x(t), u(t))dt +,,{ R(X(T))}