Part of Advances in Neural Information Processing Systems 8 (NIPS 1995)
A continuous-time, continuous-state version of the temporal differ(cid:173) ence (TD) algorithm is derived in order to facilitate the application of reinforcement learning to real-world control tasks and neurobi(cid:173) ological modeling. An optimal nonlinear feedback control law was also derived using the derivatives of the value function. The per(cid:173) formance of the algorithms was tested in a task of swinging up a pendulum with limited torque. Both the "critic" that specifies the paths to the upright position and the "actor" that works as a non(cid:173) linear feedback controller were successfully implemented by radial basis function (RBF) networks.