Part of Advances in Neural Information Processing Systems 15 (NIPS 2002)
Theodore Perkins, Doina Precup
We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solu- tion from any initial policy. To our knowledge, this is the first conver- gence result for any form of approximate policy iteration under similar computational-resource assumptions.