Improving Policies without Measuring Merits

Part of Advances in Neural Information Processing Systems 8 (NIPS 1995)

Bibtex Metadata Paper

Authors

Peter Dayan, Satinder Singh

Abstract

Performing policy iteration in dynamic programming should only require knowledge of relative rather than absolute measures of the utility of actions (Werbos, 1991) - what Baird (1993) calls the ad(cid:173) vantages of actions at states. Nevertheless, most existing methods in dynamic programming (including Baird's) compute some form of absolute utility function . For smooth problems, advantages satisfy two differential consistency conditions (including the requirement that they be free of curl), and we show that enforcing these can lead to appropriate policy improvement solely in terms of advantages.

1