Improving Policies without Measuring Merits

Dayan, Peter; Singh, Satinder

Improving Policies without Measuring Merits

Peter Dayan, Satinder P. Singh

Advances in Neural Information Processing Systems 8 (NIPS 1995)

Abstract

Performing policy iteration in dynamic programming should only require knowledge of relative rather than absolute measures of the utility of actions (Werbos, 1991) - what Baird (1993) calls the ad(cid:173) vantages of actions at states. Nevertheless, most existing methods in dynamic programming (including Baird's) compute some form of absolute utility function . For smooth problems, advantages satisfy two differential consistency conditions (including the requirement that they be free of curl), and we show that enforcing these can lead to appropriate policy improvement solely in terms of advantages.

Abstract

Name Change Policy