Part of Advances in Neural Information Processing Systems 10 (NIPS 1997)
A new policy iteration algorithm for partially observable Markov decision processes is presented that is simpler and more efficient than an earlier policy iteration algorithm of Sondik (1971,1978). The key simplification is representation of a policy as a finite-state controller. This representation makes policy evaluation straightforward. The pa(cid:173) per's contribution is to show that the dynamic-programming update used in the policy improvement step can be interpreted as the trans(cid:173) formation of a finite-state controller into an improved finite-state con(cid:173) troller. The new algorithm consistently outperforms value iteration as an approach to solving infinite-horizon problems.