Sham M. Kakade
We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the param(cid:173) eter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradi(cid:173) ent is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sut(cid:173) ton et al. . We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.