Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
Bruno Scherrer, Boris Lesner
We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error ϵ at each iteration, it is well-known that one can compute stationary policies that are $\frac{2\gamma{(1-\gamma)^2}\epsilon−optimal.Afterarguingthatthisguaranteeistight,wedevelopvariationsofValueandPolicyIterationforcomputingnon−stationarypoliciesthatcanbeupto\frac{2\gamma}{1-\gamma}\epsilon−optimal,whichconstitutesasignificantimprovementintheusualsituationwhen\gammaiscloseto1$. Surprisingly, this shows that the problem of computing near-optimal non-stationary policies'' is much simpler than that of computing near-optimal stationary policies''.