Part of Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Main Conference Track
Yujia Jin, Ishani Karmarkar, Aaron Sidford, Jiayi Wang
We provide faster randomized algorithms for computing an ϵ-optimal policy in a discounted Markov decision process with Atot-state-action pairs, bounded rewards, and discount factor γ. We provide an ˜O(Atot[(1−γ)−3ϵ−2+(1−γ)−2])-time algorithm in the sampling setting, where the probability transition matrix is unknown but accessible through a generative model which can be queried in ˜O(1)-time, and an ˜O(s+(1−γ)−2)-time algorithm in the offline setting where the probability transition matrix is known and s-sparse. These results improve upon the prior state-of-the-art which either ran in ˜O(Atot[(1−γ)−3ϵ−2+(1−γ)−3]) time [Sidford, Wang, Wu, Ye 2018] in the sampling setting, ˜O(s+Atot(1−γ)−3) time [Sidford, Wang, Wu, Yang, Ye 2018] in the offline setting, or time at least quadratic in the number of states using interior point methods for linear programming. We achieve our results by building upon prior stochastic variance-reduced value iteration methods [Sidford, Wang, Wu, Yang, Ye 2018]. We provide a variant that carefully truncates the progress of its iterates to improve the variance of new variance-reduced sampling procedures that we introduce to implement the steps. Our method is essentially model-free and can be implemented in ˜O(Atot)-space when given generative model access. Consequently, our results take a step in closing the sample-complexity gap between model-free and model-based methods.