Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

Part of Advances in Neural Information Processing Systems 33 pre-proceedings (NeurIPS 2020)

Bibtex »Paper »Supplemental »

Bibtek download is not availble in the pre-proceeding


Authors

Tiancheng Jin, Haipeng Luo

Abstract

<p>This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback. We develop the first algorithm with a ``best-of-both-worlds'' guarantee: it achieves O(log T) regret when the losses are stochastic, and simultaneously enjoys worst-case robustness with \tilde{O}(\sqrt{T}) regret even when the losses are adversarial, where T is the number of episodes. More generally, it achieves \tilde{O}(\sqrt{C}) regret in an intermediate setting where the losses are corrupted by a total amount of C. Our algorithm is based on the Follow-the-Regularized-Leader method from Zimin and Neu (2013), with a novel hybrid regularizer inspired by recent works of Zimmert et al. (2019a, 2019b) for the special case of multi-armed bandits. Crucially, our regularizer admits a non-diagonal Hessian with a highly complicated inverse. Analyzing such a regularizer and deriving a particular self-bounding regret guarantee is our key technical contribution and might be of independent interest.</p>