Loading [MathJax]/jax/output/CommonHTML/jax.js

Dynamic Regret of Adversarial Linear Mixture MDPs

Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track

Bibtex Paper

Authors

Long-Fei Li, Peng Zhao, Zhi-Hua Zhou

Abstract

We study reinforcement learning in episodic inhomogeneous MDPs with adversarial full-information rewards and the unknown transition kernel. We consider the linear mixture MDPs whose transition kernel is a linear mixture model and choose the \emph{dynamic regret} as the performance measure. Denote by d the dimension of the feature mapping, H the horizon, K the number of episodes, PT the non-stationary measure, we propose a novel algorithm that enjoys an ˜O(d2H3K+H4(K+PT)(1+PT)) dynamic regret under the condition that PT is known, which improves previously best-known dynamic regret for adversarial linear mixture MDP and adversarial tabular MDPs. We also establish an Ω(d2H3K+HK(H+PT)) lower bound, indicating our algorithm is \emph{optimal} in K and PT. Furthermore, when the non-stationary measure PT is unknown, we design an online ensemble algorithm with a meta-base structure, which is proved to achieve an ˜O(d2H3K+H4(K+PT)(1+PT)+H2S2T) dynamic regret and here ST is the expected switching number of the best base-learner. The result can be optimal under certain regimes.