Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) Main Conference Track
Jaekyeom Kim, Seohong Park, Gunhee Kim
For zero-shot transfer in reinforcement learning where the reward function varies between different tasks, the successor features framework has been one of the popular approaches. However, in this framework, the transfer to new target tasks with generalized policy improvement (GPI) relies on only the source successor features  or additional successor features obtained from the function approximators’ generalization to novel inputs . The goal of this work is to improve the transfer by more tightly bounding the value approximation errors of successor features on the new target tasks. Given a set of source tasks with their successor features, we present lower and upper bounds on the optimal values for novel task vectors that are expressible as linear combinations of source task vectors. Based on the bounds, we propose constrained GPI as a simple test-time approach that can improve transfer by constraining action-value approximation errors on new target tasks. Through experiments in the Scavenger and Reacher environment with state observations as well as the DeepMind Lab environment with visual observations, we show that the proposed constrained GPI significantly outperforms the prior GPI’s transfer performance. Our code and additional information are available at https://jaekyeom.github.io/projects/cgpi/.