Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track
Evelyn Xiao-Yue Gong, Mark Sellke
We study pure exploration with infinitely many bandit arms generated \iid from an unknown distribution. Our goal is to efficiently select a single high quality arm whose average reward is, with probability 1−δ, within ε of being with the top η-fraction of arms; this is a natural adaptation of the classical PAC guarantee for infinite action sets. We consider both the fixed confidence and fixed budget settings, aiming respectively for optimal \emph{expected} and \emph{fixed} sample complexity.For fixed confidence, we give an algorithm with expected sample complexity O(log(1/η)log(1/δ)ηε2). This is optimal except for the log(1/η) factor, and the δ-dependence closes a quadratic gap in the literature. For fixed budget, we show the asymptotically optimal sample complexity as δ→0 is c−1log(1/δ)(loglog(1/δ))2 to leading order; equivalently, the optimal failure probability with exactly N samples decays as exp(−(1±o(1))cNlog2N).The value of c depends explicitly on the problem parameters (including the unknown arm distribution) through a certain Fisher information distance. Even the strictly super-linear dependence on log(1/δ) was not known and resolves a question of Grossman-Moshkovitz (FOCS 2015).