Part of Advances in Neural Information Processing Systems 32 (NeurIPS 2019)
Andrew Bennett, Nathan Kallus
Evaluating novel contextual bandit policies using logged data is crucial in applications where exploration is costly, such as medicine. But it usually relies on the assumption of no unobserved confounders, which is bound to fail in practice. We study the question of policy evaluation when we instead have proxies for the latent confounders and develop an importance weighting method that avoids fitting a latent outcome regression model. Surprisingly, we show that there exist no single set of weights that give unbiased evaluation regardless of outcome model, unlike the case with no unobserved confounders where density ratios are sufficient. Instead, we propose an adversarial objective and weights that minimize it, ensuring sufficient balance in the latent confounders regardless of outcome model. We develop theory characterizing the consistency of our method and tractable algorithms for it. Empirical results validate the power of our method when confounders are latent.