Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper studies the important problem of off-policy policy evaluation in long-horizon MDPs. The setting focuses on small-state, large-action problems. A novel estimator is proposed, whose finite-sample statistical properties are studied. Empirical results show the method is useful, especially in partially observable problems. Reviewers feel the experiment section can be strengthened (e.g., using more domains). Furthermore, the assumption that the state space is small limits the significant and applicability of this work. On the other hand, they all agree the approach is novel and useful in some cases.