The authors present a model-based methods (MOPO) that uses an ensemble of methods to provide an uncertainty estimate over the next state distribution. The model can be used to avoid uncertain areas in state space, thus allowing a lower bound on the return in the true MDP. The reviewers are unanimously positive about the paper. The reviewers mention the theoretical soundness of the motivation and the novelty of the analysis of the penalised policy, as well as the thoroughness of the empirical evaluation. The reviewers do mention certain limitation, like needing an analytic representation of the next state distribution and the ability to quantify disagreement. The looseness of the bound was also mentioned. But most importantly is the offset between the theoretical results and the practical implementation. Nevertheless, the empirical strength as well as the novelty of the theoretical results were deemed by the reviewers to outweigh the possible limitations, and I’m happy to go along with the reviewers and recommend the paper for acceptance. Several of the reviewer’s questions were answered in the rebuttal. Please update the final version of the paper to include the additional results and clarifications.