All reviewers agree that this submission is above the acceptance threshold and they are all agree that the idea of decoupling text generation from policy learning during RL is a compelling idea and interesting idea. I would also like to recommend acceptance with two notes: 1) the reviewers raised a number of questions which were addressed in the author response, most of which are already contained in the Supplementary material, so I would advice the authors to incorporate these points in the main manuscript 2) I see your method as a way to also deal with language drift more generally. There are a couple of recent papers looking into dealing with language drift. For example, Lee et al (2019) deal with language drift through image grounding while Lazaridou et al (2020) and Lu et al. (2020) also decouple generation and policy learning, the former through reranking of language modelling samples using the RL reward and the latter through distillation such that the RL signal is never disrupting the core language knowledge. Are any of these methods superior over the others? We don't know but it is a perhaps an interesting question for this paper to put out there. Lee, Jason, Kyunghyun Cho, and Douwe Kiela. "Countering language drift via visual grounding." arXiv preprint arXiv:1909.04499 (2019). Lazaridou, Angeliki, Anna Potapenko, and Olivier Tieleman. "Multi-agent Communication meets Natural Language: Synergies between Functional and Structural Language Learning." arXiv preprint arXiv:2005.07064 (2020). Lu, Yuchen, et al. "Countering language drift with seeded iterated learning." arXiv preprint arXiv:2003.12694 (2020).