The method targets a model-based approach to solve POMDPs with high-dimensional observation spaces. This problem is tackle by learning jointly about the dynamics of the POMDP and the optimal policy by maximum likelihood using an “RL as inference” type objective. In more detail, the latent space transitions are predicted by an inference model that is trained to maximise an evidence lower bound. The reviewers are mostly positive about the paper. They mention the theoretical soundness of the approach and the quality of writing as well as the empirical set-up and usefulness of the ablations. Some criticisms were also mentioned. For example, is the method applicable to ‘strongly partial observable’ environments? One reviewer suggested the justification of the approach could be made more clear in the paper, and a ablation could be added (about the necessity of AISOC / MaxEnt). Another reviewer questioned the novelty of the training. Finally, R1 strongly felt that the SOTA claim was not (no longer) substantiated. The authors rebuttal addressed many of the points of criticism. They agree that the method is not designed to solve strongly partial observable problem and agree to remove the SOTA claim. Most of the reviewers found that the strengths of the paper outweighed its limitations. The author’s rebuttal has furthermore addressed the points by the more critical reviewer in more detailed. In personal communication, this reviewer has confirmed they are happy with accepting the paper now, but they’ve been unable to change the review in CMT. Thus, I’m happy to recommend the paper for acceptance. I’d ask the authors to update the final version of the paper in the light of the reviews. In particular, I’d like the authors to include the points on the “POMDP claims” and “SOTA claims” discussed in the author response, as well as clarify the “Overly optimistic / risk seeking” aspect. Note: it’s not necessary for the authors to contact the AC with inquiries that summarise the reviews so far.