Creating models that can perceive a rich 3D world from 2D inputs and then imagine new 3D words is both extremely difficult and important. Reviewers agreed that the model is novel, yet has many limitations. Visual fidelity is low as is resolution. The number of objects that can be rendered is small. At the same time, reviewers agreed that it outperforms prior models. Reviewers encourage the authors to move comparison with state of the art models to the main paper. The primary concern of reviewers was centered around the limitations of this sub-field as a whole, that one cannot learn from real images and generate realistic-looking images. It will take time to build models of sufficient refinement to do this, but as the reviewers point out, the work presented here provides both a conceptual advance (having a latent 3D representation) and a practical advance in generating more complex videos.