Reviews: Modeling Expectation Violation in Intuitive Physics with Coarse Probabilistic Object Representations

Many studies have looked at the ideas of physics simulation as a cognitive model. In such works, physics engines are usually employed as a model of human cognition of physical tasks, with the perception part of the task is often abstracted away. In parallel, data driven model have been frequently used to learn to parse raw visual inputs to detect or locate objects, frequently without using any explicit model of the physical world. This paper tries to bridge these two fields to build a complete model of how humans perceive certain physical scenarios, from raw pixels to expectations over objects. Whereas all of the parts employed in the proposed "pipeline" are based on previous works, their arrangement into this contiguous framework is new, as is the human and modeled results on the new dataset the authors also present. The paper is clear and well written in general. However, I do think it would benefit by making its goal clearer. Is the purpose of the proposed framework to be a model of human cognition (for example, the results with ellipsoids might suggest that humans might be performing some sort of simple shape processing at early stages of visual processing)? Or is it to propose a possible new route for improvements in computer vision? (And if so, what is the path the authors see for this type of architecture to contribute to better visual models? Or is surprise detection intrinsically valuable?) Some further clarifications that would be useful: - How fast does the whole pipeline run? How does it compare to the baselines used? - Are the results in Figure 5 different if you calculate accuracy per-person and then average the accuracies? (Instead of averaging scores across people and then calculating accuracies.) - Are there qualitative visualizations (as in Figure 4) for the predictions of the baseline models? Could be interesting in the supplement, if the results are not completely trivial - The GAN model uses a surprise score based on the discriminator. Are the results similar if a L2 loss like the one for the LSTM is used? - Isn't it a bit ad hoc that the values of the surprise for the "rare" events are hardcoded at values seemingly unrelated to their already hand-picked probabilities?(Supplement line 113, r values.) Any specific reason for these values? In sum, this paper presents an interesting combination of object detection with data-driven models, and probabilistic dynamics modeling to construct a complete pipeline for implausibility detection. Though the type of tasks employed are somewhat specific, and modeling surprise directly does not seem to have immediate practical applications, the framework works points in an interesting direction of more integrated modeling and presents interesting comparison to human results. The paper would probably benefit from more explicit discussion of its modeling results in comparison to the human data, and the possible inferences about human cognition.

Reviewer 2

To summarize, the authors present a model to discriminate physically plausible from implausible scenes in occlusion based intuitive physics setups. Images form the input to a visual derenderer the decomposes the scene into objects with physical and deliberately coarse visual attributes to facilitate generalization. The inferred physical object states are fed into a non-learned physics engine (Bullet) which predicts future physical object states. Multiple future beliefs are generated by repeatedly perturbing the input and rerunning the physics engine. A particle filter is then used to combine, track and update beliefs about future object states. Finally the beliefs are compared against actual observations to generate a measure of surprise of the system. Experiments on expectation violation setups based on classical developmental psychology experiments show that the model is reliably able to distinguish physically plausible from implausible scenes and that its predictions align with human predictions about physical plausibility. Originality: Visual object-centric derenderers, particle filters and expectation violation datasets have been proposed before. Those pieces have been put together in a nice way though. Quality: Very well written, clearly organized paper with nicely executed experiments and convincing results. Clarity: Overall the paper is clear and well written. However, all limitations and failure cases have been put into the supplement which is deceiving to the reader and those should be moved to the main paper. Significance: The results are significant to the cognitive science community, but the significance of this methods in a more general sense for other tasks is questionable as several architectural pieces have been optimized to only work on the presented setup. In its current state, this paper is a borderline weak reject for me, but I would like to accept this paper if my concerns are addressed (see improvements section). Detailed comment: (116-117) How important is this distinction and why do occluders need to be modeled separately from other objects? What happens if you don’t do this? EDIT - After reading the author's response, most of my concerns have been addressed and I am changing my rating from weak reject (score of 5) to a weak accept (score of 6).

Reviewer 3

I think this is a high-quality paper. The model proposed is fairly straightforward, and it is unclear whether any decision made represents a particular engineering contribution in which the model was refined by tests on these datasets, but it is cognitively-motivated at many steps. I really appreciate the train-test split design meant to better mimic human subject experiments -- pushing past standards of how we validate in machine learning is a very useful thing to do. The baselines are useful comparisons and I have no sense that these are weak. I do wish that I had a better sense of whether the surprise metrics on the baselines are reasonable -- looking at the supplementary, it was less than clear for me whether there might be some fairer surprise comparison for any of these. The human subject comparison is careful and a very welcome contribution. From an originality standpoint, the model builds on a variety of object-centric forward models, but it is clearly different than these in a way that lends it to the sorts of expectation violation experiments that it tests. I should emphasize that I consider this train-test split design to be quite novel as well -- some recent works have moved towards a more dev psych=inspired train/test split, but it is far from the norm and I think a very welcome addition. The paper is very clear, with very good structure that allows readers to delve into details at different levels. I quickly knew where everything was and could refer back quickly when I needed to. Put together, I think this is a contribution of some significance. Object-centric representations have matured over the past several years, but this is the first example to my knowledge that actually follows through with an expectation-violation comparison test (with a great train-test split!) like those that have inspired these sorts of models. It contributes to this virtuous cycle between AI and dev psych in a clear way.

Paper ID:	4818
Title:	Modeling Expectation Violation in Intuitive Physics with Coarse Probabilistic Object Representations

Reviewer 1

Reviewer 2

Reviewer 3