Reviews: PHYRE: A New Benchmark for Physical Reasoning

I generally like this paper. The task is compelling and the benchmark is well thought out. Experiments are well done and well reported. I think it should make a good contribution and help drive more good work, much as CLEVR has. At the same time, I have a few concerns (see Improvements below). These could be addressed in a revision and I would be interested to hear from the authors in rebuttal on these points.

Reviewer 2

I like this paper, as it presents a carefully designed benchmark for an emerging area. The initial evaluation on the benchmark also suggests future research directions. Some issues remain. First, the dataset has minimal visual complexity. It'd be great to include objects of various shapes, textures, and scenes of different lighting conditions. Note that without realistic rendering and texture, which indicate physical object properties, humans (and therefore models) may not be able to solves these tasks as well. Evaluations can be improved in two ways. - For all baselines, especially those don't work well with raw visual input (contextual bandits and policy learners), it'd make sense to use states (object and obstacle position, velocity) as input. This additional set of experiments will help to decouple visual perception from physical reasoning and highlight the merits of the dataset. - As mentioned in discussion, it's important to include some model-based RL/planning baselines. There are many papers on differentiable physics engines (e.g., interaction networks, neural physics engines). I wonder how they perform on these tasks. I failed to understand Fig 4. I'm wondering if the authors can explain what it's about and why we care about it. The authors indicated in the reproducibility checklist that links to source code and the dataset have been included, but I cannot find them.

Reviewer 3

Here are some concerns I have for this paper: - The dataset is designed to focus on "physical reasoning", which is a very broad concept. It would be better to know what aspects of physical reasoning are tested in this dataset ([40] is a good example), how are the experiment templates designed. - Following the previous point, since there is no description of the physics tested for different templates, it is difficult to draw conclusions from the results of cross-template experiments. Cross-template experiments are meaningful if they are testing similar physical reasoning processes (e.g., there could be different templates testing gravity understanding). - The selection of evaluated methods should be justified. I understand that the authors' intention is to evaluate methods that are not hand-coded with physics rules, but it seems that more appropriate baseline methods should be able to do some predictions. For example, a baseline could be a simple CNN future state prediction module combined with a success prediction module. - More like a suggestion than a criticism: it might be interesting to see human performance on this benchmark. =========================================== I have read the authors' feedback. Although I still feel that the tasks could be more clearly defined, I am fine if the paper is accepted.

Paper ID:	2793
Title:	PHYRE: A New Benchmark for Physical Reasoning

Reviewer 1

Reviewer 2

Reviewer 3