Summary and Contributions: The paper proposes a model for learning of scene representations capable of producing object-centric representations without direct supervision from objects or segmentation maps by using supervision from the reconstruction of the RGB, depth and normals. The model starts by extracting visual features with a ConvRNN in order to have long-range connections. A graph representation is constructed in a hierarchical manner starting by having each convolutional position as a node then pooling the graph based on the edge structure. Different affinity functions based on the node features are used to establish the edges by thresholding. The nodes at the current layer are clustered using a Label Propagation algorithm forming the nodes of the next layer. The features are pooled by taking the first two moments of multiple subsets of each node support and doing a graph convolution. Two decoders, without learnable parameters, are used to render the output, while the node attributes are having the same meaning as the output.
Strengths: An overall good method with sound design choices. Using graph representation in order to obtain object-centric and hierarchical representations is a good direction. The method achieves good results on the proposed datasets, surpassing powerful baselines. The efficiency of the model is a big plus. The model needs up to two orders of magnitude fewer training epochs. The design of the affinity functions in a principled manner is appreciated. The ablation study is appreciated.
Weaknesses: The main weak point of the paper is the presentation. It is very hard to follow, at the same time being very dense but missing key details in the main text and relying too much on the appendix. One possible reason could be over-formalisation. For example, maybe the paragraph 176-184 could be shortened. Details should be left to the supplemental, but the main text should have the main ideas presented in a more clear manner. There should be more details in some parts. More explanations regarding the training should be given. When are the affinity functions optimised as opposed to the main model learned by the QSR and QTR losses? Are they jointly trained? Does the gradient of the affinity function propagate back through the node features into the ConvRNN? It would have been useful to also experiment on datasets already used in the literature such as CLEVR6 or Multi-dSprites. Even though they could be easier, a discussion on the difficulty of solving them compared to the proposed datasets should be useful. ====================================================== Post Rebuttal: ====================================================== I thank the authors for their rebuttal. My main issue with the paper was the clarity of the presentation and the authors seem keen on improving this aspect. I also appreciate the experiments and analysis on the other datasets used in the literature, the zero-shot transfer being particularly interesting. The changes and additions from the rebuttal would improve the paper and thus I will increase my score.
Correctness: The claims of the model are sound and the validation is correct.
Clarity: The paper needs rewrites for clarity reasons. The current form of the paper is detrimental for presenting its contributions. It should explain key parts in a succinct and clear manner.
Relation to Prior Work: More discussion relating to current work in graph representations with GNNs should be added.
Additional Feedback: Because the node attributes are having the same meaning as the output, the model is more interpretable and it is easy to manipulate. A discussion could be made for the representation power of the output. How do they compare to the representations obtained when a decoder has learnable parameters? Could such representation be transferred or be used in a reasoning task? Does the model obtain a representation that can be transferred and could generalise to other tasks? Although easily fixable, there are multiple times where standard methods or concepts are not cited. For example VAE or graph convolution.
Summary and Contributions: The paper presents physical scene graphs, a self-supervised visual intuitive physics method that automatically reasons about the objects in a scene via incremental segmentation and using motion cues. Contributions include a learnable graph pooling operator, graph vectorization for summarization, graph rendering, an interesting way of object tracking (via tracking edges) and its use for self-supervision, proof of the need non-binary edge affinities and a camera projection loss.
Strengths: - A massive amount of work and contributions presented in minimal amount of space. Super-useful contributions that inspire future work in different visual domains involving GNNs. - Excellent evaluation rigor. - Tests on real data - Good analysis of errors and mistakes in output (L277-L279) - Good actual description of system in supplementary
Weaknesses: - It would be useful to learn how many graph layers are needed to accomplish this general task, e.g. the minimum required value of L without significant loss of performance. Maybe I missed it, but seems to be a valuable piece of information in this dawn of GNNs. - The need for the 6D input in L141 seems to have been not ablated? - Real data only indoors? - (Would be way easier to read if it wouldn’t be limited to 8 pages, 60% of the paper seems to be in the supplementary...suboptimal choice of venue?)
Correctness: Yes, plenty of ablations, mix of synthetic and real datasets and comparison to previous methods make it great.
Clarity: Yes, albeit very very compact.
Relation to Prior Work: Yes, excellently.
Additional Feedback: L37: Whilst I might agree with the overall statement about limitations of works similar to  and , I’d argue that IODINE works with a changing number of objects, even if that number is fixed per scene. There are other methods that do learn from videos in an unsupervised way, e.g. OP3 mentioned later in the paper (L261), and also “Taking Visual Motion Prediction To New Heightfields, Sébastien Ehrhardt, Aron Monszpart, Niloy Mitra, Andrea Vedaldi, ACCV 2018”. I think L37 can be reformulated to better assess the state of science today, whilst still supporting the need for the paper at hand. L258: Self-supervision gets mentioned in L258 together with depth and normal maps being available in the datasets, whilst L266 says only RGB was used as input. It would be useful to clarify what role depth and normals have in the sentence in L258 with relation to self-supervision.
Summary and Contributions: I have read the rebuttal which solved some further comments and maintain my strong support for this paper. This work introduces the idea of physical scence graphs (PSGs) to represent arrangement of objects in realistic scenes as well as their hierarchical decomposition into parts. The authors define PSGs and propose and network architecture and a numer of graph operations that extract PSG from real world images. The proposed model outperforms current SOTA for scene segmentation and allows for semantic manipulation of the learned representations.
Strengths: The paper is very clear and well written. The authors propose a novel approach towards scene understanding that is computationally and algorithmically inspired by human cognition. The theoretical exposition as well as the empirical evaluation are extremely thorough. This is an important contribution to the NeurIPS community because it propose a complex new framwork for physical scene understanding along with an empirical demonstration of how this kind of model can be trained and how it improves performance on standard benchmarks.
Weaknesses: The paper is extremely dense and requires considerable background in multiple fields. The exposition of the background and methods takes up more than half of the paper, introducing a large number of design choices, algorithms and inductive biases. It is difficult to understand the significance of each of these specifications, their motivations and how they are implemented in code (e.g. I had to read the whole supp. material to find out that it is implemented in Tensorflow and trained end-to-end). It might be more inclusive for the general ML reader to make the first half of the paper more high-level, introducing the concepts and rough hints at how this can be engineered/implemented. But then accompanied by a long and well structured supplementary text that gives a tutorial in the novel concepts proposed.
Correctness: In disentanglement research randomness across model seeds has considerable effects. Here, in Supp. 320-321 the authors state that performance did not vary much across seeds. Could you expand this claim or show some results to support this? Why is the ConvRNN unrolled 3 times? That means there is an equivalent feedforward model with 15 layers. I understand that feedback is biologically inspired, but is it really necessary here since the same computation could be computed with a deeper feedforward architecture? Did the authors do any benchmarking how long training and inference take in the proposed model? Can they comment on efficient implementations on GPU that allow for varying numbers of nodes in each PSG?
Clarity: The paper is extremely well written. While I suggest that the methodological exposition can be made more concise and readable, I hope that the style and clarity of presented thoughts will remain at this exceptionally high level.
Relation to Prior Work: I would like to see how this work relates to Kulkarni at al.'s 2015 inverse graphics models, as well as the various capsule networks proposed by Hinton et al. Moreover, it would be nice to see how this work relates to recent advances in the learning of disentangled representations (e.g. Locatello et al. 2018), especially given the manipulation experiments in figure 4.
Additional Feedback: Main: Figure 1 is much too small. I appreciate the attempt at a simple cartoon-style illustration of the whole framework, but this is too dense and colourful to be informative. 116: I am not familiar/did not find explanations of the big circle notation, can you explain? Supplementary: 31: shown +to 163: where is Delta(v) defined? 340: the -> they
Summary and Contributions: This paper proposes Physical Scene Graphs to represent scenes hierarchically and PSGNet that learns to estimate PSGs from visual inputs in a self-supervised manner. Experiments show the validity of proposed methods and generalization ability to real-world images. Detailed ablation studies are conducted to illustrate the importance of each component of the PSGNet architecture.
Strengths: The hierarchical nodes of PSG can locate onto pixels from object subparts to object groupings, which is a natural representation for scenes. The proposed self-supervise learning of such hierarchical representation and its generalization ability to real-world image are novel and experiments are sufficient.
Weaknesses: Some concerns: 1. Previous works on scene graph will model the relations between objects like support relation, occlusion or spatial order. Although hierarchical, PSG only has "edges that represent within-object bonds that hold object parts together", which limits the expressiveness of graph structure. 2. The affinity functions inspired from 4 perceptual grouping principles will encourage the grouping of similarities from many aspects. However it may fail to capture the details of texture in images especially in real-world images, as depicted in Figure 2.
Correctness: With node attributes representing features object position, surface normals, shape and texture properties, making the physical part of scene graph weak. The render and compare methodology is typical in self-supervised/unsupervised scene representation learning.
Clarity: The PSGNet architecture section has too many distinct parts and techniques that need to be addressed, especially the graph construction part. It might be a better idea to further highlight the key points. The experiments are clear and ablation study is well discussed.
Relation to Prior Work: Yes.
Additional Feedback: -----Post Rebuttal------ The authors well addressed my concerns and the rebuttal makes the work more clear. Hope to see the corresponding revised version.