Summary and Contributions: The authors take on the task of causal discoveries from video data. Their goal is to "discovery the structural dependencies among environmental and object variables" and infer "the type and strength of interactions that have a causal effect on behavior of the dynamical system". Specifically, they claim contributions toward "one-shot discovery of unseen causal mechanisms in new environments from partially observed visual data in a continuous state space."
Strengths: The paper addresses ambitious goals. The evaluation attempts to examine important research questions that are clearly outlined at the beginning of Section 3. Among these questions is one about generalization ("How well can the model extrapolate to graphs of different sizes that are not seen during training?"), which is an important consideration.
Weaknesses: The ambitious goals of the introduction are not equaled by the empirical results presented later in the paper (see below). It is unclear why the authors have focused on structural causal models (SCMs), rather than some representation that is more suited to the types of models they are attempting to infer from video. Specifically, there are a variety of simulation models that are appropriate for discrete physical systems and for deformable surfaces such as cloth. These would seem a more reasonable target for learning, although SCMs provide a set of existing theory and algorithms. The authors should make clear why they focus on SCMs.
Correctness: The correctness of the authors claims are very difficult to evaluate. They provide two highly specific experimental contexts (a physical system with multiple interacting rigid objects and a deformable fabric garment). For several reasons, these results are very difficult to interpret. First, the results do not compare between the proposed systems and logical alternatives, such as impaired versions of the authors system, baselines, or alternative approaches. Second, the selected environments are few (two) and it is unclear how they were selected. Thus, readers don't have a good sense of whether these problems are particularly challenging or particularly easy. Third, the results focus on relatively esoteric measures that are relevant to the authors' methods (e.g., key point detection and discovery of the causal summary graph) rather than measures relevant to the task itself (e.g., mean-squared error over the entire surface on the cloth garments).
Clarity: The authors frequently use expansive language to describe their contributions, when that language is not supported by the specific technical work reported in the paper. For example, in the Conclusion (line 293-4), the authors state that their method "understands" causal relationships between constituting components. Better terms would be "learns", "identifies", or "constructs". Avoid using citations as nouns. For example, on line 119, the authors state that "we leverage the technique developed in ." It is unclear what  refers to until readers to consult the bibliography. Instead, say: "we leverage the technique developed by Kulkarni et al. (2019) ."
Relation to Prior Work: The paper discusses a set of relevant work.
Additional Feedback: In their statement of broader impacts, the authors should address the potential application of their work to large-scale video surveillance, an increasing threat to individual political and social freedom throughout the world.
Summary and Contributions: The authors describe a model for video prediction via keypoint detection and causal graph discovery, and evaluate it on image sequences from 1) a balls-in-motion domain and 2) a fabric domain. Both datasets are derived from a generative process, rather than from real images. The authors show that their method is effective, exhibits some out-of-distribution generalisation, and can carry out a form of counterfactual inference.
Strengths: The work seems to me a good development of the keypoint detection work of Kulkarni et al. (2019). Unsupervised discovery of causal structure and confounding latent variables is a difficult problem, and the paper constitutes clear progress. The evaluation is convincing, albeit in a somewhat artificial setting
Weaknesses: It would have been nice to see comparisons with more baselines. I appreciate that this is not necessarily possible with the causal graph discovery, since there is (I believe) little comparable work. But comparison would have been possible for prediction on the pixel level
Correctness: The construction of the model looks sound to me, and the evaluation seems to have been conducted well
Clarity: The paper is well written and clear
Relation to Prior Work: Yes, although I'm not sufficiently familiar with the literature to be sure nothing is missing
Additional Feedback: One thing I didn't understand that could be made clearer is what actions are carried out during the generative process. In the fabric domain, from the videos supplied, it appears that parts of the fabric are pulled at random. I can't see how it's possible for the model to make predictions for many time steps into the future if this is what's happening. So maybe I'm misunderstanding something. Nothing in the supplementary material makes this clearer, so maybe the authors could explain. TYPOS Line 91: “one-short” should presumably be “one-shot” Line 133: “generates” -> “generate” Line 157: “constitutional” -> “constituent” Line 198: “one-short” -> “one-shot” Line 220: “Discovery the” -> “Discovery of the” Line 233: “subsequence” -> “subsequent”? POST-REBUTTAL COMMENTS: I have read the rebuttal and the other reviews, and it still seems like a good paper to me, so I haven't changed my score
Summary and Contributions: This paper presents a novel framework for automatically discovering physical dynamics from videos. The proposed framework first perform an unsupervised keypoint detection and then trains GNNs for predicting the causal relations between those key points and learns the model for predicting the motion of them. The experiments on simulated datasets show that the proposed approach is capable of extracting the correct keypoints and predicting the dynamics.
Strengths: - This work introduces relational models (GNNs) to the visual dynamic prediction task; - The authors did an extensive analysis of the experimental results, which showed the strength of the presented framework; - The paper is well written, and the details of the implementation are very clear;
Weaknesses: - The presented framework is a straightforward combination of several existing state-of-the-art systems; - The key-point extraction and relational model learning are separated. As a result, the accuracy of the system highly depends on the quality of key-point detection, and it seems there is no way for the dynamic fitting module to feedback to the key-point detection module when something went wrong; - In the experiments, the proposed system is not compared with any existing baseline approaches.
Correctness: The proposed approach is solid and works on the tested tasks; the result appears to be correct.
Clarity: This paper is well written and easy to understand. The content in the supplementary is also very helpful.
Relation to Prior Work: The authors have covered a broad range of related work. However, it lacks comparisons between the presented approach to those related work in the experiment section.
Additional Feedback: ---- After rebuttal ---- My major concerns about this work are lack of comparisons and the potential instability from the non-end-to-end training algorithm. In their rebuttal, the authors have included one more baseline approach and answered my second question, and I agree with them it is difficult to train an end-to-end hybrid model on such domain.
Summary and Contributions: The authors propose a method to discover causal relationship in a simple physical system from video. The method works in a number of steps: discovering keypoints from video, inferring a causal summary graph from the a short movement sequence of these keypoints, then predicting the future movement of these keypoints. In addition to this future prediction, the model helps with counterfactual predictions. The authors claim their main contribution is one-shot discovery of unseen causal mechanisms. As I understand it, this means that they are able to detect the links/springs between masses (or between a reduced order representation of a cloth) from a few frame of video.
Strengths: The paper is easy to read and the approach to assembling different learning mechanisms (keypoint detection, graph NN) seems sound.
Weaknesses: The approach proposed in this paper seems very specific to the problem they define, namely of tracking masses that are possibly connected by rigid links or springs. As such, it provides few insights that seem transferable to other problems, even within the limited field of video understanding or prediction from video. Moreover, the simulations they use are relatively simplistic and it is unclear how well the method would work for more realistic videos of the same phenomenon.
Correctness: The proposed empirical evaluation seems hard to reproduce or extend for future works as the authors are not making video frame predictions, but rather are predicting the future position of the keypoint they extracted and which seem to correspond closely enough to the ground-truth keypoints they have available in their simulation. This fact is indeed surprising — in particular for the case of cloth simulation — given that the keypoints are extracted using the unsupervised technique described in 2.1.
Clarity: The paper is reasonably well written and easy to follow.
Relation to Prior Work: The prior work seem to be well exposed and extensive, although I do not have a deep enough knowledge of this field to validate that. However, from section 4 it was not easy for me to situate this work with respect to previous work.
Additional Feedback: UPDATE: After reviewing the author's feedback and discussion with other reviewers, my main concern that the paper is very specific to the problem introduced here, with few generalizable learnings, still stands. My score remain unchanged.