This paper has received 4 reviews, with two reviewers recommending acceptance and two reviewers recommending rejection. The AC had ensured that the reviewers pool was consistent with the mix of topics covered by the paper, namely machine learning, computer vision, and causal inference. Given the mixed reviews and continuing disagreement between reviewers after the end of the discussion phase, the paper has received particular attention and was discussed between an AC and a Senior AC. Several strengths were identified:(i) unsupervised estimation of a structured causal model from visual data; (ii) a novel unsupervised estimation of causal graphs (a structural causal model) based on prior work  on unsupervised keypoint estimation combined with graph networks; (iii) evaluation on two different types of dynamical data, including classical mass-sprint systems and soft fabrics. GT SCMs are available for the former, but not the latter. The reviewers and ACs also identified weaknesses, which were mostly of experimental nature: (i) The main paper lacks crucial information on the processes generating the data, and while some more information is given in the supplementary material, this description is still not sufficient, in particular in the case of the fabrics dataset; (ii) the data is of limited complexity, the underlying hyper-parameters are limited to the number of balls in the mass-spring system.(iii) two reviewers critisized lack of generalization (this point has been addressed by the rebuttal, but did not convince some of the reviewers). The ACs concur that generalization is not tested in the fabrics scenario (e.g. beyond fabric types), but do not agree on the very general points on generalization raised by some reviewers. After discussion, the AC and SAC judged that experimental validation was sufficient and that its limitations were outweighed by the novelty of the method. While they acknowledge the validity of the points raised by the two more critical reviewers, they also judge that the paper is partially motivated by the estimation of an SCM from visual data (as also pointed out in the rebuttal) and not just by the resolution of a particular problem in ML, and recommend acceptance.