Reviews: On the Transfer of Inductive Bias from Simulation to the Real World: a New Disentanglement Dataset

ORIGINALITY: although the paper introduces no new methodology, their dataset is to my knowledge a first-of-its-kind and their design, construction, and use of the robotic camera rig to generate is novel and unusual in the field. This kind of work is innovative and shows an attention to experimental design that is sometimes lacking in machine learning. QUALITY: All parts of the paper (dataset design, camera design and construction, dataset construction, and empirical studies) are well executed. I have no complaints at this time. CLARITY: The paper is well-written and fairly easy to understand across all parts. It covers a lot of ground and as such, must sequester a lot of details in the appendix: robot schematics, model hyperparameters, and detailed results, in particular. However, they are nonetheless included and described in detail, enabling reproducibility. SIGNIFICANCE: As disentanglement is not my area of expertise, it is hard to say for sure how the community will respond, but in my experience, the appetite among researchers for clean, detailed, easy-to-use datasets -- especially with ground truth labels -- is enormous, and as such, I expect this dataset will be widely used and will become a de facto standard. I wouldn't be surprised if it picks up a hundred citations within a few years, if not faster. Here is a laundry list of questions for the authors (I don't have a ton): * It seems unfortunate that the "3Dprinted-realistic" dataset is a lower resolution (256 x 256) than the real world data (512 x 512)? Is this a limitation of the creation process using Autodesk (the text indicates that at full res, the differences between real and simulated images are more obvious)... * When transferring from lower resolution, e.g., synthetic, images, is it necessary to downsample the higher resolution, e.g., real images? If so, how might this impact performance?

This paper addresses a really important issue in the disentanglement literature, which is the gap between toy datasets and real worlds. As someone who also worked in this area, I think this is indeed something that deserves a lot more attention. I do think using a 3D-dataset that was actually recorded rather than rendered using some software to test disentanglement is a very valuable contribution. Aside from the contributions, this is very well written paper. The authors do a great job at explaining the motivations and the protocol for designing the dataset and experiments. While I extremely appreciate the effort that the authors put in designing this dataset, as well as highlighting the importance of the issues in the current disentanglement literature, I have some concerns about the contribution of the paper, which I appreciate if the authors could comment on. 1) As the authors point out, most of the research on disentanglement have focused on developing new algorithms to achieve it. I think at some point, instead of asking the question “how can we do disentanglement?”, we need to start asking the question “Is disentanglement helpful (for downstream tasks)? ”. It seems like everyone is assuming for sure that a disentangled representations will achieve a better generalization performance on any task, as so almost all papers are just bothered to evaluate the disentanglement itself. A good example of a paper that touched on this point is [1]. From this perspective, what I would have liked to see from a new real-world disentanglement dataset was to have a series of real-world tasks associated with it, and instead of evaluating disentanglement based on the one-to-one correspondence between latent codes and generative factors, we evaluate based on the performance on the real-world tasks (which disentanglement supposedly aids). While the proposed dataset has some interesting new factors compared to previous datasets such as background color and camera-angle, still the main difference to previous datasets seems to be the difficulty of the “image” itself. In real-world, I would like to believe that we are not solely interested in “achieving” disentanglement, but would like to use it to perform well on task associated to some real-world data. I understand this is considered a bit out-of-scope for this paper, but I still appreciate if the authors could comment on it. 2) On page 4, the authors state that [2] shows that training low resolution images results in the instability therefore random seed and hyperparameters being more important than the model. Could the authors clarify this? I’m not sure if the reason for this necessarily has to do with the images being low resolution, but simply just the nature of these datasets (or the methods themselves). 3) What is the architecture used to train the new dataset? 4) Could the authors maybe comment on the difficulty of disentanglement (or reconstruction) for each individual feature? (which features are harder to disentangle?) I would be also interested to see which features cause the most difference in image space. In other words, if I take some data from this dataset and keep all the features the same, but change one feature by 1 unit, what would be the L2 difference between the corresponding images? Refs: [1] Sjoerd van Steenkiste, Francesco Locatello, Jürgen Schmidhuber, and Olivier Bachem. Are disentangled representations helpful for abstract visual reasoning? arXiv preprint arXiv:1905.12506, 2019. [2] Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359, 2018

Paper ID:	9203
Title:	On the Transfer of Inductive Bias from Simulation to the Real World: a New Disentanglement Dataset

Reviewer 1

Reviewer 2

Reviewer 3