{"title": "On the Transfer of Inductive Bias from Simulation to the Real World: a New Disentanglement Dataset", "book": "Advances in Neural Information Processing Systems", "page_first": 15740, "page_last": 15751, "abstract": "Learning meaningful and compact representations with disentangled semantic aspects is considered to be of key importance in representation learning. Since real-world data is notoriously costly to collect, many recent state-of-the-art disentanglement models have heavily relied on synthetic toy data-sets. In this paper, we propose a novel data-set which consists of over 1 million images of physical 3D objects with seven factors of variation, such as object color, shape, size and position. In order to be able to control all the factors of variation precisely, we built an experimental platform where the objects are being moved by a robotic arm. In addition, we provide two more datasets which consist of simulations of the experimental setup. These datasets provide for the first time the possibility to systematically investigate how well different disentanglement methods perform on real data in comparison to simulation, and how simulated data can be leveraged to build better representations of the real world. We provide a first experimental study of these questions and our results indicate that learned models transfer poorly, but that model and hyperparameter selection is an effective means of transferring information to the real world.", "full_text": "On the Transfer of Inductive Bias\nfrom Simulation to the Real World:\n\na New Disentanglement Dataset\n\nMuhammad Waleed Gondal1\u21e4\u2020\nFrancesco Locatello12\nOlivier Bachem4\n\nMartin Breidt3\n\nValentin Volchkov1\n\nManuel W\u00fcthrich1\u21e4\n\nBernhard Sch\u00f6lkopf1\n\n\u00d0or \u00afde Miladinovi\u00b4c2\nJoel Akpo1\n\nStefan Bauer1\u2020\n\n1Max Planck Institute for Intelligent Systems\n2Department of Computer Science ETH Zurich\n3Max Planck Institute for Biological Cybernetics\n\n4Google Research, Brain Team\n\nAbstract\n\nLearning meaningful and compact representations with disentangled semantic\naspects is considered to be of key importance in representation learning. Since\nreal-world data is notoriously costly to collect, many recent state-of-the-art disen-\ntanglement models have heavily relied on synthetic toy data-sets. In this paper,\nwe propose a novel data-set which consists of over one million images of physical\n3D objects with seven factors of variation, such as object color, shape, size and\nposition. In order to be able to control all the factors of variation precisely, we\nbuilt an experimental platform where the objects are being moved by a robotic\narm. In addition, we provide two more datasets which consist of simulations of\nthe experimental setup. These datasets provide for the \ufb01rst time the possibility to\nsystematically investigate how well different disentanglement methods perform on\nreal data in comparison to simulation, and how simulated data can be leveraged\nto build better representations of the real world. We provide a \ufb01rst experimental\nstudy of these questions and our results indicate that learned models transfer poorly,\nbut that model and hyperparameter selection is an effective means of transferring\ninformation to the real world.\n\n1\n\nIntroduction\n\nIn representation learning it is commonly assumed that a high-dimensional observation X is generated\nfrom low-dimensional factors of variation G. The goal is usually to revert this process by searching\nfor a latent embedding Z which replicates the underlying generative factors G, e.g. shape, size or\ncolor. Learning well-disentangled representations of complex sensory data has been identi\ufb01ed as\none of the key challenges in the quest for arti\ufb01cial intelligence (AI) [2, 45, 31, 3, 48, 29, 54], since\nthey should contain all the information present in the observations in a compact and interpretable\nstructure [2, 26, 8] while being independent from the task at hand [15, 33].\nDisentangled representations may be useful for (semi-)supervised learning of downstream tasks,\ntransfer and few-shot learning [2, 49, 39]. Further, such representations allow to \ufb01lter out nuisance\nfactors [27], to perform interventions and to answer counterfactual questions [44, 50, 45]. First\n\n\u21e4These authors contributed equally.\n\u2020Correspondence to: waleed.gondal@tue.mpg.de, stefan.bauer@inf.ethz.ch\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fapplications of algorithms for learning disentangled representations have been applied to visual\nconcept learning, sequence modeling, curiosity-based exploration or even domain adaptation in\nreinforcement learning [51, 30, 42, 20, 22, 34, 54]. The research community is in general agreement\non the importance of this paradigm and much progress has been made in the past years, particularly\non the algorithmic level [e.g. 18, 24], fundamental understanding [e.g. 17, 52] and experimental\nevaluation [38]. However, research has thus far focused on synthetic toy datasets.\nThe main motivation for using synthetic datasets is that they are cheap, easy to generate and the\nindependent generative factors can be easily controlled. However, real-world recordings exhibit\nimperfections such as chromatic aberrations in cameras and complex surface properties of objects (e.g.\nre\ufb02ectance, radiance and irradiance), making transfer learning from synthetic to real data a nontrivial\ntask. Despite the growing importance of the \ufb01eld and the potential societal impact in the medical\ndomain or fair decision making [e.g. 6, 10, 37], the performance of state-of-the-art disentanglement\nlearning on real-world data is unknown.\n\nFigure 1: Datasets: In the top row samples from previously published datasets are shown from left to\nright: dSprites, Noisy-dSprites, Scream-dSprites, 3dshapes, Cars3D and SmallNORB. In the second\nrow (again left to right) we provide simple simulated data, highly realistically simulated data and\nreal-world data examples of the newly collected dataset.\n\nTo bridge the gap between simulation and the physical world, we built a recording platform which\nallows to investigate the following research questions: (i) How well do unsupervised state-of-the-art\nalgorithms transfer from rendered images to physical recordings? (ii) How much does this transfer\ndepend on the quality of the simulation? (iii) Can we learn representations on low dimensional\nrecordings and transfer them from the current state-of-the-art of 64 \u21e5 64 images to high quality\nimages? (iv) How much supervision is necessary to encode the necessary inductive biases? (v)\nAre the confounding and distortions of real-world recordings bene\ufb01cial for learning disentangled\nrepresentations? (vi) Can we disentangle causal mechanisms [44, 28, 29, 45] in the data generating\nprocess? (vii) Are disentangled representations useful for solving the real-world downstream tasks?\nWhile answering all of the above questions is beyond the scope of this paper, our key contributions\ncan be summarized as follows:\n\n\u2022 We introduce the \ufb01rst real-world 3D data set recorded in a controlled environment, de\ufb01ned\nby 7 factors of variation: object color, object shape, object size, camera height, background\ncolor and two degrees of freedom of motion of a robotic arm. The dataset is made publicly\navailable3.\n\u2022 We provide synthetic images produced by computer graphics with two levels of realism.\nSince the robot arm and the objects are printed from a 3D template, we can ensure a close\nsimilarity between the realistic renderings and the real-world recordings.\n\n3https://github.com/rr-learning/disentanglement_dataset\n\n2\n\n\fof the two simulated datasets contains the same number of images as well.\n\n\u2022 The collected datatset of physical 3D objects consists of over one million images, and each\n\u2022 We investigate the role of inductive bias and the transfer of different hyper-parameter settings\nbetween the different simulations and the real-world and the requirements on the quality of\nthe simulation for a succesful transfer.\n\n2 Background and Related Work\n\n\u00b7\u00b7\u00b7\n\nZK1\n\nZK\n\nG1\n\nK1\n\nGK1\n\nGK\n\n\u00b7\u00b7\u00b7\n\nX\n\nZ1\n\nZ2\n\nG1\n\nG2\n\nWe assume a set of observations of a (potentially high di-\nmensional) random variable X which is generated by K\nunobserved causes of variation (generative factors) G =\n[G1, . . . , GK] (i.e., G ! X) that do not cause each other.\nThese latent factors represent elementary ingredients to the\ncausal mechanism generating X [44, 45]. The elementary in-\ngredients Gi, i = 1, . . . , K of the causal process work on their\nown and are changeable without affecting others, re\ufb02ecting the\nindependent mechanisms assumption [49]. However, for some\nof the factors a hierarchical structure may exist for which this\nmay only hold true when seeing the hierachical structure as a\nwhole as one component. The graphical model corresponding\nto this framework and adapted from [52] is depicted in \ufb01gure 2.\nThe hirachical structure of the factors G1\nK1 and GK1 might\nrepresent one compositional process e.g. connected joints of\na robot arm. The most commonly accepted understanding of\ndisentanglement [2] is that each learned feature in Z should\ncapture one factor of variation in G.\nCurrent state-of-the-art disentanglement approaches use the\nframework of variational auto-encoders (VAEs) [25]. The\n(high-dimensional) observations x are modelled as being gen-\nerated from some latent features z with chosen prior p(z)\naccording to the probabilistic model p\u2713(x|z)p(z). The generative model p\u2713(x|z) as well as the\nproxy posterior q(z|x) can be represented by neural networks, which are optimized by maximizing\nthe variational lower bound (ELBO) of log p(x1, . . . , xN )\nLV AE =PN\n\nFigure 2: Graphical Model, where\nG = (G1, G2, . . . , GK) are the\ngenerative factors (color, shape, size,\n...)\nand X the recorded images.\nThe aim of disentangled representa-\ntion learning is to learn variables Zi\nthat capture the independent mecha-\nnisms Gi.\n\nSince the above objective does not enforce any structure on the latent space, except for some similarity\nto the typically chosen isotropic Gaussian prior p(z), various proposals for more structure imposing\nregularization have been made. Using some sort of supervision [e.g. 43, 4, 35, 40, 9] or proposing\ncompletely unsupervised [e.g. 19, 24, 7, 27, 13] learning approaches. [19] proposed the -VAE\npenalizing the Kullback-Leibler divergence (KL) term in the VAE objective more strongly, which en-\ncourages similarity to the factorized prior distribution. Others used techniques to encourage statistical\nindependence between the different components in Z, e.g., FactorVAE [24] or -TCVAE [7], while\n\ni=1 Eq(z|x(i))[log p\u2713(x(i)|z)] DKL(q(z|x(i))kp(z))\n\nDIP-VAE proposed to encourage factorization of the inferred prior q(z) =R q(z|x)p(x) dx. For\n\nother related work we refer to the detailed descriptions in the recent empirical study [38].\n\n2.1 Established Datasets for the Unsupervised Learning of Disentangled Representations\nReal-world data is costly to generate and groundtruth is often not available since signi\ufb01cant con-\nfounding may exist. To bypass this limitation, many recent state-of-the-art disentanglement models\n[55, 24, 7, 18, 8] have heavily relied on synthetic toy datasets, trying to solve a simpli\ufb01ed version of\nthe problem in the hope that the conclusions drawn might likewise be valid for real-world settings. A\nquantitative summary of the most widely used datasets for learning disentangled representations is\nprovided in table 1.\n\nDataset Descriptions: For quantitative analysis, dSprites4 is the most commonly used dataset.\nThis synthetic dataset [18] contains binary 2D images of hearts, ellipses and squares in low resolution.\n\n4https://github.com/deepmind/dsprites-dataset\n\n3\n\n\fDataset\ndSprites\nNoisy dspirtes\nScream dSprites\nSmallNORB\nCars3D\n3dshapes\nMPI3D-toy\nMPI3D-realistic\nMPI3D-real\n\nFactors of\nVariation\n\n5\n5\n5\n5\n3\n6\n7\n7\n7\n\nResolution\n64 \u21e5 64\n64 \u21e5 64\n64 \u21e5 64\n128 \u21e5 128\n64 \u21e5 64\n64 \u21e5 64\n64 \u21e5 64\n256 \u21e5 256\n512 \u21e5 512\n\n# of Images\n\n737,280\n737,280\n737,280\n48,600\n17,568\n480,000\n1,036,800\n1,036,800\n1,036,800\n\n3D Real-World\n7\n7\n7\n3\n3\n3\n3\n3\n3\n\n7\n7\n7\n7\n7\n7\n7\n7\n3\n\nTable 1: Summary of the properties of different datasets. The newly contributed datasets are\nemphasized.\n\nIn Color-dSprites the shapes are colored with a random color, Noisy-dSprites considers white-colored\nshapes on a noisy background and in Scream-dSprites the background is replaced with a random patch\nin a random color shade extracted from the famous The Scream painting [41]. The dSprites shape is\nembedded into the image by inverting the color of its pixel. The SmallNORB5 dataset contains images\nof 50 toys belonging to 5 generic categories: four-legged animals, human \ufb01gures, airplanes, trucks,\nand cars. The objects were imaged by two cameras under 6 lighting conditions, 9 elevations (30 to\n70 degrees every 5 degrees), and 18 azimuths (0 to 340 every 20 degrees) [32]. For Cars3D6,199\nCAD models from [14] were used to generate 64x64 color renderings from 24 rotation angles each\noffset by 15 degrees [46]. Recently, 3dshapes was made publicly available7, a dataset of 3D shapes\nprocedurally generated from 6 ground truth independent latent factors. These factors are \ufb02oor colour,\nwall colour, object colour, scale, shape and orientation [24].\n\n3 Bridging the Gap Between Simulation and the Real World: A Novel\n\nDataset\n\nWhile other real-world recordings, e.g. CelebA [36], exist, they offer only qualitative evaluations.\nHowever, a more controlled dataset is needed to quantitatively investigate the effects of inductive\nbiases, sample complexity and the interplay of simulations and the real-world.\n\n3.1 Controlled Recording Setup\n\nFigure 3: Renderings of the developed robotic platform: On the left a view from a 30 angle from\nthe top (note that one panel in front and the top panels have been removed such that the interior of\nthe platform is visible. During recordings, the platform is entirely closed. Middle: the robotic arm\ncarrying a red cube (the entire cage is hidden). Right: frontal view without the black shielding (note\nthe three cameras at different heights).\n\n5https://cs.nyu.edu/~ylclab/data/norb-v1.0-small/\n6http://www.scottreed.info/files/nips2015-analogy-data.tar.gz\n7https://github.com/deepmind/3dshapes-dataset/\n\n4\n\n\fIn order to record a controlled dataset of physical 3D objects, we built the mechanical platform\nillustrated in \ufb01gure 3. It consists of three cameras mounted at different heights, a robotic manipulator\ncarrying a 3D printed object (which can be swapped) in the center of the platform and a rotating table\nat the bottom. The platform is shielded with black sheets from all sides to avoid any intrusion of\nexternal factors (e.g. light) and the whole environment is relatively uniformly illuminated by three\nlight sources installed within the platform.\n\n3.1.1 Factors of Variation\nThe generative factors of variation G mentioned in section 2 are listed in the following for our\nrecording setup.\n\nObject Color: All objects have one of six different colors: red (255, 0, 0), green (0, 255, 0), blue\n(0, 0, 255), white (255, 255, 255), olive (210,210,80) and brown (153,76,0) (see \ufb01gure 4).\n\nFigure 4: We show all the object colors while maintaining the other factors constant.\n\nFigure 5: We show all object shapes while maintaining all other factors constant.\n\nObject Shape: There are objects of four different shapes in\nthe dataset: a cylinder, a hexagonal prism, a cube, a sphere,\na pyramid with square base and a cone. All objects exhibit\nrotational symmetries about some axes, however the kinemat-\nics of the robot are such that these axes never align with the\ndegrees of freedom of the robot. This is important because it\nensures that the robot degrees of freedom are observable given\nthe images.\n\nObject Size: There are objects of two different sizes in the\ndataset, categorized as large (roughly 65mm in diameter) and\nsmall (roughly 45 mm in diameter).\n\nFigure 6: We show the two object\nsizes while maintaining all other fac-\ntors constant.\n\nCamera Height: The dataset is recorded with three cameras\nmounted at three different heights (see \ufb01gure 7 on the right), which represents another factor of\nvariation in the images.\n\nFigure 7: Three images on the left: we vary the backround color. Three images on the right: we vary\nthe camera height.\n\n5\n\n\fBackground Color: The rotation table (see \ufb01gure 7) allows us to change background color. Note\nthat for all images in the dataset we orient the table in such a way that only one background color is\nvisible at a time. The colors are: sea green, salmon and purple.\n\nDegrees of Freedom of the Robotic Arm: Each object is mounted on the tip of the manipulator\nshown in \ufb01gure 3. This manipulator has two degrees of freedom, a rotation about a vertical axis at\nthe base and a second rotation about a horizontal axis. We move each joint in a range of 180 in 40\nequal steps (see \ufb01gure 8 and \ufb01gure 9). Note that these two factors of variation are independent, just\nlike all other factors (i.e. we record all possible combinations between the two).\n\nFigure 8: Motion along \ufb01rst DOF while maintaining the other factors constant. Note that in total we\nrecord 40 steps, here we only show 6 due to space constraints.\n\nFigure 9: Motion along second DOF while maintaining the other factors constant. Note that in total\nwe record 40 steps, here we only show 6 due to space constraints.\n\n3.2 Simulated Data\n\nIn addition to the real-world dataset we recorded two simulated datasets of the same setup, hence all\nfactors of variation are identical across the three datasets. One of the simulated datasets is designed\nto be as realistic as possible and the synthetic images are visually practically indistinguishable from\nreal images (see \ufb01gure 1 middle). For the second simulated dataset we used a deliberately simpli\ufb01ed\nmodel (see \ufb01gure 1 left), which allows to investigate transfer from simpli\ufb01ed models to real data.\nThe synthetic data was generated using Autodesk 3ds Max(2018). Most parts of the scene were\nimported from SolidWorks CAD \ufb01les that were originally used to construct the experimental stage\nincluding the manipulator and 3D printing of the test objects. The surface shaders are based on\nAutodesk Physical material with hand-tuned shading parameters, based on full resolution reference\nimages. The camera poses were initialized from the CAD data and then manually \ufb01ne-tuned using\nreference images. The realistic synthetic images were obtained using the Autodesk Raytracer (ART)\nwith three rectangular light sources, mimicking the LED panels. The simpli\ufb01ed images were rendered\nwith the Quicksilver hardware renderer.\n\n4 First Experimental Evaluations of (unsupervised) Disentanglement\n\nMethods on Real-World Data\n\nSome \ufb01elds have been able to narrow the gap between simulation and reality [56, 5, 23], which has\nled to remarkable achievements (e.g. for in-hand manipulation [1]). In contrast, for disentanglement\nmethods this gap has not been bridged yet, state-of-the-art algorithms seem to have dif\ufb01culties to\ntransfer learned representations even between toy datasets [38]. The proposed dataset will enable\nthe community to systematically investigate how such transfer of information between simulations\nwith different degrees of realism and real data can be achieved. In the following we present a \ufb01rst\nexperimental study in this directions.\n\n6\n\n\f4.1 Experimental Protocol\n\nWe apply all the disentanglement methods (-VAE, FactorVAE, -TCVAE, DIP-VAE-I, DIP-VAE-II,\nAnnealedVAE) which were used in a recent large-scale study [38] to our three datasets. Due to space\nconstraints, the models are abbreviated with numbers one to \ufb01ve in the plots in the same order. We\nuse (disentanglement_lib) and we evaluate on the same scores as [38]. In all the experiments,\nwe used images with resolution 64x64. This resolution is used in the recent large-scale evaluations\nand by state-of-the-art disentanglement learning algorithms [38]. Each of the six methods is trained\non each of the three datasets with \ufb01ve different hyperparameter settings (see table 2 in the appendix\nfor details) and with three different random seeds, leading to a total of 270 models. Each model is\ntrained for 300,000 iterations on Tesla V100 gpus. Details about the evaluation metrics can be found\nin appendix C.\n\n4.2 Experimental Results\n\nFigure 10: Reconstruction scores of different methods (0=-VAE, 1=FactorVAE, 2=-TCVAE,\n3=DIP-VAE-I, 4=DIP-VAE-II, 5=AnnealedVAE) on the realistic synthetic dataset, the toy synthetic\ndataset and the real dataset.\n\nReconstruction Across Datasets: Figure 10 shows that there is a difference in reconstruction score\nacross datasets: The score is the lowest on real data, followed by the realistic simulated dataset (R)\nand the simple toy (T) images. This indicates that there is a signi\ufb01cant difference in the distribution\nof the real data compared to the simulated data, and that it is harder to learn a representation of the\nreal data than of the simulated data. However, the relative behaviour of different methods seems to be\nsimilar across all three datasets, which indicates that despite the differences, the simulated data may\nbe useful for model selection.\n\nFigure 11: The Mutual Information Gap (MIG) scores attained by different methods for the following\nevaluations (from left to right): (a) trained and evaluated on synthetic realistic, (b) trained and\nevaluated on synthetic toy, (c) trained and evaluated on real, (d) trained on synthetic realistic and\nevaluated on real, (e) trained on synthetic toy and evaluated on real. The variance is due to different\nhyper-parameters and seeds.\n\nDirect Transfer of Representations:\nIn \ufb01gure 11 we show the Mutual Information Gap (MIG)\nscores attained by different methods for different evaluations. The same plots for different metrics\nlook qualitatively similar (see \ufb01gure 22 in the appendix). Given the high variance, it is dif\ufb01cult to\nmake conclusive statements. However, it seems quite clear that all methods perform signi\ufb01cantly\nbetter when they are trained and evaluated on the same dataset (three plots on the left). Direct transfer\nof learned representations from simulated to real data (two plots on the right) seems to work rather\npoorly.\n\n7\n\n\fTransfer of Hyperparameters: We have seen that transferring\nrepresentations directly from simulated to real data seems to work\npoorly. However, it may be possible to instead transfer information\nat a higher-level, such as the choice of the method and its hyperpa-\nrameters as an inductive bias.\nIn order to quantitatively evaluate whether such a transfer is possible,\nwe pick the model (including hyperparameters) which performs best\nin simulation (according to a metric chosen at random), and we\ncompute the probability of outperforming (according to a metric and\nseed chosen randomly) a model which was chosen at random. If no\ntransfer is possible, we would expect this probability to be 50%.\nHowever, we \ufb01nd that model selection from realistic simulated ren-\nderings (R) outperforms random model selection 72% of the time\nwhile transferring the model from the simpler synthetic images (T)\nto real-world data even beats random selection 78% of the time.\nThis \ufb01nding is con\ufb01rmed by \ufb01gure 12, where we show the rank-\ncorrelation of the performance of models (including hyperparame-\nters) trained on one dataset with the performance of these models\ntrained on another dataset. The performance of a model trained on\nsome dataset seems to be highly correlated with the performance of that model trained on any other\ndataset. In \ufb01gure 12 we use the DCI disentanglement metric as a score, however, qualitatively similar\nresults can be observed using most of the disentanglement metrics (see \ufb01gure 25 in the appendix).\nSummary These results indicate that the simulated and the real data distribution have some sim-\nilarities, and that these similarities can be exploited through model and hyperparameter selection.\nSurprisingly, it seems that the transfer of models from the synthetic toy dataset may work even better\nthan the transfer from the realistic synthetic dataset.\n\nFigure 12: Rank-correlation\nof the DCI disentanglement\nscores of different models\n(including hyperparameters)\nacross different data sets.\n\n5 Conclusions\n\nDespite the intended applications of disentangled representation learning algorithms to real data in\n\ufb01elds such as robotics, healthcare and fair decision making [e.g. 6, 10, 20], state-of-the-art approaches\nhave only been systematically evaluated on synthetic toy datasets. Our work effectively complements\nrelated efforts [e.g. 38] to address current challenges of representation learning, offering the possibility\nof investigating the role of inductive biases, sample complexity, transfer learning and the use of labels\nusing real-world images.\nA key aspect of our datasets is that we provide rendered images of increasing complexity for the\nsame setup used to capture the real-world recordings. The different recordings offer the possibility of\ninvestigating the question if disentangled representations can be transferred from simulation to the\nreal world and how the transferability depends on the degree of realism of the simulation. Beyond the\nevaluation of representation learning algorithms, the proposed dataset can likewise be used for other\ntasks such as 3D reconstruction and scene rendering [12] or learning compositional visual concepts\n[21]. Furthermore, we are planning to use the novel experimental setup for recording objects with\nmore complicated shapes and textures under more dif\ufb01cult conditions, such as dependence among\ndifferent factors.\n\nAcknowledgments\nThis research was partially supported by the Max Planck ETH Center for Learning Systems and\nGoogle Cloud. We thank Alexander Neitz and Arash Mehrjou for useful discussions. We would also\nlike to thank Felix Grimminger, Ludovic Righetti, Stefan Schaal, Julian Viereck and Felix Widmaier\nwhose work served as a starting point for the development of the robotic platform in the present paper.\n\n8\n\n\fReferences\n[1] Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub\nPachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous\nin-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.\n\n[2] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and\nnew perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798\u2013\n1828, 2013.\n\n[3] Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards ai. Large-scale kernel\n\nmachines, 34(5):1\u201341, 2007.\n\n[4] Diane Bouchacourt, Ryota Tomioka, and Sebastian Nowozin. Multi-level variational autoen-\n\ncoder: Learning disentangled representations from grouped observations. In AAAI, 2018.\n\n[5] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru\nErhan. Domain separation networks. In Advances in Neural Information Processing Systems,\npages 343\u2013351, 2016.\n\n[6] Agisilaos Chartsias, Thomas Joyce, Giorgos Papanastasiou, Scott Semple, Michelle Williams,\nDavid Newby, Rohan Dharmakumar, and Sotirios A Tsaftaris. Factorised spatial representation\nlearning: application in semi-supervised myocardial segmentation. In International Conference\non Medical Image Computing and Computer-Assisted Intervention, pages 490\u2013498. Springer,\n2018.\n\n[7] Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of\ndisentanglement in variational autoencoders. In Advances in Neural Information Processing\nSystems, pages 2610\u20132620, 2018.\n\n[8] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in neural information processing systems, pages 2172\u20132180, 2016.\n\n[9] Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. Discovering hidden\nfactors of variation in deep networks. In Workshop at International Conference on Learning\nRepresentations, 2015.\n\n[10] E. Creager, D. Madras, J-H. Jacobson, M. Weis, K. Swersky, T. Pitassi, and R. Zemel. Flexibly\n\nfair representation learning by disentanglement. In ICML, page to appear, 2019.\n\n[11] Cian Eastwood and Christopher K. I. Williams. A framework for the quantitative evaluation of\ndisentangled representations. In International Conference on Learning Representations, 2018.\n[12] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta\nGarnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene\nrepresentation and rendering. Science, 360(6394):1204\u20131210, 2018.\n\n[13] Babak Esmaeili, Hao Wu, Sarthak Jain, Siddharth Narayanaswamy, Brooks Paige, and\narXiv preprint\n\nJan-Willem Van de Meent. Hierarchical disentangled representations.\narXiv:1804.02086, 2018.\n\n[14] Sanja Fidler, Sven Dickinson, and Raquel Urtasun. 3d object detection and viewpoint estimation\nwith a deformable 3d cuboid model. In Advances in neural information processing systems,\npages 611\u2013619, 2012.\n\n[15] Ian Goodfellow, Honglak Lee, Quoc V Le, Andrew Saxe, and Andrew Y Ng. Measuring\ninvariances in deep networks. In Advances in neural information processing systems, pages\n646\u2013654, 2009.\n\n[16] David Ha and J\u00fcrgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.\n[17] Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende,\nand Alexander Lerchner. Towards a de\ufb01nition of disentangled representations. arXiv preprint\narXiv:1812.02230, 2018.\n\n9\n\n\f[18] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. In International Conference on Learning Representations,\n2017.\n\n[19] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. In International Conference on Learning Representations,\n2017.\n\n[20] Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel,\nMatthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot\ntransfer in reinforcement learning. In Proceedings of the 34th International Conference on\nMachine Learning-Volume 70, pages 1480\u20131490. JMLR. org, 2017.\n\n[21] Irina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matko\nBosnjak, Murray Shanahan, Matthew Botvinick, Demis Hassabis, and Alexander Lerchner.\nScan: Learning hierarchical compositional visual concepts. arXiv preprint arXiv:1707.03389,\n2017.\n\n[22] Irina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matko\nBo\u0161njak, Murray Shanahan, Matthew Botvinick, Demis Hassabis, and Alexander Lerchner.\nScan: Learning hierarchical compositional visual concepts. In International Conference on\nLearning Representations, 2018.\n\n[23] Stephen James, Paul Wohlhart, Mrinal Kalakrishnan, Dmitry Kalashnikov, Alex Irpan, Julian\nIbarz, Sergey Levine, Raia Hadsell, and Konstantinos Bousmalis. Sim-to-real via sim-to-sim:\nData-ef\ufb01cient robotic grasping via randomized-to-canonical adaptation networks. arXiv preprint\narXiv:1812.07252, 2018.\n\n[24] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983,\n\n2018.\n\n[25] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International\n\nConference on Learning Representations, 2014.\n\n[26] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolu-\ntional inverse graphics network. In Advances in neural information processing systems, pages\n2539\u20132547, 2015.\n\n[27] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of\ndisentangled latent concepts from unlabeled observations. In International Conference on\nLearning Representations, 2017.\n\n[28] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept\n\nlearning through probabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[29] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building\n\nmachines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.\n\n[30] Adrien Laversanne-Finot, Alexandre Pere, and Pierre-Yves Oudeyer. Curiosity driven explo-\nration of learned disentangled goal spaces. In Conference on Robot Learning, pages 487\u2013504,\n2018.\n\n[31] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436,\n\n2015.\n\n[32] Yann LeCun, Fu Jie Huang, Leon Bottou, et al. Learning methods for generic object recognition\n\nwith invariance to pose and lighting. In CVPR (2), pages 97\u2013104. Citeseer, 2004.\n\n[33] Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their\nequivariance and equivalence. In IEEE conference on computer vision and pattern recognition,\npages 991\u2013999, 2015.\n\n10\n\n\f[34] Timoth\u00e9e Lesort, Natalia D\u00edaz-Rodr\u00edguez, Jean-Franois Goudou, and David Filliat. State\n\nrepresentation learning for control: An overview. Neural Networks, 2018.\n\n[35] Yen-Cheng Liu, Yu-Ying Yeh, Tzu-Chien Fu, Wei-Chen Chiu, Sheng-De Wang, and Yu-\nChiang Frank Wang. Detach and adapt: Learning cross-domain disentangled deep representation.\narXiv preprint arXiv:1705.01314, 2017.\n\n[36] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in\nthe wild. In Proceedings of the IEEE international conference on computer vision, pages\n3730\u20133738, 2015.\n\n[37] Francesco Locatello, Gabriele Abbati, Tom Rainforth, Stefan Bauer, Bernhard Sc\u00f6lkopf,\narXiv preprint\n\nand Olivier Bachem. On the fairness of disentangled representations.\narXiv:1905.13662, 2019.\n\n[38] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard\nSch\u00f6lkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning\nof disentangled representations. In Proceedings of the 36th International Conference on Machine\nLearning. PMLR, 2019.\n\n[39] Francesco Locatello, Michael Tschannen, Stefan Bauer, Gunnar R\u00e4tsch, Bernhard Sch\u00f6lkopf,\nand Olivier Bachem. Disentangling factors of variation using few labels. arXiv preprint\narXiv:1905.01258, 2019.\n\n[40] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and\nYann LeCun. Disentangling factors of variation in deep representation using adversarial training.\nIn Advances in Neural Information Processing Systems, pages 5040\u20135048, 2016.\n\n[41] Edvard Munch. The scream, 1893.\n\n[42] Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Vi-\nsual reinforcement learning with imagined goals. In Advances in Neural Information Processing\nSystems, pages 9209\u20139220, 2018.\n\n[43] Siddharth Narayanaswamy, T Brooks Paige, Jan-Willem Van de Meent, Alban Desmaison, Noah\nGoodman, Pushmeet Kohli, Frank Wood, and Philip Torr. Learning disentangled representations\nwith semi-supervised deep generative models. In Advances in Neural Information Processing\nSystems, pages 5925\u20135935, 2017.\n\n[44] Judea Pearl. Causality. Cambridge university press, 2009.\n\n[45] Jonas Peters, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Elements of causal inference: founda-\n\ntions and learning algorithms. MIT press, 2017.\n\n[46] Scott E Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogy-making. In\n\nAdvances in neural information processing systems, pages 1252\u20131260, 2015.\n\n[47] Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the\nf-statistic loss. In Advances in Neural Information Processing Systems, pages 185\u2013194, 2018.\n\n[48] J\u00fcrgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computa-\n\ntion, 4(6):863\u2013879, 1992.\n\n[49] Bernhard Sch\u00f6lkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris\nMooij. On causal and anticausal learning. In International Conference on Machine Learning,\npages 1255\u20131262, 2012.\n\n[50] P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction, and search. Springer-Verlag.\n\n(2nd edition MIT Press 2000), 1993.\n\n[51] Xander Steenbrugge, Sam Leroux, Tim Verbelen, and Bart Dhoedt. Improving generalization for\nabstract reasoning tasks using disentangled feature representations. In Workshop on Relational\nRepresentation Learning at Neural Information Processing Systems, 2018.\n\n11\n\n\f[52] Raphael Suter, Djordje Miladinovic, Bernhard Sch\u00f6lkopf, and Stefan Bauer. Robustly disen-\ntangled causal mechanisms: Validating deep representations for interventional robustness. In\nProceedings of the 36th International Conference on Machine Learning. PMLR, 2019.\n\n[53] Jakub M Tomczak and Max Welling. Vae with a vampprior. arXiv preprint arXiv:1705.07120,\n\n2017.\n\n[54] Sjoerd van Steenkiste, Francesco Locatello, J\u00fcrgen Schmidhuber, and Olivier Bachem.\narXiv preprint\n\nAre disentangled representations helpful for abstract visual reasoning?\narXiv:1905.12506, 2019.\n\n[55] Nicholas Watters, Loic Matthey, Christopher P Burgess, and Alexander Lerchner. Spatial\nbroadcast decoder: A simple architecture for learning disentangled representations in vaes.\narXiv preprint arXiv:1901.07017, 2019.\n\n[56] Jingwei Zhang, Lei Tai, Peng Yun, Yufeng Xiong, Ming Liu, Joschka Boedecker, and Wolfram\nBurgard. Vr-goggles for robots: Real-to-sim domain adaptation for visual control. IEEE\nRobotics and Automation Letters, 4(2):1148\u20131155, 2019.\n\n12\n\n\f", "award": [], "sourceid": 9203, "authors": [{"given_name": "Muhammad Waleed", "family_name": "Gondal", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Manuel", "family_name": "Wuthrich", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Djordje", "family_name": "Miladinovic", "institution": "ETH Zurich"}, {"given_name": "Francesco", "family_name": "Locatello", "institution": "ETH Z\u00fcrich - MPI T\u00fcbingen"}, {"given_name": "Martin", "family_name": "Breidt", "institution": "MPI for Biological Cybernetics"}, {"given_name": "Valentin", "family_name": "Volchkov", "institution": "Max Planck Institut for Intelligent Systems"}, {"given_name": "Joel", "family_name": "Akpo", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Olivier", "family_name": "Bachem", "institution": "Google Brain"}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": "MPI for Intelligent Systems"}, {"given_name": "Stefan", "family_name": "Bauer", "institution": "MPI for Intelligent Systems"}]}