{"title": "Are Disentangled Representations Helpful for Abstract Visual Reasoning?", "book": "Advances in Neural Information Processing Systems", "page_first": 14245, "page_last": 14258, "abstract": "A disentangled representation encodes information about the salient factors of variation in the data independently. Although it is often argued that this representational format is useful in learning to solve many real-world down-stream tasks, there is little empirical evidence that supports this claim. In this paper, we conduct a large-scale study that investigates whether disentangled representations are more suitable for abstract reasoning tasks. Using two new tasks similar to Raven's Progressive Matrices, we evaluate the usefulness of the representations learned by 360 state-of-the-art unsupervised disentanglement models. Based on these representations, we train 3600 abstract reasoning models and observe that disentangled representations do in fact lead to better down-stream performance. In particular, they enable quicker learning using fewer samples.", "full_text": "Are Disentangled Representations Helpful for\n\nAbstract Visual Reasoning?\n\nSjoerd van Steenkiste\nIDSIA, USI, SUPSI\nsjoerd@idsia.ch\n\nFrancesco Locatello\nETH Zurich, MPI-IS\nlocatelf@ethz.ch\n\nJ\u00fcrgen Schmidhuber\n\nIDSIA, USI, SUPSI, NNAISENSE\n\njuergen@idsia.ch\n\nOlivier Bachem\n\nGoogle Research, Brain Team\n\nbachem@google.com\n\nAbstract\n\nA disentangled representation encodes information about the salient factors of\nvariation in the data independently. Although it is often argued that this repre-\nsentational format is useful in learning to solve many real-world down-stream\ntasks, there is little empirical evidence that supports this claim. In this paper, we\nconduct a large-scale study that investigates whether disentangled representations\nare more suitable for abstract reasoning tasks. Using two new tasks similar to\nRaven\u2019s Progressive Matrices, we evaluate the usefulness of the representations\nlearned by 360 state-of-the-art unsupervised disentanglement models. Based on\nthese representations, we train 3600 abstract reasoning models and observe that\ndisentangled representations do in fact lead to better down-stream performance. In\nparticular, they enable quicker learning using fewer samples.\n\n1\n\nIntroduction\n\nLearning good representations of high-dimensional sensory data is of fundamental importance to\nArti\ufb01cial Intelligence [4, 3, 6, 49, 7, 69, 67, 50, 59, 73]. In the supervised case, the quality of a\nrepresentation is often expressed through the ability to solve the corresponding down-stream task.\nHowever, in order to leverage vasts amounts of unlabeled data, we require a set of desiderata that\napply to more general real-world settings.\nFollowing the successes in learning distributed representations that ef\ufb01ciently encode the content\nof high-dimensional sensory data [45, 56, 76], recent work has focused on learning representations\nthat are disentangled [6, 69, 68, 73, 71, 26, 27, 42, 10, 63, 16, 52, 53, 48, 9, 51]. A disentangled\nrepresentation captures information about the salient (or explanatory) factors of variation in the\ndata, isolating information about each speci\ufb01c factor in only a few dimensions. Although the\nprecise circumstances that give rise to disentanglement are still being debated, the core concept of a\nlocal correspondence between data-generative factors and learned latent codes is generally agreed\nupon [16, 26, 52, 63, 71].\nDisentanglement is mostly about how information is encoded in the representation, and it is often\nargued that a representation that is disentangled is desirable in learning to solve challenging real-world\ndown-stream tasks [6, 73, 59, 7, 26, 68]. Indeed, in a disentangled representation, information about\nan individual factor value can be readily accessed and is robust to changes in the input that do not\naffect this factor. Hence, learning to solve a down-stream task from a disentangled representation\nis expected to require fewer samples and be easier in general [68, 6, 28, 29, 59]. Real-world\ngenerative processes are also often based on latent spaces that factorize. In this case, a disentangled\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\frepresentation that captures this product space is expected to help in generalizing systematically in\nthis regard [18, 22, 59].\nSeveral of these purported bene\ufb01ts can be traced back to empirical evidence presented in the recent\nliterature. Disentangled representations have been found to be more sample-ef\ufb01cient [29], less\nsensitive to nuisance variables [55], and better in terms of (systematic) generalization [1, 16, 28,\n35, 70]. However, in other cases it is less clear whether the observed bene\ufb01ts are actually due to\ndisentanglement [48]. Indeed, while these results are generally encouraging, a systematic evaluation\non a complex down-stream task of a wide variety of disentangled representations obtained by training\ndifferent models, using different hyper-parameters and data sets, appears to be lacking.\n\nContributions\nIn this work, we conduct a large-scale evaluation1 of disentangled representations\nto systematically evaluate some of these purported bene\ufb01ts. Rather than focusing on a simple single\nfactor classi\ufb01cation task, we evaluate the usefulness of disentangled representations on abstract visual\nreasoning tasks that challenge the current capabilities of state-of-the-art deep neural networks [30, 65].\nOur key contributions include:\n\u2022 We create two new visual abstract reasoning tasks similar to Raven\u2019s Progressive Matrices [61]\nbased on two disentanglement data sets: dSprites [27], and 3dshapes [42]. A key design property\nof these tasks is that they are hard to solve based on statistical co-occurrences and require reasoning\nabout the relations between different objects.\n\n\u2022 We train 360 unsupervised disentanglement models spanning four different disentanglement\napproaches on the individual images of these two data sets and extract their representations. We\nthen train 3600 Wild Relation Networks [65] that use these disentangled representations to perform\nabstract reasoning and measure their accuracy at various stages of training.\n\n\u2022 We evaluate the usefulness of disentangled representations by comparing the accuracy of these\nabstract reasoning models to the degree of disentanglement of the representations (measured using\n\ufb01ve different disentanglement metrics). We observe compelling evidence that more disentangled\nrepresentations yield better sample-ef\ufb01ciency in learning to solve the considered abstract visual\nreasoning tasks. In this regard our results are complementary to a recent prior study of disentangled\nrepresentations that did not \ufb01nd evidence of increased sample ef\ufb01ciency on a much simpler\ndown-stream task [52].\n\n2 Background and Related Work on Learning Disentangled Representations\n\nDespite an increasing interest in learning disentangled representations, a precise de\ufb01nition is still\na topic of debate [16, 26, 52, 63]. In recent work, Eastwood et al. [16] and Ridgeway et al. [63]\nput forth three criteria of disentangled representations: modularity, compactness, and explicitness.\nModularity implies that each code in a learned representation is associated with only one factor of\nvariation in the environment, while compactness ensures that information regarding a single factor\nis represented using only one or few codes. Combined, modularity and compactness suggest that a\ndisentangled representation implements a one-to-one mapping between salient factors of variation\nin the environment and the learned codes. Finally, a disentangled representation is often assumed\nto be explicit, in that the mapping between factors and learned codes can be implemented with a\nsimple (i.e. linear) model. While modularity is commonly agreed upon, compactness is a point of\ncontention. Ridgeway et al. [63] argue that some features (eg. the rotation of an object) are best\ndescribed with multiple codes although this is essentially not compact. The recent work by Higgins\net al. [26] suggests an alternative view that may resolve these different perspectives in the future.\n\nMetrics Multiple metrics have been proposed that leverage the ground-truth generative factors\nof variation in the data to measure disentanglement in learned representations. In recent work,\nLocatello et al. [52] studied several of these metrics, which we will adopt for our purposes in this\nwork: the BetaVAE score [27], the FactorVAE score [42], the Mutual Information Gap (MIG) [10],\nthe disentanglement score from Eastwood et al. [16] referred to as the DCI Disentanglement score,\nand the Separated Attribute Predictability (SAP) score [48].\n\n1Reproducing these experiments requires approximately 2.73 GPU years (NVIDIA P100).\n\n2\n\n\fThe BetaVAE score, FactorVAE score, and DCI Disentanglement score focus primarily on modularity.\nThe former assess this property through interventions, i.e. by keeping one factor \ufb01xed and varying all\nothers, while the DCI Disentanglement score estimates this property from the relative importance\nassigned to each feature by a random forest regressor in predicting the factor values. The SAP score\nand MIG are mostly focused on compactness. The SAP score reports the difference between the top\ntwo most predictive latent codes of a given factor, while MIG reports the difference between the top\ntwo latent variables with highest mutual information to a certain factor.\nThe degree of explicitness captured by any of the disentanglement metrics remain unclear.\nIn\nprior work it was found that there is a positive correlation between disentanglement metrics and\ndown-stream performance on single factor classi\ufb01cation [52]. However, it is not obvious whether\ndisentangled representations are useful for down-stream performance per se, or if the correlation is\ndriven by the explicitness captured in the scores. In particular, the DCI Disentanglement score and\nthe SAP score compute disentanglement by training a classi\ufb01er on the representation. The former\nuses a random forest regressor to determine the relative importance of each feature, and the latter\nconsiders the gap in prediction accuracy of a support vector machine trained on each feature in\nthe representation. MIG is based on the matrix of pairwise mutual information between factors\nof variations and dimensions of the representation, which also relates to the explicitness of the\nrepresentation. On the other hand, the BetaVAE and FactorVAE scores predict the index of a \ufb01xed\nfactor of variation and not the exact value.\nWe note that current disentanglement metrics each require access to the ground-truth factors of\nvariation, which may hinder the practical feasibility of learning disentangled representations. Here\nour goal is to assess the usefulness of disentangled representations more generally (i.e. assuming it is\npossible to obtain them), which can be veri\ufb01ed independently.\n\nMethods Several methods have been proposed to learn disentangled representations. Here we are\ninterested in evaluating the bene\ufb01ts of disentangled representations that have been learned through\nunsupervised learning. In order to control for potential confounding factors that may arise in using\na single model, we use the representations learned from four state-of-the-art approaches from the\nliterature: \u03b2-VAE [27], FactorVAE [42], \u03b2-TCVAE [10], and DIP-VAE [48]. A similar choice of\nmodels was used in a recent study by Locatello et al. [52].\nUsing notation from Tschannen et al. [73], we can view all of these models as Auto-Encoders that\nare trained with the regularized variational objective of the form:\n\nEp(x)[Eq\u03c6(z|x)[\u2212 log p\u03b8(x|z)]] + \u03bb1Ep(x)[R1(q\u03c6(z|x))] + \u03bb2R2(q\u03c6(z)).\n\n(1)\n\nThe output of the encoder that parametrizes q\u03c6(z|x) yields the representation. Regularization serves\nto control the information \ufb02ow through the bottleneck induced by the encoder, while different\nregularizers primarily vary in the notion of disentanglement that they induce. \u03b2-VAE restricts the\ncapacity of the information bottleneck by penalizing the KL-divergence, using \u03b2 = \u03bb1 > 1 with\nR1(q\u03c6(z|x)) := DKL[q\u03c6(z|x)||p(z)], and \u03bb2 = 0; FactorVAE penalizes the Total Correlation [77] of\nthe latent variables via adversarial training, using \u03bb1 = 0 and \u03bb2 = 1 with R2(q\u03c6(z)) := T C(q\u03c6(z));\n\u03b2-TCVAE also penalizes the Total Correlation but estimates its value via a biased Monte Carlo\nestimator; and \ufb01nally DIP-VAE penalizes a mismatch in moments between the aggregated posterior\nand a factorized prior, using \u03bb1 = 0 and \u03bb2 \u2265 1 with R2(q\u03c6(z)) := ||Covq\u03c6(z) \u2212 I||2\nF .\n\nOther Related Works Learning disentangled representations is similar in spirit to non-linear\nICA, although it relies primarily on (architectural) inductive biases and different degrees of supervi-\nsion [13, 2, 39, 36, 37, 38, 25, 33, 32]. Due to the initial poor performance of purely unsupervised\nmethods, the \ufb01eld initially focused on semi-supervised [62, 11, 57, 58, 44, 46] and weakly supervised\napproaches [31, 12, 40, 21, 78, 20, 15, 35, 80, 54, 47, 64, 8]. In this paper, we consider the setup of the\nrecent unsupervised methods [27, 26, 48, 42, 9, 52, 71, 10]. Finally, while this paper focuses on eval-\nuating the bene\ufb01ts of disentangled features, these are complementary to recent work that focuses on\nthe unsupervised \u201cdisentangling\u201d of images into compositional primitives given by object-like repre-\nsentations [17, 23, 24, 22, 60, 74, 75]. Disentangling pose, style, or motion from content are classical\nvision tasks that has been studied with different degrees of supervision [72, 79, 80, 34, 19, 14, 21, 36].\n\n3\n\n\fFigure 1: Examples of RPM-like abstract visual reasoning tasks using dSprites (left) and 3dshapes\n(right). The correct answer and additional samples are available in Figure 17 in Appendix C.\n\n3 Abstract Visual Reasoning Tasks for Disentangled Representations\n\nIn this work we evaluate the purported bene\ufb01ts of disentangled representations on abstract visual\nreasoning tasks. Abstract reasoning tasks require a learner to infer abstract relationships between\nmultiple entities (i.e. objects in images) and re-apply this knowledge in newly encountered set-\ntings [41]. Humans are known to excel at this task, as is evident from experiments with simple visual\nIQ tests such as Raven\u2019s Progressive Matrices (RPMs) [61]. An RPM consists of several context\npanels organized in multiple sequences, with one sequence being incomplete. The task consists of\ncompleting the \ufb01nal sequence by choosing from a given set of answer panels. Choosing the correct\nanswer panel requires one to infer the relationships between the panels in the complete context\nsequences, and apply this knowledge to the remaining partial sequence.\nIn recent work, Santoro et al. [65] evaluated the abstract reasoning capabilities of deep neural\nnetworks on this task. Using a data set of RPM-like matrices they found that standard deep neural\nnetwork architectures struggle at abstract visual reasoning under different training and generalization\nregimes. Their results indicate that it is dif\ufb01cult to solve these tasks by relying purely on super\ufb01cial\nimage statistics, and can only be solved ef\ufb01ciently through abstract visual reasoning. This makes this\nsetting particularly appealing for investigating the bene\ufb01ts of disentangled representations.\n\nGenerating RPM-like Matrices Rather than evaluating disentangled representations on the Proce-\ndurally Generated Matrices (PGM) dataset from Barrett et al. [65] we construct two new abstract\nRPM-like visual reasoning datasets based on two existing datasets for disentangled representation\nlearning. Our motivation for this is twofold: it is not clear what a ground-truth disentangled represen-\ntation should look like for the PGM dataset, while the two existing disentanglement data sets include\nthe ground-truth factors of variation. Secondly, in using established data sets for disentanglement, we\ncan reuse hyper-parameter ranges that have proven successful. We note that our study is substantially\ndifferent to recent work by Steenbrugge et al. [70] who evaluate the representation of a single trained\n\u03b2-VAE [27] on the original PGM data set.\nTo construct the abstract reasoning tasks, we use the ground-truth generative model of the dSprites [27]\nand 3dshapes [42] data sets with the following changes2: For dSprites, we ignore the orientation\nfeature for the abstract reasoning tasks as certain objects such as squares and ellipses exhibit rotational\nsymmetries. To compensate, we add background color (5 different shades of gray linearly spaced\nbetween white and black) and object color (6 different colors linearly spaced in HUSL hue space)\nas two new factors of variation. Similarly, for the abstract reasoning tasks (but not when learning\nrepresentations), we only consider three different values for the scale of the object (instead of 6) and\nonly four values for the x and y position (instead of 32). For 3dshapes, we retain all of the original\nfactors but only consider four different values for scale and azimuth (out of 8 and 16) for the abstract\nreasoning tasks. We refer to Figure 7 in Appendix B for samples from these data sets.\nFor the modi\ufb01ed dSprites and 3dshapes, we now create corresponding abstract reasoning tasks. The\nkey idea is that one is given a 3 \u00d7 3 matrix of context image panels with the bottom right image panel\nmissing, as well as a set of six potential answer panels (see Figure 1 for an example). One then has to\ninfer which of the answers \ufb01ts in the missing panel of the 3 \u00d7 3 matrix based on relations between\n2These were implemented to ensure that humans can visually distinguish between the different values of\n\neach factor of variation.\n\n4\n\n\fimage panels in the rows of the 3 \u00d7 3 matrices. Due to the categorical nature of ground-truth factors\nin the underlying data sets, we focus on the AND relationship in which one or more factor values are\nequal across a sequence of context panels [65].\nWe generate instances of the abstract reasoning tasks in the following way: First, we uniformly\nsample whether 1, 2, or 3 ground-truth factors are \ufb01xed across rows in the instance to be generated.\nSecond, we uniformly sample without replacement the set of underlying factors in the underlying\ngenerative model that should be kept constant. Third, we uniformly sample a factor value from the\nground-truth model for each of the three rows and for each of the \ufb01xed factors3. Fourth, for all other\nground-truth factors we also sample 3 \u00d7 3 matrices of factor values from the ground-truth model with\nthe single constraint that the factor values are not allowed to be constant across the \ufb01rst two rows (in\nthat case we sample a new set of values). After this we have ground-truth factor values for each of\nthe 9 panels in the correct solution to the abstract reasoning task, and we can sample corresponding\nimages from the ground-truth model. To generate dif\ufb01cult alternative answers, we take the factor\nvalues of the correct answer panel and randomly resample the non-\ufb01xed factors as well as a random\n\ufb01xed factor until the factor values no longer satisfy the relations in the original abstract reasoning\ntask. We repeat this process to obtain \ufb01ve incorrect answers and \ufb01nally insert the correct answer in a\nrandom position. Examples of the resulting abstract reasoning tasks can be seen in Figure 1 as well\nas in Figures 18 and 19 in Appendix C.\n\nModels We will make use of the Wild Relation Network (WReN) to solve the abstract visual\nreasoning tasks [65]. It incorporates relational structure, and was introduced in prior work speci\ufb01cally\nfor such tasks. The WReN is evaluated for each answer panel a \u2208 A = {a1, ..., a6} in relation to all\nthe context-panels C = {c1, ..., c8} as follows:\n\n(cid:88)\n\ne1,e2\u2208E\n\nWReN(a, C) = f\u03c6(\n\ng\u03b8(e1, e2)) , E = {CNN(c1), ..., CNN(c8)} \u222a {CNN(a)}\n\n(2)\n\nFirst an embedding is computed for each panel using a deep Convolutional Neural Network (CNN),\nwhich serve as input to a Relation Network (RN) module [66]. The Relation Network reasons about\nthe different relationships between the context and answer panels, and outputs a score. The answer\npanel a \u2208 A with the highest score is chosen as the \ufb01nal output.\nThe Relation Network implements a suitable inductive bias for (relational) reasoning [5]. It separates\nthe reasoning process into two stages. First g\u03b8 is applied to all pairs of panel embeddings to consider\nrelations between the answer panel and each of the context panels, and relations among the context\npanels. Weight-sharing of g\u03b8 between the panel-embedding pairs makes it dif\ufb01cult to over\ufb01t to the\nimage statistics of the individual panels. Finally, f\u03c6 produces a score for the given answer panel in\nrelation to the context panels by globally considering the different relations between the panels as a\nwhole. Note that in using the same WReN for different answer panels it is ensured that each answer\npanel is subject to the same reasoning process.\n\n4 Experiments\n\n4.1 Learning Disentangled Representations\n\nWe train \u03b2-VAE [27], FactorVAE [42], \u03b2-TCVAE [10], and DIP-VAE [48] on the panels from the\nmodi\ufb01ed dSprites and 3dshapes data sets4. For \u03b2-VAE we consider two variations: the standard\nversion using a \ufb01xed \u03b2, and a version trained with the controlled capacity increase presented by\nBurgess et al. [9]. Similarly for DIP-VAE we consider both the DIP-VAE-I and DIP-VAE-II variations\nof the proposed regularizer [48]. For each of these methods, we considered six different values for\ntheir (main) hyper-parameter and \ufb01ve different random seeds. The remaining experimental details are\npresented in Appendix A.\nAfter training, we end up with 360 encoders, whose outputs are expected to cover a wide variation\nof different representational formats with which to encode information in the images. Figures 9\nand 10 in the Appendix show histograms of the reconstruction errors obtained after training, and\n\n3Note that different rows may have different values.\n4Code is made available as part of disentanglement_lib at https://git.io/JelEv.\n\n5\n\n\fthe scores that various disentanglement metrics assigned to the corresponding representations. The\nreconstructions are mostly good (see also Figure 7), which con\ufb01rms that the learned representations\ntend to accurately capture the image content. Correspondingly, we expect any observed difference\nin down-stream performance when using these representations to be primarily the result of how\ninformation is encoded. In terms of the scores of the various disentanglement metrics, we observe a\nwide range of values. It suggests that in going by different de\ufb01nitions of disentanglement, there are\nlarge differences among the quality of the learned representations.\n\n4.2 Abstract Visual Reasoning\n\nWe train different WReN models where we control for two potential confounding factors: the\nrepresentation produced by a speci\ufb01c model used to embed the input images, as well as the hyper-\nparameters of the WReN model. For hyper-parameters, we use a random search space as speci\ufb01ed in\nAppendix A. We used the following training protocol: We train each of these models using a batch\nsize of 32 for 100K iterations where each mini-batch consists of newly generated random instances\nof the abstract reasoning tasks. Similarly, every 1000 iterations, we evaluate the accuracy on 100\nmini-batches of fresh samples. We note that this corresponds to the statistical optimization setting,\nsidestepping the need to investigate the impact of empirical risk minimization and over\ufb01tting5.\n\n4.2.1 Initial Study\n\nFirst, we trained a set of baseline models to assess the overall complexity of the abstract reasoning\ntask. We consider three types of representations: (i) CNN representations which are learned from\nscratch (with the same architecture as in the disentanglement models) yielding standard WReN, (ii)\npre-trained frozen representations based on a random selection of the pre-trained disentanglement\nmodels, and (iii) directly using the ground-truth factors of variation (both one-hot encoded and integer\nencoded). We train 30 different models for each of these approaches and data sets with different\nrandom seeds and different draws from the search space over hyper-parameter values.\nAn overview of the training behaviour and the ac-\ncuracies achieved can be seen in Figures 2 and 11\n(Appendix B). We observe that the standard WReN\nmodel struggles to obtain good results on average,\neven after having seen many different samples at\n100K steps. This is due to the fact that training from\nscratch is hard and runs may get stuck in local minima\nwhere they predict each of the answers with equal\nprobabilities. Given the pre-training and the expo-\nsure to additional unsupervised samples, it is not\nsurprising that the learned representations from the\ndisentanglement models perform better. The WReN\nmodels that are given the true factors also perform\nwell, already after only few steps of training. We\nalso observe that different runs exhibit a signi\ufb01cant\nspread, which motivates why we analyze the average\naccuracy across many runs in the next section.\nIt appears that dSprites is the harder task, with models\nreaching an average score of 80%, while reaching an\naverage of 90% on 3dshapes. Finally, we note that\nmost learning progress takes place in the \ufb01rst 20K\nsteps, and thus expect the bene\ufb01ts of disentangled representations to be most clear in this regime.\n\nFigure 2: Average down-stream accuracy of\nbaselines, and models using pre-trained repre-\nsentations on dSprites. Shaded area indicates\nmin and max accuracy.\n\n4.2.2 Evaluating Disentangled Representations\n\nBased on the results from the initial study, we train a full set of WReN models in the following manner:\nWe \ufb01rst sample a set of 10 hyper-parameter con\ufb01gurations from our search space and then trained\nWReN models using these con\ufb01gurations for each of the 360 representations from the disentanglement\n\n5Note that the state space of the data generating distribution is very large: 106 factor combinations per panel\n\nand 14 panels for each instance yield more than 10144 potential instances (minus invalid con\ufb01gurations).\n\n6\n\n020000400006000080000100000Steps0.00.20.40.60.81.0CNNPre-trainedTrue factors (onehot)True factors (integers)\fFigure 3: Rank correlation between various metrics and down-stream accuracy of the abstract visual\nreasoning models throughout training (i.e. for different number of samples).\n\nmodels. We then compare the average down-stream training accuracy of WReN with the BetaVAE\nscore, the FactorVAE score, MIG, the DCI Disentanglement score, and the Reconstruction error\nobtained by the decoder on the unsupervised learning task. As a sanity check, we also compare with\nthe accuracy of a Gradient Boosted Tree (GBT10000) ensemble and a Logistic Regressor (LR10000)\non single factor classi\ufb01cation (averaged across factors) as measured on 10K samples. As expected, we\nobserve a positive correlation between the performance of the WReN and the classi\ufb01ers (see Figure 3).\n\nDifferences in Disentanglement Metrics Figure 3 displays the rank correlation (Spearman) be-\ntween these metrics and the down-stream classi\ufb01cation accuracy, evaluated after training for 1K, 2K,\n5K, 10K, 20K, 50K, and 100K steps. If we focus on the disentanglement metrics, several interesting\nobservations can be made. In the few-sample regime (up to 20K steps) and across both data sets\nit can be seen that both the BetaVAE score, and the FactorVAE score are highly correlated with\ndown-stream accuracy. The DCI Disentanglement score is correlated slightly less, while the MIG and\nSAP score exhibit a relatively weak correlation.\nThese differences between the different disentanglement metrics are perhaps not surprising, as they\nare also re\ufb02ected in their overall correlation (see Figure 8 in Appendix B). Note that the BetaVAE\nscore, and the FactorVAE score directly measure the effect of intervention, i.e. what happens to\nthe representation if all factors but one are varied, which is expected to be bene\ufb01cial in ef\ufb01ciently\ncomparing the content of two representations as required for this task. Similarly, it may be that MIG\nand SAP score have a more dif\ufb01cult time in differentiating representations that are only partially\ndisentangled. Finally, we note that the best performing metrics on this task are mostly measuring\nmodularity, as opposed to compactness. A more detailed overview of the correlation between the\nvarious metrics and down-stream accuracy can be seen in Figures 12 and 13 in Appendix B.\n\nDisentangled Representations in the Few-Sample Regime\nIf we compare the correlation of the\ndisentanglement metric with the highest correlation (FactorVAE) to that of the Reconstruction error\nin the few-sample regime, then we \ufb01nd that disentanglement correlates much better with down-stream\naccuracy. Indeed, while low Reconstruction error indicates that all information is available in the\nrepresentation (to reconstruct the image) it makes no assumptions about how this information is\nencoded. We observe strong evidence that disentangled representations yield better down-stream\naccuracy using relatively few samples, and we therefore conclude that they are indeed more sample\nef\ufb01cient compared to entangled representations in this regard.\nFigure 4 demonstrates the down-stream accuracy of the WReNs throughout training, binned into\nquartiles according to their degree of being disentangled as measured by the FactorVAE score\n(left), and in terms of Reconstruction error (right). It can be seen that representations that are more\ndisentangled give rise to better relative performance consistently throughout all phases of training. If\n\n7\n\n100020005000100002000050000100000BetaVAE ScoreFactorVAE ScoreMIGDCI DisentanglementSAPGBT10000LR10000Reconstruction67595245412720696763565339322219273324-0-847424342341571611192617-5-126067716967646066625441433935-26-43-42-34-42-62-67dSprites100020005000100002000050000100000BetaVAE ScoreFactorVAE ScoreMIGDCI DisentanglementSAPGBT10000LR10000Reconstruction405659423918185971726537-8-1337261215-8-40-433135241917464042342910-21-2632382920261922110217346263-1-16-30-17-38-55-523dshapes\fFigure 4: Down-stream accuracy of the WReN models throughout training, binned in quartiles based\non the values assigned by the FactorVAE score (left), and Reconstruction error (right).\n\nwe group models according to their Reconstruction error then we \ufb01nd that this (reversed) ordering is\nmuch less pronounced. An overview for all other metrics can be seen in Figures 14 and 15.\n\nDisentangled Representations in the Many-Sample Regime\nIn the many-sample regime (i.e.\nwhen training for 100K steps on batches of randomly drawn instances in Figure 3) we \ufb01nd that there\nis no longer a strong correlation between the scores assigned by the various disentanglement metrics\nand down-stream performance. This is perhaps not surprising as neural networks are general function\napproximators that, given access to enough labeled samples, are expected to overcome potential\ndif\ufb01culties in using entangled representations. The observation that Reconstruction error correlates\nmuch more strongly with down-stream accuracy in this regime further con\ufb01rms that this is the case.\nA similar observation can be made if we look at the\ndifference in down-stream accuracy between the top\nand bottom half of the models according to each\nmetric in Figures 5 and 16 (Appendix B). For all\ndisentanglement metrics, larger positive differences\nare observed in the few-sample regime that gradually\nreduce as more samples are observed. Meanwhile,\nthe gap gradually increases for Reconstruction error\nupon seeing additional samples.\n\nDifferences in terms of Final Accuracy In our \ufb01-\nnal analysis we consider the rank correlation between\ndown-stream accuracy and the various metrics, split\naccording to their \ufb01nal accuracy. Figure 6 shows the\nrank correlation for the worst performing \ufb01fty per-\ncent of the models after 100K steps (top), and for the\nbest performing \ufb01fty percent (bottom). While these\nresults should be interpreted with care as the split\ndepends on the \ufb01nal accuracy, we still observe inter-\nesting results: It can be seen that disentanglement\n(i.e. FactorVAE score) remains strongly correlated\nwith down-stream performance for both splits in the\nfew-sample regime. At the same time, the bene\ufb01t of lower Reconstruction error appears to be limited\nto the worst 50% of models. This is intuitive, as when the Reconstruction error is too high there\nmay not be enough information present to solve the down-stream tasks. However, regarding the top\nperforming models (best 50%), it appears that the relative gains from further reducing reconstruction\nerror are of limited use.\n\nFigure 5: Difference in down-stream accuracy\nbetween top 50% and bottom 50%, according\nto various metrics on dSprites.\n\n8\n\n103104105Steps0.450.500.550.600.650.700.750.80AccuracyGroup by FactorVAE Score for dSprites0%-25%25%-50%50%-75%75%-100%103104105Steps0.450.500.550.600.650.700.750.80Group by Reconstruction for dSprites103104105Steps-2%-1%0%1%2%3%4%5%6%7%\u2206 in acc. between top and bot. 50%BetaVAE ScoreFactorVAE ScoreMIGDCI DisentanglementSAPGBT10000LR10000Reconstruction\fFigure 6: Rank correlation between various metrics and down-stream accuracy of the abstract visual\nreasoning models throughout training (i.e. for different number of samples). The results in the top\nrow are based on the worst 50% of the models (according to \ufb01nal accuracy), and those in the bottom\nrow based on the best 50% of the models. Columns correspond to different data sets.\n\n5 Conclusion\n\nIn this work we investigated whether disentangled representations allow one to learn good models for\nnon-trivial down-stream tasks with fewer samples. We created two abstract visual reasoning tasks\nbased on existing data sets for which the ground truth factors of variation are known. We trained a\ndiverse set of 360 disentanglement models based on four state-of-the-art disentanglement approaches\nand evaluated their representations using 3600 abstract reasoning models. We observed compelling\nevidence that more disentangled representations are more sample-ef\ufb01cient in the considered down-\nstream learning task. We draw three main conclusions from these results: First, these results provide\nconcrete motivation why one might want to pursue disentanglement as a property of learned repre-\nsentations in the unsupervised case. Second, we still observed differences between disentanglement\nmetrics, which should motivate further work in understanding what different properties they capture.\nNone of the metrics achieved perfect correlation in the few-sample regime, which also suggests that\nit is not yet fully understood what makes one representation better than another in terms of learning.\nThird, it might be useful to extend the methodology in this study to other complex down-stream tasks,\nor include an investigation of other purported bene\ufb01ts of disentangled representations.\n\n9\n\n100020005000100002000050000100000BetaVAE ScoreFactorVAE ScoreMIGDCI DisentanglementSAPGBT10000LR10000Reconstruction60483428311866358494243301824282527261264040313030181112141113122-44758565355555557423633352916-38-49-48-48-49-56-61Worst 50% | dSprites100020005000100002000050000100000BetaVAE ScoreFactorVAE ScoreMIGDCI DisentanglementSAPGBT10000LR10000Reconstruction27556045474545517070636152383513-675-12-233431182322121529332325252114323931292722261304727324860-16-50-66-49-49-55-54Worst 50% | 3dshapes100020005000100002000050000100000BetaVAE ScoreFactorVAE ScoreMIGDCI DisentanglementSAPGBT10000LR10000Reconstruction7877765953392868737459544025353148635216-2615968665328113734496354213565161584733196873622627363435192649401-16Best 50% | dSprites100020005000100002000050000100000BetaVAE ScoreFactorVAE ScoreMIGDCI DisentanglementSAPGBT10000LR10000Reconstruction6162614427-15-126166695934-29-264247382910-22-173742332315-8-45054453622-12-134144322219-222823301754-1035-6-0-5-12-0Best 50% | 3dshapes\fAcknowledgments\n\nThe authors thank Adam Santoro, Josip Djolonga, Paulo Rauber and the anonymous reviewers for\nhelpful discussions and comments. This research was partially supported by the Max Planck ETH\nCenter for Learning Systems, a Google Ph.D. Fellowship (to Francesco Locatello), and the Swiss\nNational Science Foundation (grant 200021_165675/1 to J\u00fcrgen Schmidhuber). This work was\npartially done while Francesco Locatello was at Google Research.\n\nReferences\n[1] Alessandro Achille, Tom Eccles, Loic Matthey, Chris Burgess, Nicholas Watters, Alexander\nLerchner, and Irina Higgins. Life-long disentangled representation learning with cross-domain\nlatent homologies. In Advances in Neural Information Processing Systems, pages 9873\u20139883,\n2018.\n\n[2] Francis Bach and Michael Jordan. Kernel independent component analysis. Journal of Machine\n\nLearning Research, 3(7):1\u201348, 2002.\n\n[3] H. B. Barlow. Unsupervised learning. Neural Computation, 1(3):295\u2013311, 1989.\n[4] H. B. Barlow, T. P. Kaushal, and G. J. Mitchison. Finding minimum entropy codes. Neural\n\nComputation, 1(3):412\u2013423, 1989.\n\n[5] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius\nZambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan\nFaulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint\narXiv:1806.01261, 2018.\n\n[6] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and\nnew perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798\u2013\n1828, 2013.\n\n[7] Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards AI. Large-scale Kernel\n\nMachines, 34(5):1\u201341, 2007.\n\n[8] Diane Bouchacourt, Ryota Tomioka, and Sebastian Nowozin. Multi-level variational autoen-\ncoder: Learning disentangled representations from grouped observations. In AAAI Conference\non Arti\ufb01cial Intelligence, 2018.\n\n[9] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Des-\njardins, and Alexander Lerchner. Understanding disentangling in \u03b2-vae. Neural Information\nProcessing Systems (NIPS) Workshop on Learning Disentangled Representations: From Percep-\ntion to Control, 2017.\n\n[10] Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud.\n\nIsolating sources\nof disentanglement in vaes. In Advances in Neural Information Processing Systems, pages\n2610\u20132620, 2018.\n\n[11] Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. Discovering hidden\n\nfactors of variation in deep networks. arXiv preprint arXiv:1412.6583, 2014.\n\n[12] Taco Cohen and Max Welling. Learning the irreducible representations of commutative lie\n\ngroups. In International Conference on Machine Learning, 2014.\n\n[13] Pierre Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287\u2013\n\n314, 1994.\n\n[14] Zhiwei Deng, Rajitha Navarathna, Peter Carr, Stephan Mandt, Yisong Yue, Iain Matthews, and\nGreg Mori. Factorized variational autoencoders for modeling audience reactions to movies. In\nIEEE Conference on Computer Vision and Pattern Recognition, 2017.\n\n[15] Emily L Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations\n\nfrom video. In Advances in Neural Information Processing Systems, 2017.\n\n[16] Cian Eastwood and Christopher K. I. Williams. A framework for the quantitative evaluation of\ndisentangled representations. In International Conference on Learning Representations, 2018.\n[17] SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Geoffrey E\nHinton, et al. Attend, infer, repeat: Fast scene understanding with generative models. In\nAdvances in Neural Information Processing Systems, pages 3225\u20133233, 2016.\n\n10\n\n\f[18] Babak Esmaeili, Hao Wu, Sarthak Jain, Alican Bozkurt, N Siddharth, Brooks Paige, Dana H\nBrooks, Jennifer Dy, and Jan-Willem Meent. Structured disentangled representations. In The\n22nd International Conference on Arti\ufb01cial Intelligence and Statistics, pages 2525\u20132534, 2019.\n[19] Vincent Fortuin, Matthias H\u00fcser, Francesco Locatello, Heiko Strathmann, and Gunnar R\u00e4tsch.\nDeep self-organization: Interpretable discrete representation learning on time series. In Interna-\ntional Conference on Learning Representations, 2019.\n\n[20] Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther. A disentangled recognition\nand nonlinear dynamics model for unsupervised learning. In Advances in Neural Information\nProcessing Systems, 2017.\n\n[21] Ross Goroshin, Michael F Mathieu, and Yann LeCun. Learning to linearize under uncertainty.\n\nIn Advances in Neural Information Processing Systems, 2015.\n\n[22] Klaus Greff, Rapha\u00ebl Lopez Kaufmann, Rishab Kabra, Nick Watters, Chris Burgess, Daniel\nZoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation\nlearning with iterative variational inference. In Proceedings of the 36th International Conference\non Machine Learning-Volume 97, 2019.\n\n[23] Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and J\u00fcrgen Schmidhuber.\nTagger: Deep unsupervised perceptual grouping. In Advances in Neural Information Processing\nSystems, pages 4484\u20134492, 2016.\n\n[24] Klaus Greff, Sjoerd van Steenkiste, and J\u00fcrgen Schmidhuber. Neural expectation maximization.\n\nIn Advances in Neural Information Processing Systems, pages 6691\u20136701, 2017.\n\n[25] Luigi Gresele, Paul K. Rubenstein, Arash Mehrjou, Francesco Locatello, and Bernhard\nSch\u00f6lkopf. The incomplete rosetta stone problem: Identi\ufb01ability results for multi-view nonlinear\nica. In Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2019.\n\n[26] Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende,\nand Alexander Lerchner. Towards a de\ufb01nition of disentangled representations. arXiv preprint\narXiv:1812.02230, 2018.\n\n[27] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. In International Conference on Learning Representations,\n2017.\n\n[28] Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel,\nMatthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot\ntransfer in reinforcement learning. In Proceedings of the 34th International Conference on\nMachine Learning-Volume 70, pages 1480\u20131490. JMLR. org, 2017.\n\n[29] Irina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matko Bo\u0161njak,\nMurray Shanahan, Matthew Botvinick, Demis Hassabis, and Alexander Lerchner. SCAN:\nLearning hierarchical compositional visual concepts. In International Conference on Learning\nRepresentations, 2018.\n\n[30] Felix Hill, Adam Santoro, David Barrett, Ari Morcos, and Timothy Lillicrap. Learning to make\nanalogies by contrasting abstract relational structure. In International Conference on Learning\nRepresentations, 2019.\n\n[31] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In\n\nInternational Conference on Arti\ufb01cial Neural Networks, 2011.\n\n[32] S. Hochreiter and J. Schmidhuber. Feature extraction through LOCOCODE. Neural Computa-\n\ntion, 11(3):679\u2013714, 1999.\n\n[33] S. Hochreiter and J. Schmidhuber. Nonlinear ICA through low-complexity autoencoders. In\nProceedings of the 1999 IEEE International Symposium on Circuits ans Systems (ISCAS\u201999),\nvolume 5, pages 53\u201356. IEEE, 1999.\n\n[34] Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. Learning\nto decompose and disentangle representations for video prediction. In Advances in Neural\nInformation Processing Systems, 2018.\n\n[35] Wei-Ning Hsu, Yu Zhang, and James Glass. Unsupervised learning of disentangled and\ninterpretable representations from sequential data. In Advances in neural information processing\nsystems, pages 1878\u20131889, 2017.\n\n11\n\n\f[36] Aapo Hyvarinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive\n\nlearning and nonlinear ica. In Advances in Neural Information Processing Systems, 2016.\n\n[37] Aapo Hyv\u00e4rinen and Petteri Pajunen. Nonlinear independent component analysis: Existence\n\nand uniqueness results. Neural Networks, 1999.\n\n[38] Aapo Hyvarinen, Hiroaki Sasaki, and Richard E Turner. Nonlinear ica using auxiliary variables\nand generalized contrastive learning. In International Conference on Arti\ufb01cial Intelligence and\nStatistics, 2019.\n\n[39] Christian Jutten and Juha Karhunen. Advances in nonlinear blind source separation.\n\nIn\nInternational Symposium on Independent Component Analysis and Blind Signal Separation,\npages 245\u2013256, 2003.\n\n[40] Theofanis Karaletsos, Serge Belongie, and Gunnar R\u00e4tsch. Bayesian representation learning\n\nwith oracle constraints. In International Conference on Learning Representations, 2016.\n\n[41] Charles Kemp and Joshua B Tenenbaum. The discovery of structural form. Proceedings of the\n\nNational Academy of Sciences, 105(31):10687\u201310692, 2008.\n\n[42] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on\n\nMachine Learning, pages 2654\u20132663, 2018.\n\n[43] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Interna-\n\ntional Conference on Learning Representations, 2015.\n\n[44] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-\nsupervised learning with deep generative models. In Advances in Neural Information Processing\nSystems, 2014.\n\n[45] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International\n\nConference on Learning Representations, 2014.\n\n[46] Jack Klys, Jake Snell, and Richard Zemel. Learning latent subspaces in variational autoencoders.\n\nIn Advances in Neural Information Processing Systems, 2018.\n\n[47] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convo-\nlutional inverse graphics network. In Advances in Neural Information Processing Systems,\n2015.\n\n[48] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of\ndisentangled latent concepts from unlabeled observations. In International Conference on\nLearning Representations, 2018.\n\n[49] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building\n\nmachines that learn and think like people. Behavioral and brain sciences, 40, 2017.\n\n[50] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436,\n\n2015.\n\n[51] Francesco Locatello, Gabriele Abbati, Thomas Rainforth, Stefan Bauer, Bernhard Sch\u00f6lkopf,\nand Olivier Bachem. On the fairness of disentangled representations. In Advances in Neural\nInformation Processing Systems 32, pages 14584\u201314597, 2019.\n\n[52] Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Sch\u00f6lkopf, and\nOlivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled\nrepresentations. In Proceedings of the 36th International Conference on Machine Learning-\nVolume 97, 2018.\n\n[53] Francesco Locatello, Michael Tschannen, Stefan Bauer, Gunnar R\u00e4tsch, Bernhard Sch\u00f6lkopf,\nand Olivier Bachem. Disentangling factors of variation using few labels. arXiv preprint\narXiv:1905.01258, 2019.\n\n[54] Francesco Locatello, Damien Vincent, Ilya Tolstikhin, Gunnar R\u00e4tsch, Sylvain Gelly, and\nBernhard Sch\u00f6lkopf. Competitive training of mixtures of independent deep generative models.\nInternational Conference on Learning Representations, Workshop Track, 2018.\n\n[55] Romain Lopez, Jeffrey Regier, Michael I Jordan, and Nir Yosef. Information constraints on\nauto-encoding variational bayes. In Advances in Neural Information Processing Systems, pages\n6114\u20136125, 2018.\n\n12\n\n\f[56] William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video\nprediction and unsupervised learning. In International Conference on Learning Representations,\n2017.\n\n[57] Michael F Mathieu, Junbo J Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun.\nDisentangling factors of variation in deep representation using adversarial training. In Advances\nin Neural Information Processing Systems, 2016.\n\n[58] Siddharth Narayanaswamy, T Brooks Paige, Jan-Willem Van de Meent, Alban Desmaison, Noah\nGoodman, Pushmeet Kohli, Frank Wood, and Philip Torr. Learning disentangled representations\nwith semi-supervised deep generative models. In Advances in Neural Information Processing\nSystems, 2017.\n\n[59] Jonas Peters, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Elements of Causal Inference -\nFoundations and Learning Algorithms. Adaptive Computation and Machine Learning Series.\nMIT Press, 2017.\n\n[60] D Raposo, A Santoro, DGT Barrett, R Pascanu, T Lillicrap, and P Battaglia. Discovering\nobjects and their relations from entangled scene representations. International Conference on\nLearning Representations, Workshop Track, 2017.\n\n[61] John C Raven. Standardization of progressive matrices, 1938. British Journal of Medical\n\nPsychology, 19(1):137\u2013150, 1941.\n\n[62] Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of\nvariation with manifold interaction. In International Conference on Machine Learning, 2014.\n[63] Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the\nf-statistic loss. In Advances in Neural Information Processing Systems, pages 185\u2013194, 2018.\n[64] Adri\u00e0 Ruiz, Oriol Martinez, Xavier Binefa, and Jakob Verbeek. Learning disentangled repre-\nsentations with reference-based variational autoencoders. arXiv preprint arXiv:1901.08534,\n2019.\n\n[65] Adam Santoro, Felix Hill, David Barrett, Ari Morcos, and Timothy Lillicrap. Measuring\nabstract reasoning in neural networks. In International Conference on Machine Learning, pages\n4477\u20134486, 2018.\n\n[66] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter\nBattaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In\nAdvances in neural information processing systems, pages 4967\u20134976, 2017.\n\n[67] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85\u2013117,\n\n2015. Published online 2014; 888 references; based on TR arXiv:1404.7828 [cs.NE].\n\n[68] J. Schmidhuber, M. Eldracher, and B. Foltin. Semilinear predictability minimization produces\n\nwell-known feature detectors. Neural Computation, 8(4):773\u2013786, 1996.\n\n[69] J\u00fcrgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computa-\n\ntion, 4(6):863\u2013879, 1992.\n\n[70] Xander Steenbrugge, Sam Leroux, Tim Verbelen, and Bart Dhoedt. Improving generalization\nfor abstract reasoning tasks using disentangled feature representations. Neural Information\nProcessing Systems (NeurIPS) Workshop on Relational Representation Learning, Montr\u00e9al,\nCanada., 2018.\n\n[71] Raphael Suter, Djordje Miladinovic, Bernhard Sch\u00f6lkopf, and Stefan Bauer. Robustly disen-\ntangled causal mechanisms: Validating deep representations for interventional robustness. In\nInternational Conference on Machine Learning, pages 6056\u20136065, 2019.\n\n[72] Joshua B Tenenbaum and William T Freeman. Separating style and content with bilinear models.\n\nNeural computation, 12(6):1247\u20131283, 2000.\n\n[73] Michael Tschannen, Olivier Bachem, and Mario Lucic. Recent advances in autoencoder-\nbased representation learning. Neural Information Processing Systems (NeurIPS) Workshop on\nBayesian Deep Learning, Montreal, Canada., 2018.\n\n[74] Sjoerd van Steenkiste, Michael Chang, Klaus Greff, and J\u00fcrgen Schmidhuber. Relational\nneural expectation maximization: Unsupervised discovery of objects and their interactions. In\nInternational Conference on Learning Representations, 2018.\n\n13\n\n\f[75] Sjoerd van Steenkiste, Karol Kurach, and Sylvain Gelly. A case for object compositionality in\ndeep generative models of images. Neural Information Processing Systems (NeurIPS) Workshop\non Modeling the Physical World: Learning, Perception, and Control, 2018.\n\n[76] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and\ncomposing robust features with denoising autoencoders. In Proceedings of the 25th international\nconference on Machine learning, pages 1096\u20131103. ACM, 2008.\n\n[77] Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of\n\nresearch and development, 4(1):66\u201382, 1960.\n\n[78] William F Whitney, Michael Chang, Tejas Kulkarni, and Joshua B Tenenbaum. Understanding\nvisual concepts with continuation learning. International Conference on Learning Representa-\ntions, Workshop Track, 2016.\n\n[79] Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentan-\ngling with recurrent transformations for 3D view synthesis. In Advances in Neural Information\nProcessing Systems, 2015.\n\n[80] Li Yingzhen and Stephan Mandt. Disentangled sequential autoencoder.\n\nConference on Machine Learning, 2018.\n\nIn International\n\n14\n\n\f", "award": [], "sourceid": 8001, "authors": [{"given_name": "Sjoerd", "family_name": "van Steenkiste", "institution": "The Swiss AI Lab - IDSIA"}, {"given_name": "Francesco", "family_name": "Locatello", "institution": "ETH Z\u00fcrich - MPI T\u00fcbingen"}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": "Swiss AI Lab, IDSIA (USI & SUPSI) - NNAISENSE"}, {"given_name": "Olivier", "family_name": "Bachem", "institution": "Google Brain"}]}