{"title": "Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies", "book": "Advances in Neural Information Processing Systems", "page_first": 9873, "page_last": 9883, "abstract": "Intelligent behaviour in the real-world requires the ability to acquire new knowledge from an ongoing sequence of experiences while preserving and reusing past knowledge. We propose a novel algorithm for unsupervised representation learning from piece-wise stationary visual data: Variational Autoencoder with Shared Embeddings (VASE). Based on the Minimum Description Length principle, VASE automatically detects shifts in the data distribution and allocates spare representational capacity to new knowledge, while simultaneously protecting previously learnt representations from catastrophic forgetting. Our approach encourages the learnt representations to be disentangled, which imparts a number of desirable properties: VASE can deal sensibly with ambiguous inputs, it can enhance its own representations through imagination-based exploration, and most importantly, it exhibits semantically meaningful sharing of latents between different datasets. Compared to baselines with entangled representations, our approach is able to reason beyond surface-level statistics and perform semantically meaningful cross-domain inference.", "full_text": "Life-Long Disentangled Representation\n\nLearning with Cross-Domain Latent Homologies\n\nAlessandro Achille, Tom Eccles, Loic Matthey, Christopher P Burgess,\n\nNick Watters, Alexander Lerchner, Irina Higgins\n\n{eccles,lmatthey,cpburgess,nwatters,lerchner,irinah}@google.com\n\nUCLA, DeepMind\nachille@cs.ucla.edu,\n\nAbstract\n\nIntelligent behaviour in the real-world requires the ability to acquire new knowledge\nfrom an ongoing sequence of experiences while preserving and reusing past knowl-\nedge. We propose a novel algorithm for unsupervised representation learning from\npiece-wise stationary visual data: Variational Autoencoder with Shared Embeddings\n(VASE). Based on the Minimum Description Length principle, VASE automatically\ndetects shifts in the data distribution and allocates spare representational capacity to\nnew knowledge, while simultaneously protecting previously learnt representations\nfrom catastrophic forgetting. Our approach encourages the learnt representations\nto be disentangled, which imparts a number of desirable properties: VASE can\ndeal sensibly with ambiguous inputs, it can enhance its own representations through\nimagination-based exploration, and most importantly, it exhibits semantically\nmeaningful sharing of latents between different datasets. Compared to baselines\nwith entangled representations, our approach is able to reason beyond surface-level\nstatistics and perform semantically meaningful cross-domain inference.\n\n1\n\nIntroduction\n\nA critical feature of biological intelligence is its capacity for life-long learning [10] \u2013 the ability to\nacquire new knowledge from a sequence of experiences to solve progressively more tasks, while\nmaintaining performance on previous ones. This, however, remains a serious challenge for current\ndeep learning approaches. While current methods are able to outperform humans on many individual\nproblems [53, 37, 20], these algorithms suffer from catastrophic forgetting [14, 34, 35, 43, 17].\nTraining on a new task or environment can be enough to degrade their performance from super-human\nto chance level [47]. Another critical aspect of life-long learning is the ability to sensibly reuse\npreviously learnt representations in new domains (positive transfer). For example, knowing that\nstrawberries and bananas are not edible when they are green could be useful when deciding whether to\neat a green peach in the future. Finding semantic homologies between visually distinctive domains can\nremove the need to learn from scratch on every new environment and hence help with data ef\ufb01ciency\n\u2013 another major drawback of current deep learning approaches [16, 30].\nBut how can an algorithm maximise the informativeness of the representation it learns on one domain\nfor positive transfer on other domains without knowing a priori what experiences are to come? One\napproach might be to capture the important structure of the current environment in a maximally\ncompact way (to preserve capacity for future learning). Such learning is likely to result in positive\ntransfer if future training domains share some structural similarity with the old ones. This is a\nreasonable expectation to have for most natural (non-adversarial) tasks and environments, since they\ntend to adhere to the structure of the real world (e.g. relate to objects and their properties) governed by\nthe consistent rules of chemistry or physics. A similar motivation underlies the Minimum Description\nLength (MDL) principle [45] and disentangled representation learning [8].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fA\n\nB\n\nC\n\nFigure 1: A: Schematic representation of the life-long learning data distribution. Each dataset/environment\ncorresponds to a cluster s. Data samples x constituting each cluster can be described by a local set of coordinates\n(data generative factors zn). Different clusters may share some data generative factors. B: VASE model architecture\nC: Schematic of the \u201cdreaming\u201d feedback loop. We use a snapshot of the model with the old parameters (old,\n\u2713old) to generate an imaginary batch of data xold for a previously experienced dataset sold. While learning in the\ncurrent environment, we ensure that the representation is still consistent on the hallucinated \u201cdream\u201d data, and can\nreconstruct it (see red dashed lines).\n\nRecent state of the art approaches to unsupervised disentangled representation learning [21, 9, 25, 29]\nuse a modi\ufb01ed Variational AutoEncoder (VAE) [27, 44] framework to learn a representation of\nthe data generative factors. These approaches, however, only work on independent and identically\ndistributed (IID) data from a single visual domain. This paper extends this line of work to life-long\nlearning from piece-wise stationary data, exploiting this setting to learn shared representations across\ndomains where applicable. The proposed Variational Autoencoder with Shared Embeddings (VASE,\nsee \ufb01g. 1B) automatically detects shifts in the training data distribution and uses this information to\nallocate spare latent capacity to novel dataset-speci\ufb01c disentangled representations, while reusing\npreviously acquired representations of latent dimensions where applicable. We use latent masking and\na generative \u201cdreaming\u201d feedback loop (similar to [42, 51, 50, 5]) to avoid catastrophic forgetting. Our\napproach outperforms [42], the only other VAE based approach to life-long learning we are aware of.\nFurthermore, we demonstrate that the pressure to disentangle endows VASE with a number of useful\nproperties: 1) dealing sensibly with ambiguous inputs; 2) learning richer representations through\nimagination-based exploration; 3) performing semantically meaningful cross-domain inference by\nignoring irrelevant aspects of surface-level appearance.\n\n2 Related work\n\nThe existing approaches to continual learning can be broadly separated into three categories: data-,\narchitecture- or weights-based. The data-based approaches augment the training data on a new task\nwith the data collected from the previous tasks, allowing for simultaneous multi-task learning on\nIID data [11, 46, 43, 34, 15]. The architecture-based approaches dynamically augment the network\nwith new task-speci\ufb01c modules, which often share intermediate representations to encourage positive\ntransfer [47, 40, 48, 49]. Both of these types of approaches, however, are inef\ufb01cient in terms of the\nmemory requirements once the number of tasks becomes large. The weights-based approaches do\nnot require data or model augmentation. Instead, they prevent catastrophic forgetting by slowing down\nlearning in the weights that are deemed to be important for the previously learnt tasks [28, 55, 39]. This\nis a promising direction, however, its application is limited by the fact that it typically uses knowledge\nof the task presentation schedule to update the loss function after each switch in the data distribution.\nMost of the continual learning literature, including all of the approaches discussed above, have been\ndeveloped in task-based settings, where representations are learnt implicitly. While deep networks\nlearn well in such settings [1, 52], this often comes at a cost of reduced positive transfer. This is\nbecause the implicitly learnt representations often over\ufb01t to the training task by discarding information\nthat is irrelevant to the current task but may be required for solving future tasks [1, 2, 3, 52, 22]. The\nacquisition of useful representations of complex high-dimensional data without task-based over\ufb01tting\nis a core goal of unsupervised learning. Past work [2, 4, 21] has demonstrated the usefulness of\ninformation-theoretic methods in such settings. These approaches can broadly be seen as ef\ufb01cient\nimplementations of the Minimum Description Length (MDL) principle for unsupervised learning\n[45, 18]. The representations learnt through such methods have been shown to help in transfer\nscenarios and with data ef\ufb01ciency for policy learning in the Reinforcement Learning (RL) context [22].\nThese approaches, however, do not immediately generalise to non-stationary data. Indeed, life-long\nunsupervised representation learning is relatively under-developed [51, 50, 39]. The majority of recent\nwork in this direction has concentrated on implicit generative models [51, 50], or non-parametric\n\n2\n\n\fapproaches [36]. Since these approaches do not possess an inference mechanism, they are unlikely\nto be useful for subsequent task or policy learning. Furthermore, none of the existing approaches\nexplicitly investigate meaningful sharing of latent representations between environments.\n\n3 Framework\n\n3.1 Problem formalisation\nWe assume that there is an a priori unknown set S ={s1,s2,...,sK} of K environments which, between\nthem, share a setZ ={z1,z2,...,zN} of N independent data generative factors. We assume z\u21e0N (0, I).\nSince we aim to model piece-wise stationary data, it is reasonable to assume s\u21e0 Cat(\u21e11,...,K), where\n\u21e1k is the probability of observing environment sk. Two environments may use the same generative\nfactors but render them differently, or they may use a different subset of factors altogether. Given an\nenvironment s, and an environment-dependent subset Z s\u2713Z of the ground truth generative factors,\nit is possible to synthesise a dataset of images xs\u21e0 p(\u00b7|zs,s). In order to keep track of which subset\nof the N data generative factors is used by each environment s to generate images xs, we introduce\nan environment-dependent mask as with dimensionality |a| = N, where as\nn = 1 if zn 2Z s and zero\notherwise. A similar masking has also been used by [24] to enforce disentanglement in a single\nenvironment, but assuming additional side-information about the generative factors. Hence, we assume\nn is the probability that factor zn is used in environment s. This leads\nas \u21e0 Bern(!s\nto the following generative process (where \u201c\u201d is element-wise multiplication):\n1,...,N ),\n\n1,...,N ), where !s\n\nz\u21e0N (0, I),\n\ns\u21e0 Cat(\u21e11,...,K),\n\nas\u21e0 Bern(!s\n\nzs = asz,\n\nxs\u21e0 p(\u00b7 | zs,s)\n\n(1)\n\nIntuitively, we assume that the piece-wise stationary observed data x can be split into clusters\n(environments s) (note evidence for similar experience clustering from the animal literature [6]).\nEach cluster has a set of standard coordinate axes (a subset of the generative factors z chosen by the\nlatent mask as) that can be used to parametrise the data in that cluster (\ufb01g. 1A). Given a sequence\nx = (xs1, xs2, ...) of datasets generated according to the process in eq. (1), where sk\u21e0 p(s) is the k-th\nsample of the environment, the aim of life-long representation learning can be seen as estimating the\nfull set of generative factors Z\u21e1 Sk q(zsk|xsk ) from the environment-speci\ufb01c subsets of z inferred\n\non each stationary data cluster xsk. Henceforth, we will drop the subscript k for simplicity of notation.\n\nInferring the data generative factors\n\n3.2\nObservations xs cannot contain information about the generative factors zn that are not relevant for\nthe environment s. Hence, we use the following form for representing the data generative factors:\n\nq(zs|xs) = as N (\u00b5(x), (x))+(1as) N (0, I).\n\n(2)\nNote that \u00b5 and  in eq. (2) depend only on the data x and not on the environment s. This is important\nto ensure that the semantic meaning of each latent dimension zn remains consistent for different\nenvironments s. We model the representation q(zs|xs) of the data generative factors as a product of\nindependent normal distributions to match the assumed prior p(z)\u21e0N (0, I).\nIn order to encourage the representation q(zs|xs) to be semantically meaningful, we encourage it to\ncapture the generative factors of variation within the data xs by following the MDL principle. We\naim to \ufb01nd a representation zs that minimises the reconstruction error of the input data xs conditioned\non zs under a constraint on the quantity of information in zs. This leads to the following loss function:\n(3)\n\nLMDL(,\u2713) =Ezs\u21e0q(\u00b7|xs)[log p\u2713(x | zs,s)]\n}\n\nReconstruction error\n\n{z\n\n|\n\n+ |KL(q(zs|xs)||p(z))\n}\n|\n\nRepresentation capacity\n\n{z\n\nThe loss in eq. (3) is closely related to the -VAE [21] objective L = Ez\u21e0q(\u00b7|x)[log p\u2713(x|z)] +\n KL(q(z|x)||p(z)), which uses a Lagrangian to limit the latent bottleneck capacity, rather than\nan explicit target C. It was shown that optimising the -VAE objective helps with learning a more\nsemantically meaningful disentangled representation q(z|x) of the data generative factors [21].\nHowever, [9] showed that progressively increasing the target capacity C in eq. (3) throughout training\nfurther improves the disentanglement results reported in [21], while simultaneously producing sharper\n\n|2\n\n C\n\nTarget\n\n|{z}\n\n3\n\n\freconstructions. Progressive increase of the representational capacity also seems intuitively better\nsuited to continual learning where new information is introduced in a sequential manner. Hence, VASE\noptimises the objective function in eq. (3) over a sequence of datasets xs. This, however, requires\na way to infer s and as, as discussed next.\n\n3.3\n\nInferring the latent mask\n\n\u21b5n =KLExs\n\nbatch) ] || p(zn).\n\nGiven a dataset xs, we want to infer which latent dimensions zn were used in its generative process\n(see eq. (1)). This serves multiple purposes: 1) helps identify the environment s (see next section);\n2) helps ignore latent factors zn that encode useful information in some environment but are not used\nin the current environment s, in order to prevent retraining and subsequent catastrophic forgetting;\nand 3) promotes latent sharing between environments. Remember that eq. (3) indirectly optimises\nfor Exs[q(zs|xs)] \u21e1 p(z) after training on a dataset s. If a new dataset uses the same generative\nfactors as xs, then the marginal behaviour of the corresponding latent dimensions zn will not change.\nOn the other hand, if a latent dimension encodes a data generative factor that is irrelevant to the new\ndataset, then it will start behaving atypically and stray away from the prior. We capture this intuition\nby de\ufb01ning the atypicality score \u21b5n for each latent dimension zn on a batch of data xs\n\nbatch:\n\n(4)\nThe atypical components are unlikely to be relevant to the current environment, so we mask them out:\n\nbatch[ q(zs\n\nn|xs\n\nas\n\nn =\u21e2 1, if \u21b5n <\n\n0, otherwise\n\n(5)\n\nwhere  is a threshold hyperparameter (see appendices A.2 and A.3 for more details). Note that the\nuninformative latent dimensions zn that have not yet learnt to represent any data generative factors, i.e.\nn) = p(zn), are automatically unmasked in this setup (as they will have \u21b5n\u21e1 0). This allows\nq(zn|xs\nthem to be available as spare latent capacity to learn new generative factors when exposed to a new\ndataset. Fig. 2 (bottom third panel) shows the sharp changes in \u21b5n at dataset boundaries during training.\n\n3.4\n\nInferring the environment\n\nGiven the generative process introduced in eq. (1), it may be tempting to treat the environment s as\na discrete latent variable and learn it through amortised variational inference. However, we found that\nin the continual learning scenario this is not a viable strategy. Parametric learning is slow, yet we have\nto infer each new data cluster s extremely fast to avoid catastrophic forgetting. Hence, we opt for a fast\nnon-parametric meta-algorithm motivated by the following intuition. Having already experienced r\ndatasets during life-long learning, there are two choices when it comes to inferring the current one s: it\nis either a new dataset sr+1, or it is one of the r datasets encountered in the past. Intuitively, one way to\ncheck for the former is to see whether the current data xs seems likely under any of the previously seen\nenvironments. This condition on its own is not suf\ufb01cient though. First, it is possible that environment s\nuses a subset of the generative factors used by another environmentZ s\u2713Z t, in which case environment\nt will explain the data xs well, yet it will be an incorrect inference. Hence, we have to ensure that the\nsubset of the relevant generative factors zs inferred for the current data xs according to section 3.3\nmatches that of the candidate past dataset t. Second, \ufb01nding a past dataset that matches the current\none on the subset of the relevant generative factors without checking the reconstruction accuracy is not\nsuf\ufb01cient. For example, an environment with a moving square should not be classi\ufb01ed as being the same\nas the environment with a moving triangle, despite the two environments sharing the same generative\nfactors (the object position). Hence the reconstruction error should be involved in the inference.\nGiven the considerations above, we infer the environment s for a batch xs\n\nbatch according to:\n\n,\n\nif Ez\u02c6s[ p\u2713(xs\n\ns =\u21e2 \u02c6s\nbatch) is the output of an auxiliary classi\ufb01er trained to infer the most likely\nwhere \u02c6s = argmaxs q(s|xs\npreviously experienced environment \u02c6s given the current batch xs\nbatch, L\u02c6s is the average reconstruction\nerror observed for the environment \u02c6s when it was last experienced, and \uf8ff is a threshold hyperparameter\n(see appendix A.2 for details).\n\nbatch|z\u02c6s,\u02c6s) ] \uf8ff \uf8ffL\u02c6s ^ as = a\u02c6s\n\nsr+1, otherwise\n\n(6)\n\n4\n\n\f3.5 Preventing catastrophic forgetting\nSo far we have discussed how VASE integrates knowledge from the current environment into its\nrepresentation q(z|x), but we haven\u2019t yet discussed how we ensure that past knowledge is not\nforgotten in the process. Most standard approaches to preventing catastrophic forgetting discussed\nin section 2 are either not applicable to a variational context, or do not scale well due to memory\nrequirements. However, thanks to learning a generative model of the observed environments, we\ncan prevent catastrophic forgetting by periodically hallucinating (i.e. generating samples) from past\nenvironments using a snapshot of VASE, and making sure that the current version of VASE is still\nable to model these samples. A similar \u201cdreaming\u201d feedback loop was used in [42, 51, 50, 5].\nMore formally, we follow the generative process in eq. (1) to create a batch of samples\nxold \u21e0 q\u2713old(\u00b7|z,sold) using a snapshot of VASE with parameters (old,\u2713 old) (see \ufb01g. 1C). We then\nupdate the current version of VASE according to the following (replacing old with 0 for brevity):\n\nLpast(,\u2713) =Ez,s0,x0h D[q(z|x0), q0(z0|x0)]\n+D[q\u2713(x|z,s0), q\u27130(x0|z,s0)]\n|\n}\n{z\n}\nwhere D is a distance between two distributions. For the decoder, which is a product of Bernoulli\nrandom variables, we use the KL divergence as the distance D. For the Gaussian encoder, we tried\n0 \u23031/2\nboth the KL divergence and the Wasserstein distance W =k\u00b50\u00b51k2+k\u23031/2\n1 k2. We did not\nobserve signi\ufb01cant differences between the two distance metrics for the majority of the hyperparameter\nsettings. However, we found the gradients of the Wasserstein distance W to be better behaved. Hence,\nwe use the Wasserstein distance for the encoder and the KL distance for the decoder in all experiments.\nThe snapshot parameters get synced to the current trainable parameters old , \u2713old \u2713 every \u2327\ntraining steps, where \u2327 is a hyperparameter. The expectation over simulators sold and latents z in eq. (7)\nis done using Monte Carlo sampling (see appendix A.2 for details).\n\nDecoder proximity\n\nEncoder proximity\n\n{z\n\n|\n\ni,\n\n(7)\n\n3.6 Model summary\nTo summarise, we train our model using a meta-algorithm with both parametric and non-parametric\ncomponents. The latter is needed to quickly associate new experiences to an appropriate cluster, so that\nlearning can happen inside the current experience cluster, without disrupting unrelated clusters. We\ninitialise the latent representation z to have at least as many dimensions as the total number of the data\ngenerative factors |z||Z| = N, and the softmax layer of the auxiliary environment classi\ufb01er to be at\nleast as large as the number of datasets |S| = K. As we observe the sequence of training data, we detect\nchanges in the environment and dynamically update the internal estimate of r\uf8ff K datasets experienced\nso far according to eq. (6). We then train VASE by minimising the following objective function:\n\nL(,\u2713) =Ezs\u21e0q(\u00b7|xs))[logp\u2713(x|zs,s)] + |KL(q(zs|xs)||p(z))C|2\n|\n}\n+Ez,s0,x0h D[q(z|x0), q0(z0|x0)]+D[q\u2713(x|z,s0), q\u27130(x0|z,s0)]i.\n|\n}\n\n{z\n{z\n\n\u201cDreaming\u201d feedback on past data\n\nMDL on current data\n\n+\n\n4 Experiments\n\n(8)\n\nContinual learning with disentangled shared latents First, we qualitatively assess whether VASE\nis able to learn good representations in a continual learning setup. We use a sequence of three datasets:\n(1) a moving version of Fashion-MNIST [54] (shortened to moving Fashion), (2) MNIST [31], and (3)\na moving version of MNIST (moving MNIST). During training we expect VASE to detect shifts in the\ndata distribution and dynamically create new experience clusters s, learn a disentangled representation\nof each environment without forgetting past environments, and share disentangled factors between\nenvironments in a semantically meaningful way. Fig. 2 (top) compares the performance of VASE to that\nof Controlled Capacity Increase-VAE (CCI-VAE) [9], a model for disentangled representation learning\nwith the same architecture as VASE but without the modi\ufb01cations introduced in this paper to allow for\ncontinual learning. It can be seen that unlike VASE, CCI-VAE forgot moving Fashion at the end of the\ntraining sequence. Both models were able to disentangle position from object identity, however, only\nVASE was able to meaningfully share latents between the different datasets - the two positional latents\n\n5\n\n\fFigure 2: We compare VASE to a CCI-VAE baseline. Both are trained on a sequence of three datasets: moving\nfashion MNIST (moving Fashion) ! MNIST ! moving MNIST. Top: latent traversals at the end of training\nseeded with samples from the three datasets. The value of each latent zn is traversed between -2 and 2 one at a time,\nand the corresponding reconstructions are shown. Rows correspond to latent dimensions zn, columns correspond\nto the traversal values. Latent use progression throughout training is demonstrated in colour. Bottom: performance\nof MNIST and Fashion object classi\ufb01ers and a position regressor trained on the latent space z throughout training.\nNote the relative stability of the curves for VASE compared to the baseline. The atypicality pro\ufb01le shows the values\nof \u21b5n through training (different colours indicate different latent dimensions), with the threshold  indicated by\nthe dashed black line.\n\nFigure 3: Latent traversals (A) and classi\ufb01cation accuracy (B) (both as in \ufb01g. 2) for VASE trained on a sequence of\nmoving MNIST ! Fashion ! inverse Fashion ! MNIST ! moving Fashion. See \ufb01g. 8 for larger traversals.\nare active for two moving datasets but not for the static MNIST. VASE also has moving Fashion- and\nMNIST-speci\ufb01c latents, while CCI-VAE shares all latents between all datasets. VASE use only 8/24 la-\ntent dimensions at the end of training. The rest remained as spare capacity for learning on future datasets.\n\nLearning representations for tasks We train object identity classi\ufb01ers (one each for moving\nFashion and MNIST) and an object position regressor on top of the latent representation z\u21e0 q(z|x)\nat regular intervals throughout the continual learning sequence. Good accuracy on these measures\nwould indicate that at the point of measurement, the latent representation z contained dataset\nrelevant information, and hence could be useful, e.g. for subsequent policy learning in RL agents.\nFigure 2 (bottom) shows that both VASE and CCI-VAE learn progressively more informative latent\nrepresentations when exposed to each dataset s, as evidenced by the increasing classi\ufb01cation accuracy\nand decreasing mean squared error (MSE) measures within each stage of training. However, with\nCCI-VAE, the accuracy and MSE measures degrade sharply once a domain shift occurs. This is not\nthe case for VASE, which retains a relatively stable representation.\n\nAblation study Here we perform a full ablation study to test the importance of the proposed\ncomponents for unsupervised life-long representation learning: 1) regularisation towards disentangled\nrepresentations (section 3.2), 2) latent masking (section 3.3 - A), 3) environment clustering (section 3.4\n- S), and 4) \u201cdreaming\u201d feedback loop (section 3.5 - D). We use the constraint capacity loss in eq. (3)\nfor the disentangled experiments, and the standard VAE loss [27, 44] for the entangled experiments\n\n6\n\nMNISTFashion MNISTInverted Fashion MNISTMoving Fashion MNISTMoving MNISTLatents (ordered)AddedUnusedUnusedUnusedAddedAddedAddedReusedReusedReusedReusedReusedReusedFrozenFrozenFrozenFrozenFrozenObject ID Classification AccuracyAB\fDISENTANGLED\n\nENTANGLED\n\nPOSITION MSE\n\nCHANGE (*1E-4)\n\nPOSITION MSE\n\nCHANGE (*1E-4)\n\nOBJECT ID ACCURACY\nCHANGE (%) MIN (*1E-4)\nMAX (%)\n3.5 (\u00b10.05)\n-15.2 (\u00b12.8)\n88.6 (\u00b10.4)\n3.4 (\u00b10.05)\n-13.9 (\u00b11.9)\n88.9 (\u00b10.5)\n3.3 (\u00b10.04)\n-14.4 (\u00b11.9)\n88.6 (\u00b10.3)\n86.7 (\u00b11.9)\n-24.5 (\u00b11.0)\n3.3 (\u00b10.04)\n87.1 (\u00b11.8)\n-28.1 (\u00b10.08)\n3.3 (\u00b10.04)\n86.3 (\u00b12.5)\n-25.2 (\u00b10.5)\n3.3 (\u00b10.04)\n3.4 (\u00b10.05)\n-12.9 (\u00b11.9)\n88.3 (\u00b10.3)\n-5.4 (\u00b10.3)\n88.6 (\u00b10.4)\n3.2 (\u00b10.03)\n\nABLATION\n-\nS\nD\nA\nSA\nDA\nSD\nSD-[42]\nVASE (SDA)\nTable 1: Average change in classi\ufb01cation accuracy/MSE and maximum/minimum average accuracy/MSE when\ntraining an object/position classi\ufb01er/regressor on top of the learnt representation on the moving Fashion ! MNIST\n! moving MNIST sequence. We do a full ablation study of VASE, where D - dreaming feedback loop, S - cluster\ninference q(s|xs), and A - atypicality based latent mask as inference. We compare two versions of our model - one\nthat is encouraged to learn a disentangled representation through the capacity increase regularisation in eq. (3), and\nan entangled VAE baseline ( = 1). The unablated disentangled version of VASE (SDA) has the best performance.\n\nOBJECT ID ACCURACY\nCHANGE (%) MIN (*1E-4)\nMAX (%)\n4.2 (\u00b10.7)\n-12.1 (\u00b10.8)\n91.8 (\u00b10.4)\n4.5 (\u00b10.8)\n-12.2 (\u00b10.03)\n91.7 (\u00b10.4)\n4.3 (\u00b10.7)\n-12.4 (\u00b10.7)\n91.8 (\u00b10.4)\n88.6 (\u00b10.3)\n-19.7 (\u00b10.5)\n4.5 (\u00b10.7)\n89.9 (\u00b11.3)\n-18.3 (\u00b10.4)\n4.8 (\u00b10.7)\n88.8 (\u00b10.3)\n-19.4 (\u00b10.4)\n4.6 (\u00b10.7)\n4.3 (\u00b10.5)\n-11.7 (\u00b10.6)\n91.4 (\u00b10.3)\n91.9 (\u00b10.1)\n-11.6 (\u00b11.1)\n4.7 (\u00b10.8)\n91.5 (\u00b10.1)\n-6.5 (\u00b10.7)\n4.2 (\u00b10.4)\n\n24.8 (\u00b113.5)\n22.5 (\u00b112.2)\n21.4 (\u00b14.9)\n67.6 (\u00b1107.0)\n78.9 (\u00b1109.0)\n72.2 (\u00b190.0)\n20.0 (\u00b13.5)\n3.0 (\u00b10.2)\n\n-\n\n10.5 (\u00b12.6)\n10.9 (\u00b13.1)\n11.7 (\u00b13.2)\n47.1 (\u00b126.2)\n41.8 (\u00b120.6)\n40.2 (\u00b119.2)\n11.6 (\u00b11.9)\n10.2 (\u00b11.8)\n3.9 (\u00b11.1)\n\n-\n\n-\n\n-\n\nFigure 4: A Cross-domain reconstructions on NatLab (outdoors) or EDE (indoors) DM Lab levels. The disentan-\ngled VASE \ufb01nds semantic homologies between the two datasets (e.g. cacti ! red objects). The entangled VASE\nonly maps lower level statistics. B Cross-domain reconstructions of samples from moving Fashion into each of the\n\ufb01ve training datasets: moving MNIST (1) ! Fashion (2) ! inverse Fashion (3) ! MNIST (4) ! moving Fashion\n(5).\n\n[21]. For each condition we report the average change in the classi\ufb01cation metrics reported above,\nand the average maximum values achieved (see appendix A.6 for details). Table 1 shows that the\nunablated VASE (SDA) has the best performance. Note that the entangled baselines perform worse\nthan the disentangled equivalents, and that the capacity constraint of the CCI-VAE framework does\nnot signi\ufb01cantly affect the maximal classi\ufb01cation accuracy compared to the VAE. It is also worth\nnoting that VASE outperforms the \u201cSD-[42]\u201d condition, which is similar to the only other VAE-based\napproach to continual learning that we are aware of, see [42]. The difference between the SD and\nSD-[42] conditions is that the latter also disables the decoder proximity term in eq. (7) to match the\nmodel setup in [42]. The only difference between the SD-[42] and the [42] approaches is that we do not\nuse variational inference to learn the value of s, opting for a classi\ufb01cation-based heuristic instead. This\ndifference is motivated by [42]\u2019s aim to compute a valid variational lower-bound in a life-long setting,\nwhile our aim is to learn semantically shared factors. Hence, we sacri\ufb01ce the probabilistic framework\nfor a better performing heuristic (see Section 3.4 for more details). Our SD-[42] also does not use\nthe Information Gain regularizer of [42], since it would not change the performance of the heuristic.\nWe have also trained VASE on longer sequences of datasets (moving MNIST ! Fashion ! inverse\nFashion ! MNIST ! moving Fashion) and found similar levels of performance (see \ufb01g. 3).\nSemantic transfer Here we test whether VASE can learn more sophisticated cross-domain latent\nhomologies than the positional latents on the moving MNIST and Fashion datasets described above.\nHence, we trained VASE on a sequence of two visually challenging DMLab-30 1 [7] datasets: the\nExploit Deferred Effects (EDE) environment and a randomized version of the Natural Labyrinth\n(NatLab) environment (Varying Map Randomized). While being visually very distinct (one being\nindoors and the other outdoors), the two datasets share many data generative factors that have to do with\n\n1https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/\n\ndmlab30#dmlab-30\n\n7\n\nNatLab/EDE Cross-Domain ReconstructionsOrigNatLabEDEReconstructions asOrigNatLabEDEReconstructions asDisentangledEntangledOrig(1)Reconstructions as(2)(3)(4)(5)5-Dataset Cross-Domain ReconstructionsDisentangledBA\fFigure 5: A Top: Ambiguous input examples created by using different interpolation weights between samples\nfrom CelebA and Fashion, and corresponding inferred parameters \u00b5 (y axis) and  (light colour range) of q(z|x);\nred corresponds to Fashion-speci\ufb01c latents, blue to CelebA-speci\ufb01c latents. Middle: Reconstruction samples\np\u2713(xs|zs,s) for different levels of ambiguity conditioned on either dataset. Bottom: Inferred q (s = CelebA\ngiven different levels of input ambiguity (x axis) and different number of ambiguous vs real data samples (y\naxis) for the two datasets. VASE deals well with ambiguity, shows context-dependent categorical perception\nand uncertainty within its inferred representation parameters. B Imagination-based exploration allows VASE to\nimagine the possibility of moving MNIST digits during static MNIST training by using position latents acquired\non moving Fashion. This helps it learn a moving MNIST classi\ufb01er during static MNIST training without ever\nseeing real translations of MNIST digits.\n\nthe 3D geometry of the world (e.g. horizon, walls/terrain, objects/cacti) and the agent\u2019s movements\n(\ufb01rst person optic \ufb02ow). Hence, the two domains share many semantically related factors z, but these\nare rendered into very different visuals x. We compared cross-domain reconstructions of VASE and\nan equivalent entangled VAE ( = 1) baseline. The reconstructions were produced by \ufb01rst inferring\na latent representation based on a batch from one domain, e.g. zNatLab = q(\u00b7|xNatLab), and then\nreconstructing them conditioned on the other domain xxRec = q\u2713(\u00b7|zNatLab,sEDE). Figure 4A shows that\nVASE discovered the latent homologies between the two domains, while the entangled baseline failed\nto do so. VASE learnt the semantic equivalence between the cacti in NatLab and the red objects in\nEDE, the brown fog corresponding to the edge of the NatLab world and the walls in EDE (top leftmost\nreconstruction), and the horizon lines in both domains. The entangled baseline, on the other hand,\nseemed to rely on the surface-level pixel statistics and hence struggled to produce meaningful cross-\ndomain reconstructions, attempting to match the texture rather than the semantics of the other domain.\nFigure 4B demonstrates that VASE also learns to share semantically meaningful factors in the more\nchallenging 5-dataset cross-domain reconstruction task (see Figure 8 and Figure 9 for more details).\nNote how the position inferred from the moving Fashion dataset is re-used when reconstructing the\nimage as a moving MNIST digit. Furthermore, the clothing type inferred from the moving Fashion\nis largely shared with the static Fashion and the inverted Fashion datasets. This is not always perfect,\nhowever, which highlights one of the limits of our approach. Since the algorithm has only visual\ninformation to work with, the \u201csemantics\u201d can sometimes become entangled with the shallow visual\nstatistics \u2013 e.g. the clothing categories in the normal and the inverted Fashion are sometimes confused\ndue to the spurious pixel level similarities. The addition of multi-sensory information, or the ability\nto interact with the environment may help alleviate this problem by integrating different sensory\nmodalities and/or affordances of the environment into the semantic representation.\n\nDealing with ambiguity Natural stimuli are often ambiguous and may be interpreted differently\nbased on contextual clues. Examples of such processes are common, e.g. visual illusions like the\nNecker cube [38], and may be driven by the functional organisation and the heavy top-down in\ufb02uences\nwithin the ventral visual stream of the brain [19, 41]. To evaluate the ability of VASE to deal with\n\n8\n\nAccuracyTraining stepExplorationmoving Fashion MNISTmoving MNISTTraining stepCelebA/Fashion Interpolation100% / 0%0% / 100%OrigReconstruction samplesprob CelebANo explorationmoving FashionMNISTmoving MNISTInferred q(s=CelebA) - Fashion contextInterpolation (CelebA/Fashion)100% / 0%0% / 100%23% / 77%56% / 44%as Fashionas CelebAas Fashionas CelebAas Fashionas CelebAas Fashionas CelebAprob FashionCelebA: 100%Fashion: 0%CelebA: 0%Fashion: 100%Inferred q(s=CelebA) - CelebA context Number of ambiguous samples (out of 64) Number of ambiguous samples (out of 64)100%/0%0%/100%40%/60%60%/40%100%/0%0%/100%40%/60%60%/40%Interpolation (CelebA/Fashion)01BA\fambiguous inputs based on the context, we train it on a CelebA [33] ! inverse Fashion sequence, and\ntest it using ambiguous linear interpolations between samples from the two datasets (\ufb01g. 5A, \ufb01rst row).\nTo measure the effects of ambiguity, we varied the interpolation weights between the two datasets. To\nmeasure the effects of context, we presented the ambiguous samples in a batch with real samples from\none of the training datasets, varying the relative proportions of the two. Figure 5A (bottom) shows\nthe inferred probability of interpreting the ambiguous samples as CelebA q(s = celebA|x). VASE\nshows a sharp boundary between interpreting input samples as Fashion or CelebA despite smooth\nchanges in input ambiguity. Such categorical perception is also characteristic of biological intelligence\n[12, 13, 32]. The decision boundary for categorical perception is affected by the context in which the\nambiguous samples are presented. VASE also represents its uncertainty about the ambiguous inputs\nby increasing the inferred variance of the relevant latent dimensions (\ufb01g. 5A, second row).\n\nImagination-driven exploration If we learn a factor of variation in a past environment (e.g., that\nobjects can move), it may be reasonable to hypothesise that it may also be applicable in the current\nenvironment, even if it is not directly observed (e.g in an environment with static objects). Given the\nability to act on an environment, we may then try to realise an imagined con\ufb01guration to test whether\nour hypothesis is correct (e.g. try to move the static objects), resulting in a form of imagination-driven\nexploration. In Appendix A.4 we show how such exploration can be implemented using VASE.\nFigure 5B shows that on a moving Fashion ! MNIST ! moving MNIST life-long learning setup,\nVASE is able to imagine and learn the concept of \u201cmoving MNIST digits\u201d before actually experiencing\nit in the moving MNIST training condition.\n\n5 Conclusions\n\nWe have introduced VASE, a novel approach to life-long unsupervised representation learning\nthat builds on recent work on disentangled factor learning [21, 9] by introducing several new key\ncomponents. Unlike other approaches to continual learning, our algorithm does not require us to\nmaintain a replay buffer of past datasets, or to change the loss function after each dataset switch. In fact,\nit does not require any a priori knowledge of the dataset presentation sequence, since these changes in\ndata distribution are automatically inferred. We have demonstrated that VASE can learn a disentangled\nrepresentation of a sequence of datasets. It does so without experiencing catastrophic forgetting\nand by dynamically allocating spare capacity to represent new information. It resolves ambiguity\nin a manner that is analogous to the categorical perception characteristic of biological intelligence.\nMost importantly, VASE allows for semantically meaningful sharing of latents between different\ndatasets, which enables it to perform cross-domain inference and imagination-driven exploration.\nTaken together, these properties make VASE a promising algorithm for learning representations that\nare conducive to subsequent robust and data-ef\ufb01cient RL policy learning.\n\nAcknowledgements\n\nWe thank Shakir Mohamed and James Kirkpatrick for useful discussions and feedback.\n\nReferences\n[1] A. Achille and S. Soatto. Emergence of Invariance and Disentangling in Deep Representations. Proceedings\n\nof the ICML Workshop on Principled Approaches to Deep Learning, 2017.\n\n[2] A. Achille and S. Soatto. Information dropout: Learning optimal representations through noisy computation.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1\u20131, 2018.\n\n[3] A. Achille and S. Soatto. A separation principle for control in the age of deep learning. Annual Review\n\nof Control, Robotics, and Autonomous Systems, 1(1):null, 2018.\n\n[4] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. arXiv preprint\n\narXiv:1612.00410, 2016.\n\n[5] B. Ans and S. Rousset. Avoiding catastrophic forgetting by coupling two reverberating neural networks.\n\nComptes Rendus de l\u2019Acad\u00e9mie des Sciences - Series III - Sciences de la Vie, 320(12):989\u2013997, 1997.\n\n[6] A. Auchter, L. K. Cormack, Y. Niv, F. Gonzalez-Lima, and M. H. Mon\ufb01ls. Reconsolidation-extinction\ninteractions in fear memory attenuation: the role of inter-trial interval variability. Frontiers in behavioral\nneuroscience, 11:2, 2017.\n\n9\n\n\f[7] C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. K\u00fcttler, A. Lefrancq, S. Green, V. Vald\u00e9s,\nA. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King,\nD. Hassabis, S. Legg, and S. Petersen. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.\n\n[8] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE\n\ntransactions on pattern analysis and machine intelligence, 35(8):1798\u20131828, 2013.\n\n[9] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. Understanding\n\ndisentangling in -VAE. NIPS Workshop of Learning Disentangled Features, 2017.\n\n[10] J. Cichon and W.-B. Gan. Branch-speci\ufb01c dendritic ca2+ spikes cause persistent synaptic plasticity. Nature,\n\n520(7546):180\u2013185, 2015.\n\n[11] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning,\nS. Legg, and K. Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner\narchitectures. arxiv, 2018.\n\n[12] N. L. Etcoff and J. J. Magee. Categorical perception of facial expressions. Cognition, 44:227\u2013240, 1992.\n[13] D. J. Freedman, M. Riesenhuber, T. Poggio, and E. K. Miller. Categorical representation of visual stimuli\n\nin the primate prefrontal cortex. Science, 291:312\u2013316, 2001.\n\n[14] R. M. French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128\u2013135,\n\n[15] T. Furlanello, J. Zhao, A. M. Saxe, L. Itti, and B. S. Tjan. Active long term memory networks. arXiv preprint\n\n[16] M. Garnelo, K. Arulkumaran, and M. Shanahan. Towards deep symbolic reinforcement learning. arXiv\n\n1999.\n\narXiv:1606.02355, 2016.\n\npreprint arXiv:1609.05518, 2016.\n\n2016.\n\n[17] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic\n\nforgetting in gradient-based neural networks. arxiv, 2013.\n\n[18] P. D. Gr\u00fcnwald. The minimum description length principle. MIT press, 2007.\n[19] B. Gulyas, D. Ottoson, and P. E. Roland. Functional Organisation of the Human Visual Cortex. Wenner\u2013Gren\n\n[20] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into recti\ufb01ers: Surpassing human-level performance\n\nInternational Series, 1993.\n\non imagenet classi\ufb01cation. ICCV, 2015.\n\n[21] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. -VAE:\n\nLearning basic visual concepts with a constrained variational framework. ICLR, 2017.\n\n[22] I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner.\n\nDARLA: Improving zero-shot transfer in reinforcement learning. ICML, 2017.\n\n[23] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in neural\n\ninformation processing systems, pages 2017\u20132025, 2015.\n\n[24] T. Karaletsos, S. Belongie, and G. R\u00e4tsch. Bayesian representation learning with oracle constraints. ICLR,\n\n[25] H. Kim and A. Mnih. Disentangling by factorising. arxiv, 2017.\n[26] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.\n[27] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.\n[28] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan,\nT. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell. Overcoming\ncatastrophic forgetting in neural networks. PNAS, 114(13):3521\u20133526, 2017.\n\n[29] A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational inference of disentangled latent concepts from\n\nunlabeled observations. ICLR, 2018.\n\n[30] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think\n\nlike people. Behavioral and Brain Sciences, pages 1\u2013101, 2016.\n\n[31] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[32] Y. Liu and B. Jagadeesh. Neural selectivity in anterior inferotemporal cortex for morphed photographic\n\nimages during behavioral classi\ufb01cation or \ufb01xation. J. Neurophysiol., 100:966\u2013982, 2008.\n\n[33] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. ICCV, 2015.\n[34] J. L. McClelland, B. L. McNaughton, and R. C. O\u2019Reilly. Why there are complementary learning systems in\nthe hippocampus and neocortex: insights from the successes and failures of connectionist models of learning\nand memory. Psychological review, 102(3):419, 1995.\n\n[35] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential\n\nlearning problem. The psychology of learning and motivation, 24(92):109\u2013165, 1989.\n\n[36] K. Milan, J. Veness, J. Kirkpatrick, D. Hassabis, A. Koop, and M. Bowling. The forget-me-not process.\n\nNIPS, 2016.\n\n[37] V. Mnih, K. Kavukcuoglu, D. S. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,\nD. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature,\n518(7540):529\u2013533, 2015.\n\n10\n\n\f[38] L. Necker. Observations on some remarkable optical phaenomena seen in switzerland; and on an optical\nphaenomenon which occurs on viewing a \ufb01gure of a crystal or geometrical solid. London and Edinburgh\nPhilosophical Magazine and Journal of Science, 1(5):329\u2013337, 1832.\n\n[39] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational continual learning. ICLR, 2018.\n[40] E. Parisotto, J. L. Ba, and R. Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement\n\n[41] A. Przybyszewski. Vision: Does top-down processing help us to see? Current Biology, 8:135\u2013139, 1998.\n[42] J. Ramapuram, M. Gregorova, and A. Kalousis. Lifelong generative modeling.\narXiv preprint\n\nlearning. ICLR, 2015.\n\narXiv:1705.09847, 2017.\n\n[43] R. Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting\n\nfunctions. Psychological review, 97(2):285, 1990.\n\n[44] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. ICML, 32(2):1278\u20131286, 2014.\n\n[45] J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465\u2013471, 1978.\n[46] A. Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123\u2013146, 1995.\n[47] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and\n\nR. Hadsell. Progressive neural networks. arxiv, 2016.\n\n[48] P. Ruvolo and E. Eaton. Ella: An ef\ufb01cient lifelong learning algorithm. ICML, 2013.\n[49] J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell.\n\nProgress & compress: A scalable framework for continual learning. ICML, 2018.\n\n[50] A. Seff, A. Beatson, D. Suo, and H. Liu. Continual learning in generative adversarial nets. NIPS, 2017.\n[51] H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. NIPS, 2017.\n[52] R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via information. arXiv\n\npreprint arXiv:1703.00810, 2017.\n\n[53] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou,\nV. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,\nM. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural\nnetworks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\n[54] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine\n\nlearning algorithms. arxiv, 2017.\n\n[55] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. ICML, 2017.\n\n11\n\n\f", "award": [], "sourceid": 6432, "authors": [{"given_name": "Alessandro", "family_name": "Achille", "institution": "UCLA"}, {"given_name": "Tom", "family_name": "Eccles", "institution": "DeepMind"}, {"given_name": "Loic", "family_name": "Matthey", "institution": "DeepMind"}, {"given_name": "Chris", "family_name": "Burgess", "institution": "DeepMind"}, {"given_name": "Nicholas", "family_name": "Watters", "institution": "Google DeepMind"}, {"given_name": "Alexander", "family_name": "Lerchner", "institution": "DeepMind"}, {"given_name": "Irina", "family_name": "Higgins", "institution": "DeepMind"}]}