{"title": "Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 15718, "page_last": 15729, "abstract": "Learning generative models that span multiple data modalities, such as vision and language, is often motivated by the desire to learn more useful, generalisable representations that faithfully capture common underlying factors between the modalities. In this work, we characterise successful learning of such models as the fulfilment of four criteria: i) implicit latent decomposition into shared and private subspaces, ii) coherent joint generation over all modalities, iii) coherent cross-generation across individual modalities, and iv) improved model learning for individual modalities through multi-modal integration. Here, we propose a mixture-of-experts multi-modal variational autoencoder (MMVAE) for learning of generative models on different sets of modalities, including a challenging image <-> language dataset, and demonstrate its ability to satisfy all four criteria, both qualitatively and quantitatively.", "full_text": "Variational Mixture-of-Experts Autoencoders\n\nfor Multi-Modal Deep Generative Models\n\nYuge Shi\u2217\n\nN. Siddharth\u2217\n\nDepartment of Engineering Science\n\nUniversity of Oxford\n\n{yshi, nsid}@robots.ox.ac.uk\n\nBrooks Paige\n\nAlan Turing Institute &\nUniversity of Cambridge\nbpaige@turing.ac.uk\n\nPhilip H.S. Torr\n\nDepartment of Engineering Science\n\nUniversity of Oxford\n\nphilip.torr@eng.ox.ac.uk\n\nAbstract\n\nLearning generative models that span multiple data modalities, such as vision and language, is\noften motivated by the desire to learn more useful, generalisable representations that faithfully\ncapture common underlying factors between the modalities. In this work, we characterise suc-\ncessful learning of such models as the ful\ufb01lment of four criteria: i) implicit latent decomposition\ninto shared and private subspaces, ii) coherent joint generation over all modalities, iii) coherent\ncross-generation across individual modalities, and iv) improved model learning for individual\nmodalities through multi-modal integration. Here, we propose a mixture-of-experts multimodal\nvariational autoencoder (MMVAE) to learn generative models on different sets of modalities,\nincluding a challenging image \u2194 language dataset, and demonstrate its ability to satisfy all four\ncriteria, both qualitatively and quantitatively. Code, data, and models are provided at this url.\n\n1\n\nIntroduction\n\n2\n\ng\n\nu\n\nM\n\no\n\nd\n\nL\n\na\n\nn\n\nality\nage\n\nAbstract Bird\n\nM odality1\nVision\n\nHuman learning in the real world involves a multitude of perspectives\nof the same underlying phenomena, such as perception of the same en-\nvironment through visual observation, linguistic description, or physical\ninteraction. Given the lack explicit labels available for observations\nin the real world, observing across modalities can provided important\ninformation in the form of correlations between the observations. Studies\nhave provided evidence that the brain jointly embeds information across\ndifferent modalities (Quiroga et al., 2009; Stein et al., 2009), and that\nsuch integration bene\ufb01ts reasoning and understanding through expres-\nsion along these modalities (Bauer and Johnson-Laird, 1993; Fan et al.,\n2018), further facilitating information transfer between (Yildirim, 2014)\nthem. We take inspiration from this to design algorithms that handle\nsuch multi-modal observations, while being capable of a similar breadth\nof behaviour. Figure 1 shows an example of such a situation, where an abstract notion of a bird is\nperceived through both visual observation as well as linguistic description. A thorough understanding\nof what a bird is involves understanding not just the characteristics of its visual and linguistic features\nindividually, but also how they relate to each other (Barsalou, 2008; Siskind, 1994). Moreover,\ndemonstrating such understanding involves being able to visualise, or discriminate birds against\nother things, or describe birds\u2019 attributes. Crucially, this process involves \ufb02ow of information in both\nways\u2014from observations to representations and vice versa.\n\nFigure 1: A schematic for\nmulti-modal perception.\n\nA red-throated\nhummingbird\nin \ufb02ight.\n\ndescribe\n\nd\nn\na\nt\ns\nr\ne\nd\nn\nu\n\nhear\n\ns\n\ne\n\ne\n\nd\n\nr\n\na\n\nw\n\n/\ns\n\ne\nl\ne\n\nc\nt\n\n\u2217Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fshared information\n\nz\n\nprivate information\nof modality 1\n\nz\n\n\u03b81 \u03b82\n\n\u02c6x1\n\n\u02c6x2\n\n(a) Latent Factorisation\n\n(b) Joint Generation\n\nx1\n\nx2\n\nx1\n\nx2\n\n\u03c61 \u03c62\n\n\u03c61 \u03c62\n\nz\n\nz\n\n\u03b81 \u03b82\n\n\u03b81 \u03b82\n\n\u02c6x1\n\n\u02c6x2\n\n\u02c6x1\n\n\u02c6x2\n\n(c) Cross Generation\n\n(d) Synergy\n\nn\n\ni\na\nm\no\nD\n\nn\no\ni\nt\na\nt\np\na\nd\nA\n\nE\nA\nV\n\nd\ne\ns\na\nB\n\nModel\n\n(a) Latent\n\nFactorisation\n\n(b) Joint\nGeneration\n\n(c) Cross\nGeneration\n\n(d)\n\nSynergy\n\nRBM-CCA (Ngiam et al., 2011)\nSBA (Silberer and Lapata, 2014)\nGRL (Ganin and Lempitsky, 2015)\nDAN (Long et al., 2015)\nDSN (Bousmalis et al., 2016)\nDMAE (Mukherjee et al., 2017)\nTELBO (Vedantam et al., 2018)\nJMVAE (Suzuki et al., 2017)\nMVAE (Wu and Goodman, 2018)\nUNIT (Liu et al., 2017)\nMFM (Tsai et al., 2019)\nOurs\n\nx\n(cid:88)\n\u2013\n\u2013\nx\nx\nx\nx\nx\nx\n(cid:88)\n(cid:88)\n\nx\n\u2013\n\u2013\n\u2013\nx\nx\nx\nx\nx\nx\nx\n(cid:88)\n\n(cid:88)\n\u2013\n\u2013\n\u2013\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\nx\n(cid:88)\n\n(cid:88)\n\u2013\n\u2013\n\u2013\n\u2013\nx\nx\n(cid:88)\n(cid:88)\nx\n(cid:88)\n(cid:88)\n\nFigure 2: [Left] The four criteria for multi-modal generative models: (a) latent factorisation (b) coherent joint\ngeneration (c) coherent cross generation and (d) synergy. [Right] A characterisation of recent work that explores\nmultiple modalities, including our own, in terms of the speci\ufb01ed criteria. See \u00a7 2 for further details.\nWith this in mind, when designing algorithms that imitate the human learning process, we seek a\ngenerative model that is able to jointly embed and generate multi-modal observations, which learn\nconcepts by association of multiple modalities and feedback from their reconstructions. While\nthe variational autoencoder (VAE) (Kingma and Welling, 2014) \ufb01ts such a description well, truly\ncapturing the range of behaviour and abilities exhibited by humans from multi-modal observation\nrequires enforcing particular characteristics on the framework itself. Although there have been a\nrange of approaches that broadly tackle the issue of multi-modal generative modelling (c.f. \u00a7 2), they\nfall short of expressing a more complete range of expected behaviour in this setting. We hence posit\nfour criteria that a multi-modal generative model should satisfy (c.f. Figure 2[left]):\nLatent Factorisation: the latent space implicitly factors into subspaces that capture the shared and\nprivate aspects of the given modalities. This aspect is important from the perspective of downstream\ntasks, where better decomposed representations (Lipton, 2018; Mathieu et al., 2019) are more\namenable for use on a wider variety of tasks.\n\nCoherent Joint Generation: generations in different modalities stemming from the same latent\nvalue exhibit coherence in terms of the shared aspects of the latent. For example, in the schematic\nin Figure 1, this could manifest through the generated image and description always matching\nsemantically\u2014that is, the description is true of the image.\n\nCoherent Cross Generation: the model can generate data in one modality conditioned on data\nobserved in a different modality, such that the underlying commonality between them is preserved.\nAgain taking Figure 1 as an example, for a given description, one should be able to generate images\nthat are semantically consistent with the description, and vice versa.\n\nSynergy: the quality of the generative model for any single modality is improved through represen-\ntations learnt across multi-modal observations, as opposed to just the single modality itself. That\nis, observing both the image and description should lead to more speci\ufb01city in generation of the\nimages (and descriptions) than when taken alone.\n\nTo this end, we propose the MMVAE, a multi-modal VAE that uses a mixture of experts (MOE)\nvariational posterior over the individual modalities to learn a multi-modal generative model that\nsatis\ufb01es the above four criteria. While it shares some characteristics with the most recent work on\nmulti-modal generative models (cf. \u00a7 2), it nonetheless differs from them in two important ways.\nFirst, we are interested in situations where 1) observations across multiple modalities are always\npresented during training; 2) trained model can handle missing modalities at test time. Second, our\nexperiments use many-to-many multi-modal mapping scenario, which provides a greater challenge\nthan the typically used one-to-one image \u2194 image (colourisation or edge/outline detection) and image\n\u2194 attribute transformations (image classi\ufb01cation). To the best of our knowledge, we are also the \ufb01rst\nto explore image \u2194 language transformation under the multi-modal VAE setting. Figure 2[right]\nsummarises relevant work (cf. \u00a7 2) and identi\ufb01es if they satisfy our proposed criteria.\n\n2 Related Work\n\nCross-modal generation Prior approaches to generative modelling with multi-modal data have\npredominantly only targetted cross-modal generation. Given data from two domains x1 and x2,\n\n2\n\n\fthey learn the conditional generative model p(x1 | x2), where the conditioning modality x2 and\ngeneration modality x1 are typically not interchangeable. This is commonly seen in conditional VAE\nmethods for attribute\u2192image or image\u2192caption generation (Pandey and Dukkipati, 2017; Pu et al.,\n2016; Sohn et al., 2015; Yan et al., 2016), as well as generative adversarial network (GAN)-based\nmodels for cross domain image-to-image translation (Ledig et al., 2017; Li and Wand, 2016; Liu\net al., 2019; Taigman et al., 2017; Wang and Gupta, 2016). In recent years, there have been a few\napproaches involving both VAEs and GANs that enable cross-modal generation both ways (Wang\net al., 2016; Zhu et al., 2017a,b), but ignore learning a common embedding between them, instead\ntreating the different cross-generations as independent but composable transforms. Curiously some\nGAN cross-generation models appear to avoid learning abstractions of data, choosing instead to hide\nthe actual input directly in the high-frequency components of the output (Chu et al., 2017).\n\nDomain adaptation The related sub-\ufb01eld of domain adaptation explores learning joint embeddings\nof multi-modal observations that generalise across the different modalities for both classi\ufb01cation\n(Long et al., 2015, 2016; Tzeng et al., 2014) and generation (Bousmalis et al., 2016) tasks. And\nregarding approaches that go beyond just cross-modal generation to models that learn common\nembedding or projection spaces, Ngiam et al. (2011) were the \ufb01rst to employ an autoencoder-based\narchitecture to learn joint representation between modalities, using their RBM-CCA model. Silberer\nand Lapata (2014) followed this with a stacked autoencoder architecture to jointly embed the visual\nand textual representations of nouns. Tian and Engel (2019) considers an intermediary embedding\nspace to transform between independent VAEs, applying constraints on the closeness of embeddings\nand their individual classi\ufb01cation performance on labels.\n\nJoint models Yet another class of approaches attempt to explicitly model the joint distribution\nover latents and data. Suzuki et al. (2017) introduced the joint multimodal VAE (JMVAE) that\nlearns shared representation with joint encoder q\u03a6(z | x1, x2). To handle missing data at test time,\ntwo unimodal encoders q\u03a6(z | x1) and q\u03a6(z | x2) are trained to match q\u03a6(z | x1, x2) with a KL\nconstraint between them. Vedantam et al. (2018)\u2019s multimodal VAE TELBO (triple ELBO) also deals\nwith missing data at test time by explicitly de\ufb01ning multimodal and unimodal inference networks.\nHowever, differing from the JMVAE, they facilitate the convergence between the unimodal encoding\ndistributions and the joint distribution using a two-step training regime that \ufb01rst \ufb01ts the joint encoding,\nand subsequently \ufb01ts the unimodal encodings holding the joint \ufb01xed. Tsai et al. (2019) propose MFM\n(multimodal factorisation model), which also explicitly de\ufb01nes a joint network q\u03a6(z | x1:M ) on top\nof the unimodal encoders, seeking to infer missing modalities using the observed modalities.\n\nWe argue that these approaches are less than ideal, as they typically only target one of the proposed\ncriteria (e.g.\n(one-way) cross-generation), often require additional modelling components and\ninference steps (JMVAE, TELBO, MFM), ignore the latent representation structures induced and\nlargely only target observations within a particular domain, typically vision \u2194 vision (Liu et al.,\n2017). More recently, Wu and Goodman (2018) introduced the MVAE, a marked improvement\nover previous approaches, proposing to model the joint posterior as a product of experts (POE) over\nthe marginal posteriors, enabling cross-modal generation at test time without requiring additional\ninference networks and multi-stage training regimes. While already a signi\ufb01cant step forward, we\nobserve that the POE factorisation does not appear to be practically suited for multi-modal learning,\nlikely due to the precision miscalibration of experts. See \u00a7 3 for more detailed explanation. We\nalso observe that latent-variable mixture models have previously been applied to generative models\ntargetting multi-modal topic-modelling (Barnard et al., 2003; Blei and Jordan, 2003). Although\ndiffering from our formulation in many ways, these approaches nonetheless indicate the suitability of\nmixture models for learning from multi-modal data.\n\n3 Methods\n\ntive model over modalities m = 1, . . . , M of the form p\u0398(z, x1:M ) = p(z)(cid:81)M\n\nBackground We employ a VAE (Kingma and Welling, 2014) to learn a multi-modal genera-\nm=1 p\u03b8m (xm | z),\nwith the likelihoods p\u03b8m(xm | z) parametrised by deep neural networks (decoders) with parame-\nters \u0398 = {\u03b81, . . . , \u03b8M}. The objective of training VAEs is to maximise the marginal likelihood of\nthe data p\u0398(x1:M ). However, computing the evidence is intractable as it requires knowledge of the\ntrue joint posterior p\u0398(z | x1:M ). To tackle this, we approximate the true unknown posterior by a\nvariational posterior q\u03a6(z | x1:M ), which now allows optimising an evidence lower bound (ELBO)\n\n3\n\n\fthrough stochastic gradient descent (SGD), with ELBO de\ufb01ned as\n\nLELBO(x1:M ) = Ez \u223c q\u03a6(z|x1:M )\n\nlog\n\np\u0398(z, x1:M )\nq\u03a6(z | x1:M )\n\n(cid:20)\n\n(cid:21)\n\n(1)\n\nThe importance weighted autoencoder (IWAE) (Burda et al., 2015) computes a tighter lower bound\nthrough appropriate weighting of a multi-sample estimator, as\n\nLIWAE(x1:M ) = Ez1:K \u223c q\u03a6(z|x1:M )\n\nlog\n\n(2)\n\n(cid:34)\n\n(cid:35)\n\n(cid:0)zk, x1:M\n\n(cid:1)\n\np\u0398\nq\u03a6(zk | x1:M )\n\nK(cid:88)\n\nk=1\n\n1\nK\n\nBeyond the targetting of a tighter bound, we further prefer the IWAE estimator since the variational\nposteriors it estimates tend to have higher entropy (see Appendix D). This is actually bene\ufb01cial\nin the multi-modal learning scenario, as each posterior q\u03c6m (z | xm) is encouraged to assign high\nprobability to regions beyond just those which characterise its own modality m.\n\nassuming the different modalities are of comparable complexity (as per motivation in \u00a7 1).\n\nThe mixture of experts (MOE) joint variational posteriors Given the objective, a crucial ques-\ntion however remains: how should we learn the variational joint posterior q\u03a6(z | x1:M )?\nOne immediately obvious approach is to train one single encoder network that takes all modali-\nties x1:M as input to explicitly parametrise the joint posterior. However, as described in \u00a7 2, this\napproach requires all modalities to be presented at all times, thus making cross-modal generation dif-\n\ufb01cult. We propose to factorise the joint variational posterior as a combination of unimodal posteriors,\nm \u03b1m \u00b7 q\u03c6m (z | xm), where \u03b1m = 1/M,\n\nusing a mixture of experts (MOE), i.e. q\u03a6(z | x1:M ) =(cid:80)\nexperts (POE), i.e. q\u03a6(z | x1:M ) =(cid:81)\n\nMoE vs. PoE An alternative choice of factorising the joint variational posterior is as a product of\nm q\u03c6m (z | xm), as seen in MVAE (Wu and Goodman, 2018).\nWhen employing POE, each expert holds the power of veto\u2014in the sense that the joint distribution\nwill have low density for a given set of observations if just one of the marginal posteriors has low\ndensity. In the case of Gaussian experts, as is typically assumed2, experts with greater precision\nwill have more in\ufb02uence over the combined prediction than experts with lower precision. When\nthe precisions are miscalibrated, as likely in learning with SGD due to difference in complexity of\ninput modalities or initialisation conditions, overcon\ufb01dent predictions by one expert\u2014implying a\npotentially biased mean prediction overall\u2014can be detrimental to the whole model. This can be\nundesirable for learning factored latent representations across modalities. By contrast, MOE does not\nsuffer from potentially overcon\ufb01dent experts, since it effectively takes a vote amongst the experts,\nand spreads its density over all the individual experts. This characteristic makes them better-suited to\nlatent factorisation, being sensitive to information across all the individual modalities. Moreover Wu\nand Goodman (2018) noted that POE does not work well when observations across all modalities are\nalways presented during training, requiring arti\ufb01cial subsampling of the observations to ensure that\nthe individual modalities are learnt faithfully. As evidence, we show empirically in \u00a7 4 that the POE\nfactorisation does not satisfy all the criteria we outline in \u00a7 1.\n\nThe MOE-multimodal VAE (MMVAE) objective With the MOE joint posterior, we can extend\nthe LIWAE in (2) to multiple modalities by employing strati\ufb01ed sampling (Robert and Casella, 2013)\nto average over M modalities:\n\nM(cid:88)\n\nm=1\n\n(cid:34)\n\n(cid:35)\n\n(cid:0)zk\n\n(cid:1)\nm, x1:M\nm | x1:M )\n\np\u0398\nq\u03a6(zk\n\nK(cid:88)\n\nk=1\n\nLMOE\nIWAE(x1:M ) =\n\n1\nM\n\nEz1:K\n\nm \u223c q\u03c6m (z|xm)\n\nlog\n\n1\nK\n\n,\n\n(3)\n\nwhich has the effect of weighing the gradients of samples from different modalities equally while still\nestimating tight bounds for each individual term. Note that although an even tighter bound can be\ncomputed by weighting the contribution of each modality differently, in proportion to its contribution\nto the marginal likelihood, doing so can lead to undesirable modality dominance similar to that in the\nPOE case. See Appendix A for further details and results.\n\n2Training POE models in general can be intractable (Hinton, 2002) due to the required normalisation, but\n\nbecomes analytic when the experts are Gaussian.\n\n4\n\n\fIt is easy to show that LMOE\nELBO using linearity of expectations, as\n\nIWAE(x1:M ) is still a tighter lower bound than the standard M-modality\n\n(cid:20)\n\nLELBO(x1:M ) = Eq\u03a6(z|x1:M )\nM(cid:88)\n\n\u2264 1\nM\n\nm=1\n\nlog\n\np\u0398(z, x1:M )\nq\u03a6(z | x1:M )\n\n(cid:34)\n\nEz1:K\n\nm \u223c q\u03c6m (z|xm)\n\nlog\n\n(cid:20)\n\n(cid:21)\n\n=\n\n1\nK\n\nM(cid:88)\nK(cid:88)\n\n1\nM\n\nm=1\n\nk=1\n\n(cid:35)\n\nEzm \u223c q\u03c6m (z|xm)\n\n(cid:0)zk\n\n(cid:1)\nm, x1:M\nm | x1:M )\n\np\u0398\nq\u03a6(zk\n\nlog\n\np\u0398(zm, x1:M )\nq\u03a6(zm | x1:M )\n\n= LMOE\n\nIWAE(x1:M ).\n\n(cid:21)\n\nIn actually computing the gradients for the objectives in Equations (3) and (5) we employ the DReG\nIWAE estimator of Tucker et al. (2019), avoiding issues with the quality of the estimator for large K\nas discovered by Rainforth et al. (2018) (cf. Appendix C).\nFrom a computational perspective, the MOE objectives incur some overhead over the POE objective,\ndue to the fact that each modality provides samples from its own encoding distribution q\u03c6m(z | xm)\nto be evaluated with the joint generative model p\u0398(z, x1:M ), needing M 2 passes over the respective\ndecoders in total. The real cost of such added complexity however can be minimal since a) the\nnumber of modalities one can simultaneously process is typically quite small, and b) the additional\ncomputation can be ef\ufb01ciently vectorised. However, if this cost should be deemed prohibitively large,\none can in fact trade off the tightness of the estimator for linear time complexity in the number of\nmodalities M, employing a multi-modal importance sampling scheme on the standard ELBO. We\ndiscuss this in further detail in appendix B.\n\n4 Experiments\n\nTo evaluate our model, we constructed two multi-modal scenarios to conduct experiments on. The\n\ufb01rst experiment involves many-to-many image \u2194 image transforms on matching digits between\nthe MNIST and street-view house numbers (SVHN) datasets. This experiment was designed to\nseparate perceptual complexity (i.e. color, style, size) from conceptual complexity (i.e. digits)\nusing relatively simple image domains. The second experiment involves a highly challenging image\n\u2194 language task on the Caltech-UCSD Birds (CUB) dataset\u2014more complicated than the typical\nimage \u2194 attribute transformations employed in prior work. We choose this dataset as it matches\nour original motivation in tackling multi-modal perception in a similar manner to how humans\nperceive and learn about the world. For each of these experiments, we provide both qualitative and\nquantitative analyses of the extent to which our model satis\ufb01es the four proposed criteria\u2014which,\nto reiterate, are i) implicit latent decomposition, ii) coherent joint generation over all modalities,\niii) coherent cross-generation across individual modalities, and iv) improved model learning for\nindividual modalities through multi-modal integration. Source code for all models and experiments is\navailable at https://github.com/iffsid/mmvae.\n\n4.1 Common Details\n\nAcross experiments, we employ Laplace priors and posteriors, constraining their scaling across the D\ndimensions to sum to D. These design choices better encourage the learning of axis-aligned represen-\ntaions by breaking the rotationally-invariant nature of the standard isotropic Gaussian prior (Mathieu\net al., 2019). For learning, we use the Adam optimiser (Kingma and Ba, 2014) with AMSGrad\n(Reddi et al., 2018), with a learning rate of 0.001. Details of the architectures used are provided in\nAppendix F. All numerical results were averaged over 5 independently trained models. Data and\npre-trained models from our experiments are also available at https://github.com/iffsid/mmvae.\n\n4.2 MNIST-SVHN\n\nDataset: As mentioned before, we design this experiment in order to probe\nconceptual complexity separate from perceptual complexity. We construct\na dataset of pairs of MNIST and SVHN such that each pair depicts the same\ndigit class. Each instance of a digit class (in either dataset) is randomly paired\nFigure 3: Example data\nwith 20 instances of the same digit class from the other dataset. As shown in\nFigure 3, although the data domains are fairly well known, effectively capturing the digit class can be\na challenging task due to the variety of styles and colours presented across both datasets. Here, we\nuse CNNs for SVHN and MLPs for MNIST, with a 20d latent space.\n\n5\n\n\fMMVAE (ours)\n\nMVAE (Wu and Goodman, 2018)\n\ns\nn\no\ni\nt\na\nr\ne\nn\ne\nG\n\nMNIST\n\u2192\u2217\n\nSVHN\n\u2192\u2217\n\nFigure 4: Qualitative evaluation of both our MMVAE model and MVAE from Wu and Goodman (2018). Genera-\ntions (top row) for each modality. Note both the quality of generations and the extent to which corresponding\ngenerations in MNIST and SVHN match on digits for MMVAE vs. MVAE, satisfying the coherent generation\ncriteria. Reconstructions and cross-generations for MNIST (middle row) and SVHN (bottom row). Again note\nthe extent to which cross generations capture the underlying digit effectively for MMVAE.\n\nQualitative Results: Figure 4 shows a qualitative comparison between MMVAE trained with\nthe MOE objective in (3) against the MVAE model of Wu and Goodman (2018). The MVAE\nwas trained using the authors\u2019 publicly available code3, following their recommended training\nregime. We generate from the model p\u0398(z, x1:M ) by taking R = 64 samples from the prior z \u223c\np(z), each of which is used to take N = 9 samples from the likelihood of each modality m\nas xm \u223c p\u0398(xm | z). Note the quality of the MMVAE model, both at coherent joint generation (top\nrow)\u2014where corresponding samples for the same z match in their digits\u2014and at coherent cross-\ngeneration (middle and bottom rows). To show that it is truly the MOE factorisation that impacts\nthe learning\u2014rather than our particular choice of model architecture or the IWAE objective\u2014we\nexplore performance on only adopting MOE, directly in the codebase of Wu and Goodman (2018),\nkeeping all other aspects \ufb01xed, in Appendix E. Results indicate MVAE with the MOE posterior does\nappear to do better, especially at cross-modal generation, than the POE.\nWe subsequently analyse the structure of the la-\ntent space by traversing each dimension indepen-\ndently as shown in Figure 5. Here, for each modal-\nity m, we encode datapoint xm through its respec-\ntive encoder q\u03c6m (z | xm) to obtain the mean em-\nbedding \u00b5m. Then, we perturb the embedding value\nm\u2212\nalong each dimension \u00b5d\n0 is the learnt standard de-\n5\u03c3d\nviation for dimension d in the prior p(z). Note the\nextent to which particular dimensions affect only a\nsingle modality, whereas other dimensions affect both,\nindicating a degree of latent factorisation. Also shown\nis a plot of the per-dimension Kullback-Leibler diver-\ngence (KL) between each posterior and the prior, as\nwell as the symmetric KL between the two posteri-\nors, to indicate which dimensions encode information\nfrom which posterior, if any.\nFigure 5: Per-dimension latent traversals for a pair\nof datapoints indicating dimensions that affect only\nQuantitative Results: To quantify the extent to\nSVHN, only MNIST, and both MNIST & SVHN.\nwhich the latent spaces factorises from multi-modal\nobservations, we employ a simple linear classi\ufb01er on the latent representations as we have no a-priori\nreason to believe that the representations factorise in an axis-aligned manner. If a linear digit classi\ufb01er\ncan extract the digit information from the shared latent space, it strongly indicates the presence of\na linear subspace that has factored as desired. We train digit classi\ufb01ers for a) MMVAE, b) MVAE,\nwith posterior either from single-modality inputs or multi-modality inputs, and c) single-VAE that\ntakes input from one modality only, plotting results in Table 1. Comparing the results of the \ufb01rst\n\nm linearly in the range (\u00b5d\n\n0 ), where \u03c3d\n\n0 , \u00b5d\n\nm + 5\u03c3d\n\n3https://github.com/mhw32/multimodal-vae-public\n\n6\n\n\fcolumn to the last in Table 1, we \ufb01nd MMVAE\u2019s latent space provides signi\ufb01cantly better accuracy\nover the single-VAE. For the MVAE, due to its POE formulation, it appears that the MNIST modal-\nity dominates, obtaining high accuracy for MNIST digit classi\ufb01cation (95.7%) but low for SVHN\n(9.10%). When given both inputs, accuracy for SVHN improves signi\ufb01cantly while that for MNIST\ndecreases slightly. Note that the accuracy for MVAE is higher when only the MNIST data is presented\ncompared to when both modalities are available, indicating that the presence of the extra modality\ndoes not further inform the model on the classi\ufb01cation of digits.\n\nTable 1: Digit classi\ufb01cation accuracy (%) of latent variables in different models.\nsingle-VAE\n85.3\n20.7\n\nMMVAE MVAE (single) MVAE (both)\n94.9\n90.1\n\nMNIST\nSVHN\n\n95.7\n9.1\n\n91.3\n68.0\n\nWe also quantify the coherence of joint generations and cross-modal generations. To do so, we employ\noff-the-shelf digit classi\ufb01ers of the original MNIST and SVHN datasets on the generation results, and\ncompute a) for joint generation, how often the digits of generations in two modalities match, and\nb) for cross-modal generation, how often the digit generated in one modality matches its input from\nanother modality. Results in Table 2 indicate that for joint generation, the classi\ufb01ers predict the same\ndigit class 42.1% of time, and for cross-generation, 86.4% (MNIST \u2192SVHN) and 69.1% (SVHN \u2192\nMNIST). Computing these metrics for MVAE yields accuracy close to chance, suggesting that the\ncoherence between modalities is not quite preserved when considering generation.\nTable 2: Probability of digit matching (%) for joint and cross generation.\n\nJoint Cross (M\u2192S) Cross (S\u2192M)\n69.1\n42.1\n9.3\n12.7\n\n86.4\n9.5\n\nMMVAE\nMVAE\n\nWe \ufb01nally compute the marginal likelihoods4 of the joint generative model p\u0398(x1:M ), and each of\nthe individual generative models p\u03b8m(xm) using both the joint variational posterior q\u03a6(z | x1:M )\nand the single variational posterior q\u03c6m(z | xm) as shown in Table 3.\n\nTable 3: Evaluating the different log likelihoods for different arrangements of MNIST and SVHN.\n\nm = MNIST,\nn = SVHN\n\nm = SVHN,\nn = MNIST\n\nMMVAE\nMVAE\nMMVAE\nMVAE\n\nlog p(xm, xn)\n6261.40\n2961.80\n\n6261.40\n2961.80\n\nlog p(xm | xm, xn)\n868.76\n\u2212176.68\n3441.01\n3395.12\n\nlog p(xm | xm)\n868.37\n\u2212107.46\n3441.01\n3536.86\n\nlog p(xm | xn)\n628.31\n\u2212778.20\n2337.56\n\u221212747.50\n\nWe observe that MMVAE model yields higher likelihoods, consistent with employing the IWAE\nestimator. Interestingly, we observe that p(xm | xm, xn) \u2265 p(xm | xm) for MMVAE, whereas\nfor MVAE we consistently \ufb01nd p(xm | xm, xn) < p(xm | xm). This serves to highlights that the\nMMVAE model is able to effectively utilise information jointly across multiple modalities, which the\nMVAE model potentially suffers from overdominant encoders and an ill-suited sub-sampled training\nscheme to accommodate data always present across modalities at train time.\n\n4.3 CUB Image-Captions\n\nDataset: Encouraged by the results from our previous experiment, we\nconsider a multi-modal experiment more in line with our original moti-\nvation. We employ the images and captions from Caltech-UCSD Birds\n(CUB) dataset (Wah et al., 2011), containing 11,788 photos of birds in\nnatural scenes, each annotated with 10 \ufb01ne-grained captions describing the\nbird\u2019s appearance characteristics collected through Amazon Mechanical\nTurk (AMT). As shown in Figure 6, the images are quite detailed and de-\nscriptions fairly complex, involving the composition of various attributes.\nFor the image data, rather than generating directly in image space, we\ninstead generate in the feature space of a pre-trained ResNet-101 (He et al., 2016), in order to avoid\nissues with blurry generations for complex image data (Zhao et al., 2017). For generations and\n\na blue bird with gray\nprimaries and secondaries\nand white breast and throat\n\nthe bird has a white body,\nblack wings, and webbed\norange feet\n\nFigure 6: Example data\n\n4We compute a 1000-sample estimate using (5) here.\n\n7\n\n\freconstructions, we simply perform a nearest-neighbour lookup in feature space using Euclidean\ndistance on the generated or reconstructed feature. For the language data, we employ a CNN encoder\nand decoder following Kalchbrenner et al. (2014); Massiceti et al. (2018b); Pham et al. (2016),\nlearning an embedding for words in the process. We use 128-dimensional latents with a Laplace\nlikelihood on image features and a Categorical likelihood for captions.\n\nFigure 7: Qualitative evaluation of the MMVAE model on the CUB data, showing reconstruction in the individual\nmodalities (top rows), cross generation (middle and bottom rows on right), and joint generation (bottom row on\nleft). More qualitative examples can be found in Appendix G.\nQualitative Results: Figure 7 shows qualitative results for the MMVAE model trained with the\nMOE objective in (3) on the CUB data. We generate from the model as before, with R = 4,\nand N1 = 9, N2 = 1. Interestingly, even for such a complicated dataset, we see joint generation align\nquite well with descriptions largely in line with the image for a range of different attributes, and cross\ngeneration where the descriptions again match the image quite well and vice versa. An interesting\navenue to explore for future directions in this experiment would be the incorporation of more complex\ngenerative models for images potentially incorporating the use of GANs to improve the quality of\ngenerated output, and observing its effect on the joint modelling. We also generate results for MVAE\non this dataset, where we observe that the image modality dominates and resulting in poor language\nreconstruction and cross-modal generation. See Appendix H for examples and further details.\nQuantitative Results: We evaluate the coherence of our generation by calculating the correlation\nbetween the jointly and cross-generated image-caption pair. We do so by employing Canonical\nCorrelation Analysis (CCA) following the observation of its effectiveness as a baseline for language\nand vision tasks by Massiceti et al. (2018a). Here, given paired observations {x1 \u2208 Rn1, x2 \u2208 Rn2},\nCCA learns projections W1 \u2208 Rn1\u00d7k and W2 \u2208 Rn2\u00d7k that maximise the correlation between\nprojected variables W T\n2 x2. With this formulation, the correlation between any new pair of\nobservations {\u02dcx1, \u02dcx2} can be computed as the cosine distance between the mean-centered projected\nvariables, i.e.\n\n1 x1 and W T\n\ncorr(\u02dcx1, \u02dcx2) =\n\n\u03c6(\u02dcx1)T \u03c6(\u02dcx2)\n\n||\u03c6(\u02dcx1)||2||\u03c6(\u02dcx2)||2\n\n(4)\n\nn \u02dcxn \u2212 avg(W T\n\nn xn).\n\nwhere \u03c6(\u02dcxn) = W T\nWe prepare the dataset for CCA by pre-processing both modalities using feature extractors. For\nimages, similar to training, we use the off-the-shelf ResNet-101 to generate feature vector of dimen-\nsion 2048-d; for captions, we \ufb01t a FastText model on all sentences in the training set, projecting\neach word onto a 300-d vector (Bojanowski et al., 2017). The representation for each caption is\nthen acquired by aggregating the embedding of all words in the sentence (here we simply take the\naverage). To compute the correlation between the generated images and captions, we \ufb01rst compute the\nprojection matrix W for each modality using the training set of CUB Image-Captions, then perform\n\n8\n\n\fCCA using (4), on i) jointly generated image-sentence pairs, taking average over 1000 examples, and\nii) image-sentence or sentence-image pair of cross generation, taking an average over the entire test\nset. Results are as shown in Table 4.\n\nTable 4: Correlation of Image (I)-Sentence (S) pair for joint and cross generation.\nJoint Cross (I \u2192S) Cross (S \u2192I) Ground Truth\n0.263\nMVAE \u22120.095\n\n0.135\n\u22120.013\n\nMMVAE\n\n0.104\n0.011\n\n0.273\n\nTable 4 shows that the average correlation of joint generation of our model is 0.263; This value is only\nslightly lower than the average correlation of the data itself (0.273 in Table 4), which demonstrates\nthe high coherence between the jointly generated image-caption pairs. For cross-generation, the\ncorrelation between input images and generated captions is slightly lower than that of input caption\nand generated image, evaluated at 0.104 and 0.135 respectively.\nIn comparison, the MVAE model appears to provide (marginally) negative correlation for the jointly\ngenerated pairs and sentence \u2192 image cross generation pairs. Notably, the correlation for image\n\u2192 sentence generation is much higher than sentence \u2192 image (0.011 and -0.013 respectively).\nObserving the qualitative results for MVAE in Appendix H also shows that any outputs generated\nfrom images as input is more expressive than those from the language inputs. These \ufb01ndings indicate\nthat the model places more weight on the image modality than language for the factorisation of\njoint posterior, once again indicating an overdominant encoder, providing empirical evidence for our\nanalysis of the POE\u2019s potential bias towards stronger experts in \u00a7 3.\n\n5 Conclusion\n\nIn this paper, we explore multi-modal generative models, characterising successful learning of\nsuch models as the ful\ufb01llment of four speci\ufb01c criteria: i) implicit latent decomposition into shared\nand private subspaces, ii) coherent simultaneous joint generation over all modalities, iii) coherent\ncross-generation between individual modalities, and iv) improved model learning for the individual\nmodalities as a consequence of having observed data from different modalities. Satisfying these goals\nenables more useful and generalisable representations for downstream tasks such as classi\ufb01cation,\nby capturing the abstract relationship between the modalities. To this end, we propose a variational\nmixture of experts (MOE) autoencoder framework that allows us to achieve these criteria, in contrast\nto prior work which primarily target just the cross-modal generation and improved model learning\naspects. We compare and contrast our MMVAE model against the state-of-the-art product of experts\n(POE) model of Wu and Goodman (2018) and demonstrate that we outperform it at satisfying these\nfour criteria. We evaluate our model on two challenging datasets that capture both image \u2194 image\nand image \u2194 language transformations, showing appealing results across these tasks.\nAcknowledgements\n\nYS, NS, and PHST were supported by the ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC\ngrant Seebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1, with further support from\nthe Royal Academy of Engineering and FiveAI. YS was additionally supported by Remarkdip through\ntheir PhD Scholarship Programme. BP is supported by the Alan Turing Institute under the EPSRC\ngrant EP/N510129/1.\n\n9\n\n\fReferences\n\nK. Barnard, P. Duygulu, D. Forsyth, N. d. Freitas, D. M. Blei, and M. I. Jordan. Matching words and pictures.\n\nJournal of machine learning research, 3(Feb):1107\u20131135, 2003.\n\nL. W. Barsalou. Grounded cognition. Annual Review of Psychology, 59:617\u2013645, 2008.\n\nM. I. Bauer and P. N. Johnson-Laird. How diagrams can improve reasoning. Psychological science, 4(6):\n\n372\u2013378, 1993.\n\nD. M. Blei and M. I. Jordan. Modeling annotated data. In Proceedings of the 26th annual international ACM\n\nSIGIR conference on Research and development in informaion retrieval, pages 127\u2013134. ACM, 2003.\n\nP. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information.\n\nTransactions of the Association for Computational Linguistics, 5:135\u2013146, 2017.\n\nK. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In\n\nAdvances in Neural Information Processing Systems, pages 343\u2013351, 2016.\n\nY. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. In International Conference on\n\nLearning Representations, 2015.\n\nC. Chu, A. Zhmoginov, and M. Sandler. Cyclegan, a master of steganography. arXiv preprint arXiv:1712.02950,\n\n2017.\n\nC. Cremer, Q. Morris, and D. Duvenaud. Reinterpreting importance-weighted autoencoders. In International\n\nConference on Learning Representations (Workshop), 2017.\n\nJ. E. Fan, D. Yamins, and N. B. Turk-Browne. Common object representations for visual recognition and\n\nproduction. Cognitive Science, 42:2670\u20132698, 2018.\n\nY. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of The 32nd\n\nInternational Conference on Machine Learning, pages 1180\u20131189, 2015.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE\n\nconference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\nG. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):\n\n1771\u20131800, 2002.\n\nN. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutional neural network for modelling sentences. In\nProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 212\u2013217.\nAssociation for Computational Linguistics, 2014.\n\nD. P. Kingma and J. Ba. Adam: a method for stochastic optimization. In International Conference on Learning\n\nRepresentations, 2014.\n\nD. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Learning\n\nRepresentations, 2014.\n\nT. A. Le, M. Igl, T. Rainforth, T. Jin, and F. Wood. Auto-encoding sequential monte carlo. In International\n\nConference on Learning Representations, 2018.\n\nC. Ledig, L. Theis, F. Husz\u00e1r, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and\nW. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition, pages 4681\u20134690, 2017.\n\nC. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks.\n\nIn European Conference on Computer Vision, pages 702\u2013716. Springer, 2016.\n\nZ. C. Lipton. The mythos of model interpretability. ACM Queue, 61(10):36\u201343, 2018.\n\nM.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in neural\n\ninformation processing systems, pages 700\u2013708, 2017.\n\nM.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz. Few-shot unsupervised image-to-\n\nimage translation, 2019.\n\nM. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In\n\nProceedings of the 32nd International Conference on Machine Learning, pages 97\u2013105, 2015.\n\n10\n\n\fM. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer networks.\n\nIn Advances in Neural Information Processing Systems, pages 136\u2013144, 2016.\n\nD. Massiceti, P. K. Dokania, N. Siddharth, and P. H. Torr. Visual dialogue without vision or dialogue. In NeurIPS\n\nWorkshop on Critiquing and Correcting Trends in Machine Learning, 2018a.\n\nD. Massiceti, N. Siddharth, P. K. Dokania, and P. H. Torr. FlipDial: a generative model for two-way visual\n\ndialogue. In IEEE Conference on Computer Vision and Pattern Recognition, 2018b.\n\nE. Mathieu, T. Rainforth, N. Siddharth, and Y. W. Teh. Disentangling disentanglement in variational autoencoders.\nIn K. Chaudhuri and R. Salakhutdinov, editors, International Conference on Machine Learning (ICML),\nvolume 97 of Proceedings of Machine Learning Research, pages 4402\u20134412, Long Beach, California, USA,\nJune 2019. PMLR.\n\nT. Mukherjee, M. Yamada, and T. M. Hospedales. Deep matching autoencoders. CoRR, abs/1711.06047, 2017.\n\nJ. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of the\n\n28th International Conference on Machine Learning, pages 689\u2013696, 2011.\n\nG. Pandey and A. Dukkipati. Variational methods for conditional multimodal deep learning. In International\n\nJoint Conference on Neural Networks, pages 308\u2013315. IEEE, 2017.\n\nN.-Q. Pham, G. Kruszewski, and G. Boleda. Convolutional neural network language models. In Proceedings of\n\nthe Conference on Empirical Methods in Natural Language Processing, pages 1153\u20131162, 2016.\n\nY. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. Variational autoencoder for deep learning of\nimages, labels and captions. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems, pages 2352\u20132360. Curran Associates, Inc., 2016.\n\nR. Q. Quiroga, A. Kraskov, C. Koch, and I. Fried. Explicit encoding of multimodal percepts by single neurons\n\nin the human brain. Current Biology, 19(15):1308\u20131313, 2009.\n\nT. Rainforth, A. R. Kosiorek, T. A. Le, C. J. Maddison, M. Igl, F. Wood, and Y. W. Teh. Tighter variational\n\nbounds are not necessarily better. International Conference on Machine Learning (ICML), 2018.\n\nS. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In International Conference on\n\nLearning Representations, 2018.\n\nC. Robert and G. Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013.\n\nG. Roeder, Y. Wu, and D. K. Duvenaud. Sticking the landing: Simple, lower-variance gradient estimators\nfor variational inference. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6925\u20136934. Curran\nAssociates, Inc., 2017.\n\nC. Silberer and M. Lapata. Learning grounded meaning representations with autoencoders. In Proceedings of\nthe 52nd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 721\u2013732, 2014.\n\nJ. M. Siskind. Grounding language in perception. Arti\ufb01cial Intelligence Review, 8(5):371\u2013391, Sep 1994.\n\nK. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative\n\nmodels. In Advances in Neural Information Processing Systems, pages 3483\u20133491, 2015.\n\nB. E. Stein, T. R. Stanford, and B. A. Rowland. The neural basis of multisensory integration in the midbrain: its\n\norganization and maturation. Hearing research, 258(1-2):4\u201315, 2009.\n\nM. Suzuki, K. Nakayama, and Y. Matsuo. Joint multimodal learning with deep generative models. In Interna-\n\ntional Conference on Learning Representations Workshop, 2017.\n\nY. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. In International Conference\n\non Learning Representations, 2017.\n\nY. Tian and J. Engel. Latent translation: Crossing modalities by bridging generative models. arXiv preprint\n\narXiv:1902.08261, 2019.\n\nY. H. Tsai, P. P. Liang, A. A. Bagherzade, L.-P. Morency, and R. Salakhutdinov. Learning factorized multimodal\n\nrepresentations. In International Conference on Learning Representations, 2019.\n\nG. Tucker, D. Lawson, S. Gu, and C. J. Maddison. Doubly reparameterized gradient estimators for monte carlo\n\nobjectives. In International Conference on Learning Representations, 2019.\n\n11\n\n\fE. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain\n\ninvariance. arXiv preprint arXiv:1412.3474, 2014.\n\nR. Vedantam, I. Fischer, J. Huang, and K. Murphy. Generative models of visually grounded imagination. In\n\nInternational Conference on Learning Representations, 2018.\n\nC. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset.\n\nTechnical Report CNS-TR-2011-001, California Institute of Technology, 2011.\n\nW. Wang, X. Yan, H. Lee, and K. Livescu. Deep variational canonical correlation analysis. arXiv preprint\n\narXiv:1610.03454, 2016.\n\nX. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In European\n\nConference on Computer Vision, pages 318\u2013335. Springer, 2016.\n\nM. Wu and N. Goodman. Multimodal generative models for scalable weakly-supervised learning. In Advances\n\nin Neural Information Processing Systems, pages 5580\u20135590, 2018.\n\nX. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In\n\nEuropean Conference on Computer Vision, pages 776\u2013791. Springer, 2016.\n\nI. Yildirim. From perception to conception: learning multisensory representations. 2014.\n\nS. Zhao, J. Song, and S. Ermon. Towards deeper understanding of variational autoencoding models. arXiv\n\npreprint arXiv:1702.08658, 2017.\n\nJ.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent\nadversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages\n2223\u20132232, 2017a.\n\nJ.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal\nimage-to-image translation. In Advances in Neural Information Processing Systems, pages 465\u2013476, 2017b.\n\n12\n\n\f", "award": [], "sourceid": 9192, "authors": [{"given_name": "Yuge", "family_name": "Shi", "institution": "University of Oxford"}, {"given_name": "Siddharth", "family_name": "N", "institution": "Unversity of Oxford"}, {"given_name": "Brooks", "family_name": "Paige", "institution": "Alan Turing Institute"}, {"given_name": "Philip", "family_name": "Torr", "institution": "University of Oxford"}]}