{"title": "Multimodal Generative Models for Scalable Weakly-Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5575, "page_last": 5585, "abstract": "Multiple modalities often co-occur when describing natural phenomena. Learning a joint representation of these modalities should yield deeper and more useful representations.Previous generative approaches to multi-modal input either do not learn a joint distribution or require additional computation to handle missing data. Here, we introduce a multimodal variational autoencoder (MVAE) that uses a product-of-experts inference network and a sub-sampled training paradigm to solve the multi-modal inference problem. Notably, our model shares parameters to efficiently learn under any combination of missing modalities. We apply the MVAE on four datasets and match state-of-the-art performance using many fewer parameters. In addition, we show that the MVAE is directly applicable to weakly-supervised learning, and is robust to incomplete supervision. We then consider two case studies, one of learning image transformations---edge detection, colorization, segmentation---as a set of modalities, followed by one of machine translation between two languages. We find appealing results across this range of tasks.", "full_text": "Multimodal Generative Models for Scalable\n\nWeakly-Supervised Learning\n\nDepartment of Computer Science\n\nDepartments of Computer Science and Psychology\n\nMike Wu\n\nStanford University\nStanford, CA 94025\n\nwumike@stanford.edu\n\nNoah Goodman\n\nStanford University\nStanford, CA 94025\n\nngoodman@stanford.edu\n\nAbstract\n\nMultiple modalities often co-occur when describing natural phenomena. Learning\na joint representation of these modalities should yield deeper and more useful\nrepresentations. Previous generative approaches to multi-modal input either do\nnot learn a joint distribution or require additional computation to handle missing\ndata. Here, we introduce a multimodal variational autoencoder (MVAE) that uses\na product-of-experts inference network and a sub-sampled training paradigm to\nsolve the multi-modal inference problem. Notably, our model shares parameters\nto ef\ufb01ciently learn under any combination of missing modalities. We apply the\nMVAE on four datasets and match state-of-the-art performance using many fewer\nparameters. In addition, we show that the MVAE is directly applicable to weakly-\nsupervised learning, and is robust to incomplete supervision. We then consider two\ncase studies, one of learning image transformations\u2014edge detection, colorization,\nsegmentation\u2014as a set of modalities, followed by one of machine translation\nbetween two languages. We \ufb01nd appealing results across this range of tasks.\n\n1\n\nIntroduction\n\nLearning from diverse modalities has the potential to yield more generalizable representations. For\ninstance, the visual appearance and tactile impression of an object converge on a more invariant\nabstract characterization [32]. Similarly, an image and a natural language caption can capture\ncomplimentary but converging information about a scene [28, 31]. While fully-supervised deep\nlearning approaches can learn to bridge modalities, generative approaches promise to capture the joint\ndistribution across modalities and \ufb02exibly support missing data. Indeed, multimodal data is expensive\nand sparse, leading to a weakly supervised setting of having only a small set of examples with all\nobservations present, but having access to a larger dataset with one (or a subset of) modalities.\nWe propose a novel multimodal variational autoencoder (MVAE) to learn a joint distribution under\nweak supervision. The VAE [11] jointly trains a generative model, from latent variables to obser-\nvations, with an inference network from observations to latents. Moving to multiple modalities\nand missing data, we would naively need an inference network for each combination of modalities.\nHowever, doing so would result in an exponential explosion in the number of trainable parameters.\nAssuming conditional independence among the modalities, we show that the correct inference net-\nwork will be a product-of-experts [8], a structure which reduces the number of inference networks\nto one per modality. While the inference networks can be best trained separately, the generative\nmodel requires joint observations. Thus we propose a sub-sampled training paradigm in which\nfully-observed examples are treated as both fully and partially observed (for each gradient update).\nAltogether, this provides a novel and useful solution to the multi-modal inference problem.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe report experiments to measure the quality of the MVAE, comparing with previous models.\nWe train on MNIST [14], binarized MNIST [13], MultiMNIST [6, 20], FashionMNIST [30], and\nCelebA [15]. Several of these datasets have complex modalities\u2014character sequences, RGB images\u2014\nrequiring large inference networks with RNNs and CNNs. We show that the MVAE is able to support\nheavy encoders with thousands of parameters, matching state-of-the-art performance.\nWe then apply the MVAE to problems with more than two modalities. First, we revisit CelebA,\nthis time \ufb01tting the model with each of the 18 attributes as an individual modality. Doing so, we\n\ufb01nd better performance from sharing of statistical strength. We further explore this question by\nchoosing a handful of image transformations commonly studied in computer vision\u2014colorization,\nedge detection, segmentation, etc.\u2014and synthesizing a dataset by applying them to CelebA. We show\nthat the MVAE can jointly learn these transformations by modeling them as modalities.\nFinally, we investigate how the MVAE performs under incomplete supervision by reducing the number\nof multi-modal examples. We \ufb01nd that the MVAE is able to capture a good joint representation when\nonly a small percentage of examples are multi-modal. To show real world applicability, we then\ninvestigate weak supervision on machine translation where each language is a modality.\n\n2 Methods\n\nA variational autoencoder (VAE) [11] is a latent variable generative model of the form p\u03b8(x, z) =\np(z)p\u03b8(x|z) where p(z) is a prior, usually spherical Gaussian. The decoder, p\u03b8(x|z), consists of a\ndeep neural net, with parameters \u03b8, composed with a simple likelihood (e.g. Bernoulli or Gaussian).\nThe goal of training is to maximize the marginal likelihood of the data (the \u201cevidence\u201d); however\nsince this is intractable, the evidence lower bound (ELBO) is instead optimized. The ELBO is de\ufb01ned\nvia an inference network, q\u03c6(z|x), which serves as a tractable importance distribution:\n\nELBO(x) (cid:44) Eq\u03c6(z|x)[\u03bb log p\u03b8(x|z)] \u2212 \u03b2 KL[q\u03c6(z|x), p(z)]\n\n(1)\n\nwhere KL[p, q] is the Kullback-Leibler divergence between distributions p and q; \u03b2 [7] and \u03bb are\nweights balancing the terms in the ELBO. In practice, \u03bb = 1 and \u03b2 is slowly annealed to 1 [2] to\nform a valid lower bound on the evidence. The ELBO is usually optimized (as we will do here) via\nstochastic gradient descent, using the reparameterization trick to estimate the gradient [11].\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Graphical model of the MVAE. Gray circles represent observed variables. (b) MVAE\narchitecture with N modalities. Ei represents the i-th inference network; \u00b5i and \u03c3i represent the\ni-th variational parameters; \u00b50 and \u03c30 represent the prior parameters. The product-of-experts (PoE)\ncombines all variational parameters in a principled and ef\ufb01cient manner. (c) If a modality is missing\nduring training, we drop the respective inference network. Thus, the parameters of E1, ..., EN are\nshared across different combinations of missing inputs.\n\nIn the multimodal setting we assume the N modalities, x1, ..., xN , are conditionally independent\ngiven the common latent variable, z (See Fig. 1a). That is we assume a generative model of the\nform p\u03b8(x1, x2, ..., xN , z) = p(z)p\u03b8(x1|z)p\u03b8(x2|z)\u00b7\u00b7\u00b7 p\u03b8(xN|z). With this factorization, we can\nignore unobserved modalities when evaluating the marginal likelihood. If we write a data point as the\ncollection of modalities present, that is X = {xi|ith modality present}, then the ELBO becomes:\n\n(cid:88)\n\nxi\u2208X\n\nELBO(X) (cid:44) Eq\u03c6(z|X)[\n\n\u03bbi log p\u03b8(xi|z)] \u2212 \u03b2 KL[q\u03c6(z|X), p(z)].\n\n(2)\n\n2\n\nzx12x\u2026PoEx1\u00b51\u03c3\u03c31E1\u00b50\u03c3\u03c30\u03c32\u00b52x2E2\u03c3N\u00b5NxNEN\u00b5\u03c3z\u2026PoE\u00b50missing\u03c3\u03c30\u03c32\u00b52x2x1E2\u03c3N\u00b5NxNEN\u00b5\u03c3z\f2.1 Approximating The Joint Posterior\nThe \ufb01rst obstacle to training the MVAE is specifying the 2N inference networks, q(z|X) for each\nsubset of modalities X \u2286 {x1, x2, ..., xN}. Previous work (e.g. [23, 26]) has assumed that the\nrelationship between the joint- and single-modality inference networks is unpredictable (and therefore\nseparate training is required). However, the optimal inference network q(z|x1, ..., xN ) would be the\ntrue posterior p(z|x1, ..., xN ). The conditional independence assumptions in the generative model\nimply a relation among joint- and single-modality posteriors:\n\np(z|x1, ..., xN ) =\n\np(x1, ..., xN|z)p(z)\n\np(z)\n\n=\n\np(x1, ..., xN )\n\np(x1, ..., xN )\n\np(xi|z)\n\nN(cid:89)\n(cid:81)N\n(cid:81)N\u22121\ni=1 p(z|xi)\ni=1 p(z)\n\ni=1\n\nN(cid:89)\n\np(z|xi)p(xi)\n\n=\n\np(z)\n\ni=1\n\n(cid:81)N\n\ni=1 p(xi)\np(x1, ..., xN )\n\n\u00b7\n\n(3)\n\n=\n\n\u221d\n\np(z)\n\n(cid:81)N\np(x1, ..., xN )\n(cid:81)N\u22121\ni=1 p(z|xi)\ni=1 p(z)\n\nThat is, the joint posterior is a product of individual posteriors, with an additional quotient by the prior.\nIf we assume that the true posteriors for each individual factor p(z|xi) is properly contained in the\nfamily of its variational counterpart1, q(z|xi), then Eqn. 3 suggests that the correct q(z|x1, ..., xN ) is\na product and quotient of experts:\nAlternatively, if we approximate p(z|xi) with q(z|xi) \u2261 \u02dcq(z|xi)p(z), where \u02dcq(z|xi) is the underlying\ninference network, we can avoid the quotient term:\n\n(cid:81)N\n(cid:81)N\u22121\ni=1 q(z|xi)\ni=1 p(z)\n\n, which we call MVAE-Q.\n\np(z|x1, ..., xN ) \u221d\n\n= p(z)\n\n\u02dcq(z|xi).\n\n(4)\n\n(cid:81)N\n(cid:81)N\u22121\ni=1 p(z|xi)\ni=1 p(z)\n\n\u2248\n\n(cid:81)N\n(cid:81)N\u22121\ni=1[\u02dcq(z|xi)p(z)]\n\ni=1 p(z)\n\nN(cid:89)\n\ni=1\n\nxi\u2208X \u02dcq(z|xi) (Figure 1c). We refer to this version as MVAE.\n\nIn other words, we can use a product of experts (PoE), including a \u201cprior expert\u201d, as the approximating\ndistribution for the joint-posterior (Figure 1b). This representation is simpler and, as we describe\nbelow, numerically more stable. This derivation is easily extended to any subset of modalities yielding\n\nq(z|X) \u221d p(z)(cid:81)\nof Gaussian experts is itself Gaussian [5] with mean \u00b5 = ((cid:80)\nV = ((cid:80)\n\ni Ti)\u22121, where \u00b5i, Vi are the parameters of the i-th Gaussian expert, and Ti = V \u22121\n\nThe product and quotient distributions required above are not in general solvable in closed form.\nHowever, when p(z) and \u02dcq(z|xi) are Gaussian there is a simple analytical solution: a product\ni Ti)\u22121 and covariance\nis the\ninverse of the covariance. Similarly, given two Gaussian experts, p1(x) and p2(x), we can show\np2(x), is also a Gaussian with mean \u00b5 = (T1\u00b51 \u2212 T2\u00b52)(T1 \u2212 T2)\u22121 and\nthat the quotient (QoE), p1(x)\ncovariance V = (T1 \u2212 T2)\u22121, where Ti = V \u22121\n. However, this distribution is well-de\ufb01ned only if\nV2 > V1 element-wise\u2014a simple constraint that can be hard to deal with in practice. A full derivation\nfor PoE and QoE can be found in the supplement.\nThus we can compute all 2N multi-modal inference networks required for MVAE ef\ufb01ciently in terms\nof the N uni-modal components, \u02dcq(z|xi); the additional quotient needed by the MVAE-Q variant is\nalso easily calculated but requires an added constraint on the variances.\n\ni \u00b5iTi)((cid:80)\n\ni\n\ni\n\n2.2 Sub-sampled Training Paradigm\n\nOn the face of it, we can now train the MVAE by simply optimizing the evidence lower bound given\nin Eqn. 2. However, a product-of-Gaussians does not uniquely specify its component Gaussians.\nHence, given a complete dataset, with no missing modalities, optimizing Eqn. 2 has an unfortunate\nconsequence: we never train the individual inference networks (or small sub-networks) and thus do\nnot know how to use them if presented with missing data at test time. Conversely, if we treat every\nobservation as independent observations of each modality, we can adequately train the inference\nnetworks \u02dcq(z|xi), but will fail to capture the relationship between modalities in the generative model.\n1Without this assumption, the best approximation to a product of factors may not be the product of the best\napproximations for each individual factor. But, the product of q(z|xi) is still a tractable family of approximations.\n\n3\n\n\fWe propose instead a simple training scheme that combines these extremes, including ELBO\nterms for whole and partial observations. For instance, with N modalities, a complete example,\n{x1, x2, ..., xN} can be split into 2N partial examples: {x1}, {x2, x6}, {x5, xN\u22124, xN}, .... If we\nwere to train using all 2N subsets it would require evaluating 2N ELBO terms. This is computa-\ntionally intractable. To reduce the cost, we sub-sample which ELBO terms to optimize for every\ngradient step. Speci\ufb01cally, we choose (1) the ELBO using the product of all N Gaussians, (2) all\nELBO terms using a single modality, and (3) k ELBO terms using k randomly chosen subsets, Xk.\nFor each minibatch, we thus evaluate a random subset of the 2N ELBO terms. In expectation, we\nwill be approximating the full objective. The sub-sampled objective can be written as:\n\nN(cid:88)\n\nk(cid:88)\n\nELBO(x1, ..., xN ) +\n\nELBO(xi) +\n\nELBO(Xj)\n\n(5)\n\nWe explore the effect of k in Sec. 5. A pleasant side-effect of this training scheme is that it generalizes\nto weakly-supervised learning. Given an example with missing data, X = {xi|ith modality present},\nwe can still sample partial data from X, ignoring modalities that are missing.\n\ni=1\n\nj=1\n\n3 Related Work\n\nGiven two modalities, x1 and x2, many variants of VAEs [11, 10] have been used to train generative\nmodels of the form p(x2|x1), including conditional VAEs (CVAE) [21] and conditional multi-modal\nautoencoders (CMMA) [17]. Similar work has explored using hidden features from a VAE trained on\nimages to generate captions, even in the weakly supervised setting [18]. Critically, these models are\nnot bi-directional. We are more interested in studying models where we can condition interchangeably.\nFor example, the BiVCCA [29] trains two VAEs together with interacting inference networks to\nfacilitate two-way reconstruction. However, it does not attempt to directly model the joint distribution,\nwhich we \ufb01nd empirically to improve the ability of a model to learn the data distribution.\nSeveral recent models have tried to capture the joint distribution explicitly. [23] introduced the joint\nmulti-modal VAE (JMVAE), which learns p(x1, x2) using a joint inference network, q(z|x1, x2). To\nhandle missing data at test time, the JMVAE collectively trains q(z|x1, x2) with two other inference\nnetworks q(z|x1) and q(z|x2). The authors use an ELBO objective with two additional divergence\nterms to minimize the distance between the uni-modal and the multi-modal importance distributions.\nUnfortunately, the JMVAE trains a new inference network for each multi-modal subset, which we\nhave previously argued in Sec. 2 to be intractable in the general setting.\nMost recently, [26] introduce another objective for the bi-modal VAE, which they call the triplet\nELBO. Like the MVAE, their model\u2019s joint inference network q(z|x1, x2) combines variational\ndistributions using a product-of-experts rule. Unlike the MVAE, the authors report a two-stage\ntraining process: using complete data, \ufb01t q(z|x1, x2) and the decoders. Then, freezing p(x1|z) and\np(x2|z), \ufb01t the uni-modal inference networks, q(z|x1) and q(z|x2) to handle missing data at test\ntime. Crucially, because training is separated, the model has to \ufb01t 2 new inference networks to handle\nall combinations of missing data in stage two. While this paradigm is suf\ufb01cient for two modalities, it\ndoes not generalize to the truly multi-modal case. To the best of our knowledge, the MVAE is the \ufb01rst\ndeep generative model to explore more than two modalities ef\ufb01ciently. Moreover, the single-stage\ntraining of the MVAE makes it uniquely applicable to weakly-supervised learning.\nOur proposed technique resembles established work in several ways. For example, PoE is reminiscent\nof a restricted Boltzmann machine (RBM), another latent variable model that has been applied to\nmulti-modal learning [16, 22]. Like our inference networks, the RBM decomposes the posterior into\na product of independent components. The bene\ufb01t that a MVAE offers over a RBM is a simpler\ntraining algorithm via gradient descent rather than requiring contrastive divergence, yielding faster\nmodels that can handle more data. Our sub-sampling technique is somewhat similar to denoising\n[27, 16] where a subset of inputs are \u201cpartially destructed\" to encourage robust representations in\nautoencoders. In our case, we can think of \u201crobustness\" as capturing the true marginal distributions.\n\n4 Experiments\n\nAs in previous literature, we transform uni-modal datasets into multi-modal problems by treating\nlabels as a second modality. We compare existing models (VAE, BiVCCA, JMVAE) to the MVAE\n\n4\n\n\fand show that we equal state-of-the-art performance on four image datasets: MNIST, FashionMNIST,\nMultiMNIST, and CelebA. For each dataset, we keep the network architectures consistent across\nmodels, varying only the objective and training procedure. Unless otherwise noted, given images x1\nand labels x2, we set \u03bb1 = 1 and \u03bb2 = 50. We \ufb01nd that upweighting the reconstruction error for the\nlow-dimensional modalities is important for learning a good joint distribution.\n\nModel\nVAE\nCVAE\nBiVCCA\nJMVAE\nMVAE-Q\nMVAE\nJMVAE19\nMVAE19\n\nBinaryMNIST MNIST\n730240\n735360\n1063680\n2061184\n1063680\n1063680\n\n730240\n735360\n1063680\n2061184\n1063680\n1063680\n\n\u2013\n\u2013\n\n\u2013\n\u2013\n\n3409536\n3414656\n3742976\n7682432\n3742976\n3742976\n\n\u2013\n\u2013\n\nFashionMNIST MultiMNIST\n\n1316936\n\n1841936\n4075064\n1841936\n1841936\n\n\u2013\n\n\u2013\n\u2013\n\nCelebA\n4070472\n4079688\n4447504\n9052504\n4447504\n4447504\n3.6259e12\n10857048\n\nTable 1: Number of inference network parameters. For a single dataset, each generative model uses\nthe same inference network architecture(s) for each modality. Thus, the difference in parameters is\nsolely due to how the inference networks interact in the model. We note that MVAE has the same\nnumber of parameters as BiVCCA. JMVAE19 and MVAE19 show the number of parameters using\n19 inference networks when each of the attributes in CelebA is its own modality.\n\nOur version of MultiMNIST contains between 0 and 4 digits composed together on a 50x50 canvas.\nUnlike [6], the digits are \ufb01xed in location. We generate the second modality by concatenating digits\nfrom top-left to bottom-right to form a string. As in literature, we use a RNN encoder and decoder\n[2]. Furthermore, we explore two versions of learning in CelebA, one where we treat the 18 attributes\nas a single modality, and one where we treat each attribute as its own modality for a total of 19. We\ndenote the latter as MVAE19. In this scenario, to approximate the full objective, we set k = 1 for a\ntotal 21 ELBO terms (as in Eqn. 5). For complete details, including training hyperparameters and\nencoder/decoder architecture speci\ufb01cation, refer to the supplement.\n\n5 Evaluation\n\nIn the bi-modal setting with x1 denoting the image and x2 denoting the label, we measure the test\nmarginal log-likelihood, log p(x1), and test joint log-likelihood log p(x1, x2) using 100 importance\nsamples in CelebA and 1000 samples in other datasets. In doing so, we have a choice of which infer-\nence network to use. For example, using q(z|x1), we estimate log p(x1) \u2248 log Eq(z|x1)[ p(x1|z)p(z)\n].\nq(z|x1)\nWe also compute the test conditional log-likelihood log p(x1|x2), as a measure of classi\ufb01cation\nperformance, as done in [23]: log p(x1|x2) \u2248 log Eq(z|x2)[ p(x1|z)p(x2|z)p(z)\n] \u2212 log Ep(z)[p(x2|z)].\nIn CelebA, we use 1000 samples to estimate Ep(z)[p(x2|z)]. In all others, we use 5000 samples.\nThese marginal probabilities measure the ability of the model to capture the data distribution and its\nconditionals. Higher scoring models are better able to generate proper samples and convert between\nmodalities, which is exactly what we \ufb01nd desirable in a generative model.\n\nq(z|x2)\n\nQuality of the Inference Network In all VAE-family models, the inference network functions\nas an importance distribution for approximating the intractable posterior. A better importance\ndistribution, which more accurately approximates the posterior, results in importance weights with\nlower variance. Thus, we estimate the variance of the (log) importance weights as a measure of\ninference network quality (see Table 3).\nFig. 2 shows image samples and conditional image samples for each dataset using the image generative\nmodel. We \ufb01nd the samples to be good quality, and \ufb01nd conditional samples to be largely correctly\nmatched to the target label. Table 2 shows test log-likelihoods for each model and dataset.2 We see\nthat MVAE performs on par with the state-of-the-art (JMVAE) while using far fewer parameters\n(see Table 1). When considering only p(x1) (i.e. the likelihood of the image modality alone), the\n2These results used q(z|x1) as the importance distribution. See supplement for similar results using\nq(z|x1, x2). Because importance sampling with either q(z|x1) or q(z|x1, x2) yields an unbiased estimator of\nmarginal likelihood, we expect the log-likelihoods to agree asymptotically.\n\n5\n\n\fBinaryMNIST MNIST\n\nFashionMNIST MultiMNIST\n\nEstimated log p(x1)\n-232.758\n-233.634\n-232.630\n-236.081\n-232.535\n\n-91.126\n-92.089\n-90.697\n-96.028\n-90.619\n\nEstimated log p(x1, x2)\n-232.948\n-90.769\n-96.641\n-236.827\n-90.859\n-233.007\n\u2013\nEstimated log p(x1|x2)\n-229.667\n-87.773\n-88.696\n-230.396\n-234.514\n-94.347\n-88.569\n-230.695\n\n\u2013\n\n\u2013\n\n\u2013\n\n\u2013\n\n\u2013\n\nModel\n\nVAE\nBiVCCA\nJMVAE\nMVAE-Q\nMVAE\nMVAE19\n\nJMVAE\nMVAE-Q\nMVAE\nMVAE19\n\nCVAE\nJMVAE\nMVAE-Q\nMVAE\nMVAE19\n\n-86.313\n-87.354\n-86.305\n-91.665\n-86.026\n\n\u2013\n\n-86.371\n-92.259\n-86.255\n\n\u2013\n\n-83.448\n-83.985\n-90.024\n-83.970\n\n\u2013\n\nCelebA\n\n-6237.120\n-7263.536\n-6237.967\n-6290.085\n-6236.923\n-6236.109\n\n-6242.187\n-6294.861\n-6242.034\n-6239.944\n\n-6228.771\n-6231.468\n-6311.487\n-6234.955\n-6233.340\n\n-152.835\n-202.490\n-152.787\n-166.580\n-152.761\n\n\u2013\n\n-153.101\n-173.615\n-153.469\n\n\u2013\n\n-145.977\n-163.302\n-147.027\n\n\u2013\n\n\u2013\n\nTable 2: Estimates (using q(z|x1)) for marginal probabilities on the average test example. MVAE\nand JMVAE are roughly equivalent in data log-likelihood but as Table 1 shows, MVAE uses far fewer\nparameters. The CVAE is often better at capturing p(x1|x2) but does not learn a joint distribution.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\n(g)\n\n(h)\n\nFigure 2: Image samples using MVAE. (a, c, e, g) show 64 images per dataset by sampling z \u223c p(z)\nand then generating via p(x1|z). Similarly, (b, d, f, h) show conditional image reconstructions by\nsampling z \u223c q(z|x2) where (b) x2 = 5, (d) x2 = Ankle boot, (f) x2 = 1773, (h) x2 = Male.\n\nMVAE also performs best, slightly beating even the image-only VAE, indicating that solving the\nharder multi-modal problem does not sacri\ufb01ce any uni-modal model capacity and perhaps helps. On\nCelebA, MVAE19 (which treats features as independent modalities) out-performs the MVAE (which\ntreats the feature vector as a single modality). This suggests that the PoE approach generalizes to a\nlarger number of modalities, and that jointly training shares statistical strength. Moreover, we show\nin the supplement that the MVAE19 is robust to randomly dropping modalities.\nTables 3 show variances of log importance weights. The MVAE always produces lower variance than\nother methods that capture the joint distribution, and often lower than conditional or single-modality\nmodels. Furthermore, MVAE19 consistently produces lower variance than MVAE in CelebA. Overall,\nthis suggests that the PoE approach used by the MVAE yields better inference networks.\n\n6\n\n\fModel\n\nVAE\nBiVCCA\nJMVAE\nMVAE-Q\nMVAE\nMVAE19\n\nJMVAE\nMVAE-Q\nMVAE\nMVAE19\n\nCelebA\n\n56.291\n429.045\n331.865\n100.072\n73.923\n71.640\n\n334.887\n101.238\n76.938\n72.030\n\n56.852\n81.190\n101.223\n73.885\n71.824\n\n54.554\n185.709\n84.186\n69.099\n26.917\n\n\u2013\n\n\u2013\nq(z|x1) ))\n\u2013\n\n37.726\n44.269\n16.822\n\n\u2013\n\nBinaryMNIST MNIST\n\nFashionMNIST MultiMNIST\nVariance of Marginal Log Importance Weights: var(log( p(x1,z)\nq(z|x1) ))\n\n22.264\n55.846\n39.427\n34.300\n22.181\n\n\u2013\n\n\u2013\n\n26.904\n93.885\n37.479\n37.463\n25.640\n\n\u2013\n\n\u2013\n\n25.795\n33.930\n53.697\n34.285\n20.309\n\n\u2013\n\n\u2013\n\nVariance of Joint Log Importance Weights: var(log( p(x1,x2,z)\nq(z|x1) ))\n91.850\n64.556\n27.989\n\n40.126\n38.190\n27.570\n\n41.003\n34.615\n23.343\n\n56.640\n34.908\n20.587\n\nVariance of Conditional Log Importance Weights: var(log( p(x1,z|x2)\n\nCVAE\nJMVAE\nMVAE-Q\nMVAE\nMVAE19\n\n21.203\n23.877\n34.719\n19.478\n\n\u2013\n\n22.486\n26.695\n38.090\n25.899\n\n\u2013\n\n12.748\n26.658\n34.978\n18.443\n\n\u2013\n\nTable 3: Average variance of log importance weights for three marginal probabilities, estimated by\nimportance sampling from q(z|x1). 1000 importance samples were used to approximate the variance.\nThe lower the variance, the better quality the inference network.\n\nEffect of number of ELBO terms\nIn the MVAE training paradigm, there is a hyperparameter\nk that controls the number of sampled ELBO terms to approximate the intractable objective. To\ninvestigate its importance, we vary k from 0 to 50 and for each, train a MVAE19 on CelebA. We \ufb01nd\nthat increasing k has little effect on data log-likelihood but reduces the variance of the importance\ndistribution de\ufb01ned by the inference networks. In practice, we choose a small k as a tradeoff between\ncomputation and a better importance distribution. See supplement for more details.\n\n(a) Dynamic MNIST\n\n(b) FashionMNIST\n\n(c) MultiMNIST\n\nFigure 3: Effects of supervision level. We plot the level of supervision as the log number of paired\nexamples shown to each model. For MNIST and FashionMNIST, we predict the target class. For\nMultiMNIST, we predict the correct string representing each digit. We compare against a suite\nof baselines composed of models in relevant literature and commonly used classi\ufb01ers. MVAE\nconsistently beats all baselines in the middle region where there is both enough data to \ufb01t a deep\nmodel; in the fully-supervised regime, MVAE is competitive with feedforward deep networks. See\nsupplement for accuracies.\n\n5.1 Weakly Supervised Learning\n\nFor each dataset, we simulate incomplete supervision by randomly reserving a fraction of the dataset\nas multi-modal examples. The remaining data is split into two datasets: one with only the \ufb01rst\nmodality, and one with only the second. These are shuf\ufb02ed to destroy any pairing. We examine the\neffect of supervision on the predictive task p(x2|x1), e.g. predict the correct digit label, x2, from\nan image x1. For the MVAE, the total number of examples shown to the model is always \ufb01xed \u2013\n\n7\n\nAutoencoderVAEJMVAEMVAEFeedforwardRBMLogReg2468Log Number of Examples0.20.40.60.81.0Accuracy2468Log Number of Examples0.50.60.70.80.9Accuracy2468Log Number of Examples0.10.20.30.40.5Accuracy\fonly the proportion of complete bi-modal examples is varied. We compare the performance of the\nMVAE against a suite of baseline models: (1) supervised neural network using the same architectures\n(with the stochastic layer removed) as in the MVAE; (2) logistic regression on raw pixels; (3) an\nautoencoder trained on the full set of images, followed by logistic regression on a subset of paired\nexamples; we do something similar for (4) VAEs and (5) RBMs, where the internal latent state is used\nas input to the logistic regression; \ufb01nally (6) we train the JMVAE (\u03b1 = 0.01 as suggested in [23]) on\nthe subset of paired examples. Fig. 3 shows performance as we vary the level of supervision. For\nMultiMNIST, x2 is a string (e.g. \u201c6 8 1 2\") representing the numbers in the image. We only include\nJMVAE as a baseline since it is not straightforward to output raw strings in a supervised manner.\nWe \ufb01nd that the MVAE surpasses all the baselines on a middle region when there are enough paired\nexamples to suf\ufb01ciently train the deep networks but not enough paired examples to learn a supervised\nnetwork. This is especially emphasized in FashionMNIST, where the MVAE equals a fully supervised\nnetwork even with two orders of magnitude less paired examples (see Fig. 3). Intuitively, these results\nsuggest that the MVAE can effectively learn the joint distribution by bootstrapping from a larger\nset of uni-modal data. A second observation is that the MVAE almost always performs better than\nthe JMVAE. This discrepancy is likely due to directly optimizing the marginal distributions rather\nthan minimizing distance between several variational posteriors. We noticed empirically that in the\nJMVAE, using the samples from q(z|x, y) did much better (in accuracy) than samples from q(z|x).\n\n(a) Edge Detection and Facial Landscapes\n\n(b) Colorization\n\n(c) Fill in the Blank\n\n(d) Removing Watermarks\n\nFigure 4: Learning Computer Vision Transformations: (a) 4 ground truth images randomly chosen\nfrom CelebA along with reconstructed images, edges, and facial landscape masks; (b) reconstructed\ncolor images; (c) image completion via reconstruction; (d) reconstructed images with the watermark\nremoved. See supplement for a larger version with more samples.\n\n6 Case study: Computer Vision Applications\n\nWe use the MVAE to learn image transformations (and their inverses) as conditional distributions. In\nparticular, we focus on colorization, edge detection, facial landmark segmentation, image completion,\nand watermark removal. The original image is itself a modality, for a total of six.\nTo build the dataset, we apply ground-truth transformations to CelebA. For colorization, we transform\nRGB colors to grayscale. For image completion, half of the image is replaced with black pixels. For\nwatermark removal, we overlay a generic watermark. To extract edges, we use the Canny detector\n[4] from Scikit-Image [24]. To compute facial landscape masks, we use dlib [9] and OpenCV [3].\nWe \ufb01t a MVAE with 250 latent dimensions and k=1. We use Adam with a 10\u22124 learning rate, a\nbatch size of 50, \u03bbi = 1 for i = 1, ..., N, \u03b2 annealing for 20 out of 100 epochs. Fig. 4 shows samples\nshowcasing different learned transformations. In Fig. 4a we encode the original image with the\nlearned encoder, then decode the transformed image with the learned generative model. We see\nreasonable reconstruction, and good facial landscape and edge extraction. In Figs.4b, 4c, 4d we go in\nthe opposite direction, encoding a transformed image and then sampling from the generative model\nto reconstruct the original. The results are again quite good: reconstructed half-images agree on\ngaze direction and hair color, colorizations are reasonable, and all trace of the watermark is removed.\n(Though the reconstructed images still suffer from the same blurriness that VAEs do [33].)\n\n8\n\n\f7 Case study: Machine Translation\n\nNum. Aligned Data (%)\n133 (0.1%)\n665 (0.5%)\n1330 (1%)\n6650 (5%)\n13300 (10%)\n133000 (100%)\n\nAs a second case study we explore machine trans-\nlation with weak supervision \u2013 that is, where only\na small subset of data consist of translated sentence\npairs. Many of the popular translation models [25]\nare fully supervised with millions of parameters and\ntrained on datasets with tens of millions of paired\nexamples. Yet aligning text across languages is very\ncostly, requiring input from expert human transla-\ntors. Even the unsupervised machine translation lit-\nerature relies on large bilingual dictionaries, strong\npre-trained language models, or synthetic datasets\n[12, 1, 19]. These factors make weak supervision\nparticularly intriguing.\nWe use the English-Vietnamese dataset (113K sen-\ntence pairs) from IWSLT 2015 and treat English (en)\nand Vietnamese (vi) as two modalities. We train the MVAE with 100 latent dimensions for 100\nepochs (\u03bben = \u03bbvi = 1). We use the RNN architectures from [2] with a maximum sequence length of\n70 tokens. As in [2], word dropout and KL annealing are crucial to prevent latent collapse.\n\nTest log p(x)\n\u2212558.88 \u00b1 3.56\n\u2212494.76 \u00b1 4.18\n\u2212483.23 \u00b1 5.81\n\u2212478.75 \u00b1 3.00\n\u2212478.04 \u00b1 4.95\n\u2212478.12 \u00b1 3.02\nTable 4: Weakly supervised translation. Log\nlikelihoods on a test set, averaged over 3 runs.\nNotably, we \ufb01nd good performance with a\nsmall fraction of paired examples.\n\nType\nxen \u223c pdata\nxvi \u223c p(xvi|z(xen))\nGOOGLE(xvi)\nxen \u223c pdata\nxvi \u223c p(xvi|z(xen))\nGOOGLE(xvi)\nxvi \u223c pdata\nxen \u223c p(xen|z(xvi))\nGOOGLE(xvi)\nxvi \u223c pdata\nxen \u223c p(xen|z(xvi))\nGOOGLE(xvi)\nTable 5: Examples of (1) translating English to Vietnamese by sampling from p(xvi|z) where\nz \u223c q(z|xen), and (2) the inverse. We use Google Translate (GOOGLE) for ground-truth.\n\nSentence\nthis was one of the highest points in my life.\n\u0110\u00f3 l\u00e0 m\u0001t gian t\u00f4i v\u0001i c\u0001a cu\u0001c \u0111\u0001i t\u00f4i.\nIt was a great time of my life.\nthe project\u2019s also made a big difference in the lives of the people .\nt\u00f4i \u00e1n n\u00e0y \u0111\u01b0\u0001c ra m\u0001t \u0110i\u0001u l\u0001n lao cu\u0001c s\u0001ng c\u0001a ch\u00fang ng\u01b0\u0001i s\u0001ng ch\u0001a h\u01b0\u0001ng .\nthis project is a great thing for the lives of people who live and thrive .\ntr\u01b0\u0001c ti\u00ean , t\u0001i sao ch\u00fang l\u0001i c\u00f3 \u0001n t\u01b0\u0001ng x\u0001u nh\u01b0 v\u0001y ?\n\ufb01rst of all, you do not a good job ?\nFirst, why are they so bad?\n\u00d4ng ngo\u0001i c\u0001a t\u00f4i l\u00e0 m\u0001t ng\u01b0\u0001i th\u0001t \u0111\u00e1ng ph\u0001c v\u00e0o th\u0001i \u0001y .\ngrandfather is the best experience of me family .\nMy grandfather was a worthy person at the time .\n\nWith only 1% of aligned examples, the MVAE is able to describe test data almost as well as it could\nwith a fully supervised dataset (Table 4). With 5% aligned examples, the model reaches maximum\nperformance. Table 5 shows examples of translation forwards and backwards between English and\nVietnamese. See supplement for more examples. We \ufb01nd that many of the translations are not\nextremely faithful but interestingly capture a close interpretation to the true meaning. While these\nresults are not competitive to state-of-the-art translation, they are remarkable given the very weak\nsupervision. Future work should investigate combining MVAE with modern translation architectures\n(e.g. transformers, attention).\n\n8 Conclusion\n\nWe introduced a multi-modal variational autoencoder with a new training paradigm that learns a joint\ndistribution and is robust to missing data. By optimizing the ELBO with multi-modal and uni-modal\nexamples, we fully utilize the product-of-experts structure to share inference network parameters\nin a fashion that scales to an arbitrary number of modalities. We \ufb01nd that the MVAE matches the\nstate-of-the-art on four bi-modal datasets, and shows promise on two real world datasets.\n\n9\n\n\fAcknowledgments\n\nMW is supported by NSF GRFP and the Google Cloud Platform Education grant. NDG is supported\nunder DARPA PPAML through the U.S. AFRL under Cooperative Agreement FA8750-14-2-0006.\nWe thank Robert X. D. Hawkins and Ben Peloquin for helpful discussions.\n\nReferences\n[1] Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural machine transla-\n\ntion. arXiv preprint arXiv:1710.11041, 2017.\n\n[2] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio.\n\nGenerating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.\n\n[3] Gary Bradski and Adrian Kaehler. Opencv. Dr. Dobb\u2019s journal of software tools, 3, 2000.\n\n[4] John Canny. A computational approach to edge detection. In Readings in Computer Vision, pages 184\u2013203.\n\nElsevier, 1987.\n\n[5] Yanshuai Cao and David J Fleet. Generalized product of experts for automatic and principled fusion of\n\ngaussian process predictions. arXiv preprint arXiv:1410.7827, 2014.\n\n[6] SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Geoffrey E Hinton, et al.\nAttend, infer, repeat: Fast scene understanding with generative models. In Advances in Neural Information\nProcessing Systems, pages 3225\u20133233, 2016.\n\n[7] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir\nMohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational\nframework. 2016.\n\n[8] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Training, 14(8),\n\n2006.\n\n[9] Davis E King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10(Jul):1755\u2013\n\n1758, 2009.\n\n[10] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\nlearning with deep generative models. In Advances in Neural Information Processing Systems, pages\n3581\u20133589, 2014.\n\n[11] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[12] Guillaume Lample, Ludovic Denoyer, and Marc\u2019Aurelio Ranzato. Unsupervised machine translation using\n\nmonolingual corpora only. arXiv preprint arXiv:1711.00043, 2017.\n\n[13] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Proceedings of the\n\nFourteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages 29\u201337, 2011.\n\n[14] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[15] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In\n\nProceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[16] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal\ndeep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pages\n689\u2013696, 2011.\n\n[17] Gaurav Pandey and Ambedkar Dukkipati. Variational methods for conditional multimodal deep learning.\n\nIn Neural Networks (IJCNN), 2017 International Joint Conference on, pages 308\u2013315. IEEE, 2017.\n\n[18] Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin.\nVariational autoencoder for deep learning of images, labels and captions. In Advances in neural information\nprocessing systems, pages 2352\u20132360, 2016.\n\n[19] Sujith Ravi and Kevin Knight. Deciphering foreign language. In Proceedings of the 49th Annual Meeting\nof the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 12\u201321.\nAssociation for Computational Linguistics, 2011.\n\n10\n\n\f[20] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in\n\nNeural Information Processing Systems, pages 3859\u20133869, 2017.\n\n[21] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep\nconditional generative models. In Advances in Neural Information Processing Systems, pages 3483\u20133491,\n2015.\n\n[22] Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal learning with deep boltzmann machines. In\n\nAdvances in neural information processing systems, pages 2222\u20132230, 2012.\n\n[23] Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Joint multimodal learning with deep generative\n\nmodels. arXiv preprint arXiv:1611.01891, 2016.\n\n[24] Stefan Van der Walt, Johannes L Sch\u00f6nberger, Juan Nunez-Iglesias, Fran\u00e7ois Boulogne, Joshua D Warner,\nNeil Yager, Emmanuelle Gouillart, and Tony Yu. scikit-image: image processing in python. PeerJ, 2:e453,\n2014.\n\n[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing\nSystems, pages 5998\u20136008, 2017.\n\n[26] Ramakrishna Vedantam, Ian Fischer, Jonathan Huang, and Kevin Murphy. Generative models of visually\n\ngrounded imagination. arXiv preprint arXiv:1705.10762, 2017.\n\n[27] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing\nrobust features with denoising autoencoders. In Proceedings of the 25th international conference on\nMachine learning, pages 1096\u20131103. ACM, 2008.\n\n[28] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image\ncaption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages\n3156\u20133164. IEEE, 2015.\n\n[29] Weiran Wang, Xinchen Yan, Honglak Lee, and Karen Livescu. Deep variational canonical correlation\n\nanalysis. arXiv preprint arXiv:1610.03454, 2016.\n\n[30] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking\n\nmachine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\n[31] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel,\nand Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In\nInternational Conference on Machine Learning, pages 2048\u20132057, 2015.\n\n[32] Ilker Yildirim. From perception to conception: learning multisensory representations. University of\n\nRochester, 2014.\n\n[33] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Towards deeper understanding of variational autoen-\n\ncoding models. arXiv preprint arXiv:1702.08658, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2666, "authors": [{"given_name": "Mike", "family_name": "Wu", "institution": "Stanford University"}, {"given_name": "Noah", "family_name": "Goodman", "institution": "Stanford University"}]}