{"title": "Improved Multimodal Deep Learning with Variation of Information", "book": "Advances in Neural Information Processing Systems", "page_first": 2141, "page_last": 2149, "abstract": "Deep learning has been successfully applied to multimodal representation learning problems, with a common strategy to learning joint representations that are shared across multiple modalities on top of layers of modality-specific networks. Nonetheless, there still remains a question how to learn a good association between data modalities; in particular, a good generative model of multimodal data should be able to reason about missing data modality given the rest of data modalities. In this paper, we propose a novel multimodal representation learning framework that explicitly aims this goal. Rather than learning with maximum likelihood, we train the model to minimize the variation of information. We provide a theoretical insight why the proposed learning objective is sufficient to estimate the data-generating joint distribution of multimodal data. We apply our method to restricted Boltzmann machines and introduce learning methods based on contrastive divergence and multi-prediction training. In addition, we extend to deep networks with recurrent encoding structure to finetune the whole network. In experiments, we demonstrate the state-of-the-art visual recognition performance on MIR-Flickr database and PASCAL VOC 2007 database with and without text features.", "full_text": "Improved Multimodal Deep Learning\n\nwith Variation of Information\n\nKihyuk Sohn\n\nWenling Shang\n\nUniversity of Michigan Ann Arbor, MI, USA\n\n{kihyuks,shangw,honglak}@umich.edu\n\nHonglak Lee\n\nAbstract\n\nDeep learning has been successfully applied to multimodal representation learn-\ning problems, with a common strategy of learning joint representations that are\nshared across multiple modalities on top of layers of modality-speci\ufb01c networks.\nNonetheless, there still remains a question about how to effectively learn asso-\nciations between heterogeneous data modalities; in particular, a good generative\nmodel of multimodal data should be able to reason about missing data modality\ngiven the rest of data modalities. In this paper, we propose a novel multimodal\nrepresentation learning framework that explicitly aims at this goal by training the\nmodel to minimize the variation of information rather than maximizing likelihood.\nWe provide a theoretical insight into why the proposed learning objective is suf-\n\ufb01cient to estimate the data-generating joint distribution of multimodal data. We\napply our method to restricted Boltzmann machines and introduce learning algo-\nrithms based on contrastive divergence and multi-prediction training. Further, we\nextend our method to deep networks with recurrent encoding for \ufb01netuning. In ex-\nperiments, we demonstrate the state-of-the-art visual recognition performance on\nMIR-Flickr and PASCAL VOC2007 database with and without text observations.\n\nIntroduction\n\n1\nDifferent types of multiple data modalities can be used to describe the same event. For example,\nimages, which are often represented with pixels or image descriptors, can also be described with\naccompanying text (e.g., user tags or subtitles) or audio data (e.g., human voice or natural sound).\nThere have been several applications of multimodal learning from multiple domains such as emo-\ntion and speech recognition with audio-visual data [16, 24, 13], robotics applications with visual and\ndepth data [18, 20, 34, 26], and medical applications with visual and temporal data [29]. For each\napplication, data from multiple sources are semantically correlated, and sometimes provide comple-\nmentary information about each other. To facilitate information exchange, it is important to capture\na high-level association between data modalities with a compact set of latent variables. However,\nlearning associations between multiple heterogeneous data distributions is a challenging problem.\nA naive approach is to concatenate the data descriptors from different input sources to construct a\nsingle high-dimensional feature vector and use it to solve a unimodal representation learning prob-\nlem. However, the correlation between features in each data modality is much stronger than that\nbetween data modalities. As a result, the learning algorithms are easily tempted to learn dominant\npatterns in each data modality separately while giving up learning patterns that occur simultaneously\nin multiple data modalities, as suggested by [24]. To resolve this issue, deep learning methods, such\nas deep autoencoders [11] or deep Boltzmann machines (DBM) [27], have been adapted [24, 30],\nwhere the common strategy is to learn joint representations that are shared across multiple modali-\nties at the higher layer of the deep network, after learning layers of modality-speci\ufb01c networks. The\nrationale is that the learned features may have less within-modality correlation than raw features, and\nthis makes it easier to capture patterns across data modalities. This has shown promise, but there\nstill remains the challenging question of how to learn associations between multiple heterogeneous\ndata modalities so that we can effectively deal with missing data modalities at testing time.\nOne necessary condition for a good generative model of multimodal data is the ability to predict\nor reason about missing data modalities given partial observation. To this end, we propose a novel\n\n1\n\n\fmultimodal representation learning framework that explicitly aims at this goal. The key idea is\nto minimize the information distance between data modalities through the shared latent represen-\ntations. More concretely, we train the model to minimize the variation of information (VI), an\ninformation theoretic measure that computes the distance between random variables, i.e., multiple\ndata modalities. Note that this is in contrast to previous approaches on multimodal deep learning,\nwhich are based on maximum (joint) likelihood (ML) learning [24, 30]. We explain as to how our\nmethod could be more effective in learning the joint representation of multimodal data than ML\nlearning, and show theoretical insights why the proposed learning objective is suf\ufb01cient to esti-\nmate the data-generating joint distribution of multimodal data. We apply the proposed framework\nto multimodal restricted Boltzmann machine (MRBM) and propose two learning algorithms, based\non contrastive divergence [23] and multi-prediction training [7]. Finally, we extend to multimodal\ndeep recurrent neural network (MDRNN) for unsupervised \ufb01netuning of whole network. In experi-\nments, we demonstrate the state-of-the-art visual recognition performance on MIR-Flickr database\nand PASCAL VOC2007 database with and without text observations at testing time.\n2 Multimodal Learning with Variation of Information\nIn this section, we propose a novel training objective based on the VI. We make a comparison to the\nML objective, a typical learning objective for training generative models of multimodal data, to give\nan insight as to how our proposed method can be better for multimodal data. Finally, we establish a\ntheorem showing that the proposed learning objective is suf\ufb01cient to obtain a good generative model\nthat fully recovers the joint data-generating distribution of multimodal data.\nNotation. We use uppercase letters X, Y to denote random variables, lowercase letters x, y for\nrealizations. Let PD be the data-generating distribution and P\u2713 the model distribution parametrized\nby \u2713.\nFor presentation clarity, we slightly abuse the notation for Q to denote conditional\n(Q(x|y), Q(y|x)), marginal (Q(x), Q(y)), as well as joint distributions (Q(x, y)). The type of dis-\ntribution of Q should be clear from the context.\n2.1 Minimum Variation of Information Learning\nMotivated by the necessary condition for good generative models to reason about the missing data\nmodality, it seems natural to learn to maximize the amount of information that one data modality\nhas about the others. We quantify such an amount of information between data modalities using\nvariation of information. The VI is an information theoretic measure that computes the information\ndistance between two random variables (e.g., data modalities), and is written as follows:1\n\nVIQ(X, Y ) = EQ(X,Y )\u21e5 log Q(X|Y ) + log Q(Y |X)\u21e4\n\nMinVI: min\u2713 LVI(\u2713), LVI(\u2713) = EPD(X,Y )\u21e5 log P\u2713(X|Y ) + log P\u2713(Y |X)\u21e4\n\n(1)\nwhere Q(X, Y ) = P\u2713(X, Y ) is any joint distribution on random variables (X, Y ) parametrized\nby \u2713. Informally, VI is small when the conditional likelihoods Q(X|Y ) and Q(Y |X) are \u201cpeaked\u201d,\nmeaning that X has low entropy conditioned on Y and vice versa. Following the intuition, we de\ufb01ne\nnew multimodal learning criteria, a minimum variation of information (MinVI) learning, as follows:\n(2)\nNote the difference that we take the expectation over PD in LVI(\u2713). Furthermore, we observe that\nthe MinVI objective can be decomposed into a sum of two negative conditional LLs. This indeed\naligns well with our initial motivation of reasoning about missing data modality. In the following, we\nprovide more insight into our MinVI objective in relation to the ML objective, which is a standard\nlearning objective in generative models.\n2.2 Relation to Maximum Likelihood Learning\nThe ML objective function can be written as a minimization of the negative LL (NLL) as follows:\n\nML: min\u2713 LNLL(\u2713), LNLL(\u2713) = EPD(X,Y )\u21e5 log P\u2713(X, Y )\u21e4,\nand we can show that the NLL objective function is reformulated as follows:\n2LNLL(\u2713) = KL (PD(X)kP\u2713(X)) + KL (PD(Y )kP\u2713(Y ))\n}\n\nEPD(X)\u21e5KL (PD(Y |X)kP\u2713(Y |X))\u21e4 + EPD(Y )\u21e5KL (PD(X|Y )kP\u2713(X|Y ))\u21e4\n|\n}\n\n{z\n\n{z\n\n|\n\n(a)\n\n+\n\n(3)\n\n+ C,\n\n(4)\n\n1In practice, we use \ufb01nite samples of the training data and use a regularizer (e.g., l2 regularizer) to avoid\n\nover\ufb01tting to the \ufb01nite sample distribution.\n\n(b)\n\n2\n\n\fwhere C is a constant which is irrelevant to \u2713. Note that (b) is equivalent to LVI(\u2713) in Equation (2)\nup to a constant. We provide a full derivation of Equation (4) in Appendix A.\nIgnoring the constant, the NLL objective has four KL divergence terms. Since KL divergence is\nnon-negative and is zero only when two distributions match, the ML learning in Equation (3) can\nbe viewed as a distribution matching problem involving (a) marginal likelihoods and (b) conditional\nlikelihoods. Here, we argue that (a) is more dif\ufb01cult to optimize than (b) because there are often\ntoo many modes in the marginal distribution. Compared to the marginal distribution, the number of\nmodes can be dramatically reduced in the conditional distribution since the conditioning variables\nmay restrict the support of random variable effectively. Therefore, (a) may become a dominant factor\nto be minimized during the optimization process and as a trade-off, (b) will be easily compromised,\nwhich makes it dif\ufb01cult to learn a good association between data modalities. On the other hand, the\nMinVI objective focuses on modeling the conditional distributions (Equation (4)), which is arguably\neasier to optimize. Indeed, similar argument has been made for generalized denoising autoencoders\n(DAEs) [3] and generative stochastic networks (GSNs) [2], which focus on learning the transition\noperators (e.g., P\u2713(X| \u02dcX), where \u02dcX is a corrupted version of data X, or P\u2713(X|H), where H can be\narbitrary latent variables) to bypass an intractable problem of learning density model P\u2713(X).\n2.3 Theoretical Results\nBengio et al. [3, 2] proved that learning transition operators of DAEs or GSNs is suf\ufb01cient to learn\na good generative model that estimates a data-generating distribution. Under similar assumptions,\nwe establish a theoretical result that we can obtain a good density estimator for joint distribution\nof multimodal data by learning the transition operators derived from the conditional distributions of\none data modality given the other. In the multimodal learning framework, we de\ufb01ne the transition\noperators T Xn and T Yn for Markov chains of data modalities X and Y , respectively. Speci\ufb01cally,\n\nT Xn (x[t]|x[t  1]) = Py2Y P\u2713n (x[t]|y) P\u2713n (y|x[t  1]), where P\u2713n (X|Y ) and P\u2713n (Y |X) are\n\nmodel conditional distributions after learning from the training data of size n. T Yn is de\ufb01ned in a\nsimilar way. Note that we do not require that the model conditionals are derived from an analytically\nde\ufb01ned joint distribution. Now, we formalize the theorem as follows:\nTheorem 2.1. For \ufb01nite state space X ,Y, if, 8x 2 X ,8y 2 Y, P\u2713n(\u00b7|y) and P\u2713n(\u00b7|x) converges in\nprobability to PD(\u00b7|y) and PD(\u00b7|x), respectively, and T Xn and T Yn are ergodic Markov chains, then,\nas the number of examples n ! 1, the asymptotic distribution \u21e1n(X) and \u21e1n(Y ) converge to data-\ngenerating marginal distributions PD(X) and PD(Y ), respectively. Moreover, the joint probability\ndistribution P\u2713n (X, Y ) converges to PD (X, Y ) in probability.\nThe proof is provided in Appendix B. The theorem ensures that the MinVI objective can lead to\na good generative model estimating the joint data-generating distribution of multimodal data. The\ntheorem holds under two assumptions: consistency of density estimators and ergodicity of transition\noperators. The ergodicity condition is satis\ufb01ed for a wide variety of neural networks, such as RBM\nor DBM.2 The consistency assumption is more dif\ufb01cult to satisfy, and the aforementioned deep\nenergy-based models or RNN may not satisfy the condition due to the model capacity limitation\nor approximated posteriors (e.g., factorial distribution). However, deep architectures are arguably\namong the most promising models for approximating the true conditionals from multimodal data.\nWe expect that more accurate approximation of the true conditional distributions would lead to better\nperformance in our multimodal learning framework, and we leave it for future work.\nWe note that our Theorem 2.1 is related to composite likelihood methods [21] and dependency\nnetworks [9]. For composite likelihood, the consistency result is derived upon a well-de\ufb01ned graph-\nical model (e.g., Markov network) and the joint distribution converges in the sense that the maxi-\nmum composite likelihood estimators are consistent for the parameters associated with the graphical\nmodel. However, in Theorem 2.1, it is not necessary to design a full graphical model (e.g., of the\njoint distribution) with analytical forms; for example, the two conditionals can be de\ufb01ned by neu-\nral networks with different parameters. In this case, the joint distribution is de\ufb01ned implicitly, and\nthe setting is similar to general dependency networks [9]. However, [9] uses ordered pseudo-Gibbs\nsamplers which may be unstable (i.e., inconsistencies between the local conditionals and the true\nconditionals can be ampli\ufb01ed to a large inconsistency between the model joint distribution and the\ntrue joint distribution). In our case, we prove that the implicit model joint distribution will converge\nto the true joint distribution under assumptions that can plausibly hold for deep architectures.\n\n2For energy-based models like RBM and DBM, it is straightforward to see that every state has non-zero\nprobability and can be reached from any other state. However, the mixing of the chain might be slow in practice.\n\n3\n\n\f3 Application to Multimodal Deep Learning\nIn this section, we describe the MinVI learning in multimodal deep learning framework. To overview\nour pipeline, we use the commonly used network architecture that consists of layers of modality-\nspeci\ufb01c deep networks followed by a layer of neural network that jointly models the multiple modal-\nities [24, 30]. The network is trained in two steps: In layer-wise pretraining, each layer of modality-\nspeci\ufb01c deep network is trained using restricted Boltzmann machines (RBMs). For the top-layer\nshared network, we train MRBM with MinVI objective (Section 3.2). Then, we \ufb01netune the whole\ndeep network by constructing multimodal deep recurrent neural network (MDRNN) (Section 3.3).\n3.1 Restricted Boltzmann Machines for Multimodal Learning\nThe restricted Boltzmann machine (RBM) is an undirected graphical model that de\ufb01nes the distri-\nbution of visible units using hidden units. For multimodal input, we de\ufb01ne the joint distribution of\nmultimodal RBM (MRBM) [24, 30] as P (x, y, h) = 1\n\nZ expE(x, y, h) with the energy function:\n\nKXk=1\n\nNxXi=1\n\nNyXj=1\n\nE(x, y, h) = \n\nNxXi=1\n\nKXk=1\n\nxiW x\n\nikhk \n\nNyXj=1\n\nKXk=1\n\nyjW y\n\njkhk \n\nbkhk \n\ncx\ni xi \n\ncy\nj yj,\n\n(5)\n\nwhere Z is the normalizing constant, x 2 {0, 1}Nx, y 2 {0, 1}Ny are the binary visible units\nof multimodal input (i.e., observations), and h 2 {0, 1}K are the binary hidden units (i.e., latent\nvariables). W x 2 RNx\u21e5K de\ufb01nes the weights between x and h, and W y 2 RNy\u21e5K de\ufb01nes the\nweights between y and h. cx 2 RNx, cy 2 RNy, and b 2 RK are bias vectors corresponding to\nx, y, and h, respectively. Note that the MRBM is equivalent to an RBM whose visible units are\nconstructed by concatenating the visible units of multiple input modalities, i.e., v = [x ; y].\nDue to bipartite structure, units in the same layer are conditionally independent given the units of\nthe other layer, and the conditional probabilities are written as follows:\n\n(6)\n\n(7)\n\n1\n\nW x\n\nW y\n\nW x\n\nikhk + cx\n\nW y\n\njkhk + cy\n\nikxi +Xj\n\njkyj + bk,\ni, P (yj = 1 | h) = Xk\n\nP (hk = 1 | x, y) = Xi\nP (xi = 1 | h) = Xk\nwhere (x) =\n1+exp(x). Similar to the standard RBM, the MRBM can be trained to maximize the\njoint LL (log P (x, y)) using stochastic gradient descent (SGD) while approximating the gradient\nwith contrastive divergence (CD) [10] or persistent CD (PCD) [32]. In our case, however, we train\nthe MRBM in MinVI criteria. We will discuss the inference and training algorithms in Section 3.2.\nWhen we have access to all data modalities, we can use Equation (6) for exact posterior inference.\nOn the other hand, when some of the input modalities are missing, the inference is intractable,\nand we resort to the variational method. For example, when we are given x but not y, the true\n\nposterior can be approximated with a fully factorized distribution Q(y, h) = QjQk Q(yj)Q(hk)\nby minimizing the KLQ(y, h)kP\u2713(y, h|x). This leads to the following \ufb01xed-point equations:\n\nj,\n\n\u02c6hk = Xi\n\nW x\n\nikxi +Xj\n\nW y\n\njk \u02c6yj + bk, \u02c6yj = Xk\n\nW y\njk\n\n\u02c6hk + cy\n\nj,\n\nwhere \u02c6hk = Q(hk) and \u02c6yj = Q(yj). The variational inference proceeds by alternately updating the\nmean-\ufb01eld parameters \u02c6h and \u02c6y that are initialized with all zeros.\n3.2 Training Algorithms\nCD-PercLoss. As in Equation (2), the objective function can be decomposed into two conditional\nLLs, and the MRBM with MinVI objective can be trained equivalently by training the two con-\nditional RBMs (CRBMs) while sharing the weights. Since the objective functions are the sum of\ntwo conditional LLs, we compute the (approximate) gradient of each CRBM separately using CD-\nPercLoss [23] and accumulate them to update parameters.3\n\n(8)\n\n3In CD-PercLoss learning, we run separate Gibbs chains for different conditioning variables and select the\n\nnegative particles with the lowest free energy among sampled particles. We refer [23] for further details.\n\n4\n\n\fFigure 1:\nAn instance\nof MDRNN with target y\ngiven x. Multiple iterations\nof bottom-up updates (y !\nh(3); Eqs. (11) & (12)) and\ntop-down updates (h(3) !\ny; Eq. (13)) are performed.\nThe arrow indicates encod-\ning direction.\n\nW x\n\nW y\n\nW y\n\nikxi +Xj2S\n\njk \u02c6yj +Xj /2S\n\njkyj + bk, \u02c6yj = Xk\n\nMulti-Prediction. We found a few practical issues of CD-PercLoss training in our application. In\nparticular, there exists a difference between the encoding process of training and testing, especially\nwhen the unimodal query (e.g., when one of the data modalities is missing) is considered for testing.\nAs an alternative objective, we propose multi-prediction (MP) training of MRBM in MinVI criteria.\nThe MP training was originally proposed to train deep Boltzmann machines [7] as an alternative to\nthe stochastic approximation learning [27]. The idea is to train the model to be good at predicting any\nsubset of input variables given the rest of them by constructing the recurrent network with encoding\nfunction derived from the variational inference problem.\nThe MP training can be adapted to learn MRBM with MinVI objective with some modi\ufb01cations. For\nexample, the CRBM with an objective log P (y|x) can be trained by randomly selecting the subset of\nvariables to be predicted only from the target modality y, but the conditioning modality x is assumed\nto be given in all cases. Speci\ufb01cally, given an arbitrary subset S \u21e2 {1,\u00b7\u00b7\u00b7 , Ny} drawn from the\nindependent Bernoulli distribution PS, the MP algorithm predicts yS = {yj : j 2 S} given x and\ny\\S = {yj : j /2 S} through the iterative encoding function derived from \ufb01xed-point equations:\nj, j 2 S,\n\n\u02c6hk = Xi\nwhich is a solution to the variational inference problem minQ KLQ(yS, h)kP\u2713(yS, h|x, y\\S)\nwith factorized distribution Q(yS, h) =Qj2SQk Q(yj)Q(hk). Note that Equation (9) is similar to\nthe Equation (8) except that only yj, j 2 S are updated. Using an iterative encoding function, the\nnetwork parameters are trained using SGD while computing the gradient by backpropagating the\nerror between the prediction and the ground truth of yS through the derived recurrent network. The\nMP formulation (e.g., encoding function) of the CRBM with log P (x|y) can be derived similarly,\nand the gradients are simply the addition of two gradients that are computed individually.\nWe have two additional hyper parameters, the number of mean-\ufb01eld updates and the sampling ratio\nof a subset S to be predicted from the target data modality. In our experiments, it was suf\ufb01cient to\nuse 10 \u21e0 20 iterations until convergence. We used a sampling ratio of 1 (i.e., all the variables in\nthe target data modality are to be predicted) since we are already conditioned on one data modality,\nwhich is suf\ufb01cient to make a good prediction of variables in the target data modality.\n3.3 Finetuning Multimodal Deep Network with Recurrent Neural Network\nMotivated from the MP training of MRBM, we propose a multimodal deep recurrent neural network\n(MDRNN) that tries to predict the target modality given the input modality through the recurrent en-\ncoding function. The MDRNN iteratively performs a full pass of bottom-up and top-down encoding\nfrom bottom-layer visible variables to top-layer joint representation back to bottom-layer through\nthe modality-speci\ufb01c deep network corresponding to the target. We show an instance of L = 3 layer\nMDRNN in Figure 1, and the encoding functions are written as follows:4\n\n\u02c6hk + cy\n\nW y\njk\n\n(9)\n\n:\n\ny\n\nx\n\nx\n\nh(l)\ny\n\nh(l)\nx\n\nx ! h(L1)\ny ! h(L1)\n:\n! h(L) :\nh(L) ! y : h(l1)\n\n= \u21e3W x,(l)>h(l1)\n= \u21e3W y,(l)>h(l1)\nh(L) = \u21e3W x,(L)>h(L1)\n= \u21e3W y,(l)h(l)\n\nx\n\ny\n\ny\n\nh(L1)\nx\n\n, h(L1)\n\ny\n\n+ bx,(l)\u2318 , l = 1,\u00b7\u00b7\u00b7 , L  1\n+ by,(l)\u2318 , l = 1,\u00b7\u00b7\u00b7 , L  1\n\n+ W y,(L)>h(L1)\n\ny\n\ny + by,(l1)\u2318 , l = L,\u00b7\u00b7\u00b7 , 1.\n\n(10)\n\n(11)\n\n+ b(L)\u2318 (12)\n\n(13)\n\n4There could be different ways of constructing MDRNN; for instance, one can construct the RNN with\nDBM-style mean-\ufb01eld updates. In our empirical evaluation, however, running full pass of bottom-up and top-\ndown updates performed the best, and DBM-style updates didn\u2019t give competitive results.\n\n5\n\nh(3)hx(1)hx(2)hy(1)hy(2)x=hx(0)y=hy(0)Wx(1)Wx(2)Wx(3)Wy(3)Wy(2)Wy(1)\fFigure 2: Visualization of samples with inferred missing modality. From top to bottom, we visualize ground\ntruth, left or right halves of digits, generated samples with inferred missing modality using MRBM with ML\nobjective, MinVI objective using CD-PercLoss and MP training methods.\n\nInput modalities at test time\n\nML (PCD)\n\nMinVI (CD-PercLoss)\n\nMinVI (MP)\n\nLeft+Right\n\n1.57%\n1.71%\n1.73%\n\nLeft\n\nRight\n14.98% 18.88%\n9.42% 11.02%\n6.58% 7.27%\n\nx = x and h(0)\n\nTable 1: Test set errors on handwritten digit recognition dataset using MRBMs with different training objectives\nand learning methods. The joint representation was fed into linear SVM for classi\ufb01cation.\nHere, we de\ufb01ne h(0)\ny = y, and the visible variables of the target modality are initialized\nwith zeros. In other words, in the initial bottom-up update, we compute h(L) only from x while\nsetting y = 0 using Equations (10), (11), & (12). Then, we run multiple iterations of top-down\n(Equation (13)) and bottom-up updates (Equations (11) & (12)). Finally, we compute the gradient\nby backpropagating the reconstruction error of target modality through the network.\n4 Experiments\n4.1 Toy Example on MNIST\nIn our \ufb01rst experiment, we evaluate the proposed learning algorithm on the MNIST handwritten\ndigit recognition dataset [19]. We consider left and right halves of the digit images as two input\nmodalities and report the recognition performance with different combinations of input modalities\nat the test time, such as full (left + right) or missing (left or right) data modalities. We compare\nthe performance of the MRBM trained with 1) ML objective using PCD [32], or MinVI objectives\nwith 2) CD-PercLoss or 3) MP training. The recognition errors are provided in Table 1. Compared\nto ML training, the recognition errors for unimodal queries are reduced by more than a half with\nMP training of MinVI objective. For multimodal queries, the model trained with ML objective\nperformed the best, although the performance gain was incremental. CD-PercLoss training of MinVI\nobjective also showed signi\ufb01cant improvement over ML training, but the errors were not as low as\nthose obtained with MP training. We hypothesize that, although it is an approximation of MinVI\nobjective, the exact gradient for MP algorithm makes learning more ef\ufb01cient than CD-PercLoss. For\nthe rest of the paper, we focus on MP training method.\nIn Figure 2, we visualize the generated samples conditioned on one input modality (e.g., left or right\nhalves of digits). There are many samples generated by the models with MinVI objective that look\nclearly better than those generated by the model with ML objective.\n4.2 MIR-Flickr Database\nIn this section, we evaluate our methods on MIR-Flickr database [14], which is composed of 1\nmillion examples of images and their user tags collected from the social photo-sharing website\nFlickr. Among those, 25000 examples were annotated with 24 potential topics and 14 regular topics,\nwhich leads to 38 classes in total with distributed class membership. The topics included object\ncategories such as dog, \ufb02ower, and people, or scenic concepts such as sky, sea, and night.\nWe used the same visual and text features as in [30].5 Speci\ufb01cally, the image feature was a 3857\ndimensional vector composed of Pyramid Histogram of Words (PHOW) features [4], GIST [25], and\nMPEG-7 descriptors [22]. We preprocessed the image features to have zero mean and unit variance\nfor each dimension across all examples. The text feature was a word count vector of 2000 most\nfrequent tags. The number of tags varied from 0 to 72, with 5.15 tags per example in average.\nFollowing the experimental protocol [15, 30], we randomly split the labeled examples into 15000\nfor training and 10000 for testing, and used 5000 from training set for validation. We iterated the\nprocedure for 5 times and report the mean average precision (mAP) averaged over 38 classes.\n\n5http://www.cs.toronto.edu/\u02dcnitish/multimodal/index.html\n\n6\n\nGround TruthQueryML (PCD)MinVI (CD-PercLoss)MinVI (MP)\fMultimodal DBM [30]\nMultimodal DBM\u2020 [31]\n\nModel Architecture. We used the network composed of [3857, 1024, 1024] variables for visual\npathway, [2000, 1024, 1024] variables for text pathway, and 2048 variables for top-layer MRBM,\nas used in [30]. As described in Section 3, we pretrained the modality-speci\ufb01c deep networks in a\ngreedy layerwise way, and \ufb01netuned the whole network by initializing MDRNN with the pretrained\nnetwork. Speci\ufb01cally, we used gaussian RBM for the bottom layer of visual pathway and binary\nRBM for text pathway.6 The intermediate layers were trained with binary RBMs, and the top-layer\nMRBM was trained using MP training algorithm. For the layer-wise pretraining of RBMs, we used\nPCD [32] to approximate the gradient. Since our algorithm requires both data modalities during\ntraining, we excluded examples with too sparse or no tags from unlabeled dataset and used about\n750K examples with at least 2 tags. After unsupervised training, we extracted joint feature repre-\nsentations of the labeled training data and use them to train multiclass logistic regression classi\ufb01ers.\nRecognition Tasks. For\nrecognition tasks, we\ntrained multiclass logistic regression classi\ufb01ers using\njoint representations as input features. Depending on\nthe availability of data modalities at testing time, we\nevaluated the performance using multimodal queries\n(i.e., both visual and text data are available) and uni-\nmodal queries (i.e., visual data is available while the\ntext data is missing).\nIn Table 2, we report the test\nset mAPs of our proposed model and compared to\nother methods. The proposed MDRNN outperformed\nthe previous state-of-the-art in multimodal queries by\n4.5% in mAP. The performance improvement becomes\nmore signi\ufb01cant for unimodal queries, achieving 7.6%\nimprovement in mAP over the best published result.\nAs we used the same input features in [30], the results\nsuggest that our proposed algorithm learns better rep-\nresentations shared across multiple modalities.\nFor a closer look into our model, we performed an additional control experiment to explore the ben-\ne\ufb01t of recurrent encoding of MDRNN. Speci\ufb01cally, we compared the performance of the models\nwith different number of mean-\ufb01eld iterations.7 We report the validation set mAPs of models with\ndifferent number of iterations (0 \u21e0 10) in Table 3. For multimodal query, the MDRNN with 10 iter-\nations improves the recognition performance by only 0.8% compared to the model with 0 iterations.\nHowever, the improvement becomes signi\ufb01cant for unimodal query, achieving 5.0% performance\ngain. In addition, the largest improvement was made when we have at least one iteration (from 0 to\n1 iteration, 3.4% gain; from 1 to 10 iteration, 1.6% gain). This suggests that a crucial factor of im-\nprovement comes from the inference with reconstructed missing data modality (e.g., text features),\nand the quality of inferred missing modality improves as we increase the number of iterations.\n\n0.607 \u00b1 0.005\nTable 2: Test set mAPs on MIR-Flickr database.\nWe implemented autoencoder following the de-\nscription in [24]. Multimodal DBM\u2020 is super-\nvised \ufb01netuned model. See [31] for details.\n\n0.686 \u00b1 0.003\nUnimodal query\n\nMK-SVM [8]\n\nMDRNN\n\n0.610\n0.609\n0.641\n0.623\n0.640\n\nMultimodal DBM [30]\n\nModel\n\nAutoencoder\n\nMK-SVM [8]\nTagProp [33]\n\nMDRNN\nModel\n\nAutoencoder\n\nMultimodal query\n\n0.495\n0.531\n0.530\n\n# iterations\n\nMultimodal query\nUnimodal query\n\n0\n\n0.677\n0.557\n\n1\n\n0.678\n0.591\n\n2\n\n0.679\n0.599\n\n3\n\n0.680\n0.602\n\n5\n\n0.682\n0.605\n\n10\n\n0.685\n0.607\n\nTable 3: Validation set mAPs on MIR-Flickr database with different number of mean-\ufb01eld iterations.\n\nRetrieval Tasks. We performed retrieval tasks using multimodal and unimodal input queries. Fol-\nlowing [30], we selected 5000 image-text pairs from the test set to form a database and use 1000\ndisjoint set of examples from the test set as queries. For each query example, we computed the\nrelevance score to the data points as a cosine similarity of joint representations. The binary rele-\nvance labels between query and the data points are determined 1 if any of the 38 class labels are\noverlapped. Our proposed model achieves 0.633 mAP with multimodal query and 0.638 mAP\nwith unimodal query. This signi\ufb01cantly outperforms the performance of multimodal DBM [30],\nwhich reported 0.622 mAP with multimodal query and 0.614 mAP with unimodal query. We show\nretrieved examples with multimodal queries in Figure 3.\n\n6We assumed text features as binary, which is different from [30] where they modeled using replicated-\nsoftmax RBM [28]. The rationale is that the tags are not likely to be assigned more than once for single image.\n7In [24], Ngiam et al. proposed the \u201cvideo-only\u201d deep autoencoder whose objective is to predict audio data\nand reconstruct video data when only video data is given as an input during the training. Our baseline model\n(MDRNN with 0 iterations) is similar, but different since we don\u2019t have a reconstruction training objective.\n\n7\n\n\fskyline, indiana, 1855mm\n\nnight, city, river,\n\ndark, buildings, skyline\n\nnight, long exposure,\nre\ufb02ection, buildings,\nmassachusetts, boston\n\ncity, lights, buildings,\n\ufb01reworks, skyscrapers\n\nnikon, night, d80, asia,\n\nskyline, hongkong, harbour\n\nsunset, explore, sun\n\nsunset, platinumphoto,\n\ntrees, silhouette\n\nsunset, sol, searchthebest,\natardecer, nubes, abigfave\n\nsunset\n\ncanon, natures\ufb01nest, 30d\n\ntoys\n\nlego\n\ndiy, robot\n\ntoy, plastic,\n\nkitty, miniature\n\nlego\n\nFigure 3: Retrieval results with multimodal queries. The leftmost image-text pairs are multimodal query sam-\nples and those in the right side of the bar are retrieved samples with the highest similarities to the query sample\nfrom the database. We include more results in Appendix C.\n\n4.3 PASCAL VOC 2007\nWe evaluate the proposed algorithm on PASCAL VOC 2007 database. The original dataset does not\ncontain user tags, but Guillaumin et al. [8] have collected user tags from Flickr website.8\nMotivated by the success of convolutional neural networks (CNNs) on large-scale visual object\nrecognition [17], we used the DeCAF7 features [6] as an input features for visual pathway, where\nDeCAF7 is 4096 dimensional feature extracted from the CNN trained on ImageNet [5]. For text\nfeatures, we used the vocabulary of size 804 suggested by [8]. For unsupervised feature learning of\nMDRNN, we used unlabeled data of MIR-Flickr database while converting the text features using\nthe new vocabulary from PASCAL database. The network architecture used in this experiment was\nas follows: [4096, 1536, 1536] variables for the visual pathway, [804, 512, 1536] variables for the\ntext pathway, and 2048 variables for top-layer joint network.\nFollowing the standard practice, we reported the mAP over 20 object classes. The performance im-\nprovement of our proposed method was signi\ufb01cant, achieving 81.5% mAP with multimodal queries\nand 76.2% mAP with unimodal queries, whereas the performance of the baseline model was 74.5%\nmAP with multimodal queries (DeCAF7 + Text) and 74.3% mAP with unimodal queries (DeCAF7).\n5 Conclusion\nMotivated by the property of good generative models of multimodal data, we proposed a novel\nmultimodal deep learning framework based on variation of information. The minimum variation of\ninformation objective enables to learn good shared representations of multiple heterogeneous data\nmodalities with a better prediction of missing input modality. We demonstrated the effectiveness\nof our proposed method on multimodal RBM and its deep extensions and showed state-of-the-art\nrecognition performance on MIR-Flickr database and competitive performance on PASCAL VOC\n2007 database with multimodal (visual + text) and unimodal (visual only) queries.\nAcknowledgments This work was supported in part by ONR N00014-13-1-0762, Toyota Techni-\ncal Center, and the Google Faculty Research Award. We thank Yoshua Bengio, Pedro Domingos,\nFrancis Bach, Nando de Freitas, Max Welling, Scott Reed, and Yuting Zhang for helpful comments.\nReferences\n[1] G. Bachman and L. Narici. Functional Analysis. Dover Publications, 2012.\n[2] Y. Bengio, E. Thibodeau-Laufer, G. Alain, and J. Yosinski. Deep generative stochastic networks trainable\n\nby backprop. In ICML, 2014.\n\n[3] Y. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising auto-encoders as generative models.\n\nIn NIPS, 2013.\n\n[4] A. Bosch, A. Zisserman, and X. Munoz. Image classi\ufb01cation using random forests and ferns. In ICCV,\n\n2007.\n8http://lear.inrialpes.fr/people/guillaumin/data.php\n\n8\n\n\f[5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image\n\ndatabase. In CVPR, 2009.\n\n[6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A deep convo-\n\nlutional activation feature for generic visual recognition. In ICML, 2014.\n\n[7] I. Goodfellow, M. Mirza, A. Courville, and Y. Bengio. Multi-prediction deep Boltzmann machines. In\n\nNIPS, 2013.\n\n[8] M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semi-supervised learning for image classi\ufb01cation.\n\nIn CVPR, 2010.\n\n[9] D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for\ninference, collaborative \ufb01ltering, and data visualization. The Journal of Machine Learning Research,\n1:49\u201375, 2001.\n\n[10] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14(8):1771\u20131800, 2002.\n\n[11] G. E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786):504\u2013507, 2006.\n\n[12] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge university press, 2012.\n[13] J. Huang and B. Kingsbury. Audio-visual deep learning for noise robust speech recognition. In ICASSP,\n\n2013.\n\n[14] M. J. Huiskes and M. S. Lew. The MIR Flickr retrieval evaluation. In ICMIR, 2008.\n[15] M. J. Huiskes, B. Thomee, and M. S. Lew. New trends and ideas in visual concept detection: The MIR\n\nFlickr retrieval evaluation initiative. In ICMIR, 2010.\n\n[16] Y. Kim, H. Lee, and E. M. Provost. Deep learning for robust feature generation in audiovisual emotion\n\nrecognition. In ICASSP, 2013.\n\n[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[18] K. Lai, L. Bo, X. Ren, and D. Fox. RGB-D object recognition: Features, algorithms, and a large scale\n\nbenchmark. In Consumer Depth Cameras for Computer Vision, pages 167\u2013192. Springer, 2013.\n\n[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[20] I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. In RSS, 2013.\n[21] B. G. Lindsay. Composite likelihood methods. Contemporary Mathematics, 80(1):221\u201339, 1988.\n[22] B. S. Manjunath, J.-R. Ohm, V. V. Vasudevan, and A. Yamada. Color and texture descriptors.\n\nTransactions on Circuits and Systems for Video Technology, 11(6):703\u2013715, 2001.\n\nIEEE\n\n[23] V. Mnih, H. Larochelle, and G. E. Hinton. Conditional restricted Boltzmann machines for structured\n\noutput prediction. In UAI, 2011.\n\n[24] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, 2011.\n[25] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial\n\nenvelope. International Journal of Computer Vision, 42(3):145\u2013175, 2001.\n\n[26] D. Rao, M. D. Deuge, N. Nourani-Vatani, B. Douillard, S. B. Williams, and O. Pizarro. Multimodal\n\nlearning for autonomous underwater vehicles from visual and bathymetric data. In ICRA, 2014.\n\n[27] R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In AISTATS, 2009.\n[28] R. Salakhutdinov and G. E. Hinton. Replicated softmax: an undirected topic model. In NIPS, 2009.\n[29] H.-C. Shin, M. R. Orton, D. J. Collins, S. J. Doran, and M. O. Leach. Stacked autoencoders for un-\nsupervised feature learning and multiple organ detection in a pilot study using 4D patient data. IEEE\nTransactions on Pattern Analysis and Machine Intelligence, 35(8):1930\u20131943, 2013.\n\n[30] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. In NIPS, 2012.\n[31] N. Srivastava and R. Salakhutdinov. Discriminative transfer learning with tree-based priors. In NIPS,\n\n2013.\n\n[32] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In\n\nICML, 2008.\n\n[33] J. Verbeek, M. Guillaumin, T. Mensink, and C. Schmid. Image annotation with tagprop on the MIR Flickr\n\nset. In ICMIR, 2010.\n\n[34] A. Wang, J. Lu, G. Wang, J. Cai, and T.-J. Cham. Multi-modal unsupervised feature learning for RGB-D\n\nscene labeling. In ECCV. Springer, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1137, "authors": [{"given_name": "Kihyuk", "family_name": "Sohn", "institution": "University of Michigan Ann Arbor"}, {"given_name": "Wenling", "family_name": "Shang", "institution": "University of Michigan"}, {"given_name": "Honglak", "family_name": "Lee", "institution": "University of Michigan"}]}