{"title": "Neural Multisensory Scene Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 8996, "page_last": 9006, "abstract": "For embodied agents to infer representations of the underlying 3D physical world they inhabit, they should efficiently combine multisensory cues from numerous trials, e.g., by looking at and touching objects. Despite its importance, multisensory 3D scene representation learning has received less attention compared to the unimodal setting. In this paper, we propose the Generative Multisensory Network (GMN) for learning latent representations of 3D scenes which are partially observable through multiple sensory modalities. We also introduce a novel method, called the Amortized Product-of-Experts, to improve the computational efficiency and the robustness to unseen combinations of modalities at test time. Experimental results demonstrate that the proposed model can efficiently infer robust modality-invariant 3D-scene representations from arbitrary combinations of modalities and perform accurate cross-modal generation. \nTo perform this exploration we have also developed a novel multi-sensory simulation environment for embodied agents.", "full_text": "Neural Multisensory Scene Inference\n\nJae Hyun Lim\u2020\u00a7123, Pedro O. Pinheiro1, Negar Rostamzadeh1,\n\nChristopher Pal1234, Sungjin Ahn\u2021\u00a75\n\n1Element AI, 2Mila, 3Universit\u00e9 de Montr\u00e9al, 4Polytechnique Montr\u00e9al, 5Rutgers University\n\nAbstract\n\nFor embodied agents to infer representations of the underlying 3D physical world\nthey inhabit, they should ef\ufb01ciently combine multisensory cues from numerous\ntrials, e.g., by looking at and touching objects. Despite its importance, multisen-\nsory 3D scene representation learning has received less attention compared to the\nunimodal setting. In this paper, we propose the Generative Multisensory Network\n(GMN) for learning latent representations of 3D scenes which are partially observ-\nable through multiple sensory modalities. We also introduce a novel method, called\nthe Amortized Product-of-Experts, to improve the computational ef\ufb01ciency and the\nrobustness to unseen combinations of modalities at test time. Experimental results\ndemonstrate that the proposed model can ef\ufb01ciently infer robust modality-invariant\n3D-scene representations from arbitrary combinations of modalities and perform\naccurate cross-modal generation. To perform this exploration, we also develop the\nMultisensory Embodied 3D-Scene Environment (MESE).\n\n1\n\nIntroduction\n\nLearning a world model and its representation is an effective way of solving many challenging\nproblems in machine learning and robotics, e.g., via model-based reinforcement learning (Silver et al.,\n2016). One characteristic aspect in learning the physical world is that it is inherently multifaceted\nand that we can perceive its complete characteristics only through our multisensory modalities. Thus,\nincorporating different physical aspects of the world via different modalities should help build a\nricher model and representation. One approach to learn such multisensory representations is to learn\na modality-invariant representation as an abstract concept representation of the world. This is an\nidea well supported in both psychology and neuroscience. According to the grounded cognition\nperspective (Barsalou, 2008), such abstract concepts like objects and events can only be obtained\nthrough perceptual signals. For example, what represents a cup in our brain is its visual appearance,\nthe sound it could make, the tactile sensation, etc. In neurosciences, the existence of concept\ncells (Quiroga, 2012) that responds only to a speci\ufb01c concept regardless of the modality sourcing the\nconcept (e.g., by showing a picture of Jennifer Aniston or listening her name) can be considered as a\nbiological evidence of the metamodal brain perspective (Pascual-Leone & Hamilton, 2001; Yildirim,\n2014) and the modality-invariant representation.\nAn unanswered question from the computational perspective (our particular interest in this paper) is\nhow to learn such modality-invariant representation of the complex physical world (e.g., 3D scenes\nplaced with objects). We argue that it is a particularly challenging problem because the following\nrequirements need to be satis\ufb01ed for the learned world model. First, the learned representation should\nre\ufb02ect the 3D nature of the world. Although there have been some efforts in learning multimodal\nrepresentations (see Section 3), those works do not consider this fundamental 3D aspect of the physical\nworld. Second, the learned representation should also be able to model the intrinsic stochasticity\n\n\u2020Work done during the internship of JHL at Element AI. \u2021Part of the work had done while SA was at\n\nElement AI. \u00a7Correspondence to jae.hyun.lim@umontreal.ca and sungjin.ahn@cs.rutgers.edu\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fof the world. Third, for the learned representation to generalize, be robust, and to be practical in\nmany applications, the representation should be able to be inferred from experiences of any partial\ncombinations of modalities. It should also facilitate the generative modelling of other arbitrary\ncombinations of modalities (Yildirim, 2014), supporting the metamodal brain hypothesis \u2013 for which\nhuman evidence can be found from the phantom limb phenomenon (Ramachandran & Hirstein,\n1998). Fourth, even if it is evidenced that there exists metamodal representation, there still exist\nmodality-dependent brain regions, revealing the modal-to-metamodal hierarchical structure (Rohe &\nNoppeney, 2016). A learning model can also bene\ufb01t from such hierarchical representation as shown\nby Hsu & Glass (2018). Lastly, the learning should be computationally ef\ufb01cient and scalable, e.g.,\nwith respect to the number of possible modalities.\nMotivated by the above desiderata, we propose the Generative Multisensory Network (GMN) for\nneural multisensory scene inference and rendering. In GMN, from an arbitrary set of source modalities\nwe infer a 3D representation of a scene that can be queried for generation via an arbitrary target\nmodality set, a property we call generalized cross-modal generation. To this end, we formalize the\nproblem as a probabilistic latent variable model based on the Generative Query Network (Eslami\net al., 2018) framework and introduce the Amortized Product-of-Experts (APoE). The prior and the\nposterior approximation using APoE makes the model trainable only with a small combinations\nof modalities, instead of the entire combination set. The APoE also resolves the inherent space\ncomplexity problem of the traditional Product-of-Experts model and also improves computation\nef\ufb01ciency. As a result, the APoE allows the model to learn from a large number of modalities without\ntight coupling among the modalities, a desired property in many applications such as Cloud Robotics\n(Saha & Dasgupta, 2018) and Federated Learning (Kone\u02c7cn\u00fd et al., 2016). In addition, with the APoE\nthe modal-to-metamodal hierarchical structure is easily obtained. In experiments, we show the above\nproperties of the proposed model on 3D scenes with blocks of various shapes and colors along with a\nhuman-like hand.\nThe contributions of the paper are as follows: (i) We introduce a formalization of modality-invariant\nmultisensory 3D representation learning using a generative query network model and propose the\nGenerative Multisensory Network (GMN)1. (ii) We introduce the Amortized Product-of-Experts\nnetwork that allows for generalized cross-modal generation while resolving the problems in the GQN\nand traditional Product-of-Experts. (iii) Our model is the \ufb01rst to extend multisensory representation\nlearning to 3D scene understanding with human-like sensory modalities (such as haptic information)\nand cross-modal generation. (iv) We also develop the Multisensory Embodied 3D-Scene Environment\n(MESE) used to develop and test the model.\n\n2 Neural Multisensory Scene Inference\n\n2.1 Problem Description\nOur goal is to understand 3D scenes by learning a metamodal representation of the scene through the\ninteraction of multiple sensory modalities such as vision, haptics, and auditory inputs. In particular,\nmotivated by human multisensory processing (Deneve & Pouget, 2004; Shams & Seitz, 2008; Murray\n& Wallace, 2011), we consider a setting where the model infers a scene from experiences of a set\nof modalities and then to generate another set of modalities given a query for the generation. For\nexample, we can experience a 3D scene where a cup is on a table only by touching or grabbing it\nfrom some hand poses and ask if we can visually imagine the appearance of the cup from an arbitrary\nquery viewpoint (see Fig. 1). We begin this section with a formal de\ufb01nition of this problem.\nA multisensory scene, simply a scene, S consists of context C and observation O. Given the set\nof all available modalities M, the context and observation in a scene are obtained through the\ncontext modalities Mc(S) \u21e2M and the observation modalities Mo(S) \u21e2M , respectively. In the\nfollowing, we omit the scene index S when the meaning is clear without it. Note that Mc and Mo\nare arbitrary subsets of M including the cases Mo \\M c = ;, Mo = Mc, and Mo [M c ( M.\nWe also use MS to denote all modalities available in a scene, Mo(S) [M c(S).\nThe context and observation consist of sets of experience trials represented as query(v)-sense(x)\npairs, i.e., C = {(vn, xn)}Nc\nn=1. For convenience, we denote the set\nof queries and senses in observation by V and X, respectively, i.e., O = (V, X). Each query\n\nn=1 and O = {(vn, xn)}No\n\n1Code is available at: https://github.com/lim0606/pytorch-generative-multisensory-network\n\n2\n\n\fFigure 1: Cross-modal inference using scene representation. (a) A single image context. (b) Haptic\ncontexts. (c) Generated images for some viewpoints (image queries) in the scene, given the contexts. (d)\nGround truth images for the same queries. Conditioning on an image context and multiple haptic contexts,\nmodality-agnostic latent scene representation, z, is inferred. Given sampled zs, images are generated using\nvarious queries; in (c), each row corresponds to the same latent sample. Note that the shapes of predicted objects\nare consistent given different samples z(i), while color pattern of the object changes except the parts seen by the\nimage context (a).\n\nn\n\nn\n\nn\n\nn\n\nn , xm\n\ncan be the hand position, and the sense xhaptics\n\nvn and sense xn in a context consists of modality-wise queries and senses corresponding to each\nmodality in the context modalities, i.e., (vn, xn) = {(vm\nn )}m2Mc (See Fig. S1). Similarly,\nthe query and the sense in observation O is constrained to have only the observation modalities\nMo. For example, for modality m = vision, an unimodal query vvision\ncan be the viewpoint\nis the observation image obtained from the query viewpoint. Similarly, for\nand the sense xvision\nm = haptics, an unimodal query vhaptics\nis the\ntactile and pressure senses obtained by a grab from the query hand pose. For a scene, we may\nhave Mc = {haptics, auditory} and Mo = {vision, auditory}. For convenience, we also\nintroduce the following notations. We denote the context corresponding only to a particular modality\nm by Cm = {(vm\nc and C = {Cm}m2Mc. Similarly, Om, Xm\nand Vm are used to denote modality m part of O, X, and V , respectively.\nGiven the above de\ufb01nitions, we formalize the problem as learning a generative model of a scene\nthat can generate senses X corresponding to queries V of a set of modalities, provided a context\nC from other arbitrary modalities. Given scenes from the scene distribution (O, C) \u21e0 P (S), our\ntraining objective is to maximize E(O,C)\u21e0P (S)[log P\u2713(X|V, C)], where \u2713 is the model parameters to\nbe learned.\n\nn=1 such that Nc =Pm N m\nn )}N m\n\nn , xm\n\nc\n\n2.2 Generative Process\n\nWe formulate this problem as a probabilistic latent variable model where we introduce the latent\nmetamodal scene representation z from a conditional prior P\u2713(z|C). The joint distribution of the\ngenerative process becomes:\n\nP\u2713(X, z|V, C) = P\u2713(X|V, z)P\u2713(z|C)\n\n=\n\nNoYn=1\n\nP\u2713(xn|vn, z)P\u2713(z|C) =\n\nNoYn=1 Ym2Mo\n\nP\u2713m(xm\n\nn |vm\n\nn , z)P\u2713(z|C).\n\n(1)\n\n2.3 Prior for Multisensory Context\nAs the prior P\u2713(z|C) is conditioned on the context, we need an encoding mechanism of the context\nto obtain z. A simple way to do this is to follow the Generative Query Network (GQN) (Eslami\net al., 2018) approach: each context query-sense pair (vn, xn) is encoded to rn = fenc(vn, xn)\nand summed (or averaged) to obtain permutation-invariant context representation r =Pn rn. A\n\nConvDRAW module (Gregor et al., 2016) is then used to sample z from r.\nIn the multisensory setting, however, this approach cannot be directly adopted due to a few challenges.\nFirst, unlike GQN the sense and query of each sensor modality has different structure, and thus we\n\n3\n\n\fenc(x, v) for each m 2M .\nf m\n\ncannot have a single and shared context encoder that deals with all the modalities. In our model, we\n\ntherefore introduce a modality encoder rm =P(x,v)2Cm\nThe second challenge stems from the fact that we want our model capable of generating from\nany context modality set Mc(S) to any observation modality set Mo(S) \u2013 a property we call\ngeneralized cross-modal generation (GCG). However, at test time we do not know which sensory\nmodal combinations will be given as a context and a target to generate. This hence requires collecting\na training data that contains all possible combinations of context-observation modalities M\u21e4. This\nequals the Cartesian product of M\u2019s powersets, i.e., M\u21e4 = P ower(M) \u21e5 P ower(M). This is a\nvery expensive requirement as |M\u21e4| increases exponentially with respect to the number of modalities2\n|M|.\nAlthough one might consider dropping-out random modalities during training to achieve the gener-\nalized cross-modal generation, this still assumes the availability of the full modalities from which\nto drop off some modalities. Also, it is unrealistic to assume that we always have access to the full\nmodalities; to learn, we humans do not need to touch everything we see. Therefore, it is important to\nmake the model learnable only with a small subset of all possible modality combinations while still\nachieving the GCG property. We call this the missing-modality problem.\nTo this end, we can model the conditional prior as a Product-of-Experts (PoE) network (Hinton, 2002)\nP\u2713m(z|Cm).\nWhile this could achieve our goal at the functional level, it comes at a computational cost of increased\nspace and time complexity w.r.t. the number of modalities. This is particularly problematic when we\nwant to employ diverse sensory modalities (as in, e.g., robotics) or if each expert has to be a powerful\n(hence expensive both in computation and storage) model like the 3D scene inference task (Eslami\net al., 2018), where it is necessary to use the powerful ConvDraw network to represent the complex\n3D scene.\n\nwith one expert per sensory modality parameterized by \u2713m. That is, P (z|C) /Qm2Mc\n\n2.4 Amortized Product-of-Experts as Metamodal Representation\nTo deal with the limitations of PoE, we introduce the Amortized Product-of-Experts (APoE). For\neach modality m 2M c, we \ufb01rst obtain modality-level representation rm using the modal-encoder.\nNote that this modal-encoder is a much lighter module than the full ConvDraw network. Then, each\nmodal-encoding rm along with its modality-id m is fed into the expert-amortizer P (z|rm, m) that\nis shared across all modal experts through shared parameter . In our case, this is implemented as a\nConvDraw module (see Appendix B for the implementation details). We can write the APoE prior as\nfollows:\n\nP (z|rm, m) .\n\n(2)\n\nP (z|C) = Ym2Mc\nP (z,{rm}|C) / Ym2Mc\n\nWe can extend this further to obtain a hierarchical representation model by treating rm as a latent\nvariable:\n\nP (z|rm, m)P\u2713m(rm|Cm) ,\n\nwhere rm is modality-level representation and z is metamodal representation. Although we can train\nthis hierarchical model with reparameterization trick and Monte Carlo sampling, for simplicity in our\nexperiments we use deterministic function for P\u2713m(rm|Cm) = [rm = f\u2713m(Cm)] where  is a dirac\ndelta function. In this hierarchical version, the generative process becomes:\n\nP (X, z,{rm}|V, C) = P\u2713(X|V, z) Ym2Mc\n\nP (z|rm, m)P\u2713m(rm|Cm) .\n\n(3)\n\nAn illustration of the generative process is provided in Fig.S2 (b), on the Appendix. From the\nperspective of cognitive psychology, the APoE model can be considered as a computational model\nof the metamodal brain hypothesis (Pascual-Leone & Hamilton, 2001), which states the existence\nof metamodal brain area (the expert-of-experts in our case) which perform a speci\ufb01c function not\nspeci\ufb01c to input sensory modalities.\n\n2The number of modalities or sensory input sources can be very large depending on the application. Even in\nthe case of \u2018human-like\u2019 embodied learning, it is not only, vision, haptics, auditory, etc. For example, given a\nrobotic hand, the context input sources can be only a part of the hand, e.g., some parts of some \ufb01ngers, from\nwhich we humans can imagine the senses of other parts.\n\n4\n\n\fFigure 2: Baseline model, PoE and APoE. In the baseline model (left), a single inference network (denoted as\nEncoder) get an input as sum of all modality encoders\u2019s outputs. In PoE (middle), each of the experts contains an\nintegrated network combining the modality encoder and a complex inference network like ConvDraw, resulting\nin O(|M|) space cost of inference networks. In APoE (right), the modality encoding and the inference network\nare detached, and the inference networks are integrated into a single amortized expert inference network serving\nfor all experts. Thus, the space cost of inference networks reduces to O(1).\n\nInference\n\n2.5\nSince the optimization of the aforementioned objective is intractable, we perform variational inference\nby maximizing the following evidence lower bound (ELBO), LS(\u2713, ), with the reparameterization\ntrick (Kingma & Welling, 2013; Rezende et al., 2014):\n(4)\n\nlog P\u2713(X|V, C)  EQ(z|C,O) [log P\u2713(X|V, z)]  KL[Q(z|C, O)||P\u2713(z|C)] ,\n\nn=1Qm2Mo\n\nwhere P\u2713(X|V, z) = QNo\nn ). This can be considered a cognitively-\nplausible objective as, according to the \u201cgrounded cognition\u201d perspective (Barsalou, 2008), the\nmodality-invariant representation of an abstract concept, z, can never be fully modality-independent.\nAPoE Approximate Posterior. The approximate posterior Q(z|C, O) is implemented as follows.\nFollowing Wu & Goodman (2018), we \ufb01rst represent the true posterior as P (z|C, O) =\nP (O, C|z)P (z)\nAfter ignoring the terms that are not a function of z, we obtain P (z|C, O) / Qm2MS\nQ|MS|1\n|MS|1\nReplacing the numerator terms with an approximation P (z|Cm, Om) \u21e1 Q(z|Cm, Om)P (z)\n|MS|\nwe can remove the priors in the denominator and obtain the following APoE approximate posterior:\n\nP (C, O) Ym2MS\n\nP (C, O) Ym2MS\n\nP (z|Cm, Om)P (Cm, Om)\n\n.\n\nP (Cm, Om|z) =\n\nP\u2713m(xm\n\nn |z, vm\n\nP (O, C)\n\nP (z|Cm,Om)\n\ni=1\n\nP (z)\n\nP (z)\n\nP (z)\n\n=\n\nP (z)\n\n.\n\n,\n\nP (z|C, O) \u21e1 Ym2MS\n\nQ(z|Cm, Om) .\n\n(5)\n\nAlthough the above product is intractable in general, a closed form solution exists if each expert is\na Gaussian (Wu & Goodman, 2018). The mean \u00b5 and covariance T of the APoE are, respectively,\n\n\u00b5 = (Pm \u00b5mUm)(Pm Um)1 and T = (Pm Um)1, where \u00b5m and Um are the mean and the\ninverse of the covariance of each expert. The posterior APoE Q(z|Cm, Om) is implemented \ufb01rst\nby encoding rm = f m\nenc(Cm, Om) and then putting rm and modality-id m into the amortized expert\nQ(z|rm, m), which is a ConvDraw module in our implementation. The amortized expert outputs\n\u00b5m and Um for m 2M S while sharing the variational parameter  across the modality-experts. Fig.\n2 compares the inference network architectures of CGQN, PoE, and APoE.\n\n3 Related Works\n\nMultimodal Generative Models. Multimodal data are associated with many interesting learning\nproblems, e.g. cross-modal inference, zero-shot learning or weakly-supervised learning. Regarding\nthese, latent variable models have provided effective solutions: from a model with global latent\nvariable shared among all modalities (Suzuki et al., 2016) to hierarchical latent structures (Hsu &\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Results on cross-modal density estimation. (a) log-likelihood of target images (gray) vs. the number\nof haptic observation. (b) log-likelihood of target images (rgb) vs. the number of haptic observation. (c) log-\nlikelihood of target haptic values vs. the number of image observations. The dotted lines show fully cross-modal\ninference where the context does not include any target modality. For the inference with additional context from\nthe target modality, the results are denoted as dashed, dashdot, and solid lines.\n\nGlass, 2018) and scalable inference networks with Product-of-Experts (PoE) (Hinton, 2002; Wu &\nGoodman, 2018; Kurle et al., 2018). In contrast to these works, the current study addresses two\nadditional challenges. First, this work aims at achieving the any-modal to any-modal conditional\ninference regardless of modality con\ufb01gurations during training: it targets on generalization under\ndistribution shifts at test time. On the other hand, the previous studies assume to have full modality\ncon\ufb01gurations in training even when missing modality con\ufb01guration at test time is addressed. Second,\nthe proposed model considers each source of information to be rather partially observable, while each\nmodality-speci\ufb01c data has been treated as fully observable. As a result, the modality-agnostic meta-\nmodal representation is inferred from modality-speci\ufb01c representations, each of which is integrated\nfrom a set of partially observable inputs.\n3D Representations and Rendering. Learning representations of 3D scenes or environments with\npartially observable inputs has been addressed by supervised learning (Choy et al., 2016; Wu\net al., 2017; Shin et al., 2018; Mescheder et al., 2018), latent variable models (Eslami et al., 2018;\nRosenbaum et al., 2018; Kumar et al., 2018), and generative adversarial networks (Wu et al., 2016;\nRajeswar et al., 2019; Nguyen-Phuoc et al., 2019). The GAN-based approaches exploited domain-\nspeci\ufb01c functions, e.g. 3D representations, 3D-to-2D projection, and 3D rotations. Thus, it is hard to\napply to non-visual modalities whose underlying transformations are unknown. On the other hand,\nneural latent variable models for random processes (Eslami et al., 2018; Rosenbaum et al., 2018;\nKumar et al., 2018; Garnelo et al., 2018a,b; Le et al., 2018; Kim et al., 2019) has dealt with more\ngeneralized settings and studied on order-invariant inference. However, these studies focus on single\nmodality cases, so they are contrasted from our method, addressing a new problem setting where\nqualitatively different information sources are available for learning the scene representations.\n\n4 Experiment\n\nThe proposed model is evaluated with respect to the following criteria: (i) cross-modal density\nestimation in terms of log-likelihood, (ii) ability to perform cross-modal sample generation, (iii)\nquality of learned representation by applying it to a downstream classi\ufb01cation task, (iv) robustness to\nthe missing-modality problem, and (v) space and computational cost.\nTo evaluate our model we have developed an environment, the Multisensory Embodied 3D-Scene\nEnvironment (MESE). MESE integrates MuJoCo (Todorov et al., 2012), MuJoCo HAPTIX (Kumar\n& Todorov, 2015), and the OpenAI gym (Brockman et al., 2016) for 3D scene understanding\nthrough multisensory interactions. In particular, from MuJoCo HAPTIX the Johns Hopkins Modular\nProsthetic Limb (MPL) (Johannes et al., 2011) is used. The resulting MESE, equipped with vision\nand proprioceptive sensors, makes itself particularly suitable for tasks related to human-like embodied\nmultisensory learning. In our experiments, the visual input is 64 \u21e5 64 RGB image and the haptic\ninput is 132-dimension consisting of the hand pose and touch senses. Our main task is similar to the\nShepard-Metzler object experiments used in Eslami et al. (2018) but extends it with the MPL hand.\nAs a baseline model, we use a GQN variant (Kumar et al., 2018) (discussed in Section 2.3). In\nthis model, following GQN, the representations from different modalities are summed and then\ngiven to a ConvDraw network. We also provide a comparison to PoE version of the model in terms\n\n6\n\n\fof computation speed and memory footprint. For more details on the experimental environments,\nimplementations, and settings, refer to Appendix A.\nCross-Modal Density Estimation. Our \ufb01rst evaluation is the cross-modal conditional den-\nsity estimation. For this, we estimate the conditional log-likelihood log P (X|V, C) for M =\n{RGB-image, haptics}, i.e. |M| = 2. During training, we use both modalities for each sam-\npled scene and use 0 to 15 randomly sampled context query-sense pairs for each modality. At test\ntime, we provide uni-modal context from one modality and generate the other.\nFig. 3 shows results on 3 different experiments: (a) HAPTIC!GRAY, (b) HAPTIC!RGB and (c)\nRGB!HAPTIC. Note that we include HAPTIC!GRAY \u2013 although GRAY images are not used\nduring training \u2013 to analyze the effect of color in haptic-to-image generation. The APoE and the\nbaseline are plotted in blue and orange, respectively. In all cases our model (blue) outperforms\nthe baseline (orange). This gap is even larger when the model is provided limited amount of\ncontext information, suggesting that the baseline requires more context to improve the representation.\nSpeci\ufb01cally, in the fully cross modal setting where the context does not include any target modality\n(the dotted lines), the gap is largest. We believe that our model can better leverage modal-invariant\nrepresentations from one modality to another. Also, when we provide additional context from the\ntarget modality (dashed, dashdot, solid lines), we still see that our model outperforms the baseline.\nThis implies that our models can successfully incorporate information from different modalities\nwithout interfering each other. Furthermore, from Fig. 3(a) and (b), we observe that haptic information\ncaptures only shapes: the prediction in RGB has lower likelihood without any image in the context.\nHowever, for the GRAY image in (a), the likelihood approaches near the upper bound.\nCross-Modal Generation. We now qualitatively evaluate the ability for cross-generation. Fig. 1\nshows samples of our cross-modal generation for various query viewpoints. Here, we condition the\nmodel on 15 haptic context signal but provide only a single image. We note that the single image\nprovides limited color information about the object, namely, red and cyan are part of the object and\nalmost no information about the shape. We can see that the model is able to almost perfectly infer the\nshape of the object. However, it fails to predict the correct colors (Fig. 1(c)) which is expected due to\nthe limited visual information provided. Interestingly, the object part for which the context image\nprovides color information has correct colors, while other parts have random colors for different\nsamples, showing that the model captures the uncertainty in z. Additional results provided in the\nAppendix D suggest further that: (i) our model gradually aggregates numerous evidences to improve\npredictions (Fig. S5) and (ii) our model successfully integrates distinctive multisensory information\nin their inference (Fig. S6).\nClassi\ufb01cation. To further evaluate the quality of the\nmodality-invariant scene representations, we test on a\ndownstream classi\ufb01cation task. We randomly sample 10\nscenes and from each scene we prepare a held-out query-\nsense pairs to use as the input to the classi\ufb01er. The models\nare then asked to classify which scene (1 out of 10) a\ngiven query-sense pair belongs to. We use Eq. (6) for this\nclassi\ufb01cation. To see how the provided multi-modal con-\ntext contributes to obtaining useful representation for this\ntask, we test the following three context con\ufb01gurations:\n(i) image-query pairs only (I), (ii) haptic-query pairs only\n(H), and (iii) all sensory contexts (H + I).\nIn Fig. 4, both models use contexts to classify scenes and their performance improves as the number\nof contexts increases. APoE outperforms the baseline in the classi\ufb01cation accuracy, while both\nmethods have similar ELBO (see Fig. S4). This suggests that the representation of our model tends to\nbe more discriminative than that of the baseline. In APoE, the results with individual modality (I or\nH) are close to the one with all modalities (H + I). The drop in performance with only haptic-query\npairs (H) is due to the fact that certain samples might have same shape, but different colors. On the\nother hand, the baseline shows worse performance when inferring modality-invariant representation\nwith single sensory modality, especially for images. This demonstrates that the APoE model helps\nlearning better representations for both modality-speci\ufb01c (I and H) and modality-invariant tasks\n(H + I).\n\nFigure 4: Classi\ufb01cation result.\n\n7\n\n\f(d)\n\n(c)\n\nS\n\n(a)\n\n(b)\n\nS\n\nS\n\nFigure 5: Results of missing-modality experiments for (a,b) |M| = 8, and (c,d) 14 environments.\nDuring training (train), limited combinations of all possible modalities are presented to the model. The size of\nexposed multimodal senses per scene is denoted as |Mtrain\n|. For validation dataset, the models are evaluated\nwith the same limited combinations as done in training (valmissing), as well as all combinations (valfull).\nMissing-modality Problem. In practical scenarios, since it is dif\ufb01cult to assume that we always have\naccess to all modalities, it is important to make the model learn when some modalities are missing.\nHere, we evaluate this robustness by providing unseen combinations of modalities at test time. This is\ndone by limiting the set of modality combinations observed during training. That is, we provide only\na subset of modality combinations for each scene S, i.e, Mtrain\n\u21e2M . At test time, the model is\nevaluated on every combinations of all modalities M thus including the settings not observed during\ntraining. As an example, for total 8 modalities M = {left, right3}\u21e5{ R, G, B}\u21e5{ haptics1}\u21e5\n{haptics2}, we use |Mtrain\n|2{ 1, 2} to indicate that each scene in training data contains only one\nor two modalities. Fig. 5(a) and (b) show results with |M| = 8 while (c) and (d) with |M| = 14.\nFig. 5 (a) and (c) are results when a much more restricted number of modalities are available during\ntraining: 2 out of 8 and 4 out of 14, respectively. At test time, however, all combinations of modalities\nare used. We denote the performance on the full con\ufb01gurations by valfull and on the limited modality\ncon\ufb01gurations used during training by valmissing. Fig. 5 (b) and (d) show the opposite setting where,\nduring training, a large number of modalities (e.g., 7\u21e08 modalities) are always provided together for\neach scene. Thus, the scenes have not trained on small modalities such as only one or two modalities\nbut we tested on this con\ufb01gurations at test time to see its ability to learn to perform the generalized\ncross-modal generation. For more results, see Appendix E.\nOverall, for all cases our model shows good test time performance on the unseen context modality\ncon\ufb01gurations whereas the baseline model mostly over\ufb01ts (except (c)) severely or converges slowly.\nThis is because, in the baseline model, the sum representation on the unseen context con\ufb01guration\nis likely to be also unseen at test time and thus over\ufb01t. In contrast, our model as a PoE is robust to\nthis problem as all experts agree to make a similar representation. The baseline results for case (c)\nseem less prone to this problem but converged much slowly. As it converges slowly, we believe that\nit might still over\ufb01t in the end with a longer run.\nSpace and Time Complexity. The expert amortizer of APoE signi\ufb01cantly reduces the inherent space\nproblem of PoE while it still requires separate modality encoders. Speci\ufb01cally, in our experiments,\nfor the M = 5 case, PoE requires 53M parameters while APoE uses 29M. For M = 14, PoE uses\n131M parameters while APoE used only 51M. We also observed a reduction in computation time by\nusing APoE. For M = 5 model, one iteration of PoE takes, in average, 790 ms while APoE takes\n679 ms. This gap becomes more signi\ufb01cant for M = 14 where PoE takes 2059 ms while APoE takes\n1189 ms. This is partly due to the number of parameters. Moreover, unlike PoE, APoE can parallelize\nits encoder computation via convolution. For more results, see Table 1 in Appendix.\n\n5 Conclusion\n\nWe propose the Generative Multisensory Network (GMN) for understanding 3D scenes via modality-\ninvariant representation learning. In GMN, we introduce the Amortized Product-of-Experts (APoE) in\norder to deal with the problem of missing-modalities while resolving the space complexity problem\nof standard Product-of-Experts. In experiments on 3D scenes with blocks of different shapes and a\nhuman-like hand, we show that GMN can generate any modality from any context con\ufb01gurations.\nWe also show that the model with APoE learns better modality-agnostic representations, as well as\n\n3left and right half of an image\n\n8\n\n\fmodality-speci\ufb01c ones. To the best of our knowledge this is the \ufb01rst exploration of multisensory\nrepresentation learning with vision and haptics for generating 3D objects. Furthermore, we have\ndeveloped a novel multisensory simulation environment, called the Multisensory Embodied 3D-Scene\nEnvironment (MESE), that is critical to performing these experiments.\n\nAcknowledgments\nJL would like to thank Chin-Wei Huang, Shawn Tan, Tatjana Chavdarova, Arantxa Casanova, Ankesh\nAnand, and Evan Racah for helpful comments and advice. SA thanks Kakao Brain, the Center for\nSuper Intelligence (CSI), and Element AI for their support. CP also thanks NSERC and PROMPT.\n\nReferences\nBrandon Amos, Laurent Dinh, Serkan Cabi, Thomas Roth\u00f6rl, Alistair Muldal, Tom Erez, Yuval Tassa,\n\nNando de Freitas, and Misha Denil. Learning awareness models. In ICLR, 2018.\n\nLawrence W Barsalou. Grounded cognition. Annu. Rev. Psychol., 2008.\n\nGreg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and\n\nWojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.\n\nYuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In\n\nICLR, 2016.\n\nSharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro,\nand Evan Shelhamer. cudnn: Ef\ufb01cient primitives for deep learning. arXiv preprint arXiv:1410.0759,\n2014.\n\nChristopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A\n\nuni\ufb01ed approach for single and multi-view 3d object reconstruction. In ECCV, 2016.\n\nSophie Deneve and Alexandre Pouget. Bayesian multisensory integration and cross-modal spatial\n\nlinks. Journal of Physiology-Paris, 98(1-3):249\u2013258, 2004.\n\nS. M. Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S. Morcos, Marta\nGarnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert,\nLars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil Rabinowitz, Helen King,\nChloe Hillier, Matt Botvinick, Daan Wierstra, Koray Kavukcuoglu, and Demis Hassabis. Neural\nscene representation and rendering. Science, 2018.\n\nMarta Garnelo, Dan Rosenbaum, Chris J. Maddison, Tiago Ramalho, David Saxton, Murray Shana-\nhan, Yee Whye Teh, Danilo J. Rezende, and S. M. Ali Eslami. Conditional neural processes. arXiv\npreprint arXiv:1807.01613, 2018a.\n\nMarta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali\n\nEslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018b.\n\nKarol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards\n\nconceptual compression. In NIPS, 2016.\n\nIrina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. In ICLR, 2017.\n\nGeoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\nComputation, 2002.\n\nWei-Ning Hsu and James R. Glass. Disentangling by partitioning: A representation learning\n\nframework for multimodal sensory data. arXiv preprint arXiv:1805.11264, 2018.\n\nMatthew S Johannes, John D Bigelow, James M Burck, Stuart D Harshbarger, Matthew V Kozlowski,\nand Thomas Van Doren. An overview of the developmental process for the modular prosthetic\nlimb. Johns Hopkins APL Technical Digest, 2011.\n\n9\n\n\fHyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol\n\nVinyals, and Yee Whye Teh. Attentive neural processes. In ICLR, 2019.\n\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\nJakub Kone\u02c7cn\u00fd, H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh, and\nDave Bacon. Federated learning: Strategies for improving communication ef\ufb01ciency. In NIPS\nWorkshop on Private Multi-Party Machine Learning, 2016. URL https://arxiv.org/abs/1610.05492.\nAnanya Kumar, S. M. Ali Eslami, Danilo J. Rezende, Marta Garnelo, Fabio Viola, Edward Lockhart,\nand Murray Shanahan. Consistent generative query networks. arXiv preprint arXiv:1807.02033,\n2018.\n\nVikash Kumar and Emanuel Todorov. Mujoco HAPTIX: A virtual reality system for hand manipula-\n\ntion. In International Conference on Humanoid Robots, Humanoids, 2015.\n\nRichard Kurle, Stephan G\u00fcnnemann, and Patrick van der Smagt. Multi-source neural variational\n\ninference. arXiv preprint arXiv:1811.04451, 2018.\n\nBrenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning\n\nthrough probabilistic program induction. Science, 2015.\n\nTuan Anh Le, Hyunjik Kim, Marta Garnelo, Dan Rosenbaum, Jonathan Schwarz, and Yee Whye Teh.\n\nEmpirical evaluation of neural process objectives. In NeurIPS Bayesian Workshop, 2018.\n\nLars M. Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas\nGeiger. Occupancy networks: Learning 3d reconstruction in function space. arXiv preprint\narXiv:1812.03828, 2018.\n\nMicah M Murray and Mark T Wallace. The neural bases of multisensory processes. CRC Press,\n\n2011.\n\nThu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang. Hologan:\nUnsupervised learning of 3d representations from natural images. arXiv preprint arXiv:1904.01326,\n2019.\n\nJohn Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with\n\ncuda. Queue, 6(2):40\u201353, March 2008. ISSN 1542-7730.\n\nAlvaro Pascual-Leone and Roy Hamilton. The metamodal organization of the brain. In Progress in\n\nbrain research, volume 134, pp. 427\u2013445. Elsevier, 2001.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NIPS-W, 2017.\n\nRodrigo Quian Quiroga. Concept cells: the building blocks of declarative memory functions. Nature\n\nReviews Neuroscience, 2012.\n\nSai Rajeswar, Fahim Mannan, Florian Golemo, David Vazquez, Derek Nowrouzezahrai, and\nAaron Courville. Pix2scene: Learning implicit 3d representations from images. preprint\nhttps://openreview.net/forum?id=BJeem3C9F7, 2019.\n\nVilayanur S Ramachandran and William Hirstein. The perception of phantom limbs. the do hebb\n\nlecture. Brain: a journal of neurology, 121(9):1603\u20131630, 1998.\n\nDanilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and\n\napproximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\nTim Rohe and Uta Noppeney. Distinct computational principles govern multisensory integration in\n\nprimary sensory and association cortices. Current Biology, 26(4):509\u2013514, 2016.\n\n10\n\n\fDan Rosenbaum, Frederic Besse, Fabio Viola, Danilo J. Rezende, and S. M. Ali Eslami. Learning\nmodels for visual 3d localization with implicit mapping. arXiv preprint arXiv:1807.03149, 2018.\nOlimpiya Saha and Prithviraj Dasgupta. A comprehensive survey of recent trends in cloud robotics\n\narchitectures and applications. Robotics, 7(3):47, 2018.\n\nLadan Shams and Aaron R Seitz. Bene\ufb01ts of multisensory learning. Trends in cognitive sciences, 12\n\n(11):411\u2013417, 2008.\n\nDaeyun Shin, Charless C. Fowlkes, and Derek Hoiem. Pixels, voxels, and views: A study of shape\n\nrepresentations for single view 3d object shape prediction. In CVPR, 2018.\n\nDavid Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,\nJulian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering\nthe game of go with deep neural networks and tree search. nature, 2016.\n\nMasahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Joint multimodal learning with deep\n\ngenerative models. arXiv preprint arXiv:1611.01891, 2016.\n\nEmanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control.\n\nIn IROS, 2012.\n\nJiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilis-\n\ntic latent space of object shapes via 3d generative-adversarial modeling. In NIPS, 2016.\n\nJiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill Freeman, and Josh Tenenbaum. Marrnet:\n\n3d shape reconstruction via 2.5 d sketches. In NIPS, 2017.\n\nMike Wu and Noah Goodman. Multimodal generative models for scalable weakly-supervised\n\nlearning. In NeurIPS, 2018.\n\nIlker Yildirim. From perception to conception: learning multisensory representations. University of\n\nRochester, 2014.\n\n11\n\n\f", "award": [], "sourceid": 4822, "authors": [{"given_name": "Jae Hyun", "family_name": "Lim", "institution": "Mila, University of Montreal"}, {"given_name": "Pedro", "family_name": "O. Pinheiro", "institution": "Element AI"}, {"given_name": "Negar", "family_name": "Rostamzadeh", "institution": "Elemenet AI"}, {"given_name": "Chris", "family_name": "Pal", "institution": "MILA, Polytechnique Montr\u00e9al, Element AI"}, {"given_name": "Sungjin", "family_name": "Ahn", "institution": "Rutgers University"}]}