{"title": "Unsupervised Learning of 3D Structure from Images", "book": "Advances in Neural Information Processing Systems", "page_first": 4996, "page_last": 5004, "abstract": "A key goal of computer vision is to recover the underlying 3D structure that gives rise to 2D observations of the world. If endowed with 3D understanding, agents can abstract away from the complexity of the rendering process to form stable, disentangled representations of scene elements. In this paper we learn strong deep generative models of 3D structures, and recover these structures from 2D images via probabilistic inference. We demonstrate high-quality samples and report log-likelihoods on several datasets, including ShapeNet, and establish the first benchmarks in the literature. We also show how these models and their inference networks can be trained jointly, end-to-end, and directly from 2D images without any use of ground-truth 3D labels. This demonstrates for the first time the feasibility of learning to infer 3D representations of the world in a purely unsupervised manner.", "full_text": "Unsupervised Learning of 3D Structure from Images\n\nDanilo Jimenez Rezende*\ndanilor@google.com\n\nS. M. Ali Eslami*\n\naeslami@google.com\n\nShakir Mohamed*\n\nshakir@google.com\n\nPeter Battaglia*\n\npeterbattaglia@google.com\n\nMax Jaderberg*\n\njaderberg@google.com\n\nNicolas Heess*\n\nheess@google.com\n\n* Google DeepMind\n\nAbstract\n\nA key goal of computer vision is to recover the underlying 3D structure that gives\nrise to 2D observations of the world. If endowed with 3D understanding, agents\ncan abstract away from the complexity of the rendering process to form stable,\ndisentangled representations of scene elements. In this paper we learn strong\ndeep generative models of 3D structures, and recover these structures from 2D\nimages via probabilistic inference. We demonstrate high-quality samples and\nreport log-likelihoods on several datasets, including ShapeNet [2], and establish\nthe \ufb01rst benchmarks in the literature. We also show how these models and their\ninference networks can be trained jointly, end-to-end, and directly from 2D images\nwithout any use of ground-truth 3D labels. This demonstrates for the \ufb01rst time\nthe feasibility of learning to infer 3D representations of the world in a purely\nunsupervised manner.\n\n1\n\nIntroduction\n\nWe live in a three-dimensional world, yet our observations of it are typically in the form of two-\ndimensional projections that we capture with our eyes or with cameras. A key goal of computer\nvision is that of recovering the underlying 3D structure that gives rise to these 2D observations.\nThe 2D projection of a scene is a complex function of the attributes and positions of the camera, lights\nand objects that make up the scene. If endowed with 3D understanding, agents can abstract away\nfrom this complexity to form stable, disentangled representations, e.g., recognizing that a chair is a\nchair whether seen from above or from the side, under different lighting conditions, or under partial\nocclusion. Moreover, such representations would allow agents to determine downstream properties\nof these elements more easily and with less training, e.g., enabling intuitive physical reasoning about\nthe stability of the chair, planning a path to approach it, or \ufb01guring out how best to pick it up or sit on\nit. Models of 3D representations also have applications in scene completion, denoising, compression\nand generative virtual reality.\nThere have been many attempts at performing this kind of reasoning, dating back to the earliest years\nof the \ufb01eld. Despite this, progress has been slow for several reasons: First, the task is inherently ill-\nposed. Objects always appear under self-occlusion, and there are an in\ufb01nite number of 3D structures\nthat could give rise to a particular 2D observation. The natural way to address this problem is by\nlearning statistical models that recognize which 3D structures are likely and which are not. Second,\neven when endowed with such a statistical model, inference is intractable. This includes the sub-tasks\nof mapping image pixels to 3D representations, detecting and establishing correspondences between\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\for\n\n2D input\n\n3D interpretation\n\nFigure 1: Motivation: The 3D\nrepresentation of a 2D image is\nambiguous and multi-modal. We\nachieve such reasoning by learning\na generative model of 3D structures,\nand recover this structure from 2D\nimages via probabilistic inference.\n\ndifferent images of the same structures, and that of handling the multi-modality of the representations\nin this 3D space. Third, it is unclear how 3D structures are best represented, e.g., via dense volumes\nof voxels, via a collection of vertices, edges and faces that de\ufb01ne a polyhedral mesh, or some other\nkind of representation. Finally, ground-truth 3D data is dif\ufb01cult and expensive to collect and therefore\ndatasets have so far been relatively limited in size and scope.\nIn this paper we introduce a family of generative models of\n3D structures and recover these structures from 2D images via\nprobabilistic inference. Learning models of 3D structures di-\nrectly from pixels has been a long-standing research problem\nand a number of approaches with different levels of underlying\nassumptions and feature engineering have been proposed. Tra-\nditional approaches to vision as inverse graphics [20, 17, 19]\nand analysis-by-synthesis [23, 27, 16, 28] rely on heavily engi-\nneered visual features with which inference of object properties\nsuch as shape and pose is substantially simpli\ufb01ed. More recent\nwork [16, 4, 3, 30] addresses some of these limitations by learn-\ning parts of the encoding-decoding pipeline depicted in \ufb01gure\n2 in separate stages. Concurrent to our work [10] also develops\na generative model of volumetric data based on adversarial\nmethods. We discuss other related work in A.1. Unlike existing approaches, our approach is one of\nthe \ufb01rst to learn 3D representations in an unsupervised, end-to-end manner, directly from 2D images.\nOur contributions are as follows. (a) We design a strong generative model of 3D structures, de\ufb01ned\nover the space of volumes and meshes, combining ideas from state-of-the-art generative models\nof images [7]. (b) We show that our models produce high-quality samples, can effectively capture\nuncertainty and are amenable to probabilistic inference, allowing for applications in 3D generation\nand simulation. We report log-likelihoods on a dataset of shape primitives, a 3D version of MNIST,\nand on ShapeNet [2], which to the best of our knowledge, constitutes the \ufb01rst quantitative benchmark\nfor 3D density modeling. (c) We show how complex inference tasks, e.g., that of inferring plausible\n3D structures given a 2D image, can be achieved using conditional training of the models. We\ndemonstrate that such models recover 3D representations in one forward pass of a neural network\nand they accurately capture the multi-modality of the posterior. (d) We explore both volumetric\nand mesh-based representations of 3D structure. The latter is achieved by \ufb02exible inclusion of\noff-the-shelf renders such as OpenGL [22]. This allows us to build in further knowledge of the\nrendering process, e.g., how light bounces of surfaces and interacts with its material\u2019s attributes. (e)\nWe show how the aforementioned models and inference networks can be trained end-to-end directly\nfrom 2D images without any use of ground-truth 3D labels. This demonstrates for the \ufb01rst time the\nfeasibility of learning to infer 3D representations of the world in a purely unsupervised manner.\n\n2 Conditional Generative Models\n\nIn this section we develop our framework for learning models of 3D structure from volumetric data\nor directly from images. We consider conditional latent variable models, structured as in \ufb01gure 2\n(left). Given an observed volume or image x and a context c, we wish to infer a corresponding\n3D representation h (which can be a volume or a mesh). This is achieved by modelling the latent\nmanifold of object shapes and poses via the low-dimensional codes z. The context is any quantity\nthat is always observed at both train- and test-time, and it conditions all computations of inference\nand generation (see \ufb01gure 2, middle). In our experiments, context is either 1) nothing, 2) an object\nclass label, or 3) one or more views of the scene from different cameras.\nOur models employ a generative process which consists of \ufb01rst generating a 3D representation h\n(\ufb01gure 2, middle) and then projecting to the domain of the observed data (\ufb01gure 2, right). For instance,\nthe model will \ufb01rst generate a volume or mesh representation of a scene or object and then render it\ndown using a convolutional network or an OpenGL renderer to form a 2D image.\nGenerative models with latent variables describe probability densities p(x) over datapoints x im-\n\nplicitly through a marginalization of the set of latent variables z, p(x) =R p\u2713(x|z)p(z)dz. Flexible\n\nmodels can be built by using multiple layers of latent variables, where each layer speci\ufb01es a con-\nditional distribution parameterized by a deep neural network. Examples of such models include\n[12, 15, 24]. The marginal likelihood p(x) is intractable and we must resort to approximations.\n\n2\n\n\fobserved\ncontext\n\nabstract \ncode\n\nc\n\nvolume/mesh\nrepresentation\n\nobserved\n\nvolume/image\n\nz\n\nh\n\nx\n\nobserved\n\nclass\n\nor\n\nobserved\nview(s)\n\ncontext\n\nc\n\ntraining\nvolume x\nx\n\ntraining\nvolume x\nx\n\ntraining\nimage xx\n\ninference\nnetwork\n\nabstract\ncode z\nz\n\n3D structure\n\nmodel\n\nvolume/mesh\nrepresentation c\nh\n\nlearned/speci\ufb01ed\n\nrenderer\n\ntraining\nimage x\nx\n\nFigure 2: Proposed framework: Left: Given an observed volume or image x and contextual\ninformation c, we wish to infer a corresponding 3D representation h (which can be a volume or a\nmesh). This is achieved by modeling the latent manifold of object shapes via the low-dimensional\ncodes z. In experiments we will consider unconditional models (i.e., no context), as well as models\nwhere the context c is class or one or more 2D views of the scene. Right: We train a context-\nconditional inference network (red) and object model (green). When ground-truth volumes are\navailable, they can be trained directly. When only ground-truth images are available, a renderer is\nrequired to measure the distance between an inferred 3D representation and the ground-truth image.\n\nWe opt for variational approximations [13], in which we bound the marginal likelihood p(x) by\nF = Eq(z|x)[log p\u2713(x|z)] KL[q(z|x)kp(z)], where the true posterior distribution is approximated\nby a parametric family of posteriors q(z|x) with parameters . Learning involves joint optimization\nof the variational parameters and model parameters \u2713. In this framework, we can think of the\ngenerative model as a decoder of the latent variables, and the inference network as an encoder of the\nobserved data into the latent representation. Gradients of F are estimated using path-wise derivative\nestimators (\u2018reparameterization trick\u2019) [12, 15].\n\n2.1 Architectures\nWe build on recent work on sequential generative models [7, 11, 6] by extending them to operate on\ndifferent 3D representations. This family of models generates the observed data over the course\nof T computational steps. More precisely, these models operate by sequentially transforming\nindependently generated Gaussian latent variables into re\ufb01nements of a hidden representation h,\nwhich we refer to as the \u2018canvas\u2019. The \ufb01nal con\ufb01guration of the canvas, hT , is then transformed into\nthe target data x (e.g. an image) through a \ufb01nal smooth transformation. In our framework, we refer\nto the hidden representation hT as the \u20183D representation\u2019 since it will have a special form that is\namenable to 3D transformations. This generative process is described by the following equations:\n\n3D representation ht= fwrite(st, ht1; \u2713w)\n\n2D projection \u02c6x = Proj(hT , sT ; \u2713p)\nObservation x \u21e0 p(x|\u02c6x).\n\nLatents zt\u21e0N (\u00b7|0, 1)\nEncoding et= fread(c, st1; \u2713r)\n\n(1)\n(2)\nHidden state st= fstate(st1, zt, et; \u2713s) (3)\n\n(4)\n(5)\n(6)\nEach step generates an independent set of K-dimensional variables zt (equation 1). We use a fully con-\nnected long short-term memory network (LSTM, [8]) as the transition function fstate(st1, zt, c; \u2713s).\nThe context encoder fread(c, st1; \u2713r) is task dependent; we provide further details in section 3.\nWhen using a volumetric latent 3D representation,\nthe representation update function\nfwrite(st, ht1; \u2713w) in equation 4 is parameterized by a volumetric spatial transformer (VST, [9]).\nMore precisely, we set fwrite(st, ht1; \u2713w) = VST(g1(st), g2(st)) where g1 and g2 are MLPs that\ntake the state st and map it to appropriate sizes. More details about the VST are provided in the\nappendix A.3. When using a mesh 3D representation fwrite is a fully-connected MLP.\nThe function Proj(hT , sT ) is a projection operator from the model\u2019s latent 3D representation hT to\nthe training data\u2019s domain (which in our experiments is either a volume or an image) and plays the\nrole of a \u2018renderer\u2019. The conditional density p(x|\u02c6x) is either a diagonal Gaussian (for real-valued\ndata) or a product of Bernoulli distributions (for binary data). We denote the set of all parameters\nof this generative model as \u2713 = {\u2713r,\u2713 w,\u2713 s,\u2713 p}. Details of the inference model and the variational\nbound is provided in the appendix A.2.\n\n3\n\n\fVST\n\n3D\nconv\n\nvolume\n\n (DxHxW)\nhT\n\nvolume\n (DxHxW)\n\u02c6x\n\nvolume\n\n (FxDxHxW)\nhT\n\ncamera\n\nsT\n\nimage\n\n (1xHxW)\n\u02c6x\n\nmesh\n (3xM)\nhT\n\ncamera\n\nsT\n\nimage\n\n (3xHxW)\n\u02c6x\n\nFigure 3: Projection operators: These drop-in modules relate a latent 3D representation with the\ntraining data. The choice of representation and the type of available training data determine which\noperator should be used. Left: Volume-to-volume projection (no parameters). Middle: Volume-\nto-image neural projection (learnable parameters). Right: Mesh-to-image OpenGL projection (no\nlearnable parameters).\n\nHere we discuss the projection operators in detail. These drop-in modules relate a latent 3D represen-\ntation with the training data. The choice of representation (volume or mesh) and the type of available\ntraining data (3D or 2D) determine which operator is used.\n3D ! 3D projection (identity): In cases where training data is already in the form of volumes (e.g.,\nin medical imagery, volumetrically rendered objects, or videos), we can directly de\ufb01ne the likelihood\ndensity p(x|\u02c6x), and the projection operator is simply the identity \u02c6x = hT function (see \ufb01gure 3 left).\n3D ! 2D neural projection (learned): In most practical applications we only have access to images\ncaptured by a camera. Moreover, the camera pose may be unknown or partially known. For these\ncases, we construct and learn a map from an F -dimensional volume hT to the observed 2D images\nby combining the VST with 3D and 2D convolutions. When multiple views from different positions\nare simultaneously observed, the projection operator is simply cloned as many times as there are\ntarget views. The parameters of the projection operator are trained jointly with the rest of the model.\nThis operator is depicted in \ufb01gure 3 (middle). For details see appendix A.4.\n3D ! 2D OpenGL projection (\ufb01xed): When working with a mesh representation, the projection\noperator in equation 4 is a complex map from the mesh description h provided by the generative\nmodel to the rendered images \u02c6x. In our experiments we use an off-the-shelf OpenGL renderer and\ntreat it as a black-box with no parameters. This operator is depicted in \ufb01gure 3 (right).\nA challenge in working with black-box renderers is that of back-propagating errors from the image\nto the mesh. This requires either a differentiable renderer [19], or resort to gradient estimation\ntechniques such as \ufb01nite-differences [5] or Monte Carlo estimators [21, 1]. We opt for a scheme\nbased on REINFORCE [26], details of which are provided in appendix A.5.\n\n3 Experiments\n\nWe demonstrate the ability of our model to learn and exploit 3D scene representations in \ufb01ve\nchallenging tasks. These tasks establish it as a powerful, robust and scalable model that is able to\nprovide high quality generations of 3D scenes, can robustly be used as a tool for 3D scene completion,\ncan be adapted to provide class-speci\ufb01c or view-speci\ufb01c generations that allow variations in scenes to\nbe explored, can synthesize multiple 2D scenes to form a coherent understanding of a scene, and can\noperate with complex visual systems such as graphics renderers. We explore four data sets:\nNecker cubes The Necker cube is a classical psychological test of the human ability for 3D and\nspatial reasoning. This is the simplest dataset we use and consists of 40 \u21e5 40 \u21e5 40 volumes with a\n10 \u21e5 10 \u21e5 10 wire-frame cube drawn at a random orientation at the center of the volume [25].\nPrimitives The volumetric primitives are of size 30 \u21e5 30 \u21e5 30. Each volume contains a simple\nsolid geometric primitive (e.g., cube, sphere, pyramid, cylinder, capsule or ellipsoid) that undergoes\nrandom translations ([0, 20] pixels) and rotations ([\u21e1, \u21e1 ] radians).\nMNIST3D We extended the MNIST dataset [18] to create a 30 \u21e5 30 \u21e5 30 volumetric dataset by\nextruding the MNIST images. The resulting dataset has the same number of images as MNIST. The\ndata is then augmented with random translations ([0, 20] pixels) and rotations ([\u21e1, \u21e1 ] radians) that\nare procedurally applied during training.\n\n4\n\n\fFigure 4: A generative model of volumes: For each dataset we display 9 samples from the model.\nThe samples are sharp and capture the multi-modality of the data. Left: Primitives (trained with\ntranslations and rotations). Middle: MNIST3D (translations and rotations). Right: ShapeNet (trained\nwith rotations only). Videos of these samples can be seen at https://goo.gl/9hCkxs.\n\nNecker\n\nPrimitives\n\nMNIST3D\n\nFigure 5: Probabilistic volume completion (Necker Cube, Primitives, MNIST3D): Left: Full\nground-truth volume. Middle: First few steps of the MCMC chain completing the missing left half of\nthe data volume. Right: 100th iteration of the MCMC chain. Best viewed on a screen. Videos of\nthese samples can be seen at https://goo.gl/9hCkxs.\n\nShapeNet The ShapeNet dataset [2] is a large dataset of 3D meshes of objects. We experiment with\na 40-class subset of the dataset, commonly referred to as ShapeNet40. We render each mesh as a\nbinary 30 \u21e5 30 \u21e5 30 volume.\nFor all experiments we used LSTMs with 300 hidden neurons and 10 latent variables per generation\nstep. The context encoder fc(c, st1) was varied for each task. For image inputs we used convolutions\nand standard spatial transformers, and for volumes we used volumetric convolutions and VSTs. For\nthe class-conditional experiments, the context c is a one-hot encoding of the class. As meshes are\nmuch lower-dimensional than volumes, we set the number of steps to be T = 1 when working with\nthis representation. We used the Adam optimizer [14] for all experiments.\n\n3.1 Generating volumes\n\nWhen ground-truth volumes are available we can directly train the model using the identity projection\noperator (see section 2.1). We explore the performance of our model by training on several datasets.\nWe show in \ufb01gure 4 that it can capture rich statistics of shapes, translations and rotations across the\ndatasets. For simpler datasets such as Primitives and MNIST3D (\ufb01gure 4 left, middle), the model\nlearns to produce very sharp samples. Even for the more complex ShapeNet dataset (\ufb01gure 4 right)\nits samples show a large diversity of shapes whilst maintaining \ufb01ne details.\n\n3.2 Probabilistic volume completion and denoising\n\nWe test the ability of the model to impute missing data in 3D volumes. This is a capability that is\noften needed to remedy sensor defects that result in missing or corrupt regions, (see for instance\n[29, 4]). For volume completion, we use an unconditional volumetric model and alternate between\ninference and generation, feeding the result of one into the other. This procedure simulates a Markov\nchain and samples from the correct distribution, as we show in appendix A.10. We test the model by\noccluding half of a volume and completing the missing half. Figure 5 demonstrates that our model\nsuccessfully completes large missing regions with high precision. More examples are shown in the\nappendix A.7.\n\n5\n\n\fBaseline model\n\n550\n\n500\n\n)\ns\nt\n\na\nn\n(\n \n\nd\nn\nu\no\nB\n\n450\n\n400\n\n350\n\n12\n\nGeneration Steps\n\n24\n\n600\n\n550\n\n)\ns\nt\n\na\nn\n(\n \n\n500\n\n450\n\nd\nn\nu\no\nB\n\n400\n\n350\n\n2\n\n6\n\nGeneration Steps\n\n12\n\n1050\n\n1000\n\n)\ns\nt\n\na\nn\n(\n \n\nd\nn\nu\no\nB\n\n950\n\n900\n\n850\n\n800\n\nUnconditional\n1 context view\n2 context views\n3 context views\n\n2\n\n6\n\nGeneration Steps\n\n12\n\nFigure 6: Quantitative results: Increasing the number of steps or the number of contextual views\nboth lead to improved log-likelihoods. Left: Primitives. Middle: MNIST3D. Right: ShapeNet.\n\n3.3 Conditional volume generation\n\nThe models can also be trained with context representing the class of the object, allowing for class\nconditional generation. We train a class-conditional model on ShapeNet and show multiple samples\nfor 10 of the 40 classes in \ufb01gure 7. The model produces high-quality samples of all classes. We\nnote their sharpness, and that they accurately capture object rotations, and also provide a variety of\nplausible generations. Samples for all 40 ShapeNet classes are shown in appendix A.8.\nWe also form conditional models using a single view of 2D contexts. Our results, shown in \ufb01gure 8\nindicate that the model generates plausible shapes that match the constraints provided by the context\nand captures the multi-modality of the posterior. For instance, consider \ufb01gure 8 (right). The model\nis conditioned on a single view of an object that has a triangular shape. The model\u2019s three shown\nsamples have greatly varying shape (e.g., one is a cone and the other a pyramid), whilst maintaining\nthe same triangular projection. More examples of these inferences are shown in the appendix A.9.\n3.4 Performance benchmarking\n\nWe quantify the performance of the model by computing likelihood scores, varying the number of\nconditioning views and the number of inference steps in the model. Figure 6 indicates that the number\nof generation steps is a very important factor for performance (note that increasing the number of\nsteps does not affect the total number of parameters in the model). Additional context views generally\nimproves the model\u2019s performance but the effect is relatively small. With these experiments we\nestablish the \ufb01rst benchmark of likelihood-bounds on Primitives (unconditional: 500 nats; 3-views:\n472 nats), MNIST3D (unconditional: 410 nats; 3-views: 393 nats) and ShapeNet (unconditional: 827\nnats; 3-views: 814 nats). As a strong baseline, we have also trained a deterministic 6-layer volumetric\nconvolutional network with Bernoulli likelihoods to generate volumes conditioned on 3 views. The\nperformance of this model is indicated by the red line in \ufb01gure 6. Our generative model substantially\noutperforms the baseline for all 3 datasets, even when conditioned on a single view.\n3.5 Multi-view training\n\nIn most practical applications, ground-truth volumes are not available for training. Instead, data is\ncaptured as a collection of images (e.g., from a multi-camera rig or a moving robot). To accommodate\nthis fact, we extend the generative model with a projection operator that maps the internal volumetric\nrepresentation hT to a 2D image \u02c6x. This map imitates a \u2018camera\u2019 in that it \ufb01rst applies an af\ufb01ne\ntransformation to the volumetric representation, and then \ufb02attens the result using a convolutional\nnetwork. The parameters of this projection operator are trained jointly with the rest of the model.\nFurther details are explained in the appendix A.4.\nIn this experiment we train the model to learn to reproduce an image of the object given one or\nmore views of it from \ufb01xed camera locations. It is the model\u2019s responsibility to infer the volumetric\nrepresentation as well as the camera\u2019s position relative to the volume. It is clear to see how the\nmodel can \u2018cheat\u2019 by generating volumes that lead to good reconstructions but do not capture the\nunderlying 3D structure. We overcome this by reconstructing multiple views from the same volumetric\nrepresentation and using the context information to \ufb01x a reference frame for the internal volume. This\nenforces a consistent hidden representation that generalises to new views.\nWe train a model that conditions on 3 \ufb01xed context views to reproduce 10 simultaneous random\nviews of an object. After training, we can sample a 3D representation given the context, and render it\nfrom arbitrary camera angles. We show the model\u2019s ability to perform this kind of inference in \ufb01gure\n\n6\n\n\ftable\n\nvase\n\ncar\n\nlaptop\n\nairplane\n\nbowl\n\nperson\n\ncone\n\nFigure 7: Class-conditional samples: Given a one-hot encoding of class as context, the model\nproduces high-quality samples. Notice, for instance, sharpness and variability of generations for\n\u2018chair\u2019, accurate capture of rotations for \u2018car\u2019, and even identi\ufb01able legs for the \u2018person\u2019 class. Videos\nof these samples can be seen at https://goo.gl/9hCkxs.\n\n9. The resulting network is capable of producing an abstract 3D representation from 2D observations\nthat is amenable to, for instance, arbitrary camera rotations.\n\n3.6 Single-view training\n\nFinally, we consider a mesh-based 3D representation and demonstrate the feasibility of training our\nmodels with a fully-\ufb02edged, black-box renderer in the loop. Such renderers (e.g. OpenGL) accurately\ncapture the relationship between a 3D representation and its 2D rendering out of the box. This image\nis a complex function of the objects\u2019 colors, materials and textures, positions of lights, and that of\nother objects. By building this knowledge into the model we give hints for learning and constrain its\nhidden representation.\nWe consider again the Primitives dataset, however now we only have access to 2D images of the\nobjects at training time. The primitives are textured with a color on each side (which increases\nthe complexity of the data, but also makes it easier to detect the object\u2019s orientation relative to the\ncamera), and are rendered under three lights. We train an unconditional model that given a 2D image,\ninfers the parameters of a 3D mesh and its orientation relative to the camera, such that when textured\nand rendered reconstructs the image accurately. The inferred mesh is formed by a collection of 162\nvertices that can move on \ufb01xed lines that spread from the object\u2019s center, and is parameterized by the\nvertices\u2019 positions on these lines.\nThe results of these experiments are shown in \ufb01gure 10. We observe that in addition to reconstructing\nthe images accurately (which implies correct inference of mesh and camera), the model correctly\ninfers the extents of the object not in view, as demonstrated by views of the inferred mesh from\nunobserved camera angles.\n\n4 Discussion\n\nIn this paper we introduced a powerful family of 3D generative models inspired by recent advances\nin image modeling. When trained on ground-truth volumes, they can produce high-quality samples\nthat capture the multi-modality of the data. We further showed how common inference tasks, such\nas that of inferring a posterior over 3D structures given a 2D image, can be performed ef\ufb01ciently\nvia conditional training. We also demonstrated end-to-end training of such models directly from 2D\nimages through the use of differentiable renderers. This demonstrates for the \ufb01rst time the feasibility\nof learning to infer 3D representations in a purely unsupervised manner.\nWe experimented with two kinds of 3D representations: volumes and meshes. Volumes are \ufb02exible\nand can capture a diverse range of structures, however they introduce modeling and computational\nchallenges due to their high dimensionality. Conversely, meshes can be much lower dimensional\nand therefore easier to work with, and they are the data-type of choice for common rendering\nengines, however standard paramaterizations can be restrictive in the range of shapes they can capture.\n\n7\n\n\fIt will be of interest to consider other representation types, such as NURBS, or training with a\nvolume-to-mesh conversion algorithm (e.g., marching cubes) in the loop.\n\nc\n\n\u02c6x\n\nh\n\nr1\n\nr2\n\nc\n\n\u02c6x\n\nh\n\nr1\n\nr2\n\nFigure 8: Recovering 3D structure from 2D images: The model is trained on volumes, conditioned\non c as context. Each row corresponds to an independent sample h from the model given c.\nWe display \u02c6x, which is h viewed from the same angle as c. Columns r1 and r2 display the\ninferred 3D representation h from different viewpoints. The model generates plausible, but varying,\ninterpretations, capturing the inherent ambiguity of the problem. Left: MNIST3D. Right: ShapeNet.\nVideos of these samples can be seen at https://goo.gl/9hCkxs.\n\nc1\n\nc2\n\nc3\n\nr1\n\nr2\n\nr3\n\nr4\n\nr5\n\nr6\n\nr7\n\nr8\n\nFigure 9: 3D structure from multiple 2D images: Conditioned on 3 depth images of an object,\nthe model is trained to generate depth images of that object from 10 different views. Left: Context\nviews. Right: Columns r1 through r8 display the inferred abstract 3D representation h rendered\nfrom different viewpoints by the learned projection operator. Videos of these samples can be seen at\nhttps://goo.gl/9hCkxs.\n\nx\n\n\u02c6x\n\nr1\n\nr2\n\nr3\n\nx\n\n\u02c6x\n\nr1\n\nr2\n\nr3\n\nFigure 10: Unsupervised learning of 3D structure: The model observes x and is trained to recon-\nstruct it using a mesh representation and an OpenGL renderer, resulting in \u02c6x. We rotate the camera\naround the inferred mesh to visualize the model\u2019s understanding of 3D shape. We observe that\nin addition to reconstructing accurately, the model correctly infers the extents of the object not in\nview, demonstrating true 3D understanding of the scene. Videos of these reconstructions have been\nincluded in the supplementary material. Best viewed in color. Videos of these samples can be seen at\nhttps://goo.gl/9hCkxs.\n\n8\n\n\fReferences\n[1] Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. arXiv preprint:1509.00519,\n\n[2] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song,\n\nH. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint:1512.03012, 2015.\n\n[3] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: An uni\ufb01ed approach for single and\n\nmulti-view 3d object reconstruction. arXiv preprint:1604.00449, 2016.\n\n[4] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox. Learning to generate chairs with convolutional neural\nnetworks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n1538\u20131546, 2015.\n\n[5] S. Eslami, N. Heess, T. Weber, Y. Tassa, K. Kavukcuoglu, and G. E. Hinton. Attend, infer, repeat: Fast\n\nscene understanding with generative models. preprint:1603.08575, 2016.\n\n[6] K. Gregor, F. Besse, D. Jimenez Rezende, I. Danihelka, and D. Wierstra. Towards conceptual compression.\n\n[7] K. Gregor, I. Danihelka, A. Graves, D. Jimenez Rezende, and D. Wierstra. Draw: A recurrent neural\n\narXiv preprint:1604.08772, 2016.\n\nnetwork for image generation. In ICML, 2015.\n\n[8] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[9] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, pages 2008\u20132016,\n\n2015.\n\n2015.\n\n[10] W. Jiajun, Z. Chengkai, X. Tianfan, F. William T., and J. Tenenbaum. Learning a probabilistic latent space\n\nof object shapes via 3d generative-adversarial modeling. arXiv preprint: 1610.07584, 2016.\n\n[11] D. Jimenez Rezende, S. Mohamed, I. Danihelka, K. Gregor, and D. Wierstra. One-shot generalization in\n\ndeep generative models. arXiv preprint:1603.05106, 2016.\n\n[12] D. Jimenez Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference\n\nin deep generative models. In ICML, 2014.\n\n[13] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for\n\ngraphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n[14] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\n[15] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.\n[16] T. Kulkarni, I. Yildirim, P. Kohli, W. Freiwald, and J. B. Tenenbaum. Deep generative vision as approximate\n\nbayesian computation. In NIPS 2014 ABC Workshop, 2014.\n\n[17] T. D. Kulkarni, V. K. Mansinghka, P. Kohli, and J. B. Tenenbaum. Inverse graphics with probabilistic cad\n\nmodels. arXiv preprint arXiv:1407.1339, 2014.\n\n[18] Y. Lecun and C. Cortes. The MNIST database of handwritten digits.\n[19] M. M. Loper and M. J. Black. Opendr: An approximate differentiable renderer. In Computer Vision\u2013ECCV\n\n2014, pages 154\u2013169. Springer, 2014.\n\n[20] V. Mansinghka, T. D. Kulkarni, Y. N. Perov, and J. Tenenbaum. Approximate bayesian image interpretation\n\nusing generative probabilistic graphics programs. In NIPS, pages 1520\u20131528, 2013.\n\n[21] A. Mnih and D. Jimenez Rezende. Variational\n\ninference for monte carlo objectives.\n\narXiv\n\n[22] OpenGL Architecture Review Board. OpenGL Reference Manual: The Of\ufb01cial Reference Document for\n\npreprint:1602.06725, 2016.\n\nOpenGL, Release 1. 1993.\n\n[23] L. D. Pero, J. Bowdish, D. Fried, B. Kermgard, E. Hartley, and K. Barnard. Bayesian geometric modeling\nof indoor scenes. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages\n2719\u20132726. IEEE, 2012.\n\n[24] R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings of\n\nthe 25th international conference on Machine learning, pages 872\u2013879, 2008.\n\n[25] R. Sundareswara and P. R. Schrater. Perceptual multistability predicted by search model for bayesian\n\n[26] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\ndecisions. Journal of Vision, 8(5):12\u201312, 2008.\n\nMachine learning, 8(3-4):229\u2013256, 1992.\n\n[27] D. Wingate, N. Goodman, A. Stuhlmueller, and J. M. Siskind. Nonstandard interpretations of probabilistic\n\nprograms for ef\ufb01cient inference. In NIPS, pages 1152\u20131160, 2011.\n\n[28] J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. Tenenbaum. Galileo: perceiving physical object properties\n\nby integrating a physics engine with deep learning. In NIPS, pages 127\u2013135, 2015.\n\n[29] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for\nvolumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 1912\u20131920, 2015.\n\n[30] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance \ufb02ow. arXiv preprint,\n\nMay 2016.\n\n9\n\n\f", "award": [], "sourceid": 2556, "authors": [{"given_name": "Danilo", "family_name": "Jimenez Rezende", "institution": "Google DeepMind"}, {"given_name": "S. M. Ali", "family_name": "Eslami", "institution": "Google DeepMind"}, {"given_name": "Shakir", "family_name": "Mohamed", "institution": "Google DeepMind"}, {"given_name": "Peter", "family_name": "Battaglia", "institution": "Google DeepMind"}, {"given_name": "Max", "family_name": "Jaderberg", "institution": "DeepMind"}, {"given_name": "Nicolas", "family_name": "Heess", "institution": "Google DeepMind"}]}