{"title": "Visual Object Networks: Image Generation with Disentangled 3D Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 118, "page_last": 129, "abstract": "Recent progress in deep generative models has led to tremendous breakthroughs in image generation. While being able to synthesize photorealistic images, existing models lack an understanding of our underlying 3D world. Different from previous works built on 2D datasets and models, we present a new generative model, Visual Object Networks (VONs), synthesizing natural images of objects with a disentangled 3D representation. Inspired by classic graphics rendering pipelines, we unravel the image formation process into three conditionally independent factors---shape, viewpoint, and texture---and present an end-to-end adversarial learning framework that jointly models 3D shape and 2D texture. Our model first learns to synthesize 3D shapes that are indistinguishable from real shapes. It then renders the object's 2.5D sketches (i.e., silhouette and depth map) from its shape under a sampled viewpoint. Finally, it learns to add realistic textures to these 2.5D sketches to generate realistic images. The VON not only generates images that are more realistic than the state-of-the-art 2D image synthesis methods but also enables many 3D operations such as changing the viewpoint of a generated image, shape and texture editing, linear interpolation in texture and shape space, and transferring appearance across different objects and viewpoints.", "full_text": "Visual Object Networks: Image Generation with\n\nDisentangled 3D Representation\n\nJun-Yan Zhu\nMIT CSAIL\n\nZhoutong Zhang\n\nMIT CSAIL\n\nChengkai Zhang\n\nMIT CSAIL\n\nJiajun Wu\nMIT CSAIL\n\nAntonio Torralba\n\nMIT CSAIL\n\nJoshua B. Tenenbaum\n\nMIT CSAIL\n\nWilliam T. Freeman\nMIT CSAIL, Google\n\nAbstract\n\nRecent progress in deep generative models has led to tremendous breakthroughs in\nimage generation. However, while existing models can synthesize photorealistic\nimages, they lack an understanding of our underlying 3D world. We present\na new generative model, Visual Object Networks (VON), synthesizing natural\nimages of objects with a disentangled 3D representation.\nInspired by classic\ngraphics rendering pipelines, we unravel our image formation process into three\nconditionally independent factors\u2014shape, viewpoint, and texture\u2014and present an\nend-to-end adversarial learning framework that jointly models 3D shapes and 2D\nimages. Our model \ufb01rst learns to synthesize 3D shapes that are indistinguishable\nfrom real shapes. It then renders the object\u2019s 2.5D sketches (i.e., silhouette and\ndepth map) from its shape under a sampled viewpoint. Finally, it learns to add\nrealistic texture to these 2.5D sketches to generate natural images. The VON\nnot only generates images that are more realistic than state-of-the-art 2D image\nsynthesis methods, but also enables many 3D operations such as changing the\nviewpoint of a generated image, editing of shape and texture, linear interpolation in\ntexture and shape space, and transferring appearance across different objects and\nviewpoints.\n\n1\n\nIntroduction\n\nModern deep generative models learn to synthesize realistic images. Figure 1a shows several cars\ngenerated by a recent model [Gulrajani et al., 2017]. However, most methods have only focused\non generating images in 2D, ignoring the 3D nature of the world. As a result, they are unable to\nanswer some questions that would be effortless for a human, for example: what will a car look like\nfrom a different angle? What if we apply its texture to a truck? Can we mix different 3D designs?\nTherefore, a 2D-only perspective inevitably limits a model\u2019s practical application in \ufb01elds such as\nrobotics, virtual reality, and gaming.\nIn this paper, we present an end-to-end generative model that jointly synthesizes 3D shapes and 2D\nimages via a disentangled object representation. Speci\ufb01cally, we decompose our image generation\nmodel into three conditionally independent factors: shape, viewpoint, and texture, borrowing ideas\nfrom classic graphics rendering engines [Kajiya, 1986]. Our model \ufb01rst learns to synthesize 3D\nshapes that are indistinguishable from real shapes. It then computes its 2.5D sketches [Barrow and\nTenenbaum, 1978, Marr, 1982] with a differentiable projection module from a sampled viewpoint.\nFinally, it learns to add diverse, realistic texture to 2.5D sketches and produce 2D images that are\nindistinguishable from real photos. We call our model Visual Object Networks (VON).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Previous 2D GANs vs. Visual Object Networks (VON). (a) Typical examples produced by\na recent GAN model [Gulrajani et al., 2017]. (b) Our model produces three outputs: a 3D shape, its\n2.5D projection given a viewpoint, and a \ufb01nal image with realistic texture. (c) Given this disentangled\n3D representation, our method allows several 3D applications including changing viewpoint and\nediting shape or texture independently. Please see our code and website for more details.\n\nWiring in conditional independence reduces our need for densely annotated data: unlike classic\nmorphable face models [Blanz and Vetter, 1999], our training does not require paired data between\n2D images and 3D shapes, nor dense correspondence annotations in 3D data. This advantage allows\nus to leverage both 2D image datasets and 3D shape collections [Chang et al., 2015] and to synthesize\nobjects of diverse shapes and texture.\nThrough extensive experiments, we show that VON produce more realistic image samples than\nrecent 2D deep generative models. We also demonstrate many 3D applications that are enabled by\nour disentangled representation, including rotating an object, adjusting object shape and texture,\ninterpolating between two objects in texture and shape space independently, and transferring the\nappearance of a real image to new objects and viewpoints.\n\n2 Related Work\n\nGANs for 2D image synthesis. Since the invention of Generative Adversarial Nets (GANs) [Good-\nfellow et al., 2014], many researchers have adopted adversarial learning for various image synthesis\ntasks, ranging from image generation [Radford et al., 2016, Arjovsky et al., 2017, Karras et al., 2018],\nimage-to-image translation [Isola et al., 2017, Zhu et al., 2017a], text-to-image synthesis [Zhang\net al., 2017, Reed et al., 2016], and interactive image editing [Zhu et al., 2016, Wang et al., 2018], to\nclassic vision and graphics tasks such as inpainting [Pathak et al., 2016] and super-resolution [Ledig\net al., 2017]. Despite the tremendous progress made on 2D image synthesis, most of the above\nmethods operate on 2D space, ignoring the 3D nature of our physical world. As a result, the lack of\n3D structure inevitably limits some practical applications of these generative models. In contrast, we\npresent an image synthesis method powered by a disentangled 3D representation.It allows a user to\nchange the viewpoint easily, as well as to edit the object\u2019s shape or texture independently. Dosovitskiy\net al. [2015] used supervised CNNs for generating synthetic images given object style, viewpoint, and\ncolor. We differ in that our aim is to produce objects with 3D geometry and natural texture without\nusing labelled data.\n\n3D shape generation. There has been an increasing interest in synthesizing 3D shapes with deep\ngenerative models, especially GANs. Popular representations include voxels [Wu et al., 2016],\npoint clouds [Gadelha et al., 2017b, Achlioptas et al., 2018], and octave trees [Tatarchenko et al.,\n2017]. Other methods learn 3D shape priors from 2D images [Rezende et al., 2016, Gadelha et al.,\n2017a]. Recent work also explored 3D shape completion from partial scans with deep generative\nmodels [Dai et al., 2017, Wang et al., 2017, Wu et al., 2018], including generalization to unseen object\ncategories [Zhang et al., 2018]. Unlike prior methods that only synthesize untextured 3D shapes, our\nmethod learns to produce both realistic shapes and images. Recent and concurrent work has learned\nto infer both texture and 3D shapes from 2D images, represented as parametrized meshes [Kanazawa\net al., 2018], point clouds [Tatarchenko et al., 2016], or colored voxels [Tulsiani et al., 2017, Sun\net al., 2018b]. While they focus on 3D reconstruction, we aim to learn an unconditional generative\nmodel of shapes and images with disentangled representations of object texture, shape and pose.\n\n2\n\n(a)samplesfromWGAN-GP(b)our3D,2.5D,and2Doutputviewpointtexture(c)3Ddisentanglementshape\fFigure 2: Our image formation model. We \ufb01rst learn a shape generative adversarial network Gshape\nthat maps a randomly sampled shape code zshape to a voxel grid v. Given a sampled viewpoint\nzview, we project v to 2.5D sketches v2.5D with our differentiable projection module P. The 2.5D\nsketches v2.5D include both the object\u2019s depth and silhouette, which help to bridge 3D and 2D data.\nFinally, we learn a texture network x = Gtexture(v2.5D, ztexture) to add realistic, diverse texture to these\n2.5D sketches, so that generated 2D images cannot be distinguished from real images by an image\ndiscriminator. The model is fully differentiable and trained end-to-end on both 2D and 3D data.\n\nInverse graphics. Motivated by the philosophy of \u201cvision as inverse graphics\u201d [Yuille and Kersten,\n2006, Bever and Poeppel, 2010], researchers have made much progress in recent years on learning to\ninvert graphics engines, many with deep neural networks [Kulkarni et al., 2015b, Yang et al., 2015,\nKulkarni et al., 2015a, Tung et al., 2017, Shu et al., 2017]. In particular, Kulkarni et al. [2015b]\nproposed a convolutional inverse graphics network. Given an image of a face, the network learns to\ninfer properties such as pose and lighting. Tung et al. [2017] extended inverse graphics networks with\nadversarial learning. Wu et al. [2017, 2018] inferred 3D shapes from a 2D image via 2.5D sketches\nand learned shape priors. Here we focus on a complementary problem\u2014learning generative graphics\nnetworks via the idea of \u201cgraphics as inverse vision\u201d. In particular, we learn our generative model\nwith recognition models that recover 2.5D sketches from generated images.\n\n3 Formulation\n\nOur goal is to learn an (implicit) generative model that can sample an image x \u2208 RH\u00d7W\u00d73 from\nthree factors: a shape code zshape, a viewpoint code zview, and a texture code ztexture. The texture\ncode describes the appearance of the object, which accounts for the object\u2019s albedo, re\ufb02ectance, and\nenvironment illumination. These three factors are disentangled, conditionally independent from each\nother. Our model is category-speci\ufb01c, as the visual appearance of an object depends on the class. We\nfurther assume that all the codes lie in their own low-dimensional spaces. During training, we are\ngiven a 3D shape collection {vi}N\ni , where vi \u2208 RW\u00d7W\u00d7W is a binary voxel grid, and a 2D image\ncollection {xj}M\nj , where xj \u2208 RH\u00d7W\u00d73. Our model training requires no alignment between 3D and\n2D data. We assume that every training image has a clean background and only contains the object of\ninterest. This assumption makes our model focus on generating realistic images of the objects instead\nof complex backgrounds.\nFigure 2 illustrates our model. First, we learn a 3D shape generation network that produces realistic\nvoxels v = Gshape(zshape) given a shape code zshape (Section 3.1). We then develop a differentiable\nprojection module P that projects a 3D voxel grid v into 2.5D sketches via v2.5D = P(v, zview),\ngiven a particular viewpoint zview (Section 3.2). Next, we learn to produce a \ufb01nal image given the\n2.5D sketches v2.5D and a randomly sampled texture code ztexture, using our texture synthesis network\nx = Gtexture(v2.5D, ztexture) in Section 3.3. Section 3.4 summarizes our full model and Section 3.5\nincludes implementation details. Our entire model is differentiable and can be trained end-to-end.\nDuring testing, we sample an image x = Gtexture(P(Gshape(zshape), zview), ztexture) from latent codes\n(zshape, zview, ztexture) via our shape network Gshape, texture network Gtexture, and projection module P.\n\n3.1 Learning 3D Shape Priors\n\nOur \ufb01rst step is to learn a category-speci\ufb01c 3D shape prior from large shape collections [Chang et al.,\n2015]. This prior depends on the object class but is conditionally independent of other factors such as\n\n3\n\n2.5D sketches\ud835\udc2f\ud835\udfd0.\ud835\udfd3\ud835\udc03shapecode\t\ud835\udc33\ud835\udc2c\ud835\udc21\ud835\udc1a\ud835\udc29\ud835\udc1eviewpoint\ud835\udc33\ud835\udc2f\ud835\udc22\ud835\udc1e\ud835\udc30texturecode\ud835\udc33\ud835\udc2d\ud835\udc1e\ud835\udc31\ud835\udc2d\ud835\udc2e\ud835\udc2b\ud835\udc1e2Dimage\ud835\udc31shapenetwork\ud835\udc06\ud835\udc94\ud835\udc89\ud835\udc82\ud835\udc91\ud835\udc86differentiableprojection\ud835\udc0ftexturenetwork\ud835\udc06\ud835\udc2d\ud835\udc1e\ud835\udc31\ud835\udc2d\ud835\udc2e\ud835\udc2b\ud835\udc1e3D shape\ud835\udc2f\fviewpoint and texture. To model the 3D shape prior and generate realistic shapes, we adopt the 3D\nGenerative Adversarial Networks recently proposed by Wu et al. [2016].\nConsider a voxelized 3D object collection {vi}N\ni , where vi \u2208 RW\u00d7W\u00d7W . We learn a generator Gshape\nto map a shape code zshape, randomly sampled from a Gaussian distribution, to a W \u00d7 W \u00d7 W voxel\ngrid. Simultaneously, we train a 3D discriminator Dshape to classify a shape as real or generated. Both\ndiscriminator and generator contain fully volumetric convolutional and deconvolutional layers. We\n\ufb01nd that the original 3D-GAN [Wu et al., 2016] sometimes suffers from mode collapse. To improve\nthe quality and diversity of the results, we use the Wasserstein distance of WGAN-GP [Arjovsky\net al., 2017, Gulrajani et al., 2017]. Formally, we play the following minimax two-player game\nbetween Gshape and Dshape: minGshape maxDshape LGAN\n\n\u2217, where\n\nshape\n\nLshape = Ev[Dshape(v)] \u2212 Ezshape[Dshape(Gshape(zshape)].\n\n(1)\n\nTo enforce the Lipschitz constraint in Wasserstein GANs [Arjovsky et al., 2017], we add a gradient-\npenalty loss \u03bbGPE\u02dcv[(\u2207\u02dcvDshape(\u02dcv) \u2212 1)2] to Eqn. 1, where \u02dcv is a randomly sampled point along the\nstraight line between a real shape and a generated shape, and \u03bbGP controls the capacity of Dshape.\nSince binary data is often challenging to model using GANs, we also experiment with distance\nfunction (DF) representation [Curless and Levoy, 1996], which is continuous on the 3D voxel space.\nSee Section 4.1 for quantitative evaluations.\n\n3.2 Generating 2.5D Sketches\n\nGiven a synthesized voxelized shape v = Gshape(zshape), how can we connect it to a 2D image?\nInspired by recent work on 3D reconstruction [Wu et al., 2017], we use 2.5D sketches [Barrow\nand Tenenbaum, 1978, Marr, 1982] to bridge the gap between 3D and 2D. This intermediate repre-\nsentation provides three main advantages. First, generating 2.5D sketches from a 3D voxel grid is\nstraightforward, as the projection is differentiable with respect to both the input shape and the view-\npoint. Second, 2D image synthesis from a 2.5D sketch can be cast as an image-to-image translation\nproblem [Isola et al., 2017], where existing methods have achieved successes even without paired\ndata [Zhu et al., 2017a]. Third, compared with alternative approaches such as colored voxels, our\nmethod enables generating images at a higher resolution.\nHere we describe our differentiable module for projecting voxels into 2.5D sketches. The inputs\nto this module are the camera parameters and 3D voxels. The value of each voxel stores the\nprobability of it being present. To render the 2.5D sketches from the voxels under a perspective\ncamera, we \ufb01rst generate a collection of rays, each originating from the camera\u2019s center and going\nthrough a pixel\u2019s center in the image plane. To render the 2.5D sketches, we need to calculate\nwhether a given ray would hit the voxels, and if so, the corresponding depth value of that ray. To\nthis end, we \ufb01rst sample a collection of points at evenly spaced depth along each ray. Next, for\neach point, we calculate the probability of hitting the input voxels using a differentiable trilinear\ninterpolation [Jaderberg et al., 2015] of the input voxels. Similar to Tulsiani et al. [2017], we\n(cid:81)j\u22121\nthen calculate the expectation of visibility and depth along each ray. Speci\ufb01cally, given a ray R\n(cid:81)j\u22121\n, RN along its path, we calculate the visibility (silhouette) as the\nwith N samples R1, R2, ...\nk=1(1 \u2212 Rk)Rj. Similarly, the expected depth\nk=1(1 \u2212 Rk)Rj, where dj is the depth of the sample Rj. This\nprocess is fully differentiable since the gradients can be back-propagated through both the expectation\ncalculation and the trilinear interpolation.\n\nexpectation of the ray hitting the voxels: (cid:80)N\ncan be calculated as(cid:80)N\n\nj=1 dj\n\nj=1\n\nViewpoint estimation. Our two-dimensional viewpoint code zview encodes camera elevation and\nazimuth. We sample zview from an empirical distribution pdata(zview) of the camera poses from the\ntraining images. To estimate pdata(zview), we \ufb01rst render the silhouettes of several candidate 3D\nmodels under uniformly sampled camera poses. For each input image, we compare its silhouette to\nthe rendered 2D views and choose the pose with the largest Intersection-over-Union value. More\ndetails can be found in the supplement.\n\n\u2217For notation simplicity, we denote Ev (cid:44) Ev\u223cpdata(v) and Ezshape\n\n(cid:44) Ezshape\u223cpdata(zshape).\n\n4\n\n\f3.3 Learning 2D Texture Priors\n\nNext, we learn to synthesize realistic 2D images given projected 2.5D sketches that encode both\nthe viewpoint and the object shape. In particular, we learn a texture network Gtexture that takes a\nrandomly sampled texture code ztexture and the projected 2.5D sketches v2.5D as input, and produces a\n2D image x = Gtexture(v2.5D, ztexture). This texture network needs to model both object texture and\nenvironment illumination, as well as the differentiable rendering equation [Kajiya, 1986]. Fortunately,\nthis mapping problem can be cast as an unpaired image-to-image translation problem [Zhu et al.,\n2017a, Yi et al., 2017, Liu et al., 2017]. We adopt recently proposed cycle-consistent adversarial\nnetworks (CycleGAN) [Zhu et al., 2017a] as our baseline. Later, we relax the one-to-one mapping\nrestriction in CycleGAN to handle one-to-many mappings from 2.5D sketches to 2D images.\nHere we introduce two encoders Etexture and E2.5D to estimate a texture code ztexture and 2.5D\nsketches v2.5D from a real image x. We train Gtexture, Etexture, and E2.5D jointly with adversarial\nlosses [Goodfellow et al., 2014] and cycle-consistency losses [Zhu et al., 2017a, Yi et al., 2017]. We\nuse the following adversarial loss on the \ufb01nal generated image:\n\nLGAN\nimage = Ex[log Dimage(x)] + E(v2.5D,ztexture)[log(1 \u2212 Dimage(Gtexture(v2.5D, ztexture))],\n\n(2)\nwhere Dimage learns to classify real and generated images. We apply the same adversarial loss for\n2.5D sketches v2.5D:\nLGAN\n2.5D = Ev2.5D[log D2.5D(v2.5D)] + Ex[log(1 \u2212 D2.5D(E2.5D(x))],\n\n(3)\nwhere D2.5D aims to distinguish between 2.5D sketches v2.5D and estimated 2.5D sketches E2.5D(x)\nfrom a real 2D image. We further use cycle-consistency losses [Zhu et al., 2017a] to enforce the\nbijective relationship between the two domains:\n\nLcyc\n2.5D = \u03bbcyc\n2.5D\nand Lcyc\nimage = \u03bbcyc\n(4)\nimage\nwhere \u03bbcyc\nimage and \u03bbcyc\n2.5D control the importance of each cycle loss. TThe texture encoder Etexture and\n2.5D sketch encoder E2.5D serve as recognition models that recover the texture and 2.5D representation\nfrom a 2D image.\n\nE(v2.5D,ztexture) [(cid:107)E2.5D(Gtexture(v2.5D, ztexture)) \u2212 v2.5D(cid:107)1]\nEx [(cid:107)Gtexture(E2.5D(x), Etexture(x)) \u2212 x(cid:107)1] ,\n\nOne-to-many mappings. Prior studies [Isola et al., 2016, Mathieu et al., 2016] have found that\nlatent codes are often ignored in conditional image generation due to the assumption of a one-to-one\nmapping; vanilla CycleGAN also suffers from this problem based on our experiments. To address\nthis, we introduce a latent space cycle-consistency loss to encourage Gtexture to use the texture code\nztexture:\n\n(5)\ntexture controls its importance. Finally, to allow sampling at test time, we add a Kull-\n\nwhere \u03bbcyc\nback\u2013Leibler (KL) loss on the z space to force Etexture(x) to be close to a Gaussian distribution:\n\ntextureE(v2.5D,ztexture)[(cid:107)Etexture(Gtexture(v2.5D, ztexture)) \u2212 ztexture(cid:107)1],\n\nLcyc\ntexture = \u03bbcyc\n\nLKL = \u03bbKLEx [DKL(Etexture(x)||N (0, I))] ,\n\nwhere DKL(p||q) = \u2212(cid:82)\n\nz p(z) log p(z)\n\nq(z) dz and \u03bbKL is its weight. We write the \ufb01nal texture loss as\n\nLtexture = LGAN\n\n(cid:124)\n\n(cid:123)(cid:122)\n(cid:125)\nimage + LGAN\n\n2.5D\n\nAdversarial losses\n\n+Lcyc\n\n(cid:124)\n\n(cid:123)(cid:122)\nimage + Lcyc\n2.5D + Lcyc\n\nCycle-consistency losses\n\n(cid:125)\n\n+ LKL(cid:124)(cid:123)(cid:122)(cid:125)\n\nKL loss\n\n.\n\ntexture\n\n(6)\n\n(7)\n\nNote that the latent space reconstruction loss Lcyc\ntexture has been explored in unconditional GANs [Chen\net al., 2016] and image-to-image translation [Zhu et al., 2017b, Almahairi et al., 2018]. Here we use\nthis loss to learn one-to-many mappings from unpaired data.\n\n3.4 Our Full Model\n\nOur full objective is\n\narg min\n\narg max\n\n(Gshape,Gtexture,E2.5D,Etexture)\n\n(Dshape,Dtexture,D2.5D)\n\n\u03bbshapeLshape + Ltexture,\n\n(8)\n\nwhere \u03bbshape controls the relative weight of shape and texture loss functions. We compare our visual\nobject networks against 2D deep generative models in Section 4.1.\n\n5\n\n\f3.5\n\nImplementation Details\n\nShape networks. For shape generation, we adopt the 3D-GAN architecture from Wu et al. [2016].\nIn particular, the discriminator Dshape contains 6 volumetric convolutional layers and the generator\nGshape contains 6 strided-convolutional layers. We remove the batch normalization layers [Ioffe and\nSzegedy, 2015] in the Gshape as suggested by the WGAN-GP paper [Gulrajani et al., 2017].\n\nTexture networks. For texture generation, we use the ResNet encoder-decoder [Zhu et al., 2017a,\nHuang et al., 2018] and concatenate the texture code ztexture to intermediate layers in the encoder. For\nthe discriminator, we use two-scale PatchGAN classi\ufb01ers [Isola et al., 2017, Zhu et al., 2017a] to\nclassify overlapping patches as real or fake. We use a least square objective as in LS-GAN [Mao\net al., 2017] for stable training. We use ResNet encoders [He et al., 2015] for our Etexture and E2.5D.\n\n2.5D = 25, \u03bbcyc\n\nimage = 10, \u03bbcyc\n\nDifferentiable projection module. We assume the camera is at a \ufb01xed distance of 2m to the\nobject\u2019s center and use a focal length of 50mm (35mm \ufb01lm equivalent). The resolution of the\nrendered sketches are 128 \u00d7 128, and we sample 128 points evenly along each camera ray. We also\nassume no in-plane rotation, that is, no tilting in the image plane. We implement a custom CUDA\nkernel for sampling along the projection rays and calculating the stop probabilities.\nTraining details. We train our models on 128 \u00d7 128 \u00d7 128 shapes (voxels or distance function)\nand 128 \u00d7 128 \u00d7 3 images. During training, we \ufb01rst train the shape generator Gshape on 3D shape\ncollections and then train the texture generator Gtexture given ground truth 3D shape data and image\ndata. Finally, we \ufb01ne-tune both modules together. We sample the shape code zshape and texture\ncode ztexture from the standard Gaussian distribution N (0, I), with the code length |zshape| = 200\nand |ztexture| = 8. The entire training usually takes two to three days. For hyperparameters, we set\n\u03bbKL = 0.05, \u03bbGP = 10, \u03bbcyc\ntexture = 1, and \u03bbshape = 0.05. We use the Adam\nsolver [Kingma and Ba, 2015] with a learning rate of 0.0002 for shape generation and 0.0001 for\ntexture generation.\nWe observe that the texture generator Gtexture sometimes introduces the undesirable effect of chang-\ning the shape of the silhouette when rendering 2.5D sketches v2.5D (i.e., depth and mask). To\naddress this issue, we explicitly mask the generated 2D images with the silhouette from v2.5D: i.e.,\nGtexture(v2.5D, ztexture) = mask \u00b7 gtexture(depth) + (1 \u2212 mask) \u00b7 1, where 1 is the background white\ncolor and the generator gtexture synthesizes an image given a depth map. Similarly, we reformulate\nE2.5D(x) = (e2.5D(x) \u00b7 maskgt, maskgt), where the encoder e2.5D only predicts depth, and the input\nobject mask is used. In addition, we add a small mask consistency loss ||e2.5D(x) \u2212 maskgt||1to\nencourage the predicted depth map to be consistent with the the object mask. As our training images\nhave clean background, we can estimate the object mask with a simple threshold.\n4 Experiments\nWe \ufb01rst compare our visual object networks (VON) against recent 2D GAN variants on two datasets.\nWe evaluate the results using both a quantitative metric and a qualitative human perception study. We\nthen perform an ablation study on the objective functions of our shape generation network. Finally,\nwe demonstrate several applications enabled by our disentangled 3D representation. The full results\nand datasets can be found at our website. Please \ufb01nd our implementation at GitHub.\n4.1 Evaluations\nDatasets. We use ShapeNet [Chang et al., 2015] for learning to generate 3D shapes. ShapeNet is a\nlarge shape repository of 55 object categories. Here we use the chair and car categories, which has\n6, 777 and 3, 513 CAD models respectively. For 2D datasets, we use the recently released Pix3D\ndataset to obtain 1, 515 RGB images of chairs alongside with their silhouettes [Sun et al., 2018a], with\nan addition of 448 clean background images crawled from Google image search. We also crawled\n2, 605 images of cars.\nBaselines We compare our method to three popular GAN variants commonly used in the literature:\nDCGAN with the standard cross-entropy loss [Goodfellow et al., 2014, Radford et al., 2016],\nLSGAN [Mao et al., 2017], and WGAN-GP [Gulrajani et al., 2017]. We use the same DCGAN-like\ngenerator and discriminator architectures for all three GAN models. For WGAN-GP, we replace\nthe BatchNorm by InstanceNorm [Ulyanov et al., 2016] in the discriminator, and we train the\ndiscriminator 5 times per generator iteration.\n\n6\n\n\fFigure 3: Qualitative comparisons between 2D GAN models and VON: we show samples from\nDCGAN [Radford et al., 2016], LSGAN [Mao et al., 2017], WGAN-GP [Gulrajani et al., 2017], and\nour VON. For our method, we show both 3D shapes and 2D images. Note that VON is trained on\nunpaired 3D shapes and 2D images, while DCGAN, LSGAN and WGAN-GP are trained only on 2D\nimages. The learned 3D prior helps our model produce better samples. (Top: cars; bottom: chairs.)\n\n3D-GAN (voxels) VON (voxels)\n\n3.021\n2.598\n\n0.021\n0.082\n\n3D-GAN (DF)\n\nVON (DF)\n\n3.896\n1.790\n\n0.002\n0.006\n\nCars\nChairs\n\nCars\nChairs\n\nFigure 4: Sampled 3D shapes: from left to\nright: 3DGAN [Wu et al., 2016], VON on\nvoxels, VON on distance functions (DF). Our\nmodel produce more natural 3D shapes. OUr\nmodel produces samples with higher quality.\n\nTable 3: Quantitative comparisons on shape gener-\nation: Fr\u00e9chet Inception Distances (FID) between\nreal shapes and shapes generated by 3D-GAN [Wu\net al., 2016] and our shape network, both on vox-\nels and distance function representation (DF). Our\nmodel achieves better results regarding FID.\n\n7\n\n3D2DDCGANLSGANWGAN-GPVON(ours)3D2DCarsDCGANLSGANWGAN-GPChairsVON(ours)3DGANVON (voxels)VON(DF)\fDCGAN\nLSGAN\nWGAN-GP\nVON (voxels)\nVON (DF)\n\nCar\n130.5\n171.4\n123.4\n81.6\n83.3\n\nChair\n225.0\n225.3\n184.9\n58.0\n51.8\n\nDCGAN\nLSGAN\nWGAN-GP\n\nCar\nChair\n72.2% 90.3%\n78.7% 92.4%\n63.0 % 89.1%\n\nTable 2: Human preferences on images gen-\nerated by DCGAN [Radford et al., 2016], LS-\nGAN [Mao et al., 2017], WGAN-GP [Gulrajani\net al., 2017] vs. our VON (DF). Each number\nshows the percentage of human subjects who\nprefer our method to the baseline method.\n\nTable 1: Fr\u00e9chet Inception Distances [Heusel\net al., 2017] between real images and images\ngenerated by DCGAN, LSGAN, WGAN-GP,\nour VON (voxels), and our VON (DF). DF de-\nnotes distance function representations.\nMetrics. To evaluate the image generation models, we calculate the Fr\u00e9chet Inception Distance\nbetween generated images and real images, a metric highly correlated to human perception [Heusel\net al., 2017, Lucic et al., 2018]. Each set of images are fed to the Inception network [Szegedy\net al., 2015] trained on ImageNet [Deng et al., 2009], and the features from the layer before the last\nfully-connected layer are used to calculate the Fr\u00e9chet Inception Distance.\nSecond, we sample 200 pairs of generated images from the VON and the state-of-the-art models\n(DCGAN, LSGAN, and WGAN-GP), and show each pair to \ufb01ve subjects on Amazon MTurk. The\nsubjects are asked to choose a more realistic result within the pair.\n\nResults Our VON consistently outperforms the 2D generative models. In particular, Table 1 shows\nthat our results have the smallest Fr\u00e9chet Inception Distance; in Table 2, 74% \u2212 85% of the responses\npreferred our results. This performance gain demonstrates that the learned 3D prior helps synthesize\nmore realistic images. See Figure 3 for a qualitative comparison between these methods.\n\nAnalysis of shape generation. For shape generation, we compare our method against the prior 3D-\nGAN work by Wu et al. [2016] on both voxel grids and distance function representation. 3D-GAN\nuses the same architecture but trained with a cross-entropy loss. We evaluate the shape generation\nmodels using the Fr\u00e9chet Inception Distance (FID) between the generated and real shapes. To extract\nstatistics for each set of generated/real shapes, we train ResNet-based 3D shape classi\ufb01ers [He et al.,\n2015] on all 55 classes of shapes from ShapeNet; classi\ufb01ers are trained separately on voxels and\ndistance function representations. We extract the features from the layer before the last fully-connected\nlayer. Table 3 shows that our method achieves better results regarding FID. Figure 4a shows that the\nWasserstein distance increases the quality of the results. As we use different classi\ufb01ers for voxels and\ndistance functions, the Fr\u00e9chet Inception Distance is not comparable across representations.\n\n4.2 Applications\nWe apply our visual object networks to several 3D manipulation applications, not possible by previous\n2D generative models [Goodfellow et al., 2014, Kingma and Welling, 2014].\n\nChanging viewpoints. As our VON \ufb01rst produces a 3D shape, we can project the shape to the\nimage plane given different viewpoints zview while keeping the same shape and texture code. Figure 1c\nand Figure 5a show a few examples.\n\nShape and texture editing. With our learned disentangled 3D representation, we can easily change\nonly the shape code or the texture code, which allows us to edit the shape and texture separately. See\nFigure 1c and Figure 5a for a few examples.\n\nDisentangled interpolation. Given our disentangled 3D representation, we can choose to interpo-\nlate between two objects in different ways. For example, we can interpolate objects in shape space\ntexture with\n\u03b1z1\nthe same shape, or both, where \u03b1 \u2208 [0, 1]. Figure 5c shows linear interpolations in the latent space.\n\nshape with the same texture, or in the texture space \u03b1z1\n\ntexture + (1 \u2212 \u03b1)z2\n\nshape + (1 \u2212 \u03b1)z2\n\nExample-based texture transfer. We can infer the texture code ztexture from a real image x with\nthe texture encoder ztexture = Etexture(x), and apply the code to new shapes. Figure 6 shows texture\ntransfer results on cars and chairs using real images and generated shapes.\n\n8\n\n\fFigure 5: 3D-aware applications: Our visual object networks allow several 3D applications such\nas (a) changing the viewpoint, texture, or shape independently, and (b) interpolating between two\nobjects in shape space, texture space, or both. None of them can be achieved by previous 2D GANs.\n\nFigure 6: Given a real input image, we synthesize objects with similar texture using the inferred\ntexture code. The same texture is transferred to different shapes and viewpoints.\n\n5 Discussion\nIn this paper, we have presented visual object networks (VON), a fully differentiable 3D-aware\ngenerative model for image and shape synthesis. Our key idea is to disentangle the image generation\nprocess into three factors: shape, viewpoint, and texture. This disentangled 3D representation allows\nus to learn the model from both 3D and 2D visual data collections under an adversarial learning\nframework. Our model synthesizes more photorealistic images compared to existing 2D generative\nmodels; it also enables various 3D manipulations that are not possible with prior 2D methods.\nIn the future, we are interested in incorporating coarse-to-\ufb01ne modeling [Karras et al., 2017] for\nproducing shapes and images at a higher resolution. Another interesting direction to explore is to\ndisentangle texture further into lighting and appearance (e.g., albedo), which could improve the\nconsistency of appearance across different viewpoints and lighting conditions. Finally, as we do not\nhave large-scale 3D geometric data for entire scenes, our current method only works for individual\nobjects. Synthesizing natural scenes is also a meaningful next step.\n\nAcknowledgements This work is supported by NSF #1231216, NSF #1447476, NSF #1524817,\nONR MURI N00014-16-1-2007, Toyota Research Institute, Shell, and Facebook. We thank Xiuming\nZhang, Richard Zhang, David Bau, and Zhuang Liu for valuable discussions.\n\n9\n\nbothviewpoint(a)Editingviewpoint,shape,andtextureshapeshapeshapeshapetexturetexturetexturetextureviewpointboth(b)InterpolationinthelatentspaceObject 1Object 2Object 1Object 2ImageShapeImageShape\fReferences\n\nPanos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and\n\ngenerative models for 3d point clouds. In ICLR Workshop, 2018. 2\n\nAmjad Almahairi, Sai Rajeswar, Alessandro Sordoni, Philip Bachman, and Aaron Courville. Augmented\n\ncyclegan: Learning many-to-many mappings from unpaired data. In ICML, 2018. 5\n\nMart\u00edn Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial networks. In ICML,\n\n2017. 2, 4\n\nHarry G Barrow and Jay M Tenenbaum. Recovering intrinsic scene characteristics from images. Computer\n\nVision Systems, 1978. 1, 4\n\nThomas G Bever and David Poeppel. Analysis by synthesis: a (re-) emerging program of research for language\n\nand vision. Biolinguistics, 4(2-3):174\u2013200, 2010. 3\n\nVolker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In SIGGRAPH, 1999. 2\n\nAngel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,\nManolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich\n3d model repository. arXiv:1512.03012, 2015. 2, 3, 6\n\nXi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable\n\nrepresentation learning by information maximizing generative adversarial nets. In NIPS, 2016. 5\n\nBrian Curless and Marc Levoy. A volumetric method for building complex models from range images. In\n\nSIGGRAPH, 1996. 4\n\nAngela Dai, Charles Ruizhongtai Qi, and Matthias Nie\u00dfner. Shape completion using 3d-encoder-predictor cnns\n\nand shape synthesis. In CVPR, 2017. 2\n\nJia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\n\nimage database. In CVPR, 2009. 8\n\nAlexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairs with convolutional\n\nneural networks. In CVPR, 2015. 2\n\nMatheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape induction from 2d views of multiple objects. In 3D\n\nVision (3DV), pages 402\u2013411. IEEE, 2017a. 2\n\nMatheus Gadelha, Subhransu Maji, and Rui Wang. Shape generation using spatially partitioned point clouds. In\n\nBMVC, 2017b. 2\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 2, 5, 6, 8\n\nIshaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training\n\nof wasserstein gans. In NIPS, 2017. 1, 2, 4, 6, 7, 8\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In\n\nCVPR, 2015. 6, 8\n\nMartin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, G\u00fcnter Klambauer, and Sepp Hochre-\niter. Gans trained by a two time-scale update rule converge to a nash equilibrium. NIPS1706.08500, 2017.\n8\n\nXun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation.\n\nECCV1804.04732, 2018. 6\n\nSergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015. 6\n\nPhillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H Adelson. Learning visual groups from co-occurrences\n\nin space and time. In ICLR Workshop, 2016. 5\n\nPhillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional\n\nadversarial networks. In CVPR, 2017. 2, 4, 6\n\n10\n\n\fMax Jaderberg, Karen Simonyan, and Andrew Zisserman. Spatial transformer networks. In NIPS, 2015. 4\n\nJames T Kajiya. The rendering equation. In SIGGRAPH, 1986. 1, 5\n\nAngjoo Kanazawa, Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. Learning category-speci\ufb01c mesh\n\nreconstruction from image collections. ECCV1803.07549, 2018. 2\n\nTero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality,\n\nstability, and variation. In ICLR, 2017. 9\n\nTero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality,\n\nstability, and variation. ICLR, 2018. 2\n\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 6\n\nDiederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014. 8\n\nTejas D Kulkarni, Pushmeet Kohli, Joshua B Tenenbaum, and Vikash Mansinghka. Picture: A probabilistic\n\nprogramming language for scene perception. In CVPR, 2015a. 3\n\nTejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse\n\ngraphics network. In NIPS, 2015b. 3\n\nChristian Ledig, Lucas Theis, Ferenc Husz\u00e1r, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew\nAitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-\nresolution using a generative adversarial network. In CVPR, 2017. 2\n\nMing-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In NIPS,\n\n2017. 5\n\nMario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a\n\nlarge-scale study. NIPS, 2018. 8\n\nXudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares\n\ngenerative adversarial networks. In ICCV, 2017. 6, 7, 8\n\nDavid Marr. Vision: A computational investigation into the human representation and processing of visual\n\ninformation, volume 2. W. H. Freeman and Company, 1982. 1, 4\n\nMichael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square\n\nerror. In ICLR, 2016. 5\n\nDeepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders:\n\nFeature learning by inpainting. In CVPR, 2016. 2\n\nAlec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. In ICLR, 2016. 2, 6, 7, 8\n\nScott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative\n\nadversarial text-to-image synthesis. In ICML, 2016. 2\n\nDanilo Jimenez Rezende, SM Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess.\n\nUnsupervised learning of 3d structure from images. In NIPS, 2016. 2\n\nZhixin Shu, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli, Eli Shechtman, and Dimitris Samaras. Neural face\n\nediting with intrinsic image disentangling. In CVPR, 2017. 3\n\nXingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenen-\nbaum, and William T Freeman. Pix3d: Dataset and methods for single-image 3d shape modeling. In CVPR,\n2018a. 6\n\nYongbin Sun, Ziwei Liu, Yue Wang, and Sanjay E Sarma. Im2avatar: Colorful 3d reconstruction from a single\n\nimage. arXiv:1804.06375, 2018b. 2\n\nChristian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,\n\nVincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. 8\n\nMaxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Multi-view 3d models from single images with a\n\nconvolutional network. In ECCV, 2016. 2\n\n11\n\n\fMaxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Ef\ufb01cient convolutional\n\narchitectures for high-resolution 3d outputs. In ICCV, 2017. 2\n\nShubham Tulsiani, Hao Su, Leonidas J Guibas, Alexei A Efros, and Jitendra Malik. Learning shape abstractions\n\nby assembling volumetric primitives. In CVPR, 2017. 2, 4\n\nHsiao-Yu Fish Tung, Adam W Harley, William Seto, and Katerina Fragkiadaki. Adversarial inverse graphics\nnetworks: Learning 2d-to-3d lifting and image-to-image translation from unpaired supervision. In ICCV,\n2017. 3\n\nDmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for\n\nfast stylization. arXiv:1607.08022, 2016. 6\n\nTing-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution\n\nimage synthesis and semantic manipulation with conditional gans. In CVPR, 2018. 2\n\nWeiyue Wang, Qiangui Huang, Suya You, Chao Yang, and Ulrich Neumann. Shape inpainting using 3d\n\ngenerative adversarial network and recurrent convolutional networks. In ICCV, 2017. 2\n\nJiajun Wu, Chengkai Zhang, Tianfan Xue, William T Freeman, and Joshua B Tenenbaum. Learning a Proba-\nbilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In NIPS, 2016. 2, 4, 6, 7,\n8\n\nJiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T Freeman, and Joshua B Tenenbaum. MarrNet:\n\n3D Shape Reconstruction via 2.5D Sketches. In NIPS, 2017. 3, 4\n\nJiajun Wu, Chengkai Zhang, Xiuming Zhang, Zhoutong Zhang, William T Freeman, and Joshua B Tenenbaum.\n\nLearning 3d shape priors for shape completion and reconstruction. In ECCV, 2018. 2, 3\n\nJimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentangling with recurrent\n\ntransformations for 3d view synthesis. In NIPS, 2015. 3\n\nZili Yi, Hao (Richard) Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-\n\nimage translation. In ICCV, 2017. 5\n\nAlan Yuille and Daniel Kersten. Vision as bayesian inference: analysis by synthesis? TiCS, 10(7):301\u2013308,\n\n2006. 3\n\nHan Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas.\nStackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV,\n2017. 2\n\nXiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Joshua B Tenenbaum, William T Freeman, and Jiajun Wu.\n\nLearning to reconstruct shapes from unseen categories. In NIPS, 2018. 2\n\nJun-Yan Zhu, Philipp Kr\u00e4henb\u00fchl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the\n\nnatural image manifold. In ECCV, 2016. 2\n\nJun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using\n\ncycle-consistent adversarial networks. In ICCV, 2017a. 2, 4, 5, 6\n\nJun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman.\n\nToward multimodal image-to-image translation. In NIPS, 2017b. 5\n\n12\n\n\f", "award": [], "sourceid": 92, "authors": [{"given_name": "Jun-Yan", "family_name": "Zhu", "institution": "MIT"}, {"given_name": "Zhoutong", "family_name": "Zhang", "institution": "MIT"}, {"given_name": "Chengkai", "family_name": "Zhang", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Jiajun", "family_name": "Wu", "institution": "MIT"}, {"given_name": "Antonio", "family_name": "Torralba", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}, {"given_name": "Bill", "family_name": "Freeman", "institution": "MIT/Google"}]}