{"title": "Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision", "book": "Advances in Neural Information Processing Systems", "page_first": 1696, "page_last": 1704, "abstract": "Understanding the 3D world is a fundamental problem in computer vision. However, learning a good representation of 3D objects is still an open problem due to the high dimensionality of the data and many factors of variation involved. In this work, we investigate the task of single-view 3D object reconstruction from a learning agent's perspective. We formulate the learning process as an interaction between 3D and 2D representations and propose an encoder-decoder network with a novel projection loss defined by the projective transformation. More importantly, the projection loss enables the unsupervised learning using 2D observation without explicit 3D supervision. We demonstrate the ability of the model in generating 3D volume from a single 2D image with three sets of experiments: (1) learning from single-class objects; (2) learning from multi-class objects and (3) testing on novel object classes. Results show superior performance and better generalization ability for 3D object reconstruction when the projection loss is involved.", "full_text": "Perspective Transformer Nets: Learning Single-View\n3D Object Reconstruction without 3D Supervision\n\nXinchen Yan1\n\nJimei Yang2 Ersin Yumer2 Yijie Guo1 Honglak Lee1,3\n\n1University of Michigan, Ann Arbor\n\n2Adobe Research\n3Google Brain\n\n{xcyan,guoyijie,honglak}@umich.edu, {jimyang,yumer}@adobe.com\n\nAbstract\n\nUnderstanding the 3D world is a fundamental problem in computer vision. How-\never, learning a good representation of 3D objects is still an open problem due\nto the high dimensionality of the data and many factors of variation involved. In\nthis work, we investigate the task of single-view 3D object reconstruction from a\nlearning agent\u2019s perspective. We formulate the learning process as an interaction\nbetween 3D and 2D representations and propose an encoder-decoder network with\na novel projection loss de\ufb01ned by the perspective transformation. More importantly,\nthe projection loss enables the unsupervised learning using 2D observation without\nexplicit 3D supervision. We demonstrate the ability of the model in generating 3D\nvolume from a single 2D image with three sets of experiments: (1) learning from\nsingle-class objects; (2) learning from multi-class objects and (3) testing on novel\nobject classes. Results show superior performance and better generalization ability\nfor 3D object reconstruction when the projection loss is involved.\n\nIntroduction\n\n1\nUnderstanding the 3D world is at the heart of successful computer vision applications in robotics, ren-\ndering and modeling [19]. It is especially important to solve this problem using the most convenient\nvisual sensory data: 2D images. In this paper, we propose an end-to-end solution to the challenging\nproblem of predicting the underlying true shape of an object given an arbitrary single image obser-\nvation of it. This problem de\ufb01nition embodies a fundamental challenge: Imagery observations of\n3D shapes are interleaved representations of intrinsic properties of the shape itself (e.g., geometry,\nmaterial), as well as its extrinsic properties that depend on its interaction with the observer and the\nenvironment (e.g., orientation, position, and illumination). Physically principled shape understanding\nshould be able to ef\ufb01ciently disentangle such interleaved factors.\nThis observation leads to insight that an end-to-end solution to this problem from the perspective\nof learning agents (neural networks) should involve the following properties: 1) the agent should\nunderstand the physical meaning of how a 2D observation is generated from the 3D shape, and 2) the\nagent should be conscious about the outcome of its interaction with the object; more speci\ufb01cally, by\nmoving around the object, the agent should be able to correspond the observations to the viewpoint\nchange. If such properties are embodied in a learning agent, it will be able to disentangle the shape\nfrom the extrinsic factors because these factors are trivial to understand in the 3D world. To enable the\nagent with these capabilities, we introduce a built-in camera system that can transform the 3D object\ninto 2D images in-network. Additionally, we architect the network such that the latent representation\ndisentangles the shape from view changes. More speci\ufb01cally, our network takes as input an object\nimage and predicts its volumetric 3D shape so that the perspective transformations of predicted shape\nmatch well with corresponding 2D observations.\nWe implement this neural network based on a combination of image encoder, volume decoder\nand perspective transformer (similar to spatial transformer as introduced by Jaderberg et al. [6]).\nDuring training, the volumetric 3D shape is gradually learned from single-view input and the\nfeedback of other views through back-propagation. Thus at test time, the 3D shape can be directly\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fgenerated from a single image. We conduct experimental evaluations using a subset of 3D models\nfrom ShapeNetCore [1]. Results from single-class and multi-class training demonstrate excellent\nperformance of our network for volumetric 3D reconstruction. Our main contributions are summarized\nbelow.\n\u2022 We show that neural networks are able to predict 3D shape from single-view without using\nthe ground truth 3D volumetric data for training. This is made possible by introducing a 2D\nsilhouette loss function based on perspective transformations.\n\u2022 We train a single network for multi-class 3D object volumetric reconstruction and show its\n\u2022 Compared to training with full azimuth angles, we demonstrate comparatively similar results\n\ngeneralization potential to unseen categories.\n\nwhen training with partial views.\n\n2 Related Work\nRepresentation learning for 3D objects. Recently, advances have been made in learning deep\nneural networks for 3D objects using large-scale CAD databases [22, 1]. Wu et al. [22] proposed a\ndeep generative model that extends the convolutional deep belief network [11] to model volumetric\n3D shapes. Different from [22] that uses volumetric 3D representation, Su et al. [18] proposed\na multi-view convolutional network for 3D shape categorization with a view-pooling mechanism.\nThese methods focus more on 3D shape recognition instead of 3D shape reconstruction. Recent\nwork [20, 14, 4, 2] attempt to learn a joint representation for both 2D images and 3D shapes.\nTatarchenko et al. [20] developed a convolutional network to synthesize unseen 3D views from a\nsingle image and demonstrated the synthesized images can be used them to reconstruct 3D shape.\nQi et al. [14] introduced a joint embedding by combining volumetric representation and multi-view\nrepresentation together to improve 3D shape recognition performance. Girdhar et al. [4] proposed a\ngenerative model for 3D volumetric data and combined it with a 2D image embedding network for\nsingle-view 3D shape generation. Choy et al. [2] introduce a 3D recurrent neural network (3D-R2N2)\nbased on long-short term memory (LSTM) to predict the 3D shape of an object from a single view or\nmultiple views. Compared to these single-view methods, our 3D reconstruction network is learned\nend-to-end and the network can be even trained without ground truth volumes.\nConcurrent to our work, Renzede et al. [16] introduced a general framework to learn 3D structures\nfrom 2D observations with 3D-2D projection mechanism. Their 3D-2D projection mechanism either\nhas learnable parameters or adopts non-differentiable component using MCMC, while our perspective\nprojection nets is both differentiable and parameter-free.\n\nRepresentation learning by transformations. Learning from transformed sensory data has gained\nattention [12, 5, 15, 13, 23, 6, 24] in recent years. Memisevic and Hinton [12] introduced a gated\nBoltzmann machine that models the transformations between image pairs using multiplicative\ninteraction. Reed et al. [15] showed that a disentangled hidden unit representations of Boltzmann\nMachines (disBM) could be learned based on the transformations on data manifold. Yang et al. [23]\nlearned out-of-plane rotation of rendered images to obtain disentangled identity and viewpoint units\nby curriculum learning. Kulkarni et al. [9] proposed to learn a semantically interpretable latent\nrepresentation from 3D rendered images using variational auto-encoders [8] by including speci\ufb01c\ntransformations in mini-batches. Complimentary to convolutional networks, Jaderberg et al. [6]\nintroduced a differentiable sampling layer that directly incorporates geometric transformations into\nrepresentation learning. Concurrent to our work, Wu et al. [21] proposed a 3D-2D projection layer\nthat enables the learning of 3D object structures using 2D keypoints as annotation.\n3 Problem Formulation\nIn this section, we develop neural networks for reconstructing 3D objects. From the perspective of a\nlearning agent (e.g., neural network), a natural way to understand one 3D object X is from its 2D\nviews by transformations. By moving around the 3D object, the agent should be able to recognize its\nunique features and eventually build a 3D mental model of it as illustrated in Figure 1(a). Assume\nthat I (k) is the 2D image from k-th viewpoint \u03b1(k) by projection I (k) = P (X; \u03b1(k)), or rendering in\ngraphics. An object X in a certain scene is the entanglement of shape, color and texture (its intrinsic\nproperties) and the image I (k) is the further entanglement with viewpoint and illumination (extrinsic\nparameters). The general goal of understanding 3D objects can be viewed as disentangling intrinsic\nproperties and extrinsic parameters from a single image.\n\n2\n\n\fFigure 1: (a) Understanding 3D object from learning agent\u2019s perspective; (b) Single-view 3D volume\nreconstruction with perspective transformation.\n(c) Illustration of perspective projection. The\nminimum and maximum disparity in the screen coordinates are denoted as dmin and dmax.\n\nIn this paper, we focus on the 3D shape learning by ignoring the color and texture factors, and\nwe further simplify the problem by making the following assumptions: 1) the scene is clean white\nbackground; 2) the illumination is constant natural lighting. We use the volumetric representation of\n3d shape V where each voxel Vi is a binary unit. In other words, the voxel equals to one, i.e., Vi = 1,\nif the i-th voxel sapce is occupied by the shape; otherwise Vi = 0. Assuming the 2D silhouette S(k)\nis obtained from the k-th image I (k), we can specify the 3D-2D projection S(k) = P (V; \u03b1(k)). Note\nthat 2D silhouette estimation is typically solved by object segmentation in real-world but it becomes\ntrivial in our case due to the white background.\nIn the following sub-sections, we propose a formulation for learning to predict the volumetric 3D\nshape V from an image I (k) with and without the 3D volume supervision.\n3.1 Learning to Reconstruct Volumetric 3D Shape from Single-View\nWe consider single-view volumetric 3D reconstruction as a dense prediction problem and develop a\nconvolutional encoder-decoder network for this learning task denoted by \u02c6V = f (I (k)). The encoder\nnetwork h(\u00b7) learns a viewpoint-invariant latent representation h(I (k)) which is then used by the\ndecoder g(\u00b7) to generate the volume \u02c6V = g(h(I (k))). In case the ground truth volumetric shapes V\nare available, the problem can be easily considered as learning volumetric 3D shapes with a regular\nreconstruction objective in 3D space: Lvol(I (k)) = ||f (I (k)) \u2212 V||2\n2.\nIn practice, however, the ground truth volumetric 3D shapes may not be available for training. For\nexample, the agent observes the 2D silhouette via its built-in camera without accessing the volumetric\n3D shape. Inspired by the space carving theory [10], we propose a silhouette-based volumetric loss\nfunction. In particular, we build on the premise that a 2D silhouette \u02c6S(j) projected from the generated\nvolume \u02c6V under certain camera viewpoint \u03b1(j) should match the ground truth 2D silhouette S(j)\nfrom image observations. In other words, if all the generated silhouettes \u02c6S(j) match well with their\ncorresponding ground truth silhouettes S(j) for all j\u2019s, then we hypothesize that the generated volume\n\u02c6V should be as good as one instance of visual hull equivalent class of the ground truth volume V [10].\nTherefore, we formulate the learning objective for the k-th image as\n\nn(cid:88)\n\nj=1\n\n||P (f (I (k)); \u03b1(j)) \u2212 S(j)||2\n2,\n\n(1)\n\nn(cid:88)\n\nj=1\n\nLproj(I (k)) =\n\nL(j)\nproj(I (k); S(j), \u03b1(j)) =\n\n1\nn\n\nwhere j is the index of output 2D silhouettes, n is the number of silhouettes used for each input image\nand P (\u00b7) is the 3D-2D projection function. Note that the above training objective Eq. (1) enables\ntraining without using ground-truth volumes. The network diagram is illustrated in Figure 1(b). A\nmore general learning objective is given by a combination of both objectives:\nLcomb(I (k)) = \u03bbprojLproj(I (k)) + \u03bbvolLvol(I (k)),\n\n(2)\n\nwhere \u03bbproj and \u03bbvol are constants that control the tradeoff between the two losses.\n3.2 Perspective Transformer Nets\nAs de\ufb01ned previously, 2D silhouette S(k) is obtained via perspective projection given input 3D\nvolume V and speci\ufb01c camera viewpoint \u03b1(k). In this work, we implement the perspective projection\n\n3\n\n2D image: I(1)I(2)I(3)I(4)\u20263D objectCamera(a)Input 2D Image I(1)Volume V(b)Target 2D mask S(1)S(2)S(3)S(4)S(5)Transformations {T(1),T(2),...T(n)}dmaxdminimagevolume Uvolume Vcamera(c)\f(see Figure 1(c)) with a 4-by-4 transformation matrix \u03984\u00d74, where K is camera calibration matrix\nand (R, t) is extrinsic parameters.\n\n(cid:20) K 0\n\n(cid:21)(cid:20) R\n\n0T\n\n1\n\n0T\n\n(cid:21)\n\nt\n1\n\n\u03984\u00d74 =\n\n(3)\n\nH(cid:88)\n\nW(cid:88)\n\nD(cid:88)\n\nn\n\nm\n\nl\n\nUi =\n\ni , ys\n\ni = (xs\ni) in screen coordinates (plus disparity dt\n\ni , zs\n\ni , 1) in 3D world coordinates, we compute the corresponding point\ni) using the perspective transformation:\n\nFor each point ps\npt\ni , 1, dt\ni = (xt\ni, yt\ni \u223c \u03984\u00d74pt\ni.\nps\nSimilar to the spatial transformer network introduced in [6], we propose a 2-step procedure : (1)\nperforming dense sampling from input volume (in 3D world coordinates) to output volume (in screen\ncoordinates), and (2) \ufb02attening the 3D spatial output across disparity dimension. In the experiment,\nwe assume that transformation matrix is always given as input, parametrized by the viewpoint \u03b1.\nAgain, the 3D point (xs\ni , dt\ni)\nin output volume U \u2208 RH(cid:48)\u00d7W (cid:48)\u00d7D(cid:48)\nis linked by perspective transformation matrix \u03984\u00d74. Here,\n(W, H, D) and (W (cid:48), H(cid:48), D(cid:48)) are the width, height and depth of input and output volume, respectively.\nWe summarize the dense sampling step and channel-wise \ufb02attening step as follows.\n\ni ) in input volume V \u2208 RH\u00d7W\u00d7D and corresponding point (xt\n\ni , zs\n\ni , ys\n\ni, yt\n\nVnml max(0, 1 \u2212 |xs\n\ni \u2212 m|) max(0, 1 \u2212 |ys\n\ni \u2212 n|) max(0, 1 \u2212 |zs\n\ni \u2212 l|)\n\n(4)\n\nSn(cid:48)m(cid:48) = max\n\nl(cid:48) Un(cid:48)m(cid:48)l(cid:48)\ni) (where i \u2208 {1, ..., W (cid:48) \u00d7 H(cid:48) \u00d7\nHere, Ui is the i-th voxel value corresponding to the point (xt\nD(cid:48)}). Note that we use the max operator for projection instead of summation along one dimension\nsince the volume is represented as a binary cube where the solid voxels have value 1 and empty\nvoxels have value 0. Intuitively, we have the following two observations: (1) each empty voxel will\nnot contribute to the foreground pixel of S from any viewpoint; (2) each solid voxel can contribute to\nthe foreground pixel of S only if it is visible from speci\ufb01c viewpoint.\n3.3 Training\nAs the same volumetric 3D shape is expected to be generated from different images of the object, the\nencoder network is required to learn a 3D view-invariant latent representation\n\ni , dt\n\ni, yt\n\nh(I (1)) = h(I (2)) = \u00b7\u00b7\u00b7 = h(I (k))\n\n(5)\nThis sub-problem itself is a challenging task in computer vision [23, 9]. Thus, we adopt a two-stage\ntraining procedure: \ufb01rst we learn the encoder network for a 3D view-invariant latent representation\nh(I) and then train the volumetric decoder with perspective transformer networks. As shown in [23],\na disentangled representation of 2D synthetic images can be learned from consecutive rotations with\na recurrent network, we pre-train the encoder of our network using a similar curriculum strategy so\nthat the latent representation only contains 3D view-invariant identity information of the object. Once\nwe obtain an encoder network that recognizes the identity of single-view images, we next learn the\nvolume generator regularized by the perspective transformer networks. To encourage the volume\ndecoder to learn a consistent 3D volume from different viewpoints, we include the projections from\nneighboring viewpoints in each mini-batch so that the network has relatively suf\ufb01cient information to\nreconstruct the 3D shape.\n4 Experiments\nShapeNetCore. This dataset contains about 51,300 unique 3D models from 55 common object\ncategories [1]. Each 3D model is rendered from 24 azimuth angles (with steps of 15\u25e6) with \ufb01xed\nelevation angles (30\u25e6) under the same camera and lighting setup. We then crop and rescale the\ncentering region of each image to 64 \u00d7 64 \u00d7 3 pixels. For each ground truth 3D shape, we create a\nvolume of 32 \u00d7 32 \u00d7 32 voxels from its canonical orientation (0\u25e6).\nNetwork Architecture. As shown in Figure 2, our encoder-decoder network has three components:\na 2D convolutional encoder, a 3D up-convolutional decoder and a perspective transformer networks.\nThe 2D convolutional encoder consists of 3 convolution layers, followed by 3 fully-connected layers\n(convolution layers have 64, 128 and 256 channels with \ufb01xed \ufb01lter size of 5 \u00d7 5; the three fully-\nconnected layers have 1024, 1024 and 512 neurons, respectively). The 3D convolutional decoder\n\n4\n\n\fFigure 2: Illustration of network architecture.\n\nconsists of one fully-connected layer, followed by 3 convolution layers (the fully-connected layer\nhave 3 \u00d7 3 \u00d7 3 \u00d7 512 neurons; convolution layers have 256, 96 and 1 channels with \ufb01lter size of\n4 \u00d7 4 \u00d7 4, 5 \u00d7 5 \u00d7 5 and 6 \u00d7 6 \u00d7 6). For perspective transformer networks, we used perspective\ntransformation to project 3D volume to 2D silhouette where the transformation matrix is parametrized\nby 16 variables and sampling grid is set to 32 \u00d7 32 \u00d7 32. We use the same network architecture for\nall the experiments.\nImplementation Details. We used the ADAM [7] solver for stochastic optimization in all the\nexperiments. During the pre-training stage (for encoder), we used mini-batch of size 32, 32, 8, 4,\n3 and 2 for training the RNN-1, RNN-2, RNN-4, RNN-8, RNN-12 and RNN-16 as used in Yang\net al. [23]. We used the learning rate 10\u22124 for RNN-1, and 10\u22125 for the rest of recurrent neural\nnetworks. During the \ufb01ne-tuning stage (for volume decoder), we used mini-batch of size 6 and\nlearning rate 10\u22124. For each object in a mini-batch, we include projections from all 24 views as\nsupervision. The models including the perspective transformer nets are implemented using Torch [3].\nTo download the code, please refer to the project webpage: http://goo.gl/YEJ2H6.\nExperimental Design. As mentioned in the formulation, there are several variants of the model\ndepending on the hyper-parameters of learning objectives \u03bbproj and \u03bbvol. In the experimental section,\nwe denote the model trained with projection loss only, volume loss only, and combined loss as\nPTN-Proj (PR), CNN-Vol (VO), and PTN-Comb (CO), respectively.\nIn the experiments, we address the following questions: (1) Will the model trained with combined\nloss achieve better single-view 3D reconstruction performance over model trained on volume loss\nonly (PTN-Comb vs. CNN-Vol)? (2) What is the performance gap between the models with and\nwithout ground-truth volumes (PTN-Comb vs. PTN-Proj)? (3) How do the three models generalize\nto instances from unseen categories which are not present in the training set? To answer the questions,\nwe trained the three models under two experimental settings: single category and multiple categories.\n4.1 Training on single category\nWe select chair category as the training set for single category experiment. For model comparisons,\nwe \ufb01rst conduct quantitative evaluations on the generated 3D volumes from the test set single-view\nimages. For each instance in the test set, we generate one volume per view image (24 volumes\ngenerated in total). Given a pair of ground-truth volume and our generated volume (threshold is 0.5),\nwe computed its intersection-over-union (IU) score and the average IU score is calculated over 24\nvolumes of all the instances in the test set. In addition, we provide a baseline method based on nearest\nneighbor (NN) search. Speci\ufb01cally, for each of the test image, we extract VGG feature from fc6\nlayer (4096-dim vector) [17] and retrieve the nearest training example using Euclidean distance in the\nfeature space. The ground-truth 3D volume corresponds to the nearest training example is naturally\nregarded as the retrieval result.\n\nTable 1: Prediction IU using the models trained on chair category. Below, \u201cchair\" corresponds to\nthe setting where each object is observable with full azimuth angles, while \u201cchair-N\" corresponds\nto the setting where each object is only observable with narrow range (subset) of azimuth angles.\n\nMethod / Evaluation Set\nPTN-Proj:single (no vol. supervision)\nPTN-Comb:single (vol. supervision)\nCNN-Vol:single (vol. supervision)\nNN search (vol. supervision)\n\ntraining\n0.5712\n0.6435\n0.6390\n\n\u2014\n\ntest\n\n0.5027\n0.5067\n0.4983\n0.3557\n\nchair\n\nchair-N\n\ntraining\n0.4882\n0.5564\n0.5518\n\n\u2014\n\ntest\n\n0.4583\n0.4429\n0.4380\n0.3073\n\n5\n\n5x5 conv5x5 conv5x5 conv64x64x332x32x6416x16x1288x8x256Volume GeneratorPerspective Transformer 1x1x 512 latent unit 1x1x10241x32x32x326x6x6 conv 4x4 transformation 1x32x32EncoderDecoder1x1x1024512x3x3x3256x6x6x696x15x15x154x4x4 conv5x5x5 conv\u03a4\u03b8(G) Grid generatorSampler1x32x32x32Input imageTarget projection\fInput\n\nGT (310)\n\nGT (130)\n\nPR (310)\n\nPR (130)\n\nCO (310)\n\nCO (130)\n\nVO (310)\n\nVO (130)\n\nFigure 3: Single-class results. GT: ground truth, PR: PTN-Proj, CO: PTN-Comb, VO: CNN-Vol\n(Best viewed in digital version. Zoom in for the 3D shape details). The angles are shown in the\nparenthesis. Please also see more examples and video animations on the project webpage.\n\nAs shown in Table 1, the model trained without volume supervision (projection loss) performs as\ngood as model trained with volume supervision (volume loss) on the chair category (testing set). In\naddition to the comparisons of overall IU, we measured the view-dependent IU for each model. As\nshown in Figure 4, the average prediction error (mean IU) changes as we gradually move from the\n\ufb01rst view to the last view (15\u25e6 to 360\u25e6). For visual comparisons, we provide a side-by-side analysis\nfor each of the three models we trained. As shown in Figure 3, each row shows an independent\ncomparison. The \ufb01rst column is the 2D image we used as input of the model. The second and\nthird column show the ground-truth 3D volume (same volume rendered from two views for better\nvisualization purpose). Similarly, we list the model trained with projection loss only (PTN-Proj),\ncombined loss (PTN-Comb) and volume loss only (CNN-Vol) from fourth column up to ninth column.\nThe volumes predicted by PTN-Proj and PTN-Comb faithfully represent the shape. However, the\nvolumes predicted by CNN-Vol do not form a solid chair shape in some cases.\n\nFigure 4: View-dependent IU. For illustration, images of a sample chair with corresponding azimuth\nangles are shown below the curves. For example, 3D reconstruction from 0\u25e6 is more dif\ufb01cult than\nfrom 30\u25e6 due to self-occlusion.\n\n6\n\n050100250300350150200Azimuth(degree)0.440.420.40.460.480.520.5Mean IUPTN-Proj PTN-Comb CNN-Vol\fTable 2: Prediction IU using the models trained on large-scale datasets.\n\nTest Category\nPTN-Proj:multi\nPTN-Comb:multi\nCNN-Vol:multi\nNN search\nTest Category\nPTN-Proj:multi\nPTN-Comb:multi\nCNN-Vol:multi\nNN search\n\nairplane\n\n0.5556\n0.5836\n0.5747\n0.5564\n\nloudspeaker\n\n0.5868\n0.5675\n0.5478\n0.4600\n\nbench\n0.4924\n0.5079\n0.5142\n0.4875\nrifle\n0.5987\n0.6097\n0.6031\n0.5133\n\ndresser\n0.6823\n0.7109\n0.6975\n0.5713\nsofa\n0.6221\n0.6534\n0.6467\n0.5314\n\ncar\n0.7123\n0.7381\n0.7348\n0.6519\ntable\n0.4938\n0.5146\n0.5136\n0.3097\n\nchair\n0.4494\n0.4702\n0.4451\n0.3512\n\ntelephone\n\n0.7504\n0.7728\n0.7692\n0.6696\n\ndisplay\n0.5395\n0.5473\n0.5390\n0.3958\nvessel\n0.5507\n0.5399\n0.5445\n0.4078\n\nlamp\n0.4223\n0.4158\n0.3865\n0.2905\n\nInput\n\nGT (310)\n\nGT (130)\n\nPR (310)\n\nPR (130)\n\nCO (310)\n\nCO (130)\n\nVO (310)\n\nVO (130)\n\nFigure 5: Multiclass results. GT: ground truth, PR: PTN-Proj, CO: PTN-Comb, VO: CNN-Vol (Best\nviewed in digital version. Zoom in for the 3D shape details). The angles are shown in the parenthesis.\nPlease also see more examples and video animations on the project webpage.\n\nTraining with partial views. We also conduct control experiments where each object is only\nobservable from narrow range of azimuth angles (e.g., 8 out of 24 views such as 0\u25e6, 15\u25e6, \u00b7\u00b7\u00b7 , 105\u25e6).\nWe include the detailed description in the supplementary materials. As shown in Table 1 (last two\ncolumns), performances of all three models drop a little bit but the conclusion is similar: the proposed\nnetwork (1) learns better 3D shape with projection regularization and (2) is capable of learning the\n3D shape by providing 2D observations only.\n\n4.2 Training on multiple categories\nWe conducted multiclass experiment using the same setup in the single-class experiment. For multi-\ncategory experiment, the training set includes 13 major categories: airplane, bench, dresser, car,\nchair, display, lamp, loudspeaker, rifle, sofa, table, telephone and vessel. Basically,\nwe preserved 20% of instances from each category as testing data. As shown in Table 2, the\nquantitative results demonstrate (1) model trained with combined loss is superior to volume loss in\nmost cases and (2) model trained with projection loss perform as good as volume/combined loss.\nFrom the visualization results shown in Figure 5, all three models predict volumes reasonably well.\nThere is only subtle performance difference in object part such as the wing of airplane.\n\n7\n\n\fTable 3: Prediction IU in out-of-category tests.\n\nMethod / Test Category\nPTN-Proj:single (no vol. supervision)\nPTN-Comb:single (vol. supervision)\nCNN-Vol:single (vol. supervision)\nPTN-Proj:multi (no vol. supervision)\nPTN-Comb:multi (vol. supervision)\nCNN-Vol:multi (vol. supervision)\n\nbed\n0.1801\n0.1507\n0.1558\n0.1944\n0.1647\n0.1586\n\nbookshelf\n\n0.1707\n0.1186\n0.1183\n0.3448\n0.3195\n0.3037\n\ncabinet\n0.3937\n0.2626\n0.2588\n0.6484\n0.5257\n0.4977\n\nmotorbike\n\n0.1189\n0.0643\n0.0580\n0.3216\n0.1914\n0.2253\n\ntrain\n0.1550\n0.1044\n0.0956\n0.3670\n0.3744\n0.3740\n\nInput\n\nGT (310)\n\nGT (130)\n\nPR (310)\n\nPR (130)\n\nCO (310)\n\nCO (310)\n\nVO (130)\n\nVO (310)\n\nFigure 6: Out-of-category results. GT: ground truth, PR: PTN-Proj, CO: PTN-Comb, VO: CNN-Vol\n(Best viewed in digital version. Zoom in for the 3D shape details). The angles are shown in the\nparenthesis. Please also see more examples and video animations on the project webpage.\n4.3 Out-of-Category Tests\nIdeally, an intelligent agent should have the ability to generalize the knowledge learned from pre-\nviously seen categories to unseen categories. To this end, we design out-of-category tests for both\nmodels trained on a single category and multiple categories, as described in Section 4.1 and Sec-\ntion 4.2, respectively. We select 5 unseen categories from ShapeNetCore: bed, bookshelf, cabinet,\nmotorbike and train for out-of-category tests. Here, the two categories cabinet and train are\nrelatively easier than other categories since there might be instances in the training set with similar\nshapes (e.g., dresser, vessel, and airplane). But the bed,bookshelf and motorbike can be\nconsidered as completely novel categories in terms of shape.\nWe summarized the quantitative results in Table 3. Suprisingly, the model trained on multiple\ncategories still achieves reasonably good overall IU. As shown in Figure 6, the proposed projection\nloss generalizes better than model trained using combined loss or volume loss on train, motorbike\nand cabinet. The observations from the out-of-category tests suggest that (1) generalization from a\nsingle category is very challenging, but training from multiple categories can signi\ufb01cantly improve\ngeneralization, and (2) the projection regularization can help learning a robust representation for\nbetter generalization on unseen categories.\n5 Conclusions\nIn this paper, we investigate the problem of single-view 3D shape reconstruction from a learning\nagent\u2019s perspective. By formulating the learning procedure as the interaction between 3D shape\nand 2D observation, we propose to learn an encoder-decoder network which takes advantage of\nthe projection transformation as regularization. Experimental results demonstrate (1) excellent\nperformance of the proposed model in reconstructing the object even without ground-truth 3D volume\nas supervision and (2) the generalization potential of the proposed model to unseen categories.\n\n8\n\n\fAcknowledgments\nThis work was supported in part by NSF CAREER IIS-1453651, ONR N00014-13-1-0762, Sloan\nResearch Fellowship, and a gift from Adobe. We acknowledge NVIDIA for the donation of GPUs.\nWe also thank Yuting Zhang, Scott Reed, Junhyuk Oh, Ruben Villegas, Seunghoon Hong, Wenling\nShang, Kibok Lee, Lajanugen Logeswaran, Rui Zhang and Yi Zhang for helpful comments and\ndiscussions.\n\nReferences\n[1] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song,\nH. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.\n[2] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A uni\ufb01ed approach for single and\n\nmulti-view 3d object reconstruction. In ECCV, 2016.\n\n[3] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning.\n\nIn BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.\n\n[4] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector\n\nrepresentation for objects. arXiv preprint arXiv:1603.08637, 2016.\n\n[5] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In ICANN. Springer, 2011.\n[6] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.\n[7] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\n[8] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.\n[9] T. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum. Deep convolutional inverse graphics network.\n\nIn NIPS, 2015.\n\n[10] K. N. Kutulakos and S. M. Seitz. A theory of shape by space carving. International Journal of Computer\n\nVision, 38(3):199\u2013218, 2000.\n\n[11] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable\n\nunsupervised learning of hierarchical representations. In ICML, 2009.\n\n[12] R. Memisevic and G. Hinton. Unsupervised learning of image transformations. In CVPR, 2007.\n[13] V. Michalski, R. Memisevic, and K. Konda. Modeling deep temporal dependencies with recurrent grammar\n\ncells\"\". In NIPS, 2014.\n\n[14] C. R. Qi, H. Su, M. Niessner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multi-view cnns for object\n\nclassi\ufb01cation on 3d data. In CVPR, 2016.\n\n[15] S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of variation with manifold\n\ninteraction. In ICML, 2014.\n\n[16] D. J. Rezende, S. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learning of\n\n3d structure from images. In NIPS, 2016.\n\n[17] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n[18] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convolutional neural networks for 3d\n\nshape recognition. In ICCV, 2015.\n\n[19] R. Szeliski. Computer vision: algorithms and applications. Springer Science & Business Media, 2010.\n[20] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Single-view to multi-view: Reconstructing unseen views\n\nwith a convolutional network. In ECCV, 2016.\n\n[21] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. Single image 3d\n\ninterpreter network. In ECCV, 2016.\n\n[22] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for\n\nvolumetric shapes. In CVPR, 2015.\n\n[23] J. Yang, S. E. Reed, M.-H. Yang, and H. Lee. Weakly-supervised disentangling with recurrent transforma-\n\ntions for 3d view synthesis. In NIPS, 2015.\n\n[24] E. Yumer and N. J. Mitra. Learning semantic deformation \ufb02ows with 3d convolutional networks. In ECCV,\n\n2016.\n\n9\n\n\f", "award": [], "sourceid": 935, "authors": [{"given_name": "Xinchen", "family_name": "Yan", "institution": "University of Michigan"}, {"given_name": "Jimei", "family_name": "Yang", "institution": "Adobe Research"}, {"given_name": "Ersin", "family_name": "Yumer", "institution": "Adobe Research"}, {"given_name": "Yijie", "family_name": "Guo", "institution": "University of Michigan"}, {"given_name": "Honglak", "family_name": "Lee", "institution": "University of Michigan"}]}