{"title": "Learning to Reconstruct Shapes from Unseen Classes", "book": "Advances in Neural Information Processing Systems", "page_first": 2257, "page_last": 2268, "abstract": "From a single image, humans are able to perceive the full 3D shape of an object by exploiting learned shape priors from everyday life. Contemporary single-image 3D reconstruction algorithms aim to solve this task in a similar fashion, but often end up with priors that are highly biased by training classes. Here we present an algorithm, Generalizable Reconstruction (GenRe), designed to capture more generic, class-agnostic shape priors. We achieve this with an inference network and training procedure that combine 2.5D representations of visible surfaces (depth and silhouette), spherical shape representations of both visible and non-visible surfaces, and 3D voxel-based representations, in a principled manner that exploits the causal structure of how 3D shapes give rise to 2D images. Experiments demonstrate that GenRe performs well on single-view shape reconstruction, and generalizes to diverse novel objects from categories not seen during training.", "full_text": "Learning to Reconstruct Shapes from Unseen Classes\n\nXiuming Zhang\u2217\n\nMIT CSAIL\n\nZhoutong Zhang\u2217\n\nMIT CSAIL\n\nChengkai Zhang\n\nMIT CSAIL\n\nJoshua B. Tenenbaum\n\nMIT CSAIL\n\nWilliam T. Freeman\n\nMIT CSAIL, Google Research\n\nJiajun Wu\nMIT CSAIL\n\nAbstract\n\nFrom a single image, humans are able to perceive the full 3D shape of an object by\nexploiting learned shape priors from everyday life. Contemporary single-image\n3D reconstruction algorithms aim to solve this task in a similar fashion, but often\nend up with priors that are highly biased by training classes. Here we present\nan algorithm, Generalizable Reconstruction (GenRe), designed to capture more\ngeneric, class-agnostic shape priors. We achieve this with an inference network and\ntraining procedure that combine 2.5D representations of visible surfaces (depth and\nsilhouette), spherical shape representations of both visible and non-visible surfaces,\nand 3D voxel-based representations, in a principled manner that exploits the causal\nstructure of how 3D shapes give rise to 2D images. Experiments demonstrate\nthat GenRe performs well on single-view shape reconstruction, and generalizes to\ndiverse novel objects from categories not seen during training.\n\nIntroduction\n\n1\nHumans can imagine an object\u2019s full 3D shape from just a single image, showing only a fraction of\nthe object\u2019s surface. This applies to common objects such as chairs, but also to novel objects that\nwe have never seen before. Vision researchers have long argued that the key to this ability may be a\nsophisticated hierarchy of representations, extending from images through surfaces to volumetric\nshape, which process different aspects of shape in different representational formats [Marr, 1982].\nHere we explore how these ideas can be integrated into state-of-the-art computer vision systems for\n3D shape reconstruction.\nRecently, computer vision and machine learning researchers have made impressive progress on\nsingle-image 3D reconstruction by learning a parametric function f2D\u21923D, implemented as deep\nneural networks, that maps a 2D image to its corresponding 3D shape. Essentially, f2D\u21923D encodes\nshape priors (\u201cwhat realistic shapes look like\u201d), often learned from large shape repositories such as\nShapeNet [Chang et al., 2015]. Because the problem is well-known to be ill-posed\u2014there exist many\n3D explanations for any 2D visual observation\u2014modern systems have explored looping in various\nstructures into this learning process. For example, MarrNet [Wu et al., 2017] uses intrinsic images or\n2.5D sketches [Marr, 1982] as an intermediate representation, and concatenates two learned mappings\nfor shape reconstruction: f2D\u21923D = f2.5D\u21923D \u25e6 f2D\u21922.5D.\nMany existing methods, however, ignore the fact that mapping a 2D image or a 2.5D sketch to a 3D\nshape involves complex, but deterministic geometric projections. Simply using a neural network to\napproximate this projection, instead of modeling this mapping explicitly, leads to inference models\nthat are overparametrized (and hence subject to over\ufb01tting training classes). It also misses valuable\ninductive biases that can be wired in through such projections. Both of these factors contribute to\npoor generalization to unseen classes.\n\n\u2217 indicates equal contribution. Project page: http://genre.csail.mit.edu\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: We study the task of generalizable single-image 3D reconstruction, aiming to reconstruct\nthe 3D shape of an object outside training classes. Here we show a table and a bed reconstructed\nfrom single RGB images by our model trained on cars, chairs, and airplanes. Our model learns to\nreconstruct objects outside the training classes.\n\nHere we propose to disentangle geometric projections from shape reconstruction to better generalize\nto unseen shape categories. Building upon the MarrNet framework [Wu et al., 2017], we further\ndecompose f2.5D\u21923D into a deterministic geometric projection p from 2.5D to a partial 3D model,\nand a learnable completion c of the 3D model. A straightforward version of this idea would be to\nperform shape completion in the 3D voxel grid: f2.5D\u21923D = c3D\u21923D \u25e6 p2.5D\u21923D. However, shape\ncompletion in 3D is challenging, as the manifold of plausible shapes is sparser in 3D than in 2D, and\nempirically this fails to reconstruct shapes well.\nInstead we perform completion based on spherical maps. Spherical maps are surface representations\nde\ufb01ned on the UV coordinates of a unit sphere, where the value at each coordinate is calculated as the\nminimal distance travelled from this point to the 3D object surface along the sphere\u2019s radius. Such a\nrepresentation combines appealing features of 2D and 3D: spherical maps are a form of 2D images,\non which neural inpainting models work well; but they have a semantics that allows them to be\nprojected into 3D to recover full shape geometry. They essentially allow us to complete non-visible\nobject surfaces from visible ones, as a further intermediate step to full 3D reconstruction. We now\nhave f2.5D\u21923D = pS\u21923D \u25e6 cS\u2192S \u25e6 p2.5D\u2192S, where S stands for spherical maps.\nOur full model, named Generalizable Reconstruction (GenRe), thus comprises three cascaded,\nlearnable modules connected by \ufb01xed geometric projections. First, a single-view depth estimator\npredicts depth from a 2D image (f2D\u21922.5D); the depth map is then projected into a spherical map\n(p2.5D\u2192S). Second, a spherical map inpainting network inpaints the partial spherical map (cS\u2192S);\nthe inpainted spherical map is then projected into 3D voxels (p2.5D\u21923D). Finally, we introduce an\nadditional voxel re\ufb01nement network to re\ufb01ne the estimated 3D shape in voxel space. Our neural\nmodules only have to model object geometry for reconstruction, without having to learn geometric\nprojections. This enhances generalizability, along with several other factors: during training, our\nmodularized design forces each module of the network to use features from the previous module,\ninstead of directly memorizing shapes from the training classes; also, each module only predicts\noutputs that are in the same domain as its inputs (image-based or voxel-based), which leads to more\nregular mappings.\nOur GenRe model achieves state-of-the-art performance on reconstructing shapes both within and\noutside training classes. Figure 1 shows examples of our model reconstructing a table and a bed from\nsingle images, after training only on cars, chairs, and airplanes. We also present detailed analyses of\nhow each component contributes to the \ufb01nal prediction.\nThis paper makes three contributions. First, we emphasize the task of generalizable single-image 3D\nshape reconstruction. Second, we propose to disentangle geometric projections from shape recon-\nstruction, and include spherical maps with differentiable, deterministic projections in an integrated\nneural model. Third, we demonstrate that the resulting model achieves state-of-the-art performance\non single-image 3D shape reconstruction for objects within and outside training classes.\n2 Related Work\nSingle-image 3D reconstruction. The problem of recovering the object shape from a single image\nis challenging, as it requires both powerful recognition systems and prior knowledge of plausible\n3D shapes. Large CAD model repositories [Chang et al., 2015] and deep networks have contributed\nto the signi\ufb01cant progress in recent years, mostly with voxel representations [Choy et al., 2016,\nGirdhar et al., 2016, H\u00e4ne et al., 2017, Kar et al., 2015, Novotny et al., 2017, Rezende et al., 2016,\nTatarchenko et al., 2016, Tulsiani et al., 2017, Wu et al., 2016, 2017, 2018, Zhu et al., 2018, Yan\net al., 2016]. Apart from voxels, some researchers have also studied reconstructing objects in point\nclouds [Fan et al., 2017] or octave trees [Riegler et al., 2017, Tatarchenko et al., 2017]. The shape\n\n2\n\nInput (Novel Class)Our ReconstructionOur ReconstructionInput (Novel Class)\fFigure 2: Our model for generalizable single-image 3D reconstruction (GenRe) has three components:\n(a) a depth estimator that predicts depth in the original view from a single RGB image, (b) a spherical\ninpainting network that inpaints a partial, single-view spherical map, and (c) a voxel re\ufb01nement\nnetwork that integrates two backprojected 3D shapes (from the inpainted spherical map and from\ndepth) to produce the \ufb01nal output.\n\npriors learned in these approaches, however, are in general only applicable to their training classes,\nwith very limited generalization power for reconstructing shapes from unseen categories. In contrast,\nour system exploits 2.5D sketches and spherical representations for better generalization to objects\noutside training classes.\nSpherical projections. Spherical projections have been shown effective in 3D shape retrieval [Es-\nteves et al., 2018], classi\ufb01cation [Cao et al., 2017], and \ufb01nding possible rotational as well as re\ufb02ective\nsymmetries [Kazhdan et al., 2004, 2002]. Recent papers [Cohen et al., 2018, 2017] have studied\ndifferentiable, spherical convolution on spherical projections, aiming to preserve rotational equivari-\nance within a neural network. These designs, however, perform convolution in the spectral domain\nwith limited frequency bands, causing aliasing and loss of high-frequency information. In particular,\nconvolution in the spectral domain is not suitable for shape reconstruction, since the reconstruction\nquality highly depends on the high-frequency components. In addition, the ringing effects caused by\naliasing would introduce undesired artifacts.\n2.5D sketch recovery. The origin of intrinsic image estimation dates back to the early years\nof computer vision [Barrow and Tenenbaum, 1978]. Through years, researchers have explored\nrecovering 2.5D sketches from texture, shading, or color images [Barron and Malik, 2015, Bell et al.,\n2014, Horn and Brooks, 1989, Tappen et al., 2003, Weiss, 2001, Zhang et al., 1999]. As handy depth\nsensors get mature [Izadi et al., 2011], and larger-scale RGB-D datasets become available [McCormac\net al., 2017, Silberman et al., 2012, Song et al., 2017], many papers start to estimate depth [Chen et al.,\n2016, Eigen and Fergus, 2015], surface normals [Bansal and Russell, 2016, Wang et al., 2015], and\nother intrinsic images [Janner et al., 2017, Shi et al., 2017] with deep networks. Our method employs\n2.5D estimation as a component, but focuses on reconstructing shapes from unseen categories.\nZero- and few-shot recognition.\nIn computer vision, abundant attempts have been made to tackle\nthe problem of few-shot recognition. We refer readers to the review article [Xian et al., 2017] for a\ncomprehensive list. A number of earlier papers have explored sharing features across categories to\nrecognize new objects from a few examples [Bart and Ullman, 2005, Farhadi et al., 2009, Lampert\net al., 2009, Torralba et al., 2007]. More recently, many researchers have begun to study zero- or\nfew-shot recognition with deep networks [Akata et al., 2016, Antol et al., 2014, Hariharan and\nGirshick, 2017, Wang et al., 2017, Wang and Hebert, 2016]. Especially, Peng et al. [2015] explored\nthe idea of learning to recognize novel 3D models via domain adaptation.\nWhile these proposed methods are for recognizing and categorizing images or shapes, in this paper\nwe explore reconstructing the 3D shape of an object from unseen classes. This problem has received\nlittle attention in the past, possibly due to its considerable dif\ufb01culty. A few imaging systems have\nattempted to recover 3D shape from a single shot by making use of special cameras [Proesmans\net al., 1996, Sagawa et al., 2011]. Unlike them, we study 3D reconstruction from a single RGB\nimage. Very recently, researchers have begun to look at the generalization power of 3D reconstruction\nalgorithms [Shin et al., 2018, Jayaraman et al., 2018, Rock et al., 2015, Funk and Liu, 2017]. Here\nwe present a novel approach that makes use of spherical representations for better generalization.\n3 Approach\nSingle-image reconstruction algorithms learn a parametric function f2D\u21923D that maps a 2D image to\na 3D shape. We tackle the problem of generalization by regularizing f2D\u21923D. The key regularization\nwe impose is to factorize f2D\u21923D into geometric projections and learnable reconstruction modules.\n\n3\n\nFinal 3D ShapeRGB ImageDepthPartial Spherical MapInpaintedSpherical MapGeometric ProjectionProjected VoxelsNetworkModuleabc\fFigure 3: Examples of our spherical inpainting module generalizing to new classes. Trained on chairs,\ncars, and planes, the module completes the partially visible leg of the table (red boxes) and the unseen\ncabinet bottom (purple boxes) from partial spherical maps projected from ground-truth depth.\n\nOur GenRe model consists of three learnable modules, connected by geometric projections as shown\nin Figure 2. The \ufb01rst module is a single-view depth estimator f2D\u21922.5D (Figure 2a), taking a color\nimage as input and estimates its depth map. As the depth map can be interpreted as the visible surface\nof the object, the reconstruction problem becomes predicting the object\u2019s complete surface given this\npartial estimate.\nAs 3D surfaces are hard to parametrize ef\ufb01ciently, we use spherical maps as a surrogate representation.\nA geometric projection module (p2.5D\u2192S) converts the estimated depth map into a spherical map,\nreferred to as the partial spherical map. It is then passed to the spherical map inpainting network\n(cS\u2192S, Figure 2b) to predict an inpainted spherical map, representing the object\u2019s complete surface.\nAnother projection module (pS\u21923D) projects the inpainted spherical map back to the voxel space.\nAs spherical maps only capture the outermost surface towards the sphere, they cannot handle self-\nocclusion along the sphere\u2019s radius. We use a voxel re\ufb01nement module (Figure 2c) to tackle this\nproblem. It takes two 3D shapes as input, one projected from the inpainted spherical map and the\nother from the estimated depth map, and outputs a \ufb01nal 3D shape.\n3.1 Single-View Depth Estimator\nThe \ufb01rst component of our network predicts a depth map from an image with a clean background.\nUsing depth as an intermediate representation facilitates the reconstruction process by distilling\nessential geometric information from the input image [Wu et al., 2017].\nFurther, depth estimation is a class-agnostic task: shapes from different classes often share common\ngeometric structure, despite distinct visual appearances. Take beds and cabinets as examples. Al-\nthough they are of different anatomy in general, both have perpendicular planes and hence similar\npatches in their depth images. We demonstrate this both qualitatively and quantitatively in Section 4.4.\n3.2 Spherical Map Inpainting Network\nWith spherical maps, we cast the problem of 3D surface completion into 2D spherical map inpainting.\nEmpirically we observe that networks trained to inpaint spherical maps generalize well to new shape\nclasses (Figure 3). Also, compared with voxels, spherical maps are more ef\ufb01cient to process, as 3D\nsurfaces are sparse in nature; quantitatively, as we demonstrate in Section 4.5 and Section 4.6, using\nspherical maps results in better performance.\nAs spherical maps are signals on the unit sphere, it is tempting to use network architectures based\non spherical convolution [Cohen et al., 2018]. They are however not suitable for our task of shape\nreconstruction. This is because spherical convolution is conducted in the spectral domain. Every\nconversion to and from the spectral domain requires capping the maximum frequency, causing extra\naliasing and information loss. For tasks such as recognition, the information loss may be negligible\ncompared with the advantage of rotational invariance offered by spherical convolution. But for\nreconstruction, the loss leads to blurred output with only low-frequency components. We empirically\n\ufb01nd that standard convolution works much better than spherical convolution under our setup.\n3.3 Voxel Re\ufb01nement Network\nAlthough an inpainted spherical map provides a projection of an object\u2019s surface onto the unit sphere,\nthe surface information is lost when self-occlusion occurs. We use a re\ufb01nement network that operates\n\n4\n\nInputGround TruthInpaintedRGBInputInpaintedRGBGround Truth\fin the voxel space to recover the lost information. This module takes two voxelized shapes as input,\none projected from the estimated depth map and the other from the inpainted spherical map, and\npredicts the \ufb01nal shape. As the occluded regions can be recovered from local neighboring regions,\nthis network only needs to capture local shape priors and is therefore class-agnostic. As shown in\nthe experiments, when provided with ground-truth depth and spherical maps, this module performs\nconsistently well across training and unseen classes.\n3.4 Technical Details\nSingle-view depth estimator. Following Wu et al. [2017], we use an encoder-decoder network\nfor depth estimation. Our encoder is a ResNet-18 [He et al., 2015], encoding a 256\u00d7256 RGB\nimage into 512 feature maps of size 1\u00d71. The decoder is a mirrored version of the encoder,\nreplacing all convolution layers with transposed convolution layers. In addition, we adopt the U-Net\nstructure [Ronneberger et al., 2015] and feed the intermediate outputs of each block of the encoder to\nthe corresponding block of the decoder. The decoder outputs the depth map in the original view at\nthe resolution of 256\u00d7256. We use an (cid:96)2 loss between predicted and target images.\nSpherical map inpainting network. The spherical map inpainting network has a similar archi-\ntecture as the single-view depth estimator. To reduce the gap between standard and spherical\nconvolutions, we use periodic padding to both inputs and training targets in the longitude dimension,\nmaking the network aware of the periodic nature of spherical maps.\nVoxel re\ufb01nement network. Our voxel re\ufb01nement network takes as input voxels projected from the\nestimated, original-view depth and from the inpainted spherical map, and recovers the \ufb01nal shape in\nvoxel space. Speci\ufb01cally, the encoder takes as input a two-channel 128\u00d7128\u00d7128 voxel (one for\ncoarse shape estimation and the other for surface estimation), and outputs a 320-D latent vector. In\ndecoding, each layer takes an extra input directly from the corresponding level of the encoder.\nGeometric projections. We make use of three geometric projections: a depth to spherical map\nprojection, a depth map to voxel projection, and a spherical map to voxel projection. For the depth\nto spherical map projection, we \ufb01rst convert depth into 3D point clouds using camera parameters,\nand then turn them into surfaces with the marching cubes algorithm [Lewiner et al., 2003]. Then, the\nspherical representation is generated by casting rays from each UV coordinate on the unit sphere to\nthe sphere\u2019s center. This process is not differentiable. To project depth or spherical maps into voxels,\nwe \ufb01rst convert them into 3D point clouds. Then, a grid of voxels is initialized, where the value of\neach voxel is determined by the average distance between all the points inside it to its center. Then,\nfor all the voxels that contain points, we negate its value and add it by 1. This projection process is\nfully differentiable.\nTraining. We train our network with viewer-centered 3D supervision, where the 3D shape is rotated\nto match the object\u2019s pose in the input image. This is in contrast to object-centered approaches,\nwhere the 3D supervision is always in a prede\ufb01ned pose regardless of the object\u2019s pose in the input\nimage. Object-centered approaches are less suitable for reconstructing shapes from new categories,\nas prede\ufb01ned poses are unlikely to generalize across categories.\nWe \ufb01rst train the 2.5D sketch estimator with RGB images and their corresponding depth images, all\nrendered with ShapeNet [Chang et al., 2015] objects (see Section 4.2 and the supplemental material\nfor details). We then train the spherical map inpainting network with single-view (partial) spherical\nmaps and the ground-truth full spherical maps as supervision. Finally, we train the voxel re\ufb01nement\nnetwork on coarse shapes predicted by the inpainting network as well as 3D surfaces backprojected\nfrom the estimated 2.5D sketches, with the corresponding ground-truth shapes as supervision. We\nthen jointly \ufb01ne-tune the spherical inpainting module and the voxel re\ufb01nement module with both 3D\nshape and 2D spherical map supervision.\n4 Experiments\n4.1 Baselines\nWe organize baselines based on the shape representation they use.\nVoxels. Voxels are arguably the most common representation for 3D shapes in the deep learning era\ndue to their amenability to 3D convolution. For this representation, we consider DRC [Tulsiani et al.,\n2017] and MarrNet [Wu et al., 2017] as baselines. Our model uses 1283 voxels of [0, 1] occupancy.\nMesh and point clouds. Considering the cubic complexity of the voxel representation, recent\npapers have explored meshes [Groueix et al., 2018, Yao et al., 2018] and point clouds [Fan et al.,\n\n5\n\n\f2017] in the context of neural networks. In this work, we consider AtlasNet [Groueix et al., 2018] as\na baseline.\nMulti-view maps. Another way of representing 3D shapes is to use a set of multi-view depth\nimages [Soltani et al., 2017, Shin et al., 2018, Jayaraman et al., 2018]. We compare with the model\nfrom Shin et al. [2018] in this regime.\nSpherical maps. As introduced in Section 1, one can also represent 3D shapes as spherical maps.\nWe include two baselines with spherical maps: \ufb01rst, a one-step baseline that predicts \ufb01nal spherical\nmaps directly from RGB images (GenRe-1step); second, a two-step baseline that \ufb01rst predicts single-\nview spherical maps from RGB images and then inpaints them (GenRe-2step). Both baselines use\nthe aforementioned U-ResNet image-to-image network architecture.\nTo provide justi\ufb01cation for using spherical maps, we provide a baseline (3D Completion) that directly\nperforms 3D shape completion in voxel space. This baseline \ufb01rst predicts depth from an input image;\nit then projects the depth map into the voxel space. A completion module takes the projected voxels\nas input and predicts the \ufb01nal result.\nTo provide a performance upper bound for our spherical inpainting and voxel re\ufb01nement networks (b\nand c in Figure 2), we also include the results when our model has access to ground-truth depth in the\noriginal view (GenRe-Oracle) and to ground-truth full spherical maps (GenRe-SphOracle).\n4.2 Data\nWe use ShapeNet [Chang et al., 2015] renderings for network training and testing. Speci\ufb01cally, we\nrender each object in 20 random views. In addition to RGB images, we also render their corresponding\nground-truth depth maps. We use Mitsuba [Jakob, 2010], a physically-based rendering engine, for all\nour renderings. Please see the supplementary material for details on data generation and augmentation.\nFor all models, we train them on the three largest ShapeNet classes (cars, chairs, and airplanes), and\ntest them on the next 10 largest classes: bench, vessel, ri\ufb02e, sofa, table, phone, cabinet, speaker, lamp,\nand display. Besides ShapeNet renderings, we also test these models, trained only on synthetic data,\non real images from Pix3D [Sun et al., 2018], a dataset of real images and the ground-truth shape of\nevery pictured object. In Section 5, we also test our model on non-rigid shapes such as humans and\nhorses [Bronstein et al., 2008] and on highly regular shape primitives.\n4.3 Metrics\nBecause neither depth maps nor spherical maps provide information inside shapes, our model predicts\nonly surface voxels that are not guaranteed watertight. Consequently, intersection over union (IoU)\ncannot be used as an evaluation metric. We hence evaluate reconstruction quality using Chamfer\ndistance (CD) [Barrow et al., 1977], de\ufb01ned as\n\nCD(S1, S2) =\n\n1\n|S1|\n\n(cid:107)x \u2212 y(cid:107)2 +\n\n1\n|S2|\n\nmin\ny\u2208S2\n\n(cid:107)x \u2212 y(cid:107)2,\n\nmin\nx\u2208S1\n\n(1)\n\n(cid:88)\n\nx\u2208S1\n\n(cid:88)\n\ny\u2208S2\n\nwhere S1 and S2 are sets of points sampled from surfaces of the 3D shape pair. For models that\noutput voxels, including DRC and our GenRe model, we sweep voxel thresholds from 0.3 to 0.7 with\na step size of 0.05 for isosurfaces, compute CD with 1,024 points sampled from all isosurfaces, and\nreport the best average CD for each object class.\nShin et al. [2018] reports that object-centered supervision produces better reconstructions for objects\nfrom the training classes, whereas viewer-centered supervision is advantaged in generalizing to novel\nclasses. Therefore, for DRC and AtlasNet, we train each network with both types of supervision. Note\nthat AtlasNet, when trained with viewer-centered supervision, tends to produce unstable predictions\nthat render CD meaningless. Hence, we only present CD for the object-centered AtlasNet.\n4.4 Results on Depth Estimation\nWe show qualitative and quantitative results on depth estimation quality across categories. As shown\nin Figure 4, our depth estimator learns effectively the concept of near and far, generalizes well to\nunseen categories, and does not show statistically signi\ufb01cant deterioration as the novel test class gets\nincreasingly dissimilar to the training classes, laying the foundation for the generalization power of\nour approach. Formally, the dissimilarity from test class Ctest to training classes Ctrain is de\ufb01ned as\n\n(1/|Ctest|)(cid:80)\n\nx\u2208Ctest miny\u2208Ctrain CD(x, y).\n\n6\n\n\fFigure 4: Left: Our single-view depth estimator, trained on cars, chairs, and airplanes, generalizes to\nnovel classes: buses, trains, and tables. Right: As the novel test class gets increasingly dissimilar to\nthe training classes (left to right), depth prediction does not show statistically signi\ufb01cant degradation\n(p > 0.05).\n\nModels\n\nSeen\n\nUnseen\n\nObject-\nCentered\n\nViewer-\nCentered\n\nBch Vsl R\ufb02 Sfa Tbl Phn Cbn Spk Lmp Dsp Avg\nDRC [Tulsiani et al., 2017]\n.072 .112 .100 .104 .108 .133 .199 .168 .164 .145 .188 .142\nAtlasNet [Groueix et al., 2018] .059 .102 .092 .088 .098 .130 .146 .149 .158 .131 .173 .127\n.092 .120 .109 .121 .107 .129 .132 .142 .141 .131 .156 .129\nDRC [Tulsiani et al., 2017]\n.070 .107 .094 .125 .090 .122 .117 .125 .123 .144 .149 .120\nMarrNet [Wu et al., 2017]\n.065 .092 .092 .102 .085 .105 .110 .119 .117 .142 .142 .111\nMulti-View [Shin et al., 2018]\n.076 .102 .099 .121 .095 .109 .122 .131 .126 .138 .141 .118\n3D Completion\n.063 .104 .093 .114 .084 .108 .121 .128 .124 .126 .151 .115\nGenRe-1step\n.061 .098 .094 .117 .084 .102 .115 .125 .125 .118 .118 .110\nGenRe-2step\n.064 .089 .092 .112 .082 .096 .107 .116 .115 .124 .130 .106\nGenRe (Ours)\nGenRe-Oracle\n.045 .050 .048 .031 .059 .057 .054 .076 .077 .060 .060 .057\n.034 .032 .030 .021 .044 .038 .037 .044 .045 .031 .040 .036\nGenRe-SphOracle\n\nTable 1: Reconstruction errors (in CD) of the training classes and 10 novel classes, ordered from the\nmost to the least similar to the training classes. Our model is viewer-centered by design, but achieves\nperformance on par with the object-centered state of the art (AtlasNet) in reconstructing the seen\nclasses. As for generalization to novel classes, our model outperforms the state of the art across 9 out\nof the 10 classes.\n\n4.5 Reconstructing Novel Objects from Training Classes\nWe present results on generalizing to novel objects from the training classes. All models are trained\non cars, chairs, and airplanes, and tested on unseen objects from the same three categories.\nAs shown in Table 1, our GenRe model is the best-performing viewer-centered model. It also\noutperforms most object-centered models except AtlasNet. GenRe\u2019s preformance is impressive given\nthat object-centered models tend to perform much better on objects from seen classes [Shin et al.,\n2018]. This is because object-centered models, by exploiting the concept of canonical views, actually\nsolve an easier problem. The performance drop from object-centered DRC to viewer-centered DRC\nsupports this empirically. However, for objects from unseen classes, the concept of canonical views\nis no longer well-de\ufb01ned. As we will see in Section 4.6, this hurts the generalization power of\nobject-centered methods.\n4.6 Reconstructing Objects from Unseen Classes\nWe study how our approach enables generalization to novel shape classes unseen during training.\nSynthetic renderings. We use the 10 largest ShapeNet classes other than chairs, cars, and airplanes\nas our test set. Table 1 shows that our model consistently outperforms the state of the art, except for the\nclass of ri\ufb02es, in which AtlasNet performs the best. Qualitatively, our model produces reconstructions\nthat are much more consistent with input images, as shown in Figure 5. In particular, on unseen\nclasses, our results still attain good consistency with the input images, while the competitors either\nlack structural details present in the input (e.g., 5) or retrieve shapes from the training classes (e.g., 4,\n6, 7, 8, 9).\n\n7\n\nInputGround TruthPredictionTraining ClassesNovel Test ClassesNon-zeroslope:AAACAHicdVA9SwNBEN3zM8avqIWFzWIQbAx3MRi1CtpYiYJRIQlhbzPRJXu7x+6cGI80/hUbC0Vs/Rl2/hv3NIKKPhh4vDfDzLwwlsKi7795I6Nj4xOTuan89Mzs3HxhYfHU6sRwqHMttTkPmQUpFNRRoITz2ACLQglnYW8/88+uwFih1Qn2Y2hF7EKJruAMndQuLDcRrjE91GrjBoymVuoYdumgXSj6pZ3trXJli/ol368G5SAj5Wpls0IDp2QokiGO2oXXZkfzJAKFXDJrG4EfYytlBgWXMMg3Ewsx4z12AQ1HFYvAttKPBwZ0zSkd2tXGlUL6oX6fSFlkbT8KXWfE8NL+9jLxL6+RYHe7lQoVJwiKfy7qJpKiplkatCMMcJR9Rxg3wt1K+SUzjKPLLO9C+PqU/k9Oy6XALwXHlWJtbxhHjqyQVbJOAlIlNXJAjkidcDIgd+SBPHq33r335D1/to54w5kl8gPeyzv0sJahAAACAHicdVA9SwNBEN3zM8avqIWFzWIQbAx3MRi1CtpYiYJRIQlhbzPRJXu7x+6cGI80/hUbC0Vs/Rl2/hv3NIKKPhh4vDfDzLwwlsKi7795I6Nj4xOTuan89Mzs3HxhYfHU6sRwqHMttTkPmQUpFNRRoITz2ACLQglnYW8/88+uwFih1Qn2Y2hF7EKJruAMndQuLDcRrjE91GrjBoymVuoYdumgXSj6pZ3trXJli/ol368G5SAj5Wpls0IDp2QokiGO2oXXZkfzJAKFXDJrG4EfYytlBgWXMMg3Ewsx4z12AQ1HFYvAttKPBwZ0zSkd2tXGlUL6oX6fSFlkbT8KXWfE8NL+9jLxL6+RYHe7lQoVJwiKfy7qJpKiplkatCMMcJR9Rxg3wt1K+SUzjKPLLO9C+PqU/k9Oy6XALwXHlWJtbxhHjqyQVbJOAlIlNXJAjkidcDIgd+SBPHq33r335D1/to54w5kl8gPeyzv0sJahAAACAHicdVA9SwNBEN3zM8avqIWFzWIQbAx3MRi1CtpYiYJRIQlhbzPRJXu7x+6cGI80/hUbC0Vs/Rl2/hv3NIKKPhh4vDfDzLwwlsKi7795I6Nj4xOTuan89Mzs3HxhYfHU6sRwqHMttTkPmQUpFNRRoITz2ACLQglnYW8/88+uwFih1Qn2Y2hF7EKJruAMndQuLDcRrjE91GrjBoymVuoYdumgXSj6pZ3trXJli/ol368G5SAj5Wpls0IDp2QokiGO2oXXZkfzJAKFXDJrG4EfYytlBgWXMMg3Ewsx4z12AQ1HFYvAttKPBwZ0zSkd2tXGlUL6oX6fSFlkbT8KXWfE8NL+9jLxL6+RYHe7lQoVJwiKfy7qJpKiplkatCMMcJR9Rxg3wt1K+SUzjKPLLO9C+PqU/k9Oy6XALwXHlWJtbxhHjqyQVbJOAlIlNXJAjkidcDIgd+SBPHq33r335D1/to54w5kl8gPeyzv0sJahAAACAHicdVA9SwNBEN3zM8avqIWFzWIQbAx3MRi1CtpYiYJRIQlhbzPRJXu7x+6cGI80/hUbC0Vs/Rl2/hv3NIKKPhh4vDfDzLwwlsKi7795I6Nj4xOTuan89Mzs3HxhYfHU6sRwqHMttTkPmQUpFNRRoITz2ACLQglnYW8/88+uwFih1Qn2Y2hF7EKJruAMndQuLDcRrjE91GrjBoymVuoYdumgXSj6pZ3trXJli/ol368G5SAj5Wpls0IDp2QokiGO2oXXZkfzJAKFXDJrG4EfYytlBgWXMMg3Ewsx4z12AQ1HFYvAttKPBwZ0zSkd2tXGlUL6oX6fSFlkbT8KXWfE8NL+9jLxL6+RYHe7lQoVJwiKfy7qJpKiplkatCMMcJR9Rxg3wt1K+SUzjKPLLO9C+PqU/k9Oy6XALwXHlWJtbxhHjqyQVbJOAlIlNXJAjkidcDIgd+SBPHq33r335D1/to54w5kl8gPeyzv0sJahp>0.05AAAB7XicdVDLSgMxFM3UV62vqks3wSK4GjLj2NaNFN24rGBroR1KJs20sZnMkGSEMvQf3LhQxK3/486/MdNWUNEDgZNz7uXee4KEM6UR+rAKS8srq2vF9dLG5tb2Tnl3r63iVBLaIjGPZSfAinImaEszzWknkRRHAae3wfgy92/vqVQsFjd6klA/wkPBQkawNlI7OUc2Ou2XK8g+q1ddrwrNH9Uc18mJW/NOPOgYJUcFLNDsl997g5ikERWacKxU10GJ9jMsNSOcTku9VNEEkzEe0q6hAkdU+dls2yk8MsoAhrE0T2g4U793ZDhSahIFpjLCeqR+e7n4l9dNdVj3MyaSVFNB5oPClEMdw/x0OGCSEs0nhmAimdkVkhGWmGgTUMmE8HUp/J+0XdtBtnPtVRoXiziK4AAcgmPggBpogCvQBC1AwB14AE/g2YqtR+vFep2XFqxFzz74AevtE3yljmk=AAAB7XicdVDLSgMxFM3UV62vqks3wSK4GjLj2NaNFN24rGBroR1KJs20sZnMkGSEMvQf3LhQxK3/486/MdNWUNEDgZNz7uXee4KEM6UR+rAKS8srq2vF9dLG5tb2Tnl3r63iVBLaIjGPZSfAinImaEszzWknkRRHAae3wfgy92/vqVQsFjd6klA/wkPBQkawNlI7OUc2Ou2XK8g+q1ddrwrNH9Uc18mJW/NOPOgYJUcFLNDsl997g5ikERWacKxU10GJ9jMsNSOcTku9VNEEkzEe0q6hAkdU+dls2yk8MsoAhrE0T2g4U793ZDhSahIFpjLCeqR+e7n4l9dNdVj3MyaSVFNB5oPClEMdw/x0OGCSEs0nhmAimdkVkhGWmGgTUMmE8HUp/J+0XdtBtnPtVRoXiziK4AAcgmPggBpogCvQBC1AwB14AE/g2YqtR+vFep2XFqxFzz74AevtE3yljmk=AAAB7XicdVDLSgMxFM3UV62vqks3wSK4GjLj2NaNFN24rGBroR1KJs20sZnMkGSEMvQf3LhQxK3/486/MdNWUNEDgZNz7uXee4KEM6UR+rAKS8srq2vF9dLG5tb2Tnl3r63iVBLaIjGPZSfAinImaEszzWknkRRHAae3wfgy92/vqVQsFjd6klA/wkPBQkawNlI7OUc2Ou2XK8g+q1ddrwrNH9Uc18mJW/NOPOgYJUcFLNDsl997g5ikERWacKxU10GJ9jMsNSOcTku9VNEEkzEe0q6hAkdU+dls2yk8MsoAhrE0T2g4U793ZDhSahIFpjLCeqR+e7n4l9dNdVj3MyaSVFNB5oPClEMdw/x0OGCSEs0nhmAimdkVkhGWmGgTUMmE8HUp/J+0XdtBtnPtVRoXiziK4AAcgmPggBpogCvQBC1AwB14AE/g2YqtR+vFep2XFqxFzz74AevtE3yljmk=AAAB7XicdVDLSgMxFM3UV62vqks3wSK4GjLj2NaNFN24rGBroR1KJs20sZnMkGSEMvQf3LhQxK3/486/MdNWUNEDgZNz7uXee4KEM6UR+rAKS8srq2vF9dLG5tb2Tnl3r63iVBLaIjGPZSfAinImaEszzWknkRRHAae3wfgy92/vqVQsFjd6klA/wkPBQkawNlI7OUc2Ou2XK8g+q1ddrwrNH9Uc18mJW/NOPOgYJUcFLNDsl997g5ikERWacKxU10GJ9jMsNSOcTku9VNEEkzEe0q6hAkdU+dls2yk8MsoAhrE0T2g4U793ZDhSahIFpjLCeqR+e7n4l9dNdVj3MyaSVFNB5oPClEMdw/x0OGCSEs0nhmAimdkVkhGWmGgTUMmE8HUp/J+0XdtBtnPtVRoXiziK4AAcgmPggBpogCvQBC1AwB14AE/g2YqtR+vFep2XFqxFzz74AevtE3yljmk=\fFigure 5: Single-image 3D reconstructions of objects within and beyond training classes. Each row\nfrom left to right: the input image, two views from the best-performing baseline for each testing\nobject (1-4, 6-9: AtlasNet; 5, 10: Shin et al. [2018]), two views of our GenRe predictions, and the\nground truth. All models are trained on the same dataset of cars, chairs, and airplanes.\n\nComparing our model with its variants, we \ufb01nd that the two-step approaches (GenRe-2step and\nGenRe) outperform the one-step approach across all novel categories. This empirically supports\nthe advantage of our two-step modeling strategy that disentangles geometric projections from shape\nreconstruction.\nReal images. We further compare how our model, AtlasNet, and Shin et al. [2018] perform on real\nimages from Pix3D. Here, all models are trained on ShapeNet cars, chairs, and airplanes, and tested\non real images of beds, bookcases, desks, sofas, tables, and wardrobes.\nQuantitatively, Table 2 shows that our model outperforms the two competitors across all novel classes\nexcept beds, for which Shin et al. [2018] performs the best. For chairs, one of the training classes,\nthe object-centered AtlasNet leverages the canonical view and outperforms the two viewer-centered\napproaches. Qualitatively, our reconstructions preserve the details present in the input (e.g., the\nhollow structures in the second row of Figure 6).\n\nAtlasNet Shin et al. GenRe\n.093\n.113\n.101\n.109\n.083\n.116\n.109\n\n.080\n.114\n.140\n.126\n.095\n.134\n.121\n\n.089\n.106\n.109\n.121\n.088\n.124\n.116\n\nChair\nBed\nBookcase\nDesk\nSofa\nTable\nWardrobe\n\nTable 2: Reconstruction errors (in CD)\nfor seen (chairs) and unseen classes\n(the rest) on real images from Pix3D.\nGenRe outperforms the two baselines\nacross all unseen classes except beds.\nFor chairs, object-centered AtlasNet\nperforms the best by leveraging the\ncanonical view.\n\nFigure 6: Reconstructions on real images from Pix3D by\nGenRe and AtlasNet or Shin et al. [2018]. All models are\ntrained on cars, chairs, and airplanes.\n\n5 Analyses\n5.1 The Effect of Viewpoints on Generalization\nThe generic viewpoint assumption states that the observer is not in a special position relative to the\nobject [Freeman, 1994]. This makes us wonder if the \u201caccidentalness\u201d of the viewpoint affects the\nquality of reconstructions.\nAs a quantitative analysis, we test our model trained on ShapeNet chairs, cars, and airplanes on 100\nrandomly sampled ShapeNet tables, each rendered in 200 different views sampled uniformly on a\nsphere. We then compute, for each of the 200 views, the median CD of the 100 reconstructions.\nFinally, in Figure 7, we visualize these median CDs as a heatmap over an elevation-azimuth view\n\n8\n\nInputBest BaselineGenRe(Ours)Ground Truth12345InputBest BaselineGenRe(Ours)Ground Truth678109InputBest BaselineGenRe(Ours)Ground Truth\fFigure 7: Reconstruction er-\nrors (CD) for different input\nviewpoints. The vertical (hori-\nzontal) axis represents eleva-\ntion (azimuth). Accidental\nviews (blue box) lead to large\nerrors, while generic views\n(green box) result in smaller\nerrors. Errors are computed\nfor 100 tables; these particu-\nlar tables are for visualization\npurposes only.\n\ngrid. As the heatmap shows, our model makes better predictions when the input view is generic than\nwhen it is accidental, consistent with our intuition.\n5.2 Reconstructing Non-Rigid Shapes\nWe probe the generalization limit of our model by testing it with unseen non-rigid shapes, such as\nhorses and humans. As the focus is mainly on the spherical map inpainting network (Figure 2b)\nand the voxel re\ufb01nement network (Figure 2c), we assume our model has access to the ground-truth\nsingle-view depth (i.e., GenRe-Oracle) in this experiment. As demonstrated in Figure 8, our model\nnot only retains the visible details in the original view, but also completes the unseen surfaces using\nthe generic shape priors learned from rigid objects (cars, chairs, and airplanes).\n\nFigure 8: Single-view completion of non-rigid\nshapes from depth maps by our model trained on\ncars, chairs, and airplanes.\n\nFigure 9: Single-view completion of highly reg-\nular shapes (primitives) from depth maps by our\nmodel trained on cars, chairs, and airplanes.\n\n5.3 Reconstructing Highly Regular Shapes\nWe further explore whether our model captures global shape attributes by testing it on highly\nregular shapes that can be parametrized by only a few attributes (such as cones and cubes). Similar\nto Section 5.2, the model has only seen cars, chairs, and airplanes during training, and we assume our\nmodel has access to the ground-truth single-view depth (i.e., GenRe-Oracle).\nAs Figure 9 shows, although our model hallucinates the unseen parts of these shape primitives, it\nfails to exploit global shape symmetry to produce correct predictions. This is not surprising given\nthat our network design does not explicitly model such regularity. A possible future direction is to\nincorporate priors that facilitate learning high-level concepts such as symmetry.\n\n6 Conclusion\n\nWe have studied the problem of generalizable single-image 3D reconstruction. We exploit various\nimage and shape representations, including 2.5D sketches, spherical maps, and voxels. We have\nproposed GenRe, a novel viewer-centered model that integrates these representations for generalizable,\nhigh-quality 3D shape reconstruction. Experiments demonstrate that GenRe achieves state-of-the-art\nperformance on shape reconstruction for both seen and unseen classes. We hope our system will\ninspire future research along this challenging but rewarding research direction.\n\n9\n\nAzimuth\u2713=2\u21e1AAACBHicbVDJSgNBEO1xjXGLesylMQiewkwQ9CJEvXiMYBbIDKGnU5M06VnorhHjkIMXf8WLB0W8+hHe/Bs7y0ETHxQ83quiqp6fSKHRtr+tpeWV1bX13EZ+c2t7Z7ewt9/Qcao41HksY9XymQYpIqijQAmtRAELfQlNf3A19pt3oLSIo1scJuCFrBeJQHCGRuoUii7CPWYXDyJMsU9H1MU+IDuvuInoFEp22Z6ALhJnRkpkhlqn8OV2Y56GECGXTOu2YyfoZUyh4BJGeTfVkDA+YD1oGxqxELSXTZ4Y0SOjdGkQK1MR0on6eyJjodbD0DedIcO+nvfG4n9eO8XgzMtElKQIEZ8uClJJMabjRGhXKOAoh4YwroS5lfI+U4yjyS1vQnDmX14kjUrZscvOzUmpejmLI0eK5JAcE4eckiq5JjVSJ5w8kmfySt6sJ+vFerc+pq1L1mzmgPyB9fkDmoaYCg==AAACBHicbVDJSgNBEO1xjXGLesylMQiewkwQ9CJEvXiMYBbIDKGnU5M06VnorhHjkIMXf8WLB0W8+hHe/Bs7y0ETHxQ83quiqp6fSKHRtr+tpeWV1bX13EZ+c2t7Z7ewt9/Qcao41HksY9XymQYpIqijQAmtRAELfQlNf3A19pt3oLSIo1scJuCFrBeJQHCGRuoUii7CPWYXDyJMsU9H1MU+IDuvuInoFEp22Z6ALhJnRkpkhlqn8OV2Y56GECGXTOu2YyfoZUyh4BJGeTfVkDA+YD1oGxqxELSXTZ4Y0SOjdGkQK1MR0on6eyJjodbD0DedIcO+nvfG4n9eO8XgzMtElKQIEZ8uClJJMabjRGhXKOAoh4YwroS5lfI+U4yjyS1vQnDmX14kjUrZscvOzUmpejmLI0eK5JAcE4eckiq5JjVSJ5w8kmfySt6sJ+vFerc+pq1L1mzmgPyB9fkDmoaYCg==AAACBHicbVDJSgNBEO1xjXGLesylMQiewkwQ9CJEvXiMYBbIDKGnU5M06VnorhHjkIMXf8WLB0W8+hHe/Bs7y0ETHxQ83quiqp6fSKHRtr+tpeWV1bX13EZ+c2t7Z7ewt9/Qcao41HksY9XymQYpIqijQAmtRAELfQlNf3A19pt3oLSIo1scJuCFrBeJQHCGRuoUii7CPWYXDyJMsU9H1MU+IDuvuInoFEp22Z6ALhJnRkpkhlqn8OV2Y56GECGXTOu2YyfoZUyh4BJGeTfVkDA+YD1oGxqxELSXTZ4Y0SOjdGkQK1MR0on6eyJjodbD0DedIcO+nvfG4n9eO8XgzMtElKQIEZ8uClJJMabjRGhXKOAoh4YwroS5lfI+U4yjyS1vQnDmX14kjUrZscvOzUmpejmLI0eK5JAcE4eckiq5JjVSJ5w8kmfySt6sJ+vFerc+pq1L1mzmgPyB9fkDmoaYCg==AAACBHicbVDJSgNBEO1xjXGLesylMQiewkwQ9CJEvXiMYBbIDKGnU5M06VnorhHjkIMXf8WLB0W8+hHe/Bs7y0ETHxQ83quiqp6fSKHRtr+tpeWV1bX13EZ+c2t7Z7ewt9/Qcao41HksY9XymQYpIqijQAmtRAELfQlNf3A19pt3oLSIo1scJuCFrBeJQHCGRuoUii7CPWYXDyJMsU9H1MU+IDuvuInoFEp22Z6ALhJnRkpkhlqn8OV2Y56GECGXTOu2YyfoZUyh4BJGeTfVkDA+YD1oGxqxELSXTZ4Y0SOjdGkQK1MR0on6eyJjodbD0DedIcO+nvfG4n9eO8XgzMtElKQIEZ8uClJJMabjRGhXKOAoh4YwroS5lfI+U4yjyS1vQnDmX14kjUrZscvOzUmpejmLI0eK5JAcE4eckiq5JjVSJ5w8kmfySt6sJ+vFerc+pq1L1mzmgPyB9fkDmoaYCg==Elevation=\u21e1AAACA3icbVDLSgNBEJyNrxhfUW96GQyCp7Argl6EoAgeI5gHZEOYnfQmQ2YfzPQGwxLw4q948aCIV3/Cm3/jZJODJhY0FFXddHd5sRQabfvbyi0tr6yu5dcLG5tb2zvF3b26jhLFocYjGammxzRIEUINBUpoxgpY4EloeIPrid8YgtIiCu9xFEM7YL1Q+IIzNFKneOAiPGB6I2GYKXRM3bgvLt1YdIolu2xnoIvEmZESmaHaKX653YgnAYTIJdO65dgxtlOmUHAJ44KbaIgZH7AetAwNWQC6nWY/jOmxUbrUj5SpEGmm/p5IWaD1KPBMZ8Cwr+e9ifif10rQv2inIowThJBPF/mJpBjRSSC0KxRwlCNDGFfC3Ep5nynG0cRWMCE48y8vkvpp2bHLzt1ZqXI1iyNPDskROSEOOScVckuqpEY4eSTP5JW8WU/Wi/VufUxbc9ZsZp/8gfX5Awewl74=AAACA3icbVDLSgNBEJyNrxhfUW96GQyCp7Argl6EoAgeI5gHZEOYnfQmQ2YfzPQGwxLw4q948aCIV3/Cm3/jZJODJhY0FFXddHd5sRQabfvbyi0tr6yu5dcLG5tb2zvF3b26jhLFocYjGammxzRIEUINBUpoxgpY4EloeIPrid8YgtIiCu9xFEM7YL1Q+IIzNFKneOAiPGB6I2GYKXRM3bgvLt1YdIolu2xnoIvEmZESmaHaKX653YgnAYTIJdO65dgxtlOmUHAJ44KbaIgZH7AetAwNWQC6nWY/jOmxUbrUj5SpEGmm/p5IWaD1KPBMZ8Cwr+e9ifif10rQv2inIowThJBPF/mJpBjRSSC0KxRwlCNDGFfC3Ep5nynG0cRWMCE48y8vkvpp2bHLzt1ZqXI1iyNPDskROSEOOScVckuqpEY4eSTP5JW8WU/Wi/VufUxbc9ZsZp/8gfX5Awewl74=AAACA3icbVDLSgNBEJyNrxhfUW96GQyCp7Argl6EoAgeI5gHZEOYnfQmQ2YfzPQGwxLw4q948aCIV3/Cm3/jZJODJhY0FFXddHd5sRQabfvbyi0tr6yu5dcLG5tb2zvF3b26jhLFocYjGammxzRIEUINBUpoxgpY4EloeIPrid8YgtIiCu9xFEM7YL1Q+IIzNFKneOAiPGB6I2GYKXRM3bgvLt1YdIolu2xnoIvEmZESmaHaKX653YgnAYTIJdO65dgxtlOmUHAJ44KbaIgZH7AetAwNWQC6nWY/jOmxUbrUj5SpEGmm/p5IWaD1KPBMZ8Cwr+e9ifif10rQv2inIowThJBPF/mJpBjRSSC0KxRwlCNDGFfC3Ep5nynG0cRWMCE48y8vkvpp2bHLzt1ZqXI1iyNPDskROSEOOScVckuqpEY4eSTP5JW8WU/Wi/VufUxbc9ZsZp/8gfX5Awewl74=AAACA3icbVDLSgNBEJyNrxhfUW96GQyCp7Argl6EoAgeI5gHZEOYnfQmQ2YfzPQGwxLw4q948aCIV3/Cm3/jZJODJhY0FFXddHd5sRQabfvbyi0tr6yu5dcLG5tb2zvF3b26jhLFocYjGammxzRIEUINBUpoxgpY4EloeIPrid8YgtIiCu9xFEM7YL1Q+IIzNFKneOAiPGB6I2GYKXRM3bgvLt1YdIolu2xnoIvEmZESmaHaKX653YgnAYTIJdO65dgxtlOmUHAJ44KbaIgZH7AetAwNWQC6nWY/jOmxUbrUj5SpEGmm/p5IWaD1KPBMZ8Cwr+e9ifif10rQv2inIowThJBPF/mJpBjRSSC0KxRwlCNDGFfC3Ep5nynG0cRWMCE48y8vkvpp2bHLzt1ZqXI1iyNPDskROSEOOScVckuqpEY4eSTP5JW8WU/Wi/VufUxbc9ZsZp/8gfX5Awewl74=.076.157Accidental ViewsGeneric ViewsError(,\u2713)AAACBHicbVBNS8NAEN34WetX1WMvi0WoICURQY9FETxWsB/QlLLZTpqlm2zYnYgl9ODFv+LFgyJe/RHe/DemHwdtfTDweG+GmXleLIVB2/62lpZXVtfWcxv5za3tnd3C3n7DqERzqHMllW55zIAUEdRRoIRWrIGFnoSmN7ga+8170Eao6A6HMXRC1o+ELzjDTOoWii7CA6bXWis9KrtxIE6oiwEgO853CyW7Yk9AF4kzIyUyQ61b+HJ7iichRMglM6bt2DF2UqZRcAmjvJsYiBkfsD60MxqxEEwnnTwxokeZ0qO+0llFSCfq74mUhcYMQy/rDBkGZt4bi/957QT9i04qojhBiPh0kZ9IioqOE6E9oYGjHGaEcS2yWykPmGYcs9zGITjzLy+SxmnFsSvO7VmpejmLI0eK5JCUiUPOSZXckBqpE04eyTN5JW/Wk/VivVsf09YlazZzQP7A+vwB5DmXkg==AAACBHicbVBNS8NAEN34WetX1WMvi0WoICURQY9FETxWsB/QlLLZTpqlm2zYnYgl9ODFv+LFgyJe/RHe/DemHwdtfTDweG+GmXleLIVB2/62lpZXVtfWcxv5za3tnd3C3n7DqERzqHMllW55zIAUEdRRoIRWrIGFnoSmN7ga+8170Eao6A6HMXRC1o+ELzjDTOoWii7CA6bXWis9KrtxIE6oiwEgO853CyW7Yk9AF4kzIyUyQ61b+HJ7iichRMglM6bt2DF2UqZRcAmjvJsYiBkfsD60MxqxEEwnnTwxokeZ0qO+0llFSCfq74mUhcYMQy/rDBkGZt4bi/957QT9i04qojhBiPh0kZ9IioqOE6E9oYGjHGaEcS2yWykPmGYcs9zGITjzLy+SxmnFsSvO7VmpejmLI0eK5JCUiUPOSZXckBqpE04eyTN5JW/Wk/VivVsf09YlazZzQP7A+vwB5DmXkg==AAACBHicbVBNS8NAEN34WetX1WMvi0WoICURQY9FETxWsB/QlLLZTpqlm2zYnYgl9ODFv+LFgyJe/RHe/DemHwdtfTDweG+GmXleLIVB2/62lpZXVtfWcxv5za3tnd3C3n7DqERzqHMllW55zIAUEdRRoIRWrIGFnoSmN7ga+8170Eao6A6HMXRC1o+ELzjDTOoWii7CA6bXWis9KrtxIE6oiwEgO853CyW7Yk9AF4kzIyUyQ61b+HJ7iichRMglM6bt2DF2UqZRcAmjvJsYiBkfsD60MxqxEEwnnTwxokeZ0qO+0llFSCfq74mUhcYMQy/rDBkGZt4bi/957QT9i04qojhBiPh0kZ9IioqOE6E9oYGjHGaEcS2yWykPmGYcs9zGITjzLy+SxmnFsSvO7VmpejmLI0eK5JCUiUPOSZXckBqpE04eyTN5JW/Wk/VivVsf09YlazZzQP7A+vwB5DmXkg==AAACBHicbVBNS8NAEN34WetX1WMvi0WoICURQY9FETxWsB/QlLLZTpqlm2zYnYgl9ODFv+LFgyJe/RHe/DemHwdtfTDweG+GmXleLIVB2/62lpZXVtfWcxv5za3tnd3C3n7DqERzqHMllW55zIAUEdRRoIRWrIGFnoSmN7ga+8170Eao6A6HMXRC1o+ELzjDTOoWii7CA6bXWis9KrtxIE6oiwEgO853CyW7Yk9AF4kzIyUyQ61b+HJ7iichRMglM6bt2DF2UqZRcAmjvJsYiBkfsD60MxqxEEwnnTwxokeZ0qO+0llFSCfq74mUhcYMQy/rDBkGZt4bi/957QT9i04qojhBiPh0kZ9IioqOE6E9oYGjHGaEcS2yWykPmGYcs9zGITjzLy+SxmnFsSvO7VmpejmLI0eK5JCUiUPOSZXckBqpE04eyTN5JW/Wk/VivVsf09YlazZzQP7A+vwB5DmXkg==0AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1Fip6Q7KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGNn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jppX1U9t+o1ryv12zyOIpzBOVyCBzWowz00oAUMEJ7hFd6cR+fFeXc+lq0FJ585hT9wPn8AeRmMtA==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1Fip6Q7KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGNn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jppX1U9t+o1ryv12zyOIpzBOVyCBzWowz00oAUMEJ7hFd6cR+fFeXc+lq0FJ585hT9wPn8AeRmMtA==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1Fip6Q7KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGNn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jppX1U9t+o1ryv12zyOIpzBOVyCBzWowz00oAUMEJ7hFd6cR+fFeXc+lq0FJ585hT9wPn8AeRmMtA==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1Fip6Q7KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGNn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jppX1U9t+o1ryv12zyOIpzBOVyCBzWowz00oAUMEJ7hFd6cR+fFeXc+lq0FJ585hT9wPn8AeRmMtA==InputGenRe(Ours)Ground TruthInputGenRe(Ours)Ground Truth\fAcknowledgements We thank the anonymous reviewers for their constructive comments. This\nwork is supported by NSF #1231216, NSF #1447476, ONR MURI N00014-16-1-2007, Toyota\nResearch Institute, Shell, and Facebook.\nReferences\nZeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. Multi-cue zero-shot learning with strong\n\nsupervision. In CVPR, 2016. 3\n\nStanislaw Antol, C Lawrence Zitnick, and Devi Parikh. Zero-shot learning via visual abstraction. In ECCV,\n\n2014. 3\n\nAayush Bansal and Bryan Russell. Marr revisited: 2D-3D alignment via surface normal prediction. In CVPR,\n\n2016. 3\n\nJonathan T Barron and Jitendra Malik. Shape, illumination, and re\ufb02ectance from shading. IEEE TPAMI, 37(8):\n\n1670\u20131687, 2015. 3\n\nHarry G Barrow and Jay M Tenenbaum. Recovering intrinsic scene characteristics from images. Computer\n\nVision Systems, 1978. 3\n\nHarry G Barrow, Jay M Tenenbaum, Robert C Bolles, and Helen C Wolf. Parametric correspondence and\n\nchamfer matching: two new techniques for image matching. In IJCAI, 1977. 6\n\nEvgeniy Bart and Shimon Ullman. Cross-generalization: learning novel classes from a single example by feature\n\nreplacement. In CVPR, 2005. 3\n\nSean Bell, Kavita Bala, and Noah Snavely. Intrinsic images in the wild. ACM TOG, 33(4):159, 2014. 3\n\nAlexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Numerical geometry of non-rigid shapes.\n\nSpringer Science & Business Media, 2008. 6\n\nZhangjie Cao, Qixing Huang, and Karthik Ramani. 3D object classi\ufb01cation via spherical projections. In 3DV,\n\n2017. 3\n\nAngel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,\nManolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: an information-rich\n3D model repository. arXiv:1512.03012, 2015. 1, 2, 5, 6\n\nWeifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-image depth perception in the wild. In NeurIPS,\n\n2016. 3\n\nChristopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3D-R2N2: a uni\ufb01ed\n\napproach for single and multi-view 3D object reconstruction. In ECCV, 2016. 2\n\nTaco Cohen, Mario Geiger, and Max Welling. Convolutional networks for spherical signals. In ICML Workshop,\n\n2017. 3\n\nTaco S Cohen, Mario Geiger, Jonas K\u00f6hler, and Max Welling. Spherical CNNs. In ICLR, 2018. 3, 4\n\nDavid Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale\n\nconvolutional architecture. In ICCV, 2015. 3\n\nCarlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. Learning SO(3) equivariant\n\nrepresentations with spherical CNNs. In ECCV, 2018. 3\n\nHaoqiang Fan, Hao Su, and Leonidas Guibas. A point set generation network for 3D object reconstruction from\n\na single image. In CVPR, 2017. 2, 5\n\nAli Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In CVPR,\n\n2009. 3\n\nWilliam T Freeman. The generic viewpoint assumption in a framework for visual perception. Nature, 368(6471):\n\n542, 1994. 8\n\nChristopher Funk and Yanxi Liu. Beyond planar symmetry: modeling human perception of re\ufb02ection and\n\nrotation symmetries in the wild. In ICCV, 2017. 3\n\nRohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable and generative\n\nvector representation for objects. In ECCV, 2016. 2\n\n10\n\n\fThibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. AtlasNet: a\n\nPapier-Mache approach to learning 3D surface generation. In CVPR, 2018. 5, 6, 7\n\nChristian H\u00e4ne, Shubham Tulsiani, and Jitendra Malik. Hierarchical surface prediction for 3D object reconstruc-\n\ntion. In 3DV, 2017. 2\n\nBharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking and hallucinating features. In\n\nICCV, 2017. 3\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In\n\nCVPR, 2015. 5\n\nBerthold KP Horn and Michael J Brooks. Shape from shading. MIT press, 1989. 3\n\nShahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard A Newcombe, Pushmeet Kohli, Jamie\nShotton, Steve Hodges, Dustin Freeman, Andrew J Davison, and Andrew W Fitzgibbon. KinectFusion:\nreal-time 3D reconstruction and interaction using a moving depth camera. In UIST, 2011. 3\n\nWenzel Jakob. Mitsuba renderer, 2010. 6\n\nMichael Janner, Jiajun Wu, Tejas Kulkarni, Ilker Yildirim, and Joshua B Tenenbaum. Self-supervised intrinsic\n\nimage decomposition. In NeurIPS, 2017. 3\n\nDinesh Jayaraman, Ruohan Gao, and Kristen Grauman. ShapeCodes: self-supervised feature learning by lifting\n\nviews to viewgrids. In ECCV, 2018. 3, 6\n\nAbhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-speci\ufb01c object reconstruction\n\nfrom a single image. In CVPR, 2015. 2\n\nMichael Kazhdan, Bernard Chazelle, David Dobkin, Adam Finkelstein, and Thomas Funkhouser. A re\ufb02ective\n\nsymmetry descriptor. In ECCV, 2002. 3\n\nMichael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Symmetry descriptors and 3D shape\n\nmatching. In SGP, 2004. 3\n\nChristoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by\n\nbetween-class attribute transfer. In CVPR, 2009. 3\n\nThomas Lewiner, H\u00e9lio Lopes, Ant\u00f4nio Wilson Vieira, and Geovan Tavares. Ef\ufb01cient implementation of\n\nmarching cubes\u2019 cases with topological guarantees. Journal of Graphics Tools, 8(2):1\u201315, 2003. 5\n\nDavid Marr. Vision. W. H. Freeman and Company, 1982. 1\n\nJohn McCormac, Ankur Handa, Stefan Leutenegger, and Andrew J Davison. SceneNet RGB-D: can 5M synthetic\n\nimages beat generic imagenet pre-training on indoor segmentation? In ICCV, 2017. 3\n\nDavid Novotny, Diane Larlus, and Andrea Vedaldi. Learning 3D object categories by looking around them. In\n\nICCV, 2017. 2\n\nXingchao Peng, Baochen Sun, Karim Ali, and Kate Saenko. Learning deep object detectors from 3D models. In\n\nICCV, 2015. 3\n\nMarc Proesmans, Luc Van Gool, and Andr\u00e9 Oosterlinck. One-shot active 3D shape acquisition. In ICPR, 1996.\n\n3\n\nDanilo Jimenez Rezende, S M Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess.\n\nUnsupervised learning of 3D structure from images. In NeurIPS, 2016. 2\n\nGernot Riegler, Ali Osman Ulusoys, and Andreas Geiger. OctNet: learning deep 3D representations at high\n\nresolutions. In CVPR, 2017. 2\n\nJason Rock, Tanmay Gupta, Justin Thorsen, JunYoung Gwak, Daeyun Shin, and Derek Hoiem. Completing 3D\n\nobject shape from one depth image. In CVPR, 2015. 3\n\nOlaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: convolutional networks for biomedical image\n\nsegmentation. In MICCAI, 2015. 5\n\nRyusuke Sagawa, Hiroshi Kawasaki, Shota Kiyota, and Ryo Furukawa. Dense one-shot 3D reconstruction by\n\ndetecting continuous regions with parallel line projection. In ICCV, 2011. 3\n\n11\n\n\fJian Shi, Yue Dong, Hao Su, and Stella X Yu. Learning non-lambertian object intrinsics across shapenet\n\ncategories. In CVPR, 2017. 3\n\nDaeyun Shin, Charless C Fowlkes, and Derek Hoiem. Pixels, voxels, and views: a study of shape representations\n\nfor single view 3D object shape prediction. In CVPR, 2018. 3, 6, 7, 8\n\nNathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference\n\nfrom RGBD images. In ECCV, 2012. 3\n\nAmir Arsalan Soltani, Haibin Huang, Jiajun Wu, Tejas D Kulkarni, and Joshua B Tenenbaum. Synthesizing 3D\nshapes via modeling multi-view depth maps and silhouettes with deep generative networks. In CVPR, 2017. 6\n\nShuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene\n\ncompletion from a single depth image. In CVPR, 2017. 3\n\nXingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenen-\nbaum, and William T Freeman. Pix3D: dataset and methods for single-image 3D shape modeling. In CVPR,\n2018. 6\n\nMarshall F Tappen, William T Freeman, and Edward H Adelson. Recovering intrinsic images from a single\n\nimage. In NeurIPS, 2003. 3\n\nMaxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Multi-view 3D models from single images with a\n\nconvolutional network. In ECCV, 2016. 2\n\nMaxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Ef\ufb01cient convolutional\n\narchitectures for high-resolution 3D outputs. In ICCV, 2017. 2\n\nAntonio Torralba, Kevin P Murphy, and William T Freeman. Sharing visual features for multiclass and multiview\n\nobject detection. IEEE TPAMI, 29(5), 2007. 3\n\nShubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view\n\nreconstruction via differentiable ray consistency. In CVPR, 2017. 2, 5, 7\n\nPeng Wang, Lingqiao Liu, Chunhua Shen, Zi Huang, Anton van den Hengel, and Heng Tao Shen. Multi-attention\n\nnetwork for one shot learning. In CVPR, 2017. 3\n\nXiaolong Wang, David Fouhey, and Abhinav Gupta. Designing deep networks for surface normal estimation. In\n\nCVPR, 2015. 3\n\nYu-Xiong Wang and Martial Hebert. Learning to learn: Model regression networks for easy small sample\n\nlearning. In ECCV, 2016. 3\n\nYair Weiss. Deriving intrinsic images from image sequences. In ICCV, 2001. 3\n\nJiajun Wu, Chengkai Zhang, Tianfan Xue, William T Freeman, and Joshua B Tenenbaum. Learning a probabilistic\n\nlatent space of object shapes via 3D generative-adversarial modeling. In NeurIPS, 2016. 2\n\nJiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T Freeman, and Joshua B Tenenbaum. MarrNet:\n\n3D shape reconstruction via 2.5D sketches. In NeurIPS, 2017. 1, 2, 4, 5, 7\n\nJiajun Wu, Chengkai Zhang, Xiuming Zhang, Zhoutong Zhang, William T Freeman, and Joshua B Tenenbaum.\n\nLearning 3D shape priors for shape completion and reconstruction. In ECCV, 2018. 2\n\nYongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the ugly. In CVPR,\n\n2017. 3\n\nXinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learning\n\nsingle-view 3D object reconstruction without 3D supervision. In NeurIPS, 2016. 2\n\nShunyu Yao, Tzu Ming Harry Hsu, Jun-Yan Zhu, Jiajun Wu, Antonio Torralba, William T Freeman, and Joshua B\n\nTenenbaum. 3D-aware scene manipulation via inverse graphics. In NeurIPS, 2018. 5\n\nRuo Zhang, Ping-Sing Tsai, James Edwin Cryer, and Mubarak Shah. Shape-from-shading: a survey. IEEE\n\nTPAMI, 21(8):690\u2013706, 1999. 3\n\nJun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum, and\nWilliam T. Freeman. Visual object networks: image generation with disentangled 3D representations. In\nNeurIPS, 2018. 2\n\n12\n\n\f", "award": [], "sourceid": 1143, "authors": [{"given_name": "Xiuming", "family_name": "Zhang", "institution": "MIT CSAIL"}, {"given_name": "Zhoutong", "family_name": "Zhang", "institution": "MIT"}, {"given_name": "Chengkai", "family_name": "Zhang", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}, {"given_name": "Bill", "family_name": "Freeman", "institution": "MIT/Google"}, {"given_name": "Jiajun", "family_name": "Wu", "institution": "MIT"}]}