{"title": "Multi-View Silhouette and Depth Decomposition for High Resolution 3D Object Representation", "book": "Advances in Neural Information Processing Systems", "page_first": 6478, "page_last": 6488, "abstract": "We consider the problem of scaling deep generative shape models to high-resolution. Drawing motivation from the canonical view representation of objects, we introduce a novel method for the fast up-sampling of 3D objects in voxel space through networks that perform super-resolution on the six orthographic depth projections. This allows us to generate high-resolution objects with more efficient scaling than methods which work directly in 3D. We decompose the problem of 2D depth super-resolution into silhouette and depth prediction to capture both structure and fine detail. This allows our method to generate sharp edges more easily than an individual network. We evaluate our work on multiple experiments concerning high-resolution 3D objects, and show our system is capable of accurately predicting novel objects at resolutions as large as 512x512x512 -- the highest resolution reported for this task. We achieve state-of-the-art performance on 3D object reconstruction from RGB images on the ShapeNet dataset, and further demonstrate the first effective 3D super-resolution method.", "full_text": "Multi-View Silhouette and Depth Decomposition for\n\nHigh Resolution 3D Object Representation\n\nEdward Smith\nMcGill University\n\nedward.smith@mail.mcgill.ca\n\nScott Fujimoto\nMcGill University\n\nscott.fujimoto@mail.mcgill.ca\n\nDavid Meger\n\nMcGill University\n\ndmeger@cim.mcgill.ca\n\nAbstract\n\nWe consider the problem of scaling deep generative shape models to high-resolution.\nDrawing motivation from the canonical view representation of objects, we introduce\na novel method for the fast up-sampling of 3D objects in voxel space through\nnetworks that perform super-resolution on the six orthographic depth projections.\nThis allows us to generate high-resolution objects with more ef\ufb01cient scaling than\nmethods which work directly in 3D. We decompose the problem of 2D depth\nsuper-resolution into silhouette and depth prediction to capture both structure and\n\ufb01ne detail. This allows our method to generate sharp edges more easily than an\nindividual network. We evaluate our work on multiple experiments concerning\nhigh-resolution 3D objects, and show our system is capable of accurately predicting\nnovel objects at resolutions as large as 512\u00d7512\u00d7512 \u2013 the highest resolution\nreported for this task. We achieve state-of-the-art performance on 3D object\nreconstruction from RGB images on the ShapeNet dataset, and further demonstrate\nthe \ufb01rst effective 3D super-resolution method.\n\n1\n\nIntroduction\n\nThe 3D shape of an object is a combination of countless physical elements that range in scale\nfrom gross structure and topology to minute textures endowed by the material of each surface.\nIntelligent systems require representations capable of modeling this complex shape ef\ufb01ciently, in\norder to perceive and interact with the physical world in detail (e.g., object grasping, 3D perception,\nmotion prediction and path planning). Deep generative models have recently achieved strong\nperformance in hallucinating diverse 3D object shapes, capturing their overall structure and rough\ntexture [3, 37, 47]. The \ufb01rst generation of these models utilized voxel representations which scale\ncubically with resolution, limiting training to only 643 shapes on typical hardware. Numerous recent\npapers have begun to propose high resolution 3D shape representations with better scaling, such\nas those based on meshes, point clouds or octrees but these often require more dif\ufb01cult training\nprocedures and customized network architectures.\nOur 3D shape model is motivated by a foundational concept in 3D perception: that of canonical\nviews. The shape of a 3D object can be completely captured by a set of 2D images from multiple\nviewpoints (see [21, 4] for an analysis of selecting the location and number of viewpoints). Deep\nlearning approaches for 2D image recognition and generation [40, 10, 8, 13] scale easily to high\nresolutions. This motivates the primary question in this paper: can a multi-view representation be\nused ef\ufb01ciently with modern deep learning methods?\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Scene created from objects reconstructed by our method from RGB images at 2563 resolution. See the\nsupplementary video for better viewing https://sites.google.com/site/mvdnips2018.\n\nWe propose a novel approach for deep shape interpretation which captures the structure of an object\nvia modeling of its canonical views in 2D as depth maps, in a framework we refer to as Multi-\nView Decomposition (MVD). By utilizing many 2D orthographic projections to capture shape,\na model represented in this fashion can be up-scaled to high resolution by performing semantic\nsuper-resolution in 2D space, which leverages ef\ufb01cient, well-studied network structures and training\nprocedures. The higher resolution depth maps are \ufb01nally merged into a detailed 3D object using\nmodel carving.\nOur method has several key components that allow effective and ef\ufb01cient training. We leverage\ntwo synergistic deep networks that decompose the task of representing an object\u2019s depth: one that\noutputs the silhouette \u2013 capturing the gross structure; and a second that produces the local variations\nin depth \u2013 capturing the \ufb01ne detail. This decomposition addresses the blurred images that often occur\nwhen minimizing reconstruction error by allowing the silhouette prediction to form sharp edges. Our\nmethod utilizes the low-resolution input shape as a rough template which simply needs carving and\nre\ufb01nement to form the high resolution product. Learning the residual errors between this template\nand the desired high resolution shape simpli\ufb01es the generation task and allows for constrained output\nscaling, which leads to signi\ufb01cant performance improvements.\nWe evaluate our method\u2019s ability to perform 3D object reconstruction on the the ShapeNet dataset [1].\nThis standard evaluation task requires generating high resolution 3D objects from single 2D RGB\nimages. Furthermore, due to the nature of our pipeline we present the \ufb01rst results for 3D object\nsuper-resolution \u2013 generating high resolution 3D objects directly from low resolution 3D objects. Our\nmethod achieves state-of-the-art quantitative performance, when compared to a variety of other 3D\nrepresentations such as octrees, mesh-models and point clouds. Furthermore, our system is the \ufb01rst\nto produce 3D objects at 5123 resolution. We demonstrate these objects are visually impressive in\nisolation, and when compared to the ground truth objects. We additionally demonstrate that objects\nreconstructed from images can be placed in scenes to create realistic environments, as shown in \ufb01gure\n1. In order to ensure reproducible experimental comparison, code for our system has been made\npublicly available on a GitHub repository1. Given the ef\ufb01ciency of our method, each experiment was\nrun on a single NVIDIA Titan X GPU in the order of hours.\n\n2 Related Work\n\nDeep Learning with 3D Data\nRecent advances with 3D data have leveraged deep learning,\nbeginning with architectures such as 3D convolutions for object classi\ufb01cation [25, 19]. When adapted\nto 3D generation, these methods typically use an autoencoder network, with a decoder composed of\n3D deconvolutional layers [3, 47]. This decoder receives a latent representation of the 3D shape and\nproduces a probability for occupancy at each discrete position in 3D voxel space. This approach has\nbeen combined with generative adversarial approaches [8] to generate novel 3D objects [47, 41, 20],\nbut only at a limited resolution.\n\n1https://github.com/EdwardSmith1884/Multi-View-Silhouette-and-Depth-Decomposition-for-High-\n\nResolution-3D-Object-Representation\n\n2\n\n\fFigure 2: The complete pipeline for 3D object reconstruction and super-resolution outlined in this paper. Our\nmethod accepts either a single RGB image for low resolution reconstruction or a low resolution object for 3D\nsuper-resolution. ODM up-scaling is de\ufb01ned in section 3.1 and model carving in section 3.2\n\n2D Super-Resolution\nSuper-resolution of 2D images is a well-studied problem [29]. Traditionally,\nimage super-resolution has used dictionary-style methods [7, 49], matching patches of images to\nhigher-resolution counterparts. This research also extends to depth map super-resolution [22, 28, 11].\nModern approaches to super-resolution are built on deep convolutional networks [5, 46, 27] as well\nas generative adversarial networks [18, 13] which use an adversarial loss to imagine high-resolution\ndetails in RGB images.\nMulti-View Representation Our work connects to multi-view representations which capture the\ncharacteristics of a 3D object from multiple viewpoints in 2D [17, 26, 43, 32, 12, 39, 34], such as\ndecomposing image silhouettes [23, 42], Light Field Descriptors [2], and 2D panoramic mapping [38].\nOther representations aim to use orientation [36], rotational invariance [15] or 3D-SURF features [16].\nWhile many of these representations are effective for 3D classi\ufb01cation, they have not previously been\nutilized to recover 3D shape in high resolution.\nEf\ufb01cient 3D Representations Given that na\u00efve representations of 3D data require cubic computa-\ntional costs with respect to resolution, many alternate representations have been proposed. Octree\nmethods [44, 9] use non-uniform discretization of the voxel space to ef\ufb01ciently capture 3D objects by\nadapting the discretization level locally based on shape. Hierarchical surface prediction (HSP) [9]\nis an octree-style method which divides the voxel space into free, occupied and boundary space.\nThe object is generated at different scales of resolution, where occupied space is generated at a very\ncoarse resolution and the boundary space is generated at a very \ufb01ne resolution. Octree generating\nnetworks (OGN) [44] use a convolutional network that operates directly on octrees, rather than in\nvoxel space. These methods have only shown novel generation results up to 2563 resolution. Our\nmethod achieves higher accuracy at this resolution and can ef\ufb01ciently produce novel objects as large\nas 5123.\nA recent trend is the use of unstructured representations such as mesh models [31, 14, 45] and\npoint clouds [33, 6] which represent the data by an unordered set with a \ufb01xed number of points.\nMarrNet [48], which resembles our work, models 3D objects through the use of 2.5D sketches,\nwhich capture depth maps from a single viewpoint. This approach requires working in voxel space\nwhen translating 2.5D sketches to high resolution, while our method can work directly in 2D space,\nleveraging 2D super-resolution technology within the 3D pipeline.\n\n3 Method\n\nIn this section we describe our methodology for representing high resolution 3D objects. Our\nalgorithm is a novel approach which uses the six axis-aligned orthographic depth maps (ODM), to\nef\ufb01ciently scale 3D objects to high resolution without directly interacting with the voxels. To achieve\nthis, a pair of networks is used for each view, decomposing the super-resolution task into predicting\nthe silhouette and relative depth from the low resolution ODM. This approach is able to recover \ufb01ne\nobject details and scales better to higher resolutions than previous methods, due to the simpli\ufb01ed\nlearning problem faced by each network, and scalable computations that occur primarily in 2D image\nspace.\n\n3\n\n\u4c00\u6f00\u7700\u2000\u5200\u6500\u7300\u6f00\u6c00\u7500\u7400\u6900\u6f00\u6e00\u2000\u5200\u6500\u6300\u6f00\u6e00\u7300\u7400\u7200\u7500\u6300\u7400\u6900\u6f00\u6e00\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u3300\u4400\u2000\u4f00\u6200\u6a00\u6500\u6300\u7400\u2000\u5300\u7500\u7000\u6500\u7200\u2d00\u5200\u6500\u7300\u6f00\u6c00\u7500\u7400\u6900\u6f00\u6e00\u4500\u6e00\u6300\u6f00\u6400\u6500\u7200\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u4400\u6500\u6300\u6f00\u6400\u6500\u7200\u4900\u6d00\u6100\u6700\u6500\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u2000\u5000\u7200\u6500\u6400\u6900\u6300\u7400\u6500\u6400\u2000\u4f00\u6200\u6a00\u6500\u6300\u7400\u4500\u7800\u7400\u7200\u6100\u6300\u7400\u6500\u6400\u2000\u4f00\u4400\u4d00\u7300\u4c00\u6f00\u7700\u2000\u5200\u6500\u7300\u6f00\u6c00\u7500\u7400\u6900\u6f00\u6e00\u2000\u4f00\u6200\u6a00\u6500\u6300\u7400\u4500\u7800\u6100\u6300\u7400\u6c00\u7900\u2000\u5500\u7000\u2d00\u5300\u6300\u6100\u6c00\u6500\u6400\u2000\u4f00\u6200\u6a00\u6500\u6300\u7400\u4e00\u6500\u6100\u7200\u6500\u7300\u7400\u2000\u4e00\u6500\u6900\u6700\u6800\u6200\u6f00\u7200\u2000\u5500\u7000\u2d00\u7300\u6300\u6100\u6c00\u6900\u6e00\u6700\u4600\u6900\u6e00\u6100\u6c00\u2000\u5000\u7200\u6500\u6400\u6900\u6300\u7400\u6900\u6f00\u6e00\u4800\u6900\u6700\u6800\u2000\u5200\u6500\u7300\u6f00\u6c00\u7500\u7400\u6900\u6f00\u6e00\u2000\u4f00\u4400\u4d00\u7300\u4f00\u4400\u4d00\u2000\u5500\u7000\u2d00\u5300\u6300\u6100\u6c00\u6900\u6e00\u6700\u4d00\u6f00\u6400\u6500\u6c00\u2000\u4300\u6100\u7200\u7600\u6900\u6e00\u6700\fFigure 3: Our Multi-View Decomposition framework (MVD). Each ODM prediction task can be decomposed\ninto a silhouette and detail prediction. We further simplify the detail prediction task by encoding only the residual\ndetails (change from the low resolution input), masked by the ground truth silhouette.\n\n3.1 Orthographic Depth Map Super-Resolution\n\nOur method begins by obtaining the orthographic depth maps of the six primary views of the low-\nresolution 3D object. In an ODM, each pixel holds a value equal to the surface depth of the object\nalong the viewing direction at the corresponding coordinate. This projection can be computed quickly\nand easily from an axis-aligned 3D array via z-clipping. Super-resolution is then performed directly\non these ODMs, before being mapped onto the low resolution object to produce a high resolution\nobject.\nRepresenting an object by a set of depth maps however, introduces a challenging learning problem,\nwhich requires both local and global consistency in depth. Furthermore, minimizing the mean squared\nerror results in blurry images without sharp edges [24, 30]. This is particularly problematic as a depth\nmap is required to be bimodal, with large variations in depth to create structure and small variations\nin depth to create texture and \ufb01ne detail. To address this concern, we propose decomposing the\nlearning problem into predicting the silhouette and depth map separately. Separating the challenge\nof predicting gross shape from \ufb01ne detail regularizes and reduces the complexity of the learning\nproblem, leading to improved results when compared with directly estimating new surface depths.\nOur Multi-View Decomposition framework (MVD) uses a set of twin of deep convolutional models\nfSIL and f\u2206D, to separately predict silhouette and variations in depth of the higher resolution ODM.\nWe depict our system in \ufb01gure 3. The deep convolutional network for predicting the high-resolution\nsilhouette, fSIL with parameters \u03b8, is passed the low resolution ODM DL, extracted from the input\n3D object. The network outputs a probability that each pixel is occupied. It is trained by minimizing\nthe mean squared error between the predicted and true silhouette of the high resolution ODM DH:\n\nN(cid:88)\n\ni=1\n\nL(\u03b8) =\n\n(cid:107)fSIL(D(i)\n\nL ; \u03b8) \u2212 1\n\nH (cid:54)=0(D(i)\n\nD(i)\n\nH )(cid:107)2,\n\n(1)\n\nwhere 1\n\nH (cid:54)=0 is an indicator function for each coordinate in the image.\nD(i)\n\nThe same low-resolution ODM DL is passed through the second deep convolution neural network,\ndenoted f\u2206D with parameters \u03c6, whose \ufb01nal output is passed through a sigmoid, to produce an\nestimate for the variation of the ODM within a \ufb01xed range r. This output is added to the low-\nresolution depth map to produce our prediction for a constrained high-resolution depth map CH:\n\nCH = r\u03c3(f\u2206D(DL; \u03c6)) + g(DL),\n\n(2)\n\nwhere g(\u00b7) denotes up-sampling using nearest neighbor interpolation.\nWe train our network f\u2206D by minimizing the mean squared error between our prediction and the\nground truth high-resolution depth map DH. During training only, we mask the output with the ground\ntruth silhouette to allow effective focus on \ufb01ne detail for f\u2206D. We further add a smoothing regularizer\n\n(cid:112)(xi+1,j \u2212 xi,j)2 + (xi,j+1 \u2212 xi,j)2 [35] within\n\nwhich penalizes the total variation V (x) =(cid:80)\n\ni,j\n\n4\n\n\fthe predicted ODM. Our loss function is a simple combination of these terms:\n\nL(\u03c6) =\n\n(cid:107)(C (i)\n\nH \u25e6 1\n\nH (j,k)(cid:54)=0(D(i)\n\nD(i)\n\nH )) \u2212 D(i)\n\nH (cid:107)2 + \u03bbV (C (i)\nH ),\n\n(3)\n\nN(cid:88)\n\ni=1\n\nwhere \u25e6 is the Hadamard product. The total variation penalty is used as an edge-preserving denoising\nwhich smooths out irregularities in the output.\nThe output of the constrained depth map and silhouette networks are then combined to produce a\ncomplete prediction for the high-resolution ODM. This accomplished by masking the constrained\nhigh-resolution depth map by the predicted silhouette:\n\n\u02c6DH = CH \u25e6 fSIL(DL; \u03b8).\n\n(4)\n\u02c6DH denotes our predicted high resolution ODM which can then be mapped back onto the original low\nresolution object by model carving to produce a high resolution object. Each of the 6 high resolution\nODMS are predicted using the same 2 network models, with the side information for each passed\nusing a forth channel in the corresponding low resolution ODM passed to the networks.\n\n3.2\n\n3D Model Carving\n\nTo complete our super-resolution procedure, the six ODMs are combined with the low-resolution\nobject to form a high-resolution object. This begins by further smoothing the up-sampled ODM with\nan adaptive averaging \ufb01lter, which only consider neighboring pixels within a small radius. To preserve\nedges, only neighboring pixels within a threshold of the value of the center pixel are included. This\nsmoothing, along with the total variation regularization in the our loss function, are added to enforce\nsmooth changes in local depth regions.\nModel carving begins by \ufb01rst up-sampling the low-resolution model to the desired resolution, using\nnearest neighbor interpolation. We then use the predicted ODMs \u02c6DH = CH \u25e6 fSIL(DL; \u03b8) to\ndetermine the surface of the new object. The carving procedure is separated into (1) structure carving,\ncorresponding to the silhouette prediction fSIL(DL; \u03b8), and (2) detail carving, corresponding to the\nconstrained depth prediction CH.\nFor the structure carving, for each predicted ODM fSIL(DL; \u03b8), if a coordinate is predicted unoccu-\npied, then all voxels perpendicular to the coordinate are highlighted to be removed. The removal\nonly occurs if there is agreement of at least two ODMs for the removal of a voxel. As there is a\nlarge amount of overlap in the surface area that the six ODMs observe, this silhouette agreement is\nenforced to maintain the structure of the object.\nThis same process occurs for detail carving with CH. However, we do not require agreement within\nthe constrained depth map predictions. This is because, unlike the silhouettes, a depth map can cause\nor deepen concavities in the surface of the object which may not be visible from any other face.\nRequiring agreement among depth maps would eliminate their ability to in\ufb02uence these concavities.\nThus, performing detail carving simply involves removing all voxels perpendicular to each coordinate\nof each ODM, up to the predicted depth.\n\n4 Experiments\n\nIn this section we present our results for our method, Multi-View Decomposition Networks (MVD),\nfor both 3D object super-resolution and 3D object reconstruction from single RGB images. Our\nresults are evaluated across 13 classes of the ShapeNet [1] dataset. 3D super-resolution is the task\nof generating a high resolution 3D object conditioned on a low resolution input, while 3D object\nreconstruction is the task of re-creating high resolution 3D objects from a single RGB image of the\nobject.\n\n4.1\n\n3D Object Super-Resolution\n\nDataset\nThe dataset consists of 323 low resolution voxelized objects and their 2563 high resolution\ncounterparts. These objects were produced by converting CAD models found in the ShapeNetCore\ndataset [1] into voxel format, in a canonical view. We work with the three commonly used object\n\n5\n\n\f(a)\n\n(b)\n\nFigure 4: Super-resolution rendering results. Each set shows, from left to right, the low resolution input and the\nresults of MVD at 5123. Sets in (b) additionally show the ground-truth 5123 objects on the far right.\n\nFigure 5: Super-resolution rendering results. Each pair shows the low resolution input (left) and the results of\nMVD at 2563 resolution (right).\n\nclasses from this dataset: Car, Chair and Plane, with around 8000, 7000, 4000 objects respectively.\nFor training, we pre-process this dataset, to extract the six ODMs from each object at high and\nlow-resolution. CAD models converted at this resolution do not remain watertight in many cases,\nmaking it dif\ufb01cult to \ufb01ll the inner volume of the object. We describe an ef\ufb01cient method for obtaining\nhigh resolution voxelized objects in the supplementary material. Data is split into training, validation,\nand test set using a ratio of 70:10:20 respectively.\nEvaluation We evaluate our method quantitatively using the intersection over union metric (IoU)\nagainst a simple baseline and the prediction of the individual networks on the test set. The baseline\nmethod corresponds to the ground truth at 323 resolution, by up-scaling to the high resolution using\nnearest neighbor up-sampling. While our full method, MVD, uses a combination of networks, we\npresent an ablation study to evaluate the contribution of each separate network.\nImplementation\nThe super-resolution task requires a pair of networks, f\u2206D and fSIL, which share\nthe same architecture. This architecture is derived from the generator of SRGAN [18], a state of the\nart 2D super-resolution network. Exact network architectures and training regime are provided in the\nsupplementary material.\nResults\nThe super-resolution IoU scores are presented in table 1. Our method greatly outperforms\nthe na\u00efve nearest neighbor up-sampling baseline in every class. While we \ufb01nd that the silhouette\nprediction contributes far more to the IoU score, the addition of the depth variation network further\nincreases the IoU score. This is due to the silhouette capturing the gross structure of the object from\nmultiple viewpoints, while the depth variation captures the \ufb01ne-grained details, which contributes\nless to the total IoU score. To qualitatively demonstrate the results of our super-resolution system we\nrender objects from the test set at both 2563 resolution in \ufb01gure 5 and 5123 resolution in \ufb01gure 4.\nThe predicted high-resolution objects are all of high quality and accurately mimic the shapes of the\nground truth objects. Additional 5123 renderings as well as multiple objects from each class at 2563\nresolution can be found in our supplementary material.\n\n4.2\n\n3D Object Reconstruction from RGB Images\n\nDataset\nTo match the datasets used by prior work, two datasets are used for 3D object reconstruc-\ntion, both derived from the ShapeNet dataset. The \ufb01rst, which we refer to as DataHSP , consists of\n\n6\n\n\fCategory Baseline Depth Variation (f\u2206D)\nCar\nChair\nPlane\n\n73.2\n54.9\n39.9\n\n80.6\n58.5\n50.5\n\nSilhouette (fSIL) MVD (Both)\n\n86.9\n67.3\n70.2\n\n89.9\n68.5\n71.1\n\nTable 1: Super-Resolution IoU Results against nearest neighbor baseline and an ablation over individual networks\nat 2563 from 323 input.\n\nFigure 6: 3D object reconstruction 2563 rendering results from our method, MVD (bottom), of the 13 classes\nfrom ShapeNet, by interpreting 2D image input (top).\n\nonly the Car, Chair and Plane classes from the Shapenet dataset, and we re-use the 323 and 2563\nvoxel objects produced for these classes in the previous section. The CAD models for each of these\nobject were rendered into 1282 RGB images capturing random viewpoints of the objects at elevations\nbetween (\u221220\u25e6, 30\u25e6) and all possible azimuth rotations. The voxelized objects and corresponding\nimages were split into a training, validation and test set, with a ratio of 70:10:20 respectively.\nThe second dataset, which we refer to as Data3D\u2212R2N 2, is that provided by Choy et al. [3]. It\nconsists of images and objects produced from the 3 classes in the ShapeNet dataset used in the\nprevious section, as well as 10 additional classes, for a total of around 50000 objects. From each\nobject 1372 RGB images are rendered at random viewpoints, and we again compute their 323 and\n2563 resolution voxelized models and ODMs. The data is split into a training, validation and test set\nwith a ratio of 70:10:20.\nEvaluation We evaluate our method quantitatively with two evaluation schemes. In the \ufb01rst, we\nuse IoU scores when reconstructing objects at 2563 resolution. We compare against HSP [9] using\nthe \ufb01rst dataset DataHSP , and against OGN [44] using the second dataset Data3D\u2212R2N 2. To study\nthe effectiveness of our super-resolution pipeline, we also compute the IoU scores using the low\nresolution objects predicted by our autoencoder (AE) with nearest neighbor up-sampling to produce\npredictions at 2563 resolution.\nOur second evaluation is performed only on the second dataset, Data3D\u2212R2N 2, by comparing the\naccuracy of the surfaces of predicted objects to those of the ground truth meshes. Following the\nevaluation procedure de\ufb01ned by Wang et al. [45], we \ufb01rst convert the 2563 voxel models into meshes\nby de\ufb01ning squared polygons on all exposed faces on the surface of the voxel models. We then\nuniformly sample points from the two mesh surfaces and compute F1 scores. Precision and recall are\ncalculated using the percentage of points found with a nearest neighbor in the ground truth sampling\nset less than a squared distance threshold of 0.0001. We compare to state of the art mesh model\nmethods, N3MR [14] and Pixel2Mesh [45], a point cloud method, PSG [6], and a voxel baseline,\n3D-R2N2 [3], using the values reported by Wang et al. [45].\nImplementation\nFor 3D object reconstruction, we \ufb01rst trained a standard autoencoder, similar to\nprior work [3, 41], to produce objects at 323 resolution. These low resolution objects are then used\nwith our 3D super-resolution method, to generate 3D object reconstructions at a high 2563 resolution.\nThis process is described in \ufb01gure 2. The exact network architecture and training regime are provided\nin the supplementary material.\n\n7\n\n\fCategory AE HSP [9] MVD (Ours)\nCar\nChair\nPlane\n\n72.7\n40.1\n56.4\n\n70.1\n37.8\n56.1\n\n55.2\n36.4\n28.9\n\nCategory AE OGN [44] MVD (Ours)\nCar\nChair\nPlane\n\n80.7\n43.3\n58.9\n\n68.1\n37.6\n34.6\n\n78.2\n\n-\n-\n\n(a) DataHSP\n\n(b) Data3D\u2212R2N 2\n\nTable 2: 3D Object Reconstruction IoU at 2563. Cells with a dash - indicate that the corresponding result was\nnot reported by the original author.\n\nCategory\nPlane\nBench\nCabinet\nCar\nChair\nMonitor\nLamp\nSpeaker\nFirearm\nCouch\nTable\nCellphone\nWatercraft\nMean\n\n3D-R2N2 [3]\n\n41.46\n34.09\n49.88\n37.80\n40.22\n34.38\n32.35\n45.30\n28.34\n40.01\n43.79\n42.31\n37.10\n39.01\n\nPSG [6] N3MR [14]\n68.20\n49.29\n39.93\n50.70\n41.60\n40.53\n41.40\n32.61\n69.96\n36.59\n53.44\n55.95\n51.28\n48.58\n\n62.10\n35.84\n21.04\n36.66\n30.25\n28.77\n27.97\n19.46\n52.22\n25.04\n28.40\n27.96\n43.71\n33.80\n\nPixel2Mesh [45] MVD (Ours)\n\n71.12\n57.57\n60.39\n67.86\n54.38\n51.39\n48.15\n48.84\n73.20\n51.90\n66.30\n70.24\n55.12\n59.72\n\n87.34\n69.92\n65.87\n67.69\n62.57\n57.48\n48.37\n53.88\n78.12\n53.66\n68.06\n86.00\n64.07\n66.39\n\nTable 3: 3D object reconstruction surface sampling F1 scores.\n\nResults\nThe results of our IoU evaluation compared to the octree methods [44, 9] can be seen in\ntable 2. We achieve state-of-the-art performance on every object class in both datasets. Our surface\naccuracy results can be seen in table 3 compared to [45, 6, 14, 3]. Our method achieves state of the art\nresults on all 13 classes. We show signi\ufb01cant improvements for many object classes and demonstrate\na large improvement on the mean over all classes when compared against the methods presented. To\nqualitatively evaluate our performance, we rendered our reconstructions for each class, which can be\nseen in \ufb01gure 6. Additional renderings can be found in the supplementary material.\n\n5 Conclusion\n\nIn this paper we argue for the application of multi-view representations when predicting the structure\nof objects at high resolution. We outline our Multi-View Decomposition framework, a novel system\nfor learning to represent 3D objects and demonstrate its af\ufb01nity for capturing category-speci\ufb01c shape\ndetails at a high resolution by operating over the six orthographic projections of the object.\nIn the task of super-resolution, our method outperforms baseline methods by a large margin, and\nwe show its ability to produce objects as large as 5123, with a 16 times increase in size from the\ninput objects. The results produced are visually impressive, even when compared against the ground-\ntruth. When applied to the reconstruction of high-resolution 3D objects from single RGB images,\nwe outperform several state of the art methods with a variety of representation types, across two\nevaluation metrics.\nAll of our visualizations demonstrate the effectiveness of our method at capturing \ufb01ne-grained detail,\nwhich is not present in the low resolution input but must be captured in our network\u2019s weights during\nlearning. Furthermore, given that the deep aspect of our method works entirely in 2D space, our\nmethod scales naturally to high resolutions. This paper demonstrates that multi-view representations\nalong with 2D super-resolution through decomposed networks is indeed capable of modeling complex\nshapes.\n\n8\n\n\fReferences\n[1] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li,\nSilvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d\nmodel repository. arXiv preprint arXiv:1512.03012, 2015.\n\n[2] Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung. On visual similarity based\n3d model retrieval. In Computer graphics forum, volume 22, pages 223\u2013232. Wiley Online\nLibrary, 2003.\n\n[3] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2:\nA uni\ufb01ed approach for single and multi-view 3d object reconstruction. In European Conference\non Computer Vision, pages 628\u2013644. Springer, 2016.\n\n[4] Trip Denton, M Fatih Demirci, Jeff Abrahamson, Ali Shokoufandeh, and Sven Dickinson.\nSelecting canonical views for view-based 3-d object recognition. In Pattern Recognition, 2004.\nICPR 2004. Proceedings of the 17th International Conference on, volume 2, pages 273\u2013276.\nIEEE, 2004.\n\n[5] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using\ndeep convolutional networks. IEEE transactions on pattern analysis and machine intelligence,\n38(2):295\u2013307, 2016.\n\n[6] Haoqiang Fan, Hao Su, and Leonidas Guibas. A point set generation network for 3d object\nreconstruction from a single image. In Conference on Computer Vision and Pattern Recognition\n(CVPR), volume 38, 2017.\n\n[7] William T Freeman, Thouis R Jones, and Egon C Pasztor. Example-based super-resolution.\n\nIEEE Computer graphics and Applications, 22(2):56\u201365, 2002.\n\n[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680. 2014.\n\n[9] Christian H\u00e4ne, Shubham Tulsiani, and Jitendra Malik. Hierarchical surface prediction for 3d\n\nobject reconstruction. arXiv preprint arXiv:1704.00710, 2017.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[11] Tak-Wai Hui, Chen Change Loy, and Xiaoou Tang. Depth map super-resolution by deep\n\nmulti-scale guidance. pages 353\u2013369, 2016.\n\n[12] Abhishek Kar, Christian H\u00e4ne, and Jitendra Malik. Learning a multi-view stereo machine. In\n\nAdvances in Neural Information Processing Systems, pages 364\u2013375, 2017.\n\n[13] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for\nimproved quality, stability, and variation. International Conference on Learning Representations,\n2018.\n\n[14] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. arXiv preprint\n\narXiv:1711.07566, 2017.\n\n[15] Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Rotation invariant spherical\nIn Symposium on geometry processing,\n\nharmonic representation of 3 d shape descriptors.\nvolume 6, pages 156\u2013164, 2003.\n\n[16] Jan Knopp, Mukta Prasad, Geert Willems, Radu Timofte, and Luc Van Gool. Hough transform\nand 3d surf for robust three dimensional classi\ufb01cation. In European Conference on Computer\nVision, pages 589\u2013602. Springer, 2010.\n\n[17] Jan J Koenderink and Andrea J Van Doorn. The singularities of the visual mapping. Biological\n\ncybernetics, 24(1):51\u201359, 1976.\n\n9\n\n\f[18] Christian Ledig, Lucas Theis, Ferenc Husz\u00e1r, Jose Caballero, Andrew Cunningham, Ale-\njandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-\nrealistic single image super-resolution using a generative adversarial network. arXiv preprint\narXiv:1609.04802, 2016.\n\n[19] Yangyan Li, Soeren Pirk, Hao Su, Charles R Qi, and Leonidas J Guibas. Fpnn: Field probing\nneural networks for 3d data. In Advances in Neural Information Processing Systems, pages\n307\u2013315, 2016.\n\n[20] Jerry Liu, Fisher Yu, and Thomas Funkhouser. Interactive 3d modeling with a generative\n\nadversarial network. arXiv preprint arXiv:1706.05170, 2017.\n\n[21] Q-T Luong and Thierry Vi\u00e9ville. Canonical representations for the geometries of multiple\n\nprojective views. Computer vision and image understanding, 64(2):193\u2013229, 1996.\n\n[22] Oisin Mac Aodha, Neill DF Campbell, Arun Nair, and Gabriel J Brostow. Patch based synthesis\nfor single depth image super-resolution. In European Conference on Computer Vision, pages\n71\u201384. Springer, 2012.\n\n[23] Diego Macrini, Ali Shokoufandeh, Sven Dickinson, Kaleem Siddiqi, and Steven Zucker. View-\nbased 3-d object recognition using shock graphs. In Pattern Recognition, 2002. Proceedings.\n16th International Conference on, volume 3, pages 24\u201328. IEEE, 2002.\n\n[24] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond\n\nmean square error. arXiv preprint arXiv:1511.05440, 2015.\n\n[25] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-\ntime object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International\nConference on, pages 922\u2013928. IEEE, 2015.\n\n[26] Hiroshi Murase and Shree K Nayar. Visual learning and recognition of 3-d objects from\n\nappearance. International journal of computer vision, 14(1):5\u201324, 1995.\n\n[27] Christian Osendorfer, Hubert Soyer, and Patrick Van Der Smagt.\n\nImage super-resolution\nwith fast approximate convolutional sparse coding. In International Conference on Neural\nInformation Processing, pages 250\u2013257. Springer, 2014.\n\n[28] Jaesik Park, Hyeongwoo Kim, Yu-Wing Tai, Michael S Brown, and Inso Kweon. High quality\ndepth map upsampling for 3d-tof cameras. In Computer Vision (ICCV), 2011 IEEE International\nConference on, pages 1623\u20131630. IEEE, 2011.\n\n[29] Sung Cheol Park, Min Kyu Park, and Moon Gi Kang. Super-resolution image reconstruction: a\n\ntechnical overview. IEEE signal processing magazine, 20(3):21\u201336, 2003.\n\n[30] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context\nencoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 2536\u20132544, 2016.\n\n[31] Jhony K Pontes, Chen Kong, Sridha Sridharan, Simon Lucey, Anders Eriksson, and Clinton\nFookes. Image2mesh: A learning framework for single image 3d reconstruction. arXiv preprint\narXiv:1711.10669, 2017.\n\n[32] Charles R Qi, Hao Su, Matthias Nie\u00dfner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas.\nVolumetric and multi-view cnns for object classi\ufb01cation on 3d data. In Proceedings of the IEEE\nconference on computer vision and pattern recognition, pages 5648\u20135656, 2016.\n\n[33] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point\nsets for 3d classi\ufb01cation and segmentation. Proc. Computer Vision and Pattern Recognition\n(CVPR), IEEE, 1(2):4, 2017.\n\n[34] Gernot Riegler, Ali Osman Ulusoy, Horst Bischof, and Andreas Geiger. Octnetfusion: Learning\n\ndepth fusion from data. In Proceedings of the International Conference on 3D Vision, 2017.\n\n[35] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal\n\nalgorithms. Physica D: nonlinear phenomena, 60(1-4):259\u2013268, 1992.\n\n10\n\n\f[36] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from\na single still image. IEEE transactions on pattern analysis and machine intelligence, 31(5):\n824\u2013840, 2009.\n\n[37] Abhishek Sharma, Oliver Grau, and Mario Fritz. Vconv-dae: Deep volumetric shape learning\nwithout object labels. In European Conference on Computer Vision, pages 236\u2013250. Springer,\n2016.\n\n[38] Baoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. Deeppano: Deep panoramic rep-\nresentation for 3-d shape recognition. IEEE Signal Processing Letters, 22(12):2339\u20132343,\n2015.\n\n[39] Daeyun Shin, Charless Fowlkes, and Derek Hoiem. Pixels, voxels, and views: A study of shape\nrepresentations for single view 3d object shape prediction. In IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2018.\n\n[40] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[41] Edward J Smith and David Meger. Improved adversarial systems for 3d object generation and\n\nreconstruction. In Conference on Robot Learning, pages 87\u201396, 2017.\n\n[42] Amir Arsalan Soltani, Haibin Huang, Jiajun Wu, Tejas D Kulkarni, and Joshua B Tenen-\nbaum. Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep\ngenerative networks.\n\n[43] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convo-\nlutional neural networks for 3d shape recognition. In Proceedings of the IEEE international\nconference on computer vision, pages 945\u2013953, 2015.\n\n[44] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks:\nEf\ufb01cient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 2088\u20132096, 2017.\n\n[45] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh:\nGenerating 3d mesh models from single rgb images. arXiv preprint arXiv:1804.01654, 2018.\n\n[46] Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, and Thomas Huang. Deep networks for\nimage super-resolution with sparse prior. In Proceedings of the IEEE International Conference\non Computer Vision, pages 370\u2013378, 2015.\n\n[47] Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T Freeman, and Joshua B Tenenbaum.\nLearning a probabilistic latent space of object shapes via 3d generative-adversarial modeling.\nIn Advances in Neural Information Processing Systems, pages 82\u201390, 2016.\n\n[48] Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill Freeman, and Josh Tenenbaum.\nIn Advances In Neural Information\n\nMarrnet: 3d shape reconstruction via 2.5 d sketches.\nProcessing Systems, pages 540\u2013550, 2017.\n\n[49] Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma. Image super-resolution via sparse\n\nrepresentation. IEEE transactions on image processing, 19(11):2861\u20132873, 2010.\n\n11\n\n\f", "award": [], "sourceid": 3185, "authors": [{"given_name": "Edward", "family_name": "Smith", "institution": "McGill University"}, {"given_name": "Scott", "family_name": "Fujimoto", "institution": "McGill University"}, {"given_name": "David", "family_name": "Meger", "institution": "University of British Columbia"}]}