{"title": "Unsupervised Learning of Shape and Pose with Differentiable Point Clouds", "book": "Advances in Neural Information Processing Systems", "page_first": 2802, "page_last": 2812, "abstract": "We address the problem of learning accurate 3D shape and camera pose from a collection of unlabeled category-specific images. We train a convolutional network to predict both the shape and the pose from a single image by minimizing the reprojection error: given several views of an object, the projections of the predicted shapes to the predicted camera poses should match the provided views. To deal with pose ambiguity, we introduce an ensemble of pose predictors which we then distill to a single \"student\" model. To allow for efficient learning of high-fidelity shapes, we represent the shapes by point clouds and devise a formulation allowing for differentiable projection of these. Our experiments show that the distilled ensemble of pose predictors learns to estimate the pose accurately, while the point cloud representation allows to predict detailed shape models.", "full_text": "Unsupervised Learning of Shape and Pose\n\nwith Differentiable Point Clouds\n\nEldar Insafutdinov\u2217\n\nMax Planck Institute for Informatics\n\neldar@mpi-inf.mpg.de\n\nAlexey Dosovitskiy\n\nIntel Labs\n\nadosovitskiy@gmail.com\n\nAbstract\n\nWe address the problem of learning accurate 3D shape and camera pose from a\ncollection of unlabeled category-speci\ufb01c images. We train a convolutional network\nto predict both the shape and the pose from a single image by minimizing the\nreprojection error: given several views of an object, the projections of the predicted\nshapes to the predicted camera poses should match the provided views. To deal\nwith pose ambiguity, we introduce an ensemble of pose predictors which we then\ndistill to a single \u201cstudent\u201d model. To allow for ef\ufb01cient learning of high-\ufb01delity\nshapes, we represent the shapes by point clouds and devise a formulation allowing\nfor differentiable projection of these. Our experiments show that the distilled\nensemble of pose predictors learns to estimate the pose accurately, while the point\ncloud representation allows to predict detailed shape models.\n\n1\n\nIntroduction\n\nWe live in a three-dimensional world, and a proper understanding of its volumetric structure is\ncrucial for acting and planning. However, we perceive the world mainly via its two-dimensional\nprojections. Based on these projections, we are able to infer the three-dimensional shapes and poses\nof the surrounding objects. How does this volumetric shape perception emerge from observing only\nfrom two-dimensional projections? Is it possible to design learning systems with similar capabilities?\nDeep learning methods have recently shown promise in addressing these questions [25, 20]. Given\na set of views of an object and the corresponding camera poses, these methods learn 3D shape via\nthe reprojection error: given an estimated shape, one can project it to the known camera views and\ncompare to the provided images. The discrepancy between these generated projections and the\ntraining samples provides training signal for improving the shape estimate. Existing methods of this\ntype have two general restrictions. First, these approaches assume that the camera poses are known\nprecisely for all provided images. This is a practically and biologically unrealistic assumption: a\ntypical intelligent agent only has access to its observations, not its precise location relative to objects\nin the world. Second, the shape is predicted as a low-resolution (usually 323 voxels) voxelated\nvolume. This representation can only describe very rough shape of an object. It should be possible to\nlearn \ufb01ner shape details from 2D supervision.\nIn this paper, we learn high-\ufb01delity shape models solely from their projections, without ground\ntruth camera poses. This setup is challenging for two reasons. First, estimating both shape and\npose is a chicken-and-egg problem: without a good shape estimate it is impossible to learn accurate\npose because the projections would be uninformative, and vice versa, an accurate pose estimate is\nnecessary to learn the shape. Second, pose estimation is prone to local minima caused by ambiguity:\nan object may look similar from two viewpoints, and if the network converges to predicting only one\nof these in all cases, it will not be able to learn predicting the other one. We \ufb01nd that the \ufb01rst problem\ncan be solved surprisingly well by joint optimization of shape and pose predictors: in practice, good\n\n\u2217Work done while interning at Intel.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fshape estimates can be learned even with relatively noisy pose predictions. The second problem,\nhowever, leads to drastic errors in pose estimation. To address this, we train a diverse ensemble of\npose predictors and distill those to a single student model.\nTo allow learning of high-\ufb01delity shapes, we use the point cloud representation, in contrast with\nvoxels used in previous works. Point clouds allow for computationally ef\ufb01cient processing, can\nproduce high-quality shape models [6], and are conceptually attractive because they can be seen as\n\u201cmatter-centric\u201d, as opposed to \u201cspace-centric\u201d voxel grids. To enable learning point clouds without\nexplicit 3D supervision, we implement a differentiable projection operator that, given a point set and\na camera pose, generates a 2D projection \u2013 a silhouette, a color image, or a depth map. We dub the\nformulation \u201cDifferentiable Point Clouds\u201d.\nWe evaluate the proposed approach on the task of estimating the shape and the camera pose from a\nsingle image of an object 2. The method successfully learns to predict both the shape and the pose,\nwith only a minor performance drop relative to a model trained with ground truth camera poses.\nThe point-cloud-based formulation allows for effective learning of high-\ufb01delity shape models when\nprovided with images of suf\ufb01ciently high resolution as supervision. We demonstrate learning point\nclouds from silhouettes and augmenting those with color if color images are available during training.\nFinally, we show how the point cloud representation allows to automatically discover semantic\ncorrespondences between objects.\n\n2 Related Work\n\nReconstruction of three-dimensional shapes from their two-dimensional projections has a long history\nin computer vision, constituting the \ufb01eld of 3D reconstruction. A review of this \ufb01eld goes outside of\nthe scope of this paper; however, we brie\ufb02y list several related methods. Cashman and Fitzgibbon\n[2] use silhouettes and keypoint annotation to reconstruct deformable shape models from small\nclass-speci\ufb01c image collections, Vicente et al. [22] apply similar methods to a large-scale Pascal VOC\ndataset, Tulsiani et al. [18] reduce required supervision by leveraging computer vision techniques.\nThese methods show impressive results even in the small data regime; however, they have dif\ufb01culties\nwith representing diverse and complex shapes. Loper and Black [12] implement a differentiable\nrenderer and apply it for analysis-by-synthesis. Our work is similar in spirit, but operates on point\nclouds and integrates the idea of differentiable rendering with deep learning. The approach of Rhodin\net al. [14] is similar to our technically in that it models human body with a set of Gaussian density\nfunctions and renders them using a physics-motivated equation for light transport. Unlike in our\napproach, the representation is not integrated into the learning framework and requires careful initial\nplacement of the Gaussians, making it unsuitable for automated reconstruction of arbitrary shape\ncategories. Moreover, the projection method scales quadratically with the number of Gaussians,\nwhich limits the maximum \ufb01delity of the shapes being represented.\nRecently the task of learning 3D structure from 2D supervision is being addressed with deep-learning-\nbased methods. The methods are typically based on reprojection error \u2013 comparing 2D projections\nof a predicted 3D shape to the ground truth 2D projections. Yan et al. [25] learn 3D shape from\nsilhouettes, via a projection operation based on selecting the maximum occupancy value along a\nray. Tulsiani et al. [20] devise a differentiable formulation based on ray collision probabilities and\napply it to learning from silhouettes, depth maps, color images, and semantic segmentation maps. Lin\net al. [11] represent point clouds by depth maps and re-project them using a high resolution grid and\ninverse depth max-pooling. Concurrently with us, Kato et al. [8] propose a differentiable renderer\nfor meshes and use it for learning mesh-based representations of object shapes. All these methods\nrequire exact ground truth camera pose corresponding to the 2D projections used for training. In\ncontrast, we aim to relax this unrealistic assumption and learn only from the projections.\nRezende et al. [13] explore several approaches to generative modeling of 3D shapes based on their\n2D views. One of the approaches does not require the knowledge of ground truth camera pose;\nhowever, it is only demonstrated on a simple dataset of textured geometric primitives. Most related to\nour submission is the concurrent work of Tulsiani et al. [21]. The work extends the Differentiable\nRay Consistency formulation [20] to learning without pose supervision. The method is voxel-based\nand deals with the complications of unsupervised pose learning using reinforcement learning and\n\n2The project website with code can be found at https://eldar.github.io/PointClouds/.\n\n2\n\n\fFigure 1: Learning to predict the shape and the camera pose. Given two views of the same object,\nwe predict the corresponding shape (represented as a point cloud) and the camera pose. Then we\nuse a differentiable projection module to generate the view of the predicted shape from the predicted\ncamera pose. Dissimilarity between this synthesized projection and the ground truth view serves as\nthe training signal.\n\na GAN-based prior. In contrast, we make use of a point cloud representation, use an ensemble to\npredict the pose, and do not require a prior on the camera poses.\nThe issue of representation is central to deep learning with volumetric data. The most commonly\nused structure is a voxel grid - a direct 3D counterpart of a 2D pixelated image [5, 23]. This similarity\nallows for simple transfer of convolutional network architectures from 2D to 3D. However, on the\ndownside, the voxel grid representation leads to memory- and computation-hungry architectures. This\nmotivates the search for alternative options. Existing solutions include octrees [17], meshes [8, 26],\npart-based representations [19, 10], multi-view depth maps [15], object skeletons [24], and point\nclouds [6, 11]. We choose to use point clouds in this work, since they are less overcomplete than\nvoxel grids and allow for effective networks architectures, but at the same time are more \ufb02exible than\nmesh-based or skeleton-based representations.\n\n3 Single-view Shape and Pose Estimation\n\ni=1{(cid:10)xi\n\n(cid:11)}mi\n\nj, pi\nj\n\nj=1. Here xi\n\nj denotes a color image and pi\n\nWe address the task of predicting the three-dimensional shape of an object and the camera pose from\na single view of the object. Assume we are given a dataset D of views of K objects, with mi views\navailable for the i-th object: D = \u222aK\nj \u2013 the\nprojection of some modality (silhouette, depth map of a color image) from the same view. Each view\nmay be accompanied with the corresponding camera pose ci\nj, but the more interesting case is when\nthe camera poses are not known. We focus on this more dif\ufb01cult scenario in the remainder of this\nsection.\nAn overview of the model is shown in Figure 1. Assume we are given two images x1 and x2 of the\nsame object. We use parametric function approximators to predict a 3D shape (represented by a point\ncloud) from one of them \u02c6P1 = FP (x1, \u03b8P ), and the camera pose from the other one: \u02c6c2 = Fc(x2, \u03b8c).\nIn our case, FP and Fc are convolutional networks that share most of their parameters. Both the\nshape and the pose are predicted as \ufb01xed-length vectors using fully connected layers.\nGiven the predictions, we render the predicted shape from the predicted view: \u02c6p1,2 = \u03c0( \u02c6P1, \u02c6c2),\nwhere \u03c0 denotes the differentiable point cloud renderer described in Section 4. The loss function is\nthen the discrepancy between this predicted projection and the ground truth. We use standard MSE in\nthis work both for all modalities, summed over the whole dataset:\n\nN(cid:88)\n\nmi(cid:88)\n\n(cid:13)(cid:13)\u02c6pi\n\n\u2212 pi\n\nj2\n\nj1,j2\n\n(cid:13)(cid:13)2\n\nL(\u03b8P , \u03b8c) =\n\n.\n\n(1)\n\ni=1\n\nj1,j2=1\n\nIntuitively, this training procedure requires that for all pairs of views of the same object, the renderings\nof the predicted point cloud match the provided ground truth views.\nEstimating pose with a distilled ensemble. We found that the basic implementation described\nabove fails to predict accurate poses. This is caused by local minima: the pose predictor converges to\neither estimating all objects as viewed from the back, or all viewed from the front. Indeed, based on\nsilhouettes, it is dif\ufb01cult to distinguish between certain views even for a human, see Figure 2 (a).\n\n3\n\nPoint cloudCamera poseDifferentiablepoint cloud projection \ud835\udf0b(\ud835\udc43,\ud835\udc50)LossPredictedprojectionGround truth projection\u2112(\ud835\udf03\ud835\udc43,\ud835\udf03\ud835\udc50)\ud835\udc43\ud835\udc50\f(a) Pose ambiguity\n\n(b) Training an ensemble of pose regressors\n\nFigure 2: (a) Pose ambiguity: segmentation masks, which we use for supervision, look very similar\nfrom different camera views. (b) The proposed ensemble of pose regressors designed to resolve this\nambiguity. The network predicts a diverse set {ck}K\nk=1 of pose candidates, each of which is used to\ncompute a projection of the predicted point cloud P . The weight update (backward pass shown in\ndashed red) is only performed for the pose candidate yielding the projection that best matches the\nground truth.\n\nTo alleviate this issue, instead of a single pose regressor Fc(\u00b7, \u03b8c), we introduce an ensemble of K\npose regressors F k\n\nc ) (see Figure 2 (b)) and train the system with the \u201chindsight\u201d loss [7, 4]:\n\nc (\u00b7, \u03b8k\n\nLh(\u03b8P , \u03b81\n\nc , . . . , \u03b8K\n\nc ) = min\nk\u2208[1,K]\n\nL(\u03b8P , \u03b8k\nc ).\n\n(2)\n\nThe idea is that each of the predictors learns to specialize on a subset of poses and together they cover\nthe whole range of possible values. No special measures are needed to ensure this specialization: it\nemerges naturally as a result of random weight initialization if the network architecture is appropriate.\nNamely, the different pose predictors need to have several (at least 3, in our experience) non-shared\nlayers.\nIn parallel with training the ensemble, we distill it to a single regressor by using the best model from\nthe ensemble as the teacher. This best model is selected based on the loss, as in Eq. (2). At test\ntime we discard the ensemble and use the distilled regressor to estimate the camera pose. The loss\nfor training the student is computed as an angular difference between two rotations represented by\nquaternions: L(q1, q2) = 1\u2212 Re(q1q\u22121\nWe found that standard MSE loss performs poorly when regressing rotation.\nNetwork architecture. We implement the shape and pose predictor with a convolutional network\nwith two branches. The network starts with a convolutional encoder with a total of 7 layers, 4 of\nwhich have stride 2. These are followed by 2 shared fully connected layers, after which the network\nsplits into two branches for shape and pose prediction. The shape branch is an MLP with one hidden\nlayer. The point cloud of N points is predicted as a vector with dimensionality 3N (point positions)\nor 6N (positions and RGB values). The pose branch is an MLP with one shared hidden layer and\ntwo more hidden layers for each of the pose predictors. The camera pose is predicted as a quaternion.\nIn the ensemble model we use K = 4 pose predictors. The \u201cstudent\u201d model is another branch with\nthe same architecture.\n\n(cid:13)(cid:13)), where Re denotes the real part of the quaternion.\n\n2 /(cid:13)(cid:13)q1q\u22121\n\n2\n\n4 Differentiable Point Clouds\n\nA key component of our model is the differentiable point cloud renderer \u03c0. Given a point cloud P\nand a camera pose c, it generates a view p = \u03c0(P, c). The point cloud may have a signal, such as\ncolor, associated with it, in which case the signal can be projected to the view.\nThe high-level idea of the method is to smooth the point cloud by representing the points with\ndensity functions. Formally, we assume the point cloud is a set of N tuples P = {(cid:104)xi, si, yi(cid:105)}N\ni=1,\neach including the point position xi = (xi,1, xi,2, xi,3), the size parameter si, and the associated\nsignal yi (for instance, an RGB color). In most of our experiments the size parameter is a two-\ndimensional vector including the covariance of an isotropic Gaussian and a scaling factor. However,\nin general si can represent an arbitrary parametric distribution: for instance, in the supplement we\n\n4\n\n...Point cloudEnsemble of pose regressors\ud835\udf0b(\ud835\udc43,\ud835\udc501)Predictedprojections\ud835\udc501\ud835\udf0b(\ud835\udc43,\ud835\udc502)\ud835\udc502\ud835\udf0b(\ud835\udc43,\ud835\udc50\ud835\udc3e)\ud835\udc50\ud835\udc3eLoss\ud835\udc43Ground truthprojectionCNN...\fFigure 3: Differentiable rendering of a point cloud. We show 2D-to-1D projection for illustration\npurposes, but in practice we perform 3D-to-2D projection. The points are transformed according to\nthe camera parameters, smoothed, and discretized. We perform occlusion reasoning via a form of ray\ntracing, and \ufb01nally project the result orthogonally.\n\nshow experiments with Gaussians with a full covariance matrix. The size parameters can be either\nspeci\ufb01ed manually or learned jointly with the point positions.\nThe overall differentiable rendering pipeline is illustrated in Figure 3. For illustration purposes we\nshow 2D-to-1D projection in the \ufb01gure, but in practice we perform 3D-to-2D projection. We start by\ntransforming the positions of points to the standard coordinate frame by the projective transformation\nTc corresponding to the camera pose c of interest: x(cid:48)\ni = Tcxi. The transform Tc accounts for both\nextrinsic and intrinsic camera parameters. We also compute the transformed size parameters s(cid:48) (the\nexact transformation rule depends on the distribution used). We set up the camera transformation\nmatrix such that after the transform, the projection amounts to orthogonal projection along the third\naxis.\nTo allow for the gradient \ufb02ow, we represent each point (cid:104)xi, si(cid:105) by a smooth function fi(\u00b7). In this\nwork we set fi to scaled Gaussian densities. The occupancy function of the point cloud is a clipped\nsum of the individual per-point functions:\n\n\u2212 1\n2\n\n(x \u2212 x(cid:48)\n\ni)T \u03a3\u22121\n\ni\n\n(x \u2212 x(cid:48)\ni)\n\n,\n\no(x) = clip(\n\nfi(x), [0, 1]),\n\nfi(x) = ci exp\n\n(3)\nwhere (cid:104)ci, \u03a3i(cid:105) = si are the size parameters. We discretize the resulting function to a grid of resolution\nD1\u00d7D2\u00d7D3. Note that the third index corresponds to the projection axis, with index 1 being the\nclosest to the camera and D3 \u2013 the furthest from the camera.\nBefore projecting the resulting volume to a plane, we need to ensure that the signal from the\noccluded points does not interfere with the foreground points. To this end, we perform occlusion\nreasoning using a differentiable ray tracing formulation, similar to Tulsiani et al. [20]. We convert the\nk3\u22121(cid:89)\noccupancies o to ray termination probabilities r as follows:\n(1 \u2212 ok1,k2,u) if k3 (cid:54) D3,\n\n(1 \u2212 ok1,k2,u).\n\nrk1,k2,k3 = ok1,k2,k3\n\nrk1,k2,D3+1 =\n\nD3(cid:89)\n\n(4)\n\nu=1\n\nu=1\n\nIntuitively, a cell has high termination probability rk1,k2,k3 if its occupancy value ok1,k2,k3 is high and\nall previous occupancy values {ok1,k2,u}u<k3 are low. The additional background cell rk1,k2,D3+1\nserves to ensure that the termination probabilities sum to 1.\nFinally, we project the volume to the plane:\n\npk1,k2 =\n\nrk1,k2,k3yk1,k2,k3 .\n\n(5)\n\nk3=1\n\nHere y is the signal being projected, which de\ufb01nes the modality of the result. To obtain a silhouette,\nwe set yk1,k2,k3 = 1 \u2212 \u03b4k3,D3+1. For a depth map, we set yk1,k2,k3 = k3/D3. Finally, to project\na signal y associated with the point cloud, such as color, we set y to a discretized version of the\n\nnormalized signal distribution: y(x) =(cid:80)N\n\ni=1 yifi(x)/(cid:80)N\n\ni=1 fi(x).\n\n4.1\n\nImplementation details\n\nTechnically, the most complex part of the algorithm is the conversion of a point cloud to a volume.\nWe have experimented with two implementations of this step: one that is simple and \ufb02exible (we\n\n5\n\nN(cid:88)\n\ni=1\n\n(cid:18)\n\n(cid:19)\n\nD3+1(cid:88)\n\nPoint cloudCamera poseInputTransformed point cloudDiscretized occupancy mapRay termination probabilitiesOrthogonal projectionOutput projection\frefer to it as basic) and another version that is less \ufb02exible, but much more ef\ufb01cient (we refer to it\nas fast). We implemented both versions using standard Tensor\ufb02ow [1] operations. At a high level,\nin the basic implementation each function fi is computed on an individual volumetric grid, and\nthe results are summed. This allows for \ufb02exibility in the choice of the function class, but leads to\nboth computational and memory requirements growing linearly with both the number of points N\nand the volume of the grid V , resulting in the complexity O(N V ). The fast version scales more\ngracefully, as O(N + V ). This comes at the cost of using the same kernel for all functions fi. The\nfast implementation performs the operation in two steps: \ufb01rst putting all points on the grid with\ntrilinear interpolation, then applying a convolution with the kernel. Further details are provided in the\nsupplement.\n\n5 Experiments\n\n5.1 Experimental setup\n\nDatasets. We conduct the experiments on 3D models from the ShapeNet [3] dataset. We focus on 3\ncategories typically used in related work: chairs, cars, and airplanes. We follow the train/test protocol\nand the data generation procedure of Tulsiani et al. [20]: split the models into training, validation and\ntest sets and render 5 random views of each model with random light source positions and random\ncamera azimuth and elevation, sampled uniformly from [0\u25e6, 360\u25e6) and [\u221220\u25e6, 40\u25e6] respectively.\nEvaluation metrics. We use the Chamfer distance as our main evaluation metric, since it has been\nshown to be well correlated with human judgment of shape similarity [16]. Given a ground truth point\ncloud P gt = {xgt\n\nn }, the distance is de\ufb01ned as follows:\n\nn } and a predicted point cloud P pr = {xpr\n(cid:107)xpr \u2212 x(cid:107)2 +\n\n(cid:88)\n\n1\n|P pr|\n\nmin\nx\u2208P gt\n\nxpr\u2208P pr\n\ndChamf (P gt, P pred) =\n\n(cid:88)\n\n(cid:13)(cid:13)xgt \u2212 x(cid:13)(cid:13)2 . (6)\n\n1\n|P gt|\n\nmin\nx\u2208P pr\n\nxgt\u2208P gt\n\nThe two sums in Eq. (6) have clear intuitive meanings. The \ufb01rst sum evaluates the precision of the\npredicted point cloud by computing how far on average is the closest ground truth point from a\npredicted point. The second sum measures the coverage of the ground truth by the predicted point\ncloud: how far is on average the closest predicted point from a ground truth point.\nFor measuring the pose error, we use the same metrics as Tulsiani et al. [21]: accuracy (the percentage\nof samples for which the predicted pose is within 30\u25e6 of the ground truth) and the median error (in\ndegrees). Before starting the pose and shape evaluation, we align the canonical pose learned by the\nnetwork with the canonical pose in the dataset, using Iterative Closest Point (ICP) algorithm on the\n\ufb01rst 20 models in the validation set. Further details are provided in the supplement.\nTraining details. We trained the networks using the Adam optimizer [9], for 600,000 mini-batch\niterations. We used mini-batches of 16 samples (4 views of 4 objects). We used a \ufb01xed learning\nrate of 0.0001 and the standard momentum parameters. We used the fast projection in most\nexperiments, unless mentioned otherwise. We varied both the number of points in the point cloud\nand the resolution of the volume used in the projection operation depending on the resolution of the\nground truth projections used for supervision. We used the volume with the same side as the training\nsamples (e.g., 643 volume for 642 projections), and we used 2000 points for 322 projections, 8000\npoints for 642 projections, and 16,000 points for 1282 projections.\nWhen predicting dense point clouds, we have found it useful to apply dropout to the predictions of the\nnetwork to ensure even distribution of points on the shape. Dropout effects in selecting only a subset\nof all predicted points for projection and loss computation. In experiments reported in Sections 5.2\nand 5.3 we started with a very high 90% dropout and linearly reduced it to 0 towards the end of\ntraining. We also implemented a schedule for the point size parameters, linearly decreasing from\n5% of the projection volume size to 0.3% over the course of training. The scaling coef\ufb01cient of the\npoints was learned in all experiments. An ablation study is shown in the supplement.\nComputational ef\ufb01ciency. A practical advantage of a point-cloud-based method is that it does not\nrequire using a 3D convolutional decoder as required by voxel-based methods. This improves the\nef\ufb01ciency and allows the method to better scale to higher resolution. For resolution 32 the training\ntimes of the methods are roughly on par. For 64 the training time of our method is roughly 1 day in\ncontrast with 2.5 days for its voxel-based counterpart. For 128 the training time of our method is 3\ndays, while the voxel-based method does not \ufb01t into 12Gb of GPU memory with our batch size.\n\n6\n\n\fDRC [20]\n\nResolution 32\nPTN [25] Ours-V Ours\n\nResolution 64\nOurs-V Ours\n\nResolution 128\n\nEPCG [11] Ours\n\nAirplane\nCar\nChair\nMean\n\n8.35\n4.35\n8.01\n\n6.90\n\n3.79\n3.94\n5.10\n\n4.27\n\n5.57\n3.88\n5.57\n\n5.01\n\n4.52\n4.22\n5.10\n\n4.61\n\n4.94\n3.41\n4.80\n\n4.39\n\n3.50\n2.98\n4.15\n\n3.55\n\n4.03\n3.69\n5.62\n4.45\n\n2.84\n2.42\n3.62\n\n2.96\n\nTable 1: Quantitative results on shape prediction with known camera pose. We report the Cham-\nfer distance between normalized point clouds, multiplied by 100. Our point-cloud-based method\n(Ours) outperforms its voxel-based counterpart (Ours-V) and bene\ufb01ts from higher resolution training\nsamples.\n\nInput\n\nView 1\n\nView 2\n\nInput\n\nView 1\n\nView 2\n\nInput\n\nView 1\n\nView 2\n\nFigure 4: Learning colored point clouds. Best viewed on screen. We show the input image, as well\nas two renderings of the predicted point cloud from other views. The general color is preserved well,\nbut the \ufb01ne details may be lost.\n\n5.2 Estimating shape with known pose\n\nComparison with baselines. We start by benchmarking the proposed formulation against existing\nmethods in the simple setup with known ground truth camera poses and silhouette-based training.\nWe compare to Perspective Transformer Networks (PTN) of Yan et al. [25], Differentiable Ray\nConsistency (DRC) of Tulsiani et al. [20], Ef\ufb01cient Point Cloud Generation (EPCG) of Lin et al. [11],\nand to the voxel-based counterpart of our method. PTN and DRC are only available for 323 output\nvoxel grid resolution. EPCG uses the point cloud representation, same as our method. However, in\nthe original work EPCG has only been evaluated in the unrealistic setup of having 100 random views\nper object and pre-training from 8 \ufb01xed views (corners of a cube). We re-train this method in the\nmore realistic setting used in this work \u2013 5 random views per object.\nThe quantitative results are shown in Table 1. Our point-cloud-based formulation (Ours) outperforms\nits voxel-based counterpart (Ours-V) in all cases. It improves when provided with high resolution\ntraining signal, and bene\ufb01ts from it more than the voxel-based method. Overall, our best model\n(at 128 resolution) decreases the mean error by 30% compared to the best baseline. An interesting\nobservation is that at low resolution, PTN performs remarkably well, closely followed by our point-\ncloud-based formulation. Note, however, that the PTN formulation only applies to learning from\nsilhouettes and cannot be easily generalized to other modalities.\nOur model achieves 50% improvement over the point cloud method EPCG, despite it being trained\nfrom depth maps, which is a stronger supervision compared to silhouettes used for our models.\nWhen trained with silhouette supervision only, EPCG achieves an average error of 8.20, 2.7 times\nworse than our model. We believe our model is more successful because our rendering procedure is\ndifferentiable w.r.t. all three coordinates of points, while the method of Lin et al. \u2013 only w.r.t. the\ndepth.\nColored point clouds. Our formulation supports training with other supervision than silhouettes, for\ninstance, color. In Figure 4 we demonstrate qualitative results of learning colored point clouds with\nour method. Despite challenges presented by the variation in lighting and shading between different\nviews, the method is able to learn correctly colored point clouds. For objects with complex textures\nthe predicted colors get blurred (last example).\nLearnable covariance. In the experiments reported above we have learnt point clouds with all\npoints having identical isotropic covariance matrices. We conducted additional experiments where\ncovariance matrices are learnt jointly with point positions, allowing for more \ufb02exible representation\nof shapes. Results are reported in the supplement.\n\n7\n\n\ft\nu\np\nn\nI\n\nh\nt\nu\nr\nt\n\nd\nn\nu\no\nr\nG\n\n]\n1\n2\n[\n\nC\nV\nM\n\ne\nv\ni\na\nn\n-\ns\nr\nu\nO\n\ns\nr\nu\nO\n\nFigure 5: Qualitative results of shape prediction. Best viewed on screen. Shapes predicted by our\nnaive model with a single pose predictor (Ours-naive) are more detailed than those of MVC [21].\nThe model with an ensemble of pose predictors (Ours) generates yet sharper shapes. The point cloud\nrepresentation allows to preserve \ufb01ne details such as thin chair legs.\n\nShape (DChamf )\n\nPose (Accuracy & Median error)\n\nMVC [21] Ours-naive Ours\n\nGT pose [21]\n\nMVC [21]\n\nOurs-naive\n\nOurs\n\nAirplane\nCar\nChair\nMean\n\n4.43\n4.16\n6.51\n\n5.04\n\n7.22\n4.14\n4.79\n\n5.38\n\n3.91\n3.47\n4.30\n\n3.89\n\n0.79\n0.90\n0.85\n\n10.7\n7.4\n11.2\n\n0.69 14.3\n0.87\n5.2\n0.81 7.8\n\n0.20 100.2\n42.8\n0.49\n0.50\n31.3\n\n0.75 8.2\n0.86 5.0\n0.86 8.1\n\n0.85\n\n10.0\n\n0.79\n\n9.0\n\n0.40\n\n58.1\n\n0.82 7.1\n\nTable 2: Quantitative results of shape and pose prediction. Best results for each metric are highlighted\nin bold. The naive version of our method predicts the shape quite well, but fails to predict accurate\npose. The full version predicts both shape and pose well.\n\n5.3 Estimating shape and pose\n\nWe now drop the unrealistic assumption of having the ground truth camera pose during training\nand experiment with predicting both the shape and the camera pose. We use the ground truth at 64\npixel resolution for our method in these experiments. We compare to the concurrent Multi-View\nConsistency (MVC) approach of Tulsiani et al. [21], using results reported by the authors for pose\nestimation and pre-trained models provided by the authors for shape evaluations.\nQuantitative results are provided in Table 2. Our naive model (Ours-naive) learns quite accurate\nshape (7% worse than MVC), despite not being able to predict the pose well. Our explanation is that\npredicting wrong pose for similarly looking projections does not signi\ufb01cantly hamper the training\nof the shape predictor. Shape predicted by the full model (Ours) is yet more precise: 28% more\naccurate than MVC and only 10% less accurate than with ground truth pose (as reported in Table 1).\nPose prediction improves dramatically, thanks to the diverse ensemble formulation. As a result, our\npose prediction results are on average slightly better than those of MVC [21] in both metrics, and\neven better in median error than the results of training with ground truth pose labels (as reported\nby Tulsiani et al. [21]).\nFigure 5 shows a qualitative comparison of shapes generated with different methods. Even the results\nof the naive model (Ours-naive) compare favorably to MVC [21]. Introducing the pose ensemble\n\n8\n\n\fTemplates\n\nFigure 6: Discovered semantic correspondences. Points of the same color correspond to the same\nsubset in the point cloud across different instances. The points were selected on two template\ninstances (top left). Best viewed on screen.\n\nleads to learning more accurate pose and, as a consequence, more precise shapes. These results\ndemonstrate the advantage of the point cloud representation over the voxel-based one. Point clouds\nare especially suitable for representing \ufb01ne details, such as thin legs of the chairs. We also show\ntypical failure cases of the proposed method. One of the airplanes is rotated by 180 degrees, since the\nnetwork does not have a way to \ufb01nd which orientation is considered correct. The shapes of two of\nthe chairs somewhat differ from the true shapes. This is because of the complexity of the training\nproblem and, possibly, over\ufb01tting. Yet, the shapes look detailed and realistic.\n\n5.4 Discovery of semantic correspondences\n\nBesides higher shape \ufb01delity, the \u201cmatter-centric\u201d point cloud representation has another advantage\nover the \u201cspace-centric\u201d voxel representation: there is a natural correspondence between points in\ndifferent predicted point clouds. Since we predict points with a fully connected layer, the points\ngenerated by the same output unit in different shapes can be expected to carry similar semantic\nmeaning. We empirically verify this hypothesis. We choose two instances from the validation set\nof the chair category as templates (shown in the top-left corner of Figure 6) and manually annotate\n3D keypoint locations corresponding to characteristic parts, such as corners of the seat, tips of the\nlegs, etc. Then, for each keypoint we select all points in the predicted clouds within a small distance\nfrom the keypoint and compute the intersection of the points indices between the two templates.\n(Intersection of indices between two object instances is not strictly necessary, but we found it to\nslightly improve the quality of the resulting correspondences.) We then visualize points with these\nindices on several other object instances, highlighting each set of points with a different color. Results\nare shown in Figure 6. As hypothesized, selected points tend to represent the same object parts in\ndifferent object instances. Note that no explicit supervision was imposed towards this goal: semantic\ncorrespondences emerge automatically. We attribute this to the implicit ability of the model to learn a\nregular, smooth representation of the output shape space, which is facilitated by reusing the same\npoints for the same object parts.\n\n6 Conclusion\n\nWe have proposed a method for learning pose and shape of 3D objects given only their 2D projections,\nusing the point cloud representation. Extensive validation has shown that point clouds compare\nfavorably with the voxel-based representation in terms of ef\ufb01ciency and accuracy. Our work opens\nup multiple avenues for future research. First, our projection method requires an explicit volume to\nperform occlusion reasoning. We believe this is just an implementation detail, which might be relaxed\nin the future with a custom rendering procedure. Second, since the method does not require accurate\nground truth camera poses, it could be applied to learning from real-world data. Learning from color\nimages or videos would be especially exciting, but it would require explicit reasoning about lighting\nand shading, as well as dealing with the background. Third, we used a very basic decoder architecture\nfor generating point clouds, and we believe more advanced architectures [26] could improve both the\nef\ufb01ciency and the accuracy of the method. Finally, the fact that the loss is explicitly computed on\nprojections (in contrast with, e.g., Tulsiani et al. [20]), allows directly applying advanced techniques\nfrom the 2D domain, such as perceptual losses and GANs, to learning 3D representations.\n\n9\n\n\fAcknowledgements\n\nWe would like to thank Ren\u00e9 Ranftl and Stephan Richter for valuable discussions and feedback. We\nwould also like to thank Shubham Tulsiani for providing the models of the MVC method for testing.\n\nReferences\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, et al. TensorFlow: A system for large-scale\n\nmachine learning. In OSDI, 2016.\n\n[2] T. J. Cashman and A. W. Fitzgibbon. What shape are dolphins? Building 3D morphable models from 2D\n\nimages. PAMI, 35, 2013.\n\n[3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song,\nH. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An information-rich 3D model repository. Technical Report\narXiv:1512.03012, 2015.\n\n[4] Q. Chen and V. Koltun. Photographic image synthesis with cascaded re\ufb01nement networks. In ICCV, 2017.\n[5] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3D-R2N2: A uni\ufb01ed approach for single and\n\nmulti-view 3D object reconstruction. In ECCV, 2016.\n\n[6] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3D object reconstruction from a single\n\nimage. In CVPR, 2017.\n\n[7] A. Guzm\u00e1n-rivera, D. Batra, and P. Kohli. Multiple choice learning: Learning to produce multiple\n\nstructured outputs. In NIPS. 2012.\n\n[8] H. Kato, Y. Ushiku, and T. Harada. Neural 3D mesh renderer. In CVPR, 2018.\n[9] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[10] J. Li, K. Xu, S. Chaudhuri, E. Yumer, H. Zhang, and L. Guibas. GRASS: Generative recursive autoencoders\n\nfor shape structures. SIGGRAPH, 2017.\n\n[11] C.-H. Lin, C. Kong, and S. Lucey. Learning ef\ufb01cient point cloud generation for dense 3D object recon-\n\nstruction. In AAAI, 2018.\n\n[12] M. M. Loper and M. J. Black. OpenDR: An approximate differentiable renderer. In ECCV, 2014.\n[13] D. Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised\n\nlearning of 3D structure from images. In NIPS, 2016.\n\n[14] H. Rhodin, N. Robertini, C. Richardt, H.-P. Seidel, and C. Theobalt. A versatile scene model with\n\ndifferentiable visibility applied to generative pose estimation. In ICCV, 2015.\n\n[15] A. A. Soltani, H. Huang, J. Wu, T. D. Kulkarni, and J. B. Tenenbaum. Synthesizing 3D shapes via modeling\n\nmulti-view depth maps and silhouettes with deep generative networks. In CVPR, 2017.\n\n[16] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman. Pix3d:\n\nDataset and methods for single-image 3D shape modeling. In CVPR, 2018.\n\n[17] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Ef\ufb01cient convolutional\n\narchitectures for high-resolution 3D outputs. In ICCV, 2017.\n\n[18] S. Tulsiani, A. Kar, J. Carreira, and J. Malik. Learning category-speci\ufb01c deformable 3D models for object\n\nreconstruction. PAMI, 39, 2017.\n\n[19] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik. Learning shape abstractions by assembling\n\nvolumetric primitives. In CVPR, 2017.\n\n[20] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-view reconstruction via\n\ndifferentiable ray consistency. In CVPR, 2017.\n\n[21] S. Tulsiani, A. A. Efros, and J. Malik. Multi-view consistency as supervisory signal for learning shape and\n\npose prediction. In CVPR, 2018.\n\n[22] S. Vicente, J. Carreira, L. Agapito, and J. Batista. Reconstructing PASCAL VOC. In CVPR, 2014.\n[23] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum. Learning a probabilistic latent space of\n\nobject shapes via 3D generative-adversarial modeling. In NIPS, 2016.\n\n[24] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. 3D interpreter networks\n\nfor viewer-centered wireframe modeling. IJCV, 2018.\n\n10\n\n\f[25] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3D\n\nobject reconstruction without 3D supervision. In NIPS, 2016.\n\n[26] Y. Yang, C. Feng, Y. Shen, and D. Tian. FoldingNet: Interpretable unsupervised learning on 3D point\n\nclouds. In CVPR, 2018.\n\n11\n\n\f", "award": [], "sourceid": 1481, "authors": [{"given_name": "Eldar", "family_name": "Insafutdinov", "institution": "Max Planck Institute for Informatics"}, {"given_name": "Alexey", "family_name": "Dosovitskiy", "institution": "Intel Labs"}]}