{"title": "Unsupervised Learning of Shape and Pose with Differentiable Point Clouds", "book": "Advances in Neural Information Processing Systems", "page_first": 2802, "page_last": 2812, "abstract": "We address the problem of learning accurate 3D shape and camera pose from a collection of unlabeled category-specific images. We train a convolutional network to predict both the shape and the pose from a single image by minimizing the reprojection error: given several views of an object, the projections of the predicted shapes to the predicted camera poses should match the provided views. To deal with pose ambiguity, we introduce an ensemble of pose predictors which we then distill to a single \"student\" model. To allow for efficient learning of high-fidelity shapes, we represent the shapes by point clouds and devise a formulation allowing for differentiable projection of these. Our experiments show that the distilled ensemble of pose predictors learns to estimate the pose accurately, while the point cloud representation allows to predict detailed shape models.", "full_text": "Unsupervised Learning of Shape and Pose\n\nwith Differentiable Point Clouds\n\nEldar Insafutdinov\u2217\n\nMax Planck Institute for Informatics\n\neldar@mpi-inf.mpg.de\n\nAlexey Dosovitskiy\n\nIntel Labs\n\nadosovitskiy@gmail.com\n\nAbstract\n\nWe address the problem of learning accurate 3D shape and camera pose from a\ncollection of unlabeled category-speci\ufb01c images. We train a convolutional network\nto predict both the shape and the pose from a single image by minimizing the\nreprojection error: given several views of an object, the projections of the predicted\nshapes to the predicted camera poses should match the provided views. To deal\nwith pose ambiguity, we introduce an ensemble of pose predictors which we then\ndistill to a single \u201cstudent\u201d model. To allow for ef\ufb01cient learning of high-\ufb01delity\nshapes, we represent the shapes by point clouds and devise a formulation allowing\nfor differentiable projection of these. Our experiments show that the distilled\nensemble of pose predictors learns to estimate the pose accurately, while the point\ncloud representation allows to predict detailed shape models.\n\n1\n\nIntroduction\n\nWe live in a three-dimensional world, and a proper understanding of its volumetric structure is\ncrucial for acting and planning. However, we perceive the world mainly via its two-dimensional\nprojections. Based on these projections, we are able to infer the three-dimensional shapes and poses\nof the surrounding objects. How does this volumetric shape perception emerge from observing only\nfrom two-dimensional projections? Is it possible to design learning systems with similar capabilities?\nDeep learning methods have recently shown promise in addressing these questions [25, 20]. Given\na set of views of an object and the corresponding camera poses, these methods learn 3D shape via\nthe reprojection error: given an estimated shape, one can project it to the known camera views and\ncompare to the provided images. The discrepancy between these generated projections and the\ntraining samples provides training signal for improving the shape estimate. Existing methods of this\ntype have two general restrictions. First, these approaches assume that the camera poses are known\nprecisely for all provided images. This is a practically and biologically unrealistic assumption: a\ntypical intelligent agent only has access to its observations, not its precise location relative to objects\nin the world. Second, the shape is predicted as a low-resolution (usually 323 voxels) voxelated\nvolume. This representation can only describe very rough shape of an object. It should be possible to\nlearn \ufb01ner shape details from 2D supervision.\nIn this paper, we learn high-\ufb01delity shape models solely from their projections, without ground\ntruth camera poses. This setup is challenging for two reasons. First, estimating both shape and\npose is a chicken-and-egg problem: without a good shape estimate it is impossible to learn accurate\npose because the projections would be uninformative, and vice versa, an accurate pose estimate is\nnecessary to learn the shape. Second, pose estimation is prone to local minima caused by ambiguity:\nan object may look similar from two viewpoints, and if the network converges to predicting only one\nof these in all cases, it will not be able to learn predicting the other one. We \ufb01nd that the \ufb01rst problem\ncan be solved surprisingly well by joint optimization of shape and pose predictors: in practice, good\n\n\u2217Work done while interning at Intel.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fshape estimates can be learned even with relatively noisy pose predictions. The second problem,\nhowever, leads to drastic errors in pose estimation. To address this, we train a diverse ensemble of\npose predictors and distill those to a single student model.\nTo allow learning of high-\ufb01delity shapes, we use the point cloud representation, in contrast with\nvoxels used in previous works. Point clouds allow for computationally ef\ufb01cient processing, can\nproduce high-quality shape models [6], and are conceptually attractive because they can be seen as\n\u201cmatter-centric\u201d, as opposed to \u201cspace-centric\u201d voxel grids. To enable learning point clouds without\nexplicit 3D supervision, we implement a differentiable projection operator that, given a point set and\na camera pose, generates a 2D projection \u2013 a silhouette, a color image, or a depth map. We dub the\nformulation \u201cDifferentiable Point Clouds\u201d.\nWe evaluate the proposed approach on the task of estimating the shape and the camera pose from a\nsingle image of an object 2. The method successfully learns to predict both the shape and the pose,\nwith only a minor performance drop relative to a model trained with ground truth camera poses.\nThe point-cloud-based formulation allows for effective learning of high-\ufb01delity shape models when\nprovided with images of suf\ufb01ciently high resolution as supervision. We demonstrate learning point\nclouds from silhouettes and augmenting those with color if color images are available during training.\nFinally, we show how the point cloud representation allows to automatically discover semantic\ncorrespondences between objects.\n\n2 Related Work\n\nReconstruction of three-dimensional shapes from their two-dimensional projections has a long history\nin computer vision, constituting the \ufb01eld of 3D reconstruction. A review of this \ufb01eld goes outside of\nthe scope of this paper; however, we brie\ufb02y list several related methods. Cashman and Fitzgibbon\n[2] use silhouettes and keypoint annotation to reconstruct deformable shape models from small\nclass-speci\ufb01c image collections, Vicente et al. [22] apply similar methods to a large-scale Pascal VOC\ndataset, Tulsiani et al. [18] reduce required supervision by leveraging computer vision techniques.\nThese methods show impressive results even in the small data regime; however, they have dif\ufb01culties\nwith representing diverse and complex shapes. Loper and Black [12] implement a differentiable\nrenderer and apply it for analysis-by-synthesis. Our work is similar in spirit, but operates on point\nclouds and integrates the idea of differentiable rendering with deep learning. The approach of Rhodin\net al. [14] is similar to our technically in that it models human body with a set of Gaussian density\nfunctions and renders them using a physics-motivated equation for light transport. Unlike in our\napproach, the representation is not integrated into the learning framework and requires careful initial\nplacement of the Gaussians, making it unsuitable for automated reconstruction of arbitrary shape\ncategories. Moreover, the projection method scales quadratically with the number of Gaussians,\nwhich limits the maximum \ufb01delity of the shapes being represented.\nRecently the task of learning 3D structure from 2D supervision is being addressed with deep-learning-\nbased methods. The methods are typically based on reprojection error \u2013 comparing 2D projections\nof a predicted 3D shape to the ground truth 2D projections. Yan et al. [25] learn 3D shape from\nsilhouettes, via a projection operation based on selecting the maximum occupancy value along a\nray. Tulsiani et al. [20] devise a differentiable formulation based on ray collision probabilities and\napply it to learning from silhouettes, depth maps, color images, and semantic segmentation maps. Lin\net al. [11] represent point clouds by depth maps and re-project them using a high resolution grid and\ninverse depth max-pooling. Concurrently with us, Kato et al. [8] propose a differentiable renderer\nfor meshes and use it for learning mesh-based representations of object shapes. All these methods\nrequire exact ground truth camera pose corresponding to the 2D projections used for training. In\ncontrast, we aim to relax this unrealistic assumption and learn only from the projections.\nRezende et al. [13] explore several approaches to generative modeling of 3D shapes based on their\n2D views. One of the approaches does not require the knowledge of ground truth camera pose;\nhowever, it is only demonstrated on a simple dataset of textured geometric primitives. Most related to\nour submission is the concurrent work of Tulsiani et al. [21]. The work extends the Differentiable\nRay Consistency formulation [20] to learning without pose supervision. The method is voxel-based\nand deals with the complications of unsupervised pose learning using reinforcement learning and\n\n2The project website with code can be found at https://eldar.github.io/PointClouds/.\n\n2\n\n\fFigure 1: Learning to predict the shape and the camera pose. Given two views of the same object,\nwe predict the corresponding shape (represented as a point cloud) and the camera pose. Then we\nuse a differentiable projection module to generate the view of the predicted shape from the predicted\ncamera pose. Dissimilarity between this synthesized projection and the ground truth view serves as\nthe training signal.\n\na GAN-based prior. In contrast, we make use of a point cloud representation, use an ensemble to\npredict the pose, and do not require a prior on the camera poses.\nThe issue of representation is central to deep learning with volumetric data. The most commonly\nused structure is a voxel grid - a direct 3D counterpart of a 2D pixelated image [5, 23]. This similarity\nallows for simple transfer of convolutional network architectures from 2D to 3D. However, on the\ndownside, the voxel grid representation leads to memory- and computation-hungry architectures. This\nmotivates the search for alternative options. Existing solutions include octrees [17], meshes [8, 26],\npart-based representations [19, 10], multi-view depth maps [15], object skeletons [24], and point\nclouds [6, 11]. We choose to use point clouds in this work, since they are less overcomplete than\nvoxel grids and allow for effective networks architectures, but at the same time are more \ufb02exible than\nmesh-based or skeleton-based representations.\n\n3 Single-view Shape and Pose Estimation\n\ni=1{(cid:10)xi\n\n(cid:11)}mi\n\nj, pi\nj\n\nj=1. Here xi\n\nj denotes a color image and pi\n\nWe address the task of predicting the three-dimensional shape of an object and the camera pose from\na single view of the object. Assume we are given a dataset D of views of K objects, with mi views\navailable for the i-th object: D = \u222aK\nj \u2013 the\nprojection of some modality (silhouette, depth map of a color image) from the same view. Each view\nmay be accompanied with the corresponding camera pose ci\nj, but the more interesting case is when\nthe camera poses are not known. We focus on this more dif\ufb01cult scenario in the remainder of this\nsection.\nAn overview of the model is shown in Figure 1. Assume we are given two images x1 and x2 of the\nsame object. We use parametric function approximators to predict a 3D shape (represented by a point\ncloud) from one of them \u02c6P1 = FP (x1, \u03b8P ), and the camera pose from the other one: \u02c6c2 = Fc(x2, \u03b8c).\nIn our case, FP and Fc are convolutional networks that share most of their parameters. Both the\nshape and the pose are predicted as \ufb01xed-length vectors using fully connected layers.\nGiven the predictions, we render the predicted shape from the predicted view: \u02c6p1,2 = \u03c0( \u02c6P1, \u02c6c2),\nwhere \u03c0 denotes the differentiable point cloud renderer described in Section 4. The loss function is\nthen the discrepancy between this predicted projection and the ground truth. We use standard MSE in\nthis work both for all modalities, summed over the whole dataset:\n\nN(cid:88)\n\nmi(cid:88)\n\n(cid:13)(cid:13)\u02c6pi\n\n\u2212 pi\n\nj2\n\nj1,j2\n\n(cid:13)(cid:13)2\n\nL(\u03b8P , \u03b8c) =\n\n.\n\n(1)\n\ni=1\n\nj1,j2=1\n\nIntuitively, this training procedure requires that for all pairs of views of the same object, the renderings\nof the predicted point cloud match the provided ground truth views.\nEstimating pose with a distilled ensemble. We found that the basic implementation described\nabove fails to predict accurate poses. This is caused by local minima: the pose predictor converges to\neither estimating all objects as viewed from the back, or all viewed from the front. Indeed, based on\nsilhouettes, it is dif\ufb01cult to distinguish between certain views even for a human, see Figure 2 (a).\n\n3\n\nPoint cloudCamera poseDifferentiablepoint cloud projection \ud835\udf0b(\ud835\udc43,\ud835\udc50)LossPredictedprojectionGround truth projection\u2112(\ud835\udf03\ud835\udc43,\ud835\udf03\ud835\udc50)\ud835\udc43\ud835\udc50\f(a) Pose ambiguity\n\n(b) Training an ensemble of pose regressors\n\nFigure 2: (a) Pose ambiguity: segmentation masks, which we use for supervision, look very similar\nfrom different camera views. (b) The proposed ensemble of pose regressors designed to resolve this\nambiguity. The network predicts a diverse set {ck}K\nk=1 of pose candidates, each of which is used to\ncompute a projection of the predicted point cloud P . The weight update (backward pass shown in\ndashed red) is only performed for the pose candidate yielding the projection that best matches the\nground truth.\n\nTo alleviate this issue, instead of a single pose regressor Fc(\u00b7, \u03b8c), we introduce an ensemble of K\npose regressors F k\n\nc ) (see Figure 2 (b)) and train the system with the \u201chindsight\u201d loss [7, 4]:\n\nc (\u00b7, \u03b8k\n\nLh(\u03b8P , \u03b81\n\nc , . . . , \u03b8K\n\nc ) = min\nk\u2208[1,K]\n\nL(\u03b8P , \u03b8k\nc ).\n\n(2)\n\nThe idea is that each of the predictors learns to specialize on a subset of poses and together they cover\nthe whole range of possible values. No special measures are needed to ensure this specialization: it\nemerges naturally as a result of random weight initialization if the network architecture is appropriate.\nNamely, the different pose predictors need to have several (at least 3, in our experience) non-shared\nlayers.\nIn parallel with training the ensemble, we distill it to a single regressor by using the best model from\nthe ensemble as the teacher. This best model is selected based on the loss, as in Eq. (2). At test\ntime we discard the ensemble and use the distilled regressor to estimate the camera pose. The loss\nfor training the student is computed as an angular difference between two rotations represented by\nquaternions: L(q1, q2) = 1\u2212 Re(q1q\u22121\nWe found that standard MSE loss performs poorly when regressing rotation.\nNetwork architecture. We implement the shape and pose predictor with a convolutional network\nwith two branches. The network starts with a convolutional encoder with a total of 7 layers, 4 of\nwhich have stride 2. These are followed by 2 shared fully connected layers, after which the network\nsplits into two branches for shape and pose prediction. The shape branch is an MLP with one hidden\nlayer. The point cloud of N points is predicted as a vector with dimensionality 3N (point positions)\nor 6N (positions and RGB values). The pose branch is an MLP with one shared hidden layer and\ntwo more hidden layers for each of the pose predictors. The camera pose is predicted as a quaternion.\nIn the ensemble model we use K = 4 pose predictors. The \u201cstudent\u201d model is another branch with\nthe same architecture.\n\n(cid:13)(cid:13)), where Re denotes the real part of the quaternion.\n\n2 /(cid:13)(cid:13)q1q\u22121\n\n2\n\n4 Differentiable Point Clouds\n\nA key component of our model is the differentiable point cloud renderer \u03c0. Given a point cloud P\nand a camera pose c, it generates a view p = \u03c0(P, c). The point cloud may have a signal, such as\ncolor, associated with it, in which case the signal can be projected to the view.\nThe high-level idea of the method is to smooth the point cloud by representing the points with\ndensity functions. Formally, we assume the point cloud is a set of N tuples P = {(cid:104)xi, si, yi(cid:105)}N\ni=1,\neach including the point position xi = (xi,1, xi,2, xi,3), the size parameter si, and the associated\nsignal yi (for instance, an RGB color). In most of our experiments the size parameter is a two-\ndimensional vector including the covariance of an isotropic Gaussian and a scaling factor. However,\nin general si can represent an arbitrary parametric distribution: for instance, in the supplement we\n\n4\n\n...Point cloudEnsemble of pose regressors\ud835\udf0b(\ud835\udc43,\ud835\udc501)Predictedprojections\ud835\udc501\ud835\udf0b(\ud835\udc43,\ud835\udc502)\ud835\udc502\ud835\udf0b(\ud835\udc43,\ud835\udc50\ud835\udc3e)\ud835\udc50\ud835\udc3eLoss\ud835\udc43Ground truthprojectionCNN...\fFigure 3: Differentiable rendering of a point cloud. We show 2D-to-1D projection for illustration\npurposes, but in practice we perform 3D-to-2D projection. The points are transformed according to\nthe camera parameters, smoothed, and discretized. We perform occlusion reasoning via a form of ray\ntracing, and \ufb01nally project the result orthogonally.\n\nshow experiments with Gaussians with a full covariance matrix. The size parameters can be either\nspeci\ufb01ed manually or learned jointly with the point positions.\nThe overall differentiable rendering pipeline is illustrated in Figure 3. For illustration purposes we\nshow 2D-to-1D projection in the \ufb01gure, but in practice we perform 3D-to-2D projection. We start by\ntransforming the positions of points to the standard coordinate frame by the projective transformation\nTc corresponding to the camera pose c of interest: x(cid:48)\ni = Tcxi. The transform Tc accounts for both\nextrinsic and intrinsic camera parameters. We also compute the transformed size parameters s(cid:48) (the\nexact transformation rule depends on the distribution used). We set up the camera transformation\nmatrix such that after the transform, the projection amounts to orthogonal projection along the third\naxis.\nTo allow for the gradient \ufb02ow, we represent each point (cid:104)xi, si(cid:105) by a smooth function fi(\u00b7). In this\nwork we set fi to scaled Gaussian densities. The occupancy function of the point cloud is a clipped\nsum of the individual per-point functions:\n\n\u2212 1\n2\n\n(x \u2212 x(cid:48)\n\ni)T \u03a3\u22121\n\ni\n\n(x \u2212 x(cid:48)\ni)\n\n,\n\no(x) = clip(\n\nfi(x), [0, 1]),\n\nfi(x) = ci exp\n\n(3)\nwhere (cid:104)ci, \u03a3i(cid:105) = si are the size parameters. We discretize the resulting function to a grid of resolution\nD1\u00d7D2\u00d7D3. Note that the third index corresponds to the projection axis, with index 1 being the\nclosest to the camera and D3 \u2013 the furthest from the camera.\nBefore projecting the resulting volume to a plane, we need to ensure that the signal from the\noccluded points does not interfere with the foreground points. To this end, we perform occlusion\nreasoning using a differentiable ray tracing formulation, similar to Tulsiani et al. [20]. We convert the\nk3\u22121(cid:89)\noccupancies o to ray termination probabilities r as follows:\n(1 \u2212 ok1,k2,u) if k3 (cid:54) D3,\n\n(1 \u2212 ok1,k2,u).\n\nrk1,k2,k3 = ok1,k2,k3\n\nrk1,k2,D3+1 =\n\nD3(cid:89)\n\n(4)\n\nu=1\n\nu=1\n\nIntuitively, a cell has high termination probability rk1,k2,k3 if its occupancy value ok1,k2,k3 is high and\nall previous occupancy values {ok1,k2,u}u