{"title": "Learning elementary structures for 3D shape generation and matching", "book": "Advances in Neural Information Processing Systems", "page_first": 7435, "page_last": 7445, "abstract": "We propose to represent shapes as the deformation and combination of learnt elementary 3D structures. We demonstrate this decomposition in learnt elementary 3D structures is highly interpretable and leads to clear improvements in 3D shape generation and matching. \nMore precisely, we present two complementary approaches to learn elementary structures in a deep learning framework: (i) continuous surface deformation learning and (ii) 3D structure points learning. Both approaches can be extended to abstract structures of higher dimensions for improved results. We evaluate our method on two very different tasks: ShapeNet objects reconstruction and dense correspondences estimation between human scans. Qualitatively our approach provides interpretable and repeatable results. Quantitatively, we show an important 16% boost for 3D object generation via surface deformation, as well as a clear 6% improvement over state of the art correspondence results on the FAUST inter challenge.", "full_text": "Learning elementary structures\n\nfor 3D shape generation and matching\n\nTheo Deprelle1\u2217, Thibault Groueix1, Matthew Fisher2, Vladimir G. Kim2,\n\nBryan C. Russell2, Mathieu Aubry1\n\n1LIGM (UMR 8049), \u00c9cole des Ponts, UPE, 2Adobe Research\n\nAbstract\n\nWe propose to represent shapes as the deformation and combination of learnable\nelementary 3D structures, which are primitives resulting from training over a\ncollection of shapes. We demonstrate that the learned elementary 3D structures\nlead to clear improvements in 3D shape generation and matching. More precisely,\nwe present two complementary approaches for learning elementary structures: (i)\npatch deformation learning and (ii) point translation learning. Both approaches can\nbe extended to abstract structures of higher dimensions for improved results. We\nevaluate our method on two tasks: reconstructing ShapeNet objects and estimating\ndense correspondences between human scans (FAUST inter challenge). We show\n16% improvement over surface deformation approaches for shape reconstruction\nand outperform FAUST inter and intra challenge state of the art by 2% and 7%,\nrespectively.\n\n1\n\nIntroduction\n\nCurrent surface-parametric approaches for generating a surface or aligning two surfaces, such as\nAtlasNet [11] and 3D-CODED [10], rely on alignment of one or more shape primitives to a target\nshape. The shape primitives can be a set of patches or a sphere, as in AtlasNet, or a human template\nshape, as in 3D-CODED. These approaches could easily be extended to other parametric shapes, such\nas blocks [22], generalized cylinders [4], or modern shape abstractions [16, 26, 28]. While surface-\nparametric approaches have achieved state-of-the-art results for (single-view) shape reconstruction\n[11] and 3D shape correspondences [10], they rely on hand-chosen parametric shape primitives tuned\nfor the target shape collection and task. In this paper, we ask \u2013 what is the right set of primitives to\nrepresent a collection of diverse shapes?\nTo address this question, we seek to go beyond manually choosing shape primitives and automatically\nlearn what we call \u201clearnable elementary structures\u201d from a shape collection, which can be used for\nshape reconstruction and matching. The ability to automatically learn elementary structures allows\nthe shape generator to \ufb01nd a better set of primitives for a shape collection and target task. We \ufb01nd\nthat learned elementary structures correspond to recurrent parts among 3D objects. For example, in\nFigure 1, we show automatically learned elementary structures roughly corresponding to the tail,\nwing, and reactor of an airplane. Moreover, we \ufb01nd that learning the elementary structures leads to\nan improvement in shape reconstruction and correspondence accuracy.\nWe explore two approaches for learning elementary structures \u2013 patch deformation learning and point\ntranslation learning. For patch deformation learning, similar to AtlasNet [11], we start from a surface\nelement, such as a 2D square, and deform it into the learned structure using a multi-layer perceptron\n[23]. This approach has the advantage that the learned elementary structures are continuous surfaces.\nIts key difference with respect to AtlasNet is that the deformations, and thus the elementary structures,\nare common to all shapes. For point translation learning, starting from a \ufb01xed set of points, we\noptimize their position to reconstruct the target objects. The drawback of this approach is that it does\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Input target shapes\n\n(b) Learned elementary structures\n\n(c) Our reconstructions\n\nFigure 1: Problem statement. We seek to automatically learn a set of primitives (called \u201clearned\nelementary structures\u201d) for shape reconstruction and matching. (a) Input target shapes to reconstruct.\n(b) Learned elementary structures roughly corresponding to the tail, wing, and reactor of airplanes.\n(c) Our output reconstructions with learned elementary structures highlighted.\n\nnot produce a continuous surface \u2013 only a \ufb01nite set of points. However, this approach is more \ufb02exible\nsince it can, for example, change the topology of the structure.\nWe show how to deform and combine our learnable elementary structures to explain a given 3D\nshape. At inference, given the learned elementary structures, we learn to position the structures\nby adjustment \u2013 a linear (projective) transformation will lead to maximum interpretability, while\na complex transformation parameterized by a multi-layer perceptron will make our approaches\ngeneralizations of prior shape reconstruction methods [11, 10] using optimized instead of manually\nde\ufb01ned templates. Moreover, such representation allows for disentanglement of the structure\u2019s shape\nand pose. We include structure learning in a deep architecture that uni\ufb01es shape abstraction and deep\nsurface deformation approaches.\nWe demonstrate that our architecture leads to improvements for 3D shape generation and matching\n\u2013 16% relative improvement over AtlasNet for generic object shape reconstruction and 7% and 2%\nover 3D-CODED for human shape matching on Faust [5] Intra and Inter challenges, respectively,\nachieving state of the art for the latter task. Our code is available on our project webpage1\n2 Related Work\n\nPrimitive \ufb01tting is a classic topic in computer vision [22], with a large number of methods targeting\nparsimonious shape approximations, such as generalized cylinders[4] and geons [3]. Ef\ufb01cient \ufb01tting\nof these primitives attracted a lot of research efforts [13, 18, 24, 25]. Since these methods analyze\nshapes independently, they are not expected to use the primitives consistently across different objects,\nwhich makes the result unsuitable for discovering a common structure in a collection of shapes,\nperforming consistent segmentation, or correspondence estimation. To address these limitations\nsome methods optimize for consistent primitive \ufb01tting over the entire shape collection [15], or aim\nto discover a consistent set of parts [9, 12, 27]. The resulting optimization problems are usually\nnon-convex, and thus existing solutions tend to be slow, require heuristics, and are prone to being\nstuck in local optima.\nLearning-based techniques offer a promising alternative to hand-crafted heuristics. Zhu et al. [31] use\na Recurent Neural Network supervised by a traditional heuristic-based algorithm for cuboid \ufb01tting.\nTulsiani et al. [28] use reconstruction loss to predict parameters of the cuboids that approximate an\ninput shape, and thus do not require any direct supervision. Several recent techniques, concurrent\nto our work, extend this approach by using more complex primitives that can better approximate\nthe surface, such as anisotropic 3D Gaussians [8], categorie speci\ufb01que morphable model\n[14]\nor superquadrics [20]. All of these techniques use a collection of simple hand-picked parametric\nprimitives. In contrast, we propose to learn a set of deformable primitives that best approximate a\ncollection of shapes.\nOne can further improve reconstruction by \ufb01tting a diverse set of primitives [17] or constructive\nsolid geometry graphs [26]. These methods, however, usually do not produce consistent \ufb01tting\n\n1http://imagine.enpc.fr/ deprellt/atlasnet2\n\n2\n\n\f(a) Points translation\n\nlearning module\n\n(b) Patch deformation\n\nlearning module\n\n(c) Elementary structures\n\ncombination model\n\nFigure 2: Approach overview. At training time, we learn (a) translations ti or (b) deformations di that\ntransform points from the unit square Si into shared learned elementary structures (c). At evaluation\ntime, we transform each elementary structure Ei to target shape Z using learned shape-dependent\nadjustment networks pi that produce points on the surface of the output shape O.\n\nacross different shapes, and thus cannot be used to discover common shape structures or inter-shape\nrelationships.\nOn the other side of the spectrum, instead of simple primitives, some techniques \ufb01t deformable\nmesh models [1, 2, 19, 32]. While they can capture complex structures, these techniques are also\nprone to being stuck in local optima, due to large number of degrees of freedom (e.g., mesh vertex\ncoordinates).\nNeural network architectures have been used to facilitate the mesh \ufb01tting [10], learning to predict the\ndeformation of a template to reconstruct unstructured input point cloud. This approach is sensitive\nto the choice of the template. We demonstrate that our method improves the quality of the \ufb01tting\nby learning the structure of the reference shape. Neural mesh \ufb01tting has been also employed for\ngeometrically and topologically diverse datasets that do not have a natural template. In these cases,\nmeshed planes or spheres can be deformed into complex 3D structures [11, 30]. We extend this line\nof work by proposing a technique for learning the base shapes that are further used to approximate\nthe shapes in the collection. Learning these elementary structures enables us to more accurately and\nconsistently reconstruct the shapes in the collection.\n\n3 Approach\n\nWe aim to learn shared elementary structures to reconstruct a set of 3D shapes. We visualize an\noverview of our approach in Figure 2. We formulate two ways to learn elementary structures \u2013 via\npatch deformation learning and point translation learning modules. The elementary structures are\nlearned over the entire training set and do not depend on the input during testing. At test time, the\nelementary structures are deformed by adjustment modules to create the output 3D shape. These\nmodules take as inputs features computed from the input via an encoder network and the coordinates\nof the elementary structure points and output the 3D coordinates of the deformed primitives.\nFor the task of 3D shape reconstruction, we assume that we are given a training set Z of target\nshapes Z \u2208 Z. Our goal is to reconstruct the target shapes using a set of K learned elementary\nstructures E1, . . . , EK, which are deformed via shape-dependent adjustment modules p1, . . . , pK.\nWe represent each shape by a feature vector f (Z) computed by a point set encoder f (de\ufb01ned\nlater in this section). Each adjustment module pk takes as inputs the coordinates of a point in the\nassociated elementary structure e \u2208 Ek and the feature vector of the target shape f (Z) and outputs\n3D coordinates of the corresponding point. The output shape O = p(Z) can thus be written as the\nunion over learned and adjusted elementary structures,\n\nK(cid:91)\n\n(cid:91)\n\nk=1\n\ne\u2208Ek\n\nO = p(Z) =\n\npk(e, f (Z)).\n\n(1)\n\nIf the elementary structures were unit squares or a unit sphere, then this equation would describe\nexactly the AtlasNet [11] model. On the other hand, the 3D-CODED model [10] uses an instance of\nZ as a single elementary structure. Generalizing these approaches, our goal is to automatically learn\n\n3\n\n\fthe elementary structures Ek over a shape collection. The intuition behind our approach is that if the\nelementary structures Ek have useful shapes to reconstruct the target, the adjustment pk should be\neasier to learn and more interpretable.\n3.1 Learnable elementary structures\nFor each k \u2208 {1, . . . , K}, we start from an initial surface Sk on which we sample N points to obtain\nan initial point cloud Sk. We then pass each sampled point sk,i \u2208 Sk for i \u2208 {1, . . . , N} through\nelementary structure learning modules \u03c8k.\nWe consider two types of elementary structure learning module \u03c8k.\nThe \ufb01rst type, patch deformation learning module, learns a continuous mapping dk to obtain deformed\npoints ek,i = dk(sk,i) starting from sampled point sk,i. The intuition behind the deformation\nmodule is that elementary structures Ek should be surface elements, and can thus be deduced\nfrom the transformation of the original surfaces Sk. Alternatively, we consider a point translation\nlearning module which translates independently each of the points sk,i by a learned vector tk,i,\nek,i = tk,i + sk,i. This module thus allows the network to update independently the position\nof each point on the surface. The result of either module results in a set of elementary structure\npoints ek,i = \u03c8k(sk,i), and we write the elementary structure Ek as the union of the independently\ndeformed or translated points sk,i \u2208 Sk.\nIn Section 4 we will show that different choices here can be desirable depending on the application\ndomain.\nDimensionality of the elementary structures. While it is natural to consider elementary structures\nas sets of 3D points, we can extend the idea to other dimensions. We experimented with 2D, 3D,\nand 10D elementary structures and show that while they are less interpretable, higher-dimensional\nstructures lead to better shape reconstruction results.\n\n3.2 Architecture details\n\nThe following describes more details of our \ufb01nal network.\nShape encoder. We represent the input shape as a point cloud, and we use as shape encoder a\nsimpli\ufb01ed version of the PointNet network [21] used in [10, 11]. We represent each 3D point of the\ninput shape as a 1024 dimensional vector using a multi-layer perceptron with 3 hidden layers of 64,\n128 and 1024 neurons and ReLU activations. We then apply max-pooling over all point features\nfollowed by a linear layer, producing a global shape feature used as input to the adjustment modules.\nPatch deformation learning module. The patch deformation learning modules are continuous-\nspace deformations that we learn as multi-layer perceptrons with 3 hidden layers of 128, 128 and\n3 neurons and ReLU activations. This module takes as input coordinates of points in the initial\nstructures and can compute not only a set of points [11] but the full image of a surface. If this module\nis used, we can densely sample points on the generated surface.\nPoint translation learning module. The point translation learning modules learn a translation for\neach of the N points of the associated initial structure. While this step gives more \ufb02exibility than\ngenerating points through the patch deformation learning module, it can only be applied for a \ufb01xed\nnumber of points, similar to point-based shape generation [7].\nAdjustment module. The goal of the adjustment modules pk is to reconstruct the input shape by\npositioning each elementary structure. The intuition is that this adjustment should be relatively simple.\nHowever, we can expect the quality of the reconstruction to increase using more complex adjustment\nmodules. In this paper, we consider two cases:\n\n\u2022 Linear adjustment: each adjustment module applies an af\ufb01ne transformation to the corre-\nsponding elementary structure. The parameters of this transformation are predicted by a\nmulti-layer perceptron that takes as input the point cloud feature vector generated by the\nencoder. We use three hidden MLP layers (512, 512, 12), ReLU activation, BatchNorm\nlayers and a hyperbolic tangent at the last layer for this module.\n\n\u2022 MLP adjustment: each adjustment module uses a multi-layer perceptron (MLP) that takes\nas inputs the concatenation of the coordinates of a point from the associated elementary\n\n4\n\n\fLinear adjustment\nAtlasNet [11]\nDeformation\nPoints\nMLP adjustment\nAtlasNet [11]\nDeformation\nPoints\n\nSingle-category training\nAirplanes\n\nChairs\n\nMulti-category training\nAll\n\nAirplanes Chairs\n\n1.57\n1.16\n1.04\n\n0.91\n0.87\n0.79\n\n4.14\n2.76\n2.00\n\n1.64\n1.56\n1.43\n\n2.22\n1.49\n1.35\n\n0.81\n0.81\n0.71\n\n3.72\n2.52\n2.47\n\n1.50\n1.25\n1.25\n\n3.07\n2.26\n2.11\n\n1.45\n1.43\n1.22\n\nMulti-category training\n\nMLP adjustment\n2D\n3D\n10D\nLinear adjustment\n2D\n3D\n10D\n\nPoints Def.\n\n1.28\n1.22\n1.21\n\n2.45\n2.11\n1.66\n\n1.42\n1.43\n1.39\n\n2.75\n2.26\n1.90\n\nTable 1: ShapeNet reconstruction. We evaluate variants of our method for single- and multi-category\nreconstruction tasks. Left: Linear vs MLP adjustment, Patch Deformation vs Points Translation\nwith 3D elementary structures. Right: different template dimensionality and deformation vs points\nlearning modules in the multi-category setup with MLP-adjustement. We report Chamfer distance\n(multiplied by 10\u22123). AtlasNet uses 10 patch primitives, which is the same as our approach, without\nthe learned elementary structures.\n\nstructure and the shape feature predicted by the shape encoder and outputs 3D coordinates.\nWe use the same architecture as [11] for this network to obtain comparable results.\n\n3.3 Losses and training\n\nWe now discuss two scenarios in which we tested our approach.\nTraining with correspondences. In this scenario, we assume point correspondences across all\ntraining examples and a common template that we can use as an initial structure for all shapes. More\nprecisely, we assume that each training shape Z is represented as an ordered set of N 3D points\nz1, . . . , zN in consistent locations on all shapes. Since all shapes are in correspondence, we consider\na single elementary structure S1 (K = 1) and N sampled points on the shape s1,1, . . . , s1,N . We\nthen train our network to minimize the following squared loss between sampled points zi on each\ntraining shape to reconstructed points starting from sampled template points s1,i :\n\n(cid:88)\n\nN(cid:88)\n\nZ\u2208Z\n\ni=1\n\nLsup(\u03b8) =\n\n(cid:107)zi \u2212 p1(\u03c81(s1,i), f (Z))(cid:107)2\n\n(2)\n\nwhere \u03b8 are the parameters of the networks. Note that at inference, we do not need to know the\ncorrespondences of the points in the test shape, since they are processed by the point set encoder\nwhich is invariant to the order of the points. Instead, the points in the reconstruction shapes will\nbe in correspondence with the elementary structure and by extension with each other. We use this\nproperty to predict correspondences between test shapes, following the pipeline of [10]. Learning the\nelementary structures is the difference between our approach and 3D-CODED [10] in this scenario,\nwhich leads to improved reconstruction and correspondence accuracy.\nTraining without correspondences. We are also able to train our system when no correspondence\nsupervision is available during training. In this case, there are many options for our choice of\nelementary structures. To be comparable with AtlasNet [11], we will assume we have K elementary\nstructures and that each initial structure Sk is a unit 2D square patch. For a given training shape Z,\nwe compute the output shape O = p(Z) according to Equation 1, and train our network\u2019s parameters\nto minimize the symmetric Chamfer distance [7] between the point clouds p(Z) and Z.\n\nLunsup(\u03b8) =\n\nk\u2208{1,...,K}, i\u2208{1,...,N}\n\nmin\n\n(cid:107)z \u2212 pk(\u03c8k(sk,i), f (Z))(cid:107)2\n\n(cid:88)\n\n(cid:88)\n\nZ\u2208Z\n\nz\u2208Z\n\n(cid:88)\n\nK(cid:88)\n\nN(cid:88)\n\nZ\u2208Z\n\nk=1\n\ni=1\n\n+\n\n(cid:107)z \u2212 pk(\u03c8k(sk,i), f (Z))(cid:107)2\n\n(3)\n\nmin\nz\u2208Z\n\nwhere \u03b8 are the parameters of the networks. In all of our experiments, we used K = 10.\nTraining details. We use the Adam optimizer with a learning rate of 0.001, a batch size of 16,\nand batch normalization layers. We train our method using input point clouds of 2500 points when\ncorrespondences are not available and 6800 points when correspondences are available. When training\n\n5\n\n\f(a) Category-speci\ufb01c 2D elementary structures (3 out of 10 structures) learned for chairs (left) and plane (righ).\n\n(b) Reconstructions using elementary structures with category-speci\ufb01c training.\n\n(c) 2D elementary structure learned from all categories (7 out of 10 structures are shown).\n\n(d) Our reconstructions using 2D elementary structures trained on all categories.\n\n(e) AtlasNet reconstruction using square patch primitives trained on all categories\n\nFigure 3: We visualize elementary structures using point learning and MLP adjustment modules. For\nall reconstruction results, we show in color the points corresponding to the visualized 2D primitives.\nFor AtlasNet, the primitives are unit squares (so we do not show the elementary structures), and we\nvisualize seven of them for the reconstruction (similarly to our method). Contrary to AtlasNet, our\nlearned elementary structures have limited overlap in the reconstructions and better reconstructs the\nshapes.\n\nusing only the deformation modules dk, we resample the initial surfaces Sk at each training step to\nminimize over-\ufb01tting. At inference time, we sample a regular grid to allow easy mesh generation. We\ntrain our model on an NVIDIA 1080Ti GPU, with a 16 core Intel I7-7820X CPU (3.6GHz), 126GB\nRAM and SSD storage. Training takes about 48h for most experiments. Using the trained models\nfrom the of\ufb01cial implementation on all categories, AtlasNet-25 performance is 1.56 (see also Table\n1 in the Atlasnet paper). Using the released code to train AtlasNet-10 yields an error of 1.55. By\nadding a learning rate schedule to the original implementation we decreased this error to 1.45 and\nreport this improved baseline (see Table 1).\n\n4 Experiments\n\nIn this section, we show qualitative and quantitative results of our approach on the tasks of shape\nreconstruction and shape matching.\n4.1 Generic object shape reconstruction\nWe evaluate our approach on non-articulated generic 3D object shapes for the task of shape recon-\nstruction. We use the training setting without correspondences described in Section 3.3.\nDataset, evaluation criteria, baseline. We evaluate on the ShapeNet Core dataset [6]. For\nsingle-category reconstruction, we evaluated over airplane (5424/1360 train/test shapes) and chair\n\n6\n\n\f(a) MLP adjustment\n\n(b) Linear adjustment\n\nFigure 4: Three (out of ten) learned 3D elementary structures learned by the point translation learning\napproach when training on all ShapeNet categories.\n\n(3248/816) categories. For multi-category reconstruction, we used 13 categories \u2013 airplane, bench,\ncabinet, car, chair, monitor, lamp, speaker, \ufb01rearm, couch, table, cellphone, watercraft (31760/7952).\nWe report the symmetric Chamfer distance between the reconstructed and target shapes. All reported\nChamfer results are multiplied by 10\u22123. As a baseline, we compare against AtlasNet [11] with ten\nunit-square primitives.\nSingle-category shape reconstruction. For our \ufb01rst experiment, we trained separate networks for\nthe different ShapeNet Core categories. Figure 3a demonstrates learned 2D elementary structures\nusing ten 2D unit squares as initial structures Sk. In Figure 3b, we show shape reconstructions using\nour points translation learning module with MLP adjustments. Note the emergence of symmetric and\ntopologically complex elementary structures.\nMulti-class shape reconstruction. We now evaluate how well our method generalizes when trained\non multiple categories, again using 2D elementary structures with point translation learning module\nand MLP-adjustements. As in single-category case, we observe discovery of non-trivial 2D elementary\nstructures (Figure 3c) that are used to accurately reconstruct the shapes (Figure 3d), with higher\n\ufb01delity than the baseline performance of AtlasNet with ten 2D square patches (Figure 3e). Note how\nAtlasNet is less faithful to the topology of reconstructed shapes, incorrectly synthesizing geometry\nin hollow areas between the back and the seat. Our quantitative evaluation in Table 1 con\ufb01rms that\nAtlasNet provides less accurate reconstructions than our method.\nLinear vs MLP adjustment. We evaluated networks trained in both the single- and multi-category\nsettings with linear and MLP adjustment modules using 3D learned elementary structures (Table 1\nleft, Figure 4). In all experimental setups, we observe that the MLP adjustment offers signi\ufb01cant\nquantitative improvements over restricting the network to use linear transformations of the elementary\nstructures. This result is expected as linear adjustment allows only limited adaptation of the elementary\nstructures for each shape. Similar to shape abstraction methods [28], linear adjustment allows a\nbetter intuition of the shape generation process but limits the reconstruction accuracy. Using MLP\nadjustments, however, offers the network more \ufb02exibility to faithfully reconstruct the shapes.\nPatch deformation vs points translation modules. We compare using patch deformation vs\npoints translation modules in Table 1. The patch deformation learning module does not allow\ntopological changes and discontinuities in mapping, and produces inferior results in comparison to\npoints translation learning. On the other hand, learning patch deformations enables the estimation of\nthe entire deformation \ufb01eld. Thus one can warp an arbitrary number of points or even tessellate the\ndomain and warp the entire mesh to generate the polygonal surface, which is more amenable to tasks\nsuch as rendering.\n\nFigure 5: 3D elementary structure obtained with point learning when initializing the training from a\ntemplate shape (left) or a random set of points (right). See text for details.\n\n7\n\n\fHigher-dimensional structures. We experimented with the dimensionality of the learned elementary\nstructures. Figures 3a and 3c suggest that learned 2D elementary structures can capture interesting\ntopological and symmetric aspects of the data \u2013 splitting, for instance, the patch into two identical\nparts for the legs of the chairs. note also the variable point density. Similarly, learned 3D elementary\nstructures with linear adjustment and patch deformation learning modules are shown in Figure 1\nfor the airplane category. Note that they roughly correspond to meaningful parts, such as wings,\ntail and reactor. Figure 4 shows 3D elementary structures inferred from all ShapeNet categories,\nwhere the learned structures include non-trivial elements such as symmetric planes, sharp angles,\nand smooth parabolic surfaces. The learned structures are often correspond to consistent parts in the\nreconstructions. In our quantitative evaluations (Table 1, right) we found that the results improve\nwith the dimensionality. The improvement diminishes for higher-dimensional spaces and are more\ndif\ufb01cult to visualize and interpret.\nConsistency in template elementary structures. We experimented with several initializations of our\nelementary structures on the ShapeNet plane category. We used the point translation learning method\nand a single 3D elementary structure. In Figure 5, we show our results when initializing the elementary\nstructure with either a plane 3D model (left) or a set of random 3D points sampled uniformly (right).\nNotice that the learned 3D elementary structure is similar re-\ngardless of the initial template shape.\nGeneralization to new categories. To test the generality of\nour approach, we trained on the chair category using ten 2D el-\nementary structures and tested on the table category. As shown\nin Figure 6, point translation learning outperforms both patch\ndeformation learning and AtlasNet. Figure 7 shows qualita-\ntively how the elementary structures are positioned on chairs\nand tables. Notice how the chair and table legs are reconstructed\nby the same elementary structures.\n\nFigure 6: Category generaliza-\ntion. Chamfer distance for net-\nworks trained on chairs and tested\non either the chairs or tables test\nsets.\n\nChairs Table\n1.64\n4.70\n4.82\n1.56\n1.34\n4.45\n\nAtlasNet\nPatch.\nPoint.\n\nFigure 7: Elementary structures learned on chairs (left) used to reconstruct chairs and tables (right).\n\nChamfer\n\n1.45\n1.35\n1.43\n1.22\n\nFigure 8: Impact of number of parame-\nters on reconstruction error.\n\nParam.\n1.8 \u00d7 108\nAtlasNet\n6-layer AN 3.9 \u00d7 108\n1.8 \u00d7 108\nPatch.\n1.8 \u00d7 108\nPoint.\n\nNumber of parameters. In Figure 8, we show the number\nof parameters for AtlasNet and our method. Our method\nhas less than 1% additional parameters to learn the ele-\nmentary structures \u2013 2.0 \u00d7 106 and 2.5 \u00d7 103 for patch\ndeformation and point translation, respectively (orders of\nmagnitude smaller than 1.8 \u00d7 108 for the full network).\nDuring inference, our approach has the same complexity\nas AtlasNet as the elementary structures are precomputed\nand remain \ufb01xed for all shapes. We also tried training AtlasNet with six layers (6-layer AN), which\nsigni\ufb01cantly increases the number of parameters. Our approach with points translation learning\noutperforms all methods.\n4.2 Human shape reconstruction and matching\nWe now evaluate our approach on 3D human shapes for the tasks of shape reconstruction and matching\nusing the training setup with correspondences described in Section 3.3. For this task, we use a single\nelementary structure for the human body using one of the meshes as the initial structure S1. Since\nwe use a single elementary structure and the shapes are deformable, we only report results using the\nMLP-adjustment.\nDatasets, evaluation criteria, baselines. We train our method using the SURREAL dataset [29],\nextended to include some additional bend-over poses as in 3D-CODED [10]. We use 229,984\nSURREAL meshes of humans in various poses for training and 224 SURREAL meshes to test\nreconstruction quality. To evaluate correspondences on real data, we use the FAUST benchmark [5]\nconsisting of 200 testing scans with \u223c 170k vertices from the \u201cinter\u201d challenge, including noise and\nholes which are not present in our training data. As a baseline, we compared against 3D-CODED [10].\n\n8\n\n\f(a) Learned points\n\n(b) Learned deformation\n\n(c) Learned deformation\n\nFigure 9: Initial shape (left) and learned elementary structure (right) using the deformation or points\nlearning modules. Notice the similarity between the elementary structure learned with the different\napproaches.\n\nSURREAL [29]\n\nFAUST [5] Inter | Intra\n\n3D-CODED\nDeformation\nPoints\n\n1.32\n1.44\n1.00\n\n2.64 | 1.747\n2.58 | 1.742\n2.71 | 1.626\n\nSURREAL [29]\nPoints Deform.\n1.54\n1.00\n1.06\n\n6.76\n1.44\n1.18\n\n2D\n3D\n10D\n\nTable 2: Human correspondences and reconstruction. We evaluate different variants of our method\n(with deformation vs points translation learning and different template dimensionality) for surface\nreconstruction (SURREAL column) and matching (FAUST column). We report Chamfer loss for\nthe former and correspondence error for the latter (measured by the distance between corresponding\npoints). Results in the left table are with 3D elementary structures, and the only difference with the\n3D-CODED baseline is thus the template/elementary structure learning. The table on the right shows\nresults with elementary structures of different dimensions.\n\nResults. Figure 9 shows learned elementary structures using deformation or points translation\nlearning and different initial surfaces. We observe that the learned templates are in\ufb02ated, bent, and\nwith their arm and legs in a similar pose, suggesting a reasonable amount of consistency in the\nproperties of a desirable primitive shape for this task.\nAs before, we found that points translation learning provides the best reconstruction (see SURREAL\ncolumn in Table 2). Both of our approaches also provide lower reconstruction loss than 3D-CODED.\nWe used reconstruction to estimate correspondences by \ufb01nding closest points on the deformed\nelementary structure as in 3D-CODED [10]. We report correspondence error in the \u201cFAUST\u201d column\nin Table 2. We observe that deformation learning provides better correspondences than points\nlearning, also yielding state-of-the-art results and clear improvement over 3D-CODED. This result is\nnot surprising because understanding the deformation \ufb01eld for the entire surface is more relevant for\nmatching and correspondence problems.\nElementary structure dimension. Similar to generic object reconstruction, we evaluate with 2D,\n3D and 10D elementary structures (Table 2, right). Note that when using the patch deformation\nlearning module we control the output size and therefore it is easy to map the input 3D template to\nhigher- or lower-dimensional elementary structure. On the other hand the points translation learning\nmodule does not allow to change dimensionality of the input template. Hence, for 2D elementary\nstructures we project the 3D template (front-facing human in a T-pose) to a front plane, and for\n10D elementary structures we embed the 3D human into a hyper-cube, keeping higher dimensions\nas zero. The difference in performance is clearer for human reconstruction than for generic object\nreconstruction, which can be related both to the fact that humans are complex with articulations and\nthat we use a single elementary structure for all human reconstructions.\n5 Conclusion\nWe have presented a method to take a collection of training shapes and learned common elementary\nstructures that can be deformed and composed to consistently reconstruct arbitrary shapes. We learn\nconsistent structures without explicit point supervision between shapes and we demonstrate that\nusing our structures for reconstruction and correspondence tasks results in signi\ufb01cant quantitative\nimprovements. When trained on shape categories, these structures are often interpretable. Moreover,\nour deformation learning approach learns elementary structures as the deformation of continuous\nsurfaces, resulting in output surfaces that can densely sampled and meshed at test time. Our approach\nopens up possibilities for other applications, such as shape morphing and scan completion.\n\n9\n\n\fAcknowledgments. This work was partly supported by ANR project EnHerit ANR-17-CE23-0008,\nLabex B\u00e9zout, and gifts from Adobe to \u00c9cole des Ponts.\n\nReferences\n\n[1] B. Allen, B. Curless, and Z. Popovic. Articulated body deformation from range scan data.\n\nSIGGRAPH, 2002.\n\n[2] B. Allen, B. Curless, and Z. Popovic. The space of human body shapes: reconstruction and\n\nparameterization from range scans. SIGGRAPH, 2003.\n\n[3] I. Biederman. Recognition-by-components: a theory of human image understanding.\n\nPsychological review, 94(2):115, 1987.\n\n[4] I. Binford. Visual perception by computer. In IEEE Conference of Systems and Control, 1971.\n[5] F. Bogo, J. Romero, M. Loper, and M. J. Black. Faust: Dataset and evaluation for 3d mesh\n\nregistration. In CVPR, 2014.\n\n[6] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese,\nM. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model\nrepository. CoRR, abs/1512.03012, 2015.\n\n[7] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction\nfrom a single image. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 605\u2013613, 2017.\n\n[8] K. Genova, F. Cole, D. Vlasic, A. Sarna, W. T. Freeman, and T. A. Funkhouser. Learning shape\n\ntemplates with structured implicit functions. CoRR, abs/1904.06447, 2019.\n\n[9] A. Golovinskiy and T. Funkhouser. Learning Consistent Segmentation of 3D Models.\n\nComputers and Graphics (Shape Modeling International), 2009.\n\n[10] T. Groueix, M. Fisher, V. G. Kim, B. Russell, and M. Aubry. 3d-coded : 3d correspondences by\n\ndeep deformation. In ECCV, 2018.\n\n[11] T. Groueix, M. Fisher, V. G. Kim, B. Russell, and M. Aubry. AtlasNet: A Papier-M\u00e2ch\u00e9\nApproach to Learning 3D Surface Generation. In Proceedings IEEE Conf. on Computer Vision\nand Pattern Recognition (CVPR), 2018.\n\n[12] Q. Huang, V. Koltun, and L. Guibas. Joint-Shape Segmentation with Linear Programming.\n\nACM Transactions on Graphics (Proc. SIGGRAPH Asia), 2011.\n\n[13] A. Kaiser, J. A. Ybanez Zepeda, and T. Boubekeur. A survey of simple geometric primitives\ndetection methods for captured 3d data. In Computer Graphics Forum. Wiley Online Library,\n2018.\n\n[14] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning category-speci\ufb01c mesh recon-\n\nstruction from image collections. In ECCV, 2018.\n\n[15] V. G. Kim, W. Li, N. J. Mitra, S. Chaudhuri, S. DiVerdi, and T. Funkhouser. Learning Part-\nbased Templates from Large Collections of 3D Shapes. Transactions on Graphics (Proc. of\nSIGGRAPH), 32(4), 2013.\n\n[16] J. Li, K. Xu, S. Chaudhuri, E. Yumer, H. Zhang, and L. Guibas. Grass: Generative recursive\n\nautoencoders for shape structures. ACM Transactions on Graphics (TOG), 36(4):52, 2017.\n\n[17] L. Li, M. Sung, A. Dubrovina, L. Yi, and L. Guibas. Supervised \ufb01tting of geometric primitives\n\nto 3d point clouds. arXiv preprint arXiv:1811.08988, 2018.\n\n[18] Y. Li, X. Wu, Y. Chrysanthou, A. Sharf, D. Cohen-Or, and N. J. Mitra. Glob\ufb01t: Consistently\n\ufb01tting primitives by discovering global relations. ACM Transactions on Graphics, 30(4):52:1\u2013\n52:12, 2011.\n\n[19] M. Loper, N. Mahmood, J. Romero, Pons-Moll, and M. J. G., Black. Smpl: A skinned\n\nmulti-person linear model. SIGGRAPH Asia, 2015.\n\n[20] D. Paschalidou, A. O. Ulusoy, and A. Geiger. Superquadrics revisited: Learning 3d shape parsing\nIn Proceedings IEEE Conf. on Computer Vision and Pattern Recognition\n\nbeyond cuboids.\n(CVPR), June 2019.\n\n[21] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d\nclassi\ufb01cation and segmentation. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 652\u2013660, 2017.\n\n[22] L. G. Roberts. Machine perception of three-dimensional solids. PhD thesis, Massachusetts\n\nInstitute of Technology, 1963.\n\n10\n\n\f[23] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization\n\nin the brain. Psychological review, 65(6):386, 1958.\n\n[24] R. Schnabel, P. Degener, and R. Klein. Completion and reconstruction with primitive shapes.\n\nComputer Graphics Forum (Proc. of Eurographics), 28(2):503\u2013512, Mar. 2009.\n\n[25] R. Schnabel, R. Wahl, and R. Klein. Ef\ufb01cient ransac for point-cloud shape detection.\n\nIn\n\nComputer graphics forum, volume 26, pages 214\u2013226. Wiley Online Library, 2007.\n\n[26] G. Sharma, R. Goyal, D. Liu, E. Kalogerakis, and S. Maji. Csgnet: Neural shape parser for\nconstructive solid geometry. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 5515\u20135523, 2018.\n\n[27] O. Sidi, O. van Kaick, Y. Kleiman, H. Zhang, and D. Cohen-Or. Unsupervised co-segmentation\nof a set of shapes via descriptor-space spectral clustering. In Transactions on Graphics (Proc.\nof SIGGRAPH Asia), pages 126:1\u2013126:10, 2011.\n\n[28] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik. Learning shape abstractions by\nassembling volumetric primitives. In Computer Vision and Pattern Recognition (CVPR), 2017.\n[29] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid. Learning\n\nfrom synthetic humans. In CVPR, 2017.\n\n[30] Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Point cloud auto-encoder via deep grid\ndeformation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun\n2018.\n\n[31] C. Zou, E. Yumer, J. Yang, D. Ceylan, and D. Hoiem. 3d-prnn: Generating shape primitives with\nrecurrent neural networks. In Proceedings of the IEEE International Conference on Computer\nVision, pages 900\u2013909, 2017.\n\n[32] S. Zuf\ufb01 and M. J. Black. The stitched puppet: A graphical model of 3d human shape and pose.\n\nProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n11\n\n\f", "award": [], "sourceid": 4037, "authors": [{"given_name": "Theo", "family_name": "Deprelle", "institution": "\u00c9cole des ponts ParisTech"}, {"given_name": "Thibault", "family_name": "Groueix", "institution": "\u00c9cole des ponts ParisTech"}, {"given_name": "Matthew", "family_name": "Fisher", "institution": "Adobe Research"}, {"given_name": "Vladimir", "family_name": "Kim", "institution": "Adobe"}, {"given_name": "Bryan", "family_name": "Russell", "institution": "Adobe"}, {"given_name": "Mathieu", "family_name": "Aubry", "institution": "\u00c9cole des ponts ParisTech"}]}