{"title": "PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space", "book": "Advances in Neural Information Processing Systems", "page_first": 5099, "page_last": 5108, "abstract": "Few prior works study deep learning on point sets. PointNet is a pioneer in this direction. However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. In this work, we introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales. With further observation that point sets are usually sampled with varying densities, which results in greatly decreased performance for networks trained on uniform densities, we propose novel set learning layers to adaptively combine features from multiple scales. Experiments show that our network called PointNet++ is able to learn deep point set features efficiently and robustly. In particular, results significantly better than state-of-the-art have been obtained on challenging benchmarks of 3D point clouds.", "full_text": "PointNet++: Deep Hierarchical Feature Learning on\n\nPoint Sets in a Metric Space\n\nCharles R. Qi Li Yi Hao Su Leonidas J. Guibas\n\nStanford University\n\nAbstract\n\nFew prior works study deep learning on point sets. PointNet [20] is a pioneer in this\ndirection. However, by design PointNet does not capture local structures induced by\nthe metric space points live in, limiting its ability to recognize \ufb01ne-grained patterns\nand generalizability to complex scenes. In this work, we introduce a hierarchical\nneural network that applies PointNet recursively on a nested partitioning of the\ninput point set. By exploiting metric space distances, our network is able to learn\nlocal features with increasing contextual scales. With further observation that point\nsets are usually sampled with varying densities, which results in greatly decreased\nperformance for networks trained on uniform densities, we propose novel set\nlearning layers to adaptively combine features from multiple scales. Experiments\nshow that our network called PointNet++ is able to learn deep point set features\nef\ufb01ciently and robustly. In particular, results signi\ufb01cantly better than state-of-the-art\nhave been obtained on challenging benchmarks of 3D point clouds.\n\n1\n\nIntroduction\n\nWe are interested in analyzing geometric point sets which are collections of points in a Euclidean\nspace. A particularly important type of geometric point set is point cloud captured by 3D scanners,\ne.g., from appropriately equipped autonomous vehicles. As a set, such data has to be invariant to\npermutations of its members. In addition, the distance metric de\ufb01nes local neighborhoods that may\nexhibit different properties. For example, the density and other attributes of points may not be uniform\nacross different locations \u2014 in 3D scanning the density variability can come from perspective effects,\nradial density variations, motion, etc.\nFew prior works study deep learning on point sets. PointNet [20] is a pioneering effort that directly\nprocesses point sets. The basic idea of PointNet is to learn a spatial encoding of each point and then\naggregate all individual point features to a global point cloud signature. By its design, PointNet does\nnot capture local structure induced by the metric. However, exploiting local structure has proven to\nbe important for the success of convolutional architectures. A CNN takes data de\ufb01ned on regular\ngrids as the input and is able to progressively capture features at increasingly larger scales along a\nmulti-resolution hierarchy. At lower levels neurons have smaller receptive \ufb01elds whereas at higher\nlevels they have larger receptive \ufb01elds. The ability to abstract local patterns along the hierarchy\nallows better generalizability to unseen cases.\nWe introduce a hierarchical neural network, named as PointNet++, to process a set of points sampled\nin a metric space in a hierarchical fashion. The general idea of PointNet++ is simple. We \ufb01rst\npartition the set of points into overlapping local regions by the distance metric of the underlying\nspace. Similar to CNNs, we extract local features capturing \ufb01ne geometric structures from small\nneighborhoods; such local features are further grouped into larger units and processed to produce\nhigher level features. This process is repeated until we obtain the features of the whole point set.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe design of PointNet++ has to address two issues: how to generate the partitioning of the point set,\nand how to abstract sets of points or local features through a local feature learner. The two issues\nare correlated because the partitioning of the point set has to produce common structures across\npartitions, so that weights of local feature learners can be shared, as in the convolutional setting. We\nchoose our local feature learner to be PointNet. As demonstrated in that work, PointNet is an effective\narchitecture to process an unordered set of points for semantic feature extraction. In addition, this\narchitecture is robust to input data corruption. As a basic building block, PointNet abstracts sets of\nlocal points or features into higher level representations. In this view, PointNet++ applies PointNet\nrecursively on a nested partitioning of the input set.\nOne issue that still remains is how to generate\noverlapping partitioning of a point set. Each\npartition is de\ufb01ned as a neighborhood ball in\nthe underlying Euclidean space, whose param-\neters include centroid location and scale. To\nevenly cover the whole set, the centroids are se-\nlected among input point set by a farthest point\nsampling (FPS) algorithm. Compared with vol-\numetric CNNs that scan the space with \ufb01xed\nstrides, our local receptive \ufb01elds are dependent\non both the input data and the metric, and thus more ef\ufb01cient and effective.\nDeciding the appropriate scale of local neighborhood balls, however, is a more challenging yet\nintriguing problem, due to the entanglement of feature scale and non-uniformity of input point\nset. We assume that the input point set may have variable density at different areas, which is quite\ncommon in real data such as Structure Sensor scanning [18] (see Fig. 1). Our input point set is thus\nvery different from CNN inputs which can be viewed as data de\ufb01ned on regular grids with uniform\nconstant density. In CNNs, the counterpart to local partition scale is the size of kernels. [25] shows\nthat using smaller kernels helps to improve the ability of CNNs. Our experiments on point set data,\nhowever, give counter evidence to this rule. Small neighborhood may consist of too few points due to\nsampling de\ufb01ciency, which might be insuf\ufb01cient to allow PointNets to capture patterns robustly.\nA signi\ufb01cant contribution of our paper is that PointNet++ leverages neighborhoods at multiple scales\nto achieve both robustness and detail capture. Assisted with random input dropout during training,\nthe network learns to adaptively weight patterns detected at different scales and combine multi-scale\nfeatures according to the input data. Experiments show that our PointNet++ is able to process point\nsets ef\ufb01ciently and robustly. In particular, results that are signi\ufb01cantly better than state-of-the-art have\nbeen obtained on challenging benchmarks of 3D point clouds.\n\nFigure 1: Visualization of a scan captured from a\nStructure Sensor (left: RGB; right: point cloud).\n\n2 Problem Statement\n\nSuppose that X = (M, d) is a discrete metric space whose metric is inherited from a Euclidean space\nRn, where M \u2713 Rn is the set of points and d is the distance metric. In addition, the density of M\nin the ambient Euclidean space may not be uniform everywhere. We are interested in learning set\nfunctions f that take such X as the input (along with additional features for each point) and produce\ninformation of semantic interest regrading X . In practice, such f can be classi\ufb01cation function that\nassigns a label to X or a segmentation function that assigns a per point label to each member of M.\n\n3 Method\n\nOur work can be viewed as an extension of PointNet [20] with added hierarchical structure. We\n\ufb01rst review PointNet (Sec. 3.1) and then introduce a basic extension of PointNet with hierarchical\nstructure (Sec. 3.2). Finally, we propose our PointNet++ that is able to robustly learn features even in\nnon-uniformly sampled point sets (Sec. 3.3).\n\n2\n\n\fHierarchical point set feature learning\n\nSegmentation\n\n(N,d+C)\n\n(N 1,K,d+C)\n\n(N 1,d+C 1)\n\n(N 2,K,d+C 1)\n\n(N 2,d+C 2)\n\nskip link concatenation\n\n(N 1,d+C 2+C 1)\n\n(N 1,d+C 3)\n\n(N,d+C 3+C)\n\n(N,k)\n\nsampling & \ngrouping\n\npointnet\n\nsampling & \ngrouping\n\npointnet\n\nset abstraction\n\nset abstraction\n\ninterpolate\n\nClassification\n\nunit \npointnet\n\n(1,C4)\n\ninterpolate\n\nunit \npointnet\n\npointnet\n\nfully connected layers\n\nper-point \nscores\n\n(k)\n\ns\ne\nr\no\nc\ns\n \ns\ns\na\nl\nc\n\nFigure 2: Illustration of our hierarchical feature learning architecture and its application for set\nsegmentation and classi\ufb01cation using points in 2D Euclidean space as an example. Single scale point\ngrouping is visualized here. For details on density adaptive grouping, see Fig. 3\n\n3.1 Review of PointNet [20]: A Universal Continuous Set Function Approximator\nGiven an unordered point set {x1, x2, ..., xn} with xi 2 Rd, one can de\ufb01ne a set function f : X! R\nthat maps a set of points to a vector:\n\n38\n\nf (x1, x2, ..., xn) = \u2713 MAX\n\ni=1,...,n{h(xi)}\u25c6\n\n(1)\n\nwhere and h are usually multi-layer perceptron (MLP) networks.\nThe set function f in Eq. 1 is invariant to input point permutations and can arbitrarily approximate any\ncontinuous set function [20]. Note that the response of h can be interpreted as the spatial encoding of\na point (see [20] for details).\nPointNet achieved impressive performance on a few benchmarks. However, it lacks the ability to\ncapture local context at different scales. We will introduce a hierarchical feature learning framework\nin the next section to resolve the limitation.\n\n3.2 Hierarchical Point Set Feature Learning\n\nWhile PointNet uses a single max pooling operation to aggregate the whole point set, our new\narchitecture builds a hierarchical grouping of points and progressively abstract larger and larger local\nregions along the hierarchy.\nOur hierarchical structure is composed by a number of set abstraction levels (Fig. 2). At each level, a\nset of points is processed and abstracted to produce a new set with fewer elements. The set abstraction\nlevel is made of three key layers: Sampling layer, Grouping layer and PointNet layer. The Sampling\nlayer selects a set of points from input points, which de\ufb01nes the centroids of local regions. Grouping\nlayer then constructs local region sets by \ufb01nding \u201cneighboring\u201d points around the centroids. PointNet\nlayer uses a mini-PointNet to encode local region patterns into feature vectors.\nA set abstraction level takes an N \u21e5 (d + C) matrix as input that is from N points with d-dim\ncoordinates and C-dim point feature. It outputs an N0 \u21e5 (d + C0) matrix of N0 subsampled points\nwith d-dim coordinates and new C0-dim feature vectors summarizing local context. We introduce the\nlayers of a set abstraction level in the following paragraphs.\n\nSampling layer. Given input points {x1, x2, ..., xn}, we use iterative farthest point sampling (FPS)\nto choose a subset of points {xi1, xi2, ..., xim}, such that xij is the most distant point (in metric\ndistance) from the set {xi1, xi2, ..., xij1} with regard to the rest points. Compared with random\nsampling, it has better coverage of the entire point set given the same number of centroids. In contrast\nto CNNs that scan the vector space agnostic of data distribution, our sampling strategy generates\nreceptive \ufb01elds in a data dependent manner.\n\n3\n\n\fGrouping layer. The input to this layer is a point set of size N \u21e5 (d + C) and the coordinates of\na set of centroids of size N0 \u21e5 d. The output are groups of point sets of size N0 \u21e5 K \u21e5 (d + C),\nwhere each group corresponds to a local region and K is the number of points in the neighborhood of\ncentroid points. Note that K varies across groups but the succeeding PointNet layer is able to convert\n\ufb02exible number of points into a \ufb01xed length local region feature vector.\nIn convolutional neural networks, a local region of a pixel consists of pixels with array indices within\ncertain Manhattan distance (kernel size) of the pixel. In a point set sampled from a metric space, the\nneighborhood of a point is de\ufb01ned by metric distance.\nBall query \ufb01nds all points that are within a radius to the query point (an upper limit of K is set in\nimplementation). An alternative range query is K nearest neighbor (kNN) search which \ufb01nds a \ufb01xed\nnumber of neighboring points. Compared with kNN, ball query\u2019s local neighborhood guarantees\na \ufb01xed region scale thus making local region feature more generalizable across space, which is\npreferred for tasks requiring local pattern recognition (e.g. semantic point labeling).\n\nPointNet layer.\nIn this layer, the input are N0 local regions of points with data size N0\u21e5K\u21e5(d+C).\nEach local region in the output is abstracted by its centroid and local feature that encodes the centroid\u2019s\nneighborhood. Output data size is N0 \u21e5 (d + C0).\nThe coordinates of points in a local region are \ufb01rstly translated into a local frame relative to the\ncentroid point: x(j)\ni \u02c6x(j) for i = 1, 2, ..., K and j = 1, 2, ..., d where \u02c6x is the coordinate of\nthe centroid. We use PointNet [20] as described in Sec. 3.1 as the basic building block for local pattern\nlearning. By using relative coordinates together with point features we can capture point-to-point\nrelations in the local region.\n\ni = x(j)\n\nconcat\n\nconcat\n\n3.3 Robust Feature Learning under Non-Uniform Sampling Density\nAs discussed earlier, it is common that a point set comes with non-\nuniform density in different areas. Such non-uniformity introduces\na signi\ufb01cant challenge for point set feature learning. Features learned\nin dense data may not generalize to sparsely sampled regions. Con-\nsequently, models trained for sparse point cloud may not recognize\n\ufb01ne-grained local structures.\nIdeally, we want to inspect as closely as possible into a point set\nto capture \ufb01nest details in densely sampled regions. However, such\nclose inspect is prohibited at low density areas because local patterns\nmay be corrupted by the sampling de\ufb01ciency. In this case, we should\nlook for larger scale patterns in greater vicinity. To achieve this goal\nwe propose density adaptive PointNet layers (Fig. 3) that learn to\ncombine features from regions of different scales when the input\nsampling density changes. We call our hierarchical network with density adaptive PointNet layers as\nPointNet++.\nPreviously in Sec. 3.2, each abstraction level contains grouping and feature extraction of a single scale.\nIn PointNet++, each abstraction level extracts multiple scales of local patterns and combine them\nintelligently according to local point densities. In terms of grouping local regions and combining\nfeatures from different scales, we propose two types of density adaptive layers as listed below.\n\nFigure 3:\n(a) Multi-scale\ngrouping (MSG); (b) Multi-\nresolution grouping (MRG).\n\nmulti-scale aggregation\n\n(a)\n\n(b)\n\nA or B\n\nA\n\nB\n\n(c)\n\ncross-level adaptive scale selection\n\ncross-level multi-scale aggregation\n\nMulti-scale grouping (MSG). As shown in Fig. 3 (a), a simple but effective way to capture multi-\nscale patterns is to apply grouping layers with different scales followed by according PointNets to\nextract features of each scale. Features at different scales are concatenated to form a multi-scale\nfeature.\nWe train the network to learn an optimized strategy to combine the multi-scale features. This is done\nby randomly dropping out input points with a randomized probability for each instance, which we call\nrandom input dropout. Speci\ufb01cally, for each training point set, we choose a dropout ratio \u2713 uniformly\nsampled from [0, p] where p \uf8ff 1. For each point, we randomly drop a point with probability \u2713. In\npractice we set p = 0.95 to avoid generating empty point sets. In doing so we present the network\nwith training sets of various sparsity (induced by \u2713) and varying uniformity (induced by randomness\nin dropout). During test, we keep all available points.\n\n4\n\n\fMulti-resolution grouping (MRG). The MSG approach above is computationally expensive since\nit runs local PointNet at large scale neighborhoods for every centroid point. In particular, since the\nnumber of centroid points is usually quite large at the lowest level, the time cost is signi\ufb01cant.\nHere we propose an alternative approach that avoids such expensive computation but still preserves\nthe ability to adaptively aggregate information according to the distributional properties of points. In\nFig. 3 (b), features of a region at some level Li is a concatenation of two vectors. One vector (left in\n\ufb01gure) is obtained by summarizing the features at each subregion from the lower level Li1 using\nthe set abstraction level. The other vector (right) is the feature that is obtained by directly processing\nall raw points in the local region using a single PointNet.\nWhen the density of a local region is low, the \ufb01rst vector may be less reliable than the second vector,\nsince the subregion in computing the \ufb01rst vector contains even sparser points and suffers more from\nsampling de\ufb01ciency. In such a case, the second vector should be weighted higher. On the other hand,\nwhen the density of a local region is high, the \ufb01rst vector provides information of \ufb01ner details since it\npossesses the ability to inspect at higher resolutions recursively in lower levels.\nCompared with MSG, this method is computationally more ef\ufb01cient since we avoids the feature\nextraction in large scale neighborhoods at lowest levels.\n\n3.4 Point Feature Propagation for Set Segmentation\n\nIn set abstraction layer, the original point set is subsampled. However in set segmentation task such\nas semantic point labeling, we want to obtain point features for all the original points. One solution is\nto always sample all points as centroids in all set abstraction levels, which however results in high\ncomputation cost. Another way is to propagate features from subsampled points to the original points.\nWe adopt a hierarchical propagation strategy with distance based interpolation and across level\nskip links (as shown in Fig. 2). In a feature propagation level, we propagate point features from\nNl \u21e5 (d + C) points to Nl1 points where Nl1 and Nl (with Nl \uf8ff Nl1) are point set size of input\nand output of set abstraction level l. We achieve feature propagation by interpolating feature values\nf of Nl points at coordinates of the Nl1 points. Among the many choices for interpolation, we\nuse inverse distance weighted average based on k nearest neighbors (as in Eq. 2, in default we use\np = 2, k = 3). The interpolated features on Nl1 points are then concatenated with skip linked point\nfeatures from the set abstraction level. Then the concatenated features are passed through a \u201cunit\npointnet\u201d, which is similar to one-by-one convolution in CNNs. A few shared fully connected and\nReLU layers are applied to update each point\u2019s feature vector. The process is repeated until we have\npropagated features to the original set of points.\n\nf (j)(x) = Pk\nPk\n\ni=1 wi(x)f (j)\ni\ni=1 wi(x)\n\nwhere wi(x) =\n\n1\n\nd(x, xi)p , j = 1, ..., C\n\n(2)\n\n4 Experiments\nDatasets We evaluate on four datasets ranging from 2D objects (MNIST [11]), 3D objects (Model-\nNet40 [31] rigid object, SHREC15 [12] non-rigid object) to real 3D scenes (ScanNet [5]). Object\nclassi\ufb01cation is evaluated by accuracy. Semantic scene labeling is evaluated by average voxel\nclassi\ufb01cation accuracy following [5]. We list below the experiment setting for each dataset:\n\u2022 MNIST: Images of handwritten digits with 60k training and 10k testing samples.\n\u2022 ModelNet40: CAD models of 40 categories (mostly man-made). We use the of\ufb01cial split\nwith 9,843 shapes for training and 2,468 for testing.\n\u2022 SHREC15: 1200 shapes from 50 categories. Each category contains 24 shapes which are\nmostly organic ones with various poses such as horses, cats, etc. We use \ufb01ve fold cross\nvalidation to acquire classi\ufb01cation accuracy on this dataset.\n\n\u2022 ScanNet: 1513 scanned and reconstructed indoor scenes. We follow the experiment setting\n\nin [5] and use 1201 scenes for training, 312 scenes for test.\n\n5\n\n\fMethod\nMulti-layer perceptron [24]\nLeNet5 [11]\nNetwork in Network [13]\nPointNet (vanilla) [20]\nPointNet [20]\nOurs\n\nError rate (%)\n\n1.60\n0.80\n0.47\n1.30\n0.78\n0.51\n\nTable 1: MNIST digit classi\ufb01cation.\n\n1024 points\n\n512 points\n\n256 points\n\n128 points\n\nMethod\nSubvolume [21]\nMVCNN [26]\nPointNet (vanilla) [20]\nPointNet [20]\nOurs\nOurs (with normal)\nTable 2: ModelNet40 shape classi\ufb01cation.\n\nInput Accuracy (%)\nvox\nimg\npc\npc\npc\npc\n\n89.2\n90.1\n87.2\n89.2\n90.7\n91.9\n\nFigure 4: Left: Point cloud with random point dropout. Right: Curve showing advantage of our\ndensity adaptive strategy in dealing with non-uniform density. DP means random input dropout\nduring training; otherwise training is on uniformly dense points. See Sec.3.3 for details.\n4.1 Point Set Classi\ufb01cation in Euclidean Metric Space\n\nWe evaluate our network on classifying point clouds sampled from both 2D (MNIST) and 3D\n(ModleNet40) Euclidean spaces. MNIST images are converted to 2D point clouds of digit pixel\nlocations. 3D point clouds are sampled from mesh surfaces from ModelNet40 shapes. In default we\nuse 512 points for MNIST and 1024 points for ModelNet40. In last row (ours normal) in Table 2, we\nuse face normals as additional point features, where we also use more points (N = 5000) to further\nboost performance. All point sets are normalized to be zero mean and within a unit ball. We use a\nthree-level hierarchical network with three fully connected layers 1\nResults.\nIn Table 1 and Table 2, we compare our method with a representative set of previous\nstate of the arts. Note that PointNet (vanilla) in Table 2 is the the version in [20] that does not use\ntransformation networks, which is equivalent to our hierarchical net with only one level.\nFirstly, our hierarchical learning architecture achieves signi\ufb01cantly better performance than the\nnon-hierarchical PointNet [20]. In MNIST, we see a relative 60.8% and 34.6% error rate reduction\nfrom PointNet (vanilla) and PointNet to our method. In ModelNet40 classi\ufb01cation, we also see that\nusing same input data size (1024 points) and features (coordinates only), ours is remarkably stronger\nthan PointNet. Secondly, we observe that point set based method can even achieve better or similar\nperformance as mature image CNNs. In MNIST, our method (based on 2D point set) is achieving\nan accuracy close to the Network in Network CNN. In ModelNet40, ours with normal information\nsigni\ufb01cantly outperforms previous state-of-the-art method MVCNN [26].\nRobustness to Sampling Density Variation. Sensor data directly captured from real world usually\nsuffers from severe irregular sampling issues (Fig. 1). Our approach selects point neighborhood of\nmultiple scales and learns to balance the descriptiveness and robustness by properly weighting them.\nWe randomly drop points (see Fig. 4 left) during test time to validate our network\u2019s robustness to\nnon-uniform and sparse data. In Fig. 4 right, we see MSG+DP (multi-scale grouping with random\ninput dropout during training) and MRG+DP (multi-resolution grouping with random input dropout\nduring training) are very robust to sampling density variation. MSG+DP performance drops by less\nthan 1% from 1024 to 256 test points. Moreover, it achieves the best performance on almost all\nsampling densities compared with alternatives. PointNet vanilla [20] is fairly robust under density\nvariation due to its focus on global abstraction rather than \ufb01ne details. However loss of details\nalso makes it less powerful compared to our approach. SSG (ablated PointNet++ with single scale\ngrouping in each level) fails to generalize to sparse sampling density while SSG+DP amends the\nproblem by randomly dropping out points in training time.\n\n1See supplementary for more details on network architecture and experiment preparation.\n\n6\n\n\f4.2 Point Set Segmentation for Semantic Scene Labeling\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.75\n\n0.7\n\n0.85\n\n0.8\n\n\u0415 (cid:0) 4\n\n\u0415 (cid:0) 5\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.730\n\n0.739\n\n0.727\n\n0.680\n\n0.65\n\n0.833\n\n0.762\n\n0.845\n\n0.804\n\n0.834\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.775\n\n0.65\n\n0.9\n\n0.65\n\n\u0415 (cid:0) 6\n\nPointNet[12]\n\n3DCNN[3]\n\nPointNet[12]\n\nOurs\n\nOurs(SSG) Ours(SSG+DP) Ours(MSG+DP)\n\nScanNet\nScanNet non-uniform\n\n3DCNN[3]\nFigure 5: Scannet labeling accuracy.\n\nPointNet[19]\n\nOurs(SSG) Ours(MSG+DP)Ours(MRG+DP)\n\nTo validate that our approach is suitable for large\nscale point cloud analysis, we also evaluate on\nsemantic scene labeling task. The goal is to pre-\ndict semantic object label for points in indoor\nscans. [5] provides a baseline using fully con-\nvolutional neural network on voxelized scans.\nThey purely rely on scanning geometry instead\nof RGB information and report the accuracy on\na per-voxel basis. To make a fair comparison,\nwe remove RGB information in all our experiments and convert point cloud label prediction into\nvoxel labeling following [5]. We also compare with [20]. The accuracy is reported on a per-voxel\nbasis in Fig. 5 (blue bar).\nOur approach outperforms all the baseline methods by a large margin. In comparison with [5], which\nlearns on voxelized scans, we directly learn on point clouds to avoid additional quantization error,\nand conduct data dependent sampling to allow more effective learning. Compared with [20], our\napproach introduces hierarchical feature learning and captures geometry features at different scales.\nThis is very important for understanding scenes at multiple levels and labeling objects with various\nsizes. We visualize example scene labeling results in Fig. 6.\nRobustness to Sampling Density Variation\nTo test how our trained model performs on scans\nwith non-uniform sampling density, we synthe-\nsize virtual scans of Scannet scenes similar to\nthat in Fig. 1 and evaluate our network on this\ndata. We refer readers to supplementary mate-\nrial for how we generate the virtual scans. We\nevaluate our framework in three settings (SSG,\nMSG+DP, MRG+DP) and compare with a base-\nline approach [20].\nPerformance comparison is shown in Fig. 5 (yel-\nlow bar). We see that SSG performance greatly\nfalls due to the sampling density shift from uni-\nform point cloud to virtually scanned scenes.\nMRG network, on the other hand, is more robust\nto the sampling density shift since it is able to au-\ntomatically switch to features depicting coarser\ngranularity when the sampling is sparse. Even though there is a domain gap between training data\n(uniform points with random dropout) and scanned data with non-uniform density, our MSG network\nis only slightly affected and achieves the best accuracy among methods in comparison. These prove\nthe effectiveness of our density adaptive layer design.\n4.3 Point Set Classi\ufb01cation in Non-Euclidean Metric Space\n\nFigure 6: Scannet labeling results. [20] captures\nthe overall layout of the room correctly but fails to\ndiscover the furniture. Our approach, in contrast,\nis much better at segmenting objects besides the\nroom layout.\n\nGround Truth\n\nPointNet\n\nChair\n\nDesk\n\nDoor\n\nTable\n\nWall\n\nFloor\n\nOurs\n\nBed\n\nIn this section, we show generalizability of our approach to non-Euclidean space. In non-rigid shape\nclassi\ufb01cation (Fig. 7), a good classi\ufb01er should be able to classify (a) and (c) in Fig. 7 correctly as the\nsame category even given their difference in pose, which requires knowledge of intrinsic structure.\nShapes in SHREC15 are 2D surfaces embedded in 3D space. Geodesic distances along the surfaces\nnaturally induce a metric space. We show through experiments that adopting PointNet++ in this\nmetric space is an effective way to capture intrinsic structure of the underlying point set.\nFor each shape in [12], we \ufb01rstly construct the metric space induced by pairwise geodesic distances.\nWe follow [23] to obtain an embedding metric that mimics geodesic distance. Next we extract\nintrinsic point features in this metric space including WKS [1], HKS [27] and multi-scale Gaussian\ncurvature [16]. We use these features as input and then sample and group points according to the\nunderlying metric space. In this way, our network learns to capture multi-scale intrinsic structure\nthat is not in\ufb02uenced by the speci\ufb01c pose of a shape. Alternative design choices include using XY Z\ncoordinates as points feature or use Euclidean space R3 as the underlying metric space. We show\nbelow these are not optimal choices.\n\n7\n\n\fResults. We compare our methods with previous state-of-the-\nart method [14] in Table 3. [14] extracts geodesic moments as\nshape features and use a stacked sparse autoencoder to digest\nthese features to predict shape category. Our approach using non-\nEuclidean metric space and intrinsic features achieves the best\nperformance in all settings and outperforms [14] by a large margin.\nComparing the \ufb01rst and second setting of our approach, we see\nintrinsic features are very important for non-rigid shape classi\ufb01ca-\ntion. XY Z feature fails to reveal intrinsic structures and is greatly\nin\ufb02uenced by pose variation. Comparing the second and third\nsetting of our approach, we see using geodesic neighborhood is bene\ufb01cial compared with Euclidean\nneighborhood. Euclidean neighborhood might include points far away on surfaces and this neighbor-\nhood could change dramatically when shape affords non-rigid deformation. This introduces dif\ufb01culty\nfor effective weight sharing since the local structure could become combinatorially complicated.\nGeodesic neighborhood on surfaces, on the other hand, gets rid of this issue and improves the learning\neffectiveness.\n\nFigure 7: An example of non-\nrigid shape classi\ufb01cation.\n\nMetric space\n\nInput feature\n\nAccuracy (%)\n\nDeepGM [14]\n\n-\n\nOurs\n\nEuclidean\nEuclidean\n\nNon-Euclidean\n\nIntrinsic features\n\nXYZ\n\nIntrinsic features\nIntrinsic features\n\n93.03\n60.18\n94.49\n96.09\n\nTable 3: SHREC15 Non-rigid shape classi\ufb01cation.\n\n4.4 Feature Visualization.\n\nIn Fig. 8 we visualize what has been learned by the \ufb01rst\nlevel kernels of our hierarchical network. We created\na voxel grid in space and aggregate local point sets that\nactivate certain neurons the most in grid cells (highest\n100 examples are used). Grid cells with high votes\nare kept and converted back to 3D point clouds, which\nrepresents the pattern that neuron recognizes. Since the\nmodel is trained on ModelNet40 which is mostly con-\nsisted of furniture, we see structures of planes, double\nplanes, lines, corners etc. in the visualization.\n\n5 Related Work\n\nFigure 8: 3D point cloud patterns learned\nfrom the \ufb01rst layer kernels. The model is\ntrained for ModelNet40 shape classi\ufb01cation\n(20 out of the 128 kernels are randomly\nselected). Color indicates point depth (red\nis near, blue is far).\n\nThe idea of hierarchical feature learning has been very\nsuccessful. Among all the learning models, convolu-\ntional neural network [10; 25; 8] is one of the most\nprominent ones. However, convolution does not apply\nto unordered point sets with distance metrics, which is\nthe focus of our work.\nA few very recent works [20; 28] have studied how to\napply deep learning to unordered sets. They ignore the underlying distance metric even if the point\nset does possess one. As a result, they are unable to capture local context of points and are sensitive\nto global set translation and normalization. In this work, we target at points sampled from a metric\nspace and tackle these issues by explicitly considering the underlying distance metric in our design.\nPoint sampled from a metric space are usually noisy and with non-uniform sampling density. This\naffects effective point feature extraction and causes dif\ufb01culty for learning. One of the key issue is\nto select proper scale for point feature design. Previously several approaches have been developed\nregarding this [19; 17; 2; 6; 7; 30] either in geometry processing community or photogrammetry\nand remote sensing community. In contrast to all these works, our approach learns to extract point\nfeatures and balance multiple feature scales in an end-to-end fashion.\n\n8\n\n\fIn 3D metric space, other than point set, there are several popular representations for deep learning,\nincluding volumetric grids [21; 22; 29], and geometric graphs [3; 15; 33]. However, in none of these\nworks, the problem of non-uniform sampling density has been explicitly considered.\n\n6 Conclusion\n\nIn this work, we propose PointNet++, a powerful neural network architecture for processing point\nsets sampled in a metric space. PointNet++ recursively functions on a nested partitioning of the\ninput point set, and is effective in learning hierarchical features with respect to the distance metric.\nTo handle the non uniform point sampling issue, we propose two novel set abstraction layers that\nintelligently aggregate multi-scale information according to local point densities. These contributions\nenable us to achieve state-of-the-art performance on challenging benchmarks of 3D point clouds.\nIn the future, it\u2019s worthwhile thinking how to accelerate inference speed of our proposed network\nespecially for MSG and MRG layers by sharing more computation in each local regions. It\u2019s also\ninteresting to \ufb01nd applications in higher dimensional metric spaces where CNN based method would\nbe computationally unfeasible while our method can scale well.\nAcknowledgement. The authors would like to acknowledge the support of a Samsung GRO grant,\nNSF grants IIS-1528025 and DMS-1546206, and ONR MURI grant N00014-13-1-0341.\n\nReferences\n[1] M. Aubry, U. Schlickewei, and D. Cremers. The wave kernel signature: A quantum mechanical approach\nto shape analysis. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference\non, pages 1626\u20131633. IEEE, 2011.\n\n[2] D. Belton and D. D. Lichti. Classi\ufb01cation and segmentation of terrestrial laser scanner point clouds using\n\nlocal variance information. Iaprs, Xxxvi, 5:44\u201349, 2006.\n\n[3] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on\n\ngraphs. arXiv preprint arXiv:1312.6203, 2013.\n\n[4] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song,\nH. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report\narXiv:1512.03012 [cs.GR], 2015.\n\n[5] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nie\u00dfner. Scannet: Richly-annotated 3d\n\nreconstructions of indoor scenes. arXiv preprint arXiv:1702.04405, 2017.\n\n[6] J. Demantk\u00e9, C. Mallet, N. David, and B. Vallet. Dimensionality based scale selection in 3d lidar point\nclouds. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information\nSciences, 38(Part 5):W12, 2011.\n\n[7] A. Gressin, C. Mallet, J. Demantk\u00e9, and N. David. Towards 3d lidar point cloud registration improvement\nusing optimal neighborhood knowledge. ISPRS journal of photogrammetry and remote sensing, 79:240\u2013\n251, 2013.\n\n[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778, 2016.\n\n[9] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.\n[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[12] Z. Lian, J. Zhang, S. Choi, H. ElNaghy, J. El-Sana, T. Furuya, A. Giachetti, R. A. Guler, L. Lai, C. Li,\nH. Li, F. A. Limberger, R. Martin, R. U. Nakanishi, A. P. Neto, L. G. Nonato, R. Ohbuchi, K. Pevzner,\nD. Pickup, P. Rosin, A. Sharf, L. Sun, X. Sun, S. Tari, G. Unal, and R. C. Wilson. Non-rigid 3D Shape\nRetrieval. In I. Pratikakis, M. Spagnuolo, T. Theoharis, L. V. Gool, and R. Veltkamp, editors, Eurographics\nWorkshop on 3D Object Retrieval. The Eurographics Association, 2015.\n\n[13] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.\n[14] L. Luciano and A. B. Hamza. Deep learning with geodesic moments for 3d shape classi\ufb01cation. Pattern\n\nRecognition Letters, 2017.\n\n[15] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks\non riemannian manifolds. In Proceedings of the IEEE International Conference on Computer Vision\nWorkshops, pages 37\u201345, 2015.\n\n[16] M. Meyer, M. Desbrun, P. Schr\u00f6der, A. H. Barr, et al. Discrete differential-geometry operators for\n\ntriangulated 2-manifolds. Visualization and mathematics, 3(2):52\u201358, 2002.\n\n[17] N. J. MITRA, A. NGUYEN, and L. GUIBAS. Estimating surface normals in noisy point cloud data.\n\nInternational Journal of Computational Geometry & Applications, 14(04n05):261\u2013276, 2004.\n\n[18] I. Occipital. Structure sensor-3d scanning, augmented reality, and more for mobile devices, 2016.\n\n9\n\n\f[19] M. Pauly, L. P. Kobbelt, and M. Gross. Point-based multiscale surface representation. ACM Transactions\n\non Graphics (TOG), 25(2):177\u2013193, 2006.\n\n[20] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classi\ufb01cation and\n\nsegmentation. arXiv preprint arXiv:1612.00593, 2016.\n\n[21] C. R. Qi, H. Su, M. Nie\u00dfner, A. Dai, M. Yan, and L. Guibas. Volumetric and multi-view cnns for object\n\nclassi\ufb01cation on 3d data. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016.\n\n[22] G. Riegler, A. O. Ulusoys, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions.\n\narXiv preprint arXiv:1611.05009, 2016.\n\n[23] R. M. Rustamov, Y. Lipman, and T. Funkhouser. Interior distance using barycentric coordinates. In\n\nComputer Graphics Forum, volume 28, pages 1279\u20131288. Wiley Online Library, 2009.\n\n[24] P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices for convolutional neural networks applied to\n\nvisual document analysis. In ICDAR, volume 3, pages 958\u2013962, 2003.\n\n[25] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n[26] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller. Multi-view convolutional neural networks for 3d\n\nshape recognition. In Proc. ICCV, to appear, 2015.\n\n[27] J. Sun, M. Ovsjanikov, and L. Guibas. A concise and provably informative multi-scale signature based on\nheat diffusion. In Computer graphics forum, volume 28, pages 1383\u20131392. Wiley Online Library, 2009.\n[28] O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Sequence to sequence for sets. arXiv preprint\n\narXiv:1511.06391, 2015.\n\nnetworks for 3d shape analysis. 2017.\n\n[29] P.-S. WANG, Y. LIU, Y.-X. GUO, C.-Y. SUN, and X. TONG. O-cnn: Octree-based convolutional neural\n\n[30] M. Weinmann, B. Jutzi, S. Hinz, and C. Mallet. Semantic point cloud interpretation based on optimal\nneighborhoods, relevant features and ef\ufb01cient classi\ufb01ers. ISPRS Journal of Photogrammetry and Remote\nSensing, 105:286\u2013304, 2015.\n\n[31] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for\nvolumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 1912\u20131920, 2015.\n\n[32] L. Yi, V. G. Kim, D. Ceylan, I.-C. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas. A\n\nscalable active framework for region annotation in 3d shape collections. SIGGRAPH Asia, 2016.\n\n[33] L. Yi, H. Su, X. Guo, and L. Guibas. Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation.\n\narXiv preprint arXiv:1612.00606, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2650, "authors": [{"given_name": "Charles Ruizhongtai", "family_name": "Qi", "institution": "Stanford University"}, {"given_name": "Li", "family_name": "Yi", "institution": "Stanford University"}, {"given_name": "Hao", "family_name": "Su", "institution": "Stanford"}, {"given_name": "Leonidas", "family_name": "Guibas", "institution": "stanford.edu"}]}