{"title": "Exploiting Local and Global Structure for Point Cloud Semantic Segmentation with Contextual Point Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 4571, "page_last": 4581, "abstract": "In this paper, we propose one novel model for point cloud semantic segmentation,which exploits both the local and global structures within the point cloud based onthe contextual point representations. Specifically, we enrich each point represen-tation by performing one novel gated fusion on the point itself and its contextualpoints. Afterwards, based on the enriched representation, we propose one novelgraph pointnet module, relying on the graph attention block to dynamically com-pose and update each point representation within the local point cloud structure.Finally, we resort to the spatial-wise and channel-wise attention strategies to exploitthe point cloud global structure and thereby yield the resulting semantic label foreach point. Extensive results on the public point cloud databases, namely theS3DIS and ScanNet datasets, demonstrate the effectiveness of our proposed model,outperforming the state-of-the-art approaches. Our code for this paper is available at https://github.com/fly519/ELGS.", "full_text": "Exploiting Local and Global Structure for Point\n\nCloud Semantic Segmentation with Contextual Point\n\nRepresentations\n\nXu Wang\n\nJingming He\n\nCollege of Computer Science\n\nand Software Engineering\n\nCollege of Computer Science\n\nand Software Engineering\n\nShenzhen University\n\nShenzhen, China\n\nShenzhen University\n\nShenzhen, China\n\nwangxu@szu.edu.cn\n\nhejingming519@gmail.com\n\nLin Ma\u2217\n\nTencent AI Lab\nShenzhen, China\n\nforest.linma@gmail.com\n\nAbstract\n\nIn this paper, we propose one novel model for point cloud semantic segmentation,\nwhich exploits both the local and global structures within the point cloud based on\nthe contextual point representations. Speci\ufb01cally, we enrich each point represen-\ntation by performing one novel gated fusion on the point itself and its contextual\npoints. Afterwards, based on the enriched representation, we propose one novel\ngraph pointnet module, relying on the graph attention block to dynamically com-\npose and update each point representation within the local point cloud structure.\nFinally, we resort to the spatial-wise and channel-wise attention strategies to exploit\nthe point cloud global structure and thereby yield the resulting semantic label for\neach point. Extensive results on the public point cloud databases, namely the\nS3DIS and ScanNet datasets, demonstrate the effectiveness of our proposed model,\noutperforming the state-of-the-art approaches. Our code for this paper is available\nat https://github.com/fly519/ELGS.\n\n1\n\nIntroduction\n\nThe point cloud captured by 3D scanners has attracted more and more research interests, especially for\nthe point cloud understanding tasks, including the 3D object classi\ufb01cation [13, 14, 10, 11], 3D object\ndetection [21, 27], and 3D semantic segmentation [25, 13, 14, 23, 10]. 3D semantic segmentation,\naiming at providing class labels for each point in the 3D space, is a prevalent challenging problem.\nFirst, the points captured by the 3D scanners are usually sparse, which hinders the design of one\neffective and ef\ufb01cient deep model for semantic segmentation. Second, the points always appear\nunstructured and unordered. As such, the relationship between the points is hard to be captured and\nmodeled.\nAs points are not in a regular format, some existing approaches \ufb01rst transform the point clouds into\nregular 3D voxel grids or collections of images, and then feed them into traditional convolutional\nneural network (CNN) to yield the resulting semantic segmentation [25, 5, 22]. Such a transformation\nprocess can somehow capture the structure information of the points and thereby exploit their\nrelationships. However, such approaches, especially in the format of 3D volumetric data, require\nhigh memory and computation cost. Recently, another thread of deep learning architectures on point\nclouds, namely PointNet [13] and PointNet++ [14], is proposed to handle the points in an ef\ufb01cient\nand effective way. Speci\ufb01cally, PointNet learns a spatial encoding of each point and then aggregates\nall individual point features as one global representation. However, PointNet does not consider the\n\n\u2217Corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\flocal structures. In order to further exploit the local structures, PointNet++ processes a set of points\nin a hierarchical manner. Speci\ufb01cally, the points are partitioned into overlapping local regions to\ncapture the \ufb01ne geometric structures. And the obtained local features are further aggregated into\nlarger units to generate higher level features until the global representation is obtained. Although\npromising results have been achieved on the public datasets, there still remains some opening issues.\nFirst, each point is characterized by its own coordinate information and extra attribute values, i.e.\ncolor, normal, re\ufb02ectance, etc. Such representation only expresses the physical meaning of the point\nitself, which does not consider its neighbouring and contextual ones. Second, we argue that the local\nstructures within point cloud are complicated, while the simple partitioning process in PointNet++\ncannot effectively capture such complicated relationships. Third, each point labeling not only depends\non its own representation, but also relates to the other points. Although the global representation is\nobtained in PointNet and PointNet++, the complicated global relationships within the point cloud\nhave not been explicitly exploited and characterized.\nIn this paper, we propose one novel model for point cloud semantic segmentation. First, for each\npoint, we construct one contextual representation by considering its neighboring points to enrich its\nsemantic meaning by one novel gated fusion strategy. Based on the enriched semantic representations,\nwe propose one novel graph pointnet module (GPM), which relies on one graph attention block\n(GAB) to compose and update the feature representation of each point within the local structure.\nMultiple GPMs can be stacked together to generate the compact representation of the point cloud.\nFinally, the global point cloud structure is exploited by the spatial-wise and channel-wise attention\nstrategies to generate the semantic label for each point.\n\n2 Related Work\n\nRecently, deep models have demonstrated the feature learning abilities on computer vision tasks\nwith regular data structure. However, due to the limitation of data representation method, there are\nstill many challenges for 3D point cloud task, which is of irregular data structures. According to\nthe 3D data representation methods, existing approaches can be roughly categorized as 3D voxel-\nbased [5, 25, 22, 7, 9], multiview-based [18, 12], and set-based approaches [13, 14].\n3D Voxel-based Approach. The 3D voxel-based methods \ufb01rst transform the point clouds into regular\n3D voxel grids, and then the 3D CNN can be directly applied similarly as the image or video. Wu et\nal. [25] propose the full-voxels based 3D ShapeNets network to store and process 3D data. Due to\nthe constraints of representation resolution, information loss are inevitable during the discretization\nprocess. Meanwhile, the memory and computational consumption are increases dramatically with\nrespect to the resolution of voxel. Recently, Oct-Net [16], Kd-Net [7], and O-CNN [22] have been\nproposed to reduce the computational cost by skipping the operations on empty voxels.\nMultiview-based Approach. The multiview-based methods need to render multiple images from the\ntarget point cloud based on different view angle settings. Afterwards, each image can be processed by\nthe traditional 2D CNN operations [18]. Recently, the multiview image CNN [12] has been applied\nto 3D shape segmentation, and has obtained satisfactory results. The multiview-based approaches\nhelp reducing the computational cost and running memory. However, converting the 3D point cloud\ninto images also introduce information loss. And how to determine the number of views and how to\nallocate the view to better represent the 3D shape still remains as an intractable problem.\nSet-based Approach. PointNet [13] is the \ufb01rst set-based method, which learns the representation\ndirectly on the unordered and unstructured point clouds. PointNet++ [14] relies on the hierarchical\nlearning strategy to extend PointNet for capturing local structures information. PointCNN [10] is\nfurther proposed to exploit the canonical order of points for local context information extraction.\nRecently, there have been several attempts in the literature to model the point cloud as structured\ngraphs. For example, Qi et al. [15] propose to build a k-nearest neighbor directed graph on top of point\ncloud to boost the performance on the semantic segmentation task. SPGraph [8] is proposed to deal\nwith large scale point clouds. The points are adaptively partitioned into geometrically homogeneous\nelements to build a superpoint graph, which is then fed into a graph convolutional network (GCN) for\npredicting the semantic labels. DGCNN [24] relies on the edge convolution operation to dynamically\ncapture the local shapes. RS-CNN [11] extends regular grid CNN to irregular con\ufb01guration, which\nencodes the geometric relation of points to achieve contextual shape-aware learning of point cloud.\n\n2\n\n\fFigure 1: Our proposed model for the point cloud semantic segmentation, consisting of three fully-\ncoupled components. The point enrichment not only considers the point itself but also its contextual\npoints to enrich the corresponding semantic representation. The feature representation relies on\nconventional encoder-decoder architecture with lateral connections to learn the feature representation\nfor each point. Speci\ufb01cally, the GPM is proposed to dynamically compose and update each point\nrepresentation via a GAB module. For the prediction, we resort to both channel-wise and spatial-wise\nattentions to exploit the global structure for the \ufb01nal semantic label prediction for each point.\n\nThese approaches mainly focus on the point local relationship exploitation, and neglect the global\nrelationship.\nUnlike previous set-based methods that only consider the raw coordinate and attribute information of\neach single point, we pay more attentions on the spatial context information within neighbor points.\nOur proposed context representation is able to express more \ufb01ne-grained structural information. We\nalso rely on one novel graph pointnet module to compose and update each point representation within\nthe local point cloud structure. Moreover, the point cloud global structure information is considered\nwith the spatial-wise and channel-wise attention strategies.\n\n3 Approach\n\nThe point cloud semantic segmentation aims to take the 3D point cloud as input and assign one\nsemantic class label for each point. We propose one novel model for handling this point cloud\nsemantic segmentation, as shown in Fig. 1. Speci\ufb01cally, our proposed network consists of three\ncomponents, namely, the point enrichment, the feature representation, and the prediction. These three\ncomponents fully couple together, ensuring an end-to-end training manner.\nPoint Enrichment. To make accurate class prediction for each point within the complicated point\ncloud structure, we need to not only consider the information of each point itself but also its\nneighboring or contextual points. Different from the existing approaches, relying on the information\nof each point itself, such as the geometry, color, etc., we propose one novel point enrichment layer to\nenrich each point representation by taking its neighboring or contextual points into consideration.\nWith the incorporated contextual information, each point is able to sense the complicated point cloud\nstructure information. As will be demonstrated in Sec. 4.4, the contextual information, enriching the\nsemantic information of each point, can help boosting the \ufb01nal segmentation performance.\nFeature Representation. With the enriched point representation, we resort to the conventional\nencoder-decoder architecture with lateral connections to learn the feature representation for each\npoint. To further exploit local structure information of the point cloud, the GPM is employed in the\nencoder, which relies on the GAB to dynamically compose and update the feature representation of\neach point within its local regions. The decoder with lateral connections works on the compacted\nrepresentation obtained from the encoder, to generate the semantic feature representation for each\npoint.\nPrediction. Based on the obtained semantic representations, we resort to both the channel-wise\nand spatial-wise attentions to further exploit the global structure of the point cloud. Afterwards, the\nsemantic label is predicted for each point.\n\n3\n\n\f3.1 Point Enrichment\n\nThe raw representation of each point is usually its 3D position and associated attributes, such as color,\nre\ufb02ectance, surface normal, etc. Existing approaches usually directly take such representation as\ninput, neglecting its neighboring or contextual information, which is believed to play an essential\nrole [17] for characterizing the point cloud structure, especially from the local perspective. In this\npaper, besides the point itself, we incorporate its neighboring points as its contextual information to\nenrich the point semantic representation. With such incorporated contextual information, each point\nis aware of the complicated point cloud structure information.\nAs illustrated in Fig. 2, a point cloud con-\nsists of N points, which can be represented as\n{P1, P2, ..., PN}, with Pi \u2208 RCf denoting the\nattribute values of the i-th point, such as posi-\ntion coordinate, color, normal, etc. To charac-\nterize the contextual information for each point,\nk-nearest neighbor set Ni within the local re-\ngion centered on i-th point are selected and con-\ncatenated together, where the contextual repre-\nsentation Ri \u2208 RkCf of the given point i is as\nfollows:\n\nRi = (cid:107)\nj\u2208Ni\n\nPj.\n\n(1)\n\nFigure 2: The point enrichment process relies on our\nproposed gated fusion strategy to enrich the point repre-\nsentation by considering the neighbouring and contex-\ntual points of each point.\n\nFor each point, we have two different representa-\ntions, speci\ufb01cally the Pi and Ri. However, these\ntwo representations are of different dimensions and different characteristics. How to effectively fuse\nthem together to produce one more representative feature for each point remains an open issue. In\nthis paper, we propose a novel gated fusion strategy. We \ufb01rst feed Pi into one fully-connected (FC)\nlayer to obtain a new feature vector \u02dcPi \u2208 RkCf . Afterwards, the gated fusion operation is performed:\n\n\u02c6Pi = gi (cid:12) \u02dcPi,\ni (cid:12) Ri,\n\u02c6Ri = gR\n\n(2)\n\ngi = \u03c3(wiRi + bi),\ngR\ni = \u03c3(wR\ni ),\ni\ni \u2208 RkCf\u00d7kCf and bi, bR\n\n\u02dcPi + bR\n\ni \u2208 RkCf are the learnable parameters. \u03c3 is the non-linear\nwhere wi, wR\nsigmoid function. (cid:12) is the element-wise multiplication. The gated fusion aims to mutually absorb\nuseful and meaningful information of Pi and Ri. And the interactions between Pi and Ri are updated,\nyielding \u02c6Pi and \u02c6Ri. As such, the i-th point representation is then enriched by concatenating them\ntogether as \u02c6Pi (cid:107) \u02c6Ri. For easing the following introduction, we will re-use Pi to denote the enriched\nrepresentation of the i-th point.\n\n3.2 Feature Representation\n\nBased on the enriched point representation, we rely on one traditional encoder-decoder architecture\nwith lateral connections to learn the feature representation of each point.\n\n3.2.1 Encoder\n\nAlthough the enriched point representation has somewhat considered the local structure information,\nthe complicated relationships within points, especially from the local perspective need to be further\nexploited. In order to tackle this challenge, we propose one novel GPM in the encoder, which aims to\nlearn the composition ability between points and thereby more effectively capture the local structural\ninformation within the point cloud.\n\nGraph Pointnet Module. Same as [14], we \ufb01rst use the sampling and grouping layers to divide\nthe point set into several local groups. Within each group, the GPM is used to exploit the local\nrelationships between points, and thereby update the point representation by aggregating the point\ninformation within the local structure.\n\n4\n\n\fthe proposed\nAs illustrated in Fig. 3,\nGPM consists of one multi-layer percep-\ntron (MLP) and GAB. The MLP in conven-\ntional PointNet [13] and PointNet++ [14]\nindependently performs on each point to\nmine the information within the point it-\nself, while neglects the correlations and\nrelationships among the points. In order\nto more comprehensively exploit the point\nrelationship, we rely on the GAB to aggre-\ngate the neighboring point representations\nand thereby updated the point representation.\nFor each obtained local structure obtained by the sampling and grouping layers, GAB [20] \ufb01rst\nde\ufb01nes one fully connected undirected graph to measure the similarities between any two points with\nsuch local structures. Given the output feature map G \u2208 RCe\u00d7Ne of the MLP layer in the GPM\nmodule, we \ufb01rst linearly project each point to one common space through a FC layer to obtain new\nfeature map \u02c6G \u2208 RCe\u00d7Ne. The similarity \u03b1ij between point i and point j is measured as follows:\n(3)\n\nFigure 3: The architecture of our proposed GPM, which\nstacks MLP and GAB to exploit the point relationships within\nthe local structure.\n\n\u03b1ij = \u02c6Gi \u00b7 \u02c6Gj.\n\nAfterwards, we calculate the in\ufb02uence factor of point j on point i:\n\u03b2ij = softmaxj(LeakyReLU(\u03b1ij)),\n\n(4)\nwhere \u03b2ij is regarded as the normalized attentive weight, representing how point j relates to point i.\nThe representation of each point is updated by attentively aggregating the point representations with\nreference to \u03b2ij:\n\nNe(cid:88)\n\n\u02dcGi =\n\n\u03b2ij \u02c6Gj.\n\n(5)\n\nj=1\n\nIt can be observed that the GAB dynamically updates the local feature representation by referring to\nthe similarities between points and captures their relationships. Moreover, in order to preserve the\noriginal information, the point feature after MLP is concatenated with the updated one via one skip\nconnection through a gated fusion operation, as shown in Fig. 3.\nPlease note that we can stack multiple GPMs, as shown in Fig. 3, to further exploit the complicated\nnon-linear relationships within each local structure. Afterwards, one max pooling layer is used to\naggregate the feature map into a one-dimensional feature vector, which not only lowers the dimen-\nsionality of the representation, thus making it possible to quickly generate compact representation of\nthe point cloud, but also help \ufb01ltering out the unreliable noises.\n\n3.2.2 Decoder\n\nFor decoder, we use the same architecture as [14]. Speci\ufb01cally, we progressively upsample the\ncompact feature obtained from the encoder until the original resolution. Please note that for preserving\nthe information generated in the encoder as much as possible, lateral connections are also used.\n\n3.3 Prediction\n\nAfter performing the feature representation, rich semantic representation for each point is obtained.\nNote that our previous operations, including contextual representation and feature representation, only\nmine the point local relationships. However, the global information is also important, which needs to\nbe considered when determining the label for each individual point. For the semantic segmentation\ntask, two points departing greatly in space may belong to the same semantic category, which can be\njointly considered to mutually enhance their feature representations. Moreover, for high-dimensional\nfeature representations, the inter-dependencies between feature channels also exist. As such, in order\nto capture the global context information for each point, we introduce two attention modules, namely\nspatial-wise and channel-wise attentions [4] for modeling the global relationships between points.\nSpatial-wise Attention. To model rich global contextual relationships among points, the spatial-wise\nattention module is employed to adaptively aggregate spatial contexts of local features. Given the\n\n5\n\n\ffeature map F \u2208 RCd\u00d7Nd from the decoder, we \ufb01rst feed it into two FC layers to obtain two new\nfeature maps A and B, respectively, where {A, B} \u2208 RCd\u00d7Nd. Nd is the number of points and Cd is\nnumber of feature channel. The normalized spatial-wise attentive weight vij measures the in\ufb02uence\nfactor of point j on point i as follows:\n\n(6)\nAfterwards, the feature map F is fed into another FC layer to generate a new feature map D \u2208\nRCd\u00d7Nd. The output feature map \u02c6F \u2208 RCd\u00d7Nd after spatial-wise attention is obtained:\n\nvij = softmaxj(Ai \u00b7 Bj),\n\nNd(cid:88)\n\n\u02c6Fi =\n\n(vijDj) + Fi.\n\n(7)\n\nj=1\n\nAs such, the global spatial structure information is attentively aggregated with each point representa-\ntion.\nChannel-wise Attention. The channel-wise attention performs similarly with the spatial-wise\nattention, with the channel attention map explicitly modeling the interdependencies between channels\nand thereby boosting the feature discriminability. Similar as the spatial-wise attention module, the\noutput feature map \u02dcF \u2208 RCd\u00d7Nd is obtained by aggregating the global channel structure information\nwith each channel representation.\nAfter summing the feature maps \u02c6F and \u02dcF, the semantic label for each point can be obtained with\none additional FC layer. With such attention processes from the global perspective, the feature\nrepresentation of each point is updated. As such, the complicated relationships between the points\ncan be comprehensively exploited, yielding more accurate segmentation results.\n\n4 Experiment\n\n4.1 Experiment Setting\n\nDataset. To evaluate the performance of proposed model and compare with state-of-the-art, we\nconduct experiments on two public available datasets, the Stanford 3D Indoor Semantics (S3DIS)\nDataset [1] and ScanNet Dataset[2]. The S3DIS dataset comes from real scan of the indoor envi-\nronment, including 3D scans of Matterport scanners from 6 areas. There are 271 rooms divided by\nroom. ScanNet is a point cloud dataset with scanned indoor scenes. It has 22 categories of semantic\ntags, with 1513 scenes. ScanNet contains a wide variety of spaces. Each point is annotated with an\ninstance-level semantic category label.\nImplementation Details. The number of neighbor-\ning points k in contextual representation is set as 3,\nwhere the farthest distance for neighboring point is\n\ufb01xed to 0.06. For feature extraction, a four-layer en-\ncoder is used, where the spatial scale of each layer is\nset as 1024, 256, 64, and 16, respectively. The GPM\nis enabled in the \ufb01rst two layers of the encoder, to\nexploit the local relationships between points. The\nmaximum training epochs for S3DIS and ScanNet\nare set as 120 and 500, respectively.\nEvaluation Metric.\nTwo widely used metrics,\nnamely overall accuracy (OA) and mean intersection\nof union (mIoU), are used to measure the semantic\nsegmentation performance. OA is the prediction accu-\nracy of all points. IoU measures the ratio of the area\nof overlap to the area of union between the ground\ntruth and segmentation result. mIoU is the average of\nIoU over all categories.\nCompetitor Methods. For S3DIS dataset, we com-\npare our method with PointNet [13], PointNet++ [14],\n\nTable 1: Results of S3DIS dataset on \u201cArea 5\u201d\nand over 6 fold in terms of OA and mIoU. \u2020 and\n\u2021 indicate that the PointNet performances are\ndirectly copied from [8] and [3], respectively.\n\u2217 indicates that the PointNet++ performances\nare produced with the publicly available code.\n\nPointNet\u2020 [13]\nSEGCloud [19]\nRSNet [6]\nPointNet++\u2217 [14]\nSPGraph [8]\nOurs\nPointNet\u2021 [13]\nSGPN [23]\nEngelmann et al. [3]\nA-SCN [26]\nSPGraph [8]\nDGCNN [24]\nOurs\n\nmIoU\n\n41.09\n48.92\n51.93\n54.98\n58.04\n60.06\n\n47.6\n50.4\n49.7\n52.7\n62.1\n56.1\n66.3\n\nTest Area Method\n\nOA\n\n-\n-\n-\n\n86.43\n86.38\n88.43\n\n78.5\n80.8\n81.1\n81.6\n85.5\n84.3\n87.6\n\nArea5\n\n6 fold\n\n6\n\n\fFigure 4: Qualitative results from the S3DIS dataset. All the walls are removed for better visualiza-\ntion. From top to bottom are the result of the Point Cloud, PointNet++, Ours, and Ground Truth,\nrespectively. The segmentation results of our proposed model is closer to the ground truth than that\nof PointNet++.\n\nSEGCloud [19], RSNet [6], SPGraph [8], SGPN [23], Engelmann et al. [3], A-SCN [26] and\nDGCNN [24]. For ScanNet dataset, we compare with 3DCNN [2], PointNet [13], PointNet++ [14],\nRSNet [6] and PointCNN [10].\n\nTable 2: The segmentation results of S3DIS dataset in terms of IoU for each category.\n\nTest Area Method\n\nArea5\n\n6fold\n\nPointNet [13]in [8]\nSEGCloud [19]\nRSNet [6]\nPointNet++ [14]\nSPGraph [8]\nOurs\nPointNet [13] in [3]\nEngelmann et al. [3]\nSPGraph [8]\nOurs\n\nceiling\n88.80\n90.06\n93.34\n91.41\n89.35\n92.80\n88.0\n90.3\n89.9\n93.7\n\n\ufb02oor\n97.33\n96.05\n98.36\n97.92\n96.87\n98.48\n88.7\n92.1\n95.1\n95.6\n\nwall\n69.80\n69.86\n79.18\n69.45\n78.12\n72.65\n69.3\n67.9\n76.4\n76.9\n\n4.2 S3DIS Semantic Segmentation\n\nbeam column window door\n0.05\n10.76\n23.12\n0.00\n50.10\n0.00\n14.48\n0.00\n61.58\n0.00\n0.01\n28.79\n51.6\n42.4\n51.2\n44.7\n62.8\n68.4\n69.0\n42.6\n\n3.92\n18.37\n15.75\n16.27\n42.81\n32.42\n23.1\n24.2\n47.1\n46.7\n\n46.26\n38.35\n45.37\n66.13\n48.93\n68.12\n47.5\n52.3\n55.3\n63.9\n\ntable\n52.61\n75.89\n65.52\n72.32\n84.66\n74.91\n42.0\n47.4\n73.5\n70.1\n\nchair\n58.93\n70.40\n67.87\n81.10\n75.41\n85.12\n54.1\n58.1\n69.2\n76.0\n\nsofa\n40.28\n58.42\n22.45\n35.12\n69.84\n55.89\n38.2\n39.0\n63.2\n52.8\n\nbookcase\n\n5.85\n40.88\n52.45\n59.67\n52.60\n64.93\n9.6\n6.9\n45.9\n57.2\n\nboard\n26.38\n12.96\n41.02\n59.45\n2.10\n47.74\n29.4\n30.0\n8.7\n54.8\n\nclutter\n33.22\n41.60\n43.64\n51.42\n52.22\n58.22\n35.2\n41.9\n52.9\n62.5\n\nWe perform semantic segmentation experiments on the S3DIS dataset to evaluate our performance\nin indoor real-world scene scans and perform ablation experiments on this dataset. Same as the\nexperimental setup in PointNet [13], we divide each room evenly into several 1m3 cube, with each\nuniformly sampleing 4096 points.\nSame as [13, 3, 8], we perform 6-fold cross validation with micro-averaging. In order to compare\nwith more methods, we also report the performance on the \ufb01fth fold only (Area 5). The OA and\nmIoU results are summarized in Table 1. From the results we can see that our algorithm performs\nbetter than other competitor methods in terms of both OA and mIoU metrics.\nBesides, the IoU values of each category are summarized in Table 2, it can be observed that our\nproposed method achieves the best performance for several categories. For simple shapes such\nas \u201c\ufb02oor\u201d and \u201cceiling\u201d, each model performs well, with our approach performing better. This\nis mainly due to that the prediction layer of our propose method incorporates the global structure\ninformation between points, which enhances the point representation in the \ufb02at area. For categories\nwith complex local structure, such as \u201cchair\u201d and \u201cbookcase\u201d, our model shows the best performance,\n\n7\n\n\fFigure 5: Qualitative results from the S3DIS dataset. From top to bottom are the result of the Point\nCloud, PointNet++, Ours, and Ground Truth, respectively. The segmentation results of our proposed\nmodel is closer to the ground truth than that of PointNet++.\n\nsince we consider the contextual representation to enhance the relationship between each point and its\nneighbors, and use the GPM module to exploit the local structure information. However, the \u201cwindow\u201d\nand \u201cboard\u201d categories are more dif\ufb01cult to distinguish from the \u201cwall\u201d, as they are close to the \u201cwall\u201d\nin position and appear similarly. The key to distinguishing them is to \ufb01nd subtle shape differences\nand detect the edges. It can be observed that our model performs well on the \u201cwindow\u201d and \u201cboard\u201d\ncategories. In order to further demonstrate the effectiveness of our model, some qualitative examples\nfrom S3DIS dataset are provided in Fig. 4 and Fig. 5, demonstrating that our model can yield more\naccurate segmentation results.\n\n4.3 ScanNet Semantic Segmentation\n\nFor the ScanNet dataset, the number of scenes trained and\ntested is 1201 and 312, same as [14, 10]. We only use its XYZ\ncoordinate information. The results are illustrated in Table 3.\nCompared with other competitive methods, our proposed model\nachieves better performance in terms of both the OA and mIoU\nmetrics.\n\n4.4 Ablation Study\n\nTable 3: The segmentation\nresults of ScanNet dataset in\nterms of both OA and mIoU.\n\nMethod\n\nOA\n\nmIoU\n\n3DCNN [2]\nPointNet [13]\nPointNet++ [14]\nRSNet [6]\nPointCNN [10]\n\nOurs\n\n73.0\n73.9\n84.5\n\n-\n\n85.1\n85.3\n\n-\n-\n\n-\n\n38.28\n39.35\n\n40.6\n\nTo validate the contribution of each module in our framework,\nwe conduct ablation studies to demonstrate their effectiveness.\nDetailed experimental results are provided in Table 4.\nContextual Representation Module. After remov-\ning the contextual representation module in the input\nlayer (denoted as w/o CR), we can see that the mIoU\nvalue dropped from 60.06 to 56.15, as shown in Ta-\nble 4. Based on the results of each category in Table 5,\nsome categories have signi\ufb01cant drops in IoU, such\nas \u201ccolumn\u201d, \u201csofa\u201d, and \u201cdoor\u201d. The contextual\nrepresentation can enhance the point feature of the\ncategories with complex local structures. We also\nreplace the gating operation in the contextual repre-\nsentation with a simple concatenation operation. Due to the inequality of the two kinds of information,\n\nTable 4: Ablation studies in terms of OA and\nmIoU.\n\nOurs(w/o CR)\nOurs(w/o GPM)\nOurs(w/o AM)\nOurs(CR with concatenation)\n\n56.15\n57.84\n58.67\n59.14\n60.06\n\nmean IoU\n\nOA\n\n87.91\n87.74\n87.90\n88.21\n88.43\n\nMethod\n\nOurs\n\n8\n\n\fTable 5: Ablation studies and analysis in terms of IoU for each category.\n\nMethod\n\nOurs(w/o CR)\nOurs(w/o AM)\nOurs(w/o GPM)\n\nOurs\n\nceiling\n92.62\n92.30\n92.17\n92.80\n\n\ufb02oor\n98.69\n97.91\n98.75\n98.48\n\nwall\n69.65\n70.98\n72.29\n72.65\n\nbeam column window door\n21.92\n0.00\n31.58\n0.00\n19.70\n0.00\n0.01\n28.79\n\n66.02\n65.43\n72.30\n68.12\n\n7.81\n21.40\n14.89\n32.42\n\ntable\n74.64\n75.16\n75.78\n74.91\n\nchair\n84.38\n83.26\n84.61\n85.12\n\nsofa\n29.94\n48.80\n36.48\n55.89\n\nbookcase\n\n62.53\n62.68\n62.73\n64.93\n\nboard\n66.52\n56.84\n68.01\n47.74\n\nclutter\n55.19\n56.45\n54.25\n58.22\n\nthe OA and mIoU decreases. Thus, the proposed gating operation is useful for fusing the information\nof the point itself and its neighborhood.\nGraph Pointnet Module. The segmentation performance of our model without GPM module\n(denoted as w/o GPM) also signi\ufb01cantly drops, which indicates that both the proposed GPM and CR\nare important for performance improvement. Speci\ufb01cally, without GPM, the mIoU of the categories,\nsuch as \u201ccolumn\u201d and \u201csofa\u201d drops signi\ufb01cantly.\nAttention Module. Removing the attention module (denoted as w/o AM) decreases both OA and\nmIoU. Moreover, the performances on categories with large \ufb02at area, such as \u201cceiling\u201d, \u201c\ufb02oor\u201d,\n\u201cwall\u201d, and \u201cwindow\u201d, signi\ufb01cantly drop. As aforementioned, the attention module aims to mine\nthe global relationship between points. Two points within the same category may with large spatial\ndistance. With the attention module, the features of these points are mutually aggregated.\nWe further incorporate the proposed CR, AM, and GPM together\nwith DGCNN [24] for point cloud semantic segmentation, with\nthe performances illustrated in Table 6. It can be observed that\nCR, AM, and GPM can help improving the performances, demon-\nstrating the effectiveness of each module.\nModel Complexity. Table 7 illustrates the model complexity\ncomparisons. The sample sizes for all the models are \ufb01xed as\n4096. It can be observed that the inference time of our model\n(28ms) is less than the other competitor models, except for Point-\nNet (5.3ms) and PointNet++ (24ms). And the model size seems to be identical with other models\nexcept PointCNN, which presents the largest model.\nRobustness under Noise. We further demonstrate the robustness\nof our proposed model with respect to PointNet++. As for scaling,\nwhen the scaling ratio are 50%, the OA of our proposed model\nand PointNet++ on segmentation task decreases by 3.0% and\n4.5%, respectively. As for rotation, when the rotation angle is \u03c0\n10,\nthe OA of our proposed model and PointNet++ on segmentation\ntask decreases by 1.7% and 1.0%, respectively. As such, our\nmodel is more robust to scaling while less robust to rotation.\n\nDGCNN\nDGCNN+CR\nDGCNN+GPM\nDGCNN+AM\nDGCNN+CR+GPM+AM\n\nTable 7: Model complexity\nSize (M)\nModel\n\nTable 6: Performances of DGCNN with\nour proposed module in terms of OA.\n\nPointNet\nDGCNN\nPointNet++\nRSNet\nPointCNN\nOurs\n\nOA\n\n84.31\n85.35\n84.90\n85.17\n86.07\n\nModel\n\nTime (ms)\n\n5.3\n42.0\n24.0\n60.4\n34.4\n28.0\n\n1.17\n0.99\n0.97\n6.92\n11.51\n1.04\n\n5 Conclusion\n\nIn this paper, we proposed one novel network for point cloud semantic segmentation. Different\nwith existing approaches, we enrich each point representation by incorporating its neighboring and\ncontextual points. Moreover, we proposed one novel graph pointnet module to exploit the point\ncloud local structure, and rely on the spatial-wise and channel-wise attention strategies to exploit the\npoint cloud global structure. Extensive experiments on two public point cloud semantic segmentation\ndatasets demonstrating the superiority of our proposed model.\n\nAcknowledgments\n\nThis work was supported in part by the National Natural Science Foundation of China (Grant\n61871270 and Grant 61672443), in part by the Natural Science Foundation of SZU (grant no.\n827000144) and in part by the National Engineering Laboratory for Big Data System Computing\nTechnology of China.\n\n9\n\n\fReferences\n[1] Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and\nSilvio Savarese. 3D semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 1534\u20131543, 2016.\n\n[2] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias\nNie\u00dfner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition, pages 5828\u20135839, 2017.\n\n[3] Francis Engelmann, Theodora Kontogianni, Alexander Hermans, and Bastian Leibe. Exploring\nspatial context for 3D semantic segmentation of point clouds. In Proceedings of the IEEE\nInternational Conference on Computer Vision, pages 716\u2013724, 2017.\n\n[4] Jun Fu, Jing Liu, Haijie Tian, Zhiwei Fang, and Hanqing Lu. Dual attention network for\nscene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 3146\u20133154, 2019.\n\n[5] Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3D semantic segmentation\nwith submanifold sparse convolutional networks. In Proceedings of the IEEE conference on\ncomputer vision and pattern recognition, pages 3577\u20133586, 2018.\n\n[6] Qiangui Huang, Weiyue Wang, and Ulrich Neumann. Recurrent slice networks for 3D segmen-\ntation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 2626\u20132635, 2018.\n\n[7] Roman Klokov and Victor Lempitsky. Escape from cells: Deep Kd-networks for the recognition\nof 3D point cloud models. In Proceedings of the IEEE International Conference on Computer\nVision, pages 863\u2013872, 2017.\n\n[8] Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with\nsuperpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 4558\u20134567, 2018.\n\n[9] Truc Le and Ye Duan. Pointgrid: A deep network for 3D shape understandings. In Proceedings\nof the IEEE conference on computer vision and pattern recognition, pages 9204\u20139214, 2018.\n\n[10] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. PointCNN:\nConvolution on X-transformed points. In Advances in Neural Information Processing Systems,\npages 820\u2013830, 2018.\n\n[11] Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong Pan. Relation-shape convolutional\nneural network for point cloud analysis. In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 8895\u20138904, 2019.\n\n[12] Guan Pang and Ulrich Neumann. 3D point cloud object detection with multi-view convolutional\nneural network. In 2016 23rd International Conference on Pattern Recognition (ICPR), pages\n585\u2013590. IEEE, 2016.\n\n[13] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep learning on\npoint sets for 3D classi\ufb01cation and segmentation. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 652\u2013660, 2017.\n\n[14] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet++: Deep hierarchical\nfeature learning on point sets in a metric space. In Advances in Neural Information Processing\nSystems, pages 5099\u20135108, 2017.\n\n[15] Xiaojuan Qi, Renjie Liao, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 3D graph neural networks\nfor RGBD semantic segmentation. In International Conference on Computer Vision (ICCV),\npages 5199\u20135208, 2017.\n\n[16] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. Octnet: Learning deep 3D representa-\ntions at high resolutions. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 3577\u20133586, 2017.\n\n10\n\n\f[17] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear\n\nembedding. SCIENCE, 290:2323\u20132326, 2000.\n\n[18] Baoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. DeepPano: Deep panoramic rep-\nresentation for 3-D shape recognition. IEEE Signal Processing Letters, 22(12):2339\u20132343,\n2015.\n\n[19] Lyne Tchapmi, Christopher Choy, Iro Armeni, JunYoung Gwak, and Silvio Savarese. SEGCloud:\nSemantic segmentation of 3D point clouds. In 2017 International Conference on 3D Vision\n(3DV), pages 537\u2013547. IEEE, 2017.\n\n[20] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua\nBengio. Graph attention networks. In International Conference on Learning Representations,\n2018.\n\n[21] Dominic Zeng Wang and Ingmar Posner. Voting for voting in online point cloud object detection.\n\nIn Proceedings of Robotics: Science and Systems, Rome, Italy, July 2015.\n\n[22] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-CNN: Octree-based\nconvolutional neural networks for 3D shape analysis. ACM Transactions on Graphics (TOG),\n36(4):72, 2017.\n\n[23] Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neumann. SGPN: Similarity group\nIn Proceedings of the IEEE\n\nproposal network for 3D point cloud instance segmentation.\nConference on Computer Vision and Pattern Recognition, pages 2569\u20132578, 2018.\n\n[24] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M.\nSolomon. Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics\n(TOG), 2019.\n\n[25] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and\nJianxiong Xiao. 3D shapenets: A deep representation for volumetric shapes. In Proceedings of\nthe IEEE conference on computer vision and pattern recognition, pages 1912\u20131920, 2015.\n\n[26] Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. Attentional shapecontextnet for point\ncloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 4606\u20134615, 2018.\n\n[27] Yin Zhou and Oncel Tuzel. VoxelNet: End-to-end learning for point cloud based 3D object\ndetection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 4490\u20134499, 2018.\n\n11\n\n\f", "award": [], "sourceid": 2574, "authors": [{"given_name": "Xu", "family_name": "Wang", "institution": "Shenzhen University"}, {"given_name": "Jingming", "family_name": "He", "institution": "Shenzhen University"}, {"given_name": "Lin", "family_name": "Ma", "institution": "Tencent AI Lab"}]}