{"title": "Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds", "book": "Advances in Neural Information Processing Systems", "page_first": 6740, "page_last": 6749, "abstract": "We propose a novel, conceptually simple and general framework for instance segmentation on 3D point clouds. Our method, called 3D-BoNet, follows the simple design philosophy of per-point multilayer perceptrons (MLPs). The framework directly regresses 3D bounding boxes for all instances in a point cloud, while simultaneously predicting a point-level mask for each instance. It consists of a backbone network followed by two parallel network branches for 1) bounding box regression and 2) point mask prediction. 3D-BoNet is single-stage, anchor-free and end-to-end trainable. Moreover, it is remarkably computationally efficient as, unlike existing approaches, it does not require any post-processing steps such as non-maximum suppression, feature sampling, clustering or voting. Extensive experiments show that our approach surpasses existing work on both ScanNet and S3DIS datasets while being approximately 10x more computationally efficient. Comprehensive ablation studies demonstrate the effectiveness of our design.", "full_text": "Learning Object Bounding Boxes for\n\n3D Instance Segmentation on Point Clouds\n\nBo Yang 1\n\nJianan Wang 2 Ronald Clark 3 Qingyong Hu 1\n\nSen Wang 4 Andrew Markham 1 Niki Trigoni 1\n\n1University of Oxford\n\n2DeepMind\n\nfirstname.lastname@cs.ox.ac.uk\n\n3Imperial College London\n\n4Heriot-Watt University\n\nAbstract\n\nWe propose a novel, conceptually simple and general framework for instance seg-\nmentation on 3D point clouds. Our method, called 3D-BoNet, follows the simple\ndesign philosophy of per-point multilayer perceptrons (MLPs). The framework\ndirectly regresses 3D bounding boxes for all instances in a point cloud, while\nsimultaneously predicting a point-level mask for each instance. It consists of a\nbackbone network followed by two parallel network branches for 1) bounding box\nregression and 2) point mask prediction. 3D-BoNet is single-stage, anchor-free\nand end-to-end trainable. Moreover, it is remarkably computationally ef\ufb01cient\nas, unlike existing approaches, it does not require any post-processing steps such\nas non-maximum suppression, feature sampling, clustering or voting. Extensive\nexperiments show that our approach surpasses existing work on both ScanNet and\nS3DIS datasets while being approximately 10\u00d7 more computationally ef\ufb01cient.\nComprehensive ablation studies demonstrate the effectiveness of our design.\n\n1\n\nIntroduction\n\nEnabling machines to understand 3D scenes is a fundamental necessity for autonomous driving,\naugmented reality and robotics. Core problems on 3D geometric data such as point clouds include\nsemantic segmentation, object detection and instance segmentation. Of these problems, instance\nsegmentation has only started to be tackled in the literature. The primary obstacle is that point clouds\nare inherently unordered, unstructured and non-uniform. Widely used convolutional neural networks\nrequire the 3D point clouds to be voxelized, incurring high computational and memory costs.\nThe \ufb01rst neural algorithm to directly tackle 3D instance segmentation is SGPN [51], which learns to\ngroup per-point features through a similarity matrix. Similarly, ASIS [52], JSIS3D [35], MASC [31],\n3D-BEVIS [8] and [29] apply the same per-point feature grouping pipeline to segment 3D instances.\nMo et al. formulate the instance segmentation as a per-point feature classi\ufb01cation problem in PartNet\n[33]. However, the learnt segments of these proposal-free methods do not have high objectness as they\ndo not explicitly detect the object boundaries. In addition, they inevitably require a post-processing\nstep such as mean-shift clustering [6] to obtain the \ufb01nal instance labels, which is computationally\nheavy. Another pipeline is the proposal-based 3D-SIS [15] and GSPN [59], which usually rely on\ntwo-stage training and the expensive non-maximum suppression to prune dense object proposals.\nIn this paper, we present an elegant, ef\ufb01cient and novel framework for 3D instance segmentation,\nwhere objects are loosely but uniquely detected through a single-forward stage using ef\ufb01cient MLPs,\nand then each instance is precisely segmented through a simple point-level binary classi\ufb01er. To this\nend, we introduce a new bounding box prediction module together with a series of carefully designed\nloss functions to directly learn object boundaries. Our framework is signi\ufb01cantly different from the\nexisting proposal-based and proposal-free approaches, since we are able to ef\ufb01ciently segment all\ninstances with high objectness, but without relying on expensive and dense object proposals. Our\ncode and data are available at https://github.com/Yang7879/3D-BoNet.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: The 3D-BoNet framework for instance segmentation on 3D point clouds.\n\nFigure 2: Rough instance boxes.\n\nAs shown in Figure 1, our framework, called 3D-BoNet, is a single-stage, anchor-free and end-to-end\ntrainable neural architecture. It \ufb01rst uses an existing backbone network to extract a local feature\nvector for each point and a global feature vector for the whole input point cloud. The backbone\nis followed by two branches: 1) instance-level bounding box prediction, and 2) point-level mask\nprediction for instance segmentation.\nThe bounding box prediction branch is the core of our framework.\nThis branch aims to predict a unique, unoriented and rectangular\nbounding box for each instance in a single forward stage, without\nrelying on prede\ufb01ned spatial anchors or a region proposal network\n[40]. As shown in Figure 2, we believe that roughly drawing a\n3D bounding box for an instance is relatively achievable, because\nthe input point clouds explicitly include 3D geometry information,\nwhile it is extremely bene\ufb01cial before tackling point-level instance\nsegmentation since reasonable bounding boxes can guarantee high\nobjectness for learnt segments. However, to learn instance boxes\ninvolves critical issues: 1) the number of total instances is variable,\ni.e., from 1 to many, 2) there is no \ufb01xed order for all instances. These issues pose great challenges\nfor correctly optimizing the network, because there is no information to directly link predicted boxes\nwith ground truth labels to supervise the network. However, we show how to elegantly solve these\nissues. This box prediction branch simply takes the global feature vector as input and directly outputs\na large and \ufb01xed number of bounding boxes together with con\ufb01dence scores. These scores are used\nto indicate whether the box contains a valid instance or not. To supervise the network, we design\na novel bounding box association layer followed by a multi-criteria loss function. Given a set of\nground-truth instances, we need to determine which of the predicted boxes best \ufb01t them. We formulate\nthis association process as an optimal assignment problem with an existing solver. After the boxes\nhave been optimally associated, our multi-criteria loss function not only minimizes the Euclidean\ndistance of paired boxes, but also maximizes the coverage of valid points inside of predicted boxes.\nThe predicted boxes together with point and global features are then fed into the subsequent point\nmask prediction branch, in order to predict a point-level binary mask for each instance. The purpose\nof this branch is to classify whether each point inside of a bounding box belongs to the valid instance\nor the background. Assuming the estimated instance box is reasonably good, it is very likely to obtain\nan accurate point mask, because this branch is simply to reject points that are not part of the detected\ninstance. A random guess may bring about 50% corrections.\nOverall, our framework distinguishes from all existing 3D instance segmentation approaches in\nthree folds. 1) Compared with the proposal-free pipeline, our method segments instance with high\nobjectness by explicitly learning 3D object boundaries. 2) Compared with the widely-used proposal-\nbased approaches, our framework does not require expensive and dense proposals. 3) Our framework\nis remarkably ef\ufb01cient, since the instance-level masks are learnt in a single-forward pass without\nrequiring any post-processing steps. Our key contributions are:\n\u2022 We propose a new framework for instance segmentation on 3D point clouds. The framework is\nsingle-stage, anchor-free and end-to-end trainable, without requiring any post-processing steps.\n\u2022 We design a novel bounding box association layer followed by a multi-criteria loss function to\n\nsupervise the box prediction branch.\n\n\u2022 We demonstrate signi\ufb01cant improvement over baselines and provide intuition behind our design\n\nchoices through extensive ablation studies.\n\n2\n\nGlobal FeaturesBounding Box PredictionPoint Mask PredictionInput 3D Point CloudPoint Features\fFigure 3: The general work\ufb02ow of 3D-BoNet framework.\n\n2 3D-BoNet\n\n2.1 Overview\n\nAs shown in Figure 3, our framework consists of two branches on top of the backbone network. Given\nan input point cloud P with N points in total, i.e., P \u2208 RN\u00d7k0, where k0 is the number of channels\nsuch as the location {x, y, z} and color {r, g, b} of each point, the backbone network extracts point\nlocal features, denoted as F l \u2208 RN\u00d7k, and aggregates a global point cloud feature vector, denoted as\nF g \u2208 R1\u00d7k, where k is the length of feature vectors.\nThe bounding box prediction branch simply takes the global feature vector F g as input, and\ndirectly regresses a prede\ufb01ned and \ufb01xed set of bounding boxes, denoted as B, and the corresponding\nbox scores, denoted as Bs. We use ground truth bounding box information to supervise this branch.\nDuring training, the predicted bounding boxes B and the ground truth boxes are fed into a box\nassociation layer. This layer aims to automatically associate a unique and most similar predicted\nbounding box to each ground truth box. The output of the association layer is a list of association\nindex A. The indices reorganize the predicted boxes, such that each ground truth box is paired\nwith a unique predicted box for subsequent loss calculation. The predicted bounding box scores\nare also reordered accordingly before calculating loss. The reordered predicted bounding boxes are\nthen fed into the multi-criteria loss function. Basically, this loss function aims to not only minimize\nthe Euclidean distance between each ground truth box and the associated predicted box, but also\nmaximize the coverage of valid points inside of each predicted box. Note that, both the bounding box\nassociation layer and multi-criteria loss function are only designed for network training. They are\ndiscarded during testing. Eventually, this branch is able to predict a correct bounding box together\nwith a box score for each instance directly.\nIn order to predict point-level binary mask for each instance, every predicted box together with\nprevious local and global features, i.e., F l and F g, are further fed into the point mask prediction\nbranch. This network branch is shared by all instances of different categories, and therefore extremely\nlight and compact. Such class-agnostic approach inherently allows general segmentation across\nunseen categories.\n\n2.2 Bounding Box Prediction\n\nBounding Box Encoding: In existing object detection networks, a bounding box is usually repre-\nsented by the center location and the length of three dimensions [3], or the corresponding residuals\n[61] together with orientations. Instead, we parameterize the rectangular bounding box by only two\nmin-max vertices for simplicity:\n\n{[xmin ymin zmin], [xmax ymax zmax]}\n\nNeural Layers: As shown in Figure 4, the global feature vector F g is fed through two fully connected\nlayers with Leaky ReLU as the non-linear activation function. Then it is followed by another two\nparallel fully connected layers. One layer outputs a 6H dimensional vector, which is then reshaped as\nan H \u00d7 2 \u00d7 3 tensor. H is a prede\ufb01ned and \ufb01xed number of bounding boxes that the whole network\nare expected to predict in maximum. The other layer outputs an H dimensional vector followed by\nsigmoid function to represent the bounding box scores. The higher the score, the more likely that\nthe predicted box contains an instance, thus the box being more valid.\nBounding Box Association Layer: Given the previously predicted H bounding boxes, i.e., B \u2208\nRH\u00d72\u00d73, it is not straightforward to take use of the ground truth boxes, denoted as \u00afB \u2208 RT\u00d72\u00d73, to\nsupervise the network, because there are no prede\ufb01ned anchors to trace each predicted box back to a\ncorresponding ground truth box in our framework. Besides, for each input point cloud P , the number\n\n3\n\nsBFFBBox Pred Branch ( MLPs)BackbonePoint FeaturesGlobal FeaturesInput Point CloudBounding BoxScoresBounding BoxesBounding Box Association LayerAssociation IndexMulti-criteria LossPoint Mask PredBranch (MLPs)Point MasksPglBMABGround Truth Boxes\fFigure 4: The architecture of bounding box regression branch. The predicted H boxes are optimally\nassociated with T ground truth boxes before calculating the multi-criteria loss.\n\nof ground truth boxes T varies and it is usually different from the prede\ufb01ned number H, although we\ncan safely assume the prede\ufb01ned number H \u2265 T for all input point clouds. In addition, there is no\nbox order for either predicted or ground truth boxes.\nOptimal Association Formulation: To associate a unique predicted bounding box from B for each\nground truth box of \u00afB, we formulate this association process as an optimal assignment problem.\nFormally, let A be a boolean association matrix where Ai,j =1 iff the ith predicted box is assigned\nto the jth ground truth box. A is also called association index in this paper. Let C be the association\ncost matrix where Ci,j represents the cost that the ith predicted box is assigned to the jth ground\ntruth box. Basically, the cost Ci,j represents the similarity between two boxes; the less the cost, the\nmore similar the two boxes. Therefore, the bounding box association problem is to \ufb01nd the optimal\nassignment matrix A with the minimal cost overall:\n\nH(cid:88)\n\nT(cid:88)\n\nH(cid:88)\n\nT(cid:88)\n\nA = arg min\n\nC i,jAi,j\n\nsubject to\n\nAi,j = 1,\n\nAi,j \u2264 1, j \u2208 {1..T}, i \u2208 {1..H}\n\n(1)\n\nA\n\ni=1\n\nj=1\n\ni=1\n\nj=1\n\nTo solve the above optimal association problem, the existing Hungarian algorithm [21; 22] is applied.\nAssociation Matrix Calculation: To evaluate the similarity between the ith\npredicted box and the jth ground truth box, a simple and intuitive criterion\nis the Euclidean distance between two pairs of min-max vertices. However,\nit is not optimal. Basically, we want the predicted box to include as many\nvalid points as possible. As illustrated in Figure 5, the input point cloud is\nusually sparse and distributed non-uniformly in 3D space. Regarding the same\nground truth box #0 (blue), the candidate box #2 (red) is believed to be much\nbetter than the candidate #1 (black), because the box #2 has more valid points\noverlapped with #0. Therefore, the coverage of valid points should be included\nto calculate the cost matrix C. In this paper, we consider the following three\ncriteria:\n(1) Euclidean Distance between Vertices.\nFormally, the cost between the ith pre-\ndicted box Bi and the jth ground truth\nbox \u00afBj is calculated as follows:\n(Bi \u2212 \u00afBj)2\n\nAlgorithm 1 An algorithm to calculate point-in-pred-box-\nprobability. H is the number of predicted bounding boxes B,\nN is the number of points in point cloud P , \u03b81 and \u03b82 are\nhyperparameters for numerical stability. We use \u03b81 = 100,\n\u03b82 = 20 in all our implementation.\n\u2022 the ith box min-vertex Bi\n\u2022 the ith box max-vertex Bi\nfor n \u2190 1 to N do\n\nFigure 5: A sparse in-\nput point cloud.\n\nfor i \u2190 1 to H do\n\nmin zi\nmax zi\n\n(cid:88)\n\nCed\n\n1\n6\n\ni,j =\n\nmin = [xi\nmax = [xi\n\n(2)\n(2) Soft\nIntersection-over-Union on\nPoints. Given the input point cloud P and\nthe jth ground truth instance box \u00afBj, it is\nable to directly obtain a hard-binary vec-\ntor \u00afqj \u2208 RN to represent whether each\npoint is inside of the box or not, where\n\u20181\u2019 indicates the point being inside and\n\u20180\u2019 outside. However, for a speci\ufb01c ith\npredicted box of the same input point\ncloud P , to directly obtain a similar hard-\nbinary vector would result in the frame-\nwork being non-differentiable, due to the\ndiscretization operation. Therefore, we\nintroduce a differentiable yet simple algorithm 1 to obtain a similar but soft-binary vector qi, called\npoint-in-pred-box-probability, where all values are in the range (0, 1). The deeper the correspond-\ning point is inside of the box, the higher the value. The farther away the point is outside, the smaller\n\nmin yi\nmax yi\n\u2022 the nth point location P n = [xn yn zn].\n\u2022 step 1: \u2206xyz \u2190 (Bi\nmin \u2212 P n)(P n \u2212 Bi\nmax).\n\u2022 step 2: \u2206xyz \u2190 max [min(\u03b81\u2206xyz, \u03b82),\u2212\u03b82].\n\u2022 step 3: probability pxyz =\n1+exp(\u2212\u2206xyz ) .\n\u2022 step 4: point probability qn\ni = min(pxyz).\ni \u00b7\u00b7\u00b7 qN\ni ].\n\nThe above two loops are only for illustration. They are\neasily replaced by standard and ef\ufb01cient matrix operations.\n\n\u2022 obtain the soft-binary vector qi = [q1\n\nmin].\nmax].\n\n1\n\n4\n\n cost calculationymaxzmaxxmaxyminsigmoidHungarianalgorithmfc+LReLUixminiizminiiifc+LReLUfcreshapeFgsB512256H6HH x 2 x 3input point cloudPBBglobal featuresbox scorespredicted boxesH x 2 x 3ground truth boxesT x 2 x 3H x Tcost matrixCassociation indexAThis module is only used for training. It is discarded in testing.loss256fc256truebox #0candidate box #1candidate box #2\fFigure 6: The architecture of point mask prediction branch. The point features are fused with each\nbounding box and score, after which a point-level binary mask is predicted for each instance.\n\nthe value. Formally, the Soft Intersection-over-Union (sIoU) cost between the ith predicted box and\nthe jth ground truth box is de\ufb01ned as follows:\n\n\u2212(cid:80)N\ni +(cid:80)N\n\n(cid:80)N\n\nn=1 qn\n\nj \u2212(cid:80)N\n\ni \u2217 \u00afqn\nj )\nn=1(qn\n\nn=1(qn\nn=1 \u00afqn\n\ni \u2217 \u00afqn\nj )\n\nCsIoU\n\ni,j =\n\ni and \u00afqn\n\nj are the nth values of qi and \u00afqj.\n\nwhere qn\n(3) Cross-Entropy Score. In addition, we also consider the cross-entropy score between qi and \u00afqj.\nBeing different from sIoU cost which prefers tighter boxes, this score represents how con\ufb01dent a\npredicted bounding box is able to include valid points as many as possible. It prefers larger and more\ninclusive boxes, and is formally de\ufb01ned as:\n\n(cid:2)\u00afqn\n\nN(cid:88)\n\nn=1\n\ni,j = \u2212 1\nCces\nN\n\nj log qn\n\ni + (1 \u2212 \u00afqn\n\nj ) log(1 \u2212 qn\n\ni )(cid:3)\n\n(3)\n\n(4)\n\n(6)\n\nCi,j = Ced\n\nOverall, the criterion (1) guarantees the geometric boundaries for learnt boxes and criteria (2)(3)\nmaximize the coverage of valid points and overcome the non-uniformity as illustrated in Figure 5.\nThe \ufb01nal association cost between the ith predicted box and the jth ground truth box is de\ufb01ned as:\n(5)\nLoss Functions After the bounding box association layer, both the predicted boxes B and scores Bs\nare reordered using the association index A, such that the \ufb01rst predicted T boxes and scores are well\npaired with the T ground truth boxes.\nMulti-criteria Loss for Box Prediction: The previous association layer \ufb01nds the most similar predicted\nbox for each ground truth box according to the minimal cost including: 1) vertex Euclidean distance,\n2) sIoU cost on points, and 3) cross-entropy score. Therefore, the loss function for bounding box\nprediction is naturally designed to consistently minimize those cost. It is formally de\ufb01ned as follows:\n\ni,j + Cces\ni,j\n\ni,j + CsIoU\n\nT(cid:88)\n\nt=1\n\n(cid:96)bbox =\n\n1\nT\n\n(Ced\n\nt,t + CsIoU\n\nt,t + Cces\nt,t )\n\nt,t\n\nand Cces\n\nt,t, CsIoU\n\nwhere Ced\nt,t are the cost of tth paired boxes. Note that, we only minimize the cost of\nT paired boxes; the remaining H \u2212 T predicted boxes are ignored because there is no corresponding\nground truth for them. Therefore, this box prediction sub-branch is agnostic to the prede\ufb01ned value of\nH. Here raises an issue. Since the H \u2212 T negative predictions are not penalized, it might be possible\nthat the network predicts multiple similar boxes for a single instance. Fortunately, the loss function\nfor the parallel box score prediction is able to alleviate this problem.\nLoss for Box Score Prediction: The predicted box scores aim to indicate the validity of the corre-\nsponding predicted boxes. After being reordered by the association index A, the ground truth scores\nfor the \ufb01rst T scores are all \u20181\u2019, and \u20180\u2019 for the remaining invalid H \u2212 T scores. We use cross-entropy\nloss for this binary classi\ufb01cation task:\n(cid:96)bbs = \u2212 1\nH\n\nlog(1 \u2212 Bt\ns)\n\n(cid:34) T(cid:88)\n\nH(cid:88)\n\nlog Bt\n\n(cid:35)\n\ns +\n\n(7)\n\nt=1\n\nt=T +1\n\nwhere Bt\ns is the tth predicted score after being associated. Basically, this loss function rewards the\ncorrectly predicted bounding boxes, while implicitly penalizing the cases where multiple similar\nboxes are regressed for a single instance.\n\n2.3 Point Mask Prediction\n\nGiven the predicted bounding boxes B, the learnt point features F l and global features F g, the point\nmask prediction branch processes each bounding box individually with shared neural layers.\n\n5\n\nMshared fc+LReLUpoint featuresglobal featuresgFFl1 x kN x kfc+LReLUconcatN x 2561 x 256N x 512N x (128+7)Fl~Fl^shared fc + LReLU layers(N x 64)(N x 32)N x 1iymaxzmaxxmaxyminixminiizminiiiisthe predicted i bounding box and scoreBthconcatshared fc + LReLU layers(N x 128)(N x 128)N x 128\fTable 1:\nIoU threshold 0.5. Accessed on 2 June 2019.\n\nInstance segmentation results on ScanNet(v2) benchmark (hidden test set). The metric is AP(%) with\n\nmean bathtub bed bookshelf cabinet chair counter curtain desk door other picture refrig showerCur sink sofa table toilet window\n5.8\nMaskRCNN [13]\n14.3\nSGPN [51]\n24.8\n3D-BEVIS [8]\n30.6\nR-PointNet [59]\n31.9\nUNet-Backbone [29]\n38.2\n3D-SIS (5 views) [15]\nMASC [31]\n44.7\nResNet-Backbone [29] 45.9\n47.8\nPanopticFusion [34]\n48.1\nMTML\n3D-BoNet(Ours)\n48.8\n\n1.4 10.7 2.0 11.0\n11.2 35.1 16.8 43.8\n12.6 60.4 18.1 85.4\n33.1 39.6 27.5 82.1\n15.0 61.5 35.5 91.6\n11.7 69.9 27.1 88.3\n36.7 63.9 38.6 98.0\n41.1 53.6 59.0 87.3\n48.5 59.1 26.7 94.4\n18.4 60.1 48.7 93.8\n40.2 49.9 51.3 90.9\n\n2.4\n0.0\n4.3\n0.0\n9.8\n3.0\n12.6 28.3 29.0\n6.7 20.1 17.3\n3.3 32.0 24.0\n26.0 36.1 43.2\n21.7 41.6 40.8\n17.5 25.0 43.4\n25.4 36.1 31.8\n30.6 34.1 25.9\n\n33.3\n0.2\n20.8\n39.0\n66.7\n56.6\n50.0\n40.5\n66.7\n71.5\n100.0 43.2\n52.8\n55.5\n100.0 73.7\n66.7\n71.2\n100.0 66.6\n100.0 67.2\n\n0.0\n0.0\n37.5\n21.4\n43.8\n85.7\n57.1\n71.4\n85.7\n100.0\n79.6\n\n5.3\n6.5\n3.5\n34.8\n18.9\n19.0\n38.2\n25.9\n25.9\n27.2\n30.1\n\n0.2\n27.5\n39.4\n58.9\n47.9\n57.7\n63.3\n58.7\n55.0\n70.9\n48.4\n\n0.2\n2.9\n2.7\n5.4\n0.8\n1.3\n0.2\n13.8\n0.0\n0.1\n9.8\n\n23.8\n1.4\n2.5\n2.8\n10.7\n7.5\n32.7\n12.8\n43.7\n9.5\n12.5\n\n6.5\n2.7\n9.8\n21.9\n12.3\n42.2\n45.1\n31.5\n41.1\n43.2\n43.4\n\n0.6\n13.8\n17.1\n24.5\n9.3\n23.5\n27.6\n30.4\n35.9\n38.4\n43.9\n\n0.0\n16.9\n7.6\n31.1\n23.3\n24.5\n38.1\n15.9\n59.5\n37.7\n59.0\n\n4.5\n8.7\n9.9\n\n2.1\n6.9\n3.5\n6.8\n21.8\n26.3\n50.9\n47.5\n61.3\n57.9\n62.0\n\nNeural Layers: As shown in Figure 6, both the point and global features are compressed to be\n256 dimensional vectors through fully connected layers, before being concatenated and further\n\ncompressed to be 128 dimensional mixed point features (cid:101)F l. For the ith predicted bounding box\nBi, the estimated vertices and score are fused with features (cid:101)F l through concatenation, producing\nbox-aware features (cid:98)F l. These features are then fed through shared layers, predicting a point-level\n\nbinary mask, denoted as M i. We use sigmoid as the last activation function. This simple box fusing\napproach is extremely computationally ef\ufb01cient, compared with the commonly used RoIAlign in\nprior art [59; 15; 13] which involves the expensive point feature sampling and alignment.\nLoss Function: The predicted instance masks M are similarly associated with the ground truth masks\naccording to the previous association index A. Due to the imbalance of instance and background point\nnumbers, we use focal loss [30] with default hyper-parameters instead of the standard cross-entropy\nloss to optimize this branch. Only the valid T paired masks are used for the loss (cid:96)pmask.\n\n2.4 End-to-End Implementation\n\nWhile our framework is not restricted to any point cloud network, we adopt PointNet++ [39] as the\nbackbone to learn the local and global features. Parallelly, another separate branch is implemented\nto learn per-point semantics with the standard sof tmax cross-entropy loss function (cid:96)sem. The\narchitecture of the backbone and semantic branch is the same as used in [51]. Given an input\npoint cloud P , the above three branches are linked and end-to-end trained using a single combined\nmulti-task loss:\n\n(cid:96)all = (cid:96)sem + (cid:96)bbox + (cid:96)bbs + (cid:96)pmask\n\n(8)\nWe use Adam solver [19] with its default hyper-parameters for optimization. Initial learning rate is\nset to 5e\u22124 and then divided by 2 every 20 epochs. The whole network is trained on a Titan X GPU\nfrom scratch. We use the same settings for all experiments, which guarantees the reproducibility of\nour framework.\n\n3 Experiments\n3.1 Evaluation on ScanNet Benchmark\n\nWe \ufb01rst evaluate our approach on ScanNet(v2) 3D semantic instance segmentation benchmark [7].\nSimilar to SGPN [51], we divide the raw input point clouds into 1m \u00d7 1m blocks for training, while\nusing all points for testing followed by the BlockMerging algorithm [51] to assemble blocks into\ncomplete 3D scenes. In our experiment, we observe that the performance of the vanilla PointNet++\nbased semantic prediction sub-branch is limited and unable to provide satisfactory semantics. Thanks\nto the \ufb02exibility of our framework, we therefore easily train a parallel SCN network [11] to estimate\nmore accurate per-point semantic labels for the predicted instances of our 3D-BoNet. The average\nprecision (AP) with an IoU threshold 0.5 is used as the evaluation metric.\nWe compare with the leading approaches on 18 object categories in Table 1. Particularly, the SGPN\n[51], 3D-BEVIS [8], MASC [31] and [29] are point feature clustering based approaches; the R-\nPointNet [59] learns to generate dense object proposals followed by point-level segmentation; 3D-SIS\n[15] is a proposal-based approach using both point clouds and color images as input. PanopticFusion\n[34] learns to segment instances on multiple 2D images by Mask-RCNN [13] and then uses the\nSLAM system to reproject back to 3D space. Our approach surpasses them all using point clouds\nonly. Remarkably, our framework performs relatively satisfactory on all categories without preferring\nspeci\ufb01c classes, demonstrating the superiority of our framework.\n\n6\n\n\fFigure 7: This shows a lecture room with hundreds of objects (e.g., chairs, tables), highlighting the\nchallenge of instance segmentation. Different color indicates different instance. The same instance\nmay not have the same color. Our framework predicts more precise instance labels than others.\n\n3.2 Evaluation on S3DIS Dataset\n\nTable 2: Instance segmentation re-\nsults on S3DIS dataset.\n\nmPrec mRec\n43.4\nPartNet [33]\n56.4\nASIS [52]\n63.6\n47.5\n47.6\n3D-BoNet (Ours) 65.6\n\nWe further evaluate the semantic instance segmentation of our\nframework on S3DIS [1], which consists of 3D complete scans\nfrom 271 rooms belonging to 6 large areas. Our data prepro-\ncessing and experimental settings strictly follow PointNet [38],\nSGPN [51], ASIS [52], and JSIS3D [35]. In our experiments,\nH is set as 24 and we follow the 6-fold evaluation [1; 52].\nWe compare with ASIS [52], the state of art on S3DIS, and the PartNet baseline [33]. For fair\ncomparison, we carefully train the PartNet baseline with the same PointNet++ backbone and other\nsettings as used in our framework. For evaluation, the classical metrics mean precision (mPrec) and\nmean recall (mRec) with IoU threshold 0.5 are reported. Note that, we use the same BlockMerging\nalgorithm [51] to merge the instances from different blocks for both our approach and the PartNet\nbaseline. The \ufb01nal scores are averaged across the total 13 categories. Table 2 presents the mPrec/mRec\nscores and Figure 7 shows qualitative results. Our method surpasses PartNet baseline [33] by\nlarge margins, and also outperforms ASIS [52], but not signi\ufb01cantly, mainly because our semantic\nprediction branch (vanilla PointNet++ based) is inferior to ASIS which tightly fuses semantic and\ninstance features for mutual optimization. We leave the feature fusion as our future exploration.\n\n3.3 Ablation Study\n\nTo evaluate the effectiveness of each component of our\nframework, we conduct 6 groups of ablation experi-\nments on the largest Area 5 of S3DIS dataset.\n(1) Remove Box Score Prediction Sub-branch. Basi-\ncally, the box score serves as an indicator and regular-\nizer for valid bounding box prediction. After removing\nit, we train the network with:\n\n(cid:96)ab1 = (cid:96)sem + (cid:96)bbox + (cid:96)pmask\n\nTable 3: Instance segmentation results of all\nablation experiments on Area 5 of S3DIS.\nmPrec mRec\n40.9\n50.9\n41.1\n53.8\n40.6\n55.2\n37.8\n51.8\n37.3\n28.5\n39.2\n50.8\n57.5\n40.2\n\n(1) Remove Box Score Sub-branch\n(2) Euclidean Distance Only\n(3) Soft IoU Cost Only\n(4) Cross-Entropy Score Only\n(5) Do Not Supervise Box Prediction\n(6) Remove Focal Loss\n(7) The Full Framework\n\nInitially, the multi-criteria loss function is a simple\nunweighted combination of the Euclidean distance, the\nsoft IoU cost, and the cross-entropy score. However, this may not be optimal, because the density\nof input point clouds is usually inconsistent and tends to prefer different criterion. We conduct the\nbelow 3 groups of experiments on ablated bounding box loss function.\n(2)-(4) Use Single Criterion. Only one criterion is used for the box association and loss (cid:96)bbox.\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\n(cid:96)ab2 = (cid:96)sem +\n\n1\nT\n\nCed\n\nt,t + (cid:96)bbs + (cid:96)pmask\n\n\u00b7\u00b7\u00b7\n\n(cid:96)ab4 = (cid:96)sem +\n\n1\nT\n\nCces\n\nt,t + (cid:96)bbs + (cid:96)pmask\n\n(5) Do Not Supervise Box Prediction. The predicted boxes are still associated according to the\nthree criteria, but we remove the box supervision signal. The framework is trained with:\n\n(cid:96)ab5 = (cid:96)sem + (cid:96)bbs + (cid:96)pmask\n\n(6) Remove Focal Loss for Point Mask Prediction. In the point mask prediction branch, the focal\nloss is replaced by the standard cross-entropy loss for comparison.\nAnalysis. Table 3 shows the scores for ablation experiments. (1) The box score sub-branch indeed\nbene\ufb01ts the overall instance segmentation performance, as it tends to penalize duplicated box\npredictions. (2) Compared with Euclidean distance and cross-entropy score, the sIoU cost tends to\nbe better for box association and supervision, thanks to our differentiable Algorithm 1. As the three\nindividual criteria prefer different types of point structures, a simple combination of three criteria\n\n7\n\nInput Point CloudPartNetGround TruthASIS3D-BoNet (Ours)\fmay not always be optimal on a speci\ufb01c dataset. (3) Without the supervision for box prediction, the\nperformance drops signi\ufb01cantly, primarily because the network is unable to infer satisfactory instance\n3D boundaries and the quality of predicted point masks deteriorates accordingly. (4) Compared\nwith focal loss, the standard cross entropy loss is less effective for point mask prediction due to the\nimbalance of instance and background point numbers.\n\n3.4 Computation Analysis\n\n(1) For point feature clustering based approaches including SGPN [51], ASIS [52], JSIS3D [35],\n3D-BEVIS [8], MASC [31], and [29], the computation complexity of the post clustering algorithm\nsuch as Mean Shift [6] tends towards O(T N 2), where T is the number of instances and N is the\nnumber of input points. (2) For dense proposal-based methods including GSPN [59], 3D-SIS [15] and\nPanopticFusion [34], region proposal network and non-maximum suppression are usually required to\ngenerate and prune the dense proposals, which is computationally expensive [34]. (3) Both PartNet\nbaseline [33] and our 3D-BoNet have similar ef\ufb01cient computation complexity O(N ). Empirically,\nour 3D-BoNet takes around 20 ms GPU time to process 4k points, while most approaches in (1)(2)\nneed more than 200ms GPU/CPU time to process the same number of points.\n\n4 Related Work\nTo extract features from 3D point clouds, traditional approaches usually craft features manually\n[5; 43]. Recent learning based approaches mainly include voxel-based [43; 47; 42; 24; 41; 11; 4] and\npoint-based schemes [38; 20; 14; 17; 46].\nSemantic Segmentation PointNet [38] shows leading results on classi\ufb01cation and semantic seg-\nmentation, but it does not capture context features. To address it, a number of approaches\n[39; 58; 44; 32; 56; 50; 27; 18] have been proposed recently. Another pipeline is convolutional\nkernel based approaches [56; 28; 48]. Basically, most of these approaches can be used as our\nbackbone network, and parallelly trained with our 3D-BoNet to learn per-point semantics.\nObject Detection The common way to detect objects in 3D point clouds is to project points onto\n2D images to regress bounding boxes [26; 49; 3; 57; 60; 54]. Detection performance is further\nimproved by fusing RGB images in [3; 55; 37; 53]. Point clouds can be also divided into voxels for\nobject detection [9; 25; 61]. However, most of these approaches rely on prede\ufb01ned anchors and the\ntwo-stage region proposal network [40]. It is inef\ufb01cient to extend them on 3D point clouds. Without\nrelying on anchors, the recent PointRCNN [45] learns to detect via foreground point segmentation,\nand the VoteNet [36] detects objects via point feature grouping, sampling and voting. By contrast,\nour box prediction branch is completely different from them all. Our framework directly regresses\n3D object bounding boxes from the compact global features through a single forward pass.\nInstance Segmentation SGPN [51] is the \ufb01rst neural algorithm to segment instances on 3D point\nclouds by grouping the point-level embeddings. ASIS [52], JSIS3D [35], MASC [31], 3D-BEVIS\n[8] and [29] use the same strategy to group point-level features for instance segmentation. Mo et\nal. introduce a segmentation algorithm in PartNet [33] by classifying point features. However, the\nlearnt segments of these proposal-free methods do not have high objectness as it does not explicitly\ndetect object boundaries. By drawing on the successful 2D RPN [40] and RoI [13], GSPN [59]\nand 3D-SIS [15] are proposal-based methods for 3D instance segmentation. However, they usually\nrely on two-stage training and a post-processing step for dense proposal pruning. By contrast, our\nframework directly predicts a point-level mask for each instance within an explicitly detected object\nboundary, without requiring any post-processing steps.\n\n5 Conclusion\nOur framework is simple, effective and ef\ufb01cient for instance segmentation on 3D point clouds.\nHowever, it also has some limitations which lead to the future work. (1) Instead of using unweighted\ncombination of three criteria, it is better to design a module to automatically learn the weights,\nso to adapt to different types of input point clouds. (2) Instead of training a separate branch for\nsemantic prediction, more advanced feature fusion modules can be introduced to mutually improve\nboth semantic and instance segmentation. (3) Our framework follows the MLP design and is therefore\nagnostic to the number and order of input points. It is desirable to directly train and test on large-scale\ninput point clouds instead of the divided small blocks, by drawing on the recent work [10][23][16].\n\n8\n\n\fReferences\n[1] I. Armeni, O. Sener, A. Zamir, and H. Jiang. 3D Semantic Parsing of Large-Scale Indoor Spaces. CVPR,\n\n2016.\n\n[2] Y. Bengio, N. L\u00e9onard, and A. Courville. Estimating or Propagating Gradients Through Stochastic Neurons\n\nfor Conditional Computation. arXiv, 2013.\n\n[3] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-View 3D Object Detection Network for Autonomous\n\nDriving. CVPR, 2017.\n\n[4] C. Choy, J. Gwak, and S. Savarese. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural\n\nNetworks. CVPR, 2019.\n\n[5] C. S. Chua and R. Jarvis. Point signatures: A new representation for 3d object recognition.\n\n25(1):63\u201385, 1997.\n\nIJCV,\n\n[6] D. Comaniciu and P. Meer. Mean Shift: A Robust Approach toward Feature Space Analysis. TPAMI,\n\n24(5):603\u2013619, 2002.\n\n[7] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nie\u00dfner. ScanNet: Richly-annotated\n\n3D Reconstructions of Indoor Scenes. CVPR, 2017.\n\n[8] C. Elich, F. Engelmann, J. Schult, T. Kontogianni, and B. Leibe. 3D-BEVIS: Birds-Eye-View Instance\n\nSegmentation. GCPR, 2019.\n\n[9] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner. Vote3Deep: Fast Object Detection in 3D\n\nPoint Clouds Using Ef\ufb01cient Convolutional Neural Networks. ICRA, 2017.\n\n[10] F. Engelmann, T. Kontogianni, A. Hermans, and B. Leibe. Exploring Spatial Context for 3D Semantic\n\nSegmentation of Point Clouds. ICCV Workshops, 2017.\n\n[11] B. Graham, M. Engelcke, and L. v. d. Maaten. 3D Semantic Segmentation with Submanifold Sparse\n\nConvolutional Networks. CVPR, 2018.\n\n[12] A. Grover, E. Wang, A. Zweig, and S. Ermon. Stochastic Optimization of Sorting Networks via Continuous\n\nRelaxations. ICLR, 2019.\n\n[13] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-CNN. ICCV, 2017.\n[14] P. Hermosilla, T. Ritschel, P.-P. Vazquez, A. Vinacua, and T. Ropinski. Monte Carlo Convolution for\n\nLearning on Non-Uniformly Sampled Point Clouds. ACM Transactions on Graphics, 2018.\n\n[15] J. Hou, A. Dai, and M. Nie\u00dfner. 3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans. CVPR,\n\n2019.\n\n[16] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and A. Markham. RandLA-Net: Ef\ufb01cient\n\nSemantic Segmentation of Large-Scale Point Clouds. arXiv preprint arXiv:1911.11236, 2019.\n\n[17] B.-S. Hua, M.-K. Tran, and S.-K. Yeung. Pointwise Convolutional Neural Networks. CVPR, 2018.\n[18] Q. Huang, W. Wang, and U. Neumann. Recurrent Slice Networks for 3D Segmentation of Point Clouds.\n\nCVPR, 2018.\n\n[19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.\n[20] R. Klokov and V. Lempitsky. Escape from Cells: Deep Kd-Networks for The Recognition of 3D Point\n\nCloud Models. ICCV, 2017.\n\n[21] H. W. Kuhn. The Hungarian Method for the assignment problem. Naval Research Logistics Quarterly,\n\n2(1-2):83\u201397, 1955.\n\n[22] H. W. Kuhn. Variants of the hungarian method for assignment problems. Naval Research Logistics\n\nQuarterly, 3(4):253\u2013258, 1956.\n\n[23] L. Landrieu and M. Simonovsky. Large-scale Point Cloud Semantic Segmentation with Superpoint Graphs.\n\nCVPR, 2018.\n\n[24] T. Le and Y. Duan. PointGrid: A Deep Network for 3D Shape Understanding. CVPR, 2018.\n[25] B. Li. 3D Fully Convolutional Network for Vehicle Detection in Point Cloud. IROS, 2017.\n[26] B. Li, T. Zhang, and T. Xia. Vehicle Detection from 3D Lidar Using Fully Convolutional Network. RSS,\n\n2016.\n\n[27] J. Li, B. M. Chen, and G. H. Lee. SO-Net: Self-Organizing Network for Point Cloud Analysis. CVPR,\n\n2018.\n\n[28] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. PointCNN : Convolution On X -Transformed Points.\n\nNeurlPS, 2018.\n\n[29] Z. Liang, M. Yang, and C. Wang. 3D Graph Embedding Learning with a Structure-aware Loss Function\n\nfor Point Cloud Semantic Instance Segmentation. arXiv, 2019.\n\n[30] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal Loss for Dense Object Detection. ICCV, 2017.\n[31] C. Liu and Y. Furukawa. MASC: Multi-scale Af\ufb01nity with Sparse Convolution for 3D Instance Segmenta-\n\ntion. arXiv, 2019.\n\n[32] S. Liu, S. Xie, Z. Chen, and Z. Tu. Attentional ShapeContextNet for Point Cloud Recognition. CVPR,\n\n2018.\n\n9\n\n\f[33] K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su. PartNet: A Large-scale Benchmark\n\nfor Fine-grained and Hierarchical Part-level 3D Object Understanding. CVPR, 2019.\n\n[34] G. Narita, T. Seno, T. Ishikawa, and Y. Kaji. PanopticFusion: Online Volumetric Semantic Mapping at the\n\nLevel of Stuff and Things. IROS, 2019.\n\n[35] Q.-H. Pham, D. T. Nguyen, B.-S. Hua, G. Roig, and S.-K. Yeung. JSIS3D: Joint Semantic-Instance\nSegmentation of 3D Point Clouds with Multi-Task Pointwise Networks and Multi-Value Conditional\nRandom Fields. CVPR, 2019.\n\n[36] C. R. Qi, O. Litany, K. He, and L. J. Guibas. Deep Hough Voting for 3D Object Detection in Point Clouds.\n\nICCV, 2019.\n\n[37] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum PointNets for 3D Object Detection from RGB-D\n\nData. CVPR, 2018.\n\n[38] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classi\ufb01cation and\n\nSegmentation. CVPR, 2017.\n\n[39] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in\n\na Metric Space. NIPS, 2017.\n\n[40] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-time Object Detection with Region\n\nProposal Networks. NIPS, 2015.\n\n[41] D. Rethage, J. Wald, J. Sturm, N. Navab, and F. Tombari. Fully-Convolutional Point Networks for\n\nLarge-Scale Point Clouds. ECCV, 2018.\n\n[42] G. Riegler, A. O. Ulusoy, and A. Geiger. OctNet: Learning Deep 3D Representations at High Resolutions.\n\nCVPR, 2017.\n\n[43] R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. ICRA, 2009.\n[44] Y. Shen, C. Feng, Y. Yang, and D. Tian. Mining Point Cloud Local Structures by Kernel Correlation and\n\nGraph Pooling. CVPR, 2018.\n\n[45] S. Shi, X. Wang, and H. Li. PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud.\n\nCVPR, 2019.\n\n[46] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.-H. Yang, and J. Kautz. SPLATNet: Sparse Lattice\n\nNetworks for Point Cloud Processing. CVPR, 2018.\n\n[47] L. P. Tchapmi, C. B. Choy, I. Armeni, J. Gwak, and S. Savarese. SEGCloud: Semantic Segmentation of\n\n3D Point Clouds. 3DV, 2017.\n\n[48] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas. KPConv: Flexible and\n\nDeformable Convolution for Point Clouds. ICCV, 2019.\n\n[49] V. Vaquero, I. Del Pino, F. Moreno-Noguer, J. So\u00ec, A. Sanfeliu, and J. Andrade-Cetto. Deconvolutional\n\nNetworks for Point-Cloud Vehicle Detection and Tracking in Driving Scenarios. ECMR, 2017.\n\n[50] C. Wang, B. Samari, and K. Siddiqi. Local Spectral Graph Convolution for Point Set Feature Learning.\n\nECCV, 2018.\n\n[51] W. Wang, R. Yu, Q. Huang, and U. Neumann. SGPN: Similarity Group Proposal Network for 3D Point\n\nCloud Instance Segmentation. CVPR, 2018.\n\n[52] X. Wang, S. Liu, X. Shen, C. Shen, and J. Jia. Associatively Segmenting Instances and Semantics in Point\n\nClouds. CVPR, 2019.\n\n[53] Z. Wang, W. Zhan, and M. Tomizuka. Fusing Bird View LIDAR Point Cloud and Front View Camera\n\nImage for Deep Object Detection. arXiv, 2018.\n\n[54] B. Wu, A. Wan, X. Yue, and K. Keutzer. SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for\n\nReal-Time Road-Object Segmentation from 3D LiDAR Point Cloud. arXiv, 2017.\n\n[55] D. Xu, D. Anguelov, and A. Jain. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation.\n\nCVPR, 2018.\n\n[56] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao. SpiderCNN: Deep Learning on Point Sets with Parameterized\n\nConvolutional Filters. ECCV, 2018.\n\n[57] G. Yang, Y. Cui, S. Belongie, and B. Hariharan. Learning Single-View 3D Reconstruction with Limited\n\nPose Supervision. ECCV, 2018.\n\n[58] X. Ye, J. Li, H. Huang, L. Du, and X. Zhang. 3D Recurrent Neural Networks with Context Fusion for\n\nPoint Cloud Semantic Segmentation. ECCV, 2018.\n\n[59] L. Yi, W. Zhao, H. Wang, M. Sung, and L. Guibas. GSPN: Generative Shape Proposal Network for 3D\n\nInstance Segmentation in Point Cloud. CVPR, 2019.\n\n[60] Y. Zeng, Y. Hu, S. Liu, J. Ye, Y. Han, X. Li, and N. Sun. RT3D: Real-Time 3D Vehicle Detection in LiDAR\n\nPoint Cloud for Autonomous Driving. IEEE Robotics and Automation Letters, 3(4):3434\u20133440, 2018.\n\n[61] Y. Zhou and O. Tuzel. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. CVPR,\n\n2018.\n\n10\n\n\f", "award": [], "sourceid": 3645, "authors": [{"given_name": "Bo", "family_name": "Yang", "institution": "University of Oxford"}, {"given_name": "Jianan", "family_name": "Wang", "institution": "DeepMind"}, {"given_name": "Ronald", "family_name": "Clark", "institution": "Imperial College London"}, {"given_name": "Qingyong", "family_name": "Hu", "institution": "University of Oxford"}, {"given_name": "Sen", "family_name": "Wang", "institution": "Heriot-Watt University"}, {"given_name": "Andrew", "family_name": "Markham", "institution": "University of Oxford"}, {"given_name": "Niki", "family_name": "Trigoni", "institution": "University of Oxford"}]}