{"title": "Semantic Labeling of 3D Point Clouds for Indoor Scenes", "book": "Advances in Neural Information Processing Systems", "page_first": 244, "page_last": 252, "abstract": "Inexpensive RGB-D cameras that give an RGB image together with depth data have become widely available. In this paper, we use this data to build 3D point clouds of full indoor scenes such as an office and address the task of semantic labeling of these 3D point clouds. We propose a graphical model that captures various features and contextual relations, including the local visual appearance and shape cues, object co-occurence relationships and geometric relationships. With a large number of object classes and relations, the model\u2019s parsimony becomes important and we address that by using multiple types of edge potentials. The model admits efficient approximate inference, and we train it using a maximum-margin learning approach. In our experiments over a total of 52 3D scenes of homes and offices (composed from about 550 views, having 2495 segments labeled with 27 object classes), we get a performance of 84.06% in labeling 17 object classes for offices, and 73.38% in labeling 17 object classes for home scenes. Finally, we applied these algorithms successfully on a mobile robot for the task of finding objects in large cluttered rooms.", "full_text": "Semantic Labeling of 3D Point Clouds for\n\nIndoor Scenes\n\nHema Swetha Koppula\u2217, Abhishek Anand\u2217, Thorsten Joachims, and Ashutosh Saxena\n\nDepartment of Computer Science, Cornell University.\n{hema,aa755,tj,asaxena}@cs.cornell.edu\n\nAbstract\n\nInexpensive RGB-D cameras that give an RGB image together with depth data\nhave become widely available. In this paper, we use this data to build 3D point\nclouds of full indoor scenes such as an of\ufb01ce and address the task of semantic la-\nbeling of these 3D point clouds. We propose a graphical model that captures var-\nious features and contextual relations, including the local visual appearance and\nshape cues, object co-occurence relationships and geometric relationships. With a\nlarge number of object classes and relations, the model\u2019s parsimony becomes im-\nportant and we address that by using multiple types of edge potentials. The model\nadmits ef\ufb01cient approximate inference, and we train it using a maximum-margin\nlearning approach. In our experiments over a total of 52 3D scenes of homes and\nof\ufb01ces (composed from about 550 views, having 2495 segments labeled with 27\nobject classes), we get a performance of 84.06% in labeling 17 object classes for\nof\ufb01ces, and 73.38% in labeling 17 object classes for home scenes. Finally, we\napplied these algorithms successfully on a mobile robot for the task of \ufb01nding\nobjects in large cluttered rooms.1\n\nIntroduction\n\n1\nInexpensive RGB-D sensors that augment an RGB image with depth data have recently become\nwidely available. At the same time, years of research on SLAM (Simultaneous Localization and\nMapping) now make it possible to reliably merge multiple RGB-D images into a single point cloud,\neasily providing an approximate 3D model of a complete indoor scene (e.g., a room). In this paper,\nwe explore how this move from part-of-scene 2D images to full-scene 3D point clouds can improve\nthe richness of models for object labeling.\nIn the past, a signi\ufb01cant amount of work has been done in semantic labeling of 2D images. However,\na lot of valuable information about the shape and geometric layout of objects is lost when a 2D\nimage is formed from the corresponding 3D world. A classi\ufb01er that has access to a full 3D model,\ncan access important geometric properties in addition to the local shape and appearance of an object.\nFor example, many objects occur in characteristic relative geometric con\ufb01gurations (e.g., a monitor\nis almost always on a table), and many objects consist of visually distinct parts that occur in a\ncertain relative con\ufb01guration. More generally, a 3D model makes it easy to reason about a variety\nof properties, which are based on 3D distances, volume and local convexity.\nSome recent works attempt to \ufb01rst infer the geometric layout from 2D images for improving the\nobject detection [12, 14, 28]. However, such a geometric layout is not accurate enough to give\nsigni\ufb01cant improvement. Other recent work [35] considers labeling a scene using a single 3D view\n(i.e., a 2.5D representation). In our work, we \ufb01rst use SLAM in order to compose multiple views\nfrom a Microsoft Kinect RGB-D sensor together into one 3D point cloud, providing each RGB\npixel with an absolute 3D location in the scene. We then (over-)segment the scene and predict\nsemantic labels for each segment (see Fig. 1). We predict not only coarse classes like in [1, 35] (i.e.,\n\n1This work was \ufb01rst presented at [16].\n\u2217 indicates equal contribution.\n\n1\n\n\fFigure 1: Of\ufb01ce scene (top) and Home (bottom) scene with the corresponding label coloring above the images.\nThe left-most is the original point cloud, the middle is the ground truth labeling and the right most is the point\ncloud with predicted labels.\nwall, ground, ceiling, building), but also label individual objects (e.g., printer, keyboard, mouse).\nFurthermore, we model rich relational information beyond an associative coupling of labels [1].\nIn this paper, we propose and evaluate the \ufb01rst model and learning algorithm for scene understand-\ning that exploits rich relational information derived from the full-scene 3D point cloud for object\nlabeling.\nIn particular, we propose a graphical model that naturally captures the geometric re-\nlationships of a 3D scene. Each 3D segment is associated with a node, and pairwise potentials\nmodel the relationships between segments (e.g., co-planarity, convexity, visual similarity, object\nco-occurrences and proximity). The model admits ef\ufb01cient approximate inference [25], and we\nshow that it can be trained using a maximum-margin approach [7, 31, 34] that globally minimizes\nan upper bound on the training loss. We model both associative and non-associative coupling of\nlabels. With a large number of object classes, the model\u2019s parsimony becomes important. Some\nfeatures are better indicators of label similarity, while other features are better indicators of non-\nassociative relations such as geometric arrangement (e.g., on-top-of, in-front-of ). We therefore in-\ntroduce parsimony in the model by using appropriate clique potentials rather than using general\nclique potentials. Our model is highly \ufb02exible and our software is available as a ROS package at:\nhttp://pr.cs.cornell.edu/sceneunderstanding\nTo empirically evaluate our model and algorithms, we perform several experiments over a total of\n52 scenes of two types: of\ufb01ces and homes. These scenes were built from about 550 views from\nthe Kinect sensor, and they are also available for public use. We consider labeling each segment\n(from a total of about 50 segments per scene) into 27 classes (17 for of\ufb01ces and 17 for homes,\nwith 7 classes in common). Our experiments show that our method, which captures several local\ncues and contextual properties, achieves an overall performance of 84.06% on of\ufb01ce scenes and\n73.38% on home scenes. We also consider the problem of labeling 3D segments with multiple\nattributes meaningful to robotics context (such as small objects that can be manipulated, furniture,\netc.). Finally, we successfully applied these algorithms on mobile robots for the task of \ufb01nding\nobjects in cluttered of\ufb01ce scenes.\n2 Related Work\nThere is a huge body of work in the area of scene understanding and object recognition from 2D im-\nages. Previous works focus on several different aspects: designing good local features such as HOG\n(histogram-of-gradients) [5] and bag of words [4], and designing good global (context) features such\nas GIST features [33]. However, these approaches do not consider the relative arrangement of the\nparts of the object or of multiple objects with respect to each other. A number of works propose\nmodels that explicitly capture the relations between different parts of the object e.g., Pedro et al.\u2019s\npart-based models [6], and between different objects in 2D images [13, 14]. However, a lot of valu-\nable information about the shape and geometric layout of objects is lost when a 2D image is formed\nfrom the corresponding 3D world. In some recent works, 3D layout or depths have been used for\nimproving object detection (e.g., [11, 12, 14, 20, 21, 22, 27, 28]). Here a rough 3D scene geometry\n(e.g., main surfaces in the scene) is inferred from a single 2D image or a stereo video stream, respec-\ntively. However, the estimated geometry is not accurate enough to give signi\ufb01cant improvements.\nWith 3D data, we can more precisely determine the shape, size and geometric orientation of the\nobjects, and several other properties and therefore capture much stronger context.\nThe recent availability of synchronized videos of both color and depth obtained from RGB-D\n(Kinect-style) depth cameras, shifted the focus to making use of both visual as well as shape features\nfor object detection [9, 18, 19, 24, 26] and 3D segmentation (e.g., [3]). These methods demonstrate\n\n2\n\n\fthat augmenting visual features with 3D information can enhance object detection in cluttered, real-\nworld environments. However, these works do not make use of the contextual relationships between\nvarious objects which have been shown to be useful for tasks such as object detection and scene\nunderstanding in 2D images. Our goal is to perform semantic labeling of indoor scenes by modeling\nand learning several contextual relationships.\nThere is also some recent work in labeling outdoor scenes obtained from LIDAR data into a few ge-\nometric classes (e.g., ground, building, trees, vegetation, etc.). [8, 30] capture context by designing\nnode features and [36] do so by stacking layers of classi\ufb01ers; however these methods do not model\nthe correlation between the labels. Some of these works model some contextual relationships in the\nlearning model itself. For example, [1, 23] use associative Markov networks in order to favor similar\nlabels for nodes in the cliques. However, many relative features between objects are not associative\nin nature. For example, the relationship \u201con top of\u201d does not hold in between two ground segments,\ni.e., a ground segment cannot be \u201con top of\u201d another ground segment. Therefore, using an associa-\ntive Markov network is very restrictive for our problem. All of these works [1, 23, 29, 30, 36] were\ndesigned for outdoor scenes with LIDAR data (without RGB values) and therefore would not apply\ndirectly to RGB-D data in indoor environments. Furthermore, these methods only consider very few\ngeometric classes (between three to \ufb01ve classes) in outdoor environments, whereas we consider a\nlarge number of object classes for labeling the indoor RGB-D data.\nThe most related work to ours is [35], where they label the planar patches in a point-cloud of an\nindoor scene with four geometric labels (walls, \ufb02oors, ceilings, clutter). They use a CRF to model\ngeometrical relationships such as orthogonal, parallel, adjacent, and coplanar. The learning method\nfor estimating the parameters was based on maximizing the pseudo-likelihood resulting in a sub-\noptimal learning algorithm. In comparison, our basic representation is a 3D segment (as compared\nto planar patches) and we consider a much larger number of classes (beyond just the geometric\nclasses). We also capture a much richer set of relationships between pairs of objects, and use a\nprincipled max-margin learning method to learn the parameters of our model.\n3 Approach\nWe now outline our approach, including the model, its inference methods, and the learning algo-\nrithm. Our input is multiple Kinect RGB-D images of a scene (i.e., a room) stitched into a single 3D\npoint cloud using RGBDSLAM.2 Each such point cloud is then over-segmented based on smooth-\nness (i.e., difference in the local surface normals) and continuity of surfaces (i.e., distance between\nthe points). These segments are the atomic units in our model. Our goal is to label each of them.\nBefore getting into the technical details of the model, the following outlines the properties we aim\nto capture in our model:\nVisual appearance. The reasonable success of object detection in 2D images shows that visual\nappearance is a good indicator for labeling scenes. We therefore model the local color, texture,\ngradients of intensities, etc. for predicting the labels. In addition, we also model the property that if\nnearby segments are similar in visual appearance, they are more likely to belong to the same object.\nLocal shape and geometry. Objects have characteristic shapes\u2014for example, a table is horizontal,\na monitor is vertical, a keyboard is uneven, and a sofa is usually smoothly curved. Furthermore,\nparts of an object often form a convex shape. We compute 3D shape features to capture this.\nGeometrical context. Many sets of objects occur in characteristic relative geometric con\ufb01gurations.\nFor example, a monitor is always on-top-of a table, chairs are usually found near tables, a keyboard\nis in-front-of a monitor. This means that our model needs to capture non-associative relationships\n(i.e., that neighboring segments differ in their labels in speci\ufb01c patterns).\nNote the examples given above are just illustrative. For any particular practical application, there\nwill likely be other properties that could also be included. As demonstrated in the following section,\nour model is \ufb02exible enough to include a wide range of features.\n3.1 Model Formulation\nWe model the three-dimensional structure of a scene using a model isomorphic to a Markov Ran-\ndom Field with log-linear node and pairwise edge potentials. Given a segmented point cloud\nx = (x1, ..., xN ) consisting of segments xi, we aim to predict a labeling y = (y1, ..., yN ) for\ni ),\nthe segments. Each segment label yi is itself a vector of K binary class labels yi = (y1\ni , ..., yK\nwith each yk\ni \u2208 {0, 1} indicating whether a segment i is a member of class k. Note that multiple yk\ni\ncan be 1 for each segment (e.g., a segment can be both a \u201cchair\u201d and a \u201cmovable object\u201d). We use\n\n2http://openslam.org/rgbdslam.html\n\n3\n\n\fFigure 2: Illustration of a few features. (Left) Features N11 and E9. Segment i is infront of segment j if\n(Middle) Two connected segment i and j are form a convex shape if (ri \u2212 rj). \u02c6ni \u2265 0 and\nrhi < rhj.\n(rj \u2212 ri). \u02c6nj \u2265 0. (Right) Illustrating feature E8.\n\nsuch multi-labelings in our attribute experiments where each segment can have multiple attributes,\nbut not in segment labeling experiments where each segment can have only one label).\nFor a segmented point cloud x, the prediction \u02c6y is computed as the argmax of a discriminant function\nfw(x, y) that is parameterized by a vector of weights w.\n\n\u02c6y = argmax\n\ny\n\nfw(x, y)\n\n(1)\n\nThe discriminant function captures the dependencies between segment labels as de\ufb01ned by an undi-\nrected graph (V,E) of vertices V = {1, ..., N} and edges E \u2286 V \u00d7 V. We describe in Section 3.2\nhow this graph is derived from the spatial proximity of the segments. Given (V,E), we de\ufb01ne the fol-\nlowing discriminant function based on individual segment features \u03c6n(i) and edge features \u03c6t(i, j)\nas further described below.\n\n(2)\n\nfw(y, x) =(cid:88)i\u2208V\n\nK(cid:88)k=1\n\nyk\n\nn \u00b7 \u03c6n(i)(cid:3) + (cid:88)(i,j)\u2208E (cid:88)Tt\u2208T (cid:88)(l,k)\u2208Tt\ni (cid:2)wk\n\nyl\niyk\n\nj(cid:2)wlk\n\nt\n\n\u00b7 \u03c6t(i, j)(cid:3)\n\nThe node feature map \u03c6n(i) describes segment i through a vector of features, and there is one\nweight vector for each of the K classes. Examples of such features are the ones capturing local\nvisual appearance, shape and geometry. The edge feature maps \u03c6t(i, j) describe the relationship\nbetween segments i and j. Examples of edge features are the ones capturing similarity in visual\nappearance and geometric context.3 There may be multiple types t of edge feature maps \u03c6t(i, j),\nand each type has a graph over the K classes with edges Tt. If Tt contains an edge between classes\nis used to model the dependencies between\nl and k, then this feature map and a weight vector wlk\nt\nclasses l and k. If the edge is not present in Tt, then \u03c6t(i, j) is not used.\nWe say that a type t of edge features is modeled by an associative edge potential if Tt = {(k, k)|\u2200k =\n1..K}. And it is modeled by an non-associative edge potential if Tt = {(l, k)|\u2200l, k = 1..K}.\nFinally, it is modeled by an object-associative edge potential if Tt = {(l, k)|\u2203object,\nl, k \u2208\nparts(object)}.\nParsimonious model.\nIn our experiments we distinguished between two types of edge feature\nmaps\u2014\u201cobject-associative\u201d features \u03c6oa(i, j) used between classes that are parts of the same object\n(e.g., \u201cchair base\u201d, \u201cchair back\u201d and \u201cchair back rest\u201d), and \u201cnon-associative\u201d features \u03c6na(i, j) that\nare used between any pair of classes. Examples of features in the object-associative feature map\n\u03c6oa(i, j) include similarity in appearance, co-planarity, and convexity\u2014i.e., features that indicate\nwhether two adjacent segments belong to the same class or object. A key reason for distinguishing\nbetween object-associative and non-associate features is parsimony of the model. In this parsimo-\nnious model (referred to as svm mrf parsimon), we model object associative features using object-\nassociative edge potentials and non-associative features as non-associative edge potentials. As not\nall edge features are non-associative, we avoid learning weight vectors for relationships which do\nnot exist. Note that |Tna| >> |Toa| since, in practice, the number of parts of an objects is much\nless than K. Due to this, the model we learn with both type of edge features will have much lesser\nnumber of parameters compared to a model learnt with all edge features as non-associative features.\n3.2 Features\nTable 1 summarizes the features used in our experiments. \u03bbi0, \u03bbi1 and \u03bbi2 are the 3 eigen-values\nof the scatter matrix computed from the points of segment i in decreasing order. ci is the centroid\nof segment i. ri is the ray vector to the centroid of segment i from the position camera in which\nit was captured. rhi is the projection of ri on horizontal plane. \u02c6ni is the unit normal of segment i\nwhich points towards the camera (ri.\u02c6ni < 0). The node features \u03c6n(i) consist of visual appearance\nfeatures based on histogram of HSV values and the histogram of gradients (HOG), as well as local\nshape and geometry features that capture properties such as how planar a segment is, its absolute\n\n3Even though it is not represented in the notation, note that both the node feature map \u03c6n(i) and the edge\n\nfeature maps \u03c6t(i, j) can compute their features based on the full x, not just xi and xj.\n\n4\n\nNodefeaturesforsegmenti.DescriptionCountVisualAppearance48N1.HistogramofHSVcolorvalues14N2.AverageHSVcolorvalues3N3.AverageofHOGfeaturesoftheblocksinim-agespannedbythepointsofasegment31LocalShapeandGeometry8N4.linearness(\u03bbi0-\u03bbi1),planarness(\u03bbi1-\u03bbi2)2N5.Scatter:\u03bbi01N6.Verticalcomponentofthenormal:\u02c6niz1N7.Verticalpositionofcentroid:ciz1N8.Vert.andHor.extentofboundingbox2N9.Dist.fromthesceneboundary(Fig.2)1Edgefeaturesfor(segmenti,segmentj).DescriptionCountVisualAppearance(associative)3E1.DifferenceofavgHSVcolorvalues3LocalShapeandGeometry(associative)2E2.Coplanarityandconvexity(Fig.2)2Geometriccontext(non-associative)6E3.Horizontaldistanceb/wcentroids.1E4.VerticalDisplacementb/wcentroids:(ciz\u2212cjz)1E5.Anglebetweennormals(Dotproduct):\u02c6ni\u00b7\u02c6nj1E6.Diff.inanglewithvert.:cos\u22121(niz)-cos\u22121(njz)1E8.Dist.betweenclosestpoints:minu\u2208si,v\u2208sjd(u,v)(Fig.2)1E8.rel.positionfromcamera(infrontof/behind).(Fig.2)1Table1:Nodeandedgefeatures.locationaboveground,anditsshape.Somefeaturescapturespatiallocationofanobjectinthescene(e.g.,N9).Weconnecttwosegments(nodes)iandjbyanedgeifthereexistsapointinsegmentiandapointinsegmentjwhicharelessthancontextrangedistanceapart.Thiscapturestheclosestdistancebetweentwosegments(ascomparedtocentroiddistancebetweenthesegments)\u2014westudytheeffectofcontextrangemoreinSection4.Theedgefeatures\u03c6t(i,j)(Table1-right)consistofassociativefeatures(E1-E2)basedonvisualappearanceandlocalshape,aswellasnon-associativefeatures(E3-E8)thatcapturethetendenciesoftwoobjectstooccurincertaincon\ufb01gurations.Notethatourfeaturesareinsensitivetohorizontaltranslationandrotationofthecamera.However,ourfeaturesplacealotofemphasisontheverticaldirectionbecausegravityin\ufb02uencestheshapeandrelativepositionsofobjectstoalargeextent.ijcamrhidbidbjrhj3.2.1ComputingPredictionsSolvingtheargmaxinEq.(1)forthediscriminantfunctioninEq.(2)isNPhard.However,itsequivalentformulationasthefollowingmixed-integerprogramhasalinearrelaxationwithseveraldesirableproperties.\u02c6y=argmaxymaxz\uffffi\u2208VK\uffffk=1yki\uffffwkn\u00b7\u03c6n(i)\uffff+\uffff(i,j)\u2208E\uffffTt\u2208T\uffff(l,k)\u2208Ttzlkij\uffffwlkt\u00b7\u03c6t(i,j)\uffff(3)\u2200i,j,l,k:zlkij\u2264yli,zlkij\u2264ykj,yli+ykj\u2264zlkij+1,zlkij,yli\u2208{0,1}(4)Notethattheproductsyliykjhavebeenreplacedbyauxiliaryvariableszlkij.Relaxingthevariableszlkijandylitotheinterval[0,1]leadstoalinearprogramthatcanbeshowntoalwayshavehalf-integralsolutions(i.e.ylionlytakevalues{0,0.5,1}atthesolution)[10].Furthermore,thisrelaxationcanalsobesolvedasaquadraticpseudo-Booleanoptimizationproblemusingagraph-cutmethod[25],whichisordersofmagnitudefasterthanusingageneralpurposeLPsolver(i.e.,10secforlabelingatypicalsceneinourexperiments).Therefore,werefertothesolutionofthisrelaxationas\u02c6ycut.Therelaxationsolution\u02c6ycuthasaninterestingpropertycalledPersistence[2,10].Persistencesaysthatanysegmentforwhichthevalueofyliisintegralin\u02c6ycut(i.e.doesnottakevalue0.5)islabeledjustlikeitwouldbeintheoptimalmixed-integersolution.5Nodefeaturesforsegmenti.DescriptionCountVisualAppearance48N1.HistogramofHSVcolorvalues14N2.AverageHSVcolorvalues3N3.AverageofHOGfeaturesoftheblocksinim-agespannedbythepointsofasegment31LocalShapeandGeometry8N4.linearness(\u03bbi0-\u03bbi1),planarness(\u03bbi1-\u03bbi2)2N5.Scatter:\u03bbi01N6.Verticalcomponentofthenormal:\u02c6niz1N7.Verticalpositionofcentroid:ciz1N8.Vert.andHor.extentofboundingbox2N9.Dist.fromthesceneboundary(Fig.2)1Edgefeaturesfor(segmenti,segmentj).DescriptionCountVisualAppearance(associative)3E1.DifferenceofavgHSVcolorvalues3LocalShapeandGeometry(associative)2E2.Coplanarityandconvexity(Fig.2)2Geometriccontext(non-associative)6E3.Horizontaldistanceb/wcentroids.1E4.VerticalDisplacementb/wcentroids:(ciz\u2212cjz)1E5.Anglebetweennormals(Dotproduct):\u02c6ni\u00b7\u02c6nj1E6.Diff.inanglewithvert.:cos\u22121(niz)-cos\u22121(njz)1E8.Dist.betweenclosestpoints:minu\u2208si,v\u2208sjd(u,v)(Fig.2)1E8.rel.positionfromcamera(infrontof/behind).(Fig.2)1Table1:Nodeandedgefeatures.locationaboveground,anditsshape.Somefeaturescapturespatiallocationofanobjectinthescene(e.g.,N9).Weconnecttwosegments(nodes)iandjbyanedgeifthereexistsapointinsegmentiandapointinsegmentjwhicharelessthancontextrangedistanceapart.Thiscapturestheclosestdistancebetweentwosegments(ascomparedtocentroiddistancebetweenthesegments)\u2014westudytheeffectofcontextrangemoreinSection4.Theedgefeatures\u03c6t(i,j)(Table1-right)consistofassociativefeatures(E1-E2)basedonvisualappearanceandlocalshape,aswellasnon-associativefeatures(E3-E8)thatcapturethetendenciesoftwoobjectstooccurincertaincon\ufb01gurations.Notethatourfeaturesareinsensitivetohorizontaltranslationandrotationofthecamera.However,ourfeaturesplacealotofemphasisontheverticaldirectionbecausegravityin\ufb02uencestheshapeandrelativepositionsofobjectstoalargeextent.camrirj\u02c6ni\u02c6nj3.2.1ComputingPredictionsSolvingtheargmaxinEq.(1)forthediscriminantfunctioninEq.(2)isNPhard.However,itsequivalentformulationasthefollowingmixed-integerprogramhasalinearrelaxationwithseveraldesirableproperties.\u02c6y=argmaxymaxz\uffffi\u2208VK\uffffk=1yki\uffffwkn\u00b7\u03c6n(i)\uffff+\uffff(i,j)\u2208E\uffffTt\u2208T\uffff(l,k)\u2208Ttzlkij\uffffwlkt\u00b7\u03c6t(i,j)\uffff(3)\u2200i,j,l,k:zlkij\u2264yli,zlkij\u2264ykj,yli+ykj\u2264zlkij+1,zlkij,yli\u2208{0,1}(4)Notethattheproductsyliykjhavebeenreplacedbyauxiliaryvariableszlkij.Relaxingthevariableszlkijandylitotheinterval[0,1]leadstoalinearprogramthatcanbeshowntoalwayshavehalf-integralsolutions(i.e.ylionlytakevalues{0,0.5,1}atthesolution)[10].Furthermore,thisrelaxationcanalsobesolvedasaquadraticpseudo-Booleanoptimizationproblemusingagraph-cutmethod[25],whichisordersofmagnitudefasterthanusingageneralpurposeLPsolver(i.e.,10secforlabelingatypicalsceneinourexperiments).Therefore,werefertothesolutionofthisrelaxationas\u02c6ycut.Therelaxationsolution\u02c6ycuthasaninterestingpropertycalledPersistence[2,10].Persistencesaysthatanysegmentforwhichthevalueofyliisintegralin\u02c6ycut(i.e.doesnottakevalue0.5)islabeledjustlikeitwouldbeintheoptimalmixed-integersolution.Sinceeverysegmentinourexperimentsisinexactlyoneclass,wealsoconsiderthelinearrelaxationfromabovewiththeadditionalconstraint\u2200i:\uffffKj=1yji=1.Thisproblemcannolongerbesolvedviagraphcutsandisnothalf-integral.Werefertoitssolutionas\u02c6yLP.Computing\u02c6yLPfora5Nodefeaturesforsegmenti.DescriptionCountVisualAppearance48N1.HistogramofHSVcolorvalues14N2.AverageHSVcolorvalues3N3.AverageofHOGfeaturesoftheblocksinim-agespannedbythepointsofasegment31LocalShapeandGeometry8N4.linearness(\u03bbi0-\u03bbi1),planarness(\u03bbi1-\u03bbi2)2N5.Scatter:\u03bbi01N6.Verticalcomponentofthenormal:\u02c6niz1N7.Verticalpositionofcentroid:ciz1N8.Vert.andHor.extentofboundingbox2N9.Dist.fromthesceneboundary(Fig.2)1Edgefeaturesfor(segmenti,segmentj).DescriptionCountVisualAppearance(associative)3E1.DifferenceofavgHSVcolorvalues3LocalShapeandGeometry(associative)2E2.Coplanarityandconvexity(Fig.2)2Geometriccontext(non-associative)6E3.Horizontaldistanceb/wcentroids.1E4.VerticalDisplacementb/wcentroids:(ciz\u2212cjz)1E5.Anglebetweennormals(Dotproduct):\u02c6ni\u00b7\u02c6nj1E6.Diff.inanglewithvert.:cos\u22121(niz)-cos\u22121(njz)1E8.Dist.betweenclosestpoints:minu\u2208si,v\u2208sjd(u,v)(Fig.2)1E8.rel.positionfromcamera(infrontof/behind).(Fig.2)1Table1:Nodeandedgefeatures.locationaboveground,anditsshape.Somefeaturescapturespatiallocationofanobjectinthescene(e.g.,N9).Weconnecttwosegments(nodes)iandjbyanedgeifthereexistsapointinsegmentiandapointinsegmentjwhicharelessthancontextrangedistanceapart.Thiscapturestheclosestdistancebetweentwosegments(ascomparedtocentroiddistancebetweenthesegments)\u2014westudytheeffectofcontextrangemoreinSection4.Theedgefeatures\u03c6t(i,j)(Table1-right)consistofassociativefeatures(E1-E2)basedonvisualappearanceandlocalshape,aswellasnon-associativefeatures(E3-E8)thatcapturethetendenciesoftwoobjectstooccurincertaincon\ufb01gurations.Notethatourfeaturesareinsensitivetohorizontaltranslationandrotationofthecamera.However,ourfeaturesplacealotofemphasisontheverticaldirectionbecausegravityin\ufb02uencestheshapeandrelativepositionsofobjectstoalargeextent.idminijj3.2.1ComputingPredictionsSolvingtheargmaxinEq.(1)forthediscriminantfunctioninEq.(2)isNPhard.However,itsequivalentformulationasthefollowingmixed-integerprogramhasalinearrelaxationwithseveraldesirableproperties.\u02c6y=argmaxymaxz\uffffi\u2208VK\uffffk=1yki\uffffwkn\u00b7\u03c6n(i)\uffff+\uffff(i,j)\u2208E\uffffTt\u2208T\uffff(l,k)\u2208Ttzlkij\uffffwlkt\u00b7\u03c6t(i,j)\uffff(3)\u2200i,j,l,k:zlkij\u2264yli,zlkij\u2264ykj,yli+ykj\u2264zlkij+1,zlkij,yli\u2208{0,1}(4)Notethattheproductsyliykjhavebeenreplacedbyauxiliaryvariableszlkij.Relaxingthevariableszlkijandylitotheinterval[0,1]leadstoalinearprogramthatcanbeshowntoalwayshavehalf-integralsolutions(i.e.ylionlytakevalues{0,0.5,1}atthesolution)[10].Furthermore,thisrelaxationcanalsobesolvedasaquadraticpseudo-Booleanoptimizationproblemusingagraph-cutmethod[25],whichisordersofmagnitudefasterthanusingageneralpurposeLPsolver(i.e.,10secforlabelingatypicalsceneinourexperiments).Therefore,werefertothesolutionofthisrelaxationas\u02c6ycut.Therelaxationsolution\u02c6ycuthasaninterestingpropertycalledPersistence[2,10].Persistencesaysthatanysegmentforwhichthevalueofyliisintegralin\u02c6ycut(i.e.doesnottakevalue0.5)islabeledjustlikeitwouldbeintheoptimalmixed-integersolution.Sinceeverysegmentinourexperimentsisinexactlyoneclass,wealsoconsiderthelinearrelaxationfromabovewiththeadditionalconstraint\u2200i:\uffffKj=1yji=1.Thisproblemcannolongerbesolvedviagraphcutsandisnothalf-integral.Werefertoitssolutionas\u02c6yLP.Computing\u02c6yLPforascenetakes11minutesonaverage4.Finally,wecanalsocomputetheexactmixedintegersolutionincludingtheadditionalconstraint\u2200i:\uffffKj=1yji=1usingageneral-purposeMIPsolver4.Wesetatimelimitof30minutesfortheMIPsolver.Thistakes18minutesonaverageforascene.AllruntimesareforsingleCPUimplementationsusing17classes.4http://www.t\ufb01nley.net/software/pyglpk/readme.html5\fNode features for segment i.\n\nDescription\nVisual Appearance\nN1. Histogram of HSV color values\nN2. Average HSV color values\nN3. Average of HOG features of the blocks in im-\nage spanned by the points of a segment\nLocal Shape and Geometry\nN4. linearness (\u03bbi0 - \u03bbi1), planarness (\u03bbi1 - \u03bbi2)\nN5. Scatter: \u03bbi0\nN6. Vertical component of the normal: \u02c6niz\nN7. Vertical position of centroid: ciz\nN8. Vert. and Hor. extent of bounding box\nN9. Dist. from the scene boundary (Fig. 2)\n\nCount\n\n48\n14\n3\n31\n\n8\n2\n1\n1\n1\n2\n1\n\nEdge features for (segment i, segment j).\n\nDescription\nVisual Appearance (associative)\nE1. Difference of avg HSV color values\nLocal Shape and Geometry (associative)\nE2. Coplanarity and convexity (Fig. 2)\nGeometric context (non-associative)\nE3. Horizontal distance b/w centroids.\nE4. Vertical Displacement b/w centroids: (ciz \u2212 cjz)\nE5. Angle between normals (Dot product): \u02c6ni \u00b7 \u02c6nj\nE6. Diff. in angle with vert.: cos\u22121(niz) - cos\u22121(njz)\nE8.\nbetween\npoints:\nminu\u2208si,v\u2208sj d(u, v) (Fig. 2)\nE8. rel. position from camera (in front of/behind). (Fig. 2)\n\nclosest\n\nDist.\n\nCount\n\n3\n3\n2\n2\n6\n1\n1\n1\n1\n1\n\n1\n\nTable 1: Node and edge features.\n\nlocation above ground, and its shape. Some features capture spatial location of an object in the scene\n(e.g., N9).\nWe connect two segments (nodes) i and j by an edge if there exists a point in segment i and a point\nin segment j which are less than context range distance apart. This captures the closest distance\nbetween two segments (as compared to centroid distance between the segments)\u2014we study the\neffect of context range more in Section 4. The edge features \u03c6t(i, j) (Table 1-right) consist of\nassociative features (E1-E2) based on visual appearance and local shape, as well as non-associative\nfeatures (E3-E8) that capture the tendencies of two objects to occur in certain con\ufb01gurations.\nNote that our features are insensitive to horizontal translation and rotation of the camera. However,\nour features place a lot of emphasis on the vertical direction because gravity in\ufb02uences the shape\nand relative positions of objects to a large extent.\n3.2.1 Computing Predictions\nSolving the argmax in Eq. (1) for the discriminant function in Eq. (2) is NP hard. However, its\nequivalent formulation as the following mixed-integer program has a linear relaxation with several\ndesirable properties.\n\ny\n\nyk\n\n(3)\n\nzlk\n\nij(cid:2)wlk\n\nt\n\nzlk\nij , yl\n\n\u02c6y = argmax\n\n\u00b7 \u03c6t(i, j)(cid:3)\n\nmax\n\nK(cid:88)k=1\nz (cid:88)i\u2208V\nij \u2264 yl\n\u2200i, j, l, k : zlk\ni,\nj have been replaced by auxiliary variables zlk\n\nn \u00b7 \u03c6n(i)(cid:3) +(cid:88)(i,j)\u2208E (cid:88)Tt\u2208T (cid:88)(l,k)\u2208Tt\ni (cid:2)wk\nzlk\nij \u2264 yk\nj ,\n\nj \u2264 zlk\n\nyl\ni + yk\n\nij + 1,\n\niyk\n\ni \u2208 {0, 1}\n\n(4)\nij . Relaxing the variables zlk\nNote that the products yl\nij\ni to the interval [0, 1] leads to a linear program that can be shown to always have half-integral\nand yl\nsolutions (i.e. yl\ni only take values {0, 0.5, 1} at the solution) [10]. Furthermore, this relaxation can\nalso be solved as a quadratic pseudo-Boolean optimization problem using a graph-cut method [25],\nwhich is orders of magnitude faster than using a general purpose LP solver (i.e., 10 sec for labeling\na typical scene in our experiments). Therefore, we refer to the solution of this relaxation as \u02c6ycut.\nThe relaxation solution \u02c6ycut has an interesting property called Persistence [2, 10]. Persistence says\nthat any segment for which the value of yl\ni is integral in \u02c6ycut (i.e. does not take value 0.5) is labeled\njust like it would be in the optimal mixed-integer solution.\nSince every segment in our experiments is in exactly one class, we also consider the linear relaxation\ni = 1. This problem can no longer be solved\nvia graph cuts and is not half-integral. We refer to its solution as \u02c6yLP . Computing \u02c6yLP for a\nscene takes 11 minutes on average4. Finally, we can also compute the exact mixed integer solution\ni = 1 using a general-purpose MIP solver4. We set\na time limit of 30 minutes for the MIP solver. This takes 18 minutes on average for a scene. All\nruntimes are for single CPU implementations using 17 classes.\nWhen using this algorithm in practice on new scenes (e.g., during our robotic experiments), objects\nother than the 27 objects we modeled might be present (e.g., coffee-mugs). So we relax the constraint\ni \u2264 1. This increases precision greatly at the cost of some drop in\n\nfrom above with the additional constraint \u2200i :(cid:80)K\nincluding the additional constraint \u2200i :(cid:80)K\n\nrecall. Also, this relaxed MIP takes lesser time to solve.\n3.2.2 Learning Algorithm\nWe take a large-margin approach to learning the parameter vector w of Eq. (2) from labeled training\nexamples (x1, y1), ..., (xn, yn) [31, 32, 34]. Compared to Conditional Random Field training [17]\n\ni = 1 to \u2200i :(cid:80)K\n\n\u2200i :(cid:80)K\n\nj=1 yj\n\nj=1 yj\n\nj=1 yj\n\nj=1 yj\n\n4http://www.t\ufb01nley.net/software/pyglpk/readme.html\n\n5\n\n\fusing maximum likelihood, this has the advantage that the partition function normalizing Eq. (2)\ndoes not need to be computed, and that the training problem can be formulated as a convex program\nfor which ef\ufb01cient algorithms exist.\nOur method optimizes a regularized upper bound on the training error\n\nR(h) =\n\n1\nn\n\nn(cid:88)j=1\n\n\u2206(yj, \u02c6yj),\n\n(5)\n\nn(cid:88)i=1\n\nwhere \u02c6yj is the optimal solution of Eq. (1) and \u2206(y, \u02c6y) = (cid:80)N\n\ni \u2212 \u02c6yk\ni |. To simplify\nnotation, note that Eq. (3) can be equivalently written as wT \u03a8(x, y) by appropriately stacking the\nij is consistent with\nwk\nEq. (4) given y. Training can then be formulated as the following convex quadratic program [15]:\n\ni=1(cid:80)K\n\nk=1 |yk\n\nn and wlk\nt\n\ninto w and the yk\n\ni \u03c6n(k) and zlk\n\nij \u03c6t(l, k) into \u03a8(x, y), where each zlk\n\nmin\nw,\u03be\n\ns.t.\n\n1\n2\n\nwT w + C\u03be\n\n\u2200\u00afy1, ..., \u00afyn \u2208 {0, 0.5, 1}N\u00b7K :\n\n1\nn\n\nwT\n\n(6)\n\n[\u03a8(xi, yi) \u2212 \u03a8(xi, \u00afyi)] \u2265 \u2206(yi, \u00afyi) \u2212 \u03be\n\nWhile the number of constraints in this quadratic program is exponential in n, N, and K, it can\nnevertheless be solved ef\ufb01ciently using the cutting-plane algorithm for training structural SVMs\n[15]. The algorithm maintains a working set of constraints, and it can be shown to provide an \u0001-\naccurate solution after adding at most O(R2C/\u0001) constraints (ignoring log terms). The algorithm\nmerely need access to an ef\ufb01cient method for computing\n\n\u00afyi = argmax\n\ny\u2208{0,0.5,1}N\u00b7K(cid:2)wT \u03a8(xi, y) + \u2206(yi, y)(cid:3) .\n\n(7)\n\nDue to the structure of \u2206(., .), this problem is identical to the relaxed prediction problem in Eqs. (3)-\n(4) and can be solved ef\ufb01ciently using graph cuts.\nSince our training problem is an overgenerating formulation as de\ufb01ned in [7], the value of \u03be at the\nsolution is an upper bound on the training error in Eq. (5). Furthermore, [7] observed empirically\nthat the relaxed prediction \u02c6ycut after training w via Eq. (6) is typically largely integral, meaning\nthat most labels yk\ni of the relaxed solution are the same as the optimal mixed-integer solution due to\npersistence. We made the same observation in our experiments as well.\n4 Experiments\n4.1 Data\nWe consider labeling object segments in full 3D scene (as compared to 2.5D data from a single\nview). For this purpose, we collected data of 24 of\ufb01ce and 28 home scenes (composed from about\n550 views). Each scene was reconstructed from about 8-9 RGB-D views from a Kinect sensor and\ncontains about one million colored points.\nWe \ufb01rst over-segment the 3D scene (as described earlier) to obtain the atomic units of our rep-\nresentation. For training, we manually labeled the segments, and we selected the labels which\nwere present in a minimum of 5 scenes in the dataset. Speci\ufb01cally, the of\ufb01ce labels are: {wall,\n\ufb02oor, tableTop, tableDrawer, tableLeg, chairBackRest, chairBase, chairBack, monitor, printerFront,\nprinterSide keyboard, cpuTop, cpuFront, cpuSide, book, paper }, and the home labels are: {wall,\n\ufb02oor, tableTop, tableDrawer, tableLeg, chairBackRest, chairBase, sofaBase, sofaArm, sofaBack-\nRest, bed, bedSide, quilt, pillow, shelfRack, laptop, book }. This gave us a total of 1108 labeled\nsegments in the of\ufb01ce scenes and 1387 segments in the home scenes. Often one object may be di-\nvided into multiple segments because of over-segmentation. We have made this data available at:\nhttp://pr.cs.cornell.edu/sceneunderstanding/data/data.php.\n4.2 Results\nTable 2 shows the results, performed using 4-fold cross-validation and averaging performance across\nthe folds for the models trained separately on home and of\ufb01ce datasets. We use both the macro and\nmicro averaging to aggregate precision and recall over various classes. Since our algorithm can\nonly predict one label for each segment, micro precision and recall are same as the percentage of\ncorrectly classi\ufb01ed segments. Macro precision and recall are respectively the averages of precision\nand recall for all classes. The optimal C value is determined separately for each of the algorithms\nby cross-validation.\nFigure 1 shows the original point cloud, ground-truth and predicted labels for one of\ufb01ce (top) and\none home scene (bottom). We see that on majority of the classes we are able to predict the correct\n\n6\n\n\fTable 2: Learning experiment statistics. The table shows average micro precision/recall, and average macro\nprecision and recall for home and of\ufb01ce scenes.\n\nOf\ufb01ce Scenes\n\nmacro\n\nmicro\nP/R Precision Recall\n26.23\n5.88\n31.67\n46.67\n60.88\n75.36\n66.23\n77.97\n68.12\n84.32\n75.94\n61.79\n70.07\n81.45\n84.06\n72.64\n\n26.23\n35.73\n64.56\n69.44\n77.84\n63.89\n76.79\n80.52\n\nfeatures\nNone\nImage Only\nShape Only\nImage+Shape\nImage+Shape & context\nImage+Shape & context\nImage+Shape & context\nImage+Shape & context\n\nalgorithm\nmax class\nsvm node only\nsvm node only\nsvm node only\nsingle frames\nsvm mrf assoc\nsvm mrf nonassoc\nsvm mrf parsimon\n\nHome Scenes\n\nmacro\n\nmicro\nP/R Precision Recall\n29.38\n5.88\n14.50\n38.00\n36.52\n56.25\n34.73\n56.50\n43.62\n69.13\n62.50\n38.34\n53.62\n72.38\n73.38\n54.80\n\n29.38\n15.03\n35.90\n37.18\n47.84\n44.65\n57.82\n56.81\n\nlabel. It makes mistakes in some cases and these usually tend to be reasonable, such as a pillow\ngetting confused with the bed, and table-top getting confused with the shelf-rack.\nOne of our goals is to study the effect of various factors, and therefore we compared different\nversions of the algorithms with various settings. We discuss them in the following.\nDo Image and Point-Cloud Features Capture Complimentary Information? The RGB-D data\ncontains both image and depth information, and enables us to compute a wide variety of features.\nIn this experiment, we compare the two kinds of features: Image (RGB) and Shape (Point Cloud)\nfeatures. To show the effect of the features independent of the effect of context, we only use the\nnode potentials from our model, referred to as svm node only in Table 2. The svm node only model\nis equivalent to the multi-class SVM formulation [15]. Table 2 shows that Shape features are more\neffective compared to the Image, and the combination works better on both precision and recall.\nThis indicates that the two types of features offer complementary information and their combination\nis better for our classi\ufb01cation task.\nHow Important is Context? Using our svm mrf parsimon model as described in Section 3.1,\nwe show signi\ufb01cant improvements in the performance over using svm node only model on both\ndatasets.\nIn of\ufb01ce scenes, the micro precision increased by 6.09% over the best svm node only\nmodel that does not use any context. In home scenes the increase is much higher, 16.88%.\nThe type of contextual relations we capture depend on the type of edge potentials we model. To\nstudy this, we compared our method with models using only associative or only non-associative\nedge potentials referred to as svm mrf assoc and svm mrf nonassoc respectively. We observed that\nmodeling all edge features using associative potentials is poor compared to our full model. In fact,\nusing only associative potentials showed a drop in performance compared to svm node only model\non the of\ufb01ce dataset. This indicates it is important to capture the relations between regions having\ndifferent labels. Our svm mrf nonassoc model does so, by modeling all edge features using non-\nassociative potentials, which can favor or disfavor labels of different classes for nearby segments. It\ngives higher precision and recall compared to svm node only and svm mrf assoc. This shows that\nmodeling using non-associative potentials is a better choice for our labeling problem.\nHowever, not all the edge features are non-associative in nature, modeling them using only non-\nassociative potentials could be an overkill (each non-associative feature adds K 2 more parameters\nto be learnt). Therefore using our svm mrf parsimon model to model these relations achieves higher\nperformance in both datasets.\nHow Large should the Context Range be? Context rela-\ntionships of different objects can be meaningful for different\nspatial distances. This range may vary depending on the en-\nvironment as well. For example, in an of\ufb01ce, keyboard and\nmonitor go together, but they may have little relation with a\nsofa that is slightly farther away. In a house, sofa and table\nmay go together even if they are farther away.\nIn order to study this, we compared our svm mrf parsimon\nwith varying context range for determining the neighborhood\n(see Figure 3 for average micro precision vs range plot). Note\nthat the context range is determined from the boundary of one\nsegment to the boundary of the other, and hence it is somewhat independent of the size of the object.\nWe note that increasing the context range increases the performance to some level, and then it drops\nslightly. We attribute this to the fact that increasing the context range can connect irrelevant objects\n\nFigure 3: Effect of context range on\nprecision (=recall here).\n\n7\n\n\fwith an edge, and with limited training data, spurious relationships may be learned. We observe that\nthe optimal context range for of\ufb01ce scenes is around 0.3 meters and 0.6 meters for home scenes.\nHow does a Full 3D Model Compare to a 2.5D Model? In Table 2, we compare the performance of\nour full model with a model that was trained and tested on single views of the same scenes. During\nthe comparison, the training folds were consistent with other experiments, however the segmentation\nof the point clouds was different (because each point cloud is from a single view). This makes the\nmicro precision values meaningless because the distribution of labels is not same for the two cases.\nIn particular, many large object in scenes (e.g., wall, ground) get split up into multiple segments in\nsingle views. We observed that the macro precision and recall are higher when multiple views are\ncombined to form the scene. We attribute the improvement in macro precision and recall to the fact\nthat larger scenes have more context, and models are more complete because of multiple views.\nWhat is the effect of the inference method? The results for svm mrf algorithms Table 2 were\ngenerated using the MIP solver. We observed that the MIP solver is typically 2-3% more accurate\nthan the LP solver. The graph-cut algorithm however, gives a higher precision and lower recall on\nboth datasets. For example, on of\ufb01ce data, the graphcut inference for our svm mrf parsimon gave\na micro precision of 90.25 and micro recall of 61.74. Here, the micro precision and recall are not\nsame as some of the segments might not get any label. Since it is orders of magnitude faster, it is\nideal for realtime robotic applications.\n4.3 Robotic experiments\nThe ability to label segments is very useful for robotics\napplications, for example, in detecting objects (so that\na robot can \ufb01nd/retrieve an object on request) or for\nother robotic tasks. We therefore performed two relevant\nrobotic experiments.\nAttribute Learning:\nIn some robotic tasks, such as\nrobotic grasping, it is not important to know the exact\nobject category, but just knowing a few attributes of an\nobject may be useful. For example, if a robot has to clean\na \ufb02oor, it would help if it knows which objects it can move\nand which it cannot. If it has to place an object, it should\nplace them on horizontal surfaces, preferably where hu-\nmans do not sit. With this motivation we have designed 8 attributes, each for the home and of\ufb01ce\nscenes, giving a total of 10 unique attributes in total, comprised of: wall, \ufb02oor, \ufb02at-horizontal-\nsurfaces, furniture, fabric, heavy, seating-areas, small-objects, table-top-objects, electronics. Note\nthat each segment in the point cloud can have multiple attributes and therefore we can learn these\nattributes using our model which naturally allows multiple labels per segment. We compute the\nprecision and recall over the attributes by counting how many attributes were correctly inferred. In\nhome scenes we obtained a precision of 83.12% and 70.03% recall, and in the of\ufb01ce scenes we\nobtain 87.92% precision and 71.93% recall.\nObject Detection: We \ufb01nally use our algorithm on two mobile robots, mounted with Kinects, for\ncompleting the goal of \ufb01nding objects such as a keyboard in cluttered of\ufb01ce scenes. The following\nvideo shows our robot successfully \ufb01nding a keyboard in an of\ufb01ce: http://pr.cs.cornell.\nedu/sceneunderstanding/\n\nFigure 4: Cornell\u2019s POLAR robot using our\nclassi\ufb01er for detecting a keyboard in a clut-\ntered room.\n\nIn conclusion, we have proposed and evaluated the \ufb01rst model and learning algorithm for scene un-\nderstanding that exploits rich relational information from the full-scene 3D point cloud. We applied\nthis technique to object labeling problem, and studied affects of various factors on a large dataset.\nOur robotic application shows that such inexpensive RGB-D sensors can be extremely useful for\nscene understanding for robots. This research was funded in part by NSF Award IIS-0713483.\n\nReferences\n[1] D. Anguelov, B. Taskar, V. Chatalbashev, D. Koller, D. Gupta, G. Heitz, and A. Ng. Discriminative\n\nlearning of markov random \ufb01elds for segmentation of 3d scan data. In CVPR, 2005.\n\n[2] E. Boros and P. Hammer. Pseudo-boolean optimization. Dis. Appl. Math., 123(1-3):155\u2013225, 2002.\n[3] A. Collet Romea, S. Srinivasa, and M. Hebert. Structure discovery in multi-modal data : a region-based\n\napproach. In ICRA, 2011.\n\n[4] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints.\n\nIn Workshop on statistical learning in computer vision, ECCV, 2004.\n\n8\n\n\f[5] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[6] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part\n\nmodel. In CVPR, 2008.\n\n[7] T. Finley and T. Joachims. Training structural svms when exact inference is intractable. In ICML, 2008.\n[8] A. Golovinskiy, V. G. Kim, and T. Funkhouser. Shape-based recognition of 3d point clouds in urban\n\nenvironments. ICCV, 2009.\n\n[9] S. Gould, P. Baumstarck, M. Quigley, A. Y. Ng, and D. Koller. Integrating Visual and Range Data for\n\nRobotic Object Detection. In ECCV workshop Multi-camera Multi-modal (M2SFA2), 2008.\n\n[10] P. Hammer, P. Hansen, and B. Simeone. Roof duality, complementation and persistency in quadratic 0\u20131\n\noptimization. Mathematical Programming, 28(2):121\u2013155, 1984.\n\n[11] V. Hedau, D. Hoiem, and D. Forsyth. Thinking inside the box: Using appearance models and context\n\nbased on room geometry. In ECCV, 2010.\n\n[12] G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded classi\ufb01cation models: Combining models for\n\nholistic scene understanding. In NIPS, 2008.\n\n[13] G. Heitz and D. Koller. Learning spatial context: Using stuff to \ufb01nd things. In ECCV, 2008.\n[14] D. Hoiem, A. A. Efros, and M. Hebert. Putting objects in perspective. In In CVPR, 2006.\n[15] T. Joachims, T. Finley, and C. Yu. Cutting-plane training of structural SVMs. Machine Learning,\n\n77(1):27\u201359, 2009.\n\n[16] H. Koppula, A. Anand, T. Joachims, and A. Saxena. Labeling 3d scenes for personal assistant robots. In\n\nR:SS workshop on RGB-D cameras, 2011.\n\n[17] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random \ufb01elds: Probabilistic models for\n\nsegmenting and labeling sequence data. In ICML, 2001.\n\n[18] K. Lai, L. Bo, X. Ren, and D. Fox. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. In\n\nICRA, 2011.\n\n[19] K. Lai, L. Bo, X. Ren, and D. Fox. Sparse Distance Learning for Object Recognition Combining RGB\n\nand Depth Information. In ICRA, 2011.\n\n[20] D. C. Lee, A. Gupta, M. Hebert, and T. Kanade. Estimating spatial layout of rooms using volumetric\n\nreasoning about objects and surfaces. In NIPS, 2010.\n\n[21] B. Leibe, N. Cornelis, K. Cornelis, and L. V. Gool. Dynamic 3d scene analysis from a moving vehicle. In\n\nCVPR, 2007.\n\n[22] C. Li, A. Kowdle, A. Saxena, and T. Chen. Towards holistic scene understanding: Feedback enabled\n\ncascaded classi\ufb01cation models. In NIPS, 2010.\n\n[23] D. Munoz, N. Vandapel, and M. Hebert. Onboard contextual classi\ufb01cation of 3-d point clouds with\n\nlearned high-order markov random \ufb01elds. In ICRA, 2009.\n\n[24] M. Quigley, S. Batra, S. Gould, E. Klingbeil, Q. V. Le, A. Wellman, and A. Y. Ng. High-accuracy 3d\n\nsensing for mobile manipulation: Improving object detection and door opening. In ICRA, 2009.\n\n[25] C. Rother, V. Kolmogorov, V. Lempitsky, and M. Szummer. Optimizing binary mrfs via extended roof\n\nduality. In CVPR, 2007.\n\n[26] R. B. Rusu, Z. C. Marton, N. Blodow, M. Dolha, and M. Beetz. Towards 3d point cloud based object\n\nmaps for household environments. Robot. Auton. Syst., 56:927\u2013941, 2008.\n\n[27] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular images. In NIPS 18, 2005.\n[28] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d scene structure from a single still image. IEEE\n\nPAMI, 31(5):824\u2013840, 2009.\n\n[29] R. Shapovalov and A. Velizhev. Cutting-plane training of non-associative markov network for 3d point\n\ncloud segmentation. In 3DIMPVT, 2011.\n\n[30] R. Shapovalov, A. Velizhev, and O. Barinova. Non-associative markov networks for 3d point cloud\n\nclassi\ufb01cation. In ISPRS Commission III symposium - PCV 2010, 2010.\n\n[31] B. Taskar, V. Chatalbashev, and D. Koller. Learning associative markov networks. In ICML. ACM, 2004.\n[32] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003.\n[33] A. Torralba. Contextual priming for object detection. IJCV, 53(2):169\u2013191, 2003.\n[34] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interde-\n\npendent and structured output spaces. In ICML, 2004.\n\n[35] X. Xiong and D. Huber. Using context to create semantic 3d models of indoor environments. In BMVC,\n\n2010.\n\n[36] X. Xiong, D. Munoz, J. A. Bagnell, and M. Hebert. 3-d scene analysis via sequenced predictions over\n\npoints and regions. In ICRA, 2011.\n\n9\n\n\f", "award": [], "sourceid": 186, "authors": [{"given_name": "Hema", "family_name": "Koppula", "institution": null}, {"given_name": "Abhishek", "family_name": "Anand", "institution": null}, {"given_name": "Thorsten", "family_name": "Joachims", "institution": null}, {"given_name": "Ashutosh", "family_name": "Saxena", "institution": null}]}