{"title": "Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces", "book": "Advances in Neural Information Processing Systems", "page_first": 1288, "page_last": 1296, "abstract": "There has been a recent push in extraction of 3D spatial layout of scenes. However, none of these approaches model the 3D interaction between objects and the spatial layout. In this paper, we argue for a parametric representation of objects in 3D, which allows us to incorporate volumetric constraints of the physical world. We show that augmenting current structured prediction techniques with volumetric reasoning significantly improves the performance of the state-of-the-art.", "full_text": "Estimating Spatial Layout of Rooms using Volumetric\n\nReasoning about Objects and Surfaces\n\nDavid C. Lee, Abhinav Gupta, Martial Hebert, Takeo Kanade\n\n{dclee,abhinavg,hebert,tk}@cs.cmu.edu\n\nCarnegie Mellon University\n\nAbstract\n\nThere has been a recent push in extraction of 3D spatial layout of scenes. However,\nnone of these approaches model the 3D interaction between objects and the spatial\nlayout. In this paper, we argue for a parametric representation of objects in 3D,\nwhich allows us to incorporate volumetric constraints of the physical world. We\nshow that augmenting current structured prediction techniques with volumetric\nreasoning signi\ufb01cantly improves the performance of the state-of-the-art.\n\n1\n\nIntroduction\n\nConsider the indoor image shown in Figure 1. Understanding such a complex scene not only in-\nvolves visual recognition of objects but also requires extracting the 3D spatial layout of the room\n(ceiling, \ufb02oor and walls). Extraction of the spatial layout of a room provides crucial geometric con-\ntext required for visual recognition. There has been a recent push to extract spatial layout of the\nroom by classi\ufb01ers which predict qualitative surface orientation labels (\ufb02oor, ceiling, left, right, cen-\nter wall and object) from appearance features and then \ufb01t a parametric model of the room. However,\nsuch an approach is limited in that it does not use the additional information conveyed by the con-\n\ufb01guration of objects in the room and, therefore, it fails to use all of the available cues for estimating\nthe spatial layout.\nIn this paper, we propose to incorporate an explicit volumetric representation of objects in 3D for\nspatial interpretation process. Unlike previous approaches which model objects by their projection\nin the image plane, we propose a parametric representation of the 3D volumes occupied by objects\nin the scene. We show that such a parametric representation of the volume occupied by an object\ncan provide crucial evidence for estimating the spatial layout of the rooms. This evidence comes\nfrom volumetric reasoning between the objects in the room and the spatial layout of the room. We\npropose to augment the existing structured classi\ufb01cation approaches with volumetric reasoning in\n3D for extracting the spatial layout of the room.\nFigure 1 shows an example of a case where volumetric reasoning is crucial in estimating the surface\nlayout of the room. Figure 1(b) shows the estimated spatial layout for the room (overlaid on surface\norientation labels predicted by a classi\ufb01er) when no reasoning about the objects is performed. In\nthis case, the couch is predicted as \ufb02oor and therefore there is substantial error in estimating the\nspatial layout. If the couch is predicted as clutter and the image evidence from the couch is ignored\n(Figure 1(c)), multiple room hypotheses can be selected based on the predicted labels of the pixels on\nthe wall (Figure 1(d)) and there is still not enough evidence in the image to select one hypothesis over\nanother in a con\ufb01dent manner. However, if we represent the object by a 3D parametric model, such\nas a cuboid (Figure 1(e)), then simple volumetric reasoning (the 3D volume occupied by the couch\nshould be contained in the free space of the room) can help us reject physically invalid hypotheses\nand estimate the correct layout of the room by pushing the walls to completely contain the cuboid\n(Figure 1(f)).\nIn this paper, we propose a method to perform volumetric reasoning by combining classical con-\nstrained search techniques and current structured prediction techniques. We show that the resulting\n\n1\n\n\fFigure 1: (a) Input image. (b) Estimate of the spatial layout of the room without object reasoning.\nColors represent the output of the surface geometry by [8]. Green: \ufb02oor, red: left wall, yellow:\ncenter wall, cyan: right wall. (c) Evidence from object region removed. (d) Spatial layout with 2D\nobject reasoning. (e) Object \ufb01tted with 3D parametric model. (f) Spatial layout with 3D volumetric\nreasoning. The wall is pushed by the volume occupied by the object.\n\napproach leads to substantially improved performance on standard datasets with the added bene\ufb01t\nof a more complete scene description that includes objects in addition to surface layout.\n\n1.1 Background\n\nThe goal of extracting 3D geometry by using geometric relationships between objects dates back\nto the start of computer vision around four decades ago.\nIn the early days of computer vision,\nresearchers extracted lines from \u201cblockworld\u201d scenes [1] and used geometric relationships using\nconstraint satisfaction algorithms on junctions [2, 3]. However, the reasoning approaches used in\nthese block world scenarios (synthetic line drawings) proved too brittle for the real-world images\nand could not handle the errors in extraction of line-segments or generalize to other shapes.\nIn recent years,\nthere has been renewed interest in extracting camera parameters and three-\ndimensional structures in restricted domains such as Manhattan Worlds [4]. Kosecka et al. [5]\ndeveloped a method to recover vanishing points and camera parameters from a single image by\nusing line segments found in Manhattan structures. Using the recovered vanishing points, rectangu-\nlar surfaces aligned with major orientations were also detected by [6]. However, these approaches\nare only concerned with dominant directions in the 3D world and do not attempt extract three di-\nmensional information of the room and the objects in the room. Yu et al. [7] inferred the relative\ndepth-order of rectangular surfaces by considering their relationship. However, this method only\nprovides depth cues of partial rectangular regions in the image and not the entire scene.\nThere has been a recent series of methods related to our work that attempt to model geometric\nscene structure from a single image, including geometric label classi\ufb01cation [8, 9] and \ufb01nding verti-\ncal/ground fold-lines [10]. Lee et al. [11] introduced parameterized models of indoor environments,\nconstrained by rules inspired by blockworld to guarantee physical validity. However, since this ap-\nproach samples possible spatial layout hypothesis without clutter, it is prone to errors caused by the\nocclusion and tend to \ufb01t rooms in which the walls coincide with the object surfaces. A recent paper\nby Hedau et al. [12] uses an appearance based clutter classi\ufb01er and computes visual features only\nfrom the regions classi\ufb01ed as \u201cnon-clutter\u201d, while parameterizing the 3D structure of the scene by a\nbox. They use structured approaches to estimate the best \ufb01tting room box to the image. A similar\napproach has been used by Wang et al. [13] which does not require the ground truth lables of clut-\nter. In these methods, however, the modeling of interactions between clutter and spatial-layout of\nthe room is only done in the image plane and the 3D interactions between room and clutter are not\nconsidered.\n\n2\n\nObject pushes wall (a) Input image (b) Spatial layout without object reasoning (c) Object removed (d) Spatial layout with 2D object reasoning (e) Object fitted with parametric model (f) Spatial layout with 3D volumetric reasoning \fIn a work concurrent to ours, Hedau et al. [14] have also modeled objects as three dimensional\ncuboids and considered the volumetric intersection with the room structure. The goal of their work\ndiffers from ours. Their primary goal is to improve object detection, such as beds, by using informa-\ntion of scene geometry, whereas our goal is to improve scene understanding by proposing a control\nstructure that incorporates volumetric constraints. Therefore, we are able to improve the estimate of\nthe room by estimating the objects and vice versa, whereas in their work information \ufb02ows in only\none direction (from scene to objects).\nIn a very recent work by Gupta et al. [15], qualitative reasoning of scene geometry was done by\nmodeling objects as \u201cblocks\u201d for outdoor scenes. In contrast, we use stronger parameteric models\nfor rooms and objects in indoor scenes, which are more structured, that allows us to do more explicit\nand exact 3D volumetric reasoning.\n\n2 Overview\nOur goal is to jointly extract the spatial layout of the room and the con\ufb01guration of objects in the\nscene. We model the spatial layout of the room by 3D boxes and we model the objects as solids\nwhich occupy 3D volumes in the free space de\ufb01ned by the room walls. Given a set of room hy-\npotheses and object hypotheses, our goal is to search the space of scene con\ufb01gurations and select\nthe con\ufb01guration that best matches the local surface geometry estimated from image cues and sat-\nis\ufb01es the volumetric constraints of the physical world. These constraints (shown in Figure 3(i))\nare:\n\n\u2022 Finite volume: Every object in the world should have a non-zero \ufb01nite volume.\n\u2022 Spatial exclusion: The objects are assumed to be solid objects which cannot intersect.\nTherefore, the volumes occupied by different object are mutually exclusive. This implies\nthat the volumetric intersection between two objects should be empty.\n\u2022 Containment: Every object should be contained in the free space de\ufb01ned by the walls of\n\nthe room (i.e, none of the objects should be outside the room walls).\n\nOur approach is illustrated in Figure 2. We \ufb01rst extract line segments and estimate three mutually\northogonal vanishing points (Figure 2(b)). The vanishing points de\ufb01ne the orientation of the major\nsurfaces in the scene [6, 11, 12] and hence constrain the layout of ceilings, \ufb02oor and walls of the\nroom. Using the line segments labeled by their orientations, we then generate multiple hypotheses\nfor rooms and objects (Figure 2(e)(f)). A hypothesis of a room is a 3D parametric representation of\nthe layout of major surfaces of the scene, such as \ufb02oor, left wall, center wall, right wall, and ceiling.\nA hypothesis of an object is a 3D parametric representation of an object in the scene, approximated\nas a cuboid.\nThe room and cuboid hypotheses are then combined to form the set of possible con\ufb01gurations of\nthe entire scene (Figure 2(h)). The con\ufb01guration of the entire scene is represented as one sample of\nthe room hypothesis along with some subset of object hypotheses. The number of possible scene\ncon\ufb01gurations is exponential in the number of object hypotheses 1. However, not all cuboid and\nroom subsets are compatible with each other. We use simple 3D spatial reasoning to enforce the\nvolumetric constraints described above (See Figure 2(g)). We therefore test each room-object pair\nand each object-object pair for their 3D volumetric compatibility, so that we allow only the scene\ncon\ufb01gurations which have no room-object and no object-object volumetric intersection.\nFinally, we evaluate the scene con\ufb01gurations created by combinations of room hypotheses and object\nhypotheses to \ufb01nd the scene con\ufb01guration that best matches the image (Figure 2(i)). As the scene\ncon\ufb01guration is a structured variable, we use a variant of the structured prediction algorithm [16] to\nlearn the cost function. We use two sources of surface geometry, orientation map [11] and geometric\ncontext [8], which serve as features in the cost function. Since it is computationally expensive to\ntest exhaustive combinations of scene con\ufb01gurations in practice, we use beam-search to sample the\nscene con\ufb01gurations that are volumetrically-compatible (Section 5.1).\n\n3 Estimating Surface Geometry\nWe would like to predict the local surface geometry of the regions in the image. A scene con\ufb01g-\nuration should satisfy local surface geometry extracted from image cues and should satisfy the 3D\n\n1O(n \u00b7 2m) where n is the number of room hypotheses and m is the number of object hypotheses\n\n3\n\n\fFigure 2: Overview of our approach for estimating the spatial layout of the room and the objects.\n\nvolumetric constraints. The estimated surface geometry is therefore used as features in a scoring\nfunction that evaluates a given scene con\ufb01guration.\nFor estimating surface geometry we use two methods: the line-sweeping algorithm [11] and a mul-\ntiple segmentation classi\ufb01er [8]. The line-sweeping algorithm takes line segments as input and\npredicts an orientation map in which regions are classi\ufb01ed as surfaces into one of the three possible\norientations. Figure 2(d) shows an example of an orientation map. The region estimated as hori-\nzontal surface is colored in red, and vertical surfaces are colored in green and blue, corresponding\nto the associated vanishing point. This orientation map is used to evaluate scene con\ufb01guration hy-\npotheses. The multiple segmentation classi\ufb01er [8] takes the full image as input, uses image features,\nsuch as combinations of color and texture, and predicts geometric context represented by surface\ngeometry labels for each superpixel (\ufb02oor, ceiling, vertical (left, center, right), solid, and porous\nregions). Similar to orientation maps, the predicted labels are used to evaluate scene con\ufb01guration\nhypotheses.\n\n4 Generating Scene Con\ufb01guration Hypothesis\n\nGiven the local surface geometry and the oriented line segments extracted from the image, we now\ncreate multiple hypotheses for possible spatial layout of the room and object layout in the room.\nThese hypotheses are then combined to produce scene con\ufb01guration layout such that all the objects\noccupy exclusive 3D volumes and the objects are inside the freespace of the room de\ufb01ned by the\nwalls.\n\n4.1 Generating Room Hypotheses\nA room hypothesis encodes the position and orientation of walls, \ufb02oor, and ceiling. In this paper, we\nrepresent a room hypothesis by a parametric box model [12]. Room hypotheses are generated from\nline segments in a way similar to the method described in Lee et al. [11]. They examine exhaustive\ncombinations of line segments and check which of the resulting combinations de\ufb01ne physically valid\nroom models. Instead, we sample random tuples of line segments lines that de\ufb01ne the boundaries\nof the parametric box. Only the minimum number of line segments to de\ufb01ne the parametric room\nmodel are sampled. Figure 2(e) shows examples of generated room hypotheses.\n\n4\n\n(a) Input image (b) Line segments and Vanishing points (e) Room hypotheses (f) Cube hypotheses (d) Orientation map (c) Geometric context (h) Scene configuration hypotheses (g) Reject invalid configurations (i) Evaluate (j) Final scene configuration \fFigure 3: (i) Examples of volumetric constraint violation. (ii) Object hypothesis generation: we use\nthe orientation maps to generate object hypotheses by \ufb01nding convex edges.\n4.2 Generating Object Hypotheses\nOur goal is to extract the 3D geometry of the clutter objects to perform 3D spatial reasoning. Es-\ntimating precise 3D models of objects from a single image is an extremely dif\ufb01cult problem and\nprobably requires recognition of object classes such as couches and tables. However, our goal is to\nperform coarse 3D reasoning about the spatial layout of rooms and spatial layout of objects in the\nroom. We only need to model a subset of objects in the scene to provide enough constraints for\nvolumetric reasoning. Therefore, we adopt a coarse 3D model of objects in the scene and model\neach object-volume as cuboids. We found that parameterizing objects as cuboids provides a good\napproximation to the occupied volume in man-made environments. Furthermore, by modeling ob-\njects by a parametric model of a cuboid, we can determine the location and dimensions in 3D up to\nscale, which allows volumetric reasoning about the 3D interaction between objects and the room.\nWe generate object hypotheses from the orientation map described above. Figure 3(ii)(a)(b) shows\nan example scene and its orientation map. The three colors represent the three possible plane orien-\ntations used in the orientation map. We can see from the \ufb01gure that the distribution of surfaces on the\nobjects estimated by the orientation map suggests the presence of a cuboidal object. Figure 3(ii)(c)\nshows a pair of regions which can potentially form a convex edge if the regions represent the visible\nsurfaces on a cuboidal object.\nWe test all pairs of regions in the orientation map to check whether they can form convex edges.\nThis is achieved by checking the estimated orientation of the regions and the spatial location of the\nregions with respect to the vanishing points. If the region pair can form a convex corner, we utilize\nthese regions to form an object hypothesis. To generate a cuboidal object hypothesis from pairs of\nregions, we \ufb01rst \ufb01t tight bounding quadrilaterals (Figure 3(ii)(c)) to each region in the pair and then\nsample all combinations of three points out of the eight vertices on the two quadrilaterals, which do\nnot lie on a plane. Three is the minimum number of points (with (x, y) coordinates) that have enough\ninformation to de\ufb01ne a cuboid projected onto a 2D image plane, which has \ufb01ve degrees of freedom.\nWe can then hypothesize a cuboid, whose corner best apprximates the three points. Figure 3(ii)(d)\nshows a sample of a cuboidal object hypothesis generated from the given orientation map.\n4.3 Volumetric Compatibility of Scene Con\ufb01guration\nGiven a room con\ufb01guration and a set of candidate objects, a key operation is to evaluate whether the\nresulting combination satis\ufb01es the three fundamental volumetric compatibility constraints described\nin Section 2. The problem of estimating the three dimensional layout of a scene from a single image\nis inherently ambiguous because any measurement from a single image can only be determined up\nto scale. In order to test the volumetric compatibility of room-object hypotheses pairs and object-\nobject hypotheses pairs, we make the assumption that all objects rest on the \ufb02oor. This assumption\n\ufb01xes the scale ambiguity between room and object hypotheses and allows us to reason about their\n3D location.\nTo test whether an object is contained within the free space of a room, we check whether the projec-\ntion of the bottom surface of the object onto the image is completely contained within the projection\nof the \ufb02oor surface of the room. If the projection of the bottom surface of the object is not completely\n\n5\n\nConvex edge (a) Image (b) Orientation Map (c) Convex Edge Check (d) Hypothesized Cuboid (ii) Object Hypothesis Generation (a) Containment Constraint (b) Spatial Exclusion Constraint (i) Volumetric Constraints \fwithin the \ufb02oor surface, the corresponding 3D object model must be protruding into the walls of the\nroom. Figure 3(i)(a) shows an example of an incompatible room-object pair.\nSimilarly, to test whether the volume occupied by two objects is exclusive, we assume that the\ntwo objects rest on the same \ufb02oor plane and we compare the projection of their bottom surfaces\nonto the image. If there is any overlap between the projections of the bottom surface of the two\nobject hypotheses, that means that they occupy intersecting volumes in 3D. Figure 3(i)(b) shows an\nexample of an incompatible object-object pair.\n\ni yi\n\nr ), yo = (y1\n\no, ..., ym\nr = 0 otherwise, and yi\n\nInference\n\no = 0 otherwise. Note that(cid:80)\n\n5 Evaluating Scene Con\ufb01gurations\n5.1\nGiven an image x, a set of room hypotheses {r1, r2, ..., rn}, and a set of object hypotheses\n{o1, o2, ..., om}, our goal is to \ufb01nd the best scene con\ufb01guration y = (yr, yo), where yr =\no ). yi\n(y1\nr = 1 if room hypothesis ri is used in the scene con\ufb01guration\nr , ..., yn\no = 1 if object hypothesis oi is present in the scene con\ufb01guration and\nand yi\nr = 1 as only one room hypothesis is needed to de\ufb01ne the scene\nyi\ncon\ufb01guration.\nSuppose that we are given a function f(x, y) that returns a score for y. Finding the best scene\ncon\ufb01guration y\u2217 = arg maxy f(x, y) through testing all possible scene con\ufb01gurations requires\nn \u00b7 2m evaluations of the score function. We resort to using beam search (\ufb01xed width search tree) to\nkeep the computation manageable by avoiding evaluating all scene con\ufb01gurations.\nIn the \ufb01rst level of the search tree, scene con\ufb01gurations with a room hypothesis and no object hypoth-\nesis are evaluated. In the following levels, an object hypothesis is added to its parent con\ufb01guration\nand the con\ufb01guration is evaluated. The top kl nodes with the highest score are added to the search\ntree as the child node, where kl is a pre-determined beam width for level l.2 The search is continued\nfor a \ufb01xed number of levels or until no cubes that are compatible with existing con\ufb01gurations can\nbe added. After the search tree has been explored, the best scoring node in the tree is returned as the\nbest scene con\ufb01guration.\n\n5.2 Learning the Score Function\nWe set the score function to f(x, y) = wT \u03c8(x, y) + wT\n\u03c6 \u03c6(y), where \u03c8(x, y) is a feature vector\nfor a given image x and measures the compatibility of the scene con\ufb01guration y with the estimated\nsurface geometry. \u03c6(y) is the penalty term for incompatible con\ufb01gurations and penalizes the room\nand object con\ufb01gurations which violate volumetric constraints.\nWe use structured SVM [16] to learn the weight vector w. The weights are learned by solving\n\nmin\nw,\u03be\n\n1\n2\n\n(cid:107)w(cid:107)2 + C\n\n(cid:88)\n\u03c6 \u03c6(y) \u2265 \u2206(yi, y) \u2212 \u03bei,\u2200i,\u2200y\n\n\u03bei\n\ni\n\ns.t. wT \u03c8(xi, yi) \u2212 wT \u03c8(xi, y) \u2212 wT\n\u03bei \u2265 0,\u2200i,\n\nwhere xi are images, yi are the ground truth con\ufb01guration, \u03bei are slack variables, and \u2206(yi, y)\nis the loss function that measures the error of con\ufb01guration y. Tsochantaridis [16] deals with the\nlarge number of constraints by iteratively adding the most violated constraints. We simplify this\nby sampling a \ufb01xed number of con\ufb01gurations per each training image, using the same beam search\nprocess used for inference, and solving using quadratic programming.\nLoss Function: The loss function \u2206(yi, y) is the percentage of pixels in the entire image having\nincorrect label. For example, pixels that are labeled as left wall when they actually belong to the\ncenter wall, or pixels labeled as object when they actually belong to the \ufb02oor would be counted as\nincorrectly labeled pixels. A wall is labeled as center if the surface normal is within 45 degrees from\nthe camera optical axis and labeled as left or right, otherwise.\nFeature Vector: The feature vector \u03c8(x, y) is computed by measuring how well each surface in\nthe scene con\ufb01guration y is supported by the orientation map and the geometric context. A feature\n\n2We set kl to (100, 5, 2, 1), with a maximum of 4 levels. The results were not sensitive to these parameters.\n\n6\n\n\fFigure 4: Two qualitative examples showing how 3D volumetric reasoning aids estimation of the\nspatial layout of the room.\n\nNo object reasoning\nVolumetric reasoning\n\nOM+GC\n18.6%\n16.2%\n\nGC\nOM\n24.7% 22.7%\n19.5% 20.2%\n\nTable 1: Percentage of pixels with correct estimate of room surfaces. First row performs no reason-\ning about objects. Second row is our approach with 3D volumetric reasoning of objects. Columns\nshows the features that are used. OM: Orientation map from [11]. GC: Geometric context from [8].\n\nis computed for each of the six surfaces in the scene con\ufb01guration (\ufb02oor, left wall, center wall,\nright wall, ceiling, object) as the relative area which the orientation map or the geometric context\ncorrectly explains the attribute of the surface. This results in a twelve dimensional feature vector for\na given scene con\ufb01guration. For example, the feature for the \ufb02oor surface in the scene con\ufb01guration\nis computed by the relative area which the orientation map predicts a horizontal surface, and the area\nwhich the geometric context predicts a \ufb02oor label.\nVolumetric Penalty: The penalty term \u03c6(y) measures how much the volumetric constraints are\n(1) The \ufb01rst term \u03c6(yr, yo) measures the volumetric intersection between the volume\nviolated.\nde\ufb01ned by room walls and objects.\nIt penalizes the con\ufb01gurations where the object hypothesis\nlie outside the room volume and the penalty is proportional to the volume outside the room. (2)\no) measures the volume intersection between two objects (i, j). This\n\nThe second term(cid:80)\n\npenalty from this term is proportional to the overlap of the cubes projected on the \ufb02oor.\n\ni,j \u03c6(yi\n\no, yj\n\n6 Experimental Results\n\nWe evaluated our 3D geometric reasoning approach on an indoor image dataset introduced in [12].\nThe dataset consists of 314 images, and the ground-truth consists of the marked spatial layout of the\nroom and the clutter layouts. For our experiments, we use the same training-test split as used in [12]\n(209 training and 105 test images). We use training images to estimate the weight vector.\nQualitative Evaluation: Figure 4 illustrates the bene\ufb01t of 3D spatial reasoning introduced in our\napproach. If no 3D clutter reasoning is used and the room box is \ufb01tted to the orientation map and\ngeometric context, the box gets \ufb01t to the object surfaces and therefore leads to substantial error\nin the spatial layout estimation. However, if we use 3D object reasoning walls get pushed due to\nthe containment constraint and the spatial layout estimation improves. We can also see from the\nexamples that extracting a subset of objects in the scene is enough for reasoning and improving\nthe spatial layout estimation. Figure 5 and 6 shows more examples of the spatial layout and the\nestimated clutter objects in the images. Additional results are in the supplementary material.\nQuantitative Evaluation: We evaluate the performance of our approach in estimating the spatial\nlayout of the room. We use the pixel-based measure introduced in [12] which counts the percentage\nof pixels on the room surfaces that disagree with the ground truth. For comparison, we employ the\nsimple multiple segmentation classi\ufb01er [8] and the recent approach introduced in [12] as baselines.\nThe images in the dataset have signi\ufb01cant clutter; therefore, simple classi\ufb01cation based approaches\nwith no clutter reasoning perform poorly and have an error of 26.5%. The state-of-the-art approach\n[12] which utilizes clutter reasoning in the image plane has an error of 21.2%. On the other hand, our\n\n7\n\nInput image Room only Room and objects Orientation map Geometric context \fFigure 5: Additional examples to show the performance on a wide variety of scenes. Dotted lines\nrepresent the room estimate without object reasoning.\n\nFigure 6: Failure examples. The \ufb01rst two examples are the failure cases when the cuboids are either\nmissed or estimated wrong. The last two failure cases are due to errors in vanishing point estimation.\n\napproach which uses a parametric model of clutter and simple 3D volumetric reasoning outperforms\nboth the approaches and has an error of 16.2%.\nWe also performed several experiments to measure the signi\ufb01cance of each step and features in our\napproach. When we only use the surface layout estimates from [8] as features of the cost function,\nour approach has an error rate of 20.2% whereas using only orientation maps as features yields an\nerror rate of 19.5%. We also tried several search techniques to search the space of hypotheses. With\na greedy approach (best cube added at each iteration) to search the hypothesis space, we achieved\nan error rate of 19.2%, which shows that early commitment to partial con\ufb01gurations leads to error\nand search strategy that allows late commitment, such as beam search, should be used.\n7 Conclusion\nIn this paper, we have proposed the use of volumetric reasoning between objects and surfaces of\nroom layout to recover the spatial layout of a scene. By parametrically representing the 3D volume\nof objects and rooms, we can apply constraints for volumetric reasoning, such as spatial exclusion\nand containment. Our experiments show that volumetric reasoning improves the estimate of the\nroom layout and provides a richer interpretation about objects in the scene. The rich geometric\ninformation provided by our method can provide crucial information for object recognition and\neventually aid in complete scene understanding.\n8 Acknowledgements\nThis research was supported by NSF Grant EEEC-0540865, ONR MURI Grant N00014-07-1-0747,\nNSF Grant IIS-0905402, and ONR Grant N000141010766.\n\n8\n\n\fReferences\n[1] L. Roberts. Machine perception of 3-d solids. In: PhD. Thesis. (1965)\n[2] A. Guzman. Decomposition of a visual scene into three dimensional bodies. In Proceedings of Fall Joint\n\nComputer Conference, 1968.\n\n[3] D. A. Waltz. Generating semantic descriptions from line drawings of scenes with shadows. Technical\n\nreport, MIT, 1972.\n\n[4] J. Coughlan, and A. Yuille. Manhattan world: Compass direction from a single image by bayesian infer-\n\nence. In proc. ICCV, 1999.\n\n[5] J. Kosecka, and W. Zhang. Video Compass. In proc. ECCV, 2002.\n[6] J. Kosecka, and W. Zhang. Extraction, matching, and pose recovery based on dominant rectangular struc-\n\ntures. CVIU, 2005.\n\n[7] S. Yu, H. Zhang, and J. Malik. Inferring Spatial Layout from A Single Image via Depth-Ordered Group-\n\ning. IEEE Computer Society Workshop on Perceptual Organization in Computer Vision, 2008\n\n[8] D. Hoiem, A. Efros, and M. Hebert. Recovering surface layout from an image. IJCV, 75(1), 2007.\n[9] A. Saxena, M. Sun, and A. Ng. Make3d: Learning 3D scene structure from a single image. PAMI, 2008.\n[10] E. Delage, H. Lee, and A. Ng. A dynamic bayesian network model for autonomous 3D reconstruction\n\nfrom a single indoor image. CVPR, 2006.\n\n[11] D. Lee, M. Hebert, and T. Kanade. Geometric reasoning for single image structure recovery. In proc.\n\nCVPR, 2009.\n\n[12] V. Hedau, D. Hoiem, and D. Forsyth. Recovering the spatial layout of cluttered rooms. In proc. ICCV,\n\n2009.\n\n[13] H. Wang, S. Gould, and D. Koller, Discriminative Learning with Latent Variables for Cluttered Indoor\n\nScene Understanding. ECCV, 2010.\n\n[14] V. Hedau, D. Hoiem, and D. Forsyth. Thinking Inside the Box: Using Appearance Models and Context\n\nBased on Room Geometry. ECCV, 2010.\n\n[15] A. Gupta, A. Efros, and M. Hebert. Blocks World Revisited: Image Understanding using Qualitative\n\nGeometry and Mechanics. ECCV, 2010.\n\n[16] I. Tsochantaridis, T. Joachims, T. Hofmann and Y. Altun: Large Margin Methods for Structured and\n\nInterdependent Output Variables, JMLR, Vol. 6, pages 1453-1484, 2005\n\n9\n\n\f", "award": [], "sourceid": 594, "authors": [{"given_name": "Abhinav", "family_name": "Gupta", "institution": null}, {"given_name": "Martial", "family_name": "Hebert", "institution": null}, {"given_name": "Takeo", "family_name": "Kanade", "institution": null}, {"given_name": "David", "family_name": "Blei", "institution": null}]}