{"title": "Localizing 3D cuboids in single-view images", "book": "Advances in Neural Information Processing Systems", "page_first": 746, "page_last": 754, "abstract": "In this paper we seek to detect rectangular cuboids and localize their corners in uncalibrated single-view images depicting everyday scenes. In contrast to recent approaches that rely on detecting vanishing points of the scene and grouping line segments to form cuboids, we build a discriminative parts-based detector that models the appearance of the cuboid corners and internal edges while enforcing consistency to a 3D cuboid model. Our model is invariant to the different 3D viewpoints and aspect ratios and is able to detect cuboids across many different object categories. We introduce a database of images with cuboid annotations that spans a variety of indoor and outdoor scenes and show qualitative and quantitative results on our collected database. Our model out-performs baseline detectors that use 2D constraints alone on the task of localizing cuboid corners.", "full_text": "Localizing 3D cuboids in single-view images\n\nJianxiong Xiao\n\nBryan C. Russell\u2217\n\nAntonio Torralba\n\nMassachusetts Institute of Technology\n\n\u2217University of Washington\n\nAbstract\n\nIn this paper we seek to detect rectangular cuboids and localize their corners in\nuncalibrated single-view images depicting everyday scenes. In contrast to recent\napproaches that rely on detecting vanishing points of the scene and grouping line\nsegments to form cuboids, we build a discriminative parts-based detector that\nmodels the appearance of the cuboid corners and internal edges while enforcing\nconsistency to a 3D cuboid model. Our model copes with different 3D viewpoints\nand aspect ratios and is able to detect cuboids across many different object cate-\ngories. We introduce a database of images with cuboid annotations that spans a\nvariety of indoor and outdoor scenes and show qualitative and quantitative results\non our collected database. Our model out-performs baseline detectors that use 2D\nconstraints alone on the task of localizing cuboid corners.\n\n1 Introduction\n\nExtracting a 3D representation from a single-view image depicting a 3D object has been a long-\nstanding goal of computer vision [20]. Traditional approaches have sought to recover 3D properties,\nsuch as creases, folds, and occlusions of surfaces, from a line representation extracted from the\nimage [18]. Among these are works that have characterized and detected geometric primitives, such\nas quadrics (or \u201cgeons\u201d) and surfaces of revolution, which have been thought to form the components\nfor many different object types [1]. While these approaches have achieved notable early successes,\nthey could not be scaled-up due to their dependence on reliable contour extraction from natural\nimages.\nIn this work we focus on the task of detecting rectangular cuboids, which are a basic geometric\nprimitive type and occur often in 3D scenes (e.g. indoor and outdoor man-made scenes [22, 23, 24]).\nMoreover, we wish to recover the shape parameters of the detected cuboids. The detection and\nrecovery of shape parameters yield at least a partial geometric description of the depicted scene,\nwhich allows a system to reason about the affordances of a scene in an object-agnostic fashion [9,\n15]. This is especially important when the category of the object is ambiguous or unknown.\nThere have been several recent efforts that revisit this problem [9, 11, 12, 17, 19, 21, 26, 28, 29].\nAlthough there are many technical differences amongst these works, the main pipeline of these ap-\nproaches is similar. Most of them estimate the camera parameters using three orthogonal vanishing\npoints with a Manhattan world assumption of a man-made scene. They detect line segments via\nCanny edges and recover surface orientations [13] to form 3D cuboid hypotheses using bottom-\nup grouping of line and region segments. Then, they score these hypotheses based on the image\nevidence for lines and surface orientations [13].\nIn this paper we look to take a different approach for this problem. As shown in Figure 1, we aim to\nbuild a 3D cuboid detector to detect individual boxy volumetric structures. We build a discriminative\nparts-based detector that models the appearance of the corners and internal edges of cuboids while\nenforcing spatial consistency of the corners and edges to a 3D cuboid model. Our model is trained\nin a similar fashion to recent work that detects articulated human body joints [27].\n\n1\n\n\fInput Image\n\n3D Cuboid Detector\n\nOutput Detection Result\n\nSynthesized New Views \n\ndetect\n\nFigure 1: Problem summary. Given a single-view input image, our goal is to detect the 2D corner\nlocations of the cuboids depicted in the image. With the output part locations we can subsequently\nrecover information about the camera and 3D shape via camera resectioning.\n\nOur cuboid detector is trained across different 3D viewpoints and aspect ratios. This is in contrast to\nview-based approaches for object detection that train separate models for different viewpoints, e.g.\n[7]. Moreover, instead of relying on edge detection and grouping to form an initial hypothesis of a\ncuboid [9, 17, 26, 29], we use a 2D sliding window approach to exhaustively evaluate all possible\ndetection windows. Also, our model does not rely on any preprocessing step, such as computing\nsurface orientations [13]. Instead, we learn the parameters for our model using a structural SVM\nframework. This allows the detector to adapt to the training data to identify the relative importance\nof corners, edges and 3D shape constraints by learning the weights for these terms. We introduce an\nannotated database of images with geometric primitives labeled and validate our model by showing\nqualitative and quantitative results on our collected database. We also compare to baseline detectors\nthat use 2D constraints alone on the tasks of geometric primitive detection and part localization. We\nshow improved performance on the part localization task.\n\n2 Model for 3D cuboid localization\n\nWe represent the appearance of cuboids by a set of parts located at the corners of the cuboid and\na set of internal edges. We enforce spatial consistency among the corners and edges by explicitly\nreasoning about its 3D shape. Let I be the image and pi = (xi, yi) be the 2D image location of the\nith corner on the cuboid. We de\ufb01ne an undirected loopy graph G = (V,E) over the corners of the\ncuboid, with vertices V and edges E connecting the corners of the cuboid. We illustrate our loopy\ngraph layout in Figure 2(a). We de\ufb01ne a scoring function associated with the corner locations in the\nimage:\n\nS(I, p) =(cid:31)i\u2208V\n+(cid:31)ij\u2208E\n\nis\n\nij \u00b7 Displacement2D(pi, pj)\nwD\n\nwH\ni\n\n\u00b7 HOG(I, pi) +(cid:31)ij\u2208E\nij \u00b7 Edge(I, pi, pj) +w S \u00b7 Shape3D(p)\nwE\ncomputed at\n\n[4]\n\n(1)\n\na HOG descriptor\n\nand\nwhere HOG(I, pi)\nDisplacement2D(pi, pj) =\u2212 [(xi \u2212 xj)2, xi \u2212 xj, (yi \u2212 yj)2, yi \u2212 yj] is a 2D corner dis-\nplacement term that is used in other pictorial parts-based models [7, 27]. By reasoning about the\n3D shape, our model handles different 3D viewpoints and aspect ratios, as illustrated in Figure 2.\nNotice that our model is linear in the weights w. Moreover, the model is \ufb02exible as it adapts to\nthe training data by automatically learning weights that measure the relative importance of the\nappearance and spatial terms. We de\ufb01ne the Edge and Shape3D terms as follows.\n\nlocation pi\n\nimage\n\nEdge(I, pi, pj): The internal edge information on cuboids is informative and provides a salient\nfeature for the locations of the corners. For this, we include a term that models the appearance of\nthe internal edges, which is illustrated in Figure 3. For adjacent corners on the cuboid, we identify\nthe edge between the two corners and calculate the image evidence to support the existence of such\nan edge. Given the corner locations pi and pj, we use Chamfer matching to align the straight line\nbetween the two corners to edges extracted from the image. We \ufb01nd image edges using Canny edge\ndetection [3] and ef\ufb01ciently compute the distance of each pixel along the line segment to the nearest\nedge via the truncated distance transform. We use Bresenham\u2019s line algorithm [2] to ef\ufb01ciently \ufb01nd\nthe 2D image locations on the line between the two points. The \ufb01nal edge term is the negative mean\nvalue of the Chamfer matching score for all pixels on the line. As there are usually 9 visible edges\nfor a cuboid, we have 9 dimensions for the edge term.\n\n2\n\n\f(a) Our Full Model. (b) 2D Tree Model.\nFigure 2: Model visualization. Corresponding model parts are colored consistently in the \ufb01gure.\nIn (a) and (b) the displayed corner locations are the average 2D locations across all viewpoints and\naspect ratios in our database. In (a) the edge thickness corresponds to the learned weight for the edge\nterm. We can see that the bottom edge is signi\ufb01cantly thicker, which indicates that it is informative\nfor detection, possibly due to shadows and contact with a supporting plane.\n\n(c) Example Part Detections.\n\nShape3D(p): The 3D shape of a cuboid constrains the layout of the parts and edges in the image.\nWe propose to de\ufb01ne a shape term that measures how well the con\ufb01guration of corner locations\nrespect the 3D shape. In other words, given the 2D locations p of the corners, we de\ufb01ne a term\nthat tells us how likely this con\ufb01guration of corner locations p can be interpreted as the reprojection\nof a valid cuboid in 3D. When combined with the weights wS, we get an overall evaluation of\nthe goodness of the 2D locations with respect to the 3D shape. Let X (written in homogeneous\ncoordinates) be the 3D points on the unit cube centered at the world origin. Then, X transforms as\nx = PLX, where L is a matrix that transforms the shape of the unit cube and P is a 3 \u00d7 4 camera\nmatrix. For each corner, we use the other six 2D corner locations to estimate the product PL using\ncamera resectioning [10]. The estimated matrix is used to predict the corner location. We use the\nnegative L2 distance to the predicted corner location as a feature for the corner in our model. As we\nmodel 7 corners on the cuboid, there are 7 dimensions in the feature vector. When combined with\nthe learned weights wS through dot-product, this represents a weighted reprojection error score.\n\n2.1 Inference\n\nOur goal is to \ufb01nd the 2D corner locations p over the HOG grid of I that maximizes the score given\nin Equation (1). Note that exact inference is hard due to the global shape term. Therefore, we\npropose a spanning tree approximation to the graph to obtain multiple initial solutions for possible\ncorner locations. Then we adjust the corner locations using randomized simple hill climbing.\nFor the initialization, it is important for the computation to be ef\ufb01cient since we need to evaluate all\npossible detection windows in the image. Therefore, for simplicity and speed, we use a spanning\ntree T to approximate the full graph G, as shown in Figure 2(b). In addition to the HOG feature as\na unary term, we use a popular pairwise spring term along the edges of the tree to establish weak\nspatial constraints on the corners. We use the following scoring function for the initialization:\n\n\u00b7 HOG(I, pi) + (cid:31)ij\u2208T\n\nij \u00b7 Displacement2D(pi, pj)\nwD\n\n(2)\n\nST (I, p) =(cid:31)i\u2208V\n\nwH\ni\n\nNote that the model used for obtaining initial solutions is similar to [7, 27], which is only able\nto handle a \ufb01xed viewpoint and 2D aspect ratio. Nonetheless, we use it since it meets our speed\nrequirement via dynamic programming and the distance transform [8].\nWith the tree approximation, we pick the top 1000 possible con\ufb01gurations of corner locations from\neach image and optimize our scoring function by adjusting the corner locations using randomized\nsimple hill climbing. Given the initial corner locations for a single con\ufb01guration, we iteratively\nchoose a random corner i with the goal of \ufb01nding a new pixel location \u02c6pi that increases the scoring\nfunction given in Equation (1) while holding the other corner locations \ufb01xed. We compute the scores\nat neighboring pixel locations to the current setting pi. We also consider the pixel location that the\n3D rigid model predicts when estimated from the other corner locations. We randomly choose one\nof the locations and update pi if it yields a higher score. Otherwise, we choose another random\ncorner and location. The algorithm terminates when no corner can reach a location that improves\nthe score, which indicates that we have reached a local maxima.\nDuring detection, since the edge and 3D shape terms are non-positive and the weights are constrained\nto be positive, this allows us to upper-bound the scoring function and quickly reject candidate loca-\n\n3\n\n\fFigure 3: Illustration of the edge term in our model. Given line endpoints, we compute a Chamfer\nmatching score for pixels that lie on the line using the response from a Canny edge detector.\n\ntions without evaluating the entire function. Also, since only one corner can change locations at each\niteration, we can reuse the computed scoring function from previous iterations during hill climbing.\nFinally, we perform non-maximal suppression among the parts and then perform non-maximal sup-\npression over the entire object to get the \ufb01nal detection result.\n\n2.2 Learning\n\nFor learning, we \ufb01rst note that our scoring function in Equation (1) is linear in the weights w.\nThis allows us to use existing structured prediction procedures for learning. To learn the weights,\nwe adapt the structural SVM framework of [16]. Given positive training images with the 2D corner\nlocations labeled {In, pn} and negative training images {In}, we wish to learn weights and bias term\n\u03b2 = (wH , wD, wE, wS, b) that minimizes the following structured prediction objective function:\n\n(cid:88)\n\nn\n\n\u03b2 \u00b7 \u03b2 + C\n\n1\n2\n\nmin\n\u03b2,\u03be\u22650\n\n\u03ben\n\n(3)\n\n\u2200n \u2208 pos \u03b2 \u00b7 \u03a6 (In, pn) \u2265 1 \u2212 \u03ben\n\u2200n \u2208 neg,\u2200p \u2208 P \u03b2 \u00b7 \u03a6 (In, p) \u2264 \u22121 + \u03ben\n\nwhere all appearance and spatial feature vectors are concatenated into the vector \u03a6(In, p) and P\nis the set of all possible part locations. During training we constrain the weights wD, wE, wS \u2265\n0.0001. We tried mining negatives from the wrong corner locations in the positive examples but\nfound that it did not improve the performance. We also tried latent positive mining and empirically\nobserved that it slightly helps. Since the latent positive mining helped, we also tried an offset\ncompensation as post-processing to obtain the offset of corner locations introduced during latent\npositive mining. For this, we ran the trained detector on the training set to obtain the offsets and\nused the mean to compensate for the location changes. However, we observed empirically that it did\nnot help performance.\n\n2.3 Discussion\n\nSliding window object detectors typically use a root \ufb01lter that covers the entire object [4] or a\ncombination of root \ufb01lter and part \ufb01lters [7]. The use of a root \ufb01lter is suf\ufb01cient to capture the\nappearance for many object categories since they have canonical 3D viewpoints and aspect ratios.\nHowever, cuboids in general span a large number of object categories and do not have a consistent\n3D viewpoint or aspect ratio. The diversity of 3D viewpoints and aspect ratios causes dramatic\nchanges in the root \ufb01lter response. However, we have observed that the responses for the part \ufb01lters\nare less affected.\nMoreover, we argue that a purely view-based approach that trains separate models for the different\nviewpoints and aspect ratios may not capture well this diversity. For example, such a strategy would\nrequire dividing the training data to train each model. In contrast, we train our model for all 3D\nviewpoints and aspect ratios. We illustrate this in Figure 2, where detected parts are colored con-\nsistently in the \ufb01gure. As our model handles different viewpoints and aspect ratios, we are able to\nmake use of the entire database during training.\nDue to the diversity of cuboid appearance, our model is designed to capture the most salient features,\nnamely the corners and edges. While the corners and edges may be occluded (e.g. by self-occlusion,\n\n4\n\n\u2212100\u221290\u221280\u221270\u221260\u221250\u221240\u221230\u221220\u221210ImageDistance Transformed Edge MapPixels Covered by Line SegmentDot-product is the Edge Term\fFigure 4: Illustration of the labeling tool and 3D viewpoint statistics. (a) A cuboid being labeled\nthrough the tool. A projection of the cuboid model is overlaid on the image and the user must\nselect and drag anchor points to their corresponding location in the image. (b) Scatter plot of 3D\nazimuth and elevation angles for annotated cuboids with zenith angle close zero. We perform an\nimage left/right swap to limit the rotation range. (c) Crops of cuboids at different azimuth angles for\na \ufb01xed elevation, with the shown examples marked as red points in the scatter plot of (b).\n\nother objects in front, or cropping), for now we do not handle these cases explicitly in our model.\nFurthermore, we do not make use of other appearance cues, such as the appearance within the cuboid\nfaces, since they have a larger variation across the object categories (e.g. dice and \ufb01re alarm trigger)\nand may not generalize as well. We also take into account the tractability of our model as adding\nadditional appearance cues will increase the complexity of our model and the detector needs to be\nevaluated over a large number of possible sliding windows in an image.\nCompared with recent approaches that detect cuboids by reasoning about the shape of the entire\nscene [9, 11, 12, 17, 19, 29], one of the key differences is that we detect cuboids directly without\nconsideration of the global scene geometry. These prior approaches rely heavily on the assumption\nthat the camera is located inside a cuboid-like room and held at human height, with the parameters\nof the room cuboid inferred through vanishing points based on a Manhattan world assumption.\nTherefore, they cannot handle outdoor scenes or close-up snapshots of an object (e.g. the boxes on\na shelf in row 1, column 3 of Figure 6). As our detector is agnostic to the scene geometry, we are\nable to detect cuboids even when these assumptions are violated.\nWhile previous approaches reason over rigid cuboids, our model is \ufb02exible in that it can adapt\nto deformations of the 3D shape. We observe that not all cuboid-like objects are perfect cuboids\nin practice. Deformations of the shape may arise due to the design of the object (e.g. the printer\nin Figure 1), natural deformation or degradation of the object (e.g. a cardboard box), or a global\ntransformation of the image (e.g. camera radial distortion). We argue that modeling the deformations\nis important in practice since a violation of the rigid constraints may make a 3D reconstruction-\nbased approach numerically unstable. In our approach, we model the 3D deformation and allow the\nstructural SVM to learn based on the training data how to weight the importance of the 3D shape\nterm. Moreover, a rigid shape requires a perfect 3D reconstruction and it is usually done with non-\nlinear optimization [17], which is expensive to compute and becomes impractical in an exhaustive\nsliding-window search in order to maintain a high recall rate. With our approach, if a rigid cuboid\nis needed, we can recover the 3D shape parameters via camera resectioning, as shown in Figure 9.\n\n3 Database of 3D cuboids\n\nTo develop and evaluate any models for 3D cuboid detection in real-world environments, it is nec-\nessary to have a large database of images depicting everyday scenes with 3D cuboids labeled. In\nthis work, we seek to build a database by manually labeling point correspondences between images\nand 3D cuboids. We have built a labeling tool that allows a user to select and drag key points on\na projected 3D cuboid model to its corresponding location in the image. This is similar to existing\ntools, such as Google building maker [14], which has been used to build 3D models of buildings for\nmaps. Figure 4(a) shows a screenshot of our tool. For the database, we have harvested images from\nfour sources: (i) a subset of the SUN database [25], which contains images depicting a large variety\nof different scene categories, (ii) ImageNet synsets [5] with objects having one or more 3D cuboids\ndepicted, (iii) images returned from an Internet search using keywords for objects that are wholly or\npartially described by 3D cuboids, and (iv) a set of images that we manually collected from our per-\nsonal photographs. Given the corner correspondences, the parameters for the 3D cuboids and camera\nare estimated. The cuboid and camera parameters are estimated up to a similarity transformation via\ncamera resectioning using Levenberg-Marquardt optimization [10].\n\n5\n\n(a)(b)(c)1\u00b09\u00b018\u00b026\u00b037\u00b043\u00b00153045\u22124504590AzimuthElevation\fFigure 5: Single top 3D cuboid detection in each image. Yellow: ground truth, green: correct\ndetection, red: false alarm. Bottom row - false positives. The false positives tend to occur when a\npart \ufb01res on a \u201ccuboid-like\u201d corner region (e.g. row 3, column 5) or \ufb01nds a smaller cuboid (e.g. the\nRubik\u2019s cube depicted in row 3, column 1).\n\nFigure 6: All 3D cuboid detections above a \ufb01xed threshold in each image. Notice that our model is\nable to detect the presence of multiple cuboids in an image (e.g. row 1, columns 2-5) and handles\npartial occlusions (e.g. row 1, column 4), small objects, and a range of 3D viewpoints, aspect ratios,\nand object classes. Moreover, the depicted scenes have varying amount of clutter. Yellow - ground\ntruth. Green - correct prediction. Red - false positive. Line thickness corresponds to detector\ncon\ufb01dence.\n\nFor our database, we have 785 images with 1269 cuboids annotated. We have also collected a\nnegative set containing 2746 images that do contain any cuboid like objects. We perform an image\nleft/right swap to limit the rotation range. As a result, the min/max azimuth, elevation, and zenith\nangles are 0/45, -90/90, -180/180 degrees respectively. In Figure 4(b) we show a scatter plot of the\nazimuth and elevation angles for all of the labeled cuboids with zenith angle close to zero. Notice that\nthe cuboids cover a large range of azimuth angles for elevation angles between 0 (frontal view) and\n45 degrees. We also show a number of cropped examples for a \ufb01xed elevation angle in Figure 4(c),\nwith their corresponding azimuth angles indicated by the red points in the scatter plot. Figure 8(c)\nshows the distribution of objects from the SUN database [25] that overlap with our cuboids (there\nare 326 objects total from 114 unique classes). Compared with [12], our database covers a larger set\nof object and scene categories, with images focusing on both objects and scenes (all images in [12]\nare indoor scene images). Moreover, we annotate objects closely resembling a 3D cuboid (in [12]\nthere are many non-cuboids that are annotated with a bounding cuboid) and overall our cuboids are\nmore accurately labeled.\n\n4 Evaluation\n\nIn this section we show qualitative results of our model on the 3D cuboids database and report\nquantitative results on two tasks: (i) 3D cuboid detection and (ii) corner localization accuracy. For\ntraining and testing, we randomly split equally the positive and negative images. As discussed in\nSection 3, there is rotational symmetry in the 3D cuboids. During training, we allow the image\n\n6\n\n\fFigure 7: Corner localization comparison for detected geometric primitives. (a) Input image and\nground truth annotation. (b) 2D tree-based initialization. (c) Our full model. Notice that our model\nis able to better localize cuboid corners over the baseline 2D tree-based model, which corresponds\nto 2D parts-based models used in object detection and articulated pose estimation [7, 27]. The last\ncolumn shows a failure case where a part \ufb01res on a \u201ccuboid-like\u201d corner region in the image.\n\nto mirror left-right and orient the 3D cuboid to minimize the variation in rotational angle. During\ntesting, we run the detector on left-right mirrors of the image and select the output at each location\nwith the highest detector response. For the parts we extract HOG features [4] in a window centered at\neach corner with scale of 10% of the object bounding box size. Figure 5 shows the single top cuboid\ndetection in each image and Figure 6 shows all of the most con\ufb01dent detections in the image. Notice\nthat our model is able to handle partial occlusions (e.g. row 1, column 4 of Figure 6), small objects,\nand a range of 3D viewpoints, aspect ratios, and object classes. Moreover, the depicted scenes have\nvarying amount of clutter. We note that our model fails when a corner \ufb01res on a \u201ccuboid-like\u201d corner\nregion (e.g. row 3, column 5 of Figure 5).\nWe compare the various components of our model against two baseline approaches. The \ufb01rst base-\nline is a root HOG template [4] trained over the appearance within a bounding box covering the\nentire object. A single model using the root HOG template is trained for all viewpoints and as-\npect ratios. During detection, output corner locations corresponding to the average training corner\nlocations relative to the bounding boxes are returned. The second baseline is the 2D tree-based\napproximation of Equation (2), which corresponds to existing 2D parts models used in object detec-\ntion and articulated pose estimation [7, 27]. Figure 7 shows a qualitative comparison of our model\nagainst the 2D tree-based model. Notice that our model localizes well and often provides a tighter\n\ufb01t to the image data than the baseline model.\nWe evaluate geometric primitive detection accuracy using the bounding box overlap criteria in the\nPascal VOC [6]. We report precision recall in Figure 8(a). We have observed that all of the corner-\nbased models achieve almost identical detection accuracy across all recall levels, and out-perform\nthe root HOG template detector [4]. This is expected as we initialize our full model with the output\nof the 2D tree-based model and it generally does not drift too far from this initialization. This in\neffect does not allow us to detect additional cuboids but allows for better part localization.\nIn addition to detection accuracy, we also measure corner localization accuracy for correctly detected\nexamples for a given model. A corner is deemed correct if its predicted image location is within t\npixels of the ground truth corner location. We set t to be 15% of the square root of the area of the\nground truth bounding box for the object. The reported trends in the corner localization performance\nhold for nearby values of t. In Figure 8 we plot corner localization accuracy as a function of recall\nand compare our model against the two baselines. Moreover, we report performance when either the\nedge term or the 3D shape term is omitted from our model. Notice that our full model out-performs\nthe other baselines. Also, the additional edge and 3D shape terms provide a gain in performance\nover using the appearance and 2D spatial terms alone. The edge term provides a slightly larger gain\nin performance over the 3D shape term, but when integrated together consistently provides the best\nperformance on our database.\n\n7\n\n(a)(b)(c)\f(a) Cuboid detection\n\n(b) Corner localization\n\n(c) Object distribution\n\nFigure 8: Cuboid detection (precision vs. recall) and corner localization accuracy (accuracy vs.\nrecall). The area under the curve is reported in the plot legends. Notice that all of the corner-based\nmodels achieve almost identical detection accuracy across all recall levels and out-perform the root\nHOG template detector [4]. For the task of corner localization, our full model out-performs the\ntwo baseline detectors or when either the Edge or Shape3D terms are omitted from our model. (c)\nDistribution of objects from the SUN database [25] that overlap with our cuboids. There are 326\nobjects total from 114 unique classes. The \ufb01rst number within the parentheses indicates the number\nof instances in each object category that overlaps with a labeled cuboid, while the second number is\nthe total number of labeled instances for the object category within our dataset.\n\nFigure 9: Detected cuboids and subsequent synthesized new views via camera resectioning.\n\n5 Conclusion\n\nWe have introduced a novel model that detects 3D cuboids and localizes their corners in single-view\nimages. Our 3D cuboid detector makes use of both corner and edge information. Moreover, we\nhave constructed a dataset with ground truth cuboid annotations. Our detector handles different 3D\nviewpoints and aspect ratios and, in contrast to recent approaches for 3D cuboid detection, does\nnot make any assumptions about the scene geometry and allows for deformation of the 3D cuboid\nshape. As HOG is not invariant to viewpoint, we believe that part mixtures would allow the model\nto be invariant to viewpoint. We believe our approach extends to other shapes, such as cylinders\nand pyramids. Our work raises a number of (long-standing) issues that would be interesting to\naddress. For instance, which objects can be described by one or more geometric primitives and how\nto best represent the compositionality of objects in general? By detecting geometric primitives, what\napplications and systems can be developed to exploit this? Our dataset and source code is publicly\navailable at the project webpage: http://SUNprimitive.csail.mit.edu.\nAcknowledgments: Jianxiong Xiao is supported by Google U.S./Canada Ph.D. Fellowship in Com-\nputer Vision. Bryan Russell was funded by the Intel Science and Technology Center for Perva-\nsive Computing (ISTC-PC). This work is funded by ONR MURI N000141010933 and NSF Career\nAward No. 0747120 to Antonio Torralba.\n\n8\n\n00.20.40.60.8100.10.20.30.40.50.60.70.80.91recallprecision recallRoot Filter [0.16]2D Tree Approximation [0.23]Full Model\u2212Edge [0.26]Full Model\u2212Shape [0.24]Full Model [0.24]00.20.40.60.8100.10.20.30.40.50.60.70.80.91recallreprojection accuracy (criteria=0.150)Root Filter [0.25]2D Tree Approximation [0.30]Full Model\u2212Edge [0.37]Full Model\u2212Shape [0.37]Full Model [0.38]stove (5/13)refrigerator (5/8)night table occluded (5/12)kitchen island (5/6)cabinets (5/22)brick (5/5)stand (7/11)CPU (7/8)table (8/26)desk (8/22)box (9/18)chest of drawers (10/10)bed (15/22)cabinet (28/87)others97 categories(168/883)night table (15/29)building (16/49)screen (5/16)\fReferences\n[1] I. Biederman. Recognition by components: a theory of human image interpretation. Pyschological review,\n\n94:115\u2013147, 1987.\n\n[2] J. E. Bresenham. Algorithm for computer control of a digital plotter. IBM Systems Journal, 4(1):25\u201330,\n\n1965.\n\n[3] J. F. Canny. A computational approach to edge detection. IEEE PAMI, 8(6):679\u2013698, 1986.\n[4] N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. In CVPR, 2005.\n[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image\n\ndatabase. In CVPR, 2009.\n\n[6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal visual object\n\nclasses (VOC) challenge. IJCV, 88(2):303\u2013338, 2010.\n\n[7] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively\n\ntrained part based models. IEEE PAMI, 32(9), 2010.\n\n[8] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1), 2005.\n[9] A. Gupta, S. Satkin, A. A. Efros, and M. Hebert. From 3d scene geometry to human workspace. In CVPR,\n\n2011.\n\n[10] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University\n\nPress, ISBN: 0521540518, second edition, 2004.\n\n[11] V. Hedau, D. Hoiem, and D. Forsyth. Thinking inside the box: Using appearance models and context\n\nbased on room geometry. In ECCV, 2010.\n\n[12] V. Hedau, D. Hoiem, and D. Forsyth. Recovering free space of indoor scenes from a single image. In\n\nCVPR, 2012.\n\n[13] D. Hoiem, A. Efros, and M. Hebert. Geometric context from a single image. In ICCV, 2005.\n[14] http://sketchup.google.com, 2012.\n[15] K. Ikeuchi and T. Suehiro. Toward an assembly plan from observation: Task recognition with polyhedral\n\nobjects. In Robotics and Automation, 1994.\n\n[16] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural svms. Machine Learning,\n\n77(1), 2009.\n\n[17] D. C. Lee, A. Gupta, M. Hebert, and T. Kanade. Estimating spatial layout of rooms using volumetric\n\nreasoning about objects and surfaces. In NIPS, 2010.\n\n[18] J. L. Mundy. Object recognition in the geometric era: A retrospective. In Toward Category-Level Object\n\nRecognition, volume 4170 of Lecture Notes in Computer Science, pages 3\u201329. Springer, 2006.\n\n[19] L. D. Pero, J. C. Bowdish, D. Fried, B. D. Kermgard, E. L. Hartley, and K. Barnard. Bayesian geometric\n\nmodelling of indoor scenes. In CVPR, 2012.\n\n[20] L. Roberts. Machine perception of 3-d solids. In PhD. Thesis, 1965.\n[21] H. Wang, S. Gould, and D. Koller. Discriminative learning with latent variables for cluttered indoor scene\n\nunderstanding. In ECCV, 2010.\n\n[22] J. Xiao, T. Fang, P. Tan, P. Zhao, E. Ofek, and L. Quan. Image-based fac\u00b8ade modeling. In SIGGRAPH\n\nAsia, 2008.\n\n[23] J. Xiao, T. Fang, P. Zhao, M. Lhuillier, and L. Quan. Image-based street-side city modeling. In SIG-\n\nGRAPH Asia, 2009.\n\n[24] J. Xiao and Y. Furukawa. Reconstructing the world\u2019s museums. In ECCV, 2012.\n[25] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN database: Large-scale scene recognition\n\nfrom abbey to zoo. In CVPR, 2010.\n\n[26] J. Xiao, B. C. Russell, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Basic level scene understanding:\n\nFrom labels to structure and beyond. In SIGGRAPH Asia, 2012.\n\n[27] Y. Yang and D. Ramanan. Articulated pose estimation using \ufb02exible mixtures of parts. In CVPR, 2011.\n[28] S. Yu, H. Zhang, and J. Malik. Inferring spatial layout from a single image via depth-ordered grouping.\n\nIn IEEE Workshop on Perceptual Organization in Computer Vision, 2008.\n\n[29] Y. Zhao and S.-C. Zhu. Image parsing with stochastic scene grammar. In NIPS. 2011.\n\n9\n\n\f", "award": [], "sourceid": 342, "authors": [{"given_name": "Jianxiong", "family_name": "Xiao", "institution": null}, {"given_name": "Bryan", "family_name": "Russell", "institution": null}, {"given_name": "Antonio", "family_name": "Torralba", "institution": null}]}