{"title": "Joint 3D Estimation of Objects and Scene Layout", "book": "Advances in Neural Information Processing Systems", "page_first": 1467, "page_last": 1475, "abstract": "We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation.", "full_text": "Joint 3D Estimation of Objects and Scene Layout\n\nAndreas Geiger\n\nKarlsruhe Institute of Technology\n\nChristian Wojek\nMPI Saarbr\u00a8ucken\n\nRaquel Urtasun\n\nTTI Chicago\n\ngeiger@kit.edu\n\ncwojek@mpi-inf.mpg.de\n\nrurtasun@ttic.edu\n\nAbstract\n\nWe propose a novel generative model that is able to reason jointly about the 3D\nscene layout as well as the 3D location and orientation of objects in the scene.\nIn particular, we infer the scene topology, geometry as well as traf\ufb01c activities\nfrom a short video sequence acquired with a single camera mounted on a moving\ncar. Our generative model takes advantage of dynamic information in the form of\nvehicle tracklets as well as static information coming from semantic labels and ge-\nometry (i.e., vanishing points). Experiments show that our approach outperforms\na discriminative baseline based on multiple kernel learning (MKL) which has ac-\ncess to the same image information. Furthermore, as we reason about objects in\n3D, we are able to signi\ufb01cantly increase the performance of state-of-the-art object\ndetectors in their ability to estimate object orientation.\n\n1\n\nIntroduction\n\nVisual 3D scene understanding is an important component in applications such as autonomous driv-\ning and robot navigation. Existing approaches produce either only qualitative results [11] or a mild\nlevel of understanding, e.g., semantic labels [10, 26], object detection [5] or rough 3D [15, 24]. A\nnotable exception are approaches that try to infer the scene layout of indoor scenes in the form of\n3D bounding boxes [13, 22]. However, these approaches can only cope with limited amounts of\nclutter (e.g., beds), and rely on the fact that indoor scenes satisfy very closely the manhattan world\nassumption, i.e., walls (and often objects) are aligned with the three dominant vanishing points. In\ncontrast, outdoor scenarios often show more clutter, vanishing points are not necessarily orthogonal\n[25, 2], and objects often do not agree with the dominant vanishing points.\nPrior work on 3D urban scene analysis is mostly limited to simple ground plane estimation [4, 29]\nor models for which the objects and the scene are inferred separately [6, 7]. In contrast, in this paper\nwe propose a novel generative model that is able to reason jointly about the 3D scene layout as well\nas the 3D location and orientation of objects in the scene. In particular, given a video sequence\nof short duration acquired with a single camera mounted on a moving car, we estimate the scene\ntopology and geometry, as well as the traf\ufb01c activities and 3D objects present in the scene (see Fig.\n1 for an illustration). Towards this goal we propose a novel image likelihood which takes advantage\nof dynamic information in the form of vehicle tracklets as well as static information coming from\nsemantic labels and geometry (i.e., vanishing points).\nInterestingly, our inference reasons about\nwhether vehicles are on the road, or parked, in order to get more accurate estimations. Furthermore,\nwe propose a novel learning-based approach to detecting vanishing points and experimentally show\nimproved performance in the presence of clutter when compared to existing approaches [19].\nWe focus our evaluation mainly on estimating the layout of intersections, as this is the most chal-\nlenging inference task in urban scenes. Our approach proves superior to a discriminative baseline\nbased on multiple kernel learning (MKL) which has access to the same image information (i.e., 3D\ntracklets, segmentation and vanishing points). We evaluate our method on a wide range of metrics\nincluding the accuracy of estimating the topology and geometry of the scene, as well as detecting\n\n1\n\n\f\u21d2\n\nFigure 1: Monocular 3D Urban Scene Understanding. (Left) Image cues. (Right) Estimated layout: Detections\nbelonging to a tracklet are depicted with the same color, traf\ufb01c activities are depicted with red lines.\n\nactivities (i.e., traf\ufb01c situations). Furthermore, we show that we are able to signi\ufb01cantly increase the\nperformance of state-of-the-art object detectors [5] in terms of estimating object orientation.\n\n2 Related Work\n\nWhile outdoor scenarios remain fairly unexplored, estimating the 3D layout of indoor scenes has\nexperienced increased popularity in the past few years [13, 27, 22]. This can be mainly attributed\nto the success of novel structured prediction methods as well as the fact that indoor scenes behave\nmostly as \u201dManhattan worlds\u201d, i.e., edges on the image can be associated with parallel lines de\ufb01ned\nin terms of the three dominant vanishing points which are orthonormal. With a moderate degree of\nclutter, accurate geometry estimation has been shown for this scenario.\nUnfortunately, most urban scenes violate the Manhattan world assumption. Several approaches\nhave focused on estimating vanishing points in this more adversarial setting [25]. Barinova et al. [2]\nproposed to jointly perform line detection as well as vanishing point, azimut and zenith estimation.\nHowever, their approach does not tackle the problem of 3D scene understanding and 3D object\ndetection. In contrast, we propose a generative model which jointly reasons about these two tasks.\nExisting approaches to estimate 3D from single images in outdoor scenarios typically infer pop-\nups [14, 24]. Geometric approaches, reminiscent to the blocksworld model, which impose physical\nconstraints between objects (e.g., object A supports object B) have also been introduced [11]. Un-\nfortunately, all these approaches are mainly qualitative and do not provide the level of accuracy\nnecessary for real-world applications such as autonomous driving and robot navigation. Prior work\non 3D traf\ufb01c scene analysis is mostly limited to simple ground plane estimation [4], or models for\nwhich the objects and scene are inferred separately [6]. In contrast, our model offers a much richer\nscene description and reasons jointly about 3D objects and the scene layout.\nSeveral methods have tried to infer the 3D locations of objects in outdoor scenarios [15, 1]. The most\nsuccessful approaches use tracklets to prune spurious detections by linking consistent evidence in\nsuccessive frames [18, 16]. However, these models are either designed for static camera setups in\nsurveillance applications [16] or do not provide a rich scene description [18]. Notable exceptions\nare [3, 29] which jointly infer the camera pose and the location of objects. However, the employed\nscene models are rather simplistic containing only a single \ufb02at ground plane.\nThe closest approach to ours is probably the work of Geiger et al. [7], where a generative model is\nproposed in order to estimate the scene topology, geometry as well as traf\ufb01c activities at intersec-\ntions. Our work differs from theirs in two important aspects. First, they rely on stereo sequences\nwhile we make use of monocular imagery. This makes the inference problem much harder, as the\nnoise in monocular imagery is strongly correlated with depth. Towards this goal we develop a richer\nimage likelihood model that takes advantage of vehicle tracklets, vanishing points as well as seg-\nmentations of the scene into semantic labels. The second and most important difference is that\nGeiger et al. [7] estimate only the scene layout, while we reason jointly about the layout as well as\nthe 3D location and orientation of objects in the scene (i.e., vehicles).\n\n2\n\nVehicle TrackletsScene LabelsVanishing Points\f(a) Model Geometry (\u03b8 = 4)\n\n(b) Model Topology \u03b8\n\nFigure 2: (a) Geometric model. In (b), the grey shaded areas illustrate the range of \u03b1.\n\nFinally, non-parametric models have been proposed to perform traf\ufb01c scene analysis from a station-\nary camera with a view similar to bird\u2019s eye perspective [20, 28]. In our work we aim to infer similar\nactivities but use video sequences from a camera mounted on a moving car with a substantially lower\nviewpoint. This makes the recognition task much more challenging. Furthermore, those models do\nnot allow for viewpoint changes, while our model reasons about over 100 unseen scenes.\n\n3 3D Urban Scene Understanding\n\nWe tackle the problem of estimating the 3D layout of urban scenes (i.e., road intersections) from\nmonocular video sequences. In this paper 2D refers to observations in the image plane while 3D\nrefers to the bird\u2019s eye perspective (in our scenario the height above ground is non-informative). We\nassume that the road surface is \ufb02at, and model the bird\u2019s eye perspective as the y = 0 plane of the\nstandard camera coordinate system. The reference coordinate system is given by the position of the\ncamera in the last frame of the sequence. The intrinsic parameters of the camera are obtained using\ncamera calibration and the extrinsics using a standard Structure-from-Motion (SfM) pipeline [12].\nWe take advantage of dynamic and static information in the form of 3D vehicle tracklets, seman-\ntic labels (i.e., sky, background, road) and vanishing points.\nIn order to compute 3D tracklets,\nwe \ufb01rst detect vehicles in each frame independently using a semi-supervised version of the part-\nbased detector of [5] in order to obtain orientation estimates. 2D tracklets are then estimated using\n\u2019tracking-by-detection\u2019: First adjacent frames are linked and then short tracklets are associated to\ncreate longer ones via the hungarian method. Finally, 3D vehicle tracklets are obtained by project-\ning the 2D tracklets into bird\u2019s eye perspective, employing error-propagation to obtain covariance\nestimates. This is illustrated in Fig. 1 where detections belonging to the same tracklet are grouped\nby color. The observer (i.e., our car) is shown in black. See sec 3.2 for more details on this process.\nSince depth estimates in the monocular case are much noisier than in the stereo case, we employ\na more constrained model than the one utilized in [7].\nIn particular, as depicted in Fig. 2, we\nmodel all intersection arms with the same width and force alternate arms to be collinear. We model\nlanes with splines (see red lines for active lanes in Fig. \ufb01g:motivation), and place parking spots\nat equidistant places along the street boundaries (see Fig. 3(b)). Our model then infers whether\nthe cars participate in traf\ufb01c or are parked in order to get more accurate layout estimations. Latent\nvariables are employed to associate each detected vehicle with positions in one of these lanes or\nparking spaces. In the following, we \ufb01rst give an overview of our probabilistic model and then\ndescribe each part in detail.\n\n3.1 Probabilistic Model\n\nAs illustrated in Fig. 2(b), we consider a \ufb01xed set of road layouts \u03b8, including straight roads, turns,\n3- and 4- armed intersections. Each of these layouts is associated with a set of geometric random\nvariables: The intersection center c, the street width w, the global scene rotation r and the angle of\nthe crossing street \u03b1 with respect to r (see Fig. 2(a)). Note that for \u03b8 = 1, \u03b1 does not exist.\nJoint Distribution: Our goal is to estimate the most likely con\ufb01guration R = (\u03b8, c, w, r, \u03b1) given\nthe image evidence E = {T, V, S}, which comprises vehicle tracklets T = {t1, .., tN}, vanish-\n\n3\n\n1234567\f(a) Graphical model\n\n(b) Road model\n\nFigure 3: Graphical model and road model with lanes represented as B-splines.\n\ning points V = {vf , vc} and semantic labels S. We assume that, given R, all observations are\nindependent. Fig. 3(a) depicts our graphical model which factorizes the joint distribution as\n\np(E,R|C) = p(R)\n\np(tn, ln|R,C)\n\np(vf|R,C)p(vc|R,C)\n\np(S|R,C)\n\n(1)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nVanishing Points\n\nSemantic Labels\n\n(cid:34) N(cid:89)\n(cid:124)\n\nn=1\n\n(cid:88)\n\nln\n\n(cid:123)(cid:122)\n\nVehicle Tracklets\n\n(cid:35)\n(cid:125)\n\n(cid:124)\n\nwhere C are the (known) extrinsic and intrinsic camera parameters for all the frames in the video\nsequence, N is the total number of tracklets and {ln} denotes latent variables representing the lane\nor parking positions associated with every vehicle tracklet. See Fig. 3(b) for an illustration.\n\nPrior: Let us \ufb01rst de\ufb01ne a scene prior, which factorizes as\n\np(R) = p(\u03b8)p(c, w)p(r)p(\u03b1)\n\n(2)\n\nwhere c and w are modeled jointly to capture their correlation. We model w using a log-Normal\ndistribution since it takes only positive values. Further, since it is highly multimodal, we model p(\u03b1)\nin a non-parametric fashion using kernel density estimation (KDE), and de\ufb01ne:\n\nr \u223c N (\u00b5r, \u03c3r)\n\n(c, log w)T \u223c N (\u00b5cw, \u03a3cw)\n\n\u03b8 \u223c \u03b4(\u03b8M AP )\n\nIn order to avoid the requirement for trans-dimensional inference procedures, the topology \u03b8M AP\nis estimated a priori using joint boosting, and set \ufb01xed at inference. To estimate \u03b8M AP , we use the\nsame feature set employed by the MKL baseline (see Sec. 4 for details).\n\n3.2\n\nImage Likelihood\n\nThis section details our image likelihood for tracklets, vanishing points and semantic labels.\n\nVehicle Tracklets:\nIn the following, we drop the tracklet index n to simplify notation. Let us\nde\ufb01ne a 3D tracklet as a set of object detections t = {d1, .., dM}. Here, each object detection\ndm = (fm, bm, om) contains the frame index fm \u2208 N, the object bounding box bm \u2208 R4 de\ufb01ned\nas 2D position and size, as well as a normalized orientation histogram om \u2208 R8 with 8 bins. We\ncompute the bounding box bm and orientation om by supervised training of a part-based object\ndetector [5], where each component contains examples from a single orientation. Following [5], we\napply the softmax function on the output scores and associate frames using the hungarian algorithm\nin order to obtain tracklets.\nAs illustrated in Fig. 3(b), we represent drivable locations with splines, which connect incoming\nand outgoing lanes of the intersection. We also allow cars to be parked on the side of the road, see\nFig. 3(b) for an illustration. Thus, for a K-armed intersection, we have l \u2208 {1, .., K(K \u2212 1) + 2K}\nin total, where K(K \u2212 1) is the number of lanes and 2K is the number of parking areas. We use the\nlatent variable l to index the lane or parking position associated with a tracklet. The joint probability\nof a tracklet t and its lane index l is given by p(t, l|R,C) = p(t|l,R,C)p(l). We assume a uniform\nprior over lanes and parking positions l \u223c U(1, K(K \u2212 1) + 2K), and denote the posterior by pl\nwhen l corresponds to a lane, and pp when it is a parking position.\nIn order to evaluate the tracklet posterior for lanes pl(t|l,R,C), we need to associate all object\ndetections t = {d1, .., dM} to locations on the spline. We do this by augmenting the observation\n\n4\n\n\fpl(t|l,R,C) = (cid:88)\n\nM(cid:89)\n\nFigure 4: Scene Labels: Scene labels obtained from joint boosting (left) and from our model (right).\n\nmodel with an additional latent variable s per object detection d as illustrated in Fig. 3(b). The\nposterior is modeled using a left-to-right Hidden Markov Model (HMM), de\ufb01ned as:\n\npl(s1)pl(d1|s1, l,R,C)\n\npl(sm|sm\u22121)pl(dm|sm, l,R,C)\n\n(3)\n\ns1,..,sM\n\nm=2\n\nWe constrain all tracklets to move forward in 3D by de\ufb01ning the transition probability p(sm|sm\u22121)\nas uniform on sm \u2265 sm\u22121 and 0 otherwise. Further, uniform initial probabilites pl(s1) are em-\nployed, since no location information is available a priori. We assume that the emission likelihood\npl(dm|sm, l,R,C) factorizes into the object location and its orientation. We impose a multinomial\ndistribution over the orientation pl(fm, om|sm, l,R,C), where each object orientation votes for its\nbin as well as neighboring bins, accounting for the uncertainty of the object detector. The 3D object\nlocation is modeled as a Gaussian with uniform outlier probability cl\n\npl(fm, bm|sm, l,R,C) \u221d cl + N (\u03c0m|\u00b5m, \u03a3m)\n\n(4)\nwhere \u03c0m = \u03c0m(fm, bm,C) \u2208 R2 denotes the object detection mapped into bird\u2019s eye per-\nspective, \u00b5m = \u00b5m(sm, l,R) \u2208 R2 is the coordinate of the spline point sm on lane l and\n\u03a3m = \u03a3m(fm, bm,C) \u2208 R2\u00d72 is the covariance of the object location in bird\u2019s eye coordinates.\nWe now describe how we transform the 2D tracklets into 3D tracklets {\u03c01, \u03a31, .., \u03c0M , \u03a3M}, which\nwe use in pl(dm|sm, l,R,C): We project the image coordinates into bird\u2019s eye perspective by back-\nprojecting objects into 3D using several complementary cues. Towards this goal we use the 2D\nbounding box foot-point in combination with the estimated road plane. Assuming typical vehicle\ndimensions obtained from annotated ground truth, we also exploit the width and height of the bound-\ning box. Covariances in bird\u2019s eye perspective are obtained by error-propagation. In order to reduce\nnoise in the observations we employ a Kalman smoother with constant 3D velocity model.\nOur parking posterior model is similar to the lane posterior described above, except that we do not\nallow parked vehicles to move; We assume them to have arbitrary orientations and place them at the\nsides of the road. Hence, we have\n\npp(t|l,R,C) =(cid:88)\n\nM(cid:89)\n\npp(dm|s, l,R,C)p(s)\n\n(5)\n\ns\n\nm=1\n\nwith s the index for the parking spot location within a parking area and\n\npp(dm|s, l,R,C) = pp(fm, bm|s, l,R,C) \u221d cp + N (\u03c0m|\u00b5m, \u03a3m)\n\n(6)\nHere, cp, \u03c0m and \u03a3m are de\ufb01ned as above, while \u00b5m = \u00b5m(s, l,R) \u2208 R2 is the coordinate of\nthe parking spot location in bird\u2019s eye perspective (see Fig. 3(b) for an illustration). For inference,\nwe subsample each tracklet trajectory equidistantly in intervals of 5 meters in order to reduce the\nnumber of detections within a tracklet and keep the total evaluation time of p(R,E|C) low.\n\nVanishing Points: We detect two types of dominant vanishing points (VP) in the last frame of\neach sequence: vf corresponding to the forward facing street and vc corresponding to the crossing\nstreet. While vf is usually in the image, the u-coordinate of the crossing VP is often close to in\ufb01nity\n(see Fig. 1). As a consequence, we represent vf \u2208 R by its image u-coordinate and vc \u2208 [\u2212 \u03c0\n4 ]\n4 , \u03c0\nby the angle of the crossing road, back projected into the image.\nFollowing [19], we employ a line detector to reason about dominant VPs in the scene. We relax\nthe original model of [19] to allow for non-orthogonal VPs, as intersection arms are often non-\northogonal. Unfortunately, traditional VP detectors tend to fail in the presence of clutter, which\nour images exhibit to a large extent, for example generated by shadows. To tackle this problem we\n\n5\n\n\fFelzenszwalb et al. [5] (raw)\nFelzenszwalb et al. [5] (smoothed)\nOur method (\u03b8 unknown)\nOur method (\u03b8 known)\n\nError\n32.6 \u25e6\n31.2 \u25e6\n15.7 \u25e6\n13.7 \u25e6\n\n(a) Detecting Structured Lines\n\n(b) Object Orientation Error\n\nFigure 5: Detecting Structured Lines and Object Orientation Errors: Our approach outperforms [19] in\nthe task of VP estimation, and [5] in estimating the orientation of objects.\nreweight line segments according to their likelihood of carrying structural information. To this end,\nwe learn a k-nn classi\ufb01er on an annotated training database where lines are labeled as either structure\nor clutter. Here, structure refers to line segments that are aligned with the major orientations of the\nroad, as well as facade edges of buildings belonging to dominant VPs. Our feature set comprises\ngeometric information in the form of position, length, orientation and number of lines with the\nsame orientation as well as perpendicular orientation in a local window. The local appearance is\nrepresented by the mean, standard deviation and entropy of all pixels on both sides of the line.\nFinally, we add texton-like features using a Gabor \ufb01lter bank, as well as 3 principal components of\nthe scene GIST [23]. The structure k-nn classi\ufb01er\u2019s con\ufb01dence is used in the VP voting process to\nreweight the lines. The bene\ufb01t of our learning-based approach is illustrated in Fig. 5.\nTo avoid estimates from spurious outliers we threshold the dominant VPs and only retain the most\ncon\ufb01dent ones. We assume that vf and vc are independent given the road parameters. Let \u00b5f =\n\u00b5f (R,C) be the image u-coordinate (in pixels) of the forward facing street\u2019s VP and let \u00b5c =\n\u00b5c(R,C) be the orientation (in radians) of the crossing street in the image. We de\ufb01ne\n\np(vf|R,C) \u221d cf + \u03b4f N (vf|\u00b5f , \u03c3f )\n\np(vc|R,C) \u221d cc + \u03b4c N (vc|\u00b5c, \u03c3c)\n\nwhere {cf , cc} are small constants capturing outliers, {\u03b4f , \u03b4c} take value 1 if the corresponding VP\nhas been detected in the image and 0 otherwise, and {\u03c3f , \u03c3c} are parameters of the VP model.\n\nSemantic Labels: We segment the last frame of the sequence pixelwise into 3 semantic classes,\ni.e., road, sky and background. For each patch, we infer a score for each of the 3 labels using the\nboosting algorithm of [30] with a combination of Walsh-Hadamard \ufb01lters [30], as well as multi-scale\nfeatures developed for detecting man-made structures [21] on patches of size 16\u00d716, 32\u00d732 and\n64\u00d764. We include the latter ones as they help in discriminating buildings from road. For training,\nwe use a set of 200 hand-labeled images which are not part of the test data.\n\nGiven the softmax normalized label scores S(i)\n(u, v) in the image, we de\ufb01ne the likelihood of a scene labeling S = {S(1), S(2), S(3)} as\n\nu,v \u2208 R of each class i for the patch located at position\n\n3(cid:88)\n\n(cid:88)\n\ni=1\n\n(u,v)\u2208Si\n\np(S|R,C) \u221d exp(\u03b3\n\nu,v)\nS(i)\n\n(7)\n\nwhere \u03b3 is a model parameter and Si is the set of all pixels of class i obtained from the reprojection\nof the geometric model into the image. Note that the road boundaries directly de\ufb01ne the lower end\nof a facade while we assume a typical building height of 4 stories, leading to the upper end. Facades\nadjacent to the observers own\u2019 street are not considered. Fig. 4 illustrates an example of the scene\nlabeling returned by boosting (left) as well as the labeling generated from the reprojection of our\nmodel (right). Note that a large overlap corresponds to a large likelihood in Eq. 7\n\n3.3 Learning and Inference\nOur goal is to estimate the posterior of R, given the image evidence E and the camera calibration C:\n(8)\nLearning the prior: We estimate the parameters of the prior p(R) using maximum likelihood\nleave-one-out cross-validation on the scene database of [7]. This is straightforward as the prior in\nEq. 2 factorizes. We employ KDE with \u03c3 = 0.02 to model p(\u03b1), as it works well in practice.\n\np(R|E,C) \u221d p(E|R,C)p(R)\n\n6\n\n00.20.40.60.8100.20.40.60.81false positive ratetrue positive rate Learning basedKosecka et al.\f(Inference with known \u03b8)\n\nBaseline\nOurs\n\nLocation Orientation Overlap Activity\n44.9 %\n18.4 %\n53.0 % 11.5 %\n\n9.6 deg\n5.9 deg\n\n6.0 m\n5.8 m\n\n(Inference with unknown \u03b8)\n\n\u03b8\n\nLocation Orientation Overlap Activity\n6.2 m\n39.3 %\n28.1 %\n48.1 % 16.6 %\n6.6 m\n\nBaseline\nOurs\n\n21.7 deg\n7.2 deg\nFigure 6: Inference of topology and geometry .\n\n27.4 %\n70.8 %\n\nk\n\n92.9 %\n71.7 %\n\nLocation Orientation Overlap Activity\n8.0 %\nStereo\nOurs\n16.6 %\nFigure 7: Comparison with stereo when k and \u03b8 are unknown.\n\n6.6 deg\n7.2 deg\n\n62.7 %\n48.1 %\n\n4.4 m\n6.6 m\n\nLearning the 3D tracklet parameters: Eq. 4 requires a function \u03d5 : f, b,C \u2192 \u03c0, \u03a3 which takes\na frame index f \u2208 N, an object bounding box b \u2208 R4 and the calibration parameters C as input\nand maps them to the object location \u03c0 \u2208 R2 and uncertainty \u03a3 \u2208 R2\u00d72 in bird\u2019s eye perspective.\nAs cues for this mapping we use the bounding box width and height, as well as the location of the\nbounding box foot-point. Scene depth adaptive error propagation is employed for obtaining \u03a3. The\nunknown parameters of the mapping are the uncertainty in bounding box location \u03c3u, \u03c3v, width \u03c3\u2206u\nand height \u03c3\u2206v as well as the real-world object dimensions \u2206x, \u2206y along with their uncertainties\n\u03c3\u2206x, \u03c3\u2206y. We learn these parameters using a separate training dataset, including 1020 images with\n3634 manually labeled vehicles and depth information [8].\n\nInference: Since the posterior in Eq. 8 cannot be computed in closed form, we approximate it\nusing Metropolis-Hastings sampling [9]. We exploit a combination of local and global moves to\nobtain a well-mixing Markov chain. While local moves modify R slightly, global moves sample R\ndirectly from the prior. This ensures quickly traversing the search space, while still exploring local\nmodes. To avoid trans-dimensional jumps, the road layout \u03b8 is estimated separately beforehand using\nMAP estimation \u03b8M AP provided by joint boosting [30]. We pick each of the remaining elements of\nR at random and select local and global moves with equal probability.\n\n4 Experimental Evaluation\n\nIn this section, we \ufb01rst show that learning which line features convey structural information improves\ndominant vanishing point detection. Next, we compare our method to a multiple kernel learning\n(MKL) baseline in estimating scene topology, geometry and traf\ufb01c activities on the dataset of [7], but\nonly employing information from a single camera. Finally, we show that our model can signi\ufb01cantly\nimprove object orientation estimates compared to state-of-the-art part based models [5]. For all\nexperiments, we set cl = cp = 10\u221215, \u03c3f = 0.1, cf = 10\u221210, \u03c3c = 0.01, cc = 10\u221230 and \u03b3 = 0.1.\n\nVanishing Point Estimation: We use a database of 185 manually annotated images to learn a\npredictor of which line segments are structured. This is important since cast shadows often mislead\nthe VP estimation process. Fig. 5(a) shows the ROC curves for the method of [19] relaxed to non-\northogonal VPs (blue) as well as our learning-based approach (red). While the baseline gets easily\ndisturbed by clutter, our method is more accurate and has signi\ufb01cantly less false positives.\n\n3D Urban Scene Inference: We evaluate our method\u2019s ability to infer the scene layout by building\na competitive baseline based on multi-kernel Gaussian process regression [17]. We employ a total of\n4 kernels built on GIST [23], tracklet histograms, VPs as well as scene labels. Note that these are the\nsame features employed by our model to estimate the scene topology, \u03b8M AP . For the tracklets, we\ndiscretize the 50\u00d750 m area in front of the vehicle into bins of size 5\u00d75 m. Each bin consists of four\nbinary elements, indicating whether forward, backward, left or right motion has been observed at\nthat location. The VPs are included with their value as well as an indicator variable denoting whether\nthe VP has been found or not. For each semantic class, we compute histograms at 3 scales, which\ndivide the image into 3\u00d7 1, 6\u00d7 2 and 12\u00d7 4 bins, and concatenate them. Following [7] we measure\nerror in terms of the location of the intersection center in meters, the orientation of the intersection\narms in degrees, the overlap of road area with ground truth as well as the percentage of correctly\ndiscovered intersection crossing activities. For details about these metrics we refer the reader to [7].\n\n7\n\n\fFigure 8: Automatically inferred scene descriptions. (Left) Trackets from all frames superimposed. (Middle)\nInference result with \u03b8 known and (Right) \u03b8 unknown. The inferred intersection layout is shown in gray, ground\ntruth labels are given in blue. Detected activities are marked by red lines.\nWe perform two types of experiments: In the \ufb01rst one we assume that the type of intersection \u03b8 is\ngiven, and in the second one we estimate \u03b8 as well. As shown in Fig. 6, our method signi\ufb01cantly\noutperforms the MKL baseline in almost all error measures. Our method particularly excels in\nestimating the intersection arm orientations and activities. We also compare our approach to [7] in\nFig. 7. As this approach uses stereo cameras, it can be considered as an oracle, yielding the highest\nperformance achievable. Our approach is close to the oracle; The difference in performance is due\nto the depth uncertainties that arise in the monocular case, which makes the problem much more\nambiguous. Fig. 8 shows qualitative results, with detections belonging to the same tracklet depicted\nwith the same color. The trajectories of all the trackets are superimposed in the last frame. Note\nthat, while for the 2-armed and 4-armed case the topology has been estimated correctly, the 3-armed\ncase has been confused with a 4-armed intersection. This is our most typical failure mode. Despite\nthis, the orientations are correctly estimated and the vehicles are placed at the correct locations.\n\nImproving Object Orientation Estimation: We also evaluate the performance of our method\nin estimating 360 degree object orientations. As cars are mostly aligned with the road surface,\nwe only focus on the orientation angle in bird\u2019s eye coordinates. As a baseline, we employ the\npart-based detector of [5] trained in a supervised fashion to distinguish between 8 canonical views,\nwhere each view is a mixture component. We correct for the ego motion and project the highest\nscoring orientation into bird\u2019s eye perspective. For our method, we infer the scene layout R using\nour approach and associate every tracklet to its lane by maximizing pl(l|t,R,C) over l using Viterbi\ndecoding. We then select the tangent angle at the associated spline\u2019s footpoint s on the inferred lane\nl as our orientation estimate. Since parked cars are often oriented arbitrarily, our evaluation focuses\non moving vehicles only. Fig. 5(b) shows that we are able to signi\ufb01cantly reduce the orientation\nerror with respect to [5]. This also holds true for the smoothed version of [5], where we average\norientations over temporally neighboring bins within each tracklet.\n\n5 Conclusions\n\nWe have proposed a generative model which is able to perform joint 3D inference over the scene\nlayout as well as the location and orientation of objects. Our approach is able to infer the scene\ntopology and geometry, as well as traf\ufb01c activities from a short video sequence acquired with a\nsingle camera mounted on a car driving around a mid-size city. Our generative model proves supe-\nrior to a discriminative approach based on MKL. Furthermore, our approach is able to outperform\nsigni\ufb01cantly a state-of-the-art detector on its ability to estimate 3D object orientation. In the fu-\nture, we plan to incorporate more discriminative cues to further boost performance in the monocular\ncase. We also believe that incorporating traf\ufb01c sign states and pedestrians into our model will be an\ninteresting avenue for future research towards fully understanding complex urban scenarios.\n\n8\n\n\fReferences\n[1] S. Bao, M. Sun, and S. Savarese. Toward coherent object detection and scene layout understanding. In\n\nCVPR, 2010.\n\n[2] O. Barinova, V. Lempitsky, E. Tretyak, and P. Kohli. Geometric image parsing in man-made environ-\n\nments. In ECCV, 2010.\n\n[3] W. Choi and S. Savarese. Multiple target tracking in world coordinate with single, minimally calibrated\n\ncamera. In ECCV, 2010.\n\n[4] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. Robust multi-person tracking from a mobile platform.\n\nPAMI, 31:1831\u20131846, 2009.\n\n[5] P. Felzenszwalb, R.Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively\n\ntrained part-based models. PAMI, 32:1627\u20131645, 2010.\n\n[6] D. Gavrila and S. Munder. Multi-cue pedestrian detection and tracking from a moving vehicle. IJCV,\n\n73:41\u201359, 2007.\n\n[7] A. Geiger, M. Lauer, and R. Urtasun. A generative model for 3d urban scene understanding from movable\n\nplatforms. In Computer Vision and Pattern Recognition, 2011.\n\n[8] A. Geiger, M. Roser, and R. Urtasun. Ef\ufb01cient large-scale stereo matching.\n\nComputer Vision, 2010.\n\nIn Asian Conference on\n\n[9] W. Gilks and S. Richardson, editors. Markov Chain Monte Carlo in Practice. Chapman & Hall, 1995.\n[10] S. Gould, T. Gao, and D. Koller. Region-based segmentation and object detection. In NIPS, 2009.\n[11] A. Gupta, A. Efros, and M. Hebert. Blocks world revisited: Image understanding using qualitative geom-\n\netry and mechanics. In ECCV, 2010.\n\n[12] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge, 2004.\n[13] V. Hedau, D. Hoiem, and D.A. Forsyth. Recovering the spatial layout of cluttered rooms. In ICCV, 2009.\n[14] D. Hoiem, A. Efros, and M. Hebert. Recovering surface layout from an image. IJCV, 75:151\u2013172, 2007.\n[15] D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. IJCV, 80:3\u201315, 2008.\n[16] C. Huang, B. Wu, and R. Nevatia. Robust object tracking by hierarchical association of detection re-\n\nsponses. In ECCV, 2008.\n\n[17] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell. Gaussian processes for object categorization. IJCV,\n\n88:169\u2013188, 2010.\n\n[18] R. Kaucic, A. Perera, G. Brooksby, J. Kaufhold, and A. Hoogs. A uni\ufb01ed framework for tracking through\n\nocclusions and across sensor gaps. In CVPR, 2005.\n\n[19] J. Kosecka and W. Zhang. Video compass. In ECCV, 2002.\n[20] D. Kuettel, M. Breitenstein, L. Gool, and V. Ferrari. What\u2019s going on?: Discovering spatio-temporal\n\ndependencies in dynamic scenes. In CVPR, 2010.\n\n[21] S. Kumar and M. Hebert. Man-made structure detection in natural images using a causal multiscale\n\nrandom \ufb01eld. In CVPR, 2003.\n\n[22] D. Lee, A. Gupta, M. Hebert, and T. Kanade. Estimating spatial layout of rooms using volumetric rea-\n\nsoning about objects and surfaces. In NIPS, 2010.\n\n[23] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial enve-\n\nlope. IJCV, 42:145\u2013175, 2001.\n\n[24] A. Saxena, S. H. Chung, and A. Y. Ng. 3-D depth reconstruction from a single still image. IJCV, 76:53\u2013\n\n69, 2008.\n\n[25] G. Schindler and F. Dellaert. Atlanta world: An expectation maximization framework for simultaneous\n\nlow-level edge grouping and camera calibration in complex man-made environments. In CVPR, 2004.\n\n[26] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object\n\nrecognition and segmentation by jointly modeling texture, layout, and context. IJCV, 81:2\u201323, 2009.\n\n[27] H. Wang, S. Gould, and D. Koller. Discriminative learning with latent variables for cluttered indoor scene\n\nunderstanding. In ECCV, 2010.\n\n[28] X. Wang, X. Ma, and W. Grimson. Unsupervised activity perception in crowded and complicated scenes\n\nusing hierarchical bayesian models. PAMI, 2009.\n\n[29] C. Wojek, S. Roth, K. Schindler, and B. Schiele. Monocular 3D Scene Modeling and Inference: Under-\n\nstanding Multi-Object Traf\ufb01c Scenes. In ECCV, 2010.\n\n[30] C. Wojek and B. Schiele. A dynamic CRF model for joint labeling of object and scene classes. In ECCV,\n\n2008.\n\n9\n\n\f", "award": [], "sourceid": 842, "authors": [{"given_name": "Andreas", "family_name": "Geiger", "institution": null}, {"given_name": "Christian", "family_name": "Wojek", "institution": null}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": null}]}