{"title": "Cascaded Classification Models: Combining Models for Holistic Scene Understanding", "book": "Advances in Neural Information Processing Systems", "page_first": 641, "page_last": 648, "abstract": "One of the original goals of computer vision was to fully understand a natural scene. This requires solving several problems simultaneously, including object detection, labeling of meaningful regions, and 3d reconstruction. While great progress has been made in tackling each of these problems in isolation, only recently have researchers again been considering the difficult task of assembling various methods to the mutual benefit of all. We consider learning a set of such classification models in such a way that they both solve their own problem and help each other. We develop a framework known as Cascaded Classification Models (CCM), where repeated instantiations of these classifiers are coupled by their input/output variables in a cascade that improves performance at each level. Our method requires only a limited \u00e2\u0080\u009cblack box\u00e2\u0080\u009d interface with the models, allowing us to use very sophisticated, state-of-the-art classifiers without having to look under the hood. We demonstrate the effectiveness of our method on a large set of natural images by combining the subtasks of scene categorization, object detection, multiclass image segmentation, and 3d scene reconstruction.", "full_text": "Cascaded Classi\ufb01cation Models:\n\nCombining Models for Holistic Scene Understanding\n\nGeremy Heitz\nStephen Gould\nDepartment of Electrical Engineering\n\nStanford University, Stanford, CA 94305\n{gaheitz,sgould}@stanford.edu\n\nAshutosh Saxena\n\nDaphne Koller\n\nDepartment of Computer Science\n\nStanford University, Stanford, CA 94305\n{asaxena,koller}@cs.stanford.edu\n\nAbstract\n\nOne of the original goals of computer vision was to fully understand a natural\nscene. This requires solving several sub-problems simultaneously, including ob-\nject detection, region labeling, and geometric reasoning. The last few decades\nhave seen great progress in tackling each of these problems in isolation. Only re-\ncently have researchers returned to the dif\ufb01cult task of considering them jointly. In\nthis work, we consider learning a set of related models in such that they both solve\ntheir own problem and help each other. We develop a framework called Cascaded\nClassi\ufb01cation Models (CCM), where repeated instantiations of these classi\ufb01ers\nare coupled by their input/output variables in a cascade that improves performance\nat each level. Our method requires only a limited \u201cblack box\u201d interface with the\nmodels, allowing us to use very sophisticated, state-of-the-art classi\ufb01ers without\nhaving to look under the hood. We demonstrate the effectiveness of our method\non a large set of natural images by combining the subtasks of scene categorization,\nobject detection, multiclass image segmentation, and 3d reconstruction.\n\n1 Introduction\nThe problem of \u201cholistic scene understanding\u201d encompasses a number of notoriously dif\ufb01cult com-\nputer vision tasks. Presented with an image, scene understanding involves processing the image to\nanswer a number of questions, including: (i) What type of scene is it (e.g., urban, rural, indoor)? (ii)\nWhat meaningful regions compose the image? (iii) What objects are in the image? (iv) What is the\n3d structure of the scene? (See Figure 1). Many of these questions are coupled\u2014e.g., a car present\nin the image indicates that the scene is likely to be urban, which in turn makes it more likely to \ufb01nd\nroad or building regions. Indeed, this idea of communicating information between tasks is not new\nand dates back to some of the earliest work in computer vision (e.g., [1]). In this paper, we present\na framework that exploits such dependencies to answer questions about novel images.\n\nWhile our focus will be on image understanding, the goal of combining related classi\ufb01ers is relevant\nto many other machine learning domains where several related tasks operate on the same (or related)\nraw data and provide correlated outputs. In the area of natural language processing, for instance,\nwe might want to process a single document and predict the part of speech of all words, correspond\nthe named entities, and label the semantic roles of verbs. In the area of audio signal processing, we\nmight want to simultaneously do speech recognition, source separation, and speaker recognition.\n\nIn the problem of scene understanding (as in many others), state-of-the-art models already exist for\nmany of the tasks of interest. However, these carefully engineered models are often tricky to modify,\nor even simply to re-implement from available descriptions. As a result, it is sometimes desirable to\ntreat these models as \u201cblack boxes,\u201d where we have we have access only to a very simple input/output\ninterface. in short, we require only the ability to train on data and produce classi\ufb01cations for each\ndata instance; speci\ufb01cs are given in Section 3 below.\nIn this paper, we present the framework of Cascaded Classi\ufb01cation Models (CCMs), where state-\nof-the-art \u201cblack box\u201d classi\ufb01ers for a set of related tasks are combined to improve performance on\n\n1\n\n\f(a) Detected Objects\n\n(b) Classi\ufb01ed Regions\n\n(c) 3D Structure\n\n(d) CCM Framework\n\nFigure 1: (a)-(c) Some properties of a scene required for holistic scene understanding that we seek to unify\nusing a cascade of classi\ufb01ers. (d) The CCM framework for jointly predicting each of these label types.\nsome or all tasks. Speci\ufb01cally, the CCM framework creates multiple instantiations of each classi\ufb01er,\nand organizes them into tiers where models in the \ufb01rst tier learn in isolation, processing the data to\nproduce the best classi\ufb01cations given only the raw instance features. Lower tiers accept as input both\nthe features from the data instance, as well as features computed from the output classi\ufb01cations of\nthe models at the previous tier. While only demonstrated in the computer vision domain, we expect\nthe CCM framework have broad applicability to many applications in machine learning.\n\nWe apply our model to the scene understanding task by combining scene categorization, object\ndetection, multi-class segmentation, and 3d reconstruction. We show how \u201cblack-box\u201d classi\ufb01ers\ncan be easily integrated into our framework. Importantly, in extensive experiments on large image\ndatabases, we show that our combined model yields superior results on all tasks considered.\n\n2 Related Work\nA number of works in various \ufb01elds aim to combine classi\ufb01ers to improve \ufb01nal output accuracy.\nThese works can be divided into two broad groups. The \ufb01rst is the combination of classi\ufb01ers that\npredict the same set of random variables. Here the aim is to improved classi\ufb01cations by combining\nthe outputs of the individual models. Boosting [6], in which many weak learners are combined into a\nhighly accurate classi\ufb01er, is one of the most common and powerful such scemes. In computer vision,\nthis idea has been very successfully applied to the task of face detection using the so-called Cascade\nof Boosted Ensembles (CoBE) [18, 2] framework. While similar to our work in constructing a\ncascade of classi\ufb01ers, their motivation was computational ef\ufb01ciency, rather than a consideration\nof contextual bene\ufb01ts. Tu [17] learns context cues by cascading models for pixel-level labeling.\nHowever, the context is, again, limited to interactions between labels of the same type.\n\nThe other broad group of works that combine classi\ufb01ers is aimed at using the classi\ufb01ers as compo-\nnents in large intelligent systems. Kumar and Hebert [9], for example, develop a large MRF-based\nprobabilistic model linking multiclass segmentation and object detection. Such approaches have also\nbeen used in the natural language processing literature. For example, the work of Sutton and McCal-\nlum [15] combines a parsing model with a semantic role labeling model into a uni\ufb01ed probabilistic\nframework that solves both simultaneously. While technically-correct probabilistic representations\nare appealing, it is often painful to \ufb01t existing methods into a large, complex, highly interdepen-\ndent network. By leveraging the idea of cascades, our method provides a simpli\ufb01ed approach that\nrequires minimal tuning of the components.\n\nThe goal of holistic scene understanding dates back to the early days of computer vision, and is\nhighlighted in the \u201cintrinsic images\u201d system proposed by Barrow and Tenenbaum [1], where maps\nof various image properties (depth, re\ufb02ectance, color) are computed using information present in\nother maps. Over the last few decades, however, researchers have instead targeted isolated computer\nvision tasks, with considerable success in improving the state-of-the-art. For example, in our work,\nwe build on the prior work in scene categorization of Li and Perona [10], object detection of Dalal\nand Triggs [4], multi-class image segmentation of Gould et al. [7], and 3d reconstruction of Saxena\net al. [13]. Recently, however, researchers have returned to the question of how one can bene\ufb01t from\nexploiting the dependencies between different classi\ufb01ers.\n\nTorralba et al. [16] use context to signi\ufb01cantly boost object detection performance, and Sudderth\net al. [14] use object recognition for 3d structure estimation. In independent contemporary work,\nHoiem et al. [8] propose an innovative system for integrating the tasks of object recognition, surface\norientation estimation, and occlusion boundary detection. Like ours, their system is modular and\nleverages state-of-the-art components. However, their work has a strong leaning towards 3d scene\n\n2\n\n\freconstruction rather than understanding, and their algorithms contain many steps that have been\nspecialized for this purpose. Their training also requires intimate knowledge of the implementation\nof each module, while ours is more \ufb02exible allowing integration of many related vision tasks regard-\nless of their implementation details. Furthermore, we consider a broader class of images and object\ntypes, and label regions with speci\ufb01c classes, rather than generic properties.\n\n3 Cascaded Classi\ufb01cation Models\nOur goal is to classify various characteristics of our data using state-of-the-art methods in a way\nthat allows the each model to bene\ufb01t from the others\u2019 expertise. We are interested in using proven\n\u201coff-the-shelf\u201d classi\ufb01ers for each subtask. As such these classi\ufb01ers will be treated as \u201cblack boxes,\u201d\neach with its own (specialized) data structures, feature sets, and inference and training algorithms.\n\nk and a series of conditional classi\ufb01ers fk,\u2113(\u03c6k(I, y\u2113\u22121\n\nTo \ufb01t into our framework, we only require that each classi\ufb01er provides a mechanism for including\nadditional (auxiliary) features from other modules. Many state-of-the-art models lend themselves\nto the easy addition of new features. In the case of \u201cintrinsic images\u201d [1], the output of each com-\nponent is converted into an image-sized feature map (e.g., each \u201cpixel\u201d contains the probability that\nit belongs to a car). These maps can easily be fed into the other components as additional image\nchannels. In cases where this cannot be done, it is trivial to convert the original classi\ufb01er\u2019s output to\na log-odds ratio and use it along with features from their other classi\ufb01ers in a simple logistic model.\nA standard setup has, say, two models that predict the variables YD and YS respectively for the\nsame input instance I. For example, I might be an image, and YD could be the locations of all cars\nin the image, while YS could be a map indicating which pixels are road. Most algorithms begin\nby processing I to produce a set of features, and then learn a function that maps these features into\na predicted label (and in some cases also a con\ufb01dence estimate). Cascaded Classi\ufb01cation Models\n(CCMs) is a joint classi\ufb01cation model that shares information between tasks by linking component\nclassi\ufb01ers in order to leverage their relatedness. Formally:\nDe\ufb01nition 3.1: An L-tier Cascaded Classi\ufb01cation Model (L-CCM) is a cascade of classi\ufb01ers of the\ntarget labels Y = {Y1, . . . , YK }L (L \u201ccopies\u201d of each label) consisting of independent classi\ufb01ers\nfk,0(\u03c6k(I); \u03b8k,0) \u2192 \u02c6Y0\n\u2212k ); \u03b8c,\u2113) \u2192 \u02c6Y\u2113\nk,\nindexed by \u2113, indicating the \u201ctier\u201d of the model, where y\u2212k indicates the assignment to all labels\nother than yk. The labels at the \ufb01nal tier (L \u2212 1) represent the \ufb01nal classi\ufb01cation outputs.\nA CCM uses L copies of each component model, stacked into tiers, as depicted in Figure 1(d). One\ncopy of each model lies in the \ufb01rst tier, and learns with only the image features, \u03c6k(I), as input.\nSubsequent tiers of models accepts a feature vector, \u03c6k(I, y\u2113\u22121\n\u2212k ), containing the original image\nfeatures and additional features computed from the outputs of models in the preceeding tier. Given\na novel test instance, classi\ufb01cation is performed by predicting the most likely (MAP) assignment to\neach of the variables in the \ufb01nal tier.\nWe learn our CCM in a feed-forward manner. That is, we begin from the top level, training the\nindependent (fk,0) classi\ufb01ers \ufb01rst, in order to maximize the classi\ufb01cation performance on the train-\ning data. Because we assume a learning interface into each model, we simply supply the subset of\ndata that has ground labels for that model to its learning function. For learning each component k in\neach subsequent level \u2113 of the CCM, we \ufb01rst perform classi\ufb01cation using the (\u2113 \u2212 1)-tier CCM that\nhas already been trained. From these output assignments, each classi\ufb01er can compute a new set of\nfeatures and perform learning using the algorithm of choice for that classi\ufb01er.\nFor learning a CCM, we assume that we have a dataset of fully or partially annotated instances. It\nis not necessary for every instance to have groundtruth labels for every component, and our method\nworks even when the training sets are disjoint. This is appealing since the prevalence of large,\nvolunteer-annotated datasets (e.g., the LabelMe dataset [12] in vision or the Penn Treebank [11] in\nlanguage processing), is likely to provide large amounts of heterogeneously labeled data.\n\n4 CCM for Holistic Scene Understanding\nOur scene understanding model uses a CCM to combine various subsets of four computer vision\ntasks: scene categorization, multi-class image segmentation, object detection, and 3d reconstruction.\nWe \ufb01rst introduce the notation for the target labels and then brie\ufb02y describe the speci\ufb01cs of each\ncomponent. Consider an image I. Our scene categorization classi\ufb01er produces a scene label C from\none of a small number of classes. Our multi-class segmentation model produces a class label Sj\n\n3\n\n\fFigure 2: (left,middle) Two exmaple features used by the \u201ccontext\u201d aware object detector. (right) Relative\nlocation maps showing the relative location of regions (columns) to objects (rows). Each map shows the preva-\nlence of the region relative to the center of object. For example, the top row shows that cars are likely to have\nroad beneath and sky above, while the bottom rows show that cows and sheep are often surrounded by grass.\n\nfor each of a prede\ufb01ned set of regions j in the image. The base object detectors produce a set of\nscored windows (Wc,i) that potentially contain an object of type c. We attach a label Dc,i to each\nwindow, that indicates whether or not the window contains the object. Our last component module\nis monocular 3d reconstruction, which produces a depth Zi for every pixel i in the image.\nScene Categorization Our scene categorization module is a simple multi-class logistic model that\nclassi\ufb01es the entire scene into one of a small number of classes. The base model uses a 13 dimen-\nsional feature vector \u03c6(I) with elements based on mean and variance of RGB and YCrCb color\nchannels over the entire image, plus a bias term. In the conditional model, we include features that\nindicate the relative proportions of each region label (a histogram of Sj values) in the image, plus\ncounts of the number of objects of each type detected, producing a \ufb01nal feature vector of length 26.\nMulticlass Image Segmentation The segmentation module aims to assign a label to each pixel. We\nbase our model on the work of Gould et al. [7] who make use of relative location\u2014the preference for\nclasses to be arranged in a consistent con\ufb01guration with respect to one another (e.g., cars are often\nfound above roads). Each image is pre-partitioned into a set {S1, . . . , SN } of regions (superpixels)\nand the pixels are labeled by assigning a class to each region Sj. The method employs a pairwise\nconditional Markov random \ufb01eld (CRF) constructed over the superpixels with node potentials based\non appearance features and edge potentials encoding a preference for smoothness.\n\nIn our work we wish to model the relative location between detected objects and region labels. This\nhas the advantage of being able to encode scale, which was not possible in [7]. The right side of\nFigure 2 shows the relative location maps learned by our model. These maps model the spatial\nlocation of all classes given the location and scale of detected objects. Because the detection model\nprovides probabilities for each detection, we actually use the relative location maps multiplied by\nthe probability that each detection is a true detection. Preliminary results showed an improvement\nin using these soft detections over hard (thresholded) detections.\nObject Detectors Our detection module builds on the HOG detector of Dalal and Triggs [4]. For\neach class, the HOG detector is trained on a set of images disjoint from our datasets below. This\ndetector is then applied to all images in our dataset with a low threshold that produces an overde-\ntection. For each image I, and each object class c, we typically \ufb01nd 10-100 candidate detection\nwindows Wc,i. Our independent detector model learns a logistic model over a small feature vector\n\u03c6c,i that can be extracted directly from the candidate window.\nOur conditional classi\ufb01er seeks to improve the accuracy of the HOG detector by using image seg-\nmentation (denoted by Sj for each region j), 3d reconstruction of the scene, with depths (Zj) for\neach region, and a categorization of the scene as a whole (C), to improve the results of the HOG\ndetector. Thus, the output from other modules and the image are combined into a feature vector\n\u03c6k(I, C, S, Z). A sampling of some features used are shown in Figure 2. This augmented feature\nvector is used in a logistic model as in the independent case. Both the independent and context aware\nlogistics are regularized with a small ridge term to prevent over\ufb01tting.\nReconstruction Module Our reconstruction module is based on the work of Saxena et al. [13]. Our\nMarkov Random Field (MRF) approach models the 3d reconstruction (i.e., depths Z at each point\nin the image) as a function of the image features and also models the relations between depths at\n\n4\n\n\fHOG\nIndependent\n2-CCM\n5-CCM\nGround\nIdeal Input\n\nCAR\n0.39\n0.55\n0.58\n0.59\n0.49\n0.63\n\nPEDES.\n\n0.29\n0.53\n0.55\n0.56\n0.53\n0.64\n\nBIKE\n0.13\n0.57\n0.65\n0.63\n0.62\n0.56\n\nBOAT\n0.11\n0.31\n0.48\n0.47\n0.35\n0.65\n\nSHEEP\n0.19\n0.39\n0.45\n0.40\n0.40\n0.45\n\nCOW Mean\n0.23\n0.28\n0.47\n0.49\n0.54\n0.53\n0.54\n0.53\n0.51\n0.48\n0.58\n0.56\n\nSegment\n\nCategory\n\nN/A\n\n72.1%\n75.0%\n75.8%\n73.6%\n78.4%\n\nN/A\n\n70.6%\n77.3%\n76.8%\n69.9%\n86.7%\n\nTable 1: Numerical evaluation of our various training regimes for the DS1 dataset. We show average precision\n(AP) for the six classes, as well as the mean. We also show segmentation and scene categorization accuracy.\n\nvarious points in the image. For example, unless there is occlusion, it is more likely that two nearby\nregions in the image would have similar depths.\nMore formally, our variables are continuous, i.e., at a point i, the depth Zi \u2208 R. Our baseline model\nconsists of two types of terms. The \ufb01rst terms model the depth at each point as a linear function\nof the local image features, and the second type models relationships between neighboring points,\nencouraging smoothness. Our conditional model includes an additional set of terms that models the\ndepth at each point as a function of the features computed from an image segmentation S in the\nneighborhood of a point. By including this third term, our model bene\ufb01ts from the segmentation\noutputs in various ways. For example, a classi\ufb01cation of grass implies a horizontal surface, and a\nclassi\ufb01cation of sky correlates with distant image points. While detection outputs might also help\nreconstruction, we found that most of the signal was present in the segmentation maps, and therefore\ndropped the detection features for simplicity.\n\n5 Experiments\nWe perform experiments on two subsets of images. The \ufb01rst subset DS1 contains 422 fully-labeled\nimages of urban and rural outdoor scenes. Each image is assigned a category (urban, rural, water,\nother). We hand label each pixel as belonging to one of: tree, road, grass, water, sky, building\nand foreground. The foreground class captures detectable objects, and a void class (not used during\ntraining or evaluation) allows for the small number of regions not \ufb01tting into one of these classes\n(e.g., mountain) to be ignored. This is standard practice for the pixel-labeling task (e.g., see [3]). We\nalso annotate the location of six different object categories (car, pedestrian, motorcycle, boat, sheep,\nand cow) by drawing a tight bounding box around each object. We use this dataset to demonstrate the\ncombining of three vision tasks: object detection, multi-class segmentation, and scene categorization\nusing the models described above.\nOur much larger second dataset DS2 was assembled by combining 362 images from the DS1 dataset\n(including either the segmentation or detection labels, but not both), 296 images from the Microsoft\nResearch Segmentation dataset [3] (labeled with segments), 557 images from the PASCAL VOC\n2005 and 2006 challenges [5] (labeled with objects), and 534 images with ground truth depth in-\nformation. This results in 1749 images with disjoint labelings (no image contains groundtruth la-\nbels for more than one task). Combining these datasets results in 534 reconstruction images with\ngroundtruth depths obtained by laser range-\ufb01nder (split into 400 training and 134 test), 596 images\nwith groundtruth detections (same 6 classes as above, split into 297 train and 299 test), and 615 with\ngroundtruth segmentations (300 train and 315 test). This dataset demonstrates the typical situation\nin learning related tasks whereby it is dif\ufb01cult to obtain large fully-labeled datasets. We use this\ndataset to demonstrate the power of our method in leveraging the data from these three tasks to\nimprove performance.\n\n5.1 DS1 Dataset\nExperiments with the DS1 dataset were performed using 5-fold cross validation, and we report\nthe mean performance results across folds. We compare \ufb01ve training/testing regimes (see Table 1).\nIndependent learns parameters on a 0-Tier (independent) CCM, where no information is exchanged\nbetween tasks. We compare two levels of complexity for our method, a 2-CCM and a 5-CCM\nto test how the depth of the cascade affects performance. The last two training/testing regimes\ninvolve using groundtruth information at every stage for training and for both training and testing,\nrespectively. Groundtruth trains a 5-CCM using groundtruth inputs for the feature construction\n(i.e., as if each tier received perfect inputs from above), but is evaluated with real inputs. The Ideal\n\n5\n\n\f(a) Cars\n\n(b) Pedestrians\n\n(c) Motorbikes\n\n(d) Categorization\n\n(e) Boats\n\n(f) Sheep\n\n(g) Cows\n\n(h) Segmentation\n\nFigure 3: Results for the DS1 dataset. (a-c,e-g) show precision-recall curves for the six object classes that we\nconsider, while (d) shows our accuracy on the scene categorization task and (h) our accuracy in labeling regions\nin one of seven classes.\n\nInput experiment uses the Groundtruth model and also uses the groundtruth input to each tier at\ntesting time. We could do this since, for this dataset, we had access to fully labeled groundtruth.\nObviously this is not a legitimate operating mode, but does provide an interesting upper bound on\nwhat we might hope to achieve.\n\nTo quantitatively evaluate our method, we consider metrics appropriate to the tasks in question.\nFor scene categorization, we report an overall accuracy for assigning the correct scene label to an\nimage. For segmentation, we compute a per-segment accuracy, where each segment is assigned the\ngroundtruth label that occurs for the majority of pixels in the region. For detection, we consider a\nparticular detection correct if the overlap score is larger than 0.2 (overlap score equals the area of\nintersection divided by the area of union between the detected bounding box and the groundtruth).\nWe plot precision-recall (PR) curves for detections, and report the average precision of these curves.\nAP is a more stable version of the area under the PR curve.\n\nOur numerical results are shown in Table 1, and the corresponding graphs are given in Figure 3. The\nPR curves compare the HOG detector results to our Independent results and to our 2-CCM results.\nIt is interesting to note that a large gain was achieved by adding the independent features to the\nobject detectors. While the HOG score looks at only the pixels inside the target window, the other\nfeatures take into account the size and location of the window, allowing our model to capture the\nfact that foreground object tend to occur in the middle of the image and at a relatively small range\nof scales. On top of this, we were able to gain an additional bene\ufb01t through the use of context in the\nCCM framework. For the categorization task, we gained 7% using the CCM framework, and for\nsegmentation, CCM afforded a 3% improvement in accuracy. Furthermore, for this task, running an\nadditional three tiers, for a 5-CCM, produced an additional 1% improvement.\nInterestingly, the Groundtruth method performs little better than Independent for these three tasks.\nThis shows that it is better to train the models using input features that are closer to the features it\nwill see at test time. In this way, the downstream tiers can learn to ignore signals that the upstream\ntiers are bad at capturing, or even take advantage of consistent upstream bias. Also, the Ideal Input\nresults show that CCMs have made signi\ufb01cant progress towards the best we can hope for from these\nmodels.\n\n5.2 DS2 Dataset\nFor this dataset we combine the three subtasks of reconstruction, segmentation, and object detec-\ntion. Furthermore, as described above, the labels for our training data are disjoint. We trained an\nIndependent model and a 2-CCM on this data. Quantitatively, 2-CCM outperformed Independent\non segmentation by 2% (75% vs. 73% accuracy), on detection by 0.02 (0.33 vs. 0.31 mean average\nprecision), and on depth reconstruction by 1.3 meters (15.4 vs. 16.7 root mean squared error).\n\n6\n\n\fFigure 4: (top two rows) three cases where CCM improved results for all tasks. In the \ufb01rst, for instance, the\npresence of grass allows the CCM to remove the boat detections. The next four rows show four examples\nwhere detections are improved and four examples where segmentations are improved.\n\nFigure 4 shows example outputs from each component. The \ufb01rst three (top two rows) show images\nwhere all components improved over the independent model. In the top left our detectors removed\nsome false boat detections which were out of context and determined that the watery appearance\nof the bottom of the car was actually foreground. Also by providing a sky segment, our method\nallowed the 3d reconstruction model to infer that those pixels must be very distant (red). The next\ntwo examples show similar improvement for detections of boats and water.\n\nThe remaining examples show how separate tasks improve by using information from the others. In\neach example we show results from the independent model for the task in question, the independent\ncontextual task and the 2-CCM output. The \ufb01rst four examples show that our method was able\nto make correct detections whereas the independent model could not. The last examples show\nimprovements in multi-class image segmentation.\n\n7\n\n\f6 Discussion\nIn this paper, we have presented the Cascaded Classi\ufb01cation Models (CCM) method for combining\na collection of state-of-the-art classi\ufb01ers toward improving the results of each. We demonstrated\nour method on the task of holistic scene understanding by combining scene categorization, object\ndetection, multi-class segmentation and depth reconstruction, and improving on all. Our results are\nconsistent with other contemporary research, including the work of Hoiem et al. [8], which uses\ndifferent components and a smaller number of object classes.\n\nImportantly, our framework is very general and can be applied to a number of machine learning\ndomains. This result provides hope that we can improve by combining our complex models in\na simple way. The simplicity of our method is one of its most appealing aspects. Cascades of\nclassi\ufb01ers have been used extensively within a particular task, and our results suggest that this should\ngeneralize to work between tasks. In addition, we showed that CCMs can bene\ufb01t from the cascade\neven with disjoint training data, e.g., no images containing labels for more than one subtask.\n\nIn our experiments, we passed relatively few features between the tasks. Due to the homogeneity of\nour data, many of the features carried the same signal (e.g., a high probability of an ocean scene is a\nsurrogate for a large portion of the image containing water regions). For larger, more heterogeneous\ndatasets, including more features may improve performance. In addition, larger datasets will help\nprevent the over\ufb01tting that we experienced when trying to include a large number of features.\nIt is an open question how deep a CCM is appropriate in a given scenario. Over\ufb01tting is anticipated\nfor very deep cascades. Furthermore, because of limits in the context signal, we cannot expect to\nget unlimited improvements. Further exploration of cases where this combination is appropriate is\nan important future direction. Another exciting avenue is the idea of feeding back information from\nthe later classi\ufb01ers to the earlier ones. Intuitively, a later classi\ufb01er might encourage earlier ones to\nfocus its effort on \ufb01xing certain error modes, or allow the earlier classi\ufb01ers to ignore mistakes that\ndo not hurt \u201cdownstream.\u201d This also should allow components with little training data to optimize\ntheir results to be most bene\ufb01cial to other modules, while worrying less about their own task.\nAcknowledgements This work was supported by the DARPA Transfer Learning program under con-\ntract number FA8750-05-2-0249 and the Multidisciplinary University Research Initiative (MURI),\ncontract number N000140710747, managed by the Of\ufb01ce of Naval Research.\n\nReferences\n[1] H. G. Barrow and J.M. Tenenbaum. Recovering intrinsic scene characteristics from images. CVS, 1978.\n[2] S.C. Brubaker, J. Wu, J. Sun, M.D. Mullin, and J.M. Rehg. On the design of cascades of boosted ensem-\n\nbles for face detection. In Tech report GIT-GVU-05-28, 2005.\n\n[3] A. Criminisi. Microsoft research cambridge object recognition image database (version 1.0 and 2.0).,\n\n2004. Available Online: http://research.microsoft.com/vision/cambridge/recognition.\n\n[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[5] M. Everingham et al. The 2005 pascal visual object classes challenge. In MLCW, 2005.\n[6] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application\n\nto boosting. In European Conference on Computational Learning Theory, pages 23\u201337, 1995.\n\n[7] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller. Multi-class segmentation with relative location\n\nprior. IJCV, 2008.\n\n[8] D. Hoiem, A.A. Efros, and M. Hebert. Closing the loop on scene interpretation, 2008.\n[9] S. Kumar and M. Hebert. A hier. \ufb01eld framework for uni\ufb01ed context-based classi\ufb01cation. In ICCV, 2005.\n[10] F. Li and P. Perona. A bayesian hier. model for learning natural scene categories. In CVPR, 2005.\n[11] M. P. Marcus, M.A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: the\n\n[12] B.C. Russell, A.B. Torralba, K.P. Murphy, and W.T. Freeman. Labelme: A database and web-based tool\n\npenn treebank. Comput. Linguist., 19(2), 1993.\n\nfor image annotation. IJCV, 2008.\n\n[13] A. Saxena, M. Sun, and A.Y. Ng. Learning 3-d scene structure from a single still image. In PAMI, 2008.\n[14] E.B. Sudderth, A. Torralba, W.T. Freeman, and A.S. Willsky. Depth from familiar objects: A hierarchical\n\nmodel for 3d scenes. In CVPR, 2006.\n\n[15] C. Sutton and A. McCallum. Joint parsing and semantic role labeling. In CoNLL, 2005.\n[16] Antonio B. Torralba, Kevin P. Murphy, and William T. Freeman. Contextual models for object detection\n\nusing boosted random \ufb01elds. In NIPS, 2004.\n\n[17] Z. Tu. Auto-context and its application to high-level vision tasks. In CVPR, 2008.\n[18] P. Viola and M.J. Jones. Robust real-time object detection. IJCV, 2001.\n\n8\n\n\f", "award": [], "sourceid": 60, "authors": [{"given_name": "Geremy", "family_name": "Heitz", "institution": null}, {"given_name": "Stephen", "family_name": "Gould", "institution": null}, {"given_name": "Ashutosh", "family_name": "Saxena", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}]}