{"title": "Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1351, "page_last": 1359, "abstract": "In many machine learning domains (such as scene understanding), several related sub-tasks (such as scene categorization, depth estimation, object detection) operate on the same raw data and provide correlated outputs. Each of these tasks is often notoriously hard, and state-of-the-art classifiers already exist for many sub-tasks. It is desirable to have an algorithm that can capture such correlation without requiring to make any changes to the inner workings of any classifier. We propose Feedback Enabled Cascaded Classification Models (FE-CCM), that maximizes the joint likelihood of the sub-tasks, while requiring only a \u2018black-box\u2019 interface to the original classifier for each sub-task. We use a two-layer cascade of classifiers, which are repeated instantiations of the original ones, with the output of the first layer fed into the second layer as input. Our training method involves a feedback step that allows later classifiers to provide earlier classifiers information about what error modes to focus on. We show that our method significantly improves performance in all the sub-tasks in two different domains: (i) scene understanding, where we consider depth estimation, scene categorization, event categorization, object detection, geometric labeling and saliency detection, and (ii) robotic grasping, where we consider grasp point detection and object classification.", "full_text": "Towards Holistic Scene Understanding:\n\nFeedback Enabled Cascaded Classi\ufb01cation Models\n\nCongcong Li, Adarsh Kowdle, Ashutosh Saxena, Tsuhan Chen\n\nasaxena@cs.cornell.edu, tsuhan@ece.cornell.edu\n\nCornell University, Ithaca, NY.\n\n{cl758,apk64}@cornell.edu,\n\nAbstract\n\nIn many machine learning domains (such as scene understanding), several related\nsub-tasks (such as scene categorization, depth estimation, object detection) oper-\nate on the same raw data and provide correlated outputs. Each of these tasks is\noften notoriously hard, and state-of-the-art classi\ufb01ers already exist for many sub-\ntasks. It is desirable to have an algorithm that can capture such correlation without\nrequiring to make any changes to the inner workings of any classi\ufb01er.\nWe propose Feedback Enabled Cascaded Classi\ufb01cation Models (FE-CCM), that\nmaximizes the joint likelihood of the sub-tasks, while requiring only a \u2018black-box\u2019\ninterface to the original classi\ufb01er for each sub-task. We use a two-layer cascade of\nclassi\ufb01ers, which are repeated instantiations of the original ones, with the output\nof the \ufb01rst layer fed into the second layer as input. Our training method involves\na feedback step that allows later classi\ufb01ers to provide earlier classi\ufb01ers informa-\ntion about what error modes to focus on. We show that our method signi\ufb01cantly\nimproves performance in all the sub-tasks in two different domains: (i) scene\nunderstanding, where we consider depth estimation, scene categorization, event\ncategorization, object detection, geometric labeling and saliency detection, and\n(ii) robotic grasping, where we consider grasp point detection and object classi\ufb01-\ncation.\n\n1\n\nIntroduction\n\nIn many machine learning domains, several sub-tasks operate on the same raw data to provide cor-\nrelated outputs. Each of these sub-tasks are often notoriously hard and state-of-the-art classi\ufb01ers\nalready exist for many of them. In the domain of scene understanding for example, several indepen-\ndent efforts have resulted in good classi\ufb01ers for tasks such as scene categorization, depth estimation,\nobject detection, etc. In practice, we see that these sub-tasks are coupled\u2014for example, if we know\nthat the scene is indoors, it would help us estimate depth more accurately from that single image.\nIn another example in the robotic grasping domain, if we know what object it is, then it is easier\nfor a robot to \ufb01gure out how to pick it up. In this paper, we propose a uni\ufb01ed model which jointly\noptimizes for all the sub-tasks, allowing them to share information and guide the classi\ufb01ers towards\na joint optimal. We show that this can be seamlessly applied across different machine learning\ndomains.\nRecently, several approaches have tried to combine these different classi\ufb01ers for related tasks in\nvision [19, 25, 35]; however, most of them tend to be ad-hoc (i.e., a hard-coded rule is used) and\noften intimate knowledge of the inner workings of the individual classi\ufb01ers is required. Even beyond\nvision, in most other domains, state-of-the-art classi\ufb01ers already exist for many sub-tasks. However,\nthese carefully engineered models are often tricky to modify, or even to simply re-implement from\nthe available descriptions. Heitz et. al. [17] recently developed a framework for scene understand-\ning called Cascaded Classi\ufb01cation Models (CCM) treating each classi\ufb01er as a \u2018black-box\u2019. Each\nclassi\ufb01er is repeatedly instantiated with the next layer using the outputs of the previous classi\ufb01ers\nas inputs. While this work proposed a method of combining the classi\ufb01ers in a way that increased\n\n1\n\n\fthe performance in all of the four tasks they considered, it had a drawback that it optimized for each\ntask independently and there was no way of feeding back information from later classi\ufb01ers to earlier\nclassi\ufb01ers during training. This feedback can help the CCM achieve a more optimal solution.\nIn our work, we propose Feedback Enabled Cascaded Classi\ufb01cation Models (FE-CCM), which pro-\nvides feedback from the later classi\ufb01ers to the earlier ones, during the training phase. This feedback,\nprovides earlier stages information about what error modes should be focused on, or what can be\nignored without hurting the performance of the later classi\ufb01ers. For example, misclassifying a street\nscene as highway would not hurt as much as misclassifying a street scene as open country. Therefore\nwe prefer the \ufb01rst layer classi\ufb01er to focus on \ufb01xing the latter error instead of optimizing the training\naccuracy. In another example, allowing the depth estimation to focus on some speci\ufb01c regions can\nhelp perform better scene categorization. For instance, the open country scene is characterized by its\nupper part as a wide sky area. Therefore, to estimate the depth well in that region by sacri\ufb01cing some\nregions in the bottom may help an image to be categorized to the correct category. In detail, we do\nso by jointly maximizing the likelihood of all the tasks; the outputs of the \ufb01rst layers are treated as\nlatent variables and training is done by using an iterative algorithm. Another bene\ufb01t of our method\nis that each of the classi\ufb01ers can be trained using their own independent training datasets, i.e., our\nmodel does not require a datapoint to have labels for all the tasks, and hence it scales well with\nheterogeneous datasets.\nIn our approach, we treat the classi\ufb01er as a \u2018black-box\u2019, with no restrictions on its operation other\nthan requiring the ability to train on data and have input/output interface. Often each of these indi-\nvidual classi\ufb01ers could be quite complex, e.g., producing labelings over pixels in an entire image.\nTherefore, our method is applicable to many tasks that have different but correlated outputs.\nIn extensive experiments, we show that our method achieves signi\ufb01cant improvements in the per-\nformance of all the sub-tasks in two different domains: (i) scene understanding, where we consider\nsix tasks: depth estimation, object detection, scene categorization, event categorization, geometric\nlabeling and saliency detection, and (ii) robotic grasping, where we consider two tasks: grasp point\ndetection and object classi\ufb01cation.\nThe rest of the paper is organized as follows. We discuss the related works in Section 2. We describe\nour FE-CCM method in Section 3 followed by the implementation of the classi\ufb01ers in Section 4.\nWe present the experiments and results in Section 5. We \ufb01nally conclude in Section 6.\n\n2 Related Work\n\nThe idea of using information from related tasks to improve the performance of the task in question\nhas been studied in various \ufb01elds of machine learning and vision. The idea of cascading layers of\nclassi\ufb01ers to aid the \ufb01nal task was \ufb01rst introduced with neural networks as multi-level perceptrons\nwhere, the output of the \ufb01rst layer of perceptrons is passed on as input to the next hidden layer\n[16, 12, 6]. However, it is often hard to train neural networks and gain an insight into its operation,\nthus making it hard to work for complicated tasks.\nThere has been a huge body of work in the area of sensor fusion where classi\ufb01ers work with dif-\nferent modalities, each one giving additional information and thus improving the performance, e.g.,\nin biometrics, data from voice recognition and face recognition is combined [21]. However, in\nour scenario, we consider multiple tasks where each classi\ufb01er is tackling a different problem (i.e.,\npredicting different labels), with the same input being provided to all the classi\ufb01ers.\nThe idea of improving classi\ufb01cation performance by combining outputs of many classi\ufb01ers is used in\nmethods such as Boosting [13], where many weak learners are combined to obtain a more accurate\nclassi\ufb01er; this has been applied tasks such as face detection [4, 40]. However, unlike the CCM\nframework which focuses on contextual bene\ufb01ts, their motivation was computational ef\ufb01ciency.\nTu [39] used pixel-level label maps to learn a contextual model for pixel-level labeling, through a\ncascaded classi\ufb01er approach, but such works considered only the interactions between labels of the\nsame type.\nWhile the above combine classi\ufb01ers to predict the same labels, there are a group of works that com-\nbine classi\ufb01ers, and use them as components in large systems. Kumar and Hebert [23] developed a\nlarge MRF-based probabilistic model to link multi-class segmentation and object detection. Similar\nefforts have been made in the \ufb01eld of natural language processing. Sutton and McCallum [36] com-\nbined a parsing model with a semantic role labeling model into a uni\ufb01ed probabilistic framework\nthat solved both simultaneously. However, it is hard to \ufb01t existing state-of-the-art classi\ufb01ers into\nthese technically-sound probabilistic representations because they require knowledge of the inner\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: Combining related classi\ufb01ers using the proposed FE-CCM model (\u2200i \u2208 {1, 2, . . . , n} \u03a8i(X) =\nFeatures corresponding to Classif ieri extracted from image X, Zi = Output of the Classif ieri in the \ufb01rst\nstage parameterized by \u03b8i, Yi = Output of the Classif ieri in the second stage parameterized by \u03c9i): (a)\nCascaded classi\ufb01cation model (CCM) where the output from the previous stage of the classi\ufb01er is used in the\nsubsequent stage along with image features. The model optimizes the output of each Classif ierj on the\nsecond stage independently; (b) Proposed Feed-back enabled cascaded classi\ufb01cation model (FE-CCM), where\nthere is feed-back from the latter stages to help achieve a model which optimizes all the tasks considered,\njointly. (Note that different colors of lines are used only to make the \ufb01gure more readable)\n\nworkings of the individual classi\ufb01ers. Structured learning (e.g., [38]) could also be a viable option\nfor our setting, however, they need a fully-labeled dataset which is not available in vision tasks.\nThere have been many works which show that with a well-designed model, one can improve the\nperformance of a particular task by using cues from other tasks (e.g., [29, 37, 2]). Saxena et. al.\nmanually designed the terms in an MRF to combine depth estimation with object detection [34] and\nstereo cues [33]. Sudderth et al. [35] used object recognition to help 3D structure estimation. Hoiem\net. al. [19] proposed an innovative but ad-hoc system that combined boundary detection and surface\nlabeling by sharing some low-level information between the classi\ufb01ers. Li et. al. [25, 24] combined\nimage classi\ufb01cation, annotation and segmentation with a hierarchical graphical model. However,\nthese methods required considerable attention to each classi\ufb01er, and considerable insight into the\ninner workings of each task and also the connections between tasks. This limits the generality of the\napproaches in introducing new tasks easily or being applied to other domains.\nThere is also a large body of work in the areas of deep learning, and we refer the reader to Bengio\nand LeCun [3] for a nice overview of deep learning architectures and Caruana [5] for multitask learn-\ning with shared representation. While most works in deep learning (e.g., [15, 18, 41]) are different\nfrom our work in that, those works focus on one particular task (same labels) by building different\nclassi\ufb01er architectures, as compared to our setting of different tasks with different labels. Hinton et\nal. [18] used unsupervised learning to obtain an initial con\ufb01guration of the parameters. This provides\na good initialization and hence their multi-layered architecture does not suffer from local minimas\nduring optimization. At a high-level, we can also look at our work as a multi-layered architecture\n(where each node typically produces complex outputs, e.g., labels over the pixels in the image); and\ninitialization in our case comes from existing state-of-the-art individual classi\ufb01ers. Given this ini-\ntialization, our training procedure \ufb01nds parameters that (consistently) improve performance across\nall the sub-tasks.\n\n3 Feedback Enabled Cascaded Classi\ufb01cation Models\n\nWe will describe the proposed model for combining and training the classi\ufb01ers in this section.\nWe consider related subtasks denoted by Classi\ufb01eri, where i \u2208 {1, 2, . . . , n} for a total of n tasks\n(Figure 1). Let \u03a8i(X) correspond to the features extracted from image X for the Classi\ufb01eri. Our\ncascade is composed of two layers, where the outputs from classi\ufb01ers on the \ufb01rst layer go as input\ninto the classi\ufb01ers in the second layer. We do this by appending all the outputs from the \ufb01rst layer\nto the features for that task. \u03b8i represents the parameters for the \ufb01rst level of Classi\ufb01eri with output\nZi, and \u03c9i represents the parameters of the second level of Classi\ufb01eri with output Yi.\nWe model the conditional joint log likelihood of all the classi\ufb01ers, i.e., log P (Y1, Y2, . . . , Yn|X),\nwhere X is an image belonging to training set \u0393.\n\n(cid:89)\n\nX\u2208\u0393\n\nlog\n\nP (Y1, Y2, . . . , Yn|X; \u03b81, \u03b82, . . . , \u03b8n, \u03c91, \u03c92, . . . , \u03c9n)\n\n(1)\n\nDuring training, Y1, Y2, . . . , Yn are all observed (because the ground-truth labels are available).\nHowever, Z1, Z2, . . . , Zn (output of layer 1 and input to layer 2) are hidden, and this makes training\nof each classi\ufb01er as a black-box hard. Heitz et al. [17] assume that each layer is independent and\nthat each layer produces the best output independently (without consideration for other layers), and\ntherefore use the ground-truth labels for Z1, Z2, . . . , Zn for training the classi\ufb01ers.\n\n3\n\n\fOn the other hand, we want our classi\ufb01ers to learn jointly, i.e., the \ufb01rst layer classi\ufb01ers need not\nperform their best (w.r.t. groundtruth), but rather focus on error modes, which would result in the\nsecond layer\u2019s output (Y1, Y2, . . . , Yn) to become the best. Therefore, we expand Equation 1 as\nfollows, using the independencies represented by the directed graphical model in Figure 1(b).\n\nP (Y1, . . . , Yn, Z1, . . . , Zn|X; \u03b81, . . . , \u03b8n, \u03c91, . . . , \u03c9n)\n\n(cid:88)\n(cid:88)\n\nX\u2208\u0393\n\nX\u2208\u0393\n\n=\n\n=\n\nlog\n\nlog\n\n(cid:88)\n(cid:88)\n\nZ1,...,Zn\n\nn(cid:89)\n\nZ1,...,Zn\n\ni=1\n\nP (Yi|\u03a8i(X), Z1, . . . , Zn; \u03c9i)P (Zi|\u03a8i(X); \u03b8i)\n\nHowever, the summation inside the log makes it dif\ufb01cult to learn the parameters. Motivated by\nthe Expectation Maximization [8] algorithm, we use an iterative algorithm where we \ufb01rst \ufb01x the\nlatent variables Zi\u2019s and learn the parameters in the \ufb01rst step (Feed-forward step), and estimate the\nlatent variables Zi\u2019s in the second step (Feed-back step). We then iterate between these two steps.\nWhile this algorithm is not guranteed to converge to the global maxima, in practice, we \ufb01nd it gives\ngood results. The results of our algorithm are always better than [17] which in our formulation is\nequivalent to \ufb01xing the latent variables to ground-truth permanently (thus highlighting the impact of\nthe feedback).\nInitialization: We initialize this process by setting the latent variables Zi\u2019s to the groundtruth.\nTraining with this initialization, our cascade is equivalent to CCM in [17], where the classi\ufb01ers (and\nthe parameters) in the \ufb01rst layer are similar to the original state-of-the-art class\ufb01er and the classi\ufb01ers\nin the second layer use the outputs of the \ufb01rst layer in addition to the original features.\nFeed-forward Step: In this step, we estimate the parameters. We assume that the latent variables\nZi\u2019s are known (and Yi\u2019s are known anyway because they are the ground-truth). This results in\n\nlog\n\nP (Yi|\u03a8i(X), Z1, . . . , Zn; \u03c9i)P (Zi|\u03a8i(X); \u03b8i)\n\n(4)\n\n(cid:88)\n\nX\u2208\u0393\n\nmaximize\n\n\u03b81,...,\u03b8n,\u03c91,...,\u03c9n\n\nNow in this feed-forward step, the terms for maximizing the different parameters turn out to be\n\nindependent. So, for the ith classi\ufb01er we have:\n\n(2)\n\n(3)\n\n(5)\n\n(6)\n\ni=1\n\nn(cid:89)\n(cid:88)\n(cid:88)\n\nX\u2208\u0393\n\nX\u2208\u0393\n\nmaximize\n\n\u03c9i\n\nmaximize\n\n\u03b8i\n\nlog P (Yi|\u03a8i(X), Z1, . . . , Zn; \u03c9i)\n\nlog P (Zi|\u03a8i(X); \u03b8i)\n\nNote that the optimization problem nicely breaks down into the sub-problems of training the indi-\nvidual classi\ufb01er for the respective sub-tasks. Depending on the speci\ufb01c form of the classi\ufb01er used\nfor the sub-task (see Section 4 for our implementation), we can use the appropriate training method\nfor each of them. For example, we can use the same training algorithm as the original black-box\nclassi\ufb01er. Therefore, we consider the original classi\ufb01ers as black-box and we do not need any low\nlevel information about the particular tasks or knowledge of the inner workings of the classi\ufb01er.\nFeed-back Step: In this second step, we will estimate the values of the latent variables Zi\u2019s assum-\ning that the parameters are \ufb01xed (and Yi\u2019s are given because the ground-truth is available). This\nfeed-back step is the crux that provides information to the \ufb01rst-layer classi\ufb01ers what error modes\nshould be focused on and what can be ignored without hurting the \ufb01nal performance.\nWe will perform MAP inference on Zi\u2019s (and not marginalization). This can be considered as\na special variant of the general EM framework (hard EM, [26]). Using Equation 3, we get the\nfollowing optimization problem for the feed-back step:\n\nlog P (Y1, . . . , Yn, Z1, . . . , Zn|X; \u03b81, . . . , \u03b8n, \u03c91, . . . , \u03c9n)\n\nlog P (Zi|\u03a8i(X); \u03b8i) + log P (Yi|\u03a8i(X), Z1, . . . , Zn; \u03c9i)\n\n(7)\n\nmaximize\nZ1,...,Zn\n\n\u21d4 maximize\n\nZ1,...,Zn\n\nn(cid:88)\n\ni=1\n\nThis maximization problem requires that we have access to the characterization of the individual\nblack-box classi\ufb01ers in a probabilistic form. While at the \ufb01rst blush this may seem asking a lot,\nour method can even handle classi\ufb01ers for which log likelihood is not available. We can do this\nby taking the output of the previous classi\ufb01ers and modeling their log-odds as a Gaussian (partly\nmotivated by variational approximation methods [14]). Parameters of the Gaussians are empirically\nestimated when the actual model is not available.\nIn some cases, the classi\ufb01er log-likelihoods in the problem in Equation 7 actually turn out to be\nconvex. For example, if the individual classi\ufb01ers are linear or logistic classi\ufb01ers, the minimization\nproblem is convex and can be solved by using a gradient descent (or any similar method).\n\n4\n\n\fFigure 2: Results showing improvement using the proposed model. All depth maps in depth estimation are\nat the same scale (black means near and white means far); Salient region in saliency detection are indicated in\ncyan; Geometric labeling - Green = Support, Blue = Sky and Red = Vertical (Best viewed in color).\nInference. Our FE-CCM is a directed model and inference in these models is straight-forward.\nMaximizing the conditional log likelihood P (Y1, Y2, . . . , Yn|X) corresponds to performing infer-\nence over the \ufb01rst layer (using the same inference techniques for the respective black-box classi\ufb01ers),\nfollowed by inference on the second layer.\nSparsity and Scaling with a large number of tasks. In Equations 4 we use weight decay (with L-1\npenalty on the weights, ||\u03c9||1) to enforce sparsity in the \u03c9\u2019s. With a large number of sub-tasks, the\nnumber of the weights in the second layer increases, and our sparsity term results in a few non-zero\nconnections between sub-tasks that are active.\nTraining with Heterogeneous datasets. Often real datasets are disjoint for different tasks, i.e,\neach datapoint does not have the labels for all the tasks. Our formulation handles this sce-\nnario well. We showed our formulation for the general case, where we use \u0393i as the dataset\nthat has labels for ith task. Now, we maximize the joint likelihood over all the datapoints, i.e.,\nP (Y1, . . . , Yn|X). Equation 3 reduces to maximizing the terms below, which is\n(cid:88)\n\nsolved using equations in Section 3 with corresponding modi\ufb01cation\n\nlog(cid:81)n\n\nX\u2208\u0393i\n\nn(cid:88)\n\n(cid:88)\n\nn(cid:89)\n\n(cid:81)\n\ni=1\n\nP (Yi|\u03a8i(X), Z1, . . . , Zn; \u03c9i)\n\nP (Zj|\u03a8j(X); \u03b8j)\n\n(8)\n\n\u03bbi\n\nlog\n\ni=1\n\nX\u2208\u0393i\n\nZ1,...,Zn\n\nj=1\n\nHere \u03bbi is the tuning parameter that balances the amount of data in different datasets (n = 6 in our\nexperiments).\n\n4 Scene Understanding: Implementation\n\nHere we brie\ufb02y describe the implementation details for our instantiation of FE-CCMs for scene\nunderstanding.1 Each of the classi\ufb01ers described below for the sub-tasks are our \u201cbase-model\u201d\nshown in Table 1. In some sub-tasks, our base-model will be simpler than the state-of-the-art models\n(that are often hand-tuned for the speci\ufb01c sub-tasks respectively). However, even when using base-\nmodels in our FE-CCM, our comparison will still be against the state-of-the-art models for the\nrespective sub-tasks (and on the same standard respective datasets) in Section 5.\nIn our preliminary work [22], where we optimized for each target task independently, we consid-\nered four vision tasks: scene categorization, depth estimation, event categorization and saliency\ndetection. Please refer to Section 4 in [22] for implementation details. In this work, we add object\ndetection and geometric labeling, and jointly optimize all six tasks.\nScene Categorization. For scene categorization, we classify an image into one of the 8 categories\nde\ufb01ned by Torralba et. al. [28]: tall building, inside city, street, highway, coast, open-country, moun-\ntain and forest. We de\ufb01ne the output of a scene classi\ufb01er to be a 8-dimensional vector with each\nelement representing the score for each category. We evaluate the performance by measuring the\naccuracy of assigning the correct scene label to an image on the MIT outdoor scene dataset [28].\nDepth Estimation. For the single image depth estimation task, we want to estimate the depth\nd \u2208 R+ of every pixel in an image (Figure 2a). We evaluate the performance of the estimation by\ncomputing the root mean square error of the estimated depth with respect to ground truth laser scan\ndepth using the Make3D Range Image dataset [30, 31].\n\n1Space constraints do not allow us to describe each sub-task in detail here, but please refer to the respective\nstate-of-the-art algorithm. Note that the power of our method is in not needing to know the details of the\ninternals of each sub-task.\n\n5\n\n!\"#$%&\u2019($\"#)*+,$#-.\u2019//0\u2019$&1#2-\u201934+//0\u2019$&1#2-\u20195&6-.\u2019417897\")\u2019!\"#$%&\u2019(\"$)*+,,-+\"./$0)+12\u2019,,-+\"./$0)+345.6)+7.).68#%+!\"#$%&\u2019($\"#)*+,$#-.\u2019//0\u2019$&1#2-\u201934+//0\u2019$&1#2-\u2019!526&)%7\u20198&-&%9\")\u2019!\"#$%&\u2019($\"#)*+,$#-.\u2019//0\u2019$&1#2-\u201934+//0\u2019$&1#2-\u2019(&\"5&-$6%\u2019789&26):\u2019\fEvent Categorization. For event categorization, we classify an image into one of the 8 sports events\nas de\ufb01ned by Li et. al. [24]: bocce, badminton, polo, rowing, snowboarding, croquet, sailing and\nrock-climbing. We de\ufb01ne the output of a event classi\ufb01er to be a 8-dimensional vector with each\nelement representing the log-odds score for each category. For evaluation, we compute the accuracy\nassigning the correct event label to an image.\nSaliency Detection. Here, we want to classify each pixel in the image to be either salient or non-\nsalient (Figure 2c). We de\ufb01ne the output of the classi\ufb01er as a scalar indicating the saliency con\ufb01-\ndence score of each pixel. We threshold this saliency score to determine whether the point is salient\n(+1) or not (\u22121). For evaluation, we compute the accuracy of assigning a pixel as a salient point.\nObject Detection. We consider the following object categories: car, person, horse and cow. A\nsample image with the object detections is shown in Figure 2b. We use the train-set and test-set\nof PASCAL 2006 [9] for our experiments. Our object detection module builds on the part-based\ndetector of Felzenszwalb et. al. [10]. We \ufb01rst generate 5 to 100 candidate windows for each image\nby applying the part-based detector with a low threshold (over-detection). We then extract HOG fea-\ntures [7] on every candidate window and learn a RBF-kernel SVM model as the \ufb01rst layer classi\ufb01er.\nThe classi\ufb01er assigns each window a +1 or \u22121 label indicating whether the window belongs to the\nobject or not. For the second-layer classi\ufb01er, we learn a logistic model over the feature vector con-\nstituted by the outputs of all \ufb01rst-level tasks and the original HOG feature. We use average precision\nto quantitatively measure the performance.\nGeometric labeling. The geometric labeling task refers to assigning each pixel to one of three\ngeometric classes: support, vertical and sky (Figure 2d), as de\ufb01ned by Hoiem et. al. [20]. We use\nthe dataset and the algorithm by [20] as the \ufb01rst-layer geometric labeling module. In order to reduce\nthe computational time, we avoid the multiple segmentation and instead use a single segmentation\nwith about 100 segments/image. For the second-layer, we learn a logistic model over the a feature\nvector which is constituted by the outputs of all \ufb01rst-level tasks and the features used in the \ufb01rst\nlayer. For evaluation, we compute the accuracy of assigning the correct geometric label to a pixel.\n\n5 Experiments and Results\n\nThe proposed FE-CCM model is a uni\ufb01ed model which jointly optimizes for all sub-tasks. We\nbelieve this is a powerful algorithm in that, while independent efforts towards each sub-task have led\nto state-of-the-art algorithms that require intricate modeling for that speci\ufb01c sub-task, the proposed\napproach is a uni\ufb01ed model which can beat the state-of-the-art performance in each sub-task and,\ncan be seamlessly applied across different machine learning domains.\nWe evaluate our proposed method on two different domains: scene understanding and robotic grasp-\ning. We use the same proposed algorithm in both domains. For each of the sub-task in each of the\ndomains, we evaluate our performance on the standard dataset for that sub-task (and compare against\nthe speci\ufb01cally designed state-of-the-art algorithm for that dataset). Note that, with such disjoint yet\npractical datasets, no image would have ground truth available for more than one task. Our model\nhandles this well.\nIn experiment we evaluate the following algorithms as in Table 1,\n\n\u2022 Base model: Our implementation (Section 4) of the algorithm for the sub-task, which serves\nas a base model for our FE-CCM. (The base model uses less information than state-of-the-\nart algorithms for some sub-tasks.)\n\n\u2022 All-features-direct: A classi\ufb01er that takes all the features of all sub-tasks, appends them\n\ntogether, and builds a separate classi\ufb01er for each task.\n\n\u2022 State-of-the-art model: The state-of-the-art algorithm for each sub-task respectively on that\n\n\u2022 CCM: The cascaded classi\ufb01er model by Heitz et. al. [17], which we re-implement for six\n\nspeci\ufb01c dataset.\n\nsub-tasks.\n\n\u2022 FE-CCM (uni\ufb01ed): This is our proposed model. Note that this is one single model which\n\nmaximizes the joint likelihood of all sub-tasks.\n\n\u2022 FE-CCM (target speci\ufb01c): Here, we train a speci\ufb01c FE-CCM for each sub-task, by using\ncross-validation to estimate \u03bbi\u2019s in Equation 8. Different values for \u03bbi\u2019s result in different\nparameters learned for each FE-CCM.\n\nNote that both CCM and All-features-direct use information from all sub-tasks, and state-of-the-art\nmodels also use carefully designed models that implicitly capture information from other sub-tasks.\n\n6\n\n\fModel\n\nEvent\n\nDepth\n\nScene\n\nCategorization\n(% Accuracy)\n\nEstimation\n(RMSE in m)\n\nCategorization\n(% Accuracy)\n\nSaliency\nDetection\n\nGeometric\nLabeling\n\n(% Accuracy)\n\n(% Accuracy)\n\n(% Average precision)\n\nObject detection\n\nCar Person Horse Cow Mean\n\nTable 1: Summary of results for the SIX vision tasks. Our method improves performance in every single task.\n(Note: Bold face corresponds to our model performing better than state-of-the-art.)\n\n83.8\n\n-\n\n44.4\n44.5\n44.4\n\n44.5\n\n45.4\n\n45.5\n\n-\n\n-\n\n300\n33.3\n\n88.1\n\n400\n24.6\n\n1000\n50\n\n1579\n22.5\n\n2688\n22.5\n\n-\n\n39.9\n40.0\n40.7\n\n38.8\n\n40.1\n\n73.4\nLi [24]\n\n2686\n\n-\n\n39.0\n38.8\n39.2\n\n63.2 38.0\n\n40.1\n\n40.7\n\n63.2 37.6\n\n40.1\n\n40.5\n\n71.8 (\u00b10.8)\n72.7 (\u00b10.8)\n\n86.2 (\u00b10.2)\n87.0 (\u00b10.6)\n\n83.8 (\u00b10.2)\n83.8 (\u00b10.4)\n\n73.3 (\u00b11.6)\n74.3 (\u00b10.6)\n74.7 (\u00b10.6)\n\nHoiem [20]\n87.0 (\u00b10.6)\n88.6 (\u00b10.2)\n88.9 (\u00b10.2)\n\nTorralba [27]\n83.8 (\u00b10.6)\n85.9 (\u00b10.3)\n86.1 (\u00b10.2)\n\n16.7 (\u00b10.4)\n16.4 (\u00b10.4)\n16.7 (MRF)\nSaxena [31]\n16.4 (\u00b10.4)\n15.5 (\u00b10.2)\n15.2 (\u00b10.2)\n\n85.2 (\u00b10.2)\n85.7 (\u00b10.2)\n82.5 (\u00b10.2)\nAchanta [1]\n85.6 (\u00b10.2)\n86.2 (\u00b10.2)\n87.6 (\u00b10.2)\n\n62.4 36.3\n62.3 36.8\n61.5 36.3\nFelzenswalb et. al. [11] (base)\n62.2 37.0\n\nImages in testset\nChance\nOur base-model\nAll-features-direct\nState-of-the-art\nmodel (reported)\nCCM [17]\n(our implementation)\nFE-CCM (uni\ufb01ed)\nFE-CCM\n(target speci\ufb01c)\n5.1 Scene Understanding\nDatasets: The datasets used are mentioned in Section 4, and the number of test images in each\ndataset is in Table 1. For each dataset we use the same number of training images as the state-\nof-the-art algorithm (for comparison). We perform 6-fold cross validation on the whole model\nwith 5 of 6 sub-tasks to evaluate the performance on each task. We do not do cross-validation\non object detection as it is standard on the PASCAL 2006 [9] dataset (1277 train and 2686 test\nimages respectively).\nResults and discussion:\nTo quantitatively evaluate our method for each of the sub-tasks, we consider the metrics appropriate\nto each of the six tasks in Section 4. Table 1 shows that FE-CCM not only beats state of art in all\nthe tasks but also does it jointly as one single uni\ufb01ed model.\nIn detail, we see that all-features-direct improves over the base model because it uses features from\nall the tasks. The state-of-the-art classi\ufb01ers improve on the base model by explicitly hand-designing\nthe task speci\ufb01c probabilistic model [24, 31] or by using adhoc methods to implicitly use information\nfrom other tasks [20]. Our FE-CCM model, which is a single model that was not given any manually\ndesigned task-speci\ufb01c insight, achieves a more signi\ufb01cant improvement over the base model.\nWe also observe that our target-speci\ufb01c FE-CCM, which is optimized for each task independently\nachieves the best performance, and this is a more fair comparison to the state-of-the-art because each\nstate-of-the-art model is trained speci\ufb01cally to the respective task. Furthermore, Table 1 shows the\nresults for CCM (which is a cascade without feedback information) and all-features-direct (which\nuses features from all the tasks). This indicates that the improvement is strictly due to the proposed\nfeedback and not just because of having more information.\nWe show some visual improvements due to the proposed FE-CCM, in Figure 2. In comparison\nto CCM, FE-CCM leads to better depth estimation of the sky and the ground, and it leads to better\ncoverage and accurate labeling of the salient region in the image, and it also leads to better geometric\nlabeling and object detection. More visual results are provided in the supplementary material.\nFE-CCM allows each classi\ufb01er in the second layer to learn which information from the other \ufb01rst-\nlayer sub-tasks is useful in the form of weights (in contrast to manually using the information shared\nacross sub-tasks in some prior works). We provide a visualization of the weights for the 6 vision\ntasks in Figure 3-left. We see that the model agrees with our intuitions that high weights are as-\nsigned to the outputs of the same task from the \ufb01rst layer classi\ufb01er (see high weights assigned to\nthe diagonals in the categorization tasks), though saliency detection is an exception which depends\nmore on its original features (not shown here) and the geometric labeling output. We also observe\nthat the weights are sparse. This is an advantage of our approach since the algorithm automatically\n\ufb01gures out which outputs from the \ufb01rst level classi\ufb01ers are useful for the second level classi\ufb01er to\nachieve the best performance.\nFigure 3-right provides a closer look to the positive weights given to the various outputs for a second-\nlevel geometric classi\ufb01er. We observe that high positive weights are assigned to \u201cmountain\u201d, \u201cfor-\nest\u201d, \u201ctall building\u201d, etc. for supporting the geometric class \u201cvertical\u201d, and similarly \u201ccoast\u201d, \u201csail-\ning\u201d and \u201cdepth\u201d for supporting the \u201csky\u201d class. These illustrate some of the relationships the model\nlearns automatically without any manual intricate modeling.\n\n5.2 Robotic Grasping\nIn order to show the applicability of our FE-CCM to problems across different machine learning\nexperiments, we also considered the problem of a robot autonomously grasping objects. Given an\nimage and a depthmap, the goal of the learning algorithm is to select a point at which to grasp the\n\n7\n\n\fFigure 3: (Left) The absolute values of the weight vectors for second-level classi\ufb01ers, i.e. \u03c9. Each column\nshows the contribution of the various tasks towards a certain task. (Right) Detailed illustration of the positive\nvalues in the weight vector for a second-level geometric classi\ufb01er. (Note: Blue is low and Red is high)\n\nobject (this location is called grasp point, [32]). It turns out that different categories of objects could\nhave different strategies for grasping, and therefore in this work, we use our FE-CCM to combine\nobject classi\ufb01cation and grasping point detection.\nImplementation: We work with the labeled synthetic dataset by Saxena et. al. [32] which spans\n6 object categories and also includes an aligned pixel level depth map for each image. For grasp\npoint detection, we use a regression over features computed from the image [32]. The output of\nthe regression is a score for each point giving the con\ufb01dence of the point being a good grasping\npoint. For object detection, we use a logistic classi\ufb01er to perform the classi\ufb01cation. The output of\nthe classi\ufb01er is a 6-dimensional vector representing the log odds score for each category.\nResults: We evaluate our algorithm on dataset published in [32], and perform cross-validation to\nevaluate the performance on each task. Table 2 shows the results for our algorithm\u2019s ability to predict\nthe grasping point, given an image and the depths observed by the robot using its sensors. We see\nthat our FE-CCM obtains signi\ufb01cantly better performance over all-features-direct and CCM (our\nimplementation). Figure 4 show our robot grasping an object using our algorithm.\nTable 2: Summary of results for the the robotic grasping experi-\nment. Our method improves performance in every single task.\n\nModel\n\nImages in testset\nChance\nAll features direct\nOur base-model\nCCM (Heitz et. al.)\nFE-CCM\n\nGraping point\n\nDetection\n\n(% accuracy)\n\n6000\n50\n87.7\n87.7\n90.5\n92.2\n\nObject\n\nClassi\ufb01cation\n(% accuracy)\n\n1200\n16.7\n45.8\n45.8\n49.5\n49.7\n\nFigure 4: Our robot grasping an object\nusing our algorithm.\n\n6 Conclusions\nWe propose a method for combining existing classi\ufb01ers for different but related tasks. We only\nconsider the individual classi\ufb01ers as a \u201cblack-box\u201d (thus not needing to know the inner workings of\nthe classi\ufb01er) and propose learning techniques for combining them (thus not needing to know how\nto combine the tasks). Our method introduces feedback in the training process from the later stage\nto the earlier one, so that a later classi\ufb01er can provide the earlier classi\ufb01ers information about what\nerror modes to focus on, or what can be ignored without hurting the joint performance.\nWe consider two domains: scene understanding and robotic grasping. Our uni\ufb01ed model (a single\nFE-CCM trained for all the sub-tasks in that domain) improves performance signi\ufb01cantly across all\nthe sub-tasks considered over the respective state-of-the-art classi\ufb01ers. We show that this was the\nresult of our feedback process. The classi\ufb01er actually learns meaningful relationships between the\ntasks automatically. We believe that this is a small step towards holistic scene understanding.\n\nAcknowledgements\nWe thank Industrial Technology Research Institute in Taiwan and Kodak for their \ufb01nancial support\nin this research. We thank Anish Nahar, Matthew Cong and Colin Ponce for help with the robotic\nexperiments. We also thank John Platt and Daphne Koller for useful discussions.\n\n8\n\n\fReferences\n[1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk. Frequency-tuned Salient Region Detection. In\n\nCVPR, 2009.\n\n[2] A. Agarwal and B. Triggs. Monocular human motion capture with a mixture of regressors.\n\nIn IEEE\n\nWorkshop Vision for HCI, CVPR, 2005.\n\n[3] Y. Bengio and Y. LeCun. Scaling learning algorithms towards ai. In Large-Scale Kernel Machines, 2007.\n[4] S. C. Brubaker, J. Wu, J. Sun, M. D. Mullin, and J. M. Rehg. On the design of cascades of boosted\n\nensembles for face detection. IJCV, 77(1-3):65\u201386, 2008.\n\n[5] R. Caruana. Multitask learning. Machine Learning, 28:41\u201375, 1997.\n[6] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: Deep neural networks\n\nwith multitask learning. In ICML, 2008.\n\n[7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR, 2005.\n[8] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em\n\nalgorithm. J of Royal Stat. Soc., Series B, 39(1):1\u201338, 1977.\n\n[9] M. Everingham, A. Zisserman, C. K. I. Williams, and L. Van Gool. The pascal voc2006 results.\n[10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Discriminatively trained deformable\n\npart models, release 3. http://people.cs.uchicago.edu/\u223cpff/latent-release3/.\n\n[11] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discrimina-\n\ntively trained part based models. PAMI, 2009.\n\n[12] Y. Freund and R. E. Schapire. Cascaded neural networks based image classi\ufb01er. In ICASSP, 1993.\n[13] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application\n\nto boosting. In EuroCOLT, 1995.\n\n[14] M. Gibbs and D. Mackay. Variational gaussian process classi\ufb01ers. Neural Networks, IEEE Trans, 2000.\n[15] I. Goodfellow, Q. Le, A. Saxena, H. Lee, and A. Ng. Measuring invariances in deep networks. In NIPS,\n\n2009.\n\n[16] L. Hansen and P. Salamon. Neural network ensembles. PAMI, 12(10):993\u20131001, 1990.\n[17] G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded classi\ufb01cation models: Combining models for\n\nholistic scene understanding. In NIPS, 2008.\n\n[18] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. In N. Comp, 2006.\n[19] D. Hoiem, A. A. Efros, and M. Hebert. Closing the loop on scene interpretation. In CVPR, 2008.\n[20] D. Hoiem, A. A. Efros, and M. Hebert. Putting objects in perspective. IJCV, 2008.\n[21] J. Kittler, M. Hatef, R. P. Duin, and J. Matas. On combining classi\ufb01ers. PAMI, 20:226\u2013239, 1998.\n[22] A. Kowdle, C. Li, A. Saxena, and T. Chen. A generic model to compose vision modules for holistic scene\n\nunderstanding. In Workshop on Parts and Attributes, ECCV, 2010.\n\n[23] S. Kumar and M. Hebert. A hierarchical \ufb01eld framework for uni\ufb01ed context-based classi\ufb01cation.\n\nIn\n\n[24] L. Li and L. Fei-Fei. What, where and who? classifying event by scene and object recognition. In ICCV,\n\nICCV, 2005.\n\n2007.\n\n[25] L.-J. Li, R. Socher, and L. Fei-Fei. Towards total scene understanding: Classi\ufb01cation, annotation and\n\nsegmentation in an automatic framework. In CVPR, 2009.\n\n[26] R. Neal and G. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and other variants.\n\nLearning in graphical models, 89:355\u2013368, 1998.\n\n[27] A. Oliva and A. Torralba. Mit outdoor scene dataset. http://people.csail.mit.edu/torralba/code/spatialenvelope/.\n[28] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial\n\nenvelope. IJCV, 42:145\u2013175, 2001.\n\nsmall images. CVPR, 2008.\n\n[29] D. Parikh, C. Zitnick, and T. Chen. From appearance to context-based recognition: Dense labeling in\n\n[30] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular images. In NIPS, 2005.\n[31] A. Saxena, S. H. Chung, and A. Y. Ng. 3-d depth reconstruction from a single still image. IJCV, 76, 2007.\n[32] A. Saxena, J. Driemeyer, J. Kearns, and A. Y. Ng. Robotic grasping of novel objects. In NIPS, 2006.\n[33] A. Saxena, J. Schulte, and A. Y. Ng. Depth estimation using monocular and stereo cues. In IJCAI, 2007.\n[34] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d scene structure from a single still image. IEEE\n\nPAMI, 30(5), 2009.\n\nmodel for 3d scenes. In CVPR, 2006.\n\n[35] E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky. Depth from familiar objects: A hierarchical\n\n[36] C. Sutton and A. McCallum. Joint parsing and semantic role labeling. In CoNLL, 2005.\n[37] A. Toshev, B. Taskar, and K. Daniilidis. Object detection via boundary structure segmentation. In CVPR,\n\n2010.\n\n[38] I. Tsochantaridis, T. Hofmann, and T. Joachims. Support vector machine learning for interdependent and\n\nstructured output spaces. In ICML, 2004.\n\n[39] Z. Tu. Auto-context and its application to high-level vision tasks. In CVPR, 2008.\n[40] P. Viola and M. J. Jones. Robust real-time face detection. IJCV, 57(2):137\u2013154, 2004.\n[41] M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus. Deconvolutional networks. In CVPR, 2010.\n\n9\n\n\f", "award": [], "sourceid": 117, "authors": [{"given_name": "Congcong", "family_name": "Li", "institution": null}, {"given_name": "Adarsh", "family_name": "Kowdle", "institution": null}, {"given_name": "Ashutosh", "family_name": "Saxena", "institution": null}, {"given_name": "Tsuhan", "family_name": "Chen", "institution": null}]}