{"title": "$\\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding", "book": "Advances in Neural Information Processing Systems", "page_first": 549, "page_last": 557, "abstract": "For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks.", "full_text": "\u03b8-MRF: Capturing Spatial and Semantic Structure in\n\nthe Parameters for Scene Understanding\n\nCongcong Li, Ashutosh Saxena, Tsuhan Chen\nCornell University, Ithaca, NY 14853, United States\n\ncl758@cornell.edu, asaxena@cs.cornell.edu, tsuhan@ece.cornell.edu\n\nAbstract\n\nFor most scene understanding tasks (such as object detection or depth estima-\ntion), the classi\ufb01ers need to consider contextual information in addition to the\nlocal features. We can capture such contextual information by taking as input\nthe features/attributes from all the regions in the image. However, this contextual\ndependence also varies with the spatial location of the region of interest, and we\ntherefore need a different set of parameters for each spatial location. This results\nin a very large number of parameters. In this work, we model the independence\nproperties between the parameters for each location and for each task, by de\ufb01n-\ning a Markov Random Field (MRF) over the parameters. In particular, two sets\nof parameters are encouraged to have similar values if they are spatially close or\nsemantically close. Our method is, in principle, complementary to other ways\nof capturing context such as the ones that use a graphical model over the labels\ninstead. In extensive evaluation over two different settings, of multi-class object\ndetection and of multiple scene understanding tasks (scene categorization, depth\nestimation, geometric labeling), our method beats the state-of-the-art methods in\nall the four tasks.\n\nIntroduction\n\n1\nMost scene understanding tasks (e.g., object detection, depth estimation, etc.) require that we exploit\ncontextual information in addition to the local features for predicting the labels. For example, a\nregion is more likely to be labeled as a car if the region below is labeled as road. I.e., we have\nto consider information in a larger area around the region of interest. Furthermore, the location of\nthe region in the image could also have a large effect on its label, and on how it depends on the\nneighboring regions. For example, one would look for sky or clouds when looking for an airplane;\nhowever if one sees grass or a runway, then there may still be an airplane (e.g., when the airplane is\non the ground)\u2014here the contextual dependence of the airplane classi\ufb01er changes based on object\u2019s\nlocation in the image.\nWe can capture such contextual information by using features from all the regions in the image, and\nthen also train a speci\ufb01c classi\ufb01er of each spatial location for each object category. However, the\ndimensionality of the feature space would become quite large,1 and training a classi\ufb01er with limited\ntraining data would not be effective. In such a case, one could reduce the amount of context captured\nto prevent over\ufb01tting. For example, some recent works [22, 33, 37] use context by encoding input\nfeatures, but are limited by the amount of context area they can handle.\nIn our work, we do not want to eliminate the amount of context captured. We therefore keep the large\nnumber of parameters, and model the interaction between the parameters of the classi\ufb01ers at different\nlocations and different tasks. For example, the parameters of two neighboring locations are similar.\nThe key contribution of our work is to note that two parameters may not ascribe a directionality to\nthe interaction between them. These interactions are sparse, and we represent these interactions as\nan undirected graph where the nodes represent the parameters for each location (for each task) and\n\n1As an example, consider the problem of object detection with many categories: we have 107 object cate-\ngories which may occur in any spatial location in the image. Even if we group the regions into 64 (8\u00d7 8) spatial\nlocations, the total number of parameters will be 107 \u2217 64 \u2217 K (for K features each). This is rather large, e.g.,\nin our multi-class object detection task this number would be about 47.6 million (see Section 4).\n\n1\n\n\fthe edges represent the interaction between the parameters. We call this representation a \u03b8-MRF, i.e.,\na Markov Random Field over the parameters. This idea is, in principle, complementary to previous\nworks that capture context by capturing the correlation between the labels. Note that our goal is not\nto directly compare against such models. Instead, we want to answer the question: How far can we\ngo with just modeling the interactions between the parameters?\nThe edges in our \u03b8-MRF not only connect spatial neighbors but also semantic neighbors. In partic-\nular, if two tasks are highly correlated, their parameters given to the same image context should be\nsimilar. For example, oven is often next to the dishwasher (in a kitchen scene), therefore they should\nshare similar context, indicating that they can share their parameters. These semantic interactions\nbetween the parameters from different tasks also follow the undirected graph. Just like object labels\nare often modeled as conditionally independent of other non-contextual objects given the important\ncontext, the corresponding parameters can also be modeled similarly.\nThere has been a large body of work that capture contextual information in many different ways\nwhich are often complementary to ours. These methods range from capturing the correlation be-\ntween labels using a graphical model to introduce different types of priors on the labels (based on\nlocation, prior knowledge, etc.). For example, a graphical model (directed or undirected) is often\nused to model the dependency between different labels [29, 40, 19, 17]. Informative priors on the\nlabels are also commonly used to improve performance (e.g., [47]). Some previous works enforce\npriors on the parameters as a directed graph [46, 32], but our model offers a different and perhaps a\nmore relevant perspective than a directed model, in terms of the independence properties modeled.\nWe extensively evaluate our method on two different settings. First, we consider the task of labeling\n107 object categories in the SUN09 dataset, and show that our method gets better performance than\nthe state-of-the-art methods even when with simple regression as the learning model. Second, we\nconsider the multiple tasks of scene categorization, depth estimation and geometry labeling, and\nagain show that our method gets comparable or better performance than the state-of-the-art methods\nwhen we use our method with simple regression. Furthermore, we show that our performance is\nmuch higher as compared to just using other methods of putting priors on the parameters.\n\n2 Related Work\nThere is a large body of work that leverages contextual information. We possibly cannot do justice\nto literature, but we mention a few here. Various sources of context have been explored, ranging\nfrom the global scene layout, interactions between regions to local features. To incorporate scene-\nlevel information, Torralba et al. [47] use the statistics of low-level features across the entire scene\nto prime object detection. Hoiem et al. [24] and Saxena et al. [45] use 3D scene information to\nprovide priors on potential object locations. Li et al. [32] propose a hierarchical model to make\nuse of contextual information between tasks on different levels. There are also generic approaches\n[22, 31] that leverage related tasks to boost the overall performance, without requiring considerate\ninsight into speci\ufb01c tasks.\nMany works also model context to capture the local interactions between neighboring regions\n[23, 35, 28], objects [48, 14], or both [16, 10, 2]. Object co-occurence statistics have also been\ncaptured in several ways, e.g., using a CRF [40, 19, 17]. Desai et al. [9] combine individual classi-\n\ufb01ers by considering spatial interactions between the object detections, and solve a uni\ufb01ed multi-class\nobject detection problem through a structured discriminative approach. Other ways to share infor-\nmation across categories include sharing representations [12, 30], sharing training examples between\ncategories [36, 15], sharing parameters [26, 27], and so on. Our work lies in the category of sharing\nparameters, aiming at capturing the dependencies in the parameters for relevant vision applications.\nThere are several regularization methods when the number of parameters is quite large, e.g., based\non L2 norms [6] and Lasso shrinkage methods [42]. Liang et al. [34] present an asymptotic analysis\nof smooth regularizers. Recent works [26, 1, 18, 25] place interesting priors on parameters. Jalali\net al. [26] do multi-task learning by expressing the parameters as a sum of two parts: shared and\nspeci\ufb01c to the task, which combines the l\u221e penalty and l1 penalty to get block-sparse and element-\nwise sparse components in the parameters. Negahban and Wainright [38] provide analysis of when\nl1,\u221e norm could be useful. Kim and Xing [27] use a tree to construct the hierarchy of multi-\ntask outputs, and then use the tree-guided group lasso to regularize the multi-task regression. In\ncontemporary work [43], Salakhutdinov et al. learn a hierarchy to share the hierarchical parameters\nfor the object appearance models. Our work is motivated by this direction of work, and our focus\nis to capture spatial and semantic sharing in parameters using undirected graphical models that have\nappropriate independence properties.\n\n2\n\n\fFigure 1: The proposed \u03b8-MRF graph with spatial and semantic interaction structure.\n\nBayesian priors over parameters are also quite commonly used. For example, [3] uses Dirichlet\npriors for parameters of a multinomial and normal distribution respectively. In fact, there is a huge\nbody of work on using non-informative priors distributions over parameters [4]\u2014this is particularly\nuseful when the amount of data is not enough to train the parameters. If all the distributions involved\n(including the prior distribution) are Gaussian, the parameters follow certain useful statistical hyper\nMarkov properties [41, 21, 8]. In applications, [46] considers capturing relationships between the\nobject categories using a Dirichlet prior on the parameters. [20] considers putting posterior sparsity\non the parameters instead of parameter sparsity. [11] present a method to learn hyperparameters\nfor CRF-type models. Most of these methods express the prior as another distribution with hyper-\nparameters\u2014one can view this as a directed graphical model over the parameters. On the other hand,\nwe express relationships between two parameters of the distribution, which does not necessarily\ninvolve hyper parameters. This also allows us to capture interesting independence properties.\n\n3 Our Approach: \u03b8-MRF\nIn order to give better intuition, we use the multi-class object detection task as an illustrative exam-\nple. (Later we will describe and apply it to other scene understanding problems.) Let us consider the\nK-class object detection. We uniformly divide an image into L grids. We then have a binary classi-\nk,(cid:96) \u2208 {0, 1} that indicates the presence of the kth object at the (cid:96)th grid in the\n\ufb01er, whose output is y(n)\nnth image. Let x(n) be the features (or attributes) extracted from nth image, and let the parameters\nof the classi\ufb01er be \u03b8k,(cid:96). Let \u0398k = (\u03b8k,1,\u00b7\u00b7\u00b7 , \u03b8k,L) and let \u0398 be the set {\u0398k}, k = 1, . . . , K.\nLet P (yk,(cid:96)|x(n), \u03b8k,(cid:96)) be the probability of the output given the input features and the parameters.\nIn order to \ufb01nd the classi\ufb01er parameters, one typically solves an optimization problem, such as:\n\nminimize\n\n\u0398\n\n\u2212 log P (yk,(cid:96)|x(n), \u03b8k,(cid:96)) + R(\u0398)\n\n(1)\n\nwhere R(\u0398) is a regularization term (e.g., \u03bb||\u0398||2\n2 with \u03bb as a tuning parameter) (In Bayesian view,\nit is a prior on the parameters that could be informative or non-informative.) Let us use J(\u03b8k,(cid:96)) =\n\u2212 log P (yk,(cid:96)|x(n), \u03b8k,(cid:96)) to indicate the cost of the data dependent term \u03b8k,(cid:96). The exact form of\nJ(\u03b8k,(cid:96)) would depend on the particular learning model being used over the labels y\u2019s. For example,\nfor logistic regression it would be J(\u03b8k,(cid:96)) = \u2212 log\n.\nMotivated by the earlier discussion, we want to model the interactions between the parameters of\nthe different classi\ufb01cation models, indexed by {k, (cid:96)} that we merge into one index {m}.\nIn this work, we represent these interactions as an undirected graph G where each node m represents\nthe parameters \u03b8m. The edges E in the this graph would represent the interaction between two sets\nof parameters \u03b8i and \u03b8j. These interactions are often sparse. We call this graph \u03b8-MRF. Eq. 1 can\nnow be viewed as optimizing the energy function of the MRF over the parameters. I.e.,\n\n(cid:1)(1\u2212yk,(cid:96))(cid:17)\n\n(cid:1)yk,(cid:96)(cid:0)1 \u2212\n\n(cid:16)(cid:0)\n\n1\n\u2212\u03b8T\n\nk,(cid:96)\n\nx(n)\n\n1\n\u2212\u03b8T\n\nk,(cid:96)\n\nx(n)\n\n1+e\n\n1+e\n\n(cid:88)\n\n(cid:88)\n\nn\n\nk,l\n\n(cid:88)\n\nm\u2208G\n\nJ(\u03b8m) + (cid:88)\n\ni,j\u2208E\n\nminimize\n\n\u0398\n\nR(\u03b8i, \u03b8j)\n\n(2)\n\nwhere J(\u03b8m) is now the node potential, and the term R(\u03b8i, \u03b8j) corresponds to the edge poten-\ntials. Note this idea of MRF is quite complementary of other modeling structures one may impose\nover y\u2019s\u2014which may itself be an MRF. This \u03b8-MRF is different from the label-based MRFs whose\nvariables y\u2019s are often in low-dimension. In our parameter-based MRF, each node constitutes high-\ndimensional variables \u03b8m. One nice property of having an MRF over parameters is that there is no\nincrease in complexity of the inference problem.\nIn previous work (also see Section 2), several priors have been used on the parameters. Such priors\nare often in the form of imposing a distribution with some other hyper parameters\u2014this corresponds\nto a directed model on the \u0398 and in some application scenarios they may not be able to express\nthe desired conditional independence properties and therefore may be sub-optimal. Our \u03b8-MRF is\n\n3\n\n!!!\"#$%&%\u2019\u2019()*+,-$./\"0)123\"45\"!!!\"!!!\"056\"055\"!!!\"!!!\"!!!\"!!!\"47\"!!!\"!!!\"076\"075\"!!!\"!!!\"!!!\"!!!\"!!!\"#$%&\u2019()&**+,-./01$2\"3,456\"78\"!!!\"!!!\"388\"!!!\"!!!\"!!!\"!!!\"79\"!!!\"!!!\"39:\"!!!\"!!!\"!!!\"!!!\"7;\"!!!\"!!!\"!!!\"!!!\"!!!\"!!!\"\flargely a non-informative prior, and also corresponds to some regularization methods. See Section 5\nfor experimental comparisons with different forms of priors. Having presented this general notion\nof \u03b8-MRF, we will now describe two types of interactions that it models well in the following.\nSpatial interactions. Intuitively the parameters of the classi\ufb01ers at neighboring spatial regions (for\nthe same object category) should share their parameters. To model this type of interactions between\nparameters, we introduce edges on the \u03b8-MRF that connect the spatially neighboring nodes, as\nshown in Figure 1-left. Note that the spatial edges only couple the parameters of the same task\ntogether. This type of edge does not exist across tasks. We de\ufb01ne the edge potential as follows.\n\nR(\u03b8i, \u03b8j) =\n\n0\n\nif \u03b8i and \u03b8j are spatial neighbors for a task\notherwise\n\n(cid:26) \u03bbspt(cid:107)\u03b8i \u2212 \u03b8j(cid:107)p\n\nwhere \u03bbspt is a tuning factor for the spatial interactions. When p \u2265 1, this potential has the nice\nproperty of being convex. Note that such a potential has been extensively used in an MRF over\nlabels, e.g., [44]. Note that this potential does not make the original learning problem in Equation 1\nany \u201charder.\u201d In fact, if the original objective J(\u03b8) is convex, then the overall problem still remains\nconvex. In this work, we consider p = 1 and p = 2.\nIn addition to connecting the parameters for neighboring locations, we also encourage the sharing\nbetween the elements of a parameter vector that correspond to spatially neighboring inputs. The\nintuition is described in the following example. Assume we have the presence of the object \u201croad\u201d\nat the different regions of an image as attributes. In order to learn a car detector with these attributes\nas inputs, we would like to give similar high-weights to the neighboring regions in the car detector\noutput. We call this source-based spatial grouping, as compared to target-based spatial grouping\nthat we described in the previous paragraph. We found that this also gives us a contextual map\n(i.e., parameters that map the feature/attributes in the neighboring regions) that is more spatially\nstructured. This interaction happens within the same node in the graph, therefore it is equivalent to\nadding an extra term to the node potential on the \u03b8-MRF.\n\nJnew(\u03b8m) = J(\u03b8m) + \u03bbsrc\n\nm \u2212 \u03b8t2\n(cid:107)\u03b8t1\n\nm(cid:107)p\n\n(3)\n\n(cid:88)\n\n(cid:88)\n\nt1\n\nt2\u2208N r(t1)\n\nm and \u03b8t2\n\n1 and the tth\n\nm corresponds the weights given to the tth\n\n2 feature inputs. t2 \u2208 N r(t1)\nwhere \u03b8t1\nmeans that the respective features are the same type of attributes form neighboring regions. Equation\n3 can be reformed as Jnew(\u03b8m) = J(\u03b8m) + \u03bbsrc(cid:107)T \u03b8m(cid:107)p, where T indicates the linear transform ma-\ntrix that computes the difference in the neighbors. \u03bbsrc is a tuning factor for the source interactions.\nSemantic interactions. We not only connect the parameters for spatial neighbors of the same task,\nbut also consider the semantic neighbors across tasks. Motivated by the conditional independency\nin the object labels which suggests that given the important context the presence of an object is inde-\npendent of other non-contextual objects, we can encode such properties in our \u03b8-MRF. For example,\nthe road often appears below the car. Note that in our framework we have the road classi\ufb01er and\nthe car classi\ufb01er take the same features as input, which are extracted from all regions of the images\nto capture long-range context. Since the high concurrence of these two objects, their corresponding\ndetectors should be activated simultaneously. Therefore, the parameter for detecting \u201croad\u201d at a bot-\ntom region of the image, can partly share with the parameter for detecting \u201ccar\u201d above the bottom\nregion. Assume we already know the dependency between the objects, we introduce the semantic\nedge potential of the \u03b8-MRF, as shown in Figure 1-right.\n\n(cid:26) \u03bbsmnwij(cid:107)\u03b8i \u2212 \u03b8j(cid:107)p\n\nR(\u03b8i, \u03b8j) =\n\n0\n\nif \u03b8i and \u03b8j are semantic neighbors\notherwise\n\nwhere wij indicates the strength of the semantic dependency between these two parameters and\n\u03bbsmn is a tuning factor for the semantic interactions. In the following we discuss how to \ufb01nd the\nsemantic connections and the weights w\u2019s.\nFinding the semantic neighbors. We \ufb01rst calculate the positive correlations between the tasks from\nthe ground-truth training data. If two tasks are highly positively correlated, they are likely to share\nsome of the parameters. In order to model how they share parameters, we model the relative spatial\nrelationship between the positive outputs of the two tasks. For example, assume we have two highly\nco-occuring object categories, indexed by k1 and k2. From the training data, we learn the relative\nspatial distribution map of the presence of the kth\n1 object in the center. We then\n\ufb01nd out the top M highest response regions on the map, each of which has a relative location \u2206(cid:96)\n\n2 object, given the kth\n\n4\n\n\fFigure 2: An instantiation of the proposed algorithm for the object recognition tasks on SUN09 dataset.\n\n2 object that satisfy these relative\n\nand co-occuring response w. Therefore, the parameters of the kth\nlocations, have semantic edges with \u03b8k1,l1.\nLearning and Optimization. R(\u0398) couples the different independent parameters. Typically, the\ntotal number of parameters is quite large in an application (e.g., 47.6 million in one of our applica-\ntions, see Section 4). Running an optimization algorithm jointly on all the parameters would either\nnot be feasible or have very slow convergence in practice. Since the parameters follow conditional\nindependence assumptions and also follow a nice topological structure, we can optimize more con-\nnected subsets of the parameters separately, and then iterate. These separate sub-problems can also\nrun in parallel. In our implementation, R(\u0398)\u2019s and J(\u03b8m) are convex, and such a decomposed\nalgorithm for optimizing the parameters is guaranteed to converge to the global optima [5].\n4 Applications\nWe apply our \u03b8-MRF on two different settings: 1) object detection on the SUN09 dataset [7]; 2)\nmultiple scene understanding tasks (scene categorization, geometric labeling, depth estimation),\ncomparing to the cascaded classi\ufb01cation models (CCM) [22, 31].\nObject Detection. The task of object detection is to recognize and localize objects of interest in an\nimage. We use the SUN 09 dataset introduced in [7], which has 4,367 training images and 4,317 test\nimages. Choi et al. [7] use an additional set of 26,000 images to training baseline detectors [13], and\nselect 107 object categories to evaluate their contextual model. We follow the same settings as [7],\ni.e., we use the same baseline object detector outputs as the attribute inputs for our algorithm, the\nsame training/testing data, and the same evaluation metrics. For evaluation, a predicted bounding\nbox is considered correct if it overlaps the ground-truth bounding box (in the intersection/union\nsense) by more than 50%. We compute the average precision (AP) of the precision-recall curve for\neach category, and compute the mean AP across categories as the overall performance.\nWe use each of the baseline object detectors to produce a 8 \u00d7 8 detection map, with each element\nindicating the con\ufb01dence (between 0 and 1) of the object\u2019s presence at the respective region. We also\nde\ufb01ne 107 scene categories, where the ith(i = 1, . . . , 107) scene category indicates the type of scene\ncontaining the ith object category. We train a logistic regression classi\ufb01er for each scene category.\nThe 107 8 \u00d7 8 object maps and the 107 scene classi\ufb01er outputs together form a 6955-dimension\nfeature vector, as the attribute inputs for our algorithm. The setup is shown in Figure 2.\nWe divide an image into 8 \u00d7 8 regions. Our algorithm learns a region-speci\ufb01c contextual model\nfor each object category, resulting in a speci\ufb01c classi\ufb01er of each region for each category. The\n8 \u00d7 8 division is determined based on the criteria that more than 70% of the training data contain\nbounding boxes no smaller than a single grid. We use a linear model for each classi\ufb01er. So we have\n6955\u22178\u22178\u2217107 = 47627840 parameter dimensions in total. Our \u03b8-MRF captures the independencies\nbetween these parameters based on location and semantics. For the lth region, it is labeled as positive\n\nfor the kth object category if it satis\ufb01es: overlap(Ok, Rl)/ min(cid:0)area(Rl), area(Ok)(cid:1) > 0.3, where\n\nOk means a bounding-box instantiation of the kth object category and Rl means the lth grid cell.\nNegative examples are sampled from the false positives of the baseline detectors. We apply the\ntrained classi\ufb01ers to the test images, and gain the object detection maps. To create bounding-box\nbased results, we use the candidate bounding boxes created by the baseline detectors, and average the\nscores gained from our algorithm within the bounding box as the con\ufb01dence score for the candidate.\nMultiple Scene Understanding Tasks. We consider the task of estimating different types of labels\nin a scene: scene categorization, geometry labeling, and depth estimation. We compose these three\ntasks in the feed-forward cascaded classi\ufb01cation models (CCM) [22]. CCM creates repeated instan-\ntiations of each classi\ufb01er on multiple layers of a cascade, where the latter-layer classi\ufb01ers take the\noutputs of the previous-layer classi\ufb01ers as input. The previous CCM algorithms [22, 31] consider\nsharing information across tasks, but do not consider the sharing between categories or between\n\n5\n\n!\"#\"$\"!%\"&\"\u2019\"!(\"!)\"%\"*\"!!\"!#\")\"+\"!&\"!\u2019\",-.\"/0102345\"666\"7\"!\"#\"$\"!%\"&\"\u2019\"!(\"!)\"%\"*\"!!\"!#\")\"+\"!&\"!\u2019\"89.:;<10=\"!\"#$%&\u2019$(()*+,$-\".*/$-$#\"0\"-+*!/$&$(()*+,$-\".*/$-$#\"0\"-+*666\">4-?\"/0102345\"@-10.\"/0102345\"A1.001B:CD1\"/0102345\"!\"#\"$\"!%\"&\"\u2019\"!(\"!)\"%\"*\"!!\"!#\")\"+\"!&\"!\u2019\"!\"#\"$\"!%\"&\"\u2019\"!(\"!)\"%\"*\"!!\"!#\")\"+\"!&\"!\u2019\",-.\"/0102345\">4-?\"/0102345\"A1.001B:CD1\"/0102345\"@-10.\"/0102345\"666\",-.E=2050\",B-==:F2-345\"@-10.E=2050\",B-==:F2-345\"666\"G;H021\"/0102345\"A2050\",B-==:F2-345\"!\u03b81,9!\u03b81,10!\u03b81,7!\u03b82,8!\u03b83,14\fTable 1: Performance of object recogni-\ntion and detection on SUN09 dataset.\n\nTable 2: Performance of scene categorization, ge-\nometric labeling, and depth estimation in CCM.\n\nModel\n\nChance\nBaseline (w/o context)\nSingle model per object\nIndependent model\nState-of-the-art [7]\n\u03b8-MRF (l2-regularized)\n\u03b8-MRF (l1-regularized)\n\nObject\n\nRecognition\n\n(% AP)\n\n5.34\n17.9\n22.3\n22.9\n25.2\n26.4\n27.0\n\nObject\n\nDetection\n(% AP)\n\nN/A\n7.06\n8.02\n8.18\n8.33\n8.76\n8.93\n\nModel\n\nChance\nBaseline(w/o context)\nState-of-the-art [31]\nCCM [22]\n(our implementation)\n\u03b8-MRF (l2-regularized)\n\u03b8-MRF (l1-regularized)\n\nScene\n\nCategorization\n\n(% AP)\n\n22.5\n83.8\n86.1\n83.8\n85.7\n86.3\n\nGeometric\nLabeling\n(% AP)\n\n33.3\n86.2\n88.9\n87.0\n88.6\n89.2\n\nDepth\n\nEstimation\n(RMSE in m)\n\n24.6\n16.7\n15.2\n16.5\n15.3\n15.2\n\ndifferent spatial regions within a task. Here we introduce the semantically-grouped regularization to\nscene categorization, and the spatially-grouped regularization to depth and geometry estimation.\nFor the three tasks we consider, we use the same datasets and 2-layer settings as [31]. For scene\ncategorization, we classify 8 different categories on the MIT outdoor scene dataset [39]. We consider\ntwo semantic groups: man-made (tall building, inside city, street, highway) and natural (coast, open-\ncountry, mountain and forest). Semantic edges are introduced between the parameters within each\ngroup. We train a logistic classi\ufb01er for each scene category. This gives us a total of 8 parameter\nvectors for scene categorization task. We evaluate the performance by measuring the accuracy of\nassigning the correct scene label to an image.\nFor depth estimation, we train a speci\ufb01c linear regression model for every region of the image (with\nuniformly divided 11 \u00d7 10 regions), and incorporate the spatial grouping on both the second-layer\ninputs and outputs. This gives us a total of 110 parameter vectors for the depth estimation task.\nWe evaluate the performance by computing the root mean square error of the estimated depth with\nrespect to ground truth laser scan depth using the Make3D Range Image dataset [44].\nFor geometry labeling, We use the dataset and the algorithm by [24] as the \ufb01rst-layer geometric\nlabeling module, and use a single segmentation with about 100 segments/image. On the second-\nlayer, we train a logistic regression classi\ufb01er for every region of the image (with uniformly divided\n16 \u00d7 16 regions), and incorporate the spatial grouping on both the second-layer inputs and outputs.\nThis gives us a total of 768 parameter vectors. We then assign the geometric label to each seg-\nment based on the average con\ufb01dence scores within the segment. We evaluate the performance by\ncomputing the accuracy of assigning the correct geometric label to a pixel.\n\n5 Experiments\nWe evaluate the proposed algorithm on two applications: (1) object recognition and detection on\nSUN09 dataset with 107 object categories; (2) the multi-task cascaded structure that composes\nscene categorization, depth estimation and geometric labeling on multiple datasets as described in\nSection 4. The training of our algorithm takes 6-7 hours for object detection/recognition and 3-4\nhours for multi-task cascade. The attribute models in (1) and the \ufb01rst-layer base classi\ufb01ers in (2) are\npre-trained. The complexity of our inference is no more than constant times of the complexity of\ninference of an individual classi\ufb01er. Furthermore, the inference for different classi\ufb01ers can be easily\nparallelized. For example, a base object detector [13] takes about 1.5 second to output results for an\nimage. Our algorithm, taking the outputs of the base detectors as input, only requires an overhead\nof less than 0.2 second.\n5.1 Overall performance on multiple tasks in CCM strcuture.\nTable 2 shows the performance of different methods on the three tasks composed into the cascaded\nclassi\ufb01cation model (CCM) [22]. \u201cBaseline\u201d means the individual classi\ufb01er for each task on the \ufb01rst\nlayer, \u201cState-of-the-art\u201d corresponds to the state-of-the-art algorithm for each sub-task respectively\nfor that specic dataset, and \u201cCCM\u201d corresponds to the second-layer output for each sub-task in the\nCCM structure. The results are computed as the average performance over 6-fold cross validation.\nWith the semantic and spatial regularization, our proposed \u03b8-MRF algorithm improves signi\ufb01cantly\nover the CCM algorithm that also uses the same set of tasks for prediction. Finally, we perform\nbetter than the state-of-the-art algorithms on two tasks and comparably for the third.\nIs \u03b8-MRF \u201ccomplementary\u201d to label-MRF? In this experiment, we also consider the MRF over\nlabels [44] together with our \u03b8-MRF for depth estimation. The combination results in a lower root-\nmean-square-error (RMSE) of 15.0m as compared to 15.2m for \u03b8-MRF alone and 16.0m for label-\nMRF alone. This indicates that our method is complementary to the traditional MRF over labels.\n5.2 Overall performance on SUN09 object detection.\nTable 1 gives the performance of different methods on SUN09 dataset, for both object recognition\n(predicting the object presence) and object detection (predicting the object location).\n\n6\n\n\fFigure 3: Examples showing that infrequent object categories share parameters with frequent object categories.\n\n- Baseline (w/o context): the baseline object detectors trained by [13], which are also used to generate\n\nthe initial detection results used as inputs for our algorithm and the state-of-the-art algorithm.\n\n- Single model: a single classi\ufb01er is trained for each object category, not varying across different\nlocations. In the following, if not speci\ufb01ed, we use a l1-regularized linear regression as the classi\ufb01er.\n- Independent model: this means an independent classi\ufb01er is trained for the presence of an object for\neach region. There is no information sharing between the models belonging to different locations of\nthe same category, or different categories.\n\n- State-of-the-art: This is the tree-based graphical model proposed in [7], which explicitly models the\n\nobject dependencies based on labels and detector outputs.2\n\n- The proposed \u03b8-MRF algorithm, which shares the models spatially within an object category and\nsemantically across various objects. We evaluate both the l1 and l2 regularization on the potentials.\n\nTable 1 shows the location-speci\ufb01c model (Independent) is better than the general model (Single\nmodel), which con\ufb01rms our intuition that the contextual model is location-speci\ufb01c. Furthermore,\nour approach that shares parameters spatially and semantically signi\ufb01cantly outperforms the inde-\npendent model without these regularizations. We also note that our algorithm can achieve com-\nparable performance to the state-of-the-art algorithm, without explicitly modeling the probabilistic\ndependency between the objects labels.\nWe study the relative improvement of the proposed parameter sharing algorithm over the non-\nparameter-sharing algorithm (Independent model in Table 1) on object categories with different\nnumber of training samples in the SUN09 object recognition task. The relative improvement on\nobject categories with less than 200 training samples is 34.2%, while the improvement on objects\nwith more than 200 training samples is 11.5%. Our parameter sharing algorithm helps the infrequent\nobjects implicitly make use of the data of frequent objects to learn better models.\nWe give two examples in Fig. 3, focusing on two infrequent object categories: van and awning,\nrespectively. The histogram in the \ufb01gures shows the number of training instances for each object\ncategory. The color bar shows the correlation between the learned parameter of the object with the\nparameters for other objects. The redder indicates the higher correlation between the parameters\nof the respective categories. Figure 3-left shows that the van category has few training instances,\nturn out to share the parameters strongly with the categories of car, building and road. Similarly,\nFigure 3-right shows how the learned awning parameters with other categories. We note that in the\ndataset, awning and streetlight are not highly co-occuring, thus initially when we create the semantic\ngroups, these two objects do not appear simultaneously in any group. However, the semantic groups\ncontaining streetlight and the semantic groups containing awning both contain objects like road,\nbuilding, and car. Through our \u03b8-MRF algorithm, the sharing information can be transferred.\nEffect of different priors. We compare our spatially-grouped and semantically-grouped regulariza-\ntion with other parameter sharing algorithms such as the prior-based algorithms in Figure 4.\n\nFigure 4: Some baseline prior-based algorithms we compare the propose algorithm with. From left to right:\nthese models use global prior, spatial-based prior, and semantic-based prior.\n\n2We evaluate the contextual model in [7] using the software published by the authors: http://web.\n\nmit.edu/\u02dcmyungjin/www/HContext.html and report the average performance on multiple runs.\n\n7\n\n0204060801001200500100015002000250030003500400045005000!\"#$%&\u2019(&))*+,-&./+0&.&%/#/.,+1&2++(&.++(&.++3)45++.$&4++3)45++.$&4+6$../)&\u2019$2+3/#7//2+#-/+)/&.2/4+0&.&%/#/.,8+1&2+9+$#-/.,+!\"#\"$%&\u2019()\"*+&*,-\u2019$\"8++++++++++++++++++++++++++++++++1&2+4/#/(#$.8+:;<=>++++++++++++++++++++++++++++++++++++++++++++0&.&%/#/.9,-&./4+1&2+4/#/(#$.8+:?<@>+A3B/(#+6&#/5$.C/,+D$<+$E+#.&C2C25+C%&5/,+0204060801001200500100015002000250030003500400045005000!\"#$%&\u2019(&))*+,-&./+0&.&%/#/.,+&12324++5&)($2*+5&)($2*++5)64++,#.//#)34-#++5)64++,#.//#)34-#+7$../)&\u2019$2+5/#1//2+#-/+)/&.2/6+0&.&%/#/.,8+&12324+9:+$#-/.,+;$<+$=+#.&32324+3%&4/,+!\"#\"$%&\u2019()\"*+&*,-\u2019$\"8+++++++++++++++++++++++++&12324+6/#/(#$.8++++>