{"title": "Contextual Models for Object Detection Using Boosted Random Fields", "book": "Advances in Neural Information Processing Systems", "page_first": 1401, "page_last": 1408, "abstract": null, "full_text": " Contextual models for object detection using\n boosted random fields\n\n\n\n Antonio Torralba Kevin P. Murphy William T. Freeman\n MIT, CSAIL UBC, CS MIT, CSAIL\n Cambridge, MA 02139 Vancouver, BC V6T 1Z4 Cambridge, MA 02139\n torralba@mit.edu murphyk@cs.ubc.edu billf@mit.edu\n\n Abstract\n\n We seek to both detect and segment objects in images. To exploit both lo-\n cal image data as well as contextual information, we introduce Boosted\n Random Fields (BRFs), which uses Boosting to learn the graph struc-\n ture and local evidence of a conditional random field (CRF). The graph\n structure is learned by assembling graph fragments in an additive model.\n The connections between individual pixels are not very informative, but\n by using dense graphs, we can pool information from large regions of\n the image; dense models also support efficient inference. We show how\n contextual information from other objects can improve detection perfor-\n mance, both in terms of accuracy and speed, by using a computational\n cascade. We apply our system to detect stuff and things in office and\n street scenes.\n1 Introduction\n\nOur long-term goal is to build a vision system that can examine an image and describe what\nobjects are in it, and where. In many images, such as Fig. 5(a), objects of interest, such as\nthe keyboard or mouse, are so small that they are impossible to detect just by using local\nfeatures. Seeing a blob next to a keyboard, humans can infer it is likely to be a mouse; we\nwant to give a computer the same abilities.\n\nThere are several pieces of related work. Murphy et al [9] used global scene context to\nhelp object recognition, but did not model relationships between objects. Fink and Perona\n[4] exploited local dependencies in a boosting framework, but did not allow for multiple\nrounds of communication between correlated objects. He et al [6] do not model connections\nbetween objects directly, but rather they induce such correlations indirectly, via a bank of\nhidden variables, using a \"restricted Boltzmann machine\" architecture.\n\nIn this paper, we exploit contextual correlations between the object classes by introducing\nBoosted Random Fields (BRFs). Boosted random fields build on both boosting [5, 10]\nand conditional random fields (CRFs) [8, 7, 6]. Boosting is a simple way of sequentially\nconstructing \"strong\" classifiers from \"weak\" components, and has been used for single-\nclass object detection with great success [12]. Dietterich et al [3] combine boosting and\n1D CRFs, but they only consider the problem of learning the local evidence potentials; we\nconsider the much harder problem of learning the structure of a 2D CRF.\n\nStandard applications of MRFs/ CRFs to images [7] assume a 4-nearest neighbor grid\nstructure. While successful in low-level vision, this structure will fail in capturing im-\nportant long distance dependencies between whole regions and across classes. We propose\na method for learning densely connected random fields with long range connections. The\n\n\f\ntopology of these connections is chosen by a weak learner which has access to a library\nof graph fragments, derived from patches of labeled training images, which reflect typical\nspatial arrangments of objects (similar to the segmentation fragments in [2]). At each round\nof the learning algorithm, we add more connections from other locations in the image and\nfrom other classes (detectors). The connections are assumed to be spatially invariant, which\nmeans this update can be performed using convolution followed by a sigmoid nonlinearity.\nThe resulting architecture is similar to a convolutional neural network, although we used a\nstagewise training procedure, which is much faster than back propagation.\n\nIn addition to recognizing things, such as cars and people, we are also interested in recog-\nnizing spatially extended \"stuff\" [1], such as roads and buildings. The traditional sliding\nwindow approach to object detection does not work well for detecting \"stuff\". Instead, we\ncombine object detection and image segmentation (c.f., [2]) by labeling every pixel in the\nimage. We do not rely on a bottom-up image segmentation algorithm, which can be fragile\nwithout top-down guidance.\n\n\n2 Learning potentials and graph structure\n\nA conditional random field (CRF) is a distribution of the form\n 1\n P (S|x) = \n Z i(Si) i,j (Si, Sj )\n i jNi\n\nwhere x is the input (e.g., image), Ni are the neighbors of node i, and Si are labels. We\nhave assumed pairwise potentials for notational simplicity. Our goal is to learn the local\nevidence potentials, i, the compatibility potentials , and the set of neighbors Ni.\n\nWe propose the following simple approximation: use belief propagation (BP) to estimate\nthe marginals, P (Si|x), and then use boosting to maximize the likelihood of each node's\ntraining data with respect to i and .\n\nIn more detail, the algorithm is as follows. At iteration t, the goal is to minimize the\nnegative log-likelihood of the training data. As in [11], we consider the per-label loss (i.e.,\nwe use marginal probabilities), as opposed to requiring that the joint labeling be correct (as\nin Viterbi decoding). Hence the cost function to be minimized is\n\n Jt = Jti = - bti,m(Si,m) = - bti,m(+1)Si,mbti,m(-1)1-Si,m (1)\n i m i m i\n\nwhere Si,m {-1, +1} is the true label for pixel i in training case m, Si,m = (Si,m +\n1)/2 {0, 1} is just a relabeling, and bti,m = [P (Si = -1|xm, t), P (Si = 1|xm, t)] is the\nbelief state at node i given input image xm after t iterations of the algorithm.\n\nThe belief at node i is given by the following (dropping the dependence on case m)\nbti(1) ti(1) Mti(1) where Mti is the product of all the messages coming into i\nfrom all its neighbors at time t and where the message that k sends to i is given by\n bt (s\n M t+1(1) = t+1 (1) t+1 (1) = k k)\n i (2)\n ki ki k,i(sk, 1) t (sk)\n kN ik\n i sk{-1,+1}\n\nwhere k,i is the compatility between nodes k and i. If we assume that the local potentials\nhave the form t /2 /2\n i(si) = [eF t\n i ; e-F ti ], where F ti is some function of the input data, then:\n bti(+1) = (F ti + Gti), Gti = log Mti(+1) - log Mti(-1) (3)\nwhere (u) = 1/(1 + e-u) is the sigmoid function. Hence each term in Eq. 1 simplifies to\na cost function similar to that used in boosting:\n\n log Jt +Gt )\n i,m\n i = log 1 + e-Si,m(F ti,m . (4)\n m\n\n\f\n 1. Input: a set of labeled pairs {xi,m; Si,m}, bound T\n Output: Local evidence functions f ti(x) and message update functions gti(bN ).\n i\n\n 2. Initialize: bt=0\n i,m = 0; F t=0\n i,m = 0; Gt=0\n i,m = 0\n\n 3. For t=1..T.\n\n (a) Fit local potential fi(xi,m) by weighted LS to\n Y t +Gt )\n i,m\n i,m = Si,m(1 + e-Si,m(F t\n i )\n\n (b) .Fit compatibilities gti(bt-1 ) to Y t\n N i,m by weighted LS.\n i ,m\n\n (c) Compute local potential F t\n i,m = F t-1 + f t\n i,m i (xi,m)\n (d) Compute compatibilities Gti,m = t gn )\n n=1 i (bt-1\n Ni,m\n\n (e) Update the beliefs bti,m = (F ti,m + Gti,m)\n\n (f) Update weights wt+1 = bt\n i,m i,m(-1) bt\n i,m(+1)\n\n\n Figure 1: BRF training algorithm.\n\n\nWe assume that the graph is very densely connected so that the information that\none single node sends to another is so small that we can make the approximation\n t+1 (+1)/ t+1 (-1) 1. (This is a reasonable approximation in the case of images,\n ki ki\nwhere each node represents a single pixel; only when the influence of many pixels is taken\ninto account will the messages become informative.) Hence\n\n bt (s\n k,m k )\n M t+1(+1) \n s k,i(sk, +1) t (s\n Gt+1 = log i = log k [-1,+1] i k )\n k\n i (5)\n M t+1(-1) bt (sk)\n i k,m\n k \n s k,i(sk, -1)\n k [-1,+1] t (s\n i k )\n k\n\n k,i(sk, +1) bt (s\n k,m k)\n log sk[-1,+1] (6)\n k,i(sk, -1) bt (sk)\n k sk[-1,+1] k,m\n\nWith this simplification, Gt+1 (bt\n i is now a non-linear function of the beliefs Gt+1\n i m) at\niteration t. Therefore, We can write the beliefs at iteration t as a function of the local\nevidences and the beliefs at time t - 1: bti(+1) = (F ti(xi,m) + Gti(bt-1\n m )). The key idea\nbehind BRFs is to use boosting to learn the G functions, which approximately implement\nmessage passing in densely connected graphs. We explain this in more detail below.\n\n\n2.1 Learning local evidence potentials\n\nDefining F ti(xi,m) = F t-1(x\n i i,m) + f t\n i (xi,m) as an additive model, where xi,m are the\nfeatures of training sample m at node i, we can learn this function in a stagewise fashion\nby optimizing the second order Taylor expansion of Eq. 4 wrt f ti, as in logitBoost [5]:\n\n arg min log Jti arg min wti,m(Y ti,m - fti(xi,m))2 (7)\n f t f t\n i i m\n\nwhere Y t +Gt )\n i,m\n i,m = Si,m(1+e-Si,m(F t\n i ). In the case that the weak learner is a \"regression\nstump\", fi(x) = ah(x)+b, we can find the optimal a, b by solving a weighted least squares\nproblem, with weights wti,m = bti(-1) bti(+1); we can find the best basis function h(x) by\nsearching over all elements of a dictionary.\n\n\n2.2 Learning compatibility potentials and graph structure\n\nIn this section, we discuss how to learn the compatibility functions ij, and hence the\nstructure of the graph. Instead of learning the compatibility functions ij, we propose to\n\n\f\n 1. Input: a set of inputs {xi,m} and functions f ti, gti\n Output: Set of beliefs bi,m and MAP estimates Si,m.\n\n 2. Initialize: bt=0\n i,m = 0; F t=0\n i,m = 0; Gt=0\n i,m = 0\n\n 3. From t = 1 to T , repeat\n\n (a) Update local evidences F t\n i,m = F t-1 + f t\n i,m i (xi,m)\n (b) Update compatibilities Gti,m = t gn )\n n=1 i (bt-1\n Ni,m\n\n (c) Compute current beliefs bti,m = (F ti,m + Gti,m)\n\n 4. Output classification is Si,m = bti,m > 0.5\n\n Figure 2: BRF run-time inference algorithm.\n\n\n\nlearn directly the function Gt+1\n i . We propose to use an additive model for Gt+1\n i as we\ndid for learning F : Gt+1 = t gn\n i,m n=1 i (btm), where btm is a vector with the beliefs of all\nnodes in the graph at iteration t for the training sample m. The weak learners gn\n i (btm) can\nbe regression stumps with the form gn\n i (btm) = a(w btm > ) + b, where a, b, are the\nparameters of the regression stump, and wi is a set of weights selected from a dictionary.\nIn the case of a graph with weak and almost symmetrical connections (which holds if\n(s1, s2) 1, for all (s1, s2), which implies the messages are not very informative) we can\nfurther simplify the function Gt+1\n i by approximating it as a linear function of the beliefs:\n\n Gt+1 = \n i,m k,i btk,m(+1) + k,i (8)\n kNi\n\n\nThis step reduces the computational cost. The weak learners gn\n i (btm) will also be linear\nfunctions. Hence the belief update simplifies to bt+1(+1) = (\n i,m i btm + i + F t\n i,m), which\nis similar to the mean-field update equations. The neighborhood Ni over which we sum\nincoming messages is determined by the graph structure, which is encoded in the non-zero\nvalues of i. Each weak learner gn\n i will compute a weighted combination of the beliefs of\nthe some subset of the nodes; this subset may change from iteration to iteration, and can be\nquite large. At iteration t, we choose the weak learner gti so as to minimize\n\n t-1\n log Jt +gt(bt-1)+ gn(bt-1))\n i m i m\n i (bt-1) = - log 1 + e-Si,m(F ti,m n=1\n m\n\nwhich reduces to a weighted least squares problem similar to Eq. 7. See Fig. 1 for the\npseudo-code for the complete learning algorithm, and Fig. 2 for the pseudo-code for run-\ntime inference.\n\n\n3 BRFs for multiclass object detection and segmentation\n\nWith the BRF training algorithm in hand, we describe our approach for multiclass object\ndetection and region-labeling using densely connected BRFs.\n\n\n3.1 Weak learners for detecting stuff and things\n\nThe square sliding window approach does not provide a natural way of working with irreg-\nular objects. Using region labeling as an image representation allows dealing with irregular\nand extended objects (buildings, bookshelf, road, ...). Extended stuff [1] may be a very\nimportant source of contextual information for other objects.\n\n\f\n (a) Examples from the dictionary of about 2000 patches and masks, Ux,y, Vx,y.\n\n\n\n (b) Examples from the dictionary of 30 graphs, Wx,y,c.\n f t=0 f t=1 f t=2 F S\n + + ... = put thu\n utO Tr\n (c) Example feedforward segmentation for screens.\n\nFigure 3: Examples of patches from the dictionary and an example of the segmentation obtained\nusing boosting trained with patches from (a).\n\n\nThe weak learners we use for the local evidence potentials are based on the segmentation\nfragments proposed in [2]. Specifically, we create a dictionary of about 2000 image patches\nU , chosen at random (but overlapping each object), plus a corresponding set of binary (in-\nclass/ out-of-class) image masks, V : see Fig. 3(a). At each round t, for each class c, and\nfor each dictionary entry, we construct the following weak learner, whose output is a binary\nmatrix of the same size as the image I:\n v(I) = ((I U ) > ) V > 0 (9)\nwhere represents normalized cross-correlation and represents convolution. The in-\ntuition behind this is that I U will produce peaks at image locations that contain this\npatch/template, and then convolving with V will superimpose the segmentation mask on\ntop of the peaks. As a function of the threshold , the feature will behave more as a template\ndetector ( 1) or as a texture descriptor ( << 1).\n\nTo be able to detect objects at multiple scales, we first downsample the image to scale ,\ncompute v(I ), and then upsample the result. The final weak learner does this for\nmultiple scales, ORs all the results together, and then takes a linear transformation.\n f (I) = ([v(I ) ]) + (10)\nFig. 3(c) shows an example of segmentation obtained by using boosting without context.\nThe weak learners we use for the compatibility functions have a similar form:\n C\n gc(b) = bc Wc + (11)\n c=1\n\nwhere bc is the image formed by the beliefs at all pixels for class c. This convolution\ncorresponds to eq. 8 in which the node i is one pixel x, y of class c. The binary kernels\n(graph fragments) W define, for each node x, y of object class c, all the nodes from which it\nwill receive messages. These kernels are chosen by sampling patches of various sizes from\nthe labeling of images from the training set. This allows generating complicated patterns\nof connectivity that reflect the statistics of object co-occurrences in the training set. The\noverall incoming message is given by adding the kernels obtained at each boosting round.\n(This is the key difference from mutual boosting [4], where the incoming message is just\nthe output of a single weak learner; thus, in mutual boosting, previously learned inter-class\nconnections are only used once.) Although it would seem to take O(t) time to compute Gt,\nwe can precompute a single equivalent kernel W , so at runtime the overall complexity is\nstill linear in the number of boosting rounds, O(T ).\n C t C\n Gtx,y,c = bc nW n\n c + ndef\n = b + \n c W \n c\n c=1 n=1 n c=1\n\n\f\n car car building car road car\n Road F\n b=(F+G)\n\n\n Car car building building building road building\n\n\n\n Building\n x G\n car road building road road road\n\n y c) A car out of context\n a) Incoming messages (outside 3rd floor windows)\n to a car node. b) Compatibilities (W'). is less of a car.\n\n t=1 t=2 t=4 t=20 t=40 Final labeling\n b(car)\n\n\n\n S(all)\n\n\n\n d) Evolution of the beliefs for the car nodes (b) and labeling (S) for road, building, car.\n\n Figure 4: Street scene. The BRF is trained to detect cars, buildings and the road.\n\n\n\nIn Fig. 4(a-b), we show the structures of the graph and the weights W defined by GT for\na BRF trained to detect cars, buildings and roads in street scenes.\n\n\n3.2 Learning and inference\n\nFor training we used a labeled dataset of office and street scenes with about 100 images in\neach set. During the training, in the first 5 rounds we only update the local potentials, to\nallow local evidence to accrue. After the 5th iteration we start updating also the compatibil-\nity functions. At each round, we update only the local potential and compatibility function\nassociated with a single object class that reduces the most the multiclass cost. This allows\nobjects that need many features to have more complicated local potentials.\n\nThe algorithm learns to first detect easy (and large) objects, since these reduce the error of\nall classes the fastest. The easy-to-detect objects can then pass information to the harder\nones. For instance, in office scenes, the system first detects screens, then keyboards, and\nfinally computer mice. Fig. 5 illustrates this behavior on the test set. A similar behavior is\nobtained for the car detector (Fig. 4(d)). The detection of building and road provides strong\nconstraints for the locations of the car.\n\n\n3.3 Cascade of classifiers with BRFs\n\nThe BRF can be turned into a cascade [12] by thresholding the beliefs. Computations\ncan then be reduced by doing the convolutions (required for computing f and g) only in\npixels that are still candidates for the presence of the target. At each round we update a\nbinary rejection mask for each object class, Rtx,y,c, by thresholding the beliefs at round t:\nRtx,y,c = Rt-1\n x,y,c (btx,y,c > tc). A pixel in the rejection mask is set to zero when we can\ndecide that the object is not present (when btx,y,c is below the threshold tc 0), and it is set\nto 1 when more processing is required. The threshold tc is chosen so that the percentage\nof missed detections is below a predefined level (we use 1%). Similarity we can define a\ndetection mask that will indicate pixels in which we decide the object is present. The mask\nis then used for computing the features v(I) and messages G by applying the convolutions\nonly on the pixels not yet classified. We can denote those operators as R and R. This\n\n\f\n Input image screen mouse Ground truth Output labeling\n\n\n\n\n\n keyboard\n\n\n\n\n t=5 t=10 t=15 t=25 t=50\n b (screen) b (screen) b (screen) b (screen) b (screen)\n F\n\n G\n\n b (keyboard) b (keyboard) b (keyboard) b (keyboard) b (keyboard)\n F\n\n G\n\n b (mouse) b (mouse) b (mouse) b (mouse) b (mouse)\n F\n\n G\n\n 1\n ROC Screen\n\n Boosting\n BRF Mouse\n a under Keyboard\n re Iteration (t)\n A 0.5 t=0 t=20 t=50\n\nFigure 5: Top. In this desk scene, it is easy to identify objects like the screen, keyboard and mouse,\neven though the local information is sometimes insufficient. Middle: the evolution of the beliefs\n(b and F and G) during detection for a test image. Bottom. The graph bellow shows the average\nevolution of the area under the ROC for the three objects on 120 test images.\n\n\n\nresults in a more efficient classifier with only a slight decrease of performance. In Fig. 6 we\ncompare the reduction of the search space when implementing a cascade using independent\nboosting (which reduces to Viola and Jones [12]), and when using BRF's. We see that for\nobjects for which context is the main source of information, like the mouse, the reduction\nin search space is much more dramatic using BRFs than using boosting alone.\n\n\n4 Conclusion\n\nThe proposed BRF algorithm combines boosting and CRF's, providing an algorithm that\nis easy for both training and inference. We have demonstrated object detection in cluttered\nscenes by exploiting contextual relationships between objects. The BRF algorithm is com-\nputationally efficient and provides a natural extension of the cascade of classifiers by inte-\ngrating evidence from other objects in order to quickly reject certain image regions. The\nBRF's densely connected graphs, which efficiently collect information over large image\nregions, provide an alternative framework to nearest-neighbor grids for vision problems.\n\n\nAcknowledgments\n\nThis work was sponsored in part by the Nippon Telegraph and Telephone Corporation as\npart of the NTT/MIT Collaboration Agreement, by BAE systems, and by DARPA contract\nDABT63-99-1-0012.\n\n\f\n Screen Keyboard Mouse\n 100%\n Boosting\n Boosting\n Boosting\n\n 50% BRF BRF\n\n BRF\n\n Round Round Round\n 0%\n Size of search space 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40\n\n 1\n BRF BRF BRF\n te\n 0.8 Boosting\n n ra Boosting Boosting\n 0.5\n tio\n 0.4\n etec 0.2\n D\n 00 0.5 1 0 0.5 1 0 0.5 1\n False alarm rate False alarm rate False alarm rate\n\n\nFigure 6: Contextual information reduces the search space in the framework of a cascade and im-\nproves performances. The search space is defined as the percentage of pixels that require further\nprocessing before a decision can be reached at each round. BRF's provide better performance and\nrequires fewer computations. The graphs (search space and ROCs) correspond to the average results\non a test set of 120 images.\n\n\n\nReferences\n\n [1] E. H. Adelson. On seeing stuff: the perception of materials by humans and machines. In Proc.\n SPIE, volume 4299, pages 112, 2001.\n\n [2] E. Borenstein and S. Ullman. Class-specific, top-down segmentation. In Proc. European Conf.\n on Computer Vision, 2002.\n\n [3] T. Dietterich, A. Ashenfelter, and Y. Bulatov. Training conditional random fields via gradient\n tree boosting. In Intl. Conf. on Machine Learning, 2004.\n\n [4] M. Fink and P. Perona. Mutual boosting for contextual influence. In Advances in Neural Info.\n Proc. Systems, 2003.\n\n [5] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of\n boosting. Annals of statistics, 28(2):337374, 2000.\n\n [6] Xuming He, Richard Zemel, and Miguel Carreira-Perpinan. Multiscale conditional random\n fields for image labelling. In Proc. IEEE Conf. Computer Vision and Pattern Recognition,\n 2004.\n\n [7] S. Kumar and M. Hebert. Discriminative random fields: A discriminative framework for con-\n textual interaction in classification. In IEEE Conf. on Computer Vision and Pattern Recognition,\n 2003.\n\n [8] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for\n segmenting and labeling sequence data. In Intl. Conf. on Machine Learning, 2001.\n\n [9] K. Murphy, A. Torralba, and W. Freeman. Using the forest to see the trees: a graphical model\n relating features, objects and scenes. In Advances in Neural Info. Proc. Systems, 2003.\n\n[10] R. Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on\n Nonlinear Estimation and Classification, 2001.\n\n[11] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in Neural\n Info. Proc. Systems, 2003.\n\n[12] P. Viola and M. Jones. Robust real-time object detection. Intl. J. Computer Vision, 57(2):137\n 154, 2004.\n\n\f\n", "award": [], "sourceid": 2663, "authors": [{"given_name": "Antonio", "family_name": "Torralba", "institution": null}, {"given_name": "Kevin", "family_name": "Murphy", "institution": null}, {"given_name": "William", "family_name": "Freeman", "institution": null}]}