{"title": "Context-Sensitive Decision Forests for Object Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 431, "page_last": 439, "abstract": "In this paper we introduce Context-Sensitive Decision Forests - A new perspective to exploit contextual information in the popular decision forest framework for the object detection problem. They are tree-structured classifiers with the ability to access intermediate prediction (here: classification and regression) information during training and inference time. This intermediate prediction is available to each sample, which allows us to develop context-based decision criteria, used for refining the prediction process. In addition, we introduce a novel split criterion which in combination with a priority based way of constructing the trees, allows more accurate regression mode selection and hence improves the current context information. In our experiments, we demonstrate improved results for the task of pedestrian detection on the challenging TUD data set when compared to state-of-the-art methods.", "full_text": "Context-Sensitive Decision Forests\n\nfor Object Detection\n\nPeter Kontschieder1\n\nSamuel Rota Bul`o2 Antonio Criminisi3\n\nPushmeet Kohli3 Marcello Pelillo2 Horst Bischof1\n\n1ICG, Graz University of Technology, Austria\n2DAIS, Universit`a Ca\u2019 Foscari Venezia, Italy\n\n3Microsoft Research Cambridge, UK\n\nAbstract\n\nIn this paper we introduce Context-Sensitive Decision Forests - A new perspective\nto exploit contextual information in the popular decision forest framework for the\nobject detection problem. They are tree-structured classi\ufb01ers with the ability to\naccess intermediate prediction (here: classi\ufb01cation and regression) information\nduring training and inference time. This intermediate prediction is available for\neach sample and allows us to develop context-based decision criteria, used for\nre\ufb01ning the prediction process. In addition, we introduce a novel split criterion\nwhich in combination with a priority based way of constructing the trees, allows\nmore accurate regression mode selection and hence improves the current context\ninformation. In our experiments, we demonstrate improved results for the task of\npedestrian detection on the challenging TUD data set when compared to state-of-\nthe-art methods.\n\nIntroduction and Related Work\n\n1\nIn the last years, the random forest framework [1, 6] has become a very popular and powerful tool\nfor classi\ufb01cation and regression problems by exhibiting many appealing properties like inherent\nmulti-class capability, robustness to label noise and reduced tendencies to over\ufb01tting [7]. They\nare considered to be close to an ideal learner [13], making them attractive in many areas of com-\nputer vision like image classi\ufb01cation [5, 17], clustering [19], regression [8] or semantic segmenta-\ntion [24, 15, 18]. In this work we show how the decision forest algorithm can be extended to include\ncontextual information during learning and inference for classi\ufb01cation and regression problems.\nWe focus on applying random forests to object detection, i.e. the problem of localizing multiple\ninstances of a given object class in a test image. This task has been previously addressed in random\nforests [9], where the trees were modi\ufb01ed to learn a mapping between the appearance of an image\npatch and its relative position to the object category centroid (i.e. center voting information). During\ninference, the resulting Hough Forest not only performs classi\ufb01cation on test samples but also casts\nprobabilistic votes in a generalized Hough-voting space [3] that is subsequently used to obtain object\ncenter hypotheses. Ever since, a series of applications such as tracking and action recognition [10],\nbody-joint position estimation [12] and multi-class object detection [22] have been presented. How-\never, Hough Forests typically produce non-distinctive object hypotheses in the Hough space and\nhence there is the need to perform non-maximum suppression (NMS) for obtaining the \ufb01nal results.\nWhile this has been addressed in [4, 26], another shortcoming is that standard (Hough) forests treat\nsamples in a completely independent way, i.e. there is no mechanism that encourages the classi\ufb01er\nto perform consistent predictions.\nWithin this work we are proposing that context information can be used to overcome the aforemen-\ntioned problems. For example, training data for visual learning is often represented by images in\nform of a (regular) pixel grid topology, i.e. objects appearing in natural images can often be found in\na speci\ufb01c context. The importance of contextual information was already highlighted in the 80\u2019s with\n\n1\n\n\fFigure 1: Top row: Training image, label image, visualization of priority-based growing of tree (the lower,\nthe earlier the consideration during training.). Bottom row: Inverted Hough image using [9] and breadth-\ufb01rst\ntraining after 6 levels (26 = 64 nodes), Inverted Hough image after growing 64 nodes using our priority queue,\nInverted Hough image using priority queue shows distinctive peaks at the end of training.\n\na pioneering work on relaxation labelling [14] and a later work with focus on inference tasks [20]\nthat addressed the issue of learning within the same framework. More recently, contextual infor-\nmation has been used in the \ufb01eld of object class segmentation [21], however, mostly for high-level\nreasoning in random \ufb01eld models or to resolve contradicting segmentation results. The introduc-\ntion of contextual information as additional features in low-level classi\ufb01ers was initially proposed\nin the Auto-context [25] and Semantic Texton Forest [24] models. Auto-context shows a general\napproach for classi\ufb01er boosting by iteratively learning from appearance and context information. In\nthis line of research [18] augmented the feature space for an Entanglement Random Forest with a\nclassi\ufb01cation feature, that is consequently re\ufb01ned by the class posterior distributions according to\nthe progress of the trained subtree. The training procedure is allowed to perform tests for speci\ufb01c,\ncontextual label con\ufb01gurations which was demonstrated to signi\ufb01cantly improve the segmentation\nresults. However, the\nIn this paper we are presenting Context-Sensitve Decision Forests - A novel and uni\ufb01ed interpreta-\ntion of Hough Forests in light of contextual sensitivity. Our work is inspired by Auto-Context and\nEntanglement Forests, but instead of providing only posterior classi\ufb01cation results from an earlier\nlevel of the classi\ufb01er construction during learning and testing, we additionally provide regression\n(voting) information as it is used in Hough Forests. The second core contribution of our work is\nrelated to how we grow the trees: Instead of training them in a depth- or breadth-\ufb01rst way, we pro-\npose a priority-based construction (which could actually consider depth- or breadth-\ufb01rst as particular\ncases). The priority is determined by the current training error, i.e. we \ufb01rst grow the parts of the tree\nwhere we experience higher error. To this end, we introduce a uni\ufb01ed splitting criterion that esti-\nmates the joint error of classi\ufb01cation and regression. The consequence of using our priority-based\ntraining are illustrated in Figure 1: Given the training image with corresponding label image (top\nrow, images 1 and 2), the tree \ufb01rst tries to learn the foreground samples as shown in the color-coded\nplot (top row, image 3, colors correspond to index number of nodes in the tree). The effects on the\nintermediate prediction quality are shown in the bottom row for the regression case: The \ufb01rst image\nshows the regression quality after training a tree with 6 levels (26 = 64 nodes) in a breadth-\ufb01rst way\nwhile the second image shows the progress after growing 64 nodes according to the priority based\ntraining. Clearly, the modes for the center hypotheses are more distinctive which in turn yields to\nmore accurate intermediate regression information that can be used for further tree construction.\nOur third contribution is a new family of split functions that allows to learn from training images\ncontaining multiple training instances as shown for the pedestrians in the example. We introduce a\ntest that checks the centroid compatibility for pairs of training samples taken from the context, based\non the intermediate classi\ufb01cation and regression derived as described before. To assess our contribu-\ntions, we performed several experiments on the challenging TUD pedestrian data set [2], yielding a\nsigni\ufb01cant improvement of 9% in the recall at 90% precision rate in comparison to standard Hough\nForests, when learning from crowded pedestrian images.\n\n2\n\n\f2 Context-Sensitive Decision Trees\nThis section introduces the general idea behind the context-sensitive decision forest without refer-\nences to speci\ufb01c applications. Only in Section 3 we show a particular application to the problem\nof object detection. After showing some basic notational conventions that are used in the paper, we\nprovide a section that revisits the random forest framework for classi\ufb01cation and regression tasks\nfrom a joint perspective, i.e. a theory allowing to consider e.g. [1, 11] and [9] in a uni\ufb01ed way.\nStarting from this general view we \ufb01nally introduce the context-sensitive forests in 2.2.\nNotations.\nIn the paper we denote vectors using boldface lowercase (e.g. d, u, v) and sets by\nusing uppercase calligraphic (e.g. X , Y) symbols. The sets of real, natural and integer numbers are\ndenoted with R, N and Z as usually. We denote by 2X the power set of X and by 1 [P ] the indicator\nfunction returning 1 or 0 according to whether the proposition P is true or false. Moreover, with\nP(Y) we denote the set of probability distributions having Y as sample space and we implicitly\nassume that some \u03c3-algebra is de\ufb01ned on Y. We denote by \u03b4(x) the Dirac delta function. Finally,\nEx\u223cQ [f (x)] denotes the expectation of f (x) with respect to x sampled according to distribution Q.\n2.1 Random Decision Forests for joint classi\ufb01cation and regression\nA (binary) decision tree is a tree-structured predictor1 where, starting from the root, a sample is\nrouted until it reaches a leaf where the prediction takes place. At each internal node of the tree the\ndecision is taken whether the sample should be forwarded to the left or right child, according to a\nbinary-valued function. In formal terms, let X denote the input space, let Y denote the output space\nand let T dt be the set of decision trees. In its simplest form a decision tree consists of a single node\n(a leaf ) and is parametrized by a probability distribution Q \u2208 P(Y) which represents the posterior\nprobability of elements in Y given any data sample reaching the leaf. We denote this (admittedly\nrudimentary) tree as LF (Q) \u2208 T td. Otherwise, a decision tree consists of a node with a left and\na right sub-tree. This node is parametrized by a split function \u03c6 : X \u2192 {0, 1}, which determines\nwhether to route a data sample x \u2208 X reaching it to the left decision sub-tree tl \u2208 T dt (if \u03c6(x) = 0)\nor to the right one tr \u2208 T dt (if \u03c6(x) = 1). We denote such a tree as ND (\u03c6, tl, tr) \u2208 T td. Finally,\na decision forest is an ensemble F \u2286 T td of decision trees which makes a prediction about a data\nsample by averaging over the single predictions gathered from all trees.\nInference. Given a decision tree t \u2208 T dt, the associated posterior probability of each element in\nY given a sample x \u2208 X is determined by \ufb01nding the probability distribution Q parametrizing the\nleaf that is reached by x when routed along the tree. This is compactly presented with the following\nde\ufb01nition of P (y|x, t), which is inductive in the structure of t:\nif t = LF (Q)\nif t = ND (\u03c6, tl, tr) and \u03c6(x) = 0\nif t = ND (\u03c6, tl, tr) and \u03c6(x) = 1 .\n\n\uf8f1\uf8f2\uf8f3Q(y)\n\nP (y | x, tl)\nP (y | x, tr)\n\nP (y | x, t ) =\n\n(1)\n\nFinally, the combination of the posterior probabilities derived from the trees in a forest F \u2286 T dt can\nbe done by an averaging operation [6], yielding a single posterior probability for the whole forest:\n\nP (y|x,F) =\n\n1\n|F|\n\nP (y|x, t) .\n\n(2)\n\n(cid:88)\n\nt\u2208F\n\nRandomized training.\nA random forest is created by training a set of random decision trees independently on random\nsubsets of the training data D \u2286 X\u00d7Y. The training procedure for a single decision tree heuristically\noptimizes a set of parameters like the tree structure, the split functions at the internal nodes and the\ndensity estimates at the leaves in order to reduce the prediction error on the training data. In order\nto prevent over\ufb01tting problems, the search space of possible split functions is limited to a random\nset and a minimum number of training samples is required to grow a leaf node. During the training\nprocedure, each new node is fed with a set of training samples Z \u2286 D. If some stopping condition\nholds, depending on Z, the node becomes a leaf and a density on Y is estimated based on Z.\nOtherwise, an internal node is grown and a split function is selected from a pool of random ones in\na way to minimize some sort of training error on Z. The selected split function induces a partition\n\n1we use the term predictor because we will jointly consider classi\ufb01cation and regression.\n\n3\n\n\fof Z into two sets, which are in turn becoming the left and right childs of the current node where\nthe training procedure is continued, respectively.\nWe will now write this training procedure in more formal terms. To this end we introduce a function\n\u03c0(Z) \u2208 P(Y) providing a density on Y estimated from the training data Z \u2286 D and a loss function\nL(Z | Q) \u2208 R penalizing wrong predictions on the training samples in Z, when predictions are given\naccording to a distribution Q \u2208 P(Y). The loss function L can be further decomposed in terms of a\nloss function (cid:96)(\u00b7|Q) : Y \u2192 R acting on each sample of the training set:\n\nL(Z | Q) =\n\n(cid:96)(y | Q) .\n\n(3)\n\n(cid:88)\n\n(x,y)\u2208Z\n\nAlso, let \u03a6(Z) be a set of split functions randomly generated for a training set Z and given a split\nfunction \u03c6 \u2208 \u03a6(Z), we denote by Z \u03c6\nr the sets identi\ufb01ed by splitting Z according to \u03c6, i.e.\n\nl and Z \u03c6\n\nZ \u03c6\nl = {(x, y) \u2208 Z : \u03c6(x) = 0}\n\nand\n\nZ \u03c6\nr = {(x, y) \u2208 Z : \u03c6(x) = 1} .\n\nWe can now summarize the training procedure in terms of a recursive function g : 2X\u00d7Y \u2192 T ,\nwhich generates a random decision tree from a training set given as argument:\n\n(cid:40)LF (\u03c0(Z))\n\n(cid:16)\n\nND\n\n\u03c6, g(Z \u03c6\n\ng(Z) =\n\n(cid:17)\n\nl ), g(Z \u03c6\nr )\n(cid:110)\n\nL(Z \u03c6(cid:48)\n\nif some stopping condition holds\notherwise .\n\n(4)\n\n(cid:111)\n\nHere, we determine the optimal split function \u03c6 in the pool \u03a6(Z) as the one minimizing the loss we\nincur as a result of the node split:\n\n\u03c6 \u2208 arg min\n\nl ) + L(Z \u03c6(cid:48)\n\nr ) : \u03c6(cid:48) \u2208 \u03a6(Z)\n\n(5)\nwhere we compactly write L(Z) for L(Z|\u03c0(Z)), i.e. the loss on Z obtained with predictions driven\nby \u03c0(Z). A typical split function selection criterion commonly adopted for classi\ufb01cation and re-\ngression is information gain. The equivalent counterpart in terms of loss can be obtained by using\na log-loss, i.e. (cid:96)(y|Q) = \u2212 log(Q(y)). A further widely used criterion is based on Gini impurity,\nwhich can be expressed in this setting by using (cid:96)(y|Q) = 1 \u2212 Q(y).\nFinally, the stopping condition that is used in (4) to determine whether to create a leaf or to continue\nbranching the tree typically consists in checking |Z|, i.e. the number of training samples at the node,\nor the loss L(Z) are below some given thresholds, or if a maximum depth is reached.\n2.2 Context-sensitive decision forests\nA context-sensitive (CS) decision tree is a decision tree in which split functions are enriched with the\nability of testing contextual information of a sample, before taking a decision about where to route\nit. We generate contextual information at each node of a decision tree by exploiting a truncated\nversion of the same tree as a predictor. This idea is shared with [18], however, we introduce some\nnovelties by tackling both, classi\ufb01cation and regression problems in a joint manner and by leaving\na wider \ufb02exibility in the tree truncation procedure. We denote the set of CS decision trees as T .\nThe main differences characterizing a CS decision tree t \u2208 T compared with a standard decision\ntree are the following: a) every node (leaves and internal nodes) of t has an associated probability\ndistribution Q \u2208 P(Y) representing the posterior probability of an element in Y given any data\nsample reaching it; b) internal nodes are indexed with distinct natural numbers n \u2208 N in a way to\npreserve the property that children nodes have a larger index compared to their parent node; c) the\nsplit function at each internal node, denoted by \u03d5(\u00b7|t(cid:48)) : X \u2192 {0, 1}, is bound to a CS decision\ntree t(cid:48) \u2208 T , which is a truncated version of t and can be used to compute intermediate, contextual\ninformation.\nSimilar to Section 2.1 we denote by LF (Q) \u2208 T the simplest CS decision tree consisting of a single\nleaf node parametrized by the distribution Q, while we denote by ND (n, Q, \u03d5, tl, tr) \u2208 T , the rest of\nthe trees consisting of a node having a left and a right sub-tree, denoted by tl, tr \u2208 T respectively,\nand being parametrized by the index n, a probability distribution Q and the split function \u03d5 as\ndescribed above.\nAs shown in Figure 2, the truncation of a CS decision tree at each node is obtained by exploiting\nthe indexing imposed on the internal nodes of the tree. Given a CS decision tree t \u2208 T and m \u2208 N,\n\n4\n\n\f1\n\n1\n\n2\n\n4\n\n2\n\n4\n\n3\n\n6\n\n5\n\n3\n\n(a) A CS decision tree t\n\n(b) The truncated version t(<5)\n\nFigure 2: On the left, we \ufb01nd a CS decision tree t, where only the internal nodes are indexed. On the right, we\nsee the truncated version t(<5) of t, which is obtained by converting to leaves all nodes having index \u2265 5 (we\nmarked with colors the corresponding node transformations).\nwe denote by t(