{"title": "Multi-Resolution Cascades for Multiclass Object Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 2186, "page_last": 2194, "abstract": "An algorithm for learning fast multiclass object detection cascades is introduced. It produces multi-resolution (MRes) cascades, whose early stages are binary target vs. non-target detectors that eliminate false positives, late stages multiclass classifiers that finely discriminate target classes, and middle stages have intermediate numbers of classes, determined in a data-driven manner. This MRes structure is achieved with a new structurally biased boosting algorithm (SBBoost). SBBost extends previous multiclass boosting approaches, whose boosting mechanisms are shown to implement two complementary data-driven biases: 1) the standard bias towards examples difficult to classify, and 2) a bias towards difficult classes. It is shown that structural biases can be implemented by generalizing this class-based bias, so as to encourage the desired MRes structure. This is accomplished through a generalized definition of multiclass margin, which includes a set of bias parameters. SBBoost is a boosting algorithm for maximization of this margin. It can also be interpreted as standard multiclass boosting algorithm augmented with margin thresholds or a cost-sensitive boosting algorithm with costs defined by the bias parameters. A stage adaptive bias policy is then introduced to determine bias parameters in a data driven manner. This is shown to produce MRes cascades that have high detection rate and are computationally efficient. Experiments on multiclass object detection show improved performance over previous solutions.", "full_text": "Multi-Resolution Cascades for Multiclass Object\n\nDetection\n\nMohammad Saberian\n\nYahoo! Labs\n\nsaberian@yahoo-inc.com\n\nNuno Vasconcelos\n\nStatistical Visual Computing Laboratory\n\nUniversity of California, San Diego\n\nnuno@ucsd.edu\n\nAbstract\n\nAn algorithm for learning fast multiclass object detection cascades is introduced.\nIt produces multi-resolution (MRes) cascades, whose early stages are binary target\nvs. non-target detectors that eliminate false positives, late stages multiclass clas-\nsi\ufb01ers that \ufb01nely discriminate target classes, and middle stages have intermediate\nnumbers of classes, determined in a data-driven manner. This MRes structure\nis achieved with a new structurally biased boosting algorithm (SBBoost). SBBost\nextends previous multiclass boosting approaches, whose boosting mechanisms are\nshown to implement two complementary data-driven biases: 1) the standard bias\ntowards examples dif\ufb01cult to classify, and 2) a bias towards dif\ufb01cult classes. It is\nshown that structural biases can be implemented by generalizing this class-based\nbias, so as to encourage the desired MRes structure. This is accomplished through\na generalized de\ufb01nition of multiclass margin, which includes a set of bias pa-\nrameters. SBBoost is a boosting algorithm for maximization of this margin. It\ncan also be interpreted as standard multiclass boosting algorithm augmented with\nmargin thresholds or a cost-sensitive boosting algorithm with costs de\ufb01ned by the\nbias parameters. A stage adaptive bias policy is then introduced to determine bias\nparameters in a data driven manner. This is shown to produce MRes cascades\nthat have high detection rate and are computationally ef\ufb01cient. Experiments on\nmulticlass object detection show improved performance over previous solutions.\n\n1\n\nIntroduction\n\nThere are many learning problems where classi\ufb01ers must make accurate decisions quickly. A promi-\nnent example is the problem of object detection in computer vision, where a sliding window is\nscanned throughout an image, generating hundreds of thousands of image sub-windows. A classi-\n\ufb01er must then decide if each sub-window contains certain target objects, ideally at video frame-rates,\ni.e. less than a micro second per window. The problem of simultaneous real-time detection of multi-\nple class of objects subsumes various important applications in computer vision alone. These range\nfrom the literal detection of many objects (e.g. an automotive vision system that must detect cars,\npedestrians, traf\ufb01c signs), to the detection of objects at multiple semantic resolutions (e.g. a camera\nthat can both detect faces and recognize certain users), to the detection of different aspects of the\nsame object (e.g. by de\ufb01ning classes as different poses). A popular architecture for real-time object\ndetection is the detector cascade of Figure 1-a [17]. This is implemented as a sequence of simple to\ncomplex classi\ufb01cation stages, each of which can either reject the example x to classify or pass it to\nthe next stage. An example that reaches the end of the cascade is classi\ufb01ed as a target. Since targets\nconstitute a very small portion of the space of image sub-windows, most examples can be rejected in\nthe early cascade stages, by classi\ufb01ers of very small computation. In result, the average computation\nper image is small, and the cascaded detector is very fast. While the design of cascades for real-time\ndetection of a single object class has been the subject of extensive research [18, 20, 2, 15, 1, 12, 14],\nthe simultaneous detection of multiple objects has received much less attention.\n\n1\n\n\f(a)\n\n(b)\n\nFigure 1: a) detector cascade [17], b) parallel cascade [19], c) parallel cascade with pre-estimator [5] and d)\nall-class cascade with post-estimator.\n\n(c)\n\n(d)\n\nMost solutions for multiclass cascade learning simply decompose the problem into several binary\n(single class) detection sub-problems. They can be grouped into two main classes. Methods in\nthe \ufb01rst class, here denoted parallel cascades [19], learn a cascaded detector per object class (e.g.\nview), as shown in Figure 1-b, and rely on some post-processing to combine their decisions. This\nhas two limitations. The \ufb01rst is the well known sub-optimality of one-vs.-all multiclass classi\ufb01ca-\ntion, since scores of independently trained detectors are not necessarily comparable [10]. Second,\nbecause there is no sharing of features across detectors, the overall classi\ufb01er performs redundant\ncomputations and tends to be very slow. This has motivated work in feature sharing. Examples\ninclude JointBoost [16], which exhaustively searches for features to be shared between classes, and\n[11], which implicitly partitions positive examples and performs a joint search for the best parti-\ntion and features. These methods have large training complexity. The complexity of the parallel\narchitecture can also be reduced by \ufb01rst making a rough guess of the target class and then running\nonly one of the binary detectors, as in Figure 1-c. We refer to these methods as parallel cascades\nwith pre-estimator [5]. While, for some applications (e.g. where classes are object poses), it is\npossible to obtain a reasonable pre-estimate of the target class, pre-estimation errors are dif\ufb01cult\nto undo. Hence, this classi\ufb01er must be fairly accurate. Since it must also be fast, this approach\nboils down to real-time multiclass classi\ufb01cation, i.e. the original problem. [4] proposed a variant\nof this method, where multiple detectors are run after the pre-estimate. This improves accuracy but\nincreases complexity.\nIn this work, we pursue an alternative strategy, inspired by Figure 1-d. Target classes are \ufb01rst\ngrouped into an abstract class of positive patches. A detector cascade is then trained to distinguish\nthese patches from everything else. A patch identi\ufb01ed as positive is \ufb01nally fed to a multiclass clas-\nsi\ufb01er, for assignment to one of the target classes. In comparison to parallel cascades, this has the\nadvantage of sharing features across all classes, eliminating redundant computation. When com-\npared to the parallel cascade with pre-estimator, it has the advantage that the complexity of its class\nestimator has little weight in the overall computation, since it only processes a small percentage of\nthe examples. This allows the use of very accurate/complex estimators. The main limitation is that\nthe design of a cascade to detect all positive patches can be quite dif\ufb01cult, due to the large intra-\nclass variability. This is, however, due to the abrupt transition between the all-class and multiclass\nregimes. While it is dif\ufb01cult to build an all-class detector with high detection and low false-positive\nrate, we show that this is really not needed. Rather than the abrupt transition of Figure 1-d, we\npropose to learn a multiclass cascade that gradually progresses from all-class to multiclass. Early\nstages are binary all-class detectors, aimed at eliminating sub-windows in background image re-\ngions. Intermediate stages are classi\ufb01ers with intermediate numbers of classes, determined by the\nstructure of the data itself. Late stages are multiclass classi\ufb01ers of high accuracy/complexity. Since\nthese cascades represent the set of classes at different resolutions, they are denoted multi-resolution\n(MRes) cascades.\nTo learn MRes cascades, we consider a M-class classi\ufb01cation problem and de\ufb01ne a negative class\nM + 1, which contains all non-target examples. We then analyze a recent multiclass boosting algo-\nrithm, MCBoost [13], showing that its weighting mechanism has two components. The \ufb01rst is the\nstandard weighting of examples by how well they are classi\ufb01ed at each iteration. The second, and\nmore relevant to this work, is a similar weighting of the classes according to their dif\ufb01culty. MC-\n\n2\n\n1)2)M)\u2026Cascade(cid:3)(1Cascade(cid:3)(2Cascade(cid:3)(MDetector(cid:3)Detector(cid:3)Detector(cid:3)CClass(cid:3)Estimator\u2026scade(cid:3)(1)scade(cid:3)(2)scade(cid:3)(M)Detector(cid:3)CaDetector(cid:3)Caetector(cid:3)CasDDDede(cid:3)(all)ector(cid:3)CascaClassEstimatorDeteClass(cid:3)Estimator\u2026\fBoost is shown to select the weak learner of largest margin on the reweighted training sample, under\na biased de\ufb01nition of margin that re\ufb02ects the class weights. This is a data-driven bias, based purely\non classi\ufb01cation performance, which does not take computational ef\ufb01ciency into account. To induce\nthe MRes behavior, it must be complemented by a structural bias that modi\ufb01es the class weighting\nto encourage the desired multi resolution structure. We show that this can be implemented by aug-\nmenting MCBoost with structural bias parameters that lead to a new structurally biased boosting\nalgorithm (SBBoost). This can also be seen as a variant of boosting with tunable margin thresholds\nor as boosting under a cost-sensitive risk. By establishing a connection between the bias parameters\nand the computational complexity of cascade stages, we then derive a stage adaptive bias policy\nthat guarantees computationally ef\ufb01cient MRes cascades of high detection rate. Experiments in\nmulti-view car detection and simultaneous detection of multiple traf\ufb01c signs show that the resulting\nclassi\ufb01ers are faster and more accurate than those previously available.\n\n2 Boosting with structural biases\n\nConsider the design of a M class cascade. The M target classes are augmented with a class M + 1,\nthe negative class, containing non-target examples. The goal is to learn a multiclass cascade detector\nH[h1(x), . . . , hr(x)] with r stages. This has the structure of Figure 1-a but, instead of a binary\ndetector, each stage is a multiclass classi\ufb01er hk(x) : X \u2192{1, . . . , M + 1}. Mathematically,\n\nH[h1(x), . . . , hr(x)] =\n\nif hk(x) (cid:54)= M + 1 \u2200k,\nif \u2203k| hk(x) = M + 1.\n\n(cid:26) hr(x)\n\nM + 1\n\nWe propose to learn the cascade stages with an extension of the MCBoost framework for multi-\nclass boosting of [13]. The class labels {1, . . . , M + 1} are \ufb01rst translated into a set of codewords\ni=1 yi = 0. MCBoost uses the codewords to\nlearn a M-dimensional predictor F \u2217(x) = [f1(x), . . . , fM (x)] \u2208 RM so that\n\n{y1, . . . , yM +1} \u2208 RM that form a simplex where(cid:80)M +1\nM +1(cid:88)\n(cid:80)n\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 F \u2217(x) = arg minF (x) R[F ] = 1\n\ne\u2212 1\n\n2 [(cid:104)yzi ,F (xi)(cid:105)\u2212(cid:104)yj ,F (xi)(cid:105)]\n\n(2)\n\nF (x) \u2208 span(G),\n\nwhere G = {gi} is a set of weak learners. This is done by iterative descent [3, 9]. At each iteration,\nthe best update for F (x) is identi\ufb01ed as\ng\u2217\nk = arg max\n\n(3)\n\ns.t\n\nj=1\n\ni=1\n\nn\n\nwith\n\n\u2212\u03b4R[F ; g] = \u2212 \u2202R[f t + \u0001g]\n\n\u2202\u0001\n\nk=1\nThe optimal step size along this weak learner direction is\n\ni=1\n\ng\u2208G \u2212\u03b4R[F ; g],\nn(cid:88)\n\nM +1(cid:88)\n(cid:104)g(xi), yzi \u2212 yk(cid:105)e\u2212 1\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u0001=0\n\n=\n\n1\n2\n\n2(cid:104)F (xi),yzi\u2212yk(cid:105).\n\nand the predictor is updated according to F (x) = F (x) + \u03b1\u2217g\u2217(x). The \ufb01nal decision rule is\n\n\u03b1\u2217 = arg min\n\n\u03b1\u2208R R[F (x) + \u03b1g\u2217(x)],\n\nh(x) = arg max\n\nk=1...M +1\n\n(cid:104)yk, F \u2217(xi)(cid:105).\n\n(1)\n\n(4)\n\n(5)\n\n(6)\n\n(8)\n\n(9)\n\nWe next provide an analysis of the updates of (3) which inspires the design of MRes cascades.\nWeak learner selection: the multiclass margin of predictor F (x) for an example x from class z is\n(7)\n\n(cid:104)F (x), yz \u2212 yj(cid:105),\n\nM(z, F (x)) = (cid:104)F (x), yz(cid:105) \u2212 max\nj(cid:54)=z\n\n(cid:104)F (x), yj(cid:105) = min\nj(cid:54)=z\n\nwhere (cid:104)F (x), yz \u2212 yj(cid:105) is the margin component of F (x) with respect to class j. Rewriting (3) as\n\n\u2212\u03b4R[F ; g] =\n\n=\n\n1\n2\n\n1\n2\n\nM +1(cid:88)\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\nk=1|k(cid:54)=zi\nw(xi)(cid:104)g(xi),\n\nM +1(cid:88)\n\nk=1|k(cid:54)=zi\n\ni=1\n\n3\n\n(cid:104)g(xi), yzi \u2212 yk(cid:105)e\u2212 1\n\n2(cid:104)F (xi),yzi\u2212yk(cid:105)\n\n\u03c1k(xi)(yzi \u2212 yk)(cid:105),\n\n\fwhere\n\nw(xi) =\n\nM(cid:88)\n\nk=1|k(cid:54)=zi\n\ne\u2212 1\n\n2(cid:104)F (xi),yzi\u2212yk(cid:105),\n\n\u03c1k(xi) =\n\n(cid:80)M\n\ne\u2212 1\nk=1|k(cid:54)=zi\n\n2(cid:104)F (xi),yzi\u2212yk(cid:105)\n\n2(cid:104)F (xi),yzi\u2212yk(cid:105) .\n\ne\u2212 1\n\n(10)\n\nenables the interpretation of MCBoost as a generalization of AdaBoost. From (10), an example xi\nhas large weight w(xi) if F (xi) has at least one large negative margin component, namely\n\n(cid:104)F (xi), yz \u2212 y(cid:105) < 0\n\nfor\n\ny = arg min\nyj(cid:54)=yz\n\n(cid:104)F (xi), yz \u2212 yj(cid:105).\n\n(11)\n\nIn this case, it follows from (6) that xi is incorrectly classi\ufb01ed into the class of codeword y. In sum-\nmary, as in AdaBoost, the weighting mechanism of (9) emphasizes examples incorrectly classi\ufb01ed\nby the current predictor F (x). However, in the multiclass setting, this is only part of the weighting\nmechanism, since the terms \u03c1k(xi) of (9)-(10) are coef\ufb01cients of a soft-min operator over margin\ncomponents (cid:104)F (xi), yzi \u2212 yk(cid:105). Assuming the soft-min closely approximates the min, (9) becomes\n\n\u2212\u03b4R[F ; g] \u2248 n(cid:88)\n\nw(xi)MF (yzi, g(xi)),\n\n(12)\n\ni=1\n\nwhere\n\n(13)\nand y is the codeword of (11). This is the multiclass margin of weak learner g(x) under an alternative\nmargin de\ufb01nition MF (z, g(x)). Comparing to the original de\ufb01nition of (7), which can be written as\n\nMF (z, g(x)) = (cid:104)g(x), yz \u2212 y(cid:105).\n\nM(z, g(x)) =\n\n(cid:104)g(x), yz \u2212 yj(cid:105),\n\n(cid:104)g(x), yz \u2212 y(cid:105)\n\n1\n2\n\nwhere\n\ny = arg min\nyj(cid:54)=yz\n\n(14)\nMF (yz, g(x)) restricts the margin of g(x) to the worst case codeword y for the current predictor\nF (x). The strength of this restriction is determined by the soft-min operator. If < F (x), yz \u2212 y > is\nmuch smaller than < F (x), yz \u2212 yj > \u2200yj (cid:54)= y, \u03c1k(x) closely approximates the minimum operator\nand (12) is identical to (9). Otherwise, the remaining codewords also contribute to (9). In summary,\n\u03c1k(xi) is a set of class weights that emphasizes classes of small margin for F (x). The inner product\nof (9) is the margin of g(x) after this class reweighting. Overall, MCBoost weights introduce a bias\ntowards dif\ufb01cult examples (weights w) and dif\ufb01cult classes (margin MF ).\nStructural biases: The core idea of cascade design is to bias the learning algorithm towards compu-\ntationally ef\ufb01cient classi\ufb01er architectures. This is not a data driven bias, as in the previous section,\nbut a structural bias, akin to the use of a prior (in Bayesian learning) to guarantee that a graphi-\ncal model has a certain structure. For example, because classi\ufb01er speed depends critically on the\nability to quickly eliminate negative examples, the initial cascade stages should effectively behave\nas a binary classi\ufb01er (all classes vs. negative). This implies that the learning algorithm should be\nbiased towards predictors of large margin component (cid:104)F (x), yz \u2212 yM +1(cid:105) with respect to the neg-\native class j = M + 1. We propose to implement this structural bias by forcing yM +1 to be the\ndominant codeword in the soft-min weighting of (10). This is achieved by rescaling the soft-min\n2(cid:104)F (xi),yzi\u2212yk(cid:105), where\ncoef\ufb01cients, i.e. by using an alternative soft-min operator \u03c1\u03b1\n\u03b1k = \u03c4 \u2208 [0, 1] for k (cid:54)= M + 1 and \u03b1M +1 = 1. The parameter \u03c4 controls the strength of the\nstructural bias. When \u03c4 = 0, \u03c1\u03b1\nk (xi) assigns all weight to codeword yM +1 and the structural bias\nk (xi) varies between the data driven bias of \u03c1k(xi) and\ndominates. For 0 < \u03c4 < 1 the bias of \u03c1\u03b1\nthe structural bias towards yM +1. When \u03c4 = 1, \u03c1\u03b1\nk (xi) = \u03c1k(xi), the bias is purely data driven,\nas in MCBoost. More generally, we can de\ufb01ne biases towards any classes (beyond j = M + 1) by\nallowing different \u03b1k \u2208 [0, 1] for different k (cid:54)= M + 1. From (10), this is equivalent to rede\ufb01ning\nthe margin components as (cid:104)F (xi), yzi \u2212 yk(cid:105) \u2212 2 log \u03b1k. Finally the biases can be adaptive with\nrespect to the class of xi, by rede\ufb01ning the margin components as (cid:104)F (xi), yzi \u2212 yk(cid:105) \u2212 \u03b4zi,k. Under\nthis structurally biased margin, the approximate boosting updates of (12) become\n\nk (xi) \u221d \u03b1ke\u2212 1\n\n\u2212\u03b4R[F ; g] \u2248 n(cid:88)\n\nw(xi)Mc\n\nF (yzi, g(xi)),\n\nwhere\n\nMc\n\nF (z, g(x)) = (cid:104)g(x), yz \u2212 \u02c6y(cid:105) \u2212 \u03b4zi,k\n\ni=1\n\n\u02c6y = arg min\nyj(cid:54)=yz\n\n(cid:104)F (x), yz \u2212 yj(cid:105) \u2212 \u03b4zi,k.\n\n4\n\n(15)\n\n(16)\n\n\fThis is, in turn, equivalent to the approximation of (9) by (12) under the de\ufb01nition of margin as\n\nMc(z, F (x)) = min\nj(cid:54)=z\n\n(cid:104)F (x), yz \u2212 yj(cid:105) \u2212 \u03b4z,j,\n\nand boosting weights\n\nM(cid:88)\n\nwc(xi) =\n\nk=1|k(cid:54)=zi\n\ne\u2212 1\n\n2 [(cid:104)F (xi),yzi\u2212yk(cid:105)\u2212\u03b4zi,k],\n\n\u03c1c\nk(xi) =\n\n(cid:80)M\n\ne\u2212 1\nl=1|k(cid:54)=zi\n\n2 [(cid:104)F (xi),yzi\u2212yk(cid:105)\u2212\u03b4zi,k]\n\ne\u2212 1\n\n2 [(cid:104)F (xi),yzi\u2212yl(cid:105)\u2212\u03b4zi,l]\n\n(17)\n\n. (18)\n\nWe denote the boosting algorithm with these weights as structurally biased boosting (SBBoost).\nAlternative interpretations: the parameters \u03b4zi,k, which control the amount of structural bias, can\nbe seen as thresholds on the margin components. For binary classi\ufb01cation, where M = 1, y1 =\n1, y2 = \u22121 and F (x) is scalar, (7) reduces to the standard margin M(z, F (x)) = yzF (x), (10) to\nthe standard boosting weights w(xi) = e\u2212yzi F (xi) and \u03c1k(xi) = 1, k \u2208 {1, 2}. In this case, MC-\nBoost is identical to AdaBoost. SBBoost can thus been seen as an extension of AdaBoost, where\nthe margin is rede\ufb01ned to include thresholds \u03b4zi according to Mc(z, F (x)) = yzF (x) \u2212 \u03b4z. By\ncontrolling the thresholds it is possible to bias the learned classi\ufb01er towards accepting or rejecting\nmore examples. For multiclass classi\ufb01cation, a larger \u03b4z,j encodes a larger bias against assigning\nexamples from class z to class j. This behavior is frequently denoted as cost-sensitive classi\ufb01cation.\nWhile it can be achieved by training a classi\ufb01er with AdaBoost (or MCBoost) and adding thresholds\nto the \ufb01nal decision rule, this is suboptimal since it corresponds to using a classi\ufb01cation boundary on\nwhich the predictor F (x) was not trained [8]. Due to Boosting\u2019s weighting mechanism (which em-\nphasizes a small neighborhood of the classi\ufb01cation boundary), classi\ufb01cation accuracy can be quite\npoor when the thresholds are introduced a-posteriori. Signi\ufb01cantly superior performance is achieved\nwhen the thresholds are accounted for by the learning algorithm, as is the case for SBBoost. Boost-\ning algorithms with this property are usually denoted as cost-sensitive and derived by introducing a\nset of classi\ufb01cation costs in the risk of (2). It can be shown, through a derivation identical to that of\nSection 2, that SBBoost is a cost-sensitive boosting algorithm with respect to the risk\n\nCz,je\u2212 1\n\n2(cid:104)yzi ,F (xi)(cid:105)\u2212(cid:104)yj ,F (xi)(cid:105)\n\n(19)\n\nn(cid:88)\n\nM +1(cid:88)\n\ni=1\n\nj=1\n\nRc\n\n[F ] =\n\n1\nn\n\nwith \u03b4z,j = 1\n2 log Cz,j. Under this interpretation, the bias parameters \u03b4z,j are the log-costs of\nassigning examples of class z to class j. For binary classi\ufb01cation, SBBoost reduces to the cost-\nsensitive boosting algorithm of [18].\n\n3 Boosting MRes cascades\n\nIn this section we discuss a strategy for the selection of bias parameters \u03b4i,j that encourage multi-\nresolution behavior. We start by noting that some biases must be shared by all stages. For example,\nwhile a cascade cannot recover a rejected target, the false-positives of a stage can be rejected by its\nsuccessors. Hence, the learning of each stage must enforce a bias against target rejections, at the cost\nof increased false-positive rates. This high detection rate problem has been the subject of extensive\nresearch in binary cascade learning, where a bias against assigning examples to the negative class\nis commonly used [18, 8]. The natural multiclass extension is to use much larger thresholds for the\nmargin components with respect to the negative class than the others, i.e.\n\n\u03b4k,M +1 (cid:29) \u03b4M +1,k \u2200k = 1, . . . , M.\n\nWe implement this bias with the thresholds\n\n\u03b4k,M +1 = log \u03b2\n\n\u03b4M +1,k = log(1 \u2212 \u03b2)\n\n\u03b2 \u2208 [0.5, 1].\n\n(20)\n\n(21)\n\nThe value of \u03b2 is determined by the target detection rate of the cascade. For each boosting iteration,\nwe set \u03b2 = 0.5 and measure the detection rate of the cascade. If this falls below the target rate, \u03b2 is\nincreased to (\u03b2 + 1)/2. The process is repeated until the desired rate is achieved.\nThere is also a need for structural biases that vary with the cascade stage. For example, the compu-\ntational complexity ct+1 of stage t + 1 is proportional to the product of the per-example complexity\n\n5\n\n\f\u0001t+1 of the classi\ufb01er (e.g. number of weak learners) and the number of image sub-windows that it\nevaluates. Since the latter is dominated by the false positives rate of the previous cascade stages,\nf pt, it follows that ct+1 \u221d f pt\u0001t+1. Since f pt decreases with t, an ef\ufb01cient cascade must have early\nstages of low complexity and more complicated detectors in later stages. This suggests the use of\nstages that gradually progress from binary to multiclass. Early stages eliminate false-positives, late\nstages are accurate multiclass classi\ufb01ers. In between, the cascade stages should detect intermediate\nnumbers of classes, according to the structure of the data. Cascades with this structure represent the\nset of classes at different resolutions and are denoted Multi-Resolution (MRes) cascades.\nTo encourage the MRes structure, we propose the following stage adaptive bias policy\n\n\u2200k, l \u2208 {1, . . . , M}\nfor k \u2208 {1, . . . , M} and l = M + 1\nfor k = M + 1 and l \u2208 {1, . . . , M},\n\n(22)\n\n\uf8f1\uf8f2\uf8f3 \u03b3t = log F P\n\nlog \u03b2\nlog(1 \u2212 \u03b2)\n\n\u03b4t\nk,l =\n\nf pt\n\nwhere F P is the target false-positive rate for the whole cascade. This policy complements the stage-\nk,l = \u03b3t,\u2200k, l \u2208\nindependent bias towards high detection rate (due to \u03b2) with a stage dependent bias \u03b4t\n{1, . . . , M}. This has the following consequences. First, since \u03b2 \u2265 0.5 and f pt (cid:29) 2F P when t is\nsmall, it follows that \u03b3t (cid:28) \u03b4k,M +1 in the early stages. Hence, for these stages, there is a much larger\nbias against rejection of examples from the target classes {1, . . . , M}, than for the differentiation\nof these classes. In result, the classi\ufb01er ht(x) is an all-class detector, as in Figure 1-d. Second, for\nlarge t, where f pt approaches FP, \u03b3t decreases to zero. In this case, there is no bias against class\ndifferentiation and the learning algorithm places less emphasis on improvements of false-positive\nrate (\u03b4k,M +1 \u2248 \u03b3t) and more emphasis on target differentiation. Like MCBoost (which has no\nbiases), it will focus in the precise assignment of targets to their individual classes. In result, for\nlate cascade stages, ht(x) is a multiclass classi\ufb01er, similar to the class post-estimator of Figure 1-\nd. Third, for intermediate t, it follows from (19) and e\u03b3t \u221d \u0001t+1/ct+1 that the learned cascade\nz,j \u221d 1/\u03bdt+1, for z, j \u2208 {1, . . . , M} where \u03bdt = ct/\u0001t.\nstages are optimal under a risk with costs C t\nNote that \u03bdt is a measure of how much the computational cost per example is magni\ufb01ed by stage\nt, therefore this risk favors cascades with stages of low complexity magni\ufb01cation. In result, weak\nlearners are preferentially added to the stages where their addition produces the smallest overall\ncomputational increase. This makes the resulting cascades computationally ef\ufb01cient, since 1) stages\nof high complexity magni\ufb01cation have small per example complexity \u0001t and 2) classi\ufb01ers of large\nper example complexity are pushed to the stages of low complexity magni\ufb01cation. Since complexity\nmagni\ufb01cation is proportional to false-positive rate (ct/\u0001t \u221d f pt\u22121), multiclass decisions (higher \u0001t)\nare pushed to the latter cascade stages. This push is data driven and gradual and thus the cascade\ngradually transitions from binary to multiclass, becoming a soft version of the detector of Figure\n1-d.\n\n4 Experiments\n\nSBBoost was evaluated on the tasks of multi-view car detection, and multiple traf\ufb01c sign detection.\nThe resulting MRes cascades were compared to the detectors of Figure 1. Since it has been estab-\nlished in the literature that the all-class detector with post-estimation has poor performance [5], the\ncomparison was limited to parallel cascades [19] and parallel cascades with pre-estimation [5]. All\nbinary cascade detectors were learned with a combination of the ECBoost algorithm of [14] and the\ncost-sensitive Boosting method of [18]. Following [2], all cascaded detectors used integral channel\nfeatures and trees of depth two as weak learners. The training parameters were set to \u03b7 = 0.02,\nD = 0.95, F P = 10\u22126 and the training set was bootstrapped whenever the false positive rate\ndropped below 90%. Bootstrapping also produced an estimate of the real false positive rate f pt,\nused to de\ufb01ne the biases \u03b4t\nk,l. As in [5], the detector cascade with pre-class estimation used tree\nclassi\ufb01ers for pre-estimation. In the remainder of this section, detection rate is de\ufb01ned as the per-\ncentage of target examples, from all views or target classes, that were detected. Detector accuracy\nis the percentage of the target examples that were detected and assigned to the correct class. Finally,\ndetector complexity is the average number of tree node classi\ufb01ers evaluated per example.\nMulti-view Car Detection: To train a multi-view car detector, we collected images of 128 Frontal,\n100 Rear, 103 Left, and 103 Right car views. These were resized to 41 \u00d7 70 pixels. The multi-view\ncar detector was evaluated on the USC car dataset [6], which consists of 197 color images of size\n480 \u00d7 640, containing 410 instances of cars in different views.\n\n6\n\n\fa)\n\nFigure 2: ROCs for a) multi-view car detection and b) traf\ufb01c sign detection.\n\nb)\n\nTable 1: Multi-view car detection performance at 100 false positives.\n\ncar detection\n\ntraf\ufb01c sign detection\n\nMethod\n\ncomplexity accuracy det. rate complexity accuracy det. rate\n\nParallel Cascades [19]\nP.C. + Pre-estimation [5] 15.15 + 6\n\n59.94\n\nMRes cascade\n\n16.40\n\n0.35\n0.35\n0.58\n\n0.72\n0.70\n0.88\n\n10.08\n\n2.32 + 4\n\n5.56\n\n0.78\n0.78\n0.84\n\n0.78\n0.78\n0.84\n\nThe ROCs of the various cascades are shown in Figure 2-a. Their detection rate, accuracy and\ncomplexity are reported in Table 1. The complexity of parallel cascades with pre-processing is\nbroken up into the complexity of the cascade plus the complexity of the pre-estimator. Figure 2-\na, shows that the MRes cascade has signi\ufb01cantly better ROC performance than any of the other\ndetectors. This is partially due to the fact that the detector is learned jointly across classes and thus\nhas access to more training examples. In result, there is less over-\ufb01tting and better generalization.\nFurthermore, as shown in Table 1, the MRes cascade is much faster. The 3.5-fold reduction of\ncomplexity over the parallel cascade suggests that MRes cascades share features very ef\ufb01ciently\nacross classes. The MRes cascade also detects 16% more cars and assigns 23% more cars to the\ntrue class. The parallel cascade with pre-processing was slightly less accurate than the parallel\ncascade but three times as fast. Its accuracy is still 23% lower than that of the MRes cascade and the\ncomplexity of the pre-estimator makes it 20% slower.\nFigure 3 shows the evolution of the detection rate, false positive rate, and accuracy of the MRes cas-\ncade with learning iterations. Note that the detection rate is above the speci\ufb01ed D = 95% throughout\nthe learning process. This is due to the updating of the \u03b2 parameter of (22). It can also be seen that,\nwhile the false positive rate decreases gradually, accuracy remains low for many iterations. This\nshows that the early stages of the MRes cascade place more emphasis on rejecting negative exam-\nples (lowering the false positive rate) than making precise view assignments for the car examples.\nThis re\ufb02ects the structural biases imposed by the policy of (22). Early on, confusion between classes\nhas little cost. However, as the cascade grows and its false positive rate f pt decreases, the detector\nstarts to distinguish different car views. This happens soon after iteration 100, where there is a sig-\nni\ufb01cant jump in accuracy. Note, however, that the false-positive rate is still 10\u22124 at this point. In the\nremaining iterations, the learning algorithm continues to improve this rate, but also \u201cgoes to work\u201d\non increasing accuracy. Eventually, the false-positive rate \ufb02attens and the SBBoost behaves as a\nmulticlass boosting algorithm. Overall, the MRes cascade behaves as a soft version of the all-class\ndetector cascade with post-estimation, shown in Figure 1-d.\nTraf\ufb01c Sign Detection: For the detection of traf\ufb01c signs, we extracted 1, 159 training examples\nfrom the \ufb01rst set of the Summer traf\ufb01c sign dataset [7]. This produced 660 examples of \u201cpriority\nroad\u201d, 145 of \u201cpedestrian crossing\u201d, 232 of \u201cgive way\u201d and 122 of \u201cno stopping no standing\u201d signs.\nFor training, these images were resized to 40\u00d7 40. For testing, we used 357 images from the second\nset of the Summer dataset which contained at least one visible instance of the traf\ufb01c signs, with more\nthan 35 pixels of height. The performance of different traf\ufb01c sign detectors is reported in Figure 2-b)\nand Table 1. Again, the MRes cascade was faster and more accurate than the others. In particular, it\nwas faster than other methods, while detecting/recognizing 6% more traf\ufb01c signs.\nWe next trained a MRes cascade for detection of the 17 traf\ufb01c signs shown in the left end of Figure\n4. The \ufb01gure also shows the evolution of MRes cascade decisions for 20 examples from each of\nthe different classes. Each row of color pixels illustrates the evolution of one example. The color\nof the kth pixel in a row indicates the decision made by the cascade after k weak learners. The\ntraf\ufb01c signs and corresponding colors are shown in the left of the \ufb01gure. Note that the early cascade\nstages only reject a few examples, assigning most of the remaining to the \ufb01rst class. This assures\n\n7\n\n0501001500.20.40.60.81number of false positivesdetection rateparallel cascadeP.C. + pre\u2212estimateMRes\u2212Cascade0501001502002200.650.70.750.80.850.9number of false positivesdetection rateparallel cascadeP.C. + pre\u2212estimateMRes\u2212Cascade\fFigure 3: MRes cascade detection rate (left), false positive rate (center), and accuracy (right) during learning.\n\nFigure 4: Evolution of MRes cascade decisions for 20 randomly selected examples of 17 traf\ufb01c sign classes.\nEach row illustrates the evolution of the label assigned to one example. The ground-truth traf\ufb01c sign classes\nand corresponding label colors are shown on the left.\n\na high detection rate but very low accuracy. However, as more weak learners are evaluated, the\ndetector starts to create some intermediate categories. For example, after 20 weak learners, all\ntraf\ufb01c signs containing red and yellow colors are assigned to the \u201cgive way\u201d class. Evaluating more\nweak learners further separates these classes. Eventually, almost all examples are assigned to the\ncorrect class (right side of the picture). This shows that besides being a soft version of the all-class\ndetector cascade, the MRes cascade automatically creates an internal class taxonomy.\nFinally, although we have not produced detection ground truth for this experiment, we have em-\npirically observed that the \ufb01nal 17-traf\ufb01c sign MRes cascade is accurate and has low complexity\n(5.15). This make it possible to use the detector in real-time on low complexity devices, such as\nsmart-phones. A video illustrating the detection results is available in the supplementary material.\n\n5 Conclusion\n\nIn this work, we have made various contributions to multiclass boosting with structural constraints\nand cascaded detector design. First, we proposed that a multiclass detector cascade should have\nMRes structure, where early stages are binary target vs. non-target detectors and late stages perform\n\ufb01ne target discrimination. Learning such cascades requires the addition of a structural bias to the\nlearning algorithm. Second, to incorporate such biases in boosting, we analyzed the recent MC-\nBoost algorithm, showing that it implements two complementary weighting mechanisms. The \ufb01rst\nis the standard weighting of examples by dif\ufb01culty of classi\ufb01cation. The second is a rede\ufb01nition\nof the margin so as to weight more heavily the most dif\ufb01cult classes. This class reweighting was\ninterpreted as a data driven class bias, aimed at optimizing classi\ufb01cation performance. This sug-\ngested a natural way to add structural biases, by modifying class weights so as to favor the desired\nMRes structure. Third, we showed that such biases can be implemented through the addition of\na set of thresholds, the bias parameters, to the de\ufb01nition of multiclass margin. This was, in turn,\nshown identical to a cost-sensitive multiclass boosting algorithm, using bias parameters as log-costs\nof mis-classifying examples between pairs of classes. Fourth, we introduced a stage adaptive policy\nfor the determination of bias parameters, which was shown to enforce a bias towards cascade stages\nof 1) high detection rate, and 2) MRes structure. Cascades designed under this policy were shown\nto have stages that progress from binary to multiclass in a gradual manner that is data-driven and\ncomputationally ef\ufb01cient. Finally, these properties were illustrated in fast multiclass object detec-\ntion experiments involving multi-view car detection and detection of multiple traf\ufb01c signs. These\nexperiments showed that MRes cascades are faster and more accurate than previous solutions.\n\n8\n\n501001500.940.950.960.970.980.991number of iterationsdetection rate5010015010\u2212410\u2212310\u2212210\u22121100number of iterationsfalse positive rate501001500.20.40.60.81number of iterationsaccuracyNumber of evaluated weak learners020406080\fReferences\n[1] L. Bourdev and J. Brandt. Robust object detection via soft cascade. In CVPR, pages 236\u2013243, 2005.\n[2] P. Dollar, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In BMVC, 2009.\n[3] J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics,\n\n29:1189\u20131232, 1999.\n\n[4] C. Huang, H. Ai, Y. Li, and S. Lao. High-performance rotation invariant multiview face detection. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 29(4):671\u2013686, 2007.\n\n[5] M. Jones and P. Viola. Fast multi-view face detection. In Proc. of Computer Vision and Pattern Recogni-\n\ntion, 2003.\n\n[6] C. Kuo and R. Nevatia. Robust multi-view car detection using unsupervised sub-categorization.\n\nWorkshop on Applications of Computer Vision (WACV), pages 1\u20138, 2009.\n\nIn\n\n[7] F. Larsson, M. Felsberg, and P. Forssen. Correlating Fourier descriptors of local patches for road sign\n\nrecognition. IET Computer Vision, 5(4):244\u2013254, 2011.\n\n[8] H. Masnadi-Shirazi and N. Vasconcelos. Cost-sensitive boosting. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 33:294 \u2013309, 2011.\n\n[9] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent. In NIPS, 2000.\n[10] D. Mease and A. Wyner. Evidence contrary to the statistical view of boosting. Journal of Machine\n\nLearning Research, 9:131\u2013156, June 2008.\n\n[11] X. Perrotton, M. Sturzel, and M. Roux. Implicit hierarchical boosting for multi-view object detection. In\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 958\u2013965, 2010.\n\n[12] M. Pham, H. V-D.D., and T. Cham. Detection with multi-exit asymmetric boosting. In CVPR, pages 1\n\n\u20138, 2008.\n\n[13] M. Saberian and N. Vasconcelos. Multiclass boosting: Theory and algorithms. In NIPS, 2011.\n[14] M. Saberian and N. Vasconcelos. Learning optimal embedded cascades. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, pages 32005 \u20132018, 2012.\n\n[15] J. sochman J. Matas. Waldboost - learning for time constrained sequential detection. In CVPR, pages\n\n150\u2013157, 2005.\n\n[16] A. Torralba, K. Murphy, and W. Freeman. Sharing visual features for multiclass and multiview object\n\ndetection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5):854\u2013869, 2007.\n\n[17] P. Viola and M. Jones. Robust real-time object detection. Workshop on Statistical and Computational\n\nTheories of Vision, 2001.\n\n[18] P. Viola and M. Jones. Fast and robust classi\ufb01cation using asymmetric adaboost and a detector cascade.\n\nIn NIPS, pages 1311\u20131318, 2002.\n\n[19] B. Wu, H. Ai, C. Huang, and S. Lao. Fast rotation invariant multi-view face detection based on real\nadaboost. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 79\u201384,\n2004.\n\n[20] Q. Zhu, S. Avidan, M. Yeh, , and K. Cheng. Fast human detection using a cascade of histograms of\n\noriented gradients. In CVPR, pages 1491\u20131498, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1147, "authors": [{"given_name": "Mohammad", "family_name": "Saberian", "institution": "UC San Diego"}, {"given_name": "Nuno", "family_name": "Vasconcelos", "institution": "UC San Diego"}]}