{"title": "Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade", "book": "Advances in Neural Information Processing Systems", "page_first": 1311, "page_last": 1318, "abstract": "", "full_text": "Fast and Robust Classi\ufb01cation using Asymmetric\n\nAdaBoost and a Detector Cascade\n\nPaul Viola and Michael Jones\nMistubishi Electric Research Lab\n\nCambridge, MA\n\nviola@merl.com and mjones@merl.com\n\nAbstract\n\nThis paper develops a new approach for extremely fast detection in do-\nmains where the distribution of positive and negative examples is highly\nskewed (e.g. face detection or database retrieval). In such domains a\ncascade of simple classi\ufb01ers each trained to achieve high detection rates\nand modest false positive rates can yield a \ufb01nal detector with many desir-\nable features: including high detection rates, very low false positive rates,\nand fast performance. Achieving extremely high detection rates, rather\nthan low error, is not a task typically addressed by machine learning al-\ngorithms. We propose a new variant of AdaBoost as a mechanism for\ntraining the simple classi\ufb01ers used in the cascade. Experimental results\nin the domain of face detection show the training algorithm yields sig-\nni\ufb01cant improvements in performance over conventional AdaBoost. The\n\ufb01nal face detection system can process 15 frames per second, achieves\nover 90% detection, and a false positive rate of 1 in a 1,000,000.\n\n1 Introduction\n\nIn many applications fast classi\ufb01cation is almost as important as accurate classi\ufb01cation.\nCommon examples include robotics, user interfaces, and classi\ufb01cation in large databases.\nIn this paper we demonstrate our approach in the domain of low latency, sometimes called\n\u201creal-time\u201d, face detection. An extremely fast face detector is a critical component in\nmany applications. User-interfaces can be constructed which detect the presence and num-\nber of users. Teleconference systems can automatically devote additional bandwidth to\nparticipant\u2019s faces. Video security systems can record facial images of individuals after\nunauthorized entry.\n\nRecently we presented a real-time face detection system which scans video images at 15\nframes per second [8] yet achieves detection rates comparable with the best published re-\nsults (e.g. [7]) 1 Face detection is a scanning process, in which a face classi\ufb01er is evaluated\nat every scale and location within each image. Since there are about 50,000 unique scales\n\n1In order to achieve real-time speeds other systems often resort to skin color \ufb01ltering in color\nimages or motion \ufb01ltering in video images. These simple queues are useful but unreliable. In large\nimage databases color and motion are often unavailable. Our system detects faces using only static\nmonochrome information.\n\n\fand locations in a typical image, this amounts to evaluating the face classi\ufb01er 750,000 times\nper second.\n\nOne key contribution of our previous work was the introduction of a classi\ufb01er cascade.\nEach stage in this cascade was trained using AdaBoost until the required detection per-\nformance was achieved [2]. In this paper we present a new training algorithm designed\nspeci\ufb01cally for a classi\ufb01er cascade called asymmetric AdaBoost. The algorithm is a gener-\nalization of that given in Singer and Shapire [6]. Many of the formal guarantees presented\nby Singer and Shapire also hold for this new algorithm. The paper concludes with a set\nof experiments in the domain of face detection demonstrating that asymmetric AdaBoost\nyields a signi\ufb01cant improvement in detection performance over conventional boosting.\n\n2 Classi\ufb01er Cascade\n\nIn the machine learning community it is well known that more complex classi\ufb01cation func-\ntions yield lower training errors yet run the risk of poor generalization. If the main con-\nsideration is test set error, structural risk minimization provides a formal mechanism for\nselecting a classi\ufb01er with the right balance of complexity and training error [1].\n\nAnother signi\ufb01cant consideration in classi\ufb01er design is computational complexity. Since\ntime and error are fundamentally different quantities, no theory can simply select the opti-\nmal trade-off. Nevertheless, for many classi\ufb01cation functions computation time is directly\nrelated to the structural complexity. In this way temporal risk minimization is clearly re-\nlated to structural risk minimization.\n\nThis direct analogy breaks down in domains where the distribution over the class labels\nis highly skewed. For example, in the domain of face detection, there are at most a few\ndozen faces among the 50,000 sub-windows in an image. Surprisingly in these domains\nit is often possible to have the best of both worlds: high detection rates and extremely\nfast classi\ufb01cation. The key insight is that while it may be impossible to construct a simple\nclassi\ufb01er which can achieve a low training/test error, in some cases it is possible to construct\na simple classi\ufb01er with a very low false negative rate. For example, in the domain of face\ndetection, we have constructed an extremely fast classi\ufb01er with a very low false negative\nrate (i.e. it almost never misses a face) and a 50% false positive rate. Such a detector might\nbe more accurately called a classi\ufb01cation pre-\ufb01lter: when an image region is labeled \u2019non-\nface\u2019 then it can be immediately discarded, but when a region is labeled \u2019face\u2019 then further\nclassi\ufb01cation effort is required. Such a pre-\ufb01lter can be used as the \ufb01rst stage in a cascade\nof classi\ufb01ers (see Figure 1).\n\nIn our face detection application (described in more detail in Section 5) the cascade has\n38 stages. Even though there are many stages, most are not evaluated for a typical non-\nface input window since the early stages weed out many non-faces. In fact, over a large\ntest set, the average number of stages evaluated is less than 2. In a cascade, computation\ntime and detection rate of the \ufb01rst few stages is critically important to overall performance.\nThe remainder of the paper describes techniques for training cascade classi\ufb01ers which are\nef\ufb01cient yet effective.\n\n3 Using Boosting to Train the Cascade\n\nIn general almost any form of classi\ufb01er can be used to construct a cascade; the key prop-\nerties are that computation time and the detection rate can be adjusted. Examples include\nsupport vector machines, perceptrons, and nearest neighbor classi\ufb01ers. In the case of an\nSVM computation time is directly related to the number of support vectors and detection\nrate is related to the margin threshold [1].\n\n\fAll Sub\u2212windows\n\nT\n\n1\nF\n\nT\n\n2\nF\n\nT\n\n3\nF\n\nFurther\nProcessing\n\nReject Sub\u2212window\n\nFigure 1: Schematic depiction of a detection cascade. A sequence of classi\ufb01ers are applied\nto every example. The initial classi\ufb01er eliminates a large number of negative examples\nwith very little processing. Subsequent stages eliminate additional negatives but require\nadditional computation. Extremely few negative examples remain after several stages.\n\nIn our system each classi\ufb01er in the cascade is a single layer perceptron whose input is a\nset of computationally ef\ufb01cient binary features. The computational cost of each classi\ufb01er\nis then simply the number of input features. The detection rate is adjusted by changing the\nthreshold (or bias).\n\nMuch of the power of our face detection system comes from the very large and varied\nset of features available. In our experiments over 6,000,000 different binary features were\navailable for inclusion in the \ufb01nal classi\ufb01ers (see Figure 4 for some example features). The\nef\ufb01ciency of each classi\ufb01er, and hence the ef\ufb01ciency of the cascade, is ensured because a\nvery small number of features are included in the early stages; the \ufb01rst stage has 1 (!) fea-\nture, the second stage 5 features, then 20, and then 50. See Section 5 for a brief description\nof the feature set. The main contribution of this paper is the adaptation of AdaBoost for the\ntask of feature selection and classi\ufb01er learning.\n\nminimizes:\n\n\u001a('\u0018)*'\u0018+\n\nThough it is not widely appreciated, AdaBoost provides a principled and highly ef\ufb01cient\nmechanism for feature selection[2, 6]. If the set of weak classi\ufb01ers is simply the set of\nbinary features (this is often called boosting stumps) each round of boosting adds a single\nfeature to the set of current features.\n\nrated binary classi\ufb01er[6]. After every round the weights are updated as follows:\n\nAdaBoost is an iterative process in which each round selects a weak classi\ufb01er,\u0002\u0001\u0004\u0003\u0006\u0005 , which\n\u0001\u0002\b\n\t\f\u000b\u000e\r\u000f\u0001\u0004\u0003\u0011\u0010\u0012\u0005\u0014\u0013\u0016\u0015\u0018\u0017\u0019\u0003\u0014\u001a\u001c\u001b\n\u001d\u0001\u0004\u0003\u001e\u0015\nFollowing the notation of Shapire and Singer,\r\n\u0003\u0011\u0010\u0012\u0005\nis the weight on example\u0010 at round\" ,\n\u000b$#&%\nis the example, and!\u0001\u0004\u0003,\u0005\nis the target label of the example,\u0015\n\u0003\u001e\u0015\n\u0003\u0011\u0010\u0012\u0005\u0014\u0013\u0016\u0015\u0018\u0017\u0019\u0003\u0014\u001a\u001c\u001b\nandCBD\bE5F7>9;:\n<G@H@\ntakes on two possible values3-4\b6587\u00189;:\n<G@>=\nis the weight of the examples given the label\u0017 which have true labelM . These\n\u0001 minimizes the weighted exponential loss at round\" . Minimizing\u0007\n\u0001 which is an upper bound on the\n\nThe classi\ufb01er\f\u00012\u0003\u0006\u0005\nwhereIKJ*L\nMinimizing\u0007\nround is also a greedy technique for minimizing N\n\npredictions insure that the weights on the next round are balanced: that the relative weights\nof positive and negative examples one each side of the classi\ufb01cation boundary are equal.\n\ntraining error of the strong classi\ufb01er. It has also been observed that the example weights\n\n(1)\n\n(2)\n\n,\n\n\u0005\u001f\u0005! \n\n\u0005\u0014\u0005\n\n<;=>=\n<?=A@\n\n\u0001\u001e-/.\n\n\u0003\u001e\u00100\u00051\b\n\nis a con\ufb01dence\n\nin each\n\n\u0007\n\u000b\n\u000b\n\u0001\n\u001b\n\u000b\n\n\u0001\n\u000b\n\n\u0001\n\u000b\n\u0007\n\u0001\n \n\u0001\n\u0001\n\u0007\n\fare directly related to example margin, which leads to a principled argument for AdaBoost\u2019s\ngeneralization capabilities [5].\n\nThe key advantage of AdaBoost as a feature selection mechanism, over competitors such\nas the wrapper method [3], is the speed of learning. Given the constraint that the search\n\nover features is greedy, AdaBoost ef\ufb01ciently selects the feature which minimizes N\n\na surrogate for overall classi\ufb01cation error. The entire dependence on previously selected\nfeatures is ef\ufb01ciently and compactly encoded using the example weights. As a result, the\naddition of the 100th feature requires no more effort than the selection of the \ufb01rst feature. 2\n\n\u0001 ,\n\n4 Asymmetric AdaBoost\n\nOne limitation of AdaBoost arises in the context of skewed example distributions and cas-\ncaded classi\ufb01ers: AdaBoost minimizes a quantity related to classi\ufb01cation error; it does\nnot minimize the number of false negatives. Given that the \ufb01nal form of the classi\ufb01er is\na weighted majority of features, the detection and false positive rates are adjustable after\ntraining. Unfortunately feature selection proceeds as if classi\ufb01cation error were the only\ngoal, and the features selected are not optimal for the task of rejecting negative examples.\n\nOne naive scheme for \u201c\ufb01xing\u201d AdaBoost is to modify the initial distribution over the train-\ning examples. If we hope to minimize false negatives then the weight on positive examples\ncould be increased so that the minimum error criteria will also have very few false neg-\natives. We can formalize this intuitive approach as follows. Recall that AdaBoost is a\n\nEach term in the summation is bounded above by a simple loss function:\n\n\u0003\u001e\u0015\n\n\u0015>\u0017\nis the class assigned by the boosted classi\ufb01er. As a result, minimizing N\n\notherwise\n\n\u0003\u001e\u0015\n\nminimizes an upper bound on simple loss.\n\nWe can introduce a related notion of asymmetric loss:\n\n\u0003\u0011\u0015\n\nwhere\u0011\n\n(3)\n\n(4)\n\nA\u0001\u0004\u0003\u001e\u0015\n\n\u0005\u0004\u0003\nif\u001b\n\n\u000b\u0010\u000f\n\b\u0012\u0011\n\b6\u001a('\n\u00051\bE'\n\n\u0013\u0016\u0015\u0018\u0017\u0002\u0001/\u001a\u001c\u001b\n\u0003\u0006\u0005\b\u0007\n\t\f\u000b\r\u000b\n\u0003\u001e\u00100\u0005$\b\u0006\u000e\n' and\u0011\nif\u001b\n\u001a(' and\u0011\nif\u001b\notherwise \n\u0003\u001e\u00100\u0005 .\n\u0005 we obtain a bound on the asymmetric loss:\n\u0007\u0015\t\f\u000b%\u000b\n\nNote that\nIf we take the bound in Equation 4 and\n\ntimes more than false positives.\n\n\u0003\u0011\u0010\u0012\u0005 .\n\n\u0003\u001e\u0015\n\n\u0003\u001e\u0015\n\n(5)\n\nscheme which minimizes:\n\u001a\u001c\u001b\n\n\u0001$\b\n\nwhere false negatives cost\n\n\u0007\u0015\t\f\u000b\r\u000b\n\u0003\u0011\u00100\u00051\b\u0017\u0016\n\u001a\u001c\u001b\n\u0007\n\t\f\u000b\r\u000b\n\u0007\u0015\t\f\u000b\r\u000b\n\u0013\u0016\u0015>\u0017! *\u001b\n587\u00189\n\u0003\u0011\u00100\u0005\n\u0003\u001e\u001b\nmultiply both sides by \u0013\u0016\u0015\u0018\u0017\n\u000b$#\n\u0013\u0016\u0015\u0018\u0017\u0019\u0003\u0014\u001a\u001c\u001b\n\u0005\u001f\u0005H\u0013\u0016\u0015\u0018\u0017\n\u0003\u001e\u0015\n587\u00189\nample by\u0013\u0016\u0015>\u0017\n\nMinimization of this bound can be achieved using AdaBoost by pre-weighting each ex-\n. The derivation is identical to that of Equation 3. Expanding\n\n2Given that there are millions of features and thousands of examples, the boosting process requires\ndays of computation. Many other techniques while feasible for smaller problems are likely to be\ninfeasible for this sort of problem.\n\n\u0001\n\u0007\n\u0001\n\u0007\n\t\n\u000b\n\u000b\n\t\n\u0001\n\u000b\n\u0013\n\u0001\n\u000b\n\t\n\u0001\n\n\u0001\n\u000b\n\u0005\n'\n\u000b\n\u0005\n)\n\u0013\n\u000b\n\u0005\n\u0001\n\u0007\n\u0001\n\u0014\n\u0018\n\u0019\n\u0018\n\u001d\n\u000b\n\b\n\u000b\n\u0005\n)\n.\n\u001e\n\u001f\n\u000b\n\b\n\u000b\n)\n\u0013\n\u001d\n\u0014\n\b\n\u000b\n\u001b\n\u001d\n\"\n\u000b\n\u001b\n\u001d\n\u0001\n\n\u0001\n\u000b\n \n\u001b\n\u000b\n\u001b\n\u001d\n\"\n\u0005\n\u0014\n \n\u001b\n\u000b\n\u001b\n\u001d\n\"\n\fEquation 2 repeatedly for\r\n\u0001\u001e-/.\n\u0003\u0011\u0010\u0012\u00051\b\n\n\u0003\u001e\u00100\u0005\n\nin terms of\r\n\u0013\u0016\u0015\u0018\u0017\u0019\u0003\u0014\u001a\u001c\u001b\n\nB3.\n\u001d\u0001\u0004\u0003\u001e\u0015\n\n\u0003\u0011\u00100\u0005 we arrive at,\n\u0005\u001f\u0005H\u0013\u0016\u0015\u0018\u0017\n587\u00189\n\n(6)\n\nwhere the second term in the numerator arises because of the initial asymmetric weighting.\nNoticing that the left hand side must sum to 1 yields the following equality,\n\n(7)\n\n\u0001$\b\n\n\u0001!\u0013\u0016\u0015\u0018\u0017\u0002\u0001/\u001a\u001c\u001b\n\nA\u0001\u0004\u0003\u001e\u0015\n\n\u0005\u0004\u0003D\u0013\u0016\u0015\u0018\u0017\n\n5F7>9\n\n\u0003\u000e \n\nTherefore AdaBoost minimizes the required bound on asymmetric loss.\n\nUnfortunately this naive technique is only somewhat effective. The main reason is Ad-\naBoost\u2019s balanced reweighting scheme. As a result the initially asymmetric example\nweights are immediately lost. Essentially the AdaBoost process is too greedy. The \ufb01rst\nclassi\ufb01er selected absorbs the entire effect of the initial asymmetric weights. The remain-\ning rounds are entirely symmetric.\n\n\" at the \ufb01rst round of an \u0001\n\nWe propose a closely related approach that results in the minimization of the same bound,\nwhich nevertheless preserves the asymmetric loss throughout all rounds. Instead of apply-\nround\n\ning the necessary asymmetric multiplier\u0013\nprocess, the nth root\u0013\u0016\u0015\u0018\u0017\n\n5F7>9\n\n\u0015>\u0017\n\n587\u00189\n\nis applied before each round. Referring to Equa-\ntion 6 we can see the \ufb01nal effect is the same; this preserves the bound on asymmetric loss.\nBut the effect on the training process is quite different. In order to demonstrate this ap-\nproach we generated an arti\ufb01cial data set and learned strong classi\ufb01ers containing 4 weak\nclassi\ufb01ers. The results are shown inFigure 2. In this \ufb01gure we can see that all but the\n\ufb01rst weak classi\ufb01er learned by the naive rule are poor, since they each balance positive and\nnegative errors. The \ufb01nal combination of these classi\ufb01ers cannot yield high detection rates\nwithout introducing many false positives. All the weak classi\ufb01ers generated by the pro-\nposed Asymmetric Adaboost rule are consistent with asymmetric loss and the \ufb01nal strong\nclassi\ufb01er yields very high detection rates and modest false positive rates.\n\nOne simple reinterpretation of this distributed scheme for asymmetric reweighting is as a\n. This\n\nreduction in the positive con\ufb01dence of each weak classi\ufb01er\u0004\u0003\n\nforces each subsequent weak classi\ufb01er to focus asymmetrically on postive examples.\n\n\u0003,\u0005$\u001a\n\n\u0003\u0006\u0005\n\n\b\n\n\n5F7>9\n\n5 Experiments\n\nWe performed two experiments in the domain of frontal face detection to demonstrate the\nadvantages of asymmetric AdaBoost. Experiments follow the general form, though differ\nin details, from those presented in Viola and Jones [8]. In each round of boosting one of\na very large set of binary features are selected. These features, which we call rectangle\nfeatures, are brie\ufb02y described in Figure 4.\n\nIn the \ufb01rst experiment a training and test set containing faces and non-faces of a \ufb01xed size\nwere acquired (faces were scaled to a size\npixels). The training set consisted of 1500\nface examples and 5000 non-face examples. Test data included 900 faces and 5000 non-\nfaces. The face examples were manually cropped from a large collection of Web images\nwhile the non-face examples were randomly chosen patches from Web images that were\nknown not to contain any faces.\n\n\u0005\u0007\u0006\t\b\n\u0005\u000b\u0006\n\nNaive asymetric AdaBoost and three parameterizations of Asymmetric AdaBoost were\nused to train classi\ufb01ers with 4 features on this data. Figure 3 shows the ROC curves on\n\n\u0001\n\u0001\n\n\u000b\n#\n\u0001\n\u000b\n \n\u001b\n\u000b\n\u001b\n\u001d\n\"\nN\n\u0001\n\u0007\n\u0001\n)\n\n\u0001\n\u0007\n\t\n\u000b\n\u000b\n\t\n\u0001\n\u000b\n \n\u001b\n\u000b\n\u001b\n\u001d\n\"\n \n\u001b\n\u000b\n\u001b\n\u001d\n \n.\n\u0002\n\u001b\n\u000b\n\u001b\n\u001d\n\"\n\u0001\n\u0001\n.\n\u0002\n\u001b\n\u001d\n\fFigure 2: Two simple examples: positive examples are \u2019x\u2019, negative \u2019o\u2019 and weak classi\ufb01ers\nare linear separators. On the left is the naive asymetric result. The \ufb01rst feature selected is\nlabelled \u20191\u2019. Subsequent features attempt to balance positive and negative errors. Notice\nthat no linear combination of the 4 weak classi\ufb01ers can achieve a low false positive and\nlow false negative rate. On the right is the asymetric boosting result. After learning 4 weak\nclassi\ufb01er the positives are well modelled and most of the negative are rejected.\n\n0.995\n\n0.99\n\n0.985\n\n0.98\n\n0.975\n\n0.97\n\nNAIVE\nT11-F10\nT15-F10\nT20-F10\n\n0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7\n\nFigure 3: ROC curves for four boosted classi\ufb01er with 4 features. The \ufb01rst is naive asym-\nmetric boosting. The other three results are for the new asymmetric approach, each using\nslightly different parameters. The ROC curve has been cropped to show only the region\nof interest in training a cascaded detector, the high detection rate regime. Notice that that\nat 99% detection asymmetric Adaboost cuts the false positive by about 20%. This will\nsigni\ufb01cantly reduce the work done by later stages in the cascade.\n\n\f\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\u0005\u0001\u0005\u0001\u0005\u0001\u0005\n\u0006\u0001\u0006\u0001\u0006\n\n\u0003\u0001\u0003\u0001\u0003\n\u0004\u0001\u0004\u0001\u0004\n\u0003\u0001\u0003\u0001\u0003\n\u0004\u0001\u0004\u0001\u0004\n\u0003\u0001\u0003\u0001\u0003\n\u0004\u0001\u0004\u0001\u0004\n\u0003\u0001\u0003\u0001\u0003\n\u0004\u0001\u0004\u0001\u0004\n\u0003\u0001\u0003\u0001\u0003\n\u0004\u0001\u0004\u0001\u0004\n\u0003\u0001\u0003\u0001\u0003\n\u0004\u0001\u0004\u0001\u0004\n\u0003\u0001\u0003\u0001\u0003\n\u0004\u0001\u0004\u0001\u0004\n\u0003\u0001\u0003\u0001\u0003\n\u0004\u0001\u0004\u0001\u0004\n\u0003\u0001\u0003\u0001\u0003\n\u0004\u0001\u0004\u0001\u0004\n\nA\n\nC\n\n\u0001\u0001\u0001\u0001\u0001\n\u0002\u0001\u0002\u0001\u0002\u0001\u0002\u0001\u0002\n\u0001\u0001\u0001\u0001\u0001\n\u0002\u0001\u0002\u0001\u0002\u0001\u0002\u0001\u0002\n\u0001\u0001\u0001\u0001\u0001\n\u0002\u0001\u0002\u0001\u0002\u0001\u0002\u0001\u0002\n\u0001\u0001\u0001\u0001\u0001\n\u0002\u0001\u0002\u0001\u0002\u0001\u0002\u0001\u0002\n\u0001\u0001\u0001\u0001\u0001\n\u0002\u0001\u0002\u0001\u0002\u0001\u0002\u0001\u0002\n\u0001\u0001\u0001\u0001\u0001\n\u0002\u0001\u0002\u0001\u0002\u0001\u0002\u0001\u0002\n\u0001\u0001\u0001\u0001\u0001\n\u0002\u0001\u0002\u0001\u0002\u0001\u0002\u0001\u0002\n\u0001\u0001\u0001\u0001\u0001\n\u0002\u0001\u0002\u0001\u0002\u0001\u0002\u0001\u0002\n\n\u0007\u0001\u0007\u0001\u0007\n\b\u0001\b\u0001\b\n\u0007\u0001\u0007\u0001\u0007\n\b\u0001\b\u0001\b\n\u0007\u0001\u0007\u0001\u0007\n\b\u0001\b\u0001\b\n\u0007\u0001\u0007\u0001\u0007\n\b\u0001\b\u0001\b\n\u0007\u0001\u0007\u0001\u0007\n\b\u0001\b\u0001\b\n\u0007\u0001\u0007\u0001\u0007\n\b\u0001\b\u0001\b\n\u0007\u0001\u0007\u0001\u0007\n\b\u0001\b\u0001\b\n\u0007\u0001\u0007\u0001\u0007\n\b\u0001\b\u0001\b\n\u0007\u0001\u0007\u0001\u0007\n\b\u0001\b\u0001\b\n\n\t\u0001\t\u0001\t\u0001\t\n\n\u0001\n\u0001\n\n\t\u0001\t\u0001\t\u0001\t\n\n\u0001\n\u0001\n\n\t\u0001\t\u0001\t\u0001\t\n\n\u0001\n\u0001\n\n\t\u0001\t\u0001\t\u0001\t\n\n\u0001\n\u0001\n\n\t\u0001\t\u0001\t\u0001\t\n\n\u0001\n\u0001\n\n\t\u0001\t\u0001\t\u0001\t\n\n\u0001\n\u0001\n\n\t\u0001\t\u0001\t\u0001\t\n\n\u0001\n\u0001\n\n\t\u0001\t\u0001\t\u0001\t\n\n\u0001\n\u0001\n\n\t\u0001\t\u0001\t\u0001\t\n\n\u0001\n\u0001\n\nB\n\nD\n\nFigure 4: Left: Example rectangle features shown relative to the enclosing detection win-\ndow. The sum of the pixels which lie within the white rectangles are subtracted from\nthe sum of pixels in the gray rectangles. A threshold operation is then applied to yield a\nbinary output. Two-rectangle features are shown in (A) and (B). Figure (C) shows a three-\nrectangle feature, and (D) a four-rectangle feature. Right: The \ufb01rst two example feature\nselected by the boosting process. Notice that the \ufb01rst feature relies on the fact that the\nhorizontal region of the eyes is darker than the horizontal region of the cheeks. The sec-\nond feature, whose selection is conditioned on the \ufb01rst, acts to distinguish horizontal edges\nfrom faces by looking for a strong vertical edge near the nose.\n\ntest data for the three classi\ufb01ers. The key result here is that at high detection rates the false\npositive rate can be reduced signi\ufb01cantly.\n\nIn the second experiment, naive and asymmetric AdaBoost were used to train two different\ncomplete cascaded face detectors. Performance of each cascade was determined on a real-\nworld face detection task, which requires scanning of the cascade across a set of large\nimages which contain embedded faces.\n\nThe cascade training process is complex, and as a result comparing detection results is\nuseful but potentially risky. While the data used to train the two cascades were identical,\nthe performance of earlier stages effects the selection of non-faces used to train later stages.\nAs a result different non-face examples are used to train the corresponding stages for the\nNaive and Asymmetric results.\n\nLayers were added to each of the cascades until the number of false positives was re-\nduced below 100 on a validation set. For normal boosting this occurred with 34 layers.\nFor asymmetric AdaBoost this occurred with 38 layers. Figure 5 shows the ROC curves\nfor the resulting face detectors on the MIT+CMU [4] test set. 3 Careful examination of\nthe ROC curves show that the asymmetric cascade reduces the number of false positives\nsigni\ufb01cantly. At a detection rate of 91% the reduction is by a factor of 2.\n\n6 Conclusions\n\nWe have demonstrated that a cascade classi\ufb01cation framework can be used to achieve fast\nclassi\ufb01cation, high detection rates, and very low false positive rates. The goal for each\nclassi\ufb01er in the cascade is not low error, but instead extremely high detection rates and\nmodest false positive rates. If this is achieved, each classi\ufb01er stage can be used to \ufb01lter out\nand discard many negatives.\n\n3Note: the detection and false positive rates for the simple 40 feature experiment and the more\n\ncomplex cascaded experiment are not directly comparable, since the test sets are quite different.\n\n\fROC curves for face detector with different boosting algorithms\n\n0.95\n\n0.9\n\n0.85\n\ne\n\nt\n\na\nr\n \n\nn\no\n\ni\nt\nc\ne\n\nt\n\ne\nd\n\n \nt\nc\ne\nr\nr\no\nc\n\n0.8\n\n0\n\n50\n\n100\n\n150\n\nfalse positives\n\n200\n\n250\n\n300\n\nAsymmetric Boosting\nNormal Boosting \n\nFigure 5: ROC curves comparing the accuracy of two full face detectors, one trained using\nnormal boosting and the other with asymmetric AdaBoost. Again, the detector trained\nusing asymmetric AdaBoost is more accurate over a wide range of false positive values.\n\nMany modern approaches for classi\ufb01cation focus entirely on the minimization of errors.\nQuestions of relative loss only arise in the \ufb01nal tuning of the classi\ufb01er. We propose a\nnew training algorithm called asymmetric AdaBoost which performs learning and ef\ufb01cient\nfeature selection with the fundamental goal of achieving high detection rates. Asymmetric\nAdaBoost is a simple modi\ufb01cation of the \u201ccon\ufb01dence-rated\u201d boosting approach of Singer\nand Shapire. Many of their derivations apply to this new approach as well.\n\nExperiments have demonstrated that asymmetric AdaBoost can lead to signi\ufb01cant improve-\nments both in classi\ufb01cation speed and in detection rates.\n\nReferences\n\n[1] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20, 1995.\n[2] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning\nand an application to boosting. In Computational Learning Theory: Eurocolt \u201995, pages 23\u201337.\nSpringer-Verlag, 1995.\n\n[3] G. John, R. Kohavi, and K. P\ufb02eger.\n\nIrrelevant features and the subset selection problem.\n\nIn\n\nMachine Learning Conference, pages 121\u2013129. Morgan Kaufmann, 1994.\n\n[4] H. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. In IEEE Patt. Anal.\n\nMach. Intell., volume 20, pages 22\u201338, 1998.\n\n[5] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: a new explanation for\n\nthe effectiveness of voting methods. Ann. Stat., 26(5):1651\u20131686, 1998.\n\n[6] Robert E. Schapire and Yoram Singer.\n\nImproved boosting algorithms using con\ufb01dence-rated\n\npredictions. Machine Learning, 37:297\u2013336, 1999.\n\n[7] H. Schneiderman and T. Kanade. A statistical method for 3D object detection applied to faces\n\nand cars. In Computer Vision and Pattern Recognition, 2000.\n\n[8] Paul Viola and Michael J. Jones. Robust real-time object detection. In Proc. of IEEE Workshop\n\non Statistical and Computational Theories of Vision, 2001.\n\n\f", "award": [], "sourceid": 2091, "authors": [{"given_name": "Paul", "family_name": "Viola", "institution": null}, {"given_name": "Michael", "family_name": "Jones", "institution": null}]}