{"title": "Joint Cascade Optimization Using A Product Of Boosted Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 1315, "page_last": 1323, "abstract": "The standard strategy for efficient object detection consists of building a cascade composed of several binary classifiers. The detection process takes the form of a lazy evaluation of the conjunction of the responses of these classifiers, and concentrates the computation on difficult parts of the image which can not be trivially rejected. We introduce a novel algorithm to construct jointly the classifiers of such a cascade. We interpret the response of a classifier as a probability of a positive prediction, and the overall response of the cascade as the probability that all the predictions are positive. From this noisy-AND model, we derive a consistent loss and a Boosting procedure to optimize that global probability on the training set. Such a joint learning allows the individual predictors to focus on a more restricted modeling problem, and improves the performance compared to a standard cascade. We demonstrate the efficiency of this approach on face and pedestrian detection with standard data-sets and comparisons with reference baselines.", "full_text": "Joint Cascade Optimization Using a Product of\n\nBoosted Classi\ufb01ers\n\nLeonidas Lefakis\n\nIdiap Research Institute\nMartigny, Switzerland\n\nFranc\u00b8ois Fleuret\n\nIdiap Research Institute\nMartigny, Switzerland\n\nleonidas.lefakis@idiap.ch\n\nfrancois.fleuret@idiap.ch\n\nAbstract\n\nThe standard strategy for ef\ufb01cient object detection consists of building a cascade\ncomposed of several binary classi\ufb01ers. The detection process takes the form of a\nlazy evaluation of the conjunction of the responses of these classi\ufb01ers, and con-\ncentrates the computation on dif\ufb01cult parts of the image which cannot be trivially\nrejected.\nWe introduce a novel algorithm to construct jointly the classi\ufb01ers of such a cas-\ncade, which interprets the response of a classi\ufb01er as the probability of a positive\nprediction, and the overall response of the cascade as the probability that all the\npredictions are positive. From this noisy-AND model, we derive a consistent loss\nand a Boosting procedure to optimize that global probability on the training set.\nSuch a joint learning allows the individual predictors to focus on a more restricted\nmodeling problem, and improves the performance compared to a standard cas-\ncade. We demonstrate the ef\ufb01ciency of this approach on face and pedestrian de-\ntection with standard data-sets and comparisons with reference baselines.\n\n1\n\nIntroduction\n\nObject detection remains one of the core objectives of computer vision, either as an objective per\nse, for instance for automatic focusing on faces in digital cameras, or as means to get high-level\nunderstanding of natural scenes for robotics and image retrieval.\nThe standard strategy which has emerged for detecting objects of reasonable complexity such as\nfaces is the so-called \u201csliding-window\u201d approach. It consists of visiting all locations and scales in\nthe scene to be parsed, and for any such pose, evaluating a two-class predictor which computes if\nthe object of interest is visible there.\nThe computational cost of such approaches is controlled traditionally with a cascade, that is a suc-\ncession of classi\ufb01ers, each one being evaluated only if the previous ones in the sequence have not\nalready rejected the candidate location. Such an architecture concentrates the computation on dif\ufb01-\ncult parts of the global image to be processed, and reduces tremendously the overall computational\neffort.\nIn its original form, this approach constructs classi\ufb01ers one after another during training, each one\nfrom examples which have not been rejected by the previous ones. While very successful, this\ntechnique suffers from three main practical drawbacks. The \ufb01rst one is the need for a very large\nnumber of negative samples, so that enough samples are available to train any one of the classi\ufb01ers.\nThe second drawback is the necessity to de\ufb01ne as many thresholds as there are levels in the cascade.\nThis second step may seem innocuous, but in practice is a serious dif\ufb01culty, requiring additional\nvalidation data. Finally the third drawback is the inability of a standard cascade to properly exploit\n\n1\n\n\fthe trade-off between the different levels. A response marginally below threshold at a certain level\nis enough to reject a sample, even if classi\ufb01ers at other levels have strong responses.\nAt a more conceptual level, standard training for cascades does not allow the classi\ufb01ers to exploit\ntheir joint modeling: Each classi\ufb01er is trained as if it has to do the job alone, without having the\nopportunity to properly balance its own modeling effort and that of the other classi\ufb01ers.\nThe novel approach we propose here is a joint learning of the classi\ufb01ers constituting a cascade.\nWe interpret the individual responses of the classi\ufb01ers as probabilities of responding positively,\nand de\ufb01ne the overall response of the cascade as the probability of all the classi\ufb01ers responding\npositively under an assumption of independence.\nInstead of training classi\ufb01ers successively, we\ndirectly minimize a loss taking into account this global response. This noisy-AND model leads to a\nvery simple criterion for a new Boosting procedure, which improves all the classi\ufb01ers symmetrically\non the positive samples, and focuses on improving the classi\ufb01er with the best response on every\nnegative sample.\nWe demonstrate the ef\ufb01ciency of this technique for face and pedestrian detection. Experiments\nshow that this joint cascade learning requires far less negative training examples, and achieves per-\nformance better than standard cascades without the need for intensive bootstrapping. At the compu-\ntational level, we propose to optimally permute the order of the classi\ufb01ers during the evaluation to\nreduce the overall number of evaluated classi\ufb01ers, and show that such optimization allows for better\nerror rates at similar computational costs.\n\n2 Related works\n\nA number of methods have been proposed over the years to control the computational cost of\nmachine-learning based object detection. The idea common to these approaches is to rely on a\nform of adaptive testing : only candidates which cannot be trivially rejected as not being the object\nof interest will require heavy computation. In practice the majority of the candidates will be rejected\nwith a very coarse criterion, hence requiring very low computation.\n\n2.1 Reducing object detection computational cost\n\nHeisele et al. [1] propose a hierarchy of linear Support Vector Machines, each trained on images of\nincreasing resolution, to weed out background patches, followed by a \ufb01nal computationally intensive\npolynomial SVM. In [2] and [3], the authors use an hierarchy of respectively two and three Support\nVector Machines of increasing complexity. Graf et al. [4] introduced the parallel support vector\nmachine which creates a \ufb01ltering process by combining layers of parallel SVMs, each trained using\nthe support vectors of classi\ufb01ers in the previous layer.\nFleuret and Geman [5] introduce a hierarchy of classi\ufb01ers dedicated to positive populations with ge-\nometrical poses of decreasing randomness. This approach generalizes the cascade to more complex\npose spaces, but as for cascades, trains the classi\ufb01ers separately.\nRecently, a number of scanning alternatives to sliding window have also been introduced. In [6] a\nbranch and bound approach is utilized during scanning, while in [7] a divide and conquer approach\nis proposed, wherein regions in the image are either accepted or rejected as a whole or split and\nfurther processed. Feature-centric approaches is proposed by the authors in [8] and [9].\nThe most popular approach however, for both its conceptual simplicity and practical ef\ufb01ciency, is\nthe attentional cascade proposed by Viola and Jones [10]. Following this seminal paper, cascades\nhave been used in a variety of problems [11, 12, 13].\n\n2.2\n\nImproving attentional cascades\n\nIn recent years approaches have been proposed that address some of the issues we list in the intro-\nduction. In [14] the authors train a cascade with a global performance criteria and a single set of\nparameters common to all stages. In [15] the authors address the asymmetric nature of the stage\ngoals via a biased minimax probability machine, while in [16] the authors formulate the stage goals\nas a constrained optimization problem. In [17] a alternate boosting method dubbed FloatBoost is\nproposed. It allows for backtracking and removing weak classi\ufb01ers which no longer contribute.\n\n2\n\n\f(xn, yn), n = 1, . . . , N, training examples.\nK number of levels in the cascade.\n\nTable 1: Notation\n\nfk(x) non-thresholded response of classi\ufb01er k. During training, f t\n\nk(x) stands for that response\n\nafter t steps of Boosting.\n\npk(x) =\nk(x) stands for the same value after t steps of Boosting, computed from f t\npt\n\n1+exp(\u2212fk(x)) probability of classi\ufb01er k to response positively on x. During training,\n\nk(x).\n\n1\n\np(x) =(cid:81)\n\nk pk(x) posterior probability of sample x to be positive, as estimated jointly by all the\nclassi\ufb01ers of the cascade. During training, pt(x) is that value after only t steps of Boosting,\ncomputed from the pt\n\nk(x).\n\nSochman and Matas [18] presented a Boosting algorithm based on sequential probability ratio tests,\nminimizing the average evaluation time subject to upper bounds on the false negative and false posi-\ntive rates. A general framework for probabilistic boosting trees (of which cascades are a degenerated\ncase) was proposed in [19]. In all these methods however, a set of free parameters concerning de-\ntection and false alarm performances must be set during training. As will be seen, our method is\ncapable of postponing any decisions concerning performance goals until after training.\nThe authors in [20] use the output of each stage as an initial weak classi\ufb01er of the boosting classi\ufb01er\nin the next stage. This allows the cascade to retain information between stages. However this\napproach only constitutes a backward view of the cascade. No information concerning the future\nperformance of the cascade is available to each stage. In [21] sample traces are utilized to keep track\nof the performance of the cascade on the training data, and thresholds are picked after the cascade\ntraining is \ufb01nished. This allows for reordering of cascade stages. However besides a validation set,\na large number of negative examples must also be bootstrapped not only during the training phase,\nbut also during the post-processing step of threshold and order calibration. Furthermore, different\nlearning targets are used in the learning and calibration phases.\nTo our knowledge, very little work has been done on the joint optimization of the cascaded stages. In\n[22] the authors attempt to jointly optimize a cascade of SVMs. As can be seen, a cascade effectively\nperforms an AND operation over the data, enforcing that a positive example passes all stages; and\nthat a negative example be rejected by at least one stage. In order to simulate this behavior, the\nauthors attempt to minimize the maximum hinge loss over the SVMs for the positive examples, and\nto minimize the product of the hinge losses for the negative examples. An approximate solution to\nthis formulation is found via cyclic optimization. In [23] the authors present a method similar to\nours, jointly optimizing a cascade using the product of the output of individual logistic regression\nbase classi\ufb01ers. Their method attempts to \ufb01nd the MAP-estimate of the optimal classi\ufb01er weights\nusing cyclic coordinate descent. As is the case with the work in [22], the authors consider the\nordering of the stages a priori \ufb01xed.\n\n3 Method\n\nOur approach can be interpreted as a noisy-AND: The classi\ufb01ers in the cascade produce stochastic\nBoolean predictions, conditionally independent given the signal to classify. We de\ufb01ne the global\nresponse of the cascade as the probability that all these predictions are positive.\nThis can be interpreted as if we were \ufb01rst computing from the signal x, for each classi\ufb01er in the\ncascade, a probability pk(x), and de\ufb01ning the response of the cascade as the probability that K\nindependent Bernoulli variables of parameters p1(x), . . . , pK(x) would all be equal to 1. Such a\ncriterion takes naturally into account the con\ufb01dence of individual classi\ufb01ers in the \ufb01nal response,\nand introduces an additional non-linearity in the decision function.\nThis approach is related to the noisy-OR proposed in [24] for multi-view object detection. How-\never, their approach aims at decomposing a complex population into a collection of homogeneous\npopulations, while our objective is to speed up the computation for the detection of a homogeneous\n\n3\n\n\fpopulation. In some sense the noisy-OR they propose and the noisy-AND we use for training are\naddressing dual objectives.\n\n3.1 Formalization\n\nLet fk(x) stand for the non-thresholded response of the classi\ufb01er at level k of the cascade. We de\ufb01ne\n\npk(x) =\n\n1 + exp(\u2212fk(x))\n\n1\n\n(1)\n\nas the probabilistic interpretation of the deterministic output of classi\ufb01er k.\nFrom that, we de\ufb01ne the \ufb01nal output of the cascade as the probability that all classi\ufb01ers make positive\npredictions, under the assumption that they are conditionally independent, given x\n\nK(cid:89)\n\np(x) =\n\npk(x).\n\n(2)\n\nIn the ideal Boolean case, an example x will be classi\ufb01ed as positive if and only if all classi\ufb01ers\nclassify it as such. Conversely the example will be classi\ufb01ed as negative if pk(x) = 0 for at least\none k. This is consistent with the AND nature of the cascade. Of course due to the product, the \ufb01nal\nclassi\ufb01er is able to make probabilistic predictions rather than solely hard ones as in [22].\n\nk=1\n\n3.2\n\nJoint Boosting\n\nLet\n\n(3)\ndenote a training set. In order to train our cascade we consider the maximization of the joint maxi-\nmum log likelihood of the data:\n\n(xn, yn) \u2208 Rd \u00d7 {0, 1}, n = 1, . . . , N\n\np(xn)yn(1 \u2212 p(xn))1\u2212yn .\n\n(4)\n\nJ = log(cid:89)\n\nn\n\nAt each round t we sequentially visit each classi\ufb01er and add a weak learner which locally minimizes\nJ the most. If pt(x) denotes the overall response of the cascade after having added t weak learners\nk(x) denotes the response of classi\ufb01er k at that point \u2013 hence a function the\nin each classi\ufb01er, and pt\nt (xn) is:\nresponse of classi\ufb01er k at step t, f t\n(5)\n\nk(x) \u2013 the score to maximize to select a weak learner hk\n\n(cid:88)\n\nwk,t\n\nn hk\n\nt (xn)\n\nn\n\nwith\n\n\u2202fk(xn)\n\nn = \u2202J\nwk,t\n\n= yn \u2212 pt(xn)\n1 \u2212 pt(xn)\nIt should be noted that in this formulation, the weight wk,t\nexamples are negative.\nk(xn) and thus this criterion\nIn the case of a positive example xn this simpli\ufb01es to wk,t\npushes every classi\ufb01er in the cascade to maximize the response on positive samples, irrespective of\nthe performance of the overall cascade.\n\nn are signed, and these assigned to negative\n\nn = 1 \u2212 pt\n\n(1 \u2212 pt\n\nk(xn)).\n\n(6)\n\nn = \u2212pt(xn)\n\n1\u2212pt(xn)(1 \u2212\nIn the case of a negative example however, the weight update rule becomes wk,t\nk(xn)), each classi\ufb01er in the cascade is then passed information regarding the overall performance\npt\nvia the term \u2212pt(xn)\n1\u2212pt(xn).\nIf the cascade is already rejecting the negative example, then this term\nbecomes 0 and the classi\ufb01er ignores its performance on the speci\ufb01c example. On the other hand, if\nthe cascade is performing poorly, then the term becomes increasingly large and the classi\ufb01ers put\nlarge weights on that example.\nFurthermore, due to the term 1 \u2212 pt\nthat it is already performing well on, effectively partitioning the space of negative examples.\nThe weights of the weak-learners can not be computed in a close formed as for AdaBoost and are\nestimated through a numerical line-search.\n\nk(xn), each classi\ufb01er puts larger weight on negative examples\n\n4\n\n\f3.3 Exponential variant\n\nTo assess if the asymptotic behavior of the loss \u2013 which is similar in spirit to the logistic one \u2013 is\ncritical or not in the performance, we also experimented the minimization of the exponential error\nof the output.\nThis translates to the minimization of the cost function :\n\nJ exp =(cid:88)\n\nn\n\n(cid:18)1 \u2212 p(xn)\n\n(cid:19)2yn\u22121\n\np(xn)\n\n(7)\n\n(8)\n\n(9)\n\nand leads to the following expression for the sample weights during Boosting:\n\nn = pt\nwk,t\n\nk(xn) \u2212 1\npt(xn)\n\nfor the positive samples and\n\nwk,t\n\nn =\n\n(1 \u2212 pt\n\nk(xn)) pt(xn)\n\n(1 \u2212 pt(xn))2\n\nfor the negative ones.\nSuch a weighting strongly penalizes outliers in the training set, in a manner similar to Adaboost\u2019s\nexponential loss.\n\n4 Experiments\n\n4.1\n\nImplementation Details\n\nWe comparatively evaluate the proposed cascade framework on two data-sets. In [10] the authors\npresent an initial comparison between their cascade framework and an AdaBoost classi\ufb01er on the\nCMU-MIT data-set. They train the monolithic classi\ufb01er for 200 rounds and compare it against a\nsimple cascade containing ten stages, each with 20 weak learners. As cascade architecture plays\nan important role in the \ufb01nal performance of the cascade, and in order to avoid any issues in the\ncomparison pertaining to architectural designs, we keep this structure and evaluate both the pro-\nposed cascade and the Viola and Jones cascade, using this architecture. The monolithic classi\ufb01er is\nsimilarly trained for 200 rounds. During the training, the thresholds for each stage in the Viola and\nJones cascade are set to achieve a 99.5% detection rate.\nAs pointed out, our approach does not make use of a validation set, nor uses bootstrapping during\ntraining. We experimented with bootstrapping a \ufb01xed number M of negative examples at \ufb01xed\nintervals, similar to [21] and attained higher performance than the one presented here. However it\nwas found that training, was highly sensitive to the choice of M and that furthermore this choice of\nM was application speci\ufb01c.\nWe tested three versions of our JointCascade approach: JointCascade is the algorithm described\nin \u00a7 3.2, JointCascade Augmented is the same, but is trained with as many negative examples as\nthe total number used by the Viola and Jones cascade, and JointCascade Exponential uses the\nsame number of negative samples as the basic setting, but uses the exponential version of the loss\ndescribed in \u00a7 3.3.\n\n4.2 Data-Sets\n\n4.2.1 Pedestrians\n\nFor pedestrian detection we use the INRIA pedestrian data-set [25], which contains pedestrian im-\nages of various poses with high variance concerning background and lighting. The training set\nconsists of 1239 images of pedestrians as positive examples, and 12180 negative examples, mined\nfrom 1218 pedestrian-free images. Of these we keep 900 images for training (together with their\nmirror images, for a total of 1800) and 9000 negative examples. The remaining images in the origi-\nnal training set are put aside to be used as a validation set by the Viola and Jones cascade.\n\n5\n\n\fAs in [25] we utilize a histogram of oriented gradient to describe each image. The reader is referred\nto this article for implementation details of the descriptor.\nThe trained classi\ufb01ers are then tested on a test set composed of 1126 images of pedestrians and\n18120 non-pedestrian images.\n\n4.2.2 Faces\n\nFor faces, we evaluate against the CMU+MIT data-set of frontal faces. We utilize the Haar-like\nwavelet features introduced in [10], however, for performance reasons, we sub-sample 2000 of these\nfeatures at each round to be used for training.\nFor training we use the same data-set as that used by Viola and Jones consisting of 4916 images of\nfaces. Of these we use 4000 (plus their mirror images) for training and set apart a further 916 (plus\nmirror images) for use as the validation set needed by the classical cascade approach. The negative\nportion of the training set is comprised of 10000 non-face images, mined randomly from non-face\ncontaining images.\nIn order to test the trained classi\ufb01ers, we extract the 507 faces in the data-set and scale-normalize\nto 24x24 images, a further 12700 non-face image patches are extracted from the background of the\nimages in the data-set. We do not perform scale search, nor do we use any form of post-processing.\n\n4.2.3 Bootstrap Images\n\nAs, during training, the Viola and Jones cascade needs to bootstrap false positive examples after each\nstage, we randomly mine a data-set of approximately 7000 images from the web. These images have\nbeen manually inspected to ensure that they do not contain either faces or pedestrians. These images\nare used for bootstrapping in both sets of experiments.\n\n4.3 Error rate\n\nThe evaluation on the face data-set can be seen in Figure 1. The plotted lines represent the ROC\ncurves for the evaluated methods. The proposed methods are able to reach a level of performance\non par with the Viola and Jones cascade, without the need for a validation set or bootstrapping. The\nlog-likelihood version of our method, performs slightly better than the exponential error version.\nThe ROC curves for the pedestrian detection task can be seen in Figure 2. The log-likelihood version\nof our method signi\ufb01cantly outperforms the Viola and Jones Cascade. The exponential error version\nis again slightly worse than the log-likelihood version, however this too outperforms the classical\napproach. Finally, as can be seen, augmenting the training data for the proposed method, leads to\nfurther improvement.\nThe results on the two data-sets show that the proposed methods are capable of performing on\npar or better than the Viola and Jones cascade, while avoiding the need for a validation set or for\nbootstrapping. This lack of a need for bootstrapping, further means that the training time needed is\nconsiderably smaller than in the case of the classical cascade.\n\n4.4 Optimization of the evaluation order\n\nAs stated, one of the main motivations for using cascades is speed. We compare the average number\nof stages visited per negative example for the various methods presented.\nTypically in cascade training, the thresholds and orders of the various stages must be determined\nduring training, either by setting them in an ad hoc manner or by using one of the optimization\nschemes of the many proposed. In our case however, any decision concerning the thresholds as well\nas the ordering of the stages can be postponed till after training. It is easy to derive for any given\ndetection goal, a relevant threshold \u03b8 on the overall cascade responce. Thus we ask that p(xn) > \u03b8,\nfor an image patch to be accepted as positive. Subsequently the image patch will be rejected if the\nproduct of any subset of strong classi\ufb01ers has a value smaller than \u03b8.\nBased on this we use a greedy method to evaluate, using the original training set, the optimal order\nof classi\ufb01ers as follows : Originally we chose as the \ufb01rst stage in our cascade, the classi\ufb01er whose\n\n6\n\n\fFigure 1: True-positive rate vs. false-positive rate on the face data-set for the methods proposed,\nAdaBoost and the Viola and Jones type cascade. The JointCascade variants are described in \u00a7 4.1.\nAt any true-positive rate above 95%, all three methods perform better than the standard cascade.\nThis is a particularly good result for the basic JointCascade which does not use bootstrapping during\ntraining, which would seem to be critical for such conservative regimes.\n\nFigure 2: True-positive rate vs. false-positive rate on the pedestrian data-set for the methods pro-\nposed, AdaBoost and the Viola and Jones type cascade. All three JointCascade methods outperform\nthe standard cascade, for regions of the false positive rate which are of practical use.\n\n7\n\n 0.8 0.85 0.9 0.95 1 0 0.01 0.02 0.03 0.04 0.05True-positive rateFalse-positive rateFacesNon-cascade AdaBoostVJ cascadeJointCascadeJointCascade AugmentedJointCascade Exponential 0.8 0.85 0.9 0.95 1 0 0.02 0.04 0.06 0.08 0.1True-positive rateFalse-positive ratePedestriansNon-cascade AdaBoostVJ cascadeJointCascadeJointCascade AugmentedJointCascade Exponential\fTable 2: Average number of classi\ufb01ers evaluated on a sample, for each method and different true-\npositive rates, on the two data-sets. As expected, the computational load increases with the accuracy.\nThe JointCascade variants require marginally more operations at a \ufb01xed rate on the pedestrian popu-\nlation, and marginally less on the faces except at very conservative rates. This is an especially good\nresult, given their lower false-positive rates, which should induce more computation on average.\n\nComputational cost (faces)\n\nComputational cost (pedestrians)\n\nTP\n\nVJ JointCascade\n\nJointCascade JointCascade\nAugmented Exponential\n\nVJ JointCascade\n\nJointCascade JointCascade\nAugmented Exponential\n\n95% 1.35\n90% 1.21\n86% 1.13\n82% 1.10\n78% 1.07\n\n1.49\n1.18\n1.09\n1.04\n1.03\n\n1.62\n1.31\n1.18\n1.12\n1.09\n\n1.69\n1.25\n1.11\n1.07\n1.04\n\n2.27\n1.93\n1.56\n1.38\n1.30\n\n2.58\n2.04\n1.79\n1.49\n1.37\n\n2.66\n1.94\n1.71\n1.59\n1.48\n\n2.93\n2.21\n1.81\n1.52\n1.39\n\nreponse is smaller than \u03b8 for the largest number of negative examples. We then iteratively add to the\norder of the cascade, that classi\ufb01er which leads to a response smaller than \u03b8 for the most negative\nexamples, when multiplied with the aggregated response of the stages already ordered in the cascade.\nAs stated this ordering of the cascade stages is computed using the training set. We then measure the\nspeed of our ordered cascade on the same test sets as above, as shown on Table 2. As can be seen, in\nthe case of the face dataset, in almost all cases our approach is actually faster during scanning than\nthe classical Viola and Jones approach. When the augmented dataset is used however this speed\nadvantage is lost, there is a thus a trade-off between performance and speed, as is to be expected.\nThe speed of our JointCascade approach on the pedestrian data-set is marginally worst than that of\nViola and Jones, which is due to the lower false-positive rates.\n\n5 Conclusion\n\nWe have presented a new criterion to train a cascade of classi\ufb01ers in a joint manner. This approach\nhas a clear probabilistic interpretation as a noisy-AND, and leads to a global decision criterion which\navoids thresholding classi\ufb01ers individually, and can exploit independence in the classi\ufb01er response\namplitudes.\nThis method avoids the need for picking multiple thresholds and the requirement for additional\nvalidation data. It allows to easily \ufb01x the \ufb01nal performance without the need for re-training. Finally,\nwe have demonstrated that it reaches state-of-the-art performance on standard data sets, without the\nneed for bootstrapping.\nThis approach is very promising as a general framework to build adaptive detection techniques. It\ncould easily be extended to hierarchical approaches instead of simple cascade, hence could be used\nfor latent poses richer than location and scale.\nFinally, the reduction of the computational cost itself could be addressed in a more explicit manner\nthan the optimization of the order presented in \u00a7 4.4. We are investigating a dynamic approach where\nthe same criterion is used to allocate weak learners adaptively among the classi\ufb01ers. This could be\ncombined with a loss function explicitly estimating the expected computation cost of detection,\nhence providing an incentive for early rejection of more samples in the cascade.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for their helpful comments. This work was supported by the\nEuropean Community\u2019s Seventh Framework Programme FP7 - Challenge 2 - Cognitive Systems,\nInteraction, Robotics - under grant agreement No 247022 - MASH.\n\n8\n\n\fReferences\n[1] B. Heisele, T. Serre, S. Prentice, and T. Poggio. Hierarchical classi\ufb01cation and feature reduction for fast\n\nface detection with support vector machines. Pattern Recognition Letters, 36(9):2007\u20132017, 2003.\n\n[2] Hedi Harzallah, Fr\u00b4ed\u00b4eric Jurie, and Cordelia Schmid. Combining ef\ufb01cient object localization and image\n\nclassi\ufb01cation. In International Conference on Computer Vision, pages 237\u2013244, 2009.\n\n[3] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection. In Interna-\n\ntional Conference on Computer Vision, pages 606\u2013613, 2009.\n\n[4] Hans Peter Graf, Eric Cosatto, L\u00b4eon Bottou, Igor Dourdanovic, and Vladimir Vapnik. Parallel support\n\nvector machines: The cascade svm. In Neural Information Processing Systems, pages 521\u2013528, 2005.\n\n[5] F. Fleuret and D. Geman. Coarse-to-\ufb01ne face detection.\n\n41(1/2):85\u2013107, 2001.\n\nInternational Journal of Computer Vision,\n\n[6] Christopher H. Lampert, M. B. Blaschko, and Thomas Hofmann. Beyond sliding windows: Object lo-\ncalization by ef\ufb01cient subwindow search. In Conference on Computer Vision and Pattern Recognition,\npages 1\u20138, 2008.\n\n[7] Christoph H. Lampert. An ef\ufb01cient divide-and-conquer cascade for nonlinear object detection. In Con-\n\nference on Computer Vision and Pattern Recognition, pages 1022\u20131029, 2010.\n\n[8] Henry Schneiderman. Feature-centric evaluation for ef\ufb01cient cascaded object detection. In Conference\n\non Computer Vision and Pattern Recognition, pages 29\u201336, 2004.\n\n[9] A. Lehmann, B. Leibe, and L. Van Gool. Feature-centric ef\ufb01cient subwindow search. In International\n\nConference on Computer Vision, pages 940\u2013947, 2009.\n\n[10] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In\n\nConference on Computer Vision and Pattern Recognition, pages 511\u2013518, 2001.\n\n[11] Owen T. Carmichael and Martial Hebert. Shape-based recognition of wiry objects. In Conference on\n\nComputer Vision and Pattern Recognition, pages 401\u2013408, 2003.\n\n[12] Qiang Zhu, Shai Avidan, Mei chen Yeh, and Kwang ting Cheng. Fast human detection using a cascade\nof histograms of oriented gradients. In Conference on Computer Vision and Pattern Recognition, pages\n1491\u20131498, 2006.\n\n[13] Geremy Heitz, Stephen Gould, Ashutosh Saxena, and Daphne Koller. Cascaded classi\ufb01cation models:\nCombining models for holistic scene understanding. In Neural Information Processing Systems, pages\n641\u2013648, 2009.\n\n[14] S. Charles Brubaker, Jianxin Wu, Jie Sun, Matthew D. Mullin, and James M. Rehg. On the design of\ncascades of boosted ensembles for face detection. International Journal of Computer Vision, 77(1-3):65\u2013\n86, 2008.\n\n[15] Kaizhu Huang, Haiqin Yang, Irwin King, and Michael R. Lyu. Learning classi\ufb01ers from imbalanced\nIn Conference on Computer Vision and Pattern\n\ndata based on biased minimax probability machine.\nRecognition, pages 558\u2013563, 2004.\n\n[16] J. Wu, S. C. Brubaker, M. D. Mullin, and J. M. Rehg. Fast asymmetric learning for cascade face detection.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 30:369\u2013382, 2008.\n\n[17] Stan Z. Li and ZhenQiu Zhang. FloatBoost learning and statistical face detection. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 26(9), 2004.\n\n[18] Jan Sochman and Jiri Matas. Waldboost \u201d learning for time constrained sequential detection. In Confer-\n\nence on Computer Vision and Pattern Recognition, pages 150\u2013156, 2005.\n\n[19] Zhuowen Tu. Probabilistic boosting-tree: Learning discriminative models for classi\ufb01cation, recognition,\n\nand clustering. In International Conference on Computer Vision, pages 1589\u20131596, 2005.\n\n[20] Rong Xiao, Long Zhu, and HongJiang Zhang. Boosting chain learning for object detection. In Interna-\n\ntional Conference on Computer Vision, pages 709\u2013715, 2003.\n\n[21] Lubomir Bourdev and Jonathan Brandt. Robust object detection via soft cascade.\n\nComputer Vision and Pattern Recognition, pages 236\u2013243, 2005.\n\nIn Conference on\n\n[22] M. Murat Dundar and Jinbo Bi. Joint optimization of cascaded classi\ufb01ers for computer aided detection.\n\nIn Conference on Computer Vision and Pattern Recognition, pages 1\u20138, 2007.\n\n[23] V. C. Raykar, B. Krishnapuram, and S. Yu. Designing ef\ufb01cient cascaded classi\ufb01ers: Tradeoff between\n\naccuracy and cost. In Conference on Knowledge Discovery and Data Mining, 2010.\n\n[24] Tae-Kyun Kim and Roberto Cipolla. MCBoost: Multiple classi\ufb01er boosting for perceptual co-clustering\n\nof images and visual features. In Neural Information Processing Systems, pages 841\u2013856, 2008.\n\n[25] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Conference on Computer\n\nVision and Pattern Recognition, pages 886\u2013893, 2005.\n\n9\n\n\f", "award": [], "sourceid": 903, "authors": [{"given_name": "Leonidas", "family_name": "Lefakis", "institution": null}, {"given_name": "Francois", "family_name": "Fleuret", "institution": null}]}