{"title": "Ensemble and Modular Approaches for Face Detection: A Comparison", "book": "Advances in Neural Information Processing Systems", "page_first": 472, "page_last": 478, "abstract": "", "full_text": "Ensemble and Modular Approaches for \n\nFace Detection: a Comparison \n\nRaphael Feraud \u00b7and Olivier Bernier t \n\nTechnopole Anticipa, 2 avenue Pierre Marzin, 22307 Lannion cedex, FRANCE \n\nFrance-Telecom CNET DTLjDLI \n\nAbstract \n\nA new learning model based on autoassociative neural networks \nis developped and applied to face detection. To extend the de(cid:173)\ntection ability in orientation and to decrease the number of false \nalarms, different combinations of networks are tested: ensemble, \nconditional ensemble and conditional mixture of networks. The \nuse of a conditional mixture of networks allows to obtain state of \nthe art results on different benchmark face databases. \n\n1 A constrained generative model \n\nOur purpose is to classify an extracted window x from an image as a face (x E \nV) or non-face (x EN). The set of all possible windows is E = V uN, with \nV n N = 0. Since collecting a representative set of non-face examples is impossible, \nface detection by a statistical model is a difficult task. An autoassociative network, \nusing five layers of neurons, is able to perform a non-linear dimensionnality reduction \n[Kramer, 1991]. However, its use as an estimator, to classify an extracted window \nas face or non-face, raises two problems: \n\n1. V', the obtained sub-manifold can contain non-face examples (V C V'), \n2. owing to local minima, the obtained solution can be close to the linear \n\nsolution: the principal components analysis. \n\nOur approach is to use counter-examples in order to find a sub-manifold as close as \npossible to V and to constrain the algorithm to converge to a non-linear solution \n[Feraud, R. et al., 1997]. Each non-face example is constrained to be reconstructed \nas its projection on V. The projection P of a point x of the input space E on V, is \ndefined by: \n\n\u00b7email: feraud@lannion.cnet.fr \nt email: bernier@lannion.cnet.fr \n\n\fEnsemble and Modular Approaches for Face Detection: A Comparison \n\n473 \n\n\u2022 if x E V, then P{ x) = x, \n\u2022 if x rJ. V: P{x) = argminYEv{d(x, V)), where d is the Euclidian distance. \n\nDuring the learning process, the projection P of x on V is approximated \nby: P(x) ,...., ~ 2:7=1 Vi, where VI, V2, \u2022.. , Vn , are the n nearest neighbours, \nin the training set of faces, of v, the nearest face example of x. \n\nThe goal of the learning process is to approximate the distance V of an input space \nelement x to the set of faces V: \n\n\u2022 V{x, V) = Ilx - P(x)11 ,...., it (x - \u00a3)2, where M is the size of input image x \n\nand \u00a3 the image reconstructed by the neural network, \n\n\u2022 let x E \u00a3, then x E V if and only if V{x, V) S T, with T E IR, where T is a \n\nthreshold used to adjust the sensitivity of the model. \n\n15 x 20 outputs \n\n50 neurons \n\n35 neurons \n\nFigure 1: The use of two hidden layers and counter-examples in a compression \nneural network allows to realize a non-linear dimensionality reduction. \n\n15 x 20 inputs \n\nIn the case of non-linear dimensionnality reduction, the reconstruction error is re(cid:173)\nlated to the position of a point to the non-linear principal components in the input \nspace. Nevertheless, a point can be near to a principal component and far from the \nset of faces. With the algorithm proposed, the reconstruction error is related to the \ndistance between a point to the set of faces. As a consequence, if we assume that \nthe learning process is consistent [Vapnik, 1995], our algorithm is able to evaluate \nthe probability that a point belongs to the set of faces. Let y be a binary random \nvariable: y = 1 corresponds to a face example and y = 0 to a non-face example, we \nuse: \n\nP(y = llx) = e-\n\n(,,_r):l \n\n0'2 \n\n,where (j depends on the threshold T \n\nThe size of the training windows is 15x20 pixels. The faces are normalized in \nposition and scale. The windows are enhanced by histogram equalization to obtain \na relative independence to lighting conditions, smoothed to remove the noise and \nnormalized by the average face, evaluated on the training set. Three face databases \nare used: after vertical mirroring, B fl is composed of 3600 different faces with \norientation between 0 degree and 20 degree, Bf2 is composed of 1600 different faces \nwith orientation between 20 degree and 60 degree and B f3 is the concatenation of \nBfl and Bf2, giving a total of 5200 faces. All of the training faces are extracted \n\n\f474 \n\nR. Feraud and O. Bernier \n\nfrom the usenix face database(**), from the test set B of CMU(**), and from 100 \nimages containing faces and complex backgrounds . \n\n\u2022 \n\nFigure 2: Left to right: the counter-examples successively chosen by the algorithm \nare increasingly similar to real faces (iteration 1 to 8) . \n\nThe non-face databases (Bn!! ,B n!2,Bn!3), corresponding to each face database, \nare collected by an \nin \n[Sung, K. and Poggio, T ., 1994] or in [Rowley, H. et al., 1995]: \n\niterative algorithm similar \n\nto \n\nthe one used \n\n\u2022 1) Bn! = 0, T = Tmin , \n\u2022 2) the neural network is trained with B! + B n!, \n\u2022 3) the face detection system is tested on a set of background images, \n\u2022 4) a maximum of 100 su bimages Xi are collected with V (Xi , V) ~ T , \n\u2022 5) Bn! = Bn! + {xo, .. . , xn } , T = T + Jl , with Jl > 0, \n\u2022 6) while T < Tmax go back to step 2. \n\nAfter vertical mirroring, the size of the obtained set of non-face examples is re(cid:173)\nspectively 1500 for B n!! , 600 for Bn!2 and 2600 for B n!3. Since the non-face set \n(N) is too large, it is not possible to prove that this algorithm converge in a finite \ntime . Nevertheless, in only 8 iterations, collected counter-examples are close to the \nset of faces (Figure 2) . Using this algorithm, three elementary face detectors are \nconstructed: the front view face detector trained on Bfl and Bnfl (CGM1), the \nturned face detector trained on Bf2 and Bn!2 (CGM2) and the general face detector \ntrained on B!3 and Bn!3 (CGM3). \nTo obtain a non-linear dimensionnality reduction, five layers are necessary. However, \nour experiments show that four layers are sufficient. Consequently, each CGM \nhas four layers (Figure 1) . The first and last layers consist each of 300 neurons, \ncorresponding to the image size 15x20. The first hidden layer has 35 neurons and \nthe second hidden layer 50 neurons. In order to reduce the false alarm rate and to \nextend the face detection ability in orientation , different combinations of networks \nare tested. The use of ensemble of networks to reduce false alarm rate was shown \nby [Rowley, H. et al., 1995]. However, considering that to detect a face in an image, \nthere are two subproblems to solve, detection of front view faces and turned faces, \na modular architecture can also be used. \n\n2 Ensemble of CGMs \n\nGeneralization error of an estimator can be decomposed in two terms: the bias and \nthe variance [Geman , S. et al., 1992] . The bias is reduced with prior knowledge. \nThe use of an ensemble of estimators can reduce the variance when these estimators \nare independently and identically distributed [Raviv, Y. and Intrator, N., 1996]. \nEach face detector i produces: \n\nAssuming that the three face detectors (CGM1,CGM2,CGM3) are independently \nand identically distributed (iid), the ouput of the ensemble is: \n\n\fEnsemble and Modular Approaches for Face Detection: A Comparison \n\n475 \n\n3 Conditional mixture of CGMs \n\nTo extend the detection ability in orientation, a conditional mixture of CGMs is \ntested. The training set is separated in two subsets: front view faces and the \ncorresponding counter-examples (B = 1) and turned faces and the corresponding \ncounter-examples (B = 0). The first subnetwork (CGMl) evaluates the probability \nof the tested image to be a front view face, knowing the label equals 1 (P(y = \nllx, B = 1)). The second (CGM2) evaluates the probability of the tested image to \nbe a turned face, knowing the label equals 0 (P(y = llx, B = 0)). A gating network \nis trained to evaluate P(B = llx), supposing that the partition B = 1, B = 0 can be \ngeneralized to every input: \n\nE[Ylx] = E[yIB = 1, x]f(x) + E[yIB = 0, x](l- f(x)) \n\nWhere f(x) is the estimated value of P{B = llx) . \nThis \nintroduced by \n[Jacobs, R. A. et aI., 1991]: each module is trained separately on a subset of the \ntraining set and then the gating network learns to combine the outputs. \n\na mixture of experts \n\nfrom \n\nsystem \n\nis different \n\n4 Conditional ensemble of CGMs \n\nTo reduce the false alarm rate and to detect front view and turned faces, an original \ncombination, using (CGMl,CGM2) and a gate network, is proposed . Four sets are \ndefined: \n\n\u2022 F is the front view face set, \n\u2022 P is the turned face set, with F n p = 0, \n\u2022 V = F U P is the face set, \n\u2022 N is the non-face set, with V n N = 0, \n\nOur goal is to evaluate P{x E Vlx). Each estimator computes respectively: \n\n\u2022 P(x E Fix E FuN, x) (CGMl(x)), \n\u2022 P(x E Pix E puN, x) (CGM2(x)), \n\nUsing the Bayes theorem, we obtain: \n\nP(x E Fix E FUN,x) = P(x E FUNlx)P(x E FuNlx E F,x) \n\nP(x E Fix) \n\nSince x E F => x E FuN, then: \n\nP(x E Flx,x E FuN) = P(x E FUNlx) \n\nP(x E Fix) \n\n\f476 \n\nR. Feraud and o. Bernier \n\n~ P(x E Fix) = P(x E Fix E FUN,x)P(x E FUNlx) \n\n~ P(x E Fix) = P(x E Fix E FUN,x)[P(x E Fix) + P(x E Nix)] \n\nIn the same way, we have: \n\nP(x E Pix) = P(x E Pix E puN, x)[P(x E Pix) + P(x E Nix)] \n\nThen: \nP(x E Vlx) = P(x E Nlx)[P(x E pix E puN, x) + P(x E Fix E FuN, x)] \n+P(x E Plx)P(x E Pix E puN, x) + P(x E Flx)P(x E Fix E FuN, x) \n\nRewriting the previous equation using the following notation, CGMl(x) for \nP(x E Fix E FuN, x) and CGMl(x) for P(x E Pix E puN, x), we have: \n\nP(x E Vlx) = P(x E Nlx)[CGMl(x) + CGM2(x)] (1) \n+P(x E Plx)CGM2(x) + P(x E Flx)CGMl(x) (2) \n\nThen, we can deduce the behaviour of the conditional ensemble: \n\n\u2022 in N, if the output of the gate network is 0.5, as in the case of ensembles, \nthe conditional ensemble reduces the variance of the error (first term of the \nright side of the equation (1)), \n\n\u2022 in V, as in the case of the conditional mixture, the conditional ensemble \npermits to combine two different tasks (second term of the right side of the \nequation (2)) : detection of turned faces and detection of front view faces. \n\nThe gate network f(x) is trained to calculate the probability that the tested image \nis a face (P(x E Vlx)), using the following cost function: \n\nC = ~ ([f(xi)MGCl(x) + (1- f(xd)]MGC2(x) - Yi)2 + ~ (J(xd - 0.5)2 \n\nx.EV \n\n5 Discussion \n\nxiEN \n\nEach 15x20 subimage is extracted and normalized by enhancing, smoothing and \nsubstracting the average face, before being processed by the network. The detection \nthreshold T is fixed for all the tested images. To detect a face at different scales, \nthe image is subsampled. \n\nThe first test allows to evaluate the limits in orientation of the face detectors. The \nsussex face database(**), containing different faces with ten orientations betwen 0 \ndegree and 90 degrees, is used (Table 1). The general face detector (CGM3) uses \nthe same learning face database than the different mixtures of CGMs. Nevertheless, \nCGM3 has a smaller orientation range than the conditional mixtures of CGMs, \nand the conditional ensemble of CGMs. Since the performances of the ensemble \nof CGMS are low, the corresponding hypothesis (the CGMs are iid) is invalid. \nMoreover, this test shows that the combination by a gating neural network ofCGMs, \n\n\fEnsemble and Modular Approaches for Face Detection: A Comparison \n\n477 \n\nTable l :Results on Sussex face database \n\norientation \n(degree) \n\nCGM1 \n\nCGM2 \n\nCGM3 \n\nEnsemble Conditional Conditional \n\n(1,2 ,3) \n\nensemble \n\n0 \n10 \n20 \n30 \n40 \n50 \n60 \n70 \n\n100.0 0 \n62 .5 % \n50.0 % \n12.5 % \n0.0 % \n0.0 % \n0.0 % \n0.0 % \n\ntrained on different training set , allows to extend the detection ability to both front \nview and turned faces. The conditional mixture of CGMs obtains results in term \nof orientation and false alarm rate close to the best CGMs used to contruct it (see \nTable 1 and Table 2). \nThe second test allows to evaluate the false alarms rate and to compare our results \nwith the best results published so far on the test set A [Rowley, H. et al., 1995] of \nthe CMU (**), containing 42 images of various quality. First, these results show that \nthe model, trained without counter-examples (GM) , overestimates the distribution \noffaces and its false alarm rate is too important to use it as a face detector. Second, \nthe estimation of the probability distribution of the face performed by one CGM \n(CGM3) is more precise than the one obtained by [Rowley, H. et al., 1995] with \none SWN (see Table 2). The conditional ensemble of CGMs and the conditional \nmixture of CG Ms obtained a similar detection rate than an ensemble of SWN s \n[Rowley, H. et al. , 1995], but with a false alarm rate two or three times lower. Since \nthe results of the conditional ensemble of CG Ms and the conditional mixture of \nCGMs are close on this test , the detection rate versus the number of false alarms \nis plotted (Figure 3), for different thresholds. The conditional mixture of CGMs \ncurve is above the one for the conditional ensemble of CG Ms. \n\nl 00~--~----~----~----~----~--~ \n\n~ ........ __ ........ ,,-\n\n............. \n\n..... -\n\n10 \n\nFigure 3: Detection rate versus number of false alarms on the CMU test set A. In \ndashed line conditional ensemble and in solid line conditional mixture. \n\n3D \n\nCl \n\nso \n\nNumt.rolt . . .wnw \n\n110 \n\n70 \n\n10 \n\nThe conditionnal mixture of CG Ms is used in an application called LISTEN \n[Collobert, M. et al. , 1996] : a camera detects, tracks a face and controls a micro(cid:173)\nphone array towards a speaker, in real time. The small size of the subimages (15x20) \nprocessed allows to detect a face far from the camera (with an aperture of 60 de(cid:173)\ngrees, the maximum distance to the camera is 6 meters). To detect a face in real \ntime, the number of tested hypothesis is reduced by motion and color analysis. \n\n\f478 \n\nR. Feraud and o. Bernier \n\nTable 2:Results on the CMU test set A \nGM: the model trained without counter-examples, CGM1 : face detector, CGM2: \nturned face detector, CGM3: general face detector. SWN: shared weight network. \n(*) Considering that our goal is to detect human faces, non-human faces and rough \nface drawings have not been taken into account. \n\nModel \nGM \nCGMI \n\nCGM2 \n\nCGM3 \n\n[Rowley, 1995] \n(one SWN) \nEnsemble \n(CGM1,CGM2,CGM3) \nConditional ensemble \n(CGM1,CGM2,gate) \n[Rowley, 1995] \n(three SWNs) \nConditional mixture \n(CGM1,CGM2,gate) \n\nDetection rate \n\n84% \n77 % \n127/164 \n85 % \n139/164 \n85 % \n139/164 \n84 % \n142/169* \n74 % \n121/164 \n82 % \n134/164 \n85 % \n144/169* \n87 % \n142/164 \n\n+/- 5 % \n+/- 5 % \n+/- 5 % \n\n+/- 5 % \n\n+/- 5 % \n\n+/- 5 % \n\n+/- 5 % \n\n+/- 5 % \n\n+/- 5 % \n\nFalse alarms rate \n\n1/1,000 \n\n5.43/1,000,000 \n47/33,700,000 \n6.3/1,000,000 \n212/33,700,000 \n1.36/1,000,000 \n46/33,700,000 \n8.13/1,000,000 \n179/22,000,000 \n0.71/1,000,000 \n24/33,700,000 \n0.77/1,000,000 \n26/33,700,000 \n2.13/1,000,000 \n47/22,000,000 \n1.15/1,000,000 \n39/33,700,000 \n\n+/- 0.1/1,000,00( \n+/- 0.38/1,000,00 \n\n+ /- 0.37/1,000,00 \n\n+/- 0.41/1,000,00 \n\n+/- 0.4/1,000,00( \n\n+/- 0.43/1,000,00 \n\n+/- 0.38/1,000,00 \n\n+/- 0.42/1,000,00 \n\n+/- 0.35/1,000,001 \n\n(**) usenix face database, sussex face database and CMU test sets can be retrieved at \nwww.cs.rug.nl/peterkr/FACE/face.html. \n\nReferences \n\n[Collobert, M. et al., 1996] Collobert, M., Feraud, R, Le Tourneur, G ., Bernier, 0., Vial(cid:173)\n\nlet, J.E, Mahieux, Y., and Collobert, D. (1996). Listen: a system for locating and \ntracking individual speaker. In Second International Conference On Automatic Face \nand Gesture Recognition. \n\n[Feraud, R et al., 1997] Feraud, R, Bernier, 0., and Collobert, D. (1997). A constrained \n\ngenerative model applied to face detection. Neural Processing Letters. \n[Geman, S. et al., 1992] Geman, S., Bienenstock, E., and Doursat, R \n\nnetworks and the bias-variance dilemma. Neural Computation, 4:1-58. \n\n(1992). Neural \n\n[Jacobs, R A. et al., 1991] Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. \n\n(1991). Adaptative mixtures of local experts. Neural Computation, 3:79-87. \n\n[Kramer, 1991] Kramer, M. (1991). Nonlinear principal component analysis using autoas(cid:173)\n\nsociative neural networks. AIChE Journal, 37:233-243. \n\n[Raviv, Y. and Intrator, N., 1996] Raviv, Y. and Intrator, N . (1996). Bootstrapping with \n\nnoise: An effective regularization technique. Connection Science, 8:355-372. \n\n[Rowley, H. et al., 1995] Rowley, H., Baluja, S., and Kanade, T. (1995). Human face \n\ndetection in visual scenes. In Neural Information Processing Systems 8. \n\n[Sung, K. and Poggio, T ., 1994] Sung, K. and Poggio, T. (1994). Example-based learning \n\nfor view-based human face detection. Technical report, M.I.T. \n\n[Vapnik, 1995] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer(cid:173)\n\nVerlag New York Heidelberg Berlin. \n\n\f", "award": [], "sourceid": 1436, "authors": [{"given_name": "Rapha\u00ebl", "family_name": "Feraud", "institution": null}, {"given_name": "Olivier", "family_name": "Bernier", "institution": null}]}