Ensemble and Modular Approaches for Face Detection: A Comparison

Part of Advances in Neural Information Processing Systems 10 (NIPS 1997)

Bibtex Metadata Paper


Raphaël Feraud, Olivier Bernier


A new learning model based on autoassociative neural networks is developped and applied to face detection. To extend the de(cid:173) tection ability in orientation and to decrease the number of false alarms, different combinations of networks are tested: ensemble, conditional ensemble and conditional mixture of networks. The use of a conditional mixture of networks allows to obtain state of the art results on different benchmark face databases.

1 A constrained generative model

Our purpose is to classify an extracted window x from an image as a face (x E V) or non-face (x EN). The set of all possible windows is E = V uN, with V n N = 0. Since collecting a representative set of non-face examples is impossible, face detection by a statistical model is a difficult task. An autoassociative network, using five layers of neurons, is able to perform a non-linear dimensionnality reduction [Kramer, 1991]. However, its use as an estimator, to classify an extracted window as face or non-face, raises two problems:

  1. V', the obtained sub-manifold can contain non-face examples (V C V'), 2. owing to local minima, the obtained solution can be close to the linear

solution: the principal components analysis.

Our approach is to use counter-examples in order to find a sub-manifold as close as possible to V and to constrain the algorithm to converge to a non-linear solution [Feraud, R. et al., 1997]. Each non-face example is constrained to be reconstructed as its projection on V. The projection P of a point x of the input space E on V, is defined by:

·email: feraud@lannion.cnet.fr t email: bernier@lannion.cnet.fr

Ensemble and Modular Approaches for Face Detection: A Comparison


• if x E V, then P{ x) = x, • if x rJ. V: P{x) = argminYEv{d(x, V)), where d is the Euclidian distance.

During the learning process, the projection P of x on V is approximated by: P(x) ,...., ~ 2:7=1 Vi, where VI, V2, •.. , Vn , are the n nearest neighbours, in the training set of faces, of v, the nearest face example of x.

The goal of the learning process is to approximate the distance V of an input space element x to the set of faces V:

• V{x, V) = Ilx - P(x)11 ,...., it (x - £)2, where M is the size of input image x

and £ the image reconstructed by the neural network,

• let x E £, then x E V if and only if V{x, V) S T, with T E IR, where T is a

threshold used to adjust the sensitivity of the model.