Face Detection --- Efficient and Rank Deficient

Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

Bibtex Metadata Paper


Wolf Kienzle, Matthias Franz, Bernhard Schölkopf, Gökhan Bakir


This paper proposes a method for computing fast approximations to sup- port vector decision functions in the field of object detection. In the present approach we are building on an existing algorithm where the set of support vectors is replaced by a smaller, so-called reduced set of syn- thesized input space points. In contrast to the existing method that finds the reduced set via unconstrained optimization, we impose a structural constraint on the synthetic points such that the resulting approximations can be evaluated via separable filters. For applications that require scan- ning large images, this decreases the computational complexity by a sig- nificant amount. Experimental results show that in face detection, rank deficient approximations are 4 to 6 times faster than unconstrained re- duced set systems.

1 Introduction

It has been shown that support vector machines (SVMs) provide state-of-the-art accuracies in object detection. In time-critical applications, however, they are of limited use due to their computationally expensive decision functions. In particular, the time complexity of an SVM classification operation is characterized by two parameters. First, it is linear in the number of support vectors (SVs). Second, it scales with the number of operations needed for computing the similarity between an SV and the input, i.e. the complexity of the kernel function. When classifying image patches of size h w using plain gray value features, the decision function requires an h w dimensional dot product for each SV. As the patch size increases, these computations become extremely expensive. As an example, the evaluation of a single 20 20 patch on a 320 240 image at 25 frames per second already requires 660 million operations per second.

In the past, research towards speeding up kernel expansions has focused exclusively on the first issue, i.e. on how to reduce the number of expansion points (SVs) [1, 2]. In [2], Burges introduced a method that, for a given SVM, creates a set of so-called reduced set vectors (RSVs) that approximate the decision function. This approach has been successfully ap- plied in the image classification domain -- speedups on the order of 10 to 30 have been re- ported [2, 3, 4] while the full accuracy was retained. Additionally, for strongly unbalanced classification problems such as face detection, the average number of RSV evaluations can be further reduced using cascaded classifiers [5, 6, 7]. Unfortunately, the above example illustrates that even with as few as three RSVs on average (as in [5]), such systems are not competitive for time-critical applications.

The present work focuses on the second issue, i.e. the high computational cost of the kernel evaluations. While this could be remedied by switching to a sparser image representation (e.g. a wavelet basis), one could argue that in connection with SVMs, not only are plain gray values straightforward to use, but they have shown to outperform Haar wavelets and gradients in the face detection domain [8]. Alternatively, in [9], the authors suggest to com- pute the costly correlations in the frequency domain. In this paper, we develop a method that combines the simplicity of gray value correlations with the speed advantage of more sophisticated image representations. To this end, we borrow an idea from image processing: by constraining the RSVs to have a special structure, they can be evaluated via separable convolutions. This works for most standard kernels (e.g. linear, polynomial, Gaussian and sigmoid) and decreases the average computational complexity of the RSV evaluations from O(h w) to O(r (h + w)), where r is a small number that allows the user to balance be- tween speed and accuracy. To evaluate our approach, we examine the performance of these approximations on the MIT+CMU face detection database (used in [10, 8, 5, 6]).

2 Burges' method for reduced set approximations

The present section briefly describes Burges' reduced set method [2] on which our work is based. For reasons that will become clear below, h w image patches are written as h w matrices (denoted by bold capital letters) whose entries are the respective pixel intensities. In this paper, we refer to this as the image-matrix notation.

Assume that an SVM has been successfully trained on the problem at hand. Let {X1, . . . Xm} denote the set of SVs, {1, . . . m} the corresponding coefficients, k(, ) the kernel function and b the bias of the SVM solution. The decision rule for a test pattern X reads m

                      f (X) = sgn             yiik(Xi, X) + b .                       (1)                                            i=1

In SVMs, the decision surface induced by f corresponds to a hyperplane in the reproducing kernel Hilbert space (RKHS) associated with k. The corresponding normal


                                =            yiik(Xi, )                             (2)                                           i=1

can be approximated using a smaller, so-called reduced set (RS) {Z1, . . . Zm } of size m < m, i.e. an approximation to of the form


                                 =              ik(Zi, ).                           (3)                                            i=1

This speeds up the decision process by a factor of m/m . To find such , we fix a desired set size m and solve min - 2RKHS (4)

for i and Zi. Here, RKHS denotes the Euclidian norm in the RKHS. The resulting RS decision function f is then given by


                      f (X) = sgn                                             ik(Zi,X)+b                                             i=1                     .                     (5) In practice, i, Zi are found using a gradient based optimization technique. Details can be found in [2].

3 From separable filters to rank deficient reduced sets

We now describe the concept of separable filters in image processing and show how this idea extends to a broader class of linear filters and to a special class of nonlinear filters, namely those used by SVM decision functions. Using the image-matrix notation, it will become clear that the separability property boils down to a matrix rank constraint.

3.1 Linear separable filters

Applying a linear filter to an image amounts to a two-dimensional convolution of the image with the impulse response of the filter. In particular, if I is the input image, H the impulse response, i.e. the filter mask, and J the output image, then

                                           J = I  H.                                                 (6)

If H has size h w, the convolution requires O(h w) operations for each output pixel. However, in special cases where H can be decomposed into two column vectors a and b, such that H = ab (7)

holds, we can rewrite (6) as J = [I a] b , (8)

since the convolution is associative and in this case, ab = a b . This splits the original problem (6) into two convolution operations with masks of size h1 and 1w, respectively. As a result, if a linear filter is separable in the sense of equation (7), the computational complexity of the filtering operation can be reduced from O(h w) to O(h + w) per pixel by computing (8) instead of (6).

3.2 Linear rank deficient filters

In view of (7) being equivalent to rank(H) 1, we now generalize the above concept to linear filters with low rank impulse responses. Consider the singular value decomposition (SVD) of the h w matrix H, H = USV , (9)

and recall that U and V are orthogonal matrices of size h h and w w, respectively, whereas S is diagonal (the diagonal entries are the singular values) and has size h w. Now let r = rank(H). Due to rank(S) = rank(H), we may write H as a sum of r rank one matrices r

                                      H =               s u v                                                             i    i    i                                 (10)                                                      i=1

where si denotes the ith singular value of H and ui, vi are the iths columns of U and V (i.e. the ith singular vectors of the matrix H), respectively. As a result, the correspond- ing linear filter can be evaluated (analogously to (8)) as the weighted sum of r separable convolutions r

                                   J =           si [I  ui]  vi                                   (11)                                               i=1

and the computational complexity drops from O(h w) to O(r (h + w)) per output pixel. Not surprisingly, the speed benefit depends on r, which can be seen to measure the structural complexity1 of H. For square matrices (w = h) for instance, (11) does not give any speedup compared to (6) if r > w/2.

   1In other words, the flatter the spectrum of HH , the less benefit can be expected from (11).

3.3 Nonlinear rank deficient filters and reduced sets

Due to the fact that in 2D, correlation is identical with convolution if the filter mask is rotated by 180 degrees (and vice versa), we can apply the above idea to any image filter f (X) = g(c(H, X)) where g is an arbitrary nonlinear function and c(H, X) denotes the correlation between images patches X and H (both of size h w). In SVMs this amounts to using a kernel of the form

                                       k(H, X) = g(c(H, X)).                                                           (12)

If H has rank r, we may split the kernel evaluation into r separable correlations plus a scalar nonlinearity. As a result, if the RSVs in a kernel expansion such as (5) satisfy this constraint, the average computational complexity decreases from O(m h w) to O(m r (h + w)) per output pixel. This concept works for many off-the-shelf kernels used in SVMs. While linear, polynomial and sigmoid kernels are defined as functions of input space dot products and therefore immediately satisfy equation (12), the idea applies to kernels based on the Euclidian distance as well. For instance, the Gaussian kernel reads

                  k(H, X) = exp((c(X, X) - 2c(H, X) + c(H, H))).                                                      (13)

Here, the middle term is the correlation which we are going to evaluate via separable filters. The first term is independent of the SVs -- it can be efficiently pre-computed and stored in a separate image. The last term is merely a constant scalar independent of the image data. Finally, note that these kernels are usually defined on vectors. Nevertheless, we can use our image-matrix notation due to the fact that the squared Euclidian distance between two vectors of gray values x and z may be written as

                                            x - z 2 = X - Z 2 ,                                                        (14)                                                                                           F

whereas the dot product amounts to 1 x z = X 2 + Z 2 - X - Z 2 , (15) 2 F F F where X and Z are the corresponding image patches and F is the Frobenius norm for matrices.

4 Finding rank deficient reduced sets

In our approach we consider a special class of the approximations given by (3), namely those where the RSVs can be evaluated efficiently via separable correlations. In order to obtain such approximations, we use a constrained version of Burges' method. In particular, we restrict the RSV search space to the manifold spanned by all image patches that -- viewed as matrices -- have a fixed, small rank r (which is to be chosen a priori by the user). To this end, the Zi in equation (3) are replaced by their singular value decompositions

                                             Z                 S V                                                       i  Ui        i          i     .                                     (16)

The rank constraint can then be imposed by allowing only the first r diagonal elements of Si to be non-zero. Note that this boils down to using an approximation of the form


                                =                                  S           V                                       r                     ik(Ui,r           i,r         i,r    , )                      (17)                                                 i=1

with Si,r being r r (diagonal) and Ui,r, Vi,r being h r, w r (orthogonal2) matrices, respectively. Analogously to (4) we fix m and r and find Si,r, Ui,r, Vi,r and i that mini- mize the approximation error = - 2 . r The minimization problem is solved via RKHS

   2In this paper we call a non-square matrix orthogonal if its columns are pairwise orthogonal and have unit length.

gradient decent. Note that when computing gradients, the image-matrix notation (together with (14) or (15), and the equality X 2 = tr(XX )) allows a straightforward computa- F tion of the kernel derivatives w.r.t. the components of the decomposed RSV image patches, i.e. the row, column and scale information in Vi,r, Ui,r and Si,r, respectively. However, while the update rules for i and Si,r follow immediately from the respective derivatives, care must be taken in order to keep Ui,r and Vi,r orthogonal during optimization. This can be achieved through re-orthogonalization of these matrices after each gradient step.

In our current implementation, however, we perform those updates subject to the so-called Stiefel constraints [11]. Intuitively, this amounts to rotating (rather than translating) the columns of Ui,r and Vi,r, which ensures that the resulting matrices are still orthogonal, i.e. lie on the Stiefel manifold. Let S(h, r) be the manifold of orthogonal h r matrices, the (h, r)-Stiefel manifold. Further, let U denote an orthogonal basis for the orthogo- i,r nal complement of the subspace spanned by the columns of Ui,r. Now, given the 'free' gradient G = /Ui,r we compute the 'constrained' gradient

                                  ^                                    G = G - U                       G U                                                             i,r           i,r ,              (18)

which is the projection of G onto the tangent space of S(h, r) at Ui,r. The desired rotation is then given [11] by the (matrix) exponential of the h h skew-symmetric matrix

                                       ^                                            G U                                      )                            A = t                    i,r      -( ^                                                                     G U                                                                              i,r                                            ^                                                 (19)                                            G U                       0                                                      i,r

where t is a user-defined step size parameter. For details, see [11]. A Matlab library is available at [12].

5 Experiments

This section shows the results of two experiments. The first part illustrates the behavior of rank deficient approximations for a face detection SVM in terms of the convergence rate and classification accuracy for different values of r. In the second part, we show how an actual face detection system, similar to that presented in [5], can be sped up using rank deficient RSVs. In both experiments we used the same training and validation set. It consisted of 19 19 gray level image patches containing 16081 manually collected faces (3194 of them kindly provided by Sami Romdhani) and 42972 non-faces automatically collected from a set of 206 background scenes. Each patch was normalized to zero mean and unit variance. The set was split into a training set (13331 faces and 35827 non-faces) and a validation set (2687 faces and 7145 non-faces). We trained a 1-norm soft margin SVM on the training set using a Gaussian kernel with = 10. The regularization constant C was set to 1. The resulting decision function (1) achieved a hit rate of 97.3% at 1.0% false positives on the validation set using m = 6910 SVs. This solution served as the approximation target (see equation (2)) during the experiments described below.

5.1 Rank deficient faces

In order to see how m and r affect the accuracy of our approximations, we compute rank deficient reduced sets for m = 1 . . . 32 and r = 1 . . . 3 (the left array in Figure 1 illustrates the actual appearance of rank deficient RSVs for the m = 6 case). Accuracy of the re- sulting decision functions is measured in ROC score (the area under the ROC curve) on the validation set. For the full SVM, this amounts to 0.99. The results for our approximations are depicted in Figure 2. As expected, we need a larger number of rank deficient RSVs than unconstrained RSVs to obtain similar classification accuracies, especially for small r. Nevertheless, the experiment points out two advantages of our method. First, a rank as

                                                 m' = 6                                                                    m' = 1

                                                                                                                     =         +     +     + ...          r=full                                                                                  r=full

                                                                                                       r=3           =         +     +                     r=3

                                                                                      r=2                            =         +    r=2

                      r=1                                                                                     r=1    =

Figure 1: Rank deficient faces. The left array shows the RSVs (Zi) of the unconstrained (top row) and constrained (r decreases from 3 to 1 down the remaining rows) approxi- mations for m = 6. Interestingly, the r = 3 RSVs are already able to capture face-like structures. This supports the fact that the classification accuracy for r = 3 is similar to that of the unconstrained approximations (cf. Figure 2, left plot). The right array shows the m = 1 RSVs (r = full, 3, 2, 1, top to bottom row) and their decomposition into rank one matrices according to (10). For the unconstrained RSV (first row) it shows an approximate (truncated) expansion based on the three leading singular vectors. While for r = 3 the decomposition is indeed similar to the truncated SVD, note how this similarity decreases for r = 2, 1. This illustrates that the approach is clearly different from simply finding un- constrained RSVs and then imposing the rank constraint via SVD (in fact, the norm (4) is smaller for the r = 1 RSV than for the leading singular vector of the r = full RSV).

low as three seems already sufficient for our face detection SVM in the sense that for equal sizes m there is no significant loss in accuracy compared to the unconstrained approxima- tion (at least for m > 2). The associated speed benefit over unconstrained RSVs is shown in the right plot of Figure 2: the rank three approximations achieve accuracies similar to the unconstrained functions, while the number of operations reduces to less than a third. Second, while for unconstrained RSVs there is no solution with a number of operations smaller than h w = 361 (in the right plot, this is the region beyond the left end of the solid line), there exist rank deficient functions which are not only much faster than this, but yield considerably higher accuracies. This property will be exploited in the next experiment.

5.2 A cascade-based face detection system

In this experiment we built a cascade-based face detection system similar to [5, 6], i.e. a cascade of RSV approximations of increasing size m . As the benefit of a cascaded classifier heavily depends on the speed of the first classifier which has to be evaluated on the whole image [5, 6], our system uses a rank deficient approximation as the first stage. Based on the previous experiment, we chose the m = 3, r = 1 classifier. Note that this function yields an ROC score of 0.9 using 114 multiply-adds, whereas the simplest possible unconstrained approximation m = 1, r = full needs 361 multiply-adds to achieve a ROC score of only 0.83 (cf. Figure 2). In particular, if the threshold of the first stage is set to yield a hit rate of 95% on the validation set, scanning the MIT+CMU set (130 images, 507 faces) with m = 3, r = 1 discards 91.5% of the false positives, whereas the m = 1, r = full can only reject 70.2%. At the same time, when scanning a 320 240 image3, the three separable convolutions plus nonlinearity require 55ms, whereas the single, full kernel evaluation takes 208ms on a Pentium 4 with 2.8 GHz. Moreover, for the unconstrained

                             3For multi-scale processing the detectors are evaluated on an image pyramid with 12 different scales using a scale decay of 0.75. This amounts to scanning 140158 patches for a 320  240 image.

           1                                                            1

          0.9                                                          0.9

          0.8                                                          0.8

ROC score ROC score r=1 r=1 0.7 r=2 0.7 r=2 r=3 r=3 r=full r=full 0.6 0.6 0 1 2 3 4 10 10 10 10 10

                      #RSVs (m')                                                    #operations (m'r(h+w))

Figure 2: Effect of the rank parameter r on classification accuracies. The left plots shows the ROC score of the rank deficient RSV approximations (cf. Section 4) for varying set sizes (m = 1 . . . 32, on a logarithmic scale) and ranks (r = 1 . . . 3). Additionally, the solid line shows the accuracy of the RSVs without rank constraint (cf. Section 2), here denoted by r = full. The right plot shows the same four curves, but plotted against the number of operations needed for the evaluation of the corresponding decision function when scanning large images (i.e. m r (h + w) with h = w = 19), also on a logarithmic scale.

                                                   Figure 3: A sample output from our demonstra-                                                        tion system (running at 14 frames per second).                                                        In this implementation, we reduced the number                                                        of false positives by adjusting the threshold of                                                        the final classifier. Although this reduces the                                                        number of detections as well, the results are still                                                        satisfactory. This is probably due to the fact that                                                        the MIT+CMU set contains several images of                                                        very low quality that are not likely to occur in                                                        our setting, using a good USB camera.

cascade to catch up in terms of accuracy, the (at least) m = 2, r = full classifier (also with an ROC score of roughly 0.9) should be applied afterwards, requiring another 0.3 2 208 ms 125ms.

The subsequent stages of our system consist of unconstrained RSV approximations of size m = 4, 8, 16, 32, respectively. These sizes were chosen such that the number of false positives roughly halves after each stage, while the number of correct detections remains close to 95% on the validation set (with the decision thresholds adjusted accordingly). To eliminate redundant detections, we combine overlapping detections via averaging of posi- tion and size if they are closer than 0.15 times the estimated patch size. This system yields 93.1% correct detections and 0.034% false positives on the MIT+CMU set. The current system was incorporated into a demo application (Figure 3). For optimal performance, we re-compiled our system using the Intel compiler (ICC). The application now classifies a 320x240 image within 54ms (vs. 238ms with full rank RSVs only) on a 2.8 GHz PC. To further reduce the number of false positives, additional bootstrapped (as in [5]) stages need to be added to the cascade. Note that this will not significantly affect the speed of our sys- tem (currently 14 frames per second) since 0.034% false positives amounts to merely 47 patches to be processed by subsequent classifiers.