{"title": "Face Detection --- Efficient and Rank Deficient", "book": "Advances in Neural Information Processing Systems", "page_first": 673, "page_last": 680, "abstract": null, "full_text": " Face Detection -- Efficient and Rank Deficient\n\n\n\n Wolf Kienzle, G okhan Bakir, Matthias Franz and Bernhard Sch olkopf\n Max-Planck-Institute for Biological Cybernetics\n Spemannstr. 38, D-72076 T ubingen, Germany\n {kienzle, gb, mof, bs}@tuebingen.mpg.de\n\n\n\n Abstract\n\n This paper proposes a method for computing fast approximations to sup-\n port vector decision functions in the field of object detection. In the\n present approach we are building on an existing algorithm where the set\n of support vectors is replaced by a smaller, so-called reduced set of syn-\n thesized input space points. In contrast to the existing method that finds\n the reduced set via unconstrained optimization, we impose a structural\n constraint on the synthetic points such that the resulting approximations\n can be evaluated via separable filters. For applications that require scan-\n ning large images, this decreases the computational complexity by a sig-\n nificant amount. Experimental results show that in face detection, rank\n deficient approximations are 4 to 6 times faster than unconstrained re-\n duced set systems.\n\n\n\n1 Introduction\n\nIt has been shown that support vector machines (SVMs) provide state-of-the-art accuracies\nin object detection. In time-critical applications, however, they are of limited use due to\ntheir computationally expensive decision functions. In particular, the time complexity of\nan SVM classification operation is characterized by two parameters. First, it is linear in the\nnumber of support vectors (SVs). Second, it scales with the number of operations needed\nfor computing the similarity between an SV and the input, i.e. the complexity of the kernel\nfunction. When classifying image patches of size h w using plain gray value features, the\ndecision function requires an h w dimensional dot product for each SV. As the patch size\nincreases, these computations become extremely expensive. As an example, the evaluation\nof a single 20 20 patch on a 320 240 image at 25 frames per second already requires\n660 million operations per second.\n\nIn the past, research towards speeding up kernel expansions has focused exclusively on the\nfirst issue, i.e. on how to reduce the number of expansion points (SVs) [1, 2]. In [2], Burges\nintroduced a method that, for a given SVM, creates a set of so-called reduced set vectors\n(RSVs) that approximate the decision function. This approach has been successfully ap-\nplied in the image classification domain -- speedups on the order of 10 to 30 have been re-\nported [2, 3, 4] while the full accuracy was retained. Additionally, for strongly unbalanced\nclassification problems such as face detection, the average number of RSV evaluations can\nbe further reduced using cascaded classifiers [5, 6, 7]. Unfortunately, the above example\nillustrates that even with as few as three RSVs on average (as in [5]), such systems are not\ncompetitive for time-critical applications.\n\n\f\nThe present work focuses on the second issue, i.e. the high computational cost of the kernel\nevaluations. While this could be remedied by switching to a sparser image representation\n(e.g. a wavelet basis), one could argue that in connection with SVMs, not only are plain\ngray values straightforward to use, but they have shown to outperform Haar wavelets and\ngradients in the face detection domain [8]. Alternatively, in [9], the authors suggest to com-\npute the costly correlations in the frequency domain. In this paper, we develop a method\nthat combines the simplicity of gray value correlations with the speed advantage of more\nsophisticated image representations. To this end, we borrow an idea from image processing:\nby constraining the RSVs to have a special structure, they can be evaluated via separable\nconvolutions. This works for most standard kernels (e.g. linear, polynomial, Gaussian and\nsigmoid) and decreases the average computational complexity of the RSV evaluations from\nO(h w) to O(r (h + w)), where r is a small number that allows the user to balance be-\ntween speed and accuracy. To evaluate our approach, we examine the performance of these\napproximations on the MIT+CMU face detection database (used in [10, 8, 5, 6]).\n\n\n2 Burges' method for reduced set approximations\n\nThe present section briefly describes Burges' reduced set method [2] on which our work is\nbased. For reasons that will become clear below, h w image patches are written as h w\nmatrices (denoted by bold capital letters) whose entries are the respective pixel intensities.\nIn this paper, we refer to this as the image-matrix notation.\n\nAssume that an SVM has been successfully trained on the problem at hand. Let\n{X1, . . . Xm} denote the set of SVs, {1, . . . m} the corresponding coefficients, k(, )\nthe kernel function and b the bias of the SVM solution. The decision rule for a test pattern\nX reads\n m\n\n f (X) = sgn yiik(Xi, X) + b . (1)\n i=1\n\nIn SVMs, the decision surface induced by f corresponds to a hyperplane in the reproducing\nkernel Hilbert space (RKHS) associated with k. The corresponding normal\n\n m\n\n = yiik(Xi, ) (2)\n i=1\n\ncan be approximated using a smaller, so-called reduced set (RS) {Z1, . . . Zm } of size\nm < m, i.e. an approximation to of the form\n\n m\n\n = ik(Zi, ). (3)\n i=1\n\nThis speeds up the decision process by a factor of m/m . To find such , we fix a desired\nset size m and solve\n min - 2RKHS (4)\n\nfor i and Zi. Here, RKHS denotes the Euclidian norm in the RKHS. The resulting RS\ndecision function f is then given by\n\n m\n\n f (X) = sgn \n ik(Zi,X)+b\n i=1 . (5)\nIn practice, i, Zi are found using a gradient based optimization technique. Details can be\nfound in [2].\n\n\f\n3 From separable filters to rank deficient reduced sets\n\nWe now describe the concept of separable filters in image processing and show how this\nidea extends to a broader class of linear filters and to a special class of nonlinear filters,\nnamely those used by SVM decision functions. Using the image-matrix notation, it will\nbecome clear that the separability property boils down to a matrix rank constraint.\n\n\n3.1 Linear separable filters\n\nApplying a linear filter to an image amounts to a two-dimensional convolution of the image\nwith the impulse response of the filter. In particular, if I is the input image, H the impulse\nresponse, i.e. the filter mask, and J the output image, then\n\n J = I H. (6)\n\nIf H has size h w, the convolution requires O(h w) operations for each output pixel.\nHowever, in special cases where H can be decomposed into two column vectors a and b,\nsuch that\n H = ab (7)\n\nholds, we can rewrite (6) as\n J = [I a] b , (8)\n\nsince the convolution is associative and in this case, ab = a b . This splits the original\nproblem (6) into two convolution operations with masks of size h1 and 1w, respectively.\nAs a result, if a linear filter is separable in the sense of equation (7), the computational\ncomplexity of the filtering operation can be reduced from O(h w) to O(h + w) per pixel\nby computing (8) instead of (6).\n\n\n3.2 Linear rank deficient filters\n\nIn view of (7) being equivalent to rank(H) 1, we now generalize the above concept to\nlinear filters with low rank impulse responses. Consider the singular value decomposition\n(SVD) of the h w matrix H,\n H = USV , (9)\n\nand recall that U and V are orthogonal matrices of size h h and w w, respectively,\nwhereas S is diagonal (the diagonal entries are the singular values) and has size h w.\nNow let r = rank(H). Due to rank(S) = rank(H), we may write H as a sum of r rank\none matrices\n r\n\n H = s u v\n i i i (10)\n i=1\n\nwhere si denotes the ith singular value of H and ui, vi are the iths columns of U and V\n(i.e. the ith singular vectors of the matrix H), respectively. As a result, the correspond-\ning linear filter can be evaluated (analogously to (8)) as the weighted sum of r separable\nconvolutions\n r\n\n J = si [I ui] vi (11)\n i=1\n\nand the computational complexity drops from O(h w) to O(r (h + w)) per output\npixel. Not surprisingly, the speed benefit depends on r, which can be seen to measure the\nstructural complexity1 of H. For square matrices (w = h) for instance, (11) does not give\nany speedup compared to (6) if r > w/2.\n\n 1In other words, the flatter the spectrum of HH , the less benefit can be expected from (11).\n\n\f\n3.3 Nonlinear rank deficient filters and reduced sets\n\nDue to the fact that in 2D, correlation is identical with convolution if the filter mask is\nrotated by 180 degrees (and vice versa), we can apply the above idea to any image filter\nf (X) = g(c(H, X)) where g is an arbitrary nonlinear function and c(H, X) denotes the\ncorrelation between images patches X and H (both of size h w). In SVMs this amounts\nto using a kernel of the form\n\n k(H, X) = g(c(H, X)). (12)\n\nIf H has rank r, we may split the kernel evaluation into r separable correlations plus a\nscalar nonlinearity. As a result, if the RSVs in a kernel expansion such as (5) satisfy\nthis constraint, the average computational complexity decreases from O(m h w) to\nO(m r (h + w)) per output pixel. This concept works for many off-the-shelf kernels\nused in SVMs. While linear, polynomial and sigmoid kernels are defined as functions of\ninput space dot products and therefore immediately satisfy equation (12), the idea applies\nto kernels based on the Euclidian distance as well. For instance, the Gaussian kernel reads\n\n k(H, X) = exp((c(X, X) - 2c(H, X) + c(H, H))). (13)\n\nHere, the middle term is the correlation which we are going to evaluate via separable filters.\nThe first term is independent of the SVs -- it can be efficiently pre-computed and stored in\na separate image. The last term is merely a constant scalar independent of the image data.\nFinally, note that these kernels are usually defined on vectors. Nevertheless, we can use\nour image-matrix notation due to the fact that the squared Euclidian distance between two\nvectors of gray values x and z may be written as\n\n x - z 2 = X - Z 2 , (14)\n F\n\nwhereas the dot product amounts to\n 1\n x z = X 2 + Z 2 - X - Z 2 , (15)\n 2 F F F\nwhere X and Z are the corresponding image patches and F is the Frobenius norm for\nmatrices.\n\n\n4 Finding rank deficient reduced sets\n\nIn our approach we consider a special class of the approximations given by (3), namely\nthose where the RSVs can be evaluated efficiently via separable correlations. In order to\nobtain such approximations, we use a constrained version of Burges' method. In particular,\nwe restrict the RSV search space to the manifold spanned by all image patches that --\nviewed as matrices -- have a fixed, small rank r (which is to be chosen a priori by the user).\nTo this end, the Zi in equation (3) are replaced by their singular value decompositions\n\n Z S V\n i Ui i i . (16)\n\nThe rank constraint can then be imposed by allowing only the first r diagonal elements of\nSi to be non-zero. Note that this boils down to using an approximation of the form\n\n m\n\n = S V\n r ik(Ui,r i,r i,r , ) (17)\n i=1\n\nwith Si,r being r r (diagonal) and Ui,r, Vi,r being h r, w r (orthogonal2) matrices,\nrespectively. Analogously to (4) we fix m and r and find Si,r, Ui,r, Vi,r and i that mini-\nmize the approximation error = - 2 .\n r The minimization problem is solved via\n RKHS\n\n\n 2In this paper we call a non-square matrix orthogonal if its columns are pairwise orthogonal and\nhave unit length.\n\n\f\ngradient decent. Note that when computing gradients, the image-matrix notation (together\nwith (14) or (15), and the equality X 2 = tr(XX )) allows a straightforward computa-\n F\ntion of the kernel derivatives w.r.t. the components of the decomposed RSV image patches,\ni.e. the row, column and scale information in Vi,r, Ui,r and Si,r, respectively. However,\nwhile the update rules for i and Si,r follow immediately from the respective derivatives,\ncare must be taken in order to keep Ui,r and Vi,r orthogonal during optimization. This\ncan be achieved through re-orthogonalization of these matrices after each gradient step.\n\nIn our current implementation, however, we perform those updates subject to the so-called\nStiefel constraints [11]. Intuitively, this amounts to rotating (rather than translating) the\ncolumns of Ui,r and Vi,r, which ensures that the resulting matrices are still orthogonal,\ni.e. lie on the Stiefel manifold. Let S(h, r) be the manifold of orthogonal h r matrices,\nthe (h, r)-Stiefel manifold. Further, let U denote an orthogonal basis for the orthogo-\n i,r\nnal complement of the subspace spanned by the columns of Ui,r. Now, given the 'free'\ngradient G = /Ui,r we compute the 'constrained' gradient\n\n ^\n G = G - U G U\n i,r i,r , (18)\n\nwhich is the projection of G onto the tangent space of S(h, r) at Ui,r. The desired rotation\nis then given [11] by the (matrix) exponential of the h h skew-symmetric matrix\n\n ^\n G U )\n A = t i,r -( ^\n G U\n i,r\n ^ (19)\n G U 0\n i,r\n\n\nwhere t is a user-defined step size parameter. For details, see [11]. A Matlab library is\navailable at [12].\n\n\n5 Experiments\n\nThis section shows the results of two experiments. The first part illustrates the behavior of\nrank deficient approximations for a face detection SVM in terms of the convergence rate\nand classification accuracy for different values of r. In the second part, we show how an\nactual face detection system, similar to that presented in [5], can be sped up using rank\ndeficient RSVs. In both experiments we used the same training and validation set. It\nconsisted of 19 19 gray level image patches containing 16081 manually collected faces\n(3194 of them kindly provided by Sami Romdhani) and 42972 non-faces automatically\ncollected from a set of 206 background scenes. Each patch was normalized to zero mean\nand unit variance. The set was split into a training set (13331 faces and 35827 non-faces)\nand a validation set (2687 faces and 7145 non-faces). We trained a 1-norm soft margin\nSVM on the training set using a Gaussian kernel with = 10. The regularization constant\nC was set to 1. The resulting decision function (1) achieved a hit rate of 97.3% at 1.0%\nfalse positives on the validation set using m = 6910 SVs. This solution served as the\napproximation target (see equation (2)) during the experiments described below.\n\n\n5.1 Rank deficient faces\n\nIn order to see how m and r affect the accuracy of our approximations, we compute rank\ndeficient reduced sets for m = 1 . . . 32 and r = 1 . . . 3 (the left array in Figure 1 illustrates\nthe actual appearance of rank deficient RSVs for the m = 6 case). Accuracy of the re-\nsulting decision functions is measured in ROC score (the area under the ROC curve) on the\nvalidation set. For the full SVM, this amounts to 0.99. The results for our approximations\nare depicted in Figure 2. As expected, we need a larger number of rank deficient RSVs\nthan unconstrained RSVs to obtain similar classification accuracies, especially for small\nr. Nevertheless, the experiment points out two advantages of our method. First, a rank as\n\n\f\n m' = 6 m' = 1\n\n = + + + ...\n r=full r=full\n\n\n\n r=3 = + + \n r=3\n\n\n\n r=2 = + \n r=2\n\n\n\n r=1 r=1 = \n\n\nFigure 1: Rank deficient faces. The left array shows the RSVs (Zi) of the unconstrained\n(top row) and constrained (r decreases from 3 to 1 down the remaining rows) approxi-\nmations for m = 6. Interestingly, the r = 3 RSVs are already able to capture face-like\nstructures. This supports the fact that the classification accuracy for r = 3 is similar to\nthat of the unconstrained approximations (cf. Figure 2, left plot). The right array shows the\nm = 1 RSVs (r = full, 3, 2, 1, top to bottom row) and their decomposition into rank one\nmatrices according to (10). For the unconstrained RSV (first row) it shows an approximate\n(truncated) expansion based on the three leading singular vectors. While for r = 3 the\ndecomposition is indeed similar to the truncated SVD, note how this similarity decreases\nfor r = 2, 1. This illustrates that the approach is clearly different from simply finding un-\nconstrained RSVs and then imposing the rank constraint via SVD (in fact, the norm (4) is\nsmaller for the r = 1 RSV than for the leading singular vector of the r = full RSV).\n\n\n\nlow as three seems already sufficient for our face detection SVM in the sense that for equal\nsizes m there is no significant loss in accuracy compared to the unconstrained approxima-\ntion (at least for m > 2). The associated speed benefit over unconstrained RSVs is shown\nin the right plot of Figure 2: the rank three approximations achieve accuracies similar to\nthe unconstrained functions, while the number of operations reduces to less than a third.\nSecond, while for unconstrained RSVs there is no solution with a number of operations\nsmaller than h w = 361 (in the right plot, this is the region beyond the left end of the solid\nline), there exist rank deficient functions which are not only much faster than this, but yield\nconsiderably higher accuracies. This property will be exploited in the next experiment.\n\n\n5.2 A cascade-based face detection system\n\nIn this experiment we built a cascade-based face detection system similar to [5, 6], i.e.\na cascade of RSV approximations of increasing size m . As the benefit of a cascaded\nclassifier heavily depends on the speed of the first classifier which has to be evaluated on\nthe whole image [5, 6], our system uses a rank deficient approximation as the first stage.\nBased on the previous experiment, we chose the m = 3, r = 1 classifier. Note that this\nfunction yields an ROC score of 0.9 using 114 multiply-adds, whereas the simplest possible\nunconstrained approximation m = 1, r = full needs 361 multiply-adds to achieve a ROC\nscore of only 0.83 (cf. Figure 2). In particular, if the threshold of the first stage is set to\nyield a hit rate of 95% on the validation set, scanning the MIT+CMU set (130 images, 507\nfaces) with m = 3, r = 1 discards 91.5% of the false positives, whereas the m = 1,\nr = full can only reject 70.2%. At the same time, when scanning a 320 240 image3, the\nthree separable convolutions plus nonlinearity require 55ms, whereas the single, full kernel\nevaluation takes 208ms on a Pentium 4 with 2.8 GHz. Moreover, for the unconstrained\n\n 3For multi-scale processing the detectors are evaluated on an image pyramid with 12 different\nscales using a scale decay of 0.75. This amounts to scanning 140158 patches for a 320 240 image.\n\n\f\n 1 1\n\n\n\n 0.9 0.9\n\n\n\n 0.8 0.8\n\n ROC score ROC score\n r=1 r=1\n 0.7 r=2 0.7 r=2\n r=3 r=3\n r=full r=full\n 0.6 0.6\n 0 1 2 3 4\n 10 10 10 10 10\n\n #RSVs (m') #operations (m'r(h+w))\n\n\nFigure 2: Effect of the rank parameter r on classification accuracies. The left plots shows\nthe ROC score of the rank deficient RSV approximations (cf. Section 4) for varying set sizes\n(m = 1 . . . 32, on a logarithmic scale) and ranks (r = 1 . . . 3). Additionally, the solid line\nshows the accuracy of the RSVs without rank constraint (cf. Section 2), here denoted by\nr = full. The right plot shows the same four curves, but plotted against the number of\noperations needed for the evaluation of the corresponding decision function when scanning\nlarge images (i.e. m r (h + w) with h = w = 19), also on a logarithmic scale.\n\n\n\n Figure 3: A sample output from our demonstra-\n tion system (running at 14 frames per second).\n In this implementation, we reduced the number\n of false positives by adjusting the threshold of\n the final classifier. Although this reduces the\n number of detections as well, the results are still\n satisfactory. This is probably due to the fact that\n the MIT+CMU set contains several images of\n very low quality that are not likely to occur in\n our setting, using a good USB camera.\n\n\n\n\ncascade to catch up in terms of accuracy, the (at least) m = 2, r = full classifier (also\nwith an ROC score of roughly 0.9) should be applied afterwards, requiring another 0.3 \n2 208 ms 125ms.\n\nThe subsequent stages of our system consist of unconstrained RSV approximations of size\nm = 4, 8, 16, 32, respectively. These sizes were chosen such that the number of false\npositives roughly halves after each stage, while the number of correct detections remains\nclose to 95% on the validation set (with the decision thresholds adjusted accordingly). To\neliminate redundant detections, we combine overlapping detections via averaging of posi-\ntion and size if they are closer than 0.15 times the estimated patch size. This system yields\n93.1% correct detections and 0.034% false positives on the MIT+CMU set. The current\nsystem was incorporated into a demo application (Figure 3). For optimal performance, we\nre-compiled our system using the Intel compiler (ICC). The application now classifies a\n320x240 image within 54ms (vs. 238ms with full rank RSVs only) on a 2.8 GHz PC. To\nfurther reduce the number of false positives, additional bootstrapped (as in [5]) stages need\nto be added to the cascade. Note that this will not significantly affect the speed of our sys-\ntem (currently 14 frames per second) since 0.034% false positives amounts to merely 47\npatches to be processed by subsequent classifiers.\n\n\f\n6 Discussion\n\nWe have presented a new reduced set method for SVMs in image processing, which cre-\nates sparse kernel expansions that can be evaluated via separable filters. To this end, the\nuser-defined rank (the number of separable filters into which the RSVs are decomposed)\nprovides a mechanism to control the tradeoff between accuracy and speed of the resulting\napproximation. Our experiments show that for face detection, the use of rank deficient\nRSVs leads to a significant speedup without losing accuracy. Especially when rough ap-\nproximations are required, our method gives superior results compared to the existing re-\nduced set methods since it allows for a finer granularity which is vital in cascade-based\ndetection systems. Another property of our approach is simplicity. At run-time, rank defi-\ncient RSVs can be used together with unconstrained RSVs or SVs using the same canonical\nimage representation. As a result, the required changes in existing code, such as in [5], are\nsmall. In addition, our approach allows the use of off-the-shelf image processing libraries\nfor separable convolutions. Since such operations are essential in image processing, there\nexist many (often highly optimized) implementations. Finally, the method can well be used\nto train a neural network, i.e. to go directly from the training data to a sparse, separable\nfunction as opposed to taking the SVM 'detour'. A comparison of that approach to the\npresent one, however, remains to be done.\n\n\nReferences\n\n [1] E. Osuna and F. Girosi. Reducing the run-time complexity in support vector machines. In\n B. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods -- Support\n Vector Learning, pages 271284, Cambridge, MA, 1999. MIT Press.\n\n [2] C. J. C. Burges. Simplified support vector decision rules. In International Conference on\n Machine Learning, pages 7177, 1996.\n\n [3] C. J. C. Burges and B. Sch olkopf. Improving the accuracy and speed of support vector ma-\n chines. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information\n Processing Systems, volume 9, page 375. MIT Press, 1997.\n\n [4] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: an application to face\n detection. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 1997.\n\n [5] S. Romdhani, P. Torr, B.Sch olkopf, and A. Blake. Computationally efficient face detection. In\n Proceedings of the International Conference on Computer Vision, pages 695700, 2001.\n\n [6] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In\n Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2001.\n\n [7] G. Blanchard and D. Geman. Hierarchical testing designs for pattern recognition. Technical\n Report 2003-07, Universit Paris-Sud, 2003.\n\n [8] B. Heisele, T. Poggio, and M. Pontil. Face detection in still gray images. AI Memo 1687, MIT,\n May 2000. CBCL Memo 187.\n\n [9] S. Ben-Yacoub, B. Fasel, and J. Luettin. Fast face detection using MLP and FFT. In Proceedings\n International Conference on Audio and Video-based Biometric Person Authentication, 1999.\n\n[10] H. A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. IEEE Transac-\n tions on Pattern Analysis and Machine Intelligence, 20(1):2338, January 1998.\n\n[11] A. Edelman, T. Arias, and S. Smith. The geometry of algorithms with orthogonality constraints.\n SIAM Journal on Matrix Analysis and Applications, 20:303353, 1998.\n\n[12] RDRSLIB a matlab library for rank deficient reduced sets in object detection,\n http://www.kyb.mpg.de/bs/people/kienzle/rdrs/rdrs.htm.\n\n\f\n", "award": [], "sourceid": 2646, "authors": [{"given_name": "Wolf", "family_name": "Kienzle", "institution": null}, {"given_name": "Matthias", "family_name": "Franz", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "G\u00f6khan", "family_name": "Bakir", "institution": null}]}