{"title": "The power of feature clustering: An application to object detection", "book": "Advances in Neural Information Processing Systems", "page_first": 57, "page_last": 64, "abstract": null, "full_text": " The power of feature clustering: An application\n to object detection\n\n\n\n Shai Avidan Moshe Butman\n Mitsibishi Electric Research Labs Adyoron Intelligent Systems LTD.\n 201 Broadway 34 Habarzel St.\n Cambridge, MA 02139 Tel-Aviv, Israel\n avidan@merl.com mosheb@adyoron.com\n\n\n\n\n Abstract\n\n We give a fast rejection scheme that is based on image segments and\n demonstrate it on the canonical example of face detection. However, in-\n stead of focusing on the detection step we focus on the rejection step and\n show that our method is simple and fast to be learned, thus making it\n an excellent pre-processing step to accelerate standard machine learning\n classifiers, such as neural-networks, Bayes classifiers or SVM. We de-\n compose a collection of face images into regions of pixels with similar\n behavior over the image set. The relationships between the mean and\n variance of image segments are used to form a cascade of rejectors that\n can reject over 99.8% of image patches, thus only a small fraction of the\n image patches must be passed to a full-scale classifier. Moreover, the\n training time for our method is much less than an hour, on a standard PC.\n The shape of the features (i.e. image segments) we use is data-driven,\n they are very cheap to compute and they form a very low dimensional\n feature space in which exhaustive search for the best features is tractable.\n\n\n1 Introduction\n\nThis work is motivated by recent advances in object detection algorithms that use a cascade\nof rejectors to quickly detect objects in images. Instead of using a full fledged classifier on\nevery image patch, a sequence of increasingly more complex rejectors is applied. Non-\nface image patches will be rejected early on in the cascade, while face image patches will\nsurvive the entire cascade and will be marked as a face.\n\nThe work of Viola & Jones [15] demonstrated the advantages of such an approach. Other\nresearchers suggested similar methods [4, 6, 12]. Common to all these methods is the\nrealization that simple and fast classifiers are enough to reject large portions of the im-\nage, leaving more time to use more sophisticated, and time consuming, classifiers on the\nremaining regions of the image.\n\nAll these \"fast\" methods must address three issues. First, is the feature space in which to\nwork, second is a fast method to calculate the features from the raw image data and third is\nthe feature selection algorithm to use.\n\nEarly attempts assumed the feature space to be the space of pixel values. Elad et al. [4]\n\n\f\nsuggest the maximum rejection criteria that chooses rejectors that maximize the rejection\nrate of each classifier. Keren et al. [6] use anti-face detectors by assuming normal distri-\nbution on the background. A different approach was suggested by Romdhani et al. [12],\nthat constructed the full SVM classifier first and then approximated it with a sequence or\nsupport vector rejectors that were calculated using non-linear optimization. All the above\nmentioned method need to \"touch\" every pixel in an image patch at least once before they\ncan reject the image patch.\n\nViola & Jones [15], on the other hand, construct a huge feature space that consists of\ncombined box regions that can be quickly computed from the raw pixel data using the\n\"integral image\" and use a sequential feature selection algorithm for feature selection. The\nrejectors are combined using a variant of AdaBoost [2]. Li et al [7] replaced the sequential\nforward searching algorithm with a float search algorithm (which can backtrack as well).\nAn important advantage of the huge feature space advocated by Viola & Jones is that now\nimage patches can be rejected with an extremely small number of operations and there is\nno need to \"touch\" every pixel in the image patch at least once.\n\nMany of these methods focus on developing fast classifiers that are often constructed in a\ngreedy manner. This precludes classifiers that might demonstrate excellent classification\nresults but are slower to compute, such as the methods suggested by Schneiderman et al.\n[8], Rowley et al. [13], Sung and Poggio [10] or Heisele et al [5].\n\nOur method offers a way to accelerate \"slow\" classification methods by using a pre-\nprocessing rejection step. Our rejection scheme is fast to be trained and very effective\nin rejecting the vast majority of false patterns. On the canonical face detection example, it\ntook our method much less than an hour to train and it was able to reject over 99.8% of the\nimage patches, meaning that we can effectively accelerate standard classifiers by several\norders of magnitude, without changing the classifier at all.\n\nLike other, \"fast\", methods we use a cascade of rejectors, but we use a different type of\nfilters and a different type of feature selection method. We take our features to be the\napproximated mean and variance of image segments, where every image segment consists\nof pixels that have similar behavior across the entire image set. As a result, our features\nare derived from the data and do not have to be hand crafted for the particular object of\ninterest. In fact they do not even have to form contiguous regions. We use only a small\nnumber of representative pixels to calculate the approximated mean and variance, which\nmakes our features very fast to compute during detection (in our experiments we found that\nour first rejector rejects almost 50% of all image patches, using just 8 pixels). Finally, the\nnumber of segments we use is quite small which makes it possible to exhaustively calculate\nall possible rejectors based on single, pairs and triplets of segments in order to find the best\nrejectors in every step of the cascade. This is in contrast to methods that construct a huge\nfeature bank and use a greedy feature selection algorithm to choose \"good\" features from\nit. Taken together, our algorithm is fast to train and fast to test. In our experiments we train\non a database that contains several thousands of face images and roughly half-a-million\nnon-faces in less than an hour on an average PC and our rejection module runs at several\nframes per second.\n\n\n2 Algorithm\n\nAt the core of our algorithm is the realization that feature representation is a crucial ingredi-\nent in any classification system. For instance, the Viola-Jones box filters are extremely effi-\ncient to compute using the \"integral image\" but they form a large feature space, thus placing\na heavy computational burden on the feature selection algorithm that follows. Moreover,\nempirically they show that the first feature selected by their method correspond to mean-\ningful regions in the face. This suggests that it might be better to focus on features that\n\n\f\ncorrespond to coherent regions in the image. This leads to the idea of image segmentation,\nthat breaks an ensemble of images into regions of pixels that exhibit similar temporal be-\nhavior. Given the image segmentation we take our features to be the mean and variance of\neach segment, giving us a very small feature space to work on (we chose to segment the\nface image into eight segments). Unfortunately, calculating the mean and variance of an\nimage segment requires going over all the pixels in the segment, a time consuming pro-\ncess. However, since the segments represent similar-behaving pixels we found that we can\napproximate the calculation of the mean and variance of the entire segment using quite a\nsmall number of representative pixels. In our experiments, four pixels were enough to ad-\nequately represent segments that contain several tens of pixels. Now that we have a very\nsmall feature space to work with, and a fast way to extract features from raw pixels data\nwe can exhaustively search for all possible combinations of single, pairs or triplets of fea-\ntures to find the best rejector in every stage. The remaining patterns should be passed to a\nstandard classifier for final validation.\n\n\n2.1 Image Segments\n\nImage segments were already presented in the past [1] for the problem of classification of\nobjects such as faces or vehicles. We briefly repeat the presentation for the paper to be\nself-contained. An ensemble of scaled, cropped and aligned images of a given object (say\nfaces) can be approximated by its leading principal components. This is done by stacking\nthe images (in vector form) in a design matrix A and taking the leading eigenvectors of the\ncovariance matrix C = 1 AAT , where N is the number of images. The leading principal\n N\ncomponents are the leading eigenvectors of the covariance matrix C and they form a basis\nthat approximates the space of all the columns of the design matrix A [11, 9]. But instead\nof looking at the columns of A look at the rows of A. Each row in A gives the intensity\nprofile of a particular pixel, i.e., each row represents the intensity values that a particular\npixel takes in the different images in the ensemble. If two pixels come from the same\nregion of the face they are likely to have the same intensity values and hence have a strong\ntemporal correlation. We wish to find this correlations and segment the image plane into\nregions of pixels that have similar temporal behavior. This approach broadly falls under\nthe category of Factor Analysis [3] that seeks to find a low-dimensional representation that\ncaptures the correlations between features.\n\nLet Ax be the x-th row of the design matrix A. Then Ax is the intensity profile of pixel x\n(We address pixels with a single number because the images are represented in a scan-line\nvector form). That is, Ax is an N -dimensional vector (where N is the number of images)\nthat holds the intensity values of pixel x in each image in the ensemble. Pixels x and y\nare temporally correlated if the dot product of rows Ax and Ay is approaching 1 and are\ntemporally uncorrelated if the dot-product is approaching 0.\n\nThus, to find temporally correlated pixels all we need to do is run a clustering algorithm\non the rows of the design matrix A. In particular, we used the k-means algorithm on the\nrows of the matrix A but any method of Factor Analysis can be used. As a result, the\nimage-plane is segmented into several (possibly non-continuous) segments of temporally\ncorrelated pixels. Experiments in the past [1] showed good classification results on different\nobjects such as faces and vehicles.\n\n\n2.2 Finding Representative Pixels\n\nOur algorithm works by comparing the mean and variance properties of one or more image\nsegments. Unfortunately this requires touching every pixel in the image segment during\ntest time, thus slowing the classification process considerably. Therefor, during train time\nwe find a set of representative pixels that will be used during test time. Specifically, we\napproximate every segment in a face image with a small number of representative pixels\n\n\f\n Face segments\n\n\n\n 2\n\n\n\n 4\n\n\n\n 6\n\n\n\n 8\n\n\n\n 10\n\n\n\n 12\n\n\n\n 14\n\n\n\n 16\n\n\n\n 18\n\n\n\n 20\n\n 2 4 6 8 10 12 14 16 18 20\n\n\n\n (a) (b)\nFigure 1: Face segmentation and representative pixels. (a) Face segmentation and repre-\nsentative pixels. The face segmentation was computed using 1400 faces, each segment is\nmarked with a different color and the segments need not be contiguous. The crosses over-\nlaid on the segments mark the representative pixels that were automatically selected by\nour method. (b) Histogram of the difference between an approximated mean and the exact\nmean of a particular segment (the light blue segment on the left). The histogram is peaked\nat zero, meaning that the representative pixels give a good approximation.\n\n\n\nthat approximate the mean and variance of the entire image segment. Define i(xj) to be\nthe true mean of segment i of face j, and let ^\n i(xj) be its approximation, defined as\n\n k xj\n ^\n j=1\n i(xj) = k\n\nwhere {xj}k are a subset of pixels in segment i of pattern j. We use a greedy algorithm\n j=1\nthat incrementally searches for the next representative pixel that minimize\n\n\n n\n\n (^\n i(xj)) - i(xj))2\n j=1\n\nand add it to the collection of representative pixels of segment i. In practice we use four\nrepresentative pixels per segment. The representative pixels computed this way are used\nfor computing both the approximated mean and the approximated variance of every test\npattern. Figure 1 show how well this approximation works in practice.\n\nGiven the representative pixels, the approximated variance ^\n i(xj) of segment i of pattern j\nis given by:\n k\n\n ^\n i(xj) = |xj - ^\n i(xj)|\n j=1\n\n\n2.3 The rejection cascade\n\nWe construct a rejection cascade that can quickly reject image patches, with minimal com-\nputational load. Our feature space consist of the approximated mean and variance of the\nimage segments. In our experiments we have 8 segments, each represented by its mean and\nvariance, giving rise to a 16D feature space. This feature space is very fast to compute, as\nwe need only four pixels to calculate the approximate mean and variance of the segment.\nBecause the feature space is so small we can exhaustively search for all classifiers on single,\npairs and triplets of segments. In addition this feature space gives enough information to\nreject texture-less regions without the need to normalize the mean or variance of the entire\nimage patch. We next describe our rejectors in detail.\n\n\f\n2.3.1 Feature rejectors\n\nNow, that we have segmented every image into several segments and approximated every\nsegment with a small number of representative pixels, we can exhaustively search for the\nbest combination of segments that will reject the largest number of non-face images. We\nrepeat this process until the improvement in rejection is negligible.\n\nGiven a training set of P positive examples (i.e. faces) and N negative examples we con-\nstruct the following linear rejectors and adjust the parameter so that they will correctly\nclassify d P (we use d = 0.95) of the face images and save r, the number of negative\nexamples they correctly rejected, as well as the parameter .\n\n 1. For each segment i, find a bound on its approximated mean. Formally, find s.t.\n\n ^\n i(x) > or ^\n i(x) < \n\n 2. For each segment i, find a bound on its approximated variance. Formally, find \n s.t.\n ^\n i(x) > or ^\n i(x) < \n\n 3. For each pair of segments i, j, find a bound on the difference between their ap-\n proximated means. Formally, find s.t.\n\n ^\n i(x) - ^\n j(x) > or ^\n i(x) - ^\n j(x) < \n\n 4. For each pair of segments i, j, find a bound on the difference between their ap-\n proximated variance. Formally, find s.t.\n\n ^\n i(x) - ^\n j(x) > or ^\n i(x) - ^\n j(x) < \n\n 5. For each triplet of segments i, j, k find a bound on the difference of the absolute\n difference of their approximated means. Formally, find s.t.\n\n |^\n i(x) - ^\n j(x)| - |^\n i(x) - ^\n k(x)| > \n\nThis process is done only once to form a pool of rejectors. We do not re-train rejectors after\nselecting a particular rejector.\n\n\n2.3.2 Training\n\nWe form the cascade of rejectors from a large pattern vs. rejector binary table T, where\neach entry T(i, j) is 1 if rejector j rejects pattern i. Because the table is binary we can\nstore every entry in a single bit and therefor a table of 513, 000 patterns and 664 rejectors\ncan easily fit in the memory. We then use a greedy algorithm to pick the next rejector with\nthe highest rejection score r. We repeat this process until r falls below some predefined\nthreshold.\n\n 1. Sum each column and choose column (rejector) j with the highest sum.\n\n 2. For each entry T (i, j), in column j, that is equal to one, zero row i.\n\n 3. Go to step 1\n\nThe entire process is extremely fast and takes only several minutes, including I/O. The idea\nof creating a rejector pool in advance was independently suggested by [16] to accelerate\nthe Viola-Jones training time. We obtain 50 rejectors using this method. Figure 2a shows\nthe rejection rate of this cascade on a training set of 513, 000 images, as well as the number\nof arithmetic operations it takes. Note that roughly 50% of all patterns are rejected by the\nfirst rejector using only 12 operations. During testing we compute the approximated mean\nand variance only when they are needed and not before hand.\n\n\f\n Rejection rate Comparing different image segmentations\n 90 90\n\n\n\n 85 80\n\n\n\n 80\n 70\n\n\n 75\n 60\n\n\n 70\n\n 50\n\n 65\n rejection rate rejection rate\n\n 40\n\n 60\n\n\n 30\n\n 55\n\n\n 20 random segments\n 50 vertical segments\n horizontal segments\n image segments\n 10\n 45 0 5 10 15 20 25\n 0 50 100 150 200 250\n number of operations number of rejectors\n\n\n\n (a) (b)\nFigure 2: (a) Rejection rate on training set. The x-axis counts the number of arithmetic\noperations needed for rejection. The y-axis is the rejection rate on a training set of about\nhalf-a-million non-faces and about 1500 faces. Note that almost 50% of the false patterns\nare rejected with just 12 operations. Overall rejection rate of the feature rejectors on the\ntraining set is 88%, it drops to about 80% on the CMU+MIT database. (b) Rejection rate\nas a function of image segmentation method. We trained our system using four types of\nimage segmentation and show the rejector. We compare our image segmentation approach\nagainst naive segmentation of the image plane into horizontal blocks, vertical blocks or\nrandom segmentation. In each case we trained a cascade of 21 rejectors and calculated their\naccumulative rejection rate on our training set. Clearly working with our image segments\ngives the best results.\n\n\n\nWe wanted to confirm our intuition that indeed only meaningful regions in the image can\nproduce such results and we therefor performed the following experiment. We segmented\nthe pixels in the image using four different methods. (1) using our image segments (2)\ninto 8 horizontal blocks (3) into 8 vertical blocks (4) into 8 randomly generated segments.\nFigure 2b show that image segments gives the best results, by far.\n\nThe remaining false positive patterns are passed on to the next rejectors, as described next.\n\n\n2.4 Texture-less region rejection\n\nWe found that the feature rejectors defined in the previous section are doing poorly in\nrejecting texture-less regions. This is because we do not perform any sort of variance\nnormalization on the image patch, a step that will slow us down. However, by now we\nhave computed the approximated mean and variance of all the image segments and we\ncan construct rejectors based on all of them to reject texture-less regions. In particular we\nconstruct the following two rejectors\n\n\n 1. Reject all image patches where the variance of all 8 approximated means falls\n below a threshold. Formally, find s.t.\n\n ^\n (^\n i(x)) < i = 1...8\n\n 2. Reject all image patches where the variance of all 8 approximated variances falls\n below a threshold. Formally, find s.t.\n\n ^\n (^\n i(x)) < i = 1...8\n\n\n2.5 Linear classifier\n\nFinally, we construct a cascade of 10 linear rejectors, using all 16 features (i.e. the approx-\nimated means and variance of all 8 segments).\n\n\f\n (a) (b)\nFigure 3: Examples. We show examples from the CMU+MIT dataset. Our method cor-\nrectly rejected over 99.8% of the image patches in the image, leaving only a handful of\nimage patches to be tested by a \"slow\", full scale classifier.\n\n\n\n2.6 Multi-detection heuristic\n\nAs noted by previous authors [15] face classifiers are insensitive to small changes in posi-\ntion and scale and therefor we adopt the heuristic that only four overlapping detections are\ndeclared a face. This help reduce the number of detected rectangles around and face, as\nwell as reject some spurious false detections.\n\n\n3 Experiments\n\nWe have tested our rejection scheme on the standard CMU+MIT database [13]. We created\na pyramid at increasing scales of 1.1 and scanned every scale for rectangles of size 20 20\nin jumps of two pixels. We calculate the approximated mean and variance only when they\nare needed, to save time.\n\nOverall, our rejection scheme rejected over 99.8% of the image patches, while correctly de-\ntecting 93% of the faces. On average the feature rejectors rejected roughly 80% of all image\npatches, the textureless region rejectors rejected additional 10% of the image patches, the\nlinear rejectors rejected additional 5% and the multi-detection heuristic rejected the remain-\ning image patterns. The average rejection rate per image is over 99.8%. This is not enough\nfor face detection, as there are roughly 615, 000 image patches per image in the CMU+MIT\ndatabase, and our rejector cascade passes, on average, 870 false positive image patches, per\nimage. This patterns will have to be passed to a full-scale classifier to be properly rejected.\nFigure 3 give some examples of our system. Note that the system correctly detects all the\nfaces, while allowing a small number of false positives.\n\nWe have also experimented with rescaling the features, instead of rescaling the image, but\nnoted that the number of false positives increased by about 5% for every fixed detection\nrate we tried (All the results reported here use image pyramids).\n\n\n4 Summary and Conclusions\n\nWe presented a fast rejection scheme that is based on image segments and demonstrated it\non the canonical example of face detection. Image segments are made of regions of pixels\nwith similar behavior over the image set. The shape of the features (i.e. image segments)\nwe use is data-driven and they are very cheap to compute The relationships between the\nmean and variance of image segments are used to form a cascade of rejectors that can reject\nover 99.8% of the image patches, thus only a small fraction of the image patches must be\n\n\f\npassed to a full-scale classifier. The training time for our method is much less than an hour,\non a standard PC. We believe that our method can be used to accelerate standard machine\nlearning algorithms that are too slow for object detection, by serving as a gate keeper that\nrejects most of the false patterns.\n\n\nReferences\n\n[1] Shai Avidan. EigenSegments: A spatio-temporal decomposition of an ensemble of\n image. In European Conference on Computer Vision (ECCV), May 2002, Copenhagen,\n Denmark.\n\n[2] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line\n learning and an application to boosting. In Computational Learning Theory: Eurocolt\n 95, pages 2337. Springer-Verlag, 1995.\n\n[3] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley-\n Interscience publication, 1973.\n\n[4] M. Elad, Y. Hel-Or and R. Keshet. Rejection based classifier for face detection. Pattern\n Recognition Letters 23 (2002) 1459-1471.\n\n[5] B. Heisele, T. Serre, S. Mukherjee, and T. Poggio. Feature reduction and hierarchy of\n classifiers for fast object detection in video images. In Proc. CVPR, volume 2, pages\n 1824, 2001.\n\n[6] D. Keren, M. Osadchy, and C. Gotsman. Antifaces: A novel, fast method for image\n detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(7):747761,\n 2001.\n\n[7] S.Z. Li, L. Zhu, Z.Q. Zhang, A. Blake, H.J. Zhang and H. Shum. Statistical Learn-\n ing of Multi-View Face Detection. In Proceedings of the 7th European Conference on\n Computer Vision, Copenhagen, Denmark, May 2002.\n\n[8] Henry Schneiderman and Takeo Kanade. A statistical model for 3d object detection\n applied to faces and cars. In IEEE Conference on Computer Vision and Pattern Recog-\n nition. IEEE, June 2000.\n\n[9] L. Sirovich and M. Kirby. Low-dimensional procedure for the characterization of hu-\n man faces. In Journal of the Optical Society of America 4, 510-524.\n\n[10] K.-K. Sung and T. Poggio. Example-based Learning for View-Based Human Face De-\n tection. In IEEE Transactions on Pattern Analysis and Machine Intelligence 20(1):39-\n 51, 1998.\n\n[11] M. Turk and A. Pentland. Eigenfaces for recognition. In Journal of Cognitive Neuro-\n science, vol. 3, no. 1, 1991.\n\n[12] S. Romdhani, P. Torr, B. Schoelkopf, and A. Blake. Computationally efficient face\n detection. In Proc. Intl. Conf. Computer Vision, pages 695700, 2001.\n\n[13] H. A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. IEEE\n Trans. on Pattern Analysis and Machine Intelligence, 20(1):2338, 1998.\n\n[14] V. Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995.\n\n[15] P. Viola and M. Jones. Rapid Object Detection using a Boosted Cascade of Simple\n Features. In IEEE Conference on Computer Vision and Pattern Recognition, Hawaii,\n 2001.\n\n[16] J. Wu, J. M. Rehg, and M. D. Mullin. Learning a Rare Event Detection Cascade\n by Direct Feature Selection. To appear in Advances in Neural Information Processing\n Systems 16 (NIPS*2003), MIT Press, 2004.\n\n\f\n", "award": [], "sourceid": 2622, "authors": [{"given_name": "Shai", "family_name": "Avidan", "institution": null}, {"given_name": "Moshe", "family_name": "Butman", "institution": null}]}