{"title": "Local Decorrelation For Improved Pedestrian Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 424, "page_last": 432, "abstract": "Even with the advent of more sophisticated, data-hungry methods, boosted decision trees remain extraordinarily successful for fast rigid object detection, achieving top accuracy on numerous datasets. While effective, most boosted detectors use decision trees with orthogonal (single feature) splits, and the topology of the resulting decision boundary may not be well matched to the natural topology of the data. Given highly correlated data, decision trees with oblique (multiple feature) splits can be effective. Use of oblique splits, however, comes at considerable computational expense. Inspired by recent work on discriminative decorrelation of HOG features, we instead propose an efficient feature transform that removes correlations in local neighborhoods. The result is an overcomplete but locally decorrelated representation ideally suited for use with orthogonal decision trees. In fact, orthogonal trees with our locally decorrelated features outperform oblique trees trained over the original features at a fraction of the computational cost. The overall improvement in accuracy is dramatic: on the Caltech Pedestrian Dataset, we reduce false positives nearly tenfold over the previous state-of-the-art.", "full_text": "Local Decorrelation for Improved Pedestrian Detection\n\nWoonhyun Nam\u2217\nStradVision, Inc.\n\nwoonhyun.nam@stradvision.com\n\nPiotr Doll\u00b4ar\n\nMicrosoft Research\npdollar@microsoft.com\n\nJoon Hee Han\n\nPOSTECH, Republic of Korea\n\njoonhan@postech.ac.kr\n\nAbstract\n\nEven with the advent of more sophisticated, data-hungry methods, boosted deci-\nsion trees remain extraordinarily successful for fast rigid object detection, achiev-\ning top accuracy on numerous datasets. While effective, most boosted detectors\nuse decision trees with orthogonal (single feature) splits, and the topology of the\nresulting decision boundary may not be well matched to the natural topology of\nthe data. Given highly correlated data, decision trees with oblique (multiple fea-\nture) splits can be effective. Use of oblique splits, however, comes at considerable\ncomputational expense. Inspired by recent work on discriminative decorrelation\nof HOG features, we instead propose an ef\ufb01cient feature transform that removes\ncorrelations in local neighborhoods. The result is an overcomplete but locally\ndecorrelated representation ideally suited for use with orthogonal decision trees.\nIn fact, orthogonal trees with our locally decorrelated features outperform oblique\ntrees trained over the original features at a fraction of the computational cost. The\noverall improvement in accuracy is dramatic: on the Caltech Pedestrian Dataset,\nwe reduce false positives nearly tenfold over the previous state-of-the-art.\n\n1\n\nIntroduction\n\nIn recent years object detectors have undergone an impressive transformation [11, 32, 14]. Never-\ntheless, boosted detectors remain extraordinarily successful for fast detection of quasi-rigid objects.\nSuch detectors were \ufb01rst proposed by Viola and Jones in their landmark work on ef\ufb01cient sliding\nwindow detection that made face detection practical and commercially viable [35]. This initial ar-\nchitecture remains largely intact today: boosting [31, 12] is used to train and combine decision trees\nand a cascade is employed to allow for fast rejection of negative samples. Details, however, have\nevolved considerably; in particular, signi\ufb01cant progress has been made on the feature representa-\ntion [6, 9, 2] and cascade architecture [3, 8]. Recent boosted detectors [1, 7] achieve state-of-the-art\naccuracy on modern benchmarks [10, 22] while retaining computational ef\ufb01ciency.\nWhile boosted detectors have evolved considerably over the past decade, decision trees with orthog-\nonal (single feature) splits \u2013 also known as axis-aligned decision trees \u2013 remain popular and pre-\ndominant. A possible explanation for the persistence of orthogonal splits is their ef\ufb01ciency: oblique\n(multiple feature) splits incur considerable computational cost during both training and detection.\nNevertheless, oblique trees can hold considerable advantages. In particular, Menze et al. [23] re-\ncently demonstrated that oblique trees used in conjunction with random forests are quite effective\ngiven high dimensional data with heavily correlated features.\nTo achieve similar advantages while avoiding the computational expense of oblique trees, we instead\ntake inspiration from recent work by Hariharan et al. [15] and propose to decorrelate features prior to\napplying orthogonal trees. To do so we introduce an ef\ufb01cient feature transform that removes corre-\nlations in local image neighborhoods (as opposed to decorrelating features globally as in [15]). The\nresult is an overcomplete but locally decorrelated representation that is ideally suited for use with\northogonal trees. In fact, orthogonal trees with our locally decorrelated features require estimation\nof fewer parameters and actually outperform oblique trees trained over the original features.\n\n\u2217This research was performed while W.N. was a postdoctoral researcher at POSTECH.\n\n1\n\n\fFigure 1: A comparison of boosting of orthogonal and oblique trees on highly correlated data while\nvarying the number (T ) and depth (D) of the trees. Observe that orthogonal trees generalize poorly\nas the topology of the decision boundary is not well aligned to the natural topology of the data.\n\nWe evaluate boosted decision tree learning with decorrelated features in the context of pedestrian\ndetection. As our baseline we utilize the aggregated channel features (ACF) detector [7], a popular,\ntop-performing detector for which source code is available online. Coupled with use of deeper trees\nand a denser sampling of the data, the improvement obtained using our locally decorrelated channel\nfeatures (LDCF) is substantial. While in the past year the use of deep learning [25], motion features\n[27], and multi-resolution models [36] has brought down log-average miss rate (MR) to under 40%\non the Caltech Pedestrian Dataset [10], LDCF reduces MR to under 25%. This translates to a nearly\ntenfold reduction in false positives over the (very recent) state-of-the-art.\nThe paper is organized as follows. In \u00a72 we review orthogonal and oblique trees and demonstrate\nthat orthogonal trees trained on decorrelated data may be equally or more effective as oblique trees\ntrained on the original data. We introduce the baseline in \u00a73 and in \u00a74 show that use of oblique\ntrees improves results but at considerable computational expense. Next, in \u00a75, we demonstrate that\northogonal trees trained with locally decorrelated features are ef\ufb01cient and effective. Experiments\nand results are presented in \u00a76. We begin by brie\ufb02y reviewing related work next.\n\n1.1 Related Work\n\nPedestrian Detection: Recent work in pedestrian detection includes use of deformable part models\nand their extensions [11, 36, 26], convolutional nets and deep learning [33, 37, 25], and approaches\nthat focus on optimization and learning [20, 18, 34]. Boosted detectors are also widely used. In\nparticular, the channel features detectors [9, 1, 2, 7] are a family of conceptually straightforward and\nef\ufb01cient detectors based on boosted decision trees computed over multiple feature channels such as\ncolor, gradient magnitude, gradient orientation and others. Current top results on the INRIA [6] and\nCaltech [10] Pedestrian Datasets include instances of the channel features detector with additional\nmid-level edge features [19] and motion features [27], respectively.\nOblique Decision Trees: Typically, decision trees are trained with orthogonal (single feature) splits;\nhowever, the extension to oblique (multiple feature) splits is fairly intuitive and well known, see\ne.g. [24]. In fact, Breiman\u2019s foundational work on random forests [5] experimented with oblique\ntrees. Recently there has been renewed interest in random forests with oblique splits [23, 30] and\nMarin et al. [20] even applied such a technique to pedestrian detection. Likewise, while typically\northogonal trees are used with boosting [12], oblique trees can easily be used instead. The contri-\nbution of this work is not the straightforward coupling of oblique trees with boosting, rather, we\npropose a local decorrelation transform that eliminates the necessity of oblique splits altogether.\nDecorrelation: Decorrelation is a common pre-processing step for classi\ufb01cation [17, 15]. In recent\nwork, Hariharan et al. [15] proposed an ef\ufb01cient scheme for estimating covariances between HOG\nfeatures [6] with the goal of replacing linear SVMs with LDA and thus allowing for fast training.\nHariharan et al. demonstrated that the global covariance matrix for a detection window can be esti-\nmated ef\ufb01ciently as the covariance between two features should depend only on their relative offset.\nInspired by [15], we likewise exploit the stationarity of natural image statistics, but instead propose\nto estimate a local covariance matrix shared across all image patches. Next, rather than applying\nglobal decorrelation, which would be computationally prohibitive for sliding window detection with\na nonlinear classi\ufb01er1, we instead propose to apply an ef\ufb01cient local decorrelation transform. The\nresult is an overcomplete representation well suited for use with orthogonal trees.\n\n1Global decorrelation coupled with a linear classi\ufb01er is ef\ufb01cient as the two linear operations can be merged.\n\n2\n\n\fFigure 2: A comparison of boosting with orthogonal decision trees (T = 5) on transformed data.\nOrthogonal trees with both decorrelated and PCA-whitened features show improved generalization\nwhile ZCA-whitening is ineffective. Decorrelating the features is critical, while scaling is not.\n\n2 Boosted Decision Trees with Correlated Data\n\nBoosting is a simple yet powerful tool for classi\ufb01cation and can model complex non-linear functions\n[31, 12]. The general idea is to train and combine a number of weak learners into a more powerful\nstrong classi\ufb01er. Decision trees are frequently used as the weak learner in conjunction with boosting,\nand in particular orthogonal decision trees, that is trees in which every split is a threshold on a single\nfeature, are especially popular due to their speed and simplicity [35, 7, 1].\nThe representational power obtained by boosting orthogonal trees is not limited by use of orthogonal\nsplits; however, the number and depth of the trees necessary to \ufb01t the data may be large. This can\nlead to complex decision boundaries and poor generalization, especially given highly correlated\nfeatures. Figure 1(a)-(c) shows the result of boosted orthogonal trees on correlated data. Observe\nthat the orthogonal trees generalize poorly even as we vary the number and depth of the trees.\nDecision trees with oblique splits can more effectively model data with correlated features as the\ntopology of the resulting classi\ufb01er can better match the natural topology of the data [23]. In oblique\ntrees, every split is based on a linear projection of the data z = w\nx followed by thresholding.\nThe projection w can be sparse (and orthogonal splits are a special case with (cid:107)w(cid:107)0 = 1). While\nin principle numerous approaches can be used to obtain w, in practice linear discriminant analysis\n(LDA) is a natural choice for obtaining discriminative splits ef\ufb01ciently [16]. LDA aims to minimize\nwithin-class scatter while maximizing between-class scatter. w is computed from class-conditional\nmean vectors \u00b5+ and \u00b5\u2212 and a class-independent covariance matrix \u03a3 as follows:\n\n(cid:124)\n\nw = \u03a3\u22121(\u00b5+ \u2212 \u00b5\u2212).\n\n(cid:124)\n\n2 Q\n\n(1)\nThe covariance may be degenerate if the amount or underlying dimension of the data is low; in this\ncase LDA can be regularized by using (1\u2212 \u0001)\u03a3 + \u0001I in place of \u03a3. In Figure 1(d) we apply boosted\noblique trees trained with LDA on the same data as before. Observe the resulting decision boundary\nbetter matches the underlying data distribution and shows improved generalization.\nThe connection between whitening and LDA is well known [15]. Speci\ufb01cally, LDA simpli\ufb01es to a\n(cid:124)\ntrivial classi\ufb01cation rule on whitened data (data whose covariance is the identity). Let \u03a3 = Q\u039bQ\nbe the eigendecomposition of \u03a3 where Q is an orthogonal matrix and \u039b is a diagonal matrix of\neigenvalues. W = Q\u039b\u2212 1\n= \u03a3\u2212 1\n2 is known as a whitening matrix because the covariance of\nx(cid:48) = Wx is the identity matrix. Given whitened data and means, LDA can be interpreted as\n+ \u2212 \u00b5(cid:48)\nlearning the trivial projection w(cid:48) = \u00b5(cid:48)\nx.\nCan whitening or a related transform likewise simplify learning of boosted decision trees?\nUsing standard terminology [17], we de\ufb01ne the following related transforms: decorrelation (Q\n),\nPCA-whitening (\u039b\u2212 1\n). Figure 2 shows the result of boosting\northogonal trees on the variously transformed features, using the same data as before. Observe that\nwith decorrelated and PCA-whitened features orthogonal trees show improved generalization. In\nfact, as each split is invariant to scaling of individual features, orthogonal trees with PCA-whitened\nand decorrelated features give identical results. Decorrelating the features is critical, while scaling\nis not. The intuition is clear: each split operates on a single feature, which is most effective if\nthe features are decorrelated. Interestingly, the standard ZCA-whitened transform used by LDA is\nineffective: while the resulting features are not technically correlated, due to the additional rotation\nby Q each resulting feature is a linear combination of features obtained by PCA-whitening.\n\n\u2212 = W\u00b5+ \u2212 W\u00b5\u2212 since w(cid:48)(cid:124)\n\n), and ZCA-whitening (Q\u039b\u2212 1\n\nx(cid:48) = w(cid:48)(cid:124)\n\nWx = w\n\n(cid:124)\n2 Q\n\n(cid:124)\n\n2 Q\n\n(cid:124)\n\n(cid:124)\n\n3\n\n\f3 Baseline Detector (ACF)\n\nWe next brie\ufb02y review our baseline detector and evaluation benchmark. This will allow us to apply\nthe ideas from \u00a72 to object detection in subsequent sections. In this work we utilize the channel\nfeatures detectors [9, 7, 1, 2], a family of conceptually straightforward and ef\ufb01cient detectors for\nwhich variants have been utilized for diverse tasks such as pedestrian detection [10], sign recognition\n[22] and edge detection [19]. Speci\ufb01cally, for our experiments we focus on pedestrian detection and\nemploy the aggregate channel features (ACF) variant [7] for which code is available online2.\nGiven an input image, ACF computes several feature channels, where each channel is a per-pixel\nfeature map such that output pixels are computed from corresponding patches of input pixels (thus\npreserving image layout). We use the same channels as [7]: normalized gradient magnitude (1 chan-\nnel), histogram of oriented gradients (6 channels), and LUV color channels (3 channels), for a total\nof 10 channels. We downsample the channels by 2x and features are single pixel lookups in the\naggregated channels. Thus, given a h \u00d7 w detection window, there are h/2 \u00b7 w/2 \u00b7 10 candidate\nfeatures (channel pixel lookups). We use RealBoost [12] with multiple rounds of bootstrapping to\ntrain and combine 2048 depth-3 decision trees over these features to distinguish object from back-\nground. Soft-cascades [3] and an ef\ufb01cient multiscale sliding-window approach are employed. Our\nbaseline uses slightly altered parameters from [7] (RealBoost, deeper trees, and less downsampling);\nthis increases model capacity and bene\ufb01ts our \ufb01nal approach as we report in detail in \u00a76.\nCurrent practice is to use the INRIA Pedestrian Dataset [6] for parameter tuning, with the test set\nserving as a validation set, see e.g. [20, 2, 9]. We utilize this dataset in much the same way and report\nfull results on the more challenging Caltech Pedestrian Dataset [10]. Following the methodology\nof [10], we summarize performance using the log-average miss rate (MR) between 10\u22122 and 100\nfalse positives per image. We repeat all experiments 10 times and report the mean MR and standard\nerror for every result. Due to the use of a log-log scale, even small improvements in (log-average)\nMR correspond to large reductions in false-positives. On INRIA, our (slightly modi\ufb01ed) baseline\nversion of ACF scores at 17.3% MR compared to 17.0% MR for the model reported in [7].\n\n4 Detection with Oblique Splits (ACF-LDA)\n\n(cid:124)\n\nIn this section we modify the ACF detector to enable oblique splits and report the resulting gains.\nRecall that given input x, at each split of an oblique decision tree we need to compute z = w\nx for\nsome projection w and threshold the result. For our baseline pedestrian detector, we use 128 \u00d7 64\nwindows where each window is represented by a feature vector x of size 128/2 \u00b7 64/2 \u00b7 10 = 20480\n(see \u00a73). Given the high dimensionality of the input x coupled with the use of thousands of trees in\na typical boosted classi\ufb01er, for ef\ufb01ciency w must be sparse.\nLocal w: We opt to use w\u2019s that correspond to local m\u00d7m blocks of pixels. In other words, we treat\nx as a h/2 \u00d7 w/2 \u00d7 10 tensor and allow w to operate over any m \u00d7 m \u00d7 1 patch in a single channel\nof x. Doing so holds multiple advantages. Most importantly, each pixel has strongest correlations\nto spatially nearby pixels [15]. Since oblique splits are expected to help most when features are\nstrongly correlated, operating over local neighborhoods is a natural choice. In addition, using local\nw allows for faster lookups due to the locality of adjacent pixels in memory.\nComplexity: First, let us consider the complexity of training the oblique splits. Let d = h/2\u00b7w/2 be\nthe window size of a single channel. The number of patches per channel in x is about d, thus naively\ntraining a single split means applying LDA d times \u2013 once per patch \u2013 and keeping w with lowest\nerror. Instead of computing d independent matrices \u03a3 per channel, for ef\ufb01ciency, we compute \u03a3, a\nd \u00d7 d covariance matrix for the entire window, and reconstruct individual m2 \u00d7 m2 \u03a3\u2019s by fetching\nappropriate entries from \u03a3. A similar trick can be used for the \u00b5\u2019s. Computing \u03a3 is O(nd2) given\nn training examples (and could be made faster by omitting unnecessary elements). Inverting each\n\u03a3, the bottleneck of computing Eq. (1), is O(dm6) but independent of n and thus fairly small as\nn (cid:29) m. Finally computing z = w\nx over all n training examples and d projections is O(ndm2).\nGiven the high complexity of each step, a naive brute-force approach for training is infeasible.\nSpeedup: While the weights over training examples change at every boosting iteration and after\nevery tree split, in practice we \ufb01nd it is unnecessary to recompute the projections that frequently.\nTable 1, rows 2-4, shows the results of ACF with oblique splits, updated every T boosting iterations\n\n(cid:124)\n\n2http://vision.ucsd.edu/\u02dcpdollar/toolbox/doc/\n\n4\n\n\fACF\nACF-LDA-4\nACF-LDA-16\nACF-LDA-\u221e\nACF-LDA\u2217-4\nACF-LDA\u2217-16\nACF-LDA\u2217-\u221e\nLDCF\n\nShared \u03a3\n\n-\nNo\nNo\nNo\nYes\nYes\nYes\nYes\n\nT\n-\n\n4\n16\n\u221e\n4\n16\n\u221e\n-\n\nMiss Rate\n17.3 \u00b1 .33\n14.9 \u00b1 .37\n15.1 \u00b1 .28\n17.0 \u00b1 .22\n14.7 \u00b1 .29\n15.1 \u00b1 .12\n16.4 \u00b1 .17\n13.7 \u00b1 .15\n\nTraining\n4.93m\n303.57m\n78.11m\n5.82m\n194.26m\n51.19m\n5.79m\n6.04m\n\nTable 1: A comparison of boosted trees with orthogonal and oblique splits.\n\n(denoted by ACF-LDA-T ). While more frequent updates improve accuracy, ACF-LDA-16 has neg-\nligibly higher MR than ACF-LDA-4 but a nearly fourfold reduction in training time (timed using 12\ncores). Training the brute force version of ACF-LDA, updated at every iteration and each tree split\n(7 interior nodes per depth-3 tree) would have taken about 5\u00b7 4\u00b7 7 = 140 hours. For these results we\nused regularization of \u0001 = .1 and patch size of m = 5 (effect of varying m is explored in \u00a76).\nShared \u03a3: The crux and computational bottleneck of ACF-LDA is the computation and application\nof a separate covariance \u03a3 at each local neighborhood. In recent work on training linear object\ndetectors using LDA, Hariharan et al. [15] exploited the observation that the statistics of natural\nimages are translationally invariant and therefore the covariance between two features should depend\nonly on their relative offset. Furthermore, as positives are rare, [15] showed that the covariances can\nbe precomputed using natural images. Inspired by these observations, we propose to use a single,\n\ufb01xed covariance \u03a3 shared across all local image neighborhoods. We precompute one \u03a3 per channel\nand do not allow it to vary spatially or with boosting iteration. Table 1, rows 5-7, shows the results\nof ACF with oblique splits using \ufb01xed \u03a3, denoted by ACF-LDA\u2217. As before, the \u00b5\u2019s and resulting\nw are updated every T iterations. As expected, training time is reduced relative to ACF-LDA.\nSurprisingly, however, accuracy improves as well, presumably due to the implicit regularization\neffect of using a \ufb01xed \u03a3. This is a powerful result we will exploit further.\nSummary: ACF with local oblique splits and a single shared \u03a3 (ACF-LDA\u2217-4) achieves 14.7% MR\ncompared to 17.3% MR for ACF with orthogonal splits. The 2.6% improvement in log-average MR\ncorresponds to a nearly twofold reduction in false positives but comes at considerable computational\ncost. In the next section, we propose an alternative, more ef\ufb01cient approach for exploiting the use of\na single shared \u03a3 capturing correlations in local neighborhoods.\n\n5 Locally Decorrelated Channel Features (LDCF)\n\n(cid:124)\np, where Q\u039bQ\n\nWe now have all the necessary ingredients to introduce our approach. We have made the following\nobservations: (1) oblique splits learned with LDA over local m \u00d7 m patches improve results over\northogonal splits, (2) a single covariance matrix \u03a3 can be shared across all patches per channel, and\n(3) orthogonal trees with decorrelated features can potentially be used in place of oblique trees. This\nsuggests the following approach: for every m \u00d7 m patch p in x, we can create a decorrelated repre-\n(cid:124)\nsentation by computing Q\nis the eigendecomposition of \u03a3 as before, followed by\nuse of orthogonal trees. However, such an approach is computationally expensive.\n(cid:124)\np for every overlapping patch results in an\nFirst, due to use of overlapping patches, computing Q\novercomplete representation with a factor m2 increase in feature dimensionality. To reduce dimen-\nsionality, we only utilize the top k eigenvectors in Q, resulting in k < m2 features per pixel. The\nintuition is that the top eigenvectors capture the salient neighborhood structure. Our experiments\nin \u00a76 con\ufb01rm this: using as few as k = 4 eigenvectors per channel for patches of size m = 5 is\n(cid:124)\nsuf\ufb01cient. As our second speedup, we observe that the projection Q\np can be computed by a series\nof k convolutions between a channel image and each m \u00d7 m \ufb01lter reshaped from its corresponding\neigenvector (column of Q). This is possible because the covariance matrix \u03a3 is shared across all\npatches per channel and hence the derived Q is likewise spatially invariant. Decorrelating all 10\nchannels in an entire feature pyramid for a 640 \u00d7 480 image takes about .5 seconds.\n\n5\n\n\fFigure 3: Top-left: autocorrelation for each channel. Bottom-left:\nlearned decorrelation \ufb01lters.\nRight: visualization of original and decorrelated channels averaged over positive training examples.\n\nIn summary, we modify ACF by taking the original 10 channels and applying k = 4 decorrelating\n(linear) \ufb01lters per channel. The result is a set of 40 locally decorrelated channel features (LDCF).\nTo further increase ef\ufb01ciency, we downsample the decorrelated channels by a factor of 2x which has\nnegligible impact on accuracy but reduces feature dimension to the original value. Given the new\nlocally decorrelated channels, all other steps of ACF training and testing are identical. The extra\nimplementation effort is likewise minimal: given the decorrelation \ufb01lters, a few lines of code suf\ufb01ce\nto convert ACF into LDCF. To further improve clarity, all source code for LDCF will be released.\nResults of the LDCF detector on the INRIA dataset are given in the last row of Table 1. The\nLCDF detector (which uses orthogonal splits) improves accuracy over ACF with oblique splits by\nan additional 1% MR. Training time is signi\ufb01cantly faster, and indeed, is only \u223c1 minute longer\nthan for the original ACF detector. More detailed experiments and results are reported in \u00a76. We\nconclude by (1) describing the estimation of \u03a3 for each channel, (2) showing various visualizations,\nand (3) discussing the \ufb01lters themselves and connections to known \ufb01lters.\nEstimating \u03a3: We can estimate a spatially constant \u03a3 for each channel using any large collec-\ntion of natural images. \u03a3 for each channel is represented by a spatial autocorrelation function\n\u03a3(x,y),(x+\u2206x,y+\u2206y) = C(\u2206x, \u2206y). Given a collection of natural images, we \ufb01rst estimate a sepa-\nrate autocorrelation function for each image and then average the results. Naive computation of the\n\ufb01nal function is O(np2) but the Wiener-Khinchin theorem reduces the complexity to O(np log p)\nvia the FFT [4], where n and p denote the number of images and pixels per image, respectively.\nVisualization: Fig. 3, top-left, illustrates the estimated autocorrelations for each channel. Nearby\nfeatures are highly correlated and oriented gradients are spatially correlated along their orientation\ndue to curvilinear continuity [15]. Fig. 3, bottom-left, shows the decorrelation \ufb01lters for each chan-\nnel obtained by reshaping the largest eigenvectors of \u03a3. The largest eigenvectors are smoothing\n\ufb01lters while the smaller ones resemble increasingly higher-frequency \ufb01lters. The corresponding\neigenvalues decay rapidly and in practice we use the top k = 4 \ufb01lters. Observe that the decorrela-\ntion \ufb01lters for oriented gradients are aligned to their orientation. Finally, Fig. 3, right, shows original\nand decorrelated channels averaged over positive training examples.\nDiscussion: Our decorrelation \ufb01lters are closely related to sinusoidal, DCT basis, and Gaussian\nderivative \ufb01lters. Spatial interactions in natural images are often well-described by Markov mod-\nels [13] and \ufb01rst-order stationary Markov processes are known to have sinusoidal KLT bases [29].\nIn particular, for the LUV color channels, our \ufb01lters are similar to the discrete cosine transform\n(DCT) bases that are often used to approximate the KLT. For oriented gradients, however, the decor-\nrelation \ufb01lters are no longer well modeled by the DCT bases (note also that our \ufb01lters are applied\ndensely whereas the DCT typically uses block processing). Alternatively, we can interpret our \ufb01lters\nas Gaussian derivative \ufb01lters. Assume that the autocorrelation is modeled by a squared-exponential\nfunction C(\u2206x) = exp(\u2212\u2206x2/2l2), which is fairly reasonable given the estimation results in Fig. 3.\nIn 1D, the kth largest eigenfunction of such an autocorrelation function is a k \u2212 1 order Gaussian\nderivative \ufb01lter [28]. It is straightforward to extend the result to an anisotropic multivariate case in\nwhich case the eigenfunctions are Gaussian directional derivative \ufb01lters similar to our \ufb01lters.\n\n6\n\n\fFigure 4: (a-b) Use of k = 4 local decorrelation \ufb01lters of size m = 5 gives optimal performance.\n(c) Increasing tree depth while simultaneously enlarging the quantity of data available for training\ncan have a large impact on accuracy (blue stars indicate optimal depth at each sampling interval).\n\n1. ACF\n2. LDCF small \u03bb\n3. LDCF random\n4. LDCF LUV only\n5. LDCF grad only\n6. LDCF constant\n7. LDCF\n\ndescription\n(modi\ufb01ed) baseline\ndecorrelation w k smallest \ufb01lters\n\ufb01ltering w k random \ufb01lters\ndecorrelation of LUV channels only\ndecorrelation of grad channels only\ndecorrelation w constant \ufb01lters\nproposed approach\n\n# channels\n\n10\n10k\n10k\n\n3k + 7\n3 + 7k\n\n10k\n10k\n\nmiss rate\n17.3 \u00b1 .33\n61.7 \u00b1 .28\n15.6 \u00b1 .26\n16.2 \u00b1 .37\n14.9 \u00b1 .29\n14.2 \u00b1 .34\n13.7 \u00b1 .15\n\nTable 2: Locally decorrelated channels compared to alternate \ufb01ltering strategies. See text.\n\n6 Experiments\n\nIn this section, we demonstrate the effectiveness of locally decorrelated channel features (LDCF) in\nthe context of pedestrian detection. We: (1) study the effect of parameter settings, (2) test variations\nof our approach, and \ufb01nally (3) compare our results with the state-of-the-art.\nParameters: LDCF has two parameters: the count and size of the decorrelation \ufb01lters. Fig. 4(a) and\n(b) show the results of LDCF on the INRIA dataset while varying the \ufb01lter count (k) and size (m),\nrespectively. Use of k = 4 decorrelation \ufb01lters of size m = 5 improves performance up to \u223c4% MR\ncompared to ACF. Inclusion of additional higher-frequency \ufb01lters or use of larger \ufb01lters can cause\nperformance degradation. For all remaining experiments we \ufb01x k = 4 and m = 5.\nVariations: We test variants of LDCF and report results on INRIA in Table 2. LDCF (row 7) outper-\nforms all variants, including the baseline (1). Filtering the channels with the smallest k eigenvectors\n(2) or k random \ufb01lters (3) performs worse. Local decorrelation of only the color channels (4) or\nonly the gradient channels (5) is inferior to decorrelation of all channels. Finally, we test constant\ndecorrelation \ufb01lters obtained from the intensity channel L that resemble the \ufb01rst k DCT basis \ufb01lters.\nUse of unique \ufb01lters per channel outperforms use of constant \ufb01lters across all channels (6).\nModel Capacity: Use of locally decorrelated features implicitly allows for richer, more effective\nsplitting functions, increasing modeling capacity and generalization ability. Inspired by their suc-\ncess, we explore additional strategies for augmenting model capacity. For the following experiments,\nwe rely solely on the training set of the Caltech Pedestrian Dataset [10]. Of the 71 minute long train-\ning videos (\u223c128k images), we use every fourth video as validation data and the rest for training. On\nthe validation set, LDCF outperforms ACF by a considerable margin, reducing MR from 46.2% to\n41.7%. We \ufb01rst augment model capacity by increasing the number of trees twofold (to 4096) and the\nsampled negatives \ufb01vefold (to 50k). Surprisingly, doing so reduces MR by an additional 4%. Next,\nwe experiment with increasing maximum tree depth while simultaneously enlarging the amount of\ndata available for training. Typically, every 30th image in the Caltech dataset is used for training and\ntesting. Instead, Figure 4(c) shows validation performance of LDCF with different tree depths while\nvarying the training data sampling interval. The impact of maximum depth on performance is quite\nlarge. At a dense sampling interval of every 4th frame, use of depth-5 trees (up from depth-2 for the\noriginal approach) improves performance by an additional 5% to 32.6% MR. Note that consistent\nwith the generalization bounds of boosting [31], use of deeper trees requires more data.\n\n7\n\n\f(a) INRIA Pedestrian Dataset\n\n(b) Caltech Pedestrian Dataset\n\nFigure 5: A comparison of our LDCF detector with state-of-the-art pedestrian detectors.\n\nINRIA Results: In Figure 5(a) we compare LDCF with state-of-the-art detectors on INRIA [6]\nusing benchmark code maintained by [10]. Since the INRIA dataset is oft-used as a validation set,\nincluding in this work, we include these results for completeness only. LDCF is essentially tied for\nsecond place with Roerei [2] and Franken [21] and outperformed by \u223c1% MR by SketchTokens [19].\nThese approaches all belong to the family of channel features detectors, and as the improvements\nproposed in this work are orthogonal, the methods could potentially be combined.\nCaltech Results: We present our main result on the Caltech Pedestrian Dataset [10], see Fig. 5(b),\ngenerated using the of\ufb01cial evaluation code available online3. The Caltech dataset has become the\nstandard for evaluating pedestrian detectors and the latest methods based on deep learning (Joint-\nDeep) [25], multi-resolution models (MT-DPM) [36] and motion features (ACF+SDt) [27] achieve\nunder 40% log-average MR. For a complete comparison, we \ufb01rst present results for an augmented\ncapacity ACF model which uses more (4096) and deeper (depth-5) trees trained with RealBoost\nusing dense sampling of the training data (every 4th image). See preceding note on model capacity\nfor details and motivation. This augmented model (ACF-Caltech+) achieves 29.8% MR, a consid-\nerable nearly 10% MR gain over previous methods, including the baseline version of ACF (ACF-\nCaltech). With identical parameters, locally decorrelated channel features (LDCF) further reduce\nerror to 24.9% MR with substantial gains at higher recall. Overall, this is a massive improvement\nand represents a nearly 10x reduction in false positives over the previous state-of-the-art.\n\n7 Conclusion\n\nIn this work we have presented a simple, principled approach for improving boosted object detectors.\nOur core observation was that effective but expensive oblique splits in decision trees can be replaced\nby orthogonal splits over locally decorrelated data. Moreover, due to the stationary statistics of\nimage features, the local decorrelation can be performed ef\ufb01ciently via convolution with a \ufb01xed\n\ufb01lter bank precomputed from natural images. Our approach is general, simple and fast.\nOur method showed dramatic improvement over previous state-of-the-art. While some of the gain\nwas from increasing model capacity, use of local decorrelation gave a clear and signi\ufb01cant boost.\nOverall, we reduced false-positives tenfold on Caltech. Such large gains are fairly rare.\nIn the present work we did not decorrelate features across channels (decorrelation was applied inde-\npendently per channel). This is a clear future direction. Testing local decorrelation in the context of\nother classi\ufb01ers (e.g. convolutional nets or linear classi\ufb01ers as in [15]) would also be interesting.\nWhile the proposed locally decorrelated channel features (LDCF) require only modest modi\ufb01cation\nto existing code, we will release all source code used in this work to ease reproducibility.\n\n3http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/\n\n8\n\n10\u2212210\u22121100101.05.10.20.30.40.50.64.801false positives per imagemiss rate 72% VJ46% HOG21% pAUCBoost20% FisherBoost20% LatSvm\u2212V220% ConvNet19% CrossTalk17% ACF16% VeryFast15% RandForest14% LDCF14% Franken14% Roerei13% SketchTokens10\u2212310\u2212210\u22121100101.05.10.20.30.40.50.64.801false positives per imagemiss rate 95% VJ68% HOG48% DBN\u2212Mut46% MF+Motion+2Ped46% MOCO45% MultiSDP44% ACF\u2212Caltech43% MultiResC+2Ped41% MT\u2212DPM39% JointDeep38% MT\u2212DPM+Context37% ACF+SDt30% ACF\u2212Caltech+25% LDCF\fReferences\n[1] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool. Pedestrian detection at 100 frames per second.\n\nIn CVPR, 2012.\n\n2013.\n\n[2] R. Benenson, M. Mathias, T. Tuytelaars, and L. Van Gool. Seeking the strongest rigid detector. In CVPR,\n\n[3] L. Bourdev and J. Brandt. Robust object detection via soft cascade. In CVPR, 2005.\n[4] G. Box, G. Jenkins, and G. Reinsel. Time series analysis: forecasting and control. Prentice Hall, 1994.\n[5] L. Breiman. Random forests. Machine Learning, 45(1):5\u201332, Oct. 2001.\n[6] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[7] P. Doll\u00b4ar, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. PAMI, 2014.\n[8] P. Doll\u00b4ar, R. Appel, and W. Kienzle. Crosstalk cascades for frame-rate pedestrian detection. In ECCV,\n\n[9] P. Doll\u00b4ar, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In BMVC, 2009.\n[10] P. Doll\u00b4ar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art.\n\n2012.\n\nPAMI, 34, 2012.\n\n[11] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively\n\ntrained part based models. PAMI, 32(9):1627\u20131645, 2010.\n\n[12] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The\n\n[13] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of\n\nAnnals of Statistics, 38(2):337\u2013374, 2000.\n\nimages. PAMI, PAMI-6(6):721\u2013741, 1984.\n\nand emantic segmentation. In CVPR, 2014.\n\nIn ECCV, 2012.\n\n[14] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\n[15] B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classi\ufb01cation.\n\n[16] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer, 2009.\n[17] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master\u2019s thesis,\n\nDepartment of Computer Science, University of Toronto, 2009.\n\n[18] D. Levi, S. Silberstein, and A. Bar-Hillel. Fast multiple-part based object detection using kd-ferns. In\n\n[19] J. Lim, C. L. Zitnick, and P. Doll\u00b4ar. Sketch tokens: A learned mid-level representation for contour and\n\n[20] J. Mar\u00b4\u0131n, D. V\u00b4azquez, A. L\u00b4opez, J. Amores, and B. Leibe. Random forests of local experts for pedestrian\n\n[21] M. Mathias, R. Benenson, R. Timofte, and L. Van Gool. Handling occlusions with franken-classi\ufb01ers. In\n\nCVPR, 2013.\n\nobject detection. In CVPR, 2013.\n\ndetection. In ICCV, 2013.\n\nICCV, 2013.\n\nthe solution? In IJCNN, 2013.\n\n[22] M. Mathias, R. Timofte, R. Benenson, and L. Van Gool. Traf\ufb01c sign recognition - how far are we from\n\n[23] B. H. Menze, B. M. Kelm, D. N. Splitthoff, U. Koethe, and F. A. Hamprecht. On oblique random forests.\n\nIn Machine Learning and Knowledge Discovery in Databases, 2011.\n\n[24] S. K. Murthy, S. Kasif, and S. Salzberg. A system for induction of oblique decision trees. Journal of\n\nArti\ufb01cial Intelligence Research, 1994.\n\n[25] W. Ouyang and X. Wang. Joint deep learning for pedestrian detection. In ICCV, 2013.\n[26] D. Park, D. Ramanan, and C. Fowlkes. Multiresolution models for object detection. In ECCV, 2010.\n[27] D. Park, C. L. Zitnick, D. Ramanan, and P. Doll\u00b4ar. Exploring weak stabilization for motion feature\n\nextraction. In CVPR, 2013.\n\n[28] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.\n[29] W. Ray and R. Driver. Further decomposition of the karhunen-lo`eve series representation of a stationary\n\nrandom process. IEEE Transactions on Information Theory, 16(6):663\u2013668, Nov 1970.\n\n[30] J. J. Rodriguez, L. I. Kuncheva, and C. J. Alonso. Rotation forest: A new classi\ufb01er ensemble method.\n\nPAMI, 28(10), 2006.\n\n[31] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the\n\neffectiveness of voting methods. The Annals of Statistics, 1998.\n\n[32] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,\n\nlocalization and detection using convolutional networks. arXiv:1312.6229, 2013.\n\n[33] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised multi-\n\nstage feature learning. In CVPR, 2013.\n\n[34] C. Shen, P. Wang, S. Paisitkriangkrai, and A. van den Hengel. Training effective node classi\ufb01ers for\n\ncascade classi\ufb01cation. IJCV, 103(3):326\u2013347, July 2013.\n\n[35] P. A. Viola and M. J. Jones. Robust real-time face detection. IJCV, 57(2):137\u2013154, 2004.\n[36] J. Yan, X. Zhang, Z. Lei, S. Liao, and S. Z. Li. Robust multi-resolution pedestrian detection in traf\ufb01c\n\nscenes. In CVPR, 2013.\n\n[37] X. Zeng, W. Ouyang, and X. Wang. Multi-stage contextual deep learning for ped. det. In ICCV, 2013.\n\n9\n\n\f", "award": [], "sourceid": 284, "authors": [{"given_name": "Woonhyun", "family_name": "Nam", "institution": "StradVision"}, {"given_name": "Piotr", "family_name": "Dollar", "institution": "Facebook AI Research"}, {"given_name": "Joon Hee", "family_name": "Han", "institution": "POSTECH"}]}