{"title": "Kernel Descriptors for Visual Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 244, "page_last": 252, "abstract": "The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT~\\cite{Lowe2004Distinctive} and HOG~\\cite{Dalal2005Histograms}, are the most successful and popular features for visual object and scene recognition. We highlight the kernel view of orientation histograms, and show that they are equivalent to a certain type of match kernels over image patches. This novel view allows us to design a family of kernel descriptors which provide a unified and principled framework to turn pixel attributes (gradient, color, local binary pattern, \\etc) into compact patch-level features. In particular, we introduce three types of match kernels to measure similarities between image patches, and construct compact low-dimensional kernel descriptors from these match kernels using kernel principal component analysis (KPCA)~\\cite{Scholkopf1998Nonlinear}. Kernel descriptors are easy to design and can turn any type of pixel attribute into patch-level features. They outperform carefully tuned and sophisticated features including SIFT and deep belief networks. We report superior performance on standard image classification benchmarks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet.", "full_text": "Kernel Descriptors for Visual Recognition\n\nLiefeng Bo\n\nUniversity of Washington\nSeattle WA 98195, USA\n\nXiaofeng Ren\nIntel Labs Seattle\n\nSeattle WA 98105, USA\n\nDieter Fox\n\nUniversity of Washington & Intel Labs Seattle\n\nSeattle WA 98195 & 98105, USA\n\nAbstract\n\nThe design of low-level image features is critical for computer vision algorithms.\nOrientation histograms, such as those in SIFT [16] and HOG [3], are the most\nsuccessful and popular features for visual object and scene recognition. We high-\nlight the kernel view of orientation histograms, and show that they are equivalent\nto a certain type of match kernels over image patches. This novel view allows\nus to design a family of kernel descriptors which provide a uni\ufb01ed and princi-\npled framework to turn pixel attributes (gradient, color, local binary pattern, etc.)\ninto compact patch-level features. In particular, we introduce three types of match\nkernels to measure similarities between image patches, and construct compact\nlow-dimensional kernel descriptors from these match kernels using kernel princi-\npal component analysis (KPCA) [23]. Kernel descriptors are easy to design and\ncan turn any type of pixel attribute into patch-level features. They outperform\ncarefully tuned and sophisticated features including SIFT and deep belief net-\nworks. We report superior performance on standard image classi\ufb01cation bench-\nmarks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet.\n\n1 Introduction\n\nImage representation (features) is arguably the most fundamental task in computer vision. The\nproblem is highly challenging because images exhibit high variations, are highly structured, and\nlie in high dimensional spaces.\nIn the past ten years, a large number of low-level features over\nimages have been proposed. In particular, orientation histograms such as SIFT [16] and HOG [3]\nare the most popular low-level features, essential to many computer vision tasks such as object\nrecognition and 3D reconstruction. The success of SIFT and HOG naturally raises questions on how\nthey measure the similarity between image patches, how we should understand the design choices in\nthem, and whether we can \ufb01nd a principled way to design and learn comparable or superior low-level\nimage features.\nIn this work, we highlight the kernel view of orientation histograms and provide a uni\ufb01ed way to\nlow-level image feature design and learning. Our low-level image feature extractors, kernel descrip-\ntors, consist of three steps: (1) design match kernels using pixel attributes; (2) learn compact basis\nvectors using kernel principle component analysis; (3) construct kernel descriptors by projecting the\nin\ufb01nite-dimensional feature vectors to the learned basis vectors. We show how our framework is\napplied to gradient, color, and shape pixel attributes, leading to three effective kernel descriptors.\nWe validate our approach on four standard image category recognition benchmarks, and show that\nour kernel descriptors surpass both manually designed and well tuned low-level features (SIFT) [16]\nand sophisticated feature learning approaches (convolutional networks, deep belief networks, sparse\ncoding, etc.) [10, 26, 14, 24].\n\n1\n\n\fThe most relevant work to this paper is that of ef\ufb01cient match kernels (EMK) [1], which provides\na kernel view to the frequently used Bag-of-Words representation and forms image-level features\nby learning compact low dimensional projections or using random Fourier transformations. While\nthe work on ef\ufb01cient match kernels is interesting, the hand-crafted SIFT features are still used as\nthe basic building block. Another related work is based on mathematics of the neural response,\nwhich shows that the hierarchical architectures motivated by the neuroscience of the visual cortex is\nassociated to the derived kernel [24]. Instead, the goal of this paper is to provide a deep understand-\ning of how orientation histograms (SIFT and HOG) work, and we can generalize them and design\nnovel low-level image features based on the kernel insight. Our kernel descriptors are general and\nprovide a principled way to convert pixel attributes to patch-level features. To the best of our knowl-\nedge, this is the \ufb01rst time that low-level image features are designed and learned from scratch using\nkernel methods; they can serve as the foundation of many computer vision tasks including object\nrecognition.\nThis paper is organized as follows. Section 2 introduces the kernel view of histograms. Our novel\nkernel descriptors are presented in Section 3, followed by an extensive experimental evaluation in\nSection 4. We conclude in Section 5.\n\n2 Kernel View of Orientation Histograms\n\nOrientation histograms, such as SIFT [16] and HOG [3], are the most commonly used low-level\nfeatures for object detection and recognition. Here we describe the kernel view of such orientation\nhistograms features, and show how this kernel view can help overcome issues such as orientation\nbinning. Let \u03b8(z) and m(z) be the orientation and magnitude of the image gradient at a pixel z. In\nHOG and SIFT, the gradient orientation of each pixel is discretized into a d\u2212dimensional indicator\nvector \u03b4(z) = [\u03b41(z),\u00b7\u00b7\u00b7 , \u03b4d(z)] with\n\n(cid:98) d\u03b8(z)\n\n2\u03c0 (cid:99) = i \u2212 1\n\n1,\n0, otherwise\n\n\u03b4i(z) =\n\n(1)\nwhere (cid:98)x(cid:99) takes the largest integer less than or equal to x (we will describe soft binning further\nbelow). The feature vector of each pixel z is a weighted indicator vector F (z) = m(z)\u03b4(z). Ag-\ngregating feature vectors of pixels over an image patch P , we obtain the histogram of oriented\ngradients:\n\n(cid:189)\n\nwhere (cid:101)m(z) = m(z)/\n\n(cid:112)(cid:80)\n\n(cid:88)\n\nz\u2208P\n\n(cid:101)m(z)\u03b4(z)\n\nFh(P ) =\n\n(2)\n\n(3)\n\nz\u2208P m(z)2 + \u0001g is the normalized gradient magnitude, with \u0001g a small\nconstant. P is typically a 4 \u00d7 4 rectangle in SIFT and an 8 \u00d7 8 rectangle in HOG. Without loss of\ngenerality, we consider L2-based normalization here. In object detection [3, 5] and matching based\nobject recognition [18], linear support vector machines or the L2 distance are commonly applied to\nsets of image patch features. This is equivalent to measuring the similarity of image patches using a\nlinear kernel in the feature map Fh(P ) in kernel space:\n\n(cid:88)\n\n(cid:88)\n\n(cid:101)m(z)(cid:101)m(z(cid:48))\u03b4(z)(cid:62)\u03b4(z(cid:48))\n\nKh(P, Q) = Fh(P )(cid:62)Fh(Q) =\n\nz\u2208P\n\nz(cid:48)\u2208Q\n\nwhere P and Q are patches usually from two different images.\n\n(cid:101)m(z)(cid:101)m(z(cid:48)) and k\u03b4(z, z(cid:48)) = \u03b4(z)(cid:62)\u03b4(z(cid:48)) are the inner product of two vectors and thus are positive\nIn Eq. 3, both k(cid:101)m(z, z(cid:48)) =\nde\ufb01nite kernels. Therefore, Kh(P, Q) is a match kernel over sets (here the sets are image patches)\nas in [8, 1, 11, 17, 7]. Thus Eq. 3 provides a kernel view of HOG features over image patches. For\nsimplicity, we only use one image patch here; it is straightforward to extend to sets of image patches.\nThe hard binning underlying Eq. 1 is only for ease of presentation. To get a kernel view of soft\nbinning [13], we only need to replace the delta function in Eq. 1 by the following, soft \u03b4(\u00b7) function:\n(4)\nwhere a(i) is the center of the i\u2212th bin. In addition, one can easily include soft spatial binning by\nnormalizing gradient magnitudes using the corresponding spatial weights. The L2 distance between\nP and Q can be expressed as D(P, Q) = 2\u2212 2F (P )(cid:62)F (Q) as we know F (P )(cid:62)F (P ) = 1, and the\nkernel view can be provided in the same manner.\n\n\u03b4i(z) = max(cos(\u03b8(z) \u2212 ai)9, 0)\n\n2\n\n\fFigure 1: Pixel attributes. Left: Gradient orientation representation. To measure similarity between\ntwo pixel orientation gradients \u03b8 and \u03b8(cid:48), we use the L2 norm between the normalized gradient vectors\n\n(cid:101)\u03b8 = [sin(\u03b8) cos(\u03b8)] and (cid:101)\u03b8(cid:48) = [sin(\u03b8(cid:48)) cos(\u03b8(cid:48))]. The red dots represent the normalized gradient\n\nvectors, and the blue line represents the distance between them. Right: Local binary patterns. The\nvalues indicate brightness of pixels in a 3\u00d73 patch. Red pixels have intensities larger than the center\npixel, blue pixels are darker. The 8-dimensional indicator vector is the resulting local binary pattern.\n\nNote that the kernel k(cid:101)m(z, z(cid:48)) measuring the similarity of gradient magnitudes of two pixels is linear\nin gradient magnitude. k\u03b4(z, z(cid:48)) measures the similarity of gradient orientations of two pixels: 1 if\ntwo gradient orientations are in the same bin, and 0 otherwise (Eq.1, hard binning). As can be seen,\nthis kernel introduces quantization errors and could lead to suboptimal performance in subsequent\nstages of processing. While soft binning results in a smoother kernel function, it still suffers from\ndiscretization. This motivates us to search for alternative match kernels which can measure the\nsimilarity of image patches more accurately.\n\n3 Kernel Descriptors\n\n3.1 Gradient, Color, and Shape Match Kernels\n\nWe introduce the following gradient match kernel, Kgrad, to capture image variations:\n\n(cid:88)\n\n(cid:88)\n\nz\u2208P\n\nz(cid:48)\u2208Q\n\n(cid:101)m(z)(cid:101)m(z(cid:48))ko((cid:101)\u03b8(z),(cid:101)\u03b8(z(cid:48)))kp(z, z(cid:48))\n\nKgrad(P, Q) =\n\n(5)\n\nwhere kp(z, z(cid:48)) = exp(\u2212\u03b3p(cid:107)z\u2212 z(cid:48)(cid:107)2) is a Gaussian position kernel with z denoting the 2D position\n\nof a pixel in an image patch (normalized to [0, 1]), and ko((cid:101)\u03b8(z),(cid:101)\u03b8(z(cid:48))) = exp(\u2212\u03b3o(cid:107)(cid:101)\u03b8(z) \u2212(cid:101)\u03b8(z(cid:48))(cid:107)2)\n\nis a Gaussian kernel over orientations. To estimate the difference between orientations at pixels z\nand z(cid:48), we use the following normalized gradient vectors in the kernel function ko:\n\n(cid:101)\u03b8(z) = [sin(\u03b8(z)) cos(\u03b8(z))] .\n\ngradient vectors(cid:101)\u03b8 would cause wrong similarity in some cases. For example, consider the two angles\n\n(6)\nThe L2 distance between such vectors measures the difference of gradient orientations very well (see\nFigure 1). Note that computing the L2 distance on the raw angle values \u03b8 instead of the normalized\n2\u03c0 \u2212 0.01 and 0.01, which have very similar orientation but very large L2 distance.\nTo summarize, our gradient match kernel Kgrad consists of three kernels: the normalized linear\nkernel is the same as that in the orientation histograms, weighting the contribution of each pixel\nusing gradient magnitudes; the orientation kernel ko computes the similarity of gradient orientations;\nand the position Gaussian kernel kp measures how close two pixels are spatially.\nThe kernel view of orientation histograms provides a simple, uni\ufb01ed way to turn pixel attributes into\npatch-level features. One immediate extension is to construct color match kernels over pixel values:\n\nKcol(P, Q) =\n\nkc(c(z), c(z(cid:48)))kp(z, z(cid:48))\n\n(7)\n\nwhere c(z) is the pixel color at position z (intensity for gray images and RGB values for color\nimages). kc(c(z), c(z(cid:48))) = exp\n\nmeasures how similar two pixel values are.\n\n(cid:88)\n\n(cid:88)\n(cid:161)\u2212\u03b3c(cid:107)c(z) \u2212 c(z(cid:48))(cid:107)2\n\nz(cid:48)\u2208Q\n\nz\u2208P\n\n(cid:162)\n\n3\n\n\f(cid:88)\n\n(cid:88)\n\n(cid:101)s(z)(cid:101)s(z(cid:48))kb(b(z), b(z(cid:48)))kp(z, z(cid:48))\n\nWhile the gradient match kernel can capture image variations and the color match kernel can de-\nscribe image appearance, we \ufb01nd that a match kernel over local binary patterns can capture local\nshape more effectively: [19]:\n\n(cid:112)(cid:80)\n\n(8)\n\nKshape(P, Q) =\n\nz\u2208P\n\nz(cid:48)\u2208Q\n\nwhere (cid:101)s(z) = s(z)/\near kernel(cid:101)s(z)(cid:101)s(z(cid:48)) weighs the contribution of each local binary pattern, and the Gaussian kernel\n\nz\u2208P s(z)2 + \u0001s, s(z) is the standard deviation of pixel values in the\n3 \u00d7 3 neighborhood around z, \u0001s a small constant, and b(z) is binary column vector binarizes\nthe pixel value differences in a local window around z (see Fig. 1(right)). The normalized lin-\nkb(b(z), b(z(cid:48))) = exp(\u2212\u03b3b(cid:107)b(z)\u2212 b(z(cid:48))(cid:107)2) measures shape similarity through local binary patterns.\nMatch kernels de\ufb01ned over various pixel attributes provide a uni\ufb01ed way to generate a rich, diverse\nvisual feature set, which has been shown to be very successful to boost recognition accuracy [6]. As\nvalidated by our own experiments, gradient, color and shape match kernels are strong in their own\nright and complement one another. Their combination turn out to be always (much) better than the\nbest individual feature.\n\n3.2 Learning Compact Features\n\nMatch kernels provide a principled way to measure the similarity of image patches, but evaluating\nkernels can be computationally expensive when image patches are large [1]. Both for computational\nef\ufb01ciency and for representational convenience, we present an approach to extract the compact low-\ndimensional features from match kernels: (1) uniformly and densely sample suf\ufb01cient basis vectors\nfrom support region to guarantee accurate approximation to match kernels; (2) learn compact basis\nvectors using kernel principal component analysis. An important advantage of our approach is that\nno local minima are involved, unlike constrained kernel singular value decomposition [1].\nWe now describe how our compact low-dimensional features are extracted from the gradient kernel\nKgrad; features for the other kernels can be generated the same way. Rewriting the kernels in Eq. 5\n\nas inner products ko((cid:101)\u03b8(z),(cid:101)\u03b8(z(cid:48))) = \u03c6o((cid:101)\u03b8(z))(cid:62)\u03c6o((cid:101)\u03b8(z(cid:48))), kp(z, z(cid:48)) = \u03c6p(z)(cid:62)\u03c6p(z(cid:48)), we can derive\n\n(9)\n\nthe following feature over image patches:\nFgrad(P ) =\n\n(cid:88)\n\nz\u2208P\n\n(cid:101)m(z)\u03c6o((cid:101)\u03b8(z)) \u2297 \u03c6p(z)\n\nwhere \u2297 is the tensor product. For this feature, it follows that Fgrad(P )(cid:62)Fgrad(Q) = Kgrad(P, Q).\nBecause we use Gaussian kernels, Fgrad(P ) is an in\ufb01nite-dimensional vector.\nA straightforward way to dimension reduction is to sample suf\ufb01cient image patches from training\nimages and perform KPCA for match kernels. However, such approach makes the learned features\ndepend on the task at hand. Moreover, KPCA can become computationally infeasible when the\nnumber of patches is very large.\nSuf\ufb01cient Finite-dimensional Approximation. We present an approach to approximate match ker-\nnels directly without requiring any image. Following classic methods, we learn \ufb01nite-dimensional\nfeatures by projecting Fgrad(P ) into a set of basis vectors. A key issue in this projection process\nis how to choose a set of basis vectors which makes the \ufb01nite-dimensional kernel approximate well\nthe original kernel. Since pixel attributes are low-dimensional vectors, we can achieve a very good\napproximation by sampling suf\ufb01cient basis vectors using a \ufb01ne grid over the support region. For\nsis vectors {\u03c6o(xi)}do\n\nexample, consider the Gaussian kernel ko((cid:101)\u03b8(z),(cid:101)\u03b8(z(cid:48))) over gradient orientation. Given a set of ba-\nin\ufb01nite-dimensional vector \u03c6o((cid:101)\u03b8(z)) by its projection into the space spanned by the set of these do\n(cid:105)\n(cid:101)ko((cid:101)\u03b8(z),(cid:101)\u03b8(z(cid:48))) = ko((cid:101)\u03b8(z), X)(cid:62)(cid:163)\nGko((cid:101)\u03b8(z(cid:48)), X)\nwhere ko((cid:101)\u03b8(z), X) = [ko((cid:101)\u03b8(z), x1),\u00b7\u00b7\u00b7 , ko((cid:101)\u03b8(z), xdo)](cid:62) is a do \u00d7 1 vector, Ko is a do \u00d7 do matrix\n\u22121 = G(cid:62)G. The resulting feature map(cid:101)\u03c6o((cid:101)\u03b8(z)) = Gko((cid:101)\u03b8(z), X)\n\nbasis vectors. Following the formulation in [1], such a procedure is equivalent to using a \ufb01nite-\ndimensional kernel:\n\ni=1 where xi are sampled normalized gradient vectors, we can approximate a\n\n\u22121(cid:164)\nij ko((cid:101)\u03b8(z(cid:48)), X) =\n\nwith Koij = ko(xi, xj), and Ko\n\n(cid:104)\n\n(cid:105)(cid:62)(cid:104)\nGko((cid:101)\u03b8(z), X)\n\n(10)\n\nKo\n\n4\n\n\fFigure 2: Finite dimensional approximation. Left: the orientation kernel ko((cid:101)\u03b8(z),(cid:101)\u03b8(z(cid:48))) and its\n\ufb01nite-dimensional approximation. \u03b3o is set to be 5 (as used in the experiments) and(cid:101)\u03b8(z(cid:48)) is \ufb01xed to\n[1 0]. All curves show kernel values as functions of(cid:101)\u03b8(z). The red line is the ground truth kernel func-\n\ntion ko, and the black, green and blue lines are the \ufb01nite approximation kernels with different grid\nsizes. Right: root mean square error (RMSE) between KPCA approximation and the corresponding\nmatch kernel as a function of dimensionality. We compute the RMSE on randomly sampled 10000\ndatapoints. The three lines show the RMSE between the kernels Kgrad (red) and Kcol (blue) and\nKshape (green), and their respective approximation kernels.\n\nis now only do\u2212dimensional. In a similar manner, we can also approximate the kernels kp, kc and kb.\n\n(cid:80)\nz\u2208P (cid:101)m(z)(cid:101)\u03c6o((cid:101)\u03b8(z)) \u2297\nThe \ufb01nite-dimensional feature for the gradient match kernel is (cid:101)Fgrad(P ) =\n(cid:101)\u03c6p(z), and may be ef\ufb01ciently used as features over image patches. We validate our intuition in Fig.\n\n2. As we expect, the approximation error rapidly drops with increasing grid sizes. When the grid\nsize is larger than 16, the \ufb01nite kernel and the original kernel become virtually indistinguishable.\nFor the shape kernel over local binary patterns, because the variables are binary, we simply choose\nthe set of all 28 = 256 basis vectors and thus no approximation error is introduced.\n\nCompact Features. Although (cid:101)Fgrad(P ) is \ufb01nite-dimensional, the dimensionality can be high due\n\nto the tensor product. For example, consider the shape kernel descriptor: the size of basis vectors\non kernel kb is 256; if we choose the basis vectors of the position kernel kp on a 5 \u00d7 5 regular grid,\nthe dimensionality of the resulting shape kernel descriptor Fshape would be 256 \u00d7 25 = 6400, too\nhigh for practical purposes. Dense uniform sampling leads to accurate approximation but does not\nguarantee orthogonality of the basis vectors, thus introducing redundance. The size of basis vectors\ncan be further reduced by performing kernel principal component analysis over joint basis vectors:\n{\u03c6o(x1) \u2297 \u03c6p(y1),\u00b7\u00b7\u00b7 , \u03c6o(xdo) \u2297 \u03c6p(ydp)}, where \u03c6p(ys) are basis vectors for the position kernel\nand dp is the number of basis vectors. The t\u2212th kernel principal component can be written as\n\nPCt =\n\nij\u03c6o(xi) \u2297 \u03c6p(yj)\n\u03b1t\n\n(11)\n\n(cid:80)\n\nwhere do and dp are the sizes of basis vectors for the orientation and position kernel, respectively,\nij is learned through kernel principal component analysis: Kc\u03b1t = \u03bbt\u03b1t, where Kc is a\nand \u03b1t\ncentered kernel matrix with [Kc]ijst = ko(xi, xj)kp(ys, yt) \u2212 2\ni(cid:48),s(cid:48) ko(xi(cid:48), xj)kp(ys(cid:48) , yt) +\ni(cid:48),j(cid:48),s(cid:48),t(cid:48) ko(xi(cid:48), xj(cid:48))kp(ys(cid:48) , yt(cid:48)). As shown in \ufb01g. (2), match kernels can be approximated rather\naccurately using the reduced basis vectors by KPCA. Under the framework of kernel principal com-\nponent analysis, our gradient kernel descriptor (Eq. 5) has the form\n\n(cid:80)\n\ndo(cid:88)\n\ndp(cid:88)\n\ni=1\n\nj=1\n\ndo(cid:88)\n\ndp(cid:88)\n\ni=1\n\nj=1\n\n(cid:40)(cid:88)\n\nz\u2208P\n\n(cid:41)\n(cid:101)m(z)ko((cid:101)\u03b8(z), xi)kp(z, yj)\n\nF\n\nt\n\ngrad(P ) =\n\n\u03b1t\nij\n\n(12)\n\nThe computational bottleneck of extracting kernel descriptors are to evaluate the kernel function\nkokp between pixels. Fortunately, we can compute two kernel values separately at the cost do + dp,\nrather than dodp. Our most expensive kernel descriptor, the shape kernel, takes about 4 seconds in\nMATLAB to compute on a typical image (300 \u00d7 300 resolution and 16 \u00d7 16 image patches over\n\n5\n\n024600.20.40.60.81 Ground Truth10 Grid App14 Grid App16 Grid App0100200300400500\u22121012345x 10\u22124DimensionalityRMSE KgradKcolKshape\f8\u00d78 grids). It is about 1.5 seconds for the gradient kernel descriptor, compared to about 0.4 seconds\nfor SIFT under the same setting. A more ef\ufb01cient GPU-based implementation will certainly reduce\nthe computation time for kernel descriptors such that real time applications become feasible.\n\n4 Experiments\n\nWe compare gradient (KDES-G), color (KDES-C), and shape (KDES-S) kernel descriptors to SIFT\nand several other state of the art object recognition algorithms on four publicly available datasets:\nScene-15, Caltech101, CIFAR10, and CIFAR10-ImageNet (a subset of ImageNet). For gradient and\nshape kernel descriptors and SIFT, all images are transformed into grayscale ([0, 1]) and resized to be\nno larger than 300 \u00d7 300 pixels with preserved ratio. Image intensity or RGB values are normalized\nto [0 1]. We extracted all low level features with 16\u00d716 image patches over dense regular grids with\nspacing of 8 pixels. We used publicly available dense SIFT code at http://www.cs.unc.edu/ lazeb-\nnik [13], which includes spatial binning, soft binning and truncation (nonlinear cutoff at 0.2), and\nhas been demonstrated to obtain high accuracy for object recognition. For our gradient kernel de-\nscriptors we use the same gradient computation as used for SIFT descriptors. We also evaluate the\nperformance of the combination of the three kernel descriptors (KDES-A) by simply concatenating\nthe image-level features vectors.\nInstead of spatial pyramid kernels, we compute image-level features using ef\ufb01cient match kernels\n(EMK), which has been shown to produce more accurate quantization. We consider 1\u00d7 1, 2\u00d7 2 and\n4 \u00d7 4 pyramid sub-regions (see [1]), and perform constrained kernel singular value decomposition\n(CKSVD) to form image-level features, using 1,000 visual words (basis vectors in CKSVD) learned\nby K-means from about 100,000 image patch features. We evaluate classi\ufb01cation performance with\naccuracy averaged over 10 random training/testing splits with the exception of the CIFAR10 dataset,\nwhere we report the accuracy on the test set. We have experimented both with linear SVMs and\nLaplacian kernel SVMs and found that Laplacian kernel SVMs over ef\ufb01cient match kernel features\nare always better than linear SVMs (see (\u00a74.2)). We use Laplacian kernel SVMs in our experiments\n(except for the tiny image dataset CIFAR10).\n\n4.1 Hyperparameter Selection\n\nWe select kernel parameters using a subset of ImageNet. We retrieve 8 everyday categories from\nthe ImageNet collection: apple, banana, box, coffee mug, computer keyboard, laptop, soda can and\nwater bottle. We choose basis vectors for ko, kc, and kp from 25, 5 \u00d7 5 \u00d7 5 and 5 \u00d7 5 uniform\ngrids, respectively, which give suf\ufb01cient approximations to the original kernels (see also Fig. 2).\nWe optimize the dimensionality of KPCA and match kernel parameters jointly using exhaustive grid\nsearch. Our experiments suggest that the optimal parameter settings are r = 200 (dimensionality\nof kernel descriptors), \u03b3o = 5, \u03b3c = 4, \u03b3b = 2, \u03b3p = 3, \u0001g = 0.8 and \u0001s = 0.2 (\ufb01g. 3). In the\nfollowing experiments, we will keep these values \ufb01xed, even though the performance may improve\nif considering task-dependent hyperparameter selection.\n\n4.2 Benchmark Comparisons\n\nScene-15. Scene-15 is a popular scene recognition benchmark from [13] which contains 15 scene\ncategories with 200 to 400 images in each. SIFT features have been extensively used on Scene-\n15. Following the common experimental setting, we train our models on 1,500 randomly selected\nimages (100 images per category) and test on the rest. We report the averaged accuracy of SIFT,\nKDES-S, KDES-C, KDES-G, and KDES-A over 10 random training/test splits in Table 1. As we\nsee, both gradient and shape kernel descriptors outperform SIFT with a margin. Gradient kernel\ndescriptors and shape kernel descriptors have similar performance.\nIt is not surprising that the\nintensity kernel descriptor has a lower accuracy, as all the images are grayscale. The combination of\nthe three kernel descriptors further boosts the performance by about 2 percent. Another interesting\n\ufb01nding is that Laplacian kernel SVMs are signi\ufb01cantly better than linear SVMs, 86.7%.\nIn our recognition system, the accuracy of SIFT is 82.2% compared to 81.4% in spatial pyramid\nmatch (SPM). We also tried to replace SIFT features with our gradient and shape kernel descrip-\ntors in SPM, and both obtained 83.5% accuracy, 2 percent higher than SIFT features. To our best\nknowledge, our gradient kernel descriptor alone outperforms the best published result 84.2% [27].\n\n6\n\n\fFigure 3: Hyperparameter selection. left: Accuracy as functions of feature dimensionality for orien-\ntation kernel (KDES-G) and shape kernel (KDES-S), respectively. center: Accuracy as functions of\n\u0001g and \u0001s. right: Accuracy as function of \u03b3o and \u03b3b.\n\nMethods\n\nSIFT\n76.7\u00b10.7\nLaplacian kernel SVM 82.2\u00b10.9\n\nLinear SVM\n\nKDES-C KDES-G KDES-S KDES-A\n38.5\u00b10.4\n81.9\u00b10.6\n47.9\u00b10.8\n86.7\u00b10.4\n\n81.6\u00b10.6\n85.0\u00b10.6\n\n79.8\u00b10.5\n84.9\u00b10.7\n\nTable 1: Comparisons of recognition accuracy on Scene-15: kernel descriptors and their combination vs SIFT.\n\nCaltech-101. Caltech-101 [15] consists of 9,144 images in 101 object categories and one back-\nground category. The number of images per category varies from 31 to 800. Because many re-\nsearchers have reported their results on Caltech-101, we can directly compare our algorithm to the\nexisting ones. Following the standard experimental setting, we train classi\ufb01ers on 30 images and test\non no more than 50 images per category. We report our results in Table 2. We compare our kernel\ndescriptors with recently published results obtained both by low-level feature learning algorithms,\nconvolutional deep belief networks (CDBN), and sparse coding methods: invariant predictive sparse\ndecomposition (IPSD) and locality-constrained linear coding. We observe that SIFT features in con-\njunction with ef\ufb01cient match kernels work well on this dataset and obtain 70.8% accuracy using a\nsingle patch size, which beat SPM with the same SIFT features by a large margin. Both our gradient\nkernel descriptor and shape kernel descriptor are superior to CDBN by a large margin.\nWe have performed feature extraction with three different patch sizes: 16\u00d7 16, 25\u00d7 25 and 31\u00d7 31\nand reached the same conclusions with many other researchers: multiple patch sizes (scales) can\nboost the performance by a few percent compared to the single patch size. Notice that both naive\nBayesian nearest neighbor (NBNN) and locality-constrained linear coding should be compared to\nour kernel descriptors over multiple patch sizes because both of them used multiple scales to boost\nthe performance. Using only our gradient kernel descriptor obtains 75.2% accuracy, higher than the\nresults obtained by all other single feature based methods, to our best knowledge. Another \ufb01nding\nis that the combination of three kernel descriptors outperforms any single kernel descriptor. We\nnote that better performance has been reported with the use of more image features [6]. Our goal\nin this paper is to evaluate the strengths of kernel descriptors. To improve accuracy further, kernel\ndescriptors can be combined with other types of image features.\nCIFAR-10. CIFAR-10 is a labeled subset of the 80 million tiny images dataset [25, 12]. This\ndataset consists of 60,000 32x32 color images in 10 categories, with 5,000 images per category as\ntraining set and 1,000 images per category as test set. Deep belief networks have been extensively\ninvestigated on this dataset [21, 22]. We extract kernel descriptors over 8\u00d78 image patches per pixel.\nEf\ufb01cient match kernels over the three spatial grids 1\u00d71, 2\u00d72, and 3\u00d73 are used to generate image-\nlevel features. The resulting feature vectors have a length of (1+4+9)\u22171000(visual words)= 14000\nper kernel descriptor. Linear SVMs are trained due to the large number of training images.\n\nSPM [13]\nNBNN [2]\nCDBN [14]\n\nSIFT\n\n64.4\u00b10.5\n\n73.0\n65.5\n\n70.8\u00b10.8\n\nkCNN [28]\nIPSD [10]\nLLC [26]\nSIFT(M)\n\n67.4\n56.0\n\n40.8\u00b10.9 KDES-C(M)\nKDES-C\nKDES-G 73.3\u00b10.6 KDES-G(M)\n73.4 \u00b10.5 KDES-S\n68.2\u00b10.7 KDES-S(M)\nKDES-A 74.5\u00b10.8 KDES-A(M)\n73.2\u00b10.5\n\n42.4\u00b10.5\n75.2\u00b10.4\n70.3\u00b10.6\n76.4\u00b10.7\n\nTable 2: Comparisons on Caltech-101. Kernel descriptors are compared to recently published re-\nsults. (M) indicates that features are extracted with multiple image patch sizes.\n\n7\n\n1002003004000.70.750.80.85 KDES\u2212SKDES\u2212G00.20.40.60.810.740.760.780.80.82 KDES\u2212SKDES\u2212G02460.740.760.780.80.82 KDES\u2212SKDES\u2212G\fLR\nSVM\n\nGIST[20]\n\nSIFT\n\n36.0 GRBM, ZCAd images\n39.5\n54.7\n65.6\n\n\ufb01ne-tuning GRBM\nGRBM two layers\n\nGRBM\n\n59.7 KDES-C 53.9\n59.6\n64.7 KDES-G 66.3\n63.8\n64.8\n68.3 KDES-S\n68.2\n56.6 mcRBM-DBN 71.0 KDES-A 76.0\n\nmRBM\ncRBM\nmcRBM\n\nTable 3: Comparisons on CIFAR-10. Both logistic regression and SVMs are trained over image\npixels.\n\nMethods\n\nLaplacian kernel SVMs\n\nSIFT\n\n66.5 \u00b10.4\n\nKDES-C\n56.4\u00b10.8\n\nKDES-G\n69.0 \u00b10.8\n\nKDES-S KDES-A\n75.2\u00b10.7\n70.5\u00b10.7\n\nTable 4: Comparisons on CIFAR10-ImageNet, subset of ImageNet using the 10 CIFAR categories.\n\nWe compare our kernel descriptors to deep networks [14, 9] and several baselines in table 3. One\nimmediate observation is that sophisticated feature extractions are signi\ufb01cantly better than raw pixel\nfeatures. Linear logistic regression and linear SVMs over raw pixels only have accuracies of 36%\nand 39.5%, respectively, over 30 percent lower than deep belief networks and our kernel descriptors.\nSIFT features still work well on tiny images and have an accuracy of 65.2%. Color kernel descriptor,\nKDES-C, has 53.9% accuracy. This result is a bit surprising since each category has a large color\nvariation. A possible explanation is that spatial information can help a lot. To validate our intu-\nitions, we also evaluated the color kernel descriptor without spatial information (kernel features are\nextracted on 1 \u00d7 1 spatial grid), and only obtained 38.5% accuracy, 18 percent lower than the color\nkernel descriptor over pyramid spatial grids. KDES-G is slightly better than SIFT features. The\nshape kernel feature, KDES-S, has accuracy of 68.2%, and is the best single feature on this dataset.\nCombing the three kernel descriptors, we obtain the best performance of 76%, 5 percent higher than\nthe most sophisticated deep network mcRBM-DBN, which model pixel mean and covariance jointly\nusing factorized third-order Boltzmann machines.\nCIFAR-10-ImageNet. Motivated by CIFAR-10, we collect a labeled subset of ImageNet [4] by\nretrieving 10 categories used in ImageNet: Airplane, Automobile, Bird, Cat, Deer, Dog, Frog, Horse,\nShip and Truck. The total number of images is 15,561 with more than 1,200 images per category.\nThis dataset is very challenging due to the following facts: multiple objects can appear in one image,\nonly a small part of objects are visible, backgrounds are cluttered, and so on. We train models on\n1,000 images per class and test on 200 images per category. We report the averaged results over 10\nrandom training/test splits in Table 4. We can\u2019t \ufb01nish running deep belief networks in a reasonable\ntime since they are slow for running images of this scale. Both gradient and shape kernel descriptors\nachieve higher accuracy than SIFT features, which again con\ufb01rms that our gradient kernel descriptor\nand shape kernel descriptor outperform SIFT features on high resolution images with the same\ncategory as CIFAR-10. We also ran the experiments on the downsized images, no larger than 50\u00d750\nwith preserved ratio. We observe that the accuracy drops 4-6 percents compared to those on high\nresolution images. This validates that high resolution is helpful for object recognition.\n\n5 Conclusion\n\nWe have proposed a general framework, kernel descriptors, to extract low-level features from image\npatches. Our approach is able to turn any pixel attribute into patch-level features in a uni\ufb01ed and\nprincipled way. Kernel descriptors are based on the insight that the inner product of orientation his-\ntograms is a particular match kernel over image patches. We have performed extensive comparisons\nand con\ufb01rmed that kernel descriptors outperform both SIFT features and hierarchical feature learn-\ning, where the former is the default choice for object recognition and the latter is the most popular\nlow-level feature learning technique. To our best knowledge, we are the \ufb01rst to show how kernel\nmethods can be applied for extracting low-level image features and show superior performance. This\nopens up many possibilities for learning low-level features with other kernel methods. Considering\nthe huge success of kernel methods in the last twenty years, we believe that this direction is worth\nbeing pursued. In the future, we plan to investigate alternative kernels for low-level feature learning\nand learn pixel attributes from large image data collections such as ImageNet.\n\n8\n\n\fReferences\n[1] L. Bo and C. Sminchisescu. Ef\ufb01cient Match Kernel between Sets of Features for Visual Recog-\n\nnition. In NIPS, 2009.\n\n[2] O. Boiman, E. Shechtman, and M. Irani. In defense of nearest-neighbor based image classi\ufb01-\n\ncation. In CVPR, 2008.\n\n[3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[4] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-fei. ImageNet: A Large-Scale Hierar-\n\nchical Image Database. In CVPR, 2009.\n\n[5] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale,\n\ndeformable part model. In CVPR, 2008.\n\n[6] P. Gehler and S. Nowozin. On feature combination for multiclass object classi\ufb01cation.\n\nICCV, 2009.\n\nIn\n\n[7] K. Grauman and T. Darrell. The pyramid match kernel: discriminative classi\ufb01cation with sets\n\nof image features. In ICCV, 2005.\n\n[8] D. Haussler. Convolution kernels on discrete structures. Technical report, 1999.\n[9] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architec-\n\nture for object recognition? In ICCV, 2009.\n\n[10] K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through\n\ntopographic \ufb01lter maps. In CVPR, 2009.\n\n[11] R. Kondor and T. Jebara. A kernel between sets of vectors. In ICML, 2003.\n[12] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n[13] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for\n\nrecognizing natural scene categories. In CVPR, 2006.\n\n[14] H. Lee, R. Grosse, R. Ranganath, and A. Ng. Convolutional deep belief networks for scalable\n\nunsupervised learning of hierarchical representations. In ICML, 2009.\n\n[15] F. Li, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE PAMI, 2006.\n[16] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60:91\u2013110, 2004.\n[17] S. Lyu. Mercer kernels for object recognition with local features. In CVPR, 2005.\n[18] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. IEEE PAMI,\n\n27(10):1615\u20131630, 2005.\n\n[19] T. Ojala, M. Pietik\u00a8ainen, and T. M\u00a8aenp\u00a8a\u00a8a. Multiresolution gray-scale and rotation invariant\n\ntexture classi\ufb01cation with local binary patterns. IEEE PAMI, 24(7):971\u2013987, 2002.\n\n[20] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the\n\nspatial envelope. IJCV, 42(3):145\u2013175, 2001.\n\n[21] M. Ranzato, Krizhevsky A., and G. Hinton. Factored 3-way restricted boltzmann machines for\n\nmodeling natural images. In AISTATS, 2010.\n\n[22] M. Ranzato and G. Hinton. Modeling pixel means and covariances using factorized third-order\n\nboltzmann machines. In CVPR, 2010.\n\n[23] B. Sch\u00a8olkopf, A. Smola, and K. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue\n\nproblem. Neural Computation, 10:1299\u20131319, 1998.\n\n[24] S. Smale, L. Rosasco, J. Bouvrie, A. Caponnetto, and T. Poggio. Mathematics of the neural\n\nresponse. Foundations of Computational Mathematics, 10(1):67\u201391, 2010.\n\n[25] A. Torralba, R. Fergus, and W. Freeman. 80 million tiny images: A large data set for nonpara-\n\nmetric object and scene recognition. IEEE PAMI, 30(11):1958\u20131970, 2008.\n\n[26] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Guo. Locality-constrained linear coding for\n\nimage classi\ufb01cation. In CVPR, 2010.\n\n[27] J. Wu and J. Rehg. Beyond the euclidean distance: Creating effective visual codebooks using\n\nthe histogram intersection kernel. 2002.\n\n[28] K. Yu, W. Xu, and Y. Gong. Deep learning with kernel regularization for visual recognition.\n\nIn NIPS, 2008.\n\n9\n\n\f", "award": [], "sourceid": 821, "authors": [{"given_name": "Liefeng", "family_name": "Bo", "institution": null}, {"given_name": "Xiaofeng", "family_name": "Ren", "institution": null}, {"given_name": "Dieter", "family_name": "Fox", "institution": null}]}