{"title": "Convolutional Kernel Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2627, "page_last": 2635, "abstract": "An important goal in visual recognition is to devise image representations that are invariant to particular transformations. In this paper, we address this goal with a new type of convolutional neural network (CNN) whose invariance is encoded by a reproducing kernel. Unlike traditional approaches where neural networks are learned either to represent data or for solving a classification task, our network learns to approximate the kernel feature map on training data. Such an approach enjoys several benefits over classical ones. First, by teaching CNNs to be invariant, we obtain simple network architectures that achieve a similar accuracy to more complex ones, while being easy to train and robust to overfitting. Second, we bridge a gap between the neural network literature and kernels, which are natural tools to model invariance. We evaluate our methodology on visual recognition tasks where CNNs have proven to perform well, e.g., digit recognition with the MNIST dataset, and the more challenging CIFAR-10 and STL-10 datasets, where our accuracy is competitive with the state of the art.", "full_text": "Convolutional Kernel Networks\n\nJulien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid\n\nfirstname.lastname@inria.fr\n\nInria\u2217\n\nAbstract\n\nAn important goal in visual recognition is to devise image representations that are\ninvariant to particular transformations. In this paper, we address this goal with a\nnew type of convolutional neural network (CNN) whose invariance is encoded by\na reproducing kernel. Unlike traditional approaches where neural networks are\nlearned either to represent data or for solving a classi\ufb01cation task, our network\nlearns to approximate the kernel feature map on training data.\nSuch an approach enjoys several bene\ufb01ts over classical ones. First, by teach-\ning CNNs to be invariant, we obtain simple network architectures that achieve a\nsimilar accuracy to more complex ones, while being easy to train and robust to\nover\ufb01tting. Second, we bridge a gap between the neural network literature and\nkernels, which are natural tools to model invariance. We evaluate our methodol-\nogy on visual recognition tasks where CNNs have proven to perform well, e.g.,\ndigit recognition with the MNIST dataset, and the more challenging CIFAR-10\nand STL-10 datasets, where our accuracy is competitive with the state of the art.\n\n1\n\nIntroduction\n\nWe have recently seen a revival of attention given to convolutional neural networks (CNNs) [22]\ndue to their high performance for large-scale visual recognition tasks [15, 21, 30]. The architecture\nof CNNs is relatively simple and consists of successive layers organized in a hierarchical fashion;\neach layer involves convolutions with learned \ufb01lters followed by a pointwise non-linearity and a\ndownsampling operation called \u201cfeature pooling\u201d. The resulting image representation has been em-\npirically observed to be invariant to image perturbations and to encode complex visual patterns [33],\nwhich are useful properties for visual recognition. Training CNNs remains however dif\ufb01cult since\nhigh-capacity networks may involve billions of parameters to learn, which requires both high com-\nputational power, e.g., GPUs, and appropriate regularization techniques [18, 21, 30].\n\nThe exact nature of invariance that CNNs exhibit is also not precisely understood. Only recently, the\ninvariance of related architectures has been characterized; this is the case for the wavelet scattering\ntransform [8] or the hierarchical models of [7]. Our work revisits convolutional neural networks,\nbut we adopt a signi\ufb01cantly different approach than the traditional one. Indeed, we use kernels [26],\nwhich are natural tools to model invariance [14]. Inspired by the hierarchical kernel descriptors\nof [2], we propose a reproducing kernel that produces multi-layer image representations.\n\nOur main contribution is an approximation scheme called convolutional kernel network (CKN) to\nmake the kernel approach computationally feasible. Our approach is a new type of unsupervised\nconvolutional neural network that is trained to approximate the kernel map. Interestingly, our net-\nwork uses non-linear functions that resemble recti\ufb01ed linear units [1, 30], even though they were not\nhandcrafted and naturally emerge from an approximation scheme of the Gaussian kernel map.\n\nBy bridging a gap between kernel methods and neural networks, we believe that we are opening\na fruitful research direction for the future. Our network is learned without supervision since the\n\n\u2217LEAR team, Inria Grenoble, Laboratoire Jean Kuntzmann, CNRS, Univ. Grenoble Alpes, France.\n\n1\n\n\flabel information is only used subsequently in a support vector machine (SVM). Yet, we achieve\ncompetitive results on several datasets such as MNIST [22], CIFAR-10 [20] and STL-10 [13] with\nsimple architectures, few parameters to learn, and no data augmentation. Open-source code for\nlearning our convolutional kernel networks is available on the \ufb01rst author\u2019s webpage.\n\n1.1 Related Work\n\nThere have been several attempts to build kernel-based methods that mimic deep neural networks;\nwe only review here the ones that are most related to our approach.\n\nArc-cosine kernels. Kernels for building deep large-margin classi\ufb01ers have been introduced\nin [10]. The multilayer arc-cosine kernel is built by successive kernel compositions, and each layer\nrelies on an integral representation. Similarly, our kernels rely on an integral representation, and\nenjoy a multilayer construction. However, in contrast to arc-cosine kernels: (i) we build our se-\nquence of kernels by convolutions, using local information over spatial neighborhoods (as opposed\nto compositions, using global information); (ii) we propose a new training procedure for learning a\ncompact representation of the kernel in a data-dependent manner.\n\nMultilayer derived kernels. Kernels with invariance properties for visual recognition have been\nproposed in [7]. Such kernels are built with a parameterized \u201cneural response\u201d function, which con-\nsists in computing the maximal response of a base kernel over a local neighborhood. Multiple layers\nare then built by iteratively renormalizing the response kernels and pooling using neural response\nfunctions. Learning is performed by plugging the obtained kernel in an SVM. In contrast to [7], we\npropagate information up, from lower to upper layers, by using sequences of convolutions. Further-\nmore, we propose a simple and effective data-dependent way to learn a compact representation of\nour kernels and show that we obtain near state-of-the-art performance on several benchmarks.\n\nHierarchical kernel descriptors. The kernels proposed in [2, 3] produce multilayer image repre-\nsentations for visual recognition tasks. We discuss in details these kernels in the next section: our\npaper generalizes them and establishes a strong link with convolutional neural networks.\n\n2 Convolutional Multilayer Kernels\n\nThe convolutional multilayer kernel is a generalization of the hierarchical kernel descriptors intro-\nduced in computer vision [2, 3]. The kernel produces a sequence of image representations that are\nbuilt on top of each other in a multilayer fashion. Each layer can be interpreted as a non-linear trans-\nformation of the previous one with additional spatial invariance. We call these layers image feature\nmaps1, and formally de\ufb01ne them as follows:\nDe\ufb01nition 1. An image feature map \u03d5 is a function \u03d5 : \u2126 \u2192 H, where \u2126 is a (usually discrete)\nsubset of [0, 1]d representing normalized \u201ccoordinates\u201d in the image and H is a Hilbert space.\nFor all practical examples in this paper, \u2126 is a two-dimensional grid and corresponds to different\nlocations in a two-dimensional image. In other words, \u2126 is a set of pixel coordinates. Given z\nin \u2126, the point \u03d5(z) represents some characteristics of the image at location z, or in a neighborhood\nof z. For instance, a color image of size m \u00d7 n with three channels, red, green, and blue, may be\nrepresented by an initial feature map \u03d50 : \u21260 \u2192 H0, where \u21260 is an m \u00d7 n regular grid, H0 is the\nEuclidean space R3, and \u03d50 provides the color pixel values. With the multilayer scheme, non-trivial\nfeature maps will be obtained subsequently, which will encode more complex image characteristics.\nWith this terminology in hand, we now introduce the convolutional kernel, \ufb01rst, for a single layer.\n\nDe\ufb01nition 2 (Convolutional Kernel with Single Layer). Let us consider two images represented\n\nby two image feature maps, respectively \u03d5 and \u03d5\u2032 : \u2126 \u2192 H, where \u2126 is a set of pixel locations,\nand H is a Hilbert space. The one-layer convolutional kernel between \u03d5 and \u03d5\u2032 is de\ufb01ned as\n\nK(\u03d5, \u03d5\u2032) := Xz\u2208\u2126 Xz\u2032\u2208\u2126\n\nk\u03d5(z)kH k\u03d5\u2032(z\u2032)kH e\u2212 1\n\n2\u03b22kz\u2212z\u2032k2\n\n2 e\u2212 1\n\n2\u03c32k \u02dc\u03d5(z)\u2212 \u02dc\u03d5\u2032(z\u2032)k2\nH ,\n\n(1)\n\n1In the kernel literature, \u201cfeature map\u201d denotes the mapping between data points and their representation in\na reproducing kernel Hilbert space (RKHS) [26]. Here, feature maps refer to spatial maps representing local\nimage characteristics at everly location, as usual in the neural network literature [22].\n\n2\n\n\fwhere \u03b2 and \u03c3 are smoothing parameters of Gaussian kernels, and \u02dc\u03d5(z) := (1/k\u03d5(z)kH) \u03d5(z)\nif \u03d5(z) 6= 0 and \u02dc\u03d5(z) = 0 otherwise. Similarly, \u02dc\u03d5\u2032(z\u2032) is a normalized version of \u03d5\u2032(z\u2032).2\nIt is easy to show that the kernel K is positive de\ufb01nite (see Appendix A). It consists of a sum of\npairwise comparisons between the image features \u03d5(z) and \u03d5\u2032(z\u2032) computed at all spatial locations z\nand z\u2032 in \u2126. To be signi\ufb01cant in the sum, a comparison needs the corresponding z and z\u2032 to be\nclose in \u2126, and the normalized features \u02dc\u03d5(z) and \u02dc\u03d5\u2032(z\u2032) to be close in the feature space H. The\nparameters \u03b2 and \u03c3 respectively control these two de\ufb01nitions of \u201ccloseness\u201d. Indeed, when \u03b2 is\nlarge, the kernel K is invariant to the positions z and z\u2032 but when \u03b2 is small, only features placed\nat the same location z = z\u2032 are compared to each other. Therefore, the role of \u03b2 is to control how\nmuch the kernel is locally shift-invariant. Next, we will show how to go beyond one single layer,\nbut before that, we present concrete examples of simple input feature maps \u03d50 : \u21260 \u2192 H0.\nGradient map. Assume that H0 = R2 and that \u03d50(z) provides the two-dimensional gradient of the\nimage at pixel z, which is often computed with \ufb01rst-order differences along each dimension. Then,\nthe quantity k\u03d50(z)kH0\nis the gradient intensity, and \u02dc\u03d50(z) is its orientation, which can be charac-\nterized by a particular angle\u2014that is, there exists \u03b8 in [0; 2\u03c0] such that \u02dc\u03d50(z) = [cos(\u03b8), sin(\u03b8)]. The\nresulting kernel K is exactly the kernel descriptor introduced in [2, 3] for natural image patches.\n\nPatch map.\n\nIn that setting, \u03d50 associates to a location z an image patch of size m \u00d7 m centered\nat z. Then, the space H0 is simply Rm\u00d7m, and \u02dc\u03d50(z) is a contrast-normalized version of the patch,\nwhich is a useful transformation for visual recognition according to classical \ufb01ndings in computer\nvision [19]. When the image is encoded with three color channels, patches are of size m \u00d7 m \u00d7 3.\nWe now de\ufb01ne the multilayer convolutional kernel, generalizing some ideas of [2].\n\nDe\ufb01nition 3 (Multilayer Convolutional Kernel). Let us consider a set \u2126k\u20131 \u2286 [0, 1]d and a Hilbert\nspace Hk\u20131. We build a new set \u2126k and a new Hilbert space Hk as follows:\n(i) choose a patch shape Pk de\ufb01ned as a bounded symmetric subset of [\u22121, 1]d, and a set of coor-\ndinates \u2126k such that for all location zk in \u2126k, the patch {zk} + Pk is a subset of \u2126k\u20131;3 In other\nwords, each coordinate zk in \u2126k corresponds to a valid patch in \u2126k\u20131 centered at zk.\n(ii) de\ufb01ne the convolutional kernel Kk on the \u201cpatch\u201d feature maps Pk \u2192 Hk\u20131, by replacing\nin (1): \u2126 by Pk, H by Hk\u20131, and \u03c3, \u03b2 by appropriate smoothing parameters \u03c3k, \u03b2k. We denote\nby Hk the Hilbert space for which the positive de\ufb01nite kernel Kk is reproducing.\nAn image represented by a feature map \u03d5k\u20131 : \u2126k\u20131 \u2192 Hk\u20131 at layer k\u20131 is now encoded in the k-th\nlayer as \u03d5k : \u2126k \u2192 Hk, where for all zk in \u2126k, \u03d5k(zk) is the representation in Hk of the patch\nfeature map z 7\u2192 \u03d5k\u20131(zk + z) for z in Pk.\nConcretely, the kernel Kk between two patches of \u03d5k\u20131 and \u03d5\u2032\nk kz\u2212z\u2032k2\nXz\u2208Pk Xz\u2032\u2208Pk\n2e\nwhere k.k is the Hilbertian norm of Hk\u20131. In Figure 1(a), we illustrate the interactions between the\nsets of coordinates \u2126k, patches Pk, and feature spaces Hk across layers. For two-dimensional grids,\na typical patch shape is a square, for example P := {\u22121/n, 0, 1/n} \u00d7 {\u22121/n, 0, 1/n} for a 3 \u00d7 3\npatch in an image of size n\u00d7 n. Information encoded in the k-th layer differs from the (k\u20131)-th one\nin two aspects: \ufb01rst, each point \u03d5k(zk) in layer k contains information about several points from\nthe (k\u20131)-th layer and can possibly represent larger patterns; second, the new feature map is more\nlocally shift-invariant than the previous one due to the term involving the parameter \u03b2k in (2).\n\nk\u20131 at respective locations zk and z\u2032\n\u2212 1\n2\u03c32\n\nk\u03d5k\u20131(zk + z)kk\u03d5\u2032\n\nk k \u02dc\u03d5k\u20131(zk+z)\u2212 \u02dc\u03d5\u2032\n\nk+z\u2032)k2\n\n,\n\nk\u20131(z\u2032\n\nk + z\u2032)k e\n\nk is\n\n(2)\n\n\u2212 1\n2\u03b22\n\nk\u20131(z\u2032\n\nThe multilayer convolutional kernel slightly differs from the hierarchical kernel descriptors of [2]\nbut exploits similar ideas. Bo et al. [2] de\ufb01ne indeed several ad hoc kernels for representing local\ninformation in images, such as gradient, color, or shape. These kernels are close to the one de\ufb01ned\nin (1) but with a few variations. Some of them do not use normalized features \u02dc\u03d5(z), and these kernels\nuse different weighting strategies for the summands of (1) that are specialized to the image modality,\n\ne.g., color, or gradient, whereas we use the same weight k\u03d5(z)kH k\u03d5\u2032(z\u2032)kH for all kernels. The\n\ngeneric formulation (1) that we propose may be useful per se, but our main contribution comes in\nthe next section, where we use the kernel as a new tool for learning convolutional neural networks.\n\n2When \u2126 is not discrete, the notation P in (1) should be replaced by the Lebesgue integral R in the paper.\n3For two sets A and B, the Minkowski sum A + B is de\ufb01ned as {a + b : a \u2208 A, b \u2208 B}.\n\n3\n\n\f\u21262\n\n\u21261\n\n\u03d52(z2) \u2208 H2\n\n{z2} + P2\n\n\u2126\u2032\nk\n\n\u03d51(z1) \u2208 H1\n\n\u2126k\u20131\n\n\u03d50(z0) \u2208 H0\n\n{z1} + P1\n\n\u21260\n\n\u2126\u2032\n\nk\u20131\n\u03bek\u20131(z)\n\n\u03bek(z)\n\nGaussian \ufb01ltering\n+ downsampling\n= pooling\n\u03b6k(zk\u20131)\npk\n\nconvolution\n+ non-linearity\n\n{zk\u20131}+P \u2032\n\nk\u20131\n\n\u03c8k\u20131(zk\u20131)\n\n(patch extraction)\n\n(a) Hierarchy of image feature maps.\n\n(b) Zoom between layer k\u20131 and k of the CKN.\n\nFigure 1: Left: concrete representation of the successive layers for the multilayer convolutional\nkernel. Right: one layer of the convolutional neural network that approximates the kernel.\n\n3 Training Invariant Convolutional Kernel Networks\n\nGeneric schemes have been proposed for approximating a non-linear kernel with a linear one, such\nas the Nystr\u00a8om method and its variants [5, 31], or random sampling techniques in the Fourier do-\nmain for shift-invariant kernels [24]. In the context of convolutional multilayer kernels, such an\napproximation is critical because computing the full kernel matrix on a database of images is com-\nputationally infeasible, even for a moderate number of images (\u2248 10 000) and moderate number of\nlayers. For this reason, Bo et al. [2] use the Nystr\u00a8om method for their hierarchical kernel descriptors.\n\nIn this section, we show that when the coordinate sets \u2126k are two-dimensional regular grids, a\nnatural approximation for the multilayer convolutional kernel consists of a sequence of spatial con-\nvolutions with learned \ufb01lters, pointwise non-linearities, and pooling operations, as illustrated in\nFigure 1(b). More precisely, our scheme approximates the kernel map of K de\ufb01ned in (1) at layer k\nby \ufb01nite-dimensional spatial maps \u03bek : \u2126\u2032\nk is a set of coordinates related to \u2126k,\nand pk is a positive integer controlling the quality of the approximation. Consider indeed two images\nrepresented at layer k by image feature maps \u03d5k and \u03d5\u2032\n\nk \u2192 Rpk , where \u2126\u2032\n\nk, respectively. Then,\n\n(A) the corresponding maps \u03bek and \u03be\u2032\n\nk are learned such that K(\u03d5k\u20131, \u03d5\u2032\n\nis the Euclidean inner-product acting as if \u03bek and \u03be\u2032\n\nk\u20131) \u2248 h\u03bek, \u03be\u2032\n\nki, where h., .i\n\nk were vectors in R|\u2126\u2032\nk where P \u2032\n\nk = \u2126k + P \u2032\n\nk|pk ;\n\nk is a patch shape, and\nk|pk ; as\n\n(B) the set \u2126\u2032\n\nk is linked to \u2126k by the relation \u2126\u2032\n\nthe quantities \u03d5k(zk) in Hk admit \ufb01nite-dimensional approximations \u03c8k(zk) in R|P \u2032\nillustrated in Figure 1(b), \u03c8k(zk) is a patch from \u03bek centered at location zk with shape P \u2032\nk;\n\n(C) an activation map \u03b6k : \u2126k\u20131 7\u2192 Rpk is computed from \u03bek\u20131 by convolution with pk \ufb01lters\nfollowed by a non-linearity. The subsequent map \u03bek is obtained from \u03b6k by a pooling operation.\n\nWe call this approximation scheme a convolutional kernel network (CKN). In comparison to CNNs,\nour approach enjoys similar bene\ufb01ts such as ef\ufb01cient prediction at test time, and involves the same\nset of hyper-parameters: number of layers, numbers of \ufb01lters pk at layer k, shape P \u2032\nk of the \ufb01lters,\nsizes of the feature maps. The other parameters \u03b2k, \u03c3k can be automatically chosen, as discussed\nlater. Training a CKN can be argued to be as simple as training a CNN in an unsupervised man-\nner [25] since we will show that the main difference is in the cost function that is optimized.\n\n3.1 Fast Approximation of the Gaussian Kernel\n\nA key component of our formulation is the Gaussian kernel. We start by approximating it by a linear\noperation with learned \ufb01lters followed by a pointwise non-linearity. Our starting point is the next\nlemma, which can be obtained after a simple calculation.\n\n4\n\n\fLemma 1 (Linear expansion of the Gaussian Kernel). For all x and x\u2032 in Rm, and \u03c3 > 0,\n\ne\u2212 1\n\n2\u03c32 kx\u2212x\u2032k2\n\n2 = (cid:18) 2\n\n\u03c0\u03c32(cid:19)\n\nm\n\n2 Zw\u2208Rm\n\ne\u2212 1\n\n\u03c32 kx\u2212wk2\n\n2e\u2212 1\n\n\u03c32 kx\u2032\u2212wk2\n\n2dw.\n\n(3)\n\nThe lemma gives us a mapping of any x in Rm to the function w 7\u2192 \u221aCe\u2212(1/\u03c32)kx\u2212wk2\n\n2 in L2(Rm),\nwhere the kernel is linear, and C is the constant in front of the integral. To obtain a \ufb01nite-dimensional\nrepresentation, we need to approximate the integral with a weighted \ufb01nite sum, which is a classical\nproblem arising in statistics (see [29] and chapter 8 of [6]). Then, we consider two different cases.\n\nSmall dimension, m \u2264 2. When the data lives in a compact set of Rm, the integral in (3) can be\napproximated by uniform sampling over a large enough set. We choose such a strategy for two types\nof kernels from Eq. (1): (i) the spatial kernels e\u2212(cid:16) 1\n2\u03c32 )k \u02dc\u03d5(z)\u2212 \u02dc\u03d5\u2032(z\u2032)k2\nwhen \u03d5 is the \u201cgradient map\u201d presented in Section 2. In the latter case, H = R2 and \u02dc\u03d5(z) is the\n\ngradient orientation. We typically sample a few orientations as explained in Section 4.\n\n2 ; (ii) the terms e\u2212( 1\n\n2\u03b22 (cid:17)kz\u2212z\u2032k2\n\nH\n\nHigher dimensions. To prevent the curse of dimensionality, we learn to approximate the kernel on\ntraining data, which is intrinsically low-dimensional. We optimize importance weights \u03b7 = [\u03b7l]p\nin Rp\nl=1 in Rm\u00d7p on n training pairs (xi, yi)i=1,...,n in Rm \u00d7 Rm:\n2\u03c32 kxi\u2212yik2\n\n+ and sampling points W = [wl]p\nXi=1 (cid:16)e\u2212 1\n\n+,W\u2208Rm\u00d7p(cid:20) 1\n\n2(cid:17)2(cid:21).\n\n\u03c32 kyi\u2212wlk2\n\n\u03c32 kxi\u2212wlk2\n\n\u03b7le\u2212 1\n\nXl=1\n\n2e\u2212 1\n\n2 \u2212\n\nmin\n\n\u03b7\u2208Rp\n\nl=1\n\n(4)\n\nn\n\nn\n\np\n\nInterestingly, we may already draw some links with neural networks. When applied to unit-norm\nvectors xi and yi, problem (4) produces sampling points wl whose norm is close to one. After\nlearning, a new unit-norm point x in Rm is mapped to the vector [\u221a\u03b7le\u2212(1/\u03c32)kx\u2212wlk2\n2]p\nl=1 in Rp,\nwhich may be written as [f (w\u22a4\nl=1, assuming that the norm of wl is always one, where f is the\nfunction u 7\u2192 e(2/\u03c32)(u\u22121) for u = w\u22a4\nl x in [\u22121, 1]. Therefore, the \ufb01nite-dimensional representation\nof x only involves a linear operation followed by a non-linearity, as in typical neural networks. In\nFigure 2, we show that the shape of f resembles the \u201crecti\ufb01ed linear unit\u201d function [30].\n\nl x)]p\n\nf (u) = e(2/\u03c32)(u\u22121)\nf (u) = max(u, 0)\n\nf (u)\n\n-1\n\n0\n\nu\n\n1\n\nFigure 2: In dotted red, we plot the \u201crecti\ufb01ed linear unit\u201d function u 7\u2192 max(u, 0). In blue, we plot\nnon-linear functions of our network for typical values of \u03c3 that we use in our experiments.\n\n3.2 Approximating the Multilayer Convolutional Kernel\n\n0, H0 = Rp0|P \u2032\n\nWe have now all the tools in hand to build our convolutional kernel network. We start by making as-\nsumptions on the input data, and then present the learning scheme and its approximation principles.\n0 \u2192 Rp0 , and\nThe zeroth layer. We assume that the input data is a \ufb01nite-dimensional map \u03be0 : \u2126\u2032\nthat \u03d50 : \u21260 \u2192 H0 \u201cextracts\u201d patches from \u03be0. Formally, there exists a patch shape P \u2032\n0 such that\n\u2126\u2032\n0 = \u21260 + P \u2032\n0|, and for all z0 in \u21260, \u03d50(z0) is a patch of \u03be0 centered at z0. Then,\nproperty (B) described at the beginning of Section 3 is satis\ufb01ed for k = 0 by choosing \u03c80 = \u03d50.\nThe examples of input feature maps given earlier satisfy this \ufb01nite-dimensional assumption: for the\ngradient map, \u03be0 is the gradient of the image along each direction, with p0 = 2, P \u2032\n0 = {0} is a 1\u00d71\npatch, \u21260 = \u2126\u2032\n0, and \u03d50 = \u03be0; for the patch map, \u03be0 is the input image, say with p0 = 3 for RGB data.\nThe convolutional kernel network. The zeroth layer being characterized, we present in Algo-\nrithms 1 and 2 the subsequent layers and how to learn their parameters in a feedforward manner. It\nis interesting to note that the input parameters of the algorithm are exactly the same as a CNN\u2014that\nis, number of layers and \ufb01lters, sizes of the patches and feature maps (obtained here via the sub-\nsampling factor). Ultimately, CNNs and CKNs only differ in the cost function that is optimized for\nlearning the \ufb01lters and in the choice of non-linearities. As we show next, there exists a link between\nthe parameters of a CKN and those of a convolutional multilayer kernel.\n\n5\n\n\fAlgorithm 1 Convolutional kernel network - learning the parameters of the k-th layer.\ninput \u03be1\n\nk\u20131 \u2192 Rpk\u20131 (sequence of (k\u20131)-th maps obtained from training images);\n\nk\u20131, . . . : \u2126\u2032\n\nk\u20131, \u03be2\n\nP \u2032\nk\u20131 (patch shape); pk (number of \ufb01lters); n (number of training pairs);\n\nk\u20131 from the maps \u03be1\n1: extract at random n pairs (xi, yi) of patches with shape P \u2032\n2: if not provided by the user, set \u03c3k to the 0.1 quantile of the data (kxi \u2212 yik2)n\n3: unsupervised learning: optimize (4) to obtain the \ufb01lters Wk in R|P \u2032\noutput Wk, \u03b7k, and \u03c3k (smoothing parameter);\n\nk\u20131, . . .;\n\nk\u20131, \u03be2\ni=1;\n\nk\u20131|pk\u20131\u00d7pk and \u03b7k in Rpk ;\n\nAlgorithm 2 Convolutional kernel network - computing the k-th map form the (k\u20131)-th one.\ninput \u03bek\u20131 : \u2126\u2032\n\nber of \ufb01lters); \u03c3k (smoothing parameter); Wk = [wkl]pk\n\nk\u20131 \u2192 Rpk\u20131 (input map); P \u2032\n\nk\u20131 (patch shape); \u03b3k \u2265 1 (subsampling factor); pk (num-\nl=1 (layer parameters);\n\nl=1 and \u03b7k = [\u03b7kl]pk\n\n1: convolution and non-linearity: de\ufb01ne the activation map \u03b6k : \u2126k\u20131 \u2192 Rpk as\n\n\u03b6k : z 7\u2192 k\u03c8k\u20131(z)k2(cid:20)\u221a\u03b7kle\n\n\u2212 1\n\u03c32\n\nk k \u02dc\u03c8k\u20131(z)\u2212wklk2\n\n2(cid:21)pk\n\nl=1\n\n,\n\n(5)\n\nwhere \u03c8k\u20131(z) is a vector representing a patch from \u03bek\u20131 centered at z with shape P \u2032\nk\u20131, and the\nvector \u02dc\u03c8k\u20131(z) is an \u21132-normalized version of \u03c8k\u20131(z). This operation can be interpreted as a\nspatial convolution of the map \u03bek\u20131 with the \ufb01lters wkl followed by pointwise non-linearities;\n\n2: set \u03b2k to be \u03b3k times the spacing between two pixels in \u2126k\u20131;\n3: feature pooling: \u2126\u2032\n\nk is obtained by subsampling \u2126k\u20131 by a factor \u03b3k and we de\ufb01ne a new map\n\n\u03bek : \u2126\u2032\n\nk \u2192 Rpk obtained from \u03b6k by linear pooling with Gaussian weights:\n\n\u03bek : z 7\u2192 p2/\u03c0 Xu\u2208\u2126k\u20131\n\n\u2212 1\n\u03b22\nk\n\ne\n\nku\u2212zk2\n\n2\u03b6k(u).\n\n(6)\n\noutput \u03bek : \u2126\u2032\n\nk \u2192 Rpk (new map);\n\nApproximation principles. We proceed recursively to show that the kernel approximation prop-\nerty (A) is satis\ufb01ed; we assume that (B) holds at layer k\u20131, and then, we show that (A) and (B) also\nhold at layer k. This is suf\ufb01cient for our purpose since we have previously assumed (B) for the ze-\nroth layer. Given two images feature maps \u03d5k\u20131 and \u03d5\u2032\nk\u20131)\nby replacing \u03d5k\u20131(z) and \u03d5\u2032\nk\u20131) \u2248 Xz,z\u2032\u2208\u2126k\u20131\n\nk\u20131, we start by approximating K(\u03d5k\u20131, \u03d5\u2032\nk\u20131(z\u2032) by their \ufb01nite-dimensional approximations provided by (B):\nk\u20131(z\u2032)k2\n2.\n\nk\u03c8k\u20131(z)k2 k\u03c8\u2032\n\nk k \u02dc\u03c8k\u20131(z)\u2212 \u02dc\u03c8\u2032\n\nk\u20131(z\u2032)k2 e\n\nk kz\u2212z\u2032k2\n2e\n\nK(\u03d5k\u20131, \u03d5\u2032\n\n\u2212 1\n2\u03c32\n\n\u2212 1\n2\u03b22\n\n(7)\n\nThen, we use the \ufb01nite-dimensional approximation of the Gaussian kernel involving \u03c3k and\n\nK(\u03d5k\u20131, \u03d5\u2032\n\nwhere \u03b6k is de\ufb01ned in (5) and \u03b6 \u2032\nthe remaining Gaussian kernel by uniform sampling on \u2126\u2032\nsums and grouping appropriate terms together, we obtain the new approximation\n\nk\u20131) \u2248 Xz,z\u2032\u2208\u2126k\u20131\nk is de\ufb01ned similarly by replacing \u02dc\u03c8 by \u02dc\u03c8\u2032. Finally, we approximate\nk, following Section 3.1. After exchanging\n\n\u03b6k(z)\u22a4\u03b6 \u2032\n\nk(z\u2032)e\n\n(8)\n\n\u2212 1\n2\u03b22\n\nk kz\u2212z\u2032k2\n2,\n\nK(\u03d5k\u20131, \u03d5\u2032\n\nk\u20131) \u2248\n\n2\n\n\u03c0 Xu\u2208\u2126\u2032\n\nk\n\n(cid:18) Xz\u2208\u2126k\u20131\n\n\u2212 1\n\u03b22\nk\n\ne\n\nkz\u2212uk2\n\n2 \u03b6k(z)(cid:19)\u22a4(cid:18) Xz\u2032\u2208\u2126k\u20131\n\ne\n\n\u2212 1\n\u03b22\n\nk kz\u2032\u2212uk2\n2\u03b6 \u2032\n\nk(z\u2032)(cid:19),\n\n(9)\n\nk of uniform sampling orresponding to the square of the distance between two pixels of \u2126\u2032\n\nwhere the constant 2/\u03c0 comes from the multiplication of the constant 2/(\u03c0\u03b22\nweight \u03b22\nAs a result, the right-hand side is exactly h\u03bek, \u03be\u2032\nby the Euclidean inner-product h\u03c8k(zk), \u03c8\u2032\nk(z\u2032\nwe assume for that purpose that P \u2032\n\nk) from (3) and the\nk.4\nki, where \u03bek is de\ufb01ned in (6), giving us property (A).\nIt remains to show that property (B) also holds, speci\ufb01cally that the quantity (2) can be approximated\nk)i with the patches \u03c8k(zk) and \u03c8\u2032\nk) of shape P \u2032\nk;\nk is a subsampled version of the patch shape Pk by a factor \u03b3k.\n\nk(z\u2032\n\n4The choice of \u03b2k in Algorithm 2 is driven by signal processing principles. The feature pooling step can\nindeed be interpreted as a downsampling operation that reduces the resolution of the map from \u2126k\u20131 to \u2126k by\nusing a Gaussian anti-aliasing \ufb01lter, whose role is to reduce frequencies above the Nyquist limit.\n\n6\n\n\fWe remark that the kernel (2) is the same as (1) applied to layer k\u20131 by replacing \u2126k\u20131 by {zk}+Pk.\nBy doing the same substitution in (9), we immediately obtain an approximation of (2). Then, all\nGaussian terms are negligible for all u and z that are far from each other\u2014say when ku\u2212zk2 \u2265 2\u03b2k.\nThus, we may replace the sumsPu\u2208\u2126\u2032\nk Pz,z\u2032\u2208\u2126k\u20131\n, which has the\nk(z\u2032\nsame set of \u201cnon-negligible\u201d terms. This yields exactly the approximation h\u03c8k(zk), \u03c8\u2032\n\nk Pz,z\u2032\u2208{zk}+Pk\n\nbyPu\u2208{zk}+P \u2032\n\nk)i.\n\nOptimization. Regarding problem (4), stochastic gradient descent (SGD) may be used since a\npotentially in\ufb01nite amount of training data is available. However, we have preferred to use L-BFGS-\nB [9] on 300 000 pairs of randomly selected training data points, and initialize W with the K-means\nalgorithm. L-BFGS-B is a parameter-free state-of-the-art batch method, which is not as fast as SGD\nbut much easier to use. We always run the L-BFGS-B algorithm for 4 000 iterations, which seems\nto ensure convergence to a stationary point. Our goal is to demonstrate the preliminary performance\nof a new type of convolutional network, and we leave as future work any speed improvement.\n\n4 Experiments\n\nWe now present experiments that were performed using Matlab and an L-BFGS-B solver [9] inter-\nfaced by Stephen Becker. Each image is represented by the last map \u03bek of the CKN, which is used\nin a linear SVM implemented in the software package LibLinear [16]. These representations are\ncentered, rescaled to have unit \u21132-norm on average, and the regularization parameter of the SVM is\nalways selected on a validation set or by 5-fold cross-validation in the range 2i, i = \u221215 . . . , 15.\nThe patches P \u2032\nk are typically small; we tried the sizes m \u00d7 m with m = 3, 4, 5 for the \ufb01rst\nlayer, and m = 2, 3 for the upper ones. The number of \ufb01lters pk in our experiments is in the\nset {50, 100, 200, 400, 800}. The downsampling factor \u03b3k is always chosen to be 2 between two con-\nsecutive layers, whereas the last layer is downsampled to produce \ufb01nal maps \u03bek of a small size\u2014say,\n0(z\u2032)kH0\n5\u00d75 or 4\u00d74. For the gradient map \u03d50, we approximate the Gaussian kernel e(1/\u03c32\nby uniformly sampling p1 = 12 orientations, setting \u03c31 = 2\u03c0/p1. Finally, we also use a small off-\nset \u03b5 to prevent numerical instabilities in the normalization steps \u02dc\u03c8(z) = \u03c8(z)/ max(k\u03c8(z)k2, \u03b5).\n\n1 )k\u03d50(z)\u2212\u03d5\u2032\n\n4.1 Discovering the Structure of Natural Image Patches\n\nUnsupervised learning was \ufb01rst used for discovering the underlying structure of natural image\npatches by Olshausen and Field [23]. Without making any a priori assumption about the data ex-\ncept a parsimony principle, the method is able to produce small prototypes that resemble Gabor\nwavelets\u2014that is, spatially localized oriented basis functions. The results were found impressive by\nthe scienti\ufb01c community and their work received substantial attention. It is also known that such\nresults can also be achieved with CNNs [25]. We show in this section that this is also the case for\nconvolutional kernel networks, even though they are not explicitly trained to reconstruct data.\n\nFollowing [23], we randomly select a database of 300 000 whitened natural image patches of\nsize 12 \u00d7 12 and learn p = 256 \ufb01lters W using the formulation (4). We initialize W with Gaussian\nrandom noise without performing the K-means step, in order to ensure that the output we obtain is\nnot an artifact of the initialization. In Figure 3, we display the \ufb01lters associated to the top-128 largest\nweights \u03b7l. Among the 256 \ufb01lters, 197 exhibit interpretable Gabor-like structures and the rest was\nless interpretable. To the best of our knowledge, this is the \ufb01rst time that the explicit kernel map of\nthe Gaussian kernel for whitened natural image patches is shown to be related to Gabor wavelets.\n\n4.2 Digit Classi\ufb01cation on MNIST\n\nThe MNIST dataset [22] consists of 60 000 images of handwritten digits for training and 10 000\nfor testing. We use two types of initial maps in our networks: the \u201cpatch map\u201d, denoted by CNK-\nPM and the \u201cgradient map\u201d, denoted by CNK-GM. We follow the evaluation methodology of [25]\n\nFigure 3: Filters obtained by the \ufb01rst layer of the convolutional kernel network on natural images.\n\n7\n\n\fCNN Scat-1 Scat-2 CKN-GM1 CKN-GM2 CKN-PM1 CKN-PM2\nTr.\n(50/200)\n[25]\nsize\n7.18\n300\n1K 3.21\n2K 2.53\n5K 1.52\n10K 0.85\n20K 0.76\n40K 0.65\n60K 0.53\n\n(200)\n5.98\n3.23\n1.97\n1.41\n1.18\n0.83\n0.64\n0.63\n\n[8]\n4.7\n2.3\n1.3\n1.03\n0.88\n0.79\n0.74\n0.70\n\n4.39\n2.60\n1.85\n1.41\n1.17\n0.89\n0.68\n0.58\n\n4.24\n2.05\n1.51\n1.21\n0.88\n0.60\n0.51\n0.39\n\n[8]\n5.6\n2.6\n1.8\n1.4\n1\n\n0.58\n0.53\n0.4\n\n(12/50)\n\n(12/400)\n\n[32]\n\n[18]\n\n[19]\n\nNA\nNA\nNA\nNA\nNA\nNA\nNA\n\n0.47 0.45 0.53\n\n4.15\n2.76\n2.28\n1.56\n1.10\n0.77\n0.58\n0.53\n\nTable 1: Test error in % for various approaches on the MNIST dataset without data augmentation.\nThe numbers in parentheses represent the size p1 and p2 of the feature maps at each layer.\nfor comparison when varying the training set size. We select the regularization parameter of the\nSVM by 5-fold cross validation when the training size is smaller than 20 000, or otherwise, we\nkeep 10 0000 examples from the training set for validation. We report in Table 1 the results obtained\nfor four simple architectures. CKN-GM1 is the simplest one: its second layer uses 3\u00d7 3 patches and\nonly p2 = 50 \ufb01lters, resulting in a network with 5 400 parameters. Yet, it achieves an outstanding\nperformance of 0.58% error on the full dataset. The best performing, CKN-GM2, is similar to\nCKN-GM1 but uses p2 = 400 \ufb01lters. When working with raw patches, two layers (CKN-PM2)\ngives better results than one layer. More details about the network architectures are provided in the\nsupplementary material. In general, our method achieves a state-of-the-art accuracy for this task\nsince lower error rates have only been reported by using data augmentation [11].\n\n4.3 Visual Recognition on CIFAR-10 and STL-10\n\nWe now move to the more challenging datasets CIFAR-10 [20] and STL-10 [13]. We select the\nbest architectures on a validation set of 10 000 examples from the training set for CIFAR-10, and\nby 5-fold cross-validation on STL-10. We report in Table 2 results for CKN-GM, de\ufb01ned in the\nprevious section, without exploiting color information, and CKN-PM when working on raw RGB\npatches whose mean color is subtracted. The best selected models have always two layers, with 800\n\ufb01lters for the top layer. Since CKN-PM and CKN-GM exploit a different information, we also report\na combination of such two models, CKN-CO, by concatenating normalized image representations\ntogether. The standard deviations for STL-10 was always below 0.7%. Our approach appears to\nbe competitive with the state of the art, especially on STL-10 where only one method does better\nthan ours, despite the fact that our models only use 2 layers and require learning few parameters.\nNote that better results than those reported in Table 2 have been obtained in the literature by using\neither data augmentation (around 90% on CIFAR-10 for [18, 30]), or external data (around 70% on\nSTL-10 for [28]). We are planning to investigate similar data manipulations in the future.\n\nMethod\n\nCIFAR-10\n\nSTL-10\n\n[12]\n82.0\n60.1\n\n[27]\n82.2\n58.7\n\n[18]\n88.32\nNA\n\n[4]\n\n[13]\n[17]\n79.6 NA 83.96\n62.3\n51.5\n\n64.5\n\n[32]\n84.87\nNA\n\nCKN-GM CKN-PM CKN-CO\n\n74.84\n60.04\n\n78.30\n60.25\n\n82.18\n62.32\n\nTable 2: Classi\ufb01cation accuracy in % on CIFAR-10 and STL-10 without data augmentation.\n\n5 Conclusion\n\nIn this paper, we have proposed a new methodology for combining kernels and convolutional neural\nnetworks. We show that mixing the ideas of these two concepts is fruitful, since we achieve near\nstate-of-the-art performance on several datasets such as MNIST, CIFAR-10, and STL10, with simple\narchitectures and no data augmentation. Some challenges regarding our work are left open for the\nfuture. The \ufb01rst one is the use of supervision to better approximate the kernel for the prediction task.\nThe second consists in leveraging the kernel interpretation of our convolutional neural networks to\nbetter understand the theoretical properties of the feature spaces that these networks produce.\n\nAcknowledgments\n\nThis work was partially supported by grants from ANR (project MACARON ANR-14-CE23-0003-\n01), MSR-Inria joint centre, European Research Council (project ALLEGRO), CNRS-Mastodons\nprogram (project GARGANTUA), and the LabEx PERSYVAL-Lab (ANR-11-LABX-0025).\n\n8\n\n\fReferences\n\n[1] Y. Bengio. Learning deep architectures for AI. Found. Trends Mach. Learn., 2009.\n\n[2] L. Bo, K. Lai, X. Ren, and D. Fox. Object recognition with hierarchical kernel descriptors. In Proc.\n\nCVPR, 2011.\n\n[3] L. Bo, X. Ren, and D. Fox. Kernel descriptors for visual recognition. In Adv. NIPS, 2010.\n\n[4] L. Bo, X. Ren, and D. Fox. Unsupervised feature learning for RGB-D based object recognition.\n\nIn\n\nExperimental Robotics, 2013.\n\n[5] L. Bo and C. Sminchisescu. Ef\ufb01cient match kernel between sets of features for visual recognition. In Adv.\n\nNIPS, 2009.\n\n[6] L. Bottou, O. Chapelle, D. DeCoste, and J. Weston. Large-Scale Kernel Machines (Neural Information\n\nProcessing). The MIT Press, 2007.\n\n[7] J. V. Bouvrie, L. Rosasco, and T. Poggio. On invariance in hierarchical models. In Adv. NIPS, 2009.\n\n[8] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE T. Pattern Anal., 35(8):1872\u2013\n\n1886, 2013.\n\n[9] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimiza-\n\ntion. SIAM J. Sci. Comput., 16(5):1190\u20131208, 1995.\n\n[10] Y. Cho and L. K. Saul. Large-margin classi\ufb01cation in in\ufb01nite neural networks. Neural Comput., 22(10),\n\n2010.\n\n[11] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classi\ufb01cation.\n\nIn Proc. CVPR, 2012.\n\n[12] A. Coates and A. Y. Ng. Selecting receptive \ufb01elds in deep networks. In Adv. NIPS, 2011.\n\n[13] A. Coates, A. Y. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning.\n\nIn Proc. AISTATS, 2011.\n\n[14] D. Decoste and B. Sch\u00a8olkopf. Training invariant support vector machines. Mach. Learn., 46(1-3):161\u2013\n\n190, 2002.\n\n[15] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A deep convo-\n\nlutional activation feature for generic visual recognition. preprint arXiv:1310.1531, 2013.\n\n[16] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear\n\nclassi\ufb01cation. J. Mach. Learn. Res., 9:1871\u20131874, 2008.\n\n[17] R. Gens and P. Domingos. Discriminative learning of sum-product networks. In Adv. NIPS, 2012.\n\n[18] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In Proc.\n\nICML, 2013.\n\n[19] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for\n\nobject recognition? In Proc. ICCV, 2009.\n\n[20] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Tech. Rep., 2009.\n\n[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Adv. NIPS, 2012.\n\n[22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nP. IEEE, 86(11):2278\u20132324, 1998.\n\n[23] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse\n\ncode for natural images. Nature, 381(6583):607\u2013609, 1996.\n\n[24] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Adv. NIPS, 2007.\n\n[25] M. Ranzato, F.-J. Huang, Y-L. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierar-\n\nchies with applications to object recognition. In Proc. CVPR, 2007.\n\n[26] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. 2004.\n\n[27] K. Sohn and H. Lee. Learning invariant representations with local transformations. In Proc. ICML, 2012.\n\n[28] K. Swersky, J. Snoek, and R. P. Adams. Multi-task Bayesian optimization. In Adv. NIPS, 2013.\n\n[29] G. Wahba. Spline models for observational data. SIAM, 1990.\n\n[30] L. Wan, M. D. Zeiler, S. Zhang, Y. LeCun, and R. Fergus. Regularization of neural networks using\n\ndropconnect. In Proc. ICML, 2013.\n\n[31] C. Williams and M. Seeger. Using the Nystr\u00a8om method to speed up kernel machines. In Adv. NIPS, 2001.\n\n[32] M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks.\n\nIn Proc. ICLR, 2013.\n\n[33] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Proc. ECCV, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1371, "authors": [{"given_name": "Julien", "family_name": "Mairal", "institution": "INRIA"}, {"given_name": "Piotr", "family_name": "Koniusz", "institution": "Inria"}, {"given_name": "Zaid", "family_name": "Harchaoui", "institution": "Inria"}, {"given_name": "Cordelia", "family_name": "Schmid", "institution": "Inria Grenoble"}]}