{"title": "Learning Convolutional Feature Hierarchies for Visual Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1090, "page_last": 1098, "abstract": "We propose an unsupervised method for learning multi-stage hierarchies of sparse convolutional features. While sparse coding has become an increasingly popular method for learning visual features, it is most often trained at the patch level. Applying the resulting filters convolutionally results in highly redundant codes because overlapping patches are encoded in isolation. By training convolutionally over large image windows, our method reduces the redudancy between feature vectors at neighboring locations and improves the efficiency of the overall representation. In addition to a linear decoder that reconstructs the image from sparse features, our method trains an efficient feed-forward encoder that predicts quasi-sparse features from the input. While patch-based training rarely produces anything but oriented edge detectors, we show that convolutional training produces highly diverse filters, including center-surround filters, corner detectors, cross detectors, and oriented grating detectors. We show that using these filters in multi-stage convolutional network architecture improves performance on a number of visual recognition and detection tasks.", "full_text": "Learning Convolutional Feature Hierarchies for\n\nVisual Recognition\n\nKoray Kavukcuoglu1,\n\nPierre Sermanet1, Y-Lan Boureau2,1,\n\nKarol Gregor1, Micha\u00a8el Mathieu1, Yann LeCun1\n\n1 Courant Institute of Mathematical Sciences, New York University\n\n2 INRIA - Willow project-team\u2217\n\n{koray,sermanet,ylan,kgregor,yann}@cs.nyu.edu, mmathieu@clipper.ens.fr\n\nAbstract\n\nWe propose an unsupervised method for learning multi-stage hierarchies of sparse\nconvolutional features. While sparse coding has become an increasingly popular\nmethod for learning visual features, it is most often trained at the patch level.\nApplying the resulting \ufb01lters convolutionally results in highly redundant codes\nbecause overlapping patches are encoded in isolation. By training convolutionally\nover large image windows, our method reduces the redudancy between feature\nvectors at neighboring locations and improves the ef\ufb01ciency of the overall repre-\nsentation. In addition to a linear decoder that reconstructs the image from sparse\nfeatures, our method trains an ef\ufb01cient feed-forward encoder that predicts quasi-\nsparse features from the input. While patch-based training rarely produces any-\nthing but oriented edge detectors, we show that convolutional training produces\nhighly diverse \ufb01lters, including center-surround \ufb01lters, corner detectors, cross de-\ntectors, and oriented grating detectors. We show that using these \ufb01lters in multi-\nstage convolutional network architecture improves performance on a number of\nvisual recognition and detection tasks.\n\nIntroduction\n\n1\nOver the last few years, a growing amount of research on visual recognition has focused on learning\nlow-level and mid-level features using unsupervised learning, supervised learning, or a combination\nof the two. The ability to learn multiple levels of good feature representations in a hierarchical\nstructure would enable the automatic construction of sophisticated recognition systems operating,\nnot just on natural images, but on a wide variety of modalities. This would be particularly useful for\nsensor modalities where our lack of intuition makes it dif\ufb01cult to engineer good feature extractors.\n\nThe present paper introduces a new class of techniques for learning features extracted though con-\nvolutional \ufb01lter banks. The techniques are applicable to Convolutional Networks and their variants,\nwhich use multiple stages of trainable convolutional \ufb01lter banks, interspersed with non-linear oper-\nations, and spatial feature pooling operations [1, 2]. While ConvNets have traditionally been trained\nin supervised mode, a number of recent systems have proposed to use unsupervised learning to pre-\ntrain the \ufb01lters, followed by supervised \ufb01ne-tuning. Some authors have used convolutional forms of\nRestricted Boltzmann Machines (RBM) trained with contrastive divergence [3], but many of them\nhave relied on sparse coding and sparse modeling [4, 5, 6]. In sparse coding, a sparse feature vector z\nis computed so as to best reconstruct the input x through a linear operation with a learned dictionary\nmatrix D. The inference procedure produces a code z\u2217 by minimizing an energy function:\n\nL(x, z, D) =\n\n1\n2\n\n||x \u2212 Dz||2\n\n2 + |z|1,\n\nz\u2217 = arg min\n\nz\n\nL(x, z, D)\n\n(1)\n\n\u2217Laboratoire d\u2019Informatique de l\u2019Ecole Normale Sup\u00b4erieure (INRIA/ENS/CNRS UMR 8548)\n\n1\n\n\fFigure 1: Left: A dictionary with 128 elements, learned with patch based sparse coding model.\nRight: A dictionary with 128 elements, learned with convolutional sparse coding model. The dic-\ntionary learned with the convolutional model spans the orientation space much more uniformly. In\naddition it can be seen that the diversity of \ufb01lters obtained by convolutional sparse model is much\nricher compared to patch based one.\n\nThe dictionary is obtained by minimizing the energy 1 wrt D: minz,D L(x, z, D) averaged over a\ntraining set of input samples. There are two problems with the traditional sparse modeling method\nwhen training convolutional \ufb01lter banks: 1: the representations of whole images are highly redun-\ndant because the training and the inference are performed at the patch level; 2: the inference for a\nwhole image is computationally expensive.\nFirst problem. In most applications of sparse coding to image analysis [7, 8], the system is trained\non single image patches whose dimensions match those of the \ufb01lters. After training, patches in\nthe image are processed separately. This procedure completely ignores the fact that the \ufb01lters are\neventually going to be used in a convolutional fashion. Learning will produce a dictionary of \ufb01lters\nthat are essentially shifted versions of each other over the patch, so as to reconstruct each patch\nin isolation. Inference is performed on all (overlapping) patches independently, which produces a\nvery highly redundant representation for the whole image. To address this problem, we apply sparse\ncoding to the entire image at once, and we view the dictionary as a convolutional \ufb01lter bank:\n\nL(x, z, D) =\n\n1\n2\n\n||x \u2212\n\nK\n\nX\n\nk=1\n\nDk \u2217 zk||2\n\n2 + |z|1,\n\n(2)\n\nwhere Dk is an s \u00d7 s 2D \ufb01lter kernel, x is a w \u00d7 h image (instead of an s \u00d7 s patch), zk is a 2D\nfeature map of dimension (w + s \u2212 1) \u00d7 (h + s \u2212 1), and \u201c\u2217\u201d denotes the discrete convolution\noperator. Convolutional Sparse Coding has been used by several authors, notably [6].\nTo address the second problem, we follow the idea of [4, 5], and use a trainable, feed-forward, non-\nlinear encoder module to produce a fast approximation of the sparse code. The new energy function\nincludes a code prediction error term:\n\nL(x, z, D, W ) =\n\n1\n2\n\n||x \u2212\n\nK\n\nX\n\nk=1\n\nDk \u2217 zk||2\n\n2 +\n\nK\n\nX\n\nk=1\n\n||zk \u2212 f (W k \u2217 x)||2\n\n2 + |z|1,\n\n(3)\n\nwhere z\u2217 = arg minz L(x, z, D, W ) and W k is an encoding convolution kernel of size s \u00d7 s, and f\nis a point-wise non-linear function. Two crucially important questions are the form of the non-linear\nfunction f, and the optimization method to \ufb01nd z\u2217. Both questions will be discussed at length below.\nThe contribution of this paper is to address both issues simultaneously, thus allowing convolutional\napproaches to sparse coding to scale up, and opening the road to real-time applications.\n2 Algorithms and Method\nIn this section, we analyze the bene\ufb01ts of convolutional sparse coding for object recognition systems,\nand propose convolutional extensions to the coordinate descent sparse coding (CoD) [9] algorithm\nand the dictionary learning procedure.\n\n2.1 Learning Convolutional Dictionaries\nThe key observation for modeling convolutional \ufb01lter banks is that the convolution of a signal with\na given kernel can be represented as a matrix-vector product by constructing a special Toeplitz-\nstructured matrix for each dictionary element and concatenating all such matrices to form a new\n\n2\n\n\fdictionary. Any existing sparse coding algorithm can then be used. Unfortunately, this method\nincurs a cost, since the size of the dictionary then depends on the size of the input signal. Therefore,\nit is advantageous to use a formulation based on convolutions rather than following the naive method\noutlined above. In this work, we use the coordinate descent sparse coding algorithm [9] as a starting\npoint and generalize it using convolution operations. Two important issues arise when learning\nconvolutional dictionaries: 1. The boundary effects due to convolutions need to be properly handled.\n2. The derivative of equation 2 should be computed ef\ufb01ciently. Since the loss is not jointly convex\nin D and z, but is convex in each variable when the other one is kept \ufb01xed, sparse dictionaries are\nusually learned by an approach similar to block coordinate descent, which alternatively minimizes\nover z and D (e.g., see [10, 8, 4]). One can use either batch [7] (by accumulating derivatives over\nmany samples) or online updates [8, 6, 5] (updating the dictionary after each sample). In this work,\nwe use a stochastic online procedure for updating the dictionary elements.\n\nThe updates to the dictionary elements, calculated from equation 2, are sensitive to the boundary\neffects introduced by the convolution operator. The code units that are at the boundary might grow\nmuch larger compared to the middle elements, since the outermost boundaries of the reconstruction\ntake contributions from only a single code unit, compared to the middle ones that combine s\u00d7s units.\nTherefore the reconstruction error, and correspondingly the derivatives, grow proportionally larger.\nOne way to properly handle this situation is to apply a mask on the derivatives of the reconstruction\nerror wrt z: DT \u2217 (x \u2212 D \u2217 z) is replaced by DT \u2217 (mask(x) \u2212 D \u2217 z), where mask is a term-by-term\nmultiplier that either puts zeros or gradually scales down the boundaries.\nAlgorithm 1 Convolutional extension to coordinate descent sparse coding[9]. A subscript index\n(set) of a matrix represent a particular element. For slicing the 4D tensor S we adopt the MATLAB\nnotation for simplicity of notation.\nfunction ConvCoD(x, D, \u03b1)\n\nSet: S = DT \u2217 D\nInitialize: z = 0; \u03b2 = DT \u2217 mask(x)\nRequire: h\u03b1 : smooth thresholding function.\nrepeat\n\n\u00afz = h\u03b1(\u03b2)\n(k, p, q) = arg maxi,m,n |zimn \u2212 \u00afzimn| (k : dictionary index, (p.q) : location index)\nbi = \u03b2kpq\n\u03b2 = \u03b2 + (zkpq \u2212 \u00afzkpq) \u00d7 align(S(:, k, :, :), (p, q))\nzkpq = \u00afzkpq, \u03b2kpq = bi\n\nuntil change in z is below a threshold\n\nend function\n\nThe second important point in training convolutional dictionaries is the computation of the S =\nDT \u2217 D operator. For most algorithms like coordinate descent [9], FISTA [11] and matching pur-\nsuit [12], it is advantageous to store the similarity matrix (S) explicitly and use a single column at\na time for updating the corresponding component of code z. For convolutional modeling, the same\napproach can be followed with some additional care. In patch based sparse coding, each element\n(i, j) of S equals the dot product of dictionary elements i and j. Since the similarity of a pair of\ndictionary elements has to be also considered in spatial dimensions, each term is expanded as \u201cfull\u201d\nconvolution of two dictionary elements (i, j), producing 2s\u22121\u00d72s\u22121 matrix. It is more convenient\nto think about the resulting matrix as a 4D tensor of size K \u00d7 K \u00d7 2s \u2212 1 \u00d7 2s \u2212 1. One should\nnote that, depending on the input image size, proper alignment of corresponding column of this\ntensor has to be applied in the z space. One can also use the steepest descent algorithm for \ufb01nding\nthe solution to convolutional sparse coding given in equation 2, however using this method would\nbe orders of magnitude slower compared to specialized algorithms like CoD [9] and the solution\nwould never contain exact zeros. In algorithm 1 we explain the extension of the coordinate descent\nalgorithm [9] for convolutional inputs. Having formulated convolutional sparse coding, the overall\nlearning procedure is simple stochastic (online) gradient descent over dictionary D:\n\n\u2200xi \u2208 X training set : z\u2217 = arg min\n\nz\n\nL(xi, z, D), D \u2190 D \u2212 \u03b7\n\n\u2202L(xi, z\u2217, D)\n\n\u2202D\n\n(4)\n\nThe columns of D are normalized after each iteration. A convolutional dictionary with 128 elements\nwhich was trained on images from Berkeley dataset [13] is shown in \ufb01gure 1.\n\n3\n\n\fFigure 2: Left: Smooth shrinkage function. Parameters \u03b2 and b control the smoothness and location\nof the kink of the function. As \u03b2 \u2192 \u221e it converges more closely to soft thresholding operator.\nCenter: Total loss as a function of number of iterations. The vertical dotted line marks the iteration\nnumber when diagonal hessian approximation was updated. It is clear that for both encoder func-\ntions, hessian update improves the convergence signi\ufb01cantly. Right: 128 convolutional \ufb01lters (W )\nlearned in the encoder using smooth shrinkage function. The decoder of this system is shown in\nimage 1.\n\n2.2 Learning an Ef\ufb01cient Encoder\nIn [4],\n[14] and [15] a feedforward regressor was trained for fast approximate inference. In this\nwork, we extend their encoder module training to convolutional domain and also propose a new\nencoder function that approximates sparse codes more closely. The encoder used in [14] is a simple\nfeedforward function which can also be seen as a small convolutional neural network: \u02dcz = gk \u00d7\n(k = 1..K). This function has been shown to produce good features for object\ntanh(x \u2217 W k)\nrecognition [14], however it does not include a shrinkage operator, thus its ability to produce sparse\nrepresentations is very limited. Therefore, we propose a different encoding function with a shrinkage\noperator. The standard soft thresholding operator has the nice property of producing exact zeros\naround the origin, however for a very wide region, the derivatives are also zero. In order to be able\nto train a \ufb01lter bank that is applied to the input before the shrinkage operator, we propose to use an\nencoder with a smooth shrinkage operator \u02dcz = sh\u03b2k,bk (x \u2217 W k) where k = 1..K and :\nsh\u03b2k,bk (s) = sign(s) \u00d7 1/\u03b2k log(exp(\u03b2k \u00d7 bk) + exp(\u03b2k \u00d7 |s|) \u2212 1) \u2212 bk\n\n(5)\nNote that each \u03b2k and bk is a singleton per each feature map k. The shape of the smooth shrinkage\noperator is given in \ufb01gure 2 for several different values of \u03b2 and b. It can be seen that \u03b2 controls the\nsmoothness of the kink of shrinkage operator and b controls the location of the kink. The function\nis guaranteed to pass through the origin and is antisymmetric. The partial derivatives \u2202sh\n\u2202\u03b2 and \u2202sh\ncan be easily written and these parameters can be learned from data.\n\n\u2202b\n\nUpdating the parameters of the encoding function is performed by minimizing equation 3. The ad-\nditional cost term penalizes the squared distance between optimal code z and prediction \u02dcz. In a\nsense, training the encoder module is similar to training a ConvNet. To aid faster convergence, we\nuse stochastic diagonal Levenberg-Marquardt method [16] to calculate a positive diagonal approx-\nimation to the hessian. We update the hessian approximation every 10000 samples and the effect\nof hessian updates on the total loss is shown in \ufb01gure 2. It can be seen that especially for the tanh\nencoder function, the effect of using second order information on the convergence is signi\ufb01cant.\n2.3 Patch Based vs Convolutional Sparse Modeling\nNatural images, sounds, and more generally, signals that display translation invariance in any di-\nmension, are better represented using convolutional dictionaries. The convolution operator enables\nthe system to model local structures that appear anywhere in the signal. For example, if k \u00d7 k image\npatches are sampled from a set of natural images, an edge at a given orientation may appear at any\nlocation, forcing local models to allocate multiple dictionary elements to represent a single underly-\ning orientation. By contrast, a convolutional model only needs to record the oriented structure once,\nsince dictionary elements can be used at all locations. Figure 1 shows atoms from patch-based and\nconvolutional dictionaries comprising the same number of elements. The convolutional dictionary\ndoes not waste resources modeling similar \ufb01lter structure at multiple locations. Instead, it mod-\nels more orientations, frequencies, and different structures including center-surround \ufb01lters, double\ncenter-surround \ufb01lters, and corner structures at various angles.\n\nIn this work, we present two encoder architectures, 1. steepest descent sparse coding with tanh\nencoding function using gk \u00d7 tanh(x \u2217 W k), 2. convolutional CoD sparse coding with shrink\n\n4\n\n\fencoding function using sh\u03b2,b(x \u2217 W k). The time required for training the \ufb01rst system is much\nhigher than for the second system due to steepest descent sparse coding. However, the performance\nof the encoding functions are almost identical.\n2.4 Multi-stage architecture\nOur convolutional encoder can be used to replace patch-based sparse coding modules used in multi-\nstage object recognition architectures such as the one proposed in our previous work [14]. Building\non our previous \ufb01ndings, for each stage, the encoder is followed by and absolute value recti\ufb01ca-\ntion, contrast normalization and average subsampling. Absolute Value Recti\ufb01cation is a simple\npointwise absolute value function applied on the output of the encoder. Contrast Normalization\nis the same operation used for pre-processing the images. This type of operation has been shown\nto reduce the dependencies between components [17, 18] (feature maps in our case). When used in\nbetween layers, the mean and standard deviation is calculated across all feature maps with a 9 \u00d7 9\nneighborhood in spatial dimensions. The last operation, average pooling is simply a spatial pooling\noperation that is applied on each feature map independently.\n\nOne or more additional stages can be stacked on top of the \ufb01rst one. Each stage then takes the\noutput of its preceding stage as input and processes it using the same series of operations with\ndifferent architectural parameters like size and connections. When the input to a stage is a series of\nfeature maps, each output feature map is formed by the summation of multiple \ufb01lters.\n\nIn the next sections, we present experiments showing that using convolutionally trained encoders in\nthis architecture lead to better object recognition performance.\n3 Experiments\nWe closely follow the architecture proposed in [14] for object recognition experiments. As stated\nabove, in our experiments, we use two different systems: 1. Steepest descent sparse coding with\ntanh encoder: SDtanh. 2. Coordinate descent sparse coding with shrink encoder: CDshrink. In\nthe following, we give details of the unsupervised training and supervised recognition experiments.\n3.1 Object Recognition using Caltech 101 Dataset\nThe Caltech-101 dataset [19] contains up to 30 training images per class and each image contains\na single object. We process the images in the dataset as follows: 1. Each image is converted to\ngray-scale and resized so that the largest edge is 151. 2. Images are contrast normalized to obtain\nlocally zero mean and unit standard deviation input using a 9 \u00d7 9 neighborhood. 3. The short side\nof each image is zero padded to 143 pixels. We report the results in Table 1 and 2. All results in\nthese tables are obtained using 30 training samples per class and 5 different choices of the training\nset. We use the background class during training and testing.\nArchitecture : We use the unsupervised trained encoders in a multi-stage system identical to the\none proposed in [14]. At \ufb01rst layer 64 features are extracted from the input image, followed by a\nsecond layers that produces 256 features. Second layer features are connected to \ufb01st layer features\nthrough a sparse connection table to break the symmetry and to decrease the number of parameters.\nUnsupervised Training : The input to unsupervised training consists of contrast normalized gray-\nscale images [20] obtained from the Berkeley segmentation dataset [13]. Contrast normalization\nconsists of processing each feature map value by removing the mean and dividing by the standard\ndeviation calculated around 9 \u00d7 9 region centered at that value over all feature maps.\nFirst Layer: We have trained both systems using 64 dictionary elements. Each dictionary item is\na 9 \u00d7 9 convolution kernel. The resulting system to be solved is a 64 times overcomplete sparse\ncoding problem. Both systems are trained for 10 different sparsity values ranging between 0.1 and\n3.0.\nSecond Layer: Using the 64 feature maps output from the \ufb01rst layer encoder on Berkeley images,\nwe train a second layer convolutional sparse coding. At the second layer, the number of feature\nmaps is 256 and each feature map is connected to 16 randomly selected input features out of 64.\nThus, we aim to learn 4096 convolutional kernels at the second layer. To the best of our knowledge,\nnone of the previous convolutional RBM [3] and sparse coding [6] methods have learned such a\nlarge number of dictionary elements. Our aim is motivated by the fact that using such large number\nof elements and using a linear classi\ufb01er [14] reports recognition results similar to [3] and [6]. In\nboth of these studies a more powerful Pyramid Match Kernel SVM classi\ufb01er [21] is used to match\nthe same level of performance. Figure 3 shows 128 \ufb01lters that connect to 8 \ufb01rst layer features. Each\n\n5\n\n\fFigure 3: Second stage \ufb01lters. Left: Encoder kernels that correspond to the dictionary elements.\nRight: 128 dictionary elements, each row shows 16 dictionary elements, connecting to a single\nsecond layer feature map. It can be seen that each group extracts similar type of features from their\ncorresponding inputs.\nrow of \ufb01lters connect a particular second layer feature map. It is seen that each row of \ufb01lters extract\nsimilar features since their output response is summed together to form one output feature map.\n\nLogistic Regression Classi\ufb01er\nSDtanh\n\nCDshrink\n\nPSD [14]\n\nU\nU+\n\n57.1 \u00b1 0.6% 57.3 \u00b1 0.5%\n57.6 \u00b1 0.4% 56.4 \u00b1 0.5%\n\n52.2%\n54.2%\n\nTable 1: Comparing SDtanh encoder to CDshrink encoder on Caltech 101 dataset using a single\nstage architecture. Each system is trained using 64 convolutional \ufb01lters. The recognition accuracy\nresults shown are very similar for both systems.\nOne Stage System: We train 64 convolutional unsupervised features using both SDtanh and\nCDshrink methods. We use the encoder function obtained from this training followed by abso-\nlute value recti\ufb01cation, contrast normalization and average pooling. The convolutional \ufb01lters used\nare 9 \u00d7 9. The average pooling is applied over a 10 \u00d7 10 area with 5 pixel stride. The output of \ufb01rst\nlayer is then 64 \u00d7 26 \u00d7 26 and fed into a logistic regression classi\ufb01er and Lazebnik\u2019s PMK-SVM\nclassi\ufb01er [21] (that is, the spatial pyramid pipeline is used, using our features to replace the SIFT\nfeatures).\n\nTwo Stage System: We train 4096 convolutional \ufb01lters with SDtanh method using 64 input feature\nmaps from \ufb01rst stage to produce 256 feature maps. The second layer features are also 9 \u00d7 9, pro-\nducing 256 \u00d7 18 \u00d7 18 features. After applying absolute value recti\ufb01cation, contrast normalization\nand average pooling (on a 6 \u00d7 6 area with stride 4), the output features are 256 \u00d7 4 \u00d7 4 (4096)\ndimensional. We only use multinomial logistic regression classi\ufb01er after the second layer feature\nextraction stage.\n\nWe denote unsupervised trained one stage systems with U, two stage unsupervised trained systems\nwith U U and \u201c+\u201d represents supervised training is performed afterwards. R stands for randomly\ninitialized systems with no unsupervised training.\n\nLogistic Regression Classi\ufb01er\n\nPSD [14] (UU)\nPSD [14] (U+U+)\nSDtanh (UU)\nSDtanh (U+U+)\n\n63.7\n65.5\n\n65.3 \u00b1 0.9%\n66.3 \u00b1 1.5%\n\nPMK-SVM [21] Classi\ufb01er:\n\nHard quantization + multiscale pooling\n\n+ intersection kernel SVM\n64.6 \u00b1 0.7%\n66.4 \u00b1 0.5%\n66.9 \u00b1 1.1%\n65.7 \u00b1 0.7%\n\nSIFT [21]\nRBM [3]\nDN [6]\nSDtanh (U)\n\nTable 2: Recognition accuracy on Caltech 101 dataset using a variety of different feature represen-\ntations using two stage systems and two different classi\ufb01ers.\nComparing our U system using both SDtanh and CDshrink (57.1% and 57.3%) with the 52.2% re-\nported in [14], we see that convolutional training results in signi\ufb01cant improvement. With two layers\nof purely unsupervised features (U U, 65.3%), we even achieve the same performance as the patch-\nbased model of Jarrett et al. [14] after supervised \ufb01ne-tuning (63.7%). Moreover, with additional\nsupervised \ufb01ne-tuning (U +U +) we match or perform very close to (66.3%) similar models [3, 6]\n\n6\n\n\f1\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.05\n\ne\nt\na\nr\n \ns\ns\nm\n\ni\n\nR+R+ (14.8%)\nU+U+ (11.5%)\n\n \n\n10\u22122\n\n10\u22121\n\n100\nfalse positives per image\n\n \n\n1\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.05\n\ne\n\nt\n\na\nr\n \ns\ns\nm\n\ni\n\nU+U+\u2212bt0 (23.6%)\nU+U+\u2212bt1 (16.5%)\nU+U+\u2212bt2 (13.8%)\nU+U+\u2212bt6 (12.4%)\nU+U+\u2212bt3 (11.9%)\nU+U+\u2212bt5 (11.7%)\nU+U+\u2212bt4 (11.5%)\n\n101\n\n \n\n10\u22122\n\n10\u22121\n\n100\n\nfalse positives per image\n\n \n\n101\n\nFigure 4: Results on the INRIA dataset with per-image metric. Left: Comparing two best systems\nwith unsupervised initialization (U U) vs random initialization (RR). Right: Effect of bootstrapping\non \ufb01nal performance for unsupervised initialized system.\n\nwith two layers of convolutional feature extraction, even though these models use the more complex\nspatial pyramid classi\ufb01er (PMK-SVM) instead of the logistic regression we have used; the spatial\npyramid framework comprises a codeword extraction step and an SVM, thus effectively adding one\nlayer to the system. We get 65.7% with a spatial pyramid on top of our single-layer U system (with\n256 codewords jointly encoding 2 \u00d7 2 neighborhoods of our features by hard quantization, then max\npooling in each cell of the pyramid, with a linear SVM, as proposed by authors in [22]).\n\nOur experiments have shown that sparse features achieve superior recognition performance com-\npared to features obtained using a dictionary trained by a patch-based procedure as shown in Ta-\nble 2. It is interesting to note that the improvement is larger when using feature extractors trained\nin a purely unsupervised way, than when unsupervised training is followed by a supervised training\nphase (57.1 to 57.6). Recalling that the supervised tuning is a convolutional procedure, this last\ntraining step might have the additional bene\ufb01t of decreasing the redundancy between patch-based\ndictionary elements. On the other hand, this contribution would be minor for dictionaries which\nhave already been trained convolutionally in the unsupervised stage.\n\n3.2 Pedestrian Detection\nWe train and evaluate our architecture on the INRIA Pedestrian dataset [23] which contains 2416\npositive examples (after mirroring) and 1218 negative full images. For training, we also augment the\npositive set with small translations and scale variations to learn invariance to small transformations,\nyielding 11370 and 1000 positive examples for training and validation respectively. The negative set\nis obtained by sampling patches from negative full images at random scales and locations. Addition-\nally, we include samples from the positive set with larger and smaller scales to avoid false positives\nfrom very different scales. With these additions, the negative set is composed of 9001 training and\n1000 validation samples.\nArchitecture and Training\nA similar architecture as in the previous section was used, with 32 \ufb01lters, each 7 \u00d7 7 for the \ufb01rst\nlayer and 64 \ufb01lters, also 7 \u00d7 7 for the second layer. We used 2 \u00d7 2 average pooling between each\nlayer. A fully connected linear layer with 2 output scores (for pedestrian and background) was used\nas the classi\ufb01er. We trained this system on 78 \u00d7 38 inputs where pedestrians are approximately\n60 pixels high. We have trained our system with and without unsupervised initialization, followed\nby \ufb01ne-tuning of the entire architecture in supervised manner. Figure 5 shows comparisons of our\nsystem with other methods as well as the effect of unsupervised initialization.\n\nAfter one pass of unsupervised and/or supervised training, several bootstrapping passes were per-\nformed to augment the negative set with the 10 most offending samples on each full negative image\nand the bigger/smaller scaled positives. We select the most offending sample that has the biggest\nopposite score. We limit the number of extracted false positives to 3000 per bootstrapping pass.\nAs [24] showed, the number of bootstrapping passes matters more than the initial training set. We\n\ufb01nd that the best results were obtained after four passes, as shown in \ufb01gure 5 improving from 23.6%\nto 11.5%.\nPer-Image Evaluation\nPerformance on the INRIA set is usually reported with the per-window methodology to avoid post-\nprocessing biases, assuming that better per-window performance yields better per-image perfor-\n\n7\n\n\fe\n\nt\n\na\nr\n \ns\ns\nm\n\ni\n\n1\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.05\n\n \n\n \n\nShapelet\u2212orig (90.5%)\nPoseInvSvm (68.6%)\n\nVJ\u2212OpenCv (53.0%)\nPoseInv (51.4%)\n\nShapelet (50.4%)\n\nVJ (47.5%)\nFtrMine (34.0%)\n\nPls (23.4%)\nHOG (23.1%)\n\nHikSvm (21.9%)\nLatSvm\u2212V1 (17.5%)\n\nMultiFtr (15.6%)\nR+R+ (14.8%)\n\nU+U+ (11.5%)\n\nMultiFtr+CSS (10.9%)\nLatSvm\u2212V2 (9.3%)\n\nFPDW (9.3%)\nChnFtrs (8.7%)\n\n10\u22122\n\n10\u22121\n\nfalse positives per image\n\n100\n\n101\n\nFigure 5: Results on the INRIA dataset with per-image metric. These curves are computed from the\nbounding boxes and con\ufb01dences made available by [25]. Comparing our two best systems labeled\n(U +U + and R+R+)with all the other methods.\n\nmance. However [25] empirically showed that the per-window methodology fails to predict the\nperformance per-image and therefore is not adequate for real applications. Thus, we evaluate the\nper-image accuracy using the source code available from [25], which matches bounding boxes with\nthe 50% PASCAL matching measure ( intersection\nIn \ufb01gure 5, we compare our best results (11.5%) to the latest state-of-the-art results (8.7%) gathered\nand published on the Caltech Pedestrians website1. The results are ordered by miss rate (the lower\nthe better) at 1 false positive per image on average (1 FPPI). The value of 1 FPPI is meaningful for\npedestrian detection because in real world applications, it is desirable to limit the number of false\nalarms.\n\nunion > 0.5).\n\nIt can be seen from \ufb01gure 4 that unsupervised initialization signi\ufb01cantly improves the performance\n(14.8%vs11.5%). The number of labeled images in INRIA dataset is relatively small, which limits\nthe capability of supervised learning algorithms. However, an unsupervised method can model large\nvariations in pedestrian pose, scale and clutter with much better success.\n\nTop performing methods [26], [27],\n[28], [24] also contain several components that our simplis-\ntic model does not contain. Probably, the most important of all is color information, whereas we\nhave trained our systems only on gray-scale images. Another important aspect is training on multi-\nresolution inputs [26], [27], [28]. Currently, we train our systems on \ufb01xed scale inputs with very\nsmall variation. Additionally, we have used much lower resolution images than top performing sys-\ntems to train our models (78 \u00d7 38 vs 128 \u00d7 64 in [24]). Finally, some models [28] use deformable\nbody parts models to improve their performance, whereas we rely on a much simpler pipeline of\nfeature extraction and linear classi\ufb01cation.\n\nOur aim in this work was to show that an adaptable feature extraction system that learns its pa-\nrameters from available data can perform comparably to best systems for pedestrian detection. We\nbelieve by including color features and using multi-resolution input our system\u2019s performance would\nincrease.\n\n4 Summary and Future Work\nIn this work we have presented a method for learning hierarchical feature extractors. Two different\nmethods were presented for convolutional sparse coding, it was shown that convolutional training of\nfeature extractors reduces the redundancy among \ufb01lters compared with those obtained from patch\nbased models. Additionally, we have introduced two different convolutional encoder functions for\nperforming ef\ufb01cient feature extraction which is crucial for using sparse coding in real world ap-\nplications. We have applied the proposed sparse modeling systems using a successful multi-stage\narchitecture on object recognition and pedestrian detection problems and performed comparably to\nsimilar systems.\n\nIn the pedestrian detection task, we have presented the advantage of using unsupervised learning for\nfeature extraction. We believe unsupervised learning signi\ufb01cantly helps to properly model extensive\nvariations in the dataset where a pure supervised learning algorithm fails. We aim to further improve\nour system by better modeling the input by including color and multi-resolution information.\n\n1http://www.vision.caltech.edu/Image Datasets/CaltechPedestrians/\ufb01les/data-INRIA\n\n8\n\n\fReferences\n[1] LeCun, Y, Bottou, L, Bengio, Y, and Haffner, P. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, November 1998.\n\n[2] Serre, T, Wolf, L, and Poggio, T. Object recognition with features inspired by visual cortex. In CVPR\u201905\n\n- Volume 2, pages 994\u20131000, Washington, DC, USA, 2005. IEEE Computer Society.\n\n[3] Lee, H, Grosse, R, Ranganath, R, and Ng, A. Convolutional deep belief networks for scalable unsuper-\n\nvised learning of hierarchical representations. In ICML\u201909, pages 609\u2013616. ACM, 2009.\n\n[4] Ranzato, M, Poultney, C, Chopra, S, and LeCun, Y. Ef\ufb01cient learning of sparse representations with an\n\nenergy-based model. In NIPS\u201907. MIT Press, 2007.\n\n[5] Kavukcuoglu, K, Ranzato, M, Fergus, R, and LeCun, Y. Learning invariant features through topographic\n\n\ufb01lter maps. In CVPR\u201909. IEEE, 2009.\n\n[6] Zeiler, M, Krishnan, D, Taylor, G, and Fergus, R. Deconvolutional Networks. In CVPR\u201910. IEEE, 2010.\n[7] Aharon, M, Elad, M, and Bruckstein, A. M. K-SVD and its non-negative variant for dictionary design. In\nPapadakis, M, Laine, A. F, and Unser, M. A, editors, Society of Photo-Optical Instrumentation Engineers\n(SPIE) Conference Series, volume 5914, pages 327\u2013339, August 2005.\n\n[8] Mairal, J, Bach, F, Ponce, J, and Sapiro, G. Online dictionary learning for sparse coding. In ICML\u201909,\n\npages 689\u2013696. ACM, 2009.\n\n[9] Li, Y and Osher, S. Coordinate Descent Optimization for l1 Minimization with Application to Com-\n\npressed Sensing; a Greedy Algorithm. CAM Report, pages 09\u201317.\n\n[10] Olshausen, B. A and Field, D. J. Sparse coding with an overcomplete basis set: a strategy employed by\n\nv1? Vision Research, 37(23):3311\u20133325, 1997.\n\n[11] Beck, A and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM J. Img. Sci., 2(1):183\u2013202, 2009.\n\n[12] Mallat, S and Zhang, Z. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal\n\nProcessing, 41(12):3397:3415, 1993.\n\n[13] Martin, D, Fowlkes, C, Tal, D, and Malik, J. A database of human segmented natural images and its appli-\ncation to evaluating segmentation algorithms and measuring ecological statistics. In ICCV\u201901, volume 2,\npages 416\u2013423, July 2001.\n\n[14] Jarrett, K, Kavukcuoglu, K, Ranzato, M, and LeCun, Y. What is the best multi-stage architecture for\n\nobject recognition? In ICCV\u201909. IEEE, 2009.\n\n[15] Gregor, K and LeCun, Y. Learning fast approximations of sparse coding. In Proc. International Confer-\n\nence on Machine learning (ICML\u201910), 2010.\n\n[16] LeCun, Y, Bottou, L, Orr, G, and Muller, K. Ef\ufb01cient backprop. In Orr, G and K., M, editors, Neural\n\nNetworks: Tricks of the trade. Springer, 1998.\n\n[17] Schwartz, O and Simoncelli, E. P. Natural signal statistics and sensory gain control. Nature Neuroscience,\n\n4(8):819\u2013825, August 2001.\n\n[18] Lyu, S and Simoncelli, E. P. Nonlinear image representation using divisive normalization. In CVPR\u201908.\n\nIEEE Computer Society, Jun 23-28 2008.\n\n[19] Fei-Fei, L, Fergus, R, and Perona, P. Learning generative visual models from few training examples: an\nincremental Bayesian approach tested on 101 object categories. In Workshop on Generative-Model Based\nVision, 2004.\n\n[20] Pinto, N, Cox, D. D, and DiCarlo, J. J. Why is real-world visual object recognition hard? PLoS Comput\n\nBiol, 4(1):e27, 01 2008.\n\n[21] Lazebnik, S, Schmid, C, and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing\n\nnatural scene categories. CVPR\u201906, 2:2169\u20132178, 2006.\n\n[22] Boureau, Y, Bach, F, LeCun, Y, and Ponce, J. Learning mid-level features for recognition. In CVPR\u201910.\n\nIEEE, 2010.\n\n[23] Dalal, N and Triggs, B. Histograms of oriented gradients for human detection. In Schmid, C, Soatto, S,\n\nand Tomasi, C, editors, CVPR\u201905, volume 2, pages 886\u2013893, June 2005.\n\n[24] Walk, S, Majer, N, Schindler, K, and Schiele, B. New features and insights for pedestrian detection. In\n\nCVPR 2010, San Francisco, California.\n\n[25] Doll\u00b4ar, P, Wojek, C, Schiele, B, and Perona, P. Pedestrian detection: A benchmark. In CVPR\u201909. IEEE,\n\nJune 2009.\n\n[26] Doll\u00b4ar, P, Tu, Z, Perona, P, and Belongie, S. Integral channel features. In BMVC 2009, London, England.\n[27] Doll\u00b4ar, P, Belongie, S, and Perona, P. The fastest pedestrian detector in the west.\nIn BMVC 2010,\n\nAberystwyth, UK.\n\n[28] Felzenszwalb, P, Girshick, R, McAllester, D, and Ramanan, D. Object detection with discriminatively\n\ntrained part based models. In PAMI 2010.\n\n9\n\n\f", "award": [], "sourceid": 1212, "authors": [{"given_name": "Koray", "family_name": "Kavukcuoglu", "institution": null}, {"given_name": "Pierre", "family_name": "Sermanet", "institution": null}, {"given_name": "Y-lan", "family_name": "Boureau", "institution": null}, {"given_name": "Karol", "family_name": "Gregor", "institution": null}, {"given_name": "Michael", "family_name": "Mathieu", "institution": null}, {"given_name": "Yann", "family_name": "Cun", "institution": null}]}