{"title": "Discriminative Unsupervised Feature Learning with Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 766, "page_last": 774, "abstract": "Current methods for training convolutional neural networks depend on large amounts of labeled samples for supervised training. In this paper we present an approach for training a convolutional neural network using only unlabeled data. We train the network to discriminate between a set of surrogate classes. Each surrogate class is formed by applying a variety of transformations to a randomly sampled 'seed' image patch. We find that this simple feature learning algorithm is surprisingly successful when applied to visual object recognition. The feature representation learned by our algorithm achieves classification results matching or outperforming the current state-of-the-art for unsupervised learning on several popular datasets (STL-10, CIFAR-10, Caltech-101).", "full_text": "Discriminative Unsupervised Feature Learning with\n\nConvolutional Neural Networks\n\nAlexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller and Thomas Brox\n\n{dosovits,springj,riedmiller,brox}@cs.uni-freiburg.de\n\n79110, Freiburg im Breisgau, Germany\n\nDepartment of Computer Science\n\nUniversity of Freiburg\n\nAbstract\n\nCurrent methods for training convolutional neural networks depend on large\namounts of labeled samples for supervised training. In this paper we present an\napproach for training a convolutional neural network using only unlabeled data.\nWe train the network to discriminate between a set of surrogate classes. Each\nsurrogate class is formed by applying a variety of transformations to a randomly\nsampled \u2019seed\u2019 image patch. We \ufb01nd that this simple feature learning algorithm\nis surprisingly successful when applied to visual object recognition. The feature\nrepresentation learned by our algorithm achieves classi\ufb01cation results matching\nor outperforming the current state-of-the-art for unsupervised learning on several\npopular datasets (STL-10, CIFAR-10, Caltech-101).\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNNs) trained via backpropagation were recently shown to perform\nwell on image classi\ufb01cation tasks with millions of training images and thousands of categories [1,\n2]. The feature representation learned by these networks achieves state-of-the-art performance not\nonly on the classi\ufb01cation task for which the network was trained, but also on various other visual\nrecognition tasks, for example: classi\ufb01cation on Caltech-101 [2, 3], Caltech-256 [2] and the Caltech-\nUCSD birds dataset [3]; scene recognition on the SUN-397 database [3]; detection on the PASCAL\nVOC dataset [4]. This capability to generalize to new datasets makes supervised CNN training an\nattractive approach for generic visual feature learning.\nThe downside of supervised training is the need for expensive labeling, as the amount of required\nlabeled samples grows quickly the larger the model gets. The large performance increase achieved\nby methods based on the work of Krizhevsky et al. [1] was, for example, only possible due to\nmassive efforts on manually annotating millions of images. For this reason, unsupervised learning\n\u2013 although currently underperforming \u2013 remains an appealing paradigm, since it can make use of\nraw unlabeled images and videos. Furthermore, on vision tasks outside classi\ufb01cation it is not even\ncertain whether training based on object class labels is advantageous. For example, unsupervised\nfeature learning is known to be bene\ufb01cial for image restoration [5] and recent results show that it\noutperforms supervised feature learning also on descriptor matching [6].\nIn this work we combine the power of a discriminative objective with the major advantage of un-\nsupervised feature learning: cheap data acquisition. We introduce a novel training procedure for\nconvolutional neural networks that does not require any labeled data. It rather relies on an auto-\nmatically generated surrogate task. The task is created by taking the idea of data augmentation \u2013\nwhich is commonly used in supervised learning \u2013 to the extreme. Starting with trivial surrogate\nclasses consisting of one random image patch each, we augment the data by applying a random set\nof transformations to each patch. Then we train a CNN to classify these surrogate classes. We refer\nto this method as exemplar training of convolutional neural networks (Exemplar-CNN).\n\n1\n\n\fThe feature representation learned by Exemplar-CNN is, by construction, discriminative and in-\nvariant to typical transformations. We con\ufb01rm this both theoretically and empirically, showing that\nthis approach matches or outperforms all previous unsupervised feature learning methods on the\nstandard image classi\ufb01cation benchmarks STL-10, CIFAR-10, and Caltech-101.\n\n1.1 Related Work\n\nOur approach is related to a large body of work on unsupervised learning of invariant features and\ntraining of convolutional neural networks.\nConvolutional training is commonly used in both supervised and unsupervised methods to utilize\nthe invariance of image statistics to translations (e.g. LeCun et al. [7], Kavukcuoglu et al. [8],\nKrizhevsky et al. [1]). Similar to our approach the current surge of successful methods employing\nconvolutional neural networks for object recognition often rely on data augmentation to generate\nadditional training samples for their classi\ufb01cation objective (e.g. Krizhevsky et al. [1], Zeiler and\nFergus [2]). While we share the architecture (a convolutional neural network) with these approaches,\nour method does not rely on any labeled training data.\nIn unsupervised learning, several studies on learning invariant representations exist. Denoising au-\ntoencoders [9], for example, learn features that are robust to noise by trying to reconstruct data from\nrandomly perturbed input samples. Zou et al. [10] learn invariant features from video by enforcing\na temporal slowness constraint on the feature representation learned by a linear autoencoder. Sohn\nand Lee [11] and Hui [12] learn features invariant to local image transformations. In contrast to our\ndiscriminative approach, all these methods rely on directly modeling the input distribution and are\ntypically hard to use for jointly training multiple layers of a CNN.\nThe idea of learning features that are invariant to transformations has also been explored for super-\nvised training of neural networks. The research most similar to ours is early work on tangent prop-\nagation [13] (and the related double backpropagation [14]) which aims to learn invariance to small\nprede\ufb01ned transformations in a neural network by directly penalizing the derivative of the output\nwith respect to the magnitude of the transformations. In contrast, our algorithm does not regularize\nthe derivative explicitly. Thus it is less sensitive to the magnitude of the applied transformation.\nThis work is also loosely related to the use of unlabeled data for regularizing supervised algorithms,\nfor example self-training [15] or entropy regularization [16]. In contrast to these semi-supervised\nmethods, Exemplar-CNN training does not require any labeled data.\nFinally, the idea of creating an auxiliary task in order to learn a good data representation was used\nby Ahmed et al. [17], Collobert et al. [18].\n\n2 Creating Surrogate Training Data\n\nThe input to the training procedure is a set of unlabeled images, which come from roughly the same\ndistribution as the images to which we later aim to apply the learned features. We randomly sample\nN \u2208 [50, 32000] patches of size 32\u00d732 pixels from different images at varying positions and scales\nforming the initial training set X = {x1, . . . xN}. We are interested in patches containing objects\nor parts of objects, hence we sample only from regions containing considerable gradients.\nWe de\ufb01ne a family of transformations {T\u03b1| \u03b1 \u2208 A} parameterized by vectors \u03b1 \u2208 A, where A is\nthe set of all possible parameter vectors. Each transformation T\u03b1 is a composition of elementary\ntransformations from the following list:\n\n\u2022 translation: vertical or horizontal translation by a distance within 0.2 of the patch size;\n\u2022 scaling: multiplication of the patch scale by a factor between 0.7 and 1.4;\n\u2022 rotation: rotation of the image by an angle up to 20 degrees;\n\u2022 contrast 1: multiply the projection of each patch pixel onto the principal components of the\nset of all pixels by a factor between 0.5 and 2 (factors are independent for each principal\ncomponent and the same for all pixels within a patch);\n\u2022 contrast 2: raise saturation and value (S and V components of the HSV color representation)\nof all pixels to a power between 0.25 and 4 (same for all pixels within a patch), multiply\nthese values by a factor between 0.7 and 1.4, add to them a value between \u22120.1 and 0.1;\n\n2\n\n\fFigure 1: Exemplary patches sampled from\nthe STL unlabeled dataset which are later\naugmented by various transformations to ob-\ntain surrogate data for the CNN training.\n\nFigure 2: Several random transformations\napplied to one of the patches extracted from\nthe STL unlabeled dataset. The original\n(\u2019seed\u2019) patch is in the top left corner.\n\n\u2022 color: add a value between \u22120.1 and 0.1 to the hue (H component of the HSV color repre-\n\nsentation) of all pixels in the patch (the same value is used for all pixels within a patch).\n\ni , . . . , \u03b1K\n\ni } and apply the corresponding transformations Ti = {T\u03b11\n\nAll numerical parameters of elementary transformations, when concatenated together, form a single\nparameter vector \u03b1. For each initial patch xi \u2208 X we sample K \u2208 [1, 300] random parameter\nvectors {\u03b11\n} to the\npatch xi. This yields the set of its transformed versions Sxi = Tixi = {T xi| T \u2208 Ti}. Afterwards\nwe subtract the mean of each pixel over the whole resulting dataset. We do not apply any other\npreprocessing. Exemplary patches sampled from the STL-10 unlabeled dataset are shown in Fig. 1.\nExamples of transformed versions of one patch are shown in Fig. 2 .\n\ni\n\n, . . . , T\u03b1K\n\ni\n\n3 Learning Algorithm\n\nGiven the sets of transformed image patches, we declare each of these sets to be a class by assigning\nlabel i to the class Sxi. We next train a CNN to discriminate between these surrogate classes.\nFormally, we minimize the following loss function:\n\nL(X) =\n\nl(i, T xi),\n\n(1)\n\n(cid:88)\n\n(cid:88)\n\nxi\u2208X\n\nT\u2208Ti\n\nwhere l(i, T xi) is the loss on the transformed sample T xi with (surrogate) true label i. We use\na CNN with a softmax output layer and optimize the multinomial negative log likelihood of the\nnetwork output, hence in our case\n\nwhere f (\u00b7) denotes the function computing the values of the output layer of the CNN given the\ninput data, and ei is the ith standard basis vector. We note that in the limit of an in\ufb01nite number of\ntransformations per surrogate class, the objective function (1) takes the form\n\nE\u03b1[l(i, T\u03b1xi)],\n\n(3)\n\n(cid:98)L(X) =\n\n(cid:88)\n\nxi\u2208X\n\nwhich we shall analyze in the next section.\nIntuitively, the classi\ufb01cation problem described above serves to ensure that different input samples\ncan be distinguished. At the same time, it enforces invariance to the speci\ufb01ed transformations. In the\nfollowing sections we provide a foundation for this intuition. We \ufb01rst present a formal analysis of\nthe objective, separating it into a well de\ufb01ned classi\ufb01cation problem and a regularizer that enforces\ninvariance (resembling the analysis in Wager et al. [19]). We then discuss the derived properties of\nthis classi\ufb01cation problem and compare it to common practices for unsupervised feature learning.\n\n3.1 Formal Analysis\nWe denote by \u03b1 \u2208 A the random vector of transformation parameters, by g(x) the vector of activa-\ntions of the second-to-last layer of the network when presented the input patch x, by W the matrix\n\n3\n\nl(i, T xi) = M (ei, f (T xi)),\n\nM (y, f ) = \u2212(cid:104)y, log f(cid:105) = \u2212(cid:88)\n\nk\n\nyk log fk,\n\n(2)\n\n\fWith (cid:98)gi = E\u03b1 [g(T\u03b1xi)] being the average feature representation of transformed versions of the\n\nxi\u2208X\n\nimage patch xi we can rewrite Eq. (5) as\n\nE\u03b1\n\nsoftmax (z) = exp(z)/(cid:107) exp(z)(cid:107)1\n\n(cid:3).\n\n(cid:88)\n(cid:2)\u2212(cid:104)ei, h(T\u03b1xi)(cid:105) + log (cid:107) exp(h(T\u03b1xi))(cid:107)1\n(cid:2)\u2212(cid:104)ei, W(cid:98)gi(cid:105) + log (cid:107) exp(W(cid:98)gi)(cid:107)1\n(cid:2)E\u03b1 [log (cid:107) exp(h(T\u03b1xi))(cid:107)1] \u2212 log (cid:107) exp(W(cid:98)gi)(cid:107)1\n\n(cid:88)\n(cid:88)\n\n(cid:3)\n\nxi\u2208X\n\nxi\u2208X\n\n+\n\n(4)\n\n(5)\n\n(cid:3).\n\n(6)\n\nof the weights of the last network layer, by h(x) = Wg(x) the last layer activations before applying\nthe softmax, and by f (x) = softmax (h(x)) the output of the network. By plugging in the de\ufb01nition\nof the softmax activation function\n\nthe objective function (3) with loss (2) takes the form\n\npairs ((cid:98)gi, ei). This objective falls back to the transformation-free instance classi\ufb01cation problem\nL(X) =(cid:80)\n\nThe \ufb01rst sum is the objective function of a multinomial logistic regression problem with input-target\nxi\u2208X l(i, xi) if g(xi) = E\u03b1[g(T\u03b1x)]. In general, this equality does not hold and thus\nthe \ufb01rst sum enforces correct classi\ufb01cation of the average representation E\u03b1[g(T\u03b1xi)] for a given\ninput sample. For a truly invariant representation, however, the equality is achieved. Similarly, if we\nsuppose that T\u03b1x = x for \u03b1 = 0, that for small values of \u03b1 the feature representation g(T\u03b1xi) is\napproximately linear with respect to \u03b1 and that the random variable \u03b1 is centered, i.e. E\u03b1 [\u03b1] = 0,\n\nthen (cid:98)gi = E\u03b1 [g(T\u03b1xi)] \u2248 E\u03b1 [g(xi) + \u2207\u03b1(g(T\u03b1xi))|\u03b1=0 \u03b1] = g(xi).\n\nThe second sum in Eq. (6) can be seen as a regularizer enforcing all h(T\u03b1xi) to be close to their\naverage value, i.e., the feature representation is sought to be approximately invariant to the transfor-\nmations T\u03b1. To show this we use the convexity of the function log (cid:107) exp(\u00b7)(cid:107)1 and Jensen\u2019s inequality,\nwhich yields (proof in supplementary material)\n\nE\u03b1 [log (cid:107) exp(h(T\u03b1xi))(cid:107)1] \u2212 log (cid:107) exp(W(cid:98)gi)(cid:107)1 \u2265 0.\n\nIf the feature representation is perfectly invariant, then h(T\u03b1xi) = W(cid:98)gi and inequality (7) turns to\n\nequality, meaning that the regularizer reaches its global minimum.\n\n(7)\n\n3.2 Conceptual Comparison to Previous Unsupervised Learning Methods\n\nSuppose we want to unsupervisedly learn a feature representation useful for a recognition task, for\nexample classi\ufb01cation. The mapping from input images x to a feature representation g(x) should\nthen satisfy two requirements: (1) there must be at least one feature that is similar for images of the\nsame category y (invariance); (2) there must be at least one feature that is suf\ufb01ciently different for\nimages of different categories (ability to discriminate).\nMost unsupervised feature learning methods aim to learn such a representation by modeling the\ninput distribution p(x). This is based on the assumption that a good model of p(x) contains infor-\nmation about the category distribution p(y|x). That is, if a representation is learned, from which\na given sample can be reconstructed perfectly, then the representation is expected to also encode\ninformation about the category of the sample (ability to discriminate). Additionally, the learned\nrepresentation should be invariant to variations in the samples that are irrelevant for the classi\ufb01ca-\ntion task, i.e., it should adhere to the manifold hypothesis (see e.g. Rifai et al. [20] for a recent\ndiscussion). Invariance is classically achieved by regularization of the latent representation, e.g., by\nenforcing sparsity [8] or robustness to noise [9].\nIn contrast, the discriminative objective in Eq. (1) does not directly model the input distribution\np(x) but learns a representation that discriminates between input samples. The representation is not\nrequired to reconstruct the input, which is unnecessary in a recognition or matching task. This leaves\nmore degrees of freedom to model the desired variability of a sample. As shown in our analysis (see\nEq. (7)), we achieve partial invariance to transformations applied during surrogate data creation by\nforcing the representation g(T\u03b1xi) of the transformed image patch to be predictive of the surrogate\nlabel assigned to the original image patch xi.\n\n4\n\n\fIt should be noted that this approach assumes that the transformations T\u03b1 do not change the identity\nof the image content. If we, for example, use a color transformation we will force the network to be\ninvariant to this change and cannot expect the extracted features to perform well in a task relying on\ncolor information (such as differentiating black panthers from pumas)1.\n\n4 Experiments\n\nTo compare our discriminative approach to previous unsupervised feature learning methods, we re-\nport classi\ufb01cation results on the STL-10 [21], CIFAR-10 [22] and Caltech-101 [23] datasets. More-\nover, we assess the in\ufb02uence of the augmentation parameters on the classi\ufb01cation performance and\nstudy the invariance properties of the network.\n\n4.1 Experimental Setup\n\nThe datasets we test on differ in the number of classes (10 for CIFAR and STL, 101 for Caltech)\nand the number of samples per class. STL is especially well suited for unsupervised learning as it\ncontains a large set of 100,000 unlabeled samples. In all experiments (except for the dataset transfer\nexperiment in the supplementary material) we extracted surrogate training data from the unlabeled\nsubset of STL-10. When testing on CIFAR-10, we resized the images from 32\u00d7 32 pixels to 64\u00d7 64\npixels so that the scale of depicted objects roughly matches the two other datasets.\nWe worked with two network architectures. A \u201csmall\u201d network was used to evaluate the in\ufb02uence\nof different components of the augmentation procedure on classi\ufb01cation performance. It consists of\ntwo convolutional layers with 64 \ufb01lters each followed by a fully connected layer with 128 neurons.\nThis last layer is succeeded by a softmax layer, which serves as the network output. A \u201clarge\u201d\nnetwork, consisting of three convolutional layers with 64, 128 and 256 \ufb01lters respectively followed\nby a fully connected layer with 512 neurons, was trained to compare our method to the state-of-the-\nart. In both models all convolutional \ufb01lters are connected to a 5\u00d7 5 region of their input. 2\u00d7 2 max-\npooling was performed after the \ufb01rst and second convolutional layers. Dropout [24] was applied to\nthe fully connected layers. We trained the networks using an implementation based on Caffe [25].\nDetails on the training, the hyperparameter settings, and an analysis of the performance depending\non the network architecture is provided in the supplementary material. Our code and training data\nare available at http://lmb.informatik.uni-freiburg.de/resources .\nWe applied the feature representation to images of arbitrary size by convolutionally computing the\nresponses of all the network layers except the top softmax. To each feature map, we applied the pool-\ning method that is commonly used for the respective dataset: 1) 4-quadrant max-pooling, resulting in\n4 values per feature map, which is the standard procedure for STL-10 and CIFAR-10 [26, 10, 27, 12];\n2) 3-layer spatial pyramid, i.e. max-pooling over the whole image as well as within 4 quadrants and\nwithin the cells of a 4 \u00d7 4 grid, resulting in 1 + 4 + 16 = 21 values per feature map, which is the\nstandard for Caltech-101 [28, 10, 29]. Finally, we trained a linear support vector machine (SVM) on\nthe pooled features.\nOn all datasets we used the standard training and test protocols. On STL-10 the SVM was trained on\n10 pre-de\ufb01ned folds of the training data. We report the mean and standard deviation achieved on the\n\ufb01xed test set. For CIFAR-10 we report two results: (1) training the SVM on the whole CIFAR-10\ntraining set (\u2019CIFAR-10\u2019); (2) the average over 10 random selections of 400 training samples per\nclass (\u2019CIFAR-10(400)\u2019). For Caltech-101 we followed the usual protocol of selecting 30 random\nsamples per class for training and not more than 50 samples per class for testing. This was repeated\n10 times.\n\n4.2 Classi\ufb01cation Results\n\nIn Table 1 we compare Exemplar-CNN to several unsupervised feature learning methods, including\nthe current state-of-the-art on each dataset. We also list the state-of-the-art for supervised learning\n(which is not directly comparable). Additionally we show the dimensionality of the feature vectors\n\n1Such cases could be covered either by careful selection of applied transformations or by combining features\nfrom multiple networks trained with different sets of transformations and letting the \ufb01nal classi\ufb01er choose which\nfeatures to use.\n\n5\n\n\f\u2020 Average per-class accu-\n\nSTL-10 CIFAR-10(400) CIFAR-10 Caltech-101 #features\n60.1 \u00b1 1\n77.3 \u00b1 0.6 1024 \u00d7 64\n\n70.7 \u00b1 0.7\n\n8000\n\n\u2014\n\nTable 1: Classi\ufb01cation accuracies on several datasets (in percent).\nracy2 78.0% \u00b1 0.4%. \u2021 Average per-class accuracy 84.4% \u00b1 0.6%.\nAlgorithm\nConvolutional K-means Network [26]\nMulti-way local pooling [28]\nSlowness on videos [10]\nHierarchical Matching Pursuit (HMP) [27]\nMultipath HMP [29]\nView-Invariant K-means [12]\n67.1 \u00b1 0.3\nExemplar-CNN (64c5-64c5-128f)\nExemplar-CNN (64c5-128c5-256c5-512f) 72.8 \u00b1 0.4\n70.1[30]\nSupervised state of the art\n\n72.6 \u00b1 0.7\n69.7 \u00b1 0.3\n75.3 \u00b1 0.2\n\n64.5 \u00b1 1\n\n\u2014\n\u2014\n\u2014\n\u2014\n\n\u2014\n61.0\n\n\u2014\n63.7\n\n82.0\n\u2014\n\u2014\n\u2014\n\u2014\n81.9\n75.7\n82.0\n\n\u2014\n\n91.2 [31]\n\n74.6\n\u2014\n\n\u2014\n\n82.5 \u00b1 0.5\n79.8 \u00b1 0.5\u2020\n85.5 \u00b1 0.4\u2021\n91.44 [32]\n\n556\n1000\n5000\n6400\n256\n960\n\u2014\n\nproduced by each method before \ufb01nal pooling. The small network was trained on 8000 surrogate\nclasses containing 150 samples each and the large one on 16000 classes with 100 samples each.\nThe features extracted from the larger network match or outperform the best prior result on all\ndatasets. This is despite the fact that the dimensionality of the feature vector is smaller than that of\nmost other approaches and that the networks are trained on the STL-10 unlabeled dataset (i.e. they\nare used in a transfer learning manner when applied to CIFAR-10 and Caltech 101). The increase in\nperformance is especially pronounced when only few labeled samples are available for training the\nSVM (as is the case for all the datasets except full CIFAR-10). This is in agreement with previous\nevidence that with increasing feature vector dimensionality and number of labeled samples, training\nan SVM becomes less dependent on the quality of the features [26, 12]. Remarkably, on STL-10 we\nachieve an accuracy of 72.8%, which is a large improvement over all previously reported results.\n\n4.3 Detailed Analysis\n\nWe performed additional experiments (using the \u201csmall\u201d network) to study the effect of three design\nchoices in Exemplar-CNN training and validate the invariance properties of the learned features.\nExperiments on sampling \u2019seed\u2019 patches from different datasets can be found in the supplementary.\n\n4.3.1 Number of Surrogate Classes\n\nWe varied the number N of surrogate classes between 50 and 32000. As a sanity check, we also\ntried classi\ufb01cation with random \ufb01lters. The results are shown in Fig. 3.\nClearly, the classi\ufb01cation accuracy increases with the number of surrogate classes until it reaches\nan optimum at about 8000 surrogate classes after which it did not change or even decreased. This\nis to be expected: the larger the number of surrogate classes, the more likely it is to draw very\nsimilar or even identical samples, which are hard or impossible to discriminate. Few such cases are\nnot detrimental to the classi\ufb01cation performance, but as soon as such collisions dominate the set\nof surrogate labels, the discriminative loss is no longer reasonable and training the network to the\nsurrogate task no longer succeeds. To check the validity of this explanation we also plot in Fig. 3 the\nclassi\ufb01cation error on the validation set (taken from the surrogate data) computed after training the\nnetwork. It rapidly grows as the number of surrogate classes increases. We also observed that the\noptimal number of surrogate classes increases with the size of the network (not shown in the \ufb01gure),\nbut eventually saturates. This demonstrates the main limitation of our approach to randomly sample\n\u2019seed\u2019 patches: it does not scale to arbitrarily large amounts of unlabeled data. However, we do not\nsee this as a fundamental restriction and discuss possible solutions in Section 5 .\n\n4.3.2 Number of Samples per Surrogate Class\n\nFig. 4 shows the classi\ufb01cation accuracy when the number K of training samples per surrogate class\nvaries between 1 and 300. The performance improves with more samples per surrogate class and\n\n2 On Caltech-101 one can either measure average accuracy over all samples (average overall accuracy) or\ncalculate the accuracy for each class and then average these values (average per-class accuracy). These differ,\nas some classes contain fewer than 50 test samples. Most researchers in ML use average overall accuracy.\n\n6\n\n\fFigure 3: In\ufb02uence of the number of surro-\ngate training classes. The validation error on\nthe surrogate data is shown in red. Note the\ndifferent y-axes for the two curves.\n\nFigure 4: Classi\ufb01cation performance on STL\nfor different numbers of samples per class.\nRandom \ufb01lters can be seen as \u20190 samples per\nclass\u2019.\n\nsaturates at around 100 samples. This indicates that this amount is suf\ufb01cient to approximate the\nformal objective from Eq. (3), hence further increasing the number of samples does not signi\ufb01cantly\nchange the optimization problem. On the other hand, if the number of samples is too small, there is\ninsuf\ufb01cient data to learn the desired invariance properties.\n\n4.3.3 Types of Transformations\n\nWe varied the transformations used for creating\nthe surrogate data to analyze their in\ufb02uence on\nthe \ufb01nal classi\ufb01cation performance. The set of\n\u2019seed\u2019 patches was \ufb01xed. The result is shown\nin Fig. 5. The value \u20190\u2019 corresponds to ap-\nplying random compositions of all elementary\ntransformations: scaling, rotation, translation,\ncolor variation, and contrast variation. Differ-\nent columns of the plot show the difference in\nclassi\ufb01cation accuracy as we discarded some\ntypes of elementary transformations.\nSeveral tendencies can be observed. First, ro-\ntation and scaling have only a minor impact on\nthe performance, while translations, color vari-\nations and contrast variations are signi\ufb01cantly\nmore important. Secondly, the results on STL-\n10 and CIFAR-10 consistently show that spatial invariance and color-contrast invariance are ap-\nproximately of equal importance for the classi\ufb01cation performance. This indicates that variations\nin color and contrast, though often neglected, may also improve performance in a supervised learn-\ning scenario. Thirdly, on Caltech-101 color and contrast transformations are much more important\ncompared to spatial transformations than on the two other datasets. This is not surprising, since\nCaltech-101 images are often well aligned, and this dataset bias makes spatial invariance less useful.\n\nFigure 5: In\ufb02uence of removing groups of trans-\nformations during generation of the surrogate\ntraining data. Baseline (\u20190\u2019 value) is applying all\ntransformations. Each group of three bars corre-\nsponds to removing some of the transformations.\n\n4.3.4 Invariance Properties of the Learned Representation\n\nIn a \ufb01nal experiment, we analyzed to which extent the representation learned by the network is\ninvariant to the transformations applied during training. We randomly sampled 500 images from the\nSTL-10 test set and applied a range of transformations (translation, rotation, contrast, color) to each\nimage. To avoid empty regions beyond the image boundaries when applying spatial transformations,\nwe cropped the central 64\u00d7 64 pixel sub-patch from each 96\u00d7 96 pixel image. We then applied two\nmeasures of invariance to these patches.\nFirst, as an explicit measure of invariance, we calculated the normalized Euclidean distance be-\ntween normalized feature vectors of the original image patch and the transformed one [10] (see the\nsupplementary material for details). The downside of this approach is that the distance between\nextracted features does not take into account how informative and discriminative they are. We there-\n\n7\n\n50100250500100020004000800016000320005456586062646668Number of classes (log scale)Classification accuracy on STL\u221210 Classificationon STL (\u00b1 \u03c3)Validation error onsurrogate data020406080100Error on validation data1248163264100150300455055606570Number of samples per class (log scale)Classification accuracy on STL\u221210 1000 classes2000 classes4000 classesrandom filters\u221220\u221215\u221210\u221250Removed transformations rotationscaling translation colorcontrast rot+sc+tr col+conall\u221220\u221215\u221210\u22125 0 Difference in classification accuracySTL\u221210CIFAR\u221210Caltech\u2212101\fFigure 6: Invariance properties of the feature representation learned by Exemplar-CNN. (a): Nor-\nmalized Euclidean distance between feature vectors of the original and the translated image patches\nvs.\nthe magnitude of the translation, (b)-(d): classi\ufb01cation performance on transformed image\npatches vs. the magnitude of the transformation for various magnitudes of transformations applied\nfor creating surrogate data. (b): rotation, (c): additive color change, (d): multiplicative contrast\nchange.\n\nfore evaluated a second measure \u2013 classi\ufb01cation performance depending on the magnitude of the\ntransformation applied to the classi\ufb01ed patches \u2013 which does not come with this problem. To com-\npute the classi\ufb01cation accuracy, we trained an SVM on the central 64 \u00d7 64 pixel patches from one\nfold of the STL-10 training set and measured classi\ufb01cation performance on all transformed versions\nof 500 samples from the test set.\nThe results of both experiments are shown in Fig. 6 . Due to space restrictions we show only few\nrepresentative plots. Overall the experiment empirically con\ufb01rms that the Exemplar-CNN objec-\ntive leads to learning invariant features. Features in the third layer and the \ufb01nal pooled feature\nrepresentation compare favorably to a HOG baseline (Fig. 6 (a)). Furthermore, adding stronger\ntransformations in the surrogate training data leads to more invariant classi\ufb01cation with respect to\nthese transformations (Fig. 6 (b)-(d)). However, adding too much contrast variation may deteriorate\nclassi\ufb01cation performance (Fig. 6 (d)). One possible reason is that level of contrast can be a useful\nfeature: for example, strong edges in an image are usually more important than weak ones.\n\n5 Discussion\n\nWe have proposed a discriminative objective for unsupervised feature learning by training a CNN\nwithout class labels. The core idea is to generate a set of surrogate labels via data augmentation.\nThe features learned by the network yield a large improvement in classi\ufb01cation accuracy compared\nto features obtained with previous unsupervised methods. These results strongly indicate that a\ndiscriminative objective is superior to objectives previously used for unsupervised feature learning.\nOne potential shortcoming of the proposed method is that in its current state it does not scale to ar-\nbitrarily large datasets. Two probable reasons for this are that (1) as the number of surrogate classes\ngrows larger, many of them become similar, which contradicts the discriminative objective, and (2)\nthe surrogate task we use is relatively simple and does not allow the network to learn invariance to\ncomplex variations, such as 3D viewpoint changes or inter-instance variation. We hypothesize that\nthe presented approach could learn more powerful higher-level features, if the surrogate data were\nmore diverse. This could be achieved by using additional weak supervision, for example, by means\nof video or a small number of labeled samples. Another possible way of obtaining richer surro-\ngate training data and at the same time avoiding similar surrogate classes would be (unsupervised)\nmerging of similar surrogate classes. We see these as interesting directions for future work.\n\nAcknowledgements\n\nWe acknowledge funding by the ERC Starting Grant VideoLearn (279401); the work was also partly\nsupported by the BrainLinks-BrainTools Cluster of Excellence funded by the German Research\nFoundation (DFG, grant number EXC 1086).\n\nReferences\n[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, pages 1106\u20131114, 2012.\n\n8\n\n\u221220\u2212100102000.20.40.60.81Translation (pixels)Distance between feature vectors (a)1st layer2nd layer3rd layer4\u2212quadrantHOG\u221250050102030405060Rotation angle (degrees)Classification accuracy (in %) (b)No movements in training dataRotations up to 20 degreesRotations up to 40 degrees\u22120.2\u22120.100.10.20.3102030405060Hue shiftClassification accuracy (in %) (c)No color transformHue change within \u00b1 0.1Hue change within \u00b1 0.2Hue change within \u00b1 0.3\u22123\u22122\u221210123102030405060Contrast multiplierClassification accuracy (in %) (d)No contrast transformContrast coefficients (2, 0.5, 0.1)Contrast coefficients (4, 1, 0.2)Contrast coefficients (6, 1.5, 0.3)\f[2] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.\n[3] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A deep convo-\n\nlutional activation feature for generic visual recognition. In ICML, 2014.\n\n[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In CVPR, 2014.\n\n[5] K. Cho. Simple sparsi\ufb01cation improves sparse denoising autoencoders in denoising highly corrupted\n\nimages. In ICML. JMLR Workshop and Conference Proceedings, 2013.\n\n[6] P. Fischer, A. Dosovitskiy, and T. Brox. Descriptor matching with convolutional neural networks: a\n\ncomparison to SIFT. 2014. pre-print, arXiv:1405.5769v1 [cs.CV].\n\n[7] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backprop-\n\nagation applied to handwritten zip code recognition. Neural Computation, 1(4):541\u2013551, 1989.\n\n[8] K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu, and Y. LeCun. Learning convolutional\n\nfeature hierachies for visual recognition. In NIPS, 2010.\n\n[9] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with\n\ndenoising autoencoders. In ICML, pages 1096\u20131103, 2008.\n\n[10] W. Y. Zou, A. Y. Ng, S. Zhu, and K. Yu. Deep learning of invariant features via simulated \ufb01xations in\n\nvideo. In NIPS, pages 3212\u20133220, 2012.\n\n[11] K. Sohn and H. Lee. Learning invariant representations with local transformations. In ICML, 2012.\n[12] K. Y. Hui. Direct modeling of complex invariances for visual object features. In ICML, 2013.\n[13] P. Simard, B. Victorri, Y. LeCun, and J. S. Denker. Tangent Prop - A formalism for specifying selected\n\ninvariances in an adaptive network. In NIPS, 1992.\n\n[14] H. Drucker and Y. LeCun. Improving generalization performance using double backpropagation. IEEE\n\nTransactions on Neural Networks, 3(6):991\u2013997, 1992.\n\n[15] M.-R. Amini and P. Gallinari. Semi supervised logistic regression. In ECAI, pages 390\u2013394, 2002.\n[16] Y. Grandvalet and Y. Bengio. Entropy regularization. In Semi-Supervised Learning, pages 151\u2013168. MIT\n\nPress, 2006.\n\n[17] A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. Xing. Training hierarchical feed-forward visual recognition\n\nmodels using transfer learning from pseudo-tasks. In ECCV (3), pages 69\u201382, 2008.\n\n[18] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language process-\n\ning (almost) from scratch. Journal of Machine Learning Research, 12:2493\u20132537, 2011.\n\n[19] S. Wager, S. Wang, and P. Liang. Dropout training as adaptive regularization. In NIPS. 2013.\n[20] S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Muller. The manifold tangent classi\ufb01er. In NIPS.\n\n2011.\n\n[21] A. Coates, H. Lee, and A. Y. Ng. An analysis of single-layer networks in unsupervised feature learning.\n\nAISTATS, 2011.\n\n[22] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master\u2019s thesis,\n\nDepartment of Computer Science, University of Toronto, 2009.\n\n[23] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An\n\nincremental bayesian approach tested on 101 object categories. In CVPR WGMBV, 2004.\n\n[24] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov.\n\nImproving neural\n\nnetworks by preventing co-adaptation of feature detectors. 2012. pre-print, arxiv:cs/1207.0580v3.\n\n[25] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[26] A. Coates and A. Y. Ng. Selecting receptive \ufb01elds in deep networks. In NIPS, pages 2528\u20132536, 2011.\n[27] L. Bo, X. Ren, and D. Fox. Unsupervised feature learning for RGB-D based object recognition. In ISER,\n\nJune 2012.\n\n[28] Y. Boureau, N. Le Roux, F. Bach, J. Ponce, and Y. LeCun. Ask the locals: multi-way local pooling for\n\nimage recognition. In ICCV\u201911. IEEE, 2011.\n\n[29] L. Bo, X. Ren, and D. Fox. Multipath sparse coding using hierarchical matching pursuit. In CVPR, pages\n\n660\u2013667, 2013.\n\n[30] K. Swersky, J. Snoek, and R. P. Adams. Multi-task bayesian optimization. In NIPS, 2013.\n[31] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.\n[32] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual\n\nrecognition. In ECCV, 2014.\n\n9\n\n\f", "award": [], "sourceid": 521, "authors": [{"given_name": "Alexey", "family_name": "Dosovitskiy", "institution": "University of Freiburg"}, {"given_name": "Jost Tobias", "family_name": "Springenberg", "institution": "University of Freiburg"}, {"given_name": "Martin", "family_name": "Riedmiller", "institution": "University of Freiburg"}, {"given_name": "Thomas", "family_name": "Brox", "institution": "University of Freiburg"}]}