{"title": "Winner-Take-All Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 2791, "page_last": 2799, "abstract": "In this paper, we propose a winner-take-all method for learning hierarchical sparse representations in an unsupervised fashion. We first introduce fully-connected winner-take-all autoencoders which use mini-batch statistics to directly enforce a lifetime sparsity in the activations of the hidden units. We then propose the convolutional winner-take-all autoencoder which combines the benefits of convolutional architectures and autoencoders for learning shift-invariant sparse representations. We describe a way to train convolutional autoencoders layer by layer, where in addition to lifetime sparsity, a spatial sparsity within each feature map is achieved using winner-take-all activation functions. We will show that winner-take-all autoencoders can be used to to learn deep sparse representations from the MNIST, CIFAR-10, ImageNet, Street View House Numbers and Toronto Face datasets, and achieve competitive classification performance.", "full_text": "Winner-Take-All Autoencoders\n\nAlireza Makhzani, Brendan Frey\n\nUniversity of Toronto\n\nmakhzani, frey@psi.toronto.edu\n\nAbstract\n\nIn this paper, we propose a winner-take-all method for learning hierarchical sparse\nrepresentations in an unsupervised fashion. We \ufb01rst introduce fully-connected\nwinner-take-all autoencoders which use mini-batch statistics to directly enforce a\nlifetime sparsity in the activations of the hidden units. We then propose the convo-\nlutional winner-take-all autoencoder which combines the bene\ufb01ts of convolutional\narchitectures and autoencoders for learning shift-invariant sparse representations.\nWe describe a way to train convolutional autoencoders layer by layer, where in\naddition to lifetime sparsity, a spatial sparsity within each feature map is achieved\nusing winner-take-all activation functions. We will show that winner-take-all au-\ntoencoders can be used to to learn deep sparse representations from the MNIST,\nCIFAR-10, ImageNet, Street View House Numbers and Toronto Face datasets,\nand achieve competitive classi\ufb01cation performance.\n\n1 Introduction\n\nRecently, supervised learning has been developed and used successfully to produce representations\nthat have enabled leaps forward in classi\ufb01cation accuracy for several tasks [1]. However, the ques-\ntion that has remained unanswered is whether it is possible to learn as \u201cpowerful\u201d representations\nfrom unlabeled data without any supervision. It is still widely recognized that unsupervised learning\nalgorithms that can extract useful features are needed for solving problems with limited label infor-\nmation. In this work, we exploit sparsity as a generic prior on the representations for unsupervised\nfeature learning. We \ufb01rst introduce the fully-connected winner-take-all autoencoders that learn to\ndo sparse coding by directly enforcing a winner-take-all lifetime sparsity constraint. We then intro-\nduce convolutional winner-take-all autoencoders that learn to do shift-invariant/convolutional sparse\ncoding by directly enforcing winner-take-all spatial and lifetime sparsity constraints.\n\n2 Fully-Connected Winner-Take-All Autoencoders\n\nTraining sparse autoencoders has been well studied in the literature. For example, in [2], a \u201clifetime\nsparsity\u201d penalty function proportional to the KL divergence between the hidden unit marginals ( \u02c6\u03c1)\nand the target sparsity probability (\u03c1) is added to the cost function: \u03bbKL(\u03c1k \u02c6\u03c1). A major drawback\nof this approach is that it only works for certain target sparsities and is often very dif\ufb01cult to \ufb01nd\nthe right \u03bb parameter that results in a properly trained sparse autoencoder. Also KL divergence\nwas originally proposed for sigmoidal autoencoders, and it is not clear how it can be applied to\nReLU autoencoders where \u02c6\u03c1 could be larger than one (in which case the KL divergence can not be\nevaluated). In this paper, we propose Fully-Connected Winner-Take-All (FC-WTA) autoencoders to\naddress these concerns. FC-WTA autoencoders can aim for any target sparsity rate, train very fast\n(marginally slower than a standard autoencoder), have no hyper-parameter to be tuned (except the\ntarget sparsity rate) and ef\ufb01ciently train all the dictionary atoms even when very aggressive sparsity\nrates (e.g., 1%) are enforced.\n\n1\n\n\f(a) MNIST, 10%\n\n(b) MNIST, 5%\n\n(c) MNIST, 2%\n\nFigure 1: Learnt dictionary (decoder) of FC-WTA with 1000 hidden units trained on MNIST\n\nSparse coding algorithms typically comprise two steps: a highly non-linear sparse encoding oper-\nation that \ufb01nds the \u201cright\u201d atoms in the dictionary, and a linear decoding stage that reconstructs\nthe input with the selected atoms and update the dictionary. The FC-WTA autoencoder is a non-\nsymmetric autoencoder where the encoding stage is typically a stack of several ReLU layers and\nthe decoder is just a linear layer. In the feedforward phase, after computing the hidden codes of\nthe last layer of the encoder, rather than reconstructing the input from all of the hidden units, for\neach hidden unit, we impose a lifetime sparsity by keeping the k percent largest activation of that\nhidden unit across the mini-batch samples and setting the rest of activations of that hidden unit to\nzero. In the backpropagation phase, we only backpropagate the error through the k percent non-zero\nactivations. In other words, we are using the min-batch statistics to approximate the statistics of\nthe activation of a particular hidden unit across all the samples, and \ufb01nding a hard threshold value\nfor which we can achieve k% lifetime sparsity rate. In this setting, the highly nonlinear encoder of\nthe network (ReLUs followed by top-k sparsity) learns to do sparse encoding, and the decoder of\nthe network reconstructs the input linearly. At test time, we turn off the sparsity constraint and the\noutput of the deep ReLU network will be the \ufb01nal representation of the input. In order to train a\nstacked FC-WTA autoencoder, we \ufb01x the weights and train another FC-WTA autoencoder on top of\nthe \ufb01xed representation of the previous network.\n\nThe learnt dictionary of a FC-WTA autoencoder trained on MNIST, CIFAR-10 and Toronto Face\ndatasets are visualized in Fig. 1 and Fig 2. For large sparsity levels, the algorithm tends to learn\nvery local features that are too primitive to be used for classi\ufb01cation (Fig. 1a). As we decrease\nthe sparsity level, the network learns more useful features (longer digit strokes) and achieves better\nclassi\ufb01cation (Fig. 1b). Nevertheless, forcing too much sparsity results in features that are too global\nand do not factor the input into parts (Fig. 1c). Section 4.1 reports the classi\ufb01cation results.\n\nWinner-Take-All RBMs. Besides autoencoders, WTA activations can also be used in Restricted\nBoltzmann Machines (RBM) to learn sparse representations. Suppose h and v denote the hidden and\nvisible units of RBMs. For training WTA-RBMs, in the positive phase of the contrastive divergence,\ninstead of sampling from P (hi|v), we \ufb01rst keep the k% largest P (hi|v) for each hi across the\nmini-batch dimension and set the rest of P (hi|v) values to zero, and then sample hi according to\nthe sparsi\ufb01ed P (hi|v). Filters of a WTA-RBM trained on MNIST are visualized in Fig. 3. We\ncan see WTA-RBMs learn longer digit strokes on MNIST, which as will be shown in Section 4.1,\nimproves the classi\ufb01cation rate. Note that the sparsity rate of WTA-RBMs (e.g., 30%) should not be\nas aggressive as WTA autoencoders (e.g., 5%), since RBMs are already being regularized by having\nbinary hidden states.\n\n(a) Toronto Face Dataset (48 \u00d7 48)\n\n(b) CIFAR-10 Patches (11 \u00d7 11)\n\nFigure 2: Dictionaries (decoder) of FC-WTA autoencoder with 256 hidden units and sparsity of 5%\n\n2\n\n\f(a) Standard RBM\n\n(b) WTA-RBM (sparsity of 30%)\n\nFigure 3: Features learned on MNIST by 256 hidden unit RBMs.\n\n3 Convolutional Winner-Take-All Autoencoders\n\nThere are several problems with applying conventional sparse coding methods on large images.\nFirst, it is not practical to directly apply a fully-connected sparse coding algorithm on high-resolution\n(e.g., 256 \u00d7 256) images. Second, even if we could do that, we would learn a very redundant\ndictionary whose atoms are just shifted copies of each other. For example, in Fig. 2a, the FC-\nWTA autoencoder has allocated different \ufb01lters for the same patterns (i.e., mouths/noses/glasses/face\nborders) occurring at different locations. One way to address this problem is to extract random image\npatches from input images and then train an unsupervised learning algorithm on these patches in\nisolation [3]. Once training is complete, the \ufb01lters can be used in a convolutional fashion to obtain\nrepresentations of images. As discussed in [3, 4], the main problem with this approach is that if the\nreceptive \ufb01eld is small, this method will not capture relevant features (imagine the extreme of 1 \u00d7 1\npatches). Increasing the receptive \ufb01eld size is problematic, because then a very large number of\nfeatures are needed to account for all the position-speci\ufb01c variations within the receptive \ufb01eld. For\nexample, we see that in Fig. 2b, the FC-WTA autoencoder allocates different \ufb01lters to represent the\nsame horizontal edge appearing at different locations within the receptive \ufb01eld. As a result, the learnt\nfeatures are essentially shifted versions of each other, which results in redundancy between \ufb01lters.\nUnsupervised methods that make use of convolutional architectures can be used to address this\nproblem, including convolutional RBMs [5], convolutional DBNs [6, 5], deconvolutional networks\n[7] and convolutional predictive sparse decomposition (PSD) [4, 8]. These methods learn features\nfrom the entire image in a convolutional fashion. In this setting, the \ufb01lters can focus on learning the\nshapes (i.e., \u201cwhat\u201d), because the location information (i.e., \u201cwhere\u201d) is encoded into feature maps\nand thus the redundancy among the \ufb01lters is reduced.\n\nIn this section, we propose Convolutional Winner-Take-All (CONV-WTA) autoencoders that learn\nto do shift-invariant/convolutional sparse coding by directly enforcing winner-take-all spatial and\nlifetime sparsity constraints. Our work is similar in spirit to deconvolutional networks [7] and convo-\nlutional PSD [4, 8], but whereas the approach in that work is to break apart the recognition pathway\nand data generation pathway, but learn them so that they are consistent, we describe a technique for\ndirectly learning a sparse convolutional autoencoder.\n\nA shallow convolutional autoencoder maps an input vector to a set of feature maps in a convolu-\ntional fashion. We assume that the boundaries of the input image are zero-padded, so that each\nfeature map has the same size as the input. The hidden representation is then mapped linearly to the\noutput using a deconvolution operation (Appendix A.1). The parameters are optimized to minimize\nthe mean square error. A non-regularized convolutional autoencoder learns useless delta function\n\ufb01lters that copy the input image to the feature maps and copy back the feature maps to the output.\nInterestingly, we have observed that even in the presence of denoising[9]/dropout[10] regulariza-\ntions, convolutional autoencoders still learn useless delta functions. Fig. 4a depicts the \ufb01lters of a\nconvolutional autoencoder with 16 maps, 20% input and 50% hidden unit dropout trained on Street\nView House Numbers dataset [11]. We see that the 16 learnt delta functions make 16 copies of the\ninput pixels, so even if half of the hidden units get dropped during training, the network can still\nrely on the non-dropped copies to reconstruct the input. This highlights the need for new and more\naggressive regularization techniques for convolutional autoencoders.\n\nThe proposed architecture for CONV-WTA autoencoder is depicted in Fig. 4b. The CONV-WTA\nautoencoder is a non-symmetric autoencoder where the encoder typically consists of a stack of\nseveral ReLU convolutional layers (e.g., 5 \u00d7 5 \ufb01lters) and the decoder is a linear deconvolutional\nlayer of larger size (e.g., 11 \u00d7 11 \ufb01lters). We chose to use a deep encoder with smaller \ufb01lters (e.g.,\n5 \u00d7 5) instead of a shallow one with larger \ufb01lters (e.g., 11 \u00d7 11), because the former introduces more\n\n3\n\n\f(a) Dropout CONV Autoencoder\n\n(b) WTA-CONV Autoencoder\n\nFigure 4: (a) Filters and feature maps of a denoising/dropout convolutional autoencoder, which\nlearns useless delta functions. (b) Proposed architecture for CONV-WTA autoencoder with spatial\nsparsity (128conv5-128conv5-128deconv11).\n\nnon-linearity and regularizes the network by forcing it to have a decomposition over large receptive\n\ufb01elds through smaller \ufb01lters. The CONV-WTA autoencoder is trained under two winner-take-all\nsparsity constraints: spatial sparsity and lifetime sparsity.\n\n3.1 Spatial Sparsity\n\nIn the feedforward phase, after computing the last feature maps of the encoder, rather than recon-\nstructing the input from all of the hidden units of the feature maps, we identify the single largest\nhidden activity within each feature map, and set the rest of the activities as well as their derivatives\nto zero. This results in a sparse representation whose sparsity level is the number of feature maps.\nThe decoder then reconstructs the output using only the active hidden units in the feature maps and\nthe reconstruction error is only backpropagated through these hidden units as well.\n\nConsistent with other representation learning approaches such as triangle k-means [3] and deconvo-\nlutional networks [7, 12], we observed that using a softer sparsity constraint at test time results in\na better classi\ufb01cation performance. So, in the CONV-WTA autoencoder, in order to \ufb01nd the \ufb01nal\nrepresentation of the input image, we simply turn off the sparsity regularizer and use ReLU con-\nvolutions to compute the last layer feature maps of the encoder. After that, we apply max-pooling\n(e.g., over 4 \u00d7 4 regions) on these feature maps and use this representation for classi\ufb01cation tasks\nor in training stacked CONV-WTA as will be discussed in Section 3.3. Fig. 5 shows a CONV-WTA\nautoencoder that was trained on MNIST.\n\n0\n\n10\n\n20\n\n30\n\n40\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n0\n\n10\n\n20\n\n30\n\n40\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n0\n\n5\n\n0\n\n5\n\n0\n\n5\n\n0\n\n5\n\n0\n\n5\n\n0\n\n50\n\n100\n\n150\n\nFigure 5: The CONV-WTA autoencoder with 16 \ufb01rst layer \ufb01lters and 128 second layer \ufb01lters trained\non MNIST: (a) Input image. (b) Learnt dictionary (deconvolution \ufb01lters). (c) 16 feature maps while\ntraining (spatial sparsity applied). (d) 16 feature maps after training (spatial sparsity turned off). (e)\n16 feature maps of the \ufb01rst layer after applying local max-pooling. (f) 48 out of 128 feature maps of\nthe second layer after turning off the sparsity and applying local max-pooling (\ufb01nal representation).\n\n4\n\n\f(a) Spatial sparsity only\n\n(b) Spatial & lifetime sparsity 20% (c) Spatial & lifetime sparsity 5%\n\nFigure 6: Learnt dictionary (deconvolution \ufb01lters) of CONV-WTA autoencoder trained on MNIST\n(64conv5-64conv5-64conv5-64deconv11).\n\n3.2 Lifetime Sparsity\n\nAlthough spatial sparsity is very effective in regularizing the autoencoder, it requires all the dictio-\nnary atoms to contribute in the reconstruction of every image. We can further increase the sparsity\nby exploiting the winner-take-all lifetime sparsity as follows. Suppose we have 128 feature maps and\nthe mini-batch size is 100. After applying spatial sparsity, for each \ufb01lter we will have 100 \u201cwinner\u201d\nhidden units corresponding to the 100 mini-batch images. During feedforward phase, for each \ufb01lter,\nwe only keep the k% largest of these 100 values and set the rest of activations to zero. Note that\ndespite this aggressive sparsity, every \ufb01lter is forced to get updated upon visiting every mini-batch,\nwhich is crucial for avoiding the dead \ufb01lter problem that often occurs in sparse coding.\n\nFig. 6 and Fig. 7 show the effect of the lifetime sparsity on the dictionaries trained on MNIST\nand Toronto Face dataset. We see that similar to the FC-WTA autoencoders, by tuning the lifetime\nsparsity of CONV-WTA autoencoders, we can aim for different sparsity rates. If no lifetime sparsity\nis enforced, we learn local \ufb01lters that contribute to every training point (Fig. 6a and 7a). As we\nincrease the lifetime sparsity, we can learn rare but useful features that result in better classi\ufb01cation\n(Fig. 6b). Nevertheless, forcing too much lifetime sparsity will result in features that are too diverse\nand rare and do not properly factor the input into parts (Fig. 6c and 7b).\n\n3.3 Stacked CONV-WTA Autoencoders\n\nThe CONV-WTA autoencoder can be used as a building block to form a hierarchy. In order to train\nthe hierarchical model, we \ufb01rst train a CONV-WTA autoencoder on the input images. Then we pass\nall the training examples through the network and obtain their representations (last layer of the en-\ncoder after turning off sparsity and applying local max-pooling). Now we treat these representations\nas a new dataset and train another CONV-WTA autoencoder to obtain the stacked representations.\nFig. 5(f) shows the deep feature maps of a stacked CONV-WTA that was trained on MNIST.\n\n3.4 Scaling CONV-WTA Autoencoders to Large Images\n\nThe goal of convolutional sparse coding is to learn shift-invariant dictionary atoms and encoding\n\ufb01lters. Once the \ufb01lters are learnt, they can be applied convolutionally to any image of any size,\nand produce a spatial map corresponding to different locations at the input. We can use this idea\nto ef\ufb01ciently train CONV-WTA autoencoders on datasets containing large images. Suppose we\nwant to train an AlexNet [1] architecture in an unsupervised fashion on ImageNet, ILSVRC-2012\n\n(a) Spatial sparsity only\n\n(b) Spatial and lifetime sparsity of 10%\n\nFigure 7: Learnt dictionary (deconvolution \ufb01lters) of CONV-WTA autoencoder trained on the\nToronto Face dataset (64conv7-64conv7-64conv7-64deconv15).\n\n5\n\n\f(a) Spatial sparsity\n\n(b) Spatial and lifetime sparsity of 10%\n\nFigure 8: Learnt dictionary (deconvolution \ufb01lters) of CONV-WTA autoencoder trained on ImageNet\n48 \u00d7 48 whitened patches. (64conv5-64conv5-64conv5-64deconv11).\n\n(224x224). In order to learn the \ufb01rst layer 11 \u00d7 11 shift-invariant \ufb01lters, we can extract medium-\nsize image patches of size 48 \u00d7 48 and train a CONV-WTA autoencoder with 64 dictionary atoms\nof size 11 on these patches. This will result in 64 shift-invariant \ufb01lters of size 11 \u00d7 11 that can\nef\ufb01ciently capture the statistics of 48 \u00d7 48 patches. Once the \ufb01lters are learnt, we can apply them in\na convolutional fashion with the stride of 4 to the entire images and after max-pooling we will have\na 64 \u00d7 27 \u00d7 27 representation of the images. Now we can train another CONV-WTA autoencoder\non top of these feature maps to capture the statistics of a larger receptive \ufb01eld at different location\nof the input image. This process could be repeated for multiple layers. Fig. 8 shows the dictionary\nlearnt on the ImageNet using this approach. We can see that by imposing lifetime sparsity, we could\nlearn very diverse \ufb01lters such as corner, circular and blob detectors.\n\n4 Experiments\n\nIn all the experiments of this section, we evaluate the quality of unsupervised features of WTA\nautoencoders by training a naive linear classi\ufb01er (i.e., SVM) on top them. We did not \ufb01ne-tune the\n\ufb01lters in any of the experiments. The implementation details of all the experiments are provided in\nAppendix A (in the supplementary materials). An IPython demo for reproducing important results\nof this paper is publicly available at http://www.comm.utoronto.ca/\u02dcmakhzani/.\n\n4.1 Winner-Take-All Autoencoders on MNIST\n\nThe MNIST dataset has 60K training points and 10K test points. Table 1 compares the performance\nof FC-WTA autoencoder and WTA-RBMs with other permutation-invariant architectures. Table 2a\ncompares the performance of CONV-WTA autoencoder with other convolutional architectures. In\nthese experiments, we have used all the available training labels (N = 60000 points) to train a linear\nSVM on top of the unsupervised features.\n\nAn advantage of unsupervised learning algorithms is the ability to use them in semi-supervised sce-\nnarios where labeled data is limited. Table 2b shows the semi-supervised performance of a CONV-\nWTA where we have assumed only N labels are available. In this case, the unsupervised features are\nstill trained on the whole dataset (60K points), but the SVM is trained only on the N labeled points\nwhere N varies from 300 to 60K. We compare this with the performance of a supervised deep con-\nvnet (CNN) [17] trained only on the N labeled training points. We can see supervised deep learning\ntechniques fail to learn good representations when labeled data is limited, whereas our WTA algo-\nrithm can extract useful features from the unlabeled data and achieve a better classi\ufb01cation. We also\ncompare our method with some of the best semi-supervised learning results recently obtained by\n\nShallow Denoising/Dropout Autoencoder (20% input and 50% hidden units dropout)\nStacked Denoising Autoencoder (3 layers) [9]\nDeep Boltzmann Machines [13]\nk-Sparse Autoencoder [14]\nShallow FC-WTA Autoencoder, 2000 units, 5% sparsity\nStacked FC-WTA Autoencoder, 5% and 2% sparsity\nRestricted Boltzmann Machines\nWinner-Take-All Restricted Boltzmann Machines (30% sparsity)\n\nError Rate\n\n1.60%\n1.28%\n0.95%\n1.35%\n1.20%\n1.11%\n1.60%\n1.38%\n\nTable 1: Classi\ufb01cation performance of FC-WTA autoencoder features + SVM on MNIST.\n\n6\n\n\fDeep Deconvolutional Network [7, 12]\nConvolutional Deep Belief Network [5]\nScattering Convolution Network [15]\nConvolutional Kernel Network [16]\nCONV-WTA Autoencoder, 16 maps\nCONV-WTA Autoencoder, 128 maps\nStacked CONV-WTA, 128 & 2048 maps\n\nError\n0.84%\n0.82%\n0.43%\n0.39%\n1.02%\n0.64%\n0.48%\n\nN\n300\n600\n1K\n2K\n5K\n10K\n60K\n\nCNN [17]\n7.18%\n5.28%\n3.21%\n2.53%\n1.52%\n0.85%\n0.53%\n\nCKN [16]\n4.15%\n\n-\n\nSC [15]\n4.70%\n\n-\n\n2.05%\n2.30%\n1.51% 1.30%\n1.21% 1.03%\n0.88% 0.88 %\n0.39% 0.43%\n\nCONV-WTA\n\n3.47%\n2.37%\n1.92%\n1.45%\n1.07%\n0.91%\n0.48%\n\n(a) Unsupervised features + SVM trained on\nN = 60000 labels (no \ufb01ne-tuning)\n\n(b) Unsupervised features + SVM trained on few\nlabels N . (semi-supervised)\n\nTable 2: Classi\ufb01cation performance of CONV-WTA autoencoder trained on MNIST.\n\nconvolutional kernel networks (CKN) [16] and convolutional scattering networks (SC) [15]. We see\nCONV-WTA outperforms both these methods when very few labels are available (N < 1K).\n\n4.2 CONV-WTA Autoencoder on Street View House Numbers\n\nThe SVHN dataset has about 600K training points and 26K test points. Table 3 reports the classi-\n\ufb01cation results of CONV-WTA autoencoder on this dataset. We \ufb01rst trained a shallow and stacked\nCONV-WTA on all 600K training cases to learn the unsupervised features, and then performed two\nsets of experiments. In the \ufb01rst experiment, we used all the N=600K available labels to train an SVM\non top of the CONV-WTA features, and compared the result with convolutional k-means [11]. We\nsee that the stacked CONV-WTA achieves a dramatic improvement over the shallow CONV-WTA\nas well as k-means. In the second experiment, we trained an SVM by using only N = 1000 la-\nbeled data points and compared the result with deep variational autoencoders [18] trained in a same\nsemi-supervised fashion. Fig. 9 shows the learnt dictionary of CONV-WTA on this dataset.\n\nConvolutional Triangle k-means [11]\nCONV-WTA Autoencoder, 256 maps (N=600K)\nStacked CONV-WTA Autoencoder, 256 and 1024 maps (N=600K)\nDeep Variational Autoencoders (non-convolutional) [18] (N=1000)\nStacked CONV-WTA Autoencoder, 256 and 1024 maps (N=1000)\nSupervised Maxout Network [19] (N=600K)\n\nAccuracy\n\n90.6%\n88.5%\n93.1%\n63.9%\n76.2%\n97.5%\n\nTable 3: CONV-WTA unsupervised features + SVM trained on N labeled points of SVHN dataset.\n\n(a) Contrast Normalized SVHN\n\n(b) Learnt Dictionary (64conv5-64conv5-64conv5-64deconv11)\n\nFigure 9: CONV-WTA autoencoder trained on the Street View House Numbers (SVHN) dataset.\n\n4.3 CONV-WTA Autoencoder on CIFAR-10\n\nFig. 10a reports the classi\ufb01cation results of CONV-WTA on CIFAR-10. We see when a small num-\nber of feature maps (< 256) are used, considerable improvements over k-means can be achieved.\nThis is because our method can learn a shift-invariant dictionary as opposed to the redundant dictio-\nnaries learnt by patch-based methods such as k-means. In the largest deep network that we trained,\nwe used 256, 1024, 4096 maps and achieved the classi\ufb01cation rate of 80.1% without using \ufb01ne-\ntuning, model averaging or data augmentation. Fig. 10b shows the learnt dictionary on the CIFAR-\n10 dataset. We can see that the network has learnt diverse shift-invariant \ufb01lters such as point/corner\ndetectors as opposed to Fig. 2b that shows the position-speci\ufb01c \ufb01lters of patch-based methods.\n\n7\n\n\fShallow Convolutional Triangle k-means (64 maps) [3]\nShallow CONV-WTA Autoencoder (64 maps)\nShallow Convolutional Triangle k-means (256 maps) [3]\nShallow CONV-WTA Autoencoder (256 maps)\nShallow Convolutional Triangle k-means (4000 maps) [3]\nDeep Triangle k-means (1600, 3200, 3200 maps) [20]\nConvolutional Deep Belief Net (2 layers) [6]\nExemplar CNN (300x Data Augmentation) [21]\nNOMP (3200,6400,6400 maps + Averaging 7 Models) [22]\nStacked CONV-WTA (256, 1024 maps)\nStacked CONV-WTA (256, 1024, 4096 maps)\nSupervised Maxout Network [19]\n\nAccuracy\n62.3%\n68.9%\n70.2%\n72.3%\n79.6%\n82.0%\n78.9%\n82.0%\n82.9%\n77.9%\n80.1%\n88.3%\n\n(a) Unsupervised features + SVM (without \ufb01ne-tuning)\n\n(b) Learnt dictionary (deconv-\ufb01lters)\n64conv5-64conv5-64conv5-64deconv7\n\nFigure 10: CONV-WTA autoencoder trained on the CIFAR-10 dataset.\n\n5 Discussion\n\nRelationship of FC-WTA to k-sparse autoencoders. k-sparse autoencoders impose sparsity across\ndifferent channels (population sparsity), whereas FC-WTA autoencoder imposes sparsity across\ntraining examples (lifetime sparsity). When aiming for low sparsity levels, k-sparse autoencoders\nuse a scheduling technique to avoid the dead dictionary atom problem. WTA autoencoders, however,\ndo not have this problem since all the hidden units get updated upon visiting every mini-batch no\nmatter how aggressive the sparsity rate is (no scheduling required). As a result, we can train larger\nnetworks and achieve better classi\ufb01cation rates.\n\nRelationship of CONV-WTA to deconvolutional networks and convolutional PSD. Deconvolu-\ntional networks [7, 12] are top down models with no direct link from the image to the feature maps.\nThe inference of the sparse maps requires solving the iterative ISTA algorithm, which is costly.\nConvolutional PSD [4] addresses this problem by training a parameterized encoder separately to\nexplicitly predict the sparse codes using a soft thresholding operator. Deconvolutional networks and\nconvolutional PSD can be viewed as the generative decoder and encoder paths of a convolutional\nautoencoder. Our contribution is to propose a speci\ufb01c winner-take-all approach for training a convo-\nlutional autoencoder, in which both paths are trained jointly using direct backpropagation yielding\nan algorithm that is much faster, easier to implement and can train much larger networks.\n\nRelationship to maxout networks. Maxout networks [19] take the max across different channels,\nwhereas our method takes the max across space and mini-batch dimensions. Also the winner-take-all\nfeature maps retain the location information of the \u201cwinners\u201d within each feature map and different\nlocations have different connectivity on the subsequent layers, whereas the maxout activity is passed\nto the next layer using weights that are the same regardless of which unit gave the maximum.\n\n6 Conclusion\n\nWe proposed the winner-take-all spatial and lifetime sparsity methods to train autoencoders that\nlearn to do fully-connected and convolutional sparse coding. We observed that CONV-WTA autoen-\ncoders learn shift-invariant and diverse dictionary atoms as opposed to position-speci\ufb01c Gabor-like\natoms that are typically learnt by conventional sparse coding methods. Unlike related approaches,\nsuch as deconvolutional networks and convolutional PSD, our method jointly trains the encoder and\ndecoder paths by direct back-propagation, and does not require an iterative EM-like optimization\ntechnique during training. We described how our method can be scaled to large datasets such as\nImageNet and showed the necessity of the deep architecture to achieve better results. We performed\nexperiments on the MNIST, SVHN and CIFAR-10 datasets and showed that the classi\ufb01cation rates\nof winner-take-all autoencoders are competitive with the state-of-the-art.\n\nAcknowledgments\n\nWe would like to thank Ruslan Salakhutdinov and Andrew Delong for the valuable comments. We\nalso acknowledge the support of NVIDIA with the donation of the GPUs used for this research.\n\n8\n\n\fReferences\n\n[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImagenet classi\ufb01cation with deep convo-\n\nlutional neural networks.,\u201d in NIPS, vol. 1, p. 4, 2012.\n\n[2] A. Ng, \u201cSparse autoencoder,\u201d CS294A Lecture notes, vol. 72, 2011.\n\n[3] A. Coates, A. Y. Ng, and H. Lee, \u201cAn analysis of single-layer networks in unsupervised\nfeature learning,\u201d in International Conference on Arti\ufb01cial Intelligence and Statistics,\n2011.\n\n[4] K. Kavukcuoglu, P. Sermanet, Y.-L. Boureau, K. Gregor, M. Mathieu, and Y. LeCun,\n\u201cLearning convolutional feature hierarchies for visual recognition.,\u201d in NIPS, vol. 1, p. 5,\n2010.\n\n[5] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, \u201cConvolutional deep belief networks for\nscalable unsupervised learning of hierarchical representations,\u201d in Proceedings of the 26th\nAnnual International Conference on Machine Learning, pp. 609\u2013616, ACM, 2009.\n\n[6] A. Krizhevsky, \u201cConvolutional deep belief networks on cifar-10,\u201d Unpublished, 2010.\n[7] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, \u201cDeconvolutional networks,\u201d in\nComputer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 2528\u2013\n2535, IEEE, 2010.\n\n[8] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, \u201cPedestrian detection with\nunsupervised multi-stage feature learning,\u201d in Computer Vision and Pattern Recognition\n(CVPR), 2013 IEEE Conference on, pp. 3626\u20133633, IEEE, 2013.\n\n[9] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, \u201cStacked denoising\nautoencoders: Learning useful representations in a deep network with a local denoising\ncriterion,\u201d The Journal of Machine Learning Research, vol. 11, pp. 3371\u20133408, 2010.\n\n[10] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, \u201cIm-\nproving neural networks by preventing co-adaptation of feature detectors,\u201d arXiv preprint\narXiv:1207.0580, 2012.\n\n[11] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, \u201cReading digits in\nnatural images with unsupervised feature learning,\u201d in NIPS workshop on deep learning\nand unsupervised feature learning, vol. 2011, p. 5, Granada, Spain, 2011.\n\n[12] M. D. Zeiler and R. Fergus, \u201cDifferentiable pooling for hierarchical feature learning,\u201d\n\narXiv preprint arXiv:1207.0151, 2012.\n\n[13] R. Salakhutdinov and G. E. Hinton, \u201cDeep boltzmann machines,\u201d in International Con-\n\nference on Arti\ufb01cial Intelligence and Statistics, pp. 448\u2013455, 2009.\n\n[14] A. Makhzani and B. Frey, \u201ck-sparse autoencoders,\u201d International Conference on Learning\n\nRepresentations, ICLR, 2014.\n\n[15] J. Bruna and S. Mallat, \u201cInvariant scattering convolution networks,\u201d Pattern Analysis and\n\nMachine Intelligence, IEEE Transactions on, vol. 35, no. 8, pp. 1872\u20131886, 2013.\n\n[16] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid, \u201cConvolutional kernel networks,\u201d in\n\nAdvances in Neural Information Processing Systems, pp. 2627\u20132635, 2014.\n\n[17] M. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. Lecun, \u201cUnsupervised learning of invari-\nant feature hierarchies with applications to object recognition,\u201d in Computer Vision and\nPattern Recognition, 2007. CVPR\u201907. IEEE Conference on, pp. 1\u20138, IEEE, 2007.\n\n[18] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, \u201cSemi-supervised learning\nwith deep generative models,\u201d in Advances in Neural Information Processing Systems,\npp. 3581\u20133589, 2014.\n\n[19] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, \u201cMaxout net-\n\nworks,\u201d ICML, 2013.\n\n[20] A. Coates and A. Y. Ng, \u201cSelecting receptive \ufb01elds in deep networks.,\u201d in NIPS, 2011.\n[21] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox, \u201cDiscriminative unsuper-\nvised feature learning with convolutional neural networks,\u201d in Advances in Neural Infor-\nmation Processing Systems, pp. 766\u2013774, 2014.\n\n[22] T.-H. Lin and H. Kung, \u201cStable and ef\ufb01cient representation learning with nonnegativity\nconstraints,\u201d in Proceedings of the 31st International Conference on Machine Learning\n(ICML-14), pp. 1323\u20131331, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1591, "authors": [{"given_name": "Alireza", "family_name": "Makhzani", "institution": "University of Toronto"}, {"given_name": "Brendan", "family_name": "Frey", "institution": "U. Toronto"}]}