{"title": "Sparse Feature Learning for Deep Belief Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1185, "page_last": 1192, "abstract": "Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have certain desirable properties (e.g. low dimension, sparsity, etc). Others are based on approximating density by stochastically reconstructing the input from the representation. We describe a novel and efficient algorithm to learn sparse representations, and compare it theoretically and experimentally with a similar machines trained probabilistically, namely a Restricted Boltzmann Machine. We propose a simple criterion to compare and select different unsupervised machines based on the trade-off between the reconstruction error and the information content of the representation. We demonstrate this method by extracting features from a dataset of handwritten numerals, and from a dataset of natural image patches. We show that by stacking multiple levels of such machines and by training sequentially, high-order dependencies between the input variables can be captured.", "full_text": "Sparse Feature Learning for Deep Belief Networks\n\nMarc\u2019Aurelio Ranzato1\n\nY-Lan Boureau2,1\n\nYann LeCun1\n\n1 Courant Institute of Mathematical Sciences, New York University\n\n2 INRIA Rocquencourt\n\n{ranzato,ylan,yann@courant.nyu.edu}\n\nAbstract\n\nUnsupervised learning algorithms aim to discover the structure hidden in the data,\nand to learn representations that are more suitable as input to a supervised machine\nthan the raw input. Many unsupervised methods are based on reconstructing the\ninput from the representation, while constraining the representation to have cer-\ntain desirable properties (e.g. low dimension, sparsity, etc). Others are based on\napproximating density by stochastically reconstructing the input from the repre-\nsentation. We describe a novel and ef\ufb01cient algorithm to learn sparse represen-\ntations, and compare it theoretically and experimentally with a similar machine\ntrained probabilistically, namely a Restricted Boltzmann Machine. We propose a\nsimple criterion to compare and select different unsupervised machines based on\nthe trade-off between the reconstruction error and the information content of the\nrepresentation. We demonstrate this method by extracting features from a dataset\nof handwritten numerals, and from a dataset of natural image patches. We show\nthat by stacking multiple levels of such machines and by training sequentially,\nhigh-order dependencies between the input observed variables can be captured.\n\n1 Introduction\n\nOne of the main purposes of unsupervised learning is to produce good representations for data, that\ncan be used for detection, recognition, prediction, or visualization. Good representations eliminate\nirrelevant variabilities of the input data, while preserving the information that is useful for the ul-\ntimate task. One cause for the recent resurgence of interest in unsupervised learning is the ability\nto produce deep feature hierarchies by stacking unsupervised modules on top of each other, as pro-\nposed by Hinton et al. [1], Bengio et al. [2] and our group [3, 4]. The unsupervised module at one\nlevel in the hierarchy is fed with the representation vectors produced by the level below. Higher-\nlevel representations capture high-level dependencies between input variables, thereby improving\nthe ability of the system to capture underlying regularities in the data. The output of the last layer in\nthe hierarchy can be fed to a conventional supervised classi\ufb01er.\n\nA natural way to design stackable unsupervised learning systems is the encoder-decoder\nparadigm [5]. An encoder transforms the input into the representation (also known as the code\nor the feature vector), and a decoder reconstructs the input (perhaps stochastically) from the repre-\nsentation. PCA, Auto-encoder neural nets, Restricted Boltzmann Machines (RBMs), our previous\nsparse energy-based model [3], and the model proposed in [6] for noisy overcomplete channels are\njust examples of this kind of architecture. The encoder/decoder architecture is attractive for two rea-\nsons: 1. after training, computing the code is a very fast process that merely consists in running the\ninput through the encoder; 2. reconstructing the input with the decoder provides a way to check that\nthe code has captured the relevant information in the data. Some learning algorithms [7] do not have\na decoder and must resort to computationally expensive Markov Chain Monte Carlo (MCMC) sam-\npling methods in order to provide reconstructions. Other learning algorithms [8, 9] lack an encoder,\nwhich makes it necessary to run an expensive optimization algorithm to \ufb01nd the code associated\nwith each new input sample. In this paper we will focus only on encoder-decoder architectures.\n\n1\n\n\fIn general terms, we can view an unsupervised model as de\ufb01ning a distribution over input vectors\nY through an energy function E(Y, Z, W ):\n\nP (Y |W ) = Zz\n\nP (Y, z|W ) = Rz e\u2212\u03b2E(Y,z,W )\nRy,z e\u2212\u03b2E(y,z,W )\n\n(1)\n\nwhere Z is the code vector, W the trainable parameters of encoder and decoder, and \u03b2 is an arbitrary\npositive constant. The energy function includes the reconstruction error, and perhaps other terms\nas well. For convenience, we will omit W from the notation in the following. Training the machine\nto model the input distribution is performed by \ufb01nding the encoder and decoder parameters that\nminimize a loss function equal to the negative log likelihood of the training data under the model.\nFor a single training sample Y , the loss function is\n\nL(W, Y ) = \u2212\n\n1\n\u03b2\n\nlogZz\n\ne\u2212\u03b2E(Y,z) +\n\n1\n\u03b2\n\nlogZy,z\n\ne\u2212\u03b2E(y,z)\n\n(2)\n\nThe \ufb01rst term is the free energy F\u03b2(Y ). Assuming that the distribution over Z is rather peaked, it\ncan be simpler to approximate this distribution over Z by its mode, which turns the marginalization\nover Z into a minimization:\n\nL\u2217(W, Y ) = E(Y, Z \u2217(Y )) +\n\n1\n\u03b2\n\nlogZy\n\ne\u2212\u03b2E(y,Z \u2217(y))\n\n(3)\n\nwhere Z \u2217(Y ) is the maximum likelihood value Z \u2217(Y ) = argminzE(Y, z), also known as the\noptimal code. We can then de\ufb01ne an energy for each input point, that measures how well it is\nreconstructed by the model:\n\nF\u221e(Y ) = E(Y, Z \u2217(Y )) = lim\n\u03b2\u2192\u221e\n\n\u2212\n\n1\n\u03b2\n\nlogZz\n\ne\u2212\u03b2E(Y,z)\n\n(4)\n\nThe second term in equation 2 and 3 is called the log partition function, and can be viewed as a\npenalty term for low energies. It ensures that the system produces low energy only for input vectors\nthat have high probability in the (true) data distribution, and produces higher energies for all other\ninput vectors [5]. The overall loss is the average of the above over the training set.\n\nRegardless of whether only Z \u2217 or the whole distribution over Z is considered, the main dif\ufb01culty\nwith this framework is that it can be very hard to compute the gradient of the log partition function\nin equation 2 or 3 with respect to the parameters W . Ef\ufb01cient methods shortcut the computation by\ndrastically and cleverly reducing the integration domain. For instance, Restricted Boltzmann Ma-\nchines (RBM) [10] approximate the gradient of the log partition function in equation 2 by sampling\nvalues of Y whose energy will be pulled up using an MCMC technique. By running the MCMC for\na short time, those samples are chosen in the vicinity of the training samples, thereby ensuring that\nthe energy surface forms a ravine around the manifold of the training samples. This is the basis of\nthe Contrastive Divergence method [10].\n\nThe role of the log partition function is merely to ensure that the energy surface is lower around\ntraining samples than anywhere else. The method proposed here eliminates the log partition function\nfrom the loss, and replaces it by a term that limits the volume of the input space over which the energy\nsurface can take a low value. This is performed by adding a penalty term on the code rather than on\nthe input. While this class of methods does not directly maximize the likelihood of the data, it can be\nseen as a crude approximation of it. To understand the method, we \ufb01rst note that if for each vector\nY , there exists a corresponding optimal code Z \u2217(Y ) that makes the reconstruction error (or energy)\nF\u221e(Y ) zero (or near zero), the model can perfectly reconstruct any input vector. This makes the\nenergy surface \ufb02at and indiscriminate. On the other hand, if Z can only take a small number of\ndifferent values (low entropy code), then the energy F\u221e(Y ) can only be low in a limited number of\nplaces (the Y \u2019s that are reconstructed from this small number of Z values), and the energy cannot\nbe \ufb02at.\n\nMore generally, a convenient method through which \ufb02at energy surfaces can be avoided is to limit\nthe maximum information content of the code. Hence, minimizing the energy F\u221e(Y ) together with\nthe information content of the code is a good substitute for minimizing the log partition function.\n\n2\n\n\fA popular way to minimize the information content in the code is to make the code sparse or low-\ndimensional [5]. This technique is used in a number of unsupervised learning methods, including\nPCA, auto-encoders neural network, and sparse coding methods [6, 3, 8, 9]. In sparse methods,\nthe code is forced to have only a few non-zero units while most code units are zero most of the\ntime. Sparse-overcomplete representations have a number of theoretical and practical advantages,\nas demonstrated in a number of recent studies [6, 8, 3]. In particular, they have good robustness to\nnoise, and provide a good tiling of the joint space of location and frequency. In addition, they are\nadvantageous for classi\ufb01ers because classi\ufb01cation is more likely to be easier in higher dimensional\nspaces. This may explain why biology seems to like sparse representations [11]. In our context, the\nmain advantage of sparsity constraints is to allow us to replace a marginalization by a minimization,\nand to free ourselves from the need to minimize the log partition function explicitly.\n\nIn this paper we propose a new unsupervised learning algorithm called Sparse Encoding Symmetric\nMachine (SESM), which is based on the encoder-decoder paradigm, and which is able to produce\nsparse overcomplete representations ef\ufb01ciently without any need for \ufb01lter normalization [8, 12] or\ncode saturation [3]. As described in more details in sec. 2 and 3, we consider a loss function which\nis a weighted sum of the reconstruction error and a sparsity penalty, as in many other unsupervised\nlearning algorithms [13, 14, 8]. Encoder and decoder are constrained to be symmetric, and share\na set of linear \ufb01lters. Although we only consider linear \ufb01lters in this paper, the method allows\nthe use of any differentiable function for encoder and decoder. We propose an iterative on-line\nlearning algorithm which is closely related to those proposed by Olshausen and Field [8] and by us\npreviously [3]. The \ufb01rst step computes the optimal code by minimizing the energy for the given\ninput. The second step updates the parameters of the machine so as to minimize the energy.\n\nIn sec. 4, we compare SESM with RBM and PCA. Following [15], we evaluate these methods by\nmeasuring the reconstruction error for a given entropy of the code. In another set of experiments,\nwe train a classi\ufb01er on the features extracted by the various methods, and measure the classi\ufb01cation\nerror on the MNIST dataset of handwritten numerals. Interestingly, the machine achieving the best\nrecognition performance is the one with the best trade-off between RMSE and entropy. In sec. 5, we\ncompare the \ufb01lters learned by SESM and RBM for handwritten numerals and natural image patches.\nIn sec.5.1.1, we describe a simple way to produce a deep belief net by stacking multiple levels of\nSESM modules. The representational power of this hierarchical non-linear feature extraction is\ndemonstrated through the unsupervised discovery of the numeral class labels in the high-level code.\n\n2 Architecture\n\nIn this section we describe a Sparse Encoding Symmetric Machine (SESM) having a set of linear \ufb01l-\nters in both encoder and decoder. However, everything can be easily extended to any other choice of\nparameterized functions as long as these are differentiable and maintain symmetry between encoder\nand decoder. Let us denote with Y the input de\ufb01ned in RN , and with Z the code de\ufb01ned in RM ,\nwhere M is in general greater than N (for overcomplete representations). Let the \ufb01lters in encoder\nand decoder be the columns of matrix W \u2208 RN \u00d7M , and let the biases in the encoder and decoder\nbe denoted by benc \u2208 RM and bdec \u2208 RN , respectively. Then, encoder and decoder compute:\n\nfenc(Y ) = W T Y + benc,\n\nfdec(Z) = W l(Z) + bdec\n\n(5)\n\nwhere the function l is a point-wise logistic non-linearity of the form:\n\n(6)\nwith g \ufb01xed gain. The system is characterized by an energy measuring the compatibility between\npairs of input Y and latent code Z, E(Y, Z) [16]. The lower the energy, the more compatible (or\nlikely) is the pair. We de\ufb01ne the energy as:\n\nl(x) = 1/(1 + exp(\u2212gx)),\n\nE(Y, Z) = \u03b1ekZ \u2212 fenc(Y )k2\n\n2 + kY \u2212 fdec(Z)k2\n2\n\n(7)\n\nDuring training we minimize the following loss:\n\nL(W, Y ) = E(Y, Z) + \u03b1sh(Z) + \u03b1rkW k1\n\n(8)\nThe \ufb01rst term tries to make the output of the encoder as similar as possible to the code Z. The second\nterm is the mean-squared error between the input Y and the reconstruction provided by the decoder.\n\n2 + \u03b1sh(Z) + \u03b1rkW k1\n\n= \u03b1ekZ \u2212 fenc(Y )k2\n\n2 + kY \u2212 fdec(Z)k2\n\n3\n\n\facts independently on each code unit and it is de\ufb01ned as h(Z) = PM\n\nThe third term ensures the sparsity of the code by penalizing non zero values of code units; this term\ni=1 log(1+l2(zi)), (correspond-\ning to a factorized Student-t prior distribution on the non linearly transformed code units [8] through\nthe logistic of equation 6). The last term is an L1 regularization on the \ufb01lters to suppress noise and\nfavor more localized \ufb01lters. The loss formulated in equation 8 combines terms that characterize\nalso other methods. For instance, the \ufb01rst two terms appear in our previous model [3], but in that\nwork, the weights of encoder and decoder were not tied and the parameters in the logistic were up-\ndated using running averages. The second and third terms are present in the \u201cdecoder-only\u201d model\nproposed in [8]. The third term was used in the \u201cencoder-only\u201d model of [7]. Besides the already-\nmentioned advantages of using an encoder-decoder architecture, we point out another good feature\nof this algorithm due to its symmetry. A common idiosyncrasy for sparse-overcomplete methods\nusing both a reconstruction and a sparsity penalty in the objective function (second and third term in\nequation 8), is the need to normalize the basis functions in the decoder during learning [8, 12] with\nsomewhat ad-hoc technique, otherwise some of the basis functions collapse to zero, and some blow\nup to in\ufb01nity. Because of the sparsity penalty and the linear reconstruction, code units become tiny\nand are compensated by the \ufb01lters in the decoder that grow without bound. Even though the overall\nloss decreases, training is unsuccessful. Unfortunately, simply normalizing the \ufb01lters makes less\nclear which objective function is minimized. Some authors have proposed quite expensive meth-\nods to solve this issue: by making better approximations of the posterior distribution [15], or by\nusing sampling techniques [17]. In this work, we propose to enforce symmetry between encoder\nand decoder (through weight sharing) so as to have automatic scaling of \ufb01lters. Their norm cannot\npossibly be large because code units, produced by the encoder weights, would have large values as\nwell, producing bad reconstructions and increasing the energy (the second term in equation 7 and\n8).\n\n3 Learning Algorithm\n\nLearning consists of determining the parameters in W , benc, and bdec that minimize the loss in\nequation 8. As indicated in the introduction, the energy augmented with the sparsity constraint is\nminimized with respect to the code to \ufb01nd the optimal code. No marginalization over code distribu-\ntion is performed. This is akin to using the loss function in equation 3. However, the log partition\nfunction term is dropped. Instead, we rely on the code sparsity constraints to ensure that the energy\nsurface is not \ufb02at.\nSince the second term in equation 8 couples both Z and W and bdec, it is not straightforward to\nminimize this energy with respect to both. On the other hand, once Z is given, the minimization\nwith respect to W is a convex quadratic problem. Vice versa, if the parameters W are \ufb01xed, the\noptimal code Z \u2217 that minimizes L can be computed easily through gradient descent. This suggests\nthe following iterative on-line coordinate descent learning algorithm:\n1. for a given sample Y and parameter setting, minimize the loss in equation 8 with respect to Z by\ngradient descent to obtain the optimal code Z \u2217\n2. clamping both the input Y and the optimal code Z \u2217 found at the previous step, do one step of\ngradient descent to update the parameters.\nUnlike other methods [8, 12], no column normalization of W is required. Also, all the parameters\nare updated by gradient descent unlike in our previous work [3] where some parameters are updated\nusing a moving average.\n\nAfter training, the system converges to a state where the decoder produces good reconstructions\nfrom a sparse code, and the optimal code is predicted by a simple feed-forward propagation through\nthe encoder.\n\n4 Comparative Coding Analysis\n\nIn the following sections, we mainly compare SESM with RBM in order to better understand their\ndifferences in terms of maximum likelihood approximation, and in terms of coding ef\ufb01ciency and\nrobustness.\nRBM\nAs explained in the introduction, RBMs minimize an approximation of the negative log\nlikelihood of the data under the model. An RBM is a binary stochastic symmetric machine de\ufb01ned\n\n4\n\n\fdecY . Although this is not\nby an energy function of the form: E(Y, Z) = \u2212Z T W T Y \u2212 bT\nobvious at \ufb01rst glance, this energy can be seen as a special case of the encoder-decoder architecture\nthat pertains to binary data vectors and code vectors [5]. Training an RBM minimizes an approxima-\ntion of the negative log likelihood loss function 2, averaged over the training set, through a gradient\ndescent procedure. Instead of estimating the gradient of the log partition function, RBM training\nuses contrastive divergence [10], which takes random samples drawn over a limited region \u2126 around\nthe training samples. The loss becomes:\n\nencZ \u2212 bT\n\nL(W, Y ) = \u2212\n\n1\n\u03b2\n\nlogXz\n\ne\u2212\u03b2E(Y,z) +\n\n1\n\u03b2\n\nlog Xy\u2208\u2126Xz\n\ne\u2212\u03b2E(y,z)\n\n(9)\n\nBecause of the RBM architecture, given a Y , the components of Z are independent, hence the sum\nover con\ufb01gurations of Z can be done independently for each component of Z. Sampling y in the\nneighborhood \u2126 is performed with one, or a few alternated MCMC steps over Y , and Z. This means\nthat only the energy of points around training samples is pulled up. Hence, the likelihood function\ntakes the right shape around the training samples, but not necessarily everywhere. However, the\ncode vector in an RBM is binary and noisy, and one may wonder whether this does not have the\neffect of surreptitiously limiting the information content of the code, thereby further minimizing the\nlog partition function as a bonus.\nSESM\nRBM and SESM have almost the same architecture because they both have a symmetric\nencoder and decoder, and a logistic non-linearity on the top of the encoder. However, RBM is trained\nusing (approximate) maximum likelihood, while SESM is trained by simply minimizing the average\nenergy F\u221e(Y ) of equation 4 with an additional code sparsity term. SESM relies on the sparsity\nterm to prevent \ufb02at energy surfaces, while RBM relies on an explicit contrastive term in the loss, an\napproximation of the log partition function. Also, the coding strategy is very different because code\nunits are \u201cnoisy\u201d and binary in RBM, while they are quasi-binary and sparse in SESM. Features\nextracted by SESM look like object parts (see next section), while features produced by RBM lack\nan intuitive interpretation because they aim at modeling the input distribution and they are used in a\ndistributed representation.\n\n4.1 Experimental Comparison\n\n\u03c3q 1\n\nP N kY \u2212 fdec( \u00afZ)k2\n\nIn the \ufb01rst experiment we have trained SESM, RBM, and PCA on the \ufb01rst 20000 digits in the\nMNIST training dataset [18] in order to produce codes with 200 components. Similarly to [15] we\nhave collected test image codes after the logistic non linearity (except for PCA which is linear), and\nwe have measured the root mean square error (RMSE) and the entropy. SESM was run for different\nvalues of the sparsity coef\ufb01cient \u03b1s in equation 8 (while all other parameters are left unchanged, see\n2, where \u00afZ is the uniformly\nnext section for details). The RMSE is de\ufb01ned as 1\nquantized code produced by the encoder, P is the number of test samples, and \u03c3 is the estimated\nvariance of units in the input Y . Assuming to encode the (quantized) code units independently and\nwith the same distribution, the lower bound on the number of bits required to encode each of them\nP M , where ci is the number of counts in the i-th bin, and Q\nis the number of quantization levels. The number of bits per pixel is then equal to: M\nN Hc.u.. Unlike\nin [15, 12], the reconstruction is done taking the quantized code in order to measure the robustness\nof the code to the quantization noise. As shown in \ufb01g. 1-C, RBM is very robust to noise in the\ncode because it is trained by sampling. The opposite is true for PCA which achieves the lowest\nRMSE when using high precision codes, but the highest RMSE when using a coarse quantization.\nSESM seems to give the best trade-off between RMSE and entropy. Fig. 1-D/F compare the features\nlearned by SESM and RBM. Despite the similarities in the architecture, \ufb01lters look quite different\nin general, revealing two different coding strategies: distributed for RBM, and sparse for SESM.\n\nis given by: Hc.u. = \u2212PQ\n\nci\nP M log2\n\ni=1\n\nci\n\nIn the second experiment, we have compared these methods by means of a supervised task in order to\nassess which method produces the most discriminative representation. Since we have available also\nthe labels in the MNIST, we have used the codes (produced by these machines trained unsupervised)\nas input to the same linear classi\ufb01er. This is run for 100 epochs to minimize the squared error\nbetween outputs and targets, and has a mild ridge regularizer. Fig. 1-A/B show the result of these\nexperiments in addition to what can be achieved by a linear classi\ufb01er trained on the raw pixel data.\nNote that: 1) training on features instead of raw data improves the recognition (except for PCA\n\n5\n\n\f10 samples\n\n100 samples\n\n1000 samples\n\n10 samples\n\n100 samples\n\n1000 samples\n\n \n\n%\nE\nT\nA\nR\nR\nO\nR\nR\nE\n\n \n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n0\n\n18\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n \n\n%\nE\nT\nA\nR\nR\nO\nR\nR\nE\n\n \n\n \n\n%\nE\nT\nA\nR\nR\nO\nR\nR\nE\n\n \n\n0.2\n\nRMSE\n\n0.4\n\n0\n0\n\n0.4\n\n3\n0\n\n0.2\n\nRMSE\n\n0.2\n\nRMSE\n\n0.4\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n \n\n%\nE\nT\nA\nR\nR\nO\nR\nR\nE\n\n \n\n18\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n10\n\n \n\nRAW: train\nRAW: test\nPCA: train\nPCA: test\nRBM: train\nRBM: test\nSESM: train\nSESM: test\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n \n\n%\nE\nT\nA\nR\nR\nO\nR\nR\nE\n\n \n\n \n\n%\nE\nT\nA\nR\nR\nO\nR\nR\nE\n\n \n\n0\n0\n2\nENTROPY (bits/pixel)\n\n1\n\n0\n0\n2\nENTROPY (bits/pixel)\n\n1\n\n3\n \n0\n2\nENTROPY (bits/pixel)\n\n1\n\n(A)\n\n(B)\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\nE\nS\nM\nR\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n \n0\n\n(C)\n\n(E)\n\n(G)\n\nSymmetric Sparse Coding \u2212 RBM \u2212 PCA\n\n \n\nPCA: quantization in 5 bins\nPCA: quantization in 256 bins\nRBM: quantization in 5 bins\nRBM: quantization in 256 bins\nSparse Coding: quantization in 5 bins\nSparse Coding: quantization in 256 bins\n\n0.5\n\n1\n\nEntropy (bits/pixel)\n\n1.5\n\n2\n\n(D)\n\n(F)\n\n(H)\n\nFigure 1: (A)-(B) Error rate on MNIST training (with 10, 100 and 1000 samples per class) and\ntest set produced by a linear classi\ufb01er trained on the codes produced by SESM, RBM, and PCA.\nThe entropy and RMSE refers to a quantization into 256 bins. The comparison has been extended\nalso to the same classi\ufb01er trained on raw pixel data (showing the advantage of extracting features).\nThe error bars refer to 1 std. dev. of the error rate for 10 random choices of training datasets\n(same splits for all methods). The parameter \u03b1s in eq. 8 takes values: 1, 0.5, 0.2, 0.1, 0.05. (C)\nComparison between SESM, RBM, and PCA when quantizing the code into 5 and 256 bins. (D)\nRandom selection from the 200 linear \ufb01lters that were learned by SESM (\u03b1s = 0.2). (E) Some pairs\nof original and reconstructed digit from the code produced by the encoder in SESM (feed-forward\npropagation through encoder and decoder). (F) Random selection of \ufb01lters learned by RBM. (G)\nBack-projection in image space of the \ufb01lters learned in the second stage of the hierarchical feature\nextractor. The second stage was trained on the non linearly transformed codes produced by the \ufb01rst\nstage machine. The back-projection has been performed by using a 1-of-10 code in the second stage\nmachine, and propagating this through the second stage decoder and \ufb01rst stage decoder. The \ufb01lters\nat the second stage discover the class-prototypes (manually ordered for visual convenience) even\nthough no class label was ever used during training. (H) Feature extraction from 8x8 natural image\npatches: some \ufb01lters that were learned.\n\n6\n\n\fwhen the number of training samples is small), 2) RBM performance is competitive overall when\nfew training samples are available, 3) the best performance is achieved by SESM for a sparsity level\nwhich trades off RMSE for entropy (overall for large training sets), 4) the method with the best\nRMSE is not the one with lowest error rate, 5) compared to a SESM having the same error rate\nRBM is more costly in terms of entropy.\n\n5 Experiments\n\nThis section describes some experiments we have done with SESM. The coef\ufb01cient \u03b1e in equation 8\nhas always been set equal to 1, and the gain in the logistic have been set equal to 7 in order to achieve\na quasi-binary coding. The parameter \u03b1s has to be set by cross-validation to a value which depends\non the level of sparsity required by the speci\ufb01c application.\n\n5.1 Handwritten Digits\n\nFig. 1-B/E shows the result of training a SESM with \u03b1s is equal to 0.2. Training was performed on\n20000 digits scaled between 0 and 1, by setting \u03b1r to 0.0004 (in equation 8) with a learning rate\nequal to 0.025 (decreased exponentially). Filters detect the strokes that can be combined to form a\ndigit. Even if the code unit activation has a very sparse distribution, reconstructions are very good\n(no minimization in code space was performed).\n\n5.1.1 Hierarchical Features\n\nA hierarchical feature extractor can be trained layer-by-layer similarly to what has been proposed\nin [19, 1] for training deep belief nets (DBNs). We have trained a second (higher) stage machine\non the non linearly transformed codes produced by the \ufb01rst (lower) stage machine described in the\nprevious example. We used just 20000 codes to produce a higher level representation with just 10\ncomponents. Since we aimed to \ufb01nd a 1-of-10 code we increased the sparsity level (in the second\nstage machine) by setting \u03b1s to 1. Despite the completely unsupervised training procedure, the\nfeature detectors in the second stage machine look like digit prototypes as can be seen in \ufb01g. 1-G.\nThe hierarchical unsupervised feature extractor is able to capture higher order correlations among\nthe input pixel intensities, and to discover the highly non-linear mapping from raw pixel data to the\nclass labels. Changing the random initialization can sometimes lead to the discover of two different\nshapes of \u201c9\u201d without a unit encoding the \u201c4\u201d, for instance. Nevertheless, results are qualitatively\nvery similar to this one. For comparison, when training a DBN, prototypes are not recovered because\nthe learned code is distributed among units.\n\n5.2 Natural Image Patches\n\nA SESM with about the same set up was trained on a dataset of 30000 8x8 natural image patches\nrandomly extracted from the Berkeley segmentation dataset [20]. The input images were simply\nscaled down to the range [0, 1.7], without even subtracting the mean. We have considered a 2\ntimes overcomplete code with 128 units. The parameters \u03b1s, \u03b1r and the learning rate were set to\n0.4, 0.025, and 0.001 respectively. Some \ufb01lters are localized Gabor-like edge detectors in different\npositions and orientations, other are more global, and some encode the mean value (see \ufb01g. 1-H).\n\n6 Conclusions\n\nThere are two strategies to train unsupervised machines: 1) having a contrastive term in the loss\nfunction minimized during training, 2) constraining the internal representation in such a way that\ntraining samples can be better reconstructed than other points in input space. We have shown that\nRBM, which falls in the \ufb01rst class of methods, is particularly robust to channel noise, it achieves very\nlow RMSE and good recognition rate. We have also proposed a novel symmetric sparse encoding\nmethod following the second strategy which: is particularly ef\ufb01cient to train, has fast inference,\nworks without requiring any withening or even mean removal from the input, can provide the best\nrecognition performance and trade-off between entropy/RMSE, and can be easily extended to a\nhierarchy discovering hidden structure in the data. We have proposed an evaluation protocol to\ncompare different machines which is based on RMSE, entropy and, eventually, error rate when also\n\n7\n\n\flabels are available. Interestingly, the machine achieving the best performance in classi\ufb01cation is the\none with the best trade-off between reconstruction error and entropy. A future avenue of work is to\nunderstand the reasons for this \u201ccoincidence\u201d, and deeper connections between these two strategies.\nAcknowledgments\nWe wish to thank Jonathan Goodman, Geoffrey Hinton, and Yoshua Bengio for helpful discussions. This work\nwas supported in part by NSF grant IIS-0535166 \u201ctoward category-level object recognition\u201d, NSF ITR-0325463\n\u201cnew directions in predictive learning\u201d, and ONR grant N00014-07-1-0535 \u201cintegration and representation of\nhigh dimensional data\u201d.\n\nReferences\n\n[1] G.E. Hinton and R. R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786):504\u2013507, 2006.\n\n[2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In\n\nNIPS, 2006.\n\n[3] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Ef\ufb01cient learning of sparse representations with an\n\nenergy-based model. In NIPS 2006. MIT Press, 2006.\n\n[4] Y. Bengio and Y. LeCun. Scaling learning algorithms towars ai. In D. DeCoste L. Bottou, O. Chapelle\n\nand J. Weston, editors, Large-Scale Kernel Machines. MIT Press, 2007.\n\n[5] M. Ranzato, Y. Boureau, S. Chopra, and Y. LeCun. A uni\ufb01ed energy-based framework for unsupervised\n\nlearning. In Proc. Conference on AI and Statistics (AI-Stats), 2007.\n\n[6] E. Doi, D. C. Balcan, and M. S. Lewicki. A theoretical analysis of robust coding over noisy overcomplete\n\nchannels. In NIPS. MIT Press, 2006.\n\n[7] Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcomplete\n\nrepresentations. Journal of Machine Learning Research, 4:1235\u20131260, 2003.\n\n[8] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by\n\nv1? Vision Research, 37:3311\u20133325, 1997.\n\n[9] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature,\n\n401:788\u2013791, 1999.\n\n[10] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14:1771\u20131800, 2002.\n\n[11] P. Lennie. The cost of cortical computation. Current biology, 13:493\u2013497, 2003.\n[12] J.F. Murray and K. Kreutz-Delgado. Learning sparse overcomplete codes for images. The Journal of\n\nVLSI Signal Processing, 45:97\u2013110, 2008.\n\n[13] G.E. Hinton and R.S. Zemel. Autoencoders, minimum description length, and helmholtz free energy. In\n\nNIPS, 1994.\n\n[14] G.E. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of images of handwritten digits. IEEE\n\nTransactions on Neural Networks, 8:65\u201374, 1997.\n\n[15] M.S. Lewicki and T.J. Sejnowski. Learning overcomplete representations. Neural Computation, 12:337\u2013\n\n365, 2000.\n\n[16] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F.J. Huang. A tutorial on energy-based learning. In\n\nG. Bakir and al.., editors, Predicting Structured Data. MIT Press, 2006.\n\n[17] P. Sallee and B.A. Olshausen. Learning sparse multiscale image representations. In NIPS. MIT Press,\n\n2002.\n\n[18] http://yann.lecun.com/exdb/mnist/.\n[19] G.E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Compu-\n\ntation, 18:1527\u20131554, 2006.\n\n[20] http://www.cs.berkeley.edu/projects/vision/grouping/segbench/.\n\n8\n\n\f", "award": [], "sourceid": 1118, "authors": [{"given_name": "Marc'aurelio", "family_name": "Ranzato", "institution": null}, {"given_name": "Y-lan", "family_name": "Boureau", "institution": null}, {"given_name": "Yann", "family_name": "Cun", "institution": null}]}