{"title": "Sparse Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 1125, "page_last": 1133, "abstract": "Unsupervised feature learning has been shown to be effective at learning representations that perform well on image, video and audio classification. However, many existing feature learning algorithms are hard to use and require extensive hyperparameter tuning. In this work, we present sparse filtering, a simple new algorithm which is efficient and only has one hyperparameter, the number of features to learn. In contrast to most other feature learning methods, sparse filtering does not explicitly attempt to construct a model of the data distribution. Instead, it optimizes a simple cost function -- the sparsity of L2-normalized features -- which can easily be implemented in a few lines of MATLAB code. Sparse filtering scales gracefully to handle high-dimensional inputs, and can also be used to learn meaningful features in additional layers with greedy layer-wise stacking. We evaluate sparse filtering on natural images, object classification (STL-10), and phone classification (TIMIT), and show that our method works well on a range of different modalities.", "full_text": "Sparse Filtering\n\nJiquan Ngiam, Pang Wei Koh, Zhenghao Chen, Sonia Bhaskar, Andrew Y. Ng\n{jngiam,pangwei,zhenghao,sbhaskar,ang}@cs.stanford.edu\n\nComputer Science Department, Stanford University\n\nAbstract\n\nUnsupervised feature learning has been shown to be effective at learning repre-\nsentations that perform well on image, video and audio classi\ufb01cation. However,\nmany existing feature learning algorithms are hard to use and require extensive\nhyperparameter tuning. In this work, we present sparse \ufb01ltering, a simple new\nalgorithm which is ef\ufb01cient and only has one hyperparameter, the number of fea-\ntures to learn. In contrast to most other feature learning methods, sparse \ufb01ltering\ndoes not explicitly attempt to construct a model of the data distribution. Instead, it\noptimizes a simple cost function \u2013 the sparsity of (cid:96)2-normalized features \u2013 which\ncan easily be implemented in a few lines of MATLAB code. Sparse \ufb01ltering scales\ngracefully to handle high-dimensional inputs, and can also be used to learn mean-\ningful features in additional layers with greedy layer-wise stacking. We evaluate\nsparse \ufb01ltering on natural images, object classi\ufb01cation (STL-10), and phone clas-\nsi\ufb01cation (TIMIT), and show that our method works well on a range of different\nmodalities.\n\n1\n\nIntroduction\n\nUnsupervised feature learning has recently emerged as a viable alternative to manually designing\nfeature representations. In many audio [1, 2], image [3, 4], and video [5] tasks, learned features\nhave matched or outperformed features speci\ufb01cally designed for such tasks. However, many current\nfeature learning algorithms are hard to use because they require a good deal of hyperparameter tun-\ning. For example, the sparse RBM [6, 7] has up to half a dozen hyperparameters and an intractable\nobjective function, making it hard to tune and monitor convergence.\nIn this work, we present sparse \ufb01ltering, a new feature learning algorithm which is easy to implement\nand essentially hyperparameter-free. Sparse \ufb01ltering is ef\ufb01cient and scales gracefully to handle\nlarge input dimensions. In contrast, it is typically computationally expensive to run straightforward\nimplementations of many other feature learning algorithms on large inputs.\nSparse \ufb01ltering works by optimizing exclusively for sparsity in the feature distribution. A key idea\nin our method is avoiding explicit modeling of the data distribution; this gives rise to a simple\nformulation and permits ef\ufb01cient learning. As a result, our method can be implemented in a few\nlines of MATLAB code1 and works well with an off-the-shelf function minimizer such as L-BFGS.\nMoreover, the hyperparameter-free approach means that sparse \ufb01ltering works well on a range of\ndata modalities without the need for speci\ufb01c tuning on each modality. This allows us to easily learn\nfeature representations that are well-suited for a variety of tasks, including object classi\ufb01cation and\nphone classi\ufb01cation.\n\n1\n\n\fTable 1. Comparison of tunable hyperparameters in various feature learning algorithms.\n\nAlgorithm\nOur Method (Sparse Filtering)\nICA\nSparse Coding\nSparse Autoencoders\nSparse RBMs\n\nTunable hyperparameters\n# features\n# features\n# features, sparsity penalty, mini-batch size\n# features, target activation, weight decay, sparsity penalty\n# features, target activation, weight decay, sparsity penalty,\nlearning rate, momentum\n\n2 Unsupervised feature learning\n\nTraditionally, feature learning methods have largely sought to learn models that provide good ap-\nproximations of the true data distribution; these include denoising autoencoders [8], restricted Boltz-\nmann machines (RBMs) [6, 7], (some versions of) independent component analysis (ICA) [9, 10],\nand sparse coding [11], among others.\nThese feature learning approaches have been successfully used to learn good feature representations\nfor a wide variety of tasks [1, 2, 3, 4, 5]. However, they are also often challenging to implement,\nrequiring the tuning of various hyperparameters; see Table 1 for a comparison of tunable hyperpa-\nrameters in several popular feature learning algorithms. Good settings for these hyperparameters\ncan vary widely from task to task, and can sometimes result in a drawn-out development process.\nThough ICA has only one tunable hyperparameter, it scales poorly to large sets of features or large\ninputs.2\nIn this work, our goal is to develop a simple and ef\ufb01cient feature learning algorithm that requires\nminimal tuning. To this end, we only focus on a few key properties of our features \u2013 population\nsparsity, lifetime sparsity, and high dispersal \u2013 without explicitly modeling the data distribution.\nWhile learning a model for the data distribution is desirable, it can complicate learning algorithms:\nfor example, sparse RBMs need to approximate the log-partition function\u2019s gradient in order to opti-\nmize for the data likelihood, while sparse coding needs to run relatively expensive inference at each\niteration to \ufb01nd the coef\ufb01cients of the active bases. The relative weightage of a data reconstruction\nterm versus a sparsity-inducing term is also often a hyperparameter that needs to be tuned.\n\n3 Feature distributions\n\nThe feature learning methods discussed in the previous section can all be viewed as generating\nparticular feature distributions. For instance, sparse coding represents each example using a few\nnon-zero coef\ufb01cients (features). A feature distribution oriented approach can provide insights into\ndesigning new algorithms based on optimizing for desirable properties of the feature distribution.\nFor clarity, let us consider a feature distribution matrix over a \ufb01nite dataset, where each row is a\nfeature, each column is an example, and each entry f (i)\nis the activity of feature j on example i. We\nassume that the features are generated through some deterministic function of the examples.\nWe consider the following as desirable properties of the feature distribution:\nSparse features per example (Population Sparsity). Each example should be represented by only\na few active (non-zero) features. Concretely, for each column (one example) in our feature matrix,\nf (i), we want a small number of active elements. For example, an image can be represented by a\ndescription of the objects in it, and while there are many possible objects that can appear, only a few\nare typically present at a single time. This notion is known as population sparsity [13, 14] and is\nconsidered a principle adopted by the early visual cortex as an ef\ufb01cient means of coding.\n\nj\n\n1We have included a complete MATLAB implementation of sparse \ufb01ltering in the supplementary material.\n2ICA is unable to learn overcomplete feature representations unless one resorts to extremely expensive\napproximate orthogonalization algorithms [12]. Even when learning complete feature representations, it still\nrequires an expensive orthogonalization step at every iteration.\n\n2\n\n\fSparse features across examples (Lifetime Sparsity). Features should be discriminative and allow\nus to distinguish examples; thus, each feature should only be active for a few examples. This means\nthat each row in the feature matrix should have few non-zero elements. This property is known as\nlifetime sparsity [13, 14].\nUniform activity distribution (High Dispersal). For each row, the distribution should have simi-\nlar statistics to every other row; no one row should have signi\ufb01cantly more \u201cactivity\u201d than the other\nrows. Concretely, we consider the mean squared activations of each feature obtained by averaging\nthe squared values in the feature matrix across the columns (examples). This value should be roughly\nthe same for all features, implying that all features have similar contributions. While high dispersal\nis not strictly necessary for good feature representations, we found that enforcing high dispersal\nprevents degenerate situations in which the same features are always active [14]. For overcomplete\nrepresentations, high dispersal translates to having fewer \u201cinactive\u201d features. As an example, prin-\nciple component analysis (PCA) codes do not generally satisfy high dispersal since the codes that\ncorrespond to the largest eigenvalues are almost always active.\nThese properties of feature distributions have been explored in the neuroscience literature [9, 13,\n14, 15]. For instance, [14] showed that population sparsity and lifetime sparsity are not necessarily\ncorrelated. We note that the characterization of neural codes have conventionally been expressed as\nproperties of the feature distribution, rather than as a way of modeling the data distribution.\nMany feature learning algorithms include these objectives. For example, the sparse RBM [6] works\nby constraining the expected activation of a feature (over its lifetime) to be close to a target value.\nICA [9, 10] has constraints (e.g., each basis has unit norm) that normalize each feature, and further\noptimizes for the lifetime sparsity of the features it learns. Sparse autoencoders [16] also explicitly\noptimize for lifetime sparsity.\nOn the other hand, clustering-based methods such as k-means [17] can be seen as enforcing an\nextreme form of population sparsity where each cluster centroid corresponds to a feature and only\none feature is allowed to be active per example. \u201cTriangle\u201d activation functions, which essentially\nserve to ensure population sparsity, have also been shown to obtain good classi\ufb01cation results [17].\nSparse coding [11] is also typically seen as enforcing population sparsity.\nIn this work, we use the feature distribution view to derive a simple feature learning algorithm that\nsolely optimizes for population sparsity while enforcing high dispersal. In our experiments, we\nfound that realizing these two properties was suf\ufb01cient to allow us to learn overcomplete represen-\ntations; we also argue later that these two properties are jointly suf\ufb01cient to ensure lifetime sparsity.\n\n4 Sparse \ufb01ltering\n\nrepresent the jth feature value (rows) for the ith example (columns), where f (i)\n\nIn this section, we will show how the sparse \ufb01ltering objective captures the aforementioned princi-\nples. Consider learning a function that computes linear features for every example. Concretely, let\nf (i)\nT x(i). Our\nj\nmethod simply involves \ufb01rst normalizing the feature distribution matrix by rows, then by columns\nand \ufb01nally summing up the absolute value of all entries.\nSpeci\ufb01cally, we \ufb01rst normalize each feature to be equally active by dividing each feature by its (cid:96)2-\nnorm across all examples: \u02dcfj = fj/(cid:107)fj(cid:107)2. We then normalize these features per example, so that they\nlie on the unit (cid:96)2-ball, by computing \u02c6f (i) = \u02dcf (i)/(cid:107)\u02dcf (i)(cid:107)2. The normalized features are optimized\nfor sparseness using the (cid:96)1 penalty. For a dataset of M examples, this gives us the sparse \ufb01ltering\nobjective (Eqn. 1):\n\nj = wj\n\n(cid:13)(cid:13)(cid:13)\u02c6f (i)(cid:13)(cid:13)(cid:13)1\n\nM(cid:88)\n\ni=1\n\n=\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u02dcf (i)\n\n(cid:107)\u02dcf (i)(cid:107)2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)1\n\nM(cid:88)\n\ni=1\n\nminimize\n\n.\n\n(1)\n\n4.1 Optimizing for population sparsity\nThe term (cid:107)\u02c6f (i)(cid:107)1 =\nmeasures the population sparsity of the features on the ith exam-\nple. Since the normalized features \u02c6f (i) are constrained to lie on the unit (cid:96)2-ball, this objective is\n\n(cid:107)\u02dcf (i)(cid:107)2\n\n(cid:13)(cid:13)(cid:13) \u02dcf (i)\n\n(cid:13)(cid:13)(cid:13)1\n\n3\n\n\fminimized when the features are sparse (Fig. 1-Left), which corresponds to being close to the axes.\nConversely, an example which has similar values for every feature would incur a high penalty.\n\nFigure 1: Left: Sparse \ufb01ltering showing two features (f1, f2) and two examples (red and green).\nEach example is \ufb01rst projected onto the (cid:96)2-ball and then optimized for sparseness. The (cid:96)2-ball is\nshown together with level sets of the (cid:96)1-norm. Notice that the sparseness of the features (in the (cid:96)1\nsense) is maximized when the examples are on the axes. Right: Competition between features due\nto normalization. We show one example where only f1 is increased. Notice that even though only\nf1 is increased, the normalized value of the second feature, \u02c6f2 decreases.\n\nOne property of normalizing features is that it implicitly introduces competition between features.\nNotice that if only one component of f (i) is increased, all the other components \u02c6f (i)\nj will decrease\nbecause of the normalization (Fig. 1-Right). Similarly, if only one component of f (i) is decreased,\nall other components will increase. Since we are minimizing (cid:107)\u02c6f (i)(cid:107)1, the objective encourages\nthe normalized features, \u02c6f (i), to be sparse and mostly close to zero. Putting this together with the\nnormalization, this means that some features in f (i) have to be large while most of them are small\n(close to zero). Therefore, the objective optimizes for population sparsity.\nThe formulation above is closely related to the Treves-Rolls [14, 18] measure of population/life-\n, where F is the total number of features.\ntime sparsity: s(i) =\nThis measure is commonly used to characterize the sparsity of neuron activations in the brain. In\nparticular, our proposed formulation can be viewed as a re-scaling of the square-root of this measure.\n\n(cid:104)(cid:80)\n\n(cid:104)(cid:80)\n\nj( \u02dcf (i)\n\nj )2/F\n\n(cid:105)2\n\n(cid:105)\n\n\u02dcf (i)\nj /F\n\nj\n\n/\n\n4.2 Optimizing for high dispersal\n\nRecall that for high dispersal we want every feature to be equally active. Speci\ufb01cally, we want the\nmean squared activation of each feature to be roughly equal. In our formulation of sparse \ufb01ltering,\nwe \ufb01rst normalize each feature so that they are equally active by dividing each feature by its norm\nacross the examples: \u02dcfj = fj/(cid:107)fj(cid:107)2. This has the same effect as constraining each feature to have\nthe same expected squared value, Ex(i)\u223cD[(f (i)\n\nj )2] = 1, thus enforcing high dispersal.\n\n4.3 Optimizing for lifetime sparsity\n\nWe found that optimizing for population sparsity and enforcing high dispersal led to lifetime sparsity\nin our features. To understand how lifetime sparsity is achieved, \ufb01rst notice that a feature distribu-\ntion which is population sparse must have many non-active (zero) entries in the feature distribution\nmatrix. Since these features are highly dispersed, these zero entries (and also the non-zero entries)\nare approximately evenly distributed among all the features. Therefore, every feature must have a\nsigni\ufb01cant number of zero entries and be lifetime sparse. This implies that optimizing for population\nsparsity and high dispersal is suf\ufb01cient to de\ufb01ne a good feature distribution.\n\n4\n\nf2f1Column-NormalizedFeatures, fRow-NormalizedFeatures, f~^f2f1Increaseonly in f1~Decreasein f2^\f4.4 Deep sparse \ufb01ltering\n\nSince the sparse \ufb01ltering objective is agnostic about the method which generates the feature matrix,\none is relatively free to choose the feedforward network that computes the features. It is thus possible\nto use more complex non-linear functions (e.g., f (i)\nj x(i))2)), or even multi-layered\nnetworks, when computing the features.\nIn this way, sparse \ufb01ltering presents itself as a natural\nframework for training deep networks.\nTraining a deep network with sparse \ufb01ltering can be achieved using the canonical greedy layerwise\napproach [7, 19]. In particular, after training a single layer of features with sparse \ufb01ltering, one\ncan compute the normalized features \u02c6f (i) and then use these as input to sparse \ufb01ltering for learning\nanother layer of features. In practice, we \ufb01nd that greedy layer-wise training with sparse \ufb01ltering\nlearns meaningful representations on the next layer (Sec. 5.2).\n\nj = log(1 + (wT\n\n(cid:113)\n\nj =\n\n\u0001 + (wT\n\n5 Experiments\n\nj x(i)|\nIn our experiments, we adopted the soft-absolute function f (i)\nas our activation function, setting \u0001 = 10\u22128, and used an off-the-shelf L-BFGS [20] package to\noptimize the sparse \ufb01ltering objective until convergence.\n\nj x(i))2 \u2248 |wT\n\n5.1 Timing and scaling up\n\nFigure 2: Timing comparisons between sparse coding, ICA, sparse autoencoders and sparse \ufb01ltering\nover different input sizes.\n\nIn this section, we examine the ef\ufb01ciency of the sparse \ufb01ltering algorithm by comparing it against\nICA, sparse coding, and sparse autoencoders. We compared the convergence of each algorithm by\nmeasuring the relative change in function value over each iteration of the algorithm, stopping when\nthis change dropped below a preset threshold. We performed experiments using 10,000 color image\npatches with varying image sizes to evaluate the ef\ufb01ciency and scalability of the methods. For each\nimage size, we learned a complete set of features (i.e., equal to the number of input dimensions).\nWe implemented sparse autoencoders as described in Coates et al. [17]. For sparse coding, we used\ncode from [2], as it is fairly optimized and easy to modify.\nFor smaller image dimensions of sizes 8 \u00d7 8 (192-dimensional inputs since our images have 3 color\nchannels) and 16 \u00d7 16 (768-dimensional inputs), we found that the algorithms generally performed\nsimilarly in terms of ef\ufb01ciency. However, with 32 \u00d7 32 image patches (3072-dimensional inputs),\nsparse coding, sparse autoencoders and ICA were signi\ufb01cantly slower to converge than sparse \ufb01lter-\ning (Fig. 2). For ICA, each iteration of the algorithm (FastICA [12]) requires orthogonalizing the\nbases learned; since the cost of orthogonalization is cubic in the number of features, the algorithm\ncan be very slow when the number of features is large. For sparse coding, as the number of features\nincreased, it took signi\ufb01cantly longer to solve the (cid:96)1-regularized least squares problem for \ufb01nding\nthe coef\ufb01cients.\n\n5\n\n\fWe obtained an overall speedup of at least 4x over sparse coding and ICA when learning features\nfrom 32 \u00d7 32 image patches. In contrast to ICA, optimizing the sparse \ufb01ltering objective does not\nrequire the expensive cubic-time whitening step. For the larger input dimensions, sparse coding and\nsparse autoencoders did not converge in a reasonable time (<3 hours).\n\n5.2 Natural images\n\nIn this section, we applied sparse \ufb01ltering to learn features off 200,000 randomly sampled patches\n(16x16) from natural images [9]. The only preprocessing done before feature learning was to sub-\ntract the mean of each image patch from itself (i.e., removing the DC component).\nThe \ufb01rst layer of features learned by sparse \ufb01l-\ntering corresponded to Gabor-like edge detec-\ntors, similar to those learned by standard sparse\nfeature learning methods [6, 9, 10, 11, 16].\nMore interestingly, when we learned a second\nlayer of features using greedy layer-wise stack-\ning on the features produced by the \ufb01rst layer,\nit discovers meaningful features that pool the\n\ufb01rst layer features (Fig. 3). We highlight that\nthe second layer of features were learned using\nthe same algorithm without any tuning or pre-\nprocessing of the data. While recent work by\n[21, 22] has also been able to learn meaning-\nful second layer features, our method is simpler\nto implement, fast to run and does not require\ntime-consuming tuning of hyperparameters.\n\nFigure 3: Learned pooling units in a second\nlayer using sparse \ufb01ltering. We show the most\nstrongly connected \ufb01rst layer units for each sec-\nond layer unit; each column corresponds to a\nsecond layer unit.\n\n5.3 STL-10 object classi\ufb01cation\n\nTable 2. Classi\ufb01cation accuracy on STL-10.\n\nMethod\nRaw Pixels [17]\nICA (Complete)\nK-means (Triangle) [17]\nRandom Weight Baseline\nOur Method\n\nAccuracy\n\n31.8% \u00b1 0.63%\n48.0% \u00b1 1.47%\n51.5% \u00b1 1.73%\n50.2% \u00b1 1.08%\n53.5% \u00b1 0.53%\n\nFigure 4: A subset of the learned \ufb01lters from\n10\u00d710 patches extracted from the STL dataset.\n\nWe also evaluated the performance of our model on an object classi\ufb01cation task. We used the STL-10\ndataset [17] which consists of a unsupervised training set of 100,000 images, a supervised training\nset of 10 training folds, each with 500 training instances, and a test set of 8,000 test instances. Each\ninstance is a 96 \u00d7 96 RGB image from 1 of 10 object categories.\nTo obtain features from the large image, we followed the protocol of [17]: features were extracted\ndensely from all locations in each image and later pooled into quadrants. Supervised training was\ncarried out by training a linear SVM on this representation of the training set where C and the\nreceptive \ufb01eld size were chosen by hold-out cross-validation. We obtain a test set accuracy of 53.5%\n\u00b1 0.53% with features learned from 10 \u00d7 10 patches. For a fair comparison, the number of features\nlearnt was also set to be consistent with the number of features used by [17].\nIn accordance with the recommended STL-10 testing protocol [17], we performed supervised train-\ning on each of the 10 supervised training folds and reported the mean accuracy on the full test set\nalong with the standard deviation across the 10 training folds (Table 2).\n\n6\n\n\fIn order to show the effects of feature learning, we include a comparison to a random weight baseline\nof our method. For the baseline, we keep the basic architecture (e.g., the divisive normalization),\nbut \ufb01ll the entries of the weight matrix W by sampling random values. Random weight baselines\nhave been shown to perform remarkably well on a variety of tasks [23], and provide a means of\ndistinguishing the effect of our divisive normalization scheme versus the effect of feature learning.\n\n5.4 Phone classi\ufb01cation (TIMIT)\n\nTable 3. Test accuracy for phone classi\ufb01cation using features learned from MFCCs.\n\nSVM (Linear)\nSVM (Linear)\nSVM (Linear)\nSVM (Linear)\nSVM (RBF)\nSVM (RBF)\nSVM (RBF)\nSVM (RBF)\n\nMethod\nICA\nMFCC\nSparse Coding\nOur Method\nMFCC\nMFCC+ICA\nMFCC+Sparse Coding\nMFCC+Our Method\nHMM [24]\nLarge Margin GMM (LMGMM) [25]\nCRF [26]\nMFCC+CDBN [2]\nHierarchical LMGMM [27]\n\nAccuracy\n\n57.3%\n67.2%\n76.8%\n75.7%\n80.4%\n78.3%\n80.1%\n80.5%\n78.6%\n78.9%\n79.2%\n80.3%\n81.3%\n\nTo evaluate the model\u2019s ability to work with a range of data modalities, we further evaluated the\nability of our models to do 39-way phone classi\ufb01cation on the TIMIT dataset [28]. As with [2,\n29], our dataset comprised 132,833 training phones and 6,831 testing phones. Following standard\napproaches [2, 24, 25, 26, 27, 29], we \ufb01rst extracted 13 mel-frequency cepstral coef\ufb01cients (MFCCs)\nand augmented them with the \ufb01rst and second order derivatives. Using sparse \ufb01ltering, we learned\n256 features from contiguous groups of 11 MFCC frames. For comparison, we also learned sets\nof 256 features in a similar way using sparse coding [11, 30] and ICA [12]. A \ufb01xed-length feature\nvector was formed from each example using the protocol described in [29].3\nTo evaluate the relative performances of the different feature sets (MFCC, ICA, sparse coding and\nsparse \ufb01ltering), we used a linear SVM, choosing the regularization coef\ufb01cient C by cross-validation\non the development set. We found that the features learned using sparse \ufb01ltering outperformed\nMFCC features alone and ICA features; they were also competitive with sparse coding and faster to\ncompute. Using an RBF kernel [31] gave performances competitive with state-of-the-art methods\nwhen MFCCs were combined with learned sparse \ufb01ltering features (Table 3).\nIn contrast, con-\ncatenating ICA and sparse coding features with MFCCs resulted in decreased performance when\ncompared to MFCCs alone.\nWhile methods such as the HMM and LMGMM were included in Table 3 to provide context, we\nnote that these methods use pipelines that are more complex than straightforwardly applying a SVM.\nIndeed, these pipelines are built on top of feature representations that can be derived from a variety\nof sources, including sparse \ufb01ltering. We thus see these methods as being complementary to sparse\n\ufb01ltering.\n\n6 Discussion\n\n6.1 Connections to divisive normalization\n\nOur formulation of population sparsity in sparse \ufb01ltering is closely related to divisive normalization\n[32] \u2013 a process in early visual processing in which a neuron\u2019s response is divided by a (weighted)\n\n3We found that spliting the features into positive and negative components improved performance slightly.\n\n7\n\n\fsum of the responses of neighboring neurons. Divisive normalization has previously been found [33,\n34] to be useful as part of a multi-stage object classi\ufb01cation pipeline. However, it was introduced as\na processing stage [33], rather than a part of unsupervised feature learning (pretraining). Conversely,\nsparse \ufb01ltering uses divisive normalization as an integral component of the feature learning process\nto introduce competition between features, resulting in population sparse representations.\n\n6.2 Connections to ICA and sparse coding\n\nThe sparse \ufb01ltering objective can be viewed as a normalized version of the ICA objective. In ICA\n[12], the objective is to minimize the response of linear \ufb01lters (e.g., (cid:107)W x(cid:107)1), subject to the constraint\nthat the \ufb01lters are orthogonal to each other. The orthogonality constraint results in set of diverse\n\ufb01lters. In sparse \ufb01ltering, we replace the objective with a normalized sparsity penalty, where the\nresponse of \ufb01lters are divided by the norm of the all the \ufb01lters ((cid:107)W x(cid:107)1/(cid:107)W x(cid:107)2). This introduces\ncompetition between the \ufb01lters and thus removes the need for orthogonalization.\nIn particular,\nSimilarly, one can apply the normalization idea to the sparse coding framework.\nsparse \ufb01ltering resembles the (cid:96)1\nsparsity penalty that has been used in non-negative matrix fac-\n(cid:96)2\ntorization [35]. Thus, instead of the usual (cid:96)1 penalty that is used in conjunction with sparse coding\n(i.e., (cid:107)s(cid:107)1), one can instead use a normalized penalty (i.e., (cid:107)s(cid:107)1/(cid:107)s(cid:107)2). This normalized penalty is\nscale invariant and can be more robust to variations in the data.\n\nReferences\n[1] G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton. Phone recognition with the mean-covariance\n\nrestricted Boltzmann machine. In NIPS. 2010.\n\n[2] H. Lee, Y. Largman, P. Pham, and A. Y. Ng. Unsupervised feature learning for audio classi\ufb01cation using\n\nconvolutional deep belief networks. In NIPS. 2009.\n\n[3] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image\n\nclassi\ufb01cation. In CVPR, 2009.\n\n[4] M.A. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun. Unsupervised learning of invariant feature\n\nhierarchies with applications to object recognition. In CVPR, 2007.\n\n[5] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical spatio-temporal features for action\n\nrecognition with independent subspace analysis. In CVPR, 2011.\n\n[6] H. Lee, C. Ekanadham, and A.Y. Ng. Sparse deep belief net model for visual area v2. In NIPS, 2008.\n[7] G. E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural Compu-\n\ntation, 18(7):1527\u20131554, 2006.\n\n[8] P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol. Extracting and composing robust features with\n\ndenoising autoencoders. In ICML, 2008.\n\n[9] J. H. van Hateren and A. van der Schaaf. Independent component \ufb01lters of natural images compared with\n\nsimple cells in primary visual cortex. Proceedings: Biological Sciences, 265(1394):359\u2013366, 1998.\n\n[10] A. J. Bell and T. J. Sejnowski. The \u201dindependent components\u201d of natural scenes are edge \ufb01lters. Vision\n\nRes., 37(23):3327\u20133338, December 1997.\n\n[11] B. Olshausen and D. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1?\n\nNature, 1997.\n\n[12] A. Hyv\u00a8arinen, J. Hurri, and Patrick O. Hoyer. Natural Image Statistics: A Probabilistic Approach to Early\n\nComputational Vision. (Computational Imaging and Vision). Springer, 2nd printing. edition, 2009.\n\n[13] D. J. Field. What is the goal of sensory coding? Neural Computation, 6(4):559\u2013601, July 1994.\n[14] B. Willmore and D. J. Tolhurst. Characterizing the sparseness of neural codes. Network, 12(3):255\u2013270,\n\nJanuary 2001.\n\n[15] O. Schwartz and E. P. Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience,\n\n4:819\u2013825, 2001.\n\n[16] M.A. Ranzato, C. Poultney, S. Chopra, and Y. Lecun. Ef\ufb01cient learning of sparse representations with an\n\nenergy-based model. In NIPS, 2006.\n\n[17] A. Coates, H. Lee, and A. Y. Ng. An analysis of single-layer networks in unsupervised feature learning.\n\nIn AISTATS, 2011.\n\n8\n\n\f[18] A. Treves and E. Rolls. What determines the capacity of autoassociative memories in the brain? Network:\n\nComputation in Neural Systems, 2:371\u2013397(27), 1991.\n\n[19] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy Layer-Wise training of deep networks. In\n\nNIPS, 2006.\n\n[20] M. Schmidt. minFunc. http://www.cs.ubc.ca/\u02dcschmidtm/Software/minFunc.html,\n\n2005.\n\n[21] M. Ranzato and G. E. Hinton. Modeling Pixel Means and Covariances Using Factorized Third-Order\n\nBoltzmann Machines. In CVPR, 2010.\n\n[22] U. K\u00a8oster and A. Hyv\u00a8arinen. A two-layer model of natural stimuli estimated with score matching. Neural\n\nComputation, 22(9):2308\u20132333, 2010.\n\n[23] A. Saxe, M. Bhand, Z. Chen, P.W. Koh, B. Suresh, and A.Y. Ng. On random weights and unsupervised\n\nfeature learning. In ICML, 2011.\n\n[24] S. Petrov, A. Pauls, and D. Klein. Learning structured models for phone recognition. In Proc. of EMNLP-\n\nCoNLL, 2007.\n\n[25] F. Sha and L.K. Saul. Large margin gaussian mixture modeling for phonetic classi\ufb01cation and recognition.\n\nIn ICASSP. IEEE, 2006.\n\n[26] D. Yu, L. Deng, and A. Acero. Hidden conditional random \ufb01eld with distribution constraints for phone\n\nclassi\ufb01cation. In Interspeech, 2009.\n\n[27] H.A. Chang and J.R. Glass. Hierarchical large-margin gaussian mixture models for phonetic classi\ufb01cation.\nIn Automatic Speech Recognition & Understanding, 2007. ASRU. IEEE Workshop on, pages 272\u2013277.\nIEEE, 2007.\n\n[28] W. E. Fisher, G. R. Doddington, and K. M. Goudle-marshall. The DARPA speech recognition research\n\ndatabase: speci\ufb01cations and status. 1986.\n\n[29] P. Clarkson and P. J. Moreno. On the use of support vector machines for phonetic classi\ufb01cation. Acoustics,\n\nSpeech, and Signal Processing, IEEE International Conference on, 2:585\u2013588, 1999.\n\n[30] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In ICML, 2009.\n[31] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on\nIntelligent Systems and Technology, 2:27:1\u201327:27, 2011. Software available at http://www.csie.\nntu.edu.tw/\u02dccjlin/libsvm.\n\n[32] M. Wainwright, O. Schwartz, and E. Simoncelli. Natural image statistics and divisive normalization:\n\nModeling nonlinearity and adaptation in cortical neurons, 2001.\n\n[33] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for\n\nobject recognition? In ICCV, 2009.\n\n[34] N. Pinto, D. D. Cox, and J. J. DiCarlo. Why is Real-World visual object recognition hard? PLoS Comput\n\nBiol, 4(1):e27+, January 2008.\n\n[35] Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints. JMLR, 5:1457\u20131469,\n\n2004.\n\n9\n\n\f", "award": [], "sourceid": 668, "authors": [{"given_name": "Jiquan", "family_name": "Ngiam", "institution": null}, {"given_name": "Zhenghao", "family_name": "Chen", "institution": null}, {"given_name": "Sonia", "family_name": "Bhaskar", "institution": null}, {"given_name": "Pang", "family_name": "Koh", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}