{"title": "ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1017, "page_last": 1025, "abstract": "Independent Components Analysis (ICA) and its variants have been successfully used for unsupervised feature learning. However, standard ICA requires an orthonoramlity constraint to be enforced, which makes it dif\ufb01cult to learn overcomplete features. In addition, ICA is sensitive to whitening. These properties make it challenging to scale ICA to high dimensional data. In this paper, we propose a robust soft reconstruction cost for ICA that allows us to learn highly overcomplete sparse features even on unwhitened data. Our formulation reveals formal connections between ICA and sparse autoencoders, which have previously been observed only empirically. Our algorithm can be used in conjunction with off-the-shelf fast unconstrained optimizers. We show that the soft reconstruction cost can also be used to prevent replicated features in tiled convolutional neural networks. Using our method to learn highly overcomplete sparse features and tiled convolutional neural networks, we obtain competitive performances on a wide variety of object recognition tasks. We achieve state-of-the-art test accuracies on the STL-10 and Hollywood2 datasets.", "full_text": "ICA with Reconstruction Cost for Ef\ufb01cient\n\nOvercomplete Feature Learning\n\nQuoc V. Le, Alexandre Karpenko, Jiquan Ngiam and Andrew Y. Ng\n\n{quocle,akarpenko,jngiam,ang}@cs.stanford.edu\nComputer Science Department, Stanford University\n\nAbstract\n\nIndependent Components Analysis (ICA) and its variants have been successfully\nused for unsupervised feature learning. However, standard ICA requires an or-\nthonoramlity constraint to be enforced, which makes it dif\ufb01cult to learn overcom-\nplete features. In addition, ICA is sensitive to whitening. These properties make\nit challenging to scale ICA to high dimensional data. In this paper, we propose a\nrobust soft reconstruction cost for ICA that allows us to learn highly overcomplete\nsparse features even on unwhitened data. Our formulation reveals formal connec-\ntions between ICA and sparse autoencoders, which have previously been observed\nonly empirically. Our algorithm can be used in conjunction with off-the-shelf fast\nunconstrained optimizers. We show that the soft reconstruction cost can also be\nused to prevent replicated features in tiled convolutional neural networks. Using\nour method to learn highly overcomplete sparse features and tiled convolutional\nneural networks, we obtain competitive performances on a wide variety of object\nrecognition tasks. We achieve state-of-the-art test accuracies on the STL-10 and\nHollywood2 datasets.\n\n1\n\nIntroduction\n\nSparsity has been shown to work well for learning feature representations that are robust for object\nrecognition [1, 2, 3, 4, 5, 6, 7]. A number of algorithms have been proposed to learn sparse fea-\ntures. These include: sparse auto-encoders [8], Restricted Boltzmann Machines (RBMs) [9], sparse\ncoding [10] and Independent Component Analysis (ICA) [11]. ICA, in particular, has been shown\nto perform well in a wide range of object recognition tasks [12]. In addition, ISA (Independent\nSubspace Analysis, a variant of ICA) has been used to learn features that achieved state-of-the-art\nperformance on action recognition tasks [13].\n\nHowever, standard ICA has two major drawbacks. First, it is dif\ufb01cult to learn overcomplete feature\nrepresentations (i.e., the number of features cannot exceed the dimensionality of the input data). This\nputs ICA at a disadvantage compared to other methods, because Coates et al. [6] have shown that\nclassi\ufb01cation performance improves for algorithms such as sparse autoencoders [8], K-means [6]\nand RBMs [9], when the learned features are overcomplete. Second, ICA is sensitive to whitening\n(a preprocessing step that decorrelates the input data, and cannot always be computed exactly for\nhigh dimensional data). As a result, it is dif\ufb01cult to scale ICA to high dimensional data. In this paper\nwe propose a modi\ufb01cation to ICA that not only addresses these shortcomings but also reveals strong\nconnections between ICA, sparse autoencoders and sparse coding.\n\nBoth drawbacks arise from a constraint in the standard ICA formulation that requires features to be\northogonal. This hard orthonormality constraint, W W T = I, is used to prevent degenerate solutions\nin the feature matrix W (where each feature is a row of W ). However, if W is overcomplete (i.e., a\n\u201ctall\u201d matrix) then this constraint can no longer be satis\ufb01ed. In particular, the standard optimization\nprocedure for ICA, ISA and TICA (Topographic ICA) uses projected gradient descent, where W is\n\n1\n\n\forthonormalized at each iteration by solving W := (W W T )\u2212 1\n2 W . This symmetric orthonormal-\nization procedure does not work when W is overcomplete. As a result, this standard ICA method\ncan not learn more features than the number of dimensions in the data. Furthermore, while alterna-\ntive orthonormalization procedures or score matching can learn overcomplete representations, they\nare expensive to compute. Constrained optimizers also tend to be much slower than unconstrained\nones.1\nOur algorithm enables ICA to scale to overcomplete representations by replacing the orthonormal-\nization constraint with a linear reconstruction penalty (akin to the one used in sparse auto-encoders).\nThis reconstruction penalty removes the need for a constrained optimizer. As a result, we can im-\nplement our algorithm with only a few lines of MATLAB, and plug it directly into unconstrained\nsolvers (e.g., L-BFGS and CG [14]). This results in very fast convergence rates for our method.\n\nIn addition, recent ICA-based algorithms, such as tiled convolutional neural networks (also known as\nlocal receptive \ufb01eld TICA) [12], also suffer from the dif\ufb01culty of enforcing the hard orthonormality\nconstraint globally. As a result, orthonormalization is typically performed locally instead, which\nresults in copied (i.e., degenerate) features. Our reconstruction penalty, on the other hand, can be\nenforced globally across all receptive \ufb01elds. As a result, our method prevents degenerate features.\n\nFurthermore, ICA\u2019s sensitivity to whitening is undesirable because exactly whitening high dimen-\nsional data is often not feasible. For example, exact whitening using principal component analysis\n(PCA) for input images of size 200x200 pixels is challenging, because it requires solving the eigen-\ndecomposition of a 40,000 x 40,000 covariance matrix. Other methods, such as sparse autoencoders\nor RBMs, work well using approximate whitening and in some cases work even without any whiten-\ning. Standard ICA, on the other hand, tends to produce noisy \ufb01lters unless the data is exactly white.\nOur soft-reconstruction penalty shares the property of auto-encoders, in that it makes our approach\nalso less sensitive to whitening. Similarities between ICA, auto-encoders and sparse coding have\nbeen observed empirically before (i.e., they all learn edge \ufb01lters). Our contribution is to show a\nformal proof and a set of conditions under which these algorithms are equivalent.\n\nFinally, we use our algorithm for classifying STL-10 images [6] and Hollywood2 [15] videos. In\nparticular, on the STL-10 dataset, we learn highly overcomplete representations and achieve 52.9%\non the test set. On Hollywood2, we achieve 54.6 Mean Average Precision, which is also the best\npublished result on this dataset.\n\n2 Standard ICA and Reconstruction ICA\n\nWe begin by introducing our proposed algorithm for overcomplete ICA. In subsequent sections\nwe will show how our method is related to ICA, sparse auto-encoders and sparse coding. Given\nunlabeled data {x(i)}m\ni=1, x(i) \u2208 Rn, regular ICA [11] is traditionally de\ufb01ned as the following\noptimization problem:\n\nminimize\n\nW\n\nm\n\nk\n\nX\n\ni=1\n\nX\n\nj=1\n\ng(Wj x(i)), subject to W W T = I\n\n(1)\n\nwhere g is a nonlinear convex function, e.g., smooth L1 penalty: g(.) := log(cosh(.)) [16], W is\nthe weight matrix W \u2208 Rk\u00d7n, k is number of components (features), and Wj is one row (feature)\nin W . The orthonormality constraint W W T = I is used to prevent the bases in W from becoming\ndegenerate. We refer to this as \u201cnon-degeneracy control\u201d in this paper.\n\nICA requires data to have zero mean, Pm\n\ni=1 x(i) = 0, and unit covariance,\nTypically,\nm Pm\n1\ni=1 x(i)(x(i))T = I. While the former can be achieved by subtracting the empirical mean,\nthe latter requires \ufb01nding a linear transformation by solving the eigendecomposition of the covari-\nance matrix [11]. This preprocessing step is also known as whitening or sphering the data.\n\nFor overcomplete representations (k > n) [17, 18], the orthonormality constraint can no longer\nhold. As a result, approximate orthonormalization (e.g., Gram-Schmidt) or \ufb01xed-point iterative\n\n1FastICA is a specialized solver that works well for complete or undercomplete ICA. Here, we focus our\nattention on ICA and its variants such as ISA and TICA in the context of overcomplete representations, where\nFastICA does not work.\n\n2\n\n\fmethods [11] have been proposed. These algorithms are often slow and require tuning. Other\napproaches, e.g., interior point methods [19] or score matching [16] exist, but they are complicated\nto implement and also slow. Score matching, for example, is dif\ufb01cult to implement and expensive\nfor multilayered algorithms like ISA or TICA, because it requires backpropagation of a Hessian\nmatrix.\nThese challenges motivate our search for a better type of non-degeneracy control for ICA. A fre-\nquently employed form of non-degeneracy control in auto-encoders and sparse coding is the use\nof reconstruction costs. As a result, we propose to replace the hard orthonormal constraint in ICA\nwith a soft reconstruction cost. Applying this change to eq. 1, produces the following unconstrained\nproblem:\n\nm\n\nm\n\nk\n\nReconstruction ICA (RICA): minimize\n\nW\n\n\u03bb\nm\n\nkW T W x(i) \u2212 x(i)k2\n\n2 +\n\nX\n\ni=1\n\nX\n\ni=1\n\nX\n\nj=1\n\ng(Wj x(i))\n\n(2)\n\nWe use the term \u201creconstruction cost\u201d for this smooth penalty because it corresponds to the recon-\nstruction cost of a linear autoencoder, where the encoding weights and decoding weights are tied\n(i.e., the encoding step is W x(i) and the decoding step is W T W x(i)).\n\nThe choice to swap the orthonormality constraint with a reconstruction penalty seems arbitrary at\n\ufb01rst. However, we will show in the following section that these two forms of degeneracy control\nare, in fact, equivalent under certain conditions. Furthermore, this change has two key bene\ufb01ts: \ufb01rst,\nit allows unconstrained optimizers (e.g., L-BFGS, CG [20] and SGDs) to be used to minimize this\ncost function instead of relying on slower constrained optimizers (e.g., projected gradient descent)\nto solve the standard ICA cost function. And second, the reconstruction penalty works even when\nW is overcomplete and the data not fully white.\n\n3 Connections between orthonormality and reconstruction\nSparse autoencoders, sparse coding and ICA have been previously suspected to be strongly con-\nnected because they learn edge \ufb01lters for natural image data.\nIn this section we present formal\nproofs that they are indeed mathematically equivalent under certain conditions (e.g., whitening and\nlinear coding). Our proofs reveal the underlying principles in unsupervised feature learning that tie\nthese algorithms together.\nWe start by reviewing the optimization problems of two common unsupervised feature learning\nalgorithms: sparse autoencoders and sparse coding. In particular, the objective function of tied-\nweight sparse autoencoders [8, 21, 22, 23] is:\n\nm\n\nminimize\n\nW,b,c\n\n\u03bb\nm\n\nk\u03c3(W T \u03c3(W x(i) + b) + c) \u2212 x(i)k2\n\n2 + S({W, b}, x(1), . . . , x(m))\n\n(3)\n\nX\n\ni=1\n\nwhere \u03c3 is the activation function (e.g., sigmoid), b, c are biases, and S is some sparse penalty\nfunction. Typically, S is chosen to be the smooth L1 penalty S({W, b}, x(i), . . . , x(m)) =\nPm\nj=1 g(Wj x(i)) or KL divergence between the average activation and target activation [24].\nSimilarly, the optimization problem of Sparse coding [10] is:\n\ni=1 Pk\n\nminimize\n\nW,z(1),...,z(m)\n\n\u03bb\nm\n\nm\n\nX\n\ni=1\n\nkW T z(i) \u2212 x(i)k2\n\n2 +\n\nm\n\nk\n\nX\n\ni=1\n\nX\n\nj=1\n\ng(z(i)\n\nj ) subject to\n\nkWjk2\n\n2 \u2264 c, \u2200j = 1, . . . , k.\n\n(4)\n\nFrom these formulations, it is clear there are links between ICA, RICA, sparse autoencoders and\nsparse coding. In particular, most methods use the L1 sparsity penalty and, except for ICA, most use\nreconstruction costs as a non-degeneracy control. These observations are summarized in Table 1.\n\nICA\u2019s main distinction compared to sparse coding and autoencoders is its use of the hard orthonor-\nmality constraint in lieu of reconstruction costs. However, we will now present a proof (consisting\nof two lemmas) that derives the relationship between ICA\u2019s orthonormality constraint and RICA\u2019s\nreconstruction cost. We subsequently present a set of conditions under which RICA is equivalent to\nsparse coding and autoencoders. The result is a novel and formal proof of the relationship between\nICA, sparse coding and autoencoders.\nWe let I denote an identity matrix, and Il an identity matrix of size l \u00d7 l. We denote the L2 norm\nby k.k2 and the matrix Frobenius norm by k.kF . We also assume that the data {x(i)}m\ni=1 has zero\nmean.\n\n3\n\n\fTable 1: A summary of different unsupervised feature learning methods. \u201cNon-degeneracy con-\ntrol\u201d refers to the mechanism that prevents all bases from learning uninteresting weights (e.g., zero\nweights or identical weights). Note that using sparsity is optional in autoencoders.\n\nAlgorithm\nSparse coding [10]\nAutoencoders and\nDenoising autoencoders [21]\nICA [16]\nRICA (this paper)\n\nNon-degeneracy control Activation function\nSparsity\nL2 reconstruction\nL1\nOptional: KL [24] L2 reconstruction\nor L1 [22]\nL1\nL1\n\n(or cross entropy [21, 8])\nOrthonormality\nL2 reconstruction\n\nImplicit\nSigmoid\n\nLinear\nLinear\n\nThe \ufb01rst lemma states that the reconstruction cost and column orthonormality cost2 are equivalent\nwhen data is whitened (see the Appendix in the supplementary material for proofs):\n\nLemma 3.1 When\nm Pm\n\u03bb\n\ni=1 kW T W x(i) \u2212 x(i)k2\n\nthe\n\ninput\n\ndata\n\n2 is equivalent to the orthonormality cost \u03bbkW T W \u2212 Ik2\nF .\n\n{x(i)}m\n\ni=1\n\nis whitened,\n\nthe\n\nreconstruction\n\ncost\n\nOur second lemma states that minimizing column orthonormality and row orthonormality costs turns\nout to be equivalent due to a property of the Frobenius norm:\n\nLemma 3.2 The column orthonormality cost \u03bbkW T W \u2212 Ink2\nmality cost \u03bbkW W T \u2212 Ikk2\n\nF up to an additive constant.\n\nF is equivalent to the row orthonor-\n\nTogether these two lemmas tell us that reconstruction cost is equivalent to both column and row or-\nthonormality cost for whitened data. Furthermore, as \u03bb approaches in\ufb01nity the orthonormality cost\nbecomes the hard orthonormality constraint of ICA (see equations 1 & 2) if W is complete or un-\ndercomplete. Thus, ICA\u2019s hard orthonormality constraint and RICA\u2019s reconstruction cost are related\nunder these conditions. More formally, the following remarks explain this conclusion, and describe\nthe set of conditions under which RICA (and by extension ICA) is equivalent to autoencoders and\nsparse coding.\n1) If the data is whitened, RICA is equivalent to ICA for undercomplete representations and \u03bb\napproaching in\ufb01nity. For whitened data our RICA formulation:\n\nRICA: minimize\n\nW\n\n\u03bb\nm\n\nm\n\nX\n\ni=1\n\nkW T W x(i) \u2212 x(i)k2\n\n2 +\n\nm\n\nk\n\nX\n\ni=1\n\nX\n\nj=1\n\ng(Wj x(i))\n\nis equivalent (from the above lemmas) to:\n\nminimize\n\nW\n\n\u03bbkW T W \u2212 Ik2\n\nF +\n\nminimize\n\nW\n\n\u03bbkW W T \u2212 Ik2\n\nF +\n\nm\n\nk\n\nX\n\ni=1\n\nX\n\nj=1\n\nm\n\nk\n\nX\n\ni=1\n\nX\n\nj=1\n\ng(Wj x(i)), and\n\ng(Wj x(i))\n\n(5)\n\n(6)\n\n(7)\n\nFurthermore, for undercomplete representations, in the limit of \u03bb approaching in\ufb01nity, the orthonor-\nmalization costs above become hard constraints. As a result, they are equivalent to:\n\nConventional ICA:\n\nm\n\nk\n\nX\n\ni=1\n\nX\n\nj=1\n\ng(Wj x(i)) subject to W W T = I\n\n(8)\n\nwhich is just plain ICA, or ISA/TICA with appropriate choices of the sparsity function g.\n\n2) Autoencoders and Sparse Coding are equivalent to RICA if\n\n\u2022 in autoencoders, we use a linear activation function \u03c3(x) = x, ignore the biases b, c, use the\nj=1 g(Wj x(i))\n\nsoft L1 sparsity for the activations: S({W, b}, x(i), . . . , x(m)) = Pm\nand\n\ni=1 Pk\n\n\u2022 in sparse coding, we use explicit encoding z(i)\n\nj = Wj x(i) and ignore the norm ball con-\n\nstraints.\n\n2The column orthonormality cost is zero only if the columns of W are orthonormal.\n\n4\n\n\fDespite their equivalence, certain formulations have certain advantages. For instance, RICA (eq. 2)\nand soft orthonormalization ICA (eq. 6 and 7) are smooth and can be optimized ef\ufb01ciently by fast\nunconstrained solvers (e.g., L-BFGS or CG) while the conventional constrained ICA optimization\nproblem cannot. Soft penalties are also preferred if we want to learn overcomplete representations\nwhere explicitly constraining W W T = I is not possible3.\nWe derive an additional relationship in the appendix (see supplementary material), which shows that\nfor whitened data denoising autoencoders are equivalent to RICA with weight decay. Another inter-\nesting connection between RBMs and denoising autoencoders is derived in [25]. The connections,\nbetween RBMs, autoencoders, denoising autoencoders and the fact that reconstruction cost captures\nwhitening (by the above lemmas), likely explains why whitening does not matter much for RBMs\nand autoencoders in [6].\n4 Effects of whitening on ICA and RICA\nIn practice, ICA tends to be much more sensitive to whitening compared to sparse autoencoders.\nRunning ICA on unwhitened data results in very noisy bases. In this section, we study empirically\nhow whitening affects ICA and our formulation, RICA.\n\nWe sampled 20000 patches of size 16x16 from a set of 11 natural images [16] and visualized the\n\ufb01lters learned using ICA and RICA with raw images, as well as approximately whitened images.\nFor approximate whitening, we use 1/f whitening with low pass \ufb01ltering. This 1/f whitening trans-\nformation uses Fourier analysis of natural image statistics and produces transformed data which has\nan approximate identity covariance matrix. This procedure does not require pretraining. As a result,\n1/f whitening runs quickly and scales well to high dimensional data. We used the 1/f whitening\nimplementation described in [16].\n\n(a) ICA on 1/f whitened images\n\n(b) ICA on raw images\n\n(c) RICA on 1/f whitened images\n\n(d) RICA on raw images\n\nFigure 1: ICA and RICA on approximately whitened and raw images. (a-b): Bases learned with\nICA. (c-d): Bases learned with RICA. RICA retains some structures of the data whereas ICA does\nnot (i.e., it learns noisy bases).\n\nFigure 1 shows the results of running ICA and RICA on raw and 1/f whitened images. As can be\nseen, ICA learns very noisy bases on raw data, as well as approximately whitened data. In contrast,\nRICA works well for 1/f whitened data and raw data. Our quantitative analysis with kurtosis (not\nshown due to space limits) agrees with visual inspection: RICA learns more kurtotic representations\nthan ICA on approximately whitened or raw data.\n\nRobustness to approximate whitening is desirable, because exactly whitening high dimensional data\nusing PCA may not be feasible. For instance, PCA on images of size 200x200 requires computing\nthe eigendecomposition of a 40,000 x 40,000 covariance matrix, which is computationally expen-\nsive. With RICA, approximate whitening or raw data can be used instead. This allows our method\nto scale to higher dimensional data than regular ICA.\n\n5 Local receptive \ufb01eld TICA\nThe \ufb01rst application of our RICA algorithm that we examine is local receptive \ufb01eld neural net-\nworks. The motivation behind local receptive \ufb01elds is computational ef\ufb01ciency. Speci\ufb01cally, rather\n\n3Note that when W is overcomplete, some rows may degenerate and become zero, because the reconstruc-\ntion constraint can be satis\ufb01ed with only a complete subset of rows. To prevent this, we employ an additional\nnorm ball constraint (see the Appendix for more details regarding L-BFGS and norm ball constraints).\n\n5\n\n\fthan having each hidden unit connect to the entire input image, each unit is instead connected to a\nsmall patch (see \ufb01gure 2a for an illustration). This reduces the number of parameters in the model.\nAs a result, local receptive \ufb01eld neural networks are faster to optimize than their fully connected\ncounterparts. A major drawback of this approach, however, is the dif\ufb01culty in enforcing orthogonal-\nity across partially overlapping patches. We show that swapping out locally enforced orthogonality\nconstraints with a global reconstruction cost solves this issue.\n\nSpeci\ufb01cally, we examine the local receptive \ufb01eld network proposed by Le et al. [12]. Their for-\nmulation constrains each feature (a row of W ) to connect to a small region of the image (i.e., all\nweights outside of the patch are set to zero). This modi\ufb01cation allows learning ICA and TICA with\nlarger images, because W is now sparse. Unlike standard convolutional networks, these networks\nmay be extended to have fully unshared weights. This permits them to learn invariances other than\ntranslational invariances, which are hardwired in convolutional networks.\nThe pre-training step for the TCNN (local receptive \ufb01eld TICA) [12] is performed by minimizing\nthe following cost function:\n\nminimize\n\nW\n\nm\n\nk\n\nX\n\ni=1\n\nX\n\nj=1\n\nq\u01eb + Hj(W x(i))2, subject to W W T = I\n\n(9)\n\nwhere H is the spatial pooling matrix and W is a learned weight matrix. The corresponding neural\nnetwork representation of this algorithm is one with two layers with weights W, H and nonlinearities\n(.)2 and p(.) respectively (see Figure 2a). In addition, W and H are set to be local. That is, each\nrow of W and H connects to a small region of the input data.\n\n(a) Local receptive \ufb01eld neural net\n\n(b) Local orthogonalization\n\n(c) RICA global reconstruc-\ntion cost\n\nFigure 2: (a) Local receptive \ufb01eld neural network with fully untied weights. A single map consists\nof local receptive \ufb01elds that do not share a location (i.e., only different colored nodes). (b & c)\nFor illustration purposes we have brightened the area of each local receptive \ufb01eld within the input\nimage. (b) Hard orthonormalization [12] is applied at each location only (i.e., nodes of the same\ncolor), which results in copied \ufb01lters (for example, see the \ufb01lters outlined in red; notice that the\nlocation of the edge stays the same within the image even though the receptive \ufb01eld areas are differ-\nent). (c) Global reconstruction (this paper) is applied both within each location and across locations\n(nodes of the same and different colors), which prevents copying of receptive \ufb01elds.\n\nEnforcing the hard orthonormality constraint on the entire sparse W matrix is challenging because it\nis typically overcomplete for TCNNs. As a result, Le et al. [12] performed local orthonormalization\ninstead. That is, only the features (rows of W ) that share a location (e.g., only the red nodes in \ufb01gure\n2) were orthonormalized using symmetric orthogonalization.\n\nHowever, visualizing the \ufb01lters learned by a TCNN with local orthonormalization, shows that many\nadjacent receptive \ufb01elds end up learning the same (copied) \ufb01lters due to the lack of an orthonormality\nconstraint between them. For instance, the green nodes in Figure 2 may end up being copies of the\nred nodes (see the copied receptive \ufb01elds in Figure 2b).\n\nIn order to prevent copied features, we replace the local orthonormalization constraint with a global\nreconstruction cost (i.e., computing the reconstruction cost kW T W x(i) \u2212 x(i)k2\n2 for the entire over-\ncomplete sparse W matrix). Figure 2c shows the resulting \ufb01lters. Figure 3 shows that the recon-\nstruction penalty produces a better distribution of edge detector locations within the image patch\n(this also holds true for frequencies and orientations).\n\n6\n\n\fn\no\n\ni\nt\n\na\nc\no\nL\n\n \nl\n\na\nc\ni\nt\nr\ne\nV\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\nHorizontal Location\n\n0.8\n\n1\n\nn\no\n\ni\nt\n\na\nc\no\nL\n\n \nl\n\na\nc\ni\nt\nr\ne\nV\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\nHorizontal Location\n\n0.8\n\n1\n\nFigure 3: Location of each edge detector within the image patch. Symbols of the same color/shape\ncorrespond to a single map. Left: local orthonormalization constraint. Right: global reconstruction\npenalty. The reconstruction penalty prevents copied \ufb01lters, producing a more uniform distribution\nof edge detectors.\n\n6 Experiments\nThe following experiments compare the speed gains of RICA over standard overcomplete ICA. We\nthen use RICA to learn a large \ufb01lter bank, and show that it works well for classi\ufb01cation on the\nSTL-10 dataset.\n6.1 Speed improvements for overcomplete ICA\nIn this experiment, we examine the speed performance of RICA and overcomplete ICA with score\nmatching [26]. We trained overcomplete ICA on 20000 gray-scale image patches, each patch of size\n16x16. We learn representations that are 2x, 4x and 6x overcomplete. We terminate both algorithms\nwhen changes in the parameter vector drop below 10\u22126. We use the score matching implementation\nprovided in [16]. We report the time required to learn these representations in Table 2. The results\nshow that our method is much faster than the competing method. In particular, learning features that\nare 6x overcomplete takes 1 hour using our method, whereas [26] requires 2 days.\n\nTable 2: Speed improvements of our method over score matching [26].\n\nScore matching ICA\nRICA\nSpeed up\n\n2x overcomplete\n33000 seconds\n1000 seconds\n\n4x overcomplete\n65000 seconds\n1600 seconds\n\n6x overcomplete\n180000 seconds\n3700 seconds\n\n33x\n\n40x\n\n48x\n\nFigure 4 shows the peak frequencies and orientations for 4x overcomplete bases learned using our\nmethod. The learned bases do not degenerate, and they cover a broad range of frequencies and\norientations (cf. Figure 3 in [27]). This ability to learn a diverse set of features allows our algorithm\nto perform well on various discriminative tasks.\n\nFigure 4: Scatter plot of peak frequencies and orientations of Gabor functions \ufb01tted to the \ufb01lters\nlearned by RICA on whitened images. Our model yields a diverse set of \ufb01lters that covers the\nspatial frequency space evenly.\n\n6.2 Overcomplete ICA on STL-10 dataset\n\nIn this section, we evaluate the overcomplete features learned by our model. The experiments are\ncarried out on the STL-10 dataset [6] where overcomplete representations have been shown to work\nwell. The STL-10 dataset contains 96x96 pixel color images taken from 10 classes. For each\n\n7\n\n\fclass 500 training images and 800 test images are provided. In addition, 100,000 unlabeled images\nare included for unsupervised learning. We use RICA to learn overcomplete features on 100,000\nrandomly sampled color patches from the unlabeled images in the STL-10 dataset. We then apply\nRICA to extract features from images in the same manner described in [6].\n\nUsing the same number of features (1600) employed by Coates et al. [6] on 96x96 images and 10x10\nreceptive \ufb01elds, our soft reconstruction ICA achieves 52.9% on the test set. This result is slightly\nbetter than (but within the error bars) of the best published result, 51.5%, obtained by K-means [6].\n\n50\n\n45\n\n40\n\n35\n\n30\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\nn\no\n\n \n\ni\nt\n\na\nd\n\ni\nl\n\na\nV\n\u2212\ns\ns\no\nr\nC\n\n25\n\n \n0\n\n200\n\n400\n\n600\nNumber of Features\n\n800\n\n1000\n\n \n\n51.5\n\nsoft ica\nica\nCoates et al.\nwhitened\nraw\n\n1200\n\n1400\n\n1600\n\nFigure 5: Classi\ufb01cation accuracy on the STL-10 dataset as a function of the number of bases learned\n(for a patch size of 8x8 pixels). The best result shown uses bases that are 8x overcomplete.\n\nFinally, we compare classi\ufb01cation accuracy as a function of the number of bases. Figure 5 shows\nthe results for ICA and RICA. Notice that the reconstruction cost in RICA allows us to learn over-\ncomplete representations that outperform the complete representation obtained by the regular ICA.\n\n6.3 Reconstruction Independent Subspace Analysis for action recognition\n\nRecently we presented a system [13] for learning features from unlabelled data that can lead to\nstate-of-the-art performance on many challenging datasets such as Hollywood2 [15], KTH [28]\nand YouTube [29]. This system makes use of a two-layered Independent Subspace Analysis (ISA)\nnetwork [16]. Like ICA, ISA also uses orthogonalization for degeneracy control.4\nIn this section we compare the effects of reconstruction versus orthogonality on classi\ufb01cation per-\nformance using ISA. In our experiments we swap out the orthonormality constraint employed by\nISA with a reconstruction penalty. Apart from this change, the entire pipeline and parameters are\nidentical to the system described in [13].\n\nWe observe that the reconstruction penalty tends to works better than orthogonality constraints. In\nparticular, on the Hollywood2 dataset ISA achieves a mean AP of 54.6% when the reconstruction\npenalty is used. The performance of ISA drops to 53.3% when orthogonality constraints are used.\nBoth results are state-of-the art resuls on this dataset [30]. We attribute the improvement in perfor-\nmance to the fact that features in invariant subspaces of ISA need not be strictly orthogonal.\n\n7 Discussion\n\nIn this paper, we presented a novel soft reconstruction approach that enables the learning of over-\ncomplete representations in ICA and TICA. We have also presented mathematical proofs that con-\nnect ICA with autoencoders and sparse coding. We showed that our algorithm works well even\nwithout whitening; and that the reconstruction cost allows us to \ufb01x replicated \ufb01lters in tiled convo-\nlutional neural networks. Our experiments show that RICA is fast and works well in practice. In\nparticular, we found our method to be 30-50x faster than overcomplete ICA with score matching.\nFurthermore, our overcomplete features achieve state-of-the-art performance on the STL-10 and\nHollywood2 datasets.\n\n4Note that in ISA the square nonlinearity is used in the \ufb01rst layer, and squareroot is used in in second\n\nlayer [13].\n\n8\n\n\fReferences\n[1] M.A. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Ef\ufb01cient learning of sparse representations with an\n\nenergy-based model. In NIPS, 2006.\n\n[2] R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng. Self-taught learning: Transfer learning from unla-\n\nbelled data. In ICML, 2007.\n\n[3] M. Ranzato, F. J. Huang, Y. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies\n\nwith applications to object recognition. In CVPR, 2007.\n\n[4] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image\n\nclassi\ufb01cation. In CVPR, 2009.\n\n[5] J. Yang, K. Yu, and T. Huang. Ef\ufb01cient highly over-complete sparse coding using a mixture model. In\n\nECCV, 2010.\n\n[6] A. Coates, H. Lee, and A. Y. Ng. An analysis of single-layer networks in unsupervised feature learning.\n\nIn AISTATS 14, 2011.\n\n[7] K. Yu, Y. Lin, and J. Lafferty. Learning image representations from pixel level via hierarchical sparse\n\ncoding. In CVPR, 2011.\n\n[8] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layerwise training of deep networks. In\n\nNIPS, 2007.\n\n[9] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Compu-\n\ntation, 2006.\n\n[10] B. Olshausen and D. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse code\n\nfor natural images. Nature, 1996.\n\n[11] A. Hyv\u00a8arinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley Interscience, 2001.\n[12] Q. V. Le, J. Ngiam, Z. Chen, D. Chia, P. W. Koh, and A. Y. Ng. Tiled convolutional neural networks. In\n\nNIPS, 2010.\n\n[13] Q. V. Le, W. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical spatio-temporal features for action\n\nrecognition with independent subspace analysis. In CVPR, 2011.\n\n[14] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. On optimization methods for deep\n\nlearning. In ICML, 2011.\n\n[15] M. Marzalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009.\n[16] A. Hyv\u00a8arinen, J. Hurri, and P. O. Hoyer. Natural Image Statistics. Springer, 2009.\n[17] B. Olshausen and D. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1.\n\nVision Research, 1997.\n\n[18] M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representations. Neural Computation, 2000.\n[19] L. Ma and L. Zhang. Overcomplete topographic independent component analysis. Elsevier, 2008.\n[20] M. Schmidt. minFunc, 2005.\n[21] P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol. Extracting and composing robust features with\n\ndenoising autoencoders. In ICML, 2008.\n\n[22] H. Lee, C. Ekanadham, and A. Y. Ng. Sparse deep belief net model for visual area V2. In NIPS, 2008.\n[23] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring strategies for training deep neural\n\nnetworks. JMLR, 2009.\n\n[24] G. Hinton. A practical guide to training restricted boltzmann machines. Technical report, U. of Toronto,\n\n2010.\n\n[25] P. Vincent. A connection between score matching and denoising autoencoders. Neural Computation,\n\n2010.\n\n[26] A. Hyv\u00a8arinen. Estimation of non-normalized statistical models using score matching. JMLR, 2005.\n[27] Y. Karklin and M.S. Lewicki. Is early vision optimized for extracting higher-order dependencies? In\n\nNIPS, 2006.\n\n[28] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local SVM approach. In ICPR,\n\n2004.\n\n[29] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos \u201cin the Wild\u201d. In CVPR, 2009.\n[30] Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, and Cordelia Schmid. Evaluation\n\nof local spatio-temporal features for action recognition. In BMVC, 2010.\n\n9\n\n\f", "award": [], "sourceid": 625, "authors": [{"given_name": "Quoc", "family_name": "Le", "institution": null}, {"given_name": "Alexandre", "family_name": "Karpenko", "institution": null}, {"given_name": "Jiquan", "family_name": "Ngiam", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}