{"title": "Optimal Sparse Linear Encoders and Sparse PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 298, "page_last": 306, "abstract": "Principal components analysis~(PCA) is the optimal linear  encoder of data. Sparse linear encoders (e.g., sparse PCA) produce more interpretable features that  can promote better generalization. (\\rn{1}) Given a level of sparsity, what is the best approximation to PCA?  (\\rn{2}) Are there efficient algorithms which can achieve this optimal  combinatorial tradeoff? We answer both questions by  providing the first polynomial-time algorithms to construct \\emph{optimal} sparse linear auto-encoders; additionally, we demonstrate the performance of our algorithms on real data.", "full_text": "Optimal Sparse Linear Encoders and Sparse PCA\n\nMalik Magdon-Ismail\n\nRensselaer Polytechnic Institute, Troy, NY 12211\n\nChristos Boutsidis\n\nNew York, NY\n\nmagdon@cs.rpi.edu\n\nchristos.boutsidis@gmail.com\n\nAbstract\n\nPrincipal components analysis (PCA) is the optimal linear encoder of data. Sparse\nlinear encoders (e.g., sparse PCA) produce more interpretable features that can\npromote better generalization. (i) Given a level of sparsity, what is the best ap-\nproximation to PCA? (ii) Are there ef\ufb01cient algorithms which can achieve this\noptimal combinatorial tradeoff? We answer both questions by providing the \ufb01rst\npolynomial-time algorithms to construct optimal sparse linear auto-encoders; ad-\nditionally, we demonstrate the performance of our algorithms on real data.\n\n1\n\nIntroduction\n\ni \u2208 R1\u00d7d is a data point in d dimensions). auto-encoders\nThe data matrix is X \u2208 Rn\u00d7d (a row xT\ntransform (encode) the data into a low dimensional feature space and then lift (decode) it back to\nthe original space, reconstructing the data through a bottleneck. If the reconstruction is close to the\noriginal, then the encoder preserved most of the information using just a small number of features.\nAuto-encoders are important in machine learning because they perform information preserving di-\nmension reduction. Our focus is the linear auto-encoder, which, for k < d, is a pair of linear\nmappings h : Rd 7\u2192 Rk and g : Rk 7\u2192 Rd, speci\ufb01ed by an encoder matrix H \u2208 Rd\u00d7k and a decoder\nmatrix G \u2208 Rk\u00d7d. For data point x \u2208 Rd, the encoded feature is z = h(x) = HTx \u2208 Rk and the\nreconstructed datum is \u02c6x = g(z) = GTz \u2208 Rd. The reconstructed data matrix is \u02c6X = XHG. The\npair (H, G) is a good auto-encoder if \u02c6X \u2248 X under some loss metric (we use squared loss):\nDe\ufb01nition 1 (Loss \u2113(H, X)). The loss of encoder H is the minimum possible Frobenius reconstruc-\ntion error (over all linear decoders G) when using H as encoder for X:\n\n\u2113(H, X) = minG\u2208Rk\u00d7d kX \u2212 XHGk2\n\nF = kX \u2212 XH(XH)\u2020Xk\n\n2\n\nF.\n\nThe loss is de\ufb01ned for an encoder H alone, by choosing the decoder optimally. The literature\nconsiders primarily the symmetric auto-encoder which places the additional restriction that G = H\u2020\n[18]. To get the most useful features, one should not place unnecessary constraints on the decoder.\nPrincipal Component Analysis (PCA) is the most famous linear auto-encoder, because it is optimal\n(and symmetric). Since rank(XHG) \u2264 k, the loss is bounded by \u2113(H, X) \u2265 kX \u2212 Xkk2\nF (Xk is\nits best rank-k approximation to X). By the Eckart-Young theorem Xk = XVk VT\nk, where Vk \u2208\nRd\u00d7k is the matrix whose columns are the top-k right singular vectors of X (see, for example, e-\nChapter 9 of [14]). Thus, the optimal linear encoder is Hopt = Vk, and the top-k PCA-features are\nZpca = XVk. Since its early beginings, [19], PCA has evolved into a classic tool for data analysis.\nWhile PCA simpli\ufb01es the data by concentrating the maximum information into a few components,\nthose components may be hard to interpret. In many applications, it is desirable to \u201cexplain\u201d the\nfeatures using a few original variables (for example, genes in a biological application or assets in\na \ufb01nancial application). One trades off the \ufb01delity of the features (their ability to reconstruct the\ndata) with the interpretability of the features using a few original features (sparsity of the encoder\nH). We introduce a sparsity parameter r and require that every column of H have at most r non-zero\nelements. Every feature in an r-sparse encoding can be \u201cexplained\u201d using at most r original features.\nWe now formally state the sparse linear encoder problem:\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fProblem 1: Optimal r-sparse encoder (Sparse PCA)\n\nGiven X \u2208 Rn\u00d7d, \u03b5 > 0 and k < rank(A), \ufb01nd an r-sparse encoder H with minimum r,\nfor which the loss is a (1 + \u01eb) relative approximation to the optimal loss,\nF \u2264 (1 + \u03b5)kX \u2212 Xkk2\nF.\n\n\u2113(H, X) = kX \u2212 XH(XH)\u2020Xk\n\n2\n\nOur Contributions. First, we are proposing the \u201csparse-PCA\u201d problem de\ufb01ned above in lieu of\ntraditional sparse-PCA which is based on maximizing variance, not minimizing loss. With no spar-\nsity constraint, variance maximization and loss minimization are equivalent, both being solved by\nPCA. Historically, variance maximization became the norm for sparse-PCA. However, minimizing\nloss better achieves the machine learning goal of preserving information. The table below compares\nthe 10-fold cross-validation error Eout for an SVM classi\ufb01er using features from popular variance\nmaximizing sparse-PCA encoders and our loss minimizing sparse-encoder (k = 6 and r = 7),\n\nd\n\nSVM T-Power [23]\nOur Algorithm\nEout Eout Loss Var Eout Loss Var Eout Loss Var Eout Loss Var\n\nG-Power-\u21130 [10]\n\nG-Power-\u21131 [10]\n\nARCENE\n\n104\n5000\n\n500\n\nGisette\nMadelon\n\u201c1\u201d vs \u201c5\u201d\n256\nSECOM 691\n57\n\nSpam\n\n0.44\n0.44\n0.51\n0.35\n0.07\n0.17\n\nmatrix to large\n0.1\n0.09\n0.21\n0.96\n1.0\n\n0.45 1.17\n1.3\n0.46\n0.30\n2.4\n0.34\n1.3\n0.20 1.00\n\n0.325\n0.49\n0.33\n0.34\n0.35\n0.22\n\n2.5\n1.2\n\n0.01 0.35\n0.02 0.50\n\n0.01 0.29\n0.02 0.31\n\n1.4\n2.5\n1.1\n1.2\n1.08 0.08 0.46 1.33 0.05 0.24 1.07\n1.2\n2.28 0.27 0.33\n1.0\n2.9\n0.79 0.33\n0.21 1.02\n1.0\n1.03\n\n2.3\n2.9\n0.20 1.02\n\n0.19 0.01\n0.79 0.31\n1.0\n\n0.005\n0.02\n0.03\n0.03\n0.46\n1.0\n\nLower loss gives lower error. Our experiments are not exhaustive, but their role is modest: to moti-\nvate minimizing loss as the right machine learning objective for sparse encoders (Problem 1).\n\nOur main contribution is polynomial sparse encoder algorithms with theoretical guarantees that\nsolve Problem 1. We give a (1+\u01eb)-optimal encoder with sparsity O(k/\u03b5) (Theorem 7). This sparsity\ncannot be beaten by any algorithm that guarantees a (1 + \u03b5)-approximation (Theorem 8). Ours is the\n\ufb01rst theoretical guarantee for a k-component sparse linear encoder with respect to the optimal PCA.\nOur algorithm constructs sparse PCA features (columns of the encoder H) which preserve almost\nas much information as optimal PCA-features. Our second technical contribution (Theorem 11)\nis an algorithm to construct sparse features iteratively (typical of sparse-PCA algorithms). Iterative\nalgorithms are notoriously hard to analyze, and we give the \ufb01rst theoretical guarantees for an iterative\nsparse encoder. (Detailed proofs are postponed to a full version.)\n\nNotation. Let \u03c1 \u2264 min{n, d} = rank(X) (typically \u03c1 = d). We use A, B, C, . . . for matrices and\na, b, c, . . . for vectors. The Euclidean basis is e1, e2, . . . (dimension can inferred from context). For\nan n \u00d7 d matrix X, the singular value decomposition (SVD) gives X = U\u03a3VT, where the columns\nof U \u2208 Rn\u00d7\u03c1 are the left singular vectors, the columns of V \u2208 Rd\u00d7\u03c1 are the right singular vectors,\nand \u03a3 \u2208 R\u03c1\u00d7\u03c1 is the positive diagonal matrix of singular values \u03c31 \u2265 \u00b7 \u00b7 \u00b7 \u2265 \u03c3\u03c1; U and V are\northonormal, UT U = VT V = I\u03c1. For integer k, we use Uk \u2208 Rn\u00d7k (resp. Vk \u2208 Rd\u00d7k) for the\n\ufb01rst k left (resp. right) singular vectors, and \u03a3k \u2208 Rk\u00d7k is the diagonal matrix with the top-k\nsingular values. We can view a matrix as a row of columns. So, X = [f1, . . . , fd], U = [u1, . . . , u\u03c1],\nV = [v1, . . . , v\u03c1], Uk = [u1, . . . , uk] and Vk = [v1, . . . , vk]. We use f for the columns of X, the\nfeatures, and we reserve xi for the data points (rows of X), XT = [x1, . . . , xn]. A = [a1, . . . , ak]\nis (r1, . . . , rk)-sparse if kaik0 \u2264 ri; if all ri are equal to r, we say the matrix is r-sparse. The\nFrobenius (Euclidean) norm of a matrix A is kAk2\nij = Tr(AT A) = Tr(AAT). The\nA; AA\u2020 = UA UT\npseudo-inverse A\u2020 of A with SVD UA\u03a3A VT\nA is a symmetric\nprojection operator. For matrices A, B with AT B = 0, a generalized Pythagoras theorem holds,\nkA + Bk2\n\nF = Pij A2\nA is A\u2020 = VA\u03a3\u22121\n\nF. kAk2 is the operator norm (top singular value) of A.\n\nF = kAk2\n\nF + kBk2\n\nA UT\n\nDiscussion of Related Work. PCA is the optimal (and most popular) linear auto-encoder. Nonlin-\near auto-encoders became prominent with auto-associative neural networks [7, 3, 4, 17, 18]. There\nis some work on sparse linear auto-encoders (e.g. [15]) and a lot of research on \u201csparse PCA\u201d. The\nimportance of sparse factors in dimensionality reduction has been recognized in some early work:\nthe varimax criterion [11] has been used to rotate the factors to encourage sparsity, and this has\nbeen used in multi-dimensional scaling approaches to dimension reduction [20, 12]. One of the \ufb01rst\nattempts at sparse PCA used axis rotations and component thresholding [6]. The traditional formu-\nlation of sparse PCA is as cardinality constrained variance maximization: maxv vT Av subject to\n\nr )(cid:1)\nvTv = 1 and kvk0 \u2264 r, which is NP-hard [14]. The exhaustive algorithm requires O(cid:0)dr2( d\n\n2\n\n\fcomputation which can be improved to O(dq+1) for a rank-q perturbation of the identity [2]. These\nalgorithms are not practical. Several heuristics exist. [22] and [24] take an L1 penalization view.\nDSPCA (direct sparse PCA) [9] also uses an L1 sparsi\ufb01er but solves a relaxed convex semide\ufb01-\nnite program which is further re\ufb01ned in [8] where they also give a tractable suf\ufb01cient condition for\ntesting optimality. The simplest algorithms use greedy forward and backward subset selection. For\nexample, [16] develop a greedy branch and bound algorithm based on spectral bounds with O(d3)\nrunning time for forward selection and O(d4) running time for backward selection. An alternative\nview of the problem is as a sparse matrix reconstruction problem; for example [21] obtain sparse\nprincipal components using regularized low-rank matrix approximation. Most existing SPCA algo-\nrithms \ufb01nd one sparse principal component. One applies the algorithm iteratively on the residual\nafter projection to get additional sparse principal components [13].\n\n[1] considers sparse PCA with\nThere are no polynomial algorithms with optimality guarantees.\nthey give an algorithm with input parameter k and running time\na non-negativity constraint:\nO(dk+1 log d + dkr3) which constructs a sparse component within (1 \u2212 n\nr kX \u2212 Xkk2/kXk2) from\noptimal. The running time is not practical when k is large; further, the approximation guarantee\nrelies on rapid spectral decay of X and only applies to the \ufb01rst component, not to further iterates.\n\nExplained Variance vs. Loss. For symmetric auto-encoders, minimizing loss is equivalent to\nmaximizing the symmetric explained variance kXHH\u2020k\nF = kX(I \u2212 HH\u2020) + XHH\u2020k\n\n2\nF due to the identity\n2\n2\nF = kX \u2212 XHH\u2020k\n\nvar(X) = kXk2\n\nF + kXHH\u2020k\n\nF\n\n2\n\n(the last equality is from Pythagoras\u2019 theorem). The PCA auto-encoder is symmetric, V\u2020\nk = VT\nk.\nSo the optimal encoder for maximum variance or minimum loss are the same: PCA. But, when it\ncomes to approximation, an approximation algorithm for loss can be converted to an approximation\nalgorithm for variance maximization (the reverse is not true).\nTheorem 2. If kX \u2212 XHH\u2020k\n\nF \u2264 (1 + \u03b5)kX \u2212 Xkk2\n\nF, then kXHH\u2020k\n\n2\n\n2\n\nF \u2265 (cid:16)1 \u2212 \u03c1\u2212k\n\nk \u03b5(cid:17)kXkk2\nF .\n\nWhen factors are not decorrelated, explained variance is not well de\ufb01ned [24], whereas loss is well\nde\ufb01ned for any encoder. Minimizing loss and maximizing the explained variance are both ways of\nencouraging H to be close to Vk. However, when H is constrained (for example to be sparse), these\noptimization objectives can produce very different solutions. From the machine learning perspective,\nsymmetry is an unnecessary constaint on the auto-encoder. All we want is an encoder that produces\na compact representation of the data while capturing as much information as possible.\n\n2 Optimal Sparse Linear Encoder\n\nWe show a black-box reduction of sparse encoding to the column subset selection problem (CSSP).\nWe then use column subset selection algorithms to construct provably accurate sparse auto-encoders.\nFor X = [f1, . . . , fd], we let C = [fi1 , fi2 . . . , fir ] denote a matrix formed using r columns \u201csampled\u201d\nfrom X, where 1 \u2264 i1 < i2 \u00b7 \u00b7 \u00b7 < ir \u2264 d are distinct column indices. We can use a matrix \u2126 \u2208 Rd\u00d7r\nto perform the sampling, C = X\u2126, where \u2126 = [ei1 , ei2 . . . , eir ] and ei are the standard basis vectors\nin Rd (post-multiplying X by ei \u201csamples\u201d the ith column of X). The columns of C span a subspace\nin the range of X. A sampling matrix can be used to construct an r-sparse matrix.\nLemma 3. Let \u2126 = [ei1 , ei2 . . . , eir ] and let A \u2208 Rr\u00d7k be any matrix. Then \u2126A is r-sparse.\n\nDe\ufb01ne XC = CC\u2020X, the projection of X onto the column space of C. Let XC,k \u2208 Rn\u00d7d be the\noptimal rank-k approximation to XC obtained via the SVD of XC.\nLemma 4 (See, for example, [5]). XC,k is a rank-k matrix whose columns are in the span of C. Let\n\u02c6X be any rank-k matrix whose columns are in the span of C. Then, kX \u2212 XC,kkF \u2264 kX \u2212 \u02c6XkF.\nThat is, XC,k is the best rank-k approximation to X whose columns are in the span of C. An ef\ufb01cient\nalgorithm to compute XC,k is also given in [5]. The algorithm runs in O(ndr + (n + d)r2) time.\n\n2.1 Sparse Linear Encoders from Column Subset Selection\n\nWe show that a set of columns C for which XC,k is a good approximation to X can produce a good\nsparse linear encoder. In the algorithm below we assume (not essential) that C has full column rank.\nThe algorithm uses standard linear algebra operations and has total runtime in O(ndr + (n + d)r2).\n\n3\n\n\fBlackbox algorithm to compute encoder from CSSP\n\nInputs: X \u2208 Rn\u00d7d; C \u2208 Rn\u00d7r with C = X\u2126 and \u2126 = [ei1 , . . . , eir ]; k \u2264 r.\nOutput: r-sparse linear encoder H \u2208 Rd\u00d7k.\n1: Compute a QR-factorization of C as C = QR, with Q \u2208 Rn\u00d7r, R \u2208 Rr\u00d7r.\n2: Obtain the SVD of R\u22121(QT X)k, R\u22121(QT X)k = UR\u03a3R VT\nR.\n\n(UR \u2208 Rr\u00d7k, \u03a3R \u2208 Rk\u00d7k and VR \u2208 Rd\u00d7k)\n\n3: Return H = \u2126UR \u2208 Rd\u00d7k.\n\nIn step 2, even though R\u22121(QT X)k is an r \u00d7 d matrix, it has rank k, hence the dimensions of\nUR, \u03a3R, VR depend on k, not r. By Lemma 3, the encoder H is r-sparse. Also, H has orthonormal\ncolumns, as is typically desired for an encoder (HT H = UT\nR UR = I). In every column\nof our encoder, the non-zeros are at the same r coordinates which is much stronger than r-sparsity.\nThe next theorem shows that our encoder is good if C contains a good rank-k approximation XC,k.\nTheorem 5 (Blackbox encoder from CSSP). Given X \u2208 Rn\u00d7d, C = X\u2126 \u2208 Rn\u00d7r with \u2126 =\n[ei1 , . . . , eir ] and k \u2264 r, let H be the r-sparse linear encoder produced by the algorithm above in\nO(ndr + (n + d)r2) time. Then, the loss satis\ufb01es\n\nR\u2126T\u2126UR = UT\n\n\u2113(H, X) = kX \u2212 XH(XH)\u2020Xk\n\n2\n\nF \u2264 kX \u2212 XC,kk2\nF.\n\nThe theorem says that if we can \ufb01nd a set of r columns within which a good rank-k approximation\nto X exists, then we can construct a good sparse linear encoder. What remains is to \ufb01nd a sampling\nmatrix \u2126 which gives a good set of columns C = X\u2126 for which kX \u2212 XC,kk2\nF is small. The main\ntool to obtain C and \u2126 was developed in [5] which gave a constant factor deterministic approxima-\ntion algorithm and a relative-error randomized approximation algorithm. We state a simpli\ufb01ed form\nof the result and then discuss various ways in which this result can be enhanced. Any algorithm to\nconstruct a good set of columns can be used as a black box to get a sparse linear encoder.\nTheorem 6 (Near-optimal CSSP [5]). Given X \u2208 Rn\u00d7d of rank \u03c1 and target rank k:\n\n(i) (Theorem 2 in [5]) For sparsity parameter r > k, there is a deterministic algorithm which\nruns in time TVk + O(ndk + dk3) to construct a sampling matrix \u2126 = [ei1 , . . . , eir ] and\ncorresponding columns C = X\u2126 such that\n\nkX \u2212 XC,kk2\n\nF \u2264 (cid:16)1 + (1 \u2212pk/r)\u22122(cid:17) kX \u2212 Xkk2\n\nF.\n\n(ii) (Simpli\ufb01ed Theorem 5 in [5]) For sparsity parameter r > 5k, there is a randomized algorithm\nwhich runs in time O(ndk + dk3 + r log r) to construct a sampling matrix \u2126 = [ei1 , . . . , eir ]\nand corresponding columns C = X\u2126 such that\n\nEhkX \u2212 XC,kk2\n\nFi \u2264 (cid:18)1 +\n\n5k\n\nr \u2212 5k(cid:19) kX \u2212 Xkk2\n\nF.\n\nOur \u201cbatch\u201d sparse linear encoder uses Theorem 6 in our black-box CSSP-encoder.\n\n\u201cBatch\u201d Sparse Linear Encoder Algorithm\n\nInputs: X \u2208 Rn\u00d7d; rank k \u2264 rank(X); sparsity r > k.\nOutput: r-sparse linear encoder H \u2208 Rd\u00d7k.\n1: Use Theorem 6-(ii) to compute columns C = X\u2126 \u2208 Rn\u00d7r, with inputs X, k, r.\n2: Return H computed with X, C, k as input to the CSSP-blackbox encoder algorithm.\n\nUsing Theorem 6 in Theorem 5, we have an approximation guarantee for our algorithm.\nTheorem 7 (Sparse Linear Encoder). Given X \u2208 Rn\u00d7d of rank \u03c1, the target number of sparse PCA\nvectors k \u2264 \u03c1, and sparsity parameter r > 5k, the \u201cbatch\u201d sparse linear encoder algorithm above\nruns in time O(ndr + (n + d)r2 + dk3) and constructs an r-sparse encoder H such that:\n\nEhkX \u2212 XH(XH)\u2020Xk\n\n2\n\nFi \u2264 (cid:18)1 +\n\n5k\n\nr \u2212 5k(cid:19) kX \u2212 Xkk2\n\nF.\n\nComments. The expectation is over the random choices in the algorithm, and the bound can be\nboosted to hold with high probability or even deterministically. 2. The guarantee is with respect to\nkX \u2212 Xkk2\n\nF (optimal dense PCA): sparsity r = O(k/\u03b5) suf\ufb01ces to mimic top-k (dense) PCA.\n\n4\n\n\fWe now give the lower bound on sparsity, showing that our result is worst-case optimal. De\ufb01ne the\nrow-sparsity of H as the number of its rows that are non-zero. When k=1, the row-sparsity equals\nthe sparsity of the single factor. The row-sparsity is the total number of dimensions which have non-\nzero loadings among all the factors. Our algorithm produces an encoder with row-sparsity O(k/\u03b5)\nand comes within (1 + \u03b5) of the minimum possible loss. This is worst case optimal:\n\nTheorem 8 (Lower Bound). There is a matrix X for which any linear encoder that achieves a\n(1 + \u03b5)-approximate loss as compared to PCA must have a row-sparsity r \u2265 k/\u03b5.\n\nThe common case that in the literature is with k = 1 (top sparse component). Our lower bound\nshows that \u2126(1/\u03b5)-sparsity is required and our algorithm asymptotically achieves this lower bound.\nTo prove Theorem 8, we show the converse of Theorem 5: a good linear auto-encoder with row-\nsparsity r can be used to construct r columns C for which XC,k approximates X.\nLemma 9. Suppose H is a linear encoder for X with row-sparsity r and decoder G. Then, BHG =\nCY, where C is a set of r columns of X and Y \u2208 Rr\u00d7d.\n\nGiven Lemma 9, a sketch of the rest of the argument is as follows. Section 9.2 of [5] demonstrates\na matrix for which there do not exist r good columns. Since a good r-sparse encoder gives r good\ncolumns, no r-sparse encoder can be (1 + k/r)-optimal: No linear encoder with row-sparsity r\nachieves a loss within (1 + k/r) of PCA. Our construction is asymptotically worst case optimal.\nThe lower bound holds for general linear auto-encoders, and so this lower bound also applies to\nthe symmetric auto-encoder HH\u2020, the traditional formulation of sparse PCA. When k = 1, for any\nr-sparse unit norm v, there exists X for which kX \u2212 XvvTk2\nF, or in terms of\nthe symmetric explained variance vT BT Bv \u2264 kB1k2\n\nr )kX \u2212 X1k2\n\nF \u2265 (1 + 1\nr kB \u2212 B1k2\nF.\n\nF \u2212 1\n\n3\n\nIterative Sparse Linear Encoders\n\nOur CSSP-based algorithm is \u201cbatch\u201d in that all k factors are constructed simultaneously. Every\nfeature in the encoder is r-sparse with non-zero loadings on the same set of r original dimensions;\nand, you cannot do better with a row-sparsity of r. Further, the batch algorithm does not distinguish\nbetween the k-factors. That is, there is no top component, second component, and so on.\n\nThe traditional techniques for sparse PCA construct the factors iteratively. We can too: run our batch\nalgorithm in an iterative mode, where in each step we set k = 1 and compute a sparse factor for a\nresidual matrix. By constructing our k features iteratively (and adaptively), we identify an ordering\namong the k features. Further, we might be able to get each feature sparser while still maintaining a\nbound on the row-sparsity. We now give an iterative version of our algorithm. In each iteration, we\naugment H by computing a top sparse encoder for the residual obtained using the current H.\n\nIterative Sparse Linear Encoder Algorithm\n\nInputs: X \u2208 Rn\u00d7d; rank k \u2264 rank(X); sparsity parameters r1, . . . , rk.\nOutput: (r1, . . . , rk)-sparse linear encoder\n1: Set the residual \u2206 = X and H = [ ].\n2: for i = 1 to k do\nUse the batch algorithm to compute encoder h for \u2206, with k = 1 and r = ri.\n3:\nAdd h to the encoder: H \u2190 [H, h].\n4:\nUpdate the residual \u2206: \u2206 \u2190 X \u2212 XH(XH)\u2020X.\n5:\n6: Return the (r1, . . . , rk)-sparse encoder H \u2208 Rn\u00d7k.\n\nThe next lemma bounds the reconstruction error for this iterative step in the algorithm.\n\nLemma 10. Suppose, for k \u2265 1, Hk = [h1, . . . , hk] is an encoder for X, satisfying\n\nkX \u2212 XHk(XHk)\u2020Xk\n\n2\n\nF = err.\n\nGiven a sparsity r > 5 and \u03b4 \u2264 5/(r \u2212 5), one can compute in time O(ndr + (n + d)r2) an r-sparse\nfeature hk+1 such that the reconstruction error of the encoder Hk+1 = [h1, . . . , hk, hk+1] satis\ufb01es\n\nEhkX \u2212 XHk+1(XHk+1)\u2020Xk\n\n2\n\nFi = (1 + \u03b4)(err \u2212 kX \u2212 XHk(XHk)\u2020Xk\n\n2\n\n2).\n\n5\n\n\fLemma 10 gives a bound on the reconstruction error for an iterative addition of the next sparse\nencoder vector. To see how Lemma 10 is useful, consider target rank k = 2. First construct h1 with\nsparsity r1 = 5 + 5/\u03b5, which gives (1 + \u03b5)kX \u2212 X1k2\nF loss. Now construct h2, also with sparsity\nr2 = 5 + 5/\u03b5. The loss for H = [h1, h2] is bounded by\n\n\u2113(H, X) \u2264 (1 + \u03b5)2kX \u2212 X2k2\n\nF + \u03b5(1 + \u03b5)kX \u2212 X1k2\n2.\n\n2 = \u03c32\n\n2, which in practice is smaller than kX \u2212 X2k2\n\nOn the other hand, our batch algorithm uses sparsity r = 10 + 10/\u03b5 in each encoder h1, h2 and\nachieves reconstruction error (1 + \u03b5)kX \u2212 X2k2\nF. The iterative algorithm uses sparser features,\nbut pays for it a little in reconstruction error. The second term is small, O(\u03b5), and depends on\nkX \u2212 X1k2\nd. Using the\niterative algorithm, we can tailor the sparsity of each encoder vector separately to achieve a desired\naccuracy. It is algebraically intense to prove a bound for a general choice of the sparsity parameters\nr1, . . . , rk, so for simplicity, we prove a bound for a speci\ufb01c choice of the sparsity parameters which\nslowly increase with each iterate. The proof idea is similar to our example with k = 2.\nTheorem 11 (Iterative Encoder). Given X \u2208 Rn\u00d7d of rank \u03c1 and k < \u03c1, set rj = 5 + \u2308 5j/\u03b5 \u2309 in\nour iterative encoder to compute the (r1, . . . , rk)-sparse encoder H = [h1, h2, . . . , hk]. Then, for\nevery \u2113 = 1, . . . , k, the encoder H = [h1, h2, . . . , hk] has reconstruction error\n\n3 + \u00b7 \u00b7 \u00b7 + \u03c32\n\nF = \u03c32\n\nEhkX \u2212 XH\u2113(XH\u2113)\u2020Xk\n\n2\n\nFi \u2264 (e\u2113)\u03b5kX \u2212 X\u2113k2\n\nF + \u03b5\u21131+\u03b5kX\u2113 \u2212 X1k2\nF.\n\n(1)\n\nThe running time to compute all the encoder vectors is O(ndk2\u03b5\u22121 + (n + d)k3\u03b5\u22122).\n\nComments. This is the \ufb01rst theoretical guarantee for iterative sparse encoders. Up to a small additive\nterm, we have a relative error approximation because (e\u2113)\u03b5 = 1 + O(\u01eb log \u2113) grows slowly with \u2113.\nEach successive encoder vector has a larger sparsity (as opposed to a \ufb01xed sparsity r = 5k + 5k/\u03b5\nin the batch algorithm). If we used a constant sparsity rj = 5 + 5k/\u03b5 for every encoder vector in\nthe iterative algorithm, the relative error term becomes 1 + O(\u03b5\u2113) as opposed to 1 + O(\u03b5 log \u2113). Just\nas with the PCA vectors v1, v2, . . ., we have a provably good encoder for any \u2113 by taking the \ufb01rst \u2113\nfactors h1, . . . , h\u2113. In the batch-encoder H = [h1, . . . , hk], we cannot guarantee that h1 will give a\nreconstruction comparable with X1. The detailed proof is in the supplementary matrrial.\n\nProof. (Sketch) For \u2113 \u2265 1, we de\ufb01ne two quantities Q\u2113, P\u2113 for that will be useful in the proof.\n\nQ\u2113 = (1 + \u03b5)(1 + 1\nP\u2113 = Q\u2113 \u2212 1 = (1 + \u03b5)(1 + 1\n\n2 \u03b5)(1 + 1\n2 \u03b5)(1 + 1\n\n3 \u03b5)(1 + 1\n3 \u03b5)(1 + 1\n\n4 \u03b5) \u00b7 \u00b7 \u00b7 (1 + 1\n4 \u03b5) \u00b7 \u00b7 \u00b7 (1 + 1\n\n\u2113 \u03b5);\n\u2113 \u03b5) \u2212 1.\n\nUsing Lemma 10 and induction, we can prove a bound on the loss of H\u2113.\n\nEhkX \u2212 XH\u2113(XH\u2113)\u2020Xk\n\n2\n\nFi \u2264 Q\u2113kX \u2212 X\u2113k2\n\nF + Q\u2113\n\n\u2113\n\nXj=2\n\n\u03c32\nj\n\nPj\u22121\nQj\u22121\n\n.\n\n(\u2217)\n\nWhen \u2113 = 1, the claim is that E[kX \u2212 XH1(XH1)\u2020Xk\nF (since the summation\nis empty), which is true by construction of H1 = [h1] because r1 = 5 + 5/\u03b5. For the induction\n2\nstep, we apply Lemma 10 with \u03b4 = \u03b5/(\u2113 + 1), condition on err = kX \u2212 XH\u2113(XH\u2113)\u2020Xk\nF whose\nexpectation is given in (\u2217), and use iterated expectation. The details are given in the supplementary\nmaterial. The \ufb01rst term in the bound (1) follows by bounding Q\u2113 using elementary calculus:\n\nF] \u2264 (1+\u03b5)kX \u2212 X1k2\n\n2\n\nlog(cid:16)1 +\n\n\u03b5\n\ni(cid:17) \u2264\n\n\u2113\n\nXi=1\n\n\u03b5\ni\n\n\u2264 \u03b5 log(e\u2113),\n\n\u2113\n\nlog Q\u2113 =\n\nXi=1\n3 + \u00b7 \u00b7 \u00b7 + 1\n\nwhere we used log(1 + x) \u2264 x for x \u2265 0 and the well known upper bound log(e\u2113) for the \u2113th\nharmonic number 1 + 1\n\u2113 . Thus, Q\u2113 \u2264 (e\u2113)\u03b5. The rest of the proof is to bound the\nsecond term in (\u2217) to obtain the second term in (1). Obeserve that for i \u2265 1,\nQi\nQ1\n\nPi = Qi \u2212 1 = \u03b5\n\n+ Qi\u22121 \u2212 1 = \u03b5\n\n+ Pi\u22121,\n\n\u2212 1 \u2264 \u03b5\n\n2 + 1\n\nQi\nQ1\n\nQi\nQ1\n\nQi\nQ1\n\n+\n\nwhere we used Qi/Q1 \u2264 Qi\u22121 and we de\ufb01ne P0 = 0. After some algebra which we omit,\n\n\u2113\n\nXj=2\n\n\u03c32\nj\n\nPj\u22121\nQj\u22121\n\n\u2264\n\n\u03b5\nQ1\n\nkX\u2113 \u2212 X1k2\n\nF +\n\n\u2113\n\nXj=3\n\n\u03c32\nj\n\nPj\u22122\nQj\u22122\n\n.\n\n6\n\n\fLymphoma\n\nColon\n\nPitProps\n\nBatch\nIterative\nTpower\nGpower\u22120\nGpower\u22121\n\n1.4\n\n1.3\n\n1.2\n\ns\ns\no\nL\nn\no\n\n \n\ni\nt\n\na\nm\nr\no\nn\n\nf\n\n6\n\nSparsity r\n\n8\n\n10\n\nPitProps\n\nI\n\n1.1\n\n \n\n1\n2\n\n4\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\ne\nc\nn\na\ni\nr\na\nV\n \nc\ni\nr\nt\ne\nm\nm\ny\nS\n\nBatch\nIterative\nTpower\nGpower\u22120\nGpower\u22121\n\n10\n\n \n\n0.2\n2\n\n4\n\n6\n\nSparsity r\n\n8\n\n \n\n \n\ns\ns\no\nL\n \nn\no\ni\nt\na\nm\nr\no\nf\nn\nI\n\n1.6\n\n1.5\n\n1.4\n\n1.3\n\n1.2\n\n1.1\n\n \n\n1\n5\n\n0.4\n\ne\nc\nn\na\ni\nr\na\nV\n \nc\ni\nr\nt\ne\nm\nm\ny\nS\n\n0.3\n\n0.2\n\n0.1\n\n \n\n0\n5\n\n \n\nBatch\nIterative\nTpower\nGpower\u22120\nGpower\u22121\n\ns\ns\no\nL\nn\no\n\n \n\ni\nt\n\na\nm\nr\no\nn\n\nf\n\nI\n\n1.6\n\n1.5\n\n1.4\n\n1.3\n\n1.2\n\n1.1\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n1\n\n \n\n10\n\nSparsity r\n\nLymphoma\n\n \n\nBatch\nIterative\nTpower\nGpower\u22120\nGpower\u22121\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\nSparsity r\n\ne\nc\nn\na\ni\nr\na\nV\n \nc\ni\nr\nt\n\ne\nm\nm\ny\nS\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n\n \n\nBatch\nIterative\nTpower\nGpower\u22120\nGpower\u22121\n\n20\n\n30\n\nSparsity r\n\n40\n\nColon\n\n \n\nBatch\nIterative\nTpower\nGpower\u22120\nGpower\u22121\n\n10\n\n20\n\n30\n\n40\n\nSparsity r\n\nFigure 1: Performance of the sparse encoder algorithms on the PitProps data (left), Lymphoma data (middle)\nand Colon data (right) data: We Information loss (top) and symmetric explained variance (bottom) with k = 2.\nOur algorithms give the minimum information loss which decreases inversely with r as the theory predicts. It\nis no surprise that existing sparse-PCA algorithms do better at maximizing symmetric explained variance.\n\nUsing this reduction it is now an elementary task to prove by induction that\n\n\u2113\n\n\u03c32\nj\n\nPj\u22121\nQj\u22121\n\n\u2264\n\n\u03b5\nQ1\n\n\u2113\u22121\n\nXj=1\n\nkX\u2113 \u2212 Xjk2\nF.\n\nSince kX\u2113 \u2212 Xjk2\n\nF(\u2113 \u2212 j)/(\u2113 \u2212 1), we have that\n\nXj=2\nF \u2264 kX\u2113 \u2212 X1k2\nXj=2\n\nPj\u22121\nQj\u22121\n\n\u03c32\nj\n\n\u2264\n\n\u2113\n\n\u03b5kX\u2113 \u2212 X1k2\nQ1(\u2113 \u2212 1)\n\nF\n\n\u2113\u22121\n\nXj=1\n\n\u2113 \u2212 j =\n\n\u03b5\u2113kX\u2113 \u2212 X1k2\n\nF\n\n2Q1\n\n.\n\nUsing (\u2217), we have that\n\nkX \u2212 XH\u2113(XH\u2113)\u2020Xk\n\nThe result \ufb01nally follows because\n\n2\n\nF \u2264 (e\u2113)\u03b5kX \u2212 X\u2113k2\n\nF +\n\n\u03b5\u2113kX\u2113 \u2212 X1k2\n\nF\n\n2\n\n\u00b7\n\nQ\u2113\nQ1\n\n.\n\nlog\n\nQ\u2113\nQ1\n\n=\n\n\u2113\n\nXi=2\n\nlog(cid:16)1 +\n\n\u03b5\n\ni(cid:17) \u2264 \u03b5\n\n\u2113\n\nXi=2\n\n1\ni\n\n\u2264 \u03b5(log(e\u2113) \u2212 1) = \u03b5 log \u2113,\n\nand so Q\u2113/Q1 \u2264 \u2113\u03b5.\n\n4 Demonstration\n\nWe empirically demonstrate our algorithms against existing state-of-the-art sparse PCA methods.\nThe inputs are X \u2208 Rn\u00d7d, the number of components k and the sparsity parameter r. The output\nis the sparse encoder H = [h1, h2, . . . , hk] \u2208 Rn\u00d7k with khik0 \u2264 r; H is used to project X onto\nsome subspace to obtain a reconstruction \u02c6X which decomposes the variance into two terms:\n\nkXk2\n\nF = kX \u2212 \u02c6Xk\n\nF + k \u02c6Xk\n\n2\n\n2\n\nF = Loss + Explained Variance\n\nFor symmetric auto-encoders, minimizing loss is equivalent to maximizing the symmetric explained\nvariance, the path traditional sparse-PCA takes,\n\nSymmetric Explained Variance = kXHH\u2020k\n\n2\n\nF/kXkk2\n\nF \u2264 1\n\nTo capture how informative the sparse components are, we can use the normalized loss:\n\nLoss = kX \u2212 XH(XH)\u2020Xk\n\n7\n\n2\n\nF/kX \u2212 Xkk2\n\nF \u2265 1.\n\n\fWe report the symmetric explained variance primarily for historical reasons because existing sparse\nPCA methods have constructed auto-encoders to optimize the symmetric explained variance.\n\nWe implemented an instance of the sparse PCA algorithm of Theorem 7 with the deterministic\ntechnique described in part (i) in Theorem 6. (This algorithm gives a constant factor approximation,\nas opposed to the relative error approximation of the algorithm in Theorem 7, but it is deterministic\nand simpler to implement.) We call this the \u201cBatch\u201d sparse linear auto-encoder algorithm. We\ncorrespondingly implement an \u201cIterative\u201d version with \ufb01xed sparsity r in each principal component.\nIn each step of the iterative sparse auto-encoder algorithm we use the above batch algorithm to select\none principal component with sparsity at most r.\n\nWe compare our to the following state-of-the-art sparse PCA algorithms: (1) T-Power: truncated\npower method [23]. (2) G-power-\u21130: generalized power method with \u21130 regularization [10]. (3)\nG-power-\u21131: generalized power method with \u21131 regularization [10]. All these algorithms were\ndesigned to operate for k = 1 (notice our algorithms handle any k) so to pick k components, we use\nthe \u201cde\ufb02ation\u201d method suggested in [13]. We use the same data sets used by these prior algorithms\n(all available in [23]): PitProps (X \u2208 R13\u00d713); Colon (X \u2208 R500\u00d7500); Lymphoma (X \u2208 R500\u00d7500).\n\nThe qualitative results for different k are simi-\nlar so we only show k = 2 in Figure 1. The\ntake-away is that loss and symmetric variance give\nvery different sparse encoders (example encoders\n[h1, h2] with r = 5 are shown on the right).\nThis underlines why the correct objective is im-\nportant. The machine learning goal is to preserve\nas much information as possible, which makes\nloss the compeling objective. The \ufb01gures show\nthat as r increases, our algorithms deliver near-\noptimal 1 + O(1/r) normalized loss, as the theory\nguarantees. The \u201citerative\u201d algorithm has better\nempirical performance than the batch algorithm.\n\nBatch\nh1 h2\n\nIter.\nh1 h2\n\nTP\n\nh1 h2\n\nGP-\u21130\nh1 h2\n\nGP-\u21131\nh1 h2\n\n0\n\n0\n\n0\n\n0\n\n\u22120.8 \u22120.3\n\n\u22120.6 \u22120.8\n\n0.5\n\n0.5\n\n0\n\n0\n\n0.7\n\n0.7\n\n0\n\n0\n\n0.6\n\n0.6\n\n0\n\n0\n\n0 \u22120.4\n\n0 \u22120.2\n\n0\n\n0\n\n0\n\n0\n\n0 0.6\n\n0 0.6\n\n0\n\n0\n\n0 0.3\n\n\u22120.7 \u22120.1\n\n0.4\n\n\u22120.3 0.3\n\n0\n\n0\n\n0\n\n0 \u22120.8\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n\u22120.3\n\n\u22120.1\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0 \u22120.2\n\n0\n\n0\n\n0\n\n0\n\n0.4\n\n0.4 \u22120.2\n\n0\n\n0\n\n0 0.3\n\n0.5 \u22120.4\n\n0\n\n0\n\n0\n\n0\n\n0 0.7\n\n0 0.7\n\n0 0.7\n\n0 0.7\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0.5\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\nSummary. Loss minimization and variance maximization give very different encoders under a spar-\nsity constraint. The empirical performance of our loss minimization algorithms follows the theory.\nOur iterative algorithm is empirically better though it has a slightly worse theoretical guarantee.\n\n5 Discussion\n\nHistorically, sparse PCA was cardinality constrained variance maximization. Variance per se has\nno intrinsic value, and is hard to de\ufb01ne for non-orthogonal or correlated encoders, which is to be\nexpected once you introduce a sparsity constraint. Our de\ufb01nition of loss is general and captures the\nmachine learning goal of preserving as much information as possible.\n\nWe gave theoretical guarantees for sparse encoders. Our iterative algorithm has a weaker bound\nthan our batch algorithm, yet the iterative algorithm is better empirically. Iterative algorithms are\ntough to analyze, and it remains open whether a tighter analysis can be given. We conjecture that\nthe iterative algorithm is as good or better than the batch algorithm, though proving it seems elusive.\n\nFinally, we have not optimized for running times. Considerable speed-ups may be possible without\nsacri\ufb01cing accuracy. For example, in the iterative algorithm (which repeatedly calls the CSSP algo-\nrithm with k = 1), it should be possible signi\ufb01cantly speed up the generic algorithm (for arbitrary\nk) to a specialized one for k = 1. We leave such implementation optimizations for future work.\n\nAcknowledgments. Magdon-Ismail was partially supported by NSF:IIS 1124827 and by the Army\nResearch Laboratory under Cooperative Agreement W911NF-09-2-0053 (the ARL-NSCTA). The\nviews and conclusions contained in this document are those of the authors and should not be in-\nterpreted as representing the of\ufb01cial policies, either expressed or implied, of the Army Research\nLaboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute\nreprints for Government purposes notwithstanding any copyright notation here on.\n\n8\n\n\fReferences\n\n[1] M. Asteris, D. Papailiopoulos, and A. Dimakis. Non-negative sparse PCA with provable guarantees. In\n\nProc. ICML, 2014.\n\n[2] M. Asteris, D. Papailiopoulos, and G. Karystinos. Sparse principal component of a rank-de\ufb01cient matrix.\n\nIn Proc. ISIT, 2011.\n\n[3] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples\n\nwithout local minima. Neural Networks, 2:53\u201358, 1988.\n\n[4] H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular value decomposition.\n\nBiological Cybernetics, 59:291\u2013294, 1988.\n\n[5] C. Boutsidis, P. Drineas, and M. Magdon-Ismail. Near-optimal column-based matrix reconstruction.\n\nSIAM Journal on Computing, 43(2), 2014.\n\n[6] J. Cadima and I. Jolliffe. Loadings and correlations in the interpretation of principal components. Applied\n\nStatistics, 22:203\u2013214, 1995.\n\n[7] G. Cottrell and P. Munro. Principal components analysis of images via back propagation. In Proc. SPIE\n\n1001, Visual Communications and Image Processing \u201988, 1988.\n\n[8] A. d\u2019Aspremont, F. Bach, and L. E. Ghaoui. Optimal solutions for sparse principal component analysis.\n\nJournal of Machine Learning Research, 9:1269\u20131294, June 2008.\n\n[9] A. d\u2019Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G. Lanckriet. A direct formulation for sparse PCA\n\nusing semide\ufb01nite programming. SIAM Review, 49(3):434\u2013448, 2007.\n\n[10] M. Journ\u00b4ee, Y. Nesterov, P. Richt\u00b4arik, and R. Sepulchre. Generalized power method for sparse principal\n\ncomponent analysis. The Journal of Machine Learning Research, 11:517\u2013553, 2010.\n\n[11] H. Kaiser. The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3):187\u2013200,\n\n1958.\n\n[12] J. Kruskal. Multidimensional scaling by optimizing goodness of \ufb01t to a nonmetric hypothesis. Psychome-\n\ntrika, 29(1):1\u201327, 1964.\n\n[13] L. W. Mackey. De\ufb02ation methods for sparse PCA. In Proc. NIPS, pages 1017\u20131024, 2009.\n\n[14] M. Magdon-Ismail. Np-hardness and inapproximability of sparse pca. arXiv:1502.05675, 2015.\n\n[15] A. Makhzani and B. Frey. k-sparse autoencoders. In ICLR, 2014.\n\n[16] B. Moghaddam, Y. Weiss, and S. Avidan. Generalized spectral bounds for sparse LDA. In ICML, 2006.\n\n[17] E. Oja. Data compression, feature extraction and autoassociation in feedforward neural networks.\n\nIn\n\nArti\ufb01cial Neural Networks, volume 1, pages 737\u2013745, 1991.\n\n[18] E. Oja. Principal components, minor components and linear neural networks. Neural Networks, 5:927\u2013\n\n935, 1992.\n\n[19] K. Pearson. On lines and planes of closest \ufb01t to systems of points in space. Philosophical Magazine,\n\n2:559\u2013572, 1901.\n\n[20] J. Sammon. A nonlinear mapping for data structure analysis.\n\nIEEE Transactions on Computers, C-\n\n18(5):401\u2013409, 1969.\n\n[21] H. Shen and J. Z. Huang. Sparse principal component analysis via regularized low rank matrix approxi-\n\nmation. Journal of Multivariate Analysis, 99:1015\u20131034, July 2008.\n\n[22] N. Trenda\ufb01lov, I. T. Jolliffe, and M. Uddin. A modi\ufb01ed principal component technique based on the lasso.\n\nJournal of Computational and Graphical Statistics, 12:531\u2013547, 2003.\n\n[23] X.-T. Yuan and T. Zhang. Truncated power method for sparse eigenvalue problems. The Journal of\n\nMachine Learning Research, 14(1):899\u2013925, 2013.\n\n[24] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Computational &\n\nGraphical Statistics, 15(2):265\u2013286, 2006.\n\n9\n\n\f", "award": [], "sourceid": 198, "authors": [{"given_name": "Malik", "family_name": "Magdon-Ismail", "institution": "Rensselaer"}, {"given_name": "Christos", "family_name": "Boutsidis", "institution": "Yahoo Labs"}]}