{"title": "Column Selection via Adaptive Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 406, "page_last": 414, "abstract": "Selecting a good column (or row) subset of massive data matrices has found many applications in data analysis and machine learning. We propose a new adaptive sampling algorithm that can be used to improve any relative-error column selection algorithm. Our algorithm delivers a tighter theoretical bound on the approximation error which we also demonstrate empirically using two well known relative-error column subset selection algorithms. Our experimental results on synthetic and real-world data show that our algorithm outperforms non-adaptive sampling as well as prior adaptive sampling approaches.", "full_text": "Column Selection via Adaptive Sampling\n\nSaurabh Paul\n\nGlobal Risk Sciences, Paypal Inc.\n\nsaupaul@paypal.com\n\nMalik Magdon-Ismail\n\nCS Dept., Rensselaer Polytechnic Institute\n\nmagdon@cs.rpi.edu\n\nPetros Drineas\n\nCS Dept., Rensselaer Polytechnic Institute\n\ndrinep@cs.rpi.edu\n\nAbstract\n\nSelecting a good column (or row) subset of massive data matrices has found many\napplications in data analysis and machine learning. We propose a new adap-\ntive sampling algorithm that can be used to improve any relative-error column\nselection algorithm. Our algorithm delivers a tighter theoretical bound on the ap-\nproximation error which we also demonstrate empirically using two well known\nrelative-error column subset selection algorithms. Our experimental results on\nsynthetic and real-world data show that our algorithm outperforms non-adaptive\nsampling as well as prior adaptive sampling approaches.\n\n1\n\nIntroduction\n\nIn numerous machine learning and data analysis applications, the input data are modelled as a matrix\nA \u2208 Rm\u00d7n, where m is the number of objects (data points) and n is the number of features. Often,\nit is desirable to represent your solution using a few features (to promote better generalization and\ninterpretability of the solutions), or using a few data points (to identify important coresets of the\ndata), for example PCA, sparse PCA, sparse regression, coreset based regression, etc. [1, 2, 3, 4].\nThese problems can be reduced to identifying a good subset of the columns (or rows) in the data\nmatrix, the column subset selection problem (CSSP). For example, \ufb01nding an optimal sparse linear\nencoder for the data (dimension reduction) can be explicitly reduced to CSSP [5]. Motivated by the\nfact that in many practical applications, the left and right singular vectors of a matrix A lacks any\nphysical interpretation, a long line of work [6, 7, 8, 9, 10, 11, 12, 13, 14, 15], focused on extracting\na subset of columns of the matrix A, which are approximately as good as Ak at reconstructing A.\nTo make our discussion more concrete, let us formally de\ufb01ne CSSP.\nColumn Subset Selection Problem, CSSP: Find a matrix C \u2208 Rm\u00d7c containing c columns of\n\nA for which (cid:13)(cid:13)A \u2212 CC+A(cid:13)(cid:13)F is small.1 In the prior work, one measures the quality of a CSSP-\nalgorithms to \ufb01nd C with c \u2248 2k/\u0001 columns, for which (cid:13)(cid:13)A \u2212 CC+A(cid:13)(cid:13)F \u2264 (1 + \u0001)(cid:107)A \u2212 Ak(cid:107)F .\n\nsolution against Ak, the best rank-k approximation to A obtained via the singular value decompo-\nsition (SVD), where k is a user speci\ufb01ed target rank parameter. For example, [15] gives ef\ufb01cient\n\nOur contribution is not to directly attack CSSP. We present a novel algorithm that can improve an\nexisting CSSP algorithm by adaptively invoking it, in a sense actively learning which columns to\nsample next based on the columns you have already sampled. If you use the CSSP-algorithm from\n[15] as a strawman benchmark, you can obtain c columns all at once and incur an error roughly\n(1 + 2k/c)(cid:107)A \u2212 Ak(cid:107)F . Or, you can invoke the algorithm to obtain, for example, c/2 columns,\nand then allow the algorithm to adapt to the columns already chosen (for example by modifying A)\nbefore choosing the remaining c/2 columns. We refer to the former as continued sampling and to the\n\n1CC+A is the best possible reconstruction of A by projection into the space spanned by the columns of C.\n\n1\n\n\flatter as adaptive sampling. We prove performance guarantees which show that adaptive sampling\nimproves upon continued sampling, and we present experiments on synthetic and real data that\ndemonstrate signi\ufb01cant empirical performance gains.\n\n1.1 Notation\nA, B, . . . denote matrices and a, b, . . . denote column vectors; In is the n \u00d7 n identity matrix.\n[A, B] and [A; B] denote matrix concatenation operations in a column-wise and row-wise manner,\nrespectively. Given a set S \u2286 {1, . . . n}, AS is the matrix that contains the columns of A \u2208 Rm\u00d7n\nindexed by S. Let rank(A) = \u03c1 \u2264 min{m, n}. The (economy) SVD of A is\n\n(cid:18) \u03a3k\n\nA = (Uk U\u03c1\u2212k)\n\n0\n\n0 \u03a3\u03c1\u2212k\n\n(cid:19)(cid:18) VT\n\nk\nVT\n\u03c1\u2212k\n\n(cid:19)\n\n\u03c1(cid:88)\n\ni=1\n\n=\n\n\u03c3i(A)uivT\ni\n\n(cid:80)\n\nwhere Uk \u2208 Rm\u00d7k and U\u03c1\u2212k \u2208 Rm\u00d7(\u03c1\u2212k) contain the left singular vectors ui, Vk \u2208 Rn\u00d7k\nand V\u03c1\u2212k \u2208 Rn\u00d7(\u03c1\u2212k) contain the right singular vectors vi, and \u03a3 \u2208 R\u03c1\u00d7\u03c1 is a diagonal matrix\ncontaining the singular values \u03c31(A) \u2265 . . . \u2265 \u03c3\u03c1(A) > 0. The Frobenius norm of A is (cid:107)A(cid:107)2\nF =\nij; Tr(A) is the trace of A; the pseudoinverse of A is A+ = V\u03a3\u22121UT ; and, Ak, the best\ni=1 \u03c3iuivT\ni .\n\nrank-k approximation to A under any unitarily invariant norm is Ak = Uk\u03a3kVT\n\nk =(cid:80)k\n\ni,j A2\n\n1.2 Our Contribution: Adaptive Sampling\n\nWe design a novel CSSP-algorithm that adaptively selects columns from the matrix A in rounds. In\neach round we remove from A the information that has already been \u201ccaptured\u201d by the columns that\nhave been thus far selected. Algorithm 1 selects tc columns of A in t rounds, where in each round\nc columns of A are selected using a relative-error CSSP-algorithm from prior work.\n\nInput: A \u2208 Rm\u00d7n; target rank k; # rounds t; columns per round c\nOutput: C \u2208 Rm\u00d7tc, tc columns of A and S, the indices of those columns.\n1: S = {}; E0 = A\n2: for (cid:96) = 1,\u00b7\u00b7\u00b7 , t do\n3:\n4:\n5:\n6: return C, S\n\nSample indices S(cid:96) of c columns from E(cid:96)\u22121 using a CSSP-algorithm.\nS \u2190 S \u222a S(cid:96).\nSet C = AS and E(cid:96) = A \u2212 (CC+A)(cid:96)k.\n\nAlgorithm 1: Adaptive Sampling\n\nAt round (cid:96) in Step 3, we compute column indices S (and C = AS) using a CSSP-algorithm on the\nresidual E(cid:96)\u22121 of the previous round. To compute this residual, remove from A the best rank-((cid:96)\u22121)k\napproximation to A in the span of the columns selected from the \ufb01rst (cid:96) \u2212 1 rounds,\n\nE(cid:96)\u22121 = A \u2212 (CC+A)((cid:96)\u22121)k.\n\nA similar strategy was developed in [8] with sequential adaptive use of (additive error) CSSP-\nalgorithms. These (additive error) CSSP-algorithms select columns according to column norms [11].\nIn [8], the residual in step 5 is de\ufb01ned differently, as E(cid:96) = A \u2212 CC+A. To motivate our result, it\nhelps to take a closer look at the reconstruction error E = A \u2212 CC+A after t adaptive rounds of\nthe strategy in [8] with the CSSP-algorithm in [11].\n\n# rounds Continued sampling: tc columns using\n\nt = 2\n\nt\n\nCSSP-algorithm from [11]. (\u0001 = k/c)\n(cid:107)A(cid:107)2\n(cid:107)A(cid:107)2\n\nF \u2264 (cid:107)A \u2212 Ak(cid:107)2\nF \u2264 (cid:107)A \u2212 Ak(cid:107)2\n\n(cid:107)E(cid:107)2\n(cid:107)E(cid:107)2\n\nF +\n\nF +\n\nF\n\nF\n\n\u0001\n2\n\u0001\nt\n\nAdaptive sampling: t rounds of the strategy\nin [8] with the CSSP-algorithm from [11].\nF + \u00012 (cid:107)A(cid:107)2\nF +\u0001t (cid:107)A(cid:107)2\n\nF \u2264 (1 + \u0001)(cid:107)A \u2212 Ak(cid:107)2\nF \u2264 (1 + O(\u0001)) (cid:107)A \u2212 Ak(cid:107)2\n\n(cid:107)E(cid:107)2\n(cid:107)E(cid:107)2\n\nF\n\nF\n\n2\n\n\fF (cid:29) (cid:107)A \u2212 Ak(cid:107)2\n\nTypically (cid:107)A(cid:107)2\nF and \u0001 is small (i.e., c (cid:29) k), so adaptive sampling `a la [8] wins\nover continued sampling for additive error CSSP-algorithms. This is especially apparent after t\nrounds, where continued sampling only attenuates the big term (cid:107)A(cid:107)2\nF by \u0001/t, but adaptive sampling\nexponentially attenuates this term by \u0001t.\nRecently, powerful CSSP-algorithms have been developed which give relative-error guarantees [15].\nWe can use the adaptive strategy from [8] together with these newer relative error CSSP-algorithms.\nIf one carries out the analysis from [8] by replacing the additive error CSSP-algorithm from [11]\nwith the relative error CSSP-algorithm in [15], the comparison of continued and adaptive sampling\nusing the strategy from [8] becomes (t = 2 rounds suf\ufb01ces to see the problem):\n\n# rounds Continued sampling: tc columns using\nCSSP-algorithm from [15]. (\u0001 = 2k/c)\n\nF \u2264(cid:16)\n\n(cid:107)E(cid:107)2\n\n1 +\n\n\u0001\n2\n\nt = 2\n\n(cid:17) (cid:107)A \u2212 Ak(cid:107)2\n\nF\n\nAdaptive sampling: t rounds of the strategy\nin [8] with the CSSP-algorithm from [15].\n(cid:107)A \u2212 Ak(cid:107)2\n\n(cid:107)E(cid:107)2\n\n1 +\n\n+\n\n(cid:18)\n\n(cid:19)\n\nF \u2264\n\nF\n\n\u0001\n2\n\n\u00012\n2\n\nAdaptive sampling from [8] gives a worse theoretical guarantee than continued sampling for relative\nerror CSSP-algorithms. In a nutshell, no matter how many rounds of adaptive sampling you do,\nthe theoretical bound will not be better than (1 + k/c)(cid:107)A \u2212 Ak(cid:107)2\nF if you are using a relative error\nCSSP-algorithm. This raises an obvious question: is it possible to combine relative-error CSSP-\nalgorithms with adaptive sampling to get (provably and empirically) improved CSSP-algorithms?\nThe approach of [8] does not achieve this objective. We provide a positive answer to this question.\nOur approach is a subtle modi\ufb01cation to the approach in [8]: in Step 5 of Algorithm 1. When we\ncompute the residual matrix in round (cid:96), we subtract (CC+A)(cid:96)k from A, the best rank-(cid:96)k approxi-\nmation to the projection of A onto the current columns selected, as opposed to subtracting the full\nprojection CC+A. This subtle change, is critical in our new analysis which gives a tighter bound\non the \ufb01nal error, allowing us to boost relative-error CSSP-algorithms. For t = 2 rounds of adaptive\nsampling, we get a reconstruction error of\n\n(cid:107)E(cid:107)2\n\nF \u2264 (1 + \u0001)(cid:107)A \u2212 A2k(cid:107)2\n\nF + \u0001(1 + \u0001)(cid:107)A \u2212 Ak(cid:107)2\nF ,\n\nF , and the dependence on (cid:107)A \u2212 Ak(cid:107)2\n\nwhere \u0001 = 2k/c. The critical improvement in the bound is that the dominant O(1)-term depends on\n(cid:107)A \u2212 A2k(cid:107)2\nF is now O(\u0001). To highlight this improved theoret-\nical bound in an extreme case, consider a matrix A that has rank exactly 2k, then (cid:107)A \u2212 A2k(cid:107)F = 0.\nF , where as our adaptive sampling gives\nContinued sampling gives an error-bound (1+ \u0001\nan error-bound (\u0001 + \u00012)(cid:107)A \u2212 Ak(cid:107)2\nF , which is clearly better in this extreme case. In practice, data\nmatrices have rapidly decaying singular values, so this extreme case is not far from reality (See\nFigure 1).\n\n2 )(cid:107)A \u2212 Ak(cid:107)2\n\nFigure 1: Figure showing the singular value decay for two real world datasets.\n\nTo state our main theoretical result, we need to more formally de\ufb01ne a relative error CSSP-algorithm.\nDe\ufb01nition 1 (Relative Error CSSP-algorithm A(X, k, c)). A relative error CSSP-algorithm A takes\nas input a matrix X, a rank parameter k < rank(X) and a number of columns c, and outputs column\nindices S with |S| = c, so that the columns C = XS satisfy:\n\n(cid:104)(cid:107)X \u2212 (CC+X)k(cid:107)2\n\n(cid:105) \u2264 (1 + \u0001(c, k))(cid:107)X \u2212 Xk(cid:107)2\n\nEC\n\nF ,\n\nF\n\n3\n\n02040608010002004006008001000Singular Values of TechTC\u2212300 avgd over 49 datasetsSingular Values0200400600800100012000200400600800Singular Values of HGDP avgd over 22 chromosomesSingular Values\fwhere \u0001(c, k) depends on A and the expectation is over random choices made in the algorithm.2\n\nOur main theorem bounds the reconstruction error when our adaptive sampling approach is used to\nboost A. The boost in performance depends on the decay of the spectrum of A.\nTheorem 1. Let A \u2208 Rm\u00d7n be a matrix of rank \u03c1 and let k < \u03c1 be a target rank. If, in Step 3 of\nAlgorithm 1, we use the relative error CSSP-algorithm A with \u0001(c, k) = \u0001 < 1, then\n\n(cid:104)(cid:107)A \u2212 (CC+A)tk(cid:107)2\n\n(cid:105) \u2264 (1 + \u0001)(cid:107)A \u2212 Atk(cid:107)2\n\nF\n\nEC\n\nF + \u0001\n\n(1 + \u0001)t\u2212i (cid:107)A \u2212 Aik(cid:107)2\nF .\n\nt\u22121(cid:88)\n\ni=1\n\nt\u22121(cid:88)\n\nComments.\n1. The dominant O(1) term in our bound is (cid:107)A \u2212 Atk(cid:107)F , not (cid:107)A \u2212 Ak(cid:107)F . This is a major im-\nprovement since the former is typically much smaller than the latter in real data. Further, we\nneed a bound on the reconstruction error (cid:107)A \u2212 CC+A(cid:107)F . Our theorem give a stronger result\nthan needed because (cid:107)A \u2212 CC+A(cid:107)F \u2264 (cid:107)A \u2212 (CC+A)tk(cid:107)F .\n2. We presented our result for the case of a relative error CSSP-algorithm with a guarantee on the\nexpected reconstruction error. Clearly, if the CSSP-algorithm is deterministic, then Theorem 1\nwill also hold deterministically. The result in Theorem 1 can also be boosted to hold with high\nprobability, by repeating the process log 1\n\u03b4 times and picking the columns which performed best.\nThen, with probability at least 1 \u2212 \u03b4,\n\n(cid:107)A \u2212 (CC+A)tk(cid:107)2\n\nF \u2264 (1 + 2\u0001)(cid:107)A \u2212 Atk(cid:107)2\n\nF + 2\u0001\n\n(1 + \u0001)t\u2212i (cid:107)A \u2212 Aik(cid:107)2\nF .\n\ni=1\n\nIf the CSSP-algorithm itself only gives a high-probability (at least 1\u2212\u03b4) guarantee, then the bound\nin Theorem 1 also holds with high probability, at least 1 \u2212 t\u03b4, which is obtained by applying a\nunion bound to the probability of failure in each round.\n3. Our results hold for any relative error CSSP-algorithm combined with our adaptive sampling\nstrategy. The relative error CSSP-algorithm in [15] has \u0001(c, k) \u2248 2k/c. The relative error CSSP-\nalgorithm in [16] has \u0001(c, k) = O(k log k/c). Other algorithms can be found in [8, 9, 17].\nWe presented the simplest form of the result, which can be generalized to sample a different\nnumber of columns in each round, or even use a different CSSP-algorithm in each round. We\nhave not optimized the sampling schedule, i.e. how many columns to sample in each round. At\nthe moment, this is largely dictated by the CSSP algorithm itself, which requires a minimum\nnumber of samples in each round to give a theoretical guarantee. From the empirical perspective\n(for example using leverage score sampling to select columns), strongest performance may be\nobtained by adapting after every column is selected.\n\n4. In the context of the additive error CSSP-algorithm from [11], our adaptive sampling strategy\ngives a theoretical performance guarantee which is at least as good as the adaptive sampling\nstrategy from [8].\n\nLastly, we also provide the \ufb01rst empirical evaluation of adaptive sampling algorithms. We imple-\nmented our algorithm using two relative-error column selection algorithms (the near-optimal column\nselection algorithm of [18, 15] and the leverage-score sampling algorithm of [19]) and compared it\nagainst the adaptive sampling algorithm of [8] on synthetic and real-world data. The experimental\nresults show that our algorithm outperforms prior approaches.\n\n1.3 Related Work\n\nColumn selection algorithms have been extensively studied in prior literature. Such algorithms\ninclude rank-revealing QR factorizations [6, 20] for which only weak performance guarantees can\nbe derived. The QR approach was improved in [21] where the authors proposed a memory ef\ufb01cient\nimplementation. The randomized additive error CSSP-algorithm [11] was a breakthrough, which led\nto a series of improvements producing relative CSSP-algorithms using a variety of randomized and\n\n(cid:104)(cid:107)X \u2212 (CC+X)k(cid:107)2\n\n(cid:105) \u2264 (cid:107)X \u2212 Xk(cid:107)2\n\nF\n\nF + \u0001(c, k)(cid:107)X(cid:107)2\nF .\n\n2For an additive-error CSSP algorithm, EC\n\n4\n\n\fdeterministic techniques. These include leverage score sampling [19, 16], volume sampling [8, 9,\n17], the two-stage hybrid sampling approach of [22], the near-optimal column selection algorithms\nof [18, 15], as well as deterministic variants presented in [23]. We refer the reader to Section 1.5\nof [15] for a detailed overview of prior work. Our focus is not on CSSP-algorithms per se, but rather\non adaptively invoking existing CSSP-algorithms. The only prior adaptive sampling with a provable\nguarantee was introduced in [8] and further analyzed in [24, 9, 25]; this strategy is speci\ufb01cally boosts\nthe additive error CSSP-algorithm, but does not work with relative error CSSP-algorithms which are\ncurrently in use. Our modi\ufb01cation of the approach in [8] is delicate, but crucial to the new analysis\nwe perform in the context of relative error CSSP-algorithms.\nOur work is motivated by relative error CSSP-algorithms satisfying de\ufb01nition 1. Such algorithms\nexist which give expected guarantees [15] as well as high probability guarantees [19]. Speci\ufb01cally,\ngiven X \u2208 Rm\u00d7n and a target rank k, the leverage-score sampling approach of [19] selects c =\nprobability at least 1 \u2212 \u03b4. Similarly, [18, 15] proposed near-optimal relative error CSSP-algorithms\nselecting c \u2248 2c/\u0001 columns and giving a (1 + \u0001)-relative error guarantee in expectation, which can\nbe boosted to a high probability guarantee via independent repetition.\n\nO(cid:0)(cid:0)k/\u00012(cid:1) log(cid:0)k/\u00012(cid:1)(cid:1) columns of A to form a matrix C \u2208 Rm\u00d7c to give a (1+\u0001)-relative error with\n\n2 Proof of Theorem 1\n\nWe now prove the main result which analyzes the performance of our adaptive sampling in Algo-\nrithm 1 for a relative error CSSP-algorithm. We will need the following linear algebraic Lemma.\nLemma 1. Let X, Y \u2208 Rm\u00d7n and suppose that rank(Y) = r. Then,\n\n\u03c3i(X \u2212 Y) \u2265 \u03c3r+i(X).\n\nProof. Observe that \u03c3i(X\u2212 Y) = (cid:107)(X \u2212 Y) \u2212 (X \u2212 Y)i\u22121(cid:107)2. The claim is now immediate from\nthe Eckart-Young theorem because Y + (X \u2212 Y)i\u22121 has rank at most r + i \u2212 1, therefore\n\n\u03c3i(X \u2212 Y) = (cid:107)X \u2212 (Y + (X \u2212 Y)i\u22121)(cid:107)2 \u2265 (cid:107)X \u2212 Xr+i\u22121(cid:107)2 = \u03c3r+i(X).\n\nWe are now ready to prove Theorem 1 by induction on t, the number of rounds of adaptive sampling.\nWhen t = 1, the claim is that\n\nE(cid:104)(cid:107)A \u2212 (CC+A)k(cid:107)2\n\n(cid:105) \u2264 (1 + \u0001)(cid:107)A \u2212 Ak(cid:107)2\n\nF ,\n\nF\n\nwhich is immediate from the de\ufb01nition of the relative error CSSP-algorithm. Now for the induction.\nSuppose that after t rounds, columns Ct are selected, and we have the induction hypothesis that\n\n(cid:104)(cid:107)A \u2212 (CtCt+A)tk(cid:107)2\n\n(cid:105) \u2264 (1 + \u0001)(cid:107)A \u2212 Atk(cid:107)2\n\nF\n\nECt\n\nF + \u0001\n\n(1 + \u0001)t\u2212i (cid:107)A \u2212 Aik(cid:107)2\nF .\n\n(1)\n\nt\u22121(cid:88)\n\ni=1\n\nIn the (t + 1)th round, we use the residual Et = A \u2212 (CtCt+A)tk to select new columns C(cid:48). Our\nrelative error CSSP-algorithm A gives the following guarantee:\n\n(cid:104)(cid:107)Et \u2212 (C(cid:48)C(cid:48)+Et)k(cid:107)2\n\nF\n\nEC(cid:48)\n\n(cid:12)(cid:12)(cid:12) Et(cid:105) \u2264 (1 + \u0001)(cid:13)(cid:13)Et \u2212 Et\n(cid:32)(cid:13)(cid:13)Et(cid:13)(cid:13)2\n(cid:32)(cid:13)(cid:13)Et(cid:13)(cid:13)2\n\n\u2264 (1 + \u0001)\n\n(cid:13)(cid:13)2\nF \u2212 k(cid:88)\nF \u2212 k(cid:88)\n\n= (1 + \u0001)\n\nk\n\nF\n\ni=1\n\n(cid:33)\n\ni (Et)\n\u03c32\n\n(cid:33)\n\n\u03c32\ntk+i(A)\n\n.\n\n(2)\n\n(The last step follows because \u03c32\nX = A, Y = (CtCt+A)tk and r = rank(Y) = tk, to obtain \u03c32\n\ni (A \u2212 (CtCt+A)tk) and we can apply Lemma 1 with\ntk+i(A).) We now take\n\ni (Et) \u2265 \u03c32\n\ni (Et) = \u03c32\n\ni=1\n\n5\n\n\fthe expectation of both sides with respect to the columns Ct,\n\n(cid:104)EC(cid:48)\n(cid:32)\n\nECt\n\n\u2264 (1 + \u0001)\n\n(cid:12)(cid:12)(cid:12) Et(cid:105)(cid:105)\n(cid:33)\n\n\u03c32\ntk+i(A)\n\n.\n\nF\n\nECt\n\n(cid:104)(cid:107)Et \u2212 (C(cid:48)C(cid:48)+Et)k(cid:107)2\n(cid:104)(cid:13)(cid:13)Et(cid:13)(cid:13)2\n(cid:105) \u2212 k(cid:88)\nt\u22121(cid:88)\nF \u2212 k(cid:88)\n\n(cid:107)A \u2212 Atk(cid:107)2\n\nF + \u0001\n\ni=1\n\ni=1\n\nF\n\ni=1\n\n(cid:32)\n\n(cid:33)\n\n= (1 + \u0001)\n\nt\u22121(cid:88)\n\ni=1\n\n+\u0001\n\n(cid:104)EC(cid:48)\n\nECt\n\nF\n\n(1 + \u0001)t+1\u2212i (cid:107)A \u2212 Aik(cid:107)2\n\nF + \u0001\n\nt(cid:88)\n(cid:12)(cid:12)(cid:12) Et(cid:105)(cid:105)\n(cid:104)(cid:107)Et \u2212 (C(cid:48)C(cid:48)+Et)k(cid:107)2\n\ni=1\n\nF\n\nk(cid:88)\n\ni=1\n\n\u03c32\ntk+i(A)\n\n(3)\n\n(a)\u2264 (1 + \u0001)2(cid:107)A \u2212 Atk(cid:107)2\n\n(1 + \u0001)t+1\u2212i (cid:107)A \u2212 Aik(cid:107)2\n\nF \u2212 (1 + \u0001)\n\n\u03c32\ntk+i(A)\n\n+ \u0001(1 + \u0001)(cid:107)A \u2212 Atk(cid:107)2\n\nF\n\n= (1 + \u0001)(cid:107)A \u2212 A(t+1)k(cid:107)2\n\n(1 + \u0001)t+1\u2212i (cid:107)A \u2212 Aik(cid:107)2\n\nF\n\n(a) follows, because of the induction hypothesis (eqn. 1). The columns chosen after round t + 1 are\nCt+1 = [Ct, C(cid:48)]. By the law of iterated expectation,\n\n(cid:104)(cid:107)Et \u2212 (C(cid:48)C(cid:48)+Et)k(cid:107)2\n\n(cid:105)\n\n.\n\nF\n\n= E\n\nCt+1\n\nObserve that Et \u2212 (C(cid:48)C(cid:48)+Et)k = A \u2212 (CtCt+A)tk \u2212 (C(cid:48)C(cid:48)+Et)k = A \u2212 Y, where Y is in the\ncolumn space of Ct+1 = [Ct, C(cid:48)]; further, rank(Y) \u2264 (t + 1)k. Since (Ct+1Ct+1+\nA)(t+1)k is the\nbest rank-(t + 1)k approximation to A in the column space of Ct+1, for any realization of Ct+1,\n\nCombining (4) with (3), we have that\n\n(cid:107)A \u2212 (Ct+1Ct+1+\n\nA)(t+1)k(cid:107)2\n\n(cid:21)\n\nF \u2264 (cid:107)Et \u2212 (C(cid:48)C(cid:48)+Et)k(cid:107)2\nF .\nt(cid:88)\n\nE\nCt+1\n\n(cid:107)A \u2212 (Ct+1Ct+1+\n\nA)(t+1)k(cid:107)2\n\nF\n\n\u2264 (1+\u0001)(cid:107)A \u2212 A(t+1)k(cid:107)2\n\nF +\u0001\n\n(1+\u0001)t+1\u2212i (cid:107)A \u2212 Aik(cid:107)2\nF .\n\n(cid:20)\n\n(4)\n\nThis is the desired bound after t + 1 rounds, concluding the induction.\nIt is instructive to understand where our new adaptive sampling strategy is needed for the proof to\ngo through. The crucial step is (2) where we use Lemma 1 \u2013 it is essential that the residual was a\nlow-rank perturbation of A.\n\ni=1\n\n3 Experiments\n\nWe compared three adaptive column sampling methods, using two real and two synthetic data sets.3\n\nAdaptive Sampling Methods\n\nADP-AE: the prior adaptive method which uses the additive error CSSP-algorithm [8].\nADP-LVG: our new adaptive method using the relative error CSSP-algorithm [19].\nADP-Nopt: our adaptive method using the near optimal relative error CSSP-algorithm [15].\n\nData Sets\n\nHGDP 22 chromosomes: SNPs human chromosome data from the HGDP database [26]. We\nuse all 22 chromosome matrices (1043 rows; 7,334-37,493 columns) and report the average.\nEach matrix contains +1, 0,\u22121 entries, and we randomly \ufb01lled in missing entries.\nTechTC-300: 49 document-term matrices [27] (150-300 rows (documents); 10,000-40,000\ncolumns (words)). We kept 5-letter or larger words and report averages over 49 data-sets.\nSynthetic 1: Random 1000 \u00d7 10000 matrices with \u03c3i = i\u22120.3 (power law).\nSynthetic 2: Random 1000 \u00d7 10000 matrices with \u03c3i = exp(1\u2212i)/10 (exponential).\n\n3ADP-Nopt: has two stages. The \ufb01rst stage is a deterministic dual set spectral-Frobenius column selection\nin which ties could occur. We break ties in favor of the column not already selected with the maximum norm.\n\n6\n\n\fFor randomized algorithms, we repeat the experiments \ufb01ve times and take the average. We use the\nsynthetic data sets to provide a controlled environment in which we can see performance for different\ntypes of singular value spectra on very large matrices. In prior work it is common to report on the\nquality of the columns selected C by comparing the best rank-k approximation within the column-\n\nspan of C to Ak. Hence, we report the relative error (cid:13)(cid:13)A \u2212 (CC+A)k\n\n(cid:13)(cid:13)F /(cid:107)A \u2212 Ak(cid:107)F when\n\ncomparing the algorithms. We set the target rank k = 5 and the number of columns in each round to\nc = 2k. We have tried several choices for k and c and the results are qualitatively identical so we only\nreport on one choice. Our \ufb01rst set of results in Figure 2 is to compare the prior adaptive algorithm\nADP-AE with the new adaptive ones ADP-LVG and ADP-Nopt which boose relative error CSSP-\nalgorithms. Our two new algorithms are both performing better the prior existing adaptive sampling\nalgorithm. Further, ADP-Nopt is performing better than ADP-LVG, and this is also not surprising,\nbecause ADP-Nopt produces near-optimal columns \u2013 if you boost a better CSSP-algorithm, you get\nbetter results. Further, by comparing the performance on Synthetic 1 with Synthetic 2, we see that\nour algorithm (as well as prior algorithms) gain signi\ufb01cantly in performance for rapidly decaying\nsingular values; our new theoretical analysis re\ufb02ects this behavior, whereas prior results do not.\nThe theory bound depends on \u0001 = c/k. The \ufb01gure\nto the right shows a result for k = 10; c = 2k\n(k increases but \u0001 is constant). Comparing the \ufb01g-\nure with the HGDP plot in Figure 2, we see that\nthe quantitative performance is approximately the\nsame, as the theory predicts (since c/k has not\nchanged). The percentage error stays the same\neven though we are sampling more columns be-\ncause the benchmark (cid:107)A \u2212 Ak(cid:107)F also get smaller\nwhen k increases. Since ADP-Nopt is the supe-\nrior algorithm, we continue with results only for\nthis algorithm.\nOur next experiment is to test which adaptive strat-\negy works better in practice given the same ini-\ntial selection of columns. That is, in Figure 2,\nADP-AE uses an adaptive sampling based on the\nresidual A \u2212 CC+A and then adaptively sam-\nples according to the adaptive strategy in [8]; the\ninitial columns are chosen with the additive error\nalgorithm. Our approach chooses initial columns\nwith the relative error CSSP-algorithm and then\ncontinues to sample adaptively based on the rel-\native error CSSP-algorithm and the residual A \u2212\n(CC+A)tk. We now give all the adaptive sam-\npling algorithms the bene\ufb01t of the near-optimal\ninitial columns chosen in the \ufb01rst round by the al-\ngorithm from [15]. The result shown to the right con\ufb01rms that ADP-Nopt is best even if all adaptive\nstrategies start from the same initial near-optimal columns.\nAdaptive versus Continued Sequential Sam-\npling. Our last experiment is to demonstrate that\nadaptive sampling works better than continued se-\nquential sampling. We consider the relative error\nCSSP-algorithm in [15] in two modes. The \ufb01rst\nis ADP-Nopt, which is our adaptive sampling al-\ngorithms which selects tc columns in t rounds of\nc columns each. The second is SEQ-Nopt, which\nis just the relative error CSSP-algorithm in [15]\nsampling tc columns, all in one go. The results\nare shown on the right. The adaptive boosting of\nthe relative error CSSP-algorithm can gives up to\na 1% improvement in this data set.\n\n7\n\n1234511.021.041.06HGDP 22 chromosomes, k:10,c=2k# of rounds||A\u2212(CC+A)k||F/||A\u2212Ak||F ADP\u2212AEADP\u2212LVGADP\u2212Nopt12345# of rounds11.051.11.151.2||A-(CC+A)k||F/||A-Ak||FTechTC-300 49 Datasets, k:5,c=2kADP-AEADP-LVGADP-Nopt12345# of rounds11.0051.011.0151.02||A-(CC+A)k||F/||A-Ak||F TechTC300 49 datasets, k:5,c:2kADP-NoptSEQ-Nopt\fFigure 2: Plots of relative error ratio (cid:13)(cid:13)A \u2212 (CC+A)k\n\n(cid:13)(cid:13)F\n\n/(cid:107)A \u2212 Ak(cid:107)F for various adaptive sampling al-\ngorithms for k = 5 and c = 2k. In all cases, performance improves with more rounds of sampling, and rapidly\nconverges to a relative reconstruction error of 1. This is most so in data matrices with singular values that decay\nquickly (such as TechTC and Synthetic 2). The HGDP singular values decay slowly because missing entries\nare selected randomly, and Synthetic 1 has slowly decaying power-law singular values by construction.\n\n4 Conclusion\nWe present a new approach for adaptive sampling algorithms which can boost relative error CSSP-\nalgorithms, in particular the near optimal CSSP-algorithm in [15]. We showed theoretical and exper-\nimental evidence that our new adaptively boosted CSSP-algorithm is better than the prior existing\nadaptive sampling algorithm which is based on the additive error CSSP-algorithm in [11]. We also\nshowed evidence (theoretical and empirical) that our adaptive sampling algorithms are better than\nsequentially sampling all the columns at once. In particular, our theoretical bounds give a result\nwhich is tighter for matrices whose singular values decay rapidly.\nSeveral interesting questions remain. We showed that the simplest adaptive sampling algorithm\nwhich samples a constant number of columns in each round improves upon sequential sampling all\nat once. What is the optimal sampling schedule, and does it depend on the singular value spectrum\nof the data matric? In particular, can improved theoretical bounds or empirical performance be\nobtained by carefully choosing how many columns to select in each round?\nIt would also be interesting to see the improved adaptive sampling boosting of CSSP-algorithms in\nthe actual applications which require column selection (such as sparse PCA or unsupervised feature\nselection). How do the improved theoretical estimates we have derived carry over to these problems\n(theoretically or empirically)? We leave these directions for future work.\nAcknowledgements\nMost of the work was done when SP was a graduate student at RPI. PD was supported by IIS-\n1447283 and IIS-1319280. MMI was partially supported by the Army Research Laboratory under\nCooperative Agreement Number W911NF-09-2-0053 (the ARL Network Science CTA). The views\nand conclusions contained in this document are those of the authors and should not be interpreted\nas representing the of\ufb01cial policies, either expressed or implied, of the Army Research Laboratory\nor the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for\nGovernment purposes notwithstanding any copyright notation here on.\n\n8\n\n1234511.021.041.06HGDP 22 chromosomes, k:5,c=2k# of rounds||A\u2212(CC+A)k||F/||A\u2212Ak||F ADP\u2212AEADP\u2212LVGADP\u2212Nopt1234511.051.11.15TechTC\u2212300 49 Datasets, k:5,c=2k# of rounds||A\u2212(CC+A)k||F/||A\u2212Ak||F ADP\u2212AEADP\u2212LVGADP\u2212Nopt1234511.011.021.031.04Synthetic Data 1, k:5,c=2k# of rounds||A\u2212(CC+A)k||F/||A\u2212Ak||F ADP\u2212AEADP\u2212LVGADP\u2212Nopt1234511.051.1Synthetic Data 2, k:5,c=2k# of rounds||A\u2212(CC+A)k||F/||A\u2212Ak||F ADP\u2212AEADP\u2212LVGADP\u2212Nopt\fReferences\n[1] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near optimal coresets for least-squares\n\nregression. IEEE Transactions on Information Theory, 59(10), October 2013.\n\n[2] C. Boutsidis and M. Magdon-Ismail. A note on sparse least-squares regression. Information Processing\n\nLetters, 115(5):273\u2013276, 2014.\n\n[3] Christos Boutsidis and Malik Magdon-Ismail. Deterministic feature selection for k-means clustering.\n\nIEEE Transactions on Information Theory, 59(9), September 2013.\n\n[4] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Sparse features for pca-like regression. In\n\nProc. 25th Annual Conference on Neural Information Processing Systems (NIPS), 2011. to appear.\n\n[5] Malik Magdon-Ismail and Christos Boutsidis. Optimal sparse linear auto-encoders and sparse pca.\n\narXiv:1502.06626, 2015.\n\n[6] T. F. Chan and P. C. Hansen. Some applications of the rank revealing QR factorization. SIAM J. Sci. Stat.\n\nComput., 13(3):727\u2013741, 1992.\n\n[7] A. Deshpande and L. Rademacher. Ef\ufb01cient volume sampling for row/column subset selection. In Pro-\n\nceedings of the IEEE 51st FOCS, pages 329\u2013338, 2010.\n\n[8] A. Deshpande and S. Vempala. Adaptive sampling and fast low-rank matrix approximation. In Approx-\nimation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 292\u2013303.\nSpringer, 2006.\n\n[9] A. Deshpande, L. Rademacher, S. Vempala, and G. Wang. Matrix approximation and projective clustering\n\nvia volume sampling. Theory of Computing, 2(1):225\u2013247, 2006.\n\n[10] P. Drineas, I. Kerenidis, and P. Raghavan. Competitive recommendation systems. In Proceedings of the\n\n34th STOC, pages 82\u201390, 2002.\n\n[11] A. Frieze, R. Kannan, and S. Vempala. Fast monte-carlo algorithms for \ufb01nding low-rank approximations.\n\nJournal of the ACM (JACM), 51(6):1025\u20131041, 2004.\n\n[12] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms\n\nfor constructing approximate matrix decompositions. SIAM Rev., 53(2):217\u2013288, May 2011.\n\n[13] E. Liberty, F. Woolfe, P.G. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms for the low-\n\nrank approximation of matrices. PNAS, 104(51):20167\u201320172, 2007.\n\n[14] Michael W Mahoney and Petros Drineas. CUR matrix decompositions for improved data analysis. PNAS,\n\n106(3):697\u2013702, 2009.\n\n[15] C. Boutsidis, P. Drineas, and M. Magdon-Ismail. Near-optimal column-based matrix reconstruction.\n\nSIAM Journal of Computing, 43(2):687\u2013717, 2014.\n\n[16] P. Drineas, M. W Mahoney, and S Muthukrishnan. Subspace sampling and relative-error matrix approx-\nimation: Column-based methods. In Approximation, Randomization, and Combinatorial Optimization.\nAlgorithms and Techniques, pages 316\u2013326. Springer, 2006.\n\n[17] Venkatesan Guruswami and Ali Kemal Sinop. Optimal column-based low-rank matrix reconstruction. In\nProceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1207\u20131214, 2012.\n[18] C. Boutsidis, P. Drineas, and M. Magdon-Ismail. Near optimal column-based matrix reconstruction. In\n\nIEEE 54th Annual Symposium on FOCS, pages 305\u2013314, 2011.\n\n[19] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error cur matrix decompositions.\n\nSIAM Journal on Matrix Analysis and Applications, 30(2):844\u2013881, 2008.\n\n[20] T.F. Chan. Rank revealing QR factorizations. Linear Algebra and its Applications, 8889(0):67 \u2013 82, 1987.\n[21] Crystal Maung and Haim Schweitzer. Pass-ef\ufb01cient unsupervised feature selection. In Advances in Neural\n\nInformation Processing Systems, pages 1628\u20131636, 2013.\n\n[22] C. Boutsidis, M. W Mahoney, and P. Drineas. An improved approximation algorithm for the column\n\nsubset selection problem. In Proceedings of the 20th SODA, pages 968\u2013977, 2009.\n\n[23] D. Papailiopoulos, A. Kyrillidis, and C. Boutsidis. Provable deterministic leverage score sampling. In\n\nProc. SIGKDD, pages 997\u20131006, 2014.\n\n[24] A. Deshpande, L. Rademacher, S. Vempala, and G. Wang. Matrix approximation and projective clustering\n\nvia volume sampling. In Proc. SODA, pages 1117\u20131126, 2006.\n\n[25] P. Drineas and M. W Mahoney. A randomized algorithm for a tensor-based generalization of the singular\n\nvalue decomposition. Linear algebra and its applications, 420(2):553\u2013571, 2007.\n\n[26] P. Paschou, J. Lewis, A. Javed, and P. Drineas. Ancestry informative markers for \ufb01ne-scale individual\n\nassignment to worldwide populations. Journal of Medical Genetics, 47(12):835\u201347, 2010.\n\n[27] D. Davidov, E. Gabrilovich, and S. Markovitch. Parameterized generation of labeled datasets for text\n\ncategorization based on a hierarchical directory. In Proc. SIGIR, pages 250\u2013257, 2004.\n\n9\n\n\f", "award": [], "sourceid": 279, "authors": [{"given_name": "Saurabh", "family_name": "Paul", "institution": "Paypal Inc"}, {"given_name": "Malik", "family_name": "Magdon-Ismail", "institution": "RPI"}, {"given_name": "Petros", "family_name": "Drineas", "institution": "Rensselaer Polytechnic Institute"}]}