{"title": "Random Projections for $k$-means Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 298, "page_last": 306, "abstract": "This paper discusses the topic of dimensionality reduction for $k$-means clustering. We prove that any set of $n$ points in $d$ dimensions (rows in a matrix $A \\in \\RR^{n \\times d}$) can be projected into $t = \\Omega(k / \\eps^2)$ dimensions, for any $\\eps \\in (0,1/3)$, in $O(n d \\lceil \\eps^{-2} k/ \\log(d) \\rceil )$ time, such that with constant probability the optimal $k$-partition of the point set is preserved within a factor of $2+\\eps$. The projection is done by post-multiplying $A$ with a $d \\times t$ random matrix $R$ having entries $+1/\\sqrt{t}$ or $-1/\\sqrt{t}$ with equal probability. A numerical implementation of our technique and experiments on a large face images dataset verify the speed and the accuracy of our theoretical results.", "full_text": "Random Projections for k-means Clustering\n\nChristos Boutsidis\n\nDepartment of Computer Science\n\nRPI\n\nAnastasios Zouzias\n\nPetros Drineas\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nRPI\n\nAbstract\n\nThis paper discusses the topic of dimensionality reduction for k-means clustering. We prove that\nany set of n points in d dimensions (rows in a matrix A \u2208 Rn\u00d7d) can be projected into t = \u2126(k/\u03b52)\ndimensions, for any \u03b5 \u2208 (0, 1/3), in O(nd\u2308\u03b5\u22122k/ log(d)\u2309) time, such that with constant probability\nthe optimal k-partition of the point set is preserved within a factor of 2 + \u03b5. The projection is done\nby post-multiplying A with a d \u00d7 t random matrix R having entries +1/\u221at or \u22121/\u221at with equal\n\nprobability. A numerical implementation of our technique and experiments on a large face images\ndataset verify the speed and the accuracy of our theoretical results.\n\n1 Introduction\n\nThe k-means clustering algorithm [16] was recently recognized as one of the top ten data mining tools of the last \ufb01fty\nyears [20]. In parallel, random projections (RP) or the so-called Johnson-Lindenstrauss type embeddings [12] became\npopular and found applications in both theoretical computer science [2] and data analytics [4]. This paper focuses on\nthe application of the random projection method (see Section 2.3) to the k-means clustering problem (see De\ufb01nition\n1). Formally, assuming as input a set of n points in d dimensions, our goal is to randomly project the points into \u02dcd\ndimensions, with \u02dcd \u226a d, and then apply a k-means clustering algorithm (see De\ufb01nition 2) on the projected points. Of\ncourse, one should be able to compute the projection fast without distorting signi\ufb01cantly the \u201cclusters\u201d of the original\npoint set. Our algorithm (see Algorithm 1) satis\ufb01es both conditions by computing the embedding in time linear in the\nsize of the input and by distorting the \u201cclusters\u201d of the dataset by a factor of at most 2 + \u03b5, for some \u03b5 \u2208 (0, 1/3) (see\nTheorem 1). We believe that the high dimensionality of modern data will render our algorithm useful and attractive in\nmany practical applications [9].\n\nDimensionality reduction encompasses the union of two different approaches: feature selection, which embeds the\npoints into a low-dimensional space by selecting actual dimensions of the data, and feature extraction, which \ufb01nds an\nembedding by constructing new arti\ufb01cial features that are, for example, linear combinations of the original features.\nLet A be an n \u00d7 d matrix containing n d-dimensional points (A(i) denotes the i-th point of the set), and let k be\nthe number of clusters (see also Section 2.2 for more notation). We slightly abuse notation by also denoting by A\n\u02dcd with f (A(i)) = \u02dcA(i) for all\nthe n-point set formed by the rows of A. We say that an embedding f : A \u2192 R\ni \u2208 [n] and some \u02dcd < d, preserves the clustering structure of A within a factor \u03c6, for some \u03c6 \u2265 1, if \ufb01nding an\noptimal clustering in \u02dcA and plugging it back to A is only a factor of \u03c6 worse than \ufb01nding the optimal clustering\ndirectly in A. Clustering optimality and approximability are formally presented in De\ufb01nitions 1 and 2, respectively.\nPrior efforts on designing provably accurate dimensionality reduction methods for k-means clustering include: (i) the\nSingular Value Decomposition (SVD), where one \ufb01nds an embedding with image \u02dcA = Uk\u03a3k \u2208 Rn\u00d7k such that the\nclustering structure is preserved within a factor of two; (ii) random projections, where one projects the input points into\nt = \u2126(log(n)/\u03b52) dimensions such that with constant probability the clustering structure is preserved within a factor\nof 1 + \u03b5 (see Section 2.3); (iii) SVD-based feature selection, where one can use the SVD to \ufb01nd c = \u2126(k log(k/\u03b5)/\u03b52)\nactual features, i.e. an embedding with image \u02dcA \u2208 Rn\u00d7c containing (rescaled) columns from A, such that with constant\nprobability the clustering structure is preserved within a factor of 2 + \u03b5. These results are summarized in Table 1. A\nhead-to-head comparison of our algorithm with existing results allows us to claim the following improvements: (i)\n\n1\n\n\fYear\n1999\n\n-\n\nRef.\n[6]\n\nFolklore\n\n2009\n2010 This paper\n\n[5]\n\nDescription\n\nDimensions\n\nTime\n\nAccuracy\n\nSVD - feature extraction\nRP - feature extraction\nSVD - feature selection \u2126(k log(k/\u03b5)/\u03b52)\nRP - feature extraction\n\n\u2126(log(n)/\u03b52)\n\n\u2126(k/\u03b52)\n\nk\n\nO(nd\u2308\u03b5\u22122 log(n)/ log(d)\u2309)\n\nO(nd min{n, d})\nO(nd min{n, d})\nO(nd\u2308\u03b5\u22122k/ log(d)\u2309)\n\n2\n\n1 + \u03b5\n2 + \u03b5\n2 + \u03b5\n\nTable 1: Dimension reduction methods for k-means. In the RP methods the construction is done with random sign matrices and\nthe mailman algorithm (see Sections 2.3 and 3.1, respectively).\n\nreduce the running time by a factor of min{n, d}\u2308\u03b52 log(d)/k\u2309, while losing only a factor of \u03b5 in the approximation\naccuracy and a factor of 1/\u03b52 in the dimension of the embedding; (ii) reduce the dimension of the embedding and\nthe running time by a factor of log(n)/k while losing a factor of one in the approximation accuracy; (iii) reduce the\ndimension of the embedding by a factor of log(k/\u03b5) and the running time by a factor of min{n, d}\u2308\u03b52 log(d)/k\u2309,\nrespectively. Finally, we should point out that other techniques, for example the Laplacian scores [10] or the Fisher\nscores [7], are very popular in applications (see also surveys on the topic [8, 13]). However, they lack a theoretical\nworst case analysis of the form we describe in this work.\n\n2 Preliminaries\n\nWe start by formally de\ufb01ning the k-means clustering problem using matrix notation. Later in this section, we precisely\ndescribe the approximability framework adopted in the k-means clustering literature and \ufb01x the notation.\nDe\ufb01nition 1. [THE K-MEANS CLUSTERING PROBLEM]\nGiven a set of n points in d dimensions (rows in an n \u00d7 d matrix A) and a positive integer k denoting the number of\nclusters, \ufb01nd the n \u00d7 k indicator matrix Xopt such that\n\nXopt = arg min\n\nX\u2208X(cid:13)(cid:13)A \u2212 XX \u22a4A(cid:13)(cid:13)\n\n2\nF .\n\n(1)\n\nHere X denotes the set of all n\u00d7 k indicator matrices X. The functional F (A, X) = (cid:13)(cid:13)A \u2212 XX \u22a4A(cid:13)(cid:13)\nF is the so-called\nk-means objective function. An n \u00d7 k indicator matrix has exactly one non-zero element per row, which denotes\ncluster membership. Equivalently, for all i = 1, . . . , n and j = 1, . . . , k, the i-th point belongs to the j-th cluster if\nand only if Xij = 1/\u221azj, where zj denotes the number of points in the corresponding cluster. Note that X \u22a4X = Ik,\nwhere Ik is the k \u00d7 k identity matrix.\n2.1 Approximation Algorithms for k-means clustering\n\n2\n\nFinding Xopt is an NP-hard problem even for k = 2 [3], thus research has focused on developing approximation\nalgorithms for k-means clustering. The following de\ufb01nition captures the framework of such efforts.\nDe\ufb01nition 2. [K-MEANS APPROXIMATION ALGORITHM]\nAn algorithm is a \u201c\u03b3-approximation\u201d for the k-means clustering problem (\u03b3 \u2265 1) if it takes inputs A and k, and\nreturns an indicator matrix X\u03b3 that satis\ufb01es with probability at least 1 \u2212 \u03b4\u03b3,\nX\u2208X(cid:13)(cid:13)A \u2212 XX \u22a4A(cid:13)(cid:13)\n\nIn the above, \u03b4\u03b3 \u2208 [0, 1) is the failure probability of the \u03b3-approximation k-means algorithm.\nFor our discussion, we \ufb01x the \u03b3-approximation algorithm to be the one presented in [14], which guarantees \u03b3 = 1 + \u03b5\u2032\nfor any \u03b5\u2032 \u2208 (0, 1] with running time O(2(k/\u03b5\u2032)O(1)\n2.2 Notation\n\n\u03b3 A(cid:13)(cid:13)\n(cid:13)(cid:13)A \u2212 X\u03b3X \u22a4\n\nF \u2264 \u03b3 min\n\ndn).\n\n2\nF .\n\n(2)\n\n2\n\nGiven an n \u00d7 d matrix A and an integer k with k < min{n, d}, let Uk \u2208 Rn\u00d7k (resp. Vk \u2208 Rd\u00d7k) be the matrix\nof the top k left (resp. right) singular vectors of A, and let \u03a3k \u2208 Rk\u00d7k be a diagonal matrix containing the top\n\n2\n\n\fk singular values of A in non-increasing order. If we let \u03c1 be the rank of A, then A\u03c1\u2212k is equal to A \u2212 Ak, with\nk . By A(i) we denote the i-th row of A. For an index i taking values in the set {1, . . . , n} we write\nAk = Uk\u03a3kV \u22a4\ni \u2208 [n]. We denote, in non-increasing order, the non-negative singular values of A by \u03c3i(A) with i \u2208 [\u03c1]. kAkF and\nkAk2 denote the Frobenius and the spectral norm of a matrix A, respectively. A\u2020 denotes the pseudo-inverse of A, i.e.\nthe unique d \u00d7 n matrix satisfying A = AA\u2020A, A\u2020AA\u2020 = A\u2020, (AA\u2020)\u22a4 = AA\u2020, and (A\u2020A)\u22a4 = A\u2020A. Note also\nthat (cid:13)(cid:13)A\u2020(cid:13)(cid:13)2 = \u03c31(A\u2020) = 1/\u03c3\u03c1(A) and kAk2 = \u03c31(A) = 1/\u03c3\u03c1(A\u2020). A useful property of matrix norms is that for\nany two matrices C and T of appropriate dimensions, kCTkF \u2264 kCkF kTk2; this is a stronger version of the standard\nsubmultiplicavity property. We call P a projector matrix if it is square and P 2 = P . We use E [Y ] and Var [Y ] to take\nthe expectation and the variance of a random variable Y and P (e) to take the probability of an event e. We abbreviate\n\u201cindependent identically distributed\u201d to \u201ci.i.d.\u201d and \u201cwith probability\u201d to \u201cw.p.\u201d. Finally, all logarithms are base two.\n\n2.3 Random Projections\n\nA classical result of Johnson and Lindenstrauss states that any n-point set in d dimensions - rows in a matrix A \u2208 Rn\u00d7d\n- can be linearly projected into t = \u2126(log(n)/\u03b52) dimensions while preserving pairwise distances within a factor of\n1\u00b1\u03b5 using a random orthonormal matrix [12]. Subsequent research simpli\ufb01ed the proof of the above result by showing\nthat such a projection can be generated using a d \u00d7 t random Gaussian matrix R, i.e., a matrix whose entries are i.i.d.\nGaussian random variables with zero mean and variance 1/\u221at [11]. More precisely, the following inequality holds\nwith high probability over the randomness of R,\n\n(1 \u2212 \u03b5)(cid:13)(cid:13)A(i) \u2212 A(j)(cid:13)(cid:13)2 \u2264 (cid:13)(cid:13)A(i)R \u2212 A(j)R(cid:13)(cid:13)2 \u2264 (1 + \u03b5)(cid:13)(cid:13)A(i) \u2212 A(j)(cid:13)(cid:13)2 .\n\nNotice that such an embedding \u02dcA = AR preserves the metric structure of the point-set, so it also preserves, within a\nfactor of 1 + \u03b5, the optimal value of the k-means objective function of A. Achlioptas proved that even a (rescaled)\nrandom sign matrix suf\ufb01ces in order to get the same guarantees as above [1], an approach that we adopt here (see step\ntwo in Algorithm 1). Moreover, in this paper we will heavily exploit the structure of such a random matrix, and obtain,\nas an added bonus, savings on the computation of the projection.\n\n(3)\n\n3 A random-projection-type k-means algorithm\nAlgorithm 1 takes as inputs the matrix A \u2208 Rn\u00d7d, the number of clusters k, an error parameter \u03b5 \u2208 (0, 1/3), and some\n\u03b3-approximation k-means algorithm. It returns an indicator matrix X\u02dc\u03b3 determining a k-partition of the rows of A.\n\nInput: n \u00d7 d matrix A (n points, d features), number of clusters k, error parameter \u03b5 \u2208 (0, 1/3), and\n\u03b3-approximation k-means algorithm.\nOutput: Indicator matrix X\u02dc\u03b3 determining a k-partition on the rows of A.\n\n1. Set t = \u2126(k/\u03b52), i.e. set t = to \u2265 ck/\u03b52 for a suf\ufb01ciently large constant c.\n2. Compute a random d \u00d7 t matrix R as follows. For all i \u2208 [d], j \u2208 [t]\n\nRij = (cid:26)+1/\u221at, w.p. 1/2,\n\u22121/\u221at, w.p. 1/2.\n\n3. Compute the product \u02dcA = AR.\n4. Run the \u03b3-approximation algorithm on \u02dcA to obtain X\u02dc\u03b3; Return the indicator matrix X\u02dc\u03b3\n\nAlgorithm 1: A random projection algorithm for k-means clustering.\n\n3.1 Running time analysis\n\nAlgorithm 1 reduces the dimensions of A by post-multiplying it with a random sign matrix R. Interestingly, any\n\u201crandom projection matrix\u201d R that respects the properties of Lemma 2 with t = \u2126(k/\u03b52) can be used in this step. If R\nis constructed as in Algorithm 1, one can employ the so-called mailman algorithm for matrix multiplication [15] and\n\n3\n\n\fcompute the product AR in O(nd\u2308\u03b5\u22122k/ log(d)\u2309) time. Indeed, the mailman algorithm computes (after preprocessing\n1) a matrix-vector product of any d-dimensional vector (row of A) with an d \u00d7 log(d) sign matrix in O(d) time.\nBy partitioning the columns of our d \u00d7 t matrix R into \u2308t/ log(d)\u2309 blocks, the claim follows. Notice that when\nk = O(log(d)), then we get an - almost - linear time complexity O(nd/\u03b52). The latter assumption is reasonable in our\nsetting since the need for dimension reduction in k-means clustering arises usually in high-dimensional data (large d).\nOther choices of R would give the same approximation results; the time complexity to compute the embedding would\nbe different though. A matrix where each entry is a random Gaussian variable with zero mean and variance 1/\u221at\nwould imply an O(knd/\u03b52) time complexity (naive multiplication). In our experiments in Section 5 we experiment\nwith the matrix R described in Algorithm 1 and employ MatLab\u2019s matrix-matrix BLAS implementation to proceed\nin the third step of the algorithm. We also experimented with a novel MatLab/C implementation of the mailman\nalgorithm but, in the general case, we were not able to outperform MatLab\u2019s built-in routines (see section 5.2).\n\nFinally, note that any \u03b3-approximation algorithm may be used in the last step of Algorithm 1. Using, for example,\nthe algorithm of [14] with \u03b3 = 1 + \u03b5 would result in an algorithm that preserves the clustering within a factor of\n2 + \u03b5, for any \u03b5 \u2208 (0, 1/3), running in time O(nd\u2308\u03b5\u22122k/ log(d)\u2309 + 2(k/\u03b5)O(1)\nkn/\u03b52). In practice though, the Lloyd\nalgorithm [16, 17] is very popular and although it does not admit a worst case theoretical analysis, it empirically\ndoes well. We thus employ the Lloyd algorithm for our experimental evaluation of our algorithm in Section 5. Note\nthat, after using the proposed dimensionality reduction method, the cost of the Lloyd heuristic is only O(nk2/\u03b52) per\niteration. This should be compared to the cost of O(knd) per iteration if applied on the original high dimensional data.\n\n4 Main Theorem\n\nTheorem 1 is our main quality-of-approximation result for Algorithm 1. Notice that if \u03b3 = 1, i.e.\nproblem with inputs \u02dcA and k is solved exactly, Algorithm 1 guarantees a distortion of at most 2 + \u03b5, as advertised.\nTheorem 1. Let the n \u00d7 d matrix A and the positive integer k < min{n, d} be the inputs of the k-means clustering\nproblem. Let \u03b5 \u2208 (0, 1/3) and assume access to a \u03b3-approximation k-means algorithm. Run Algorithm 1 with inputs\nA, k, \u03b5, and the \u03b3-approximation algorithm in order to construct an indicator matrix X\u02dc\u03b3. Then with probability at\nleast 0.97 \u2212 \u03b4\u03b3,\n\nif the k-means\n\n(cid:13)(cid:13)A \u2212 X\u02dc\u03b3X \u22a4\n\u02dc\u03b3 A(cid:13)(cid:13)\n\n2\n\nF \u2264 (1 + (1 + \u03b5)\u03b3)(cid:13)(cid:13)A \u2212 XoptX \u22a4\n\noptA(cid:13)(cid:13)\n\n2\nF .\n\n(4)\n\nProof of Theorem 1\n\nThe proof of Theorem 1 employs several results from [19] including Lemma 6, 8 and Corollary 11. We summarize\nthese results in Lemma 2 below. Before employing Corollary 11, Lemma 6, and Lemma 8 from [19] we need to make\nsure that the matrix R constructed in Algorithm 1 is consistent with De\ufb01nition 1 and Lemma 5 in [19]. Theorem 1.1\nof [1] immediately shows that the random sign matrix R of Algorithm 1 satis\ufb01es De\ufb01nition 1 and Lemma 5 in [19].\nLemma 2. Assume that the matrix R is constructed by using Algorithm 1 with inputs A, k and \u03b5.\n\n1. Singular Values Preservation: For all i \u2208 [k] and w.p. at least 0.99,\n\n|1 \u2212 \u03c3i(V \u22a4\n\nk R)| \u2264 \u03b5.\n\n2. Matrix Multiplication: For any two matrices S \u2208 Rn\u00d7d and T \u2208 Rd\u00d7k,\nF kTk2\nF .\n\n2\nt kSk2\n\n2\n\nFi \u2264\n\n3. Moments: For any C \u2208 Rn\u00d7d: EhkCRk2\n\nF and Var [kCRkF] \u2264 2 kCk4\n\nF /t.\n\nEh(cid:13)(cid:13)ST \u2212 SRR\u22a4T(cid:13)(cid:13)\nFi = kCk2\n\nThe \ufb01rst statement above assumes c being suf\ufb01ciently large (see step 1 of Algorithm 1). We continue with several\nnovel results of general interest.\n\n1Reading the input d \u00d7 log d sign matrix requires O(d log d) time. However, in our case we only consider multiplication with\na random sign matrix, therefore we can avoid the preprocessing step by directly computing a random correspondence matrix as\ndiscussed in [15, Preprocessing Section].\n\n4\n\n\fLemma 3. Under the same assumptions as in Lemma 2 and w.p. at least 0.99,\n\n\u2020\nk R)\n\n(cid:13)(cid:13)(cid:13)(V \u22a4\n\n\u2212 (V \u22a4\n\nk R)\u22a4(cid:13)(cid:13)(cid:13)2 \u2264 3\u03b5.\n\nProof. Let \u03a6 = V \u22a4\nmatrices, and V\u03a6 is a t \u00d7 k matrix. By taking the SVD of (V \u22a4\n\u03a6 \u2212 V\u03a6\u03a3\u03a6U \u22a4\n\nk R; note that \u03a6 is a k \u00d7 t matrix and the SV D of \u03a6 is \u03a6 = U\u03a6\u03a3\u03a6V \u22a4\nk R)\u22a4 we get\n\u03a6 \u2212 \u03a3\u03a6)U \u22a4\n\n\u2212 (V \u22a4\n\u03a6 can be dropped without changing any unitarily invariant norm. Let \u03a8 = \u03a3\u22121\n\n\u03a6(cid:13)(cid:13)2 = (cid:13)(cid:13)V\u03a6(\u03a3\u22121\n\n\u03a6 , where U\u03a6 and \u03a3\u03a6 are k \u00d7 k\n\u03a6(cid:13)(cid:13)2 = (cid:13)(cid:13)\u03a3\u22121\n\n\u03a6 \u2212 \u03a3\u03a6(cid:13)(cid:13)2 ,\n\n\u03a6 \u2212 \u03a3\u03a6; \u03a8 is a k \u00d7 k\nsince V\u03a6 and U \u22a4\ndiagonal matrix. Assuming that, for all i \u2208 [k], \u03c3i(\u03a6) and \u03c4i(\u03a8) denote the i-th largest singular value of \u03a6 and the\ni-th diagonal element of \u03a8, respectively, it is\n\n= (cid:13)(cid:13)V\u03a6\u03a3\u22121\n\nk R)\u22a4(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)(V \u22a4\n\n\u2020 and (V \u22a4\n\n\u2020\nk R)\n\n\u03a6 U \u22a4\n\nk R)\n\n(5)\n\n(6)\n\nSince \u03a8 is a diagonal matrix,\n\n\u03c4i(\u03a8) =\n\n1 \u2212 \u03c3i(\u03a6)\u03c3k+1\u2212i(\u03a6)\n\n\u03c3k+1\u2212i\n\n.\n\nk\u03a8k2 = max\n\n1\u2264i\u2264k\n\n\u03c4i(\u03a8) = max\n1\u2264i\u2264k\n\n1 \u2212 \u03c3i(\u03a6)\u03c3k+1\u2212i(\u03a6)\n\n\u03c3k+1\u2212i(\u03a6)\n\n.\n\nThe \ufb01rst statement of Lemma 2, our choice of \u03b5 \u2208 (0, 1/3) and elementary calculations suf\ufb01ce to conclude the\nproof.\nLemma 4. Under the same assumptions as in Lemma 2 and for any n \u00d7 d matrix C w.p. at least 0.99,\n\nkCRkF \u2264 p(1 + \u03b5)kCkF .\n\nProof. Notice that there exists a suf\ufb01ciently large constant c such that t \u2265 ck/\u03b52. Then, setting Z = kCRk2\nthe third statement of Lemma 2, the fact that k \u2265 1, and Chebyshev\u2019s inequality we get\n\nF, using\n\nP(cid:16)|Z \u2212 E [Z]| \u2265 \u03b5 kCk2\n\nF(cid:17) \u2264\n\nVar [Z]\n\u03b52 kCk4\n\nF \u2264\n\nF\n\n2 kCk4\nt\u03b52 kCk4\n\nF \u2264\n\n2\nck \u2264 0.01.\n\nThe last inequality follows assuming c suf\ufb01ciently large. Finally, taking square root on both sides concludes the\nproof.\n\nLemma 5. Under the same assumptions as in Lemma 2 and w.p. at least 0.97,\n\nAk = (AR)(V \u22a4\n\n\u2020\nk R)\n\nV \u22a4\nk + E,\n\n(7)\n\nwhere E is an n \u00d7 d matrix with kEkF \u2264 4\u03b5 kA \u2212 AkkF.\nProof. Since (AR)(V \u22a4\nAk + A\u03c1\u2212k, and using the triangle inequality we get\n\n\u2020\nk R)\n\nk is an n \u00d7 d matrix, let us write E = Ak \u2212 (AR)(V \u22a4\nV \u22a4\n\n\u2020\nk R)\n\nk . Then, setting A =\nV \u22a4\n\nkEkF \u2264 (cid:13)(cid:13)(cid:13)Ak \u2212 AkR(V \u22a4\n\n\u2020\nk R)\n\nV \u22a4\n\nk (cid:13)(cid:13)(cid:13)F\n\nThe \ufb01rst statement of Lemma 2 implies that rank(V \u22a4\nk and setting (V \u22a4\nmatrix. Replacing Ak = Uk\u03a3kV \u22a4\n\nk R) = k thus (V \u22a4\nk R)(V \u22a4\n\n\u2020\nk R)\n\n\u2020\nk R)\n\n+ (cid:13)(cid:13)(cid:13)A\u03c1\u2212kR(V \u22a4\n\n\u2020\nk R)\n= Ik we get that\n\nk R)(V \u22a4\n\n.\n\nV \u22a4\n\nk (cid:13)(cid:13)(cid:13)F\n\n= Ik, where Ik is the k\u00d7 k identity\n\nk R(V \u22a4\n\n\u2020\nk R)\n\nV \u22a4\n\nk (cid:13)(cid:13)(cid:13)F\n\n= (cid:13)(cid:13)Ak \u2212 Uk\u03a3kV \u22a4\n\nk R)\u22a4V \u22a4\n\nk , add and subtract the matrix A\u03c1\u2212kR(V \u22a4\n\nk , and use the triangle\n\nTo bound the second term above, we drop V \u22a4\ninequality and submultiplicativity:\n\n(cid:13)(cid:13)(cid:13)Ak \u2212 AkR(V \u22a4\n(cid:13)(cid:13)(cid:13)A\u03c1\u2212kR(V \u22a4\n\n\u2020\nk R)\n\nV \u22a4\n\n\u2020\nk R)\n\nV \u22a4\n\nk (cid:13)(cid:13)(cid:13)F\n= (cid:13)(cid:13)(cid:13)Ak \u2212 Uk\u03a3kV \u22a4\nk (cid:13)(cid:13)(cid:13)F \u2264 (cid:13)(cid:13)A\u03c1\u2212kR(V \u22a4\n\nk R)\u22a4(cid:13)(cid:13)F + (cid:13)(cid:13)(cid:13)A\u03c1\u2212kR((V \u22a4\n\u2264 (cid:13)(cid:13)A\u03c1\u2212kRR\u22a4Vk(cid:13)(cid:13)F + kA\u03c1\u2212kRkF(cid:13)(cid:13)(cid:13)(V \u22a4\n\n\u2020\nk R)\n\n\u2020\nk R)\n\n\u2212 (V \u22a4\n\u2212 (V \u22a4\n\n5\n\nk (cid:13)(cid:13)F = 0.\nk R)\u22a4)(cid:13)(cid:13)(cid:13)F\nk R)\u22a4(cid:13)(cid:13)(cid:13)2\n\n.\n\n\fNow we will bound each term individually. A crucial observation for bounding the \ufb01rst term is that A\u03c1\u2212kVk =\n\u03c1\u2212kVk = 0 by orthogonality of the columns of Vk and V\u03c1\u2212k. This term now can be bounded using the\nU\u03c1\u2212k\u03a3\u03c1\u2212kV \u22a4\nsecond statement of Lemma 2 with S = A\u03c1\u2212k and T = Vk. This statement, assuming c suf\ufb01ciently large, and an\n\napplication of Markov\u2019s inequality on the random variable(cid:13)(cid:13)A\u03c1\u2212kRR\u22a4Vk \u2212 A\u03c1\u2212kVk(cid:13)(cid:13)F give that w.p. at least 0.99,\n\n(8)\nThe second two terms can be bounded using Lemma 3 and Lemma 4 on C = A\u03c1\u2212k. Hence by applying a union bound\non Lemma 3, Lemma 4 and Inq. (8), we get that w.p. at least 0.97,\n\n(cid:13)(cid:13)A\u03c1\u2212kRR\u22a4Vk(cid:13)(cid:13)F \u2264 0.5\u03b5 kA\u03c1\u2212kkF .\nkEkF \u2264 (cid:13)(cid:13)A\u03c1\u2212kRR\u22a4Vk(cid:13)(cid:13)F + kA\u03c1\u2212kRkF(cid:13)(cid:13)(cid:13)(V \u22a4\n\u2264 0.5\u03b5 kA\u03c1\u2212kkF +p(1 + \u03b5)kA\u03c1\u2212kkF \u00b7 3\u03b5\n\u2264 0.5\u03b5 kA\u03c1\u2212kkF + 3.5\u03b5 kA\u03c1\u2212kkF\n= 4\u03b5 \u00b7 kA\u03c1\u2212kkF .\n\n\u2020\nk R)\n\n\u2212 (V \u22a4\n\nk R)\u22a4(cid:13)(cid:13)(cid:13)2\n\nThe last inequality holds thanks to our choice of \u03b5 \u2208 (0, 1/3).\nProposition 6. A well-known property connects the SVD of a matrix and k-means clustering. Recall De\ufb01nition 1, and\nnotice that XoptX \u22a4\n\noptA is a matrix of rank at most k. From the SVD optimality we immediately get that\n\nkA\u03c1\u2212kk2\n4.1 The proof of Eqn. (4) of Theorem 1\n\nF = kA \u2212 Akk2\n\nF \u2264 (cid:13)(cid:13)A \u2212 XoptX \u22a4\n\noptA(cid:13)(cid:13)\n\n2\nF .\n\n(9)\n\nWe start by manipulating the term (cid:13)(cid:13)A \u2212 X\u02dc\u03b3X \u22a4\n\u02dc\u03b3 A(cid:13)(cid:13)\nPythagorean theorem (the subspaces spanned by the components Ak \u2212 X\u02dc\u03b3X \u22a4\n\u02dc\u03b3 Ak and A\u03c1\u2212k \u2212 X\u02dc\u03b3X \u22a4\nperpendicular) we get\n\u02dc\u03b3 )A\u03c1\u2212k(cid:13)(cid:13)\n+ (cid:13)(cid:13)(I \u2212 X\u02dc\u03b3X \u22a4\nF = (cid:13)(cid:13)(I \u2212 X\u02dc\u03b3X \u22a4\n|\n{z\n{z\n}\n\nin Eqn. (4). Replacing A by Ak + A\u03c1\u2212k, and using the\n\u02dc\u03b3 A\u03c1\u2212k are\n\n(10)\n\n\u03b82\n1\n\n\u03b82\n2\n\n2\nF\n\n2\nF\n\n.\n\n\u02dc\u03b3 is a projector matrix, it can be dropped without\n\n2\n\n2\nF\n\n|\n\n(cid:13)(cid:13)A \u2212 X\u02dc\u03b3X \u22a4\n\u02dc\u03b3 A(cid:13)(cid:13)\n\n\u02dc\u03b3 )Ak(cid:13)(cid:13)\n}\nWe \ufb01rst bound the second term of Eqn. (10). Since I \u2212 X\u02dc\u03b3X \u22a4\nincreasing a unitarily invariant norm. Now Proposition 6 implies that\nF \u2264 (cid:13)(cid:13)A \u2212 XoptX \u22a4\n+ kEkF\n+ kEkF\n\nWe now bound the \ufb01rst term of Eqn. (10):\n\n2 \u2264 kA\u03c1\u2212kk2\n\u03b82\n\n2\nF .\n\noptA(cid:13)(cid:13)\n\n\u02dc\u03b3 )AR(VkR)\u2020V \u22a4\n\n\u03b81 \u2264 (cid:13)(cid:13)(cid:13)(I \u2212 X\u02dc\u03b3X \u22a4\nk (cid:13)(cid:13)(cid:13)F\n\u02dc\u03b3 )AR(cid:13)(cid:13)F(cid:13)(cid:13)(cid:13)(VkR)\u2020(cid:13)(cid:13)(cid:13)2\n\u2264 (cid:13)(cid:13)(I \u2212 X\u02dc\u03b3X \u22a4\nopt)AR(cid:13)(cid:13)F(cid:13)(cid:13)(cid:13)(VkR)\u2020(cid:13)(cid:13)(cid:13)2\n\u2264 \u221a\u03b3(cid:13)(cid:13)(I \u2212 XoptX \u22a4\n+ kEkF\n\u2264 \u221a\u03b3p(1 + \u03b5)(cid:13)(cid:13)(I \u2212 XoptX \u22a4\nopt)A(cid:13)(cid:13)F\nopt)A(cid:13)(cid:13)F\n+ 4\u03b5(cid:13)(cid:13)(I \u2212 XoptX \u22a4\n\u2264 \u221a\u03b3(1 + 2.5\u03b5)(cid:13)(cid:13)(I \u2212 XoptX \u22a4\nopt)A(cid:13)(cid:13)F + \u221a\u03b3 4\u03b5(cid:13)(cid:13)(I \u2212 XoptX \u22a4\nopt)A(cid:13)(cid:13)F\n\u2264 \u221a\u03b3(1 + 6.5\u03b5)(cid:13)(cid:13)(I \u2212 XoptX \u22a4\nopt)A(cid:13)(cid:13)F\n\n1\n1 \u2212 \u03b5\n\nIn Eqn. (12) we used Lemma 5, the triangle inequality, and the fact that I \u2212 \u02dcX\u03b3 \u02dcX \u22a4\n\u03b3 is a projector matrix and can be\ndropped without increasing a unitarily invariant norm. In Eqn. (13) we used submultiplicativity (see Section 2.2) and\nthe fact that V \u22a4\nk can be dropped without changing the spectral norm. In Eqn. (14) we replaced X\u02dc\u03b3 by Xopt and the\nfactor \u221a\u03b3 appeared in the \ufb01rst term. To better understand this step, notice that X\u02dc\u03b3 gives a \u03b3-approximation to the\noptimal k-means clustering of the matrix AR, and any other n \u00d7 k indicator matrix (for example, the matrix Xopt)\nsatis\ufb01es\n\n(cid:13)(cid:13)(cid:0)I \u2212 X\u02dc\u03b3X \u22a4\n\n\u02dc\u03b3 (cid:1) AR(cid:13)(cid:13)\n\n2\n\nF \u2264 \u03b3 min\n\nX\u2208X (cid:13)(cid:13)(I \u2212 XX \u22a4)AR(cid:13)(cid:13)\n\n2\n\nF \u2264 \u03b3(cid:13)(cid:13)(cid:0)I \u2212 XoptX \u22a4\n\nopt(cid:1) AR(cid:13)(cid:13)\n\n2\nF .\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\n6\n\n\fF vs. t\n\nP vs. t\n\nT vs. t\n\n \n\nF\ne\nu\na\nv\n \n\nl\n\nn\no\n\ni\nt\nc\nn\nu\n\nf\n \n\nj\n\ne\nv\ni\nt\nc\ne\nb\no\nd\ne\nz\n\n \n\ni\nl\n\na\nm\nr\no\nn\n\n0.036\n\n0.034\n\n0.032\n\n0.03\n\n0.028\n\n0.026\n\n0.024\n\n0.022\n\n0.02\n\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nnumber of dimensions t\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\nP\ne\n\n \n\nt\n\na\nr\n \nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\u2212\ns\nM\n\ni\n\nl\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nnumber of dimensions t\n\nT\n \ns\nd\nn\no\nc\ne\ns\n \n\nn\n\ni\n \n\ne\nr\nu\nd\ne\nc\no\nr\np\n\n \ns\nn\na\ne\nm\n\u2212\nk\n \nf\n\no\n\n \n\ne\nm\nT\n\ni\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nnumber of dimensions t\n\nFigure 1: The results of our experiments after running Algorithm 1 with k = 40 on the face images collection.\n\nIn Eqn. (15) we used Lemma 4 with C = (I \u2212 XoptX \u22a4\nopt)A, Lemma 3 and Proposition 6. In Eqn. (16) we used the\nfact that \u03b3 \u2265 1 and that for any \u03b5 \u2208 (0, 1/3) it is (\u221a1 + \u03b5)/(1 \u2212 \u03b5) \u2264 1 + 2.5\u03b5. Taking squares in Eqn. (17) we get\n\n\u03b82\n\n1 \u2264 \u03b3(1 + 28\u03b5)(cid:13)(cid:13)(I \u2212 XoptX \u22a4\n\nopt)A(cid:13)(cid:13)\n\n2\nF .\n\nFinally, rescaling \u03b5 accordingly and applying the union bound on Lemma 5 and De\ufb01nition 2 concludes the proof.\n\n5 Experiments\n\nThis section describes an empirical evaluation of Algorithm 1 on a face images collection. We implemented our\nalgorithm in MatLab and compared it against other prominent dimensionality reduction techniques such as the Local\nLinear Embedding (LLE) algorithm and the Laplacian scores for feature selection. We ran all the experiments on a\nMac machine with a dual core 2.26 Ghz processor and 4 GB of RAM. Our empirical \ufb01ndings are very promising\nindicating that our algorithm and implementation could be very useful in real applications involving clustering of\nlarge-scale data.\n\n5.1 An application of Algorithm 1 on a face images collection\n\nWe experiment with a face images collection. We downloaded the images corresponding to the ORL database from\n[21]. This collection contains 400 face images of dimensions 64 \u00d7 64 corresponding to 40 different people. These\nimages form 40 groups each one containing exactly 10 different images of the same person. After vectorizing each\n2-D image and putting it as a row vector in an appropriate matrix, one can construct a 400 \u00d7 4096 image-by-pixel\nmatrix A. In this matrix, objects are the face images of the ORL collection while features are the pixel values of the\nimages. To apply the Lloyd\u2019s heuristic on A, we employ MatLab\u2019s function kmeans with the parameter determining\nthe maximum number of repetitions setting to 30. We also chose a deterministic initialization of the Lloyd\u2019s iterative\nE-M procedure, i.e. whenever we call kmeans with inputs a matrix \u02dcA \u2208 R400\u00d7 \u02dcd, with \u02dcd \u2265 1, and the integer k = 40,\nwe initialize the cluster centers with the 1-st, 11-th,..., 391-th rows of \u02dcA, respectively. Note that this initialization\ncorresponds to picking images from the forty different groups of the available collection, since the images of every\ngroup are stored sequentially in A. We evaluate the clustering outcome from two different perspectives. First, we\nmeasure and report the objective function F of the k-means clustering problem. In particular, we report a normalized\nversion of F , i.e. \u02dcF = F/||A||2\nF . Second, we report the mis-classi\ufb01cation accuracy of the clustering result. We\ndenote this number by P (0 \u2264 P \u2264 1), where P = 0.9, for example, implies that 90% of the objects were assigned\nto the correct cluster after the application of the clustering algorithm. In the sequel, we \ufb01rst perform experiments by\nrunning Algorithm 1 with everything \ufb01xed but t, which denotes the dimensionality of the projected data. Then, for\nfour representative values of t, we compare Algorithm 1 with three other dimensionality reduction methods as well\nwith the approach of running the Lloyd\u2019s heuristic on the original high dimensional data.\n\nWe run Algorithm 1 with t = 5, 10, ..., 300 and k = 40 on the matrix A described above. Figure 1 depicts the results\nof our experiments. A few interesting observations are immediate. First, the normalized objective function \u02dcF is a\npiece-wise non-increasing function of the number of dimensions t. The decrease in \u02dcF is large in the \ufb01rst few choices\n\n7\n\n\ft = 10\n\nt = 20\n\nt = 50\n\nt = 100\n\nP\n\n0.5900\n0.6500\n0.3400\n0.6255\n0.4225\n\nF\n\n0.0262\n0.0245\n0.0380\n0.0220\n0.0283\n\nP\n\n0.6750\n0.7125\n0.3875\n0.6255\n0.4800\n\nF\n\n0.0268\n0.0247\n0.0362\n0.0220\n0.0255\n\nP\n\n0.7650\n0.7725\n0.4575\n0.6255\n0.6425\n\nF\n\n0.0269\n0.0258\n0.0319\n0.0220\n0.0234\n\nP\n\n0.6500\n0.6150\n0.4850\n0.6255\n0.6575\n\nF\n\n0.0324\n0.0337\n0.0278\n0.0220\n0.0219\n\nSVD\nLLE\nLS\nHD\nRP\n\nTable 2: Numerics from our experiments with \ufb01ve different methods.\n\nof t; then, increasing the number of dimensions t of the projected data decreases \u02dcF by a smaller value. The increase\nof t seems to become irrelevant after around t = 90 dimensions. Second, the mis-classi\ufb01cation rate P is a piece-wise\nnon-decreasing function of t. The increase of t seems to become irrelevant again after around t = 90 dimensions.\nAnother interesting observation of these two plots is that the mis-classi\ufb01cation rate is not directly relevant to the\nobjective function F . Notice, for example, that the two have different behavior from t = 20 to t = 25 dimensions.\nFinally, we report the running time T of the algorithm which includes only the clustering step. Notice that the increase\nin the running time is - almost - linear with the increase of t. The non-linearities in the plot are due to the fact that\nthe number of iterations that are necessary to guarantee convergence of the Lloyd\u2019s method are different for different\nvalues of t. This observation indicates that small values of t result to signi\ufb01cant computational savings, especially\nwhen n is large. Compare, for example, the one second running time that is needed to solve the k-means problem\nwhen t = 275 against the 10 seconds that are necessary to solve the problem on the high dimensional data. To our\nbene\ufb01t, in this case, the multiplication AR takes only 0.1 seconds resulting to a total running time of 1.1 seconds\nwhich corresponds to an almost 90% speedup of the overall procedure.\nWe now compare our algorithm against other dimensionality reduction techniques. In particular, in this paragraph\nwe present head-to-head comparisons for the following \ufb01ve methods: (i) SVD: the Singular Value Decomposition\n(or Principal Components Analysis) dimensionality reduction approach - we use MatLab\u2019s svds function; (ii) LLE:\nthe famous Local Linear Embedding algorithm of [18] - we use the MatLab code from [23] with the parameter K\ndetermining the number of neighbors setting equal to 40; (iii) LS: the Laplacian score feature selection method of [10]\n- we use the MatLab code from [22] with the default parameters2; (v) HD: we run the k-means algorithm on the High\nDimensional data; and (vi) RP: the random projection method we proposed in this work - we use our own MatLab\nimplementation. The results of our experiments on A, k = 40 and t = 10, 20, 50, 100 are shown in Table 2. In terms of\ncomputational complexity, for example t = 50, the time (in seconds) needed for all \ufb01ve methods (only the dimension\nreduction step) are TSV D = 5.9, TLLE = 4.4, TLS = 0.32, THD = 0, and TRP = 0.03. Notice that our algorithm\nis much faster than the other approaches while achieving worse (t = 10, 20), slightly worse (t = 50) or slightly better\n(t = 100) approximation accuracy results.\n\n5.2 A note on the mailman algorithm for matrix-matrix and matrix-vector multiplication\nIn this section, we compare three different implementations of the third step of Algorithm 1. As we already discussed\nin Section 3.1, the mailman algorithm is asymptotically faster than naively multiplying the two matrices A and R.\nIn this section we want to understand whether this asymptotic behavior of the mailman algorithm is indeed achieved\nin a practical implementation. We compare three different approaches for the implementation of the third step of\nour algorithm: the \ufb01rst is MatLab\u2019s function times(A, R) (MM1); the second exploits the fact that we do not need to\nexplicitly store the whole matrix R, and that the computation can be performed on the \ufb02y (column-by-column) (MM2);\nthe last is the mailman algorithm [15] (see Section 3.1 for more details). We implemented the last two algorithms in\nC using MatLab\u2019s MEX technology. We observed that when A is a vector (n = 1), then the mailman algorithm\nis indeed faster than (MM1) and (MM2) as it is also observed in the numerical experiments of [15]. Moreover, it\u2019s\nworth-noting that (MM2) is also superior compared to (MM1). On the other hand, our best implementation of the\nmailman algorithm for matrix-matrix operations is inferior to both (MM1) and (MM2) for any 10 \u2264 n \u2264 10, 000.\nBased on these \ufb01ndings, we chose to use (MM1) for our experimental evaluations.\nAcknowledgments: Christos Boutsidis was supported by NSF CCF 0916415 and a Gerondelis Foundation Fellow-\nship; Petros Drineas was partially supported by an NSF CAREER Award and NSF CCF 0916415.\n\n2In particular, we run W = constructW (A); Scores = LaplacianScore(A, W );\n\n8\n\n\fReferences\n\n[1] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of\n\nComputer and System Science, 66(4):671\u2013687, 2003.\n\n[2] N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In\n\nACM Symposium on Theory of Computing (STOC), pages 557\u2013563, 2006.\n\n[3] D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine\n\nLearning, 75(2):245\u2013248, 2009.\n\n[4] E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text\ndata. In ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pages 245\u2013\n250, 2001.\n\n[5] C. Boutsidis, M. W. Mahoney, and P. Drineas. Unsupervised feature selection for the k-means clustering problem.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2009.\n\n[6] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering in large graphs and matrices. In ACM-\n\nSIAM Symposium on Discrete Algorithms (SODA), pages 291\u2013299, 1999.\n\n[7] D. Foley and J. Sammon. An optimal set of discriminant vectors. IEEE Transactions on Computers, C-24(3):281\u2013\n\n289, March 1975.\n\n[8] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning\n\nResearch, 3:1157\u20131182, 2003.\n\n[9] I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the NIPS 2003 feature selection challenge. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 545\u2013552. 2005.\n\n[10] X. He, D. Cai, and P. Niyogi. Laplacian score for feature selection. In Advances in Neural Information Processing\n\nSystems (NIPS) 18, pages 507\u2013514. 2006.\n\n[11] P. Indyk and R. Motwani Approximate nearest neighbors: towards removing the curse of dimensionality. In\n\nACM Symposium on Theory of Computing (STOC), pages 604\u2013613, 1998.\n\n[12] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathe-\n\nmatics, 26(189-206):1\u20131, 1984.\n\n[13] E. Kokiopoulou, J. Chen and Y. Saad. Trace optimization and eigenproblems in dimension reduction methods.\n\nNumerical Linear Algebra with Applications, to appear.\n\n[14] A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1+\u03b5)-approximation algorithm for k-means clustering\n\nin any dimensions. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 454\u2013462, 2004.\n\n[15] E. Liberty and S. Zucker. The Mailman algorithm: A note on matrix-vector multiplication. Information Process-\n\ning Letters, 109(3):179\u2013182, 2009.\n\n[16] S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129\u2013137, 1982.\n[17] R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the\n\nk-means problem. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 165\u2013176, 2006.\n\n[18] S. Roweis, and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:5500,\n\npages 2323-2326, 2000.\n\n[19] T. Sarlos. Improved approximation algorithms for large matrices via random projections. In IEEE Symposium\n\non Foundations of Computer Science (FOCS), pages 329\u2013337, 2006.\n\n[20] X. Wu et al. Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1):1\u201337, 2008.\n[21] http://www.cs.uiuc.edu/\u02dcdengcai2/Data/FaceData.html\n[22] http://www.cs.uiuc.edu/\u02dcdengcai2/Data/data.html\n[23] http://www.cs.nyu.edu/\u02dcroweis/lle/\n\n9\n\n\f", "award": [], "sourceid": 113, "authors": [{"given_name": "Christos", "family_name": "Boutsidis", "institution": null}, {"given_name": "Anastasios", "family_name": "Zouzias", "institution": null}, {"given_name": "Petros", "family_name": "Drineas", "institution": null}]}