{"title": "Subspace Embeddings for the Polynomial Kernel", "book": "Advances in Neural Information Processing Systems", "page_first": 2258, "page_last": 2266, "abstract": "Sketching is a powerful dimensionality reduction tool for accelerating statistical learning algorithms. However, its applicability has been limited to a certain extent since the crucial ingredient, the so-called oblivious subspace embedding, can only be applied to data spaces with an explicit representation as the column span or row span of a matrix, while in many settings learning is done in a high-dimensional space implicitly defined by the data matrix via a kernel transformation. We propose the first {\\em fast} oblivious subspace embeddings that are able to embed a space induced by a non-linear kernel {\\em without} explicitly mapping the data to the high-dimensional space. In particular, we propose an embedding for mappings induced by the polynomial kernel. Using the subspace embeddings, we obtain the fastest known algorithms for computing an implicit low rank approximation of the higher-dimension mapping of the data matrix, and for computing an approximate kernel PCA of the data, as well as doing approximate kernel principal component regression.", "full_text": "Subspace Embeddings for the Polynomial Kernel\n\nHaim Avron\n\nIBM T.J. Watson Research Center\n\nYorktown Heights, NY 10598\n\nhaimav@us.ibm.com\n\nHuy L. Nguy\u02dc\u02c6en\n\nSimons Institute, UC Berkeley\n\nBerkeley, CA 94720\n\nhlnguyen@cs.princeton.edu\n\nDavid P. Woodruff\n\nIBM Almaden Research Center\n\nSan Jose, CA 95120\n\ndpwoodru@us.ibm.com\n\nAbstract\n\nSketching is a powerful dimensionality reduction tool for accelerating statistical\nlearning algorithms. However, its applicability has been limited to a certain extent\nsince the crucial ingredient, the so-called oblivious subspace embedding, can only\nbe applied to data spaces with an explicit representation as the column span or row\nspan of a matrix, while in many settings learning is done in a high-dimensional\nspace implicitly de\ufb01ned by the data matrix via a kernel transformation. We pro-\npose the \ufb01rst fast oblivious subspace embeddings that are able to embed a space\ninduced by a non-linear kernel without explicitly mapping the data to the high-\ndimensional space.\nIn particular, we propose an embedding for mappings in-\nduced by the polynomial kernel. Using the subspace embeddings, we obtain the\nfastest known algorithms for computing an implicit low rank approximation of the\nhigher-dimension mapping of the data matrix, and for computing an approximate\nkernel PCA of the data, as well as doing approximate kernel principal component\nregression.\n\n1\n\nIntroduction\n\nSketching has emerged as a powerful dimensionality reduction technique for accelerating statisti-\ncal learning techniques such as (cid:96)p-regression, low rank approximation, and principal component\nanalysis (PCA) [12, 5, 14]. For natural settings of parameters, this technique has led to the \ufb01rst\nasymptotically optimal algorithms for a number of these problems, often providing considerable\nspeedups over exact algorithms. Behind many of these remarkable algorithms is a mathematical ap-\nparatus known as an oblivious subspace embedding (OSE). An OSE is a data-independent random\ntransform which is, with high probability, an approximate isometry over the embedded subspace,\ni.e. (cid:107)Sx(cid:107) = (1 \u00b1 \u0001)(cid:107)x(cid:107) simultaneously for all x \u2208 V where S is the OSE, V is the embedded\nsubspace and (cid:107) \u00b7 (cid:107) is some norm of interest. For the OSE to be useful in applications, it is crucial\nthat applying it to a vector or a collection of vectors (a matrix) can be done faster than the intended\ndownstream use.\nSo far, all OSEs proposed in the literature are for embedding subspaces that have a representation\nas the column space or row space of an explicitly provided matrix, or close variants of it that admit\na fast multiplication given an explicit representation (e.g. [1]). This is quite unsatisfactory in many\nstatistical learning settings. In many cases the input may be described by a moderately sized n-by-\nd sample-by-feature matrix A, but the actual learning is done in a much higher (possibly in\ufb01nite)\ndimensional space, by mapping each row of A to an high dimensional feature space. Using the\nkernel trick one can access the high dimensional mapped data points through an inner product space,\n\n1\n\n\fand thus avoid computing the mapping explicitly. This enables learning in the high-dimensional\nspace even if explicitly computing the mapping (if at all possible) is prohibitive. In such a setting,\ncomputing the explicit mapping just to compute an OSE is usually unreasonable, if not impossible\n(e.g., if the feature space is in\ufb01nite-dimensional).\nThe main motivation for this paper is the following question: is it possible to design OSEs that\noperate on the high-dimensional space without explicitly mapping the data to that space?\nWe propose the \ufb01rst fast oblivious subspace embeddings for spaces induced by a non-linear kernel\nwithout explicitly mapping the data to the high-dimensional space. In particular, we propose an OSE\nfor mappings induced by the polynomial kernel. We then show that the OSE can be used to obtain\nfaster algorithms for the polynomial kernel. Namely, we obtain faster algorithms for approximate\nkernel PCA and principal component regression.\nWe now elaborate on these contributions.\n\n\u221a\n\nSubspace Embedding for Polynomial Kernel Maps. Let k(x, y) = ((cid:104)x, y(cid:105) + c)q for some con-\nstant c \u2265 0 and positive integer q. This is the degree q polynomial kernel function. Without loss\nof generality we assume that c = 0 since a non-zero c can be handled by adding a coordinate of\nvalue\nc to all of the data points. Let \u03c6(x) denote the function that maps a d-dimensional vector x\nto the dq-dimensional vector formed by taking the product of all subsets of q coordinates of x, i.e.\n\u03c6(v) = v\u2297 . . .\u2297 v (doing \u2297 q times), and let \u03c6(A) denote the application of \u03c6 to the rows of A. \u03c6 is\nthe map that corresponds to the polynomial kernel, that is k(x, y) = (cid:104)\u03c6(x), \u03c6(y)(cid:105), so learning with\nthe data matrix A and the polynomial kernel corresponds to using \u03c6(A) instead of A in a method\nthat uses linear modeling.\nWe describe a distribution over dq \u00d7 O(3qn2/\u00012) sketching matrices S so that the mapping \u03c6(A)\u00b7 S\ncan be computed in O(nnz(A)q) + poly(3qn/\u0001) time, where nnz(A) denotes the number of non-\nzero entries of A. We show that with constant probability arbitrarily close to 1, simultaneously for\nall n-dimensional vectors z, (cid:107)z \u00b7 \u03c6(A) \u00b7 S(cid:107)2 = (1 \u00b1 \u0001)(cid:107)z \u00b7 \u03c6(A)(cid:107)2, that is, the entire row-space of\n\u03c6(A) is approximately preserved. Additionally, the distribution does not depend on A, so it de\ufb01nes\nan OSE.\nIt is important to note that while the literature has proposed transformations for non-linear kernels\nthat generate an approximate isometry (e.g. Kernel PCA), or methods that are data independent (like\nthe Random Fourier Features [17]), no method previously had both conditions, and thus they do not\nconstitute an OSE. These conditions are crucial for the algorithmic applications we propose (which\nwe discuss next).\n\nApplications: Approximate Kernel PCA, PCR. We say an n \u00d7 k matrix V with orthonormal\ncolumns spans a rank-k (1 + \u0001)-approximation of an n \u00d7 d matrix A if (cid:107)A \u2212 V V T A(cid:107)F \u2264 (1 +\n\u0001)(cid:107)A\u2212 Ak(cid:107)F , where (cid:107)A(cid:107)F is the Frobenius norm of A and Ak = arg minX of rank k (cid:107)A\u2212 X(cid:107)F . We\nstate our results for constant q.\nIn O(nnz(A))+n\u00b7poly(k/\u0001) time an n\u00d7k matrix V with orthonormal columns can be computed, for\nwhich (cid:107)\u03c6(A)\u2212 V V T \u03c6(A)(cid:107)F \u2264 (1 + \u0001)(cid:107)\u03c6(A)\u2212 [\u03c6(A)]k(cid:107)F , where [\u03c6(A)]k denotes the best rank-k\napproximation to \u03c6(A). The k-dimensional subspace V of Rn can be thought of as an approximation\nto the top k left singular vectors of \u03c6(A). The only alternative algorithm we are aware of, which\ndoesn\u2019t take time at least dq, would be to \ufb01rst compute the Gram matrix \u03c6(A) \u00b7 \u03c6(A)T in O(n2d)\ntime, and then compute a low rank approximation, which, while this computation can also exploit\nsparsity in A, is much slower since the Gram matrix is often dense and requires \u2126(n2) time just to\nwrite down.\nGiven V , we show how to obtain a low rank approximation to \u03c6(A). Our algorithm computes three\nmatrices V, U, and R, for which (cid:107)\u03c6(A) \u2212 V \u00b7 U \u00b7 \u03c6(R)(cid:107)F \u2264 (1 + \u0001)(cid:107)\u03c6(A) \u2212 [\u03c6(A)]k(cid:107)F . This\nrepresentation is useful, since given a point y \u2208 Rd, we can compute \u03c6(R) \u00b7 \u03c6(y) quickly using\nthe kernel trick. The total time to compute the low rank approximation is O(nnz(A)) + (n + d) \u00b7\npoly(k/\u0001). This is considerably faster than standard kernel PCA which \ufb01rst computes the Gram\nmatrix of \u03c6(A).\nWe also show how the subspace V can be used to regularize and speed up various learning algorithms\nwith the polynomial kernel. For example, we can use the subspace V to solve regression problems\n\n2\n\n\fof the form minx (cid:107)V x \u2212 b(cid:107)2, an approximate form of principal component regression [8]. This can\nserve as a form of regularization, which is required as the problem minx (cid:107)\u03c6(A)x \u2212 b(cid:107)2 is usually\nunderdetermined. A popular alternative form of regularization is to use kernel ridge regression,\nwhich requires O(n2d) operations. As nnz(A) \u2264 nd, our method is again faster.\n\nOur Techniques and Related Work. Pagh recently introduced the TENSORSKETCH algo-\nrithm [14], which combines the earlier COUNTSKETCH of Charikar et al. [3] with the Fast Fourier\nTransform (FFT) in a clever way. Pagh originally applied TENSORSKETCH for compressing matrix\nmultiplication. Pham and Pagh then showed that TENSORSKETCH can also be used for statistical\nlearning with the polynomial kernel [16].\nHowever, it was unclear whether TENSORSKETCH can be used to approximately preserve entire\nsubspaces of points (and thus can be used as an OSE). Indeed, Pham and Pagh show that a \ufb01xed\npoint v \u2208 Rd has the property that for the TENSORSKETCH sketching matrix S, (cid:107)\u03c6(v) \u00b7 S(cid:107)2 =\n(1 \u00b1 \u0001)(cid:107)\u03c6(v)(cid:107)2 with constant probability. To obtain a high probability bound using their results,\nthe authors take a median of several independent sketches. Given a high probability bound, one\ncan use a net argument to show that the sketch is correct for all vectors v in an n-dimensional\nsubspace of Rd. The median operation results in a non-convex embedding, and it is not clear how\nto ef\ufb01ciently solve optimization problems in the sketch space with such an embedding. Moreover,\nsince n independent sketches are needed for probability 1 \u2212 exp(\u2212n), the running time will be at\nleast n \u00b7 nnz(A), whereas we seek only nnz(A) time.\nRecently, Clarkson and Woodruff [5] showed that COUNTSKETCH can be used to provide a subspace\nembedding, that is, simultaneously for all v \u2208 V , (cid:107)\u03c6(v) \u00b7 S(cid:107)2 = (1 \u00b1 \u0001)(cid:107)\u03c6(v)(cid:107)2. TENSORSKETCH\ncan be seen as a very restricted form of COUNTSKETCH, where the additional restrictions enable\nits fast running time on inputs which are tensor products. In particular, the hash functions in TEN-\nSORSKETCH are only 3-wise independent. Nelson and Nguyen [13] showed that COUNTSKETCH\nstill provides a subspace embedding if the entries are chosen from a 4-wise independent distribution.\nWe signi\ufb01cantly extend their analysis, and in particular show that 3-wise independence suf\ufb01ces for\nCOUNTSKETCH to provide an OSE, and that TENSORSKETCH indeed provides an OSE.\nWe stress that all previous work on sketching the polynomial kernel suffers from the drawback de-\nscribed above, that is, it provides no provable guarantees for preserving an entire subspace, which is\nneeded, e.g., for low rank approximation. This is true even of the sketching methods for polynomial\nkernels that do not use TENSORSKETCH [10, 7], as it only provides tail bounds for preserving the\nnorm of a \ufb01xed vector, and has the aforementioned problems of extending it to a subspace, i.e.,\nboosting the probability of error to be enough to union bound over net vectors in a subspace would\nrequire increasing the running time by a factor equal to the dimension of the subspace.\nAfter we show that TENSORSKETCH is an OSE, we need to show how to use it in applications. An\nunusual aspect is that for a TENSORSKETCH matrix S, we can compute \u03c6(A) \u00b7 S very ef\ufb01ciently,\nas shown by Pagh [14], but computing S \u00b7 \u03c6(A) is not known to be ef\ufb01ciently computable, and\nindeed, for degree-2 polynomial kernels this can be shown to be as hard as general rectangular\nmatrix multiplication. In general, even writing down S \u00b7 \u03c6(A) would take a prohibitive dq amount\nof time. We thus need to design algorithms which only sketch on one side of \u03c6(A).\nAnother line of research related to ours is that on random features maps, pioneered in the seminal\npaper of Rahimi and Recht [17] and extended by several papers a recent fast variant [11]. The goal in\nthis line of research is to construct randomized feature maps \u03a8(\u00b7) so that the Euclidean inner product\n(cid:104)\u03a8(u), \u03a8(v)(cid:105) closely approximates the value of k(u, v) where k is the kernel; the mapping \u03a8(\u00b7) is\ndependent on the kernel. Theoretical analysis has focused so far on showing that (cid:104)\u03a8(u), \u03a8(v)(cid:105) is\nindeed close to k(u, v). This is also the kind of approach that Pham and Pagh [16] use to analyze\nTENSORSKETCH. The problem with this kind of analysis is that it is hard to relate it to downstream\nmetrics like generalization error and thus, in a sense, the algorithm remains a heuristic. In contrast,\nour approach based on OSEs provides a mathematical framework for analyzing the mappings, to\nreason about their downstream use, and to utilize various tools from numerical linear algebra in\nconjunction with them, as we show in this paper. We also note that in to contrary to random feature\nmaps, TENSORSKETCH is attuned to taking advantage of possible input sparsity. e.g. Le et al. [11]\nmethod requires computing the Walsh-Hadamard transform, whose running time is independent of\nthe sparsity.\n\n3\n\n\fp(cid:96)(x) =\n\nvj \u00b7 s(cid:96)(j),\n\nB\u22121(cid:88)\nB\u22121(cid:88)\n\nxi\n\nxi (cid:88)\n(cid:88)\n\nj|h(cid:96)(j)=i\n\ni=0\nfor (cid:96) = 1, 2, . . . , q. A calculation [14] shows\n\nq(cid:89)\n\n(cid:96)=1\n\n2 Background: COUNTSKETCH and TENSORSKETCH\n\nvalue at coordinate i of the output, i = 1, 2, . . . , m is(cid:80)\n\nWe start by describing the COUNTSKETCH transform [3]. Let m be the target dimension. When\napplied to d-dimensional vectors, the transform is speci\ufb01ed by a 2-wise independent hash function\nh : [d] \u2192 [m] and a 2-wise independent sign function s : [d] \u2192 {\u22121, +1}. When applied to v, the\nj|h(j)=i s(j)vj. Note that COUNTSKETCH\ncan be represented as a m\u00d7 d matrix in which the j-th column contains a single non-zero entry s(j)\nin the h(j)-th row.\nWe now describe the TENSORSKETCH transform [14]. Suppose we are given a point v \u2208 Rd\nand so \u03c6(v) \u2208 Rdq, and the target dimension is again m. The transform is speci\ufb01ed using q 3-\nwise independent hash functions h1, . . . , hq : [d] \u2192 [m], and q 4-wise independent sign functions\ns1, . . . , sq : [d] \u2192 {+1,\u22121}. TENSORSKETCH applied to v is then COUNTSKETCH applied to\n\u03c6(v) with hash function H : [dq] \u2192 [m] and sign function S : [dq] \u2192 {+1,\u22121} de\ufb01ned as follows:\n\nH(i1, . . . , iq) = h1(i1) + h2(i2) + \u00b7\u00b7\u00b7 + hq(iq) mod m,\n\nand\n\nS(i1, . . . , iq) = s1(i1) \u00b7 s2(i1)\u00b7\u00b7\u00b7 sq(iq).\n\nIt is well-known that if H is constructed this way, then it is 3-wise independent [2, 15]. Unlike the\nwork of Pham and Pagh [16], which only used that H was 2-wise independent, our analysis needs\nthis stronger property of H.\nThe TENSORSKETCH transform can be applied to v without computing \u03c6(v) as follows. First,\ncompute the polynomials\n\np(cid:96)(x) mod (xB \u2212 1) =\n\nvj1 \u00b7\u00b7\u00b7 vjq S(j1, . . . , jq),\n\ni=0\n\n(j1,...,jq)|H(j1,...,jq)=i\n\nthe coef\ufb01cients of the product of the q polynomials mod (xm \u2212 1) form the value\nthat is,\nof TENSORSKETCH(v). Pagh observed that this product of polynomials can be computed in\nO(qm log m) time using the Fast Fourier Transform. As it takes O(q nnz(v)) time to form the q\npolynomials, the overall time to compute TENSORSKETCH(v) is O(q(nnz(v) + m log m)).\n\n3 TENSORSKETCH is an Oblivious Subspace Embedding\nLet S be the dq \u00d7 m matrix such that TENSORSKETCH(v) is \u03c6(v) \u00b7 S for a randomly selected\nTENSORSKETCH. Notice that S is a random matrix. In the rest of the paper, we refer to such a\nmatrix as a TENSORSKETCH matrix with an appropriate number of columns i.e.\nthe number of\nhash buckets. We will show that S is an oblivious subspace embedding for subspaces in Rdq for\nappropriate values of m. Notice that S has exactly one non-zero entry per row. The index of the\nj=1 hj(ij) mod m. Let \u03b4a,b be the indicator\nrandom variable of whether Sa,b is non-zero. The sign of the non-zero entry in row (i1, . . . , iq) is\nj=1 sj(ij). Our main result is that the embedding matrix S of TENSORSKETCH\ncan be used to approximate matrix product and is a subspace embedding (OSE).\nTheorem 1 (Main Theorem). Let S be the dq \u00d7 m matrix such that TENSORSKETCH(v) is \u03c6(v)S\nfor a randomly selected TENSORSKETCH. The matrix S satis\ufb01es the following two properties.\n\nnon-zero in the row (i1, . . . , iq) is H(i1, . . . , iq) =(cid:80)q\nS(i1, . . . , iq) =(cid:81)q\n\n1. (Approximate Matrix Product:) Let A and B be matrices with dq rows. For m \u2265 (2 +\n\n3q)/(\u00012\u03b4), we have\n\nPr[(cid:107)AT SST B \u2212 AT B(cid:107)2\n\nF ] \u2265 1 \u2212 \u03b4\n2. (Subspace Embedding:) Consider a \ufb01xed k-dimensional subspace V .\n\nIf m \u2265 k2(2 +\n3q)/(\u00012\u03b4), then with probability at least 1 \u2212 \u03b4, (cid:107)xS(cid:107) = (1 \u00b1 \u0001)(cid:107)x(cid:107) simultaneously for all\nx \u2208 V .\n\nF \u2264 \u00012(cid:107)A(cid:107)2\n\nF(cid:107)B(cid:107)2\n\n4\n\n\fAlgorithm 1 k-Space\n1: Input: A \u2208 Rn\u00d7d, \u0001 \u2208 (0, 1], integer k.\n2: Output: V \u2208 Rn\u00d7k with orthonormal columns which spans a rank-k (1 + \u0001)-approximation to\n\n\u03c6(A).\n\n3: Set the parameters m = \u0398(3qk2 + k/\u0001) and r = \u0398(3qm2/\u00012).\n4: Let S be a dq \u00d7 m TENSORSKETCH and T be an independent dq \u00d7 r TENSORSKETCH.\n5: Compute \u03c6(A) \u00b7 S and \u03c6(A) \u00b7 T .\n6: Let U be an orthonormal basis for the column space of \u03c6(A) \u00b7 S.\n7: Let W be the m \u00d7 k matrix containing the top k left singular vectors of U T \u03c6(A)T .\n8: Output V = U W .\n\nWe establish the theorem via two lemmas. The \ufb01rst lemma proves the approximate matrix product\nproperty via a careful second moment analysis. Due to space constraints, a proof is included only in\nthe supplementary material version of the paper.\nLemma 2. Let A and B be matrices with dq rows. For m \u2265 (2 + 3q)/(\u00012\u03b4), we have\n\nPr[(cid:107)AT SST B \u2212 AT B(cid:107)2\n\nF \u2264 \u00012(cid:107)A(cid:107)2\n\nF(cid:107)B(cid:107)2\n\nF ] \u2265 1 \u2212 \u03b4\n\nThe second lemma proves that the subspace embedding property follows from the approximate\nmatrix product property.\nIf m \u2265 k2(2 + 3q)/(\u00012\u03b4), then with\nLemma 3. Consider a \ufb01xed k-dimensional subspace V .\nprobability at least 1 \u2212 \u03b4, (cid:107)xS(cid:107) = (1 \u00b1 \u0001)(cid:107)x(cid:107) simultaneously for all x \u2208 V .\nProof. Let B be a dq \u00d7 k matrix whose columns form an orthonormal basis of V . Thus, we have\nBT B = Ik and (cid:107)B(cid:107)2\nF = k. The condition that (cid:107)xS(cid:107) = (1 \u00b1 \u0001)(cid:107)x(cid:107) simultaneously for all x \u2208 V is\nequivalent to the condition that the singular values of BT S are bounded by 1 \u00b1 \u0001. By Lemma 2, for\nm \u2265 (2 + 3q)/((\u0001/k)2\u03b4), with probability at least 1 \u2212 \u03b4, we have\nF \u2264 (\u0001/k)2(cid:107)B(cid:107)4\n\n(cid:107)BT SST B \u2212 BT B(cid:107)2\n\nThus, we have (cid:107)BT SST B \u2212 Ik(cid:107)2 \u2264 (cid:107)BT SST B \u2212 Ik(cid:107)F \u2264 \u0001. In other words, the squared singular\nvalues of BT S are bounded by 1\u00b1 \u0001, implying that the singular values of BT S are also bounded by\n1 \u00b1 \u0001. Note that (cid:107)A(cid:107)2 for a matrix A denotes its operator norm.\n\nF = \u00012\n\n4 Applications\n\n4.1 Approximate Kernel PCA and Low Rank Approximation\nWe say an n \u00d7 k matrix V with orthonormal columns spans a rank-k (1 + \u0001)-approximation of an\nn \u00d7 d matrix A if (cid:107)A \u2212 V V T A(cid:107)F \u2264 (1 + \u0001)(cid:107)A \u2212 Ak(cid:107)F . Algorithm k-Space (Algorithm 1) \ufb01nds\nan n \u00d7 k matrix V which spans a rank-k (1 + \u0001)-approximation of \u03c6(A).\nBefore proving the correctness of the algorithm, we start with two key lemmas. Proofs are included\nonly in the supplementary material version of the paper.\nLemma 4. Let S \u2208 Rdq\u00d7m be a randomly chosen TENSORSKETCH matrix with m = \u2126(3qk2 +\nk/\u0001). Let U U T be the n\u00d7n projection matrix onto the column space of \u03c6(A)\u00b7S. Then if [U T \u03c6(A)]k\nis the best rank-k approximation to matrix U T \u03c6(A), we have\n\n(cid:107)U [U T \u03c6(A)]k \u2212 \u03c6(A)(cid:107)F \u2264 (1 + O(\u0001))(cid:107)\u03c6(A) \u2212 [\u03c6(A)]k(cid:107)F .\n\nLemma 5. Let U U T be as in Lemma 4. Let T \u2208 Rdq\u00d7r be a randomly chosen TENSORSKETCH\nmatrix with r = O(3qm2/\u00012), where m = \u2126(3qk2 + k/\u0001). Suppose W is the m \u00d7 k matrix whose\ncolumns are the top k left singular vectors of U T \u03c6(A)T . Then,\n\n(cid:107)U W W T U T \u03c6(A) \u2212 \u03c6(A)(cid:107)F \u2264 (1 + \u0001)(cid:107)\u03c6(A) \u2212 [\u03c6(A)]k(cid:107)F .\n\nTheorem 6. (Polynomial Kernel Rank-k Space.) For the polynomial kernel of degree q, in\nO(nnz(A)q) + n \u00b7 poly(3qk/\u0001) time, Algorithm k-SPACE \ufb01nds an n \u00d7 k matrix V which spans\na rank-k (1 + \u0001)-approximation of \u03c6(A).\n\n5\n\n\fProof. By Lemma 4 and Lemma 5, the output V = U W spans a rank-k (1 + \u0001)-approximation to\n\u03c6(A). It only remains to argue the time complexity. The sketches \u03c6(A) \u00b7 S and \u03c6(A) \u00b7 T can be\ncomputed in O(nnz(A)q) + n \u00b7 poly(3qk/\u0001) time. In n \u00b7 poly(3qk/\u0001) time, the matrix U can be\nobtained from \u03c6(A) \u00b7 S and the product U T \u03c6(A)T can be computed. Given U T \u03c6(A)T , the matrix\nW of top k left singular vectors can be computed in poly(3qk/\u0001) time, and in n \u00b7 poly(3qk/\u0001) time\nthe product V = U W can be computed. Hence the overall time is O(nnz(A)q) + n \u00b7 poly(3qk/\u0001),\nand the theorem follows.\n\nWe now show how to \ufb01nd a low rank approximation to \u03c6(A). A proof is included in the supplemen-\ntary material version of the paper.\nTheorem 7. (Polynomial Kernel PCA and Low Rank Factorization) For the polynomial kernel of\ndegree q, in O(nnz(A)q)+(n+d)\u00b7poly(3qk/\u0001) time, we can \ufb01nd an n\u00d7k matrix V , a k\u00d7poly(k/\u0001)\nmatrix U, and a poly(k/\u0001) \u00d7 d matrix R for which\n\n(cid:107)V \u00b7 U \u00b7 \u03c6(R) \u2212 A(cid:107)F \u2264 (1 + \u0001)(cid:107)\u03c6(A) \u2212 [\u03c6(A)]k(cid:107)F .\n\nThe success probability of the algorithm is at least .6, which can be ampli\ufb01ed with independent\nrepetition.\nNote that Theorem 7 implies the rowspace of \u03c6(R) contains a k-dimensional subspace L with dq\u00d7dq\nprojection matrix LLT for which (cid:107)\u03c6(A)LLT \u2212 \u03c6(A)(cid:107)F \u2264 (1 + \u0001)(cid:107)\u03c6(A) \u2212 [\u03c6(A)]k(cid:107)F , that is, L\nprovides an approximation to the space spanned by the top k principal components of \u03c6(A).\n\n4.2 Regularizing Learning With the Polynomial Kernel\nConsider learning with the polynomial kernel. Even if d (cid:28) n it might be that even for low values of\nq we have dq (cid:29) n. This makes a number of learning algorithms underdetermined, and increases the\nchance of over\ufb01tting. The problem is even more severe if the input matrix A has a lot of redundancy\nin it (noisy features).\nTo address this, many learning algorithms add a regularizer, e.g., ridge terms. Here we propose to\nregularize by using rank-k approximations to the matrix (where k is the regularization parameter\nthat is controlled by the user). With the tools developed in the previous subsection, this not only\nserves as a regularization but also as a means of accelerating the learning.\nWe now show that two different methods that can be regularized using this approach.\n\n4.2.1 Approximate Kernel Principal Component Regression\n\nIf dq > n the linear regression with \u03c6(A) becomes underdetermined and exact \ufb01tting to the right\nhand side is possible, and in more than one way. One form of regularization is Principal Component\nRegression (PCR), which \ufb01rst uses PCA to project the data on the principal component, and then\ncontinues with linear regression in this space.\nWe now introduce the following approximate version of PCR.\nDe\ufb01nition 8. In the Approximate Principal Component Regression Problem (Approximate PCR),\nwe are given an n \u00d7 d matrix A and an n \u00d7 1 vector b, and the goal is to \ufb01nd a vector x \u2208 Rk and\nan n \u00d7 k matrix V with orthonormal columns spanning a rank-k (1 + \u0001)-approximation to A for\nwhich x = argminx(cid:107)V x \u2212 b(cid:107)2.\nNotice that if A is a rank-k matrix, then Approximate PCR coincides with ordinary least squares\nregression with respect to the column space of A. While PCR would require solving the regression\nproblem with respect to the top k singular vectors of A, in general \ufb01nding these k vectors exactly\nresults in unstable computation, and cannot be found by an ef\ufb01cient linear sketch. This would\noccur, e.g., if the k-th singular value \u03c3k of A is very close (or equal) to \u03c3k+1. We therefore relax\nthe de\ufb01nition to only require that the regression problem be solved with respect to some k vectors\nwhich span a rank-k (1 + \u0001)-approximation to A.\nThe following is our main theorem for Approximate PCR.\nTheorem 9. (Polynomial Kernel Approximate PCR.) For the polynomial kernel of degree q, in\nO(nnz(A)q) + n \u00b7 poly(3qk/\u0001) time one can solve the approximate PCR problem, namely, one\n\n6\n\n\fcan output a vector x \u2208 Rk and an n \u00d7 k matrix V with orthonormal columns spanning a rank-k\n(1 + \u0001)-approximation to \u03c6(A), for which x = argminx(cid:107)V x \u2212 b(cid:107)2.\nProof. Applying Theorem 6, we can \ufb01nd an n \u00d7 k matrix V with orthonormal columns spanning a\nrank-k (1 + \u0001)-approximation to \u03c6(A) in O(nnz(A)q) + n\u00b7 poly(3qk/\u0001) time. At this point, one can\nsolve solve the regression problem argminx(cid:107)V x\u2212 b(cid:107)2 exactly in O(nk) time since the minimizer is\nx = V T b.\n\n4.2.2 Approximate Kernel Canonical Correlation Analysis\n\nIn Canonical Correlation Analysis (CCA) we are given two matrices A, B and we wish to \ufb01nd\ndirections in which the spaces spanned by their columns are correlated. Due to space constraints,\ndetails appear only in the supplementary material version of the paper.\n\n5 Experiments\n\nWe report two sets of experiments whose goal is to demonstrate that the k-Space algorithm (Algo-\nrithm 1) is useful as a feature extraction algorithm. We use standard classi\ufb01cation and regression\ndatasets.\nIn the \ufb01rst set of experiments, we compare ordinary (cid:96)2 regression to approximate principal compo-\nnent (cid:96)2 regression, where the approximate principal components are extracted using k-Space (we\nuse RLSC for classi\ufb01cation). Speci\ufb01cally, as explained in Section 4.2.1, we use k-Space to compute\nV and then use regression on V (in one dataset we also add an additional ridge regularization). To\npredict, we notice that V = \u03c6(A)\u00b7 S \u00b7 R\u22121 \u00b7 W , where R is the R factor of \u03c6(A)\u00b7 S, so S \u00b7 R\u22121 \u00b7 W\nde\ufb01nes a mapping to the approximate principal components. So, to predict on a matrix At we \ufb01rst\ncompute \u03c6(At) \u00b7 S \u00b7 R\u22121 \u00b7 W (using TENSORSKETCH to compute \u03c6(At) \u00b7 S fast) and then multiply\nby the coef\ufb01cients found by the regression. In all the experiments, \u03c6(\u00b7) is de\ufb01ned using the kernel\nk(u, v) = (uT v + 1)3.\nWhile k-Space is ef\ufb01cient and gives an embedding in time that is faster than explicitly expanding the\nfeature map, or using kernel PCA, there is still some non-negligible overhead in using it. Therefore,\nwe also experimented with feature extraction using only a subset of the training set. Speci\ufb01cally, we\n\ufb01rst sample the dataset, and then use k-Space to compute the mapping S \u00b7 R\u22121 \u00b7 W . We apply this\nmapping to the entire dataset before doing regression.\nThe results are reported in Table 1. Since k-Space is randomized, we report the mean and standard\ndeviation of 5 runs. For all datasets, learning with the extracted features yields better generalized\nerrors than learning with the original features. Extracting the features using only a sample of the\ntraining set results in only slightly worse generalization errors. With regards to the MNIST dataset,\nwe caution the reader not to compare the generalization results to the ones obtained using the poly-\nnomial kernel (as reported in the literature). In our experiments we do not use the polynomial kernel\non the entire dataset, but rather use it to extract features (i.e., do principal component regularization)\nusing only a subset of the examples (only 5,000 examples out of 60,000). One can expect worse re-\nsults, but this is a more realistic strategy for very large datasets. On very large datasets it is typically\nunrealistic to use the polynomial kernel on the entire dataset, and approximation techniques, like the\nones we suggest, are necessary.\nWe use a similar setup in the second set of experiments, now using linear SVM instead of regression\n(we run only on the classi\ufb01cation datasets). The results are reported in Table 2. Although the gap is\nsmaller, we see again that generally the extracted features lead to better generalization errors.\nWe remark that it is not our goal to show that k-Space is the best feature extraction algorithm of\nthe classi\ufb01cation algorithms we considered (RLSC and SVM), or that it is the fastest, but rather\nthat it can be used to extract features of higher quality than the original one. In fact, in our experi-\nments, while for a \ufb01xed number of extracted features, k-Space produces better features than simply\nusing TENSORSKETCH, it is also more expensive in terms of time. If that additional time is used\nto do learning or prediction with TENSORSKETCH with more features, we overall get better gen-\neralization error (we do not report the results of these experiments). However, feature extraction is\nwidely applicable, and there can be cases where having fewer high quality features is bene\ufb01cial, e.g.\nperforming multiple learning on the same data, or a very expensive learning tasks.\n\n7\n\n\fTable 1: Comparison of testing error with using regression with original features and with features extracted using k-Space. In the table, n\nis number of training instances, d is the number of features per instance and nt is the number of instances in the test set. \u201cRegression\u201d stands\nfor ordinary (cid:96)2 regression. \u201cPCA Regression\u201d stand for approximate principal component (cid:96)2 regression. \u201cSample PCA Regression\u201d stands\napproximate principal component (cid:96)2 regression where only ns samples from the training set are used for computing the feature extraction. In\n\u201cPCA Regression\u201d and \u201cSample PCA Regression\u201d k features are extracted. In k-Space we use m = O(k) and r = O(k) with the ratio\nbetween m and k and r and k as detailed in the table. For classi\ufb01cation tasks, the percent of testing points incorrectly predicted is reported.\nFor regression tasks, we report (cid:107)yp \u2212 y(cid:107)2/(cid:107)y(cid:107) where yp is the predicted values and y is the ground truth.\n\nDataset\nMNIST\nclassi\ufb01cation\nn = 60, 000, d = 784\nnt = 10, 000\nCPU\nregression\nn = 6, 554, d = 21\nnt = 819\nADULT\nclassi\ufb01cation\nn = 32, 561, d = 123\nnt = 16, 281\nCENSUS\nregression\nn = 18, 186, d = 119\nnt = 2, 273\n\nUSPS\nclassi\ufb01cation\nn = 7, 291, d = 256\nnt = 2, 007\n\n12%\n\n15.3%\n\n7.1%\n\n13.1%\n\nRegression\n14%\n\nPCA Regression\nOut of\nMemory\n\nSampled PCA Regression\n7.9% \u00b1 0.06%\nk = 500, ns = 5000\nm/k = 2\nr/k = 4\n3.6% \u00b1 0.1%\nk = 200, ns = 2000\nm/k = 4\nr/k = 8\n\n4.3% \u00b1 1.0%\nk = 200\nm/k = 4\nr/k = 8\n15.2% \u00b1 0.1% 15.2% \u00b1 0.03%\nk = 500\nm/k = 2\nr/k = 4\n6.5% \u00b1 0.2%\nk = 500\nm/k = 4\nr/k = 8\n\u03bb = 0.001\n7.0% \u00b1 0.2%\nk = 200\nm/k = 4\nr/k = 8\n\nk = 500, ns = 5000\nm/k = 2\nr/k = 4\n6.8% \u00b1 0.4%\nk = 500, ns = 5000\nm/k = 4\nr/k = 8\n\u03bb = 0.001\n7.5% \u00b1 0.3%\nk = 200, ns = 2000\nm/k = 4\nr/k = 8\n\nTable 2: Comparison of testing error with using SVM with original features and with features extracted using k-Space.. In the table, n is\nnumber of training instances, d is the number of features per instance and nt is the number of instances in the test set. \u201cSVM\u201d stands for linear\nSVM. \u201cPCA SVM\u201d stand for using k-Space to extract features, and then using linear SVM. \u201cSample PCA SVM\u201d stands for using only ns\nsamples from the training set are used for computing the feature extraction. In \u201cPCA SVM\u201d and \u201cSample PCA SVM\u201d k features are extracted.\nIn k-Space we use m = O(k) and r = O(k) with the ratio between m and k and r and k as detailed in the table. For classi\ufb01cation tasks,\nthe percent of testing points incorrectly predicted is reported.\n\nDataset\nMNIST\nclassi\ufb01cation\nn = 60, 000, d = 784\nnt = 10, 000\nADULT\nclassi\ufb01cation\nn = 32, 561, d = 123\nnt = 16, 281\nUSPS\nclassi\ufb01cation\nn = 7, 291, d = 256\nnt = 2, 007\n\nSVM\n8.4%\n\n15.0%\n\n8.3%\n\nPCA SVM\nOut of\nMemory\n\nSampled PCA SVM\n6.1% \u00b1 0.1%\nk = 500, ns = 5000\nm/k = 2\nr/k = 4\n\n15.1% \u00b1 0.1% 15.2% \u00b1 0.1%\nk = 500\nm/k = 2\nr/k = 4\n7.2% \u00b1 0.2%\nk = 200\nm/k = 4\nr/k = 8\n\nk = 500, ns = 5000\nm/k = 2\nr/k = 4\n7.5% \u00b1 0.3%\nk = 200, ns = 2000\nm/k = 4\nr/k = 8\n\n6 Conclusions and Future Work\n\nSketching based dimensionality reduction has so far been limited to linear models. In this paper,\nwe describe the \ufb01rst oblivious subspace embeddings for a non-linear kernel expansion (the polyno-\nmial kernel), opening the door for sketching based algorithms for a multitude of problems involving\nkernel transformations. We believe this represents a signi\ufb01cant expansion of the capabilities of\nsketching based algorithms. However, the polynomial kernel has a \ufb01nite-expansion, and this \ufb01nite-\nness is quite useful in the design of the embedding, while many popular kernels induce an in\ufb01nite-\ndimensional mapping. We propose that the next step in expanding the reach of sketching based\nmethods for statistical learning is to design oblivious subspace embeddings for non-\ufb01nite kernel\nexpansions, e.g., the expansions induced by the Gaussian kernel.\n\n8\n\n\fReferences\n[1] H. Avron, V. Sindhawni, and D. P. Woodruff. Sketching structured matrices for faster nonlinear\n\nregression. In Advances in Neural Information Processing Systems (NIPS), 2013.\n\n[2] L. Carter and M. N. Wegman. Universal classes of hash functions. J. Comput. Syst. Sci.,\n\n18(2):143\u2013154, 1979.\n\n[3] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. Theor.\n\nComput. Sci., 312(1):3\u201315, 2004.\n\n[4] K. L. Clarkson and D. P. Woodruff. Numerical linear algebra in the streaming model.\nProceedings of the 41th Annual ACM Symposium on Theory of Computing (STOC), 2009.\n\nIn\n\n[5] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity\ntime. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC),\n2013.\n\n[6] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decomposi-\n\ntions. SIAM J. Matrix Analysis Applications, 30(2):844\u2013881, 2008.\n\n[7] R. Hamid, Y. Xiao, A. Gittens, and D. DeCoste. Compact random feature maps. In Proc. of\n\nthe 31th International Conference on Machine Learning (ICML), 2014.\n\n[8] I. T. Jolliffe. A note on the use of principal components in regression. Journal of the Royal\n\nStatistical Society, Series C, 31(3):300\u2013303, 1982.\n\n[9] R. Kannan, S. Vempala, and D. P. Woodruff. Principal component analysis and higher correla-\ntions for distributed data. In Proceedings of the 27th Conference on Learning Theory (COLT),\n2014.\n\n[10] P. Kar and H. Karnick. Random feature maps for dot product kernels. In Proceedings of the\n\nFifteenth International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2012.\n\n[11] Q. Le, T. Sarl\u00b4os, and A. Smola. Fastfood \u2013 Approximating kernel expansions in loglinear time.\n\nIn Proc. of the 30th International Conference on Machine Learning (ICML), 2013.\n\n[12] M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in\n\nMachine Learning, 3(2):123\u2013224, 2011.\n\n[13] J. Nelson and H. Nguyen. OSNAP: Faster numerical linear algebra algorithms via sparser\nsubspace embeddings. In 54th IEEE Annual Symposium on Foundations of Computer Science\n(FOCS), 2013.\n\n[14] R. Pagh. Compressed matrix multiplication. ACM Trans. Comput. Theory, 5(3):9:1\u20139:17,\n\n2013.\n\n[15] M. Patrascu and M. Thorup. The power of simple tabulation hashing. J. ACM, 59(3):14, 2012.\n[16] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps.\nIn\nProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, KDD \u201913, pages 239\u2013247, New York, NY, USA, 2013. ACM.\n\n[17] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in\n\nNeural Information Processing Systems (NIPS), 2007.\n\n9\n\n\f", "award": [], "sourceid": 1205, "authors": [{"given_name": "Haim", "family_name": "Avron", "institution": "IBM Research"}, {"given_name": "Huy", "family_name": "Nguyen", "institution": "University of California, Berkeley"}, {"given_name": "David", "family_name": "Woodruff", "institution": "IBM Research"}]}