{"title": "Is Input Sparsity Time Possible for Kernel Low-Rank Approximation?", "book": "Advances in Neural Information Processing Systems", "page_first": 4435, "page_last": 4445, "abstract": "Low-rank approximation is a common tool used to accelerate kernel methods: the $n \\times n$ kernel matrix $K$ is approximated via a rank-$k$ matrix $\\tilde K$ which can be stored in much less space and processed more quickly. In this work we study the limits of computationally efficient low-rank kernel approximation. We show that for a broad class of kernels, including the popular Gaussian and polynomial kernels, computing a relative error $k$-rank approximation to $K$ is at least as difficult as multiplying the input data matrix $A \\in R^{n \\times d}$ by an arbitrary matrix $C \\in R^{d \\times k}$. Barring a breakthrough in fast matrix multiplication, when $k$ is not too large, this requires $\\Omega(nnz(A)k)$ time where $nnz(A)$ is the number of non-zeros in $A$. This lower bound matches, in many parameter regimes, recent work on subquadratic time algorithms for low-rank approximation of general kernels [MM16,MW17], demonstrating that these algorithms are unlikely to be significantly improved, in particular to $O(nnz(A))$ input sparsity runtimes. At the same time there is hope: we show for the first time that $O(nnz(A))$ time approximation is possible for general radial basis function kernels (e.g., the Gaussian kernel) for the closely related problem of low-rank approximation of the kernelized dataset.", "full_text": "Is Input Sparsity Time Possible for\nKernel Low-Rank Approximation?\n\nCameron Musco\n\nMIT\n\ncnmusco@mit.edu\n\nDavid P. Woodruff\n\nCarnegie Mellon University\ndwoodruf@cs.cmu.edu\n\nAbstract\n\nLow-rank approximation is a common tool used to accelerate kernel methods: the\nn\u21e5 n kernel matrix K is approximated via a rank-k matrix \u02dcK which can be stored\nin much less space and processed more quickly. In this work we study the limits\nof computationally ef\ufb01cient low-rank kernel approximation. We show that for a\nbroad class of kernels, including the popular Gaussian and polynomial kernels,\ncomputing a relative error k-rank approximation to K is at least as dif\ufb01cult as\nmultiplying the input data matrix A 2 Rn\u21e5d by an arbitrary matrix C 2 Rd\u21e5k.\nBarring a breakthrough in fast matrix multiplication, when k is not too large, this\nrequires \u2326(nnz(A)k) time where nnz(A) is the number of non-zeros in A. This\nlower bound matches, in many parameter regimes, recent work on subquadratic\ntime algorithms for low-rank approximation of general kernels [MM16, MW17],\ndemonstrating that these algorithms are unlikely to be signi\ufb01cantly improved, in\nparticular to O(nnz(A)) input sparsity runtimes. At the same time there is hope:\nwe show for the \ufb01rst time that O(nnz(A)) time approximation is possible for\ngeneral radial basis function kernels (e.g., the Gaussian kernel) for the closely\nrelated problem of low-rank approximation of the kernelized dataset.\n\n1\n\nIntroduction\n\ni aj)q for some parameter c.\n\nj aj = kajk2, and aT\n\nThe kernel method is a popular technique used to apply linear learning and classi\ufb01cation algorithms\nto datasets with nonlinear structure. Given training input points a1, ..., an 2 Rd, the idea is to\nreplace the standard Euclidean dot product hai, aji = aT\ni aj with the kernel dot product (ai, aj),\nwhere : Rd \u21e5 Rd ! R+ is some positive semide\ufb01nite function. Popular kernel functions include\ne.g., the Gaussian kernel with (ai, aj) = ekaiajk2/ for some bandwidth parameter and the\npolynomial kernel of degree q with (ai, aj) = (c + aT\nThroughout this work, we focus on kernels where (ai, aj) is a function of the dot products\ni ai = kaik2, aT\ni aj. Such functions encompass many kernels used in practice,\naT\nincluding the Gaussian kernel, the Laplace kernel, the polynomial kernel, and the Matern kernels.\nLetting F be the reproducing kernel Hilbert space associated with (\u00b7,\u00b7), we can write (ai, aj) =\nh(ai), (aj)i where : Rd !F is a typically non-linear feature map. We let =\n[(a1), ..., (an)]T denote the kernelized dataset, whose ith row is the kernelized datapoint (ai).\nThere is no requirement that can be ef\ufb01ciently computed or stored \u2013 for example, in the case of the\nGaussian kernel, F is an in\ufb01nite dimensional space. Thus, kernel methods typically work with the\nkernel matrix K 2 Rn\u21e5n with Ki,j = (ai, aj). We will also sometimes denote K = { (ai, aj)}\nto make it clear which kernel function it is generated by. We can equivalently write K = T . As\nlong as all operations of an algorithm only access via the dot products between its rows, they can\nthus be implemented using just K without explicitly computing the feature map.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fUnfortunately computing K is expensive, and a bottleneck for scaling kernel methods to large\ndatasets. For the kernels we consider, where depends on dot products between the input points, we\nmust at least compute the Gram matrix AAT , requiring \u21e5(n2d) time in general. Even if A is sparse,\nthis takes \u21e5(nnz(A)n) time. Storing K then takes \u21e5(n2) space, and processing it for downstream\napplications like kernel ridge regression and kernel SVM can be even more expensive.\n\n1.1 Low-rank kernel approximation\nFor this reason, a vast body of work studies how to ef\ufb01ciently approximate K via a low-rank sur-\nrogate \u02dcK [SS00, AMS01, WS01, FS02, RR07, ANW14, LSS13, BJ02, DM05, ZTK08, BW09,\nIf \u02dcK is rank-k, it can be stored in factored form in O(nk) space and\nCKS11, WZ13, GM13].\noperated on quickly \u2013 e.g., it can be inverted in just O(nk2) time to solve kernel ridge regression.\nOne possibility is to set \u02dcK = Kk where Kk is K\u2019s best k-rank approximation \u2013 the projection onto\nits top k eigenvectors. Kk minimizes, over all rank-k \u02dcK, the error kK \u02dcKkF , where kMkF is\nthe Frobenius norm: (Pi,j M 2\ni,j)1/2. It in fact minimizes error under any unitarily invariant norm,\ne.g., the popular spectral norm. Unfortunately, Kk is prohibitively expensive to compute, requiring\n\u21e5(n3) time in practice, or n! in theory using fast matrix multiplication, where ! \u21e1 2.373 [LG14].\nThe idea of much prior work on low-rank kernel approximation is to \ufb01nd \u02dcK which is nearly as good\nas Kk, but can be computed much more quickly. Speci\ufb01cally, it is natural to ask for \u02dcK ful\ufb01lling the\nfollowing relative error guarantee for some parameter \u270f> 0:\n\n(1)\nOther goals, such as nearly matching the spectral norm error kK Kkk or approximating K entry-\nwise have also been considered [RR07, GM13]. Of particular interest to our results is the closely\nrelated goal of outputting an orthonormal basis Z 2 Rn\u21e5k satisfying for any with T = K:\n\nkK \u02dcKkF \uf8ff (1 + \u270f)kK KkkF .\n\nk ZZT kF \uf8ff (1 + \u270f)k kkF .\n\n(2)\n(2) can be viewed as a Kernel PCA guarantee \u2013 its asks us to \ufb01nd a low-rank subspace Z such that\nthe projection of our kernelized dataset onto Z nearly optimally approximates this dataset. Given\nZ, we can approximate K using \u02dcK = ZZT T ZZT = ZZT KZZT . Alternatively, letting P\nbe the projection onto the row span of ZZT , we can write \u02dcK = P T , which can be computed\nef\ufb01ciently, for example, when P is a projection onto a subset of the kernelized datapoints [MM16].\n\n1.2 Fast algorithms for relative-error kernel approximation\nUntil recently, all algorithms achieving the guarantees of (1) and (2) were at least as expensive as\ncomputing the full matrix K, which was needed to compute the low-rank approximation [GM13].\nHowever, recent work has shown that this is not required. Avron, Nguyen, and Woodruff [ANW14]\ndemonstrate that for the polynomial kernel, Z satisfying (2) can be computed in O(nnz(A)q) +\nn poly(3qk/\u270f) time for a polynomial kernel with degree q.\nMusco and Musco [MM16] give a fast algorithm for any kernel, using recursive Nystr\u00f6m sampling,\nwhich computes \u02dcK (in factored form) satisfying kK \u02dcKk \uf8ff , for input parameter . With\nthe proper setting of , it can output Z satisfying (2) (see Section C.3 of [MM16]). Computing\nZ requires evaluating \u02dcO(k/\u270f) columns of the kernel matrix along with \u02dcO(n(k/\u270f)!1) additional\ntime for other computations. Assuming the kernel is a function of the dot products between the\ninput points, the kernel evaluations require \u02dcO(nnz(A)k/\u270f) time. The results of [MM16] can also be\nused to compute \u02dcK satisfying (1) with \u270f = pn in \u02dcO(nnz(A)k + nk!1) time (see Appendix A of\n[MW17]).\nWoodruff and Musco [MW17] show that for any kernel, and for any \u270f> 0, it is possible to\n\nachieve (1) in \u02dcO(nnz(A)k/\u270f)+n poly(k/\u270f) time plus the time needed to compute an \u02dcO(pnk/\u270f2)\u21e5\n\u02dcO(pnk/\u270f) submatrix of K. If A has uniform row sparsity \u2013 i.e., nnz(ai) \uf8ff c nnz(A)/n for some\nconstant c and all i, this step can be done in \u02dcO(nnz(A)k/\u270f2.5) time. Alternatively, if d \uf8ff (pnk/\u270f2)\u21b5\nfor \u21b5< . 314 this can be done in \u02dcO(nk/\u270f4) = \u02dcO(nnz(A)k/\u270f4) time using fast rectangular matrix\nmultiplication [LG12, GU17] (assuming that there are no all zero data points so n \uf8ff nnz(A).)\n\n2\n\n\f1.3 Our results\n\nThe algorithms of [MM16, MW17] make signi\ufb01cant progress in ef\ufb01ciently solving (1) and (2) for\ngeneral kernel matrices. They demonstrate that, surprisingly, a relative-error low-rank approxima-\ntion can be computed signi\ufb01cantly faster than the time required to write down all of K.\nA natural question is if these results can be improved. Even ignoring \u270f dependencies and typically\nlower order terms, both algorithms use \u2326(nnz(A)k) time. One might hope to improve this to input\nsparsity, or near input sparsity time, \u02dcO(nnz(A)), which is known for computing a low-rank approx-\nimation of A itself [CW13]. The work of Avron et al. af\ufb01rms that this is possible for the kernel PCA\nguarantee of (2) for degree-q polynomial kernels, for constant q. Can this result be extended to other\npopular kernels, or even more general classes?\n\n1.3.1 Lower bounds\n\nWe show that achieving the guarantee of (1) signi\ufb01cantly more ef\ufb01ciently than the work of [MM16,\nMW17] is likely very dif\ufb01cult. Speci\ufb01cally, we prove that for a wide class of kernels, the kernel\nlow-rank approximation problem is as hard as multiplying the input A 2 Rn\u21e5d by an arbitrary\nC 2 Rd\u21e5k. We have the following result for some common kernels to which our techniques apply:\nTheorem 1 (Hardness for low-rank kernel approximation). Consider any polynomial kernel\ni mj)q, Gaussian kernel (mi, mj) = ekmimjk2/, or the linear ker-\n (mi, mj) = (c + mT\ni mj. Assume there is an algorithm which given M 2 Rn\u21e5d with associated\nnel (mi, mj) = mT\nkernel matrix K = { (mi, mj)}, returns N 2 Rn\u21e5k in o(nnz(M )k) time satisfying:\n\nkK N N Tk2\n\nF \uf8ff kK Kkk2\n\nF\n\nfor some approximation factor . Then there is an o(nnz(A)k) + O(nk2) time algorithm for mul-\ntiplying arbitrary integer matrices A 2 Rn\u21e5d, C 2 Rd\u21e5k.\nThe above applies for any approximation factor . While we work in the real RAM model, ignoring\nbit complexity, as long as = poly(n) and A, C have polynomially bounded entries, our reduction\nfrom multiplication to low-rank approximation is achieved using matrices that can be represented\nwith just O(log(n + d)) bits per entry.\nTheorem 1 shows that the runtime of \u02dcO(nnz(A)k + nk!1) for = pn achieved by [MM16]\nfor general kernels cannot be signi\ufb01cantly improved without advancing the state-of-the-art in matrix\nmultiplication. Currently no general algorithm is known for multiplying integer A 2 Rn\u21e5d, C 2\nRd\u21e5k in o(nnz(A)k) time, except when k n\u21b5 for \u21b5< . 314 and A is dense. In this case, AC can\nbe computed in O(nd) time using fast rectangular matrix multiplication [LG12, GU17].\nAs discussed, when A has uniform row sparsity or when d \uf8ff (pnk/\u270f2)\u21b5, the runtime of [MW17]\nfor = (1 + \u270f), ignoring \u270f dependencies and typically lower order terms, is \u02dcO(nnz(A)k), which is\nalso nearly tight.\nIn recent work, Backurs et al. [BIS17] give lower bounds for a number of kernel learning problems,\nincluding kernel PCA for the Gaussian kernel. However, their strong bound, of \u2326(n2) time, requires\nvery small error = exp( !(log2 n), whereas ours applies for any relative error .\n1.3.2 Improved algorithm for radial basis function kernels\n\nIn contrast to the above negative result, we demonstrate that achieving the alternative Kernel PCA\nguarantee of (2) is possible in input sparsity time for any shift and rotationally invariant kernel \u2013 e.g.,\nany radial basis function kernel where (xi, xj) = f (kxi xjk). This result signi\ufb01cantly extends\nthe progress of Avron et al. [ANW14] on the polynomial kernel.\nOur algorithm is based off a fast implementation of the random Fourier features method [RR07],\nwhich uses the fact that that the Fourier transform of any shift invariant kernel is a probability\ndistribution after appropriate scaling (a consequence of Bochner\u2019s theorem). Sampling frequencies\nfrom this distribution gives an approximation to (\u00b7,\u00b7) and consequentially the matrix K.\n\n3\n\n\fF also satis\ufb01es k ZZT k2\n\n(1 \u270f)( \u02dcK + I) K + I (1 + \u270f)( \u02dcK + I).\n\nFourier features suf\ufb01ces to give \u02dcK = \u02dc \u02dcT satisfying the spectral approximation guarantee:\n\n\u270f2 random\nWe employ a new analysis of this method [AKM+17], which shows that sampling \u02dcO n\nIf we set \uf8ff k+1(K)/k, we can show that \u02dc also gives a projection-cost preserving sketch\n[CEM+15] for the kernelized dataset . This ensures that any Z satisfying k \u02dc ZZT \u02dck2\nF \uf8ff\n(1 + \u270f)k \u02dc \u02dckk2\nF and thus achieves (2).\n\u270f2k+1(K)\u2318 random Fourier features, which naively re-\nOur algorithm samples s = \u02dcO n\n\nquires O(nnz(A)s) time. We show that this can be accelerated to O(nnz(A)) + poly(n, s) time, us-\ning a recent result of Kapralov et al. on fast multiplication by random Gaussian matrices [KPW16].\nOur technique is analogous to the \u2018Fastfood\u2019 approach to accelerating random Fourier features using\nfast Hadamard transforms [LSS13]. However, our runtime scales with nnz(A), which can be signif-\nicantly smaller than the \u02dcO(nd) runtime given by Fastfood when A is sparse. Our main algorithmic\nresult is:\nTheorem 2 (Input sparsity time kernel PCA). There is an algorithm that given A 2 Rn\u21e5d along\nwith shift and rotation-invariant kernel function : Rd\u21e5Rd ! R+ with (x, x) = 1, outputs, with\nprobability 99/100, Z 2 Rn\u21e5k satisfying:\n\nF \uf8ff (1 + O(\u270f))k kk2\n\n\u270f2 = \u02dcO\u21e3\n\nnk\n\nfor any with T = K = { (ai, aj)} and any \u270f> 0. Letting k+1 denote the (k + 1)th largest\neigenvalue of K and !< 2.373 be the exponent of fast matrix multiplication, the algorithm runs in\n\nk ZZT k2\n\nF \uf8ff (1 + \u270f)k kk2\n\nF\n\nO(nnz(A)) + \u02dcO\u2713n!+1.5 \u00b7\u21e3 k\n\nk+1\u270f2\u2318!1.5\u25c6 time.\n\nWe note that the runtime of our algorithm is O(nnz(A)) whenever n, k, 1/k+1, and 1/\u270f are not\ntoo large. Due to the relatively poor dependence on n, the algorithm is relevant for very high\ndimensional datasets with d n. Such datasets are found often, e.g., in genetics applications\n[HDC+01, JDMP11]. While we have dependence on 1/k+1, in the natural setting, we only com-\npute a low-rank approximation up to an error threshold, ignoring very small eigenvalues of K, and\nso k+1 will not be too small. We do note that if we apply Theorem 2 to the low-rank approximation\ninstances given by our lower bound construction, k+1 can be very small, \uf8ff 1/ poly(n, d) for ma-\ntrices with poly(n) bounded entries. Thus, removing this dependence is an important open question\nin understanding the complexity of low-rank kernel approximation.\nWe leave open the possibility of improving our algorithm, achieving O(nnz(A)) + n \u00b7 poly(k, \u270f)\nruntime, which would match the state-of-the-art for low-rank approximation of non-kernelized ma-\ntrices [CW13]. Alternatively, it is possible that a lower bound can be shown, proving the that high n\ndependence, or the 1/k+1 term are required even for the Kernel PCA guarantee of (2).\n\n2 Lower bounds\n\nOur lower bound proof argues that for a broad class of kernels, given input M, a low-rank approxi-\nmation of the associated kernel matrix K achieving (1) can be used to obtain a close approximation\nto the Gram matrix M M T . We write (mi, mj) as a function of mT\ni mj (or kmi mjk2 for dis-\ntance kernels) and expand this function as a power series. We show that the if input points are\nappropriately rescaled, the contribution of degree-1 term mT\ni mj dominates, and hence our kernel\nmatrix approximates M M T , up to some easy to compute low-rank components.\nWe then show that such an approximation can be used to give a fast algorithm for multiplying any\ntwo integer matrices A 2 Rn\u21e5d and C 2 Rd\u21e5k. The key idea is to set M = [A, wC] where w is a\nlarge weight. We then have:\n\nM M T =\uf8ff AAT\n\nwCT AT w2CT C .\n\nwAC\n\nSince w is very large, the AAT block is relatively very small, and so M M T is nearly rank-2k \u2013\nit has a \u2018heavy\u2019 strip of elements in its last k rows and columns. Thus, computing a relative-error\nrank-2k approximation to M M T recovers all entries except those in the AAT block very accurately,\nand importantly, recovers the wAC block and so the product AC.\n\n4\n\n\f2.1 Lower bound for low-rank approximation of M M T .\nWe \ufb01rst illustrate our lower bound technique by showing hardness of direct approximation of M M T .\nTheorem 3 (Hardness of low-rank approximation for M M T ). Assume there is an algorithm A\nwhich given any M 2 Rn\u21e5d returns N 2 Rn\u21e5k such that kM M T N N Tk2\nF \uf8ff 1kM M T \n(M M T )kk2\nFor any A 2 Rn\u21e5d and C 2 Rd\u21e5k each with integer entries in [2, 2], let B = [AT , wC]T\nwhere w = 3p12\n2nd. It is possible to compute the product AC in time T (B, 2k) + O(nk!1).\n\nF in T (M, k) time for some approximation factor 1.\n\nwCT AT w2CT C .\nLet Q 2 Rn\u21e52k be an orthogonal span for the columns of the n \u21e5 2k matrix:\n\nProof. We can write the (n + k) \u21e5 (n + k) matrix BBT as:\nBBT = [AT , wC]T [A, wC] =\uf8ff AAT\n\uf8ff 0\nV w2CT C\n\nwAC\n\nwAC\n\nwhere V 2 Rk\u21e5k spans the columns of wCT AT 2 Rk\u21e5n. The projection QQT BBT gives the best\nFrobenius norm approximation to BBT in the span of Q. We can see that:\n\nF \uf8ff\uf8ffAAT\n\n0\n\n0\n\n0\n\n2\n\nF \uf8ff 4\n\n2n2d2\n\n(3)\n\nkBBT (BBT )2kk2\n\nF \uf8ff kBBT QQT BBTk2\n\n(BBT N N T )2\n\nF \uf8ff 1kBBT (BBT )2kk2\n\nsince each entry of A is bounded in magnitude by 2 and so each entry of AAT is bounded by d2\n2.\nLet N be the matrix returned by running A on B with rank 2k. In order to achieve the approximation\nbound of kBBT N N Tk2\nF we must have, for all i, j:\nwhere the last inequality is from (3). This gives |BBT N N T|i,j \uf8ff p12\n2nd. Since A and\nC have integer entries, each entry in the submatrix wAC of BBT is an integer multiple of w =\n3p12\n2nd, by simply rounding\n(N N T )i,j to the nearest multiple of w, we obtain the entry exactly. Thus, given N, we can exactly\nrecover AC in O(nk!1) time by computing the n\u21e5k submatrix corresponding to AC in BBT .\nTheorem 3 gives our main bound Theorem 1 for the case of the linear kernel (mi, mj) = mT\n\n2nd. Since (N N T )i,j approximates this entry to error p12\n\ni,j \uf8ff kBBT N N Tk2\n\nF \uf8ff 14\n\n2n2d2\n\ni mj.\n\nProof of Theorem 1 \u2013 Linear Kernel. We apply Theorem 3 after noting that for B = [AT , wC]T ,\nnnz(B) \uf8ff nnz(A) + nk and so T (B, 2k) = o(nnz(A)k) + O(nk2).\nWe show in Appendix A that there is an algorithm which nearly matches the lower bound of Theorem\n1 for any = (1 + \u270f) for any \u270f> 0. Further, in Appendix B we show that even just outputting an\northogonal matrix Z 2 Rn\u21e5k such that \u02dcK = ZZT M M T is a relative-error low-rank approximation\nof M M T , but not computing a factorization of \u02dcK itself, is enough to give fast multiplication of\ninteger matrices A and C.\n\n2.2 Lower bound for dot product kernels\nWe now extend Theorem 3 to general dot product kernels \u2013 where (ai, aj) = f (aT\nfunction f. This includes, for example, the polynomial kernel.\nTheorem 4 (Hardness of low-rank approximation for dot product kernels). Consider any kernel\n : Rd \u21e5 Rd ! R+ with (ai, aj) = f (aT\ni aj) for some function f which can be expanded as\nf (x) =P1q=0 cqxq with c1 6= 0 and |cq/c1|\uf8ff Gq1 and for all q 2 and some G 1.\nAssume there is an algorithm A which given M 2 Rn\u21e5d with kernel matrix K = { (mi, mj)},\nreturns N 2 Rn\u21e5k satisfying kK N N Tk2\nFor any A 2 Rn\u21e5d, C 2 Rd\u21e5k with integer entries in [2, 2], let B = [w1AT , w2C]T with w1 =\n. Then it is possible to compute AC in time T (B, 2k + 1) + O(nk!1).\n12p12\n\nF \uf8ff 1kK Kkk in T (M, k) time.\n\ni aj) for some\n\n2nd, w2 =\n\n4pGd2\n\nw2\n\n1\n\n5\n\n\fProof. Using our decomposition of (\u00b7,\u00b7), we can write the kernel matrix for B and as:\n2CT C + c2K(2) + c3K(3) + ...\n\n1 1 + c1\uf8ff w2\n\nK = c0\uf8ff1 1\n\nw1w2CT AT w2\n\nw1w2AC\n\n1AAT\n\n(4)\n\n1\n\n2 < 1\n\ni,j = (bT\n\n4pGd2\ni bj|\uf8ff w2\n\n, the fact that w1 < w2, and our bound on the entries of A and C,\n2d2\n\nwhere K(q)\ni bj)q and 1 denotes the all ones matrix of appropriate size. The key idea is to show\nthat the contribution of the K(q) terms is small, and so any relative-error rank-(2k+1) approximation\nto K must recover an approximation to BBT , and thus the product AC as in Theorem 3.\nBy our setting of w2 =\nwe have for all i, j, |bT\ni,j \uf8ff c1|bT\ni bj| \u00b7\n\ni bj|q1 \uf8ff c1|bT\n1Xq=2\n1Xq=2\nGq1|bT\nLet \u00afK be the matrix\u2713K c0\uf8ff1\n1\u25c6, with its top right n \u21e5 n block set to 0. \u00afK just has its last\nk columns and rows non-zero, so has rank \uf8ff 2k. Let Q 2 Rn\u21e52k+1 be an orthogonal span for the\ncolumns \u00afK along with the all ones vector of length n. Let N be the result of running A on B with\nrank 2k + 1. Then we have:\n\n16G. Thus, for any i, j, using that |cq/c1|\uf8ff Gq1:\n\nGq1\n(16G)q1 \uf8ff\n\n1Xq=2\n\nc1|bT\n\ncqK(q)\n\ni bj|.\n\ni bj|\n\n1\n12\n\n(5)\n\n1\n\n1\n\nwhere \u02c6K(q) denotes the top left n \u21e5 n submatrix of K(q). By our bound on the entries of A and (5):\n\nF\n\n0\n\nkK N N Tk2\n\nF \uf8ff 1kK K2k+1k2\n\nF \uf8ff 1kK QQT Kk2\n\n\uf8ff 1\uf8ff(c1w2\n\u21e3c1w2\n1AAT + c2 \u02c6K(2) + c3 \u02c6K(3) + ...\u2318i,j \uf8ff\n12c1w2\n(K N N T )i,j \uf8ff kK N N TkF \uf8ffp1n2 \u00b7 2c1w2\np1n \u00b7 2c1d2\n12p12\nw1w2c1\n\n2nd, this gives for any i, j:\n\n12p12\n\n13\n\nw2\n\n2\n\n1d2\n2\n\n2nd \u00b7 w1w2\n\n\uf8ff\n\uf8ff\n\n.\n\n6\n\nPlugging back into (6) and using w1 =\n\n1AAT + c2 \u02c6K(2) + ...)\n\n2\n\nF\n\n(6)\n\n0\n\n0\n\n1AATi,j \uf8ff 2c1w2\n\n1d2\n2.\n\n(7)\n\nSince A and C have integer entries, each entry of c1w1w2AC is an integer multiple of c1w1w2. By\nthe decomposition of (4) and the bound of (5), if we subtract c0 from the corresponding entry of\nK and round it to the nearest multiple of c1w1w2, we will recover the entry of AC. By the bound\nof (7), we can likewise round the corresponding entry of N N T . Computing all nk of these entries\ngiven N takes time O(nk!1), giving the theorem.\n\nTheorem 4 lets us lower bound the time to compute a low-rank kernel approximation for any kernel\nfunction expressible as a reasonable power expansion of aT\ni aj. As a straightforward example, it\ngives the lower bound for the polynomial kernel of any degree stated in Theorem 1.\n\nProof of Theorem 1 \u2013 Polynomial Kernel. We apply Theorem 4, noting that (mi, mj) = (c +\nj. Thus c1 6= 0\ni mj)q can be written as f (mT\nmT\nand |cj/c1|\uf8ff Gj1 for G = (q/c). Finally note that nnz(B) \uf8ff nnz(A) + nk giving the result.\n2.3 Lower bound for distance kernels\n\nj=0 cjxj with cj = cqjq\n\ni mj) where f (x) =Pq\n\nWe \ufb01nally extend Theorem 4 to handle kernels like the Gaussian kernel whose value depends on the\nsquared distance kai ajk2 rather than just the dot product aT\n\ni aj. We prove:\n\n6\n\n\fTheorem 5 (Hardness of low-rank approximation for distance kernels). Consider any kernel func-\ntion : Rd\u21e5Rd ! R+ with (ai, aj) = f (kaiajk2) for some function f which can be expanded\nas f (x) =P1q=0 cqxq with c1 6= 0 and |cq/c1|\uf8ff Gq1 and for all q 2 and some G 1.\nAssume there is an algorithm A which given input M 2 Rn\u21e5d with kernel matrix K =\n{ (mi, mj)}, returns N 2 Rn\u21e5k satisfying kK N N Tk2\nF \uf8ff 1kK Kkk in T (M, k) time.\nFor any A 2 Rn\u21e5d, C 2 Rd\u21e5k with integer entries in [2, 2], let B = [w1AT , w2C]T with\n2nd). It is possible to compute AC in T (B, 2k + 3) +\nw1 =\n36p12\nO(nk!1) time.\n\n2nd, w2 =\n\n2)(36p12\n\n(16Gd24\n\nw2\n\n1\n\nThe proof of Theorem 5 is similar to that of Theorem 4, and relegated to Appendix C. The key\nidea is to write K as a polynomial in the distance matrix D with Di,j = kbi bjk2\n2. Since kbi \ni bj, D can be written as 2BBT plus a rank-2 component. By setting\nbjk2\nw1, w2 suf\ufb01ciently small, as in the proof of Theorem 4, we ensure that the higher powers of D are\nnegligible, and thus that our low-rank approximation must accurately recover the submatrix of BBT\ncorresponding to AC. Theorem 5 gives Theorem 1 for the popular Gaussian kernel:\n\n2 = kbik2\n\n2 + kbjk2\n\n2 2bT\n\n(1/)q\n\nProof of Theorem 1 \u2013 Gaussian Kernel. (mi, mj) can be written as f (kmimjk2) where f (x) =\nex/ =P1q=0\nxq. Thus c1 6= 0 and |cq/c1|\uf8ff Gq1 for G = 1/. Applying Theorem 5\nand bounding nnz(B) \uf8ff nnz(A) + nk, gives the result.\n3\n\nInput sparsity time kernel PCA for radial basis kernels\n\nq!\n\nTheorem 1 gives little hope for achieving o(nnz(A)k) time for low-rank kernel approximation.\nHowever, the guarantee of (1) is not the only way of measuring the quality of \u02dcK. Here we show that\nfor shift/rotationally invariant kernels, including e.g., radial basis kernels, input sparsity time can be\nachieved for the kernel PCA goal of (2).\n\n3.1 Basic algorithm\nOur technique is based on the random Fourier features technique [RR07]. Given any shift-invariant\nkernel, (x, y) = (x y) with (0) = 1 (we will assume this w.l.o.g. as the function can always\nbe scaled), there is a probability density function p(\u2318) over vectors in Rd such that:\n\np(\u2318) is just the (inverse) Fourier transform of (\u00b7), and is a density function by Bochner\u2019s theorem.\nInformally, given A 2 Rn\u21e5d if we let Z denote the matrix with columns z(\u2318) indexed by \u2318 2 Rd.\nz(\u2318)j = e2\u21e1i\u2318 T aj . Then (8) gives ZP Z\u21e4 = K where P is diagonal with P\u2318,\u2318 = p(\u2318), and Z\u21e4\ndenotes the Hermitian transpose.\nThe idea of random Fourier features is to select s frequencies \u23181, ...,\u2318 s according to the density p(\u2318)\nand set \u02dcZ = 1ps [z(\u23181), ...z(\u2318s)]. \u02dcK = \u02dcZ \u02dcZT is then used to approximate K.\nIn recent work, Avron et al. [AKM+17] give a new analysis of random Fourier features. Extending\nprior work on ridge leverage scores in the discrete setting [AM15, CMM17], they de\ufb01ne the ridge\nleverage function for parameter > 0:\n\n\u2327(\u2318) = p(\u2318)z(\u2318)\u21e4(K + I)1z(\u2318)\n\n(9)\n\nAs part of their results, which seek \u02dcK that spectrally approximates K, they prove the following:\nLemma 6. For all \u2318, \u2327(\u2318) \uf8ff n/.\nWhile simple, this bound is key to our algorithm. It was shown in [CMM17] that if the columns of\na matrix are sampled by over-approximations to their ridge leverage scores (with appropriately set\n), the sample is a projection-cost preserving sketch for the original matrix. That is, it can be used\nas a surrogate in computing a low-rank approximation. The results of [CMM17] carry over to the\ncontinuous setting giving, in conjunction with Lemma 6:\n\n7\n\n (x y) =ZRd\n\ne2\u21e1i\u2318 T (xy)p(\u2318)d\u2318.\n\n(8)\n\n\fLemma 7 (Projection-cost preserving sketch via random Fourier features). Consider any A 2 Rn\u21e5d\nand shift-invariant kernel (\u00b7) with (0) = 1, with associated kernel matrix K = { (ai aj)}\nkPn\nfor\nand kernel Fourier transform p(\u2318). For any 0 < \uf8ff 1\nsuf\ufb01ciently large c and let \u02dcZ = 1ps [z(\u23181), ..., z(\u2318s)] where \u23181, ...,\u2318 s are sampled independently\naccording to p(\u2318). Then with probability 1 , for any orthonormal Q 2 Rn\u21e5k and any with\nT = K:\n(10)\n\ni=k+1 i(K), let s = cn log(n/)\n\n\u270f2\n\nF \uf8ff kQQT k2\n\nF \uf8ff (1 + \u270f)kQQT \u02dcZ \u02dcZk2\nF .\n\n(1 \u270f)kQQT \u02dcZ \u02dcZk2\n\nBy (10) if we compute Q satisfying kQQT \u02dcZ \u02dcZk2\nF \uf8ff (1 + \u270f)2k \u02dcZ \u02dcZkk2\n\nkQQT k2\n\nF then we have:\n\n(1 + \u270f)2\n\nF \uf8ff (1 + \u270f)k \u02dcZ \u02dcZkk2\n1 \u270f kUkU T\nF \uf8ff\n= (1 + O(\u270f))k kk2\n\nk k2\n\nF\n\nF\n\nwhere Uk 2 Rn\u21e5k contains the top k column singular vectors of . By adjusting constants on \u270f by\nmaking c large enough, we thus have the relative error low-rank approximation guarantee of (2). It\nremains to show that this approach can be implemented ef\ufb01ciently.\n\n3.2\n\nInput sparsity time implementation\n\nGiven \u02dcZ sampled as in Lemma 7, we can \ufb01nd a near optimal subspace Q using any input sparsity\ntime low-rank approximation algorithm (e.g., [CW13, NN13]). We have the following Corollary:\nCorollary 8. Given \u02dcZ sampled as in Lemma 7 with s = \u02dc\u21e5(\nin time \u02dcO(\n\n\u270f2k+1(K) ), there is an algorithm running\n\u270f2k+1(K) ) that computes Q satisfying with high probability, for any with T = K:\n\nn2k\n\nnk\n\nkQQT k2\n\nF \uf8ff (1 + \u270f)k kk2\nF .\n\nWith Corollary 8 in place the main bottleneck to our approach becomes computing \u02dcZ.\n\n3.2.1 Sampling Frequencies\nTo compute \u02dcZ, we \ufb01rst sample \u23181, ...,\u2318 s according to p(\u2318). Here we use the rotational invariance of\n (\u00b7). In this case, p(\u2318) is also rotationally invariant [LSS13] and so, letting \u02c6p(\u00b7) be the distribution\nover norms of vectors sampled from p(\u2318) we can sample \u23181, ...,\u2318 n by \ufb01rst selecting s random\nGaussian vectors and then rescaling them to have norms distributed according to \u02c6p(\u00b7). That is, we\ncan write [\u23181, ...,\u2318 n] = GD where G 2 Rd\u21e5s is a random Gaussian matrix and D is a diagonal\nrescaling matrix with Dii = m\nwith m \u21e0 \u02c6p. We will assume that \u02c6p can be sampled from in\nkGik\nO(1) time. This is true for many natural kernels \u2013 e.g., for the Gaussian kernel, \u02c6p is just a Gaussian\ndensity.\n\n3.2.2 Computing \u02dcZ\nDue to our large sample size, s > n, even writing down G above requires \u2326(nd) time. However,\nto form \u02dcZ we do not need G itself:\nit suf\ufb01ces to compute for m = 1, ..., s the column z(\u2318m)\nwith z(\u2318m)j = e2\u21e1i\u2318 T\nmaj . This requires computing AGD, which contains the appropriate dot\nproducts aT\nj \u2318m for all m, j. We use a recent result [KPW16] which shows that this can be performed\napproximately in input sparsity time:\nLemma 9 (From Theorem 1 of [KPW16]). There is an algorithm running in O(nnz(A) +\nlog4 dn3s!1.5\n) time which outputs random B whose distribution has total variation distance at most\n from the distribution of AG where G 2 Rd\u21e5s is a random Gaussian matrix. Here, !< 2.373 is\nthe exponent of fast matrix multiplication.\n\n\n\nProof. Theorem 1 of [KPW16] shows that for B to have total variation distance from the distribu-\ntion of AG it suf\ufb01ces to set B = ACG0 where C is a d \u21e5 O(log4 dn2s1/2/) CountSketch matrix\n\n8\n\n\fand G0 is an O(log4 dn2s1/2/) \u21e5 s random Gaussian matrix. Computing AC requires O(nnz(A))\ntime. Multiplying the result by G0 then requires O( log4 dn3s1.5\n) time if fast matrix multiplication is\nnot employed. Using fast matrix multiplication, this can be improved to O( log4 dn3s!1.5\n\n).\n\n\n\n\n\nApplying Lemma 9 with = 1/200 lets us compute random BD with total variation distance 1/200\nfrom AGD. Thus, the distribution of \u02dcZ generated from this matrix has total variation distance\n\uf8ff 1/200 from the \u02dcZ generated from the true random Fourier features distribution. So, by Corollary\n8, we can use \u02dcZ to compute Q satisfying kQQT k2\nF with probability\n1/100 accounting for the the total variation difference and the failure probability of Corollary 8.\nThis yields our main algorithmic result, Theorem 2.\n\nF \uf8ff (1 + \u270f)k kk2\n\n3.3 An alternative approach\n\nWe conclude by noting that near input sparsity time Kernel PCA can also be achieved for a broad\nclass of kernels using a very different approach. We can approximate (\u00b7,\u00b7) via an expansion into\npolynomial kernel matrices as is done in [CKS11] and then apply the sketching algorithms for the\npolynomial kernel developed in [ANW14]. As long as the expansion achieves high accuracy with\nlow degree, and as long as 1/k+1 is not too small \u2013 since this will control the necessary approxima-\ntion factor, this technique can yield runtimes of the form \u02dcO(nnz(A))+poly(n, k, 1/k+1, 1/\u270f), giv-\ning improved dependence on n for some kernels over our random Fourier features method. Improv-\ning the poly(n, k, 1/k+1, 1/\u270f) term in both these methods, and especially removing the 1/k+1\ndependence and achieving linear dependence on n is an interesting open question for future work.\n\n4 Conclusion\n\nIn this work we have shown that for a broad class of kernels, including the Gaussian, polynomial, and\nlinear kernels, given data matrix A, computing a relative error low-rank approximation to A\u2019s kernel\nmatrix K (i.e., satisfying (1)) requires at least \u2326(nnz(A)k) time, barring a major breakthrough in\nthe runtime of matrix multiplication.\nIn the constant error regime, this lower bound essentially\nmatches the runtimes given by recent work on subquadratic time kernel and PSD matrix low-rank\napproximation [MM16, MW17].\nWe show that for the alternative kernel PCA guarantee of (2), a potentially faster runtime of\nO(nnz(A)) + poly(n, k, 1/k+1, 1/\u270f) can be achieved for general shift and rotation-invariant ker-\nnels. Practically, improving the second term in our runtime, especially the poor dependence on\nn, is an important open question. Generally, computing the kernel matrix K explicitly requires\nO(n2d) time, and so our algorithm only gives runtime gains when d is large compared to n \u2013 at least\n\u2326(n!.5), even ignoring k, k+1, and \u270f dependencies. Theoretically, removing the dependence on\nk+1 would be of interest, as it would give input sparsity runtime without any assumptions on the\nmatrix A (i.e., that k+1 is not too small). Resolving this question has strong connections to \ufb01nding\nef\ufb01cient kernel subspace embeddings, which approximate the full spectrum of K.\n\nReferences\n[AKM+17] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Vel-\ningker, and Amir Zandieh. Random Fourier features for kernel ridge regression: Ap-\nproximation bounds and statistical guarantees. In Proceedings of the 34th International\nConference on Machine Learning (ICML), 2017.\n\n[AM15] Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression\nwith statistical guarantees. In Advances in Neural Information Processing Systems 28\n(NIPS), pages 775\u2013783, 2015.\n\n[AMS01] Dimitris Achlioptas, Frank Mcsherry, and Bernhard Sch\u00f6lkopf. Sampling techniques\nfor kernel methods. In Advances in Neural Information Processing Systems 14 (NIPS),\n2001.\n\n9\n\n\f[ANW14] Haim Avron, Huy Nguyen, and David Woodruff. Subspace embeddings for the polyno-\nmial kernel. In Advances in Neural Information Processing Systems 27 (NIPS), pages\n2258\u20132266, 2014.\n\n[BIS17] Arturs Backurs, Piotr Indyk, and Ludwig Schmidt. On the \ufb01ne-grained complexity\nof empirical risk minimization: Kernel methods and neural networks. In Advances in\nNeural Information Processing Systems 30 (NIPS), 2017.\n\n[BJ02] Francis Bach and Michael I. Jordan. Kernel independent component analysis. Journal\n\nof Machine Learning Research, 3(Jul):1\u201348, 2002.\n\n[BW09] Mohamed-Ali Belabbas and Patrick J. Wolfe. Spectral methods in machine learning:\nNew strategies for very large datasets. Proceedings of the National Academy of Sci-\nences of the USA, 106:369\u2013374, 2009.\n\n[CEM+15] Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina\nPersu. Dimensionality reduction for k-means clustering and low rank approximation.\nIn Proceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC),\npages 163\u2013172, 2015.\n\n[CKS11] Andrew Cotter, Joseph Keshet, and Nathan Srebro. Explicit approximations of the\n\nGaussian kernel. arXiv:1109.4603, 2011.\n\n[CMM17] Michael B. Cohen, Cameron Musco, and Christopher Musco. Input sparsity time low-\nrank approximation via ridge leverage score sampling.\nIn Proceedings of the 28th\nAnnual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1758\u20131777,\n2017.\n\n[CW13] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression\nin input sparsity time. In Proceedings of the 45th Annual ACM Symposium on Theory\nof Computing (STOC), pages 81\u201390, 2013.\n\n[CW17] Kenneth L. Clarkson and David P. Woodruff. Low-rank PSD approximation in input-\nsparsity time. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete\nAlgorithms (SODA), pages 2061\u20132072, 2017.\n\n[DM05] Petros Drineas and Michael W Mahoney. On the Nystr\u00f6m method for approximat-\ning a Gram matrix for improved kernel-based learning. Journal of Machine Learning\nResearch, 6:2153\u20132175, 2005.\n\n[FS02] Shai Fine and Katya Scheinberg. Ef\ufb01cient SVM training using low-rank kernel repre-\n\nsentations. Journal of Machine Learning Research, 2:243\u2013264, 2002.\n\n[FT07] Shmuel Friedland and Anatoli Torokhti. Generalized rank-constrained matrix approxi-\n\nmations. SIAM Journal on Matrix Analysis and Applications, 29(2):656\u2013659, 2007.\n\n[GM13] Alex Gittens and Michael Mahoney. Revisiting the Nystr\u00f6m method for improved\nlarge-scale machine learning. In Proceedings of the 30th International Conference on\nMachine Learning (ICML), pages 567\u2013575, 2013. Full version at arXiv:1303.1849.\n\n[GU17] Fran\u00e7ois Le Gall and Florent Urrutia. Improved rectangular matrix multiplication using\n\npowers of the Coppersmith-Winograd tensor. arXiv:1708.05622, 2017.\n\n[HDC+01] Ingrid Hedenfalk, David Duggan, Yidong Chen, Michael Radmacher, Michael Bit-\ntner, Richard Simon, Paul Meltzer, Barry Gusterson, Manel Esteller, Mark Raffeld,\net al. Gene-expression pro\ufb01les in hereditary breast cancer. New England Journal of\nMedicine, 344(8):539\u2013548, 2001.\n\n[JDMP11] Asif Javed, Petros Drineas, Michael W Mahoney, and Peristera Paschou. Ef\ufb01cient\ngenomewide selection of PCA-correlated tSNPs for genotype imputation. Annals of\nHuman Genetics, 75(6):707\u2013722, 2011.\n\n[KPW16] Michael Kapralov, Vamsi Potluru, and David Woodruff. How to fake multiply by a\nIn Proceedings of the 33rd International Conference on Machine\n\nGaussian matrix.\nLearning (ICML), pages 2101\u20132110, 2016.\n\n10\n\n\f[LG12] Fran\u00e7ois Le Gall. Faster algorithms for rectangular matrix multiplication.\n\nIn Pro-\nceedings of the 53rd Annual IEEE Symposium on Foundations of Computer Science\n(FOCS), pages 514\u2013523, 2012.\n\n[LG14] Fran\u00e7ois Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings\nof the 39th international symposium on symbolic and algebraic computation, pages\n296\u2013303. ACM, 2014.\n\n[LSS13] Quoc Le, Tam\u00e1s Sarl\u00f3s, and Alexander Smola. Fastfood - Computing Hilbert space\nexpansions in loglinear time. In Proceedings of the 30th International Conference on\nMachine Learning (ICML), pages 244\u2013252, 2013.\n\n[MM16] Cameron Musco and Christopher Musco. Recursive sampling for the Nystr\u00f6m method.\n\nIn Advances in Neural Information Processing Systems 30 (NIPS), 2016.\n\n[MW17] Cameron Musco and David P Woodruff. Sublinear time low-rank approximation of\npositive semide\ufb01nite matrices. In Proceedings of the 58th Annual IEEE Symposium on\nFoundations of Computer Science (FOCS), 2017.\n\n[NN13] Jelani Nelson and Huy L Nguy\u00ean. OSNAP: Faster numerical linear algebra algorithms\nvia sparser subspace embeddings. In Proceedings of the 54th Annual IEEE Symposium\non Foundations of Computer Science (FOCS), pages 117\u2013126, 2013.\n\n[RR07] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines.\nIn Advances in Neural Information Processing Systems 20 (NIPS), pages 1177\u20131184,\n2007.\n\n[SS00] Alex J Smola and Bernhard Sch\u00f6kopf. Sparse greedy matrix approximation for ma-\nchine learning. In Proceedings of the 17th International Conference on Machine Learn-\ning (ICML), pages 911\u2013918, 2000.\n\n[WS01] Christopher Williams and Matthias Seeger. Using the Nystr\u00f6m method to speed up\nkernel machines. In Advances in Neural Information Processing Systems 14 (NIPS),\npages 682\u2013688, 2001.\n\n[WZ13] Shusen Wang and Zhihua Zhang. Improving CUR matrix decomposition and the Nys-\ntr\u00f6m approximation via adaptive sampling. Journal of Machine Learning Research,\n14:2729\u20132769, 2013.\n\n[ZTK08] Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved Nystr\u00f6m low-rank approx-\nIn Proceedings of the 25th International Conference on\n\nimation and error analysis.\nMachine Learning (ICML), pages 1232\u20131239, 2008.\n\n11\n\n\f", "award": [], "sourceid": 2312, "authors": [{"given_name": "Cameron", "family_name": "Musco", "institution": "Massachusetts Institute of Technology"}, {"given_name": "David", "family_name": "Woodruff", "institution": "Carnegie Mellon University"}]}