{"title": "Recursive Sampling for the Nystrom Method", "book": "Advances in Neural Information Processing Systems", "page_first": 3833, "page_last": 3845, "abstract": "We give the first algorithm for kernel Nystrom approximation that runs in linear time in the number of training points and is provably accurate for all kernel matrices, without dependence on regularity or incoherence conditions. The algorithm projects the kernel onto a set of s landmark points sampled by their ridge leverage scores, requiring just O(ns) kernel evaluations and O(ns^2) additional runtime. While leverage score sampling has long been known to give strong theoretical guarantees for Nystrom approximation, by employing a fast recursive sampling scheme, our algorithm is the first to make the approach scalable. Empirically we show that it finds more accurate kernel approximations in less time than popular techniques such as classic Nystrom approximation and the random Fourier features method.", "full_text": "Recursive Sampling for the Nystr\u00f6m Method\n\nCameron Musco\n\nMIT EECS\n\ncnmusco@mit.edu\n\nChristopher Musco\n\nMIT EECS\n\ncpmusco@mit.edu\n\nAbstract\n\nWe give the \ufb01rst algorithm for kernel Nystr\u00f6m approximation that runs in linear\ntime in the number of training points and is provably accurate for all kernel matrices,\nwithout dependence on regularity or incoherence conditions. The algorithm projects\nthe kernel onto a set of s landmark points sampled by their ridge leverage scores,\nrequiring just O(ns) kernel evaluations and O(ns2) additional runtime. While\nleverage score sampling has long been known to give strong theoretical guarantees\nfor Nystr\u00f6m approximation, by employing a fast recursive sampling scheme, our\nalgorithm is the \ufb01rst to make the approach scalable. Empirically we show that it\n\ufb01nds more accurate kernel approximations in less time than popular techniques\nsuch as classic Nystr\u00f6m approximation and the random Fourier features method.\n\n1\n\nIntroduction\n\nThe kernel method is a powerful for applying linear learning algorithms (SVMs, linear regression,\netc.) to nonlinear problems. The key idea is to map data to a higher dimensional kernel feature space,\nwhere linear relationships correspond to nonlinear relationships in the original data.\nTypically this mapping is implicit. A kernel function is used to compute inner products in the\nhigh-dimensional kernel space, without ever actually mapping original data points to the space.\nGiven n data points x1, . . . , xn, the n \u21e5 n kernel matrix K is formed where Ki,j contains the high-\ndimensional inner product between xi and xj, as computed by the kernel function. All computations\nrequired by a linear learning method are performed using the inner product information in K.\nUnfortunately, the transition from linear to nonlinear comes at a high cost. Just generating the entries\nof K requires \u21e5(n2) time, which is prohibitive for large datasets.\n\n1.1 Kernel approximation\n\nA large body of work seeks to accelerate kernel methods by \ufb01nding a compressed, often low-\nrank, approximation \u02dcK to the true kernel matrix K. Techniques include random sampling and\nembedding [AMS01, BBV06, ANW14], random Fourier feature methods for shift invariant kernels\n[RR07, RR09, LSS13], and incomplete Cholesky factorization [FS02, BJ02].\nOne of the most popular techniques is the Nystr\u00f6m method, which constructs \u02dcK using a subset of\n\u201clandmark\u201d data points [WS01]. Once s data points are selected, \u02dcK (in factored form) takes just\nO(ns) kernel evaluations and O(s3) additional time to compute, requires O(ns) space to store, and\ncan be manipulated quickly in downstream applications. E.g., inverting \u02dcK takes O(ns2) time.\nThe Nystr\u00f6m method performs well in practice [YLM+12, GM13, TRVR16], is widely implemented\n[HFH+09, PVG+11, IBM14], and is used in a number of applications under different names such as\n\u201clandmark isomap\u201d [DST03] and \u201clandmark MDS\u201d [Pla05]. In the classic variant, landmark points are\nselected uniformly at random. However, signi\ufb01cant research seeks to improve performance via data-\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fdependent sampling that selects landmarks which more closely approximate the full kernel matrix\nthan uniformly sampled landmarks [SS00, DM05, ZTK08, BW09, KMT12, WZ13, GM13, LJS16].\nTheoretical work has converged on leverage score based approaches, as they give the strongest\nprovable guarantees for both kernel approximation [DMM08, GM13] and statistical performance\nin downstream applications [AM15, RCR15, Wan16]. Leverage scores capture how important an\nindividual data point is in composing the span of the kernel matrix.\nUnfortunately, these scores are prohibitively expensive to compute. All known approximation schemes\nrequire \u2326(n2) time or only run quickly under strong conditions on K \u2013 e.g. good conditioning or\ndata \u201cincoherence\u201d [DMIMW12, GM13, AM15, CLV16]. Hence, leverage score-based approaches\nremain largely in the domain of theory, with limited practical impact [KMT12, LBKL15, YPW15].\n\n1.2 Our contributions\n\nIn this work, we close the gap between strong approximation bounds and ef\ufb01ciency: we present a\nnew Nystr\u00f6m algorithm based on recursive leverage score sampling which achieves the \u201cbest of both\nworlds\u201d: it produces kernel approximations provably matching the high accuracy of leverage score\nmethods while only requiring O(ns) kernel evaluations and O(ns2) runtime for s landmark points.\nTheoretically, this runtime is surprising. In the typical case when s \u2327 n, the algorithm evaluates just\na small subset of K, ignoring most of the kernel space inner products. Yet its performance guarantees\nhold for general kernels, requiring no assumptions on coherence or regularity.\nEmpirically, the runtime\u2019s linear dependence on n means that our method is the \ufb01rst leverage\nscore algorithm that can compete with the most commonly implemented techniques, including the\nclassic uniform sampling Nystr\u00f6m method and random Fourier features sampling [RR07]. Since our\nalgorithm obtains higher quality samples, we show experimentally that it outperforms these methods\non benchmark datasets \u2013 it can obtain as accurate a kernel approximation in signi\ufb01cantly less time.\nOur approximations also have lower rank, so they can be stored in less space and processed more\nquickly in downstream learning tasks.\n\n1.3 Paper outline\n\neff log d\n\neff) samples1, where d\n\nOur recursive sampling algorithm is built on top of a Nystr\u00f6m scheme of Alaoui and Mahoney that\nsamples landmark points based on their ridge leverage scores [AM15]. After reviewing preliminaries\nin Section 2, in Section 3 we analyze this scheme, which we refer to as RLS-Nystr\u00f6m. To simplify\nprior work, which studies the statistical performance of RLS-Nystr\u00f6m for speci\ufb01c kernel learning\ntasks [AM15, RCR15, Wan16], we prove a strong, application independent approximation guarantee:\nfor any , if \u02dcK is constructed with s =\u21e5( d\neff = tr(K(K + I)1) is\nthe so-called \u201c-effective dimensionality\u201d of K, then with high probability, kK \u02dcKk2 \uf8ff .\nIn Appendix E, we show that this guarantee implies bounds on the statistical performance of RLS-\nNystr\u00f6m for kernel ridge regression and canonical correlation analysis. We also use it to prove new\nresults on the performance of RLS-Nystr\u00f6m for kernel rank-k PCA and k-means clustering \u2013 in both\ncases just O(k log k) samples are required to obtain a solution with good accuracy.\nAfter af\ufb01rming the favorable theoretical properties of RLS-Nystr\u00f6m, in Section 4 we show that its\nruntime can be signi\ufb01cantly improved using a recursive sampling approach. Intuitively our algorithm\nis simple. We show how to approximate the kernel ridge leverage scores using a uniform sample of 1\n2\nof our input points. While the subsampled kernel matrix still has a prohibitive n2/4 entries, we can\nrecursively approximate it, using our same sampling algorithm. If our \ufb01nal Nystr\u00f6m approximation\nwill use s landmarks, the recursive approximation only needs rank O(s), which lets us estimate\nthe ridge leverage scores of the original kernel matrix in just O(ns2) time. Since n is cut in half\n\nat each level of recursion, our total runtime is O\u21e3ns2 + ns2\n\n2 + ns2\n\n4 + ...\u2318 = O(ns2), signi\ufb01cantly\n\nimproving upon the method of [AM15], which takes \u21e5(n3) time in the worst case.\nOur approach builds on recent work on iterative sampling methods for approximate linear algebra\n[CLM+15, CMM17]. While the analysis in the kernel setting is technical, our \ufb01nal algorithm is\n\n1This is within a log factor of the best possible for any low-rank approximation with error .\n\n2\n\n\fsimple and easy to implement. We present and test a parameter-free variation of Recursive RLS-\nNystr\u00f6m in Section 5, con\ufb01rming superior performance compared to existing methods.\n\n2 Preliminaries\nConsider an input space X and a positive semide\ufb01nite kernel function K : X\u21e5X! R. Let\nbe a (typically nonlinear)\nF be an associated reproducing kernel Hilbert space and : X!F\nfeature map such that for any x, y 2X , K(x, y) = h(x), (y)iF. Given a set of n input points\nx1, . . . , xn 2X , de\ufb01ne the kernel matrix K 2 Rn\u21e5n by Ki,j = K(xi, xj).\nIt is often natural to consider the kernelized data matrix that generates K. Informally, let 2 Rn\u21e5d0\nbe the matrix containing (x1), ..., (xn) as its rows (note that d0 may be in\ufb01nite). K = T .\nWhile we use for intuition, in our formal proofs we replace it with any matrix B 2 Rn\u21e5n satisfying\nBBT = K (e.g. a Cholesky factor). Such a B is guaranteed to exist since K is positive semide\ufb01nite.\nWe repeatedly use the singular value decomposition, which allows us to write any rank r matrix\nM 2 Rn\u21e5d as M = U\u2303VT, where U 2 Rn\u21e5r and V 2 Rd\u21e5r have orthogonal columns (the left\nand right singular vectors of M), and \u2303 2 Rr\u21e5r is a positive diagonal matrix containing the singular\nvalues: 1(M) 2(M) . . . r(M) > 0. M\u2019s pseudoinverse is given by M+ = V\u23031UT .\n2.1 Nystr\u00f6m approximation\n\nThe Nystr\u00f6m method selects a subset of \u201clandmark\u201d points and uses them to construct a low-rank\napproximation to K. Given a matrix S 2 Rn\u21e5s that has a single entry in each column equal to 1 so\nthat KS is a subset of s columns from K, the associated Nystr\u00f6m approximation is:\n\n\u02dcK = KS(ST KS)+ST K.\n\n(1)\n\u02dcK can be stored in O(ns) space by separately storing KS 2 Rn\u21e5s and (ST KS)+ 2 Rs\u21e5s. Further-\nmore, the factors can be computed using just O(ns) evaluations of the kernel inner product to form\nKS and O(s3) time to compute (ST KS)+. Typically s \u2327 n so these costs are signi\ufb01cantly lower\nthan the cost to form and store the full kernel matrix K.\nWe view Nystr\u00f6m approximation as a low-rank approximation to the dataset in feature space. Re-\ncalling that K = T , S selects s kernelized data points ST and we approximate using its\nprojection onto these points. Informally, let PS 2 Rd0\u21e5d0 be the orthogonal projection onto the row\nspan of ST . We approximate by \u02dc def= PS. We can write PS = T S(ST T S)+ST .\nSince it is an orthogonal projection, PSPT\n\nS = PS, and so we can write:\n\nS = P2\n\n\u02dcK = \u02dc \u02dcT = P2\n\nST = T S(ST T S)+ST T = KS(ST KS)+ST K.\n\nThis recovers the standard Nystr\u00f6m approximation (1).\n\n3 The RLS-Nystr\u00f6m method\n\nWe now introduce the RLS-Nystr\u00f6m method, which uses ridge leverage score sampling to select\nlandmark data points, and discuss its strong approximation guarantees for any kernel matrix K.\n\n3.1 Ridge leverage scores\n\nIn classical Nystr\u00f6m approximation (1), S is formed by sampling data points uniformly at random.\nUniform sampling can work in practice, but it only gives theoretical guarantees under strong regularity\nor incoherence assumptions on K [Git11]. It will fail for many natural kernel matrices where the\nrelative \u201cimportance\u201d of points is not uniform across the dataset\nFor example, imagine a dataset where points fall into several clusters, but one of the clusters is much\nlarger than the rest. Uniform sampling will tend to oversample landmarks from the large cluster while\nundersampling or possibly missing smaller but still important clusters. Approximation of K and\nlearning performance (e.g. classi\ufb01cation accuracy) will decline as a result.\n\n3\n\n\f(a) Uniform landmark sampling.\n\n(b) Improved landmark sampling.\n\nFigure 1: Uniform sampling for Nystr\u00f6m approximation can oversample from denser parts of the\ndataset. A better Nystr\u00f6m scheme will select points that more equally cover the relevant data.\n\nTo combat this issue, alternative methods compute a measure of point importance that is used to\nselect landmarks. For example, one heuristic applies k-means clustering to the input and takes the\ncluster centers as landmarks [ZTK08]. A large body of theoretical work measures importance using\nvariations on the statistical leverage scores. One natural variation is the ridge leverage score:\nDe\ufb01nition 1 (Ridge leverage scores [AM15]). For any > 0, the -ridge leverage score of data\npoint xi with respect to the kernel matrix K is de\ufb01ned as\n\nl\n\ni (K) def= K(K + I)1i,i ,\n\n(2)\n\nwhere I is the n \u21e5 n identity matrix. For any B 2 Rn\u21e5n satisfying BBT = K, we can also write:\n(3)\n\ni (BT B + I)1bi,\n\nl\ni (K) = bT\n\nwhere bT\n\ni 2 R1\u21e5n is the ith row of B.\nFor conciseness we typically write l\nbT\n\ni (BT B+I)1bi =B(BT B + I)1BTi,i. Using the SVD to write B = U\u2303VT and accord-\ningly K = U\u23032UT con\ufb01rms that K(K+I)1 = B(BT B+I)1BT = U\u23032\u23032 + I1\n\nIt is not hard to check (see [CLM+15]) that the ridge scores can be de\ufb01ned alternatively as:\n\ni . To check that (2) and (3) are equivalent note that\n\ni (K) as l\n\nUT .\n\nl\ni = min\ny2Rn\n\n1\nkbT\n\ni yT Bk2\n\n2 + kyk2\n2.\n\n(4)\n\nThis formulation provides better insight into these scores. Since BBT = K, any kernel algorithm\neffectively works with B\u2019s rows as data points. The ridge scores re\ufb02ect the relative importance of\ni \uf8ff 1 since we can set y to the ith standard basis vector. bi will\nthese rows. From (4) it\u2019s clear that l\nhave score \u2327 1 (i.e. is less important) when it\u2019s possible to \ufb01nd a more \u201cspread out\u201d y that uses other\nrows in B to approximately reconstruct bi \u2013 in other words when the row is less unique.\n\n3.2 Sum of ridge leverage scores\n\nAs is standard in leverage score methods, we don\u2019t directly select landmarks to be the points with the\nhighest scores. Instead, we sample each point with probability proportional to l\ni . Accordingly, the\nnumber of landmarks selected, which controls \u02dcK\u2019s rank, is a random variable with expectation equal\nto the sum of the -ridge leverage scores. To ensure compact kernel approximations, we want this\nsum to be small. Immediately from De\ufb01nition 1, we have:\n\nFact 2.\n\ni=1 l\n\ni (K) = tr(K(K + I)1).\n\nPn\n\ndef= tr(K(K + I)1). d\n\nWe denote d\neff is a natural quantity, referred to as the \u201ceffective dimension\u201d\neff\nor \u201cdegrees of freedom\u201d for a ridge regression problem on K with regularization [HTF02, Zha06].\neff increases monotonically as decreases. For any \ufb01xed it is essentially the smallest possible rank\nd\nachievable for \u02dcK satisfying the approximation guarantee given by RLS-Nystr\u00f6m: kK \u02dcKk2 < .\n\n4\n\n\f3.3 The basic sampling algorithm\n\nWe can now introduce the RLS-Nystr\u00f6m method as Algorithm 1. We allow sampling each point by\nany probability greater than l\ni , which is useful later when we compute the scores approximately.\nOversampling landmarks can only improve \u02dcK\u2019s accuracy. It could cause us to take more samples, but\nwe will always ensure that the sum of our approximate ridge leverage scores is not too large.\n\nAlgorithm 1 RLS-NYSTR\u00d6M SAMPLING\ninput: x1, . . . , xn 2X , kernel matrix K, ridge parameter > 0, failure probability 2 (0, 1/8)\n1: Compute an over-approximation, \u02dcl\ni for the -ridge leverage score of each x1, . . . , xn\n2: Set pi := minn1, \u02dcl\n3: Construct S 2 Rn\u21e5s by sampling x1, . . . , xn each independently with probability pi. In other\n4: return the Nystr\u00f6m factors KS 2 Rn\u21e5s and (ST KS)+ 2 Rs\u21e5s.\n\nwords, for each i add a column to S with a 1 in position i with probability pi.\n\ni \u00b7 16 log(P \u02dcl\n\ni /)o.\n\ni > l\n\n3.4 Accuracy bounds\n\neff log(d\n\nWe show that RLS-Nystr\u00f6m produces \u02dcK which spectrally approximates K up to a small additive\nerror. This is the strongest type of approximation offered by any known Nystr\u00f6m method [GM13]. It\nguarantees provable accuracy when \u02dcK is used in place of K in many learning applications [CMT10].\nTheorem 3 (Spectral error approximation). For any > 0 and 2 (0, 1/8), Algorithm 1 returns\nS 2 Rn\u21e5s such that with probability 1 , s \uf8ff 2Pi pi and \u02dcK = KS(ST KS)+ST K satis\ufb01es:\n\u02dcK K \u02dcK + I.\nWhen ridge scores are computed exactly,Pi pi = Od\n denotes the Loewner ordering: M N means that N M is positive semide\ufb01nite. Note that (5)\nimmediately implies the well studied (see e.g [GM13]) spectral norm guarantee, kK \u02dcKk2 \uf8ff .\nIntuitively, Theorem 3 guarantees that \u02dcK well approximates the top of K\u2019s spectrum (i.e. any\neigenvalues > ) while losing information about smaller, less important eigenvalues. Due to space\nconstraints, we defer the proof to Appendix A. It relies on the view of Nystr\u00f6m approximation as a\nlow-rank projection of the kernelized data (see Section 2.1) and we use an intrinsic dimension matrix\nBernstein bound to show accuracy of the sampled approximation.\nOften the regularization parameter is speci\ufb01ed for a learning task, and for near optimal performance\non this task, we set the approximation factor in Theorem 3 to \u270f. In this case we have:\nCorollary 4 (Tighter spectral error approximation). For any > 0 and 2 (0, 1/8), Algorithm 1\nrun with ridge parameter \u270f returns S 2 Rn\u21e5s such that with probability 1 , s = O\u21e3 d\n\u270f\u2318\n\u270f log d\nand \u02dcK = KS(ST KS)+ST K satis\ufb01es \u02dcK K \u02dcK + \u270fI.\n\neff/).\n\n(5)\n\neff\n\neff\n\nProof. This follows from Theorem 3 by noting d\u270f\n\neff \uf8ff d\n\neff/\u270f since (K+\u270fI)1 1\n\n\u270f (K+I)1.\n\nCorollary 4 suf\ufb01ces to prove that \u02dcK can be used in place of K without sacri\ufb01cing performance on\nkernel ridge regression and canonical correlation tasks [AM15, Wan16]. We also use it to prove\na projection-cost preservation guarantee (Theorem 12, Appendix B), which gives approximation\nbounds for kernel PCA and k-means clustering. Projection-cost preservation has proven a powerful\nconcept in the matrix sketching literature [FSS13, CEM+15, CMM17, BWZ16, CW17] and we hope\nthat extending the guarantee to kernels leads to applications beyond those considered in this work.\nOur results on downstream learning bounds that can be derived from Theorem 3 are summarized in\nTable 1. Details can be found in Appendices B and E.\n\n5\n\n\fTable 1: Downstream guarantees for \u02dcK obtained from RLS-Nystr\u00f6m (Algorithm 1).\n\nApplication\n\nGuarantee\n\nTheorem Space to store \u02dcK\n\nKernel Ridge Regression w/ param \n\n(1 + \u270f) relative error risk bound\n\nKernel k-means Clustering\n\n(1 + \u270f) relative error\n\nRank k Kernel PCA\n\n(1 + \u270f) relative Frob norm error\n\nKernel CCA w/ params x, y\n\n\u270f additive error\n\nThm 16\nThm 17\nThm 18\n\nThm 19\n\n\u21e4 For conciseness, \u02dcO(\u00b7) hides log factors in the failure probability, deff, and k.\n\n4 Recursive sampling for ef\ufb01cient RLS-Nystr\u00f6m\n\n\u02dcO( nd\neff\n\u270f )\n\u02dcO( nk\n\u270f )\n\u02dcO( nk\n\u270f )\n\n\u02dcO( ndx\n\neff +nd\n\n\u270f\n\ny\neff\n\n)\n\nHaving established strong approximation guarantees for RLS-Nystr\u00f6m, it remains to provide an\nef\ufb01cient implementation. Speci\ufb01cally, Step 1 of Algorithm 1 naively requires \u21e5(n3) time. We show\nthat signi\ufb01cant acceleration is possible using a recursive sampling approach.\n\n4.1 Ridge leverage score approximation via uniform sampling\n\nThe key is to estimate the leverage scores by computing (3) approximately, using a uniform sample of\nthe data points. To ensure accuracy, the sample must be large \u2013 a constant fraction of the points. Our\nfast runtimes are achieved by recursively approximating this large sample. In Appendix F we prove:\nLemma 5. For any B 2 Rn\u21e5n with BBT = K and S 2 Rn\u21e5s chosen by sampling each\ndata point independently with probability 1/2, let \u02dcl\ni (BT SST B + I)1bi and pi =\nmin{1, 16\u02dcl\n\n\u02dcl\ni /)} for any 2 (0, 1/8). Then with probability at least 1 :\ni l\n\ni log(Pi\n\ni for all i\n\ni = bT\n\n1) \u02dcl\n\nl\ni /).\n\n2) Xi\n\npi \uf8ff 64Xi\n\nl\n\ni log(Xi\n\nThe \ufb01rst condition ensures that the approximate scores \u02dcl\nensures that the Nystr\u00f6m approximation obtained will not have too many sampled landmarks.\nNaively computing \u02dcl\ni in Lemma 5 involves explicitly forming B, requiring \u2326(n2) time (e.g. \u21e5(n3)\nvia Cholesky decomposition). Fortunately, the following formula (proof in Appx. F) avoids this cost:\nLemma 6. For any sampling matrix S 2 Rn\u21e5s, and any > 0:\n\ni suf\ufb01ce for use in Algorithm 1. The second\n\n\u02dcl\ni\n\ndef= bT\n\ni (BT SST B + I)1bi =\n\n1\n\n\u21e3K KSST KS + I1\n\nST K\u2318i,i\n\n.\n\nIt follows that we can compute all \u02dcl\ncompute KS and the diagonal of K.\n\ni for all i in O(ns2) time using just O(ns) kernel evaluations, to\n\n4.2 Recursive RLS-Nystr\u00f6m\n\nWe apply Lemmas 5 and 6 to give an ef\ufb01cient recursive implementation of RLS-Nystr\u00f6m, Algorithm\n2. We show that the output of this algorithm, S, is sampled according to approximate ridge leverage\nscores for K and thus satis\ufb01es the approximation guarantee of Theorem 3.\nTheorem 7 (Main Result). Let S 2 Rn\u21e5s be computed by Algorithm 2. With probability 1 3,\neff/), S is sampled by overestimates of the -ridge leverage scores of K, and\ns \uf8ff 384 \u00b7 d\nthus by Theorem 3, the Nystr\u00f6m approximation \u02dcK = KS(ST KS)+ST K satis\ufb01es:\n\neff log(d\n\nAlgorithm 2 uses O(ns) kernel evaluations and O(ns2) computation time.\n\n\u02dcK K \u02dcK + I.\n\n6\n\n\fAlgorithm 2 RECURSIVERLS-NYSTR\u00d6M.\ninput: x1, . . . , xm 2X , kernel function K : X\u21e5X! R, ridge > 0, failure prob. 2 (0, 1/32)\noutput: weighted sampling matrix S 2 Rm\u21e5s\n1: if m \uf8ff 192 log(1/) then\nreturn S := Im\u21e5m.\n2:\n3: end if\n4: Let \u00afS be a random subset of {1, ..., m}, with each i included independently with probability 1\n2.\n\n. Let \u00afX = {xi1, xi2, ..., xi| \u00afS|} for ij 2 \u00afS be the data sample corresponding to \u00afS.\n. Let \u00afS = [ei1, ei2, ..., ei| \u00afS|\n\n] be the sampling matrix corresponding to \u00afS.\n\nfor each i 2{ 1, . . . , m} .\n\n5: \u02dcS := RECURSIVERLS-NYSTR\u00d6M( \u00afX, K,,/ 3).\n6: \u02c6S := \u00afS \u00b7 \u02dcS.\n7: Set \u02dcl\n:= 3\ni\n\n2\u2713K K\u02c6S\u21e3\u02c6ST K\u02c6S + I\u23181\n\n\u02c6ST K\u25c6i,i\n\n. By Lemma 6, equals 3\npoints {x1, . . . , xm} and kernel function K.\n\n2 (B(BT \u02c6S\u02c6ST B + I)1BT )i,i. K denotes the kernel matrix for data-\n\n8: Set pi := min{1, \u02dcl\n9: Initially set weighted sampling matrix S to be empty. For each i 2{ 1, . . . , m}, with probability\n10: return S.\n\ni /)} for each i 2{ 1, . . . , m}.\n\npi, append the column 1ppi\n\nei onto S.\n\ni \u00b7 16 log(P \u02dcl\n\n3\n2\n\n1\n2\n\nNote that in Algorithm 2 the columns of S are weighted by 1/ppi. The Nystr\u00f6m approximation\n\u02dcK = KS(ST KS)+ST K is not effected by column weights (see derivation in Section 2.1). However,\nthe weighting is necessary when the output is used in recursive calls (i.e. when \u02dcS is used in Step 6).\nWe prove Theorem 7 via the following intermediate result:\nTheorem 8. For any inputs x1, . . . , xm, K, > 0 and 2 (0, 1/32), let K be the kernel matrix for\nx1, . . . , xm and kernel function K and let d\neff(K) be the effective dimension of K with parameter .\nWith probability (1 3), RECURSIVERLS-NYSTR\u00d6M outputs S with s columns that satis\ufb01es:\n\n(BT B + I)\n\nfor any B with BBT = K.\n\n(BT B + I) (BT SST B + I) \n\n(6)\neff(K), ) where smax(w, z) def= 384 \u00b7 (w + 1) log ((w + 1)/z). The al-\neff(K), ) kernel evaluations and \uf8ff c2msmax(d\neff(K), )2 additional\n\nAdditionally, s \uf8ff smax(d\ngorithm uses \uf8ff c1msmax(d\ncomputation time where c1 and c2 are \ufb01xed universal constants.\nTheorem 8 is proved via an inductive argument, given in Appendix C. Roughly, consider in Step 6 of\nAlgorithm 2, setting \u02c6S := \u00afS instead of \u00afS \u00b7 \u02dcS. By Lemma 5 and the formula in Lemma 6, the leverage\nscore approximations \u02dcl\ni computed in Step 7 would be good approximations to the true leverage\nscores, and S would satisfy Theorem 8 by a standard matrix Bernstein bound (see Lemma 9).\nHowever, if we set \u02c6S := \u00afS, it will have n/2 columns in expectation, and the computation in Step 7\nwill be expensive \u2013 requiring roughly O(n3) time. By recursively calling Algorithm 8 and applying\nTheorem 8 inductively, we obtain \u02dcS satisfying with high probability:\n3\n2\n\n(BT \u00afS\u00afST B + I) ((B\u00afS)\u02dcS\u02dcST (\u00afST B) + I) \n\n(B\u00afS\u00afST B + I).\n\n1\n2\n\nThis guarantee ensures that when we use \u02c6S = \u02c6S \u00b7 \u02dcS in place of \u00afS in Step 7, the leverage score\nestimates are changed only by a constant factor. Thus, sampling by these estimates, still gives us the\ndesired guarantee (6). Further, \u02dcS and therefore \u02c6S has just O(smax(d\neff(K), )) columns, so Step 7\ncan be performed very ef\ufb01ciently, within the stated runtime bounds.\nWith Theorem 8 we can easily prove our main result, Theorem 7.\n\nProof of Theorem 7. In our proof of Theorem 3 in Appendix A.1, we show that if\n\n1\n2\n\n(BT B + I) (BT SST B + I) \n\n3\n2\n\n(BT B + I)\n\n7\n\n\ffor a weighted sampling matrix S, then even if we remove the weights from S so that it has all unit\nentries (they don\u2019t effect the Nystr\u00f6m approximation), \u02dcK = KS(ST KS)+ST K satis\ufb01es:\n\n\u02dcK K \u02dcK + I.\n\nThe runtime bounds also follow nearly directly from Theorem 8. In particular, we have established\n\nrequired by RECURSIVERLS-NYSTR\u00d6M. We only needed the upper bound to prove Theorem 8,\nbut along the way actually show that in a successful run of RECURSIVERLS-NYSTR\u00d6M, S has\n\neff(K), )2 additional runtime are\neff(K), ) kernel evaluations and Onsmax(d\nthat Onsmax(d\neff(K)/ columns. Additionally, we may assume that deff(K) 1/2. If it is not,\neff(K) logd\n\u21e5d\nthen it\u2019s not hard to check (see proof of Lemma 20) that must be kKk. If this is the case, the\nguarantee of Theorem 7 is vacuous: any Nystr\u00f6m approximation \u02dcK satis\ufb01es \u02dcK K \u02dcK + I.\nWith deff(K) 1/2, d\neff(K), ) so we conclude that\nTheorem 7 uses O(ns) kernel evaluations and O(ns2) additional runtime.\n\neff(K)/ and thus s are \u21e5(smax(d\n\neff(K) logd\n\n5 Empirical evaluation\n\nWe conclude with an empirical evaluation of our recursive RLS-Nystr\u00f6m method. We use a variant\nof Algorithm 2 where, instead of choosing a regularization parameter , the user sets a sample size\ns and is automatically determined such that s =\u21e5( d\neff/)). This variant is practically\neff \u00b7 log(d\nappealing as it essentially yields the best possible approximation to K for a \ufb01xed sample budget.\nPseudocode and proofs of correctness are included in Appendix D.\n\n5.1 Performance of Recursive RLS-Nystr\u00f6m for kernel approximation\n\nWe evaluate RLS-Nystr\u00f6m on the YearPredictionMSD, Covertype, Cod-RNA, and Adult datasets\ndownloaded from the UCI ML Repository [Lic13] and [UKM06]. These datasets contain 515345,\n581012, 331152, and 48842 data points respectively. We compare against the classic Nystr\u00f6m method\nwith uniform sampling [WS01] and the random Fourier features method [RR07]. Due to the large\nsize of the datasets, prior leverage score based Nystr\u00f6m approaches [DMIMW12, GM13, AM15],\nwhich require at least \u2326(n2) time are infeasible, and thus not included in our tests.\nWe split categorical features into binary indicatory features and mean center and normalize features to\nhave variance 1. We use a Gaussian kernel for all tests, with the width parameter selected via cross\nvalidation on regression and classi\ufb01cation tasks. kK \u02dcKk2 is used to measure approximation error.\nSince this quantity is prohibitively expensive to compute directly (it requires building the full kernel\nmatrix K), the error is estimated using a random subset of 20,000 data points and repeated trials.\n\n4\n\n10\n\n2\n\n10\n\n0\n\n10\n\n2\n\u2225\n\u02dcK\n\u2212\nK\n\u2225\n\n-2\n\n10\n\n-4\n\n10\n\n0\n\nRecursive RLS-Nystrom\nUniform Nystrom\nRandom Fourier Features\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\nSamples\n\n4\n\n10\n\n2\n\n10\n\n0\n\n10\n\n2\n\u2225\n\u02dcK\n\u2212\nK\n\u2225\n\n-2\n\n10\n\n0\n\nRecursive RLS-Nystrom\nUniform Nystrom\nRandom Fourier Features\n\n500\n\n1000\n\n1500\n\n2000\n\nSamples\n\n4\n\n10\n\n2\n\n10\n\n0\n\n10\n\n2\n\u2225\n\u02dcK\n\u2212\nK\n\u2225\n\n-2\n\n10\n\n-4\n\n10\n\n0\n\nRecursive RLS-Nystrom\nUniform Nystrom\nRandom Fourier Features\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\nSamples\n\n4\n\n10\n\n3\n\n10\n\n2\n\n10\n\n1\n\n10\n\n2\n\u2225\n\u02dcK\n\u2212\nK\n\u2225\n\n0\n\n10\n\n0\n\nRecursive RLS-Nystrom\nUniform Nystrom\nRandom Fourier Features\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\nSamples\n\n(a) Adult\n\n(d) YearPredictionMSD\nFigure 2: For a given number of samples, Recursive RLS-Nystr\u00f6m yields approximations with lower\nerror, measured by kK \u02dcKk2. Error is plotted on a logarithmic scale, averaged over 10 trials.\n\n(b) Covertype\n\n(c) Cod-RNA\n\nFigure 2 con\ufb01rms that Recursive RLS-Nystr\u00f6m consistently obtains substantially better kernel\napproximation error than the other methods. As we can see in Figure 3, with the exception of\nYearPredictionMSD, the better quality of the landmarks obtained with Recursive RLS-Nystr\u00f6m\nalso translates into runtime improvements. While the cost per sample is higher for our method at\nO(nd + ns) time versus O(nd + s2) for uniform Nystr\u00f6m and O(nd) for random Fourier features,\nsince RLS-Nystr\u00f6m requires fewer samples it more quickly obtains \u02dcK with a given accuracy. \u02dcK will\nalso have lower rank, which can accelerate processing in downstream applications.\n\n8\n\n\f1\n\n10\n\n0\n\n10\n\n2\n\u2225\n\u02dcK\n\u2212\nK\n\u2225\n\n-1\n\n10\n\n-2\n\n10\n\n-3\n\n10\n\n-4\n\n10\n\n0\n\n2\n\u2225\n\u02dcK\n\u2212\nK\n\u2225\n\n2\n\n10\n\n1\n\n10\n\n0\n\n10\n\n-1\n\n10\n\n-2\n\n10\n\n-3\n\n10\n\n0\n\nRecursive RLS-Nystrom\nUniform Nystrom\n\n5\n\n10\n\n15\n\nRuntime (sec.)\n\n(a) Adult\n\nRecursive RLS-Nystrom\nUniform Nystrom\n\nRecursive RLS-Nystrom\nUniform Nystrom\n\n2\n\n10\n\n1\n\n10\n\n0\n\n10\n\n-1\n\n10\n\n-2\n\n10\n\n-3\n\n10\n\n2\n\u2225\n\u02dcK\n\u2212\nK\n\u2225\n\n1\n\n2\n\n3\n\n4\n\n5\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\nRuntime (sec.)\n\n(b) Covertype\n\nRuntime (sec.)\n\n(c) Cod-RNA\n\n3\n\n10\n\n2\n\u2225\n\u02dcK\n\u2212\nK\n\u2225\n\n2\n\n10\n\n1\n\n10\n\n0\n\n10\n\n0\n\nRecursive RLS-Nystrom\nUniform Nystrom\n\n2\n\n4\n\n6\n\n8\n\n10\n\nRuntime (sec.)\n\n(d) YearPredictionMSD\n\nFigure 3: Recursive RLS-Nystr\u00f6m obtains a \ufb01xed level of approximation faster than uniform sampling,\nonly underperforming on YearPredictionMSD. Results for random Fourier features are not shown:\nwhile the method is faster, it never obtained high enough accuracy to be directly comparable.\n\nIn Appendix G, we show that that runtime of RLS-Nystr\u00f6m can be further accelerated, via a heuristic\napproach that under-samples landmarks at each level of recursion. This approach brings the per\nsample cost down to approximately that of random Fourier features and uniform Nystr\u00f6m while\nnearly maintaining the same approximation quality. Results are shown in Figure 4.\nFor datasets such as Covertype in which Recursive RLS-Nystr\u00f6m performs signi\ufb01cantly better than\nuniform sampling, so does the accelerated method (see Figure 4b). However, the performance of the\naccelerated method does not degrade when leverage scores are relatively uniform \u2013 it still offers the\nbest runtime to approximation quality tradeoff (Figure 4c).\nWe note further runtime optimizations may be possible. Subsequent work extends fast ridge leverage\nscore methods to distributed and streaming environments [CLV17]. Empirical evaluation of these\ntechniques could lead to even more scalable, high accuracy Nystr\u00f6m methods.\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n0\n\n)\nc\ne\ns\n(\n\ne\nm\n\ni\nt\nn\nu\nR\n\nRecursive RLS-Nystrom\nUniform Nystrom\nRandom Fourier Features\nAcclerated Recursive RLS-Nystrom\n\n500\n\n1000\n\nSamples\n\n1500\n\n2000\n\n4\n\n10\n\n2\n\u2225\n\u02dcK\n\u2212\nK\n\u2225\n\n2\n\n10\n\n0\n\n10\n\n-2\n\n10\n\n0\n\nRecursive RLS-Nystrom\nUniform Nystrom\nRandom Fourier Features\nAccelerated Recursive RLS-Nystrom\n\n3\n\n10\n\n2\n\n10\n\n1\n\n10\n\n2\n\u2225\n\u02dcK\n\u2212\nK\n\u2225\n\nRecursive RLS-Nystrom\nUniform Nystrom\nRandom Fourier Features\nAccelerated Recursive RLS-Nystrom\n\n500\n\n1000\n\nSamples\n\n1500\n\n2000\n\n0\n\n10\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\nRuntime (sec.)\n\n(a) Runtimes for Covertype.\n\n(b) Errors for Covertype.\n\n(c) Runtime/error tradeoff for\n\nYearPredictionMSD.\n\nFigure 4: Our accelerated Recursive RLS-Nystr\u00f6m, nearly matches the per sample runtime of random\nFourier features and uniform Nystr\u00f6m while still providing much better approximation.\n\n5.2 Additional Empirical Results\nIn Appendix G we verify the usefulness of our kernel approximations in downstream learning tasks.\nWhile full kernel methods do not scale to our large datasets, Recursive RLS-Nystr\u00f6m does since\nits runtime depends linearly on n. For example, on YearPredictionMSD the method requires 307\nsec. (averaged over 5 trials) to build a 2, 000 landmark Nystr\u00f6m approximation for 463,716 training\npoints. Ridge regression using the approximate kernel then requires 208 sec. for a total of 515\nsec. These runtime are comparable to those of the very fast random Fourier features method, which\nunderperforms RLS-Nystr\u00f6m in terms of regression and classi\ufb01cation accuracy.\n\nAcknowledgements\n\nWe would like to thank Michael Mahoney for bringing the potential of ridge leverage scores to our\nattention and suggesting their possible approximation via iterative sampling schemes. We would\nalso like to thank Michael Cohen for pointing out (and \ufb01xing) an error in our original manuscript\nand generally for his close collaboration in our work on leverage score sampling algorithms. Finally,\nthanks to Haim Avron for pointing our an error in our original analysis.\n\n9\n\n\fReferences\n\n[AM15] Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression\nwith statistical guarantees. In Advances in Neural Information Processing Systems 28\n(NIPS), pages 775\u2013783, 2015.\n\n[AMS01] Dimitris Achlioptas, Frank Mcsherry, and Bernhard Sch\u00f6lkopf. Sampling techniques\nfor kernel methods. In Advances in Neural Information Processing Systems 14 (NIPS),\n2001.\n\n[ANW14] Haim Avron, Huy Nguyen, and David Woodruff. Subspace embeddings for the\npolynomial kernel. In Advances in Neural Information Processing Systems 27 (NIPS),\npages 2258\u20132266, 2014.\n\n[Bac13] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Pro-\nceedings of the 26th Annual Conference on Computational Learning Theory (COLT),\n2013.\n\n[BBV06] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Kernels as features: On\nkernels, margins, and low-dimensional mappings. Machine Learning, 65(1):79\u201394,\n2006.\n\n[BJ02] Francis Bach and Michael I. Jordan. Kernel independent component analysis. Journal\n\nof Machine Learning Research, 3(Jul):1\u201348, 2002.\n\n[BMD09] Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. Unsupervised feature\nselection for the k-means clustering problem. In Advances in Neural Information\nProcessing Systems 22 (NIPS), pages 153\u2013161, 2009.\n\n[BW09] Mohamed-Ali Belabbas and Patrick J. Wolfe. Spectral methods in machine learning:\nNew strategies for very large datasets. Proceedings of the National Academy of\nSciences of the USA, 106:369\u2013374, 2009.\n\n[BWZ16] Christos Boutsidis, David P. Woodruff, and Peilin Zhong. Optimal principal component\nanalysis in distributed and streaming models. In Proceedings of the 48th Annual ACM\nSymposium on Theory of Computing (STOC), 2016.\n\n[CEM+15] Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina\nPersu. Dimensionality reduction for k-means clustering and low rank approximation.\nIn Proceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC),\npages 163\u2013172, 2015.\n\n[CLL+15] Shouyuan Chen, Yang Liu, Michael Lyu, Irwin King, and Shengyu Zhang. Fast\nrelative-error approximation algorithm for ridge regression. In Proceedings of the 31st\nAnnual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 201\u2013210,\n2015.\n\n[CLM+15] Michael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng,\nand Aaron Sidford. Uniform sampling for matrix approximation. In Proceedings of\nthe 6th Conference on Innovations in Theoretical Computer Science (ITCS), pages\n181\u2013190, 2015.\n\n[CLV16] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Analysis of Nystr\u00f6m\nmethod with sequential ridge leverage score sampling. In Proceedings of the 32nd\nAnnual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 62\u201371, 2016.\n[CLV17] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Distributed adaptive\nsampling for kernel matrix approximation. In Proceedings of the 20th International\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2017.\n\n[CMM17] Michael B. Cohen, Cameron Musco, and Christopher Musco. Input sparsity time\nlow-rank approximation via ridge leverage score sampling. In Proceedings of the 28th\nAnnual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1758\u20131777,\n2017.\n\n10\n\n\f[CMT10] Corinna Cortes, Mehryar Mohri, and Ameet Talwalkar. On the impact of kernel ap-\nproximation on learning accuracy. In Proceedings of the 13th International Conference\non Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 113\u2013120, 2010.\n\n[CW17] Kenneth L. Clarkson and David P. Woodruff. Low-rank PSD approximation in input-\nsparsity time. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete\nAlgorithms (SODA), pages 2061\u20132072, 2017.\n\n[DM05] Petros Drineas and Michael W Mahoney. On the Nystr\u00f6m method for approximating\na Gram matrix for improved kernel-based learning. Journal of Machine Learning\nResearch, 6:2153\u20132175, 2005.\n\n[DMIMW12] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff.\nFast approximation of matrix coherence and statistical leverage. Journal of Machine\nLearning Research, 13:3475\u20133506, 2012.\n\n[DMM08] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error CUR\nmatrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844\u2013\n881, 2008.\n\n[DST03] Vin De Silva and Joshua B Tenenbaum. Global versus local methods in nonlinear\ndimensionality reduction. In Advances in Neural Information Processing Systems 16\n(NIPS), pages 721\u2013728, 2003.\n\n[FS02] Shai Fine and Katya Scheinberg. Ef\ufb01cient SVM training using low-rank kernel\n\nrepresentations. Journal of Machine Learning Research, 2:243\u2013264, 2002.\n\n[FSS13] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data:\nConstant-size coresets for k-means, PCA, and projective clustering. In Proceedings\nof the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages\n1434\u20131453, 2013.\n\n[Git11] Alex Gittens.\n\narXiv:1110.5305, 2011.\n\nThe spectral norm error of\n\nthe naive Nystr\u00f6m extension.\n\n[GM13] Alex Gittens and Michael Mahoney. Revisiting the Nystr\u00f6m method for improved\nlarge-scale machine learning. In Proceedings of the 30th International Conference on\nMachine Learning (ICML), pages 567\u2013575, 2013. Full version at arXiv:1303.1849.\n[HFH+09] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,\nand Ian H Witten. The WEKA data mining software: an update. ACM SIGKDD\nExplorations Newsletter, 11(1):10\u201318, 2009.\n\n[HKZ14] Daniel Hsu, Sham M. Kakade, and Tong Zhang. Random design analysis of ridge\n\nregression. Foundations of Computational Mathematics, 14(3):569\u2013600, 2014.\n\n[HTF02] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical\n\nlearning: data mining, inference and prediction. Springer, 2nd edition, 2002.\n\n[IBM14] IBM Reseach Division, Skylark Team. Libskylark: Sketching-based Distributed\nMatrix Computations for Machine Learning. IBM Corporation, Armonk, NY, 2014.\n[KMT12] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the\n\nNystr\u00f6m method. Journal of Machine Learning Research, 13:981\u20131006, 2012.\n\n[LBKL15] Mu Li, Wei Bi, James T Kwok, and Bao-Liang Lu. Large-scale Nystr\u00f6m kernel matrix\napproximation using randomized SVD. IEEE Transactions on Neural Networks and\nLearning Systems, 26(1):152\u2013164, 2015.\n\n[Lic13] M. Lichman. UCI machine learning repository, 2013.\n[LJS16] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast DPP sampling for Nystr\u00f6m with\napplication to kernel methods. In Proceedings of the 33rd International Conference\non Machine Learning (ICML), 2016.\n\n11\n\n\f[LSS13] Quoc Le, Tam\u00e1s Sarl\u00f3s, and Alexander Smola. Fastfood - Computing Hilbert space\nexpansions in loglinear time. In Proceedings of the 30th International Conference on\nMachine Learning (ICML), pages 244\u2013252, 2013.\n\n[MU17] Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomization\nand Probabilistic Techniques in Algorithms and Data Analysis. Cambridge university\npress, 2017.\n\n[PD16] Saurabh Paul and Petros Drineas. Feature selection for ridge regression with provable\n\nguarantees. Neural Computation, 28(4):716\u2013742, 2016.\n\n[Pla05] John Platt. FastMap, MetricMap, and Landmark MDS are all Nystr\u00f6m algorithms.\nIn Proceedings of the 8th International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS), 2005.\n\n[PVG+11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,\nM. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.\nJournal of Machine Learning Research, 12:2825\u20132830, 2011.\n\n[RCR15] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nystr\u00f6m\ncomputational regularization. In Advances in Neural Information Processing Systems\n28 (NIPS), pages 1648\u20131656, 2015.\n\n[RR07] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines.\nIn Advances in Neural Information Processing Systems 20 (NIPS), pages 1177\u20131184,\n2007.\n\n[RR09] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing\nminimization with randomization in learning. In Advances in Neural Information\nProcessing Systems 22 (NIPS), pages 1313\u20131320, 2009.\n\n[SS00] Alex J Smola and Bernhard Sch\u00f6kopf. Sparse greedy matrix approximation for\nmachine learning. In Proceedings of the 17th International Conference on Machine\nLearning (ICML), pages 911\u2013918, 2000.\n\n[SS02] Bernhard Sch\u00f6lkopf and Alexander J Smola. Learning with kernels: support vector\n\nmachines, regularization, optimization, and beyond. MIT press, 2002.\n\n[SSM99] Bernhard Sch\u00f6lkopf, Alexander J. Smola, and Klaus-Robert M\u00fcller. Advances in\nkernel methods. chapter Kernel principal component analysis, pages 327\u2013352. MIT\nPress, 1999.\n\n[Tro15] Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and\n\nTrends in Machine Learning, 8(1-2):1\u2013230, 2015.\n\n[TRVR16] Stephen Tu, Rebecca Roelofs, Shivaram Venkataraman, and Benjamin Recht. Large\n\nscale kernel learning using block coordinate descent. arXiv:1602.05310, 2016.\n\n[UKM06] Andrew V Uzilov, Joshua M Keegan, and David H Mathews. Detection of non-coding\nRNAs on the basis of predicted secondary structure formation free energy change.\nBMC bioinformatics, 7(1):173, 2006.\n\n[Wan16] Weiran Wang. On column selection in approximate kernel canonical correlation\n\nanalysis. arXiv:1602.02172, 2016.\n\n[Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and\n\nTrends in Theoretical Computer Science, 10(1-2):1\u2013157, 2014.\n\n[WS01] Christopher Williams and Matthias Seeger. Using the Nystr\u00f6m method to speed up\nkernel machines. In Advances in Neural Information Processing Systems 14 (NIPS),\npages 682\u2013688, 2001.\n\n12\n\n\f[WZ13] Shusen Wang and Zhihua Zhang. Improving CUR matrix decomposition and the\nNystr\u00f6m approximation via adaptive sampling. Journal of Machine Learning Research,\n14:2729\u20132769, 2013.\n\n[YLM+12] Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystr\u00f6m\nmethod vs random Fourier features: A theoretical and empirical comparison. In\nAdvances in Neural Information Processing Systems 25 (NIPS), pages 476\u2013484, 2012.\n[YPW15] Yun Yang, Mert Pilanci, and Martin J Wainwright. Randomized sketches for kernels:\n\nFast and optimal non-parametric regression. Annals of Statistics, 2015.\n\n[YZ13] Martin Wainwright Yuchen Zhang, John Duchi. Divide and conquer kernel ridge\nregression. Proceedings of the 26th Annual Conference on Computational Learning\nTheory (COLT), 2013.\n\n[Zha06] Tong Zhang. Learning bounds for kernel regression using effective data dimensionality.\n\nLearning, 17(9), 2006.\n\n[ZTK08] Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved Nystr\u00f6m low-rank approxi-\nmation and error analysis. In Proceedings of the 25th International Conference on\nMachine Learning (ICML), pages 1232\u20131239, 2008.\n\n13\n\n\f", "award": [], "sourceid": 2097, "authors": [{"given_name": "Cameron", "family_name": "Musco", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Christopher", "family_name": "Musco", "institution": "Mass. Institute of Technology"}]}