{"title": "On Fast Leverage Score Sampling and Optimal Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5672, "page_last": 5682, "abstract": "Leverage score sampling provides an appealing way to perform approximate com- putations for large matrices. Indeed, it allows to derive faithful approximations with a complexity adapted to the problem at hand. Yet, performing leverage scores sampling is a challenge in its own right requiring further approximations. In this paper, we study the problem of leverage score sampling for positive definite ma- trices defined by a kernel. Our contribution is twofold. First we provide a novel algorithm for leverage score sampling and second, we exploit the proposed method in statistical learning by deriving a novel solver for kernel ridge regression. Our main technical contribution is showing that the proposed algorithms are currently the most efficient and accurate for these problems.", "full_text": "On Fast Leverage Score Sampling and Optimal\n\nLearning\n\nAlessandro Rudi\u21e4\nINRIA \u2013 Sierra team,\n\nENS, Paris\n\nDaniele Calandriello\u21e4\nLCSL \u2013 IIT & MIT,\n\nGenoa, Italy\n\nLuigi Carratino\n\nUniversity of Genoa,\n\nGenoa, Italy\n\nLorenzo Rosasco\nUniversity of Genoa,\nLCSL \u2013 IIT & MIT\n\nAbstract\n\nLeverage score sampling provides an appealing way to perform approximate com-\nputations for large matrices. Indeed, it allows to derive faithful approximations\nwith a complexity adapted to the problem at hand. Yet, performing leverage scores\nsampling is a challenge in its own right requiring further approximations. In this\npaper, we study the problem of leverage score sampling for positive de\ufb01nite ma-\ntrices de\ufb01ned by a kernel. Our contribution is twofold. First we provide a novel\nalgorithm for leverage score sampling and second, we exploit the proposed method\nin statistical learning by deriving a novel solver for kernel ridge regression. Our\nmain technical contribution is showing that the proposed algorithms are currently\nthe most ef\ufb01cient and accurate for these problems.\n\n1\n\nIntroduction\n\nA variety of machine learning problems require manipulating and performing computations with\nlarge matrices that often do not \ufb01t memory. In practice, randomized techniques are often employed to\nreduce the computational burden. Examples include stochastic approximations [1], columns/rows\nsubsampling and more general sketching techniques [2, 3]. One of the simplest approach is uniform\ncolumn sampling [4, 5], that is replacing the original matrix with a subset of columns chosen\nuniformly at random. This approach is fast to compute, but the number of columns needed for a\nprescribed approximation accuracy does not take advantage of the possible low rank structure of the\nmatrix at hand. As discussed in [6], leverage score sampling provides a way to tackle this shortcoming.\nHere columns are sampled proportionally to suitable weights, called leverage scores (LS) [7, 6]. With\nthis sampling strategy, the number of columns needed for a prescribed accuracy is governed by the\nso called effective dimension which is a natural extension of the notion of rank. Despite these nice\nproperties, performing leverage score sampling provides a challenge in its own right, since it has\ncomplexity in the same order of an eigendecomposition of the original matrix. Indeed, much effort\nhas been recently devoted to derive fast and provably accurate algorithms for approximate leverage\nscore sampling [2, 8, 6, 9, 10].\nIn this paper, we consider these questions in the case of positive semi-de\ufb01nite matrices, central for\nexample in Gaussian processes [11] and kernel methods [12]. Sampling approaches in this context\nare related to the so called Nystr\u00f6m approximation [13] and Nystr\u00f6m centers selection problem [11],\nand are widely studied both in practice [4] and in theory [5]. Our contribution is twofold. First,\nwe propose and study BLESS, a novel algorithm for approximate leverage scores sampling. The\n\ufb01rst solution to this problem is introduced in [6], but has poor approximation guarantees and high\ntime complexity. Improved approximations are achieved by algorithms recently proposed in [8] and\n[9]. In particular, the approach in [8] can obtain good accuracy and very ef\ufb01cient computations but\nonly as long as distributed resources are available. Our \ufb01rst technical contribution is showing that\nour algorithm can achieve state of the art accuracy and computational complexity without requiring\n\n\u21e4Equal contribution. Respective emails: alessandro.rudi@inria.fr, daniele.calandriello@iit.it\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdistributed resources. The key idea is to follow a coarse to \ufb01ne strategy, alternating uniform and\nleverage scores sampling on sets of increasing size.\nOur second, contribution is considering leverage score sampling in statistical learning with least\nsquares. We extend the approach in [14] for ef\ufb01cient kernel ridge regression based on combining\nfast optimization algorithms (preconditioned conjugate gradient) with uniform sampling. Results in\n\n[14] showed that optimal learning bounds can be achieved with a complexity which is eO(npn) in\ntime and eO(n) space. In this paper, we study the impact of replacing uniform with leverage score\nthe time and memory is now eO(ndeff), and eO(deff\n\nsampling. In particular, we prove that the derived method still achieves optimal learning bounds but\n2) respectively, where deff is the effective dimension\nwhich and is never larger, and possibly much smaller, than pn. To the best of our knowledge this is\nthe best currently known computational guarantees for a kernel ridge regression solver.\n\n2 Leverage score sampling with BLESS\n\nAfter introducing leverage score sampling and previous algorithms, we present our approach and \ufb01rst\ntheoretical results.\n\n2.1 Leverage score sampling\n\nSuppose bK 2 Rn\u21e5n is symmetric and positive semide\ufb01nite. A basic question is deriving memory\nef\ufb01cient approximation of bK [4, 8] or related quantities, e.g. approximate projections on its range\n[9], or associated estimators, as in kernel ridge regression [15, 14]. The eigendecomposition of bK\n\noffers a natural, but computationally demanding solution. Subsampling columns (or rows) is an\nappealing alternative. A basic approach is uniform sampling, whereas a more re\ufb01ned approach is\nleverage scores sampling. This latter procedure corresponds to sampling columns with probabilities\nproportional to the leverage scores\n\n(1)\nwhere [n] = {1, . . . , n}. The advantage of leverage score sampling, is that potentially very few\ncolumns can suf\ufb01ce for the desired approximation. Indeed, letting\n\n`(i, ) =\u21e3bK(bK + nI)1\u2318ii\n\n,\n\ni 2 [n],\n\nd1() = n max\n\ni=1,...,n\n\n`(i, ),\n\ndeff() =\n\n`(i, ),\n\nnXi=1\n\nfor > 0, it is easy to see that deff() \uf8ff d1() \uf8ff 1/ for all , and previous results show that\nthe number of columns required for accurate approximation are d1 for uniform sampling and deff\nfor leverage score sampling [5, 6]. However, it is clear from de\ufb01nition (1) that an exact leverage\nscores computation would require the same order of computations as an eigendecomposition, hence\napproximations are needed.The accuracy of approximate leverage scores is typically measured by\nt > 0 in multiplicative bounds of the form\n\nBefore proposing a new improved solution, we brie\ufb02y discuss relevant previous works. To provide a\nuni\ufb01ed view, some preliminary discussion is useful.\n\n1\n\n1 + t\n\n`(i, ) \uf8ffe`(i, ) \uf8ff (1 + t)`(i, ),\n\n8i 2 [n].\n\n(2)\n\n2.2 Approximate leverage scores\nFirst, we recall how a subset of columns can be used to compute approximate leverage scores. For\ni=1 with ji 2 [n], and bKJ,J 2 RM\u21e5M with entries (KJ,J )lm = Kjl,jm. For\nM \uf8ff n, let J = {ji}M\ni 2 [n], let bKJ,i = (bKj1,i, . . . , bKjM ,i) and consider for > 1/n,\ne`J (i, ) = (n)1(bKii bK>J,i(bKJ,J + nA)1bKJ,i),\n(3)\nwhere A 2 RM\u21e5M is a matrix to be speci\ufb01ed \u21e4 (see later for details). The above de\ufb01nition is\nmotivated by the observation that if J = [n], and A = I, thene`J (i, ) = `(i, ), by the following\n\u21e4Clearly,e`J depends on the choice of the matrix A, but we omit this dependence to simplify the notation.\n\n2\n\n\fidentity\n\nAlso in the following we will use the notation\n\nIn the following, it is also useful to consider a subset of leverage scores computed as in (3). For\nM \uf8ff R \uf8ff n, let U = {ui}R\n\nbK(bK + nI)1 = (n)1(bK bK(bK + nI)1bK).\ni=1 with ui 2 [n], and\nLJ (U, ) = {e`J (u1, ), . . . ,e`J (uR, )}.\n\n(5)\nto indicate the leverage score sampling of J0 \u21e2 U columns based on the leverage scores LJ (U, ),\nthat is the procedure of sampling columns from U according to their leverage scores 1, computed\nusing J, to obtain a new subset of columns J0.\nWe end noting that leverage score sampling (5) requires O(M 2) memory to store KJ, and O(M 3 +\nRM 2) time to invert KJ, and compute R leverage scores via (3).\n\nLJ (U, ) 7! J0\n\n(4)\n\n2.3 Previous algorithms for leverage scores computations\nWe discuss relevant previous approaches using the above quantities.\n\nTWO-PASS sampling [6]. This is the \ufb01rst approximate leverage score sampling proposed,\nand is based on using directly (5) as LJ1(U2, ) 7! J2, with U2 = [n] and J1 a subset taken\nuniformly at random. Here we call this method TWO-PASS sampling since it requires two rounds of\nsampling on the whole set [n], one uniform to select J1 and one using leverage scores to select J2.\n\nRECURSIVE-RLS [9]. This is a development of TWO-PASS sampling based on the idea of\nrecursing the above construction. In our notation, let U1 \u21e2 U2 \u21e2 U3 = [n], where U1, U2 are\nuniformly sampled and have cardinalities n/4 and n/2, respectively. The idea is to start from\nJ1 = U1, and consider \ufb01rst\n\nbut then continue with\n\nLJ1(U2, ) 7! J2,\nLJ2(U3, ) 7! J3.\n\nIndeed, the above construction can be made recursive for a family of nested subsets (Uh)H of\ncardinalities n/2h, considering J1 = U1 and\n\nLJh(Uh+1, ) 7! Jh+1.\n\n(6)\n\nSQUEAK[8]. This approach follows a different iterative strategy. Consider a partition U1, U2, U3\nof [n], so that Uj = n/3, for j = 1, . . . 3. Then, consider J1 = U1, and\n\nand then continue with\n\nLJ1[U2(J1 [ U2, ) 7! J2,\nLJ2[U3(J2 [ U3, ) 7! J3.\n\nSimilarly to the other cases, the procedure is iterated considering H subsets (Uh)H\ncardinality n/H. Starting from J1 = U1 the iterations is\n\nh=1 each with\n\nLJh[Uh+1(Jh [ Uh+1, ).\n\n(7)\n\nWe note that all the above procedures require specifying the number of iteration to be performed, the\nweights matrix to compute the leverage scores at each iteration, and a strategy to select the subsets\n(Uh)h. In all the above cases the selection of Uh is based on uniform sampling, while the number of\niterations and weight choices arise from theoretical considerations (see [6, 8, 9] for details).\nNote that TWO-PASS SAMPLING uses a set J1 of cardinality roughly 1/ (an upper bound on d1())\nand incurs in a computational cost of RM 2 = n/2. In comparison, RECURSIVE-RLS [9] leads\nto essentially the same accuracy while improving computations. In particular, the sets Jh are never\nlarger than deff(). Taking into account that at the last iteration performs leverage score sampling on\nUh = [n], the total computational complexity is ndeff()2. SQUEAK [8] recovers the same accuracy,\nsize of Jh, and ndeff()2 time complexity when |Uh|' deff(), but only requires a single pass over\nthe data. We also note that a distributed version of SQUEAK is discussed in [8], which allows to\nreduce the computational cost to ndeff()2/p, provided p machines are available.\n\n3\n\n\fi=1, regularization , step q, starting reg. 0, constants q1, q2 controlling the\n\nlog q\n\napproximation level.\n\nAlgorithm 1 Bottom-up Leverage Scores Sampling (BLESS)\nInput: dataset {xi}n\nOutput: Mh 2 [n] number of selected points, Jh set of indexes, Ah weights.\n1: J0 = ;, A0 = [], H = log(0/)\n2: for h = 1 . . . H do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: Ah = RhMh\n11: end for\n\nh = h1/q\nset constant Rh = q1 min{\uf8ff2/h, n}\nsample Uh = {u1, . . . , uRh} i.i.d. ui \u21e0 U nif orm([n])\ncomputee`Jh1(xuk , h) for all uk 2 Uh using Eq. 3\nset Ph = (ph,k)Rh\nset constant Mh = q2dh with dh = n\nsample Jh = {j1, . . . , jMh} i.i.d. ji \u21e0 M ultinomial(Ph, Uh)\n\nk=1 with ph,k =e`Jh1(xuk , h)/(Pu2Uhe`Jh1(xu, h))\ndiag\u21e3ph,j1, . . . , ph,jMh\u2318\n\nRhPu2Uhe`Jh1(xu, h), and\n\nn\n\n2.4 Leverage score sampling with BLESS\nThe procedure we propose, dubbed BLESS, has similarities to the one proposed in [9] (see (6)),\nbut also some important differences. The main difference is that, rather than a \ufb01xed , we consider\na decreasing sequence of parameters 0 > 1 > \u00b7\u00b7\u00b7 > H = resulting in different algorithmic\nchoices. For the construction of the subsets Uh we do not use nested subsets, but rather each (Uh)H\nh=1\nis sampled uniformly and independently, with a size smoothly increasing as 1/h. Similarly, as in [9]\nwe proceed iteratively, but at each iteration a different decreasing parameter h is used to compute\nthe leverage scores. Using the notation introduced above, the iteration of BLESS is given by\n\nLJh(Uh+1, h+1) 7! Jh+1,\n\n(8)\n\nwhere the initial set J1 = U1 is sampled uniformly with size roughly 1/0.\nBLESS has two main advantages. The \ufb01rst is computational: each of the sets Uh, including the \ufb01nal\nUH, has cardinality smaller than 1/. Therefore the overall runtime has a cost of only RM 2 \uf8ff M 2/,\nwhich can be dramatically smaller than the nM 2 cost achieved by the methods in [9], [8] and is\ncomparable to the distributed version of SQUEAK using p = /n machines. The second advantage\nis that a whole path of leverage scores {`(i, h)}H\nh=1 is computed at once, in the sense that at each\niteration accurate approximate leverage scores at scale h are computed. This is extremely useful in\npractice, as it can be used when cross-validating h. As a comparison, for all previous method a full\nrun of the algorithm is needed for each value of h.\nIn the paper we consider two variations of the above general idea leading to Algorithm 1 and\nAlgorithm 2. The main difference in the two algorithms lies in the way in which sampling is\nperformed: with and without replacement, respectively. In particular, considering sampling without\nreplacement (see 2) it is possible to take the set (Uh)H\nh=1 to be nested and also to obtain slightly\nimproved results, as shown in the next section.\nThe derivation of BLESS rests on some basic ideas. First, note that, since sampling uniformly a set\nU of size d1() \uf8ff 1/ allows a good approximation, then we can replace L[n]([n], ) 7! J by\n\n(9)\nwhere J can be taken to have cardinality deff(). However, this is still costly, and the idea is to repeat\nand couple approximations at multiple scales. Consider 0 > , a set U0 of size d1(0) \uf8ff 1/0\nsampled uniformly, and LU0 (U0, 0) 7! J0. The basic idea behind BLESS is to replace (9) by\n\nLU(U, ) 7! J,\n\nThe key result, see , is that taking \u02dcJ of cardinality\n\nLJ0(U, ) 7! \u02dcJ.\n\n(10)\nsuf\ufb01ce to achieve the same accuracy as J. Now, if we take 0 suf\ufb01ciently large, it is easy to see that\ndeff(0) \u21e0 d1(0) \u21e0 1/0, so that we can take J0 uniformly at random. However, the factor (0/)\nin (10) becomes too big. Taking multiple scales \ufb01x this problem and leads to the iteration in (8).\n\n(0/)deff()\n\n4\n\n\fi=1, regularization , step q, starting reg. 0, constant q2 controlling the approxi-\n\n,\n\nlog q\n\nmation level.\n\nAlgorithm 2 Bottom-up Leverage Scores Sampling without Replacement (BLESS-R)\nInput: dataset {xi}n\nOutput: Mh 2 [n] number of selected points, Jh set of indexes, Ah weights.\n1: J0 = ;, A0 = [], H = log(0/)\n2: for h = 1 . . . H do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14: end for\n\nh = h1/q\nset constant h = min{q2\uf8ff2/(hn), 1}\ninitialize Uh = ;\nfor i 2 [n] do\nend for\nfor j 2 Uh do\ncompute ph,j = min{q2e`Jh1(xj, h1), 1}\nJh = {j1, . . . , jMh}, and Ah = diag\u21e3ph,j1, . . . , ph,jMh\u2318 .\n\nadd j to Jh with probability ph,j/h\n\nadd i to Uh with probability h\n\nend for\n\n2.5 Theoretical guarantees\nOur \ufb01rst main result establishes in a precise and quantitative way the advantages of BLESS.\nTheorem 1. Let n 2 N, > 0 and 2 (0, 1]. Given t > 0, q > 1 and H 2 N, (h)H\nin Algorithms 1 and 2, when (Jh, ah)H\n\nh=1 are computed\n\nh=1 de\ufb01ned as\n\n1. by Alg. 1 with parameters 0 = \uf8ff2\n\n2. by Alg. 2 with parameters 0 = \uf8ff2\n\nmin(t,1), q1 5\uf8ff2q2\nmin(t,1), q1 54\uf8ff2 (2t+1)2\n\nq(1+t), q2 12q (2t+1)2\n,\n\nlog 12Hn\n\nt2\n\nt2\n\n\n\n(1 + t) log 12Hn\n\n\n\n,\n\nlete`Jh(i, h) as in Eq. (3) depending on Jh, Ah, then with probability at least 1 :\n\n1\n\n`(i, h) \uf8ff e`Jh(i, h) \uf8ff (1 + min(t, 1))`(i, h),\n\n1 + t\n|Jh|\uf8ff q2deff(h),\n\n8h 2 [H].\n\n(a)\n\n(b)\n\n8i 2 [n], h 2 [H],\n\nThe above result con\ufb01rms that the subsets Jh computed by BLESS are accurate in the desired sense,\nsee (2), and the size of all Jh is small and proportional to deff(h), leading to a computational\n\ncost of only Omin 1\n\n , n deff()2 log2 1\n\n in time and Odeff()2 log2 1\n\n in space (for additional\n\nproperties of Jh see Thm. 4 in appendixes). Table 1 compares the complexity and number of\ncolumns sampled by BLESS with other methods. The crucial point is that in most applications, the\nparameter is chosen as a decreasing function of n, e.g. = 1/pn, resulting in potentially massive\ncomputational gains. Indeed, since BLESS computes leverage scores for sets of size at most 1/, this\nallows to perform leverage scores sampling on matrices with millions of rows/columns, as shown in\nthe experiments. In the next section, we illustrate the impact of BLESS in the context of supervised\nstatistical learning.\n\n3 Ef\ufb01cient supervised learning with leverage scores\n\nIn this section, we discuss the impact of BLESS in a supervised learning. Unlike most previous\nresults on leverage scores sampling in this context [6, 8, 9], we consider the setting of statistical\nlearning, where the challenge is that inputs, as well as the outputs, are random. More precisely, given\na probability space (X \u21e5 Y, \u21e2), where Y \u21e2 R, and considering least squares, the problem is to solve\n(11)\n\n(f (x) y)2d\u21e2(x, y),\n\nE(f ) =ZX\u21e5Y\n\nmin\n\nf2HE(f ),\n\n5\n\n\fAlgorithm\nUniform Sampling [5]\nExact RLS Sampl.\nTwo-Pass Sampling [6]\nRecursive RLS [9]\nSQUEAK [8]\nThis work, Alg. 1 and 2\n\nRuntime\n\n\nn3\nn/2\n\nndeff()2\nndeff()2\n\n1/ de\u21b5 ()2\n\n|J|\n1/\ndeff()\ndeff()\ndeff()\ndeff()\ndeff()\n\ni=1 \u21e0 \u21e2n.\n\ncomplexity and cardinality of the set J required to satisfy the approximation condition in Eq. 2.\n\nTable 1: The proposed algorithms are compared with the state of the art (in eO notation), in terms of time\nIn the above minimization problem, H is a\nwhen \u21e2 is known only through (xi, yi)n\nreproducing kernel Hilbert space de\ufb01ned by a positive de\ufb01nite kernel K : X \u21e5 X ! R [12].\nRecall that the latter is de\ufb01ned as the completion of span{K(x,\u00b7) | x 2 X} with the inner product\nhK(x,\u00b7), K(x0,\u00b7)iH = K(x, x0). The quality of an empirical approximate solution bf is measured\nvia probabilistic bounds on the excess risk R(bf ) = E(bf ) minf2H E(f ).\n\n3.1 Learning with FALKON-BLESS\nThe algorithm we propose, called FALKON-BLESS, combines BLESS with FALKON [14] a state of\nthe art algorithm to solve the least squares problem presented above. The appeal of FALKON is that\nit is currently the most ef\ufb01cient solution to achieve optimal excess risk bounds. As we discuss in the\nfollowing, the combination with BLESS leads to further improvements.\nWe describe the derivation of the considered algorithm starting from kernel ridge regression (KRR)\n\nnXi=1\n\nbf(x) =\n\nK(x, xi)ci,\n\nc = (bK + nI)1bY\n\n(12)\n\nspace requirements. FALKON can be seen as an approximate ridge regression solver combining a\n\nwhere c = (c1, . . . , cn),bY = (y1, . . . , yn) and bK 2 Rn\u21e5n is the empirical kernel matrix with entries\n(bK)ij = K(xi, xj). KRR has optimal statistical properties [16], but large O(n3) time and O(n2)\nnumber of algorithmic ideas. First, sampling is used to select a subset {ex1, . . . ,exM} of the input\n\ndata uniformly at random, and to de\ufb01ne an approximate solution\n\n= (K>nM KnM + KMM )1K>nM y,\n\n(13)\n\nbf,M (x) =\n\nMXj=1\n\nK(exj, x)\u21b5j,\u21b5\n\nwhere \u21b5 = (\u21b51, . . . ,\u21b5 M ), KnM 2 Rn\u21e5M, has entries (KnM )ij = K(xi, \u02dcxj) and KMM 2 RM\u21e5M\nhas entries (KMM )jj0 = K(\u02dcxj, \u02dcxj0), with i 2 [n], j, j0 2 [M ]. We note, that the linear system\nin (13) can be seen to obtained from the one in (12) by uniform column subsampling of the empirical\n\nkernel matrix. The columns selected corresponds to the inputs {ex1, . . . ,exM}. FALKON proposes to\n\ncompute a solution of the linear system 13 via a preconditioned iterative solver. The preconditioner is\nthe core of the algorithm and is de\ufb01ned by a matrix B such that\n\nBB> =\u21e3 n\n\nM\n\nK2\n\nMM + KMM\u23181\n\n.\n\n(14)\n\nThe above choice provides a computationally ef\ufb01cient approximation to the exact preconditioner\nof the linear system in (13) corresponding to B such that BB> = (K>nM KnM + KMM )1. The\npreconditioner in (14) can then be combined with conjugate gradient to solve the linear system in (13).\nThe overall algorithm has complexity O(nM t) in time and O(M 2) in space, where t is the number\nof conjugate gradient iterations performed.\n\nleverage score sampling using BLESS, see Algorithm 1 or Algorithm 2, so that M = Mh and\n\nIn this paper, we analyze a variation of FALKON where the points {ex1, . . . ,exM} are selected via\nexk = xjk, for Jh = {j1, . . . , jMh} and k 2 [Mh]. Further, the preconditioner in (14) is replaced by\n\nKJh,JhA1\n\n(15)\n\nh KJh,Jh + hKJh,Jh\u23181\n\n.\n\nBhB>h =\u21e3 n\n\nM\n\n6\n\n\fBLESS\nBLESS-R\nSQUEAK\nUniform\nRRLS\n\nTime R-ACC 5th/ 95th quant\n17\n17\n52\n-\n235\n\n0.57 / 2.03\n0.73 / 1.50\n0.70 / 1.48\n0.22 / 3.75\n1.00 / 2.70\n\n1.06\n1.06\n1.06\n1.09\n1.59\n\nRLS Accuracy\n\n3\n\n2.5\n\n2\n\nC\nC\nA\nR\n\n-\n\n1.5\n\n1\n\nFigure 1: Leverage scores relative accuracy for = 105, n = 70 000, M = 10 000, 10 repetitions.\n\nBLESS\n\nBLESS-R SQUEAK Uniform\n\nRRLS\n\nThis solution can lead to huge computational improvements. Indeed, the total cost of FALKON-\nBLESS is the sum of computing BLESS and FALKON, corresponding to\nO(M 2),\n\nOnM t + (1/)M 2 log n + M 3\n\nin time and space respectively, where M is the size of the set JH returned by BLESS.\n\n(16)\n\n3.2 Statistical properties of FALKON-BLESS\nIn this section, we state and discuss our second main result, providing an excess risk bound for\nFALKON-BLESS. Here a population version of the effective dimension plays a key role. Let \u21e2X\nbe the marginal measure of \u21e2 on X, let C : H!H be the linear operator de\ufb01ned as follows and\ndeff\u21e4() be the population version of deff(),\n\ndeff\u21e4() = Tr(C(C + I)1), with (Cf )(x0) =ZX\n\nK(x0, x)f (x)d\u21e2X(x),\n\n8x, x0 2 X,\n\nK(x, x0) \uf8ff \uf8ff2,\n\nfor any f 2H , x 2 X. It is possible to show that deff\u21e4() is the limit of deff() as n goes to in\ufb01nity,\nsee Lemma 1 below taken from [15]. If we assume throughout that,\n(17)\nthen the operator C is symmetric, positive de\ufb01nite and trace class, and the behavior of deff\u21e4() can\nbe characterized in terms of the properties of the eigenvalues (j)j2N of C. Indeed as for deff(), we\nhave that deff\u21e4() \uf8ff \uf8ff2/, moreover if j = O(j\u21b5), for \u21b5 1, we have deff\u21e4() = O(1/\u21b5) .\nThen for larger \u21b5, deff\u21e4 is smaller than 1/ and faster learning rates are possible, as shown below.\nWe next discuss the properties of the FALKON-BLESS solution denoted by bf,n,t.\nTheorem 2. Let n 2 N, > 0 and 2 (0, 1]. Assume that y 2 [ a\n2 ], almost surely, a > 0, and\ndenote by fH a minimizer of (11). There exists n0 2 N, such that for any n n0, if t log n,\n 9\uf8ff2\n+ ! .\n\n2 , a\n , then the following holds with probability at least 1 :\nR(bf,n,t) \uf8ff\n\nIn particular, when deff\u21e4() = O(1/\u21b5), for \u21b5 1, by selecting \u21e4 = n\u21b5/(\u21b5+1), we have\n\nH a2 log2 2\n\n+ 32kfHk2\n\na deff() log 2\n\n\nn log n\n\n\n\n+\n\nn2\n\nn\n\n4a\nn\n\nwhere c is given explicitly in the proof.\n\nR(bf\u21e4,n,t) \uf8ff cn \u21b5\n\n\u21b5+1 ,\n\nWe comment on the above result discussing the statistical and computational implications.\n\nStatistics. The above theorem provides statistical guarantees in terms of \ufb01nite sample bounds on\nthe excess risk of FALKON-BLESS, A \ufb01rst bound depends of the number of examples n, the\nregularization parameter and the population effective dimension deff\u21e4(). The second bound is\nderived optimizing , and is the same as the one achieved by exact kernel ridge regression which is\n\n7\n\n\fFigure 2: Runtimes with = 103 and n increasing Figure 3: C-err at 5 iterations for varying f alkon\n\nknown to be optimal [16, 17, 18]. Note that improvements under further assumptions are possible\nand are derived in the supplementary materials, see Thm. 8. Here, we comment on the computational\nproperties of FALKON-BLESS and compare it to previous solutions.\n\nComputations. To discuss computational implications, we recall a result from [15] show-\ning that the population version of the effective dimension deff\u21e4() and the effective dimension deff()\nassociated to the empirical kernel matrix converge up to constants.\nLemma 1. Let > 0 and 2 (0, 1]. When 9\uf8ff2\n\n , then with probability at least 1 ,\n\nn log n\n\n(1/3)deff\u21e4() \uf8ff deff() \uf8ff 3deff\u21e4().\n\nRecalling the complexity of FALKON-BLESS (16), using Thm 2 and Lemma 1, we derive a cost\n\nO\u2713ndeff\u21e4() log n +\n\n1\n\n\ndeff\u21e4()2 log n + deff\u21e4()3\u25c6\n\n1\n\ndeff\u21e4() \uf8ff \uf8ff2/,\n\nin time and O(deff\u21e4()2) in space, for all n, satisfying the assumptions in Theorem 2. These\nexpressions can be further simpli\ufb01ed. Indeed, it is easy to see that for all > 0,\n(18)\n deff\u21e4()2. Moreover, if we consider the optimal choice \u21e4 = O(n \u21b5\nso that deff\u21e4()3 \uf8ff \uf8ff2\n\u21b5+1 )\ngiven in Theorem 2, and take deff\u21e4() = O(1/\u21b5), we have 1\ndeff\u21e4(\u21e4) \uf8ffO (n), and therefore\n\u21e4\n deff\u21e4()2 \uf8ffO (ndeff\u21e4()). In summary, for the parameter choices leading to optimal learning rates,\nFALKON-BLESS has complexity eO(ndeff\u21e4(\u21e4)), in time and eO(deff\u21e4(\u21e4)2) in space, ignoring log\nterms. We can compare this to previous results. In [14] uniform sampling is considered leading to\nM \uf8ffO (1/) and achieving a complexity of eO(n/) which is always larger than the one achieved by\neO(ndeff()2) time and reducing the time complexity of FALKON to eO(ndeff(\u21e4)). Clearly in this\n\nFALKON in view of (18). Approximate leverage scores sampling is also considered in [14] requiring\n\ncase the complexity of leverage scores sampling dominates, and our results provide BLESS as a \ufb01x.\n\n4 Experiments\n\nLeverage scores accuracy. We \ufb01rst study the accuracy of the leverage scores generated by BLESS\nand BLESS-R, comparing SQUEAK [8] and Recursive-RLS (RRLS) [9]. We begin by uniformly\nsampling a subsets of n = 7 \u21e5 104 points from the SUSY dataset [19], and computing the exact\nleverage scores `(i, ) using a Gaussian Kernel with = 4 and = 105, which is at the limit of our\ncomputational feasibility. We then run each algorithm to compute the approximate leverage scores\ne`JH (i, ), and we measure the accuracy of each method using the ratioe`JH (i, )/`(i, ) (R-ACC).\n\nThe \ufb01nal results are presented in Figure 1. On the left side for each algorithm we report runtime, mean\nR-ACC, and the 5th and 95th quantile, each averaged over the 10 repetitions. On the right side a box-\nplot of the R-ACC. As shown in Figure 1 BLESS and BLESS-R achieve the same optimal accuracy\n\n8\n\n01234567Number of Points10410-210-1100101SecondsTime ComparisonBLESSBLESS-RSQUEAKRRLS10-1410-1210-1010-810-610-410-2100falkon0.180.20.220.240.260.280.30.32Classification ErrorTest ErrorFALKON-UNIFALKON-BLESS\fFigure 4: AUC per iteration of the SUSY dataset\n\nFigure 5: AUC per iteration of the HIGGS dataset\n\nof SQUEAK with just a fraction of time. Note that despite our best efforts, we could not obtain\nhigh-accuracy results for RRLS (maybe a wrong constant in the original implementation). However\nnote that RRLS is computationally demanding compared to BLESS, being orders of magnitude\nslower, as expected from the theory. Finally, although uniform sampling is the fastest approach, it\nsuffers from much larger variance and can over or under-estimate leverage scores by an order of\nmagnitude more than the other methods, making it more fragile for downstream applications.\nIn Fig. 2 we plot the runtime cost of the compared algorithms as the number of points grows from\nn = 1000 to 70000, this time for = 103. We see that while previous algorithms\u2019 runtime grows\nnear-linearly with n, BLESS and BLESS-R run in a constant 1/ runtime, as predicted by the theory.\n\nBLESS for supervised learning. We study the performance of FALKON-BLESS and compare it\nwith the original FALKON [14] where an equal number of Nystr\u00f6m centres are sampled uniformly at\nrandom (FALKON-UNI). We take from [14] the two biggest datasets and their best hyper-parameters\nfor the FALKON algorithm.\nWe noticed that it is possible to achieve the same accuracy of FALKON-UNI, by using bless for\nBLESS and f alkon for FALKON with bless f alkon, in order to lower the deff and keep\nthe number of Nystr\u00f6m centres low. For the SUSY dataset we use a Gaussian Kernel with =\n4, f alkon = 106, bless = 104 obtaining MH ' 104 Nystr\u00f6m centres. For the HIGGS dataset\nwe use a Gaussian Kernel with = 22, f alkon = 108, bless = 106, obtaining MH ' 3 \u21e5 104\nNystr\u00f6m centres. We then sample a comparable number of centers uniformly for FALKON-UNI.\nLooking at the plot of their AUC at each iteration (Fig.4,5) we observe that FALKON-BLESS\nconverges much faster than FALKON-UNI. For the SUSY dataset (Figure 4) 5 iterations of FALKON-\nBLESS (160 seconds) achieve the same accuracy of 20 iterations of FALKON-UNI (610 seconds).\nSince running BLESS takes just 12 secs. this corresponds to a \u21e0 4\u21e5 speedup. For the HIGGS dataset\n10 iter. of FALKON-BLESS (with BLESS requiring 1.5 minutes, for a total of 1.4 hours) achieve\nbetter accuracy of 20 iter. of FALKON-UNI (2.7 hours). Additionally we observed that FALKON-\nBLESS is more stable than FALKON-UNI w.r.t. f alkon, . In Figure 3 the classi\ufb01cation error after\n5 iterations of FALKON-BLESS and FALKON-UNI over the SUSY dataset (bless = 104). We\nnotice that FALKON-BLESS has a wider optimal region (95% of the best error) for the regulariazion\nparameter ([1.3 \u21e5 103, 4.8 \u21e5 108]) w.r.t. FALKON-UNI ([1.3 \u21e5 103, 3.8 \u21e5 106]).\n5 Conclusions\n\nIn this paper we presented two algorithms BLESS and BLESS-R to ef\ufb01ciently compute a small set\nof columns from a large symmetric positive semide\ufb01nite matrix K, useful for approximating the\nmatrix or to compute leverage scores with a given precision. Moreover we applied the proposed\nalgorithms in the context of statistical learning with least squares, combining BLESS with FALKON\n[14]. We analyzed the computational and statistical properties of the resulting algorithm, showing that\nit achieves optimal statistical guarantees with a cost that is O(ndeff\u21e4()) in time, being currently the\nfastest. We can extend the proposed work in several ways: (a) combine BLESS with fast stochastic\n[20] or online [21] gradient algorithms and other approximation schemes (i.e. random features\n[22, 23, 24]), to further reduce the computational complexity for optimal rates, (b) consider the\nimpact of BLESS in the context of multi-tasking [25, 26] or structured prediction [27, 28].\n\n9\n\n5101520Iterations0.50.550.60.650.70.750.80.85AUCTest Accuracy\fAcknowledgments.\nThis material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by\nNSF STC award CCF-1231216, and the Italian Institute of Technology. We gratefully acknowledge the support\nof NVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this research.\nL. R. acknowledges the support of the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007\n(European Of\ufb01ce of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS\n- DLV-777826. A. R. acknowledges the support of the European Research Council (grant SEQUOIA 724063).\n\nReferences\n[1] Raman Arora, Andrew Cotter, Karen Livescu, and Nathan Srebro. Stochastic optimization for\npca and pls. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton\nConference on, pages 861\u2013868. IEEE, 2012.\n\n[2] David P. Woodruff. Sketching as a tool for numerical linear algebra.\n\narXiv:1411.4357, 2014.\n\narXiv preprint\n\n[3] Joel A Tropp. User-friendly tools for random matrices: An introduction. Technical report,\nCALIFORNIA INST OF TECH PASADENA DIV OF ENGINEERING AND APPLIED\nSCIENCE, 2012.\n\n[4] Christopher Williams and Matthias Seeger. Using the Nystrom method to speed up kernel\n\nmachines. In Neural Information Processing Systems, 2001.\n\n[5] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on\n\nLearning Theory, 2013.\n\n[6] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel methods with statistical\n\nguarantees. In Neural Information Processing Systems, 2015.\n\n[7] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. Fast\napproximation of matrix coherence and statistical leverage. The Journal of Machine Learning\nResearch, 13(1):3475\u20133506, 2012.\n\n[8] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Distributed adaptive sampling for\n\nkernel matrix approximation. In AISTATS, 2017.\n\n[9] Cameron Musco and Christopher Musco. Recursive Sampling for the Nystr\u00f6m Method. In\n\nNIPS, 2017.\n\n[10] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Second-order kernel online convex\noptimization with adaptive sketching. In Doina Precup and Yee Whye Teh, editors, Proceedings\nof the 34th International Conference on Machine Learning, volume 70 of Proceedings of Ma-\nchine Learning Research, pages 645\u2013653, International Convention Centre, Sydney, Australia,\n06\u201311 Aug 2017. PMLR.\n\n[11] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine\nlearning. Adaptive computation and machine learning. MIT Press, Cambridge, Mass, 2006.\nOCLC: ocm61285753.\n\n[12] Bernhard Sch\u00f6lkopf, Alexander J Smola, et al. Learning with kernels: support vector machines,\n\nregularization, optimization, and beyond. MIT press, 2002.\n\n[13] Alex J Smola and Bernhard Sch\u00f6lkopf. Sparse greedy matrix approximation for machine\n\nlearning. 2000.\n\n[14] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. Falkon: An optimal large scale kernel\n\nmethod. In Advances in Neural Information Processing Systems, pages 3891\u20133901, 2017.\n\n[15] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nystr\u00f6m computa-\ntional regularization. In Advances in Neural Information Processing Systems, pages 1657\u20131665,\n2015.\n\n10\n\n\f[16] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares\n\nalgorithm. Foundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[17] Ingo Steinwart, Don R Hush, Clint Scovel, et al. Optimal rates for regularized least squares\n\nregression. In COLT, 2009.\n\n[18] Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates for spectral\nalgorithms with least-squares regression over hilbert spaces. Applied and Computational\nHarmonic Analysis, 2018.\n\n[19] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy\n\nphysics with deep learning. Nature communications, 5:4308, 2014.\n\n[20] Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with\nan exponential convergence _rate for \ufb01nite training sets. In Advances in neural information\nprocessing systems, pages 2663\u20132671, 2012.\n\n[21] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Ef\ufb01cient second-order online\nkernel learning with adaptive embedding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,\nR. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 30, pages 6140\u20136150. Curran Associates, Inc., 2017.\n\n[22] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin neural information processing systems, pages 1177\u20131184, 2008.\n\n[23] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random\n\nfeatures. In Advances in Neural Information Processing Systems, pages 3215\u20133225, 2017.\n\n[24] Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features.\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 31, pages 10213\u201310224. Curran Associates,\nInc., 2018.\n\n[25] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature\n\nlearning. Machine Learning, 73(3):243\u2013272, 2008.\n\n[26] Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco, and Massimiliano Pontil. Consistent multi-\ntask learning with nonlinear output relations. In Advances in Neural Information Processing\nSystems, pages 1986\u20131996, 2017.\n\n[27] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization approach\nfor structured prediction. Advances in Neural Information Processing Systems 29, pages\n4412\u20134420, 2016.\n\n[28] Anna Korba, Alexandre Garcia, and Florence d\u2019Alch\u00e9 Buc. A structured prediction approach\nfor label ranking. In Advances in Neural Information Processing Systems, pages 9008\u20139018,\n2018.\n\n[29] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathemati-\n\ncal society, 68(3):337\u2013404, 1950.\n\n[30] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business\n\nMedia, 2008.\n\n11\n\n\f", "award": [], "sourceid": 2739, "authors": [{"given_name": "Alessandro", "family_name": "Rudi", "institution": "INRIA, Ecole Normale Superieure"}, {"given_name": "Daniele", "family_name": "Calandriello", "institution": "LCSL IIT/MIT"}, {"given_name": "Luigi", "family_name": "Carratino", "institution": "University of Genoa"}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "University of Genova- MIT - IIT"}]}