{"title": "Fast Randomized Kernel Ridge Regression with Statistical Guarantees", "book": "Advances in Neural Information Processing Systems", "page_first": 775, "page_last": 783, "abstract": "One approach to improving the running time of kernel-based methods is to build a small sketch of the kernel matrix and use it in lieu of the full matrix in the machine learning task of interest. Here, we describe a version of this approach that comes with running time guarantees as well as improved guarantees on its statistical performance.By extending the notion of \\emph{statistical leverage scores} to the setting of kernel ridge regression, we are able to identify a sampling distribution that reduces the size of the sketch (i.e., the required number of columns to be sampled) to the \\emph{effective dimensionality} of the problem. This latter quantity is often much smaller than previous bounds that depend on the \\emph{maximal degrees of freedom}. We give an empirical evidence supporting this fact. Our second contribution is to present a fast algorithm to quickly compute coarse approximations to thesescores in time linear in the number of samples. More precisely, the running time of the algorithm is $O(np^2)$ with $p$ only depending on the trace of the kernel matrix and the regularization parameter. This is obtained via a variant of squared length sampling that we adapt to the kernel setting. Lastly, we discuss how this new notion of the leverage of a data point captures a fine notion of the difficulty of the learning problem.", "full_text": "Fast Randomized Kernel Ridge Regression with\n\nStatistical Guarantees\u2217\n\n\u2020 Electrical Engineering and Computer Sciences\n\nAhmed El Alaoui \u2020\nMichael W. Mahoney \u2021\n\u2021 Statistics and International Computer Science Institute\nUniversity of California, Berkeley, Berkeley, CA 94720.\n\n{elalaoui@eecs,mmahoney@stat}.berkeley.edu\n\nAbstract\n\nOne approach to improving the running time of kernel-based methods is to build\na small sketch of the kernel matrix and use it in lieu of the full matrix in the\nmachine learning task of interest. Here, we describe a version of this approach\nthat comes with running time guarantees as well as improved guarantees on its\nstatistical performance. By extending the notion of statistical leverage scores to\nthe setting of kernel ridge regression, we are able to identify a sampling distribu-\ntion that reduces the size of the sketch (i.e., the required number of columns to\nbe sampled) to the effective dimensionality of the problem. This latter quantity is\noften much smaller than previous bounds that depend on the maximal degrees of\nfreedom. We give an empirical evidence supporting this fact. Our second contri-\nbution is to present a fast algorithm to quickly compute coarse approximations to\nthese scores in time linear in the number of samples. More precisely, the running\ntime of the algorithm is O(np2) with p only depending on the trace of the kernel\nmatrix and the regularization parameter. This is obtained via a variant of squared\nlength sampling that we adapt to the kernel setting. Lastly, we discuss how this\nnew notion of the leverage of a data point captures a \ufb01ne notion of the dif\ufb01culty\nof the learning problem.\n\n1\n\nIntroduction\n\nWe consider the low-rank approximation of symmetric positive semi-de\ufb01nite (SPSD) matrices that\narise in machine learning and data analysis, with an emphasis on obtaining good statistical guaran-\ntees. This is of interest primarily in connection with kernel-based machine learning methods. Recent\nwork in this area has focused on one or the other of two very different perspectives: an algorith-\nmic perspective, where the focus is on running time issues and worst-case quality-of-approximation\nguarantees, given a \ufb01xed input matrix; and a statistical perspective, where the goal is to obtain good\ninferential properties, under some hypothesized model, by using the low-rank approximation in\nplace of the full kernel matrix. The recent results of Gittens and Mahoney [2] provide the strongest\nexample of the former, and the recent results of Bach [3] are an excellent example of the latter.\nIn this paper, we combine ideas from these two lines of work in order to obtain a fast randomized\nkernel method with statistical guarantees that are improved relative to the state-of-the-art.\nTo understand our approach, recall that several papers have established the crucial importance\u2014\nfrom the algorithmic perspective\u2014of the statistical leverage scores, as they capture structural non-\nuniformities of the input matrix and they can be used to obtain very sharp worst-case approximation\nguarantees. See, e.g., work on CUR matrix decompositions [5, 6], work on the the fast approxi-\nmation of the statistical leverage scores [7], and the recent review [8] for more details. Here, we\n\n\u2217A technical report version of this conference paper is available at [1].\n\n1\n\n\f\u2020\nk)i, where Kk is the best rank k approximation of K and where K\n\nsimply note that, when restricted to an n \u00d7 n SPSD matrix K and a rank parameter k, the statistical\nleverage scores relative to the best rank-k approximation to K, call them (cid:96)i, for i \u2208 {1, . . . , n}, are\nthe diagonal elements of the projection matrix onto the best rank-k approximation of K. That is,\n\u2020\n(cid:96)i = diag(KkK\nk is the Moore-\nPenrose inverse of Kk. The recent work by Gittens and Mahoney [2] showed that qualitatively\nimproved worst-case bounds for the low-rank approximation of SPSD matrices could be obtained\nin one of two related ways: either compute (with the fast algorithm of [7]) approximations to the\nleverage scores, and use those approximations as an importance sampling distribution in a random\nsampling algorithm; or rotate (with a Gaussian-based or Hadamard-based random projection) to a\nrandom basis where those scores are uniformized, and sample randomly in that rotated basis.\nIn this paper, we extend these ideas, and we show that\u2014from the statistical perspective\u2014we are\nable to obtain a low-rank approximation that comes with improved statistical guarantees by using\na variant of this more traditional notion of statistical leverage. In particular, we improve the recent\nbounds of Bach [3], which provides the \ufb01rst known statistical convergence result when substitut-\ning the kernel matrix by its low-rank approximation. To understand the connection, recall that a\nkey component of Bach\u2019s approach is the quantity dmof = n(cid:107) diag( K(K + n\u03bbI)\u22121)(cid:107)\u221e, which he\ncalls the maximal marginal degrees of freedom.1 Bach\u2019s main result is that by constructing a low-\nrank approximation of the original kernel matrix by sampling uniformly at random p = O(dmof/\u0001)\ncolumns, i.e., performing the vanilla Nystr\u00a8om method, and then by using this low-rank approxima-\ntion in a prediction task, the statistical performance is within a factor of 1 + \u0001 of the performance\nwhen the entire kernel matrix is used. Here, we show that this uniform sampling is suboptimal. We\ndo so by sampling with respect to a coarse but quickly-computable approximation of a variant to the\nstatistical leverage scores, given in De\ufb01nition 1 below, and we show that we can obtain similar 1 + \u0001\nguarantees by sampling only O(deff/\u0001) columns, where deff = Tr(K(K + n\u03bbI)\u22121) < dmof. The\nquantity deff is called the effective dimensionality of the learning problem, and it can be interpreted\nas the implicit number of parameters in this nonparametric setting [9, 10].\nWe expect that our results and insights will be useful much more generally. As an example of this,\nwe can directly compare the Nystr\u00a8om sampling method to a related divide-and-conquer approach,\nthereby answering an open problem of Zhang et al. [9]. Recall that the Zhang et al. divide-and-\nconquer method consists of dividing the dataset {(xi, yi)}n\ni=1 into m random partitions of equal size,\ncomputing estimators on each partition in parallel, and then averaging the estimators. They prove\nthe minimax optimality of their estimator, although their multiplicative constants are suboptimal;\nand, in terms of the number of kernel evaluations, their method requires m\u00d7 (n/m)2, with m in the\norder of n/d2\neff) evaluations. They noticed that the scaling\nof their estimator was not directly comparable to that of the Nystr\u00a8om sampling method (which was\nproven to only require O(ndmof) evaluations, if the sampling is uniform [3]), and they left it as an\nopen problem to determine which if either method is fundamentally better than the other. Using\nour Theorem 3, we are able to put both results on a common ground for comparison. Indeed, the\nestimator obtained by our non-uniform Nystr\u00a8om sampling requires only O(ndeff) kernel evaluations\n(compared to O(nd2\neff) and O(ndmof)), and it obtains the same bound on the statistical predictive\nperformance as in [3]. In this sense, our result combines \u201cthe best of both worlds,\u201d by having the\nreduced sample complexity of [9] and the sharp approximation bound of [3].\n\neff, which gives a total number of O(nd2\n\n2 Preliminaries and notation\nLet {(xi, yi)}n\nspace. The kernel-based learning problem can be cast as the following minimization problem:\n\ni=1 be n pairs of points in X \u00d7 Y, where X is the input space and Y is the response\n\nn(cid:88)\n\ni=1\n\nmin\nf\u2208F\n\n1\nn\n\n(cid:96)(yi, f (xi)) +\n\n(cid:107)f(cid:107)2F ,\n\n\u03bb\n2\n\n(1)\n\nwhere F is a reproducing kernel Hilbert space and (cid:96) : Y \u00d7 Y \u2192 R is a loss function. We denote by\nk : X \u00d7X \u2192 R the positive de\ufb01nite kernel corresponding to F and by \u03c6 : X \u2192 F a corresponding\nfeature map. That is, k(x, x(cid:48)) = (cid:104)\u03c6(x), \u03c6(x(cid:48))(cid:105)F for every x, x(cid:48) \u2208 X . The representer theorem\n[11, 12] allows us to reduce Problem (1) to a \ufb01nite-dimensional optimization problem, in which\n\n1We will refer to it as the maximal degrees of freedom.\n\n2\n\n\fn(cid:88)\n\ni=1\n\ncase Problem (1) boils down to \ufb01nding the vector \u03b1 \u2208 Rn that solves\n\nmin\n\u03b1\u2208Rn\n\n1\nn\n\n(cid:96)(yi, (K\u03b1)i) +\n\n\u03b1(cid:62)K\u03b1,\n\n\u03bb\n2\n\n(2)\n\nwhere Kij = k(xi, xj). We let U \u03a3U(cid:62) be the eigenvalue decomposition of K, with \u03a3 =\nDiag(\u03c31,\u00b7\u00b7\u00b7 , \u03c3n), \u03c31 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c3n \u2265 0, and U an orthogonal matrix. The underlying data model is\n\nyi = f\u2217(xi) + \u03c32\u03bei\n\ni = 1,\u00b7\u00b7\u00b7 , n\n\nwith f\u2217 \u2208 F, (xi)1\u2264i\u2264n a deterministic sequence and \u03bei are i.i.d. standard normal random variables.\nWe consider (cid:96) to be the squared loss, in which case we will be interested in the mean squared error\nas a measure of statistical risk: for any estimator \u02c6f, let\n\nR( \u02c6f ) :=\n\nE\u03be(cid:107) \u02c6f \u2212 f\u2217(cid:107)2\n\n2\n\n1\nn\n\n(3)\n\nbe the risk function of \u02c6f where E\u03be denotes the expectation under the randomness induced by \u03be. In\nthis setting the problem is called Kernel Ridge Regression (KRR). The solution to Problem (2) is\n\u03b1 = (K + n\u03bbI)\u22121y, and the estimate of f\u2217 at any training point xi is given by \u02c6f (xi) = (K\u03b1)i.\nWe will use \u02c6fK as a shorthand for the vector ( \u02c6f (xi))1\u2264i\u2264n \u2208 Rn when the matrix K is used as a\nkernel matrix. This notation will be used accordingly for other kernel matrices (e.g. \u02c6fL for a matrix\nL). Recall that the risk of the estimator \u02c6fK can then be decomposed into a bias and variance term:\n\nR( \u02c6fK) =\n\n1\nn\n1\nn\n\nE\u03be(cid:107)K(K + n\u03bbI)\u22121(f\u2217 + \u03c32\u03be) \u2212 f\u2217(cid:107)2\n(cid:107)(K(K + n\u03bbI)\u22121 \u2212 I)f\u2217(cid:107)2\n\u03c32\nn\nvariance(K).\n\n=\n= n\u03bb2(cid:107)(K + n\u03bbI)\u22121f\u2217(cid:107)2\n\nbias(K)2\n\n\u03c32\nn\n\n2 +\n\n:=\n\n+\n\n2\n\nE\u03be(cid:107)K(K + n\u03bbI)\u22121\u03be(cid:107)2\n\n2\n\n2 +\nTr(K 2(K + n\u03bbI)\u22122)\n\n(4)\n\nSolving Problem (2), either by a direct method or by an optimization algorithm needs at least a\nquadratic and often cubic running time in n which is prohibitive in the large scale setting. The\nso-called Nytr\u00a8om method approximates the solution to Problem (2) by substituting K with a low-\nrank approximation to K. In practice, this approximation is often not only fast to construct, but\nthe resulting learning problem is also often easier to solve [13, 14, 15, 2]. The method operates\nas follows. A small number of columns K1,\u00b7\u00b7\u00b7 , Kp are randomly sampled from K.\nIf we let\nC = [K1,\u00b7\u00b7\u00b7 , Kp] \u2208 Rn\u00d7p denote the matrix containing the sampled columns, W \u2208 Rp\u00d7p the\noverlap between C and C(cid:62) in K, then the Nystr\u00a8om approximation of K is the matrix\n\nL = CW \u2020C(cid:62).\n\nMore generally, if we let S \u2208 Rn\u00d7p be an arbitrary sketching matrix, i.e., a tall and skinny matrix\nthat, when left-multiplied by K, produces a \u201csketch\u201d of K that preserves some desirable properties,\nthen the Nystr\u00a8om approximation associated with S is\n\nL = KS(S(cid:62)KS)\u2020S(cid:62)K.\n\nFor instance, for random sampling algorithms, S would contain a non-zero entry at position (i, j) if\nthe i-th column of K is chosen at the j-th trial of the sampling process. Alternatively, S could also\nbe a random projection matrix; or S could be constructed with some other (perhaps deterministic)\nmethod, as long as it veri\ufb01es some structural properties, depending on the application [8, 2, 6, 5].\nWe will focus in this paper on analyzing this approximation in the statistical prediction context\nrelated to the estimation of f\u2217 by solving Problem (2). We proceed by revisiting and improving\nupon prior results from three different areas. The \ufb01rst result (Theorem 1) is on the behavior of the\nbias of \u02c6fL, when L is constructed using a general sketching matrix S. This result underlies the\nstatistical analysis of the Nystr\u00a8om method. To see this, \ufb01rst, it is not hard to prove that L (cid:22) K\nin the sense of usual the order on the positive semi-de\ufb01nite cone. Second, one can prove that the\nvariance is matrix-increasing, hence the variance will decrease when replacing K by L. On the other\n\n3\n\n\fhand, the bias (while not matrix monotone in general) can be proven to not increase too much when\nreplacing K by L. This latter statement will be the main technical dif\ufb01culty for obtaining a bound\non R( \u02c6fL) (see Appendix A). A form of this result is due to Bach [3] in the case where S is a uniform\nsampling matrix. The second result (Theorem 2) is a concentration bound for approximating matrix\nmultiplication when the rank-one components of the product are sampled non uniformly. This result\nis derived from the matrix Bernstein inequality, and yields a sharp quanti\ufb01cation of the deviation\nof the approximation from the true product. The third result (De\ufb01nition 1) is an extension of the\nde\ufb01nition of the leverage scores to the context of kernel ridge regression. Whereas the notion of\nleverage is established as an algorithmic tool in randomized linear algebra, we introduce a natural\ncounterpart of it to this statistical setting. By combining these contributions, we are able to give a\nsharp statistical statement on the behavior of the Nystr\u00a8om method if one is allowed to sample non\nuniformly. All the proofs are deferred to the appendix (or see [1]).\n3 Revisiting prior work and new results\n\n3.1 A structural result\n\nWe begin by stating a \u201cstructural\u201d result that upper-bounds the bias of the estimator constructed\nusing the approximation L. This result is deterministic: it only depends on the properties of the\ninput data, and holds for any sketching matrix S that satis\ufb01es certain conditions. This way the\nrandomness of the construction of S is decoupled from the rest of the analysis. We highlight the fact\nthat this view offers a possible way of improving the current results since a better construction of S\n-whether deterministic or random- satisfying the data-related conditions would immediately lead to\ndown stream algorithmic and statistical improvements in this setting.\nTheorem 1. Let S \u2208 Rn\u00d7p be a sketching matrix and L the corresponding Nystr\u00a8om approxi-\n\u03a6 \u2212\nmation. For \u03b3 > 0, let \u03a6 = \u03a3(\u03a3 + n\u03b3I)\u22121.\n, where \u03bbmax denotes the\n\n\u03a61/2U(cid:62)SS(cid:62)U \u03a61/2(cid:17) \u2264 t for t \u2208 (0, 1) and \u03bb \u2265 1\n\n(cid:16)\n\nmaximum eigenvalue and (cid:107) \u00b7 (cid:107)op is the operator norm then\n\nop \u00b7 \u03bbmax(K)\n\nIf the sketching matrix S satis\ufb01es \u03bbmax\n1\u2212t(cid:107)S(cid:107)2\n(cid:19)\n\nn\n\nbias(K).\n\n(5)\n\n(cid:18)\n\nbias(L) \u2264\n\n1 +\n\n\u03b3/\u03bb\n1 \u2212 t\n\n\u221a\npn in every column with p the\nIn the special case where S contains one non zero entry equal to 1/\nnumber of sampled columns, the result and its proof can be found in [3] (appendix B.2), although\nwe believe that their argument contains a problematic statement. We propose an alternative and\ncomplete proof in Appendix A. The subsequent analysis unfolds in two steps: (1) assuming the\nsketching matrix S satis\ufb01es the conditions stated in Theorem 1, we will have R( \u02c6fL) (cid:46) R( \u02c6fK), and\n(2) matrix concentration is used to show that an appropriate random construction of S satis\ufb01es the\nsaid conditions. We start by stating the concentration result that is the source of our improvement\n(section 3.2), de\ufb01ne a notion of statistical leverage scores (section 3.3), and then state and prove\nthe main statistical result (Theorem 3 section 3.4). We then present our main algorithmic result\nconsisting of a fast approximation to this new notion of leverage scores (section 3.5).\n\n3.2 A concentration bound on matrix multiplication\nNext, we state our result for approximating matrix products of the form \u03a8\u03a8(cid:62) when a few columns\nfrom \u03a8 are sampled to form the approximate product \u03a8I \u03a8(cid:62)\nI where \u03a8I contains the chosen columns.\nThe proof relies on a matrix Bernstein inequality (see e.g. [16]) and is presented at the end of the\npaper (Appendix B).\nTheorem 2. Let n, m be positive integers. Consider a matrix \u03a8 \u2208 Rn\u00d7m and denote by \u03c8i the ith\ncolumn of \u03a8. Let p \u2264 m and I = {i1,\u00b7\u00b7\u00b7 , ip} be a subset of {1,\u00b7\u00b7\u00b7 , m} formed by p elements\nchosen randomly with replacement, according to the distribution\n\n\u2200i \u2208 {1,\u00b7\u00b7\u00b7 , m} Pr(choosing i) = pi \u2265 \u03b2\n\n4\n\n(cid:107)\u03c8i(cid:107)2\n(cid:107)\u03a8(cid:107)2\n\n2\n\nF\n\n(6)\n\n\ffor some \u03b2 \u2208 (0, 1]. Let S \u2208 Rn\u00d7p be a sketching matrix such that Sij = 1/\nand 0 elsewhere. Then\n\u2212pt2/2\n\u03bbmax(\u03a8\u03a8(cid:62))((cid:107)\u03a8(cid:107)2\n\n(cid:0)\u03a8\u03a8(cid:62) \u2212 \u03a8SS(cid:62)\u03a8(cid:62)(cid:1) \u2265 t\n\n(cid:17) \u2264 n exp\n\n(cid:18)\n\n\u03bbmax\n\n(cid:16)\n\nPr\n\n\u221a\n\np \u00b7 pij only if i = ij\n\n(cid:19)\n\nF /\u03b2 + t/3)\n\n.\n\n(7)\n\nRemarks: 1. This result will be used for \u03a8 = \u03a61/2U(cid:62), in conjunction with Theorem 1 to prove\nour main result in Theorem 3. Notice that \u03a8(cid:62) is a scaled version of the eigenvectors, with a scaling\ngiven by the diagonal matrix \u03a6 = \u03a3(\u03a3 + n\u03b3I)\u22121 which should be considered as \u201csoft projection\u201d\nmatrix that smoothly selects the top part of the spectrum of K. The setting of Gittens et al. [2], in\nwhich \u03a6 is a 0-1 diagonal is the closest analog of our setting.\n\n(cid:107)\u03c8i(cid:107)2\n(cid:107)\u03a8(cid:107)2\n\n2\n\nF\n\n2. It is known that pi =\nis the optimal sampling distribution in terms of minimizing the\nexpected error E(cid:107)\u03a8\u03a8(cid:62) \u2212 \u03a8SS(cid:62)\u03a8(cid:62)(cid:107)2\nF [17]. The above result exhibits a robustness property by\nallowing the chosen sampling distribution to be different from the optimal one by a factor \u03b2.2 The\nsub-optimality of such a distribution is re\ufb02ected in the upper bound (7) by the ampli\ufb01cation of the\nsquared Frobenius norm of \u03a8 by a factor 1/\u03b2. For instance, if the sampling distribution is chosen\nto be uniform, i.e. pi = 1/m, then the value of \u03b2 for which (6) is tight is\n, in which\ncase we recover a concentration result proven by Bach [3]. Note that Theorem 2 is derived from\none of the state-of-the-art bounds on matrix concentration, but it is one among many others in the\nliterature; and while it constitutes the base of our improvement, it is possible that a concentration\nbound more tailored to the problem might yield sharper results.\n\nm maxi (cid:107)\u03c8i(cid:107)2\n\n(cid:107)\u03a8(cid:107)2\n\nF\n\n2\n\n3.3 An extended de\ufb01nition of leverage\n\nWe introduce an extended notion of leverage scores that is speci\ufb01cally tailored to the ridge regression\nproblem, and that we call the \u03bb-ridge leverage scores.\nDe\ufb01nition 1. For \u03bb > 0, the \u03bb-ridge leverage scores associated with the kernel matrix K and the\nparameter \u03bb are\n\n\u2200i \u2208 {1,\u00b7\u00b7\u00b7 , n},\n\nli(\u03bb) =\n\nU 2\nij.\n\n(8)\n\nn(cid:88)\n\n\u03c3j\n\n\u03c3j + n\u03bb\n\nj=1\n\nNote that li(\u03bb) is the ith diagonal entry of K(K + n\u03bbI)\u22121. The quantities (li(\u03bb))1\u2264i\u2264n are in this\nsetting the analogs of the so-called leverage scores in the statistical literature, as they characterize the\ndata points that \u201cstick out\u201d, and consequently that most affect the result of a statistical procedure.\nThey are classically de\ufb01ned as the row norms of the left singular matrix U of the input matrix,\nand they have been used in regression diagnostics for outlier detection [18], and more recently in\nrandomized matrix algorithms as they often provide an optimal importance sampling distribution\nfor constructing random sketches for low rank approximation [17, 19, 5, 6, 2] and least squares\nregression [20] when the input matrix is tall and skinny (n \u2265 m). In the case where the input matrix\nis square, this de\ufb01nition is vacuous as the row norms of U are all equal to 1. Recently, Gittens and\nMahoney [2] used a truncated version of these scores (that they called leverage scores relative to the\nbest rank-k space) to obtain the best algorithmic results known to date on low rank approximation\nof positive semi-de\ufb01nite matrices. De\ufb01nition 1 is a weighted version of the classical leverage scores,\nwhere the weights depend on the spectrum of K and a regularization parameter \u03bb. In this sense, it is\nan interpolation between Gittens\u2019 scores and the classical (tall-and-skinny) leverage scores, where\nthe parameter \u03bb plays the role of a rank parameter. In addition, we point out that Bach\u2019s maximal\ndegrees of freedom dmof is to the \u03bb-ridge leverage scores what the coherence is to Gittens\u2019 leverage\nscores, i.e. their (scaled) maximum value: dmof/n = maxi li(\u03bb); and that while the sum of Gittens\u2019\nscores is the rank parameter k, the sum of the \u03bb-ridge leverage scores is the effective dimensionality\ndeff. We argue in the following that De\ufb01nition 1 provides a relevant notion of leverage in the context\nof kernel ridge regression. It is the natural counterpart of the algorithmic notion of leverage in the\nprediction context. We use it in the next section to make a statistical statement on the performance\nof the Nystr\u00a8om method.\n\n2In their work [17], Drineas et al. have a comparable robust statement for controlling the expected error.\n\nOur result is a robust quanti\ufb01cation of the tail probability of the error, which is a much stronger statement.\n\n5\n\n\f3.4 Main statistical result: an error bound on approximate kernel ridge regression\n\nNow we are able to give an improved version of a theorem by Bach [3] that establishes a performance\nguaranty on the use of the Nystr\u00a8om method in the context of kernel ridge regression. It is improved\nin the sense that the suf\ufb01cient number of columns that should be sampled in order to incur no\n(or little) loss in the prediction performance is lower. This is due to a more data-sensitive way of\nsampling the columns of K (depending on the \u03bb-ridge leverage scores) during the construction of\nthe approximation L. The proof is in Appendix C.\nTheorem 3. Let \u03bb, \u0001 > 0, \u03c1 \u2208 (0, 1/2), n \u2265 2 and L be a Nystr\u00a8om approximation of K by choosing\np columns randomly with replacement according to a probability distribution (pi)1\u2264i\u2264n such that\n\n\u2200i \u2208 {1,\u00b7\u00b7\u00b7 , n}, pi \u2265 \u03b2 \u00b7 li(\u03bb\u0001)/(cid:80)n\n(cid:19)\nwith deff =(cid:80)n\n\n(cid:18) deff\n\np \u2265 8\n\n1\n6\n\n+\n\n\u03b2\n\ni=1 li(\u03bb\u0001) = Tr(K(K + n\u03bb\u0001I)\u22121) then\n\n\u03c1\n\n(cid:18) n\n\ni=1 li(\u03bb\u0001) for some \u03b2 \u2208 (0, 1]. Let l \u2264 mini li(\u03bb\u0001). If\nlog\n\nand \u03bb \u2265 2\n\n(cid:19) \u03bbmax(K)\n\n(cid:19)\n\n(cid:18)\n\n1 +\n\n1\nl\n\n,\n\nn\n\nR( \u02c6fL) \u2264 (1 + 2\u0001)2R( \u02c6fK)\n\nwith probability at least 1 \u2212 2\u03c1, where (li)i are introduced in De\ufb01nition 1 and R is de\ufb01ned in (3).\nTheorem 3 asserts that substituting the kernel matrix K by a Nystr\u00a8om approximation of rank p in the\nKRR problem induces an arbitrarily small prediction loss, provided that p scales linearly with the\n3 and that \u03bb is not too small4. The leverage-based sampling appears to be\neffective dimensionality deff\ncrucial for obtaining this dependence, as the \u03bb-ridge leverage scores provide information on which\ncolumns -and hence which data points- capture most of the dif\ufb01culty of the estimation problem.\nAlso, as a sanity check, the smaller the target accuracy \u0001, the higher deff, and the more uniform the\nsampling distribution (li(\u03bb\u0001))i becomes. In the limit \u0001 \u2192 0, p is in the order of n and the scores\nare uniform, and the method is essentially equivalent to using the entire matrix K. Moreover, if\nthe sampling distribution (pi)i is a factor \u03b2 away from optimal, a slight oversampling (i.e. increase\np by 1/\u03b2) achieves the same performance. In this sense, the above result shows robustness to the\nsampling distribution. This property is very bene\ufb01cial from an implementation point of view, as\nthe error bounds still hold when only an approximation of the leverage scores is available. If the\ncolumns are sampled uniformly, a worse lower bound on p that depends on dmof is obtained [3].\n\n3.5 Main algorithmic result: a fast approximation to the \u03bb-ridge leverage scores\n\nAlthough the \u03bb-ridge leverage scores can be naively computed using SVD, the exact computation is\nas costly as solving the original Problem (2). Therefore, the central role they play in the above result\nmotivates the problem of a fast approximation, in a similar way the importance of the usual leverage\nscores has motivated Drineas et al. to approximate them is random projection time [7]. A success in\nthis task will allow us to combine the running time bene\ufb01ts with the improved statistical guarantees\nwe have provided.\nAlgorithm:\n\n{1, 2,\u00b7\u00b7\u00b7}, \u03bb > 0, \u0001 \u2208 (0, 1/2).\n\n\u2022 Inputs: data points (xi)1\u2264i\u2264n, probability vector (pi)1\u2264i\u2264n, sampling parameter p \u2208\n\u2022 Output: (\u02dcli)1\u2264i\u2264n \u0001-approximations to (li(\u03bb))1\u2264i\u2264n.\n1. Sample p data points from (xi)1\u2264i\u2264n with replacement with probabilities (pi)1\u2264i\u2264n.\n2. Compute the corresponding columns K1,\u00b7\u00b7\u00b7 , Kp of the kernel matrix.\n3. Construct C = [K1,\u00b7\u00b7\u00b7 , Kp] \u2208 Rn\u00d7p and W \u2208 Rp\u00d7p as presented in Section 2.\n4. Construct B \u2208 Rn\u00d7p such that BB(cid:62) = CW \u2020C(cid:62).\n5. For every i \u2208 {1,\u00b7\u00b7\u00b7 , n}, set\n\n\u02dcli = B(cid:62)\n\ni (B(cid:62)B + n\u03bbI)\u22121Bi\n\n(9)\n\nwhere Bi is the i-th row of B, and return it.\n\neffective dimensionality [10, 9, 3] However, the following bound holds: deff \u2264 1\n\n3Note that deff depends on the precision parameter \u0001, which is absent in the classical de\ufb01nition of the\n4This condition on \u03bb is not necessary if one constructs L as KS(S(cid:62)KS + n\u03bb\u0001I)\u22121S(cid:62)K (see proof).\n\n\u0001 Tr(K(K + n\u03bbI)\u22121).\n\n6\n\n\fp \u2265 8\n\nthen we have \u2200i \u2208 {1,\u00b7\u00b7\u00b7 , n}\n\n(cid:18) Tr(K)\n\nn\u03bb\u0001\n\n(cid:18) n\n\n(cid:19)\n\n\u03c1\n\n(cid:19)\n\n+\n\n1\n6\n\nlog\n\n(cid:16) \u03c3n \u2212 n\u03bb\u0001\n\n(cid:17)\n\n\u03c3n + n\u03bb\u0001\n\n(additive error bound)\n\nli(\u03bb) \u2212 2\u0001 \u2264 \u02dcli \u2264 li(\u03bb)\n\nRunning time: The running time of the above algorithm is dominated by steps 4 and 5. Indeed,\nconstructing B can be done using a Cholesky factorization on W and then a multiplication of C by\nthe inverse of the obtained Cholesky factor, which yields a running time of O(p3 +np2). Computing\nthe approximate leverage scores (\u02dcli)1\u2264i\u2264n in step 5 also runs in O(p3 + np2). Thus, for p (cid:28) n,\nthe overall algorithm runs in O(np2). Note that formula (9) only involves matrices and vectors of\nsize p (everything is computed in the smaller dimension p), and the fact that this yields a correct\napproximation relies on the matrix inversion lemma (see proof in Appendix D). Also, only the\nrelevant columns of K are computed and we never have to form the entire kernel matrix. This\nimproves over earlier models [2] that require that all of K has to be written down in memory. The\nimproved running time is obtained by considering the construction (9) which is quite different from\nthe regular setting of approximating the leverage scores of a rectangular matrix [7]. We now give\nboth additive and multiplicative error bounds on its approximation quality.\nTheorem 4. Let \u0001 \u2208 (0, 1/2), \u03c1 \u2208 (0, 1) and \u03bb > 0. Let L be a Nystr\u00a8om approximation of K by\nchoosing p columns at random with probabilities pi = Kii/Tr(K), i = 1,\u00b7\u00b7\u00b7 , n. If\n\nand\n\n(multiplicative error bound)\n\nli(\u03bb) \u2264 \u02dcli \u2264 li(\u03bb)\n\nwith probability at least 1 \u2212 \u03c1.\nRemarks: 1. Theorem 4 states that if the columns of K are sampled proportionally to Kii then\nn\u03bb ) is a suf\ufb01cient number of samples. Recall that Kii = (cid:107)\u03c6(xi)(cid:107)2F , so our procedure is akin\nO( Tr(K)\nto sampling according to the squared lengths of the data vectors, which has been extensively used in\ndifferent contexts of randomized matrix approximation [21, 17, 19, 8, 2].\n2. Due to how \u03bb is de\ufb01ned in eq. (1) the n in the denominator is arti\ufb01cial: n\u03bb should be thought of as\na \u201crescaled\u201d regularization parameter \u03bb(cid:48). In some settings, the \u03bb that yields the best generalization\n\u221a\nerror scales like O(1/\nn) is suf\ufb01cient. On the other hand, if the columns\nare sampled uniformly, one would get p = O(dmof) = O(n maxi li(\u03bb)).\n4 Experiments\n\n\u221a\nn), hence p = O(Tr(K)/\n\nWe test our results based on several datasets: one synthetic regression problem from [3] to illus-\ntrate the importance of the \u03bb-ridge leverage scores, the Pumadyn family consisting of three datasets\npumadyn-32fm, pumadyn-32fh and pumadyn-32nh 5 and the Gas Sensor Array Drift Dataset from\nthe UCI database6. The synthetic case consists of a regression problem on the interval X = [0, 1]\nwhere, given a sequence (xi)1\u2264i\u2264n and a sequence of noise (\u0001i)1\u2264i\u2264n, we observe the sequence\n\nyi = f (xi) + \u03c32\u0001i,\n\ni \u2208 {1,\u00b7\u00b7\u00b7 , n}.\n\n(2\u03b2)! B2\u03b2(x\u2212y\u2212(cid:98)x\u2212y(cid:99))\nThe function f belongs to the RKHS F generated by the kernel k(x, y) = 1\nwhere B2\u03b2 is the 2\u03b2-th Bernoulli polynomial [3]. One important feature of this regression problem\nis the distribution of the points (xi)1\u2264i\u2264n on the interval X : if they are spread uniformly over the\ninterval, the \u03bb-ridge leverage scores (li(\u03bb))1\u2264i\u2264n are uniform for every \u03bb > 0, and uniform column\nn for i = 1,\u00b7\u00b7\u00b7 , n, the kernel matrix K is\nsampling is optimal in this case. In fact, if xi = i\u22121\na circulant matrix [3], in which case, we can prove that the \u03bb-ridge leverage scores are constant.\nOtherwise, if the data points are distributed asymmetrically on the interval, the \u03bb-ridge leverage\nscores are non uniform, and importance sampling is bene\ufb01cial (Figure 1). In this experiment, the\ndata points xi \u2208 (0, 1) have been generated with a distribution symmetric about 1\n2, having a high\ndensity on the borders of the interval (0, 1) and a low density on the center of the interval. The\nnumber of observations is n = 500. On Figure 1, we can see that there are few data points with\n\n5http://www.cs.toronto.edu/\u02dcdelve/data/pumadyn/desc.html\n6https://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset\n\n7\n\n\fFigure 1: The \u03bb-ridge leverage scores for the synthetic Bernoulli data set described in the text (left) and\nthe MSE risk vs. the number of sampled columns used to construct the Nystr\u00a8om approximation for different\nsampling methods (right).\nhigh leverage, and those correspond to the region that is underrepresented in the data sample (i.e. the\nregion close to the center of the interval since it is the one that has the lowest density of observations).\nThe \u03bb-ridge leverage scores are able to capture the importance of these data points, thus providing a\nway to detect them (e.g. with an analysis of outliers), had we not known their existence.\nFor all datasets, we determine \u03bb and the band width of k by cross validation, and we compute the\neffective dimensionality deff and the maximal degrees of freedom dmof. Table 1 summarizes the\nIt is often the case that deff (cid:28) dmof and R( \u02c6fL)/R( \u02c6fK) (cid:39) 1, in agreement with\nexperiments.\nTheorem 3.\n\nkernel\nBern\n\nLinear\n\nRBF\n\ndataset\nSynth\nGas2\nGas3\n\nn\n500\n1244\n1586\nPum-32fm 2000\n2000\nPum-32fh\n2000\nPum-32nh\n1244\n1586\nPum-32fm 2000\n2000\nPum-32fh\nPum-32nh\n2000\n\nGas2\nGas3\n\nnb. feat\n\n-\n128\n128\n32\n32\n32\n-\n-\n-\n-\n-\n\nband width\n\n-\n-\n-\n-\n-\n-\n1\n1\n5\n5\n5\n\n\u03bb\n1e\u22126\n1e\u22123\n1e\u22123\n1e\u22123\n1e\u22123\n1e\u22123\n4.5e\u22124\n5e\u22124\n0.5\n5e\u22122\n1.3e\u22122\n\ndeff\n24\n126\n125\n31\n31\n32\n1135\n1450\n142\n747\n1337\n\ndmof\n500\n1244\n1586\n2000\n2000\n2000\n1244\n1586\n1897\n1989\n1997\n\nrisk ratio R( \u02c6fL)/R( \u02c6fK)\n\n(p = 2deff)\n1.01\n1.10 (p = 2deff)\n1.09 (p = 2deff)\n(p = 2deff)\n0.99\n(p = 2deff)\n0.99\n(p = 2deff)\n0.99\n1.56\n(p = deff)\n(p = deff)\n1.50\n(p = deff)\n1.00\n(p = deff)\n1.00\n0.99\n(p = deff)\n\nTable 1: Parameters and quantities of interest for the different datasets and using different kernels: the synthetic\ndataset using the Bernoulli kernel (denoted by Synth), the Gas Sensor Array Drift Dataset (batches 2 and 3,\ndenoted by Gas2 and Gas3) and the Pumadyn datasets (Pum-32fm, Pum-32fh, Pum-32nh) using linear and\nRBF kernels.\n5 Conclusion\n\nWe showed in this paper that in the case of kernel ridge regression, the sampling complexity of the\nNystr\u00a8om method can be reduced to the effective dimensionality of the problem, hence bridging and\nimproving upon different previous attempts that established weaker forms of this result. This was\nachieved by de\ufb01ning a natural analog to the notion of leverage scores in this statistical context, and\nusing it as a column sampling distribution. We obtained this result by combining and improving\nupon results that have emerged from two different perspectives on low rank matrix approximation.\nWe also present a way to approximate these scores that is computationally tractable, i.e. runs in time\nO(np2) with p depending only on the trace of the kernel matrix and the regularization parameter.\nOne natural unanswered question is whether it is possible to further reduce the sampling complexity,\nor is the effective dimensionality also a lower bound on p? And as pointed out by previous work\n[22, 3], it is likely that the same results hold for smooth losses beyond the squared loss (e.g. logistic\nregression). However the situation is unclear for non-smooth losses (e.g. support vector regression).\nAcknowledgements: We thank Xixian Chen for pointing out a mistake in an earlier draft of this\npaper [1]. We thank Francis Bach for stimulating discussions and for contributing to a recti\ufb01ed proof\nof Theorem 1. We thank Jason Lee and Aaditya Ramdas for fruitful discussions regarding the proof\nof Theorem 1. We thank Yuchen Zhang for pointing out the connection to his work.\n\n8\n\n\fReferences\n[1] Ahmed El Alaoui and Michael W Mahoney. Fast randomized kernel methods with statistical\n\nguarantees. arXiv preprint arXiv:1411.0306, 2014.\n\n[2] Alex Gittens and Michael W Mahoney. Revisiting the Nystr\u00a8om method for improved large-\nIn Proceedings of The 30th International Conference on Machine\n\nscale machine learning.\nLearning, pages 567\u2013575, 2013.\n\n[3] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Proceedings of\n\nThe 26th Conference on Learning Theory, pages 185\u2013209, 2013.\n\n[4] Francis Bach. Personal communication, October 2015.\n[5] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error CUR matrix de-\n\ncompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844\u2013881, 2008.\n\n[6] Michael W Mahoney and Petros Drineas. CUR matrix decompositions for improved data\n\nanalysis. Proceedings of the National Academy of Sciences, 106(3):697\u2013702, 2009.\n\n[7] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast\napproximation of matrix coherence and statistical leverage. The Journal of Machine Learning\nResearch, 13(1):3475\u20133506, 2012.\n\n[8] Michael W Mahoney. Randomized algorithms for matrices and data. Foundations and Trends\n\nin Machine Learning, 3(2):123\u2013224, 2011.\n\n[9] Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regres-\n\nsion. In Proceedings of The 26th Conference on Learning Theory, pages 592\u2013617, 2013.\n\n[10] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning,\n\nvolume 1. Springer series in statistics Springer, Berlin, 2001.\n\n[11] George Kimeldorf and Grace Wahba. Some results on Tchebychef\ufb01an spline functions. Jour-\n\nnal of Mathematical Analysis and Applications, 33(1):82\u201395, 1971.\n\n[12] Bernhard Sch\u00a8olkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In\n\nComputational Learning Theory, pages 416\u2013426. Springer, 2001.\n\n[13] Shai Fine and Katya Scheinberg. Ef\ufb01cient SVM training using low-rank kernel representations.\n\nThe Journal of Machine Learning Research, 2:243\u2013264, 2002.\n\n[14] Christopher Williams and Matthias Seeger. Using the Nystr\u00a8om method to speed up kernel\nmachines. In Proceedings of the 14th Annual Conference on Neural Information Processing\nSystems, pages 682\u2013688, 2001.\n\n[15] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling techniques for the Nystr\u00a8om\nmethod. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 304\u2013311,\n2009.\n\n[16] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Compu-\n\ntational Mathematics, 12(4):389\u2013434, 2012.\n\n[17] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for\nmatrices I: Approximating matrix multiplication. SIAM Journal on Computing, 36(1):132\u2013\n157, 2006.\n\n[18] Samprit Chatterjee and Ali S Hadi. In\ufb02uential observations, high leverage points, and outliers\n\nin linear regression. Statistical Science, pages 379\u2013393, 1986.\n\n[19] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for\nmatrices II: Computing a low-rank approximation to a matrix. SIAM Journal on Computing,\n36(1):158\u2013183, 2006.\n\n[20] Petros Drineas, Michael W Mahoney, S Muthukrishnan, and Tam\u00b4as Sarl\u00b4os. Faster least squares\n\napproximation. Numerische Mathematik, 117(2):219\u2013249, 2011.\n\n[21] Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for \ufb01nding\n\nlow-rank approximations. Journal of the ACM (JACM), 51(6):1025\u20131041, 2004.\n\n[22] Francis Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics,\n\n4:384\u2013414, 2010.\n\n9\n\n\f", "award": [], "sourceid": 516, "authors": [{"given_name": "Ahmed", "family_name": "Alaoui", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Mahoney", "institution": "UC Berkeley"}]}