{"title": "Generalization Properties of Learning with Random Features", "book": "Advances in Neural Information Processing Systems", "page_first": 3215, "page_last": 3225, "abstract": "We study the generalization properties of ridge regression with random features in the statistical learning framework. We show for the first time that $O(1/\\sqrt{n})$ learning bounds can be achieved with only $O(\\sqrt{n}\\log n)$ random features rather than $O({n})$ as suggested by previous results. Further, we prove faster learning rates and show that they might require more random features, unless they are sampled according to a possibly problem dependent distribution. Our results shed light on the statistical computational trade-offs in large scale kernelized learning, showing the potential effectiveness of random features in reducing the computational complexity while keeping optimal generalization properties.", "full_text": "Generalization Properties of Learning with Random\n\nFeatures\n\nAlessandro Rudi \u2217\n\nINRIA - Sierra Project-team,\n\n\u00b4Ecole Normale Sup\u00b4erieure, Paris,\n\n75012 Paris, France\n\nalessandro.rudi@inria.fr\n\nLorenzo Rosasco\nUniversity of Genova,\n\nIstituto Italiano di Tecnologia,\n\nMassachusetts Institute of Technology.\n\nlrosasco@mit.edu\n\nAbstract\n\nWe study the generalization properties of ridge regression with random features\n\u221a\nin the statistical learning framework. We show for the \ufb01rst time that O(1/\nn)\nlearning bounds can be achieved with only O(\nn log n) random features rather\nthan O(n) as suggested by previous results. Further, we prove faster learning\nrates and show that they might require more random features, unless they are\nsampled according to a possibly problem dependent distribution. Our results\nshed light on the statistical computational trade-offs in large scale kernelized\nlearning, showing the potential effectiveness of random features in reducing the\ncomputational complexity while keeping optimal generalization properties.\n\n\u221a\n\n1\n\nIntroduction\n\nSupervised learning is a basic machine learning problem where the goal is estimating a function\nfrom random noisy samples [1, 2]. The function to be learned is \ufb01xed, but unknown, and \ufb02exible\nnon-parametric models are needed for good results. A general class of models is based on functions\nof the form,\n\nM(cid:88)\n\nf (x) =\n\n\u03b1i q(x, \u03c9i),\n\n(1)\n\ni=1\n\nwhere q is a non-linear function, \u03c91, . . . , \u03c9M \u2208 Rd are often called centers, \u03b11, . . . , \u03b1M \u2208 R are\ncoef\ufb01cients, and M = Mn could/should grow with the number of data points n. Algorithmically, the\nproblem reduces to computing from data the parameters \u03c91, . . . , \u03c9M , \u03b11, . . . , \u03b1M and M. Among\nothers, one-hidden layer networks [3], or RBF networks [4], are examples of classical approaches\nconsidering these models. Here, parameters are computed by considering a non-convex optimization\nproblem, typically hard to solve and analyze [5]. Kernel methods are another notable example of\nan approach [6] using functions of the form (1). In this case, q is assumed to be a positive de\ufb01nite\nfunction [7] and it is shown that choosing the centers to be the input points, hence M = n, suf\ufb01ces\nfor optimal statistical results [8, 9, 10]. As a by product, kernel methods require only \ufb01nding the\ncoef\ufb01cients (\u03b1i)i, typically by convex optimization. While theoretically sound and remarkably\neffective in small and medium size problems, memory requirements make kernel methods unfeasible\nfor large scale problems.\nMost popular approaches to tackle these limitations are randomized and include sampling the centers\nat random, either in a data-dependent or in a data-independent way. Notable examples include\nNystr\u00a8om [11, 12] and random features [13] approaches. Given random centers, computations still\n\u2217This work was done when A.R. was working at Laboratory of Computational and Statistical Learning\n\n(Istituto Italiano di Tecnologia).\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u221a\n\nreduce to convex optimization with potential big memory gains, provided that the centers are fewer\nthan the data-points. In practice, the choice of the number of centers is based on heuristics or memory\nconstraints, and the question arises of characterizing theoretically which choices provide optimal\nlearning bounds. Answering this question allows to understand the statistical and computational\ntrade-offs in using these randomized approximations. For Nystr\u00a8om methods, partial results in this\ndirection were derived for example in [14] and improved in [15], but only for a simpli\ufb01ed setting\nwhere the input points are \ufb01xed. Results in the statistical learning setting were given in [16] for\n\u221a\nridge regression, showing in particular that O(\nn log n) random centers uniformly sampled from n\nn) learning bounds, the same as full kernel ridge regression.\ntraining points suf\ufb01ces to yield O(1/\nA question motivating our study is whether similar results hold for random features approaches.\nWhile several papers consider the properties of random features for approximating the kernel function,\nsee [17] and references therein, fewer results consider their generalization properties.\nSeveral papers considered the properties of random features for approximating the kernel function,\nsee [17] and references therein, an interesting line of research with connections to sketching [24] and\nnon-linear (one-bit) compressed sensing [18]. However, only a few results consider the generalization\nproperties of learning with random features.\nAn exception is one of the original random features papers, which provides learning bounds for a\n\u221a\ngeneral class of loss functions [19]. These results show that O(n) random features are needed for\nn) learning bounds and choosing less random features leads to worse bounds. In other words,\nO(1/\nthese results suggest that that computational gains come at the expense of learning accuracy. Later\nresults, see e.g. [20, 21, 22], essentially con\ufb01rm these considerations, albeit the analysis in [22]\nsuggests that fewer random features could suf\ufb01ce if sampled in a problem dependent way.\nIn this paper, we focus on the least squares loss, considering random features within a ridge regression\n\u221a\napproach. Our main result shows, under standard assumptions, that the estimator obtained with a\nnumber of random features proportional to O(\nn) learning error, that is\nthe same prediction accuracy of the exact kernel ridge regression estimator. In other words, there\nare problems for which random features can allow to drastically reduce computational costs without\nany loss of prediction accuracy. To the best of our knowledge this is the \ufb01rst result showing that\nsuch an effect is possible. Our study improves on previous results by taking advantage of analytic\nand probabilistic results developed to provide sharp analyses of kernel ridge regression. We further\npresent a second set of more re\ufb01ned results deriving fast convergence rates. We show that indeed\nfast rates are possible, but, depending on the problem at hand, a larger number of features might be\nneeded. We then discuss how the requirement on the number of random features can be weakened\nat the expense of typically more complex sampling schemes. Indeed, in this latter case either some\nknowledge of the data-generating distribution or some potentially data-driven sampling scheme is\nneeded. For this latter case, we borrow and extend ideas from [22, 16] and inspired from the theory\nof statical leverage scores [23]. Theoretical \ufb01ndings are complemented by numerical simulation\nvalidating the bounds.\nThe rest of the paper is organized as follows. In Section 2, we review relevant results on learning with\nkernels, least squares and learning with random features. In Section 3, we present and discuss our\nmain results, while proofs are deferred to the appendix. Finally, numerical experiments are presented\nin Section 4.\n\n\u221a\nn log n) achieves O(1/\n\n2 Learning with random features and ridge regression\n\nWe begin recalling basics ideas in kernel methods and their approximation via random features.\n\nKernel ridge regression Consider the supervised problem of learning a function given a training set\ni=1, where xi \u2208 X, X = RD and yi \u2208 R. Kernel methods are nonparametric\nof n examples (xi, yi)n\napproaches de\ufb01ned by a kernel K : X \u00d7 X \u2192 R, that is a symmetric and positive de\ufb01nite (PD)\nfunction2. A particular instance is kernel ridge regression given by\n(cid:98)f\u03bb(x) =\n\n\u03b1iK(xi, x), \u03b1 = (K + \u03bbnI)\u22121y.\n\nn(cid:88)\n\n(2)\n\n2A kernel K is PD if for all x1, . . . , xN the N by N matrix with entries K(xi, xj) is positive semide\ufb01nite.\n\ni=1\n\n2\n\n\fHere \u03bb > 0, y = (y1, . . . , yn), \u03b1 \u2208 Rn, and K is the n by n matrix with entries Kij = K(xi, xj).\nThe above method is standard and can be derived from an empirical risk minimization perspective\n[6], and is related to Gaussian processes [3]. While KRR has optimal statistical properties\u2013 see later\u2013\nits applicability to large scale datasets is limited since it requires O(n2) in space, to store K, and\nroughly O(n3) in time, to solve the linear system in (2). Similar requirements are shared by other\nkernel methods [6].\nTo explain the basic ideas behind using random features with ridge regression, it is useful to recall the\ncomputations needed to solve KRR when the kernel is linear K(x, x(cid:48)) = x(cid:62)x(cid:48). In this case, Eq. (2)\nreduces to standard ridge regression and can be equivalenty computed considering,\n\n(cid:98)f\u03bb(x) = x(cid:62)(cid:98)w\u03bb\n\n(cid:98)w\u03bb = ((cid:98)X(cid:62)(cid:98)X + \u03bbnI)\u22121(cid:98)X(cid:62)y.\n\nwhere (cid:98)X is the n by D data matrix. In this case, the complexity becomes O(nD) in space, and\n\nO(nD2 + D3) in time. Beyond the linear case, the above reasoning extends to inner product kernels\n\n(3)\n\nK(x, x(cid:48)) = \u03c6M (x)(cid:62)\u03c6M (x(cid:48))\n\nconsidering (3) with the data matrix (cid:98)X replaced by the n by M matrix (cid:98)S(cid:62)\n\n(4)\nwhere \u03c6M : X \u2192 RM is a \ufb01nite dimensional (feature) map. In this case, KRR can be computed\nM = (\u03c6(x1), . . . , \u03c6(xn)).\nThe complexity is then O(nM ) in space, and O(nM 2 + M 3) in time, hence much better than O(n2)\nand O(n3), as soon as M (cid:28) n. Considering only kernels of the form (4) can be restrictive. Indeed,\nclassic examples of kernels, e.g. the Gaussian kernel e\u2212(cid:107)x\u2212x(cid:48)(cid:107)2, do not satisfy (4) with \ufb01nite M. It is\nthen natural to ask if the above reasoning can still be useful to reduce the computational burden for\nmore complex kernels such as the Gaussian kernel. Random features, that we recall next, show that\nthis is indeed the case.\n\nRandom features with ridge regression The basic idea of random features [13] is to relax Eq. (4)\nassuming it holds only approximately,\n\nK(x, x(cid:48)) \u2248 \u03c6M (x)(cid:62)\u03c6M (x(cid:48)).\n\n(5)\nClearly, if one such approximation exists the approach described in the previous section can still be\nused. A \ufb01rst question is then for which kernels an approximation of the form (5) can be derived. A\nsimple manipulation of the Gaussian kernel provides one basic example.\nExample 1 (Random Fourier features [13]). If we write the Gaussian kernel as K(x, x(cid:48)) = G(x\u2212x(cid:48)),\n2\u03c32 (cid:107)z(cid:107)2, for a \u03c3 > 0, then since the inverse Fourier transform of G is a Gaussian,\n\u2212 1\nwith G(z) = e\nand using a basic symmetry argument, it is easy to show that\n\u221a\n\n(cid:90) (cid:90) 2\u03c0\n\n\u221a\n\n2 cos(w(cid:62)x + b)\n\n2 cos(w(cid:62)x(cid:48) + b) e\u2212 \u03c32\n\n2 (cid:107)w(cid:107)2\n\ndw db\n\nG(x \u2212 x(cid:48)) =\n\n1\n\n2\u03c0Z\n\n0\n\n\u221a\nwhere Z is a normalizing factor. Then, the Gaussian kernel has an approximation of the form (5) with\n2 cos(w(cid:62)\n\u03c6M (x) = M\u22121/2 (\nM x + bM )), and w1, . . . , wM and b1, . . . , bM\nsampled independently from 1\n\n1 x + b1), . . . ,\nZ e\u2212\u03c32(cid:107)w(cid:107)2/2 and uniformly in [0, 2\u03c0], respectively.\n\n2 cos(w(cid:62)\n\n\u221a\n\n(cid:90)\n\nThe above example can be abstracted to a general strategy. Assume the kernel K to have an integral\nrepresentation,\n\nK(x, x(cid:48)) =\n\n\u03c8(x, \u03c9)\u03c8(x(cid:48), \u03c9)d\u03c0(\u03c9),\n\n\u2200x, x(cid:48) \u2208 X,\n\n(6)\nwhere (\u2126, \u03c0) is probability space and \u03c8 : X \u00d7 \u2126 \u2192 R. The random features approach provides\nan approximation of the form (5) where \u03c6M (x) = M\u22121/2 (\u03c8(x, \u03c91), . . . , \u03c8(x, \u03c9M )), and with\n\u03c91, . . . , \u03c9M sampled independently with respect to \u03c0. Key to the success of random features is that\nkernels, to which the above idea apply, abound\u2013 see Appendix E for a survey with some details.\n\n\u2126\n\nRemark 1 (Random features, sketching and one-bit compressed sensing). We note that speci\ufb01c\nexamples of random features can be seen as form of sketching [24]. This latter term typically refers\nto reducing data dimensionality by random projection, e.g. considering\n\n\u03c8(x, \u03c9) = x(cid:62)\u03c9,\n\n3\n\n\fwhere \u03c9 \u223c N (0, I) (or suitable bounded measures). From a random feature perspective, we\nare de\ufb01ning an approximation of the linear kernel since E[\u03c8(x, \u03c9)\u03c8(x(cid:48), \u03c9)] = E[x(cid:62)\u03c9\u03c9(cid:62)x(cid:48)] =\nx(cid:62)E[\u03c9\u03c9(cid:62)]x(cid:48) = x(cid:62)x(cid:48). More general non-linear sketching can also be considered. For example in\none-bit compressed sensing [18] the following random features are relevant,\n\n\u03c8(x, \u03c9) = sign(x(cid:62)\u03c9)\n\nwith w \u223c N (0, I) and sign(a) = 1 if a > 0 and \u22121 otherwise. Deriving the corresponding kernel is\nmore involved and we refer to [25] (see Section E in the appendixes).\n\nBack to supervised learning, combining random features with ridge regression leads to,\n\n(cid:98)f\u03bb,M (x) := \u03c6M (x)(cid:62)(cid:98)w\u03bb,M , with\nM(cid:98)SM + \u03bbI)\u22121(cid:98)S(cid:62)\nM(cid:98)y,\nM := n\u22121/2 (\u03c6M (x1), . . . , \u03c6M (xn)) and(cid:98)y := n\u22121/2 (y1, . . . , yn).\n\n(cid:98)w\u03bb,M := ((cid:98)S(cid:62)\n\nfor \u03bb > 0, (cid:98)S(cid:62)\n\n(7)\n\nThen, random features can be used to reduce the computational costs of full kernel ridge regression\nas soon as M (cid:28) n (see Sec. 2). However, since random features rely on an approximation (5), the\nquestion is whether there is a loss of prediction accuracy. This is the question we analyze in the rest\nof the paper.\n\n3 Main Results\n\nIn this section, we present our main results characterizing the generalization properties of random\nfeatures with ridge regression. We begin considering a basic setting and then discuss fast learning\nrates and the possible bene\ufb01ts of problem dependent sampling schemes.\n\n\u221a\n3.1 O(\n\n\u221a\nn log n) Random features lead to O(1/\n\nn) learning error\n\ni=1 are sampled identically and\nWe consider a standard statistical learning setting. The data (xi, yi)n\nindependently with respect to a probability \u03c1 on X \u00d7 R, with X a separable space (e.g. X = RD,\nD \u2208 N). The goal is to minimize the expected risk\n\n(cid:90)\n\nE(f ) =\n\n(f (x) \u2212 y)2d\u03c1(x, y),\n\nsince this implies that f will generalize/predict well new data. Since we consider estimators of the\nform (2), (7) we are potentially restricting the space of possible solutions. Indeed, estimators of this\nform can be naturally related to the so called reproducing kernel Hilbert space (RKHS) corresponding\nto the PD kernel K. Recall that, the latter is the function space H de\ufb01ned as as the completion of the\nlinear span of {K(x,\u00b7) : x \u2208 X} with respect to the inner product (cid:104)K(x,\u00b7), K(x(cid:48),\u00b7)(cid:105) := K(x, x(cid:48))\n[7]. In this view, the best possible solution is fH solving\nf\u2208HE(f ).\n\nmin\n\n(8)\n\nWe will assume throughout that fH exists. We add one technical remark useful in the following.\nRemark 2. Existence of fH is not ensured, since we consider a potentially in\ufb01nite dimensional RKHS\nH, possibly universal [26]. The situation is different if H is replaced by HR = {f \u2208 H : (cid:107)f(cid:107) \u2264 R},\nwith R \ufb01xed a priori. In this case a minimizer of risk E always exists, but R needs to be \ufb01xed a priori\nand HR can\u2019t be universal. Clearly, assuming fH to exist, implies it belongs to a ball of radius R\u03c1,H.\nHowever, our results do not require prior knowledge of R\u03c1,H and hold uniformly over all \ufb01nite radii.\nThe following is our \ufb01rst result on the learning properties of random features with ridge regression.\nTheorem 1. Assume that K is a kernel with an integral representation (6). Assume \u03c8 continuous,\nsuch that |\u03c8(x, \u03c9)| \u2264 \u03ba almost surely, with \u03ba \u2208 [1,\u221e) and |y| \u2264 b almost surely, with b > 0. Let\n\u03b4 \u2208 (0, 1]. If n \u2265 n0 and \u03bbn = n\u22121/2, then a number of random features Mn equal to\n\n108\u03ba2\u221a\n\u03b4\nis enough to guarantee, with probability at least 1 \u2212 \u03b4, that\nE((cid:98)f\u03bbn,Mn ) \u2212 E(fH) \u2264 c1 log2 18\n\nMn = c0\n\nn log\n\n\u221a\n\nn\n\n,\n\n\u03b4\u221a\nn\n\n.\n\nIn particular the constants c0, c1 do not depend on n, \u03bb, \u03b4, and n0 does not depends on n, \u03bb, fH, \u03c1.\n\n4\n\n\f(cid:98)fR(x) = \u03c6M (x)(cid:62)(cid:98)\u03b2R,\u221e,\n\n(cid:98)\u03b2R,\u221e = argmin\n\n(cid:107)\u03b2(cid:107)\u221e\u2264R\n\n(cid:26)(cid:90)\n\nGR =\n\n\u03c8(\u00b7, \u03c9)\u03b2(\u03c9)d\u03c0(\u03c9)\n\n(cid:96)(\u03c6M (xi)(cid:62)\u03b2, yi),\n\ni=1\n\n1\nn\n\nn(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12) |\u03b2(\u03c9)| < R a.e.\n(cid:27)\n\n,\n\nfor a \ufb01xed R, a Lipshitz loss function (cid:96), and where (cid:107)w(cid:107)\u221e = max{|\u03b21|,\u00b7\u00b7\u00b7 ,|\u03b2M|}. The largest\nspace considered in [19] is\n\nThe above result is presented with some simpli\ufb01cations (e.g. the assumption of bounded output) for\nsake of presentation, while it is proved and presented in full generality in the Appendix. In particular,\nthe values of all the constants are given explicitly. Here, we make a few comments. The learning\nbound is the same achieved by the exact kernel ridge regression estimator (2) choosing \u03bb = n\u22121/2,\nsee e.g. [10]. The theorem derives a bound in a worst case situation, where no assumption is made\nbesides existence of fH, and is optimal in a minmax sense [10]. This means that, in this setting,\nas soon as the number of features is order\nn log n, the corresponding ridge regression estimator\nhas optimal generalization properties. This is remarkable considering the corresponding gain from\na computational perspective: from roughly O(n3) and O(n2) in time and space for kernel ridge\nregression to O(n2) and O(n\nn) for ridge regression with random features (see Section 2). Consider\nthat taking \u03b4 \u221d 1/n2 changes only the constants and allows to derive bounds in expectation and\nalmost sure convergence (see Cor. 1 in the appendix, for the result in expectation).\nThe above result shows that there is a whole set of problems where computational gains are achieved\nwithout having to trade-off statistical accuracy. In the next sections we consider what happens under\nmore benign assumptions, which are standard, but also somewhat more technical. We \ufb01rst compare\nwith previous works since the above setting is the one more closely related.\n\n\u221a\n\n\u221a\n\nComparison with [19]. This is one of the original random features paper and considers the question\nof generalization properties. In particular they study the estimator\n\n(9)\n\nsolving\n\n(10)\n\nrather than a RKHS, where R is \ufb01xed a priori. The best possible solution is f\u2217\nGR\nminf\u2208GR E(f ), and the main result in [19] provides the bound\nR\u221a\nM\n\nE((cid:98)fR) \u2212 E(f\u2217\n\n) (cid:46) R\u221a\nn\n\nGR\n\n+\n\n,\n\nThis is the \ufb01rst and still one the main results providing a statistical analysis for an estimator based\n\u221a\non random features for a wide class of loss functions. There are a few elements of comparison with\nthe result in this paper, but the main one is that to get O(1/\nn) learning bounds, the above result\nrequires O(n) random features, while a smaller number leads to worse bounds. This shows the main\nnovelty of our analysis. Indeed we prove that, considering the square loss, fewer random features\nare suf\ufb01cient, hence allowing computational gains without loss of accuracy. We add a few more\ntehcnical comments explaining : 1) how the setting we consider covers a wider range of problems,\nand 2) why the bounds we obtain are sharper. First, note that the functional setting in our paper is\nmore general in the following sense. It is easy to see that considering the RKHS H is equivalent to\nGR \u2282 G\u221e \u2282 H2. Clearly, assuming a minimizer of the expected risk to exists in H2 does not imply\nit belongs to G\u221e or GR, while the converse is true. In this view, our results cover a wider range of\nproblems. Second, note that, this gap is not easy to bridge. Indeed, even if we were to consider G\u221e\nin place of GR, the results in [19] could be used to derive the bound\n\nconsider H2 =(cid:8)(cid:82) \u03c8(\u00b7, \u03c9)\u03b2(\u03c9)d\u03c0(\u03c9)(cid:12)(cid:12) (cid:82) |\u03b2(\u03c9)|2d\u03c0(\u03c9) < \u221e(cid:9) and the following inclusions hold\n\nE E((cid:98)fR) \u2212 E(f\u2217\nG\u221e ) and f\u2217\n\n+\n\nG\u221e ) (cid:46) R\u221a\n(11)\nwhere A(R) := E(f\u2217\nG\u221e is a minimizer of the expected risk on G\u221e. In this case\nGR\nwe would have to balance the various terms in (11), which would lead to a worse bound. For example,\nwe could consider R := log n, obtaining a bound n\u22121/2 log n with an extra logarithmic term, but the\nresult would hold only for n larger than a number of examples n0 at least exponential with respect to\nthe norm of f\u221e. Moreover, to derive results uniform with respect to f\u221e, we would have to keep into\naccount the decay rate of A(R) and this would get bounds slower than n\u22121/2.\n\n) \u2212 E(f\u2217\n\nR\u221a\nM\n\n+ A(R),\n\nn\n\n5\n\n\fFigure 1: Random feat. M = O(nc) required for optimal generalization. Left: \u03b1 = 1. Right: \u03b1 = \u03b3.\n\nComparison with other results. Several other papers study the generalization properties of random\nfeatures, see [22] and references therein. For example, generalization bounds are derived in [20]\n\u221a\nfrom very general arguments. However, the corresponding generalization bound requires a number of\nrandom features much larger than the number of training examples to give O(1/\nn) bounds. The\nbasic results in [22] are analogous to those in [19] with the set GR replaced by HR. These results\nare closer, albeit more restrictive then ours (see Remark 8) and especially like the bounds in [19]\nsuggest O(n) random features are needed for O(1/\nn) learning bounds. A novelty in [22] is the\nintroduction of more complex problem dependent sampling that can reduce the number of random\nfeatures. In Section 3.3, we show that using possibly-data dependent random features can lead to\nrates much faster than n\u22121/2, and using much less than\nRemark 3 (Sketching and randomized numerical linear algebra (RandLA)). Standard sketching\ntechniques from RandLA [24] can be recovered, when X is a bounded subset of RD, by selecting\n\u03c8(x, \u03c9) = x(cid:62)\u03c9 and \u03c9 sampled from suitable bounded distribution (e.g. \u03c9 = (\u03b61, . . . , \u03b6d) inde-\npendent Rademacher random variables). Note however that the \ufb01nal goal of the analysis in the\nrandomized numerical linear algebra community is to minimize the empirical error instead of E.\n\nn features.\n\n\u221a\n\n\u221a\n\n3.2 Re\ufb01ned Results: Fast Learning Rates\n\nFaster rates can be achieved under favorable conditions. Such conditions for kernel ridge regression\nare standard, but somewhat technical. Roughly speaking they characterize the \u201csize\u201d of the considered\nRKHS and the regularity of fH. The key quantity needed to make this precise is the integral operator\nde\ufb01ned by the kernel K and the marginal distribution \u03c1X of \u03c1 on X, that is\n\n(cid:90)\n\nX\n\n(Lg)(x) =\n\nK(x, z)g(z)d\u03c1X (z),\nseen as a map from L2(X, \u03c1X ) = {f : X \u2192 R | (cid:107)f(cid:107)2\nthe assumptions of Thm. 1, the integral operator is positive, self-adjoint and trace-class (hence\ncompact) [27]. We next de\ufb01ne the conditions that will lead to fast rates, and then comment on their\ninterpretation.\nAssumption 1 (Prior assumptions). For \u03bb > 0, let the effective dimension be de\ufb01ned as N (\u03bb) :=\n\nTr(cid:0)(L + \u03bbI)\u22121L(cid:1) , and assume, there exists Q > 0 and \u03b3 \u2208 [0, 1] such that,\n\n\u2200g \u2208 L2(X, \u03c1X ),\n\n\u03c1 = (cid:82) |f (x)|2d\u03c1X < \u221e} to itself. Under\n\nMoreover, assume there exists r \u2265 1/2 and g \u2208 L2(X, \u03c1X ) such that\n\nfH(x) = (Lrg)(x) a.s.\n\nN (\u03bb) \u2264 Q2\u03bb\u2212\u03b3.\n\n(12)\n\n(13)\n\nWe provide some intuition on the meaning of the above assumptions, and defer the interested reader\nto [10] for more details. The effective dimension can be seen as a \u201cmeasure of the size\u201d of the RKHS\nH. Condition (12) allows to control the variance of the estimator and is equivalent to conditions on\ncovering numbers and related capacity measures [26]. In particular, it holds if the eigenvalues \u03c3i\u2019s of\nL decay as i\u22121/\u03b3. Intuitively, a fast decay corresponds to a smaller RKHS, whereas a slow decay\ncorresponds to a larger RKHS. The case \u03b3 = 0 is the more benign situation, whereas \u03b3 = 1 is the\nworst case, corresponding to the basic setting. A classic example, when X = RD, corresponds to\n\n6\n\n\fconsidering kernels of smoothness s, in which case \u03b3 = D/(2s) and condition (12) is equivalent to\nassuming H to be a Sobolev space [26]. Condition (13) allows to control the bias of the estimator\nand is common in approximation theory [28]. It is a regularity condition that can be seen as form\nof weak sparsity of fH. Roughly speaking, it requires the expansion of fH, on the the basis given\nby the the eigenfunctions L, to have coef\ufb01cients that decay faster than \u03c3r\ni . A large value of r means\nthat the coef\ufb01cients decay fast and hence many are close to zero. The case r = 1/2 is the worst case,\nand can be shown to be equivalent to assuming fH exists. This latter situation corresponds to setting\nconsidered in the previous section. We next show how these assumptions allow to derive fast rates.\nTheorem 2. Let \u03b4 \u2208 (0, 1]. Under Asm. 1 and the same assumptions of Thm. 1, if n \u2265 n0, and\n\u03bbn = n\n\n2r+\u03b3 , then a number of random features M equal to\n\n\u2212 1\n\nlog\nis enough to guarantee, with probability at least 1 \u2212 \u03b4, that\n\nMn = c0 n\n\n2r+\u03b3\n\n1+\u03b3(2r\u22121)\n\n108\u03ba2n\n\n\u03b4\n\n,\n\nE((cid:98)f\u03bbn,Mn ) \u2212 E(fH) \u2264 c1 log2 18\n\n\u2212 2r\n\n2r+\u03b3 ,\n\nn\n\n\u03b4\n\nfor r \u2264 1, and where c0, c1 do not depend on n, \u03c4, while n0 does not depends on n, fH, \u03c1.\nThe above bound is the same as the one obtained by the full kernel ridge regression estimator and\nis optimal in a minimax sense [10]. For large r and small \u03b3 it approaches a O(1/n) bound. When\n\u03b3 = 1 and r = 1/2 the worst case bound of the previous section is recovered. Interestingly, the\n\u221a\nnumber of random features in different regimes is typically smaller than n but can be larger than\nn). Figure. 1 provides a pictorial representation of the number of random features needed for\nO(\noptimal rates in different regimes. In particular M (cid:28) n random features are enough when \u03b3 > 0 and\nr > 1/2. For example for r = 1, \u03b3 = 0 (higher regularity/sparsity and a small RKHS) O(\nn) are\nsuf\ufb01cient to get a rate O(1/n). But, for example, if r = 1/2, \u03b3 = 0 (not too much regularity/sparsity\nbut a small RKHS) O(n) are needed for O(1/n) error. The proof suggests that this effect can be a\nbyproduct of sampling features in a data-independent way. Indeed, in the next section we show how\nmuch fewer features can be used considering problem dependent sampling schemes.\n\n\u221a\n\n3.3 Re\ufb01ned Results: Beyond uniform sampling\n\nWe show next that fast learning rates can be achieved with fewer random features if they are somewhat\ncompatible with the data distribution. This is made precise by the following condition.\nAssumption 2 (Compatibility condition). De\ufb01ne the maximum random features dimension as\n\nF\u221e(\u03bb) = sup\n\u03c9\u2208\u2126\n\n(cid:107)(L + \u03bbI)\u22121/2\u03c8(\u00b7, \u03c9)(cid:107)2\n\n\u03c1X\n\n,\n\n\u03bb > 0.\n\n(14)\n\n\u2200\u03bb > 0.\n\nAssume there exists \u03b1 \u2208 [0, 1], and F > 0 such that F\u221e(\u03bb) \u2264 F \u03bb\u2212\u03b1,\nThe above assumption is abstract and we comment on it before showing how it affects the results.\nThe maximum random features dimension (14) relates the random features to the data-generating\ndistribution through the operator L. It is always satis\ufb01ed for \u03b1 = 1 ands F = \u03ba2. e.g. considering\nany random feature satisfying (6). The favorable situation corresponds to random features such that\ncase \u03b1 = \u03b3. The following theoretical construction borrowed from [22] gives an example.\nExample 2 (Problem dependent RF). Assume K is a kernel with an integral representation (6).\nFor s(\u03c9) = (cid:107)(L + \u03bbI)\u22121/2\u03c8(\u00b7, \u03c9)(cid:107)\u22122\ns(\u03c9) d\u03c0(\u03c9), consider the random features\nCss(\u03c9) . We show in the Appendix that\n\n\u03c8s(x, \u03c9) = \u03c8(x, \u03c9)(cid:112)Css(\u03c9), with distribution \u03c0s(\u03c9) := \u03c0(\u03c9)\n\nand Cs := (cid:82)\n\nthese random features provide an integral representation of K and satisfy Asm. 2 with \u03b1 = \u03b3.\n\n\u03c1X\n\n1\n\nWe next show how random features satisfying Asm. 2 can lead to better resuts.\nTheorem 3. Let \u03b4 \u2208 (0, 1]. Under Asm. 2 and the same assumptions of Thm. 1, 2, if n \u2265 n0, and\n\u03bbn = n\n\n2r+\u03b3 , then a number of random features Mn equal to\n\n\u2212 1\n\nMn = c0 n\n\n\u03b1+(1+\u03b3\u2212\u03b1)(2r\u22121)\n\n2r+\u03b3\n\nlog\n\n108\u03ba2n\n\n\u03b4\n\n,\n\n7\n\n\fFigure 2: Comparison between the number of features M = O(nc) required by Nystr\u00a8om (uniform\nsampling, left) [16] and Random Features (\u03b1 = 1, right), for optimal generalization.\n\nis enough to guarantee, with probability at least 1 \u2212 \u03b4, that\n\nE((cid:98)f\u03bbn,Mn ) \u2212 E(fH) \u2264 c1 log2 18\n\n\u2212 2r\n\n2r+\u03b3 ,\n\nn\n\n\u03b4\n\nwhere c0, c1 do not depend on n, \u03c4, while n0 does not depends on n, fH, \u03c1.\n\n\u221a\n\n\u221a\n\nThe above learning bound is the same as Thm. 2, but the number of random features is given by a\nmore complex expression depending on \u03b1. In particular, in the slow O(1/\nn) rates scenario, that\nn), since \u03b3 \u2264 \u03b1 \u2264 1. On the\nis r = 1/2, \u03b3 = 1, we see that O(n\u03b1/2) are needed, recovering O(\ncontrary, for a small RKHS, that is \u03b3 = 0 and random features with \u03b1 = \u03b3, a constant (!) number of\nfeature is suf\ufb01cient. A similar trend is seen considering fast rates. For \u03b3 > 0 and r > 1/2, if \u03b1 < 1\nthen the number of random features is always smaller, and potentially much smaller, then the number\nof random features sampled in a problem independent way, that is \u03b1 = 1. For \u03b3 = 0 and r = 1/2,\nthe number of number of features is O(n\u03b1) and can be again just constant if \u03b1 = \u03b3. Figure 1 depicts\nthe number of random features required if \u03b1 = \u03b3. The above result shows the potentially dramatic\neffect of problem dependent random features. However the construction in Ex. 2 is theoretical. We\ncomment on this in the next remark.\nRemark 4 (Random features leverage scores). The construction in Ex. 2 is theoretical, however\n\nempirical random features leverage scores (cid:98)s(\u03c9) = (cid:98)v(\u03c9)(cid:62)(K + \u03bbnI)\u22121(cid:98)v(\u03c9), with(cid:98)v(\u03c9) \u2208 Rn,\n((cid:98)v(\u03c9))i = \u03c8(xi, \u03c9), can be considered. Statistically, this requires considering an extra estimation\n\nstep. It seems our proof can be extended to account for this, and we will pursue this in a future work.\nComputationally, it requires devising approximate numerical strategies, like standard leverage scores\n[23].\n\n\u2212 1\n\nComparison with Nystr\u00a8om. This question was recently considered in [21] and our results offer\nnew insights. In particular, recalling the results in [16], we see that in the slow rate setting there is\nessentially no difference between random features and Nystr\u00a8om approaches, neither from a statistical\nnor from a computational point of view. In the case of fast rates, Nystr\u00a8om methods with uniform\n2r+\u03b3 ) random centers, which compared to Thm. 2, suggests Nystr\u00a8om methods\nsampling requires O(n\ncan be advantageous in this regime. While problem dependent random features provide a further\nimprovement, it should be compared with the number of centers needed for Nystr\u00a8om with leverage\nscores, which is O(n\n2r+\u03b3 ) and hence again better, see Thm. 3. In summary, both random features\nand Nystr\u00a8om methods achieve optimal statistical guarantees while reducing computations. They are\nessentially the same in the worst case, while Nystr\u00a8om can be better for benign problems.\nFinally we add a few words about the main steps in the proof.\n\n\u2212 \u03b3\n\nSteps of the proof. The proofs are quite technical and long and are collected in the appendices.\nThey use a battery of tools developed to analyze KRR and related methods. The key challenges in the\nanalysis include analyzing the bias of the estimator, the effect of noise in the outputs, the effect of\nrandom sampling in the data, the approximation due to random features and a notion of orthogonality\nbetween the function space corresponding to random features and the full RKHS. The last two points\nare the main elements on novelty in the proof. In particular, compared to other studies, we identify\nand study the quantity needed to assess the effect of the random feature approximation if the goal is\nprediction rather than the kernel approximation itself.\n\n8\n\n\fFigure 3: Comparison of theoretical and simulated rates for: excess risk E((cid:98)f\u03bb,M ) \u2212 inf f\u2208H E(f ), \u03bb,\n\nM, w.r.t. n (100 repetitions). Parameters r = 11/16, \u03b3 = 1/8 (top), and r = 7/8, \u03b3 = 1/4 (bottom).\n\n4 Numerical results\n\n(x, x(cid:48)), \u03c8(\u03c9, x) = \u039b 1\n\nof order q (see [29] Eq. 2.1.7 when q integer), de\ufb01ned as \u039bq(x, x(cid:48)) =(cid:80)\u221e\nalmost everywhere on [0, 1], with q \u2208 R, for which we have(cid:82) 1\n\nWhile the learning bounds we present are optimal, there are no lower bounds on the number of random\nfeatures, hence we present numerical experiments validating our bounds. Consider a spline kernel\nk=\u2212\u221e e2\u03c0ikxe\u22122\u03c0ikz|k|\u2212q,\n0 \u039bq(x, z)\u039bq(cid:48)(x(cid:48), z)dz = \u039bq+q(cid:48)(x, x(cid:48)),\nfor any q, q(cid:48) \u2208 R. Let X = [0, 1], and \u03c1X be the uniform distribution. For \u03b3 \u2208 (0, 1) and r \u2208 [1/2, 1]\n2 +\u0001(x, x0) with \u0001 > 0, x0 \u2208 X.\nlet, K(x, x(cid:48)) = \u039b 1\nLet \u03c1(y|x) be a Gaussian density with variance \u03c32 and mean f\u2217(x). Then Asm 1, 2 are satis\ufb01ed\nand \u03b1 = \u03b3. We compute the KRR estimator for n \u2208 {103, . . . , 104} and select \u03bb minimizing the\nexcess risk computed analytically. Then we compute the RF-KRR estimator and select the number of\nfeatures M needed to obtain an excess risk within 5% of the one by KRR. In Figure 3, the theoretical\nand estimated behavior of the excess risk, \u03bb and M with respect to n are reported together with their\nstandard deviation over 100 repetitions. The experiment shows that the predictions by Thm. 3 are\naccurate, since the theoretical predictions estimations are within one standard deviation from the\nvalues measured in the simulation.\n\n\u03b3\n\n(\u03c9, x), f\u2217(x) = \u039b r\n\n\u03b3 + 1\n\n2\u03b3\n\n5 Conclusion\n\nIn this paper, we provide a thorough analyses of the generalization properties of random features with\nridge regression. We consider a statistical learning theory setting where data are noisy and sampled\nat random. Our main results show that there are large classes of learning problems where random\nfeatures allow to reduce computations while preserving optimal statistical accuracy of exact kernel\nridge regression. This in contrast with previous state of the art results suggesting computational\ngains needs to be traded-off with statistical accuracy. Our results open several venues for both\ntheoretical and empirical work. As mentioned in the paper, it would be interesting to analyze random\nfeatures with empirical leverage scores. This is immediate if input points are \ufb01xed, but our approach\nshould allow to also consider the statistical learning setting. Beyond KRR, it would be interesting\nto analyze random features together with other approaches, in particular accelerated and stochastic\ngradient methods, or distributed techniques. It should be possible to extend the results in the paper to\nconsider these cases. A more substantial generalization would be to consider loss functions other\nthan quadratic loss, since this require different techniques from empirical process theory.\n\nAcknowledgments The authors gratefully acknowledge the contribution of Raffaello Camoriano who was\ninvolved in the initial phase of this project. These preliminary result appeared in the 2016 NIPS workshop\n\u201cAdaptive and Scalable Nonparametric Methods in ML\u201d. This work is funded by the Air Force project FA9550-\n17-1-0390 (European Of\ufb01ce of Aerospace Research and Development) and by the FIRB project RBFR12M3AC\n(Italian Ministry of Education, University and Research).\n\n9\n\n\u221211\u221210.5\u221210\u22129.5\u22129\u22128.5\u22128\u22127.5\u22127\u22126.5\u221261000 2000 3000 4000 5000 6000 7000 8000 900010000nlog errormeasured errormeas \u00b1stdpredicted errorn\u22129\u22128\u22127\u22126\u22125\u221241000 2000 3000 4000 5000 6000 7000 8000 900010000nlog \u03bbmeasured \u03bbmeas \u00b1stdpredicted \u03bbn123456781000 2000 3000 4000 5000 6000 7000 8000 900010000nlog mmeasured mmeas \u00b1stdpredicted mn\u221210\u22129.5\u22129\u22128.5\u22128\u22127.5\u22127\u22126.5\u221261000 2000 3000 4000 5000 6000 7000 8000 900010000nlog errormeasured errormeas \u00b1stdpredicted errorn\u22127\u22126.5\u22126\u22125.5\u22125\u22124.51000 2000 3000 4000 5000 6000 7000 8000 900010000nlog \u03bbmeasured \u03bbmeas \u00b1stdpredicted \u03bbn3456781000 2000 3000 4000 5000 6000 7000 8000 900010000nlog mmeasured mmeas \u00b1stdpredicted mn\fReferences\n[1] V. Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.\n\n[2] F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the AMS, 39:1\u201349, 2002.\n\n[3] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n\n[4] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 1990.\n\n[5] A. Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica, 8:143\u2013195, 1999.\n\n[6] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Opti-\n\nmization, and Beyond (Adaptive Computation and Machine Learning). MIT Press, 2002.\n\n[7] N. Aronszajn. Theory of reproducing kernels. Transactions of the AMS, 68(3):337\u2013404, 1950.\n\n[8] G. S. Kimeldorf and G. Wahba. A correspondence between bayesian estimation on stochastic processes\n\nand smoothing by splines. The Annals of Mathematical Statistics, 41(2):495\u2013502, 1970.\n\n[9] B. Sch\u00a8olkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Computational learning\n\ntheory, pages 416\u2013426. Springer, 2001.\n\n[10] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. FoCM, 2007.\n\n[11] A. J. Smola and B. Sch\u00a8olkopf. Sparse greedy matrix approximation for machine learning. In ICML, 2000.\n\n[12] C. Williams and M. Seeger. Using the nystr\u00a8om method to speed up kernel machines. In NIPS, 2000.\n\n[13] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.\n\n[14] F. Bach. Sharp analysis of low-rank kernel matrix approximations. In COLT, 2013.\n\n[15] A. Alaoui and M. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In NIPS.\n\n2015.\n\n[16] A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nystr\u00a8om computational regularization. In NIPS.\n\n2015.\n\n[17] B. K. Sriperumbudur and Z. Szabo. Optimal rates for random fourier features. ArXiv e-prints, June 2015.\n\n[18] Yaniv Plan and Roman Vershynin. Dimension reduction by random hyperplane tessellations. Discrete &\n\nComputational Geometry, 51(2):438\u2013461, 2014.\n\n[19] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with\n\nrandomization in learning. In NIPS, 2009.\n\n[20] C. Cortes, M. Mohri, and A. Talwalkar. On the impact of kernel approximation on learning accuracy. In\n\nAISTATS, 2010.\n\n[21] T. Yang, Y. Li, M. Mahdavi, R. Jin, and Z. Zhou. Nystr\u00a8om method vs random fourier features: A theoretical\n\nand empirical comparison. In NIPS, pages 485\u2013493, 2012.\n\n[22] F. Bach. On the equivalence between quadrature rules and random features. ArXiv e-prints, February 2015.\n\n[23] P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix\n\ncoherence and statistical leverage. JMLR, 13:3475\u20133506, 2012.\n\n[24] N. Halko, P. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for\n\nconstructing approximate matrix decompositions. SIAM review, 53(2):217\u2013288, 2011.\n\n[25] Y. Cho and L. K. Saul. Kernel methods for deep learning. In Y. Bengio, D. Schuurmans, J.D. Lafferty,\n\nC.K.I. Williams, and A. Culotta, editors, NIPS, pages 342\u2013350. 2009.\n\n[26] I. Steinwart and A. Christmann. Support Vector Machines. Springer New York, 2008.\n\n[27] S. Smale and D. Zhou. Learning theory estimates via integral operators and their approximations. Con-\n\nstructive approximation, 26(2):153\u2013172, 2007.\n\n[28] S. Smale and D. Zhou. Estimating the approximation error in learning theory. Analysis and Applications,\n\n1(01):17\u201341, 2003.\n\n10\n\n\f[29] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series\n\nin Applied Mathematics. SIAM, Philadelphia, 1990.\n\n[30] E. De Vito, L. Rosasco, A. Caponnetto, U. D. Giovannini, and F. Odone. Learning from examples as an\n\ninverse problem. In JMLR, pages 883\u2013904, 2005.\n\n[31] S. Boucheron, G. Lugosi, and O. Bousquet. Concentration inequalities. In Advanced Lectures on Machine\n\nLearning. 2004.\n\n[32] V. V. Yurinsky. Sums and Gaussian vectors. 1995.\n\n[33] J. A. Tropp. User-friendly tools for random matrices: An introduction. 2012.\n\n[34] S. Minsker. On some extensions of bernstein\u2019s inequality for self-adjoint operators. arXiv, 2011.\n\n[35] J. Fujii, M. Fujii, T. Furuta, and R. Nakamoto. Norm inequalities equivalent to heinz inequality. Proceedings\n\nof the American Mathematical Society, 118(3), 1993.\n\n[36] Andrea Caponnetto and Yuan Yao. Adaptation for regularization operators in learning theory. Technical\n\nreport, DTIC Document, 2006.\n\n[37] Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.\n\n[38] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS, 2009.\n\n[39] P. Kar and H. Karnick. Random feature maps for dot product kernels. In AISTATS, 2012.\n\n[40] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of\nthe 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 239\u2013247.\nACM, 2013.\n\n[41] Q. Le, T. Sarl\u00b4os, and A. Smola. Fastfood - computing hilbert space expansions in loglinear time. In ICML,\n\n2013.\n\n[42] J. Yang, V. Sindhwani, Q. Fan, H. Avron, and M. Mahoney. Random laplace feature maps for semigroup\nkernels on histograms. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on,\npages 971\u2013978. IEEE, 2014.\n\n[43] R. Hamid, Y. Xiao, A. Gittens, and D. Decoste. Compact random feature maps. In ICML, pages 19\u201327,\n\n2014.\n\n[44] J. Yang, V. Sindhwani, H. Avron, and M. W. Mahoney. Quasi-monte carlo feature maps for shift-invariant\n\nkernels. In ICML, volume 32 of JMLR Proceedings, pages 485\u2013493. JMLR.org, 2014.\n\n[45] Ingo Steinwart, Don Hush, and Clint Scovel. An explicit description of the reproducing kernel hilbert\n\nspaces of gaussian rbf kernels. IEEE Transactions on Information Theory, 52(10):4635\u20134643, 2006.\n\n[46] A. Vedaldi and A. Zisserman. Ef\ufb01cient additive kernels via explicit feature maps. Pattern Analysis and\n\nMachine Intelligence, IEEE Transactions on, 34(3):480\u2013492, 2012.\n\n11\n\n\f", "award": [], "sourceid": 1832, "authors": [{"given_name": "Alessandro", "family_name": "Rudi", "institution": "INRIA"}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "University of Genova- MIT - IIT"}]}