{"title": "Kernel Truncated Randomized Ridge Regression: Optimal Rates and Low Noise Acceleration", "book": "Advances in Neural Information Processing Systems", "page_first": 15358, "page_last": 15367, "abstract": "In this paper we consider the nonparametric least square regression in a Reproducing Kernel Hilbert Space (RKHS). We propose a new randomized algorithm that has optimal generalization error bounds with respect to the square loss, closing a long-standing gap between upper and lower bounds. Moreover, we show that our algorithm has faster finite-time and asymptotic rates on problems where the Bayes risk with respect to the square loss is small. We state our results using standard tools from the theory of least square regression in RKHSs, namely, the decay of the eigenvalues of the associated integral operator and the complexity of the optimal predictor measured through the integral operator.", "full_text": "Kernel Truncated Randomized Ridge Regression:\n\nOptimal Rates and Low Noise Acceleration\n\nKwang-Sung Jun\n\nThe University of Arizona\u02da\nkjun@cs.arizona.edu\n\nAshok Cutkosky\nGoogle Research\n\nashok@cutkosky.com\n\nFrancesco Orabona\nBoston University\n\nfrancesco@orabona.com\n\nAbstract\n\nIn this paper, we consider the nonparametric least square regression in a Reproduc-\ning Kernel Hilbert Space (RKHS). We propose a new randomized algorithm that\nhas optimal generalization error bounds with respect to the square loss, closing a\nlong-standing gap between upper and lower bounds. Moreover, we show that our\nalgorithm has faster \ufb01nite-time and asymptotic rates on problems where the Bayes\nrisk with respect to the square loss is small. We state our results using standard\ntools from the theory of least square regression in RKHSs, namely, the decay of the\neigenvalues of the associated integral operator and the complexity of the optimal\npredictor measured through the integral operator.\n\nIntroduction\n\n1\nGiven a training set S \u201c txt, ytun\nt\u201c1 of n samples drawn identically and independently distributed\nfrom a \ufb01xed but unknown distribution \u03c1 on X \u02c6 Y, the goal of nonparametric least square regression\nis to \ufb01nd a function \u02c6f whose risk\n\n\u017c\n\n\u00b4\n\u02c6fpxq \u00b4 y\n\n\u00af\n\n2\n\nd\u03c1\n\nRp \u02c6fq :\u201c\n\nX\u02c6Y\n\nis close to the optimal risk\n\nR\u2039 :\u201c inf\n\nf\n\nRpfq .\n\nWe focus on the kernel-based methods, which consider candidate functions from a Reproducing\nKernel Hilbert Space (RKHS) of functions and possibly their composition with elementary functions.\nA classic kernel-based algorithm for nonparametric least squares is Kernel Ridge Regression (KRR),\nwhich constructs the prediction function \u02c6f as\n\n\u02c6f \u201c argmin\nfPHK\n\n\u03bb}f}2 ` 1\nn\n\npfpxtq \u00b4 ytq2,\n\nn\u00ff\n\nt\u201c1\n\nwhere HK is a RKHS associated with a kernel K and \u03bb is the hyperparameter controlling the amount\nof regularization.\nIt has been proved that, when the amount of regularization is chosen optimally and under similar\nassumptions, KRR converges to the Bayes risk at the best known rate among kernel-based algo-\nrithms [Lin et al., 2018]. Despite this result, kernel-based learning is still not a solved problem: these\nrates match the known lower bounds in Fischer and Steinwart [2017] only in some regimes, unless\nadditional assumptions are used [Steinwart et al., 2009]. Indeed, it was not even known if the lower\nbound was optimal in all the regimes [Pillaud-Vivien et al., 2018].\n\n\u02daThis work was done while the author was at Boston University.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\frecent empirical\n\nresults.\n\nregularization seems\n\nresults have also challenged the theoretical\n\nIn\nMoreover,\nparticular, KRR without\nto perform very well on real-world\ndatasets [Zhang et al., 2017, Belkin et al., 2018], at least in the classi\ufb01cation setting, and even\noutperform KRRs with any nonzero regularization in a popular computer vision dataset [Liang and\nRakhlin, 2018, Figure 1]. This challenges the theoretical \ufb01ndings because our current understanding\nof kernel-based learning tells us that a non-zero regularization is needed in all cases for learning\nin in\ufb01nite dimensional RKHSs. Given the current gap in upper and lower bounds, it is unclear if\nthis mismatch between theory and practice is due to (i) suboptimal analyses that lead to suboptimal\nchoices of the amount of regularization or (ii) not taking into account crucial data-dependent\nquantities (e.g., capturing \u201ceasiness\u201d of the problem) that allow fast rates and minimal regularization.\nIn this work, we address all these questions. We propose a new kernel-based learning algorithm\nnamed Kernel Truncated Randomized Ridge Regression (KTR3). We show that the performance of\nKTR3 is minimax optimal, matching known lower bounds. This closes the gap between upper and\nlower bounds, without the need for additional assumptions. Moreover, we show that the generalization\nguarantee of KTR3 accelerates when the Bayes risk is zero or close to zero. As far as we know, the\nphenomenon is new in this literature. Finally, we identify a regime of easy problems in which the\nbest amount of regularization is exactly zero.\nAnother important contribution lies in our proof methods, which vastly differ from the usual one in\nthis \ufb01eld. In particular, we use methods from the online learning literature that make the proof very\nsimple and rely only on population quantities rather than empirical ones. We believe the community\nof nonparametric kernel-regression will greatly bene\ufb01t from the addition of these new tools.\nThe rest of the paper is organized as follows: In the next section, we formally introduce the setting\nand our assumptions. In Section 3 we introduce our KTR3 algorithm and its theoretical guarantee,\nand in Section 4 the precise comparison with similar results. In Section 5, we empirically evaluate\nour \ufb01ndings. Finally, Section 6 discusses open problems and future directions of research.\n\n2 Setting and Notation: Source Condition and Eigenvalue Decay\n\nIn this section, we formally introduce our learning setting and our characterization of the complexity\nof each regression problem. This characterization is standard in the literature on regression in RKHS,\nsee, e.g., Steinwart and Christmann [2008], Steinwart et al. [2009], Dieuleveut and Bach [2016], Lin\net al. [2018].\nLet X \u0102 Rd a compact set and HK a separable RKHS associated to a Mercer kernel K : X\u02c6 X \u00d1 R\nimplementing the inner product x\u00a8,\u00a8y and induced norm } \u00a8 }. The inner product is de\ufb01ned so that it\nsatis\ufb01es the reproducing property, xKpx,\u00a8q, fp\u00a8qy \u201c fpxq. Denote by Kt P Rt\u02c6t the Gram matrix\nsuch that Ki,j \u201c Kpxi, xjq where xi, xj belong to St \u010e S that contains the \ufb01rst2 t elements of the\ntraining set S.\nOur \ufb01rst assumption is related to the boundedness of the kernel and labels.\nAssumption 1 (Boundedness). We assume K to be bounded, that is, supxPX Kpx, xq \u201c R2 \u0103 8.\nTo avoid super\ufb02uous notations and without loss of generality, we further assume R \u201c 1. We also\nassume the labels to be bounded: Y \u201c r\u00b4Y, Y s where Y \u0103 8.\nDenote by \u03c1X the marginal probability measure on X and let L2\n\u03c1X be the space of square integrable\nfunctions with respect to \u03c1X. We will assume that the support of \u03c1X is X, whose norm is denoted by\n}g}\u03c1 :\u201c\nX g2pxqd\u03c1X. It is well known that the function minimizing the risk over all functions\n\u03c1X is f\u03c1pxq :\u201c\nY yd\u03c1py|xq, which has the Bayes risk with respect to the square loss, R\u2039 \u201c\nin L2\nRpf\u03c1q \u201c inf fPL2\nIf we use a Universal Kernel (e.g., the Gaussian kernel) [Steinwart, 2001] and X is compact, we have\nthat inf fPHK Rpfq \u201c R\u2039 [Steinwart and Christmann, 2008, Corollary 5.29]. This suggests that using\na universal kernel is somehow enough to reach the Bayes risk. However, while f\u03c1 P L2\n\u03c1X, this actually\n\ndoes not imply that f\u03c1 P HK but only that f\u03c1 P\u011aHK, which is the closure of HK. Thus, the question\n\n\u015f\n\u03c1X Rpfq.\n\nb\u015f\n\n2Note that the ordering of the elements in S is immaterial, but our algorithm will depend on it. So we can\n\njust consider S ordered according to an initial random shuf\ufb02ing.\n\n2\n\n\fAlgorithm 1 KTR3: Kernel Truncated Randomized Ridge Regression\n\nInput: A training set S \u201c tpxi, yiqun\nRandomly permute the training set S\nfor t \u201c 0, 1, . . . , n \u00b4 1 do\n\u03bb}f}2 ` 1\nSet ft \u201c argminfPHK\n(take the minimum norm solution when there is no unique solution)\nend for\nReturn T Y \u02dd fk, where k is uniformly at random between 0 and n \u00b4 1\n\ni\u201c1, a regularization parameter \u03bb \u011b 0\n\u0159\n\ni\u201c1pfpxiq \u00b4 yiq2\n\nn\n\nt\n\n\u015f\n\nof whether it is possible to achieve the Bayes risk is relevant even for Universal kernels. We address\nthis by the standard parametrization called source condition that smoothly characterizes whether f\u03c1\nbelongs or not to HK. To introduce the formalism, let LK : L2\n\u03c1X be the integral operator\nde\ufb01ned by pLKfqpxq \u201c\nX Kpx, x1qfpx1qd\u03c1Xpx1q. There exists an orthonormal basis t\u03a61, \u03a62,\u00a8\u00a8\u00a8u\n\u03c1X consisting of eigenfunctions of LK with corresponding non-negative eigenvalues t\u03bb1, \u03bb2,\u00a8\u00a8\u00a8u\nof L2\nand the set t\u03bbiu is \ufb01nite or \u03bbk \u00d1 0 when k \u00d1 8 [Cucker and Zhou, 2007, Theorem 4.7]. Since\nK is a Mercer kernel, LK is compact and positive. Moreover, given that we assumed the kernel to\nbe bounded, LK is trace class, hence compact [Steinwart and Christmann, 2008]. Therefore, the\nfractional power operator L\u03b2\n\nK is well-de\ufb01ned for any \u03b2 \u011b 0. We indicate its range space by\nKpL2\nL\u03b2\n\n*\ni \u0103 8\na2\n\n\"\nf \u201c\n\n\u03c1X \u00d1 L2\n\n\u03c1Xq :\u201c\n\n\u03bb\u03b2\ni ai\u03a6i\n\n8\u00ff\n\n8\u00ff\n\ni\u201c1\n\n:\n\ni\u201c1\n\n.\n\nThis space has a key role in our analysis. In particular, we will use the following assumption.\nAssumption 2 (Source Condition). We assume that f\u03c1 P L\u03b2\nKpL2\n2 , which is\nKpgq .\n\n\u03c1Xq for 0 \u0103 \u03b2 \u010f 1\n\n\u03c1X : f\u03c1 \u201c L\u03b2\n\nDg P L2\n\nKpL2\n\n\u03c1Xq \u201c L2\n\n\u03c1X. On the other hand, we have that L1{2\n\nK pL2\n\u03c1X, and }f} \u201c }L\n2s allow us to consider spaces in between L2\n\nNote that the assumption above is always satis\ufb01ed for \u03b2 \u201c 0 because, by de\ufb01nition of the orthonormal\n\u03c1Xq \u201c HK, that is every function\nbasis, L0\n\u00b41{2\nf P HK can be written as L1{2\nK f}\u03c1 [Cucker and Zhou, 2007,\nK g for some g P L2\nCorollary 4.13]. Hence, the values of \u03b2 in r0, 1\n\u03c1X and\nHK, including the extremes. Thus, a bigger \u03b2 means a simpler function f\u03c1.\nAnother assumption needed to characterize the learning process is on the complexity of the RKHS\nitself, rather than on the complexity of the optimal function. This is typically done assuming that the\neigenvalue of the integral operator satis\ufb01es a certain rate of decay. We will use equivalent condition,\nassuming that the trace of some fractional power of the integral operator is bounded.\nAssumption 3 (Eigenvalue Decay). Assume that there exists b P r0, 1s such that TrrLb\nNote that the sum of the eigenvalues of LK is at most supxPX Kpx, xq, which we assumed to be\nbounded in Assumption 1. This implies that the assumption above is always satis\ufb01ed with b \u201c 1.\nHence, a smaller b corresponds to an RKHS with a smaller complexity.\n\nKs \u0103 8.\n\n3 Kernel Truncated Randomized Ridge Regression\n\nWe now describe our algorithm called Kernel Truncated Randomized Ridge Regression (KTR3).\nThe pseudo-code is in Algorithm 1. The algorithm consists of two stages. In the \ufb01rst stage, we\ngenerate n candidate functions solving KRR with increasing sizes of the training set and a \ufb01xed\nregularization weight \u03bb. In the second stage, we select the prediction function as the truncation of\none of the candidate functions uniformly at random. Note that this is equivalent to extracting a subset\nof the training set of size i, where i is uniformly at random between 0 and n \u00b4 1 and training a KRR\non the subset with parameter \u03bb. The truncation function is de\ufb01ned as follows\n\nThe de\ufb01nition of the truncation function implies that pT Y p\u02c6yq \u00b4 yq2 \u010f p\u02c6y \u00b4 yq2,@\u02c6y P R, y P Y.\n\nT Y pzq :\u201c minpY,|z|q \u00a8 signpzq .\n\n3\n\n\fWe now present our two main theorems on the excess risk of KTR3 where Theorem 1 is on \u03bb \u0105 0 and\nTheorem 2 is on \u03bb \u201c 0 for an \u201ceasy\u201d problem regime. The proof of Theorem 2 is in the Appendix.\nTheorem 1. Let X \u0102 Rd be a compact domain and K a Mercer kernel such that Assumptions 1,2,\nand 3 are veri\ufb01ed. De\ufb01ne by fS,\u03bb the function returned by the KTR3 algorithm on a training set S\nwith regularization parameter \u03bb \u0105 0. Then\n\ufb00\nErRpfS,\u03bbqs \u00b4 Rpf\u03c1q\n\u010f \u03bb2\u03b2}L\u00b4\u03b2\n\n4Y 2 TrrLb\nKs\n\n` Rpf\u03c1q\n\n\u03bb2\u03b2\u00b41}L\u00b4\u03b2\n\nK f\u03c1}2\n\nln1\u00b4b\n\nK f\u03c1}2\n\n\u03c1 ` min\n\n\u02c6\n\n\u02c6\n\n\u02d9\n\n\u02d9\n\n\u00ab\n\n\u03c1\n\n1 ` 1\n\u03bb\n\n,\n\n1\nb\n\n,\n\nmin\n\n\u03bbbn\n\nn\n\n,\n\n\u03bbn\n\nwhere the expectation is with respect to S and the randomization of the algorithm.\nTheorem 2. Let \u03bb \u201c 0 and assume the same conditions as in Theorem 1 except for \u03bb. Assume\n\u03b2 \u201c 1{2 and Rpf\u03c1q \u201c 0. Assume that the distribution \u03c1 satis\ufb01es that Kn is invertible with\nprobability 1. Then, ErRpfS,0qs \u00b4 Rpf\u03c1q \u201c Opn\u00b41q.\nRemark. Our algorithm can be changed to randomize at the prediction time for each test data point\nrather than the training time while enjoying the same risk bound. Furthermore, our algorithm can\nsample from tp1\u00b4 \u03b1qnu to n\u00b4 1 for some \u03b1 P p0, 1s instead of from t0, . . . , n\u00b4 1u and obtain a rate\n\u03b1 factor worse than the bounds above; our choice of presentation of Algorithm 1 is for simplicity.\nFrom the above theorem, with appropriate settings of the regularization parameter \u03bb it is possible to\nobtain the following convergence rates.\nCorollary 1. Under the assumptions of Theorem 1, there exists a setting of \u03bb \u011b 0 such that:\n\u00af\u00af\n\n(i) When b \u2030 0,\n\n\u00b4\n\n1\n\nErRpfS,\u03bbqs \u00b4 Rpf\u03c1q \u010f O\n\nmin\n\n(ii) In the case b \u201c 0 and \u03b2 \u201c 1\n2 ,3\n\nErRpfS,\u03bbqs \u00b4 Rpf\u03c1q \u010f O\n\n`\n\n\u00b4\npn{Rpf\u03c1qq\u00b4 2\u03b2\n2\u03b2`1 ` n\u00b42\u03b2, n\u00b4 2\u03b2\n2\u03b2`b\n\u02d8\u02d8\n`\nKs\n1 ` n{ TrrL0\n\nn\u00b41 TrrL0\n\nKs log\n\n.\n\n.\n\nThe proof and the tuning of \u03bb can be found in the Appendix. Before moving to the proof of Theorem 1\nin the next section, there are some interesting points to stress.\n\n\u2022 In the case of Rpf\u03c1q \u2030 0, our rate n\u00b4 2\u03b2\n\n2\u03b2`b matches the worst-case lower bound [Fischer\nand Steinwart, 2017] without additional assumptions for the \ufb01rst time in the literature, to\nour knowledge. Speci\ufb01cally, our bound is a strict improvement in the regime 2\u03b2 ` b \u0103 1\nupon the best-known bound Opn\u00b42\u03b2q of KRR [Lin et al., 2018] and stochastic gradient\ndescent [Dieuleveut and Bach, 2016]. In this regime, our rate goes to Opn\u00b41q as b goes to 0.\n\u2022 If Rpf\u03c1q \u201c 0, we have convergence of the risk to 0 at a faster rate of n\u00b4\nmint2\u03b2`b,1u . It is\nimportant to stress that this holds also in the case that f\u03c1 R HK, i.e., \u03b2 \u0103 1\n2. As far as we\nknow, this result is new and we are not aware of lower bounds under the same assumptions.\n\u2022 When Rpf\u03c1q \u201c 0, the optimal \u03bb that minimizes the generalization upper bound in Theorem 1\ngoes to zero when \u03b2 goes to 1{2 and becomes exactly 0 when \u03b2 is exactly 1{2.\n\n2\u03b2\n\n3.1 Proof of Theorem 1\n\nOur proof technique is vastly different from the existing ones for analyzing KRR and stochastic\ngradient descent methods. It is also extremely short and simple compared to the proofs of similar\nresults. Our technique is based on the well-known possibility to solve batch problems through a\nreduction to online learning ones. In turn, we use a recent result on the performance of online kernel\nridge regression, Theorem 3 by Zhdanov and Kalnishkan [2013]. This result is the key to obtain\nthe improved rates in the regime 2\u03b2 ` b \u0103 1. In particular, it allows us to analyze the effect of the\n3When b \u201c 0 the space is \ufb01nite dimensional, hence \u03b2 can only have value 0 or 1{2 and there is no convergence\n\nto the Bayes risk when \u03b2 \u201c 0.\n\n4\n\n\feigenvalues using only the expectation of the Gram matrix Kn and nothing else. Instead, previous\nproofs [e.g., Lin and Cevher, 2018] involved the study of the convergence of empirical covariance\noperator to the population one, which seems to deteriorate when the regularization parameter becomes\ntoo small, which is precisely needed in the regime 2\u03b2 ` b \u0103 1.\nTheorem 3. [Zhdanov and Kalnishkan, 2013, Theorem 1] Take a kernel K on a domain X and a\nn\u00ff\nparameter \u03bb \u0105 0. Then, with the notation of Algorithm 1, we have\n\nn\u00ff\n\npft\u00b41pxtq \u00b4 ytq2\n\n1 ` dt\n\n\u03bbn\n\n1\nn\n\nt\u201c1\n\n\u201c min\nfPHK\n\n\u03bb}f}2 ` 1\nn\n\nt\u201c1\n\npfpxtq \u00b4 ytq2 ,\n\n:\u201c Kpxt, xtq \u00b4 kt\u00b41pxtqJpKt\u00b41 ` \u03bbnIq\u00b41kt\u00b41pxtq \u011b 0, kt\u00b41pxtq\nwhere dt\nrKpxt, x1q, . . . , Kpxt, xt\u00b41qsJ, and Kt\u00b41 is the Gram matrix of the samples x1, . . . , xt\u00b41.\nWe use the following well-known result to upper bound the approximation error, which is the gap\nbetween the value of the regularized population risk minimization problem and the Bayes risk.\nTheorem 4. [Cucker and Zhou, 2007, Proposition 8.5.ii] Let X \u0102 Rd be a compact domain and K\na Mercer kernel such that Assumption 2 holds. Then, for any 0 \u0103 \u03b2 \u010f 1{2, we have\n\n:\u201c\n\nmin\nfPHK\n\n\u03bb}f}2 ` Rpfq \u00b4 Rpf\u03c1q \u010f \u03bb2\u03b2}L\u00b4\u03b2\nK f\u03c1}2\n\u03c1 .\n\u02d9\n\u02d9\nWe also need the following technical lemmas. The proof of the next lemma is in the Appendix.\nLemma 1. Under Assumptions 1 and 3, and with \u03bb \u0105 0, we have\n1 ` 1\n\u03bb\n\nKs\nTrrLb\n\u03bbb\n\nn Kn|\n\nln1\u00b4b\n\n\u02c6\n\n\u02c6\n\n\uf6be\n\n\u201e\n\n1\nb\n\n,\n\n.\n\n|\u03bbIn ` 1\n|\u03bbIn|\n\u201e\nFurthermore, if b \u201c 0, then\n\nES\n\nln\n\n\u02d9\n\nES\n\nln\n\nn Kn|\n\n|\u03bbIn ` 1\n|\u03bbIn|\n\n\u010f ln\n\n1\nTrrL0\n\nKs\u03bb\n\nTrrL0\n\nKs .\n\n\u010f min\n\uf6be\n\n\u02c6\n1 `\n\nNote that the logarithmic term is unavoidable when b \u201c 0 because in the \ufb01nite dimensional case we\npay \u00b4 lnp\u03bbq due to the online learning setting. The last lemma is a classic result in online learning [e.g.\nCesa-Bianchi et al., 2005].\nLemma 2. With the notation in Theorem 3, we have that\n\ndt\n\ndt ` \u03bbn\n\n\u010f ln\n\nn Kn|\n\n|\u03bbIn ` 1\n|\u03bbIn|\n\n.\n\n`\n\n\u02d8\n\n`\n\n. Hence,\n. Also, using Zhdanov and Kalnishkan [2013, Lemma 3] we have\n\nx`1, we have that\n\n\u03bbn\n\ndt\n\ndt\n\n\u03bbn\n\nn\nt\u201c1\n\n\u015b\n\n1 ` dt\n\ndt`\u03bbn \u010f log\n\n\u0159\n\u015b\ndt`\u03bbn \u010f ln\nProof. From the elementary inequality lnp1` xq \u011b x\nn\nt\u201c1\nt\u201c1p\u03bbn ` dtq \u201c |\u03bbnIn ` Kn|. Putting all together, we have the stated bound.\nn\nWe are now ready to prove Theorem 1.\nProof of Theorem 1. De\ufb01ne f\u03bb \u201c argminfPHK\nization true risk minimization problem.\nFirst, we use the so-called online-to-batch conversion [Cesa-Bianchi et al., 2004] to have\nES,krRpT Y \u02dd fkqs \u201c ES\n\nn\u00b41\u00ff\n\n\u02d8\ufb00\n\n\u201c ES\n\n`\n\nR\n\n\u03bb}f}2 ` Rpfq, which is the solution of the regular-\n\n1 ` dt\n\nn\u00ff\n\u02d8\n\nt\u201c1\n\n\u00ab\n\u00ab\n\u00ab\n\n1\nn\n\n1\nn\n\n1\nn\n\nt\u201c0\n\nn\u00b41\u00ff\nn\u00b41\u00ff\nn\u00ff\n`\n\nt\u201c0\n\nt\u201c1\n\n\u201c ES\n\n\u201c ES\n\nT Y \u02dd ft\n`\n\nESt\n\nT Y pftpxqq \u00b4 y\n\nT Y pft\u00b41pxtqq \u00b4 yt\n\n5\n\n\u02d8\ufb00\n`\nT Y \u02dd ft\nn\u00b41\u00ff\n`\n\n\u00ab\nEStR\n\nt\u201c0\n\u201c ES\n\n1\nn\n\nt\u201c0\n\n1\nn\n\n\u00ab\n\ufb00\n\u02d82\n\ufb00\n\u02d82\n\n.\n\n\ufb00\n\n\u02d82\n\nESt\n\nT Y pftpxt`1qq \u00b4 yt`1\n\n\f\u00ab\n\nDenote by d1\n\nES\n\n1\nn\n\nn\u00ff\n\nt\u201c1\n\n\u03bbn, (cid:96)1\n\n`\nt \u201c dt\nT Y pft\u00b41pxtqq \u00b4 yt\n\n\u02d82\n\n\ufb00\n\nn\u00ff\nt \u201c pTY pft\u00b41pxtqq \u00b4 ytq2, and (cid:96)t \u201c pft\u00b41pxtq \u00b4 ytq2. We have that\n\n\ufb00\n\n\u00ab\n\n\ufb00\n\n\ufb00\n\n(cid:96)1\n1 ` d1\n\nt\n\nt\n\n\u00ab\n\u00ab\n\n1\nn\n\n1\nn\n\nn\u00ff\nn\u00ff\n\nt\u201c1\n\nt\u201c1\n\nn\u00ff\n\n1\nn\n\n\u02c6\n\nt\u201c1\nln1\u00b4b\n\n(cid:96)1\n\nt\n\n\ufb00\n\u201c ES\n(cid:96)1\ntd1\n1 ` d1\n\ufb00\n\nt\n\nt\n\nt\n\nt\n\nd1\n\u02c6\n1 ` d1\n\u00ab\n\n1 ` 1\n\u03bb\n\n\u201c ES\n\n\u010f ES\n\u00ab\n\n\u010f 4Y 2ES\n\n\u010f 4Y 2\nn\nn\u00ff\n\nmin\n\n\ufb00\n\n\u00ab\n\n1\nn\n` ES\n\nn\u00ff\n\u00ab\n\nt\u201c1\n1\nn\n\nt\n\ntd1\n(cid:96)1\nn\u00ff\n1 ` d1\n\nt\n\nt\u201c1\n\n\u201e\n\n\u02d9\n\u02d9\n\u010f 4Y 2\nn\nn\u00ff\n\n1\nb\n\n,\n\nES\nln\nTrrLb\nKs\n\ufb00\n\u03bbb\n\n.\n\n\ufb00\n` ES\n\n1\nn\n\nt\u201c1\n\n(cid:96)t\n1 ` d1\n\nt\n\n.\n\n\uf6be\n\n|\u03bbI ` 1\nn Kn|\n|\u03bbI|\n\nWe now focus on the \ufb01rst sum in the last inequality and we upper bound it in two different ways.\nFirst, using Lemma 2 and Lemma 1, we have\n\n\u00ab\n\nES\n\n1\nn\n\n\ufb00\n\nn\u00ff\n\nt\u201c1\n\n\ufb00\n\nt\n\nt\n\n(cid:96)1\ntd1\n1 ` d1\n\u00ab\n\n\u00ab\n\nn\u00ff\n\nAlso, we can upper bound the same term as\n\n(cid:96)td1\n1 ` d1\n\nt\n\n(cid:96)1\ntd1\n1 ` d1\n\nt\n\nt\n\nt\n\n1\nn\n\n1\nn\n\nt\u201c1\n\nt\u201c1\n\nES\n\n\ufb00\n\nn\u00ff\n\n\u010f ES\n\n\u010f ES\nn\u00ff\n\u201c ES\nt\u201c1\n\u201c \u03bb}f\u03bb}2 ` Rpf\u03bbq \u201c min\nfPHK\n\u010f \u03bb2\u03b2}L\u00b4\u03b2\n\u03c1 ` Rpf\u03c1q .\nPutting all together, we have the stated bound.\n\n\u03bb}f}2 ` 1\nn\n\nK f\u03c1}2\n\n(cid:96)t\n1 ` d1\n\nminHK\n\nES\n\nt\u201c1\n\n1\nn\n\nt\n\n\u00ab\nNow, using Theorems 3 and 4 with the fact that dt \u010f 1, we bound the term ES\n\n\u00ab\n\n\ufb00\n\nn\u00ff\n\nt\u201c1\n\n1\nn\n\n\u00ab\n\u0159\n\n\ufb00\n\n.\n\n(cid:96)t\n\u0131\n1 ` d1\n\nt\n\n\ufb00\n\nt\n\nas\n\n(cid:96)t\n1`d1\npf\u03bbpxtq \u00b4 ytq2\n\nmaxt d1\n\nt\n\nt\n\nn\n\nt\u201c1\n\n\u201d\n\nES\n\n\u010f 1\n\u03bbn\n\n(cid:96)t\n1 ` d1\n\u00ab\npfpxtq \u00b4 ytq2\n\u03bb}f\u03bb}2 ` 1\nn\n\u03bb}f}2 ` Rpfq \u00b4 Rpf\u03c1q ` Rpf\u03c1q\n\n\u010f ES\n\n1\nn\n\nn\nt\u201c1\n\nn\u00ff\n\nt\u201c1\n\n4 Detailed Comparison with Previous Results\n\nThe sheer volume of research on regression, see, e.g., Lin and Cevher [2018, Table 1], precludes\na complete survey of the results. In this section, we focus on the closely related ones that involve\nin\ufb01nite dimensional spaces.\nFirst, it is useful to compare our convergence rate to the one we would get from known guarantees for\nKRR. We can compare it to the stability bound in Shalev-Shwartz and Ben-David [2014] for KRR:\n\nES\n\nR\n\nf KRR\nS,\u03bb\n\n\u010f\n\n`\n\n\u201c\n`\n\n\u201c\n\n\u02d8\u2030\n\n\u02c6\n\n\u02d8\u2030\n\n\u02d9\n\nES\n\nn\u00ff\n\n`\n\nt\u201c1\n\n\u00ab\n\u02d9\n\n1\nn\n\n1 ` 192\n\u02c6\n\u03bbn\n\n\ufb00\n\n\u02d82\n\nS,\u03bb pxtq \u00b4 yt\nf KRR\n\n.\n\nIt is easy to see4 that this bound implies the following convergence rate\nK f\u03c1}2\n\n\u00b4 Rpf\u03c1q \u010f\n\n\u03bb2\u03b2}L\u00b4\u03b2\n\nf KRR\nS,\u03bb\n\nES\n\nR\n\n1 ` 192\n\u03bbn\n\n\u03c1 ` 192Rpf\u03c1q\n\n.\n\n\u03bbn\n\nThis convergence rate matches only half of our bound. In particular, it does not contain the term that\ndepends in the capacity of the RKHS through b. Also, the theorem in Shalev-Shwartz and Ben-David\nm. This essentially prevents the setting of \u03bb \u201c 0 and the possibility to\n[2014] holds only for \u03bb \u011b 4\nachieve the rate of n\u00b41 in the case that \u03b2 \u201c 1\n2 and Rpf\u03c1q \u201c 0.\n\u02d9\n\u02d8\u2030\nAnother similar bound is the Leave-One-Out analysis in Zhang [2003], which gives\npfpxtq \u00b4 ytq2 .\n\nn\u00ff\n\nES\n\n`\n\n\u201c\n\nR\n\n\u010f\n\n2\n\nf KRR\nS,\u03bb\n\n\u02c6\n1 ` 2\n\u03bbn\n\n\u03bb}f}2 ` 1\nn\n\nmin\nfPHK\n\nt\u201c1\n\n4For completeness, the proof is in Theorem 5 in the Appendix.\n\n6\n\n\f\u201c\n\n`\n\n\u02d8\u2030\n\n\u02c6\n\nAs for the stability bound, using Theorem 4, this bound implies the following bound for \u03bb \u0105 0:\n\nES\n\nR\n\nf KRR\nS,\u03bb\n\n\u00b4 Rpf\u03c1q \u010f\n\n1 ` 2\n\u03bbn\n\n2\n\n\u03bb2\u03b2}L\u00b4\u03b2\n\nK f\u03c1}2\n\n\u03c1 `\n\n4\n\u03bbn\n\n` 4\n\u03bb2n2\n\nRpf\u03c1q .\n\n\u02c6\n\n\u02d9\n\n\u02d9\n\n#\n\nHence, this bound suffers from the same problems of the stability bound; it is suboptimal with respect\nto the capacity of the space and the presence of the square always makes the \u03bb that minimizes the\nrisk bound bounded away from zero.\nThe best known results for nonparametric least square under Assumptions 1\u20133 are obtained by\nKRR [Lin et al., 2018] and by stochastic least square [Dieuleveut and Bach, 2016], with the rate\n\nES rRpfS,\u03bbqs \u00b4 Rpf\u03c1q \u010f\n\nO\n\nO\n\n\u00af\n\n\u00b4\n`\n\n\u02d8\n\nn\u00b4 2\u03b2\n2\u03b2`b\nn\u00b42\u03b2\n\n, if 2\u03b2 ` b \u011b 1,\n\n, otherwise.\n\n2\u03b2`bq in all regimes for truncated KRR.\n\nThis kind of rates are suboptimal in the regime 2\u03b2` b \u0103 1. In contrast, our result achieves the optimal\nrate in all regimes. Also, these rates do not depend in any way on the risk of the optimal function f\u03c1.\nHence, they never support the choice of a regularization parameter being zero. Pillaud-Vivien et al.\n[2018] call the regime 2\u03b2 ` b \u0103 1 the \u201chard\u201d problems and prove that SGD with multiple passes\nachieves the optimal rate for a subset of the hard problems However, their result makes an additional\nassumption on the in\ufb01nity norm of the functions in HK. Under the same assumption, Steinwart et al.\n[2009] present a convergence rate of Opn\u00b4 2\u03b2\nThe only result we are aware of that shows an acceleration in the low noise case is Orabona [2014].\nUsing a SGD-like procedure that does not require to set parameters, he proves a rate of Opn\u00b4 2\u03b2\n2\u03b2`1q\nthat accelerates to Opn\u00b4 2\u03b2\nTurning to KRR used for classi\ufb01cation, in the extreme case of the Tsybakov\u2019s noise condition\n(also called Massart low noise condition [Massart and N\u00e9d\u00e9lec, 2006]) Yao et al. [2007] proved an\nexponential rate of convergence. However, this is speci\ufb01c to the classi\ufb01cation case only and it does\nnot apply to the regression setting. Under stronger assumptions, i.e. data separable with margin, the\nsame effect was already proved in Zhang [2001]. It is also interesting to note that these results require\na non-zero implicit or explicit regularization.\nMore recently, Hastie et al. [2019] showed5 an asymptotic result (as n\u00d18) that the best regularization\nparameter \u03bb of ridge regression is 0 when there is no label noise (i.e., Rpf\u03c1q \u201c 0) and \u03b2 \u201c 1\n2. Their\nresult aligns well with ours, but we are not limited to asymptotic regimes nor \ufb01nite dimensional\nspaces. On the other hand, our guarantee is an upper bound on the risk rather than an equality.\n\n\u03b2`1q when Rpf\u03c1q \u201c 0, for smooth and Lipschitz losses.\n\n5 Empirical Validation\n\nIn this section, we empirically validate some of our theoretical \ufb01ndings. Inspired by Pillaud-Vivien\net al. [2018], we consider a spline kernel of order q \u011b 2 where q is even [Wahba, 1990, Eq. (2.1.7)].\nSpeci\ufb01cally, we de\ufb01ne\n\n\u039bqpx, x1q \u201c 1 ` 2\n\ncosp2\u03c0kps \u00b4 tqq\n\np2\u03c0kqq\n\n.\n\n8\u00ff\n\nk\u201c1\n\nand use the kernel Kpx, x1q \u201c \u039b1{bpx, x1q for some b P r0, 1s. We consider the uniform distribution\n\u03c1X on X \u201c r0, 1s and de\ufb01ne the target function to be f\u2039pxq \u201c \u039b \u03b2\npx, 0q for x P X. We de\ufb01ne the\nobserved response of x to be f\u2039pxq ` B where B is a uniform random variable r\u00b4\u0001, \u0001s. One can\nshow that this problem satis\ufb01es Assumptions 1\u20133 [Pillaud-Vivien et al., 2018].\nFor each n in \ufb01ne-grained grid points in r102, 103s and \u03bb in another \ufb01ne-grained set of numbers, we\ndraw n training points, compute fn by Algorithm 1, and estimate its excess risk by a test set. Finally,\nfor each n we choose the \u03bb that minimizes the average excess risk. We repeat the same 5 times. First,\nwe set b \u201c 1\n16, and \u0001 \u201c 0.1. Figure 1(a) plots the excess risk of the best \u03bb\u2019s vs n, which\napproximately achieves the predicted rate n\u00b4 7\n8 .\n\n8 and \u03b2 \u201c 7\n\nb ` 1\n\n2\n\n5To see this, set \u03c32 \u201c 0 in Hastie et al. [2019, Theorem 6].\n\n7\n\n\f(a)\n\n(b)\n\nFigure 1: Expected excess risk of KTR3 vs the number of training points on a synthetic dataset with\na spline kernel. (a) and (b) show two different dif\ufb01culties of the task, as parametrized by \u03b2 and b.\n\nTo verify our improved rate in the regime 2\u03b2 ` b \u0103 1, we also consider the case of \u03b2 \u201c 1\n4, b \u201c 1\n6,\nand \u0001 \u201c 0.1. Figure 1(b) plots the excess risk of the best \u03bb\u2019s vs n, which approximately achieves the\npredicted rate n\u00b4 3\n\n4 rather than the slow rate n\u00b4 1\n\n2 of prior art.6\n\n6 Discussion and Open Problems\n\nWe have presented a new algorithm for kernel-based nonparametric least squares that achieves optimal\ngeneralization rates with respect to the source condition and complexity of the RKHS. Moreover,\nfaster rates are possible when the Bayes risk is zero, even when the optimal predictor is not in HK.\nOne natural open problem is to prove similar guarantees for KRR. We conjecture that the randomiza-\ntion used in our analysis is not strictly necessary; it only greatly simpli\ufb01es the proof. One may try to\nprove that the generalization error of KRR is nonincreasing with n in which case the randomization\nonly harms the generalization and thus implies that KRR enjoys the same error bound as KTR3. Such\na claim is, unfortunately, not true, shown by Viering et al. [2019, Example III] where the error rate of\nKRR can increase with n.\nIt would also be interesting to prove lower bounds for the Rpf\u03c1q \u201c 0 case, to understand if the ob-\ntained rates are optimal or not. Furthermore, alleviating the boundedness assumption (Assumption 1)\nwould be interesting, possibly with some mild moment conditions that appear in Hsu et al. [2012],\nAudibert and Catoni [2011] and Hsu and Sabato [2016].\nOne consequence of our work is that it shows a gap between the best-known bounds for SGD and\nERM-based algorithms. Indeed, before this work, the rates of SGD and ERM-based algorithms (e.g.,\nKRR) under Assumptions 1\u20133 were the same. It would be interesting to understand if some variants\nof SGD can achieve the optimal rates or if there is indeed a clear separation between the rates.\nThe limitation of this work is mainly with regards to the parametrization of the problem via the source\ncondition and the complexity of the RKHS. Speci\ufb01cally, our rates are only valid for \u03b2 \u010f 1{2 (see\nAssumption 2), due to use of Theorem 4. However, this is unlikely to be a limitation of the analysis\nbut rather a consequence of the use of a regularizer and the consequent \u201csaturation\u201d phenomenon,\nsee discussion in Yao et al. [2007]. Another limitation of our framework is that it is well-known\nthat the guarantee on the approximation error in Theorem 4 is non-trivial for a Gaussian kernel\nwith \ufb01xed bandwidth only if f\u03c1 P C8 [Smale and Zhou, 2003]. While this is a strong condition\nfrom a mathematical point of view, it is unclear how strong it is for real-world problems, where the\nbandwidth of the Gaussian kernel is often tuned.\n\n6We remark that the considered kernel satis\ufb01es an extra assumption (e.g., Pillaud-Vivien et al. [2018,\nAssumption (A3)]) that in fact allows KRR to achieve the same optimal rate as ours. We are not aware of simple\nproblems where that condition is not satis\ufb01ed. However, our theory clearly does not make such an assumption\nyet achieves the optimal rate.\n\n8\n\n4.555.566.57log of number of training points-6-5-4-3-2log of expected excess riskBest fit line, rate n-0.854.555.566.57log of number of training points-6-5-4-3-2log of expected excess riskBest fit line, rate n-0.72\fFinally, we believe the assumptions considered too strong in the theory community can be reconsid-\nered with modern machine learning tasks. Indeed, most results in the community have ignored the\ncase of Rpf\u03c1q \u201c 0, perhaps due to the fact that it was considered too strong as a condition. However,\nmost of the visual perception tasks on which modern machine learning has been successful seem to\nsatisfy this assumption; for example, humans have zero or very close to zero error in recognizing cats\nversus dogs from a photograph. In this view, a more ambitious open problem is to \ufb01nd the correct\ncharacterization of \u201ceasiness\u201d for real-world problems, rather than using mathematically appealing\nones.\n\nAcknowledgements\n\nThe authors thank Junhong Lin, Lorenzo Rosasco, and Alessandro Rudi for the comments and\ndiscussions on this work. This material is based upon work supported by the National Science\nFoundation under grant no. 1908111 \u201cCollaborative Research: TRIPODS Institute for Optimization\nand Learning\u201d.\n\nReferences\nJ.-Y. Audibert and O. Catoni. Robust linear least squares regression. The Annals of Statistics, 39(5):\n\n2766\u20132794, 2011.\n\nM. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel learning.\nIn J. Dy and A. Krause, editors, International Conference on Machine Learning, volume 80\nof Proceedings of Machine Learning Research, pages 541\u2013549, Stockholmsm\u00e4ssan, Stockholm\nSweden, 2018. PMLR.\n\nN. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning\nalgorithms. IEEE Trans. Inf. Theory, 50(9):2050\u20132057, 2004. URL https://homes.di.unimi.\nit/~cesabian/Pubblicazioni/J20.pdf.\n\nN. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order Perceptron algorithm. SIAM Journal\n\non Computing, 34(3):640\u2013668, 2005.\n\nF. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge\n\nUniversity Press, New York, NY, USA, 2007.\n\nA. Dieuleveut and F. Bach. Nonparametric stochastic approximation with large step-sizes. The\n\nAnnals of Statistics, 44(4):1363\u20131399, 2016.\n\nS. Fischer and I. Steinwart. Sobolev norm learning rates for regularized least-squares algorithm.\n\narXiv preprint arXiv:1702.07254, 2017.\n\nT. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani. Surprises in high-dimensional ridgeless least\n\nsquares interpolation. arXiv preprint arXiv:1903.08560v2, 2019.\n\nD. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of\n\nMachine Learning Research, 17(1):543\u2013582, 2016.\n\nD. Hsu, S. M. Kakade, and T. Zhang. Random design analysis of ridge regression. In Proc. of the\n\n25th Conference on Learning Theory, pages 9\u20131, 2012.\n\nT. Liang and A. Rakhlin. Just interpolate: Kernel \u201cridgeless\u201d regression can generalize. arXiv\n\npreprint arXiv:1808.00387, 2018.\n\nJ. Lin and V. Cevher. Optimal convergence for distributed learning with stochastic gradient methods\n\nand spectral algorithms. arXiv preprint arXiv:1801.07226, 2018.\n\nJ. Lin, A. Rudi, L. Rosasco, and V. Cevher. Optimal rates for spectral algorithms with least-squares\n\nregression over Hilbert spaces. Applied and Computational Harmonic Analysis, 2018.\n\nP. Massart and \u00c9. N\u00e9d\u00e9lec. Risk bounds for statistical learning. The Annals of Statistics, 34(5):\n\n2326\u20132366, 2006.\n\n9\n\n\fF. Orabona. Simultaneous model selection and optimization through parameter-free stochastic\n\nlearning. In Advances in Neural Information Processing Systems 27, 2014.\n\nL. Pillaud-Vivien, A. Rudi, and F. Bach. Statistical optimality of stochastic gradient descent on hard\nlearning problems through multiple passes. In Advances in Neural Information Processing Systems\n31, pages 8114\u20138124. Curran Associates, Inc., 2018.\n\nL. Rosasco, M. Belkin, and E. De Vito. On learning with integral operators. J. Mach. Learn. Res., 11:\n\n905\u2013934, March 2010.\n\nS. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms.\n\nCambridge University Press, New York, NY, USA, 2014.\n\nS. Smale and D.-X. Zhou. Estimating the approximation error in learning theory. Analysis and\n\nApplications, 1(01):17\u201341, 2003.\n\nI. Steinwart. On the in\ufb02uence of the kernel on the consistency of support vector machines. Journal of\n\nmachine learning research, 2(Nov):67\u201393, 2001.\n\nI. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.\n\nI. Steinwart, D. R. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In\n\nProc. of the 22nd Conference on Learning Theory, 2009.\n\nT. Viering, A. Mey, and M. Loog. Open problem: Monotonicity of learning.\n\nConference on Learning Theory (COLT), pages 3198\u20133201. PMLR, 2019.\n\nIn Proc. of the\n\nG. Wahba. Spline models for observational data, volume 59. Siam, 1990.\n\nY. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constr.\n\nApprox., 26:289\u2013315, 2007.\n\nC. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\nrethinking generalization. In 5th International Conference on Learning Representations, ICLR\n2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.\n\nT. Zhang. Convergence of large margin separable linear classi\ufb01cation.\n\nInformation Processing Systems, pages 357\u2013363, 2001.\n\nIn Advances in Neural\n\nT. Zhang. Leave-one-out bounds for kernel methods. Neural Comput., 15(6):1397\u20131437, June 2003.\n\nF. Zhdanov and Y. Kalnishkan. An identity for kernel ridge regression. Theor. Comput. Sci., 473:\n\n157\u2013178, February 2013.\n\n10\n\n\f", "award": [], "sourceid": 8838, "authors": [{"given_name": "Kwang-Sung", "family_name": "Jun", "institution": "U of Arizona"}, {"given_name": "Ashok", "family_name": "Cutkosky", "institution": "Google Research"}, {"given_name": "Francesco", "family_name": "Orabona", "institution": "Boston University"}]}