{"title": "Efficient online learning with kernels for adversarial large scale problems", "book": "Advances in Neural Information Processing Systems", "page_first": 9432, "page_last": 9441, "abstract": "We are interested in a framework of online learning with kernels for low-dimensional, but large-scale and potentially adversarial datasets. \nWe study the computational and theoretical performance of online variations of kernel Ridge regression. Despite its simplicity, the algorithm we study is the first to achieve the optimal regret for a wide range of kernels with a per-round complexity of order $n^\\alpha$ with $\\alpha < 2$. \n\nThe algorithm we consider is based on approximating the kernel with the linear span of basis functions. Our contributions are twofold: 1) For the Gaussian kernel, we propose to build the basis beforehand (independently of the data) through Taylor expansion. For $d$-dimensional inputs, we provide a (close to) optimal regret of order $O((\\log n)^{d+1})$ with per-round time complexity and space complexity $O((\\log n)^{2d})$. This makes the algorithm a suitable choice as soon as $n \\gg e^d$ which is likely to happen in a scenario with small dimensional and large-scale dataset; 2) For general kernels with low effective dimension, the basis functions are updated sequentially, adapting to the data, by sampling Nystr\u00f6m points. In this case, our algorithm improves the computational trade-off known for online kernel regression.", "full_text": "Ef\ufb01cient online learning with Kernels for adversarial\n\nlarge scale problems\n\nR\u00e9mi J\u00e9z\u00e9quel\n\nPierre Gaillard\n\nAlessandro Rudi\n\nINRIA - D\u00e9partement d\u2019Informatique de l\u2019\u00c9cole Normale Sup\u00e9rieure\n\nPSL Research University, Paris, France\n\n{remi.jezequel,pierre.gaillard,alessandro.rudi}@inria.fr\n\nAbstract\n\nWe are interested in a framework of online learning with kernels for low-\ndimensional, but large-scale and potentially adversarial datasets. We study the\ncomputational and theoretical performance of online variations of kernel Ridge\nregression. Despite its simplicity, the algorithm we study is the \ufb01rst to achieve the\noptimal regret for a wide range of kernels with a per-round complexity of order n\u03b1\nwith \u03b1 < 2.\nThe algorithm we consider is based on approximating the kernel with the linear\nspan of basis functions. Our contributions are twofold: 1) For the Gaussian kernel,\nwe propose to build the basis beforehand (independently of the data) through\nTaylor expansion. For d-dimensional inputs, we provide a (close to) optimal regret\nof order O((log n)d+1) with per-round time complexity and space complexity\nO((log n)2d). This makes the algorithm a suitable choice as soon as n (cid:29) ed which\nis likely to happen in a scenario with small dimensional and large-scale dataset; 2)\nFor general kernels with low effective dimension, the basis functions are updated\nsequentially, adapting to the data, by sampling Nystr\u00f6m points. In this case, our\nalgorithm improves the computational trade-off known for online kernel regression.\n\n1\n\nIntroduction\n\nNowadays the volume and the velocity of data \ufb02ows are deeply increasing. Consequently, many\napplications need to switch from batch to online procedures that can treat and adapt to data on\nthe \ufb02y. Furthermore to take advantage of very large datasets, non-parametric methods are gaining\nincreasing momentum in practice. Yet the latter often suffer from slow rates of convergence and bad\ncomputational complexities. At the same time, data is getting more complicated and simple stochastic\nassumptions, such as i.i.d. data, are often not satis\ufb01ed. In this paper, we try to combine these different\naspects due to large scale and arbitrary data. We build a non-parametric online procedure based on\nkernels, which is ef\ufb01cient for large data sets and achieves close to optimal theoretical guarantees.\nOnline learning is a sub\ufb01eld of machine learning where a learner sequentially interacts with an\nenvironment and tries to learn and adapt on the \ufb02y to the observed data as one goes along. We\nconsider the following sequential setting. At each iteration t \u2265 1, the learner receives some input\n\nxt \u2208 X ; makes a prediction(cid:98)yt \u2208 R and the environment reveals the output yt \u2208 R. The inputs xt\n\nand the outputs yt are sequentially chosen by the environment and can be arbitrary. Learner\u2019s goal is\nto minimize his cumulative regret\n\nn(cid:88)\n(yt \u2212(cid:98)yt)2 \u2212 n(cid:88)\n\n(cid:0)yt \u2212 f (xt)(cid:1)2\n\n(1)\nuniformly over all functions f in a space of functions H. We will consider Reproducing Kernel\nHilbert Space (RKHS) H, [see next section or Aro50, for more details]. It is worth noting here that\n\nRn(f ) :=\n\nt=1\n\nt=1\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fall the properties of a RKHS are controlled by the associated kernel function k : X \u00d7X \u2192 R, usually\nknown in closed form, and that many function spaces of interest are (or are contained in) RKHS,\ne.g. when X \u2286 Rd: polynomials of arbitrary degree, band-limited functions, analytic functions with\ngiven decay at in\ufb01nity, Sobolev spaces and many others [BTA11].\n\nPrevious work Kernel regression in a statistical setting has been widely studied by the statistics\ncommunity. Our setting of online kernel regression with adversarial data is more recent. Most of the\nexisting work focuses on the linear setting (i.e., linear kernel). First work on online linear regression\ndates back to [Fos91]. [BKM+15] provided the minimax rates (together with an algorithm) and we\nrefer the reader to references therein for a recent overview of the literature in the linear case. We only\nrecall relevant work for this paper. [AW01, Vov01] designed the nonlinear Ridge forecaster (denoted\nAWV). In linear regression (linear kernel), it achieves the optimal regret of order O(d log n) uniformly\nover all (cid:96)2-bounded vectors. The latter can be extended to kernels (see De\ufb01nition (3)) which we refer\nto as Kernel-AWV. With regularization parameter \u03bb > 0, it obtains a regret upper-bounded for all\nf \u2208 H as\n\nRn(f ) (cid:46) \u03bb(cid:13)(cid:13)f(cid:13)(cid:13)2\n\nis the effective dimension, where Knn :=(cid:0)k(xi, xj)(cid:1)\n\n+ B2deff(\u03bb) , where deff(\u03bb) := Tr(Knn\n\n(cid:0)Knn + \u03bbIn)\u22121(cid:1)\n\n\u221a\n\n(2)\n1\u2264i,j\u2264n \u2208 Rn\u00d7n denotes the kernel matrix at\ntime n. The above upper-bound on the regret is essentially optimal (see remark 2.1). Yet the per\nround complexity and the space complexity of Kernel-AWV are O(n2). In this paper, we aim at\nreducing this complexity while keeping optimal regret guarantees.\nThough the literature on online contextual learning is vast, little considers non-parametric function\nclasses. Related work includes [Vov06] that considers the Exponentially Weighted Average forecaster\nor [HM07] which considers bounded Lipschitz function set and Lipschitz loss functions, while here\nwe focus on the square loss. Minimax rates for general function sets H are provided by [RST13].\nRKHS spaces were \ufb01rst considered in [Vov05] though they only obtain O(\nn) rates which are\nsuboptimal for our problem. More recently, a regret bound of the form (2) was proved by [ZK10] for\na clipped version of kernel Ridge regression and by [CLV17b] for a clipped version of Kernel Online\nNewton Step (KONS) for general exp-concave loss functions.\nThe computational complexity (O(n2) per round) of these algorithms is however prohibitive for large\ndatasets. [CLV17b] and [CLV17a] provide approximations of KONS to get manageable complexities.\nHowever, these come with deteriorated regret guarantees. [CLV17b] improves the time and space\ncomplexities by a factor \u03b3 \u2208 (0, 1) enlarging the regret upper-bound by 1/\u03b3. [CLV17a] designs an\nef\ufb01cient approximation of KONS based on Nystr\u00f6m approximation [SSL00, WS01] and restarts with\n\nper-round complexities O(cid:0)m2) where m is the number of Nystr\u00f6m points. Yet their regret bound\n\nsuffers an additional multiplicative factor m with respect to (2) because of the restarts. Furthermore,\ncontrary to our results, the regret bounds of [CLV17b] and [CLV17a] are not with respect to all\nfunctions in H but only with functions f \u2208 H such that f (xt) \u2264 C for all t \u2265 1 where C is a\nparameter of their algorithm. Since C comes has a multiplicative factor of their bounds, their results\nare sensitive to outliers that may lead to large C. Other relevant approximation schemes of Online\nKernel Learning have been done by [LHW+16] and [ZL19]. The authors consider online gradient\ndescent algorithms which they approximate using different approximation schemes (as Nystr\u00f6m and\n\u221a\nrandom features). However since they use general Lipschitz loss functions and consider (cid:96)1-bounded\ndual norm of functions f, their regret bounds of order O(\nn) are hardly comparable to ours and\nseem suboptimal in n in our restrictive setting with square loss and kernels with small effective\ndimension (such as Gaussian kernel).\n\nContributions and outline of the paper The main contribution of the paper is to analyse a variant\nof Kernel-AWV that we call PKAWV (see De\ufb01nition (4)). Despite its simplicity, it is to our knowledge\nthe \ufb01rst algorithm for kernel online regression that recovers the optimal regret (see bound (2)) with\nan improved space and time complexity of order (cid:28) n2 per round. Table 1 summarizes the regret\nrates and complexities obtained by our algorithm and the ones of [CLV17b, CLV17a].\nOur procedure consists simply in applying Kernel-AWV while, at time t \u2265 1, approximating the\nRKHS H with a linear subspace \u02dcHt of smaller dimension. In Theorem 3, PKAWV suffers an\nadditional approximation term with respect to the optimal bound of Kernel-AWV which can be\nmade small enough by properly choosing \u02dcHt. To achieve the optimal regret with low computational\n\n2\n\n\fAlgorithm\nPKAWV\nSketched-KONS [CLV17b] (c > 0)\nPros-N-KONS [CLV17a]\n\nRegret\n\n(log n)d+1\nc(log n)d+1\n(log n)2d+1\n\nKernel\n\nGaussian\n\n(cid:1)d\ndeff(\u03bb) \u2264(cid:0) log n\n(cid:1)\u03b3\ndeff(\u03bb) \u2264(cid:0) n\n\nGeneral\n\n\u03bb\n\n\u221a\n\n\u03bb\n\n2 \u2212 1\n\n\u03b3 <\n\nPer-round complexity\n\n(log n)2d\n\n(log n)2d\n\n(cid:0)n/c(cid:1)2\n(cid:0)n/c(cid:1)2\n\n1\u2212\u03b32\n\nn\n\n4\u03b3\n\n4\u03b3(1\u2212\u03b3)\n(1+\u03b3)2\n\nPKAWV\nSketched-KONS [CLV17b] (c > 0)\nPros-N-KONS [CLV17a]\n\n\u03b3\n\nn\ncn\n\n\u03b3+1 log n\n\u03b3+1 log n\n\n\u03b3\n\n4\u03b3\n\n(1+\u03b3)2 log n\n\nn\n\nn\n\nTable 1: Order in n of the best possible regret rates achievable by the algorithms and corresponding\nper-round time-complexity. Up to log n, the rates obtained by PKAWV are optimal.\n\ncomplexity, \u02dcHt needs to approximate H well and to be low dimensional with an easy-to-compute\nprojection. We provide two relevant constructions for \u02dcHt.\nIn section 3.1, we focus on the Gaussian kernel that we approximate by a \ufb01nite set of basis functions.\nThe functions are deterministic and chosen beforehand by the learner independently of the data. The\nnumber of functions included in the basis is a parameter to be optimized and \ufb01xes an approximation-\ncomputational trade-off. Theorem 4 shows that PKAWV satis\ufb01es (up to log) the optimal regret\n\nbounds (2) while enjoying a per-round space and time complexity of O(cid:0) log2d(cid:0) n\n(cid:1)(cid:1). For the\nGaussian kernel, this corresponds to O(cid:0)deff(\u03bb)2(cid:1) which is known to be optimal even in the statistical\n\nregret with a computational complexity of O(cid:0)deff(\u03bb)4/(1\u2212\u03b3)(cid:1). The latter is o(n2) (for well-tuned \u03bb)\n\nsetting with i.i.d. data.\nIn section 3.2, we consider data adaptive approximation spaces \u02dcHt based on Nystr\u00f6m approximation.\nAt time t \u2265 1, we approximate any kernel H by sampling a subset of the input vectors {x1, . . . , xt}.\nIf the kernel satis\ufb01es the capacity condition deff(\u03bb) \u2264 (n/\u03bb)\u03b3 for \u03b3 \u2208 (0, 1), the optimal regret is then\nof order deff(\u03bb) = O(n\u03b3/(1+\u03b3)) for well-tuned parameter \u03bb. Our method then recovers the optimal\n2 \u2212 1. Furthermore, if the sequence of input vectors xt is given beforehand to the\nas soon as \u03b3 <\nalgorithm, the per-round complexity needed to reach the optimal regret is improved to O(deff(\u03bb)4)\nand our algorithm can achieve it for all \u03b3 \u2208 (0, 1).\nFinally, we perform in Section 4 several experiments based on real and simulated data to compare the\nperformance (in regret and in time) of our methods with competitors.\n\n\u221a\n\n\u03bb\n\nNotations We recall here basic notations that we will use throughout the paper. Given a vector\nv \u2208 Rd, we write v = (v(1), . . . , v(d)). We denote by N0 = N \u222a {0} the set of non-negative integers\nand for p \u2208 Nd\n0, |p| = p(1) + \u00b7\u00b7\u00b7 + p(d). By a sligh abuse of notation, we denote by (cid:107) \u00b7 (cid:107) both\nthe Euclidean norm and the norm for the Hilbert space H. Write v(cid:62)w, the dot product between\nv, w \u2208 RD. The conjugate transpose for linear operator Z on H will be denoted Z\u2217. The notation (cid:46)\nwill refer to approximate inequalities up to logarithmic multiplicative factors. Finally, we will denote\na \u2228 b = max(a, b) and a \u2227 b = min(a, b), for a, b \u2208 R.\n\n2 Background\nKernels. Let k : X \u00d7 X \u2192 R be a positive de\ufb01nite kernel [Aro50] that we assume to be bounded\n(i.e., supx\u2208X k(x, x) \u2264 \u03ba2 for some \u03ba > 0). The function k is characterized by the existence of a\nfeature map \u03c6 : X \u2192 RD, with D \u2208 N \u222a {\u221e}1 such that k(x, x(cid:48)) = \u03c6(x)(cid:62)\u03c6(x(cid:48)). Moreover the\nreproducing kernel Hilbert space (RKHS) associated to k is characterized by H = {f | f (x) =\nw(cid:62)\u03c6(x), w \u2208 RD, x \u2208 X}, with inner product (cid:104)f, g(cid:105)H := v(cid:62)w, for f, g \u2208 H de\ufb01ned by f (x) =\nv(cid:62)\u03c6(x), g(x) = w(cid:62)\u03c6(x) and v, w \u2208 RD. For more details and different characterizations of k,H,\nsee [Aro50, BTA11]. It\u2019s worth noting that the knowledge of \u03c6 is not necessary when working\ni=1 \u03b1i\u03c6(xi), with \u03b1i \u2208 R, xi \u2208 X and \ufb01nite p \u2208 N, indeed\ni=1 \u03b1ik(xi, x), and moreover (cid:107)f(cid:107)2H = \u03b1(cid:62)Kpp\u03b1, with Kpp the\n\nwith functions of the form f = (cid:80)p\ni=1 \u03b1i\u03c6(xi)(cid:62)\u03c6(x) =(cid:80)p\nf (x) =(cid:80)p\n\nkernel matrix associated to the set of points x1, . . . , xp.\n\n1when D = \u221e we consider RD as the space of squared summable sequences.\n\n3\n\n\fKernel-AWV. The Azoury-Warmuth-Vovk forecaster (denoted AWV) on the space of linear func-\ntions on X = Rd has been introduced and analyzed in [AW01, Vov01]. We consider here a\nstraightforward generalization to kernels (denoted Kernel-AWV) of the nonlinear Ridge forecaster\n(AWV) introduced by [AW01, Vov01] on the space of linear functions on X = Rd. At iteration t \u2265 1,\n\nKernel-AWV predicts(cid:98)yt = (cid:98)ft(xt), where\n(cid:40)t\u22121(cid:88)\n\n(cid:98)ft \u2208 argmin\n\nf\u2208H\n\ns=1\n\n(cid:0)ys \u2212 f (xs)(cid:1)2\n\n+ \u03bb(cid:13)(cid:13)f(cid:13)(cid:13)2\n\n(cid:41)\n\n+ f (xt)2\n\n.\n\n(3)\n\nA variant of this algorithm, more used in the context of data independently sampled from distribution,\nis known as kernel Ridge regression. It corresponds to solving the problem above, without the last\npenalization term f (xt)2.\nOptimal regret for Kernel-AWV. In the next proposition, we state a preliminary result which proves\nthat Kernel-AWV achieves a regret depending on the eigenvalues of the kernel matrix.\nProposition 1. Let \u03bb, B > 0. For any RKHS H, for all n \u2265 1, for all inputs x1, . . . , xn \u2208 X and all\ny1, . . . , yn \u2208 [\u2212B, B], the regret of Kernel-AWV is upper-bounded for all f \u2208 H as\n\nRn(f ) \u2264 \u03bb(cid:13)(cid:13)f(cid:13)(cid:13)2\n\nn(cid:88)\n\n(cid:18)\n\n+ B2\n\nlog\n\n1 +\n\n\u03bbk(Knn)\n\n\u03bb\n\n,\n\n(cid:19)\n\nwhere \u03bbk(Knn) denotes the k-th largest eigenvalue of Knn.\n\nk=1\n\nThe proof is a direct consequence of the known regret bound of AWV in the \ufb01nite dimensional linear\nregression setting\u2014see Theorem 11.8 of [CBL06] or Theorem 2 of [GGHS18]. For completeness,\nwe reproduce the analysis for in\ufb01nite dimensional space (RKHS) in Appendix C.1. In online linear\nregression in dimension d, the above result implies the optimal rate of convergence dB2 log(n)+O(1)\n(see [GGHS18] and [Vov01]). As shown by the following proposition, Proposition 1 yields optimal\nregret (up to log) of the form (2) for online kernel regression.\nProposition 2. For all n \u2265 1, \u03bb > 0 and all input sequences x1, . . . , xn \u2208 X ,\n\n(cid:18)\n\nn(cid:88)\n\nk=1\n\n(cid:19)\n\n(cid:16)\n\n(cid:17)\n\n(cid:0)\u03bb(cid:1) .\n\nlog\n\n1 +\n\n\u03bbk(Kn)\n\n\u03bb\n\n\u2264 log\n\ne +\n\nen\u03ba2\n\n\u03bb\n\ndeff\n\nCombined with Proposition 1, this entails that Kernel-AWV satis\ufb01es (up to the logarithmic factor)\nthe optimal regret bound (2). As discussed in the introduction, such an upper-bound on the regret is\nnot new and was already proved by [ZK10] or by [CLV17b] for other algorithms. An advantage of\nKernel-AWV is that it does not require any clipping and thus the beforehand knowledge of B > 0 to\nobtained Proposition 1. Furthermore, we slightly improve the constants in the above proposition.\nRemark 2.1 (Optimal regret under the capacity condition). Assuming the capacity condition (deff(\u03bb) \u2264\n(n/\u03bb)\u03b3 for 0 \u2264 \u03b3 \u2264 1), the rate of the regret bound (2) can be made explicit. As we show now,\nthis matches existing minimax lower rates in the stochastic setting. Under the capacity condition,\noptimizing \u03bb (cid:39) n\u03b3/(1+\u03b3) to minimize the r.h.s. of (2), the regret bound is then of order Rn(f ) \u2264\nO(n\u03b3/(1+\u03b3)) (up to logs). If the data (x1, y1), . . . , (xn, yn) is i.i.d. according to some distribution\n\u03c1 over X \u00d7 R, we can apply a standard online to batch conversion (see [CBCG04]). The estimator\n\u00affn = 1\nn\n\n(cid:80)n\nt=1 ft satis\ufb01es for any f \u2208 H the upper-bound on its excess risk\n\nn\n\n\u2264 O(n\n\n\u2212 1\n\nE( \u00affn) \u2212 E(f ) \u2264 E\n\n(cid:2)(f (X) \u2212 Y )2(cid:3). This corresponds to the known minimax lower rate in this\n\nwhere E(f ) := E(X,Y )\u223c\u03c1\nstochastic setting as shown by Theorem 2 (applied with c = 1 and b = 1/\u03b3) of [CDV07].\nIt is worth pointing out that in the worst case deff(\u03bb) \u2264 \u03ba2n/\u03bb for any bounded kernel. In particular,\n\u221a\noptimizing the bound yields \u03bb = O(\nn log n). In the\nspecial case of the Gaussian kernel (which we consider in Section 3.1), the latter can be improved\n\nto deff(\u03bb) (cid:46)(cid:0) log(n/\u03bb)(cid:1)d (see [ABRW18]) which entails Rn(f ) \u2264 O(cid:0)(log n)d+1(cid:1) for well tuned\n\nn log n) and a regret bound of order O(\n\n1+\u03b3 ) ,\n\n\u221a\n\n(cid:20) Rn(f )\n\n(cid:21)\n\nvalue of \u03bb.\n\n4\n\n\f3 Online Kernel Regression with projections\n\nIn the previous section we have seen that Kernel-AWV achieves optimal regret. Yet, it has computa-\ntional requirements that are O(n3) in time and O(n2) in space, for n steps of the algorithm, making it\nunfeasible in the context of large scale datasets, i.e. n (cid:29) 105. In this paper, we consider and analyze\na simple variation of Kernel-AWV denoted PKAWV. At time t \u2265 1, for a regularization parameter\n\n\u03bb > 0 and a linear subspace \u02dcHt of H the algorithm predicts(cid:98)yt = (cid:98)ft(xt), where\n(cid:41)\n\n(cid:0)ys \u2212 f (xs)(cid:1)2\n\n+ \u03bb(cid:13)(cid:13)f(cid:13)(cid:13)2\n\n+ f (xt)2\n\n.\n\n(4)\n\n(cid:98)ft = argmin\n\nf\u2208 \u02dcHt\n\n(cid:40)t\u22121(cid:88)\n\ns=1\n\nIn the next subsections, we explicit relevant approximations \u02dcHt (typically the span of a small number\nof basis functions) of H that trade-off good approximation with a low computational cost. Appendix H\ndetails how (4) can be ef\ufb01ciently implemented in these cases.\nThe result below bounds the regret of the PKAWV for any function f \u2208 H and holds for any bounded\nkernel and any explicit subspace \u02dcH associated with projection P . The cost of the approximation of\n\nH by \u02dcH is measured by the important quantity \u00b5 :=(cid:13)(cid:13)(I \u2212 P )C 1/2\n(cid:19)\n\noperator.\nTheorem 3. Let \u02dcH be a linear subspace of H and P the Euclidean projection onto \u02dcH. When PKAWV\nis run with \u03bb > 0 and \ufb01xed subspaces \u02dcHt = \u02dcH, then for all f \u2208 H\n\n(cid:13)(cid:13)2, where Cn is the covariance\n\n(cid:18)\n\nn\n\n\u03bbj(Knn)\n\nn\u00b5B2\n\n+ B2\n\nfor any sequence (x1, y1), . . . , (xn, yn) \u2208 X \u00d7 [\u2212B, B] where \u00b5 :=(cid:13)(cid:13)(I \u2212 P )C 1/2\n(cid:80)n\nt=1 \u03c6(xt) \u2297 \u03c6(xt).\n\n\u03bb2\n\nj=1\n\n+ (\u00b5 + \u03bb)\n\n1 +\n\nlog\n\n\u03bb\n\nn\n\n,\n\n(5)\n\n(cid:13)(cid:13)2 and Cn :=\n\nRn(f ) \u2264 \u03bb(cid:13)(cid:13)f(cid:13)(cid:13)2\n\nn(cid:88)\n\nThe proof of Thm. 3 is deferred to Appendix D.1 and is the consequence of a more general Thm. 9.\n\n3.1 Learning with Taylor expansions and Gaussian kernel for very large data set\n\nIn this section we focus on non-parametric regression with the widely used Gaussian kernel de\ufb01ned\nby k(x, x(cid:48)) = exp(\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2/(2\u03c32)) for x, x(cid:48) \u2208 X and \u03c3 > 0 and the associated RKHS H.\nUsing the results of the previous section with a \ufb01xed linear subspace \u02dcH which is the span of a\nbasis of O(polylog(n/\u03bb)) functions, we prove that PKAWV achieves optimal regret. This leads to a\ncomputational complexity that is only O(n polylog(n/\u03bb)) for optimal regret. We need a basis that\n(1) approximates very well the Gaussian kernel and at the same time (2) whose projection is easy to\ncompute. We consider the following basis of functions, for k \u2208 Nd\n0,\nxt\n\u221a\n\n\u03c8ki(x(i)), where \u03c8t(x) =\n\nd(cid:89)\n\ngk(x) =\n\n\u2212 x2\n\n2\u03c32 .\n\n(6)\n\ne\n\ni=1\n\n\u03c3t\n\nt!\n\nFor one dimensional data, this corresponds to Taylor expansion of the Gaussian kernel. Our theorem\nbelow states that PKAWV (see (4)) using for all iterations t \u2265 1\n\n\u02dcHt = Span(GM )\n\n0}\nwith GM = {gk | |k| \u2264 M, k \u2208 Nd\n\nwhere |k| := k1 + \u00b7\u00b7\u00b7 + kd, for k \u2208 Nd\n0, gets optimal regret while enjoying low complexity. The\nsize of the basis M controls the trade-off between approximating well the Gaussian kernel (to incur\nlow regret) and large computational cost. Theorem 4 optimizes M so that the approximation term of\nTheorem 3 (due to kernel approximation) is of the same order than the optimal regret.\nTheorem 4. Let \u03bb > 0, n \u2208 N and let R, B > 0. Assume that (cid:107)xt(cid:107) \u2264 R and |yt| \u2264 B. When\n\n(cid:7), then running PKAWV using GM as set of functions achieves a regret\n\nM =(cid:6) 8R2\nn(cid:88)\nMoreover, its per iteration computational cost is O(cid:0)(cid:0)3 + 1\n\n\u2200f \u2208 H, Rn(f ) \u2264 \u03bb(cid:13)(cid:13)f(cid:13)(cid:13)2\n\n(cid:1)2d(cid:1) in space and time.\n\n\u03c32 \u2228 2 log n\n\u03bb\u22271\n\nbounded by\n\n\u03bbj(Knn)\n\n(cid:18)\n\n(cid:19)\n\n3B2\n\n1 +\n\nlog\n\nj=1\n\n+\n\n\u03bb\n\n2\n\n.\n\nd log n\n\u03bb\u22271\n\n5\n\n\f(cid:16)\n\n(cid:17)d\n\nn\n\u03bb\n\n(cid:16)(cid:16)\n\n(cid:17)d(cid:17)\n\nn\n\u03bb\n\nTherefore PKAWV achieves a regret-bound only deteriorated by a multiplicative factor of 3/2 with\nrespect to the bound obtained by Kernel-AWV (see Prop. 1). From Prop. 2 this also yields (up to log)\nthe optimal bound (2).\nIn particular, it is known [ABRW18] for the Gaussian kernel that\n\ndeff(\u03bb) \u2264 3\n\n41\nd\n\nR2\n2\u03c32 +\n\n3\nd\n\nlog\n\nlog\n\n6 +\n\n= O\n\nhave |GM| (cid:46) deff(\u03bb). The per-round space and time complexities are thus O(cid:0)deff(\u03bb)2(cid:1). Though our\n\nThe upper-bound is matching even in the i.i.d. setting for nontrivial distributions. In this case, we\n\nmethod is quite simple (since it uses \ufb01xed explicit embedding) it is able to recover results -in terms\nof computational time and bounds in the adversarial setting- that are similar to results obtained in the\nmore restrictive i.i.d. setting obtained via much more sophisticated methods, like learning with (1)\nNystr\u00f6m with importance sampling via leverage scores [RCR15], (2) reweighted random features\n[Bac17, RR17], (3) volume sampling [DWH18]. By choosing \u03bb = (B/(cid:107)f(cid:107))2, to minimize the r.h.s.\nof the regret bound of the theorem, we get\n\n.\n\nRn(f ) (cid:46)(cid:16)\n\nn(cid:107)f(cid:107)2H\nB2\n\nlog\n\n(cid:17)d+1\n\nB2 + B2.\n\n(7)\n\nNote that the optimal \u03bb does not depend on n and can be optimized in practice through standard\nonline calibration methods. For instance, one can run in parallel subroutines of the algorithm, each\nd+1 , . . . , 0}. The subroutines can\nusing a different value of \u03bb in the \ufb01nite grid \u039b := {n2k, k = \u2212n\nthen be sequentially combined with an expert advice algorithm such as the Exponentially Weighted\nAverage forecaster [CBL06] at an additional negligible cost of order O(B2 log |\u039b|) in the regret\n(using the fact that the squared loss is exp-concave on [0, B]). Similarly, though we use a \ufb01xed\nnumber of features M in the experiments, the latter could be increased slowly over time thanks to\nonline calibration techniques.\n\n1\n\n3.2 Nystr\u00f6m projection\n\n(cid:111)\n\n(cid:110)\n\nThe previous two subsections considered a deterministic function basis (independent of the data) to\napproximate speci\ufb01c RKHS. Here, we analyse Nystr\u00f6m projections [RCR15] that are data dependent\nand work for any RKHS. It consists in sequentially updating a dictionary It \u2282 {x1, . . . , xt} and\nusing\n\n\u02dcHt = Span\n\n\u03c6(x), x \u2208 It\n\n.\n\n(8)\nIf the points included into It are well-chosen, the latter may approximate well the solution of (3)\nwhich belongs to the linear span of {\u03c6(x1), . . . , \u03c6(xt)}. The inputs xt might be included in the\ndictionary independently and uniformly at random. Here, we build the dictionary by following the\nKORS algorithm of [CLV17a] which is based on approximate leverage scores. At time t \u2265 1, it\nevaluates the importance of including xt to obtain an accurate projection Pt by computing its leverage\nscore. Then, it decides to add it or not, by drawing a Bernoulli random variable. The points are\nnever dropped from the dictionary so that I1 \u2282 I2 \u2282 \u00b7\u00b7\u00b7In. With their notations, choosing \u03b5 = 1/2\nand remarking that (cid:107)\u03a6T\n(cid:107)2, their Proposition 1 can be rewritten as\nfollows.\nProposition 5. [CLV17a, Prop. 1] Let \u03b4 > 0, n \u2265 1, \u00b5 > 0. Then, the sequence of dictionaries\nI1 \u2282 I2 \u2282 \u00b7\u00b7\u00b7 \u2282 In learned by KORS with parameters \u00b5 and \u03b2 = 12 log(n/\u03b4) satis\ufb01es w.p. 1 \u2212 \u03b4,\n\nt (I \u2212 Pt)\u03a6t(cid:107) = (cid:107)(I \u2212 Pt)C 1/2\n\nFurthermore, the algorithm runs in O(cid:0)deff(\u00b5)2 log4(n)(cid:1) space and O(cid:0)deff(\u00b5)2(cid:1) time per iteration.\n\n|It| \u2264 9deff(\u00b5) log(cid:0)2n/\u03b4(cid:1)2\n\n(cid:13)(cid:13)(I \u2212 Pt)C 1/2\n\n(cid:13)(cid:13)2 \u2264 \u00b5\n\n\u2200t \u2265 1,\n\nand\n\n.\n\nt\n\nt\n\nUsing this approximation result together with Thm. 9 (which is a more general version of Thm. 3),\nwe can bound the regret of PKAWV with KORS. The proof is postponed to Appendix E.1.\nTheorem 6. Let n \u2265 1, \u03b4 > 0 and \u03bb \u2265 \u00b5 > 0. Assume that the dictionaries (It)t\u22651 are built\naccording to Proposition 5. Then, probability at least 1 \u2212 \u03b4, PKAWV with the subspaces \u02dcHt de\ufb01ned\nin (8) satis\ufb01es the regret upper-bound\n\nRn \u2264 \u03bb(cid:107)f(cid:107)2 + B2deff(\u03bb) log(cid:0)e + en\u03ba2/\u03bb(cid:1) + 2B2(|In| + 1)\n\nn\u00b5\n\u03bb\n\n,\n\nand the algorithm runs in O(deff(\u00b5)2) space O(deff(\u00b5)2) time per iteration.\n\n6\n\n\fSketched-KONS [CLV17b]\nPros-N-KONS [CLV17a]\nPKAWV\nPKAWV (beforehand features)\n\nn\nR\ng\no\nl\n\nn\ng\no\nl\n\n1\n\n\u03b3\n\n1+\u03b3\n\n0\n\n0\n\n4\u03b3\n\n(1+\u03b3)2\n\n2\u03b3\n1+\u03b3\n\n1\n\nlog m\nlog n\n\n1\n\nn\nR\ng\no\nl\n\nn\ng\no\nl\n\n\u03b3\n\n1+\u03b3\n\n0\n\n0\n\nn\nR\ng\no\nl\n\nn\ng\no\nl\n\n1\n\n\u03b3\n\n1+\u03b3\n\n0\n\n0\n\n4\u03b3\n\n(1+\u03b3)2\n\n1\n\n2\u03b3\n1+\u03b3\n\nlog m\nlog n\n\n1\n\n2\u03b3\n1+\u03b3\n\nlog m\nlog n\n\nFigure 1: Comparison of the theoretical regret rate log Rn/ log n according to the size of the\ndictionary log m/ log n considered by PKAWV, Sketched-KONS and Pros-N-KONS for optimized\nparameters when deff(\u03bb) \u2264 (n/\u03bb)\u03b3 with \u03b3 = 0.2,\n2\u22121, 0.6 (from left to right). The value \u03b3/(1+\u03b3)\ncorresponds to the optimal rate.\n\n\u221a\n\nThe last term of the regret upper-bound above corresponds to the approximation cost of using the\napproximation (8) in PKAWV. This cost is controlled by the parameter \u00b5 > 0 which trades-off\nbetween having a small approximation error (small \u00b5) and a small dictionary of size |In| \u2248 deff(\u00b5)\n(large \u00b5) and thus a small computational complexity. For the Gaussian Kernel, using that deff(\u03bb) \u2264\n\nO(cid:0) log(n/\u03bb)d(cid:1), the above theorem yields for the choice \u03bb = 1 and \u00b5 = n\u22122 a regret bound of\norder Rn \u2264 O(cid:0)(log n)d+1(cid:1) with a per-round time and space complexity of order O(|In|2) =\nO(cid:0)(log n)2d+4(cid:1). We recover a similar result to the one obtained in Section 3.1.\nExplicit rates under the capacity condition Assuming the capacity condition deff(\u03bb(cid:48)) \u2264(cid:0)n/\u03bb(cid:48)(cid:1)\u03b3\n\nfor 0 \u2264 \u03b3 \u2264 1 and \u03bb(cid:48) > 0, which is a classical assumption made on kernels [RCR15], the following\ncorollary provides explicit rates for the regret according to the size of the dictionary m \u2248 |In|.\nCorollary 7. Let n \u2265 1 and m \u2265 1. Assume that deff(\u03bb(cid:48)) \u2264 (n/\u03bb(cid:48))\u03b3 for all \u03bb(cid:48) > 0. Then, under\nthe assumptions of Theorem 6, PKAWV with \u00b5 = nm\u22121/\u03b3 has a dictionary of size |In| (cid:46) m and a\nregret upper-bounded with high-probability as\n\n(cid:40)\n\nRn (cid:46)\n\n1+\u03b3\n\n\u03b3\n\nn\nnm\n\n1\n\n2\u2212 1\n\n2\u03b3\n\n2\u03b3\n\n1\u2212\u03b32\n\nif m \u2265 n\notherwise\n\n\u03b3\n\n1+\u03b3\n\nfor \u03bb = n\nfor \u03bb = nm\n\n1\n\n2\u2212 1\n\n2\u03b3\n\n.\n\nThe per-round space and time complexity of the algorithm is O(m2) per iteration.\n\n\u03b3\n\nThe rate of order n\n1+\u03b3 is optimal in this case (it corresponds to optimizing (2) in \u03bb). If the dictionary\nis large enough m \u2265 n2\u03b3/(1\u2212\u03b32), the approximation term is negligible and the algorithm recovers\nthe optimal rate. This is possible for a small dictionary m = o(n) whenever 2\u03b3/(1 \u2212 \u03b32) < 1,\n2 \u2212 1. The rates obtained in Corollary 7 can be compared to the one\nwhich corresponds to \u03b3 <\nobtained by Sketched-KONS of [CLV17b] and Pros-N-KONS of [CLV17a] which also provide a\nsimilar trade-off between the dictionary size m and a regret bound. The forms of the regret bounds in\nm, \u00b5, \u03bb of the algorithms can be summarized as follows\n\n\u221a\n\n\uf8f1\uf8f2\uf8f3 \u03bb + deff(\u03bb) + nm\u00b5\n\n\u03bb + n\nm(\u03bb + deff(\u03bb)) + n\u00b5\n\u03bb\n\nm deff(\u03bb)\n\n\u03bb\n\nRn (cid:46)\n\nfor PKAWV with KORS\nfor Sketched-KONS\nfor Pros-N-KONS\n\n.\n\n(9)\n\nWhen deff(\u03bb) \u2264 (n/\u03bb)\u03b3, optimizing these bounds in \u03bb, PKAWV performs better than Sketched-\nKONS as soon as \u03b3 \u2264 1/2 and the latter cannot obtain the optimal rate \u03bb + deff(\u03bb) = n\n1+\u03b3 if\nm = o(n). Furthermore, because of the multiplicative factor m, Pros-N-KONS can\u2019t either reached\nthe optimal rate even for m = n. Figure 1 plots the rate in n of the regret of these algorithms when\nenlarging the size m of the dictionary. We can see that for \u03b3 = 1/4, PKAWV is the only algorithm\nthat achieves the optimal rate n\u03b3/(1+\u03b3) with m = o(n) features. The rate of Pros-N-KONS cannot\nbeat 4\u03b3/(1 + \u03b3)2 and stops improving even when the size of the dictionary increases. This is because\nPros-N-KONS is restarted whenever a point is added in the dictionary which is too costly for large\ndictionaries. It is worth pointing out that these rates are for a well-tuned value of \u03bb. However, such\nan optimization can be performed at a small cost using expert advice algorithm on a \ufb01nite grid of \u03bb.\n\n\u03b3\n\n7\n\n\fFurthermore, w.h.p.\ncomplexity O(m2).\n\nn\nnm\n\nRn (cid:46)\nthe dictionary is of size |In| (cid:46) m leading to a per-round space and time\n\nfor \u03bb = n\nfor \u03bb = nm\n\n\u2212 1\n\n2\u03b3\n\n\u2212 1\n\n2\u03b3\n\n1+\u03b3\n\n.\n\n2\u03b3\n\n1+\u03b3\n\nif m \u2265 n\notherwise\n\n\u03b3\n\n\u221a\n\nFigure 2: Average classi\ufb01cation error and time on: (top) code-rna (n = 2.7 \u00d7 105, d = 8); (bottom)\nSUSY (n = 6 \u00d7 106, d = 22).\n\nBeforehand known features We may assume that the sequence of feature vectors xt is given\nin advance to the learner while only the outputs yt are sequentially revealed (see [GGHS18] or\n[BKM+15] for details). In this case, the complete dictionary In \u2282 {x1, . . . , xn} may be computed\nbeforehand and PKAWV can be used with the \ufb01x subspace \u02dcH = Span(\u03c6(x), x \u2208 In). In this case,\nthe regret upper-bound can be improved to Rn (cid:46) \u03bb + deff(\u03bb) + n\u00b5\n\u03bb by removing a factor m in the\nlast term (see (9)).\nCorollary 8. Under the notation and assumptions of Corollary 7, PKAWV used with dictionary In\nand parameter \u00b5 = nm\u22121/\u03b3 achieves with high probability\n\n(cid:40)\n\n\u03b3\n\n1+\u03b3\n\nm compared to the \u201csequen-\nThe suboptimal rate due to a small dictionary is improved by a factor\ntially revealed features\u201d setting. Furthermore, since 2\u03b3/(1 + \u03b3) < 1 for all \u03b3 \u2208 (0, 1), the algorithm\nis able to recover the optimal rate n\u03b3/(1+\u03b3) for all \u03b3 \u2208 (0, 1) with a dictionary of sub-linear size\nm (cid:28) n. We leave for future work the question whether there is really a gap between these two\nsettings or if this gap from a suboptimality of our analysis.\n\n4 Experiments\n\nWe empirically test PKAWV against several state-of-the-art algorithms for online kernel regression.\nIn particular, we test our algorithms in (1) an adversarial setting [see Appendix G], (2) on large scale\ndatasets. The following algorithms have been tested:\n\n\u2022 Kernel-AWV for adversarial setting or Kernel Ridge Regression for i.i.d. real data settings;\n\u2022 Pros-N-Kons [CLV17b];\n\u2022 Fourier Online Gradient Descent (FOGD, [LHW+16]);\n\u2022 PKAWV(or PKRR for real data settings) with Taylor expansions (M \u2208 {2, 3, 4})\n\u2022 PKAWV(or PKRR for real data settings) with Nystr\u00f6m\n\nThe algorithms above have been implemented in python with numpy (the code for our algorithm\nis in Appendix H.2). The code necessary to reproduce the following experiments is available on\nGitHub at https://github.com/Remjez/kernel-online-learning. For most algorithms, we\nused hyperparameters from the respective papers. For all algorithms and all experiments, we set\n\u03c3 = 1 [except for SUSY where \u03c3 = 4, to match accuracy results from RCR17] and \u03bb = 1. When\n\n8\n\n\f\u221a\nusing KORS, we set \u00b5 = 1, \u03b2 = 1 and \u03b5 = 0.5 as in [CLV17b]. The number of random-features in\nFOGD is \ufb01xed to 1000 and the learning rate \u03b7 is 1/\nn. All experiments have been done on a single\ndesktop computer (Intel Core i7-6700) with a timeout of 5-min per algorithm. The results of the\nalgorithms are only recorded up to this time.\nLarge scale datasets. The algorithms are evaluated on four datasets from UCI machine learning\nrepository. In particular, casp (regression) and ijcnn1, cod-rna, SUSY (classi\ufb01cation) [see Ap-\npendix G for casp and ijcnn1] ranging from 4 \u00d7 104 to 6 \u00d7 106 datapoints. For all datasets, we\nscaled x in [\u22121, 1]d and y in [\u22121, 1]. In Figs. 2 and 4 we show the average loss (square loss for\nregression and classi\ufb01cation error for classi\ufb01cation) and the computational costs of the considered\nalgorithm.\nIn all the experiments PKAWV with M = 2 approximates reasonably well the performance of\nkernel forecaster and is usually very fast. We remark that using PKAWV M = 2 on the \ufb01rst million\nexamples of SUSY, we achieve in 10 minutes on a single desktop, the same average classi\ufb01cation\nerror obtained with speci\ufb01c large scale methods for i.i.d. data [RCR17], although Kernel-AWV is\nusing a number of features reduced by a factor 100 with respect to the one used in for FALKON in\nthe same paper. Indeed they used r = 104 Nystr\u00f6m centers, while with M = 2 we used r = 190\nfeatures, validating empirically the effectiveness of the chosen features for the Gaussian kernel. This\nshows the effectiveness of the proposed approach for large scale machine learning problems with a\nmoderate dimension d.\n\nReferences\n[ABRW18] Jason Altschuler, Francis Bach, Alessandro Rudi, and Jonathan Weed. Massively\nscalable sinkhorn distances via the nystr\\\" om method. arXiv preprint arXiv:1812.05189,\n2018.\n\n[Aro50] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American\n\nmathematical society, 68(3):337\u2013404, 1950.\n\n[AW01] Katy S. Azoury and Manfred K. Warmuth. Relative loss bounds for on-line density\nestimation with the exponential family of distributions. Machine Learning, 43(3):211\u2013\n246, 2001.\n\n[Bac17] Francis Bach. On the equivalence between kernel quadrature rules and random feature\n\nexpansions. Journal of Machine Learning Research, 18(21):1\u201338, 2017.\n\n[BKM+15] Peter L. Bartlett, Wouter M. Koolen, Alan Malek, Eiji Takimoto, and Manfred K.\nWarmuth. Minimax Fixed-Design Linear Regression. JMLR: Workshop and Conference\nProceedings, 40:1\u201314, 2015. Proceedings of COLT\u20192015.\n\n[BTA11] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in\n\nprobability and statistics. Springer Science & Business Media, 2011.\n\n[CBCG04] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability\nof on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050\u2013\n2057, 2004.\n\n[CBL06] Nicol\u00f2 Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, Learning, and Games. Cambridge\n\nUniversity Press, 2006.\n\n[CDV07] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares\n\nalgorithm. Foundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[CKS11] Andrew Cotter, Joseph Keshet, and Nathan Srebro. Explicit approximations of the\n\ngaussian kernel. arXiv preprint arXiv:1109.4603, 2011.\n\n[CLV17a] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Ef\ufb01cient second-order\nonline kernel learning with adaptive embedding. In Neural Information Processing\nSystems, 2017.\n\n9\n\n\f[CLV17b] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Second-order kernel online\nconvex optimization with adaptive sketching. In International Conference on Machine\nLearning, 2017.\n\n[DWH18] Micha\u0142 Derezi\u00b4nski, Manfred K Warmuth, and Daniel Hsu. Correcting the bias in least\nsquares regression with volume-rescaled sampling. arXiv preprint arXiv:1810.02453,\n2018.\n\n[Fos91] Dean P. Foster. Prediction in the worst case. The Annals of Statistics, 19(2):1084\u20131090,\n\n1991.\n\n[GGHS18] Pierre Gaillard, S\u00e9bastien Gerchinovitz, Malo Huard, and Gilles Stoltz. Uniform regret\nbounds over Rd for the sequential linear regression problem with the square loss. arXiv\npreprint arXiv:1805.11386, 2018.\n\n[HM07] Elad Hazan and Nimrod Megiddo. Online learning with prior knowledge. In Inter-\nnational Conference on Computational Learning Theory, pages 499\u2013513. Springer,\n2007.\n\n[LHW+16] Jing Lu, Steven CH Hoi, Jialei Wang, Peilin Zhao, and Zhi-Yong Liu. Large scale online\nkernel learning. The Journal of Machine Learning Research, 17(1):1613\u20131655, 2016.\n[OLBC10] Frank WJ Olver, Daniel W Lozier, Ronald F Boisvert, and Charles W Clark. NIST\n\nhandbook of mathematical functions. Cambridge University Press, 2010.\n\n[RCR15] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nystr\u00f6m\ncomputational regularization. In Advances in Neural Information Processing Systems,\npages 1657\u20131665, 2015.\n\n[RCR17] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. Falkon: An optimal large\nscale kernel method. In Advances in Neural Information Processing Systems, pages\n3888\u20133898, 2017.\n\n[RR17] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with\nrandom features. In Advances in Neural Information Processing Systems, pages 3215\u2013\n3225, 2017.\n\n[RST13] A. Rakhlin, K. Sridharan, and A.B. Tsybakov. Empirical entropy, minimax regret and\n\nminimax risk. Bernoulli, 2013. To appear.\n\n[SHS06] Ingo Steinwart, Don Hush, and Clint Scovel. An explicit description of the reproducing\nkernel hilbert spaces of gaussian rbf kernels. IEEE Transactions on Information Theory,\n52(10):4635\u20134643, 2006.\n\n[SSL00] AJ Smola, B Sch\u00f6lkopf, and P Langley. Sparse greedy matrix approximation for machine\nlearning. In 17th International Conference on Machine Learning, Stanford, 2000, pages\n911\u2013911, 2000.\n\n[Vov01] Vladimir Vovk. Competitive on-line statistics.\n\n69(2):213\u2013248, 2001.\n\nInternational Statistical Review,\n\n[Vov05] V. Vovk. On-line regression competitive with reproducing kernel hilbert spaces. arXiv,\n\n2005.\n\n[Vov06] V. Vovk. Metric entropy in competitive on-line prediction. arXiv, 2006.\n[WS01] Christopher KI Williams and Matthias Seeger. Using the nystr\u00f6m method to speed up\nkernel machines. In Advances in neural information processing systems, pages 682\u2013688,\n2001.\n\n[ZK10] Fedor Zhdanov and Yuri Kalnishkan. An identity for kernel ridge regression.\n\nIn\nInternational Conference on Algorithmic Learning Theory, pages 405\u2013419. Springer,\n2010.\n\n[ZL19] Xiao Zhang and Shizhong Liao. Incremental randomized sketching for online kernel\nlearning. In International Conference on Machine Learning, pages 7394\u20137403, 2019.\n\n10\n\n\f", "award": [], "sourceid": 5030, "authors": [{"given_name": "R\u00e9mi", "family_name": "J\u00e9z\u00e9quel", "institution": "INRIA, \u00c9cole Normale Sup\u00e9rieure"}, {"given_name": "Pierre", "family_name": "Gaillard", "institution": null}, {"given_name": "Alessandro", "family_name": "Rudi", "institution": "INRIA, Ecole Normale Superieure"}]}