{"title": "Efficient Second-Order Online Kernel Learning with Adaptive Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 6140, "page_last": 6150, "abstract": "Online kernel learning (OKL) is a flexible framework to approach prediction problems, since the large approximation space provided by reproducing kernel Hilbert spaces can contain an accurate function for the problem. Nonetheless, optimizing over this space is computationally expensive. Not only first order methods accumulate $\\O(\\sqrt{T})$ more loss than the optimal function, but the curse of kernelization results in a $\\O(t)$ per step complexity. Second-order methods get closer to the optimum much faster, suffering only $\\O(\\log(T))$ regret, but second-order updates are even more expensive, with a $\\O(t^2)$ per-step cost. Existing approximate OKL methods try to reduce this complexity either by limiting the Support Vectors (SV) introduced in the predictor, or by avoiding the kernelization process altogether using embedding. Nonetheless, as long as the size of the approximation space or the number of SV does not grow over time, an adversary can always exploit the approximation process. In this paper, we propose PROS-N-KONS, a method that combines Nystrom sketching to project the input point in a small, accurate embedded space, and performs efficient second-order updates in this space. The embedded space is continuously updated to guarantee that the embedding remains accurate, and we show that the per-step cost only grows with the effective dimension of the problem and not with $T$. Moreover, the second-order updated allows us to achieve the logarithmic regret. We empirically compare our algorithm on recent large-scales benchmarks and show it performs favorably.", "full_text": "Ef\ufb01cient Second-Order Online Kernel\nLearning with Adaptive Embedding\n\nDaniele Calandriello\n\nAlessandro Lazaric\n\nMichal Valko\n\nSequeL team, INRIA Lille - Nord Europe, France\n\n{daniele.calandriello, alessandro.lazaric, michal.valko}@inria.fr\n\nAbstract\n\nOnline kernel learning (OKL) is a \ufb02exible framework for prediction problems,\nsince the large approximation space provided by reproducing kernel Hilbert spaces\noften contains an accurate function for the problem. Nonetheless, optimizing over\n\u221a\nthis space is computationally expensive. Not only \ufb01rst order methods accumulate\nO(\nT ) more loss than the optimal function, but the curse of kernelization results\nin a O(t) per-step complexity. Second-order methods get closer to the optimum\nmuch faster, suffering only O(log T ) regret, but second-order updates are even\nmore expensive with their O(t2) per-step cost. Existing approximate OKL methods\nreduce this complexity either by limiting the support vectors (SV) used by the\npredictor, or by avoiding the kernelization process altogether using embedding.\nNonetheless, as long as the size of the approximation space or the number of\nSV does not grow over time, an adversarial environment can always exploit the\napproximation process. In this paper, we propose PROS-N-KONS, a method that\ncombines Nystr\u00f6m sketching to project the input point to a small and accurate\nembedded space; and to perform ef\ufb01cient second-order updates in this space. The\nembedded space is continuously updated to guarantee that the embedding remains\naccurate. We show that the per-step cost only grows with the effective dimension\nof the problem and not with T . Moreover, the second-order updated allows us to\nachieve the logarithmic regret. We empirically compare our algorithm on recent\nlarge-scales benchmarks and show it performs favorably.\n\nIntroduction\n\n1\nOnline learning (OL) represents a family of ef\ufb01cient and scalable learning algorithms for building a\npredictive model incrementally from a sequence of T data points. A popular online learning approach\n[26] is to learn a linear predictor using gradient descent (GD) in the input space Rd. Since we can\nexplicitly store and update the d weights of the linear predictor, the total runtime of this algorithm is\nO(T d), allowing it to scale to large problems. Unfortunately, it is sometimes the case that no good\npredictor can be constructed starting from only the linear combination of the input features. For this\nreason, online kernel learning (OKL) [10] \ufb01rst maps the points into a high-dimensional reproducing\nkernel Hilbert space (RKHS) using a non-linear feature map \u03d5, and then runs GD on the projected\npoints, which is often referred to as functional GD (FGD) [10]. With the kernel approach, each\ngradient step does not update a \ufb01xed set of weights, but instead introduces the feature-mapped point\nin the predictor as a support vector (SV). The resulting kernel-based predictor is \ufb02exible and data\nadaptive, but the number of parameters, and therefore the per-step space and time cost, now scales\nwith O(t), the number of SVs included after t steps of GD. This curse of kernelization results in an\nO(T 2) total runtime, and prevents standard OKL methods from scaling to large problems.\nGiven an RKHS H containing functions with very small prediction loss, the objective of an OL\nalgorithm is to approach over time the performance of the best predictor in H and thus minimize the\nregret, that is the difference in cumulative loss between the OL algorithm and the best predictor in\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\ffeature map \u03d5 with an approximate (cid:101)\u03d5 constructed using a Nystr\u00f6m dictionary approximation. For\n\n\u221a\nhindsight. First-order GD achieve a O(\nT ) regret for any arbitrary sequence of convex losses [10].\nHowever, if we know that the losses are strongly convex, setting a more aggressive step-size in\n\ufb01rst-order GD achieves a smaller O(log T ) regret [25]. Unfortunately, most common losses, such as\nthe squared loss, are not strongly convex when evaluated for a single point xt. Nonetheless, they\nposses a certain directional curvature [8] that can be exploited by second-order GD methods, such as\nkernelized online Newton step (KONS) [2] and kernel-recursive least squares (KRLS) [24], to achieve\nthe O(log T ) regret without strong convexity along all directions. The drawback of second-order\nmethods is that they have to store and invert the t \u00d7 t covariance matrix between all SV included\nin the predictor. This requires O(t2) space and time per-step, dwar\ufb01ng the O(t) cost of \ufb01rst-order\nmethods and resulting in an even more infeasible O(T 3) runtime.\nContributions In this paper, we introduce PROS-N-KONS, a new OKL method that (1) achieves\nlogarithmic regret for losses with directional curvature using second-order updates, and (2) avoids\nthe curse of kernelization, taking only a \ufb01xed per-step time and space cost. To achieve this, we start\nfrom KONS, a low-regret exact second-order OKL method proposed in [2], but replace the exact\na dictionary of size j, this non-linearly embeds the points in Rj, where we can ef\ufb01ciently perform\nexact second-order updates in constant O(j2) per-step time, and achieve the desired O(log T ) regret.\nCombined with an online dictionary learning (KORS [2]) and an adaptive restart strategy, we show\nthat we never get stuck performing GD in an embedded space that is too distant from the true H,\nbut at the same time the size of the embedding j never grows larger than the effective dimension\nof the problem. While previous methods [13, 11] used \ufb01xed embeddings, we adaptively construct\na small dictionary that scales only with the effective dimension of the data. We then construct an\naccurate approximation of the covariance matrix, to avoid the variance due to dictionary changes\nusing carefully designed projections.\nRelated work Although \ufb01rst-order OKL methods cannot achieve logarithmic regret, many approxi-\nmation methods have been proposed to make them scale to large datasets. Approximate methods\nusually take one of two approaches, either performing approximate gradient updates in the true RKHS\n(budgeted perceptron [4], projectron [15], forgetron [6]) preventing SV from entering the predictor,\nor exact gradient updates in an approximate RKHS (Nystr\u00f6m [13], random feature expansion [11]),\nwhere the points are embedded in a \ufb01nite-dimensional space and the curse of kernelization does not\napply. Overall, the goal is to never exceed a budget of SVs in order to maintain a \ufb01xed per-step\nupdate cost. Among budgeted methods, weight degradation [17] can be done in many different ways,\nsuch as removal [6] or more expensive projection [15] and merging. Nonetheless, as long as the\nsize of the budget is \ufb01xed, the adversary can exploit this to increase the regret of the algorithm, and\noblivious inclusion strategies such as uniform sampling [9] fail. Another approach is to replace the\n\nexact feature-map \u03d5 with an approximate feature map (cid:101)\u03d5 which allows to explicitly represent the\n\nmapped points, and run linear OL on this embedding [13, 21]. When the embedding is oblivious to\ndata, the method is known as random-feature expansion, while a common data-dependent embedding\nmapping is known as Nystr\u00f6m method [19]. Again, if the embedding is \ufb01xed or with a limit in size,\nthe adversary can exploit it. In addition, analyzing a change in embedding during the gradient descent\nis an open problem, since the underlying RKHS changes with it.\nThe only approximate second-order method known to achieve logarithmic regret is SKETCHED-\nKONS. Both SKETCHED-KONS and PROS-N-KONS are based on the exact second-order OL\nmethod ONS [8] or its kernelized version KONS [2]. However, SKETCHED-KONS only applies\nbudgeting techniques to the Hessian of the second-order updates and not to the predictor itself,\nresulting in a O(t) per-step evaluation time cost. Moreover, the Hessian sketching is performed\nonly through SV removal, resulting in high instability. In this paper, we solve these two issues with\nPROS-N-KONS by directly approximating KONS using Nystr\u00f6m functional approximation. This\nresults in updates that are closer to SV projection than removal, and that budget both the representation\nof the Hessian and the predictor.\n\n2 Background\nNotation We borrow the notation from [14] and [2]. We use upper-case bold letters A for matrices,\nlower-case bold letters a for vectors, lower-case letters a for scalars. We denote by [A]ij and [a]i the\n(i, j) element of a matrix and i-th element of a vector respectively. We denote by IT \u2208 RT\u00d7T the\nidentity matrix of dimension T and by Diag(a) \u2208 RT\u00d7T the diagonal matrix with the vector a \u2208 RT\non the diagonal. We use eT,i \u2208 RT to denote the indicator vector of dimension T for element i.\n\n2\n\n\fWhen the dimension of I and ei is clear from the context, we omit the T , and we also indicate the\nidentity operator by I. We use A (cid:23) B to indicate that A\u2212 B is a positive semi-de\ufb01nite (PSD) matrix.\nFinally, the set of integers between 1 and T is denoted by [T ] := {1, . . . , T}.\nKernels Given an input space X and a kernel function K(\u00b7,\u00b7) : X \u00d7 X \u2192 R, we denote the\nreproducing kernel Hilbert space (RKHS) induced by K by H , and with \u03d5(\u00b7) : X \u2192 H the\nassociated feature map. Using the feature map, the kernel function can be represented as K(x, x(cid:48)) =\n(cid:104)\u03d5(x), \u03d5(x(cid:48))(cid:105)H, but with a slight abuse of notation we use the simpli\ufb01ed notation K(x, x(cid:48)) =\n\u03d5(x)T\u03d5(x(cid:48)) in the following. Any function f \u2208 H can be represented as a (potentially in\ufb01nite) set of\nweights w such that fw(x) = \u03d5(x)Tw. Given a set of t points, Dt = {xs}t\ns=1 we denote the feature\nmatrix with \u03c6s as its s-th column by \u03a6t \u2208 R\u221e\u00d7t.\nOnline kernel learning (OKL) We consider online kernel learning, where an adversary chooses\nan arbitrary sequence of points {xt}T\nt=1. The learning\nprotocol is the following. At each round t \u2208 [T ] (1) the adversary reveals the new point xt, (2)\nthe learner chooses a function fwt and predicts fwt (xt) = \u03d5(xt)Twt, (3) the adversary reveals the\nloss (cid:96)t, and (4) the learner suffers (cid:96)t(\u03d5(xt)Twt) and observes the associated gradient gt. We are\ninterested in bounding the cumulative regret between the learner and a \ufb01xed function w de\ufb01ned as\nt=1 (cid:96)(\u03c6twt) \u2212 (cid:96)(\u03c6tw). Since H is potentially a very large space, we need to restrict\nthe class of comparators w. As in [14], we consider all functions that guarantee bounded predictions,\ni.e., S = {w : \u2200t \u2208 [T ],|\u03c6T\nAssumption 1 (Scalar Lipschitz). The loss functions (cid:96)t satisfy |(cid:96)(cid:48)\nAssumption 2 (Curvature). There exists \u03c3t \u2265 \u03c3 > 0 such that for all u, w \u2208 S and for all t \u2208 [T ],\n\nt w| \u2264 C}. We make the following assumptions on the losses.\nt(z)| whenever |z| \u2264 C.\n\nt=1 and convex differentiable losses {(cid:96)t}T\n\nRT (w) =(cid:80)T\n\n(cid:96)t(\u03c6T\n\nt w) := lt(w) \u2265 lt(u) + \u2207lt(u)T(w \u2212 u) +\n\n(\u2207lt(u)T(w \u2212 u))2 .\n\n\u03c3t\n2\n\nThis assumption is weaker than strong convexity as it only requires the losses to be strongly convex\nin the direction of the gradient. It is satis\ufb01ed by squared loss, squared hinge loss, and in general, all\nexp-concave losses [8]. Under this weaker requirement, second-order learning methods [8, 2], obtain\nthe O(log T ) regret at the cost of a higher computational complexity w.r.t. \ufb01rst-order methods.\nNystr\u00f6m approximation A common approach to alleviate the computational cost is to replace\nI = {xi}j\n\nthe high-dimensional feature map \u03d5 with a \ufb01nite-dimensional approximate feature map (cid:101)\u03d5. Let\n\u03d5(xi) as columns. We de\ufb01ne the embedding (cid:101)\u03d5(x) := \u03a3\u22121UT\u03a6TI\u03d5(x) \u2208 Rj, where \u03a6I = V\u03a3UT\n\ni=1 be a dictionary of j points from the dataset and \u03a6I be the associated feature matrix with\n\nis the singular value decomposition of the feature matrix. While in general \u03a6I is in\ufb01nite dimensional\nand cannot be directly decomposed, we exploit the fact that U\u03a3VTV\u03a3UT = \u03a6TI\u03a6I = KI =\nU\u039bUT and that KI is a (\ufb01nite-dimensional) PSD matrix. Therefore it is suf\ufb01cient to compute\nthe eigenvectors U and eigenvalues \u039b of KI and take the square root \u039b1/2 = \u03a3. Note that with\nthis de\ufb01nition we are effectively replacing the kernel K and H with an approximate KI and HI,\n\nsuch that KI(x, x(cid:48)) = (cid:101)\u03d5(x)T(cid:101)\u03d5(x(cid:48)) = \u03d5(x)T\u03a6IU\u03a3\u22121\u03a3\u22121UT\u03a6TI\u03d5(x(cid:48)) = \u03d5(x)TPI\u03d5(x(cid:48)) where\nPI = \u03a6I(\u03a6TI\u03a6I)\u22121\u03a6TI is the projection matrix on the column span of \u03a6I. Since (cid:101)\u03d5 returns vectors\ndown to the size of the dictionary j. The accuracy of (cid:101)\u03d5 is directly related to the accuracy of the\nin Rj, this transformation effectively reduces the computation complexity of kernel operations from t\n(cid:101)\u03d5(xs)T(cid:101)\u03d5(xs(cid:48)) is close to \u03d5(xs)TPt\u03d5(xs(cid:48)) = \u03d5(xs)T\u03d5(xs(cid:48)).\nt, so that for all s, s(cid:48) \u2208 [t],\nRidge leverage scores All that is left is to \ufb01nd an ef\ufb01cient algorithm to choose a good dictionary I\nto minimize the error PI \u2212 Pt. Among dictionary-selection methods, we focus on those that\nsample points proportionally to their ridge leverage scores (RLSs) [1] because they provide strong\nreconstruction guarantees. We now de\ufb01ne RLS and associated effective dimension.\nDe\ufb01nition 1. Given a kernel function K, a set of points Dt = {xs}t\n\u03b3-ridge leverage score (RLS) of point i is de\ufb01ned as\n\nprojection PI in approximating the projection Pt = \u03a6t(\u03a6T\n\ns=1 and a parameter \u03b3 > 0, the\n\nt \u03a6t)\u22121\u03a6T\n\n\u03c4t,i = et,iKt(Kt + \u03b3It)\u22121et,i = \u03c6T\n\ni (\u03a6t\u03a6T\n\nt + \u03b3I)\u22121\u03c6i,\n\nand the effective dimension of Dt as their sum for the each example of Dt,\n\nt(cid:88)\n\n\u03c4t,i = Tr(cid:0)Kt(Kt + \u03b3It)\u22121(cid:1) .\n\ndt\neff(\u03b3) =\n\ni=1\n\n3\n\n(1)\n\n(2)\n\n\feff(\u03b3) =(cid:80)t\n\nThe RLS of a point measures how orthogonal \u03c6i is w.r.t. to the other points in \u03a6t, and therefore\nhow important it is to include it in I to obtain an accurate projection PI. The effective dimension\ncaptures the capacity of the RKHS H over the support vectors in Dt. Let {\u03bbi}i be the eigenvalues of\ns=1 \u03bbi/(\u03bbi + \u03b3), the effective dimension can be seen as the soft rank of Kt\nKt, since dt\nwhere only eigenvalues above \u03b3 are counted.\nTo estimate the RLS and construct an accurate I, we leverage KORS [2] (see Alg. 1 in App. A) that\nextends the online row sampling of Cohen et al. [5] to kernels. Starting from an empty dictionary,\nat each round, KORS receives a new point xt, temporarily adds it to the current dictionary It and\n\nestimates its associated RLS(cid:101)\u03c4t. Then it draws a Bernoulli r.v. proportionally to(cid:101)\u03c4t. If the outcome\nis one, the point is deemed relevant and added to the dictionary, otherwise it is discarded and never\nadded. Note that since points get only evaluated once, and never dropped, the size of the dictionary\ngrows over time and the RKHS HIt is included in the RKHS HIt+1, guaranteeing stability in the\nRKHS evolution, unlike alternative methods (e.g., [3]) that construct smaller but often changing\ndictionaries. We restate the quality of the learned dictionaries and the complexity of the algorithm\nthat we use as a building block.\nProposition 1 ([2, Thm. 2]). Given parameters 0 < \u03b5 \u2264 1, 0 < \u03b3, 0 < \u03b4 < 1, if \u03b2 \u2265 3 log(T /\u03b4)/\u03b52\nthen the dictionary learned by KORS is such that w.p. 1 \u2212 \u03b4,\n\n(1) for all rounds t \u2208 [T ], we have 0 (cid:22) \u03a6T\n(2) the maximum size of the dictionary J is bounded by 1+\u03b5\n\nt (Pt \u2212 PIt)\u03a6t (cid:22) + \u03b5\n1\u2212\u03b5 3\u03b2dT\n\n1\u2212\u03b5 \u03b3I, and\neff(\u03b3) log(2T /\u03b4).\n\nThe algorithm runs in O(dT\n\neff(\u03b1)2 log4(T )) space and (cid:101)O(dT\n\neff(\u03b1)3) time per iteration.\n\n3 The PROS-N-KONS algorithm\nWe \ufb01rst use a toy OKL example from [2] to illustrate the main challenges for FGD in getting both\ncomputational ef\ufb01ciency and optimal regret guarantees. We then propose a different approach which\nwill naturally lead to the de\ufb01nition of PROS-N-KONS.\nConsider the case of binary classi\ufb01cation with the square loss, where the point presented by the\nadversary in the sequence is always the same point xexp, but each round with an opposite {1,\u22121}\nlabel. Note that the dif\ufb01culty in this problem arises from the adversarial nature of the labels and it is\nnot due to the dataset itself. The cumulative loss of the comparator w becomes (\u03d5(xexp)Tw \u2212 1)2 +\n(\u03d5(xexp)Tw + 1)2 + . . . for T steps. Our goal is to achieve O(log T ) regret w.r.t. the best solution\nin hindsight, which is easily done by always predicting 0. Intuitively an algorithm will do well when\n\u221a\nthe gradient-step magnitude shrinks as 1/t. Note that these losses are not strongly convex, thus exact\n\ufb01rst-order FGD only achieves O(\nT ) regret and does not guarantee our goal. Exact second-order\nmethods (e.g., KONS) achieve the O(log T ) regret, but also store T copies of the SV, and have T 4\nruntime. If we try to improve the runtime using approximate updates and a \ufb01xed budget of SV, we\nlose the O(log T ) regime, since skipping the insertion of a SV also slows down the reduction in the\nstep-size, both for \ufb01rst-order and second-order methods. If instead we try to compensate the scarcity\nof SV additions due to the budget with larger updates to the step-size, the adversary can exploit such\nan unstable algorithm, as is shown in [2] where in order to avoid an unstable solution forces the\nalgorithm to introduce SV with a constant probability. Finally, note that this example can be easily\ngeneralized for any algorithm that stores a \ufb01xed budget of SV, replacing a single xexp with a set of\nrepeating vectors that exceed the budget. This also defeats oblivious embedding techniques such as\nrandom feature expansion with a \ufb01xed amount of random features or a \ufb01xed dictionary, and simple\nstrategies that update the SV dictionary by insertion and removal.\nIf we relax the \ufb01xed-budget requirement, selection algorithms such as KORS can \ufb01nd an appropriate\nbudget size for the SV dictionary. Indeed, this single sample problem is intrinsically simple: its\neff(\u03b1) (cid:39) 1 is small, and its induced RKHS H = \u03d5(xexp) is a singleton.\neffective dimension dT\nTherefore, following an adaptive embedding approach, we can reduce it to a one-dimensional\nwe can see this approach as constructing an approximate feature map (cid:101)\u03d5 that after one step will\nparametric problem and solve it ef\ufb01ciently in this space using exact ONS updates. Alternatively,\nreplacing K with (cid:101)K. Building on this intuition, we propose PROS-N-KONS, a new second-order\nexactly coincide with the exact feature map \u03d5, but allows us to run exact KONS updates ef\ufb01ciently\nFGD method that continuously searches for the best embedding space HIt and, at the same time,\nexploits the small embedding space HIt to ef\ufb01ciently perform exact second-order updates.\n\n4\n\n\f5:\n6:\n7:\n8:\n9:\n10:\n11:\n\nj\u03a6T\n\nj UT\n\nWe start from an empty dictionary I0 and a null predictor w0 = 0. At each round, PROS-N-KONS\n(Algorithm 1) receives a new point xt and invokes KORS to decide whether it should be included in\nthe current dictionary or not. Let tj with j \u2208 [J] be the random step when KORS introduces xtj in\nthe dictionary. We analyze PROS-N-KONS as an epoch-based algorithm using these milestones tj.\nNote that the length hj = tj+1 \u2212 tj and total number of epochs J is random, and is decided in a\ndata-adaptive way by KORS based on the dif\ufb01culty of the problem. During epoch j, we have a\n(cid:101)\u03d5(x) : X \u2192 Rj = \u03a3\u22121\n\ufb01xed dictionary Ij that induces a feature matrix \u03a6Ij containing samples \u03c6i \u2208 Ij, an embedding\n\u03a6j, with its associated approximate kernel function (cid:101)K and induced RKHS Hj. At each round\nj\u03d5(x) based on the singular values \u03a3j and singular vectors Uj of\ntj < t < tj+1, we perform an exact KONS update using the approximate map (cid:101)\u03d5. This can be\ncomputed explicitly since (cid:101)\u03c6t is in Rj and can be easily stored in memory. The update rules are\nt\u22121(cid:101)\u03c6t,\n(cid:101)At = (cid:101)At\u22121 +\n(cid:101)A\u22121\nt\u22121(cid:101)gt\u22121, (cid:101)\u03c9t = \u03a0At\u22121\n1: Initialize j = 0, (cid:101)w0 = 0,(cid:101)g0 = 0,(cid:101)P0 = 0, (cid:101)A0 = \u03b1I,\n\n(\u03c5t) =(cid:101)\u03c5t \u2212 h((cid:101)\u03c6T\nt(cid:101)\u03c5t)\n(cid:101)\u03c6T\nt\u22121(cid:101)\u03c6t\nt(cid:101)A\u22121\n\nt , (cid:101)\u03c5t = (cid:101)\u03c9t\u22121 \u2212 (cid:101)A\u22121\n\n2(cid:101)gt(cid:101)gT\n\nInput: Feasible parameter C, step-sizes \u03b7t, regularizer \u03b1\n2: Start a KORS instance with an empty dictionary I0.\n3: for t = {1, . . . , T} do { Dictionary changed, reset.}\n4:\n\nReceive xt, feed it to KORS.\n\nwhere the oblique projection \u03a0At\u22121\nis\ncomputed using the closed-form solution\nfrom [14]. When t = tj and a new epoch\nbegins, we perform a reset step before\ntaking the \ufb01rst gradient step in the new\nembedded space. We update the feature-\n\nmap (cid:101)\u03d5, but we reset (cid:101)Atj and(cid:101)\u03c9tj to zero.\n\nReceive zt (point added to dictionary or not)\n\nif zt\u22121 = 1 then\n\nSt\n\nSt\n\n\u03c3t\n\n\u22121\n\nj \u03a6T\n\nj UT\nj\n\nend if\n\n12:\n13:\n\n\u22121\nj UT\n\nj \u03c6t \u2208 Rj.\n\nelse {Execute a gradient-descent step.}\n\nwhere h(z) = sign(z) max{|z| \u2212 C, 0}\n\n14:\n15:\n16:\n17:\n18: end for\n\nj = j + 1\nBuild Kj from Ij and decompose it in Uj\u03a3j\u03a3T\n\nWhile this may seem a poor choice, as\ninformation learned over time is lost, it\nleaves intact the dictionary. As long as\n(a) the dictionary, and therefore the em-\nbedded space where we perform our GD,\nkeeps improving and (b) we do not need-\nlessly reset too often, we can count on\nthe fast second-order updates to quickly\ncatch up to the best function in the current\nHj. The motivating reason to reset the de-\nscent procedure when we switch subspace\nis to guarantee that our starting point in\nthe descent cannot be in\ufb02uenced by the ad-\nversary, and therefore allow us to bound\nthe regret for the overall process (Sect. 4).\nComputational complexity PROS-N-\nKONS\u2019s computation complexity is dom-\ninversion required to compute the projection and the gradient update and by the query\nto KORS, that internally also inverts a j \u00d7 j matrix. Therefore, a na\u00efve implementation requires\n\nSet (cid:101)At\u22121 = \u03b1I \u2208 Rj\u00d7j.\n(cid:101)\u03c9t = 0 \u2208 Rj\nCompute map \u03c6t and approximate map (cid:101)\u03c6t =\nCompute(cid:101)\u03c5t = (cid:101)\u03c9t\u22121 \u2212 (cid:101)A\nt\u22121(cid:101)gt\u22121.\nt\u22121(cid:101)\u03c6t\n(cid:101)A\nCompute(cid:101)\u03c9t = (cid:101)\u03c5t \u2212 h((cid:101)\u03c6T\nt(cid:101)\u03c5t)\n(cid:101)\u03c6T\nt\u22121(cid:101)\u03c6t\nt (cid:101)A\nPredict(cid:101)yt = (cid:101)\u03c6T\nt(cid:101)\u03c9t.\nt((cid:101)yt)(cid:101)\u03c6t.\nObserve(cid:101)gt = \u2207(cid:101)\u03c9t (cid:96)t((cid:101)\u03c6T\nt(cid:101)\u03c9t) = (cid:96)(cid:48)\nUpdate (cid:101)At = (cid:101)At\u22121 + \u03c3t\n2(cid:101)gt(cid:101)gT\n\ninated by (cid:101)A\u22121\nO(j3) per-step time and has a space O(j2) space complexity necessary to store (cid:101)At. Notice that\nthat similarly, the (cid:101)At matrix is constructed using rank-one updates, a careful implementation reduces\nusing the bound on J provided by Prop. 1 and neglecting logarithmic terms reduces to (cid:101)O(T dT\n\ntaking advantage of the fact that KORS only adds SV to the dictionary and never removes them, and\nthe per-step cost to O(j2). Overall, the total runtime of PROS-N-KONS is then O(T J 2), which\neff(\u03b3)2).\nCompared to other exact second-order FGD methods, such as KONS or RKLS, PROS-N-KONS\ndramatically improves the time and space complexity from polynomial to linear. Unlike other\napproximate second-order methods, PROS-N-KONS does not add a new SV at each step. This way\nit removes T 2 from the O(T 2 + T dT\neff(\u03b3)3) time complexity of SKETCHED-KONS [2]. Moreover,\nwhen mint \u03c4t,t is small, SKETCHED-KONS needs to compensate by adding a constant probability of\nadding a SV to the dictionary, resulting in a larger runtime complexity, while PROS-N-KONS has\nno dependency on the value of the RLS. Even compared to \ufb01rst-order methods, which incur a larger\nregret, PROS-N-KONS performs favorably, improving on the O(T 2) runtime of exact \ufb01rst-order\nFGD. Compared to other approximate methods, the variant using rank-one updates matches the\nO(J 2) per-step cost of the more accurate \ufb01rst-order methods such as the budgeted perceptron [4],\nprojectron [15], Nystr\u00f6m GD [13], while improving on their regret. PROS-N-KONS also closely\nmatches faster but less accurate O(J) methods such as the forgetron [6] and budgeted GD [23].\n\nFigure 1: PROS-N-KONS\n\n\u22121\n\n\u22121\n\nt .\n\nt\n\n\u03a3\n\n5\n\n\f4 Regret guarantees\nIn this section, we study the regret performance of PROS-N-KONS.\nconstant L, let \u03c3 = mint \u03c3t. If \u03b7t \u2265 \u03c3 for all t, \u03b1 \u2264 \u221a\nTheorem 1 (proof in App. B,). For any sequence of losses (cid:96)t satisfying Asm. 2 with Lipschitz\nT , \u03b3 \u2264 \u03b1, and predictions are bounded by C,\nlog(cid:0)2\u03c3L2T /\u03b1(cid:1)(cid:19)\nthen the regret of PROS-N-KONS over T steps is bounded w.p. 1 \u2212 \u03b4 as\n\n(cid:18) T \u03b3\u03b5\n\n(cid:16) \u03b1\n\n\u03b1(cid:107)w(cid:107)2 +\n\n+ 2JC,\n\n(cid:18)\n\n(cid:17)\n\n+ 1\n\n+\n\n(3)\n\nL2\n\u03b1\n\n4(1 \u2212 \u03b5)\n\n4\n\u03c3\n\ndT\neff\n\n\u03c3L2\n\nRT (w) \u2264 J\nwhere J \u2264 3\u03b2dT\n\n(cid:19)\n(cid:19)\n\n(cid:18)\n\neff (\u03b3) log(2T ) is the number of epochs. If \u03b3 = \u03b1/T the previous bound reduces to\nRT (w) = O\n\neff (\u03b1/T ) log(T ) + dT\n\neff (\u03b1) log2(T )\n\n\u03b1(cid:107)w(cid:107)2dT\n\neff (\u03b1/T ) dT\n\n.\n\n(4)\n\neff(\u03b3)dT\n\nthe dictionary returned by KORS up to step T , which w.h.p. is (cid:101)O(dT\nas (cid:101)O(dT\n\nRemark (bound) The bound in Eq. 3 is composed of three terms. At each epoch of PROS-N-KONS,\nan instance of KONS is run on the embedded feature space Hj obtained by using the dictionary Ij\nconstructed up to the previous epoch. As a result, we directly use the bound on the regret of KONS\n(Thm. 1 in [2]) for each of the J epochs, thus leading to the \ufb01rst term in the regret. Since a new epoch\nis started whenever a new SV is added to the dictionary, the number of epochs J is at most the size of\neff(\u03b3)), making the \ufb01rst term scale\neff(\u03b1)) overall. Nonetheless, the comparator used in the per-epoch regret of KONS is\nconstrained to the RKHS Hj induced by the embedding used in epoch j. The second term accounts\nfor the difference in performance between the best solutions in the RKHS in epoch j and in the\noriginal RKHS H. While this error is directly controlled by KORS through the RLS regularization \u03b3\nand the parameter \u03b5 (hence the factor \u03b3\u03b5/(1 \u2212 \u03b5) from Property (1) in Prop. 1), its impact on the\nregret is ampli\ufb01ed by the length of each epoch, thus leading to an overall linear term that needs to be\nregularized. Finally, the last term summarizes the regret suffered every time a new epoch is started\n\nand the default prediction(cid:98)y = 0 is returned. Since the values yt and(cid:98)yt are constrained in S, this\n\nresults in a regret of 2JC.\nRemark (regret comparison) Tuning the RLS regularization as \u03b3 = \u03b1/T leads to the bound in\nEq. 4. While the bound displays an explicit logarithmic dependency on T , this comes at the cost\nof increasing the effective dimension, which now depends on the regularization \u03b1/T . While in\ngeneral this could possibly compromise the overall regret, if the sequence of points \u03c61, . . . , \u03c6T\ninduces a kernel matrix with a rapidly decaying spectrum, the resulting regret is still competitive.\nFor instance, if the eigenvalues of KT decrease as \u03bbt = at\u2212q with constants a > 0 and q > 1, then\n\u221a\neff (\u03b1/T ) \u2264 aqT 1/q/(q \u2212 1). This shows that for any q > 2 we obtain a regret1 o(\nT log2 T )\ndT\nshowing that KONS still improves over \ufb01rst-order methods. Furthermore, if the kernel has a low\nrank or the eigenvalues decrease exponentially, the \ufb01nal regret is poly-logarithmic, thus preserving\nthe full advantage of the second-order approach. Notice that this scenario is always veri\ufb01ed when\nH = Rd, and is also veri\ufb01ed when the adversary draws samples from a stationary distribution\nand, e.g., the Gaussian kernel [22] (see also [16, 18]). This result is particularly remarkable when\n\ncompared to SKETCHED-KONS, whose regret scales as O(cid:0)\u03b1(cid:107)w(cid:107)2 + dT\n\neff (\u03b1) (log T )/\u03b7(cid:1), where \u03b7\n\nis the fraction of samples which is forced into the dictionary (when \u03b7 = 1, we recover the bound\nfor KONS). Even when the effective dimension is small (e.g., exponentially decaying eigenvalues),\nSKETCHED-KONS requires setting \u03b7 to T \u2212p for a constant p > 0 to get a subquadratic space\ncomplexity, at the cost of increasing the regret to O(T p log T ). On the other hand, PROS-N-KONS\nachieves a poly-logarithmic regret with linear space complexity up to poly-log factors (i.e., T dT\neff(\u03b3)2),\nthus greatly improving both the learning and computational performance w.r.t. SKETCHED-KONS.\nFinally, notice that while \u03b3 = \u03b1/T is the best choice agnostic to the kernel, better bounds can\nbe obtained optimizing Eq. 3 for \u03b3 depending on dT\neff(\u03b3). For instance, let \u03b3 = \u03b1/T s, then the\noptimal value of s for q-polynomially decaying spectrum is s = q/(1 + q), leading to a regret bound\n\n(cid:101)O(T q/(1+q)), which is always o(\n\n\u221a\n\nT ) for any q > 1.\n\nRemark (comparison in the Euclidean case) In the special case H = Rd, we can make a compari-\nson with existing approximate methods for OL. In particular, the closest algorithm is SKETCHED-\nONS by Luo et al. [14]. Unlike PROS-N-KONS, and similarly to SKETCHED-KONS, they take the\n\n1Here we ignore the term dT\n\neff(\u03b1) which is a constant w.r.t. T for any constant \u03b1.\n\n6\n\n\fi=k+1 \u03c32\n\nby k log T + k(cid:80)T\n\ni , where the sum(cid:80)T\n\napproach of directly approximating At in the exact H = Rd using frequent directions [7] to construct\na k-rank approximation of At for a \ufb01xed k. The resulting algorithm achieves a regret that is bounded\ni is equal to the sum of all the smallest d \u2212 k\neigenvalues of the \ufb01nal (exact) matrix AT . This quantity can vary from 0, when the data lies in a\nsubspace of rank r \u2264 k, to T d\u2212k\nd when the sample lie orthogonally and in equal number along all d\ndirections available in Rd. Computationally, the algorithm requires O(T dk) time and O(dk) space.\nConversely, PROS-N-KONS automatically adapt its time and space complexity to the effective\ndimension of the algorithm dT\neff(\u03b1/T ) which is smaller than the rank for any \u03b1. As a consequence,\n\nit requires only (cid:101)O(T r2) time and (cid:101)O(r2) space, achieving a O(r2 log T ) regret independently from\n\ni=k+1 \u03c32\n\nthe spectrum of the covariance matrix. Computationally, all of these complexities are smaller than\nthe ones of SKETCHED-ONS in the regime r < k, which is the only one where SKETCHED-ONS\ncan guarantee a sublinear regret, and where the regrets of the two algorithms are close. Overall,\nwhile SKETCHED-ONS implicitly relies on the r < k assumption, but continues to operate in a d\ndimensional space and suffers large regret if r > k, PROS-N-KONS will adaptively convert the d\ndimensional problem into a simpler one with the appropriate rank, fully reaping the computational\nand regret bene\ufb01ts.\nThe bound in Thm. 1 can be re\ufb01ned in the speci\ufb01c case of squared loss as follows.\n\nTheorem 2. For any sequence of squared losses (cid:96)t = (yt \u2212(cid:98)yt)2, L = 4C and \u03c3 = 1/(8C 2), if \u03b7t \u2265 \u03c3\nfor all t, \u03b1 \u2264 \u221a\n(cid:18) 4\nRT (w)\u2264 J(cid:88)\nwhere \u03b5(cid:48) = \u03b1(cid:0)\u03b1 \u2212 \u03b3\u03b5\n\n(cid:19)\nj = minw\u2208H(cid:0)(cid:80)tj+1\u22121\n\nT and \u03b3 \u2264 \u03b1, the regret of PROS-N-KONS over T steps is bounded w.p. 1 \u2212 \u03b4 as\n\n(cid:16)\n(cid:16) \u03b1\n(cid:1)\u22121 \u2212 1 and L\u2217\n\n(cid:18)\nL(cid:0)C +\n(cid:1)2\n(cid:0)\u03c6T\n\n(cid:19)\n(cid:1) is the best\n\n(cid:1)+\u03b5(cid:48)\u03b1(cid:107)w(cid:107)2\n\n+ \u03b1(cid:107)w(cid:107)2\n\nt w \u2212 yt\n\n+\u03b5(cid:48)L\u2217\n\nTr(Kj)\n\nL2\n\u03b1\n\nlog\n\n2\u03c3\n\n(cid:17)\n\ndj\neff\n\n\u03c3\n\n+J\n\nj\n\n(cid:17)\n\n\u03c3L2\n\nL\n\u03b1\n\n,\n\n2\n\n(5)\n\nj=1\n\n2\n\n1\u2212\u03b5\n\nregularized cumulative loss in H within epoch j.\nLet L\u2217\nhave that dj\n\nT be the best regularized cumulative loss over all T steps, then L\u2217\neff and thus regret in Eq. 5 can be (loosely) bounded as\n\neff \u2264 dT\n\nt=tj\n\nT . Furthermore, we\n\nj \u2264 L\u2217\n\n(cid:17)(cid:19)\n\n.\n\n(cid:18)\n\n(cid:16)\n\nRT (w) = O\n\nJ\n\neff(\u03b1) log(T ) + +\u03b5(cid:48)L\u2217\ndT\n\nj + \u03b5(cid:48)\u03b1(cid:107)w(cid:107)2\n\n2\n\nT scales as O(log T ) for a given regularization \u03b1 (e.g., in the realizable case L\u2217\n\nThe major difference w.r.t. the general bound in Eq. 3 is that we directly relate the regret of PROS-N-\nKONS to the performance of the best predictor in H in hindsight, which replaces the linear term\n\u03b3T /\u03b1. As a result, we can set \u03b3 = \u03b1 (for which \u03b5(cid:48) = \u03b5/(1 \u2212 2\u03b5)) and avoid increasing the effective\ndimension of the problem. Furthermore, since L\u2217\nT is the regularized loss of the optimal batch solution,\nwe expect it to be small whenever the H is well designed for the prediction task at hand. For instance,\nif L\u2217\nT is actually just\n\u03b1(cid:107)w(cid:107)), then the regret of PROS-N-KONS is directly comparable with KONS up to a multiplicative\nfactor depending on the number of epochs J and with a much smaller time and space complexity that\nadapt to the effective dimension of the problem (see Prop. 1).\n5 Experiments\nWe empirically validate PROS-N-KONS on several regression and binary classi\ufb01cation problems,\nshowing that it is competitive with state-of-the-art methods. We focused on verifying 1) the advantage\nof second-order vs. \ufb01rst-order updates, 2) the effectiveness of data-adaptive embedding w.r.t. the\noblivious one, and 3) the effective dimension in real datasets. Note that our guarantees hold for more\nchallenging (possibly adversarial) settings than what we test empirically.\nAlgorithms Beside PROS-N-KONS, we introduce two heuristic variants. CON-KONS follows\nthe same update rules as PROS-N-KONS during the descent steps, but at reset steps it does not\n\nreset the solution and instead computes (cid:101)wt\u22121 = \u03a6j\u22121Uj\u22121\u03a3\u22121\n(cid:101)\u03c9t = \u03a3\u22121\n\nj\u22121(cid:101)\u03c9t\u22121 starting from(cid:101)\u03c9t\u22121 and sets\nj(cid:101)wt\u22121. A similar update rule is used to map (cid:101)At\u22121 into the new embedded space\n\nwithout resetting it. B-KONS is a budgeted version of PROS-N-KONS that stops updating the\ndictionary at a maximum budget Jmax and then it continues learning on the last space for the rest of\nthe run. Finally, we also include the best BATCH solution in the \ufb01nal space HJ returned by KORS as\na best-in-hindsight comparator. We also compare to two state-of-the-art embedding-based \ufb01rst-order\n\nj UT\n\nj\u03a6T\n\n7\n\n\fAlgorithm\n\nFOGD\nNOGD\nPROS-N-KONS\nCON-KONS\nB-KONS\nBATCH\n\nAlgorithm\n\nFOGD\nNOGD\nPROS-N-KONS\nCON-KONS\nB-KONS\nBATCH\n\nAlgorithm\n\nFOGD\nNOGD\nDUAL-SGD\nPROS-N-KONS\nCON-KONS\nB-KONS\nBATCH\n\nparkinson n = 5, 875, d = 20\ntime\n\u2014\n\u2014\n5.16\n5.21\n5.35\n\u2014\n\n#SV\navg. squared loss\n0.04909 \u00b1 0.00020\n30\n0.04896 \u00b1 0.00068\n30\n0.05798 \u00b1 0.00136\n18\n0.05696 \u00b1 0.00129\n18\n0.05795 \u00b1 0.00172\n18\n0.04535 \u00b1 0.00002 \u2014\n\ncpusmall n = 8, 192, d = 12\n\n#SV\navg. squared loss\n0.02577 \u00b1 0.00050\n30\n0.02559 \u00b1 0.00024\n30\n0.02494 \u00b1 0.00141\n20\n0.02269 \u00b1 0.00164\n20\n0.02496 \u00b1 0.00177\n20\n0.01090 \u00b1 0.00082 \u2014\n\ntime\n\u2014\n\u2014\n7.28\n7.40\n7.37\n\ncadata n = 20, 640, d = 8\n\ncasp n = 45, 730, d = 9\n\n#SV\navg. squared loss\n0.04097 \u00b1 0.00015\n30\n0.03983 \u00b1 0.00018\n30\n0.03095 \u00b1 0.00110\n20\n0.02850 \u00b1 0.00174\n19\n0.03095 \u00b1 0.00118\n19\n0.02202 \u00b1 0.00002 \u2014\n\ntime\n\u2014\n\u2014\n\n18.59\n18.45\n18.65\n\n\u2014\n\n#SV\navg. squared loss\n0.08021 \u00b1 0.00031\n30\n0.07844 \u00b1 0.00008\n30\n0.06773 \u00b1 0.00105\n21\n0.06832 \u00b1 0.00315\n20\n0.06775 \u00b1 0.00067\n21\n0.06100 \u00b1 0.00003 \u2014\n\ntime\n\u2014\n\u2014\n\n40.73\n40.91\n41.13\n\n\u2014\n\nslice n = 53, 500, d = 385\n\nyear n = 463, 715, d = 90\n\n\u2014\n\navg. squared loss\n0.00726 \u00b1 0.00019\n0.02636 \u00b1 0.00460\n\n#SV\n30\n30\n\u2014\ndid not complete \u2014\ndid not complete \u2014\n0.00913 \u00b1 0.00045\n100\n0.00212 \u00b1 0.00001 \u2014\n\ntime\n\u2014\n\u2014\n\u2014\n\u2014\n\u2014\n60\n\u2014\n\n#SV\navg. squared loss\n0.01427 \u00b1 0.00004\n30\n0.01427 \u00b1 0.00004\n30\n0.01440 \u00b1 0.00000\n100\n0.01450 \u00b1 0.00014\n149\n0.01444 \u00b1 0.00017\n147\n0.01302 \u00b1 0.00006\n100\n0.01147 \u00b1 0.00001 \u2014\n\ntime\n\u2014\n\u2014\n\u2014\n\n884.82\n889.42\n505.36\n\n\u2014\n\nTable 1: Regression datasets\n\nmethods from [13]. NOGD selects the \ufb01rst J points and uses them to construct an embedding and\nthen perform exact GD in the embedded space. FOGD uses random feature expansion to construct\nan embedding, and then runs \ufb01rst-order GD in the embedded space. While oblivious embedding\nmethods are cheaper than data-adaptive Nystr\u00f6m, they are usually less accurate. Finally, DUAL-SGD\nalso performs a random feature expansion embedding, but in the dual space. Given the number\n#SV of SVs stored in the predictor, and the input dimension d of the dataset\u2019s samples, the time\ncomplexity of all \ufb01rst-order methods is O(T d#SV ), while that of PROS-N-KONS and variants is\nO(T (d + #SV )#SV ). When #SV \u223c d (as in our case) the two complexities coincide. The space\ncomplexities are also close, with PROS-N-KONS O(#SV 2) not much larger than the \ufb01rst order\nmethods\u2019 O(#SV ). We do not run SKETCHED-KONS because the T 2 runtime is prohibitive.\nExperimental setup We replicate the experimental setting in [13] with 9 datasets for regression\nand 3 datasets for binary classi\ufb01cation. We use the same preprocessing as Lu et al. [13]: each\nfeature of the points xt is rescaled to \ufb01t in [0, 1], for regression the target variable yt is rescaled in\n[0, 1], while in binary classi\ufb01cation the labels are {\u22121, 1}. We also do not tune the Gaussian kernel\nbandwidth, but take the value \u03c3 = 8 used by [13]. For all datasets, we set \u03b2 = 1 and \u03b5 = 0.5 for all\nPROS-N-KONS variants and Jmax = 100 for B-KONS. For each algorithm and dataset, we report\naverage and standard deviation of the losses. The scores for the competitor baselines are reported as\nprovided in the original papers [13, 12]. We only report scores for NOGD, FOGD, and DUAL-SGD,\nsince they have been shown to outperform other baselines such as budgeted perceptron [4], projectron\n[15], forgetron [6], and budgeted GD [23]. For PROS-N-KONS variant we also report the runtime in\nseconds, but do not compare with the runtimes reported by [13, 12], as that would imply comparing\ndifferent implementations. Note that since the complexities O(T d#SV ) and O(T (d + #SV )#SV )\nare close, we do not expect large differences. All experiments are run on a single machine with 2\nXeon E5-2630 CPUs for a total of 10 cores, and are averaged over 15 runs.\nEffective dimension and runtime We use size of the dictionary returned by KORS as a proxy for\nthe effective dimension of the datasets. As expected, larger datasets and datasets with a larger input\neff(\u03b3) increases (sublinearly) when we\ndimension have a larger effective dimension. Furthermore, dT\nreduce \u03b3 from 1 to 0.01 in the ijcnn1 dataset. More importantly, dT\neff(\u03b3) remains empirically small\n\n8\n\n\fAlgorithm\n\nFOGD\nNOGD\nDUAL-SGD\nPROS-N-KONS\nCON-KONS\nB-KONS\nBATCH\n\nAlgorithm\n\nFOGD\nNOGD\nDUAL-SGD\nPROS-N-KONS\nCON-KONS\nB-KONS\nBATCH\n\n\u03b1 = 1, \u03b3 = 1\n\nijcnn1 n = 141, 691, d = 22\ntime\naccuracy\n9.06 \u00b1 0.05\n\u2014\n9.55 \u00b1 0.01\n\u2014\n8.35 \u00b1 0.20\n\u2014\n9.70 \u00b1 0.01\n9.64 \u00b1 0.01\n9.70 \u00b1 0.01\n8.33 \u00b1 0.03\n\n#SV\n400\n100\n100\n100\n101\n98\n\u2014\n\n211.91\n215.71\n206.53\n\n\u2014\n\n\u03b1 = 0.01, \u03b3 = 0.01\n\ncod-rna n = 271, 617, d = 8\n#SV\ntime\naccuracy\n10.30 \u00b1 0.10\n400\n\u2014\n13.80 \u00b1 2.10\n100\n\u2014\n4.83 \u00b1 0.21\n100\n\u2014\n13.95 \u00b1 1.19\n38\n18.99 \u00b1 9.47\n38\n13.99 \u00b1 1.16\n38\n3.781 \u00b1 0.01 \u2014\n\n270.81\n271.85\n274.94\n\n\u2014\n\nijcnn1 n = 141, 691, d = 22\n#SV\naccuracy\ntime\n9.06 \u00b1 0.05\n400\n\u2014\n9.55 \u00b1 0.01\n100\n\u2014\n8.35 \u00b1 0.20\n100\n\u2014\n10.73 \u00b1 0.12\n436\n6.23 \u00b1 0.18\n432\n4.85 \u00b1 0.08\n100\n5.61 \u00b1 0.01 \u2014\nTable 2: Binary classi\ufb01cation datasets\n\ncod-rna n = 271, 617, d = 8\n#SV\naccuracy\ntime\n10.30 \u00b1 0.10\n400\n\u2014\n13.80 \u00b1 2.10\n100\n\u2014\n4.83 \u00b1 0.21\n100\n\u2014\n4.91 \u00b1 0.04\n111\n5.81 \u00b1 1.96\n111\n4.57 \u00b1 0.05\n100\n3.61 \u00b1 0.01 \u2014\n\n1003.82\n987.33\n147.22\n\n459.28\n458.90\n333.57\n\n\u2014\n\n\u2014\n\neven for datasets with hundreds of thousands samples, such as year, ijcnn1 and cod-rna. On the\nother hand, in the slice dataset, the effective dimension is too large for PROS-N-KONS to complete\nand we only provide results for B-KONS. Overall, the proposed algorithm can process hundreds of\nthousands of points in a matter of minutes and shows that it can practically scale to large datasets.\nRegression All algorithms are trained and evaluated using the squared loss. Notice that whenever the\nbudget Jmax is not exceeded, B-KONS and PROS-N-KONS are the same algorithm and obtain the\nsame result. On regression datasets (Tab. 1) we set \u03b1 = 1 and \u03b3 = 1, which satis\ufb01es the requirements\nof Thm. 2. Note that we did not tune \u03b1 and \u03b3 for optimal performance, as that would require\nmultiple runs, and violate the online setting. On smaller datasets such as parkinson and cpusmall,\nwhere frequent restarts greatly interfere with the gradient descent, and even a small non-adaptive\nembedding can capture the geometry of the data, PROS-N-KONS is outperformed by simpler\n\ufb01rst-order methods. As soon as T reaches the order of tens of thousands (cadata, casp), second-order\nupdates and data adaptivity becomes relevant and PROS-N-KONS outperform its competitors, both\nin the number of SVs and in the average loss. In this intermediate regime, CON-KONS outperforms\nPROS-N-KONS and B-KONS since it is less affected by restarts. Finally, when the number of\nsamples raises to hundreds of thousands, the intrinsic effective dimension of the dataset starts playing\na larger role. On slice, where the effective dimension is too large to run, B-KONS still outperforms\nNOGD with a comparable budget of SVs, showing the advantage of second-order updates.\nBinary classi\ufb01cation All algorithms are trained using the hinge loss and they are evaluated using\nthe average online error rate. Results are reported in Tab. 2. While for regression, an arbitrary value\nof \u03b3 = \u03b1 = 1 is suf\ufb01cient to obtain good results, it fails for binary classi\ufb01cation. Decreasing the\ntwo parameters to 0.01 resulted in a 3-fold increase in the number of SVs included and runtime, but\nalmost a 2-fold decrease in error rate, placing PROS-N-KONS and B-KONS on par or ahead of\ncompetitors without the need of any further parameter tuning.\n\n6 Conclusions\n\nWe presented PROS-N-KONS a novel algorithm for sketched second-order OKL that achieves\nO(dT\neff log T ) regret for losses with directional curvature. Our sketching is data-adaptive and, when the\neffective dimension of the dataset is constant, it achieves a constant per-step cost, unlike SKETCHED-\nKONS [2], which was previously proposed for the same setting. We empirically showed that\nPROS-N-KONS is practical, performing on par or better than state-of-the-art methods on standard\nbenchmarks using small dictionaries on realistic data.\n\n9\n\n\fAcknowledgements The research presented was supported by French Ministry of Higher Education and\nResearch, Nord-Pas-de-Calais Regional Council, Inria and Univert\u00e4t Potsdam associated-team north-european\nproject Allocate, and French National Research Agency projects ExTra-Learn (n.ANR-14-CE24-0010-01) and\nBoB (n.ANR-16-CE23-0003).\n\nReferences\n[1] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel methods with statistical\n\nguarantees. In Neural Information Processing Systems, 2015.\n\n[2] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Second-order kernel online convex\noptimization with adaptive sketching. In International Conference on Machine Learning, 2017.\n[3] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Distributed sequential sampling\n\nfor kernel matrix approximation. In AISTATS, 2017.\n\n[4] Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Tracking the best hyperplane\n\nwith a simple budget perceptron. Machine Learning, 69(2-3):143\u2013167, 2007.\n\n[5] Michael B Cohen, Cameron Musco, and Jakub Pachocki. Online row sampling. International\nWorkshop on Approximation, Randomization, and Combinatorial Optimization APPROX, 2016.\n[6] Ofer Dekel, Shai Shalev-Shwartz, and Yoram Singer. The forgetron: A kernel-based perceptron\n\non a budget. SIAM Journal on Computing, 37(5):1342\u20131372, 2008.\n\n[7] Mina Ghashami, Edo Liberty, Jeff M Phillips, and David P Woodruff. Frequent directions:\nSimple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762\u20131792,\n2016.\n\n[8] Elad Hazan, Adam Kalai, Satyen Kale, and Amit Agarwal. Logarithmic regret algorithms for\n\nonline convex optimization. In Conference on Learning Theory. Springer, 2006.\n\n[9] Wenwu He and James T. Kwok. Simple randomized algorithms for online learning with kernels.\n\nNeural Networks, 60:17\u201324, 2014.\n\n[10] J. Kivinen, A.J. Smola, and R.C. Williamson. Online learning with kernels. IEEE Transactions\n\non Signal Processing, 52(8), 2004.\n\n[11] Quoc Le, Tam\u00e1s Sarl\u00f3s, and Alex J Smola. Fastfood - Approximating kernel expansions in\n\nloglinear time. In International Conference on Machine Learning, 2013.\n\n[12] Trung Le, Tu Nguyen, Vu Nguyen, and Dinh Phung. Dual Space Gradient Descent for Online\n\nLearning. In Neural Information Processing Systems, 2016.\n\n[13] Jing Lu, Steven C.H. Hoi, Jialei Wang, Peilin Zhao, and Zhi-Yong Liu. Large scale online\n\nkernel learning. Journal of Machine Learning Research, 17(47):1\u201343, 2016.\n\n[14] Haipeng Luo, Alekh Agarwal, Nicolo Cesa-Bianchi, and John Langford. Ef\ufb01cient second-order\n\nonline learning via sketching. In Neural Information Processing Systems, 2016.\n\n[15] Francesco Orabona, Joseph Keshet, and Barbara Caputo. The projectron: a bounded kernel-\n\nbased perceptron. In International conference on Machine learning, 2008.\n\n[16] Yi Sun, J\u00fcrgen Schmidhuber, and Faustino J. Gomez. On the size of the online kernel sparsi\ufb01ca-\n\ntion dictionary. In International Conference on Machine Learning, 2012.\n\n[17] Zhuang Wang, Koby Crammer, and Slobodan Vucetic. Breaking the curse of kernelization:\nBudgeted stochastic gradient descent for large-scale svm training. Journal of Machine Learning\nResearch, 13(Oct):3103\u20133131, 2012.\n\n[18] Andrew J. Wathen and Shengxin Zhu. On spectral distribution of kernel matrices related to\n\nradial basis functions. Numerical Algorithms, 70(4):709\u2013726, 2015.\n\n[19] Christopher Williams and Matthias Seeger. Using the Nystr\u00f6m method to speed up kernel\n\nmachines. In Neural Information Processing Systems, 2001.\n\n[20] Yi Xu, Haiqin Yang, Lijun Zhang, and Tianbao Yang. Ef\ufb01cient non-oblivious randomized\nreduction for risk minimization with improved excess risk guarantee. In AAAI Conference on\nArti\ufb01cial Intelligence, 2017.\n\n[21] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystr\u00f6m method\nvs random fourier features: A theoretical and empirical comparison. In Neural Information\nProcessing Systems, 2012.\n\n10\n\n\f[22] Y. Yang, M. Pilanci, and M. J. Wainwright. Randomized sketches for kernels: Fast and optimal\n\nnon-parametric regression. Annals of Statistics, 2017.\n\n[23] Peilin Zhao, Jialei Wang, Pengcheng Wu, Rong Jin, and Steven C H Hoi. Fast bounded\nonline gradient descent algorithms for scalable kernel-based online learning. In International\nConference on Machine Learning, 2012.\n\n[24] Fedor Zhdanov and Yuri Kalnishkan. An identity for kernel ridge regression. In Algorithmic\n\nLearning Theory. 2010.\n\n[25] Changbo Zhu and Huan Xu. Online gradient descent in function space. arXiv:1512.02394,\n\n2015.\n\n[26] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nIn International Conference on Machine Learning, 2003.\n\n11\n\n\f", "award": [], "sourceid": 3115, "authors": [{"given_name": "Daniele", "family_name": "Calandriello", "institution": "INRIA Lille - Nord Europe"}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": "INRIA Lille-Nord Europe"}, {"given_name": "Michal", "family_name": "Valko", "institution": "Inria Lille - Nord Europe"}]}