{"title": "Optimal learning rates for Kernel Conjugate Gradient regression", "book": "Advances in Neural Information Processing Systems", "page_first": 226, "page_last": 234, "abstract": "We prove rates of convergence in the statistical sense for kernel-based least squares regression using a conjugate gradient algorithm, where regularization against overfitting is obtained by early stopping. This method is directly related to Kernel Partial Least Squares, a regression method that combines supervised dimensionality reduction with least squares projection. The rates depend on two key quantities: first, on the regularity of the target regression function and second, on the effective dimensionality of the data mapped into the kernel space. Lower bounds on attainable rates depending on these two quantities were established in earlier literature, and we obtain upper bounds for the considered method that match these lower bounds (up to a log factor) if the true regression function belongs to the reproducing kernel Hilbert space. If the latter assumption is not fulfilled, we obtain similar convergence rates provided additional unlabeled data are available. The order of the learning rates in these two situations match state-of-the-art results that were recently obtained for the least squares support vector machine and for linear regularization operators.", "full_text": "Optimal learning rates\n\nfor Kernel Conjugate Gradient regression\n\nGilles Blanchard\n\nMathematics Institute, University of Potsdam\n\nAm neuen Palais 10, 14469 Potsdam\n\nblanchard@math.uni-potsdam.de\n\nNicole Kr\u00a8amer\n\nWeierstrass Institute\n\nMohrenstr. 39, 10117 Berlin, Germany\n\nnicole.kraemer@wias-berlin.de\n\nAbstract\n\nWe prove rates of convergence in the statistical sense for kernel-based least\nsquares regression using a conjugate gradient algorithm, where regularization\nagainst over\ufb01tting is obtained by early stopping. This method is directly related\nto Kernel Partial Least Squares, a regression method that combines supervised\ndimensionality reduction with least squares projection. The rates depend on two\nkey quantities: \ufb01rst, on the regularity of the target regression function and sec-\nond, on the effective dimensionality of the data mapped into the kernel space.\nLower bounds on attainable rates depending on these two quantities were estab-\nlished in earlier literature, and we obtain upper bounds for the considered method\nthat match these lower bounds (up to a log factor) if the true regression func-\ntion belongs to the reproducing kernel Hilbert space. If this assumption is not\nful\ufb01lled, we obtain similar convergence rates provided additional unlabeled data\nare available. The order of the learning rates match state-of-the-art results that\nwere recently obtained for least squares support vector machines and for linear\nregularization operators.\n\n1\n\nIntroduction\n\nThe contribution of this paper is the learning theoretical analysis of kernel-based least squares regres-\nsion in combination with conjugate gradient techniques. The goal is to estimate a regression function\nf\u2217 based on random noisy observations. We have an i.i.d. sample of n observations (Xi, Yi) \u2208 X\u00d7R\nfrom an unknown distribution P (X, Y ) that follows the model\n\nY = f\u2217(X) + \u03b5 ,\n\nwhere \u03b5 is a noise variable whose distribution can possibly depend on X, but satis\ufb01es E [\u03b5|X] = 0.\nWe assume that the true regression function f\u2217 belongs to the space L2(PX) of square-integrable\nfunctions. Following the kernelization principle, we implicitly map the data into a reproducing\nkernel Hilbert space H with a kernel k. We denote by Kn = 1\nn(k(Xi, Xj)) \u2208 Rn\u00d7n the normalized\nkernel matrix and by \u03a5 = (Y1, . . . , Yn)> \u2208 Rn the n-vector of response observations. The task is\nto \ufb01nd coef\ufb01cients \u03b1 such that the function de\ufb01ned by the normalized kernel expansion\n\nf\u03b1(X) =\n\nnX\n\u00a3(f\u03b1(X) \u2212 f\u2217(X))2\u2044 = EXY\n\n1\nn\n\ni=1\n\n\u03b1ik(Xi, X)\n\nis an adequate estimator of the true regression function f\u2217. The closeness of the estimator f\u03b1 to the\ntarget f\u2217 is measured via the L2(PX) distance,\nkf\u03b1 \u2212 f\u2217k2\nThe last equality recalls that this criterion is the same as the excess generalization error for the\nsquared error loss \u2018(f, x, y) = (f(x) \u2212 y)2.\n\n\u00a3(f\u03b1(X) \u2212 Y )2\u2044 \u2212 EXY\n\n\u00a3(f\u2217(X) \u2212 Y )2\u2044 ,\n\n2 = EX\u223cPX\n\n1\n\n\fIn empirical risk minimization, we use the training data empirical distribution as a proxy for the gen-\nerating distribution, and minimize the training squared error. This gives rise to the linear equation\n\nwith \u03b1 \u2208 Rn .\n\nKn\u03b1 = \u03a5\n\n\u03b1 = (Kn + \u03bbI)\u22121\u03a5,\n\n(1)\nAssuming Kn invertible, the solution of the above equation is given by \u03b1 = K\u22121\nn \u03a5, which yields\na function in H interpolating perfectly the training data but having poor generalization error. It is\nwell-known that to avoid over\ufb01tting, some form of regularization is needed. There is a considerable\nvariety of possible approaches (see e.g. [10] for an overview). Perhaps the most well-known one is\n(2)\nknown alternatively as kernel ridge regression, Tikhonov\u2019s regularization, least squares support vec-\ntor machine, or MAP Gaussian process regression. A powerful generalization of this is to consider\n(3)\nwhere F\u03bb : R+ \u2192 R+ is a \ufb01xed function depending on a parameter \u03bb. The notation F\u03bb(Kn) is\nto be interpreted as F\u03bb applied to each eigenvalue of Kn in its eigen decomposition. Intuitively,\nF\u03bb should be a \u201cregularized\u201d version of the inverse function F (x) = x\u22121. This type of regular-\nization, which we refer to as linear regularization methods, is directly inspired from the theory of\ninverse problems. Popular examples include as particular cases kernel ridge regression, principal\ncomponents regression and L2-boosting. Their application in a learning context has been studied\nextensively [1, 2, 5, 6, 12]. Results obtained in this framework will serve as a comparison yardstick\nin the sequel.\nIn this paper, we study conjugate gradient (CG) techniques in combination with early stopping for\nthe regularization of the kernel based learning problem (1). The principle of CG techniques is to\nrestrict the learning problem onto a nested set of data-dependent subspaces, the so-called Krylov\nsubspaces, de\ufb01ned as\n\nn \u03a5\u201c .\nKm(\u03a5, Kn) = span'\u03a5, Kn\u03a5, . . . , K m\u22121\n\n\u03b1 = F\u03bb(Kn)\u03a5,\n\n(4)\nDenote by h., .i the usual euclidean scalar product on Rn rescaled by the factor n\u22121. We de\ufb01ne\n:= h\u03b1, Kn\u03b1i . The CG solution after m iterations is formally\nthe Kn-norm as k\u03b1k2\nde\ufb01ned as\n\n:= h\u03b1, \u03b1iKn\n\nKn\n\n\u03b1m = arg\n\nmin\n\n\u03b1\u2208Km(\u03a5,Kn)\n\nk\u03a5 \u2212 Kn\u03b1kKn ;\n\n(5)\n\nand the number m of CG iterations is the model parameter. To simplify notation we de\ufb01ne\nfm := f\u03b1m. In the learning context considered here, regularization corresponds to early stopping.\nConjugate gradients have the appealing property that the optimization criterion (5) can be computed\nby a simple iterative algorithm that constructs basis vectors d1, . . . , dm of Km(\u03a5, Kn) by using\nonly forward multiplication of vectors by the matrix Kn. Algorithm 1 displays the computation of\nthe CG kernel coef\ufb01cients \u03b1m de\ufb01ned by (5).\n\nAlgorithm 1 Kernel Conjugate Gradient regression\n\nInput kernel matrix Kn, response vector \u03a5, maximum number of iterations m\nInitialization: \u03b10 = 0n; r1 = \u03a5; d1 = \u03a5; t1 = Kn\u03a5\nfor i = 1, . . . , m do\nti = ti/ktikKn ; di = di/ktikKn (normalization of the basis, resp. update vector)\n\u03b3i = h\u03a5, tiiKn\n\u03b1i = \u03b1i\u22121 + \u03b3idi (update)\nri+1 = ri \u2212 \u03b3iti (residuals)\ndi+1 = ri+1 \u2212 di hti, Knri+1iKn\nend for\n\nReturn: CG kernel coef\ufb01cients \u03b1m, CG function fm =Pn\n\n; ti+1 = Kndi+1 (new update, resp. basis vector)\n\n(proj. of \u03a5 on basis vector)\n\ni=1 \u03b1i,mk(Xi,\u00b7)\n\nThe CG approach is also inspired by the theory of inverse problems, but it is not covered by the\nframework of linear operators de\ufb01ned in (3): As we restrict the learning problem onto the Krylov\nspace Km(\u03a5, Kn) , the CG coef\ufb01cients \u03b1m are of the form \u03b1m = qm(Kn)\u03a5 with qm a polynomial\nof degree \u2264 m \u2212 1. However, the polynomial qm is not \ufb01xed but depends on \u03a5 as well, making the\nCG method nonlinear in the sense that the coef\ufb01cients \u03b1m depend on \u03a5 in a nonlinear fashion.\n\n2\n\n\fWe remark that in machine learning, conjugate gradient techniques are often used as fast solvers\nfor operator equations, e.g. to obtain the solution for the regularized equation (2). We stress that\nin this paper, we study conjugate gradients as a regularization approach for kernel based learning,\nwhere the regularity is ensured via early stopping. This approach is not new. As mentioned in\nthe abstract, the algorithm that we study is closely related to Kernel Partial Least Squares [18].\nThe latter method also restricts the learning problem onto the Krylov subspace Km(\u03a5, Kn), but\nit minimizes the euclidean distance k\u03a5 \u2212 Kn\u03b1k instead of the distance k\u03a5 \u2212 Kn\u03b1kKn de\ufb01ned\nabove1. Kernel Partial Least Squares has shown competitive performance in benchmark experiences\n(see e.g [18, 19]). Moreover, a similar conjugate gradient approach for non-de\ufb01nite kernels has been\nproposed and empirically evaluated by Ong et al [17]. The focus of the current paper is therefore not\nto stress the usefulness of CG methods in practical applications (and we refer to the above mentioned\nreferences) but to examine its theoretical convergence properties. In particular, we establish the\nexistence of early stopping rules that lead to optimal convergence rates. We summarize our main\nresults in the next section.\n\n2 Main results\n\nFor the presentation of our convergence results, we require suitable assumptions on the learning\nproblem. We \ufb01rst assume that the kernel space H is separable and that the kernel function is\nmeasurable. (This assumption is satis\ufb01ed for all practical situations that we know of.) Further-\nmore, for all results, we make the (relatively standard) assumption that the kernel is bounded:\nk(x, x) \u2264 \u03ba for all x \u2208 X . We consider \u2013 depending on the result \u2013 one of the following as-\nsumptions on the noise:\n(Bounded) (Bounded Y ): |Y | \u2264 M almost surely.\n(Bernstein) (Bernstein condition): E [\u03b5p|X] \u2264 (1/2)p!M p almost surely, for all integers p \u2265 2.\nThe second assumption is weaker than the \ufb01rst. In particular, the \ufb01rst assumption implies that not\nonly the noise, but also the target function f\u2217 is bounded in supremum norm, while the second\nassumption does not put any additional restriction on the target function.\nThe regularity of the target function f\u2217 is measured in terms of a source condition as follows. The\nkernel integral operator is given by\n\nK : L2(PX) \u2192 L2(PX), g 7\u2192\n\nk(., x)g(x)dP (x) .\n\nZ\n\nThe source condition for the parameters r > 0 and \u03c1 > 0 is de\ufb01ned by:\n\nSC(r, \u03c1) : f\u2217 = K ru\n\nwith\n\nkuk \u2264 \u03ba\u2212r\u03c1.\n\nIt is a known fact that if r \u2265 1/2, then f\u2217 coincides almost surely with a function belonging to Hk.\nWe refer to r \u2265 1/2 as the \u201cinner case\u201d and to r < 1/2 as the \u201couter case\u201d.\nThe regularity of the kernel operator K with respect to the marginal distribution PX is measured in\nterms of the so-called effective dimensionality condition, de\ufb01ned by the two parameters s \u2208 (0, 1),\nD \u2265 0 and the condition\n\nED(s, D) : tr(K(K + \u03bbI)\u22121) \u2264 D2(\u03ba\u22121\u03bb)\u2212s for all \u03bb \u2208 (0, 1].\n\nThis notion was \ufb01rst introduced in [22] in a learning context, along with a number of fundamental\nanalysis tools which we rely on and have been used in the rest of the related literature cited here.\nIt is known that the best attainable rates of convergence, as a function of the number of examples\nn, are determined by the parameters r and s in the above conditions: It was shown in [6] that the\nminimax learning rate given these two parameters is lower bounded by O(n\u22122r/(2r+s)).\nWe now expose our main results in different situations. In all the cases considered, the early stopping\nrule takes the form of a so-called discrepancy stopping rule: For some sequence of thresholds\n\u039bm > 0 to be speci\ufb01ed (and possibly depending on the data), de\ufb01ne the (data-dependent) stopping\n\niteration bm as the \ufb01rst iteration m for which\n\n(6)\n1This is generalized to a CG-l algorithm (l \u2208 N\u22650) by replacing the Kn-norm in (5) with the norm de\ufb01ned\nn. Corresponding fast iterative algorithms to compute the solution exist for all l (see e.g. [11]).\n\n< \u039bm .\n\nby K l\n\nk\u03a5 \u2212 Kn\u03b1mkKn\n\n3\n\n\fOnly in the \ufb01rst result below, the threshold \u039bm actually depends on the iteration m and on the data.\nstopping rule always has bm \u2264 n.\nIt is not dif\ufb01cult to prove from (4) and (5) that k\u03a5 \u2212 Kn\u03b1nkKn\n= 0, so that the above type of\n\nr\n\n\u2021\u221a\n\n+ Mplog(2\u03b3\u22121)\n\n\u00b7\n\nInner case without knowledge on effective dimension\n\n2.1\nThe inner case corresponds to r \u2265 1/2, i.e.\nthe target function f\u2217 lies in H almost surely. For\nsome constants \u03c4 > 1 and 1 > \u03b3 > 0, we consider the discrepancy stopping rule with the threshold\nsequence\n\n\u03ba log(2\u03b3\u22121)\n\n\u03bak\u03b1mkKn\n\nn\n\n\u039bm = 4\u03c4\n\nFor technical reasons, we consider a slight variation of the rule in that we stop at stepbm\u22121 instead of\nbm if qbm(0) \u2265 4\u03baplog(2\u03b3\u22121)/n, where qm is the iteration polynomial such that \u03b1m = qm(Kn)\u03a5.\nDenote em the resulting stopping step. We obtain the following result.\nfor r \u2265 1/2. With probability 1 \u2212 2\u03b3 , the estimator fem obtained by the (modi\ufb01ed) discrepancy\n\u00b6 2r\n(cid:181)log2 \u03b3\u22121\n\nTheorem 2.1. Suppose that Y is bounded (Bounded), and that the source condition SC(r, \u03c1) holds\n\nstopping rule (7) satis\ufb01es\n\n(7)\n\n2r+1\n\n.\n\nkfem \u2212 f\u2217k2\n\n2 \u2264 c(r, \u03c4)(M + \u03c1)2\n\nn\n\n.\n\nWe present the proof in Section 4.\n\n2.2 Optimal rates in inner case\n\nWe now introduce a stopping rule yielding order-optimal convergence rates as a function of the\ntwo parameters r and s in the \u201cinner\u201d case (r \u2265 1/2, which is equivalent to saying that the target\nfunction belongs to H almost surely). For some constant \u03c40 > 3/2 and 1 > \u03b3 > 0, we consider the\ndiscrepancy stopping rule with the \ufb01xed threshold\n\u221a\n\n\u00b6 2r+1\n\n2r+s\n\n.\n\n(8)\n\n\u039bm \u2261 \u039b = \u03c40M\n\n\u03ba\n\n(cid:181) 4D\u221a\n\nn\n\nlog\n\n6\n\u03b3\n\nfor which we obtain the following:\nTheorem 2.2. Suppose that the noise ful\ufb01lls the Bernstein assumption (Bernstein), that the source\ncondition SC(r, \u03c1) holds for r \u2265 1/2, and that ED(s, D) holds. With probability 1 \u2212 3\u03b3 , the\nestimator fbm obtained by the discrepancy stopping rule (8) satis\ufb01es\nlog2 6\n\u03b3\n\n2 \u2264 c(r, \u03c40)(M + \u03c1)2\n\nkfbm \u2212 f\u2217k2\n\n(cid:181)16D2\n\n\u00b6 2r\n\n2r+s\n\nn\n\n.\n\nDue to space limitations, the proof is presented in the supplementary material.\n\n2.3 Optimal rates in outer case, given additional unlabeled data\n\nWe now turn to the \u201couter\u201d case. In this case, we make the additional assumption that unlabeled\ndata is available. Assume that we have \u02dcn i.i.d. observations X1, . . . , X\u02dcn, out of which only the \ufb01rst\nn (Y1, . . . , Yn, 0, . . . , 0) \u2208 R\u02dcn and run the CG\nn are labeled. We de\ufb01ne a new response vector \u02dc\u03a5 = \u02dcn\nalgorithm 1 on X1, . . . , X\u02dcn and \u02dc\u03a5. We use the same threshold (8) as in the previous section for the\nstopping rule, except that the factor M is replaced by max(M, \u03c1).\nTheorem 2.3. Suppose assumptions (Bounded), SC(r, \u03c1) and ED(s, D), with r + s \u2265 1\n\n2 . Assume\n\nunlabeled data is available with en\n\n(cid:181)16D2\n\nn\n\n\u2265\n\nn\n\n\u00b6\u2212 (1\u22122r)+\n\n2r+s\n\n.\n\nlog2 6\n\u03b3\n\n4\n\n\fThen with probability 1 \u2212 3\u03b3 , the estimator fbm obtained by the discrepancy stopping rule de\ufb01ned\n\nabove satis\ufb01es\n\nkfbm \u2212 f\u2217k2\n\n2 \u2264 c(r, \u03c40)(M + \u03c1)2\n\n(cid:181)16D2\n\nn\n\nlog2 6\n\u03b3\n\n\u00b6 2r\n\n2r+s\n\n.\n\nA sketch of the proof can be found in the supplementary material.\n\n3 Discussion and comparison to other results\nFor the inner case \u2013 i.e. f\u2217 \u2208 H almost surely \u2013 we provide two different consistent stopping\ncriteria. The \ufb01rst one (Section 2.1) is oblivious to the effective dimension parameter s, and the\nobtained bound corresponds to the \u201cworst case\u201d with respect to this parameter (that is, s = 1).\nHowever, an interesting feature of stopping rule (7) is that the rule itself does not depend on the\na priori knowledge of the regularity parameter r, while the achieved learning rate does (and with\nthe optimal dependence in r when s = 1). Hence, Theorem 2.1 implies that the obtained rule is\nautomatically adaptive with respect to the regularity of the target function. This contrasts with the\nresults obtained in [1] for linear regularization schemes of the form (3), (also in the case s = 1)\nfor which the choice of the regularization parameter \u03bb leading to optimal learning rates required the\nknowledge or r beforehand.\nWhen taking into account also the effective dimensionality parameter s, Theorem 2.2 provides the\norder-optimal convergence rate in the inner case (up to a log factor). A noticeable difference to\nTheorem 2.1 however, is that the stopping rule is no longer adaptive, that is, it depends on the a\npriori knowledge of parameters r and s. We observe that previously obtained results for linear\nregularization schemes of the form (2) in [6] and of the form (3) in [5], also rely on the a priori\nknowledge of r and s to determine the appropriate regularization parameter \u03bb.\nThe outer case \u2013 when the target function does not lie in the reproducing Kernel Hilbert space H \u2013 is\nmore challenging and to some extent less well understood. The fact that additional assumptions are\nmade is not a particular artefact of CG methods, but also appears in the studies of other regularization\ntechniques. Here we follow the semi-supervised approach that is proposed in e.g. [5] (to study linear\nregularization of the form (3)) and assume that we have suf\ufb01cient additional unlabeled data in order\nto ensure learning rates that are optimal as a function of the number of labeled data. We remark that\nother forms of additional requirements can be found in the recent literature in order to reach optimal\nrates. For regularized M-estimation schemes studied in [20], availability of unlabeled data is not\nrequired, but a condition is imposed of the form kfk\u221e \u2264 C kfkpH kfk1\u2212p\nfor all f \u2208 H and some\np \u2208 (0, 1]. In [13], assumptions on the supremum norm of the eigenfunctions of the kernel integral\noperator are made (see [20] for an in-depth discussion on this type of assumptions).\nFinally, as explained in the introduction, the term \u2019conjugate gradients\u2019 comprises a class of methods\nthat approximate the solution of linear equations on Krylov subspaces. In the context of learning,\nour approach is most closely linked to Partial Least Squares (PLS) [21] and its kernel extension\n[18]. While PLS has proven to be successful in a wide range of applications and is considered one\nof the standard approaches in chemometrics, there are only few studies of its theoretical properties.\nIn [8, 14], consistency properties are provided for linear PLS under the assumption that the target\nfunction f\u2217 depends on a \ufb01nite known number of orthogonal latent components. These \ufb01ndings\nwere recently extended to the nonlinear case and without the assumption of a latent components\nmodel [3], but all results come without optimal rates of convergence. For the slightly different CG\napproach studied by Ong et al [17], bounds on the difference between the empirical risks of the CG\napproximation and of the target function are derived in [16], but no bounds on the generalization\nerror were derived.\n\n2\n\n4 Proofs\n\nConvergence rates for regularization methods of the type (2) or (3) have been studied by casting\nkernel learning methods into the framework of inverse problems (see [9]). We use this framework\nfor the present results as well, and recapitulate here some important facts.\n\n5\n\n\fWe \ufb01rst de\ufb01ne the empirical evaluation operator Tn as follows:\n\nTn :\n\ng \u2208 H 7\u2192 Tng := (g(X1), . . . , g(Xn))> \u2208 Rn\n\nand the empirical integral operator T \u2217\n\nn as:\n\nn : u = (u1, . . . , un) \u2208 Rn 7\u2192 T \u2217\nT \u2217\n\nn u :=\n\n1\nn\n\nuik(Xi,\u00b7) \u2208 H.\n\nnX\n\ni=1\n\nUsing the reproducing property of the kernel, it can be readily checked that Tn and T \u2217\noperators, i.e.\nTnT \u2217\n\nn are adjoint\nn u, giH = hu, Tngi, for all u \u2208 Rn, g \u2208 H . Furthermore, Kn =\n= kf\u03b1kH. Based on these facts, equation (5) can be rewritten as\nfm = arg\n\nn, and therefore k\u03b1kKn\n\nthey satisfy hT \u2217\n\nn Y \u2212 SnfkH ,\n\nn \u03a5,Sn)\n\nmin\nf\u2208Km(T \u2217\n\n(9)\nn Tn is a self-adjoint operator of H, called empirical covariance operator. This de\ufb01ni-\nwhere Sn = T \u2217\ntion corresponds to that of the \u201cusual\u201d conjugate gradient algorithm formally applied to the so-called\nnormal equation (in H)\n\nSnf\u03b1 = T \u2217\nn\u03a5 ,\nwhich is obtained from (1) by left multiplication by T \u2217\nn. The advantage of this reformulation is that\nit can be interpreted as a \u201cperturbation\u201d of a population, noiseless version (of the equation and of\nthe algorithm), wherein Y is replaced by the target function f\u2217 and the empirical operator T \u2217\nn , Tn\nare respectively replaced by their population analogues, the kernel integral operator\n\nkT \u2217\n\nT \u2217 : g \u2208 L2(PX) 7\u2192 T \u2217g :=\n\nk(., x)g(x)dPX(x) = E [k(X,\u00b7)g(X)] \u2208 H ,\n\nZ\n\nand the change-of-space operator\n\nT :\n\ng \u2208 H 7\u2192 g \u2208 L2(PX) .\n\nThe latter maps a function to itself but between two Hilbert spaces which differ with respect to their\ngeometry \u2013 the inner product of H being de\ufb01ned by the kernel function k, while the inner product of\nL2(PX) depends on the data generating distribution (this operator is well de\ufb01ned: since the kernel\nis bounded, all functions in H are bounded and therefore square integrable under any distribution\nPX).\nThe following results, taken from [1] (Propositions 21 and 22) quantify more precisely that the\nempirical covariance operator Sn = T \u2217\nn Tn and the empirical integral operator applied to the data,\nn\u03a5, are close to the population covariance operator S = T \u2217T and to the kernel integral operator\nT \u2217\napplied to the noiseless target function, T \u2217f\u2217 respectively.\nProposition 4.1. Assume that k(x, x) \u2264 \u03ba for all x \u2208 X . Then the following holds:\n\n\u2265 1 \u2212 \u03b3 ,\n\n(10)\n\n\u2022\nkSn \u2212 SkHS \u2264 4\u03ba\u221a\n\nr\n\nn\n\nlog\n\n2\n\u03b3\n\n\u201a\n\n\u201a\n\n\u221a\n\u03ba\u221a\nn\n\nlog\n\n2\n\u03b3\n\nP\n\n\u2022\n\nwhere k.kHS denotes the Hilbert-Schmidt norm. If the representation f\u2217 = T f\u2217\nassumption (Bernstein), we have the following:\n\nH holds, and under\n\nP\n\nkT \u2217\n\nnY \u2212 Sf\u2217\n\nHk \u2264 4M\n\n\u2265 1 \u2212 \u03b3 .\n\n(11)\n\nWe note that f\u2217 = T f\u2217\nH belonging\nto H (remember that T is just the change-of-space operator). Hence, the second result (11) is valid\nfor the case with r \u2265 1/2, but it is not true in general for r < 1/2 .\n\nH implies that the target function f\u2217 coincides with a function f\u2217\n\n4.1 Nemirovskii\u2019s result on conjugate gradient regularization rates\n\nWe recall a sharp result due to Nemirovskii [15] establishing convergence rates for conjugate gradi-\nent methods in a deterministic context. We present the result in an abstract context, then show how,\ncombined with the previous section, it leads to a proof of Theorem 2.1. Consider the linear equation\n\nAz\u2217 = b ,\n\n6\n\n\fwhere A is a bounded linear operator over a Hilbert space H . Assume that the above equation has a\nsolution and denote z\u2217 its minimal norm solution; assume further that a self-adjoint operator \u00afA, and\nan element \u00afb \u2208 H are known such that\n\n(12)\n(with \u03b4 and \u03b5 known positive numbers). Consider the CG algorithm based on the noisy operator \u00afA\nand data \u00afb, giving the output at step m\n\n(cid:176)(cid:176)A \u2212 \u00afA(cid:176)(cid:176) \u2264 \u03b4 ;\n\n(cid:176)(cid:176)b \u2212 \u00afb(cid:176)(cid:176) \u2264 \u03b5 ,\n(cid:176)(cid:176) \u00afAz \u2212 \u00afb(cid:176)(cid:176)2\n\n.\n\n(13)\n\nzm = Arg Min\nz\u2208Km( \u00afA,\u00afb)\n\n\u00afm = min'm \u2265 0 :(cid:176)(cid:176) \u00afAzm \u2212 \u00afb(cid:176)(cid:176) < \u03c4(\u03b4 kzmk + \u03b5)\u201c .\nbm =\n\nif q \u00afm(0) < \u03b7\u03b4\u22121\notherwise,\n\nmax(0, \u00afm \u2212 1)\n\n\u2030 \u00afm\n\nThe discrepancy principle stopping rule is de\ufb01ned as follows. Consider a \ufb01xed constant \u03c4 > 1 and\nde\ufb01ne\n\nWe output the solution obtained at step max(0, \u00afm \u2212 1) . Consider a minor variation of of this rule:\n\nwhere q \u00afm is the degree m \u2212 1 polynomial such that z \u00afm = q \u00afm( \u00afA)\u00afb , and \u03b7 is an arbitrary positive\nconstant such that \u03b7 < 1/\u03c4 . Nemirovskii established the following theorem:\n\nTheorem 4.2. Assume that (a) max(kAk ,(cid:176)(cid:176) \u00afA(cid:176)(cid:176)) \u2264 L; and that (b) z\u2217 = A\u00b5u\u2217 with ku\u2217k \u2264 R for\nsome \u00b5 > 0. Then for any \u03b8 \u2208 [0, 1] , provided that bm < \u221e it holds that\n\n(cid:176)(cid:176)A\u03b8 (zbm \u2212 z\u2217)(cid:176)(cid:176)2 \u2264 c(\u00b5, \u03c4, \u03b7)R\n\n2(1\u2212\u03b8)\n1+\u00b5 (\u03b5 + \u03b4RL\u00b5)2(\u03b8+\u00b5)/(1+\u00b5) .\n\n2): By identifying the approximate\nnY, we see that the CG algorithm considered by Nemirovskii\n\n4.2 Proof of Theorem 2.1\nWe apply Nemirovskii\u2019s result in our setting (assuming r \u2265 1\noperator and data as \u00afA = Sn and \u00afb = T \u2217\n(13) is exactly (9), more precisely with the identi\ufb01cation zm = fm.\nFor the population version, we identify A = S, and z\u2217 = f\u2217\nsource condition, then there exists f\u2217\nCondition (a) of Nemirovskii\u2019s theorem 4.2 is satis\ufb01ed with L = \u03ba by the boundedness of the kernel.\nCondition (b) is satis\ufb01ed with \u00b5 = r \u2212 1/2 \u2265 0 and R = \u03ba\u2212r\u03c1, as implied by the source condition\nSC(r, \u03c1). Finally, the concentration result 4.1 ensures that the approximation conditions (12) are\nsatis\ufb01ed with probability 1 \u2212 2\u03b3 , more precisely with \u03b4 = 4\u03ba\u221a\n\u03b3 . (Here\nwe replaced \u03b3 in (10) and (11) by \u03b3/2, so that the two conditions are satis\ufb01ed simultaneously, by\nthe union bound). The operator norm is upper bounded by the Hilbert-Schmidt norm, so that the\ndeviation inequality for the operators is actually stronger than what is needed.\nWe consider the discrepancy principle stopping rule associated to these parameters, the choice \u03b7 =\n1/(2\u03c4), and \u03b8 = 1\n\nH (remember that provided r \u2265 1\nH \u2208 H such that f\u2217 = T f\u2217\nH).\nq\n\n\u221a\n\u03ba\u221a\nn log 2\n\n2 , thus obtaining the result, since\n\n\u03b3 and \u03b5 = 4M\n\n2 in the\n\nlog 2\n\nn\n\n(cid:176)(cid:176)(cid:176)A\n\n(cid:176)(cid:176)(cid:176)2\n2 (zbm \u2212 z\u2217)\n\n1\n\n(cid:176)(cid:176)(cid:176)S\n\n=\n\n(cid:176)(cid:176)(cid:176)2\n\nH\n\n1\n\n2 (fbm \u2212 f\u2217\nH)\n\n= kfbm \u2212 f\u2217\nHk2\n2 .\n\n4.3 Notes on the proof of Theorems 2.2 and 2.3\n\nThe above proof shows that an application of Nemirovskii\u2019s fundamental result for CG regularization\nof inverse problems under deterministic noise (on the data and the operator) allows us to obtain our\n\ufb01rst result. One key ingredient is the concentration property 4.1 which allows to bound deviations\nin a quasi-deterministic manner.\nTo prove the sharper results of Theorems 2.2 and 2.3, such a direct approach does not work unfor-\ntunately, and a complete rework and extension of the proof is necessary. The proof of Theorem 2.2\nis presented in the supplementary material to the paper. In a nutshell, the concentration result 4.1\nis too coarse to prove the optimal rates of convergence taking into account the effective dimension\n\n7\n\n\f(cid:176)(cid:176)(cid:176)(S + \u03bbI)\u2212 1\n\n2 (T \u2217\n\n(cid:176)(cid:176)(cid:176)HS\n\n(cid:176)(cid:176)(cid:176) for the data, and\n\n(cid:176)(cid:176)(cid:176)(S + \u03bbI)\u2212 1\n\n2 (Sn \u2212 S)\n\nnY \u2212 T \u2217f\u2217)\n\nparameter. Instead of that result, we have to consider the deviations from the mean in a \u201cwarped\u201d\nnorm, i.e. of the form\nfor the operator (with an appropriate choice of \u03bb > 0) respectively. Deviations of this form were\nintroduced and used in [5, 6] to obtain sharp rates in the framework of Tikhonov\u2019s regularization (2)\nand of the more general linear regularization schemes of the form (3). Bounds on deviations of this\nform can be obtained via a Bernstein-type concentration inequality for Hilbert-space valued random\nvariables.\nOn the one hand, the results concerning linear regularization schemes of the form (3) do not apply\nto the nonlinear CG regularization. On the other hand, Nemirovskii\u2019s result does not apply to de-\nviations controlled in the warped norm. Moreover, the \u201couter\u201d case introduces additional technical\ndif\ufb01culties. Therefore, the proofs for Theorems 2.2 and 2.3, while still following the overall funda-\nmental structure and ideas introduced by Nemirovskii, are signi\ufb01cantly different in that context. As\nmentioned above, we present the complete proof of Theorem 2.2 in the supplementary material and\na sketch of the proof of Theorem 2.3.\n\n5 Conclusion\n\nIn this work, we derived early stopping rules for kernel Conjugate Gradient regression that provide\noptimal learning rates to the true target function. Depending on the situation that we study, the rates\nare adaptive with respect to the regularity of the target function in some cases. The proofs of our\nresults rely most importantly on ideas introduced by Nemirovskii [15] and further developed by\nHanke [11] for CG methods in the deterministic case, and moreover on ideas inspired by [5, 6].\nCertainly, in practice, as for a large majority of learning algorithms, cross-validation remains the\nstandard approach for model selection. The motivation of this work is however mainly theoretical,\nand our overall goal is to show that from the learning theoretical point of view, CG regulariza-\ntion stands on equal footing with other well-studied regularization methods such as kernel ridge\nregression or more general linear regularization methods (which includes between many others L2\nboosting). We also note that theoretically well-grounded model selection rules can generally help\ncross-validation in practice by providing a well-calibrated parametrization of regularizer functions,\nor, as is the case here, of thresholds used in the stopping rule.\nOne crucial property used in the proofs is that the proposed CG regularization schemes can be con-\nveniently cast in the reproducing kernel Hilbert space H as displayed in e.g (9). This reformulation\nis not possible for Kernel Partial Least Squares: It is also a CG type method, but uses the standard\nEuclidean norm instead of the Kn-norm used here. This point is the main technical justi\ufb01cation\non why we focus on (5) rather than kernel PLS. Obtaining optimal convergence rates also valid for\nKernel PLS is an important future direction and should build on the present work.\nAnother important direction for future efforts is the derivation of stopping rules that do not depend on\nthe con\ufb01dence parameter \u03b3. Currently, this dependence prevents us to go from convergence in high\nprobability to convergence in expectation, which would be desirable. Perhaps more importantly, it\nwould be of interest to \ufb01nd a stopping rule that is adaptive to both parameters r (target function\nregularity) and s (effective dimension parameter) without their a priori knowledge. We recall that\nour \ufb01rst stopping rule is adaptive to r but at the price of being worst-case in s. In the literature on\nlinear regularization methods, the optimal choice of regularization parameter is also non-adaptive,\nbe it when considering optimal rates with respect to r only [1] or to both r and s [5]. An approach to\nalleviate this problem is to use a hold-out sample for model selection; this was studied theoretically\nin [7] for linear regularization methods (see also [4] for an account of the properties of hold-out\nin a general setup). We strongly believe that the hold-out method will yield theoretically founded\nadaptive model selection for CG as well. However, hold-out is typically regarded as inelegant in that\nit requires to throw away part of the data for estimation. It would be of more interest to study model\nselection methods that are based on using the whole data in the estimation phase. The application of\nLepskii\u2019s method is a possible step towards this direction.\n\n8\n\n\fReferences\n[1] F. Bauer, S. Pereverzev, and L. Rosasco. On Regularization Algorithms in Learning Theory.\n\nJournal of Complexity, 23:52\u201372, 2007.\n\n[2] N. Bissantz, T. Hohage, A. Munk, and F. Ruymgaart. Convergence Rates of General Regular-\nization Methods for Statistical Inverse Problems and Applications. SIAM Journal on Numerical\nAnalysis, 45(6):2610\u20132636, 2007.\n\n[3] G. Blanchard and N. Kr\u00a8amer. Kernel Partial Least Squares is Universally Consistent. Pro-\nceedings of the 13th International Conference on Arti\ufb01cial Intelligence and Statistics, JMLR\nWorkshop & Conference Proceedings, 9:57\u201364, 2010.\n\n[4] G. Blanchard and P. Massart. Discussion of V. Koltchinskii\u2019s \u201dLocal Rademacher complexities\n\nand oracle inequalities in risk minimization\u201d. Annals of Statistics, 34(6):2664\u20132671, 2006.\n\n[5] A. Caponnetto. Optimal Rates for Regularization Operators in Learning Theory. Technical\nReport CBCL Paper 264/ CSAIL-TR 2006-062, Massachusetts Institute of Technology, 2006.\n[6] A. Caponnetto and E. De Vito. Optimal Rates for Regularized Least-squares Algorithm. Foun-\n\ndations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[7] A. Caponnetto and Y. Yao. Cross-validation based Adaptation for Regularization Operators in\n\nLearning Theory. Analysis and Applications, 8(2):161\u2013183, 2010.\n\n[8] H. Chun and S. Keles. Sparse Partial Least Squares for Simultaneous Dimension Reduction\n\nand Variable Selection. Journal of the Royal Statistical Society B, 72(1):3\u201325, 2010.\n\n[9] E. De Vito, L. Rosasco, A. Caponnetto, U. De Giovannini, and F. Odone. Learning from\n\nExamples as an Inverse Problem. Journal of Machine Learning Research, 6(1):883, 2006.\n\n[10] L. Gy\u00a8or\ufb01, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-Free Theory of Nonparametric\n\nRegression. Springer, 2002.\n\n[11] M. Hanke. Conjugate Gradient Type Methods for Linear Ill-posed Problems. Pitman Research\n\nNotes in Mathematics Series, 327, 1995.\n\n[12] L. Lo Gerfo, L. Rosasco, E. Odone, F.and De Vito, and A. Verri. Spectral Algorithms for\n\nSupervised Learning. Neural Computation, 20:1873\u20131897, 2008.\n\n[13] S. Mendelson and J. Neeman. Regularization in Kernel Learning. The Annals of Statistics,\n\n38(1):526\u2013565, 2010.\n\n[14] P. Naik and C.L. Tsai. Partial Least Squares Estimator for Single-index Models. Journal of the\n\nRoyal Statistical Society B, 62(4):763\u2013771, 2000.\n\n[15] A. S. Nemirovskii. The Regularizing Properties of the Adjoint Gradient Method in Ill-posed\n\nProblems. USSR Computational Mathematics and Mathematical Physics, 26(2):7\u201316, 1986.\n\n[16] C. S. Ong. Kernels: Regularization and Optimization. Doctoral dissertation, Australian Na-\n\ntional University, 2005.\n\n[17] C. S. Ong, X. Mary, S. Canu, and A. J. Smola. Learning with Non-positive Kernels.\n\nIn\nProceedings of the 21st International Conference on Machine Learning, pages 639 \u2013 646,\n2004.\n\n[18] R. Rosipal and L.J. Trejo. Kernel Partial Least Squares Regression in Reproducing Kernel\n\nHilbert Spaces. Journal of Machine Learning Research, 2:97\u2013123, 2001.\n\n[19] R. Rosipal, L.J. Trejo, and B. Matthews. Kernel PLS-SVC for Linear and Nonlinear Classi\ufb01-\ncation. In Proceedings of the Twentieth International Conference on Machine Learning, pages\n640\u2013647, Washington, DC, 2003.\n\n[20] I. Steinwart, D. Hush, and C. Scovel. Optimal Rates for Regularized Least Squares Regression.\n\nIn Proceedings of the 22nd Annual Conference on Learning Theory, pages 79\u201393, 2009.\n\n[21] S. Wold, H. Ruhe, H. Wold, and W.J. Dunn III. The Collinearity Problem in Linear Regression.\nThe Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM Journal of Scienti\ufb01c\nand Statistical Computations, 5:735\u2013743, 1984.\n\n[22] T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural\n\nComputation, 17(9):2077\u20132098, 2005.\n\n9\n\n\f", "award": [], "sourceid": 601, "authors": [{"given_name": "Gilles", "family_name": "Blanchard", "institution": null}, {"given_name": "Nicole", "family_name": "Kr\u00e4mer", "institution": null}]}