{"title": "Data-driven calibration of linear estimators with minimal penalties", "book": "Advances in Neural Information Processing Systems", "page_first": 46, "page_last": 54, "abstract": "This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which first estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows $C_L$ penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves significantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation.", "full_text": "Data-driven calibration of linear estimators\n\nwith minimal penalties\n\nFrancis Bach \u2020\n\nINRIA ; Willow Project-Team\nLaboratoire d\u2019Informatique de\nl\u2019Ecole Normale Superieure\n\n(CNRS/ENS/INRIA UMR 8548)\n\nSylvain Arlot \u2217\n\nCNRS ; Willow Project-Team\nLaboratoire d\u2019Informatique de\nl\u2019Ecole Normale Superieure\n\n(CNRS/ENS/INRIA UMR 8548)\n\n23, avenue d\u2019Italie, F-75013 Paris, France\n\n23, avenue d\u2019Italie, F-75013 Paris, France\n\nsylvain.arlot@ens.fr\n\nfrancis.bach@ens.fr\n\nAbstract\n\nThis paper tackles the problem of selecting among several linear estimators in\nnon-parametric regression; this includes model selection for linear regression, the\nchoice of a regularization parameter in kernel ridge regression or spline smooth-\ning, and the choice of a kernel in multiple kernel learning. We propose a new\nalgorithm which \ufb01rst estimates consistently the variance of the noise, based upon\nthe concept of minimal penalty which was previously introduced in the context of\nmodel selection. Then, plugging our variance estimate in Mallows\u2019 CL penalty\nis proved to lead to an algorithm satisfying an oracle inequality. Simulation ex-\nperiments with kernel ridge regression and multiple kernel learning show that the\nproposed algorithm often improves signi\ufb01cantly existing calibration procedures\nsuch as 10-fold cross-validation or generalized cross-validation.\n\n1 Introduction\n\nKernel-based methods are now well-established tools for supervised learning, allowing to perform\nvarious tasks, such as regression or binary classi\ufb01cation, with linear and non-linear predictors [1, 2].\nA central issue common to all regularization frameworks is the choice of the regularization parame-\nter: while most practitioners use cross-validation procedures to select such a parameter, data-driven\nprocedures not based on cross-validation are rarely used. The choice of the kernel, a seemingly\nunrelated issue, is also important for good predictive performance: several techniques exist, either\nbased on cross-validation, Gaussian processes or multiple kernel learning [3, 4, 5].\n\nIn this paper, we consider least-squares regression and cast these two problems as the problem of\nselecting among several linear estimators, where the goal is to choose an estimator with a quadratic\nrisk which is as small as possible. This problem includes for instance model selection for linear\nregression, the choice of a regularization parameter in kernel ridge regression or spline smoothing,\nand the choice of a kernel in multiple kernel learning (see Section 2).\n\nThe main contribution of the paper is to extend the notion of minimal penalty [6, 7] to all discrete\nclasses of linear operators, and to use it for de\ufb01ning a fully data-driven selection algorithm satisfying\na non-asymptotic oracle inequality. Our new theoretical results presented in Section 4 extend simi-\nlar results which were limited to unregularized least-squares regression (i.e., projection operators).\nFinally, in Section 5, we show that our algorithm improves the performances of classical selection\nprocedures, such as GCV [8] and 10-fold cross-validation, for kernel ridge regression or multiple\nkernel learning, for moderate values of the sample size.\n\n\u2217http://www.di.ens.fr/\u223carlot/\n\u2020http://www.di.ens.fr/\u223cfbach/\n\n1\n\n\f2 Linear estimators\n\nIn this section, we de\ufb01ne the problem we aim to solve and give several examples of linear estimators.\n\n2.1 Framework and notation\nLet us assume that one observes\n\nYi = f (xi) + \u03b5i \u2208 R\n\nfor\n\ni = 1 . . . n ,\n\n2 , where \u2200t \u2208 Rn , we denote by ktk2 the \u21132-norm of t , de\ufb01ned as ktk2\n\ni ] = \u03c32 unknown, f is an unknown\nwhere \u03b51, . . . , \u03b5n are i.i.d. centered random variables with E[\u03b52\nmeasurable function X 7\u2192 R and x1, . . . , xn \u2208 X are deterministic design points. No assumption\nis made on the set X . The goal is to reconstruct the signal F = (f (xi))1\u2264i\u2264n \u2208 Rn , with some\nestimator bF \u2208 Rn , depending only on (x1, Y1), . . . , (xn, Yn) , and having a small quadratic risk\nn\u22121kbF \u2212 Fk2\ni .\ni=1 t2\nIn this paper, we focus on linear estimators bF that can be written as a linear function of Y =\n(Y1, . . . , Yn) \u2208 Rn , that is, bF = AY , for some (deterministic) n \u00d7 n matrix A . Here and in\n\nthe rest of the paper, vectors such as Y or F are assumed to be column-vectors. We present in\nSection 2.2 several important families of estimators of this form. The matrix A may depend on\nx1, . . . , xn (which are known and deterministic), but not on Y , and may be parameterized by certain\nquantities\u2014usually regularization parameter or kernel combination weights.\n\n2 :=Pn\n\n2.2 Examples of linear estimators\nIn this paper, our theoretical results apply to matrices A which are symmetric positive semi-de\ufb01nite,\nsuch as the ones de\ufb01ned below.\nOrdinary least-squares regression / model selection.\n\nIf we consider linear predictors from a\n\ndesign matrix X \u2208 Rn\u00d7p , then bF = AY with A = X(X \u22a4X)\u22121X \u22a4 , which is a projection matrix\n(i.e., A\u22a4A = A); bF = AY is often called a projection estimator. In the variable selection setting,\none wants to select a subset J \u2282 {1, . . . , p} , and matrices A are parameterized by J .\nKernel ridge regression / spline smoothing. We assume that a positive de\ufb01nite kernel k : X \u00d7\nX \u2192 R is given, and we are looking for a function f : X \u2192 R in the associated reproducing kernel\nHilbert space (RKHS) F , with norm k \u00b7 kF . If K denotes the n \u00d7 n kernel matrix, de\ufb01ned by\nKab = k(xa, xb) , then the ridge regression estimator\u2014a.k.a. spline smoothing estimator for spline\nkernels [9]\u2014is obtained by minimizing with respect to f \u2208 F [2]:\nF .\n\n(Yi \u2212 f (xi))2 + \u03bbkfk2\n\nnXi=1\nThe unique solution is equal to bf =Pn\n\n1\nn\n\ni=1 \u03b1ik(\u00b7, xi) , where \u03b1 = (K + n\u03bbI)\u22121Y . This leads to the\nsmoothing matrix A\u03bb = K(K + n\u03bbIn)\u22121 , parameterized by the regularization parameter \u03bb \u2208 R+ .\nMultiple kernel learning / Group Lasso / Lasso. We now assume that we have p different\nkernels kj , feature spaces Fj and feature maps \u03a6j : X \u2192 Fj , j = 1, . . . , p . The group Lasso [10]\nand multiple kernel learning [11, 5] frameworks consider the following objective function\n\nJ(f1, . . . , fp) = 1\nn\n\nnXi=1(cid:0)yi\u2212Pp\n\nj=1hfj, \u03a6j(xi)i(cid:1)2\n\npXj=1\n\n+2\u03bb\n\nkfjkFj = L(f1, . . . , fp)+2\u03bb\n\nkfjkFj\n\n.\n\npXj=1\n\n1\n\nUsing a1/2 = minb>0\nj=1 kfjk = min\u03b7\u2208Rp\n\nNote that when \u03a6j(x) is simply the j-th coordinate of x \u2208 Rp , we get back the penalization by the\n\u21131-norm and thus the regular Lasso [12].\n2{ a\nb + b} , we obtain a variational formulation of the sum of norms\n+ \u03b7jo . Thus, minimizing J(f1, . . . , fp) with respect to\nj=1n kfj k2\n+Pp\npXj=1\npXj=1\n\n2Pp\n(f1, . . . , fp) is equivalent to minimizing with respect to \u03b7 \u2208 Rp\ny\u22a4(cid:0)Pp\n\nj=1 \u03b7jKj + n\u03bbIn(cid:1)\u22121\n\n+ (see [5] for more details):\n\nL(f1, . . . , fp) + \u03bb\n\nkfjk2\n\u03b7j\n\npXj=1\n\nmin\n\nf1,...,fp\n\n+ \u03bb\n\n\u03b7j =\n\ny + \u03bb\n\n\u03b7j ,\n\n1\nn\n\n\u03b7j\n\n2\n\n\fA\u03b7,\u03bb = (Pp\n\nj=1 \u03b7jKj)(Pp\n\nwhere In is the n \u00d7 n identity matrix. Moreover, given \u03b7 , this leads to a smoothing matrix of the\nform\n(1)\n\nj=1 \u03b7jKj + n\u03bbIn)\u22121 ,\n\nparameterized by the regularization parameter \u03bb \u2208 R+ and the kernel combinations in Rp\nthat it depends only on \u03bb\u22121\u03b7 , which can be grouped in a single parameter in Rp\nThus, the Lasso/group lasso can be seen as particular (convex) ways of optimizing over \u03b7 .\nIn\nthis paper, we propose a non-convex alternative with better statistical properties (oracle inequality\nin Theorem 1). Note that in our setting, \ufb01nding the solution of the problem is hard in general\nsince the optimization is not convex. However, while the model selection problem is by nature\ncombinatorial, our optimization problems for multiple kernels are all differentiable and are thus\namenable to gradient descent procedures\u2014which only \ufb01nd local optima.\nNon symmetric linear estimators. Other linear estimators are commonly used, such as nearest-\nneighbor regression or the Nadaraya-Watson estimator [13]; those however lead to non symmetric\nmatrices A , and are not entirely covered by our theoretical results.\n\n+\u2014note\n\n+ .\n\n3 Linear estimator selection\n\nIn this section, we \ufb01rst describe the statistical framework of linear estimator selection and introduce\nthe notion of minimal penalty.\n\n3.1 Unbiased risk estimation heuristics\n\nThe best choice would be the oracle:\n\nthis paper is then to select one of them, that is, to choose a matrix A . Let us assume that a family\nof matrices (A\u03bb)\u03bb\u2208\u039b is given (examples are shown in Section 2.2), hence a family of estimators\n\nUsually, several estimators of the form bF = AY can be used. The problem that we consider in\n(bF\u03bb)\u03bb\u2208\u039b can be used, with bF\u03bb := A\u03bbY . The goal is to choose from data someb\u03bb \u2208 \u039b , so that the\nquadratic risk of bFb\u03bb is as small as possible.\n\u03bb\u22c6 \u2208 arg min\ndata-drivenb\u03bb satisfying an oracle inequality\n\nwhich cannot be used since it depends on the unknown signal F . Therefore, the goal is to de\ufb01ne a\n\n\u03bb\u2208\u039bn n\u22121kbF\u03bb \u2212 Fk2\n2o ,\n\u03bb\u2208\u039bn n\u22121kbF\u03bb \u2212 Fk2\n\nwith large probability, where the leading constant Cn should be close to 1 (at least for large n) and\nthe remainder term Rn should be negligible compared to the risk of the oracle.\n\nn\u22121kbFb\u03bb \u2212 Fk2\n\n2o + Rn ,\n\n2 \u2264 Cn inf\n\n(2)\n\nminimizes a criterion crit(\u03bb) such that\n\nMany classical selection methods are built upon the \u201cunbiased risk estimation\u201d heuristics: If b\u03bb\n\n\u2200\u03bb \u2208 \u039b,\n\nE [ crit(\u03bb) ] \u2248 Eh n\u22121kbF\u03bb \u2212 Fk2\n2i ,\n\nvalidation [14, 15] and generalized cross-validation (GCV) [8] are built upon this heuristics.\n\nthenb\u03bb satis\ufb01es an oracle inequality such as in Eq. (2) with large probability. For instance, cross-\n\nOne way of implementing this heuristics is penalization, which consists in minimizing the sum of\nthe empirical risk and a penalty term, i.e., using a criterion of the form:\n\n2 + pen(\u03bb) .\n\nThe unbiased risk estimation heuristics, also called Mallows\u2019 heuristics, then leads to the ideal\n(deterministic) penalty\n\ncrit(\u03bb) = n\u22121kbF\u03bb \u2212 Y k2\npenid(\u03bb) := Eh n\u22121kbF\u03bb \u2212 Fk2\n\n3\n\n2i \u2212 Eh n\u22121kbF\u03bb \u2212 Y k2\n2i .\n\n\fWhen bF\u03bb = A\u03bbY , we have:\n2 + kA\u03bb\u03b5k2\nkbF\u03bb \u2212 Fk2\nkbF\u03bb \u2212 Y k2\nwhere \u03b5 = Y \u2212 F \u2208 Rn and \u2200t, u \u2208 Rn , ht, ui =Pn\n\n2 = k(A\u03bb \u2212 In)Fk2\n2 = kbF\u03bb \u2212 Fk2\n\nmatrix \u03c32In , Eq. (3) and Eq. (4) imply that\n\n2 + k\u03b5k2\n\n2 + 2hA\u03bb\u03b5, (A\u03bb \u2212 In)Fi ,\n\n(3)\n\n2 \u2212 2h\u03b5, A\u03bb\u03b5i + 2h\u03b5, (In \u2212 A\u03bb)Fi ,\n\n(4)\ni=1 tiui . Since \u03b5 is centered with covariance\n\npenid(\u03bb) =\n\n2\u03c32 tr(A\u03bb)\n\nn\n\n,\n\n(5)\n\n2] = \u2212\u03c32 , which can be dropped off since it does not vary with \u03bb .\n\nup to the term \u2212E[n\u22121k\u03b5k2\nNote that df(\u03bb) = tr(A\u03bb) is called the effective dimensionality or degrees of freedom [16], so that\nthe ideal penalty in Eq. (5) is proportional to the dimensionality associated with the matrix A\u03bb\u2014\nfor projection matrices, we get back the dimension of the subspace, which is classical in model\nselection.\n\nThe expression of the ideal penalty in Eq. (5) led to several selection procedures, in particular Mal-\nlows\u2019 CL (called Cp in the case of projection estimators) [17], where \u03c32 is replaced by some esti-\n\nmatorc\u03c32 . The estimator of \u03c32 usually used with CL is based upon the value of the empirical risk at\n\nsome \u03bb0 with df(\u03bb0) large; it has the drawback of overestimating the risk, in a way which depends\non \u03bb0 and F [18]. GCV, which implicitly estimates \u03c32 , has the drawback of over\ufb01tting if the family\n(A\u03bb)\u03bb\u2208\u039b contains a matrix too close to In [19]; GCV also overestimates the risk even more than CL\nfor most A\u03bb (see (7.9) and Table 4 in [18]).\nIn this paper, we de\ufb01ne an estimator of \u03c32 directly related to the selection task which does not have\nsimilar drawbacks. Our estimator relies on the concept of minimal penalty, introduced by Birg\u00b4e and\nMassart [6] and further studied in [7].\n\n3.2 Minimal and optimal penalties\nWe deduce from Eq. (3) the bias-variance decomposition of the risk:\ntr(A\u22a4\n\u03bb A\u03bb)\u03c32\nn\n\n2 +\n\nEh n\u22121kbF\u03bb \u2212 Fk2\nEh n\u22121kbF\u03bb \u2212 Y k2\n\n2i = n\u22121 k(A\u03bb \u2212 In)Fk2\n2 \u2212 k\u03b5k2\n\n2i = n\u22121 k(A\u03bb \u2212 In)Fk2\n\nand from Eq. (4) the expectation of the empirical risk:\n\n= bias + variance ,\n\n(6)\n\n2 \u2212(cid:0) 2 tr(A\u03bb) \u2212 tr(A\u22a4\n\n\u03bb A\u03bb)(cid:1) \u03c32\n\nn\n\n.\n\n(7)\n\nNote that the variance term in Eq. (6) is not proportional to the effective dimensionality df(\u03bb) =\ntr(A\u03bb) but to tr(A\u22a4\n\u03bb A\u03bb) . Although several papers argue these terms are of the same order (for\ninstance, they are equal when A\u03bb is a projection matrix), this may not hold in general. If A\u03bb is\nsymmetric with a spectrum Sp(A\u03bb) \u2282 [0, 1] , as in all the examples of Section 2.2, we only have\n\n0 \u2264 tr(A\u22a4\n\n\u03bb A\u03bb) \u2264 tr(A\u03bb) \u2264 2 tr(A\u03bb) \u2212 tr(A\u22a4\n\n\u03bb A\u03bb) \u2264 2 tr(A\u03bb) .\n\n(8)\n\nIn order to give a \ufb01rst intuitive interpretation of Eq. (6) and Eq. (7), let us consider the kernel ridge\nregression example and assume that the risk and the empirical risk behave as their expectations\nin Eq. (6) and Eq. (7); see also Fig. 1. Completely rigorous arguments based upon concentration\ninequalities are developed in [20] and summarized in Section 4, leading to the same conclusion as\nthe present informal reasoning.\nFirst, as proved in [20], the bias n\u22121 k(A\u03bb \u2212 In)Fk2\n2 is a decreasing function of the dimensionality\n\u03bb A\u03bb)\u03c32n\u22121 is an increasing function of df(\u03bb) , as well\ndf(\u03bb) = tr(A\u03bb) , and the variance tr(A\u22a4\n\u03bb A\u03bb) . Therefore, Eq. (6) shows that the optimal \u03bb realizes the best trade-off\nas 2 tr(A\u03bb) \u2212 tr(A\u22a4\nbetween bias (which decreases with df(\u03bb)) and variance (which increases with df(\u03bb)), which is a\nclassical fact in model selection.\n\nSecond, the expectation of the empirical risk in Eq. (7) can be decomposed into the bias and a\nnegative variance term which is the opposite of\n\npenmin(\u03bb) := n\u22121(cid:0) 2 tr(A\u03bb) \u2212 tr(A\u22a4\n\n\u03bb A\u03bb)(cid:1) \u03c32 .\n\n4\n\n(9)\n\n\fs\nr\no\nr\nr\ne\n\n \n\nn\no\n\ni\nt\n\na\nz\n\ni\nl\n\na\nr\ne\nn\ne\ng\n\n0.5\n\n0\n\n\u22120.5\n\n \n0\n\n \n\n\u03c32trA\n\n\u03c32trA2 \u2212 2\u03c32trA\n\nbias\nvariance \u223c \u03c32tr A2\ngeneralization error \u223c bias + \u03c32 tr A2\nempirical error\u2212\u03c32 \u223c bias+\u03c32trA2\u22122\u03c32 tr A\n\n200\n600\ndegrees of freedom ( tr A )\n\n400\n\n800\n\nFigure 1: Bias-variance decomposition of the generalization error, and minimal/optimal penalties.\n\nAs suggested by the notation penmin , we will show it is a minimal penalty in the following sense.\nIf\n\n\u03bb\u2208\u039bn n\u22121kbF\u03bb \u2212 Y k2\n\n2 + C penmin(\u03bb)o ,\n\nmizer of\n\n\u2200C \u2265 0,\n\nthen, up to concentration inequalities that are detailed in Section 4.2,b\u03bbmin(C) behaves like a mini-\ngC(\u03bb) = Eh n\u22121kbF\u03bb \u2212 Y k2\n\nb\u03bbmin(C) \u2208 arg min\n2 + C penmin(\u03bb)i\u2212n\u22121\u03c32 = n\u22121 k(A\u03bb \u2212 In)Fk2\n\nTherefore, two main cases can be distinguished:\n\n2+(C\u22121) penmin(\u03bb) .\n\nthen gC(\u03bb) increases with df(\u03bb) when df(\u03bb) is large enough, so that\n\n\u2022 if C < 1 , then gC(\u03bb) decreases with df(\u03bb) so that df(b\u03bbmin(C)) is huge:b\u03bbmin(C) over\ufb01ts.\n\u2022 if C > 1 ,\ndf(b\u03bbmin(C)) is much smaller than when C < 1 .\nAs a conclusion, penmin(\u03bb) is the minimal amount of penalization needed so that a minimizerb\u03bb of\n\nFollowing an idea \ufb01rst proposed in [6] and further analyzed or used in several other papers such as\n[21, 7, 22], we now propose to use that penmin(\u03bb) is a minimal penalty for estimating \u03c32 and plug\nthis estimator into Eq. (5). This leads to the algorithm described in Section 4.1.\n\na penalized criterion is not clearly over\ufb01tting.\n\nNote that the minimal penalty given by Eq. (9) is new; it generalizes previous results [6, 7] where\npenmin(A\u03bb) = n\u22121 tr(A\u03bb)\u03c32 because all A\u03bb were assumed to be projection matrices, i.e., A\u22a4\n\u03bb A\u03bb =\nA\u03bb . Furthermore, our results generalize the slope heuristics penid \u2248 2 penmin (only valid for\nprojection estimators [6, 7]) to general linear estimators for which penid / penmin \u2208 (1, 2] .\n4 Main results\n\nIn this section, we \ufb01rst describe our algorithm and then present our theoretical results.\n\n4.1 Algorithm\n\nThe following algorithm \ufb01rst computes an estimator of bC of \u03c32 using the minimal penalty in Eq. (9),\nthen considers the ideal penalty in Eq. (5) for selecting \u03bb .\nInput: \u039b a \ufb01nite set with Card(\u039b) \u2264 Kn\u03b1 for some K, \u03b1 \u2265 0 , and matrices A\u03bb .\n\u03bb A\u03bb)(cid:1)} .\n2 + C(cid:0) 2 tr(A\u03bb) \u2212 tr(A\u22a4\n\n\u2022 \u2200C > 0 , computeb\u03bb0(C) \u2208 arg min\u03bb\u2208\u039b{kbF\u03bb \u2212 Y k2\n\u2022 Find bC such that df(b\u03bb0(bC)) \u2208(cid:2) n3/4, n/10(cid:3) .\n\u2022 Selectb\u03bb \u2208 arg min\u03bb\u2208\u039b{kbF\u03bb \u2212 Y k2\n2 + 2bC tr(A\u03bb)} .\n\nIn the steps 1 and 2 of the above algorithm, in practice, a grid in log-scale is used, and our theoretical\nresults from the next section suggest to use a step-size of order n\u22121/4 . Note that it may not be\n\n5\n\n\fthe presence of a jump around \u03c32 , but do not show the absence of similar jumps elsewhere.\n\npossible in all cases to \ufb01nd a C such that df(b\u03bb0(C)) \u2208 [n3/4, n/10] ; therefore, our condition in\nstep 2, could be relaxed to \ufb01nding a bC such that for all C > bC + \u03b4 , df(b\u03bb0(C)) < n3/4 and for all\nC < bC \u2212 \u03b4 , df(b\u03bb0(C)) > n/10 , with \u03b4 = n\u22121/4+\u03be , where \u03be > 0 is a small constant.\nAlternatively, using the same grid in log-scale, we can select bC with maximal jump between succes-\nsive values of df(b\u03bb0(C))\u2014note that our theoretical result then does not entirely hold, as we show\nTheorem 1 Let bC andb\u03bb be de\ufb01ned as in the algorithm of Section 4.1, with Card(\u039b) \u2264 Kn\u03b1 for\nsome K, \u03b1 \u2265 0 . Assume that \u2200\u03bb \u2208 \u039b , A\u03bb is symmetric with Sp(A\u03bb) \u2282 [0, 1] , that \u03b5i are i.i.d.\nGaussian with variance \u03c32 > 0 , and that \u2203\u03bb1, \u03bb2 \u2208 \u039b with\n\n4.2 Oracle inequality\n\ndf(\u03bb1) \u2265\n\nn\n2\n\n, df(\u03bb2) \u2264 \u221an, and \u2200i \u2208 { 1, 2} , n\u22121 k(A\u03bbi \u2212 In)Fk2\n\n2 \u2264 \u03c32r ln(n)\n\nn\n\n.\n\n(A1\u22122)\n\nFurthermore, if\n\nThen, a numerical constant Ca and an event of probability at least 1 \u2212 8Kn\u22122 exist on which, for\nevery n \u2265 Ca ,\n\nn ! \u03c32 \u2264 bC \u2264 1 +\n\n! \u03c32 .\n 1 \u2212 91(\u03b1 + 2)r ln(n)\n44(\u03b1 + 2)pln(n)\n\u2203\u03ba \u2265 1, \u2200\u03bb \u2208 \u039b, n\u22121 tr(A\u03bb)\u03c32 \u2264 \u03baEh n\u22121kbF\u03bb \u2212 Fk2\n2i ,\n2 \u2264(cid:18) 1 +\n\nthen, a constant Cb depending only on \u03ba exists such that for every n \u2265 Cb , on the same event,\nn\u22121kbFb\u03bb \u2212 Fk2\nTheorem 1 is proved in [20]. The proof mainly follows from the informal arguments developed in\nSection 3.2, completed with the following two concentration inequalities: If \u03be \u2208 Rn is a standard\nGaussian random vector, \u03b1 \u2208 Rn and M is a real-valued n \u00d7 n matrix, then for every x \u2265 0 ,\n\n2o +\n\u03bb\u2208\u039bn n\u22121kbF\u03bb \u2212 Fk2\n\nln(n)(cid:19) inf\n\n36(\u03ba + \u03b1 + 2) ln(n)\u03c32\n\n(10)\n\n(A3)\n\nn1/4\n\n.\n\n(11)\n\n40\u03ba\n\nn\n\n(12)\n\n(13)\n\nP(cid:16)|h\u03b1, \u03bei| \u2264\n\n\u221a2xk\u03b1k2(cid:17) \u2265 1 \u2212 2e\u2212x\n\nP(cid:16)\u2200\u03b8 > 0, (cid:12)(cid:12)(cid:12)kM \u03bek2\n\n2 \u2212 tr(M \u22a4M )(cid:12)(cid:12)(cid:12) \u2264 \u03b8 tr(M \u22a4M ) + 2(1 + \u03b8\u22121)kMk2 x(cid:17) \u2265 1 \u2212 2e\u2212x ,\n\nwhere kMk is the operator norm of M . A proof of Eq. (12) and (13) can be found in [20].\n4.3 Discussion of the assumptions of Theorem 1\nGaussian noise. When \u03b5 is sub-Gaussian, Eq. (12) and Eq. (13) can be proved for \u03be = \u03c3\u22121\u03b5 at the\nprice of additional technicalities, which implies that Theorem 1 is still valid.\nSymmetry. The assumption that matrices A\u03bb must be symmetric can certainly be relaxed, since it\nis only used for deriving from Eq. (13) a concentration inequality for hA\u03bb\u03be, \u03bei . Note that Sp(A\u03bb) \u2282\n[0, 1] barely is an assumption since it means that A\u03bb actually shrinks Y .\nAssumptions (A1\u22122).\n(A1\u22122) holds if max\u03bb\u2208\u039b { df(\u03bb)} \u2265 n/2 and the bias is smaller than\nc df(\u03bb)\u2212d for some c, d > 0 , a quite classical assumption in the context of model selection. Besides,\n(A1\u22122) is much less restrictive and can even be relaxed, see [20].\nAssumption (A3).\nThe upper bound (A3) on tr(A\u03bb) is certainly the strongest assumption of\nTheorem 1, but it is only needed for Eq. (11). According to Eq. (6), (A3) holds with \u03ba = 1 when\n\u03bb A\u03bb) = tr(A\u03bb) . In the kernel ridge regression framework,\nA\u03bb is a projection matrix since tr(A\u22a4\n(A3) holds as soon as the eigenvalues of the kernel matrix K decrease like j\u2212\u03b1\u2014see [20].\nIn\n\ngeneral, (A3) means that bF\u03bb should not have a risk smaller than the parametric convergence rate\n\nassociated with a model of dimension df(\u03bb) = tr(A\u03bb) .\nWhen (A3) does not hold, selecting among estimators whose risks are below the parametric rate\nis a rather dif\ufb01cult problem and it may not be possible to attain the risk of the oracle in general.\n\n6\n\n\f400\n\n300\n\n200\n\n100\n\nm\no\nd\ne\ne\nr\nf\n \nf\n\no\n \ns\ne\ne\nr\ng\ne\nd\nd\ne\nt\nc\ne\ne\ns\n\n \n\nl\n\n0\n\n \n\u22122\n\n0\n\nlog(C/\u03c32)\n\n2\n\n \n\nminimal penalty\noptimal penalty / 2\n\nm\no\nd\ne\ne\nr\nf\n \nf\n\n \n\noptimal/2\nminimal (discrete)\nminimal (continuous)\n\n200\n\n150\n\n100\n\n50\n\n0\n\n \n\u22122\n\n0\n\nlog(C/\u03c32)\n\n2\n\no\n \ns\ne\ne\nr\ng\ne\nd\nd\ne\nt\nc\ne\ne\ns\n\n \n\nl\n\nFigure 2: Selected degrees of freedom vs. penalty strength log(C/\u03c32) : note that when penalizing\nby the minimal penalty, there is a strong jump at C = \u03c32 , while when using half the optimal penalty,\nthis is not the case. Left: single kernel case, Right: multiple kernel case.\n\nNevertheless, an oracle inequality can still be proved without (A3), at the price of enlarging bC\nslightly and adding a small fraction of \u03c32n\u22121 tr(A\u03bb) in the right-hand side of Eq. (11), see [20].\nEnlarging bC is necessary in general: If tr(A\u22a4\n\u03bb A\u03bb) \u226a tr(A\u03bb) for most \u03bb \u2208 \u039b , the minimal penalty\nis very close to 2\u03c32n\u22121 tr(A\u03bb) , so that according to Eq. (10), over\ufb01tting is likely as soon as bC\n\nunderestimates \u03c32 , even by a very small amount.\n\n4.4 Main consequences of Theorem 1 and comparison with previous results\n\nof \u03c32 in a general framework and under mild assumptions. Compared to classical estimators of \u03c32 ,\n\nassumed to have almost no bias, which can lead to overestimating \u03c32 by an unknown amount [18].\nOracle inequality. Our algorithm satis\ufb01es an oracle inequality with high probability, as shown by\n\nConsistent estimation of \u03c32 . The \ufb01rst part of Theorem 1 shows that bC is a consistent estimator\nsuch as the one usually used with Mallows\u2019 CL, bC does not depend on the choice of some model\nEq. (11): The risk of the selected estimator bFb\u03bb is close to the risk of the oracle, up to a remainder\n\nterm which is negligible when the dimensionality df(\u03bb\u22c6) grows with n faster than ln(n) , a typical\nsituation when the bias is never equal to zero, for instance in kernel ridge regression.\nSeveral oracle inequalities have been proved in the statistical literature for Mallows\u2019 CL with a con-\nsistent estimator of \u03c32 , for instance in [23]. Nevertheless, except for the model selection problem\n(see [6] and references therein), all previous results were asymptotic, meaning that n is implicitly\nassumed to be larged compared to each parameter of the problem. This assumption can be prob-\nlematic for several learning problems, for instance in multiple kernel learning when the number p\nof kernels may grow with n . On the contrary, Eq. (11) is non-asymptotic, meaning that it holds for\nevery \ufb01xed n as soon as the assumptions explicitly made in Theorem 1 are satis\ufb01ed.\nComparison with other procedures. According to Theorem 1 and previous theoretical results\n[23, 19], CL, GCV, cross-validation and our algorithm satisfy similar oracle inequalities in various\nframeworks. This should not lead to the conclusion that these procedures are completely equivalent.\nIndeed, second-order terms can be large for a given n , while they are hidden in asymptotic results\nand not tightly estimated by non-asymptotic results. As showed by the simulations in Section 5, our\nalgorithm yields statistical performances as good as existing methods, and often quite better.\n\nFurthermore, our algorithm never over\ufb01ts too much because df(b\u03bb) is by construction smaller than\nthe effective dimensionality ofb\u03bb0(bC) at which the jump occurs. This is a quite interesting property\n\ncompared for instance to GCV, which is likely to over\ufb01t if it is not corrected because GCV minimizes\na criterion proportional to the empirical risk.\n\n5 Simulations\n\ni=1 e\u2212|xi\u2212yi| , with the\nx\u2019s sampled i.i.d. from a standard multivariate Gaussian. The functions f are then selected randomly\ni=1 \u03b1ik(\u00b7, zi) , where both \u03b1 and z are i.i.d. standard Gaussian (i.e., f belongs to the RKHS).\n\nThroughout this section, we consider exponential kernels on Rd , k(x, y) =Qd\nasPm\n\n7\n\n\f)\n \n\nl\n\ne\nc\na\nr\no\n\nr\no\nr\nr\ne\n\n \n/\n \nr\no\nr\nr\ne\n\n \n(\nn\na\ne\nm\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n \n\n \n\n10\u2212fold CV\nGCV\nmin. penalty\n\n4\n\n5\n6\nlog(n)\n\n7\n\n \n\nMKL+CV\nGCV\nkernel sum\nmin. penalty\n\n)\n \n\ns\nw\no\n\nl\nl\n\na\nM\n\nr\no\nr\nr\ne\n\n \n/\n \nr\no\nr\nr\ne\n\n \n(\nn\na\ne\nm\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n \n\n3.5\n\n4\n\n5\n\n4.5\nlog(n)\n\n5.5\n\nFigure 3: Comparison of various smoothing parameter selection (minikernel, GCV, 10-fold cross\nvalidation) for various values of numbers of observations, averaged over 20 replications. Left: single\nkernel, right: multiple kernels.\n\nJump.\nIn Figure 2 (left), we consider data xi \u2208 R6 , n = 1000, and study the size of the jump\nin Figure 2 for kernel ridge regression. With half the optimal penalty (which is used in traditional\nvariable selection for linear regression), we do not get any jump, while with the minimal penalty we\nalways do. In Figure 2 (right), we plot the same curves for the multiple kernel learning problem with\ntwo kernels on two different 4-dimensional variables, with similar results. In addition, we show two\n+ , by discrete optimization with n different kernel matrices\u2014a\nways of optimizing over \u03bb \u2208 \u039b = R2\nsituation covered by Theorem 1\u2014or with continuous optimization with respect to \u03b7 in Eq. (1), by\ngradient descent\u2014a situation not covered by Theorem 1.\nComparison of estimator selection methods.\nIn Figure 3, we plot model selection results for 20\nreplications of data (d = 4, n = 500), comparing GCV [8], our minimal penalty algorithm, and\ncross-validation methods. In the left part (single kernel), we compare to the oracle (which can be\ncomputed because we can enumerate \u039b), and use for cross-validation all possible values of \u03bb . In the\nright part (multiple kernel), we compare to the performance of Mallows\u2019 CL when \u03c32 is known (i.e.,\npenalty in Eq. (5)), and since we cannot enumerate all \u03bb\u2019s, we use the solution obtained by MKL\nwith CV [5]. We also compare to using our minimal penalty algorithm with the sum of kernels.\n\n6 Conclusion\n\nA new light on the slope heuristics. Theorem 1 generalizes some results \ufb01rst proved in [6] where\nall A\u03bb are assumed to be projection matrices, a framework where assumption (A3) is automatically\nsatis\ufb01ed. To this extent, Birg\u00b4e and Massart\u2019s slope heuristics has been modi\ufb01ed in a way that sheds\na new light on the \u201cmagical\u201d factor 2 between the minimal and the optimal penalty, as proved in\n[6, 7]. Indeed, Theorem 1 shows that for general linear estimators,\n\npenid(\u03bb)\npenmin(\u03bb)\n\n=\n\n,\n\n(14)\n\n2 tr(A\u03bb)\n\n2 tr(A\u03bb) \u2212 tr(A\u22a4\n\n\u03bb A\u03bb)\n\n\u03bb A\u03bb) ,\n\nwhich can take any value in (1, 2] in general; this ratio is only equal to 2 when tr(A\u03bb) \u2248 tr(A\u22a4\nhence mostly when A\u03bb is a projection matrix.\nFuture directions.\nIn the case of projection estimators, the slope heuristics still holds when the de-\nsign is random and data are heteroscedastic [7]; we would like to know whether Eq. (14) is still valid\nfor heteroscedastic data with general linear estimators. In addition, the good empirical performances\nof elbow heuristics based algorithms (i.e., based on the sharp variation of a certain quantity around\ngood hyperparameter values) suggest that Theorem 1 can be generalized to many learning frame-\nworks (and potentially to non-linear estimators), probably with small modi\ufb01cations in the algorithm,\nbut always relying on the concept of minimal penalty.\nAnother interesting open problem would be to extend the results of Section 4, where Card(\u039b) \u2264\nKn\u03b1 is assumed, to continuous sets \u039b such as the ones appearing naturally in kernel ridge regression\nand multiple kernel learning. We conjecture that Theorem 1 is valid without modi\ufb01cation for a\n\u201csmall\u201d continuous \u039b , such as in kernel ridge regression where taking a grid of size n in log-scale is\nalmost equivalent to taking \u039b = R+ . On the contrary, in applications such as the Lasso with p \u226b n\nvariables, the natural set \u039b cannot be well covered by a grid of cardinality n\u03b1 with \u03b1 small, and our\nminimal penalty algorithm and Theorem 1 certainly have to be modi\ufb01ed.\n\n8\n\n\fReferences\n\n[1] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univer-\n\nsity Press, 2004.\n\n[2] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. MIT Press, 2001.\n[3] O. Chapelle and V. Vapnik. Model selection for support vector machines.\n\nNeural Information Processing Systems (NIPS), 1999.\n\nIn Advances in\n\n[4] C. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press,\n\n2006.\n\n[5] F. Bach. Consistency of the group Lasso and multiple kernel learning. Journal of Machine\n\nLearning Research, 9:1179\u20131225, 2008.\n\n[6] L. Birg\u00b4e and P. Massart. Minimal penalties for Gaussian model selection. Probab. Theory\n\nRelated Fields, 138(1-2):33\u201373, 2007.\n\n[7] S. Arlot and P. Massart. Data-driven calibration of penalties for least-squares regression. J.\n\nMach. Learn. Res., 10:245\u2013279, 2009.\n\n[8] P. Craven and G. Wahba. Smoothing noisy data with spline functions. Estimating the correct\ndegree of smoothing by the method of generalized cross-validation. Numer. Math., 31(4):377\u2013\n403, 1978/79.\n\n[9] G. Wahba. Spline Models for Observational Data. SIAM, 1990.\n[10] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.\n\nJournal of The Royal Statistical Society Series B, 68(1):49\u201367, 2006.\n\n[11] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the\n\nkernel matrix with semide\ufb01nite programming. J. Mach. Learn. Res., 5:27\u201372, 2003/04.\n\n[12] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of The Royal Statis-\n\ntical Society Series B, 58(1):267\u2013288, 1996.\n\n[13] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-\n\nVerlag, 2001.\n\n[14] D. M. Allen. The relationship between variable selection and data augmentation and a method\n\nfor prediction. Technometrics, 16:125\u2013127, 1974.\n\n[15] M. Stone. Cross-validatory choice and assessment of statistical predictions. J. Roy. Statist.\n\nSoc. Ser. B, 36:111\u2013147, 1974.\n\n[16] T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural\n\nComput., 17(9):2077\u20132098, 2005.\n\n[17] C. L. Mallows. Some comments on Cp. Technometrics, 15:661\u2013675, 1973.\n[18] B. Efron. How biased is the apparent error rate of a prediction rule? J. Amer. Statist. Assoc.,\n\n81(394):461\u2013470, 1986.\n\n[19] Y. Cao and Y. Golubev. On oracle inequalities related to smoothing splines. Math. Methods\n\nStatist., 15(4):398\u2013414 (2007), 2006.\n\n[20] S. Arlot and F. Bach. Data-driven calibration of linear estimators with minimal penalties,\n\nSeptember 2009. Long version. arXiv:0909.1884v1.\n\n[21] \u00b4E. Lebarbier. Detecting multiple change-points in the mean of a gaussian process by model\n\nselection. Signal Proces., 85:717\u2013736, 2005.\n\n[22] C. Maugis and B. Michel. Slope heuristics for variable selection and clustering via gaussian\n\nmixtures. Technical Report 6550, INRIA, 2008.\n\n[23] K.-C. Li. Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation:\n\ndiscrete index set. Ann. Statist., 15(3):958\u2013975, 1987.\n\n9\n\n\f", "award": [], "sourceid": 291, "authors": [{"given_name": "Sylvain", "family_name": "Arlot", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}]}