{"title": "Early stopping for kernel boosting algorithms: A general analysis with localized complexities", "book": "Advances in Neural Information Processing Systems", "page_first": 6065, "page_last": 6075, "abstract": "Early stopping of iterative algorithms is a widely-used form of regularization in statistical learning, commonly used in conjunction with boosting and related gradient-type algorithms. Although consistency results have been established in some settings, such estimators are less well-understood than their analogues based on penalized regularization. In this paper, for a relatively broad class of loss functions and boosting algorithms (including $L^2$-boost, LogitBoost and AdaBoost, among others), we connect the performance of a stopped iterate to the localized Rademacher/Gaussian complexity of the associated function class. This connection allows us to show that local fixed point analysis, now standard in the analysis of penalized estimators, can be used to derive optimal stopping rules. We derive such stopping rules in detail for various kernel classes, and illustrate the correspondence of our theory with practice for Sobolev kernel classes.", "full_text": "Early stopping for kernel boosting algorithms: A\n\ngeneral analysis with localized complexities\n\nYuting Wei1\n\nFanny Yang2\u2217 Martin J. Wainwright1,2\nDepartment of Statistics1\n\nDepartment of Electrical Engineering and Computer Sciences2\n\nUC Berkeley\n\nBerkeley, CA 94720\n\n{ytwei, fanny-yang, wainwrig}@berkeley.edu\n\nAbstract\n\nEarly stopping of iterative algorithms is a widely-used form of regularization\nin statistics, commonly used in conjunction with boosting and related gradient-\ntype algorithms. Although consistency results have been established in some\nsettings, such estimators are less well-understood than their analogues based on\npenalized regularization. In this paper, for a relatively broad class of loss functions\nand boosting algorithms (including L2-boost, LogitBoost and AdaBoost, among\nothers), we exhibit a direct connection between the performance of a stopped\niterate and the localized Gaussian complexity of the associated function class.\nThis connection allows us to show that local \ufb01xed point analysis of Gaussian or\nRademacher complexities, now standard in the analysis of penalized estimators,\ncan be used to derive optimal stopping rules. We derive such stopping rules in\ndetail for various kernel classes, and illustrate the correspondence of our theory\nwith practice for Sobolev kernel classes.\n\n1\n\nIntroduction\n\nWhile non-parametric models offer great \ufb02exibility, they can also lead to over\ufb01tting, and thus poor\ngeneralization performance. For this reason, procedures for \ufb01tting non-parametric models must\ninvolve some form of regularization, most commonly done by adding some type of penalty to the\nobjective function. An alternative form of regularization is based on the principle of early stopping, in\nwhich an iterative algorithm is terminated after a pre-speci\ufb01ed number of steps prior to convergence.\n\nWhile the idea of early stopping is fairly old (e.g., [31, 1, 35]), recent years have witnessed renewed\ninterests in its properties, especially in the context of boosting algorithms and neural network training\n(e.g., [25, 12]). Over the past decade, a line of work has yielded some theoretical insight into early\nstopping, including works on classi\ufb01cation error for boosting algorithms [3, 13, 18, 23, 39, 40],\nL2-boosting algorithms for regression [8, 7], and similar gradient algorithms in reproducing kernel\nHilbert spaces (e.g. [11, 10, 34, 39, 26]). A number of these papers establish consistency results for\nparticular forms of early stopping, guaranteeing that the procedure outputs a function with statistical\nerror that converges to zero as the sample size increases. On the other hand, there are relatively\nfew results that actually establish rate optimality of an early stopping procedure, meaning that the\nachieved error matches known statistical minimax lower bounds. To the best of our knowledge,\nB\u00fchlmann and Yu [8] were the \ufb01rst to prove optimality for early stopping of L2-boosting as applied\n\n\u2217Yuting Wei and Fanny Yang contributed equally to this work.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fto spline classes, albeit with a rule that was not computable from the data. Subsequent work by\nRaskutti et al. [26] re\ufb01ned this analysis of L2-boosting for kernel classes and \ufb01rst established an\nimportant connection to the localized Rademacher complexity; see also the related work [39, 27, 9]\nwith rates for particular kernel classes.\n\nMore broadly, relative to our rich and detailed understanding of regularization via penalization\n(e.g., see the books [17, 33, 32, 37] and papers [2, 20] for details), the theory for early stopping\nregularization is still not as well developed. In particular, for penalized estimators, it is now well-\nunderstood that complexity measures such as the localized Gaussian width, or its Rademacher\nanalogue, can be used to characterize their achievable rates [2, 20, 32, 37]. Is such a general and sharp\ncharacterization also possible in the context of early stopping? The main contribution of this paper\nis to answer this question in the af\ufb01rmative for boosting algorithms in regression and classi\ufb01cation\nproblems involving functions in reproducing kernel Hilbert spaces (RKHS).\n\nThe remainder of this paper is organized as follows. In Section 2, we provide background on\nboosting methods and reproducing kernel Hilbert spaces, and then introduce the updates studied in\nthis paper. Section 3 is devoted to statements of our main results, followed by a discussion of their\nconsequences for particular function classes in Section 4. We provide simulations that con\ufb01rm the\npractical effectiveness of our stopping rules and show close agreement with our theoretical predictions.\nThe proofs for all of our results can be found in the supplemental material.\n2 Background and problem formulation\nThe goal of prediction is to learn a function that maps covariates x \u2208 X to responses y \u2208 Y. In a\nregression problem, the responses are typically real-valued, whereas in a classi\ufb01cation problem, the\nresponses take values in a \ufb01nite set. In this paper, we study both regression (Y = R) and classi\ufb01cation\nproblems (e.g., Y = {\u22121, +1} in the binary case) where we observe a collection of n pairs of the\nform {(xi, Yi)}n\ni=1, with \ufb01xed covariates xi \u2208 X and corresponding random responses Yi \u2208 Y drawn\nindependently from a distribution PY |xi. In this section, we provide some necessary background on a\ngradient-type algorithm which is often referred to as boosting algorithm.\n2.1 Boosting and early stopping\nConsider a cost function \u03c6 : R \u00d7 R \u2192 [0,\u221e), where the non-negative scalar \u03c6(y, \u03b8) denotes the cost\nassociated with predicting \u03b8 when the true response is y. Some common examples of loss functions\n\u03c6 that we consider in later sections include:\n\u2022 the least-squares loss \u03c6(y, \u03b8) : = 1\n\u2022 the logistic regression loss \u03c6(y, \u03b8) = ln(1 + e\u2212y\u03b8) that underlies the LogitBoost algo-\n\u2022 the exponential loss \u03c6(y, \u03b8) = exp(\u2212y\u03b8) that underlies the AdaBoost algorithm [13].\n\n2 (y \u2212 \u03b8)2 that underlies L2-boosting [8],\n\nrithm [14, 15], and\n\nThe least-squares loss is typically used for regression problems (e.g., [8, 11, 10, 34, 39, 26]), whereas\nthe latter two losses are frequently used in the setting of binary classi\ufb01cation (e.g., [13, 23, 15]).\nGiven some loss function \u03c6 and function space F , we de\ufb01ne the population cost functional f (cid:55)\u2192 L(f )\nand the corresponding optimal (minimizing) function\u2020 via\n\n(cid:104) 1\nn(cid:88)\n(1)\nn\nNote that with the covariates {xi}n\ni=1 \ufb01xed, the functional L is a non-random object. As a standard\n2 (y \u2212 \u03b8)2, the population minimizer\nexample, when we adopt the least-squares loss \u03c6(y, \u03b8) = 1\nf\u2217 corresponds to the conditional expectation x (cid:55)\u2192 E[Y |x]. Since we do not have access to the\npopulation distribution of the responses however, the computation of f\u2217 is impossible. Given our\nsamples {Yi}n\n\ni=1, we consider instead some procedure applied to the empirical loss\n\n\u03c6(cid:0)Yi, f (xi)(cid:1)(cid:105)\n\nf\u2217 : = arg min\nf\u2208F\n\nL(f ).\n\nL(f ) : = EY n\n\n1\n\ni=1\n\n,\n\nn(cid:88)\n\ni=1\n\nLn(f ) : =\n\n1\nn\n\n\u2020As clari\ufb01ed in the sequel, our assumptions guarantee uniqueness of f\u2217.\n\n\u03c6(Yi, f (xi)),\n\n(2)\n\n\fwhere the population expectation has been replaced by an empirical expectation. For example, when\nLn corresponds to the log likelihood of the samples with \u03c6(Yi, f (xi)) = log[P(Yi; f (xi))], direct\nunconstrained minimization of Ln would yield the maximum likelihood estimator.\nIt is well-known that direct minimization of Ln over a rich function class F may lead to over\ufb01tting.\nA classical method to mitigate this phenomenon is to minimize the sum of the empirical loss with a\npenalty term. Adjusting the weight on the regularization term allows for trade-off between \ufb01t to the\ndata, and some form of regularity or smoothness of the \ufb01t. The behavior of such penalized estimation\nmethods is quite well understood (see e.g. the books [17, 33, 32, 37] and papers [2, 20] for details).\nIn this paper, we study a form of algorithmic regularization, based on applying a gradient-type\nalgorithm to Ln. In particular, we consider boosting algorithms (see survey paper [7]) which involve\n\u201cboosting\u201d or improve the \ufb01t of a function via a sequence of additive updates (see e.g. [28, 13, 6, 5, 29])\nand can be understood as forms of functional gradient methods [23, 15]. Instead of running until\nconvergence, we then stop it \u201cearly\u201d\u2014that is, after some \ufb01xed number of steps. The way in which\nthe number of steps is chosen is referred to as a stopping rule, and the overall procedure is referred to\nas early stopping of a boosting algorithm.\n\n(a)\n\n(cid:80)n\nFigure 1: Plots of the squared error (cid:107)f t \u2212 f\u2217(cid:107)2\ni=1(f t(xi) \u2212 f\u2217(xi))2 versus the iteration\nnumber t for (a) LogitBoost using a \ufb01rst-order Sobolev kernel (b) AdaBoost using the same \ufb01rst-order\nSobolev kernel K(x, x(cid:48)) = 1 + min(x, x(cid:48)) which generates a class of Lipschitz functions (splines of\norder one). Both plots correspond to a sample size n = 100.\n\nn = 1\nn\n\n(b)\n\nIn more detail, a broad class of boosting algorithms [23] generate a sequence {f t}\u221e\nthe form\n\nt=0 via updates of\n\ngt \u221d arg max\n(cid:107)d(cid:107)F\u22641\n\n(cid:104)\u2207Ln(f t), d(xn\n\n1 )(cid:105),\n\nf t+1 = f t \u2212 \u03b1tgt with\n\n(3)\nwhere the scalar {\u03b1t}\u221e\nt=0 is a sequence of step sizes chosen by the user, the constraint (cid:107)d(cid:107)F \u2264 1\nde\ufb01nes the unit ball in a given function class F , \u2207Ln(f ) \u2208 Rn denotes the gradient taken at\nFor non-decaying step sizes and a convex objective Ln, running this procedure for an in\ufb01nite\nnumber of iterations will lead to a minimizer of the empirical loss, thus causing over\ufb01tting. In\norder to illustrate this phenomenon, Figure 1 provides plots of the squared error (cid:107)f t \u2212 f\u2217(cid:107)2\nn : =\n\nthe vector(cid:0)f (x1), . . . , f (xn)), and (cid:104)h, g(cid:105) is the usual inner product between vectors h, g \u2208 Rn.\n(cid:0)f t(xi) \u2212 f\u2217(xi)(cid:1)2 versus the iteration number, for LogitBoost in panel (a) and AdaBoost\n(cid:80)n\n\n1\nn\nin panel (b). (See Section 4.2 for more details on how these experiments were set up.)\n\ni=1\n\nIn these plots, the dotted line indicates the minimum mean-squared error \u03c12\nn over all iterates of that\nparticular run of the algorithm. Both plots are qualitatively similar, illustrating the existence of a\n\u201cgood\u201d number of iterations to take, after which the MSE greatly increases. Hence a natural problem\nis to decide at what iteration T to stop such that the iterate f T satis\ufb01es bounds of the form\n\nL(f T ) \u2212 L(f\u2217) (cid:45) \u03c12\n\n(4)\nwith high probability. The main results of this paper provide a stopping rule T for which bounds of\nthe form (4) do in fact hold with high probability over the randomness in the observed responses.\n\nand\n\nn\n\nn\n\n(cid:107)f T \u2212 f\u2217(cid:107)2\n\nn\n\n(cid:45) \u03c12\n\n050100150200250Iteration0.040.060.080.100.12Squared error |ftf*|2nEarly stopping for LogitBoost: MSE vs iteration050100150200250Iteration0.10.20.30.40.5Squared error |ftf*|2nMinimum errorEarly stopping for AdaBoost: MSE vs iteration\fMoreover, as shown by our later results, under suitable regularity conditions,\ntation of the minimum squared error \u03c12\n\ninf(cid:98)f supf\u2208F E[L((cid:98)f ) \u2212 L(f )], where the in\ufb01mum is taken over all possible estimators (cid:98)f. Cou-\n\nthe expec-\nto the statistical minimax risk\n\npled with our stopping time guarantee (4) this implies that our estimate achieves the minimax risk up\nto constant factors. As a result, our bounds are unimprovable in general (see Corollary 1).\n\nn is proportional\n\n2.2 Reproducing Kernel Hilbert Spaces\n\nj=1 is some collection of points in X , and {\u03c9j}\u221e\n\nThe analysis of this paper focuses on algorithms with the update (3) when the function class F is\na reproducing kernel Hilbert space H (RKHS, see standard sources [36, 16, 30, 4]), consisting of\nfunctions mapping a domain X to the real line R. Any RKHS is de\ufb01ned by a bivariate symmetric\nkernel function K : X \u00d7 X \u2192 R which is required to be positive semide\ufb01nite, i.e. for any integer\nsemide\ufb01nite. The associated RKHS is the closure of linear span of the form f (\u00b7) =(cid:80)\nN \u2265 1 and a collection of points {xj}N\nj=1 in X , the matrix [K(xi, xj)]ij \u2208 RN\u00d7N is positive\nj\u22651 \u03c9jK(\u00b7, xj),\ntwo functions f1, f2 \u2208 H which can be expressed as a \ufb01nite sum f1(\u00b7) = (cid:80)(cid:96)1\nwhere {xj}\u221e\n(cid:80)(cid:96)2\nf2(\u00b7) =(cid:80)(cid:96)2\nj=1 is a real-valued sequence. For\ni=1 \u03b1iK(\u00b7, xi) and\nj=1 \u03b1i\u03b2jK(xi, xj)\nK(xi, xi). For each x \u2208 X , the function K(\u00b7, x) belongs to\n\nj=1 \u03b2jK(\u00b7, xj), the inner product is de\ufb01ned as (cid:104)f1, f2(cid:105)H =(cid:80)(cid:96)1\n\nwith induced norm (cid:107)f1(cid:107)2\nH , and satis\ufb01es the reproducing relation (cid:104)f, K(\u00b7, x)(cid:105)H = f (x) for all f \u2208 H .\nThroughout this paper, we assume that the kernel function is uniformly bounded, meaning that there\nis a constant L such that supx\u2208X K(x, x) \u2264 L. Such a boundedness condition holds for many kernels\nused in practice, including the Gaussian, Laplacian, Sobolev, other types of spline kernels, as well\nas any trace class kernel with trignometric eigenfunctions. By rescaling the kernel as necessary, we\nmay assume without loss of generality that L = 1. As a consequence, for any function f such that\n(cid:107)f(cid:107)H \u2264 r, we have by the reproducing relation that\n\nH =(cid:80)(cid:96)1\n\ni=1 \u03b12\ni\n\ni=1\n\n(cid:107)f(cid:107)\u221e = sup\n\n(cid:104)f, K(\u00b7, x)(cid:105)H \u2264 (cid:107)f(cid:107)H sup\n\n(cid:107)K(\u00b7, x)(cid:107)H \u2264 r.\n\nx\n\nx\n\nGiven samples {(xi, yi)}n\nthe linear subspace Hn = span{K(\u00b7, xi)}n\n\ni=1, by the representer theorem [19], it is suf\ufb01cient to restrict ourselves to\n\nn(cid:88)\n\ni=1, for which all f \u2208 Hn can be expressed as\n1\u221a\nn\n\n\u03c9iK(\u00b7, xi)\n\nf =\n\ni=1\n\n(5)\n\nfor some coef\ufb01cient vector \u03c9 \u2208 Rn. Among those functions which achieve the in\ufb01mum in expression\n(1), let us de\ufb01ne f\u2217 as the one with the minimum Hilbert norm. This de\ufb01nition is equivalent to\nrestricting f\u2217 to be in the linear subspace Hn.\n\n2.3 Boosting in kernel spaces\n\nFor a \ufb01nite number of covariates xi from i = 1 . . . n, let us de\ufb01ne the normalized kernel matrix\nK \u2208 Rn\u00d7n with entries Kij = K(xi, xj)/n. Since we can restrict the minimization of Ln and L\nfrom H to the subspace Hn w.l.o.g., using expression (5) we can then write the function value\nvectors f (xn\nnK\u03c9. As there is a one-to-one correspondence\n1 ) \u2208 Rn and the corresponding function f \u2208 Hn in H by\nbetween the n-dimensional vectors f (xn\nthe representer theorem, minimization of an empirical loss in the subspace Hn essentially becomes\nthe n-dimensional problem of \ufb01tting a response vector y over the set range(K). In the sequel, all\nupdates will thus be performed on the function value vectors f (xn\n\n1 ) : = (f (x1), . . . , f (xn)) as f (xn\n\n1 ) =\n\n\u221a\n\n1 ).\n\nnK\u2207Ln(f t)\n\n\u221a\n\u2207Ln(f t)K\u2207Ln(f t)\n\nWith a change of variable d(xn\n\u221a\nstudy gt = (cid:104)\u2207Ln(f t), dt(xn\nthe form\n\nn\n\n1 ) =\n\nKz we then have dt(xn\n\n1 )(cid:105) =\n, where the maximum is taken over vectors d \u2208 range(K). In this paper we\n1 )(cid:105)dt in the boosting update (3), so that the function value iterates take\n\n(cid:104)\u2207Ln(f t), d(xn\n\n1 ) : = arg max\n(cid:107)d(cid:107)H \u22641\n\n\u221a\n\n\u221a\n\nf t+1(xn\n\n1 ) = f t(xn\n\n1 ) \u2212 \u03b1nK\u2207Ln(f t),\n\n(6)\n\n\ffunction (cid:98)f = 1\n\n1 ) = 0 ensures that all iterates f t(xn\nwhere \u03b1 > 0 is a constant stepsize choice. Choosing f 0(xn\n1 )\nremain in the range space of K. Our goal is to propose a stopping time T such that the averaged\nt=1 f t satis\ufb01es bounds of the type (4). Importantly, we exhibit such bounds with a\n\nstatistical error term \u03b4n that is speci\ufb01ed by the localized Gaussian complexity of the kernel class.\n\nT\n\n(cid:80)T\n\n3 Main results\n\nWe now turn to the statement of our main results, beginning with the introduction of some regularity\nassumptions.\n\n3.1 Assumptions\nRecall from our earlier set-up that we differentiate between the empirical loss function Ln in\nexpression (2), and the population loss L in expression (1). Apart from assuming differentiability of\nboth functions, all of our remaining conditions are imposed on the population loss. Such conditions\nat the population level are weaker than their analogues at the empirical level.\nFor a given radius r > 0, let us de\ufb01ne the Hilbert ball around the optimal function f\u2217 as\n\nBH (f\u2217, r) : = {f \u2208 H | (cid:107)f \u2212 f\u2217(cid:107)H \u2264 r}.\n\nOur analysis makes particular use of this ball de\ufb01ned for the radius C 2\nwhere \u03c3 is the effective noise level de\ufb01ned as\n\nE[e((Yi\u2212f\u2217(xi))2/t2)] < \u221e(cid:111)\n\n(cid:40)\n\n(cid:110)\n\nmin\n\nt | max\n\ni=1,...,n\n\n\u03c3 : =\n\n4 (2M + 1)(1 + 2CH )\n\nH : = 2 max{(cid:107)f\u2217(cid:107)2\n\n(7)\nH , 32, \u03c32},\n\nfor least squares\nfor \u03c6(cid:48)-bounded losses.\n\n(8)\n\n(cid:107)f \u2212 g(cid:107)2\n\n(cid:107)f \u2212 g(cid:107)2\n\nn\n\n1 ) \u2212 g(xn\n\n1 )(cid:105) \u2264 M\n2\n\nn \u2264 L(f ) \u2212 L(g) \u2212 (cid:104)\u2207L(g), f (xn\n\nWe assume that the population loss is m-strongly convex and M-smooth over BH (f\u2217, 2CH ),\nmeaning that the sandwich inequality\nm-M-condition m\n2\nholds for all f, g \u2208 BH (f\u2217, 2CH ). On top of that we assume \u03c6 to be M-Lipschitz in the second\nargument. To be clear, here \u2207L(g) denotes the vector in Rn obtained by taking the gradient of L\n1 ). It can be veri\ufb01ed by a straightforward computation that when L is\nwith respect to the vector g(xn\n2 (y \u2212 \u03b8)2, the m-M-condition holds for m = M = 1.\ninduced by the least-squares cost \u03c6(y, \u03b8) = 1\nThe logistic and exponential loss satisfy this condition (see supp. material), where it is key that we\nhave imposed the condition only locally on the ball BH (f\u2217, 2CH ).\nIn addition to the least-squares cost, our theory also applies to losses L induced by scalar functions \u03c6\nthat satisfy the following condition:\n\u03c6(cid:48)-boundedness max\nThis condition holds with B = 1 for the logistic loss for all Y, and B = exp(2.5CH ) for the\nexponential loss for binary classi\ufb01cation with Y = {\u22121, 1}, using our kernel boundedness condition.\nNote that whenever this condition holds with some \ufb01nite B, we can always rescale the scalar loss \u03c6\nby 1/B so that it holds with B = 1, and we do so in order to simplify the statement of our results.\n\nfor all f \u2208 BH (f\u2217, 2CH ) and y \u2208 Y.\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202\u03c6(y, \u03b8)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b8=f (xi)\n\n\u2264 B\n\ni=1,...,n\n\n\u2202\u03b8\n\n3.2 Upper bound in terms of localized Gaussian width\n\nOur upper bounds involve a complexity measure known as the localized Gaussian width. In general,\nGaussian widths are widely used to obtain risk bounds for least-squares and other types of M-\nestimators. In our case, we consider Gaussian complexities for \u201clocalized\u201d sets of the form\n\nEn(\u03b4, 1) : =\n\nf \u2212 g | f, g \u2208 H ,\n\n(cid:107)f \u2212 g(cid:107)H \u2264 1,\n\n(cid:107)f \u2212 g(cid:107)n \u2264 \u03b4\n\n(cid:111)\n\n.\n\n(cid:110)\n\nThe Gaussian complexity localized at scale \u03b4 is given by\n\n(cid:0)En(\u03b4, 1)(cid:1) : = E(cid:104)\n\nGn\n\n(cid:105)\n\nn(cid:88)\n\ni=1\n\nsup\n\ng\u2208En(\u03b4,1)\n\n1\nn\n\nwig(xi)\n\n,\n\n(9)\n\n(10)\n\n\fwhere (w1, . . . , wn) denotes an i.i.d. sequence of standard Gaussian variables.\nAn essential quantity in our theory is speci\ufb01ed by a certain \ufb01xed point equation that is now standard\nin empirical process theory [32, 2, 20, 26]. The critical radius \u03b4n is the smallest positive scalar such\nthat\n\nGn(En(\u03b4, 1))\n\n\u03b4\n\n\u2264 \u03b4\n\u03c3\n\n.\n\n(11)\n\nWe note that past work on localized Rademacher and Gaussian complexity [24, 2] guarantee that\nthere exists a unique \u03b4n > 0 that satis\ufb01es this condition, so that our de\ufb01nition is sensible.\n\n3.2.1 Upper bounds on excess risk and empirical L2(Pn)-error\n\nWith this set-up, we are now equipped to state our main theorem. It provides high-probability bounds\non the excess risk and L2(Pn)-error of the estimator \u00aff T : = 1\nt=1 f t de\ufb01ned by averaging the T\niterates of the algorithm.\nTheorem 1. Consider any loss function satisfying the m-M-condition and the \u03c6(cid:48)-boundedness\ncondition (if not least squares), for which we generate function iterates {f t}\u221e\nt=0 of the form (6) with\nM , M}], initialized at f 0 = 0. Then, if n is large enough such that \u03b4n \u2264 M\nstep size \u03b1 \u2208 (0, min{ 1\nm ,\nfor all iterations T = 0, 1, . . .(cid:98) m\n\n(cid:99), the averaged function estimate \u00aff T satis\ufb01es the bounds\n\nT\n\n(cid:80)T\n\n(cid:17)\n\n8M \u03b42\nn\n\nL( \u00aff T ) \u2212 L(f\u2217) \u2264 CM\n\n(cid:107) \u00aff T \u2212 f\u2217(cid:107)2\n\nn \u2264 C\n\n(cid:16) 1\n(cid:16) 1\n\n\u03b1mT\n\n+\n\n\u03b1mT\n\n+\n\n\u03b42\nn\nm2\n\n\u03b42\nn\nm2\n\n(cid:17)\n\n,\n\n,\n\nand\n\n(12a)\n\n(12b)\n\nwhere both inequalities hold with probability at least 1 \u2212 c1 exp(\u2212C2\n\nm2n\u03b42\nn\n\n\u03c32\n\n).\n\nH : = 2 max{(cid:107)f\u2217(cid:107)2\n\nIn our statements, constants of the form cj are universal, whereas capital Cj may depend on parameters\nof the joint distribution and population loss L. In the previous theorem, C2 = { m2\n\u03c32 , 1} and C depends\nH , 32, \u03c32}. In order to gain intuition for the claims in\non the squared radius C 2\nthe theorem, note that (disregarding factors depending on (m, M )), for all iterations T (cid:46) 1/\u03b42\nn, the\n\ufb01rst term 1\nm2 , so that taking further iterations reduces the upper\nbound on the error until T \u223c 1/\u03b42\nFurthermore, note that similar bounds as in Theorem 1 can be obtained for the expected loss (over the\nresponse yi, with the design \ufb01xed) by a simple integration argument. Hence if we perform updates\nwith step size \u03b1 = 1\n\nn, at which point the upper bound on the error is of the order \u03b42\nn.\n\n\u03b1mT dominates the second term \u03b42\n\nn\n\nM , after \u03c4 : =\n\n\u03b42\n\nm\n\nn max{8,M} iterations, the mean squared error is bounded as\nE(cid:107) \u00aff \u03c4 \u2212 f\u2217(cid:107)2\n\nn \u2264 C(cid:48) \u03b42\nm2 ,\n\nn\n\n(13)\nwhere we use M \u2265 m and where C(cid:48) is another constant depending on CH . It is worth noting that\nguarantee (13) matches the best known upper bounds for kernel ridge regression (KRR)\u2014indeed, this\nmust be the case, since a sharp analysis of KRR is based on the same notion of localized Gaussian\ncomplexity. Thus, our results establish a strong parallel between the algorithmic regularization of\nearly stopping, and the penalized regularization of kernel ridge regression. Moreover, as discussed in\nSection 3.3, under suitable regularity conditions on the RKHS, the critical squared radius \u03b42\nn also acts\nas a lower bound for the expected risk, i.e. our upper bounds are not improvable in general.\n\nCompared with the work of Raskutti et al. [26], which also analyzes the kernel boosting iterates of\nthe form (6), our theory more directly analyzes the effective function class that is explored in the\nboosting process by taking T steps, with the localized Gaussian width (10) appearing more naturally.\nIn addition, our analysis applies to a broader class of loss functions beyond least-squares.\n\nIn the case of reproducing kernel Hilbert spaces, it is possible to sandwich the localized Gaussian\ncomplexity by a function of the eigenvalues of the kernel matrix. Mendelson [24] provides this\nargument in the case of the localized Rademacher complexity, but similar arguments apply to the\n\n\flocalized Gaussian complexity. Letting \u00b51 \u2265 \u00b52 \u2265 \u00b7\u00b7\u00b7 \u2265 \u00b5n \u2265 0 denote the ordered eigenvalues of\nthe normalized kernel matrix K, de\ufb01ne the function\n\n(cid:118)(cid:117)(cid:117)(cid:116) n(cid:88)\n\nj=1\n\nR(\u03b4) =\n\n1\u221a\nn\n\nmin{\u03b42, \u00b5j}.\n\n(14)\n\n(cid:0)En(\u03b4, 1)(cid:1) for\n\n(15)\n\n(16)\n\nUp to a universal constant, this function is an upper bound on the Gaussian width Gn\nall \u03b4 \u2265 0, and up to another universal constant, it is also a lower bound for all \u03b4 \u2265 1\u221a\nn.\n\ni=1 through the solution\nNote that the critical radius \u03b42\nof inequality (11). In many cases, with examples given in Section 4, it is possible to compute or\nupper bound this critical radius, so that a concrete stopping rule can indeed by calculated in advance.\n\nn only depends on our observations {(xi, yi)}n\n\n3.3 Achieving minimax lower bounds\nWe claim that when the noise Y \u2212 f (x) is Gaussian, for a broad class of kernels, upper bound (13)\nmatches the known minimax lower bound, thus is unimprovable in general. In particular, Yang et\nal. [38] de\ufb01ne the class of regular kernels, which includes the Gaussian and Sobolev kernels as\nparticular cases. For such kernels, the authors provide a minimax lower bound over the unit ball of\n\nthe Hilbert space involving \u03b4n, which implies that any estimator (cid:98)f has prediction risk lower bounded\n\nas\n\nE(cid:107)(cid:98)f \u2212 f\u2217(cid:107)2\n\nn \u2265 c(cid:96)\u03b42\nn.\n\nsup\n\n(cid:107)f\u2217(cid:107)H \u22641\n\nComparing the lower bound (15) with upper bound (13) for our estimator \u00aff T stopped after O(1/\u03b42\nn)\nmany steps, it follows that the bounds proven in Theorem 1 are unimprovable apart from constant\nfactors. We summarize our \ufb01ndings in the following corollary:\nCorollary 1. For the class of regular kernels and any function f\u2217 with (cid:107)f\u2217(cid:107)H \u2264 1, running\nT : = (cid:98)\nM and f 0 = 0 yields an estimate \u00aff T such that\n\nn max{8,M}(cid:99) iterations with step size \u03b1 = m\n\n\u03b42\n\n1\n\nE(cid:107) \u00aff T \u2212 f\u2217(cid:107)2\n\nn (cid:16) inf(cid:98)f\n\nsup\n\n(cid:107)f\u2217(cid:107)H \u22641\n\nE(cid:107)(cid:98)f \u2212 f\u2217(cid:107)2\n\nn,\n\nwhere the in\ufb01mum is taken over all measurable functions of the input data and the expectation is\ntaken over the randomness of the response variables {Yi}n\n\ni=1.\n\nOn a high level, the statement in Corollary 1 implies that stopping early essentially prevents us from\nover\ufb01tting to the data and automatically \ufb01nds the optimal balance between low training error (i.e.\n\ufb01tting the data well) and low model complexity (i.e. generalizing well).\n4 Consequences for various kernel classes\nIn this section, we apply Theorem 1 to derive some concrete rates for different kernel spaces and\nthen illustrate them with some numerical experiments. It is known that the complexity of a RKHS in\nassociation with \ufb01xed covariates {xi}n\ni=1 can be characterized by the decay rate of the eigenvalues\n{\u00b5j}n\nj=1 of the normalized kernel matrix K. The representation power of a kernel class is directly\ncorrelated with the eigen-decay: the faster the decay, the smaller the function class.\n\n4.1 Theoretical predictions as a function of decay\n\nIn this section, let us consider two broad types of eigen-decay:\n\u2022 \u03b3-exponential decay: For some \u03b3 > 0, the kernel matrix eigenvalues satisfy a decay condition\nof the form \u00b5j \u2264 c1 exp(\u2212c2j\u03b3), where c1, c2 are universal constants. Examples of kernels in\nthis class include the Gaussian kernel, which for the Lebesgue measure satis\ufb01es such a bound\nwith \u03b3 = 2 (real line) or \u03b3 = 1 (compact domain).\n\n\u2022 \u03b2-polynomial decay: For some \u03b2 > 1/2, the kernel matrix eigenvalues satisfy a decay condition\nof the form \u00b5j \u2264 c1j\u22122\u03b2, where c1 is a universal constant. Examples of kernels in this class\n\n\finclude the kth-order Sobolev spaces for some \ufb01xed integer k \u2265 1 with Lebesgue measure on\na bounded domain. We consider Sobolev spaces that consist of functions that have kth-order\nweak derivatives f (k) being Lebesgue integrable and f (0) = f (1)(0) = \u00b7\u00b7\u00b7 = f (k\u22121)(0) = 0.\nFor such classes, the \u03b2-polynomial decay condition holds with \u03b2 = k.\n\nn\n\n(cid:45) (log n)1/\u03b3\n\n(cid:0)E(\u03b4, 1)(cid:1), we can show that for \u03b3-exponentially decaying kernels, we have\n\nGiven eigendecay conditions of these types, it is possible to compute an upper bound on the critical\nradius \u03b4n. In particular, using the fact that the function R from equation (14) is an upper bound\non the function Gn\n, whereas for \u03b2-polynomial kernels, we have \u03b42\n\u03b42\nn\nn\nCombining with our Theorem 1, we obtain the following result:\nCorollary 2 (Bounds based on eigendecay). Suppose we apply boosting with stepsize \u03b1 = m\nM and\ninitialization f 0 = 0 on the empirical loss function Ln which satis\ufb01es the m-M-condition and\n\u03c6(cid:48)-boundedness conditions, and is de\ufb01ned on covariate-response pairs {(xi, Yi)}n\ni=1 with Yi drawn\nfrom the distribution PY |xi. Then, the error of the averaged iterate \u00aff T satis\ufb01es the following upper\nbounds with high probability, \u201c(cid:46)\u201d neglecting dependence on problem parameters such as (m, M ):\n\n2\u03b2+1 up to universal constants.\n\n(cid:45) n\n\n\u2212 2\u03b2\n\n(cid:107) \u00aff T \u2212 f\u2217(cid:107)2\n\n(a) For kernels with \u03b3-exponential eigen-decay with respect to {xi}n\nsteps.\n(b) For kernels with \u03b2-polynomial eigen-decay with respect to {xi}n\n\nwhen stopped after T (cid:16)\n\n(cid:46) log1/\u03b3 n\n\nlog1/\u03b3 n\n\nn\n\nn\n\nn\n\ni=1:\n\ni=1:\n\n(cid:107) \u00aff T \u2212 f\u2217(cid:107)2\n\nn\n\n(cid:46) n\u22122\u03b2/(2\u03b2+1), when stopped after T (cid:16) n2\u03b2/(2\u03b2+1) steps.\n\nIn particular, these bounds hold for LogitBoost and AdaBoost.\n\nTo the best of our knowledge, this result is the \ufb01rst to show non-asymptotic and optimal statistical rates\nfor the (cid:107) \u00b7 (cid:107)2\nn-error when using early stopping LogitBoost or AdaBoost with an explicit dependence\nof the stopping rule on n. Our results also yield similar guarantees for L2-boosting, as has been\nestablished in past work [26]. Note that we can observe a similar trade-off between computational\nef\ufb01ciency and statistical accuracy as in the case of kernel least-squares regression [39, 26]: although\nlarger kernel classes (e.g. Sobolev classes) yield higher estimation errors, boosting updates reach the\noptimum faster than for a smaller kernel class (e.g. Gaussian kernels).\n\n4.2 Numerical experiments\n\n2| \u2212 1\n\nWe now describe some numerical experiments that provide illustrative con\ufb01rmations of our theoretical\npredictions using the \ufb01rst-order Sobolev kernel as a typical example for kernel classes with polynomial\neigen-decay. In particular, we consider the \ufb01rst-order Sobolev space of Lipschitz functions on the\nunit interval [0, 1], de\ufb01ned by the kernel K(x, x(cid:48)) = 1 + min(x, x(cid:48)), and with the design points\n{xi}n\ni=1 set equidistantly over [0, 1]. Note that the equidistant design yields \u03b2-polynomial decay\nn (cid:16) n\u22122/3. Accordingly, our theory predicts that the\nof the eigenvalues of K with \u03b2 = 1 so that \u03b42\nstopping time T = (cn)2/3 should lead to an estimate \u00aff T such that (cid:107) \u00aff T \u2212 f\u2217(cid:107)2\nIn our experiments for L2-Boost, we sampled Yi according to Yi = f\u2217(xi)+wi with wi \u223c N (0, 0.5),\nwhich corresponds to the probability distribution P(Y | xi) = N (f\u2217(xi); 0.5), where f\u2217(x) =\n|x \u2212 1\n4 is de\ufb01ned on the unit interval [0, 1]. By construction, the function f\u2217 belongs to the\n\ufb01rst-order Sobolev space with (cid:107)f\u2217(cid:107)H = 1. For LogitBoost, we sampled Yi according to Bern(p(xi))\nwhere p(x) = exp(f\u2217(x))\n1+exp(f\u2217(x)) with the same f\u2217. We chose f 0 = 0 in all cases, and ran the updates (6)\nfor L2-Boost and LogitBoost with the constant step size \u03b1 = 0.75. We compared various stopping\nrules to the oracle gold standard G, which chooses the stopping time G = arg mint\u22651 (cid:107)f t \u2212 f\u2217(cid:107)2\nthat yields the minimum prediction error among all iterates {f t}. Although this procedure is\nn\nunimplementable in practice, but it serves as a convenient lower bound with which to compare.\nFigure 2 shows plots of the mean-squared error (cid:107) \u00aff T \u2212 f\u2217(cid:107)2\nn over the sample size n averaged over 40\ntrials, for the gold standard T = G and stopping rules based on T = (7n)\u03ba for different choices of\n\u03ba. Error bars correspond to the standard errors computed from our simulations. Panel (a) shows the\nbehavior for L2-boosting, whereas panel (b) shows the behavior for LogitBoost.\n\n(cid:45) n\u22122/3.\n\nn\n\nNote that both plots are qualitatively similar and that the theoretically derived stopping rule T = (7n)\u03ba\nwith \u03ba\u2217 = 2/3 = 0.67, while slightly worse than the Gold standard, tracks its performance closely.\n\n\f(a)\n\n(b)\n\nFigure 2: The mean-squared errors for the stopped iterates \u00aff T at the Gold standard, i.e. iterate with\nthe minimum error among all unstopped updates (blue) and at T = (7n)\u03ba (with the theoretically\noptimal \u03ba = 0.67 in red, \u03ba = 0.33 in black and \u03ba = 1 in green) for (a) L2-Boost and (b) LogitBoost.\n\nWe also performed simulations for some \u201cbad\u201d stopping rules, in particular for an exponent \u03ba not\nequal to \u03ba\u2217 = 2/3, indicated by the green and black curves. In the log scale plots in Figure 3 we\ncan clearly see that for \u03ba \u2208 {0.33, 1} the performance is indeed much worse, with the difference in\nslope even suggesting a different scaling of the error with the number of observations n. Recalling\nour discussion for Figure 1, this phenomenon likely occurs due to under\ufb01tting and over\ufb01tting effects.\n\n(a)\n\n(b)\n\nFigure 3: Logarithmic plots of the mean-squared errors at the Gold standard in blue and at T = (7n)\u03ba\n(with the theoretically optimal rule for \u03ba = 0.67 in red, \u03ba = 0.33 in black and \u03ba = 1 in green) for (a)\nL2-Boost and (b) LogitBoost.\n\n5 Discussion\n\nIn this paper, we have proven non-asymptotic bounds for early stopping of kernel boosting for a\nrelatively broad class of loss functions. These bounds allowed us to propose simple stopping rules\nwhich, for the class of regular kernel functions [38], yield minimax optimal rates of estimation.\nAlthough the connection between early stopping and regularization has long been studied and\nexplored in the literature, to the best of our knowledge, this paper is the \ufb01rst one to establish a\ngeneral relationship between the statistical optimality of stopped iterates and the localized Gaussian\ncomplexity, a quantity well-understood to play a central role in controlling the behavior of regularized\nestimators based on penalization [32, 2, 20, 37].\n\nThere are various open questions suggested by our results. Can fast approximation techniques for\nkernels be used to approximately compute optimal stopping rules without having to calculate all\neigenvalues of the kernel matrix? Furthermore, we suspect that similar guarantees can be shown for\nthe stopped estimator f T which we observed to behave similarly to the averaged estimator \u00aff T in our\nsimulations. It would be of interest to establish results on f T directly.\n\n2004006008001000Sample size n0.0000.0050.0100.0150.0200.025Mean squared error |fTf*|2nGood versus bad rules: L2-BoostOracleStop at = 1.00Stop at = 0.67Stop at = 0.3326272829210Sample size n102Mean squared error |fTf*|2nGood versus bad rules: L2-BoostOracleStop at = 1.00Stop at = 0.67Stop at = 0.332004006008001000Sample size n0.0050.0100.0150.0200.0250.030Mean squared error |fTf*|2nGood versus bad rules: LogitBoostOracleStop at = 1.00Stop at = 0.67Stop at = 0.3326272829210Sample size n102Mean squared error |fTf*|2nGood versus bad rules: LogitBoostOracleStop at = 1.00Stop at = 0.67Stop at = 0.33\fAcknowledgements\n\nThis work was partially supported by DOD Advanced Research Projects Agency W911NF-16-1-\n0552, National Science Foundation grant NSF-DMS-1612948, and Of\ufb01ce of Naval Research Grant\nDOD-ONR-N00014.\n\nReferences\n[1] R. S. Anderssen and P. M. Prenter. A formal comparison of methods proposed for the numerical solution\n\nof \ufb01rst kind integral equations. Jour. Australian Math. Soc. (Ser. B), 22:488\u2013500, 1981.\n\n[2] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of Statistics,\n\n33(4):1497\u20131537, 2005.\n\n[3] P. L. Bartlett and M. Traskin. Adaboost is consistent. Journal of Machine Learning Research, 8(Oct):2347\u2013\n\n2368, 2007.\n\n[4] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Kluwer\n\nAcademic, Norwell, MA, 2004.\n\n[5] L. Breiman. Prediction games and arcing algorithms. Neural computation, 11(7):1493\u20131517, 1999.\n\n[6] L. Breiman et al. Arcing classi\ufb01er (with discussion and a rejoinder by the author). Annals of Statistics,\n\n26(3):801\u2013849, 1998.\n\n[7] P. B\u00fchlmann and T. Hothorn. Boosting algorithms: Regularization, prediction and model \ufb01tting. Statistical\n\nScience, pages 477\u2013505, 2007.\n\n[8] P. B\u00fchlmann and B. Yu. Boosting with L2 loss: Regression and classi\ufb01cation. Journal of American\n\nStatistical Association, 98:324\u2013340, 2003.\n\n[9] R. Camoriano, T. Angles, A. Rudi, and L. Rosasco. Nytro: When subsampling meets early stopping. In\nProceedings of the 19th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1403\u20131411,\n2016.\n\n[10] A. Caponetto and Y. Yao. Adaptation for regularization operators in learning theory. Technical Report\n\nCBCL Paper #265/AI Technical Report #063, Massachusetts Institute of Technology, September 2006.\n\n[11] A. Caponneto. Optimal rates for regularization operators in learning theory. Technical Report CBCL Paper\n\n#264/AI Technical Report #062, Massachusetts Institute of Technology, September 2006.\n\n[12] R. Caruana, S. Lawrence, and C. L. Giles. Over\ufb01tting in neural nets: Backpropagation, conjugate gradient,\n\nand early stopping. In Advances in Neural Information Processing Systems, pages 402\u2013408, 2001.\n\n[13] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to\n\nboosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[14] J. Friedman, T. Hastie, R. Tibshirani, et al. Additive logistic regression: a statistical view of boosting (with\n\ndiscussion and a rejoinder by the authors). Annals of statistics, 28(2):337\u2013407, 2000.\n\n[15] J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics,\n\n29:1189\u20131232, 2001.\n\n[16] C. Gu. Smoothing spline ANOVA models. Springer Series in Statistics. Springer, New York, NY, 2002.\n\n[17] L. Gyor\ufb01, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression.\n\nSpringer Series in Statistics. Springer, 2002.\n\n[18] W. Jiang. Process consistency for adaboost. Annals of Statistics, 21:13\u201329, 2004.\n\n[19] G. Kimeldorf and G. Wahba. Some results on Tchebychef\ufb01an spline functions. Jour. Math. Anal. Appl.,\n\n33:82\u201395, 1971.\n\n[20] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. Annals of\n\nStatistics, 34(6):2593\u20132656, 2006.\n\n[21] M. Ledoux. The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs.\n\nAmerican Mathematical Society, Providence, RI, 2001.\n\n\f[22] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer-Verlag,\n\nNew York, NY, 1991.\n\n[23] L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient descent. In Advances\n\nin Neural Information Processing Systems 12, pages 512\u2013518, 1999.\n\n[24] S. Mendelson. Geometric parameters of kernel machines. In Proceedings of the Conference on Learning\n\nTheory (COLT), pages 29\u201343, 2002.\n\n[25] L. Prechelt. Early stopping-but when? In Neural Networks: Tricks of the trade, pages 55\u201369. Springer,\n\n1998.\n\n[26] G. Raskutti, M. J. Wainwright, and B. Yu. Early stopping and non-parametric regression: An optimal\n\ndata-dependent stopping rule. Journal of Machine Learning Research, 15:335\u2013366, 2014.\n\n[27] L. Rosasco and S. Villa. Learning with incremental iterative regularization.\n\nInformation Processing Systems, pages 1630\u20131638, 2015.\n\nIn Advances in Neural\n\n[28] R. E. Schapire. The strength of weak learnability. Machine learning, 5(2):197\u2013227, 1990.\n\n[29] R. E. Schapire. The boosting approach to machine learning: An overview. In Nonlinear estimation and\n\nclassi\ufb01cation, pages 149\u2013171. Springer, 2003.\n\n[30] B. Sch\u00f6lkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n\n[31] O. N. Strand. Theory and methods related to the singular value expansion and Landweber\u2019s iteration for\n\nintegral equations of the \ufb01rst kind. SIAM J. Numer. Anal., 11:798\u2013825, 1974.\n\n[32] S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000.\n\n[33] A. W. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag, New\n\nYork, NY, 1996.\n\n[34] E. D. Vito, S. Pereverzyev, and L. Rosasco. Adaptive kernel methods using the balancing principle.\n\nFoundations of Computational Mathematics, 10(4):455\u2013479, 2010.\n\n[35] G. Wahba. Three topics in ill-posed problems. In M. Engl and G. Groetsch, editors, Inverse and ill-posed\n\nproblems, pages 37\u201350. Academic Press, 1987.\n\n[36] G. Wahba. Spline models for observational data. CBMS-NSF Regional Conference Series in Applied\n\nMathematics. SIAM, Philadelphia, PN, 1990.\n\n[37] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press,\n\n2017.\n\n[38] Y. Yang, M. Pilanci, and M. J. Wainwright. Randomized sketches for kernels: Fast and optimal non-\n\nparametric regression. Annals of Statistics, 2017. To appear.\n\n[39] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive\n\nApproximation, 26(2):289\u2013315, 2007.\n\n[40] T. Zhang and B. Yu. Boosting with early stopping: Convergence and consistency. Annals of Statistics,\n\n33(4):1538\u20131579, 2005.\n\n\f", "award": [], "sourceid": 3086, "authors": [{"given_name": "Yuting", "family_name": "Wei", "institution": "University of California, Berkeley"}, {"given_name": "Fanny", "family_name": "Yang", "institution": "University of California, Berkeley"}, {"given_name": "Martin", "family_name": "Wainwright", "institution": "UC Berkeley"}]}