{"title": "Early Stopping for Nonparametric Testing", "book": "Advances in Neural Information Processing Systems", "page_first": 3985, "page_last": 3994, "abstract": "Early stopping of iterative algorithms is an algorithmic regularization method to avoid over-fitting in estimation and classification. In this paper, we show that early stopping can also be applied to obtain the minimax optimal testing in a general non-parametric setup. Specifically, a Wald-type test statistic is obtained based on an iterated estimate produced by functional gradient descent algorithms in a reproducing kernel Hilbert space. A notable contribution is to establish a ``sharp'' stopping rule: when the number of iterations achieves an optimal order, testing optimality is achievable; otherwise, testing optimality becomes impossible. As a by-product, a similar sharpness result is also derived for minimax optimal estimation under early stopping. All obtained results hold for various kernel classes, including Sobolev smoothness classes and Gaussian kernel classes.", "full_text": "Early Stopping for Nonparametric Testing\n\nMeimei Liu\n\nDepartment of Statistical Science\n\nDuke University\n\nDurham, NC 27705\n\nmeimei.liu@duke.edu\n\nGuang Cheng\n\nDepartment of Statistics\n\nPurdue University\n\nWest Lafayette, IN 47907\n\nchengg@purdue.edu\n\nAbstract\n\nEarly stopping of iterative algorithms is an algorithmic regularization method to\navoid over-\ufb01tting in estimation and classi\ufb01cation. In this paper, we show that\nearly stopping can also be applied to obtain the minimax optimal testing in a\ngeneral non-parametric setup. Speci\ufb01cally, a Wald-type test statistic is obtained\nbased on an iterated estimate produced by functional gradient descent algorithms\nin a reproducing kernel Hilbert space. A notable contribution is to establish a\n\u201csharp\u201d stopping rule: when the number of iterations achieves an optimal order,\ntesting optimality is achievable; otherwise, testing optimality becomes impossible.\nAs a by-product, a similar sharpness result is also derived for minimax optimal\nestimation under early stopping. All obtained results hold for various kernel classes,\nincluding Sobolev smoothness classes and Gaussian kernel classes.\n\n1\n\nIntroduction\n\nAs a computationally ef\ufb01cient approach, early stopping often works by terminating an iterative\nalgorithm on a pre-speci\ufb01ed number of steps to avoid over-\ufb01tting. Recently, various forms of early\nstopping have been proposed in estimation and classi\ufb01cation. Examples include boosting algorithms\n(B\u00fchlmann and Yu [2003], Zhang and Yu [2005], Wei et al. [2017]); gradient descent over reproducing\nkernel Hilbert spaces (Yao et al. [2007], Raskutti et al. [2014]) and reference therein. However,\nstatistical inference based on early stopping has largely remained unexplored.\nIn this paper, we apply the early stopping regularization to nonparametric testing and characterize\nits minimax optimality from an algorithmic perspective. Notably, it differs from the traditional\nframework of using penalization methods to conduct statistical inference. Recall that classical\nnonparametric inference often involves minimizing objective functions in the loss + penalty form to\navoid over\ufb01tting; examples include the penalized likelihood ratio test, Wald-type test, see Fan and\nJiang [2007], Shang and Cheng [2013], Liu et al. [2018] and reference therein. However, solving a\nquadratic program in the penalized regularization requires O(n3) basic operations. Additionally, in\npractice cross validation method (Golub et al. [1979]) is often used as a tuning procedure which is\nknown to be optimal for estimation but suboptimal for testing; see Fan et al. [2001]. As far as we are\naware, there is no theoretically justi\ufb01ed tuning procedure for obtaining optimal testing in our setup.\nWe address this issue by proposing a data-dependent early stopping rule that enjoys both theoretical\nsupport and computational ef\ufb01ciency.\nTo be more speci\ufb01c, we \ufb01rst develop a Wald-type test statistic Dn,t based on the iterated estimator\nft with t being the number of iterations. As illustrated in Figure 1 (a) and (b), the testing power\ndemonstrates a parabolic pattern. Speci\ufb01cally, it increases as the iteration grows in the beginning,\nand then decreases after reaching its largest value when t = T \u21e4, implying how over-\ufb01tting affects the\npower performance. To precisely quantify T \u21e4, we analyze the power performance by characterizing\nthe strength of the weakest detectable signals (SWDS). We show that SWDS at each iteration is\ncontrolled by the bias of the iterated estimator and the standard derivation of the test statistic. In\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ffact, each iterative step reduces the former but increase the latter. Such a tradeoff in testing is rather\ndifferent from the classical \u201cbias-variance\u201d tradeoff in estimation; as shown in Figure 1 (c). Hence,\nthe early stopping rule to be provided is different from those in the literature such as Raskutti et al.\n[2014] and Wei et al. [2017]; also see Figure 1 (a) and (b) in comparison with power and MSE.\n\nSecond-order Sobolev kernel\n\nGaussian kernel\n\n0.961\n\n0.949\n\nr\ne\nw\no\nP\n\n0.937\n\n0.925\n\n0.913\n\n0.036\n\n0.034\n\nM\nS\nE\n\n0.032\n\nMSE\n\nPower\n\n0.030\n\n0.991\n\n0.989\n\nr\ne\nw\no\nP\n\n0.987\n\n0.985\n\n0.983\n\n100\n\nT \u21e4\n\n200\n\n300\nIteration\n\n400\n\n500\n\n0\n\n100\n\nT \u21e4\n\n200\n\nIteration\n\n300\n\n(a)\n\n(b)\n\nMSE\n\nPower\n\n0.012\n\n0.009\n\nM\nS\nE\n\n0.006\n\n0.003\n\n400\n\n500\n\nStopping rules\n\neT\n\nT \u21e4\n(c)\n\nVariance\tof\t%$\nS.D.\tof\t!\",$\n&'()* of\t%$\n\nt\n\nFigure 1: (a) and (b) are mean square error (MSE) and power performance of gradient descent update\nat each iteration with constant step size \u21b5 = 1; Power was calculated based on 500 replicates. (a) Data\nwere generated via yi = 0.5x2\ni=1 \u21e0 U nif [0, 1],\n\u270fi \u21e0 N (0, 1). (b) Data were generated by yi = 0.5x2\ni + 0.5|xi 0.5| + \u270fi with sample size n = 200.\n(c) Stopping rules for estimation and testing based on different tradeoff criteria.\n\ni +0.5 sin(4\u21e1xi)+\u270fi with sample size n = 200, {xi}n\n\nThe above analysis apply to many reproducing kernels, and lead to speci\ufb01c optimal testing rate,\ndepending on their eigendecay rate. In the speci\ufb01c examples of polynomial decaying kernel and\nexponential decaying kernel, we further show that the developed stopping rule is indeed \u201csharp\u201d:\ntesting optimality is obtained if and only if the number of iterations obtains an optimal order de\ufb01ned\nby the stopping rule. As a by-product, we prove that the early stopping rule in Raskutti et al. [2014]\nand Wei et al. [2017] is also \u201csharp\u201d for optimal estimation.\n\n2 Background and Problem Formulation\n\nWe begin by introducing some background on reproducing kernel Hilbert space (RKHS), and func-\ntional gradient descent algorithms in the RKHS, together with our nonparametric testing formulation.\n\n2.1 Nonparametric estimation in reproducing kernel Hilbert spaces\n\nConsider the following nonparametric model\n\nyi = f (xi) + \u270fi,\n\ni = 1,\u00b7\u00b7\u00b7 , n,\n\n(2.1)\nwhere xi 2X\u21e2 Rd for a \ufb01xed d 1 are random covariates, and \u270fi are Gaussian random noise\nwith mean zero and variance 2. Throughout we assume that f 2H , where H\u21e2 L2(PX) is a\nreproducing kernel Hilbert space (RKHS) associated with an inner product h\u00b7,\u00b7iH and a reproducing\nkernel function K(\u00b7,\u00b7) : X\u21e5X! R. By Mercer\u2019s Theorem, K has the following spectral expansion:\n(2.2)\n\nK(x, x0) =\n\n\u00b5ii(x)i(x0), x, x0 2X ,\n\n1Xi=1\n\nwhere \u00b51 \u00b52 \u00b7\u00b7\u00b7 0 is a sequence of eigenvalues and {i}1i=1 form a basis in L2(PX).\nMoreover, for any i, j 2 N,\n\nhi, jiL2(PX ) = ij\n\nand\n\nhi, jiH = ij/\u00b5i.\n\nIn the literature, e.g., Guo [2002] and Shang and Cheng [2013], it is common to assume that j\u2019s are\nuniformly bounded. This is also assumed throughout this paper.\nAssumption A1. The eigenfunctions {k}1k=0 are uniformly bounded on X , i.e., there exists a \ufb01nite\nconstant cK > 0 such that\n\nsup\n\nj1 kjksup \uf8ff cK.\n\n2\n\n\fTwo types of kernel are often considered in the nonparametric literature, depending on how fast\nits eigenvalues decay to zero. The \ufb01rst is that \u00b5i \u21e3 i2m, leading to the so-called polynomial\ndecay kernel (PDK) of order m > 0. For instance, an m-order Sobolev space is an RKHS with a\nPDK of order m; see Wahba [1990], and the trigonometric basis in periodic Sobolev space with\nPDK satis\ufb01es Assumption A1 trivially. The second is that \u00b5i \u21e3 exp(ip) for some constant\n, p > 0, corresponding to the so-called exponential-polynomial decay kernel (EDK) of order\np > 0; see Sch\u00f6lkopf et al. [1999]. In particular, for EDK of order two, an example is K(x1, x2) =\nexp((x1 x2)2/2). In the latter case, Assumption A1 holds according to Lu et al. [2016].\nBy representer theorem, any f 2H can be presented as\n\nf (\u00b7) =\n\n1\npn\n\nnXi=1\n\nwiK(xi,\u00b7) + \u21e0(\u00b7),\n\nwhere \u21e0 2H and \u21e0(\u00b7) ? span{K(x1,\u00b7),\u00b7\u00b7\u00b7 , K(xn,\u00b7)}. Given x = (x1,\u00b7\u00b7\u00b7 , xn), de\ufb01ne an\nn K(xi, xj) and f = (f (x1),\u00b7\u00b7\u00b7 , f (xn)), then f = pnKw, where\nempirical kernel [K]ij = 1\nw = (w1,\u00b7\u00b7\u00b7 , wn)> 2 Rn.\n2.2 Gradient Descent Algorithms\nGiven the samples {(xi, yi)}, consider minimizing the least-square loss function\n\nL(f ) :=\n\n1\n2n\n\n(yi f (xi))2\n\nnXi=1\n\n(2.3)\n\n\u2327 =0 \u21b5\u2327 .\n\nf t+1 = f t \u21b5tK(f t y),\n\nConsider the singular value decomposition K = U \u21e4U>, where U U> = In and \u21e4=\n\nover a Hilbert space H. Note that by representer theorem, f (x) = hf, K(x,\u00b7)iH, then the gradient of\nL(f ) is rL(f ) = n1Pn\ni=1(f (xi) yi)K(xi,\u00b7). Given x = (x1,\u00b7\u00b7\u00b7 , xn) and y = (y1,\u00b7\u00b7\u00b7 , yn),\nde\ufb01ne f t = (ft(x1),\u00b7\u00b7\u00b7 , ft(xn)) for t = 0, 1,\u00b7\u00b7\u00b7 . Then straightforward calculation shows that the\nfunctional gradient descent algorithm generates a sequence of vectors {f t}1t=0 via the recursion\nwhere {\u21b5t}1t=0 is the step sizes. Denote the total step size upto the t-th step as \u2318t = Pt1\ndiag(b\u00b51,b\u00b52,\u00b7\u00b7\u00b7 ,b\u00b5n) with b\u00b51 b\u00b52 \u00b7\u00b7\u00b7 b\u00b5n 0. We have the following assumption for\nthe step sizes and \u2318t.\nAssumption A2. The step size {\u21b5t}1t=0 is non-increasing; for all \u2327 = 0, 1, 2,\u00b7\u00b7\u00b7 , 0 \uf8ff \u21b5\u2327 \uf8ff\nmin{1, 1/b\u00b51}. The total step size \u2318t =Pt1\n\u2327 =0 \u21b5\u2327 diverges as t ! 1; for 0 \uf8ff t1 \u2327 t2 as t2 ! 1 ,\n\u2318t1 \u2327 \u2318t2.\nAssumption A2 supposes the step size {\u21b5t}1t=0 to be bounded and non-increasing, but cannot decrease\ntoo fast as t diverges. Many choices of step sizes satisfy Assumption A2. A trivial example is to\nchoose a constant step size \u21b50 = \u00b7\u00b7\u00b7 = \u21b5t = min{1, 1/b\u00b51}.\nDe\ufb01ne \uf8fft = argmin{j : \u00b5j < 1\neigenvalues through \uf8fft.\nAssumption A3. \uf8fft diverges as t ! 1.\nIt is easy to check that Assumption A3 is satis\ufb01ed in PDK and EDK introduced in Section 2.1.\n\n\u2318t} 1, we have the following assumption on the population\n\n2.3 Nonparametric testing\n\nOur goal is to test whether the nonparametric function in (2.1) is equal to some known function. To\nbe precise, we consider the nonparametric hypothesis testing problem\nH0 : f = f\u21e4 v.s. H1 : f 2H \\ { f\u21e4},\n\nwhere f\u21e4 is a hypothesized function. For convenience, assume f\u21e4 = 0, i.e., we will test\n\nH0 : f = 0 vs. H1 : f 2H \\ { 0}.\n\n(2.4)\n\n3\n\n\fIn general, testing f = f\u21e4 (for an arbitrary known f\u21e4) is equivalent to testing f\u21e4 \u2318 f f\u21e4 = 0.\nSo, (2.4) has no loss of generality. Based on the iterated estimator f t, we propose the following\nWald-type test statistic:\n\nwhere kf tk2\nand explicitely show how the stopping time affects minimax optimality of testing.\n\n(2.5)\nt (xi). In what follows, we will derive the null limit distribution of Dn,t,\n\nDn,t = kf tk2\nn,\n\ni=1 f 2\n\nn = 1\n\nnPn\n\n3 Main Results\n\n3.1 Stopping rule for nonparametric testing\nGiven a sequence of step size {\u21b5t}1t=0 satisfying Assumption A2, we \ufb01rst introduce the stopping rule\nas follows:\n\nT \u21e4 := argmin8<:\n\n<\n\nt 2 N 1\n\n\u2318t\n\n\n\nnvuut\n\nnXi=1\n\n.\n\nmin{1,\u2318 tb\u00b5i}9=;\nnpPn\n\nAs will be clari\ufb01ed in Section 3.2, the intuition underlying the stopping rule (3.1) is that 1\n\u2318t\nthe bias of the iterated estimator ft, which is a decrease function of t; 1\nthe standard deviation of the test statistic Dn,t as an increasing function of t. The optimal stopping\nrule can be achieved by such a bias-standard deviation tradeoff. Recall that such a tradeoff in (3.1)\nfor testing is different from another type of bias-variance tradeoff in estimation (see Raskutti et al.\n[2014], Wei et al. [2017]), thus leading to different optimal stopping time. In fact, as seen in Figure 1\n\ni=1 min{1,\u2318 tb\u00b5i} is\n(c), optimal estimation can be achieved at eT , which is earlier than than T \u21e4. This is also empirically\n\ncon\ufb01rmed by Figure 1 (a) and (b) where minimum mean square error (MSE) can always be achieved\nearlier than the maximum power. Please see Section 4 for more discussions.\n\ncontrols\n\n3.2 Minimax optimal testing\nIn this section, we \ufb01rst derive the null limit distribution of (standardized) Dn,t as a standard Gaussian\nunder mild conditions, that is, we only require the total step sizes \u2318t goes to in\ufb01nity.\n\n\u2327 =0(In \u21b5\u2327 \u21e4). As stated in Raskutti\net al. [2014], the matrix St describes the extent of shrinkage towards the origin. By Assumption A2\n\nDe\ufb01ne a sequence of diagonal shrinkage matrices as St =Qt1\nthat 0 \uf8ff \u21b5t \uf8ff min{1, 1/b\u00b51}, St is positive semide\ufb01nite.\nTheorem 3.1. Suppose Assumption A2, A3 are satis\ufb01ed. Then under H0, as n ! 1 and t ! 1,\nwe have\n\nDn,t \u00b5n,t\n\nn,t\n\nd! N (0, 1).\n\nHere \u00b5n,t = EH0[Dn,t|x] = 1\nThen based on Theorem 3.1, we have the following testing rule at signi\ufb01cance level \u21b5:\n\nn,t = VarH0[Dn,t|x] = 2\n\nn tr((In St)2) and 2\n\nn2 tr((I St)4).\n\n(3.1)\n\nn,t = I(|Dn,t \u00b5n,t| z1\u21b5/2n,t),\n\nwhere z1\u21b5/2 is the 100 \u21e5 (1 \u21b5/2)th percentile of standard normal distribution.\nnPn\nLemma 3.2. \u00b5n,t \u21e3 1\ni=1 min{1,\u2318 tb\u00b5i}, and 2\nDe\ufb01ne the squared separation rate\n\nn,t \u21e3 1\n\nn2Pn\nnXi=1\n\ni=1 min{1,\u2318 tb\u00b5i}.\nmin{1,\u2318 tb\u00b5i}.\n\n1\n\nnvuut\n\nd2\nn,t =\n\n1\n\u2318t\n\n+ n,t \u21e3\n\n1\n\u2318t\n\n+\n\nThe separation rate dn,t is used to measure the distance between the null hypothesis and a sequence of\nalternative hypotheses. The following Theorem 3.3 shows that, if the alternative signal f is separated\nfrom zero by an order dn,t, then the proposed test statistic Dn,t asymptotically achieves high power\n\n4\n\n\fat the total step size \u2318t. To achieve the maximum power, we need to minimize dn,t. Under the\nstopping rule (3.1), we can see that when t = T \u21e4, the separation rate achieves its minimal value as\nd\u21e4n := dn,T \u21e4.\nTheorem 3.3. (a) Suppose Assumption A2 and A3 are satis\ufb01ed. For any \"> 0, there exist positive\n\nconstants C\", t\" and N\" such that with probability greater than 1 ec\uf8fft,\n\ninf\ntt\"\n\ninf\nnN\"\n\ninf\nf2B\n\nkfknC\"dn,t\n\nPf (n,t = 1|x) 1 \",\n\nwhere c is a constant, B = {f 2H : kfkH \uf8ff C} for a constant C and Pf (\u00b7) is the probability\nmeasure under f.\n\n(b) The separation rate dn,t achieves its minimal value as d\u21e4n := dn,T \u21e4.\nThe general Theorem 3.3 implies the following concrete stopping rules under various kernel classes.\nCorollary 3.4. (PDK of order m) Suppose Assumption A2 holds and m > 3/2. Then at time T \u21e4\nwith \u2318T \u21e4 \u21e3 n4m/(4m+1), for any \"> 0, there exist constants C\" and N\" such that, with probability\ngreater than 1 ecmn(2m3)/(2m1) ec1n2/(4m+1),\n\ninf\nnN\"\n\ninf\nf2B\n\nkfknC\"n 2m\n\n4m+1\n\nPf (n,T \u21e4 = 1|x) 1 \",\n\nwhere cm is an absolute constant depending on m only, c1 is a constant.\n\nNote that the minimal separation rate n 2m\n\n4m+1 is minimax optimal according to (Ingster [1993]).\nThus, Dn,T \u21e4 is optimal when \u2318T \u21e4 \u21e3 n4m/(4m+1). Note that \u2318T \u21e4 =PT \u21e41\nt=0 \u21b5t, T \u21e4 \u21e3 n4m/(4m+1)\nwhen constant step sizes are chosen.\nCorollary 3.5. (EDK of order p) Suppose Assumption A2 holds and p 1. Then at time T \u21e4 with\n\u2318T \u21e4 \u21e3 n(log n)1/(2p), for any \"> 0, there exist constants C\" and N\" such that, with probability\ngreater than 1 ec,p n(log n)2/p ec1(log n)1/p,\n\ninf\nnN\"\n\ninf\nf2B\nkfknC\"n 1\n\n2 (log n)\n\n1\n4p\n\nPf (n,T \u21e4 = 1|x) 1 \",\n\nwhere c,p is an absolute constant depending on , p.\n\nNote that the minimal separation rate n1/2(log n)1/(4p) is proven to be minimax optimal in Corollary\n1 of Wei and Wainwright [2017]. Hence, Dn,T \u21e4 is optimal at the total step size \u2318T \u21e4 \u21e3 n(log n)1/(2p).\nWhen the step sizes are chosen as constants, then the corresponding T \u21e4 \u21e3 n(log n)1/(2p).\n3.3 Sharpness of the stopping rule\n\nTheorem 3.3 shows that optimal testing can be achieved when t = T \u21e4. In the speci\ufb01c examples\nof PDK and EDK, Theorem 3.6 further shows that when t \u2327 T \u21e4 or t T \u21e4, there exists a local\nalternative f that is not detectable by Dn,t even when it is separated from zero by d\u21e4n. In this case,\nthe asymptotic testing power is actually smaller than \u21b5. Hence, we claim that T \u21e4 is sharp in the sense\nthat testing optimality is obtained if and only if the total step size achieves the order of \u2318T \u21e4. Given a\nsequence of step size {\u21b5t}1t=0 satisfying Assumption A2, we have the following results.\nTheorem 3.6. Suppose Assumption A2 holds, and t \u2327 T \u21e4 or t T \u21e4. There exists a positive\nconstant C1 such that, with probability approaching 1,\n\nlim sup\nn!1\n\ninf\nf2B\n\nkfknC1d\u21e4n\n\nPf (n,t = 1|x) \uf8ff \u21b5.\n\nIn the proof, we construct the alternative function asPn\nand (A.9) for the two cases t \u2327 T \u21e4 and t T \u21e4, respectively.\n\ni=1 K(xi,\u00b7)wi, with wi being de\ufb01ned in (A.8)\n\n5\n\n\f4 Sharpness of early stopping in nonparametric estimation\n\nIn this section, we review the existing early stopping rule for estimation, and further explore its\n\u201csharpness\u201d property. In the literature, Raskutti et al. [2014] and Wei et al. [2017] proposed to use the\n\ufb01xed point of local empirical Rademacher complexity to de\ufb01ne the stopping rule as follows\n\neT := argmin(t 2 N 1\n\n\u2318t\n\n<\n\n\nn\n\nnXi=1\n\nmin{1,\u2318 tb\u00b5i}) .\n\nGiven the above stopping rule, the following theorem holds where f\u21e4 represents truth.\n\n(4.1)\n\nversal positive constants (c1, c2) such that the following events hold with probability at least\n\nTheorem 4.1 (Raskutti et al. [2014]). Given the stopping time eT de\ufb01ned by (4.1), there are uni-\n1 c1 exp(c2n/\u2318eT ):\n(a) For all iterations t = 1, 2,\u00b7\u00b7\u00b7 ,eT : kft f\u21e4k2\n(b) At the iteration eT , kfeT f\u21e4k2\n(c) For all t eT ,\n\nn \uf8ff 12 1\n\u2318eT\n\nn \uf8ff 4\n\n2\n\n2\n\ne\u2318t\n\n.\n\n.\n\n.\n\nEkft f\u21e4k2\n\nn \n\n4 1\n\nn\n\nnXi=1\n\nmin{1,b\u00b5i\u2318t} \n\n\u2318eT\n\nTo show the sharpness of eT , it suf\ufb01ces to examine the upper bound in Theorem 4.1 (a). In particular,\nwe prove a complementary lower bound result. Speci\ufb01cally, Theorem 4.2 implies that once t \u2327 eT ,\nthe rate optimality will break down for at least one true f 2B with high probability. Denote the\nstopping time eT satisfying\n\n\u2318eT \u21e3\u21e2n2m/(2m+1)\n\nn/(log n)1/p\n\nK is PDK of order m,\nK is EDK of order p.\n\nTheorem 4.2. (a) (PDK of order m) Suppose Assumption A2 holds and m > 3\n\nwith probability approaching 1, it holds that\n\n2. For all t \u2327 eT ,\n\nsup\n\nf2B kft f\u21e4k2\nn \n\ncm2\n\u2318t \n\n.\n\n1\n\n\u2318eT\n\nsup\n\nf2B kft f\u21e4k2\n\nn \n\n.\n\n\u2318eT\n\n(b) (EDK of order p) Suppose Assumption A2 holds and p 1. For all t \u2327 eT , with probability\n\napproaching 1,\n\n1\n\nkft f\u21e4k2\n\nAt last, we comment brie\ufb02y that the stopping rule for estimation and Theorem 4.1 (a), (b) can also be\n\nCombining with Theorem 4.1, we claim that eT is a \u201csharp\u201d stopping time for estimation.\nobtained in our framework as a by-product. Intuitively, the stopping time eT in (4.1) is achieved by\nthe classical bias-variance tradeoff. Note that kft f\u21e4k2\nn has a trivial upper bound\n+2k E\u270f ft f\u21e4k2\nn \uf8ff 2kft E\u270f ftk2\n{z\n}\n|\n}\nwhere the expectation is taken with respect to \u270f. The squared bias term is bounded by 1\n\u2318t\nA.3); the variance term is bounded by the mean of Dn,t, that is, kft E ftk2\nnPn\nLemma A.1), where \u00b5n,t = tr((I St)2)/n \u21e3 1\ni=1 min{1,\u2318 tb\u00b5i} as shown in Lemma 3.2.\nObviously, according to (4.1), when t \u2327 eT , the squared bias will dominate the variance.\n\n(see Lemma\nn = OP (\u00b5n,t) (see\n\n{z\n\nSquared bias\n\nVariance\n\n|\n\nn\n\nn\n\n,\n\n6\n\n\f5 Numerical Study\n\nIn this section, we compare our testing method with an oracle version of stopping rule that uses\nknowledge of f\u21e4, as well as the test based on the penalized regularization. We further conduct the\nsimulation studies to verify our theoretical results.\nAn oracle version of early stopping rule The early stopping rule de\ufb01ned in (3.1) involves the bias\nof the iterated estimator ft that can be directly calculated as\n\nk E ft f\u21e4k2\n\nn = kStU>f\u21e4k2\n\nn =\n\n1\nn\n\n(St\n\nii)2[U>f\u21e4(x)]2\ni .\n\nnXi=1\n\nAnd the standard derivation of Dn,t is n,t = 1\nstopping time on the exact in-sample bias of ft and the standard derivation of Dn,t, which is de\ufb01ned\nas follows:\n\nnq2 tr(I St)4. An \u201coracle\u201d method is to base its\n\nT \u2020 := argmin(t 2 N 1\n\nn\n\nnXi=1\n\n(St\n\nii)2[U>f\u21e4(x)]2\n\ni <\n\n(5.1)\n\n1\n\nnq2 tr(I St)4) .\n\ni\n\nb=1 based on the pair boostrapped data {x(b)\n\nOur oracle method represents an ideal case that the true function f\u21e4 is known.\nAlgorithm based on the early stopping rule (3.1) In the early stopping rule de\ufb01ned in (3.1), the\nbias term is bounded by the order of 1\n. To implement the stopping rule in (3.1) practically, we\n\u2318t\npropose a boostrap method to approximate the bias term. Speci\ufb01cally, we calculate a sequence of\n{f (b)\nn to approximate\nt }B\nthe bias term, where ftB = PB\n, B is a positive integer. On the other hand, the standard\n\nnq2 tr(I St)4 involves calculating all eigenvalues of the kernel matrix. This\n\nderivation term 1\nstep can be implemented by many methods on fast computation of kernel eigenvalues; see Stewart\n[2002], Drineas and Mahoney [2005] and reference therein.\nPenalization-based test As another reference, we also conduct the penalization-based test by using\n\ni=1, and use kStU>ftBk2\n\n, y(b)\ni }n\n\nb=1 ftb\nB\n\nnXi=1\n\n(Shawe-Taylor and Cristianini [2004]) de\ufb01ned as\n\nthe test statistic as Dn, = kbfn,k2\nn. Here bfn, is the kernel ridge regression (KRR) estimator\nf2H ( 1\nbfn, := argmin\n\nH) ,\n(yi f (xi))2 + kfk2\n\nwhere kfk2\nH = hf, fiH with h\u00b7,\u00b7iH the inner product of H. The penalty parameter plays the same\nrole of the total step size \u2318t to avoid over\ufb01tting. Liu et al. [2018] shows that minimax optimal testing\n\nrate can be achieved by choosing the penalty parameter satisfying \u21e3qtr(\u21e4 + In)1)\u21e44/n.\n\nThe speci\ufb01c varies for different kernel classes. For example, in PDK, the optimal testing can be\nachieved with \u21e4 \u21e3 n4m/(4m+1); in EDK, the corresponding \u21e4 \u21e3 n1(log n)1/(2p). We discover\nan interesting connection that the inverse of these \u21e4 share the same order as the stopping rules in\nCorollary 3.4 and Corollary 3.5, respectively. Lemma 5.1 provides a theoretical explanation for such\nconnection.\n\nLemma 5.1. tr(\u21e4 + In)1\u21e44 \u21e3 trI St4 holds if and only if \u21e3 1\n\nHowever, it is still challenging to choose the optimal penalty parameter for testing in practice. A\ncompromising strategy is to use cross validation (CV) method (Golub et al. [1979]), which was\ninvented for optimal estimation problems. In the following numerical study, we will show that the\nCV-based Dn, performs less satisfactorily than our proposed early stopping method.\n\n(5.2)\n\nn\n\n\u2318t\n\n.\n\n5.1 Numerical study I\nIn this section, we compare our early stopping based test statistics (ES) with two other methods:\nthe oracle early stopping (Oracle ES) method, and the penalization-based test described above.\nParticularity, we consider the hypothesis testing problem H0 : f = 0.\n\n7\n\n\fiid\u21e0\nData were generated from the regression model (2.1) with f (xi) = c \u00b7 cos(4\u21e1xi), where xi\nUnif[0, 1] and c = 0, 1 respectively. c = 0 is used for examining the size of the test, and c = 1 is\nused for examining the power of the test. The sample size n is ranged from 100 to 1000. We use\nGaussian kernel (i.e., p = 2 in EDK) to \ufb01t the data. Signi\ufb01cance level was chosen as 0.05. Both size\nand power were calculated as the proportions of rejections based on 500 independent replications.\nFor the ES, we use bootstrap method to approximate the bias with B = 10 and the step size \u21b5 = 1.\nFor the penalization-based test, we use 10fold cross validation (10-fold CV) to select the penalty\nparameter. For the oracle ES, we follow the stopping rule in (5.1) with constant step size \u21b5 = 1.\nFigure 2 (a) shows that the size of the three testing methods approach the nominal level 0.05 under\nvarious n, demonstrating the testing consistency. Figure 2 (b) displays the power of the three testing\nrules. The ES exhibits better power performance than the penalization-based test with 10fold CV\nunder various sample sizes. Furthermore, as n increases, the power of ES approaches to the Oracle\nES, which serves as the benchmark. As shown in Figure 2 (c), the ES enjoys great computation\nef\ufb01ciency compared with the Wald-test with 10fold CV, and it is reasonable that our proposed ES\ntakes more time than the oracle ES, due to the extra step for bootstrapping. In Supplementary A.8, we\nshow additional synthetic experiments with various c based on second-order Sobolev kernel verifying\nour theoretical contribution.\n\n0.20\n\n0.15\n\ne\nz\nS\n\ni\n\n0.10\n\n0.05\n\n0.00\n\nMethod\n\n\u25cf\n\n10\u2212fold CV\n\nOracle ES\n\nES\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n1.00\n\n0.75\n\nr\ne\nw\no\nP\n\n0.50\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n0.25\n\n\u25cf\n\n\u25cf\n\n100\n\n200\n\n300\n\n400\n\n500\n\nn\n\n600\n\n700\n\n800\n\n900\n\n1000\n\n100\n\n200\n\n300\n\n400\n\n(a)\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\nMethod\n\n\u25cf\n\n10\u2212fold CV\n\nOracle ES\n\nES\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\nMethod\n\n\u25cf\n\n10\u2212fold CV\n\nOracle ES\n\nES\n\n1500\n\n1000\n\n500\n\n)\ns\n(\n \n\ne\nm\nT\n\ni\n\n \nl\n\na\nn\no\n\ni\nt\n\na\n\nt\n\nu\np\nm\no\nC\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n700\n\n800\n\n900\n\n1000\n\n0\n\n\u25cf\n\n100\n\n\u25cf\n\n200\n\n\u25cf\n\n300\n\n400\n\n500\n\n600\n\nn\n(b)\n\n500\n\n600\n\n700\n\n800\n\n900 1000\n\nn\n(c)\n\nFigure 2: (a) is the size with signal strength c = 0; (b) is the power with signal strength c = 1; (c)\nis the computational time (in seconds) for the three testing rules.\n\n5.2 Numerical study II\n\nIn this section, we show synthetic experiments verifying our sharpness results stated in Corollary\n3.4, Corollary 3.5 and Theorem 3.6. Data were generated from the regression model (2.1) with\niid\u21e0 Unif[0, 1], and c = 0, 1, respectively. Other\nf (xi) = c(0.8(xi 0.5)2 + 0.2 sin(4\u21e1xi)), where xi\nsetting is as the same as in Section 5.1.\nIn Figure 3 (a) and (b), we use the second-order Sobolev kernel (i.e., m = 2 in PDK) to \ufb01t the model,\nand set the constant step size \u21b5 = 1. Corollary 3.4 suggests that optimal power can be achieved at\nthe stopping time T \u21e4 \u21e3 n8/9. To display the impact of the stopping time on power performance,\nwe set the total iteration steps T as (n8/9) with = 2/3, 1, 4/3 and n ranges from 100 to 1000.\nFigure 3 (a) shows that the size approaches the nominal level 0.05 under various choices of (, n),\ndemonstrating the testing consistency supported by Theorem 3.1. Figure 3 (b) displays the power of\nour testing rule. A key observation is that the power under the theoretically derived stopping rule\n( = 1) performs best, compared with other stopping choices ( = 2/3, 4/3). In Figure 3 (c) and (d),\nwe use Gaussian kernel (i.e., p = 2 in EDK) to \ufb01t the model, and set the constant step size \u21b5 = 1.\nHere we set the total iteration steps as (n/(log n)1/4) with = 2/3, 1, 4/3 and n ranges from 100\nto 1000. Note that = 1 corresponds to the optimal stopping time in Corollary 3.5. Overall, the\ninterpretations are similar to Figure 3 (a) and (b) for PDK.\n\n6 Discussion\n\nThe main contribution of this paper is that we apply the early stopping strategy to nonparametric\ntesting, and propose the \ufb01rst \u201csharp\u201d stopping rule to guarantee minimax optimal testing (to the best\nof our knowledge). Our stopping rule depends on the eigenvalues of the kernel matrix, especially the\n\ufb01rst few leading eigenvalues. There are many ef\ufb01cient methods to compute the top eigenvalues fast,\n\n8\n\n\f0.20\n\n0.15\n\ne\nz\nS\n\ni\n\n0.10\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n0.05\n\n0.00\n\n100\n\n200\n\n300\n\n400\n\n600\n500\nSample Size\n(a)\n\n700\n\n800\n\n900\n\n1000\n\n100\n\n200\n\n300\n\n400\n\n\u03b3\n\n\u25cf\n\n2/3\n\n1\n\n4/3\n\n\u25cf\n\n1.00\n\n0.75\n\nr\ne\nw\no\nP\n\n0.50\n\n\u25cf\n\n0.25\n\n0.00\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u03b3\n\n\u25cf\n\n2/3\n\n1\n\n4/3\n\n0.20\n\n0.15\n\ne\nz\nS\n\ni\n\n0.10\n\n0.05\n\n0.00\n\n\u03b3\n\n\u25cf\n\n2/3\n\n1\n\n4/3\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n700\n\n800\n\n900\n\n1000\n\n100\n\n200\n\n300\n\n400\n\n700\n\n800\n\n900\n\n1000\n\n500\n600\nSample Size\n(c)\n\n1.00\n\n0.75\n\nr\ne\nw\no\nP\n\n0.50\n\n0.25\n\n0.00\n\n\u25cf\n\n\u25cf\n\n500\n600\nSample Size\n(b)\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u03b3\n\n\u25cf\n\n2/3\n\n1\n\n4/3\n\n700\n\n800\n\n900\n\n1000\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n100\n\n200\n\n300\n\n400\n\n500\n600\nSample Size\n(d)\n\nFigure 3: (a) is the size of Dn,t with signal strength c = 0 under PDK; (b) is the power of Dn,t with\nsignal strength c = 1 under PDK. (c) is the size of Dn,t with signal strength c = 0 under EDK; (d)\nis the power of Dn,t with signal strength c = 1 under EDK.\n\nsee Drineas and Mahoney [2005], Ma and Belkin [2017]. As a future work, we can also introduce the\nrandomly projected kernel methods to accelerate the computing time.\n\nReferences\nPeter B\u00fchlmann and Bin Yu. Boosting with the l 2 loss: regression and classi\ufb01cation. Journal of the\n\nAmerican Statistical Association, 98(462):324\u2013339, 2003.\n\nPetros Drineas and Michael W Mahoney. On the nystr\u00f6m method for approximating a gram matrix\nfor improved kernel-based learning. journal of machine learning research, 6(Dec):2153\u20132175,\n2005.\n\nJianqing Fan and Jiancheng Jiang. Nonparametric inference with generalized likelihood ratio tests.\n\nTEST, 16(3):409\u2013444, Dec 2007. ISSN 1863-8260.\n\nJianqing Fan, Chunming Zhang, and Jian Zhang. Generalized likelihood ratio statistics and wilks\n\nphenomenon. The Annals of statistics, 29(1):153\u2013193, 2001.\n\nGene H Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for\n\nchoosing a good ridge parameter. Technometrics, 21(2):215\u2013223, 1979.\n\nWensheng Guo. Inference in smoothing spline analysis of variance. Journal of the Royal Statistical\n\nSociety: Series B (Statistical Methodology), 64(4):887\u2013898, 2002.\n\nYuri I Ingster. Asymptotically minimax hypothesis testing for nonparametric alternatives. i, ii, iii.\n\nMathematical Methods of Statistics, 2(2):85\u2013114, 1993.\n\nMeimei Liu, Zuofeng Shang, and Guang Cheng. Nonparametric testing under random projection.\n\narXiv preprint arXiv:1802.06308, 2018.\n\nJunwei Lu, Guang Cheng, and Han Liu. Nonparametric heterogeneity testing for massive data. arXiv\n\npreprint arXiv:1601.06212, 2016.\n\nSiyuan Ma and Mikhail Belkin. Diving into the shallows: a computational perspective on large-scale\nshallow learning. In Advances in Neural Information Processing Systems, pages 3781\u20133790, 2017.\n\nGarvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression:\nan optimal data-dependent stopping rule. Journal of Machine Learning Research, 15(1):335\u2013366,\n2014.\n\nMark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentration.\n\nElectronic Communications in Probability, 18(82):1\u20139, 2013.\n\nBernhard Sch\u00f6lkopf, Christopher JC Burges, and Alexander J Smola. Advances in kernel methods:\n\nsupport vector learning. MIT press, 1999.\n\nZuofeng Shang and Guang Cheng. Local and global asymptotic inference in smoothing spline models.\n\nThe Annals of Statistics, 41(5):2608\u20132638, 2013.\n\n9\n\n\fJohn Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university\n\npress, 2004.\n\nGilbert W Stewart. A krylov\u2013schur algorithm for large eigenproblems. SIAM Journal on Matrix\n\nAnalysis and Applications, 23(3):601\u2013614, 2002.\n\nGrace Wahba. Spline models for observational data. SIAM, 1990.\nYuting Wei and Martin J Wainwright. The local geometry of testing in ellipses: Tight control via\n\nlocalized kolomogorov widths. arXiv preprint arXiv:1712.00711, 2017.\n\nYuting Wei, Fanny Yang, and Martin J Wainwright. Early stopping for kernel boosting algorithms:\nA general analysis with localized complexities. In Advances in Neural Information Processing\nSystems, pages 6067\u20136077, 2017.\n\nYuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.\n\nConstructive Approximation, 26(2):289\u2013315, 2007.\n\nTong Zhang and Bin Yu. Boosting with early stopping: Convergence and consistency. The Annals of\n\nStatistics, 33(4):1538\u20131579, 2005.\n\n10\n\n\f", "award": [], "sourceid": 1974, "authors": [{"given_name": "Meimei", "family_name": "Liu", "institution": "Duke University"}, {"given_name": "Guang", "family_name": "Cheng", "institution": "Purdue University"}]}