{"title": "Inference for the Generalization Error", "book": "Advances in Neural Information Processing Systems", "page_first": 307, "page_last": 313, "abstract": null, "full_text": "Inference for the Generalization Error \n\nClaude Nadeau \n\nCIRANO \n\n2020, University, \n\nYoshua Bengio \n\nCIRANO and Dept. IRO \nUniversite de Montreal \n\nMontreal, Qc, Canada, H3A 2A5 \njcnadeau@altavista.net \n\nMontreal, Qc, Canada, H3C 3J7 \n\nbengioy@iro.umontreal.ca \n\nAbstract \n\nIn order to to compare learning algorithms, experimental results reported \nin the machine learning litterature often use statistical tests of signifi(cid:173)\ncance. Unfortunately, most of these tests do not take into account the \nvariability due to the choice of training set. We perform a theoretical \ninvestigation of the variance of the cross-validation estimate of the gen(cid:173)\neralization error that takes into account the variability due to the choice \nof training sets. This allows us to propose two new ways to estimate \nthis variance. We show, via simulations, that these new statistics perform \nwell relative to the statistics considered by Dietterich (Dietterich, 1998). \n\n1 Introduction \n\nWhen applying a learning algorithm (or comparing several algorithms), one is typically \ninterested in estimating its generalization error. Its point estimation is rather trivial through \ncross-validation. Providing a variance estimate of that estimation, so that hypothesis test(cid:173)\ning and/or confidence intervals are possible, is more difficult, especially, as pointed out in \n(Hinton et aI., 1995), if one wants to take into account the variability due to the choice of \nthe training sets (Breiman, 1996). A notable effort in that direction is Dietterich's work (Di(cid:173)\netterich, 1998). Careful investigation of the variance to be estimated allows us to provide \nnew variance estimates, which tum out to perform well. \n\nLet us first layout the framework in which we shall work. We assume that data are avail(cid:173)\nable in the form Zjl = {Z 1, ... , Zn}. For example, in the case of supervised learning, \nZi = (Xi,}Ii) E Z ~ RP+q, where p and q denote the dimensions of the X/s (inputs) \nand the }Ii's (outputs). We also assume that the Zi'S are independent with Zi rv P(Z) . \nLet \u00a3(D; Z), where D represents a subset of size nl ::; n taken from Zjl, be a function \nZnl X Z -t R For instance, this function could be the loss incurred by the decision \nthat a learning algorithm trained on D makes on a new example Z. We are interested in \nestimating nJ.l. == E[\u00a3(Zjl; Zn+1)] where Zn+1 rv P(Z) is independent of Zjl. Subscript \nn stands for the size of the training set (Zjl here). The above expectation is taken over Zjl \nand Zn+1, meaning that we are interested in the performance of an algorithm rather than \nthe performance of the specific decision function it yields on the data at hand. According to \nDietterich's taxonomy (Dietterich, 1998), we deal with problems of type 5 through 8, (eval(cid:173)\nuating learning algorithms) rather then type 1 through 4 (evaluating decision functions). We \ncall nJ.l. the generalization error even though it can also represent an error difference: \n\n\u2022 Generalization error \nWe may take \n\n\u00a3(D; Z) = \u00a3(D; (X, Y)) = Q(F(D)(X), Y), \n\n(1) \n\n\f308 \n\nC. Nadeau and Y. Bengio \n\nwhere F(D) (F(D) : ]RP ~ ]Rq) is the decision function obtained when training an \nalgorithm on D, and Q is a loss function measuring the inaccuracy of a decision. For \ninstance, we could have Q(f), y) = I[f) 1= y], where I[ ] is the indicator function, for \nclassification problems and Q(f), y) =11 f) - y 11 2 , where is II . II is the Euclidean norm, for \n\"regression\" problems. In that case nJ.L is what most people call the generalization error. \n\u2022 Comparison of generalization errors \nSometimes, we are not interested in the performance of algorithms per se, but instead in \nhow two algorithms compare with each other. In that case we may want to consider \n\n.cCDi Z) = .c(Di (X, Y)) = Q(FA(D)CX), Y) - Q(FB(D)(X), Y), \n\n(2) \nwhere FA(D) and FB(D) are decision functions obtained when training two algorithms \n(A and B) on D , and Q is a loss function. In this case nJ.L would be a difference of \ngeneralization errors as outlined in the previous example. \n\nThe generalization error is often estimated via some form of cross-validation. Since there \nare various versions of the latter, we layout the specific form we use in this paper. \n\u2022 Let Sj be a random set of nl distinct integers from {I, ... , n }(nl < n). Here nl \nrepresents the size of the training set and we shall let n2 = n - nl be the size of the test \nset. \n\u2022 Let SI, ... SJ be independent such random sets, and let Sj = {I, ... , n} \\ Sj denote the \ncomplement of Sj. \n\u2022 Let Z Sj = {Zi Ii E Sj} be the training set obtained by subsampling Zr according to the \nrandom index set Sj. The corresponding test set is ZSj = {Zili E Sj}. \n\n\u2022 Let L(j, i) = .c(Zs;; Zi). According to (1), this could be the error an algorithm trained \non the training set ZSj makes on example Zi. According to (2), this could be the difference \nof such errors for two different algorithms. \n\u2022 Let (1,j = k 2:~=1 L(j, i{) where i{, ... ,i'k are randomly and independently drawn \nfrom Sj. Here we draw K examples from the test set ZS'j with replacement and compute \nthe average error committed. The notation does not convey the fact that {1,j depends on K, \nnl and n2 . \n\u2022 Let {1,j = limK ..... oo (1,j = ';2 2:iES~ L(j, i) denote what {1,j becomes as K increases \nwithout bounds. Indeed, when sampling infinitely often from ZS'j' each Zi (i E Sj) is \nchosen with relative frequency .l.., yielding the usual \"average test error\". The use of K is \njust a mathematical device to make the test examples sampled independently from Sj. \n\nn2 \n\nJ \n\nThen the cross-validation estimate of the generalization error considered in this paper is \n\nJ \n\nn2 ~ K _ \nI '\"\"' ~ \nnl J.LJ - J L.J J.Lj. \n\nj=1 \n\nWe note that this an unbiased estimator of nlJ.L = E[.c{Zfl, Zn+r)] (not the same as nJ.L). \n\nThis paper is about the estimation of the variance of ~~ {1,~. We first study theoretically \nthis variance in section 2, leading to two new variance estimators developped in section 3. \nSection 4 shows part of a simulation study we performed to see how the proposed statistics \nbehave compared to statistics already in use. \n\n2 Analysis of Var[ ~~itr] \nHere we study Var[ ~~ {1,~]. This is important to understand why some inference proce(cid:173)\ndures about nl J.L presently in use are inadequate, as we shall underline in section 4. This \ninvestigation also enables us to develop estimators of Var[ ~~ {1,~] in section 3. Before we \nproceed, we state the following useful lemma, proved in (Nadeau and Bengio, 1999). \n\n\fInference for the Generalization Error \n\n309 \n\nLemma 1 Let U 1, ... , Uk be random variables with common mean (3, common variance \n6 and Cov[Ui , Uj] = \"I, Vi '# j. Let1r = J be the correlation between Ui and Uj (i '# j). \nLet U = k- 1 2::=1 Ui and 8b = k~1 2::=1 (Ui - U)2 be the sample mean and sample \nvariance respectively. Then E[8b] = 6 - \"I and Var[U] = \"I + (6~'Y) = 6 (11\" + lk1l') . \nTo study Var[ ~i j1,~] we need to define the following covariances. \n\n\u2022 Let lio = liO(nl) = Var[L(j, i)] when i is randomly drawn from 8J. \n\u2022 Let lil = lil (nl, n2) = Cov[L(j, i), L(j, i')] for i and i' randomly and indepen(cid:173)\n\ndently drawn from 8j. \n\n\u2022 Let li2 = liZ(nl, n2) = Cov[L(j, i), L(j', i')], with j '# j', i and i' randomly and \n\nindependently drawn from 8j and 8jl respectively. \n\n\u2022 Let li3 = li3(nl) = Cov[L(j, i), L(j, i')] for i, i' E 8j and i '# i'. This is not the \n\nsame as lil. In fact, it may be shown that \n\n. \n\nlil \n\nC \n\n[L( \") L(' \")] -\n-\n\nOV), z, \n\n), z \n\nlio + (nz - 1) li3 _ \nnz \n\nnz \n\n- li3 \n\n+ lio -\n\nli3 \n\nnz \n\n(3) \n\n. \n\nLet us look at the mean and variance of j1,j and ~i j1,~. Concerning expectations, we \nobviously have E[j1,j] = n1f.\u00a3 and thus E[ ~ij1,~] = n1f.\u00a3. From Lemma 1, we have \nVar[j1,j] = lil + O'\u00b0KO'I which implies \n\nVar[j1,j] = Var[ lim j1,j] = lim Var[j1,j] = lil. \n\nK-too \n\nK-too \n\nIt can also be shown that Cov[j1,j, j1,j'] = liZ, \n\nj '# j', and therefore (using Lemma 1) \n\nTT \n\n[n2 ~K] _ \n\nvar n1f.\u00a3J -liz+ \n\nTT [~] \nvar f.\u00a3j -\nJ \n\nliZ _ \n\nlil \n\n-liZ+ \n\n-\n\nliZ \n\n. \n\n(4) \n\n+ 0'0-0'1 \nK \nJ \n\nWe shall often encounter liO, lil, liZ, li3 in the future, so some knowledge about those quan(cid:173)\ntities is valuable. Here's what we can say about them. \nProposition 1 For given nl and n2, we have 0 ~ liz ~ lil ~ lio and 0 ~ li3 ~ lil. \nProof See (Nadeau and Bengio, 1999). \nA natural question about the estimator ~i j1,~ is how nl, nz, K and J affect its variance. \nProposition 2 The variance of ~i j1,~ is non-increasing in J, K and nz. \nProof See (Nadeau and Bengio, 1999). \n\nClearly, increasing K leads to smaller variance because the noise introduced by sampling \nwith replacement from the test set disappears when this is done over and over again. Also, \naveraging over many trainltest (increasing J) improves the estimation of nl f.\u00a3. Finally, all \nthings equal elsewhere (nl fixed among other things), the larger the size of the test sets, the \nbetter the estimation of nl f.\u00a3. \nThe behavior of Var[ ~i j1,~] with respect to nl is unclear, but we conjecture that in most \nsituations it should decrease in nl. Our argument goes like this. The variability in ~i j1,~ \ncomes from two sources: sampling decision rules (training process) and sampling testing \nexamples. Holding n2, J and K fixed freezes the second source of variation as it solely de(cid:173)\npends on those three quantities, not nl. The problem to solve becomes: how does nl affect \nthe first source of variation? It is not unreasonable to say that the decision function yielded \nby a learning algorithm is less variable when the training set is large. We conclude that the \nfirst source of variation, and thus the total variation (that is Var[ ~ij1,~]) is decreasing in \nnl. We advocate the use of the estimator \n\n(5) \n\n\f310 \n\nC. Nadeau and Y Bengio \n\nas it is easier to compute and has smaller variance than ~~it} (J, nl, n2 held constant). \n\nVar[ n2 11 00 ] = \n\nnl,-J \n\nlim Var[ n2 rl.K] = (72 + (71 -\nJ \nK-+oo \n\nnl,-J \n\n(72 -\n-\n\nwhere P - ~ - Corr[ll.oo r/OO] \n'-j , '-j' . \n\n111 \n\n-\n\n-\n\n(7 (p + 1 - p) \n\n- J - ' \n\n1 \n\n(6) \n\n3 Estimation of Var[ ~~JtJ] \nWe are interested in estimating ~~(7J == Var[ ~~ it:f] where ~~ it:f is as defined in (5). We \nprovide two different estimators of Var[ ~~ it:f]. The first is simple but may have a positive \nor negative bias for the actual variance. The second is meant to be conservative, that is, \nif our conjecture of the previous section is correct, its expected value exceeds the actual \nvariance. \n1st Method: Corrected Resampled t-Test. Let us recall that ~~ it:f = J 'Ef=1 itj. Let \njj2 be the sample variance of the itj's. According to Lemma 1, \n\n( \n\nI-p \n\n, \n\n(7) \n\nI-P) \n\nI-p \nP + J \n\nVar[ ~~it:f] \nl+--L ' \nJ \n\n(71 (p+!=\u00a3) \n1. ~ \nJ + I-p \n\n!=\u00a3(71 p+~ = \n\nE[jj21=(71(1-p)= \nso that (J + G) jj2 is an unbiased estimator of Var[ ~~ iL:f]. The only problem is \nthat p = p(nl,n2) = :~t~:::~~, the correlation between the itj's, is unknown and \ndifficult to estimate. We use a naive surrogate for p as follows. Let us recall that \niLj = :2 'EiES~ \u00a3(ZSj; Zi). For the purpose of building our estimator, let us make the \napproximation that \u00a3(ZSj; Zi) depends only on Zi and nl. Then it is not hard to show (see \n(Nadeau and Bengio, 1999)) that the correlation between the itj's becomes nl~n2' There-\nfore our first estimator of Var[~~iL:fl is (J + l~~o) jj2 where Po = po(nl,n2) = nl~n2' \nthat is (J + ~ ) jj2. This will tend to overestimate or underestimate Var[ ~~ iL:f] accord(cid:173)\ning to whether Po > p or Po < p. Note that this first method basically does not require \nany more computations than that already performed to estimate generalization error by \ncross-validation. \n2nd Method: Conservative Z. Our second method aims at overestimating Var[ ~~ iL:f] \nwhich will lead to conservative inference, that is tests of hypothesis with actual size less \nthan the nominal size. This is important because techniques currently in use have the \nopposite defect, that is they tend to be liberal (tests with actual size exceeding the nominal \nsize), which is typically regarded as less desirable than conservative tests. \n\nEstimating ~~ (7J unbiasedly is not trivial as hinted above. However we may estimate \nunbiasedly nn? (7J = Var[ nn? it:fl where n~ = L!!2 J - n2 < nl. Let n? uJ be the unbiased \nestimator, developed below, of the above variance. We argued in the previous section that \nVar[ ~~ it:fl ~ Var[ ~~ iL:fl. Therefore ~;uJ will tend to overestimate ~~(7J, that is \nE[ n2a-2] = n2(72 > n2(72 \nnl J' \n\nJ -\n\nn 1 \n\n1 \n\n1 \n\nn; J \n\nn; \n\n1 \n\nHere's how we may estimate ~? (7J without bias. For simplicity, assume that n is even. \nWe have to randomly split our data Zr into two distinct data sets, Dl and D1, of size ~ \neach. Let iL(1) be the statistic of interest ( ~; iL:f) computed on D 1 . This involves, among \nother things, drawing J train/test subsets from DI . Let iL(l) be the statistic computed on \nD1\u00b7 Then iL(l) and iL(l) are independent since Dl and Dl are independent data sets, so \nh \nt at /-L(l) -\nestImate \nof ~?(7J. This splitting process may be repeated M times. This yields Dm and D~, with \n\n+ J.L(I) -\n\nAC)2' \n/-L(l) \n\nIS an un las \n\n= 2\" /-L(l) -\n\nit(!)+it(I))2 \n\nit(I )+it(1))2 \n\nb' ed \n\n( A \n\nI(A \n\n(AC \n\n. \n\n2 \n\n2 \n\n1 \n\n\fInference for the Generalization Error \n\n311 \n\nDm U D~ = zf, Dm n D~ = 0 for m = 1, ... , M. Each split yields a pair (it(m) , it(m\u00bb) \nthat is such that ~(it(m) - it(m\u00bb)2 is unbiased for ~~U}. This allows us to use the following \nunbiased estimator of ~? U}: \n\n1 \n\nn2 ~ 2 _ 1 \"\"' (~ \nn~ U J - 2M L..J J-t(m) - J-t(m) \n\n~ c \n\nM \n\n)2 \n. \n\n(8) \n\nm=1 \n\nNote that, according to Lemma 1, Var[ ~~oj] = t Var[(it(m) - it(m\u00bb)2] (r + IMr) with \nr = Corr[(it(i) - it(i\u00bb)2, (it(j) - it(j\u00bb)2] for i i- j. Simulations suggest that r is usually \nclose to 0, so that the above variance decreases roughly like k for M up to 20, say. The \nsecond method is therefore a bit more computation intensive, since requires to perform \ncross-validation M times, but it is expected to be conservative. \n4 Simulation study \nWe consider five different test statistics for the hypothesis Ho : niJ-t = J-to. The first three \nare methods already in use in the machine learning community, the last two are the new \nmethods we put forward. They all have the following form \n\nreject Ho if I it ~J-to I > c. \n\n(9) \n\nTable 1 describes what they are 1. We performed a simulation study to inves(cid:173)\ntigate the size (probability of rejecting the null hypothesis when it is true) and \nthe power (probability of rejecting the null hypothesis when it is false) of the \nfive test statistics shown in Table 1. We consider the problem of estimating gen(cid:173)\neralization errors in the Letter Recognition classification problem (available from \nwww. ics. uci . edu/pub/machine-learning-databases). The learning algo(cid:173)\nrithms are \n\n1. Classification tree \n\nWe used the function tree in Splus version 4.5 for Windows. The default argu(cid:173)\nments were used and no pruning was performed. The function predict with option \ntype=\"class\" was used to retrieve the decision function of the tree: FA (Zs)(X). \nHere the classification loss function LAU,i) = I[FA(Zsj)(Xi ) i- Yi ] is equal \nto 1 whenever this algorithm misclassifies example i when the training set is Sj; \notherwise it is O. \n\n2. First nearest neighbor \n\nWe apply the first nearest neighbor rule with a distorted distance metric to pun \ndown the performance of this algorithm to the level of the classification tree (as \nin (Dietterich, 1998\u00bb. We have LBU, i) equal to 1 whenever this algorithm mis(cid:173)\nclassifies example i when the training set is Sj; otherwise it is O. \n\nIn addition to inference about the generalization errors ni J-tA and ni J-tB associated with \nniJ-tB = \nthose two algorithms, we also consider inference about niJ-tA-B = niJ-tA -\nE[LA-B(j,i)] whereLA_B(j,i) = LAU,i) - LB(j,i). \nWe sample, without replacement, 300 examples from the 20000 examples available in the \nLetter Recognition data base. Repeating this 500 times, we obtain 500 sets of data of the \nform {ZI,\"\" Z300}. Once a data set zloO = {ZI,'\" Z300} has been generated, we may \nlWhen comparing two classifiers, (Nadeau and Bengio, 1999) show that the t-test is closely re(cid:173)\n\nlated to McNemar's test described in (Dietterich, 1998). The 5 x 2 cv procedure was developed in \n(Dietterich, 1998) with solely the comparison of classifiers in mind but may trivially be extended to \nother problems as shown in (Nadeau and Bengio, 1999). \n\n\f312 \n\nII Name \n\nt-test (McNemar) \nresampled t \nDietterich's 5 x 2 cv \n\n1: conservative Z \n\n2: corr. resampled t \n\nII \n\nn2 AOO \nnl/-Ll \nn2 AOO \nnl/-LJ \nn(2 AOO \nn/2/-Ll \nn2 AOO \nnl/-LJ \n\nn2 AOO \nnl /-L J \n\nC. Nadeau and Y. Bengio \n\nc \n\n~2 SV(L(I, i)) \n\nyO-:.l \n\nsee (Dietterich, 1998) \n\nn2IT3+~ITO-IT3) > 1 \nt n2 - 1,1-ar/2 \nt J - 1,1-ar/2 I+J~ > 1 \ntS,1-ar/2 \n\nITn -IT3 \n\n? \n\n\" \n\nn2 A2 \nn'UJ \n\n1 \n\n(!. + ~) 0-2 \n\nnl \n\nJ \n\nZl-ar/2 \n\ntJ-l,1-ar/2 \n\n~~IT? < 1 \nn,uJ \nl+JE \nl+J~ \n\nTable 1: Description of five test statistics in relation to the rejection criteria shown in (9). \nZp and h,p refer to the quantile p of the N(O, 1) and Student tk distribution respectively. \n0-2 is as defined above (7) and SV (L(I, i)) is the sample variance of the L(I, i)'s involved \nin ~i {l{'. The ~t~~l ratio (which comes from proper application of Lemma 1, except for \nDietterich's 5 x 2 cv and the Conservative Z) indicates if a test will tend to be conservative \n(ratio less than 1) or liberal (ratio greater than 1). \n\nperform hypothesis testing based on the statistics shown in Table 1. A difficulty arises \nhowever. For a given n (n = 300 here), those methods don't aim at inference for the same \ngeneralization error. For instance, Dietterich's 5 x 2 cv test aims at n/2/-L, while the others \naim at nl/-L where nl would usually be different for different methods (e.g. nl = 23n for \nthe t test statistic, and nl = ~~ for the resampled t test statistic, for instance). In order \nto compare the different techniques, for a given n, we shall always aim at n/2/-L, i.e. use \nnl = \u00a5-. However, for statistics involving ~ip.r with J > 1, normal usage would call for \nnl to be 5 or 10 times larger than n2, not nl = n2 = \u00a5-. Therefore, for those statistics, we \nalso use nl = \u00a5- and n2 = l~ so that ~ = 5. To obtain ~~;o p.r we simply throw out 40% \nof the data. For the conservative Z, we do the variance calculation as we would normally \ndo (n2 = l~ for instance) to obtain ~i2-n2a-J = ;~~~a-J. However, in the numerator we \ncompute ot n/2/-LJ an \nn-n2/-LJ' as exp rune a ove. \nNote that the rationale that led to the conservative Z statistics is maintained, that is ;~~~a-J \n[n/lOAOO] > \n\nn/lOAoo' \nn/2 /-LJ mstea 0 \n\nE [n/lOA2] > TT \n\nb h n/2AOO \n\nd n2 AOO \nn/2/-LJ \n\nI' db \n\n[n/lOAOO] \n\nd f n2 \n\nAOO \n\nd TT [n/2A OO] \n\nvar n/2/-LJ: \n\nan \n\noverestimates ot var n/2 /-LJ \n\nb h TT \n\n. \n\n2n/su J _ var n/2 /-LJ \n\nTT \nvar n/2/-LJ \n\n[n/2 A 00] \n. \n\nFigure 1 shows the estimated power of different statistics when we are interested in /-LA and \n/-LA-B. We estimate powers by computing the proportion of rejections of Ho . We see that \ntests based on the t-test or resampled t-test are liberal, they reject the null hypothesis with \nprobability greater than the prescribed a = 0.1, when the null hypothesis is true. The other \ntests appear to have sizes that are either not significantly larger the 10% or barely so. Note \nthat Dietterich's 5 x 2cv is not very powerful (note that its curve has the lowest power on \nthe extreme values of muo). To make a fair comparison of power between two curves, one \nshould mentally align the size (bottom of the curve) of these two curves. Indeed, even the \nresampled t-test and the conservative Z that throw out 40% of the data are more powerful. \nThat is of course due to the fact that the 5 x 2 cv method uses J = 1 instead of J = 15. \n\nThis is just a glimpse of a much larger simulation study. When studying the corrected \nresampled t-test and the conservative Z in their natural habitat (nl = ~9 and n2 = l~)' we \nsee that they are usually either right on the money in term of size, or slIghtly conservative. \nTheir powers appear equivalent. The simulations were performed with J up to 25 and M \nup to 20. We found that taking J greater than 15 did not improve much the power of the \n\n\fInference for the Generalization Error \n\n313 \n\nFigure 1: Powers of the tests about Ho : /-LA = /-Lo (left panel) and Ho : /-LA-B = /-Lo \n(right panel) at level a = 0.1 for varying /-Lo. The dotted vertical lines correspond to the \n95% confidence interval for the actual/-LA or /-LA-B. therefore that is where the actual size \nof the tests may be read. The solid horizontal line displays the nominal size of the tests. \ni.e. 10%. Estimated probabilities of rejection laying above the dotted horizontal line are \nsignificatively greater than 10% (at significance level 5%). Solid curves either correspond \nto the resampled t-test or the corrected resampled t-test. The resampled t-test is the one that \nhas ridiculously high size. Curves with circled points are the versions of the ordinary and \ncorrected resampled t-test and conservative Z with 40% of the data thrown away. Where it \nmatters J = 15. M = 10 were used. \nstatistics. Taking M = 20 instead of M = 10 does not lead to any noticeable difference \nin the distribution of the conservative Z. Taking M = 5 makes the statistic slightly less \nconservative. See (Nadeau and Bengio. 1999) for further details. \n5 Conclusion \nThis paper addresses a very important practical issue in the empirical validation of new \nmachine learning algorithms: how to decide whether one algorithm is significantly better \nthan another one. We argue that it is important to take into account the variability due to \nthe choice of training set. (Dietterich. 1998) had already proposed a statistic for this pur(cid:173)\npose. We have constructed two new variance estimates of the cross-validation estimator \nof the generalization error. These enable one to construct tests of hypothesis and confi(cid:173)\ndence intervals that are seldom liberal. Furthermore. tests based on these have powers that \nare unmatched by any known techniques with comparable size. One of them (corrected \nresampled t-test) can be computed without any additional cost to the usual K-fold cross(cid:173)\nvalidation estimates. The other one (conservative Z) requires M times more computation. \nwhere we found sufficiently good values of M to be between 5 and 10. \n\nReferences \nBreiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, \n\n24 (6):2350-2383. \n\nDietterich, T. (1998). Approximate statistical tests for comparing supervised classification learning \n\nalgorithms. Neural Computation, 10 (7):1895-1924. \n\nHinton. G., Neal. R., Tibshirani, R., and DELVE team members (1995). Assessing learning proce(cid:173)\ndures using DELVE. Technical report, University of Toronto, Department of Computer Science. \nNadeau, C. and Bengio, Y. (1999). Inference for the generalisation error. Technical Report in prepa(cid:173)\n\nration, CIRANO. \n\n\f", "award": [], "sourceid": 1661, "authors": [{"given_name": "Claude", "family_name": "Nadeau", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}]}