{"title": "Benchmarking Non-Parametric Statistical Tests", "book": "Advances in Neural Information Processing Systems", "page_first": 651, "page_last": 658, "abstract": null, "full_text": "Benchmarking Non-Parametric Statistical Tests\n\nMikaela Keller IDIAP Research Institute 1920 Martigny Switzerland mkeller@idiap.ch\n\nSamy Bengio IDIAP Research Institute 1920 Martigny Switzerland bengio@idiap.ch\n\nSiew Yeung Wong IDIAP Research Institute 1920 Martigny Switzerland sywong@idiap.ch\n\nAbstract\nAlthough non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole \"population\", we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions.\n\n1\n\nIntroduction\n\nStatistical tests are often used in machine learning in order to assess the performance of a new learning algorithm or model over a set of benchmark datasets, with respect to the state-of-the-art solutions. Several researchers (see for instance [4] and [9]) have proposed statistical tests suited for 2-class classification tasks where the performance is measured in terms of the classification error (ratio of the number of errors and the number of examples), which enables the use of assumptions based on the fact that the error can be seen as a sum of random variables over the evaluation examples. On the other hand, various research domains prefer to measure the performance of their models using different indicators, such as the F1 measure, used in information retrieval [11], described in Section 2.1. Most classical statistical tests cannot cope directly with such measure as the usual necessary assumptions are no longer correct, and non-parametric bootstrap-based methods are then used [5]. Since several papers already use these non-parametric tests [2, 1], we were interested in verifying empirically how reliable they were. For this purpose, we used a very large text categorization database (the extended Reuters dataset [10]), composed of more than 800000 examples, and concerning more than 100 categories (each document was labelled with one or more of these categories). We purposely set aside the largest part of the dataset and considered it as the whole population, while a much smaller part of it was used as a training set for the models. Using the large set aside dataset part, we tested the statistical test in the\n This work was supported in part by the Swiss NSF through the NCCR on IM2 and in part by the European PASCAL Network of Excellence, IST-2002-506778, through the Swiss OFES.\n\n\f\nsame spirit as was done in [4], by sampling evaluation sets over which we observed the performance of the models and the behavior of the significance test. Following the taxonomy of questions of interest defined by Dietterich in [4], we can differentiate between statistical tests that analyze learning algorithms and statistical tests that analyze classifiers. In the first case, one intends to be robust to possible variations of the train and evaluation sets, while in the latter, one intends to only be robust to variations of the evaluation set. While the methods discussed in this paper can be applied alternatively to both approaches, we concentrate here on the second one, as it is more tractable (for the empirical section) while still corresponding to real life situations where the training set is fixed and one wants to compare two solutions (such as during a competition). In order to conduct a thorough analysis, we tried to vary the evaluation set size, the class unbalance, the error measure, the statistical test itself (with its associated assumptions), and even the closeness of the compared learning algorithms. This paper, and more precisely Section 3, is a detailed account of this analysis. As it will be seen empirically, the closeness of the compared learning algorithms seems to have an effect on the resulting quality of the statistical tests: comparing an MLP and an SVM yields less reliable statistical tests than comparing two SVMs with a different kernel. To the best of our knowledge, this has never been considered in the literature of statistical tests for machine learning.\n\n2\n\nA Statistical Significance Test for the Difference of F1\n\nLet us first remind the basic classification framework in which statistical significance tests are used in machine learning. We consider comparing two models A and B on a two-class classification task where the goal is to classify input examples xi into the corresponding class yi {-1, 1}, using already trained models fA (xi ) or fB (xi ). One can estimate their respective performance on some test data by counting the number of utterances of each possible outcome: either the obtained class corresponds to the desired class, or not. Let Ne,A (resp. Ne,B ) be the number of errors of model A (resp. B ) and N the total number of test examples; The difference between models A and B can then be written as Ne,A - Ne,B . (1) N The usual starting point of most statistical tests is to define the so-called null hypothesis H0 which considers that the two models are equivalent, and then verifies how probable this hypothesis is. Hence, assuming that D is an instance of some random variable D which follows some distribution, we are interested in D= p (|D| < |D|) < (2) where represents the risk of selecting the alternate hypothesis (the two models are different) while the null hypothesis is in fact true. This can in general be estimated easily when the distribution of D is known. In the simplest case, known as the proportion test, one assumes (reasonably) that the decision taken by each model on each example can be modeled by a Bernoulli, and further assumes that the errors of the models are independent. This is in general wrong in machine learning since the evaluation sets are the same for both models. When N is large, this leads to estimate D as a Normal distribution with zero mean and standard deviation D 2 C (1 - C ) D = (3) N N +N where C = e,A2N e,B is the average classification error. In order to get rid of the wrong independence assumption between the errors of the models, the McNemar test [6] concentrates on examples which were differently classified by the two compared models. Following the notation of [4], let N01 be the number of examples misclassified by model A but not\n\n\f\nby model B and N10 the number of examples misclassified by model B but not by model A. It can be shown that the following statistics is approximatively distributed as a 2 with 1 degree of freedom: (|N01 - N10 | - 1)2 . (4) z= N01 + N10 More recently, several other statistical tests have been proposed, such as the 5x2cv method [4] or the variance estimate proposed in [9], which both claim to better estimate the distribution of the errors (and hence the confidence on the statistical significance of the results). Note however that these solutions assume that the error of one model is the average of some random variable (the error) estimated on each example. Intuitively, it will thus tend to be Normally distributed as N grows, following the central limit theorem. 2.1 The F1 Measure\n\nText categorization is the task of assigning one or several categories, among a predefined set of K categories, to textual documents. As explained in [11], text categorization is usually solved as K 2-class classification problems, in a one-against-the-others approach. In this field two measures are considered of importance: Ntp Ntp , and Recall = , Ntp + Nf p Ntp + Nf n where for each category Ntp is the number of true positives (documents belonging to the category that were classified as such), Nf p the number of false positives (documents out of this category but classified as being part of it) and Nf n the number of false negatives (documents from the category classified as out of it). Precision and Recall are effectiveness measures, i.e. inside [0, 1] interval, the closer to 1 the better. For each category k , Precisionk measures the proportion of documents of the class among the ones considered as such by the classifier and Recallk the proportion of documents of the class correctly classified. Precision = To summarize these two values, it is common to consider the so-called F1 measure [12], often used in domains such as information retrieval, text categorization, or vision processing. F1 can be described as the inverse of the harmonic mean of Precision and Recall: 1 1 -1 2 Precision Recall 1 2Ntp F1 = = + = . 2 Recall Precision Precision + Recall 2Ntp + Nf n + Nf p (5) Let us consider two models A and B , which achieve a performance measured by F1,A and F1,B respectively. The difference dF1 = F1,A - F1,B does not fit the assumptions of the tests presented earlier. Indeed, it cannot be decomposed into a sum over the documents of independent random variables, since the numerator and the denominator of dF1 are non constant sums over documents of independent random variables. For the same reason F1 , while being a proportion, cannot be considered as a random variable following a Normal distribution for which we could easily estimate the variance. An alternative solution to measure the statistical significance of dF1 is based on the Bootstrap Percentile Test proposed in [5]. The idea of this test is to approximate the unknown distribution of dF1 by an estimate based on bootstrap replicates of the data. 2.2 Bootstrap Percentile Test\n\nGiven an evaluation set of size N , one draws, with replacement, N samples from it. This gives the first bootstrap replicate B1 , over which one can compute the statistics of interest,\n\n\f\ndF1,B1 . Similarly, one can create as many bootstrap replicates Bn as needed, and for each, compute dF1,Bn . The higher n is, the more precise should be the statistical test. Literature [3] suggests to create at least 50 replicates where is the level of the test; for the smallest we considered (0.01), this amounts to 5000 replicates. These 5000 estimates dF1,Bi represent the non-parametric distribution of the random variable dF1 . From it, one can for instance consider an interval [a, b] such that p(a < dF1 < b) = 1 - centered around the mean of p(dF1 ). If 0 lies outside this interval, one can say that dF1 = 0 is not among the most probable results, and thus reject the null hypothesis.\n\n3\n\nAnalysis of Statistical Tests\n\nWe report in this section an analysis of the bootstrap percentile test, as well as other more classical statistical tests, based on a real large database. We first describe the database itself and the protocol we used for this analysis, and then provide results and comments. 3.1 Database, Models and Protocol\n\nAll the experiments detailed in this paper are based on the very large RCV1 Reuters dataset [10], which contains up to 806,791 documents. We divided it as follows: 798,809 documents were kept aside and any statistics computed over this set Dtrue was considered as being the truth (ie a very good estimate of the actual value); the remaining 7982 documents were used as a training set Dtr (to train models A and B ). There was a total of 101 categories and each document was labeled with one or more of these categories. We first extracted the dictionary from the training set, removed stop-words and applied stemming to it, as normally done in text categorization. Each document was then represented as a bag-of-words using the usual tf idf coding. We trained three different models: a linear Support Vector Machine (SVM), a Gaussian kernel SVM, and a multi-layer perceptron (MLP). There was one model for each category for the SVMs, and a single MLP for the 101 categories. All models were properly tuned using cross-validation on the training set. Using the notation introduced earlier, we define the following competing hypotheses: H0 : |dF1 | = 0 and H1 : |dF1 | > 0. We further define the level of the test = p(Reject H0 |H0 ), where takes on values 0.01, 0.05 and 0.1. Table 1 summarizes the possible outcomes of a statistical test. With that respect, rejecting H0 means that one is confident with (1 - ) 100% that H0 is really false. Table 1: Various outcomes of a statistical test, with = p(Type I error). Truth H0 H1 Decision Reject H0 Accept H0 Type I error OK OK Type II error\n\nIn order to assess the performance of the statistical tests on their Type I error, also called Size of the test, and on their Power = 1- Type II error, we used the following protocol.\ns For each category Ci , we sampled over Dtrue , S (500) evaluation sets Dte of N documents, s ran the significance test over each Dte and computed the proportion of sets for which H0 was rejected given that H0 was true over Dtrue (resp. H0 was false over Dtrue ), which we note true (resp. ).\n\nWe used true as an estimate of the significance test's probability of making a Type I error\n\n\f\nand as an estimate of the significance test's Power. When true is higher than the fixed by the statistical test, the test underestimates Type I error, which means we should not rely on its decision regarding the superiority of one model over the other. Thus, we consider that the significance test fails. On the contrary, true < yields a pessimistic statistical test that decides correctly H0 more often than predicted. Furthermore we would like to favor significance tests with a high , since the Power of the test reflects its ability to reject H0 when H0 is false. 3.2 Summary of Conditions\n\nIn order to verify the sensitivity of the analyzed statistical tests to several conditions, we varied the following parameters: the value of : it took on values in {0.1, 0.05, 0.01}; the two compared models: there were three models, two of them were of the same family (SVMs), hence optimizing the same criterion, while the third one was an MLP. Most of the times the two SVMs gave very similar results, (probably because the optimal capacity for this problem was near linear), while the MLP gave poorer results on average. The point here was to verify whether the test was sensitive to the closeness of the tested models (although a more formal definition of closeness should certainly be devised); the evaluation sample size: we varied it from small sizes (100) up to larger sizes (6000) to see the robustness of the statistical test to it; the class unbalance: out of the 101 categories of the problem, most of them resulted in highly unbalanced tasks, often with a ratio of 10 to 100 between the two classes. In order to experiment with more balanced tasks, we artificially created meta-categories, which were random aggregations of normal categories that tended to be more balanced; the tested measure: our initial interest was to directly test dF1 , the difference of F1 , but given poor initial results, we also decided to assess dC err, the difference of classification errors, in order to see whether the tests were sensitive to the measure itself; the statistical test: on top of the bootstrap percentile test, we also analyzed the more classical proportion test and McNemar test, both of them only on dC err (since they were not adapted to dF1 ). 3.3 Results\n\nFigure 1 summarizes the results for the Size of the test estimates. All graphs show true , the number of times the test rejected H0 while H0 was true, for a fixed = 0.05, with respect to the sample size, for various statistical tests and tested measures. Figure 2 shows the obtained results for the Power of the test estimates. The proportion of evaluation sets over which the significance test (with = 0.05) rejected H0 when indeed H0 was false, is plotted against the evaluation set size. Figures 1(a) and 2(a) show the results for balanced data (where the positive and negative examples were approximatively equally present in the evaluation set) when comparing two different models (an SVM and an MLP). Figures 1(b) and 2(b) show the results for unbalanced data when comparing two different models. Figures 1(c) and 2(c) show the results for balanced data when comparing two similar models (a linear SVM and a Gaussian SVM) for balanced data, and finally Figures 1(d) and 2(d)\n\n\f\nshow the results for unbalanced data and two similar models. Note that each point in the graphs was computed over a different number of samples, since eg over the (500 evaluation sets 101 categories) experiments only those for which H0 was true in Dtrue were taken into account in the computation of true . When the proportion of H0 true in Dtrue equals 0 (resp. the proportion of H0 false in Dtrue equals 0), true (resp. ) is set to -1. Hence, for instance the first points ({100, . . . , 1000}) of Figures 2(c) and 2(d) were computed over only 500 evaluation sets on which respectively the same categorization task was performed. This makes these points unreliable. See [8] for more details. For each of the Size's graphs, when the curves are over the 0.05 line, we can state that the statistical test is optimistic, while when it is below the line, the statistical test is pessimistic. As already explained, a pessimistic test should be favored whenever possible. Several interesting conclusions can be drawn from the analysis of these graphs. First of all, as expected, most of the statistical tests are positively influenced by the size of the evaluation set, in the sense that their true value converges to for large sample sizes 1 . On the available results, the McNemar test and the bootstrap test over dC err have a similar performance. They are always pessimistic even for small evaluation set sizes, and tend to the expected values when the models compared on balanced tasks are dissimilar. They have also a similar performance in Power over all the different conditions, higher in general when comparing very different models. When the compared models are similar, the bootstrap test over dF1 has a pessimistic behavior even on quite small evaluation sets. However, when the models are really different the bootstrap test over dF1 is on average always optimistic. Note nevertheless that most of the points in Figures 1(a) and 1(b) have a standard deviation std, over the categories, such that true - std < (see [8] for more details). Another interesting point is that in the available results for the Power, the dF1 's bootstrap test have relatively high values with respect to the other tests. The proportion test have in general, on the available results, a more conservative behavior than the McNemar test and the dC err bootstrap test. It has more pessimistic results and less Power. It is too often prone to \"Accept H0 \", ie to conclude that the compared models have an equivalent performance, whether it is true or not. This results seem to be consistent with those of [4] and [9]. However, when comparing close models in a small unbalanced evaluation set (Figure 1(d)), this conservative behavior is not present. To summarize the findings, the bootstrap-based statistical test over dC err obtained a good performance in Size comparable to the one of the McNemar test in all conditions. However both significance test performances in Power are low even for big evaluation sets in particular when the compared models are close. The bootstrap-based statistical test over dF1 has higher Power than the other compared tests, however it must be emphasized that it is slightly over-optimistic in particular for small evaluation sets. Finally, when applying the proportion test over unbalanced data for close models we obtained an optimistic behavior, untypical of this usually conservative test.\n\n4\n\nConclusion\n\nIn this paper, we have analyzed several parametric and non-parametric statistical tests for various conditions often present in machine learning tasks, including the class balancing, the performance measure, the size of the test sets, and the closeness of the compared mod1\n\nNote that the same is true for the variance of true ( 0), and this for any of the values tested.\n\n\f\n0.6\n\nProportion of Type I error\n\nProportion of Type I error\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.05 0.0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\nBootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr\n\n0.6\n\nBootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr\n\n0.5\n\n0.5\n\n0.05 0.0\n\n0\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n0\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\nEvaluation set size\n\nEvaluation set size\n\n(a) Linear SVM vs MLP - Balanced data\n\n(b) Linear SVM vs MLP - Unbalanced data\n\n0.6\n\nProportion of Type I error\n\nProportion of Type I error\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.05 0.0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\nBootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr\n\n0.6\n\nBootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr\n\n0.5\n\n0.5\n\n0.05 0.0\n\n0\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n0\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\nEvaluation set size\n\nEvaluation set size\n\n(c) Linear vs RBF SVMs - Balanced data\n\n(d) Linear vs RBF SVMs - Unbalanced data\n\nFigure 1: Several statistical tests comparing Linear SVM vs MLP or vs RBF SVM. The proportion of Type I error equals -1, in Figure 1(b), when there was no data to compute the proportion (ie H0 was always false).\n\nels. More particularly, we were concerned by the quality of non-parametric tests since in some cases (when using more complex performance measures such as F1 ), they are the only available statistical tests. Fortunately, most statistical tests performed reasonably well (in the sense that they were more often pessimistic than optimistic in their decisions) and larger test sets always improved their performance. Note however that for dF1 the only available statistical test was too optimistic although consistant for different levels. An unexpected result was that the rather conservative proportion test used over unbalanced data for close models yielded an optimistic behavior. It has to be noted that recently, a probabilistic interpretation of F1 was suggested in [7], and a comparison with bootstrap-based tests should be worthwhile.\n\nReferences\n[1] M. Bisani and H. Ney. Bootstrap estimates for confidence intervals in ASR performance evaluation. In Proceedings of ICASSP, 2004. [2] R. M. Bolle, N. K. Ratha, and S. Pankanti. Error analysis of pattern recognition systems - the subsets bootstrap. Computer Vision and Image Understanding, 93:1 33, 2004.\n\n\f\n1.0\n\n0.8\n\nPower of the test\n\nPower of the test\n\n0.8\n\n1.0\n\nBootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr\n\nBootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n0\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n0.0 0\n\n0.2\n\n0.4\n\n0.6\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\nEvaluation set size\n\nEvaluation set size\n\n(a) Linear SVM vs MLP - Balanced data\n\n(b) Linear SVM vs MLP - Unbalanced data\n\n1.0\n\nPower of the test\n\nPower of the test\n\nBootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr\n\n1.0\n\nBootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n0\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n0.0 0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\nEvaluation set size\n\nEvaluation set size\n\n(c) Linear vs RBF SVMs - Balanced data\n\n(d) Linear vs RBF SVMs - Unbalanced data\n\nFigure 2: Power of several statistical tests comparing Linear SVM vs MLP or vs RBF SVM. The power equals -1, in Figures 2(c) and 2(d), when there was not data to compute the proportion (ie H1 was never true). [3] A. C. Davison and D. V. Hinkley. Bootstrap methods and their application. Cambridge University Press, 1997. [4] T.G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7):18951924, 1998. [5] B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993. [6] B. S. Everitt. The analysis of contingency tables. Chapman and Hall, 1977. [7] C. Goutte and E. Gaussier. A probabilistic interpretation of precision, recall and Fscore, with implication for evaluation. In Proceedings of ECIR, pages 345359, 2005. [8] M. Keller, S. Bengio, and S. Y. Wong. Surprising Outcome While Benchmarking Statistical Tests. IDIAP-RR 38, IDIAP, 2005. [9] Claude Nadeau and Yoshua Bengio. Inference for the generalization error. Machine Learning, 52(3):239281, 2003. [10] T.G. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus Volume 1 - from yesterday's news to tomorrow's language resources. In Proceedings of the 3rd Int. Conf. on Language Resources and Evaluation, 2002. [11] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):147, 2002. [12] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, UK, 1975.\n\n\f\n", "award": [], "sourceid": 2846, "authors": [{"given_name": "Mikaela", "family_name": "Keller", "institution": null}, {"given_name": "Samy", "family_name": "Bengio", "institution": null}, {"given_name": "Siew", "family_name": "Wong", "institution": null}]}