{"title": "Optimizing Classifers for Imbalanced Training Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 253, "page_last": 259, "abstract": null, "full_text": "Optimizing Classifiers for Imbalanced \n\nTraining Sets \n\nGrigoris Karakoulas \nGlobal Analytics Group \n\nCanadian Imperial Bank of Commerce \n\nJohn Shawe-Taylor \n\nDepartment of Computer Science \n\nRoyal Holloway, University of London \n\n161 Bay St., BCE-ll, \n\nToronto ON, Canada M5J 2S8 \n\nEmail: karakoulOcibc.ca \n\nEgham, TW20 OEX \n\nEngland \n\nEmail: jstOdcs.rhbnc.ac.uk \n\nAbstract \n\nFollowing recent results [9, 8] showing the importance of the fat(cid:173)\nshattering dimension in explaining the beneficial effect of a large \nmargin on generalization performance, the current paper investi(cid:173)\ngates the implications of these results for the case of imbalanced \ndatasets and develops two approaches to setting the threshold. \nThe approaches are incorporated into ThetaBoost, a boosting al(cid:173)\ngorithm for dealing with unequal loss functions. The performance \nof ThetaBoost and the two approaches are tested experimentally. \n\nKeywords: Computational Learning Theory, Generalization, fat-shattering, large \nmargin, pac estimates, unequal loss, imbalanced datasets \n\n1 \n\nIntroduction \n\nShawe-Taylor [8] demonstrated that the output margin can also be used as an \nestimate of the confidence with which a particular classification is made. In other \nwords if a new example has an output value we]) clear of the threshold we can \nbe more confident of the associated classification than when the output value is \ncloser to the threshold. The current paper applies this result to the case where \nthere are different losses associated with a false positive, than with a false negative. \nIf a significant number of data points are misclassified we can use the criterion \nof minimising the empirical loss. If, however, the data is correctly classified the \nempirical loss is zero for all correctly separating hyperplanes. It is in this case that \nthe approach can provide insight into how to choose the hyperplane and threshold. \nIn summary, the paper suggests ways in which a hyperplane should be optimised for \nimbalanced datasets where the loss associated with misclassifying the less prevalent \nclass is higher. \n\n\f254 \n\nG. Karakoulas and J Shawe-Taylor \n\n2 Background to the Analysis \n\n[3} Let F be a set of real-valued functions. We say that a set of \nDefinition 2.1 \npoints X is ,-shattered by F if there are real numbers rx indexed by x E X such \nthat for all binary vectors b indexed by X, there is a function fb E F realising \ndichotomy b with margin ,. The fat-shattering dimension Fat.:F of the set F is a \nfunction from the positive real numbers to the integers which maps a value, to the \nsize of the largest ,-shattered set, if this is finite, or infinity otherwise. \n\nIn general we are concerned with classifications obtained by thresholding real-valued \nfunctions. The classification values will be {-I, I} instead of the usual {O, I} in or(cid:173)\nder to simplify some expressions. Hence, typically we will consider a set F of \nfunctions mapping from an input space X to the reals. For the sake of simplifying \nthe presentation of our results we will assume that the threshold used for classi(cid:173)\nfication is O. The results can be extended to other thresholds without difficulty. \nHence we implicitly use the classification functions H = T(F) = {T(J) : f E F}, \nwhere T(f) is the function f thresholded at O. We will say that f has , margin on \nthe training set {(Xi, Yi) : i = 1, ... , m} , if minl* f + TJ E F. \n\nNote that the linear functions (with threshold weights) used in perceptrons [9] \nsatisfy this property as do neural networks with linear output units. Hence, this \nproperty applies to the Support Vector Machine, and the neural network examples. \nWe now quote a result from [8]. \n\nTheorem 2.4 [8} Let F be a class of real-valued functions closed under addition \nof constants with fat-shattering dimension bounded by Fat.:Fh) which is continuous \nfrom the right. With probability at least 1 - t5 over the choice of a random m sample \n(Xi, Yi) drawn according to P the following holds. Suppose that for some f E F, \nTJ > 0, \n\n1. yd(xd 2: -TJ + 2, for all (Xi, yd in the sample, \n2. n = I{i: yi/(xd 2: TJ + 2,}I, \n3. n 2: 3J2m(2dln(288m) log2(12em) + In(32m2jt5)), \n\nLet d = Fat.:Fh /6). Then the probability that a new example with margin TJ 1S \nmisclassified is bounded by \n\n~ (2dlog2(288m)lOg2(12em) + log2 32;n2). \n\n\fOptimizing Classijers for Imbalanced Training Sets \n\n255 \n\n3 Unequal Loss Functions \n\nWe consider the situation where the loss associated with an example is different \nfor misclassification of positive and negative examples. Let Lh(x, y) be the loss \nassociated with the classification function h on example (x, y). For the analysis \nconsidered above the loss function is taken to be Lh(X, y) = Ih(x) - YI, that is 1 if \nthe point x is misclassified and 0 otherwise. This is also known as the discrete loss. \nIn this paper we consider a different loss function for classification functions. \n\nDefinition 3.1 The loss function L{3 \nh (x) f. y, and 0, otherwise. \n\nis defined as L{3(x, y) = f3y + (1 - y), if \n\nWe first consider the classical approach of minimizing the empirical loss, that is the \nloss on the training set. Since, the loss function is no longer binary the standard \ntheoretical results that can be applied are much weaker than for the binary case. The \nalgorithmic implications will, however, be investigated under the assumption we are \nusing a hyperplane parallel to the maximal margin hyperplane. The empirical risk \nis given by ER(h) = 2:::1 Lf3(Xi, yd, for the training set {(Xi,Yi): i = 1, ... ,m}. \nAssuming that the training set can be correctly classified by the hypothesis class \nthis criterion will not be able to distinguish between consistent hypotheses, hence \ngiving no reason not to choose the standard maximal margin choice. However, \nthere is a natural way to introduce the different losses into the maximal margin \nquadratic programming procedure [1]. Here, the constraints given are specified as \nYi ((w . Xi) + 0) ~ 1, i ~ 1,2, ... , m. In order to force the hyperplane away from the \npositive points which will incur greater loss, a natural heuristic is to set Yi = -1 for \nnegative examples and Yi = 1/ f3 for positive points, hence making them further from \nthe decision boundary. \nIn the case where consistent classification is possible, the \neffect of this will be to move the hyperplane parallel to itself so that the margin on \nthe positive side is f3 times that on the negative side. Hence, to solve the problem \nwe simply use the standard maximal margin algorithm [1] and then replace the \nthreshold 0 with \n\n1 \n\nb = 1 + f3[(w, x+) + f3(w\u00b7 x-)), \n\n(1) \n\nwhere x+ (x-) is one of the closest positive (negative) points. \n\nThe alternative approach we wish to employ is to consider other movements of the \n\nhyperplane parallel to itself while retaining consistency. Let ,0 be the margin of the \n,0 + \"I to the positive examples, and ,0 - \"I to the negative example. The basic \n\nmaximal margin hyperplane. We consider a consistent hyperplane hI] with margin \n\nanalytic tool is Theorem 2.4 which will be applied once for the positive examples \nand once for the negative examples (note that classifications are in the set {-I, I}). \n\nTheorem 3.2 Let ho be the maximal margin hyperplane with margin ,0, while hI] \n\nis as above with \"I < ,0\u00b7 Set ,+ = (,0 + \"I) /2 and ,- = (,0 - \"I) /2. With probability \n\nat least 1 - J over the choice of a random m sample (Xi, Yi) drawn according to P \nthe following holds. Suppose that for ho \n1. no = I{i: YihO(xd ~ 2\"1 + ,o}1, \n2. no ~ 3J2m(dln(288m) log2(12em) + In(8/J)), \n\nLet d+ = FatF(r+ /6) and d- = FatF(r- /6). Then we can bound the expected loss \nby \n\n\f256 \n\nG. Karakoulas and 1. Shawe- Taylor \n\nProof: Using Theorem 2.4 we can bound the probability of error given that the \ncorrect classification is positive in terms of the expression with the fat shattering \ndimension d+ and n = no, while for a negative example we can bound the probability \nof error in terms of the expression with fat shattering dimension d- and n = m. \nHence, the expected loss can be bounded by taking the maximum of the second \nbound with n+ in place of m together with a factor {3 in front of the second log \nterm and the first bound multiplied by {3 \u2022\u2022 \n\nThe bound obtained suggests a way of optimising the choice of \"I, namely to minimise \nthe expression for the fat shattering dimension of linear functions [9]. Solving for \"I \nin terms of {a and {3 gives \n\n\"I = {a (( W - 1) / ( W + 1) ) . \n\n(2) \n\nThis choice of \"I does not in general agree with that suggested by the choice of \nthe threshold b in the previous section. \nIn a later section we report on initial \nexperiments for investigating the performance of these different choices. \n\n4 The ThetaBoost Algorithm \n\nThe above idea for adjusting the margin in the case of unequal loss function can \nalso be applied to the AdaBoost algorithm [2] which has been shown to maximise \nthe margin on the training examples and hence the generalization can be bounded \nin terms of the margin and the fat-shattering dimension of the functions that can \nbe produced by the algorithm [6]. We will first develop a boosting algorithm for \nunequal loss functions and then extend it for adjustable margin. More specifically, \nassume: (i) a set of training examples (Xl, yd, ... , (Xrn, Yrn) where Xi E X and \nY E Y = {-I, + I} j (ii) a weak learner that outputs hypotheses h : X -r {-I, + I} \nand (iii) the unequal loss function L(3 (y) of Definition 3.1. \nWe assign initial weight Dl (i) = w+ to the n+ positive examples and Dt{ i) = w(cid:173)\nto the n- negative examples, where w+n+ + w- n- = 1. The values can be set so \nthat w+ /w- = {3 or they can be adjusted using a validation set. The generalization \nof AdaBoost to the case of an unequal loss function is given as the AdaUBoost \nalgorithm in Figure 1. We adapt theorem 1 in [7] for this algorithm. \n\nTheorem 4.1 Assuming the notation and algorithm of Figure 1, the following \nbound holds on the training error of H \n\nw+li: H(xd #- Yi = 11 + w-li: H(xd #- Yi = -11::; IT Zt. \n\nT \n\n(3) \n\nt=l \n\nThe choice of w+ and w- will force uneven probabilities of misclassification on \nthe training set, but to ensure that the weak learners concentrate on misclassified \npositive examples we define Z (suppressing the subscript) as \n\nThus, to minimize training error we should seek to minimize Z with respect to Q' \n(the voting coefficient) on each iteration of boosting. Following [7], we introduce \nthe notation W++, W_+, W+_ and W __ , where for Sl and S2 E {-I, +1} \n\ni \n\n(4) \n\nD(i) \n\ni :y, =31 ,h(x ,)=32 \n\n(5) \n\n\fOptimizing Classifers for Imbalanced Training Sets \n\n257 \n\nBy equating to zero the first derivative of (4) with respect to a, Z'(a), and using (5) \nwe have - exp( -0'/ J3)W++/ ,6+exp(a/ ,6)W_+/,6+exp(a)W+_ -exp( -a)W __ = o. \nLetting Y = exp(a) we get a polynomial in Y: \n\n(6) \n\nwhere C1 = -W++/,6, C2 = W_+/,6, C3 = W+_, and C4 = -W __ . \nThe root of this polynomial can be found numerically. Since Z\" (a) > 0, Z' (a) can \nhave at most one zero and this gives the unique minimum of Z(a). The solution \nfor a from (6) is used (as at) when taking the distance of a training example from \nthe standard threshold on each iteration of the AdaUBoost algorithm in Figure 1 \nas well as when combining the weak learners in H(x). \n\nThe ThetaBoost algorithm searches for a positive and a negative support vector \n(SV) point such that the hyperplane separating them has the largest margin. Once \nthese SV points are found we can then apply the formulas (1) and (2) of Sections \n3.1 and 3.2 respectively to compute values for adjusting the threshold. See Figure \n2 for the complete algorithm. \n\nAlgorithm AdaUBoost(X, Y, (3) \n\n1. Initialize Dt{i) as described above. \n2. For t = 1, ... , T \n\n\u2022 train weak learner using distribution Dt; \n\u2022 get weak hypothesis ht ; \n\u2022 choose at E lR ; \n\u2022 update: Dt+l(i) = Dt(i) exp[-at(3iYih(xdl/Zt \n\u2022 where (3i = 1/(3 if Yi = 1 and 1 if otherwise, and Zt is a normalization \n\nfactor such that Li Dt+1(i) = 1; \n\n3. Output the final hypothesis: H(x) = sgn (L'f=l atht(x)). \n\nAlgorithm ThetaBoost(X, Y, (3, 6M ) \n\n1. H(x) = AdaUBoost(X, Y, ,6); \n2. Remove from the training dataset the false positive and borderline points; \n3. Find the smallest H(x+) and mark this as the SV+; and remove any nega(cid:173)\n\ntive points with value greater than H(SV+); \n\n4. Find the first negative point that is next in ranking to the SV+ and mark \nthis as SV_; and compute the margin as the sum of distances, d+ and d_, \nof SV+ and SV_ from the standard threshold; \n\n5. Check for candidate SV_ 's that are near to the current one and change the \n\nmargin by at least 6M ; \n\n6. Use SV+ and SV_ to compute the theta threshold from Eqn (1) and (2); \n7. Output the final hypothesis: H(x) = sgn (L'f=l atht(x) - e) \n\nFigure 1: The AdaUBoost and Theta-Boost algorithms. \n\n\f258 \n\nG. Karakoulas and J Shawe-Taylor \n\n5 Experiments \n\nThe purpose of the experiments reported in this section is two-fold: \n\n(i) to compare the generalization performance of AdaUBoost against that of \n\nstandard Adaboost on imbalanced datasetsj \n\n(ii) to examine the two formulas for choosing the threshold in ThetaBoost and \n\nevaluate their effect on generalization performance. \n\nFor the evaluations in (i) and (ii) we use two performance measures: the average \nLi3 and the geometric mean of accuracy (g-mean) [4]. The latter is defined as \n9 = Jprecision . recall, where \n\n. . \n\npreClSlOn = # \n\n# positives correct \nposItIves pre Icte \n\n. . \n\nd\u00b7 \n\nd j \n\nreca -\n\nII _ # positives correct \n. \ntrue POSI tJ ves \n\n. . \n\n# \n\nThe g-mean has recently been proposed as a performance measure that, in contrast \nto accuracy, can capture the \"specificity\" trade-off between false positives and true \npositives in imbalanced datasets [4]. It is also independent of the distribution of \nexamples between classes. \n\nFor our initial experiments we used the satimage dataset from the UCI repository \n[5] and used a uniform D 1 \u2022 The dataset is about classifying neigborhoods of pixels \nin a satelite image. It has 36 continuous attributes and 6 classes. We picked class \n4 as the goal class since it is the less prevalent one (9.73% of the dataset). The \ndataset comes in a training (4435 examples) and a test (2000 examples) set. \n\nTable 1 shows the performance on the test set of AdaUBoost, AdaBoost and C4.5 \nfor different values of the beta parameter. It should be pointed out that the latter \ntwo algorithms minimize the total error assuming an equal loss function (13 = 1). In \nthe case of equal loss AdaUBoost simply reduces to AdaBoost. As observed from the \ntable the higher the loss parameter the bigger the improvement of AdaUBoost over \nthe other two algorithms. This is particularly apparent in the values of g-mean. \n\nf3 values \n1 \n2 \n4 \n8 \n16 \n\nAdaUBoost \n\navgLoss g-mean \n0.773 \n0.0545 \n0.865 \n0.0895 \n0.13 \n0.889 \n0.1785 \n0.898 \n0.267 \n0.89 \n\nAdaBoost \n\navgLoss g-mean \n0.0545 \n0.773 \n0.0831 \n0.773 \n0.1662 \n0.773 \n0.3324 \n0.773 \n0.664 \n0.773 \n\nC4.5 \n\navgLoss g-mean \n0.724 \n0.0885 \n0.724 \n0.136 \n0.724 \n0.231 \n0.724 \n0.421 \n0.724 \n0.801 \n\nTable 1: Generalization performance in the SatImage dataset. \n\nFigure 2 shows the generalization performance of ThetaBoost in terms of average \nloss (13 = 2) for different values of the threshold (). The latter ranges from the largest \nmargin of negative examples that corresponds to SV_ to the smallest margin of \npositive examples that corresponds to SV+. This range includes the values of band \nTJ given by formulas (I) and (2). In this experiment,sM was set to 0.2. As depicted in \nthe figure , the margin defined by b achieves better generalization performance than \nthe margin defined by TJ. In particular, b is closer to the value of () that gives the \nminimum loss on this test set. In addition, ThetaBoost with b performs better than \nAdaUBoost on this test set. We should emphasise, however, that the differences \nare not significant and that more extensive experiments are required before the two \napproaches can be ranked reliably. \n\n\fOptimizing Classifers for Imbalanced Training Sets \n\n259 \n\n0 .2 . - - - - - - - - - - . - - - - - - - - - - . , - - - - - - - - - - - , \n\n0.18 \n\n0.16 \n\nen \nen \n.3 \n~0.14 \nl!! \n~ \n\nQ) \n\n0.12 \n\n0.1 \n\n0 .08L - - - - - - - -L - - - - - - - - ' - - - - - - - - - - ' \n100 \n\n-50 \n\no \n\nFigure 2: Average Loss L{3 (13 = 2) on test set as a function of () \n\nThreshold e \n\n50 \n\n6 Discussion \n\nIn the above we built a theoretical framework for optimaIly setting the margin \ngiven an unequal loss function. By applying this framework to boosting we devel(cid:173)\noped AdaUBoost and ThetaBoost that generalize Adaboost, a weIl known boosting \nalgorithm, for taking into account unequal loss functions and adjusting the margin \nin imbalanced datasets. Initial experiments have shown that both these factors \nimprove the generalization performance of the boosted classifier. \n\nReferences \n\n[lJ Corinna Cortes and Vladimir Vapnik, Machine Learning, 20, 273- 297, 1995. \n[2J Yoav Freund and Robert Schapire, pages 148-156 in Proceedings of the Inter(cid:173)\n\nnational Conference on Machine Learning, ICML '96, 1996. \n\n[3] Michael J. Kearns and Robert E. Schapire, pages 382- 391 in Proceedings of the \n\n31st Symposium on the Foundations of Computer Science, FOCS'90, 1990. \n[4J Kubat, M., Holte, R. and Matwin, S., Machine Learning, 30, 195-215, 1998. \n[5] Merz, C.J. and Murphy, P.M. (1997). UCI repository of machine learning \n\ndatabases. http://www.ics.uci.edu/ mlearn/MLRepository.html. \n\n[6] R. Schapire, Y. Freund, P. Bartlett, W. Sun Lee, pages 322- 330 in Proceedings \n\nof International Conference on Machine Learning, ICML '97, 1997. \n\n[7] Robert Schapire and Yoram Singer, in Proceedings of the Eleventh Annual \n\nConference on Computational Learning Theory, COLT'98, 1998. \n\n[8] John Shawe-Taylor, Algorithmica, 22,157-172,1998. \n[9J John Shawe-Taylor, Peter Bartlett, Robert Williamson and Martin Anthony, \n\nIEEE Trans. Inf. Theory, 44 (5) 1926-1940, 1998. \n\n\f", "award": [], "sourceid": 1523, "authors": [{"given_name": "Grigoris", "family_name": "Karakoulas", "institution": null}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}]}*