{"title": "ARC-LH: A New Adaptive Resampling Algorithm for Improving ANN Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 522, "page_last": 528, "abstract": null, "full_text": "ARC-LH: A New Adaptive Resampling \nAlgorithm for Improving ANN Classifiers \n\nFriedrich Leisch \n\nKurt Hornik \n\nFriedrich.Leisch@ci.tuwien.ac.at \n\nKurt.Hornik@ci.tuwien.ac.at \n\nInstitut fiir Statistik und Wahrscheinlichkeitstheorie \n\nTechnische UniversWit Wien \n\nA-I040 Wien, Austria \n\nAbstract \n\nWe introduce arc-Ih, a new algorithm for improvement of ANN clas(cid:173)\nsifier performance, which measures the importance of patterns by \naggregated network output errors. On several artificial benchmark \nproblems, this algorithm compares favorably with other resample \nand combine techniques. \n\n1 \n\nIntroduction \n\nThe training of artificial neural networks (ANNs) is usually a stochastic and unsta(cid:173)\nble process. As the weights of the network are initialized at random and training \npatterns are presented in random order, ANNs trained on the same data will typ(cid:173)\nically be different in value and performance. In addition, small changes in the \ntraining set can lead to two completely different trained networks with different \nperformance even if the nets had the same initial weights. \n\nRoughly speaking, ANNs have a low bias because of their approximation capabili(cid:173)\nties, but a rather high variance because of the instability. Recently, several resample \nand combine techniques for improving ANN performance have been proposed. In \nthis paper we introduce an new arcing (\"~aptive resample and \u00a3ombine\") method \ncalled arc-Ih. Contrary to the arc-fs method by Freund & Schapire (1995), which \nuses misclassification rates for adapting the resampling probabilities, arc-Ih uses the \naggregated network output error. The performance of arc-Ih is compared with other \ntechniques on several popular artificial benchmark problems. \n\n\fARC-Uf: A New Adaptive Resampling Algorithm/or ANN Classifiers \n\n523 \n\n2 Bias-Variance Decomposition of 0-1 Loss \nConsider the task of classifying a random vector e taking values in X into one of c \nclasses G1 , ... , Ge , and let g(.) be a classification function mapping the input space \non the finite set {I, ... , c}. \n\nThe classification task is to find an optimal function g minimizing the risk \n\nRg = IELg(e) = l Lg(x) dF(x) \n\n(1) \nwhere F denotes the (typically unknown) distribution function of e, and L is a \nloss function. In this paper, we consider 0-1 loss only, i.e., the loss is 1 for all \nmisclassified patterns and zero otherwise. \n\nIt is well known that the optimal classifier, i.e., the classifier with minimum risk, is \nthe Bayes classifier g* assigning to each input x the class with maximum posterior \nprobability IP(Gnlx) . These posterior probabilities are typically unknown, hence \nthe Bayes classifier cannot be used directly. Note that Rg* = 0 for disjoint classes \nand Rg* > 0 otherwise. \nLet X N = {xt, ... ,xN} be a set of independent input vectors for which the true \nclass is known, available for training the classifier. Further, let g X N ( .) denote a \nclassifier trained using set X N. The risk Rg x N ~ Rg* of classifier g x N is a random \nvariable depending on the training sample X N. In the case of ANN classifiers it \nalso depends on the network training, i.e., even for fixed X N the performance of a \ntrained ANN is a random variable depending on the initialization of weights and \nthe (often random) presentation of the patterns [x nl during training. \nFollowing Breiman (1996a) we decompose the risk of a classifier into the (minimum \npossible) Bayes error, a systematic bias term of the model class and the variance of \nthe classifier within its model class. We call a classifier model unbiased for input x \nif, over replications of all possible training sets X N of size N, network initializations \nand pattern presentations, g picks the correct class more often than any other class. \nLet U = U(g) denote the set of all x E X where g is unbiased; and B = B(g) = X\\U \nthe set of all points where g is biased. The risk of classifier g can be decomposed as \n\nRg = Rg* + Bias(g) + Var(g) \n\n(2) \n\nwhere Rg* is the risk of the Bayes classifier, \n\nBias(g) \nVar(g) \n\nRag - Rag* \nRug - Rug* \n\nand Ra and Ru denote the risk on set Band U, respectively, i.e., the integration in \nEquation 1 is over B or U instead of X, repectively. \n\nA simpler bias-variance decomposition has been proposed by Kong & Dietterich \n(1995): \n\nBias(g) \nVar(g) \n\nIP{B} \nRg - Bias(g) \n\n\f524 \n\nF. LeischandK. Hornik \n\nThe size of the bias set is seen as the bias of the model (i.e., the error the model \nclass \"typically\" makes) . The variance is simply the difference between the actual \nrisk and this bias term. This decompostion yields negative variance if the current \nclassifier performs better than the average classifier. \n\nIn both decompositions, the bias gives the systematic risk of the model, whereas \nthe variance measures how good the current realization is compared to the best \npossible realization of the model. Neural networks are very powerful but rather \nunstable approximators, hence their bias should be low, but the variance may be \nhigh. \n\n3 Resample and Combine \n\nSuppose we had k independent training sets X N1 , .. . , X Nk and corresponding clas(cid:173)\nsifiers 91' . .. , 9k trained using these sets, respectively. We can then combine these \nsingle classifiers into ajoint voting classifier 9~ by assigning to each input x the class \nthe majority of the 9j votes for . If the 9j have low bias, then 9~ should have low \nbias, too. If the model is unbiased for an input x, then the variance of 9~ vanishes \nas k -+ 00 , and 9 v = limk --+ oo 9k is optimal for x. Hence, by resampling training sets \nfrom the original training set and combining the resulting classifiers into a voting \nclassifier it might be possible to reduce the high variance of unstable classification \nalgorithms. \n\nTraining sets \n\nANN classifiers \n\nX N1 \n\nXN'J \n\n\u2022 \n\nX Nk \n\nX N \n\nresample \n\n3.1 Bagging \n\nt::-\n\n~ . - .\" \n\nt::-\n\n........ \n\n...c:,.~ ... . '& .'\" \n\n. ~~-.... -.. \n\nt::-\n.-.-.... -.. \n\n... -:::1; .. \u00b7\u00b7-\u00b7 .. \u2022 ... \u2022 \n\n92 \n\u2022 \n\nI> \n\n9k \n\nadapt \n\n91 ~ \n\n-;/ 9k \n\ncombine \n\nBreiman (1994, 1996a) introduced a procedure called bagging (\"Qootstrap aggre(cid:173)\ngating\") for tree classifiers that may also be used for ANNs. The bagging algorithm \nstarts with a training set X N of size N. Several bootstrap replica X J..\" \n. .. ,X7v are \nconstructed and a neural network is trained on each. These networks are finally \ncombined by majority voting. The bootstrap sets X1 consist of N patterns drawn \nwith replacement from the original training set (see Efron & Tibshirani (1993) for \nmore information on the bootstrap). \n\n\fARC-ill: A New Adaptive Resampling Algorithm/or ANN Classifiers \n\n525 \n\n3.2 Arcing \n\n3.2.1 Arcing Based on Misclassification Rates \n\nArcing, which is a more sophisticated version of bagging, was first introduced by \nFreund & Schapire (1995) and called boosting. The new training sets are not con(cid:173)\nstructed by uniformly sampling from the empirical distribution of the training set \nXN \" but from a distribution over X N that includes information about previous \nmisclassifications. \n\nLet P~ denote the probability that pattern xn is included into the i-th training set \nX},y and initialize with P~ = 1/ N. Freund and Schapire's arcing algorithm, called \narc-fs as in Breiman (1996a), works as follows: \n\n1. Construct a pattern set Xiv by sampling with replacement with probabili(cid:173)\n\nties P~ from X N and train a classifier 9i using set xiv. \n\n2. Set dn = 1 for all patterns that are misclassified by 9i and zero otherwise. \n\nWith fi = L~=lp~dn and!3i = (1- fi)/fi update the probabilities by \n\nHI _ \n-\n\nPn \n\ni /3dn \nPn \ni \n'd \nN \nLn=l P~(3i n \n\n3. Set i := i + 1 and repeat. \n\nAfter k steps, 91' . . . ,gk are combined with weighted voting were each 9j'S vote has \nweight log!3i. Breiman (1996a) and Quinlan (1996) compare bagging and arcing for \nCART and C4.5 classifiers, respectively. Both bagging and arc-fs are very effective \nin reducing the high variance component of tree classifiers, with adaptive resampling \nbeing a bit better than simple bagging. \n\n3.2.2 Arcing Based on Network Error \n\nIndependently from the arcing and bagging procedures described above, adaptive \nresampling has been introduced for active pattern selection in leave-k-out cross(cid:173)\nvalidation CV / APS (Leisch & Jain, 1996; Leisch et al., 1995). Whereas arc-fs (or \nBreiman's arc-x4) uses only the information whether a pattern is misclassified or \nnot, in CV / APS the fact that MLPs approximate the posterior probabilities of the \nclasses (Kanaya & Miyake, 1991) is utilized, too. We introduce a simple new arcing \nmethod based on the main idea of CV / APS that the \"importance\" of a pattern for \nthe learning process can be measured by the aggregated output error of an MLP \nfor the pattern over several training runs. \n\nLet the classifier 9 be an ANN using l-of-c coding, i.e., one output node per class, \nthe target t(x) for each input x is one at the node corresponding to the class of \nx and zero at the remaining output nodes. Let e(x) = It(x) - 9(x))12 be the \nsquared error ofthe network for input x. Patterns that 'repeatedly have high output \nerrors are somewhat harder to learn for the network and therefore their resampling \nprobabilities are increased proportionally to the error. Error-dependent resampling \n\n\f526 \n\nF. Leisch and K. Hornik \n\nintroduces a \"grey-scale\" of pattern-importance as opposed to the \"black and white\" \nparadigm of misclassification dependent resampling. \n\nAgain let p~ denote the probability that pattern xn is included into the i-th training \nset Xiv and initialize with p; = 1/ N. Our new arcing algorithm, called arc-Ih, works \nas follows: \n\n1. Construct a pattern set xiv by sampling with replacement with probabili(cid:173)\n\nties p~ from X N and train a classifier gj using set xiv. \n\n2. Add the network output error of each pattern to the resampling probabili(cid:173)\n\nties: \n\n3. Set i := i + 1 and repeat. \n\nAfter k steps, g1' ... ,gk are combined by majority voting. \n\n3.3 Jittering \n\nIn our experiments, we also compare the above resample and combine methods with \njittering, which resamples the training set by contaminating the inputs by artificial \nnoise. No voting is done, but the size of the training set is increased by creation of \nartificial inputs \"around\" the original inputs, see Koistinen & Holmstrom (1992). \n\n4 Experiments \n\nWe demonstrate the effects of bagging and arcing on several well known artificial \nbenchmark problems. For all problems, i - h - c single hidden layer perceptrons \n(SHLPs) with i input, h hidden and c output nodes were used. The number of hid(cid:173)\nden nodes h was chosen in a way that the corresponding networks have reasonably \nlow bias. \n\n2 Spirals with noise: 2-dimensional input, 2 classes. Inputs with uniform noise \n\naround two spirals. N = 300. Rg* = 0%. 2-14-2 SHLP. \n\nContinuous XOR: 2-dimensional input, 2 classes. Uniform inputs on the 2-\ndimensional square -1 :::; x, y :::; 1 classified in the two classes x * y ~ 0 and \nx * y < O. N = 300. Rg* = 0%. 2-4-2 SHLP. \n\nRingnorm: 20-dimensional input, 2 classes. Class 1 is normal wit mean zero and \ncovariance 4 times the identity matrix. Class 2 is a unit normal with mean \n(a, a, ... , a). a = 2/.../20. N = 300. Rg* = 1.2%. 20-4-2 SHLP. \n\nThe first two problems are standard benchmark problems (note however that we \nuse a noisy variant of the standard spirals problem); the last one is, e.g., used in \nBreiman (1994, 1996a). \n\n\fARC-LH: A New Adaptive Resampling Algorithm/or ANN Classifiers \n\n527 \n\nAll experiments were replicated 50 times, in each bagging and arcing replication \n10 classifiers were combined to build a voting classifier. Generalization errors were \ncomputed using Monte Carlo techniques on test sets of size 10000. \n\nTable 1 gives the average risk over the 50 replications for a standard single SHLP, \nan SHLP trained on a jittered training set and for voting classifiers using ten votes \nconstructed with bagging, arc-Ih and arc-fs, respectively. The Bayes risk ofthe spiral \nand xor example is zero, hence the risk of a network equals the sum of its bias and \nvariance. The Bayes risk of the ringnorm example is 1.2%. \n\nKong & Dietterich \nRg Bias(g) Var(g) Bias(g) Var(g) \n\nBreiman \n\nstandard \njitter \nbagging \narc-fs \narc-Ih \n\nstandard \njitter \nbagging \narc-fs \narc-Ih \n\n7.75 \n6.53 \n4.39 \n4.31 \n4.32 \n\n6.54 \n6.29 \n3.69 \n3.73 \n3.58 \n\nstandard 18.64 \njitter \n18.56 \n15.72 \nbagging \n15.71 \narc-fs \narc-Ih \n15.63 \n\n2 Spirals \n\n7.43 \n6.27 \n4.04 \n3.96 \n4.01 \nXOR \n6.01 \n5.92 \n3.09 \n3.15 \n3.08 \n\nRingnorm \n\n8.26 \n8.34 \n4.91 \n4.81 \n5.13 \n\n0.82 \n0.52 \n0.68 \n0.60 \n0.72 \n\n1.32 \n1.08 \n1.22 \n1.12 \n1.20 \n\n13.84 \n13.72 \n13.54 \n13.58 \n13.20 \n\n0.32 \n0.26 \n0.35 \n0.35 \n0.31 \n\n0.53 \n0.37 \n0.59 \n0.58 \n0.50 \n\n9.19 \n9.03 \n9.61 \n9.70 \n9.30 \n\n6.93 \n6.02 \n3.71 \n3.71 \n3.60 \n\n5.22 \n5.21 \n2.47 \n2.61 \n2.38 \n\n4.80 \n4.84 \n2.18 \n2.13 \n2.43 \n\nTable 1: Bias-variance decompositions. \n\nThe variance part was drastically reduced by the res ample & combine methods, with \nonly a negligible change in bias. Note the low bias in the spiral and xor problems. \nANNs obviously can solve these classification tasks (one could create appropriate \nnets by hand), but of course training cannot find the exact boundaries between the \nclasses. Averaging over several nets helps to overcome this problem. The bias in \nthe ringnorm example is rather high, indicating that a change of network topology \n(bigger net, etc.) or training algorithm (learning rate, etc.) may lower the overall \nrisk. \n\n5 Summary \n\nComparison of of the resample and combine algorithms shows slight advantages \nfor adaptive resampling, but no algorithm dominates the other two. Further im-\n\n\f528 \n\nF. Leisch and K. Hornik \n\nprovements should be possible based on a better understanding of the theoretical \nproperties of resample and combine techniques. These issues are currently being \ninvestigated. \n\nReferences \n\nBreiman, L. (1994). Bagging predictors. Tech. Rep. 421, Department of Statistics, Uni(cid:173)\n\nversity of California, Berkeley, California, USA. \n\nBreiman, 1. (1996a). Bias, variance, and arcing classifiers. Tech. Rep. 460, Statistics \n\nDepartment, University of California, Berkeley, CA, USA. \n\nBreiman, L. (1996b). Stacked regressions. Machine Learning, 24,49. \n\nDrucker, H. & Cortes, C. (1996) . Boosting decision trees. In Touretzky, S., Mozer, M. C., \n& Hasselmo, M. E. (eds.), Advances in Neural Information Processing Systems, vol. 8. \nMIT Press. \n\nEfron, B. & Tibshira...u, R. J. (1993). An introduction to the bootstrap. Monographs on \n\nStatistics and Applied Probability. New York: Chapman & Hall. \n\nFreund, Y. & Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning \nand an application to boosting. Tech. rep., AT&T Bell Laboratories, 600 Mountain Ave, \nMurray Hill, NJ, USA. \n\nKanaya, F. & Miyake, S. (1991). Bayes statistical behavior and valid generalization of \npattern classifying neural networks. IEEE Transactions on Neural Networks, 2(4), 471-\n475. \n\nKohavi, R. & Wolpert, D. H. (1996). Bias plus variance decomposition for zero-one loss. \n\nIn Machine Learning: Proceedings of the 19th International Conference. \n\nKoistinen, P. & Holmstrom, L. (1992). Kernel regression and backpropagation training \nwith noise. In Moody, J. E., Hanson, S. J., & Lippmann, R. P. (eds.), Advances in Neural \nInformation Processing Systems, vol. 4, pp. 1033-1039. Morgan Kaufmann Publishers, \nInc. \n\nKong, E. B. & Dietterich, T. G. (1995). Error-correcting output coding corrects bias and \nvariance. In Machine Learning: Proceedings of the 12th International Conference, pp. \n313-321. Morgan-Kaufmann. \n\nLeisch, F. & Jain, 1. C. (1996). Cross-validation with active pattern selection for neural \n\nnetwork classifiers. Submitted to IEEE Transactions on Neural Networks, in Review. \n\nLeisch, F., Jain, 1. C., & Hornik, K. (1995). NN classifiers: Reducing the computational \ncost of cross-validation by active pattern selection. In Artificial Neural Networks and \nExpert Systems, vol. 2. Los Alamitos, CA, USA: IEEE Computer Society Press. \n\nQuinlan, J. R. (1996). Bagging, boosting and C4.5. University of Sydney, Australia. \n\nRipley, B. D. (1996). Pattern recognition and neural networks. Cambridge, UK: Cambridge \n\nUniversity Press. \n\nTibshirani, R. (1996a). Bias, variance and prediction error for classification rules. Univer(cid:173)\n\nsity of Toronto, Canada. \n\nTibshirani, R. (1996b). A comparison of some error estimates for neural network models. \n\nNeural Computation, 8(1), 152-163. \n\n\f", "award": [], "sourceid": 1198, "authors": [{"given_name": "Friedrich", "family_name": "Leisch", "institution": null}, {"given_name": "Kurt", "family_name": "Hornik", "institution": null}]}