{"title": "Training Methods for Adaptive Boosting of Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 647, "page_last": 653, "abstract": null, "full_text": "Training Methods for Adaptive Boosting \n\nof Neural Networks \n\nHolger Schwenk \n\nDept.IRO \n\nUniversite de Montreal \n2920 Chemin de la Tour, \n\nMontreal, Qc, Canada, H3C 317 \n\nschwenk@iro.umontreal.ca \n\nYoshua Bengio \n\nDept.IRO \n\nUniversite de Montreal \n\nand AT&T Laboratories, NJ \n\nbengioy@iro.umontreal.ca \n\nAbstract \n\n\"Boosting\" is a general method for improving the performance of any \nlearning algorithm that consistently generates classifiers which need to \nperform only slightly better than random guessing. A recently proposed \nand very promising boosting algorithm is AdaBoost [5]. It has been ap(cid:173)\nplied with great success to several benchmark machine learning problems \nusing rather simple learning algorithms [4], and decision trees [1, 2, 6]. \nIn this paper we use AdaBoost to improve the performances of neural \nnetworks. We compare training methods based on sampling the training \nset and weighting the cost function. Our system achieves about 1.4% \nerror on a data base of online handwritten digits from more than 200 \nwriters. Adaptive boosting of a multi-layer network achieved 1.5% error \non the UCI Letters and 8.1 % error on the UCI satellite data set. \n\n1 Introduction \nAdaBoost [4, 5] (for Adaptive Boosting) constructs a composite classifier by sequentially \ntraining classifiers, while putting more and more emphasis on certain patterns. AdaBoost \nhas been applied to rather weak learning algorithms (with low capacity) [4] and to deci(cid:173)\nsion trees [1, 2, 6], and not yet, until now, to the best of our knowledge, to artificial neural \nnetworks. These experiments displayed rather intriguing generalization properties, such as \ncontinued decrease in generalization error after training error reaches zero. Previous work(cid:173)\ners also disagree on the reasons for the impressive generalization performance displayed \nby AdaBoost on a large array of tasks. One issue raised by Breiman [1] and the authors of \nAdaBoost [4] is whether some of this effect is due to a reduction in variance similar to the \none obtained from the Bagging algorithm. \n\nIn this paper we explore the application of AdaBoost to Diabolo (auto-associative) net(cid:173)\nworks and multi-layer neural networks (MLPs). In doing so, we also compare three dif-\n\n\f648 \n\nH. Schwenk and Y. Bengio \n\nferent versions of AdaBoost: (R) training each classifier with a fixed training set obtained \nby resampling with replacement from the original training set (as in [1]), (E) training by \nresampling after each epoch a new training set from the original training set, and (W) train(cid:173)\ning by directly weighting the cost fundion (here the squared error) of the neural network. \nNote that the second version (E) is a better approximation of the weighted cost function \nthan the first one (R), in particular when many epochs are performed. If the variance re(cid:173)\nduction induced by averaging the hypotheses from very different models explains a good \npart of the generalization performance of AdaBoost, then the weighted training version \n(W) should perform worse then the resampling versions, and the fixed sample version (R) \nshould perform better then the continuously resampled version (E). \n\n2 AdaBoost \nAdaBoost combines the hypotheses generated by a set of classifiers trained one after the \nother. The tth classifier is trained with more emphasis on certain patterns, using a cost func(cid:173)\ntion weighted by a probability distribution Dt over the training data (Dt(i) is positive and \nLi Dt(i) = 1). Some learning algorithms don't permit training with respect to a weighted \ncost function. In this case sampling with replacement (using the probability distribution \nD t ) can be used to approximate a weighted cost function. Examples with high probability \nwould then occur more often than those with low probability, while some examples may \nnot occur in the sample at all although their probability is not zero. This is particularly true \nin the simple resampling version (labeled \"R\" earlier), and unlikely when a new training \nset is resampled after each epoch (\"E\" version). Neural networks can be trained directly \nwith respect to a distribution over the learning data by weighting the cost function (this is \nthe \"W\" version): the squared error on the i-th pattern is weighted by the probability D t (i). \nThe result of training the tth classifier is a hypothesis ht : X -+ Y where Y = {I, ... , k} is \nthe space of labels, and X is the space of input features. After the tth round the weighted \nerror \u20act of the resulting classifier is calculated and the distribution Dt+l is computed from \nD t , by increasing the probability of incorrectly labeled examples. The global decision f is \nobtained by weighted voting. Figure I (left) summarizes the basic AdaBoost algorithm. It \nconverges (learns the training set) if each classifier yields a weighted error that is less than \n50%, i.e., better than chance in the 2-c1ass case. There is also a multi-class version, called \npseudoloss-AdaBoost, that can be used when the classifier computes confidence scores for \neach class. Due to lack of space, we give only the algorithm (see figure 1, right) and we \nrefer the reader to the references for more details [4, 5]. \n\nAdaBoost has very interesting theoretical properties, in particular it can be shown that the \nerror of the composite classifier on the training data decreases exponentially fast to zero [5] \nas the number of combined classifiers is increased. More importantly, however, bounds \non the generalization error of such a system have been formulated [7]. These are based \non a notion of margin of classification, defined as the difference between the score of the \ncorrect class and the strongest score of a wrong class. In the case in which there are just \ntwo possible labels {-I, +1}, this is yf(x), where f is the composite classifier and y the \ncorrect label. Obviously, the classification is correct if the margin is positive. We now \nstate the theorem bounding the generalization error of Adaboost [7] (and any classifier \nobtained by a convex combination of a set of classifiers). Let H be a set of hypotheses \n(from which the ht hare chosen), with VC-dimenstion d. Let f be any convex combination \nof hypotheses from H. Let S be a sample of N examples chosen independently at random \naccording to a distribution D. Then with probability at least 1 - 8 over the random choice \nof the training set S from D, the following bound is satisfied for all () > 0: \n\nPD[yf(x) ~ 0] ~ Ps[yf(x) ~ ()] + 0 \n\n( \n\n1 \njN \n\ndlog2 (N/d) \n\n(}2 \n\n+ log(1/8) \n\n) \n\n(1) \n\nNote that this bound is independent of the number of combined hypotheses and how they \n\n\fTraining Methods for Adaptive Boosting of Neural Networks \n\n649 \n\nInput: \n\nsequence of N examples (Xl, YI), ... , (X N , Y N ) \nwith labels Yi E Y = {I, ... , k} \n\nInit: Dl(i) = l/N for all i \n\nRepeat: \n\n1. Train neural network with respect \n\nto distribution D t and obtain \nhypothesis ht : X ~ Y \n\n2. calculate the weighted error of h t : \n\n\u20act \n\nD (.) \n\n\" \n\n_ \n- ~ t \ni:ht(x,)#y, \n\n'/, \n\nabort loop \nif \u20act > ~ \n\n3. set (3t = \u20act/(1 - \u20act) \n4. update distribution D t \n\n-\n\nt+l \n\n(i) - Dt(i) a O, \n/Jt \n\nD \nwith c5i = (ht(Xi) = Yi) \nand Zt a normalization constant \n\nZt \n\nloit: letB = {(i,y): i E{l, ... ,N},y i= yd \n\nDl (i. y) = l/IBI for all (i, y) E B \n\nRepeat: \n\nI. Train neural network with respect \n\nto distribution D t and obtain \nhypothesis ht : X x Y ~ [0,1] \n2. calculate the pseudo-loss of ht : \n\u20act = ~ LDt(i, y)(l-ht(xi, Yd+ht(Xi' y)) \n\n(i,y)EB \n\n3. set (3t = \u20act/(1 - \u20act) \n4. update distribution D t \n/Jt \n\n,Y -\n\nt+l \n\nD \n\n(i \n\nZt \n\n) - Dt(i,y) a~((1+ht(x\"y,)-ht(x\"y\u00bb \n\nwhere Zt is a normalization constant \n\nOutput: final hypothesis: \n\nf(x) = arg max L log Ii\" \n1 \n/Jt \n\nyEY \n\nt:ht(x)=y \n\nOutput: final hypothesis: \n\nf(x) = arg max L (log ; ) ht(x, y) \n\nyEY \n\nt \n\n/Jt \n\nFigure I: AdaBoost algorithm (left), multi-class extension using confidence scores (right) \n\nare chosen from H. The distribution of the margins however plays an important role. It can \nbe shown that the AdaBoost algorithm is especially well suited to the task of maximizing \nthe number of training examples with large margin [7]. \n\n3 The Diabolo Classifier \n\nNormally, neural networks used for classification are trained to map an input vector to an \noutput vector that encodes directly the classes, usually by the so called \"I-out-of-N encod(cid:173)\ning\". An alternative approach with interesting properties is to use auto-associative neural \nnetworks, also called autoencoders or Diabolo networks, to learn a model of each class. \nIn the simplest case, each autoencoder network is trained only with examples of the cor(cid:173)\nresponding class, i.e., it learns to reconstruct all examples of one class at its output. The \ndistance between the input vector and the reconstructed output vector expresses the likeli(cid:173)\nhood that a particular example is part of the corresponding class. Therefore classification \nis done by choosing the best fitting model. Figure 2 summarizes the basic architecture. \nIt shows also typical classification behavior for an online character recognition task. The \ninput and output vectors are (x, y)-coordinate sequences of a character. The visual repre(cid:173)\nsentation in the figure is obtained by connecting these points. In this example the\" I\" is \ncorrectly classified since the network for this class has the smallest reconstruction error. \n\nThe Diabolo classifier uses a distributed representation of the models which is much more \ncompact than the enumeration of references often used by distance-based classifiers like \nnearest-neighbor or RBF networks. Furthermore, one has to calculate only one distance \nmeasure for each class to recognize. This allows to incorporate knowledge by a domain \nspecific distance measure at a very low computational cost. In previous work [8], we have \nshown that the well-known tangent-distance [11] can be used in the objective function of the \nautoencoders. This Diabolo classifier has achieved state-of-the-art results in handwritten \nOCR [8,9]. Recently, we have also extended the idea of a transformation invariant distance \n\n\f650 \n\nH. Schwenk and Y. Bengio \n\n1 \n\n1 \n1 \n1 \n\n~ score I \n\n0.08 \n\n6 \n\n0.13 \n\n~ score 7 \n\n0.23 \n\ncharacter \nto classify \n\ninput \n\nsequence \n\noutput \n\nsequences \n\ndistance \nmeasures \n\ndecision \nmodule \n\nFigure 2: Architecture of a Diabolo classifier \n\nmeasure to online character recognition [10]. One autoencoder alone, however, can not \nlearn efficiently the model of a character if it is written in many different stroke orders and \ndirections. The architecture can be extended by using several autoencoders per class, each \none specializing on a particular writing style (subclass). For the class \"0\", for instance, \nwe would have one Diabolo network that learns a model for zeros written clockwise and \nanother one for zeros written counterclockwise. The assignment of the training examples to \nthe different subclass models should ideally be done in an unsupervised way. However, this \ncan be quite difficult since the number of writing styles is not known in advance and usually \nthe number of examples in each subclass varies a lot. Our training data base contains for \ninstance 100 zeros written counterclockwise, but only 3 written clockwise (there are also \nsome more examples written in other strange styles). Classical clustering algorithms would \nprobably tend to ignore subclasses with very few examples since they aren't responsible \nfor much of the error, but this may result in poor generalization behavior. Therefore, in \nprevious work we have manually assigned the subclass labels [10]. Of course, this is not a \ngenerally satisfactory approach, and certainly infeasible when the training set is large. In \nthe following, we will show that the emphasizing algorithm of AdaBoost can be used to \ntrain multiple Diabolo classifiers per class, performing a soft assignment of examples of \nthe training set to each network. \n\n4 Results with Diabolo and MLP Classifiers \n\nExperiments have been performed on three data sets: a data base of online handwritten \ndigits, the UeI Letters database of offline machine-printed alphabetical characters and the \nUCI satellite database that is generated from Landsat Multi-spectral Scanner image data. \nAll data sets have a pre-defined training and test set. The Diabolo classifier was only \napplied to the online data set (since it takes advantage of the structure of the input features). \n\nThe online data set was collected at Paris 6 University [10]. It is writer-independent (dif(cid:173)\nferent writers in training and test sets) and there are 203 writers, 1200 training examples \nand 830 test examples. Each writer gave only one example per class. Therefore, there are \nmany different writing styles, with very different frequencies. We only applied a simple \npr~processing: the characters were resamfled to 11 points, centered and size normalized \nto a (x,y)-coordinate sequence in [-1, 1]2 . Since the Diabolo classifier with tangent dis(cid:173)\ntance [10] is invariant to small transformations we don't need to extract further features. \n\nTable 1 summarizes the results on the test set of different approaches before using Ada(cid:173)\nBoost. The Diabolo classifier with hand-selected sub-classes in the training set performs \nbest since it is invariant to transformations and since it can deal with the different writing \nstyles. The experiments suggest that fully connected neural networks are not well suited \nfor this task: small nets do poorly on both training and test sets, while large nets overfit. \n\n\f", "award": [], "sourceid": 1335, "authors": [{"given_name": "Holger", "family_name": "Schwenk", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}]}