{"title": "Semi-supervised MarginBoost", "book": "Advances in Neural Information Processing Systems", "page_first": 553, "page_last": 560, "abstract": "", "full_text": "Semi-Supervised MarginBoost \n\nF. d'Alche-Buc \n\nLIP6,UMR CNRS 7606, \nUniversite P. et M. Curie \n75252 Paris Cedex, France \n\nflorence. dAlche@lip6.fr \n\nYves Grandvalet \n\nHeudiasyc, UMR CNRS 6599, \n\nUniversite de Technologie de Compiegne, \nBP 20.529, 60205 Compiegne cedex, France \n\nYves. Grandvalet@hds.utc.fr \n\nChristophe Ambroise \n\nHeudiasyc, UMR CNRS 6599, \n\nUniversite de Technologie de Compiegne, \nBP 20.529, 60205 Compiegne cedex, France \n\nChristophe A mbroise@hds.utc.fr \n\nAbstract \n\nIn many discrimination problems a large amount of data is available but \nonly a few of them are labeled. This provides a strong motivation to \nimprove or develop methods for semi-supervised learning. In this paper, \nboosting is generalized to this task within the optimization framework of \nMarginBoost . We extend the margin definition to unlabeled data and \ndevelop the gradient descent algorithm that corresponds to the resulting \nmargin cost function. This meta-learning scheme can be applied to any \nbase classifier able to benefit from unlabeled data. We propose here to \napply it to mixture models trained with an Expectation-Maximization \nalgorithm. Promising results are presented on benchmarks with different \nrates of labeled data. \n\n1 \n\nIntroduction \n\nIn semi-supervised classification tasks, a concept is to be learnt using both labeled \nand unlabeled examples. Such problems arise frequently in data-mining where the \ncost of the labeling process can be prohibitive because it requires human help as in \nvideo-indexing, text-categorization [12] and medical diagnosis. While some works \nproposed different methods [16] to learn mixture models [12], [1], SVM [3], co(cid:173)\ntrained machines [5] to solve this task, no extension has been developed so far for \nensemble methods such as boosting [7, 6]. Boosting consists in building sequen(cid:173)\ntially a linear combination of base classifiers that focus on the difficult examples. \nFor AdaBoost and extensions such as MarginBoost [10], this stage-wise procedure \ncorresponds to a gradient descent of a cost functional based on a decreasing function \nof the margin, in the space of linear combinations of base classifiers. \n\nWe propose to generalize boosting to semi-supervised learning within the frame(cid:173)\nwork of optimization. We extend the margin notion to unlabeled data, derive the \ncorresponding criterion to be maximized, and propose the resulting algorithm called \nSemi-Supervised MarginBoost (SSMBoost). This new method enhances our previ-\n\n\fous work [9] based on a direct plug-in extension of AdaBoost in the sense that all \nthe ingredients of the gradient algorithm such as the gradient direction and the \nstopping rule are defined from the expression of the new cost function. Moreover, \nwhile the algorithm has been tested using the mixtures of models [1], 55MBoost \nis designed to combine any base classifiers that deals with both labeled and unla(cid:173)\nbeled data. The paper begins with a brief presentation of MarginBoost (section 2). \nThen, in section 3, the 55MBoost algorithm is presented. Experimental results are \ndiscussed in section 5 and we conclude in section 6. \n\n2 Boosting with MarginBoost \n\nBoosting [7, 6, 15] aims at improving the performance of any weak \"base clas(cid:173)\nsifier\" by linear combination. We focus here on normalized ensemble classifiers \ngt E LinCH) whose normalized1 coefficients are noted aT = \nI ~: I and each base \nclassifier with outputs in [-1, 1] is hT E 1{: \n\nt \n\ngt(x) = L aThT(x) \n\nT=l \n\n(1) \n\nDifferent contributions [13, 14],[8], [10] have described boosting within an optimiza(cid:173)\ntion scheme, considering that it carries out a gradient descent in the space of linear \ncombinations of base functions. We have chosen the MarginBoost algorithm, a vari(cid:173)\nant of a more general algorithm called Any Boost [10], that generalizes AdaBoost \nand formally justifies the interpretation in terms of margin. If S is the training \nsample {(Xi,Yi) ,i = l..l}, MarginBoost, described in Fig. 1, minimizes the cost \nfunctional C defined for any scalar decreasing function c of the margin p : \n\nI \n\nC(gt) = L c(p(gt(Xi), Yi))) \n\ni=l \n\n(2) \n\nInstead of taking exactly ht+l = - \\1C(gt) which does not ensure that the resulting \nfunction gt+! belongs to Lin(1{), ht+! is chosen such as the inner product2 - < \n\\1C(gt), ht+l > is maximal. The equivalent weighted cost function to be maximized \ncan thus be expressed as : \n\nJF = L Wt(i)Yiht+! (Xi) \n\niES \n\n(3) \n\n3 Generalizing MarginBoost to semi-supervised \n\nclassification \n\n3.1 Margin Extension \n\nFor labeled data, the margin measures the quality of the classifier output. When no \nlabel is observed, the usual margin cannot be calculated and has to be estimated. \nA first estimation could be derived from the expected margin EypL(gt(X) , y). We \ncan use the output of the classifier (gt(x) + 1)/2 as an estimate of the posterior \nprobability P(Y = +llx). This leads to the following margin pi; which depends on \nthe input and is linked with the response of the classifier: \n\nlOr> 0 and L1 norm is used for normalization: IOrl = L~=l Or \n2< f, 9 >= LiE S f(X;)g(Xi) \n\n\fLet wo(i) = l/l , i = 1, ... ,l. \nLet go(x) = 0 \nFor t = 1 ... T (do the gradient descent): \n\n1. Learn a gradient direction htH E 1i with a high value of \n\nJ{ = L,iEswt(i)YihtH(Xi) \n\n2. Apply the stopping rule: if J{ ::::: L,iES Wt(i)Yigt(Xi) then return \n\ngt else go on. \n\n3. Choose a step-length for the obtained direction by a line-search or \n\nby fixing it as a constant f \n\n4 Add the new direction to obtain 9 \n\n. \n5. Fix the weight distribution: Wt 1 = \n\n= (l a t I9t+a ttlht t') \n\nlattl l \n\nHI \n\n+ \n\nc'(p(9ttl(Xi),Yi)) \n\n2: jE S c'(p(9ttl(Xj),Yj)) \n\nFigure 1: MarginBoost algorithm (with L1 normalization of the combination coefficients) \n\nAnother way of defining the extended margin is to use directly the maximum a \nposteriori estimate of the true margin. This MAP estimate depends on the sign of \nthe classifier output and provides the following margin definition pC; : \n\n(5) \n\n3.2 Semi-Supervised MarginBoost : generalization of marginBoost to \n\ndeal with unlabeled data \n\nThe generalization of the margin can be used to define an appropriate cost functional \nfor the semi-supervised learning task. Considering that the training sample S is now \ndivided into two disjoint subsets L for labeled data and U for unlabeled data, the \ncost falls into two parts involving PL = P and PU: \n\niEL \n\niEU \n\nThe maximization of - < \\lC(gt), htH > is equivalent to optimize the new quantity \nJtS that falls now into two terms J{ = Jf + J? The first term one can be directly \nobtained from equation (3) : \n\n(6) \n\nJf = LWt(i).YihtH(Xi) \n\niEL \n\nThe second term, J? , can be expressed as following: \n\n(7) \n\n(8) \n\n\fwith the weight distribution Wt now defined as : \n\n( .) _ \nWt z -\n\n{ \n\nc'(pL(9t(Xi),Yi)) \n\nc'(PU(9t(Xi))) \n\nIWt l \nIWt l \n\nif i E L \n. . \nIf z E U \n\nwith IWt I = 2:= Wt (i) \n\niES \n\n(9) \n\nThis expression of JP comes directly from differential calculus and the chosen inner \nproduct: \n\n( )() \n\n{\n\nYiC'(Pd9t(Xi),Yi)) \n\nif x = Xi and i E L \nif x = x, and i E U \n\nU \n\nt \n\nt \n\n'VC gt Xi = c'(p (g (x.))) apU(9t(Xi)) \n\n(10) \nImplementation of 55MBoost with margins pI[; and Pu requires their derivatives. \nLet us notice that the \"signed margin\", pus, is not derivable at point O. However, \naccording to the results of convex analysis (see for instance [2]), it is possible to \ndefine the \"sub derivative' of Pus since it is a continuous and convex function. The \nvalue of the sub derivative corresponds here to the average value of the right and \nleft derivatives. \n\na9t( Xi) \n\n0 \n\napUS(gt(Xi)) = {sign(g(Xi)) \n\nagt (Xi) \n\n0 \n\nif X :f\": 0 \nif x = 0 \n\nAnd, for the \"squared margin\" Pu 9 , we have: \n\napu 9 (gt(Xi)) = 2g(Xi) \n\nagt(Xi) \n\n(11) \n\n(12) \n\nThis completes the set of ingredients that must be incorporated into the algorithm \nof Fig. 1 to obtain 55MBoost. \n\n4 Base Classifier \n\nThe base classifier should be able to make use of the unlabeled data provided by the \nboosting algorithm. Mixture models are well suited for this purpose, as shown by \ntheir extensive use in clustering. Hierarchical mixtures provide flexible discrimina(cid:173)\ntion tools, where each conditional distribution f(xlY = k) is modelled by a mixture \nof components [4]. At the high level, the distribution is described by \n\nK \n\nf(x; if\u00bb = 2:= Pk!k (x; Ok) , \n\nk=l \n\n(13) \n\nwhere K is the number of classes, Pk are the mixing proportions, Ok the conditional \ndistribution parameters, and if> denotes all parameters {Pk; 0df=l. The high-level \ndescription can also be expressed as a low-level mixture of components, as shown \nhere for binary classification: \n\nKl \n\nf(x;if\u00bb = 2:= PkJkl(X;Okl) + 2:= Pk2!k2(X;Ok2) \n\nK2 \n\n(14) \n\nWith this setting, the EM algorithm is used to maximize the log-likelihood with \nrespect to if> considering the incomplete data is {Xi, Yi}~= l and the missing data \nis the component label Cik, k = 1, ... , K 1 + K2 [11]. An original implementation \nof EM based on the concept of possible labels [1] is considered here. It is well \nadapted to hierarchical mixtures, where the class label Y provides a subset of possible \ncomponents. When Y = 1 the first Kl modes are possible, when Y = -1 the last \nK2 modes are possible, and when an example is unlabeled, all modes are possible. \n\n\fA binary vector Zi E {0,1}(Kl+K2) indicates the components from which feature \nvector Xi may have been generated, in agreement with the assumed mixture model \nand the (absence of) label Yi. Assuming that the training sample {Xi, Zi }i=l is i.i.d, \nthe weighted log-likelihood is given by \n\nI \n\nL(* ;{Xi,zdi=l = LWt(i) log (j(Xi,zi;**; {Xi, zdi=l) conditionally to {Xi , zdi=l \n\nand the current value of ** (denoted **q): \n\nI Kl+K2 \nL \ni=l k=l \n\nL Wt(i)Uik log (Pk!k(Xi; Ok)) \n\n(16) \n\nwith Uik \n\nZikPk!k(Xi; Ok) \nL\u00a3 ZUP\u00a3!\u00a3(Xi; O\u00a3) \n\nM-Step Maximize Q(**I**q) with respect to **. \n\nAssuming that each mode k follows a normal distribution with mean ILk' \nand covariance ~k ' **q+l = {ILk+! ; ~k+!;Pk+l}f~iK2 is given by: \n\n(17) \n\n(18) \n\n5 Experimental results \n\nTests of the algorithm are performed on three benchmarks of the boosting literature: \ntwonorm and ringnorm [6] and banana [13]. Information about these datasets and \nthe results obtained in discrimination are available at www.first.gmd.de/-raetsch/ \n10 different samples were used for each experiment. \nWe first study the behavior of 55MBoost according the evolution of the test error \nwith increasing rates of unlabeled data (table 1). We consider five different settings \nwhere 0%, 50%, 75%, 90% and 95% of labels are missing. 55MB is tested for the \nmargins P~ and Pu with c(x) = exp( -x). It is compared to mixture models and \nAdaBoost. 55MBoost and AdaBoost are trained identically, the only difference \nbeing that AdaBoost is not provided with missing labels. \n\nBoth algorithms are run for T = 100 boosting steps, without special care of overfit(cid:173)\nting. The base classifier (called here base(EM)) is a hierarchical mixture model with \nan arbitrary choice of 4 modes per class but the algorithm (which may be stalled \nin local minima) is restarted 100 times from different initial solutions, and the best \nfinal solution (regarding training error rate) is selected. We report mean error rates \ntogether with the lower and upper quartiles in table 1. For sake of space, we did \nnot display the results obtained without missing labels: in this case, AdaBoost and \n55MBoost behave nearly identically and better than EM only for Banana. \n\nFor rates of unlabeled data inferior to 95% , 55MBoost beats slightly AdaBoost \nfor Ringnorm and Twonorm (except for 75%) but is not able to do as well as \n\n\fTable 1: Mean error rates (in %) and interquartiles obtained with 4 different percentages \nof unlabeled data for mixture models base(EM), AdaBoost and 55MBoost. \n\nRingnorm \n\n50% \n\n75% \n\n90% \n\n95% \n\nbase(EM) \nAdaBoost \n55MBoost pS \n55MBoost pg \nTwonorm \n\nbase(EM) \nAdaBoost \n55MBoost pS \n55MBoost pg \nBanana \n\nbase(EM) \nAdaBoost \n55MBoost pS \n55MBoost pg \n\n2.1 [ 1.7, 2.1] \n1.8[ 1.6, 2.0] \n1. 7[ 1.5, 1.8] \n1. 7[ 1.6, 1.8] \n\n4.3[ 1.9, 5.7] \n3.1[ 1.9, 4.1] \n2.0 [ 1.5, 2.4] \n2.O[ 1.4, 2.5] \n\n9.5 [ 2.7,12.0] 23.7 [14.5,27.0] \n11.5[ 4.2,12.1] 28.7[11.5,37.6] \n3.7[ 2.1, 4.8] \n6.9[ 5.6,10.7] \n4.5 [ 2.2, 3.6] \n8.1 [ 4.2, 9.0] \n\n50% \n\n75% \n\n90% \n\n95% \n\n3.2 [ 2.7, 3.1] \n3.2[ 2.9, 3.2] \n2.7[ 2.5, 2.9] \n2.7[ 2.5, 2.8] \n\n6.5[ 3.0, 9.0] \n20.6[10.3,22.5] 24.8[18.3,31.9] \n3.2[ 3.0, 3.5] 11.0[ 5.2,14.2] 38.9[29.4,50.0] \n3.4 [ 2.8, 4.3] 10.1 [ 5.8,13.6] 20.4[11.9,32.3] \n3.4 [ 2.8, 4.2] \n11.0[ 5.6,16.2] 21.1 [1 2.5,30.8] \n\n50% \n\n75% \n\n90% \n\n95% \n\n18.2[16.7,18.6] 21.8[18.0,25.0] \n26.1[20.7,29.8] 31.7[23.8,35.8] \n12.6[11.7,13.1] 15.2 [13.0,16.8] 22.1 [18.0,24.3] 37.5 [32.2,42.2] \n13.3 [12.7,14.3] \n17.0[15.3,17.8] 22.2[18.0,28.0] 28.3 [20.2,35.2] \n13.3[12.8,14.2] 16.9[15.6,17.8] 22.8[18.3,29.3] 28.6 [21.5,34.2] \n\nAdaBoost on Banana data. One possible explanation is that the discrimination \nfrontiers involved in the banana problem are so complex that the labels really bring \ncrucial informations and thus adding unlabeled data does not help in such a case. \nNevertheless, at rate 95% which is the most realistic situation, the margin Pu obtains \nthe minimal error rate for each of the three problems. It shows that it is worth \nboosting and using unlabeled data. \nAs there is no great difference between the two proposed margins, we conducted \nfurther experiments using only the Pu' \nSecond, in order to study the relation between the presence of noise in the dataset \nand the ability of 55MBoost to enhance generalization performance, we draw in \nFig. 2, the test errors obtained for problems with different values of Bayes error \nwhen varying the rate of labeled examples. We see that even for difficult tasks (very \nnoisy problems), the degradation in performance for large subsets of unlabeled data \nis still low. This reflects some consistency in the behavior of our algorithm. \n\nThird, we test the sensibility of 55MBoost to overfitting. Overfitting can usually \nbe avoided by techniques such as early stopping, softenizing of the margin ([13], \n[14]) or using an adequate margin function such as 1 - tanh(p) instead of exp( -p) \n[10]. Here we keep using c = exp and ran 55MBoost with a maximal number of \nstep T = 1000 with 95% of unlabeled data. Of course, this does not correspond to \na realistic use of boosting in practice but it allows to check if the algorithm behaves \nconsistently in terms of gradient steps number. It is remarkable that no overfitting \nis observed and in the Twonorm case (see Fig. 3), the test error still decreases \n! We also observe that the standard error deviation is reduced at the end of the \nprocess. For the banana problem (see Fig. 3 b.), we observe a stabilization near \nthe step t = 100. A massive presence of unlabeled data implies thus a regularizing \neffect. \n\n\fBayes error:;;; 2.3% \nBayes error:;;; 15.7% \nBayes error:;;; 3 1.2% \n\n50 \n\n40 \n\n20 \n\n10 \n\n\u00b00L---~,70 --~207---~~7----4~0--~5~0--~6=0--~7=0----8=0----9=0~~'00 \n\nRate of missing labels (%) \n\nFigure 2: Consistency of the 55MBoost behavior: evolution of test error versus the missing \nlabels rate with respect to various Bayes error (twonorm ). \n\n70 \n\n60' \n, \n\n, \nI \n\\ \n\\ \n\" \n\n\"'\n\nMean (Error Test) +/- 1 std \nMean (Error test) \n\n70 \n\n60 \n\nI ~ Mean of Error Test +/- std \n\nMean of Error test \n\nI \n\n~ 50 \ni \\!) \n\n0: 40\" \n\n, \\ \n\n~\",~~,.~. '-. -.I~\" ~~--\n\n~_~~ __ -r~/_ ~ \n\n10 \n\noL-~ __ ~ __ ~ __ _L __ _L __ ~ __ ~ __ ~ __ L_~ \no ~ ~ _ ~ ~ ~ ~ ~ ~ _ \n\n-'-\",---.- - --/-----\n\nSteptofgradientdescent(boosting process} \n\n\u00b0OL-~'OO~-2~OO~~3~OO--~400~~500~-=~~~7~OO~~8=OO--~~~~'~ \n\nStep t of gradient descent \n\nFigure 3: Evolution of Test error with respect to maximal number T of iterations with \n95% of missing labels (Two norm and Banana). \n\n\f6 Conclusion \n\nMarginBoost algorithm has been extended to deal with both labeled and unlabeled \ndata. Results obtained on three classical benchmarks of boosting litterature show \nthat it is worth using additional information conveyed by the patterns alone. No \noverfitting was observed during processing 55MBoost on the benchmarks when \n95% of the labels are missing: this should mean that the unlabeled data should \nplaya regularizing role in the ensemble classifier during the boosting process. After \napplying this method to a large real dataset such as those of text-categorization, \nour future works on this theme will concern the use of the extended margin cost \nfunction on the base classifiers itself like multilayered perceptrons or decision trees. \nAnother approach could also be conducted from the more general framework of \nAnyBoost that optimize any differential cost function. \n\nReferences \n\n[1] C. Ambroise and G. Govaert. EM algorithm for partially known labels. In IFCS 2000, \n\njuly 2000. \n\n[2] J.-P. Aubin. L 'analyse non lineaire et ses applications d l'economie. Masson , 1984. \n[3] K P. Bennett and A. Demiriz. Semi-supervised support vector machines. In D. Cohn, \nM. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems, \npages 368-374. MIT Press, 1999. \n\n[4] C.M. Bishop and M.E. Tipping. A hierarchical latent variable model for data vizual(cid:173)\n\nization. IEEE PAMI, 20:281- 293, 1998. \n\n[5] A. Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. \nIn Proceedings of the 1998 Conference on Computational Learning Th eory, July 1998. \n[6] L. Breiman. Prediction games and arcing algorithms. Technical Report 504, Statistics \n\nDepartment, University of California at Berkeley, 1997. \n\n[7] Y . Freund and R. E. Schapire. Experiments with a new boosting algorithm. \n\nIn \nMachin e Learning: Proceedings of th e Thirteenth International Conference, pages \n148- 156. Morgan Kauffman, 1996. \n\n[8] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical \n\nview of boosting. The Annals of Statistics, 28(2):337- 407, 2000. \n\n[9] Y. Grandvalet, F. d'Alche Buc, and C. Ambroise. Boosting mixture models for semi(cid:173)\n\nsupervised learning. In ICANN 2001 , august 200l. \n\n[10] L. Mason , J. Baxter, P. L. Bartlett, and M. Frean. Functional gradient techniques \n\nfor combining hypotheses. In Advances in Large Margin Classifiers. MIT, 2000. \n\n[11] G.J. McLachlan and T. Krishnan. Th e EM algorithm and extensions. Wiley, 1997. \n[12] K Nigam, A. K McCallum, S. Thrun, and T. Mitchell. Text classification from \nlabeled and unlabeled documents using EM. Machine learning, 39(2/3):135- 167, \n2000. \n\n[13] G. Riitsch, T. Onoda, and K-R. Muller. Soft margins for AdaBoost. Technical report, \n\nDepartment of Computer Science, Royal Holloway, London , 1998. \n\n[14] G. Riitsch, T. Onoda, and K-R. Muller. Soft margins for AdaBoost. Machine Learn(cid:173)\n\ning, 42(3):287- 320, 200l. \n\n[15] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A \nnew explanation for the effectiveness of voting methods. Th e Annals of Statistics, \n26(5):1651- 1686, 1998. \n\n[16] Matthias \n\nSeeger. \n\ndata,www.citeseer.nj.nec.com/seegerOllearning.html. \n\nLearning \n\nwith \n\nlabeled \n\nand \n\nunlabeled \n\n\f", "award": [], "sourceid": 2108, "authors": [{"given_name": "Florence", "family_name": "d'Alch\u00e9-Buc", "institution": ""}, {"given_name": "Yves", "family_name": "Grandvalet", "institution": null}, {"given_name": "Christophe", "family_name": "Ambroise", "institution": null}]}*