{"title": "Semi-Supervised Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 368, "page_last": 374, "abstract": null, "full_text": "Semi-Supervised Support Vector \n\nMachines \n\nKristin P. Bennett \n\nDepartment of Mathematical Sciences \n\nRensselaer Polytechnic Institute \nbennek@rpi.edu \n\nTroy, NY 12180 \n\nDepartment of Decision Sciences and Engineering Systems \n\nAyhan Demiriz \n\nRensselaer Polytechnic Institute \nTroy, NY 12180 demira@rpi.edu \n\nAbstract \n\nWe introduce a semi-supervised support vector machine (S3yM) \nmethod. Given a training set of labeled data and a working set \nof unlabeled data, S3YM constructs a support vector machine us(cid:173)\ning both the training and working sets. We use S3 YM to solve \nthe transduction problem using overall risk minimization (ORM) \nposed by Yapnik. The transduction problem is to estimate the \nvalue of a classification function at the given points in the working \nset. This contrasts with the standard inductive learning problem \nof estimating the classification function at all possible values and \nthen using the fixed function to deduce the classes of the working \nset data. We propose a general S3YM model that minimizes both \nthe misclassification error and the function capacity based on all \nthe available data. We show how the S3YM model for I-norm lin(cid:173)\near support vector machines can be converted to a mixed-integer \nprogram and then solved exactly using integer programming. Re(cid:173)\nsults of S3YM and the standard I-norm support vector machine \napproach are compared on ten data sets. Our computational re(cid:173)\nsults support the statistical learning theory results showing that \nincorporating working data improves generalization when insuffi(cid:173)\ncient training information is available. In every case, S3YM either \nimproved or showed no significant difference in generalization com(cid:173)\npared to the traditional approach. \n\n\fSemi-Supervised Support Vector Machines \n\n1 \n\nINTRODUCTION \n\n369 \n\nIn this work we propose a method for semi-supervised support vector machines \n(S3VM). S3VM are constructed using a mixture of labeled data (the training set) \nand unlabeled data (the working set) . The objective is to assign class labels to the \nworking set such that the \"best\" support vector machine (SVM) is constructed. \nIf the working set is empty the method becomes the standl1rd SVM approach to \nclassification [20, 9, 8]. If the training set is empty, then the method becomes a \nform of unsupervised learning. Semi-supervised learning occurs when both training \nand working sets are nonempty. Semi-supervised learning for problems with small \ntraining sets and large working sets is a form of semi-supervised clustering. There \nare successful semi-supervised algorithms for k-means and fuzzy c-means clustering \n[4, 18]. Clustering is a potential application for S3VM as well. When the training \nset is large relative to the working set, S3VM can be viewed as a method for solving \nthe transduction problem according to the principle of overall risk minimization \n(ORM) posed by Vapnik at the NIPS 1998 SVM Workshop and in [19, Chapter 10]. \nS3VM for ORM is the focus of this paper. \n\nIn classification, the transduction problem is to estimate the class of each given \npoint in the unlabeled working set. The usual support vector machine (SVM) ap(cid:173)\nproach estimates the entire classification function using the principle of statistical \nrisk minimization (SRM). In transduction, one estimates the classification func(cid:173)\ntion at points within the working set using information from both the training and \nworking set data. Theoretically, if there is adequate training data to estimate the \nfunction satisfactorily, then SRM will be sufficient. We would expect transduction \nto yield no significant improvement over SRM alone. If, however, there is inad(cid:173)\nequate training data, then ORM may improve generalization on the working set. \nIntuitively, we would expect ORM to yield improvements when the training sets are \nsmall or when there is a significant deviation between the training and working set \nsubsamples of the total population. Indeed,the theoretical results in [19] support \nthese hypotheses. \n\nIn Section 2, we briefly review the standard SV:~\\'1 model for structural risk minimiza(cid:173)\ntion . According to the principles of structural risk minimization, SVM minimize \nboth the empirical misclassification rate and the capacity of the classification func(cid:173)\ntion [19, 20] using the training data. The capacity of the function is determined by \nmargin of separation between the two classes based on the training set. ORM also \nminimizes the both the empirical misclassification rate and the function capacity. \nBut the capacity of the function is determined using both the training and working \nsets. In Section 3, we show how SVM can be extended to the semi-supervised case \nand how mixed integer programming can be used practically to solve the resulting \nproblem. We compare support vector machines constructed by structural risk min(cid:173)\nimization and overall risk minimization computationally on ten problems in Section \n4. Our computational results support past theoretical results that improved gener(cid:173)\nalization can be obtained by incorporating working set information during training \nwhen there is a deviation between the working set and training set sample distri(cid:173)\nbutions. In three of ten real-world problems the semi-supervised approach, S3VM , \nachieved a significant increase in generalization. In no case did S3VM ever obtain a \nsifnificant decrease in generalization. We conclude with a discussion of more general \nS VM algorithms. \n\n\f370 \n\nK. Bennett and A. Demiriz \n\n6 \n\nClass 1 \n\n- - -- 6 ___ __ .1:> _______ __ ______ w\u00b7 x = b+ 1 \n\n- - - - -- - - W\u00b7 x = b \n\n- - - --- - -0-------0---- - - - - - W\u00b7 x = b - 1 \n\no \n\no \n\no \n\no \n\n0 \n\noClass -1 \n\nFigure 1: Optimal plane maximizes margin. \n\n2 SVM using Structural Risk Minimization \n\nThe basic SRM task is to estimate a classification function f : RN - t {\u00b1 I} using \ninput-output training data from two classes \n\n(1) \n\nThe function f should correctly classify unseen examples (x, Y), i.e. f(x) = y if (x, y) \nis generated from the same underlying probability distribution as the training data. \nIn this work we limit discussion to linear classification functions. We will discuss \nextensions to the nonlinear case in Section 5. If the points are linearly separable, \nthen there exist an n-vector wand scalar b such that \nif Yi = 1, and \nif Yi = - 1, i = 1, . .. , f \n\nw\u00b7 Xi - b ~ 1 \nw . Xi - b :S - 1 \n\n(2) \n\nor equivalently \n\nb] ~ 1, i = 1, ... , f. \n(3) \nYt [w . Xi -\n. X = b, is the one which is furthest from the \nThe \"optimal\" separating plane, W \nclosest points in the two classes. Geometrically this is equivalent to maximizing the \n. X = b + 1 and \nseparation margin or distance between the two parallel planes W \nW \n\n. X = b - 1 (see Figure 1.) \n\nThe \"margin of separation\" in Euclidean distance is 2/llw112 where IIw I1 2 = \n:L~=l wt is the 2-norm. To maximize the margin, we minimize IIw1l2/2 subject \nto the constraints (3). According to structural risk minimization, for a fixed em(cid:173)\npirical misclassification rate, larger margins should lead to better generalization \nand prevent overfitting in high-dimensional attribute spaces. The classifier is called \na support vector machine because the solution depends only on the points (called \nsupport vectors) located on the two supporting planes w\u00b7 x = b - 1 and W \u00b7 x = b + 1. \nIn general the classes will not be separable, so the generalized optimal plane (GOP) \nproblem (4) [9, 20] is used. A slack term T]! is added for each point such that if the \npoint is misclassified, T]i 2: 1. The final GOP formulation is: \n\nmin \nw ,b,'1 \ns. t. \n\n1 \n\ne \n\ni= l \n\nC LT]t + 2 II wll2 \nYdw . Xi - b] + T]i 2: 1 \nT]i ~ 0, \n\ni = 1, ... , f \n\n(4) \n\nwhere C > 0 is a fixed penalty parameter. The capacity control provided by the \nmargin maximization is imperative to achieve good generalization [21 , 19]. \n\nThe Robust Linear Programming (RLP) approach to SVM is identical to GOP \nexcept the margin term is changed from the 2-norm II wll2 to the I-norm, II wlll = \n\n\fSemi-Supervised Support Vector Machines \n\n371 \n\n2::;=1 IWj l\u00b7 The problem becomes the following robust linear program (RLP) [2, 7, \n1]: \n\ne \n\nn \n\nCL1]i + LS) \n\nmin \nw ,b,s ,,,, \n\ns.t. \n\ni = l \n\nj = l \n\nb] + 1]i ~ 1 \n\nYt [w . Xi -\ni = 1, ... , f \n1]i ~ 0, \n-Sj <= Wj <= Sj, \n\nj = 1, ... , n. \n\n(5) \n\nThe RLP formulation is a useful variation of SVM with some nice characteristics. \nThe I-norm weight reduction still provides capacity control. The results in [13] can \nbe used to show that minimizing II wi ll corresponds to maximizing the separation \nmargin using the infinity norm. Statistical learning theory could potentially be \nextended to incorporate alternative norms. One major benefit of RLP over GOP \nis dimensionality reduction. Both RLP and GOP minimize the magnitude of the \nweights w. But RLP forces more of the weights to be 0 due to the properties of \nthe I-norm. Another benefit of RLP over GOP is that it can be solved using linear \nprogramming instead of quadratic programming. Both approaches can be extended \nto handle nonlinear discrimination using kernel functions [8, 12] . Empirical compar(cid:173)\nisons of the approaches have not found any significant difference in generalization \nbetween the formulations [5, 7, 3, 12] . \n\n3 Semi-supervised support vector machines \n\nTo formulate the S3VM , we start with either SVM formulation, (4) or (5) , and then \nadd two constraints for each point in the working set. One constraint calculates \nthe misclassification error as if the point were in class 1 and the other constraint \ncalculates the misclassification error as if the point were in class - l. The objective \nfunction calculates the minimum of the two possible misclassification errors. The \nfinal class of the points corresponds to the one that results in the smallest error. \nSpecifically we define the semi-supervised support vector machine problem (S3VM) \nas: \n\nw~~~,,' \nsubjectto Yi (w'xt+b)+1]i ~I 1]t~O i = I, . . . ,e \n\nC [t,~. + j~' min(~j, Zj)] + II w II \n\n. X j - b + t,j ~ 1 \n\nt,j ~ 0 j = f + 1, ... , f + k \n\nW \n\n(6) \n\n- (w\u00b7xj-b)+zj~I Zj ~ O \n\nwhere C > 0 is a fixed misclassification penalty. \nInteger programming can be used to solve this problem. The basic idea is to add \na 0 or 1 decision variable, dj , for each point Xj in the working set. This variable \nindicates the class of the point. If dj = 1 then the point is in class 1 and if dj = 0 \nthen the point is in class -1. This results in the following mixed integer program: \n\nW,~~~',d C [t,~. + j~l (~j + Zj)] + II w II \n\nsubject to \n\nYt(w\u00b7x i- b)+1]t~I 1]t~O i=I, ... ,f \n\nW . Xj - b + t,j + A1(I - d j ) ~ 1 \n\nt,j ~ 0 \n\nj = e + 1, ... , f + k \n\n- (w \u00b7 xj-b)+zj+Mdj~I Zj~O dj={O , I} \n\n(7) \n\nThe constant M > 0 is chosen sufficiently large such that if d j = 0 then t,j = 0 is \nfeasible for any optimal wand b. Likewise if dJ = 1 then Zj = O. A globally optimal \n\n\f372 \n\nK. Bennett and A. Demiriz \n\ne:, \ne \n\ne \ne:, \ne e:,e \n\ne:, \n\ne:, \n\ne \n\ne \n\ne \ne \n- - -e- - -~ - - - - - - -\u2022 - - - - - - - - - -\n-e. -- --- - -. -- - - - -- -e- - - --\ne \n\nee \n\ne \n\n0 \n\ne 0 \n\no \n\no \n\no \n\no \n\no \n\ne:, \n\ne:, \n\ne:, \n\ne:, \n\n4 \u00b7 \n\n~ \",. ' \n\ne:, \u2022 \u2022\u2022 \n. .... /. \n..... \u2022 \n. ... ~ \n\u2022 \n\u2022 \n.,.' .... \n\u2022 \n\u2022 \n\n..\u2022 . \n\n0 \n\no \n\no \n\no \n\no \n\no \n\no \n\no \n\n\u2022 0 \n\nFigure 2: Left = solution found by RLP; Right = solution found by S3YM \n\nsolution to this problem can be found using CPLEX or other commercial mixed \ninteger programming codes [10] provided computer resources are sufficient for the \nproblem size. Using the mathematical programming modeling language AMPL [11], \nwe were able to express the problem in thirty lines of code plus a data file and solve \nit using CPLEX. \n\n4 S3VM and Overall Risk Minimization \n\nAn integer S3YM can be used to solve the Overall Risk Minimization problem. \nConsider the simple problem given in Figure 20 of [19]. Using RLP alone on the \ntraining data results in the separation shown in Figure 1. Figure 2 illustrates what \nhappens when working set data is added. The training set points are shown as \ntransparent triangles and hexagons. The working set points are shown as filled \ncircles. The left picture in Figure 2 shows the solution found by RLP. Note that \nwhen the working set points are added, the resulting separation has very a small \nmargin. The right picture shows the S3YM solution constructed using the unlabeled \nworking set. Note that a much larger and clearer separation margin is found. These \ncomputational solutions are identical to those presented in [19] . \n\nWe also tested S3YM on ten real-world data sets (eight from [14] and the bright and \ndim galaxy sets from [15]). There have been many algorithms applied successfully to \nthese problems without incorporate working set information. Thus it was not clear \na priori that S3YM would improve generalization on these data sets. For the data \nsets where no improvement is possible, we would like transduction using ORM to \nnot degrade the performance of the induction via SRM approach. For each data set, \nwe performed 10-fold cross-validation. For the three starred data sets, our integer \nprogramming solver failed due to excessive branching required within the CPLEX \nalgorithm. On those data sets we randomly extracted 50 point working sets for \neach trial. The same C parameter was used for each data set in both the RLP and \nS3YM problems l . In all ten problems, S3YM never performed significantly worse \nthan RLP. In three of the problems, S3YM performed significantly better. So ORM \ndid not hurt generalization and in some cases it helped significantly. \\Ve would \nexpect this based on ORM theory. The generalization bounds for ORM depend on \nthe difference between the training and working sets. If there is little difference, we \nwould not expect any improvement using ORM. \n\nIThe formula for C was C = ;~f~>;;) with oX = .001, f is the size of training set, and k \nis the size of the working set . This formula was chosen because it worked well empirically \nfor both methods. \n\n\fSemi-Supervised Support Vector Machines \n\n373 \n\nS.1VM \nDim Points CV-size RLP \n0.018 \n14 \n0.02 \n0.036 0.034 \n9 \n0.035 0.033 \n30 \n0.064 0.054 \n14 \n0.173 0.160 \n13 \n0.155 0.151 \n13 \n0.109 0.106 \n34 \n0.173 0.173 \n166 \n0.220 0.222 \n8 \n0.281 \n0.219 \n60 \n\n2462 \n699 \n569 \n4192 \n297 \n506 \n351 \n476 \n769 \n208 \n\n50* \n70 \n57 \n50* \n30 \n51 \n35 \n48 \n50* \n21 \n\np-value \n0.343 \n0.591 \n0.678 \n0.096 \n0.104 \n0.590 \n0.59 \n0.999 \n0.678 \n0.045 \n\nData Set \nBright \nCancer \n\nCancer(Prognostic ) \n\nDim \nHeart \nHousing \n\nIonosphere \n\nMusk \nPima \nSonar \n\n5 Conclusion \n\n\\Ve introduced a semi-supervised SVM model. S3VM constructs a support vector \nmachine using all the available data from both the training and working sets. We \nshow how the S3VM model for I-norm linear support vector machines can be con(cid:173)\nverted to a mixed-integer program. One great advantage of solving S3VM using in(cid:173)\nteger programming is that the globall\u00a5 optimal solution can be found using packages \nsuch as CPLEX. Using the integer S VM we performed an empirical investigation \nof transduction using overall risk minimization, a problem posed by Vapnik. Our \nresults support the statistical learning theory results that incorporating working \ndata improves generalization when insufficient training information is available. In \nevery case, S3VM either improved or showed no significant difference in generaliza(cid:173)\ntion compared to the usual structural risk minimization approach. Our empirical \nresults combined with the theoretical results in [19], indicate that transduction via \nORM constitutes a very promising research direction. \n\nMany research questions remain. Since transduction via overall risk minimization \nis not always be better than the basic induction via structural risk minimization, \ncan we identify a priori problems likely to benefit from transduction? The best \nmethods of constructing S3VM for the 2-norm case and for nonlinear functions \nare still open questions. Kernel based methods can be incorporated into S3VM. \nThe practical scalability of the approach needs to be explored. We were able to \nsolve moderately-sized problems with on the order of 50 working set points using a \ngeneral purpose integer programming code. The recent success of special purpose \nalgorithms for support vector machines [16, 17, 6] indicate that such approaches \nmay produce improvement for S3VM as well. \n\nReferences \n\n[1] K. P. Bennett and E. J. Bredensteiner. Geometry in learning. In C. Gorini, \nE. Hart, W. Meyer, and T. Phillips, editors, Geometry at Work, Washington, \nD.C., 1997. Mathematical Association of America. To appear. \n\n[2] K. P. Bennett and O. 1. Mangasarian. Robust linear programming discrim(cid:173)\n\nination of two linearly inseparable sets. Optimization Methods and Software, \n1:23- 34, 1992. \n\n[3] K. P. Bennett, D. H. Wu, and L. Auslender. On support vector decision trees for \ndatabase marketing. R.P.I. Math Report No. 98-100, Rensselaer Polytechnic \n\n\f374 \n\nK. Bennett and A. Demiriz \n\nInstitute, Troy, NY, 1998. \n\n[4J A.M. Bensaid, L.O. Hall, J.e. Bezdek, and L.P. Clarke. Partially supervised \n\nclustering for image segmentation. Pattern Recognition, 29(5):859- 871, 199. \n[5J P. S. Bradley and O. L. Mangasarian. Feature selection via concave mini(cid:173)\nmization and support vector machines. Technical Report Mathematical Pro(cid:173)\ngramming Technical Report 98-03, University of Wisconsin-Madison, 1998. To \nappear in ICML-98. \n\n[6J P. S. Bradley and O. L. Mangasarian. Massive data discrimination via lin(cid:173)\n\near support vector machines. Technical Report Mathematical Programming \nTechnical Report 98-05, University of Wisconsin-Madison, 1998. Submitted \nfor publication. \n\n[7J E. J. Bredensteiner and K. P. Bennett. Feature minimization within decision \n\ntrees. Computational Optimization and Applications, 10:110-126, 1997. \n\n[8J C. J. C Burges. A tutorial on support vector machines for pattern recognition. \n\nData Mining and Knowledge Discovery, 1998. to appear. \n\n[9J C. Cortes and V. N. Vapnik. Support vector networks. Machine Learning, \n\n20:273- 297, 1995. \n\n[IOJ CPLEX Optimization Incorporated, Incline Village, Nevada. Using the CPLEX \n\nCallable Library, 1994. \n\n[11] R. Fourer, D. Gay, and B. Kernighan. AMPL A Modeling Language for Math(cid:173)\n\nematical Programming. Boyd and Frazer, Danvers, Massachusetts, 1993. \n\n[12J T. T. Fries and R. Harrison Fries. Linear programming support vector machines \n\nfor pattern classification and regression estimation: and the sr algorithm. Re(cid:173)\nsearch report 706, University of Sheffield, 1998. \n\n[13] O. L. Mangasarian. Parsimonious least norm approximation. Technical Report \n\nMathematical Programming Technical Report 97-03, University of Wisconsin(cid:173)\nMadison, 1997. To appear in Computational Optimization and Applications. \n\n[14] P.M. Murphy and D.W. Aha. UCI repository of machine learning databases. \nDepartment of Information and Computer Science, University of California, \nIrvine, California, 1992. \n\n[15J S. Odewahn, E. Stockwell, R. Pennington, R Humphreys, and W Zumach. \nAutomated star/galaxy discrimination with neural networks. Astronomical \nJournal, 103(1):318- 331,1992. \n\n[16] E. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and \n\napplications. AI Memo 1602, Maassachusets Institute of Technology, 1997. \n\n[17] J. Platt. Sequentional minimal optimization: A fast algorithm for training \nsupport vector machines. Technical Report Technical Report 98-14, Microsoft \nResearch, 1998. \n\n[18] M. Vaidyanathan, R.P. Velthuizen, P. Venugopal, L.P. Clarke, and L.O. Hall. \n\nTumor volume measurements using supervised and semi-supervised mri seg(cid:173)\nIn Artificial Neural Networks in Engineering Conference, AN(cid:173)\nmentation. \nNIE(19g4), 1994. \n\n[19] V. N. Vapnik. Estimation of dependencies based on empirical Data. Springer, \n\nNew York, 1982. English translation, Russian version 1979. \n\n[20] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, \n\nNew York, 1995. \n\n[21] V. N. Vapnik and A. Ja. Chervonenkis. Theory of Pattern Recognition. Nauka, \n\nMoscow, 1974. In Russian. \n\n\f", "award": [], "sourceid": 1582, "authors": [{"given_name": "Kristin", "family_name": "Bennett", "institution": null}, {"given_name": "Ayhan", "family_name": "Demiriz", "institution": null}]}