{"title": "Regularized Boost for Semi-Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 281, "page_last": 288, "abstract": "Semi-supervised inductive learning concerns how to learn a decision rule from a data set containing both labeled and unlabeled data. Several boosting algorithms have been extended to semi-supervised learning with various strategies. To our knowledge, however, none of them takes local smoothness constraints among data into account during ensemble learning. In this paper, we introduce a local smoothness regularizer to semi-supervised boosting algorithms based on the universal optimization framework of margin cost functionals. Our regularizer is applicable to existing semi-supervised boosting algorithms to improve their generalization and speed up their training. Comparative results on synthetic, benchmark and real world tasks demonstrate the effectiveness of our local smoothness regularizer. We discuss relevant issues and relate our regularizer to previous work.", "full_text": "Regularized Boost for Semi-Supervised Learning\n\nKe Chen and Shihai Wang\nSchool of Computer Science\nThe University of Manchester\n\nManchester M13 9PL, United Kingdom\n\n{chen,swang}@cs.manchester.ac.uk\n\nAbstract\n\nSemi-supervised inductive learning concerns how to learn a decision rule from a\ndata set containing both labeled and unlabeled data. Several boosting algorithms\nhave been extended to semi-supervised learning with various strategies. To our\nknowledge, however, none of them takes local smoothness constraints among data\ninto account during ensemble learning. In this paper, we introduce a local smooth-\nness regularizer to semi-supervised boosting algorithms based on the universal\noptimization framework of margin cost functionals. Our regularizer is applicable\nto existing semi-supervised boosting algorithms to improve their generalization\nand speed up their training. Comparative results on synthetic, benchmark and real\nworld tasks demonstrate the effectiveness of our local smoothness regularizer. We\ndiscuss relevant issues and relate our regularizer to previous work.\n\n1 Introduction\n\nSemi-supervised inductive learning concerns the problem of automatically learning a decision rule\nfrom a set of both labeled and unlabeled data, which has received a great deal of attention due to\nenormous demands of real world learning tasks ranging from data mining to medical diagnosis [1].\nFrom different perspectives, a number of semi-supervised learning algorithms have been proposed\n[1],[2], e.g., self-training, co-training, generative models along with the EM algorithm, transductive\nlearning models and graph-based methods.\nIn semi-supervised learning, the ultimate goal is to \ufb01nd out a classi\ufb01cation function which not only\nminimizes classi\ufb01cation errors on the labeled training data but also must be compatible with the\ninput distribution by inspecting their values on unlabeled data. To work towards the goal, unlabeled\ndata can be exploited to discover how data is distributed in the input space and then the informa-\ntion acquired from the unlabeled data is used to \ufb01nd a good classi\ufb01er. As a generic framework,\nregularization has been used in semi-supervised learning to exploit unlabeled data by working on\nwell known semi-supervised learning assumptions, i.e., the smoothness, the cluster, and the mani-\nfold assumptions [1], which leads to a number of regularizers applicable to various semi-supervised\nlearning paradigms, e.g., the measure-based [3], the manifold-based [4], the information-based [5],\nthe entropy-based [6], harmonic mixtures [7] and graph-based regularization [8].\nAs a generic ensemble learning framework [9] , boosting works by sequentially constructing a lin-\near combination of base learners that concentrate on dif\ufb01cult examples, which results in a great\nsuccess in supervised learning. Recently boosting has been extended to semi-supervised learning\nwith different strategies. Within the universal optimization framework of margin cost functional\n[9], semi-supervised MarginBoost [10] and ASSEMBLE [11] were proposed by introducing the\n\u201cpseudo-classes\u201d to unlabeled data for characterizing dif\ufb01cult unlabeled examples. In essence, such\nextensions work in a self-training way; the unlabeled data are assigned pseudo-class labels based on\nthe constructed ensemble learner so far, and in turn the pseudo-class labels achieved will be used\nto \ufb01nd out a new proper learner to be added to the ensemble. The co-training idea was extended to\n\n\fboosting, e.g. CoBoost [12]. More recently, the Agreement Boost algorithm [13] has been devel-\noped with a theoretic justi\ufb01cation of bene\ufb01ts from the use of multiple boosting learners within the\nco-training framework. To our knowledge, however, none of the aforementioned semi-supervised\nboosting algorithms has taken the local smoothness constraints into account.\nIn this paper, we exploit the local smoothness constraints among data by introducing a regular-\nizer to semi-supervised boosting. Based on the universal optimization framework of margin cost\nfunctional for boosting [9], our regularizer is applicable to existing semi-supervised boosting al-\ngorithms [10]-[13]. Experimental results on the synthetic, benchmark and real world classi\ufb01cation\ntasks demonstrate its effectiveness of our regularizer in semi-supervised boosting learning.\nIn the reminder of this paper, Sect. 2 brie\ufb02y reviews semi-supervised boosting learning and presents\nour regularizer. Sect. 3 reports experimental results and the behaviors of regularized semi-supervised\nboosting algorithms. Sect. 4 discusses relevant issues and the last section draws conclusions.\n\n2 Semi-supervised boosting learning and regularization\n\nIn the section, we \ufb01rst brie\ufb02y review the basic idea behind existing semi-supervised boosting al-\ngorithms within the universal optimization framework of margin cost functional [9] for making it\nself-contained. Then we present our Regularized Boost based on the previous work.\n\n(cid:80)\n\n2.1 Semi-supervised boosting learning\nGiven a training set, S = L \u222a U, of |L| labeled examples, L = {(x1, y1),\u00b7\u00b7\u00b7 , (x|L|, y|L|)}, and\n|U| unlabeled examples, U = {x|L|+1,\u00b7\u00b7\u00b7 , x|L|+|U|}, we wish to construct an ensemble learner\nF (x) =\nt wtft(x), where wt is coef\ufb01cients for linear combination and ft(x) is a base learner,\nso that P (F (x) (cid:54)= y) is small. Since there exists no label information available for unlabeled data,\nthe critical idea underlying semi-supervised boosting is introducing a pseudo-class [11] or a pseudo\nmargin [10] concept within the universal optimization framework [9] to unlabeled data. Similar to\nan approach in supervised learning, e.g., [14], a multi-class problem can be converted into binary\nclassi\ufb01cation forms. Therefore, our presentation below focuses on the binary classi\ufb01cation problem\nonly; i.e. y \u2208 {\u22121, 1}. The pseudo-class of an unlabeled example, x, is typically de\ufb01ned as\ny = sign[F (x)] [11] and its corresponding pseudo margin is yF (x) = |F (x)| [10],[11].\nWithin the universal optimization framework of margin cost functional [9], the semi-supervised\nboosting learning is to \ufb01nd F such that the cost of functional\n\nC(F ) =\n\n\u03b1iC[yiF (xi)] +\n\n\u03b1iC[|F (xi)|]\n\n(1)\n\n(cid:88)\n\nxi\u2208U\n\nis minimized for some non-negative and monotonically decreasing cost function C : R \u2192 R and\nthe weight \u03b1i \u2208 R+. In the universal optimization framework [9], constructing an ensemble learner\nneeds to choose a base learner, f(x), to maximize the inner product \u2212(cid:104)\u2207C(F ), f(cid:105). For unlabeled\ndata, a subgradient of C(F ) in (1) has been introduced to tackle its non-differentiable problem [11]\nand then unlabeled data of pseudo-class labels can be treated in the same way as labeled data in the\noptimization problem. As a result, \ufb01nding a proper f(x) amounts to maximizing\n\n(cid:88)\n\nxi\u2208L\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n\u03b1iC(cid:48)[yiF (xi)] \u2212\n\n\u03b1iC(cid:48)[yiF (xi)],\n\n(2)\n\ni:f (xi)(cid:54)=yi\n\ni:f (xi)=yi\n\n\u2212(cid:104)\u2207C(F ), f(cid:105) =\n\nter dividing through by \u2212(cid:80)\n(cid:88)\n\n(cid:88)\n\nwhere yi is the true class label if xi is a labeled example or a pseudo-class label otherwise. Af-\ni\u2208S \u03b1iC(cid:48)[yiF (xi)] on both sides of (2), \ufb01nding f(x) to maximize\n\u2212(cid:104)\u2207C(F ), f(cid:105) is equivalent to searching for f(x) to minimize\n\nD(i) \u2212\n\nD(i) = 2\n\nD(i) \u2212 1,\n\n(3)\n\ni:f (xi)(cid:54)=yi\n\ni:f (xi)=yi\n\ni:f (xi)(cid:54)=yi\n\n(cid:80)\n\nwhere D(i), for 1 \u2264 i \u2264 |L| + |U|,\n(cid:80)\nis the empirical data distribution de\ufb01ned as D(i) =\n\u03b1iC(cid:48)[yiF (xi)]\ni\u2208S \u03b1iC(cid:48)[yiF (xi)] . From (3), a proper base learner, f(x), can be found by minimizing weighted\nD(i). Thus, any boosting algorithms speci\ufb01ed for supervised learning [9] are\n\nerrors\nnow applicable to semi-supervised learning with the aforementioned treatment.\n\ni:f (xi)(cid:54)=yi\n\n\fJ(cid:88)\n\n(cid:88)\n\nFor co-training based semi-supervised boosting algorithms [12],[13], the above semi-supervised\nboosting procedure is applied to each view of data to build up a component ensemble learner. Instead\nof self-training, the pseudo-class label of an unlabeled example for a speci\ufb01c view is determined by\nensemble learners trained on other views of this example. For example, the Agreement Boost [13]\nde\ufb01nes the co-training cost functional as\n\nC(F 1,\u00b7\u00b7\u00b7 , F J) =\n\nC[yiF j(xi)] + \u03b7\n\nC[\u2212V (xi)].\n\nj=1\n\n(cid:163)\n\nxi\u2208L\n\n(cid:80)J\n\n(cid:80)J\nHere J views of data are used to train J ensemble learners, F 1,\u00b7\u00b7\u00b7 , F J, respectively. The disagree-\nj=1[F j(xi)]2 \u2212\nment of J ensemble learners for an unlabeled example, xi \u2208 U, is V (xi) = 1\n\n(cid:164)2 and the weight \u03b7 \u2208 R+. In light of view j, the pseudo-class label of an unla-\n\n1\nJ\nbeled example, xi, is determined by yi = sign\nof (3) with such pseudo-class labels leads to a proper base learner f j(x) to be added to F j(x).\n\n(cid:164)\n(cid:80)J\nj=1 F j(xi) \u2212 F j(xi)\n\n. Thus, the minimization\n\nj=1 F j(xi)\n\nxi\u2208U\n\n(cid:163)\n\n1\nJ\n\nJ\n\n(4)\n\n(cid:88)\n\n2.2 Boosting with regularization\n\n(cid:88)\n\nMotivated by the work on the use of regularization in semi-supervised learning [3]-[8], we intro-\nduce a local smoothness regularizer to semi-supervised boosting based on the universal optimization\nframework of margin cost functional [9], which results in a novel objective function:\n\nT (F, f) = \u2212(cid:104)\u2207C(F ), f(cid:105) \u2212\n\n(5)\nwhere \u03b2i \u2208 R+ is a weight, determined by the input distribution to be discussed in Sect. 4, associated\nwith each training example and the local smoothness around an example, xi, is measured by\n\n\u03b2iR(i),\n\ni:xi\u2208S\n\n(cid:88)\n\nWij \u02dcC(\u2212Iij).\n\nR(i) =\n\nj:xj\u2208S,j(cid:54)=i\n\n(6)\nHere, Iij is a class label compatibility function for two different examples xi, xj \u2208 S and de\ufb01ned as\nIij = |yi \u2212 yj| where yi and yj are the true labels of xi and xj for labeled data or their pseudo-class\nlabels otherwise. \u02dcC : R \u2192 R is a monotonically decreasing function derived from the cost function\nadopted in (1) so that \u02dcC(0)=0. Wij is an af\ufb01nity measure de\ufb01ned by Wij = exp(\u2212||xi\u2212xj||2/2\u03c32)\nwhere \u03c3 is a bandwidth parameter. To \ufb01nd a proper base learner, f(x), we now need to maximize\nT (F, f) in (5) so as to minimize not only misclassi\ufb01cation errors as before (see Sect. 2.1) but also\nthe local class label incompatibility cost for smoothness.\nIn order to use the objective function in (5) for boosting learning, we need to have the new empirical\ndata distribution and the termination condition. Inserting (2) into (5) results in\n\u03b1iC(cid:48)[yiF (xi)] \u2212\n\n\u03b1iC(cid:48)[yiF (xi)] \u2212\n\nT (F, f) =\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n\u03b2iR(i).\n\n(7)\n\ni:f (xi)(cid:54)=yi\n\ni:f (xi)=yi\n\ni:xi\u2208S\n\nSince an appropriate cost function used in (1) is non-negative and monotonically decreasing,\nC(cid:48)[yiF (xi)] is always negative and R(i) is non-negative according to its de\ufb01nition in (6). Therefore,\nwe can de\ufb01ne our empirical data distribution as\n\n(cid:80)\n\n\u02dcD(i) =\n\n(cid:169)\n\n\u03b1iC(cid:48)[yiF (xi)] \u2212 \u03b2iR(i)\n\n\u03b1kC(cid:48)[ykF (xk)] \u2212 \u03b2kR(k)\n\nk:xk\u2208S\n\n(cid:170) ,\n\n1 \u2264 i\u2264 |L| + |U|.\n\n(8)\n\n\u02dcD(i) is always non-negative based on de\ufb01nitions of cost function in (1) and R(i) in (6). Applying (8)\nto (7) with some mathematical development similar to that described in Sect. 2.1, we can show that\n\ufb01nding a proper base learner f(x) to maximize T (F, f) is equivalent to \ufb01nding f(x) to minimize\n\n(cid:88)\n\n2\n\ni:f (xi)(cid:54)=yi\n\nwhich is equal to\n\n\u02dcD(i) \u2212\n\n\u02dcD(i) \u2212 2\n\n(cid:88)\n(cid:123)(cid:122)\n\ni:f (xi)(cid:54)=yi\n\n(cid:124)\n\ni:f (xi)=yi\n\n+ 2\n\n\u02dcD(i)\n\n(cid:125)\n\n(cid:124)\n\n(cid:88)\n\ni:f (xi)=yi\n\nk:xk\u2208S\n\n(cid:169)\n\n(cid:88)\n\n(cid:80)\n\n(cid:80)\n(cid:169)\n\ni:f (xi)=yi\n\nk:xk\u2208S\n\n\u03b2iR(i)\n\n\u03b1kC(cid:48)[ykF (xk)] \u2212 \u03b2kR(k)\n\n(cid:170) ,\n\n\u03b1kC(cid:48)[ykF (xk)] \u2212 \u03b2kR(k)\n\n\u2212\u03b2iR(i)\n\n(cid:123)(cid:122)\n\n(cid:170)\n(cid:125)\n\n\u2212 1.\n\n(9)\n\n(cid:88)\n\nmisclassi\ufb01cation errors\n\nlocal class label incompatibility\n\n\fIn (9), the \ufb01rst term refers to misclassi\ufb01cation errors while the second term corresponds to the class\nlabel incompatibility of a data point with its nearby data points even though this data point itself\n\ufb01ts well. In contrast to (3), \ufb01nding a proper base learner, f(x), now needs to minimize not only\nthe misclassi\ufb01cation errors but also the local class label incompatibility in our Regularized Boost.\nAccordingly, a new termination condition of our Regularized Boost is derived from (9) as \u0001 \u2265 1\nwhere \u0001 =\n\n\u02dcD(i) +\n\n(cid:80)\n\n(cid:80)\n\n(cid:170) .\n\n\u2212\u03b2iR(i)\n\n(cid:169)\n\n2\n\nOnce \ufb01nding an optimal base learner, ft+1(x), at step t+1, we need to choose a proper weight,\nwt+1, to form a new ensemble, Ft+1(x) = Ft(x) + wt+1ft+1(x). In our Regularized Boost, we\nchoose wt+1 = 1\nby simply treating pseudo-class labels for unlabeled data as same as\ntrue labels of labeled data, as suggested in [11].\n\n2 log\n\n1\u2212\u0001\n\u0001\n\n(cid:80)\n\ni:f (xi)=yi\n\nk:xk\u2208S\n\n\u03b1kC(cid:48)[ykF (xk)]\u2212\u03b2kR(k)\n\ni:f (xi)(cid:54)=yi\n\n(cid:161)\n\n(cid:162)\n\n3 Experiments\n\nIn this section, we report experimental results on synthetic, benchmark and real data sets. Although\nour regularizer is applicable to existing semi-supervised boosting [10]-[13], we mainly apply it\nto the ASSEMBLE [11], a winning algorithm from the NIPS 2001 Unlabeled Data Competition,\non a variety of classi\ufb01cation tasks.\nIn addition, our regularizer is also used to train component\nensemble learners of the Agreement Boost [13] for binary classi\ufb01cation benchmark tasks since the\nalgorithm [13] in its original form can cope with binary classi\ufb01cation only. In our experiments, we\nuse C(\u03b3) = e\u2212\u03b3 in (1) and \u02dcC(\u03b3) = C(\u03b3) \u2212 1 in (6) and set \u03b1i = 1 in (1) and \u03b2i = 1\nFor synthetic and benchmark data sets, we always randomly select 20% of examples as testing data\nexcept that a benchmark data set has pre-de\ufb01ned a training/test split. Accordingly, the remaining\nexamples used as a training set or those in a pre-de\ufb01ned training set, S, are randomly divided into two\nsubsets, i.e., labeled data (L) and unlabeled data (U), and the ratio between labeled and unlabeled\ndata is 1:4 in our experiments. For reliability, each experiment is repeated for ten times. To test the\neffectiveness of our Regularized Boost across different base learners, we perform all experiments\nwith K nearest-neighbors (KNN) classi\ufb01er, a local classi\ufb01er, and multi-layer perceptron (MLP), a\nglobal classi\ufb01er, where 3NN and a single hidden layered MLP are used in our experiments. For\ncomparison, we report results of a semi-supervised boosting algorithm (i.e., ASSEMBLE [11] or\nAgreement Boost [13]) and its regularized version (i.e., Regularized Boost). In addition, we also\nprovide results of a variant of Adaboost [14] trained on the labeled data only for reference. The\nabove experimental method conforms to those used in semi-supervised boosting methods [10]-[13]\nas well as other empirical studies of semi-supervised learning methods, e.g., [15].\n\n2 in (5).\n\n3.1 Synthetic data set\n\nWe use a Gaussian mixture model of four components to generate a data set of four categories in\nthe 2-D space; 200 examples are in each category, as illustrated in Figure 1(a). We wish to test our\nregularizer on this intuitive multi-class classi\ufb01cation task of a high optimal Bayes error.\n\n(a)\n\n(b)\n\nFigure 1: Synthetic data classi\ufb01cation task. (a) The data set. (b) Classi\ufb01cation results\n\nFrom Figure 1(b), it is observed that the use of unlabeled data improves the performance of Adaboost\nand the use of our regularizer further improves the generalization performance of the ASSEMBLE\n\n\u22125\u22124\u22123\u22122\u2212101234\u22125\u22124\u22123\u22122\u22121012345AdaBoostASSEMBLERegularizedBoost20253035Error Rate(%)  with KNNwith MLP32.5031.8828.7527.3726.8726.25\fby achieving an averaging error rate closer to the optimal Bayes error no matter what kind of a base\nlearner is used. Our further observation via visualization with the ground truth indicates that the\nuse of our regularizer leads to smoother decision boundaries than the original ASSEMBLE, which\nyields the better generalization performance.\n\n3.2 Benchmark data sets\n\nTo assess the performance of our regularizer for semi-supervised boosting algorithms, we perform\na series of experiments on benchmark data sets from the UCI machine learning repository [16]\nwithout any data transformation. In our experiments, we use the same initialization conditions for\nall boosting algorithms. Our empirical work suggests that a maximum number of 100 boosting steps\nis suf\ufb01cient to achieve the reasonable performance for those benchmark tasks. Hence, we set such a\nmaximum number of boosting steps to stop all boosting algorithms for a sensible comparison.\nWe \ufb01rst apply our regularizer to the ASSEMBLE [11] on \ufb01ve UCI benchmark classi\ufb01cation tasks\nof different categories[16]: BUPA liver disorders (BUPA), Wisconsin Diagnostic Breast Cancer\n(WDBC), Balance Scale Weight & Distance (BSWD), Car Evaluation Database (CAR), and Optical\nRecognition of Handwritten Digits (OPTDIGITS) where its data set has been split into the \ufb01xed\ntraining and testing subsets in advance by the data collector.\nTable 1: Error rates (mean\u00b1dev.)% of AdaBoost, ASSEMBLE and Regularized Boost (RegBoost)\nwith different base learners on \ufb01ve UCI classi\ufb01cation data sets.\n\nData Set\n\nBUPA\nWDBC\nBSWD\nCAR\n\nOTIDIGITS\n\nAdaBoost ASSEMBLE RegBoost AdaBoost ASSEMBLE RegBoost\n37.7\u00b13.4\n28.8\u00b15.6\n8.3\u00b11.9\n3.2\u00b10.8\n22.2\u00b10.9\n13.6\u00b12.6\n31.3\u00b11.2\n17.7\u00b11.1\n4.9\u00b10.1\n5.0\u00b10.2\n\n34.9\u00b13.1\n3.7\u00b12.0\n17.4\u00b10.9\n23.2 \u00b11.1\n2.7\u00b10.7\n\n35.1\u00b11.1\n9.7\u00b12.0\n16.8\u00b12.8\n30.6\u00b13.0\n6.3\u00b10.2\n\nKNN\n36.1\u00b13.0\n4.1\u00b11.0\n18.7\u00b10.4\n24.4\u00b10.7\n3.1\u00b10.5\n\nMLP\n31.2\u00b16.7\n3.5\u00b10.9\n14.4\u00b12.4\n20.5\u00b10.9\n5.2\u00b10.2\n\nTable 1 tabulates the results of different boosting learning algorithms. It is evident from Table 1 that\nin general the use of unlabeled data constantly improves the generalization performance in contrast\nto the performance of AdaBoost and the use of our regularizer in the ASSEMBLE always further\nreduces its error rates on all \ufb01ve data sets no matter what kind of a base learner is used. It is also\nobserved that the use of different base learners results in various performance on \ufb01ve data sets; the\nuse of KNN as a base learner yields better performance on the WDBC and OPTDIGITS data set\nwhereas the use of MLP as a base learner outperforms its KNN counterpart on other three data\nsets. Apparently the nature of a base learner, e.g., global vs. local classi\ufb01ers, may determine if it is\nsuitable for a classi\ufb01cation task. It is worth mentioning that for the OPTDIGITS data set the lowest\nerror rate achieved by 3NN with the entire training set, i.e., using all 3823 examples as training\nprototypes, is around 2.2% on the testing set, as reported in the literature [16].\nIn contrast, the\nASSEMBLE [11] on 3NN equipped with our regularizer yields an error rate of 2.7% on average\ndespite the fact that our Regularized Boost algorithm simply uses 765 labeled examples.\nTable 2: Error rates (mean\u00b1dev.)% of AdaBoost, Agreement Boost and Regularized Boost (Reg-\nBoost) on \ufb01ve UCI binary classi\ufb01cation data sets.\n\nData Set\nBUPA\nWDBC\nVOTE\n\nAUSTRALIAN\n\nKR-vs-KP\n\nAdaBoost-KNN AdaBoost-MLP AgreementBoost RegBoost\n28.9\u00b15.8\n3.0\u00b10.8\n2.8\u00b10.6\n15.2\u00b12.8\n5.2\u00b11.6\n\n35.1\u00b11.1\n9.7\u00b12.0\n10.6\u00b10.5\n21.0\u00b13.4\n7.1\u00b10.2\n\n37.7\u00b13.4\n8.3\u00b11.9\n9.0\u00b11.5\n37.7\u00b11.2\n15.6\u00b10.7\n\n30.4\u00b17.5\n3.3\u00b10.7\n4.4\u00b10.8\n16.7\u00b12.1\n6.3\u00b11.3\n\nWe further apply our regularizer to the Agreement Boost [13]. Due to the limitation of this algorithm\n[13], we can use only the binary classi\ufb01cation data sets to test the effectiveness. As a result, we use\nBUPA and WDBC mentioned above and three additional UCI binary classi\ufb01cation data sets [16]:\n1984 U.S. Congressional Voting Records (VOTE), Australian Credit Approval (AUSTRALIAN)\n\n\fand Chess End-Game King Rook versus King Pawn (KR-vs-KP). As required by the Agreement\nBoost [13], the KNN and the MLP classi\ufb01ers as base learners are used to construct two component\nensemble learners without and with the use of our regularizer in experiments, which corresponds to\nits original and regularized version of the Agreement Boost.\nTable 2 tabulates results produced by different boosting algorithms. It is evident from Table 2 that\nthe use of our regularizer in its component ensemble learners always leads the Agreement Boost to\nimprove its generalization on \ufb01ve benchmark tasks while its original version trained with labeled\nand unlabeled data considerably outperforms the Adaboost trained with labeled data only.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Behaviors of semi-supervised boosting algorithms: the original version vs. the regularized\nversion. (a) The ASSEMBLE with KNN on the OPTDIGITS. (b) The ASSEMBLE with MLP on\nthe OPTDIGITS. (c) The Agreement Boost on the KR-vs-KP.\n\nWe investigate behaviors of regularized semi-supervised boosting algorithms on two largest data\nsets, OPTDIGITS and VR-vs-VP. Figure 2 shows the averaging generalization performance\nachieved by stopping a boosting algorithm at different boosting steps. From Figure 2, the use of\nour regularizer in the ASSEMBLE regardless of base learners adopted and the Agreement Boost\nalways yields fast training. As illustrated in Figures 2(a) and 2(b), the regularized version of the\nASSEMBLE with KNN and MLP takes only 22 and 46 boosting steps on average to reach the per-\nformance of the original ASSEMBLE after 100 boosting steps, respectively. Similarly, Figure 2(c)\nshows that the regularized Agreement Boost takes only 12 steps on average to achieve the perfor-\nmance of its original version after 100 boosting steps.\n\n3.3 Facial expression recognition\n\nFacial expression recognition is a typical semi-supervised learning task since labeling facial expres-\nsions is an extremely expensive process and very prone to errors due to ambiguities. We test the\neffectiveness of our regularizer by using a facial expression benchmark database, JApanese Female\nFacial Expression (JAFFE) [17] where there are 10 female expressers who posed 3 or 4 examples for\neach of seven universal facial expressions (anger, disgust, fear, joy, neutral, sadness and surprise),\nas exempli\ufb01ed in Figure 3(a), and 213 pictures of 256 \u00d7 256 pixels were collected totally.\n\nFigure 3: Facial expression recognition on the JAFFE. (a) Exemplar pictures corresponding to seven\nuniversal facial expressions. (b) Classi\ufb01cation results of different boosting algorithms.\n\n(a)\n\n(b)\n\nIn our experiments, we \ufb01rst randomly choose 20% images (balanced to seven classes) as testing data\nand the rest of images constitute a training set (S) randomly split into labeled (L) and unlabeled\n(U) data of equal size in each trial. We apply the independent component analysis and then the\n\n01020304050607080901002.62.833.23.43.63.84Number of base learnersError Rate(%)  RegularizedBoostASSEMBLE010203040506070809010055.566.577.5Error Rate(%)Number of base learners  RegularizedBoostASSEMBLE010203040506070809010055.566.577.588.5Error Rate(%)Number of base learners  RegularizedBoostAgreementBoostAdaBoostASSEMBLERegularizedBoost15202530354045Error Rate(%)34.2732.1926.37\fprincipal component analysis (PCA) to each image for feature extraction and use only \ufb01rst 40 PCA\ncoef\ufb01cients to form a feature vector. A single hidden layered MLP of 30 hidden neurons is used\nas the based learner. We set a maximum number of 1000 boosting rounds to stop the algorithms\nif their termination conditions are not met while the same initialization is used for all boosting\nalgorithms. For reliability, the experiment is repeated 10 times. From Figure 3(b), it is evident that\nthe ASSEMBLE with our regularizer yields 5.82% error reduction on average; an averaging error\nrate of 26.37% achieved is even better than that of some supervised learning methods on the same\ndatabase, e.g., [18] where around 70% images were used to train a convolutional neural network and\nan averaging error rate of 31.5% was achieved on the remaining images.\n\n4 Discussions\n\nIn this section, we discuss issues concerning our regularizer and relate it to previous work in the\ncontext of regularization in semi-supervised learning.\nAs de\ufb01ned in (5), our regularizer has a parameter, \u03b2i, associated with each training point, which can\nbe used to encode the information of the marginal or input distribution, P (x), by setting \u03b2i = \u03bbP (x)\nwhere \u03bb is a tradeoff or regularization parameter. Thus, the use of \u03b2i would make the regularization\ntake effect only in dense regions although our experiments reported were carried out by setting\n\u03b2i = 1\n2 ; i.e., we were using a weak assumption that data are scattered uniformly throughout the\nwhole space. In addition, (6) uses an af\ufb01nity metric system to measure the proximity of data points\nand can be extended by incorporating the manifold information, if available, into our regularizer.\nOur local smoothness regularizer plays an important role in re-sampling all training data including\nlabeled and unlabeled data for boosting learning. As uncovered in (9), the new empirical distribution\nbased on our regularizer not only assigns a large probability to a data point misclassi\ufb01ed but also\nmay cause a data point even classi\ufb01ed correctly in the last round of boosting learning but located\nin a \u201cnon-smoothing\u201d region to be assigned a relatively large probability, which distinguishes our\napproach from existing boosting algorithms where the distribution for re-sampling training data is\ndetermined solely by misclassi\ufb01cation errors. For unlabeled data, such an effect always makes sense\nto work on the smoothness and the cluster assumptions [1] as performed by existing regularization\ntechniques [3]-[8]. For labeled data, it actually has an effect that the labeled data points located in\na \u201cnon-smoothing\u201d region is more likely to be retained in the next round of boosting learning. As\nexempli\ufb01ed in Figure 1, such points are often located around boundaries between different classes\nand therefore more informative in determining a decision boundary, which would be another reason\nwhy our regularizer improves the generalization of semi-supervised boosting algorithms.\nThe use of manifold smoothness in a special form of Adaboost, marginal Adaboost, has been at-\ntempted in [19] where the graph Laplacian regularizer was applied to select base learners by the\nadaptive penalization of base learners according to their decision boundaries and the actual manifold\nstructural information. In essence, the objective of using manifold smoothness in our Regularized\nBoost is identical to theirs in [19] but we accomplish it in a different way. We encode the mani-\nfold smoothness into the empirical data distribution used in boosting algorithms for semi-supervised\nlearning, while their implementation adaptively adjusts the edge offset in the marginal Adaboost\nalgorithm for a weight decay used in the linear combination of based learners [19]. In contrast, our\nimplementation is simpler yet applicable to any boosting algorithms for semi-supervised learning,\nwhile theirs needs to be ful\ufb01lled via the marginal Adaboost algorithm even though their regularized\nmarginal Adaboost is applicable to both supervised and semi-supervised learning indeed.\nBy comparison with existing regularization techniques used in semi-supervised learning, our Reg-\nularized Boost is closely related to graph-based semi-supervised learning methods, e.g., [8].\nIn\ngeneral, a graph-based method wants to \ufb01nd a function to simultaneously satisfy two conditions\n[2]: a) it should be close to given labels on the labeled nodes, and b) it should be smooth on the\nwhole graph. In particular, the work in [8] develops a regularization framework to carry out the\nabove idea by de\ufb01ning the global and local consistency terms in their cost function. Similarly, our\ncost function in (9) has two terms explicitly corresponding to global and local consistency though\ntrue labels of labeled data never change during our boosting learning, which resembles theirs [8].\nNevertheless, a graph-based algorithm is an iterative label propagation process on a graph where\na regularizer directly gets involved in label modi\ufb01cation over the graph, whereas our Regularized\nBoost is an iterative process that runs a base learner on various distributions over training data where\n\n\four regularizer simply plays a role in determining distributions. In general, a graph-based algorithm\nis applicable to transductive learning only although it can be combined with other methods, e.g. a\nmixture model [7], for inductive learning. In contrast, our Regularized Boost is developed for induc-\ntive learning. Finally it is worth stating that unlike most of existing regularization techniques used in\nsemi-supervised learning, e.g., [5],[6], our regularization takes effect on both labeled and unlabeled\ndata while theirs are based on unlabeled data only.\n\n5 Conclusions\n\nWe have proposed a local smoothness regularizer for semi-supervising boosting learning and demon-\nstrated its effectiveness on different types of data sets. In our ongoing work, we are working for a\nformal analysis to justify the advantage of our regularizer and explain the behaviors of Regularized\nBoost, e.g. fast training, theoretically.\n\nReferences\n\nIn Advances in Neural\n\n[1] Chapelle, O., Sch\u00a8olkopf, B., & Zien, A. (2006) Semi-Supervised Learning. Cambridge, MA: MIT Press.\n[2] Zhu, X. (2006) Semi-supervised learning literature survey. Computer Science TR-1530, University of\nWisconsin - Madison, U.S.A.\n[3] Bousquet, O., Chapelle, O., & Hein, M. (2004) Measure based regularization.\nInformation Processing Systems 16. Cambridge, MA: MIT Press.\n[4] Belkin, M., Niyogi, P., & Sindhwani, V. (2004) Manifold regularization: a geometric framework for learning\nfrom examples. Technical Report, University of Michigan, U.S.A.\n[5] Szummer, M., & Jaakkola, T. (2003) Information regularization with partially labeled data. In Advances in\nNeural Information Processing Systems 15. Cambridge, MA: MIT Press.\n[6] Grandvalet, Y., & Begio, Y. (2005) Semi-supervised learning by entropy minimization. In Advances in\nNeural Information Processing Systems 17. Cambridge, MA: MIT Press.\n[7] Zhu, X., & Lafferty, J. (2005) Harmonic mixtures: combining mixture models and graph-based methods for\ninductive and scalable semi-supervised learning. In Proc. Int. Conf. Machine Learning, pp. 1052-1059.\n[8] Zhou, D., Bousquet, O., Lal, T., Weston, J., & Sch\u00a8lkopf, B. (2004) Learning with local and global consis-\ntency. In Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press.\n[9] Mason, L., Bartlett, P., Baxter, J., & Frean, M. (2000) Functional gradient techniques for combining hy-\npotheses. In Advances in Large Margin Classi\ufb01ers. Cambridge, MA: MIT Press.\n[10] d\u2019Alch\u00b4e-Buc, F., Grandvalet, Y., & Ambroise, C. (2002) Semi-supervised MarginBoost. In Advances in\nNeural Information Processing Systems 14. Cambridge, MA: MIT Press.\n[11] Bennett, K., Demiriz, A., & Maclin, R. (2002) Expoliting unlabeled data in ensemble methods. In Proc.\nACM Int. Conf. Knowledge Discovery and Data Mining, pp. 289-296.\n[12] Collins, M., & Singer, Y. (1999) Unsupervised models for the named entity classi\ufb01cation. In Proc. SIGDAT\nConf. Empirical Methods in Natural Language Processing and Very Large Corpora.\n[13] Leskes, B. (2005) The value of agreement, a new boosting algorithm. In Proc. Int. Conf. Algorithmic\nLearning Theory (LNAI 3559), pp. 95-110, Berlin: Springer-Verlag.\n[14] G\u00a8unther, E., & Pfeiffer, K.P. (2005) Multiclass boosting for weak classi\ufb01ers. Journal of Machine Learning\nResearch 6:189-210.\n[15] Nigam, K., McCallum, A., Thrum, S., & Mitchell, T. (2000) Using EM to classify text from labeled and\nunlabeled documents. Machine Learning 39:103-134.\n[16] Blake, C., Keogh, E., & Merz, C.J. (1998) UCI repository of machine learning databases. University of\nCalifornia, Irvine. [on-line] http://www.ics.uci.edu/ mlearn/MLRepository.html\n[17] The JAFFE Database. [Online] http://www.kasrl.org/jaffe.html\n[18] Fasel, B. (2002) Robust face analysis using convolutional neural networks. In Proc. Int. Conf. Pattern\nRecognition, vol. 2, pp. 40-43.\n[19] K\u00b4egl, B., & Wang, L. (2004) Boosting on manifolds: adaptive regularization of base classi\ufb01er. In Advances\nin Neural Information Processing Systems 16. Cambridge, MA: MIT Press.\n\n\f", "award": [], "sourceid": 164, "authors": [{"given_name": "Ke", "family_name": "Chen", "institution": null}, {"given_name": "Shihai", "family_name": "Wang", "institution": null}]}