{"title": "Adaptive Regularization for Transductive Support Vector Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 2125, "page_last": 2133, "abstract": "We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms.", "full_text": "Adaptive Regularization for\n\nTransductive Support Vector Machine\n\nZenglin Xu \u2020\u2021\n\u2020 Cluster MMCI\n\nSaarland Univ. & MPI INF\n\nSaarbrucken, Germany\nzlxu@mpi-inf.mpg.de\n\nRong Jin\n\nComputer Sci. & Eng.\nMichigan State Univ.\nEast Lansing, MI, U.S.\nrongjin@cse.msu.edu\n\nJianke Zhu\n\nComputer Vision Lab\n\nETH Zurich\n\nZurich, Switzerland\n\nzhuji@vision.ee.ethz.ch\n\nIrwin King\u2021\n\nMichael R. Lyu\u2021\n\n\u2021 Computer Science & Engineering\nThe Chinese Univ. of Hong Kong\n\nShatin, N.T., Hong Kong\n\n{king,lyu}@cse.cuhk.edu.hk\n\nZhirong Yang\n\nInformation & Computer Science\n\nHelsinki Univ. of Technology\n\nEspoo, Finland\n\nzhirong.yang@tkk.fi\n\nAbstract\n\nWe discuss the framework of Transductive Support Vector Machine\n(TSVM) from the perspective of the regularization strength induced by\nthe unlabeled data. In this framework, SVM and TSVM can be regarded\nas a learning machine without regularization and one with full regular-\nization from the unlabeled data, respectively. Therefore, to supplement\nthis framework of the regularization strength, it is necessary to introduce\ndata-dependant partial regularization. To this end, we reformulate TSVM\ninto a form with controllable regularization strength, which includes SVM\nand TSVM as special cases. Furthermore, we introduce a method of adap-\ntive regularization that is data dependant and is based on the smooth-\nness assumption. Experiments on a set of benchmark data sets indicate\nthe promising results of the proposed work compared with state-of-the-art\nTSVM algorithms.\n\n1 Introduction\n\nSemi-supervised learning has attracted a lot of research focus in recently years. Most of\nthe existing approaches can be roughly divided into two categories: (1) the clustering-based\nmethods [12, 4, 8, 17] assume that most of the data, including both the labeled ones and the\nunlabeled ones, should be far away from the decision boundary of the target classes; (2) the\nmanifold-based methods make the assumption that most of data lie on a low-dimensional\nmanifold in the input space, which include Label Propagation [21], Graph Cuts [2], Spectral\nKernels [9, 22], Spectral Graph Transducer [11], and Manifold Regularization [1]. The\ncomprehensive study on semi-supervised learning techniques can be found in the recent\nsurveys [23, 3].\n\nAlthough semi-supervised learning wins success in many real-world applications, there still\nremains two major unsolved challenges. One is whether the unlabeled data can help the\nclassi\ufb01cation, and the other is what is the relation between the clustering assumption and\nthe manifold assumption.\n\nAs for the \ufb01rst challenge, Singh et al. [16] provided a \ufb01nite sample analysis on the usefulness\nof unlabeled data based on the cluster assumption. They show that unlabeled data may\n\n\fbe useful for improving the error bounds of supervised learning methods when the margin\nbetween di\ufb00erent classes satis\ufb01es some conditions. However, in the real-world problems, it\nis hard to identify the conditions that unlabeled data can help.\n\nOn the other hand, it is interesting to explore the relation between the low density assump-\ntion and the manifold assumption. Narayanan et al. [14] implied that the cut-size of the\ngraph partition converges to the weighted volume of the boundary which separates the two\nregions of the domain for a \ufb01xed partition. This makes a step forward for exploring the\nconnection between graph-based partitioning and the idea surrounding the low density as-\nsumption. Unfortunately, this approach cannot be generalized uniformly over all partitions.\nLa\ufb00erty and Wasserman [13] revisited the assumptions of semi-supervised learning from the\nperspective of minimax theory, and suggested that the manifold assumption is stronger than\nthe smoothness assumption for regression. Till now, the underlying relationships between\nthe cluster assumption and the manifold assumption are still undisclosed. Speci\ufb01cally, it is\nunclear that in what kind of situation the clustering assumption or the manifold assumption\nshould be adopted.\n\nIn this paper, we address these current limitations by a uni\ufb01ed solution from the perspective\nof the regularization strength of the unlabeled data. Taking Transductive Support Vector\nMachine (TSVM) as an example, we suggest an framework that introduces the regularization\nstrength of the unlabeled data when estimating the decision boundary. Therefore, we can\nobtain a spectrum of models by varying the regularization strength of unlabeled data which\ncorresponds to changing the models from supervised SVM to Transductive SVM. To select\nthe optimal model under the proposed framework, we employ the manifold regularization\nassumption that enables the prediction function to be smooth over the data space. Further,\nthe optimal function is a linear combination of supervised models, weakly semi-supervised\nmodels, and semi-supervised models. Additionally, it provides an e\ufb00ective approach towards\ncombining the cluster assumption and the manifold assumption in semi-supervised learning.\n\nThe rest of this paper is organized as follows. In Section 2, we review the background of\nTransductive SVM. In Section 3, we \ufb01rst present a framework of models with di\ufb00erent reg-\nularization strength, followed by an integrating approach based on manifold regularization.\nIn Section 4, we report the experimental results on a series of benchmark data sets. Section\n5 concludes the paper.\n\n2 Related Work on TSVM\n\nBefore presenting the formulation of TSVM, we \ufb01rst describe the notations used in this\npaper. Let X = (x1, . . . , xn) denote the entire data set, including both the labeled examples\nand the unlabeled ones. We assume that the \ufb01rst l examples within X are labeled and the\nnext n \u2212 l examples are unlabeled. We denote the unknown labels by yu = (yu\nn).\n\nl+1, . . . , yu\n\nTSVM [12] maximizes the margin in the presence of unlabeled data and keeps the boundary\ntraversing through low density regions while respecting labels in the input space. Under\nthe maximum-margin framework, TSVM aims to \ufb01nd the classi\ufb01cation model with the\nmaximum classi\ufb01cation margin for both labeled and unlabeled examples, which amounts to\nsolve the following optimization problem:\n\nmin\n\nw\u2208Rn,yu\u2208Rn\u2212`,\u03be\u2208Rn\n\ns. t.\n\n1\n2\n\nkwkK + C\n\nl\n\nXi=1\n\n\u03bei + C \u2217\n\nn\n\nXi=l+1\n\n\u03bei\n\nyiw>\u03c6(xi) \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, 1 \u2264 i \u2264 l,\nyu\ni w>\u03c6(xi) \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, l + 1 \u2264 i \u2264 n,\n\n(1)\n\nwhere C and C \u2217 are the trade-o\ufb00 parameters between the complexity of the function w and\nthe margin errors. Moreover, the prediction function can be formulated as f (x) = w>\u03c6(x).\nNote that we remove the bias term in the above formulation, since it can be taken into\naccount by introducing a constant element into the input pattern alternatively.\n\n\fAs in [19] and [20], we can rewrite (1) into the following optimization problem:\n\nmin\nf ,\u03be\n\ns. t.\n\n1\n2\n\nf >K\u22121f + C\n\nl\n\nXi=1\n\n\u03bei + C \u2217\n\nn\n\nXi=l+1\n\n\u03bei\n\nyifi \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, 1 \u2264 i \u2264 l,\n|fi| \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, l + 1 \u2264 i \u2264 n.\n\n(2)\n\nThe optimization problem held in TSVM is a non-linear non-convex optimization [6]. During\npast several years, researchers have devoted a signi\ufb01cant amount of research e\ufb00orts to solving\nthis critical problem. A branch-and-bound method [5] was developed to search for the\noptimal solution, which is only limited to solve the problem with a small number of examples\ndue to involving the heavy computational cost. To apply TSVM for large-scale problems,\nJoachims [12] proposed a label-switching-retraining procedure to speed up the optimization\nprocedure. Later, the hinge loss in TSVM is replaced by a smooth loss function, and a\ngradient descent method is used to \ufb01nd the decision boundary in a region of low density [4].\nIn addition, there are some iterative methods, such as deterministic annealing [15], concave-\nconvex procedure (CCCP) [8], and convex relaxation method [19, 18]. Despite the success\nof TSVM, the unlabeled data not necessarily improve classi\ufb01cation accuracy.\n\nTo better utilize the unlabeled data, unlike existing TSVM approaches, we propose a frame-\nwork that tries to control the regularization strength of the unlabeled data. To do this, we\nintend to learn the optimal regularization strength con\ufb01guration from the combination of a\nspectrum of models: supervised, weakly-supervised, and semi-supervised.\n\n3 TSVM: A Regularization View\n\nFor the sake of illustration, we \ufb01rst study a model that does not penalize on the classi\ufb01cation\nerrors of unlabeled data. Note that the penalization on the margin errors of unlabeled data\ncan be included if needed. Therefore, we have the following form of TSVM that can be\nderived through the duality:\n\nmin\nf ,\u03be\n\ns. t.\n\n1\n2\n\nf >K\u22121f + C\n\nl\n\nXi=1\n\n\u03bei\n\nyifi \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, 1 \u2264 i \u2264 l,\nf 2\ni \u2265 1, l + 1 \u2264 i \u2264 n.\n\n(3)\n\n3.1 Full Regularization of Unlabeled Data\n\nIn order to adjust the strength of the regularization raised from the unlabeled examples, we\nintroduce a coe\ufb03cient \u03c1 \u2265 0, and modify the above problem (3) as below:\n\nmin\nf ,\u03be\n\ns. t.\n\n1\n2\n\nf >K\u22121f + C\n\nl\n\nXi=1\n\n\u03bei\n\nyifi \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, 1 \u2264 i \u2264 l,\nf 2\ni \u2265 \u03c1, l + 1 \u2264 i \u2264 n.\n\n(4)\n\nObviously, it is the standard TSVM for \u03c1 = 1. In particular, the larger the \u03c1 is, the stronger\nthe regularization of unlabeled data is. It is also important to note that we only take into\naccount the classi\ufb01cation errors on the labeled examples in the above equation. Namely, we\nonly denote \u03bei for each labeled example.\n\nFurther, we write f = (fl; fu) where fl = (f1, . . . , fl) and fu = (fl+1, . . . , fn) represent the\nprediction for the labeled and the unlabeled examples, respectively. According to the inverse\nlemma of the block matrix, we can write K\u22121 as follows:\n\nK\u22121 = (cid:18)\n\nl\n\n\u2212M\u22121\n\nM\u22121\nu Ku,lK\u22121\n\nl,l\n\n\u2212K\u22121\n\nl,l Kl,uM\u22121\n\nu\n\nM\u22121\n\nu\n\n(cid:19) ,\n\n\fwhere\n\nMl = Kl,l \u2212 Kl,uK\u22121\nMu = Ku,u \u2212 Ku,lK\u22121\n\nu,uKu,l,\nl,l Kl,u.\n\nThus, the term f >K\u22121f is computed as\n\nf >K\u22121f = f >\n\nl M\u22121\n\nl\n\nfl + f >\n\nu M\u22121\n\nu fu \u2212 2f >\n\nl K\u22121\n\nl,l Kl,uM\u22121\n\nu fu.\n\nWhen the unlabeled data are loosely correlated to the labeled data, namely when most of\nthe elements within Ku,l are small, this leads to Mu \u2248 Ku. We refer to this case as \u201cweakly\nunsupervised learning\u201d. Using the above equations, we rewrite TSVM as follows:\n\nmin\nfl,fu,\u03be\n\n1\n2\n\nl M\u22121\nf >\n\nl\n\nfl + C\n\nl\n\nXi=1\n\n\u03bei + \u03c9(fl, \u03c1)\n\n(5)\n\ns. t.\n\nyifi \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, 1 \u2264 i \u2264 l,\n\nwhere \u03c9(fl, \u03c1) is a regularization function for fl and it is the result of the following optimiza-\ntion problem:\n\nf >\nu M\u22121\n\nu fu \u2212 f >\n\nl K\u22121\n\nl,l Kl,uM\u22121\n\nu fu\n\n(6)\n\nmin\nfu\n\ns. t.\n\n1\n2\n[f u\n\ni ]2 \u2265 \u03c1,\n\nl + 1 \u2264 i \u2264 n.\n\nTo understand the regularization function \u03c9(fl, \u03c1), we \ufb01rst compute the dual of the problem\n(6) by the Lagrangian function:\n\nL =\n\n=\n\n1\n2\n\n1\n2\n\nf >\nu M\u22121\n\nu fu \u2212 f >\n\nl K\u22121\n\nl,l Kl,uM\u22121\n\n\u03bbi([f u\n\ni ]2 \u2212 \u03c1)\n\nnu\n\n1\n2\n\nu fu \u2212\n\nXi=1\nl,l Kl,uM\u22121\n\nu fu +\n\n\u03bb>e,\n\n\u03c1\n2\n\nf >\nu (M\u22121\n\nu \u2212 D(\u03bb))fu \u2212 f >\n\nl K\u22121\n\nwhere D(\u03bb) = diag(\u03bb1, . . . , \u03bbn\u2212l) and e denotes a vector with all elements being one. As\nthe derivatives vanish for optimality, we have\n\nfu = (M\u22121\n\nu \u2212 D(\u03bb))\u22121M\u22121\n\nu Ku,lK\u22121\nl,l fl\n\n= (I \u2212 MuD(\u03bb))\u22121Ku,lK\u22121\n\nl,l fl,\n\nwhere I is an identity matrix.\n\nReplacing fu in (6) with the above equation, we have the following dual problem:\n\nmax\n\n\u03bb\n\n\u2212\n\ns. t. M\u22121\n\nl K\u22121\nf >\n\nl,l Kl,u(Mu \u2212 MuD(\u03bb)Mu)\u22121Ku,lK\u22121\n\n1\n2\nu (cid:23) D(\u03bb), \u03bbi \u2265 0, i = 1, . . . , n \u2212 l.\n\nl,l fl + \u03c1\u03bb>e\n\n(7)\n\nThe above formulation allows us to understand how the parameter \u03c1 controls the strength\nof regularization from the unlabeled data. In the following, we will show that a series of\nlearning models can be derived through assigning various values for the coe\ufb03cient \u03c1.\n\n3.2 No Regularization from Unlabeled Data\n\nFirst, we study the case of \u03c1 = 0. We have the following theorem to illustrate the relationship\nbetween the dual problem (7) and the supervised SVM.\n\nTheorem 1 When \u03c1 = 0, the optimization problem is reduced to the standard supervised\nSVM.\n\n\fProof 1 It is not di\ufb03cult to see that the optimal solution to (7) is \u03bb = 0. As a result,\n\u03c9(fl, \u03c1) becomes\n\n\u03c9(fl, \u03c1 = 0) = \u2212\n\n1\n2\n\nflK\u22121\n\nl,l Kl,uM\u22121\n\nu Ku,lK\u22121\nl,l fl\n\nSubstituting \u03c9(fl, \u03c1) in (5) with the formulation above, the overall optimization problem\nbecomes\n\nmin\nfl,\u03be\n\n1\n2\n\nl (M\u22121\nf >\n\nl \u2212 K\u22121\n\nl,l Kl,uM\u22121\n\nu Ku,lK\u22121\n\nl,l )fl + C\n\nl\n\nXi=1\n\n\u03bei\n\nyifi \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, 1 \u2264 i \u2264 l.\nAccording to the matrix inverse lemma, we calculate M\u22121\n\ns. t.\n\nl\n\nas below:\n\nM\u22121\n\nl\n\nu,uKu,l)\u22121\n\n= (Kl,l \u2212 Kl,uK\u22121\n= K\u22121\n= K\u22121\n\nl,l + K\u22121\nl,l + K\u22121\n\nl,l Kl,u(Ku,u \u2212 Ku,lK\u22121\nl,l Kl,uM\u22121\n\nu Ku,lK\u22121\nl,l .\n\nl,l Kl,u)\u22121Ku,lK\u22121\n\nl,l\n\nFinally, the optimization problem is simpli\ufb01ed as\n\nmin\nfl,\u03be\n\n1\n2\n\nl K\u22121\nf >\n\nl,l fl + C\n\nl\n\nXi=1\n\n\u03bei\n\ns. t.\n\nyifi \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, 1 \u2264 i \u2264 l.\n\n(8)\n\nClearly, the above optimization is identical to the standard supervised SVM. Hence, the\nunlabeled data are not employed to regularize the decision boundary when \u03c1 = 0.\n\n3.3 Partial Regularization of Unlabeled Data\n\nSecond, we consider the case when \u03c1 is small. According to (7), we expect \u03bb to be small\nwhen \u03c1 is small. As a result, we can approximate (Mu \u2212 MuD(\u03bb)Mu)\u22121 as follows:\n\n(Mu \u2212 MuD(\u03bb)Mu)\u22121 \u2248 M\u22121\n\nu + D(\u03bb).\n\nConsequently, we can write \u03c9(fl, \u03c1) as follows:\n\n\u03c9(fl, \u03c1) = \u2212\n\n1\n2\n\nl K\u22121\nf >\n\nl,l Kl,uM\u22121\n\nu Ku,lK\u22121\n\nl,l fl + \u03c6(fl, \u03c1),\n\n(9)\n\nwhere \u03c6(fl, \u03c1) is the output of the following optimization problem\n\nmax\n\n\u03bb\n\n\u03c1\u03bb>e \u2212\n\n1\n2\n\nl K\u22121\nf >\n\nl,l Kl,uD(\u03bb)Ku,lK\u22121\nl,l fl\n\ns. t. M\u22121\n\nu (cid:23) D(\u03bb), \u03bbi \u2265 0, i = 1, . . . , n \u2212 l.\n\nWe can simplify the above problem by approximating M\u22121\nu (cid:23) D(\u03bb) as \u03bbi \u2264 [\u03c31(Mu)]\u22121,\ni = 1, . . . , n \u2212 l, where \u03c31(Mu) represents the maximum eigenvalue of matrix Mu. The\nresulting simpli\ufb01ed problem becomes\n\nmax\n\n\u03bb\n\ns. t.\n\n1\n2\n\n\u03bb>e \u2212\n\n\u03c1\nl,l Kl,uD(\u03bb)Ku,lK\u22121\nl,l fl\n2\n0 \u2264 \u03bbi \u2264 [\u03c31(Mu)]\u22121, 1 \u2264 i \u2264 n \u2212 l.\n\nl K\u22121\nf >\n\nAs the above problem is a linear programming problem, the solution for \u03bb can be computed\nas:\n\n\u03bbi = (cid:26)\n\n0\n\n\u03c3(Mu)\u22121\n\n[Ku,lK\u22121\n[Ku,lK\u22121\n\nl,l fl]2\nl,l fl]2\n\ni > \u03c1,\ni \u2264 \u03c1.\n\nFrom the above formulation, we \ufb01nd that \u03c1 plays the role of a threshold of selecting the\nunlabeled examples. Since [Ku,lK\u22121\nl,l fl]i can be regarded as the approximation for the ith\n\n\funlabeled example, the above formulation can be interpreted in the way that only the unla-\nbeled examples with low prediction con\ufb01dence will be selected for regularizing the decision\nboundary. Moreover, all the unlabeled examples with high prediction con\ufb01dence will be\nignored. From the above discussions, we can conclude that \u03c1 determines the regularization\nstrength of unlabeled examples.\n\nThen, we rewrite the overall optimization problem as below:\n\nmin\nfl,\u03be\n\nmax\n\n\u03bb\n\ns. t.\n\n1\n2\n\nl K\u22121\nf >\n\nl,l fl + C\n\nl\n\nXi=1\n\n\u03bei \u2212\n\n1\n2\n\nl K\u22121\nf >\n\nl,l Kl,uD(\u03bb)Ku,lK\u22121\nl,l fl\n\n(10)\n\nyifi \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, 1 \u2264 i \u2264 l,\n0 \u2264 \u03bbi \u2264 [\u03c31(Mu)]\u22121, 1 \u2264 i \u2264 n \u2212 l.\n\nThis is a min-max optimization problem and thus the global optimal solution can be guar-\nanteed. To obtain the optimal solution, we employ an alternating optimization procedure,\nwhich iteratively computes the values of fl and \u03bb. To account for the penalty on the margin\nerror from the unlabeled data, we just need to add an extra constraint of \u03bbi \u2264 2C for\ni = 1, . . . , n \u2212 l.\n\nBy varying the parameter \u03c1 from 0 to 1, we can indeed obtain a series of transductive models\nfor SVM. When \u03c1 is small, we call the corresponding optimization problem as weakly semi-\nsupervised learning. Therefore, it is important to \ufb01nd an appropriate \u03c1 which adapts for the\ninput data. However, as the data distribution is usually unknown, it is very challenging to\ndirectly estimate an optimal regularization strength parameter \u03c1. Instead, we try to explore\nan alternative approach to select an appropriate \u03c1 by combining the prediction functions.\nDue to the large cost in calculating the inverse of kernel matrices, one can solve the dual\nproblems according to the Representer theorem.\n\n3.4 Adaptive Regularization\n\nAs stated in previous sections, \u03c1 determines the regularization strength of the unlabeled\ndata. We now try to adapt the parameter \u03c1 according to the unlabeled data information.\nSpeci\ufb01cally, we intend to implicitly select the best \u03c1 from a given list, i.e., \u03a5 = {\u03c11, . . . , \u03c1m}\nwhere \u03c11 = 0 and \u03c1m = 1. This is equivalent to selecting the optimal f from a list of\nprediction functions, i.e., F = {f1, . . . , fm}. Motivated from the ensemble technique for\nsemi-supervised learning [7], we assume that the optimal f comes from a linear combination\nof the base functions {fi}. We then have:\n\nf =\n\nm\n\nXi=1\n\n\u03b8ifi,\n\nm\n\nXi=1\n\n\u03b8i = 1, \u03b8i \u2265 0, i = 1, . . . , m.\n\nwhere \u03b8i is the weight of the prediction function fi and \u03b8 \u2208 Rm. One can also involve a\npriori to \u03b8i. For example, if we have more con\ufb01dences on the semi-supervised classi\ufb01er,\nwe can introduce a constraint like \u03b8m \u2265 0.5.\nIt is important to note that the learning\nfunctions in ensemble methods [7] are usually weak learners, while in our approach, the\nlearning functions are strong learners with di\ufb00erent degrees of regularization.\n\nIn the following, we study how to set the regularization strength adaptive to data. Since\nTSVM naturally follows the cluster assumption of semi-supervised learning, in order to\ncomplement the cluster assumption, we adopt another principle in semi-supervised learning,\ni.e., the manifold assumption. From the point of view of manifold assumption in semi-\nsupervised learning, the prediction function f should be smooth on unlabeled data. To\nthis end, the approach of manifold regularization is widely adopted as a smoothing term\nin semi-supervised learning literatures, e.g.,\n[1, 10]. In the following, we will employ the\nmanifold regularization principle for selecting the regularization strength.\n\nThe manifold regularization is mainly based on a graph G =< V, E > derived from the whole\ndata space X, where V = {xi}n\ni=1 is the vertex set, and E denotes the edges linking pairs of\nnodes. In general, a graph is built in the following four steps: (1) constructing adjacency\ngraph; (2) calculating the weights on edges; (3) computing the adjacency matrix W; (4)\n\n\fobtaining the graph Laplacian by L = diag(Pn\n\nregularization term as f >Lf .\n\nj=1 Wij ) \u2212 W. Then, we denote the manifold\n\nFor simplicity, we denote the predicted values of function fi on the data X as fi, such that\nfi = ([fi]1, . . . , [fi]n). F = (f1, . . . , fm)> is used to represent the set of the prediction values\nof all prediction functions. Finally, We have the following minimization problem:\n\nmin\n\n\u03b8\n\n\u03b7(\u03b8>F)L(F>\u03b8) \u2212 y>\n\n1\n2\n\u03b8>e = 1, \u03b8i \u2265 0, i = 1, . . . , m,\n\n` (F>\n\n` \u03b8)\n\n(11)\n\n` (F>\n\ns. t.\nwhere the second term, y>\n` \u03b8), is used to strengthen the con\ufb01dence on the prediction over\nthe labeled data. \u03b7 is a trade-o\ufb00 parameter. The above optimization problem is a simple\nquadratic programming problem, which can be solved very e\ufb03ciently. It is important to note\nthat the above optimization problem is less sensitive to the graph structure than Laplacian\nSVM as used in [1], since the basic learning functions are all strong learners. It also saves\na huge amount of e\ufb00orts in estimating the parameters compared with Laplacian SVM.\n\nThe above approach indeed provides a practical approach towards a combination of both\nthe cluster assumption and the manifold assumption. It is empirically suggested that com-\nbining these two assumptions helps to improve the prediction accuracy of semi-supervised\nlearning according to the survey paper on semi-supervised SVM [6]. Moreover, when \u03c1 = 0,\nsupervised models are incorporated in the framework. Thus the usefulness of unlabeled in\nnaturally considered by the regularization. This therefore provides a practical solution to\nthe problems described in Section 1.\n\n4 Experiment\n\nIn this section, we give details of our implementation and discuss the results on several\nbenchmark data sets for our proposed approach. To conduct a comprehensive evaluation, we\nemploy several well-known datasets as the testbed. As summarized in Table 1, three image\ndata sets and \ufb01ve text data sets are selected from the recent book (www.kyb.tuebingen.\nmpg.de/ssl-book/) and the literature (www.cs.uchicago.edu/~vikass/).\n\nTable 1: Datasets used in our experiments. d represents the data dimensionality, and n\ndenotes the total number of examples.\n\nData set\nusps\ncoil\npcmac\nlink\n\nn\n\n1500\n1500\n1946\n1051\n\nd\n241\n241\n7511\n1800\n\nData set\ndigit1\nibm vs rest\npage\npagelink\n\nn\n\n1500\n1500\n1051\n1051\n\nd\n241\n11960\n3000\n4800\n\nFor simplicity, our proposed adaptive regularization approach is denoted as ARTSVM.\nTo evaluate it, we conduct an extensive comparison with several state-of-the-art ap-\nproaches, including the label-switching-retraining algorithm in SVM-Light [12], CCCP [8],\nand \u2207TSVM [4]. We employ SVM as the baseline method.\n\nIn our experiments, we repeat all the algorithms 20 times for each dataset. In each run,\n10% of the data are randomly selected as the training data and the remaining data are used\nas the unlabeled data. The value of C in all algorithms are selected from [1, 10, 100, 1000]\nusing cross-validation. The set of \u03c1 is set to [0, 0.01, 0.05, 0.1, 1] and \u03b7 is \ufb01xed to 0.001. As\nstated in Section 3.4, ARTSVM is less sensitive to the graph structure. Thus, we adopt a\nsimple way to construct the graph: for each data, the number of neighbors is set to 20 and\nbinary weighting is employed. In ARTSVM, the supervised, weakly semi-supervised, and\nsemi-supervised algorithms are based on implementation in LibSVM (www.csie.ntu.edu.\ntw/~cjlin/libsvm/), MOSEK (www.mosek.org), and \u2207TSVM (www.kyb.tuebingen.mpg.\nde/bs/people/chapelle/lds/), respectively. For the comparison algorithms, we adopt the\noriginal authors\u2019 own implementations.\n\nTable 2 summarizes the classi\ufb01cation accuracy and the standard deviations of the proposed\nARTSVM method and other competing methods. We can draw several observations from\n\n\fthe results. First of all, we can clearly see that our proposed algorithm performs signif-\nicantly better than the baseline SVM method across all the data sets. Note that some\nvery large deviations in SVM are mainly because the labeled data and the unlabeled data\nmay have quite di\ufb00erent distributions after the random sampling. On the other hand, the\nunlabeled data capture the underlying distribution and help to correct such random error.\nComparing ARTSVM with other TSVM algorithms, we observe that ARTSVM achieves\nthe best performance in most cases. For example, for the digital image data sets, espe-\ncially digit1, supervised learning usually works well and the advantages of TSVM are very\nlimited. However, the proposed ARTSVM outperforms both the supervised and other semi-\nsupervised algorithms. This indicates that the appropriate regularization from the unlabel\ndata improves the classi\ufb01cation performance.\n\nTable 2: The classi\ufb01cation performance of Transductive SVMs on benchmark data sets.\n\nData Set ARTSVM \u2207TSVM\n79.44\u00b13.63\nusps\n80.55\u00b11.94\ndigit1\n79.84\u00b11.88\ncoil\nibm vs rest\n76.83\u00b12.11\n95.42\u00b10.95\npcmac\n94.78\u00b11.83\npage\n93.56\u00b11.58\nlink\npagelink\n96.53\u00b11.84\n\n81.30\u00b14.04\n82.10\u00b12.11\n81.70\u00b12.10\n78.04\u00b11.44\n95.50\u00b10.88\n94.65\u00b11.19\n94.27\u00b10.97\n97.31\u00b10.68\n\nSVM\n\nCCCP\n\n79.23\u00b18.60\n81.70\u00b15.61\n78.98\u00b18.07\n72.90\u00b12.32\n92.57\u00b10.82\n75.22\u00b117.38\n40.79\u00b13.63\n89.41\u00b13.12\n\n80.48\u00b13.20\n80.69\u00b12.97\n80.15\u00b12.90\n77.52\u00b11.51\n94.86\u00b11.09\n94.47\u00b11.67\n92.60\u00b12.10\n95.97\u00b12.22\n\nSVM-light\n78.16\u00b14.41\n77.53\u00b14.24\n79.03\u00b12.84\n73.99\u00b15.18\n91.42\u00b17.24\n93.98\u00b12.60\n92.18\u00b12.45\n94.89\u00b11.81\n\n5 Conclusion\n\nThis paper presents a novel framework for semi-supervised learning from the perspective of\nthe regularization strength from the unlabeled data. In particular, for Transductive SVM,\nwe show that SVM and TSVM can be incorporated as special cases within this framework.\nIn more detail, the loss on the unlabeled data can essentially be regarded as an additional\nregularizer for the decision boundary in TSVM. To control the regularization strength, we\nintroduce an alternative method of data-dependant regularization based on the principle of\nmanifold regularization. Empirical studies on benchmark data sets demonstrate that the\nproposed framework is more e\ufb00ective than the previous transductive algorithms and purely\nsupervised methods.\n\nFor future work, we plan to design a controlling strategy that is adaptive to data from\nthe perspective of low density assumption and manifold regularization of semi-supervised\nlearning. Finally, it is desirable to integrate the low density assumption and manifold\nregularization into a uni\ufb01ed framework.\n\nAcknowledgement\n\nThe work was supported by the National Science Foundation (IIS-0643494), Na-\ntional Institute of Health (1R01GM079688-01), Research Grants Council of Hong Kong\n(CUHK4158/08E and CUHK4128/08E), and MSRA (FY09-RES-OPP-103). It is also a\ufb03li-\nated with the MS-CUHK Joint Lab for Human-centric Computing & Interface Technologies.\n\nReferences\n\n[1] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric\nframework for learning from labeled and unlabeled examples. Journal of Machine Learning\nResearch, 7:2399\u20132434, 2006.\n\n[2] Avrim Blum and Shuchi Chawla. Learning from labeled and unlabeled data using graph\nmincuts. In ICML \u201901: Proceedings of the 18th international conference on Machine learning,\npages 19\u201326. Morgan Kaufmann, San Francisco, CA, 2001.\n\n[3] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cam-\n\nbridge, MA, 2006.\n\n\f[4] O. Chapelle and A. Zien. Semi-supervised classi\ufb01cation by low density separation. In Pro-\nceedings of the Tenth International Workshop on Arti\ufb01cial Intelligence and Statistics, pages\n57\u201364, 2005.\n\n[5] Olivier Chapelle, Vikas Sindhwani, and Sathiya Keerthi. Branch and bound for semi-supervised\nsupport vector machines. In B. Sch\u00a8olkopf, J. Platt, and T. Ho\ufb00man, editors, Advances in Neural\nInformation Processing Systems 19. MIT Press, Cambridge, MA, 2007.\n\n[6] Olivier Chapelle, Vikas Sindhwani, and Sathiya S. Keerthi. Optimization techniques for semi-\nsupervised support vector machines. Journal of Machine Learning Research, 9:203\u2013233, 2008.\n\n[7] Ke Chen and Shihai Wang. Regularized boost for semi-supervised learning.\n\nIn J.C. Platt,\nD. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Sys-\ntems 20, pages 281\u2013288. MIT Press, Cambridge, MA, 2008.\n\n[8] Ronan Collobert, Fabian Sinz, Jason Weston, and L\u00b4eon Bottou. Large scale transductive\n\nSVMs. Journal of Machine Learning Reseaerch, 7:1687\u20131712, 2006.\n\n[9] S. C. H. Hoi, M. R. Lyu, and E. Y. Chang. Learning the uni\ufb01ed kernel machines for classi\ufb01ca-\ntion. In Proceedings of Twentith International Conference on Knowledge Discovery and Data\nMining (KDD-2006), pages 187\u2013196, New York, NY, USA, 2006. ACM Press.\n\n[10] Steven C. H. Hoi, Rong Jin, and Michael R. Lyu. Learning nonparametric kernel matrices\nfrom pairwise constraints. In ICML \u201907: Proceedings of the 24th international conference on\nMachine learning, pages 361\u2013368, New York, NY, USA, 2007. ACM.\n\n[11] T. Joachims. Transductive learning via spectral graph partitioning. In ICML \u201903: Proceedings\n\nof the 20th international conference on Machine learning, pages 290\u2013297, 2003.\n\n[12] Thorsten Joachims. Transductive inference for text classi\ufb01cation using support vector ma-\nchines. In ICML \u201999: Proceedings of the 16th international conference on Machine learning,\npages 200\u2013209, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.\n\n[13] John La\ufb00erty and Larry Wasserman. Statistical analysis of semi-supervised regression. In J.C.\nPlatt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing\nSystems 20, pages 801\u2013808. MIT Press, Cambridge, MA, 2008.\n\n[14] Hariharan Narayanan, Mikhail Belkin, and Partha Niyogi. On the relation between low density\nseparation, spectral clustering and graph cuts.\nIn B. Sch\u00a8olkopf, J. Platt, and T. Ho\ufb00man,\neditors, Advances in Neural Information Processing Systems 19, pages 1025\u20131032. MIT Press,\nCambridge, MA, 2007.\n\n[15] Vikas Sindhwani, S. Sathiya Keerthi, and Olivier Chapelle. Deterministic annealing for semi-\nsupervised kernel machines. In ICML \u201906: Proceedings of the 23rd international conference on\nMachine learning, pages 841\u2013848, New York, NY, USA, 2006. ACM Press.\n\n[16] Aarti Singh, Robert Nowak, and Xiaojin Zhu. Unlabeled data: Now it helps, now it doesn\u2019t. In\nD. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information\nProcessing Systems 21, pages 1513\u20131520. 2009.\n\n[17] Junhui Wang, Xiaotong Shen, and Wei Pan. On e\ufb03cient large margin semisupervised learning:\n\nMethod and theory. Journal of Machine Learning Research, 10:719\u2013742, 2009.\n\n[18] Linli Xu and Dale Schuurmans. Unsupervised and semi-supervised multi-class support vector\n\nmachines. In AAAI, pages 904\u2013910, 2005.\n\n[19] Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, and Michael R. Lyu. E\ufb03cient convex relaxation\nfor transductive support vector machine. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis,\neditors, Advances in Neural Information Processing Systems 20, pages 1641\u20131648. MIT Press,\nCambridge, MA, 2008.\n\n[20] T. Zhang and R. Ando. Analysis of spectral kernel design based semi-supervised learning.\nIn Y. Weiss, B. Sch\u00a8olkopf, and J. Platt, editors, Advances in Neural Information Processing\nSystems (NIPS 18), pages 1601\u20131608. MIT Press, Cambridge, MA, 2006.\n\n[21] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Sch\u00a8olkopf.\nLearning with local and global consistency. In Sebastian Thrun, Lawrence Saul, and Bern-\nhard Sch\u00a8olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press,\nCambridge, MA, 2004.\n\n[22] X. Zhu, J. Kandola, Z. Ghahramani, and J. La\ufb00erty. Nonparametric transforms of graph\nkernels for semi-supervised learning. In Advances in Neural Information Processing Systems\n(NIPS 17), pages 1641\u20131648, Cambridge, MA, 2005. MIT Press.\n\n[23] Xiaojin Zhu. Semi-supervised learning literature survey. Technical report, Computer Sciences,\n\nUniversity of Wisconsin-Madison, 2005.\n\n\f", "award": [], "sourceid": 697, "authors": [{"given_name": "Zenglin", "family_name": "Xu", "institution": null}, {"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Jianke", "family_name": "Zhu", "institution": null}, {"given_name": "Irwin", "family_name": "King", "institution": null}, {"given_name": "Michael", "family_name": "Lyu", "institution": null}, {"given_name": "Zhirong", "family_name": "Yang", "institution": null}]}