{"title": "Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 627, "page_last": 634, "abstract": null, "full_text": "Hyperparameter and Kernel Learning for\nGraph Based Semi-Supervised Classi\ufb01cation\n\nAshish Kapoory, Yuan (Alan) Qiz, Hyungil Ahny and Rosalind W. Picardy\n\nyMIT Media Laboratory, Cambridge, MA 02139\n\nfkapoor, hiahn, picardg@media.mit.edu\n\nzMIT CSAIL, Cambridge, MA 02139\n\nalanqi@csail.mit.edu\n\nAbstract\n\nThere have been many graph-based approaches for semi-supervised clas-\nsi\ufb01cation. One problem is that of hyperparameter learning: performance\ndepends greatly on the hyperparameters of the similarity graph, trans-\nformation of the graph Laplacian and the noise model. We present a\nBayesian framework for learning hyperparameters for graph-based semi-\nsupervised classi\ufb01cation. Given some labeled data, which can contain\ninaccurate labels, we pose the semi-supervised classi\ufb01cation as an in-\nference problem over the unknown labels. Expectation Propagation is\nused for approximate inference and the mean of the posterior is used for\nclassi\ufb01cation. The hyperparameters are learned using EM for evidence\nmaximization. We also show that the posterior mean can be written in\nterms of the kernel matrix, providing a Bayesian classi\ufb01er to classify new\npoints. Tests on synthetic and real datasets show cases where there are\nsigni\ufb01cant improvements in performance over the existing approaches.\n\n1\n\nIntroduction\n\nA lot of recent work on semi-supervised learning is based on regularization on graphs [5].\nThe basic idea is to \ufb01rst create a graph with the labeled and unlabeled data points as the\nvertices and with the edge weights encoding the similarity between the data points. The aim\nis then to obtain a labeling of the vertices that is both smooth over the graph and compatible\nwith the labeled data. The performance of most of these algorithms depends upon the edge\nweights of the graph. Often the smoothness constraints on the labels are imposed using a\ntransformation of the graph Laplacian and the parameters of the transformation affect the\nperformance. Further, there might be other parameters in the model, such as parameters\nto address label noise in the data. Finding a right set of parameters is a challenge, and\nusually the method of choice is cross-validation, which can be prohibitively expensive for\nreal-world problems and problematic when we have few labeled data points.\n\nMost of the methods ignore the problem of learning hyperparameters that determine the\nsimilarity graph and there are only a few approaches that address this problem. Zhu et al.\n[8] propose learning non-parametric transformation of the graph Laplacians using semidef-\ninite programming. This approach assumes that the similarity graph is already provided;\nthus, it does not address the learning of edge weights. Other approaches include label\n\n\fentropy minimization [7] and evidence-maximization using the Laplace approximation [9].\n\nThis paper provides a new way to learn the kernel and hyperparameters for graph based\nsemi-supervised classi\ufb01cation, while adhering to a Bayesian framework. The semi-\nsupervised classi\ufb01cation is posed as a Bayesian inference. We use the evidence to si-\nmultaneously tune the hyperparameters that de\ufb01ne the structure of the similarity graph,\nthe parameters that determine the transformation of the graph Laplacian, and any other\nparameters of the model. Closest to our work is Zhu et al. [9], where they proposed a\nLaplace approximation for learning the edge weights. We use Expectation Propagation\n(EP), a technique for approximate Bayesian inference that provides better approximations\nthan Laplace. An additional contribution is a new EM algorithm to learn the hyperparam-\neters for the edge weights, the parameters of the transformation of the graph spectrum.\nMore importantly, we explicitly model the level of label noise in the data, while [9] does\nnot do. We provide what may be the \ufb01rst comparison of hyperparameter learning with\ncross-validation on state-of-the-art algorithms (LLGC [6] and harmonic \ufb01elds [7]).\n\n2 Bayesian Semi-Supervised Learning\n\nWe assume that we are given a set of data points X = fx1; ::; xn+mg, of which XL =\nfx1; ::; xng are labeled as tL = ft1; ::; tng and XU = fxn+1; ::; xn+mg are unlabeled.\nThroughout this paper we limit ourselves to two-way classi\ufb01cation, thus t 2 f(cid:0)1; 1g. Our\nmodel assumes that the hard labels ti depend upon hidden soft-labels yi for all i. Given\nthe dataset D = [fXL; tLg; XU ], the task of semi-supervised learning is then to infer the\nposterior p(tU jD), where tU = [tn+1; ::; tn+m]. The posterior can be written as:\n\np(tU jD) = Zy\n\np(tU jy)p(yjD)\n\n(1)\n\nIn this paper, we propose to \ufb01rst approximate the posterior p(yjD) and then use (1) to\nclassify the unlabeled data. Using the Bayes rule we can write:\n\np(yjD) = p(yjX; tL) / p(yjX)p(tLjy)\n\nThe term, p(yjX) is the prior. It enforces a smoothness constraint and depends upon the\nunderlying data manifold. Similar to the spirit of graph regularization [5] we use similarity\ngraphs and their transformed Laplacian to induce priors on the soft labels y. The second\nterm, p(tLjy) is the likelihood that incorporates the information provided by the labels.\nIn this paper, p(yjD) is inferred using Expectation Propagation, a technique for approxi-\nmate Bayesian inference [3]. In the following subsections \ufb01rst we describe the prior and\nthe likelihood in detail and then we show how evidence maximization can be used to learn\nhyperparameters and other parameters in the model.\n\n2.1 Priors and Regularization on Graphs\n\nThe prior plays a signi\ufb01cant role in semi-supervised learning, especially when there is only\na small amount of labeled data. The prior imposes a smoothness constraint and should be\nsuch that it gives higher probability to the labelings that respect the similarity of the graph.\n\nThe prior, p(yjX), is constructed by \ufb01rst forming an undirected graph over the data points.\nThe data points are the nodes of the graph and edge-weights between the nodes are based\non similarity. This similarity is usually captured using a kernel. Examples of kernels\ninclude RBF, polynomial etc. Given the data points and a kernel, we can construct an\n(n + m) (cid:2) (n + m) kernel matrix K, where Kij = k(xi; xj) for all i 2 f1; ::; n + mg.\nLets consider the matrix ~K, which is same as the matrix K, except that the diagonals are set\n~Kij, then we can construct the\nto zero. Further, if G is a diagonal matrix such that Gii =Pj\n\n\fcombinatorial Laplacian ((cid:1) = G(cid:0) ~K) or the normalized Laplacian ( ~(cid:1) = I (cid:0)G(cid:0) 1\n2 ~KG(cid:0) 1\n2 )\nof the graph. For brevity, in the text we use (cid:1) as a notation for both the Laplacians. Both\nthe Laplacians are symmetric and positive semide\ufb01nite. Consider the eigen decomposition\nof (cid:1) where fvig denote the eigenvectors and f(cid:21)ig the corresponding eigenvalues; thus, we\nthat\nmodi\ufb01es the spectrum of (cid:1) is used as a regularizer. Speci\ufb01cally, the smoothness imposed\nby this regularizer prefers soft labeling for which the norm yT r((cid:1))y is small. Equivalently,\nwe can interpret this probabilistically as following:\n\ni . Usually, a transformation r((cid:1)) = Pn+m\n\ncan write (cid:1) = Pn+m\n\ni=1 (cid:21)ivivT\n\ni=1 r((cid:21)i)vivT\n\ni\n\np(yjX) / e(cid:0) 1\n\n2 yT r((cid:1))y = N (0; r((cid:1))(cid:0)1)\n\n(2)\n\nWhere r((cid:1))(cid:0)1 denotes the pseudo-inverse if the inverse does not exist. Equation (2) sug-\ngests that the labelings with the small value of yT r((cid:1))y are more probable than the others.\nNote, that when r((cid:1)) is not invertible the prior is improper. The fact that the prior can\nbe written as a Gaussian is advantageous as techniques for approximate inference can be\neasily applied. Also, different choices of transformation functions lead to different semi-\nsupervised learning algorithms. For example, the approach based on Gaussian \ufb01elds and\nharmonic functions (Harmonic) [7] can be thought of as using the transformation r((cid:21)) = (cid:21)\non the combinatorial Laplacian without any noise model. Similarly, the approach based in\nlocal and global consistency (LLGC) [6] can be thought of as using the same transforma-\ntion but on the normalized Laplacian and a Gaussian likelihood. Therefore, it is easy to see\nthat most of these algorithms can exploit the proposed evidence maximization framework.\nIn the following we focus only on the parametric linear transformation r((cid:21)) = (cid:21) + (cid:14). Note\nthat this transformation removes zero eigenvalues from the spectrum of (cid:1).\n\n2.2 The Likelihood\n\nlikelihood p(tLjy) can be written as p(tLjy) = Qn\n\nAssuming conditional independence of the observed labels given the hidden soft labels, the\ni=1 p(tijyi). The likelihood models the\nprobabilistic relation between the observed label ti and the hidden label yi. Many real-\nworld datasets contain hand-labeled data and can often have labeling errors. While most\npeople tend to model label errors with a linear or quadratic slack in the likelihood, it has\nbeen noted that such an approach does not address the cases where label errors are far from\nthe decision boundary [2]. The \ufb02ipping likelihood can handle errors even when they are far\nfrom the decision boundary and can be written as:\n\np(tijyi) = (cid:15)(1 (cid:0) (cid:8)(yi (cid:1) ti)) + (1 (cid:0) (cid:15))(cid:8)(yi (cid:1) ti) = (cid:15) + (1 (cid:0) 2(cid:15))(cid:8)(yi (cid:1) ti)\n\n(3)\n\nHere, (cid:8) is the step function, (cid:15) is the labeling error rate and the model admits possibility of\nerrors in labeling with a probability (cid:15). This likelihood has been earlier used in the context\nof Gaussian process classi\ufb01cation [2][4]. The above described likelihood explicitly models\nthe labeling error rate; thus, the model should be more robust to the presence of label noise\nin the data. The experiments in this paper use the \ufb02ipping noise likelihood shown in (3).\n\n2.3 Approximate Inference\n\nIn this paper, we use EP to obtain a Gaussian approximation of the posterior p(yjD).\nAlthough, the prior derived in section 2.1 is a Gaussian distribution, the exact posterior\nis not a Gaussian due to the form of the likelihood. We use EP to approximate the posterior\nas a Gaussian and then equation (1) can be used to classify unlabeled data points. EP has\nbeen previously used [3] to train a Bayes Point Machine, where EP starts with a Gaussian\nprior over the classi\ufb01ers and produces a Gaussian posterior. Our task is very similar and we\nuse the same algorithm. In our case, EP starts with the prior de\ufb01ned in (2) and incorporates\nlikelihood to approximate the posterior p(yjD) (cid:24) N ((cid:22)y; (cid:6)y).\n\n\f2.4 Hyperparameter Learning\n\nWe use evidence maximization to learn the hyperparameters. Denote the parameters of\nthe kernel as (cid:2)K and the parameters of transformation of the graph Laplacian as (cid:2)T .\nLet (cid:2) = f(cid:2)K ; (cid:2)T ; (cid:15)g, where (cid:15) is the noise hyperparameter. The goal is to solve ^(cid:2) =\narg max(cid:2) log[p(tLjX; (cid:2))].\nNon-linear optimization techniques, such as gradient descent or Expectation Maximization\n(EM) can be used to optimize the evidence. When the parameter space is small then the\nMatlab function fminbnd, based on golden section search and parabolic interpolation,\ncan be used. The main challenge is that the gradient of evidence is not easy to compute.\nPreviously, an EM algorithm for hyperparameter learning [2] has been derived for Gaus-\nsian Process classi\ufb01cation. Using similar ideas we can derive an EM algorithm for semi-\nsupervised learning. In the E-step EP is used to infer the posterior q(y) over the soft labels.\nThe M-step consists of maximizing the lower bound:\np(yjX; (cid:2))p(tLjy; (cid:2))\n\nq(y)\n\nq(y) log\n\nF = Zy\n= (cid:0)Zy\nq(y) log q(y) + Zy\nZyi\nXi=1\n\n+\n\nn\n\nq(y) log N (y; 0; r((cid:1))(cid:0)1)\n\nq(yi) log ((cid:15) + (1 (cid:0) 2(cid:15))(cid:8)(yi (cid:1) ti)) (cid:20) p(tLjX; (cid:2))\n\nThe EM procedure alternates between the E-step and the M-step until convergence.\n\n(cid:15) E-Step: Given the current parameters (cid:2)i, approximate the posterior q(y) (cid:24)\n\nN ((cid:22)y; (cid:6)y) by EP.\n(cid:15) M-Step: Update\n\n(cid:2)i+1 = arg max(cid:2)Ry q(y) log p(yjX;(cid:2))p(tLjy;(cid:2))\n\nq(y)\n\nIn the M-step the maximization with respect to the (cid:2) cannot be computed in a closed\nform, but can be solved using gradient descent. For maximizing the lower bound, we used\ngradient based projected BFGS method using Armijo rule and simple line search. When\nusing the linear transformation r((cid:21)) = (cid:21) + (cid:14) on the Laplacian (cid:1), the prior p(yjX; (cid:2)) can\nbe written as N (0; ((cid:1) + (cid:14)I)(cid:0)1). De\ufb01ne Z = (cid:1) + (cid:14)I then, the gradients of the lower bound\nwith respect to the parameters are as follows:\n1\n2\n\n(cid:22)yT @(cid:1)\n@(cid:2)K\n\n@(cid:1)\n@(cid:2)K\n\n(cid:6)y)\n\n(cid:22)y (cid:0)\n\n) (cid:0)\n\ntr(\n\n1\n2\n\ntr(Z(cid:0)1 @(cid:1)\n@(cid:2)K\n1\n2\n\ntr(Z(cid:0)1) (cid:0)\n\n1\n2\n1\n2\nn\n\n@F\n@(cid:2)K\n@F\n@(cid:2)T\n\n@F\n@(cid:15)\n\n=\n\n=\n\n(cid:25)\n\n(cid:22)yT (cid:22)y (cid:0)\n\n1\n2\n\ntr((cid:6)y)\n\n1 (cid:0) 2(cid:8)(ti (cid:1) (cid:22)yi)\n\n(cid:15) + (1 (cid:0) 2(cid:15))(cid:8)(ti (cid:1) (cid:22)yi)\n\nwhere: (cid:22)yi = Zy\n\nyiq(y)\n\nXi=1\n\nIt is easy to show that the provided approximation of the derivative @F\n@(cid:15) equals zero, when\nn , where k is the number of labeled data points differing in sign from their posterior\n(cid:15) = k\nmeans. The EM procedure described here is susceptible to local minima and in a few cases\nmight be too slow to converge. Especially, when the evidence curve is \ufb02at and the initial\nvalues are far from the optimum, we found that the EM algorithm provided very small\nsteps, thus, taking a long time to converge.\n\nWhenever we encountered this problem in the experiments, we used an approximate gradi-\nent search to \ufb01nd a good value of initial parameters for the EM algorithm. Essentially as the\ngradients of the evidence are hard to compute, they can be approximated by the gradients\nof the lower bound and can be used in any gradient ascent procedure.\n\n\f10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n\u221225\n\n\u221230\n\n100\n\n95\n\n90\n\n85\n\n80\n\n75\n\n70\n\n0\n\ne\nc\nn\ne\nd\nv\ne\n\ni\n\n \n\ng\no\nL\n\nt\n\ne\na\nR\nn\no\n\n \n\ni\nt\ni\n\n/\n\nn\ng\no\nc\ne\nR\ne\nc\nn\ne\nd\nv\ne\n\ni\n\n \n\ng\no\nL\n\nEvidence curves (moon data,d = 10\u22124,e =0.1)\nN=1\nN=6\nN=10\nN=20\n\nEvidence curves (odd vs even, d =10\u22124, e =0)\nN=5\nN=15\nN=25\nN=40\n\nEvidence curves (PC vs MAC, d = 10\u22124, e =0)\nN=5\nN=15\nN=35\nN=50\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221260\n\ne\nc\nn\ne\nd\nv\ne\n\ni\n\n \n\ng\no\nL\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\ne\nc\nn\ne\nd\nv\ne\n\ni\n\n \n\ng\no\nL\n\n\u221240\n\n\u221250\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\nHyperparameter (s ) for RBF kernel\n\n0.25\n\n0.3\n\n0.35\n\n0.4\n\n100\n\n150\n\n200\n\nHyperparameter (s ) for RBF kernel\n\n300\n\n250\n\n350\n\n400\n\n\u221270\n\n0.02\n\n0.04\n\n0.06\n\n0.08\n\n0.1\n\nHyperparameter (g ) for the kernel\n\n0.14\n\n0.16\n\n0.12\n\n0.18\n\n0.2\n\n(a)\n\n(b)\n\n(c)\n\nRecognition and log evidence (moon data N=10)\n\nRecognition and log evidence (odd vs even N=25)\n\nRecognition and log evidence (PC vs MAC N=35)\n\nLog evidence\nAccuracy\n\nt\n\ne\na\nR\nn\no\n\n \n\ni\nt\ni\n\n/\n\nn\ng\no\nc\ne\nR\ne\nc\nn\ne\nd\nv\ne\n\ni\n\n \n\ng\no\nL\n\nLog evidence\nAccuracy\n\n100\n\n95\n\n90\n\n85\n\n80\n\n75\n\n70\n\n65\n\nLog evidence\nAccuracy\n\n105\n\n100\n\n95\n\n90\n\n85\n\n80\n\nt\n\ne\na\nR\nn\no\n\n \n\ni\nt\ni\n\n/\n\nn\ng\no\nc\ne\nR\ne\nc\nn\ne\nd\nv\ne\n\ni\n\n \n\ng\no\nL\n\nHyperparameter (s ) for RBF kernel\n\n0.2\n\n0.4\n\n0.6\n\n60\n100\n\n150\n\nHyperparameter (s ) for RBF kernel\n\n300\n\n200\n\n250\n\n350\n\n400\n\n75\n\n0\n\nHyperparameter (g ) for the kernel\n\n0.1\n\n0.2\n\n(d)\n\n(e)\n\n(f)\n\nFigure 1: Evidence curves showing similar properties across different datasets (half-moon,\nodd vs even and PC vs MAC). The top row \ufb01gures (a), (b) and (c) show the evidence curves\nfor different amounts of labeled data per class. The bottom row \ufb01gures (d), (e) and (f) show\nthe correlation between recognition accuracy on unlabeled points and the evidence.\n\n2.5 Classifying New Points\n\nSince we compute a posterior distribution over the soft-labels of the labeled and unla-\nbeled data points, classifying a new point is tricky. Note, that from the parameteriza-\ntion lemma for Gaussian Processes [1] it follows that given a prior distribution p(yjX) (cid:24)\nN (0; r((cid:1))(cid:0)1), the mean of the posterior p(yjD) is a linear combination of the columns of\nr((cid:1))(cid:0)1. That is:\n\n(cid:22)y = r((cid:1))(cid:0)1a\n\nwhere, a 2 IR(n+m)(cid:2)1\n\nFurther, if the similarity matrix K is a valid kernel matrix1 then we can write the mean\ndirectly in terms of the linear combination of the columns of K:\n\n(cid:22)y = KK (cid:0)1r((cid:1))(cid:0)1a = Kb\n\n(4)\n\nHere, b = [b1; ::; bn+m]T is a column vector and is equal to K (cid:0)1r((cid:1))(cid:0)1a. Thus, we\nj=1 bj (cid:1) K(xi; xj). This provides a natural extension of the framework\n\nhave that (cid:22)yi = Pn+m\n\nto classify new points.\n\n3 Experiments\n\nWe performed experiments to evaluate the three main contributions of this work: Bayesian\nhyperparameter learning, classi\ufb01cation of unseen data points, and robustness with respect\nto noisy labels. For all the experiments we use the linear transformation r((cid:21)) = (cid:21) + (cid:14)\neither on normalized Laplacian (EP-NL) or the combinatorial Laplacian (EP-CL). The ex-\nperiments were performed on one synthetic (Figure 4(a)) and on three real-world datasets.\nTwo real-world datasets were the handwritten digits and the newsgroup data from [7]. We\nevaluated the task of classifying odd vs even digits (15 labeled, 485 unlabeled and rest new\n\n1The matrix K is the adjacency matrix of the graph and depending upon the similarity criterion\nmight not always be positive semi-de\ufb01nite. For example, discrete graphs induced using K-nearest\nneighbors might result in K that is not positive semi-de\ufb01nite.\n\n\fEvidence curves (affect data, K=3, d =10\u22124)\nN=5\nN=15\nN=25\nN=40\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\ne\nc\nn\ne\nd\nv\ne\n\ni\n\n \n\ng\no\nL\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n\u221225\n\n\u221230\n\n\u221235\n\n\u221240\n\n\u221245\n\ne\nc\nn\ne\nd\nv\ne\n\ni\n\n \n\ng\no\nL\n\nEvidence curves (affect data, d = 10\u22124, e = 0.05)\n\nN=5\nN=15\nN=25\nN=40\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n\u221225\n\n\u221230\n\n\u221235\n\n\u221240\n\ni\n\ne\nc\nn\ne\nd\nv\ne\n \ng\no\nL\n\nEvidence curves (affect data, K = 3, e = 0.05)\nN=5\nN=15\nN=25\nN=40\n\n\u221260\n\n0\n\n0.1\n\n0.2\n\nNoise parameter (e )\n\n0.3\n\n0.4\n\n0.5\n\n\u221250\n\n1\n\n2\n\n3\n\n4\n\n5\n\nK in K\u2212nearest neigbors\n\n6\n\n7\n\n8\n\n9\n\n10\n\n\u221245\n\n0\n\n0.02\n\n0.04\n\nTransformation parameter (d )\n\n0.08\n\n0.06\n\n0.1\n\n0.12\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Evidence curves showing similar properties across different parameters of the\nmodel. The \ufb01gures (a), (b) and (c) show the evidence curves for different amount of labeled\ndata per class for the three different parameters in the model.\n\nEP\u2212NL\n\nEP\u2212CL\n\nLLGC\n\nHarmonic\n\n1\u2212NN\n\nEP\u2212NL\n\nEP\u2212CL\n\nLLGC\n\nHarmonic\n\n1\u2212NN\n\nEP\u2212CL\n\nEP\u2212NL\n\nLLGC\n\nHarmonic\n\n1\u2212NN\n\n15\n\n20\n\n25\n\n10\n\n20\n\n30\n\n40\n\nError rate on unlabeled points (odd vs even)\n\nError rate on unlabeled points (PC vs MAC)\n\n(a)\n\n(b)\n\nEP\u2212CL\n\nEP_NL\n\nLLGC\n\nHarmonic\n\n1\u2212NN\n\n15\n\n20\n\n25\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\nError rate on new points (odd vs even)\n\nError rate on new points (PC vs MAC)\n\n(c)\n\n(d)\n\nFigure 3: Error rates for different\nalgorithms on digits (\ufb01rst column,\n(a) and (c)) and newsgroup dataset\n(second column (b) and (d)). The\n\ufb01gures in the top row (a) and\n(b) show error rates on unlabeled\npoints and the bottom row \ufb01gures\n(c) and (d) on the new points. The\nresults are averaged over 5 runs.\nNon-overlapping of error bars, the\nstandard error scaled by 1.64, in-\ndicates 95% signi\ufb01cance of the\nperformance difference.\n\nT xj\n\n(cid:13) (1 (cid:0) xi\n\n(unseen) points per class) and classifying PC vs MAC (5 labeled, 895 unlabeled and rest as\nnew (unseen) points per class). An RBF kernel was used for handwritten digits, whereas\nkernel K(xi; xj) = exp[(cid:0) 1\njxijjxjj )] was used on 10-NN graph to determine similar-\nity. The third real-world dataset labels the level of interest (61 samples of high interest and\n75 samples of low interest) of a child solving a puzzle on the computer. Each data point is\na 19 dimensional real vector summarizing 8 seconds of activity from the face, posture and\nthe puzzle. The labels in this database are suspected to be noisy because of human labeling.\nAll the experiments on this data used K-nearest neighbor to determine the kernel matrix.\nHyperparameter learning: Figure 1 (a), (b) and (c) plots log evidence versus kernel pa-\nrameters that determine the similarity graphs for the different datasets with varying size of\nthe labeled set per class. The value of (cid:14) and (cid:15) were \ufb01xed to the values shown in the plots.\nFigure 2 (a), (b) and (c) plots the log evidence versus the noise parameter ((cid:15)), the kernel\nparameter (k in k-NN) and the transformation parameter ((cid:14)) for the affect dataset. First,\nwe see that the evidence curves generated with very little data are \ufb02at and as the number of\nlabeled data points increases we see the curves become peakier. When there is very little\nlabeled data, there is not much information available for the evidence maximization frame-\nwork to prefer one parameter value over the other. With more labeled data, the evidence\ncurves become more informative. Figure 1 (d), (e) and (f) show the correlation between the\nevidence curves and the recognition rate on the unlabeled data and reveal that the recogni-\ntion over the unlabeled data points is highly correlated with the evidence. Note that both\nof these effects are observed across all the datasets as well as all the different parameters,\njustifying evidence maximization for hyperparameter learning.\n\n\fToy Data\n\nnoisy label\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\nnoisy label\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n(a)\n\n(b)\n\n(c)\n\nFigure 4: Semi-supervised classi\ufb01cation in presence of label noise. (a) Input data with label\nnoise. Classi\ufb01cation (b) without \ufb02ipping noise model and with (c) \ufb02ipping noise model.\n\nHow good are the learnt parameters? We performed experiments on the handwritten dig-\nits and on the newsgroup data and compared with 1-NN, LLGC and Harmonic approach.\nThe kernel parameters for both LLGC and Harmonic were estimated using leave one out\ncross validation2. Note that both the approaches can be interpreted in terms of the new\nproposed Bayesian framework (see sec 2.1). We performed experiments with both the nor-\nmalized (EP-NL) and the combinatorial Laplacian (EP-CL) with the proposed framework\nto classify the digits and the newsgroup data. The approximate gradient descent was \ufb01rst\nused to \ufb01nd an initial value of the kernel parameter for the EM algorithm. All three pa-\nrameters were learnt and the top row in \ufb01gure 3 shows the average error obtained for 5\ndifferent runs on the unlabeled points. On the task of classifying odd vs even the error rate\nfor EP-NL was 14.46(cid:6)4.4%, signi\ufb01cantly outperforming the Harmonic (23.98(cid:6)4.9%) and\n1-NN (24.23(cid:6)1.1%). Since the prior in EP-NL is determined using the normalized Lapla-\ncian and there is no label noise in the data, we expect EP-NL to at least work as well as\nLLGC (16.02 (cid:6) 1.1%). Similarly for the newsgroup dataset EP-CL (9.28(cid:6)0.7%) signif-\nicantly beats LLGC (18.03(cid:6)3.5%) and 1-NN (46.88(cid:6)0.3%) and is better than Harmonic\n(10.86(cid:6)2.4%). Similar, results are obtained on new points as well. The unseen points were\nclassi\ufb01ed using eq. (4) and the nearest neighbor rule was used for LLGC and Harmonic.\nHandling label noise: Figure 4(a) shows a synthetic dataset with noisy labels. We per-\nformed semi-supervised classi\ufb01cation both with and without the likelihood model given in\n(3) and the EM algorithm was used to tune all the parameters including the noise ((cid:15)). Be-\nsides modifying the spectrum of the Laplacian, the transformation parameter (cid:14) can also be\nconsidered as latent noise and provides a quadratic slack for the noisy labels [2]. The results\nare shown in \ufb01gure 4 (b) and (c). The EM algorithm can correctly learn the noise parameter\nresulting in a perfect classi\ufb01cation. The classi\ufb01cation without the \ufb02ipping model, even with\nthe quadratic slack, cannot handle the noisy labels far from the decision boundary.\nIs there label noise in the data? It was suspected that due to the manual labeling the\naffect dataset might have some label noise. To con\ufb01rm this and as a sanity check, we \ufb01rst\nplotted evidence using all the available data. For all the semi-supervised methods in these\nexperiments, we use 3-NN to induce the adjacency graph. Figure 5(a) shows the plot for\nthe evidence against the noise parameter ((cid:15)). From the \ufb01gure, we see that the evidence\npeaks at (cid:15) = 0:05 suggesting that the dataset has around 5% of labeling noise. Figure\n5(b) shows comparisons with other semi-supervised (LLGC and SVM with graph kernel)\nand supervised methods (SVM with RBF kernel) for different sizes of the labeled dataset.\nEach point in the graph is the average error on 20 random splits of the data, where the\nerror bars represent the standard error. EM was used to tune (cid:15) and (cid:14) in every run. We\nused the same transformation r((cid:21)) = (cid:21) + (cid:14) on the graph kernel in the semi-supervised\nSVM. The hyperparameters in both the SVMs (including (cid:14) for the semi-supervised case)\nwere estimated using leave one out. When the number of labeled points are small, both\n\n2Search space for (cid:27) (odd vs even) was 100 to 400 with increments of 10 and for (cid:13) (PC vs MAC)\n\nwas 0.01 to 0.2 with increments of 0.1\n\n\fEvidence curve for the whole affect data (K=3, d =10\u22124)\n\n45\n\n40\n\n35\n\n30\n\nEP\u2212NL\nLLGC\nSVM (RBF)\nSVM (N\u2212Laplacian)\n\nl\n\nt\n\n \n\na\na\nd\nd\ne\ne\nb\na\nn\nu\nn\no\n\n \n\nl\n\nEP\u2212NL (N = 80)\n\nHarmonic (N = 92)\n\n\u221260\n\n\u221265\n\n\u221270\n\n\u221275\n\n\u221280\n\n\u221285\n\n\u221290\n\ne\nc\nn\ne\nd\nv\ne\n\ni\n\n \n\ng\no\nL\n\n\u221295\n\n0\n\n0.05\n\n0.1\n\n25\n\n \nr\no\nr\nr\n\nE\n\n20\n\n0.4\n\n0.45\n\n0.5\n\n5\n\n0.15\n\n0.2\n\nNoise parameter (e )\n\n0.25\n\n0.3\n\n0.35\n\n10\n\nNumber of labels per class\n\n15\n\n20\n\n25\n\n30\n\n0.5\n\n1.0\n\n1.5\n\nError rate on unlabeled points (1 vs 2)\n\n(a)\n\n(b)\n\n(c)\n\nFigure 5: (a) Evidence vs noise parameter plotted using all the available data in the affect\ndataset. The maximum at (cid:15) = 0:05 suggests that there is around 5% label noise in the\ndata. (b) Performance comparison of the proposed approach with LLGC, SVM using graph\nkernel and the supervised SVM (RBF kernel) on the affect dataset which has label noise.\nThe error bars represent the standard error. (c) Comparison of the proposed EM method for\nhyperparameter learning with the result reported in [7] using label entropy minimization.\nThe plotted error bars represent the standard deviation.\n\nLLGC and EP-NL perform similarly beating both the SVMs, but as the size of the labeled\ndata increases we see a signi\ufb01cant improvement of the proposed approach over the other\nmethods. One of the reasons is when you have few labels the probability of the labeled set\nof points containing a noisy label is low. As the size of the labeled set increases the labeled\ndata has more noisy labels. And, since LLGC has a Gaussian noise model, it cannot handle\n\ufb02ipping noise well. As the number of labels increase, the evidence curve turns informative\nand EP-NL starts to learn the label noise correctly, outperforming the other Both the SVMs\nshow competitive performance with more labels but still are worse than EP-NL. Finally, we\nalso test the method on the task of classifying \u201c1\u201d vs \u201c2\u201d in the handwritten digits dataset.\nWith 40 labeled examples per class (80 total labels and 1800 unlabeled), EP-NL obtained\nan average recognition accuracy of 99.72 (cid:6) 0.04% and \ufb01gure 5(c) graphically shows the\ngain over the accuracy of 98.56 (cid:6) 0.43% reported in [7], where the hyperparameter were\nlearnt by minimizing label entropy with 92 labeled and 2108 unlabeled examples.\n4 Conclusion\nWe presented and evaluated a Bayesian framework for learning hyperparameters for graph-\nbased semi-supervised classi\ufb01cation. The results indicate that evidence maximization\nworks well for learning hyperparameters, including the amount of label noise in the data.\n\nReferences\n[1] Csato, L. (2002) Gaussian processes-iterative sparse approximation. PhD Thesis, Aston Univ.\n[2] Kim, H. & Ghahramani, Z. (2004) The EM-EP algorithm for Gaussian process classi\ufb01cation.\nECML.\n[3] Minka, T. P. (2001) Expectation propagation for approximate Bayesian inference. UAI.\n[4] Opper, M. & Winther, O. (1999) Mean \ufb01eld methods for classi\ufb01cation with Gaussian processes.\nNIPS.\n[5] Smola, A. & Kondor, R. (2003) Kernels and regularization on graphs. COLT.\n[6] Zhou et al. (2004) Learning with local and global consistency. NIPS.\n[7] Zhu, X., Ghahramani, Z. & Lafferty, J. (2003) Semi-supervised learning using Gaussian \ufb01elds\nand harmonic functions. ICML.\n[8] Zhu, X., Kandola, J., Ghahramani, Z. & Lafferty, J. (2004) Nonparametric transforms of graph\nkernels for semi-supervised learning. NIPS.\n[9] Zhu, X., Lafferty, J. & Ghahramani, Z. (2003) Semi-supervised learning: From Gaussian \ufb01elds to\nGaussian processes. CMU Tech Report:CMU-CS-03-175.\n\n\f", "award": [], "sourceid": 2932, "authors": [{"given_name": "Ashish", "family_name": "Kapoor", "institution": null}, {"given_name": "Hyungil", "family_name": "Ahn", "institution": null}, {"given_name": "Yuan", "family_name": "Qi", "institution": null}, {"given_name": "Rosalind", "family_name": "Picard", "institution": null}]}