{"title": "Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 1585, "page_last": 1592, "abstract": null, "full_text": "Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms\n\nXinhua Zhang Statistical Machine Learning Program National ICT Australia, Canberra, Australia and CSL, RSISE, ANU, Canberra, Australia xinhua.zhang@nicta.com.au\n\nWee Sun Lee Department of Computer Science National University of Singapore 3 Science Drive 2, Singapore 117543 leews@comp.nus.edu.sg\n\nAbstract\nSemi-supervised learning algorithms have been successfully applied in many applications with scarce labeled data, by utilizing the unlabeled data. One important category is graph based semi-supervised learning algorithms, for which the performance depends considerably on the quality of the graph, or its hyperparameters. In this paper, we deal with the less explored problem of learning the graphs. We propose a graph learning method for the harmonic energy minimization method; this is done by minimizing the leave-one-out prediction error on labeled data points. We use a gradient based method and designed an efficient algorithm which significantly accelerates the calculation of the gradient by applying the matrix inversion lemma and using careful pre-computation. Experimental results show that the graph learning method is effective in improving the performance of the classification algorithm.\n\n1\n\nIntroduction\n\nRecently, graph based semi-supervised learning algorithms have been used successfully in various machine learning problems including classification, regression, ranking, and dimensionality reduction. These methods create graphs whose vertices correspond to the labeled and unlabeled data while the edge weights encode the similarity between each pair of data points. Classification is performed using these graphs by labeling unlabeled data in such a way that instances connected by large weights are given similar labels. Example graph based semi-supervised algorithms include min-cut [3], harmonic energy minimization [11], and spectral graphical transducer [8]. The performance of the classifier depends considerably on the similarity measure of the graph, which is normally defined in two steps. Firstly, the weights are defined locally in a pair-wise parametric form using functions that are essentially based on a distance metric such as radial basis functions (RBF). It is argued in [7] that modeling error can degrade performance of semi-supervised learning. As the distance metric is an important part of graph based semi-supervised learning, it is crucial to use a good distance metric. In the second step, smoothing is applied globally, typically, based on the spectral transformation of the graph Laplacian [6, 10]. There have been only a few existing approaches which address the problem of graph learning. [13] learns a nonparametric spectral transformation of the graph Laplacian, assuming that the weight and distance metric are given. [9] learns the spectral parameters by performing evidence maximization using approximate inference and gradient descent. [12] uses evidence maximization and Laplace approximation to learn simple parameters of the similarity function. Instead of learning one single good graph, [4] proposed building robust graphs by applying random perturbation and edge removal\n\n\nThis work was done when the author was at the National University of Singapore.\n\n\f\nfrom an ensemble of minimum spanning trees. [1] combined graph Laplacians to learn a graph. Closest to our work is [11], which learns different bandwidths for different dimensions by minimizing the entropy on unlabeled data; like the maximum margin motivation in transductive SVM, the aim here is to get confident labeling of the data by the algorithm. In this paper, we propose a new algorithm to learn the hyperparameters of distance metric, or more specifically, the bandwidth for different dimensions in the RBF form. In essence, these bandwidths are just model parameters and normal model selection methods include k-fold cross validation or leave-one-out (LOO) cross validation in the extreme case can be used for selecting the bandwidths. Motivated by the same spirit, we base our learning algorithm on the aim of achieving low LOO prediction loss on labeled data, i.e., each labeled data can be correctly classified by the other labeled data in a semi-supervised style with as high probability as possible. This idea is similar to [5] which learns multiple parameters for SVM. Since most LOO style algorithms are plagued with prohibitive computational cost, an efficient algorithm is designed. With a simple regularizer, the experimental results show that learning the hyperparameters by minimizing the LOO loss is effective.\n\n2\n\nGraph Based Semi-supervised Learning\n\nSuppose we have a set of labeled data points {(xi , yi )} for i  L {1, ..., l}. In this paper, we only consider binary classification, i.e., yi  {1 (positive), 0 (negative)}. In addition, we also have {l + 1, ..., l + u}. Denote n l + u. Suppose the a set of unlabeled data points {xi } for i  U dimensionality of input feature vectors is m. 2.1 Graph Based Classification Algorithms\n\nOne of the earliest graph based semi-supervised learning algorithms is min-cut by [3], which minimizes: (1) E (f ) i wij (fi - fj )2\n,j\n\nwhere the nonnegative wij encodes the similarity between instance i and j . The label fi is fixed to yi  {1, 0} if i  L. The optimization variables fi (i  U ) are constrained to {1, 0}. This combinatorial optimization problem can be efficiently solved by the max-flow algorithm. [11] relaxed the constraint fi  {1, 0} (i  U ) to real numbers. The optimal solution of the unlabeled data's soft labels can be written neatly as: fU = (DU - WU U )-1 WU L fL = (I - PU U )-1 PU L fL (2) where fL is the vector of soft labels (fixed to yi ) for L. D diag(di ), where di j wij and DU -1 is the submatrix of D associated with unlabeled data. P D W . WU U , WU L , PU U , and PU L are defined by: . , P W PLU WLU LL LL P= W= PU L PU U WU L WU U\n\nThe solution (2) has a number of interesting properties pointed out by [11]. All fi (i  U ) are automatically bounded by [0, 1], so it is also known as square interpolation. They can be interpreted by using Markov random walk on the graph. Imagine a graph with n nodes corresponding to the n data points. Define the probability of transferring from xi to xj as pij , which is actually row-wise normalization of wij . The random walk starts from any unlabeled points, and stops once it hits any labeled point (absorbing boundary). Then fi is the probability of hitting a positive labeled point. In this sense, the labeling of each unlabeled point is largely based on its neighboring labeled points, which helps to alleviate the problem of noisy data. (1) can also be interpreted as a quadratic energy function and its minimizer is known to be harmonic: fi (i  U ) equals the average of fj (j = i) weighted by pij . So we call this algorithm Harmonic Energy Minimization (HEM). By (1), fU is independent of wii (i = 1, ..., n), so henceforth we fix wii = pii = 0. Finally, to translate the soft labels fi to hard labels pos/neg, the simplest way is by thresholding at 0.5, which works well when the two classes are well separated. [11] proposed another approach, called Class Mass Normalization (CMN), to make use of prior information such as class ratio in fnlabeled data, estimated by that in labeled data. Specifically, they normalize the softn abels to fi+ u l n\ni j =1\n\nfj as the probabilistic score of being positive, and to fi-\n\n(1 - fi )\n\nj =1\n\n(1 - fj ) as\n\n\f\nthe score of being negative. Suppose there are r+ positive points and r- negative points in the labeled data, then we classify xi to positive iff fi+ r+ > fi- r- . 2.2 Basic Hyperparameter Learning Algorithms (\n\nwhere xi,d is the d component of xi , and likewise the meaning of fU,i in (4). The bandwidth d has considerable influence on the classification accuracy. HEM uses one common bandwidth for all dimensions, which can be easily selected by cross validation. However, it will be desirable to learn different d for different dimensions; this allows a form of feature selection. [11] proposed learning the hyperparameters d by minimizing the entropy on unlabeled data points (we call it MinEnt): u H (fU ) = - (fU,i log fU,i + (1 - fU,i ) log(1 - fU,i )) (4)\ni=1\n\nOne of the simplest parametric form of wij is RBF: - d 2 wij = exp (xi,d - xj,d )2 d\nth\n\n3)\n\nThe optimization is conducted by gradient descent. To prevent numerical problems, they replaced ~ P with P = U + (1 - )P , where  [0, 1), and U is the uniform matrix with Uij = n-1 .\n\n3\n\nLeave-one-out Hyperparameter Learning\n\nIn this section, we present the formulation and efficient calculation of our graph learning algorithm. 3.1 Formulation and Efficient Calculation\n\nWe propose a graph learning algorithm which is similar to minimizing the leave-one-out cross validation error. Suppose we hold out a labeled example xt and predict its label by using the rest of the t labeled and unlabeled examples. Making use of the result in (2), the soft label for xt is s fU (the t first component of fU ), where t t t s (1, 0, ..., 0)  Ru+1 , fU (f0 , flt+1 , ..., fn ) . t t ~t ~t t Here, the value of fU can be determined by fU = (I - PU U )-1 PU L fL , where , ptU , t t fL (f1 , .., ft-1 , ft+1 , ..., fl ) pij (1 - )pij + /n , PU U ~ ptt pU t P U U pU t (pl+1,t , ..., pn,t ) , ptU (pt,l+1 , ..., pt,n ) ,   pt,1  pt,t-1 pt,t+1  pt,l    pl+1,t-1 pl+1,t+1    pl+1,l   p t . P U L =  l + 1, 1        pn,1    pn,t-1 pn,t+1    pn,l\nt If xt is positive, then we hope that fU,1 is as close to 1 as possible. Otherwise, if xt is negative, we t hope that fU,1 is as close to 0 as possible. So the cost function to be minimized can be written as: ( s l l = ft ( ~t ~t t 5) Q= I - PU U )-1 PU L fL ht U,1 ht t=1 t=1\n\nt t Since in both PU U and PU L , the first row corresponds to xt , and the ith row (i  2) corresponds t t t t to xi+l-1 , denoting PU N (PU L PU U ) makes sense as each row of PU N corresponds to a well\n\nTo minimize Q, we use gradient-based optimization mthods. The gradient is: e ,   ft s ( l t t t ~t ~ ~ t )-1 P t d  fU +  PU L d  fL I - PU U  Q/ d = t=1 h U,1 UU h t ( ~t using matrix property dX -1 = -X -1 (dX )X -1 . Denoting ( t ) I - PU U )-1 t (fU,1 )s ~ and noting P = U + (1 - )P, we have l .   t t t t  Q/ d = (1 - ) (6) ( t )  PU U d  fU +  PU L d  fL\nt=1\n\nwhere ht (x) is the cost function for instance t. We denote ht (x) = h+ (x) for yt = 1 and ht (x) = h- (x) for yt = 0. Possible choices of h+ (x) include 1 - x, (1 - x)a , a-x , and - log x with a > 1. ft . Possible choices for h- (x) include x, xa , ax-1 , and - log(1-x). Let Loo loss(xt , yt ) ht U,1\n\n\f\ndefined single data point. Let all n otations about P carry over to the corresponding W . We now use n t t t swi n=1 wU N (i, k ) and k=1  wU N (i, k )/ d (i = 1, ..., u + 1) to denote the sum of these k corresponding rows. Now (6) can be rewritten in ground terms by the following \"two\" equations:  , n    t t t t  PU  (i, j ) d = (swi )-1 wU  (i, j ) d - pt  (i, j )  wU N (i, k ) d U k =1 3 2 where  can be U or L.  wij / d = 2wij (xi,d - xj,d ) d by (3). Algorithm 1 nave form of LOOHL i\n\nThe nave way to calculate the function value Q and its gradient is presented in Algorithm 1. We i call it leave-one-out hyperparameter learning (LOOHL).\n\n1: function value Q  0, gradient g  (0, ..., 0)  Rm 2: for each t = 1, ..., l (leave-one-out loop for each labeled point) do t t ~t ~t t 3: fL  (f1 , .., ft-1 , ft+1 , ..., fl ) , fU  (I - PU U )-1 PU L fL , ft , t t t ~t ( ) h (fU,1 )s (I - PU U )-1 Q  Q + ht U,1 4: for each d = 1, ..., m (f r all feature dimensions) do o w t t t k n  wU N (i,k)  PU U (i,j ) wU U (i,j ) 1 5:  swt - pt U (i, j ) U  d  d  d i =1 n t t here swi = k=1 wU N (i, k), i, j = 1, ..., u + 1 i  t t t k n  wU N (i,k)  PU L (i,j ) wU L (i,j ) 1  swt - pt L (i, j ) = 1, ..., u + 1, j = 1, ..., l - 1 6: U  d  d  d i =1 8 t t  PU  PU U t t 7: gd  gd + (1 - )( t ) fU +  dL fL  d : end for 9: end for\n\nThe computational complexity of the nave algorithm is expensive: O(lu(mn+u2 )), just to calculate i the gradient once. Here we assume the cost of inverting a u  u matrix is O(u3 ). We reduce the two terms in the cost by means of using matrix inversion lemma and careful pre-computation. ~t One part of the cost, O(lu3 ), stems from inverting I - PU U , a (u + 1)  (u + 1) matrix, for l times in t ~ (5). We note that for different t, I - PU U differs only by the first row and first column. So there exist  ~t ~t two vectors ,   Ru+1 such that I - PU1U = (I - PU2U ) + e +  e , where e = (1, 0, ..., 0) ~ Ru+1 . With I - P t expressed in this form, we are ready to apply matrix inversion lemma:\nUU\n\n~t a We only need to invert I - PU U for t = 1 from scratch,u nd then apply (7) twice for each t . new total complexity related to matrix inversion is O 3 + lu2\nt i t l u+1 i Q = t  d swi =1 =1\n\nA\n\n+ \n\n-1\n\n= A-1 - A-1   \n\nA-1\n\n1\n\n+\n\nA\n\n\n\n.\n\n(7) 2. The\n\nThe other part of the cost, O(lumn) , can be reduced by using careful pre-computation. Written in detail, we have:\nu+1 l j  wt (i, j ) t j-1  wt (i, j ) t UU UL fU,j + fL,j  d  d =1 =1 u+1 l t j-1 t  wU N (i, k) j t t pt U (i, j ) fU,j + pU L (i, j ) fL,j U  d =1 =1\n\n-\n\nThe crucial observation is the existence of ij , which are independent of dimension index d. Therefore, they can be pre-computed efficiently. The Algorithm 2 below presents the efficient approach to gradient calculation. Algorithm 2 Efficient algorithm to gradient calculation\n1: for i, j = 1, ..., n do 2: for all feature dimension d on which either xi or xj is nonzero do 3: gd = gd + ij   wij / d 4: end for 5: end for\n\nkn\n\n=1\n\ni\n\n=1\n\nn\n\njn\n\nij\n\n=1\n\n wij  d\n\n\f\nFigure 1: Examples of degenerative graphs learned by pure LOOHL. Letting swi\nl\n\nn=1 wik and  () be Kroneker delta, we derive the form of ij as: k\nt\nt i-l+1 =1 tl\n\nij =\n\n- swi 1\n\nf \n\nn\n\nl t pik fU,k-l+1 t -pit fU,1\n\nt U,j -l+1\n\n-\n\nk\n\n- k\n\n=l+1\n\nn\n\nk\n\n=1:k=t\n\n- ij = swi 1\n\n=1\n\nt t t i-l+1 fU,1  (t = j ) + fj  (t = j ) - pit fU,1 -\n\n=l+1\n\nij =\n\n- i swi 1 1\n\nf\n\nn\n\ni U,j -l+1 n\n\n-\n\nij =\n\n- i swi 1 1\n\nf\n\nk\n\ni pik fU,k-l+1 =l +1\n\n-\n\nj\n\n-\n\nand ii are fixed to 0 for all i since we fix wii = pii = 0.\n\nk\n\ni pik fU,k-l+1 =l +1\n\n-\n\nkl\n\nkl\n\npik fk  for i > l and j > l;  kl t pik fU,k-l+1 - pik fk \n=1:k=t\n\n\n\npik fk\n\n=1\n\nf or i > l and j f or i f or i l and j\n\nl;\n\nl and j > l;\n\npik fk\n\nl,\n\n=1\n\nAll ij can be computed in O(u2 l) time and Algorithm 2 can be completed in O(n2 m) time, where ~ 1 m 2n-1 (n - 1)-1  ~ |{ d  1...m| xi or xj is not zero on feature d}|.\ni<j n\n\nIn many applications such as text classification and image pattern recognition, the data is very sparse and m ~ m. In sum, the computational cost has been reduced from O(lu(mn + u2 )) to O(lnu + 2 3 n m + u ) . The space cost is mild at O(n2 + nm). ~ ~\n\n4\n\nRegularizing the Graph Learning\n\nSimilar to the MinEnt method, purely applying LOOHL can lead to degenerative graphs. In this section, we show two such examples and then propose a simple approach which regularizes the graph learning process. Two degenerative graphs are shown in Figure 1. In example (a), the points with the same xv coordinate are from the same classes. For each labeled point, there is another labeled point from the opposite class which has the same xh coordinate. So the leave-one-out hyperparameter learning will push 1/h to zero and 1/v to infinity, i.e., all points can transfer only horizontally. Therefore the graph will effectively split into six disconnected sub-graphs, each sharing the same xv coordinate as showed in (a). So the desired gradual change of label from positive to negative along dimension xv cannot appear. As a result, the point at question mark cannot hit any labeled points and cannot be classified. One way to prevent sud h degenerate graphs is to prevent 1/v from growing too large, c (1/d )2 . e.g., with a regularizer such as\n\nIn example (b), although the negative points will encourage both horizontal and vertical walk, horizontal walk will make the leave-one-out error large on positive points. So the learned 1/v will be far smaller than 1/h , i.e., the result strongly encourages walking in vertical direction and ignoring the information from the horizontal direction. As a result, the point at the question mark will be labeled as positive, although by nearest neighbor intuition, it should be labeled as negative. We notice that the four negative pd ints will be partitioned into two groups as shown in the figure. o In such a case, the regularizer (1/d )2 will not be helpful with utilizing dimensions that are informative. A different regularizer that encourages the use of more dimensions may be better in this case. One simple regularizer that has this prod erty is to minimize the variance of the inverse p d 1/d , assuming that the mean is non-zero. It (1/d - )2 , where  = m-1 bandwidth\n\n\f\nTable 1: Dataset properties. Sparsity is the average frequency of features to be zero in the whole dataset. The rightmost column gives the size of the whole dataset from which the labeled data in experiment is sampled. Some data in text dataset has unknown label, thus always used as unlabeled.\n\nis a priori unclear which regularizer will be better empirically, but for the datasets in our experiments, the minimum variance regularizer is overwhelmingly better, even when useless features are intentionally added to the datasets. Since the gradient based optimization can get stuck in local minima, it is advantageous to test several different parameter initialization. With this in mind, we implement a simple approximation to the minimum variance regularizer that tests different parameter initialization as well. We discretize  d 2 (1/d - 1/ ) , where  is fixed a priori to several ~ ~ and minimize the leave-one-out loss plus different possible values. We run with different  and set all initial d to  . Then we choose the ~ ~ function produced by the value of  that has the smallest regularized cost function value. This ~ process is similar to restarting from various values to avoid local minima, but now we are also trying with different mean of estimated optimal bandwidth at the same time. A similar way to regularize is d 2 (1/d - 1/) with respect by using a Gaussian prior with mean -1 and minimizing Q + C to d and  simultaneously.\n\n5\n\nExperimental Results\n\nUsing HEM as a basis for classification, we compare the test accuracy of three model selection methods: LOOHL, 5-CV (tying all bandwidths and choose by 5-fold cross validation), and MinEnt, each with both thresholding and CMN. Since the topic of this paper is how to learn the hyperparameters of a graph, we pay more attention to how the performance of a given recognized classifier can be improved by means of learning the graph, than to the comparison between different classifers' performance, i.e., comparing with other semi-supervised or supervised learning algorithms. Ionosphere is from UCI repository. The other four datasets used in the experiment are from NIPS 2003 Workshop on feature selection challenge. Each of them has two versions: original version and probe version which adds useless probing features in order to investigate the algorithm's performance in the presence of useless features, though at current stage we do not use the algorithm as feature selector. Since the workshop did not provide the original datasets, we downloaded the original datasets from other sites. Our original intention was to use original versions that we downloaded and to reproduce the probe version ourself using the pre-processing described in NIPS 2003 workshop, so that we can check the performance of the algorithms on datasets with and without redundant features. Unfortunately, we find that with our own effort at pre-processing, the datasets with probes yield far different accuracies compared with the datasets with probes downloaded from the workshop web site. Thus we are using the original version and the probe version downloaded from difference sources, and the comparison between them should be done with care, though the demonstration of LOOHL's efficacy is not affected. The properties of the five datasets are summarized in Table 1. We randomly pick the labeled subset L from all labeled data available under the constraint that both classes must be present in L. The remaining labeled and unlabeled data are used as unlabeled data. For example, by saying |L| = 20 for text dataset, we mean randomly picking 20 points from the 600 labeled data as labeled, and label the other 1980 points by using our algorithm. Finally we calculate the prediction accuracy on the 580 (originally) labeled points. For other datasets, say cancer, testing is on 180 points since we know the label of all points. For each fixed |L|, this random test is conducted for 10 times and the average accuracy is reported. Then |L| is varied. We normalized all input feature vectors to have length 1.\n\n\f\n(a) 4 vs 9 (original)\n\n(b) cancer (original)\n\n(c) text (original)\n\n(d) thrombin (original)\n\n(e) 4 vs 9 (probe)\n\n(f) cancer (probe)\n\n(g) text (probe)\n\n(h) thrombin (probe)\n\nFigure 2: Accuracy of original and probe versions in percentage vs. number of labeled data. The initial common bandwidth and smoothing factor in MinEnt are selected by five fold cross validation. For LOOHL, We fix h+ (x) = (1 - x)2 and h- (x) = x2 . The final obmctive function is: je d 2 C1  Loo loss N ormal + C2  (1/d - 1/ ) ~ , y y Loo loss(xi , yi )+(2r- )-1 Loo loss(xi , yi ), (8) Loo loss N ormal (2r+ )-1\ni =1 i =0\n\nand there are r+ positive labeled examples and r- negative labeled examples. For each C1 :C2 ratio, we run on  = 0.05, 0.1, 0.15, 0.2, 0.25, 0.3 for all datasets and select the function that corresponds ~ to the smallest objective function value for use in cross validation testing. The final C1 :C2 value was picked by five fold cross validation, with discrete levels at 10-i , where i = 1, 2, 3, 4, 5, since strong regularizer is needed given the large number of features (variables) and much fewer labeled points. The optimization solver we use is the Toolkit for Advanced Optimization [2]. From the results in Figure 2 and Figure 3, we can make the following observations and conclusions: 1. LOOHL generally outperforms 5-CV and MinEnt. Both LOOHL+Thrd and LOOHL+CMN outperform 5-CV and MinEnt (regardless of Thrd or CMN) on all datasets except thrombin and ionosphere, where either LOOHL+CMN or LOOHL+Thrd finally performs best.\n\n2. For 5-CV, CMN is almost always better than thresholding, except on the original form of cancer and thrombin dataset, where CMN hurts 5-CV. In [11], it is claimed that although the theory of HEM is sound, CMN is still necessary to achieve reasonable performance because the underlying graph is often poorly estimated and may not reflect the classification goal, i.e., one should not rely exclusively on the graph. Now that our LOOHL is aimed at learning a good graph, the ideal case is that the graph learned is suitable for our classification such that the improvement by CMN will not be large. In other words, the difference between LOOHL+CMN and LOOHL+Thrd, compared with the difference between 5-CV+CMN and 5-CV+Thrd, can be viewed as an approximate indicator of how well the graph is learned by LOOHL. The efficacy of LOOHL can be clearly observed in datasets 4vs9, cancer, text, ionosphere and original version of thrombin. In these cases, we see that LOOHL+Thrd is already achieving high accuracy and LOOHL+CMN does not offer much improvement then or even hurts performance due to inaccurate class ratio estimation. In fact, LOOHL+Thrd performs reliably well on all datasets. It is thus desirable to learn the bandwidth for each dimension of the feature vector, and there is no longer any need to post-process by using class ratio information. 3. The performance of MinEnt is generally inferior to 5-CV and LOOHL. MinEnt+Thrd has equal chance of out-performing or losing to 5-CV+Thrd, while 5-CV+CMN is almost always better than MinEnt+CMN. Most of the time, MinEnt+CMN performs significantly better than MinEnt+Thrd, so we can conclude that MinEnt fails to learn a good graph. This may be due to converging to a poor local minimum, or that the idea of minimizing the entropy on unlabeled data is by itself insufficient.\n\n\f\nFigure 3: Accuracy of Ionosphere in percentage vs. number of labeled data.\n\n4. For these datasets, assuming low variance of inverse bandwidth with discretization as regularizer is more reasonable than assuming that many features are irrelevant to the classification. This is even true for probe versions of the datasets. Figure 4 shows the comparison.\n\nFigure 4: Accuracy comparison of priors in percentage betd een minimizing sum of square inverse bandwidth w - d 2 and minimizing variance of inverse bandwidth.\n\n6\n\nConclusions\n\nIn this paper, we proposed learning the graph for graph based semi-supervised learning by minimizing the leave-one-out prediction error, with a simple regularizer. Efficient gradient calculation algorithms are designed and the empirical result is encouraging. Acknowledgements This work is partially funded by the Singapore-MIT Alliance. National ICT Australia is funded through the Australian Government's Backing Australia's Ability initiative, in part through the Australian Research Council.\n\nReferences\n[1] Andreas Argyriou, Mark Herbster, and Massimiliano Pontil. Combining Graph Laplacians for SemiSupervised Learning. In NIPS 2005, Vancouver, Canada, 2005.  [2] Steven Benson, Lois McInnes, Jorge More, and Jason Sarich. TAO User Manual ANL/MCS-TM-242, http://www.mcs.anl.gov/tao, 2005. [3] Avrin Blum, and Shuchi Chawla. Learning From Labeled and Unlabeled Data using Graph Mincuts. In ICML 2001.   [4] Miguel A Carreira-Perpinan, and Richard S. Zemel. Proximity Graphs for Clustering and Manifold Learning. In NIPS 2004. [5] Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, and Sayan Mukherjee. Choosing Multiple Parameters for Support Vector Machines. Machine Learning, 46, 131159, 2002.  [6] Olivier Chapelle, Jason Weston, and Bernhard Scholkopf. Cluster Kernels for Semi-Supervised Learning. In NIPS 2002. [7] Fabio G. Cozman, Ira Cohen, and Marcelo C. Cirelo. Semi-Supervised Learning of Mixture Models and Bayesian Networks. In ICML 2003. [8] Thorsten Joachims. Transductive Learning via Spectral Graph Partitioning. In ICML 2003. [9] Ashish Kapoor, Yuan Qi, Hyungil Ahn, and Rosalind Picard. Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification. In NIPS 2005. [10] Alexander Smola, and Risi Kondor. Kernels and Regularization on Graphs. In COLT 2003. [11] Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. In ICML 2003. [12] Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Semi-Supervised Learning: From Gaussian Fields to Gaussian Processes. CMU Technical Report CMU-CS-03-175. [13] Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, and John Lafferty. Non-parametric Transforms of Graph Kernels for Semi-Supervised Learning. In NIPS 2004.\n\n\f\n", "award": [], "sourceid": 3043, "authors": [{"given_name": "Xinhua", "family_name": "Zhang", "institution": null}, {"given_name": "Wee", "family_name": "Lee", "institution": null}]}