{"title": "Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 1585, "page_last": 1592, "abstract": null, "full_text": "Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms\n\nXinhua Zhang Statistical Machine Learning Program National ICT Australia, Canberra, Australia and CSL, RSISE, ANU, Canberra, Australia xinhua.zhang@nicta.com.au\n\nWee Sun Lee Department of Computer Science National University of Singapore 3 Science Drive 2, Singapore 117543 leews@comp.nus.edu.sg\n\nAbstract\nSemi-supervised learning algorithms have been successfully applied in many applications with scarce labeled data, by utilizing the unlabeled data. One important category is graph based semi-supervised learning algorithms, for which the performance depends considerably on the quality of the graph, or its hyperparameters. In this paper, we deal with the less explored problem of learning the graphs. We propose a graph learning method for the harmonic energy minimization method; this is done by minimizing the leave-one-out prediction error on labeled data points. We use a gradient based method and designed an efficient algorithm which significantly accelerates the calculation of the gradient by applying the matrix inversion lemma and using careful pre-computation. Experimental results show that the graph learning method is effective in improving the performance of the classification algorithm.\n\n1\n\nIntroduction\n\nRecently, graph based semi-supervised learning algorithms have been used successfully in various machine learning problems including classification, regression, ranking, and dimensionality reduction. These methods create graphs whose vertices correspond to the labeled and unlabeled data while the edge weights encode the similarity between each pair of data points. Classification is performed using these graphs by labeling unlabeled data in such a way that instances connected by large weights are given similar labels. Example graph based semi-supervised algorithms include min-cut [3], harmonic energy minimization [11], and spectral graphical transducer [8]. The performance of the classifier depends considerably on the similarity measure of the graph, which is normally defined in two steps. Firstly, the weights are defined locally in a pair-wise parametric form using functions that are essentially based on a distance metric such as radial basis functions (RBF). It is argued in [7] that modeling error can degrade performance of semi-supervised learning. As the distance metric is an important part of graph based semi-supervised learning, it is crucial to use a good distance metric. In the second step, smoothing is applied globally, typically, based on the spectral transformation of the graph Laplacian [6, 10]. There have been only a few existing approaches which address the problem of graph learning. [13] learns a nonparametric spectral transformation of the graph Laplacian, assuming that the weight and distance metric are given. [9] learns the spectral parameters by performing evidence maximization using approximate inference and gradient descent. [12] uses evidence maximization and Laplace approximation to learn simple parameters of the similarity function. Instead of learning one single good graph, [4] proposed building robust graphs by applying random perturbation and edge removal\n\n\nThis work was done when the author was at the National University of Singapore.\n\n\f\nfrom an ensemble of minimum spanning trees. [1] combined graph Laplacians to learn a graph. Closest to our work is [11], which learns different bandwidths for different dimensions by minimizing the entropy on unlabeled data; like the maximum margin motivation in transductive SVM, the aim here is to get confident labeling of the data by the algorithm. In this paper, we propose a new algorithm to learn the hyperparameters of distance metric, or more specifically, the bandwidth for different dimensions in the RBF form. In essence, these bandwidths are just model parameters and normal model selection methods include k-fold cross validation or leave-one-out (LOO) cross validation in the extreme case can be used for selecting the bandwidths. Motivated by the same spirit, we base our learning algorithm on the aim of achieving low LOO prediction loss on labeled data, i.e., each labeled data can be correctly classified by the other labeled data in a semi-supervised style with as high probability as possible. This idea is similar to [5] which learns multiple parameters for SVM. Since most LOO style algorithms are plagued with prohibitive computational cost, an efficient algorithm is designed. With a simple regularizer, the experimental results show that learning the hyperparameters by minimizing the LOO loss is effective.\n\n2\n\nGraph Based Semi-supervised Learning\n\nSuppose we have a set of labeled data points {(xi , yi )} for i L {1, ..., l}. In this paper, we only consider binary classification, i.e., yi {1 (positive), 0 (negative)}. In addition, we also have {l + 1, ..., l + u}. Denote n l + u. Suppose the a set of unlabeled data points {xi } for i U dimensionality of input feature vectors is m. 2.1 Graph Based Classification Algorithms\n\nOne of the earliest graph based semi-supervised learning algorithms is min-cut by [3], which minimizes: (1) E (f ) i wij (fi - fj )2\n,j\n\nwhere the nonnegative wij encodes the similarity between instance i and j . The label fi is fixed to yi {1, 0} if i L. The optimization variables fi (i U ) are constrained to {1, 0}. This combinatorial optimization problem can be efficiently solved by the max-flow algorithm. [11] relaxed the constraint fi {1, 0} (i U ) to real numbers. The optimal solution of the unlabeled data's soft labels can be written neatly as: fU = (DU - WU U )-1 WU L fL = (I - PU U )-1 PU L fL (2) where fL is the vector of soft labels (fixed to yi ) for L. D diag(di ), where di j wij and DU -1 is the submatrix of D associated with unlabeled data. P D W . WU U , WU L , PU U , and PU L are defined by: . , P W PLU WLU LL LL P= W= PU L PU U WU L WU U\n\nThe solution (2) has a number of interesting properties pointed out by [11]. All fi (i U ) are automatically bounded by [0, 1], so it is also known as square interpolation. They can be interpreted by using Markov random walk on the graph. Imagine a graph with n nodes corresponding to the n data points. Define the probability of transferring from xi to xj as pij , which is actually row-wise normalization of wij . The random walk starts from any unlabeled points, and stops once it hits any labeled point (absorbing boundary). Then fi is the probability of hitting a positive labeled point. In this sense, the labeling of each unlabeled point is largely based on its neighboring labeled points, which helps to alleviate the problem of noisy data. (1) can also be interpreted as a quadratic energy function and its minimizer is known to be harmonic: fi (i U ) equals the average of fj (j = i) weighted by pij . So we call this algorithm Harmonic Energy Minimization (HEM). By (1), fU is independent of wii (i = 1, ..., n), so henceforth we fix wii = pii = 0. Finally, to translate the soft labels fi to hard labels pos/neg, the simplest way is by thresholding at 0.5, which works well when the two classes are well separated. [11] proposed another approach, called Class Mass Normalization (CMN), to make use of prior information such as class ratio in fnlabeled data, estimated by that in labeled data. Specifically, they normalize the softn abels to fi+ u l n\ni j =1\n\nfj as the probabilistic score of being positive, and to fi-\n\n(1 - fi )\n\nj =1\n\n(1 - fj ) as\n\n\f\nthe score of being negative. Suppose there are r+ positive points and r- negative points in the labeled data, then we classify xi to positive iff fi+ r+ > fi- r- . 2.2 Basic Hyperparameter Learning Algorithms (\n\nwhere xi,d is the d component of xi , and likewise the meaning of fU,i in (4). The bandwidth d has considerable influence on the classification accuracy. HEM uses one common bandwidth for all dimensions, which can be easily selected by cross validation. However, it will be desirable to learn different d for different dimensions; this allows a form of feature selection. [11] proposed learning the hyperparameters d by minimizing the entropy on unlabeled data points (we call it MinEnt): u H (fU ) = - (fU,i log fU,i + (1 - fU,i ) log(1 - fU,i )) (4)\ni=1\n\nOne of the simplest parametric form of wij is RBF: - d 2 wij = exp (xi,d - xj,d )2 d\nth\n\n3)\n\nThe optimization is conducted by gradient descent. To prevent numerical problems, they replaced ~ P with P = U + (1 - )P , where [0, 1), and U is the uniform matrix with Uij = n-1 .\n\n3\n\nLeave-one-out Hyperparameter Learning\n\nIn this section, we present the formulation and efficient calculation of our graph learning algorithm. 3.1 Formulation and Efficient Calculation\n\nWe propose a graph learning algorithm which is similar to minimizing the leave-one-out cross validation error. Suppose we hold out a labeled example xt and predict its label by using the rest of the t labeled and unlabeled examples. Making use of the result in (2), the soft label for xt is s fU (the t first component of fU ), where t t t s (1, 0, ..., 0) Ru+1 , fU (f0 , flt+1 , ..., fn ) . t t ~t ~t t Here, the value of fU can be determined by fU = (I - PU U )-1 PU L fL , where , ptU , t t fL (f1 , .., ft-1 , ft+1 , ..., fl ) pij (1 - )pij + /n , PU U ~ ptt pU t P U U pU t (pl+1,t , ..., pn,t ) , ptU (pt,l+1 , ..., pt,n ) , pt,1 pt,t-1 pt,t+1 pt,l pl+1,t-1 pl+1,t+1 pl+1,l p t . P U L = l + 1, 1 pn,1 pn,t-1 pn,t+1 pn,l\nt If xt is positive, then we hope that fU,1 is as close to 1 as possible. Otherwise, if xt is negative, we t hope that fU,1 is as close to 0 as possible. So the cost function to be minimized can be written as: ( s l l = ft ( ~t ~t t 5) Q= I - PU U )-1 PU L fL ht U,1 ht t=1 t=1\n\nt t Since in both PU U and PU L , the first row corresponds to xt , and the ith row (i 2) corresponds t t t t to xi+l-1 , denoting PU N (PU L PU U ) makes sense as each row of PU N corresponds to a well\n\nTo minimize Q, we use gradient-based optimization mthods. The gradient is: e , ft s ( l t t t ~t ~ ~ t )-1 P t d fU + PU L d fL I - PU U Q/ d = t=1 h U,1 UU h t ( ~t using matrix property dX -1 = -X -1 (dX )X -1 . Denoting ( t ) I - PU U )-1 t (fU,1 )s ~ and noting P = U + (1 - )P, we have l . t t t t Q/ d = (1 - ) (6) ( t ) PU U d fU + PU L d fL\nt=1\n\nwhere ht (x) is the cost function for instance t. We denote ht (x) = h+ (x) for yt = 1 and ht (x) = h- (x) for yt = 0. Possible choices of h+ (x) include 1 - x, (1 - x)a , a-x , and - log x with a > 1. ft . Possible choices for h- (x) include x, xa , ax-1 , and - log(1-x). Let Loo loss(xt , yt ) ht U,1\n\n\f\ndefined single data point. Let all n otations about P carry over to the corresponding W . We now use n t t t swi n=1 wU N (i, k ) and k=1 wU N (i, k )/ d (i = 1, ..., u + 1) to denote the sum of these k corresponding rows. Now (6) can be rewritten in ground terms by the following \"two\" equations: , n t t t t PU (i, j ) d = (swi )-1 wU (i, j ) d - pt (i, j ) wU N (i, k ) d U k =1 3 2 where can be U or L. wij / d = 2wij (xi,d - xj,d ) d by (3). Algorithm 1 nave form of LOOHL i\n\nThe nave way to calculate the function value Q and its gradient is presented in Algorithm 1. We i call it leave-one-out hyperparameter learning (LOOHL).\n\n1: function value Q 0, gradient g (0, ..., 0) Rm 2: for each t = 1, ..., l (leave-one-out loop for each labeled point) do t t ~t ~t t 3: fL (f1 , .., ft-1 , ft+1 , ..., fl ) , fU (I - PU U )-1 PU L fL , ft , t t t ~t ( ) h (fU,1 )s (I - PU U )-1 Q Q + ht U,1 4: for each d = 1, ..., m (f r all feature dimensions) do o w t t t k n wU N (i,k) PU U (i,j ) wU U (i,j ) 1 5: swt - pt U (i, j ) U d d d i =1 n t t here swi = k=1 wU N (i, k), i, j = 1, ..., u + 1 i t t t k n wU N (i,k) PU L (i,j ) wU L (i,j ) 1 swt - pt L (i, j ) = 1, ..., u + 1, j = 1, ..., l - 1 6: U d d d i =1 8 t t PU PU U t t 7: gd gd + (1 - )( t ) fU + dL fL d : end for 9: end for\n\nThe computational complexity of the nave algorithm is expensive: O(lu(mn+u2 )), just to calculate i the gradient once. Here we assume the cost of inverting a u u matrix is O(u3 ). We reduce the two terms in the cost by means of using matrix inversion lemma and careful pre-computation. ~t One part of the cost, O(lu3 ), stems from inverting I - PU U , a (u + 1) (u + 1) matrix, for l times in t ~ (5). We note that for different t, I - PU U differs only by the first row and first column. So there exist ~t ~t two vectors , Ru+1 such that I - PU1U = (I - PU2U ) + e + e , where e = (1, 0, ..., 0) ~ Ru+1 . With I - P t expressed in this form, we are ready to apply matrix inversion lemma:\nUU\n\n~t a We only need to invert I - PU U for t = 1 from scratch,u nd then apply (7) twice for each t . new total complexity related to matrix inversion is O 3 + lu2\nt i t l u+1 i Q = t d swi =1 =1\n\nA\n\n+ \n\n-1\n\n= A-1 - A-1 \n\nA-1\n\n1\n\n+\n\nA\n\n\n\n.\n\n(7) 2. The\n\nThe other part of the cost, O(lumn) , can be reduced by using careful pre-computation. Written in detail, we have:\nu+1 l j wt (i, j ) t j-1 wt (i, j ) t UU UL fU,j + fL,j d d =1 =1 u+1 l t j-1 t wU N (i, k) j t t pt U (i, j ) fU,j + pU L (i, j ) fL,j U d =1 =1\n\n-\n\nThe crucial observation is the existence of ij , which are independent of dimension index d. Therefore, they can be pre-computed efficiently. The Algorithm 2 below presents the efficient approach to gradient calculation. Algorithm 2 Efficient algorithm to gradient calculation\n1: for i, j = 1, ..., n do 2: for all feature dimension d on which either xi or xj is nonzero do 3: gd = gd + ij wij / d 4: end for 5: end for\n\nkn\n\n=1\n\ni\n\n=1\n\nn\n\njn\n\nij\n\n=1\n\n wij d\n\n\f\nFigure 1: Examples of degenerative graphs learned by pure LOOHL. Letting swi\nl\n\nn=1 wik and () be Kroneker delta, we derive the form of ij as: k\nt\nt i-l+1 =1 tl\n\nij =\n\n- swi 1\n\nf \n\nn\n\nl t pik fU,k-l+1 t -pit fU,1\n\nt U,j -l+1\n\n-\n\nk\n\n- k\n\n=l+1\n\nn\n\nk\n\n=1:k=t\n\n- ij = swi 1\n\n=1\n\nt t t i-l+1 fU,1 (t = j ) + fj (t = j ) - pit fU,1 -\n\n=l+1\n\nij =\n\n- i swi 1 1\n\nf\n\nn\n\ni U,j -l+1 n\n\n-\n\nij =\n\n- i swi 1 1\n\nf\n\nk\n\ni pik fU,k-l+1 =l +1\n\n-\n\nj\n\n-\n\nand ii are fixed to 0 for all i since we fix wii = pii = 0.\n\nk\n\ni pik fU,k-l+1 =l +1\n\n-\n\nkl\n\nkl\n\npik fk for i > l and j > l; kl t pik fU,k-l+1 - pik fk \n=1:k=t\n\n\n\npik fk\n\n=1\n\nf or i > l and j f or i f or i l and j\n\nl;\n\nl and j > l;\n\npik fk\n\nl,\n\n=1\n\nAll ij can be computed in O(u2 l) time and Algorithm 2 can be completed in O(n2 m) time, where ~ 1 m 2n-1 (n - 1)-1 ~ |{ d 1...m| xi or xj is not zero on feature d}|.\ni