{"title": "Combining Graph Laplacians for Semi--Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 67, "page_last": 74, "abstract": null, "full_text": "Combining Graph Laplacians for SemiSupervised Learning\n\nAndreas Argyriou,\n\nMark Herbster,\n\nMassimiliano Pontil\n\nDepartment of Computer Science University College London Gower Street, London WC1E 6BT, England, UK {a.argyriou, m.herbster, m.pontil}@cs.ucl.ac.uk\n\nAbstract\nA foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the `k ' in nearest neighbors.\n\n1\n\nIntroduction\n\nSemi-supervised learning has received significant attention in machine learning in recent years, see, for example, [2, 3, 4, 8, 9, 16, 17, 18] and references therein. The defining insight of semi-supervised methods is that unlabeled data may be used to improve the performance of learners in a supervised task. One of the key semi-supervised learning methods builds on the assumption that the data is situated on a low dimensional manifold within the ambient space of the data and that this manifold can be approximated by a weighted discrete graph whose vertices are identified with the empirical (labeled and unlabeled) data, [3, 17]. Graph construction consists of two stages, first selection of a distance function and then application of it to determine the graph's edges (or weights thereof). For example, in this paper we consider distances between images based on the Euclidean distance, Euclidean distance combined with image transformations, and the related tangent distance [6]; we determine the edge set of the graph with k -nearest neighbors. Another common choice is 2 to weight edges by a decreasing function of the distance d such as e- d . Although a surplus of unlabeled data may improve the quality of the empirical approximation of the manifold (via the graph) leading to improved performances, practical experience with these methods indicates that their performance significantly depends on how the graph is constructed. Hence, the model selection problem must consider both the selection of the distance function and the parameters k or used in the graph building process described above. A diversity of methods have been proposed for graph construction; in this paper\n\n\f\nwe do not advocate selecting a single graph but, rather we propose combining a number of graphs. Our solution implements a method based on regularization which builds upon the work in [1]. For a given dataset each combination of distance functions and edge set specifications from the distance will lead to a specific graph. Each of these graphs may then be associated with a kernel. We then apply regularization to select the best convex combination of these kernels; the minimizing function will trade off its fit to the data against its norm. What is unique about this regularization is that the minimization is not over a single kernel space but rather over a space corresponding to all convex combinations of kernels. Thus all data (labeled vertices) may be conserved for training rather than reduced by cross-validation which is not an appealing option when the number of labeled vertices per class is very small. Figure 3 in Section 4 illustrates our algorithm on a simple example. There, three different distances for 400 images of the digits `six' and `nine' are depicted, namely, the Euclidean distance, a distance invariant under small centered image rotations from [-10 , 10 ] and a distance invariant under rotations from [-180 , 180 ]. Clearly, the last distance is problematic as sixes become similar to nines. The performance of our graph regularization learning algorithm discussed in Section 2.2 with these distances is reported below each plot; as expected, this performance is much lower in the case that the third distance is used. The paper is constructed as follows. In Section 2 we discuss how regularization may be applied to single graphs. First, we review regularization in the context of reproducing kernel Hilbert spaces (Section 2.1); then in Section 2.2 we specialize our discussion to Hilbert spaces of functions defined over a graph. Here we review the (normalized) Laplacian of the graph and a kernel which is the pseudoinverse of the graph Laplacian. In Section 3 we detail our algorithm for learning an optimal convex combination of Laplacian kernels. Finally, in Section 4 we present experiments on the USPS dataset with our algorithm trained over different classes of Laplacian kernels.\n\n2\n\nBackground on graph regularization\n\nIn this section we review graph regularization [2, 9, 14] from the perspective of reproducing kernel Hilbert spaces, see e.g. [12]. 2.1 Reproducing kernel Hilbert spaces\n\nLet X be a set and K : X X IR a kernel function. We say that HK is a reproducing kernel Hilbert space (RKHS) of functions f : X IR if (i): for every x X , K (x, ) HK and (ii): the reproducing kernel property f (x) = f, K (x, ) K holds for every f HK and x X , where , K is the inner product on HK . In particular, (ii) tells us that for x, t X , K (x, t) = K(x, ), K (t, ) K, implying that the n n matrix (K (ti , tj ) : i, j INp ) is symmetric and positive semi-definite for any set of inputs {ti : i INp } X , p IN, where we use the notation INp := {1, . . . , p}. Regularization in an RKHS learns a function f HK on the basis of available input/output examples {(xi , yi ) : i IN } by solving the variational problem ( i E (K ) := min V (yi , f (xi )) + f K : f HK 2 2.1)\n=1\n\nwhere V : IR IR [0, ) is a loss function and a positive parameter. Moreover, if f is a solution to problem (2.1) then it has the form f (x) = i ci K (xi , x), x X (2.2)\n\n=1\n\n\f\nwhere K = (K (xi , xj )),j =1 and the function V : IR IR IR {+} is the conjugate i of the loss function V which is defined, for every z , IR, as V (z , ) := sup{ - V (z , ) : IR}, see, for example, [1] for a discussion. The choice of the loss function V leads to different learning methods among which the most prominent are square loss regularization and support vector machines, see, for example [15]. 2.2 Graph regularization\n\nfor some real vector of coefficients c = (ci : i IN ) , see, for example, [12], where \" \" denotes transposition. This vector can be found by replacing f by the right hand side of equation (2.2) in equation (2.1) and then optimizing with respect to c. However, in many practical situations it is more convenient to compute c by solving the dual problem to (2.1), namely 1 ( i -E (K ) := min V (yi , ci ) : c IR c Kc + 2.3) 4 =1\n\nLet G be an undirected graph with m vertices and an m m adjacency matrix A such that Aij = 1 if there is an edge connecting vertices i and j and zero otherwise1 . The graph Laplacian L is the m m matrix defined as L m D - A, where D = diag(di : i INm ) := and di is the degree of vertex i, that is di = j =1 Aij . We identify the linear space of real-valued functions defined on the graph with IRm and introduce on it the semi-inner product u, v := u Lv, u, v IRm . v , v , v IRm . It is a semi-norm since v = 0 if The induced semi-norm is v := m 1 v is a constant vector, as can be verified by noting that v 2 = 2 i,j =1 (vi - vj )2 Aij .\n\nWe recall that G has r connected components if and only if L has r eigenvectors with zero eigenvalues. Those eigenvectors are piece-wise constant on the connected components of the graph. In particular, G is connected if and only if the constant vector is the only eigenvector of L with zero eigenvalue [5]. We let {i , ui }m 1 be a system of eigeni= values/vectors of L where the eigenvalues are non-decreasing in order, i = 0, i INr , and define the linear subspace H(G) of IRm which is orthogonal to the eigenvectors with zero eigenvalue, that is, H(G) := {v : v ui = 0, i INr }. Within this framework, we wish to learn a function v H(G) on the basis of a set of labeled vertices. Without loss of generality we assume that the first m vertices are labeled and let y1 , ..., y {-1, 1} be the corresponding labels. Following [2] we prescribe a loss function V and compute the function v by solving the optimization problem . i 2 min V (yi , vi ) + v : v H(G) (2.4)\n=1\n\nWe note that a similar approach is presented in [17] where v is (essentially) obtained as the minimal norm interpolant in H(G) to the labeled vertices. The functional (2.4) balances the error on the labeled points with a smoothness term measuring the complexity of v on the graph. Note that this last term contains the information of both the labeled and unlabeled vertices via the graph Laplacian.\n1\n\nThe ideas we discuss below naturally extend to weighted graphs.\n\n\f\nMethod (2.4) is a special case of problem (2.1). Indeed, the restriction of the semi-norm on H(G) is a norm. Moreover, the pseudoinverse of the Laplacian, L+ , is the reproducing kernel of H(G), see, for example, [7] for a proof. This means that for every v H(G) and i INm there holds the reproducing kernel property vi = L+ , v , where L+ is the i-th i i column of L+ . Hence, by setting X INm , f (i) = vi and K (i, j ) = L+ , i, j INm , we ij see that HK H(G). We note that the above analysis naturally extends to the case that L is replaced by any positive semidefinite matrix. In particular, in our experiments below we 1 1 will use the normalized Laplacian matrix given by D- 2 LD- 2 . Typically, problem (2.4) is solved by optimizing over v = (vi : i INm ). In particular, for square loss regularization [2] and minimal norm interpolation [17] this requires solving a squared linear system of m and m - equations respectively. On the contrary, in this paper we use the representer theorem to express v as v= j L+ cj : i INm ij .\n\n=1\n\nThis approach is advantageous if L+ can be computed off-line because, typically, m. A further advantage of this approach is that multiple problems may be solved with the same Laplacian kernel. The coefficients ci are obtained by solving problem (2.3) with K = (L+ ),j =1 . For example, for square loss regularization the computation of the parameter ij i vector c = (ci : i IN ) involves solving a linear system of equations, namely (K + I)c = y. (2.5)\n\n3\n\nLearning a convex combination of Laplacian kernels\n\nHence an optimal convex combination of kernels has a smaller right hand side than that of any individual kernel, motivating the expectation of improved performance. Furthermore, large values of the components of the minimizing identify the most relevant kernels. Problem (3.1) is a special case of the problem of jointly minimizing functional (2.1) over v HK and K co(K), the convex hull of kernels in a prescribed set K. This problem is discussed in detail in [1, 12], see also [10, 11] where the case that K is finite is considered. Practical experience with this method [1, 10, 11] indicates that it can enhance the performance of the learning algorithm and, moreover, it is computationally efficient to solve. When solving problem (3.1) it is important to require that the kernels K (q) satisfy a normalization condition such as that they all have the same trace or the same Frobenius norm, see [10] for a discussion.\n\nn where we have defined the set := { IRn : q 0, q=1 q = 1} and, for each n , the kernel K () := q=1 q K (q) . The above problem is motivated by observing that . E min (K (q) ) : q INn\n\nWe now describe our framework for learning with multiple graph Laplacians. We assume that we are given n graphs G(q) , q INn , all having m vertices, with corresponding Laplacians L(q) , kernels K (q) = (L(q) )+ , Hilbert spaces H(q) := H(G(q) ) and norms v 2 := v L(q) v, v H(q) . We propose to learn an optimal convex combination of q graph kernels, that is, we solve the optimization problem ( i 2 = min 3.1) V (yi , vi ) + v K () : , v HK ()\n=1\n\n\f\nInitialization: Choose K (1) co{K (q) : q INn } For t = 1 to T : 1. compute c(t) to be the solution of problem (2.3) with K = K (t) ; 2. find q INn : (c(t) , K (q) c(t) ) > (c(t) , K (t) c(t) ). If such q does not exist terminate; E ; (q ) 3. compute p = argmin ^ + (1 - p)K (t) ) : p (0, 1] (pK 4. set K (t+1) = pK (q) + (1 - p)K (t) . ^ ^\n\nFigure 1: Algorithm to compute an optimal convex combination of kernels in the set co{K (q) : q INn }. Using the dual problem formulation discussed above (see equation (2.3)) in the inner minimum in (3.1) we can rewrite this problem as . m : 1 i K (3.2) c ()c + V (yi , ci ) : c IR - = max in 4 =1\n\nThe variational problem (3.2) expresses the optimal convex combination of the kernels as the solution to a saddle point problem. This problem is simpler to solve than the original problem (3.1) since its objective function is linear in , see [1] for a discussion. Several ^^ algorithms can be used for computing a saddle point (c, ) IR . Here we adapt an algorithm from [1] which alternately optimizes over c and . For reproducibility of the ^ ^ algorithm, it is reported in Figure 1. Note that once is computed c is given by a minimizer of problem (2.3) for K = K (). In particular, for square loss regularization this requires ^ solving the equation (2.5) with K = (Kij () : i, j IN ).\n\n4\n\nExperiments\n\nIn this section we present our experiments on optical character recognition. We observed the following. First, the optimal convex combination of kernels computed by our algorithm is competitive with the best base kernels. Second, by observing the `weights' of the convex combination we can distinguish the strong from the weak candidate kernels. We proceed by discussing the details of the experimental design interleaved with our results. We used the USPS dataset2 of 1616 images of handwritten digits with pixel values ranging between -1 and 1. We present the results for 5 pairwise classification tasks of varying difficulty and for odd vs. even digit classification. For pairwise classification, the training set consisted of the first 200 images for each digit in the USPS training set and the number of labeled points was chosen to be 4, 8 or 12 (with equal numbers for each digit). For odd vs. even digit classification, the training set consisted of the first 80 images per digit in the USPS training set and the number of labeled points was 10, 20 or 30, with equal numbers for each digit. Performance was averaged over 30 random selections, each with the same number of labeled points. In each experiment, we constructed n = 30 graphs G(q) (q INn ) by combining k -nearest neighbors (k IN10 ) with three different distances. Then, n corresponding Laplacians were computed together with their associated kernels. We chose as the loss function V the square loss. Since kernels obtained from different types of graphs can vary widely, it was necessary to renormalize them. Hence, we chose to normalize each kernel during the\n2\n\nAvailable at: http://www-stat-class.stanford.edu/tibs/ElemStatLearn/data.html\n\n\f\nEuclidean (10 kernels) Task \\ Labels % 1 vs. 7 1% 1.55 0.08 2 vs. 3 3.08 0.85 2 vs. 7 4.46 1.17 3 vs. 8 7.33 1.67 4 vs. 7 2.90 0.77 Labels Odd vs. Even 10 18.6 3.98 2% 1.53 0.05 3.34 1.21 4.04 1.21 7.30 1.49 2.64 0.78 20 15.5 2.40 3% 1.50 0.15 3.38 1.29 3.56 0.82 7.03 1.43 2.25 0.77 30 13.4 2.67\n\nTransf. (10 kernels) 1% 1.45 0.10 0.80 0.40 3.27 1.16 6.98 1.57 1.81 0.26 10 15.7 4.40 2% 1.45 0.11 0.85 0.38 2.92 1.26 6.87 1.77 1.82 0.42 20 11.7 3.14 3% 1.38 0.12 0.82 0.32 2.96 1.08 6.50 1.78 1.69 0.45 30 8.52 1.32\n\nTangent dist. (10 kernels) 1% 1.01 0.00 0.73 0.93 2.95 1.79 4.43 1.21 0.88 0.17 10 14.66 4.37 2% 1.00 0.09 0.19 0.51 2.30 0.76 4.22 1.36 0.90 0.20 20 10.50 2.30 3% 1.00 0.11 0.03 0.09 2.14 0.53 3.96 1.25 0.90 0.20 30 8.38 1.90 1%\n\nAll (30 kernels) 2% 1.24 0.27 0.25 0.61 2.54 0.97 4.32 1.46 1.14 0.42 20 10.98 2.61 3% 1.20 0.22 0.10 0.21 2.41 0.89 4.20 1.53 1.13 0.39 30 8.74 2.39\n\n1.28 0.28 0.79 0.93 3.51 1.92 4.80 1.57 1.04 0.37 10 17.07 4.38\n\nTable 1: Misclassification error percentage (top) and standard deviation (bottom) for the best convex combination of kernels on different handwritten digit recognition tasks, using different distances. See text for description. training process by the Frobenius norm of its submatrix corresponding to the labeled data. We also observed that similar results were obtained when normalizing with the trace of this submatrix. The regularization parameter was set to 10-5 in all algorithms. For convex minimization, as the starting kernel in the algorithm in Figure 1 we always used the average of the n kernels and as the maximum number of iterations T = 100. Table 1 shows the results obtained using three distances as combined with k -NN (k IN10 ). The first distance is the Euclidean distance between images. The second method is transformation, where the distance between two images is given by the smallest Euclidean distance between any pair of transformed images as determined by applying a number of affine transformations and a thickness transformation3 , see [6] for more information. The third distance is tangent distance, as described in [6], which is a first-order approximation to the above transformations. For the first three columns in the table the Euclidean distance was used, for columns 46 the image transformation distance was used, for columns 79 the tangent distance was used. Finally, in the last three columns all three methods were jointly compared. As the results indicate, when combining different types of kernels, the algorithm tends to select the most effective ones (in this case the tangent distance kernels and to a lesser degree the transformation distance kernels which did not work very well because of the Matlab optimization routine we used). We also noted that within each of the methods the performance of the convex combination is comparable to that of the best kernels. Figure 2 reports the weight of each individual kernel learned by our algorithm when 2% labels are used in the pairwise tasks and 20 labels are used for odd vs. even. With the exception of the easy 1 vs. 7 task, the large weights are associated with the graphs/kernels built with the tangent distance. The effectiveness of our algorithm in selecting the good graphs/kernels is better demonstrated in Figure 3, where the Euclidean and the transformation kernels are combined with a \"low-quality\" kernel. This \"low-quality\" kernel is induced by considering distances invariant over rotation in the range [-180 , 180 ], so that the image of a 6 can easily have a small distance from an image of a 9, that is, if x and t are two images and T (x) is the image obtained by rotating x by degrees, we set d(x, t) = min{ T (x) - T (t) : , [-180 , 180 ]}.\n3\n\nThis distance was approximated using Matlab's constrained minimization function.\n\n\f\nThe figure shows the distance matrix on the set of labeled and unlabeled data for the Euclidean, transformation and \"low-quality distance\" respectively. The best error among 15 different values of k within each method, the error of the learned convex combination and the total learned weights for each method are shown below each plot. It is clear that the solution of the algorithm is dominated by the good kernels and is not influenced by the ones with low performance. As a result, the error of the convex combination is comparable to that of the Euclidean and transformation methods. The final experiment (see Figure 4) demonstrates that unlabeled data improves the performance of our method.\n0.12 0.1 0.08 0.06 0.3 0.04 0.02 0 0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 0.2 0.1 5 10 15 20 25 30 0 0 0.35 0.3 0.2 0.25 0.2 0.15 0.1 0.05 0.05 0 0 5 10 15 20 25 30 0 0 5 10 15 20 25 30 0.15 5 10 15 20 25 30 0.15 0.1 0.05 0 0 0.25 5 10 15 20 25 30\n\n1 vs. 7\n\n0.7 0.6 0.5 0.4\n\n2 vs. 3\n\n0.35 0.3 0.25 0.2\n\n2 vs. 7\n\n3 vs. 8\n\n4 vs. 7\n\nodd-even\n\n0.1\n\nFigure 2: Kernel weights for Euclidean (first 10), Transformation (middle 10) and Tangent (last 10). See text for more information.\nEuclidean\n0 50 100 150 200 250 300 350 400 0 100 200 300 400 0 50 100 150 200 250 300 350 400 0 100 200 300 400\n\nTransformation\n0 50 100 150 200 250 300 350 400 0\n\nLow-quality distance\n\n100\n\n200\n\n300\n\n400\n\nerror = 0.24% 15 i=1 i = 0.553\n\nconvex combination error = 0.26%\n\nerror = 0.24% 30 i=16 i = 0.406\n\nerror = 17.47% 45 i=31 i = 0.041\n\nFigure 3: Similarity matrices and corresponding learned coefficients of the convex combination for the 6 vs. 9 task. See text for description.\n\n5\n\nConclusion\n\nWe have presented a method for computing an optimal kernel within the framework of regularization over graphs. The method consists of a minimax problem which can be efficiently solved by using an algorithm from [1]. When tested on optical character recognition tasks, the method exhibits competitive performance and is able to select good graph structures. Future work will focus on out-of-sample extensions of this algorithm and on continuous optimization versions of it. In particular, we may consider a continuous family of graphs each corresponding to a different weight matrix and study graph kernel combinations over this class.\n\n\f\n0.28 0.27 0.26 0.25 0.24 0.23 0.22 0.21 0.2 0.19 0.18 0 500 1000 1500 2000 Euclidean transformation tang. dist.\n\n0.22 Euclidean transformation tang. dist.\n\n0.2\n\n0.18\n\n0.16\n\n0.14\n\n0.12\n\n0.1 0\n\n500\n\n1000\n\n1500\n\n2000\n\nFigure 4: Misclassification error vs. number of training points for odd vs. even classification. The number of labeled points is 10 on the left and 20 on the right.\n\nReferences\n[1] A. Argyriou, C.A. Micchelli and M. Pontil. Learning convex combinations of continuously parameterized basic kernels. Proc. 18-th Conf. on Learning Theory, 2005. [2] M. Belkin, I. Matveeva and P. Niyogi. Regularization and semi-supervised learning on large graphs. Proc. of 17th Conf. Learning Theory (COLT), 2004. [3] M. Belkin and P. Niyogi. Semi-supervised learning on Riemannian manifolds. Mach. Learn., 56: 209239, 2004. [4] A. Blum and S. Chawla. Learning from Labeled and Unlabeled Data using Graph Mincuts, Proc. of 18th International Conf. on Learning Theory, 2001. [5] F.R. Chung. Spectral Graph Theory. Regional Conference Series in Mathematics, Vol. 92, 1997. [6] T. Hastie and P. Simard. Models and Metrics for Handwritten Character Recognition. Statistical Science, 13(1): 5465, 1998. [7] M. Herbster, M. Pontil, L. Wainer. Online learning over graphs. Proc. 22-nd Int. Conf. Machine Learning, 2005. [8] T. Joachims. Transductive Learning via Spectral Graph Partitioning. Proc. of the Int. Conf. Machine Learning (ICML), 2003. [9] R.I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. Proc. 19-th Int. Conf. Machine Learning, 2002. [10] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, M. I. Jordan. Learning the kernel matrix with semidefinite programming. J. Machine Learning Research, 5: 2772, 2004. [11] Y. Lin and H.H. Zhang. Component selection and smoothing in smoothing spline analysis of variance models COSSO. Institute of Statistics Mimeo Series 2556, NCSU, January 2003. [12] C. A. Micchelli and M. Pontil. Learning the kernel function via regularization, J. Machine Learning Research, 6: 10991125, 2005. [13] C.S. Ong, A.J. Smola, and R.C. Williamson. Hyperkernels. Advances in Neural Information Processing Systems, 15, S. Becker et al. (Eds.), MIT Press, Cambridge, MA, 2003. [14] A.J. Smola and R.I Kondor. Kernels and regularization on graphs. Proc. of 16th Conf. Learning Theory (COLT), 2003. [15] V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. [16] D. Zhou, O. Bousquet, T.N. Lal, J. Weston and B. Scholkopf. Learning with local and global consistency. Advances in Neural Information Processing Systems, 16, S. Thrun et al. (Eds.), MIT Press, Cambridge, MA, 2004. [17] X. Zhu, Z. Ghahramani and J. Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. Proc. 20th Int. Conf. Machine Learning, 2003. [18] X. Zhu, J. Kandola, Z, Ghahramani, J. Lafferty. Nonparametric transforms of graph kernels for semi-supervised learning. Advances in Neural Information Processing Systems, 17, L.K. Saul et al. (Eds.), MIT Press, Cambridge, MA, 2005.\n\n\f\n", "award": [], "sourceid": 2938, "authors": [{"given_name": "Andreas", "family_name": "Argyriou", "institution": null}, {"given_name": "Mark", "family_name": "Herbster", "institution": null}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": null}]}