{"title": "Learning on Graph with Laplacian Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 25, "page_last": 32, "abstract": null, "full_text": "Learning on Graph with Laplacian Regularization\n\nRie Kubota Ando IBM T.J. Watson Research Center Hawthorne, NY 10532, U.S.A. rie1@us.ibm.com\n\nTong Zhang Yahoo! Inc. New York City, NY 10011, U.S.A. tzhang@yahoo-inc.com\n\nAbstract\nWe consider a general form of transductive learning on graphs with Laplacian regularization, and derive margin-based generalization bounds using appropriate geometric properties of the graph. We use this analysis to obtain a better understanding of the role of normalization of the graph Laplacian matrix as well as the effect of dimension reduction. The results suggest a limitation of the standard degree-based normalization. We propose a remedy from our analysis and demonstrate empirically that the remedy leads to improved classification performance.\n\n1 Introduction\nIn graph-based methods, one often constructs similarity graphs by linking similar data points that are close in the feature space. It was proposed in [3] that one may first project these data points into the eigenspace corresponding to the largest eigenvalues of a normalized adjacency matrix of the graph and then use the standard k -means method for clustering. In the ideal case, points in the same class will be mapped into a single point in the reduced eigenspace, while points in different classes will be mapped to different points. One may also consider similar ideas in semi-supervised learning using a discriminative kernel method. If the underlying kernel is induced from the graph, one may formulate semi-supervised learning directly on the graph (e.g., [1, 5, 7, 8]). In these studies, the kernel is induced from the adjacency matrix W whose (i, j )-entry is the weight of edge (i, j ). W is sometimes normalized by D-1/2 WD-1/2 [2, 4, 3, 7] where D is a diagonal matrix whose (j, j )-entry is the degree of the j -th node, but sometimes not [1, 8]. Although such normalization may significantly affect the performance, the issue has not been studied from the learning theory perspective. The relationship of kernel design and graph learning was investigated in [6], which argued that quadratic regularization-based graph learning can be regarded as kernel design. However, normalization of W was not considered there. The goal of this paper is to provide some learning theoretical insight into the role of normalization of the graph Laplacian matrix (D - W). We first present a model for transductive learning on graphs and develop a margin analysis for multi-class graph learning. Based on this, we analyze the performance of Laplacian regularization-based graph learning in relation to graph properties. We use this analysis to obtain a better understanding of the role of normalization of the graph Laplacian matrix as well as dimension reduction in graph learning. The results indicate a limitation of the commonly practiced degree-based normalization mentioned above. We propose a learning theoretical remedy based on our analysis and use experiments to demonstrate that the remedy leads to improved classification performance.\n\n2 Transductive Learning Model\nWe consider the following multi-category transductive learning model defined on a graph. Let V = {v1 , . . . , vm } be a set of m nodes, and let Y be a set of K possible output values. Assume that each node vj is associated with an output value yj Y , which we are interested in predicting. We randomly draw a set of n indices Zn = {ji : 1 i n} from {1, . . . , m} uniformly and without\n\n\f\nreplacement. We then manually label the n nodes vji with labels yji Y , and then automatically label the remaining m - n nodes. The goal is to estimate the labels on the remaining m - n nodes as accurately as possible. We encode the label yj into a vector in RK , so that the problem becomes that of generating an estimation vector fj, = [fj,1 , . . . , fj,K ] RK , which can then be used to recover the label yj . In multi-category classification with K classes Y = {1, . . . , K }, we encode each yj = k Y as ek RK , where ek is a vector of zero entries except for the k -th entry being one. Given fj, = [fj,1 , . . . , fj,K ] RK (which is intended to approximate eyj ), we decode the corresponding label estimation yj as: yj = arg maxk {fj,k : k = 1, . . . , K }. If the true label is ^ ^ yj , then the classification error is err(fj, , yj ) = I (yj = yj ), where we use I () to denote the set ^ indicator function. In order to estimate f = [fj,k ] RmK from only a subset of labeled nodes, we consider for a K given kernel matrix K Rm , the quadratic regularization f T QK f = k=1 fTk K-1 f,k , where , f,k = [f1,k , . . . , fm,k ] Rm . We assume that K is full-rank. We will consider the kernel matrix induced by the graph Laplacian, to be introduced later in the paper. Note that the bold symbol K denotes the kernel matrix, and regular K denotes the number of classes. Given a vector f RmK , the accuracy of its component fj, = [fj,1 , . . . , fj,K ] RK is measured by a loss function (fj, , yj ). Our learning method attempts to minimize the empirical risk on the set Zn of n labeled training nodes, subject to f T QK f being small: j 1 ^ f (Zn ) = arg min (fj, , yj ) + f T QK f . (1 ) f RmK n\nZn\n\nwhere > 0 is an appropriately chosen regularization parameter.\n\nIn this paper, we focus on a special class of loss function that is of the form (fj, , yj ) = K k=1 0 (fj,k , k,yj ), where a,b is the delta function defined as: a,b = 1 when a = b and a,b = 0 otherwise. We are interested in the generalization behavior of (1) compared to a properly defined optimal regularized risk, often referred to as \"oracle inequalities\" in the learning theory literature. K Theorem 1 Let (fj, , yj ) = k=1 0 (fj,k , k,yj ) in (1). Assume that there exist positive constants a, b, and c such that: (i) 0 (x, y ) is non-negative and convex in x, (ii) 0 (x, y ) is Lipschitz with constant b when 0 (x, y ) a, and (iii) c = inf {x : 0 (x, 1) a} - sup{x : 0 (x, 0) a}. Then p > 0, the expected generalization error of the learning method (1) over the random training samples Zn can be bounded by:\nEZn 1 m-n\n\nj\n\n Zn\n\n1 ^ err(fj, (Zn ), yj ) a\n\nf R mK\n\ninf\n\n1 jm\nm\n\nm\n\n0 (fj, , yj ) + f QK f\n\nT\n\n+b\n\n=1\n\ntrp (K) nc\n\np\n\n,\n\n where Zn = {1, . . . , m} - Zn , trp (K) =\n\nm\n\nrelated lemma used in [6] for proving a similar result, we have the following inequality for each k = 1, . . . , K : ^ ^ ^ |fin+1 ,k (Zn+1 ) - fin+1 ,k (Zn )| |1,k (fin+1 , (Zn+1 ), Yin+1 )|Kin+1 ,in+1 /(2n),\n\nProof. The proof is similar to the proof of a related bound for binary-classification in [6]. We shall introduce the following notation. let in+1 = i1 , . . . , in be an integer randomly drawn from Zn , and ^(Zn+1 ) be the semi-supervised learning method (1) using training let Zn+1 = Zn {in+1 }. Let f 1j . ^ data in Zn+1 : f (Zn+1 ) = arg inf f RmK (fj, , Yj ) + f T QK f Adapted from a\nn Zn+1\n\n1\n\np j =1 Kj,j\n\n1/p\n\n, and Kj,j is the (j, j )-entry of K.\n\n(2 )\n\nwhere 1,k (fi, , y ) denotes a sub-gradient of (fi, , y ) with respect to fi,k , where fi, = [fi,1 , . . . , fi,K ]. Next we prove 1 p b ^ , (Zn ), yi ) sup ^ ,k (Zn+1 ), i ,k ) + err(fin+1 . (3 ) 0 (fin+1 Ki , i n+1 n+1 cn n+1 n+1 k=k0 ,in+1 a ^ In fact, if f (Zn ) does not make an error on the in+1 -th example, then the inequality automatically ^ holds. Otherwise, assume that f (Zn ) makes an error on the in+1 -th example, then there exists k0 =\n\n\f\n^ d, it follows that there exists k = 0 or k = in+1 such that either 0 (fin+1 ,k (Zn+1 ), in+1 ,k ) a or k ^ ^ ^ fin+1 ,k (Zn+1 ) - fin+1 ,k (Zn ) c/2. Using (2), we have either 0 (fin+1 ,k (Zn+1 ), in+1 ,k ) a p bK in+1 ,in+1 1 ^ or bKin+1 ,in+1 /(2n) c/2, implying that a 0 (fin+1 ,k (Zn+1 ), in+1 ,k ) + 1= cn ^ , (Zn ), yi ). This proves (3). err(fi\nn+1 n+1\n\n^ ^ yin+1 such that fin+1 ,yin+1 (Zn ) fin+1 ,k0 (Zn ). If we let d = (inf {x : 0 (x, 1) a} + sup{x : ^ ^ (Zn ) d or fi ,k (Zn ) d. By the definition of c and 0 (x, 0) a})/2, then either fi ,y\nn+1 in+1 n+1 0\n\nWe are now ready to prove Theorem 1 using (3). For every j Zn+1 , denote by Zn+1 the (j ) ^ subset of n samples in Zn+1 with the j -th data point left out. We have err(fj, (Zn ), yj ) p b 1 mK ^ : a (fj, (Zn+1 ), yj ) + cn Kj,j . We thus obtain for all f R\nEZn 1 m-n\n\n(j )\n\nj\n\n^ err(fj, (Zn ), yj ) \n\n Zn\n\n 1 EZn+1 n+1\n\n\n\nj 1 EZ n + 1 n+1 Z j\n Zn + 1\n\n( ^ err(fj, (Znj ) ), yj )\n\nn+1\n\na\n\n1j \n Zn + 1\n\n^ (fj, (Zn+1 ), yj ) +\n\nb\n\np\nKj,j\n\nn EZ a(n + 1) n+1\n\nn\n\n1j\n Zn + 1\n\n\n\ncn\n\n(fj, , yj ) + f T QK f\n\n+\n\nj 1 EZ n + 1 n+1 Z\n\nb\ncn\nn+1\n\np\nKj,j . 2\n\nThe formulation used here corresponds to the one-versus-all method for multi-category classification. For the SVM loss 0 (x, y ) = max(0, 1 - (2x - 1)(2y - 1)), we may take a = 0.5, b = 2, and c = 0.5. In the experiments reported here, we shall employ the least squares function 0 (x, y ) = (x - y )2 which is widely used for graph learning. With this formulation, we may choose a = 1/16, b = 0.5, c = 0.5 in Theorem 1.\n\n3 Laplacian regularization\nConsider an undirected graph G = (V , E ) defined on the nodes V = {vj : j = 1, . . . , m}, with edges E {1, . . . , m} {1, . . . , m}, and weights wj,j 0 associated with edges (j, j ) m E . For / simplicity, we assume that (j, j ) E and wj,j = 0 when (j, j ) E . Let degj (G) = j =1 wj,j / be the degree of node j of graph G. We consider the following definition of normalized Laplacian. Definition 1 Consider a graph G = (V , E ) of m nodes with weights wj,j (j, j = 1, . . . , m). The unnormalized Laplacian matrix L(G) Rmm is defined as: Lj,j (G) = -wj,j if j = j ; degj (G) otherwise. Given m scaling factors Sj (j = 1, . . . , m), let S = diag({Sj }). The S-normalized Laplacian matrix is defined as: LS (G) = S-1/2 L(G)S-1/2 . The corresponding 2 f m fj j 1 T ,k - ,k . regularization is based on: f,k LS (G)f,k = 2 j,j =1 wj,j S\nSj\nj\n\nA common choice of S is S = I, corresponding to regularizing with the unnormalized Laplacian L. The idea is natural: we assume that the predictive values fj,k and fj ,k should be close when (j, j ) E with a strong link. Another common choice is to normalize by Sj = degj (G) (i.e. S = D) so that diagonals of LS become all one [3, 4, 7, 2]. Definition 2 Given label y = {yj }j =1,...,m on V , we define the cut for LS in Definition 1 as: 2 1 +j 1 j wj,j wj,j - 1 cut(LS , y ) = . + S1 :y =y : y =y ,j j ,j j 2 Sj 2 j S j\nj\n\nSj\n\nj\n\nUnlike typical graph-theoretical definitions of graph-cut, this learning theoretical definition of graphcut penalizes not only between-class edge weights but also within-class edge weights when such an edge connects two nodes with different scaling factors. This penalization is intuitive if we look at the S S regularizer in Definition (1), which encourages fj,k / j to be similar to fj ,k / j when wj,j is ,k . Therefore for such large. If j and j belongs to the same class, we want fj,k to be similar to fj\n\n\f\nan in-class pair (j, j ), we want to have Sj Sj . This penalization has important consequences, which we will investigate later in the paper. For unnormalized Laplacian (i.e. Sj = 1), the second term on the right hand side of Definition 2 vanishes, and our learning theoretical definition becomes j identical to the standard graph-theoretical definition: cut(L, y ) = ,j :yj =yj wj,j . We consider K in (1) defined as follows: K = (S-1 + LS (G))-1 , where > 0 is a tuning parameter to make K strictly positive definite. This parameter is important. For simplicity, we state the generalization bound based on Theorem 1 with optimal . Note that in applications, is usually tuned through cross validation. Therefore assuming optimal will simplify the bound so that we can focus on the more essential characteristics of generalization performance. Theorem 2 Let the conditions in Theorem 1 hold with the regularization condition K = (S-1 + LS (G))-1 . Assume that 0 (0, 0) = 0 (1, 1) = 0, then p > 0, there exists a sample independent regularization parameter in (1) such that the expected generalization error is bounded by: 1j Cp (a, b, c) ^ EZn err(fj, (Zn ), yj ) p/(p+1) (s + cut(LS , y ))p/(p+1) trp (K)p/(p+1) , m-n n Zn m where Cp (a, b, c) = (b/ac)p/(p+1) (p1/(p+1) + p-p/(p+1) ) and s = j =1 S-1 . j m Proof. Let fj,k = yj ,k . It can be easily verified that j =1 (fj, , yj )/m + f T QK f = (s + cut(LS , y )). Now, we simply use this expression in Theorem 1, and then optimize over . 2 This theorem relates graph-cut to generalization performance. The conditions on the loss function in Theorem 2 hold for least squares with b/ac = 16. It also applies to other standard loss functions such as SVM. With p fixed, the generalization error decreases at the rate O(n-p/(p+1) ) when n increases. This rate of convergence is faster when p increases. However in general, trp (K) is an increasing function of p. Therefore we have a trade-off between the two terms. The bound also suggests that if we normalize the diagonal entries of K such that Kj,j is a constant, then trp (K) is independent of p, and thus a larger p can be used in the bound. This motivates the idea of normalizing the diagonals p p of K. Our goal is to better understand how the quantity (s + cut(LS , y )) p+1 trp (K) p+1 is related to properties of the graph, which gives better understanding of graph-based learning. Definition 3 A subgraph G0 = (V0 , E0 ) of G = (V , E ) is a pure component if G0 is connected, E0 is induced by restricting E on V0 , and if labels y have identical values on V0 . A pure subgraph G = q=1 G of G divides V into q disjoint sets V = q=1 V such that each subgraph G = (V , E ) is a pure component. Denote by i (G ) = i (L(G )) the i-th smallest eigenvalue of L(G ). If we remove all edges of G that connect nodes with different labels, then the resulting subgraph is a pure subgraph (but not the only one). For each pure component G , its first eigenvalue 1 (G ) is always zero. The second eigenvalue 2 (G ) > 0, and it measures how well-connected Gi is [2]. Theorem 3 Let the assumptions of Theorem 2 hold, and G = q=1 G (G = (V , E )) be a pure subgraph of G. For all p 1, there exist sample-independent and , such that the generalization j ^ performance of (1), EZn Zn err(fj, , yj )/(m - n), is bounded by 1/2p 1/2p 2p/(p+1) q q s (p)/m Cp (a, b, c) 1/2 s (p)/m + cut(LS , y )1/2 , s mp 2 (G )p np/(p+1) \n=1 =1\n\nwhere m = |V |, s =\n\n-1 j =1 Sj ,\n\nProof sketch. We simply upper bound trp (K) in terms of 2 (G ) and s , where K = (S-1 + LS )-1 . Substitute this estimation into Theorem 2 and optimize it over . 2 To put this into perspective, suppose that we use unnormalized Laplacian regularizer on a zero-cut graph. Then S = I and cut(LS , y ) = 0, and by letting p = 1 and p in Theorem 3, we have: b j j ^ ^ err(fj, , yj ) err(fj, , yj ) q b m EZn . 2 and EZn m-n ac n m-n ac n min m \nZn Zn\n\nm\n\nand s (p) =\n\nj\n\np V Sj .\n\n\f\nq That is, in the zero-cut case, the generalization performance can be bounded as O( /n). We can also achieve a faster convergence rate of O(1/n), but it also depends on m/(min m ) q . This implies that we will achieve better convergence at the O(1q n) level if the sizes of the components / are balanced, while the convergence may behave like O( /n) otherwise. 3.1 Near zero-cut optimum scaling factors The above observation motivates a scaling matrix S so that it compensates for the unbalanced pure component sizes. From Definition 2 and Theorem 2 we know that good scaling factors should be approximately constant within each class. Here we focus on the case that scaling factors are constant within each pure component (Sj = s when j V ) in order to derive optimum scaling factors. j j wj,j Let us define cut(G , y ) = ,j :yj =yj wj,j + = V ,j V 2 . I n T h e o r e m 3 , wh e n we use cut(LS , y ) cut(G , y )/ min s and let p and assume that cut(G , y ) is sufficiently q s small, the dominate term of the bound becomes max (n /m ) =1 m , which can then be optimized s with the choice s = m , and the resulting bound becomes: c 2 b1 ut(G , y ) 1j ^ err(fj, , yj ) , q+ m-n ac n u(G ) min m\nZn\n\nwhere u(G ) = min (2 (G )/m ). Hence, if cut(G , y ) is small, then we should choose s m for each pure component so that the generalization performance is approximately (ac)-1 b q /n. The analysis provided here not only formally shows the importance of normalization in the learning theoretical framework but also suggests that the good normalization factor for each node j is approximately the size of the well-connected pure component that contains node j (assuming that nodes belonging to different pure components are only weakly connected). The commonly practiced degree-based normalization method Sj = degj (G) provides such good normalization factors under a simplified \"box model\" used in early studies e.g. [4]. In this model, each node connects to itself and all other nodes of the same pure component with edge weight wj,j = 1. The degree is thus degj (G ) = |V | = m , which gives the optimal scaling in our analysis. However, in general, the box model may not be a good approximation for practical problems. A more realistic approximation, which we call core-satellite model, will be introduced in the experimental section. For such a model, the degree-based normalization can fail because the degj (G ) within each pure component G is not approximately constant (thus raising cut(LS , y )), and it may not be proportional to m . Our remedy is as follows. Let K = (I + L)-1 be the kernel matrix corresponding to the unnormalm ized Laplacian. Let v R be the vector whose j -th entry is 1 if j V and 0 otherwise. Then it q T is easy to verify that for small and near-zero cut(G , y ), we have K = =1 v v /m + O(1), j,j m-1 for each j V . Therefore the scaling factor Sj = 1/Kj,j is nearly optimal and thus K for all j . We call this method of normalization (Sj = 1/Kj,j , K = (S-1 + LS )-1 ) K-scaling in this paper as it scales the kernel matrix K so that each Kj,j = 1. By contrast, we call the standard degree-based normalization (Sj = degj (G), K = (I + LS )-1 ) L-scaling as it scales diagonals of LS to 1. Although K-scaling coincides with a common practice in standard kernel learning, it is important to notice that showing this method behaves well in the graph learning setting is non-trivial and novel. In fact, no one has proposed this normalization method in the graph learning setting before this work. Without the learning theoretical results developed here, it is not obvious whether this method should work better than the commonly practiced degree-based normalization.\n\n4 Dimension Reduction\nNormalization and dimension reduction have been commonly used in spectral clustering such as [3, 4]. For semi-supervised learning, dimension reduction (without normalization) is known to improve performance [1, 6] while normalization (without dimension reduction) has also been explored [7]. An appropriate combination of normalization and dimension reduction can further improve performance. We shall first introduce dimension reduction with normalized Laplacian LS (G). Denote by Pr (G) the projection operator onto the eigenspace of S-1 + LS (G) corresponding to the r S\n\n\f\nsmallest eigenvalues. Now, we may define the following regularizer on the reduced subspace: f T -1 Pr (G)f,k = f,k , S , k K0 f , k fTk K-1 f,k = , + otherwise.\n\n(4 )\n\nNote that we will focus on bounding the generalization complexity using the reduced dimensionality r. In such context, the choice of K0 is not important. For example, we may simply choose K0 = I. The benefit of dimension reduction in graph learning has been investigated in [6], under the spectral kernel design framework. Note that the normalization issue, which will change the eigenvectors and their ordering, wasn't investigated there. The following theorem shows that the target vectors can be well approximated by its projection onto Pq (G). We skip the proof due to the space limitation. S Theorem 4 Let G = q=1 G (G = (V , E )) be a pure subgraph of G. Consider r q : r+1 (LS (G)) r+1 (LS (G )) min 2 (LS (G )). For each k , let fj,k = yj ,k be the target r 2 (encoding of the true labels) for class k (j = 1, . . . , m). Then PS (G)f,k - f,k 2 r (S) f,k 2 , 2 j -1/2 2 -1/2 LS (G)-LS (G )) 2+d(S) 1 where r (S) = - Sj ) . , d(S) = max 2|V | ,j V (Sj r+1 (LS (G))\n\nWe can prove a generalization bound using Theorem 4. For simplicity, we only consider least K squares loss (fj, , yj ) = k=1 (fj,k - k,yj )2 in (1) using regularization (4) and K0 = I. With m 1 p = 1, we have m j =1 (fj, , yj ) r (S)2 + m. It is also equivalent to take K0 = Pr (G) due S to the dimension reduction, so that we can use tr(K) = r. Now from Theorem 1 with a = 1/16, j 1 r 2 ^ b = 0.5, c = 0.5, we have EZn m-n Zn err(fj, , yj ) 16(r (S) + m)+ nm . By optimizing over , we obtain j ^ r err(fj, , yj ) 16r (S)2 + 32 /n. (5 ) EZn m-n \nZn\n\nThe analysis of optimum scaling factors is analogous to Section 3.1, and the conclusions there hold. Compared to Theorem 3, the advantage of dimension reduction in (5) is that the quantity cut(LS , y ) is replaced by LS (G) - LS (G ) 2, which is typically much smaller. Instead of a rigorous analysis, we shall just give a brief intuition. For simplicity we take S = I so that we can ignore the variations caused by S. The 2-norm of the symmetric error matrix LS (G) - LS (G ) is its largest eigenvalue, which is no more than the largest 1-norm of one of its row vectors. In contrast, cut(LS , y ) behaves similar to the absolute sum of entries of the error matrix, which is m times more than the averaged 1-norm of its row vectors. Therefore if error is relatively uniform across rows, then cut(LS , y ) can be at an order of m times more than LS (G) - LS (G ) 2.\n\n5 Experiments\nWe test the three types of the kernel matrix K (Unnormalized, normalized by K-scaling or Lscaling) with the two regularization methods: the first method is to use K without dimension reduction, and the second method reduces the dimension of K-1 to eigenvectors corresponding to the smallest r eigenvalues and regularizes with f T K-1 f if Pr (G)f = f and + otherwise. We S are particularly interested in how well K-scaling performs. From m data points, n training labeled examples are randomly chosen while ensuring that at least one training example is chosen from each class. The remaining m - n data points serve as test data. The regularization parameter is chosen by cross validation on the n training labeled examples. We will show performance either when the rest of the parameters ( and dimensionality r) are also chosen by cross validation or when they are set to the optimum (oracle performance). The dimensionality r is chosen from K, K + 5, K + 10, , 100 where K is the number of classes unless otherwise specified. Our focus is on small n close to the number of classes. Throughout this section, we conduct 10 runs with random training/test splits and report the average accuracy. We use the one-versus-all strategy with least squares loss k (a, b) = (a - k,b )2 . Controlled data experiments The purpose of the controlled data experiments is to observe the correlation of the effectiveness of the normalization methods with graph properties. The graphs we generate contain 2000 nodes, each of which is assigned one of 10 classes. We show the results when dimension reduction is applied\n\n\f\nAccuracy (%)\n\nAccuracy (%)\n\n100 80 60 40\ngraph1 graph2 graph3\n\n100 90 80 70 60\ngraph6 graph7 graph8 graph9 graph10\n\nUnnor malized\n\nL- scaling\n\nK- scaling\n\nUnnor malized\n\nL- scaling\n\nK- scaling\n\n(a) Nearly-constant degrees.\n\n(b) Core-satellite graphs\n\nFigure 1: Classification accuracy (%). (a) Graphs with near constant within class degrees. (b) Core-satellite\ngraphs. n = 40, m = 2000. With dimension reduction (dim 20; chosen by cross validation).\n\nto the three types of matrix K. The performance is averaged over 10 random splits with error bar representing one standard deviation. Figure 1 (a) shows classification accuracy on three graphs that were generated so that the node degrees (of either correct edges or erroneous edges) are close to constant within each class but vary across classes. On these graphs, both K-scaling and L-scaling significantly improve classification accuracy over the unnormalized baseline. There is not much difference between K-scaling's and L-scaling's. Observe that K-scaling and L-scaling perform differently on the graphs used in Figure 1 (b). These five graphs have the following properties. Each class consists of core nodes and satellite nodes. Core nodes of the same class are tightly connected with each other and do not have any erroneous edges. Satellite nodes are relatively weakly connected to core nodes of the same class. They are also connected to some other classes' satellite nodes (i.e., introducing errors). This core-satellite model is intended to simulate real-world data in which some data points are close to the class boundaries (satellite nodes). For graphs generated in this manner, degrees vary within the same class since the satellite nodes have smaller degrees than the core nodes. Our analysis suggests that L-scaling will do poorly. Figure 1 (b) shows that on the five core-satellite graphs, K-scaling indeed produces higher performance than L-scaling. In particular, K-scaling does well even when L-scaling rather underperforms the unnormalized baseline. Real-world data experiments Our real-world data experiments use an image data set (MNIST) and a text data set (RCV1). The MNIST data set, downloadable from http://yann.lecun.com/exdb/mnist/, consists of hand-written digit image data (representing 10 classes, from digit \"0\" to \"9\"). For our experiments, we randomly choose 2000 images (i.e., m = 2000). Reuters Corpus Version 1 (RCV1) consists of news articles labeled with topics. For our experiments, we chose 10 topics (ranging from sports to labor issues; representing 10 classes) that have relatively large populations and randomly chose 2000 articles that are labeled with exactly one of those 10 topics. To generate graphs from the image data, as is commonly done, we first generate the vectors of the gray-scale values of the pixels, and produce the edge weight between the i-th and the j -th data points Xi and Xj by wi,j = exp(-||Xi - Xj ||2 /t) where t > 0 is a parameter (RBF kernels). To generate graphs from the text data, we first create the bag-of-word vectors and then set wi,j based on RBF as above. As our baseline, we test the supervised configuration by letting W + I be the kernel matrix and using the same least squares loss function, where we use the oracle which is optimal. Figures 2 (a-1,2) shows performance in relation to the number of labeled examples (n) on the MNIST data set. The comparison of the three bold lines (representing the methods with dimension reduction) in Figure 2 (a-1) shows that when the dimensionality and are determined by cross validation,\n(a-1) MNIST, dim and (a-2) MNIST, optimum dim and 75 85 by cross validation 85\naccuracy (%) 75 65 55 45 10 30 50 # of labeled ex am ples\n75 65 55 45 10 30 50 # of labeled ex am ples 65 55 45 35\n\n(b-1) RCV1 cross validation 75\n65 55 45 35\n\n(b-2) RCV1 optimum\n\nSuperv is ed baseline U nnorm aliz ed (w / o dim redu.) L-s c aling (w / o dim redu.) K-s c aling (w / o dim redu.) U nnorm aliz ed (w / dim redu.) L-s c aling (w / dim redu.) K-s c aling (w / dim redu.)\n\n10 50 90 10 50 90 # of labeled ex am ples # of labeled ex .\n\nFigure 2: Classification accuracy (%) versus sample size n (m = 2000). (a-1) MNIST, dim and determined\nby cross validation. (a-2) MNIST, dim and set to the optimum. (b-1) RCV1, dim and determined by cross validation. (b-2) RCV1, dim and set to the optimum.\n\n\f\nK-scaling outperforms L-scaling, and L-scaling outperforms the unnormalized Laplacian. The performance differences among these three are statistically significant (p 0.01) based on the paired t test. The performance of the unnormalized Laplacian (with dimension reduction) is roughly consistent with the performance with similar (m, n) with heuristic dimension selection in [1]. Without dimension reduction, L-scaling and K-scaling still improve performance over the unnormalized Laplacian. The best performance is always obtained by K-scaling with dimension reduction. In Figure 2 (a-1), the unnormalized Laplacian with dimension reduction underperforms the unnormalized Laplacian without dimension reduction, indicating that dimension reduction rather degrades performance. By comparing Figure 2 (a-1) and (a-2), we observe that this seemingly counterintuitive performance trend is caused by the difficulty of choosing the right dimensionality by cross validation. Figure 2 (a-2) shows the performance at the oracle optimal dimensionality and . As observed, if the optimal dimensionality is known (as in (a-2)), dimension reduction improves performance either with or without normalization by K-scaling and L-scaling, and all transductive configurations outperform the supervised baseline. We also note that the comparison of Figure 2 (a-1) and (a-2) shows that choosing good dimensionality by cross validation is much harder than choosing by cross validation, especially when the number of labeled examples is small. On the RCV1 data set, the performance trend is similar to that of MNIST. Figures 2 (b-1,2) shows the performance on RCV1 using the RBF kernel (t = 0.25, 100NN). In the setting of Figure 2 (b-1) where the dimensionality and were determined by cross validation, K-scaling with dimension reduction generally performs the best. By setting the dimensionality and to the optimum, the benefit of K-scaling with dimension reduction is even clearer (Figure 2 (b-2)). Its performance differences from the second and third best `L-scaling (w/ dim redu.)' and `Unnormalized (w/ dim redu.)' are statistically significant (p 0.01) in both Figure 2 (b-1) and (b-2). In our experiments, K-scaling with dimension reduction consistently outperformed others. Without dimension reduction, K-scaling and L-scaling are not always effective. This is consistent with our analysis. On real data, cut is not near-zero, and the effect of normalization is unclear (Section 3.1); however, when dimension is reduced, LS (G) - LS (G ) 2 (corresponding to cut) can be much smaller (Section 4), which suggests that K-scaling should improve performance.\n\n6 Conclusion\nWe derived generalization bounds for learning on graphs with Laplacian regularization, using properties of the graph. In particular, we explained the importance of Laplacian normalization and dimension reduction for graph learning. We argued that the standard L-scaling normalization method has the undesirable property that the normalization factors can vary significantly within a pure component. An alternate normalization method, which we call K-scaling, is proposed to remedy the problem. Experiments confirm the superiority of the this normalization scheme.\n\nReferences\n[1] M. Belkin and P. Niyogi. Semi-supervised learning on Riemannian manifolds. Machine Learning, Special Issue on Clustering:209239, 2004. [2] F. R. Chung. Spectral Graph Theory. Regional Conference Series in Mathematics. American Mathematical Society, Rhode Island, 1998. [3] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, pages 849856, 2001. [4] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell, 22:888905, 2000. [5] M. Szummer and T. Jaakkola. Partially labeled classification with Markov random walks. In NIPS 2001, 2002. [6] T. Zhang and R. K. Ando. Analysis of spectral kernel design based semi-supervised learning. In NIPS, 2006. [7] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Schlkopf. Learning with local and global consistency. In NIPS 2003, pages 321328, 2004. [8] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In ICML 2003, 2003.\n\n\f\n", "award": [], "sourceid": 3148, "authors": [{"given_name": "Rie", "family_name": "Ando", "institution": null}, {"given_name": "Tong", "family_name": "Zhang", "institution": null}]}