{"title": "Analyzing the Harmonic Structure in Graph-Based Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3129, "page_last": 3137, "abstract": "We show that either explicitly or implicitly, various well-known graph-based models exhibit a common significant \\emph{harmonic} structure in its target function -- the value of a vertex is approximately the weighted average of the values of its adjacent neighbors. Understanding of such structure and analysis of the loss defined over such structure help reveal important properties of the target function over a graph. In this paper, we show that the variation of the target function across a cut can be upper and lower bounded by the ratio of its harmonic loss and the cut cost. We use this to develop an analytical tool and analyze 5 popular models in graph-based learning: absorbing random walks, partially absorbing random walks, hitting times, pseudo-inverse of graph Laplacian, and eigenvectors of the Laplacian matrices. Our analysis well explains several open questions of these models reported in the literature. Furthermore, it provides theoretical justifications and guidelines for their practical use. Simulations on synthetic and real datasets support our analysis.", "full_text": "Analyzing the Harmonic Structure\n\nin Graph-Based Learning\n\nXiao-Ming Wu1, Zhenguo Li3, and Shih-Fu Chang1,2\n\n1Department of Electrical Engineering, Columbia University\n\n2Department of Computer Science, Columbia University\n\n3Huawei Noah\u2019s Ark Lab, Hong Kong\n\n{xmwu, sfchang}@ee.columbia.edu,\n\nli.zhenguo@huawei.com\n\nAbstract\n\nWe \ufb01nd that various well-known graph-based models exhibit a common important\nharmonic structure in its target function \u2013 the value of a vertex is approximately\nthe weighted average of the values of its adjacent neighbors. Understanding of\nsuch structure and analysis of the loss de\ufb01ned over such structure help reveal im-\nportant properties of the target function over a graph. In this paper, we show that\nthe variation of the target function across a cut can be upper and lower bounded by\nthe ratio of its harmonic loss and the cut cost. We use this to develop an analytical\ntool and analyze \ufb01ve popular graph-based models: absorbing random walks, par-\ntially absorbing random walks, hitting times, pseudo-inverse of the graph Lapla-\ncian, and eigenvectors of the Laplacian matrices. Our analysis sheds new insights\ninto several open questions related to these models, and provides theoretical justi-\n\ufb01cations and guidelines for their practical use. Simulations on synthetic and real\ndatasets con\ufb01rm the potential of the proposed theory and tool.\n\n1 Introduction\n\nVarious graph-based models, regardless of application, aim to learn a target function on graphs that\nwell respects the graph topology. This has been done under different motivations such as Laplacian\nregularization [4, 5, 6, 14, 24, 25, 26], random walks [17, 19, 23, 26], hitting and commute times\n[10], p-resistance distances [1], pseudo-inverse of the graph Laplacian [10], eigenvectors of the\nLaplacian matrices [18, 20], diffusion maps [8], to name a few. Whether these models can capture\nthe graph structure faithfully, or whether their target functions possess desirable properties over\nthe graph, remain unclear. Understanding of such issues can be of great value in practice and has\nattracted much attention recently [16, 22, 23].\n\nSeveral important observations about learning on graphs have been reported. Nadler et al. [16]\nshowed that the target functions of Laplacian regularized methods become \ufb02at as the number of\nunlabeled points increases, but they also observed that a good classi\ufb01cation can still be obtained\nif an appropriate threshold is used. An explanation to this would be interesting. Von Luxburg\net al. [22] proved that commute and hitting times are dominated by the local structures in large\ngraphs, ignoring the global patterns. Does this mean these metrics are \ufb02awed? Interestingly, despite\nthis \ufb01nding, the pseudo-inverse of graph Laplacian, known as the kernel matrix of commute times,\nconsistently performs superior in collaborative \ufb01ltering [10]. In spectral clustering, the eigenvectors\nof the normalized graph Laplacian are more desired than those of the un-normalized one [20, 21].\nAlso for the recently proposed partially absorbing random walks [23], certain setting of absorption\nrates seems better than others. While these issues arise from seemingly unrelated contexts, we will\nshow in this paper that they can be addressed in a single framework.\n\n1\n\n\fOur starting point is the discovery of a common structure hidden in the target functions of various\ngraph models. That is, the value of a vertex is approximately the weighted average of the values\nof its adjacent neighbors. We call this structure the harmonic structure for its resemblance to the\nharmonic function [9, 26]. It naturally arises from the \ufb01rst step analysis of random walk models,\nand, as will be shown in this paper, implicitly exists in other methods such as pseudo-inverse of the\ngraph Laplacian and eigenvectors of the Laplacian matrices. The target functions of these models\nare characterized by their harmonic loss, a quantitative notion introduced in this paper to measure\nthe discrepancy of a target function f on cuts of graphs. The variations of f across cuts can then be\nupper and lower bounded by the ratio of its harmonic loss and the cut cost. As long as the harmonic\nloss varies slowly, the graph conductance dominates the variations of f \u2013 it will remain smooth in\na dense area but vary sharply otherwise. Models possessing such properties successfully capture\nthe cluster structures, and as shown in Sec. 4, lead to superior performance in practical applications\nincluding classi\ufb01cation and retrieval.\n\nThis novel perspective allows us to give a uni\ufb01ed treatment of graph-based models. We use this tool\nto study \ufb01ve popular models: absorbing random walks, partially absorbing random walks, hitting\ntimes, pseudo-inverse of the graph Laplacian, and eigenvectors of the Laplacian matrices. Our\nanalysis provides new theoretical understandings into these models, answers related open questions,\nand helps to correct and justify their practical use. The key message conveyed in our results is that\nvarious existing models enjoying the harmonic structure are actually capable of capturing the global\ngraph topology, and understanding of this structure can guide us in applying them properly.\n\n2 Analysis\n\nLet us \ufb01rst de\ufb01ne some notations. In this paper, we consider graphs which are connected, undirected,\nweighted, and without self-loops. Denote by G = (V, W ) a graph with n vertices V and a symmetric\nnon-negative af\ufb01nity matrix W = [wij ] \u2208 Rn\u00d7n (wii = 0). Denote by di = Pj wij the degree of\nvertex i, by D = diag(d1, d2, . . . , dn) the degree matrix, and by L = D \u2212 W the graph Laplacian\n[7]. The conductance of a subset S \u2282 V of vertices is de\ufb01ned as \u03a6(S) =\nmin(d(S),d( \u00afS)), where\nw(S, \u00afS) = Pi\u2208S,j\u2208 \u00afS wij is the cut cost between S and its complement \u00afS, and d(S) = Pi\u2208S di is\nthe volume of S. For any i /\u2208 S, denote by i \u223c S if there is an edge between vertex i and the set S.\nDe\ufb01nition 2.1 (Harmonic loss). The harmonic loss of f : V \u2192 R on any S \u2286 V is de\ufb01ned as:\n\nw(S, \u00afS)\n\nLf (S) := X\n\ni\u2208S\n\ndi\n\n\uf8eb\n\uf8edf (i) \u2212 X\n\nj\u223ci\n\nwij\ndi\n\nf (j)\uf8f6\n\n\uf8f8 = X\n\ni\u2208S\n\n\uf8eb\n\uf8eddif (i) \u2212 X\n\nj\u223ci\n\nwijf (j)\uf8f6\n\uf8f8 .\n\n(1)\n\nNote that Lf (S) = Pi\u2208S(Lf )(i). By de\ufb01nition, the harmonic loss can be negative. However, as\nwe shall see below, it is always non-negative on superlevel sets.\nThe following lemma shows that the harmonic loss couples the cut cost and the discrepancy of the\nfunction across the cut. This observation will serve as the foundation of our analysis in this paper.\nLemma 2.2. Lf (S) = Pi\u2208S,j\u2208 \u00afS wij (f (i) \u2212 f (j)). In particular, Lf (V) = 0.\nIn practice, to examine the variation of f on a graph, one does not necessarily examine on every\nsubset of vertices, which will be exponential in the number of vertices. Instead, it suf\ufb01ces to consider\nits variation on the superlevel sets de\ufb01ned as follows.\nDe\ufb01nition 2.3 (Superlevel set). For any function f : V \u2192 R on a graph and a scalar c \u2208 R, the\nset {i | f (i) \u2265 c} is called a superlevel set of f with level c.\n\nW.l.o.g., we assume the vertices are sorted such that f (1) \u2265 f (2) \u2265 \u00b7 \u00b7 \u00b7 \u2265 f (n \u2212 1) \u2265 f (n). The\nsubset Si := {1, . . . , i} is the superlevel set with level f (i) if f (i) > f (i + 1). For convenience, we\nstill call Si a superlevel set of f even if f (i) = f (i + 1). In this paper, we will mainly examine the\nvariation of f on its n superlevel sets S1, . . . , Sn. Our \ufb01rst observation is that the harmonic loss on\neach superlevel set is non-negative, stated as follows.\nLemma 2.4. Lf (Si) \u2265 0, i = 1, . . . , n.\n\n2\n\n\fBased on the notion of superlevel sets, it becomes legitimate to talk about the continuity of a function\non graphs, which we formally de\ufb01ne as follows.\nDe\ufb01nition 2.5 (Continuity). For any function f : V \u2192 R, we call it left-continuous if i \u223c Si\u22121,\ni = 2, . . . , n; we call it right-continuous if i \u223c \u00afSi, i = 1, . . . , n \u2212 1; we call it continuous if\ni \u223c Si\u22121 and i \u223c \u00afSi, i = 2, . . . , n \u2212 1. Particularly, f is called left-continuous, right-continuous,\nor continuous at vertex i if i \u223c Si\u22121, i \u223c \u00afSi, or i \u223c Si\u22121 and i \u223c \u00afSi, respectively.\nProposition 2.6. For any function f : V \u2192 R and any vertex 1 < i < n, 1) if Lf (i) < 0, then\ni \u223c Si\u22121, i.e., f is left-continuous at i; 2) if Lf (i) > 0, then i \u223c \u00afSi, i.e., f is right-continuous at i;\n3) if Lf (i) = 0 and f (i \u2212 1) > f (i) > f (i + 1), then i \u223c Si\u22121 and i \u223c \u00afSi, i.e., f is continuous at i.\n\nThe variation of f can be characterized by the following upper and lower bounds.\nTheorem 2.7 (Dropping upper bound). For i = 1, . . . , n \u2212 1,\n\nf (i) \u2212 f (i + 1) \u2264\n\nLf (Si)\nw(Si, \u00afSi)\n\n=\n\nLf (Si)\n\n\u03a6(Si) min(d(Si), d( \u00afSi))\n\n.\n\nTheorem 2.8 (Dropping lower bound). For i = 1, . . . , n \u2212 1,\n\nf (u) \u2212 f (v) \u2265\n\nLf (Si)\nw(Si, \u00afSi)\n\n=\n\nLf (Si)\n\n\u03a6(Si) min(d(Si), d( \u00afSi))\n\n,\n\nwhere u := arg max\n\nj\u2208Si,j\u223c \u00afSi\n\nf (j) and v := arg min\n\nj\u2208 \u00afSi,j\u223cSi\n\nf (j).\n\n(2)\n\n(3)\n\nThe key observations are two-fold. First, for any function f on a graph, as long as its harmonic\nloss Lf (Si) varies slowly on the superlevel sets, i.e., f is harmonic almost everywhere, the graph\nconductance \u03a6(Si) will dominate the variation of f . In particular, by Theorem 2.7, f (i + 1) drops\nlittle if \u03a6(Si) is large, whereas by Theorem 2.8, a big gap exists across the cut if \u03a6(Si) is small (see\nSec. 3.1 for illustration). Second, the continuity (either left, right, or both) of f ensures that its varia-\ntions conform with the graph connectivity, i.e., points with similar values on f tend to be connected.\nIt is a desired property because a \u201cdiscontinuous\u201d function that changes alternatively among differ-\nent clusters can hardly describe the graph. These observations can guide us in identifying \u201cgood\u201d\nfunctions that encode the global structure of graphs, as will be shown in the next section.\n\n3 Examples\n\nWith the tool developed in Sec. 2, in this section, we study \ufb01ve popular graph models arising from\ndifferent contexts including SSL, retrieval, recommendation, and clustering. For each model, we\nshow its target function in harmonic forms, quantify its harmonic loss, analyze its dropping bounds,\nand provide corrections or justi\ufb01cations for its use.\n\n3.1 Absorbing Random Walks\n\nThe \ufb01rst model we examine is the seminal Laplacian regularization method [26] proposed for SSL.\nWhile it has a nice interpretation in terms of absorbing random walks, with the labeled points being\nabsorbing states, it was argued in [16] that this method might be ill-posed for large unlabeled data\nin high dimension (\u2265 2) because the target function is extremely \ufb02at and thus seems problematic\nfor classi\ufb01cation. [1] further connected this argument with the resistance distance on graphs, point-\ning out that the classi\ufb01cation biases to the labeled points with larger degrees. Here we show that\nLaplacian regularization can actually capture the global graph structure and a simple normalization\nscheme would resolve the raised issue.\n\nFor simplicity, we consider the binary classi\ufb01cation setting with one label in each class. Denote by\nf : V \u2192 R the absorption probability vector from every point to the positive labeled point. Assume\nthe vertices are sorted such that 1 = f (1) > f (2) \u2265 \u00b7 \u00b7 \u00b7 \u2265 f (n \u2212 1) > f (n) = 0 (vertex 1 is labeled\npositive and vertex n is labeled negative). By the \ufb01rst step analysis of the random walk,\n\nf (i) = X\n\nk\u223ci\n\nwik\ndi\n\nf (k), for i = 2, . . . , n \u2212 1.\n\n(4)\n\nOur \ufb01rst observation is that the harmonic loss of f is constant w.r.t. Si, as shown below.\n\n3\n\n\ff\n\n(2) 0.97\n\n=\n\nf\n\n(1) 1\n\n=\n\n2\n\n1\n\n1\n\nS2\n\n1\n\n3\n\n1\n\nS\n\n3\n\nf\n\n(3) 0.94\n\n=\n\n0.1\n\nf\n\n(4) 0.06\n\n=\n\n1\n\n4\n\nf\n\n(6) 0\n\n=\n\n6\n\n1\n\n5\n\nf\n\n(5) 0.03\n\n=\n\n1\n\nFigure 1: Absorbing random walks on a 6-point graph.\n\nCorollary 3.1. Lf (Si) = Pk\u223c1 w1k(1 \u2212 f (k)), i = 1, . . . , n \u2212 1.\nThe following statement shows that f changes continuously on graphs under general condition.\nCorollary 3.2. Suppose f is mutually different on unlabeled data. Then f is continuous.\n\nSince the harmonic loss of f is a constant on the superlevel sets Si (Corollary 3.1), by Theorems\n2.7 and 2.8, the variation of f depends solely on the cut value w(Si, \u00afSi), which indicates that it will\ndrop slowly when the cut is dense but drastically when the cut is sparse. Also by Corollary 3.2, f is\ncontinuous. Therefore, we conclude that f is a good function on graphs.\nThis can be illustrated by a toy example in Fig. 1, where the graph consists of 6 points in 2 classes\ndenoted by different colors, with 3 points in each. The edge weights are all 1 except for the edge\nbetween the two cluster, which is 0.1. Vertices 1 and 6 (black edged) are labeled. The absorption\nprobabilities from all the vertices to vertex 1 are computed and shown. We can see that since the\ncut w(S2, \u00afS2) = 2 is quite dense, the drop between f (2) and f (3) is upper bounded by a small\nnumber (Theorem 2.7), so f (3) must be very close to f (2), as observed. In contrast, since the cut\nw(S3, \u00afS3) = 0.1 is very weak, Theorem 2.8 guarantees that there will be a huge gap between f (3)\nand f (4), as also veri\ufb01ed. The bound in Theorem 2.8 is now tight as there is only 1 edge in the cut.\nNow let f1 and f2 denote the absorption probability vectors to the two labeled points respectively.\nTo classify an unlabeled point i, the usual way is to compare f1(i) and f2(i), which is equivalent to\nsetting the threshold as 0 in f0 = f1 \u2212 f2. It was observed in [16] that although f0 can be extremely\n\ufb02at in the presence of large unlabeled data in high dimension, setting the \u201cright\u201d threshold can\nproduce sensible results. Our analysis explains this \u2013 it is because both f1 and f2 are informative of\nthe cluster structures. Our key argument is that Laplacian regularization actually carries suf\ufb01cient\ninformation about the graph structure, but how to exploit it can really make a difference.\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n(a)\n\n1\n\n0.5\n\n0\n0\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n600\n\n300\n(b)\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n(c)\n\n6x 10\u22123\n\n4\n\n2\n0\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n600\n\n300\n(d)\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n(e)\n\nFigure 2: (a) Two 20-dimensional Gaussians with the \ufb01rst two dimensions plotted. The magenta\ntriangle and the green circle denote labeled data. The blue cross denotes a starting vertex indexed\nby i for later use.\n(c) Classi\ufb01cation by\ncomparing the absorption probabilities. (d) Normalized absorption probabilities. (e) Classi\ufb01cation\nby comparing the normalized absorption probabilities.\n\n(b) Absorption probabilities to the two labeled points.\n\nWe illustrate this point by using a mixture of two 20-dimensional Gaussians of 600 points, with one\nlabel in each Gaussian (Fig. 2(a)). The absorption probabilities to both labeled points are shown in\nFig. 2(b), in magenta and green respectively. The green vector is well above the the magenta vector,\nindicating that every unlabeled point has larger absorption probability to the green labeled point.\nComparing them classi\ufb01es all the unlabeled points to the green Gaussian (Fig. 2(c)). Since the green\nlabeled point has larger degree than the magenta one1, this result is expected from the analysis in\n[1]. However, the probability vectors are informative, with a clear gap between the clusters in each\n\n1The degrees are 1.4405 and 0.1435. We use a weighted 20-NN graph (see Supplement).\n\n4\n\n\fvector. To use the information, we propose to normalize each vector by its probability mass, i.e.,\nf \u2032(i) = f (i)/Pj f (j) (Fig. 2(d)). Comparing them leads to a perfect classi\ufb01cation (Fig. 2(e)).\nThis idea is based on two observations from our analysis: 1) the variance of the probabilities within\neach cluster is small; 2) there is a gap between the clusters. The small variance indicates that\ncomparing the probabilities is essentially the same as comparing their means within clusters. The\ngap between the clusters ensures that the normalization makes the vectors align well (this point is\nmade precise in Supplement). Our above analysis applies to multi-class problems and allows more\nthan one labeled points in one class. In this general case, the classi\ufb01cation rule is as follows: 1)\ncompute the absorption probability vector fi : U \u2192 R for each labeled point i by taking all other\nlabeled points as negative, where U denotes the set of unlabeled points; 2) normalize fi by its mass,\ni(j)}. We denote\ndenoted by f \u2032\nthis algorithm as ARW-N-1NN.\n\ni; 3) assign each unlabeled point j to the class of j\u2217 := arg maxi{f \u2032\n\n3.2 Partially Absorbing Random Walks\n\n\u03b1\u03bbi+di\n\nHere we revisit the recently proposed partially absorbing random walks (PARW) [23], which gener-\nalizes absorbing random walks by allowing partial absorption at each state. The absorption rate pii\nat state i is de\ufb01ned as pii = \u03b1\u03bbi\n, where \u03b1 > 0, \u03bbi > 0 are regularization parameters. Given cur-\nrent state i, a PARW in the next step will get absorbed at i with probability pii and with probability\n(1 \u2212 pii) \u00d7 wij\nmoves to state j. Let aij be the probability that a PARW starting from state i gets\ndi\nabsorbed at state j within \ufb01nite steps, and denote by A = [aij] \u2208 Rn\u00d7n the absorption probability\nmatrix. Then A = (\u03b1\u039b + L)\u22121\u03b1\u039b, where \u039b = diag(\u03bb1, . . . , \u03bbn) is the regularization matrix.\nPARW is a uni\ufb01ed framework with several popular SSL methods and PageRank [17] as its special\ncases, corresponding to different \u039b. Particularly, the case \u039b = I has been justi\ufb01ed in capturing the\ncluster structures [23]. In what follows, we extend this result to show that the columns of A obtained\nby PARW with almost arbitrary \u039b (not just \u039b = I) actually exhibit strong harmonic structures and\nshould be expected to work equally well.\n\nOur \ufb01rst observation is that while A is not symmetric for arbitrary \u039b, A\u039b\u22121 = (\u03b1\u039b + L)\u22121\u03b1 is.\nLemma 3.3. aij = \u03bbj\n\u03bbi\nLemma 3.4. aii is the only largest entry in the i-th column of A, i = 1, . . . , n.\n\naji.\n\nOur second observation is that the harmonic structure exists in the probabilities of PARW from every\nvertex getting absorbed at a particular vertex, i.e., in the columns of A. W.l.o.g., consider the \ufb01rst\ncolumn of A and denote it by p. Assume that the vertices are sorted such that p(1) > p(2) \u2265 \u00b7 \u00b7 \u00b7 \u2265\np(n \u2212 1) \u2265 p(n), where p(1) > p(2) is due to Lemma 3.4. By the \ufb01rst step analysis of PARW, we\ncan write p in a recursive form:\n+ X\n\np(i) = X\n\np(k), i = 2, . . . , n,\n\np(1) =\n\np(k),\n\n\u03b1\u03bb1\n\nw1k\n\n(5)\n\nd1 + \u03b1\u03bb1\n\nd1 + \u03b1\u03bb1\n\nk\u223c1\n\nwik\n\ndi + \u03b1\u03bbi\n\nk\u223ci\n\nwhich is equivalent to the following harmonic form:\n\np(1) =\n\n\u03b1\u03bb1\nd1\n\n(1 \u2212 p(1)) + X\n\nk\u223c1\n\nw1k\nd1\n\np(k),\n\np(i) = \u2212\n\n\u03b1\u03bbi\ndi\n\np(i) + X\n\nk\u223ci\n\nwik\ndi\n\np(k), i = 2, . . . , n.\n\n(6)\n\na1k, i = 1, . . . , n \u2212 1.\n\na1k) = \u03b1\u03bb1 Pk\u2208 \u00afSi\n\nThe harmonic loss of p can be computed from Eq. (6).\nCorollary 3.5. Lp(Si) = \u03b1\u03bb1(1 \u2212 Pk\u2208Si\nCorollary 3.6. p is left-continuous.\nNow we are ready to examine the variation of p. Note that Pk a1k = 1 and a1k \u2192 \u03bbk/Pi \u03bbi\nas \u03b1 \u2192 0 [23]. By Theorem 2.7, the drop of p(i) is upper bounded by \u03b1\u03bb1/w(Si, \u00afSi), which is\nsmall when the cut w(Si, \u00afSi) is dense and \u03b1 is small. Now let k be the largest number such that\nd(Sk) \u2264 1\n2 Pi \u03bbi. By Theorem 2.8, for 1 \u2264 i \u2264 k, the drop of p(i)\n3 \u03b1\u03bb1/w(Si, \u00afSi), if \u03b1 is suf\ufb01ciently small. This shows\nacross the cut {Si, \u00afSi} is lower bounded by 1\nthat p(i) will drop a lot when the cut w(Si, \u00afSi) is weak. The comparison between the corresponding\nrow and column of A is shown in Figs. 3(a\u2013b)2, which con\ufb01rms our analysis.\n\n2 d(V), and assume Pi\u2208 \u00afSk\n\n\u03bbi \u2265 1\n\n2\u03bbi\u2019s are sampled from the uniform distribution on the interval [0, 1] and \u03b1 = 1e \u2212 6, as used in Sec. 4.\n\n5\n\n\f4x 10\u22123\n\n2\n\n0\n0\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n3.418x 10\u22123\n\n3.417\n\n1\n\n0.5\n\n0\n\n600\n\n3.416\n0\n\n300\n(a)\n\n600\n\n\u22120.5\n0\n\n300\n(b)\n\n600\n\n300\n(c)\n\n0.5\n\n0\n\n\u22120.5\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n2x 104\n\n1\n\n0\n0\n\n0.1\n\n0.05\n\n0\n\n\u22120.05\n\n1500\n\n1400\n\n1300\n\n600\n\n1200\n0\n\n300\n(d)\n\n600\n\n300\n(e)\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n300\n\n300\n\n300\n\n\u22121\n0\n600\n(g) \u03bbu = 0.0172\n\n300\n\n\u22120.1\n0\n600\n(h) \u03bb = 0.0304\n\n300\n\n\u22120.2\n0\n600\n(f) \u03bbu = 0.0144\n\n\u22120.2\n0\n600\n(j) \u03bbv = 0.3845\nFigure 3: (a) Absorption probabilities that a PAWR gets absorbed at other points when starting from\ni (see Fig. 2). (b) Absorption probabilities that PAWR gets absorbed at i when starting from other\npoints. (c) The i-th row of L\u2020. (d) Hitting times from i to hit other points. (e) Hitting times from\nother points to hit i. (f) and (g) Eigenvectors of L (mini{di} = 0.0173). (h) An eigenvector of\nLsym. (i) and (j) Eigenvectors of Lrw. The values in (f\u2013j) denote eigenvalues.\n\n\u22120.1\n0\n600\n(i) \u03bbv = 0.0304\n\nIt is worth mentioning that our analysis substantially extends the results in [23] by showing that the\nsetting of \u039b is not really necessary \u2013 a random \u039b can perform equally well if using the columns\ninstead of the rows of A. In addition, our result includes the seminal local clustering model [2] as a\nspecial case, which corresponds to \u039b = D in our analysis.\n\n3.3 Pseudo-inverse of the Graph Laplacian\n\nThe pseudo-inverse L\u2020 of the graph Laplacian is a valid kernel corresponding to commute times\n[10, 12]. While commute times may fail to capture the global topology in large graphs [22], L\u2020, if\nused directly as a similarity measure, gives superior performance in practice [10]. Here we provide\na formal analysis and justi\ufb01cation for L\u2020 by revealing the strong harmonic structure hidden in it.\nLemma 3.7. (L\u2020L)ij = \u2212 1\n\nn , i 6= j; and (L\u2020L)ii = 1 \u2212 1\nn .\n\nNote that L\u2020 is symmetric since L is symmetric. W.l.o.g., we consider the \ufb01rst row of L\u2020 and denote\nit by \u2113. The following lemma shows the harmonic form of \u2113.\nLemma 3.8. \u2113 has the following harmonic form:\n\n\u2113(1) =\n\n1 \u2212 1\nn\n\nd1\n\n+ X\n\nk\u223c1\n\nw1k\nd1\n\n\u2113(k), \u2113(i) = \u2212\n\n1\nn\ndi\n\n+ X\n\nk\u223ci\n\nwik\ndi\n\n\u2113(k), i = 2, . . . , n.\n\n(7)\n\nW.l.o.g., assume the vertices have been sorted such that \u2113(1) > \u2113(2) \u2265 \u00b7 \u00b7 \u00b7 \u2265 \u2113(n \u2212 1) \u2265 \u2113(n)3.\nThen the harmonic loss of \u2113 on the set Si admits a very simple form, as shown below.\nCorollary 3.9. L\u2113(Si) = | \u00afSi|\nCorollary 3.10. \u2113 is left-continuous.\n\nn , i = 1, . . . , n \u2212 1.\n\nBy Corollary 3.9, L\u2113(Si) < 1 and decreases very slowly in large graphs since L\u2113(Si) \u2212 L\u2113(Si+1) =\n1\nn for any i. From the analysis in Sec. 2, we can immediately conclude that the variation of \u2113(i) is\ndominated by the cut cost on the superlevel set Si. Fig. 3(c) illustrates this argument.\n\n3.4 Hitting Times\n\nThe hitting time hij from vertex i to j is the expected number of steps it takes a random walk starting\nfrom i to reach j for the \ufb01rst time. While it was proven in [22] that hitting times are dominated by\nthe local structure of the target, we show below that the hitting times from other points to the same\ntarget admit a harmonic structure, and thus are still able to capture the global structure of graphs.\nOur result is complementary to the analysis in [22], and provides a justi\ufb01cation of using hitting times\nin information retrieval where the query is taken as the target to be hit by others [15].\n\n3\u2113(1) > \u2113(2) since one can show that any diagonal entry in L\u2020 is the only largest in the corresponding row.\n\n6\n\n\fLet h : V \u2192 R be the hitting times from every vertex to a particular vertex. W.l.o.g., assume the\nvertices have been sorted such that h(1) \u2265 h(2) \u2265 \u00b7 \u00b7 \u00b7 \u2265 h(n \u2212 1) > h(n) = 0, where vertex n is\nthe target vertex. Applying the \ufb01rst step analysis, we obtain the harmonic form of h:\n\nh(i) = 1 + X\n\nk\u223ci\n\nwik\ndi\n\nh(k),\n\nfor i = 1, . . . , n \u2212 1.\n\n(8)\n\nThe harmonic loss on the set Si turns out to be the volume of the set, as stated below.\nCorollary 3.11. Lh(Si) = X\nCorollary 3.12. h is right-continuous.\n\ndk, i = 1, . . . , n \u2212 1.\n\n1\u2264k\u2264i\n\nNow let us examine the variation of h across any cut {Si, \u00afSi}. Note that\n\nLh(Si)\nw(Si, \u00afSi)\n\n=\n\n\u03b1i\n\n\u03a6(Si)\n\n, where \u03b1i =\n\nd(Si)\n\nmin(d(Si), d( \u00afSi))\n\n.\n\n(9)\n\n\u2212 1 could be quite large. As i decreases from d(Si) > 1\n\nFirst, by Theorem 2.8, there could be a signi\ufb01cant gap between the target and its neighbors, since\n\u03b1n\u22121 = d(V)\n2 d(V), the variation of \u03b1i\ndn\nbecomes slower and slower (\u03b1i = 1 when d(Si) \u2264 1\n2 d(V)), so the variation of h will depend on the\nvariation of the conductance of Si, i.e., \u03a6(Si), according to Theorems 2.7 and 2.8. Fig. 3(e) shows\nthat h is \ufb02at within the clusters, but there is a large gap presented between them. In contrast, there\nare no gaps exhibited in the hitting times from the target to other vertices (Fig. 3(d)).\n\n3.5 Eigenvectors of the Laplacian Matrices\n\nThe eigenvectors of the Laplacian matrices play a key role in graph partitioning [20]. In practice, the\neigenvectors with smaller (positive) eigenvalues are more desired than those with larger eigenvalues,\nand the ones from a normalized Laplacian are preferred than those from the un-normalized one.\nThese choices are usually justi\ufb01ed from the relaxation of the normalized cuts [18] and ratio cuts\n[11]. However, it has been known that these relaxations can be arbitrarily loose [20]. It seems more\ninteresting if one can draw conclusions by analyzing the eigenvectors directly. Here we address\nthese issues by examining the harmonic structures in these eigenvectors.\n\nWe follow the notations in [20] to denote two normalized graph Laplacians: Lrw := D\u22121L and\nLsym := D\u2212 1\n2 . Denote by u and v two eigenvectors of L and Lrw with eigenvalues \u03bbu > 0\nand \u03bbv > 0, respectively, i.e., Lu = \u03bbuu and Lrwv = \u03bbvv. Then we have\n\n2 LD\u2212 1\n\nu(i) = X\n\nk\u223ci\n\nwik\n\ndi \u2212 \u03bbu\n\nu(k),\n\nv(i) = X\n\nk\u223ci\n\nwik\n\ndi(1 \u2212 \u03bbv)\n\nv(k),\n\nfor i = 1, . . . , n.\n\n(10)\n\nWe can see that the smaller \u03bbu and \u03bbv, the stronger the harmonic structures of u and v. This explains\nwhy in practice the eigenvector with the second4 smallest eigenvalues gives superior performance.\nAs long as \u03bbu \u226a mini{di}, we are safe to say that u will have a signi\ufb01cant harmonic structure, and\nthus will be informative for clustering. However, if \u03bbu is close to mini{di}, no matter how small \u03bbu\nis, the harmonic structure of u will be weaker, and thus u is less useful. In contrast, from Eq. (10),\nv will always enjoy a signi\ufb01cant harmonic structure as long as \u03bbv is much smaller than 1. This\nexplains why eigenvectors of Lrw are preferred than those of L for clustering. These arguments are\nvalidated in Figs. 3(f\u2013j), where we also include an eigenvector of Lsym for comparison.\n\n4 Experiments\n\nIn the \ufb01rst experiment5, we test absorbing random walks (ARW) for SSL, with the class mass nor-\nmalization suggested in [26] (ARW-CMN), our proposed normalization (ARW-N-1NN, Sec. 3.1),\nand without any normalization (ARW-1NN) \u2013 where each unlabeled instance is assigned the class of\nthe labeled instance at which it most likely gets absorbed. We also compare with the local and global\n\n4Note that the smallest one is zero in either L or Lrw.\n5Please see Supplement for parameter settings, data description, graph construction, and experimental setup.\n\n7\n\n\fTable 1: Classi\ufb01cation accuracy on 9 datasets.\n\nARW-N-1NN\nARW-1NN\nARW-CMN\n\nLGC\n\nPARW (\u039b = I)\n\nUSPS\n.879\n.445\n.775\n.821\n.880\n\nYaleB\n.892\n.733\n.847\n.884\n.906\n\nsatimage\n\nimageseg\n\nionosphere\n\n.777\n.650\n.741\n.725\n.781\n\n.673\n.595\n.624\n.638\n.665\n\n.771\n.699\n.724\n.731\n.752\n\niris\n.918\n.902\n.894\n.903\n.928\n\nprotein\n.589\n.440\n.511\n.477\n.572\n\nspiral\n.830\n.754\n.726\n.729\n.835\n\nsoybean\n\n.916\n.889\n.856\n.816\n.905\n\nconsistency (LGC) method [24] and the PARW with \u039b = I in [23]. The results are summarized in\nTable 1. We can see that ARW-N-1NN and PARW (\u039b = I) consistently perform the best, which\nveri\ufb01es our analysis in Sec. 3. The results of ARW-1NN are unsatisfactory due to its bias to the\nlabeled instance with the largest degree [1]. Although ARW-CMN does improve over ARW-1NN in\nmany cases, it does not perform as well as ARW-N-1NN, mainly because of the artifacts induced by\nestimating the class proportion from limited labeled data. The results of LGC are not comparable to\nARW-N-1NN and PARW (\u039b = I), which is probably due to the lack of a harmonic structure.\n\nTable 2: Ranking results (MAP) on USPS.\n\nDigits\n\n\u039b = R (column)\n\n\u039b = R (row)\n\n\u039b = I\n\n0\n\n.981\n.169\n.981\n\n1\n\n.988\n.143\n.988\n\n2\n\n.875\n.114\n.876\n\n3\n\n.892\n.096\n.893\n\n4\n\n.647\n.092\n.646\n\n5\n\n.780\n.076\n.778\n\n6\n\n.941\n.093\n.940\n\n7\n\n.918\n.093\n.919\n\n8\n\n.746\n.075\n.746\n\n9\n\n.731\n.086\n.730\n\nAll\n.850\n.103\n.850\n\nIn the second experiment, we test PARW on a retrieval task on USPS (see Supplement). We compare\nthe cases with \u039b = I and \u039b = R, where R is a random diagonal matrix with positive diagonal\nentries. For \u039b = R, we also compare the uses of columns and rows for retrieval. The results are\nshown in Table 2. We observe that the columns in \u039b = R give signi\ufb01cantly better results compared\nwith rows, implying that the harmonic structure is vital to the performance. \u039b = R (column) and\n\u039b = I perform very similarly. This suggests that it is not the special setting of absorbing rates but\nthe harmonic structure that determines the overall performance.\n\nTable 3: Classi\ufb01cation accuracy on USPS.\n\nk-NN unweighted graphs\n\nHT(L \u2192 U)\nHT(U \u2192 L)\n\nL\u2020\n\n10\n\n.8514\n.1518\n.8512\n\n20\n\n.8361\n.1454\n.8359\n\n50\n\n.7822\n.1372\n.7816\n\n100\n.7500\n.1209\n.7493\n\n200\n.7071\n.1131\n.7062\n\n500\n.6429\n.1113\n.6426\n\nIn the third experiment, we test hitting times and pseudo-inverse of the graph Laplacian for SSL on\nUSPS. We compare two different uses of hitting times, the case of starting from the labeled data L\nto hit the unlabeled data U (HT(L \u2192 U)), and the case of the opposite direction (HT(U \u2192 L)).\nEach unlabeled instance j is assigned the class of labeled instance j\u2217, where j\u2217 = arg mini\u2208L{hij}\nin HT(L \u2192 U), j\u2217 = arg mini\u2208L{hji} in (HT(U \u2192 L)), and j\u2217 = arg maxi\u2208L{\u2113ji} in L\u2020 = (\u2113ij).\nThe results averaged over 100 trials are shown in Table 3, where we see that HT(L \u2192 U) performs\nmuch better than HT(U \u2192 L), which is expected as the former admits a desired harmonic structure.\nNote that HT(L \u2192 U) is not lost as the number of neighbors increases (i.e., the graph becomes\nmore connected). The slight performance drop is due to the inclusion of more noisy edges. In\ncontrast, HT(U \u2192 L) is completely lost [20]. We also observe that L\u2020 produces very competitive\nperformance, which again supports our analysis.\n\n5 Conclusion\n\nIn this paper, we explore the harmonic structure that widely exists in graph models. Different\nfrom previous research [3, 13] of harmonic analysis on graphs, where the selection of canonical\nbasis on graphs and the asymptotic convergence on manifolds are studied, here we examine how\nfunctions on graphs deviate from being harmonic and develop bounds to analyze their theoretical\nbehavior. The proposed harmonic loss quanti\ufb01es the discrepancy of a function across cuts, allows a\nuni\ufb01ed treatment of various models from different contexts, and makes them easy to analyze. Due\nto its resemblance with standard mathematical concepts such as divergence and total variation, an\ninteresting line of future work is to make their connections clear. Other future works include deriving\nmore rigorous bounds for certain functions and extending our analysis to more graph models.\n\n8\n\n\fReferences\n\n[1] M. Alamgir and U. von Luxburg. Phase transition in the family of p-resistances. In NIPS.\n\n2011.\n\n[2] R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In\n\nFOCS, pages 475\u2013486, 2006.\n\n[3] M. Belkin. Problems of Learning on Manifolds. PhD thesis, The University of Chicago, 2003.\n[4] M. Belkin, I. Matveeva, and P. Niyogi. Regularization and semi-supervised learning on large\n\ngraphs. In COLT, pages 624\u2013638, 2004.\n\n[5] M. Belkin, Q. Que, Y. Wang, and X. Zhou. Toward understanding complex spaces: Graph\n\nlaplacians on manifolds with singularities and boundaries. In COLT, 2012.\n\n[6] O. Bousquet, O. Chapelle, and M. Hein. Measure based regularization. In NIPS, 2003.\n[7] F. Chung. Spectral Graph Theory. American Mathematical Society, 1997.\n[8] R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis,\n\n21(1):5\u201330, 2006.\n\n[9] P. G. Doyle and J. L. Snell. Random walks and electric networks. Mathematical Association\n\nof America, 1984.\n\n[10] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens. Random-walk computation of similarities\nbetween nodes of a graph with application to collaborative recommendation. IEEE Transac-\ntions on Knowledge and Data Engineering, 19(3):355\u2013369, 2007.\n\n[11] L. Hagen and A. B. Kahng. New spectral methods for ratio cut partitioning and clustering.\nIEEE transactions on Computer-aided design of integrated circuits and systems, 11(9):1074\u2013\n1085, 1992.\n\n[12] D. J. Klein and M. Randi\u00b4c. Resistance distance. Journal of Mathematical Chemistry, 12(1):81\u2013\n\n95, 1993.\n\n[13] S. S. Lafon. Diffusion maps and geometric harmonics. PhD thesis, Yale University, 2004.\n[14] M. H. G. Lever and M. Herbster. Predicting the labelling of a graph via minimum p-seminorm\n\ninterpolation. In COLT, 2009.\n\n[15] Q. Mei, D. Zhou, and K. Church. Query suggestion using hitting time. In CIKM, pages 469\u2013\n\n478, 2008.\n\n[16] B. Nadler, N. Srebro, and X. Zhou. Statistical analysis of semi-supervised learning: The limit\n\nof in\ufb01nite unlabelled data. In NIPS, pages 1330\u20131338, 2009.\n\n[17] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order\n\nto the web. 1999.\n\n[18] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. PAMI, 22(8):888\u2013\n\n905, 2000.\n\n[19] M. Szummer and T. Jaakkola. Partially labeled classi\ufb01cation with Markov random walks. In\n\nNIPS, pages 945\u2013952, 2002.\n\n[20] U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395\u2013416,\n\n2007.\n\n[21] U. Von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. The Annals\n\nof Statistics, pages 555\u2013586, 2008.\n\n[22] U. Von Luxburg, A. Radl, and M. Hein. Hitting and commute times in large graphs are often\n\nmisleading. Arxiv preprint arXiv:1003.1266, 2010.\n\n[23] X.-M. Wu, Z. Li, A. M.-C. So, J. Wright, and S.-F. Chang. Learning with partially absorbing\n\nrandom walks. In NIPS, 2012.\n\n[24] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Sch\u00a8olkopf. Learning with local and global\n\nconsistency. In NIPS, 2004.\n\n[25] X. Zhou and M. Belkin. Semi-supervised learning by higher order regularization. In AISTATS,\n\n2011.\n\n[26] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and\n\nharmonic functions. In ICML, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1423, "authors": [{"given_name": "Xiao-Ming", "family_name": "Wu", "institution": "Columbia University"}, {"given_name": "Zhenguo", "family_name": "Li", "institution": "Columbia University"}, {"given_name": "Shih-Fu", "family_name": "Chang", "institution": "Columbia University"}]}