{"title": "Getting lost in space: Large sample analysis of the resistance distance", "book": "Advances in Neural Information Processing Systems", "page_first": 2622, "page_last": 2630, "abstract": "The commute distance between two vertices in a graph is the expected time it takes a random walk to travel from the first to the second vertex and back. We study the behavior of the commute distance as the size of the underlying graph increases. We prove that the commute distance converges to an expression that does not take into account the structure of the graph at all and that is completely meaningless as a distance function on the graph. Consequently, the use of the raw commute distance for machine learning purposes is strongly discouraged for large graphs and in high dimensions. As an alternative we introduce the amplified commute distance that corrects for the undesired large sample effects.", "full_text": "Getting lost in space: Large sample analysis of the\n\ncommute distance\n\nUlrike von Luxburg\n\nAgnes Radl\n\nMax Planck Institute for Biological Cybernetics, T\u00a8ubingen, Germany\n{ulrike.luxburg,agnes.radl}@tuebingen.mpg.de\n\nMatthias Hein\n\nSaarland University, Saarbr\u00a8ucken, Germany\n\nhein@cs.uni-sb.de\n\nAbstract\n\nThe commute distance between two vertices in a graph is the expected time it takes\na random walk to travel from the \ufb01rst to the second vertex and back. We study the\nbehavior of the commute distance as the size of the underlying graph increases.\nWe prove that the commute distance converges to an expression that does not take\ninto account the structure of the graph at all and that is completely meaningless\nas a distance function on the graph. Consequently, the use of the raw commute\ndistance for machine learning purposes is strongly discouraged for large graphs\nand in high dimensions. As an alternative we introduce the ampli\ufb01ed commute\ndistance that corrects for the undesired large sample effects.\n\n1\n\nIntroduction\n\nGiven an undirected, weighted graph, the commute distance between two vertices u and v is de\ufb01ned\nas the expected time it takes a random walk starting in vertex u to travel to vertex v and back to u.\nAs opposed to the shortest path distance, it takes into account all paths between u and v, not just the\nshortest one. As a rule of thumb, the more paths connect u with v, the smaller the commute distance\nbecomes. As a consequence, it supposedly satis\ufb01es the following, highly desirable property:\nProperty (F): Vertices in the same cluster of the graph have a small commute\ndistance, whereas two vertices in different clusters of the graph have a \u201clarge\u201d\ncommute distance.\n\nIt is because of this property that the commute distance has become a popular choice and is widely\nused, for example in clustering (Yen et al., 2005), semi-supervised learning (Zhou and Sch\u00a8olkopf,\n2004), in social network analysis (Liben-Nowell and Kleinberg, 2003), for proximity search (Sarkar\net al., 2008), in image processing (Qiu and Hancock, 2005), for dimensionality reduction (Ham\net al., 2004), for graph embedding (Guattery, 1998, Saerens et al., 2004, Qiu and Hancock, 2006,\nWittmann et al., 2009) and even for deriving learning theoretic bounds for graph labeling (Herbster\nand Pontil, 2006, Cesa-Bianchi et al., 2009). One of the main contributions of this paper is to\nestablish that property (F) does not hold in many relevant situations.\nIn this paper we study how the commute distance (up to a constant factor equivalent to the resistance\ndistance, see below for exact de\ufb01nitions) behaves when the size of the graph increases. We focus on\nthe case of random geometric graphs as this is most relevant to machine learning, but similar results\nhold for very general classes of graphs under mild assumptions. Denoting by Hij the expected\nhitting time, by Cij the commute distance between two vertices vi and vj and by di the degree of\n\n1\n\n\fvertex vi we prove that the hitting times and commute distances can be approximated (up to the\nconstant vol(G) that denotes the volume of the graph) by\n\n1\n\nvol(G) Hij \u2248 1\n\ndj\n\nand\n\n1\n\nvol(G) Cij \u2248 1\n\ndi\n\n+\n\n1\ndj\n\n.\n\nThe intuitive reason for this behavior is that if the graph is large, the random walk \u201cgets lost\u201d in\nthe sheer size of the graph. It takes so long to travel through a substantial part of the graph that by\nthe time the random walk comes close to its goal it has already \u201cforgotten\u201d where it started from.\nFor this reason, the hitting time Hij does not depend on the starting vertex vi any more. It only\ndepends on the inverse degree of the target vertex vj, which intuitively represents the likelihood that\nthe random walk exactly hits vj once it is in its neighborhood. In this respect it shows the same\nbehavior as the mean return time at j (the mean time it takes a random walk that starts at j to return\nto its staring point) which is well-known to be vol(G) \u00b7 1/dj as well.\nOur \ufb01ndings have very strong implications:\nThe raw commute distance is not a useful distance function on large graphs. On the negative\nside, our approximation result shows that contrary to popular belief, the commute distance does not\ntake into account any global properties of the data, at least if the graph is \u201clarge enough\u201d. It just\nconsiders the local density (the degree of the vertex) at the two vertices, nothing else. The resulting\nlarge sample commute distance dist(vi, vj) = 1/di + 1/dj is completely meaningless as a distance\non a graph. For example, all data points have the same nearest neighbor (namely, the vertex with\nthe largest degree), the same second-nearest neighbor (the vertex with the second-largest degree),\nand so on. In particular, the main motivation to use the commute distance, Property (F), no longer\nholds when the graph becomes \u201clarge enough\u201d. Even more disappointingly, computer simulations\nshow that n does not even need to be very large before (F) breaks down. Often, n in the order of\n1000 is already enough to make the commute distance very close to its approximation expression\n(see Section 5 for details). This effect is even stronger if the dimensionality of the underlying data\nspace is large. Consequently, even on moderate-sized graphs, the use of the raw commute distance\nas a basis for machine learning algorithms should be discouraged.\nCorrecting the commute distance. It has been reported in the literature that hitting times and com-\nmute times can be observed to be quite small if the vertices under consideration have a high degree,\nand that the spread of the commute distance values can be quite large (Liben-Nowell and Kleinberg,\n2003, Brand, 2005, Yen et al., 2009). Subsequently, the authors suggested several different methods\nto correct for this unpleasant behavior. In the light of our theoretical results we can see immediately\nwhy the undesired behavior of the commute distance occurs. Moreover, we are able to analyze the\nsuggested corrections and prove which ones are meaningful and which ones not (see Section 4).\nBased on our theory we suggest a new correction, the ampli\ufb01ed commute distance. This is a new\ndistance function that is derived from the commute distance, but avoids its artifacts. This distance\nfunction is Euclidean, making it well-suited for machine learning purposes and kernel methods.\nEf\ufb01cient computation of approximate commute distances. In some applications the commute\ndistance is not used as a distance function, but for other reasons, for example in graph sparsi\ufb01cation\n(Spielman and Srivastava, 2008) or when computing bounds on mixing or cover times (Aleliunas\net al., 1979, Chandra et al., 1989, Avin and Ercal, 2007, Cooper and Frieze, 2009) or graph labeling\n(Herbster and Pontil, 2006, Cesa-Bianchi et al., 2009). To obtain the commute distance between all\npoints in a graph one has to compute the pseudo-inverse of the graph Laplacian matrix, an operation\nof time complexity O(n3). This is prohibitive in large graphs. To circumvent the matrix inversion,\nseveral approximations of the commute distance have been suggested in the literature (Spielman and\nSrivastava, 2008, Sarkar and Moore, 2007, Brand, 2005). Our results lead to a much simpler and\nwell-justi\ufb01ed way of approximating the commute distance on large random geometric graphs.\n\n2 General setup, de\ufb01nitions and notation\n\n(wij)i,j=1,...,n. By di :=Pn\n\nWe consider undirected, weighted graphs G = (V, E) with n vertices. We always assume that G is\nconnected and not bipartite. The non-negative weight matrix (adjacency matrix) is denoted by W :=\nj=1 dj is\nthe volume of the graph. D denotes the diagonal matrix with diagonal entries d1, . . . , dn and is\ncalled the degree matrix.\n\nj=1 wij we denote the degree of vertex vi and vol(G) :=Pn\n\n2\n\n\fOur main focus in this paper is the class of random geometric graphs as it is most relevant to machine\nlearning. Here we are given a sequence of points X1, . . . , Xn that has been drawn i.i.d. from some\nunderlying density p on Rd. These points form the vertices v1, . . . , vn of the graph. The edges in\nthe graph are de\ufb01ned such that \u201cneighboring points\u201d are connected: In the \u03b5-graph we connect two\npoints whenever their Euclidean distance is less than or equal to \u03b5. In the undirected, symmetric\nk-nearest neighbor graph we connect vi to vj if Xi is among the k nearest neighbors of Xj or vice\nversa. In the mutual k-nearest neighbor graph we connect vi to vj if Xi is among the k nearest\nneighbors of Xj and vice versa. For space constraints we only discuss the case of unweighted\ngraphs in this paper. Our results can be carried over to weighted graphs, in particular to weighted\nkNN-graphs and Gaussian similarity graphs.\nConsider the natural random walk on G, that is the random walk with transition matrix P = D\u22121W .\nThe hitting time Hij is de\ufb01ned as the expected time it takes a random walk starting in vertex vi to\ntravel to vertex vj (with Hii := 0 by de\ufb01nition). The commute distance (also called commute time)\nbetween vi and vj is de\ufb01ned as Cij := Hij + Hji. Some readers might also know the commute\ndistance under the name resistance distance. Here one interprets the graph as an electrical network\nwhere the edges represent resistors. The conductance of a resistor is given by the corresponding edge\nweight. The resistance distance Rij between i and j is de\ufb01ned as the effective resistance between\nthe vertices i and j in the network. It is well known that the resistance distance coincides with the\ncommute distance up to a constant: Cij = vol(G)\u00b7Rij. For background reading see Doyle and Snell\n(1984), Klein and Randic (1993), Xiao and Gutman (2003), Fouss et al. (2006), Bollob\u00b4as (1998),\nLyons and Peres (2010).\nFor the rest of the paper we consider a probability distribution with density p on Rd. We want to\nstudy the behavior of the commute distance between two \ufb01xed points s and t. We will see that we\nonly need to study the density in a reasonably small region X \u2282 Rd that contains s and t. For\nconvenience, let us make the following de\ufb01nition.\nDe\ufb01nition 1 (Valid region) Let p be any density on Rd, and s, t \u2208 Rd be two points with\np(s), p(t) > 0. We call a connected subset X \u2282 Rd a valid region with respect to s, t and p if\nthe following properties are satis\ufb01ed:\n\n1. s and t are interior points of X .\n2. The density on X is bounded away from 0, that is for all x \u2208 X we have that p(x) \u2265\npmin > 0 for some constant pmin. Assume that pmax := maxx\u2208X p(x) < \u221e.\n3. X has \u201cbottleneck\u201d larger than some value h > 0: the set {x \u2208 X : dist(x, \u2202X ) > h/2}\nis connected (here \u2202X denotes the topological boundary of X ).\n4. The boundary of X is regular in the following sense. We assume that there exist positive\nconstants \u03b1 > 0 and \u03b50 > 0 such that if \u03b5 < \u03b50, then for all points x \u2208 \u2202X we have\nvol(B\u03b5(x) \u2229 X ) \u2265 \u03b1 vol(B\u03b5(x)) (where vol denotes the Lebesgue volume). Essentially\nthis condition just excludes the situation where the boundary has arbitrarily thin spikes.\n\nFor readability reasons, we are going to state some of our main results using constants ci > 0. These\nconstants are independent of n and the graph connectivity parameter (\u03b5 or k, respectively) but depend\non the dimension, the geometry of X , and p. The values of all constants are determined explicitly in\nthe proofs. They do not coincide across different propositions. For notational convenience, we will\nformulate all the following results in terms of the resistance distance. To obtain the results for the\ncommute distance one just has to multiply by factor vol(G).\n\n3 Convergence of the resistance distance on random geometric graphs\n\nIn this section we present our theoretical main results for random geometric graphs. We show that\non this type of graph, the resistance distance Rij converges to the trivial limit 1/di + 1/dj. For\nspace constraints we only formulate these results for unweighted kNN and \u03b5-graphs. Similar results\nalso hold for weighted variants of these graphs and for Gaussian similarity graphs.\n\nTheorem 2 (Resistance distance on kNN-graphs) Fix two points Xi and Xj. Consider a valid\nregion X with respect to Xi and Xj with bottleneck h and density bounds pmin and pmax. Assume\nthat Xi and Xj have distance at least h from the boundary of X and that (k/n)1/d/2pmax \u2264 h.\n\n3\n\n\fThen there exist constants c1, . . . , c5 > 0 such that with probability at least 1 \u2212 c1n exp(\u2212c2k) the\nresistance distance on both the symmetric and the mutual kNN-graph satis\ufb01es\n\n(cid:0)log(n/k) + (k/n)1/3 + 1(cid:1)\n\n(cid:18) k\n\n(cid:12)(cid:12)(cid:12)(cid:12)kRij \u2212\n\n+ k\ndj\n\ndi\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) \u2264\n\n(cid:26)c4\n\nc5\n\n1\nk\n1\nk\n\nif d = 3\nif d > 3\n\nThe probability converges to 1 if n \u2192 \u221e and k/ log(n) \u2192 \u221e. The rhs of the deviation bound\nconverges to 0 as n \u2192 \u221e, if k \u2192 \u221e and k/ log(n/k) \u2192 \u221e in case d = 3, and if k \u2192 \u221e in case\nd > 3. Under these conditions, if the density p is continuous and if additionally k/n \u2192 0, then\nkRij \u2192 2 in probability.\nTheorem 3 (Resistance distance on \u03b5-graphs) Fix two points Xi and Xj. Consider a valid region\nX with respect to Xi and Xj with bottleneck h and density bounds pmin and pmax. Assume that Xi\nand Xj have distance at least h from the boundary of X and that \u03b5 \u2264 h. Then there exist constants\nc1, . . . , c6 > 0 such that with probability at least 1 \u2212 c1n exp(\u2212c2n\u03b5d) \u2212 c3 exp(\u2212c4n\u03b5d)/\u03b5d the\nresistance distance on the \u03b5-graph satis\ufb01es\n+ n\u03b5d\ndj\n\n(cid:12)(cid:12)(cid:12)(cid:12)n\u03b5dRij \u2212\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) \u2264\n\n(cid:18) n\u03b5d\n\nif d = 3\nif d > 3\n\nlog(1/\u03b5)+\u03b5+1\n\n(\n\nn\u03b53\n\ndi\n\nc5\nc6\n\n1\n\nn\u03b5d\n\nThe probability converges to 1 if n \u2192 \u221e and n\u03b5d/ log(n) \u2192 \u221e. The rhs of the deviation bound\nconverges to 0 as n \u2192 \u221e, if n\u03b53/ log(1/\u03b5) \u2192 \u221e in case d = 3, and if n\u03b5d \u2192 \u221e in case d > 3.\nUnder these conditions, if the density p is continuous and if additionally \u03b5 \u2192 0, then\n\nn\u03b5dRij \u2192 1\n\n\u03b7dp(Xi)\n\n+\n\n1\n\n\u03b7dp(Xj)\n\nin probability.\n\nLet us discuss the theorems en bloc. We start with a couple of technical remarks. Note that to achieve\nthe convergence of the resistance distance we have to rescale it appropriately (for example, in the\n\u03b5-graph we scale by a factor of n\u03b5d). Our rescaling is exactly chosen such that the limit expressions\nare \ufb01nite, positive values. Scaling by any other factor in terms of n, \u03b5 or k either leads to divergence\nor to convergence to zero.\nThe convergence conditions on n and \u03b5 (or k, respectively) are the ones to be expected for random\ngeometric graphs. They are satis\ufb01ed as soon as the degrees are of the order log(n) (for smaller\ndegrees, the graphs are not connected anyway, see e.g. Penrose, 1999). Hence, our results hold for\nsparse as well as for dense connected random geometric graphs.\nThe valid region X has been introduced for technical reasons. We need to operate in such a region\nin order to be able to control the behavior of the graph, e.g. the average degrees. The assumptions\non X are the standard assumptions used in the random geometric graph literature. In our setting, we\nhave the freedom of choosing X \u2282 Rd as we want. In order to obtain the tightest bounds one should\naim for a valid X that has a wide bottleneck and a high minimal density.\nMore generally, results about the convergence of the commute distance to 1/di + 1/dj can also be\nproved for other kinds of graphs such as graphs with given expected degrees and even for power law\ngraphs, under the assumption that the minimal degree in the graph slowly increases with n. Details\nare beyond the scope of this paper.\nProof outline of Theorems 2 and 3 (full proofs are presented in the supplementary material). Con-\nsider two \ufb01xed vertices s and t in a connected graph and consider the graph as an electrical network\nwhere each edge has resistance 1. By the electrical laws, resistances in series add up, that is for two\nresistances R1 and R2 in series we get the overall resistance R = R1 + R2. Resistances in parallel\nlines satisfy 1/R = 1/R1 + 1/R2. Now consult the situation in Figure 1. Consider the vertex s and\nall edges from s to its ds neighbors. The resistance \u201cspanned\u201d by these ds parallel edges satis\ufb01es\ni=1 1, that is R = 1/ds. Similarly for t. Between the neighbors of s and the ones of t there\nare very many paths. It turns out that the contribution of these paths to the resistance is negligible\n(essentially, we have so many wires between the two neighborhoods that electricity can \ufb02ow nearly\nfreely). So the overall effective resistance between s and t is dominated by the edges adjacent to i\nand j with contributions 1/ds + 1/dt.\nProviding a clean mathematical proof for this argument is quite technical. Our proof is based on\nCorollary 6 in Section IX.2 of Bollob\u00b4as (1998) that states that the resistance distance beween two\n\n1/R =Pds\n\n4\n\n\fFigure 1: Intuition for the proof of Theorems 2 and 3. See text for details.\n\n\ufb01xed vertices s and t can be expressed as\n\nnP\n\n(cid:12)(cid:12)(cid:12) u = (ue)e\u2208E unit \ufb02ow from s to t\n\no\n\n.\n\nRst = inf\n\ne\u2208E u2\ne\n\nTo apply this theorem one has to construct a \ufb02ow that spreads \u201cas widely as possible\u201d over the whole\ngraph. Counting edges and adding up resistances then leads to the desired results. Details are \ufb01ddly\nand can be found in the supplementary material.\n\n,\n\n4 Correcting the resistance distance\nObviously, the large sample resistance distance Rij \u2248 1/di + 1/dj is completely meaningless as\na distance on a graph. The question we want to discuss in this section is whether there is a way to\ncorrect the commute distance such that this unpleasant large sample effect does not occur. Let us\nstart with some references to the literature. It has been observed in several empirical studies that the\ncommute distances are quite small if the vertices under consideration have a high degree, and that\nthe spread of the commute distance values can be quite large. Our theoretical results immediately\nexplain this behavior: if the degrees are large, then 1/di + 1/dj is very small. And compared to the\n\u201cspread\u201d of di, the spread of 1/di can be enormous.\nSeveral heuristics have been suggested to solve this problem. Liben-Nowell and Kleinberg (2003)\nsuggest to correct the hitting times by simply multiplying by the degrees. For the commute distance,\nthis leads to the suggested correction of CLN K(i, j) := djHij + diHji. Even though we did\nnot prove it explicitly in our paper, the convergence results for the commute time also hold for\nthe individual hitting times. Namely, hitting time Hij can be approximated by vol(G)/dj. These\ntheoretical results immediately show that the correction CLN K is not useful, at least if we consider\nthe absolute values. For large graphs, it simply has the effect of normalizing all hitting times to\n\u2248 1, leading to CLN K \u2248 2. However, we believe that the ranking introduced by this distance\nfunction still contains useful information about the data. The reason is that while the \ufb01rst order\nterms dominate the absolute value and converge to two, the second order terms introduce some\n\u201cvariation around two\u201d, and this variation might encode the cluster structure.\nYen et al. (2009) exploit the well-known fact that the commute distance is Euclidean and its kernel\nmatrix coincides with the Moore-Penrose inverse L+ of the graph Laplacian matrix. The authors\nnow apply a sigmoid transformation to L+ and consider KYen(i, j) = 1/(1 + exp(\u2212l+\nij/\u03c3)) for\nsome contant \u03c3. The idea is that the sigmoid transformation reduces the spread of the distance (or\nsimilarity) values. However, this is an ad-hoc approach that has the disadvantage that the resulting\n\u201ckernel\u201d KYen is not positive de\ufb01nite.\nA third correction has been suggested in Brand (2005). As Yen et al. (2009) he considers the kernel\nmatrix that corresponds to the commute distance. But instead of applying a sigmoid transformation\nhe centers and normalizes the kernel matrix in the feature space. This leads to the corrected kernel\n\nwith\n\n2 \u00afKij = \u2212Rij + 1\n\nn\n\nk,l=1 Rkl.\nOne the \ufb01rst glance it is surprising that using the centered and normalized kernel instead of the\ncommute distance should make any difference. However, whenever one takes a Euclidean distance\nfunction of the form dist(i, j) = sij + ui + uj \u2212 2\u03b4ijui and computes the corresponding centered\nkernel matrix, one obtains\n\nn2\n\nKBrand(i, j) = \u00afKij/p \u00afKii \u00afKjj\n\nPn\nk=1(Rik + Rkj) \u2212 1\n\nPn\n\nKij = K s\n\nij + 2\u03b4ijui \u2212 2\n\n(ui + uj) +\n\n2\nn2\n\nn\n\n5\n\nnX\n\nr=1\n\nur,\n\n(1)\n\nvery many pathsresistance 1edges, each withmany outgoingdtmany outgoingedges, each withresistance 1dsst\fwhere K s is the kernel matrix induced by s. Thus the off-diagonal terms are still in\ufb02uenced by ui\nn compared to the diagonal. Even though this is no longer the case after\nbut with a decaying factor 1\nnormalization (because for the normalization the diagonal terms are important, and these terms still\ndepend on the di), we believe that this is the key to why Brand\u2019s kernel is useful.\nWhat would be a suitable correction based on our theoretical results? The proof of our main the-\norems shows that the edges adjacent to i and j completely dominate the behavior of the resistance\ndistance: they are the \u201cbottleneck\u201d of the \ufb02ow, and their contribution 1/di + 1/dj dominates all\nthe other terms. The interesting information about the global topology of the graph is contained\nin the remainder terms Sij = Rij \u2212 1/di \u2212 1/dj, which summarize the \ufb02ow contributions of all\nother edges in the graph. We believe that the key to obtaining a good distance function is to remove\nthe in\ufb02uence of the 1/di terms and \u201camplify\u201d the in\ufb02uence of the general graph term Sij. This\ncan be achieved by either using the off-diagonal terms of the pseudo-inverse graph Laplacian L\u2020\nwhile ignoring its diagonal, or by building a distance function based on the remainder terms Sij\ndirectly. We choose the second option and propose the following new distance function. We de\ufb01ne\nthe ampli\ufb01ed commute distance as Camp(i, j) = Sij + uij with Sij = Rij \u2212 1/di \u2212 1/dj and\nuij = 2wij/didj \u2212 wii/d2\nProposition 4 (Ampli\ufb01ed commute distance is Euclidean) The matrix D with entries dij =\nCamp(i, j)1/2 is a Euclidean distance matrix.\nIn preliminary work we show that the remainder terms can be written as Sij =\nProof outline.\nh(ei \u2212 ej), B(ei \u2212 ej)i\u2212 uij where ei denotes the i-th unit vector and B is a positive de\ufb01nite matrix\n(see the proof of Proposition 2 in von Luxburg et al., 2010). This implies the desired statement. ,\nAdditionally to being a Euclidean distance, the ampli\ufb01ed commute distance has a nice limit behavior.\nWhen n \u2192 \u221e the terms uij are dominated by the terms Sij, hence all that is left are the \u201cinteresting\nterms\u201d Sij. For all practical purposes, one should use the kernel induced by the ampli\ufb01ed commute\ndistance and center and normalize it. In formulas, the ampli\ufb01ed commute kernel is\n\nj. Of course we set Camp(i, i) = 0 for all i.\n\ni \u2212 wjj/d2\n\nq \u00afKii \u00afKjj\n\nKamp(i, j) := \u00afKij/\n\nwith\n\n\u00afK = (I \u2212 1\nn\n\n110)Camp(I \u2212 1\n\nn\n\n110)\n\n(2)\n\n(where I is the identity matrix, 1 the vector of all ones, and Camp the ampli\ufb01ed commute distance\nmatrix). The next section shows that the kernel Kamp works very nicely in practice.\nNote that the correction by Brand and our ampli\ufb01ed commute kernel are very similar, but not identi-\ncal with each other. The off-diagonal terms of both kernels are very close to each other, see Equation\n(1), that is if one is only interested in a ranking based on similarity values, both kernels behave simi-\nlarly. However, an important difference is that the diagonal terms in the Brand kernel are way bigger\nthan the ones in the ampli\ufb01ed kernel (using our convergence techniques one can show that the Brand\nkernel converges to an identity matrix, that is the diagonal completely dominates the off-diagonal\nterms). This might lead to the effect that the Brand kernel behaves worse than our kernel with\nalgorithms like the SVM that do not ignore the diagonal of the kernel.\n\n5 Experiments\n\nOur \ufb01rst set of experiments considers the question how fast the convergence of the commute distance\ntakes place in practice. We will see that already for relatively small data sets, a very good approx-\nimation takes place. This means that the problems of the raw commute distance already occur for\nsmall sample size. Consider the plots in Figure 2. They report the maximal relative error de\ufb01ned as\nmaxij |Rij\u22121/di\u22121/dj|/Rij and the corresponding mean relative error on a log10-scale. We show\nthe results for \u03b5-graphs, unweighted kNN graphs and Gaussian similarity graphs (fully connected\nweighted graphs with edge weights exp(kxi \u2212 xjk2/\u03c32)). In order to be able to plot all results in\nthe same \ufb01gure, we need to match the parameters of the different graphs. Given some value k for\nthe kNN-graph we thus set the values of \u03b5 for the \u03b5-graph and \u03c3 for the Gaussian graph to be equal\nto the maximal k-nearest neighbor distance in the data set.\nSample size. Consider a set of points drawn from the uniform distribution on the unit cube in R10.\nAs can be seen in Figure 2 (\ufb01rst plot), the maximal relative error decreases very fast with increasing\nsample size. Note that already for small sample sizes the maximal deviations get very small.\nDimension. A result that seems surprising at \ufb01rst glance is that the maximal deviation decreases\n\n6\n\n\fFigure 2: Relative deviations between true and approximate commute distances. Solid lines show\nthe maximal relative deviations, dashed lines the mean relative deviations. See text for details.\n\nas we increase the dimension, see Figure 2 (second plot). The intuitive explanation is that in higher\ndimensions, geometric graphs mix faster as there exist more \u201cshortcuts\u201d between the two sides of\nthe point cloud. Thus, the random walk \u201cforgets faster\u201d where it started from.\nClusteredness. The deviation gets worse if the data has a more pronounced cluster structure. Con-\nsider a mixture of two Gaussians in R10 with unit variances and the same weight on both compo-\nnents. We call the distance between the centers of the two components the separation. In Figure 2\n(third plot) we show both the maximum relative errors (solid lines) and mean relative errors (dashed\nlines). We can clearly see that with increasing separation, the deviation increases.\nSparsity. The last plot of Figure 2 shows the relative errors for increasingly dense graphs, namely\nfor increasing parameter k. Here we used the well-known USPS data set of handwritten digits (9298\npoints in 256 dimensions). We plot both the maximum relative errors (solid lines) and mean relative\nerrors (dashed lines). We can see that the errors decrease the denser the graph gets. Again this is\ndue to the fact that the random walk mixes faster on denser graphs. Note that the deviations are\nextremely small on this real-world data set.\nIn a second set of experiments we compare the different corrections of the raw commute distance. To\nthis end, we built a kNN graph of the whole USPS data set (all 9298 points, k = 10), computed the\ncommute distance matrix and the various corrections. The resulting matrices are shown in Figure\n3 (left part) as heat plots. In all cases, we only plot the off-diagonal terms. We can see that as\npredicted by theory, the raw commute distance does not identify the cluster structure. However, the\ncluster structure is still visible in the kernel corresponding to the commute distance, the pseudo-\ninverse graph Laplacian L\u2020. The reason is that the diagonal of this matrix can be approximated by\n(1/d1, ...., 1/dn), whereas the off-diagonal terms encode the graph structure, but on a much smaller\nscale than the diagonal. In our heat plots, all four corrections of the graph Laplacian show the cluster\nstructure to a certain extent (the correction by LNK to a small extent, the corrections by Brand, Yen\nand us to a bigger extent).\nA last experiment evaluates the performance of the different distances in a semi-supervised learning\ntask. On the whole USPS data set, we \ufb01rst chose some random points to be labeled. Then\nwe classi\ufb01ed the unlabeled points by the k-nearest neighbor classi\ufb01er based on the distances\nto the labeled data points. For each classi\ufb01er, k was chosen by 10-fold cross-validation among\nk \u2208 {1, ..., 10}. The experiment was repeated 10 times. The mean results can be seen in Figure 3\n(right \ufb01gure). As baseline we also report results based on the standard Euclidean distance between\nthe data points. As predicted by theory, we can see that the raw commute distance performs\nextremely poor. The Euclidean distance behaves reasonably, but is outperformed by all corrections\nof the commute distance. This shows \ufb01rst of all that using the graph structure does help over the\nbasic Euclidean distance. While the naive correction by LNK stays close to the Euclidean distance,\nthe three corrections by Brand, Yen and us virtually lie on top of each other and outperform the\n\n7\n\n20050010002000\u22123\u22122.5\u22122\u22121.5\u22121data uniform, dim=10, k= n/10log10(rel deviation)n Gaussianepsilonknn2510\u22123\u22122.5\u22122\u22121.5\u22121\u22120.5data uniform, n=2000, k=100log10(rel deviation)dim Gaussianepsilonknn123\u22124\u22123\u22122\u221210separationlog10(rel deviation)mixture of Gaussians,n=2000, dim=10, k=100 Gaussianepsilonknn0123\u22125\u22124\u22123\u22122\u221210log10(k)log10(rel deviation)USPS data set Gaussianepsilonknn\fFigure 3: Figures on the left: Distances and kernels based on a kNN graph between all 9298 USPS\npoints (heat plots, off-diagonal terms only): exact resistance distance, pseudo-inverse graph Lapla-\ncian L\u2020; kernels corresponding to the corrections by LNK, Yen, Brand, and our ampli\ufb01ed Kamp.\nFigure on the right: Semi-supervised learning results based on the different distances and kernels.\nThe last three lines corresponding to the ampli\ufb01ed, Brand and Yen kernel lie on top of each other.\n\nother methods by a large margin.\n\nWe conclude with the following tentative statements. We believe that the correction by LNK is \u201ca bit\ntoo naive\u201d, whereas the corrections by Brand, Yen and us \u201ctend to work\u201d in a ranking based setting.\nBased on our simple experiments it is impossible to judge which out of these candidates is \u201cthe best\none\u201d. We are not too fond of Yen\u2019s correction because it does not lead to a proper kernel. Both\nBrand\u2019s and our kernel converge to (different) limit functions. So far we do not know the theoretical\nproperties of these limit functions and thus cannot present any theoretical reason to prefer one over\nthe other. However, we think that the diagonal dominance of the Brand kernel can be problematic.\n\n6 Discussion\n\nIn this paper we have proved that the commute distance on random geometric graphs can be\napproximated by a very simple limit expression. Contrary to intuition, this limit expression no\nlonger takes into account the cluster structure of the graph, nor any other global property (such as\ndistances in the underlying Euclidean space). Both our theoretical bounds and our simulations tell\nthe same story: the approximation gets better if the data is high-dimensional and not extremely\nclustered, both of which are standard situations in machine learning. This shows that the use of the\nraw commute distance for machine learning purposes can be problematic. However, the structure\nof the graph can be recovered by certain corrections of the commute distance. We suggest to use\neither the correction by Brand (2005) or our own ampli\ufb01ed commute kernel from Section 4. Both\ncorrections have a well-de\ufb01ned, non-trivial limit and perform well in experiments.\n\nThe intuitive explanation for our result is that as the sample size increases, the random walk on the\nsample graph \u201cgets lost\u201d in the sheer size of the graph. It takes so long to travel through a substantial\npart of the graph that by the time the random walk comes close to its goal it has already \u201cforgotten\u201d\nwhere it started from. Stated differently: the random walk on the graph has mixed before it hits the\ndesired target vertex. On a higher level, we expect that the problem of \u201cgetting lost\u201d not only affects\nthe commute distance, but many other methods where random walks are used in a naive way to\nexplore global properties of a graph. For example, the results in Nadler et al. (2009), where artifacts\nof semi-supervised learning in the context of many unlabeled points are studied, seem strongly\nrelated to our results.\nIn general, we believe that one has to be particularly careful when using\nrandom walk based methods for extracting global properties of graphs in order to avoid getting lost\nand converging to meaningless results.\n\n8\n\n205010020040000.20.40.60.81Number of labeled pointsclassification errorSemi\u2212supervised learning task Raw commute distEuclidean distLNK distAmplified kernelBrand kernelYen kernel\fReferences\nR. Aleliunas, R. Karp, R. Lipton, L. Lov\u00b4asz, and C. Rackoff. Random walks, universal traversal\n\nsequences, and the complexity of maze problems. In FOCS, 1979.\n\nC. Avin and G. Ercal. On the cover time and mixing time of random geometric graphs. Theor.\n\nComput. Sci, 380(1-2):2\u201322, 2007.\n\nB. Bollob\u00b4as. Modern Graph Theory. Springer, 1998.\nM. Brand. A random walks perspective on maximizing satisfaction and pro\ufb01t. In SDM, 2005.\nN. Cesa-Bianchi, C. Gentile, and F. Vitale. Fast and optimal prediction on a labeled tree. In COLT,\n\n2009.\n\nA. Chandra, P. Raghavan, W. Ruzzo, R. Smolensky, and P. Tiwari. The electrical resistance of a\n\ngraph captures its commute and cover times. In STOC, 1989.\n\nC. Cooper and A. Frieze. The cover time of random geometric graphs. In SODA, 2009.\nP. G. Doyle and J. L. Snell. Random walks and electric networks. Mathematical Association of\n\nAmerica, Washington, DC, 1984.\n\nF. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens. A novel way of computing dissimilarities\nbetween nodes of a graph, with application to collaborative \ufb01ltering and subspace projection of\nthe graph nodes. Technical Report IAG WP 06/08, Universit\u00b4e catholique de Louvain, 2006.\n\nS. Guattery. Graph embeddings, symmetric real matrices, and generalized inverses. Technical report,\nInstitute for Computer Applications in Science and Engineering, NASA Research Center, 1998.\nJ. Ham, D. D. Lee, S. Mika, and B. Sch\u00a8olkopf. A kernel view of the dimensionality reduction of\n\nmanifolds. In ICML, 2004.\n\nM. Herbster and M. Pontil. Prediction on a graph with a perceptron. In NIPS, 2006.\nD. Klein and M. Randic. Resistance distance. Journal of Mathematical Chemistry, 12:81\u201395, 1993.\nD. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In CIKM, 2003.\nR. Lyons and Y. Peres. Probability on trees and networks. Book in preparation, available online on\n\nthe webpage of Yuval Peres, 2010.\n\nB. Nadler, N. Srebro, and X. Zhou. Statistical analysis of semi-supervised learning: The limit of\n\nin\ufb01nite unlabelled data. In NIPS, 2009.\n\nM. Penrose. A strong law for the longest edge of the minimal spanning tree. Ann. of Prob., 27(1):\n\n246 \u2013 260, 1999.\n\nH. Qiu and E. R. Hancock. Image segmentation using commute times. In BMVC, 2005.\nH. Qiu and E. R. Hancock. Graph embedding using commute time. S+SSPR 2006, pages 441\u2013449,\n\n2006.\n\nM. Saerens, F. Fouss, L. Yen, and P. Dupont. The principal components analysis of a graph, and its\n\nrelationships to spectral clustering. In ECML, 2004.\n\nP. Sarkar and A. Moore. A tractable approach to \ufb01nding closest truncated-commute-time neighbors\n\nin large graphs. In UAI, 2007.\n\nP. Sarkar, A. Moore, and A. Prakash. Fast incremental proximity search in large graphs. In ICML,\n\n2008.\n\nD. Spielman and N. Srivastava. Graph sparsi\ufb01cation by effective resistances. In STOC, 2008.\nU. von Luxburg, A. Radl, and M. Hein. Hitting times, commute distances and the spectral gap in\n\nlarge random geometric graphs. Preprint available at Arxiv, March 2010.\n\nD. M. Wittmann, D. Schmidl, F. Bl\u00a8ochl, and F. J. Theis. Reconstruction of graphs based on random\n\nwalks. Theoretical Computer Science, 2009.\n\nW. Xiao and I. Gutman. Resistance distance and Laplacian spectrum. Theoretical Chemistry Ac-\n\ncounts, 110:284\u2013298, 2003.\n\nL. Yen, D. Vanvyve, F. Wouters, F. Fouss, M. Verleysen, and M. Saerens. Clustering using a random\n\nwalk based distance measure. In ESANN, 2005.\n\nL. Yen, F. Fouss, C. Decaestecker, P. Francq, and M. Saerens. Graph nodes clustering based on the\ncommute-time kernel. Advances in Knowledge Discovery and Data Mining, pages 1037\u20131045,\n2009.\n\nD. Zhou and B. Sch\u00a8olkopf. Learning from Labeled and Unlabeled Data Using Random Walks. In\n\nDAGM, 2004.\n\n9\n\n\f", "award": [], "sourceid": 929, "authors": [{"given_name": "Ulrike", "family_name": "Luxburg", "institution": null}, {"given_name": "Agnes", "family_name": "Radl", "institution": null}, {"given_name": "Matthias", "family_name": "Hein", "institution": null}]}