{"title": "Diffusion Improves Graph Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 13354, "page_last": 13366, "abstract": "Graph convolution is the core of most Graph Neural Networks (GNNs) and usually approximated by message passing between direct (one-hop) neighbors. In this work, we remove the restriction of using only the direct neighbors by introducing a powerful, yet spatially localized graph convolution: Graph diffusion convolution (GDC). GDC leverages generalized graph diffusion, examples of which are the heat kernel and personalized PageRank. It alleviates the problem of noisy and often arbitrarily defined edges in real graphs. We show that GDC is closely related to spectral-based models and thus combines the strengths of both spatial (message passing) and spectral methods. We demonstrate that replacing message passing with graph diffusion convolution consistently leads to significant performance improvements across a wide range of models on both supervised and unsupervised tasks and a variety of datasets. Furthermore, GDC is not limited to GNNs but can trivially be combined with any graph-based model or algorithm (e.g. spectral clustering) without requiring any changes to the latter or affecting its computational complexity. Our implementation is available online.", "full_text": "Diffusion Improves Graph Learning\n\nJohannes Klicpera, Stefan Wei\u00dfenberger, Stephan G\u00fcnnemann\n\nTechnical University of Munich\n\n{klicpera,stefan.weissenberger,guennemann}@in.tum.de\n\nAbstract\n\nGraph convolution is the core of most Graph Neural Networks (GNNs) and usually\napproximated by message passing between direct (one-hop) neighbors. In this\nwork, we remove the restriction of using only the direct neighbors by introducing a\npowerful, yet spatially localized graph convolution: Graph diffusion convolution\n(GDC). GDC leverages generalized graph diffusion, examples of which are the\nheat kernel and personalized PageRank. It alleviates the problem of noisy and often\narbitrarily de\ufb01ned edges in real graphs. We show that GDC is closely related to\nspectral-based models and thus combines the strengths of both spatial (message\npassing) and spectral methods. We demonstrate that replacing message passing\nwith graph diffusion convolution consistently leads to signi\ufb01cant performance\nimprovements across a wide range of models on both supervised and unsupervised\ntasks and a variety of datasets. Furthermore, GDC is not limited to GNNs but\ncan trivially be combined with any graph-based model or algorithm (e.g. spectral\nclustering) without requiring any changes to the latter or affecting its computational\ncomplexity. Our implementation is available online. 1\n\n1\n\nIntroduction\n\nWhen people started using graphs for evaluating chess tournaments in the middle of the 19th\ncentury they only considered each player\u2019s direct opponents, i.e. their \ufb01rst-hop neighbors. Only\nlater was the analysis extended to recursively consider higher-order relationships via A2, A3, etc.\nand \ufb01nally generalized to consider all exponents at once, using the adjacency matrix\u2019s dominant\neigenvector [38, 75]. The \ufb01eld of Graph Neural Networks (GNNs) is currently in a similar state. Graph\nConvolutional Networks (GCNs) [32], also referred to as Message Passing Neural Networks (MPNNs)\n[23] are the prevalent approach in this \ufb01eld but they only pass messages between neighboring nodes\nin each layer. These messages are then aggregated at each node to form the embedding for the next\nlayer. While MPNNs do leverage higher-order neighborhoods in deeper layers, limiting each layer\u2019s\nmessages to one-hop neighbors seems arbitrary. Edges in real graphs are often noisy or de\ufb01ned using\nan arbitrary threshold [70], so we can clearly improve upon this approach.\nSince MPNNs only use the immediate neigborhod information, they are often referred to as spatial\nmethods. On the other hand, spectral-based models do not just rely on \ufb01rst-hop neighbors and capture\nmore complex graph properties [16]. However, while being theoretically more elegant, these methods\nare routinely outperformed by MPNNs on graph-related tasks [32, 74, 81] and do not generalize\nto previously unseen graphs. This shows that message passing is a powerful framework worth\nextending upon. To reconcile these two separate approaches and combine their strengths we propose\na novel technique of performing message passing inspired by spectral methods: Graph diffusion\nconvolution (GDC). Instead of aggregating information only from the \ufb01rst-hop neighbors, GDC\naggregates information from a larger neighborhood. This neighborhood is constructed via a new\ngraph generated by sparsifying a generalized form of graph diffusion. We show how graph diffusion\n\n1https://www.daml.in.tum.de/gdc\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fis expressed as an equivalent polynomial \ufb01lter and how GDC is closely related to spectral-based\nmodels while addressing their shortcomings. GDC is spatially localized, scalable, can be combined\nwith message passing, and generalizes to unseen graphs. Furthermore, since GDC generates a new\nsparse graph it is not limited to MPNNs and can trivially be combined with any existing graph-based\nmodel or algorithm in a plug-and-play manner, i.e. without requiring changing the model or affecting\nits computational complexity. We show that GDC consistently improves performance across a wide\nrange of models on both supervised and unsupervised tasks and various homophilic datasets. In\nsummary, this paper\u2019s core contributions are:\n1. Proposing graph diffusion convolution (GDC), a more powerful and general, yet spatially localized\nalternative to message passing that uses a sparsi\ufb01ed generalized form of graph diffusion. GDC is\nnot limited to GNNs and can be combined with any graph-based model or algorithm.\n\n2. Analyzing the spectral properties of GDC and graph diffusion. We show how graph diffusion is\n\nexpressed as an equivalent polynomial \ufb01lter and analyze GDC\u2019s effect on the graph spectrum.\n\n3. Comparing and evaluating several speci\ufb01c variants of GDC and demonstrating its wide applicabil-\n\nity to supervised and unsupervised learning on graphs.\n\n2 Generalized graph diffusion\n\nWe consider an undirected graph G = (V,E) with node set V and edge set E. We denote with\nN = |V| the number of nodes and A \u2208 RN\u00d7N the adjacency matrix. We de\ufb01ne generalized graph\ndiffusion via the diffusion matrix\n\n\u221e(cid:88)\n\nk=0\n\nS =\n\n\u03b8kT k,\n\n(1)\n\nand require that(cid:80)\u221e\n\nD is the diagonal matrix of node degrees, i.e. Dii =(cid:80)N\n\nwith the weighting coef\ufb01cients \u03b8k, and the generalized transition matrix T . The choice of \u03b8k and T k\nmust at least ensure that Eq. 1 converges. In this work we will consider somewhat stricter conditions\nk=0 \u03b8k = 1, \u03b8k \u2208 [0, 1], and that the eigenvalues of T are bounded by \u03bbi \u2208 [0, 1],\nwhich together are suf\ufb01cient to guarantee convergence. Note that regular graph diffusion commonly\nrequires T to be column- or row-stochastic.\nTransition matrix. Examples for T in an undirected graph include the random walk transition matrix\nTrw = AD\u22121 and the symmetric transition matrix Tsym = D\u22121/2AD\u22121/2, where the degree matrix\nj=1 Aij. Note that in our de\ufb01nition Trw is\ncolumn-stochastic. We furthermore adjust the random walk by adding (weighted) self-loops to the\noriginal adjacency matrix, i.e. use \u02dcTsym = (wloopIN + D)\u22121/2(wloopIN + A)(wloopIN + D)\u22121/2,\nwith the self-loop weight wloop \u2208 R+. This is equivalent to performing a lazy random walk with a\nprobability of staying at node i of pstay,i = wloop/Di.\nSpecial cases. Two popular examples of graph diffusion are personalized PageRank (PPR) [57] and\nk = \u03b1(1 \u2212 \u03b1)k, with teleport\nthe heat kernel [36]. PPR corresponds to choosing T = Trw and \u03b8PPR\nprobability \u03b1 \u2208 (0, 1) [14]. The heat kernel uses T = Trw and \u03b8HK\nk! , with the diffusion time\nt [14]. Another special case of generalized graph diffusion is the approximated graph convolution\nintroduced by Kipf & Welling [32], which translates to \u03b81 = 1 and \u03b8k = 0 for k (cid:54)= 1 and uses\nT = \u02dcTsym with wloop = 1.\nWeighting coef\ufb01cients. We compute the series de\ufb01ned by Eq. 1 either in closed-form, if possible, or\nby restricting the sum to a \ufb01nite number K. Both the coef\ufb01cients de\ufb01ned by PPR and the heat kernel\ngive a closed-form solution for this series that we found to perform well for the tasks considered.\nNote that we are not restricted to using Trw and can use any generalized transition matrix along\nwith the coef\ufb01cients \u03b8PPR\nand the series still converges. We can furthermore choose \u03b8k by\nrepurposing the graph-speci\ufb01c coef\ufb01cients obtained by methods that optimize coef\ufb01cients analogous\nto \u03b8k as part of their training process. We investigated this approach using label propagation [8, 13]\nand node embedding models [1]. However, we found that the simple coef\ufb01cients de\ufb01ned by PPR or\nthe heat kernel perform better than those learned by these models (see Fig. 7 in Sec. 6).\n\nk = e\u2212t tk\n\nor \u03b8HK\nk\n\nk\n\n2\n\n\f.\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\nGraph diffusion\n\nDensity de\ufb01nes edges\n\nSparsify edges\n\nNew graph\n\nFigure 1: Illustration of graph diffusion convolution (GDC). We transform a graph A via graph\ndiffusion and sparsi\ufb01cation into a new graph \u02dcS and run the given model on this graph instead.\n\n3 Graph diffusion convolution\n\nEssentially, graph diffusion convolution (GDC) exchanges the normal adjacency matrix A with a\nsparsi\ufb01ed version \u02dcS of the generalized graph diffusion matrix S, as illustrated by Fig. 1. This matrix\nde\ufb01nes a weighted and directed graph, and the model we aim to augment is applied to this graph\ninstead. We found that the calculated edge weights are bene\ufb01cial for the tasks considered. However,\nwe even found that GDC works when ignoring the weights after sparsi\ufb01cation. This enables us to use\nGDC with models that only support unweighted edges such as the degree-corrected stochastic block\nmodel (DCSBM). If required, we make the graph undirected by using ( \u02dcS + \u02dcST )/2, e.g. for spectral\nclustering. With these adjustments GDC is applicable to any graph-based model or algorithm.\nIntuition. The general intuition behind GDC is that graph diffusion smooths out the neighborhood\nover the graph, acting as a kind of denoising \ufb01lter similar to Gaussian \ufb01lters on images. This helps\nwith graph learning since both features and edges in real graphs are often noisy. Previous works also\nhighlighted the effectiveness of graph denoising. Berberidis & Giannakis [7] showed that PPR is\nable to reconstruct the underlying probability matrix of a sampled stochastic block model (SBM)\ngraph. Kloumann et al. [35] and Ragain [64] showed that PPR is optimal in recovering the SBM\nand DCSBM clusters in the space of landing probabilities under the mean \ufb01eld assumption. Li et al.\n[40] generalized this result by analyzing the convergence of landing probabilities to their mean \ufb01eld\nvalues. These results con\ufb01rm the intuition that graph diffusion-based smoothing indeed recovers\nmeaningful neighborhoods from noisy graphs.\nSparsi\ufb01cation. Most graph diffusions result in a dense matrix S. This happens even if we do not\nsum to k = \u221e in Eq. 1 due to the \u201cfour/six degrees of separation\u201d in real-world graphs [5]. However,\nthe values in S represent the in\ufb02uence between all pairs of nodes, which typically are highly localized\n[54]. This is a major advantage over spectral-based models since the spectral domain does not provide\nany notion of locality. Spatial localization allows us to simply truncate small values of S and recover\nsparsity, resulting in the matrix \u02dcS. In this work we consider two options for sparsi\ufb01cation: 1. top-k:\nUse the k entries with the highest mass per column, 2. Threshold \u0001: Set entries below \u0001 to zero.\nSparsi\ufb01cation would still require calculating a dense matrix S during preprocessing. However, many\npopular graph diffusions can be approximated ef\ufb01ciently and accurately in linear time and space.\nMost importantly, there are fast approximations for both PPR [3, 77] and the heat kernel [34], with\nwhich GDC achieves a linear runtime O(N ). Furthermore, top-k truncation generates a regular graph,\nwhich is amenable to batching methods and solves problems related to widely varying node degrees\n[15]. Empirically, we even found that sparsi\ufb01cation slightly improves prediction accuracy (see Fig. 5\nin Sec. 6). After sparsi\ufb01cation we calculate the (symmetric or random walk) transition matrix on the\nresulting graph via T \u02dcS\nLimitations. GDC is based on the assumption of homophily, i.e. \u201cbirds of a feather \ufb02ock together\u201d\n[49]. Many methods share this assumption and most common datasets adhere to this principle.\nHowever, this is an often overlooked limitation and it seems non-straightforward to overcome. One\nway of extending GDC to heterophily, i.e. \u201copposites attract\u201d, might be negative edge weights\n\n\u22121/2\n\u02dcSD\n\u02dcS\n\n.\n\nsym = D\n\n\u22121/2\n\u02dcS\n\n3\n\n\f[17, 44]. Furthermore, we suspect that GDC does not perform well in settings with more complex\nedges (e.g. knowledge graphs) or graph reconstruction tasks such as link prediction. Preliminary\nexperiments showed that GDC indeed does not improve link prediction performance.\n\n4 Spectral analysis of GDC\n\nEven though GDC is a spatial-based method it can also be interpreted as a graph convolution and\nanalyzed in the graph spectral domain. In this section we show how generalized graph diffusion\nis expressed as an equivalent polynomial \ufb01lter and vice versa. Additionally, we perform a spectral\nanalysis of GDC, which highlights the tight connection between GDC and spectral-based models.\nSpectral graph theory. To employ the tools of spectral theory to graphs we exchange the regular\nLaplace operator with either the unnormalized Laplacian Lun = D\u2212 A, the random-walk normalized\nLrw = IN \u2212 Trw, or the symmetric normalized graph Laplacian Lsym = IN \u2212 Tsym [76]. The\nLaplacian\u2019s eigendecomposition is L = U \u039bU T , where both U and \u039b are real-valued. The graph\nFourier transform of a vector x is then de\ufb01ned via \u02c6x = U T x and its inverse as x = U \u02c6x. Using\nthis we de\ufb01ne a graph convolution on G as x \u2217G y = U ((U T x) (cid:12) (U T y)), where (cid:12) denotes the\nHadamard product. Hence, a \ufb01lter g\u03be with parameters \u03be acts on x as g\u03be(L)x = U \u02c6G\u03be(\u039b)U T x, where\n\u02c6G\u03be(\u039b) = diag(\u02c6g\u03be,1(\u039b), . . . , \u02c6g\u03be,N (\u039b)). A common choice for g\u03be in the literature is a polynomial\n\ufb01lter of order J, since it is localized and has a limited number of parameters [16, 27]:\n\n\uf8eb\uf8ed J(cid:88)\n\n\uf8f6\uf8f8 U T .\n\ng\u03be(L) =\n\n\u03bejLj = U\n\n\u03bej\u039bj\n\nj=0\n\nj=0\n\n(2)\n\nJ(cid:88)\n\nJ(cid:88)\n\nj=0\n\n(\u2212t)j\nj!\n\nGraph diffusion as a polynomial \ufb01lter. Comparing Eq. 1 with Eq. 2 shows the close relationship\nbetween polynomial \ufb01lters and generalized graph diffusion since we only need to exchange L by T\nto go from one to the other. To make this relationship more speci\ufb01c and \ufb01nd a direct correspondence\nbetween GDC with \u03b8k and a polynomial \ufb01lter with parameters \u03bej we need to \ufb01nd parameters that\nsolve\n\n\u03bejLj !=\n\n\u03b8kT k.\n\n(3)\n\nK(cid:88)\n\nk=0\n\n(cid:18)j\n\n(cid:19)\n\nJ(cid:88)\n\nk\n\nj=k\n\n(cid:19)j\n\n(cid:18)\n\n1 \u2212 1\n\u03b1\n\nTo \ufb01nd these parameters we choose the Laplacian corresponding to L = In \u2212 T , resulting in (see\nApp. A)\n\n\u03bej =\n\n(\u22121)j\u03b8k,\n\n\u03b8k =\n\n(\u22121)k\u03bej,\n\n(4)\n\n(cid:18)k\n\n(cid:19)\n\nK(cid:88)\n\nj\n\nk=j\n\nwhich shows the direct correspondence between graph diffusion and spectral methods. Note that we\nneed to set J = K. Solving Eq. 4 for the coef\ufb01cients corresponding to the heat kernel \u03b8HK\nand PPR\nk\n\u03b8PPR\nk\n\nleads to\n\n\u03beHK\nj =\n\n,\n\n\u03bePPR\nj =\n\n,\n\n(5)\n\nj\n\nshowing how the heat kernel and PPR are expressed as polynomial \ufb01lters. Note that PPR\u2019s cor-\nresponding polynomial \ufb01lter converges only if \u03b1 > 0.5. This is caused by changing the order of\nsummation when deriving \u03bePPR\n, which results in an alternating series. However, if the series does\nconverge it gives the exact same transformation as the equivalent graph diffusion.\nSpectral properties of GDC. We will now extend the discussion to all parts of GDC and analyze\nhow they transform the graph Laplacian\u2019s eigenvalues. GDC consists of four steps: 1. Calculate\nthe transition matrix T , 2. take the sum in Eq. 1 to obtain S, 3. sparsify the resulting matrix by\ntruncating small values, resulting in \u02dcS, and 4. calculate the transition matrix T \u02dcS.\n1. Transition matrix. Calculating the transition matrix T only changes which Laplacian matrix we\nuse for analyzing the graph\u2019s spectrum, i.e. we use Lsym or Lrw instead of Lun. Adding self-loops to\nobtain \u02dcT does not preserve the eigenvectors and its effect therefore cannot be calculated precisely.\nWu et al. [78] empirically found that adding self-loops shrinks the graph\u2019s eigenvalues.\n\n4\n\n\fPPR (\u03b1)\n0.05\n0.15\n\nHeat (t)\n3\n5\n\n1.0\n\n\u03bb\n\n0.5\n\n20\n\n15\n\n10\n\n5\n\nA\n\u03bb\n/\nS\n\u03bb\n\n0\n0.0\n\n0.5\n\n1.5\n\n2.0\n\n1.0\n\u03bbA\n\n0.0\n\n0\n\n\u03bb\n\u03bb; \u0001 = 10\u22123\n\u2206\u03bb; \u0001 = 10\u22123\n\u2206\u03bb; \u0001 = 10\u22124\n\n\u02dcS\n\u03bb\n/\n\n\u02dcS\nT\n\u03bb\n\n1.00\n\n0.75\n\n0.50\n\n0.25\n\n0.00\n\n\u0001 = 10\u22123\n\u0001 = 10\u22124\n\n1000\n\nIndex\n\n2000\n\n0.0\n\n0.5\n\u03bb\u02dcS\n\n1.0\n\n(a) Graph diffusion as a \ufb01lter, PPR\nwith \u03b1 and heat kernel with t. Both\nact as low-pass \ufb01lters.\n\n(b) Sparsi\ufb01cation with threshold \u0001 of\nPPR (\u03b1 = 0.1) on CORA. Eigenval-\nues are almost unchanged.\n\n(c) Transition matrix on CORA\u2019s\nsparsi\ufb01ed graph \u02dcS. This acts as a\nweak high-pass \ufb01lter.\n\nFigure 2: In\ufb02uence of different parts of GDC on the Laplacian\u2019s eigenvalues \u03bb.\n\n\u221e(cid:88)\n\nk=0\n\n2. Sum over T k. Summation does not affect the eigenvectors of the original matrix, since T kvi =\n\u03bbiT k\u22121vi = \u03bbk\ni vi, for the eigenvector vi of T with associated eigenvalue \u03bbi. This also shows that\nthe eigenvalues are transformed as\n\n\u02dc\u03bbi =\n\n\u03b8k\u03bbk\ni .\n\n(6)\n\nexpression for PPR, i.e. \u02dc\u03bbi = \u03b1(cid:80)\u221e\nexponential series, resulting in \u02dc\u03bbi = e\u2212t(cid:80)\u221e\n\nSince the eigenvalues of T are bounded by 1 we can use the geometric series to derive a closed-form\nk=0(1 \u2212 \u03b1)k\u03bbk\n. For the heat kernel we use the\ni =\ni = et(\u03bbi\u22121). How this transformation affects\ntk\nk! \u03bbk\nthe corresponding Laplacian\u2019s eigenvalues is illustrated in Fig. 2a. Both PPR and the heat kernel act\nas low-pass \ufb01lters. Low eigenvalues corresponding to large-scale structure in the graph (e.g. clusters\n[55]) are ampli\ufb01ed, while high eigenvalues corresponding to \ufb01ne details but also noise are suppressed.\n3. Sparsi\ufb01cation. Sparsi\ufb01cation changes both the eigenvalues and the eigenvectors, which means\nthat there is no direct correspondence between the eigenvalues of S and \u02dcS and we cannot analyze\nits effect analytically. However, we can use eigenvalue perturbation theory (Stewart & Sun [69],\nCorollary 4.13) to derive the upper bound\n\nk=0\n\n\u03b1\n\n1\u2212(1\u2212\u03b1)\u03bbi\n\n(cid:118)(cid:117)(cid:117)(cid:116) N(cid:88)\n\n(\u02dc\u03bbi \u2212 \u03bbi)2 \u2264 ||E||F \u2264 N||E||max \u2264 N \u0001,\n\n(7)\n\ni=1\n\n\u221a\n\nwith the perturbation matrix E = \u02dcS \u2212 S and the threshold \u0001. This bound signi\ufb01cantly overestimates\nthe perturbation since PPR and the heat kernel both exhibit strong localization on real-world graphs\nand hence the change in eigenvalues empirically does not scale with N (or, rather,\nN). By ordering\nthe eigenvalues we see that, empirically, the typical thresholds for sparsi\ufb01cation have almost no effect\non the eigenvalues, as shown in Fig. 2b and in the close-up in Fig. 11 in App. B.2. We \ufb01nd that the\nsmall changes caused by sparsi\ufb01cation mostly affect the highest and lowest eigenvalues. The former\ncorrespond to very large clusters and long-range interactions, which are undesired for local graph\nsmoothing. The latter correspond to spurious oscillations, which are not helpful for graph learning\neither and most likely affected because of the abrupt cutoff at \u0001.\n4. Transition matrix on \u02dcS. As a \ufb01nal step we calculate the transition matrix on the resulting graph\n\u02dcS. This step does not just change which Laplacian we consider since we have already switched to\nusing the transition matrix in step 1. It furthermore does not preserve the eigenvectors and is thus\nagain best investigated empirically by ordering the eigenvalues. Fig. 2c shows that, empirically,\nthis step slightly dampens low eigenvalues. This may seem counterproductive. However, the main\npurpose of using the transition matrix is ensuring that sparsi\ufb01cation does not cause nodes to be treated\ndifferently by losing a different number of adjacent edges. The \ufb01ltering is only a side-effect.\nLimitations of spectral-based models. While there are tight connections between GDC and spectral-\nbased models, GDC is actually spatial-based and therefore does not share their limitations. Similar to\npolynomial \ufb01lters, GDC does not compute an expensive eigenvalue decomposition, preserves locality\non the graph and is not limited to a single graph after training, i.e. typically the same coef\ufb01cients \u03b8k\n\n5\n\n\fcan be used across graphs. The choice of coef\ufb01cients \u03b8k depends on the type of graph at hand and\ndoes not change signi\ufb01cantly between similar graphs. Moreover, the hyperparameters \u03b1 of PPR and t\nof the heat kernel usually fall within a narrow range that is rather insensitive to both the graph and\nmodel (see Fig. 8 in Sec. 6).\n\n5 Related work\n\nGraph diffusion and random walks have been extensively studied in classical graph learning [13, 14,\n36, 37], especially for clustering [34], semi-supervised classi\ufb01cation [12, 22], and recommendation\nsystems [44]. For an overview of existing methods see Masuda et al. [46] and Fouss et al. [22].\nThe \ufb01rst models similar in structure to current Graph Neural Networks (GNNs) were proposed by\nSperduti & Starita [68] and Baskin et al. [6], and the name GNN \ufb01rst appeared in [24, 65]. However,\nthey only became widely adopted in recent years, when they started to outperform classical models in\nmany graph-related tasks [19, 33, 42, 82]. In general, GNNs are classi\ufb01ed into spectral-based models\n[11, 16, 28, 32, 41], which are based on the eigendecomposition of the graph Laplacian, and spatial-\nbased methods [23, 26, 43, 52, 56, 62, 74], which use the graph directly and form new representations\nby aggregating the representations of a node and its neighbors. However, this distinction is often rather\nblurry and many models can not be clearly attributed to one type or the other. Deep learning also\ninspired a variety of unsupervised node embedding methods. Most models use random walks to learn\nnode embeddings in a similar fashion as word2vec [51] [25, 61] and have been shown to implicitly\nperform a matrix factorization [63]. Other unsupervised models learn Gaussian distributions instead\nof vectors [10], use an auto-encoder [31], or train an encoder by maximizing the mutual information\nbetween local and global embeddings [73].\nThere have been some isolated efforts of using extended neighborhoods for aggregation in GNNs\nand graph diffusion for node embeddings. PPNP [33] propagates the node predictions generated by\na neural network using personalized PageRank, DCNN [4] extends node features by concatenating\nfeatures aggregated using the transition matrices of k-hop random walks, GraphHeat [79] uses the\nheat kernel and PAN [45] the transition matrix of maximal entropy random walks to aggregate over\nnodes in each layer, PinSage [82] uses random walks for neighborhood aggregation, and MixHop [2]\nconcatenates embeddings aggregated using the transition matrices of k-hop random walks before\neach layer. VERSE [71] learns node embeddings by minimizing KL-divergence from the PPR matrix\nto a low-rank approximation. Attention walk [1] uses a similar loss to jointly optimize the node\nembeddings and diffusion coef\ufb01cients \u03b8k. None of these works considered sparsi\ufb01cation, generalized\ngraph diffusion, spectral properties, or using preprocessing to generalize across models.\n\n6 Experimental results\n\nExperimental setup. We take extensive measures to prevent any kind of bias in our results. We\noptimize the hyperparameters of all models on all datasets with both the unmodi\ufb01ed graph and all\nGDC variants separately using a combination of grid and random search on the validation set. Each\nresult is averaged across 100 data splits and random initializations for supervised tasks and 20 random\ninitializations for unsupervised tasks, as suggested by Klicpera et al. [33] and Shchur et al. [67]. We\nreport performance on a test set that was used exactly once. We report all results as averages with\n95 % con\ufb01dence intervals calculated via bootstrapping.\nWe use the symmetric transition matrix with self-loops \u02dcTsym = (IN +D)\u22121/2(IN +A)(IN +D)\u22121/2\nfor GDC and the column-stochastic transition matrix T \u02dcS\non \u02dcS. We present two simple\nand effective choices for the coef\ufb01cients \u03b8k: The heat kernel and PPR. The diffusion matrix S is\nsparsi\ufb01ed using either an \u0001-threshold or top-k.\nDatasets and models. We evaluate GDC on six datasets: The citation graphs CITESEER [66],\nCORA [48], and PUBMED [53], the co-author graph COAUTHOR CS [67], and the co-purchase\ngraphs AMAZON COMPUTERS and AMAZON PHOTO [47, 67]. We only use their largest connected\ncomponents. We show how GDC impacts the performance of 9 models: Graph Convolutional\nNetwork (GCN) [32], Graph Attention Network (GAT) [74], jumping knowledge network (JK)\n[80], Graph Isomorphism Network (GIN) [81], and ARMA [9] are supervised models. The degree-\ncorrected stochastic block model (DCSBM) [30], spectral clustering (using Lsym) [55], DeepWalk\n\nrw = \u02dcSD\u22121\n\n\u02dcS\n\n6\n\n\fCORA\n\nNone\nHeat\nPPR\n\n75\n72\n69\n66\n63\n60\n\nCITESEER\n\nPUBMED\n\n80\n\n76\n\n72\n\nGCN\n\nJK\n\nGAT\nGIN\nCOAUTHOR CS\n\nARMA\n\nGCN\n\nJK\n\nGAT\nGIN\nAMZ COMP\n\nARMA\n\nGCN\n\nJK\n\nGAT\nGIN\nAMZ PHOTO\n\nARMA\n\n80\n\n60\n\n90\n\n75\n\n60\n\n)\n\n%\n\n(\n\ny\nc\na\nr\nu\nc\nc\nA\n\n84\n\n81\n\n78\n\n75\n\n72\n\n)\n\n%\n\n(\n\ny\nc\na\nr\nu\nc\nc\nA\n\n92\n\n90\n\noom\n\nGIN\n\nJK\n\n40\n\nGCN\n\nGAT\n\nARMA\nFigure 3: Node classi\ufb01cation accuracy of GNNs with and without GDC. GDC consistently improves\naccuracy across models and datasets. It is able to \ufb01x models whose accuracy otherwise breaks down.\n\nARMA\n\nARMA\n\nGCN\n\nGCN\n\nGAT\n\nGIN\n\nGAT\n\nJK\n\nGIN\n\nJK\n\nCORA\n\nCITESEER\n\n)\n\n%\n\n(\n\ny\nc\na\nr\nu\nc\nc\nA\n\n60\n\n45\n\n30\n\n)\n\n%\n\n(\n\ny\nc\na\nr\nu\nc\nc\nA\n\n60\n\n45\n\n30\n\nDCSBM Spectral DeepWalk\nCOAUTHOR CS\n\nNone\nHeat\nPPR\n\nDGI\n\n60\n\n45\n\n30\n\n60\n\n45\n\n30\n\nDCSBM Spectral DeepWalk\nAMZ COMP\n\nDGI\n\nPUBMED\n\nDCSBM Spectral DeepWalk\nAMZ PHOTO\n\nDGI\n\n70\n\n60\n\n50\n\n40\n\n75\n\n60\n\n45\n\n30\n\nDCSBM Spectral DeepWalk\n\nDGI\n\nDCSBM Spectral DeepWalk\n\nDGI\n\nDCSBM Spectral DeepWalk\n\nDGI\n\nFigure 4: Clustering accuracy with and without GDC. GDC consistently improves the accuracy\nacross a diverse set of models and datasets.\n\n[61], and Deep Graph Infomax (DGI) [73] are unsupervised models. Note that DGI uses node features\nwhile other unsupervised models do not. We use k-means clustering to generate clusters from node\nembeddings. Dataset statistics and hyperparameters are reported in App. B.\nSemi-supervised node classi\ufb01cation. In this task the goal is to label nodes based on the graph, node\nfeatures X \u2208 RN\u00d7F and a subset of labeled nodes y. The main goal of GDC is improving the\nperformance of MPNN models. Fig. 3 shows that GDC consistently and signi\ufb01cantly improves the\naccuracy of a wide variety of state-of-the-art models across multiple diverse datasets. Note how GDC\nis able to \ufb01x the performance of GNNs that otherwise break down on some datasets (e.g. GAT). We\nalso surpass or match the previous state of the art on all datasets investigated (see App. B.2).\nClustering. We highlight GDC\u2019s ability to be combined with any graph-based model by reporting\nthe performance of a diverse set of models that use a wide range of paradigms. Fig. 4 shows the\nunsupervised accuracy obtained by matching clusters to ground-truth classes using the Hungarian\nalgorithm. Accuracy consistently and signi\ufb01cantly improves for all models and datasets. Note that\nspectral clustering uses the graph\u2019s eigenvectors, which are not affected by the diffusion step itself.\nStill, its performance improves by up to 30 percentage points. Results in tabular form are presented\nin App. B.2.\n\n7\n\n\f)\n\n%\n\n(\ny\nc\na\nr\nu\nc\nc\nA\n\u2206\n\n0\n\u22121\n\u22122\n\n)\n\n%\n\n(\ny\nc\na\nr\nu\nc\nc\nA\n\n80\n\n70\n\n100\n\nCORA\nCITESEER\nAMZ COMP\n\n101\n\n102\n\nAverage degree\n\n103\n\nFigure 5: GCN+GDC accuracy\n(using PPR and top-k). Lines in-\ndicate original accuracy and de-\ngree. GDC surpasses original ac-\ncuracy at around the same degree\nindependent of dataset. Sparsi\ufb01-\ncation often improves accuracy.\n\n0\n\n1\n\n2\n\n3\n\nSelf-loop weight\n6:\nDifference\n\nFigure\nin\nGCN+GDC accuracy (using\nPPR and top-k,\npercentage\npoints) compared to the symmet-\nric Tsym without self-loops. Trw\nperforms worse and self-loops\nhave no signi\ufb01cant effect.\n\nCORA\nCITESEER\nAMZ COMP\n\nTsym\nTrw\n\nPPR\nCORA\n\nAdaDIF\nCITESEER\n\nAMZ\nCOMP\n\nK\nJ\n\nN\nC\nG\n\nA\nM\nR\nA\n\nK\nJ\n\nN\nC\nG\n\nA\nM\nR\nA\n\nK\nJ\n\nN\nC\nG\n\nA\nM\nR\nA\n\n)\n\n%\n\n(\n\ny\nc\na\nr\nu\nc\nc\nA\n\n80\n\n70\n\n4\n\nFigure 7: Accuracy of GDC with\ncoef\ufb01cients \u03b8k de\ufb01ned by PPR\nand learned by AdaDIF. Simple\nPPR coef\ufb01cients consistently per-\nform better than those obtained\nby AdaDIF, even with optimized\nregularization.\n\n) CORA\n%\n\n85\n\n(\n\ny\nc\na\nr\nu\nc\nc\nA\n\nGCN\nJK\nARMA\n\n80\n\n0.01 0.05\n\u03b1\n\nPPR\n\nCITESEER\n\n76\n\n85\n\nAMZ COMP\n\n85\n\n80\n\n74\n\n0.4\n\n0.01 0.05\n\u03b1\n\n82\n\n0.4\n\n0.01 0.05\n\u03b1\n\n75\n\n0.4\n\n0.5 1 2 5 20\n\nt\n\nCORA\n\nHeat\n\nCITESEER\n\nAMZ COMP\n\n75\n\n72\n\n80\n\n70\n\n0.5 1 2 5 20\n\nt\n\n0.5 1 2 5 20\n\nt\n\nFigure 8: Accuracy achieved by using GDC with varying hyperparameters of PPR (\u03b1) and the heat\nkernel (t). Optimal values fall within a narrow range that is consistent across datasets and models.\n\nIn this work we concentrate on node-level prediction tasks in a transductive setting. However, GDC\ncan just as easily be applied to inductive problems or different tasks like graph classi\ufb01cation. In\nour experiments we found promising, yet not as consistent results for graph classi\ufb01cation (e.g. 2.5\npercentage points with GCN on the DD dataset [18]). We found no improvement for the inductive\nsetting on PPI [50], which is rather unsurprising since the underlying data used for graph construction\nalready includes graph diffusion-like mechanisms (e.g. regulatory interactions, protein complexes,\nand metabolic enzyme-coupled interactions). We furthermore conducted experiments to answer \ufb01ve\nimportant questions:\nDoes GDC increase graph density? When sparsifying the generalized graph diffusion matrix S we\nare free to choose the resulting level of sparsity in \u02dcS. Fig. 5 indicates that, surprisingly, GDC requires\nroughly the same average degree to surpass the performance of the original graph independent of the\ndataset and its average degree (\u0001-threshold in App. B.2, Fig. 12). This suggests that the sparsi\ufb01cation\nhyperparameter can be obtained from a \ufb01xed average degree. Note that CORA and CITESEER are\nboth small graphs with low average degree. Graphs become denser with size [39] and in practice we\nexpect GDC to typically reduce the average degree at constant accuracy. Fig. 5 furthermore shows\nthat there is an optimal degree of sparsity above which the accuracy decreases. This indicates that\nsparsi\ufb01cation is not only computationally bene\ufb01cial but also improves prediction performance.\nHow to choose the transition matrix T ? We found Tsym to perform best across datasets. More\nspeci\ufb01cally, Fig. 6 shows that the symmetric version on average outperforms the random walk\ntransition matrix Trw. This \ufb01gure also shows that GCN accuracy is largely insensitive to self-loops\nwhen using Tsym \u2013 all changes lie within the estimated uncertainty. However, we did \ufb01nd that other\nmodels, e.g. GAT, perform better with self-loops (not shown).\nHow to choose the coef\ufb01cients \u03b8k? We found the coef\ufb01cients de\ufb01ned by PPR and the heat kernel\nto be effective choices for \u03b8k. Fig. 8 shows that their optimal hyperparameters typically fall within\na narrow range of \u03b1 \u2208 [0.05, 0.2] and t \u2208 [1, 10]. We also tried obtaining \u03b8k from models that\nlearn analogous coef\ufb01cients [1, 8, 13]. However, we found that \u03b8k obtained by these models tend to\nconverge to a minimal neighborhood, i.e. they converge to \u03b80 \u2248 1 or \u03b81 \u2248 1 and all other \u03b8k \u2248 0.\n\n8\n\n\f)\n\n%\n\n(\n\ny\nc\na\nr\nu\nc\nc\nA\n\u2206\n\n4\n2\n0\n\n2\n1\n0\n\nCORA\nHeat\nPPR\n\n1 2 3 4 \u22655\n\n4\n5\n2\n=\n\u00afn\n\n0\n1\n6\n=\n\u00afn\n\n3\n3\n3\n=\n\u00afn\n\n5\n1\n1\n=\n\u00afn\n\n8\n.\n7\n4\n=\n\u00afn\n\nHops\n\nCITESEER\n\nPUBMED\n\n1\n\n0\n\n1 2 3 4 5 \u22656\n\n2\n3\n2\n=\n\u00afn\n\n5\n7\n4\n=\n\u00afn\n\n5\n4\n3\n=\n\u00afn\n\n4\n7\n1\n=\n\u00afn\n\n8\n.\n2\n8\n=\n\u00afn\n\n3\n.\n2\n7\n=\n\u00afn\n\nHops\n\n1 2 3 4 5 \u22656\n\n6\n.\n7\n1\n=\n\u00afn\n\n7\n8\n1\n=\n\u00afn\n\n0\n0\n5\n=\n\u00afn\n\n9\n2\n5\n=\n\u00afn\n\n2\n5\n1\n=\n\u00afn\n\n7\n.\n4\n5\n=\n\u00afn\n\nHops\n\nCOAUTHOR CS\n1\n\n0\n\n1 2 3 4 \u22655\n\n7\n5\n5\n=\n\u00afn\n\n9\n4\n3\n=\n\u00afn\n\n2\n.\n1\n7\n=\n\u00afn\n\n1\n2\n1\n2\n=\n\u00afn\n\n2\n0\n6\n1\n=\n\u00afn\n\nHops\n\nAMZ COMP\n\n2\n1\n0\n\n1 2 \u22653\n\n1\n2\n3\n=\n\u00afn\n\n7\n7\n8\n=\n\u00afn\n\n2\n0\n1\n=\n\u00afn\n\nHops\n\nAMZ PHOTO\n2\n1\n0\n\n1 2 3 \u22654\n\n8\n8\n4\n=\n\u00afn\n\n9\n6\n7\n=\n\u00afn\n\n6\n.\n1\n7\n=\n\u00afn\n\n1\n.\n1\n1\n=\n\u00afn\n\nHops\n\nFigure 10: Improvement (percentage points) in GCN accuracy by adding GDC depending on distance\n(number of hops) from the training set. Nodes further away tend to bene\ufb01t more from GDC.\n\n85\n\n80\n\n75\n\n5\n\n)\n\n%\n\n(\n\ny\nc\na\nr\nu\nc\nc\nA\n\nNone\nHeat\nPPR\n60\n\nGCN\nJK\nARMA\n20\n30\n\nThis is caused by their training losses almost always decreasing\nwhen the considered neighborhood shrinks. We were able to\ncontrol this over\ufb01tting to some degree using strong regularization\n(speci\ufb01cally, we found L2 regularization on the difference of\nneighboring coef\ufb01cients \u03b8k+1\u2212\u03b8k to perform best). However, this\nrequires hand-tuning the regularization for every dataset, which\ndefeats the purpose of learning the coef\ufb01cients from the graph.\nMoreover, we found that even with hand-tuned regularization the\ncoef\ufb01cients de\ufb01ned by PPR and the heat kernel perform better\nthan trained \u03b8k, as shown in Fig. 7.\nHow does the label rate affect GDC? When varying the label\nrate from 5 up to 60 labels per class we found that the improvement\nachieved by GDC increases the sparser the labels are. Still, GDC\nimproves performance even for 60 labels per class, i.e. 17 %\nlabel rate (see Fig. 9). This trend is most likely due to larger\nneighborhood leveraged by GDC.\nWhich nodes bene\ufb01t from GDC? Our experiments showed no correlation of improvement with\nmost common node properties, except for the distance from the training set. Nodes further away\nfrom the training set tend to bene\ufb01t more from GDC, as demonstrated by Fig. 10. Besides smoothing\nout the neighborhood, GDC also has the effect of increasing the model\u2019s range, since it is no longer\nrestricted to only using \ufb01rst-hop neighbors. Hence, nodes further away from the training set in\ufb02uence\nthe training and later bene\ufb01t from the improved model weights.\n\nFigure 9: Accuracy on Cora with\ndifferent label rates.\nImprove-\nment from GDC increases for\nsparser label rates.\n\n10\n40\nLabels per class\n\n7 Conclusion\n\nWe propose graph diffusion convolution (GDC), a method based on sparsi\ufb01ed generalized graph\ndiffusion. GDC is a more powerful, yet spatially localized extension of message passing in GNNs,\nbut able to enhance any graph-based model. We show the tight connection between GDC and\nspectral-based models and analyzed GDC\u2019s spectral properties. GDC shares many of the strengths of\nspectral methods and none of their weaknesses. We conduct extensive and rigorous experiments that\nshow that GDC consistently improves the accuracy of a wide range of models on both supervised\nand unsupervised tasks across various homophilic datasets and requires very little hyperparameter\ntuning. There are many extensions and applications of GDC that remain to be explored. We expect\nmany graph-based models and tasks to bene\ufb01t from GDC, e.g. graph classi\ufb01cation and regression.\nPromising extensions include other diffusion coef\ufb01cients \u03b8k such as those given by the methods\npresented in Fouss et al. [22] and more advanced random walks and operators that are not de\ufb01ned by\npowers of a transition matrix.\n\nAcknowledgments\n\nThis research was supported by the German Federal Ministry of Education and Research (BMBF),\ngrant no. 01IS18036B, and by the Deutsche Forschungsgemeinschaft (DFG) through the Emmy\nNoether grant GU 1409/2-1 and the TUM International Graduate School of Science and Engineering\n(IGSSE), GSC 81. The authors of this work take full responsibilities for its content.\n\n9\n\n\fReferences\n[1] Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alex Alemi. Watch Your Step: Learning Node\n\nEmbeddings via Graph Attention. In NeurIPS, 2018.\n\n[2] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyun-\nyan, Greg Ver Steeg, and Aram Galstyan. MixHop: Higher-Order Graph Convolutional Architectures via\nSparsi\ufb01ed Neighborhood Mixing. In ICML, 2019.\n\n[3] R. Andersen, F. Chung, and K. Lang. Local Graph Partitioning using PageRank Vectors. In FOCS, 2006.\n\n[4] James Atwood and Don Towsley. Diffusion-Convolutional Neural Networks. In NIPS, 2016.\n\n[5] Lars Backstrom, Paolo Boldi, Marco Rosa, Johan Ugander, and Sebastiano Vigna. Four degrees of\n\nseparation. In ACM Web Science Conference, 2012.\n\n[6] Igor I. Baskin, Vladimir A. Palyulin, and Nikolai S. Ze\ufb01rov. A Neural Device for Searching Direct\nCorrelations between Structures and Properties of Chemical Compounds. Journal of Chemical Information\nand Computer Sciences, 37(4):715\u2013721, 1997.\n\n[7] Dimitris Berberidis and Georgios B. Giannakis. Node Embedding with Adaptive Similarities for Scalable\n\nLearning over Graphs. CoRR, 1811.10797, 2018.\n\n[8] Dimitris Berberidis, Athanasios N. Nikolakopoulos, and Georgios B. Giannakis. Adaptive diffusions for\n\nscalable learning over graphs. IEEE Transactions on Signal Processing, 67(5):1307\u20131321, 2019.\n\n[9] Filippo Maria Bianchi, Daniele Grattarola, Lorenzo Livi, and Cesare Alippi. Graph Neural Networks with\n\nconvolutional ARMA \ufb01lters. CoRR, 1901.01343, 2019.\n\n[10] Aleksandar Bojchevski and Stephan G\u00fcnnemann. Deep Gaussian Embedding of Graphs: Unsupervised\n\nInductive Learning via Ranking. ICLR, 2018.\n\n[11] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral Networks and Deep Locally\n\nConnected Networks on Graphs. In ICLR, 2014.\n\n[12] Eliav Buchnik and Edith Cohen. Bootstrapped Graph Diffusions: Exposing the Power of Nonlinearity.\nProceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), 2(1):1\u201319,\n2018.\n\n[13] Siheng Chen, Aliaksei Sandryhaila, Jose M. F. Moura, and Jelena Kovacevic. Adaptive graph \ufb01ltering:\nMultiresolution classi\ufb01cation on graphs. In IEEE Global Conference on Signal and Information Processing\n(GlobalSIP), 2013.\n\n[14] F. Chung. The heat kernel as the pagerank of a graph. Proceedings of the National Academy of Sciences,\n\n104(50):19735\u201319740, 2007.\n\n[15] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborov\u00e1.\n\nInference and phase\ntransitions in the detection of modules in sparse networks. Physical Review Letters, 107(6):065701, 2011.\n\n[16] Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional Neural Networks on Graphs\n\nwith Fast Localized Spectral Filtering. In NIPS, 2016.\n\n[17] Tyler Derr, Yao Ma, and Jiliang Tang. Signed Graph Convolutional Networks. In ICDM, 2018.\n\n[18] Paul D. Dobson and Andrew J. Doig. Distinguishing enzyme structures from non-enzymes without\n\nalignments. Journal of Molecular Biology, 330(4):771\u2013783, 2003.\n\n[19] David K. Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael G\u00f3mez-Bombarelli, Timothy\nHirzel, Al\u00e1n Aspuru-Guzik, and Ryan P. Adams. Convolutional Networks on Graphs for Learning\nMolecular Fingerprints. In NIPS, 2015.\n\n[20] Radim \u02c7Reh\u02dau\u02c7rek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In LREC\n\n2010 Workshop on New Challenges for NLP Frameworks, 2010.\n\n[21] Matthias Fey and Jan E. Lenssen. Fast Graph Representation Learning with PyTorch Geometric. In ICLR\n\nworkshop, 2019.\n\n[22] Fran\u00e7ois Fouss, Kevin Francoisse, Luh Yen, Alain Pirotte, and Marco Saerens. An experimental investi-\ngation of kernels on graphs for collaborative recommendation and semisupervised classi\ufb01cation. Neural\nNetworks, 31:53\u201372, 2012.\n\n10\n\n\f[23] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural Message\n\nPassing for Quantum Chemistry. In ICML, 2017.\n\n[24] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In IEEE International\n\nJoint Conference on Neural Networks, 2005.\n\n[25] Aditya Grover and Jure Leskovec. node2vec: Scalable Feature Learning for Networks. In KDD, 2016.\n\n[26] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive Representation Learning on Large Graphs.\n\nIn NIPS, 2017.\n\n[27] David K. Hammond, Pierre Vandergheynst, and R\u00e9mi Gribonval. Wavelets on graphs via spectral graph\n\ntheory. Applied and Computational Harmonic Analysis, 30(2):129\u2013150, 2011.\n\n[28] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep Convolutional Networks on Graph-Structured Data.\n\nCoRR, 1506.05163, 2015.\n\n[29] Eric Jones, Travis Oliphant, Pearu Peterson, and others. SciPy: Open source scienti\ufb01c tools for Python.\n\n2001.\n\n[30] Brian Karrer and Mark EJ Newman. Stochastic blockmodels and community structure in networks.\n\nPhysical review E, 83(1):016107, 2011.\n\n[31] Thomas N. Kipf and Max Welling. Variational Graph Auto-Encoders. In NIPS workshop, 2016.\n\n[32] Thomas N. Kipf and Max Welling. Semi-Supervised Classi\ufb01cation with Graph Convolutional Networks.\n\nIn ICLR, 2017.\n\n[33] Johannes Klicpera, Aleksandar Bojchevski, and Stephan G\u00fcnnemann. Predict then Propagate: Graph\n\nNeural Networks Meet Personalized PageRank. In ICLR, 2019.\n\n[34] Kyle Kloster and David F Gleich. Heat kernel based community detection. In KDD, 2014.\n\n[35] Isabel M. Kloumann, Johan Ugander, and Jon Kleinberg. Block models and personalized PageRank.\n\nProceedings of the National Academy of Sciences, 114(1):33\u201338, 2017.\n\n[36] Risi Imre Kondor and John Lafferty. Diffusion kernels on graphs and other discrete structures. In ICML,\n\n2002.\n\n[37] St\u00e9phane Lafon and Ann B. Lee. Diffusion Maps and Coarse-Graining: A Uni\ufb01ed Framework for\nDimensionality Reduction, Graph Partitioning, and Data Set Parameterization. IEEE Trans. Pattern Anal.\nMach. Intell., 28(9):1393\u20131403, 2006.\n\n[38] Edmund Landau. Zur relativen Wertbemessung der Turnierresultate. Deutsches Wochenschach, 11:\n\n366\u2013369, 1895.\n\n[39] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over Time: Densi\ufb01cation Laws, Shrinking\n\nDiameters and Possible Explanations. In KDD, 2005.\n\n[40] Pan Li, Eli Chien, and Olgica Milenkovic. Optimizing generalized pagerank methods for seed-expansion\n\ncommunity detection. In NeurIPS, 2019.\n\n[41] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Adaptive Graph Convolutional Neural Networks.\n\nIn AAAI, 2018.\n\n[42] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion Convolutional Recurrent Neural Network:\n\nData-Driven Traf\ufb01c Forecasting. In ICLR, 2018.\n\n[43] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated Graph Sequence Neural\n\nNetworks. In ICLR, 2016.\n\n[44] Jeremy Ma, Weiyu Huang, Santiago Segarra, and Alejandro Ribeiro. Diffusion \ufb01ltering of graph signals\n\nand its use in recommendation systems. In ICASSP, 2016.\n\n[45] Zheng Ma, Ming Li, and Yuguang Wang. PAN: Path Integral Based Convolution for Deep Graph Neural\n\nNetworks. In ICML workshop, 2019.\n\n[46] Naoki Masuda, Mason A Porter, and Renaud Lambiotte. Random walks and diffusion on networks. Physics\n\nreports, 716:1\u201358, 2017.\n\n11\n\n\f[47] Julian J. McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-Based Recom-\n\nmendations on Styles and Substitutes. In SIGIR, 2015.\n\n[48] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construc-\n\ntion of internet portals with machine learning. Information Retrieval, 3(2):127\u2013163, 2000.\n\n[49] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social\n\nnetworks. Annual review of sociology, 27(1):415\u2013444, 2001.\n\n[50] J\u00f6rg Menche, Amitabh Sharma, Maksim Kitsak, Susan Ghiassian, Marc Vidal, Joseph Loscalzo, and Albert-\nL\u00e1szl\u00f3 Barab\u00e1si. Uncovering disease-disease relationships through the incomplete human interactome.\nScience, 347(6224):1257601, 2015.\n\n[51] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed Representa-\n\ntions of Words and Phrases and their Compositionality. In NIPS, 2013.\n\n[52] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M.\nBronstein. Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs. In CVPR,\n2017.\n\n[53] Galileo Namata, Ben London, Lise Getoor, and Bert Huang. Query-driven Active Surveying for Collective\n\nClassi\ufb01cation. In International Workshop on Mining and Learning with Graphs (MLG), KDD, 2012.\n\n[54] Huda Nassar, Kyle Kloster, and David F. Gleich. Strong Localization in Personalized PageRank Vectors.\n\nIn International Workshop on Algorithms and Models for the Web Graph (WAW), 2015.\n\n[55] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On Spectral Clustering: Analysis and an algorithm. In\n\nNIPS, 2002.\n\n[56] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning Convolutional Neural Networks\n\nfor Graphs. In ICML, 2016.\n\n[57] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking:\n\nBringing order to the web. Report, Stanford InfoLab, 1998.\n\n[58] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS\nworkshop, 2017.\n\n[59] Fabian Pedregosa, Ga\u00ebl Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel,\nMathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos,\nDavid Cournapeau, Matthieu Brucher, Matthieu Perrot, and \u00c9douard Duchesnay. Scikit-learn: Machine\nLearning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\n[60] Tiago P. Peixoto. The graph-tool python library. \ufb01gshare, 2014.\n\n[61] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. DeepWalk: online learning of social representations. In\n\nKDD, 2014.\n\n[62] Trang Pham, Truyen Tran, Dinh Q. Phung, and Svetha Venkatesh. Column Networks for Collective\n\nClassi\ufb01cation. In AAAI, 2017.\n\n[63] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network Embedding as\n\nMatrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. In WSDM, 2018.\n\n[64] Stephen Ragain. Community Detection via Discriminant functions for Random Walks in the degree-\n\ncorrected Stochastic Block Model. Report, Stanford University, 2017.\n\n[65] F. Scarselli, M. Gori, Ah Chung Tsoi, M. Hagenbuchner, and G. Monfardini. The Graph Neural Network\n\nModel. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[66] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad.\n\nCollective Classi\ufb01cation in Network Data. AI Magazine, 29(3):93\u2013106, 2008.\n\n[67] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan G\u00fcnnemann. Pitfalls of\n\nGraph Neural Network Evaluation. In NIPS workshop, 2018.\n\n[68] A. Sperduti and A. Starita. Supervised neural networks for the classi\ufb01cation of structures. IEEE Transac-\n\ntions on Neural Networks, 8(3):714\u2013735, 1997.\n\n12\n\n\f[69] Gilbert Wright Stewart and Ji-guang Sun. Matrix Perturbation Theory. Computer Science and Scienti\ufb01c\n\nComputing. 1990.\n\n[70] Yu-Hang Tang, Dongkun Zhang, and George Em Karniadakis. An atomistic \ufb01ngerprint algorithm for\n\nlearning ab initio molecular force \ufb01elds. The Journal of Chemical Physics, 148(3):034101, 2018.\n\n[71] Anton Tsitsulin, Davide Mottin, Panagiotis Karras, and Emmanuel M\u00fcller. VERSE: Versatile Graph\n\nEmbeddings from Similarity Measures. In WWW, 2018.\n\n[72] Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The NumPy array: a structure for ef\ufb01cient\n\nnumerical computation. Computing in Science & Engineering, 13(2):22, 2011.\n\n[73] Petar Velickovic, William Fedus, William L. Hamilton, Pietro Li\u00f2, Yoshua Bengio, and R. Devon Hjelm.\n\nDeep Graph Infomax. In ICLR, 2019.\n\n[74] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua Bengio.\n\nGraph Attention Networks. In ICLR, 2018.\n\n[75] Sebastiano Vigna. Spectral ranking. Network Science, CoRR (updated, 0912.0238v15), 4(4):433\u2013445,\n\n2016.\n\n[76] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395\u2013416, 2007.\n\n[77] Zhewei Wei, Xiaodong He, Xiaokui Xiao, Sibo Wang, Shuo Shang, and Ji-Rong Wen. TopPPR: Top-k\n\nPersonalized PageRank Queries with Precision Guarantees on Large Graphs. In SIGMOD, 2018.\n\n[78] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr., Christopher Fifty, Tao Yu, and Kilian Q. Weinberger.\n\nSimplifying Graph Convolutional Networks. In ICML, 2019.\n\n[79] Bingbing Xu, Huawei Shen, Qi Cao, Keting Cen, and Xueqi Cheng. Graph Convolutional Networks using\n\nHeat Kernel for Semi-supervised Learning. In IJCAI, 2019.\n\n[80] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka.\n\nRepresentation Learning on Graphs with Jumping Knowledge Networks. In ICML, 2018.\n\n[81] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural Networks?\n\nIn ICLR, 2019.\n\n[82] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec.\n\nGraph Convolutional Neural Networks for Web-Scale Recommender Systems. KDD, 2018.\n\n13\n\n\f", "award": [], "sourceid": 7338, "authors": [{"given_name": "Johannes", "family_name": "Klicpera", "institution": "Technical University of Munich"}, {"given_name": "Stefan", "family_name": "Wei\u00dfenberger", "institution": "Technical University of Munich"}, {"given_name": "Stephan", "family_name": "G\u00fcnnemann", "institution": "Technical University of Munich"}]}