{"title": "DFNets: Spectral CNNs for Graphs with Feedback-Looped Filters", "book": "Advances in Neural Information Processing Systems", "page_first": 6009, "page_last": 6020, "abstract": "We propose a novel spectral convolutional neural network (CNN) model on graph structured data, namely Distributed Feedback-Looped Networks (DFNets). This model is incorporated with a robust class of spectral graph filters, called feedback-looped filters, to provide better localization on vertices, while still attaining fast convergence and linear memory requirements. Theoretically, feedback-looped filters can guarantee convergence w.r.t. a specified error bound, and be applied universally to any graph without knowing its structure. Furthermore, the propagation rule of this model can diversify features from the preceding layers to produce strong gradient flows. We have evaluated our model using two benchmark tasks: semi-supervised document classification on citation networks and semi-supervised entity classification on a knowledge graph. The experimental results show that our model considerably outperforms the state-of-the-art methods in both benchmark tasks over all datasets.", "full_text": "DFNets: Spectral CNNs for Graphs with\n\nFeedback-Looped Filters\n\nAsiri Wijesinghe\n\nResearch School of Computer Science\n\nThe Australian National University\nasiri.wijesinghe@anu.edu.au\n\nQing Wang\n\nResearch School of Computer Science\n\nThe Australian National University\n\nqing.wang@anu.edu.au\n\nAbstract\n\nWe propose a novel spectral convolutional neural network (CNN) model on graph\nstructured data, namely Distributed Feedback-Looped Networks (DFNets). This\nmodel is incorporated with a robust class of spectral graph \ufb01lters, called feedback-\nlooped \ufb01lters, to provide better localization on vertices, while still attaining fast\nconvergence and linear memory requirements. Theoretically, feedback-looped\n\ufb01lters can guarantee convergence w.r.t. a speci\ufb01ed error bound, and be applied\nuniversally to any graph without knowing its structure. Furthermore, the propaga-\ntion rule of this model can diversify features from the preceding layers to produce\nstrong gradient \ufb02ows. We have evaluated our model using two benchmark tasks:\nsemi-supervised document classi\ufb01cation on citation networks and semi-supervised\nentity classi\ufb01cation on a knowledge graph. The experimental results show that our\nmodel considerably outperforms the state-of-the-art methods in both benchmark\ntasks over all datasets.\n\nIntroduction\n\n1\nConvolutional neural networks (CNNs) [20] are a powerful deep learning approach which has been\nwidely applied in various \ufb01elds, e.g., object recognition [29], image classi\ufb01cation [14], and semantic\nsegmentation [22]. Traditionally, CNNs only deal with data that has a regular Euclidean structure,\nsuch as images, videos and text. In recent years, due to the rising trends in network analysis and\nprediction, generalizing CNNs to graphs has attracted considerable interest [3, 7, 11, 26]. However,\nsince graphs are in irregular non-Euclidean domains, this brings up the challenge of how to enhance\nCNNs for effectively extracting useful features (e.g. topological structure) from arbitrary graphs.\nTo address this challenge, a number of studies have been devoted to enhancing CNNs by developing\n\ufb01lters over graphs. In general, there are two categories of graph \ufb01lters: (a) spatial graph \ufb01lters, and\n(b) spectral graph \ufb01lters. Spatial graph \ufb01lters are de\ufb01ned as convolutions directly on graphs, which\nconsider neighbors that are spatially close to a current vertex [1, 9, 11]. In contrast, spectral graph\n\ufb01lters are convolutions indirectly de\ufb01ned on graphs, through their spectral representations [3, 5, 7].\nIn this paper, we follow the line of previous studies in developing spectral graph \ufb01lters and tackle the\nproblem of designing an effective, yet ef\ufb01cient CNNs with spectral graph \ufb01lters.\nPreviously, Bruna et al. [3] proposed convolution operations on graphs via a spectral decomposition\nof the graph Laplacian. To reduce learning complexity in the setting where the graph structure is\nnot known a priori, Henaff et al. [13] developed a spectral \ufb01lter with smooth coef\ufb01cients. Then,\nDefferrard et al. [7] introduced Chebyshev \ufb01lters to stabilize convolution operations under coef\ufb01cient\nperturbation and these \ufb01lters can be exactly localized in k-hop neighborhood. Later, Kipf et al. [19]\nproposed a simple layer-wise propagation model using Chebyshev \ufb01lters on 1-hop neighborhood.\nVery recently, some works attempted to develop rational polynomial \ufb01lters, such as Cayley \ufb01lters [21]\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: A simpli\ufb01ed example of illustrating feedback-looped \ufb01lters, where v1 is the current vertex\nand the similarity of the colours indicates the correlation between vertices, e.g., v1 and v5 are highly\ncorrelated, but v2 and v6 are less correlated with v1: (a) an input graph, where \u03bbi is the original\nfrequency to vertex vi; (b) the feedforward \ufb01ltering, which attenuates some low order frequencies,\ne.g. \u03bb2, and amplify other frequencies, e.g. \u03bb5 and \u03bb6; (c) the feedback \ufb01ltering, which reduces the\nerror in the frequencies generated by (b), e.g. \u03bb6.\n\nand ARMA1 [2]. From a different perspective, Petar et al. [31] proposed a self-attention based CNN\narchitecture for graph \ufb01lters, which extracts features by considering the importance of neighbors.\nOne key idea behind existing works on designing spectral graph \ufb01lters is to approximate the frequency\nresponses of graph \ufb01lters using a polynomial function (e.g. Chebyshev \ufb01lters [7]) or a rational\npolynomial function (e.g. Cayley \ufb01lters [21] and ARMA1 [2]). Polynomial \ufb01lters are sensitive\nto changes in the underlying graph structure. They are also very smooth and can hardly model\nsharp changes, as illustrated in Figure 1. Rational polynomial \ufb01lters are more powerful to model\nlocalization, but they often have to trade off computational ef\ufb01ciency, resulting in higher learning and\ncomputational complexities, as well as instability.\nContributions. In this work, we aim to develop a new class of spectral graph \ufb01lters that can overcome\nthe above limitations. We also propose a spectral CNN architecture (i.e. DFNet) to incorporate these\ngraph \ufb01lters. In summary, our contributions are as follows:\n\n\u2022 Improved localization. A new class of spectral graph \ufb01lters, called feedback-looped \ufb01lters,\nis proposed to enable better localization, due to its rational polynomial form. Basically,\nfeedback-looped \ufb01lters consist of two parts: feedforward and feedback. The feedforward\n\ufb01ltering is k-localized as polynomial \ufb01lters, while the feedback \ufb01ltering is unique which\nre\ufb01nes k-localized features captured by the feedforward \ufb01ltering to improve approximation\naccuracy. We also propose two techniques: scaled-normalization and cut-off frequency to\navoid the issues of gradient vanishing/exploding and instabilities.\n\u2022 Ef\ufb01cient computation. For feedback-looped \ufb01lters, we avoid the matrix inversion implied\nby the denominator through approximating the matrix inversion with a recursion. Thus,\nbene\ufb01ted from this approximation, feedback-looped \ufb01lters attain linear convergence time\nand linear memory requirements w.r.t. the number of edges in a graph.\n\u2022 Theoretical properties. Feedback-looped \ufb01lters enjoy several nice theoretical properties.\nUnlike other rational polynomial \ufb01lters for graphs, they have theoretically guaranteed\nconvergence w.r.t. a speci\ufb01ed error bound. On the other hand, they still have the universal\nproperty as other spectral graph \ufb01lters [17], i.e., can be applied without knowing the\nunderlying structure of a graph. The optimal coef\ufb01cients of feedback-looped \ufb01lters are\nlearnable via an optimization condition for any given graph.\n\u2022 Dense architecture. We propose a layer-wise propagation rule for our spectral CNN model\nwith feedback-looped \ufb01lters, which densely connects layers as in DenseNet [15]. This\ndesign enables our model to diversify features from all preceding layers, leading to a strong\ngradient \ufb02ow. We also introduce a layer-wise regularization term to alleviate the over\ufb01tting\nissue. In doing so, we can prevent the generation of spurious features and thus improve\naccuracy of the prediction.\n\n2\n\n\u03bb2(b) Feedforward(c) Feedbackv1v3v2v6v4v5\u03bb5\u03bb60110(a) Inputv1v3v2v6v4v5v1v3v2v6v4v5(q=1)(p=2)01\u03bb2\u03bb5\u03bb6\u03bb2\u03bb5\u03bb6\fTo empirically verify the effectiveness of our work, we have evaluated feedback-looped \ufb01lters within\nthree different CNN architectures over four benchmark datasets to compare against the state-of-the-art\nmethods. The experimental results show that our models signi\ufb01cantly outperform the state-of-the-art\nmethods. We further demonstrate the effectiveness of our model DFNet through the node embeddings\nin a 2-D space of vertices from two datasets.\n\nDii =(cid:80)\n\n2 Spectral Convolution on Graphs\nLet G = (V, E, A) be an undirected and weighted graph, where V is a set of vertices, E \u2286 V \u00d7 V\nis a set of edges, and A \u2208 Rn\u00d7n is an adjacency matrix which encodes the weights of edges. We\nlet n = |V | and m = |E|. A graph signal is a function x : V \u2192 R and can be represented\nas a vector x \u2208 Rn whose ith component xi is the value of x at the ith vertex in V . The graph\nLaplacian is de\ufb01ned as L = I \u2212 D\u22121/2AD\u22121/2, where D \u2208 Rn\u00d7n is a diagonal matrix with\ni=0 \u2208 Rn,\nknown as the graph Fourier basis, and non-negative eigenvalues {\u03bbi}n\u22121\ni=0 , known as the graph\nfrequencies [5]. L is diagonalizable by the eigendecomposition such that L = U \u039bU H, where\n\u039b = diag ([\u03bb0, . . . , \u03bbn\u22121]) \u2208 Rn\u00d7n and U H is a hermitian transpose of U. We use \u03bbmin and \u03bbmax\nto denote the smallest and largest eigenvalues of L, respectively.\nGiven a graph signal x, the graph Fourier transform of x is \u02c6x = U H x \u2208 Rn and its inverse is\nx = U \u02c6x [27, 30]. The graph Fourier transform enables us to apply graph \ufb01lters in the vertex domain.\nA graph \ufb01lter h can \ufb01lter x by altering (amplifying or attenuating) the graph frequencies as\n\nj Aij and I is an identity matrix. L has a set of orthogonal eigenvectors {ui}n\u22121\n\nh(L)x = h(U \u039bU H )x = U h(\u039b)U H x = U h(\u039b)\u02c6x.\n\n(1)\nHere, h(\u039b) = diag([h(\u03bb0), . . . , h(\u03bbn\u22121)]), which controls how the frequency of each component\nin a graph signal x is modi\ufb01ed. However, applying graph \ufb01ltering as in Eq. 1 requires the eigen-\ndecomposition of L, which is computationally expensive. To address this issue, several works\n[2, 7, 12, 19, 21, 23] have studied the approximation of h(\u039b) by a polynomial or rational polynomial\nfunction.\n\nChebyshev \ufb01lters. Hammond et al. [12] \ufb01rst proposed to approximate h(\u03bb) by a polynomial\nfunction with kth-order polynomials and Chebyshev coef\ufb01cients. Later, Defferrard et al.\n[7]\ndeveloped Chebyshev \ufb01lters for spectral CNNs on graphs. A Chebyshev \ufb01lter is de\ufb01ned as\n\nk\u22121(cid:88)\n\nk\u22121(cid:88)\n\nh\u03b8(\u02dc\u03bb) =\n\n\u03b8jTj(\u02dc\u03bb),\n\n(2)\n\nwhere \u03b8 \u2208 Rk is a vector of learnable Chebyshev coef\ufb01cients, \u02dc\u03bb \u2208 [\u22121, 1] is rescaled from \u03bb, the\nChebyshev polynomials Tj(\u03bb) = 2\u03bbTj\u22121(\u03bb) \u2212 Tj\u22122(\u03bb) are recursively de\ufb01ned with T0(\u03bb) = 1 and\nT1(\u03bb) = \u03bb, and k controls the size of \ufb01lters, i.e., localized in k-hop neighborhood of a vertex [12].\nKipf and Welling [19] simpli\ufb01ed Chebyshev \ufb01lters by restricting to 1-hop neighborhood.\n\nj=0\n\nLanczos \ufb01lters. Recently, Liao et al. [23] used the Lanczos algorithm to generate a low-rank matrix\napproximation T for the graph Laplacian. They used the af\ufb01nity matrix S = D\u22121/2AD\u22121/2. Since\nL = I \u2212 S holds, L and S share the same eigenvectors but have different eigenvalues. As a result,\nL and S correspond to the same \u02c6x. To approximate the eigenvectors and eigenvalues of S, they\ndiagonalize the tri-diagonal matrix T \u2208 Rm\u00d7m to compute Ritz-vectors V \u2208 Rn\u00d7m and Ritz-values\nR \u2208 Rm\u00d7m, and thus S \u2248 V RV T . Accordingly, a k-hop Lanczos \ufb01lter operation is,\n\nh\u03b8(R) =\n\n\u03b8jRj,\n\n(3)\n\nj=0\n\nwhere \u03b8 \u2208 Rk is a vector of learnable Lanczos \ufb01lter coef\ufb01cients. Thus, spectral convolutional\noperation is de\ufb01ned as h\u03b8(S)x \u2248 V h\u03b8(R)V T x. Such Lanczos \ufb01lter operations can signi\ufb01cantly\nreduce computation overhead when approximating large powers of S, i.e. Sk \u2248 V RkV T . Thus, they\ncan ef\ufb01ciently compute the spectral graph convolution with a very large localization range to easily\ncapture the multi-scale information of the graph.\n\n3\n\n\fk\u22121(cid:88)\n\nCayley \ufb01lters. Observing that Chebyshev \ufb01lters have dif\ufb01culty in detecting narrow frequency bands\ndue to \u02dc\u03bb \u2208 [\u22121, 1], Levie et al. [21] proposed Cayley \ufb01lters, based on Cayley polynomials:\n\n\u03b8j(s\u03bb \u2212 i)j(s\u03bb + i)\u2212j),\n\nj=1\n\nh\u03b8,s(\u03bb) = \u03b80 + 2Re(\n\n(4)\nwhere \u03b80 \u2208 R is a real coef\ufb01cient and (\u03b81, . . . , \u03b8k\u22121) \u2208 Ck\u22121 is a vector of complex coef\ufb01cients.\nRe(x) denotes the real part of a complex number x, and s > 0 is a parameter called spectral\nzoom, which controls the degree of \u201czooming\u201d into eigenvalues in \u039b. Both \u03b8 and s are learnable\nduring training. To improve ef\ufb01ciency, the Jacobi method is used to approximately compute Cayley\npolynomials.\nARMA1 \ufb01lters. Bianchi et al. [2] sought to address similar issues as identi\ufb01ed in [21]. However,\ndifferent from Cayley \ufb01lters, they developed a \ufb01rst-order ARMA \ufb01lter, which is approximated by a\n\ufb01rst-order recursion:\n(5)\nwhere a and b are the \ufb01lter coef\ufb01cients, \u00afx(0) = x, and \u02dcL = (\u03bbmax \u2212 \u03bbmin)/2I \u2212 L. Accordingly,\nthe frequency response is de\ufb01ned as:\n\n\u00afx(t+1) = a \u02dcL\u00afx(t) + bx,\n\nr\n\n,\n\nh(\u02dc\u03bb) =\n\n\u02dc\u03bb \u2212 p\n\n(6)\nwhere \u02dc\u03bb = (\u03bbmax \u2212 \u03bbmin)/2\u03bb, r = \u2212b/a, and p = 1/a [17]. Multiple ARMA1 \ufb01lters can be\napplied in parallel to obtain a ARMAk \ufb01lter. However, the memory complexity of k parallel ARMA1\n\ufb01lters is k times higher than ARMA1 graph \ufb01lters.\nWe make some remarks on how these existing spectral \ufb01lters are related to each other. (i) As discussed\nin [2, 21, 23], polynomial \ufb01lters (e.g. Chebyshev and Lanczos \ufb01lters) can be approximately treated\nas a special kind of rational polynomial \ufb01lters. (ii) Further, Chebyshev \ufb01lters can be regarded as a\nspecial case of Lanczos \ufb01lters. (iii) Although both Cayley and ARMAk \ufb01lters are rational polynomial\n\ufb01lters, they differ in how they approximate the matrix inverse implied by the denominator of a rational\nfunction. Cayley \ufb01lters use a \ufb01xed number of Jacobi iterations, while ARMAk \ufb01lters use a \ufb01rst-order\nrecursion plus a parallel bank of k ARMA1. (iv) ARMA1 by Bianchi et al. [2] is similar to GCN by\nKipf et al. [19] because they both consider localization within 1-hop neighborhood.\n\n3 Proposed Method\nWe introduce a new class of spectral graph \ufb01lters, called feedback-looped \ufb01lters, and propose a\nspectral CNN for graphs with feedback-looped \ufb01lters, namely Distributed Feedback-Looped Networks\n(DFNets). We also discuss optimization techniques and analyze theoretical properties.\n\n3.1 Feedback-Looped Filters\n\nFeedback-looped \ufb01lters belong to a class of Auto Regressive Moving Average (ARMA) \ufb01lters [16, 17].\nFormally, an ARMA \ufb01lter is de\ufb01ned as:\n\n(cid:16)\n\np(cid:88)\n\n\u03c8jLj(cid:17)\u22121(cid:16) q(cid:88)\n\n\u03c6jLj(cid:17)\n\nI +\n\nh\u03c8,\u03c6(L)x =\n\n(7)\nThe parameters p and q refer to the feedback and feedforward degrees, respectively. \u03c8 \u2208 Cp and\n\u03c6 \u2208 Cq+1 are two vectors of complex coef\ufb01cients. Computing the denominator of Eq. 7 however\nrequires a matrix inversion, which is computationally inef\ufb01cient for large graphs. To circumvent this\nissue, feedback-looped \ufb01lters use the following approximation:\n\u03c8j \u02dcLj \u00afx(t\u22121) +\n\n\u03c6j \u02dcLjx,\n\n(8)\n\nj=0\n\nj=1\n\nx.\n\nq(cid:88)\n\n\u00afx(0) = x and \u00afx(t) = \u2212 p(cid:88)\n)I, \u02c6L = I \u2212 \u02c6D\u22121/2 \u02c6A \u02c6D\u22121/2, \u02c6A = A + I, \u02c6Dii =(cid:80)\n(cid:80)q\n1 +(cid:80)p\n\nj=0 \u03c6j\u03bbj\n\nh(\u03bbi) =\n\nj=1\n\nj=0\n\n.\n\ni\n\nj=1 \u03c8j\u03bbj\n\ni\n\n4\n\nwhere \u02dcL = \u02c6L \u2212 (\n\u02c6Aij and \u02c6\u03bbmax is the\nlargest eigenvalue of \u02c6L. Accordingly, the frequency response of feedback-looped \ufb01lters is de\ufb01ned as:\n\n\u02c6\u03bbmax\n\n2\n\nj\n\n(9)\n\n\f2\n\n\u02c6\u03bbmax\n\nTo alleviate the issues of gradient vanishing/exploding and numerical instabilities, we further introduce\ntwo techniques in the design of feedback-looped \ufb01lters: scaled-normalization and cut-off frequency.\nScaled-normalization technique. To assure the stability of feedback-looped \ufb01lters, we apply the\nscaled-normalization technique to increasing the stability region, i.e., using the scaled-normalized\nLaplacian \u02dcL = \u02c6L \u2212 (\n)I, rather than just \u02c6L. This accordingly helps centralize the eigenvalues of\nthe Laplacian \u02c6L and reduce its spectral radius bound. The scaled-normalized Laplacian \u02dcL consists of\ngraph frequencies within [0, 2], in which eigenvalues are ordered in an increasing order.\nCut-off frequency technique.\nTo map graph frequencies within [0, 2] to a uniform discrete\n2 \u2212 \u03b7), where \u03b7 \u2208 [0, 1] and \u03bbmax refers to\ndistribution, we de\ufb01ne a cut-off frequency \u03bbcut = ( \u03bbmax\nthe largest eigenvalue of \u02dcL. The cut-off frequency is used as a threshold to control the amount of\ni=0 are converted to binary values {\u02dc\u03bbi}n\u22121\nattenuation on graph frequencies. The eigenvalues {\u03bbi}n\u22121\nsuch that \u02dc\u03bbi = 1 if \u03bbi \u2265 \u03bbcut and \u02dc\u03bbi = 0 otherwise. This trick allows the generation of ideal\nhigh-pass \ufb01lters so as to sharpen a signal by amplifying its graph Fourier coef\ufb01cients. This technique\nalso solves the issue of narrow frequency bands existing in previous spectral \ufb01lters, including both\npolynomial and rational polynomial \ufb01lters [7, 21]. This is because these previous spectral \ufb01lters\nonly accept a small band of frequencies. In contrast, our proposed feedback-looped \ufb01lters resolve\nthis issue using a cut-off frequency technique, i.e., amplifying frequencies higher than a certain low\ncut-off value while attenuating frequencies lower than that cut-off value. Thus, our proposed \ufb01lters\ncan accept a wider range of frequencies and capture better characteristic properties of a graph.\n\ni=0\n\n3.2 Coef\ufb01cient Optimisation\ni=0 \u2192 R, we aim to \ufb01nd\nGiven a feedback-looped \ufb01lter with a desired frequency response: \u02c6h : {\u02dc\u03bbi}n\u22121\nthe optimal coef\ufb01cients \u03c8 and \u03c6 that make the frequency response as close as possible to the desired\nfrequency response, i.e. to minimize the following error:\n\n(cid:80)q\n1 +(cid:80)p\np(cid:88)\n\nj=0 \u03c6j\n\n\u02dc\u03bbj\ni\nj=1 \u03c8j\n\n\u02dc\u03bbj\ni\n\ni \u2212 q(cid:88)\n\n\u00b4e(\u02dc\u03bbi) = \u02c6h(\u02dc\u03bbi) \u2212\n\n(10)\n\nHowever, the above equation is not linear w.r.t. the coef\ufb01cients \u03c8 and \u03c6. Thus, we rede\ufb01ne the error\nas follows:\n\ne(\u02dc\u03bbi) = \u02c6h(\u02dc\u03bbi) + \u02c6h(\u02dc\u03bbi)\n\n\u03c8j\n\n\u02dc\u03bbj\n\n\u03c6j\n\n\u02dc\u03bbj\ni .\n\n(11)\n\nLet e = [e(\u02dc\u03bb0), . . . , e(\u02dc\u03bbn\u22121)]T , \u02c6h = [\u02c6h(\u02dc\u03bb0), . . . , \u02c6h(\u02dc\u03bbn\u22121)]T , \u03b1 \u2208 Rn\u00d7p with \u03b1ij = \u02dc\u03bbj\ni and\n\u03b2 \u2208 Rn\u00d7(q+1) with \u03b2ij = \u02dc\u03bbj\u22121\nare two Vandermonde-like matrices. Then, we have e = \u02c6h +\ndiag(\u02c6h)\u03b1\u03c8 \u2212 \u03b2\u03c6. Thus, the stable coef\ufb01cients \u03c8 and \u03c6 can be learned by minimizing e as a convex\nconstrained least-squares optimization problem:\n\ni\n\nj=1\n\nj=0\n\nminimize\u03c8,\u03c6 ||\u02c6h + diag(\u02c6h)\u03b1\u03c8 \u2212 \u03b2\u03c6||2\nsubject to ||\u03b1\u03c8||\u221e \u2264 \u03b3 and \u03b3 < 1\n\n(12)\n\nHere, the parameter \u03b3 controls the tradeoff between convergence ef\ufb01ciency and approximation accu-\nracy. A higher value of \u03b3 can lead to slower convergence but better accuracy. It is not recommended\nto have very low \u03b3 values due to potentially unacceptable accuracy. ||\u03b1\u03c8||\u221e \u2264 \u03b3 < 1 is the stability\ncondition, which will be further discussed in detail in Section 3.4.\n\n3.3 Spectral Convolutional Layer\n\nP = \u2212(cid:80)p\n\nWe propose a CNN-based architecture, called DFNets, which can stack multiple spectral con-\nvolutional layers with feedback-looped \ufb01lters to extract features of increasing abstraction. Let\nj=0 \u03c6j \u02dcLj. The propagation rule of a spectral convolutional layer is\n\nj=1 \u03c8j \u02dcLj and Q =(cid:80)q\n\nde\ufb01ned as:\n\n\u00afX (t+1) = \u03c3(P \u00afX (t)\u03b8(t)\n\n1 + QX\u03b8(t)\n\n2 + \u00b5(\u03b8(t)\n\n1 ; \u03b8(t)\n\n2 ) + b),\n\n(13)\n\n5\n\n\fwhere \u03c3 refers to a non-linear activation function such as ReLU. \u00afX (0) = X \u2208 Rn\u00d7f is a graph\nsignal matrix where f refers to the number of features. \u00afX (t) is a matrix of activations in the tth layer.\n1 \u2208 Rc\u00d7h and \u03b8(t)\n2 \u2208 Rf\u00d7h are two trainable weight matrices in the tth layer. To compute \u00afX (t+1),\n\u03b8(t)\na vertex needs access to its p-hop neighbors with the output signal of the previous layer \u00afX (t), and its\nq-hop neighbors with the input signal from X. To attenuate the over\ufb01tting issue, we add \u00b5(\u03b8(t)\n1 ; \u03b8(t)\n2 ),\nnamely kernel regularization [6], and a bias term b. We use the xavier normal initialization method\n[10] to initialise the kernel and bias weights, the unit-norm constraint technique [8] to normalise\nthe kernel and bias weights by restricting parameters of all layers in a small range, and the kernel\nregularization technique to penalize the parameters in each layer during the training. In doing so, we\ncan prevent the generation of spurious features and thus improve the accuracy of prediction 1.\nIn this model, each layer is directly connected to all subsequent layers in a feed-forward manner, as\nin DenseNet [15]. Consequently, the tth layer receives all preceding feature maps F0, . . . , Ft\u22121 as\ninput. We concatenate multiple preceding feature maps column-wise into a single tensor to obtain\nmore diversi\ufb01ed features for boosting the accuracy. This densely connected CNN architecture has\nseveral compelling bene\ufb01ts: (a) reduce the vanishing-gradient issue, (b) increase feature propagation\nand reuse, and (c) re\ufb01ne information \ufb02ow between layers [15].\n\n3.4 Theoretical Analysis\n\nto guarantee the stability, we can derive the stability condition || \u2212(cid:80)p\n\nFeedback-looped \ufb01lters have several nice properties, e.g., guaranteed convergence, linear convergence\ntime, and universal design. We discuss these properties and analyze computational complexities.\nConvergence. Theoretically, a feedback-looped \ufb01lter can achieve a desired frequency response only\nwhen t \u2192 \u221e [17]. However, due to the property of linear convergence preserved by feedback-looped\n\ufb01lters, stability can be guaranteed after a number of iterations w.r.t. a speci\ufb01ed small error [16]. More\nspeci\ufb01cally, since the pole of rational polynomial \ufb01lters should be in the unit circle of the z-plane\nj=1 \u03c8jLj|| < 1 by Eq. 7 in\nthe vertex domain and correspondingly obtain the stability condition ||\u03b1\u03c8||\u221e \u2264 \u03b3 \u2208 (0, 1) in the\nfrequency domain as stipulated in Eq. 12 [16].\nUniversal design. The universal design is bene\ufb01cial when the underlying structure of a graph is\nunknown or the topology of a graph changes over time. The corresponding \ufb01lter coef\ufb01cients can\nbe learned independently of the underlying graph and are universally applicable. When designing\nfeedback-looped \ufb01lters, we de\ufb01ne the desired frequency response function \u02c6h over graph frequencies\n\u02dc\u03bbi in a binary format in the uniform discrete distribution as discussed in Section 3.1. Then, we\nsolve Eq. 12 in the least-squares sense for this \ufb01nite set of graph frequencies to \ufb01nd optimal \ufb01lter\ncoef\ufb01cients.\n\nSpectral Graph Filter\nChebyshev \ufb01lters [7]\nLanczos \ufb01lters [23]\nCayley \ufb01lters [21]\nARMA1 \ufb01lters [2]\nd parallel ARMA1 \ufb01lters [2]\nFeedback-looped \ufb01lters (ours)\n\nType\n\nPolynomial\n\nRational\npolynomial\n\nLearning\nComplexity\n\nO(k)\nO(k)\n\nTime\n\nComplexity\n\nO(km)\nO(km2)\n\nO((r + 1)k) O((r + 1)km)\n\nO(t)\nO(t)\n\nO(tm)\nO(tm)\n\nMemory\n\nComplexity\n\nO(m)\nO(m2)\nO(m)\nO(m)\nO(dm)\nO(m)\n\nTable 1: Learning, time and space complexities of spectral graph \ufb01lters.\n\nO(tp + q)\n\nO((tp + q)m)\n\nComplexity. When computing \u00afx(t) as in Eq. 8, we need to calculate \u02dcLj \u00afx(t\u22121) for j = 1, . . . , p and\n\u02dcLjx for j = 1, . . . , q. Nevertheless, \u02dcLjx is computed only once because \u02dcLjx = \u02dcL( \u02dcLj\u22121x). Thus,\nwe need p multiplications for each t in the \ufb01rst term in Eq. 8, and q multiplications for the second\nterm in Eq. 8. Table 1 summarizes the complexity results of existing spectral graph \ufb01lters and ours,\nwhere r refers to the number of Jacobi iterations in [21]. Note that, when t = 1 (i.e., one spectral\nconvolutional layer), feedback-looped \ufb01lters have the same learning, time and memory complexities\nas Chebyshev \ufb01lters, where p + q = k.\n\n1DFNets implementation can be found at: https://github.com/wokas36/DFNets\n\n6\n\n\f4 Numerical Experiments\n\nWe evaluate our models on two benchmark tasks: (1) semi-supervised document classi\ufb01cation in\ncitation networks, and (2) semi-supervised entity classi\ufb01cation in a knowledge graph.\n\n4.1 Experimental Set-Up\n\nDatasets. We use three citation network datasets Cora, Citeseer, and Pubmed [28] for semi-supervised\ndocument classi\ufb01cation, and one dataset NELL [4] for semi-supervised entity classi\ufb01cation. NELL is\na bipartite graph extracted from a knowledge graph [4]. Table 2 contains dataset statistics [33].\n\nType\nDataset\nCora\nCitation network\nCiteseer Citation network\nPubmed Citation network\nNELL\nKnowledge graph\n\n#Nodes\n2,708\n3,327\n19,717\n65,755\nTable 2: Dataset statistics.\n\n#Edges\n5,429\n4,732\n44,338\n266,144\n\n#Classes\n7\n6\n3\n210\n\n#Features %Labeled Nodes\n0.052\n0.036\n0.003\n0.001\n\n1,433\n3,703\n500\n5,414\n\nBaseline methods. We compare against twelve baseline methods, including \ufb01ve methods using\nspatial graph \ufb01lters, i.e., Semi-supervised Embedding (SemiEmb) [32], Label Propagation (LP) [34],\nskip-gram graph embedding model (DeepWalk) [26], Iterative Classi\ufb01cation Algorithm (ICA) [24],\nand semi-supervised learning with graph embedding (Planetoid*) [33], and seven methods using\nspectral graph \ufb01lters: Chebyshev [7], Graph Convolutional Networks (GCN) [19], Lanczos Networks\n(LNet) and Adaptive Lanczos Networks (AdaLNet) [23], CayleyNet [21], Graph Attention Networks\n(GAT) [31], and ARMA Convolutional Networks (ARMA1) [2].\nWe evaluate our feedback-looped \ufb01lters using three different spectral CNN models: (i) DFNet: a\ndensely connected spectral CNN with feedback-looped \ufb01lters, (ii) DFNet-ATT: a self-attention based\ndensely connected spectral CNN with feedback-looped \ufb01lters, and (iii) DF-ATT: a self-attention\nbased spectral CNN model with feedback-looped \ufb01lters.\n\nModel\nDFNet\nDFNet-ATT\nDF-ATT\n\nL2 reg.\n9e-2\n9e-4\n9e-3\n\n#Layers\n\n5\n4\n2\n\n#Units\n[8, 16, 32, 64, 128]\n[8, 16, 32, 64]\n[32, 64]\n\nDropout\n\n0.9\n0.9\n\n[0.1, 0.9]\n\n[p, q]\n[5, 3]\n[5, 3]\n[5, 3]\n\n\u03bbcut\n0.5\n0.5\n0.5\n\nTable 3: Hyperparameter settings for citation network datasets.\n\nHyperparameter settings. We use the same data splitting for each dataset as in Yang et al. [33]. The\nhyperparameters of our models are initially selected by applying the orthogonalization technique (a\nrandomized search strategy). We also use a layerwise regularization (L2 regularization) and bias terms\nto attenuate the over\ufb01tting issue. All models are trained 200 epochs using the Adam optimizer [18]\nwith a learning rate of 0.002. Table 3 summarizes the hyperparameter settings for citation network\ndatasets. The same hyperparameters are applied to the NELL dataset except for L2 regularization (i.e.,\n9e-2 for DFNet and DFnet-ATT, and 9e-4 for DF-ATT). For \u03b3, we choose the best setting for each\nmodel. For self-attention, we use 8 multi-attention heads and 0.5 attention dropout for DFNet-ATT,\nand 6 multi-attention heads and 0.3 attention dropout for DF-ATT. The parameters p = 5, q = 3 and\n\u03bbcut = 0.5 are applied to all three models over all datasets.\n\n4.2 Comparison with Baseline Methods\n\nTable 4 summarizes the results of classi\ufb01cation in terms of accuracy. The results of the baseline\nmethods are taken from the previous works [19, 23, 31, 33]. Our models DFNet and DFNet-ATT\noutperform all the baseline methods over four datasets. Particularly, we can see that: (1) Compared\nwith polynomial \ufb01lters, DFNet improves upon GCN (which performs best among the models using\npolynomial \ufb01lters) by a margin of 3.7%, 3.9%, 5.3% and 2.3% on the datasets Cora, Citeseer, Pubmed\nand NELL, respectively. (2) Compared with rational polynomial \ufb01lters, DFNet improves upon\nCayleyNet and ARMA1 by 3.3% and 1.8% on the Cora dataset, respectively. For the other datasets,\nCayleyNet does not have results available in [21]. (3) DFNet-ATT further improves the results of\nDFNet due to the addition of a self-attention layer. (4) Compared with GAT (Chebyshev \ufb01lters with\n\n7\n\n\fself-attention), DF-ATT also improves the results and achieves 0.4%, 0.6% and 3.3% higher accuracy\non the datasets Cora, Citeseer and Pubmed, respectively.\nAdditionally, we compare DFNet (our feedback-looped \ufb01lters + DenseBlock) with GCN + Dense-\nBlock and GAT + DenseBlock. The results are also presented in Table 4. We can see that our\nfeedback-looped \ufb01lters perform best, no matter whether or not the dense architecture is used.\n\nModel\nSemiEmb [32]\nLP [34]\nDeepWalk [26]\nICA [24]\nPlanetoid* [33]\nChebyshev [7]\nGCN [19]\nLNet [23]\nAdaLNet [23]\nCayleyNet [21]\nARMA1 [2]\nGAT [31]\nGCN + DenseBlock\nGAT + Dense Block\nDFNet (ours)\nDFNet-ATT (ours)\nDF-ATT (ours)\n\nCora\n59.0\n68.0\n67.2\n75.1\n64.7\n81.2\n81.5\n79.5\n80.4\n81.9\u2217\n83.4\n83.0\n\n82.7 \u00b1 0.5\n83.8 \u00b1 0.3\n85.2 \u00b1 0.5\n86.0 \u00b1 0.4\n83.4 \u00b1 0.5\n\nCiteseer\n\nPubmed\n\n59.6\n45.3\n43.2\n69.1\n75.7\n69.8\n70.3\n66.2\n68.7\n\n-\n\n72.5\n72.5\n\n71.3 \u00b1 0.3\n73.1 \u00b1 0.3\n74.2 \u00b1 0.3\n74.7 \u00b1 0.4\n73.1 \u00b1 0.4\n\n71.1\n63.0\n65.3\n73.9\n77.2\n74.4\n79.0\n78.3\n78.1\n\n-\n\n78.9\n79.0\n\n81.5 \u00b1 0.5\n81.8 \u00b1 0.3\n84.3 \u00b1 0.4\n85.2 \u00b1 0.3\n82.3 \u00b1 0.3\n\nNELL\n26.7\n26.5\n58.1\n23.1\n61.9\n\n66.0\n\n-\n\n-\n-\n-\n-\n-\n\n-\n\n66.4 \u00b1 0.3\n68.3 \u00b1 0.4\n68.8 \u00b1 0.3\n67.6 \u00b1 0.3\n\nTable 4: Accuracy (%) averaged over 10 runs (* was obtained using a different data splitting in [21])\n\n4.3 Comparison under Different Polynomial Orders\n\n.\n\nIn order to test how the polynomial orders p and q in\ufb02uence the performance of our model DFNet, we\nconduct experiments to evaluate DFNet on three citation network datasets using different polynomial\norders p = [1, 3, 5, 7, 9] and q = [1, 3, 5, 7, 9]. Figure 2 presents the experimental results. In our\nexperiments, p = 5 and q = 3 turn out to be the best parameters for DFNet over these datasets. In\nother words, this means that feedback-looped \ufb01lters are more stable on p = 5 and q = 3 than other\nvalues of p and q. This is because, when p = 5 and q = 3, Eq. 12 can obtain better convergence for\n\ufb01nding optimal coef\ufb01cients than in the other cases. Furthermore, we observe that: (1) Setting p to be\ntoo low or too high can both lead to poor performance, as shown in Figure 2.(a), and (2) when q is\nlarger than p, the accuracy decreases rapidly as shown in Figure 2.(b). Thus, when choosing p and q,\nwe require that p > q holds.\n\nFigure 2: Accuracy (%) of DFNet under different polynomial orders p and q.\n\n4.4 Evaluation of Scaled-Normalization and Cut-off Frequency\n\nTo understand how effectively the scaled-normalisation and cut-off frequency techniques can help\nlearn graph representations, we compare our methods that implement these techniques with the\nvariants of our methods that only implement one of these techniques. The results are presented in\nFigure 3. We can see that, the models using these two techniques outperform the models that only use\none of these techniques over all citation network datasets. Particularly, the improvement is signi\ufb01cant\non the Cora and Citeseer datasets.\n\n8\n\n13579p405060708090Accuracy(a) q=313579q405060708090(b) p=5CoraCiteseerPubmed\fFigure 3: Accuracy (%) of our models in three cases: (1) using both scaled-normalization and cut-off\nfrequency, (2) using only cut-off frequency, and (3) using only scaled-normalization.\n\n4.5 Node Embeddings\n\nWe analyze the node embeddings by DFNets over two datasets: Cora and Pubmed in a 2-D space.\nFigures 4 and 5 display the visualization of the learned 2-D embeddings of GCN, GAT, and DFNet\n(ours) on Pubmed and Cora citation networks by applying t-SNE [25] respectively. Colors denote\ndifferent classes in these datasets. It reveals the clustering quality of theses models. These \ufb01gures\nclearly show that our model DFNet has better separated 3 and 7 clusters respectively in the embedding\nspaces of Pubmed and Cora datasets. This is because features extracted by DFNet yield better node\nrepresentations than GCN and GAT models.\n\n(a) GCN\n\n(b) GAT\n\n(c) DFNet (ours)\n\nFigure 4: The t-SNE visualization of the 2-D node embedding space for the Pubmed dataset.\n\n(a) GCN\n\n(b) GAT\n\n(c) DFNet (ours)\n\nFigure 5: The t-SNE visualization of the 2-D node embedding space for the Cora dataset.\n\n5 Conclusions\nIn this paper, we have introduced a spectral CNN architecture (DFNets) with feedback-looped \ufb01lters\non graphs. To improve approximation accuracy, we have developed two techniques: scaled nor-\nmalization and cut-off frequency. In addition to these, we have discussed some nice properties of\nfeedback-looped \ufb01lters, such as guaranteed convergence, linear convergence time, and universal de-\nsign. Our proposed model outperforms the state-of-the-art approaches signi\ufb01cantly in two benchmark\ntasks. In future, we plan to extend the current work to time-varying graph structures. As discussed in\n[17], feedback-looped graph \ufb01lters are practically appealing for time-varying settings, and similar to\nstatic graphs, some nice properties would likely hold for graphs that are a function of time.\n\n9\n\nCoraCiteseerPubmedDatasets020406080100Accuracy(a) DFNetCoraCiteseerPubmedDatasets020406080100Accuracy(b) DFNet-ATTCoraCiteseerPubmedDatasets020406080100Accuracy(c) DF-ATTCase (1)Case (2)Case (3)\fReferences\n[1] J. Atwood and D. Towsley. Diffusion-convolutional neural networks. In Advances in Neural\n\nInformation Processing Systems (NeurIPS), pages 1993\u20132001, 2016.\n\n[2] F. M. Bianchi, D. Grattarola, L. Livi, and C. Alippi. Graph neural networks with convolutional\n\nARMA \ufb01lters. arXiv preprint arXiv:1901.01343, 2019.\n\n[3] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected\n\nnetworks on graphs. International Conference on Learning Representations (ICLR), 2013.\n\n[4] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, and T. M. Mitchell. Toward\nan architecture for never-ending language learning. In Twenty-Fourth AAAI Conference on\nArti\ufb01cial Intelligence (AAAI), 2010.\n\n[5] F. R. Chung and F. C. Graham. Spectral graph theory. Number 92. American Mathematical\n\nSoc., 1997.\n\n[6] C. Cortes, M. Mohri, and A. Rostamizadeh. L2 regularization for learning kernels. In Pro-\nceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages\n109\u2013116, 2009.\n\n[7] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs\nwith fast localized spectral \ufb01ltering. In Advances in neural information processing systems\n(NeurIPS), pages 3844\u20133852, 2016.\n\n[8] S. C. Douglas, S.-i. Amari, and S.-Y. Kung. On gradient adaptation with unit-norm constraints.\n\nIEEE Transactions on Signal processing, 48(6):1843\u20131847, 2000.\n\n[9] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik,\nand R. P. Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints. In\nAdvances in neural information processing systems (NeurIPS), pages 2224\u20132232, 2015.\n\n[10] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks. In Proceedings of the thirteenth international conference on arti\ufb01cial intelligence\nand statistics (AIStats), pages 249\u2013256, 2010.\n\n[11] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In\n\nAdvances in Neural Information Processing Systems (NeurIPS), pages 1024\u20131034, 2017.\n\n[12] D. K. Hammond, P. Vandergheynst, and R. Gribonval. Wavelets on graphs via spectral graph\n\ntheory. Applied and Computational Harmonic Analysis, 30(2):129\u2013150, 2011.\n\n[13] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data.\n\narXiv preprint arXiv:1506.05163, 2015.\n\n[14] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li. Deep convolutional neural networks for\n\nhyperspectral image classi\ufb01cation. Journal of Sensors, 2015, 2015.\n\n[15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional\nnetworks. In Proceedings of the IEEE conference on computer vision and pattern recognition\n(CVPR), pages 4700\u20134708, 2017.\n\n[16] E. Isu\ufb01, A. Loukas, and G. Leus. Autoregressive moving average graph \ufb01lters: a stable\ndistributed implementation. In 2017 IEEE International Conference on Acoustics, Speech and\nSignal Processing (ICASSP), pages 4119\u20134123, 2017.\n\n[17] E. Isu\ufb01, A. Loukas, A. Simonetto, and G. Leus. Autoregressive moving average graph \ufb01ltering.\n\nIEEE Transactions on Signal Processing, 65(2):274\u2013288, 2017.\n\n[18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.\n\nConference on Learning Representations (ICLR), 2015.\n\nIn International\n\n[19] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nIn International Conference on Learning Representations (ICLR), 2017.\n\n10\n\n\f[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in neural information processing systems (NeurIPS), pages\n1097\u20131105, 2012.\n\n[21] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein. Cayleynets: Graph convolutional neural\nIEEE Transactions on Signal Processing,\n\nnetworks with complex rational spectral \ufb01lters.\n67(1):97\u2013109, 2017.\n\n[22] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 2359\u20132367, 2017.\n\n[23] R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel. LanczosNet: Multi-scale deep graph con-\nvolutional networks. In Proceedings of the seventh International Conference on Learning\nRepresentation (ICLR), 2019.\n\n[24] Q. Lu and L. Getoor. Link-based classi\ufb01cation. In Proceedings of the 20th International\n\nConference on Machine Learning (ICML), pages 496\u2013503, 2003.\n\n[25] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning\n\nresearch, 9(Nov):2579\u20132605, 2008.\n\n[26] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In\nProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and\ndata mining (SIGKDD), pages 701\u2013710, 2014.\n\n[27] A. Sandryhaila and J. M. Moura. Discrete signal processing on graphs: Graph fourier transform.\nIn 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),\npages 6167\u20136170, 2013.\n\n[28] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. Collective classi\ufb01ca-\n\ntion in network data. AI magazine, 29(3):93\u201393, 2008.\n\n[29] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an\nastounding baseline for recognition. In Proceedings of the IEEE conference on computer vision\nand pattern recognition workshops, pages 806\u2013813, 2014.\n\n[30] D. Shuman, S. Narang, P. Frossard, A. Ortega, and P. Vandergheynst. The emerging \ufb01eld of\nsignal processing on graphs: Extending high-dimensional data analysis to networks and other\nirregular domains. IEEE Signal Processing Magazine, 30:83\u201398, 2013.\n\n[31] P. Veli\u02c7ckovi\u00b4c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention\n\nnetworks. International Conference on Learning Representations (ICLR), 2017.\n\n[32] J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep learning via semi-supervised embedding.\n\nIn Neural Networks: Tricks of the Trade, pages 639\u2013655. Springer, 2012.\n\n[33] Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph\nIn Proceedings of The 33rd International Conference on Machine Learning\n\nembeddings.\n(ICML), pages 40\u201348, 2016.\n\n[34] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian \ufb01elds and\nharmonic functions. In Proceedings of the 20th International conference on Machine learning\n(ICML), pages 912\u2013919, 2003.\n\n11\n\n\fAppendices\n\nIn the following, we provide further experiments on comparing our work with the others.\n\nComparison with different spectral graph \ufb01lters. We have conducted an ablation study of our\nproposed graph \ufb01lters. Speci\ufb01cally, we compare our feedback-looped \ufb01lters, i.e., the newly proposed\nspectral \ufb01lters in this paper, against other spectral \ufb01lters such as Chebyshev \ufb01lters and Cayley\n\ufb01lters. To conduct this ablation study, we remove the dense connections from our model DFNet.\nThe experimental results are presented in table 5. It shows that feedback-looped \ufb01lters improve\nlocalization upon Chebyshev \ufb01lters by a margin of 1.4%, 1.7% and 7.3% on the datasets Cora,\nCiteseer and Pubmed, respectively. It also improves upon Cayley \ufb01lters by a margin of 0.7% on the\nCora dataset.\n\nModel\nChebyshev \ufb01lters [7]\nCayley \ufb01lters [21]\nFeedback-looped \ufb01lters (ours)\n\nCora\n81.2\n81.9\n\n71.5 \u00b1 0.4\nTable 5: Accuracy (%) averaged over 10 runs.\n\n82.6 \u00b1 0.3\n\nCiteseer\n\nPubmed\n\n69.8\n\n-\n\n74.4\n\n-\n\n81.7 \u00b1 0.6\n\nComparison with LNet and AdaLNet using different data splittings. We have benchmarked the\nperformance of our DFNet model against the models LNet and AdaLNet proposed in [23], as well as\nChebyshev, GCN and GAT, over three citation network datasets Cora, Citeseer and Pubmed. We use\nthe same data splittings as used in [23]. All the experiments are repeated 10 times. For our model\nDFNet, we use the same hyperparameter settings as discussed in Section 4.2.\n\nTraining Split\n5.2% (standard)\n3%\n1%\n0.5%\n\nChebyshev\n78.0 \u00b1 1.2\n62.1 \u00b1 6.7\n44.2 \u00b1 5.6\n33.9 \u00b1 5.0\n\nGCN\n\n80.5 \u00b1 0.8\n74.0 \u00b1 2.8\n61.0 \u00b1 7.2\n52.9 \u00b1 7.4\n\nGAT\n\n82.6 \u00b1 0.7\n56.8 \u00b1 7.9\n48.6 \u00b1 8.0\n41.4 \u00b1 6.9\n\nLNet\n\n79.5 \u00b1 1.8\n76.3 \u00b1 2.3\n66.1 \u00b1 8.2\n58.1 \u00b1 8.2\n\nAdaLNet\n80.4 \u00b1 1.1\n77.7 \u00b1 2.4\n67.5 \u00b1 8.7\n60.8 \u00b1 9.0\n\nDFNet\n85.2 \u00b1 0.5\n80.5 \u00b1 0.4\n69.5 \u00b1 2.3\n61.3 \u00b1 4.3\n\nTable 6: Accuracy (%) averaged over 10 runs on the Cora dataset.\n\nTraining Split\n3.6% (standard)\n1%\n0.5%\n0.3%\n\nGAT\n\nGCN\n\nChebyshev\n70.1 \u00b1 0.8\n59.4 \u00b1 5.4\n45.3 \u00b1 6.6\n39.3 \u00b1 4.9\n\nAdaLNet\n68.7 \u00b1 1.0\n63.3 \u00b1 1.8\n53.8 \u00b1 4.7\n46.7 \u00b1 5.6\nTable 7: Accuracy (%) averaged over 10 runs on the Citeseer dataset.\n\n66.2 \u00b1 1.9\n61.3 \u00b1 3.9\n53.2 \u00b1 4.0\n44.4 \u00b1 4.5\n\n72.2 \u00b1 0.9\n46.5 \u00b1 9.3\n38.2 \u00b1 7.1\n30.9 \u00b1 6.9\n\n68.1 \u00b1 1.3\n58.3 \u00b1 4.0\n47.7 \u00b1 4.4\n39.2 \u00b1 6.3\n\nLNet\n\nTraining Split\n0.3% (standard)\n0.1%\n0.05%\n0.03%\n\nGAT\n\nGCN\n\nChebyshev\n69.8 \u00b1 1.1\n55.2 \u00b1 6.8\n48.2 \u00b1 7.4\n45.3 \u00b1 4.5\n\nAdaLNet\n78.1 \u00b1 0.4\n72.8 \u00b1 4.6\n66.0 \u00b1 4.5\n61.0 \u00b1 8.7\nTable 8: Accuracy (%) averaged over 10 runs on the Pubmed dataset.\n\n78.3 \u00b1 0.3\n73.4 \u00b1 5.1\n68.8 \u00b1 5.6\n60.4 \u00b1 8.6\n\n76.7 \u00b1 0.5\n59.6 \u00b1 9.5\n50.4 \u00b1 9.7\n50.9 \u00b1 8.8\n\n77.8 \u00b1 0.7\n73.0 \u00b1 5.5\n64.6 \u00b1 7.5\n57.9 \u00b1 8.1\n\nLNet\n\nDFNet\n74.2 \u00b1 0.3\n67.4 \u00b1 2.3\n55.1 \u00b1 3.2\n48.3 \u00b1 3.5\n\nDFNet\n84.3 \u00b1 0.4\n75.2 \u00b1 3.6\n67.2 \u00b1 7.3\n59.3 \u00b1 6.6\n\nTables 6-8 present the experimental results. Table 6 shows that DFNet performs signi\ufb01cantly better\nthan all the other models over the Cora dataset, including LNet and AdaLNet proposed in [23].\nSimilarly, Table 7 shows that DFNet performs signi\ufb01cantly better than all the other models over the\nCiteseer dataset. For the Pubmed dataset, as shown in Table 8, DFNet performs signi\ufb01cantly better\nthan almost all the other models, except for only one case in which DFNet performs slightly worse\nthan AdaLNet using the splitting 0.03%. These results demonstrate the robustness of our model\nDFNet.\n\n12\n\n\f", "award": [], "sourceid": 3235, "authors": [{"given_name": "W. O. K. Asiri Suranga", "family_name": "Wijesinghe", "institution": "The Australian National University"}, {"given_name": "Qing", "family_name": "Wang", "institution": "Australian National University"}]}