{"title": "Break the Ceiling: Stronger Multi-scale Deep Graph Convolutional Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 10945, "page_last": 10955, "abstract": "Recently, neural network based approaches have achieved significant progress for solving large, complex, graph-structured problems. Nevertheless, the advantages of multi-scale information and deep architectures have not been sufficiently exploited. In this paper, we first analyze key factors constraining the expressive power of existing Graph Convolutional Networks (GCNs), including the activation function and shallow learning mechanisms. Then, we generalize spectral graph convolution and deep GCN in block Krylov subspace forms, upon which we devise two architectures, both scalable in depth however making use of multi-scale information differently. On several node classification tasks, the proposed architectures achieve state-of-the-art performance.", "full_text": "Break the Ceiling: Stronger Multi-scale Deep\n\nGraph Convolutional Networks\n\nSitao Luan1,2,\u21e4, Mingde Zhao1,2,\u21e4, Xiao-Wen Chang1, Doina Precup1,2,3\n{sitao.luan@mail, mingde.zhao@mail, chang@cs, dprecup@cs}.mcgill.ca\n\n1McGill University; 2Mila; 3DeepMind\n\n\u21e4Equal Contribution\n\nAbstract\n\nRecently, neural network based approaches have achieved signi\ufb01cant\nprogress for solving large, complex, graph-structured problems. Never-\ntheless, the advantages of multi-scale information and deep architectures\nhave not been suciently exploited. In this paper, we \ufb01rst analyze key\nfactors constraining the expressive power of existing Graph Convolutional\nNetworks (GCNs), including the activation function and shallow learning\nmechanisms. Then, we generalize spectral graph convolution and deep\nGCN in block Krylov subspace forms, upon which we devise two architec-\ntures, both scalable in depth however making use of multi-scale information\ndi\u21b5erently. On several node classi\ufb01cation tasks, the proposed architectures\nachieve state-of-the-art performance.\n\n1\n\nIntroduction & Motivation\n\nMany real-world problems can be modeled as graphs [14, 18, 25, 12, 27, 7]. Inspired by\nthe success of Convolutional Neural Networks (CNNs) [20] in computer vision [22], graph\nconvolution de\ufb01ned on graph Fourier domain stands out as the key operator and one of\nthe most powerful tools for using machine learning to solve graph problems. In this paper,\nwe focus on spectrum-free Graph Convolutional Networks (GCNs) [2, 29], which have\ndemonstrated state-of-the-art performance on many transductive and inductive learning\ntasks [7, 18, 25, 3, 4].\nOne major problem of the existing GCNs is the low expressive power limited by their\nshallow learning mechanisms [38, 36]. There are mainly two reasons why people have\nnot yet achieved an architecture that is scalable in depth. First, this problem is dicult:\nconsidering graph convolution as a special form of Laplacian smoothing [21], networks with\nmultiple convolutional layers will su\u21b5er from an over-smoothing problem that makes the\nrepresentation of even distant nodes indistinguishable [38]. Second, some people think it is\nunnecessary: for example, [2] states that it is not necessary for the label information to totally\ntraverse the entire graph and one can operate on the multi-scale coarsened input graph\nand obtain the same \ufb02ow of information as GCNs with more layers. Acknowledging the\ndiculty, we hold on to the objective of deepening GCNs since the desired compositionality1\nwill yield easy articulation and consistent performance for problems with di\u21b5erent scales.\nIn this paper, we break the performance ceiling of the GCNs. First, we analyze the limits of\nthe existing GCNs brought by the shallow learning mechanisms and the activation functions.\nThen, we show that any graph convolution with a well-de\ufb01ned analytic spectral \ufb01lter can\n\n1The expressive power of a sound deep NN architecture should be expected to grow with the\n\nincrement of network depth [19, 16].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbe written as a product of a block Krylov matrix and a learnable parameter matrix in a\nspecial form. Based on this, we propose two GCN architectures that leverage multi-scale\ninformation in di\u21b5erent ways and are scalable in depth, with stronger expressive powers\nand abilities to extract richer representations of graph-structured data. We also show that\nthe equivalence of the two architectures can be achieved under certain conditions. For\nempirical validation, we test di\u21b5erent instances of the proposed architectures on multiple\nnode classi\ufb01cation tasks. The results show that even the simplest instance of the architectures\nachieves state-of-the-art performance, and the complex ones achieve surprisingly higher\nperformance, with or without validation sets.\n\n2 Why Deep GCN Does Not Work Well?\n\n2.1 Foundations\n\nAs in [11], we use bold font for vectors (e.g. v), block vectors (e.g. V) and matrix blocks\n(e.g. Vi). Suppose we have an undirected graph G = (V,E, A), where V is the node set\nwith |V| = N, E is the edge set with |E| = E, A 2 RN\u21e5N is a symmetric adjacency matrix\nand D is a diagonal degree matrix, i.e. Dii =Pj Aij. A di\u21b5usion process [6, 5] on G can\nbe de\ufb01ned by a di\u21b5usion operator L, which is a symmetric matrix, e.g. graph Laplacian\nL = D A, normalized graph Laplacian L = I D1/2AD1/2 and anity matrix L = A + I,\netc.. In this paper, we use L for a general di\u21b5usion operator, unless speci\ufb01ed otherwise. The\neigendecomposition of L gives us L = U\u21e4UT, where \u21e4 is a diagonal matrix whose diagonal\nelements are eigenvalues and the columns of U are the orthonormal eigenvectors, named\ngraph Fourier basis. We also have a feature matrix (graph signals) X 2 RN\u21e5F (which can be\nregarded as a block vector) de\ufb01ned on V and each node i has a feature vector Xi,:, which is\nthe i-th row of X.\nSpectral graph convolution is de\ufb01ned in graph Fourier domain s.t. x\u21e4G y = U((UTx) (UTy)),\nwhere x, y 2 RN and is the Hadamard product [7]. Following this de\ufb01nition, a graph\nsignal x \ufb01ltered by g\u2713 can be written as\n\ny = g\u2713(L)x = g\u2713(U\u21e4UT)x = Ug\u2713(\u21e4)UTx\n\n(1)\nwhere g\u2713 is any function which is analytic inside a closed contour which encircles (L), e.g.\nChebyshev polynomial [7]. GCN generalizes this de\ufb01nition to signals with F input channels\nand O output channels and its network structure can be described as\n\nY = softmax(L ReLU(LXW0) W1)\n\n(2)\n\nwhere\n\n\u02dcA \u2318 A + I,\n\n(3)\nThis is called spectrum-free method [2] since it requires no explicit computation of eigende-\ncomposition and operations on the frequency domain [38].\n\n\u02dcD \u2318 diag(Pj \u02dcA1j, . . . ,Pj \u02dcANj)\n\nL \u2318 \u02dcD1/2 \u02dcA \u02dcD1/2,\n\n2.2 Problems\n\nSuppose we deepen GCN in the same way as [18, 21], we have\n\nY = softmax(L ReLU(\u00b7\u00b7\u00b7 L ReLU(L ReLU(LXW0) W1) W2 \u00b7\u00b7\u00b7 ) Wn) \u2318 softmax(Y0)\n(4)\nFor this architecture, [21] gives an analysis on the e\u21b5ect of L without considering the ReLU\nactivation function. Our analyses on (4) can be summarized in the following theorems.\nTheorem 1. Suppose that G has k connected components and the di\u21b5usion operator L is\nde\ufb01ned as that in (3). Let X 2 RN\u21e5F be any block vector and let Wj be any non-negative\nparameter matrix with kWjk2 \uf8ff 1 for j = 0, 1, . . .. If G has no bipartite components, then in\n(4), as n ! 1, rank(Y0) \uf8ff k.\nProof\n\nSee Appendix A.\n\n\u21e4\n\n2\n\n\fConjecture 1. Theorem 1 still holds without the non-negative constraint on the parameter\nmatrices.\n\nTheorem 2. Suppose the n-dimensional x and y are independently sampled from a contin-\nuous distribution and the activation function Tanh(z) = ezez\nez+ez is applied to [x, y] pointwisely,\nthen\n\nP(rankTanh([x, y]) = rank([x, y])) = 1\n\nSee Appendix A.\n\nProof\n\u21e4\nTheorem 1 shows that if we simply deepen GCN, the extracted features will degrade, i.e.\nY0 only contains the stationary information of the graph structure and loses all the local\ninformation in node for being smoothed.\nIn addition, from the proof we see that the\npointwise ReLU transformation is a conspirator. Theorem 2 tells us that Tanh is better at\nkeeping linear independence among column features. We design a numerical experiment on\nsynthetic data (see Appendix) to test, under a 100-layer GCN architecture, how activation\nfunctions a\u21b5ect the rank of the output in each hidden layer during the feedforward process.\nAs Figure 1(a) shows, the rank of hidden features decreases rapidly with ReLU, while having\nlittle \ufb02uctuation under Tanh, and even the identity function performs better than ReLU (see\nAppendix for more comparisons). So we propose to replace ReLU by Tanh.\n\n140\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n0\n\nIdentity\nSigmoid\nReLU\nLeakyReLU\nTanh\n\n20\n\n40\n\n60\n\n80\n\n100\n\n(a) Deep GCN\n\n140\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n0\n\n140\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n0\n\nIdentity\nSigmoid\nReLU\nLeakyReLU\nTanh\n\n20\n\n40\n\n60\n\n80\n\n100\n\n(b) Snowball\n\nIdentity\nSigmoid\nReLU\nLeakyReLU\nTanh\n\n20\n\n40\n\n60\n\n80\n\n100\n\n(c) Truncated Block Krylov\n\nFigure 1: Changes in the number of independent features with the increment of network\ndepth\n\n3 Spectral Graph Convolution and Block Krylov Subspaces\n\n3.1 Block Krylov Subspaces\n\nLet S be a vector subspace of RF\u21e5F containing the identity matrix IF that is closed under\nmatrix multiplication and transposition. We de\ufb01ne an inner product h\u00b7,\u00b7iS in the block vector\nspace RN\u21e5F as follows [11]:\nDe\ufb01nition 1 A mapping h\u00b7,\u00b7iS from RN\u21e5F \u21e5 RN\u21e5F to S is called a block inner product onto S if\n8X, Y, Z 2 RN\u21e5F and 8C 2 S:\n1. S-linearity: hX, YCiS = hX, YiSC and hX + Y, ZiS = hX, ZiS + hY, ZiS\n2. symmetry: hX, YiS = hY, XiT\n3. de\ufb01niteness: hX, XiS is positive de\ufb01nite if X has full rank, and hX, XiS = 0F i\u21b5 X = 0.\nThere are mainly three ways to de\ufb01neh\u00b7,\u00b7iS [11]: 1) (Classical.) SCl = RF\u21e5F andhX, YiCl\n2) (Global.) SGl = cIF, c 2 R and hX, YiGl\nset of diagonal matrices and hX, YiLi\nwe will use the classical one for our contribution.\nFor further explanations, we give the de\ufb01nition of block vector subspace of RN\u21e5F.\n\nS = XTY;\nS = trace(XTY)IF; 3) (Loop-interchange.) SLi is the\nS = diag(XTY). The three de\ufb01nitions are all useful yet\n\nS\n\n3\n\n\fXkCk : Ck 2 S}\n\nk=1 \u21e2 RN\u21e5F, the S-span of {Xk}m\n\nDe\ufb01nition 2 Given a set of block vectors {Xk}m\nmPk=1\nspanS{X1, . . . , Xm} := {\nGiven the above de\ufb01nition, the order-m block Krylov subspace with respect to the matrix\nA 2 RN\u21e5N, the block vector B 2 RN\u21e5F and the vector space S can be de\ufb01ned as K S\nm(A, B) :=\nspanS{B, AB, . . . , Am1B}. The corresponding block Krylov matrix is de\ufb01ned as Km(A, B) :=\n[B, AB, . . . , Am1B].\n\nk=1 is de\ufb01ned as\n\n3.2 Spectral Graph Convolution in Block Krylov Subspace Form\n\nIn this section, we show that any graph convolution with well-de\ufb01ned analytic spectral \ufb01lter\nde\ufb01ned on L 2 RN\u21e5N can be written as the product of a block Krylov matrix with a learnable\nparameter matrix in a speci\ufb01c form. We take S = SCl = RF\u21e5F.\nFor any real analytic scalar function g, its power series expansion around center 0 is\n\ng(x) =\n\nanxn =\n\ng(n)(0)\n\nn!\n\nxn, |x| < R\n\n1Xn=0\n\nwhere R is the radius of convergence.\nThe function g can be used to de\ufb01ne a \ufb01lter. Let \u21e2(L) denote the spectrum radius of L and\nsuppose \u21e2(L) < R. The spectral \ufb01lter g(L) 2 RN\u21e5N can be de\ufb01ned as\nLn,\u21e2 (L) < R\n\ng(L) :=\n\ng(n)(0)\n\nanLn =\n\n1Xn=0\n\nn!\n\n1Xn=0\n\n1Xn=0\n\nAccording to the de\ufb01nition of spectral graph convolution in (1), graph signal X is \ufb01ltered by\ng(L) as follows,\n\ng(L)X =\n\ng(n)(0)\n\nn!\n\n1Xn=0\n\nLnX =hX, LX, L2X,\u00b7\u00b7\u00b7i\" g(0)(0)\n\n0!\n\nIF,\n\ng(1)(0)\n\n1!\n\nIF,\n\ng(2)(0)\n\n2!\n\nIF,\u00b7\u00b7\u00b7#T\n\n= A0B0\n\nm(L, X). From (5), the convolution can be written as\n\nspanS{X, LX, L2X,\u00b7\u00b7\u00b7} = spanS{X, LX, L2X, . . . , Lm1X}\n\nwhere A0 2 RN\u21e51 and B0 2 R1\u21e5F. We can see that A0 is a block Krylov matrix and Range(A0B0)\n\u2713 Range(A0). It is shown in [13, 11] that for S = RF\u21e5F there exists a smallest m such that\n(5)\nwhere m depends on L and X and will be written as m(L, X) later. This means for any k m,\nLkX 2K S\ng(L)X =\n\n1Xn=0\n\u2318 Km(L, X)S (6)\ni 2 RF\u21e5F for i = 1, . . . , m1 are parameter matrix blocks. Then, a graph convolutional\n(7)\n\nLnX \u2318hX, LX, . . . , Lm1Xih(0\n\nwhere S\nlayer can be be generally written as\n\ng(L)XW0 = Km(L, X)SW0 = Km(L, X)WS\n\nm1)TiT\n\ng(n)(0)\n\nS)T, (1\n\nS)T,\u00b7\u00b7\u00b7 , (S\n\nn!\n\nwhere WS \u2318 SW0 2 RmF\u21e5O. The essential number of learnable parameters is mF \u21e5 O.\n3.3 Deep GCN in the Block Krylov Subspace Form\n\nSince the spectral graph convolution can be simpli\ufb01ed as (6)(7), we can build deep GCN in\nthe following way.\nSuppose that we have a sequence of analytic spectral \ufb01lters G = {g0, g1, . . . , gn} and a sequence\nof pointwise nonlinear activation functions H = {h0, h1, . . . , hn}. Then, a deep spectral graph\nconvolution network can be written as\n\nY = softmaxngn(L) hn1n\u00b7\u00b7\u00b7 g2(L) h1ng1(L) h0ng0(L)XW00o W01o W02 \u00b7\u00b7\u00b7o W0no\n\n4\n\n(8)\n\n\fDe\ufb01ne\n\nThen, we have\n\nH0 = X,\n\nFrom (7) and (8), we see we can write\n\ni = 0, . . . , n 1\nHi+1 = hi{gi(L)HiWi},\nY = softmax{Kmn(L, Hn)WSn\nn }\n\nHi+1 = hi{Kmi(L, Hi)WSi\n\ni }, mi \u2318 m(L, Hi)\n\nIt is easy to see that, when gi(L) = I, (8) is a fully connected network [21]; when n = 1,\ng0(L) = g1(L) = L, where L is de\ufb01ned in (3), it is just GCN [18]; when gi(L) is de\ufb01ned by the\nChebyshev polynomial [15], W0i = I, (8) is ChebNet [7].\n\n3.4 Diculties & Inspirations\n\nIn the last subsection, we gave a general form of deep GCN in the block Krylov form.\nFollowing this idea, we can leverage the existing block Lanczos algorithm [11, 10] to \ufb01nd mi\nand compute orthogonal basis of K S\nmi(L, Hi) which makes the \ufb01lter coecients compact [25]\nand improve numerical stability. But there are some diculties in practice:\n\n1. During the training phase, Hi changes every time when parameters are updated. This\n\nmakes mi a variable and thus requires adaptive size for parameter matrices WSi\ni .\n\n2. For classical inner product, the QR factorization that is needed in block Lanczos algorithm\n\n[11] is dicult to be implemented in backpropagation framework.\n\nDespite implementation intractability, block Krylov form is still meaningful for constructing\nGCNs that are scalable in depth as we illustrate below.\nFor each node v 2{ 1, . . . , N} in the graph, denote N(v) as the set of its neighbors and Nk(v) as\nthe set of its k-hop neighbors. Then, LX(v, :) can be interpreted as a weighted mean of the\nfeature vectors of v and N(v). If the network goes deep as (4), Y0(v, :) becomes the \u201cweighted\nmean\u201d of the feature vectors of v and N(n+1)(v) (not exactly weighted mean because we have\nReLU in each layer). As the scope grows, the nodes in the same connected component\ntend to have the same (global) features, while losing their individual (local) features, which\nmakes them indistinguishable. Such phenomenon is recognized as \u201coversmoothing\u201d [21].\nThough it is reasonable to assume that the nodes in the same cluster share many similar\nproperties, it will be harmful to omit the individual di\u21b5erences between each node.\nTherefore, the inspiration from the block Krylov form is that, to get a richer representation\nof each node, we need to concatenate the multi-scale information (local and global) together\ninstead of merely doing smoothing in each hidden layer. If we have a smart way to stack\nmulti-scale information, the network will be scalable in depth. To this end, we naturally\ncome up with a densely connected architecture [17], which we call snowball network and a\ncompact architecture, which we call the truncated Krylov network, in which the multi-scale\ninformation is used di\u21b5erently.\n\n4 Deep GCN Architectures\n\n4.1 Snowball\n\nThe block Krylov form inspires \ufb01rst an architecture that concatenates multi-scale features\nincrementally, resulting in a densely-connected graph network (Figure 2(a)) as follows:\n\nH0 = X, Hl+1 = f (L [H0, H1, . . . , Hl] Wl) ,\nC = g ([H0, H1, . . . , Hn] Wn)\noutput = softmax (LpCWC)\ni=0 Fi)\u21e5Fl+1, Wn 2 R(Pn\n\nwhere Wl 2 R(Pl\n\ni=0 Fi)\u21e5FC and WC 2 RFC\u21e5FO are learnable parameter matrices,\nFl+1 is the number of output channels in layer l; f and g are pointwise activation functions;\n\nl = 0, 1, . . . , n 1\n\n(9)\n\n5\n\n\fHl are extracted features; C is the output of a classi\ufb01er of any kind, e.g., a fully connected\nneural network or even an identity layer, in which case C = [H0, H1, . . . , Hn]; p 2{ 0, 1}. When\np = 0, Lp = I and when p = 1, LP = L, which means that we project C back onto graph\nFourier basis, which is necessary when the graph structure encodes much information.\nFollowing this construction, we can stack all learned features as the input of the subsequent\nhidden layer, which is an ecient way to concatenate multi-scale information. The size of\ninput will grow like a snowball and this construction is similar to DenseNet [17], which\nis designed for regular grids (images). Thus, some advantages of DenseNet are naturally\ninherited, e.g., alleviate the vanishing-gradient problem, encourage feature reuse, increase\nthe variation of input for each hidden layer, reduce the number of parameters, strengthen\nfeature propagation and improve model compactness.\n\n4.2 Truncated Krylov\n\nThe block Krylov form inspires then an architecture that concatenates multi-scale features\ndirectly together in each layer. However, as stated in Section 3.4, the fact that mi is a variable\nmakes GCN dicult to be merged into the block Krylov framework. Thus we compromise\nand set mi as a hyperparameter and get a truncated block Krylov network (Figure 2(b)) as\nshown below:\n\nH0 = X, Hl+1 = f\u21e3hHl, LHl . . . , Lml1Hli Wl\u2318 ,\n\nC = g (HnWn)\noutput = softmax (LpCWC)\n\nl = 0, 1, . . . , n 1\n\n(10)\n\nwhere Wl 2 R(mlFl)\u21e5Fl+1, Wn 2 RFn\u21e5FC and WC 2 RFC\u21e5FO are learnable parameter matrices; f\nand g are activation functions; C is the output of a classi\ufb01er of any kind; p 2{ 0, 1}. In the\ntruncated Krylov network, the local information will not be diluted in each layer because in\neach layer l, we start the concatenation from L0Hl so that the extracted local information can\nbe kept.\nThere are works on the analysis of error bounds of doing truncation in block Krylov methods\n[11]. But the results need many assumptions either on X, e.g., X is a standard Gaussian\nmatrix [34], or on L, e.g., some conditions on the smallest and largest eigenvalues of L have\nto be satis\ufb01ed [28]. Instead of doing truncation for a speci\ufb01c function or a \ufb01xed X, we are\ndealing with variable X during training. So we cannot get a practical error bound since we\ncannot put any restriction on X and its relation to L.\n\nX\n\nH1\n\nH2\n\nH3\n\nH4\n\nX\n\nH1\n\nH2\n\nH3\n\nH4\n\nX\n\nH1\n\nX\n\nH2\n\nH1\n\nX\n\nLX\n\nLH1\n\nLH2\n\nLH3\n\nO\n\nL2X\n\nL3X\n\nL2H1\n\nL2H2\n\nL2H3\n\nO\n\nL3H1\n\nL3H2\n\nL3H3\n\nLm0-1X\n\nLm1-1H1\n\nLm2-1H2\n\nLm3-1H3\n\n(a) Snowball\n\n(b) Truncated Block Krylov\n\nFigure 2: Proposed Architectures\n\nThe Krylov subspace methods are often associated with low-rank approximation methods\nfor large sparse matrices. Here we would like to mention [25] does low-rank approximation\nof L by the Lanczos algorithm. It su\u21b5ers from the tradeo\u21b5 between accuracy and eciency:\nthe information in L will be lost if L is not low-rank, while keeping more information via\nincreasing the Lanczos steps will hurt the eciency. Since most of the graphs we are dealing\n\n6\n\n\fwith have sparse connectivity structures, they are actually not low-rank, e.g., the Erd\u02ddos-R\u00e9nyi\ngraph G(n, p) with p = !( 1\nn) [32] and examples in Appendix IV. Thus, we do not propose to\ndo low-rank approximation in our architecture.\n\n4.3 Equivalence of Linear Snowball GCN and Truncated Block Krylov Network\n\nIn this part, we will show that the two proposed architectures are inherently connected. In\nfact their equivalence can be established when using identify functions as f , identity layer\nas C and constraining the parameter matrix of truncated Krylov to be in a special form.\nIn linear snowball GCN, we can split the parameter matrix Wi into i + 1 blocks and write it\n\nas Wi =h(W(1)\n\ni\n\n)T,\u00b7\u00b7\u00b7 , (W(i+1)\n\ni\n\n)TiT\n\nand then following (9) we have\n\n0 #2666664\n2666666666666666664\n3777777777777777775\n\nW(1)\nC\nW(2)\nC\n...\nW(n)\nC\n\n1 3777775, . . .\n3777777777777777775\n\nH0 = X, H1 = LXW0, H2 = L[X, H1]W1 = LXW(1)\n\n1 +L2XW(1)\n\n0 W(2)\n\nAs in (9), we have CWC = L[H0, H1, . . . , Hn]WC. Thus we can write\n\n1 = L[X, LX]\"I\n\n0\n0 W(1)\n\nW(1)\n1\nW(2)\n\n[H0, H1 \u00b7\u00b7\u00b7 , Hn]WC\nI\n0\n...\n0\n\n= [X, LX,\u00b7\u00b7\u00b7 , LnX]2666666666666666664\n\n0\nI\n...\n0\n\n0\n\u00b7\u00b7\u00b7\n0\n\u00b7\u00b7\u00b7\n...\n...\n\u00b7\u00b7\u00b7 W(1)\n\n0\n\nI\n0\n...\n0\n\n0\nI\n...\n0\n\n0\n\u00b7\u00b7\u00b7\n0\n\u00b7\u00b7\u00b7\n...\n...\n\u00b7\u00b7\u00b7 W(1)\n\n1\n\n3777777777777777775\n\n2666666666666666664\n\n3777777777777777775\n\n0\nI\n0 W(n)\nn1\n...\n...\n0\n0\n\n\u00b7\u00b7\u00b7\n\n2666666666666666664\n\n0\n\u00b7\u00b7\u00b7\n0\n\u00b7\u00b7\u00b7\n...\n...\n\u00b7\u00b7\u00b7 W(1)\nn1\n\nwhich is in the form of (7), where the parameter matrix is the multiplication of a sequence\nof block diagonal matrices whose entries consist of identity blocks and blocks from other\nparameter matrices. Though the two proposed architectures stack multi-scale information\nin di\u21b5erent ways, i.e. incremental and direct respectively, the equivalence reveals that the\ntruncated block Krylov network can be constrained to leverage multi-scale information in a\nway similar to the snowball architecture. While it is worth noting that when there are no\nconstraints, truncated Krylov is capable of achieving more than what snowball does.\n\n4.4 Relation to Message Passing Framework\nIf we consider L as a general aggregation\nWe denote the concatenation operator as k.\noperator which aggregates node features with its neighborhood features, we see that the\ntwo proposed architectures both have close relationships with message passing framework\n[12], which are illustrated in the following table, where N0(v) = {v}, Mt is a message function,\nUt is a vertex update function, m(t+1)\nare messages and hidden states at each node\nrespectively, m(t+1) = [m(t+1)\n,\u00b7\u00b7\u00b7 , m(t+1)\n]T and is a nonlinear\nN\nactivation function.\nCompared to our proposed architectures, we can see that the message passing paradigm\ncannot avoid oversmoothing problem because it does not leverage multi-scale infor-\nmation in each layer and will \ufb01nally lose local information. An alternate solution\nto address the oversmoothing problem could be to modify the readout function to\n\u02c6y = R({h(0)\n5 Experiments\n\nv\n]T, h(t+1) = [h(t+1)\n\n,\u00b7\u00b7\u00b7 , h(t+1)\n\nv |v 2V} ).\n\nv , h(1)\n\nv , . . . , h(T)\n\n, h(t+1)\n\nN\n\nv\n\n1\n\n1\n\nOn node classi\ufb01cation tasks, we test 2 instances of the snowball GCN and 1 instance of\nthe truncated Krylov GCN, which include linear snowball GCN ( f = g = identity, p = 1),\nsnowball GCN ( f = Tanh, g = identity, p = 1) and truncated Krylov ( f = g = Tanh, p = 0).\nThe test cases include on public splits [37, 25] of Cora, Citeseer and PubMed2, as well as\n\n2Source code to be found at https://github.com/PwnerHarry/Stronger_GCN\n\n7\n\n\fTable 1: Algorithms in Matrix and Nodewise Forms\n\nv\n\nAlgorithms\n\nMessage Passing\n\nGraphSAGE-GCN m(t+1) = Lh(t)\n\nMatrix\nm(t+1) = Mt(A, h(t))\nh(t+1) = Ut(h(t), m(t+1))\n\nForms\nNodewise\nm(t+1)\nh(t+1)\nv\nm(t+1)\nh(t+1)\nv\nm(t+1)\nh(t+1)\nv\nTruncated Krylov m(t+1) = h(t)k . . .kLmt1h(t) m(t+1)\nh(t+1)\nv\n\nh(t+1) = (m(t+1)Wt)\nm(t+1) = L[h(0)k . . .kh(t)]\nh(t+1)\nv\n\nh(t+1) = (m(t+1)Wt)\n\n= Pw2N(v)\nMt(h(t)\nv , h(t)\n= Ut(h(t)\nv , m(t+1)\n)\n= mean({h(t)\nt m(t+1)\n= (WT\n= kt\n= (WT\n= kmt1\n= (WT\n\nv }[{ h(t)\n)\ni=0mean({h(i)\n)\ni=0 mean([i\n)\n\nt m(t+1)\n\nt m(t+1)\n\n= (m(t+1)Wt)\n\nSnowball\n\nv\n\nv\n\nv\n\nv\n\nv\n\nv\n\nv }[{ h(i)\n\nN(v)})\n\nk=0{h(t)\n\nNk(v)})\n\nv\n\nw , evw)\n\nN(v)})\n\nthe crafted smaller splits that are more dicult [25, 21, 31]. We compare the instances\nagainst several methods under 2 experimental settings, with or without validations sets. The\ncompared methods with validation sets include graph convolutional networks for \ufb01ngerprint\n(GCN-FP) [8], gated graph neural networks (GGNN) [23], di\u21b5usion convolutional neural\nnetworks (DCNN) [1], Chebyshev networks (Cheby) [7], graph convolutional networks\n(GCN) [18], message passing neural networks (MPNN) [12], graph sample and aggregate\n(GraphSAGE) [14], graph partition neural networks (GPNN) [24], graph attention networks\n(GAT) [33], LanczosNet (LNet) [25] and AdaLanczosNet (AdaLNet) [25]. The copmared\nmethods without validation sets include label propagation using ParWalks (LP) [35], Co-\ntraining [21], Self-training [21], Union [21], Intersection [21], GCN without validation [21],\nMulti-stage training [31], Multi-stage self-supervised (M3S) training [31], GCN with sparse\nvirtual adversarial training (GCN-SVAT) [30] and GCN with dense virtual adversarial\ntraining (GCN-DVAT) [30].\n\n(a) Linear Snowball\n\n(c) Truncated Krylov\nFigure 3: t-SNE for the extracted features trained on Cora (7 classes) public (5.2%).\n\n(b) Snowball\n\nIn Table 2 and 3, for each test case, we report the accuracy averaged from 10 independent\nruns using the best searched hyperparameters. These hyperparameters are reported in the\nappendix, which include learning rate and weight decay for the optimizers RMSprop or\nAdam for cases with validation or without validation, respectively, taking values in the\nintervals [106, 5 \u21e5 103] and [105, 102], respectively, width of hidden layers taking value\nin the set {100, 200,\u00b7\u00b7\u00b7 , 5000}, number of hidden layers in the set {1, 2, . . . , 50}, dropout in\n(0, 0.99], and the number of Krylov blocks taking value in {1, 2, . . . , 100}. An early stopping\ntrick is also used to achieve better training. Speci\ufb01cally we terminate the training after 100\nupdate steps of not improving the training loss.\nWe see that the instances of the proposed architectures achieve overwhelming performance\nin all test cases. We visualize a representative case using t-SNE [26] in Figure 3. From\nthese visualization, we can see the instances can extract good features with small training\n\n8\n\n\fdata, especially for the truncated block Krylov network. Particularly, when the training\nsplits are small, they perform astonishingly better than the existing methods. This may\nbe explained by the fact that when there is less labeled data, larger scope of vision \ufb01eld is\nneeded to make recognition of each node or to let the label signals propagate. We would\nalso highlight that the linear snowball GCN can achieve state-of-the-art performance with\nmuch less computational cost. If G has no bipartite components, then in (4), as n ! 1,\nrank(Y0) \uf8ff k almost surely.\n\nTable 2: Accuracy without Validation\n\nCora\n\nCiteSeer\n\nPubMed\n\nAlgorithms\n\nLP\n\nUnion\n\nCheby\n\nIntersection\nMultiStage\n\nCo-training\nSelf-training\n\n0.5% 1% 2% 3% 4% 5% 0.5% 1% 2% 3% 4% 5% 0.03% 0.05% 0.1% 0.3%\n66.4 65.4 66.8\n56.4 62.3 65.4 67.5 69.0 70.2 34.8 40.2 43.6 45.3 46.4 47.3 61.4\n47.3 51.2 72.8\n38.0 52.0 62.4 70.8 74.1 77.6 31.7 42.8 59.9 66.2 68.3 69.3 40.4\n68.3 72.7 78.2\n56.6 66.4 73.5 75.9 78.9 80.8 47.3 55.7 62.1 62.5 64.5 65.5 62.2\n58.7 66.8 77.0\n53.7 66.1 73.8 77.2 79.4 80.0 43.3 58.1 68.2 69.8 70.4 71.0 51.9\n58.5 69.9 75.9 78.5 80.4 81.7 46.3 59.1 66.7 66.7 67.6 68.2 58.4\n64.0 70.7 79.2\n59.3 69.7 77.6\n49.7 65.0 72.9 77.1 79.4 80.2 42.9 59.1 68.6 70.1 70.8 71.2 52.0\n64.3 70.2\n57.4\n61.1 63.7 74.4 76.1 77.2\n64.4 70.6\n61.5 67.2 75.6 77.8 78.0\n59.2\n49.7 56.3 76.6\n42.6 56.9 67.8 74.9 77.6 79.3 33.4 46.5 62.6 66.9 68.7 69.6 46.4\n56.9 63.5 77.2\n43.6 53.9 71.4 75.6 78.3 78.5 47.0 52.4 65.8 68.6 69.5 70.7 52.1\n49 61.8 71.9 75.9 78.4 78.6 51.5 58.5 67.4 69.2 70.8 71.3 53.3\n58.6 66.3 77.3\n68.5 73.6 79.7\nlinear Snowball 67.6 74.6 78.9 80.9 82.3 82.9 56.0 63.4 69.3 70.6 72.5 72.6 65.5\n68.6 73.2 80.1\n68.4 73.2 78.4 80.8 82.3 83.0 56.4 63.9 68.7 70.5 71.8 72.8 66.5\ntruncated Krylov 71.8 76.5 80.0 82.0 83.0 84.1 59.9 66.1 69.8 71.3 72.3 73.7 68.7\n71.4 75.5 80.4\nFor each (column), the greener the cell, the better the performance. The redder, the worse. If our\nmethods achieve better performance than all others, the corresponding cell will be in bold.\n\n53.0 57.8 63.8 68.0 69.0\n56.1 62.1 66.4 70.3 70.5\n\nGCN-SVAT\nGCN-DVAT\n\nM3S\nGCN\n\nSnowball\n\nTable 3: Accuracy with Validation\n\nCora\n0.5% 1% 3%\n\nAlgorithms\n\nCheby\nGCN-FP\nGGNN\nDCNN\nMPNN\n\n5.2%\npublic\n33.9 44.2 62.1 78.0\n50.5 59.6 71.7 74.6\n48.2 60.5 73.1 77.6\n59.0 66.4 76.7 79.7\n46.5 56.7 72.0 78.0\n37.5 49.0 64.2 74.5\n41.4 48.6 56.8 83.0\n50.9 62.3 76.5 80.5\n58.1 66.1 77.3 79.5\n60.8 67.5 77.7 80.4\nlinear Snowball 72.5 76.3 82.2 83.3\n71.2 76.6 81.9 83.2\ntruncated Krylov 74.8 78.0 82.7 83.2\n\nGAT\nGCN\nLNet\n\nGraphSAGE\n\nSnowball\n\nAdaLNet\n\nCiteSeer\n\n0.5% 1%\n\n3.6%\npublic\n45.3 59.4 70.1\n43.9 54.3 61.5\n44.3 56.0 64.6\n53.1 62.2 69.4\n41.8 54.3 64.0\n33.8 51.0 67.2\n38.2 46.5 72.5\n43.6 55.3 68.7\n53.2 61.3 66.2\n53.8 63.3 68.7\n62.0 66.7 72.9\n61.0 66.4 73.3\n64.0 68.3 73.9\n\nPubMed\n\n0.03% 0.05% 0.1%\n\n0.3%\npublic\n55.2 69.8\n70.3 76.0\n70.4 75.8\n73.1 76.8\n67.3 75.6\n65.4 76.8\n59.6 79.0\n73.0 77.8\n73.4 78.3\n72.8 78.1\n75.6 79.1\n75.2 79.2\n78.0 80.1\n\n45.3\n56.2\n55.8\n60.9\n53.9\n45.4\n50.9\n57.9\n60.4\n61.0\n70.8\n69.9\n72.2\n\n48.2\n63.2\n63.3\n66.7\n59.6\n53.0\n50.4\n64.6\n68.8\n66.0\n72.1\n72.7\n74.9\n\n6 Future Works\n\nFuture research of this like includes: 1) Investigating how the pointwise nonlinear activation\nfunctions in\ufb02uence block vectors, e.g., the feature block vector X and hidden feature block\nvectors Hi, so that we can \ufb01nd possible activation functions better than Tanh; 2) Finding a\nbetter way to leverage the block Krylov algorithms instead of conducting simple truncation.\n\nAcknowledgements\n\nThe authors wish to express sincere gratitude for the computational resources of Compute\nCanada provided by Mila, as well as for the proofreading done by Sitao and Mingde\u2019s good\nfriend & coworker Ian P. Porada.\n\n9\n\n\fReferences\n[1] J. Atwood and D. Towsley. Di\u21b5usion-convolutional neural networks.\n\nabs/1511.02136, 2015.\n\narXiv,\n\n[2] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep\n\nlearning: going beyond euclidean data. arXiv, abs/1611.08097, 2016.\n\n[3] J. Chen, T. Ma, and C. Xiao. Fastgcn: fast learning with graph convolutional networks\n\nvia importance sampling. arXiv preprint arXiv:1801.10247, 2018.\n\n[4] J. Chen, J. Zhu, and L. Song. Stochastic training of graph convolutional networks with\n\nvariance reduction. arXiv preprint arXiv:1710.10568, 2017.\n\n[5] R. R. Coifman and S. Lafon. Di\u21b5usion maps. Applied and computational harmonic analysis,\n\n21(1):5\u201330, 2006.\n\n[6] R. R. Coifman and M. Maggioni. Di\u21b5usion wavelets. Applied and Computational Harmonic\n\nAnalysis, 21(1):53\u201394, 2006.\n\n[7] M. De\u21b5errard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on\n\ngraphs with fast localized spectral \ufb01ltering. arXiv, abs/1606.09375, 2016.\n\n[8] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-\nGuzik, and R. P. Adams. Convolutional networks on graphs for learning molecular\n\ufb01ngerprints. In Advances in neural information processing systems, pages 2224\u20132232, 2015.\n[9] X. Feng and Z. Zhang. The rank of a random matrix. Applied mathematics and computation,\n\n185(1):689\u2013694, 2007.\n\n[10] A. Frommer, K. Lund, M. Schweitzer, and D. B. Szyld. The radau\u2013lanczos method for\nmatrix functions. SIAM Journal on Matrix Analysis and Applications, 38(3):710\u2013732, 2017.\n[11] A. Frommer, K. Lund, and D. B. Szyld. Block Krylov subspace methods for functions\n\nof matrices. Electronic Transactions on Numerical Analysis, 47:100\u2013126, 2017.\n\n[12] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message\npassing for quantum chemistry. In Proceedings of the 34th International Conference on\nMachine Learning-Volume 70, pages 1263\u20131272. JMLR. org, 2017.\n\n[13] M. H. Gutknecht and T. Schmelzer. The block grade of a block krylov space. Linear\n\nAlgebra and its Applications, 430(1):174\u2013185, 2009.\n\n[14] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large\n\ngraphs. arXiv, abs/1706.02216, 2017.\n\n[15] D. K. Hammond, P. Vandergheynst, and R. Gribonval. Wavelets on graphs via spectral\n\ngraph theory. Applied and Computational Harmonic Analysis, 30(2):129\u2013150, 2011.\n\n[16] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets.\n\nNeural computation, 18(7):1527\u20131554, 2006.\n\n[17] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected\nconvolutional networks. In Proceedings of the IEEE conference on computer vision and\npattern recognition, pages 4700\u20134708, 2017.\n\n[18] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. arXiv, abs/1609.02907, 2016.\n\n[19] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.\n[20] Y. LeCun, L. Bottou, Y. Bengio, P. Ha\u21b5ner, et al. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[21] Q. Li, Z. Han, and X. Wu. Deeper insights into graph convolutional networks for\n\nsemi-supervised learning. arXiv, abs/1801.07606, 2018.\n\n[22] R. Li, S. Wang, F. Zhu, and J. Huang. Adaptive graph convolutional neural networks.\n\nIn Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[23] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks.\n\narXiv preprint arXiv:1511.05493, 2015.\n\n10\n\n\f[24] R. Liao, M. Brockschmidt, D. Tarlow, A. L. Gaunt, R. Urtasun, and R. Zemel. Graph parti-\ntion neural networks for semi-supervised classi\ufb01cation. arXiv preprint arXiv:1803.06272,\n2018.\n\n[25] R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel. Lanczosnet: Multi-scale deep graph\n\nconvolutional networks. arXiv, abs/1901.01484, 2019.\n\n[26] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning\n\nresearch, 9(Nov):2579\u20132605, 2008.\n\n[27] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein. Geometric\ndeep learning on graphs and manifolds using mixture model cnns. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition, pages 5115\u20135124, 2017.\n\n[28] C. Musco, C. Musco, and A. Sidford. Stability of the lanczos method for matrix function\napproximation. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on\nDiscrete Algorithms, pages 1605\u20131624. Society for Industrial and Applied Mathematics,\n2018.\n\n[29] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst. The emerging\n\ufb01eld of signal processing on graphs: Extending high-dimensional data analysis to\nnetworks and other irregular domains. arXiv preprint arXiv:1211.0053, 2012.\n\n[30] K. Sun, H. Guo, Z. Zhu, and Z. Lin. Virtual adversarial training on graph convolutional\n\nnetworks in node classi\ufb01cation. arXiv preprint arXiv:1902.11045, 2019.\n\n[31] K. Sun, Z. Zhu, and Z. Lin. Multi-stage self-supervised learning for graph convolutional\n\nnetworks. arXiv, abs/1902.11038, 2019.\n\n[32] L. V. Tran, V. H. Vu, and K. Wang. Sparse random graphs: Eigenvalues and eigenvectors.\n\nRandom Structures & Algorithms, 42(1):110\u2013134, 2013.\n\n[33] P. Veli\u02c7ckovi\u00b4c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph\n\nattention networks. arXiv, abs/1710.10903, 2017.\n\n[34] S. Wang, Z. Zhang, and T. Zhang. Improved analyses of the randomized power method\n\nand block lanczos method. arXiv preprint arXiv:1508.06429, 2015.\n\n[35] X.-M. Wu, Z. Li, A. M. So, J. Wright, and S.-F. Chang. Learning with partially absorbing\nrandom walks. In Advances in Neural Information Processing Systems, pages 3077\u20133085,\n2012.\n\n[36] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on\n\ngraph neural networks. arXiv, abs/1901.00596, 2019.\n\n[37] Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with\n\ngraph embeddings. arXiv preprint arXiv:1603.08861, 2016.\n\n[38] S. Zhang, H. Tong, J. Xu, and R. Maciejewski. Graph convolutional networks: Algo-\nrithms, applications and open challenges. In International Conference on Computational\nSocial Networks, pages 79\u201391. Springer, 2018.\n\n11\n\n\f", "award": [], "sourceid": 5861, "authors": [{"given_name": "Sitao", "family_name": "Luan", "institution": "McGill University, Mila"}, {"given_name": "Mingde", "family_name": "Zhao", "institution": "Mila & McGill University"}, {"given_name": "Xiao-Wen", "family_name": "Chang", "institution": "McGill University"}, {"given_name": "Doina", "family_name": "Precup", "institution": "McGill University / Mila / DeepMind Montreal"}]}