{"title": "Layer-Dependent Importance Sampling for Training Deep and Large Graph Convolutional Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 11249, "page_last": 11259, "abstract": "Graph convolutional networks (GCNs) have recently received wide attentions, due to their successful applications in different graph tasks and different domains. Training GCNs for a large graph, however, is still a challenge. Original full-batch GCN training requires calculating the representation of all the nodes in the graph per GCN layer, which brings in high computation and memory costs. To alleviate this issue, several sampling-based methods are proposed to train GCNs on a subset of nodes. Among them, the node-wise neighbor-sampling method recursively samples a fixed number of neighbor nodes, and thus its computation cost suffers from exponential growing neighbor size across layers; while the layer-wise importance-sampling method discards the neighbor-dependent constraints, and thus the nodes sampled across layer suffer from sparse connection problem. To deal with the above two problems, we propose a new effective sampling algorithm called LAyer-Dependent ImportancE Sampling (LADIES). Based on the sampled nodes in the upper layer, LADIES selects nodes that are in the neighborhood of these nodes and uses the constructed bipartite graph to compute the importance probability. Then, it samples a fixed number of nodes according to the probability for the whole layer, and recursively conducts such procedure per layer to construct the whole computation graph. We prove theoretically and experimentally, that our proposed sampling algorithm outperforms the previous sampling methods regarding both time and memory. Furthermore, LADIES is shown to have better generalization accuracy than original full-batch GCN, due to its stochastic nature.", "full_text": "Layer-Dependent Importance Sampling for Training\n\nDeep and Large Graph Convolutional Networks\n\nDifan Zou\u2217, Ziniu Hu\u2217, Yewen Wang, Song Jiang, Yizhou Sun, Quanquan Gu\n\nDepartment of Computer Science, UCLA, Los Angeles, CA 90095\n\n{knowzou,bull,wyw10804,songjiang,yzsun,qgu}@cs.ucla.edu\n\nAbstract\n\nGraph convolutional networks (GCNs) have recently received wide attentions, due\nto their successful applications in different graph tasks and different domains. Train-\ning GCNs for a large graph, however, is still a challenge. Original full-batch GCN\ntraining requires calculating the representation of all the nodes in the graph per\nGCN layer, which brings in high computation and memory costs. To alleviate this\nissue, several sampling-based methods have been proposed to train GCNs on a sub-\nset of nodes. Among them, the node-wise neighbor-sampling method recursively\nsamples a \ufb01xed number of neighbor nodes, and thus its computation cost suffers\nfrom exponential growing neighbor size; while the layer-wise importance-sampling\nmethod discards the neighbor-dependent constraints, and thus the nodes sampled\nacross layer suffer from sparse connection problem. To deal with the above two\nproblems, we propose a new effective sampling algorithm called LAyer-Dependent\nImportancE Sampling (LADIES) 2. Based on the sampled nodes in the upper layer,\nLADIES selects their neighborhood nodes, constructs a bipartite subgraph and\ncomputes the importance probability accordingly. Then, it samples a \ufb01xed number\nof nodes by the calculated probability, and recursively conducts such procedure\nper layer to construct the whole computation graph. We prove theoretically and\nexperimentally, that our proposed sampling algorithm outperforms the previous\nsampling methods in terms of both time and memory costs. Furthermore, LADIES\nis shown to have better generalization accuracy than original full-batch GCN, due\nto its stochastic nature.\n\nIntroduction\n\n1\nGraph convolutional networks (GCNs) recently proposed by Kipf et al. [12] adopt the concept of\nconvolution \ufb01lter into graph domain [2, 6\u20138]. For a given node, a GCN layer aggregates the embed-\ndings of its neighbors from the previous layer, followed by a non-linear transformation, to obtain an\nupdated contextualized node representation. Similar to the convolutional neural networks (CNNs)\n[13] in the computer vision domain, by stacking multiple GCN layers, each node representation\ncan utilize a wide receptive \ufb01eld from both its immediate and distant neighbors, which intuitively\nincreases the model capacity.\nDespite the success of GCNs in many graph-related applications [12, 17, 15], training a deep GCN\nfor large-scale graphs remains a big challenge. Unlike tokens in a paragraph or pixels in an image,\nwhich normally have limited length or size, graph data in practice can be extremely large. For\nexample, Facebook social network in 2019 contains 2.7 billion users3. Such a large-scale graph is\nimpossible to be handled using full-batch GCN training, which takes all the nodes into one batch\nto update parameters. However, conducting mini-batch GCN training is non-trivial, as the nodes\n\n\u2217equal contribution\n2codes are avaiable at https://github.com/acbull/LADIES\n3https://zephoria.com/top-15-valuable-facebook-statistics/\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An illustration of the sampling process of GraphSage, FastGCN, and our proposed LADIES.\nBlack nodes denote the nodes in the upper layer, blue nodes in the dashed circle are their neighbors,\nand node with the red frame is the sampled nodes. As is shown in the \ufb01gure, GraphSAGE will\nredundantly sample a neighboring node twice, denoted by the red triangle, while FastGCN will\nsample nodes outside of the neighborhood. Our proposed LADIES can avoid these two problems.\n\nin a graph are closely coupled and correlated. In particular, in GCNs, the embedding of a given\nnode depends recursively on all its neighbors\u2019 embeddings, and such computation dependency grows\nexponentially with respect to the number of layers. Therefore, training deep and large GCNs is still a\nchallenge, which prevents their application to many large-scale graphs, such as social networks [12],\nrecommender systems [17], and knowledge graphs [15].\nTo alleviate the previous issues, researchers have proposed sampling-based methods to train GCNs\nbased on mini-batch of nodes, which only aggregate the embeddings of a sampled subset of neighbors\nof each node in the mini-batch. Among them, one direction is to use a node-wise neighbor-sampling\nmethod. For example, GraphSAGE [9] calculates each node embedding by leveraging only a \ufb01xed\nnumber of uniformly sampled neighbors. Although this kind of approaches reduces the computation\ncost in each aggregation operation, the total cost can still be large. As is pointed out in [11], the\nrecursive nature of node-wise sampling brings in redundancy for calculating embeddings. Even if two\nnodes share the same sampled neighbor, the embedding of this neighbor has to be calculated twice.\nSuch redundant calculation will be exaggerated exponentially when the number of layers increases.\nFollowing this line of research as well as reducing the computation redundancy, a series of work\nwas proposed to reduce the size of sampled neighbors. VR-GCN [3] proposes to leverage variance\nreduction techniques to improve the sample complexity. Cluster-GCN [5] considers restricting the\nsampled neighbors within some dense subgraphs, which are identi\ufb01ed by a graph clustering algorithm\nbefore the training of GCN. However, these methods still cannot well address the issue of redundant\ncomputations, which may become worse when training very deep and large GCNs.\nAnother direction uses a layer-wise importance-sampling method. For example, FastGCN [4] calcu-\nlates a sampling probability based on the degree of each node, and samples a \ufb01xed number of nodes\nfor each layer accordingly. Then, it only uses the sampled nodes to build a much smaller sampled\nadjacency matrix, and thus the computation cost is reduced. Ideally, the sampling probability is calcu-\nlated to reduce the estimation variance in FastGCN [4], and guarantee fast and stable convergence.\nHowever, since the sampling probability is independent for each layer, the sampled nodes from two\nconsecutive layers are not necessarily connected. Therefore, the sampled adjacency matrix can be\nextremely sparse, and may even have all-zero rows, meaning some nodes are disconnected. Such a\nsparse connection problem incurs an inaccurate computation graph and further deteriorates the train-\ning and generalization performance of FastGCN. In order to capture the inter-layer correlation and\nreduce the estimation variance, Huang et al. [11] proposed an adaptive and trainable sampling method\nthat conducts layer-wise sampling conditioned on the former layer, which has been demonstrated\nto achieve higher accuracy than FastGCN. Yet the advantage of the importance sampling procedure\nused in [11] in terms of time and memory costs is not fully justi\ufb01ed theoretically or empirically.\nBased on the pros and cons of the aforementioned sampling approaches, we argue that an ideal\nsampling method should have the following features: 1) layer-wise, thus the neighbor nodes can be\ntaken into account together to calculate next layers\u2019 embeddings without redundancy; 2) neighbor-\ndependent, thus the sampled adjacency matrix is dense without losing much information for training;\n3) the importance sampling method should be adopted to reduce the sampling variance and accelerate\n\n2\n\nNode-wiseNeighbor Sampling(GraphSAGE)Layer-wiseImportance Sampling(FastGCN)Layer-Dependent Importance Sampling(LADIES)\fconvergence. To this end, we propose a new ef\ufb01cient sampling algorithm called LAyer-Dependent\nImportancE-Sampling (LADIES) 4, which ful\ufb01lls all the above features.\nThe procedure of LADIES is described as below: (1) For each current layer (l), based on the nodes\nsampled in the upper layer (l + 1), it picks all their neighbors in the graph, and constructs a bipartite\ngraph among the nodes between the two layers. (2) Then it calculates the sampling probability\naccording to the degree of nodes in the current layer, with the purpose to reduce sampling variance.\n(3) Next, it samples a \ufb01xed number of nodes based on this probability. (4) Finally, it constructs\nthe sampled adjacency matrix between layers and conducts training and inference, where row-wise\nnormalization is applied to all sampled adjacency matrices to stabilize training. As illustrated in\nFigure 1, our proposed sampling strategy can avoid two pitfalls faced by existing two sampling\nstrategies: layer-wise structure avoids exponential expansion of receptive \ufb01eld; layer-dependent\nimportance sampling guarantees the sampled adjacency matrix to be dense such that the connectivity\nbetween nodes in two adjacent layers can be well maintained.\nWe highlight our contributions as follows:\n\u2022 We propose LAyer-Dependent ImportancE Sampling (LADIES) for training deep and large graph\nconvolutional networks, which is built upon a novel layer-dependent sampling scheme to avoid\nexponential expansion of receptive \ufb01eld as well as guarantee the connectivity of the sampled\nadjacency matrix.\n\u2022 We prove theoretically that the proposed algorithm achieves signi\ufb01cantly better memory and time\ncomplexities compared with node-wise sampling methods including GraphSage [9] and VR-GCN\n[3], and has a dramatically smaller variance compared with FastGCN [4].\n\u2022 We evaluate the performance of the proposed LADIES algorithm on benchmark datasets and\ndemonstrate its superior performance in terms of both running time and test accuracy. Moreover,\nwe show that LADIES achieves high accuracy with an extremely small sample size (e.g., 256 for a\ngraph with 233k nodes), which enables the use of LADIES for training very deep and large GCNs.\n\n2 Background\nIn this section, we review GCNs and several state-of-the-art sampling-based training algorithms,\nincluding full-batch, node-wise neighbor sampling methods, and layer-wise importance sampling\nmethods. We summarize the notation used in this paper in Table 1.\n2.1 Existing GCN Training Algorithms\nFull-batch GCN. When given an undirected graph G, with P de\ufb01ned in Table 1, the l-th GCN layer\nis de\ufb01ned as:\n\nZ(l) = PH(l\u22121)W(l\u22121), H(l\u22121) = \u03c3(Z(l\u22121)),\n\n(1)\nConsidering a node-level downstream task, given training dataset {(xi, yi)}vi\u2208Vs, the weight matrices\nW(l) will be learned by minimizing the loss function: L = 1|Vs|\n), where (cid:96)(\u00b7,\u00b7) is a\ni denotes the output of GCN corresponding to the vertex vi, and VS denotes\nspeci\ufb01ed loss function, zL\nthe training node set. Gradient descent algorithm is utilized for the full-batch optimization, where the\ngradient is computed over all the nodes as\n\n\u2207(cid:96)(yi, z(L)\n\n(cid:96)(yi, z(L)\n\n(cid:80)\n\n(cid:80)\n\nvi\u2208Vs\n\ni\n\n1|Vs|\n\nvi\u2208VS\n\n).\n\ni\n\nWhen the graph is large and dense, full-batch GCN\u2019s memory and time costs can be very expensive\nwhile computing such gradient, since during both backward and forward propagation process it\nrequires to store and aggregate embeddings for all nodes across all layer. Also, since each epoch only\nupdates parameters once, the convergence would be very slow.\nOne solution to address issues of full-batch training is to sample a a mini-batch of labeled nodes\nVB \u2208 VS, and compute the mini-batch stochastic gradient\ni ). This can help\nreduce computation cost to some extent, but to calculate the representation of each output node, we\nstill need to consider a large-receptive \ufb01eld, due to the dependency of the nodes in the graph.\nNode-wise Neighbor Sampling Algorithms. GraphSAGE [9] proposed to reduce receptive \ufb01eld\nsize by neighbor sampling (NS). For each node in l-th GCN layer, NS randomly samples snode\n4We would like to clarify that the proposed layer-dependent importance sampling is different from \u201clayered\n\n\u2207(cid:96)(yi, zL\n\n(cid:80)\n\nvi\u2208VB\n\n1|VB|\n\nimportance sampling\u201d proposed in [14].\n\n3\n\n\fG = (V, A), (cid:107)A(cid:107)0\n\nMi,\u2217, M\u2217,j Mi,j\n\n\u02dcA, \u02dcD, P\n\nl, \u03c3(\u00b7), H(l), Z(l), W(l)\n\nL, K\n\nb, snode, slayer\n\nTable 1: Summary of Notations\n\n2 \u02dcA \u02dcD\u2212 1\n\nG denotes a graph consist of a set of nodes V with node number |V|, A is the\nadjacency matrix, and (cid:107)A(cid:107)0 denotes the number of non-zero entries in A.\nMi,\u2217 is the i-th row of matrix M, M\u2217,j is the j-th column of matrix M, and Mi,j\nis the entry at the position (i, j) of matrix M.\n\n\u02dcA = A + I, \u02dcD is a diagonal matrix satisfying \u02dcDi,i =(cid:80)\n\n2 is the normalized Laplacian matrix.\n\nP = \u02dcD\u2212 1\nl denotes the GCN layer index, \u03c3(\u00b7) denotes the activation function, H(l) =\n\u03c3(Z(l)) denotes the embedding matrix at layer l, H(0) = X, Z(l) =\nPH(l\u22121)W(l\u22121) is the intermediate embedding matrix, and W(l) denotes the\ntrainable weight matrix at layer l.\nL is the total number of layers in GCN, and K is the dimension of embedding\nvectors (for simplicity, assume it is the same across all layers).\nFor batch-wise sampling, b denotes the batch size, snode is the number of\nsampled neighbors per node for node-wise sampling, and slayer is number of\nsampled nodes per layer for layer-wise sampling.\n\n\u02dcAi,j, and\n\nj\n\n0\n\nVariance\n\nF /slayer)\nF\n\nTable 2: Summary of Complexity and Variance 6. Here \u03c6 denotes the upper bound of the (cid:96)2 norm\nof embedding vector, \u2206\u03c6 denotes the bound of the norm of the difference between the embedding\nand its history, D denotes the average degree, b denotes the batch size, and \u00afV (b) denotes the average\nnumber of nodes which are connected to the nodes sampled in the training batch.\nTime Complexity\nMethods\nMemory Complexity\nO(L(cid:107)A(cid:107)0K + L|V|K 2)\nFull-Batch [12] O(L|V|K + LK 2)\nO(bKsL\u22121\nO(bKsL\nnode + bK 2sL\u22121\nGraphSage [9]\nnode + LK 2)\nnode)\nO(L|V|K + LK 2)\nO(bDKsL\u22121\nnode + bK 2sL\u22121\nVR-GCN [3]\nO(LKslayer + LK 2) O(LKs2\nlayer + LK 2slayer) O(\u03c6(cid:107)P(cid:107)2\nFastGCN [4]\nO(LKslayer + LK 2) O(LKs2\nLADIES\nof its neighbors at the (l \u2212 1)-th GCN layer and formulate an unbiased estimator \u02c6P(l\u22121)H(l\u22121) to\napproximate PH(l\u22121) in graph convolution layer:\nPi,j,\n\nO(cid:0)D\u03c6(cid:107)P(cid:107)2\nnode) O(cid:0)D\u2206\u03c6(cid:107)P(cid:107)2\nlayer + LK 2slayer) O(cid:0)\u03c6(cid:107)P(cid:107)2\n\nF /(|V|snode)(cid:1)\nF /(|V|snode)(cid:1)\n\u00afV (b)/(|V|slayer)(cid:1)\n\n(2)\nwhere N (vi) and \u02c6N (l\u22121)(vi) are the full and sampled neighbor sets of node vi for (l \u2212 1)-th GCN\nlayer respectively.\nVR-GCN [3] is another neighbor sampling work. It proposed to utilize historical activations to reduce\nthe variance of the estimator under the same sample strategy as GraphSAGE. Though successfully\nachieved comparable convergence rate, the memory complexity is higher and the time complexity\nremains the same. Though NS scheme alleviates the memory bottleneck of GCN, there exists\nredundant computation under NS since the embedding calculation for each node in the same layer is\nindependent, thus the complexity grows exponentially when the number of layers increases.\nLayer-wise Importance Sampling Algorithms. FastGCN [4] proposed a more advanced layer-wise\nimportance sampling (IS) scheme aiming to solve the scalability issue as well as reducing the variance.\nIS conducts sampling for each layer with a degree-based sampling probability. The approximation of\nthe i-th row of (PH(l\u22121)) with slayer samples vj1, . . . , vjsl\n\nvj \u2208 \u02c6N (l\u22121)(vi);\notherwise.\n\n(cid:26) |N (vi)|\n\n\u02c6P(l\u22121)\ni,j =\n\nsnode\n0,\n\n(PH(l\u22121))i,\u2217 (cid:39) 1\nslayer\n\nper layer can be estimated as:\nPi,jk H(l\u22121)\n\njk,\u2217 /q(vjk )\n\n(3)\n\nslayer(cid:88)\n2/(cid:107)P(cid:107)2\n\nk=1,vjk\u223cq(v)\n\nwhere q(v) is the distribution over V, q(vj) = (cid:107)P\u2217,j(cid:107)2\nF is the probability assigned to node vj.\nThough IS outperforms uniform sampling and the layer-wise sampling successfully reduced both time\nand memory complexities, however, this sampling strategy has a major limitation: since the sampling\noperation is conducted independently at each layer, FastGCN cannot guarantee connectivity between\nsampled nodes at different layers, which incurs large variance of the approximate embeddings.\n\n6For simplicity, when evaluating the variance we only consider two-layer GCN.\n\n4\n\n\f2.2 Complexity and Variance Comparison\nWe compare each method\u2019s memory, time complexity, and variance with that of our proposed LADIES\nalgorithm in Table 2.\nComplexity. We now compare the complexity of the proposed LADIES algorithm and existing\nsampling-based GCN training algorithms. The complexity for all methods are summarized in Table\n2, detailed calculation could be found in Appendix A. Compare with full-batch GCN, the time and\nmemory complexities of LADIES do not depend on the total number of nodes and edges, thus our\nalgorithm does not have scalability issue on large and dense graphs. Unlike NS based methods\nincluding GraphSAGE and VR-GCN, LADIES is not sensitive to the number of layers and will not\nsuffer exponential growth in complexity, therefore it can perform well when the neural network goes\ndeeper. Compared to layer-wise importance sampling proposed in FastGCN, it maintains the same\ncomplexity while obtaining a better convergence guarantee as analyzed in the next paragraph. In fact,\nin order to guarantee good performance, our method requires a much smaller sample size than that\nof FastGCN, thus the time and memory burden is much lighter. Therefore, our proposed LADIES\nalgorithm can achieve the best time and memory complexities and is applicable to training very deep\nand large graph convolution networks.\nVariance. We compare the variance of our algorithm with existing sampling-based algorithms. To\nsimplify the analysis, when evaluating the variance we only consider two-layer GCN. The results are\nsummarized in Table 2. We defer the detailed calculation to Appendix B. Compared with FastGCN,\nour variance result is strictly better since \u00afV (b) \u2264 |V|, where \u00afV (b) denotes the number of nodes\nwhich are connected to the nodes sampled in the training batch. Moreover, \u00afV (b) scales with b, which\nimplies that our method can be even better when using a small batch size. Compared with node-wise\nsampling, consider the same sample size, i.e., slayer = bsnode. Ignoring the same factors, the\nvariance of LADIES is in the order of O( \u00afV (b)/b) and the variance of GraphSAGE is O(D), where\nD denotes the average degree. Based on the de\ufb01nition of \u00afV (b), we strictly have \u00afV (b) \u2264 O(bD) since\nthere is no redundant node been calculated in \u00afV (b). Therefore our method is also strictly better than\nGraphSAGE especially when the graph is dense, i.e., many neighbors can be shared. The variance of\nVR-GCN resembles that of GraphSAGE but relies on the difference between the embedding and its\nhistory, which is not directly comparable to our results.\n3 LADIES: LAyer-Dependent ImportancE Sampling\nWe present our method, LADIES, in this section. As illustrated in previous sections, for node-wise\nsampling methods [9, 3], one has to sample a certain number of nodes in the neighbor set of all\nsampled nodes in the current layer, then the number of nodes that are selected to build the computation\ngraph is exponentially large with the number of hidden layers, which further slows down the training\nprocess of GCNs. For the existing layer-wise sampling scheme [4], it is inef\ufb01cient when the graph is\nsparse, since some nodes may have no neighbor been sampled during the sampling process, which\nresults in meaningless zero activations [3].\nTo address the aforementioned drawbacks and weaknesses of existing sampling-based methods for\nGCN training, we propose our training algorithms that can achieve good convergence performance as\nwell as reduce sampling complexity. In the following, we are going to illustrate our method in detail.\n3.1 Revisiting Independent Layer-wise Sampling\nWe \ufb01rst revisit the independent layer-wise sampling scheme for building the computation graph\nof GCN training. Recall in the forward process of GCN (1), the matrix production PH(l\u22121) can\nbe regarded as the embedding aggregation process. Then, the layer-wise sampling methods aim\nto approximate the intermediate embedding matrix Z(l) by only sampling a subset of nodes at the\n(l \u2212 1)-th layer and aggregating their embeddings for approximately estimating the embeddings at the\nl-th layer. Mathematically, similar to (3), let S(l\u22121) with |Sl\u22121| = sl\u22121 be the set of sampled nodes\nat the (l \u2212 1)-th layer, we can approximate PH(l\u22121) as\n\n(cid:88)\n\nPH(l\u22121) (cid:39) 1\nsl\u22121\n\nP\u2217,kH(l\u22121)\nk,\u2217\n\n,\n\n1\n\np(l\u22121)\n\nk\n\nk\u2208S (l\u22121)\n\nwhere we adopt non-uniformly sampling scheme by assigning probabilities p(l\u22121)\nall nodes in V. Then the corresponding discount weights are {1/(sl\u22121p(l\u22121)\n\n1\n\n, . . . , p(l\u22121)\n|V|\n\nto\n)}i=1,...,|V|. Then let\n\ni\n\n5\n\n\f{i(l\u22121)\nbe formulated as\n\nk\n\n}k\u2208Sl\u22121 be the indices of sampled nodes at the l \u2212 1-th layer, the estimator of PH(l\u22121) can\n\nwhere S(l\u22121) \u2208 R|V|\u00d7|V| is a diagonal matrix with only nonzero diagnoal matrix, de\ufb01ned by\n\n(4)\n\n(5)\n\nPH(l\u22121) (cid:39) PS(l\u22121)H(l\u22121),\n\nS(l\u22121)\ns,s =\n\n,\n\n1\n\nsl\u22121p(l\u22121)\n(l\u22121)\ni\nk\n0,\n\ns = i(l\u22121)\n\nk\n\n;\n\notherwise.\n\n\uf8f1\uf8f2\uf8f3\n\n(cid:26)\n\nIt can be veri\ufb01ed that (cid:107)S(l\u22121)(cid:107)0 = sl\u22121 and E[S(l\u22121)] = I. Assuming at the l-th and l \u2212 1-th layers\nthe sets of sampled nodes are determined. Then let {i(l)\n}k\u2208Sl\u22121 denote the indices\nof sampled nodes at these two layers, and de\ufb01ne the row selection matrix Q(l) \u2208 Rsl\u00d7|V| as:\n\nk }k\u2208Sl and {i(l\u22121)\n\nk\n\nQ(l)\n\nk,s =\n\n(k, s) = (k, i(l)\n\n1\n0, otherwise,\n\nk );\n\n(6)\n\nthe forward process of GCN with layer-wise sampling can be approximated by\n\u02dcH(l\u22121) = \u03c3( \u02dcZ(l\u22121)),\n\n\u02dcZ(l) = \u02dcP(l\u22121) \u02dcH(l\u22121)W(l\u22121),\n\n(7)\nwhere \u02dcZ(l) \u2208 Rsl\u00d7d denotes the approximated intermediate embeddings for sampled nodes at the l-th\nlayer and \u02dcP(l\u22121) = Q(l)PS(l\u22121)Q(l\u22121)(cid:62) \u2208 Rsl\u00d7sl\u22121 serves as a modi\ufb01ed Laplacian matrix, and\ncan be also regarded as a sampled bipartite graph after certain rescaling. Since typically we have\nsl, sl\u22121 (cid:28) |V|, the sampled graph is dramatically smaller than the entire one, thus the computation\ncost can be signi\ufb01cantly reduced.\n3.2 Layer-dependent Importance Sampling\nHowever, independently conducting layer-wise sampling at different layers is not ef\ufb01cient since the\nsampled bipartite graph may still be sparse and even have all-zero rows. This further results in very\npoor performance and require us to sample lots of nodes in order to guarantee convergence throughout\nthe GCN training. To alleviate this issue, we propose to apply neighbor-dependent sampling that can\nleverage the dependency between layers which further leads to dense computation graph. Speci\ufb01cally,\nour layer-dependent sampling mechanism is designed in a top-down manner, i.e., the sampled nodes\nat the l-th layer are generated depending on the sampled nodes that have been generated in all upper\nlayers. Note that for each node we only need to aggregate the embeddings from its neighbor nodes in\nthe previous layer. Thus, at one particular layer, we only need to generate samples from the union of\nneighbors of the nodes we have sampled in the upper layer, which is de\ufb01ned by\n\nV (l\u22121) = \u222avi\u2208SlN (vi)\n\nwhere Sl is the set of nodes we have sampled at the l-th layer and N (vi) denotes the neighbors\nset of node vi. Therefore during the sampling process, we only assign probability to nodes in\nV (l\u22121), denoted by {p(l\u22121)\n}vi\u2208V (l\u22121). Similar to FastGCN [4], we apply importance sampling to\nreduce the variance. However, we have no information regarding the activation matrix H(l\u22121) when\ncharacterizing the samples at the (l\u22121)-th layer. Therefore, we resort to a important sampling scheme\nwhich only relies on the matrices Q(l) and P. Speci\ufb01cally, we de\ufb01ne the importance probabilities as:\n\ni\n\np(l\u22121)\n\ni\n\n=\n\n.\n\n2\n\n(cid:107)Q(l)P\u2217,i(cid:107)2\n(cid:107)Q(l)P(cid:107)2\n2 = 0, which implies that p(l\u22121)\n\nF\n\n(8)\n\nk\n\nEvidently, if vi /\u2208 V (l\u22121), we have (cid:107)Q(l)P\u2217,i(cid:107)2\n= 0. Then, let\n{i(l\u22121)\n}vk\u2208Sl\u22121 be the indices of sampled nodes at the (l \u2212 1)-th layer based on the importance\nprobabilities computed in (8), we can also de\ufb01ne the random diagonal matrix S(l\u22121) according to (5),\nand formulate the same forward process of GCN as in (7) but with a different modi\ufb01ed Laplacian\nmatrix \u02dcP(l\u22121) = Q(l)PS(l\u22121)Q(l\u22121)(cid:62) \u2208 Rsl\u00d7sl\u22121. The computation of \u02dcP(l\u22121) can be very ef\ufb01cient\nsince it only involves sparse matrix productions. Here the major difference between our sampling\n\ni\n\n6\n\n\fAlgorithm 1 Sampling Procedure of LADIES\nRequire: Normalized Laplacian Matrix P; Batch Size b, Sample Number n;\n1: Randomly sample a batch of b output nodes as QL\n2: for l = L to 1 do\n3:\n\n4:\n5:\n\ni \u2190 (cid:107)Q(l)P\u2217,i(cid:107)2\n(cid:107)Q(l)P(cid:107)2\n\n2\n\nF\n\nGet layer-dependent laplacian matrix Q(l)P. Calculate sampling probability for each node\nusing p(l\u22121)\nSample n nodes in l \u2212 1 layer using p(l\u22121). The sampled nodes formulate Q(l\u22121)\nReconstruct sampled laplacian matrix between sampled nodes in layer l \u2212 1 and l by\n\u02dcP(l\u22121) \u2190 Q(l)PS(l\u22121)Q(l\u22121)(cid:62), then normalize it by \u02dcP(l) \u2190 D\u22121\n\u02dcP(l)\n\n, and organize them into a random diagonal matrix S(l\u22121).\n\n\u02dcP(l).\n\n6: end for\n7: return Modi\ufb01ed Laplacian Matrices { \u02dcP(l)}l=1,...,L and Sampled Node at Input Layer Q0;\n\nmethod and independent layer-wise sampling is the different constructions of matrix S(l\u22121). In our\nsampling mechanism, we have E[S(l\u22121)] = L(l\u22121), where L(l\u22121) is a diagonal matrix with\n\n(cid:26) 1\n\nL(l\u22121)\ns,s =\n\ns \u2208 V (l\u22121)\n0, otherwise.\n\n(9)\nSince (cid:107)L(l\u22121)(cid:107)0 = |V (l\u22121)| (cid:28) |V|, applying independent sampling method results in the fact that\nmany nodes are in the set V/V (l\u22121), which has no contribution to the construction of computation\ngraph. In contrast, we only sample nodes from V (l\u22121) which can guarantee more connections between\nthe sampled nodes at l-th and (l\u22121)-th layers, and further leads to a dense computation graph between\nthese two layers.\n3.3 Normalization\nNote that for original GCN, the Laplacian matrix P is obtained by normalizing the matrix I + A.\nSuch normalization operation is crucial since it is able to maintain the scale of embeddings in the\nforward process and avoid exploding/vanishing gradient. However, the modi\ufb01ed Laplacian matrix\n{ \u02dcP(l)}l=1,...,L may not be able to achieve this, especially when L is large, because its maximum\nsingular values can be very large without suf\ufb01cient samples. Therefore, motivated by [12], we\npropose to normalize \u02dcP(l) such that the sum of all rows are 1, i.e., we have\n\n\u02dcP(l) \u2190 D\u22121\n\u02dcP(l)\n\n\u02dcP(l),\n\nwhere D \u02dcP(l) \u2208 Rsl+1\u00d7sl+1 is a diagonal matrix with each diagonal entry to be the sum of the\ncorresponding row in \u02dcP(l). Now, we can leverage the modi\ufb01ed Laplacian matrices { \u02dcP}l=1,...,L to\nbuild the whole computation graph. We formally summarize the proposed algorithm in Algorithm 1.\n4 Experiments\nIn this section, we conduct experiments to evaluate LADIES for training deep GCNs on different\nnode classi\ufb01cation datasets, including Cora, Citeseer, Pubmed [16] and Reddit [9].\n4.1 Experiment Settings\nWe compare LADIES with the original GCN (full-batch) , GraphSage (neighborhood sampling) and\nFastGCN (important sampling). We modi\ufb01ed the PyTorch implementation of GCN 7 to add our\nLADIES sampling mechanism. To make the fair comparison only on the sampling part, we also\nchoose the online PyTorch implementation of all these baselines released by their authors and use\nthe same training code for all the methods. By default, we train 5-layer GCNs with hidden state\ndimension as 256, using the four methods. We choose 5 neighbors to be sampled for GraphSage, 64\nand 512 nodes to be sampled for both FastGCN and LADIES per layer. We update the model with a\nbatch size of 512 and ADAM optimizer with a learning rate of 0.001.\nFor all the methods and datasets, we conduct training for 10 times and take the mean and variance of\nthe evaluation results. Each time we stop training when the validation accuracy doesn\u2019t increase a\nthreshold (0.01) for 200 batches, and choose the model with the highest validation accuracy as the\nconvergence point. We use the following metrics to evaluate the effectiveness of sampling methods:\n\n7https://github.com/tkipf/pygcn\n\n7\n\n\fTable 3: Comparison of LADIES with original GCN (Full-Batch), GraphSage (Neighborhood\nSampling) and FastGCN (Important Sampling), in terms of accuracy, time, memory and convergence.\nTraining 5-layer GCNs on different node classi\ufb01cation datasets (node number is below the dataset\nname). Results show that LADIES can achieve the best accuracy with lower time and memory cost.\n\nFull-Batch\n\nFull-Batch\n\nFull-Batch\n\nCora\n(2708)\n\nCiteseer\n(3327)\n\nPubmed\n(19717)\n\nReddit\n(232965)\n\n5.89\n13.97\n23.24\n5.89\n13.92\n\n137.93\n453.58\n\n1.92\n4.53\n49.41\n1.92\n4.39\n\nDataset\n\nSample Method\n\n30.72\n471.39\n3.13\n7.33\n3.13\n7.35\n\n68.13\n595.71\n\nFull-Batch\n\nGraphSage (5)\nFastGCN (64)\nFastGCN (512)\nLADIES (64)\nLADIES (512)\n\nGraphSage (5)\nFastGCN (64)\nFastGCN (512)\nFastGCN (1024)\n\nLADIES (64)\nLADIES (512)\n\nGraphSage (5)\nFastGCN (64)\nFastGCN (512)\nFastGCN (8192)\n\nLADIES (64)\nLADIES (512)\n\nGraphSage (5)\nFastGCN (64)\nFastGCN (512)\nFastGCN (8192)\n\nLADIES (64)\nLADIES (512)\n\nF1-Score(%)\n76.5 \u00b1 1.4\n75.2 \u00b1 1.5\n25.1 \u00b1 8.4\n78.0 \u00b1 2.1\n77.6 \u00b11.4\n78.3 \u00b11.6\n62.3 \u00b1 3.1\n59.4 \u00b1 0.9\n19.2 \u00b1 2.7\n44.6 \u00b1 10.8\n63.5 \u00b1 1.8\n65.0 \u00b1 .1.4\n64.3 \u00b1 2.4\n71.9 \u00b1 1.9\n70.1 \u00b1 1.4\n38.5 \u00b1 6.9\n39.3 \u00b1 9.2\n74.4 \u00b1 0.8\n76.8 \u00b1 0.8\n75.9 \u00b1 1.1\n91.6 \u00b1 1.6\n92.1 \u00b1 1.1\n27.8 \u00b1 12.6\n17.5 \u00b1 16.7\n89.5 \u00b1 1.2\n83.5 \u00b1 0.9\n92.8 \u00b1 1.6\n\nTotal Time(s) Mem(MB) Batch Time(ms)\n15.75 \u00b1 0.52\n1.19 \u00b1 0.82\n6.77 \u00b1 4.94\n78.42 \u00b1 0.87\n0.55 \u00b1 0.65\n9.22 \u00b1 0.20\n4.70 \u00b1 1.35\n10.08 \u00b1 0.29\n9.68 \u00b1 0.48\n4.19 \u00b1 1.16\n0.72 \u00b1 0.39\n9.77 \u00b1 0.28\n15.77 \u00b1 0.58\n0.61 \u00b1 0.70\n4.51 \u00b1 3.68\n53.14 \u00b1 1.90\n0.53 \u00b1 0.48\n8.88 \u00b1 0.40\n10.41 \u00b1 0.51\n4.34 \u00b1 1.73\n10.54 \u00b1 0.27\n2.24 \u00b1 1.01\n2.17 \u00b1 0.65\n9.60 \u00b1 0.39\n0.41 \u00b1 0.22\n10.32 \u00b1 0.23\n44.69 \u00b1 0.57\n4.80 \u00b1 1.53\n44.73 \u00b1 0.30\n5.53 \u00b1 2.57\n0.40 \u00b1 0.69\n7.42 \u00b1 0.16\n0.44 \u00b1 0.61\n10.06 \u00b1 0.41\n3.47 \u00b1 1.16\n17.84 \u00b1 0.33\n2.57 \u00b1 0.72\n9.43 \u00b1 0.47\n10.43 \u00b1 0.36\n2.27 \u00b1 1.17\n1564 \u00b1 3.41\n474.3 \u00b1 84.4\n13.12 \u00b1 2.84\n121.47 \u00b1 0.72\n2.06 \u00b1 1.29\n7.85 \u00b1 0.72\n10.01 \u00b1 0.31\n0.31 \u00b1 0.41\n16.57 \u00b1 0.58\n5.63 \u00b1 2.12\n5.62 \u00b1 1.58\n9.42 \u00b1 0.48\n6.87 \u00b1 1.17\n10.87 \u00b1 0.63\n\nBatch Num\n80.8 \u00b1 51.7\n65.2 \u00b1 52.1\n63.2 \u00b1 71.2\n487 \u00b1 147\n436 \u00b1 118.4\n75.6 \u00b1 37.0\n40.6 \u00b1 22.8\n57.2 \u00b1 42.1\n64.0 \u00b1 57.0\n386 \u00b1 167\n223 \u00b1 98.6\n232 \u00b1 66.8\n37.6 \u00b1 11.9\n102 \u00b1 33.4\n74.8 \u00b1 31.7\n58.8 \u00b1 94.8\n44.8 \u00b1 55.0\n195 \u00b1 56.9\n277 \u00b1 82.2\n245 \u00b1 84.5\n179 \u00b1 75.5\n81.5 \u00b1 42.3\n57.4 \u00b1 43.7\n32.1 \u00b1 72.3\n278 \u00b1 51.2\n453 \u00b1 88.2\n393 \u00b1 74.4\n\u2022 Accuracy: The micro F1-score of the test data at the convergence point. We calculate it using the\n\u2022 Total Running Time (s): The total training time (exclude validation) before convergence point.\n\u2022 Memory (MB): Total memory costs of model parameters and all hidden representations of a batch.\n\u2022 Batch Time and Num: Time cost to run a batch and the total batch number before convergence.\n4.2 Experiment Results\nAs is shown in Table 4, our proposed LADIES can achieve the highest accuracy score among all the\nmethods, using a small sampling number. One surprising thing is that the sampling-based method\ncan achieve higher accuracy than the Full-Batch version, and in some cases using a smaller sampling\nnumber can lead to better accuracy (though it may take longer batches to converge). This is probably\nbecause the graph data is incomplete and noisy, and the stochastic nature of the sampling method\ncan bring in regularization for training a more robust GCN with better generalization accuracy [10].\nAnother observation is that no matter the size of the graph, LADIES with a small sample number (64)\ncan still converge well, and sometimes even better than a larger sample number (512). This indicates\nthat LADIES is scalable to training very large and deep GCN while maintaining high accuracy.\nAs a comparison, FastGCN with 64 and 512 sampled nodes can lead to similar accuracy for small\ngraphs (Cora). But for bigger graphs as Citeseer, Pubmed, and Reddit, it cannot converge to a good\npoint, partly because of the computation graph sparsity issue. For a fair comparison, we choose a\nhigher sampling number for FastGCN on these big graphs. For example, in Reddit, we choose 8192\nnodes to be sampled, and FastGCN in such cases can converge to a similar accurate result compared\nwith LADIES, but obviously taking more memory and time cost. GraphSage with 5 nodes to be\nsampled takes far more memory and time cost because of the redundancy problem we\u2019ve discussed,\nand its uniform sampling makes it fail to converge well and fast compared to importance sampling.\n\nfull-batch version to get the most accurate inference (only care about training).\n\n2370.48\n1234.63\n\n3.75\n6.91\n74.28\n3.75\n7.26\n\n8\n\n\fFigure 2: F1-score, total time and memory cost at convergence point for training PubMed, when we\nchoose different sampling numbers of our method. Results show that LADIES can achieve good\ngeneralization accuracy (F1-score = 77.6) even with a small sampling number as 16, while FastGCN\ncannot converge (only reach F1-score = 39.3) with a large sampling number as 512.\n\n(a) Training (60 data samples)\n\n(b) Validation (500 data samples)\n\n(c) Testing (1000 data samples)\n\nFigure 3: Experiments on the PubMed dataset. We plot the F1-score of both full-batch GCN and\nLADIES every epoch on (a) Training dataset (b) Validation dataset and (c) Testing dataset.\nIn addition to the above comparison, we show that our proposed LADIES can converge pretty well\nwith a much fewer sampling number. As is shown in Figure 2, when we choose the sampling number\nas small as 16, the algorithm can already converge to the best result, with low time and memory cost.\nThis implies that although in Table 4, we choose sample number as 64 and 512 for a fair comparison,\nbut actually, the performance of our method can be further enhanced with a smaller sampling number.\nFurthermore, we show that the stochastic nature of LADIES can help to achieve better generalization\naccuracy than original full-batch GCN. We plot the F1-score of both full-batch GCN and LADIES\non the PubMed dataset for 300 epochs without early stop in Figure 3. From Figure 3(a), we can see\nthat, full-batch GCN can achieve higher F1-Score than LADIES on the training set. Nevertheless,\non the validation and test datasets, we can see from Figures 3(b) and 3(c) that LADIES can achieve\nsigni\ufb01cantly higher F1-Score than full-batch GCN. This suggests that LADIES has better generaliza-\ntion performance than full-batch GCN. The reason is: real graphs are often noisy and incomplete.\nFull-batch GCN uses the entire graph in the training phase, which can cause over\ufb01tting to the noise.\nIn sharp contrast, LADIES employs stochastic sampling to use partial information of the graph and\ntherefore can mitigate the noise of the graph and avoid over\ufb01tting to the training data. At a high-level,\nthe sampling scheme adopted in LADIES shares a similar spirit as bagging and bootstrap [1], which\nis known to improve the generalization performance of machine learning predictors.\n\n5 Conclusions\nWe propose a new algorithm namely LADIES for training deep and large GCNs. The crucial\ningredient of our algorithm is layer-dependent importance sampling, which can both ensure dense\ncomputation graph as well as avoid drastic expansion of receptive \ufb01eld. Theoretically, we show that\nLADIES enjoys signi\ufb01cantly lower memory cost, time complexity and estimation variance, compared\nwith existing GCN training methods including GraphSAGE and FastGCN. Experimental results\ndemonstrate that LADIES can achieve the best test accuracy with much lower computational time\nand memory cost on benchmark datasets.\n\nAcknowledgement\nWe would like to thank the anonymous reviewers for their helpful comments. D. Zou and Q. Gu\nwere partially supported by the NSF BIGDATA IIS-1855099, NSF CAREER Award IIS-1906169\nand Salesforce Deep Learning Research Award. Z. Hu, Y. Wang, S. Jiang and Y. Sun were partially\nsupported by NSF III-1705169, NSF 1937599, NSF CAREER Award 1741634, and Amazon Research\nAward. We also thank AWS for providing cloud computing credits associated with the NSF BIGDATA\naward. The views and conclusions contained in this paper are those of the authors and should not be\ninterpreted as representing any funding agencies.\n\n9\n\n050100150200250300Epoch0.30.40.50.60.70.80.91.0F1-ScoreFull BatchLADIES (64)050100150200250300Epoch0.20.30.40.50.60.70.8F1-ScoreFull BatchLADIES (64)050100150200250300Epoch0.20.30.40.50.60.7F1-ScoreFull BatchLADIES (64)\fReferences\n[1] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123\u2013140, 1996.\n\n[2] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.\nGeometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine,\n34(4):18\u201342, 2017.\n\n[3] Jianfei Chen, Jun Zhu, and Le Song. Stochastic training of graph convolutional networks with\nvariance reduction. In Proceedings of the 35th International Conference on Machine Learning,\nICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018, pages 941\u2013949, 2018.\n\n[4] Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: Fast learning with graph convolutional networks\nvia importance sampling. In 6th International Conference on Learning Representations, ICLR\n2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.\n\n[5] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-gcn:\nAn ef\ufb01cient algorithm for training deep and large graph convolutional networks. In Proceedings\nof the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,\nKDD 2019, Anchorage, AK, USA, August 4-8, 2019., pages 257\u2013266, 2019.\n\n[6] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for\nstructured data. In Proceedings of the 33nd International Conference on Machine Learning,\nICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2702\u20132711, 2016.\n\n[7] Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks\non graphs with fast localized spectral \ufb01ltering. In Advances in Neural Information Processing\nSystems 29: Annual Conference on Neural Information Processing Systems 2016, December\n5-10, 2016, Barcelona, Spain, pages 3837\u20133845, 2016.\n\n[8] William L. Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods\n\nand applications. IEEE Data Eng., 40(3):52\u201374, 2017.\n\n[9] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on\nlarge graphs. In Advances in Neural Information Processing Systems 30: Annual Conference\non Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA,\npages 1025\u20131035, 2017.\n\n[10] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of\nstochastic gradient descent. In Proceedings of the 33nd International Conference on Machine\nLearning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1225\u20131234, 2016.\n\n[11] Wen-bing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. Adaptive sampling towards\nfast graph representation learning. In Advances in Neural Information Processing Systems\n31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8\nDecember 2018, Montr\u00e9al, Canada., pages 4563\u20134572, 2018.\n\n[12] Thomas N. Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. In 5th International Conference on Learning Representations (ICLR-17).\n\n[13] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series.\n\nThe handbook of brain theory and neural networks, 3361(10):1995, 1995.\n\n[14] Luca Martino, Victor Elvira, David Luengo, and Jukka Corander. Layered adaptive importance\n\nsampling. Statistics and Computing, 27(3):599\u2013623, 2017.\n\n[15] Michael Sejr Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and\nMax Welling. Modeling relational data with graph convolutional networks. In The Semantic\nWeb - 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018,\nProceedings, pages 593\u2013607, 2018.\n\n[16] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-\n\nRad. Collective classi\ufb01cation in network data. AI magazine, 29(3):93\u201393, 2008.\n\n10\n\n\f[17] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure\nLeskovec. Graph convolutional neural networks for web-scale recommender systems.\nIn\nProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &\nData Mining, KDD 2018, London, UK, August 19-23, 2018, pages 974\u2013983, 2018.\n\n11\n\n\f", "award": [], "sourceid": 6006, "authors": [{"given_name": "Difan", "family_name": "Zou", "institution": "University of California, Los Angeles"}, {"given_name": "Ziniu", "family_name": "Hu", "institution": "UCLA"}, {"given_name": "Yewen", "family_name": "Wang", "institution": "UCLA"}, {"given_name": "Song", "family_name": "Jiang", "institution": "University of California, Los Angeles"}, {"given_name": "Yizhou", "family_name": "Sun", "institution": "UCLA"}, {"given_name": "Quanquan", "family_name": "Gu", "institution": "UCLA"}]}