{"title": "Inductive Representation Learning on Large Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 1024, "page_last": 1034, "abstract": "Low-dimensional embeddings of nodes in large graphs have proved extremely useful in a variety of prediction tasks, from content recommendation to identifying protein functions. However, most existing approaches require that all nodes in the graph are present during training of the embeddings; these previous approaches are inherently transductive and do not naturally generalize to unseen nodes. Here we present GraphSAGE, a general, inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings. Instead of training individual embeddings for each node, we learn a function that generates embeddings by sampling and aggregating features from a node's local neighborhood. Our algorithm outperforms strong baselines on three inductive node-classification benchmarks: we classify the category of unseen nodes in evolving information graphs based on citation and Reddit post data, and we show that our algorithm generalizes to completely unseen graphs using a multi-graph dataset of protein-protein interactions.", "full_text": "Inductive Representation Learning on Large Graphs\n\nWilliam L. Hamilton\u2217\nwleif@stanford.edu\n\nRex Ying\u2217\n\nrexying@stanford.edu\n\nJure Leskovec\n\njure@cs.stanford.edu\n\nDepartment of Computer Science\n\nStanford University\nStanford, CA, 94305\n\nAbstract\n\nLow-dimensional embeddings of nodes in large graphs have proved extremely\nuseful in a variety of prediction tasks, from content recommendation to identifying\nprotein functions. However, most existing approaches require that all nodes in the\ngraph are present during training of the embeddings; these previous approaches are\ninherently transductive and do not naturally generalize to unseen nodes. Here we\npresent GraphSAGE, a general inductive framework that leverages node feature\ninformation (e.g., text attributes) to ef\ufb01ciently generate node embeddings for\npreviously unseen data. Instead of training individual embeddings for each node,\nwe learn a function that generates embeddings by sampling and aggregating features\nfrom a node\u2019s local neighborhood. Our algorithm outperforms strong baselines\non three inductive node-classi\ufb01cation benchmarks: we classify the category of\nunseen nodes in evolving information graphs based on citation and Reddit post\ndata, and we show that our algorithm generalizes to completely unseen graphs\nusing a multi-graph dataset of protein-protein interactions.\n\n1\n\nIntroduction\n\nLow-dimensional vector embeddings of nodes in large graphs1 have proved extremely useful as\nfeature inputs for a wide variety of prediction and graph analysis tasks [5, 11, 28, 35, 36]. The basic\nidea behind node embedding approaches is to use dimensionality reduction techniques to distill the\nhigh-dimensional information about a node\u2019s neighborhood into a dense vector embedding. These\nnode embeddings can then be fed to downstream machine learning systems and aid in tasks such as\nnode classi\ufb01cation, clustering, and link prediction [11, 28, 35].\nHowever, previous works have focused on embedding nodes from a single \ufb01xed graph, and many\nreal-world applications require embeddings to be quickly generated for unseen nodes, or entirely new\n(sub)graphs. This inductive capability is essential for high-throughput, production machine learning\nsystems, which operate on evolving graphs and constantly encounter unseen nodes (e.g., posts on\nReddit, users and videos on Youtube). An inductive approach to generating node embeddings also\nfacilitates generalization across graphs with the same form of features: for example, one could train\nan embedding generator on protein-protein interaction graphs derived from a model organism, and\nthen easily produce node embeddings for data collected on new organisms using the trained model.\nThe inductive node embedding problem is especially dif\ufb01cult, compared to the transductive setting,\nbecause generalizing to unseen nodes requires \u201caligning\u201d newly observed subgraphs to the node\nembeddings that the algorithm has already optimized on. An inductive framework must learn to\n\n\u2217The two \ufb01rst authors made equal contributions.\n1While it is common to refer to these data structures as social or biological networks, we use the term graph\n\nto avoid ambiguity with neural network terminology.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Visual illustration of the GraphSAGE sample and aggregate approach.\n\nrecognize structural properties of a node\u2019s neighborhood that reveal both the node\u2019s local role in the\ngraph, as well as its global position.\nMost existing approaches to generating node embeddings are inherently transductive. The majority\nof these approaches directly optimize the embeddings for each node using matrix-factorization-based\nobjectives, and do not naturally generalize to unseen data, since they make predictions on nodes in a\nsingle, \ufb01xed graph [5, 11, 23, 28, 35, 36, 37, 39]. These approaches can be modi\ufb01ed to operate in an\ninductive setting (e.g., [28]), but these modi\ufb01cations tend to be computationally expensive, requiring\nadditional rounds of gradient descent before new predictions can be made. There are also recent\napproaches to learning over graph structures using convolution operators that offer promise as an\nembedding methodology [17]. So far, graph convolutional networks (GCNs) have only been applied\nin the transductive setting with \ufb01xed graphs [17, 18]. In this work we both extend GCNs to the task\nof inductive unsupervised learning and propose a framework that generalizes the GCN approach to\nuse trainable aggregation functions (beyond simple convolutions).\nPresent work. We propose a general framework, called GraphSAGE (SAmple and aggreGatE), for\ninductive node embedding. Unlike embedding approaches that are based on matrix factorization,\nwe leverage node features (e.g., text attributes, node pro\ufb01le information, node degrees) in order to\nlearn an embedding function that generalizes to unseen nodes. By incorporating node features in the\nlearning algorithm, we simultaneously learn the topological structure of each node\u2019s neighborhood\nas well as the distribution of node features in the neighborhood. While we focus on feature-rich\ngraphs (e.g., citation data with text attributes, biological data with functional/molecular markers), our\napproach can also make use of structural features that are present in all graphs (e.g., node degrees).\nThus, our algorithm can also be applied to graphs without node features.\nInstead of training a distinct embedding vector for each node, we train a set of aggregator functions\nthat learn to aggregate feature information from a node\u2019s local neighborhood (Figure 1). Each\naggregator function aggregates information from a different number of hops, or search depth, away\nfrom a given node. At test, or inference time, we use our trained system to generate embeddings for\nentirely unseen nodes by applying the learned aggregation functions. Following previous work on\ngenerating node embeddings, we design an unsupervised loss function that allows GraphSAGE to be\ntrained without task-speci\ufb01c supervision. We also show that GraphSAGE can be trained in a fully\nsupervised manner.\nWe evaluate our algorithm on three node-classi\ufb01cation benchmarks, which test GraphSAGE\u2019s ability\nto generate useful embeddings on unseen data. We use two evolving document graphs based on\ncitation data and Reddit post data (predicting paper and post categories, respectively), and a multi-\ngraph generalization experiment based on a dataset of protein-protein interactions (predicting protein\nfunctions). Using these benchmarks, we show that our approach is able to effectively generate\nrepresentations for unseen nodes and outperform relevant baselines by a signi\ufb01cant margin: across\ndomains, our supervised approach improves classi\ufb01cation F1-scores by an average of 51% compared\nto using node features alone and GraphSAGE consistently outperforms a strong, transductive baseline\n[28], despite this baseline taking \u223c100\u00d7 longer to run on unseen nodes. We also show that the new\naggregator architectures we propose provide signi\ufb01cant gains (7.4% on average) compared to an\naggregator inspired by graph convolutional networks [17]. Lastly, we probe the expressive capability\nof our approach and show, through theoretical analysis, that GraphSAGE is capable of learning\nstructural information about a node\u2019s role in a graph, despite the fact that it is inherently based on\nfeatures (Section 5).\n\n2\n\n\f2 Related work\n\nOur algorithm is conceptually related to previous node embedding approaches, general supervised\napproaches to learning over graphs, and recent advancements in applying convolutional neural\nnetworks to graph-structured data.2\nFactorization-based embedding approaches. There are a number of recent node embedding\napproaches that learn low-dimensional embeddings using random walk statistics and matrix\nfactorization-based learning objectives [5, 11, 28, 35, 36]. These methods also bear close rela-\ntionships to more classic approaches to spectral clustering [23], multi-dimensional scaling [19],\nas well as the PageRank algorithm [25]. Since these embedding algorithms directly train node\nembeddings for individual nodes, they are inherently transductive and, at the very least, require\nexpensive additional training (e.g., via stochastic gradient descent) to make predictions on new nodes.\nIn addition, for many of these approaches (e.g., [11, 28, 35, 36]) the objective function is invariant\nto orthogonal transformations of the embeddings, which means that the embedding space does not\nnaturally generalize between graphs and can drift during re-training. One notable exception to this\ntrend is the Planetoid-I algorithm introduced by Yang et al. [40], which is an inductive, embedding-\nbased approach to semi-supervised learning. However, Planetoid-I does not use any graph structural\ninformation during inference; instead, it uses the graph structure as a form of regularization during\ntraining. Unlike these previous approaches, we leverage feature information in order to train a model\nto produce embeddings for unseen nodes.\nSupervised learning over graphs. Beyond node embedding approaches, there is a rich literature\non supervised learning over graph-structured data. This includes a wide variety of kernel-based\napproaches, where feature vectors for graphs are derived from various graph kernels (see [32] and\nreferences therein). There are also a number of recent neural network approaches to supervised\nlearning over graph structures [7, 10, 21, 31]. Our approach is conceptually inspired by a number of\nthese algorithms. However, whereas these previous approaches attempt to classify entire graphs (or\nsubgraphs), the focus of this work is generating useful representations for individual nodes.\nGraph convolutional networks. In recent years, several convolutional neural network architectures\nfor learning over graphs have been proposed (e.g., [4, 9, 8, 17, 24]). The majority of these methods\ndo not scale to large graphs or are designed for whole-graph classi\ufb01cation (or both) [4, 9, 8, 24].\nHowever, our approach is closely related to the graph convolutional network (GCN), introduced by\nKipf et al. [17, 18]. The original GCN algorithm [17] is designed for semi-supervised learning in a\ntransductive setting, and the exact algorithm requires that the full graph Laplacian is known during\ntraining. A simple variant of our algorithm can be viewed as an extension of the GCN framework to\nthe inductive setting, a point which we revisit in Section 3.3.\n\n3 Proposed method: GraphSAGE\n\nThe key idea behind our approach is that we learn how to aggregate feature information from a\nnode\u2019s local neighborhood (e.g., the degrees or text attributes of nearby nodes). We \ufb01rst describe\nthe GraphSAGE embedding generation (i.e., forward propagation) algorithm, which generates\nembeddings for nodes assuming that the GraphSAGE model parameters are already learned (Section\n3.1). We then describe how the GraphSAGE model parameters can be learned using standard\nstochastic gradient descent and backpropagation techniques (Section 3.2).\n\n3.1 Embedding generation (i.e., forward propagation) algorithm\n\nIn this section, we describe the embedding generation, or forward propagation algorithm (Algorithm\n1), which assumes that the model has already been trained and that the parameters are \ufb01xed. In\nparticular, we assume that we have learned the parameters of K aggregator functions (denoted\nAGGREGATEk,\u2200k \u2208 {1, ..., K}), which aggregate information from node neighbors, as well as a set\nof weight matrices Wk,\u2200k \u2208 {1, ..., K}, which are used to propagate information between different\nlayers of the model or \u201csearch depths\u201d. Section 3.2 describes how we train these parameters.\n\n2In the time between this papers original submission to NIPS 2017 and the submission of the \ufb01nal, accepted\n(i.e., \u201ccamera-ready\u201d) version, there have been a number of closely related (e.g., follow-up) works published on\npre-print servers. For temporal clarity, we do not review or compare against these papers in detail.\n\n3\n\n\fInput\n\nAlgorithm 1: GraphSAGE embedding generation (i.e., forward propagation) algorithm\n\n: Graph G(V,E); input features {xv,\u2200v \u2208 V}; depth K; weight matrices\nWk,\u2200k \u2208 {1, ..., K}; non-linearity \u03c3; differentiable aggregator functions\nAGGREGATEk,\u2200k \u2208 {1, ..., K}; neighborhood function N : v \u2192 2V\n\nOutput : Vector representations zv for all v \u2208 V\nv \u2190 xv,\u2200v \u2208 V ;\n1 h0\n2 for k = 1...K do\nfor v \u2208 V do\n(cid:16)\n3\n4\nv(cid:107)2,\u2200v \u2208 V\nv/(cid:107)hk\nv ,\u2200v \u2208 V\n\nhkN (v) \u2190 AGGREGATEk({hk\u22121\nWk \u00b7 CONCAT(hk\u22121\nv \u2190 \u03c3\nhk\n\nu\n\nend\nv \u2190 hk\nhk\n\n6\n7\n8 end\n9 zv \u2190 hK\n\n5\n\n(cid:17)\n\n,\u2200u \u2208 N (v)});\n, hkN (v))\n\nv\n\nv\n\nu\n\nThe intuition behind Algorithm 1 is that at each iteration, or search depth, nodes aggregate information\nfrom their local neighbors, and as this process iterates, nodes incrementally gain more and more\ninformation from further reaches of the graph.\nAlgorithm 1 describes the embedding generation process in the case where the entire graph, G =\n(V,E), and features for all nodes xv,\u2200v \u2208 V, are provided as input. We describe how to generalize\nthis to the minibatch setting below. Each step in the outer loop of Algorithm 1 proceeds as follows,\nwhere k denotes the current step in the outer loop (or the depth of the search) and hk denotes a node\u2019s\nrepresentation at this step: First, each node v \u2208 V aggregates the representations of the nodes in its\nimmediate neighborhood, {hk\u22121\n,\u2200u \u2208 N (v)}, into a single vector hk\u22121N (v). Note that this aggregation\nstep depends on the representations generated at the previous iteration of the outer loop (i.e., k \u2212 1),\nand the k = 0 (\u201cbase case\u201d) representations are de\ufb01ned as the input node features. After aggregating\nthe neighboring feature vectors, GraphSAGE then concatenates the node\u2019s current representation,\n, with the aggregated neighborhood vector, hk\u22121N (v), and this concatenated vector is fed through a\nhk\u22121\nfully connected layer with nonlinear activation function \u03c3, which transforms the representations to\nv,\u2200v \u2208 V). For notational convenience, we denote\nbe used at the next step of the algorithm (i.e., hk\nv ,\u2200v \u2208 V. The aggregation of the neighbor\nthe \ufb01nal representations output at depth K as zv \u2261 hK\nrepresentations can be done by a variety of aggregator architectures (denoted by the AGGREGATE\nplaceholder in Algorithm 1), and we discuss different architecture choices in Section 3.3 below.\nTo extend Algorithm 1 to the minibatch setting, given a set of input nodes, we \ufb01rst forward sample\nthe required neighborhood sets (up to depth K) and then we run the inner loop (line 3 in Algorithm\n1), but instead of iterating over all nodes, we compute only the representations that are necessary to\nsatisfy the recursion at each depth (Appendix A contains complete minibatch pseudocode).\nRelation to the Weisfeiler-Lehman Isomorphism Test. The GraphSAGE algorithm is conceptually\ninspired by a classic algorithm for testing graph isomorphism. If, in Algorithm 1, we (i) set K = |V|,\n(ii) set the weight matrices as the identity, and (iii) use an appropriate hash function as an aggregator\n(with no non-linearity), then Algorithm 1 is an instance of the Weisfeiler-Lehman (WL) isomorphism\ntest, also known as \u201cnaive vertex re\ufb01nement\u201d [32]. If the set of representations {zv,\u2200v \u2208 V} output\nby Algorithm 1 for two subgraphs are identical then the WL test declares the two subgraphs to be\nisomorphic. This test is known to fail in some cases, but is valid for a broad class of graphs [32].\nGraphSAGE is a continuous approximation to the WL test, where we replace the hash function\nwith trainable neural network aggregators. Of course, we use GraphSAGE to generate useful node\nrepresentations\u2013not to test graph isomorphism. Nevertheless, the connection between GraphSAGE\nand the classic WL test provides theoretical context for our algorithm design to learn the topological\nstructure of node neighborhoods.\nNeighborhood de\ufb01nition. In this work, we uniformly sample a \ufb01xed-size set of neighbors, instead of\nusing full neighborhood sets in Algorithm 1, in order to keep the computational footprint of each batch\n\n4\n\n\f\ufb01xed.3 That is, using overloaded notation, we de\ufb01ne N (v) as a \ufb01xed-size, uniform draw from the set\n{u \u2208 V : (u, v) \u2208 E}, and we draw different uniform samples at each iteration, k, in Algorithm 1.\nWithout this sampling the memory and expected runtime of a single batch is unpredictable and in\nthe worst case O(|V|). In contrast, the per-batch space and time complexity for GraphSAGE is \ufb01xed\ni=1 Si), where Si, i \u2208 {1, ..., K} and K are user-speci\ufb01ed constants. Practically speaking\nwe found that our approach could achieve high performance with K = 2 and S1 \u00b7 S2 \u2264 500 (see\nSection 4.4 for details).\n\nat O((cid:81)K\n\n3.2 Learning the parameters of GraphSAGE\n\nIn order to learn useful, predictive representations in a fully unsupervised setting, we apply a\ngraph-based loss function to the output representations, zu,\u2200u \u2208 V, and tune the weight matrices,\nWk,\u2200k \u2208 {1, ..., K}, and parameters of the aggregator functions via stochastic gradient descent. The\ngraph-based loss function encourages nearby nodes to have similar representations, while enforcing\nthat the representations of disparate nodes are highly distinct:\n\nJG(zu) = \u2212 log(cid:0)\u03c3(z(cid:62)\n\nu zv)(cid:1) \u2212 Q \u00b7 Evn\u223cPn(v) log(cid:0)\u03c3(\u2212z(cid:62)\n\nu zvn )(cid:1) ,\n\n(1)\nwhere v is a node that co-occurs near u on \ufb01xed-length random walk, \u03c3 is the sigmoid function,\nPn is a negative sampling distribution, and Q de\ufb01nes the number of negative samples. Importantly,\nunlike previous embedding approaches, the representations zu that we feed into this loss function\nare generated from the features contained within a node\u2019s local neighborhood, rather than training a\nunique embedding for each node (via an embedding look-up).\nThis unsupervised setting emulates situations where node features are provided to downstream\nmachine learning applications, as a service or in a static repository. In cases where representations\nare to be used only on a speci\ufb01c downstream task, the unsupervised loss (Equation 1) can simply be\nreplaced, or augmented, by a task-speci\ufb01c objective (e.g., cross-entropy loss).\n\n3.3 Aggregator Architectures\n\nUnlike machine learning over N-D lattices (e.g., sentences, images, or 3-D volumes), a node\u2019s\nneighbors have no natural ordering; thus, the aggregator functions in Algorithm 1 must operate over\nan unordered set of vectors. Ideally, an aggregator function would be symmetric (i.e., invariant to\npermutations of its inputs) while still being trainable and maintaining high representational capacity.\nThe symmetry property of the aggregation function ensures that our neural network model can\nbe trained and applied to arbitrarily ordered node neighborhood feature sets. We examined three\ncandidate aggregator functions:\nMean aggregator. Our \ufb01rst candidate aggregator function is the mean operator, where we simply\ntake the elementwise mean of the vectors in {hk\u22121\n,\u2200u \u2208 N (v)}. The mean aggregator is nearly\nequivalent to the convolutional propagation rule used in the transductive GCN framework [17]. In\nparticular, we can derive an inductive variant of the GCN approach by replacing lines 4 and 5 in\nAlgorithm 1 with the following:4\n\nu\n\n} \u222a {hk\u22121\n\n,\u2200u \u2208 N (v)}).\n\nv \u2190 \u03c3(W \u00b7 MEAN({hk\u22121\nhk\n\nv\n\nu\n\n(2)\nWe call this modi\ufb01ed mean-based aggregator convolutional since it is a rough, linear approximation of\na localized spectral convolution [17]. An important distinction between this convolutional aggregator\nand our other proposed aggregators is that it does not perform the concatenation operation in line\n5 of Algorithm 1\u2014i.e., the convolutional aggregator does concatenate the node\u2019s previous layer\nrepresentation hk\u22121\nv with the aggregated neighborhood vector hkN (v). This concatenation can be\nviewed as a simple form of a \u201cskip connection\u201d [13] between the different \u201csearch depths\u201d, or \u201clayers\u201d\nof the GraphSAGE algorithm, and it leads to signi\ufb01cant gains in performance (Section 4).\nLSTM aggregator. We also examined a more complex aggregator based on an LSTM architecture\n[14]. Compared to the mean aggregator, LSTMs have the advantage of larger expressive capability.\nHowever, it is important to note that LSTMs are not inherently symmetric (i.e., they are not permuta-\ntion invariant), since they process their inputs in a sequential manner. We adapt LSTMs to operate on\nan unordered set by simply applying the LSTMs to a random permutation of the node\u2019s neighbors.\n\n3Exploring non-uniform samplers is an important direction for future work.\n4Note that this differs from Kipf et al\u2019s exact equation by a minor normalization constant [17].\n\n5\n\n\fPooling aggregator. The \ufb01nal aggregator we examine is both symmetric and trainable. In this\npooling approach, each neighbor\u2019s vector is independently fed through a fully-connected neural\nnetwork; following this transformation, an elementwise max-pooling operation is applied to aggregate\ninformation across the neighbor set:\n\n+ b(cid:1) ,\u2200ui \u2208 N (v)}),\n\n(3)\n\nk = max({\u03c3(cid:0)Wpoolhk\n\nui\n\nAGGREGATEpool\n\nwhere max denotes the element-wise max operator and \u03c3 is a nonlinear activation function. In\nprinciple, the function applied before the max pooling can be an arbitrarily deep multi-layer percep-\ntron, but we focus on simple single-layer architectures in this work. This approach is inspired by\nrecent advancements in applying neural network architectures to learn over general point sets [29].\nIntuitively, the multi-layer perceptron can be thought of as a set of functions that compute features for\neach of the node representations in the neighbor set. By applying the max-pooling operator to each of\nthe computed features, the model effectively captures different aspects of the neighborhood set. Note\nalso that, in principle, any symmetric vector function could be used in place of the max operator\n(e.g., an element-wise mean). We found no signi\ufb01cant difference between max- and mean-pooling in\ndevelopments test and thus focused on max-pooling for the rest of our experiments.\n\n4 Experiments\n\nWe test the performance of GraphSAGE on three benchmark tasks: (i) classifying academic papers\ninto different subjects using the Web of Science citation dataset, (ii) classifying Reddit posts as\nbelonging to different communities, and (iii) classifying protein functions across various biological\nprotein-protein interaction (PPI) graphs. Sections 4.1 and 4.2 summarize the datasets, and the\nsupplementary material contains additional information.\nIn all these experiments, we perform\npredictions on nodes that are not seen during training, and, in the case of the PPI dataset, we test on\nentirely unseen graphs.\nExperimental set-up. To contextualize the empirical results on our inductive benchmarks, we\ncompare against four baselines: a random classifer, a logistic regression feature-based classi\ufb01er\n(that ignores graph structure), the DeepWalk algorithm [28] as a representative factorization-based\napproach, and a concatenation of the raw features and DeepWalk embeddings. We also compare four\nvariants of GraphSAGE that use the different aggregator functions (Section 3.3). Since, the \u201cconvo-\nlutional\u201d variant of GraphSAGE is an extended, inductive version of Kipf et al\u2019s semi-supervised\nGCN [17], we term this variant GraphSAGE-GCN. We test unsupervised variants of GraphSAGE\ntrained according to the loss in Equation (1), as well as supervised variants that are trained directly\non classi\ufb01cation cross-entropy loss. For all the GraphSAGE variants we used recti\ufb01ed linear units as\nthe non-linearity and set K = 2 with neighborhood sample sizes S1 = 25 and S2 = 10 (see Section\n4.4 for sensitivity analyses).\nFor the Reddit and citation datasets, we use \u201conline\u201d training for DeepWalk as described in Perozzi et\nal. [28], where we run a new round of SGD optimization to embed the new test nodes before making\npredictions (see the Appendix for details). In the multi-graph setting, we cannot apply DeepWalk,\nsince the embedding spaces generated by running the DeepWalk algorithm on different disjoint\ngraphs can be arbitrarily rotated with respect to each other (Appendix D).\nAll models were implemented in TensorFlow [1] with the Adam optimizer [16] (except DeepWalk,\nwhich performed better with the vanilla gradient descent optimizer). We designed our experiments\nwith the goals of (i) verifying the improvement of GraphSAGE over the baseline approaches (i.e.,\nraw features and DeepWalk) and (ii) providing a rigorous comparison of the different GraphSAGE\naggregator architectures. In order to provide a fair comparison, all models share an identical imple-\nmentation of their minibatch iterators, loss function and neighborhood sampler (when applicable).\nMoreover, in order to guard against unintentional \u201chyperparameter hacking\u201d in the comparisons be-\ntween GraphSAGE aggregators, we sweep over the same set of hyperparameters for all GraphSAGE\nvariants (choosing the best setting for each variant according to performance on a validation set). The\nset of possible hyperparameter values was determined on early validation tests using subsets of the\ncitation and Reddit data that we then discarded from our analyses. The appendix contains further\nimplementation details.5\n\n5Code and links to the datasets: http://snap.stanford.edu/graphsage/\n\n6\n\n\fTable 1: Prediction results for the three datasets (micro-averaged F1 scores). Results for unsupervised\nand fully supervised GraphSAGE are shown. Analogous trends hold for macro-averaged scores.\n\nCitation\n\nReddit\n\nPPI\n\nName\nRandom\nRaw features\nDeepWalk\nDeepWalk + features\nGraphSAGE-GCN\nGraphSAGE-mean\nGraphSAGE-LSTM\nGraphSAGE-pool\n% gain over feat.\n\nUnsup. F1\n\n0.206\n0.575\n0.565\n0.701\n0.742\n0.778\n0.788\n0.798\n39%\n\nSup. F1 Unsup. F1\n0.206\n0.575\n0.565\n0.701\n0.772\n0.820\n0.832\n0.839\n46%\n\n0.043\n0.585\n0.324\n0.691\n0.908\n0.897\n0.907\n0.892\n55%\n\n0.396\n0.422\n\n\u2014\n\u2014\n\nSup. F1 Unsup. F1\n0.042\n0.585\n0.324\n0.691\n0.930\n0.950\n0.954\n0.948\n63%\n\n0.465\n0.486\n0.482\n0.502\n19%\n\nSup. F1\n0.396\n0.422\n\n\u2014\n\u2014\n\n0.500\n0.598\n0.612\n0.600\n45%\n\nFigure 2: A: Timing experiments on Reddit data, with training batches of size 512 and inference\non the full test set (79,534 nodes). B: Model performance with respect to the size of the sampled\nneighborhood, where the \u201cneighborhood sample size\u201d refers to the number of neighbors sampled at\neach depth for K = 2 with S1 = S2 (on the citation data using GraphSAGE-mean).\n\n4.1\n\nInductive learning on evolving graphs: Citation and Reddit data\n\nOur \ufb01rst two experiments are on classifying nodes in evolving information graphs, a task that is\nespecially relevant to high-throughput production systems, which constantly encounter unseen data.\nCitation data. Our \ufb01rst task is predicting paper subject categories on a large citation dataset. We\nuse an undirected citation graph dataset derived from the Thomson Reuters Web of Science Core\nCollection, corresponding to all papers in six biology-related \ufb01elds for the years 2000-2005. The\nnode labels for this dataset correspond to the six different \ufb01eld labels. In total, this is dataset contains\n302,424 nodes with an average degree of 9.15. We train all the algorithms on the 2000-2004 data\nand use the 2005 data for testing (with 30% used for validation). For features, we used node degrees\nand processed the paper abstracts according Arora et al.\u2019s [2] sentence embedding approach, with\n300-dimensional word vectors trained using the GenSim word2vec implementation [30].\nReddit data. In our second task, we predict which community different Reddit posts belong to.\nReddit is a large online discussion forum where users post and comment on content in different topical\ncommunities. We constructed a graph dataset from Reddit posts made in the month of September,\n2014. The node label in this case is the community, or \u201csubreddit\u201d, that a post belongs to. We sampled\n50 large communities and built a post-to-post graph, connecting posts if the same user comments\non both. In total this dataset contains 232,965 posts with an average degree of 492. We use the \ufb01rst\n20 days for training and the remaining days for testing (with 30% used for validation). For features,\nwe use off-the-shelf 300-dimensional GloVe CommonCrawl word vectors [27]; for each post, we\nconcatenated (i) the average embedding of the post title, (ii) the average embedding of all the post\u2019s\ncomments (iii) the post\u2019s score, and (iv) the number of comments made on the post.\nThe \ufb01rst four columns of Table 1 summarize the performance of GraphSAGE as well as the baseline\napproaches on these two datasets. We \ufb01nd that GraphSAGE outperforms all the baselines by a\nsigni\ufb01cant margin, and the trainable, neural network aggregators provide signi\ufb01cant gains compared\n\n7\n\n\fto the GCN approach. For example, the unsupervised variant GraphSAGE-pool outperforms the\nconcatenation of the DeepWalk embeddings and the raw features by 13.8% on the citation data\nand 29.1% on the Reddit data, while the supervised version provides a gain of 19.7% and 37.2%,\nrespectively. Interestingly, the LSTM based aggregator shows strong performance, despite the fact\nthat it is designed for sequential data and not unordered sets. Lastly, we see that the performance of\nunsupervised GraphSAGE is reasonably competitive with the fully supervised version, indicating\nthat our framework can achieve strong performance without task-speci\ufb01c \ufb01ne-tuning.\n\n4.2 Generalizing across graphs: Protein-protein interactions\n\nWe now consider the task of generalizing across graphs, which requires learning about node roles\nrather than community structure. We classify protein roles\u2014in terms of their cellular functions from\ngene ontology\u2014in various protein-protein interaction (PPI) graphs, with each graph corresponding\nto a different human tissue [41]. We use positional gene sets, motif gene sets and immunological\nsignatures as features and gene ontology sets as labels (121 in total), collected from the Molecular\nSignatures Database [34]. The average graph contains 2373 nodes, with an average degree of 28.8.\nWe train all algorithms on 20 graphs and then average prediction F1 scores on two test graphs (with\ntwo other graphs used for validation).\nThe \ufb01nal two columns of Table 1 summarize the accuracies of the various approaches on this\ndata. Again we see that GraphSAGE signi\ufb01cantly outperforms the baseline approaches, with the\nLSTM- and pooling-based aggregators providing substantial gains over the mean- and GCN-based\naggregators.6\n\n4.3 Runtime and parameter sensitivity\n\nFigure 2.A summarizes the training and test runtimes for the different approaches. The training time\nfor the methods are comparable (with GraphSAGE-LSTM being the slowest). However, the need to\nsample new random walks and run new rounds of SGD to embed unseen nodes makes DeepWalk\n100-500\u00d7 slower at test time.\nFor the GraphSAGE variants, we found that setting K = 2 provided a consistent boost in accuracy of\naround 10-15%, on average, compared to K = 1; however, increasing K beyond 2 gave marginal\nreturns in performance (0-5%) while increasing the runtime by a prohibitively large factor of 10-100\u00d7,\ndepending on the neighborhood sample size. We also found diminishing returns for sampling\nlarge neighborhoods (Figure 2.B). Thus, despite the higher variance induced by sub-sampling\nneighborhoods, GraphSAGE is still able to maintain strong predictive accuracy, while signi\ufb01cantly\nimproving the runtime.\n\n4.4 Summary comparison between the different aggregator architectures\n\nOverall, we found that the LSTM- and pool-based aggregators performed the best, in terms of both\naverage performance and number of experimental settings where they were the top-performing\nmethod (Table 1). To give more quantitative insight into these trends, we consider each of the\nsix different experimental settings (i.e., (3 datasets) \u00d7 (unsupervised vs. supervised)) as trials and\nconsider what performance trends are likely to generalize. In particular, we use the non-parametric\nWilcoxon Signed-Rank Test [33] to quantify the differences between the different aggregators across\ntrials, reporting the T -statistic and p-value where applicable. Note that this method is rank-based and\nessentially tests whether we would expect one particular approach to outperform another in a new\nexperimental setting. Given our small sample size of only 6 different settings, this signi\ufb01cance test is\nsomewhat underpowered; nonetheless, the T -statistic and associated p-values are useful quantitative\nmeasures to assess the aggregators\u2019 relative performances.\nWe see that LSTM-, pool- and mean-based aggregators all provide statistically signi\ufb01cant gains over\nthe GCN-based approach (T = 1.0, p = 0.02 for all three). However, the gains of the LSTM and\npool approaches over the mean-based aggregator are more marginal (T = 1.5, p = 0.03, comparing\n\n6Note that in very recent follow-up work Chen and Zhu [6] achieve superior performance by optimizing\nthe GraphSAGE hyperparameters speci\ufb01cally for the PPI task and implementing new training techniques (e.g.,\ndropout, layer normalization, and a new sampling scheme). We refer the reader to their work for the current\nstate-of-the-art numbers on the PPI dataset that are possible using a variant of the GraphSAGE approach.\n\n8\n\n\fLSTM to mean; T = 4.5, p = 0.10, comparing pool to mean). There is no signi\ufb01cant difference\nbetween the LSTM and pool approaches (T = 10.0, p = 0.46). However, GraphSAGE-LSTM is\nsigni\ufb01cantly slower than GraphSAGE-pool (by a factor of \u22482\u00d7), perhaps giving the pooling-based\naggregator a slight edge overall.\n\n5 Theoretical analysis\n\nIn this section, we probe the expressive capabilities of GraphSAGE in order to provide insight into\nhow GraphSAGE can learn about graph structure, even though it is inherently based on features.\nAs a case-study, we consider whether GraphSAGE can learn to predict the clustering coef\ufb01cient of\na node, i.e., the proportion of triangles that are closed within the node\u2019s 1-hop neighborhood [38].\nThe clustering coef\ufb01cient is a popular measure of how clustered a node\u2019s local neighborhood is, and\nit serves as a building block for many more complicated structural motifs [3]. We can show that\nAlgorithm 1 is capable of approximating clustering coef\ufb01cients to an arbitrary degree of precision:\nTheorem 1. Let xv \u2208 U,\u2200v \u2208 V denote the feature inputs for Algorithm 1 on graph G = (V,E),\nwhere U is any compact subset of Rd. Suppose that there exists a \ufb01xed positive constant C \u2208 R+\nsuch that (cid:107)xv \u2212 xv(cid:48)(cid:107)2 > C for all pairs of nodes. Then we have that \u2200\u0001 > 0 there exists a parameter\nsetting \u0398\u2217 for Algorithm 1 such that after K = 4 iterations\n|zv \u2212 cv| < \u0001,\u2200v \u2208 V,\n\nwhere zv \u2208 R are \ufb01nal output values generated by Algorithm 1 and cv are node clustering coef\ufb01cients.\nTheorem 1 states that for any graph there exists a parameter setting for Algorithm 1 such that it can\napproximate clustering coef\ufb01cients in that graph to an arbitrary precision, if the features for every\nnode are distinct (and if the model is suf\ufb01ciently high-dimensional). The full proof of Theorem 1 is\nin the Appendix. Note that as a corollary of Theorem 1, GraphSAGE can learn about local graph\nstructure, even when the node feature inputs are sampled from an absolutely continuous random\ndistribution (see the Appendix for details). The basic idea behind the proof is that if each node has a\nunique feature representation, then we can learn to map nodes to indicator vectors and identify node\nneighborhoods. The proof of Theorem 1 relies on some properties of the pooling aggregator, which\nalso provides insight into why GraphSAGE-pool outperforms the GCN and mean-based aggregators.\n\n6 Conclusion\n\nWe introduced a novel approach that allows embeddings to be ef\ufb01ciently generated for unseen nodes.\nGraphSAGE consistently outperforms state-of-the-art baselines, effectively trades off performance\nand runtime by sampling node neighborhoods, and our theoretical analysis provides insight into\nhow our approach can learn about local graph structures. A number of extensions and potential\nimprovements are possible, such as extending GraphSAGE to incorporate directed or multi-modal\ngraphs. A particularly interesting direction for future work is exploring non-uniform neighborhood\nsampling functions, and perhaps even learning these functions as part of the GraphSAGE optimization.\n\nAcknowledgments\nThe authors thank Austin Benson, Aditya Grover, Bryan He, Dan Jurafsky, Alex Ratner, Marinka\nZitnik, and Daniel Selsam for their helpful discussions and comments on early drafts. The authors\nwould also like to thank Ben Johnson for his many useful questions and comments on our code. This\nresearch has been supported in part by NSF IIS-1149837, DARPA SIMPLEX, Stanford Data Science\nInitiative, Huawei, and Chan Zuckerberg Biohub. W.L.H. was also supported by the SAP Stanford\nGraduate Fellowship and an NSERC PGS-D grant. The views and conclusions expressed in this\nmaterial are those of the authors and should not be interpreted as necessarily representing the of\ufb01cial\npolicies or endorsements, either expressed or implied, of the above funding agencies, corporations, or\nthe U.S. and Canadian governments.\n\n9\n\n\fReferences\n\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,\nJ. Dean, M. Devin, et al. Tensor\ufb02ow: Large-scale machine learning on heterogeneous distributed\nsystems. arXiv preprint , 2016.\n\n[2] S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence embeddings. In\n\nICLR, 2017.\n\n[3] A. R. Benson, D. F. Gleich, and J. Leskovec. Higher-order organization of complex networks.\n\nScience, 353(6295):163\u2013166, 2016.\n\n[4] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected\n\nnetworks on graphs. In ICLR, 2014.\n\n[5] S. Cao, W. Lu, and Q. Xu. Grarep: Learning graph representations with global structural\n\ninformation. In KDD, 2015.\n\n[6] J. Chen and J. Zhu. Stochastic training of graph convolutional networks. arXiv preprint\n\narXiv:1710.10568, 2017.\n\n[7] H. Dai, B. Dai, and L. Song. Discriminative embeddings of latent variable models for structured\n\ndata. In ICML, 2016.\n\n[8] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs\n\nwith fast localized spectral \ufb01ltering. In NIPS, 2016.\n\n[9] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and\nR. P. Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints. In NIPS,\n2015.\n\n[10] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In IEEE\n\nInternational Joint Conference on Neural Networks, volume 2, pages 729\u2013734, 2005.\n\n[11] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In KDD, 2016.\n[12] W. L. Hamilton, J. Leskovec, and D. Jurafsky. Diachronic word embeddings reveal statistical\n\nlaws of semantic change. In ACL, 2016.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In EACV,\n\n2016.\n\n[14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[15] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks,\n\n4(2):251\u2013257, 1991.\n\n[16] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[17] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nIn ICLR, 2016.\n\n[18] T. N. Kipf and M. Welling. Variational graph auto-encoders. In NIPS Workshop on Bayesian\n\nDeep Learning, 2016.\n\n[19] J. B. Kruskal. Multidimensional scaling by optimizing goodness of \ufb01t to a nonmetric hypothesis.\n\nPsychometrika, 29(1):1\u201327, 1964.\n\n[20] O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In NIPS,\n\n2014.\n\n[21] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. In\n\nICLR, 2015.\n\n[22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of\n\nwords and phrases and their compositionality. In NIPS, 2013.\n\n[23] A. Y. Ng, M. I. Jordan, Y. Weiss, et al. On spectral clustering: Analysis and an algorithm. In\n\nNIPS, 2001.\n\n[24] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In\n\nICML, 2016.\n\n10\n\n\f[25] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order\n\nto the web. Technical report, Stanford InfoLab, 1999.\n\n[26] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\n[27] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation.\n\nIn EMNLP, 2014.\n\n[28] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In\n\nKDD, 2014.\n\n[29] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d\n\nclassi\ufb01cation and segmentation. In CVPR, 2017.\n\n[30] R. \u02c7Reh\u02dau\u02c7rek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In\n\nLREC, 2010.\n\n[31] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural\n\nnetwork model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[32] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-\n\nlehman graph kernels. Journal of Machine Learning Research, 12:2539\u20132561, 2011.\n[33] S. Siegal. Nonparametric statistics for the behavioral sciences. McGraw-hill, 1956.\n[34] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette,\nA. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, et al. Gene set enrichment analysis: a\nknowledge-based approach for interpreting genome-wide expression pro\ufb01les. Proceedings of\nthe National Academy of Sciences, 102(43):15545\u201315550, 2005.\n\n[35] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information network\n\nembedding. In WWW, 2015.\n\n[36] D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. In KDD, 2016.\n[37] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang. Community preserving network\n\nembedding. In AAAI, 2017.\n\n[38] D. J. Watts and S. H. Strogatz. Collective dynamics of \u2018small-world\u2019 networks. Nature,\n\n393(6684):440\u2013442, 1998.\n\n[39] L. Xu, X. Wei, J. Cao, and P. S. Yu. Embedding identity and interest for social networks. In\n\nWWW, 2017.\n\n[40] Z. Yang, W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph\n\nembeddings. In ICML, 2016.\n\n[41] M. Zitnik and J. Leskovec. Predicting multicellular function through multi-layer tissue networks.\n\nBioinformatics, 33(14):190\u2013198, 2017.\n\n11\n\n\f", "award": [], "sourceid": 671, "authors": [{"given_name": "Will", "family_name": "Hamilton", "institution": "Stanford University"}, {"given_name": "Zhitao", "family_name": "Ying", "institution": "Stanford University"}, {"given_name": "Jure", "family_name": "Leskovec", "institution": "Stanford University"}]}