{"title": "Watch Your Step: Learning Node Embeddings via Graph Attention", "book": "Advances in Neural Information Processing Systems", "page_first": 9180, "page_last": 9190, "abstract": "Graph embedding methods represent nodes in a continuous vector space,\npreserving different types of relational information from the graph.\nThere are many hyper-parameters to these methods (e.g. the length of a random walk) which have to be manually tuned for every graph.\nIn this paper, we replace previously fixed hyper-parameters with trainable ones that we automatically learn via backpropagation. \nIn particular, we propose a novel attention model on the power series of the transition matrix, which guides the random walk to optimize an upstream objective.\nUnlike previous approaches to attention models, the method that we propose utilizes attention parameters exclusively on the data itself (e.g. on the random walk), and are not used by the model for inference.\nWe experiment on link prediction tasks, as we aim to produce embeddings that best-preserve the graph structure, generalizing to unseen information. \nWe improve state-of-the-art results on a comprehensive suite of real-world graph datasets including social, collaboration, and biological networks, where we observe that our graph attention model can reduce the error by up to 20\\%-40\\%.\nWe show that our automatically-learned attention parameters can vary significantly per graph, and correspond to the optimal choice of hyper-parameter if we manually tune existing methods.", "full_text": "Watch Your Step:\n\nLearning Node Embeddings via Graph Attention\n\nSami Abu-El-Haija\u2217\n\nInformation Sciences Institute,\nUniversity of Southern California\n\nhaija@isi.edu\n\nBryan Perozzi\n\nGoogle AI\n\nNew York City, NY\nbperozzi@acm.org\n\nRami Al-Rfou\n\nGoogle AI\n\nMountain View, CA\nrmyeid@google.com\n\nAlex Alemi\nGoogle AI\n\nMountain View, CA\nalemi@google.com\n\nAbstract\n\nGraph embedding methods represent nodes in a continuous vector space, preserving\ndifferent types of relational information from the graph. There are many hyper-\nparameters to these methods (e.g. the length of a random walk) which have to be\nmanually tuned for every graph. In this paper, we replace previously \ufb01xed hyper-\nparameters with trainable ones that we automatically learn via backpropagation. In\nparticular, we propose a novel attention model on the power series of the transition\nmatrix, which guides the random walk to optimize an upstream objective. Unlike\nprevious approaches to attention models, the method that we propose utilizes\nattention parameters exclusively on the data itself (e.g. on the random walk), and\nare not used by the model for inference. We experiment on link prediction tasks, as\nwe aim to produce embeddings that best-preserve the graph structure, generalizing\nto unseen information. We improve state-of-the-art results on a comprehensive\nsuite of real-world graph datasets including social, collaboration, and biological\nnetworks, where we observe that our graph attention model can reduce the error\nby up to 20%-40%. We show that our automatically-learned attention parameters\ncan vary signi\ufb01cantly per graph, and correspond to the optimal choice of hyper-\nparameter if we manually tune existing methods.\n\n1\n\nIntroduction\n\nUnsupervised graph embedding methods seek to learn representations that encode the graph structure.\nThese embeddings have demonstrated outstanding performance on a number of tasks including node\nclassi\ufb01cation [29, 15], knowledge-base completion [24], semi-supervised learning [37], and link\nprediction [2]. In general, as introduced by Perozzi et al [29], these methods operate in two discrete\nsteps: First, they sample pair-wise relationships from the graph through random walks and counting\nnode co-occurances. Second, they train an embedding model e.g. using Skipgram of word2vec [25],\nto learn representations that encode pairwise node similarities.\nWhile such methods have demonstrated positive results on a number of tasks, their performance\ncan signi\ufb01cantly vary based on the setting of their hyper-parameters. For example, [29] observed\nthat the quality of learned representations is dependent on the length of the random walk (C). In\npractice, DeepWalk [29] and many of its extensions [e.g. 15] use word2vec implementations [25].\n\n\u2217Work done while at Google AI.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fAccordingly, it has been revealed by [21] that the hyper-parameter C, refered to as training window\nlength in word2vec [25], actually controls more than a \ufb01xed length of the random walk. Instead,\nit parameterizes a function, we term the context distribution and denote Q, which controls the\nprobability of sampling a node-pair when visited within a speci\ufb01c distance 2. Implicitly, the choices\nof C and Q, create a weight mass on every node\u2019s neighborhood. In general, the weight is higher\non nearby nodes, but the speci\ufb01c form of the mass function is determined by the aforementioned\nhyper-parameters. In this work, we aim to replace these hyper-parameters with trainable parameters,\nso that they can be automatically learned for each graph. To do so, we pose graph embedding as\nend-to-end learning, where the (discrete) two steps of random walk co-occurance sampling, followed\nby representation learning, are joint using a closed-form expectation over the graph adjacency matrix.\nOur inspiration comes from the successful application of attention models in domains such as Natural\nLanguage Processing (NLP) [e.g. 4, 38], image recognition [26], and detecting rare events in videos\n[31]. To the best of our knowledge, the approach we propose is signi\ufb01cantly different from the\nstandard application of attention models. Instead of using attention parameters to guide the model\nwhere to look when making a prediction, we use attention parameters to guide our learning algorithm\nto focus on parts of the data that are most helpful for optimizing an upstream objective.\nWe show the mathematical equivalence between the context distribution and the co-ef\ufb01cients of\npower series of the transition matrix. This allows us to learn the context distribution by learning an\nattention model on the power series. The attention parameters \u201cguide\u201d the random walk, by allowing\nit to focus more on short- or long-term dependencies, as best suited for the graph, while optimizing\nan upstream objective. To the best of our knowledge, this work is the \ufb01rst application of attention\nmethods to graph embedding.\nSpeci\ufb01cally, our contributions are the following:\n\n1. We propose an extendible family of graph attention models that can learn arbitrary (e.g.\n\nnon-monotonic) context distributions.\n\n2. We show that the optimal choice of context distribution hyper-parameters for competing\nmethods, found by manual tuning, agrees with our automatically-found attention parameters.\n3. We evaluate on a number of challenging link prediction tasks comprised of real world\ndatasets, including social, collaboration, and biological networks. Experiments show we\nsubstantially improve on our baselines, reducing link-prediction error by 20%-40%.\n\n2 Preliminaries\n\n2.1 Graph Embeddings\nGiven an unweighted graph G = (V, E), its (sparse) adjacency matrix A \u2208 {0, 1}|V |\u00d7|V | can be\nconstructed according to Avu = 1[(v, u) \u2208 E], where the indicator function 1[.] evaluates to 1 iff its\nboolean argument is true. In general, graph embedding methods minimize an objective:\n\nL(f (A), g(Y));\n\nmin\nY\n\n(1)\nwhere Y \u2208 R|V |\u00d7d is a d-dimensional node embedding dictionary; f : R|V |\u00d7|V | \u2192 R|V |\u00d7|V | is a\ntransformation of the adjacency matrix; g : R|V |\u00d7d \u2192 R|V |\u00d7|V | is a pairwise edge function; and\nL : R|V |\u00d7|V | \u00d7 R|V |\u00d7|V | \u2192 R is a loss function.\nMany popular embedding methods can be viewed in this light. For instance, a stochastic3 version of\nSingular Value Decomposition (SVD) is an embedding method, and can be cast into our framework by\nsetting f (A) = A; decomposing Y into two halves, the left and right representations4 as Y = [L|R]\n\n2To clarify, as noted by Levy et al [21] \u2013 studying the implementation of word2vec reveals that rather than\nusing C as constant and assuming all nodes visited within distance C are related, a desired context distance ci is\nsampled from uniform (ci \u223c U{1, C}) for each node pair i in training. If the node pair i was visited more than\nci-steps apart, it is not used for training. Many DeepWalk-style methods inherited this context distribution, as\nthey internally utilize standard word2vec implementations.\n\n3Orthonormality Constraints are not shown.\n4Also known in NLP [25] as the \u201cinput\u201d and \u201coutput\u201d embedding representations.\n\n2\n\n\fwith L, R \u2208 R|V |\u00d7 d\nsetting L to the Frobenius norm of the error, yielding:\n\n2 then setting g to their outer product g(Y) = g([L|R]) = L \u00d7 R(cid:62); and \ufb01nally\n\n||A \u2212 L \u00d7 R(cid:62)||F\n\nmin\nL,R\n\n2.2 Learning Embeddings via Random Walks\n\nIntroduced by [29], this family of methods [incl. 15, 19, 30, 10] induce random walks along E by\nstarting from a random node v0 \u2208 sample(V ), and repeatedly sampling an edge to transition to\nnext node as vi+1 := sample(N [vi]), where N [vi] are the outgoing edges from vi. The transition\nsequences v0 \u2192 v1 \u2192 v2 \u2192 . . . (i.e. random walks) can then be passed to word2vec algorithm,\nwhich learns embeddings by stochastically taking every node along the sequence vi, and the embed-\nding representation of this anchor node vi is brought closer to the embeddings of its next neighbors,\n{vi+1, vi+2, . . . , vi+c}, the context nodes. In practice, the context window size c is sampled from a\ndistribution e.g. uniform U{1, C} as explained in [21]. For further information on graph embedding\nmethods see [9].\nLet D \u2208 R|V |\u00d7|V | be the co-occurrence matrix from random walks, with each entry Dvu containing\nthe number of times nodes v and u are co-visited within context distance c \u223c U{1, C}, in all\nsimulated random walks. Embedding methods utilizing random walks, can also be viewed using\nthe framework of Eq. (1). For example, to get Node2vec [15], we can set f (A) = D, set the edge\nfunction to the embeddings outer product g(Y) = Y \u00d7 Y(cid:62), and set the loss function to negative log\nlikelihood of softmax, yielding:\n\n\uf8ee\uf8f0log Z \u2212 (cid:88)\n\nv,u\u2208V\n\n\uf8f9\uf8fb ,\n\nDvu(Y (cid:62)\n\nv Yu)\n\nv,u exp(Y (cid:62)\n\nv Yu) can be estimated with negative sampling [25, 15].\n\n(2)\n\nwhere partition function Z =(cid:80)\n\nmin\nY\n\n2.2.1 Graph Likelihood\n\n(cid:89)\n\nv,u\u2208V\n\nA recently-proposed objective for learning embeddings is the graph likelihood [2]:\n\n\u03c3(g(Y)v,u)Dvu(1 \u2212 \u03c3(g(Y)v,u))1[(v,u) /\u2208E],\n\n(3)\n\nwhere g(Y)v,u is the output of the model evaluated at edge (v, u), given node embeddings Y; the\nactivation function \u03c3(.) is the logistic; Maximizing the graph likelihood pushes the model score\ng(Y)v,u towards 1 if value Dvu is large and pushes it towards 0 if (v, u) /\u2208 E.\nIn our work, we minimize the negative log of Equation 3, written in our matrix notation as:\n\nmin\nY\n\n||\u2212D \u25e6 log (\u03c3(g(Y))) \u2212 1[A = 0] \u25e6 log (1 \u2212 \u03c3(g(Y)))||1 ,\n\n(4)\nwhich we minimize w.r.t node embeddings Y \u2208 R|V |\u00d7d , where \u25e6 is the Hadamard product; and the\nL1-norm ||.||1 of a matrix is the sum of its entries. The entries of this matrix are positive because\n0 < \u03c3(.) < 1. Matrix D \u2208 R|V |\u00d7|V | can be created similar to the one described in [2], by counting\nnode co-occurrences in simulated random walks.\n\n2.3 Attention Models\n\nWe mention attention models that are most similar to ours [e.g. 26, 31, 35], where an attention\nfunction is employed to suggest positions within the input example that the classi\ufb01cation function\nshould pay attention to, when making inference. This function is used during the training phase in\nthe forward pass and in the testing phase for prediction. The attention function and the classi\ufb01er are\njointly trained on an upstream objective e.g. cross entropy. In our case, the attention mechanism\nis only guides the learning procedure, and not used by the model for inference. Our mechanism\nsuggests parts of the data to focus on, during training, as explained next.\n\n3\n\n\f3 Our Method\nFollowing our general framework (Eq 1), we set g(Y) = g ([L | R]) = L\u00d7 R(cid:62) and f (A) = E[D],\nthe expectation on co-occurrence matrix produced from simulated random walk. Using this closed\nform, we extend the the Negative Log Graph Likelihood (NLGL) loss (Eq. 4) to include attention\nparameters on the random walk sampling.\n\n3.1 Expectation on the co-occurance matrix: E[D]\n\nRather than obtaining D by simulation of random walks and sampling co-occurances, we formulate\nan expectation of this sampling, as E[D]. In general. this allows us to tune sampling parameters\nliving inside of the random walk procedure including number of steps C.\nLet T be the transition matrix for a graph, which can be calculated by normalizing the rows of A to\nsum to one. This can be written as:\n\nT = diag(A \u00d7 1n)\u22121 \u00d7 A.\n\n(5)\nGiven an initial probability distribution p(0) \u2208 R|V | of a random surfer, it is possible to \ufb01nd the\ndistribution of the surfer after one step conditioned on p(0) as p(1) = p(0)(cid:62)T and after k steps as\np(k) = p(0)(cid:62)\n(T )k, where (T )k multiplies matrix T with itself k-times. We are interested in an\nanalytical expression for E[D], the expectation over co-occurrence matrix produced by simulated\nrandom walks. A closed form expression for this matrix will allow us to perform end-to-end learning.\nIn practice, random walk methods based on DeepWalk [29] do not use C as a hard limit; instead,\ngiven walk sequence (v1, v2, . . . ), they sample ci \u223c U{1, C} separately for each anchor node vi and\npotential context nodes, and only keep context nodes that are within ci-steps of vi. In expectation then,\nnodes vi+1, vi+2, vi+3, . . . , will appear as context for anchor node vi, respectively with probabilities\n1, 1 \u2212 1\n\nC , . . . . We can write an expectation on D \u2208 R|V |\u00d7|V |:\n\nC , 1 \u2212 2\n\nE(cid:2)DDEEPWALK; C(cid:3) =\n\nC(cid:88)\n\nk=1\n\nPr(c \u2265 k) \u02dcP(0) (T )k ,\n\n(cid:21)\n\n(T )k .\n\n(6)\nwhich is parametrized by the (discrete) walk length C; where Pr(c \u2265 k) indicates the probability of\nnode with distance k from anchor to be selected; and \u02dcP(0) \u2208 R|V |\u00d7|V | is a diagonal matrix (the initial\npositions matrix), with \u02dcP(0)\nC for\nj=k P (c = j), and re-write the expectation as:\n1 \u2212 k \u2212 1\n\nall k = {1, 2, . . . , C}, we can expand Pr(c \u2265 k) =(cid:80)C\n(cid:20)\nC(cid:88)\nE(cid:2)DDEEPWALK; C(cid:3) = \u02dcP(0)\n(cid:3), but we note that the coef\ufb01cient decreases with k.\n\nEq. (7) is derived, step-by-step, in the Appendix. We are not concerned by the exact de\ufb01nition of the\n\nvv set to the number of walks starting at node v. Since Pr(c = k) = 1\n\nInstead of keeping C a hyper-parameter, we want to analytically optimize it on an upstream objective.\nFurther, we are interested to learn the co-ef\ufb01cients to (T )k instead of hand-engineering a formula.\nAs an aside, running the GloVe embedding algorithm [28] over the random walk sequences, in expec-\n\nscalar coef\ufb01cient,(cid:2)1 \u2212 k\u22121\ntation, is equivalent to factorizing the co-occurance matrix: E(cid:2)DGloVe; C(cid:3) = \u02dcP(0)(cid:80)C\n3.2 Learning the Context Distribution\nas Q = (Q1, Q2,\u00b7\u00b7\u00b7 , QC) with Qk \u2265 0 and(cid:80)\nWe want to learn the co-ef\ufb01cients to (T )k. Let the context distribution Q be a C-dimensional vector\nk Qk = 1. We assign co-ef\ufb01cient Qk to (T )k.\nC(cid:88)\n\nFormally, our expectation on D is parameterized with, and is differentiable w.r.t., Q:\n\n(cid:3) (T )k .\n\n(cid:2) 1\n\nk\n\n(7)\n\nk=1\n\nk=1\n\nC\n\nC\n\nE [D; Q1, Q2, . . . QC] = \u02dcP(0)\n\nQk (T )k = \u02dcP(0) E\nk\u223cQ\n\n[(T )k],\n\nspecial cases of Equation 8, with Q \ufb01xed apriori as Qk =(cid:2)1 \u2212 k\u22121\n\nk=1\n\n(cid:3) or Qk \u221d 1\n\nTraining embeddings over random walk sequences, using word2vec or GloVe, respectively, are\n\nk .\n\nC\n\n(8)\n\n4\n\n\fDataset\nwiki-vote\nego-Facebook\nca-AstroPh\nca-HepTh\nPPI [33]\n\n|V |\n7, 066\n4, 039\n17, 903\n8, 638\n3, 852\n\n|E|\n103, 663\n88, 234\n197, 031\n24, 827\n20, 881\n\nnodes\nusers\nusers\n\nresearchers\nresearchers\n\nproteins\n\nedges\nvotes\n\nfriendship\n\nco-authorship\nco-authorship\n\nchemical interaction\n\n(a) Datasets used in our experiments: wiki-vote is\ndirected but all others are undirected graphs.\nFigure 1: In 1a we present statistics of our datasets. In 1b, we motivate our work by showing the\nnecessity of setting the parameter C for node2vec (d=128, each point is the average of 7 runs).\n\n(b) Test ROC-AUC as a function of C using node2vec.\n\n(a) Learned Attention weights Q (log scale).\n\n(b) Q with varying the regularization \u03b2 (linear scale).\nFigure 2: (a) shows learned attention weights Q, which agree with grid-search of node2vec (Figure\n1b). (b) shows how varying \u03b2 affects the learned Q. Note that distributions can quickly tail off to zero\n(ego-Facebook and PPI), while other graphs (wiki-vote) contain information across distant nodes.\n\n3.3 Graph Attention Models\n\nTo learn Q automatically, we propose an attention model which guides the random surfer on \u201cwhere\nto attend to\u201d as a function of distance from the source node. Speci\ufb01cally, we de\ufb01ne a Graph Attention\nModel as a process which models a node\u2019s context distribution Q as the output of softmax:\n\nand(cid:80)\n\n(Q1, Q2, Q3, . . . ) = softmax((q1, q2, q3, . . . )),\n\n(9)\nwhere the variables qk are trained via backpropagation, jointly while learning node embeddings. Our\nhypothesis is as follows. If we don\u2019t impose a speci\ufb01c formula on Q = (Q1, Q2, . . . QC), other than\n(regularized) softmax, then we can use very large values of C and allow every graph to learn its own\nform of Q with its preferred sparsity and own decay form. Should the graph structure require a small\nC, then the optimization would discover a left-skewed Q with all of probability mass on {Q1, Q2}\nk>2 Qk \u2248 0. However, if according to the objective, a graph is more accurately encoded by\nmaking longer walks, then they can learn to use a large C (e.g. using uniform or even right-skewed Q\ndistribution), focusing more attention on longer distance connections in the random walk.\nTo this end, we propose to train softmax attention model on the in\ufb01nite power series of the transition\nmatrix. We de\ufb01ne an expectation on our proposed random walk matrix Dsoftmax[\u221e] as5:\n\nDsoftmax[\u221e]; q1, q2, q3, . . .\n\n= \u02dcP(0)\n\nlim\nC\u2192\u221e\n\nsoftmax(q1, q2, q3, . . . )k (T )k ,\n\n(10)\n\nwhere q1, q2, . . . are jointly trained with the embeddings to minimize our objective.\n\n3.4 Training Objective\n\nmin\nL,R,q\n\n\u03b2||q||2\n\nThe \ufb01nal training objective for the Softmax attention mechanism, coming from the NLGL Eq. (4),\n\n2 +(cid:12)(cid:12)(cid:12)(cid:12)\u2212E[D; q] \u25e6 log(cid:0)\u03c3(L \u00d7 R(cid:62))(cid:1) \u2212 1[A = 0] \u25e6 log(cid:0)1 \u2212 \u03c3(L \u00d7 R(cid:62))(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)1\n\n(11)\nis minimized w.r.t attention parameter vector q = (q1, q2, . . . ) and node embeddings L, R \u2208 R|V |\u00d7 d\n2 .\nHyper-parameter \u03b2 \u2208 R applies L2 regularization on the attention parameters. We emphasize that\nour attention parameters q live within the expectation over data D, and are not part of the model\nk Qk = 1, through the softmax\nactivation, prevents E[Dsoftmax] from collapsing into a trivial solution (zero matrix).\n\n(L, R) and are therefore not required for inference. The constraint(cid:80)\n\n5We do not actually unroll the summation in Eq. (10) an in\ufb01nite number of times. Our experiments show\n\nthat unrolling it 10 or 20 times is suf\ufb01cient to obtain state-of-the-art results.\n\n5\n\nE(cid:104)\n\n(cid:105)\n\nC(cid:88)\n\nk=1\n\n12345678910C0.99000.99050.99100.99150.99200.9925ROC-AUCfacebook12345678910C0.700.750.800.85ppi12345678910C0.600.610.620.630.64wiki-voteego-Facebookca-HepThca-AstroPhPPIwiki-vote103102101100Attention Probability MasssoftmaxQ1Q2Q3Q4Q512345678910Q0.00.51.0Attention Probability Massego-Facebook12345678910QPPI12345678910Qwiki-vote=0.3=0.5=0.7\fDataset dim Eigen\n\nMethods Use: A\nMaps SVD DNGR n2v\n\n64\n\n61.3 86.0 59.8\nwiki-vote 128 62.2 80.8 55.4\nego-Facebook 64\n96.4 96.7 98.1\n128 95.4 94.5 98.4\n64\n82.4 91.1 93.9\nca-AstroPh 128 82.9 92.4 96.8\nca-HepTh 64\n80.2 79.3 86.8\n128 81.2 78.0 89.7\n64\n70.7 75.4 76.7\nPPI 128 73.7 71.2 76.9\n\nD\nn2v\nC = 5\n63.6\n64.6\n99.0\n99.2\n96.9\n97.5\n91.8\n92.0\n70.6\n74.4\n\nAsym\nProj\n91.7\n91.7\n97.4\n97.3\n95.7\n95.7\n90.3\n90.3\n82.4\n83.9\n\nE[D]\n\n(ours)\n\nGraph Attention\n93.8 \u00b1 0.13\n93.8 \u00b1 0.05\n99.4 \u00b1 0.10\n99.5 \u00b1 0.03\n97.9 \u00b1 0.21\n98.1 \u00b1 0.49\n93.6 \u00b1 0.06\n93.9 \u00b1 0.05\n89.8 \u00b1 1.05\n91.0 \u00b1 0.28\n\nError\n\nReduction\n\n25.2%\n25.2%\n33.3%\n28.6%\n19.2%\n24.0%\n22.0%\n23.8%\n43.5%\n44.2%\n\nC = 2\n64.4\n63.7\n99.1\n99.3\n97.4\n97.7\n90.6\n90.1\n79.7\n81.8\n\nTable 1: Results on Link Prediction Datasets. Shown is the ROC-AUC. Each row shows results for\none dataset results on one dataset when training embedding with We bold the highest accuracy per\ndataset-dimension pair, including when the highest accuracy intersects with the mean \u00b1 standard\ndeviation. We use the train:test splits of [2], hosted on http://sami.haija.org/graph/splits\n\n3.5 Computational Complexity\nThe naive computation of (T )k requires k matrix multiplications and so is O(|V |3k). However, as\nmost real-world adjacency matrices have an inherent low rank structure, a number of fast approxima-\ntions to computing the random walk transition matrix raised to a power k have been proposed [e.g.\n34]. Alternatively SVD can decompose T as T = U\u039bV(cid:62) and then the kth power can be calculated\nby raising the diagonal matrix of singular values to k as (T )k = U(\u039b)kV(cid:62) since V(cid:62)U = I. Further-\nmore, the SVD can be approximated in time linear to the number of non-zero entries [16]. Therefore,\nwe can approximate (T )k in O(|E|). In this work, we compute (T )k without approximations. Our\nalgorithm runs in seconds over the given datasets (at least 10X faster than node2vec [15], DVNE [? ],\nDNGR [8]). We leave stochastic and approximation versions of our method as future work.\n\n3.6 Extensions\n\nAs presented, our proposed method can learn the weights of the context distribution Q. However,\nwe brie\ufb02y note that such a model can be trivially extended to learn the weight of any other type of\npair-wise node similarity (e.g. Personalized PageRank, Adamic-Adar, etc). In order to do this, we\ncan extend the de\ufb01nition of the context Q with an additional dimension Qk+1 for the new type of\nsimilarity, and an additional element in the softmax qk+1 to learn a joint importance function.\n\n4 Experiments\n\n4.1 Link Prediction Experiments\n\nWe evaluate the quality of embeddings produced when random walks are augmented with attention,\nthrough experiments on link prediction [23]. Link prediction is a challenging task, with many real\nworld applications in information retrieval, recommendation systems and social networks. As such,\nit has been used to study the properties of graph embeddings [29, 15]. Such an intrinsic evaluation\nemphasizes the structure-preserving properties of embedding.\nOur experimental setup is designed to determine how well the embeddings produced by a method\ncaptures the topology of the graph. We measure this in the manner of [15]: remove a fraction (=50%)\nof graph edges, learn embeddings from the remaining edges, and measure how well the embeddings\ncan recover those edges which have been removed. More formally, we split the graph edges E into\ntwo partitions of equal size Etrain and Etest such that the training graph is connected. We also sample\nnon existent edges ((u, v) /\u2208 E) to make E\u2212\ntrain) for training and model\nselection, and use (Etest, E\u2212\n\ntest. We use (Etrain, E\u2212\n\ntest) to compute evaluation metrics.\n\ntrain and E\u2212\n\n6\n\n\fDataset\nCora\nCiteseer\n\nn2v\nC = 5\n63.1\n45.6\n\nGraph Attention\n\n(ours)\n67.9\n51.5\n\n(a) node2vec, Cora\n\n(b) Graph Attention (ours), Cora\n\n(c) Classi\ufb01cation accuracy\n\nFigure 3: Node Classi\ufb01cation. Fig. (a)/(b): t-SNE visualization of node embeddings for Cora dataset.\nWe note that both methods are unsupervised, and we have colored the learned representations by\nnode labels. Fig. (c) However, quantitatively, our embeddings achieves better separation.\n\nTraining: We train our models using TensorFlow, with PercentDelta optimizer [1]. For the results\nTable 1, we use \u03b2 = 0.5, C = 10, and \u02dcP(0) = diag(80), which corresponds to 80 walks per node.\nWe analyze our model\u2019s sensitivity in Section 4.2. To ensure repeatability of results, we have released\nour model and instructions6.\nDatasets: Table 1a describes the datasets used in our experiments. Datasets available from SNAP\nhttps://snap.stanford.edu/data.\nBaselines: We evaluate against many baselines. For all methods, we calculate g(Y) \u2208 R|V |\u00d7|V |,\nand extract entries from g(Y) corresponding to positive and negative test edges, then use them to\ncompute ROC AUC. We compare against following baselines. We mark symmetric models with\n\u2020. Their counterparts, asymmetric models including ours, can learn g(Y)vu (cid:54)= g(Y)uv, which we\nexpect to perform relatively better on the directed graph wiki-vote.\n\u2013 \u2020EigenMaps [5]. Minimizes Euclidean distance of adjacent nodes of A.\n\u2013 SVD. Singular value decomposition of A. Inference is through the function g(Y ) = Ud\u00d7(\u039b)d\u00d7Vd,\nwhere (Ud, \u039bd,Vd) is a low-rank SVD decomposiiton corresponding to the d largest singular values.\n\u2013 \u2020DNGR [8]. Non-linear (i.e. deep) embedding of nodes, using an auto-encoder on A. We use\nauthor\u2019s code to learn the deep embeddings Y and use for inference g(Y) = YYT .\n\u2013 \u2020n2v: node2vec [15] is a popular baseline. It simulates random walks and uses word2vec to\nlearn node embeddings. Minimizes objective in Eq. (2). For Table 1, we use author\u2019s code to learn\nembeddings Y then use g(Y) = YY(cid:62). We run with C = 2 and C = 5.7\n\u2013 AsymProj [2]. Learns edges as asymmetric projections in a deep embedding space, trained by\nmaximizing the graph likelihood (Eq. 3).\nResults: Our results, summarized in Table 1, show that our proposed methods substantially out-\nperform all baseline methods. Speci\ufb01cally, we see that the error is reduced by up to 45% over\nbaseline methods which have \ufb01xed context de\ufb01nitions. This shows that by parameterizing the context\ndistribution and allowing each graph to learn its own distribution, we can better preserve the graph\nstructure (and thereby better predict missing edges).\nDiscussion: Figure 2a shows how the learned attention weights Q vary across datasets. Each dataset\nlearns its own attention form, and the highest weights generally correspond to the highest weights\nwhen doing a grid search over C for node2vec (as in Figure 1b).\nThe hyper-parameter C determines the highest power of the transition matrix, and hence the maximum\ncontext size available to the attention model. We suggest using large values for C, since the attention\nweights can effectively use a subset of the transition matrix powers. For example, if a network needs\nonly 2 hops to be accurately represented, then it is possible for the softmax attention model to learn\nQ3, Q4,\u00b7\u00b7\u00b7 \u2248 0. Figure 2b shows how varying the regularization term \u03b2 allows the softmax attention\nmodel to \u201cattend to\u201d only what each dataset requires. We observe that for most graphs, the majority\nof the mass gets assigned to Q1, Q2. This shows that shorter walks are more bene\ufb01cial for most\ngraphs. However, on wiki-vote, better embeddings are produced by paying attention to longer walks,\nas its softmax Q is uniform-like, with a slight right-skew.\n\n6Available at http://sami.haija.org/graph/context\n7We sweep C in Figure 1b, showing that there are no good default for C that works best across datasets.\n\n7\n\n\fFigure 4: Sensitivity Analysis of softmax attention model. Our method is robust to choices of both \u03b2\nand C. We note that it consistently outperforms even an optimally set node2vec.\n\n4.2 Sensitivity Analysis\n\nSo far, we have removed two hyper-parameters, the maximum window size C, and the form of the\ncontext distribution U. In exchange, we have introduced other hyper-parameters \u2013 speci\ufb01cally walk\nlength (also C) and a regularization term \u03b2 for the softmax attention model. Nonetheless, we show\nthat our method is robust to various choices of these two. Figures 2a and 2b both show that the\nsoftmax attention weights drop to almost zero if the graph can be preserved using shorter walks,\nwhich is not possible with \ufb01xed-form distributions (e.g. U).\nFigure 4 examines this relationship in more detail for d = 128 dimensional embeddings, sweeping\nour hyper-parameters C and \u03b2, and comparing results to the best and worst node2vec embeddings for\nC \u2208 [1, 10]. (Note that node2vec lines are horizontal, as they do not depend on \u03b2.) We observe that\nall the accuracy metrics are within 1% to 2%, when varying these hyper-parameters, and are all still\nwell-above our baseline (which sample from a \ufb01xed-form context distribution).\n\n4.3 Node Classi\ufb01cation Experiments\n\nWe conduct node classi\ufb01cation experiments, on two citation datasets, Cora and Citeseer, with the\nfollowing statistics: Cora contains (2, 708 nodes, 5, 429 edges and K = 7 classes); and Citeseer\ncontains (3, 327 nodes, 4, 732 edges and K = 6 classes). We learn embeddings from only the\ngraph structure (nodes and edges), without observing node features nor labels during training.\nFigure 3 shows t-SNE visualization of the Cora dataset, comparing our method with node2vec\n\n[15]. For classi\ufb01cation, we follow the data splits of [37]. We predict labels (cid:101)L \u2208 R|V |\u00d7K as:\n(cid:101)L = exp (\u03b1g(Y)) \u00d7 Ltrain , where Ltrain \u2208 {0, 1}|V |\u00d7K contains rows of ones corresponding to\n\nnodes in training set and zeros elsewhere. The scalar \u03b1 \u2208 R is manually tuned on the validation set.\nThe classi\ufb01cation results, summarized in Table 3c, show that our model learns a better unsupervised\nrepresentation than previous methods, that can then be used for supervised tasks. We do not compare\nagainst other semi-supervised methods that utilize node features during training and inference [incl.\n37, 20], as our method is unsupervised.\nOur classi\ufb01cation prediciton function contains one scalar parameter \u03b1. It can be thought of a \u201csmooth\u201d\nk-nearest-neighbors, as it takes a weighted average of known labels, where the weights are exponential\nof the dot-product similarity. Such a simple function should introduce no model bias.\n\n5 Related Work\n\nThe \ufb01eld of learning on graphs has attracted much attention lately. Here we summarize two broad\nclasses of algorithms, and point the reader to recent reviews [10, 6, 18, 14] for more context.\nThe \ufb01rst class of algorithms are semi-supervised and concerned with predicting labels over a graph,\nits edges, and/or its nodes. Typically, these algorithms process a graph (nodes and edges) as well as\nper-node features. These include recent graph convolution methods [e.g. 27, 7, 3, 17] with spectral\nvariants [12, 7], diffusion methods [e.g. 11, 13], including ones trained until \ufb01xed-point convergence\n[32, 22] and semi-supervised node classi\ufb01cation [37] with low-rank approximation of convolution\n[12, 20]. We differ from these methods as (1) our algorithm is unsupervised (trained exclusively from\nthe graph structure itself) without utilizing labels during training, and (2) we explicitly model the\nrelationship between all node pairs.\n\n8\n\n0.30.50.70.90.9650.9700.9750.9800.985ROC-AUCca-AstroPh0.30.50.70.90.900.910.920.930.94ca-HepTh0.30.50.70.90.700.750.800.850.90PPI0.30.50.70.90.600.650.700.750.800.850.900.95wiki-votesoftmax[C=5]softmax[C=10]softmax[C=20]softmax[C=30]node2vec[best C]node2vec[worst C]\fThe second class of algorithms consist of unsupervised graph embedding methods. Their primary\ngoal is to preserve the graph structure, to create task independent representations. They explicitly\nmodel the relationship of all node pairs (e.g. as dot product of node embeddings). Some methods\ndirectly use the adjacency matrix [8, 36], and others incorporate higher order structure (e.g. from\nsimulated random walks) [29, 15, 2]. Our work falls under this class of algorithms, where inference\nis a scoring function V \u00d7 V \u2192 R, trained to score positive edges higher than negative ones. Unlike\nexisting methods, we do not specify a \ufb01xed context distribution apriori, whereas we push gradients\nthrough the random walk to those parameters, which we jointly train while learning the embeddings.\n\n6 Conclusion\n\nIn this paper, we propose an attention mechanism for learning the context distribution used in graph\nembedding methods. We derive the closed-form expectation of the DeepWalk [29] co-occurrence\nstatistics, showing an equivalence between the context distribution hyper-parameters, and the co-\nef\ufb01cients of the power series of the graph transition matrix. Then, we propose to replace the context\nhyper-parameters with trainable models, that we learn jointly with the embeddings on an objective\nthat preserves the graph structure (the Negative Log Graph Likelihood, NLGL). Speci\ufb01cally, we\npropose Graph Attention Models, using a softmax to learn a free-form contexts distribution with a\nparameter for each type of context similarity (e.g. distance in a random walk).\nWe show signi\ufb01cant improvements on link prediction and node classi\ufb01cation over state-of-the-\nart baselines (that use a \ufb01xed-form context distribution), reducing error on link prediction and\nclassi\ufb01cation, respectively by up to 40% and 10%. In addition to improved performance (by learning\ndistributions of arbitrary forms), our method can obviate the manual grid search over hyper-parameters:\nwalk length and form of context distribution, which can drastically \ufb02uctuate the quality of the learned\nembeddings and are different for every graph. On the datasets we consider, we show that our method\nis robust to its hyper-parameters, as described in Section 4.2. Our visualizations of converged attention\nweights convey to us that some graphs (e.g. voting graphs) can be better preserved by using longer\nwalks, while other graphs (e.g. protein-protein interaction graphs) contain more information in short\ndependencies and require shorter walks.\nWe believe that our contribution in replacing these sampling hyperparameters with a learnable\ncontext distribution is general and can be applied to many domains and modeling techniques in graph\nrepresentation learning.\n\nReferences\n[1] S. Abu-El-Haija. Proportionate gradient updates with percentdelta. In arXiv, 2017.\n\n[2] S. Abu-El-Haija, B. Perozzi, and R. Al-Rfou. Learning edge representations via low-rank asymmetric\nprojections. In ACM International Conference on Information and Knowledge Management (CIKM), 2017.\n\n[3] J. Atwood and D. Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2016.\n\n[4] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.\n\nIn International Conference on Learning Representations (ICLR), 2015.\n\n[5] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. In\n\nNeural Computation, 2003.\n\n[6] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going\n\nbeyond euclidean data. In IEEE Signal Processing Magazine, 2017.\n\n[7] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and deep locally connected networks\n\non graphs. In International Conference on Learning Representations, 2013.\n\n[8] S. Cao, W. Lu, and Q. Xu. Deep neural networks for learning graph representations. In Association for the\n\nAdvancement of Arti\ufb01cial Intelligence, 2016.\n\n[9] H. Chen, B. Perozzi, R. Al-Rfou, and S. Skiena. A tutorial on network embeddings. arXiv preprint\n\narXiv:1808.02590, 2018.\n\n9\n\n\f[10] H. Chen, B. Perozzi, Y. Hu, and S. Skiena. Harp: hierarchical representation learning for networks. The\n\n32nd AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[11] H. Dai, B. Dai, and L. Song. Discriminative embeddings of latent variable models for structured data. In\n\nInternational Conference on Machine Learning, 2016.\n\n[12] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast\n\nlocalized spectral \ufb01ltering. In Advances in Neural Information Processing Systems (NIPS), 2016.\n\n[13] D. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. Adams.\nConvolutional networks on graphs for learning molecular \ufb01ngerprints. In Advances in Neural Information\nProcessing Systems (NIPS), 2015.\n\n[14] P. Goyal and E. Ferrara. Graph embedding techniques, applications, and performance: A survey. In\n\nKnowledge-Based Systems, 2018.\n\n[15] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In International Conference\n\non Knowledge Discovery and Data Mining, 2016.\n\n[16] N. Halko, P. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for\n\nconstructing approximate matrix decompositions. In SIAM Review, 2011.\n\n[17] W. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In NIPS, 2017.\n\n[18] W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and applications.\n\nIn IEEE Data Engineering Bulletin, 2017.\n\n[19] G. J, S. Ganguly, M. Gupta, V. Varma, and V. Pudi. Author2vec: Learning author representations by\ncombining content and link information. In International Conference Companion on World Wide Web\n(WWW), WWW \u201916 Companion, 2016.\n\n[20] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\nIn\n\n[21] O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word\n\nembeddings. In TACL, 2015.\n\n[22] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. In International\n\nConference on Learning Representations, 2016.\n\n[23] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. In Journal of American\n\nSociety for Information Science and Technology, 2007.\n\n[24] Y. Luo, Q. Wang, B. Wang, and L. Guo. Context-dependent knowledge graph embedding. In Conference\n\non Emperical Methods in Natural Language Processing (EMNLP), 2015.\n\n[25] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and\nphrases and their compositionality. In Advances in Neural Information Processing Systems NIPS. 2013.\n\n[26] V. Mnih, N. Heess, A. Graves, and k. kavukcuoglu. Recurrent models of visual attention. In Advances in\n\nNeural Information Processing Systems (NIPS). 2014.\n\n[27] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In Interna-\n\ntional Conference on Machine Learning (ICML), 2016.\n\n[28] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Conference\n\non Empirical Methods in Natural Language Processing, EMNLP, 2014.\n\n[29] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In Knowledge\n\nDiscovery and Data Mining (KDD), 2014.\n\n[30] B. Perozzi, V. Kulkarni, H. Chen, and S. Skiena. Don\u2019t walk, skip!: Online learning of multi-scale network\n\nembeddings. In Advances in Social Networks Analysis and Mining (ASONAM), 2017.\n\n[31] V. Ramanathan, J. Huang, S. Abu-El-Haija, A. Gorban, K. Murphy, and L. Fei-Fei. Detecting events\nand key actors in multi-person videos. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2016.\n\n[32] F. Scarselli, M. Gori, A. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. In\n\nIEEE Trans. on Neural Networks, 2009.\n\n10\n\n\f[33] C. Stark, B. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers. Biogrid: A general repository\n\nfor interaction datasets. In Nucleic Acids Research, 2006.\n\n[34] H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its applications. In International\n\nConference on Data Mining (ICDM). IEEE, 2006.\n\n[35] P. Veli\u02c7ckovi\u00b4c, G. Cucurull, A. Casanova, A. Romero, P. Li\u00f2, and Y. Bengio. Graph attention networks. In\n\nInternational Conference on Learning Representations (ICLR), 2018.\n\n[36] D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. In International Conference on\n\nKnowledge Discovery and Data Mining, 2016.\n\n[37] Z. Yang, W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. In\n\nInternational Conference on Machine Learning (ICML), 2016.\n\n[38] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. Hierarchical attention networks for document\nIn Conference of the North American Chapter of the Association for Computational\n\nclassi\ufb01cation.\nLinguistics (NAACL), 2016.\n\n11\n\n\f", "award": [], "sourceid": 5521, "authors": [{"given_name": "Sami", "family_name": "Abu-El-Haija", "institution": "Information Sciences Institute @ USC"}, {"given_name": "Bryan", "family_name": "Perozzi", "institution": "Google AI"}, {"given_name": "Rami", "family_name": "Al-Rfou", "institution": "Google Research"}, {"given_name": "Alexander", "family_name": "Alemi", "institution": "Google"}]}