{"title": "Learning Graph Representations with Embedding Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 5119, "page_last": 5130, "abstract": "We propose EP, Embedding Propagation, an unsupervised learning framework for graph-structured data. EP learns vector representations of graphs by passing two types of messages between neighboring nodes. Forward messages consist of label representations such as representations of words and other attributes associated with the nodes. Backward messages consist of gradients that result from aggregating the label representations and applying a reconstruction loss. Node representations are finally computed from the representation of their labels. With significantly fewer parameters and hyperparameters, an instance of EP is competitive with and often outperforms state of the art unsupervised and semi-supervised learning methods on a range of benchmark data sets.", "full_text": "Learning Graph Representations with Embedding\n\nPropagation\n\nAlberto Garc\u00eda-Dur\u00e1n\n\nNEC Labs Europe\n\nHeidelberg, Germany\n\nalberto.duran@neclab.eu\n\nMathias Niepert\nNEC Labs Europe\n\nHeidelberg, Germany\n\nmathias.niepert@neclab.eu\n\nAbstract\n\nWe propose Embedding Propagation (EP), an unsupervised learning framework for\ngraph-structured data. EP learns vector representations of graphs by passing two\ntypes of messages between neighboring nodes. Forward messages consist of label\nrepresentations such as representations of words and other attributes associated with\nthe nodes. Backward messages consist of gradients that result from aggregating the\nlabel representations and applying a reconstruction loss. Node representations are\n\ufb01nally computed from the representation of their labels. With signi\ufb01cantly fewer\nparameters and hyperparameters an instance of EP is competitive with and often\noutperforms state of the art unsupervised and semi-supervised learning methods on\na range of benchmark data sets.\n\n1\n\nIntroduction\n\nGraph-structured data occurs in numerous application domains such as social networks, bioinfor-\nmatics, natural language processing, and relational knowledge bases. The computational problems\ncommonly addressed in these domains are network classi\ufb01cation [40], statistical relational learn-\ning [12, 36], link prediction [22, 24], and anomaly detection [8, 1], to name but a few. In addition,\ngraph-based methods for unsupervised and semi-supervised learning are often applied to data sets\nwith few labeled examples. For instance, spectral decompositions [25] and locally linear embeddings\n(LLE) [38] are always computed for a data set\u2019s af\ufb01nity graph, that is, a graph that is \ufb01rst constructed\nusing domain knowledge or some measure of similarity between data points. Novel approaches to\nunsupervised representation learning for graph-structured data, therefore, are important contributions\nand are directly applicable to a wide range of problems.\nEP learns vector representations (embeddings) of graphs by passing messages between neighboring\nnodes. This is reminiscent of power iteration algorithms which are used for such problems as com-\nputing the PageRank for the web graph [33], running label propagation algorithms [47], performing\nisomorphism testing [16], and spectral clustering [25]. Whenever a computational process can be\nmapped to message exchanges between nodes, it is implementable in graph processing frameworks\nsuch as Pregel [29], GraphLab [23], and GraphX [44].\nGraph labels represent vertex attributes such as bag of words, movie genres, categorical features,\nand continuous features. They are not to be confused with class labels of a supervised classi\ufb01cation\nproblem. In the EP learning framework, each vertex v sends and receives two types of messages.\nLabel representations are sent from v\u2019s neighboring nodes to v and are combined so as to reconstruct\nthe representations of v\u2019s labels. The gradients resulting from the application of some reconstruction\nloss are sent back as messages to the neighboring vertices so as to update their labels\u2019 representations\nand the representations of v\u2019s labels. This process is repeated for a certain number of iterations or\nuntil a convergence threshold is reached. Finally, the label representations of v are used to compute a\nrepresentation of v itself.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fDespite its conceptual simplicity, we show that EP generalizes several existing machine learning\nmethods for graph-structured data. Since EP learns embeddings by incorporating different label types\n(representing, for instance, text and images) it is a framework for learning with multi-modal data [31].\n\n2 Previous Work\n\nThere are numerous methods for embedding learning such as multidimensional scaling (MDS) [20],\nLaplacian Eigenmap [3], Siamese networks [7], IsoMap [43], and LLE [38]. Most of these approaches\nconstruct an af\ufb01nity graph on the data points \ufb01rst and then embed the graph into a low dimensional\nspace. The corresponding optimization problems often have to be solved in closed form (for instance,\ndue to constraints on the objective that remove degenerate solutions) which is intractable for large\ngraphs. We discuss the relation to LLE [38] in more detail when we analyze our framework.\nGraph neural networks (GNN) [39] is a general class of recursive neural networks for graphs where\neach node is associated with one label. Learning is performed with the Almeida-Pineda algorithm\n[2, 35]. The computation of the node embeddings is performed by backpropagating gradients for a\nsupervised loss after running a recursive propagation model to convergence. In the EP framework\ngradients are computed and backpropagated immediately for each node. Gated graph sequence\nneural networks (GG-SNN) [21] modify GNN to use gated recurrent units and modern optimization\ntechniques. Recent work on graph convolutional networks (GCNs) uses a supervised loss to inject\nclass label information into the learned representations [18]. GCNs as well as GNNs and GG-SNNs,\ncan be seen as instances of the Message Passing Neural Network (MPNN) framework, recently\nintroduced in [13]. There are several signi\ufb01cant differences between the EP and MPNN framework:\n(i) all instances of MPNN use a supervised loss but EP is unsupervised and, therefore, classi\ufb01er\nagnostic; (ii) EP learns label embeddings for each of the different label types independently and\ncombines them into a joint node representation whereas all existing instances of MPNN do not provide\nan explicit method for combining heterogeneous feature types. Moreover, EP\u2019s learning principle\nbased on reconstructing each node\u2019s representation from neighboring nodes\u2019 representations is highly\nsuitable for the inductive setting where nodes are missing during training.\nMost closely related to our work is DEEPWALK [34] which applies a word embedding algorithm\nto random walks. The idea is that random walks (node sequences) are treated as sentences (word\nsequences). A SKIPGRAM [30] model is then used to learn node embeddings from the random\nwalks. NODE2VEC [15] is identical to DEEPWALK with the exception that it explores new methods\nto generate random walks (the input sentences to WORD2VEC), at the cost of introducing more\nhyperparamenters. LINE [41] optimizes similarities between pairs of node embeddings so as to\npreserve their \ufb01rst and second-order proximity. The main advantage of EP over these approaches is its\nability to incorporate graph attributes such as text and continuous features. PLANETOID [45] combines\na learning objective similar to that of DEEPWALK with supervised objectives. It also incorporates\nbag of words associated with nodes into these supervised objectives. We show experimentally that\nfor graph without attributes, all of the above methods learn embeddings of similar quality and that EP\noutperforms all other methods signi\ufb01cantly on graphs with word labels. We can also show that EP\ngeneralizes methods that learn embeddings for multi-relational graphs such as TRANSE [5].\n\n3 Embedding Propagation\n\nA graph G = (V, E) consists of a set of vertices\nV and a set of edges E \u2286 {(v, w) | v, w \u2208 V }.\nThe approach works with directed and undi-\nrected edges as well as with multiple edge types.\nN(v) is the set of neighbors of v if G is undi-\nrected and the set of in-neighbors if G is directed.\nThe graph G is associated with a set of k label\nclasses L = {L1, ..., Lk} where each Li is a\nset of labels corresponding to label type i. A\nlabel is an identi\ufb01er of some object and not to\nbe confused with a class label in classi\ufb01cation\nproblems. Labels allow us to represent a wide range of objects associated with the vertices such\nas words, movie genres, and continuous features. To illustrate the concept of label types, Figure 1\n\nFigure 1: A fragment of a citation network.\n\n2\n\n{a95}{bio, health, gene}{a237}{learning, rna}{a651}{margin, SVM, loss}{a23}{health, symptom}{a214}{bio, chemical, dna, rna}v\fFigure 2: Illustration of the messages passed between a vertex v and its neighbors for the citation\nnetwork of Figure 1. First, the label embeddings are sent from the neighboring vertices to the vertex\n\nv (black node). These embeddings are fed into differentiable functions(cid:101)gi. Here, there is one function\n(cid:101)gi applied to the embeddings sent from v\u2019s neighbors and (ii) the output of the functions gi applied\nto v\u2019s label embeddings. The better the output of the functions(cid:101)gi is able to reconstruct the output of\n\nfor the article identi\ufb01er label type (yellow shades) and one for the natural language words label type\n(red shades). The gradients are derived from the distances di between (i) the output of the functions\n\nthe functions gi, the smaller the value of the distance measure. The gradients are the messages that\nare propagated back to the neighboring nodes so as to update the corresponding embedding vectors.\nThe \ufb01gure is best seen in color.\n\ni. We write l(v) =(cid:83)\n\ndepicts a fragment of a citation network. There are two label types. One representing the unique\narticle identi\ufb01ers and the other representing the identi\ufb01ers of natural language words occurring in the\narticles.\nThe functions li : V \u2192 2Li map every vertex in the graph to a subset of the labels Li of label type\ni li(v) for the set of all labels associated with vertex v. Moreover, we write\nli(N(v)) = {li(u) | u \u2208 N(v)} for the multiset of labels of type i associated with the neighbors of\nvertex v.\nWe begin by describing the general learning framework of EP which proceeds in two steps.\n\n\u2022 First, EP learns a vector representation for every label by passing messages along the edges\nof the input graph. We write (cid:96) for the current vector representation of a label (cid:96). For labels\nof label type i, we apply a learnable embedding function (cid:96) = fi((cid:96)) that maps every label\n(cid:96) of type i to its embedding (cid:96). The embedding functions fi have to be differentiable so\nas to facilitate parameter updates during learning. For each label type one can chose an\nappropriate embedding function such as a linear embedding function for text input or a more\ncomplex convolutional network for image data.\n\u2022 Second, EP computes a vector representation for each vertex v from the vector representa-\n\ntions of v\u2019s labels. We write v for the current vector representation of a vertex v.\n\nLet v \u2208 V , let i \u2208 {1, ..., k} be a label type, and let di \u2208 N be the size of the embedding for label type\n\ni. Moreover, let hi(v) = gi ({(cid:96) | (cid:96) \u2208 li(v)}) and let(cid:101)hi(v) =(cid:101)gi ({(cid:96) | (cid:96) \u2208 li(N(v))}), where gi and\n(cid:101)gi are differentiable functions that map multisets of di-dimensional vectors to a single di-dimensional\nvector. We refer to the vector hi(v) as the embedding of label type i for vertex v and to(cid:101)hi(v) as\nembeddings of v\u2019s neighbors. While the gi and(cid:101)gi can be parameterized (typically with a neural\n\nthe reconstruction of the embedding of label type i for vertex v since it is computed from the label\n\nnetwork), in many cases they are simple parameter free functions that compute, for instance, the\nelement-wise average or maximum of the input.\nThe \ufb01rst learning procedure is driven by the following objectives for each label type i \u2208 {1, ..., k}\n(1)\n\n(cid:16)(cid:101)hi(v), hi(v)\n(cid:17)\n\nminLi = min\n\n,\n\nwhere di is some measure of distance between hi(v), the current representation of label type i for\n\nvertex v, and its reconstruction(cid:101)hi(v). Hence, the objective of the approach is to learn the parameters\n\n(cid:88)\n\nv\u2208V\n\ndi\n\n3\n\nd1( , )a95biohealthgenea237healthrnag1g2d2( , )gradientsgradientsv's currentlabel embeddingsv's currentlabel embeddingsa95biohealthgenea237healthrnaupdatedembeddingsg\u03032g\u03031updatedembeddingsa23healthsymptomvh1(v)h1(v)h2(v)h2(v)embeddingsembeddingscurrentembeddingscurrentembeddingsv\fFigure 3: For each vertex v, the function r computes a vector representation of the vertex based on\nthe vector representations of v\u2019s labels.\n\nof the functions gi and(cid:101)gi (if such parameters exist) and the vector representations of the labels such\nthat the output of(cid:101)gi applied to the type i label embeddings of v\u2019s neighbors is close to the output\n\nof gi applied to the type i label embeddings of v. For each vertex v the messages passed to v from\nits neighbors are the representations of their labels. The messages passed back to v\u2019s neighbors are\nthe gradients which are used to update the label embeddings. The gradients also update v\u2019s label\nembeddings. Figure 2 illustrates the \ufb01rst part of the unsupervised learning framework for a part of a\ncitation network. A representation is learned both for the article identi\ufb01ers and the words occurring\nin the articles. The gradients are computed based on a loss between the reconstruction of the label\ntype embeddings and their current values.\nDue to the learning principle of EP, nodes that do not have any labels for label type i can be assigned\na new dummy label unique to the node and the label type. The representations learned for these\ndummy labels can then be used as part of the representation of the node itself. Hence, EP is also\napplicable in situations where data is missing and incomplete.\nThe embedding functions fi can be initialized randomly or with an existing model. For instance,\nembedding functions for words can be initialized using word embedding algorithms [30] and those\nfor images with pretrained CNNs [19, 11]. Initialized parameters are then re\ufb01ned by the application\nof EP. We can show empirically, however, that random initializations of the embedding functions fi\nalso lead to effective vertex embeddings.\nThe second step of the learning framework applies a function r to compute the representations of the\nvertex v from the representations of v\u2019s labels: v = r ({(cid:96) | (cid:96) \u2208 l(v)}) . Here, the label embeddings\n\nand the parameters of the functions gi and(cid:101)gi (if such parameters exist) remain unchanged. Figure 3\n(cid:101)gi(H) = 1|H|\n\nillustrates the second step of EP.\nWe now introduce EP-B, an instance of the EP framework that we have found to be highly effective\nfor several of the typical graph-based learning problems. The instance results from setting gi(H) =\nh\u2208H h for all label types i and all sets of embedding vectors H. In this case we have,\n\n(cid:80)\n\nfor any vertex v and any label type i,\n\nIn conjunction with the above functions gi and(cid:101)gi, we can use the margin-based ranking loss1\n\n(cid:96)\u2208li(v)\n\n(cid:88)\n(cid:104)\n\nhi(v) =\n\n1\n\n|li(v)|\n\n(cid:96),\n\n(cid:88)\n\n(cid:88)\n\nv\u2208V\n\nu\u2208V \\{v}\n\nLi =\n\n\u03b3 + di\n\n(cid:101)hi(v) =\n(cid:16)(cid:101)hi(v), hi(v)\n\n(cid:96).\n\n(cid:88)\n(cid:88)\n(cid:16)(cid:101)hi(v), hi(u)\n(cid:17)(cid:105)\n\nu\u2208N(v)\n\n(cid:96)\u2208li(u)\n\n,\n\n+\n\n1\n\n|li(N(v))|\n\n(cid:17) \u2212 di\n\n(2)\n\n(3)\n\nwhere di is the Euclidean distance, [x]+ is the positive part of x, and \u03b3 > 0 is a margin hyperparameter.\n\nHence, the objective is to make the distance between(cid:101)hi(v), the reconstructed embedding of label\ndistance between(cid:101)hi(v) and hi(u), the embedding of label type i of a vertex u different from v. We\n\ntype i for vertex v, and hi(v), the current embedding of label type i for vertex v, smaller than the\n\nsolve the minimization problem with gradient descent algorithms and use one node u for every v in\neach learning iteration. Despite using only \ufb01rst-order proximity information in the reconstruction\nof the label embeddings, this learning is effectively propagating embedding information across the\ngraph: an update of a label embedding affects neighboring label embeddings which, in other updates,\naffects their neighboring label embeddings, and so on; hence the name of this learning framework.\n\n1Directly minimizing Equation (1) could lead to degenerate solutions.\n\n4\n\na23healthsymptomrlabelembeddingsvertexembeddingv\fTable 1: Number of parameters and hyperparam-\neters for a graph without node attributes.\n\n#hyperparams\n\nTable 2: Dataset statistics. k is the number of\nlabel types.\nDataset\n\n#classes\n\n|E|\n\nMethod\n\n#params\n2d|V |\nDEEPWALK [34]\n2d|V |\nNODE2VEC [15]\n2d|V |\nPLANETOID [45] (cid:29) 2d|V |\nd|V |\n\nLINE [41]\n\nEP-B\n\n4\n6\n2\n\u2265 6\n2\n\nBlogCatalog\nPPI\nPOS\nCora\nCiteseer\nPubmed\n\n|V |\n10,312\n3,890\n4,777\n2,708\n3,327\n19,717\n\n333,983\n76,584\n184,812\n5,429\n4,732\n44,338\n\n39\n50\n40\n7\n6\n3\n\nk\n1\n1\n1\n2\n2\n2\n\nFinally, a simple instance of the function r is a function that concatenates all the embeddings hi(v)\nfor i \u2208 {1, ..., k} to form one single vector representation v for each node v\n\nv = concat [g1({(cid:96) | (cid:96) \u2208 l1(v)}), ..., gk({(cid:96) | (cid:96) \u2208 lk(v)})] = concat [h1(v), ..., hk(v)] .\n\n(4)\nFigure 3 illustrates the working of this particular function r. We refer to the instance of the learning\nframework based on the formulas (2),(3), and (4) as EP-B. The resulting vector representation of\nthe vertices can now be used for downstream learning problems such as vertex classi\ufb01cation, link\nprediction, and so on.\n\n4 Formal Analysis\n\nWe now analyze the computation and model complexities of the EP framework and its connection to\nexisting models.\n\n4.1 Computational and Model Complexity\nLet G = (V, E) be a graph (either directed or undirected) with k label types L = {L1, ..., Lk}.\nMoreover, let labmax = maxv\u2208V,i\u2208{1,...,k} |li(v)| be the maximum number of labels for any type\ngraph, and let \u03c4 (n) be the worst-case complexity of computing any of the functions gi and(cid:101)gi on n\nand any vertex of the input graph, let degmax = maxv\u2208V |N(v)| be the maximum degree of the input\ninput vectors of size di. Now, the worst-case complexity of one learning iteration is\n\nO (k|V |\u03c4 (labmaxdegmax)) .\n\nFor an input graph without attributes, that is, where the only label type represents node identities, the\nworst-case complexity of one learning iteration is O(|V |\u03c4 (degmax)). If, in addition, the complexity\nof the single reconstruction function is linear in the number of input vectors, the complexity is\ngraph. This is the case for most aggregation functions and, in particular, for the functions(cid:101)gi and\nO(|V |degmax) and, hence, linear in both the number of nodes and the maximum degree of the input\ngi used in EP-B, the particular instance of the learning framework de\ufb01ned by the formulas (2),(3),\nand (4). Furthermore, the average complexity is linear in the average node degree of the input graph.\nThe worst-case complexity of EP can be limited by not exchanging messages from all neighbors but\nonly a sampled subset of size at most \u03ba. We explore different sampling scenarios in the experimental\nsection.\nIn general, the number of parameters and hyperparameters of the learning framework depends on\n\nthe parameters of the functions gi and(cid:101)gi, the loss functions, and the number of distinct labels of the\n\ninput graph. For graphs without attributes, the only parameters of EP-B are the embedding weights\nand the only hyperparameters are the size of the embedding d and the margin \u03b3. Hence, the number\nof parameters is d|V | and the number of hyperparameters is 2. Table 1 lists the parameter counts for\na set of state of the art methods for learning embeddings for graphs without attributes.\n\n4.2 Comparison to Existing Models\n\nEP-B is related to locally linear embeddings (LLE) [38]. In LLE there is a single function(cid:101)g which\ncomputes a linear combination of the vertex embeddings.(cid:101)g\u2019s weights are learned for each vertex in\na separate previous step. Hence, unlike EP-B,(cid:101)g does not compute the unweighted average of the\n\ninput embeddings. Moreover, LLE does not learn embeddings for the labels (attribute values) but\n\n5\n\n\fdirectly for vertices of the input graph. Finally, LLE is only feasible for graphs where each node\nhas at most a small constant number of neighbors. LLE imposes additional constraints to avoid\ndegenerate solutions to the objective and solves the resulting optimization problem in closed form.\nThis is not feasible for large graphs.\nIn several applications, the nodes of the graphs are associated with a set of words. For instance,\nin citation networks, the nodes which represent individual articles can be associated with a bag of\nwords. Every label corresponds to one of the words. Figure 1 illustrates a part of such a citation\nnetwork. In this context, EP-B\u2019s learning of word embeddings is related to the CBOW model [30].\nThe difference is that for EP-B the context of a word is determined by the neighborhood of the\nvertices it is associated with and it is the embedding of the word that is reconstructed and not its\none-hot encoding.\nFor graphs with several different edge types such as multi-relational graphs, the reconstruction\n\nfunctions(cid:101)gi can be made dependent on the type of the edge. For instance, one could have, for any\nvertex v and label type i,(cid:101)hi(v) =\n\n(cid:88)\n\n(cid:0)(cid:96) + r(u,v)\n\n(cid:1) ,\n\n(cid:88)\n\n1\n\n|li(N(v))|\n\nu\u2208N(v)\n\n(cid:96)\u2208li(u)\n\nwhere r(u,v) is the vector representation corresponding to the type of the edge (the relation) from\nvertex u to vertex v, and hi(v) could be the average embedding of v\u2019s node id labels. In combination\nwith the margin-based ranking loss (3), this is related to embedding models for multi-relational\ngraphs [32] such as TRANSE [5].\n\n5 Experiments\n\nThe objectives of the experiments are threefold. First, we compare EP-B to the state of the art on\nnode classi\ufb01cation problems. Second, we visualize the learned representations. Third, we investigate\nthe impact of an upper bound on the number of neighbors that are sending messages.\nWe evaluate EP with the following six commonly used benchmark data sets. BlogCatalog [46] is a\ngraph representing the social relationships of the bloggers listed on the BlogCatalog website. The\nclass labels represent user interests. PPI [6] is a subgraph of the protein-protein interactions for\nHomo Sapiens. The class labels represent biological states. POS [28] is a co-occurrence network\nof words appearing in the \ufb01rst million bytes of the Wikipedia dump. The class labels represent\nthe Part-of-Speech (POS) tags. Cora, Citeseer and Pubmed [40] are citation networks where nodes\nrepresent documents and their corresponding bag-of-words and links represent citations. The class\nlabels represents the main topic of the document. Whereas BlogCatalog, PPI and POS are multi-label\nclassi\ufb01cation problems, Cora, Citeseer and Pubmed have exactly one class label per node. Some\nstatistics of these data sets are summarized in Table 2.\n\n5.1 Set-up\n\nThe input to the node classi\ufb01cation problem is a graph (with or without node attributes) where a\nfraction of the nodes is assigned a class label. The output is an assignment of class labels to the test\nnodes. Using the node classi\ufb01cation data sets, we compare the performance of EP-B to the state\nof the art approaches DEEPWALK [34], LINE [41], NODE2VEC [15], PLANETOID [45], GCN [18],\nand also to the baselines WVRN [27] and MAJORITY. WVRN is a weighted relational classi\ufb01er that\nestimates the class label of a node with a weigthed mean of its neighbors\u2019 class labels. Since all the\ninput graphs are unweighted, WVRN assigns the class label to a node v that appears most frequently\nin v\u2019s neighborhood. MAJORITY always chooses the most frequent class labels in the training set.\nFor all data sets and all label types the functions fi are always linear embeddings equivalent to an\nembedding lookup table. The dimension of the embeddings is always \ufb01xed to 128. We used this\ndimension for all methods which is in line with previous work such as DEEPWALK and NODE2VEC\nfor the data sets under consideration. For EP-B, we chose the margin \u03b3 in (3) from the set of values\n[1, 5, 10, 20] on validation data. For all approaches except LINE, we used the hyperparameter values\nreported in previous work since these values were tuned to the data sets. As LINE has not been applied\nto the data sets before, we set its number of samples to 20 million and negative samples to 5. This\nmeans that LINE is trained on (at least) an order of magnitude more examples than all other methods.\n\n6\n\n\fTable 3: Multi-label classi\ufb01cation results for BlogCatalog, POS and PPI in the transductive setting.\nThe upper and lower part list micro and macro F1 scores, respectively.\n\nTr [%]\nEP-B\nDEEPWALK\nNODE2VEC\nLINE\nWVRN\nMAJORITY\n\n10\n\n35.05 \u00b1 0.41\n34.48 \u00b1 0.40\n35.54 \u00b1 0.49\n34.83 \u00b1 0.39\n20.50 \u00b1 0.45\n16.51 \u00b1 0.53\n\nBlogCatalog\n\n50\n\n\u03b3 = 1\n\n39.44 \u00b1 0.29\n38.11 \u00b1 0.43\n39.31 \u00b1 0.25\n38.99 \u00b1 0.25\n30.24 \u00b1 0.96\n16.88 \u00b1 0.35\n\n90\n\n10\n\n40.41 \u00b1 1.59\n38.34 \u00b1 1.82\n40.03 \u00b1 1.22\n38.77 \u00b1 1.08\n33.47 \u00b1 1.50\n16.53 \u00b1 0.74\n\n46.97 \u00b1 0.36\n45.02 \u00b1 1.09\n44.66 \u00b1 0.92\n45.22 \u00b1 0.86\n26.07 \u00b1 4.35\n40.40 \u00b1 0.62\n\nPOS\n50\n\n\u03b3 = 10\n\n49.52 \u00b1 0.48\n49.10 \u00b1 0.52\n48.73 \u00b1 0.59\n51.64 \u00b1 0.65\n29.21 \u00b1 2.21\n40.47 \u00b1 0.51\n\n90\n\n10\n\n50.05 \u00b1 2.23\n49.33 \u00b1 2.39\n49.73 \u00b1 2.35\n52.28 \u00b1 1.87\n33.09 \u00b1 2.27\n40.10 \u00b1 2.57\n\n17.82 \u00b1 0.77\n17.14 \u00b1 0.89\n17.00 \u00b1 0.81\n16.55 \u00b1 1.50\n10.99 \u00b1 0.57\n6.15 \u00b1 0.40\n\nPPI\n50\n\n\u03b3 = 5\n\n23.30 \u00b1 0.37\n23.52 \u00b1 0.65\n23.31 \u00b1 0.62\n23.01 \u00b1 0.84\n18.14 \u00b1 0.60\n5.94 \u00b1 0.66\n\n90\n\n24.74 \u00b1 1.30\n25.02 \u00b1 1.38\n24.75 \u00b1 2.02\n25.28 \u00b1 1.68\n21.49 \u00b1 1.19\n5.66 \u00b1 0.92\n\nEP-B\nDEEPWALK\nNODE2VEC\nLINE\nWVRN\nMAJORITY\n\n19.08 +- 0.78\n18.16 \u00b1 0.44\n19.08 \u00b1 0.52\n18.13 \u00b1 0.33\n10.86 \u00b1 0.87\n2.51 \u00b1 0.09\n\n\u03b3 = 1\n\n25.11 \u00b1 0.43\n22.65 \u00b1 0.49\n23.97 \u00b1 0.58\n22.56 \u00b1 0.49\n17.46 \u00b1 0.74\n2.57 \u00b1 0.08\n\n20.36 \u00b1 1.42\n20.01 \u00b1 1.82\n19.66 \u00b1 2.34\n20.59 \u00b1 1.59\n17.50 \u00b1 1.42\n1.44 \u00b1 0.35\nTable 4: Multi-label classi\ufb01cation results for BlogCatalog, POS and PPI in the inductive setting for\nTr = 0.1. The upper and lower part of the table list micro and macro F1 scores, respectively.\n\n18.96 \u00b1 0.43\n18.73 \u00b1 0.59\n18.57 \u00b1 0.49\n18.06 \u00b1 0.81\n14.65 \u00b1 0.74\n1.51 \u00b1 0.27\n\n25.97 \u00b1 1.25\n22.86 \u00b1 1.03\n24.82 \u00b1 1.00\n23.00 \u00b1 0.92\n20.10 \u00b1 0.98\n2.53 \u00b1 0.31\n\n12.17 \u00b1 1.19\n12.23 \u00b1 1.38\n12.11 \u00b1 1.93\n12.40 \u00b1 1.18\n4.41 \u00b1 0.53\n3.36 \u00b1 0.44\n\n13.80 \u00b1 0.67\n13.01 \u00b1 0.90\n13,32 \u00b1 0.49\n12,79 \u00b1 0.48\n8.60 \u00b1 0.57\n1.58 \u00b1 0.25\n\n8.85 \u00b1 0.33\n8.20 \u00b1 0.27\n8.32 \u00b1 0.36\n8.49 \u00b1 0.41\n4.14 \u00b1 0.54\n3.38 \u00b1 0.13\n\n\u03b3 = 10\n\n10.45 \u00b1 0.69\n10.84 \u00b1 0.62\n11.07 \u00b1 0.60\n12.43 \u00b1 0.81\n4.42 \u00b1 0.35\n3.36 \u00b1 0.14\n\n\u03b3 = 5\n\nRemoved Nodes [%]\nEP-B\nDEEPWALK-I\nLINE-I\nWVRN\nMAJORITY\n\nEP-B\nDEEPWALK-I\nLINE-I\nWVRN\nMAJORITY\n\nBlogCatalog\n\n20\n\n40\n\n20\n\nPOS\n\n40\n\n20\n\nPPI\n\n40\n\n\u03b3 = 10\n\n29.22 \u00b1 0.95\n27.84 \u00b1 1.37\n19.15 \u00b1 1.30\n19.36 \u00b1 0.59\n16.84 \u00b1 0.68\n\n\u03b3 = 10\n\n12.12 \u00b1 0.75\n11.96 \u00b1 0.88\n6.64 \u00b1 0.49\n9.45 \u00b1 0.65\n2.50 \u00b1 0.18\n\n\u03b3 = 5\n\n27.30 \u00b1 1.33\n27.14 \u00b1 0.99\n19.96 \u00b1 2.44\n19.07 \u00b1 1.53\n16.81 \u00b1 0.55\n\n\u03b3 = 5\n\n11.24 \u00b1 0.89\n10.91 \u00b1 0.95\n6.54 \u00b1 1.87\n9.18 \u00b1 0.62\n2.59 \u00b1 0.19\n\n\u03b3 = 10\n\n43.23 \u00b1 1.44\n40.92 \u00b1 1.11\n40.34 \u00b1 1.72\n23.35 \u00b1 0.66\n40.43 \u00b1 0.86\n\n\u03b3 = 10\n\n42.12 \u00b1 0.78\n41.02 \u00b1 0.70\n40.08 \u00b1 1.64\n27.91 \u00b1 0.53\n40.59 \u00b1 0.55\n\n\u03b3 = 10\n5.47 \u00b1 0.80\n4.54 \u00b1 0.32\n4.67 \u00b1 0.46\n3.74 \u00b1 0.64\n3.35 \u00b1 0.24\n\n\u03b3 = 10\n5.16 \u00b1 0.49\n4.46 \u00b1 0.57\n4.24 \u00b1 0.52\n3.87 \u00b1 0.44\n3.27 \u00b1 0.15\n\n\u03b3 = 10\n\n16.63 \u00b1 0.98\n15.55 \u00b1 1.06\n14.89 \u00b1 1.16\n8.83 \u00b1 0.91\n6.09 \u00b1 0.40\n\n\u03b3 = 10\n\n11.55 \u00b1 0.90\n10.52 \u00b1 0.56\n9.86 \u00b1 1.07\n6.90 \u00b1 1.02\n1.54 \u00b1 0.31\n\n\u03b3 = 10\n\n14.87 \u00b1 1.04\n13.99 \u00b1 1.18\n13.55 \u00b1 0.90\n9.41 \u00b1 0.94\n6.39 \u00b1 0.61\n\n\u03b3 = 10\n10.38 \u00b1 0.90\n9.69 \u00b1 1.14\n9.15 \u00b1 0.74\n6.81 \u00b1 0.89\n1.55 \u00b1 0.26\n\nWe did not simply copy results from previous work but used the authors\u2019 code to run all experiments\nagain. For DEEPWALK we used the implementation provided by the authors of NODE2VEC (setting\np = 1.0 and q = 1.0). We also used the other hyperparameters values for DEEPWALK reported in\nthe NODE2VEC paper to ensure a fair comparison. We did 10 runs for each method in each of the\nexperimental set-ups described in this section, and computed the mean and standard deviation of\nthe corresponding evaluation metrics. We use the same sets of training, validation and test data for\neach method. All methods were evaluated in the transductive and inductive setting. The transductive\nsetting is the setting where all nodes of the input graph are present during training. In the inductive\nsetting, a certain percentage of the nodes are not part of the graph during unsupervised learning.\nInstead, these removed nodes are added after the training has concluded. The results computed for\nthe nodes not present during unsupervised training re\ufb02ect the methods ability to incorporate newly\nadded nodes without retraining the model.\nFor the graphs without attributes (BlogCatalog, PPI and POS) we follow the exact same experimental\nprocedure as in previous work [42, 34, 15]. First, the node embeddings were computed in an\nunsupervised fashion. Second, we sampled a fraction Tr of nodes uniformly at random and used\ntheir embeddings and class labels as training data for a logistic regression classi\ufb01er. The embeddings\nand class labels of the remaining nodes were used as test data. EP-B\u2019s margin hyperparameter \u03b3\nwas chosen by 3-fold cross validation for Tr = 0.1 once. The resulting margin \u03b3 was used for\nthe same data set and for all other values of Tr. For each method, we use 3-fold cross validation\nto determine the L2 regularization parameter for the logistic regression classi\ufb01er from the values\n[0.01, 0.1, 0.5, 1, 5, 10]. We did this for each value of Tr and the F1 macro and F1 micro scores\nseparately. This proved to be important since the L2 regularization had a considerable impact on the\nperformance of the methods.\nFor the graphs with attributes (Cora, Citeseer, Pubmed) we follow the same experimental procedure\nas in previous work [45]. We sample 20 nodes uniformly at random for each class as training data,\n1000 nodes as test data, and a different 1000 nodes as validation data. In the transductive setting,\nunsupervised training was performed on the entire graph. In the inductive setting, the 1000 test nodes\nwere removed from the graph before training. The hyperparameter values of GCN for these same\ndata sets in the transductive setting are reported in [18]; we used these values for both the transductive\nand inductive setting. For EP-B, LINE and DEEPWALK, the learned node embeddings for the 20\nnodes per class label were fed to a one-vs-rest logistic regression classi\ufb01er with L2 regularization. We\n\n7\n\n\fTable 5: Classi\ufb01cation accuracy for Cora, Citeseer, and Pubmed. (Left) The upper and lower part of\nthe table list the results for the transuctive and inductive setting, respectively. (Right) Results for the\ntransductive setting where the directionality of the edges is taken into account.\n\nMethod\n\nEP-B\nDW+BOW\nPLANETOID-T\nGCN\nDEEPWALK\nBOW FEAT\nEP-B\nDW-I+BOW\nPLANETOID-I\nGCN-I\nBOW FEAT\n\nPubmed\n\u03b3 = 1\n\n79.56 \u00b1 2.10\n77.82 \u00b1 2.19\n74.49 \u00b1 4.95\n77.32 \u00b1 2.66\n73.49 \u00b1 3.00\n70.49 \u00b1 2.89\n79.94 \u00b1 2.30\n74.87 \u00b1 1.23\n75.73 \u00b1 4.21\n73.47 \u00b1 2.48\n70.49 \u00b1 2.89\n\n\u03b3 = 1\n\nCora\n\u03b3 = 20\n\n78.05 \u00b1 1.49\n76.15 \u00b1 2.06\n71.90 \u00b1 5.33\n79.59 \u00b1 2.02\n71.11 \u00b1 2.70\n58.63 \u00b1 0.68\n73.09 \u00b1 1.75\n68.35 \u00b1 1.70\n64.80 \u00b1 3.70\n67.76 \u00b1 2.11\n58.63 \u00b1 0.68\n\n\u03b3 = 5\n\nCiteseer\n\u03b3 = 10\n\n71.01 \u00b1 1.35\n61.87 \u00b1 2.30\n58.58 \u00b1 6.35\n69.21 \u00b1 1.25\n47.60 \u00b1 2.34\n58.07 \u00b1 1.72\n68.61 \u00b1 1.69\n59.47 \u00b1 2.48\n61.97 \u00b1 3.82\n63.40 \u00b1 0.98\n58.07 \u00b1 1.72\n\n\u03b3 = 5\n\nMethod\nEP-B\nDEEPWALK\n\nCora\n\u03b3 = 20\n\n77.31 \u00b1 1.43\n14.82 \u00b1 2.15\n\nCiteseer\n\u03b3 = 5\n\n70.21 \u00b1 1.17\n15.79 \u00b1 3.58\n\nPubmed\n\u03b3 = 1\n\n78.77 \u00b1 2.06\n32.82 \u00b1 2.12\n\nchose the best value for EP-B\u2019s margins and the L2 regularizer on the validation set from the values\n[0.01, 0.1, 0.5, 1, 5, 10]. The same was done for the baselines DW+BOW and BOW FEAT. Since\nPLANETOID jointly optimizes an unsupervised and supervised loss, we applied the learned models\ndirectly to classify the nodes. The authors of PLANETOID did not report the number of learning\niterations, so we ensured the training had converged. This was the case after 5000, 5000, and 20000\ntraining steps for Cora, Citeseer, and Pubmed, respectively. For EP-B we used ADAM [17] to learn\nthe parameters in a mini-batch setting with a learning rate of 0.001. A single learning epoch iterates\nthrough all nodes of the input graph and we \ufb01xed the number of epochs to 200 and the mini-batch size\nto 64. In all cases, the parameteres were initilized following [14] and the learning always converged.\nEP was implemented with the Theano [4] wrapper Keras [9]. We used the logistic regression classi\ufb01er\nfrom LibLinear [10]. All experiments were run on commodity hardware with 128GB RAM, a single\n2.8 GHz CPU, and a TitanX GPU.\n\n5.2 Results\n\nThe results for BlogCatalog, POS and PPI in the transductive setting are listed in Table 3. The best\nresults are always indicated in bold. We observe that EP-B tends to have the best F1 scores, with\nthe additional aforementioned advantage of fewer parameters and hyperparameters to tune. Even\nthough we use the hyperparameter values reported in NODE2VEC, we do not observe signi\ufb01cant\ndifferences to DEEPWALK. This is contrary to earlier \ufb01ndings [15]. We conjecture that validating the\nL2 regularization of the logistic regression classi\ufb01er is crucial and might not have been performed in\nsome earlier work. The F1 scores of EP-B, DEEPWALK, LINE, and NODE2VEC are signi\ufb01cantly\nhigher than those of the baselines WVRN and MAJORITY. The results for the same data sets in the\ninductive setting are listed in Table 4 for different percentages of nodes removed before unsupervised\ntraining. EP reconstructs label embeddings from the embeddings of labels of neighboring nodes.\n\nHence, with EP-B we can directly use the concatenation of the reconstructed embedding(cid:101)hi(v) as\n\nthe node embedding for each of the nodes v that were not part of the graph during training. For\nDEEPWALK and LINE we computed the embeddings of those nodes that were removed during\ntraining by averaging the embeddings of neighboring nodes; we indicate this by the suf\ufb01x I. EP-B\noutperforms all these methods in the inductive setting.\nThe results for the data sets Cora, Citeseer and Pubmed are listed in Table 5. Since these data sets\nhave bag of words associated with nodes, we include the baseline method DW+BOW. DW+BOW\nconcatenates the embedding of a node learned by DEEPWALK with a vector that encodes the bag of\nwords of the node. PLANETOID-T and PLANETOID-I are the transductive and inductive formulation\nof PLANETOID [45]. GCN-I is an inductive variant of GCN [18] where edges from training to test\nnodes are removed from the graph but those from test nodes to training nodes are not. Contrary\nto other methods, EP-B\u2019s F1 scores on the transductive and inductive setting are very similar,\ndemonstrating its suitability for the inductive setting. DEEPWALK cannot make use of the word\nlabels but we included it in the evaluation to investigate to what extent the word labels improve the\nperformance of the other methods. The baseline BOW FEAT trains a logistic regression classi\ufb01er\non the binary vectors encoding the bag of words of each node. EP-B signi\ufb01cantly outperforms all\nexisting approaches in both the transductive and inductive setting on all three data sets with one\n\n8\n\n\f(a)\n\n(b)\n\nFigure 4: (a) The plot visualizes embeddings for the Cora data set learned from node identity labels\nonly (left), word labels only (center), and from the combination of the two (right). The Silhouette\nscore is from left to right 0.008, 0.107 and 0.158. (b) Average batch loss vs. number of epochs for\ndifferent values of the parameter \u03ba for the BlogCatalog data set.\n\nexception: for the transductive setting on Cora GCN achieves a higher accuracy. Both PLANETOID-T\nand DW+BOW do not take full advantage of the information given by the bag of words, since the\nencoding of the bag of words is only exposed to the respective models for nodes with class labels and,\ntherefore, only for a small fraction of nodes in the graph. This could also explain PLANETOID-T\u2019s\nhigh standard deviation since some nodes might be associated with words that occur in the test data\nbut which might not have been encountered during training. This would lead to misclassi\ufb01cations of\nthese nodes.\nFigure 4 depicts a visualization of the learned embeddings for the Cora citation network by applying t-\nsne [26] to the 128-dimensional embeddings generated by EP-B. Both qualitatively and quantitatively\n\u2013 as demonstrated by the Silhouette score [37] that measures clustering quality \u2013 it shows EP-B\u2019s\nability to learn and combine embeddings of several label types.\nUp until now, we did not take into account the direction of the edges, that is, we treated all graphs as\nundirected. Citation networks, however, are intrinsically directed. The right part of Table 5 shows\nthe performance of EP-B and DEEPWALK when the edge directions are considered. For EP this\nmeans label representations are only sent along the directed edges. For DEEPWALK this means\nthat the generated random walks are directed walks. While we observe a signi\ufb01cant performance\ndeterioration for DEEPWALK, the accuracy of EP-B does not change signi\ufb01cantly. This demonstrates\nthat EP is also applicable when edge directions are taken into account.\nFor densely connected graphs with a high average node degree, it is bene\ufb01cial to limit the number of\nneighbors that send label representations in each learning step. This can be accomplished by sampling\na subset of at most size \u03ba from the set of all neighbors and to send messages only from the sampled\nnodes. We evaluated the impact of this strategy by varying the parameter \u03ba in Figure 4. The loss is\nsigni\ufb01cantly higher for smaller values of \u03ba. For \u03ba = 50, however, the average loss is almost identical\nto the case where all neighbors send messages while reducing the training time per epoch by an order\nof magnitude (from 20s per epoch to less than 1s per epoch).\n\n6 Conclusion and Future Work\n\nEmbedding Propagation (EP) is an unsupervised machine learning framework for graph-structured\ndata. It learns label and node representations by exchanging messages between nodes. It supports\narbitrary label types such as node identities, text, movie genres, and generalizes several existing\napproaches to graph representation learning. We have shown that EP-B, a simple instance of EP,\nis competitive with and often outperforms state of the art methods while having fewer parameters\nand/or hyperparameters. We believe that EP\u2019s crucial advantage over existing methods is its ability to\nlearn label type representations and to combine these label type representations into a joint vertex\nembedding.\nDirection of future research include the combination of EP with multitask learning, that is, learning\nthe embeddings of labels and nodes guided by both an unsupervised loss and a supervised loss de\ufb01ned\nwith respect to different tasks; a variant of EP that incorporates image and sequence data; and the\nintegration of EP with an existing distributed graph processing framework. One might also want to\ninvestigate the application of the EP framework to multi-relational graphs.\n\n9\n\n050100150200Epoch010203040506070AverageBatchLoss\u03ba=1\u03ba=5\u03ba=50\u03ba=degmax=3992\fReferences\n[1] L. Akoglu, H. Tong, and D. Koutra. Graph based anomaly detection and description: a survey.\n\nData Mining and Knowledge Discovery, 29(3):626\u2013688, 2015.\n\n[2] L. B. Almeida. Arti\ufb01cial neural networks. chapter A Learning Rule for Asynchronous Percep-\n\ntrons with Feedback in a Combinatorial Environment, pages 102\u2013111. 1990.\n\n[3] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and\n\nclustering. In NIPS, volume 14, pages 585\u2013591, 2001.\n\n[4] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-\nFarley, and Y. Bengio. Theano: A cpu and gpu math compiler in python. In Proc. 9th Python in\nScience Conf, pages 1\u20137, 2010.\n\n[5] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings\nfor modeling multi-relational data. In Advances in Neural Information Processing Systems,\npages 2787\u20132795, 2013.\n\n[6] B.-J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone, R. Oughtred,\nD. H. Lackner, J. B\u00e4hler, V. Wood, et al. The biogrid interaction database: 2008 update. Nucleic\nacids research, 36(suppl 1):D637\u2013D640, 2008.\n\n[7] J. Bromley, I. Guyon, Y. Lecun, E. S\u00e4ckinger, and R. Shah. Signature veri\ufb01cation using a\n\n\"siamese\" time delay neural network. In Neural Information Processing Systems, 1994.\n\n[8] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv.,\n\n41(3):15:1\u201315:58, 2009.\n\n[9] F. Chollet. Keras. URL http://keras. io, 2016.\n\n[10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large\n\nlinear classi\ufb01cation. Journal of machine learning research, 9(Aug):1871\u20131874, 2008.\n\n[11] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep\nvisual-semantic embedding model. In Advances in neural information processing systems,\npages 2121\u20132129, 2013.\n\n[12] L. Getoor and B. Taskar. Introduction to Statistical Relational Learning (Adaptive Computation\n\nand Machine Learning). The MIT Press, 2007.\n\n[13] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for\n\nquantum chemistry. arXiv preprint arXiv:1704.01212, 2017.\n\n[14] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\n\nnetworks. In Aistats, volume 9, pages 249\u2013256, 2010.\n\n[15] A. Grover and J. Leskovec. Node2vec: Scalable feature learning for networks. In Proceedings of\nthe 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\npages 855\u2013864, 2016.\n\n[16] K. Kersting, M. Mladenov, R. Garnett, and M. Grohe. Power iterated color re\ufb01nement. In\n\nTwenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence, 2014.\n\n[17] D. Kingma and J. Ba. Adam: A method for stochastic optimization.\n\narXiv:1412.6980, 2014.\n\narXiv preprint\n\n[18] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907, 2016.\n\n[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105,\n2012.\n\n[20] J. B. Kruskal and M. Wish. Multidimensional scaling. Sage Publications, Beverely Hills,\n\nCalifornia, 1978.\n\n10\n\n\f[21] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks.\n\narXiv preprint arXiv:1511.05493, 2015.\n\n[22] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. J. Am. Soc.\n\nInf. Sci. Technol., 58(7):1019\u20131031, 2007.\n\n[23] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab:\nA new parallel framework for machine learning. In Conference on Uncertainty in Arti\ufb01cial\nIntelligence (UAI), July 2010.\n\n[24] L. L\u00fc and T. Zhou. Link prediction in complex networks: A survey. Physica A: Statistical\n\nMechanics and its Applications, 390(6):1150\u20131170, 2011.\n\n[25] U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395\u2013416, 2007.\n\n[26] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning\n\nResearch, 9(Nov):2579\u20132605, 2008.\n\n[27] S. A. Macskassy and F. Provost. A simple relational classi\ufb01er. Technical report, DTIC Document,\n\n2003.\n\n[28] M. Mahoney. Large text compression benchmark. URL: http://www. mattmahoney. net/text/text.\n\nhtml, 2009.\n\n[29] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski.\nPregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD\nInternational Conference on Management of Data, pages 135\u2013146, 2010.\n\n[30] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of\nwords and phrases and their compositionality. In Advances in neural information processing\nsystems, pages 3111\u20133119, 2013.\n\n[31] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In\nProceedings of the 28th international conference on machine learning, pages 689\u2013696, 2011.\n\n[32] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relational machine learning\n\nfor knowledge graphs. Proceedings of the IEEE, 104(1):11\u201333, 2016.\n\n[33] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: bringing order\n\nto the web. 1999.\n\n[34] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In\nProceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 701\u2013710, 2014.\n\n[35] F. J. Pineda. Generalization of back-propagation to recurrent neural networks. Phys. Rev. Lett.,\n\n59:2229\u20132232, 1987.\n\n[36] L. D. Raedt, K. Kersting, S. Natarajan, and D. Poole. Statistical relational arti\ufb01cial intelligence:\nLogic, probability, and computation. Synthesis Lectures on Arti\ufb01cial Intelligence and Machine\nLearning, 10(2):1\u2013189, 2016.\n\n[37] P. J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster\n\nanalysis. Journal of computational and applied mathematics, 20:53\u201365, 1987.\n\n[38] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.\n\nScience, 290(5500):2323\u20132326, 2000.\n\n[39] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural\n\nnetwork model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[40] P. Sen, G. M. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad. Collective\n\nclassi\ufb01cation in network data. AI Magazine, 29(3):93\u2013106, 2008.\n\n11\n\n\f[41] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information network\nembedding. In Proceedings of the 24th International Conference on World Wide Web, pages\n1067\u20131077. ACM, 2015.\n\n[42] L. Tang and H. Liu. Relational learning via latent social dimensions. In Proceedings of the\n15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages\n817\u2013826. ACM, 2009.\n\n[43] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[44] R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. Graphx: A resilient distributed graph\nsystem on spark. In First International Workshop on Graph Data Management Experiences\nand Systems, pages 2:1\u20132:6, 2013.\n\n[45] Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph\nembeddings. In Proceedings of the 33nd International Conference on Machine Learning, pages\n40\u201348, 2016.\n\n[46] R. Zafarani and H. Liu. Social computing data repository at asu. School of Computing,\n\nInformatics and Decision Systems Engineering, Arizona State University, 2009.\n\n[47] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation.\n\nTechnical Report CMU-CALD-02-107, 2002.\n\n12\n\n\f", "award": [], "sourceid": 2655, "authors": [{"given_name": "Alberto", "family_name": "Garcia Duran", "institution": "NEC Europe"}, {"given_name": "Mathias", "family_name": "Niepert", "institution": "NEC Labs Europe"}]}