{"title": "Multi-relational Poincar\u00e9 Graph Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 4463, "page_last": 4473, "abstract": "Hyperbolic embeddings have recently gained attention in machine learning due to their ability to represent hierarchical data more accurately and succinctly than their Euclidean analogues. However, multi-relational knowledge graphs often exhibit multiple simultaneous hierarchies, which current hyperbolic models do not capture. To address this, we propose a model that embeds multi-relational graph data in the Poincar\u00e9 ball model of hyperbolic space. Our Multi-Relational Poincar\u00e9 model (MuRP) learns relation-specific parameters to transform entity embeddings by M\u00f6bius matrix-vector multiplication and M\u00f6bius addition. Experiments on the hierarchical WN18RR knowledge graph show that our Poincar\u00e9 embeddings outperform their Euclidean counterpart and existing embedding methods on the link prediction task, particularly at lower dimensionality.", "full_text": "Multi-relational Poincar\u00e9 Graph Embeddings\n\nIvana Bala\u017eevi\u00b4c1\n\nCarl Allen1\n\nTimothy Hospedales1,2\n\n1 School of Informatics, University of Edinburgh, UK\n\n2 Samsung AI Centre, Cambridge, UK\n\n{ivana.balazevic, carl.allen, t.hospedales}@ed.ac.uk\n\nAbstract\n\nHyperbolic embeddings have recently gained attention in machine learning due\nto their ability to represent hierarchical data more accurately and succinctly than\ntheir Euclidean analogues. However, multi-relational knowledge graphs often\nexhibit multiple simultaneous hierarchies, which current hyperbolic models do not\ncapture. To address this, we propose a model that embeds multi-relational graph\ndata in the Poincar\u00e9 ball model of hyperbolic space. Our Multi-Relational Poincar\u00e9\nmodel (MuRP) learns relation-speci\ufb01c parameters to transform entity embeddings\nby M\u00f6bius matrix-vector multiplication and M\u00f6bius addition. Experiments on\nthe hierarchical WN18RR knowledge graph show that our Poincar\u00e9 embeddings\noutperform their Euclidean counterpart and existing embedding methods on the\nlink prediction task, particularly at lower dimensionality.\n\n1\n\nIntroduction\n\nHyperbolic space can be thought of as a continuous analogue of discrete trees, making it suitable for\nmodelling hierarchical data [28, 10]. Various types of hierarchical data have recently been embedded\nin hyperbolic space [25, 26, 16, 32], requiring relatively few dimensions and achieving promising\nresults on downstream tasks. This demonstrates the advantage of modelling tree-like structures in\nspaces with constant negative curvature (hyperbolic) over zero-curvature spaces (Euclidean).\nCertain data structures, such as knowledge graphs, often exhibit multiple hierarchies simultaneously.\nFor example, lion is near the top of the animal food chain but near the bottom in a tree of taxonomic\nmammal types [22]. Despite the widespread use of hyperbolic geometry in representation learning,\nthe only existing approach to embedding hierarchical multi-relational graph data in hyperbolic space\n[31] does not outperform Euclidean models. The dif\ufb01culty with representing multi-relational data in\nhyperbolic space lies in \ufb01nding a way to represent entities (nodes), shared across relations, such that\nthey form a different hierarchy under different relations, e.g. nodes near the root of the tree under one\nrelation may be leaf nodes under another. Further, many state-of-the-art approaches to modelling\nmulti-relational data, such as DistMult [37], ComplEx [34], and TuckER [2] (i.e. bilinear models),\nrely on inner product as a similarity measure and there is no clear correspondence to the Euclidean\ninner product in hyperbolic space [32] by which these models can be converted. Existing translational\napproaches that use Euclidean distance to measure similarity, such as TransE [6] and STransE [23],\ncan be converted to the hyperbolic domain, but do not currently compete with the bilinear models\nin terms of predictive performance. However, it has recently been shown in the closely related \ufb01eld\nof word embeddings [1] that the difference (i.e. relation) between word pairs that form analogies\nmanifests as a vector offset, suggesting a translational approach to modelling relations.\nIn this paper, we propose MuRP, a theoretically inspired method to embed hierarchical multi-relational\ndata in the Poincar\u00e9 ball model of hyperbolic space. By considering the surface area of a hypersphere\nof increasing radius centered at a particular point, Euclidean space can be seen to \u201cgrow\u201d polynomially,\nwhereas in hyperbolic space the equivalent growth is exponential [10]. Therefore, moving outwards\nfrom the root of a tree, there is more \u201croom\u201d to separate leaf nodes in hyperbolic space than in\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fEuclidean. MuRP learns relation-speci\ufb01c parameters that transform entity embeddings by M\u00f6bius\nmatrix-vector multiplication and M\u00f6bius addition [35]. The model outperforms not only its Euclidean\ncounterpart, but also current state-of-the-art models on the link prediction task on the hierarchical\nWN18RR dataset. We also show that our Poincar\u00e9 embeddings require far fewer dimensions than\nEuclidean embeddings to achieve comparable performance. We visualize the learned embeddings and\nanalyze the properties of the Poincar\u00e9 model compared to its Euclidean analogue, such as convergence\nrate, performance per relation, and in\ufb02uence of embedding dimensionality.\n\n2 Background and preliminaries\n\nMulti-relational link prediction A knowledge graph is a multi-relational graph representation of a\ncollection F of facts in triple form (es, r, eo)\u2208E\u00d7R\u00d7E, where E is the set of entities (nodes) and\nR is the set of binary relations (typed directed edges) between them. If (es, r, eo)\u2208F, then subject\nentity es is related to object entity eo by relation r. Knowledge graphs are often incomplete, so the\naim of link prediction is to infer other true facts. Typically, a score function \u03c6 : E\u00d7R\u00d7E \u2192 R is\nlearned, that assigns a score s = \u03c6(es, r, eo) to each triple, indicating the strength of prediction that a\nparticular triple corresponds to a true fact. A non-linearity, such as the logistic sigmoid function, is\noften used to convert the score to a predicted probability p = \u03c3(s)\u2208 [0, 1] of the triple being true.\nKnowledge graph relations exhibit multiple properties, such as symmetry, asymmetry, and transitivity.\nCertain knowledge graph relations, such as hypernym and has_part, induce a hierarchical structure\nover entities, suggesting that embedding them in hyperbolic rather than Euclidean space may lead\nto improved representations [28, 25, 26, 14, 32]. Based on this intuition, we focus on embedding\nmulti-relational knowledge graph data in hyperbolic space.\n\n(cid:48)(r)\ne\no\n\nz\n\ny\n\nx\n\nx4\n\nx3\n\nx5\n\nx1\n\nx2\n\n+ b o\n\n\u221a b s\n\ne(r)\ns\n\ne(r)\no\n\n(a) Poincar\u00e9 disk geodesics.\n\n(b) Model decision boundary.\n\n(c) Spheres of in\ufb02uence.\n\nFigure 1: (a) Geodesics in the Poincar\u00e9 disk, indicating the shortest paths between pairs of points. (b)\nThe model predicts the triple (es, r, eo) as true and (es, r, e(cid:48)\no) as false. (c) Each entity embedding has\na sphere of in\ufb02uence, whose radius is determined by the entity-speci\ufb01c bias.\nHyperbolic geometry of the Poincar\u00e9 ball The Poincar\u00e9 ball (Bd\nd-dimensional manifold Bd\nis conformal to the Euclidean metric g\nE. The distance between two points x, y\u2208Bd\nB\ng\npath between the points, see Figure 1a) and is given by:\n\u221a\n\nc, c > 0 is a\nB which\nx = 2/(1 \u2212 c(cid:107)x(cid:107)2), i.e.\nc is measured along a geodesic (i.e. shortest\n\nc = {x\u2208Rd : c(cid:107)x(cid:107)2 < 1} equipped with the Riemannian metric g\n\n= Id with the conformal factor \u03bbc\n\n\u221a\n) of radius 1/\n\n= (\u03bbc\n\nx)2g\n\nE\n\nB\n\nc , g\n\ndB(x, y) =\n\ntanh\u22121(\n\nc(cid:107) \u2212 x \u2295c y(cid:107)),\n\n2\u221a\nc\n\n(1)\n\n(3)\n\nwhere (cid:107) \u00b7 (cid:107) denotes the Euclidean norm and \u2295c represents M\u00f6bius addition [35]:\n\n(1 + 2c(cid:104)x, y(cid:105) + c(cid:107)y(cid:107)2)x + (1 \u2212 c(cid:107)x(cid:107)2)y\n\nx \u2295c y =\n\n(2)\nwith (cid:104)\u00b7,\u00b7(cid:105) being the Euclidean inner product. Ganea et al. [13] show that M\u00f6bius matrix-vector\nmultiplication can be obtained by projecting a point x\u2208Bd\nc with the\n0(x), performing matrix multiplication by M\u2208Rd\u00d7k in the Euclidean tangent\nlogarithmic map logc\nspace, and projecting back to Bd\n\nc onto the tangent space at 0\u2208Bd\n\n1 + 2c(cid:104)x, y(cid:105) + c2(cid:107)x(cid:107)2(cid:107)y(cid:107)2\n\nc via the exponential map at 0, i.e.:\n\n,\n\nM \u2297c x = expc\n\n0(Mlogc\n\n0(x)).\n\nFor the de\ufb01nitions of exponential and logarithmic maps, see Appendix A.\n\n2\n\n\f3 Related work\n\n3.1 Hyperbolic geometry\n\nEmbedding hierarchical data in hyperbolic space has recently gained popularity in representation\nlearning. Nickel and Kiela [25] \ufb01rst embedded the transitive closure1 of the WordNet noun hierar-\nchy, in the Poincar\u00e9 ball, showing that low-dimensional hyperbolic embeddings can signi\ufb01cantly\noutperform higher-dimensional Euclidean embeddings in terms of both representation capacity and\ngeneralization ability. The same authors subsequently embedded hierarchical data in the Lorentz\nmodel of hyperbolic geometry [26].\nGanea et al. [13] introduced Hyperbolic Neural Networks, connecting hyperbolic geometry with deep\nlearning. They build on the de\ufb01nitions for M\u00f6bius addition, M\u00f6bius scalar multiplication, exponential\nand logarithmic maps of Ungar [35] to derive expressions for linear layers, bias translation and\napplication of non-linearity in the Poincar\u00e9 ball. Hyperbolic analogues of several other algorithms\nhave been developed since, such as Poincar\u00e9 GloVe [32] and Hyperbolic Attention Networks [16].\nMore recently, Gu et al. [15] note that data can be non-uniformly hierarchical and learn embeddings\non a product manifold with components of different curvature: spherical, hyperbolic and Euclidean.\nTo our knowledge, only Riemannian TransE [31] seeks to embed multi-relational data in hyperbolic\nspace, but the Riemannian translation method fails to outperform Euclidean baselines.\n\n3.2 Link prediction for knowledge graphs\n\nBilinear models typically represent relations as linear transformations acting on entity vectors. An\nearly model, RESCAL [24], optimizes a score function \u03c6(es, r, eo) = e(cid:62)\ns Mreo, containing the\nbilinear product between the subject entity embedding es, a full rank relation matrix Mr and the\nobject entity embedding eo. RESCAL is prone to over\ufb01tting due to the number of parameters\nper relation being quadratic relative to the number per entity. DistMult [37] is a special case of\nRESCAL with diagonal relation matrices, reducing parameters per relation and controlling over\ufb01tting.\nHowever, due to its symmetry, DistMult cannot model asymmetric relations. ComplEx [34] extends\nDistMult to the complex domain, enabling asymmetry to be modelled. TuckER [2] performs a Tucker\ndecomposition of the tensor of triples, which enables multi-task learning between different relations\nvia the core tensor. The authors show each of the linear models above to be a special case of TuckER.\nTranslational models regard a relation as a translation (or vector offset) from the subject to the object\nentity embeddings. These models include TransE [6] and its many successors, e.g. FTransE [12],\nSTransE [23]. The score function for translational models typically considers Euclidean distance\nbetween the translated subject entity embedding and the object entity embedding.\n\n4 Multi-relational Poincar\u00e9 embeddings\n\nA set of entities can form different hierarchies under different relations. In the WordNet knowledge\ngraph [22], the hypernym, has_part and member_meronym relations each induce different hierarchies\nover the same set of entities. For example, the noun chair is a parent node to different chair types\n(e.g. folding_chair, armchair) under the relation hypernym and both chair and its types are parent\nnodes to parts of a typical chair (e.g. backrest, leg) under the relation has_part. An ideal embedding\nmodel should capture all hierarchies simultaneously.\nScore function As mentioned above, bilinear models measure similarity between the subject entity\nembedding (after relation-speci\ufb01c transformation) and an object entity embedding using the Euclidean\ninner product [24, 37, 34, 2]. However, a clear correspondence to the Euclidean inner product does\nnot exist in hyperbolic space [32]. The Euclidean inner product can be expressed as a function of\nEuclidean distance and norms, i.e. (cid:104)x, y(cid:105) = 1\n2 (\u2212dE(x, y)2 + (cid:107)x(cid:107)2 + (cid:107)y(cid:107)2), dE(x, y) = (cid:107)x \u2212 y(cid:107).\nNoting this, in Poincar\u00e9 GloVe, Tifrea et al. [32] absorb squared norms into biases bx, by and replace\nthe Euclidean with the Poincar\u00e9 distance dB(x, y) to obtain the hyperbolic version of GloVe [27].\nSeparately, it has recently been shown in the closely related \ufb01eld of word embeddings that statistics\npertaining to analogies naturally contain linear structures [1], explaining why similar linear structure\n\n1Each node in a directed graph is connected not only to its children, but to every descendant, i.e. all nodes to\n\nwhich there exists a directed path from the starting node.\n\n3\n\n\fa as wb is to w\u2217\n\nappears amongst word embeddings of word2vec [20, 21, 19]. Analogies are word relationships of\nthe form \u201cwa is to w\u2217\nb \u201d, such as \u201cman is to woman as king is to queen\u201d, and are in\nprinciple not restricted to two pairs (e.g. \u201c...as brother is to sister\u201d). It can be seen that analogies have\nmuch in common with relations in multi-relational graphs, as a difference between pairs of words (or\nentities) common to all pairs, e.g. if (es, r, eo) and (e(cid:48)\no) hold, then we could say \u201ces is to eo as\ne(cid:48)\ns is to e(cid:48)\no\u201d. Of particular relevance is the demonstration that the common difference, i.e. relation,\nbetween the word pairs (e.g. (man, woman) and (king, queen)) manifests as a common vector offset\n[1], justifying the previously heuristic translational approach to modelling relations.\nInspired by these two ideas, we de\ufb01ne the basis score function for multi-relational graph embedding:\n\ns, r, e(cid:48)\n\n\u03c6(es, r, eo) = \u2212d(e(r)\n\ns , e(r)\n\no )2 + bs + bo\n\n= \u2212d(Res, eo + r)2 + bs + bo,\n\n(4)\n\nwhere d : E \u00d7R\u00d7E \u2192 R+ is a distance function, es, eo \u2208 Rd are the embeddings and bs, bo \u2208 R\nscalar biases of the subject and object entities es and eo respectively. R\u2208Rd\u00d7d is a diagonal relation\nmatrix and r\u2208Rd a translation vector (i.e. vector offset) of relation r. e(r)\no = eo + r\nrepresent the subject and object entity embeddings after applying the respective relation-speci\ufb01c\ntransformations, a stretch by R to es and a translation by r to eo.\nHyperbolic model Taking the hyperbolic analogue of Equation 4, we de\ufb01ne the score function for\nour Multi-Relational Poincar\u00e9 (MuRP) model as:\ns , h(r)\n\ns = Res and e(r)\n\no )2 + bs + bo\n\n\u03c6MuRP(es, r, eo) = \u2212dB(h(r)\n= \u2212dB(expc\n\n0(Rlogc\n\n0(hs)), ho \u2295c rh)2 + bs + bo,\n\n(5)\n\no \u2208Bd\nc to the object entity embedding ho\u2208Bd\n\nwhere hs, ho\u2208Bd\nc are hyperbolic embeddings of the subject and object entities es and eo respectively,\nand rh \u2208 Bd\nc is a hyperbolic translation vector of relation r. The relation-adjusted subject entity\ns \u2208Bd\nembedding h(r)\nc is obtained by M\u00f6bius matrix-vector multiplication: the original subject entity\nembedding hs\u2208Bd\nc is projected to the tangent space of the Poincar\u00e9 ball at 0 with logc\n0, transformed\nby the diagonal relation matrix R\u2208Rd\u00d7d, and then projected back to the Poincar\u00e9 ball by expc\n0. The\nrelation-adjusted object entity embedding h(r)\nc is obtained by M\u00f6bius addition of the relation\nvector rh\u2208Bd\nc. Since the relation matrix R is diagonal, the\nnumber of parameters of MuRP increases linearly with the number of entities and relations, making\nit scalable to large knowledge graphs. To obtain the predicted probability of a fact being true, we\napply the logistic sigmoid to the score, i.e. \u03c3(\u03c6MuRP(es, r, eo)).\nTo directly compare the properties of hyperbolic embeddings with the Euclidean, we implement the\nEuclidean version of Equation 4 with d(e(r)\no ). We refer to this model as the\nMulti-Relational Euclidean (MuRE) model.\nGeometric intuition We see from Equation 4 that the biases bs, bo determine the radius of a hy-\npersphere decision boundary centered at e(r)\ns . Entities es and eo are predicted to be related by r if\nrelation-adjusted e(r)\nbs + bo (see Figure 1b). Since biases are\no\nsubject and object entity-speci\ufb01c, each subject-object pair induces a different decision boundary. The\nrelation-speci\ufb01c parameters R and r determine the position of the relation-adjusted embeddings, but\nthe radius of the entity-speci\ufb01c decision boundary is independent of the relation. The score function\nin Equation 4 resembles the score functions of existing translational models [6, 12, 23], with the main\ndifference being the entity-speci\ufb01c biases, which can be seen to change the geometry of the model.\nRather than considering an entity as a point in space, each bias de\ufb01nes an entity-speci\ufb01c sphere of\nin\ufb02uence surrounding the center given by the embedding vector (see Figure 1c). The overlap between\nspheres measures relatedness between entities. We can thus think of each relation as moving the\nspheres of in\ufb02uence in space, so that only the spheres of subject and object entities that are connected\nunder that relation overlap.\n\nfalls within a hypershpere of radius\n\no ) = dE(e(r)\n\ns , e(r)\n\ns , e(r)\n\n\u221a\n\n4.1 Training and Riemannian optimization\n\nWe use the standard data augmentation technique [11, 18, 2] of adding reciprocal relations for every\ntriple, i.e. we add (eo, r\u22121, es) for every (es, r, eo). To train both models, we generate k negative\nsamples for each true triple (es, r, eo), where we corrupt either the object (es, r, e(cid:48)\no) or the subject\n\n4\n\n\f(eo, r\u22121, e(cid:48)\ntrained to minimize the Bernoulli negative log-likelihood loss:\n\ns) entity with a randomly chosen entity from the set of all entities E. Both models are\n\nN(cid:88)\n\ni=1\n\nL(y, p) = \u2212 1\nN\n\n(y(i)log(p(i)) + (1 \u2212 y(i))log(1 \u2212 p(i))),\n\n(6)\n\nwhere p is the predicted probability, y is the binary label indicating whether a sample is positive or\nnegative and N is the number of training samples.\nFor fairness of comparison, we optimize the Euclidean model using stochastic gradient descent\n(SGD) and the hyperbolic model using Riemannian stochastic gradient descent (RSGD) [5]. We note\nthat the Riemannian equivalent of adaptive optimization methods has recently been developed [3],\nbut leave replacing SGD and RSGD with their adaptive equivalent to future work. To compute the\nRiemannian gradient \u2207RL, the Euclidean gradient \u2207EL is multiplied by the inverse of the Poincar\u00e9\nmetric tensor, i.e. \u2207RL = 1/(\u03bbc\n\u03b8)2\u2207EL. Instead of the Euclidean update step \u03b8 \u2190 \u03b8 \u2212 \u03b7\u2207EL,\na \ufb01rst order approximation of the true Riemannian update, we use expc\n\u03b8 to project the gradient\n\u2207RL \u2208 T\u03b8Bd\nc onto its corresponding geodesic on the Poincar\u00e9 ball and compute the Riemannian\nupdate \u03b8 \u2190 expc\n\n\u03b8(\u2212\u03b7\u2207RL), where \u03b7 denotes the learning rate.\n\n5 Experiments\n\nTo evaluate both Poincar\u00e9 and Euclidean models, we \ufb01rst test their performance on the knowledge\ngraph link prediction task using standard WN18RR and FB15k-237 datasets:\nFB15k-237 [33] is a subset of Freebase [4], a collection of real world facts, created from FB15k [6]\nby removing the inverse of many relations from validation and test sets to make the dataset more\nchallenging. FB15k-237 contains 14,541 entities and 237 relations.\nWN18RR [11] is a subset of WordNet [22], a hierarchical collection of relations between words,\ncreated in the same way as FB15k-237 from WN18 [6], containing 40,943 entities and 11 relations.\nTo demonstrate the usefulness of MuRP on hierarchical datasets (given WN18RR is hierarchical and\nFB15k-237 is not, see Section 5.3), we also perform experiments on NELL-995 [36], containing\n75,492 entities and 200 relations, \u223c 22% of which hierarchical. We create several subsets of the\noriginal dataset by varying the proportion of non-hierarchical relations, as described in Appendix B.\nWe evaluate each triple from the test set by generating ne (where ne denotes number of entities in\nthe dataset) evaluation triples, which are created by combining the test entity-relation pair with all\npossible entities E. The scores obtained for each evaluation triple are ranked. All true triples are\nremoved from the evaluation triples apart from the current test triple, i.e. the commonly used \ufb01ltered\nsetting [6]. We evaluate our models using the evaluation metrics standard across the link prediction\nliterature: mean reciprocal rank (MRR) and hits@k, k \u2208 {1, 3, 10}. Mean reciprocal rank is the\naverage of the inverse of a mean rank assigned to the true triple over all ne evaluation triples. Hits@k\nmeasures the percentage of times the true triple appears in the top k ranked evaluation triples.\n\n5.1\n\nImplementation details\n\nWe implement both models in PyTorch and make our code, as well as all the subsets of the NELL-995\ndataset, publicly available.2 We choose the learning rate from {1, 5, 10, 20, 50, 100} by MRR on\nthe validation set and \ufb01nd that the best learning rate is 50 for WN18RR and 10 for FB15k-237 for\nboth models. We initialize all embeddings near the origin where distances are small in hyperbolic\nspace, similar to [25]. We set the batch size to 128 and the number of negative samples to 50. In all\nexperiments, we set the curvature of MuRP to c = 1, since preliminary experiments showed that any\nmaterial change reduced performance.\n\n5.2 Link prediction results\n\nTable 1 shows the results obtained for both datasets. As expected, MuRE performs slightly better on\nthe non-hierarchical FB15k-237 dataset, whereas MuRP outperforms on WN18RR which contains\n\n2https://github.com/ibalazevic/multirelational-poincare\n\n5\n\n\fTable 1: Link prediction results on WN18RR and FB15k-237. Best results in bold and underlined,\nsecond best in bold. The RotatE [30] results are reported without their self-adversarial negative\nsampling (see Appendix H in the original paper) for fair comparison.\n\nWN18RR\n\nFB15k-237\n\nTransE [6]\nDistMult [37]\nComplEx [34]\nNeural LP [38]\nMINERVA [9]\nConvE [11]\nM-Walk [29]\nTuckER [2]\nRotatE [30]\nMuRE d = 40\nMuRE d = 200\nMuRP d = 40\nMuRP d = 200\n\nMRR Hits@10 Hits@3 Hits@1\n.226\n.430\n.440\n\u2212\n\u2212\n.430\n.437\n.470\n\u2212\n.459\n.475\n.477\n.481\n\n\u2212\n.390\n.410\n\u2212\n\u2212\n.400\n.414\n.443\n\u2212\n.429\n.436\n.438\n.440\n\n\u2212\n.440\n.460\n\u2212\n\u2212\n.440\n.445\n.482\n\u2212\n.474\n.487\n.489\n.495\n\n.501\n.490\n.510\n\u2212\n\u2212\n.520\n\u2212\n.526\n\u2212\n.528\n.554\n.555\n.566\n\nMRR Hits@10 Hits@3 Hits@1\n.294\n.241\n.247\n.250\n\u2212\n.325\n\u2212\n.358\n.297\n\n\u2212\n.155\n.158\n\u2212\n\u2212\n.237\n\u2212\n.266\n.205\n\n\u2212\n.263\n.275\n\u2212\n\u2212\n.356\n\u2212\n.394\n.328\n\n.465\n.419\n.428\n.408\n.456\n.501\n\u2212\n.544\n.480\n\n.315\n.336\n.324\n.335\n\n.493\n.521\n.506\n.518\n\n.346\n.370\n.356\n.367\n\n.227\n.245\n.235\n.243\n\nhierarchical relations (as shown in Section 5.3). Both MuRE and MuRP outperform previous state-\nof-the-art models on WN18RR on all metrics apart from hits@1, where MuRP obtains second best\noverall result. In fact, even at relatively low embedding dimensionality (d = 40), this is maintained,\ndemonstrating the ability of hyperbolic models to succinctly represent multiple hierarchies. On\nFB15k-237, MuRE is outperformed only by TuckER [2] (and similarly ComplEx-N3 [18], since\nBala\u017eevi\u00b4c et al. [2] note that the two models perform comparably), primarily due to multi-task\nlearning across relations. This is highly advantageous on FB15k-237 due to a large number of\nrelations compared to WN18RR and thus relatively little data per relation in some cases. As the \ufb01rst\nmodel to successfully represent multiple relations in hyperbolic space, MuRP does not also set out\nto include multi-task learning, but we hope to address this in future work. Further experiments on\nNELL-995, which substantiate our claim on the advantage of embedding hierarchical multi-relational\ndata in hyperbolic over Euclidean space, are presented in Appendix C.\n\n5.3 MuRE vs MuRP\n\nEffect of dimensionality We compare the MRR achieved by MuRE and MuRP on WN18RR\nfor embeddings of different dimensionalities d \u2208 {5, 10, 15, 20, 40, 100, 200}. As expected, the\ndifference is greatest at lower embedding dimensionality (see Figure 2a).\nConvergence rate Figure 2b shows the MRR per epoch for MuRE and MuRP on the WN18RR\ntraining and validation sets, showing that MuRP also converges faster.\n\n(a) MRR per embedding dimensionality.\n\n(b) MRR covergence rate per epoch.\n\nFigure 2: (a) MRR log-log graph for MuRE and MuRP for different embeddings sizes on WN18RR.\n(b) Comparison of the MRR convergence rate for MuRE and MuRP on the WN18RR training (dashed\nline) and validation (solid line) sets with embeddings of size d = 40 and learning rate 50.\nModel architecture ablation study Table 2 shows an ablation study of relation-speci\ufb01c transforma-\ntions and bias choices. We note that any change to the current model architecture has a negative effect\non performance of both MuRE and MuRP. Replacing biases by the (transformed) entity embedding\nnorms leads to a signi\ufb01cant reduction in performance of MuRP, in part because norms are constrained\nto [0, 1), whereas the biases they replace are unbounded.\n\n6\n\n101102Embedding Dimensionality (log)10-1100MRR (log)MuREMuRP0100200300400Epoch0.00.20.40.60.81.0MRRMuREMuRP\fTable 2: Ablation study of different model architecture choices on WN18RR: relational transforma-\ntions (left) and biases (right). Current model (top row) outperforms all others.\n\n(a) Relational transformations.\n\n(b) Biases.\n\nDistance function\nd(Res, eo + r)\nd(es, eo + r)\nd(Res, eo)\nd(Rses, Roeo + r)\nd(es + r, Reo)\n\nMuRE\n\nMuRP\n\nMRR H@1 MRR H@1\n.438\n.459\n.192\n.340\n.363\n.413\n.341\n.335\n.413\n.442\n\n.429\n.235\n.381\n.299\n.410\n\n.477\n.307\n.401\n.367\n.454\n\nBias choice\nbs & bo\nbs only\nbo only\nbx = (cid:107)ex(cid:107)2\nbx = (cid:107)e(r)\nx (cid:107)2\n\nMuRE\n\nMuRP\n\nMRR H@1 MRR H@1\n.438\n.459\n.415\n.455\n.409\n.453\n.414\n.352\n.372\n.443\n\n.429\n.414\n.412\n.393\n.404\n\n.477\n.463\n.460\n.414\n.434\n\nPerformance per relation Since not every relation in WN18RR induces a hierarchical structure\nover the entities, we report the Krackhardt hierarchy score (Khs) [17] of the entity graph formed by\neach relation to obtain a measure of the hierarchy induced. The score is de\ufb01ned only for directed\nnetworks and measures the proportion of node pairs (x, y) where there exists a directed path x\u2192 y,\nbut not y \u2192 x (see Appendix D for further details). The score takes a value of one for all directed\nacyclic graphs, and zero for cycles and cliques. We also report the maximum and average shortest\npath between any two nodes in the graph for hierarchical relations. To gain insight as to which\nrelations bene\ufb01t most from embedding entities in hyperbolic space, we compare hits@10 per relation\nof MuRE and MuRP for entity embeddings of low dimensionality (d = 20). From Table 3 we see\nthat both models achieve comparable performance on non-hierarchical, symmetric relations with the\nKrackhardt hierarchy score 0, such as verb_group, whereas MuRP generally outperforms MuRE on\nhierarchical relations. We also see that the difference between the performances of MuRE and MuRP\nis generally larger for relations that form deeper trees, \ufb01tting the hypothesis that hyperbolic space is\nof most bene\ufb01t for modelling hierarchical relations.\nComputing the Krackhardt hierarchy score for FB15k-237, we \ufb01nd that 80% of the relations have\nKhs = 1, however, the average of maximum path lengths over those relations is 1.14 with only 2.7%\nrelations having paths longer than 2, meaning that the vast majority of relational sub-graphs consist\nof directed edges between pairs of nodes, rather than trees.\n\nTable 3: Comparison of hits@10 per relation for MuRE and MuRP on WN18RR for d = 20.\n\nRelation Name\nalso_see\nhypernym\nhas_part\nmember_meronym\nsynset_domain_topic_of\ninstance_hypernym\nmember_of_domain_region\nmember_of_domain_usage\nderivationally_related_form\nsimilar_to\nverb_group\n\nMuRE MuRP\n.705\n.071\n.228\n.067\n.282\n.067\n.346\n.074\n.430\n.114\n.471 \u2212.017\n.039\n.347\n.417\n.021\n\n.634\n.161\n.215\n.272\n.316\n.488\n.308\n.396\n\n0.24\n0.99\n1\n1\n0.99\n1\n1\n1\n\n\u2206 Khs Max Path Avg Path\n15.2\n4.5\n2.2\n3.9\n1.1\n1.0\n1.0\n1.0\n\u2212\n\u2212\n\u2212\n\n44\n18\n13\n10\n3\n3\n2\n2\n\u2212\n\u2212\n\u2212\n\n0.04\n0\n0\n\n.954\n1\n.974\n\n.967\n1\n.974\n\n.013\n0\n0\n\nBiases vs embedding vector norms We plot the norms versus the biases bs for MuRP and MuRE\nin Figure 3. This shows an overall correlation between embedding vector norm and bias (or radius of\nthe sphere of in\ufb02uence) for both MuRE and MuRP. This makes sense intuitively, as the sphere of\nin\ufb02uence increases to \u201c\ufb01ll out the space\u201d in regions that are less cluttered, i.e. further from the origin.\nSpatial layout Figure 4 shows a 40-dimensional subject embedding for the word asia and a random\nsubset of 1500 object embeddings for the hierarchical WN18RR relation has_part, projected to 2\ndimensions so that distances and angles of object entity embeddings relative to the subject entity\nembedding are preserved (see Appendix E for details on the projection method). We show subject\nand object entity embeddings before and after relation-speci\ufb01c transformation. For both MuRE\nand MuRP, we see that applying the relation-speci\ufb01c transformation separates true object entities\nfrom false ones. However, in the Poincar\u00e9 model, where distances increase further from the origin,\nembeddings are moved further towards the boundary of the disk, where, loosely speaking, there is\nmore space to separate and therefore distinguish them.\n\n7\n\n\fFigure 3: Scatter plot of norms vs biases for MuRP (left) and MuRE (right). Entities with larger\nembedding vector norms generally have larger biases for both MuRE and MuRP.\n\n(a) MuRP\n\n(b) MuRE\n\nFigure 4: Learned 40-dimensional MuRP and MuRE embeddings for WN18RR relation has_part,\nindicates true positive object\nprojected to 2 dimensions.\nentities predicted by the model,\nfalse negatives. Lightly\nshaded blue and red points indicate object entity embeddings before applying the relation-speci\ufb01c\ntransformation. The line in the left \ufb01gure indicates the boundary of the Poincar\u00e9 disk. The supposed\nfalse positives predicted by MuRP are actually true facts missing from the dataset (e.g. malaysia).\n\nindicates the subject entity embedding,\n\ntrue negatives,\n\nfalse positives and\n\nAnalysis of wrong predictions Here we analyze the false positives and false negatives predicted\nby both models. MuRP predicts 15 false positives and 0 false negatives, whereas MuRE predicts\nonly 2 false positives and 1 false negative, so seemingly performs better. However, inspecting the\nalleged false positives predicted by MuRP, we \ufb01nd they are all countries on the Asian continent\n(e.g. sri_lanka, palestine, malaysia, sakartvelo, thailand), so are actually correct, but missing from\nthe dataset. MuRE\u2019s predicted false positives (philippines and singapore) are both also correct but\nmissing, whereas the false negative (bahrain) is indeed falsely predicted. We note that this suggests\ncurrent evaluation methods may be unreliable.\n\n6 Conclusion and future work\n\nWe introduce a novel, theoretically inspired, translational method for embedding multi-relational\ngraph data in the Poincar\u00e9 ball model of hyperbolic geometry. Our multi-relational Poincar\u00e9 model\nMuRP learns relation-speci\ufb01c parameters to transform entity embeddings by M\u00f6bius matrix-vector\nmultiplication and M\u00f6bius addition. We show that MuRP outperforms its Euclidean counterpart\nMuRE and existing models on the link prediction task on the hierarchical WN18RR knowledge graph\ndataset, and requires far lower dimensionality to achieve comparable performance to its Euclidean\nanalogue. We analyze various properties of the Poincar\u00e9 model compared to its Euclidean analogue\nand provide insight through a visualization of the learned embeddings.\nFuture work may include investigating the impact of recently introduced Riemannian adaptive\noptimization methods compared to Riemannian SGD. Also, given not all relations in a knowledge\ngraph are hierarchical, we may look into combining the Euclidean and hyperbolic models to produce\nmixed-curvature embeddings that best \ufb01t the curvature of the data.\n\n8\n\n0.00.20.40.60.81.0kEk0.00.20.40.60.81.0bs0.40.50.60.70.82024681012140.51.01.52.02.53.03.54.04.55.0202468101214asiawaterlinerepublic_of_irelandmalaysiasakartvelokyrgyzstanriversideasiawaterlinerepublic_of_irelandmalaysiasakartvelokyrgyzstanriverside\fAcknowledgements\n\nWe thank Rik Sarkar, Ivan Titov, Jonathan Mallinson, Eryk Kopczy\u00b4nski and the anonymous reviewers\nfor helpful comments. Ivana Bala\u017eevi\u00b4c and Carl Allen were supported by the Centre for Doctoral\nTraining in Data Science, funded by EPSRC (grant EP/L016427/1) and the University of Edinburgh.\n\nReferences\n[1] Carl Allen and Timothy Hospedales. Analogies Explained: Towards Understanding Word\n\nEmbeddings. In International Conference on Machine Learning, 2019.\n\n[2] Ivana Bala\u017eevi\u00b4c, Carl Allen, and Timothy M Hospedales. TuckER: Tensor Factorization for\nKnowledge Graph Completion. In Empirical Methods in Natural Language Processing, 2019.\n[3] Gary B\u00e9cigneul and Octavian-Eugen Ganea. Riemannian Adaptive Optimization Methods. In\n\nInternational Conference on Learning Representation, 2019.\n\n[4] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: A\nCollaboratively Created Graph Database for Structuring Human Knowledge. In ACM SIGMOD\nInternational Conference on Management of Data, 2008.\n\n[5] Silvere Bonnabel. Stochastic Gradient Descent on Riemannian Manifolds. IEEE Transactions\n\non Automatic Control, 2013.\n\n[6] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.\nTranslating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information\nProcessing Systems, 2013.\n\n[7] James W Cannon, William J Floyd, Richard Kenyon, Walter R Parry, et al. Hyperbolic Geometry.\n\nFlavors of Geometry, 31:59\u2013115, 1997.\n\n[8] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka, and Tom M\nMitchell. Toward an Architecture for Never-ending Language Learning. In AAAI Conference\non Arti\ufb01cial Intelligence, 2010.\n\n[9] Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay\nKrishnamurthy, Alex Smola, and Andrew McCallum. Go for a Walk and Arrive at the Answer:\nReasoning over Paths in Knowledge Bases Using Reinforcement Learning. In International\nConference on Learning Representations, 2018.\n\n[10] Christopher De Sa, Albert Gu, Christopher R\u00e9, and Frederic Sala. Representation Tradeoffs for\n\nHyperbolic Embeddings. In International Conference on Machine Learning, 2018.\n\n[11] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2D\n\nKnowledge Graph Embeddings. In AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[12] Jun Feng, Minlie Huang, Mingdong Wang, Mantong Zhou, Yu Hao, and Xiaoyan Zhu. Knowl-\nedge Graph Embedding by Flexible Translation. In Principles of Knowledge Representation\nand Reasoning, 2016.\n\n[13] Octavian Ganea, Gary B\u00e9cigneul, and Thomas Hofmann. Hyperbolic Neural Networks. In\n\nAdvances in Neural Information Processing Systems, 2018.\n\n[14] Octavian-Eugen Ganea, Gary B\u00e9cigneul, and Thomas Hofmann. Hyperbolic Entailment Cones\nfor Learning Hierarchical Embeddings. In International Conference on Machine Learning,\n2018.\n\n[15] Albert Gu, Frederic Sala, Beliz Gunel, and Christopher R\u00e9. Learning Mixed-Curvature Rep-\nresentations in Product Spaces. In International Conference on Learning Representations,\n2019.\n\n[16] Caglar Gulcehre, Misha Denil, Mateusz Malinowski, Ali Razavi, Razvan Pascanu, Karl Moritz\nHermann, Peter Battaglia, Victor Bapst, David Raposo, Adam Santoro, and Nando de Freitas.\nHyperbolic Attention Networks. In International Conference on Learning Representations,\n2019.\n\n9\n\n\f[17] David Krackhardt. Graph Theoretical Dimensions of Informal Organizations. In Computational\n\nOrganization Theory. Psychology Press, 1994.\n\n[18] Timoth\u00e9e Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical Tensor Decomposition\n\nfor Knowledge Base Completion. In International Conference on Machine Learning, 2018.\n\n[19] Omer Levy and Yoav Goldberg. Linguistic Regularities in Sparse and Explicit Word Represen-\n\ntations. In Computational Natural Language Learning, 2014.\n\n[20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed Repre-\nsentations of Words and Phrases and their Compositionality. In Advances in Neural Information\nProcessing Systems, 2013.\n\n[21] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous\nSpace Word Representations. In North American Chapter of the Association for Computational\nLinguistics: Human Language Technologies, 2013.\n\n[22] George A Miller. WordNet: a Lexical Database for English. Communications of the ACM,\n\n1995.\n\n[23] Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, and Mark Johnson. STransE: a Novel Embedding\nModel of Entities and Relationships in Knowledge Bases. In North American Chapter of the\nAssociation for Computational Linguistics: Human Language Technologies, 2016.\n\n[24] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A Three-Way Model for Collective\nLearning on Multi-Relational Data. In International Conference on Machine Learning, 2011.\n\n[25] Maximillian Nickel and Douwe Kiela. Poincar\u00e9 Embeddings For Learning Hierarchical Repre-\n\nsentations. In Advances in Neural Information Processing Systems, 2017.\n\n[26] Maximillian Nickel and Douwe Kiela. Learning Continuous Hierarchies in the Lorentz Model\n\nof Hyperbolic Geometry. In International Conference on Machine Learning, 2018.\n\n[27] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global Vectors for Word\n\nRepresentation. In Empirical Methods in Natural Language Processing, 2014.\n\n[28] Rik Sarkar. Low Distortion Delaunay Embedding of Trees in Hyperbolic Plane. In International\n\nSymposium on Graph Drawing, 2011.\n\n[29] Yelong Shen, Jianshu Chen, Po-Sen Huang, Yuqing Guo, and Jianfeng Gao. M-Walk: Learning\nto Walk over Graphs using Monte Carlo Tree Search. In Advances in Neural Information\nProcessing Systems, 2018.\n\n[30] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. RotatE: Knowledge Graph\nEmbedding by Relational Rotation in Complex Space. In International Conference on Learning\nRepresentations, 2019.\n\n[31] Atsushi Suzuki, Yosuke Enokida, and Kenji Yamanishi. Riemannian TransE: Multi-relational\nGraph Embedding in Non-Euclidean Space, 2019. URL https://openreview.net/forum?\nid=r1xRW3A9YX.\n\n[32] Alexandru Tifrea, Gary B\u00e9cigneul, and Octavian-Eugen Ganea. Poincar\u00e9 GloVe: Hyperbolic\n\nWord Embeddings. In International Conference on Learning Representations, 2019.\n\n[33] Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael\nGamon. Representing Text for Joint Embedding of Text and Knowledge Bases. In Empirical\nMethods in Natural Language Processing, 2015.\n\n[34] Th\u00e9o Trouillon, Johannes Welbl, Sebastian Riedel, \u00c9ric Gaussier, and Guillaume Bouchard.\nComplex Embeddings for Simple Link Prediction. In International Conference on Machine\nLearning, 2016.\n\n[35] Abraham A Ungar. Hyperbolic Trigonometry and its Application in the Poincar\u00e9 Ball Model of\nHyperbolic Geometry. Computers & Mathematics with Applications, 41(1-2):135\u2013147, 2001.\n\n10\n\n\f[36] Wenhan Xiong, Thien Hoang, and William Yang Wang. DeepPath: A Reinforcement Learn-\ning Method for Knowledge Graph Reasoning. In Empirical Methods in Natural Language\nProcessing, 2017.\n\n[37] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding Entities\nand Relations for Learning and Inference in Knowledge Bases. In International Conference on\nLearning Representations, 2015.\n\n[38] Fan Yang, Zhilin Yang, and William W Cohen. Differentiable Learning of Logical Rules for\n\nKnowledge Base Reasoning. In Advances in Neural Information Processing Systems, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2511, "authors": [{"given_name": "Ivana", "family_name": "Balazevic", "institution": "University of Edinburgh"}, {"given_name": "Carl", "family_name": "Allen", "institution": "University of Edinburgh"}, {"given_name": "Timothy", "family_name": "Hospedales", "institution": "University of Edinburgh"}]}